An efficient mechanism for processing similarity search queries in sensor networks

An efficient mechanism for processing similarity search queries in sensor networks

Information Sciences 181 (2011) 284–307 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins...

2MB Sizes 3 Downloads 74 Views

Information Sciences 181 (2011) 284–307

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

An efficient mechanism for processing similarity search queries in sensor networks Yu-Chi Chung a, I-Fang Su b, Chiang Lee c,⇑ a

Department of Computer Science and Information Engineering, Chang Jung Christian University, Taiwan, ROC Department of Information Management, Fortune Institute of Technology, Taiwan, ROC c Department of Computer Science and Information Engineering, National Cheng-Kung University, Taiwan, ROC b

a r t i c l e

i n f o

Article history: Received 8 January 2008 Received in revised form 21 April 2009 Accepted 23 August 2010

Keywords: Sensor networks Query processing Similarity search Hilbert curve Data-centric storage systems

a b s t r a c t The similarity search problem has received considerable attention in database research community. In sensor network applications, this problem is even more important due to the imprecision of the sensor hardware, and variation of environmental parameters. Traditional similarity search mechanisms are both improper and inefficient for these highly energy-constrained sensors. A difficulty is that it is hard to predict which sensor has the most similar (or closest) data item such that many or even all sensors need to send their data to the query node for further comparison. In this paper, we propose a similarity search algorithm (SSA), which is a novel framework based on the concept of Hilbert curve over a data-centric storage structure, for efficiently processing similarity search queries in sensor networks. SSA successfully avoids the need of collecting data from all sensors in the network in searching for the most similar data item. The performance study reveals that this mechanism is highly efficient and significantly outperforms previous approaches in processing similarity search queries. Ó 2010 Elsevier Inc. All rights reserved.

1. Introduction The past research on query processing in sensor networks mainly focused on retrieving exact answer from the networks. However, the detected data of sensors may be imprecise either due to the lag of database update [4,13,17,25,28], or due to noisy readings [2,5,8,30]. In the former case, the massiveness of readings and the limited energy and wireless bandwidth may not allow for continuous and instantaneous updates. Therefore, the database state may not reflect the true state of the real world. It is often infeasible for the database to contain the exact status of an entity being monitored at every moment in time. Typically, the data of an entity is known with certainty only at the time of the update. The later case, however, is due to inaccuracies of measurements. The sources of inaccuracies include, but are not limited to: (a) noise from external sources, (b) inaccuracies in the measurement technique, and (c) imprecision in computing a derived value from the underlying measurements. The reasons of imprecision are not all hardware-related. An application itself may also requires some similar (or close) data in addition to the exact answer. For one example, sensors are used for detecting mudflow and landslides in mountain villages that are in danger. Researchers need to gather the observed data from sensors and warn the villagers before heavy rains or typhoons to be alert for possible disasters. Assume that the geographic condition for triggering mudflow and landslides, such as soil water content (swc), is equal to x. A query of ‘‘swc being equal to x” is usually meaningless in monitoring

⇑ Corresponding author. Tel.: +886 6 2757575x62528; fax: +886 6 2747076. E-mail addresses: [email protected] (Y.-C. Chung), [email protected] (I.-F. Su), [email protected] (C. Lee). 0020-0255/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2010.08.031

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

285

mudflow and landslide environments, because once the swc is close to x the hazard can happen at any moment. Hence, what researchers actually need is to find the location where the monitored swc data is close to x. This allows them to warn and evacuate villagers to avoid deaths and injuries. Another application for a similarity search query in a sensor network is that an outdoor biologist analyzes habitats for birds by collecting calls of birds with acoustic sensors. The biologist needs to recognize bird species based on the collected data. As a bird may generate sound of a wide frequency band and the bands of different bird species may overlap, it is very hard to retrieve bird of one species without accessing other bird species of similar sound. Hence, similarity search is almost an unavoidable type of query in these applications. Under such circumstances, the exact retrieval is not a mandatory requirement anymore. Similar or the nearest data is equally important in answering a user query. Traditional similarity search algorithms [6,7,9,12,15,18,29] are either centralized or assumed that each node of a system is very powerful (such as in the world wide web or a peer-to-peer environment), which are unsuitable for sensor networks due to the limited availability of bandwidth and power of sensors. Current query processing algorithms for sensor networks, however, are inefficient in processing similarity search queries. These algorithms mainly focused on two types of queries, point queries and range queries [10,21,26]. A point query means to find results from sensors that own a value exactly matches the given value of the query. A range query is to retrieve results from sensors that have the values falling in the given range of the query. While executing a point query in the sensor network, the sensors only return those data that exactly match the given query. Utilizing a point query processing technique to process a similarity search requires that the user issues multiple point query of similar conditions so as to retrieve similar data. However, processing multiple queries in this case causes a rapid energy consumption to the sensors. Using a range query processing technique to process a similarity search, on the other hand, faces two major problems. First, redundant results might be transmitted to the query node. For example, relative humidity is an important factor influencing orchid’s growth. An orchid flower planter is looking for a place where the relative humidity is closest to 75% if there is not a place of humidity being 75%. The planter issues a range query such as finding the humidity between 65% and 85%. If there are five sensors detecting their humidity, 65%, 67%, 74%, 80%, and 85%, falling in the range of the given query, these five sensors will all reply to the query node. However, only the humidity 74% is the closest result to the given query, and it is the only one that should be sent back. Using this range query processing method, however, extra four tuples are transmitted to the query node which wastes the sensor’s energy. The second problem is that it might not be easy for a user to specify an appropriate range in a similarity search query. The reason is that if the given range is too small, there may not be any result qualifying for the condition. If the given range is wide, there may be too many qualifying results returned, which again wastes sensor energy. Therefore, the past point query and the range query processing techniques are improper for processing similarity search queries in sensor networks. A major challenge in processing a similarity search query in a sensor network is that each sensor is only a minirepository of an entire distributed sensor database. Each sensor only has the knowledge of its local data, but has no global knowledge of the entire sensor database. Hence, while processing a similarity search query, each sensor does not know whether its local similar data is the globally most similar data and has to transmit its local similar data to somewhere (e.g., the query node) for further verification. This causes a serious waste of sensor energy for data transmission and data forwarding. In this paper, we propose a similarity search algorithm (SSA) to overcome the above problems in processing similarity search queries. We choose a group of sensors, which are named the indexing nodes, to store data based on the data-centric storage (DCS) concept [26]. DCS uses in-network placement of data to increase the efficiency of data retrieval in certain circumstances. The placement of a detected data item is determined according to the event type of this data. The event type refers to certain pre-defined constellations of event values such as temperature and pressure. A detected data with a particular event is stored at an indexing node. The indexing node is determined by looking up a geographic hash table (GHT) [26] using the event of the data. These indexing nodes are so chosen from the entire sensor network that they actually form a Hilbert curve [11,31] in the network. The adjacent indexing nodes along the Hilbert curve have data of similar values. Hence, searching similar data in this arrangement becomes very easy. In this paper, we will discuss how this scheme is realized in a sensor network environment and how deep (i.e., how many levels) the Hilbert curve should be implemented. Our performance study indicates that the proposed method provides a significantly lower query processing cost than a previous method while processing a similarity search query. Another elegant feature is that this method is scalable with respect to the number of queries and the amount of detected data items. The main contributions of this paper are as follows. 1. This work is the first one to provide an algorithm for searching similar data in wireless sensor network environments. 2. The data mapping is based on the concept of Hilbert curve, which is simple and easy to implement. The indexing node to which a detected data item should be mapped can be determined distributedly by each sensor, which avoids centralized data dispatching to indexing nodes. 3. The whole processing is in-network. The number of involved indexing nodes in processing a similarity search query is only a few, which avoids the need of transmission of local similar data from all sensors and therefore dramatically simplifies the task and reduces the energy consumption of sensors. As a preliminary study of the problem, this paper mainly focuses on processing similarity search queries for one-dimensional data which means that a query is specified only for one type of events. We leave the multi-dimensional part of the

286

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

work as our future work. The subsequent content of this paper is organized as follows. A representative previous work on data-centric storage that can possibly be applied to processing similarity search query is surveyed and discussed in Section 2. Section 3 presents the proposed algorithm for a similarity search and two extensions of the proposed algorithm are given in Section 4. Section 5 presents the simulation results. Finally, we give our conclusions and future work in Section 6.

2. Related work To the best of our knowledge, data-centric storage in Sensornets with a geographic hash table (GHT) [26] may be the most representative approach among the past DCS-based research that is applicable in processing similarity search in a distributed sensor network. However, there exist some obstacles for such schemes to process a similarity search query. We illustrate them in the following. Essentially, GHT hashes the type of an event into geographic coordinates and stores the detected data of this event type at the sensor node geographically nearest the hashed coordinates of the event type. A sensor which is responsible for storing the mapped data is named a storage node, which is just like an indexing node in our design. When a query is issued to a sensor, the sensor also hashes the event type of the query to a geographic coordinates, and forwards this query to the storage node nearest the coordinates. The data of this storage node, which matches the query will be retrieved and sent to the query node. For instance, the sensors are responsible for detecting two events, temperature and humidity. Hence, two sensors are used for storing the observed temperature and humidity, respectively, in GHT. As one type of events is mapped to one storage node, the workload of this node can be very heavy as all queries asking for this type of events will be processed in this storage node. It easily makes this storage node a hot spot and depletes its energy much sooner than the other nodes. To alleviate the problem, the GHT team proposed the structured replication GHT (SR-GHT) to dispatch detected data to multiple storage nodes. The replicated storage nodes are named the mirror nodes. A node that detects a data will store the data at either the hashed node or the mirror node, depending on which one is closer to the location that the data is detected. For example, Fig. 1 shows a hierarchy of up to level = 2 structured replication. The black dot, which is named a root node, is the original storage node of GHT. The gray and white dots represent the mirror nodes of level = 1 and level = 2, respectively. The number of the replication level depends on the data generation frequency. It increases while data is frequently generated. If an event is detected at (100, 100) in Fig. 1, for example, it will be stored at the mirror node at the upper-right cell as it is the closest mirror node. Thus, SR-GHT reduces the cost for transmitting the detected data and uplifts the availability of sensors as it spreads the workload over multiple nodes. This storage mechanism is however very inefficient in processing similarity search queries. In processing a similarity search query, SR-GHT has to forward the query to all mirror nodes as each of them has some portion of the entire data and we do not know which one has the most similar data. Also, all these mirror nodes have to participate in processing this query and send back the result to the query node for finding the most similar data. As a result, the processing cost as well as the communication cost are both dramatically high. A new method is required for processing similarity search queries efficiently. Xia et al. proposed an algorithm (named SAQP) [32] to exploit the similarities among different queries issued to DCS sensor networks. A scheme is proposed to enable the sharing of query processing among similar queries. The results of previous

(0,100)

(100,100)

(0,0)

(100,0)

root node: (3,3) level 1 mirror nodes: (53,3) (3,53) (53,53) level 2 mirror nodes: (28,3) (3,28) (28,28) (78,3) (53,28) (78,28) (3,78) (28,53) (28,78) (78,53) (53,78) (78,78) Fig. 1. Example of structured replication.

287

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

queries will be utilized if they are suitable to answer a new query. The energy is therefore saved while processing such similar queries. However, the main goal in our work is to find similar data of a given query (rather than finding the same data that have been obtained in a previous query as in their work). Hence, these two problems are fundamentally different. Our work presented in this paper represents an enhanced version of our earlier work [27]. In [27], we only studied the point query issue with a limited performance study of the design. In this paper, however, we added new materials including the processing of range queries, multiple queries, and an in-depth analysis of these issues, plus a comprehensive performance study on all these types of queries to make this paper a much more complete work on the topic of similarity search in sensor networks. 3. Design of a data-centric storage system supporting similarity search In this section, we first illustrate how to map a Hilbert curve to a sensor network in Section 3.1 and analyze the complexity of building a Hilbert curve in Section 3.2. In Section 3.3, we explain how to select an indexing node and then we propose a data insertion mechanism for storing data in an indexing node. A search mechanism is proposed in Section 3.4 which efficiently finds the answer for a given query. In Section 3.5, we analyze the complexity of the proposed mechanism. The workload sharing mechanism is designed in Section 3.6 which allows the workload be amortized among sensors when a sensor fails due to running out of power or hardware defects. We propose two different workload sharing mechanisms for different node failure scenarios. 3.1. Mapping Hilbert curve to sensor network A space-filling curve is a thread that goes through all the points in the space while visiting each point only one time, and imposing a linear order of points in the multi-dimensional space. The Hilbert curve manifests superior data clustering properties when compared with the other space-filling curves [3,11,14,16,23]. Thus, we adopt the Hilbert curve to design the structure of indexing nodes in a sensor network. The Hilbert curve is mathematically defined by a mapping of the unit interval [0, 1] in one dimension to a bounded region of a higher dimension space which is called a Hilbert space. Given the number of level ‘ and the boundary of the sensor network (i.e., height and width), the network is divided recursively into 4‘ square quadrants. Determination of ‘ is dependent on the data detection rate. The more the detected data, the higher the ‘ is required. (The determination of ‘ is to be discussed shortly.) Let P be the center of each quadrant. The Hilbert curve passes through each P of all quadrants [11,31]. For example, in Fig. 2(a) the sensor network is divided into four quadrants where ‘ = 1 and Pi is the center point of each quadrant. These four points, P0, P1, P2, P3, of the four quadrants are strung in a linear order P0, P1, P2, P3. When ‘ increases to 2, the network is divided into 16 quadrants as shown in Fig. 2(b) and the center point of each quadrant is linked in the same manner as that in the level-1 Hilbert space. We assume that sensors are randomly deployed in the network. All the sensors are homogeneous and each of them has a unique sensor ID, SID, and is aware of the entire network boundary as well as the sensor’s own geographic location. In the sensor network, the corresponding quadrant in a Hilbert space is called a cell. Hence, the number of cells is also equal to 4‘. We choose the sensor that is closest to the center of a cell in the network as the indexing node (corresponding to P in the Hilbert space) which is responsible for storing detected data. The number of indexing nodes is equal to the number of cells, i.e., (4‘). The detailed mechanism of selecting indexing nodes is listed in Section 3.3.

6

5

4

10

9

7

11

8

13 12

14

Level = 1 Fig. 2. Hilbert curve for level 1 and level 2.

Level = 2

15

288

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

Fig. 3 shows an example of a Hilbert space of ‘ = 1 mapped onto a sensor network. Fig. 3(b) is the sensor network corresponding to the Hilbert space in Fig. 3(a). Each black dot in Fig. 3(b) represents a sensor in the network, and each white dot (I0, I1, I2, I3) is a chosen sensor that is closest to the center of a cell and is regarded as the indexing node of this cell. These four nodes compose a Hilbert curve of level = 1 in the sensor network. Notice that designating the node closest to the center of a cell as the indexing node is not done at this moment. It is actually determined while the first data item is sent to the center point of this cell. We will give the details later in Section 3.3. The proper number of levels of the Hilbert curve is determined in this way. We understand that too many indexing nodes (i.e., building a Hilbert curve of too many levels) will degrade the efficiency of query processing. Too fewer indexing nodes, on the other hand, may not provide enough storage space for detected data items. Assume the detected data expire after a period of time so that they do not need to be kept forever. The total memory space for storing data is A. If the memory size of each sensor is z, the number of indexing nodes n should be n P dA/ze. Since the number of indexing nodes is four to the number of levels (n = 4‘), the number of levels ‘ equals to log4 n P log4dA/ze. The range of data that are managed by an indexing node is determined in this way. Let the entire data range R of an event be [lower bound RL, upper bound RU]. For instance, the detected range for temperature in a sensor is usually in the range of 40 to 60 °C in a wild area, and humidity is within the range 0–100%. If the number of indexing nodes is n, which are I0, I1, . . . , In1. We equally divide R into n sub-ranges, each being equal to r. That is, n  r = R. The sub-range of data for which the indexing node IID is responsible is defined as ½RILID ; RIUID Þ, which is equal to [RL + (IID  1)  r, RL + IID  r). For example, Fig. 4(a) shows a sensor network of partition level ‘ = 1. There are four indexing nodes in the sensor network, which are I0, I1, I2, and I3. Assume that the value range of an event is [0, 1]. We equally split the range [0, 1] into four sub-ranges [0, 0.25), [0.25, 0.5), [0.5, 0.75), and [0.75, 1], corresponding to the indexing nodes I0, I1, I2, and I3, respectively. If ‘ = 2 as shown in Fig. 4(b), the number of indexing nodes increases to 16, i.e., I0, I1, . . . , I15. Hence, the sub-range of the first indexing node I0 is [0, 0.0625), that of the second indexing node I1 is [0.0625, 0.125), and so on.

(a) Hilbert curve

(b) Sensor network

Fig. 3. Mapping Hilbert curve onto a sensor network.

I5 [0.25, 0.5)

[0.5, 0.75)

I1

[0.3125, 0.9375)

I6

I9

I7

I8

I10

I2

I4

I11

[0.25, 0.3125)

t =0.3

I0

I3 I3

[0, 0.25)

[0.75, 1]

I2

[0.1875 , 0.25)

0

(a) Level 1

I13

[0.125, 0.1875 )

I0 [0, 0.0625)

t =0.3

I1 [0.0625, 0.125)

I14

(b) Level 2

Fig. 4. Hilbert curve for level 1 and level 2.

I12

I15

[0.9375, 1]

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

289

The detailed steps of mapping Hilbert curve to sensor network are given in Algorithm 1. Algorithm 1. Mapping Hilbert curve to sensor network. 1. GIVEN: The number of level ‘, the boundary of the sensor network, the data range R, and its lower bound and upper bound [RL, RU] 2. FIND: A geographic location of the center point of each quadrant, the order of each quadrant, and the sub-range of each quadrant. 3. The number of quadrant in the sensor network is 4‘; 4. Determining the location of the center point and the order of each quadrant by using Hilbert curve algorithm; 5. Let i be an order of a quadrant and 0 6 i6 to 4‘  1; 6. The sub-range of quadrant i is [RL + (i  1)  R/4‘ 6 d, RL + i  R/4‘);

3.2. Complexity of building a Hilbert curve The Hilbert space is actually an abstract network structure. The goal of mapping Hilbert curve to a sensor network is to determine the number of indexing node, the sub-range of data for which the indexing node is responsible, and the order of indexing nodes. However, locating an indexing node of a cell in the network is accomplished while the first data item is sent to this cell. The details will be given later in Section 3.3. Therefore, no communication cost is required for building a Hilbert curve in our design. In order to achieve the goal of determining the number of indexing node and the sub-range of each indexing node, each sensor before it is deployed is installed prior knowledge which includes the network size (e.g., the coordinates of two opposite corners (0, 0) and (100, 100)), the maximum value range of a detected event (e.g., a humidity between 0% and 100%), and the number of levels (‘) of the created Hilbert curve. The fact that the first two parameters can be known before the deployment is quite obvious. As for the third parameter (i.e., ‘), it can be determined by the storage space each sensor owns and the total amount of data that are going to be detected. As the detecting rate and how long a period that each detected data item is going to be saved are normally known before the sensors are deployed, the total data size in the sensor network can therefore be estimated before the deployment. As the factors (i.e., each sensor’s storage space and the total data size) that are used to determine ‘ can be obtained, the number of levels ‘ can be determined before the sensors are deployed. Note that we may assign one or two more levels than necessary to ensure there are enough indexing nodes to store and manage detected events. Therefore, when a sensor is deployed, it has the knowledge about the number of levels the sensor network is going to have. In addition, based on the deployed location of the sensor (i.e., coordinates), it can figure out the relative position of the other sensor in the whole network (because the sensor has the knowledge of the location of the whole network). That means the deployed sensor is able to figure out which cell (i.e., quadrant) it belongs to in the Hilbert space because the number of levels of the Hilbert curve is also known. As this computation (of determining the belonging cell of a sensor) is only a matter of a few arithmetic steps, its complexity is O(1) and therefore the computation cost is almost negligible. Another thing needs to be determined by each indexing node is the sequence number of this node in the Hilbert curve. That is, each indexing node should determine itself a number between 0 and 4‘  1, if the number of levels is ‘. This number is the IID of the indexing node. Note that only the indexing nodes need to determine this IID. The other nodes do not need to do so. According to the analysis in Meng et al. [22], the computation complexity of determining this number is O(‘2). This is the complexity for building a ‘-level Hilbert curve in a sensor network. Notice that all these are done independently in every indexing node. No message exchanges are required at all. That is, the required number of message exchanges is 0. Hence, it is a very cheap tree-constructing mechanism. Besides, it is performed only in the beginning when the sensors are deployed. As the constructing cost is extremely low and once for all, this mechanism is quite acceptable in the sensor network environment. 3.3. Data insertion mechanism When a data is detected by a sensor, this sensor sends the data to the indexing node whose managing value range covers the detected data. More precisely, this sensor is not sending the data to a known indexing node, but to a geographic location P which is the center of the cell whose value range covers the detected data. This can be done by using the greedy perimeter stateless routing (GPSR) mechanism proposed by Karp and Kung [19]. Fig. 5 shows an example of this process. Sensor A detected a data and figured that it should be sent to the center P of a certain cell. Then, this GPSR mechanism is initiated to direct the data toward P. If there is not a sensor right at P, this data will be forwarded to the one-hop neighbors around P as shown in the figure. After the data has been circled in the neighbors around P once, the sensor that is closest to P can be determined and this sensor is designated as the indexing node of this cell. The indexing node then broadcasts a message to its neighbors to let them know that it is the indexing node of the cell. Later on when another data is forwarded to P, these sensors around P will direct the data to the indexing node. ID ID Two parameters ðV Imin ; V Imax Þ are used here to record the minimum and the maximum existing values of an indexing node. These two parameters are initially set to 0. When a data is stored in an indexing node, these two values are updated

290

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

P

A

Fig. 5. The greedy perimeter stateless routing mechanism.

accordingly. For example, in Fig. 4(a), if a sensor detects a data whose value is 0.3, the data will be sent to I1 because it be1 1 longs to the sub-range of I1. And if 0.3 is the only data in I1, the two parameters will be ðV Imin ; V Imax Þ ¼ ð0:3; 0:3Þ, as the minimum and the maximum existing values are both 0.3. The pseudo-code of data insertion mechanism is listed Algorithm 2. Algorithm 2. Data insertion mechanism. 1. GIVEN: A detected data d, the number of level ‘, the location of the center point of each quadrant, the data range R, and its lower bound and upper bound [RL, RU] 2. FIND: A geographic location (a, b) in the sensor network, and the indexing node Ii that is the closest to (a, b). 3. Find i such that RL + (i  1)  R/4‘ 6 d < RL + i  R/4‘; 4. Let Pi be the center point of quadrant i and (a, b) be the location of Pi; 5. Routing d to Ii using GPSR, where Ii’s location is the closest to (a, b); Ii Ii i i 6. Updating ðV min ; V Imax Þ if d is not the first data in Ii, else set ðV min ; V Imax Þ to (0, 0); 3.4. Similarity search mechanism The data insertion mechanism presented in the previous subsection allows a query to be easily processed after locating the indexing node that needs to be searched. This, in effect, embeds a filtering mechanism that elegantly sifts out unqualifying data. The similarity search mechanism consists of two phases, the similarity search query resolving phase and the query probing phase. The query resolving phase determines an indexing node that is most likely to provide an answer for the query. If the answer does not exactly match the given query, the query probing phase is initiated for finding possible answers from other indexing nodes. 3.4.1. Query resolving phase for similarity search queries When a similarity search query is issued, the sensor that receives this query locates the indexing node whose data subrange covers the given value of this query by executing a locate( ) function, which is defined as follows:

locateðV q Þ ¼ locate the indexing node IID such that RL þ ðIID  1Þ  r 6 V q < RL þ IID  r; where V q is the search value given by the query: The query is then forwarded to the located indexing node IID to retrieve data. We call this indexing node the Target Node IT. IT compares Vq with its local data. If a result that exactly matches Vq is found, the query execution is finished and the data is forwarded to the query node. The probing phase is unnecessary in this case. Otherwise, the probing phase is initiated for efficiently determining which adjacent indexing nodes should be accessed to find the most similar data. 3.4.2. Query probing phase for similarity search queries Let the local similar data of indexing node IID be V IsID . So the local similar data of target node IT is V IsT . We have the following possible cases.  Case 1: IT is nonempty (i.e., has local data) and V IsT is the most similar local data in IT. – Subcase 1: If V IsT is larger than Vq, then all data in IT+1 must be even greater than Vq (because they are greater than V IsT ). But in IT1 there may be a V IsT1 which is closer to Vq.

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

291

– Subcase 2: If V IsT is smaller than Vq, then all data in IT1 must not be more similar to Vq than V IsT is. But in IT+1 there may be a V IsTþ1 which is closer to Vq.  Case 2: IT is empty (i.e., no data is stored in IT). Vq has to be sent to both neighbors (i.e., IT1 and IT+1) of IT to find the most similar data. We proposed three operations, backward probing, forward probing, and bi-directional probing, to deal with the above cases in the probing phase. The backward probing and forward probing are designed for Subcase 1 and Subcase 2, respectively, and the bi-directional probing is for Case 2. 3.4.2.1. Backward probing. This type of probing is needed when V IsT is greater than Vq. Since the greatest data in IT1 is smaller than the lower bound RILT of IT, the V q ; V IsT , and RILT can be used to determine whether V IsT is the answer of Vq. If V IsT is closer to Vq than RILT is, V IsT is definitely the answer of Vq. Otherwise, there may be proper answer in IT1. More precisely, we can rephrase the above in the following. We denote the middle of two values x and y as the midpoint of x and y, Mx,y. Let x be V IsT and y be the lower bound RILT of IT. Hence, Mx,y is equal to ðV IsT þ RILT Þ=2. If Vq is greater than Mx,y, then none of the data items in IT+1 will be closer to Vq than V IsT is. So, V IsT is the answer. On the other hand, if Vq is smaller than Mx,y, then there may be more T1 similar data in IT1. Hence, IT1 has to be visited. In this case, the greatest value V Imax of IT1 should be sent to IT. The value of IT1 IT V max of V s that is closer to Vq is returned as the answer. If there is no data in IT1, then V IsT is the answer. 3.4.2.2. Forward probing. This case is an opposite case to the previous one. It is needed when V IsT is smaller than Vq. In this case, only the data that is greater than V IsT can be a more similar data. Therefore, V q ; V IsT , and RIUT are used to determine whether V IsT is the answer of Vq. If V IsT is closer to Vq than RIUT is, V IsT is the answer of the query. Otherwise, IT+1 might have a data item that is a more proper answer. We also use Mx,y to illustrate whether IT+1 should be visited. Let x be V IsT , and y be the upper bound RIUT of IT. Mx,y is equal to ðV IsT þ RIUT Þ=2. If Vq is smaller than Mx,y, then none of the data items in IT+1 will be closer to Vq than V IsT is. So, V IsT is the answer. On the other hand, if Vq is greater than Mx,y, then there may be more similar ITþ1 ITþ1 data in IT+1. Hence, IT+1 has to be visited. The largest data V min should be sent to IT. The value of V min or V IsT that is closer to Vq is returned as the answer. If there is no data in IT+1, then V IsT is the answer. 3.4.2.3. Bi-directional probing. If IT has no data, then IT has to probe both IT1 and IT+1 to find the most similar data. IT initiates T1 Tþ1 both the backward probing and the forward probing processes to find V max and V min , respectively, and return the one that is closer to Vq as the result. If IT1 or IT+1 again contains no data, the process will continue until the termination condition in the backward or the forward probing process is satisfied. 3.4.2.4. Example. We use Fig. 6 to illustrate how the probing phase works. Assume that sensors are used for detecting humidity of the environment, and the indexing nodes are I0, I1, and I2. The assigned data sub-ranges of I0, I1, and I2 are [0%, 25%), [25%, 50%), and [50%, 75%), respectively. In Fig. 6(a), if a query is issued to find the humidity that is either 28% or the one that is closest to 28%, the query will be forwarded to I1 as the sub-range of I1 covers Vq = 28%. That is, I1 is IT in this case. Assume that the local data of I1 that is closest to Vq is 48%. That is, V Is1 is 48%. However, as V Is1 , which is 48%, is greater than Vq, which is 28%, I0 might have data closer to Vq than V Is1 is. Since V Is1 is 48% and the lower bound of I1 is 25%, Mx,y is (48% + 25%)/2 = 36.5%.

Query Value, Vq Similar Data,VsI ID Midpoint, M x , y I ID Max. Observed Data of IID, Vmax Upper Bound of IID , RUI I ID I ID Lower Bound of IID ,RL Min. Observed Data of IID, Vmin ID

RUI 2

75%

RUI 2

75%

I2

I2 RUI1

I1

50%

48% 36.5%

28%

RLI1

25% I0 max

RUI 2

75%

I2

RUI1

I2 min

V

50%

RUI1

I1

40%

I1

30%

RLI1

25%

I0 I0 L

R

0%

(a)

50%

47%

RLI1

V

I0

I2 Vmin

I0 L

R

25% I0 Vmax

I0 0%

(b)

RLI0

0%

(c)

Fig. 6. Three operations of the probing phase.

292

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

0 0 As Vq is smaller than Mx,y, I1 should forward the query and V Is1 to I0 to compare V Imax with the V Is1 . If V Imax is closer to Vq than V Is1 0 is, then V Imax is the answer and is returned to the query node. Otherwise V Is1 is the most similar data of all. This implements a backward probing process in locating the answer. Fig. 6(b) shows another case that requires a forward probing. Let Vq = 47%. Again, let V Is1 be 30% as shown in Fig. 6(b). As V q > V Is1 (i.e., 47% > 30%) and the midpoint is 40%, I2 might contain more similar data than V Is1 is. Therefore, I1 forwards the 2 2 2 query and V Is1 to I2 to compare V Imin with V Is1 . If V Imin is closer to Vq than V Is1 is, then V Imin is the answer. Otherwise V Is1 is the most similar data. If the target node I1 has no data in its memory as shown in Fig. 6(c), I1 has to implement a backward probing and a for2 0 ward probing process to retrieve V Imax and V Imin from I0 and I2, respectively. Finally, I1 returns the most similar data to the query node. The detailed steps of similarity search mechanism are given in Algorithm 3.

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

293

3.5. Complexity of similarity search algorithm In this subsection, we discuss the computation cost of running the SSA. The computation cost is to count the number of steps required for executing the algorithm. The SSA comprises the query resolving phase and the query probing phase. The query resolving phase contains the locate( ) function which is to find the indexing node IT whose sub-range covers the given query and searching for the similar data V IsT in IT. The query probing phase involves the execution of one of the backward probing, the forward probing, and the bi-directional probing operations. Since executing the locate( ) function requires a few arithmetic steps, its computation cost is O(1). As for the query probing phase, there are two possible cases. The first case is that at least one data in IT is found. In this case, one of the following operations is initiated, searching for the similar data, executing the backward probing, and executing the forward probing. The second case is that IT contains no data, then a bi-directional probing is executed. The first case can again be divide into two cases. One is that the most similar data found in IT (that is, V IsT ) is indeed the globally most similar data to the query. The other is that another indexing node might have a data that is a more proper answer for the query. If V IsT is the globally most similar data to the query, then the computation cost is to search all stored data in IT. This computation cost for searching for the similar data in IT is therefore O(XT), where XT is the amount of data stored in IT. Otherwise, either the backward probing or the forward probing is executed. In that case, SSA has to compare T1 V IsT with the maximum existing value V Imax of the indexing node IT1 in a backward probing or compare V IsT with the minimum ITþ1 existing value V min of the indexing node IT+1 in a forward probing. As this comparison of two values is very simple (a cost of O(1)), the computation cost of this second case is also O(XT). In terms of the second case (no data stored in IT), the bi-directional probing is initiated. In this case, both backward probing and forward probing are executed. Hence, the computation cost is O(XT+1) + O(XT1), where XT+1 and XT1 represent the amounts of data in IT+1 and IT1, respectively. Hence, we conclude that the complexity of SSA is either O(1) + O(XT) (the first case) or O(1) + O(XT+1) + O(XT1) (the second case). O(1) in this expression is negligible as it is only a constant. And, in normal cases XT  XT+1  XT1. For this reason, the complexity of SSA is either O(XT) or O(2XT), which means that the complexity of SSA is O(XT).

3.6. Workload sharing mechanism In the above discussion we assumed that all nodes are stable and able to continuously route, monitor and store data. However, a sensor node may fail due to reasons such as running out of power or hardware defects. Two workload sharing mechanisms are proposed to remedy this problem. One is designed for a sensor that fails due to running out of power, and another is for a sudden damage to the hardware of a sensor due to the reasons such as fire or being soaked in water. When an indexing node fails due to the lack of power, a workload sharing mechanism should be initiated to let other sensors take the responsibility. The basic idea is to transfer the current jobs and the data to a nearby sensor when the energy of this indexing node is below a certain threshold. This indexing node then switches to a low-power state to conserve energy. The threshold is system-defined and pre-installed in sensors before the deployment. The chosen nearby sensor, informed by the indexing node, will become the new indexing node. However, the other sensors in the network do not know that the indexing node has been changed. The sensors that detect a data item still send them to the old indexing node. Intuitively, this problem can be solved by broadcasting the information of this change to the entire sensor network. This is certainly very costly and should be avoided. Our solution is that while an old indexing node transfers its workload to a new indexing node, it also broadcasts the location of this new indexing node to the sensors located in the same cell. Hence, all sensors in this cell become aware that the indexing node has been replaced by the new node. The future data can therefore be delivered to the new indexing node. In another scenario that a node fails due to a sudden, unexpectable damage, saving data in two different places would be inevitable to avoid a permanent loss of these data. Our idea is that a detected data has to be stored in both its corresponding indexing node IID and a mirror node of IID. This technique essentially creates a mirror Hilbert curve and a mirror mapping function in the sensor network. Then, when an indexing node fails, the neighbors of this indexing node would be aware of this failure and when a detected data is sent to this failed indexing node through these neighbors, they will locate the mirrored indexing node by using the mirror mapping function and forward the data to that new indexing node. Note that more storage and communicate costs are certainly necessary in order to overcome the node failure problem. Since the above two workload sharing mechanisms are straightforward, we omit the pseudo-code of these two mechanisms.

4. Extensions of our similarity search algorithm In this section, we propose two extensions of our similarity search algorithm. One is to support the processing of a range query, and another is to support multiple queries processing.

294

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

4.1. Range query processing In addition to the point query discussed in Section 3.4, it is also common to have a range query in a sensor network, for example, finding the humidity that is between 60% and 80%. The proposed SSA can also be efficiently applied to processing a range query in a sensor network. While a range query is issued, the query is partitioned into sub-queries and then sent to their corresponding indexing nodes for further processing. The partitioning of the query is performed according to the pre-defined value range of each indexing node. Assume that the value range of the humidity is between 0% and 100% and there are four indexing nodes in the sensor network, which are I0, I1, I2, and I3. We equally split the range [0, 100] into four sub-ranges [0%, 25%), [25%, 50%), [50%, 75%), and [75%, 100], corresponding to the indexing nodes I0, I1, I2, and I3, respectively. If a range query is to find data within the range [60%, 80%], then the sub-ranges of I2 and I3 will be located as they cover the given query. Hence, the query will be forwarded to I2 and I3, respectively, for further processing. If similar data is required, then the task is simply to enlarge the search range so that a wider range of data will be returned to the query node. The basic idea is still the same. The range query processing is similar to the similarity search mechanism to enlarge the search range of a similarity search query, and here we omit the pseudo-code of the range query processing.

4.2. Multiple queries processing It is energy-inefficient to process queries one by one, especially when query conditions are similar to each other. We discuss how to process multiple point queries and multiple range queries separately in the following.

4.2.1. Multiple point queries processing We can utilize the previous queries and their results to answer a similar query. The query conditions (Vq) and the corresponding similar data ðV Vs q Þ are kept in an indexing node where the query is forwarded to. If V 0q is the first query issued to indexing node IT, the similarity search mechanism proceeds as that presented in Section 3. Otherwise (i.e., IT processed a similar query before), the query probing phase is initiated according to the relationship among the new query V 0q , the previous query Vq (where Vq is the closest query to V 0q ) and its corresponding answer V Vs q , as well as the minimum and the maximum T T existing values ðV Imin ; V Imax Þ in IT. They form the following possible cases.  Case 1: If V 0q is between Vq and V Vs q (i.e., V q 6 V 0q 6 V Vs q or V Vs q 6 V 0q 6 V q ), then V Vs q is the answer for V 0q . The reason is that V Vs q is even closer to V 0q than Vq. If V Vs q is already a similar result to Vq, it is definitely a result to V 0q . No processing cost is incurred in this case.  Case 2: If V 0q is smaller than Vq and V Vs q (i.e., V 0q < V q < V Vs q or V 0q < V Vs q < V q ), we can further divide it into two subcases. T – Subcase 1: If V 0q < V Imin , the backward probing phase is initiated. 0 T – Subcase 2: If V q P V Imin , the answer of V 0q is definitely located in IT. Hence, IT finds the similar data for V 0q .  Case 3: If V 0q is greater than Vq and V Vs q (i.e., V Vs q < V q < V 0q or V q < V Vs q < V 0q ), then there are also two possibilities. T – Subcase 1: If V 0q 6 V Imax , the answer of V 0q is located in IT. So, IT finds the similar data for V 0q . 0 IT – Subcase 2: If V q > V max , the forward probing phase is initiated. We use the following example to illustrate the above cases. In Fig. 7, Vq is a previous query in IT and V Vs q is the corresponding answer for Vq. Let X be a mirror point of V Vs q versus Vq, that is, the distance between Vq and V Vs q is equal to the distance between Vq and X. Then, the three points X, Vq, and V Vs q can divide data of IT into four regions, g1, g2, g3, and g4, respectively.

Fig. 7. Three cases of the multiple point queries processing.

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

295

Fig. 7(a) corresponds to the above Case 1, i.e., V q 6 V 0q 6 V Vs q . As V Vs q is the answer for Vq, we know that there is not any data in g2 (otherwise, V Vs q cannot be the result of Vq). Hence, if V q 6 V 0q 6 V Vs q , then V Vs q is definitely the most similar data to V 0q in IT. T T Fig. 7(b) corresponds to the Subcase 1 of Case 2 (i.e., V 0q < V q < V Vs q and V 0q < V Imin ). As there is not any data in g 2 ; V Imin is T the similar data for V 0q in IT. However, there might be a data in IT1 closer to V 0q than V Imin is. Therefore the backward probing T phase is initiated. On the other hand, if V 0q < V q < V Vs q and V 0q P V Imin in Fig. 7(c), which is the Subcase 2 of Case 2, there may T be a similar data in g1 that is closer to V 0q than V Imin is. Therefore, IT has to search its stored data in g1 and finds the similar data for V 0q . As the subcases of Case 3 are opposite to those of Case 2, Case 3 is self-evident through the explanations in Case 2. The pseudo-code of multiple point query processing is listed in Algorithm 4.

4.2.2. Multiple range queries processing When a range query is issued, the query is partitioned into sub-queries according to the value ranges of indexing nodes. Each of these sub-queries can then be considered as an independent range query and processed in the corresponding indexing node. The result will be sent back to the query node and the query node will assemble them into the final result. Assume that a sub-query V 0q of a new range query is sent to an indexing node IT, and Vq is a previous range query that is similar to V 0q . According to the relationship of Vq and V 0q , the process has two possible cases. The reader may refer to Fig. 8 for this relationship.  Case 1: The answers of the new range query is located in one indexing node. – Subcase 1: If V 0q exactly matches Vq, then the answer of Vq is the answer of V 0q . – Subcase 2: If Vq covers V 0q , then IT searches the data that is in the range of V 0q in V Vs q for V 0q . Note that there will not be other cases because otherwise the most similar data might appear in other indexing node.  Case 2: The answers of the new range query might located in several indexing nodes. – Subcase 1: If V 0q covers Vq, then the range query processing is initiated to process V 0q in IT. – Subcase 1: If V 0q overlaps Vq, then the range query processing is initiated to process V 0q in IT. For example in Fig. 8, if sensors are used for detecting the humidity of the environment. Assume that the value range of the humidity is between 0% and 100% and there are four indexing nodes in the sensor network, i.e., I0, I1, I2, and I3. These four indexing nodes correspond to four sub-ranges [0%, 25%), [25%, 50%), [50%, 75%), and [75%, 100%], respectively. Assume that a range query is issued to find data within the range [60%, 70%]. If a previous query is also to find data within the range [60%, 70%] (corresponding to the first subcase of Case 1) as shown in Fig. 8(a), the answer of the previous query is also the answer for the new range query. If a previous query is to find data within the range [55%, 74%] (corresponding to the second subcase of Case 1) as shown in Fig. 8(b), the answer of the new range query is located in the answer of this previous query. Hence, IT searches the data that are in the range of [60%, 70%] in V Vs q for V 0q .

296

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

Fig. 8. Four cases of the multiple range queries processing.

If the previous query is to find data within the range [62%, 68%] (corresponding to the first subcase of Case 2) as shown in Fig. 8(c), then V Vs q is only part of the answer of V 0q . And data that are stored in IT but does not belong to V Vs q might also be the answer for V 0q . Even worse is that I1 and I3 may also contain some similar data. For instance, if in Fig. 8(c) the range query is on [60%, 73%] (rather than [60%, 70%]) and if there is not any data in I2 fall in the range [73%, 75%], then I3 has to be searched because it may have data closer to 73% than 68% is. Therefore, the range query processing mechanism is initiated to process V 0q in IT. If the previous query is to find data within the range [55%, 65%] (corresponding to the second subcase of Case 2) as shown in Fig. 8(d), other indexing node may also be involved and therefore, the range query processing mechanism is initiated. The detailed steps of multiple range query processing are presented listed in Algorithm 5.

5. Simulation results In this section we verify the effectiveness of our work, the proposed similarity search algorithm (SSA), by comparing it against SR-GHT in processing similarity search queries. Since the communication cost is the main part of energy consumption of sensors, we use the number of exchanged messages as the comparison metrics. The variables include network size, node density, node distribution, and the number of levels of a Hilbert curve. We also compare the performance of the SSA

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

297

with the SR-GHT and the Naive algorithm which is proposed in Section 1 under different query rate as well as data detection rate and node failure rate. 5.1. Performance model 5.1.1. Sensor network settings As the comparison system is SR-GHT, we use the SR-GHT’s settings in this performance study. Sensors are randomly deployed in the network, the radio range of each sensor is equal to 40 m and the node density of the sensor network is equal to 1 node per 256 m2. We vary the number of sensors from 50 to 200 in the simulations. The number of nodes varies from 103 up to 105 nodes to investigate the feasibility of the work. We also consider skewed distribution of sensors in the sensor deployment. To simulate a skewed distribution, a large of sensors were distributed in one half of the region and the rest were distributed in the other of the region [1,20,24,33]. Hence, in our simulation, we assume that 80% of the nodes are randomly distributed in one half of the network area and the other 20% of the nodes randomly distributed in the other half of the area. Meanwhile, we vary the node density of the sensor network from 1 node per 256 m2 to 1 node per 64 m2. 5.1.2. Data settings In the experiments, each sensor on average generates ten data items, and the value of each data item is uniformly distributed in the range [0, 1]. According to the analysis of SR-GHT, DCS system performs well when the frequency of data detection is higher than the frequency of query issued. For fairness in comparison, the ratio of query issuing frequency to data detec1 tion frequency in this simulation is 10 . We also compare the communication cost of the proposed similarity search algorithm (SSA) with SR-GHT and the Naive algorithm under different data detection frequency. 5.1.3. Query settings Each sensor on average generates one query and the query value within each range is also uniformly distributed in [0, 1]. 1 The default ratio of query rate to data detection rate is 10 . We fix the data detection rate and vary the query rate so that the ratio of query rate to data detection rate from 10% to 100%. 5.1.4. Node failure Also, we compare the insertion cost and query cost of SSA, SR-GHT and the Naive algorithm under different node failure rate. We vary the node failure probability from 0 to 0.5 to see how the performance is affected. 5.1.5. Performance metrics The performance metrics employed in the simulations is the number of exchanged messages, which include the data insertion cost and the query processing cost. For data insertion cost, we record the number of exchanged messages required for each data item that is detected in one sensor and forwarded to the corresponding storage node. For query processing cost, we record the number of exchanged messages required for processing a similarity search query, which includes forwarding the query to the corresponding storage node, executing the similarity search mechanism, and forwarding the results to the query node. 5.1.6. Similarity search algorithm settings This research focuses on one-dimensional data. That is, sensors in this sensor network only detect data of one event type. The number of levels of the Hilbert curve varies in the performance from one to two while the number of nodes is small, and the number of levels varies from one to four while the number of nodes varies from 103 up to 105 nodes. For convenience, we list all the parameters of the simulation and its meaning in Table 1.

Table 1 Parameters and values. Parameters

Values

Node density Radio range Number of sensors (small scale) Number of sensors (large scale) Level of SR-GHT Level of Hilbert curve in SSA Data items per node Queries per node Node failure probability

1 node/256 m2, 1 node/128 m2, 1 node/64 m2 40 m 50, 100, 150, 200 103, 104, 105 1, 2, 3, 4 1, 2, 3, 4 1, 2, 3, 4, . . ., 10 data items/node 1, 2, 3, 4, . . ., 10 queries/node 0.1, 0.2, 0.3, 0.4, 0.5

298

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

5.2. Network size We first compare the performance of processing similarity search queries of SSA with SR-GHT under different size of network. The SSA and SR-GHT are processed in two types of scale sensor networks, a small sensor network and a large sensor network. The small network size varies from 50 sensors to 200 sensors and the large network size varies from 103 sensors to 105 sensors. The density of the network remains at 1 node per 256 m2. Each sensor on average generates 10 detected data items when it issues one similarity search query. Fig. 9 gives the cost per sensor for level = 1 and level = 2 in a small sensor network and Fig. 10 gives the result for level = 1 up to level = 4 in a large sensor network. In the following, we list the interesting observations found in Figs. 9 and 10. 1. The performance of query cost of SSA outperforms SR-GHT in all scales of the network size. The reason is that drastically fewer storage nodes are visited while processing a similarity search query in SSA, whereas, all the storage nodes have to be visited in SR-GHT. As a result, the query cost of SR-GHT increases significantly with the expansion of the network size and the increase of the Hilbert curve level. 2. The data insertion cost of SSA is higher than that in SR-GHT. That is because the detected data is stored locally in SR-GHT, but is forwarded to an assigned indexing node in SSA and the assigned indexing node may be far away from the detecting sensor. Though the data insertion cost of SR-GHT performs well in insertion cost, SSA outperforms SR-GHT in total cost. 3. While the number of level increases, the above two observations remain the same. The query processing cost of SR-GHT increases exponentially with the increase of number of levels. However, the query processing cost of SSA remains quite stable when the number of levels increases. 5.3. Node density 1 The default setting of node density is 256 m2 . In this experiment, the size of the network is kept constant at 227  227 m. 1 1 1 Node densities in the experiments include 256 m2 ; 128 m2 , and 64 m2 , which correspond to a total number of 200, 400, and 800 nodes, respectively. We compare the average insertion cost and the average query cost of SSA and SR-GHT under different node density. The performance result, which is shown in Fig. 11, reveals that the average insertion cost and the average query cost of SSA and SR-GHT are insensitive to the variation node of density. The reason is that message exchange is the cost for forwarding data/ query to the corresponding storage node. The increase of node density does not change the geographical distance of a sensor and the target indexing node. The GPSR routing protocol chooses a node that is closest to the target node in its radio range to relay the data no matter how dense the sensors are deployed. For example, Fig. 12(a) shows a sparsely deployed sensor network, in which sensor S1 is going to send the detected data to indexing node I1. S1 chooses S2 which is closest to I1 in S1’s radio range to relay the data. If the sensor network is densely deployed as shown in Fig. 12(b), S1 will still choose the one closest to I1 in S1’s radio range to relay the data, which is S3 in this example. Hence, if the sensors are deployed densely, the insertion cost might be slightly lower than that in a sparsely deployed environment. That is why in Fig. 11, the insertion cost of SRGHT and SSA decreases as the density is high.

5.4. Node distribution We compare the performances of SSA and SR-GHT under different node distribution of a sensor network. The SSA and SR-GHT are processed under two types of node distributions, random distribution and skewed distribution. A random distribution is that sensors are randomly deployed in the network. A skewed distribution is that 80% of the nodes are 80

160

Query Cost of SR-GHT Insertion Cost of SR-GHT

60

Query Cost of SSA Insertion Cost of SSA

140

Cost / Number of sensors

Cost / Number of sensors

Query Cost of SSA Insertion Cost of SSA

40

20

Query Cost of SR-GHT Insertion Cost of SR-GHT

120 100 80 60 40 20

0 50

100

0 150

Number of sensors

200 Level = 1

100

50

150

Number of sensors

Fig. 9. Total cost of SSA and SR-GHT in a small network.

200 Level = 2

299

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307 5000

1000

Query Cost of SSA Insertion Cost of SSA

Query Cost of SSA Insertion Cost of SSA

4000

Query Cost of SR-GHT Insertion Cost of SR-GHT

Cost / Number of sensors

Cost / Number of sensors

1200

800

600

400

Query Cost of SR-GHT Insertion Cost of SR-GHT

3000

2000

1000

200

0

0 3

10

4

10

Level = 1

18000 16000 14000 12000 10000 8000 6000 4000 1800 1600 1400 1200 1000 800 600 400 200 0

Query Cost of SSA Insertion Cost of SSA Query Cost of SR-GHT Insertion Cost of SR-GHT

3

10

4

25000

Query Cost of SSA Insertion Cost of SSA

20000

Query Cost of SR-GHT Insertion Cost of SR-GHT

15000

10

Number of sensors

5000 1400 1200 1000 800 600 400 200 0

3

4

10

10

5

10

Level = 4

Level = 3

Number of sensors

Fig. 10. Total cost of SSA and SR-GHT in a large network.

1/256

10 Level = 2

10000

5

10

5

10

Number of sensors

Cost / Number of sensors

Cost / Number of sensors

Number of sensors

4

3

10

5

10

1/128

1/64

Node density (node/m 2 ) Fig. 11. The average cost of SSA and SR-GHT under different node density.

300

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

Fig. 12. Number of message exchanges under different node density.

Sensor nodes

Number of sensors = 200 Fig. 13. A skewed distribution of sensors.

randomly deployed in one half of the network area, and the other 20% of the nodes are randomly deployed in the other half of the area. Fig. 13 shows an example of such a skewed distribution of sensors in a sensor network. We compare the insertion cost and query cost of SSA and SR-GHT under different distributions of sensors. The performance result, which is shown in Fig. 14, shows that each pair of curves for random distribution and skewed distribution of sensors are so close to each other so that they almost overlap in the figure, no matter it is for the SSA or for the SR-GHT algorithms. The reason is as illustrated in Section 5.1.6 that the increase of node density does not change the geographical distance between a sensor and the target indexing node. Therefore, in both the insertion cost and the query cost, the results of different sensor distributions are about the same. Hence, we conclude that both the SSA and the SR-GHT algorithms are insensitive to node distribution. 5.5. Partition level Also, we observe the performances of SSA and SR-GHT under different partition level of sensor network. In Fig. 15, we compare their average query processing cost, average data insertion cost, and average total cost for the number of sensor nodes being equal to 1000. 1. Insertion cost and query cost: The insertion cost of SR-GHT drops as the number of levels increases. The reason is that the observed data are stored to the nearest mirror node in SR-GHT. When the number of levels increases, the number of mirror nodes greatly increases so that on average a detected data item becomes closer to a mirror node to be stored there. The query cost of SR-GHT increases however much more significantly as the number of levels increases, because all data in the mirror nodes need to be sent to the query node.

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

SSA in a random deployment of sensors SSA in a skewed deployment of sensors SR-GHT in a random deployment of sensors SR-GHT in a skewed deployment of sensors

301

SSA in a random deployment of sensors SSA in a skewed deployment of sensors SR-GHT in a random deployment of sensors SR-GHT in a skewed deployment of sensors

Fig. 14. The cost of SSA and SR-GHT in a skewed distribution of sensors.

Fig. 15. Comparison on the number of levels in a large sensor network.

2. Total cost: The total cost of the SR-GHT method is higher than the cost of SSA for all levels of the Hilbert curve. The higher the number of levels, the greater the difference between the two algorithms. Notice that the vertical scale in Fig. 15 is measured in the unit of algorithm. Hence, when level = 4, the SSA method is more than one order of magnitude better than SR-GHT. 3. Sensitivity to the number of levels: Each cost component of SSA remains about the same for all levels of Hilbert curve. Hence, the SSA method is insensitive to the number of levels. The number of levels actually represents the amount of generated data. When the amount of generated data increases, a higher level of Hilbert curve has to be used to contain so much data. This means the SSA method has a very nice feature that it is insensitive to the amount of data generated. Hence, the SSA method is perfectly scalable in terms of data size. 5.6. Query rate and data detection rate Here, we examine the effect of query rate and data detection rate on SSA, SR-GHT and the Naive algorithm. The Naive algorithm (or simply Naive hereafter) essentially takes a local storage approach. It stores data locally at the detecting node and simply floods queries to all sensors when a query is issued. Therefore, there is no insertion cost in the Naive algorithm and the total cost of this algorithm is only the query execution cost. Let us refer to the ratio as the Q-to-D ratio and the ratio of data detection rate to query rate as the D-to-Q ratio. In Fig. 16(a), we fix the data detection rate and vary the query rate so that the Q-to-D ratio varies from 10% to 100%. In Fig. 16(b), however, we fix the query rate and vary the data detection rate so

302

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

(i.e., total)

(i.e., total)

Fig. 16. Performance comparison under different query and data detection ratio.

that the D-to-Q ratio varies from 10% to 100%. We compare the insertion cost and the query cost of SSA, SR-GHT, and Naive. The performance result shown in Fig. 16(a) reveals that SSA is less sensitive to the increase of query arrival rate than SR-GHT and Naive are. The query cost of SSA increases approximately 80 message exchanges per node while the Q-to-D ratio increases from 10% up to 100%. However, the query cost of SR-GHT increases approximately 5000 message exchanges per node, and the query cost of Naive increases approximately 10 times higher. In Fig. 16(b), query rate is fixed and the data detection rate is varied so that the D-to-Q ratio varies from 10% up to 100%. We compare the insertion cost and the query cost of SSA, SR-GHT, and the Naive algorithm. We see that the query cost of SSA, SR-GHT, and the Naive algorithm are all fixed. Naive requires the highest query (i.e., total) cost because every sensor in the Naive algorithm may have the qualifying data so that all sensors have to participate in processing this query. The insertion cost of SR-GHT is lower than that of SSA because a detected data is always sent to the nearest storage node/ mirror node in SR-GHT. But the total cost of SR-GHT is much higher than that of SSA, especially when the D-to-Q ratio is low. Hence, a conclusion can be drawn from this figure that SSA is more suitable for higher query rate (i.e., more users) environment, whereas SR-GHT’s performance can be close to the performance of SSA when the data detection rate is very high. 5.7. Node failure In a real sensor network, nodes may fail due to various reasons. In this subsection, we compare the insertion cost and the query cost of SSA, SR-GHT, and Naive in a small as well as a large sensor network. Since the second type of workload sharing mechanism of SSA (which discussed in Section 3.5 is for a sudden damage to the hardware of a sensor) requires much more

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

303

storage and communication cost than the first type of workload sharing mechanism of SSA, here we analyze the insertion cost and query cost of the second type of workload sharing mechanism as a worst case analysis of SSA. Likewise, the data of SR-GHT and Naive also have to be replicated and store a second copy in a mirror node to ensure that queried data can be found if a node suddenly fails. We assume that sensors are randomly deployed in a network with a partition level of three. Each sensor on average generates 10 data items, which means 2000 and 10,000 data items are generated in total in a small and a large sensor network, respectively. Each sensor on average issues one query so that 200 and 1000 queries are issued in a small and a large sensor network, respectively. Let the node failure probability be 0.1. Fig. 17(a) and (b), respectively, shows the average insertion cost and the average query cost of SSA, SR-GHT, and Naive under different failure probability in a small sensor network. Moreover, Fig. 18(a) and (b) shows that in a large sensor network. The costs of each algorithm under no failure node are shown in figures as a dotted line. We use them as the baseline for comparisons. Notice that, the insertion cost of Naive under no failure node coincides with the x-axis in Figs. 17(a) and 18(a). The reason is that the Naive algorithm stores the detected data locally and no insertion cost is required. With the appearance of failed nodes, Fig. 17(a) shows that the SSA requires a higher cost than SR-GHT in terms of insertion cost. But the query cost in Fig. 17(b) of SSA is much lower than SR-GHT. Notice that the difference of query cost between these two algorithms is on average around 800, but the difference of insertion cost between these two algorithms is only

Fig. 17. Performance comparison considering node failure in a small sensor network.

304

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

Fig. 18. Performance comparison considering node failure in a large sensor network.

about 200. Hence, the overall performance of SSA is much better than SR-GHT. Similar conclusions also hold for a large sensor network. 5.8. Workload of indexing nodes and storage nodes In this section, we first illustrate the effect of the workload of the indexing node in SSA and the storage node in SR-GHT on different level. Then, we examine the effect of the workload of the indexing node and the storage node under different data detection rate and query rate. We vary the number of level from 1 to 3 and the D-to-Q ratio from 10% to 100% in Fig. 19. In Fig. 20, we vary the Q-to-D ratio from 10% to 100% and the number of level is also varied from 1 to 3. We use the hot spot usage to evaluate the performance. The hot spot usage is the maximum total cost (i.e., insertion cost plus query cost) of an indexing node in SSA and a storage node in SR-GHT. For ease of presentation, we name the indexing node with the maximum total cost and the storage node with the maximum total cost the hot spot in SSA and the hot spot in SR-GHT, respectively. In Figs. 19 and 20, the insertion cost and query cost of the hot spot in SSA decrease as the number of level is high. The reason is that the number of indexing node increases as the number of level increases, and the workload of the hot spot in SSA is shared by more indexing nodes. Although the above observation of insertion cost remains the same in SR-GHT, the query cost of the hot spot in SR-GHT increases exponentially with the increase of number of levels. The reason is that

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

305

Fig. 19. Performance comparison considering hot spot by varying data detection rate.

in SR-GHT, all the storage nodes have to be visited. Therefore, the increase of number of levels in SSA can efficiently alleviate the problem of hot spot. However, with the increase of number of levels, a high query cost in SR-GHT is incurred. While the ratio of D-to-Q or Q-to-D increase, the above two observations remain the same.

306

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

Fig. 20. Performance comparison considering hot spot by varying query rate.

Y.-C. Chung et al. / Information Sciences 181 (2011) 284–307

307

6. Conclusion In this paper, we proposed the design and implementation of an algorithm for processing similarity search queries in sensor networks. Our design applies the concept of Hilbert curve to sensor networks such that semantically related data are mapped to adjacent indexing nodes. A similarity search algorithm was proposed for efficiently processing similarity search queries. Such a query can be directly routed to an indexing node to find the matching result or the one that is closest to the given query. The major advantage of this design is that it drastically reduces the communication cost for processing similarity search queries. Our performance study showed that this design exhibits a superior performance in terms of energy consumption in both small and large sensor networks, and in low and high query arrival rate. Currently, we are extending the capability of this design to deal with multi-dimensional similarity search queries. Indexing multi-dimensional data is difficult as it requires an intelligent mapping to a two-dimensional sensor network so as to maintain the adjacency in the two-dimensional space. An efficient query processing technique that works for multi-dimensional data is also under designed. References [1] Mohamed Aly, Kirk Pruhs, Panos K. Chrysanthis, Kddcs: a load-balanced in-network data-centric storage scheme for sensor networks, in: Proceedings of the Conference on Information and Knowledge Management (CIKM’06), November 2006, pp. 317–326. [2] Joseph Anderson, Lang Hong, Sensor resource management driven by threat projection and priorities, Information Sciences 178 (8) (2008) 2007–2021. [3] T. Asano, D. Ranjan, T. Roos, E. Welzl, P. Widmaier, Space filling curves and their use in geometric data structure, Theoretical Computer Science 181 (1997) 3–15. [4] Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar, Evaluation of probabilistic queries over imprecise data in constantly-evolving environments, Information Systems 32 (1) (2007) 104–130. [5] Thomas Clouqueur, Veradej Phipatanasuphorn, Parameswaran Ramanathan, Kewal K. Saluja, Sensor deployment strategy for target detection, in: Proceedings of the 1st ACM International Workshop on Wireless Sensor Networks and Applications (WSNA’02), September 2002, pp. 42–48. [6] Scott Cost, Steven Salzberg, A weighted nearest neighbor algorithm for learning with symbolic features, Machine Learning 10 (1993) 57–78. [7] Xin Dong, Alon Halevy, Jayant Madhavan, Ema Nemes, Jun Zhang, Similarity search for web services, in: Proceedings of the 30th VLDB Conference, Toroto, Canada, September 2004, pp. 372–383. [8] Zhao Feng, Leonidas Guibas, Wireless Sensor Network: An Information Processing Approach, Morgan Kaufmann, 2004. [9] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, P. Yanker, Query by image and video content: the qbic system, IEEE Computer 28 (9) (1995) 23–32. [10] Benjamin Greenstein, Deborah Estrin, Ramesh Govindan, Sylvia Ratnasamy, Scott Shenker, Difs: a distributed index for features in sensor networks, in: Proceedings of the First IEEE International Workshop on Sensor Network Protocols and Applications, Anchorage, Alaska, May 11, 2003, pp. 163–173. [11] J.G. Griffiths, An algorithm for displaying a class of space-filling curves, Software-Practice and Experience 16 (5) (1986) 403–411. [12] T. Hastie, R. Tibshirani, Discriminant adaptive nearest neighbor classification, in: Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montreal, Canada, August 1995, pp. 142–149. [13] T. He, C. Huang, B.M. Blum, J.A. Stankovic, T.F. Abdelzaher, Range-free localization schemes in large scale sensor networks, in: Proceedings of the 9th Annual International Conference on Mobile Computing and Networking (MobiCom), September 2003, pp. 81–95. [14] D. Hilbert, Uber die stetige abbildung einer linie auf ein flachenstuck, Mathematische Annalen 38 (1891) 459–460. [15] I. Bhattacharya, S.R. Kashyap, S. Parthasarathy, Similarity searching in peer-to-peer databases, in: Proceedings of 25th International Conference on Distributed Computing Systems (ICDC05), Columbus, Ohio, USA, June 2005, pp. 329–338. [16] H.V. Jagadish, Linear clustering of objects with multiple attributes, in: Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, May 1990, pp. 332–342. [17] J. Jeong, S. Sharafkandi, D. Du, Energy-aware scheduling with quality of surveillance guarantee in wireless sensor networks, in: Proceedings of the Workshop on Dependability Issues in Wireless Ad Hoc Networks and Sensor Networks (DIWANS), September 2006, pp. 55–64. [18] Panos Kalnis, Wee Siong Ng, Beng Chin Ooi, Kian-Lee Tan, Similarity queries in peer-to-peer networks, Information Systems Journal 31 (1) (2006) 57– 72. [19] B. Karp, H.T. Kung, Gpsr: greedy perimeter stateless routing for wireless networks, in: Proceedings of the Sixth Annual ACM/IEEE International Conference on Mobile Computing and Networking (Mobicom 2000), August 2000, pp. 243–254. [20] Purushottam Kulkarni, Prashant Shenoy, Deepak Ganesan, Approximate initialization of camera sensor networks, in: Proceedings of the 4th European Conference on Wireless Sensor Networks, EWSN 2007, January 2007, pp. 67–82. [21] Xin Li, Young Jin Kim, Ramesh Govindan, Wei Hong, Multi-dimensional range queries in sensor networks, In Proceedings of the 1st International Conference on Embedded Networked Sensor Systems, Los Angels, CA, November 2003, pp. 63–75. [22] Lingkui Meng, Changqing Huang, Chunyu Zhao, Zhiyong Lin, An improved Hilbert curve for parallel spatial data partitioning, Geo-Spatial Information Science 10 (4) (2007) 282–286. [23] B. Moon, H.V. Jagadish, C. Faloutsos, J.H. Saltz, Analysis of the clustering properties of the Hilbert space-filling curve, IEEE Transactions on Knowledge and Data Engineering 13 (1) (2001) 124–141. [24] Dimitris Papadias, Yufei Tao, Kyriakos Mouratidis, Chun Kit Hui, Aggregate nearest neighbor queries in spatial databases, ACM Transactions on Database Systems 30 (2) (2005) 529–576. [25] Q. Ren, Q. Liang, Energy and quality aware query processing in wireless sensor database systems, Information Sciences 177 (10) (2007) 2188–2205. [26] Sylvia Ratnasamy, Brad Karp, Scott Shenker, Deborah Estrin, Ramesh Govindan, Li Yin, Fang Yu, Data-centric storage in sensornets with ght, a geographic hash table, Mobile Networks and Applications 8 (4) (2003) 427–442. [27] I-Fang Su, Yu-Chi Chung, Chiang Lee, Finding similar answers in data-centric sensor networks, in: Proceedings of the International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing (SUTC’08), June 2008, pp. 217–224. [28] S. Tilak, N.B. Abu-Ghazaleh, W. Heinzelman, Infrastructure tradeoffs for sensor networks, in: Proceeding of Wireless Sensor Networks and Applications (WSNA’02), September 2002, pp. 19–58. [29] Xiaojun Wan, A novel document similarity measure based on earth mover’s distance, Information Sciences 177 (18) (2007) 3718–3730. [30] Hanbiao Wang, Deborah Estrin, Lewis Girod, Preprocessing in a tiered sensor network for habitat monitoring, EURASIP JASP Special Issue of Sensor Networks 2003 (4) (2003) 392–401. [31] N. Wirth, Algorithms and Data Structures, Prentice-Hall Inc., Englewood Cliff, NJ, 1986. [32] Ping Xia, Panos K. Chrysanthis, Alexandros Labrinidis, Similarity-aware query processing in sensor networks, in: Proceedings of the 14th International Workshop on Parallel and Distributed Real-Time Systems, April 2006, pp. 1–8. [33] Jerry Zhao, Ramesh Govindan, Deborah Estrin, Computing aggregates for monitoring wireless sensor networks, in: Proceedings of the 1st IEEE International Workshop on Sensor Network Protocols and Applications (SNPA), May 2003, pp. 139–148.