subscribe systems

subscribe systems

Future Generation Computer Systems 36 (2014) 102–119 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: ...

2MB Sizes 6 Downloads 54 Views

Future Generation Computer Systems 36 (2014) 102–119

Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Scalable and elastic event matching for attribute-based publish/subscribe systems Xingkong Ma, Yijie Wang ∗ , Qing Qiu, Weidong Sun, Xiaoqiang Pei National Key Laboratory for Parallel and Distributed Processing, School of Computer, National University of Defense Technology, Changsha, 410073, PR China

highlights • • • •

We propose a scalable event matching service for attribute-based pub/sub model. HPartition adapts to skewed subscriptions and achieves high matching throughput. PDetection adapts to the sudden change of workloads with low latency. We implement a thorough and systematic evaluation of our approach.

article

info

Article history: Received 19 December 2012 Received in revised form 28 July 2013 Accepted 9 September 2013 Available online 25 September 2013 Keywords: Publish/subscribe Attribute-based Event matching Content space partitioning Cloud computing

abstract Due to the sudden change of the arrival live content rate and the skewness of the large-scale subscriptions, the rapid growth of emergency applications presents a new challenge to the current publish/subscribe systems: providing a scalable and elastic event matching service. However, most existing event matching services cannot adapt to the sudden change of the arrival live content rate, and generate a non-uniform distribution of load on the servers because of the skewness of the large-scale subscriptions. To this end, we propose SEMAS, a scalable and elastic event matching service for attribute-based pub/sub systems in the cloud computing environment. SEMAS uses one-hop lookup overlay to reduce the routing latency. Through a hierarchical multi-attribute space partition technique, SEMAS adaptively partitions the skewed subscriptions and maps them into balanced clusters to achieve high matching throughput. The performance-aware detection scheme in SEMAS adaptively adjusts the scale of servers according to the churn of workloads, leading to high performance–price ratio. A prototype system on an OpenStack-based platform demonstrates that SEMAS has a linear increasing matching capacity as the number of servers and the partitioning granularity increase. It is able to elastically adjust the scale of servers and tolerate a large number of server failures with low latency and traffic overhead. Compared with existing cloud based pub/sub systems, SEMAS achieves higher throughput in various workloads. © 2013 Elsevier B.V. All rights reserved.

1. Introduction As the real-time requirement of data dissemination becomes increasingly significant in many fields, the emergency applications have received increasing attention, for instance, stock quote distribution, earthquake monitoring [1], emergency weather alert [2], smart transportation system [3], and social networks. Recently, the development of emergency applications demonstrates two trends. One is the sudden change of the arrival live content rate. Take ANSS [1] as an example, its mission is to provide real-time and accurate seismic information for emergency response personnel.



Corresponding author. Tel.: +86 13308491230. E-mail addresses: [email protected] (X. Ma), [email protected] (Y. Wang), [email protected] (Q. Qiu), [email protected] (W. Sun), [email protected] (X. Pei). 0167-739X/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.future.2013.09.019

Millions of messages are generated by sensors in a short time when an earthquake happens, while few events are generated if there is no earthquake. The other is the skewness of the large-scale subscriptions. That is, a large number of subscribers demonstrate similar interests. For instance, the dataset [4] of 297 K users of Facebook shows that the hottest 100 topics together have more than 1.1 million subscribers, while 71% topics have no more than 16 subscribers. Publish/subscribe (pub/sub) paradigm is a key technology for asynchronous data dissemination that are widely used in the emergency applications. It decouples senders and receivers of the emergency applications in space, time, and synchronization [5], which enables a pub/sub system to seamlessly expand to massive size. However, traditional pub/sub system faces a number of challenges. Firstly, the system must guarantee real-time event matching capacity when it expands to very large-scale. For instance, Facebook contains billions of users and 684,478 pieces of content are

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

published on average in every minute [6]. Secondly, the system needs to be elastic to the sudden change of incoming event rate to achieve high performance–price ratio. This is because if a fixed number of servers are deployed in response to the sudden change of incoming event rate, numerous servers are in the idle states delivering few messages in most of the time. Thirdly, the service must be tolerant to the server failures. In emergency applications, a large number of machines and links may be unavailable instantaneously due to hardware errors or operator mistakes, which leads to the loss of events and subscriptions. Existing pub/sub systems are not adequate to efficiently address all above challenges. In the broker based pub/sub systems [7–16], all publishers and subscribers are directly connected to a group of servers, known as brokers. Subscriptions are commonly replicated to all brokers or a part of brokers, so that each broker can match events and forward them to the interested subscribers. However, replicating subscriptions incurs that each event is matched against the same subscriptions for many times, which leads to high matching latency and low scalability when a large number of events and subscriptions arrive. Moreover, it is difficult to provide elastic service in order to deal with the changing workloads. This is because these systems often over-provision brokers to reduce their loads, and do not have financial incentives to reduce the scale of brokers during off-peak hours. In contrast, a large number of P2P based systems [17–23] do not provide dedicated brokers. All nodes are organized into a P2P based overlay [24–26], and they act both as publishers and subscribers. All events and subscriptions are forwarded through multihop routing. Then the subscriptions falling into the same subspace of the entire content space are organized into a multicast group or stored in a rendezvous node. With the growth of arrival event rate, the multi-hop routing may lead to high latency and traffic overhead. A large body of skewed subscriptions may incur unbalanced load on the multicast groups or rendezvous nodes, which imposes a limit on scalability. Besides that, the P2P based systems are hard to provide elastic service due to the unpredictability of the node behavior. Recently, cloud computing has become a new infrastructure to develop large-scale distributed applications over the Internet because of its low capital cost, powerful storage and computing capacities [27,28], which provides great opportunities to meet the requirements of complex computing [29] and high seed communication. Cloud server providers scatter geographically many datacenters to support a large number of users around the world. Because of the high bandwidth and reliable links in the cloud environment, Cassandra [30] proposes to use the consistent hashing [31] to provide one-hop lookup to organize servers into a scalable overlay. Thus, the main challenges of designing attributebased pub/subs in the cloud computing environment lie in the partitioning technique and the elastic strategy. Based on the above-mentioned one-hop lookup technique, a number of researches [32–34] focus on the cloud computing environment to provide pub/sub services, where the most relevant one to our work is BlueDove [33]. There are two main components in BlueDove. One is the multi-dimensional subscription space partitioning technique which provides multiple candidate servers for each message. The other is the performance-aware forwarding technique which ensures each message is sent to the least loaded candidate server for matching. However, the partitioning technique in BlueDove may render the system poorly scalable. Firstly, BlueDove divides every single dimension of the content space into multiple clusters, each of which is uniformly mapped to a server. Since each cluster only concerns a specific range of a single dimension, a hot cluster can be easily formed if a large number of users subscribe the same range of one dimension, which severely hurts the matching throughput. Secondly, the performance-aware forwarding scheme of BlueDove

103

enables the system to keep workload balance of servers through dispatching each message to a candidate server with the lowest processing time. However, it intrinsically cannot alleviate the hot spots. When the distribution of subscriptions is flat, each server may be responsible for a number of hot spots, which severely decreases the performance of each server. Besides, although BlueDove provides elastic servers to adapt to a sudden increasing of the workload through detecting system saturation, it does not point out how to adaptively decrease the scale of servers to achieve high performance–price ratio when the event arrival rate decreases. Motivated by these factors, we present SEMAS, a scalable and elastic event matching service for an attribute-based pub/sub system in the cloud computing environment. Generally speaking, the main novelty of SEMAS lies in a hierarchical multi-attribute space partition technique (called HPartition) and a performance-aware detection technique (called PDetection). The first aims to improve the matching throughput in a scalable manner. HPartition divides the entire content space into multiple hypercubes, each of which concerns a specific range of each dimension. That is, a hot cluster is formed only if a large number of users are interested in the same ranges of all dimensions, which greatly reduces the number of hot clusters. Moreover, HPartition hierarchically divides a hot cluster into multiple small ‘‘cold’’ clusters which are dispatched again to uniform selected servers using consistent hashing. This iterative partition scheme ensures that each cluster has a moderate size which brings low matching latency and high throughput. The second aims to achieve elastic and reliable event matching. Specifically, PDetection elastically adjusts the scale of servers according to the churn of workloads. To detect the changing of workloads, PDetection adopts a light-weighted gossip based aggregation technique [35] to periodically check the performance of the servers. When the performance of the worst server is below the minimum threshold or above the maximum threshold, a number of servers are added or removed respectively to keep a high performance–price ratio. To reduce the reconfiguring delay and traffic overhead, PDetection redistributes each cluster through migrating the identifications of subscriptions. By synchronizing the status of all servers, PDetection ensures reliable and continuous event matching service. To evaluate the performance of SEMAS, we design and implement a prototype on our OpenStack-based testbed. The experimental results show that SEMAS presents linear increasing event matching throughput as the number of servers and the partitioning granularity increase, rapid response to the sudden change of workload, strong reliability to large-scale server failures with low latency and traffic overhead, and higher throughput in various workloads compared with the existing cloud based approaches. As a summary, the primary contributions are given as follows.

• We propose a hierarchical multi-attribute space partition scheme to achieve high matching throughput with numerous skewed subscriptions. • We propose a performance-aware detection scheme to ensure elastic and reliable matching service with the sudden change of workloads. • We implement a thorough and systematic evaluation of SEMAS as well as various matching approaches to evaluate the performance of SEMAS. The rest of this paper is organized as follows. In Section 2, we outline the related works. In Section 3, we introduce the data model and system architecture of SEMAS. In Section 4, we describe the full SEMAS protocol. We implement SEMAS and analyze its performance in Section 5. Finally, this paper is concluded in Section 6. Table 1 shows key parameters used in this paper.

104

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

Table 1 Notations and their meanings.

Ω k Ai Ri Pi Nseg Nm Nd Nsub Gi ′ Nseg

α

Ds

The entire content space The number of attributes of Ω The attribute, i ∈ [1, k] The range of Ai The predicate of Ai The number of segments on Ri The number of matchers The number of dispatchers The number of subscriptions The cluster with the identification i The number of segments on each range of hot clusters The minimum size of the hot cluster The synchronizing server

2. Related works According to the overlay infrastructure, existing works of the attribute-based pub/sub systems can be categorized into the broker based systems, P2P based systems and cloud based systems. 2.1. Broker based systems In broker based systems, publishers and subscribers are directly connected to a group of servers, known as brokers. Brokers route all events among themselves and disseminate these events to their corresponding subscribers. According to the event matching algorithms, existing methods can be grouped into the centralization based methods, multicast based methods, and pruning tree based methods. Firstly, the centralization based methods match each event against all subscriptions on the brokers before dissemination. All brokers have full knowledge of subscriptions. They use a matching algorithm against all subscriptions to compute the interested subscribers for each event, and provide the corresponding data replication scheme [36] to ensure reliable matching. For instance, Gryphon [9] replicates all the subscriptions to each broker. Then each broker uses a parallel search tree to match events, and computes a shortest paths tree for the matching results. In MEDYM [10], all publishers have a global view of all subscriptions, and utilize the existing testing network algorithm [37] or extended counting algorithm [38] as an independent plug-in module to match events. More details of the centralized matching algorithm can be found in [39]. However, the main scalability issue of the centralization based methods is how to efficiently match events against a large number of subscriptions. Recall that the emergence scenarios are characterized by the sudden change of the arrival event rate and the large number of skewed subscriptions, which brings high matching latency to the centralization based methods. To this end, SEMAS adopts parallel multiple servers in the cloud environment to provide a scalable event matching service. Secondly, the multicast based methods construct a bounded number of clusters before event dissemination begins. Each cluster is in charge of a subspace of the entire content space. All the subscriptions that fall into the same clusters are organized into a multicast group. When an event is published, it is mapped to the corresponding cluster to avoid uninterested nodes. For instance, Kyra [8] is designed with a two-level topology. At the bottom level, brokers are grouped into cliques based on their network proximity. The brokers in the same clique know each other, and partition the entire content space into non-overlapping zones. At the top level, multiple multicast trees are built among the brokers that are in charge of similar zones. Opyrchal et al. [11] exploit a limited number of IP multicasting groups to provide an optimal path to deliver events to a group of subscribers. Riabov et al. [12] use a gridbased clustering framework to divide the content space into cells,

and then each event is matched in its corresponding cell. Unfortunately, due to the dynamics of the subscriptions and the participating nodes, maintaining clusters in the multicast based methods incurs high latency and bandwidth cost. Furthermore, the frequent clustering results in more uninterested nodes in each cluster and severely reduces the matching throughput. Compared with multicast based methods, SEMAS groups similar subscriptions into the same server via one hop, which greatly reduces the clustering latency. Thirdly, the pruning tree based methods route and match each event along a pruning tree overlay to avoid uninterested brokers. The basic idea of the pruning tree based methods is to broadcast the subscriptions announcements to update the routing table of each broker, and then each event is routed to the brokers in the routing table that have matching subscriptions [13]. Based on this idea, a lot of systems are proposed to optimize the performance. To improve the matching throughput, a number of methods [14,15] are proposed to compress the size of routing tables in terms of summarization. To further reduce the size of the routing table, a lot of methods use rooted trees to organize nodes. Each subscription is propagated to the root of the tree, where the propagation stops. In contrast, each event is propagated starting from the root to the subtree with matching subscriptions. For instance, Hermes [7] uses Pastry [25] as the routing overlay to build diffusion trees. The root of each tree is in charge of the unique key associated with the language type of the content space. Terpstra et al. [16] build a different routed spanning tree for each publisher based on Chord [40]. More details of pruning tree based methods can be found in [41]. Compared with the centralization based methods, these pruning tree based methods do not need to store all subscriptions on each routing table. However, each event may be matched against the same subscriptions on different brokers many times. Besides that, an event possibly traverses each broker to match subscriptions, which brings high matching latency. In SEMAS, when an event arrives at the cloud, it is assigned to the corresponding server via one hop and matched against a small set of subscriptions on this server. 2.2. P2P based systems In P2P based pub/sub systems, an overlay based on an application-layer is maintained to connect all participants. There are no dedicated brokers. Each node can act as both a subscriber and a publisher. The incoming events are routed to the corresponding receivers through the overlay. Existing methods can be grouped into sematic neighborhood based methods and rendezvous based methods. Firstly, the semantic neighborhood based methods group the similar nodes according to the semantic similarity between nodes. They are commonly suited for the P2P unstructured overlay. For instance, DPS [18] maintains all subscriptions in a semantic tree. Most paths in the tree cross nodes satisfying a inclusion relation, which reduces the matching latency significantly. To extend the content space to multiple attributes, each subscription arbitrarily selects a tree of containing one of its predicates, and each event has to be assigned to all trees associated with all its values. In contrast, Sub-2-Sub [19] dynamically aggregates a similar subscription based on the neighborhood exchanging protocol [24]. All nodes periodically exchange and update their views to get the similar subscriptions, which are grouped into the optimal delivery clusters [41]. Then these clusters are connected into multiple bidirectional rings, each of which is built by periodically exchanging the views between each other. Besides that, Brushwood [20] builds a parallel search tree to match events according to semantic similarity between nodes. These semantic neighborhood based methods provide a flexible communication model, where each node can

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

connect each other oblivious to network proximity and communication constrains. However, due to the dynamic nodes join/leave, maintaining the overlay leads to high latency and traffic cost, which may severely limit the scalability of event matching. In contrast, SEMAS adopts one-hop lookup technique to organize servers into a scalable overlay, which leads to less maintaining overhead and routing latency. Secondly, the rendezvous based methods map all events and subscriptions to key spaces, and assemble them at the rendezvous node to match each other. In order to reduce the routing latency, the rendezvous based methods commonly adopt the P2P structured overlay [25,26,40] to manage all keys and nodes. For instance, Tam et al. [21] map each event and subscription to an index digest based on Pastry [25]. Each index digest is generated by concatenating the attribute type, name, and value of each attribute in the index. To support range queries, the range of each attribute is divided into intervals. Then the interval indicators are used to build the index digest. Meghdoot [22] maps subscriptions and events to a 2n-dimensional space through the CAN [26] network. To improve load balance issue, a new arrival node actively contacts the heavily loaded node. Then the overloaded node migrates its half load to the new node. PastryString [17] constructs a distributed index tree for each attribute on Pastry network to support rich queries on both numerical and string attributes. Li [23] proposes to map each attribute to a 2d-tree over DHT. More details of rendezvous based methods can be found in [41,42]. These rendezvous based methods avoid uninterested nodes involved in the matching process. However, with numerous skewed subscriptions, the heterogeneity of nodes and clusters results in load imbalance on each node, which is adverse to the matching throughput. In SEMAS, the servers are virtual machines with same computing and storing capacities in the cloud. Moreover, SEMAS uses a hierarchical multi-attribute space partition technique to alleviate the skewness of subscriptions and guarantee better workload balance among servers. 2.3. Cloud based systems The methods in the cloud based systems match all arrival events on the servers of the cloud platform in parallel. Then the events are disseminated to the corresponding subscribers. Compared with the broker and P2P based environment, the cloud computing environment allows the tenants to provide elastic service capacity based on the changing of workloads, which brings a high performance–price ratio. Besides that, it provides reliable links, scalable servers, and high bandwidth for the service. Recently, many researches have turned to a cloud computing environment to provide a scalable pub/sub service. Amazon Simple Notification Service provides a topic based pub/sub service [43]. Ekanayake et al. [44] provide a pub/sub based communication library by utilizing the storage services of Windows Azure [27] to transfer data, and develops a distributed image retrieval application using the pub/sub library. However, it does not involve the event matching of the attribute-based pub/sub model. Dynamo [34] provides a highly available key-value storage system with one-hop routing. To reduce the matching latency, Move [32] proposes an adaptive event matching approach to consider a tradeoff between the replication and separation, while the data models of these keyword-based systems are different from our attributebased model. As we mentioned earlier, the most relevant one to SEMAS is BlueDove [33] that groups subscriptions through a multi-dimensional subscription space partitioning technique and exploits data skewness through a performance-aware message forwarding technique. However, its subscription partitioning technique results in a large number of hot clusters since each cluster only concerns a specific range of a single dimension, and the performance of its performance-aware message forwarding technique depends on the data skewness.

105

Fig. 1. Overlay infrastructure.

3. SEMAS system architecture 3.1. Attribute-based pub/sub model According to the express power of subscription models, the pub/sub systems are classified into topic-based, attribute-based and content-based systems. In a topic-based model [45–48], events are usually matched against subscriptions by a unique identification denoted by a string, called topic. In contrast, a more expressive form is the content-based model [49,50], where each subscription can be arbitrary Boolean functions, including disjunction of predicates, conjunction of predicates, and nested predicates (XML documents [51]), etc. As a trade-off between the simplicity of the topic-based model and the high expressiveness of the contentbased model, the attribute-based model [7,8,17] expresses a subscription as a conjunction of predicates on the attributes of the content space, and an event as a series of values on each attribute. Many Internet-scale applications use the attribute-based model to disseminate information, including stock-market monitoring engines, earthquake monitoring [52], and emergency weather alert [2]. In order to notify the latest information to subscribers in these applications, the key is to efficiently match incoming events against a set of subscriptions. To this end, we focus on the event matching of the attribute-based model. The matching problem of attribute-based pub/sub model can be viewed as testing whether a point in a space is contained within a series of given subspaces. Consider k attributes A1 , A2 , . . . , Ak , let R1 , R2 , . . . , Rk be the set of all possible values of each attribute. The whole attribute space is Ω = R1 × R2 × · · · × Rk . An event is defined as a point within Ω . It is a k-sized vector specifying exact one value from each attribute, i.e., E = {r1 , r2 , . . . , rk }, where ri ∈ Ri . A subscription is a subspace within Ω . It is a conjunction of predicates S = P1 ∧ P2 ∧ · · · ∧ Pk . Each Pi , i ∈ [1, k] is defined as a continuous range of values on Ai , i.e., Pi = (li ≤ Ai ≤ ui ). By this definition, an event E matches a subscription S if and only if for each predicate Pi in S, a corresponding value ri of E falls into Pi . A lot of applications use this form of multiple-dimensional conjunction attributes to express subscriptions. To take Shakecast [52] as an example, a user may be interested in the seismic events in his/her proximate area, and the corresponding subscription can be mapped into: (30 ≤ Latitude ≤ 60) ∧ (120 ≤ Longitude ≤ 150), which indicates that the user wants to receive the seismic events when an earthquake happens in the area with latitude range [30, 60] and longitude range [120, 150]. 3.2. Overlay infrastructure At a high level, the servers participating in SEMAS are organized into a two-layer overlay in the cloud computing environment, inspired by the idea from the attribute-based pub/sub service BlueDove [33]. As shown in Fig. 1, the servers in the top layer and the bottom layer are called dispatchers and matchers, respectively.

106

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

At the top layer, the dispatchers are exposed to the Internet as the front-end servers. They are responsible for dispatching the received subscriptions and events. Any subscriber or publisher selects one of the dispatchers to connect directly through the DNS [53]. Due to the network anomalies or node crash, a part of subscriptions and events may be lost before they arrive at the dispatchers, which requires to provide reliable routing between subscribers/publishers and dispatchers. A large number of reliable routing strategies in the P2P environment have been proposed, such as retransmission [54], path redundancy via multiple hops [55], Forward Error Correction [56], Epidemic Algorithm [57]. Since the distance between subscribers/publishers and dispatchers is one hop in our overlay, the subscribers/publishers simply retransmit their lost content. When subscriptions/events arrive at a dispatcher, they are assigned to the appropriate matchers through a light-weight dispatching scheme. To achieve low routing latency, the dispatching strategy ensures that all subscriptions and events traverse only one hop from a dispatcher to a matcher using the consistent hashing approach [31]. At the bottom layer, subscriptions are divided into multiple clusters, each of which is managed by one matcher. When a matcher receives an event, it matches the event against the subscriptions in the cluster whose space contains the position of the event. Note that the matcher either broadcasts events to corresponding subscribers directly or delivers them using the multicast technique. We utilize the well-studied dynamic multicast tree [10] to disseminate events to matched subscribers, and mainly focus on the scalable event matching problem. Compared with the broker based systems and the P2P based systems, such an overlay infrastructure shows a number of advantages:

• There is no need to maintain the states among the clients due to the decoupling of clients and servers, which enables the pub/sub system to be adaptive to the frequently dynamic network environment, such as mobile phones or sensor networks. Each subscriber only sends its subscriptions to a dispatcher, and then periodically reports a heartbeat message to the dispatcher to update its state. In contrast, the node in the broker based overlay or the P2P based overlay commonly maintains a broker list or a neighbor list to keep connectivity of the network, which leads to costly maintenance overhead. • When subscriptions/events arrive at the dispatchers, it is allowed to group similar subscriptions and match events via one hop among the servers of the cloud due to the reliable links and high bandwidth of the cloud computing environment. This is a critical factor for low latency and high matching throughput. In contrast, when subscriptions/events arrive at the overlay network in the broker based or P2P based systems, they still need to traverse a large number of nodes via multiple hops to complete the matching jobs. • It is allowed to change the servers scale with low latency and traffic overhead, such that the event matching service keeps a high performance–price ratio in case of the sudden change of workloads. In contrast, there are no incentives to reduce the scale of brokers in the broker based overlay, and the scale of nodes is unpredictable in P2P based overlay. 4. SEMAS components Our objective is to provide scalable and elastic event matching service in the attribute-based pub/sub systems. Although the twolayer overlay (Section 3.2) presents a lot of features that make it suitable for scalable event matching, we still need to provide more details to answer how SEMAS meets the challenge of a sudden change of incoming event rate and the skewness of subscriptions. First, how should we group the skewed subscriptions so that each

message only needs to be matched on one matcher, while still keep load balance of the matchers? A simple solution is to replicate each subscription to all matchers, and then match each event on a random matcher. Nevertheless, each event must be matched against all subscriptions on each matcher, which leads to high latency and low throughput. Second, how should the system adjust the matching capacity elastically with the churn of arrival event rate, while still provide continuous event matching service? This is a key factor in a high performance–price ratio. In this section, we propose two main components to answer the above questions:

• Hierarchical multi-attribute space partition (HPartition). It decides how the subscriptions are stored on the matchers and which matcher is selected to match each event. Besides that, it adaptively eliminates the impact of the skewness of subscriptions to ensure the load balance of each matcher. • Performance-aware detection (PDetection). It detects the sudden change of workloads according to the worst performance of the matchers, elastically adjusts the scale of servers and redistributes each cluster among all matchers to adapt to the churn of workloads. 4.1. Hierarchical multi-attribute space partition In order to utilize multiple matchers, we propose HPartition, a hierarchical multi-attribute space partition technique, to divide the content space among matchers, such that each matcher is in charge of a small subset of subscriptions. Generally speaking, HPartition divides the entire content space into multiple disjoint hypercubes, each of which is managed by one matcher (Section 4.1.1). Then the subscriptions and events that fall into the same hypercube are matched on the same matcher (Sections 4.1.2 and 4.1.3). To ensure the load balance of the matchers, HPartition divides the skewed subscriptions into multiple matchers through a hierarchical partition scheme (Section 4.1.4). We provide the details of HPartition as follows. 4.1.1. Multi-attribute space partition The multi-attribute space partition divides the content space into multiple disjoint clusters, each of which is mapped to a matcher. More specifically, for each attribute Ai and its range Ri , Ai j is divided into Nseg continuous and disjoint segments {Ri , j = 0, 1, 2, . . . , Nseg − 1}, each of which is marked by a unique identification j, called segmentID. Suppose k is the number of attributes, then the content space is divided into (Nseg )k clusters x x x {Gx1 x2 ···xk = R11 × R22 × · · · × Rkk , xj ∈ [0, Nseg − 1]}, each of which is marked by a unique identification x1 x2 · · · xk , called clusterID. To keep the load balance of the matchers, each cluster is uniformly mapped to one matcher using the consistent hashing [31]. The basic idea of the consistent hashing is to associate each node with one or more hash value intervals where the interval boundaries are determined by calculating the hash value of each node identifier. It guarantees balanced workloads among nodes and generates minimal reconfiguring cost when the number of nodes changes. Specifically, for each cluster Gx1 x2 ···xk , its clusterID x1 x2 · · · xk is mapped into a random integer Hash(x1 x2 · · · xk ) using the Murmurhash [58] approach. Similarly, for each matcher Mi , its matcherID i is mapped to a random integer Hash(i) using the same hashing approach, where i ∈ [1, Nm ] and Nm is the number of matchers. Suppose all these hash values are diffused in a circle. Thus, Gx1 x2 ···xk searches its target matcher along the circle in clockwise direction from its position Hash(x1 x2 · · · xk ). It walks along the circle until falling into the first matcher Mi with the hash value Hash(i) it encounters. Note that each cluster is managed separately on each matcher, which is a key factor of high matching throughput. As shown in Fig. 2, the content space contains two attributes, whose ranges are both from 0 to 180. Each attribute is

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

107

Fig. 2. An example of the multi-attribute space partition where each range of the two attributes is cut into 4 segments, and the entire content space is divided into 16 clusters. The clusterID of each cluster is the concatenation of the indices of all attributes. Each cluster is mapped to one matcher using the consistent hashing.

divided into 4 segments. The number filled in the box of the left side of Fig. 2 represents their corresponding clusterIDs. As shown in the right side of Fig. 2, each cluster is uniformly mapped to one matcher through the consistent hashing. 4.1.2. Subscription assignment In order to match each event on only one cluster, SEMAS assigns a subscription S to the clusters whose spaces overlap with that of S. That is, for a subscription S = P1 ∧ P2 · · · ∧ Pk and a cluster x x x Gx1 x2 ···xk = R11 × R22 × · · · × Rkk , the dispatcher finds all clusters x that overlap with S, i.e., C (S ) = {Gx1 x2 ···xk |Ri i ∩ Pi ̸= Ø, ∀i ∈ [1, k]}. According to the above mentioned multi-attribute space partition, S is forwarded to the matchers who manage the clusters in C (S ). Before assigning S, the dispatcher broadcasts S to all matchers firstly. When a matcher receives S, it stores S into a global subscription list with the form of ⟨S .id, S ⟩, where S .id represents the unique identification of S. Then for each cluster Gj in C (S ), a two-tuple ⟨S .id, j⟩ is sent to its corresponding matcher based on the consistent hashing. After receiving ⟨S .id, j⟩, the matcher stores S .id into the cluster Gj . One might wonder why we broadcast S to all matchers rather than only dispatch S to the matchers in C (S ). This is because the same subscription may fall into different clusters due to the change of Nm or Nseg . Storing all the subscriptions on every matcher makes it possible that only the identifications of the subscriptions need to be transferred between servers, which brings much less traffic and memory overhead. We will utilize this feature in Sections 4.1.4 and 4.2.2. Besides that, each dispatcher also broadcasts each new subscription to other Nd − 1 dispatchers to avoid assigning the same subscription for many times, where Nd is the total number of dispatchers. Each subscriber periodically sends its subscription as a heartbeat message to a random dispatcher. If the subscription has been assigned to the corresponding matchers, there is no need to assign it again. Therefore, each dispatcher needs to store all subscriptions to check whether the incoming subscription has been assigned. As shown in Fig. 3, S falls into {G02 , G03 , G12 , G13 }. Algorithm 1 shows how to get the clusters that overlap with S. Firstly, we can get the sets of segmentIDs T [i] that each predicate Pi falls into (Algorithm 1, line 2–5). Then the Cartesian product of all T [i] is the set of clusterIDs which S falls into (Algorithm 1, line 6). Finally, we can get the corresponding matcherIDs through the consistent hashing. 4.1.3. Event matching When an event arrives at a dispatcher, it is assigned to the corresponding cluster according to its values of all attributes. Because each event represents a single point of the entire content space, it only falls into one cluster. Formally, the event E = x x x {r1 , r2 , . . . , rk } falls into cluster Gx1 x2 ···xk = R11 × R22 × · · · × Rkk

Algorithm 1: getClusterIDs (S, Space, Nseg ) Input : S: a conjunction of predicates. Space: a specific content space. Nseg : the number of segments of each attribute. Output: T (S ): the clusterIDs whose spaces overlap S.

5

foreach predicate Pi of S do Ri = getValueRange(Space.Ai ); //Ri is the range of attribute Ai minSegID[i] = getSegmentID(Pi .li , Ri , Nseg ); maxSegID[i] = getSegmentID(Pi .ui , Ri , Nseg ); T [i] = {j|minSegID[i] ≤ j ≤ maxSegID[i]};

6

T (S ) = T [1] × T [2] · · · × T [k];

1 2

3 4

value

x

iff ri ∈ Ri i for ∀i ∈ [1, k]. According to the consistent hashing, the dispatcher sends a two-tuple ⟨E , x1 x2 · · · xk ⟩ to the matcher whose hash value is closest to the value of H (x1 x2 · · · xk ) in clockwise direction. After that, E is matched against the subscriptions in the cluster Gx1 x2 ···xk . To reduce the event matching latency, SEMAS uses the extended counting algorithm [38] to match events in each cluster. 4.1.4. Hot cluster elimination As discussed in Section 1, the distribution of the subscriptions are often skewed. It means that a large number of subscriptions may fall into a few hot clusters, which leads to large latency and low matching throughput. In order to keep high matching throughput, the basic idea of HPartition is to divide the hot clusters into multiple cold clusters. Next, we give a detailed description of our method. Step one, each matcher periodically checks whether a hot cluster exists or not. Suppose a cluster Gx1 x2 ···xk is managed by the matcher M. Gx1 x2 ···xk is regarded as a hot cluster by M if its size is larger than α , where α is a constant threshold value. Step two, the matcher M divides the space of Gx1 x2 ···xk through the multi-attribute space partition in Section 4.1.1 and then reassigns the subscriptions of Gx1 x2 ···xk into the corresponding clusters. x ′ Formally, each range Ri i of attribute Ai on Gx1 x2 ···xk is cut into Nseg xy

segments, each of which is represented by Ri i i , xi ∈ [1, Nseg ], yi ∈ ′ ′ )k clusters {Gx1 x2 ···xk y1 y2 ···yk = [1, Nseg ]. Gx1 x2 ···xk is divided into (Nseg x y

x y

x y

′ R11 1 × R22 2 × · · · × Rkk k |xi ∈ [1, Nseg ], yi ∈ [1, Nseg ]}. Then the matcher M reassigns each subscription of Gx1 x2 ···xk into these new clusters. To identify these news clusters, the clusterID of Gx1 x2 ···xk y1 y2 ···yk is marked by x1 x2 · · · xk y1 y2 · · · yk . After that, each cluster Gx1 x2 ···xk y1 y2 ···yk and its subscriptions are dispatched to its target matcher according to the consistent hashing. Step three, the matcher M notifies each dispatcher about the change of Gx1 x2 ···xk . Specifically, the matcher M sends a two′ tuple ⟨x1 x2 · · · xk y1 y2 · · · yk , Nseg ⟩ to each dispatcher. Then each

108

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

Fig. 3. An example of assigning a subscription where S = (30 ≤ Latitude ≤ 60) ∧ (120 ≤ Longitude ≤ 150). S is assigned to the clusters whose spaces overlap with that of S. ′

Algorithm 2: eliminateHotClusters (α, Nseg )

Algorithm 3: getClusterIDsExtended (S , HCL) Input : S: a conjunction of predicates. HCL: the hot cluster list of the dispatcher. Output: T (S ): the clusterIDs whose spaces overlap S.

Input : α : the minimum size of the hot cluster. ′ Nseg : the number of segments on each attribute of the hot cluster. 1 2 3 4 5

while true do foreach cluster Gi do if (Gi .size > α ) then foreach subscription S in Gi do ′ T (S ) = getClusterIDs(S , Gi .space, Nseg ) ;

//Algorithm 1 6 7 8 9

foreach clusterID j in T (S ) do Gij .add(S .id); foreach Gij in Gi do send Gij to its corresponding matcher according to the consistent hashing; ′

10 11 12

send < ij, Nseg > to each dispatcher; delete Gi ; sleep(∆); //Each

matcher checks hot spots per ∆ seconds.

dispatcher stores the two-tuple into a hot cluster list HCL. This is because each dispatcher needs to assign the future subscriptions and events according to the HCL. For each incoming subscription S or event E, they are divided into multiple clusters by the whole content space Ω and Nseg according to Algorithm 1. Then for each ′ cluster Gi that S or E falls into, if ⟨i, Nseg ⟩ is in the hot cluster list HCL, the dispatcher redivides S or E by the space of the cluster Gi ′ and Nseg . The dispatcher divides S or E repeatedly until all clusters that S or E falls into are not in the HCL. The detail of the processing is described in Algorithm 3, which is an extension of Algorithm 1. As shown in Fig. 4, the hot cluster G12 contains 9 subscriptions. After G12 is divided into 4 clusters, the maximum size of clusters is reduced to 5. The detail of the processing is described in Algorithm 2. As mentioned in Section 4.1.2, all subscriptions are stored on every matcher. Therefore, only the identifications of the subscriptions of the hot clusters need to be transferred between matchers (Algorithm 2, line 7), which brings much less traffic and memory overhead. 4.1.5. The cluster size The matching of each event is accomplished on one cluster. Therefore, the number of subscriptions in each cluster, called cluster size, is critical to achieving high matching throughput. We give the formal analysis as follows. Theorem 1. Suppose that the percentage of each predicate’s range of each subscription is λ, then the average cluster size Nc in SEMAS

1 2 3 4

T (S ) = getClusterIDs(S , Ω , Nseg ); //Algorithm foreach clusterID i of T (S ) do if (i is contained in any item of HCL ) then ′ T (S )i = getClusterIDs(S , Gi .space, Nseg );

1

//Algorithm 1 5 6

foreach clusterID j of T (S )i do add ij into T (S );

is λk Nsub if Nseg → ∞ and α → ∞, where k is the number of attributes, Nsub is the total size of subscriptions, and Nseg is the number of segments of each attribute, α is the minimum size of the hot cluster. Proof. For each attribute Ai , i ∈ [1, k], the range of Ai is Ri , the length of corresponding predicate Pi is λ∥Ri ∥. Based on multiattribute space partition (Section 4.1.1), each attribute is divided into Nseg segments, therefore, the length of each segment is ∥Ri ∥/Nseg . Then the expected number of segments that Pi falls into is ⌈λNseg ⌉. Because each subscription contains k predicates, the expected number of clusters that each subscription falls into is ⌈λNseg ⌉k . On the other hand, the number of clusters in the content space is (Nseg )k . Then we have the average cluster size Nc = ⌈λNseg ⌉k Nsub . (Nseg )k

When α → ∞, we do not redivide the hot clusters.

When Nseg → ∞, we have

⌈λNseg ⌉ Nseg

= λ, and Nc = λk Nsub .

From the result of Theorem 1, the average cluster size Nc decreases with the reduction of λ. That is, smaller λ brings less matching time in SEMAS. Fortunately, the subscriptions distribution in real world applications is often skewed, and most predicates subscribe small ranges, which guarantees small average cluster size of SEMAS. Note that for small λ, ⌈λNseg ⌉ grows much slowly as Nseg increases. If ⌈λNseg1 ⌉ = ⌈λNseg2 ⌉ and Nseg1 < Nseg2 , then the average cluster size with Nseg2 segments is smaller than that with Nseg1 segments, and the gain of Nseg2 segments over Nseg1 segments is N

seg2 k ( Nseg1 ) ×.

4.2. Performance-aware detection As stated earlier, the event arrival rate of the emergency applications may churn rapidly in a short time. We propose PDetection, a performance-aware detection technique, to adapt to the changing of workloads. Generally speaking, PDetection perceives the changing of workloads through periodically detecting the worst performance of matchers (Section 4.2.1), then adjusts the scale of matchers and redistributes each cluster among all the matchers (Section 4.2.2).

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

109

Fig. 4. An example of eliminating hot clusters where G12 is supposed to be a hot cluster. G12 contains 9 subscriptions. After it is divided into 4 clusters through Algorithm 2, the maximum size of all clusters is reduced to 5.

4.2.1. Detecting the maximum waiting time In order to perceive the sudden change of workloads, we periodically detect the worst performance of each matcher. We describe the details as follows. To judge the performance of each matcher, we estimate the waiting time of next message as a criteria. Specifically, the estimation is based on linear extrapolation, which assumes that the message arrival rate and the event matching rate remain the same between two continuous updates. The linear extrapolation method creates a tangent line at the end of the known data, and provides good results when the estimated data is not too far beyond the known data. Formally, each matcher buffers all arriving events to an event queue q. At the moment t, the length of q is Ltq , the event arrival rate is Rta , and the event matching rate is Rtm . Then based on ′

the linear extrapolation method, the length of q at time t ′ is Ltq = Ltq + (Rta − Rtm )(t ′ − t ). Thus, the waiting time of next arriving event ′

Tw is (Ltq + 1)/Rtm . One might argue that a direct method is that each matcher periodically reports the ratio of its event arrival rate Ra to the matching throughput Rm to a appointed server. However, the ratio may lead to inaccurate report. This is because a matcher with a large number of events in its event queue may be supposed to be light loaded because of the sudden decreasing of arrival event rate. To obtain the maximum waiting time, we adopt a light-weight gossip based aggregation technique. Specifically, all matchers are organized into a gossip based overlay [59]. Then each matcher i periodically exchanges its waiting time Twi with a random matcher j. Both matchers update their Tw to be the bigger one. According to the gossip based aggregation [35] technique, the value of Tw on each matcher will converge to the actual global maximum in a short time. Each matcher periodically synchronizes its clock with a synchronizing dispatcher Ds and estimates its own waiting time. In this way, Ds easily obtains the maximum Tw by accessing arbitrary matcher. One might argue that a simple method is to report each Tw of the matcher to Ds . However, this leads to high synchronizing latency as the size of matchers increases. 4.2.2. Adjusting the scale of matchers In this section, we introduce how to adjust the scale of matchers and redistribute each cluster among all the matchers. First of all, the synchronizing dispatcher Ds adjusts the scale of matchers according to the maximum waiting time. Suppose that min the range of total matchers in the cloud platform is [Nm , Nmmax ], 0 1 and the expected churn range of waiting time is [Tw , Tw ] when the size of matchers is stable. As mentioned above, the synchronizing

dispatcher Ds gets the maximum waiting time Twmax from anyone matcher periodically. If Twmax > Tw1 , Ds adds new matchers until max Twmax falls into [Tw0 , Tw1 ] or the size of live matchers is equal to Nm . max 0 max If Tw < Tw , Ds will suspend a number of matchers until Tw falls min into [Tw0 , Tw1 ] or the size of live matchers is equal to Nm . After that, we need to redistribute each cluster among all the matchers. There are two main problems we need to solve. One is to reduce the migrating latency and traffic overhead. According to the balance and monotonicity of the consistent hashing, the addition and removal of nodes requires O(Nsub /Nm ) subscriptions to be reshuffled, which leads to large traffic overhead as Nsub increases. Recall that each matcher stores all subscriptions as described in Section 4.1.2. Therefore, only the identifications of subscriptions are transferred between servers, which brings much lower latency and traffic overhead. Another problem is how to keep the 24 × 7 event matching service when migrating clusters. When matchers join or leave the system, a part of the clusters have to be migrated to their new positions to satisfy the mapping rules of the consistent hashing. However, the events falling into these clusters may not be matched during their migration. According to the consistent hashing, the matcher list stored in each dispatcher decides each event to be dispatched which matcher. To keep a continuous matching service, our basic idea is to match events using the outdated matcher list during migrating clusters, and using the latest matcher list after migration. Thus, the key is to synchronize the matcher list on all dispatchers. To this end, PDetection utilizes a synchronizing dispatcher Ds and a mutex V to coordinate all the servers. Ds is used to synchronize the states of all servers. The mutex V is used to protect the scale of matchers Nm from simultaneous update. We provide details as follows. (1) Adding matchers: Step one, Ds needs to obtain the mutex V . When Ds receives a new matcher for the first time, it checks the state of V firstly. If V is occupied, it means the system is migrating clusters and the adding matchers operation has to wait until V is released. Step two, Ds records the addresses of new matchers after it obtains V . Because each new matcher may join the system at any moment, Ds sets a timer Tt and a timeout interval Tout for next new matcher. When Ds receives a new matcher for the first time, it launches timer Tt . We set the initial value of Tt to zero, and increase it with the system time. When the next new matcher arrives, if Tt < Tout , Ds approves this matcher and resets Tt to zero, otherwise Ds rejects the matcher. Note that Ds rejects all the incoming matchers when Tt ≥ Tout . It guarantees the atomicity of migrating clusters and the synchronization of dispatchers. Through sending heartbeat

110

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

Fig. 6. The sequence diagram of all servers when PDetection removes matchers. Fig. 5. The sequence diagram of all servers when PDetection adds matchers.

messages periodically, the rejected matchers will be approved after all clusters are migrated to the right matchers. Step three, redistribute clusters to the new matchers. According to the monotonicity of the consistent hashing, a cluster may be moved from a old matcher to a new matcher, but not from one old matcher to another. Specifically, Ds sends the new matcher list to each matcher. The cluster that is mapped to a new matcher through the consistent hashing is called an emigrator. Each matcher copies all emigrators to their right matchers through the consistent hashing. Ds copies all subscriptions to the new matchers. After that, each matcher reports a ‘‘COMPLETE’’ message to Ds . Note that all emigrators are still in their old matchers, which ensures correctness of the event matching service when an event arrives at these matchers. Step four, Ds synchronizes the states of all servers. If Ds receives ‘‘COMPLETE’’ messages from each matcher, it notifies all the dispatchers including itself to update their matcher lists. Then each dispatcher returns a ‘‘ACK’’ message to Ds . After receiving all ‘‘ACK’’ messages, Ds notifies each matcher to remove their emigrators. Finally, Ds resets Tt to zero and releases the mutex V . Fig. 5 shows an example of adding matchers with two dispatchers and two matchers, where Dispatcher 1 is Ds and Matcher 2 is the new matcher. Note that there are two synchronizing points in Fig. 5. One is when Ds must wait for all ‘‘COMPLETE’’ messages from each matcher. The other is when Ds must wait for all ‘‘ACK’’ messages from each dispatcher. Both of them ensure that each event is assigned to the right matcher after the matcher list of each dispatcher is updated. One might argue that each matcher can update its clusters according to the new matcher list because it stores all subscriptions, and then it does not need to migrate emigrators to other matchers. However, this method leads to high processing latency for each matcher with the growth of subscriptions. Besides that, one might wonder what we will do if a number of matchers fail when adding matchers. In this case, the failed matchers cannot send their emigrators to the new matchers. To solve this problem, Ds takes over the jobs of these failed matchers. That is, Ds sends the emigrators of the failed matchers to their corresponding new matchers according to the consistent hashing. Note that the clusters that fall into these failed clusters are also lost. Then the problem becomes how to recover the system from the removing matcher operation. We give the solution as follows. (2) Removing matchers: In order to ensure the reliability of the service, we handle the cases of matcher removing and matcher failure in the same way. Similar to the process of adding matchers

operation, when Ds detects a failed matcher for the first time, it needs to obtain the mutex V at step one. Step two, Ds detects the number of lost matchers. Each matcher periodically sends a heartbeat message to Ds . If the interval of two continuous heartbeat messages of the same matcher exceeds ′ Tout , Ds supposes the matcher to be lost. Similar to the process of ′ adding matchers, Ds uses a timer Tt′ and the timeout interval Tout for the next lost matcher. When Ds detects a lost matcher for the first time, it launches a timer Tt′ that is set zero initially and increased with the growing of system time. When Ds detects the next lost ′ matcher, if Tt′ < Tout , Ds approves this matcher and resets Tt′ to zero, otherwise Ds rejects the matcher. Step three, redistribute the lost clusters among live matchers. Ds first sends the latest matcher list to each alive matcher. Then it sends the lost clusters with the identifications of their subscriptions to alive matchers according to the consistent hashing. Compared with the operation of adding matchers, it is unnecessary to emigrate clusters among matchers due to the monotonicity of the consistent hashing. After receiving all emigrators, each matcher reports a ‘‘COMPLETE’’ message to Ds . Step four, after Ds receives all ‘‘COMPLETE’’ messages from all matchers and itself, Ds notifies all the dispatchers including itself to update their matcher lists. Then each dispatcher returns a ‘‘ACK’’ message to Ds . After Ds receives all ‘‘ACK’’ messages, it notifies each matcher to remove their emigrators. Finally, Ds resets Tt′ to zero and releases the mutex V . Fig. 6 shows an example of removing matchers with two dispatchers and two matchers, where Dispatcher 1 is Ds and Matcher 2 is the failed matcher. Compared with the operation of adding matchers, Ds needs to copy each emigrator of the failed matchers to the right matcher and send a ‘‘COMPLETE’’ message to itself. Note that when the events falling into the failed matchers arrive at a dispatcher, they are stored on the dispatcher temporarily. Then, after the operation of removing matchers completes, the dispatcher reassigns these events, which ensures the reliability of event matching service. 5. Experiment 5.1. Implementation We design and implement the prototype of SEMAS on our OpenStack-based [60] testbed. In order to build the prototype as modular and portable with ease, we use ICE [61], an object-based middleware, as our fundamental framework. ICE allows users to

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

111

Table 2 Default parameters in SEMAS. Parameter

k

Ri

Nm

Value

4

[0, 500] 16

Nd

Nseg

′ Nseg

Nsub

α

σ

2

10

2

40,000

8000

50

focus on their application logic, rather than interact with lowlevel network programming interfaces. Besides the code generated by ICE, we add about 14,000 lines of Java code to implement the prototype. To evaluate the performance of SEMAS, we implement three methods on our testbed. SEMAS: It contains both HPartition and PDetection. Through HPartition, each dispatcher assigns the incoming subscriptions and events to their corresponding matchers, and each matcher divides its hot clusters into multiple cold clusters with low traffic overhead. Through PDetection, SEMAS adaptively adjusts the scale of matchers with the frequent churn of event arrival rate. SEMAS-B: It is implemented to evaluate the performance of HPartition of SEMAS. Compared with SEMAS, SEMAS-B only divides the subscriptions into (Nseg )k clusters using Nseg (Section 4.1.2), rather than eliminates the hot clusters (Section 4.1.4). BlueDove [33]: To the best of our knowledge, BlueDove is the first effort to provide the event matching service of the attributebased pub/sub systems in a cloud computing environment. It utilizes a multi-attribute subscription space partitioning technique to divide the whole content space into kNm clusters and provide multiple candidate matchers for each event. All the matchers of BlueDove are organized into a gossip based overlay to maintain the global state. Using this gossip based overlay, BlueDove adopts a performance-aware forwarding technique to select the least loaded candidate matcher for each event. For a fair comparison, our implementation of all these methods uses the same gossip based overlay to organize the matchers as described in Section 4.2.1. Each matcher periodically exchanges information with another random sampled matcher. The gossip based overlay enables the event matching service to support largescale matchers with low traffic overhead. 5.2. Parameters and metrics The testbed consists of 20 CentOS servers, each of which is a virtual machine (VM) with a quad-core 2.0 GHz Intel processor, 2.0 GB memory, 40 GB storage. All the VMs are interconnected by the Gigabit Ethernet switches. In SEMAS, we use 2 VMs as dispatchers, 2 VMs as event generators, and 16 VMs as matchers. In the experiment, the content space is composed of 4 attributes, whose ranges are between 0 and 500. Before matching events, 40,000 subscriptions are dispatched to each matcher. The values in each attribute follow a normal distribution with a standard deviation σ = 50 and the hot spots of attributes are diffused evenly along the full range of each attribute. In order to evaluate the matching throughput of each approach, the event generators generate 2 million events. The attribute values of the events are distributed uniformly in the full range of each dimension. In SEMAS and SEMAS-B, the range of each attribute is divided into 10 segments. Besides that, a cluster of SEMAS is defined as a hot cluster if its size is larger than 8000. Then for each hot cluster, its ranges of all attributes are cut into 2 segments. The default parameters are shown in Table 2. We measure the performance of all three methods with the same parameters for a fair comparison. Following are the metrics we use in the experiments:

• Matching throughput: It denotes the number of average matched events per second. Suppose that the number of events is Nev , the first incoming event arrives at the dispatcher at the moment of T1 , and the matching of the last event is completed at the moment of T2 . Then the matching throughput is T N−evT . 2

1

Fig. 7. The distribution of cluster sizes.

• Average cluster size: It denotes the average number of subscriptions of all clusters. It also represents the average matching time of each event. 5.3. Load balance The evaluation of load balance mainly contains two aspects: one is the balance of cluster sizes which reflects the average matching time for each message. The other is the workload balance of matchers which reflects the utility ratio of the system. These two factors greatly affect the matching throughput of the system. In this section, we use the defaulted parameters of Table 2 to evaluate the load balance of all above-mentioned approaches. We first evaluate the balance of cluster sizes of all approaches. Fig. 7 shows the cumulative distribution functions (CDF) of cluster sizes in each method. The result demonstrates that BlueDove generates more hot clusters than SEMAS and SEMAS-B. The percentages of hot clusters in SEMAS, SEMAS-B and BlueDove are 0%, 0.5% and 35%, respectively. The average cluster sizes of SEMAS, SEMASB, and BlueDove are 570, 299, and 8117, respectively. According to the partitioning strategy in BlueDove, each cluster only concerns a specific range of a single dimension. It indicates a hot cluster will be formed if a large number of subscribers are interested in the same range of one dimension, which leads to a large number of hot clusters. Moreover, BlueDove does not provide a partitioning strategy to alleviate hot spots as the number of subscriptions increases. In contrast, each cluster in SEMAS and SEMAS-B concerns a specific range of each dimension. Meanwhile, SEMAS adaptively divides hot clusters into multiple cold clusters in a hierarchical manner, which greatly reduces the number of hot clusters. Next, to evaluate the workload balance of matchers, we adopt a load balance index β(Nm ) [62] as the criteria, where β(Nm ) = Nm  ( i=1 Loadi )2 /(Nm Ni=m1 Load2i ), Nm is the number of matchers, and Loadi is the workload of the matcher Mi . The value of β(Nm ) is bounded between 0 and 1 and a higher value represents a better fairness. With 40,000 skewed subscriptions, Fig. 8(a) shows the distribution of average cluster size on each matcher. Due to the multi-attribute space partition in Section 4.1.1, both SEMAS and SEMAS-B have smaller average cluster sizes than BlueDove. Suppose Loadi is the average cluster size stored in the matcher Mi , the corresponding β(Nm ) of SEMAS, SEMAS-B and BlueDove are 0.960, 0.912, and 0.757, respectively. The consistent hashing of SEMAS and SEMAS-B ensures better balance of the average cluster sizes among matchers than BlueDove. In contrast, each matcher in BlueDove is responsible for a fixed range of each dimension, which causes that the workload balance of matchers severely depends on the distribution of subscriptions. Note that SEMAS has best balance of average cluster size, while its corresponding value is larger than that of SEMAS-B. This is because when a hot cluster in SEMAS is divided into multiple new clusters, these new clusters

112

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

(a) MatcherID.

(b) MatcherID. Fig. 8. Load balance. (a) The average cluster size of each matcher. (b) The average matching throughput of each matcher.

may be still big enough due to the skewness of subscriptions, which leads to larger but better balanced cluster sizes. Next, Fig. 8(b) shows the distribution of matching throughput on each matcher. Suppose Loadi is the matching throughput of the matcher Mi , then the values of β(Nm ) in SEMAS, SEMAS-B, and BlueDove are 0.969, 0.966, and 0.284, respectively. Because of the better balance of average cluster size in SEMAS and the uniform distribution of events, SEMAS presents better balance of matching throughput than the other approaches. In BlueDove, the performance-aware forwarding scheme dispatches a message to a candidate matcher with the lowest processing delay, which brings the imbalance of matching throughput among matchers. In addition, Fig. 8 also shows that the matching throughput is inversely proportional to the average cluster size on each matcher of each approach. In conclusion, SEMAS presents better balance of cluster sizes and matching rates among matchers than SEMAS-B and BlueDove, while at the cost of more memory overhead than SEMAS-B. 5.4. Scalability In this section we compare the scalability of SEMAS with SEMAS-B and BlueDove by evaluating how the matching throughput changes with the increasing of Nm , Nsub , Nseg , and α . In the first experiment, Nm increases from 8 to 20. Fig. 9(a) shows that the matching rate of each approach increases linearly with the growth of Nm . Their slopes are 6.16, 4.86, and 1.81, respectively. With 8 matchers, the throughput of SEMAS is 1.15 and 7.64 times higher than that of SEMAS-B and BlueDove, respectively. With 20 matchers, these gains become 1.22× and 4.32×, respectively. In SEMAS, the entire content space is divided into more than 10,000 clusters according to HPartition. This fine-grained clustering ensures small average cluster size according to Theorem 1.

Compared with SEMAS, SEMAS-B does not divide the hot clusters whose sizes are larger than α , which leads to less matching throughput than SEMAS. Recall that BlueDove divides the entire content space into kNm clusters, which means the number of clusters increases from 32 to 80 as Nm increases from 8 to 20. This coarse-grained clustering brings large average cluster size and low matching throughput. In the second experiment, Nsub increases from 20,000 to 80,000. Fig. 9(b) shows that the throughput of SEMAS decreases slower than the other approaches as Nsub increases. The decreasing proportion of matching throughput of SEMAS, SEMAS-B, and BlueDove is 48.2%, 62.2%, 89.8%, respectively. As Nsub increases, more clusters become hot clusters. Due to the fine-grained clustering, the average cluster sizes of both SEMAS-B and SEMAS increase more slowly than BlueDove. Compared with SEMAS-B, SEMAS divides these hot clusters into multiple cold clusters, which leads to better balance of loads on matchers and higher throughput. Next, we test the scalability of both SEMAS and SEMAS-B by increasing Nseg from 6 to 12. Fig. 9(c) shows that the matching throughput of SEMAS is larger than SEMAS-B with different Nseg . Note that the gap between SEMAS and SEMAS-B widens with the decreasing of Nseg . This is because the hot clusters increases with the decreasing of Nseg , which brings better balance of load on matchers of SEMAS. Then, we test the impact of α to the matching throughput of SEMAS. As shown in Fig. 9(d), the throughput of SEMAS increases with the decreasing of α . This is because smaller α leads to more hot clusters and better balanced sizes of clusters. In conclusion, the matching throughput of SEMAS shows a linear increasing capacity with the growing of matchers. Compared with SEMAS-B and BlueDove, the matching throughput of SEMAS is higher with different Nm , Nsub , Nseg , and α .

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

(a) Number of matchers.

(c) Number of segments.

113

(b) Number of subscriptions.

(d) Minimum cluster size of hot clusters.

Fig. 9. Scalability. (a) Impact of the increasing number of matchers. (b) Impact of the increasing number of subscriptions. (c) Impact of the increasing number of segments on each attribute. (d) Impact of the minimum cluster size of hot clusters.

5.5. Elasticity and reliability We evaluate the elasticity and reliability of SEMAS in this section. In the experiment, each matcher estimates its waiting time ′ per 200 ms. We set both timeout intervals Tout and Tout to 10 s for adding and removing matchers, respectively. That is, SEMAS collects all new matchers or failed matchers until the continuous arrival interval or failure interval of two matchers exceeds 10 s. Besides that, Ds supposes that a matcher fails if its continuous arrival interval of two heartbeat messages exceeds 10 s. As stated earlier, elasticity represents the adaptability of the service to the sudden changes of workloads. SEMAS implements elastic service through adaptively adjusting the scale of matchers. In the experiments, we set the expected churn range of waiting time [Tw0 , Tw1 ] is [0.1 s, 1 s]. The event generators generates events using two different manners: linear changing event arrival rate over time and instantaneous changing event arrival rate at one moment. We first evaluate how SEMAS adjust the number of matchers to adapt to both linear and instantaneous increasing event arrival rates. In the experiment, there are 8 matchers running in SEMAS at the beginning. We set the initial event rate to 35,000 events per second, which is close to the saturation throughput for 8 matchers. In the first case, the event generators increase the event arrival rate by 1000 events per second. Fig. 10(a) shows how the matching throughput of SEMAS changes with the increasing of the arrival event rate, where the red lines represent the moment when the new matchers are added into SEMAS and the blue lines represent the moment when the new matchers are ready to match events. The range between a red line and its adjacent blue line is the interval of reconfiguring clusters among matchers. This reconfiguring delay mainly lies in two factors: the delay of migrating a number of

clusters, and the delay of reconstructing these clusters. As the event arrival rate increases, four new matchers are added into the system at 32.2 s and they are read to match events at 39.1 s. It shows that the matching throughput decreases when these new matchers initialize their states. This is because the synchronizing server Ds needs to copy all subscriptions to the new matchers, which reduces the rate of dispatching events to matchers. When all new matchers are ready to match events, the matching throughput increases to a higher value. The similar phenomenon happens when another four new matchers join into SEMAS from the moment of 78.0–85.9 s. Because of the synchronizing operations described in Section 4.2.2, SEMAS is able to provide continuous service even if the scale of matchers is changing. In addition, Fig. 10(a) also shows that the latency of migrating clusters is less than ten seconds. This is because each matcher only needs to transmit the identifications of subscriptions by storing all subscriptions at each matcher (Section 4.1.2). In the second case, the event generators keep the constant event arrival rate for 15 s. After that, they instantaneously increase the rate to 90,000 events per second. As shown in Fig. 10(b), the number of matchers increases from 8 to 16. Compared with the first case, there are more matchers joining in the system every time. Therefore, the reconfiguring delay between the red line and the blue line is a little longer than the delays of the first case. Next, we evaluate how the elasticity of SEMAS adapts to both linear and instantaneous decreasing event arrival rates. In the experiment, there are 16 matchers running in SEMAS at the beginning. We set the initial event rate to 90,000 events per second. In the first case, event generators decrease the event arrival rate by 1000 events per second. Fig. 11(a) shows that the number of matchers shrinks from 16 to 12 to 8 with the decreasing event arrival rate. Compared with the scenario of adding new matchers

114

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

(a) Time (s).

(b) Time (s).

Fig. 10. Elasticity of adding matchers. (a) Adding matchers with linear increasing event arrival rate. (b) Adding matchers with instantaneous increasing event arrival rate. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

(a) Time (s).

(b) Time (s).

Fig. 11. Elasticity of removing matchers. (a) Removing matchers with linear increasing event arrival rate. (b) Removing matchers with instantaneous increasing event arrival rate.

in Fig. 10(a), the reconfiguring delay is lower when the same number of matchers are removed. Because of the monotonicity of the consistent hashing, clusters would not emigrate from one old matcher to another. According to the operations of adjusting the scale of matchers (Section 4.2.2), the reconfiguring delay mainly lies in the process of redistributing the lost clusters to their corresponding matchers from the synchronizing server Ds . Since the skewed subscriptions lead to small average cluster size, the lost clusters will be quickly transferred to their corresponding matchers, which greatly reduces the traffic overhead and the total reconfiguring delay. Moreover, Fig. 11(a) shows that the matching throughput during reconfiguring matchers is not less than the succedent values during the matchers stabilization, which is because the server Ds only transfers the lost clusters with the identifications of the lost subscriptions to the corresponding matchers, which does not bring much decreasement of the rate of dispatching events. In the second case, the event generators keep the constant event arrival rate for 15 s. After that, they decrease the rate to 35,000 events per second. As shown in Fig. 11(b), the number of matchers decreases from 16 to 8. Similar to the first case, compared with the scenario of adding new matchers in Fig. 10(b), the reconfiguring delay is also lower when the same number of matchers are removed. To evaluate the reliability of SEMAS, we test how the event loss rate changes when a number of matchers fail simultaneously. Sup1 pose that a dispatcher totally sends Nev events during the latest Tl 2 seconds. In these events, Nev events cannot connect to their corre2 1 sponding matchers. Then the event loss rate is Nev /Nev . In the experiment, we set Tl and Nm to 5 and 16, respectively. Suppose T represents the system time. As shown in Fig. 12, we test three cases: 4, 8, and 12 matchers fail at T = 0 s. Within ten seconds,

Fig. 12. Reliability. The changing of event loss rate with a number of matchers failure.

the event loss rates of these cases are stabilized around 25%, 50%, 75%, respectively. In this stage, all live matchers can still provide matching service. Meanwhile, the synchronizing dispatcher Ds assigns the lost clusters with the identifications of their subscriptions to the right matchers. Note that the event loss rate of each case is proportional to the percentage of the number of failed matchers. This is because all the clusters are uniformly mapped to the matchers through the hash operation. The events falling into the failed matchers are stored in the dispatchers temporarily according to Section 4.2.2. In addition, Fig. 12 shows that the event loss rate of each case becomes zero after T > 12 s. In this stage, each cluster has been migrated and reconstructed on its corresponding matcher. Therefore, SEMAS can match all incoming events

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

(a) Standard deviation of subscriptions.

(b) Skewness coefficient of subscriptions.

115

(c) Standard deviation of events.

Fig. 13. Workload characteristics. (a) Skewed subscriptions in normal distributions. (b) Skewed subscriptions in Zipf distributions. (c) Skewed events in normal distributions.

correctly. The events falling into the failed matchers will be assigned to their corresponding matchers again to ensure a reliable event matching service. In conclusion, SEMAS can reconfigure clusters among matchers in a few seconds when the scale of matchers changes, because the clusters are migrated through transferring the identifications of their subscriptions. The synchronizing operation of Ds ensures continuous matching service even when the scale of matchers is changing. Besides, SEMAS also ensures reliable event matching through buffering the events falling into the failed matchers to the dispatchers. 5.6. Influence of workload characteristics 5.6.1. Skewness of subscription distribution In this section, we evaluate how the skewness degree of subscriptions affects the performance. We adopt two different distributions to generate skewed subscriptions: normal distribution and Zipf distribution. In the first case, we set various standard deviations of the normal distribution of subscriptions to reflect different levels of skewness. Suppose that the standard deviation is σ , and the expected value of the attribute Ai is Ei . For each predicate Pi = (rl ≤ Ai ≤ ru ), i ∈ [1, k], rl and ru are generated by N (Ei , σ ), where N (Ei , σ ) is a randomizer of the normal distribution with the expected value Ei and the standard deviation σ . It indicates that each range of the predicate increases with the growth of σ . Fig. 13(a) shows the matching throughput with different σ from 50 to 200. With the growth of σ , each subscription falls into more clusters, which leads to larger average cluster size and lower matching throughput. As σ increases from 50 to 200, the matching throughput of SEMAS, SEMAS-B and BlueDove drops by 89.1%, 87.3% and 64.4%, respectively. At σ = 200, the distribution of subscriptions is quite ‘‘flat’’. For SEMAS, the highest number of subscriptions stored on matchers is only 1.11 times of the lowest, and the standard deviation of the average cluster size on matchers is 49. In this situation, the matching throughput of SEMAS is still much higher than BlueDove (1.4×). This shows that as long as the skewness of subscriptions exists, SEMAS can utilize it to improve the performance greatly. In the second case, we set different skewed Zipf distributions to evaluate the performance. For each predicate Pi = (rl ≤ Ai ≤ ru ), i ∈ [1, k], the values of rl and ru are chosen by the probability −β

x function p(x) = 500

−β i=1 i

, x ∈ [1, 500], where β is called the

skewness coefficient of the Zipf distribution. With the growth of β , the range of the predicates becomes smaller, which brings the lower average cluster size of SEMAS according to Theorem 1. Fig. 13(b) shows the matching throughput with different β from 1.0 to 1.3. When β = 1.3, the distribution of subscriptions is quite ‘‘skewed’’, where 93% values of the predicates fall into 20% ranges

of their corresponding attributes. In this situation, the matching throughput of SEMAS is 10.6 times than that of BlueDove. This shows that the skewness of subscriptions improves the matching throughput of SEMAS significantly. 5.6.2. Skewness of event distribution In all above experiments, the events are distributed uniformly in the content space. We evaluate how the skewness of event distribution impacts the performance in this section. Suppose that both events and subscriptions follow a normal distribution. Then the skewness of events brings two different situations. Firstly, both events and subscriptions follow the same normal distribution. It means the hot clusters match against most events, which decreases the performance significantly. Secondly, the hot clusters of subscriptions do not overlap with the ‘‘hot spots’’ of the event at all. It means the hot clusters match against a few events and the cold clusters match against most events, which increases the performance. In the experiment, the predicate Pi of the subscriptions follows a normal distribution N (Ei , 50). Each value ri of the events follows four different normal distributions N (Ei , 50), N (Ei , 100), N (Ei , 150), N (Ei , 200), respectively. Fig. 13(c) shows that the matching throughput of all approaches decreases as the σ of events reduces. As σ decreases from 200 to 50, the matching throughput of SEMAS, SEMAS-B and BlueDove drops by 87.8%, 91.2% and 60.9%, respectively. When σ of the events is 50, all approaches have the poorest performance. However, the throughput of SEMAS is still higher than that of BlueDove (8.6×). In conclusion, even with adversely skewed event distribution, SEMAS still outperforms BlueDove by eliminating hot clusters in HPartition. 5.7. Overhead 5.7.1. Memory overhead The average number of clusters that each subscription falls into is called partitioning overhead. In this section, we utilize the partitioning overhead to evaluate the memory overhead of each approach. Firstly, we test the partitioning overhead with different subscription distributions. Fig. 14(a) and (b) show that the partitioning overhead of each approach increases with the growth of σ in the normal distribution and the decreasing of β in the Zipf distribution, respectively. The partitioning overhead of SEMAS increases faster than that of SEMAS-B and BlueDove in various subscription distributions. This is because the fine-grained clustering of SEMAS enables each subscription to fall into more clusters. Recall that each cluster of SEMAS only stores the identifications of the subscriptions. Suppose that each identification uses 8 bytes, which ensures that the system can accept 264 subscriptions. Each value of the

116

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

(a) Standard deviation of subscriptions.

(b) Skewness coefficient of subscriptions.

(c) Number of segments.

(d) Minimum cluster size of hot clusters.

Fig. 14. Memory overhead. (a) Impact of skewed subscriptions in a normal distribution. (b) Impact of skewed subscriptions in a Zipf distribution. (c) Impact of the increasing segments. (d) Impact of minimum cluster size of hot spots.

predicate uses 4 bytes. Then the memory overhead of each subscription is 2 × 4 × 4 = 32 bytes according to the form of the subscription in Section 3.1. At σ = 200, the distribution is quite ‘‘flat’’. The total memory overhead of one subscription in SEMAS is 419 × 8 + 32 = 3384 bytes. Suppose that there are 1 million subscriptions in SEMAS, then 3.384 GB of memory on average are used in each matcher. At σ = 50, the average memory overhead of each matcher is reduced to 176 MB for 1 million subscriptions. In conclusion, SEMAS uses much less memory overhead with the growing skewness of subscriptions. Secondly, we test the partitioning overhead with different Nseg and α . Fig. 14(c) and (d) show that the partitioning overhead of SEMAS increases with the growth of Nseg and the decreasing of α . This is because the growing Nseg and reducing α both divide the entire content space into more clusters, which leads to more memory overhead. In conclusion, the memory overhead of SEMAS increases with the growing granularity of clustering in HPartition. 5.7.2. Traffic overhead In this section, we discuss the traffic overhead between the servers of SEMAS in different scenarios. Firstly, we evaluate the maintaining overhead of SEMAS. Each matcher periodically sends a ‘‘Exchange’’ message to a random matcher to collect the maximum waiting time. It also periodically sends a heartbeat message to the synchronizing dispatcher Ds . When a new subscription arrives at a dispatcher, it is stored to Nm matchers and Nd − 1 according to the description in Section 4.1.2. Therefore, the traffic overhead of a new subscription is Nm + Nd − 1 packages. When an event arrives at a dispatcher, it is only assigned to the cluster whose space contains its space. That is, the traffic overhead of an event is one package. Secondly, we evaluate the traffic overhead with dynamic Nm . As mentioned in Section 4.2.2, SEMAS adaptively migrates the clusters to ensure the elastic service capacity if Nm is changed. In

the experiments, Nm is 16 at first, and l matchers are added or removed every time to evaluate the total traffic overhead between servers, where l varies from 1 to 8. Event generators generate subscriptions using two types of distributions: norm distributions with different standard deviation σ and Zipf distributions with different skewness coefficient β . The skewness of subscriptions decreases with σ and increases with β . We repeat the experiments 10 times for each l and compute the average results. Fig. 15 shows that the total traffic overhead increases with the growth of l in various subscription distributions. This is because the increasing l leads to more clusters being transferred to other matchers. Note that the increment of the traffic overhead of removing l matchers is higher than that of adding l matchers as the skewness of subscriptions decreases. This is because skewed subscriptions result in more lost clusters when a number of matchers are removed. Specifically, adding matchers leads to two types of traffic overhead. One is the traffic overhead of transferring subscriptions from Ds to new matchers, denoted by TF 0 . It is obvious that TF 0 = Nsub ∗ Θ (S ) ∗ l, where Θ (S ) is the average memory overhead per subscription. The other is the traffic overhead of transferring a number of clusters from old matchers to new matchers, denoted by TF 1 . According to the consistent hashing, the expected value of TF 1 is Θ (M ) ∗ l/2, where Θ (M ) is the average memory overhead of clusters in each matcher. Removing matchers leads to the traffic overhead of transferring lost clusters from Ds to alive matchers, denoted by TF 2 . According to the consistent hashing, the expected value of TF 2 is Θ (M ) ∗ l. Since the average cluster size increases as the skewness of subscriptions decreases, TF 2 becomes the main traffic overhead as the number of removed matchers l increases. In contrast, as the skewness of subscription increases, the average cluster size is smaller and TF 0 becomes the main traffic overhead. From the Fig. 15(b), removing 8 matchers in the normal distribution with σ = 200 leads to most traffic overhead. This is because the ‘‘flat’’ subscriptions incurs larger average cluster size.

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

(a) The number of added matchers.

117

(b) The number of removed matchers.

Fig. 15. Traffic overhead. (a) The traffic overhead when adding different number of matchers with various subscription distributions. (b) The traffic overhead when removing different number of matchers with various subscription distributions.

Even in this case, the total traffic overhead is less than a hundred MB and the lost clusters can be migrated to their corresponding matchers in a few seconds for a gigabit Ethernet. In conclusion, the servers of SEMAS generate a small traffic overhead to ensure an available event matching service. When the scale of matchers changes, SEMAS generates a moderate traffic overhead, which brings low transmission latency to keep the elastic service capacity. 6. Conclusions and future work This paper introduces SEMAS, a novel scalable and elastic event matching approach for the attribute-based pub/sub systems. SEMAS utilizes an one-hop lookup overlay in the cloud computing environment to reduce the clustering latency. Through a hierarchical multi-attribute space partition technique, SEMAS reaches scalable clustering of subscriptions and matches each event on one cluster. The performance-aware detection technique enables the system to adaptively adjusts the scale of matchers according to the changing of workloads. Compared with the existing cloud based pub/sub systems, our analytical and experimental results demonstrate that SEMAS shows a much higher matching rate and better load balance with different workload characteristics. Moreover, SEMAS adapts to the sudden workload changes and server failures with low latency and small traffic overhead. After the encouraging results shown in this paper, we are planning to work on a number of aspects on the architecture of SEMAS to enhance its performance. First, as the number of attributes and segments increases, the HPartition incurs significant memory overhead. Since some attributes of real applications are rarely used in subscriptions, we want to investigate how to identify these attributes to reduce the overhead. Second, how to extend SEMAS to support more complicated content-based pub/sub systems? In some emerging sense-and-response applications, each subscription and event may only contain a part of attributes, and users are allowed to express their interests in the form of disjunctive normal form, such as S = (P1 ∧ P2 ) ∨ (P1 ∧ P3 ). We will investigate how to extend SEMAS to make the attribute-based and content-based pub/sub systems compatible. Acknowledgments This work was supported by the National Grand Fundamental Research 973 Program of China (Grant No. 2011CB302601), the National Natural Science Foundation of China (Grant No. 61379052),

the National High Technology Research and Development 863 Program of China (Grant No. 2013AA01A213), the Natural Science Foundation for Distinguished Young Scholars of Hunan Province (Grant No. S2010J5050), Specialized Research Fund for the Doctoral Program of Higher Education (Grant No. 20124307110015). References [1] Anss. URL: http://earthquake.usgs.gov/monitoring/anss/. [2] Readywarn. URL: http://www.readywarn.com/. [3] R.K. Ganti, N. Pham, H. Ahmadi, S. Nangia, T.F. Abdelzaher, Greengps: a participatory sensing fuel-efficient maps application, in: Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services, ACM, 2010, pp. 151–164. [4] M. Gjoka, M. Kurant, C.T. Butts, A. Markopoulou, Walking in Facebook: a case study of unbiased sampling of OSNs, in: International Conference on Computer Communications, INFOCOM, 2010, pp. 1–9. [5] P.T. Eugster, P. Felber, R. Guerraoui, A.-M. Kermarrec, The many faces of publish/subscribe, ACM Computing Surveys (CSUR) 35 (2) (2003) 114–131. [6] Datacreatedperminite. URL: http://www.domo.com/blog/2012/06/ how-much-data-is-created-every-minute/?dkw=socf3/. [7] P. Pietzuch, J. Bacon, Hermes: a distributed event-based middleware architecture, in: 22nd International Conference on Distributed Computing Systems Workshops, 2002. [8] F. Cao, J.P. Singh, Efficient event routing in content-based publish/subscribe service network, in: International Conference on Computer Communications, INFOCOM, 2004. [9] G. Banavar, T. Chandra, B. Mukherjee, J. Nagarajarao, R.E. Strom, D.C. Sturman, An efficient multicast protocol for content-based publish–subscribe systems, in: IEEE International Conference on Distributed Computing Systems, ICDCS, 1999, pp. 262–272. [10] F. Cao, J.P. Singh, Medym: match-early with dynamic multicast for contentbased publish–subscribe networks, 2005, pp. 292–313. [11] L. Opyrchal, M. Astley, J. Auerbach, G. Banavar, R. Strom, D. Sturman, Exploiting IP multicast in content-based publish–subscribe systems, in: IFIP/ACM International Conference on Distributed Systems Platforms, 2000, pp. 185–207. [12] A. Riabov, Z. Liu, J.L. Wolf, P.S. Yu, L. Zhang, Clustering algorithms for content-based publication–subscription systems, in: IEEE 22nd International Conference on Distributed Computing Systems, ICDCS, 2002, pp. 133–142. [13] A. Carzaniga, Architectures for an event notification service scalable to widearea networks, Ph.D. Thesis, POLITECNICO DI MILANO, 1998. [14] Y.-M. Wang, L. Qiu, C. Verbowski, D. Achlioptas, G. Das, P.-Å Larson, Summarybased routing for content-based event distribution networks, Computer Communication Review 34 (5) (2004) 59–74. [15] A. Carzaniga, M.J. Rutherford, A.L. Wolf, A routing scheme for content-based networking, in: IEEE International Conference on Computer Communications, INFOCOM, 2004. [16] W.W. Terpstra, S. Behnel, L. Fiege, A. Zeidler, A.P. Buchmann, A peer-topeer approach to content-based publish/subscribe, in: Proceedings of the 2nd International Workshop on Distributed Event-Based Systems, 2003, pp. 1–8. [17] I. Aekaterinidis, P. Triantafillou, Pastrystrings: a comprehensive content-based publish/subscribe DHT network, in: IEEE 26nd International Conference on Distributed Computing Systems, ICDCS, 2006, p. 23. [18] E. Anceaume, M. Gradinariu, A.K. Datta, G. Simon, A. Virgillito, A semantic overlay for self-* peer-to-peer publish/subscribe, in: IEEE 26nd International Conference on Distributed Computing Systems, ICDCS, 2006, pp. 22–30.

118

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119

[19] S. Voulgaris, E. Riviere, A. Kermarrec, M. Van Steen, et al., Sub-2-sub: selforganizing content-based publish and subscribe for dynamic and large scale collaborative networks, Research Report RR5772, INRIA, Rennes, France, 2005. [20] C. Zhang, A. Krishnamurthy, R.Y. Wang, J.P. Singh, Combining flexibility and scalability in a peer-to-peer publish/subscribe system, in: Proceedings of the ACM/IFIP/USENIX International Conference on Middleware, Middleware, 2005, pp. 102–123. [21] D.K. Tam, R. Azimi, H.-A. Jacobsen, Building content-based publish/subscribe systems with distributed hash tables, in: Databases, Information Systems, and Peer-to-Peer Computing, 2003, pp. 138–152. [22] A. Gupta, O.D. Sahin, D. Agrawal, A. El Abbadi, Meghdoot: contentbased publish/subscribe over P2P networks, in: Proceedings of the 5th ACM/IFIP/USENIX International Conference on Middleware, Middleware, 2004, pp. 254–273. [23] W. Li, S. Vuong, Towards a scalable content-based publish/subscribe service over DHT, in: IEEE Global Telecommunications Conference, GLOBECOM, 2010, pp. 1–6. [24] S. Voulgaris, M. van Steen, Epidemic-style management of semantic overlays for content-based searching, in: International European Conference on Parallel and Distributed Computing, Euro-Par, 2005, pp. 1143–1152. [25] A.I.T. Rowstron, P. Druschel, Pastry: scalable, decentralized object location, and routing for large-scale peer-to-peer systems, in: Proceedings of the ACM/IFIP/USENIX International Conference on Middleware, Middleware, 2001, pp. 329–350. [26] S. Ratnasamy, M. Handley, R.M. Karp, S. Shenker, Application-level multicast using content-addressable networks, 2001, pp. 14–29. [27] B. Calder, J. Wang, A. Ogus, N. Nilakantan, A. Skjolsvold, S. McKelvie, Y. Xu, S. Srivastav, J. Wu, H. Simitci, J. Haridas, C. Uddaraju, H. Khatri, A. Edwards, V. Bedekar, S. Mainali, R. Abbasi, A. Agarwal, M.F. ul Haq, M.I. ul Haq, D. Bhardwaj, S. Dayanand, A. Adusumilli, M. McNett, S. Sankaran, K. Manivannan, L. Rigas, Windows azure storage: a highly available cloud storage service with strong consistency, in: Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP, 2011, pp. 143–157. [28] X. Lu, H. Wang, J. Wang, J. Xu, D. Li, Internet-based virtual computing environment: beyond the data center as a computer, Future Generation Computer Systems 29 (2011) 309–322. [29] Y. Wang, X. Li, X. Li, Y. Wang, A survey of queries over uncertain data, Knowledge and Information Systems (2013) http://dx.doi.org/10.1007/s10115-0130638-6. [30] A. Lakshman, P. Malik, Cassandra: a decentralized structured storage system, Operating Systems Review 44 (2) (2010) 35–40. [31] D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, D. Lewin, Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web, in: Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, ACM, 1997, pp. 654–663. [32] W. Rao, L. Chen, P. Hui, S. Tarkoma, Move: a large scale keyword-based content filtering and dissemination system, in: IEEE 32nd International Conference on Distributed Computing Systems, ICDCS, 2012, pp. 445–454. [33] M. Li, F. Ye, M. Kim, H. Chen, H. Lei, A scalable and elastic publish/subscribe service, in: IEEE International Parallel & Distributed Processing Symposium, IPDPS, 2011, pp. 1254–1265. [34] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, W. Vogels, Dynamo: Amazon’s highly available key-value store, in: Proceedings of the ACM Symposium on Operating Systems Principles, SOSP, 2007, pp. 205–220. [35] M. Jelasity, A. Montresor, O. Babaoglu, Gossip-based aggregation in large dynamic networks, ACM Transactions on Computer Systems (TOCS) 23 (3) (2005) 219–252. [36] Y. Wang, S. Li, Research and performance evaluation of data replication technology in distributed storage systems, Computers and Mathematics with Applications 51 (11) (2006) 1625–1632. [37] M. Aguilera, R. Strom, D. Sturman, M. Astley, T. Chandra, Matching events in a content-based subscription system, in: Proceedings of the Eighteenth Annual ACM Symposium on Principles of Distributed Computing, 1999. [38] A. Carzaniga, A.L. Wolf, Forwarding in a content-based network, in: Proceedings of the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, 2003, pp. 163–174. [39] G. Mühl, L. Fiege, P.R. Pietzuch, Distributed Event-Based Systems, Springer, 2006. [40] I. Stoica, R. Morris, D. Karger, M.F. Kaashoek, H. Balakrishnan, Chord: a scalable peer-to-peer lookup service for internet applications, in: ACM SIGCOMM Computer Communication Review, Vol. 31, ACM, 2001, pp. 149–160. [41] J.L. Martins, S. Duarte, Routing algorithms for content-based publish/subscribe systems, IEEE Communications Surveys and Tutorials 12 (1) (2010) 39–58. [42] R. Baldoni, C. Marchetti, A. Virgillito, R. Vitenberg, Content-based publish–subscribe over structured overlay networks, in: IEEE 25nd International Conference on Distributed Computing Systems, ICDCS, 2005, pp. 437–446. [43] Amazon simple notification service. URL: http://aws.amazon.com/sns/. [44] J. Ekanayake, J. Jackson, W. Lu, R. Barga, A.S. Balkir, A scalable communication runtime for clouds, in: IEEE International Conference on Cloud Computing, CLOUD, 2011, pp. 211–218. [45] G. Chockler, R. Melamed, Y. Tock, R. Vitenberg, Spidercast: a scalable interestaware overlay for topic-based pub/sub communication, in: Proceedings of the ACM International Conference on Distributed Event-Based Systems, 2007, pp. 14–25.

[46] R. Baldoni, R. Beraldi, V. Quéma, L. Querzoni, S.T. Piergiovanni, Tera: topicbased event routing for peer-to-peer architectures, in: Proceedings of the ACM International Conference on Distributed Event-Based Systems, 2007. [47] F. Rahimian, S. Girdzijauskas, A.H. Payberah, S. Haridi, Vitis: a gossipbased hybrid overlay for internet-scale publish/subscribe enabling rendezvous routing in unstructured overlay networks, in: IEEE International Parallel & Distributed Processing Symposium, IPDPS, 2011, pp. 746–757. [48] Z. Zheng, Y. Wang, X. Ma, Peerchatter: a peer-to-peer architecture for data distribution over social networks, Information. An International Interdisciplinary Journal 15 (1) (2011) 259–266. [49] JBoss content based routing. URL: http://docs.jboss.org/jbossesb/docs/4.3.GA/ manuals/html/services/ContentBasedRouting.html. [50] B. Segall, D. Arnold, Elvin has left the building: a publish/subscribe notification service with quenching, in: AUUG, 1997. [51] E. Grummt, Fine-grained parallel XML filtering for content-based publish/subscribe systems, in: Proceedings of the ACM International Conference on Distributed Event-Based Systems, 2011, pp. 219–228. [52] Shakecast. URL: http://earthquake.usgs.gov/research/software/shakecast/. [53] P. Mockapetris, K.J. Dunlap, Development of the Domain Name System, Vol. 18, ACM, 1988. [54] A. Malekpour, A. Carzaniga, F. Pedone, G. Toffetti Carughi, End-to-end reliability for best-effort content-based publish/subscribe networks, in: Proceedings of the 5th ACM International Conference on Distributed Event-Based System, ACM, 2011, pp. 207–218. [55] S.Q. Zhuang, B.Y. Zhao, A.D. Joseph, R.H. Katz, J.D. Kubiatowicz, Bayeux: an architecture for scalable and fault-tolerant wide-area data dissemination, in: Proceedings of the 11th International Workshop on Network and Operating Systems Support for Digital Audio and Video, ACM, 2001, pp. 11–20. [56] R.S. Kazemzadeh, H.-A. Jacobsen, Publiy+: a peer-assisted publish/subscribe service for timely dissemination of bulk content, in: IEEE International Conference on Distributed Computing Systems, ICDCS, 2012, pp. 345–354. [57] P. Costa, M. Migliavacca, G.P. Picco, G. Cugola, Introducing reliability in content-based publish–subscribe through epidemic algorithms, in: Proceedings of the 2nd International Workshop on Distributed Event-Based Systems, ACM, 2003, pp. 1–8. [58] Murmurhash. URL: http://burtleburtle.net/bob/hash/doobs.html. [59] S. Voulgaris, D. Gavidia, M. van Steen, Cyclon: inexpensive membership management for unstructured P2P overlays, Journal of Network and Systems Management 13 (2) (2005) 197–217. [60] Openstack. URL: http://openstack.org/. [61] Zeroc. URL: http://www.zeroc.com/. [62] B. He, D. Sun, D.P. Agrawal, Diffusion based distributed Internet gateway load balancing in a wireless mesh network, in: Global Telecommunications Conference, GLOBECOM, IEEE, 2009, pp. 1–6.

Xingkong Ma received the B.S. degree in Computer Science and Technology from the School of Computer of Shandong University, China, in 2007, and received the M.S. in Computer Science and Technology in the School of Computer of National University of Defense Technology, China, in 2009. He is currently a Ph.D. candidate in the School of Computer of National University of Defense Technology. He is a student member of CCF and ACM. His current research interests lie in the areas of data dissemination, publish/subscribe systems, and network computing.

Yijie Wang received the Ph.D. degree from the National University of Defense Technology, China in 1998. She was a recipient of the National Excellent Doctoral Dissertation (2001), a recipient of Fok Ying Tong Education Foundation Award for Young Teachers (2006) and a recipient of the Natural Science Foundation for Distinguished Young Scholars of Hunan Province (2010). Now she is a Professor in the National Key Laboratory for Parallel and Distributed Processing, National University of Defense Technology. Her research interests include network computing, massive data processing, parallel and distributed processing.

Qing Qiu received the B.S. degree in Computer Science and Technology from the School of Computer of Shandong University, China, in 2010, and received the M.S. in Computer Science and Technology from the School of Computer of the National University of Defense Technology, China, in 2012. He is a student member of CCF and ACM. His current research interests lie in the areas of data dissemination, publish/subscribe systems, and network computing.

X. Ma et al. / Future Generation Computer Systems 36 (2014) 102–119 Weidong Sun received the B.S. degree in computer science and technology from the School of Computer of National University of Defense Technology, China, in 2005, and received the M.S. in computer science and technology in the School of Computer of National University of Defense Technology, China, in 2009. He is currently a Ph.D. candidate in the School of Computer of National University of Defense Technology. He is a student member of CCF and ACM. His current research interests lie in the areas of network computing, massive data processing, cloud computing.

119

Xiaoqiang Pei received the B.S. degree in computer science and technology from the School of Computer of National University of Defense Technology, China, in 2009, and received the M.S. in computer science and technology in the School of Computer of National University of Defense Technology, China, in 2011. He is currently a Ph.D. candidate in the School of Computer of National University of Defense Technology. He is a student member of CCF and ACM. His current research interests lie in the areas of network computing, massive data processing, cloud computing.