Mesh-Mon: A multi-radio mesh monitoring and management system

Mesh-Mon: A multi-radio mesh monitoring and management system

Available online at www.sciencedirect.com Computer Communications 31 (2008) 1588–1601 www.elsevier.com/locate/comcom Mesh-Mon: A multi-radio mesh mo...

633KB Sizes 9 Downloads 62 Views

Available online at www.sciencedirect.com

Computer Communications 31 (2008) 1588–1601 www.elsevier.com/locate/comcom

Mesh-Mon: A multi-radio mesh monitoring and management system Soumendra Nanda *, David Kotz Department of Computer Science, and Institute for Security Technology Studies, Hinman 6211, Dartmouth College, Hanover, NH 03755, USA Available online 2 February 2008

Abstract Mesh networks are a potential solution for providing communication infrastructure in an emergency. They can be rapidly deployed by first responders in the wake of a major disaster to augment an existing wireless or wired network. We imagine a mesh node with multiple radios embedded in each emergency vehicle arriving at the site to form the backbone of a mobile wireless mesh. The ability of such a mesh network to monitor itself, diagnose faults and anticipate problems are essential features for its sustainable operation. Typical SNMP-based centralized solutions introduce a single point of failure and are unsuitable for managing such a network. Mesh-Mon is a decentralized monitoring and management system designed for such a mobile, rapidly deployed, unplanned mesh network and works independently of the underlying mesh routing protocol. Mesh-Mon nodes are designed to actively cooperate and use localized algorithms to predict, detect, diagnose and resolve network problems in a scalable manner. Mesh-Mon is independent of the underlying routing protocol and can operate even if the mesh routing protocol completely fails. One novel aspect of our approach is that we employ mobile users of the mesh, running software called Mesh-Mon-Ami, to ferry management packets between physically-disconnected partitions in a delay-tolerant-network manner. The main contributions of this paper are the design, implementation and evaluation of a comprehensive monitoring and management architecture that helps a network administrator proactively identify, diagnose and resolve a range of issues that can occur in a dynamic mesh network. In experiments on Dart-Mesh, our 16-node indoor mesh testbed, we found Mesh-Mon to be effective in quickly diagnosing and resolving a variety of problems with high accuracy, without adding significant management overhead. Ó 2008 Elsevier B.V. All rights reserved. Keywords: Wireless networks; Wireless mesh networks; Network management; Network monitoring; Network diagnosis; Network analysis

1. Introduction Public safety agencies often need to rapidly deploy special communications capabilities to support large-scale events, such as annual parades, festivals, rallies and sporting events. Ad hoc and mesh networks are wireless networks that can be deployed quickly. They present a potential solution for deploying a data and voice network for ‘‘first responders” (FRs) in scenarios where there is limited or no infrastructure available, especially for naturaldisaster and emergency-response situations. There are several challenges to managing such a mesh infrastructure that have not been adequately addressed. Several commercial mesh vendors offer proprietary management solutions [1–4]. These solutions are vendor-specific and thus little is *

Corresponding author. Tel.: +1 603 646 0304. E-mail address: [email protected] (S. Nanda).

0140-3664/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.comcom.2008.01.046

known about the solutions they employ. Furthermore, they are designed for a well-planned mesh network with stationary nodes. We assume that our FR mesh is formed by rapid deployment of portable multi-radio Mesh Nodes (MNs). Several MNs form the mesh backbone of the communication infrastructure in an ad hoc manner. Mobile Clients (MCs), such as PDAs, laptops or sensors can associate with the nearest deployed MN to send information in a multihop manner. We also assume that there are one or more mobile human system administrators (sysadmins) present at the deployment site, who are responsible for managing the network. There are many differences between monitoring a wired network or even an infrastructure-based Wireless Local Area Network (WLAN) versus an emergency-response mesh network. In a wired network or a WLAN, wired backbone links are more reliable and have an order of magnitude higher capacity than the links in a wireless mesh

S. Nanda, D. Kotz / Computer Communications 31 (2008) 1588–1601

1589

backbone. In our FR mesh scenario, we assume that links may be unstable and have limited bandwidth. Node mobility, node failures or link failures can create partitions. Some regions may have ‘‘holes” where no wireless coverage is available for clients. Nodes may malfunction or be incorrectly configured. The MCs may have difficulty connecting to MNs for several reasons. In a mobile mesh network a single fault may lead to a disconnected network, which would make gathering accurate network information challenging or even impossible after the occurrence of certain faults. We claim that managing such a network requires a management system that is capable of adapting to these dynamic conditions and diverse failures. We propose ‘‘Mesh-Mon,” a proactive mesh management system designed for a dynamic environment. MeshMon is designed to work in an unplanned mesh where we do not have complete information about how or where the network is deployed, or how it may be modified during its operation. Mesh-Mon is a system with many interrelated components for local information collection, storage of local and neighborhood information, partition-aware communication of network information, and distributed network analysis. Since we have observed instances where the implementation of a mesh routing protocol did not perform correctly, we know that any mesh monitoring system that relies on a specific underlying routing protocol may fail to operate if the routing protocol fails. We designed Mesh-Mon to work independently of the routing protocol, and consider this feature to be one if its strengths. We are not aware of any other network management system designed for mobile-multi-radio mesh networks that can handle disconnections, partitions, multiple routing protocols and function in the face of routing failures. In Section 2 we introduce our design principles and objectives. Section 3 explains the components of MeshMon in depth. Sections 4 and 5 illustrate our analysis engine design and fault detection engine, respectively. Section 6 describes our deployed mesh testbed and summarizes our experimental results. Section 7 reviews related work and highlights our contributions, followed by our concluding remarks.

2.1. Design principles

2. Design goals and principles

‘‘Mesh-Mon” is the software that runs on all mesh nodes and manages the collection, communication and analysis of information gathered in the network. ‘‘MeshMon-Ami” (MMA) is a software component that runs on the client nodes and assists in communication of management information in disconnected areas of the network on behalf of Mesh-Mon nodes. Both Mesh-Mon and the MMA depend on the operating system (OS); thus, if the OS suffers a fatal crash then Mesh-Mon or MMA cannot function. As shown in Fig. 1, Mesh-Mon (and Mesh-Mon-Ami) have three basic components: (1) a local information collection system, (2) an information communication and storage engine, and (3) a fault detection and analysis engine.

The primary aim of Mesh-Mon is to assist one or more mobile human sysadmins in fault detection and performance management. The sysadmin connects his laptop to any node in a connected partition. A software process ‘‘Mesh-Admin” running on the sysadmin’s laptop then receives reports from individual mesh nodes and clients. These reports may contain alerts describing a detected fault or suspected anomaly. The human sysadmin’s job is to study the reports, judge the severity of a problem and take corrective actions at his or her own discretion. We assume that a team of sysadmins can be addressed through an IP multicast address, so that any number of sysadmins may be present and may connect to any mesh node.

Mesh-Mon has three main principles of operation: (1) each mesh node and mesh client must monitor itself, (2) each mesh node must monitor its k-hop neighbors and maintain a detailed representation of the local network and a sparse representation of the global network, and (3) each node must help in forming a hierarchical overlay network for propagation of monitoring information. The first principle is critical as each mesh node and mesh client must be healthy for the network as a whole to be healthy. The second principle aims for distributed analysis by allowing local nodes to cooperate to detect and analyze local problems. Mesh-Mon nodes are designed to communicate in a peer-to-peer manner to monitor each other. The third principle is a common aggregation-based approach to deal with scalability in large distributed systems. Mesh-Mon uses a combination of active and passive monitoring techniques in combination with a rule-based diagnostics engine. Mesh-Mon code runs on each mesh node in a distributed manner, with each node capable of analyzing local and collected information. Mesh-Mon performs local repairs and generates alerts for the sysadmin as needed. Our overall system is designed to facilitate selfmanagement when a node or partition is disconnected from the rest of the network. The performance of any mesh depends heavily on the nature of the mesh routing protocol. Our current mesh testbed uses Optimized Link State Routing (OLSR) [5], a proactive mesh routing protocol or Ad hoc On-demand Distance Vector (AODV) [20], a reactive protocol. We designed Mesh-Mon to function effectively without requiring any significant modifications, even if we replace our mesh routing protocol with any other layer-three network routing protocol (proactive, reactive or hybrid). 3. Mesh-Mon design details

1590

S. Nanda, D. Kotz / Computer Communications 31 (2008) 1588–1601

Analysis Engine

Information to and from other Mesh Nodes and Clients

Communication Engine

Local Information Storage

Local Information Collection

Fig. 1. Mesh-Mon and Mesh-Mon-Ami common design components.

3.1. Local information collection system The goal of this component, which runs on every MN and MMA, is to periodically sample local state and configuration information. We need to measure, collect and analyze information about the state of the network at all three lower layers (physical, link and network) on both the clients and mesh nodes. We categorize the information collected as configuration related, measurement related or as a statistic. We record the strength of received signals, the Signal-to-Noise Ratio (SNR), battery levels of each node, number of MCs at each MN, and network statistics of each interface (such as throughput, control overhead, total data traffic, errors, and current outgoing packet queue length). Periodically, nodes actively probe each other to measure bandwidth and latency among clients, mesh nodes and external hosts. The summary of the collected information, analysis results and alerts are stored and distributed to other nodes by the storage and communication engine. We chose to save information from other nodes, since it allows multiple sysadmins to quickly view information about the entire network from the node nearest to them. 3.2. Information storage and communication engine While information storage and information communication are distinct tasks, we present them as one functional component since the two tasks are tightly coupled in our design and implementation. Each node stores detailed information about itself and its neighbors that it collects locally, information about its k-hop neighbors, and some information about the structure of the entire mesh network. For scalability, we designed Mesh-Mon to store more information about the local neighborhood and a sparse representation of the global network and distant nodes. Each Mesh-Mon node uses a flooding broadcast to periodically spread its local information and locally generated alerts in UDP packets. Thus alerts are cached by nearby nodes in case any of the alerts fail to reach a sysadmin. Flooding exploits broadcast communication between a node and all its one-hop neighbors, repeated by them to their neighbors. This simple flooding protocol has the effect of building a ‘‘holographic database” in which any single

mesh node has a complete view of the whole network. In the event of a network partition, every node knows the full topology of the network and the status of nodes prior to the partitioning. Each node periodically records a distributed snapshot of the network locally using the most recently received information about the network. By design, Mesh-Mon nodes can communicate through flooding even if the routing protocol fails or is disabled. This feature is important, since in the event that the routing daemon crashes or misbehaves, Mesh-Mon is still able to inform its neighbors and the sysadmin of this fact and other monitored information. Our broadcast-based flooding approach to disseminate management information is suitable for small mesh networks, such as our 16-node testbed. This approach cannot scale well to large deployments, since there are many redundant rebroadcasts and a risk of packet storms [6]. For large networks, our proposed solution is to limit the flood of information to k-hops and to build a hierarchical overlay network of MeshLeaders, who communicate information between k-hop neighborhoods efficiently. A MeshLeader is equivalent to an elected cluster-head [7] or a Multi-Point Relay (MPR) in OLSR [5]. As shown in Fig. 2, a MeshLeader is selected by nodes in its k-hop neighborhood using a suitable leader-election protocol. We let nodes appoint themselves as the MeshLeader, if none exists, and use the lowest numeric identifier as a tiebreaker if there are too many MeshLeaders in a single neighborhood. If a node is a MeshLeader, it discovers other MeshLeaders through a beaconing process. MeshLeaders then exchange aggregated management information related to their own k-hop neighborhood (collected through flooding) with each other. The aggregated information consists of topology, statistics and the latest alerts generated by all covered nodes. The MeshLeader in turn shares topology information received from other MeshLeaders with its k-hop neighborhood constituents. The same hierarchy and structure is used during analysis of problems that are not within the k-hop neighborhood of a node (such as global partition detection). 3.3. Fault detection and analysis engine We detect several kinds of faults through a deterministic rule-based system. In addition, we run anomaly detection algorithms on measured and received information in an attempt to detect unusual network behavior. Our analysis engine runs on all nodes and is detailed in Sections 4 and 5. 3.4. Mesh-Mon-Ami A mesh network is designed to provide network access to several clients, many of which we assume are mobile. Mesh-Mon-Ami (MMA) is the software monitoring component running on one or more mobile clients. In addition to checking for MC configuration errors and monitoring performance of the MC itself, the MMA’s primary task

S. Nanda, D. Kotz / Computer Communications 31 (2008) 1588–1601

1591

Phase 2: MeshLeader to MeshLeader Information Exhchange

Phase 1: Local k-hop elections to create MeshLeaders

Fig. 2. MeshLeader election and discovery.

is to ferry information between mesh nodes in disconnected partitions of the network. We use a store and forward technique similar to epidemic routing [8] for delay-tolerant-networks (DTNs). The MMA client associates with the nearest MN, receives management information from it. The MMA then buffers the management packet and shares it with other MNs that the MMA associates with while moving. This MMA relay may be the only means of communication between two disconnected partitions of the physical network without investment in an additional long-range network link. In effect, the MMA client acts as a DTN ‘‘mule” for management packets. The effectiveness of using MMAs depends entirely on the mobility pattern of individual MMAs. When a MMA receives a management packet from its associated MN, it compares the encoded topology with recent topology information it has gathered from other MNs previously visited. Based on its analysis and results from ping tests, if the MMA suspects that MNs it has previously visited are in different partitions, it sends an alert to the sysadmin and forwards the last message it received from previous MNs from each disjoint partition to the current MN (and eventually its k-hop neighbors and MeshLeader). Thus, unlike epidemic routing (where every message is replicated at every encountered node), an MMA selectively sends a small number of messages to MNs it encounters. Periodically, the MMA expires older messages from its cache, since older information is more likely to be inaccurate in a dynamic network. 4. Network analysis Our network analysis unit is responsible for network health monitoring, local configuration management, and topological analysis. In our topology analysis, we attempt to predict partitions before they occur.

4.1. Network health monitoring Since each node in Mesh-Mon has an approximate view of the global network (through MeshLeaders), each individual node selects target nodes and clients, probes them by sending pings, measures the round-trip time, calculates packet-loss statistics, and logs the traversed path to each destination. Depending on the routing algorithm and the stability of links, the route traversed can change several times even for consecutive ping packets. We calculate network-level resource utilization locally on nodes and clients for each network interface. During periods of low utilization, nodes run pairwise throughput measurement tests. In our prototype we use an automated version of Iperf [9] to run UDP and TCP throughput tests between nodes, clients and external hosts. Nodes that are heavily loaded may disable other nodes from probing them temporarily since these tests involve transferring 600 KB of data and may quickly saturate the wireless medium. We calculate local network health indicators [10] from sampled data such as the network utilization, receive discard rate, transmit discard rate, receive error rate, transmit error rate, transmit broadcast rate and the receive broadcast rate. The discard rate is an indication of the number of packets that were discarded by an interface due to resource limitations. A high discard rate is usually an indicator that more buffer space is needed for an interface. A high error rate may be an indicator of a hardware problem. A sudden and sustained jump in the broadcast rate (packets broadcasted per second) can be indicative of a packet storm. A high percentage of failed reassemblies at the IP layer can be indicative that fragments are being corrupted or discarded at intermediate nodes. A high percentage of fragmentation could be caused by MTU mismatches. We record statistics for UDP, TCP and ICMP traffic. Sudden changes in the statistical values of our network health indi-

1592

S. Nanda, D. Kotz / Computer Communications 31 (2008) 1588–1601

cators are counted as anomalies triggering alerts to neighbors and the sysadmins. 4.2. Local configuration management Mesh-Mon aids in local configuration management of each node. Each node has a local set of reference values and acceptable ranges for each configurable component. Mesh-Mon periodically checks the local configuration and compares its configuration settings with those of its neighbors. If any of the locally configured parameters are outside acceptable ranges, Mesh-Mon generates an alert for the sysadmin, and locally attempts to restore the configuration to a default state. We periodically check the status of essential services running on each mesh node (such as olsrd, iptables and dhcpd). If any of these services have crashed, then Mesh-Mon will attempt to restart them with default settings and alert the sysadmins. At present, our configuration tests utilize only static local information. In future work, we would like the analysis engine to be capable of autonomously determining the correct configuration parameters (possibly through neighborhood consensus). 4.3. Partition prediction and partition detection Assuming that the mesh is one large connected component at initialization, partitions occur when specific nodes fail or nodes move away. Nodes in different partitions cannot communicate with each other. Critical nodes and critical links (equivalent to articulation points and bridges in undirected graphs) are defined as those nodes or links whose removal will partition the induced network subgraph into disconnected components or partitions. By identifying critical nodes and links, Mesh-Mon can anticipate where partitions are likely to occur. Tarjan [11] presents a centralized algorithm based on Depth First Search (DFS) to detect critical nodes and links. If the root of the DFS search tree has two or more children, then it must be a critical node, since the nodes in the two subtrees must share only the root in common. Removing the root node would thus lead to two isolated partitions. Jorgic et al. [12] propose a localized approach to predict critical nodes by running DFS on just the k-hop neighborhood graph from each node. Unfortunately, graph connectivity is a global property; thus there is always a possibility any localized algorithm will lead to false positives. False positives are nodes that are marked globally critical but in reality are only critical to the local neighborhood. In their simulations [12], the authors observed that about 80% of locally predicted nodes were globally critical in random connected graphs for k = 3. In our approach, each node first runs the localized detection algorithm on its k-hop neighborhood. If a node suspects another node as being critical, its MeshLeader is requested to verify it by running Tarjan’s DFS-based approach on the global topology. If Mesh-Mon confirms

the presence of a critical node and issues an alert, the sysadmin may choose to adjust the topology to reduce the risk of partitioning by moving some nodes or by adding an additional node in a strategic location. While critical nodes help the sysadmin predict where the next partition is likely to occur, Mesh-Mon nodes detect an actual partition by noting changes in the set of mutually reachable nodes in the reported topology over time. Each node in Mesh-Mon assumes unreachable nodes are in their own partitions, unless a sysadmin confirms that the node is dead, or additional information arrives through an MMA. In a highly mobile environment partitions can grow and shrink frequently. 4.3.1. An optimization for OLSR While we strive to keep Mesh-Mon independent of the underlying routing protocol, we can benefit from information already available without modifications to our software or hardware architecture. Since OLSR is a proactive routing protocol, every node in the network already has some information about the current global topology of the entire network. We do not assume that all routing protocol implementations expose internal data structures, so the optimizations we describe here are specific to our OLSR implementation. In OLSR, the ‘‘Topology Change Redundancy” (TCR) setting determines the volume of redundant topology information propagated in the network. When we set TCR to its maximum value, each node receives the global topology periodically. The number of control packets sent does not change, but the size of each packet increases. Thus, in our implementation each connected node periodically acquires a view of the global topology (from OLSR) and a strong view of its k-hop neighborhood (from Mesh-Mon). The node can then locally determine all the critical links and nodes on this graph in H(V + E) time for a graph with V nodes and E links using DFS. We also output all nodes in each bi-connected component in the same pass. 4.4. Eigenvector centrality The degree of a node in a mesh network is the number of links the node shares with its neighbors that are available for routing purposes. The current degree of an individual node and the minimum, maximum and average degree of all nodes in the entire network are useful characterization metrics. From the global topology, all the nodes can be quickly ranked according to their degree to identify the ‘‘most connected” and ‘‘least connected” nodes. However, this degree-based ranking does not convey the true nature of connectivity in the network, since all links are not identical (link characteristics fluctuate over time) and all nodes are not equally important. Two nodes with the same degree may not have similar characteristics (one node may be in the core and another on the periphery of the mesh). In any ad hoc or mesh network, where nodes must cooperate

S. Nanda, D. Kotz / Computer Communications 31 (2008) 1588–1601

with each other to route packets, the connectivity of a node depends on the connectivity of its neighbors. Bonachich suggested that a better ‘‘connectivity” ranking can be developed using eigenvector centrality (EVC) [13]. EVC is a metric often used in social network analysis, in which an important node (or person) is characterized by its connectivity to other important nodes (or people). Our approach is inspired by the work of Bytyci [14] who studied the stability of wired networks using EVC calculated on offline traces to rank nodes in the network. We use similar techniques as Bytyci (detailed below), but in a fully online manner. In particular, we focus on changes in the EVC over time, as an indicator of changes in connectivity and potential anomalies. Begnum and Burgess [15] analytically studied centrality-based ranking with offline traffic traces on a static wired network, but reported limited success in detecting network anomalies from NFS and email traffic patterns. Eigenvector centrality is calculated using the network’s topology (represented as an adjacency matrix). Let vi be the ith element of the vector ~ v representing the centrality measures for n nodes and let A be the n  n binary adjacency matrix of the network with a maximum of n nodes. The diagonal entries of matrix A are zero. The centrality of a node is proportional to the sum of the centrality values of all its neighboring nodes. Centrality is thus defined using the following formulas: X vi / vj ð1Þ j¼neighbors ofi

) vi /

n X

Aij vj

ð2Þ

j¼1

) vi ¼

n 1X Aij vj k j¼1

ð3Þ

Eq. (3) can be rewritten in the vector form as an eigenvalue equation: k~ v ¼ A~ v Since A is an n  n matrix, it has n eigenvectors ð~ vÞ and n corresponding eigenvalues (k). The principle eigenvector is the eigenvector with the highest eigenvalue. After the principle eigenvector is found, its elements are sorted from highest to lowest values to determine the EVC ranking of nodes that we seek. In the mesh context, a node with a high EVC represents a strongly connected node and would be a good candidate for a MeshLeader. On the other hand, a low-ranked node could warrant further investigation and scrutiny since it may have poor connectivity (have few neighbors or lie at the edge of a network). We calculate three variants of EVC for mesh networks. In the first variant we use the binary adjacency matrix representing the global topology. In the second variant, we use Expected Transmission Count (ETX) values to weight the adjacency matrix entries (values that lie between 0 and 1) to obtain the ‘‘effective expected adjacency matrix” [14] and calculate the link quality (LQ) EVC on the new matrix. Other metrics worth considering as link weights are link

1593

capacity and latency, but we do not have them accurately available at all times in our online monitoring system. In the third variant, the gateway (GW) EVC, we consider the Internet as a virtual node in the ETX adjacency matrix. Because we want to emphasize the importance of gateways over other MNs, we give a high value of 10 to links between any gateway node and the Internet. In this manner gateway nodes receive the highest centrality and we get a numerical ranking for each MN that reflects its connectivity to the Internet. Our hypothesis (validated by our results in Section 6.4) is that this metric is a useful tool in understanding both gateway load balancing and for anomaly detection. The GW EVC helps evaluate how Internet connectivity is affected when a gateway node fails and the clients and other mesh nodes must adjust routes to connect through an alternate gateway. 5. Fault detection and rule-based diagnosis Our goal is not just to detect a problem, but whenever possible to determine the root cause, resolve it automatically if possible, or provide timely information and guidance to the administrator. The analysis engine within Mesh-Mon controls and runs the tests described earlier (topology analysis, configuration, and health monitoring). Based on the detection of any anomaly or negative test result, or after parsing received alerts (from neighbors or MMAs), Mesh-Mon nodes run secondary diagnostic tests and generate corresponding alerts for neighbors and the sysadmin. To design this engine, we created a state-transition diagram consisting of node states and state transitions based on alerts and results from diagnostic tests. State transitions can include actions such as issuing an alert to a sysadmin or the execution of a secondary test. The goal is to reach a terminal state that identifies the problem faced by the mesh node, client or network. Our engine generates a hypothesis about the cause of the fault based on the outcome of all the individual tests in a short time window. We present a simplified flowchart of the analysis engine (without secondary tests) in Fig. 3. All diagnostic information, the rules applied and corresponding results are logged locally. All generated alerts are multicast to the sysadmin as well as piggybacked on packets that Mesh-Mon nodes send to their k-hop neighbors. The alert structure include fields that encode the type and subtype of the problem, the local action taken and the result of that action. Neighbors parse received alerts and run automated secondary diagnostic tests. For example, if a neighbor reports congestion, all nodes receiving that alert will do a quick check to see if they are experiencing congestion as well. In Table 1 we describe a few scenarios, the corresponding local actions taken, the type of alert that will be generated, and actions that neighbors will execute upon receipt of such alerts. For example, if a mesh node finds it has no neighbors, it will try an increased transmission power-level setting. Some actions taken may have adverse consequences. Increasing

1594

S. Nanda, D. Kotz / Computer Communications 31 (2008) 1588–1601 Local Config Check

Start Analysis

is ConÞg Valid?

N

Attempt to Repair

A

All Services up?

N

Attempt to Repair

B

Y A

Essential Services Check

Y

B

Any Clients connected?

Y

Run MMA specific tests and message protocol

N Any MeshNodes in range?

Y

Run probe tests & Topology Analysis

N Fault or Anomaly Detector

Is MeshLeader up?

Y

N Initiate leader election process

Diagnose and Attempt Repairs

Send Alerts to Sysadmins and Neighbors

Fig. 3. Analysis engine flowchart.

Table 1 Mesh-Mon alert generation and processing Scenario/symptoms

Alert type

Subtype

Default local action

Action on alert receipt

Radio is on incorrect channel dhcpd crashed iptables service crashed Broadcast storm suspected No MCs or MNs are in range Change in EVC Drop in avg. throughput Jump in dropped frames MeshLeader is unreachable New critical node detected Test site is inaccessible

Config. Services Services Anomaly Anomaly Anomaly Anomaly Anomaly MeshLeader Topology Fault

Channel DHCP iptables Broadcast Isolation Centrality Throughput Congestion Re-election Criticality External

Reset to default channel Restart service Restart service No action Reset radio and boost power No action No action No action Announce new MeshLeader Inform MeshLeader Test if gateway is reachable

Check local channel Check local services Check local services Check broadcast rate Ping sender Recalculate EVCs Probe links between pair Check local health metrics Check validity of new leader Run topology analysis Check test site access

the power level may lead to increased interference, while reducing power levels may lead to a loss of some neighbors. The sysadmin is responsible for verifying alerts and for determining the course of action when Mesh-Mon is unable to automatically resolve the problem. In our present implementation we only analyze locally generated information and information received from k-hop neighbors. For detecting problems in very large networks, we propose that MeshLeaders communicate with each other and cooperatively diagnose problems affecting multiple neighborhoods. We present a short example that illustrates our analysis approach. Assume that a Mesh-Mon node is unable to access a default external test site, say, www.meshmon.com. The node will run through its local tests indicated in Fig. 3, and will detect the problem at the ‘‘Fault or Anomaly

Detector” stage. Assume that all other local checks on the MN have passed correctly. After running through the tests, the engine will reach the ‘‘Diagnose and Attempt Repairs” step. At this stage, Mesh-Mon will check recent alerts from neighbors (if any), run additional secondary tests or attempt local repairs. For this specific fault, the MN will verify that the Internet Gateway is reachable (through ping) and that the address is resolvable through the Domain Name System (DNS) nameserver. If the test results recommend a local repair, such as resetting the DNS settings in /etc/resolv.conf to a safe reference value, then Mesh-Mon will attempt the repair and check its outcome. Since our node has not received any alerts from its neighbors (till now) and the local repair has failed, the

S. Nanda, D. Kotz / Computer Communications 31 (2008) 1588–1601

1595

ments on a real deployment, we introduce failures in controlled experiments to test Mesh-Mon and evaluate our design.

MN sends an appropriate alert message to the sysadmin and its neighbors. The neighbors parse the received alert and run their own corresponding local diagnosis rules (refer Table 1). In this case the neighbors would also try to access the test site. Since the neighbors do not have difficulty reaching the test site, the problem is probably at the original node. The original node will periodically loop through its analysis routine and if the same test fails again, it will send an alert with a higher priority to the sysadmin. The sysadmin can then remotely login to the MN and manually attempt to repair the problem. Perhaps DNS name resolution was disabled after a remote software update and the sysadmin must roll back the update. Complex problems such as this one require human intervention, while others can be fixed automatically by Mesh-Mon. Problems such as congestion may have no direct solution, but the sysadmin will be made aware of the extent of a problem through multiple alerts from multiple MeshMon nodes.

6.1. Our experimental mesh testbed We built Dart-Mesh, a two-tier mesh testbed using 16 dual-radio Linux boxes. Dart-Mesh is a live system deployed on all three floors of our department building (see Fig. 4). The same hardware was used earlier in a static residential setup [19]. Each MN has one 802.11b interface in Master mode, creating the access-tier. Each MN acts as an Access Point (AP) and uses a common ESSID ‘‘Dart-Mesh”. MCs associate with the nearest AP and acquire an IP address through the dynamic host configuration protocol (DHCP). MNs connect to each other via the second radio interface, which is set to ad hoc mode and uses OLSR or AODV, thus creating the mesh-tier. Each ad hoc interface shares a single subnet and a common 802.11b channel. Five mesh boxes are connected to the wired Internet and act as gateways. The gateway nodes act as Network Address Translators to manage connections to and from the Internet. Prior to deployment, all mesh nodes and mesh clients are pre-configured to a fixed channel, respectively. During the course of our development, we tested several publicly available routing protocols [16–18]. To demonstrate Mesh-Mon’s ability to support

6. Evaluation We present results on how quickly Mesh-Mon detected a problem, the accuracy of the diagnosis, and on the CPU utilization and bandwidth overhead that are key to scalability and efficiency. We deployed 16 mesh nodes inside our department. Overall, the mesh was extremely stable when configured properly. Although we run our experi-

MESH NODE GATEWAY

16

MESH NODE Floor 3

15

6 14

5 11 9

Floor 2

4

7

8

12

1

10

3 Floor 1

13 2

Fig. 4. Typical Dart-Mesh deployment with 16 nodes across three floors of CS department.

1596

S. Nanda, D. Kotz / Computer Communications 31 (2008) 1588–1601

Table 2 Dart-Mesh performance Protocol

Category

One-hop

Two-hops

Three-hops

OLSR

Average TCP throughput Average ping RTT

3.36 Mbps 8.08 ms

784 Kbps 13.74 ms

224 Kbps 34.29 ms

AODV

Average TCP throughput Average ping RTT

2.87 Mbps 6.89 ms

654 Kbps 51.54 ms

108 Kbps 134.28 ms

6.2. AODV testing and routing overhead For normal mesh usage, we run OLSR [16] because it has built-in support for multiple gateways to the Internet, while AODV does not. To demonstrate Mesh-Mon’s versatility, we used AODV-UU [18] since it supports Linux 2.6 kernels, while kernel AODV [17] (used earlier with our mesh nodes [19]) only supports the older 2.2 and 2.4 kernels and is no longer maintained. Routing-protocol control overheads with AODV are in general much lower than OLSR, because AODV is a reactive protocol. However, AODV routes solely on hop-count and thus often uses routes with fewer hops of lower link quality. The average routing control overhead of OLSR is 13.5 Kbps per node. The average routing overhead of AODV is only 2.5 Kbps per node. The routing overhead for AODV depends most on client usage patterns, whereas for OLSR the overhead depends on the size and stability of the entire network, as well as configuration parameters. AODV periodically forgets unused routes and only stores current one-hop neighbors. Thus, AODV latency is higher. 6.3. Mesh-Mon overhead Each Mesh-Mon and MMA packet is 1408 bytes long and includes multiple alerts. Each node generates one Mesh-Mon packet per minute by default. MNs communicate through flooding with the k-hop neighborhood in our network with 16 nodes. Shorter alert-only packets (100 bytes) are sent to the sysadmin as needed. The average management overhead purely due to Mesh-Mon was at most 9.8 Kbps per node with all 16 nodes. This overhead could be reduced by letting each node rebroadcast packets on a probabilistic basis, compressing data and by other optimized broadcast algorithms. In addition, each MN runs Iperf once every 2 min (to estimate bandwidth), which adds 600 KB of bursty traffic. This overhead can be reduced by decreasing the frequency of this test. Using a smaller payload leads to inaccurate bandwidth estimates. Multiple failures lead to short bursts of high Mesh-Monto-sysadmin alert traffic, but this traffic was never high

enough to cause congestion or affect user traffic in our tests with static networks. The combined traffic overhead of Mesh-Mon with variable network sizes and a single MeshLeader is shown in Fig. 5 for both AODV and OLSR. The overhead is independent of the routing protocol and depends on the size, topology, network density, and the rate of failure. MeshMon flooded traffic grows rapidly (worst case n2) and dominates as the network size increases, justifying the need for multiple MeshLeaders for larger networks, while the average Iperf-induced traffic increases only slightly as the network grows and longer multi-hop routes are created. Mesh-Mon is able to provide identical management functionality with the similar overheads with either AODV or OLSR. Mesh-Mon generates a similar volume of alerts and accuracy for identical tests run on both networks. The only difference between the two protocols in our implementation is that for OLSR each MN has local access to the global topology (bypassing MeshLeaders), but in AODV, global topology information is gathered through neighbors and MeshLeaders. The CPU overhead of running Mesh-Mon was reported as under 1.5% using the Unix time command on a 1.3 GHz system. Thus, it may be feasible to extend Mesh-Mon’s abilities to perform traffic analysis by using a packet-capture program to compute statistics on the composition of individual packets. We reserve tests with multiple MeshLeaders and more mobile scenarios for future work.

Iperf(OLSR)

20000

Mesh-Mon(OLSR)

Bytes per second per node

multiple routing protocols, we run all tests with both AODV and OLSR with the same topology. We present some general performance results for both protocols in Table 2.

Iperf(AODV)

15000 Mesh-Mon(AODV)

10000

5000

0 2

2

5

5

10

10

16

16

Number of Nodes

Fig. 5. Average (Tx + Rx + Fwd).

combined

monitoring

overhead

per

node

S. Nanda, D. Kotz / Computer Communications 31 (2008) 1588–1601

6.4. EVC calculation results We present results from the calculation of EVC for a static network with 13 nodes in a single bi-connected component over 2 h (the other nodes were turned off). Over the duration of measurements, the EVC ranks were stable, variance was low and the median EVC value was close to the mean. By looking at the average of three EVC metric calculations (see Table 3) and considering the average degree of each node, we get some interesting insights. The gateway weighted (GW) EVC gives a ranking of the quality of Internet connectivity a client should expect from a mesh node. We can immediately see that nodes 2, 5 and 6 are gateways from the magnitude of their GW EVC values. We may also infer that nodes 8, 9, 12 and 13 are not physically close to gateways (or if they are close to any of the gateways, then they have poor-quality links to those gateways). In this example, node 8 was actually just one-hop away from gateway node 5. From the link quality (LQ) EVC, we can see that node 8 has the lowest rank but it

1597

has a high degree, and is centrally ranked in the original EVC. Thus the conclusion is that node 8 is surrounded by many neighbors, has weak connections with them, and has poor Internet connectivity. Node 13 has few neighbors, but has high-quality links with them, and has even worse Internet connectivity. Node 1 is the most well-connected non-gateway node. We compared our conclusions with the actual topology and performance results and found that they were correct. We now present results on how the EVC value changes after a significant network event. We tested what happens when a gateway node is switched off, when a gateway node loses its Internet connection and when the highest ranked, moderately ranked and lowest ranked nodes are turned off. We detected noticeable changes in the GW EVC in all the above cases (see Fig. 6) in spite of these fluctuations. For instance, if the GW EVC of a gateway node went to zero, then the node had crashed (or was turned off). Subsequently, the GW EVCs of other gateway nodes (nodes GW2 and GW6 in Fig. 6) went up. If a gateway node lost

Table 3 Ranked average EVC calculations and average degrees Rank

1

2

3

4

5

6

7

8

9

10

11

12 16 0.17

13

NodeID EVC

1 0.41

3 0.38

7 0.30

2 0.30

15 0.29

5 0.29

8 0.26

11 0.24

6 0.21

13 0.18

12 0.17

NodeID LQ EVC

1 0.43

7 0.40

3 0.40

15 0.38

5 0.37

2 0.25

13 0.17

11 0.15

6 0.10

16 0.082

12 0.081

9 0.065

8 0.061

NodeID GW EVC

5 0.29

2 0.28

6 0.27

7 0.03

1 0.02

3 0.02

15 0.02

11 0.01

16 0.006

8 0.006

9 0.004

12 0.003

13 0.002

3 9.6

2 6.6

7 6.5

15 6.01

5 6.0

8 5.55

6 5.4

1 10.6

11 6.2

12 4.3

16 4.03

13 4.0

0.7 GW2 GW5 GW6

0.6

0.5

GW EVC

NodeID Degree

0.4

0.3

0.2

0.1

0 1

4

7

10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88

Time (Minutes) Fig. 6. GW EVC changes over time for three gateway nodes.

9 0.12

9 2.8

1598

S. Nanda, D. Kotz / Computer Communications 31 (2008) 1588–1601

its Internet connection, but did not crash (GW6 at time t = 50 min), then its EVC value reduced. When Internet connectivity was restored the GW EVC of the node went up again. We also noticed that the GW EVC of node 9 (a lowranked node) was often zero (see Fig. 7). For node 5, the GW EVC falling to zero was due to a crash. However, for node 9, we discovered by parsing several logs manually that it had never crashed, but was unable to create symmetric links with its neighbors (another anomaly type). Changes in GW EVC for gateway nodes triggered minor changes in EVC values for other nodes based on their dependence on the gateway nodes. EVC values fluctuate over time (see Fig. 7). However, their mean and moving averages (over a window of five samples) are relatively steady. Thus, studying the EVC rankings and changes in GW EVC can be used for anomaly detection, with the caveat that the exact cause of the anomaly (such as for node 9) may need more information to identify. The GW EVC is a versatile metric since it combines the adjacency matrix, link quality and status of gateway nodes into a one-dimensional value for each node. Similarly, the LQ EVC is useful for analyzing connectivity in self-contained mesh networks with no gateways.

6.5. Other test cases and qualitative results In our tests we consider two types of syadmins: a stationary sysadmin and a mobile sysadmin (running MMA). We collect all alerts destined for the sysadmin’s Mesh-Admin process in a time-stamped log on his client laptop. We manually scrutinize the data collected in the logs and on each individual node both during and after a controlled experiment. At present, the output of our system is primarily text information in the form of logs on each

node, alerts and reports on the sysadmin’s console and a few automatically generated graphs. 6.5.1. Node failures and routing failures We first simulated a single-node failure by powering down a single node in a static configuration. We also observed what alerts were generated when a node was rebooted. We repeated this test on different nodes, including nodes that Mesh-Mon had suspected as being critical. We then proceeded to simulate failures in multiple nodes simultaneously. In later tests we ran the same scenarios in the presence of Mesh-Mon-Ami clients, and depending on how the client was moving during the test, we observed whether the sysadmin got additional information about the state of the network. In all these tests, we evaluated the accuracy of the alerts, the time between the event occurring and the node locally logging the hypothesis and the time delay until the sysadmin received the alert. In post-test analysis we verified that the logs were correct and consistent with the received alerts. In addition to the above tests, we turned off the routing daemons on select running nodes and observed that MeshMon was successfully able to continue delivering monitoring information both to and from such nodes through its flooded packets. This demonstrated Mesh-Mon unique ability to function in the situation where the routing daemon on one or more nodes had crashed or was not functioning correctly (possibly due to a poor implementation or a bad software update). Such a situation would potentially render any unicast-communication-based management system ineffective, but did not affect Mesh-Mon. 6.5.2. Hardware failures We disabled one or both 802.11b radios on an MN and checked to see whether Mesh-Mon or the client was able to detect an error. For instance, if a mesh node had its ad hoc

0.045

node9 0.04 0.035

GW EVC

0.03 0.025 0.02 0.015 0.01 0.005 0 1 4 7

10 13 16 19 22 25 28 31 34 37 40 34 46 49 52 55 58 61 64 67 70 73 76 79 82 85 88

Time (Minutes) Fig. 7. GW EVC changes over time for node 9.

S. Nanda, D. Kotz / Computer Communications 31 (2008) 1588–1601

interface radio fail and detected it, it may still be able to alert an MMA since it AP interface may still work. Alternately, if the node could not detect the hardware failure, the MMA may notice that it can communicate with the mesh node it is associated with, but not with other mesh nodes. When any radio was turned off, Mesh-Mon was able to detect it and reset it correctly. One issue we discovered was that Atheros-based wireless cards would lock up after a random period (between 1 and 2 h) of usage in ad hoc mode. This effect is due to an unresolved ‘‘ath_mgtstart: discard, no xmit buf” bug (reported by dmesg) in the current MadWiFi drivers. Mesh-Mon was able to automatically resolve this issue because it reset the wireless interface whenever a Mesh-Mon node locally reported a sudden loss of all its neighbors, or was isolated for a long period of time. 6.5.3. Configuration We selectively modified certain configuration parameters on both clients and nodes. In our tests we measure the time delay until the problem was resolved by a node taking corrective action or the sysadmin manually resetting the value. We also recorded the effect a misconfigured node has on its neighbors and the network. Such information can be useful for future versions of the diagnosis engine. Mesh-Mon was able to successfully resolve all configuration issues within a few milliseconds once they were detected. 6.5.4. Congestion We forced various traffic patterns between pairs of clients and managed to saturate specific links. This condition forced the local anomaly detector (monitoring local health metrics) to trigger alerts to the sysadmin and neighbors to help identify regions of congestion. 6.5.5. Multiple simultaneous events After testing the individual cases, we tested the detection of several of the above faults occurring simultaneously or within a short delay of other events on multiple devices. Mesh-Mon nodes report all problems encountered, ranked by severity, not just the first one. If the faults are unresolved, the alerts are resent in the next reporting cycle. Mesh-Mon was again able to detect and send out accurate alerts to the sysadmin, as long as both were in the same partition. 6.5.6. Mobility testing and partitions In our initial setup with 16 nodes, we had a well-connected network. As we introduced several individual node failures and moved individual nodes around, the topology graph became sparse and Mesh-Mon reported the presence of critical nodes. We tested each alert by then turning off the reported critical node and checking whether the network was still connected or partitioned. We had a few false alarms with critical node detection, since a few nodes would run computations with slightly outdated topology

1599

data due to mobile nodes and transient links. We found that a simple way to filter these false positives was to wait for at least three identical alerts in a row; such post-processing could be implemented in the Mesh-Admin and its interface with the sysadmin. After we reduced the network to two distinct partitions, we walked around with a laptop as an MMA, visiting every node on every floor. The MMA detected the two partitions and sent alerts to all other nodes it connected to (and thus their neighbors and so forth). We repeated this test with three partitions. The stationary sysadmin received the alerts depending on which partition he was located in and the path followed by the MMA. In a realistic scenario, a robotic or intelligent MMA could remember where the sysadmin was last contacted and move in an optimized manner to ensure that its alerts reach the sysadmin with a high probability. 7. Related work We present a summary of the advantages of Mesh-Mon over all comparable systems in Table 4. The only drawback of our approach is the relatively higher communication overhead due to flooding, but its impact on user traffic was negligible in our evaluations and it enables MeshMon to be routing-protocol independent and capable of monitoring a mesh network even with routing failures. Simple Network Management Protocol (SNMP) is the de facto protocol for management of most networks due to its widespread acceptance. Commercial mesh systems [1–4] are often bundled with centralized SNMP-based management solutions. SNMP aids as a monitoring tool and leaves the task of analysis and management to humans. While SNMP is based on a pull-based philosophy, MeshMon uses a proactive push-based approach. SNMP was originally designed for static wired networks and uses a centralized design. In Mesh-Mon we do not have a single point of failure, since our architecture is distributed and nodes monitor each other in a peer-to-peer-based adaptive manner. The Ad hoc Network Management Protocol (ANMP) [7] is designed to be an extension of SNMP for data collection from mobile devices. Management in ANMP is based on a hierarchical approach where elected cluster-heads poll management information from their cluster members in a centralized manner. The Guerilla Management Architecture (GMA) [21] extends the ANMP model by allowing nodes to participate in management functions based on their capability to perform certain tasks or measurements. The authors advocate the use of techniques from the mobile-agent community (such as mobile code) to allow nodes to manage themselves in an adaptive and distributed manner, but do not provide details. Neither ANMP nor GMA have ever been implemented. Distributed Ad hoc Monitoring Network (DAMON) [22], from UCSB, is a distributed system for monitoring ad hoc and sensor networks. DAMON uses agents within

1600

S. Nanda, D. Kotz / Computer Communications 31 (2008) 1588–1601

Table 4 Summary comparison of Mesh-Mon vs. other systems System

Structure

Scalable

Partition aware

Routing protocol independent

Implemented

Mesh-Mon ANMP GMA DRAMA DAMON MeshMan JANUS

Hierarchical SNMP-based Mobile agent based Hierarchical Hierarchical Wired control plane DHT-based

Yes Yes No Yes Yes No No

Yes No Yes No No No No

Yes (tested with AODV and OLSR) No No No (OLSR) No (AODV) No (AODV) No (LQSR)

Yes Simulated No Simulated Yes Yes Yes

the network to monitor network behavior and send collected measurements to central data repositories or sinks. DAMON was implemented in perl and was designed specifically for AODV. UCSB also has a stationary mesh testbed that is managed by set of interconnected components called MeshMan, Mesh-Mon and MeshViz [23]. In their testbed, most of the nodes use a wired backhaul for management information and easier control of experiments. Our approaches to monitoring share some similarities, but we use in-band communication of all monitored information and have more emphasis on mobility and fault diagnosis. JANUS [24] is another framework for distributed monitoring of wireless mesh networks, which uses Pastry (a distributed-hash-table-based peer-to-peer overlay network) to make network information (collected at different layers of the stack), available at all connected nodes in the mesh. An initial prototype of JANUS was tested on six Windows nodes, but does not perform any fault detection or support multiple protocols. Qiu et al. [27] advocate diagnosing a mesh network by means of simulation. A wireless mesh is monitored, modeled and then simulated. The input from the monitoring system is fed into a network simulator that is used to study the observed network behavior and predict expected behavior. The approach is novel but certainly depends on the quality, fidelity, speed and accuracy of the simulation methods. Chandra et al. [28] present WiFiProfiler, a cooperative management system for an infrastructure-based WLAN. Their system focuses on clients assisting other clients connected to a stationary access point in a Wi-Fi network. On the other hand, we monitor and manage both mobile clients and mobile mesh nodes, and diagnose a diverse set of problems (a subset of which overlap with theirs). Kant et al. [25] propose DRAMA, an adaptive hierarchical policy-based management architecture for future military networks. Chiang et al. [26] study the scalability of DRAMA in simulations of up to 504 nodes running OLSR. The simulations show that DRAMA generates less network traffic than using SNMP (with a single central collector) since each level of the hierarchy filters information before propagating it. Mesh-Mon uses a similar hierarchical structure, but we present a real implementation, running on a real mesh, deal with both monitoring and management challenges, and test our system on multiple routing protocols. None of the other mesh management

systems consider the use of centrality metrics or the possibility that routing can fail. 8. Conclusion and next steps We have demonstrated Mesh-Mon’s abilities to tackle monitoring and management issues of mobile 802.11 mesh networks. There are many avenues to optimize Mesh-Mon to make it more efficient, scalable, adaptive and effective. We plan to refine Mesh-Mon’s ability to store and communicate information and design a plug-in interface to make Mesh-Mon extensible. There are several issues and challenges that remain. We do not look into security aspects of Mesh-Mon or Byzantine failures as they are beyond the scope of this paper. The local actions taken to resolve one problem could lead to creation of another problem. In our current prototype actions are decided by static rules; we would like to allow Mesh-Mon nodes to make local decisions dynamically and autonomously. There are many tradeoffs we need to further examine with respect to energy consumption, bandwidth usage, network coverage, choice of algorithms (centralized vs. distributed, probabilistic vs. deterministic, localization) and fault detection accuracy. We aim to use data gathered from Mesh-Mon to design better routing protocols and management systems for future mesh networks. In future work we hope to experiment with a larger campus-wide deployment, to evaluate the scalability and performance of our system using multiple MeshLeaders. Acknowledgments This research program is a part of the Institute for Security Technology Studies, supported by a gift from Intel Corporation, by Award No. 2000-DT-CX-K001 from the US Department of Homeland Security (Science and Technology Directorate) and by Grant No. 2005-DD-BX-1091 awarded by the Bureau of Justice Assistance. Points of view in this document are those of the authors, and do not necessarily represent the official position or policies of any of the sponsors. The authors wish to thank Wayne Allen, Tim Tregubov, Chris McDonald, Bennet Vance, David Bourque, Tristan Henderson, Andrew Campbell, Cindy Torres and members of the CMC lab at Dartmouth.

S. Nanda, D. Kotz / Computer Communications 31 (2008) 1588–1601

References [1] Cisco mesh products. Available from: . [2] Motorola Mesh Networks. Available from: . [3] Order One Networks. Available from: . [4] Tropos Networks Public Safety Solutions. Available from: . [5] T. Clausen, P. Jacquet, Optimized Link State Routing Protocol (OLSR), RFC 3626 (Experimental). Available from: (October 2003). [6] S. Ni, Y. Tseng, Y. Chen, J. Sheu, The broadcast storm problem in a mobile ad hoc network, in: Proceedings of the 5th Annual ACM/ IEEE International Conference on Mobile Computing and Networking (MobiCom), 1999, pp. 151–162. [7] W. Chen, N. Jain, S. Singh, ANMP: ad hoc network management protocol, IEEE Journal on Selected Areas in Communications 17 (8) (1999) 1506–1531. [8] A. Vahdat, D. Becker, Epidemic routing for partially connected ad hoc networks, Technical Report CS-2000-06, Duke University, April 2000. [9] A. Tirumala, F. Qin, J. Dugan, J. Ferguson, K. Gibbs, Iperf TCP/IP network performance measurement tool. Available from: http:// dast.nlanr.net/Projects/Iperf/. [10] D. Zeltserman, A Practical Guide to SNMPv3 and Network Management, Prentice Hall PTR Upper Saddle River, NJ, USA, 1999. [11] R.E. Tarjan, Depth-first search and linear graph algorithms, SIAM Journal of Computing 1 (2) (1972) 146–160. [12] M. Jorgic, I. Stojmenovic, M. Hauspie, D. Simplot-Ryl, Localized algorithms for detection of critical nodes and links for connectivity in ad hoc networks, in: Proceedings of the 3rd IFIP Mediterranean Ad Hoc Networking Workshop (MED-HOC-NET), 2004, pp. 360–371. [13] P. Bonacich, Power and centrality: a family of measures, The American Journal of Sociology 92 (5) (1987) 1170–1182. [14] I. Bytyci, Monitoring Changes in the Stability of Networks Using Eigenvector Centrality, Master’s thesis, Oslo University College, May 2006. [15] K. Begnum, M. Burgess, Principle components and importance ranking of distributed anomalies, Machine Learning 58 (2) (2005) 217–230. [16] A. Tønnesen, OLSR version 0.4.10. Available from: .

1601

[17] L. Klein-Berndt, NIST Kernel AODV. Available from: . [18] AODV-UU Implementation v0.9.5. Available from: (2007). [19] W. Allen, A. Martin, A. Rangarajan, Designing and deploying a rural ad-hoc community mesh network testbed, in: Proceedings of the 30th Anniversary IEEE Conference on Local Computer Networks, 2005, pp. 740–743. [20] C. Perkins, E. Belding-Royer, S. Das, Ad hoc On-Demand Distance Vector (AODV) Routing, RFC 3561 (Experimental). Available from: http://www.ietf.org/rfc/rfc3561.txt (July 2003). [21] C.-C. Shen, C. Srisathapornphat, C. Jaikaeo, An adaptive management architecture for ad hoc networks, IEEE Communications Magazine 41 (2) (2003) 108–115. [22] K. Ramachandran, E. Belding-Royer, K. Almeroth, DAMON: a distributed architecture for monitoring multi-hop mobile networks, in: Proceeding of the First Annual IEEE Communications Society Conference on Sensor and Ad Hoc Communications and Networks (SECON), 2004, pp. 601–609. [23] H. Lundgren, K. Ramachandran, E. Belding-Royer, K. Almeroth, M. Benny, A. Hewatt, A. Touma, A. Jardosh, Experiences from the design, deployment, and usage of the UCSB MeshNet testbed, IEEE Wireless Communications 13 (2) (2006) 18–29. [24] N. Scalabrino, R. Riggio, D. Miorandi, I. Chlamtac, JANUS: a framework for distributed management of wireless mesh networks, in: Proceedings of the 3rd International Conference on Testbeds and Research Infrastructures for the Development of Networks and Communities, 2007. [25] L. Kant, S. Demers, P. Gopalakrishnan, R. Chadha, L. LaVergne, S. Newman, Performance modeling and analysis of a mobile ad hoc network management system, in: Proceedings of the IEEE Military Communications Conference (MILCOM), 2005, pp. 2816–2822. [26] C.Y.J. Chiang, S. Demers, P. Gopalakrishnan, L. Kant, A. Poylisher, Y.H. Cheng, R. Chadha, G. Levin, S. Li, Y. Ling, S. Newman, L. LaVergne, R. Lo, Performance analysis of DRAMA: a distributed policy-based system for MANET management, in: Proceedings of the IEEE Military Communications Conference (MILCOM), 2006, pp. 1–8. [27] L. Qiu, P. Bahl, A. Rao, L. Zhou, Troubleshooting wireless mesh networks, SIGCOMM Computer Communications Review 36 (5) (2006) 17–28. [28] R. Chandra, V.N. Padmanabhan, M. Zhang, WiFiProfiler: cooperative diagnosis in wireless LANs, in: Proceedings of the 4th International Conference on Mobile Systems, Applications and Services (MobiSys), 2006, pp. 205–219.