Virtual active networks: towards multi-edged network computing

Virtual active networks: towards multi-edged network computing

Computer Networks 36 (2001) 153±168 www.elsevier.com/locate/comnet Virtual active networks: towards multi-edged network computing Gong Su *, Yechiam...

354KB Sizes 0 Downloads 12 Views

Computer Networks 36 (2001) 153±168

www.elsevier.com/locate/comnet

Virtual active networks: towards multi-edged network computing Gong Su *, Yechiam Yemini Department of Computer Science, Columbia University, New York, NY 10027, USA

Abstract Virtual active networks (VANs) are dynamically constructed virtual networks of packet processing nodes and QoSenabled tunnels that support application-speci®c services, such as Web caching, multi-casting, transcoding, and ®ltering, etc. The goals of a VAN are to enable large-scale multi-edged network applications, i.e., applications with components distributed at network edge nodes, to control and con®gure network topology and resources to best support their needs; and to enable these applications to monitor and adapt to network changes. In this paper, we describe the VAN architecture, a middleware that provides services and mechanisms to achieve these goals. In particular, the VAN architecture provides: (1) abstractions for applications to specify a VAN; algorithm to map the VAN speci®cation to physical network topology and resources; and protocols to acquire the topology and resources. (2) Algorithm and protocol to resolve deadlock among competing VANs for shared node and link resources. (3) Mechanisms to recover from physical network failure in order to preserve VAN service properties. Ó 2001 Elsevier Science B.V. All rights reserved. Keywords: Virtual; Active; Multi-edged; Topology; Abstraction

1. Introduction Recent progress in computer network research has witnessed the emergence of a new breed of what we call multi-edged network applications, which present several distinguishing characteristics when compared to traditional client±server ones. First, multi-edged network applications typically have peer±peer service components deployed at network edge nodes (nodes residing at the boundary of a network) and these components collaborate to achieve the intended functionality of the applications. Web caching, multi-casting,

*

Corresponding author. E-mail address: [email protected] (G. Su).

transcoding, and ®ltering, etc., are good examples of such applications. In contrast, client±server application typically operate at end nodes. Second, while traditional network applications treat the network largely as a best-e€ort packet transport wire, multi-edged network applications must interact with the network more closely; they must be able to control and con®gure network topology and resources to best support their needs. For example, a Web caching application may need to provide coverage for certain network areas and may con®gure its service components with certain topology (e.g., a ring) for increased reliability; it may also need guaranteed node and link resources in order to process and transport cached contents. Third, traditional edge nodes are vendor-supplied blackboxes that perform prede®ned network layer

1389-1286/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved. PII: S 1 3 8 9 - 1 2 8 6 ( 0 1 ) 0 0 1 7 4 - 8

154

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

functions such as routing. In contrast, emerging multi-edged network application components may perform dynamic application layer functions at edge nodes such as caching, transcoding, and ®ltering. Multi-edged network applications thus represent a new class of applications requiring new network services dicult to accomplish with current network application program interfaces (APIs). These services include network area coverage, network location awareness, specialized QoS support, high availability and reliability, and security. The virtual active network (VAN) architecture is an application middleware architecture (henceforth, we use the word ``application'' to refer to the multi-edged network application unless otherwise noted) that provides distributed resource management mechanisms through the notion of virtual network abstractions. These mechanisms extend beyond the scope of single node operating system (OS) resource management mechanisms and are essential in supporting the needs of multi-edged network computing. The VAN architecture speci®cally focuses on three important aspects of virtual networking, namely abstractions for speci®cation, deadlock-free deployment, and failure recovery. The VAN system de®nes abstractions through which applications can specify a VAN with desired service type, topology feature, resource constraint, and reliability constraint. The VAN system provides algorithms and protocols to dynamically construct such a VAN according to the speci®cation. When multiple VANs are being constructed simultaneously, potential deadlocks resulting from di€erent VANs competing for shared node and link resources are resolved by the VAN system's con¯ict resolution algorithm and protocol. At runtime, the VAN system also monitors underlying physical network conditions and adapts accordingly to best preserve the VAN properties when failures occur (e.g., a physical link fails). The rest of the paper is organized as follows: Section 2 gives an overview of VAN service architecture and functional components; Section 3 presents the algorithm and protocol mechanisms to support the application services provided by VAN; Section 4 brie¯y discusses some of the low-

level OS support VAN relies on; Section 5 explains the ``activeness'' of VAN; Section 6 describes our prototype implementation of the VAN system and its current status; Section 7 relates our work to other in the area; Section 8 summarizes the paper. 2. The VAN architecture overview The VAN architecture is built upon a set of core components with well-de®ned functions. These components interact with each other through specialized protocols and they collectively orchestrate the services and supporting mechanisms of the VAN architecture. 2.1. VAN service architecture Fig. 1 depicts a high-level view of the VAN application service architecture comprising the following main functional components: virtual node and virtual link (VN and VL), VAN local manager (VLM), and VAN domain server (VDS). VNs and VLs are the abstractions serving as the basic building blocks for a virtual network. VNs are packet processing software running on physical nodes and resemble physical packet processing devices such as switches and routers. VLs are logical datalink layer communication channels between VNs and resemble physical wires. VLMs augment node OS with certain virtual network speci®c capabilities such as creating and deleting VNs and VLs. VAN domain servers (VDSes) further extend the VAN system to provide dis-

Fig. 1. VAN service architecture.

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

tributed domain-wide resource management by commanding a collection of VLMs within their administrative domains. Section 2.2 will provide more detailed description of these components. Creating a VAN thus involves hierarchical interactions among the application, the VDSes, and the VLMs. An application interacts with a VDS (called root VDS) to request the creation and deletion of a VAN. The root VDS, along with VDSes in other domains, interact with each other through the VAN exterior setup protocol (VESP) to coordinate inter-domain VAN resource negotiation. Each VDS interrogates VLMs belonging to its respective domain through the VAN interior setup protocol (VISP) to acquire intra-domain VAN resources. During the construction of a VAN, competition for shared network resources among di€erent VANs is resolved by the VAN system's priority based preemption algorithm and protocol. At runtime, VLMs monitor the VNs and VLs status and report to their respective VDSes when failures occur. The VAN system employs ``local repair'' and ``global repair'' mechanisms to recover from the failures and to preserve the properties of the a€ected VANs as best as it can. 2.2. VAN functional components We now elaborate in more detail each of the main functional components of the VAN architecture mentioned in Section 2.1. 2.2.1. Virtual node and virtual link VNs are components of multi-edged application (e.g., Web caching proxies, transcoding gateways, etc.) running on network edge nodes and provide application-speci®c packet processing functionality at network edge nodes. Functionally speaking, VNs are equivalent to packet processing engines running inside a router or switch, except VNs are instantiated by VLM on-demand at runtime according to application requests. In multi-edged applications such as Web caching and transcoding, many VNs will cooperate with each other and collectively achieve their intended goals. VLs are datalink layer logical communication channels that provide the means for VNs to

155

exchange packets. Because multi-edged application components can perform functions anywhere from network layer through application layer, a common network layer, such as IP, no long exists. Therefore, it is necessary for components to conduct per-to-peer interaction at the datalink layer, in contrast to traditional peer-to-peer and/or client± server interaction which normally occurs at the network layer and above. As an example, Web caching applications usually employ their own routing scheme to route the URL requests amongst the mesh of caching components. In other words, peer caching components interact with each other as direct datalink layer connected peers, rather than network layer connected peers. VNs access the functions of VLs through the virtual interface (VIF) abstraction. VIFs resemble physical network interface card (NIC) devices and provide a uniform interface to VNs regardless of how the VLs are instrumented. For example, VLs can be built on top of physical datalink layer connecting two VNs residing on the same datalink layer network, such as an Ethernet LAN. VLs can also be tunneled over a network layer, such as IP, connecting two VNs spanning across a wide-area network. In essence, VNs and VLs are the abstractions which represent node processing resource and link bandwidth resource, respectively and through which applications acquire and release network resources. This is similar to the virtual memory abstraction that represents physical memory resource and through which applications acquire and release the memory resource. 2.2.2. VAN local manager VLMs are daemon processes running on network edge nodes. VLMs extend the node OS for managing local node resources, such as coordinated scheduling of processor cycles and link bandwidth, etc. VLMs export the following services to applications: · Create/delete/con®gure VNs with guaranteed processor resource. · Create/delete/con®gure VIFs. · Create/delete/con®gure VLs with guaranteed link resource. · Send/receive messages over VLs via VIFs.

156

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

Fig. 2. VLM functional components.

Fig. 2 depicts the main functional components of a VLM. Note that although the ®gure separates user space and kernel space, it by no means indicates that this is the only way to implement a VLM. In fact, an implementation may choose to put all the components in the kernel space to achieve higher improved performance. · VISP module implements the VAN interior setup protocol, which is the protocol used by VDSes for interacting with VLMs for resource acquisition and releasing. · VL module implements the VL tunneling setup and QoS signaling protocol between peer VLMs. · Scheduler extension cooperates with OS scheduler to guarantee and regulate VN processor resource and VL link resource. · Classi®er extension extends OS packet classi®er to support VL multiplexing and demultiplexing. 2.2.3. VAN domain server A VAN is essentially a collection of node and link resources, represented with the notion of VNs and VLs, distributed across the network. We have seen that VLMs manage resources within the scope of a single node; what we need are mechanisms that extend the resource management capability of VLMs to the scope of a network. VDSes are server processes that instrument these mechanisms by performing several essential functions. First, VDSes logically organize network resources into administrative domains and perform

intra- and inter-domain resource acquisition for constructing VANs through specialized protocols. VDSes interact with VLMs within its administrative domain via VISP to coordinate intra-domain resource negotiation; VDSes cooperate with other VDSes in neighboring domains via VESP to coordinate inter-domain resource negotiation. In addition, VDSes maintain the namespace of VNs and VLs representing the resources of a VAN and perform directory services such as VN name lookup for applications to refer to the VAN resources. Second, VDSes implement algorithms and protocols for mapping the application speci®cation of a VAN, which is expressed in terms of virtual entities such as VNs, VIFs, and VLs, to underlying physical network topology and resources. Virtualized abstractions enable applications to refer to network resources through a uniform interface regardless of the underlying physical network topology and technology. Section 3.1 presents the VAN speci®cation and mapping algorithm. Third, much like the functionality of managing access to shared resources of a single node by a VLM, VDSes arbitrate access to shared domainwide network resources and, in collaboration with VLMs, resolve con¯icts of di€erent VANs' competition for these resources in order to avoid possible deadlock situations. The algorithm and protocol used for resolving the con¯icts are described in greater details in Section 3.2. Fourth, the mapping of a VAN to the collection of network node and link resources becomes invalid when the underlying physical resources fail, such as node crash and link break. VDSes incorporate mechanisms to recover from such failures and keep the properties of the mapping consistent so that the failures are transparent to the applications. Details of the mechanisms are covered in Section 3.3. Fig. 3 illustrates the main functional components of a VDS. Again, the picture does not imply any restriction on the actual implementation of a VDS. · Namespace module maintains the namespace of VNs and VLs that represent the resources belonging to a VAN.

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

Fig. 3. VDS functional components.

· Mapping module implements the distributed algorithm and protocol for mapping a VAN speci®cation to physical network topology and resources. · VESP module implements the VAN exterior setup protocol for inter-domain resource negotiation. · VISP module implements the VAN interior setup protocol for intra-domain resource negotiation. 3. VAN service mechanisms This section describes in more detail the supporting mechanisms of the VAN architecture for provisioning its services. 3.1. Map VAN speci®cation to physical network topology and resources One of the ®rst questions the VAN architecture needs to answer is ``How can applications ask for a network?'' Similar to the abstractions used by an OS for the computing resources (e.g., process for CPU, address space for memory, directory/®le for disk, etc.), there must be abstractions for applications to specify network resources and topologies. 3.1.1. VAN speci®cation The VAN architecture de®nes a set of abstractions in terms of which applications can specify a VAN of VNs (corresponding to the edge nodes of Autonomous Systems, aka ASes) interconnected through VLs. The applications can also specify the

157

desired service type, topology features, resource constraint, and reliability constraint of the VAN. The current abstractions are formulated based on graph theory and are listed below. First, the service type indicates the applicationspeci®c type of processing at the edge nodes. For example, an application may specify a VAN for Web caching, or a VAN for multi-casting. Second, the topology features specify desired topology properties of the VAN in terms of coverage and connectivity. Coverage speci®es all the VNs in the VAN; it is assumed that all VNs must be connected. Connectivity is speci®ed in terms of the following abstractions: · Cyclic or acyclic. · Degree of each VN, i.e., number of VLs connected to each VN. · Diameter of the VAN, i.e., maximum distance between any two VNs. Examples of common topologies that can be easily speci®ed with these abstractions: i. Chain: acyclic and degree…VN‰iŠ† 6 2 for all i. ii. Ring: degree…VN‰iŠ† ˆ 2 for all i. iii. Star: degree…VN‰iŠ† ˆ 1 for all i except one of them. iv. Tree: acyclic. v. Clique: degree…VN‰iŠ† ˆ n 1 for all i, where n is the total number of VNs. With more ``irregular'' topology, this abstraction becomes cumbersome. Another way of specifying the desired topology is with ``anonymous'' graphs such as those in Fig. 4. Note that applications are not required to specify explicitly which VN is located in which anonymous node in the graph, i.e., A, B, C, and D can be any permutation of the VNs. If needed, applications can also (partially) specify the topology such as the graph to the right in the ®gure, in which the application has speci®cally chosen to put VN2 at the ``C'' position.

Fig. 4. Anonymous graph.

158

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

Third, the resource constraint speci®es, for all the VNs, processor cycles needed to support application-speci®c processing at the edge nodes; and for all VLs, the link bandwidth desired. Fourth, the reliability constraint is expressed in terms of the number of di€erent VLs a physical link carries. Intuitively, the more VLs a physical link carries, the less reliable the VAN is since the failure of the physical link will bring down all the VLs. So applications may specify an upper limit on the number of VLs in a VAN that traverse any particular physical link. The sheer size and complexity of today's network makes it simply impossible for any application to have enough knowledge about the network down to the level of individual devices and links; and it is therefore impractical to require applications to have such knowledge in order for them to specify a virtual network. The VAN abstractions are thus designed to allow applications to specify a VAN at the AS level (recall that a VN corresponds to an edge node of an AS) and to concentrate on the features of the VAN rather than the details of how exactly each VN should be connected. For example, applications only need to specify that they need a star comprising of VN1, VN2, VN3, and VN4 but not which VN should be the center; or a ring connecting the four VNs but not the exact neighboring order. It is the VAN system's responsibility to best map this virtual (star or ring) topology to the physical topology according to the constraints speci®ed by the applications. The philosophy is that the VAN system, performing the role of an OS, knows better about the underlying physical network resource usage and availability to produce a mapping that best utilizes the resources. Furthermore, applications are more likely to be interested in the features of the topology rather than its details even if they have enough knowledge about the network to specify the topology in detail, in which case they can still do so with the ``anonymous graph''. 3.1.2. Mapping algorithm The mapping from a VAN speci®cation to physical network topology and resources can be mathematically formulated as an optimization problem. We do not include the formulation here

due to space constraints; it can be found in [21]. We describe a simple heuristic algorithm for mapping a virtual topology to a physical topology while satisfying the resource and reliability constraints. Note that it may not be possible to generate such a mapping with all the constraints, in which case the algorithm will present a ``the best you can get'' mapping. The algorithm maintains two graph data structures, one for the virtual topology, the other for the physical topology. It ®rst maps the VNs to the physical nodes (remember that this is a one to one mapping). It then maps the VLs to physical paths. The following are the main steps taken by the algorithm: 1. Sort all VNs and physical nodes by their degree in descending order. Nodes with the same degree are ordered arbitrarily. 2. Map VNs to physical nodes one to one according to the sorted order, i.e., the highest degree virtual node is mapped to the highest degree physical node, etc. 3. Scan the physical topology and mark the links that do not have enough available link capacity for the VLs as infeasible. No VL will be mapped to a physical path traversing an infeasible physical link. If the physical topology consisting of only feasible links is partitioned, stop and declare that no mapping can be produced. 4. Associated with each feasible physical link is a counter, initially 0, which counts the number of times it has been mapped onto, i.e., the number of VLs it is carrying. Pick a VL, map it to a physical path between the two physical nodes (where the two VNs of the VL are mapped onto) such that the highest counter of the feasible physical links traversed by the path is minimized. 5. Increment the counter of all the physical links in step 4 and subtract their capacity by the capacity of the VL. If any of these physical links is left with capacity lower than the capacity of a VL, mark it as infeasible. 6. Repeat step 4 until all VLs are mapped or stop when a VL cannot be mapped. The bulk of the work of this algorithm lies in step 4. The exact complexity of this step depends on

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

how the physical path for a VL is chosen. Ideally, we want to choose a path that is short and also satis®es the bandwidth and reliability constraints. However, there may be situations where relaxing the distance of some paths may result in a better overall mapping in terms of satisfying the constraints. We are currently investigating the behavior of this heuristic under such variations, along with other optimization techniques, such as di€erent ways of mapping VNs to physical nodes in steps 1 and 2, precomputing all physical edgedisjoint paths between any two physical nodes, changing the order of picking the VL in step 4, etc. We are also exploring the possibility of other heuristics. 3.1.3. Topology and resource acquisition protocols Carrying out the mapping described in Section 3.1.2 implies that the VAN system needs to have up-to-date global network node and link resource usage and availability information. Maintaining such information is indeed essential for the operations of the VAN system. Although the VAN architecture per se does not mandate the speci®c mechanism of maintaining such information, the two-tiered VDS±VLM design re¯ects our philosophy as to how this (admittedly very dicult) problem may be tackled. The sheer size and complexity of today's network precludes any hope of having one central entity maintain the global resource information database (GRID) of the network. The VAN system, assisted by the VDSes, partitions the GRID into manageable domains. This is similar to border gateway protocol's (BGP) partition of the network into ASes for scalable routing purpose. Note that resource domains are logical and may or may not map one-on-one to ASes. The VDSes, further assisted by the VLMs, maintain the partial GRID within their respective domains. We further observe that building a VAN covering certain ASes does not require knowledge of the full GRID. Only the partial GRID of domains containing the ASes in question needs to be pieced together for the mapping purpose. The VESP is designed speci®cally for the VDSes to communicate with each other and to piece together, ondemand, a fragment of the GRID that is enough

159

Fig. 5. VESP and VISP.

for mapping the requested VAN. The VDSes utilize VESP, on a per VAN basis, to query and negotiate resource allocations that must crossdomain boundaries, e.g., a VL spanning multiple domains. Within each domain, a VDS has full knowledge (through its partial GRID) to compute the mapping for the part of the VAN that lies entirely within its domain. This ``interior mapping'' is then executed by the VDS through interrogating the VLMs in the domain via the VISP. We believe this two-tiered resource query and negotiation scheme suits the needs well for building VANs on a global scale; and it also ®ts well with today's network infrastructure. Due to the space constraint, the protocol details of the VESP and the VISP are omitted and we again refer to [21]. In Fig. 5 we only present a picture of the parties interacting via these protocols. 3.2. Acquiring VAN topology and resources without deadlock Building a VAN is a distributed resource acquisition process. Competition for shared resources between VANs being built simultaneously can lead to deadlock if care is not taken. This is somewhat similar to distributed database transactions [11], where node and link resources are analogous to databases, VLMs to database managers, and VDSes to transaction managers. The key di€erences between the two are (1) constructing a VAN usually involves a large number of sites, which makes traditional distributed deadlock

160

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

detection algorithms infeasible since most of these algorithms have message complexity of O…n2 † and detection delay of O…n† [20] with n sites; (2) the ``transactions'' of building a VAN, i.e., acquiring node and link resources, are fairly semantically simple and uniform operations compared to typical database transactions; therefore certain deadlock prevention techniques such as preemption using priority which is infeasible in traditional database transactions due to the high rollback cost of the preempted ``victim'' becomes viable in the ``transactions'' of building a VAN. In this section, we ®rst discuss a possible approach for preventing deadlock which is based on the global ordering of resources [13]. We will discuss its drawback and introduce another deadlock prevention algorithm that capitalizes on the distinctions mentioned above, which is based on preemption [18] using priority. 3.2.1. Global resource ordering One way to prevent deadlock is to impose a global total ordering on the AS'es; and within each AS a total ordering of the node and link resources. VANs are required to be built sequentially following the order of AS numbers. And resources are allocated on a ®rst come ®rst serve basis. A VAN that cannot acquire a needed resource may search for other (higher numbered) resources that satisfy its needs; or wait for a ®nite amount of time before giving up and releasing all the resources it has acquired so far to let other VANs have a chance to proceed. The advantage of this approach is its simplicity. However, it does have a couple of drawbacks. First, it limits the concurrency of building a VAN as a VAN must be built sequentially from one AS to another. Second, a starvation situation can occur since di€erent VANs start from di€erent AS'es. If enough VANs keep starting from a higher numbered AS, a VAN starting from a lower numbered AS may never get a chance to acquire the resources it needs from the higher numbered AS. This problem can potentially be ®xed by associating a timestamp with each VAN and not resetting the timestamp when a VAN is restarted. This e€ectively assigns each VAN a static priority (the timestamp) for arbitration when con¯ict occurs.

Due to these inadequacies of global ordering, we explore another approach which is based on preemption using priority. However, instead of using a ®xed priority for each VAN being built, our algorithm dynamically computes what we call the ``Progress Index'' (PI) for each VAN and uses it as the priority for preemption. 3.2.2. Progress index algorithm The PI is a number that is computed to indicate how far along a VAN is into its building process. It is used to dynamically assign priority to VANs in order to resolve con¯ict through preemption. The idea is that a VAN having ®nished most parts of it (also taking into consideration of the total size of the VAN) should be given higher priority. As parts of a VAN are being built simultaneously by many VDSes, each VDS will compute a ``local total weight'' and a ``local current weight'' for its part of the VAN as follows: 1. A ``weight'' is associated with each VN and VL. This weight assignment should be consistent across all VDSes. Otherwise, VANs built by some VDSes that assign ``better'' weight value to VNs and VLs will have unfair advantage over VANs built by other VDSes. 2. The ``local total weight'' of the part of the VAN that lies within the VDS administrative domain is simply the sum of the weight of all the VNs and VLs to be built by the VDS. 3. For each VN or VL successfully built, its weight is added to the ``local current weight'' (initially 0) of the VAN. The ``global PI'' (GPI) of a VAN is computed based on the ``global total weight'' of the VAN, which is the sum of the ``local total weight'' of all VDSes, and the ``global current weight'' of the VAN, which is the sum of the ``local current weight'' of all VDSes. We are examining di€erent algorithms for computing the GPI and their e€ect on the con¯ict resolution process. 3.2.3. Con¯ict resolution protocol The rule used by VLMs for arbitrating con¯icts is fairly simple. When the resource for a VN or VL is requested by a VDS, the protocol message carries the GPI of the VAN to which the VN or VL

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

161

belongs. When con¯ict occurs, whichever VN or VL having the higher GPI wins. When there is a tie, pick one randomly. The preempted VAN has the choice of either waiting (optionally with timeout) for the resource to become available or attempting to ®nd and alternative resource. We now show that it is not enough to use just a ``local PI'' (LPI, which is computed from the ``local total weight'' and ``local current weight'') within each VDS domain to resolve the con¯ict. It would work if a VAN were only spanning one VDS administrative domain. But when a VAN is built simultaneously in multiple domains and con¯icts between di€erent VANs occur in di€erent domains, we have a problem ± as illustrated in Fig. 6. Shown in Fig. 6 are two VDS administrative domains in which two instances of VDSes are building two di€erent VANs concurrently. VDS1 in domain A and VDS3 in domain B are building VAN1; VDS2 in domain A and VDS4 in domain B are building VAN2. Assume that the VLMs will

use LPI to resolve the con¯ict within their respective domain. In domain A, VDS1 computes its LPI for VAN1 as 3 and VDS2 computes its LPI for VAN2 as 6. And in domain B the numbers are 7 for VAN1 and 2 for VAN2, respectively. It is clear that VAN1 will win the battle in domain B but will lose in domain A, and vice versa for VAN2. This results in a deadlock or livelock (depending on the behavior of the loser) across domains between A and B. In essence, the problem is that there is no total ordering of LPI. We present a solution to remedy this situation. The basic idea of the solution is to synchronize the LPIs of all the VDSes building the same VAN in di€erent domains at the time of con¯ict. That is, upon detecting a con¯ict by a VLM, instead of using the LPI for the arbitration, the VLM will inform the con¯icting VDSes to start a GPI synchronization protocol. Fig. 7 describes the protocol interaction. Only the con¯ict resolution in domain A is shown. Domain B will carry out the same procedure.

Fig. 6. Deadlock with global con¯ict.

1. VDSes send requests to VLM for VN or VL resources. 2. VLM detects con¯ict and noti®es VDSes that an arbitration decision needs to be made. 3. VDSes broadcast a message to all other VDSes in their respective VAN requesting synchronization of GPI. 4. VDSes reply to sender with their ``local total weight'' and ``local current weight''.

Fig. 7. Con¯ict resolution protocol.

162

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

5. VDSes compute the GPI after getting all the replies and send the GPI to VLM. 6. Based on the now synchronized GPI, VLM decides on the winner and the loser and replies to the VDSes accordingly. It is clear that since the arbitration decision is made based on the GPI, which will be the same for domain A and B after carrying out the con¯ict resolution protocol (10 for VAN1 and 8 for VAN2), domain A and B will make the same decision, i.e., either VAN1 or VAN2 will be the winner and the other will be the loser. Situations like the one we described in Fig. 6 will not occur. 3.3. Recovery from physical network failure A VL may traverse multiple physical links; also many VLs may share the same physical link. It is thus imperative that when a physical link fails, the VAN system must adapt to preserve the original properties (i.e., service type, topology feature, resource constraint, and reliability constraint) of the VANs a€ected; and it must do so as e€ectively as possible. This section presents the VAN system's monitoring and adaptation mechanisms to deal with physical link failures. Our future work will address physical node failure scenarios in which one or more VNs are brought down. 3.3.1. VL monitoring The two VLMs at each end of a VL use a KEEP-ALIVE message to refresh the resource reserved for the VL along its path and to detect failure of the VL (e.g., one of the physical links in the path of the VL goes down). Upon detecting a VL failure, VLMs notify their respective VDSes. If the VL is an inter-domain one, additional messages are exchanged between the involved VDSes. Fig. 8 depicts the interactions. (1) VLMs notify VDSes about VL failure. (2) VDSes notify each other about an interdomain VL failure. 3.3.2. Local repair In order to recover lost VLs due to a failed physical link as quickly as possible, the ®rst logical thing to do would be to locally ®nd an alternative

Fig. 8. VL monitoring.

path between the two disconnected VNs. For an intra-domain VL, the VDS for the domain in question recomputes an alternative path between the two VNs based on its local topology and resource information of the domain. For an interdomain VL, the two VDSes involved use the topology and resource information between them to recompute a new path for the lost VL. The advantage of local repair is that it preserves the original VAN topology and it recovers relatively quickly since lost VLs are repaired locally and only the lost VLs are a€ected. However, the problem with local repair is that since it uses only local topology and resource information, the recomputed new path may traverse physical links that are already serving other VLs of the VAN. This may result in the violation of the reliability constraint of the VAN speci®ed by the application. In addition, when an alternative path cannot be found between the disconnected VNs, there is nothing local repair can do. But as we see from Fig. 9, it is certainly possible to reconstruct the VAN topology even when local repair fails. In the ®gure a tree is disconnected due to the failure of the VL between VN2 and VN5. Assuming no other physical path exists between VN2 and VN5, local repair would leave the tree topology in limbo. However, there may well be another physical path between VN4 and VN6 so that a VL can be established between them to restore the tree topology. Note that the new tree is not the same tree as the old one. But recall Section 3.1.1 that topology is speci®ed in terms of coverage and connectivity

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

Fig. 9. Local repair inadequacy.

features rather than details of which VN should be connected to which other VN(s). It is assumed that applications requiring such topology are capable of adapting from one particular instance of the topology to another instance as long as the constraint describing the topology feature (i.e., a tree) is not violated. 3.3.3. Global repair When local repair violates the reliability constraint of the VAN or when no alternative path can be found between the two VNs of the failed VL, global knowledge is needed to best preserve the reliability and topology constraint of the VAN. This is done by reporting the failure to the root VDS of the VAN which has the necessary global information to recompute the mapping. The sequence of protocol messages exchanged among the local VDSes (where VL failure happens) and the root VDS are shown in Fig. 10. (1) When an inter-domain VL goes down, one of the VDSes (e.g., the one with a lower IP address) will be responsible for recomputing the

Fig. 10. Global repair protocol messages.

163

new path. If a new path cannot be found, this message is sent to the other VDS. Note that the failed physical link may actually also cuto€ the communication between the two VDSes. In this case, both VDSes will eventually timeout (VDS2 cannot receive the message and VDS1 cannot receive the acknowledgement). They both will proceed to declare the VL fully down and that it cannot be reconstructed locally. (2) Acknowledgement to the local repair failure message. (3) Whether or not a local repair is successful, a message is sent to the root VDS to signify the condition of a VL mapping change. If the local repair failed or if the succeeded local repair violates the reliability constraint of the VAN, the root VDS recomputes a new VL mapping using the global topology and resource information. The new VL mapping is sent to appropriate VDSes while the VDSes of the locally repaired VL will be noti®ed to tear the VL down. Global repair uses global topology and resource information to recompute new paths for failed VLs. It can therefore reconstruct the topology features of the VAN whilst satisfying the resource and reliability constraints when local repair fails. However, it does so at the cost of storage and computation overhead; the communication between the root VDS and the local VDSes also incurs delay. The computational overhead may be tackled by designing algorithms that can do incremental mapping of virtual topology to physical topology while satisfying the constraints, i.e., the algorithm does not have to recompute the whole mapping again. We are exploring the possibility of extending our mapping algorithm described in Section 3.1.2 with such capability. 3.3.4. Combining both approaches Given the pros and cons of local repair and global repair, respectively, it is clear that neither of them alone can resolve the failure problem satisfactorily. Therefore, the VAN system combines both approaches and tries to make the best of them. The strategy is that when a VL fails due to a physical link failure, a local repair is attempted ®rst to temporarily restore the VL as quickly as

164

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

possible. Meanwhile, the root VDS is noti®ed to validate the temporary VL with the reliability constraint. If no constraint is violated, the temporary VL becomes permanent. Otherwise, a new permanent VL that satis®es the constraint is recomputed and replaces the temporary VL. In essence, local repair and global repair work in concert to compensate for the inadequacy of each other and together they make the failure adaptation as smooth as possible. 4. Low-level OS support The VAN service mechanisms described in Section 3 provide the means for supporting the VAN services exported to multi-edged network applications. These mechanisms themselves in turn rely on the support of certain low-level OS mechanisms which we identify in this section. Most of these mechanisms either are already o€ered by the OS or can be adapted with minor changes. It is our intention to fully leverage such existing mechanisms in the VAN system to avoid unnecessary re-engineering. We brie¯y discuss some of the low-level mechanisms used by the VAN system. 4.1. Tunnel setup and QoS signaling VLs in the VAN architecture are datalink layer tunnels with QoS guarantees. Tunnel setup [12,15,25] (note that [15] does not include an automated method for bringing up tunnels in-band, it is just the encapsulation and decapsulation rules) and QoS signaling [5,7,17] are well-studied subjects of their own. Conceivably, VLs can be created by carrying out the tunnel setup and QoS signaling in two separate steps. But obviously for eciency we would like to combine these two. Emerging technology such as multi-protocol label switching (MPLS) [27] naturally provides the capability for such integration. In fact, a label distribution protocol using RSVP extension has already been proposed [1]. Note that the QoS guarantee for VLs is provisioned on the base physical network. Currently, the VAN architecture does not yet address issues of further providing ®ner QoS within a VL tunnel.

4.2. Scheduler extension VNs in the VAN architecture are processes performing application-speci®c packet processing, which requires adequate processor cycles commensurable to the link bandwidth allocated to the VNs. Similar to tunnel setup and QoS signaling, processor scheduling [23] and link scheduling [28] are also well-studied topics although it is not until recently that per-process CPU guarantees are being extensively studied in the context of general purpose OSes [2]. The Eclipse OS [4] is the ®rst publicly available OS that we are aware of to incorporate QoS in a general purpose OS. 5. VAN and active networks So far we have been describing the VAN architecture in its most general form, i.e., an application middleware architecture that supports multi-edged network computing by allowing applications to dynamically construct a virtual network with desired topology and resources according to their needs. The middle name of VAN, i.e., Active, suggests that the VAN architecture bear certain relations to recent active networks (AN) [26] research. Indeed, some of the very initial thoughts on the VAN architecture have originated from the AN research. There have been two major research models of AN. One is the active packet (aka capsule) model which advocates in-band programmability and the other the active node model which advocates out-of-band programmability. The VAN architecture started out as a particular active node approach to AN. The active aspect of the VAN architecture lies in the fact that VNs in a VAN are application-speci®c packet processing engines and the VAN architecture per se does not preclude any dynamic programmability of these VNs. As a matter of fact, the VL abstraction is designed precisely to allow peer-to-peer communication between VNs; and such communication commensurates full programmability of the VNs since the interpretation of the messages exchanged among VNs is purely an application matter. Also, since VLs are datalink layer tunnels that appear to

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

VNs as ``wires'', VNs can perform functions at layers as low as the network layer and as high as the application layer. Such freedom in a packet processing engine is exactly one of the goals of AN. It is therefore clear that while the current VAN architecture has evolved to be a general distributed resource management system for supporting multiedged network applications, the architecture is also valuable, as an active node model, in addressing many of the issues the AN community is trying to resolve. 6. Implementation and current status To evaluate the feasibility of the ideas in VAN, we have implemented and demonstrated a prototype VAN to illustrate some of the basic constructs of VAN, which includes, · the VN, VIF, and VL objects; · APIs for dynamically creating, deleting, and con®guring these objects; · APIs for dynamically loadable VN Engines; · APIs for dynamically loadable VL Engines; · protocol and APIs for managing (creating, deleting, con®guring, and monitoring) a VAN; · experimental VN Engines that provide trac redirection and IP routing (RIP capable); · experimental VL Engines that provide UDP tunnelled VL and QoS guarantees (via RSVP) VL; · front-end command line tool with scripting capability for managing a VAN; · interface with SNMP agent to allow monitoring of VAN with a GUI (Smarts InCharge); · simple plaintext password based authentication and host based authorization. The prototype is built on Linux platform acting as both end hosts and intermediate routers. Currently, the prototype can construct a VAN on top of the IP network with an arbitrary topology. Fig. 11 shows an example. The topology of the VAN can be changed dynamically. The front-end command line tool allows a management station to create a VAN remotely based on a script that describes the desired con®guration of the VAN. The script speci®es the

165

Fig. 11. Sample virtual topology.

VNs and VLs to be created and how they are combined to give the desired VAN topology. It also speci®es what VN Engine and VL Engine to load into each VN and VL object. When executed, the script is translated by the command line tool into appropriate VAN APIs to construct the whole VAN. Our current prototype does not yet instrument the virtual to physical topology mapping capability so the script needs to specify where exactly to put the VNs. But the system is design in a modular way that a mapping algorithm can be ``dropped in'' when it is available. At the boundary of the VAN and the IP network (such as R1 and R2 in Fig. 11), a VN with a special VN Engine is used to trap IP network trac and redirect it into the VAN, and vice versa. In the VAN, the trac is processed by the customized VN Engine in each VN. These VN Engines can be dynamically programmed to change their behavior by the management station. We have implemented VN Engines using NetScript language that can perform IP routing. When deployed into these VNs, we e€ectively construct a virtual IP network with arbitrary topology on top of the real IP network. The VLs can have simple QoS through RSVP. The prototype does not yet have the failure adaptation capability described in Section 3.3 and relies on the underlying network layer routing mechanism for failure recovery. We are currently redesigning and implementing certain parts of the core VAN system to extend the ®rst prototype with more functions described in this paper. In particular, we are extending the Linux OS kernel to provide better

166

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

support for VLM functions such as maintaining and monitoring local host resource usage information for VNs and VLs, and redirecting trac between VAN and the underlying physical network. We are also designing and implement some of the core VDS functions such as namespace maintenance, virtual to physical mapping, con¯ict resolution, and failure recovery. We expect, at the end of this year, to release a new version of VAN system implementation that will provide these important functions. 7. Related work The virtual network concept has been utilized in many emerging networking technologies such as virtual local area networks (VLANs) [16], virtual private networks (VPNs) [22], and ATM virtual path connections (VPCs) [10]. A VLAN is a logical subset of a physical LAN that de®nes a broadcast domain intended to con®ne LAN broadcast trac within the boundary of the logical subnet, resulting in better LAN bandwidth utilization and broadcast security. A VPN is a scalable, cost-effective way to a create logical, secure, and private network on top of a physical, insecure, and public network through the combination of several existing technologies such as encryption, tunneling, and ®rewall. ATM VPCs are labeled logical paths between two ATM switches called VPC terminators that bundles multiple virtual circuit connects (VCCs) in order to facilitate the setting up, switching, segregating, and managing of VCCs for optimal routing and partition of capacity based on trac pattern. There is also a substantial amount of work on providing QoS in VPNs [3]. These virtual networks are usually manually con®gured logical networks intended to better manage network resources and to provide certain value-added services to end users. The virtual network concept in the VAN architecture, in comparison, is an abstraction designed for multi-edged applications to allow them dynamically control and con®gure network resources and topology according to their needs. This key point of the VAN architecture is also the main di€erence between VAN and most of the other projects we relate below.

The Detour project [19] is a framework for alternate routing and experimental deployment of new wide-area routing algorithms. A virtual IP network of Detour nodes connected by IP-in-IP tunnels is manually con®gured on top of the real IP network. Unlike VANs, the Detour virtual networks are connectivity-only networks without explicit mechanisms for managing network resources within the overlay. Although the framework does incorporate basic mechanisms required for managing the virtual network and coordinating the activities of various routing and congestion control algorithms running at each Detour node, most of these mechanisms require manual con®guration. The X-Bone project at ISI [24] is an architecture that facilitates automated deployment of ``overlay'' networks, such as MBone [9] and 6Bone [8]. The ``overlay'' network is essentially the same as the virtual network in that it is a virtual topology layered on top of the physical topology and represents a partitioning of the underlying physical network resources with new services deployed inside. X-Bone uses overlay managers, resource daemons, and a multi-cast control protocol to automate the process of con®guring, controlling, and discovering resources for overlay networks. It is a life-saving management tool for network operators managing large-scale overlay networks. XBone is also particularly useful for testing new protocols in a controlled environment. While XBone focuses on the con®guration and manageability aspects of deploying overlay networks for network administrators and operators, VAN focuses on the resource query and allocation aspects of building virtual networks for emerging multiedged network applications. The Darwin project [6] conducted at CMU is another example of providing customized control in a network via the control plane. Darwin uses so called ``delegates'' which are code segments that are dynamically dispatched to switches and routers in order to in¯uence the trac and resource management of these switches and routers according to speci®c application QoS requirement. Darwin integrates delegates with its own signaling protocol (Beagle), resource broker (Xena), and hierarchical scheduling (H-FSC) to present a

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

complete architecture for a resource management system in service-oriented networks. Recent virtual network service (VNS) [14] builds on Darwin and extends virtualization of routers into data plane. VNS provides value-added network services for deploying customizable VPNs in an IP network. While Darwin (and VNS) focuses on providing customizable QoS to multi-party end-to-end network applications (such as VPNs), VAN pursues issues concerning resource and topology acquisition in general (datalink layer) virtual networks for emerging multi-edged network applications. Thus, we consider VAN and Darwin (and VNS) as complimentary e€orts addressing the needs of di€erent types of applications. 8. Summary We have described in this paper a middleware architecture, the VAN architecture, that provides services to enable multi-edged network applications to dynamically control and con®gure network resources and topology. We believe such an architecture that manages network resources and topology on behalf of applications is vital to future networking infrastructure as networks become ubiquitous and multi-edged network applications become pervasive, much the same way that desktop OSes are for desktop applications. In this paper, we identify and focus our attention on several important issues of the VAN architecture design. These issues, which we reiterate, are: (1) How can applications specify a network with desired resources and topology features? (2) How can the VAN system eciently prevent deadlock among the distributed topology and resource acquisition processes of building VANs? (3) How can the VAN system recover from underlying physical network failure to preserve the VAN service properties? We provide our initial solutions to these challenges in this paper. We also identify several important e€orts by other researchers that address issues of virtual networks not covered by the VAN architecture. We hope that the importance of having the architecture and the feasibility of building the sys-

167

tem demonstrated by this paper will bring the attention of the network research community to this subject. The collective e€ort by the community can accelerate the development of the architecture to support future multi-edged applications in both active and non-active networks.

References [1] D.O. Awduche, L. Berger, D.H. Gan, T. Li, G. Swallow, V. Srinivasan, RSVP-TE: Extensions to RSVP for LSP tunnels, Work in Progress, IETF, February 2000. [2] G. Banga, P. Druschel, J.C. Mogul, Resource containers: a new facility for resource management in server systems, in: Third Symposium on Operating Systems Design and Implementation (OSDI '99), 1999. [3] T. Braun, M. Gunter, I. Khalil, An architecture for managing QoS-enabled VPNs over the Internet, in: Proceedings of the 24th IEEE Annual Conference on Local Computer Networks (LCN'99), Boston, MA, October 1999. [4] J. Bruno, E. Gabber, B. Ozden, A. Silberschatz, The eclipse operating system: providing quality of service via reservation domains, in: Proceedings of the 1998 USENIX Technical Conference, June 1998. [5] R. Braden, L. Zhang, S. Berson, S. Herzog, S. Jamin, Resource reservation protocol (RSVP) ± Version 1 functional speci®cation, RFC2205, IETF, September 1997. [6] P. Chandra, A. Fisher, C. Kosak, T.S.E. Ng, P. Steenkiste, E. Takahashi, H. Zhang, Darwin: resource management for value-added customizable network services, in: Sixth International Conference on Network Protocols, Austin, TX, October 1998. [7] L. Delgrossi, L. Berger, Internet stream protocol version 2 (ST2) protocol speci®cation ± version ST2+, RFC1819, ST2 Working Group, August 1995. [8] A. Durand, B. Buclin, 6Bone routing practice, RFC2546, IETF, March 1999. [9] H. Eriksson, MBone: the multicast backbone, Commun. ACM 37 (8) (1994) 54±60. [10] V.J. Friesen, J.J. Harms, J.W. Wong, Resource management with virtual paths in ATM networks, IEEE Network 10 (5) (1996) 10±20. [11] J. Gray, A. Reuter, in: Transaction Processing: Concepts and Techniques, Morgan Kaufmann, Los Altos, 1992. [12] K. Hamzeh, Ascend tunnel management protocol ± ATMP, RFC2107, Ascend Communications, February 1997. [13] J.W. Havender, Avoiding deadlocks in multitasking systems, IBM Syst. J. 2 (2) (1968) 74±84. [14] L.K. Lim, Design and implementation of a virtual network services, Masters thesis, School of Computer Science, Carnegie Mellon University, 1999.

168

G. Su, Y. Yemini / Computer Networks 36 (2001) 153±168

[15] C. Perkins, IP encapsulation within IP, RFC2003, IETF, October 1996. [16] D. Passmore, J. Freeman, The virtual LAN technology report, 200374-001, 3Com, 1997. [17] Private network±network interface speci®cation version 1.0, af-pnni-0055.000, ATM Forum, March 1996. [18] D.J. Rosenkrantz, R.E. Stearns, P.M. Lewis, System level concurrency control for distributed database systems, ACM Trans. Database Syst. (1978) 178±198. [19] S. Savage et al., Detour: a case for informed internet routing and transport, IEEE Micro 19 (1) (1999) 50±59. [20] M. Singhal, N.G. Shivaratri, in: Advanced Concepts in Operating Systems, McGraw-Hill, New York, 1994. [21] G. Su, Virtual active networks, Technical report, CUCS017-00, Computer Science Department, Columbia University, March 2000. [22] C. Scott, P. Wolfe, M. Erwin, Virtual Private Networks, O'Reilly, Sebastopol, CA, 1998. [23] A.S. Tanenbaum, in: Modern Operating Systems, PrenticeHall, Englewood Cli€s, NJ, 1992. [24] J. Touch, S. Hotz, The X-Bone, in: Third Global Internet Mini-Conference in Conjunction with Globecom `98, Sydney, Australia, November 1998. [25] W. Townsley, A. Valencia, A. Rubens, G. Pall, G. Zorn, B. Palter, Layer two tunneling protocol L2TP, RFC2661, IETF, August 1999. [26] D.L. Tennenhouse, D.J. Wetherall, Towards active networks, MIT Laboratory for Computer Science, 1996. [27] A. Viswanathan, N. Feldman, Z. Wang, R. Callon, Evolution of multiprotocol label switching, IEEE Commun. Mag. 36 (5) (1998) 165±173. [28] H. Zhang, S. Keshav, Comparison of rate-based service disciplines, in: Proceedings of ACM SIGCOMM'91, Zurich, Switzerland, September 1991.

Gong Su was born in Nanjing, China. He holds a Bachelor of Science degree in Physics from the University of Science and Technology of China. He joined the Physics Department at Columbia University in 1991 and received his Master of Philosophy degree in Physics in 1993. He joined the Computer Science Department at Columbia in 1994 and received his Master of Philosophy degree in Computer Science in 1997. He is currently a Ph.D. candidate. His research interests are networking and operating system. Yechiam Yemini is a Professor of Computer Science at Columbia University where he founded and directs the Distributed Computing and Communications (DCC) lab. His research interests include computer networks, network management, economics of information systems, distributed systems and protocols. He authored over 150 publications and 5 patents, and lectured widely in these areas. Technologies created at his DCC lab have been widely exported to thousands of sites and commercialized by several companies. Professor Yemini is a co-founder of Comverse Technology Inc., a $3.5B lead vendor of multi-media message computers for telecom networks. He is also a co-founder and Director of System Management Arts, Inc., a lead technology vendor of software that automates root-cause diagnosis of network problems. Professor Yemini serves as a director of several high-tech companies, advises a major venture fund on high-tech investments, and serves on the US-Israel Science & Technology Commission. In his spare time he practices gourmet cooking and jogging to control its caloric impact.