Redirection based recovery for MPLS network systems

Redirection based recovery for MPLS network systems

The Journal of Systems and Software 83 (2010) 609–620 Contents lists available at ScienceDirect The Journal of Systems and Software journal homepage...

1MB Sizes 1 Downloads 49 Views

The Journal of Systems and Software 83 (2010) 609–620

Contents lists available at ScienceDirect

The Journal of Systems and Software journal homepage: www.elsevier.com/locate/jss

Redirection based recovery for MPLS network systems Jenn-Wei Lin *, Huang-Yu Liu Dept. of Computer Science and Information Engineering, Fu Jen Catholic University, Taiwan, ROC

a r t i c l e

i n f o

Article history: Received 2 January 2009 Received in revised form 29 October 2009 Accepted 30 October 2009 Available online 5 November 2009 Keywords: MPLS Fault tolerance Label switched path Affected traffic Minimum cost flow

a b s t r a c t To provide a reliable backbone network, fault tolerance should be considered in the network design. For a multiprotocol label switching (MPLS) based backbone network, the fault-tolerant issue focuses on how to protect the traffic of a label switched paths (LSP) against node and link failures. In IETF, two well-known recovery mechanisms (protection switching and rerouting) have been proposed. To further enhance the fault-tolerant performance of the two recovery mechanisms, the proposed approach utilizes the failurefree LSPs to transmit the traffic of the failed LSP (the affected traffic). To avoid affecting the original traffic of each failure-free LSP, the proposed approach applies the solution of the minimum cost flow to determine the amount of affected traffic to be transmitted by each failure-free LSP. For transmitting the affected traffic along a failure-free working LSP, IP tunneling technique is used. We also propose a permission token scheme to solve the packet disorder problem. Finally, simulation experiments are performed to show the effectiveness of the proposed approach. Ó 2009 Elsevier Inc. All rights reserved.

1. Introduction With rapid growth of Internet and increase in real-time and multimedia applications, hop-by-hop packet forwarding is insufficient to support the data transmission. The IETF has proposed multiprotocol label switching (MPLS) as a new forwarding technology for meeting the requirement of explosive traffic. In addition to fast forwarding, fault tolerance is also an important issue in the network design. If an Internet service provider (ISP) adopts the MPLS technology to design its backbone network, a fault-tolerant mechanism is also necessary to protect the traffic of a label switched path (LSP) against node and link failures. The LSP is a transmission path in the MPLS network. A lot of research work (Huang et al., 2002; Haskin and Krishnan, 2000; Hundessa and Pascual, 2001; Ho and Mouftah, 2004; Yoon et al., 2001; Ahn et al., 2002; Agarwal and Deshmukh, 2002) has been studied the fault-tolerant issue of the MPLS network. The main ideas of these work are derived from the two IETF recovery mechanisms: protection switching and rerouting (Sharma et al., 2003). The protection switching mechanism preestablishes a backup path for each working LSP. When an LSP fails, the carried traffic in this LSP is switched to a pre-established backup path of the LSP. However, if there is also a node (link) failure in the pre-established backup path, this recovery mechanism cannot work successfully. For the rerouting mechanism, the backup path is dynamically found. There is non-trivial overhead for finding of the

* Corresponding author. E-mail address: [email protected] (J.-W. Lin). 0164-1212/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2009.10.043

backup path. In addition, the rerouting mechanism may also fail if it cannot find a suitable backup path. In this paper, we propose an efficient approach for enhancing the fault-tolerant performance of the protection switching and rerouting recovery mechanisms. If a failed LSP cannot be recovered successfully using anyone of the above two recovery mechanisms, the proposed approach is initiated to perform the recovery of the failed LSP again. The proposed approach utilizes failure-free working LSPs (the working LSPs without suffering from failures) to carry the traffic of the failed LSP (the affected traffic). For transmitting the affected traffic along a failure-free working LSP, IP tunneling technique is used to encapsulate each packet of the affected traffic to be with the forwarding equivalence class (FEC) type of the LSP. With IP tunneling technique, it is not required to perform additional label assignment. However, in the above protection switching and rerouting recovery mechanisms, extra labels are assigned for each backup path to transmit the affected traffic. The proposed approach can avoid performing the complicated label assignment task (Applegate and Thorup, 2003). To minimize the influence of the affected traffic on failure-free working LSPs, the proposed approach transfers the problem of affected traffic distribution to the problem of minimum cost flow. We also propose a permission token scheme to solve the packet disorder problem. Finally, we perform simulation experiments to show the performance and overhead of the proposed approach. The rest of this paper is organized as follows. Section 2 gives background knowledge. Section 3 proposes our fault-tolerant approach. Section 4 compares the proposed approach with previous approaches. Finally, concluding remarks are made in Section 5.

610

J.-W. Lin, H.-Y. Liu / The Journal of Systems and Software 83 (2010) 609–620

(Sharma et al., 2003). The backup path in the two recovery models is either pre-established or dynamically found. The approaches of Huang et al. (2002), Haskin and Krishnan (2000), Hundessa and Pascual (2001) and Ho and Mouftah (2004) are based on the protection switching mechanism. In the approach of Huang et al. (2002), each working LSP has a disjoint backup path between the ingress LSR and egress LSR. The backup path is preestablished, and it does not share any intermediate LSRs with the corresponding primary LSP. When detecting one or more failure in a working LSP, the FIS message is sent back to the ingress LSR of the failed LSP. Upon receiving the FIS message, the ingress LSR reroutes the incoming packets through the disjoint backup path. However, the approach of Huang et al. (2002) has the packet loss problem since it does not reroute the packets currently carried in the failed LSP (the in-transit packets). To solve the packet loss problem, the approach of Haskin and Krishnan (2000) additionally pre-establishes a backward backup path for each working LSP. There are two backup paths for a working LSP. The route of the backward backup path is reverse with the route of the corresponding primary LSP. When a failure is detected in a working LSP, new incoming packets are carried by the disjoint backup path. As for the in-transit packets, they are sent back to the ingress LS using the backward backup path. When the ingress LSR receives the in-transit packets, it further redirects the packets to the disjoint backup path. Although the approach of Haskin and Krishnan (2000) can solve the packet loss problem, it may additionally introduce the packet disorder problem, such that new incoming packets are earlier than the in-transit packets to be carried by the disjoint backup path. To overcome the packet disorder problem, the approach of Hundessa and Pascual (2001) uses tagging and buffering techniques to improve the approach of Haskin and Krishnan (2000). The tagging technique is used to make each path switch LSR (PSL) on the failed LSP know its last received packet before the failure. The buffering technique is used to make each PSL actively store the incoming packets after the failure. By the assistance of the above two techniques, the in-transit packets and new incoming packets can be carried by the disjoint and backward backup paths under an in-order manner. Unlike the above protection switching based approaches, the approach of Ho and Mouftah (2004) pre-establishes several backup paths for each working LSP. In this approach, a working LSP is first subdivided into several protected segments. Each protected segment forms a protection domain, which has a PSL and a PML (path merge LSR). In a protection domain, each backup path is pre-established and disjoint with its protected segment. Once detecting a

2. Background 2.1. Network model The network model referred to this paper is shown in Fig. 1, which consists of an MPLS backbone network, two IP based access networks, and an OAM (operations, administration, and management) center. In the MPLS backbone network, a number of label switched paths (LSP) are established in advance. Each LSP consists of an ingress label switching router (ingress LSR), one or more intermediate label switching routers (intermediate LSRs), and an egress label switching router (egress LSR). The establishment of an LSP can be accomplished using label distribution protocol (LDP) (Andersson, 2007). This dedicated protocol is developed by IETF for assigning labels to an LSP. With the label assignment, an LSP is responsible for carrying the packets with a particular forwarding equivalence class (FEC) type (header). The FEC represents an aggregation of packets which are treated using the same transmission manner. For a packet, the FEC of this packet is determined by some fields of its header, such as the source and/or destination addresses. As shown in Fig. 1, there are also many components in the MPLS network system. Generally, an OAM center is equipped within a network system for managing the operations, verifying the performance, and monitoring the statuses of all the components. In (Cavendish et al., 2004), authors have described the OAM of MPLS network system in more details. 2.2. Failure assumption and detection Although a whole MPLS network system includes an MPLS backbone network, two IP based access networks and an OAM center (see Fig. 1), we mainly studies the fault-tolerance issue of MPLS backbone network. Failures are assumed to occur in the MPLS backbone network only. The failure detection is based on a wellknown Hello mechanism. In this mechanism, a Hello message is periodically sent between each two neighbor LSRs. After a period of time, if each LSR on an LSP does not receive a Hello message from one of its neighbor LSRs, a failure indication signal (FIS) message is sent to report the failure detection. Next, the proposed approach is initiated to perform the recovery of the failed LSP. 2.3. Related work All existing MPLS fault-tolerant approaches are based on the two IETF recovery models: protection switching and rerouting

OAM Center

Source hosts

Intermediate LSRs

Ingress LSRs

LSR2

LSR1

LSR4

LSR8

IP Source Access Network

Destination hosts

MPLS backbone nework

LSR5

Egress LSRs LSR3

LSR6

LSR9

LSR: Label switching router

Fig. 1. MPLS network model.

LSR7

LSR10

IP Destinaiopn Access Network

J.-W. Lin, H.-Y. Liu / The Journal of Systems and Software 83 (2010) 609–620

failure in a protected segment, the corresponding PSL switches the affected traffic to the corresponding pre-established backup path. After the affected traffic is around the faulty segment of the failed LSP, the PML redirects the affected traffic back to the failure-free segment of the failed LSP. The fault-tolerant idea of the approach of Ho and Mouftah (2004) is not novel, which is fully based on the local recovery (Sharma et al., 2003) to protect the affected traffic. The contribution of this approach is on the capacity allocation for the backup paths. The approaches of Yoon et al. (2001), Ahn et al. (2002) and Agarwal and Deshmukh (2002) belong to the rerouting model. In the approach of Yoon et al. (2001), a pre-qualified recovery mechanism is proposed, as follows. Whenever establishing a working LSP, each LSR on the LSP also determines the route of its corresponding recovery path. The recovery path of an LSR is used to protect the LSP segment from it to its next LSR. When a LSR detects a failure between it and its next LSR, the corresponding pre-qualified recovery path is actually established. The bandwidth resource is also reserved for the pre-qualified recovery path. Since the route of the recovery path is pre-determined, the approach of Yoon et al. (2001) does not take actions to find the recovery route after the failure. However, during normal time, this approach incurs the route calculation overhead. In addition, if a failure also occurs in the pre-qualified recovery path, the approach of Yoon et al. (2001) cannot work. In the approach of Ahn et al. (2002), each LSR has one or more corresponding candidate protection merging LSRs (candidate PMLs). The candidate PMLs of an LSR indicate the LSRs located on the downstream direction of the LSR. While an LSP is established, each LSR on the LSP also finds the information about its candidate PMLs. Once detecting a failure in an LSP, the LSR that detects the failure first calculates all the costs of the possible recovery paths from it to each its candidate PMLs. Then, the LSR selects the recovery path with the least cost. Next, the constraint-based label distribution protocol (CR-LDP) is used to explicitly establish the route of the best recovery path. Compared to the approach of Yoon et al. (2001), the approach of Ahn et al. (2002) can easily handle multiple failures. However, the approach of Ahn et al. (2002) incurs a nontrivial recovery time for finding the least-cost recovery path. The above mentioned approaches do not allow failures to occur at the edge LSRs (the ingress or egress LSRs). Only the approach of Agarwal and Deshmukh (2002) proposes a rerouting solution to tolerate the ingress LSR failure, but the solution is not applicable to the intermediate or egress LSR failure. This approach works as follows. Initially, each ingress LSR specifies one of other ingress LSRs as its backup. If a failure occurs in an ingress LSR, the failed ingress LSR cannot perform label switching to forward the packets of source host. In such case, the source host will send the undeliverable packets to the specified backup ingress LSR. The backup ingress LSR forwards the packets along itself corresponding LSP using label stacking. Upon packets arrive at the egress LSR of the corresponding LSP, the egress LSR strips off the stacking labels of the packets and forwards them to the destination host using normal IP routing. 3. Proposed approach This section presents a new fault-tolerant approach for the MPLS backbone network. Upon detecting one or more failures in a working LSP, the carried traffic in the failed LSP is redirected to the failure-free working LSPs without using the additional label assignment. 3.1. Basic idea The Telecom operator often considers the traffic growth in the network design. In Vasseur et al. (2004), the authors indicate that

611

the bandwidth allocated for a transmission path should be overprovisioned for considering the traffic growth and avoiding congestion. For an MPLS backbone network, the amount of the bandwidth allocated to an LSP is larger than the maximum bandwidth requirement of the traffic carried in this LSP. When one or more node failures occur in a working LSP, the failure-free working LSPs may have residual bandwidth. This observation inspires the idea that one or more failure-free working LSPs may substitute the failed LSP to carry the affected traffic. To achieve this idea, the following problems are required to be handled first.  How to distribute the affected traffic to the failure-free working LSPs.  How to redirect the affected traffic to the failure-free working LSPs.  How to forward the affected traffic along the route of a failurefree working LSP.  How to solve the packet loss and disorder. The first three problems are for making multiple failure-free working LSPs carry the affected traffic based on an optimal way. In the previous approaches, they often assume that a backup path can be used to carry the affected traffic. If the backup path already carries some traffic, it must perform the label merging and splitting procedures to additionally carry the affected traffic. However, the label merging and splitting procedures are not given in the previous approaches, and they also introduce complicated computation. In the proposed approach, the affected traffic is redirected to multiple failure-free working LSPs. The number of the new transmission paths for carrying the affected traffic is not only one. Besides, the failure-free working LSPs already have their respective traffic. It is necessary to describe how to redirect and regulate the affected traffic along the route of a failure-free working LSP. As for the last problem, the packet loss and disorder are not simultaneously considered in all the previous approaches except (Hundessa and Pascual, 2001). In the proposed approach, we also present a piggyback method and a permission token scheme to solve the packet loss and disorder, respectively. 3.2. Solutions In the MPLS backbone network, each LSP has different bandwidth and transmission delay. The residual bandwidth and transmission delay in each LSP are two main factors for handling the above first problem (how to distribute the affected traffic to the failure-free working LSPs). To avoid affecting the existing traffic of a failure-free working LSP, the residual bandwidth of a failurefree working LSP is calculated based on a conservative way, which is done by subtracting the maximum bandwidth requirement of the carried traffic flow from the amount of the original bandwidth allocated to this LSP. Note that a traffic flow to be carried in an LSP is required to provide the information about the maximum bandwidth requirement. This information can be known from the service level agreement (SLA). However, if the total amount of the residual bandwidths on all failure-free LSPs is less than the bandwidth requirement of the affected traffic, the affected traffic is degraded its bandwidth requirement to be carried by the failure-free LSPs; otherwise, the affected traffic cannot be continuously transmitted. To obtain the optimal distribution of the affected traffic, we transfer the problem of the affected traffic distribution to the problem of the minimum cost flow (Sokkalingam et al., 2000). The transformation process is given as follows. First, a simple graph is virtually established based on all failurefree working LSPs. First, the ingress and egress LSRs of all failurefree working LSPs are put in the simple graph as the ingress and

612

J.-W. Lin, H.-Y. Liu / The Journal of Systems and Software 83 (2010) 609–620

egress nodes. Then, each corresponding ingress-egress node pair is connected by an edge between them. For each edge, the cost and capacity are set to be the transmission delay (number of intermediate LSRs) and residual bandwidth of the corresponding failurefree working LSP, respectively. An example is given in Fig. 2. There are three working LSPs which bandwidths are all 20 Mbps. For these three LSPs, the bandwidth requirements of their traffic flows are 10, 12, and 11 Mpbs, respectively. A failure is detected in LSP1. To redirect the affected traffic of LSP1, the failure-free LSP2 and LSP3 are first modeled as two edges in a simple graph, as shown in Fig. 2a. The costs of the two edges are 2 and 1 since the number of the intermediate LSRs in the failure-free LSP2 and the number in the failure-free LSP3 are 2 and 1, respectively. The capacities of the two edges are 8 and 9 since the amount of the residual bandwidth of failure-free LSP2 and the amount of failure-free LSP3 are 8 (20  12 = 8) Mpbs and 9 (20  11 = 9) Mpbs, respectively. In Fig. 2a, the simple graph only models all possible transmission costs of the affected traffic in the MPLS backbone network. As shown in Fig. 1, we can know that the packets from the source host to the destination host are also through two IP access networks in addition to the MPLS backbone network. The affected traffic should also consider the transmission costs of the two IP access networks. To take this consideration, a source node and a destination node are additionally put in the most left and right positions of the simple graph, respectively. Then, a number of edges are set from the source node to the ingress nodes of the simple graph. The source node corresponds to the access router of the source host in the IP source access network, and each of its edges corresponds to one transmission path in the IP source access network. The cost and capacity of each such edge are set to the number of transit hops and the bandwidth of the corresponding transmission path, respectively. For the destination node of the simple graph, it corresponds to the access router of the destination host in the IP destination access network. Similarly, a number of edges are set in the simple graph to connect the destination node with all egress nodes. Each such edge corresponds to one transmission path in the IP destination access network. The cost and capacity of each such edge are also set based on the same way used by the source node. Based on the above operations, all possible distribution costs of the affected traffic can be represented in the simple graph, as shown in Fig. 2b. In Fig. 2b, the access routers AR1 and AR2 are modeled as the source and destination nodes in the simple graph, respectively. The costs of the outing edges of the source node are set to 2 and 3 since there are 2 and 3 transit hops in the corresponding transmission paths of the IP source access network, respectively. The capacities of the outing edges are 8 since the bandwidths of the two corresponding transmission paths are assumed to be 8 Mbps. For the destination node, the costs and capacities of its incoming edges are also set based on the above same scenarios. Note that the numbers of nodes and edges in the simple graph are dependent on the number of failure-free LSPs, not the number of LSRs. The topology of the simple graph is not complicated. After establishing the simple graph, a traffic flow with x units of data rate is supplied to the source node, where x represents the bandwidth requirement of the affected traffic. Then, the affected traffic distribution is transferred to the problem how to transmit the affected traffic from the source node to the destination node with a minimum cost (the minimum cost flow problem). The polynomial-time algorithm (Sokkalingam et al., 2000) has been proposed for solving the minimum cost flow problem, which formulates the problem as a linear equation, as follows:

Minmize

n X k¼1

Subject to

c k xk

ð1Þ

n X

xk ¼ bLSPf

ð2Þ

k¼1

0 6 xk 6 r k for all k ¼ 1; . . . ; n where xk represents which amount of the affected traffic to be carried by failure-free LSPk, ck is the unit cost of an affected traffic packet carried by failure-free LSPk, rk is the residual bandwidth of failurefree LSPk, bLSPf is the bandwidth requirement of the affected traffic, and n is the number of failure-free working LSPs in the MPLS backbone network. Based on (1) and (2), the minimum cost flow problem of the simple graph in Fig. 2b can be formulated as the following linear equation:

Minimize

6x1 þ 7x2 8 > < x1 þ x2 ¼ 10 0 6 x1 6 8 Subject to > : 0 6 x2 6 9 where the unit costs of an affected traffic packet carried by failurefree LSP2 and LSP3 are 6 and 7, respectively since the costs of the two transmission paths corresponding to failure-free LSP2 and LSP3 in Fig. 2b are 2 + 2 + 2 = 6 and 3 + 1 + 3 = 7, respectively. As mentioned above, the bandwidth requirement of the affected traffic (the traffic flow of the failed LSP1) is 10 Mbps. The residual bandwidths in the failure-free LSP2 and LSP3 are 8 Mbps and 9 Mbps, respectively. Under these given constraints, the solutions of x1 and x2 in the above linear equation are 8 and 2, respectively. It represents that the affected traffic should be divided into 8 Mbps and 2 Mbps sub-traffic flows to be transmitted by failure-free LSP2 and LSP3, respectively. For the second problem, a software switchover mechanism is used to redirect the affected traffic. Note that the switchover mechanism can be implemented by modifying the routing table of an access router. The implementation details will be given in Section 3.3. As shown in Fig. 1, the packet from a source host is first through a corresponding access router in the IP source access network. Then, the packet is transmitted by an LSP in the MPLS backbone network. In theory, the access router can send incoming packets to any ingress LSRs in the MPLS backbone network since the IP source access network and the MPLS backbone network are usually deployed by the same Telecom operator (administrator). If the same administrator assumption is not made, the access router still can communicate with any ingress LSRs by not making routing restrictions in the routers of the IP source access network. For the third problem how to forward the affected traffic along the route of a failure-free LSP, it is handled using the IP header encapsulation and decapsulation. The IP header encapsulation and decapsualtion can be supported based on the tunneling protocol (Simpson et al., 1995). As mentioned in Section 2.1, packets carried by the same LSP have the same FEC. The FEC of a packet is determined by some of its header fields (e.g. source address and/ or destination address). Therefore, if two packets are carried by the same LSP, the headers of these two packets must have same values in some fields. From the above description, we can know that an LSP also associates a particular header type (a FEC type). If a packet is encapsulated with a header type, it can be carried by the corresponding LSP with the header type. After the packet leaves the MPLS backbone network, its encapsulated header (outer header) is removed. An example is also shown in Fig. 2. From Fig. 2, we know that there is a failure in the LSP1. Therefore, the affected traffic is assisted by access router AC1 to be redirected to failure-free LSP2 and LSP3. As shown in Fig. 2c, access router AC1 encapsulates the outer header: FEC2 (FEC3) on each affected traffic packet. Then, the ingress LSR of LSP2 (LSP3) receives the affected

J.-W. Lin, H.-Y. Liu / The Journal of Systems and Software 83 (2010) 609–620

613

Fig. 2. An example of the recovery of a failed LSP. (a) Forming an initial simple graph. (b) Forming the whole simple graph. (c) Redirecting the affected traffic.

traffic packet and transmits the packet along the route of LSP2 (LSP3) using the normal label switching. Upon arriving at the egress LSR of LSP2 (LSP3), the egress LSR strips off the outer header of the affected traffic packet and uses the normal IP routing to forward such packet to the destination host (dest1).

For the last problem (packet loss and disorder), the solutions are elaborated as follow. To solve the packet loss, the proposed approach piggybacks an unsent sequence number on the FIS message. We have known that when a failure is detected in an LSP, the detecting node (the upstream LSR of the failure point) will send

614

J.-W. Lin, H.-Y. Liu / The Journal of Systems and Software 83 (2010) 609–620

an FIS message to the corresponding source host (see Section 2.2). Before sending the FIS message, the detecting node piggybacks the sequence number of the packet that has not been sent to its next downstream LSR (see Fig. 3). When the source host receives the FIS message, it will re-send all the packets which sequence numbers are larger than or equal to the sequence number attached on the FIS. According to the unsent sequence number, all lost packets can be recovered. Before presenting the solution of the packet disorder, we explain the occurrence of packet disorder as follows. In the proposed approach, the affected traffic is distributed to multiple failure-free working LSP. Due to different transmission delays in the failurefree working LSPs, some affected traffic packets may arrive at the destination host out of order. To solve this packet disorder problem, a permission token is used to make all the affected traffic packets arrive at the destination host in order. The permission token is passed among the egress LSRs of all the failure-free working LSPs, which controls the sending order of the affected traffic packets between the egress LSRs and the destination host. The details are described as follows. Initially, the permission token is owned by the egress LSR of the failure-free working LSP which carries the first affected traffic packet. Then, the permission token is passed to the egress LSR of the failure-free working LSP which carries the second affected traffic packet, and so on. When an egress LSR would like to send an affected traffic packet, if it does not hold the permission token, the affected traffic packet will be stored in buffer to wait for the permission token. In the proposed approach, the passing of the permission token can be integrated with the IP header encapsulation and decapsulation of the affected traffic. As mentioned early, the load redirector (the corresponding access router of the source host) is responsible for redirecting affected traffic to one or more failurefree working LSPs. For an affected traffic packet and its next packet, these two packets may be transmitted by the same failure-free working LSP or not. The information about which LSP is used to transmit next affected traffic packet can be known by the load redirector since the redirection rule of the affected traffic is defined by it. When the load redirector encapsulates the outer header on an affected IP packet, the information about the permission token passing is also put on the outer header (see step 1 of Fig. 4). The information records the initial owner and next owner of the permission token. Note that the network model of this paper is based on two IP based access network and one MPLS backbone network. The outer header is with the IP header format. Based on the IP header format, there is an option field that can be used to store the extension information. The permission token passing information can be stored in the option filed of the outer header. Later, when an egress LSR decapsulates the outer header of the affected IP packet, it will know the initial owner and the next owner of the permission token from the outer header (see step 2 of Fig. 4). Then, the egress LSR will encapsulate a new outer header on the af-

fected packet again using the initial owner as the destination address. With knowing the next owner, the egress LSR can also pass the permission token to the next owner using the IP routing protocol (see step 3 of Fig. 4). Note that the egress LSR has the MPLS and IP routing protocols. Due to encapsulating the initial owner information on each affected packet as the destination address, the initial owner can receive all affected traffic packets. Next, the initial owner can send each affected traffic packet to the desired destination host in order (see step 4 of Fig. 4). The proposed permission token scheme does not incur high resource-consuming overhead. This can be validated by the following operations executed in this scheme.  Putting the initial owner and next owner of the permission token in the outer header of each affected packet.  Decapsulating the outer header to retrieve the initial owner and next owner.  Passing the permission token to the next owner.  Rerouting each affected packet to the initial owner by encapsulating a new outer header.  Decapsulating the new outer header to send each affected traffic packet to the desired destination host in order. Each of the above operations takes less than 109 s. There is an additional rerouting delay in the permission token scheme. However, the MPLS base backbone network is equipped with high speed links. By simulation experiments, the average additional rerouting delay is about 0.015 s.

3.3. Implementation The above all operations of the proposed approach can be incorporated into the MPLS OAM functionality, as shown in Fig. 5. From Section 2.1, we know that the OAM center is an existing component in the MPLS network system, which is not additionally introduced by the proposed approach. The main functions of the OAM center are the configuration management, performance management, and fault management. Once detecting a failure in an LSP, the failure event will be sent to the fault management (see step 1 of Fig. 5) of the OAM center. The fault management invokes the proposed approach to redirect the affected traffic to failure-free LSPs (see step 2 of Fig. 5). The proposed approach first inquires the performance management of the OAM center to acquire the following two transmission cost information: the transmission costs of failure-free working LSPs in the MPLS and the transmission costs of the used paths in the IP source and destination access networks (see steps 3 and 4 of Fig. 5). The transmission costs were already measured and stored in the performance management while establishing the MPLS network system. For the residual bandwidth of an LSP and the bandwidth requirement of the affected traffic, the

Piggyback the unsent sequence number

FIS

USN

FIS

1

Source host

2

Ingress LSR

FIS

USN

Failure 3

Upstream LSR

FIS : Failure indication signal USN : Unsent sequence number Fig. 3. The piggyback of the unsent sequence number.

4

615

J.-W. Lin, H.-Y. Liu / The Journal of Systems and Software 83 (2010) 609–620

proposed approach has described how to calculate these two bandwidth information (see Section 3.2).

Step 1 Packet Encapsulation

Based on the above collected information, the proposed approach transfers the problem of the affected traffic distribution

Step 2 Packet Decapsulation

Step 3 Packet Rerouting and Token Passing

4 O4

4

O1

1

4

1

1 1

I1,null I1,N2

1

Step 4 Packet Sending

4

LSP 1

Packets

3 2

1

Source Host

N2

N1

Packets

Destination Host

2 4

3

2

1

O2

2

2 2

2

I1,N3

LSP 2 Access router of source host (Load redirector)

N3

3 O3

3

3 3

I1,N4

3 LSP 3

Egress LSRs

Egress LSRs Oi : Outer header of packet i

Packet Rerouting

Ii : Initial token owner in LSRi

Token passing

Ni : Next token owner to LSRi

Fig. 4. The passing of the permission token.

4. Perform the switchover of the affected traffic

OAM

3. Respond to the performance inquiry

Proposed Approach 2. Inquire the some performance information

Fault Management

Configuration Management

Performance Management

1. Report the LSR failure event 5. Redirect the affected traffic

MPLS Backbone Networks

IP Source Access Network

LSP 1 1

Source host

Failure

2

IP Destination Access Network

3 Destination host

Access Router

LSP 2 4

5

6

7

LSP 3 8 6. Initiate the software switchover mechanism (1) Removing the routing entry of LSR 1 from the routing table (2) Adding the routing entries of LSRs 4 and 8 to the routing table

9

10

Fig. 5. Integrating the proposed approach into the OAM.

616

J.-W. Lin, H.-Y. Liu / The Journal of Systems and Software 83 (2010) 609–620

to the problem of the minimum cost flow and then solves the corresponding linear equation. Next, the proposed approach notifies the configuration management to send a configuration command to the access router of a source host (see steps 5 and 6 of Fig. 5). Upon receiving the configuration command, the access router initiates the software switchover mechanism to distribute the affected traffic packets to the failure-free working LSPs (see step 7 of Fig. 5) based on the solution of the corresponding linear equation. The switchover mechanism in the access router can be implemented by deleting the routing entry corresponding to the failed LSP and adding the routing entries corresponding to failure-free working LSPs. The maximum bandwidth filed in each added routing entry is set to the amount of the affected traffic packets redirected to the corresponding failure-free working LSP. The access router uses the added routing entries to forward the affected packets based on a round robin way. The round robin policy is frequently adopted in the multipath routing, and it is also supported in a general router (Cetinkaya and Knightly, 2004). 4. Comparisons This section makes the comparisons between the proposed approach and the previous approaches. We first summarize the characteristics of proposed approach and previous approaches based on the following metrics: recovery method, failure-free overhead, fault-tolerant overhead (recovery time), and fault-tolerant capability. These metrics are also used in Ahn et al. (2002) to characterize an MPLS fault-tolerant approach. In addition, we also perform simulation experiments to obtain quantitative comparisons. 4.1. Characteristic summary The characteristics of proposed approach and the previous approach (Huang et al., 2002; Haskin and Krishnan, 2000; Hundessa and Pascual, 2001; Ho and Mouftah, 2004; Yoon et al., 2001; Ahn et al., 2002) are listed in Table 1.  Recovery method: The approaches of Huang et al. (2002), Haskin and Krishnan (2000), Hundessa and Pascual (2001) and Ho and Mouftah (2004) belong to the protection switching mechanism. These approaches need to pre-establish one or more backup paths for each working LSP. For the approaches of Huang et al. (2002), Haskin and Krishnan (2000) and Hundessa and Pascual (2001), the recovery methods of these approaches are based on the global recovery to protect the whole LSP. The switchover is always activated at the ingress LSR of a failed LSP. For the approach of Ho and Mouftah (2004), the recovery method is based on the local recovery. Similar to 1:N protection, there are many backup paths for a working LSP. The switchover starts from the upstream LSR of the failure point. The upstream LSR uses its pre-established backup path to bypass the failure segment of the failed LSP. The rerouting mechanism dynamically finds a recovery path while detecting a failure. To reduce the recovery time, the approach of Yoon et al. (2001) pre-calculates the routes of recovery paths during the normal time. Although this approach is not required to find the route of the recovery path after a failure, it need to take time to establish the pr-determined recovery route in advance. The approach of Ahn et al. (2002) dynamically finds the least cost recovery path. As for the proposed approach, the existing failure-free working LSPs are utilized to organize a backup path set for a failed LSP.  Failure-free overhead: In the protection switching mechanism, while an LSP is established, its corresponding backup path is also simultaneously pre-established. For the proposed approach and most of the rerouting based approaches, these approaches do

not take any recovery operations during the normal time (the failure-free period). Although the approach of Yoon et al. (2001) belongs to the rerouting mechanism, it needs to pre-calculate the optimal recovery route before a failure. Whenever the network topology has any change, the optimal recovery route is also updated. Therefore, the approach of Yoon et al. (2001) incurs the route calculation overhead during the failure-free period.  Fault-tolerant overhead: This metric is divided to three sub-metrics: packet losses, packet disorder, and recovery time. The approaches of Huang et al. (2002), Ho and Mouftah (2004), Yoon et al. (2001) and Ahn et al. (2002) have the packet loss problem. To solve the packet loss problem, the approach of Haskin and Krishnan (2000) incurs the packet disorder problem. The proposed approach and the approach of Hundessa and Pascual (2001) do not incur the above two problems. The proposed approach uses the unsent sequence number and the permission token (see Section 3.2) to avoid packet loss and disorder problems. With the recovery time of the protection switching mechanism, it is mainly taken at the activation of the pre-established backup path. The activation involves a number of LDP signaling messages. In addition, for solving the packet losses, the approaches of Haskin and Krishnan (2000) and Hundessa and Pascual, 2001 also need to take additional time to send some affected traffic packets back to the ingress LSR. For the rerouting based approaches of Yoon et al. (2001) and Ahn et al. (2002), their recovery time is mainly taken for finding the recovery path and activating the path. To reduce the recovery time, the approach of Yoon et al. (2001) pre-calculated the route of the recovery path before a failure. While detecting a failure, the approach of Yoon et al. (2001) only needs to set the required bandwidth of the pre-determined recovery path. As for the proposed approach, the finding and activation of the recovery path are not required since the recovery path is constituted by the existing failure-free working LSPs in the MPLS backbone network. The recovery time of the proposed approach is mainly determined on the calculation of the affected traffic distribution. In Section 3.2, we have formulized the affected traffic distribution as a linear equation. The solution of the linear equation has been also implemented by Mathematica (Mathematica5, 2009). The time for calculating the affected traffic distribution to 50 (100) failure-free working LSPs is 0.02 (0.031) second.  Fault-tolerant capability: For the protection switching based approaches, the fault-tolerant capability is dependent on the status of the used backup path. If there is also a failure in the used backup path, the protection switching based approaches cannot work. For the rerouting based approaches, the fault-tolerant capability is based on the existence of a suitable redundant path as the backup path. If the backup path cannot be found, the rerouting based approaches will also fail. For the proposed approach, all failure-free working LSPs can be utilized to organize a backup path, which can allow multiple LSPs to be failed simultaneously.

4.2. Simulation We also extend the MPLS simulation module of the Network Simulator version 2 (ns2) Network Simulator, 2009 to perform simulation experiments for proposed approach and previous approaches. In the simulation experiments, the used network model refers to Kar et al. (2003), as shown in Fig. 6. In Fig. 6, there are three working LSPs established as follows. LSP 1 consists of LSR1, LSR8, and LSR17. LSP2 consists of LSR1, LSR4, LSR6, and LSR17. LSP3 consists of LSR1, LSR2, LSR5, LSR12, LSR16, and LSR17. In addition, three corresponding pre-established backup paths are also set

617

J.-W. Lin, H.-Y. Liu / The Journal of Systems and Software 83 (2010) 609–620

Table 1 The characteristics of proposed approach and previous approaches (protection switching and rerouting based approaches). Metrics

Protection switching Approach of Huang et al. (2002)

Recovery method

Proposed Approach Approach of Haskin and Krishnan (2000)

Approach of Ho and Mouftah (2004)

Pre-establish backup paths

Global recovery Failure-free Pre-establish backup paths overhead Fault-tolerant overhead Packet losses Yes Pack disorder No Recovery time Activate the backup path Fault-tolerant capability

Dependent on the status of the preestablished backup path

Metrics

Rerouting

Use failure-free working LSPs Local recovery No

No Yes

Approach of Yoon et al. (2001) Recovery Establish backup paths on demand method Local recovery Failure-free Pre-calculate the route of a recovery path overhead Fault-tolerant overhead Packet losses Yes Pack disorder No Recovery time Activate the backup path Fault-tolerant capability

Approach of Hundessa and Pascual (2001)

Dependent on available redundant paths

up as follows. The backup path of LSP1 consists of LSR1, LSR4, LSR5, LSR7, LSR12, LSR16, and LSR17. The backup path of LSP2 consists of LSR1, LSR2, LSR5, LSR7, LSR9, LSR13, LSR15, LSR16, and LSR17. The backup path of LSP3 consists of LSR1, LSR8, LSR11, LSR14, and LSR17. The link capacity and delay in the network model are set to 20 Mbps and 1 ms, respectively. The traffic flow carried by each LSP is assumed to be with 2 Mbps. The proposed approach and each of previous approaches individually perform 4 different simulation runs. The execution time of each simulation run is 1000 s. In the 4 simulation runs, each failure occurrence is set to have 1, 2, 4, and 8 LSR failures, respectively. Note that there are 18 LSRs in the simulation network model. In a simulation run, the number of the failure occurrences is randomly generated. In each failure occurrence, the locations of failed LSRs are also randomly specified. With the random property, the simulation experiments consider various cases of LSR failures. Furthermore, in each failure occurrence, if the proposed approach or one previous approach can achieve the fault tolerance successfully, 256 packets are fed into the recovery path to observe the transmission delay of these 256 packets. Here, the transmission delay is used to represent the recovery quality. In the proposed approach, the measurement of the transmission delay also includes the communication and computing overhead due to performing the IP tunneling, buffering, unsent sequence number piggyback, and permission token passing. If the transmission delay of one approach is smaller than another, it represents that the former approach has better recovery quality than the later approach. The simulation experiments measure the concerned metrics of Table 1 (the fault-tolerant capability, packet losses, packet disorder, and recovery time). In addition to the concerned metrics, the recovery quality is also measured. Figs. 7–11 illustrate the average results of the five metrics for the proposed approach and previous approaches. From Figs. 7–11, we obviously see that the proposed approach has the best performance in the fault-tolerant capability, packet losses, packet disorder, and recovery qualify by comparison with all the previous approaches.

No No

Yes No

No No Calculate the distribution of the affected traffic Dependent on the number of failure-free LSPs Proposed Approach

Approach of Ahn et al. (2002) Local recovery No

Yes No Find and activate backup path

Use failure-free working LSPs No

No No Calculate the distribution of the affected traffic Dependent on the number of failure-free LSPs

In Fig. 7, the fault-tolerant capability is represented as the resof the failures recovered successfully . From Fig. 7, we can toration ratio: number total number of the failure occurrences see that the protection switching based approaches have a lower restoration ratio. Fig. 7 also shows that the fault-tolerant capability of the proposed approach is superior to the rerouting based approaches. In the proposed approach, all failure-free working LSPs can be utilized to organize a backup path set. The proposed approach can allow many LSR nodes to fail simultaneously. In the rerouting based approaches, the backup path of a failed LSP is found from the redundant paths in the MPLS backbone network. By simulations, when there are many LSR node failures in the network, the probability of existing a suitable redundant path is small. Fig. 8 shows that the rerouting based approaches have worse performance in the packet losses since they take possibly long time to dynamically establish a backup path. Before completing the backup path establishment, packets are sent along the route of the failed LSP. In such case, these packets will be lost. Compared to the rerouting based approaches and some of the protection switching based approaches, the proposed approach and the approaches of Haskin and Krishnan (2000) and Hundessa and Pascual, 2001 do not incur the packet loss problem since they have proposed packet recovery methods. For the packet disorder, this problem is incurred by the proposed approach and the approaches of Haskin and Krishnan (2000) and Hundessa and Pascual (2001). However, the proposed approach and approach of Hundessa and Pascual (2001) have presented schemes to solve the packet disorder problem, but the approach of Haskin and Krishnan (2000) does not do that. From Fig. 9, we see that packet disorder only appears in the approach of Haskin and Krishnan (2000). Fig. 10 shows that most of the protection switching based approaches (Huang et al., 2002; Haskin and Krishnan, 2000; Ho and Mouftah, 2004) have less recovery time since they utilize pre-established backup paths to tolerate failures. If a working LSP fails, its carried traffic can be quickly switched to the corresponding pre-established backup path. However, if the

618

J.-W. Lin, H.-Y. Liu / The Journal of Systems and Software 83 (2010) 609–620

14

11

3 LSP 1

8 17

1 6

18

4

LSP 3

10

LSP 2

16 12

2

5 7

LSP 1 : LSR 1-8-16 LSP 2 : LSR 1-4-6-16 LSP 3 : LSR 1-2-5-12-17-16

15 13

9

Fig. 6. The simulation network model.

Fig. 7. Comparison of fault-tolerant capability. (a) Protection switching. (b) Rerouting.

Fig. 8. Comparison of packet losses. (a) Protection switching. (b) Rerouting.

processing time for handling the packet loss and disorder problems is taken into account, the protection switching based approaches are not always better than the proposed approach in the recovery time. For example, in Fig 10, the recovery time of the approach of Hundessa and Pascual (2001) is larger than the proposed approach.

The recovery quality is shown in terms of the transmission delay of 256 packets in the recovery path. From Fig. 11, we can see that the proposed approach has the best recovery quality. In the protection switching based approaches, while pre-establishing a backup path, they only concern whether the backup path has enough bandwidth to carry the affected traffic. The factor of the

J.-W. Lin, H.-Y. Liu / The Journal of Systems and Software 83 (2010) 609–620

619

Fig. 9. Comparison of packet disorder. (a) Protection switching. (b) Rerouting.

Fig. 10. Comparison of recovery time. (a) Protection switching. (b) Rerouting.

Fig. 11. Comparison of recovery quality. (a) Protection switching. (b) Rerouting.

transmission delay is ignored in the backup path pre-establishment. For the rerouting based approaches, if a working LSP fails, they dynamically find a shortest backup path for the failed LSP without considering whether the bandwidth of the backup path is enough to transmit the affected traffic. In contrast, the proposed approach considers the bandwidth requirement and transmission delay to redirect the affected traffic to multiple failure-free working LSPs. Due to using multiple LSPs to carry the affected traffic, the recovery quality of the proposed approach is better than all the previous approaches. However, when there are many LSR node failures in a failure occurrence, few failure-free working LSPs can

be used to carry the affected traffic. In such case, the recovery quality of the proposed approach is nearly same as previous approaches.

5. Conclusions This paper has presented an efficient approach for protecting the carried traffic of LSPs. When a node or link failure is detected in a working LSP, the proposed approach redirects the carried traffic of the failed LSP to other working LSPs. To minimize the

620

J.-W. Lin, H.-Y. Liu / The Journal of Systems and Software 83 (2010) 609–620

influence on the existing failure-free LSPs, the problem of the affected traffic distribution is transferred to the problem of the minimum cost flow. Moreover, the proposed approach also considers the packet loss and disorder problems using the FIS piggyback and permission token schemes, respectively. The simulation results show that the proposed approach can assist the protection switching and rerouting based approaches to enhance the faulttolerant capability and recovery quality. In addition, the proposed approach can also solve the packet loss and disorder problems. Acknowledgment This research was supported by the Nation Science Council, Taiwan, ROC, under Grant NSC 98-2221-E-030-014. References Agarwal, A., Deshmukh, R., 2002. Ingress failure recovery mechanisms in MPLS networks. MILCOM 2002 Proceedings 2, 1150–1153. Ahn, G., Jang, J., Chun, W., 2002. An efficient rerouting scheme for MPLS-based recovery and its performance evaluation. Telecommun. Syst. 19 (3), 481–495. Andersson, L., Minei, I., Thomas, B. (Eds.), 2007. LDP specification. In: IETF RFC 5036, October 2007. Applegate, David, Thorup, Mikkel, 2003. Load optimal MPLS routing with N + M labels. IEEE INFOCOM, 555–565. Cavendish, D., Ohta, H., Rakotoranto, H., 2004. Operation, administration, and maintenance in MPLS networks. IEEE Commun. Mag. 42 (10), 91–99. Cetinkaya, C., Knightly, E., 2004. Opportunistic traffic scheduling over multiple network paths. Proc. of IEEE INFOCOM 3 (7–11), 1928–1937. Haskin, D., Krishnan, R., 2000. A method for setting an alternative label switched paths to handle fast reroute. IETF Internet Draft draft-haskin-mpls-fast-reroute-05txt. Ho, Pin-Han, Mouftah, H.T., 2004. Reconfiguration of spare capacity for MPLS-based recovery in the internet backbone networks. IEEE/ACM Trans. Networking 12 (1), 73–84.

Huang, C., Sharma, V., Owens, K., Makam, S., 2002. Building reliable MPLS networks using a path protection mechanism. IEEE Commun. Mag. 40 (3), 156–162. Hundessa, L., Pascual, J.D., 2001. Fast rerouting mechanism for a protected label switched path. In: Proceedings of the 10th International Conference on Computer Communications and Networks, pp. 527–530. Kar, K., Kodialam, M., Lakshman, T.V., 2003. Routing restorable bandwidth guaranteed connections using maximum 2-route flows. IEEE/ACM Trans. Networking 11 (5), 772–781. Mathematica5, URL: (accessed May 2009). UCB/LBNL/VINT Network Simulator Version 2, ns-2, URL: (accessed May 2009). Sharma, V., Metanoia, Hellstrand, F., 2003. Framework for multi-protocol label switching (MPLS)-based recovery. ETF RFC 3469, February 2003. Simpson, W., 1995. IP in IP tunneling. In: IETF RFC 1853, October 1995. Sokkalingam, P.T., Ahuja, R.K., Orlin, J.B., 2000. New polynomial-time cyclecanceling algorithms for minimum cost flows. Networks 36, 53–63. Vasseur, J.P., Pickavet, M., Demeester, P., 2004. Network Recovery: Protection and Restoration of Optical, SONET-SDH, IP, and MPLS,. Morgan Kaufmann Publishers (Elsevier). Yoon, S., Lee, H., Choi, D., Kim, Y., Lee, G., Lee, M., 2001. An efficient recovery mechanism for MPLS-based protection LSP. In: Joint 4th IEEE International Conference on ATM (ICATM 2001) and High Speed Intelligent Internet Symposium, April 2001, pp. 75–79.

Jenn-Wei Lin received the M.S. degree in computer and information science from National Chiao Tung University, Hsinchu, Taiwan, in 1993, and the Ph.D. degree in electrical engineering from National Taiwan University, Taipei, Taiwan, in 1999. He is currently an Associate Professor in the Department of Computer Science and Information Engineering, Fu Jen Catholic University, Taiwan. He was a researcher at Chunghwa Telecom Co., Ltd., Taoyuan, Taiwan from 1993 to 2001. His current research interests are fault-tolerant computing, mobile computing and networks, distributed systems, and broadband networks.

Huang-Yu Liu received the M.S. degree in computer and information science from Fu Jen Catholic University, Taiwan, in 2003. His research interests include broadband networks and fault-tolerant computing.