The Journal of China Universities of Posts and Telecommunications August 2009, 16(4): 23–28 www.sciencedirect.com/science/journal/10058885
www.buptjournal.cn/xben
Fault management: analysis of fault location algorithm in optical network ZHENG Yan-lei ( ), HUANG Shan-guo, ZHANG Xian, GU Wan-yi School of Telecommunication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China
Abstract
This article proposes a new fault location mechanism in optical network. In this mechanism, a network alarm packet format with time-stamp is introduced to implement fast restoration. In locating the fault, the existing schemes are usually complex and inaccessible when solving the multifailure location problem. For multifailures, the proposed mechanism using time-stamps is more efficient in locating the fault and decreasing computational complexity. Keywords fault location, alarming packet, time-stamp, optical network
1
Introduction
By 2000, wavelength division multiplexing (WDM) technology, which really improves the transmission capacity of a single fiber, has been applied substantially in backbone optical networks. One optical cable contains 8–64 fibers, and the number of fibers in transcontinental cables that use ribbon structure can reach several hundred. If four or more optical carrier signals are used in one fiber, this fiber can carry data information as many as T bit/s. Moreover, inexpensive devices such as semiconductor optical amplifier and wavelength converter also facilitate the rapid development of WDM technology. Because of large data-carrying capacity in WDM networks, a single failure may result in large amounts of data loss. How to guarantee network survivability has received much attention [1]. Survivability represents the ability to protect and restore the data in optical network after node(s) and/or link(s) fail. Till now, many kinds of protection/restoration mechanisms have been proposed, but their realizations depend in part on the triggering of fault alarms and corresponding detection of the fault location [2–3]. If a failure occurs in the network without being detected or the failure cannot be localized exactly according to the alarm packet, it Received date: 02-07-2008 Corresponding author: ZHENG Yan-lei, E-mail:
[email protected] DOI: 10.1016/S1005-8885(08)60243-5
will lead to network’s inability to start the normal protection/restoration mechanism. Hence, fault detection and location has a direct impact on network survivability [4]. Fault location problem in optical network is difficult to deal with, because when network failures (e.g. optic fiber breaks) happen, all related nodes and detection points will report alarms (namely, just one single failure will trigger large amounts of alarms) [5]. When many alarms reach the manager through the management communication network, it is difficult to locate the multifailures exactly without much troublesome manual checking and measuring [6]. In this article, the authors propose a fault localization model to solve the above problems. In this model, the central management system judges the fault location, whereas the distributed network elements report the alarm in formations. The manager can detect the fault(s) location using network topology and alarm packet analysis. In the following sections, the authors first explain basic concepts and properties that will be used in the proposed model (Sect. 2), and then introduce fault management system, analyze the alarm packet format used between the manager and elements when using distributed alarm system (Sect. 3). By analyzing the fault alarm phenomena, Sects. 4 and 5 formalize the fault location algorithm mathematically. In Sect. 6, a concrete network scenario is used to depict the whole fault location process. Finally, conclusions are presented in Sect. 7.
24
2
The Journal of China Universities of Posts and Telecommunications
The model of fault location algorithm
Network is usually depicted as many nodes connecting with each other by links. By analyzing the alarm packets reported by the network elements, the network manager can locate multifailures [5–7]. In this section, we will define and regulate the terms that will be used in fault location management model. 2.1
Network components
In WDM network, typical network components include transmitter, receiver, optical add-drop multiplexer, optical cross connection, etc. According to the ITU-T G.7710 Rec., network components can be classified into two categories in fault management: active component and passive component. An example of a passive component is optical fiber [8]. Active components work based on electrical needs. This category includes many components such as transceiver, optical cross connect (OXC), (de) multiplexer, etc. Obviously, as the managed objects, only active components can send alarm information to the manager directly. 2.2
Alarm type and failure source points
Alarms are messages sent to the manager by network elements to inform an abnormal condition (e.g. values of the component out of range or missing). Passive components cannot give any information to the manager. For active ones, there are two kinds of alarms: self-alarm and out-alarm. 1) Optical fiber break is one of the most frequently occurring failures in optical network [3]. Though fibers will not send alarms when broken, the active components after it will be triggered to report alarms till the end of the channel. According to the classification of alarm types [8], these alarms are defined as out-alarm, because the components send alarms after the failure occurs. 2) Node failure is viewed as the second frequent failure. For example, the physical distortion caused by hardware aging of the input and output ports of OXC will result in alarm for degradation in signal qualities of all the channels. In addition, some failures caused by equipment abnormal value, such as temperature and the optical signal power, will also trigger alarm report. These kinds of alarms as classified as self-alarm. In terms of severity, alarms can be divided into normal alarms and severe alarms. M1 and M2 are used below to identify them: ‘M1’ means the normal working interval. The parameter will report normal alarm when the value is deviated from the interval,
2009
but the component still works; ‘M2’ means the working interval. The parameter will report severe alarm when the value is deviated from this interval, and then the component will stop working and try to start backup equipment. 2.3
Downstream and upstream
For a given node S n in one channel following the data transmission direction, the nodes before S n are called upstream nodes, whereas the nodes after it are called downstream ones. 2.4
Grouping and unidirectional channel
The failure of one network component may affect its downstream node till the end of the channel, or there is one node terminating its effect [9]. Group is defined as the set of active components sending alarms because of a failure in one channel. It can be easily concluded that the components falling into the group are out-alarm components. The WDM channels defined in this article are unidirectional, which denoted by Ci and expressed as:
Ci
{Sl |l
number
}
The Group is an order decided by alarm packets [9]: G {P1 , P2 ,..., Pn }
(1) (2)
If one chooses one active component and denotes its position as P, then the first one in the group is called P1 and the rest may be deduced by analogy. There is difference between P and the alarm origin Sl : the former one is decided by concrete alarms and groups, whereas the latter is a unique identification of the network components. The expression of P position is, Pi Pj , Pi n Pj |; 1İi j N , N (1,2,..., n) (3)
2.5
Time-stamp
The time-stamp adopted in this article is used as an important parameter to localize failures. Alarm packets will be sent to the network management center, whereas the arrival sequences of them cannot be obtained easily as the alarms downstream nodes report maybe reach the management system earlier than the upstream does. Therefore, the time when the failure happens is considered rather than the time when the alarm packet reaches the network management. When the network management system receives the alarm packets, it will choose one benchmark time and extract the failure time from every alarm packet. The smaller value of
Issue 4
ZHENG Yan-lei, et al. / Fault management: analysis of fault location algorithm in optical network
increment ǻt indicates that the failure occurred earlier, from which the alarm packet is located.
3 Analysis of alarm packet in management ITU-T M.3010 specifies the direction of network management standardization, simplification, and intelligence. telecommunication management network (TMN) is one of the three supporting networks. Its operation, management, and maintenance are critical in providing QoS-guaranteed service to users. To realize access to and control of remote managed objects, the management system needs communication protocols, such as common management information protocol and simple network management protocol (SNMP) [6]. The system manager and the managed equipment communicate with each other using management information communication protocol to fulfill fault management. As the SNMP specifications require, the managed objects need spontaneously send ‘traps’ to the manager system center. The alarm packets that use relevant properties are for fault location. The overwhelming alarm packets swarming in the manager system center are queued to be classified and identified with artificial intervention. Undoubtedly, this will implement fault location with more complexity for using more time to locate the fault. Also, this method cannot avoid the mistakes in judging the location of the faults. Hence, the authors design a packet format used by the components in the communication protocols, as shown in Table 1. The alarm packet contains the failure-related messages that will be used by the manager to localize the failures. Table 1 Alarm packet format Alarming section Alarm ID Alarm type Alarm source Alarm service Alarm code Alarm time Alarm notion
Alarming abbreviator Alm_id Alm_Type Alm_S Alm_Ser Alm_C Alm_T Alm_Nos
Variable ID D Y S Ser Y-id T C
Alarm packets will be triggered by components. Packet data unit (PDU) is composed of parameters shown in Table 1. A detailed explanation about each parameter is given as follows: 1) Alm_id: represents local packet counting function. The number of the alarm packets will be stored in the register to evaluate the network performance. 2) Alm_type: represents the alarm types. There is alarm severity classification, including critical alarm, major alarm,
25
minor alarm, and warning in G.7710. However, it is not applicable in fault localization. In this article, the authors classify alarms into severe alarm and normal alarm with regard to the signal level. There is an alarm list database for each active component. The Alm-type is derived from the list before it reports to the manager. 3) Alm_S: points out the origin of the alarm, which is unique in the whole network. This reference is usually denoted by IP address. 4) Alm_Ser: denotes different channels in WDM equipment. The optical nodes do not know the wavelength of the specific service. Thus, the section of Alm_Ser is marked by the wavelength before it is sent to the manager. 5) Alm_C: distinguishes different types of alarms on the same wavelength. If the alarm is self-alarm, it represents the alarm-causing item, such as temperature or power. 6) Alm_T: provides the time when the monitoring element reports alarms to the manager. The value of this section is derived from system time. 7) Alm_Nos: explanation section.
4 Alarm phenomenon analysis In this section, several failure phenomena are analyzed, which lays the foundation for the later algorithm, with an emphasis on double failures for different sorts of faults. 4.1
Single failure
When single failure happens, whether it is self-alarm or out-alarm, or severe alarm or normal alarm, the fault can be located by the analysis of alm_Type, alm_S, and alm_T in the alarm packet. 4.2 4.2.1
Double failures Severe alarm
1) Scenario 1: see Fig. 1(a) Precondition: failure occurs at point A earlier than point B. Conclusion: if one adopts the time stamp, components E1, E2, and E3 will report alarms simultaneously; after point B fails, the components E2 and E3 will not report alarms anymore. 2) Scenario 2: see Fig. 1(b) Precondition: failure occurs at point A earlier than point B. Conclusion: when point A fails, the components E2 and E3 will report severe alarms; then if point B fails, E1 reports
26
The Journal of China Universities of Posts and Telecommunications
severe alarms, but E2 and E3 would not report any alarms because they have already been shut down. 4.2.2
Normal alarm
1) Scenario 1: see Fig. 1(c) Precondition: points A and B fail consecutively. Conclusion: the components E1, E2, and E3 report normal alarms, when point B fails, < Within the M1 interval, E2 and E3 will send the same alarms with no change in the packet. < Deviation from the M2, E2 and E3 will report severe alarms. Whatever happens, only after repairing point A, the failure (point B) can be located. 2) Scenario 2: see Fig. 1(d) Precondition: when point A fails, E2 and E3 will send alarms, and then point B fails, e1 will give a normal alarm: < Within the M1 interval, E2 and E3 do not send the same alarms with no change in the packet. < Deviation from the M2, E2 and E3 send severe alarms Conclusion: the two failures can be located at the same time.
5
Algorithm descriptions
In this section, a fault localization algorithm is proposed to locate network component failure(s) according to the received alarm packets. The proposed algorithm is based on two assumptions. One is that the probability of a single failure is larger than multifailure; the other is that the reliability of management communication network (MCN) is guaranteed. That is, the alarming packets lost or received nonsequentially when being sent to the manager is not considered [6,10]. It is unrealistic for the manager to deal with the alarm packets at the moment it arrives, especially when a large amount of alarms arrive simultaneously. This is the problem how to respond to the service promptly in queuing theory, which is not considered here for simplicity. Moreover, multifailure localization problem is a NP-complete problem in fault localization analysis. The algorithm proposes uses time-stamp, and the manager center will collect all the alarms during a reasonable period of Ti . The input, output, and steps for the algorithm are given as follows: Input: network topology and the list of alarm packets. Output: a set of diagnosed failed network components. Steps of the fault location algorithm are described as follows: Step 1 Self-alarm components can be easily picked out using the Alm_S section from the received alarm packets. Thus, the algorithm proposed emphasizes locating the failed component(s), which reports the out-alarm packet to the manager, and the relations between these packets are uncertain. Step 2 Network manager has many views, and network connection view is one of them. Every connection denoted by Ci is a collection of network components. C1 {Si | i N }, C2 {S j | j N },..., Cn C1 C2 ... Cn z
Fig. 1
Fault phenomena: in (a), (b), (c) and (d)
2009
{S k | k N }½° ¾ °¿
(4)
alarm. Severe alarms occur in both (a) and (b); normal alarms occur
A received during the period Step 3 When the alarms in i i T , G is calculated by examining their Alm_S and Alm_Ser sections. Node position Sl is mapped into Pj in the group.
in both (c) and (d).
In our algorithm, the concept of Gi is summarized as follows:
Note: Point A occurs with alarm first, and then point B occurs with
4.3
Multiple failures
The analysis of this kind of failure is similar to that of double failures. It is possible to determine the fault location according to the kinds of alarms from components.
Gi
{P1 , P2 ,..., Pn | Ci
const}
(5)
Step 4 Based on the assumptions, first ensure that all the alarms are caused by one failure, and then classify the alarms into severe or normal alarms. 1) If the packet is a severe alarm, its time-stamp will be
Issue 4
ZHENG Yan-lei, et al. / Fault management: analysis of fault location algorithm in optical network
checked. Let ǻti
ti t0 . Hence, if:
ǻt1 ǻt2 ... ǻtn
27
as that for severe multifailures. (6)
one can conclude that the severe alarms are caused by the upstream node besides P1 . By examining alm_S in alarm packet P1 and then mapping into Sl , packet analyzer can
(a) Severe alarm scene
locate the failed component directly. Otherwise, the process goes to the multifailure localization step F1 ( F1 denotes the process that there are multifailure to deal with). 2) If it is a normal alarm, the same judgment process will be done. If it corresponds to Eq. (6), the exact fault location will be found. Otherwise, the process goes to F1 . 3) The possibility of F1 is larger than others. Step 5
The process of F1 is depicted.
1) If it is a severe alarm, the out-alarm nodes should be categorized. The alarms with the group can be classified according to the channels as Sg1 , Sg2 ,... . If P ( Sg1 ) P( Sg2 ) ... P( Sg n )
(b) Normal alarm scene Fig. 2 Double faults detection during the process of the algorithm
Step 6 After packet analyzer locates the failed components, the protection/restoration process will be triggered, automatically or by network operators. If alarms still exist after that, the algorithm will be executed again. The algorithm structure is shown in Figs. 3 and 4.
(7)
There must be two or more failure occurrences in the channel. For example, about double faults, two failures occur in Fig. 2(a). If it meets the following requirements: ǻt3 ǻt4 ǻt5 ½ ° ǻt1 ǻt3 (8) ¾ ° ǻt1 ǻt2 ¿ It can be concluded that there is one failed component between P3 and P2 , and the second one is before P1 . More general requirements (application for double failures) are listed as follows: ǻti ǻti 1 ... ǻtn ½ ° ǻt1 ǻti (9) ¾ ° ǻt1 ... ǻti 1 ¿
Fig. 3
The procedure of algorithm abstraction
Fig. 4
The analysis process of alarm packets
If it does not confirm the above ones, one should find a node whose Alm_T is in min{ǻt1 ,ǻt2 ,...,ǻtn } , and there must be a failure before the node. Then the process should go to step 6 and finish the discrimination. 2) If it is a normal alarm, it has similar process as strong process. Here, an example of double failures is given (see Fig. 2(b). If it meets: ǻt3 ǻt4 ǻt5 ½ ° ǻt1 ǻt3 (10) ¾ ° ǻt1 ǻt2 ¿ When it is a normal alarm, P3 , P4 , and P5 will probably send normal alarms. It can thus be concluded that there is one failed component between P3 and P2 , obviously the other failure is before P1 . More general circumstance is the same
28
The Journal of China Universities of Posts and Telecommunications
2009
Fig. 4 does not show the differences between severe and normal alarms, because it does not affect the failure localization.
According to the topology view and the channels where the alarms are reported, the network manager categorizes G1 {S1 o S 2 o S 4 o S6 o S8 } as a group G1 , and maps
6
all the nodes into the group. P( S1 ) o P1; P( S2 ) o P2 ; P( S4 ) o P3 ; P( S6 ) o P4 ; P( S8 ) o P5
Design and example
As shown in Fig. 5, there are eight units of network equipment in the management domain. After connections have been set up, three unidirectional service connections have been added to the network management topology view (see Table 2).
According to the time-stamp, the single failure diagnosis begins; the elements in G1 are divided into two subgroups:
Sg1
{P1 , P2 , P3 , P4 }; Sg2
{P5}
If 't5 't1 , it can be concluded that there must be one fault between S8 and S6 , and protection/restoration scheme is triggered to repair point A. Only after the fault (point A) has been repaired successfully, the faults (points B and C) can be located and repaired.
7
The mechanism proposed in this article can realize precise location of network faults. Its advantage depends on the structure of alarm packet using time-stamp. In this mechanism, the convergence time for judgments in manager will be shortened, and this will obtain higher speed for recovery in multifailure location. Hence, it can be concluded safely that the scheme proposed can be a good solution to fault location in optical network.
Fig. 5 An example of multifailures in the networks Table 2 Connections view Service No. C1
Type
Node sequence
Point line
S1oS3oS6oS8
C2
Dotted line
S1oS2oS4oS6oS8
C3
Line
S2oS5oS7
Acknowledgements This work was supported by the National Natural Science
Suppose that the optical fibers at three places (A, B, and C) break, the corresponding network equipment will send severe alarms, as discussed in the above analysis (see Table 3). Table 3 Sequence of the alarm Failure point A B C
Now, packet analyzer will check the PDU of each alarm reported by the nodes (see Table 4). Table 4 Alarming packets Node
Type
Nodes
Outside alarm
S8
Service C1
Time-stamp t1
Index
S8 S8
Outside alarm
S8
C2
t1
2
S1
Outside alarm
S1
C2
t1 +t
1
S2
Outside alarm
S2
C2
t1 +tc
1
t1 +tcc
1
t1 +tccc
1
S4
Outside alarm
S4
S6
Outside alarm
S6
C2
Foundation of China (60877052, 60702005), PCSIRT (IRT0609), and the Program of Introducing Talents of Discipline to Universities ( b07005).
References
Sequences of nodes reporting alarms S8 S1oS2oS4oS6 None
C2
Conclusions
1
Notes: The beginning of the time is t0, and +t ccc !+t cc !+t c ! 0; t1 ! t0
1. Mas C, Thiran P, Le Boudec J Y. Fault location at the WDM layer. Photon Network Communications, 1999, 1(3): 235255 2. Gardner R D, Harle D A. Alarm correlation and network fault resolution using Kohonen self-organising map. Proceedings of IEEE Global Telecommunications Conference (GLOBECOM’97), Vol 3, Nov 38, 1997, Phoenix, AZ, USA. Piscataway, NJ, USA: IEEE, 1997: 13981402 3. Sten T E, Bala K. Multiwavelength optical network: a layered approach. Reading, MA, USA: Addison Wesley, 1999 4. Habel R, Roberts K, Solheim A, et al. Optical domain performance monitoring. Proceedings of 2000 Optical Fiber Communications Conference (OFC’00): Vol 2, Mar 710, 2000, Baltimore, MD, USA. Piscataway, NJ, USA: IEEE, 2000: 174175 5. Bouloutas A T, Calo S, Finkel A. Alarm correlation and fault identification in communication networks. IEEE Transactions on Communications, 1994, 42(2/3/4): 523533 6. ITU-T Rec M.3010. Principles for a telecommunications management network. 2000
To p. 97
Issue 4
LIU Kai-ming, et al. / Adaptive power loading with BER-constraint for OFDM systems
4. Fasano A, Di Blasio G, Baccarelli E, et al. Optimal discrete bit loading for DMT based constrained multicarrier systems. Proceedings of IEEE International Symposium on Information Theory (ISIT’02), Jun 30Jul 5, Lausanne, Switzerland. Piscataway, NJ, USA: IEEE, 2002: 243247 5. Liu K M, Yin F F, Wang W B, et al. Efficient adaptive loading algorithm with simplified bandwidth optimization method for OFDM systems. Proceedings of IEEE Global Telecommunications Conference (GLOBECOM’05): Vol 5, Nov 28Dec 2, 2005, St Louis, MO, USA. Piscataway, NJ, USA: IEEE, 2005: 27932795 6. Willink T J, Wittke P H. Optimization and performance evaluation of multicarrier transmission. IEEE Transactions on Information Theory, 1997, 43(2): 426440 7. Wyglinski A M, Labeaua F, Kabal P. Bit loading with BER-constraint for multicarrier systems. IEEE Transactions on Wireless Communications, 2005, 4(4): 13831387
97
8. Daly D, Heneghan C, Fagan A D. Power- and bit-loading algorithms for multitone systems. Proceedings of IEEE International Symposium Image and Signal Processing and Analysis (ISPA’03): V ol 2, Sep 1820, 2003, Aizu, Japan. Piscataway, NJ, USA: IEEE, 2003: 639644 9. Goldfeld L, Lyandres V, Wulich D. Minimum BER power loading for OFDM in fading channel. IEEE Transactions on Communications, 2002, 50(11): 17291733 10. Goldsmith A J, Chua S G. Variable-rate variable-power MQAM for fading channels. IEEE Transactions on Communications, 1997, 45(10): 12181230 11. Cover T M, Thomas J A. Elements of information theory. New York, NY, USA: John Wiley and Sons, 1990 12. ITU-R M 1225. Guidelines for evaluation of radio transmission technologies for IMT-2000. 1997
(Editor: ZHANG Ying)
From p. 28 7. Mas C, Thiran P. An efficient algorithm for locating soft and hard failures in WDM networks. IEEE Journal on Selected Areas in Communications, 2000, 18(10): 19001911 8. ITU-T Rec G.7710. Common equipment management function requirements. 2001
9. Deng R H, Lazar A A, Wang W. A probabilistic approach to fault diagnosis in linear lightwave networks. IEEE Journal on Selected Areas in Communications, 1993, 11(9): 14381448 10. ITU-T Rec G.784. Synchronous digital hierarchy (SDH) management. 1999
(Editor: ZHANG Ying)