Reliability Engineering and System Safety 80 (2003) 33–40 www.elsevier.com/locate/ress
A reliability centered approach to remote condition monitoring. A railway points case study Fausto Pedro Garcı´a Ma´rqueza,*, Felix Schmidb, Javier Conde Colladoa b
a ETSICCP, Universidad de Castilla-La Mancha, Ciudad Real, Spain Department of Mechanical Engineering, University of Sheffield, Sheffield, UK
Received 1 September 2001; accepted 15 August 2002
Abstract Railway turnouts, consisting of switches and a crossing, are complex electro-mechanical devices which are exposed to severe environmental influences and which are essential for the operation of any railway bar horizontal lifts. Their safe and reliable operation must be assured if the rail mode of transport is to flourish. Conventionally, the continuous availability of turnout mechanisms has been assured by high levels of routine maintenance, to some extent tailored to the criticality of a particular point location. However, traffic increases and shortened maintenance windows require better approaches to turnout maintenance. The authors of the present paper undertook the development of algorithms to detect gradual failure in railway turnout which should allow a move to an RCM2 approach to the management of switch and crossing maintenance. They demonstrate the approach using data from tests on a commonly found point mechanism and include a discussion of the benefits of adopting a Kalman Filter for pre-processing the data collected during tests. q 2003 Elsevier Science Ltd. All rights reserved. Keywords: Remote condition monitoring; Reliability centred maintenance; Kalman Filter; Point mechanism; Failure mode and effect analysis
1. Introduction The safety of staff, customers and of the general public in general viewed as one of the most important requirements in industry and is of particular importance in the railway industry, where passenger rightly expert vary high standards of care. In Britain the Railway Regulations were introduced in 1994 and a new safety culture was established. This was a necessary part of the privatisation process for British Railways (BR). It was also a consequence of the realisation that, from 1989 to 1994 alone, 825 members of the general public had been killed on BR. Most of these died as a result of trespass but substantial numbers by falling from trains [7], due to design faults and poor maintenance. Any accidents on the railways though are of serious concern to society, particularly so since the accidents at Shouthall, Ladbroke Grave, Hatfield, and, most recently, Potters Bar in Britain, Eschede and Bru¨hl in Germany and Norway. Ever since its inception though, the railway industry has searched for ways to improve the safety and the reliability of * Corresponding author. E-mail address:
[email protected] (F.P. Garcı´a Ma´rquez).
the railway subsystems. Provision of a reliable infrastructure plays a very important role in achieving a safe system. The authors of the present paper discuss the development of an approach to the application of remote condition monitoring to the reliability centred maintenance of railway turnouts. Section 2 provides the background to the issues, a description of the test system used in Sections 3 and 4, followed by an outline of the criteria in Section 5. A Kalman Filter approach to the preprocessing of data is given in Section 6, while the results and conclusion from Sections 7 and 8, respectively. 2. Background 2.1. Railway turnouts As part of a guided transportation system, a train can move from one track to another only in certain places, that is, where an appropriate mechanical device has been installed. This device is known as a turnout.1 The turnout has moving parts which are called switches (US: blades) and which steer the trains in one of two directions, normal 1 Some authors refer to turnouts as points, that is, the parts of the turnout which move to allows trains to change track to enter siding.
0951-8320/03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved. PII: S 0 9 5 1 - 8 3 2 0 ( 0 2 ) 0 0 1 6 6 - 7
34
F.P. Garcı´a Ma´rquez et al. / Reliability Engineering and System Safety 80 (2003) 33–40
(straight through) or reverse. The switches move from normal to reverse or reverse to normal direction. Turnouts are perhaps the most important infrastructure elements of the railway system and affect its safety greatly. The Potters Bar accident of 10th May, 2002 in England was caused by a faulty turnout while the consequences of the Eschede accident in Germany were aggravated by a point moving underneath the train. The 55% of railway infrastructure component failures on high speed lines are due to signalling equipment and turnouts [10], where ‘signalling equipment’ covers signals, track circuits, interlockings, automatic train protection (ATP) or LZB (track loop-based ATP), and the traffic control centre. From another point of view, the annual cost of maintaining points is high, about UKP (United Kingdom Pound) 3.4 million per year for about 1000 km of railway, compared to other infrastructure elements. TC-TCR trade circuits, for example, cost UKP 2.1 million per year for the same area. Of the points expenditure, UKP 1.2 million is for clamp lock (hydraulic) turnout and UKP 1.4 million for electrically operated turnouts (data provided by a British asset manager). Turnouts can also be used to implement flank protection for a train route allocated to another train. This is achieved by positioning the blades of the turnout in such a way that a train driving through the turnout is not directed into a track segment belonging to the route of the other train. The two safe positions of the moving parts of a turnout, normal and reverse, are generally detected using switches operated by the blades or their operating mechanisms. In order to ensure high availability and reliable and safe operation, points require regular inspection and maintenance. Currently, such maintenance is carried out on a time basis with allowance being made for the operational criticality of a particular point. A better and more costeffective approach though is reliability centred maintenance which is being adopted by a number of railway undertakings for point mechanism maintenance. 2.2. Turnout failures and maintenance Reliability centred maintenance (RCM1) is a process used to decide what must be done to ensure that any physical asset, system or process continues to do whatever its users want it to do [4,8,12]. Therefore, RCM1 provides powerful rules for deciding whether a failure management policy is technically appropriate, providing precise criteria for deciding how often routine tasks should be carried out. RCM1 identifies ways in which the system can fail to live up these expectations. This must generally be followed by a failure mode and effects analysis (FMEA) which allows an assessment of the consequences of failure. As substantial number of research projects concerning railway infrastructure have been carried out or are still in progress, for example, REMAIN (reliability and maintainability in European Rail Transport), ROMAIN (Railway Open Maintenance Tool), INFRACOST (The Cost of Railway Infrastructure), RAIL (reliability centred
maintenance approach to infrastructure and logistic of railway operation), all of which justify the contribution of RCM1 for point mechanism in improving the safety and reliability of railways. FMEA is a systematic analysis of the potential failure modes of a component of a system [9]. It includes the identification of possible failure modes, determination of the potential causes and consequences and an analysis of the associated risk. It also includes a record of corrective actions or controls implemented resulting in a detailed control plan. FMEAs can be performed on both the product and the process. Typically, an FMEA is performed at the component level, starting with potential failures and then tracing their effects up to the ultimate consequences. The FMEA allows the identification of the most critical components and the likely failure mechanisms, thus leading to the specification of system parameters to be monitored. Primary performance parameters of complex mechanisms, such as railway points, are speed of movement, vibration, supply voltage, power, throwing time, temperature, current, force, etc. Based on these performance parameters, RCM1 can be used to define terms such as risk, quality, control, comfort, economy, containment, ergonomy, etc. In practice, condition-based maintenance decisions are based substantially upon assessments of the condition of the system obtained at discrete monitoring time intervals [15]. This type of condition monitoring is called indirect condition monitoring in contrast to direct monitoring which measures the actual condition. The latter employs advanced electronics, sensors and transducers, computing and communications technology. Their measurement (vibration, supply voltage, power) can be embodied in remote condition monitoring systems (RCM2) [2 –4,11]. RCM2 leads to improved reliability and can pay for itself in terms of cost-effectiveness since staff do not have to visit installations as frequently. The integration of the two types of RCMi is called RCM2, with the overall aim of using advanced electronics, control, computing and communication technologies to address the multiple objectives of cost effectiveness, improved reliability and services. In addition to the data collection required for RCM2, it is also necessary to process the large amount of information to provide a warning when the device moves out of tolerance or adjustments. Algorithm design for the detection of trends and failure patterns has been undertaken by many researchers but only a few papers dealing with the dynamics of railway turnouts have been found in the literature [1,4,13, 14]. The present paper describes a simple approach to RCM2 as applied to railway turnout mechanisms in a case study.
3. Description of turnout used in experiment Turnouts are assembled from switches and a crossing where the moving parts are often described as the ‘points’. Most standard point machines contain a switch and lock
F.P. Garcı´a Ma´rquez et al. / Reliability Engineering and System Safety 80 (2003) 33–40
mechanism which includes a hand-throw lever and a selector lever to allow operation by power or hand. The mechanism is normally divided into three major subsystems: (i) the motor unit which includes a contactor control arrangement and a termination area, (ii) a gearbox comprising spur-gears and a worm reduction unit with overload clutch and the dual control mechanism as well as (iii) the controller subsystem with motor cut-off and detection contacts. Generally, there are also mechanical linkages for detection and locking of the point. The standard railway point is therefore a complex electro-mechanical device with many potential failure modes. In the following description reference is made to a particular type of turnout which is in wide-spread use in UK and elsewhere but which cannot be mentioned for reasons of confidentiality. The point machine is normally non-handed with respect to the locking bars and detection rods but it would be necessary to place the hand-throw and selector levers on the side furthest from the rails. Machines can be ordered and supplied with these levers on the appropriate side but conversion can be made during installation. The circuit controller includes detection switches and a pair of snap-action switches to stop the machine at the end of its stroke and to regeneratively brake the motor so that the mechanism is not subjected to impacts. The detection switches have high pressure wiping contacts made of silver/cadmium oxide or gold and they are operated by both the lockbox and the detection rod. The detection switches have additional contacts to allow mid-stroke short circuiting of the detection relays to avoid wrong indications in the signal box (Figs. 1 and 2). The control contactor is fixed in the motor compartment with a plug connection to the wiring. It has heavy duty silver/cadmium oxide wiping contacts which reverse the motor and additional contacts for integration with detection. The contactor is a two position, polarised, magnetic stiction type requiring 2 W to operate. The DC-powered motor is a special heavy-duty design developed specifically for point machine use. It is plugged into the wiring and is supported by a machined spigot entering the gearbox and two mounting lugs fixed to the base. The motor can take two basic forms, series-wound
35
Fig. 2. Railway turnout.
split-field and permanent magnet field for AC immune machines. Where the machine will be operated on AC supplies a silicon bridge rectifier is built into the machine. Points failures can occur for a number of reasons, ranging from foreign objects blocking the switch rails to loss of position detection due to microswitches being dislodged because of vibration. Failures can also be caused by excessive friction on movement of the format in the ballast.
4. The prototype remote condition monitoring system As part of a research project, the development of a simple RCM2 system for turnout monitoring was undertaken. The motor current was measured using non-intrusive current transformers mounted within the point machine housing. The force in the drive bar was measured by replacing the bolted connection between the drive bar and the drive rod with a load-cell. Trigger signals were taken from adjacent signalling locations utilising spare contacts and wired as part of an instrumentation engineer’s installation. The power supply for the installation was taken from the 110 V supply in an adjacent relay room. Sensors were connected locally to a connection box mounted in close proximity to the points. This box contained no active components but was equipped with lightning protection for the positional detector. For the initial tests, the data were collected using a conventional data logger and then processed on a personal computer.
5. Model criteria
Fig. 1. Point mechanism.
Faults in point mechanisms must be detected quickly and reliably if the information is to be useful. It is a discrete dynamic system, where data must be processed in real time. Any detection algorithm must therefore use a simple model for detecting faults rapidly by analysing data in real time. It is thus better to apply statistical approaches rather than simply comparing new data with ‘good’ and ‘bad’ signals stored earlier. Also, the model for detecting faults must adapt to external conditions, i.e. changes in the environment (humidity, temperature climatic, etc.) friction forces, etc.
F.P. Garcı´a Ma´rquez et al. / Reliability Engineering and System Safety 80 (2003) 33–40
36
and the model must detect faults in both directions of the turnout mechanism movement. This was the reason for the authors to choose a reference dynamic system for their analysis. This is based on calculating individually for normal to reverse and reverse to normal operations, the arithmetic mean between the previous reference data set and the latest new data set for operation in that direction, once I has been proved to be fault free (Eq. (1)). FMEA was employed for the identification of possible failure modes, determination of the potential causes and consequences and an analysis of the associated risk. A sample list of faults is shown in Appendix A. It also includes a record of corrective actions or controls implemented resulting in a detailed control plan xsi ¼
xs21 þ xi j i ; 2
ð1Þ
where s 2 1 is the previous reference data set; j, the new data set (fault free), s, the new reference data set with xi is the data value at time-step i. The data collected refers to force (N) versus time (s). The first conclusion after studying these curves was that we could not detect the faults directly by analysing the curves. However, if we analyse the difference between the actual data xj and the reference data xs in the form of absolute values ðdj Þ; we can detect the majority of faults as they develop dij ¼ lxji 2 xsi l:
ð2Þ
The tests carried out where no faults are present, that is, the ‘as commissioned’ tests result in curves d which are very similar. They are symmetric with respect to the maximum position and do not display major irregularities, although their amplitudes are different, and the maximum position is similar but not equal in all cases. These are the signal characteristics considered to detect faults, described as follows. A first criterion is based on whether the shape of the test data curve is irregular. If this is the case it is assumed to be the consequence of a fault. The detection of irregularity in the curves requires the application of a sensitivity value for perturbations in the signal (Fig. 3(a)). If the data is not sufficiently distinctive to detect a fault with the first criterion we can check the data gathered in test j j to find the maximum position tmax of the curve dj : We may i find that it is not the same as the maximum position tmax of s the current reference curve xr : Allowing for the margin tmg, a signal is considered to be indicative of a fault if the position of the maximum is outside the band (Fig. 3(b)). Finally, if the first two criteria have not resulted in the detection of a fault, then a third criterion is applied whether the curve is symmetric with respect to the maximum position, again with a margin of a given width. If this is the case then it will be assumed to be a fault. This criterion cannot be used for detecting a fault in real time. This supposition has been demonstrated with numerous simulations and experiments. The mathematical analysis is
as follows j tmax j T j 2 tmax
! n¼t j Xmax
dnj <
n¼0
m¼T Xj
dmj ;
ð3Þ
j m¼tmax
j where tmax and T j are the maximum position and total time, j P max j j j tmax =ðT j 2 tmax Þ is the area relationship, n¼t n¼0 dn and Pm¼T j j j dm are the areas under and above tmax in the j actual j m¼tmax curve of d. Any set of test data that satisfies the three criteria will be considered to be the result of a test without fault and is defined ‘as commissioned’. This is shown in Fig. 3(c). Although using these criteria has allowed a significant improvement in detecting faults by analysing changes from data set to data set, sensitivity values can be applied for further improvement. To increase, the reliability of the last criterion, we have employed a Kalman Filter that is described in Section 6.
6. Kalman Filter approach to data in RCM2 The Kalman Filter [6] approach is based on a set of mathematical equations that provide an efficient computational solution using the least squared method. The Kalman Filter addresses the general problem of trying to estimate the state x of a discrete-time controlled process that is governed by a linear stochastic difference equation. The objective in using Kalman filtering in this study was to increase the reliability of the model presented to the rulebased decision mechanism. The Kalman Filter estimates a process by using a form of feedback control: the filter estimates the process state at some time and then obtains feedback in the form of measurements. The equations for the Kalman Filter fall into two groups: time update equations and measurement update equations. The time update equations are responsible for projecting forward the current state and error covariance estimates to obtain the a priori estimates for the next time step. The measurement update equations are responsible for the feedback. The time update equations can also be thought of as predictor equations, while the measurement update equations can be thought of as corrector equations. Indeed, the final estimation algorithm resembles that of a predictorcorrector algorithm for solving numerical problems [2,3]. The state d and measurement m of the process are governed by the linear stochastic difference equations: diþ1 ¼ Ai di þ Bi ui þ wi ;
ð4Þ
mi ¼ Hi di þ vi ;
ð5Þ
where vi and wi are the process and measurement noises, respectively. They are both assumed to have the characteristic of Gaussian white noise. Their mean and covariances are E½wi ¼ 0;
E½w2i ¼ Qi ;
ð6Þ
F.P. Garcı´a Ma´rquez et al. / Reliability Engineering and System Safety 80 (2003) 33–40
Fig. 3. Criteria employed for detecting faults in a point mechanism. (a) First criterion, (b) second criterion and (c) third criterion.
37
F.P. Garcı´a Ma´rquez et al. / Reliability Engineering and System Safety 80 (2003) 33–40
38
Fig. 4. Difference between the actual reference curve and the new curve in absolute values with and without Kalman Filter. (a) Normal to reverse direction and (b) reverse to normal direction.
E½vi ¼ 0;
E½v2i ¼ Ri ;
E½wi vi ¼ 0;
ð7Þ ð8Þ
where E is a statistical averaging operator. The values of the a posteriori state estimate d^ 2 i and error covariance Pi are as follows ^2 d^ i ¼ d^ 2 i þ Kðmi 2 Hi x i Þ; Pi ¼ E½ðdi 2 d^ i Þ2 ;
ð9Þ ð10Þ
where K is the blending factor that minimises Pi (the valour K in Eq. (9) is calculated in Ref. [5]), and d^ 2 i is the a priori state estimate. The priori estimate error covariance is as follows: 2 Pi ¼ E½ðdi 2 d^ 2 i Þ :
to normal, respectively. Each diagram shows both the actual measurement and a filtered measurement.
ð11Þ
The output from a Kalman Filter application to the problem in hand is shown in Fig. 4. Fig. 4(a) and (b) shows the forces measured for transitions from normal to reverse and reverse
7. Results The reported RCM2 was developed as part of a research project between the Department of Mechanical Engineering at the University of Sheffield, England, and an industrial partner. The data collection and algorithm developments were carried out during the year 2000 with a number of experiments at a test site. The partner company arranged for a suitable turnout assembly, fitted with a point machine of a type common in the UK and elsewhere, and made it available in a fully instrumented form for the collection of test data. Test data were collected with the mechanism both in perfect working order, and also with a number of previously encountered real life fault conditions created artificially in turn. Data were recorded for each fault condition at a variety of levels
F.P. Garcı´a Ma´rquez et al. / Reliability Engineering and System Safety 80 (2003) 33–40
of severity (Appendix A). A total of 151 experiments were carried out, 79 in the reverse to normal direction and 72 in the normal to reverse direction. The most important results are as follows. With a Kalman Filter, we could detect 100% of faults in the reverse to normal direction in the 79 experiments. Without the Kalman Filter this drops to 97.33%. In this direction, the margin employed for detecting the maximum position is 0.3 s less when using the Kalman Filter, and the margin considered for detecting irregularities in curves is reduced to 91.3%. In the other direction, we can currently detect only 97.1% of faults when using the Kalman Filter and without it only 94.2%. The margin for detecting irregularities is 85.71% better, and the margin in maximum position is 0.3 s worse.
39
the rule-based decision mechanism. The Kalman Filter estimates a process by using a form of feedback control: the filter estimates the process state at some time and then obtains feedback in the form of measurements. With a Kalman Filter, the authors can currently detect 100% of faults in the reverse to normal direction, and without the Kalman Filter this drops to 97.33%. In the other direction, we can currently detect only 97.1% of faults with the Kalman Filter and without it only a 94.2%. In general, employing Kalman Filter has improved the margins of criteria in both directions. The authors cannot explain why detection in the normal to reverse is not as successful as for reverse to normal, but in their opinion they have achieved a minimum rejection rate but they continue to improve their methods.
8. Conclusions Appendix A. Sample list of faults Turnouts are probably the most important infrastructure elements of the railway system and affect its safety greatly. The standard railway turnout is a complex electromechanical device with many potential failure modes. In order to ensure high availability and reliable and safe operation, points require regular inspection and maintenance. Reliability-centred maintenance (RCM1) provides powerful rules for deciding whether a failure management policy is technically appropriate, providing precise criteria for deciding how often routine tasks should be carried out. The technique employs advanced electronics, sensors and transducers, computing and communications technology embodied in remote condition monitoring systems (RCM2). RCM2 leads to improved reliability and can pay for itself in terms of cost-effectiveness since staff do not have to visit installations as frequently. The integration of the two types of RCMi is called RCM2. The authors of the present paper have described a simple approach to RCM2 as applied to railway turnout mechanisms, based on a case study. Faults in turnout mechanisms must be detected quickly and reliably if the information is to be useful. It is a discrete dynamic system, where data must be processed on line. Any detection to be adopted must use a simple model for detecting faults quickly by analysing data in real time. The model for detecting faults must adapt to external conditions and must detect faults in both directions of the turnout mechanism movement. This was the reason for the authors to choose a reference dynamic system for their analysis. The RCM2 implementation was developed as part of a research project. The data collection and algorithm developments were carried out during the year 2000 with a number of 151 experiments at a test site. The data collected refers to force (N) versus time (s). If we analyse the difference between the actual data and the reference data in the form of absolute values, we can detect the majority of faults as they develop. The objective for using Kalman filtering in this study was to increase the reliability of the model presented to
† 15 mm obstruction at second bearer on normal side of points; † 13 mm obstruction at eighth bearer on reverse side of points; † 12 mm obstruction at toe on normal side of points; † Back drive overdriving at heel on normal side with dry slide chairs; † Back drive slack end off at toe end; † Back drive slack end off at toe end (LHS side drive basket slack end off); † Diode snubbing block disconnected; † Dry slide chairs; † Low tension on motor brush; † Operational contact in original position; † Tight lock on reverse side; † Tight lock on reverse side (sand on all bearers on both sides).
References [1] Andersson C, Dahlberg T. Wheel/rail impacts at a railway turnout crossing. J Rail Rapid Transit 1998;212(2):123–34. [2] Christer AH, Wang W. A simple condition monitoring model for a direct monitoring process. Eur J Oper Res 1995;82:258– 69. [3] Christer AH, Wang W, Sharp JM. A state space condition monitoring model for furnace erosion prediction and replacement. Eur J Oper Res 1997;101:1 –14. [4] Garcı´a Ma´rquez FP, Schmid F, Conde J. Mantenimiento Centrado en la Fiabilidad y Monitorizacio´n Remota Basada en la Condicio´n, RCM2: Un caso de Estudio. Gestio´n de Activos Industriales 2002; in press. [5] Jacobs OLR. Introduction to control theory, 2nd ed. New York: Oxford University Press; 1993. [6] Kalman RE. A new approach to linear filtering and prediction problems. Trans ASME J Basic Engng 1960;35–45. [7] Kennedy A. Risk management and assessment for rolling stock safety cases. Proc Inst Mech Engrs: Part F 1997;211:67 –72. [8] Moubray J. Reliability-centred maintenance. Oxford, UK: Butterworth/Heinemannn; 1997.
40
F.P. Garcı´a Ma´rquez et al. / Reliability Engineering and System Safety 80 (2003) 33–40
[9] Price CJ, Pragh DR, Wilson MS, Snooke N. The flame system: automating electrical failure mode and effects analysis (FMEA). Proc Reliab Maintain Symp 1995;90–5. [10] REMAIN Consortium. Modular system for reliability and maintainability management in European rail transport. Final Report, IITB; 1998. [11] Roberts C, Fararooy S. Remote condition monitoring into the next millennium. Proc Comput Aid Des Manuf Oper Railway 1998;. [12] SAE Standard JA1011, Evaluation criteria for reliability-centered maintenance (RCM) process. Commonwealth Drive Warrendale,
USA: International Society of Automotive Engineers Department; 1999. [13] Shimonae T, Kawakami T, Miki H, Matsuda O, Tekeuchi H. Development of a monitoring system for electric point machines. IRSE Aspect Int Conf 1991;395–401. [14] Stott PF. Automatic open level crossing a review of safety. London, UK: Her Majesty’s Stationery Office; 1987. [15] Wang W. A model to determine the optimal critical level and the monitoring in condition based maintenance. Technical Report CMS99-07; 1999.