Fault tolerant component management platform over Data Distribution Service

Fault tolerant component management platform over Data Distribution Service

Proceedings of the 1st IFAC Conference on Embedded Systems, Computational Intelligence and Telematics in Control - CESCIT 2012 3-5 April 2012. Würzbur...

771KB Sizes 1 Downloads 47 Views

Proceedings of the 1st IFAC Conference on Embedded Systems, Computational Intelligence and Telematics in Control - CESCIT 2012 3-5 April 2012. Würzburg, Germany

Fault tolerant component management platform over Data Distribution Service Aitor Agirre*, Elisabet Estévez**, Marga Marcos*** *Ikerlan-IK4 Research Alliance, Arizmendiarrieta 2, 20500 Arrasate, Spain (e-mail: [email protected]). **Dept. Of Electronics and Automatic Engineering, EPS Jaén, Universidad de Jaén, Jaén, Spain (e-mail: [email protected]) ***Dept. Of Automatic Control and Systems Engineering, ETSI Bilbao, UPV/EHU, Bilbao, Spain (e-mail: [email protected])} Abstract: Distributed Real Time Systems (DRTS) are evolving to a higher level of complexity in terms of heterogeneity, dynamism and QoS constraints, as they must interact with an increasingly demanding environment. In an scenario with distributed applications running over different hardware platforms and networks, middleware technologies that provide real-time or QoS support are gaining importance, as they ease the development and deployment of complex distributed applications in heterogeneous networks while ensuring certain time and QoS constraints. In this sense, distribution middleware technologies like Data Distribution Service (DDS) have emerged, adding extensive QoS support to traditional middleware advantages. This paper proposes a component based applications management platform built on top of DDS that provides an easy way to deploy, configure and manage component based distributed applications, while ensuring the fulfillment of certain QoS aspects. Keywords: Middleware, QoS, DDS, Fault-tolerance, Distributed systems

1. INTRODUCTION Middleware technologies offer well known advantages (e.g. low level communication detail abstraction, less testing effort, flexibility, portability or interoperability) that contribute to facilitate the design and development of distributed applications. Nevertheless, the use of middleware technologies has not been so extended in the distributed real time systems domain because of two main reasons: Limited real-time and QoS support in traditional middleware technologies (Java RMI, Web Services, CORBA, REST…) and efficiency issues (an additional abstraction layer always involves an overhead). To solve this lack of real-time middleware platforms, some of them have been adapted to the real-time and embedded domain (CORBA-RT, CORBA/e), whilst new ones have been developed, e.g. DDS (Data Distribution Service), a data centric distribution middleware that provides extensive QoS support (OMG, 2007). Unlike other middleware technologies, DDS can explicitly control the latency and efficient use of network resources, incorporating several QoS related parameters that allow fine tuning of some key aspects that are critical in soft real-time application integration (e.g. deadline, latency budget or transport priority) (Wang and Grigg, 2009). Another important point is its transport layer independence, as DDS runs over any OSI layer level 2 capable communication standards, e.g. CAN (Rekik and Hasnaoui, 2009). On the other hand, current complex distributed applications demand middleware services to facilitate their management and maintenance. In this sense, platforms like OSGi 978-3-902661-97-5/12/$20.00 © 2012 IFAC

(Alliance, 2011) provide support for the management of component oriented applications. However, OSGi lacks two important aspects that are nuclear in distributed real-time systems (Liu et al., 2004): (1) Its QoS support is almost inexistent, and (2) it is Java tied. Other platforms (Seinturier et al., 2011, Flissi and Merle, 2006) cover similar management functionalities but neither are QoS oriented. Therefore, merging these factors, it would be of interest a QoS enabled application management platform for component based distributed real-time applications, built over a standardized distribution middleware. Such a platform should facilitate the deployment, maintenance, execution control and final un-installation of distributed component based applications. Additionally, it should allow application and middleware QoS parameter tuning (e.g. fault tolerance or component execution deadlines) at runtime, and also multilanguage and multi-platform implementations. This paper describes the basic services and functionalities that should be provided by such a distributed component management platform (DCMP), and proposes a concrete implementation based on DDS. DDS has been specifically adopted because it natively provides support for services like dynamic discovery or fault tolerance, that otherwise should be implemented by the management platform developer (Roman et al., 2002). Different levels of fault tolerance have been considered, regarding both the application and platform components that support the middleware services. On the other hand, DDS faces scalability issues better than other RT component based middleware (Deng et al., 2007) that are connection oriented, as it can enable multicast communication from one to many nodes.

218

10.3182/20120403-3-DE-3010.00023

The paper is organized as follows: Section 2 describes the proposed platform architecture, Section 3 explains the platform fault tolerance support, Section 4 presents a case study and finally, Section 5 draws some conclusions and points to future work. 2. MODULAR ARCHITECTURE BASED ON DDS A modular architecture based on DDS is proposed, as it supports an extensive set of QoS parameters that abstract the communication issues in an easy and flexible way. A similar approach to implement a service oriented architecture over DDS has been pointed out in (Wang and Grigg, 2009), focused in remote integration and testing over the internet. From the system management and maintenance point of view, in a dynamically reconfigurable scenario such as the one proposed in this paper, DDS also seems to be a good choice, since no communication endpoint reconfiguration is needed: A component reallocation to another physical node is transparent to other “client” components. This is especially useful in case of a system reconfiguration triggered by a node failure, when the component is reloaded in another node, because no reconfiguration is needed in the rest of the nodes. A schematic view of the architecture design of the DCMP is shown in Fig.1. The DCMP is composed of two main components (MW_Manager and MW_Daemon) and the wrapper code attached to each application component.

As commented above, the application components embed a wrapper code that links the functional code (business logic) of the application components to DDS specific data structures. This means that the wrapper code of the application components is DDS specific, whilst their functional code is DDS agnostic, and thus platform independent. Anyway, this wrapper code can be automatically generated, and thus it could be generated for any other distribution middleware. In fact, comparing to other similar approaches (Roman et al., 2002), this is one of the key advantages of the DCMP: it can be linked to MDE tools (Marcos et al., 2011) to automatically generate all the necessary “glue” code to interconnect the application components. The functional code and the DDS vendor specific libraries are linked through the DDS specific wrapper code, which can be derived from the functional specification of the application and acts as the “glue software” between the functional code and the DCMP, embedding the interconnection logic between the application components. So, the goal of the wrapper is basically to link the DDS input topics to the functional code input parameters, and the output parameters of the internal functions to the DDS output topics. It may also include some additional logic depending on combinations of current parameter values. Furthermore, the processing logic can be used to achieve runtime functional reconfiguration. While the wrapper code can embed interconnection logic and thus is application specific, the functional code of the components is reusable and can be employed in different applications. To summarize, the application components as well as the DCMP components are DDS enabled. In a distributed system, formed by n physical nodes, there is only one MW_Manager (although it could be mirrored for fault tolerant purposes), n MW_Daemon components (one per node), and m application components. The communication between the different components of the DCMP is performed through DDS topics (see Fig.2, where stereotypes from DDS UML profile (OMG, 2008) have been used).

Fig.1. Platform architecture Table1 summarizes the services offered by the DCMP. Other services could be added to support functionalities like admission control, health monitoring or dynamic application reconfiguration. Table 1. DCMP basic services Service Deployment Execution control QoS configuration Fault tolerance

Notes Application and component registration Start and stop of applications Application components dynamic QoS configuration Achieved through DCMP and application component redundancy

CESCIT 2012 3-5 April 2012. Würzburg, Germany

Fig.2. DCMP main components The key components of the DCMP platform are the MW_Manager and the MW_Daemon stand alone executables. The communication between these components and the application specific components is performed through the ComponentControl DDS topic and the MWDaemon

219

topic. The IComponentControl DDS DataReader is attached to every application component that runs in the platform, and provides access to the ComponentControl topic, which is used for component control and re-configuration purposes. On the other hand, the communication between application components is performed through application specific topics. Next, the three different types of components (i.e. MW_Manager, MW_Daemon and Application Component) that make up the DCMP are explained in detail, together with some considerations about the DCMP fault tolerance support.

The start up of an application is easy to achieve as what the MW_Manager does is to launch the physical components corresponding to the application (guided by some predefined logic). The MW_Manager, which is initially allocated to one node (although it could be mirrored for fault tolerance purposes), relies on the MW_Daemon DCMP component to actually activate the distributed components. On the other hand, to stop an application, the MW_Manager communicates directly with the components of that application through the ComponentControl topic illustrated in Fig.2, and asks them for a clean shutdown.

2.1 Middleware Manager Component

2.2 Middleware Daemon Component

The MW_Manager provides two DDS interfaces (ControlManager and InstallManager topics) to support the following functionalities:

This daemon, deployed in every node, supports the activation of the components that reside in its node. To launch an application, the MW_Manager communicates with the daemons of the involved nodes and ask them to start up the corresponding application components.

─ Installation and un-installation of components and applications. ─ Application and component start and stop operations. A DDS enabled library (Fig.3) is provided to communicate with the MW_Manager from other applications; for instance, a graphical application configuration tool could be developed to manage the complete DCMP through this “proxy” library.

Fig. 3. MW_Manager access library Regarding the application deployment requirement, the DCMP support the installation (registration) of components and applications through the IInstallManager interface of the MW_Manager. When the function InstallApp is called by an application configuration tool, it returns an identifier (App_ID) for the newly installed application. Using this App_ID, the application configuration tool can invoke the InstallLogicalComponent function to register each component of the previously defined application. Finally, as it can be needed for fault tolerance or load balancing considerations, a component can be deployed (through InstallPhysicalComponent function) to more than one physical node, so the logical concept of component transforms into the physical concept of component, which is the real executable that is deployed to hard disk. Hence, it is possible to integrate new functionalities in the MW_Manager to manage, for instance, load balancing aspects: In case of node CPU saturation, the MW_Manager could decide to stop a (physical) component and start it again in another node. The same strategy can be implemented in case of node failure. Related to the execution control requirement, the second interface that the MW_Manager implements is called IControlManager. Through this interface the start up or the termination of an application can be triggered (see Fig.3). CESCIT 2012 3-5 April 2012. Würzburg, Germany

The MW_Daemon provides a DDS interface (see Fig.2, MWDaemon topic) to support the start-up of the components inside its node. In this topic, which is an input to the MW_Daemon, three parameters are considered: The unique ID-s of the physical node and the component together with the path of the component inside the node. With the ID of the physical node, the MW_Daemon can filter the messages sent to other daemons, and with the path information, it can start up the specific application component. The sequence diagram for the start-up and stop operations is described in Fig.4. In the start-up sequence, the application configuration tool uses the IControlManager interface exposed by the MW_Manager_Proxy library to invoke the Launch_App function with the ID of the application as a parameter. This function injects a DDS message (ControlManager topic) for the MW_Manager; then, the MW_Manager injects a message (MWDaemon topic) for each component that composes the application. In this topic the ID of the destination MW_Daemon as well as the component path is indicated. With this information, the MW_Daemon can start a component process. To stop an application, the application configuration tool invokes the StopApp function of the IControlManager interface, which in turn sends a DDS message (ControlManager topic) for the MW_Manager. Then, the MW_Manager communicates directly (through the ComponentControl topic) with each application component that composes the application, to ask for its clean shutdown. The MW_Daemon is configured with an ID (e.g. node network name) to be able to filter the DDS messages sent to other MW_Daemon-s components. 2.3 Application Component As stated before, the distributed applications considered here are composed of a set of interconnected application components. These DDS components can be data sources,

220

data sinks, or, in the more general case, components with input and output data (see Fig.4). Whilst the MW_Daemon launches the start up of the application components (ordered by the MW_Manager), the shutdown of the application is not so trivial, because a forced shutdown triggered by the MW_Manager (and eventually executed by the MW_Daemon) could not be acceptable.

most of the QoS parameters from the middleware underneath (DDS). Such kind of application component QoS configuration is offered to other entities (e.g. the MW_Manager) through the ComponentControl topic interface. Other key aspect pointed out in the requirements is the application deployment. In this sense, a process based deployment model has been chosen. This means that the deployed physical components are executable files. There are other options, for instance considering a component as a task allocated inside a unique node process. It has been decided to use heavy processes for two main reasons: ─ Safety: From the safety point of view, higher safety levels can be achieved if the components use independent memory address spaces (spatially isolated components). ─ Flexibility: Components implemented in different languages can be easily merged in an application.

Fig.4. Application component To solve this problem, some kind of mechanism is needed to achieve a clean component shutdown. And this is done through the previously cited ComponentControl topic, which is received through the IComponentControl DDS DataReader that is attached to each application component. This way, the MW_Manager communicates directly with the components through this control topic, and asks for a “clean” shutdown. Besides that, other functionalities like QoS parameter reconfiguration can be supported through this component control topic. Thus, the IComponentControl DataReader provides access to the following IDL topic type definition: struct ComponentControl{ long IdComponent; enum eOrders order; struct QoSData; };

In any case, as the component management (execution control plus QoS configuration) is performed through a DDS topic (ComponentControl topic), the task model can be easily changed to a mono-process, multi-threaded model. In such case, each component would be allocated to an active object inside a “main” process, but the communication mechanism does not change in either way. 3. FAULT TOLERANCE CONSIDERATIONS As stated in previous section, the component specific QoS parameters can be configured through the so called ComponentControl topic. But there are other platform specific QoS aspects to consider. For instance, what happens if the node in which MW_Manager resides breaks down?

Where enum Orders{ Stop, ConfigureQoS };

In such a case, DDS provides a powerful mechanism to mirror or replicate components, by means of the so called Ownership QoS policy (easily configurable through an XML file). This parameter, along with the Ownership Strength, specifies if DataReader-s for a topic can receive data from multiple DataWriter-s. But only one DataWriter is considered at a time, the one with the higher Ownership Strength.

struct QoSData{ float Deadline; float T; boolean ReliableComms; }

The QoSData structure commented above is just an example of the kind of parameters that could be considered in the ConfigureQoS order provided by the ComponentControl topic interface. In case of a periodic component like for example a data source component (e.g. a sensor), it could be interesting to specify the period T and the deadline of the data. In other cases, it could be configured the communication mechanism between the components (best effort vs. reliable), etc. In this sense, the DCMP leverages CESCIT 2012 3-5 April 2012. Würzburg, Germany

On the other hand, not all the RTOS support multi-process, so in these cases the scheduling analysis would not be possible. The context switch is heavier, too. But taking into account that the presented platform is QoS oriented, not strict hard real time oriented, it is not necessary to address these problems.

This means that it is possible to have two instances of the MW_Manager running at the same time in different nodes, one with a bigger strength than the other. If a node failure occurs and one of the MW_Manager goes down, the second MW_Manager takes the control immediately, increasing the availability of the component platform. This is done in a transparent way for the application components, that do not realize (and do not mind) where the control data comes from. This idea is valid to replicate the application components too. In critical applications where the availability is a key issue,

221

the application components could be replicated as well, thus increasing the safety level. In fact, this way we can scale the fault tolerance support to the needed level (enabling two, three or more “clones” for the same component). Obviously, such kind of mechanism involves an overhead in terms of communication bandwidth, memory and CPU usage, but this drawback is the price to pay for high availability systems. The second fault tolerance mechanism is provided by the DCMP through the Liveliness topic, which is periodically updated by the application components to notify the MW_Manager that they are still “alive”. If the MW_Manager detects the absence of such a “heartbeat” for a particular application component that should be alive, it could decide to re-start the component in the same node or, in case of node failure, in another node. This fault tolerance mechanism provided by the DCMP can be combined with the DDS specific “component mirroring” mechanism commented before. Also, application specific fault tolerance mechanisms can be implemented easily thanks to the extensive QoS support provided by DDS, as it is explained in next section. 4. CASE STUDY To illustrate the kind of distributed applications that the DCMP supports, an application composed of four components is showed in Fig.5. It is a typical control application with two sensors (one temperature sensor and one pressure sensor), the controller and an actuator, deployed over two Intel Atom based conga-CA nodes.

node. Regarding the application specific topics (Temperature, Pressure and data), the “data source” components (e.g. the sensors) only need DDS DataWriter ports, whilst the “data sink” components (e.g. the actuator) only include DDS DataReader ports. These ports manage the application specific DDS topics. Finally, the controller component has both DataReader and DataWriter ports for application topics, as it reads inputs (data from sensors) and writes outputs (data for the actuator). As commented in Section 2, different “wrapper” templates can be generated for the functional code, depending of the desired component behaviours specified in the configuration phase. In this scenario, several test cases have been performed, corresponding to different functional modes: For instance, regarding the component intrinsic behaviour, the controller can expose periodic or sporadic behaviour: If periodic, it writes its outputs periodically, regardless of whether or not it has fresh input data (it computes the outputs from the last input data received). If sporadic, it computes its outputs only when it receives an input update (event driven). On the other hand, there have been tested two different behaviours for the input ports, which can be combined with any of the previous component intrinsic behaviors: (1) The NO_WAIT input logic establishes that in case of any input change, the outputs are updated, no matter if we have “fresh” data for the rest of the inputs. On the contrary, the (2) WAIT input logic implies that the component updates its outputs only if all the inputs have been updated in a specified time interval Ti (see Fig.6). If the interval Ti is infinite and all the inputs have been received at least once since the system startup, then the outputs are updated each time a “write outputs” trigger is raised (it can be triggered by a periodic timer in case of a periodic component or by a sporadic input update in case of a sporadic component). WRITE outputs trigger

Start up Interval Ti

t

t0 t1 : input A arrival

t2 : input B arrival

t3 : input C arrival

Fig.6. WAIT logic with output update conditions fulfilled

Fig.5. Test application scenario All the application components integrate a DDS DataReader for the ComponentControl topic (in green), which supports their management (shutdown and QoS configuration). Also, they integrate a ComponentControl DataWriter, which allows them to control other application components (e.g. an application component could decide to launch another component in case of an internal application reconfiguration event); in such a case the MW_Manager acts as a broker, i.e. the first component asks the MW_Manager to launch a specific component and then the MW_Manager performs the actual launching through the MW_Daemon of the selected CESCIT 2012 3-5 April 2012. Würzburg, Germany

In the example shown in Fig.6, there are three input variables (A, B, C) that arrive at time t1, t2 and t3, respectively. When t0 “write outputs” is triggered, the application component the checks the “freshness” of its current input data set, and if all the input data (A, B, C) last arrival time stamps (t1, t2 and t3) are included in the Ti interval, then it proceeds with the output update. However, if the freshness of the input data is not as desired, then the outputs are not written. In the case study of Fig.5, the four possible combination scenarios have been successfully tested, i.e. the combination of periodic and sporadic behavior with the two different input port logic types (WAIT and NO-WAIT). The fault tolerance mechanisms of the platform have been successfully tested. First, some components have been mirrored as backup components in different nodes (these

222

component redundancies are not showed in Fig.5). If one of the nodes is shutdown, the DCMP restarts the affected components in other available nodes. Second, the DCMP fault tolerance mechanism based on the Liveliness topic has been as well tested: If one of the components is manually terminated, the MW_Manager detects it and restarts the component. Finally, the application is fault tolerant in a third way: The controller can be designed to perform a safe stop if both the temperature sensor and the pressure sensor do not provide their source data in the expected time (deadline). This deadline could be reconfigured in runtime (following some predefined logic) by the MW_Manager through the ComponentControl topic of the involved application components. 4. CONCLUSIONS AND FUTURE WORK The proposed approach for a distributed component management platform seems to be feasible and flexible enough to support component based distributed applications with QoS considerations. DDS itself appears as an efficient middleware to support component based distributed applications, providing platform and language independence while ensuring that certain QoS restrictions are met. The rich set of QoS parameters that DDS provides to fine tune the application non functional parameters is a key advantage of DDS that must be considered. The intercommunication mechanisms between different components in different nodes have been tested. The data centric approach is valid in a complex scenario with multiple nodes, multiple components and multiple applications with a strong interaction between them, whilst ensuring the specified QoS. Anyway, several stress tests must be done to validate the platform in high load scenarios. The functionality of the proposed platform can be extended in a relatively easy way, because of its modular design. Several extra features could be considered: application admission control, automatic load balancing based in platform health monitoring, automatic reconfiguration mechanisms for fault recovery (García-Valls et al., 2010a), or even deterministic approaches (García-Valls et al., 2010b). Currently the approach is QoS oriented, but as soon as the middleware underneath (DDS) become deterministic (Pérez Tijero and Gutiérrez), the DCMP would be valid not only for QoS enabled applications, but also for real-time applications. 5. ACKNOWLEDGMENTS

Real-Time and Embedded Systems. In: LEE, I., LEUNG, J. & SON, S. (eds.) Handbook of RealTime and Embedded Systems. CRC Press. Flissi, A. & Merle, P. 2006. A generic deployment framework for grid computing and distributed applications. Montpellier. García-Valls, M., Basanta-Val, P. & Estévez-Ayres, I. A component model for homogeneous implementation of reconfigurable service-based distributed real-time applications. 10th Annual International Conference on New Technologies of Distributed Systems, NOTERE'10, 2010a Tozeur. 267-272. García-Valls, M., Rodríguez-López, I., Fernández-Villar, L., Estévez-Ayres, I. & Basanta-Val, P. Towards a middleware architecture for deterministic reconfiguration of service-based networked applications. 15th IEEE International Conference on Emerging Technologies and Factory Automation, ETFA 2010, 2010b Bilbao. Liu, W., Chen, Z. L., Tu, S. L. & Du, W. 2004. Adaptable QOS Management in OSGi-based cooperative gateway middleware. Lecture Notes in Computer Science, 3033, 604-607. Marcos, M., Estévez, E., Jouvray, C. & Kung, A. 2011. An Approach to use MDE in Dynamically Reconfigurable Networked Embedded SOAs. 18th IFAC World Congress. Milano, Italy. OMG 2007. Data Distribution Service for Real-time Systems v1.2. OMG 2008. UML Profile for DDS (Beta 1). Pérez Tijero, H. & Gutiérrez, J. J. On the schedulability of a data-centric real-time distribution middleware. Computer Standards and Interfaces. Rekik, R. & Hasnaoui, S. Application of a CAN BUS transport for DDS middleware. 2nd International Conference on the Applications of Digital Information and Web Technologies, ICADIWT 2009, 2009 London. 766-771. Roman, M., Hess, C., Cerqueira, R., Ranganathan, A., Campbell, R. H. & Nahrstedt, K. 2002. A middleware infrastructure for active spaces. Pervasive Computing, IEEE, 1, 74-83. Seinturier, L., Merle, P., Rouvoy, R., Romero, D., Schiavoni, V. & Stefani, J.-B. 2011. A component-based middleware platform for reconfigurable serviceoriented architectures. Software: Practice and Experience, n/a-n/a. Wang, Y.-H. & Grigg, A. 2009. A DDS based framework for remote integration over the internet. 7th Conference on Systems Engineering Research (CSER).

This work has been partially funded by the Spanish Ministry of Industry, Trade and Tourism and the ARTEMIS Joint Undertaking under Project contract 100026. REFERENCES Alliance, T. O. 2011. OSGi Service Platform Core Specification, Version 4.3. Deng, G., Gill, C., Schmidt, D. C. & Wang, N. 2007. QoSenabled Component Middleware for Distributed CESCIT 2012 3-5 April 2012. Würzburg, Germany

223