JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.1 (1-28)
Science of Computer Programming ••• (••••) •••–•••
Contents lists available at ScienceDirect
Science of Computer Programming www.elsevier.com/locate/scico
Adding distribution and fault tolerance to Jason Á. Fernández-Díaz ∗ , C. Benac-Earle, L. Fredlund Babel group, DLSIIS, Facultad de Informática, Universidad Politécnica de Madrid, Spain
h i g h l i g h t s • • • • •
We present a distribution schema for the Jason programming language. The monitoring technique introduced can be used to avoid fault propagation. The supervision mechanism presented enables fault recovery in multi-agent systems. The runtime system supports the extensions and can be used in a declarative way. Various use cases expose the potential of the extensions.
a r t i c l e
i n f o
Article history: Received 3 March 2013 Received in revised form 26 September 2013 Accepted 10 January 2014 Available online xxxx Keywords: Multi-agent systems Fault tolerance Jason programming language Erlang programming language eJason
*
a b s t r a c t In this article we describe an extension of the multi-agent system programming language Jason with constructs for distribution and fault tolerance. This extension is completely integrated into Jason in the sense that distributing a Jason multi-agent system does not require the use of another programming language. This contrasts with the standard Java based Jason implementation, which often requires writing Java code in order to distribute Jason-based agent systems. These extensions to Jason are being implemented in eJason, an Erlang-based implementation of Jason. We introduce two different fault tolerance mechanisms that allow fault detection and recovery. A low-level agent monitoring mechanism allows a monitoring agent to detect, and possibly recover, when another agent experiences difficulties such as e.g. hardware failures or due to network partitioning. More novel is the second fault tolerance mechanism, supervision, whereby one agent acts as a supervisor to a second agent. The supervision mechanism is in addition to handling low-level faults such as the above, also capable of detecting higher-level failures such as e.g. “event overload” (an agent is incapable of timely handling all its associated events and plans) and “divergence” (an agent is not completing any iteration of its reasoning cycle). Moreover, mechanisms exist for another agent to inform a supervisor that one of its supervised agents is misbehaving. Although these extensions are inspired by the distribution and fault tolerance mechanisms of Erlang, due to the agent perspective, the details are quite different. For instance, the supervisor mechanism of eJason is much more capable than the supervisor behaviour of Erlang, corresponding to the more abstract/higher-level perspective offered by agentoriented programming (Jason) compared with process-oriented programming (Erlang). As another example, from the perspective of agent programming we consider it natural to support the flexibility of the supervision trees, i.e. allow the evolution of supervision relations over time. For instance, the supervisor of an agent, as well as the supervision policy maintained for that same agent, may vary as the system evolves. © 2014 Elsevier B.V. All rights reserved.
Corresponding author. E-mail addresses:
[email protected].fi.upm.es (Á. Fernández-Díaz),
[email protected].fi.upm.es (C. Benac-Earle),
[email protected].fi.upm.es (L. Fredlund).
0167-6423/$ – see front matter © 2014 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.scico.2014.01.007
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.2 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
2
1. Introduction The increasing interest in multi-agent systems (MAS) is resulting in the development of new programming languages and tools capable of supporting complex MAS development. One such language is Jason [8]. Some of the more difficult challenges faced by the multi-agent systems community, i.e., how to develop scalable and fault tolerant systems, are the same fundamental challenges that any concurrent and distributed system faces. Consequently, the agent-oriented programming languages provide mechanisms to address these issues, typically borrowing from mainstream frameworks for developing distributed systems. For instance, Jason allows the development of distributed multi-agent systems by interfacing with JADE [7, 6]. However, Jason does not provide specific mechanisms to implement fault tolerant systems. MAS and the actor model [3] have many characteristics in common. The key difference is that agents normally impose extra requirements upon the actors, typically a rational and motivational component such as the Belief-Desire-Intention architecture [22,24]. Some programming languages based on the actor model are very effective in addressing the aforesaid challenges of distributed systems. Erlang [5,9], in particular, provides excellent support for concurrency, distribution and fault tolerance. However, Erlang lacks some of the concepts, like the Belief-Desire-Intention architecture, which are relevant to the development of MAS. In recent work [12], we presented eJason, an open source implementation of a significant subset of Jason in Erlang, with very encouraging results in terms of efficiency and scalability. Moreover, some characteristics common to Jason and Erlang (e.g. both having their syntactical roots in Prolog) made some parts of the implementation quite straightforward. However, the first eJason prototype did not permit the programming of distributed or fault tolerant multi-agent systems. A preliminary version of the distribution model and a fault tolerance mechanism for Jason were presented in [13]. In this article we build on these extensions and motivate the changes. The new extensions are being implemented in eJason, thus making it possible to develop fault tolerant distributed systems fully in Jason itself. Our up-to-date implementation of eJason and sample multi-agent systems described in this and previous documents can be downloaded at: https://github.com/avalor/eJason.git. The rest of the article is organised as follows: Section 2 provides background material about Erlang, Jason and eJason. Section 3 includes a brief description of platforms and techniques that address the fault tolerance of multi-agent systems. Sections 4 and 5 describe the proposed distribution model and fault tolerance mechanisms for Jason programs, respectively. Some details on the implementation in eJason of these extensions can be found in Section 6. Several examples that illustrate in detail the use of the proposed extensions are included in Section 7. Finally, Section 8 presents the conclusions and future lines of work. 2. Background In this section we briefly introduce the background work for the contents presented in this article. Some previous knowledge of both Jason and Erlang is assumed. 2.1. Jason Jason is an agent-oriented programming language which is an extension of AgentSpeak [21]. The standard implementation of Jason is an interpreter written in Java. 2.1.1. The Jason programming language The Jason programming language is based on the Belief-Desire-Intention (BDI) architecture [22,24] which is highly influential on the development of multi-agent systems. The first-class constructs of the language are: beliefs, goals (desires) and plans (intentions). This approach allows the implementation of the rational part of agents by the definition of their “know-how”, i.e., their knowledge about how to act in order to achieve their goals. The Jason language also follows an environment-oriented philosophy, i.e., an agent exists in an environment which it can perceive and with which it can interact using so-called external actions. In addition, Jason allows the execution of internal actions. These actions allow the interaction with other agents (communication) or to carry out some useful tasks such as, e.g., string concatenation and printing on the standard output, among others. 2.1.2. The Java implementation of Jason A complete description of the Java implementation of Jason can be found in [8]. This implementation of Jason allows the programming of distributed multi-agent systems by interfacing with the well-known third-party software JADE [7,6], which is compliant to FIPA recommendations [2]. JADE implements a distribution model where the agents are grouped in agent containers, which are, in turn, grouped again to compose agent platforms. These agent containers are organised hierarchically. A so-called Main Container exists in every platform. The Main Container provides some services to the rest of containers (e.g. maintains a global registry of all the agents in the platform) and hosts the agents that implement the Agent Management System (a white pages service) and the Directory Facilitator (a yellow pages service).
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.3 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
3
Fig. 1. An Erlang multi-node system.
The distribution of a Jason system using JADE is not transparent from the programmer’s perspective as he/she must declare the architecture of the system (centralised and JADE being the ones provided by default), and, most likely, execute some actions that rely on the third-party software used to distribute the system. The Java interpreter of Jason provides mechanisms to detect and react to the failure of a plan of an agent. However, there are no mechanisms to detect and react to the failure of an entire agent. That is, there are no constructs which permit to detect if some agent has stopped working (has died) or has become isolated. One can of course program a “monitor agent” to continuously interact with the agent that should stay alive according to some pre-established communication protocol, in order to detect if the monitored agent fails. An example of this implementation of a monitor agent can be found e.g. in [4]. However, having to program such monitors by hand is an error prone and a tedious task. 2.2. Erlang Erlang [5,9] is a functional concurrent programming language created by Ericsson in the 1980s which follows the actor model. The chief strength of the language is that it provides excellent support for concurrency, distribution and fault tolerance on top of a dynamically typed and strictly evaluated functional programming language. It enables programmers to write robust and clean code for modern multiprocessor and distributed systems. An Erlang system (see Fig. 1) is a collection of Erlang nodes. An Erlang node (or Erlang Runtime System) is a collection of processes (actors), with a unique node name. These processes run independently from each other and do not share memory. They interact via communication. Communication is asynchronous and point-to-point, with one process sending a message to a second process identified by its process identifier (pid). Messages sent to a process are put in its message queue, also referred to as a mailbox. As an alternative to addressing a process using its pid, there is a facility for associating a symbolic name with a pid. The name, which must be an atom,1 is automatically unregistered when the associated process terminates. Message passing between processes in different nodes is transparent when pids are used, i.e., there is no syntactical difference between sending a message to a process in the same node, or to a remote node. However, the node must be specified when sending messages using registered names, as the pid registry is local to a node. For instance, in Fig. 1 let the process Proc C be registered with the symbolic name procC. Then, if the process Proc B wants to send a message to the process procC, the message will be addressed to procC@Node2. A unique feature of Erlang that greatly facilitates building fault tolerant systems is that one process can monitor another process in order to detect and recover from abnormal process termination. If a process P 1 monitors another process P 2 , and P 2 terminates with a fault, process P 1 is automatically informed by the Erlang runtime of the failure of P 2 . It is possible to monitor processes at remote nodes. This functionality is provided by an Erlang function named erlang:monitor. Another interesting feature of Erlang is the Open Telecom Platform (OTP). The OTP is a set of libraries and design principles for the development of distributed systems in Erlang. It identifies and supports a series of abstractions, named behaviours, for common capabilities of processes. One of these is the supervisor behaviour. It enables the implementation of the supervisor process in a supervision tree. A supervision tree defines a hierarchical structure of processes for fault tolerance purposes. Each process (node) in a supervision tree is either a supervisor (branch nodes) or a worker (leaf nodes). The basic idea is that a supervisor keeps all its child processes alive. The worker processes carry out different computations,
1
An atom is a literal, i.e. a constant starting with a lowercase letter.
JID:SCICO AID:1691 /FLA
4
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.4 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
i.e. they do the actual work. The supervisor processes monitor the behaviour of their child processes (either workers or other supervisors) and restart them when necessary. A supervisor process spawns and monitors all its child processes. If a supervisor process terminates, all its child processes terminate as well. 2.3. eJason eJason is our Erlang implementation of the Jason programming language which exploits the support for efficient distribution and concurrency provided by the Erlang runtime system to make a MAS more robust and performant. It is interesting to note the similarities between Jason and Erlang; both are inspired by Prolog, and both support asynchronous communication among computational independent entities (agents/processes), which makes the implementation of (some parts of) Jason in Erlang rather straightforward. The first prototype of eJason was described in [12]. This prototype supported a significant subset of Jason. This subset included the main functionality: the reasoning cycle, the inference engine used by test goals and plan contexts, and the knowledge base. We continue developing eJason by increasing the Jason subset supported and by improving the design and implementation of eJason. The current implementation supports plan annotation, strong negation, all Jason terms and an improved inference engine, among others. As a matter of example, the new inference engine generates matching values for the variables in the queries upon request (instead of unnecessarily computing all matching values in order to calculate e.g. the set of applicable plans that our implementation did in its early stages). The main features of Jason not yet supported by our implementation are the interaction with the agent’s environment via external actions and the different customisations of the interpreter, belief base, pre-processing directives, etc., described in [8]. Probably the most relevant fact of the implementation of eJason is the one-to-one correspondence between an agent and an Erlang process (a lightweight entity), enabling the eJason implementation to execute multi-agent systems composed of up to hundreds of thousands of concurrently executing agents with few performance problems. This compares very favorably with the Java based standard Jason implementation which has problems in executing systems with more than few thousands of concurrent agents (even if executed using a pool of threads) on comparable hardware (see [12] for benchmarks). 3. Related work: fault tolerance in multi-agent systems Several existing techniques try to facilitate the development of robust multi-agent systems. These techniques use approaches that follow either the survivalist or citizen concepts [16]. A survivalist approach implies that the agents are designed to preserve their own existence. Survivalist agents inspect their own state and handle the possible exceptions that could arise, applying the corresponding corrective measures. The plan failure mechanism of Jason [8] is an example of a survivalist approach. The replication of agents [17,10], i.e. maintaining different copies of the same agent in the system, is a technique that also follows the survivalist approach. In active replication techniques the different replicas of an agent carry out the same computations. Then a single output is determined using e.g. a proxy [11] that implements some aggregation function or voting system. Passive replication techniques maintain all but one replicas dormant. A dormant replica takes over the responsibilities of the active replica if a failure of the latter occurs. The DARX framework [18] allows the use of both active and passive replication techniques. The use of survivalist techniques requires the programmer to implement complex coordination and exception handling behaviours. Moreover, the replication of agents can consume an excessive amount of resources if not used carefully. On the contrary, the citizen approach implies that the agents rely on different entities (e.g. system services or other agents) in order to handle different exceptions and agent deaths. Citizen approaches often require the existence of sentinels [15,14]. The sentinels are agents or services that monitor the behaviour of the different agents in order to take the appropriate corrective measures in the presence of faults. Sentinel agents require a description of the expected behaviour of the agents that they monitor. The sentinels in [14] infer the behaviour of the agents by listening to their broadcast messages. A sentinel can even alter (repair) a message sent between two other agents if it is considered wrong. Conversely, each agent in [15] possesses a normative description of its own behaviour. The agent uses this description to create a sentinel that monitors its behaviour. The HADES System [19] follows a hybrid (survivalist and citizen) approach inspired by biology. The agents in HADES actively replicate themselves in a way that seeks to preserve the equilibrium of the system. These agents monitor their own behaviour and repair, or even kill, themselves if necessary. The agents acknowledge their “citizenship” to the system, hence monitoring the agents in their environment. If an agent identifies some persistent anomaly they entice the malfunctioning agents to kill themselves. New replicas of these self-killed agents are then created to replace them. The platform SyMPA [23] uses replication to maintain the functionality of its central system. Besides, the state of individual agents is preserved by writing it on the disk periodically. Similarly, the platform JADE [7] introduces a Main Container Replication Service (MCRS) to replicate the main containers that compose an agent platform. The MCRS appoints a master main container, while the other main containers are replicas for backup purposes. The master container stores the information of the Directory Facilitator in a database. In the case of a failure of the master main container, a backup replica takes over that information from the database and replaces the failed one.
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.5 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
5
Fig. 2. Sample eJason distributed system.
4. Distribution In this section we describe the proposed agent distribution model extension to Jason, which has been implemented in eJason. It is inspired by the distribution model of Erlang, as it provides some useful features. The distribution model has been designed with three goals in mind: distribution transparency (i.e., ensuring that distributed agents can communicate, even if oblivious to the physical location of each other), efficiency (via decentralised control), and minimising the impact on the syntax of Jason programming language. 4.1. Distribution schema Below we introduce the terminology used to describe the distribution model.
• A multi-agent system is composed by one or more agent containers. Unlike in other agent development environments, such as JADE, there is no hierarchy between agent containers.
• Each (agent) container2 belongs to one and only one multi-agent system, is comprised of a set of agents and is located on a computing host. It is given a name that is unique in the whole system. Concretely the name has the shape NickName@HostName, where NickName is an arbitrary short name given to the container when it is started and HostName is the full name of the host in which the container runs. For instance, the name
[email protected] corresponds to a container that runs in a host whose name in the system is avalor-laptop.fi.upm.es. • Each agent is present in a single container. At the moment of agent creation, it is given a symbolic name that is unique within the system. This unique name can be provided statically (specified by its source code) or dynamically (given by the runtime system upon its creation). Using the unique name of an agent, agents can communicate with each other irrespectively of the containers they reside in. As an example, consider the distributed Jason multi-agent system in Fig. 2. This system is composed by four agents distributed over two different hosts. The agents with the names owner, robot and fridge reside in a container named
[email protected]. The agent supermarket runs in the container
[email protected], located on a different host. 4.1.1. Connectivity Given the distribution schema presented above, the following restrictions apply: (r1) Any two different agents can interact directly (communicate, monitor, supervise, . . .) if and only if their respective containers are connected.
2
We use the name container to emphasise the similarities with a JADE container.
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.6 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
6
(r2) All the agent containers belonging to the same multi-agent system are interconnected. (r3) If two agent containers are interconnected, they belong to the same multi-agent system. In a nutshell, these restrictions imply that an agent can interact with (and only with) all other agents that belong to the same multi-agent system.
4.1.2. Multi-agent system dynamics A multi-agent system is not a static entity, hence its composition varies over time. New agent containers can be created and added to the system or, conversely, the containers belonging to certain system can get detached from it. The following cases illustrate the possible dynamics of a multi-agent system (the agent actions that enable these dynamics will be described later in the document):
• Merge: Consider two different multi-agent systems, m1 and m2 , comprised of the sets of agent containers AC 1 and AC 2 , respectively. Recall that | AC 1 | 1 | AC 2 | and AC 1 ∩ AC 2 = ∅, as m1 = m2 . By restrictions (r1)-(r3) above, the agents belonging to different systems cannot interact (e.g. send a message) with each other. Nevertheless, any two different multi-agent systems, m1 and m2 , can be merged to compose a new multi-agent system m3 if all the following conditions hold: – AC 3 = AC 1 ∪ AC 2 . – Given Name i the set containing the names of all agents in the system mi (recall that an agent name is unique within its system), then Name 3 = Name 1 ∪ Name 2 and Name 1 ∩ Name 2 = ∅, i.e. there are no two agents with the same name and that belong to the different systems m1 and m2 . Therefore, the merging of m1 and m2 will not provoke a clash between agent names. • Split: Consider a multi-agent system m1 with a set AC 1 of agent containers. At any moment in time, this system can evolve into a different system m1 such that AC 1 ⊂ AC 1 and 1 | AC 1 |. There exist two possibilities for a system split: – A non-empty subset, AC 2 = AC 1 \ AC 1 , of agent containers detaches from the system m1 to form a new system m2 (e.g. due to the execution of the disconnect action below or due to the isolation of this subset caused by a network problem). – A non-empty subset, AC 2 = AC 1 \ AC 1 , of agent containers is stopped (e.g. due to problems in the host(s) in which they run). 4.2. Design decisions The distribution schema described above differs from the one presented in [13]. The main difference is that all agent names are unique within the system, while in [13] an agent name is unique only within its container. The new distribution schema brings in some improvements:
• The main advantage is that it enables the transparent distribution of the system, i.e. the agents can interact with each other completely oblivious to their respective containers. Agent names are unique within the whole system, hence any given agent only needs to know the name of the agent with which it wants to interact in order to address it. From a programmer’s perspective, the implementation of distributed systems is facilitated. The logic dominating the behaviour of an agent can be made independent to the actual system orchestration at runtime. Furthermore, it facilitates a future implementation of the transparent migration of agents between containers. • Another improvement introduced is the possibility to delegate the naming of an agent to the runtime system of eJason. This delegation avoids name clashes between agents (ensured by the runtime system). The drawback is that an agent created in this way needs to make itself visible to other agents (e.g. by starting a communication) in order to allow them to interact with it. The use of this feature is recommended in order to create agents that will themselves start the interaction with other agents, instead of just waiting for others to start it. Conversely, agents that will be addressed by others without prior interaction should be given a name statically. For instance, consider a classical system composed by a server (an agent that provides some kind of service) and an arbitrary number of clients (other agents). The recommendation for this kind of system is providing a name, e.g. server, for the server agent statically and delegating the naming of the client agents to the runtime system. The logic for the clients just needs to use the atom server to interact with the server agent. • The guarantees of the internal action whose execution creates a new agent (see [8] or the distribution API in Appendix A) have changed. In the case that the name of a new agent is provided by the source code, it is either created with that name or its creation is aborted (due to a name clash). Besides, the plan executing an aborted agent creation fails. Hence, it is now possible to implement agent behaviours that react to name clashes. As a matter of example, consider a system where some agent alice executes a plan that requires the creation of an agent named bob and, later, it sends a message to it:
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.7 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
7
+!plan <.create_agent(bob,"./bob.asl"); .send(bob,tell,"hello!"). If some agent with name bob exists in the system prior to the creation of the new agent, a plan failure is generated. A failure plan can be provided to deal with it, e.g. by delegating the name creation to the runtime system:
-!plan <.create_agent(Bob,"./bob.asl"); .send(Bob,tell,"hello!"). Instead, following the strategy of the Java implementation of Jason the agent would be created with another name, e.g. bob_a. Neither the agent invoking the creation nor the new agent would be aware of this change of name. Moreover, the plan would be executed successfully and the message “hello!” would be sent to the agent with name bob that existed before the plan execution, which may have unpredictable consequences. • The lack of a hierarchy between agent containers facilitates the execution of split and merge operations, hence enhancing the flexibility of the agent systems. There is no need to select a new main container after a split operation (as would happen in other platforms like JADE). Conversely, there is no need to demote a main container after a merge of two systems. Furthermore, after a split operation, all the resulting subsystems are fully operational agent systems themselves. The new distribution schema requires the implementation of (non-trivial) mechanisms that maintain the consistency of the system and allow its proper functioning:
• The need to keep track of the containers in which the agents run is relieved from the programmers. Hence, a white pages service becomes necessary in order to resolve the addressing of agents. Given some atom AgentName, this service will determine whether it is a valid reference to an agent in the system or not. This service must also be aware of the connections between containers, i.e. the set of all containers belonging to the system. • Every time that the creation of a new agent is attempted, it must be checked whether a name clash would occur or not in order to allow the creation. Besides, if the naming is delegated to the runtime system, a means to generate a unique name at any point in time is needed. • During a merge operation, all the agent names in the two systems that are about to be merged must be compared, whereas it is not necessary for any kind of split operation. The different internal actions that enable the operations above can be found in Appendix A. 5. Fault tolerance Multi-agent systems are susceptible to faults similar to those of other kinds of distributed systems. We define a fault as an event that arises at some component of the system and which threatens the proper functioning of such system. Faults can propagate among the components of a system and, in the worst case, provoke a system breakdown. Some instances of faults are the unexpected death of an agent, the isolation (preventing interaction) of a set of agents and the lack of response to incoming stimuli by an agent. These errors may be originated by different causes, e.g. a broken physical link in the network connection, the shut down of a host where some agent containers run or bugs in the source code that defines the behaviour of an agent. A classification of the different faults, attending to the causes that provoke them, can be found in [20]. The fault detection and fault recovery techniques that we present in this work disregard the underlying causes of failure. Instead, these techniques focus on the observable effects of faults:
• • • •
The The The The
death of an agent. attempt of interaction with an agent that does not belong to the system. (undesirable and unexpected) unresponsiveness of an agent. termination of the connection between two agents.
It can be argued whether some of the observable events above are actually faults. For instance, the death of an agent a may be programmed to occur at some point as part of its behaviour, which is not a fault itself. However, it can generate a deadlock in some other agent b that has started some interaction protocol and is expecting a response from it. This deadlock can be avoided, e.g. allowing b to be aware of the death of a. The fault tolerance techniques discussed next not only address the recovery from faults by the components in the system (e.g. restarting a), but prevent the propagation of these faults between different components (e.g. deadlocking b).
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.8 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
8
5.1. Overview of the fault tolerance techniques In order to cope with the different observable faults, we present here two different mechanisms, namely monitoring and supervision. They are inspired by the homonym fault tolerance mechanisms provided by Erlang. There exist some similarities between the monitoring and supervision techniques proposed:
• The detection of faults is carried out by the runtime system of eJason. Whenever a fault is detected, a notification (a message) is sent by the runtime system to the proper entity (an agent).
• The fault detection is started by the agent that will receive the fault notification messages. The runtime system does not look for faults if there is no agent to be notified. This approach increases the efficiency of the system, as there is no need to check for faults if no recovery action is going to be taken. • They require the establishment of a relation between at least two agents. In each relation, one of the agents is notified about faults in the other. The agent notified can then take some (predefined or ad-hoc) corrective action or simply ignore the notification. • The monitoring and supervision relations persist after the death of the agent that established them if it is restarted or revived. The concepts of restart and revival of an agent a require the existence of a supervisor agent. This supervisor maintains a supervision policy over it such that this agent is created again immediately after its death (revival) or is stopped and started anew (restart). In spite of their similarities, the monitoring and supervision techniques are designed to address different facets of fault tolerance. Therefore, they impose different responsibilities to the agents involved and their use is recommended in different situations:
• The monitoring mechanism can be used when some agent a (namely monitoring agent) is interested in the execution state of another agent b (namely monitored agent) and its availability for interaction. If a starts a monitoring relation with b, a will know whether b dies, whether it leaves the system, whether it is restarted, whether it does not belong to the system at all or whether it is added to the system (in case it did not). It is a valuable information if the functionality of a depends somehow on that of b, e.g. if b provides some service to a or if a and b interact frequently. Therefore, the monitoring mechanism can be used to avoid fault propagation between agents. A monitoring relation does not impose any kind of additional responsibility to neither the monitoring nor the monitored agents. Hence, it can be used discretionally. • The supervision mechanism allows the implementation of complex fault recovery behaviours. An agent a (supervisor agent) can start a supervision relation with some set of agents B. Then, a acquires the responsibility of carrying out certain fault recovery actions, according to a supervision policy, in order to maintain (up to some extent) the availability of the agents in B. The supervisor agent handles requests from other agents that suspect the failure of some of its supervised agents. Finally, a supervisor agent receives the ability to execute some fault detection (ping) and fault recovery actions (restart, unblock, revive) over its supervised agents, that other agents cannot. We recommend the use of the supervision mechanism as a means to guarantee the availability of an agent or set agents that carry out some relevant task. Moreover, we also recommend that both the supervisor and its supervised agents run in the same container, so that the exchange and processing of messages is executed locally by the same part of the runtime system. Finally, we suggest that a supervisor agent should devote itself to supervision tasks. This way, no other task would interfere with the supervision related ones. An analysis of the potential security risks introduced by the fault tolerance techniques presented next, as well as their possible solutions, fall beyond the scope of this work. 5.2. Monitoring agents The agents in a multi-agent system, despite being autonomous, are not isolated from each other. The interaction between agents is a recurrent phenomenon. Agents interact (e.g. cooperate, negotiate, provide a service) as part of the execution of the plans (intentions) to achieve their goals. Therefore, a fault in one of the agents involved in some interaction can spread among all other agents interacting with it. In order to avoid that, we introduce a monitoring mechanism that allows an agent a to be aware of the execution state of any other agent b in the system. Ideally, a interacts or expects some future interaction with b. The monitoring mechanism proposed does not inform all the agents about faults in each other by default. Instead, a (monitoring) agent interested in the state of another (monitored) agent, with name AgentName, must explicitly request to be notified when the status of the monitored agent worsens. We refer to this phenomenon as monitoring relation. Note that a monitoring relation just requires the monitoring agent to know the unique identifier of the monitored agent, i.e. its name. Once a monitoring relation is established, the monitoring agent will be informed about the following changes in the execution state of the monitored agent:
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.9 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
9
(1) The monitored agent with the symbolic name AgentName has been restarted, i.e. its supervisor agent has stopped and started it anew (see Section 5.3 for details on agent supervision). This information can be useful as the restarted agent may have lost part of its belief base and/or goals. (2) The monitored agent with the symbolic name AgentName has been revived, i.e. this agent has died but its supervisor agent has started it anew (see Section 5.3 for details on agent supervision). In a similar way to restarts, this information can be useful as the revived agent may have lost part of its belief base and/or goals. (3) The monitored agent with the symbolic name AgentName has stopped running, i.e. has died and it is not revived (i.e. it has no supervisor agent or the supervision policy maintained for it does not allow its revival). (4) The containers of the monitored and monitoring agent are no longer connected, e.g. caused by a network problem, or by a problem with the container host, or by a failure in the container itself, or due to some split operation. Therefore, the direct interaction between these agents is not possible: they are “unreachable” to each other. (5) The monitored agent does not belong to the system. There is no agent whose symbolic name is AgentName in the system at the moment of the request. (6) An agent with name AgentName has been added to the system (either newly created or as a result of a merge operation). This information can be useful, e.g. to allow an agent in need of a service from AgentName to be aware of the moment when it is available or even to resume an interaction interrupted by a split operation after a network failure. Note that the only guarantee is that an agent with AgentName has been added to the system. It may be the case that it were not the agent expected. Therefore, the programmer is responsible of ensuring that agent names are not reused. The characteristics that define the monitoring mechanism implemented in eJason are:
• A monitoring agent can monitor several agents, while a monitored agent can be monitored by several monitors. Besides, there can only be one instance of each monitoring relation.
• A monitoring relation exists until the monitoring agent disables it, executing the proper action, or until the monitoring agent dies (restarts and revivals excluded).
• It is possible for an agent to receive more than one notification for the same monitoring relation, but never two consecutive ones of the same type (excepting restarts and revivals). Clearly the errors above are not mutually exclusive. For instance, an unreachable agent (disconnected) may be dead as well. However, after a disconnection the only observable change of state is a reconnection. Note how these features of the monitors in Jason contrast with the monitors of Erlang. Erlang allows many instances of the same monitor, with a maximum of one notification per instance, hence not allowing their persistence after the death or disconnection of the agents involved. We believe that our approach to monitoring provides a series of improvements over the one of Erlang:
• As there can only be one instance of each Jason monitor, the agents do not need to keep track of the identification (reference) of each monitor in case they need to disable them.
• The monitors persist over agent deaths and disconnections and, moreover, notify the (re)apparition of the monitored agent. They can then be used to implement complex interaction protocols that survive agent deaths and temporal disconnections. Recall that the main limitation is that the monitoring relation depends on the name of the monitored agent. Given the death or disconnection of a monitored agent with name AgentName and the later apparition of an agent using that same name, all the monitoring relations depending on AgentName that have not been disabled will send a notification to the monitoring agents. It may be the case that this new AgentName agent is a completely different agent (i.e. has a different behaviour or is a different instance than the one existing when the monitoring relation was established). In a system where the reuse of names is possible, we recommend the automatic disabling of monitoring relations after the death or the disconnection of the monitored agent. Section 7.4 shows how this limitation can be overcome. • The observational power is augmented as not only agent deaths and disconnections are observable. A monitoring agent can detect if a monitored agent has been restarted or revived. This implies that it can rest assured that the restarted/revived agent maintains, at least, its plan base (i.e. its control logic). Note that the monitoring mechanism can also be used to implement fault recovery. For instance, if an agent receives the notification about the death of one of its monitored agents, it can try to respawn it. However, we consider that this approach can be prone to errors. Instead, we recommend the utilisation of the supervision mechanism described in the next section, as it provides high-level constructs for fault recovery. 5.3. Supervising agents The implementation of strategies that enable fault recovery is not a trivial task. There exists a wide variety of faults that must be identified and different corrective actions that can be taken to tackle them. The supervision mechanism that
JID:SCICO AID:1691 /FLA
10
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.10 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
Fig. 3. Sample eJason supervision tree.
Fig. 4. Default scenario for a ping request.
we describe in this section facilitates the implementation of such strategies for fault recovery. It provides runtime system support to (easily) execute operations like reviving a dead agent, breaking a possibly infinite loop or implementing complex supervision behaviours à la Erlang [1]. The use of the supervision mechanism enables the construction of supervision trees. Fig. 3 depicts a sample supervision tree. The branch nodes of a supervision tree correspond to supervisor agents. An edge of a supervision tree is the counterpart of a supervision relation between a supervisor agent and a non-empty set of agents (namely supervised set). On the one hand, this relation enables the supervisor to execute actions aimed at recovering the proper execution state of its supervised agents (according to some supervision policy). On the other hand, the relation imposes on the supervisor the responsibility to handle requests from other agents that suspect the misbehaviour of some of its supervised agents. As a matter of example, consider the case in Fig. 4. The agent a is the supervisor of the agent b. Then, a can e.g. restart b if it suspects that it is not working properly. Besides, if some other agent c wants to know if b is overloaded, or even blocked, (e.g. because b is not answering requests in a timely manner) it will request (message 1) a to check the execution state of b (messages 2 and 3). Then, a must handle this request and provide c with an answer (message 4). The restrictions imposed on a supervision relation are the following: (a) Each agent can have at most one supervisor agent at any given moment in time. (b) A supervisor agent can supervise several sets of agents. These supervised sets must be disjoint. Restrictions (a) and (b) impose that any supervised agent belongs to at most one supervised set. It removes the possibility of maintaining conflicting supervision policies for the same agent. (c) The supervision relation is automatically disabled when the supervisor agent dies (revivals and restarts excluded) and when it is disconnected from all the agents in the supervised set. The agents that are disconnected from the supervisor are removed from the supervised set. (d) An agent cannot be supervised by itself. It would be pointless, as the death of a supervisor agent terminates a supervision relation. Apart from the faults identified by the monitoring mechanism, the supervision mechanism enables the identification of the divergence of an agent. We borrow the concept of divergence from the classical literature of process algebras, like e.g. CSP [25], where a divergent process executes an infinite set of hidden actions and, consequently, does not terminate. In
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.11 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
11
eJason, it is considered that an agent diverges if it is unable to complete any further iteration of its reasoning cycle. This situation can be caused, e.g. by the execution of an infinite loop during the resolution of a rule in the context of a plan. The observable properties of an agent that diverges is that it is alive but totally unresponsive, i.e. it no longer processes new incoming messages and does not answer to any queries. However, we must not confuse the divergence of an agent with neither an overload (i.e. when an agent must process many incoming requests and cannot answer them as quickly as expected) nor a situation where the agent intentionally avoids the communication with some other agent. Strictly, the divergence of an agent occurs when it is no longer completing any iteration of its reasoning cycle. 5.3.1. Actions The actions that a supervisor agent can execute over its supervised agents are the following:
• Ping: this action (similar to the traditional ping protocol used in computer networks to test the reachability of hosts) is used to identify whether a supervised agent diverges or is overloaded.
• Unblock: this action makes a supervised agent cancel the resolution of a test goal or plan context. It enables the supervised agent to avoid divergence.
• Restart: this action stops a running (alive) supervised agent and starts it anew, i.e. the execution of this action causes the death of an agent and its subsequent creation. The agent restarted maintains its name, the initial beliefs that it was given during its creation and its monitoring/supervision relations. • Revive: this action starts anew a dead agent. A revived agent maintains its name, the initial beliefs that it was given during its creation and its monitoring/supervision relations. In contrast to a restart, the execution of this action only causes the creation of the supervised agent as a reaction to its death, which has necessarily occurred before such execution. Moreover, the revival of an agent can only occur as part of the application of a supervision policy. The existence of such policy disables the possible interference from other agents with regards to a name clash as long as the corresponding supervision relation exists. • Supervision policy: the supervisor agent can adopt a complex supervision policy for its supervised agents. This supervision policy defines the course of action of the supervisor with regards to both fault identification and recovery. An example of a valid supervision policy is one that requires pinging the supervised agents every 15 seconds and restarting any of them that fails to answer to a ping two consecutive times (as it is probably diverging). An agent a can interact with the supervisor s of another agent b, even if it does not know who it is (i.e. a does not know that agent s is the supervisor of agent b, but a interacts with s all the same). The interactions supported by the supervision mechanism are the following:
• Request: the agent a sends a request to the supervisor agent s. This request corresponds to the application of one of the actions above (ping, unblock, restart). The agent s processes the request and, if applicable, executes the corresponding action. This interaction enables the agents not involved in a supervision relation to ask for the execution of actions related to supervision (e.g. in case the supervisor has not noticed by its own means that some action shall be taken) if they suspect that some agent is misbehaving. • Tell: the agent a sends a regular message to s. This action is equivalent to sending a message to s using the performative “tell” [8]. Recall that, a does not necessarily need to know the agent with whom it is communicating. Hence this action enables the anonymous communication (from the perspective of the supervisor) between agents for supervision purposes. It can be used to, e.g. define a supervision protocol specific to some system, without the need of knowing the actual supervision tree in advance. 5.3.2. Signals The execution of the different actions related to supervision requires the emission and processing of signals. A signal is a message sent by the runtime system of eJason, though its emission can be requested by an agent. The recipient of a signal is an agent. Signals are selected with preference over regular messages by the default message selection function [8] of the reasoning cycle implemented in eJason. The reception of a signal is processed by the runtime system of eJason using a predefined behaviour. However, in some cases, this behaviour can be overwritten by the agent that receives the signal. When a signal whose processing can be customised is received by an agent, a related event is generated. The runtime system checks then the plan base of that agent. If an applicable plan can be found, that plan is executed instead of the predefined one. Hence it is possible to provide customised supervision behaviours. Otherwise, the runtime system executes the standard behaviour corresponding to that signal. For instance, when a supervised agent receives a restart_signal, a new restart event is generated for that agent (+supervision_restart_signal(RestartTime)[source(signal)]). If an applicable plan for that event can be found in the plan base (e.g. storing the agent state in a database, then killing itself), then this plan is executed. Otherwise, the runtime system provides a default one (just killing itself), which is the one executed then. This approach enables the implementation of fault recovery strategies that in some cases rely on a predefined behaviour and, in some others, define a specific one.
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.12 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
12 Signal
ping_signal pong_signal pang_signal unblock_signal restart_signal
Related action(s) ping, ping request ping, ping request ping, ping request unblock, unblock request restart, restart request
Receiver supervised agent supervisor agent supervisor agent supervised agent
Related event – – – –
supervised agent
+supervision_restart_signal (RestartTime) [source(signal)] –
death_signal ping_request
ping request
supervisor agent supervisor agent
ping_request
ping request
supervisor agent
ping_request
ping request
supervisor agent
unblock_request
unblock request
supervisor agent
restart_request
restart request
supervisor agent
restart_request
restart request
supervisor agent
+supervision_ping_request (AgentName, Result) [source(SomeAgent)] +supervision_ping_request (AgentName, PingTime, Result) [source(SomeAgent)] +supervision_ping_request (AgentName, PingTime, low_priority, Result) [source(SomeAgent)] +supervision_unblock_request (AgentName, Result) [source(SomeAgent)] +supervision_restart_request (AgentName, Result) [source(SomeAgent)] +supervision_restart_request (AgentName, RestartTime, Result) [source(SomeAgent)]
Fig. 5. Signals in eJason.
The complete set of signals in eJason and their features can be found in Fig. 5. The description of the relation between the execution of actions and the emission of signals can be found in the next section. The example in Section 7.3 shows how a predefined behaviour (default plan) can be overwritten. 5.3.3. Actions and signals In this section we describe the execution of the actions in Section 5.3.1, including the signals that are emitted and processed. Considering a supervisor agent that executes the aforementioned actions over a supervised agent with name AgentName, their execution works as follows:
• Ping: when the supervisor executes this action, a ping_signal is sent to the supervised agent AgentName. Ping signals are answered at the beginning of each iteration of the reasoning cycle, by the supervised agent, using the default plan provided by the runtime of eJason (hence it cannot be overwritten). This default plan consists in the emission of a pong_signal. The belief base of the supervised agent is not modified. Ping signals have a timeout of PingTime milliseconds associated to each of them. If a ping_signal is not responded by a pong_signal before PingTime milliseconds, then a pang_signal is sent to the supervisor agent. The reception of pang and pong signals by a supervisor agent is processed by the runtime system. It amounts to matching a variable with the atom pang or pong, respectively (see Appendix C for details). The emission of a pong signal can be used to test the divergence of an agent. Recall that the ping signals are answered at the beginning of the iterations of the reasoning cycle. If no pong signal is emitted (for a PingTime big enough), the supervised agent has not completed an iteration in at least PingTime milliseconds, i.e. it is suspected of divergence. Note that a diverging agent will never respond to a ping signal. Besides, an overloaded agent (e.g. an agent with many incoming messages) answers to ping signals, as they are processed with preference with respect to the incoming messages. Furthermore, it is possible to send low priority ping signals, i.e. ping signals whose processing is not given preference over that of regular messages. Their processing is exactly the same as the regular ping signals above. Therefore, they can be used to test whether the supervised agent is overloaded, i.e. the response time is bigger than PingTime. • Unblock: when a supervisor agent invokes this action, the agent with name AgentName receives an unblock_signal. This signal is processed by the runtime system and aborts the current resolution of a plan context or test goal being executed by the agent AgentName. Unblocking the resolution of a plan context amounts to stop identifying applicable plans for the event processed in the current reasoning cycle. The standard execution of the reasoning cycle is resumed then, i.e. the option selection function is applied over the applicable plans already identified, if any. Unblocking a test goal amounts to making it fail as if no matching could be found. If the supervised agent is not resolving a test goal or plan context, the unblock signal is ignored. There is no default plan to be overwritten for unblock signals. Note that this internal action modifies the default behaviour for goal resolution
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.13 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
13
(i.e. it can cause the failure of a resolution that, provided enough time, could have been successfully executed) and should be used carefully (Section 7.2 shows a use case example). The purpose of this action is enabling the system to overcome the divergence of an agent without restarting it (hence avoiding some possible loss of the agent state like e.g. its belief base). • Restart: when a supervisor agent invokes this action, the supervised agent receives a restart_signal. The default plan for these signals is making the agent kill itself. A restart signal possesses a restart time of RestartTime milliseconds, which is the maximum time given to the agent to execute any possible cleanup and kill itself before it is terminated by the runtime system. Once the supervised agent has been killed, it is started again using the same Jason source code (hence maintaining its plan base and initial beliefs) and keeping the initial beliefs provided to its last creation, i.e. maintaining the parameters used to create it. Note that the agent does not preserve the information about its state (i.e. its belief, goal and intention bases) by default. Notice also that the supervision and monitoring relations are maintained. The use of this action enables the system to recover from a situation where some agent stubbornly falls into divergence, even after using unblock signals. Such an agent is then freshly started. • Revive: when a supervised agent dies for a reason other than a restart a signal death_signal is sent to the supervisor agent. The default plan for this signal is the one corresponding to the supervision policy specified (see below). If such policy applies, the supervised agent is started again with the same conditions as a restart above. Note that the only observable difference with a restart is that the supervised agent has not died as a consequence of the reception of a restart signal. This action enables the system to maintain the agents alive after their death. As shown in the example in Section 7, it also avoids possible race conditions derived from a name clash. • Supervision policy: the use of supervision policies enables the definition of complex supervision behaviours in a declarative way. A supervision policy defines the behaviour of a supervisor agent with respect to a certain supervised set. A supervision policy is completely defined by four subpolicies and a restart strategy: – Ping policy: it defines the frequency, Frequency, with which the possible divergence of the supervised agent must be checked using ping signals and a maximum threshold, MaxPings, of unanswered pings. The supervisor sends a ping signal every Frequency milliseconds to its supervised agents. In the case that one of them fails to respond to more than MaxPings consecutive pings (i.e. the ping threshold is exceeded), the periodical ping stops and the supervisor agent executes its unblock policy. The periodical ping is resumed afterwards. – Unblock policy: it specifies the maximum number, MaxUnblocks, of unblock operations that can be executed over an agent in the supervised set in a temporal threshold UMaxTime (namely the unblock threshold). If a ping policy is specified, an unblock operation is executed when the ping threshold is exceeded (over the agent that exceeded it), unless the unblock threshold has been exceeded as well. If the unblock threshold is exceeded, the restart policy is applied. – Restart policy: it specifies the maximum number of restarts MaxRestarts that can be executed on an agent in the set of supervised agents in a time span RestartMaxTime. When an agent exceeds the unblock threshold, this agent is restarted, unless the restart threshold is exceeded as well. If this threshold is exceeded, all the supervised agents in the set are killed and the supervision relation is terminated, hence disabling any revival policy. – Revival policy: it specifies the maximum number of revivals MaxRevive that can occur for an agent in the supervised set in a time span ReviveMaxTime. This policy can be used to maintain the agents alive when they are killed or die unexpectedly. If the revival threshold is exceeded, all the supervised agents in the set are killed and the supervision relation is disabled. – Restart strategy: it defines a dependency, RestartStrategy, between the agents belonging to the same supervised set. It is applied anytime that an agent Agent in the supervised set is restarted or revived. Its possible values are similar to those of the Erlang supervisor [1]. The strategy one_for_one states that only the agent Agent is started/revived. The strategy rest_for_one implies that the agents that appear after Agent in the list of supervised agents (the supervised set is represented by a list of agents) are restarted as well. Finally, the strategy one_for_all states that all the agents in the supervised set are restarted. Note that these restarts may exceed the corresponding restart threshold. As a matter of example, consider a system where a supervision relation exists such that the supervisor agent s and the supervised set is composed by the agents ag1, ag2 and ag3 (in this order). The supervisor agent must check every fifteen seconds whether any of the supervised agents diverges or not. Anytime that an agent is suspected of divergence an unblock operation must be sent to it, with a maximum of 3 such operations in ten minutes. An agent can only be restarted once per minute when it exceeds this unblock threshold. Besides, anytime that an agent dies, it must be revived up to a maximum of 10 revivals every minute. In this example the functioning of the supervised agents depends on each other, whenever one is restarted or revived, all others must be restarted as well. In order to implement this kind of policy it suffices to specify a supervision policy such that: – The ping policy has values: Frequency = 15000 (15 seconds) and MaxPings = 0 (any failed ping requires an unblock). – The unblock policy has values: MaxUnblocks = 3 and UMaxTime = 600000 (10 minutes). – The restart policy has values: MaxRestarts = 1 and RMaxTime = 60000 (1 minute).
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.14 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
14
– The revival policy has values: MaxRevive = 10 and ReviveMaxTime = 60000 (1 minute). – The restart strategy is one_for_all. • Request: when an agent Agent executes a supervision request for the supervised agent SAgent, a signal request_signal is sent to the supervisor agent of SAgent. Every request signal possesses a Reason that can be either ping, unblock or restart. The default plan provided by the runtime system corresponds to the execution of a ping, unblock or restart (respectively). An example of how to overwrite this plan is shown in Section 7. • Tell: the execution of this action does not generate any signal. It just allows the sending of a regular message to an agent, whose identification is unknown, on the basis of its role (i.e. the supervisor of other agent). The API that enables the use of the different features of the supervision mechanism can be found in Appendix C. Some practical uses of this supervision mechanism are given in Section 7. 5.3.4. Comparison to Erlang mechanisms Although the supervision mechanism proposed is inspired by that of Erlang [1], its capabilities are superior in several ways. These capabilities correspond to an agent-oriented way of programming in contrast to the process-oriented perspective of Erlang. The main improvements introduced by the supervision mechanism of eJason are:
• A supervisor not only detects the death of its supervised agents. If an agent is alive but underperforming (overloaded) or blocked (diverging) the supervision detects (using pings) and corrects this situation.
• Agent termination and restart are not the only possible corrective actions. The supervisor unblocks diverging agents. Unblocking an agent is far less drastic than restarting it (the agent maintains its state, e.g. its beliefs and goals).
• The concept of revival of an agent. This operation is supported by the runtime system and provides the strong guarantee that, if the agent is revived, it maintains its name. It avoids race conditions derived from name clashes.
• The agents can communicate with the supervisor of another agent, even if they do not know who this supervisor is. The control logic of the agents can be made independent to the actual supervision tree.
• The eJason supervision trees are more dynamic than the Erlang ones: – A supervisor agent can maintain several supervision policies for different sets of supervised agents. Besides, the supervisor of an agent and/or its supervision policy may vary over time. – The supervisor can maintain a supervision policy chosen from a variety of predefined supervision behaviours or provide its own behaviour. – The supervisor can supervise agents not created by itself and does not need to possess a description of their behaviour (a so-called child specification). – The supervised agents do not die if the supervisor dies. This strong dependence between the supervisor and its supervised agents is not introduced. 6. Implementation In this section, we provide some details about the implementation in eJason of the proposed extensions to Jason. We do not intend to give a low-level description of how every single element has been implemented. Instead, we describe a high-level implementation design and explain the correspondence between its elements and their Erlang counterparts (recall that the extensions are inspired by the distribution and fault tolerance mechanisms of Erlang). 6.1. High-level design Next, we briefly describe the design decisions for the implementation of the distribution and fault tolerance mechanisms introduced in eJason. 6.1.1. Distribution The distribution model proposed guarantees that all the agent names in the system are unique at any moment in time. Therefore, the existence of a global registry is necessary. In order to allow eJason systems to scale, the maintenance of and the access to this global registry shall be as efficient as possible. The information of the global registry is partitioned and distributed among the different agent containers. Each container hosts a Distribution Manager (DM). The DM is a service in charge of:
• Maintaining an agent name registry local to the container. The name of every agent running in the container belongs to this registry. Anytime an agent is created, it must ensure that it will not cause a name clash.
• Issuing non-conflicting names when the naming of a new agent is delegated to the runtime system. • Resolving distributed agent name queries, i.e. given an agent name, determine whether this agent belongs to the system or not. If it does, it can also provide the name of the container it is running into. This information can be obtained, e.g. polling all the containers in the system and/or maintaining a consistent cache.
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.15 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
15
• Handling connections between containers, i.e. knowing which agent containers belong to the system, disconnecting from other containers during a split operation and looking for possible name clashes during a merge operation. It is also involved in the execution of the internal actions related to connected containers like .container_of_agent and .containers_list. Note that the global registry of a system does not exist as a single structure. It can be constructed unifying the information maintained by the distributed managers of each container in the system. 6.1.2. Fault tolerance Apart from the DM, each container hosts a Fault Tolerance Manager (FTM). This component provides the following services:
• Setting up and disabling monitor relations for the monitoring agents running in its container, i.e. it is involved in the execution of every monitor and demonitor action. • Maintaining a record of the supervision relations that involve at least one of the agents running in its container. Consider an agent a running in the container C1 that supervises another agent b running in the container C2. Then, the information about the supervision relation between a and b is replicated in the FTMs of C1 and C2. • Prevent the name clash with any agent whose supervision policy allows their revival. If a supervised agent with name supA can be revived, its name is not removed from the local name registry by its death alone. Instead, the local FTM obtains the responsibility of notifying the local DM about the death of supA when it will be no longer revived, hence removing its name from the name registry. This way, during the time span between the agent death and its revival a name clash cannot occur, as no other agent can be created with name supA. • Resolving supervision related queries, i.e. when some agent requests communication with the supervisor of some other agent, this component finds this supervisor agent, either using its own records (including a cache) or polling another fault tolerance manager, and sends the proper signals or messages. Notice that the information about the different supervision and monitoring relations is replicated and distributed among several FTMs. 6.2. Erlang implementation This section provides some insights about the Erlang data structures and mechanisms used to implement the functionality described all along this document. The functionality provided by eJason is constantly increased, hence some parts of the actual implementation in subsequent releases of eJason may vary from what is described next. A very basic knowledge about Erlang constructs and their semantics suffices to understand the contents of this section. 6.2.1. Distribution A new concept introduced by our distribution model is the agent container. It is implemented using an Erlang node. Therefore, each container must be given a symbolic name unique in its host, as it is not possible to have two Erlang nodes with the same name running in the same host (Erlang nodes can be given short- or full-names. For convenience, eJason only considers the latter possibility). For each agent container in a system, a new Erlang node is started. Besides, in eJason each agent is represented by a different Erlang process. Therefore, given an agent with symbolic name AgentName running in a container with symbolic name NickName in host Host, there exists an Erlang process running in the Erlang node NickName@Host. This process is locally registered as AgentName using the Erlang registry. Note that, conversely, not every locally registered process is an agent. Otherwise, any Erlang process, e.g. runtime system processes, would be perceived as an agent. The guarantees for the message exchange achieved by executing the internal action .send, described in Section 4, derive from the Erlang semantics of their implementation:
• When the action .send(AgentName,Performative, Message,[Reply,Timeout]) is executed, it is first checked whether some process is locally registered as AgentName. If there is none, a request (message) is sent to the distribution manager to determine the container (node) where the agent (process) runs. If the process can be identified, the message is sent. The execution of the action fails otherwise. • When the internal action executed is .send(AgentName[container(Container)],Performative, Message,[Reply,Timeout]), it is first determined whether a connection to the node Container already exists (checking whether Container belongs to a list of connected containers). If the container does not belong to the system a merge operation is attempted (invoking the Erlang function net_adm:ping(Container)). If the merge operation fails, the execution of the action fails as well. In any other case (the connection existed already or it is newly established), the procedure is the same as the one for the previous case. The implementation of the distribution manager relies upon two elements:
JID:SCICO AID:1691 /FLA
16
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.16 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
• The distribution mechanisms provided by Erlang, e.g. the concept of connection among containers is similar to the connection among Erlang nodes. Besides, the merge and split operations carried out by the internal actions .connect_to and .disconnect_from require the execution of the Erlang functions net_adm:ping/1 and erlang:disconnect_node/1, respectively. • An Erlang process locally registered as ejason_distribution_manager. This process provides most of the functionality of the distribution manager component above: – Maintains the local agent registry. Whenever a process for some agent is spawned, the new agent is not created until it is registered by the distribution manager process. This manager first checks whether the name assigned to the new agent will cause a name clash in the system. If it does not, the new process is locally registered using the Erlang function erlang:register/2. This process is much simpler, as it does not require a poll, if the name of the new agent is given by the distribution manager (recall that it occurs when AgentName is a free variable in the call to .create_agent). – Communicates with the distribution managers running in the different nodes of the system in order to determine whether an agent belongs to the system (e.g. to detect a possible name clash or to allow message passing between distributed processes). – Maintains a cache of the latest name resolutions. The entries in this cache (an ordered list of key/value pairs) relate agent names and the containers in which they run (to avoid repeating a poll when the interaction between two agents running in different containers is relatively frequent). In order to ensure the consistency of the caches, the process monitors (using the Erlang function erlang:monitor/2) all the processes corresponding to the agents in its cache entries. Whenever one of these processes dies, its entry is removed from the cache. – Maintains a list with the information necessary to restart all the agents running in the container. This information resumes to the agent name, the path to its source code and the initial beliefs provided during its creation. This information corresponds to the parameters provided to the action .create_agent used to spawn the agent. It will be used during the execution of restarts and revivals of an agent. 6.2.2. Fault tolerance The fault tolerance manager of a container is implemented as a process locally registered with the name ejason_fault_tolerance_manager. This process makes use of the fault tolerance mechanisms of Erlang to provide its functionality:
• The monitoring relation established by the internal action .monitor_agent is implemented by the Erlang function erlang:monitor. Consider the case where an agent a monitors another agent b. When a executes the .monitor_agent(b) internal action, the Erlang process corresponding to a requests the establishment of a monitoring relation to the fault tolerance manager in its container. This manager checks whether the relation already existed. If it did, it does nothing. Otherwise, it invokes the function erlang:monitor/2 and adds a new entry to its records (a list of Erlang tuples) to keep track of the new relation. Upon reception of a so-called message ‘DOWN’ message, from the Erlang runtime system, the manager sends the proper notification to the process of the monitoring agent: – If the information received is the atom noproc, the Erlang process corresponding to the agent b cannot be found in the system. It generates a notification of type unknown_agent. – If the information received is the atom noconnection, the Erlang node corresponding to the container of agent b no longer belongs to the system (it has been disconnected due to any reason). A notification of type unreachable_agent is then sent. – If the information received is the atom restart, the process corresponding to agent b has been restarted by its supervisor (using the function erlang:exit(PidB,restart)). A notification of type restarted_agent is issued. – The reception of any other kind of information (e.g. the atom killed meaning that the Erlang process of agent b has been killed or a tuple providing information about the reason why that process has crashed) means that the monitored agent has died. If it occurs, the fault tolerance manager for the dead agent is contacted in order to discover if the agent is revived, generating a notification of type revived_agent then. Otherwise, a dead_agent notification is issued. The monitoring relation does not always end after the reception of a notification. However, the Erlang monitor is turned off. In order to allow the monitoring relation to persist, the following actions are taken depending on the notification issued: – After the reception of a notification of type restarted_agent or revived_agent, the Erlang function erlang:monitor is executed again to establish a new Erlang monitor. – After the issue of a notification of type unreachable_agent, dead_agent or unknown_agent, a request is sent to the distribution manager of agent a. If some agent b is added to the system, that distribution manager notifies it to the fault tolerance manager. Finally, this manager establishes again an Erlang monitor agent for b and a receives a notification (the belief +agent_up(b)). Note that, using a fault tolerance manager, the monitor relation can be maintained after the Erlang monitors have been turned off. Besides, it will also ease the implementation of agent migration in a future functionality extension.
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.17 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
17
• The establishment of a supervision relation is carried out in a different way to the monitoring relation above. Consider the case where a supervisor agent a belonging to container c1 wants to supervise another agent b that runs in container c2, executing the action .supervise_agents(b). This action will send a supervision request to the fault tolerance manager running in the container c1. This fault tolerance manager will check if the agent b exists and, if it does, will check if it is already supervised by other agent (asking its corresponding fault tolerance manager). If agent
b exists and does not have any supervisor, then a new supervision relation is established. The creation of a supervision relation implies the generation of an Erlang tuple that is stored by the fault tolerance managers of the containers c1 and c2 (i.e. this information is replicated). Hence, if some agent c wants to communicate with the supervisor of agent b (using the proper internal actions), the fault tolerance manager in c2 will resolve who is the supervisor of b and will send the proper messages and signals to this agent. Conversely, anytime that the supervisor agent a wants to send a signal or message to the supervised agent b, the fault tolerance manager in c1 will resolve if a supervision relation exists and will handle the request accordingly. • The signalling mechanism is implemented using the message passing ability of Erlang. Every signal is an Erlang message represented by a tuple whose first element is the atom signal, i.e. it has a special tag. This tag can be used to extract signal messages from a process mailbox before other kind of messages, using Erlang’s pattern matching. 7. Examples In this section we provide some examples showing how the fault tolerance mechanisms introduced in Jason can be used to implement complex behaviours. The semantics of the internal actions that appear in the pieces of code provided are described in Appendices A–C. 7.1. Avoiding race conditions This example shows how some race conditions are removed by using the supervision mechanism implemented in eJason. System: Consider a system composed by the agents alice and bob. The characteristics of this system are:
• alice creates agent bob. • alice must maintain bob alive and create it again anytime it dies. This behaviour is implemented including these plans in the plan base of alice:
!create_bob <.create_agent(bob,"bob.asl"); .monitor_agent(bob). +agent_down(bob)[reason(dead_agent)]
!create_bob <.create_agent(bob,"bob.asl"); .supervise_agents([bob], supervision_policy([revive(always)])). Now, alice is the supervisor of bob and will revive it automatically every time that it dies (as specified by the supervision policy revive(always)). This revival is supported by the runtime system and does not allow name clashes with the revived agent. Therefore, the race condition cannot occur and bob is revived.
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.18 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
18
7.2. Overcoming divergence This example shows how the supervision mechanism can be used to overcome the divergence of an agent without killing or restarting it. System: Consider an agent bob that possesses the following rule in its belief base:
even(2):true. even(X) :X mod 2 == 0. even(X) :even(Y) & X = Y +2. This rule enables the generation of all positive even numbers (an infinite set). Consider then a plan that uses this rule in its context:
+!plan : even(X) &
X < 0 <- .print(X).
This plan is not applicable under any circumstances (its context cannot be matched because there is no positive even number lower than zero). Whenever the resolution of the plan context is attempted, an infinite loop is executed. Hence bob will be alive but totally unresponsive to stimuli (i.e. it diverges). Problem: The resolution of the conditions in a plan context, as well as test goals, often requires the resolution of some rules in the belief base of the agent. These rules can be used to derive new knowledge [8]. However, their resolution can provoke infinite loops that, in turn, cause the divergence of the agent. The identification of a possible source of an infinite loop is not always as trivial as in this case. Solution: The supervision mechanism can be used to overcome this divergence. For instance, consider an agent alice that supervises the agent bob. alice tests the divergence of bob every 10 seconds (Frequency), granting bob 25 seconds (PingTime) to respond to ping signals. If bob fails to answer to two consecutive (MaxPings) ping signals, alice sends an unblock signal to bob. The reception of this unblock signal stops the resolution of the plan context, hence breaking the infinite loop. If no other applicable plans for the event being processed were found, the corresponding plan failure event is generated. This supervision policy is implemented using the following internal action (which alice eventually executes):
.supervise_agents(bob, supervision_policy([ping(10000,25000,1), unblock(always)])). 7.3. Overwriting the default plans for signals This example shows how the predefined supervision behaviours (the default plans for signals) can be overwritten as desired. System: Consider a system that contains the agents alice, bob and supervisor. The characteristics of this system are:
• The agent supervisor supervises bob. The supervision policy consists in the revival of bob every time it dies (i.e. supervisor has executed the following internal action, related to the supervision mechanism: .supervise_agents (bob,supervision_policy([revive(always)]))). • The agent alice is not aware of the supervision relation between bob and supervisor. • The agents alice and bob interact frequently. When alice suspects that bob diverges, alice sends ping and unblock requests to the supervisor of bob. • The agent bob sometimes executes high-priority actions whose computation requires a considerable amount of time. During these executions, bob seems to diverge, but it is not. These computations are of considerable relevance and, thus, they should not be interrupted.
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.19 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
19
Problem: The different unblock requests sent by alice shall not interfere with the computations of bob. By default, if alice executes an unblock request for bob, the supervisor of bob executes an unblock operation. This operation stops the execution of a high-priority action. Besides, alice shall not consider that bob diverges (after executing a ping request) when it is executing a high-priority action. Solution: The solution then requires the overwriting of the default plans for ping and unblock requests. This overwriting must be included in the code of the supervisor agent:
+supervision_ping_request(bob,PingTime,Result)[source(alice)]: bob_state(high_priority)
.send(supervisor,tell,bob_state(high_priority); !execute_high_priority_action; .send(supervisor,untell,bob_state(high_priority)); 7.4. Fault tolerant client–server system This example shows how the monitoring mechanism of eJason can be used to implement a fault tolerant instance of a classical client–server system. It addresses issues such as the identification of different degradations of the execution state of agents (deaths, restarts) as well as the recognition of agents after network disconnections. System: Consider a classic system composed by a server agent, also named server, that processes requests from an arbitrary number of (client) agents. These agents are distributed in a number of agent containers that varies over time (i.e. there can be split and merge operations). The characteristics of this system are:
• Each client must start a (stateful) connection session with the server before it can start sending requests. If a session for that client already exists, the state of the session is not restarted.
• A state is maintained by the server for each session. The state of a session can change after the processing of each of the requests corresponding to that session.
• A session terminates automatically if the client or server agent die, even if they are restarted or revived. A session must persist after disconnections of clients. However, a session can be temporally suspended (e.g. for efficiency issues) and get reactivated once the connection is restored. • The service provided by server can be unavailable at most 10 seconds before it is restarted. Besides, if the response time of server grows over 3 seconds, it must also be restarted. Problem: There are several problems to be solved: (a) The server and client agents must be aware of the deaths of each other in order to terminate the sessions. (b) The disconnection from a client does not terminate a session. (c) After a reconnection, the server must recognise the client (i.e. check whether the agent is the same agent that started the session or if it is a new agent that uses the same name) as it is possible for an agent to be restarted or revived after a disconnection. (d) The clients need to be aware of a reconnection to the server. It is possible for the server to have been restarted or revived during the time of disconnection. Therefore, they must request the start of a session when the reconnection happens. (e) Some requirements have been imposed on the server regarding to its minimum availability and its maximum response time.
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.20 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
20
!start. //#supervisor_plan#1 +!start<.supervise_agents(server, supervision_policy([ping(2000,3000,1,low_priority), unblock(never), restart(always), revive(always)])).
Fig. 6. Code for the agent “supervisor”.
//server_plan#1 +!start_session(ClientID)[source(Client)]
//#server_plan#4 +!process_start(Client,ClientID, same_consumer): get_session_state(Client,State) <.send(Client,tell,state(State)). //#server_plan#5 +!request(ClientID,Info)[source(Client)] get_session_state(Client,State)
Fig. 7. Code for the agent “server”.
Solution: The system above can be implemented using the code provided in Figs. 6, 7 and 8. For simplicity, we omit the code that spawns the clients and the possible supervision mechanisms used on them. The different problems are solved in the following way: (a) This problem is solved using the monitoring mechanism of eJason. The server agent monitors all clients that have started a session (server_plan#1)) and automatically removes the sessions of the clients that die, are restarted or are revived (server_plan#7)). It avoids faults like the association of a session to the wrong agent (e.g. if an agent name is reused) or wasting resources to maintain the state for the session of a dead agent. Besides, the client agents monitor the server (client_plan#1). Therefore, they can detect the death (client_plan#4), restart or revival (client_plan#5) of the server and react accordingly. (b) This problem is solved using the monitoring mechanism again. If a client is disconnected from the server, the server suspends the session (server_plan#6) of that client. (c) This problem is more complex. The server must recognise a client after they have been disconnected for certain amount of time. The name of the client is not enough to recognise it, as a restart or revival could have occurred while they were disconnected (hence the session for this client shall be terminated, not resumed). The recognition is implemented using a random identifier (client_plan#1) for each client that is generated when it is created (it changes after a restart or revival). This random identifier is sent to the server along with the request of the start of a session (client_plan#1). The server checks whether the random identifier corresponds to a client that already started a session (server_plan#1). If it does, that session is resumed (server_plan#4). Otherwise, a new session is created (server_plan#3). (d) This problem is solved using the monitoring mechanism. The clients monitor the server. Hence, receive a notification when the connection to the server is restored (client_plan#5). (e) This problem is solved using the supervision mechanism. An agent supervisor supervises the server (supervisor_plan#1). The requirements are implemented using a supervision policy that fulfils them. The server is pinged every 2 seconds (Frequency=2000) with a ping time of 3 seconds (PingTime=3000). Two consecutive ping failures (MaxPings=1), i.e. 10 seconds of unavailability, trigger a restart of the server (unblock(never) and
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.21 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
21
!start.
//#client_plan#1 +!start <.random(MyID); +my_id(MyID); .monitor_agent(server, demonitor(dead_agent)); .send(server,achieve, start_session(MyID)).
//#client_plan#2 +state(NewState)[source(server)]: my_id(MyID) <-+state(NewState); !do_some_task(NewState, Info); -+last_info(Info); .send(server,achieve, request(MyID,Info).
//#client_plan#3 +agent_down(server)[reason(Reason)]: .member(Reason,[restarted_agent, revived_agent]) <.send(server,achieve, start_session(MyID)).
//#client_plan#4 +agent_down(server)[reason(dead_agent)]: my_name(Name) <.kill_agent(Name).
//#client_plan#5 +agent_up(server) <.send(server,achieve, start_session(MyID)).
Fig. 8. Code for the client agents.
restart(always)). Besides, as the response time of the server must also be tested (to detect an overload), the option low_priority is used. Finally, the server is revived every time it dies (revive(always)). 8. Conclusion and future work In this paper we have described an extension to the Jason multi-agent systems programming language, which provides a new distribution model and constructs for fault detection and fault recovery. A preliminary version of these extensions was presented in [13]. Our implementation of Jason in Erlang – eJason – contains a prototype implementation of the extension. The addition of a proper, runtime supported, distribution model to Jason systems, removes the need of interfacing to third-party software such as e.g. JADE. The addition of fault detection and fault tolerance mechanisms to Jason addresses one of the key issues in the development of robust distributed systems. These mechanisms allow the detection of failures in a multi-agent system caused by malfunctioning hardware (computers, network links) or software (agents). Moreover, they enable the implementation of complex behaviours for fault recovery in a declarative, high-level manner. The distribution model and fault tolerance mechanisms of eJason are inspired by Erlang, an actor based language which is increasingly used in industry to develop robust distributed systems, in part precisely because of the elegant approach to fault detection and fault tolerance. Even though our extensions are inspired by Erlang, they are more capable. The agents can be restarted or revived while the supervision and monitoring relations are maintained. Moreover, these restart and revival operations provide some strong guarantees, like the avoidance of race conditions generated by name clashes. The observational power of the supervision and monitoring mechanisms has been increased. They allow the detection of restarts, revivals and (re)appearances of agents. Besides, eJason supports the detection of the divergence and the overload of agents, i.e. agents that are still alive but are no longer contributing useful results. Another useful capability for agents is the exchange of messages addressed to the supervisor of an agent (i.e. where the recipient of a message is specified as a role, not an agent name). This allows the supervision trees to be transparent to the programmer. Finally, these supervision trees are more dynamic than Erlang supervision trees. Supervision relations can be started and terminated while the agents involved do not necessarily die as a consequence. Furthermore, the predefined supervision behaviours can be customised when necessary. The lines of future work are many. First, we need to complete the implementation of eJason, with regards to environment (perception of external events, execution of external actions, etc.) handling, with all its related functionality. It will enable, among others, new operations related to fault tolerance. For instance, the agents will be able to store part of their state in a database to preserve it across failures. Besides, we plan to add support for broader principles of multi-agent systems like, e.g., agent mobility to allow the migration of agents between different agent containers and the introduction of organisational schemas that enable governance in normative multi-agent systems. Then, we should develop at least a “semi-formal” semantics for these extensions of Jason, describing exactly the behaviour of the internal actions (sending, monitoring, etc). Apart from that, we need to evaluate this extension on further examples, to ensure that the new internal actions have sufficient expressive power, and are integrated well enough in Jason, to permit us to design distributed multi-agent systems cleanly and succinctly, and in the general spirit of BDI reasoning systems.
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.22 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
22
We must address security within Jason systems. The mechanisms introduced in this work do not take this into consideration. Issues like agent privacy, by which an agent could prevent being itself monitored or supervised by other agents (e.g. adversaries or those perceived as ill-intentioned) will be considered. Otherwise, a third-party agent may be reluctant to enter a system with distribution and supervision schemas like the ones described before. Finally, we plan to permit the interoperability between the agents running in eJason and agents belonging to other (e.g., JADE based) agent platforms. We will study how to best interface eJason with other existing agent platforms. Appendix A. Distribution API Jason agents communicate with each other using internal actions. Thus, to support distribution, we add support for (distribution transparent) communication in existing internal actions such as, e.g., .send, as well as adding a few new internal actions that allow the discovery and manipulation (split/merge) of the structure of the system. A.1. Container connection To add new agent containers to certain system, the following internal action can be executed:
• .connect_to(Container) When an agent in some system m1 executes this action, a merge operation is attempted. This operation will try to merge the system m1 and the system m2 , to which the container Container belongs. Therefore, a new system m3 can arise. In the case that m1 = m2 , i.e., when a connection to an already connected container is attempted, the execution of the internal action succeeds. The execution of the internal action will fail if the merge operation cannot be completed (see the conditions for a merge operation in Section 4.1.2) or if Container cannot be found (which includes the case when Container is an unbound variable). Conversely, an agent container AC in a system m1 can get detached from it, or ensure that it does not belong to the same system m2 that some agent container belongs to by the execution of the internal action:
• .disconnect_from(Container) This execution will fail if Container is a free variable. Otherwise, it will succeed and the possibilities are the following:
• If Container is the name of the container in which the agent executing the action runs (i.e. AC = Container) or if Container and AC are interconnected (i.e. m1 = m2 ), AC will be disconnected from every other container and, therefore, will be part of a new system with only one container (itself).
• If Container is not connected to AC , AC will not get disconnected from any other container and the execution of the action will succeed, even if no agent container with name Container can be found. An agent can discover which containers (apart from its own) belong to the system using the internal action:
• .containers_list(ContainerList) The variable ContainerList will be bound to a list with the names of the containers belonging to the system, excluding the one corresponding to the agent that executes the action. Note that, if the system is composed by only one container, the list will be empty. A.2. Agent communication The communication between agents in Jason requires the use of the internal action (a) .send(AgentName,Performative,Message [Reply,Timeout]), where the parameters in brackets are optional and AgentName is the symbolic name of the agent receiving the message. The parameters Reply and Timeout can only be used when the performative is of type ask. We omit the exact description of their meaning from this article as they are present with the same semantics in “non-distributed Jason”; see [8] for details. The symbolic name must uniquely identify one agent within the system. Thus the above internal action can be used only to communicate agents whose containers are already interconnected. Therefore, to permit communication between agents belonging to different systems (i.e. whose agent containers are not interconnected) the distributed extension of Jason provides a new internal action:
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.23 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
23
(b) .send(Address,Performative,Message, [Reply,Timeout]) where the parameter Address is an annotated atom with the structure AgentName[container(Container)] and Container is the name of the container in which the agent with symbolic name AgentName may run. The guarantees3 provided by the aforementioned internal actions differ:
• If the execution of (a) succeeds, the membership of the receiver to the system is guaranteed, i.e. the address of the receiver can be resolved. However, it does not ensure that the receiving agent actually receives the message or that it considers the message socially acceptable (i.e. considered suitable for processing, cf. not socially acceptable messages are automatically discarded and do not generate any event, see [8]) nor that it has a plan which can be triggered by the reception of the message. The execution will fail if there is no agent in the system whose symbolic name is AgentName. • The execution of (b) depends on the value of the variable Container of the structure Address. The possible cases are: – If Container is an agent container that already belongs to the system, the guarantees are the same of (a). Note that, in this case, the execution is independent to whether the agent AgentName runs in the container with name Container or not. – If Container is an agent container that does not belong to the system, a merge operation is attempted (as if .connect_to(Container) had been invoked). If the merge operation is not successfully completed, the execution will fail. Otherwise, the guarantees are those of (a). Note that even if the execution of the internal action fails (if AgentName does not correspond to any agent in the system resulting from the merge), a merge operation successfully completed is not rolled back. A.2.1. Agent creation and destruction Agents can create agents in a named container using the internal action
• .create_agent(AgentName,Source,[InitialBeliefs]) where the parameter in brackets is optional. If Container is not provided, using an Address structure, the new agent is created in the same container as the agent executing the internal action. If AgentName is a free variable, the runtime system will provide a (unique) name for the agent and will bind this variable to it. As in [8], the parameter Source indicates the implementation of the agent (its plans, initial goals and initial beliefs). Finally, InitialBeliefs is a list of beliefs that should be added to the set of initial beliefs of the new agent upon creation. This internal action will succeed if (1) the named container exists in the system, and (2) there is no other agent with symbolic name AgentName already in the system, and (3) Source correctly identifies an agent implementation. An agent can also explicitly kill other agents. The following internal action allows this:
• .kill_agent(AgentName). The parameters and the meaning of their absence are analogous to those above. The execution of this internal action does not fail, even if the agent to terminate does not exist. A.3. Agent discovery Any agent can check whether some other agent with name AgentName belongs to its system executing the internal action:
• .find_agent(AgentName). The execution of this action will fail if AgentName does not identify an agent in the system or if AgentName is unbound. A.4. Container name discovery An agent can discover the name of its own container by executing the internal action
• .my_container(Var) 3 We encourage the reader to read the content at http://www.erlang.org/faq/academic.html#id54296 to get a feel of how useless and inefficient the imposition of strong guarantees on message delivery can be for distributed systems.
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.24 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
24
If the variable Var is unbound, the execution of the internal action does not fail and, as a result, Var will be bound to the name of the container of the agent. If Var is bound to any value different from the name of the container of the agent executing it, the internal action will fail, otherwise it succeeds. An agent can discover the container in which some other agent is running using the internal action:
• .container_of_agent(AgentName, Container) This internal action will try to bind the variable Container to the name of the container of the agent AgentName. This action will fail if AgentName does not belong to the system or if Container is already bound to some value different from the name of the container. Appendix B. Monitoring API An agent can request the set up of a monitoring relation executing the internal action:
• .monitor_agent(AgentName [, Configuration]) where the parameter AgentName is the symbolic name of the monitored agent. The parameter in brackets is optional. The variable Configuration is either the structure demonitor(Options) or persist(Options). Options must be a list whose elements are a subset of {unknown_agent, dead_agent, restarted_agent, revived_agent and unreachable_agent} or the atom any, equivalent to a list of all these five values. If the structure demonitor(Options) is used, the monitoring relation is automatically disabled after the notification of any of the values in Options (see below). If the structure persist(Options) is used, the monitoring relation persist only after the notification of any of the values in Options, being disabled otherwise. If Configuration is not given, its default value is persist(any). This action is executed by the monitoring agent. The execution of this internal action never fails. Similarly, a monitoring relation can be disabled by the monitoring agent executing the action:
• .demonitor_agent(AgentName). The execution of this action never fails. The fault notifications related to a monitoring relation are sent by the runtime system to the monitoring agent. They are represented by a new belief of the form:
• +agent_down(AgentName) [reason(RType)] along with the corresponding belief addition event. The possible values of RType are unknown_agent, dead_agent, restarted_agent, revived_agent and unreachable_agent. Their meaning corresponds to the notifications (1) to (5) in Section 5.2, respectively. Besides, the notification of the (re)apparition (notification (6) in Section 5.2) of a monitored agent is represented by the new belief:
• +agent_up(AgentName) Appendix C. Supervision API A supervision relation is always set up by the supervisor agent. This agent executes the internal action:
.supervise_agents(SupervisedSet[,supervision_policy(Options)]). The possible values for SupervisedSet are:
• If it is an atom AgentName, a supervision relation between the agent executing the action and the agent with name AgentName is established. The execution of this action will fail if AgentName is a free variable, if this name does not correspond to any agent in the system or if the agent AgentName is already supervised by any agent. • If it is a structure agent(AgentName,Code,InitialBeliefs), the creation of the supervised agent is attempted first. It is equivalent to the atomic, consecutive execution of the internal actions .create_agent(AgentName,Code, InitialBeliefs) and .supervise_agents(AgentName). If the execution succeeds, the supervised agent is created and the new supervision relation is created. Otherwise, neither of those happened.
• If it is a list List (whose elements are of the form of either of the two possibilities above), the same supervision policy is applied to all the agents in the list as they belong to the same supervised set. The execution of this internal action never fails. Note that this does not imply that any supervision relation is actually created.
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.25 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
25
The second parameter is optional and must be a structure that contains a list of options. These options specify the supervision policy that will be executed. This policy is defined by the elements of the list:
• ping(Frequency,PingTime,MaxPings) and ping(Frequency,PingTime,MaxPings,low_priority)
•
•
•
• •
•
•
• •
define the ping policy, i.e. the frequency with which the possible divergence or overload of the supervised agents will be checked using ping signals. The possible values of the structure are: a non-negative integer Frequency, a non-negative integer PingTime a non-negative integer MaxPings and an optional atom low_priority (to use low priority ping signals). unblock(MaxUnblocks,UMaxTime) specifies the unblock policy. The arguments in the structure are: a non-negative integer MaxUnblocks and a non-negative integer UMaxTime. UMaxTime can also be the atom infinity. This policy allows the execution of a maximum of MaxUnblocks unblock operations in UMaxTime milliseconds (for the same agent). restart(MaxRestarts,RMaxTime) specifies the restart policy. The elements in the structure are: a non-negative integer MaxRestarts and a non-negative integer RMaxTime. RMaxTime can also be the atom infinity. This policy allows the execution of a maximum of MaxRestarts restart operations (due to a possible divergence) in RMaxTime milliseconds. If this threshold is exceeded, the agent is killed and the supervision relation is terminated, hence disabling any revival policy. revive(MaxRevive,ReviveMaxTime) specifies the revival policy. The elements in the structure are: a non-negative integer MaxRevive and a non-negative integer ReviveMaxTime. ReviveMaxTime can also be the atom infinity. This policy allows a maximum of MaxRevive restarts (due to the death of the agent) in ReviveMaxTime milliseconds. It this limit is exceeded, the supervision relation is terminated and the agent is not revived. no_ping specifies the absence of all three ping, unblock and restart policies. Therefore, the divergence of AgentName will not be periodically checked. Notice that, without a ping policy, its threshold is never exceeded, hence the other two can never be executed. unblock(never) is equivalent to unblock(0,infinity). It specifies that no unblock signal shall be sent automatically. Conversely, unblock(always), equivalent to unblock(1,0), specifies that unblock signals shall always be sent when the threshold of the ping policy has been exceeded. Note that, this latter one disables the restart policy, as the unblock threshold cannot be exceeded. restart(never) is equivalent to restart(0,infinity). It specifies that no restart signal shall be sent, i.e. the supervised agent is never restarted. Conversely, restart(always), equivalent to restart(1,0), specifies that restart signals shall always be sent when the threshold of the unblock policy has been exceeded, i.e. the supervised agent is always restarted when the unblock policy does not resolve its divergence. revive(never) is equivalent to revive(0,infinity). It specifies that the supervised agent shall never be revived when it dies. Conversely, revive(always), equivalent to revive(1,0), specifies that the supervised agent is always restarted when it dies. Note that this death does not include the case where the supervisor terminates an agent because it has exceeded the limit imposed by its restart policy. strategy(Strategy) specifies the restart strategy. The possible values for the variable Strategy are the atoms one_for_one, one_for_all and rest_for_one. For convenience, we also provide default values for the different policies as well as for the restart strategy: – If no ping, unblock or restart policy is specified, the default option is no_ping. – If at least one ping, unblock or restart policy is specified, the default values for the policies not specified are: ping(15000,2000,1), unblock(3,infinity) and restart(always). – The default revival policy is revive(never). – The default restart strategy is strategy(one_for_one). Finally, note that if more than one policy of the same type is included, it is only considered the one closer to the head of the list, being the rest ignored. A supervision relation can be terminated by the supervisor agent by executing the internal action:
• .unsupervise_agents(Agents) The parameter Agents corresponds to either the name of an agent or a list of agents. If these agents belong to some supervised set, they are removed from the set. The supervision relation is terminated if and only if there are no more agents left in the supervised set. The execution of this action always succeeds. C.1. Supervisor agent abilities The following actions can only be executed by a supervisor agent and will immediately fail if the parameter AgentName is unbound or bound to the name of some agent that is not supervised by the agent executing it:
• .ping_agent(AgentName, PingTime, Res)
JID:SCICO AID:1691 /FLA
26
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.26 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
Invoking this action, a ping signal is sent to the agent AgentName. The parameter PingTime must be either a positive integer or the atom infinity (otherwise the execution fails). The parameter Res will be bound to either pong, if the supervised agent responds to the ping signal within the specified time interval or pang otherwise. To send low priority ping signals the internal action to be executed is:
• .ping_agent(AgentName, PingTime, low_priority, Res) Given an appropriate value for PingTime (the use of the value infinity is discouraged, as it can provoke the divergence of AgentName to carry over into the supervisor agent), if Res is bound to pang, it can be considered that the supervised agent with name AgentName diverges.
• .unblock_agent(AgentName) If a supervisor agent invokes this action, the agent AgentName receives an unblock signal.
• .restart_agent(AgentName, RestartTime) When a supervisor agent invokes this action, the supervised agent receives a restart signal. The default plan for this kind of signals is the following:
+supervision_restart_signal(RestartTime)[source(signal)]: true <.my_name(MyName); .kill_agent(MyName). This plan can be easily overwritten. It suffices to include a plan with the same trigger in the plan base. The parameter RestartTime must be either a positive integer or the atom infinity. This parameter specifies the time (in milliseconds) that the supervisor agent will grant AgentName to process the restart signal (hence terminating itself). If AgentName does not terminate itself within the time interval given, it is (brutally) terminated (as if the supervisor agent called .kill_agent(AgentName)). Again, the use of an infinite restart time is discouraged.
• .start_info(AgentName, Container, Code, InitialBeliefs) This action enables a supervisor agent to obtain the information that was last used to create the supervised agent with name AgentName. A successful execution will match the variable Container to the name of the container where the agent was created, the variable Code to the path of its Jason source code and the variable InitialBeliefs to the set of initial beliefs added to the belief base of the agent. These values correspond to those provided to start the agent, i.e. the parameters of the internal action .create_agent(AgentName[container(Container)],Code,InitialBeliefs). C.2. Discovering supervision relations Any agent can discover the name of its supervisor agent executing the internal action:
• .my_supervisor(AgentName) The execution of this action will try to bind the variable AgentName to the name of the supervisor of the agent executing the action. If it has no supervisor agent, AgentName will be bound to the atom no_supervisor. As usual, if AgentName is not a free variable and its value is different from the one that the action tries to assign to it, the execution fails. A supervisor agent can obtain a list of its supervised agents (if any) executing the action:
• .my_supervised(AgentList) If AgentList is a free variable, it will be bound to a (possibly empty) list of lists, where each is of the form [SupervisedSet,SupervisionPolicy]. Each list corresponds to a different supervision relation. SupervisedSet is the list of the names of the agents in the supervised set. SupervisionPolicy is the supervision structure applying for the supervised set. The list AgentList is not ordered. The execution will only fail if AgentList is not free and its value differs from the list of supervised agents. Notice that, as there is no guarantee in the name ordering returned in the list, calling this action providing a bound variable should be avoided.
JID:SCICO AID:1691 /FLA
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.27 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
27
C.2.1. Communicating with the supervisor agent The following internal actions are available:
• .request_to_supervisor(AgentName, Request, Result) When an agent with name SomeAgent executes this action, a signal will be sent to the supervisor of the agent AgentName. If this agent has no supervisor, the value of the variable Request is ignored and Result is bound to the atom no_supervisor. Otherwise, the result of the execution varies depending on the value assigned to Request (as usual, if it is a free variable or has some value other than the ones below, the execution fails): – If Request is either a structure ping(PingTime), ping(PingTime,low_priority) or the atom ping, a ping_request signal is sent to the supervisor agent. The default plans for ping_request signals are:
+supervision_ping_request(AgentName,Result)[source(SomeAgent)] <.ping_agent(AgentName, 5000, Result). +supervision_ping_request(AgentName,PingTime,Result) [source(SomeAgent)] <.ping_agent(AgentName,PingTime,Result) +supervision_ping_request(AgentName,PingTime,low_priority,Result) [source(SomeAgent)] <.ping_agent(AgentName,PingTime,low_priority,Result) Note that, if PingTime is not specified in the call, its default value is 5 seconds. Note also that the variable Result will be matched after the execution of the .ping internal action. – If Request is the atom unblock, an unblock_request signal is sent to the supervisor. The default plans to handle unblock_request signals are:
+supervision_unblock_request(AgentName,Result)[source(SomeAgent): .ping_agent(AgentName,5000,pong)
+supervision_restart_request(AgentName,Result)[source(SomeAgent)] <.restart_agent(AgentName,10000); Result = done. +\xch{supervision}{supersivion}_restart_request(AgentName,RestartTime,Result) [source(SomeAgent)] <.restart_agent(AgentName,RestartTime); Result = done. Note that, by default, RestartTime is 10 seconds. Note that this internal action makes use of the signalling mechanism of eJason. Therefore, its use should be restricted to deal with the divergence of agents. An abusive use of these fault tolerance mechanisms may interfere with the traditional semantics of Jason.
• .tell_supervisor(AgentName, Message) When the agent named SomeAgent executes this action, a normal message is sent to the supervisor, i.e. the belief base of the supervisor will be updated with a new belief Message[source(SomeAgent)]. This internal action allows any agent to communicate with the supervisor of another agent (or even its own if AgentName is equal to SomeAgent) in an anonymous way, using standard message passing.
JID:SCICO AID:1691 /FLA
28
[m3G; v 1.123; Prn:29/01/2014; 14:56] P.28 (1-28)
Á. Fernández-Díaz et al. / Science of Computer Programming ••• (••••) •••–•••
References [1] Documentation for the supervisor behaviour of Erlang, http://www.erlang.org/doc/design_principles/sup_princ.html. [2] Foundation for intelligent physical agents, http://www.fipa.org/. [3] G. Agha, C. Hewitt, Actors: a conceptual foundation for concurrent object-oriented programming, in: B. Shriver, P. Wegner (Eds.), Research Directions in Object-Oriented Programming, MIT Press, Cambridge, MA, USA, ISBN 0-262-19264-0, 1987, pp. 49–74, http://dl.acm.org/citation.cfm?id=36160.36162. [4] J.F. Hübner, R.H. Bordini, Using agent- and organisation-oriented programming to develop a team of agents for a competitive game, in: Ann. Math. Artif. Intell., vol. 59, Kluwer Academic Publishers, Hingham, MA, USA, 2010, pp. 351–372, http://dx.doi.org/10.1007/s10472-010-9179-9. [5] J. Armstrong, Programming Erlang: Software for a Concurrent World, The Pragmatic Bookshelf, 2007. [6] F. Bellifemine, F. Bergenti, G. Caire, A. Poggi, JADE – a Java agent development framework, in: Multi-Agent Programming, 2005, pp. 125–147. [7] F.L. Bellifemine, G. Caire, D. Greenwood, Developing Multi-Agent Systems with JADE, Wiley Ser. Agent Technol., Wiley, ISBN 0470057475, Apr. 2007, http://www.worldcat.org/isbn/0470057475. [8] R.H. Bordini, J.F. Hübner, M.J. Wooldridge, Programming Multi-Agent Systems in AgentSpeak Using Jason, John Wiley & Sons, ISBN 978-0-470-02900-8, Oct. 2007, http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0470029005.html, hardcover. [9] F. Cesarini, S. Thompson, Erlang Programming, O’Reilly Media, ISBN 978-0-596-51818-9, 2009, ISBN 10: 0-596-51818-8. [10] S. Ductor, Z. Guessoum, M. Ziane, Adaptive replication in fault-tolerant multi-agent systems, in: 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 2, Aug. 2011, pp. 304–307. [11] A. Fedoruk, R. Deters, Improving fault-tolerance by replicating agents, in: Proceedings of the First International Joint Conference on Autonomous Agents and Multiagent Systems: Part 2, AAMAS’02, ACM, New York, NY, USA, ISBN 1-58113-480-0, 2002, pp. 737–744, http://doi.acm.org/ 10.1145/544862.544917. [12] Á. Fernández-Díaz, C. Benac-Earle, L.-A. Fredlund, eJason: an implementation of Jason in Erlang, in: Proceedings of the 10th International Workshop on Programming Multi-Agent Systems, ProMAS, 2012, pp. 7–22. [13] Á. Fernández Díaz, C. Benac Earle, L.-A. Fredlund, Adding distribution and fault tolerance to Jason, in: Proceedings of the 2nd Edition on Programming Systems, Languages and Applications Based on Actors, Agents, and Decentralized Control Abstractions, AGERE!’12, ACM, New York, NY, USA, ISBN 978-1-4503-1630-9, 2012, pp. 95–106, http://doi.acm.org/10.1145/2414639.2414651. [14] S. Haegg, A sentinel approach to fault handling in multi-agent systems, in: Revised Papers from the Second Australian Workshop on Distributed Artificial Intelligence: Multi-Agent Systems: Methodologies and Applications, Springer-Verlag, London, UK, ISBN 3-540-63412-6, 1997, pp. 181–195, http://dl.acm.org/citation.cfm?id=647793.760755. [15] M. Klein, C. Dellarocas, Exception handling in agent systems, in: Proceedings of the Third Annual Conference on Autonomous Agents, AGENTS’99, ACM, New York, NY, USA, ISBN 1-58113-066-X, 1999, pp. 62–68, http://doi.acm.org/10.1145/301136.301164. [16] M. Klein, J.-A. Rodriguez-Aguilar, C. Dellarocas, Using domain-independent exception handling services to enable robust open multi-agent systems: The case of agent death, Auton. Agents Multi-Agent Syst. (ISSN 1387-2532) 7 (2003) 179–189, http://dx.doi.org/10.1023/A:1024145408578. [17] S. Kumar, P.R. Cohen, Towards a fault-tolerant multi-agent system architecture, in: Proceedings of the Fourth International Conference on Autonomous Agents, AGENTS’00, ACM, New York, NY, USA, ISBN 1-58113-230-1, 2000, pp. 459–466, http://doi.acm.org/10.1145/336595.337570. [18] O. Marin, P. Sens, J.P. Briot, Z. Guessoum, Towards adaptive fault-tolerance for distributed multi-agent systems, in: Proceedings of the 3rd European Research Seminar on Advanced Distributed Systems, ERSADS’2001, May 2001, pp. 195–201. [19] M.M. Olsen, N. Siegelmann-Danieli, H.T. Siegelmann, Robust artificial life via artificial programmed death, Artif. Intell. (ISSN 0004-3702) 172 (6–7) (Apr. 2008) 884–898, http://dx.doi.org/10.1016/j.artint.2007.10.015. [20] K. Potiron, P. Taillibert, A. Fallah Seghrouchni, A step towards fault tolerance for multi-agent systems, in: Languages, Methodologies and Development Tools for Multi-Agent Systems, Springer-Verlag, Berlin, Heidelberg, ISBN 978-3-540-85057-1, 2008, pp. 156–172. [21] A.S. Rao, AgentSpeak(L): BDI agents speak out in a logical computable language, in: W. Van de Velde, J.W. Perram (Eds.), Agents Breaking Away. Proceedings of the 7th European Workshop on Modelling Autonomous Agents in a Multi-Agent World (MAAMAW’96), Eindhoven, The Netherlands, 22–25 Jan. 1996, in: Lect. Notes Comput. Sci., vol. 1038, Springer, ISBN 978-3-540-60852-3, 1996, pp. 42–55. [22] A.S. Rao, M.P. Georgeff, BDI agents: From theory to practice, in: Proceedings of the First International Conference on Multi-Agent Systems, ICMAS-95, 1995, pp. 312–319. [23] A. Suna, CLAIM et SyMPA: Un environnement pour la programmation d’agents intelligents et mobiles, PhD thesis, LIP6, Université Paris 6 – Pierre et Marie Curie, Dec. 2005. [24] M. Wooldridge, Reasoning about Rational Agents, MIT Press, 2000. [25] C.A.R. Hoare, Communicating Sequential Processes, Int. Ser. Comput. Sci., Prentice Hall, 1985.