Coordinated behavior of cooperative agents using deep reinforcement learning

Coordinated behavior of cooperative agents using deep reinforcement learning

ARTICLE IN PRESS JID: NEUCOM [m5G;May 2, 2019;10:10] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing journ...

1MB Sizes 3 Downloads 101 Views

ARTICLE IN PRESS

JID: NEUCOM

[m5G;May 2, 2019;10:10]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Coordinated behavior of cooperative agents using deep reinforcement learning Elhadji Amadou Oury Diallo∗, Ayumi Sugiyama, Toshiharu Sugawara Department of Computer Science and Communications Engineering, Waseda University, 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan

a r t i c l e

i n f o

Article history: Received 23 March 2018 Revised 16 July 2018 Accepted 17 August 2018 Available online xxx Keywords: Deep reinforcement learning Multi-agent systems Cooperation Coordination

a b s t r a c t In this work, we focus on an environment where multiple agents with complementary capabilities cooperate to generate non-conflicting joint actions that achieve a specific target. The central problem addressed is how several agents can collectively learn to coordinate their actions such that they complete a given task together without conflicts. However, sequential decision-making under uncertainty is one of the most challenging issues for intelligent cooperative systems. To address this, we propose a multi-agent concurrent framework where agents learn coordinated behaviors in order to divide their areas of responsibility. The proposed framework is an extension of some recent deep reinforcement learning algorithms such as DQN, double DQN, and dueling network architectures. Then, we investigate how the learned behaviors change according to the dynamics of the environment, reward scheme, and network structures. Next, we show how agents behave and choose their actions such that the resulting joint actions are optimal. We finally show that our method can lead to stable solutions in our specific environment. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Multi-agent systems [1–4] are very important in the sense that many systems can be modeled to cope with the limitations of the processing power of a single agent. These systems can profit from many advantages of distributed systems such as robustness, parallelism, and scalability [5]. Many real-world systems are achieved by collective and cooperative effort. The need for collaboration between agents that are intelligent computational entities having specialized functionality becomes even more evident when looking at examples such as traffic control [6–8], task allocation [9–12], ant colonies [13,14], time-varying formation control [15,16], and biological cells [17]. Because of their wide applicability, multi-agent systems (MASs) arise in a variety of domains including robotics [18], distributed control [19], telecommunications [20–22], and economics [23,24]. The common pattern among all of the aforementioned examples is that the system consists of many agents that wish to collectively reach a certain global goal or individual goals. While these agents can often communicate with each other by various means, such as observing each protagonist and exchanging messages, decision making in an intelligent MAS is challenging because the



Corresponding author. E-mail address: [email protected] (E.A.O. Diallo).

appropriate behavior of one agent is inevitably influenced by the behaviors of others, which are often uncertain and not observable. In short, the goal can only be reached if most of the agents work together, while self-interested agents should be prevented from ruining the global task for the rest. The main questions related to multi-agent learning are “How much cooperation is required and can be achieved by the agents?” and “How can agents learn which action they should perform under given circumstances?” On one end, we have independent learners trying to optimize their own behavior without any form of communication with the other protagonists, and they only use the feedback received from the environment. On the other end, we have joint learners where every agent reports every step they take to every other agent before proceeding to the next step. Multi-agent learning [25–27] is a key technique in distributed artificial intelligence. As expected, computer scientists have been working on extending reinforcement learning (RL) [28] to multiagent systems to identify appropriate behavior in complex systems. Markov games [29] have been widely recognized as the prevalent model of multi-agent reinforcement learning (MARL). In general, we have two types of learning in multi-agent systems: centralized (the learning is done by an independent agent on its own) and distributed collective learning (learning is done by the agents as a group). Modeling multi-agent systems is a complex task due to the environmental dynamics, the action, and state spaces as well as the type of agents. In fact, many real-world domains have very

https://doi.org/10.1016/j.neucom.2018.08.094 0925-2312/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: E.A.O. Diallo, A. Sugiyama and T. Sugawara, Coordinated behavior of cooperative agents using deep reinforcement learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.08.094

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;May 2, 2019;10:10]

E.A.O. Diallo, A. Sugiyama and T. Sugawara / Neurocomputing xxx (xxxx) xxx

large state spaces and complex dynamics, requiring agents to reason over extremely high-dimensional observations. Hence, making optimal decisions in MAS intractable. Not long ago, the so-called deep-Q-network achieved unexpected results by playing a set of Atari games, receiving only visual states [30,31]. The same neural network architecture was used to learn different Atari games, and in some of them, the algorithm performed even better than human players. Multi-agent deep reinforcement learning (MADRL) is the learning technique of multiple agents trying to maximize their expected total discounted reward while coexisting within a Markov game environment whose underlying transition and reward models are usually unknown or noisy. Formally, agents use neural networks with a large number of layers as a function approximator to estimate their action values. In MADRL, the optimal policy of an agent depends not only on the environment but also on the policies of the other agents. Undeniably, cooperativeness in multi-agent learning is a problem of great interest within game theory and artificial intelligence. Notwithstanding the success of deep RL, there has not been sufficient research to extend these techniques to a multi-agent context except in some specific studies [32–35]. 2. Related work The research in the (multi-agent) RL field has been exponentially growing over the last decade. We focus only on the most related and recent work, specifically, on MADRL and multi-agent coordination. It has been demonstrated that using neural networks can lead to good results [36] when playing complex games from raw images. However, the proof of the convergence does not hold anymore. It is known that using neural networks to represent the Q-values is unstable. In fact, several ideas to make DQN stabler and more robust have recently been introduced. In DQN, experience replay [31,37] lets online RL agents remember and reuse experiences from the past. In general, experience replay can reduce the number of transitions required to learn by replacing them with more computation and memory, which are often cheaper than the interactions between RL agents. Prioritized experience replay [37,38] investigates how prioritizing which transitions are replayed can make experience replay more efficient and effective than if all transitions are uniformly sampled. Leibo et al. [39] analyze the dynamics of policies learned by multiple independent learning agents, each of which used its own DQN. They introduced the concept of sequential social dilemmas in fruit gathering and wolfpack hunting games. Then, they showed how behavior can emerge from competition over shared resources in conflicting situations. He et al. [40] presented a neural-based model that jointly learns a policy and the behavior of its opponent and found that opponent modeling is often necessary for competitive multi-agent systems. Then, they proposed a method that automatically discovers different strategy patterns of the opponents without any extra supervision from the system. Tampuu et al. [41] used independent DQN to investigate cooperation and competition in a two- player pong game. They demonstrated that agents controlled by autonomous DQN are able to learn a two-player pong game from raw sensory data. Their agents independently and simultaneously learn their own Q-function. The competitive agents learn to play and score efficiently while the collaborative agents find optimal strategies to keep the ball in the game as long as possible. Weiss [5] proposed two algorithms ACtion Estimation (ACE) and Action Group Estimation (AGE) in which their reactive agents could learn to coordinate their actions. They collectively learn to estimate the goal relevance of their actions on the basis of their estimates before they generated appropriate joint action sequences. Dimopoulos and Moraitis [42] studied the multi-

agent coordination problem in two different settings. In the first setting, the agents partially achieved their subgoals by themselves, without any communication, and in the second context, agents had to communicate with each other in order to achieve their subgoals. Contribution. Work on applying deep RL in cooperative multiagent contexts is limited to independent, homogeneous, or heterogeneous team learning with the centralization of the learning algorithm. In this paper, we use concurrent learning, an alternative to team learning in cooperative multi-agent systems where both agents attempt to improve parts of the team. Then, we show how agents can dynamically learn cooperative and coordinated behavior using deep RL in a cooperative MAS [43]. Our agents have individual goals that they should achieve themselves by jointly taking their actions. We also investigate how the learned behavior changes according to the environment dynamics including reward schemes and learning techniques. This paper draws partially from content originally published in Diallo et al. [35], where experimental results using naive DQN [30,31] were reported. Moreover, we implemented different deep reinforcement learning techniques including double DQN [44,45], dueling network architectures [46], and dueling architectures combined with double DQN by using the same proposed framework. Then, we ran few more experiments and compared the new results with the ones in Diallo et al. [35]. The results of this paper provide new insights into how agents behave and interact with each other in complex environments and how they coherently choose their actions such that the resulting joint actions are optimal. 3. Problem and background 3.1. Problem The pong game is an arcade-style table tennis video game where two paddles move to keep a ball in play. It has been studied in a variety of contexts as an interesting RL domain. In a pong game, agents can easily last 10,0 0 0 timesteps compared to other games such as gridworld or predator-prey, where a game can last at most 20 0–10 0 0 timesteps, and its observations are also complex, containing the players’ scores and side walls. It has been shown that fully predicting the player’s paddle movement requires knowledge of the last 18 actions [47]. Each agent corresponds to one of the paddles situated on the left and right side of the board. There are three (3) actions that can be taken by each agent: move up, move down, and stay at the same place. The game ends when one agent reaches the maximum score of 21. We developed an environment where we can easily increase the number of players and modify the dynamics of the game to explore agents’ strategies (Fig. 1). This environment is a reactive multi-agent system where there is a continuous interaction between the agents and their environment. Our agents have limited information about the dynamics of the game. Two agents (players on the right in Fig. 1) form a team and try to defeat the left player, which is a sophisticated hard-coded agent, by learning cooperative and coordinated behavior. The hard-coded agent can move faster than our agents but it may occasionally lose the ball. The state transition model and reward model are deterministic functions, so this environment can be thought of as deterministic. Otherwise, if an agent while in a given state repeats a given action, they will always go to the same next state and receive the same reward. Our purpose is to investigate whether agents can learn to dynamically and appropriately divide their area of responsibility and stop the gap between them, as well as to avoid collision, by using MADRL techniques. The implicit boundary between agents is shown by the dotted horizontal line in Fig. 1. This environment is well suited to evaluate the effectiveness of our concurrent multi-agent framework.

Please cite this article as: E.A.O. Diallo, A. Sugiyama and T. Sugawara, Coordinated behavior of cooperative agents using deep reinforcement learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.08.094

ARTICLE IN PRESS

JID: NEUCOM

[m5G;May 2, 2019;10:10]

E.A.O. Diallo, A. Sugiyama and T. Sugawara / Neurocomputing xxx (xxxx) xxx

3

Fig. 1. Doubles pong game.

3.2. Multi-agent reinforcement learning The function of RL is to learn what to do next, and how to map situations to appropriate actions, so as to maximize a cumulative numerical reward signal through the continuous interaction between agents and the environment. Essentially, the main goal of RL is to find the optimal action-selection policy. The generalization of the Markov decision process to the multi-agent case is the Markov game. Definition 1. A Multi-agent Markov game is defined as N, S, A, R, T, where N = {1, . . . , n} is a set of n agents, S is a finite state space, A = A1 × · · · × An is a joint action space of agents (where Ai is the action space of agent i), R : S × A → R is the common expected reward function, and T: S × A × S → [0, 1] is a probabilistic transition function. The objective of the n agents is to find a deterministic joint policy or joint strategy π = (π1 , . . . , πn ), where π : S → A and πi = S → Ai , so as to maximize the expected sum of their discounted common reward. A non-trivial result, proven by Shapley [10] for zero sum games and by Filar and Vrieze [48] for general sum games, is that there exist equilibria solutions for stochastic games just as they do for matrix games.

Rt =

γ

(1)

t=t0

where T is the terminal timestep and γ ∈ [0, 1] is the discount factor that weights the importance of rewards. The action-value of a given policy π represents the utility of action ai at state si , where the utility is defined as the expected and discounted future reward.







Qπ si , ai = E Rt+1 | sti = s, ati = a



The optimal Q∗ is defined as







Q∗ si , ai = max E Rt+1 | sti = s, ati = a

(2)



Q-learning [49] is an approach that iteratively estimate the Qfunction using the Bellman equation:









ai



| s, a



(5)

where yti is the target Q-value and is defined as





yti = E Rt+1 + γ max Q∗i st+1 , ai ; θt i



ai



i = Rt+1 + γ max Q st+1 , ai ; θ t i

| s, a



(6)



(7)

ai







i i = Rt+1 + γ Q st+1 , arg max Q st+1 , ai ; θ t ; θ t i

i



(8)

A target network [31] with parameters θ , updated every K steps from the online network, improves the stability of DQN. Hence, yt becomes



(4)

3.4. Deep Q network in a single-agent case Originally, Q-learning uses a table containing the Q-values of state-action pairs where we assume that a state is the screen



i,−

i i yti = Rt+1 + γ Q st+1 , arg max Q st+1 , ai ; θ t ai



i,−

; θt



(9)

The derivative of the loss function with respect to the weights is given by the following gradient:

 i   i  i  ∇θt L(θti ) = Esi ,ai ,ri ,st+1 yt − Q si , ai ; θ t ∇θ i Q si , ai ; θt i

(10)

t

Instead of using an accurate estimate of the above gradient, it is shown in practice that using the following approximation stabilizes i the weights θt :

     ∇θt L(θti ) = yti − Q si , ai ; θti ∇θt Q si , ai ; θti .

(11)

In mini-batch training, we usually use the mean squared error (MSE) loss function defined as

(3)

π

i Q∗ si , ai = E Rt+1 + γ max Q∗ st+1 , ai

 i  i  i   yt − Q i si , ai ; θt 2 , θt = Esi ,ai ,ri ,st+1 i



In RL, agent i learns an optimal control policy π i . At each step, i observes the current state si , chooses an action ai using π , and receives a reward ri . Then, the environment transits to a new state st+1 . The goal of the agent is to maximize the cumulative discounted future reward at time t as t−t0 i rt ,

L

ai

3.3. Deep reinforcement learning in a single-agent case

T 

image of a pong game. If we apply the same preprocessing as in Silver et al. [36], which takes four last screen images, resizes them to 84 × 84, and converts them to grayscale with 256 gray levels, we would have 25684 × 84 × 4 ≈ 1067970 possible game states. In the problem we try to solve, the state also usually consists of several numbers (position, velocity, RGB values) and so our state space would be almost infinite. Because we cannot use any table to store such a huge numbers of values, we can directly extend tabular Q-learning [49] to the deep reinforcement learning framework by using a neural network function approximator with the collection of weights θ . The weights θ can be trained by minimizing a sequence of loss functions Lt (θt ) that changes at each timestep t:

MSE =

m i 1  i i  yt − Q st+1 , ai ; θ t 2 , m

(12)

t=1

where yt and pt are the target and predicted Q-values. The error i i term or difference term — yt − Q i (st+1 , ai ; θt ) — can be huge for the sample that is not in arrangement with the current network prediction. This may cause very large changes to the network due to the back-propagation that impacts the weights during the backward process. To minimize these changes, we could clip the values of MSE in the [−1, 1] region and use the mean absolute error

Please cite this article as: E.A.O. Diallo, A. Sugiyama and T. Sugawara, Coordinated behavior of cooperative agents using deep reinforcement learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.08.094

JID: NEUCOM 4

ARTICLE IN PRESS

[m5G;May 2, 2019;10:10]

E.A.O. Diallo, A. Sugiyama and T. Sugawara / Neurocomputing xxx (xxxx) xxx

Fig. 2. Multi-agent concurrent deep reinforcement learning framework. We have two homogeneous agents, each with its own set of actions, and a given goal to be achieved. This environment is modeled as a fully cooperative multi-agent MDP [51,52] since all agents take a joint action and share the same utility function at every timestep.

(MAE) outside. MAE is defined as

 m  i 1  i i  MAE = yt − Q st+1 , ai ; θt .  m

(13)

t=1

In order to achieve such smoothness, we combine the benefits of both MSE and MAE. This is called the Huber loss function [50] and can be defined as

⎧        i  ⎨ 1 yti − Q si , ai ; θti 2 for yti − Q si , ai ; θt  < δ 2 i  L (θ t ) =   ⎩δ yti − Q si , ai ; θti  − 1 δ 2 otherwise 2

(14) For all experiments below, we set δ = 1. This treats small errors quadratically to gain high efficiency, while counting larges ones by their absolute error for robustness.

area of responsibility but should learn to divide the area for an effective collaborative play. This is modeled as a concurrent learning, where agents attempt to improve the team score. Each agent has its own DQN to learn its behavior with the ε -greedy strategy to balance between exploration and exploitation. It becomes difficult to predict the teammate behavior since the two agents may not share the same reward structure (see Algorithm 1).

Algorithm 1: Multi-agent concurrent DQN 1

2 3 4 5 6

4. Proposed method This section describes the approaches taken in this research paper. We extended a few variants of DQN to MAS and empirically evaluated the effectiveness of applying deep reinforcement learning to concurrent MAS. 4.1. Deep multi-agent concurrent learning In a multi-agent coordination/cooperation task, a number of agents need to generate non-conflicting joint actions that achieve their respective goals. In our study, agents are able to achieve their sub-goals independently without any supervision from the system. Moreover, agents should discover a coordinated strategy, which allows them to maximize their rewards in the long run. Definition 2. Coordinated behaviors: Given two or more agents with different subgoals and sets of actions, find the optimal joint actions or policies that maximize the total discounted reward without hindering the other agent’s actions. However, when multiple agents apply RL in a shared environment, the optimal policy of an agent depends not only on the environment but also on the policies of the other agents. In previous work [41], agents were not able to observe the actions or rewards of other agents even though these actions interfered with each other (had a direct impact on their own rewards and environment). Our challenge is to have each agent adapt its behavior in the context of another co-adapting agent without explicit control. We proposed a framework (Fig. 2) in which two agents form a team to win against a hard-coded agent. We consider these agents to be autonomous entities that have global goals and independent decision-making capabilities, but that also are influenced by each other’s decisions. We assume that agents do not have a predefined

7

Initialize online network and target network with random weights θ i = θ (i,− ) Initialize experience replay memory D i for episode ← 1 to number_episodes do for timestep ← 1 to max_timesteps do for agent i ← 1 to 2 do compute the current state of agent sti ati ←



with probability ε

Select random action ati



arg maxai Q sti , ati ; θ

i

with probability 1 − ε

t

8

9

10

11

Apply action i state st+1

ati

and observe reward rti and next

i Store transition sti , ati , rti , st+1  in the experience

buffer D i Sample batch of transitions sij , aij , r ij , sij+1  from D i by using the prioritized experience replay sampling described in Section 4.7 yij



⎧ i ⎪ ⎨r j , ri ⎪ j





i + γ Q st+1 , arg maxai Q i i,−  i,−  × st+1 , ai ; θt ; θt

if sij+1 is terminal otherwise

13

Calculate the loss L(· ) by using Equation 14 Compute the gradient ∇θt L(θt ) by using Equation 11

14

Update the weights θ ← θ − α∇θt L(θt )

12

15 16 17 18

Every K step set θ end end end

i

i

(i,− )

i

←θ

Independent or joint-learning agents can in principle lead to a convergence problem since one agent’s learning process makes the environment appear non-stationary to other agents. Agents constantly modify their behaviors, which in turn can interfere with other agents’ learned behaviors [53] by making obsolete the

Please cite this article as: E.A.O. Diallo, A. Sugiyama and T. Sugawara, Coordinated behavior of cooperative agents using deep reinforcement learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.08.094

ARTICLE IN PRESS

JID: NEUCOM

[m5G;May 2, 2019;10:10]

E.A.O. Diallo, A. Sugiyama and T. Sugawara / Neurocomputing xxx (xxxx) xxx

5

Table 1 DQN architecture. Layer

Input

Filter size

Stride

No. filters

Activation

Output

Conv1 Conv2 Conv3 FC4 FC5

84 × 84 × 4 20 × 20 × 32 9 × 9 × 64 7 × 7 × 64 512

8×8 4×4 3×3

4 2 1

32 64 64 512 3

ReLu ReLu ReLu ReLu Linear

20 × 20 × 32 9 × 9 × 64 7 × 7 × 64 512 3

4.4. Exploration and exploitation

Fig. 3. DQN architecture. Table 2 Reward scheme. Event

Agent 1 reward

Agent 2 reward

Left player loses One agent loses Collision

+1 −1 −1

+1 −1 −1

Exploration vs. exploitation is one of the main dilemmas in RL. The optimal policy for an agent is to always select the action found to be most effective based on history (exploitation). On the other hand, to improve or learn a policy, the agent must explore a new state it has never observed before (exploration) by taking nonoptimal actions. We use ε -greedy to balance between exploration and exploitation with 0 ≤ ε ≤ 1. At every timestep, the agent acts randomly with probability ε and acts according to the current policy with probability 1 − ε . In practice, it is common to start with ε = 1 and to progressively decay ε as the training progresses until we reach the desired ε . 4.5. DQN with double Q-learning The conventional Q-learning [49] algorithm overestimates the action-state values [44]. This is due to the use of the max function when updating the Q-values. One solution is Double DQN [45], which consists of using two independent value functions that are

assumptions on which they are based. Thus, agents might smoothly forget everything they have learned. 4.2. Structure of deep Q networks The DQN architecture used in our experiment is shown in Fig. 3 and Table 1. We used convolutional layers to consider regions of an image and to maintain spatial relationships between the objects. The network takes screen pixels as input and estimates the actions. The images are rescaled to 84 × 84 pixels. The four last screens implicitly contain all of the relevant information about the game situation, along with the speed and direction of the ball. By using a frameskip of 4, we fix the problem of environment observability. The output of our network is three (3) actions: up, down, and stay at the same place. 4.3. Reward structure To achieve the desired behavior, we use reward shaping, which is an attempt to mold the conduct of the learning agent by adding localized rewards that encourage behavior consistent with some prior knowledge. It is known that with a good reward scheme, the learning process will easily converge in the single-agent case [54]. In reality, we are not yet fully aware of all the advantages/disadvantages of reward shaping in a multi-agent context. Table 2 details the rewards used in our experiments. Every time the right players (our agents) win against the hard-coded agent, they receive a positive reward (+1 ). We punish both agents by giving them a negative reward (−1 ) when one of them loses the ball. They also receive a negative reward (−1 ) if they collide with each other. The minimum possible reward is −2 and that happens when both agents simultaneously lose the ball and collide. With this reward scheme, our agents jointly learn to cooperate with their teammate to win against the AI player and implicitly divide their areas of responsibility.

randomly updated, resulting in two sets of weights, θ and θ . Then we use one set of weights to determine the greedy policy and the other to determine its value. Given the fact that we already have two different value functions in DQN, ytDDQN becomes







i i yt(DDQN ) = Rt+1 + γ Q st+1 , arg max Q st+1 , ai ; θ t ; θ t i

ai

i,−



(15)

Double DQN is shown to sometimes underestimate rather than overestimate the expected Q-values. However, the chance of both estimators underestimating or overestimating at the same action is low. 4.6. Dueling network architectures In many reinforcement learning problems, it is unnecessary to estimate the action value for each action as some actions have little to no value for a given state. Dueling networks [46] decompose the Q-Network into a value stream V(s; θ , β ) and an advantage stream A(s, a; θ , α ). The value stream expresses how good it is for an agent to be in any given state while the advantage stream compares how much better taking a certain action would be compared to the others. The relationship between value and advantage streams is expressed in the following formula:





 

Aπ (si , ai ) = Qπ si , ai − Vπ si .

(16)

This decomposition helps to generalize the learning for the state values. The output layer of this architecture is a combination of the value stream output and the advantage streams output. The aggregator is given by







Q si , ai ; θ , ζ , β = V si ; θ , β i

 

i





+ A si , ai ; θ , ζ − i

1   i i i  A s , a ; θ , ζ , (17) |Ai | i a

where β is a parameter of V, ζ is a parameter of A, and |Ai | represents the length of the advantage stream. Algorithm 2 describes

Please cite this article as: E.A.O. Diallo, A. Sugiyama and T. Sugawara, Coordinated behavior of cooperative agents using deep reinforcement learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.08.094

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;May 2, 2019;10:10]

E.A.O. Diallo, A. Sugiyama and T. Sugawara / Neurocomputing xxx (xxxx) xxx

Algorithm 2: Multi-agent concurrent DQN with dueling network architecture. 1

Initialize online network and target network with random (i,− )

i

2 3 4 5 6 7

weights θ = θ Initialize experience replay memory D i for episode ← 1 to number_episodes do for timestep ← 1 to max_timesteps do for agent i ← 1 to 2 do compute the current state of agent sti ati ←



with probability ε

Select random action ati



arg maxai Q sti , ati ; θ

i

with probability 1 − ε

t

8

9

10

11

Apply action ati and observe reward rti and next i state st+1

i Store transition sti , ati , rti , st+1  in the experience

buffer D i Sample batch of transitions sij , aij , r ij , sij+1  from D i by using the prioritized experience replay sampling described in Section 4.7 yij ←

⎧ i ⎪ ⎨r j ,

if sij+1 is terminal



i r ij + γ Q st+1 , arg maxai Q ⎪ ⎩ ×si , ai ; θ i,− ; θ i,−  otherwise t t t+1

   i i

12

Calculate Vπ si from the Value stream (Fig. 4)

13

Calculate Aπ s , a

from the Advantage stream

(Fig. 4) 14 15 16

17 18 19 20 21

i

Compute Q (si , ai ; θ , ζ , β ) by using Equation 17 Calculate the loss L(· ) by using Equation 14 Compute the gradient ∇θt L(θt ) by using Equation 11 i

i

(i,− )

i

Update the weights θ ← θ − α∇θt L(θt )

Every K step set θ end end end

←θ

while considering past experiences. We stored the experiences as a tuple of s, a, r, st+1 . Instead of randomly sampling, prioritized experience replay [38] takes transitions that are not in alignment with our network current estimate. Furthermore, we store the residual error between the actual Q-value and the target Q-value of a transition. In fact, the TD error δ — δ = |yti − Q (si , ai ; θti )| — will be converted to a priority of the jth experience τ j using

  τ j = δ + μ λ,

(18)

where 0 ≤ λ ≤ 1 determines how much prioritization is used and μ is a very small positive value. We then use the priority as a probability distribution for sampling an experience j [38], in which j is selected by

τ φ ( j ) = K j , k τk

(19)

K where k τk is the total of all priorities and K is the size of the mini-batch. This is called proportional prioritization. Each agent stores the most recent experiences in its own experience replay memory. This may not be optimal for a huge number of agents because the system would require a large amount of memory and computation, but it is efficient for the learning of cooperation with a few or several agents, as in our case. 5. Experimental results and discussion We performed experiments in which we train the agents’ networks to ensure a smooth convergence in the learning process by using stochastic gradient descent. We then initialize our network by feeding it the learned weights without using ε -greedy or learning rate decay; thus, with this initialization, the agents no longer need to extensively explore their environment. We use the following parameters for all our experiments: learning rate α = 0.0 0 025, discount factor γ = 0.99, a screen size of 84 × 84 pixels, the last four images are concatenated to constitute one state. The loss function is optimized using RMSprop optimizer [55] with momentum = 0.95 and ε = 0.01. We update the target network every 10, 0 0 0 timesteps. To stabilize learning, we feed the network with small mini-batches of 32 samples. We terminate a game after 50, 0 0 0 timesteps (called an episode) even if none of the players achieved a score of 21. The experimental results in the following sections present the average values of ten (10) experimental runs. 5.1. Learning convergence and Q-values

Fig. 4. Dueling network architectures for DQN.

the process in detail. The network architecture is shown in Fig. 4 and the parameters are the same as the ones in Table 1 4.7. Experience replay The learning process becomes smooth by storing an agent’s experiences and uniformly sampling batches of them to train the network. This helps the network to learn about immediate actions

Fig. 5(a) compares the log loss results between different deep reinforcement learning techniques such as DQN, Double DQN (DDQN), dueling network architectures (duelDQN) and Double DQN combined with dueling networks (duelDDQN). As shown, the learning processes of all techniques converge more or less after 2 million timesteps. Our agents could achieve approximately the same loss values for all learning techniques. The loss values are smaller when we use the simple DQN technique. This is due to the fact that the network always selects the maximum Q-values therefore making the discrepancy between the prediction and target smaller. At the same time, we can see that our agents’ learning processes smoothly converged and they could learn the desired behavior in a very short time even in an apparent non-stationary environment. Note that the use of the Huber loss function was quite efficient in our experiments, but this is not always the case in some RL applications where we need an accurate estimate of the action value functions. The Q-value metrics are useful in order to clarify the internal details of the learning agents and to evaluate how agents decide to take a specific action in a given state. The Q-values of agents 1 and 2 are plotted in Fig. 5(b) for different deep reinforcement

Please cite this article as: E.A.O. Diallo, A. Sugiyama and T. Sugawara, Coordinated behavior of cooperative agents using deep reinforcement learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.08.094

JID: NEUCOM

ARTICLE IN PRESS

[m5G;May 2, 2019;10:10]

E.A.O. Diallo, A. Sugiyama and T. Sugawara / Neurocomputing xxx (xxxx) xxx

7

Fig. 5. Log loss function and average Q-value per step.

learning techniques. We can see that the Q-values of agents 1 and 2 are almost identical and that they have both learned acceptable policies that help them to win against the hard-coded agent. This can be explained by the fact that each agent is trying to adapt its behavior on the basis of the other agent. They first learned to cooperatively divide their area of responsibility (avoid collision) and only after that they started to learn how to play well and win a game. Their target networks predicted the right actions to take by following a near-optimal strategy. We could change this behavior by making them share the same experience replay memory and target network. The results for dueling networks for both agents are similar and accurate. This is due to the way we decompose the network to calculate the action value and the advantage of the action in a given state. In an environment where safety is required, it would be bet-

ter to use the dueling architectures or double DQN. However, if we do not care about safety, we might use any other techniques so long as the network converges. If one of the agents received fewer balls in its area, its target network more often predicted to stay at the same place, thereby taking the best action in those situations by following a near-optimal policy. The remaining agent received many balls in its area and took many different actions, and thus its learning process became noisy. 5.2. Reward The reward is an important metric that shows the impact of an agent in the environment. Fig. 6(a) shows the average reward per step for each agent and for each learning technique. All techniques followed more or less the same trend. In the beginning, all tech-

Fig. 6. Rewards (moving average mean of 10 windows).

Please cite this article as: E.A.O. Diallo, A. Sugiyama and T. Sugawara, Coordinated behavior of cooperative agents using deep reinforcement learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.08.094

JID: NEUCOM 8

ARTICLE IN PRESS

[m5G;May 2, 2019;10:10]

E.A.O. Diallo, A. Sugiyama and T. Sugawara / Neurocomputing xxx (xxxx) xxx

Fig. 7. Number of paddle bounces per episode (moving average mean of 10 windows).

Fig. 8. Number of games played per episode (moving average mean of 10 windows).

niques could achieve approximately the same reward per step, but we started seeing a difference between the results just after the learning process became stable. Furthermore, Double DQN outperformed all the remaining techniques more or less after 2 million timesteps. The results for double DQN, dueling network architectures, and double dueling DQN progressively and slowly increased as we played the games. However, the results for DQN did not vary a lot, as the agents tended to sometimes forget the learned behaviors. In our experiments, the discount factor γ is close to 1 (γ = 0.99), which means we give a great weight to future reward and decrease the importance of previous feedback. The results also indicate that our agents have received the expected rewards per step. In these experiments, we gave the same reward to both agents regardless of whether they win or lose. It might be desirable to assign credit in a different fashion to overcome the credit assignment problem. If one agent did most of the job, it might be helpful to specially reward that agent for its actions, and punish the other for laziness. Fig. 6(b) shows the average game reward per episode for various learning techniques. For simplicity, very small rewards (≤ −10) are not shown. Our agents could easily achieve an average game

reward of 15. Generally, this means that our agents are taking the right decisions by following an optimal policy. Fig. 6(c) shows the maximum reward obtained by each agent over the period of a few games. DQN could rarely achieve the maximum reward of 21. In fact, the DQN agents could play the game well and perhaps win even while still colliding sometimes. DQN has the higher q-values, but its reward is low because the estimated q-values are different from the actual ones. In contrast, the other agents played well and could easily avoid colliding into each other. They could regularly achieve the maximum reward. Fig. 6(d) shows the minimum game reward per episode. We can see that the minimum reward for DDQN is higher than the results of the other learning techniques because the network converges to near-optimal Q-values. DQN followed a similar trend even though its Q-values are overestimated by the max function in Eq. (8). As expected, all techniques using dueling network architectures have lower reward because the action space of our environment is not large. Finally, we can see that all agents completely avoid getting negative rewards after the learning process became stable. 5.3. Number of paddle bounces and games per episode Fig. 7 shows how many times an agent hits the ball per episode. In the beginning, agents that are playing randomly hit the ball only a few times, but after training for a few millions of timesteps, both agents learn good policies that help them keep the ball in play as long as possible. These values are correlated with the time a game lasts. We obviously know that the more agents hit the ball, the longer the game lasts. The high values for double DQN may be explained by the fact that the network converged to the right Q-values. Therefore, double DQN agents are taking the best actions according to their optimal policies. We decided to record the number of games every episode of 50,0 0 0 timesteps in order to determine how long one game can last. Fig. 8 shows the number of games played by our agents. It also shows that agents played many games per episode and often missed the ball during the first stage of training. After the learning process converges, they play around seven (7) games per episode for all agents except for DQN agents, which could play around eight (8) games per episode. As a result of this, the game lasts for a relatively long time. A game can sometimes last for a very short

Please cite this article as: E.A.O. Diallo, A. Sugiyama and T. Sugawara, Coordinated behavior of cooperative agents using deep reinforcement learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.08.094

JID: NEUCOM

ARTICLE IN PRESS

[m5G;May 2, 2019;10:10]

E.A.O. Diallo, A. Sugiyama and T. Sugawara / Neurocomputing xxx (xxxx) xxx

9

Fig. 9. DQN results. This is an average of five (5) experimental runs where the area of responsibility is explicitly given to each agent. Every time the agents win, each of them receives a positive reward (+1). An agent receives a negative reward (−1) whenever it misses the ball in its area of responsibility. An agent is not punished if its teammate misses the ball. We can see the average episode reward converges to 15.

Fig. 10. Density estimation of different metrics by team.

time if agents perfectly win by 21 − 0. See Figs. 11 and 10 and Table 3 for more details about the distribution of these learning techniques in a cooperative multi-agent context.

5.4. Remarks DQN with double Q-learning (DDQN) achieved globally the best results in our given environment. From these experiments, we see that all agents could learn a coordinated behavior by implicitly dividing their areas of responsibility. Moreover, the agents could approximately achieve the same number of steps (Fig. 6(a)) and game

reward (Fig. 6(b)) as in the case where the responsible area is given such that the agents can never collide (Fig. 9). Note that if the reward for collision is less than −1, the agents will never learn to play the game, they would only avoid  collision.  If the collision reward is comprised in the interval − 1, 0 , the agents could easily achieve the maximum reward per game, but they collided a lot, and thus the learning process became very long before they could adaptively and appropriately divide their areas of responsibility. Reward engineering [56] is an important and difficult open question in reinforcement learning. We need more theoretical and applied methods to appropriately choose a flexible and effective reward scheme.

Please cite this article as: E.A.O. Diallo, A. Sugiyama and T. Sugawara, Coordinated behavior of cooperative agents using deep reinforcement learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.08.094

ARTICLE IN PRESS

JID: NEUCOM 10

[m5G;May 2, 2019;10:10]

E.A.O. Diallo, A. Sugiyama and T. Sugawara / Neurocomputing xxx (xxxx) xxx

Fig. 11. Density estimation of different metrics by agent.

Table 3 Training performance: average values over 10 experiments. Techniques

Loss

DQN DDQN DuelDQN DuelDDQN

0.012 0.011 0.011 0.011

Q-value

Agent 1 ± ± ± ±

Agent 2 0.033 0.027 0.030 0.030

0.012 0.012 0.011 0.011

± ± ± ±

0.033 0.031 0.031 0.031

Bounces

Agent 1

Agent 2

Agent 1

0.107 ± 0.509 −0.088 ± 0.552 0.005 ± 0.475 −0.081 ± 0.569

0.101 ± 0.509 −0.094 ± 0.670 0.024 ± 0.472 −0.121 ± 0.590

100.977 122.493 101.790 104.826

6. Conclusion We demonstrated that multi-agent concurrent deep reinforcement learning techniques can smoothly converge even in a nonstationary environment. We also showed the details of how each agent chooses its strategies and adapts its behavior. We also clarified some possible limitations of this method. One drawback is that both agents hesitate to take an action if the ball is on the frontier of their area of responsibility. This is certainly not an optimal behavior. It might be helpful to make agents communicate or negotiate in order to save time and energy for one of the protagonists. In real-world applications, agents do not have the same resources and abilities assumed in our experiments, and they do not have a full view of the environment. Our near future work is to improve our proposed method in partially observable Markov decision processes where each agent has a limited view of the environment. We also plan to evaluate it in an environment where agents are heterogeneous in that they do not have the same speed, abilities, and sizes of paddles. Conflict of interest None. Acknowledgments This work is partly supported by JSPS KAKENHI grant number 17KT0044. References [1] M. Wooldridge, An Introduction to Multiagent Systems, John Wiley & Sons, 2009. [2] G. Weiss, Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence, MIT Press, 1999. [3] J. Ferber, Multi-Agent Systems: An Introduction to Distributed Artificial Intelligence, 1, Addison-Wesley Reading, 1999.

Avg reward

Games

−10.231 ± 118.376 −3.367 ± 88.696 −5.199 ± 86.677 −4.352 ± 85.409

8.324 7.393 7.365 7.562

Agent 2 ± ± ± ±

9.377 9.624 13.572 13.040

100.580 138.292 114.338 115.603

± ± ± ±

9.438 19.929 14.648 14.865

± ± ± ±

1.8 2.414 2.267 2.135

[4] Y. Shoham, K. Leyton-Brown, Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations, Cambridge University Press, 2008. [5] G. Weiss, Learning to coordinate actions in multi-agent systems, in: Proceedings of the International Joint Conference on Artificial Intelligence, 1993, pp. 311–316. [6] K. Dresner, P. Stone, Multiagent traffic management: an improved intersection control mechanism, in: Proceedings of the Fourth International Joint Conference on Autonomous Agents and Multiagent Systems, ACM, 2005, pp. 471–477. [7] M. Wiering, Multi-agent reinforcement learning for traffic light control, in: Proceedings of the Seventeenth International Conference on Machine Learning (ICML’20 0 0), 20 0 0, pp. 1151–1158. [8] C. Tomlin, G.J. Pappas, S. Sastry, Conflict resolution for air traffic management: a study in multiagent hybrid systems, IEEE Trans. Autom. Control 43 (4) (1998) 509–521. [9] B.P. Gerkey, M.J. Mataric´ , A formal analysis and taxonomy of task allocation in multi-robot systems, Int. J. Robot. Res. 23 (9) (2004) 939–954. [10] L.S. Shapley, Stochastic games, Proc. Natl. Acad. Sci. 39 (10) (1953) 1095–1100. [11] N. Iijima, M. Hayano, A. Sugiyama, T. Sugawara, Analysis of task allocation based on social utility and incompatible individual preference, in: Proceedings of the Conference on Technologies and Applications of Artificial Intelligence (TAAI), IEEE, 2016, pp. 24–31. [12] Y. Miyashita, M. Hayano, T. Sugawara, Self-organizational reciprocal agents for conflict avoidance in allocation problems, in: Proceedings of the IEEE Ninth International Conference on Self-Adaptive and Self-Organizing Systems (SASO), IEEE, 2015, pp. 150–155. [13] C. Yeh, T. Sugawara, Solving coalition structure generation problem with double-layered ant colony optimization, in: Proceedings of the Fifth IIAI International Congress on Advanced Applied Informatics (IIAI-AAI), IEEE, 2016, pp. 65–70. [14] W. Xiang, H.P. Lee, Ant colony intelligence in multi-agent dynamic manufacturing scheduling, Eng. Appl. Artif. Intell. 21 (1) (2008) 73–85. [15] X. Dong, G. Hu, Time-varying formation control for general linear multi-agent systems with switching directed topologies, Automatica 73 (2016) 47–55. [16] X. Dong, G. Hu, Time-varying formation tracking for linear multiagent systems with multiple leaders, IEEE Trans. Autom. Control 62 (7) (2017) 3658–3664. [17] S. Khan, R. Makkena, F. McGeary, K. Decker, W. Gillis, C. Schmidt, A multi-agent system for the quantitative simulation of biological networks, in: Proceedings of the Second International Joint Conference on Autonomous Agents and Multiagent Systems, ACM, 2003, pp. 385–392. [18] G. Dudek, M.R. Jenkin, E. Milios, D. Wilkes, A taxonomy for multi-agent robotics, Auton. Robots 3 (4) (1996) 375–397. [19] J. Aguilar, M. Cerrada, G. Mousalli, F. Rivas, F. Hidrobo, A multiagent model for intelligent distributed control systems, in: Proceedings of the International Conference on Knowledge-Based and Intelligent Information and Engineering Systems, Springer, 2005, pp. 191–197.

Please cite this article as: E.A.O. Diallo, A. Sugiyama and T. Sugawara, Coordinated behavior of cooperative agents using deep reinforcement learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.08.094

JID: NEUCOM

ARTICLE IN PRESS

[m5G;May 2, 2019;10:10]

E.A.O. Diallo, A. Sugiyama and T. Sugawara / Neurocomputing xxx (xxxx) xxx [20] A.L. Hayzelden, J. Bigham, Heterogeneous multi-agent architecture for ATM virtual path network resource configuration, in: Proceedings of the International Workshop on Intelligent Agents for Telecommunication Applications, Springer, 1998, pp. 45–59. [21] H. Velthuijsen, Distributed artificial intelligence for runtime feature-interaction resolution, Computer 26 (8) (1993) 48–55. [22] T. Sugawara, A cooperative LAN diagnostic and observation expert system, in: Proceedings of the Ninth Annual International Phoenix Conference on Computers and Communications, 1990, pp. 667–674. [23] T. Kaihara, Multi-agent based supply chain modelling with dynamic environment, Int. J. Prod. Econ. 85 (2) (2003) 263–269. [24] T. Lux, M. Marchesi, Scaling and criticality in a stochastic multi-agent model of a financial market, Nature 397 (6719) (1999) 498. [25] P. Stone, M. Veloso, Multiagent systems: a survey from a machine learning perspective, Auton. Robots 8 (3) (20 0 0) 345–383. [26] L. Bus¸ oniu, R. Babuška, B. De Schutter, Multi-agent reinforcement learning: An overview, in: Innovations in Multi-Agent Systems and Applications-1, Springer, 2010, pp. 183–221. [27] K. Tuyls, G. Weiss, Multiagent learning: basics, challenges, and prospects, AI Mag. 33 (3) (2012) 41. [28] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, 1, MIT press Cambridge, 1998. [29] M.L. Littman, Markov games as a framework for multi-agent reinforcement learning, in: Proceedings of the Machine Learning, Elsevier, 1994, pp. 157–163. [30] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Playing Atari with deep reinforcement learning, arXiv:1312. 5602(2013). [31] V. Mnih, K. Kavukcuoglu, D. Silver, A.A. Rusu, J. Veness, M.G. Bellemare, A. Graves, M. Riedmiller, A.K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature 518 (7540) (2015) 529. [32] P. Peng, Q. Yuan, Y. Wen, Y. Yang, Z. Tang, H. Long, J. Wang, Multiagent bidirectionally-coordinated nets for learning to play StarCraft combat games, arXiv:1703.10069(2017). [33] J. Foerster, N. Nardelli, G. Farquhar, P. Torr, P. Kohli, S. Whiteson, et al., Stabilising experience replay for deep multi-agent reinforcement learning, arXiv:1702. 08887(2017). [34] S. Omidshafiei, J. Pazis, C. Amato, J.P. How, J. Vian, Deep decentralized multitask multi-agent RL under partial observability, arXiv:1703.06182(2017). [35] E.A.O. Diallo, A. Sugiyama, T. Sugawara, Learning to coordinate with deep reinforcement learning in doubles pong game, in: Proceedings of the Sixteenth IEEE International Conference on Machine Learning and Applications (ICMLA), 2017, pp. 14–19. [36] D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Mastering the game of go with deep neural networks and tree search, Nature 529 (7587) (2016) 484–489. [37] L.-J. Lin, Self-improving reactive agents based on reinforcement learning, planning and teaching, Mach. Learn. 8 (3-4) (1992) 293–321. [38] T. Schaul, J. Quan, I. Antonoglou, D. Silver, Prioritized experience replay, arXiv:1511.05952(2015). [39] J.Z. Leibo, V. Zambaldi, M. Lanctot, J. Marecki, T. Graepel, Multi-agent reinforcement learning in sequential social dilemmas, in: Proceedings of the Sixteenth Conference on Autonomous Agents and MultiAgent Systems, International Foundation for Autonomous Agents and Multiagent Systems, 2017, pp. 464–473. [40] H. He, J. Boyd-Graber, K. Kwok, H. Daumé III, Opponent modeling in deep reinforcement learning, in: Proceedings of the International Conference on Machine Learning, 2016, pp. 1804–1813. [41] A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, R. Vicente, Multiagent cooperation and competition with deep reinforcement learning, PloS One 12 (4) (2017) e0172395. [42] Y. Dimopoulos, P. Moraitis, Multi-agent coordination and cooperation through classical planning, in: Proceedings of the IEEE/WIC/ACM International Conference on Intelligent Agent Technology, IEEE Computer Society, 2006, pp. 398–402. [43] L. Panait, S. Luke, Cooperative multi-agent learning: The state of the art, Auton. Agents Multi-agent Syst. 11 (3) (2005) 387–434. [44] H.V. Hasselt, Double Q-learning, in: Proceedings of the Advances in Neural Information Processing Systems, 2010, pp. 2613–2621. [45] H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double Q-learning, in: Proceedings of the AAAI, 16, 2016, pp. 2094–2100.

11

[46] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, N. De Freitas, Dueling network architectures for deep reinforcement learning, arXiv:1511. 06581(2015). [47] M.G. Bellemare, Y. Naddaf, J. Veness, M. Bowling, The arcade learning environment: An evaluation platform for general agents, J. Artif. Intell. Res.(JAIR) 47 (2013) 253–279. [48] J. Filar, K. Vrieze, Competitive Markov Decision Processes, Springer Science & Business Media, 2012. [49] C.J.C.H. Watkins, Learning From Delayed Rewards, King’s College, Cambridge, 1989 Ph.D. thesis. [50] P.J. Huber, Robust Statistics, Wiley, New York, 1981. [51] C. Boutilier, Planning, learning and coordination in multiagent decision processes, in: Proceedings of the Sixth Conference on Theoretical Aspects of Rationality and Knowledge, Morgan Kaufmann Publishers Inc., 1996, pp. 195–210. [52] C.E. Guestrin, D. Koller, Planning Under Uncertainty in Complex Structured Environments, Stanford University, 2003 Ph.D. thesis. [53] C. Claus, C. Boutilier, The dynamics of reinforcement learning in cooperative multiagent systems, Proceedings of the AAAI/IAAI 1998(1998) 746–752. [54] A.Y. Ng, Shaping and Policy Search in Reinforcement Learning, University of California, Berkeley, 2003 Ph.D. thesis. [55] T. Tieleman, G. Hinton, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, Neural Networks for Machine Learning, COURSERA, 2012, pp. 26–31. [56] D. Dewey, Reinforcement learning and the reward engineering principle, in: Proceedings of the AAAI Spring Symposium Series, 2014. Elhadji Amadou Oury Diallo received his B.S. degree in software engineering in 2015 from Supdeco (Senegal) and Master degree in computer science in 2018 from Waseda University (Japan). He is currently a PhD student in computer science at Waseda University. His research interests include multi-agent systems, artificial intelligence, and machine learning. He is a recipient of the Japanese Government (MEXT) scholarship.

Ayumi Sugiyama received his B.S. and M.S. degrees in computer engineering from Waseda University, Japan, in 2014 and 2016. He is currently a doctoral student in computer science at Waseda University. His research interests include multi-agent systems, artificial intelligence, behavioral science, and ethology. He is a Research Fellow (DC1) of Japan Society for the Promotion of Science.

Toshiharu Sugawara is a professor of Department of Computer Science and Engineering, Waseda University, Japan, since April, 2007. He received his B.S. and M.S. degrees in Mathematics, 1980 and 1982,respectively, and a Ph.D, 1992, from Waseda University. In 1982, he joined Basic Research Laboratories, Nippon Telegraph and Telephone Corporation. From 1992 to 1993, he was a visiting researcher in Department of Computer Science, University of Massachusetts at Amherst, USA. His research interests include multi-agent systems, distributed artificial intelligence, machine learning, internetworking, and network management. He is a member of IEEE, ACM, AAAI, Internet Society, The Institute of Electronics, Information and Communication Engineers (IEICE), Information Processing Society of Japan (IPSJ), Japan Society for Software Science and Technology (JSSST), and Japanese Society of Artificial Intelligence (JSAI).

Please cite this article as: E.A.O. Diallo, A. Sugiyama and T. Sugawara, Coordinated behavior of cooperative agents using deep reinforcement learning, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.08.094