CHAPTER FIFTEEN
Cooperative Learning for Robust Connectivity in Multirobot Heterogeneous Networks P.J. Cruz, R. Fierro, C.T. Abdallah The University of New Mexico, Albuquerque, NM, United States
1 INTRODUCTION In the last decade we have witnessed numerous improvements in the field of cooperative robotics. For example, technology has pushed the limits further, providing microscale high-performance processors and advanced high-efficiency sensors and actuators. At the same time, novel methods and strategies to control and coordinate teams of robots have been developed, including reinforcement learning (RL) techniques. As a result, commercially available robotic platforms are more capable now of providing essential support to human teams in a variety of civilian and military missions, such as urban search and rescue, environmental surveillance, and target localization. The accomplishment of these missions places special requirements on the robotic agents that cannot easily be fulfilled by a team of homogeneous robots. Furthermore, the use of robots with different dynamics and/or capabilities has the potential to meet complex task requirements. For example, unmanned aerial vehicles (UAVs) can monitor an area faster but with low resolution, while unmanned ground vehicles (UGVs) can sense a limited area but do so with much more accuracy. Thus these platforms are complementary, and by coordinating them, one can mitigate their weaknesses. Coordinated control refers to the ability of a team of robots to work together to accomplish a task [1]. Typical coordination tasks for a multirobot network include sensor coverage [2], area exploration [3], and flocking control [4], to mention a few. In these applications a multirobot team is generally deployed in an environment populated with obstacles. Thus the Control of Complex Systems http://dx.doi.org/10.1016/B978-0-12-805246-4.00015-X
© 2016 Elsevier Inc. All rights reserved.
451
452
P.J. Cruz et al.
robotic agents experience uncertainty in communication, navigation, and sensing. For example, the robots have to move toward a region of interest while avoiding collisions with obstacles and other robots. At the same time, the objects in the environment can create phenomena such as shadow effects and secondary reflections which degrade the performance of the interagent wireless communication. Consequently, the coordinated control of the multirobot network should incorporate wireless communication constraints. The objective of maintaining connectivity together with additional requirements such as collision avoidance of a multiagent system has been extensively studied in the literature. Fink et al. [1] demonstrated that a team of networked robots can maintain end-to-end connectivity in complex environments while they move to accomplish their preassigned task. Decentralized controllers based on navigation functions for a group of robots were developed in [5] to satisfy individual sensing goals while some neighborhood connectivity relationships are maintained. In [6], communication range and line of sight are used as motion constraints for a swarm of point robots which goes from an initial to a final configuration in a cluttered environment. Rooker and Birk [7] introduced a multirobot exploration algorithm that uses a utility function built with the constraints of wireless networking being taken into account. Monitoring the communication link quality and the construction of a signal strength map are strategies described in [8] for a good link quality maintenance in the deployment of a mobile robot network. The use of heterogeneous robots to enhance the communication capabilities of the whole multirobot network has attracted a lot of attention because it brings new research challenges compared with its homogeneous counterpart. Indeed, the introduction of robots with different kinematics, sensing, and communication resources increases the team robustness to operate in highly dynamic environments. Cortez et al. [2] described a mobile communication relay to a network of sensors and derived connectivity constraints among the network members. Moreover, these constraints are used to maximize the feasible motion sets of the sensing agents. In this work, the network is assumed to move in a free-obstacle environment and its agents are considered as point robots. In [9], a team of UGVs performs a collaborative task, while a team of UAVs is positioned in a configuration such that they optimize the communication link quality to support the team of UGVs, but Gil et al. [9] assume that the UGVs are static to guarantee the connectivity of the UAV-UGV network. By use of a heterogeneous robotic
Cooperative Learning for Robust Connectivity in Multirobot Heterogeneous Networks
453
system, a search/pursuit scenario is implemented in [10], where a control algorithm guarantees a certain level of signal-to-interference plus noise ratio among the members of the system. Even though the field of view of the sensors in the network is considered, the geometry of the agents is ignored. Communication maintenance for a group of heterogeneous robots is enforced in [11] by a passivity-based decentralized strategy. This approach allows the creation/deletion of communication links at any time as long as global connectivity is preserved, but the strategy is not tested in the case that the network has a main goal such as sensing on top of maintaining connectivity. Motivated by these applications, we are interested in exploiting the heterogeneity of a multiagent network made of aerial and ground robots. We envision a scenario where a group of UAVs (eg, quadrotors [12]) interacts with a group of ground robots (eg, OctoRoACHes [13]) such that they extend the situational awareness of human task forces in a disaster area with collapsed structures. A sketch of this scenario is shown in Fig. 1A. This multirobot heterogeneous network can provide operational capabilities to first responders that would otherwise be costly, impossible, or dangerous to achieve. Therefore we seek to develop strategies to maintain network connectivity at all times while different regions of interest are sensed. We assume that the ground agents are better equipped for sensing purposes, while the aerial vehicles act as mobile wireless communication routers. We use potential field methods to coordinate the ground mobile sensors, so they move toward the regions of interest, avoiding obstacle and interagent collisions. For the case of the aerial routers, we employ a cooperative RL
(A)
(B)
Fig. 1 (A) A group of aerial relays cooperating with a group of mobile ground sensors to explore a disaster area with collapsed structures. (B) A heterogeneous robotic network made by a quadrotor equipped with four directional antennas, one in each arm, and three OctoRoACHes moving around a cluttered environment.
454
P.J. Cruz et al.
technique to dynamically relocate them in such a way that the whole multirobot network stays connected. In particular, our algorithm seeks to increase the received signal strength (RSS) among the relays and the sensors. RSS is a reasonable proxy for link quality, so by improving the RSS, one can enhance connectivity in networked robotic teams [14–16]. The rest of the chapter is organized as follows. Section 2 describes the heterogeneous robotic team considered in this work. Also in this section, the motion dynamics assumed for these agents are presented. The concepts related to the connectivity of the heterogeneous robotic network are explained in Section 3. For example, we introduce the definitions of RSS, antenna diversity, and the connectivity matrix. In Section 4 we state the problem addressed herein. The control methods for the mobile sensors and for the flying relays are detailed in Sections 5 and 6, respectively. The proposed control algorithms are tested in simulation and their results are discussed in Section 7. Section 8 gives our concluding remarks.
2 HETEROGENEOUS ROBOTIC TEAM The term heterogeneous is used in robotics to describe a group of agents with differences in their hardware structure (hardware-based heterogeneity) and/or their mission objective (objective-based heterogeneity) [2]. Hardware heterogeneity includes, for example, different agent dynamics and distinct sensing capabilities and communication ranges. With respect to objectivebased heterogeneity, the overall mission goal can be divided into multiple subobjectives which are assigned to each agent. Thus the particular abilities of each member of the network are employed to accomplish effectively the subgoals, which in turn results in successful completion of the global objective. Fig. 1B shows an example of a robotic network that has hardwareand objective-based heterogeneity. In Fig. 1B a robotic team formed by three crawler robots (OctoRoACHes) and one quadrotor is deployed in a cluttered environment. The aerial vehicle has four antennas, one in each arm, to improve the network communication range while the OctoRoACHes explore the environment. In this work we focus on a set of robots A made of two types of agents: aerial vehicles acting mainly as communication relays and ground robots better equipped for sensing. Thus A is a network of robots that shows hardware- and objective-based heterogeneity. We define S = {s1 , s2 , . . . , sN } as the set of N mobile sensors and R = {r1 , r2 , . . . , rM } as the set of M aerial relays. Notice that A = S ∪ R and S ∩ R = ∅.
455
Cooperative Learning for Robust Connectivity in Multirobot Heterogeneous Networks
1.2
r1
W
r2
0.8 0.6
r3
0.4 0.2 0 2 1.5
1 0.5
y (m)
(A)
s1
o1 0 –0.5 –1 –1.5
–2
–2
–1.5
–1
o3
–0.5
0
o2
r1
1
1.5
2
x (m)
s1
r2 τ1
0
–1.5 0.5
b
o1
s2 r3
–1
s4
τ2
s3
1
–0.5
τ1
s2
2 1.5
0.5
o2
FW
b
3D workspace Fixed Cartesian frame Base station Mobile sensor Aerial router Static obstacle Fixed target region
y (m)
z (m)
1
W FW b si ri oi τi
s3
τ2
s4
o3
–2
(B)
–2 –1.5 –1 –0.5
0 0.5 x (m)
1
1.5
2
Fig. 2 Workspace where the heterogeneous robotic network operates. (A) Threedimensional view and (B) two-dimensional view. The key in (A) facilitates the reference of the elements that are part of the workspace W.
We denote IS and IR as the index sets of S and R, respectively. We assume that the set of robots A operates in W ⊂ R3 , a compact subset of the three-dimensional Euclidean space. Inside W there is a fixed base station b, whose position is denoted as qb . Also in W there are L convex obstacles grouped in the set O = {o1 , o2 , . . . , oL } and K target regions that form the set T = {τ1 , τ2 , . . . , τK } such that O ∩ T = ∅. We denote IO and IT as the index sets of O and S , respectively. Embedded in W there is a Cartesian frame FW that helps to describe the position and orientation of the robots, objects, and targets in the workspace W . Fig. 2 depicts the environment created for simulation purposes. The key in Fig. 2A facilitates the reference of the elements inside W . In the next section we introduce the models for the mobile sensors and the aerial routers.
2.1 Motion Dynamics For every i in the index set IS , we define the state vector of the ith mobile sensor as qsi = [xsi ysi vsi θsi ]T . Here,(xsi , ysi ) is the xy position of the ith mobile sensor and θsi is its heading angle. We assume that the equations of motion of the ith mobile sensor are given by the unicycle model x˙ si = vsi cos θsi , y˙ si = vsi sin θsi , v˙ si = asi , θ˙si = ωsi ,
(1)
456
P.J. Cruz et al.
where asi and ωsi are the linear acceleration and the angular velocity, respectively. We define usi = [asi ωsi ]T as the control input of the ith mobile sensor. For every j in IR , the state vector of the jth aerial router is given by qrj = [xrj yrj zrj ]T . This state specifies the position of the router with respect to FW . The motion dynamics for this case are the ones given by the holonomic model q¨ rj = urj ,
(2)
where urj ∈ R3 is the acceleration control input. In the next section we detail our assumptions with respect to the communication capabilities of the heterogeneous robotic team.
3 NETWORK CONNECTIVITY We consider as communication nodes the base station b, the set of sensors S , and the set of relays R. Thus we define M = {b} ∪ S ∪ R as the set of communication nodes. The cardinality of this set is |M| = N +M +1. Let IM be the index set of M. Then for all i ∈ IM , the set of neighbors in the vicinity of the communication node ni ∈ M is given by (3) Nni = nj ∈ M | d(ni , nj ) ≤ ni . Here d(ni , nj ) is the distance between the nodes ni and nj , and ni is the maximum range of node ni .
3.1 Received Signal Strength Assuming the ith node acts as a receiver and the jth node acts as a transmitter, we adopt a log-distance path-loss model [17] to estimate the power loss between the transmitter and the receiver. For this model, the noise-free received power at the ith node from the jth node is given by α do , (4) pij = pTXj κ d(ni , nj ) where pTXj is the power transmitted by the jth node, do is a reference closein distance for the antenna far field, α is the path-loss exponent, and κ is a unitless constant that depends on the antenna characteristics and the frequency of operation. We assume that pTXj is the same for all nodes, so pTXj = pTX . The log-distance model is a simplified piecewise model with two segments: free-space propagation is assumed up to distance do and non-free-space propagation is assumed for distances greater than do . Typical
Cooperative Learning for Robust Connectivity in Multirobot Heterogeneous Networks
457
values for do are 1–10 m indoors and 10–100 m outdoors. The value of the path-loss exponent α depends on the propagation environment. It ranges from 2 for free-space propagation to 6 for heavy cluttered environments [17]. The constant κ can be set to the free-space path gain at distance do given by λ 2 . (5) κ = Gni Gnj 4πdo Here Gni and Gnj are the antenna gains for the nodes ni and nj , respectively, and λ = c/f is the wavelength of the transmitted signal (c is the speed of light and f is the communication frequency). We assume that the base station and the mobile sensors are equipped with unidirectional antennas. Thus the antenna gain for these cases is 1. On the other hand, we consider that the aerial relays have directional antennas, which means that the antenna gain takes on a value of 1 or less. Every received-transmitter link is affected by obstructions and clutter, so the effective received power at the ith node from the jth node can be expressed as Pij = pij ζij ,
(6)
χij
with ζij = 10 10 being the model of a log-normal shadowing noise [17]. Each χij is assumed to have a Gaussian distribution with zero mean and variance σij2 . Furthermore, we assume independent and homogeneous lognormal shadowing across all nodes; then σij2 = σ 2 . The total RSS at the ith node, RSSni , represents an aggregate of the power emanating from all the transmitters in its vicinity. This parameter measured in decibels can be expressed as [18] RSSni = 10 log10 Pij , (7) j∈INn
i
where INni is the index set of Nni . From Eq. (6), we also define the RSS at the ith node with respect to the jth node as
RSSni ,nj = 10 log10 pij ζij . (8)
3.2 Antenna Diversity For the case of the aerial relays, we consider that each of them has four directional antennas. Each antenna is mounted at the end of each arm (see Fig. 3). The use of multiple antennas, known as antenna diversity,
458
P.J. Cruz et al.
(A)
(B)
Fig. 3 (A) An aerial relay equipped with four directional antennas. (B) The numbers assigned to each antenna.
improves wireless links and is especially effective at mitigating multipath. Antenna diversity also has the advantage that while one antenna is in a deep fade, it is probable that another antenna has a strong enough signal [17, 19]. We assume directional antennas since they radiate greater power in a specific direction as opposed to omnidirectional antennas, which radiate power uniformly. This characteristic allows the quadrotor to sense the unknown RF environment and find a direction to move, improving the communication performance. We assume the incoming signal is received by all four antennas but only the signal taken from the antenna with the highest RSS is used. Therefore four RSS values have to be calculated for every aerial relay rj ∈ R. These values can be found by application of Eqs. (4)–(7) considering the distance with respect to each antenna as well as the individual antenna gain. Let RSS(k) rj be the RSS of the kth antenna at the jth aerial relay. Notice that k = 1, . . . , 4. Then the RSS for the jth relay is RSSrj = max RSS(k) . (9) rj
3.3 Connectivity Matrix To describe the connectivity of the network formed by the nodes in M, we construct a graph G = (V , E ). This graph consists of the set of vertices V = M and the set of edges E such that for all ni , nj ∈ M (ni , nj ) ∈ E
iff i = j and RSSni ,nj ≥ υij .
(10)
Cooperative Learning for Robust Connectivity in Multirobot Heterogeneous Networks
459
Here υij is the RSS threshold needed to maintain a minimal link quality between the nodes ni and nj . We consider that υij = υ for all nodes. We also assume that the links between all nodes are symmetric (ie, RSSni ,nj = RSSnj ,ni ). From this assumption and by condition (10), the graph G is undirected and has no self-loops. For this type of graph, G is connected if and only if there exists a path between any two vertices [20, 21]. Hence we define an |M| × |M| matrix C = [Cij ] such that
1 if (ni , nj ) ∈ E , Cij = 0 otherwise. Thus Cij is nonzero just when RSSni ,nj is sufficiently high to establish a link between ni and nj . For this reason we call C the connectivity matrix. Let L be the Laplacian matrix associated with the graph G defined by
with δi =
|M | j=1
L = diag(δi ) − C,
(11)
Cij . From graph theory [20, 21] we know that
G is connected iff rank(L) = |M| − 1 = N + M. Now that we have introduced our assumptions, we state the specific problem addressed in this chapter next.
4 DEPLOYMENT MAINTAINING NETWORK CONNECTIVITY Given a heterogeneous robotic team A formed by a set of mobile sensors S with dynamics given by Eq. (1) and a set of aerial relays R with dynamics given by Eq. (2), our goal is to design controllers for these two sets such that the sensors get measurements of the target regions specified in the set T , while the aerial relays maintain network connectivity between the mobile sensors and the base station b. We consider that every mobile sensor in S has been assigned a target region in T whose location is known a priori. We know from Section 2 that the workspace W is populated with L static obstacles. Therefore the controller for each mobile sensor is required to guide it to get measurements of its preassigned target region avoiding collisions with obstacles and other mobile sensors. Section 5 provides the details of this controller designed based on potential functions [22–24].
460
P.J. Cruz et al.
For the case of the aerial relays, we assume they fly at a safe height above the obstacles at all times. Thus the aerial relays do not require obstacle avoidance, but they have to keep a safe distance among themselves. While the goal of the mobile sensors is to get measurements of the target regions, the flying relays have to keep the connectivity of the whole network, particularly between the base station and the mobile sensors. The proposed controller for the relays (see Section 6) uses a cooperative learning technique [4] to maintain the RSS between the relays and the sensors above a threshold so as to have a good link quality.
5 MOBILE SENSOR CONTROLLER The control objectives for the case of the mobile sensors are (i) getting measurements of the target region, (ii) avoiding obstacles, and (iii) avoiding collisions with other sensors. Thus we use potential field methods having an attractive potential function to move toward the assigned target region and a repulsive potential function to avoid collisions.
5.1 Potential Field Method This motion planning method controls the movement of a robot on the basis of the gradient of a potential function [22]. Potential field methods were originally developed as an online collision avoidance approach applicable when a robot does not have a model of the obstacles a priori but senses them during motion execution. Roughly speaking, a potential field method acts as a fastest descent optimization procedure, so it may get stuck at a local minimum other than the goal configuration. However, various techniques [23, 24] have been demonstrated to be valid for escaping from local minima. In our simulations we use random walk [22] when an agent gets stuck at a local minimum. This is the method most commonly used to avoid this problem. Let Usi be the total potential function for the mobile sensor si ∈ S . This function is given by Usi = Uatt (qsi ) + Urep (qsi ),
(12)
where Uatt (qsi ) is the attractive potential defined on the basis of the position of si and the position of its assigned target region, and Urep (qsi ) is the
Cooperative Learning for Robust Connectivity in Multirobot Heterogeneous Networks
461
repulsive potential generated considering the obstacles and the location of the other mobile sensors. Indeed, the attractive potential is defined as
1 2 2 βatt d (qsi , τsi ) if d(qsi , τsi ) ≥ sen , (13) Uatt (qsi ) = 0 otherwise. Here d(qsi , τsi ) is the distance between si and its assigned target region τsi , sen is the minimum range to get measurements of a target region, and βatt is a scaling factor. The repulsive potential is given by Urep (qsi ) = Us (qsi ) + Uo (qsi ),
(14)
where Us (qsi ) is used to avoid intersensor collisions and Uo (qsi ) is used to prevent obstacle collisions. These functions are defined as Us (qsi ) =
N
U(qsi , qsj ), with U(qsi , qsj ) =
j=i,j=1
⎧ ⎨ ⎩
1 2 βs
2 1 d(qsi ,qsj )
−
if d(qsi , qsj ) ≤ s ,
1 s
0
otherwise (15)
and Uo (qsi ) =
L j=1
U(qsi , oj ), with U(qsi , oj ) =
⎧ ⎨
1 2 βo
⎩ 0
1 d(qsi ,oj )
−
1 o
2
if d(qsi , oj ) ≤ o ,
,
otherwise (16)
respectively. In Eqs. (15) and (16), d(qsi , qsj ) is the distance between the ith and the jth mobile sensors, d(qsi , oj ) is the distance between the ith mobile sensor and the jth obstacle, s is the clearance distance to avoid intersensor collision, o is the influence distance of the obstacles, and βs and βo are scaling factors. We assume that 0 < s < o . From Section 2 we have that the control vector for the mobile sensor si is usi = [asi ωi ]T . Using the gradient of the total potential function Usi given by ∂Usi ∂Usi ∂Usi T ∇Usi = , ∂xsi ∂ysi ∂θsi we define the control inputs for the ith mobile sensor as asi = cos θsi sin θsi 0 ∇Usi − k0 vsi , and ∂Usi ∂Usi − θsi , aωsi = k1 arctan2 , ∂xsi ∂ysi where k0 and k1 are control constants.
(17)
462
P.J. Cruz et al.
6 AERIAL RELAY CONTROLLER The main goal of the flying relays is to maintain the connectivity of the network, specially between the base station and the mobile sensors. We define Drj ⊂ {b} ∪ R as the set of neighbors for which the jth relay has to keep a relative distance μr . This set for each relay is defined a priori. We assume that one of the relays has the base station as a neighbor and that all the relays have at least one neighbor. Also, we consider that if rj is a neighbor of rk , then rk is also a neighbor of rj . From our definition of Drj , Drj ∩ S = ∅ for every j. We do not consider that a relay should maintain a sensor in its vicinity since the sensors explore around W , moving from the vicinity of one relay to another. However, we use the RSS among relays and the sensors in their vicinity to maintain network connectivity. We define the acceleration control input for the jth aerial router as
ij − μr eij − k3 q˙ rj − k4 qrj − qref , (18) u r j = k2 i∈IDr
j
where IDrj is the index set of Drj . The first two terms in Eq. (18) create a virtual spring [10] among rj and its neighbors, so the relative distance μr can be maintained. For these terms, k2 and k3 are the spring constant and the damping coefficient, respectively. Also, we define for every ith neighbor of rj in Drj the length and the direction of the force of the virtual spring as ij = qi − qrj
and
eij =
qi − qrj qi − qrj
,
respectively. Here qi is the position with respect to FW of the ith neighbor of rj . The last term in Eq. (18) helps to move the aerial relay rj such that the RSS of rj is improved, taking into account the RSS with respect to the mobile sensors in its vicinity. From Section 3.2 and using Eq. (7), we redefine the RSS for the jth mobile relay considering just the mobile sensors in its vicinity. Thus we get Pij , (19) RSS(k) rj = 10 log10 i,ni ∈Nrj ∩S
where k = 1, 2, 3, 4 is the number of the antenna. Then redefining Eq. (9), we obtain [RSSrj , Arj ] = max RSS(k) , with k = 1, 2, 3, 4. (20) rj
Cooperative Learning for Robust Connectivity in Multirobot Heterogeneous Networks
463
Here the max function returns the maximum, not only RSSrj (the maximum RSS between all four antennas) but also the number of the antenna with this RSS Arj . Using RSSrj and Arj , we select on the basis of a cooperative RL technique the reference position qref required by the last term in Eq. (18).
6.1 Cooperative Q-Learning Recently, RL has grown as an effective approach for control applications [25] and in particular to allow multirobot systems to learn cooperation [4, 26, 27]. Reinforcement learning (RL) refers to an actor (eg, a robotic agent) which interacts with its environment and modifies its control policy on the basis of a reward received in response to its actions. Thus RL algorithms seek to find the optical control policy that maximizes the future reward. In RL the actor knows only its previous and current states and the reward of how good the previous action was. Let xt and ut be the agent state and the action at time t. Then the action ut produces a new state xt+1 by interacting with the environment and a reward ρ t+1 is observed. Applying RL, the agent then selects a new action based on xt , xt+1 , and ρ t+1 . Typically, RL algorithms must balance the need to collect informative data (by exploring novel action choices) with the need to control the process (by exploiting the currently available knowledge) [28–30]. Exploration is when the agent selects random actions while learning regardless of the current state. In this way the agent can explore the entire state-action space to avoid the problem of learning local maxima. In contrast, exploitation is the use of the agent’s knowledge in selecting actions. The agents choose the action that will produce the maximum reward given the current state. The tradeoff between exploration and exploitation is challenging. For example, use of mostly exploration can cause the agent to take a long time to learn. Usually at the beginning of the learning, exploration is mainly used. Then the exploration diminishes over time and exploitation takes over. One approach to transition between exploration and exploitation is the ε-greedy technique [28]. We use this technique in our algorithm for implementing the exploration-exploitation trade-off. Q-learning is a model-free value iteration RL algorithm which uses a state-action value function known as a Q-function [28–30]. This function represents the expected utility of taking a given action in a given state and following the optimal policy thereafter. In fact, Q-learning starts from an arbitrary Q-function and updates it without requiring a model and using
464
P.J. Cruz et al.
instead observed state transitions and rewards. Let xtj , utj , and ρjt be the state, action, and reward of the jth aerial relay rj at time t, respectively. Then we have the following models for the state, action, and reward. The state. For every rj , the state is defined by the number of the antenna whose RSS is the highest. From Eq. (20) we have that xj = Arj , and so xj ∈ {1, 2, 3, 4}. Thus the total number of states per aerial relay is four. The action. We assume five possible actions: north, south, east, west, and stay, which are associated with the unitary vectors e1 = [1 0 0]T , and e5 = 0,
e2 = −e1 ,
e3 = [0 1 0]T ,
e4 = −e3 ,
respectively. Here, 0 ∈ R3 is the zero vector. Once an action has been chosen, we use its associated vector to find the reference position qref required by Eq. (18). For example, let uj = east. Then qref = qrj + Le3 , where L is a reference distance. Since the total number of states is four and the number of actions is five, we have 20 state-action pairs per aerial relay. The reward. This is calculated on the basis of RSSrj obtained by application of Eq. (20). Since our purpose is to maintain RSSrj above the threshold υ, we define the reward function as follows:
100 if RSSrj > υ, (21) ρj = 0 otherwise. ¯ j (xj , uj ) as Each aerial relay rj updates an individual Q-function Q follows: t ¯ jt+1 (xj , uj ) = Qjt (xj , uj ) + η ρjt+1 + γ maxuˆ j Qjt (xt+1 Q , u ˆ ) − Q (x , u ) j j j j , j (22) where η ∈ (0, 1] is the learning rate and γ ∈ (0, 1] is the discount factor. However, our intention is to allow each aerial relay to aggregate information on its relay neighbors via their individual Q-functions. Let Brj = Drj \ {b} be the set of the relay neighbors of relay rj . Then each rj updates the total Q-function Qj (xj , uj ) following the equation for the cooperative Q-learning technique described in [4]:
465
Cooperative Learning for Robust Connectivity in Multirobot Heterogeneous Networks
¯ jt (xj , uj ) + (1 − w) Qjt+1 (xj , uj ) = wQ
¯ it (xi , ui ), Q
(23)
i∈IBr
j
where IBrj is the index set of Brj and w ∈ [0, 1] is the trust weight. If w = 1, the Q-function is updated with trust just of the knowledge of the relay itself (independent learning). On the other hand, if w = 0, the aerial relay updates its Q-function only on the basis of the Q-functions of its relay neighbors. When 0 < w < 1, the learning is performed in a cooperative fashion. The convergence of the cooperative Q-learning algorithm is analyzed in [4]. Qt will converge if μ ∈ (0, 1) and γ ∈ (0, 1n ), with n being the number of agents performing the cooperative learning. For a complete discussion, the reader is referred to [4]. Once the Q-function has been updated, we use the ε-greedy technique to balance exploration with exploitation [28, 30]. This method selects actions according to
with probability 1 − εt (exploit), arg maxuˆ Qjt (xj , uˆ j ) t uj = uniformly random action with probability εt (explore). (24) Here εt ∈ (0, 1] is the exploration probability at time step t. A typical value for the exploration probability is εt = 1t . Algorithm 15.1 summarizes the cooperative Q-learning method for every aerial relay. Algorithm 15.1 Cooperative Q-Learning for All rj ∈ R Require: learning rate η, discount factor γ , initial Q-function Qj0 (eg, Qj0 ← 0), initial state x0j , trust weight w 1: for every 0, 1, 2, . . . do ⎧ time step t = 1 t (x , u Q with probability 1 − t+1 arg max ⎪ j ˆj) u ˆ j ⎨ 2:
utj =
3: 4:
Apply utj Find next state xt+1 and the reward ρjt+1 j ¯ jt+1 (xj , uj ) ← Qjt (xj , uj ) + η ρjt+1 + γ maxuˆ j Qjt (xt+1 Q ˆ j ) − Qjt j ,u (xj , uj ) ¯ jt (xj , uj ) + (1 − w) i∈I Q ¯ it (xi , ui ) Qjt+1 (xj , uj ) ← wQ
5: 6:
uniformly random action in ⎪ ⎩ {north, south, east, west, stay}
7: end for
with probability
Brj
1 t+1
466
P.J. Cruz et al.
7 SIMULATION RESULTS In this section we present simulation results of the proposed method for a network of mobile sensors (Section 5) and aerial relays (Section 6). For the case of the cooperative Q-learning method (Algorithm 15.1), we compare independent learning (w = 1) with cooperative learning (assuming w = 0.5) in terms of the network connectivity. From our discussion in Section 3, we know that the network formed by the base station b, the set of N mobile sensors S , and the set of M serial relays R is connected if the rank of the Laplacian matrix L given by Eq. (11) is equal to N + M. Therefore the relative connectivity index [4] can be defined as rank(L) . (25) N +M The network is connected if and only if c = 1. Otherwise it is broken, and as c tends to zero, the network has more and more disconnected components [21]. We compute this index during our simulation tests. We consider a heterogeneous team of robots made by four mobile sensors (ie, N = 4) and three aerial relays (ie, M = 3). The workspace W is of size 4 × 4 × 1.5 m3 populated with four convex obstacles and two target regions. Fig. 4 illustrates this workspace, where the obstacles are c=
2
W r2
1
1.5
r1
r3
0.5
0.5
o1
0 2 1.5
b
τ2
o2
1 0.5
0 –0.5 y (m) –1 –1.5
(A)
1
o3 –2
y (m)
z (m)
1.5
–2
–1
–0.5
s4
o1 τ2
s3
s1 s2
o2
0 –0.5 –1
o4
τ1
–1.5
b
o3
τ1
o4
–1.5 0 x (m)
0.5
1
1.5
2
–2 –2 –1.5 –1 –0.5
(B)
0 0.5 x (m)
1
1.5
2
Fig. 4 (A) Three-dimensional view and (B) two-dimensional view of the workspace employed for the simulation tests. The workspace W has a base station b, four obstacles O = {o1 , o2 , o3 , o4 }, and two target regions T = {τ1 , τ2 }. The heterogeneous robotic network is formed by four mobile sensors, S = {s1 , s2 , s3 , s4 }, and three aerial relays, R = {r1 , r2 , r3 }. The blue squares inside τ1 and τ2 denote the reference points used for the initial deployment for the aerial relays.
Cooperative Learning for Robust Connectivity in Multirobot Heterogeneous Networks
467
in gray and the target regions are in green. The base station is represented in blue and its position is qbase = [−1.75 1.75 0]T m. Also, the initial positions of the ground sensors and the aerial relays are shown. The blue dashed circles denote the range for each flying relay. We assume that for all the communication nodes the range is 1 m (ie, for all ni ∈ M, ni = 1 m. For the set of relays R = {r1 , r2 , r3 }, we have Dr1 = {b, r2 , r3 },
Dr2 = {r3 },
and
Dr3 = {r2 }.
Thus r1 has to keep a relative distance with the base station and with the other sensors, while r2 and r3 have to do the same with r1 . The time step is 0.005 s and we run the simulation for 30 s. The rest of the simulation parameters are summarized in Table 1. At the beginning of the simulation, we deploy just the aerial routers such that r2 gets closer to the first target τ1 , and r3 gets closer to the second target τ2 . To achieve this, we use the control law (18), but qref is fixed to a position inside their respective target regions. The blue squares within τ1 and τ2 represent the reference positions. For the case of r1 we also use Eq. (18) but without considering the third term (ie, k4 = 0). Therefore we do not use the learning algorithm during this initial deployment. Fig. 5 shows a series of snapshots of the deployment of the aerial relays. Since we take into consideration damping effects in the controller (18), the mesh of virtual springs between the flying relays has behavior similar to that of a real spring system, in which dissipative forces act against the movement of the springs, ensuring the velocity eventually reaches zero. This behavior guarantees that the network of aerial relays eventually reaches a rest state [10], which indicates the end of the initial deployment. Indeed, the initial deployment maneuver finishes (ie, the flying relays reach the rest state) around 4 s after it started. As can be seen in Fig. 5, the relays move such that they maintain the relative distance with respect to their predefined
Table 1 Simulation parameters
Network connectivity Mobile sensor controller Aerial relay controller
pTX = 60 mW, α = 2.5, do = 0.5 m, f = 2.4 GHz, σ = 0.25, υ = −70 dB sens = 0.1 m, s = 0.2 m, o = 0.3 m, βsens = 0.6, βs = βo = 0.4, k0 = 0.75, k1 = 1.25 μr = 1.5 m, η = 0.2, γ = 0.25, L = 0.75 m, k2 = 5.75, k3 = 3.75, k4 = 2.5
1
0.5 0 2 1.5
0.5
1 0.5
–2
–2
–1.5
–1
–0.5
0
0.5
1
1.5
2
1 0.5
0 2 1.5
0 –0.5 –1 y (m) –1.5
1 0.5
0 2 1.5 0 –0.5
y (m)
x (m)
–1 –1.5
–2
–2
–1.5
–1
–0.5
0
0.5
1
2
1.5
1.5
1
1
1
0.5
0.5
0.5 y (m)
2
1.5
0
–0.5
–0.5
–1
–1
–1
–1.5
–1.5
–1.5
–2
–2 –0.5
0.5 0 x (m)
1
1.5
(B)
–0.5 x (m)
0
0.5
1
1
1.5
2
1.5
–2 –2
2
–1
–1.5
–1
–0.5
0 0.5 x (m)
1
1.5
2
–2
–1.5
–1
–0.5
0 0.5 x (m)
2
(C)
Fig. 5 Snapshots of the initial deployment of the aerial routers. Time elapsed since the start of the simulation (A) 1 s, (B) 2 s, and (C) 4 s. The three-dimensional views are at the top and the corresponding two-dimensional views are at the bottom. The maneuver finishes around 4 s after the start of the simulation.
P.J. Cruz et al.
(A)
–1
–1.5
0
–0.5
–1.5
–1 –1.5 –2 –2
1.5
–2
0 –0.5
x (m)
2
0
1 0.5
y (m)
2
y (m)
y (m)
1.5
z (m)
1 z (m)
z (m)
1.5
468
1.5
Cooperative Learning for Robust Connectivity in Multirobot Heterogeneous Networks
469
neighbors closer to μr while they go toward their preassigned reference positions in the target regions, specially for the case of r2 and r3 . Fig. 5C shows the rest position that the aerial relays achieve. Once the deployment of the aerial relays is complete, the mobile sensors are commanded to start moving around W such that they get measurements of the target regions τ1 and τ2 by applying the control law (17). Also, the aerial relays start using the learning algorithm together with their controller (18) to enhance the network connectivity. Two of the mobile sensors are assigned to τ1 and the others are assigned to τ2 . Once a sensor gets to its predefined target region, it keeps moving around within the region. The flying relays do not use the learning algorithm (Algorithm 15.1) at every time step. Instead, Algorithm 15.1 is called every 0.25 s. We do this to allow the aerial relays to have enough time to move to the reference position qref computed on the basis of this algorithm. Fig. 6 shows different time instants of the simulation for the case of independent learning (ie, w = 1). For each relay we indicate which antenna has the maximum RSS by changing to red the circle at the end of its corresponding arm. From Fig. 6, we see the mobile sensors move toward their preassigned targets, evading obstacles and collisions with other sensors. At the same time, the aerial relays select the antenna with the maximum RSS and change its position according to the reference position computed on the basis of Algorithm 15.1. In Fig. 6B one of the mobile sensors assigned to τ1 is out of the range of the relays. Indeed, it is not inside any of the blue dashed circles associated with each flying relay. This causes the network to be disconnected, making the relative connectivity index, defined by Eq. (25), be less than 1. Furthermore, the evaluation of this index during the simulation experiments is shown in Fig. 7A for the case of independent learning and in Fig. 7B for cooperative learning. As one can see in Fig. 7A and B, the network connectivity is lost more times when the independent learning method is applied. For example, two mobile sensors are out of the communication range of the relays in Fig. 6C, decreasing the RSS and then affecting the relative connectivity index c. Even though c is not always 1 for cooperative learning (meaning that the network is always connected), we see from our tests that in the worst-case two of the mobile sensors are disconnected for a short time. This is not the case for independent learning, where at some time instant three of the four sensors were disconnected.
0.5
2 1.5
1 0.5
y (m)
0 –0.5
–1 –1.5
–2 –2
–1.5
–1
–0.5
0
1
0.5
1.5
1 0.5
0 2 1.5 0 –0.5
y (m)
2
1 0.5
0 2 1.5
0
–1 –1.5
–2
–2
–1.5
–1
–0.5
0
1
0.5
1.5
y (m)
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5 y (m)
2
0 –0.5
–0.5
–1
–1
–1
–1.5
–1.5
–1.5
–2 –2
(A)
–1.5
–1
–0.5
0 0.5 x (m)
1
1.5
2
–1.5
–0.5
–1
0
0.5
1
1.5
2
x (m)
–2 –2
(B)
–1 –1.5 –2
0
–0.5
–2
0 –0.5
–2
2
0
1 0.5
x (m)
x (m)
y (m)
y (m)
1.5
1
z (m)
z (m)
z (m)
1 0.5
470
1.5
1.5
–1.5
–1
–0.5
0 0.5 x (m)
1
1.5
–2
2
(C)
–1.5
–1
–0.5
0
0.5
1
1.5
2
x (m)
P.J. Cruz et al.
Fig. 6 Snapshots of the simulation experiment under the assumption of independent learning (ie, w = 1). Time elapsed since the start of the simulation (A) 6 s, (B) 10 s, and (C) 20 s. The three-dimensional views are at the top and the corresponding two-dimensional view are at the bottom. In (B) and (C) the network is disconnected since one or more mobile sensors are out of the range of the mobile routers.
Cooperative Learning for Robust Connectivity in Multirobot Heterogeneous Networks
(A)
471
(B)
Fig. 7 Relative connectivity index c given by (25): (A) for the case of independent learning (ie, w = 1) and (B) for the case of cooperative learning (ie, 0 < w < 1). We set w = 0.5 to simulate the latter case.
8 CONCLUSIONS We took advantage of the different dynamics, resources, and capabilities of a heterogeneous robotic network made of ground and aerial agents to explore a cluttered environment. Exploiting heterogeneity allows us to effectively use the agents’ specialized abilities and therefore accomplish subgoals more efficiently, which in turn results in successful completion of the overall mission. By use of potential field methods, the ground mobile sensors were coordinated to sense regions of interest in the environment, avoiding collisions. At the same time, we combined antenna diversity and cooperative Q-learning to maintain a certain RSS level between the sensors and the aerial communication relays. We analyzed the effectiveness of our proposed method in a simulation scenario where we considered for the relays the cases of independent and cooperative learning. Our results showed that the network achieved a better relative connectivity index when the learning was performed in a cooperative fashion. Since the connectivity of the network is not maintained all the time even when the cooperative method is employed, our future work will consist in expanding the learning method to the mobile sensors. Thus they will be able to react when their RSS drops under a minimum communication level. In our proposed method, just the aerial relays are sensing and maintaining the RSS between themselves and the sensors. We believe that adding the consideration of the minimum RSS communication level to the mobile sensor controller will help us get better results for the whole network connectivity.
472
P.J. Cruz et al.
ACKNOWLEDGMENTS This work was supported in part by the Army Research Lab Micro Autonomous Systems and Technology Collaborative Alliance (ARL MAST-CTA #W911NF-08-2-0004). We would like to thank the Ecuadorian scholarship program administrated by the Secretaría de Educación Superior, Ciencia, Tecnología e Innovación (SENESCYT) for providing part of the financial support for P. J. Cruz.
REFERENCES [1] J. Fink, A. Ribeiro, V. Kumar, Robust control of mobility and communications in autonomous robot teams, IEEE Access 1 (2013) 290–309. [2] A. Cortez, R. Fierro, J. Wood, Connectivity maintenance of a heterogeneous sensor network, in: A. Martinoli, F. Mondada, N. Correll, G. Mermoud, M. Egerstedt, M.A. Hsieh, L.E. Parker, K. Støy (Eds.), Distributed Autonomous Robotic Systems, Springer Tracts in Advanced Robotics, vol. 83, Springer, Berlin, Heidelberg, 2013, pp. 33–46. [3] P. Cruz, R. Fierro, W. Lu, S. Ferrari, Maintaining robust connectivity in heterogeneous robotic networks, in: Proc. SPIE, vol. 8741, Unmanned Systems Technology XV, 2013, pp. 87410N–87410N-15. [4] H.M. La, R. Lim, W. Sheng, Multirobot cooperative learning for predator avoidance, IEEE Trans. Control Syst. Technol. 23 (1) (2015) 52–63. [5] G.A.S. Pereira, A.K. Das, V. Kumar, M.F.M. Campos, Decentralized motion planning for multiple robots subject to sensing and communication constraints, in: Proceedings of the Second MultiRobot Systems Workshop, Kluwer Academic Press, Dordrecht, 2003, pp. 267–278. [6] J. Esposito, T. Dunbar, Maintaining wireless connectivity constraints for swarms in the presence of obstacles, in: Proceedings IEEE International Conference on Robotics and Automation, ICRA, 2006, pp. 946–951. [7] M.N. Rooker, A. Birk, Multi-robot exploration under the constraints of wireless networking, Control Eng. Pract. 15 (4) (2007) 435–445. [8] M.A. Hsieh, A. Cowley, V. Kumar, C.J. Taylor, Maintaining network connectivity and performance in robot teams, J. Field Rob. 25 (1–2) (2008) 111–131. [9] S. Gil, M. Schwager, B.J. Julian, D. Rus, Optimizing communication in air-ground robot networks using decentralized control, in: Proceedings IEEE International Conference on Robotics and Automation, ICRA, 2010, pp. 1964– 1971. [10] N. Bezzo, M.S. Anderson, R. Fierro, J. Wood, A real world coordination framework for connected heterogeneous robotic systems, in: M. Ani Hsieh, G. Chirikjian (Eds.), Distributed Autonomous Robotic Systems, Springer Tracts in Advanced Robotics, vol. 104, Springer, Berlin, Heidelberg, 2014, pp. 75–89. [11] P. Robuffo Giordano, A. Franchi, C. Secchi, H.H. Bülthoff, A passivity-based decentralized strategy for generalized connectivity maintenance, Int. J. Robot. Res. 32 (3) (2013) 299–323. [12] R. Mahony, V. Kumar, P. Corke, Multirotor aerial vehicles: modeling, estimation, and control of quadrotor, IEEE Robot. Autom. Mag. 19 (3) (2012) 20–32. [13] A. Pullin, N. Kohut, D. Zarrouk, R. Fearing, Dynamic turning of 13 cm robot comparing tail and differential drive, in: Proceedings IEEE International Conference on Robotics and Automation, ICRA, 2012, pp. 5086–5093. [14] Y. Mostofi, M. Malmirchegini, A. Ghaffarkhah, Estimation of communication signal strength in robotic networks, in: Proceedings IEEE International Conference on Robotics and Automation, ICRA, 2010, pp. 1946–1951.
Cooperative Learning for Robust Connectivity in Multirobot Heterogeneous Networks
473
[15] J. Twigg, J. Fink, P. Yu, B. Sadler, RSS gradient-assisted frontier exploration and radio source localization, in: Proceedings IEEE International Conference on Robotics and Automation, ICRA, 2012, pp. 889–895. [16] G. Tuna, V.C. Gungor, K. Gulez, An autonomous wireless sensor network deployment system using mobile robots for human existence detection in case of disasters, Ad Hoc Netw. 13 (Part A) (2014) 54–68, (1) Special Issue: Wireless Technologies for Humanitarian Relief & (2) Special Issue: Models and Algorithms for Wireless Mesh Networks. [17] A. Goldsmith, Wireless Communications, Cambridge University Press, Cambridge, 2005. [18] J. Nelson, J. Almodovar, M. Gupta, W. Mortensen, Estimating multiple transmitter locations from power measurements at multiple receivers, in: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2009, pp. 2761–2764. [19] B. Griffin, R. Fierro, I. Palunko, An autonomous communications relay in GPS– denied environments via antenna diversity, J. Def. Model. Simul. Appl. Methodol. Technol. 9 (1) (2012) 33–44. [20] G. Agnarsson, R. Greenlaw, Graph Theory: Modeling, Applications, and Algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2006. [21] C. Godsil, G. Royle, Algebraic Graph Theory, in: Graduate Texts in Mathematics, Springer, New York, 2001. [22] J. Latombe, Robot Motion Planning, Kluwer Academic Publishers, Dordrecht, 1991. [23] L. Chengqing, V. Ang, H. Krishnan, S.Y. Lim, Virtual obstacle concept for local-minimum-recovery in potential-field based navigation, in: IEEE International Conference on Robotics and Automation, ICRA, vol. 2, 2000, pp. 983–988. [24] S. Ge, Y. Cui, New potential functions for mobile robot path planning, IEEE Trans. Robot. Autom. 16 (5) (2000) 615–620. [25] F. Lewis, D. Liu (Eds.), Reinforcement Learning and Approximate Dynamic Programming for Feedback Control, IEEE Press Series on Computational Intelligence, Wiley, New York, 2012. [26] L. Bu¸soniu, R. Babuška, B. De Schutter, A comprehensive survey of multiagent reinforcement learning, IEEE Trans. Syst. Man Cybern. C: Appl. Rev. 38 (2) (2008) 156–172. [27] H. Guo, Y. Meng, Distributed reinforcement learning for coordinate multi-robot foraging, J. Intell. Robot. Syst. 60 (3–4) (2010) 531–551. [28] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, first ed., MIT Press, Cambridge, MA, USA, 1998. [29] D.P. Bertsekas, Dynamic Programming and Optimal Control, vol. I and vol. II, third ed., Athena Scientific, Belmont, MA, 2005 (vol. I) and 2012 (vol. II). [30] L. Bu¸soniu, R. Babuška, B.D. Schutter, D. Ernst, Reinforcement Learning and Dynamic Programming Using Function Approximators, first ed., CRC Press, Inc., Boca Raton, FL, USA, 2010.