Applied Soft Computing 25 (2014) 322–335
Contents lists available at ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
Bio-insect and artificial robot interaction using cooperative reinforcement learning Ji-Hwan Son, Young-Cheol Choi, Hyo-Sung Ahn ∗ School of Mechatronics, Gwangju Institute of Science and Technology (GIST), 123 Cheomdangwagi-ro, Buk-gu, Gwangju 500-712, Republic of Korea
a r t i c l e
i n f o
Article history: Received 24 September 2013 Received in revised form 3 September 2014 Accepted 3 September 2014 Available online 16 September 2014 Keywords: Reinforcement learning Fuzzy control Intelligent interaction
a b s t r a c t In this paper, we propose fuzzy logic-based cooperative reinforcement learning for sharing knowledge among autonomous robots. The ultimate goal of this paper is to entice bio-insects towards desired goal areas using artificial robots without any human aid. To achieve this goal, we found an interaction mechanism using a specific odor source and performed simulations and experiments [1]. For efficient learning without human aid, we employ cooperative reinforcement learning in multi-agent domain. Additionally, we design a fuzzy logic-based expertise measurement system to enhance the learning ability. This structure enables the artificial robots to share knowledge while evaluating and measuring the performance of each robot. Through numerous experiments, the performance of the proposed learning algorithms is evaluated. © 2014 Elsevier B.V. All rights reserved.
1. Introduction In the field of robotics, numerous efforts to establish artificial intelligence have been taken by numerous researchers. However, there has still been no dominant result due to the difficulty of creating artificial intelligence for robots [2,3]. This is especially true in our environment context, which involves complex and unpredictable elements and makes it difficult to apply artificial intelligence in robot applications. The project, called BRIDS (Bio-insect and artificial Robot Intelligence based on Distributed Systems) [1,4], seeks to study interactions between bio-insects and artificial robots to establish a new architectural framework for improving the intelligence of robots. In this project, we use living bio-insects which have their own intelligence to survive in nature. Because of their own intelligence, behavior of the bio-insect also involves complex and unpredictable elements. Therefore, studying an interaction between a living insect from nature and artificial robot will provide an idea of how to enhance the intelligence of robots. In this paper, as a specific task for the interaction between bio-insects and artificial robots, we would like to entice bio-insects towards desired goal areas using artificial robots without any human aid. Thus, the potential contribution of this research lies in the field of robot intelligence; it establishes a new learning framework for an intelligent robot based on cooperative reinforcement
∗ Corresponding author. Tel.: +82 62 715 2398. E-mail addresses:
[email protected] (J.-H. Son),
[email protected] (Y.-C. Choi),
[email protected] (H.-S. Ahn). http://dx.doi.org/10.1016/j.asoc.2014.09.002 1568-4946/© 2014 Elsevier B.V. All rights reserved.
learning, which constitutes a type of coordination for a community composed of bio-insects and artificial robots. The research on bioinsect and artificial robot interaction will provide a fundamental theoretical framework for human and robot interactions. The main focus of our early results [1] was on how to address the uncertain and complex behavior of a bio-insect under a constructed framework for robot intelligence. The first goal was to find available interaction mechanisms between a bio-insect and an artificial robot. Contrary to our expectation, the bio-insect did not react to light, vibration, or movement of the robot. From various trials and errors, we eventually found an interaction mechanism using a specific odor source from the bio-insect’s habitat. Additionally, to develop a framework, we made an artificial robot that can spread the specific odor source towards a bio-insect. Then, to evaluate interaction ability of the mechanism, we conducted experiments using the artificial robot, which was manually controlled by a human operator. In the experiment, by the human operator the artificial robot was considered that the robot has enough knowledge to entice the bio-insect towards desired point, and the experiment result showed that the artificial robot can entice the bio-insect. The second goal was to entice a bio-insect towards the desired goal area using an artificial robot without human aid. To achieve the second goal, we conducted two types of experiments to entice a bio-insect towards the desired goal using an artificial robot without human aid. The first type of experiment was conducted using fuzzy logic based reinforcement learning and second type of experiment used simple regular reward based reinforcement learning. From the experimental results, we found that fuzzy logic based reinforcement learning showed more efficient results.
J.-H. Son et al. / Applied Soft Computing 25 (2014) 322–335
However, it took a huge amount of learning time for the robot to acquire the necessary knowledge though it eventually obtained the knowledge by the learning process. Furthermore, due to the complex and unpredictable behaviors of a bio-insect, the single reinforcement learning process was insufficient to enable a reliable and efficient learning. Thus, for an efficient learning (i.e., to improve the success rate in a faster learning time), we decided to use a cooperative learning mechanism in multi-agent domain. In Ref. [4], we conducted experiments using two artificial robots. In the experiments, we used fuzzy logic based expertise measurement system for sharing knowledge and obtained successful results. This paper is an extended version of Ref. [4]. In this extended version, we have generalized a fuzzy logic based expertise measurement system and presented more detailed explanation of the system. In the previous version, the effect of sharing knowledge might not be clear and the experiment result appeared quite optimal at the beginning. Therefore, we have newly conducted two types of experiments: Experiment A – without sharing knowledge case and Experiment B – sharing knowledge case. In the Experiment A, individual agents 1 and 2 have performed to entice a bio-insect together without sharing knowledge. The Experiment B has focused on sharing knowledge between the two artificial robots to entice a bio-insect together. Also, the number of action points has been increased and the artificial robots have needed more trials-anderrors to find out suitable angle direction and optimal distance range to entice the bio-insect. After conducting experiments, we have obtained newly experimental results. Note that we use the term “cooperative learning” to represent a learning by sharing data among multiple autonomous robots. When a robot is faced with given commands for which the robot lacks a sufficient knowledge base and is required to act alone, the robot may not be successful in implementing the commands. Or the robot takes a long time to complete the task. However, if there are several other robots and each of the robots possesses their own specialized knowledge about the task, then the given commands can be more readily completed by mutual cooperation. Moreover, when the robots learn knowledge from trials and errors, some of the robots may have more specialized knowledge than the others, as seen in human society. If the robots have the ability to share knowledge, then the performance of the robots would be enhanced. For these reasons, cooperative learning has recently received a lot of attention due to the various benefits it provides. In a relationship between a predator and prey, a predator needs to learn hunting skills to survive in nature. By trials-and-errors, the predator will obtain useful knowledge to capture prey, and success rate of hunting will be increased as shown in Refs. [5,6]. However, due to complex and unpredictable elements in nature, the predator cannot be always successful in hunting prey even though the predator has enough knowledge in hunting. For example, weather conditions, species of prey, and physical elements of the predator and prey, etc. are different whenever the predator hunts a prey. The elements affect the success rate of hunting of the predator [7,8] and make it difficult to capture a prey. At least, the predator can learn available hunting skills by its own trial and error process, and it can survive in nature. From a behavioral point of view, the basic concept of reinforcement learning is similar to learning mechanism of animal using positive and negative reward through trial and error process. This process is similar to the relation between an artificial robot and a bio-insect. The artificial robot needs to find out how to entice the bio-insect by trial and error process. As a predator cannot fully capture a prey due to complex and unpredictable environment, the artificial robot cannot entice the bio-insect at all times because of complex and unpredictable elements of the bio-insect. At least, the artificial robot learns to have useful knowledge to entice the bio-insect. Because of these similarities, therefore, we
323
consider that this approach is a fully adaptable and useful solution. In addition, due to the merit of this process, the reinforcement learning has a lot of attention and has been applied to various fields. Using the reinforcement learning, they have controlled helicopter flight [9], movement of elevator [10], humanoid robots [11], soccer robot [12], and traffic signal control [13]. Also, they have applied into spoken dialogue system [14], packet routing [15], production scheduling [16], traveling salesman problem [17], and resource allocation [18]. As a basic step, we have focused on using a living bio-insect, called stag beetle, as a target in interaction with robots. Based on interaction mechanism we found, we attempt to control movement of the bio-insect without any human aid. To achieve this goal, the robots need to learn how to control behavior of the bio-insect. The problem is that a bio-insect contains own low level intelligence composed of ganglia to survive in nature. Therefore, behavior of the bio-insect also contains complex and unpredictable elements, and the elements make it hard to control the bio-insect. In our previous experimental results using a robot controlled manually by human operator [1], we achieved only 80% success rate. This result implies that reactions of the bio-insect are not always equal to what we expected, and the amount of reaction is different at every trial. In these conditions, the robots need to learn precise knowledge to entice the bio-insect towards desired goal area. If we know what the bio-insect thinks of future movement in current situation, then the robots may entice the bio-insect in an efficient way. However, the robots have some clues, which only were acquired from behavioral reaction of the bio-insect. To overcome these difficulties, we applied reinforcement learning and fuzzy logic in this paper. To apply the reinforcement learning into real robot, a generation of a precise reward is a crucial issue for an accurate learning. To deal with this complexity of the environment, we found that the fuzzy logic could be one of the profitable approaches for generating a reward. We expect that this process will make robots do more active learning than a sole reinforcement learning by adequately generating a reward from behavioral reactions after interacting with the bio-insect. Therefore, we adopt the fuzzy logic into our learning structure. Also, when sharing knowledge for a cooperative learning, classifying and finding experts in each specific field among agents also contain complex elements. Therefore, we also use fuzzy logic to measure performance of each robots. Then, based on the developed fuzzy rules, the system calculates performance of each robot in each specific fields. This paper is organized as follows. In Section 2, we briefly introduce a project entitled BRIDS (Bio-insect and artificial Robot Interaction based on Distributed Systems). In this section, we also present the main purpose and goal of the research. In Section 3, we present the fuzzy logic-based cooperative reinforcement learning using an expertise measurement system. Using the aforementioned structure, we present experimental setup and results in Section 4. In Section 5, we present a discussion of our experimental results. Finally, Section 6 provides a conclusion of this paper.
2. Bio-insect and artificial Robot Interaction based on Distributed Systems 2.1. Motivation and goal of the BRIDS The BRIDS seeks to study bio-insects and artificial robots interaction to establish a new architectural framework for improving the intelligence of mobile robots. One of the main research goals is to drive or entice a bio-insect through the coordination of a group of mobile robots towards a desired point. The research includes the establishment of hardware/software for the bio-insect and artificial robot interaction and the synthesis of distributed sensing,
324
J.-H. Son et al. / Applied Soft Computing 25 (2014) 322–335
Circulaon Loop for Learning
Generalizing Knowledge
Generalizing Knowledge
Generalizing Knowledge
Recognizing the Behavior of Model-free Bio-insect
Actuaon
Applicaon
Applicaon
Applicaon
Test Bed for Bio-insect and arficial Robot Interacon based on Distributed Systems (BRIDS) Fig. 1. Flowchart of BRIDS composed of the distributed decision, distributed control, and distributed sensing. Subsystems are connected in a feedback loop manner.
distributed decision-making, and distributed control systems for building a network composed of a bio-insect and artificial robots. Fig. 1 explains how to compose and connect the subsystems. Distributed sensing is used in the recognition and detection of the bio-insect, as well as in the construction of a wireless sensor network to locate the artificial robots and bio-insect. The distributed decision contains the learning of the repetitive reactions of bioinsect for a certain form of input. It aims at finding which commands and actuations drive the bio-insect towards a desired point or drive the bio-insect away from the target position. The reinforcement learning algorithm will be designed to generate either a penalty or reward based on a set of actions. The distributed decision stores the state of current actions and their outputs, which are closely associated with the future event, into memory. Then, it selects commands and outcomes of past actions for the current closed-loop learning. Thus, the synthesis of the recursive learning algorithm based on the storage and selection procedure along with the learning domain will be main point of interest in the distributed decision. The distributed control includes the control and deployment of the multiple-mobile robots via coordination, as well as the design of the optimally distributed-control algorithm based on the coordination. It learns how the bio-insect reacts based on the relative speed, position, and orientation between the multiple-mobile robots and the bio-insect. Thus, the ultimate goal of this research is to establish a new theoretical framework for robot learning via a recursive sequential procedure of the distributed sensing, decision and control systems. Fig. 2 illustrates the structure of the BRIDS. 2.2. Platform of the BRIDS [1] As a candidate of the bio-insect, we selected a stag beetle shown in Fig. 4(c). As shown in Fig. 3, the stag beetle, called Serrognathus platymelus castanicolor Motschulsky as a scientific name,
Fig. 3. The stag beetles (female (left side) and male (right side)) [1].
has strong physical strength, good movement over flat surfaces, and long life on extreme environment. When we place the bio-insect to experimental platform, it normally stays on the experimental platform without movements while hiding its own antenna. In that case, the robot cannot entice the bio-insect, because the bio-insect do not react from any stimulation. After a few minutes later, the bio-insect tries to measure odor source in the air and starts moving somewhere. When the bio-insect opens its own antenna, then the bio-insect begins to follow the specific odor source. Therefore, before taking an experiment, we have to check condition of the bio-insect whether the bio-insect will follow the robot or not. Also, when a bio-insect is suddenly shocked by collision with a robot or something, the bio-insect tries to run away strongly or do not move. From the observed results, we have understood that unknown other characteristic may affect the behavior of the bio-insect. For an artificial robot, we redesigned e-puck robots to produce the desired actuation by adding one more micro controller unit. One of the crucial points in hardware development is how to interact between the bio-insect and the artificial robot. Eventually, as a solution, we selected a specific odor source from habitat of the bio-insect and used it an actuation mechanism as illustrated in Fig. 4(a). The actuation mechanism contains two air-pump motors to produce airflow and one plastic bottle containing the specific odor source. When a bio-insect smells the specific odor source generated from the actuation mechanism, the bio-insect follows the specific odor source. The actuation mechanism was installed into the redesigned e-puck robots as shown in Fig. 4(b). To control the redesigned e-puck robots, we used Bluetooth access point for a communication link. A machine vision camera captures the image of experimental platform continuously; then, using the captured images, a host computer finds the current positions and heading angles of bio-insects and artificial robots. Then, using own designed intelligence structure the robots decide actions to entice the bioinsect without human aid. The whole structure of the hardware platform is illustrated in Fig. 4(c). It is notable that the BRIDS project was initially targeting to develop a distributed or decentralized decision and control platform, however it was not easy to design fully distributed or decentralized hardware platform at the present time. Thus, at this moment, our main goal is to develop an intelligent learning algorithm, and we seek to verify the developed learning algorithms in a centralized hardware setup as shown in Fig. 4. The development of BRIDS with a fully distributed hardware platform remains as a future work. This paper only focuses on development, test, and verifications of intelligent learning algorithms under rather ideal setups. 3. Cooperative reinforcement learning based on a fuzzy logic-based expertise measurement system
Fig. 2. Structure of BRIDS: It shows how to relate the individual subsystems. The first step is to construct distributed sensing, distributed decision and distributed control systems. Then, we construct a closed-system based on a feedback loop for learning and the exchange of knowledge for sharing information.
In the field of cooperative reinforcement learning, the area of expertise (AOE) concept was recently proposed in [19], where the framework evaluates the performance of each robot from several points of view and obtains generalized knowledge from the expert robots. In [20], they report a similar concept and introduce an advice-exchange structure focused on sharing knowledge based
J.-H. Son et al. / Applied Soft Computing 25 (2014) 322–335
325
Fig. 4. Platform of the BRIDS: (a) The proposed structure of spreading an odor source with a robot using two air-pump motors to produce airflow and one plastic bottle containing an odor source from the habitat of the bio-insect. (b) Robot robot. (c) Hardware structure.
on previous experience. In a different way, the AOE is focused on which robot has more expertise in defined area, and then, the robots share the knowledge. In [20], there are two different aspects on expertise. A behavioral and knowledge-based approach focuses on a better and more rational behavior, while a structural approach examines better and more reliable knowledge for evaluating expertise. For evaluating the expertise of each robot [19,21,22], present various methods that were used to measure and calculate expertise. These measurements help the AOE evaluate the expertise of all the robots in each specific area. After evaluating the knowledge of each robot, robots then share knowledge with each other using a weight strategy-sharing concept. Based on the AOE structure [23], presents a simple experiment using two robots that use an adaptive weight strategy sharing and a regret measure. In this paper, we adopt the AOE method proposed in [19] into our framework, because it is suitable for evaluating knowledge and efficient way for sharing knowledge among the multiple robots. 3.1. Fuzzy logic-based cooperative reinforcement learning In this subsection, we design a cooperative reinforcement learning structure using a fuzzy logic-based expertise measurement system. The structure of new learning logic is composed of two
parts: expertise measurement part and sharing knowledge part. In the expertise measurement part, it helps to evaluate the performance of each robot using various measurements in the specific field. From the outcomes of each robot in enticing the bio-insect towards specific directions, the learning logic can evaluate which robot possesses a higher expertise in specific fields. Here, the specific fields mean specific expert domains of each robot. If robots are required to learn how to complete complex tasks without any given knowledge, the robots try to learn how to fulfill the given tasks. Among the robots, some of robots may have more knowledge in domain A and other robots may have more knowledge in domain B because the robots rely on randomly chosen action. During the tasks, some of robots may have outstanding knowledge in each different domain. If the robots can determine which robot is an expert in specific domain and share knowledge, then the performance of the robots will be increased comparing than non-sharing knowledge case. Then, based upon the evaluated performance, the robots share knowledge with each other. The following Fig. 5 depicts the whole structure of the system. Fig. 5(a) represents a fuzzy-logic-based reinforcement learning structure for a robot, which is composed of reinforcement learning and fuzzy logic. The Fig. 5(b) represents a core of the cooperative reinforcement learning part using fuzzy logic. During each episode,
326
J.-H. Son et al. / Applied Soft Computing 25 (2014) 322–335
Fig. 6. Structure of reinforcement learning: The structure is composed of two parts; one is the robot, and the other one is the environment. Based on the recognized state st , the robot actuates an action towards the environment as at , following which an output is given to the robot as a reward t+1 . This circulation process makes the robot acquire knowledge under a trial-and-error iteration process. This learning mechanism is similar to the learning behavior of animals that possess intelligence.
towards the desired direction under defined specific fields as seen in the following equation: k,l k,l Qt+1 (s, a) ← (1 − ˛)Qtk,l (s, a) + ˛(t+1 + maxQtk,l (ˆs, aˆ ))
(1)
aˆ
Fig. 5. Structure of cooperative reinforcement learning based on a fuzzy logicbased expertise measurement system: (a) Fuzzy-logic-based reinforcement learning structure for a robot i. (b) Expertise measurement part for sharing knowledge of robots i, j, · · · , k.
the expertise measurement part stores every robot’s specific criteria defined as expertise measurements introduced in Section III(c). Using the criteria, the expertise measurement system evaluates each robot’s performance with a score based on fuzzy logic and fuzzy rules. Then, using the evaluated score of each robot, the robots share knowledge. The following subsections introduce the specific processes of the expertise measurement system.
3.2. A robot Reinforcement learning [24–26] is a reward signal-based trialand-error iteration process (see Fig. 6). Based on a discrete set of states S, a set of robot actions A, a set of transition probability T:T (s, a, sˆ), policy p:s→ a, and an immediate reward signal , an optimal policy is searched using a Q-learning structure. The Qlearning structure helps robots learn how to entice a bio-insect
where ˛ is the learning rate (0 ≤ ˛ ≤ 1), is the discount factor (0 ≤ ≤ 1), t is iteration step, k denotes a robot, and l is a specific field. One of merits of the Q-learning structure is that it adopts a learning rate ˛. The learning rate ˛ is a weighting parameter between previously acquired knowledge and newly acquired knowledge by a reward. When ˛ is near 1, then the Q-learning structure fully updates newly acquired rewards as part of the exploration process. Conversely, when ˛ approaches 0, then the structure passes over the newly acquired rewards. In this case, the structure depends on previous knowledge that a robot has learned as part of the exploitation process. The value ˛ can be useful for our experiment because the robots require precise learning knowledge of the complex behavior of bio-insects. If the robots can control ˛ during the experiment, then performance of the experiment will be enhanced. In these experiments, we choose the approach where ˛ decreases with an increase in the number of episodes. Additionally, we adaptively update the specific fields where a robot is an expert. Using the evaluated performance of each robot, we know which robot is an expert in the field. Using an initialized Qk,l (s, a) table, (1), the kth robot updates its own table using the calculated immediate reward at the current state s by the selected action aˆ within a specific field l. To understand the behavior of a bio-insect as a result of a given action, we apply a fuzzy logic to generate rewards for the behavior of a bio-insect, because the fuzzy logic is a good approach for understanding an imprecise environment, such as understanding emotion of human behavior [27] and human mind [28]. When the kth robot recognizes the current state, then, with a possible set of actions A, it chooses an action aˆ in the current state s. After the action is executed, the reaction information, including the variation in distance dtb between the sub-goal point for the bth bio-insect and the bth bio-insect and the variation in distance etk between the bth bio-insect and the kth robot, is collected. Here, dtb and etk are calculated using the following equations, respectively: dtb = qbt s − qGoal,b − qbt e − qGoal,b t s t e
(2)
etk = qbt s − qkt s − qbt e − qkt e
(3)
where qbt , qkt , and qGoal,b indicate the positions of the bth bio-insect, t the kth artificial robot, and the sub-goal for bth bio-insect, respectively where qt ∈ R2 , {t s, t e} ∈ t (t s and t e indicate the start time and end time of the selected action aˆ at the iteration step t, respectively), and · is the Euclidean norm.
J.-H. Son et al. / Applied Soft Computing 25 (2014) 322–335
327
Fig. 7. Input fuzzy sets: (a) distance variation (dtb ) as an input and (b) distance Variation (etk ) as an input and Output fuzzy sets: (c) Output.
To generate suitable rewards for robots, using only the parameter dtb , which means variation in distance between the sub-goal point and the bio-insect, was insufficient. Due to complex and unpredictable elements of a bio-insect, the bio-insect may move towards the desired goal point with wrongly chosen action when a bio-insect did not react from a robot. In this case, if we use only the value, a wrongly generated reward may be accumulated to each robot. To avoid such case, we additionally use the parameter etk , which means the variation in distance between the bth bio-insect and the kth robot. We consider that the parameter etk is also one of crucial clues that the specific odor source lets a bio-insect follow towards the spreading direction. Therefore, using this approach, the system can generate more specific reward signal. To generate a reward signal, we divide two types of situations: positive case - the artificial robot entices the bio-insect towards the right place and negative case - the artificial robot entices the bio-insect towards a wrong place. Because of complex and unpredictable elements of the bio-insect, the bio-insect occasionally moves towards a place without any clues. Therefore, to generate a precise reward signal, we focus on specific behaviors as follows. Positive case: If the bio-insect followed the artificial robot (etk is VG) and the artificial robot enticed the bio-insect towards the right place (dtb is VG), then we consider that this is a very good case A. Negative Cases: If the bio-insect did not follow the artificial robot (etk is VB or BD) and the artificial robot moved the bio-insect towards the right place (dtk is VG), then we consider that this is a very bad case E. If the bio-insect followed the artificial robot (etk is VG or GD) and the artificial robot enticed the bio-insect towards a wrong place, then we consider that this is very bad case E. The other rules have considered as meaningless cases C and the rules that are slightly related with above positive and negative cases have been classified as B or D. Based on the above regulation for generating rewards, detailed fuzzy rules are developed as shown in Table 4 in the appendix. Based on the fuzzy rules described in Table 3 in the appendix, the input variables et and dt are changed by the following membership functions (4) and (5) as depicted in Fig. 7(a) and (b).
input variables are changed by linguistic process. Next, the calculated values dtb and etk are converted by a fuzzification process using the defined fuzzy sets as depicted in Fig. 7(a) and (b). After the fuzzification process, the converted values are calculated using (7) and (8) with a max-min composition process. Then using the fuzzy rules shown in Table 3 in the appendix, all of the values are expressed into output fuzzy sets as depicted in Fig. 7(c) using (8). All the outputs are combined into the aggregation of output fuzzy sets as union process in set theory.
d = {VGd , GDd , NM d , PRd , VP d }
(4)
e = {VGe , GDe , NM e , PRe , VP e }
(5)
where Mk,l is the number of iterations for kth robot in the specific field l. We define a positive reward as shown below.
output = {A, B, C, D, E}
(6)
where VG, GD, NM, PR, and VP indicate very good, good, normal, poor, and very poor, respectively. In the fuzzy sets, VG, GD, NM, BD, VB, A, B, C, D, and E represent each fuzzy membership function, and
i = min[min[id (dtb ), ie (etk )], ioutput ] o (u) =
25 i=1
i
(7)
(8)
where parameter i represents number of fuzzy rules and k denotes robot. An immediate reward is calculated using the center of mass method as follows: k,l t+1
uo (u)du =
(9)
o (u)du
Based on the reinforcement learning structure, the fuzzy logic k,l generates a reward signal t+1 for the kth robot from the collected reaction of the bio-insect in specific field l. Then using the reward, a robot updates the Qk,l (s, a) table and tries to optimize the Q-table as knowledge. 3.3. Expertise measurement When we examine the performance of each robot, various indexes can be used as measurements. In our structure, we choose the following three measurements: average reward, positive average reward, and percentage of positive rewards. Average reward is calculated as follows:
Mk,l
k,l avg
=
k,l t=1 t+1 M k,l
pst k,l = t+1
(10)
k,l t+1 ,
k,l if t+1 >ı
0,
otherwise.
Here, the range of the reward is −1 ≤ ≤ 1.
(11)
328
J.-H. Son et al. / Applied Soft Computing 25 (2014) 322–335
Fig. 8. Input fuzzy sets: (a) average reward as an input, (b) percentage of the positive rewards as an input, (c) positive average reward as an input, and (d) output fuzzy sets.
Using the defined positive reward, the average positive reward is calculated as
Mk,l t=1
k,l pst avg =
pst k,l t
M k,l
(12)
Similarly, the percentage of positive rewards is calculated in the following equations. For counting the number of positive rewards, k,l the equations below check whether the current reward t+1 is positive reward or not.
cnt k,l = t+1
1,
k,l if t+1 >ı
0,
otherwise.
After the fuzzification process, the converted values are calculated using (19) and (20) with a max-min composition process. Then, using the fuzzy rules shown in Table 4 in the appendix, all of the values are expressed into the output fuzzy sets as depicted in Fig. 8(d) using (20). All the outputs are combined into the aggregation of the output fuzzy sets as union process in set theory. k,l i = min[min[iavg (avg ), ipst ( k,l pst avg ), icnt ( k,l cnt avg )], iexp ]
(19) (13) exp (u) =
27 i=1
Then, the percentage of positive rewards is calculated as
Mk,l
k,l
cnt avg =
t=1
cnt k,l t
M k,l
(14)
3.4. Expertise measurement system Under the expertise measurement values, the expertise measurement system evaluates the performance of all robots using the following fuzzy sets and fuzzy rules as described in Table 4 in the appendix. avg = {GDa , NM a , PRa }
(15)
pst = {GDp , NM p , PRp }
(16)
cnt = {GDc , NM c , PRc }
(17)
exp = {A, B, C, D, E}
(18)
For determining an expert among agents in each specific field, we use three types of measurements. In that case, each measurement equally contributes to judge all agents. Therefore, one of measurements contains NM or BD, then the output will be decreased proportionally. For example, if all measurements are GD or one measurements is NM and others are GD, then the output is A. Then, one of measurements contains more NM or BD, then output will be decreased proportionally as B, C, and D. Eventually, when one measurement is NM, and others are BD or all measurements are BD, then the output is E. Based on the above regulation for expertise measurement system, detailed fuzzy rules are described in Table 4.
i
(20)
where parameter i represents number of fuzzy rules, k denotes robot, and l denotes number of specific field. The final output Sk,l is the score of each robot and is calculated using the center of mass method: S
k,l
uexp (u)du = exp (u)du
(21)
Then, knowledge of each robot is merged as Sl ←
N
S k,l
(22)
k=1
where N is the number of robots and k denotes a robot k ∈ 1, . . ., N Finally, all robots have the shared knowledge as follows: Ql ←
N S k,l k=1
Sl
· Q k,l
(23)
The whole procedures of the fuzzy logic-based expertise measurement system for cooperative reinforcement learning are described in the Algorithms 1 and 2. In the algorithms, B denotes the number of bio-insects b ∈ {1, . . ., B}, L denotes the number of specific fields l ∈ {1, . . ., L}, N denotes the number of robots k ∈ {1, . . ., N}, and Mk,l denotes the number of iterations of the kth robot in specific field l.
J.-H. Son et al. / Applied Soft Computing 25 (2014) 322–335
3.5. Comments on reinforcement learning approaches From literature search, we find several different reinforcement learning approaches. In [29], it was explained why a cooperative learning is more attractive compared with independent learning case; and in [30], they consider two types of agents that try to complete their opposed goals. Using opposite goals, one agent tries to maximize its own reward while another agent tries to minimize its own reward. [31] presents a learning structure for competitive and cooperative behavior in artificial soccer game based on temporal difference learning algorithms in reinforcement learning field. In [32], they adopt an average reward-based learning mechanism under different action and task levels. Each level is composed of a hierarchical structure, and the action level performs a given task under the overall current task. In [33], they present a team Qlearning algorithm composed of parallel Q-tables for maximizing the common goal and in [34] an integrated sequential Q-learning using a genetic algorithm was presented. Using a fitness function, they evaluate the current performance and try to find a better performance under selection, crossover, and mutation processes. In this paper, in order to move the bio-insect towards a given goal point, the robots need to achieve a common goal together because the robots are supposed to entice the bio-insect. Therefore, [31–34] may be utilized in our task. On the other hand, [29] and [30] cannot be used because they try to achieve opposite goals. From our experiments for finding interaction mechanism between a bio-insect and a robot, we found that one of the crucial criteria was an actuation direction to the bio-insect using the specific odor source. Because, the bio-insect relies on collected information from its own antenna, which is located in its own head to detect smell in the air, the probability to entice the bio-insect is different according to the actuation directions. When a robot tries to spread a specific odor source at heading direction of the bio-insect, the bio-insect followed well the robot with a high probability. Contrary to this, when the robot tries to spread the specific odor source at the rear of bio-insect, the bio-insect followed the robot with low probability. Due to this problem, we have used an enticing mechanism to interact with bio-insect. However, it is important to check which actuations of the robots affect more on the movement of the bio-insect. For example, if two robots locate at the heading side and the rear side of the bio-insect, then since the heading direction of the bio-insect is right direction that robots need to entice, the bio-insect only follows a robot located on heading side with a high probability. However, in that case, even though the bio-insect follows a robot located on heading direction, two robots (robots locating at the heading side and rear side) will receive same positive reward since the bio-insect was actuated towards the desired direction. Due to this problem, in achieving a common goal, the multiple robots may get some problems in finding the right actions. To handle this problem, in the fuzzy-logic-based expertise measurement system that was introduced in previous sub-sections, each robot only tries to entice a bio-insect at a chosen action point, and each agent receives a reward and records achieved performance by expertise measurement. After an episode has been completed, the robots share knowledge based on own recorded performance using expertise measurement system. Then, in the next episode, the robots will entice the bio-insect based on shared knowledge. 4. Experiment 4.1. Experimental setup As an interaction mechanism between a bio-insect and an artificial robot, we found a specific odor source that helps the bio-insect follow the artificial robot [1]. Using the interaction mechanism, each robot learns how to entice a bio-insect towards the desired
329
goal point under cooperative manner. To realize this concept, we conduct the following two experiments using a bio-insect and two artificial robots: Experiment A – without sharing knowledge as a control group and Experiment B – with sharing knowledge as a experimental group using the fuzzy logic based expertise measurement system as described in previous section to measure effect of sharing knowledge. In examining the performance of the cooperative reinforcement learning, we consider that it is more favorable to increase the number of artificial robots. Because, when the number of artificial robots is increased, then the number of clues for obtaining knowledge is also increased by sharing the obtained knowledge. It means that the total learning time can be reduced if they share knowledge efficiently. However, in our experiments, only two robots were utilized for single bio-insect due to limited space around the bio-insect. In examining the performance of the cooperative reinforcement learning, we build the following experimental platforms illustrated in Fig. 9 for Experiments. As shown in Fig. 9(a) and (c), robot 1 and robot 2 work as a group for bio-insect 1. In the Experiment A, individual agents 1 and 2 perform to entice the bio-insect together without sharing knowledge. The Experiment B focuses on sharing knowledge between the two artificial robots. In the experiments, the robot 1 and robot 2 try to entice the bio-insect 1 towards a given sub-goal point while avoiding artificial walls and common restricted areas. Each sub-goal point is given by Algorithm 3. All sub-goal points and areas are illustrated in Fig. 9(b). Especially, after the robots have conducted the experiment at every episode, they share their knowledge using the fuzzy logic-based expertise measurement systems only in the Experiment B. Then, in the next episode, the robots try to entice the bio-insect using the shared knowledge. To recognize the current state among the bio-insects and robots, we define states that consist of a heading angle and a goal direction for the bio-insect, as illustrated in Fig. 10(a). The heading angle and the goal direction are divided into eight equal parts, each separated by 45◦ with drawn dotted lines in Fig. 10(a). To entice the bio-insect, the number of actuation points is illustrated in Fig. 10(b). The action points consist of three different distance ranges as d1 , d2 , and d3 and eight different directions separated by 45◦ . At chosen points among the action points, robots spread a specific odor source towards the bio-insect. To avoid collision, the robots move around a related bioinsect at a restricted distance range among them. 4.2. Experimental results In this experiment, we use the following parameters: ˛ = 0.85, = 0.95, = 0.3, e = 0.6, e = 0.03, d1 = 23 cm, d2 = 26 cm, and d3 = 29 cm. The parameters and are decreased by 0.008 and 0.02 per each episode step e, respectively.
(e + 1) =
(e + 1) =
(e) − ,
if (e) > e
e ,
otherwise
(e) − , if (e) > e e ,
(24)
(25)
otherwise
where =0.008 and =0.02. If either or reaches a defined minimum value, then the value becomes invariable under the next episode variations. After executing the experiments, we obtain the following results.1
1 Reader can view all experimental movie clips by visiting the web site: http://dcas.gist.ac.kr/bridscrl.
330
J.-H. Son et al. / Applied Soft Computing 25 (2014) 322–335
Fig. 10. Designed states: (a) designed states for recognizing the current state and (b) related actuation points for robots. Fig. 9. Experimental platform for Experiments: (a) designed state for recognizing the current state of location, (b) Defined areas and sub goal points, and (c) photograph of the experimental platform. Table 1 Summary of experimental results.
After executing a number of experiments, we have obtained the following experimental results as shown in Table 1. Both experiments have been performed 30 times with 4 bio-insects. The bio-insects were chosen by a given numerical order and were swapped out when they became exhausted or demonstrated incompliant reactions with the given actions of a robot. In Experiment A, the robots have achieved 30.0% success rate using the 4 bio-insects as shown in Fig. 11 and as described in Table 5 in the appendix. From the first episode, performance of the enticing ability has been increased with learning process as
The number of episodes Success episodes (rate) The number of iterations Total lab time (s) Success rate of bio-insect 1 Success rate of bio-insect 2 Success rate of bio-insect 3 Success rate of bio-insect 4
Experiment A
Experiment B
30 9 (30.0%) 690 7759 60.0% 22.2% 20.0% 0.0%
30 16 (53.3%) 795 7665 70.0% 33.3% 20.0% 83.3%
J.-H. Son et al. / Applied Soft Computing 25 (2014) 322–335
331
Table 2 Summary of whole experimental results (due to limited space of the table, fuzzy logic, reinforcement learning, and expertise measurement system are indicated as FL, RL, and EMS, respectively.) Experiment
Exp. I
Exp. II
Exp. III
Exp. IV
Exp. A
Exp. B
Type of intelligence for artificial robot(s) The number of agents Sharing knowledge between agents The number of episodes The number of actions Success rate of experiments
Ref. [1] Human operator 1 – 10 – 80%
Ref. [1] FL based RL 1 – 32 8 50%
Ref. [1] Simple RL 1 – 32 8 18.75%
Ref. [4] FL based RL 2 FL based EMS 30 8 80%
Current FL based RL 2 Non-sharing 30 24 30%
Study FL based RL 2 FL based EMS 30 24 53.3%
average reward, positive average reward, and percentage of positive rewards. Then, the robots have shared knowledge using fuzzy logic based expertise measurement system. In this case, the episode No. 19 has been recognized as having shortest iterations (13 times) and shortest lab time (140 s) domains.
Experimental Results 70
800 Iteration of Success Case Lab Time of Success Case Iteration of Failure Case Lab Time of Failure Case
60
700
600
500 40 400 30 300
Lab Time (sec)
Number of Iterations
50
20 200 10
100
0
0 0
5
10
15
20
25
30
Number of Episodes
Fig. 11. Results of Experiment A – In this figure, four types of results are indicated: successful cases of iterations, lab time (drawn with lines), failure cases of iterations, and lab time.
shown in the Fig. 11. The episode No. 27 has been recognized as the shortest iterations (12 times) and shorted lab time (160 s) among whole episodes in Experiment A. As a control group, the robots did not share knowledge after finishing current episode domains. Each individual robot only learns knowledge from the experience. In Experiment B, the robots have achieved 53.3% success rate using 4 bio-insects as shown in Fig. 12 and as described in Table 6. As an experimental group, the robots have shared knowledge after finishing every episode. As explained in previous section, performance of the robots have been evaluated using three measurements: Experimental Results 700
70 Iteration of Success Case Lab Time of Success Case Iteration of Failure Case Lab Time of Failure Case
60
600
Number of Iterations
400
40
300
30
Lab Time (sec)
500
50
200
20
10
100
0
0 0
5
10
15
20
25
30
Number of Episodes
Fig. 12. Results of Experiment B – In this figure, four types of results are indicated: successful cases of average iterations, lab time (drawn with lines), failure cases of average iterations, and lab time.
5. Discussions on experimental results In the Section 4, we presented two types of experimental results. In comparison between Experiments A and B, Experiment B has achieved better success rate (53.3%) than Experiment A (30%) with limited number of episodes. Also, in Experiment B, the achieved record episode No. 19 reveals the shortest iterations and duration time, which similarly has been conducted at the episode No. 27 in Experiment A. Here, the success rate does not mean that the robot can entice the bio-insect towards desired goal area with the full success rate because these experiments did not consider any fixed training set. From the experimental results, we could confirm that the
Table 3 25 fuzzy rules. F01 :
IF (dtb is VGd ) and (etk is VGe ),
THEN Output is A
F02 :
IF (dtb is VGd ) and (etk is GDe ),
THEN Output is B
F03 :
IF (dtb is VGd ) and (etk is NMe ),
THEN Output is C
F04 :
IF (dtb is VGd ) and (etk is BDe ),
THEN Output is D
F05 :
IF (dtb is VGd ) and (etk is VBe ),
THEN Output is E
F06 :
IF (dtb is GDd ) and (etk is VGe ),
THEN Output is B
F07 :
IF (dtb is GDd ) and (etk is GDe ),
THEN Output is C
F08 :
IF (dtb is GDd ) and (etk is NMe ),
THEN Output is C
F09 :
IF (dtb is GDd ) and (etk is BDe ),
THEN Output is D
F10 :
IF (dtb is GDd ) and (etk is VBe ),
THEN Output is E
F11 :
IF (dtb is NMd ) and (etk is VGe ),
THEN Output is C
F12 :
IF (dtb is NMd ) and (etk is GDe ),
THEN Output is C
(dtb
(etk
F13 :
IF
F14 :
IF (dtb is NMd ) and (etk is BDe ),
THEN Output is C
F15 :
IF (dtb is NMd ) and (etk is VBe ),
THEN Output is C
F16 :
IF (dtb is BDd ) and (etk is VGe ),
THEN Output is E
F17 :
IF (dtb is BDd ) and (etk is GDe ),
THEN Output is D
F18 :
IF (dtb is BDd ) and (etk is NMe ),
THEN Output is C
F19 :
IF (dtb is BDd ) and (etk is BDe ),
THEN Output is C
(dtb
is NMd ) and
(etk
is NMe ),
THEN Output is C
F20 :
IF
is VBe ),
THEN Output is C
F21 :
IF (dtb is VBd ) and (etk is VGe ),
THEN Output is E
F22 :
IF (dtb is VBd ) and (etk is GDe ),
THEN Output is D
F23 :
IF (dtb is VBd ) and (etk is NMe ),
THEN Output is C
F24 :
IF (dtb is VBd ) and (etk is BDe ),
THEN Output is C
F25 :
IF (dtb is VBd ) and (etk is VBe ),
THEN Output is C
is BDd ) and
332
J.-H. Son et al. / Applied Soft Computing 25 (2014) 322–335
Table 4 27 fuzzy rules for expertise measurement system. F01 : IF
k,l k,l (avg is GDa ) and ( cnt k,l avg is GDc ) and ( pst avg is GDp )
THEN Output is A
F02 : IF
k,l k,l (avg is GDa ) and ( cnt k,l avg is GDc ) and ( pst avg is NMp )
THEN Output is A
F03 : IF
k,l k,l (avg is GDa ) and ( cnt k,l avg is GDc ) and ( pst avg is BDp )
THEN Output is B
F04 : IF
k,l (avg
F05 : IF
k,l k,l (avg is GDa ) and ( cnt k,l avg is NMc ) and ( pst avg is NMp )
THEN Output is B
F06 : IF
k,l k,l (avg is GDa ) and ( cnt k,l avg is NMc ) and ( pst avg is BDp )
THEN Output is C
F07 : IF
k,l (avg
THEN Output is B
F08 : IF
k,l k,l (avg is GDa ) and ( cnt k,l avg is BDc ) and ( pst avg is NMp )
THEN Output is C
F09 : IF
k,l k,l (avg is GDa ) and ( cnt k,l avg is BDc ) and ( pst avg is BDp )
THEN Output is D
F10 : IF
k,l k,l (avg is NMa ) and ( cnt k,l avg is GDc ) and ( pst avg is GDp )
THEN Output is A
F11 : IF
k,l (avg
THEN Output is B
F12 : IF
k,l k,l (avg is NMa ) and ( cnt k,l avg is GDc ) and ( pst avg is BDp )
THEN Output is C
F13 : IF
k,l k,l (avg is NMa ) and ( cnt k,l avg is NMc ) and ( pst avg is GDp )
THEN Output is B
F14 : IF
k,l (avg
THEN Output is C
F15 : IF
k,l k,l (avg is NMa ) and ( cnt k,l avg is NMc ) and ( pst avg is BDp )
THEN Output is D
F16 : IF
k,l k,l (avg is NMa ) and ( cnt k,l avg is BDc ) and ( pst avg is GDp )
THEN Output is C
F17 : IF
k,l (avg
THEN Output is D
F18 : IF
k,l k,l (avg is NMa ) and ( cnt k,l avg is BDc ) and ( pst avg is BDp )
THEN Output is E
F19 : IF
k,l k,l (avg is BDa ) and ( cnt k,l avg is GDc ) and ( pst avg is GDp )
THEN Output is C
F20 : IF
k,l k,l (avg is BDa ) and ( cnt k,l avg is GDc ) and ( pst avg is NMp )
THEN Output is C
F21 : IF
k,l (avg
F22 : IF
k,l k,l (avg is BDa ) and ( cnt k,l avg is NMc ) and ( pst avg is GDp )
THEN Output is C
F23 : IF
k,l k,l (avg is BDa ) and ( cnt k,l avg is NMc ) and ( pst avg is NMp )
THEN Output is D
F24 : IF
k,l (avg
THEN Output is E
F25 : IF
k,l k,l (avg is BDa ) and ( cnt k,l avg is BDc ) and ( pst avg is GDp )
THEN Output is D
F26 : IF
k,l k,l (avg is BDa ) and ( cnt k,l avg is BDc ) and ( pst avg is NMp )
THEN Output is E
F27 : IF
k,l (avg
THEN Output is E
is GDa ) and (
is GDa ) and (
is NMa ) and (
is NMa ) and (
is NMa ) and (
is BDa ) and (
is BDa ) and (
is BDa ) and (
cnt k,l avg
cnt k,l avg
cnt k,l avg
cnt k,l avg
cnt k,l avg
cnt k,l avg
cnt k,l avg
cnt k,l avg
pst k,l avg
is NMc ) and (
is BDc ) and (
pst k,l avg
is NMc ) and (
is GDc ) and (
is NMc ) and (
is BDc ) and (
is GDp )
pst k,l avg
is GDc ) and (
is BDc ) and (
is GDp )
is NMp )
pst k,l avg
pst k,l avg
pst k,l avg
is NMp )
is BDp )
pst k,l avg
pst k,l avg
is NMp )
is BDp )
is BDp )
learning indeed takes place and sharing knowledge affects to increase performance. From the experimental results, we also have found that both the learning process and sharing knowledge mechanism could be one of valuable solutions for cooperative behavior. A few common problems were observed throughout the experiments. Some of bio-insects occasionally did not follow the odor source during the experiments. When that happened, the robots lost their ability to apply their collectively acquired knowledge. For example, in Experiment A, bio-insect 4 did not make success every time. However, in Experiment B, the bio-insect 4 achieved about 83.3% success rate. When the bio-insects failed to follow the robots, no patterns or evidence was observed from the result. Additionally, in our previous experiments using human operator [1], we got only 80% success rate. This means that even human could not fully entice the bio-insect. This effect might come from the condition (physical strength) or unknown other characteristics of individual bio-insect. As seen in previous experimental results, the bio-insects have frequently shown complex and unpredictable behavior. Those problems therefore have caused disturbances to the robots’ learning ability, as seen by the non-convergence of the number of iterations and the time duration with the increasing numbers of episodes. Also, sometimes, the robots proceeded towards the bio-insect in a wrong direction or place due to a randomly selected action. Consequently, the bio-insect occasionally moved in a wrong direction. Therefore, the number of iterations did not decrease with the increasing number of episodes. Therefore, if we conduct more experiments with increasing number of episodes, variations in both number of iterations and lap time domain will
THEN Output is A
THEN Output is D
frequently happen again due to complex and unpredictable elements of the bio-insect. However, taking account all the results, we still confirm that sharing knowledge in the experiment B showed better performance than non-sharing knowledge case in the experiment A (Figs. 13 and 14). Finally, we would like to provide the comparision of the results of this paper with the results of [1,4]. Table 2 summarizes the comparisons. As shown in Table 2, the experiments of the [1,4] and the current study have not been conducted in a same environmental condition. For example, the number of episodes, actions and agents are slightly different. Therefore, the experimental results cannot be compared directly. In addition, the experimental results (Experiment II) in [1] using fuzzy logic based reinforcement learning showed 50 % success rate. However, another type of experimental results without sharing knowledge case (Experiment A) in the current study showed only 30 % success rate even though the Experiment A used fuzzy logic based reinforcement learning. As mentioned above, the main purpose of this study is to show the effect of sharing knowledge in multi-agents domain. Therefore, the Experiment A has used two artificial robots, and the robots have tried to entice the bio-insect at the same time without sharing knowledge. In our experiments, no training set was defined. It means that the robots have only learned how to entice the bio-insects from randomly chosen actions or selected actions by learned knowledge as described in the learning algorithms. The problem is that a bio-insect may follow an artificial robot, which chooses wrong action place, even though another artificial agent chooses the right action place. Then the case will
J.-H. Son et al. / Applied Soft Computing 25 (2014) 322–335
333
Table 5 Detailed experimental results for Experiment A. Episode
Iterations
Lab time (s)
Insect No.
Result
1 2 3 4 5
66 11 39 15 44
710 132 401 237 560
1 1 1 2 2
× × ×
6 7 8 9 10
16 28 18 40 20
190 441 176 353 205
2 3 4 4 4
× × × ×
11 12 13 14 15
28 37 26 49 34
303 402 263 609 343
1 1 1 1 2
×
16 17 18 19 20
28 6 15 20 9
342 59 134 215 77
2 2 2 3 3
× × × × ×
21 22 23 24 25
13 10 11 14 10
137 103 101 149 101
3 3 4 4 4
× × × × ×
26 27 28 29 30
16 12 11 37 7
185 160 119 462 90
1 1 1 2 2
× × ×
Table 6 Detailed experimental results for Experiment B. Episode
Iterations
Lab time (s)
Insect No.
Result
1 2 3 4 5
61 44 27 26 15
665 462 244 250 251
1 1 1 1 2
× × ×
6 7 8 9 10
11 20 30 32 13
125 209 249 301 146
2 2 2 3 3
× × × ×
11 12 13 14 15
31 7 39 24 35
288 89 373 218 388
3 3 4 4 4
× ×
16 17 18 19 20
39 23 21 13 38
401 215 196 140 343
4 1 1 1 1
×
21 22 23 24 25
12 24 19 15 22
111 171 167 140 160
2 2 2 2 3
× × × ×
26 27 28 29 30
31 34 36 28 25
297 249 356 257 204
4 4 1 1 2
Fig. 13. Experimental result: Experiment A (without sharing knowledge) – Ep. 27 (the sequence of the movie clips follows the time flow).
be considered as a failure case. Additionally, when the robots learn knowledge from trials and errors, some of the robots may have more knowledge than the others. Reversely, some of robots may have not enough knowledge than others. Due to imbalance knowledge of each agent, experimental results in multi-agent domain might be poor than single agent case though the number of agents has increased in multi-agents domain. Therefore, the Experiment II showed a better performance than Experiment A in limited number of episodes. Also, the increased number of action positions may affect the experimental results. In addition, the number of action points of both Experiments A and B has been increased. To entice the bio-insect, distance range between a bio-insect and an artificial robot is one of crucial factors because the bio-insect mainly reacts the specific odor source at some optimal distances. In the previous experiments, the optimal distance range was already given and the artificial robot did not need to find the optimal range. The artificial robot only tried to find available angle direction between them. However, in both Experiments A and B, the artificial robots have needed to learn not only angle direction but also distance range between them. Therefore, the level of difficulty of these Experiments A and B are higher than previous experiments, and the artificial robots have done more trials-and-errors to find out the suitable angle direction and the optimal distance range to entice the bio-insect than the previous experiments. Eventually, the success rates of both experiments are lower than other experiments
334
J.-H. Son et al. / Applied Soft Computing 25 (2014) 322–335
among the robots and the bio-insects using a reasoning process for more specific and complex tasks. Acknowledgement This work was supported by the National Research Foundation of Korea (NRF) (No. NRF-2013R1A2A2A01067449). Appendix In this appendix section, we present the fuzzy rules for generating reward and calculating expertise and detailed experimental results of Experiment A and Experiment B. Algorithm 1. Cooperative reinforcement learning based on fuzzy logic-based expertise measurement system
Fig. 14. Experimental result: Experiment B (with sharing knowledge) – Ep. 19 (the sequence of the movie clips follows the time flow).
in limited number of episodes. However, we consider that these experiments are more realistic and general approach than previous experiments. 6. Conclusion In this paper, we presented a cooperative reinforcement learning technique using a fuzzy logic-based expertise measurement system to entice bio-insects towards desired goal areas. Based upon our obtained results in [1], we modified the fuzzy rules and input values to obtain a more precise knowledge to control the movement of the bio-insects. We also addressed the fuzzy logic-based expertise measurement system for sharing knowledge among the robots. We then obtained meaningful experimental results using two types of experiments. As a control group, the robots enticed the bio-insect without sharing knowledge in Experiment A, and the robots enticed the bio-insect with sharing knowledge as the second experimental group in Experiment B. In comparison between the Experiments A and B, Experiment B showed better results than Experiment A, which means that sharing knowledge using fuzzylogic-based expertise measurement system is more efficient way for our task. The experiments of this paper just focused on how to entice a bio-insect towards a desired sub-goal direction using basic knowledge learned by the designed structure. Therefore, in our future experiments, we will address a new framework for the interaction
Initialize Q tables and variables. if Current Number of episode > 1 then Load previous Q tables for all robots, ˛, and . end if Mk,l ← Mk,l + 1 repeat forb ← 1 : B do Recognize the current area, the current state, and the current sub-goal. if rand() < then Select the best action aˆ k for kth robot among possible actions of kth robot. if If learned knowledge is empty at current state then Select an action aˆ k for kth robot randomly. end if else Select an action randomly. end if Move towards the selected action points. Recognize the current state. Do an action towards the bth bio-insect. Calculate values d and e Calculate t+1 using fuzzy logic-based reward process. if t+1 > ı then pst k,l ← t+1 , cnt k,l ←1 t+1 t+1 else ← 0, cnt k,l ←0 pst k,l t+1 t+1 end if k,l k,l Qt+1 (s, a) ← (1 − ˛)Qtk,l (s, a) + ˛(t+1 + maxQtk,l (ˆs, aˆ )) aˆ
end for until Bio-insect reaches the goal area or happens any failure case Run the Algorithm 2 for sharing knowledge Algorithm 2. Fuzzy logic-based expertise measurement system for sharing knowledge Input: Q-tables and parameters of all robots Output: Q-tables including shared knowledge fork ← 1 : N do forl ← 1 : L do if Mk,l > 0 then M
k,l ← avg
k,l t=1 t+1
M k,l
k,l pst avg ←
M t=1
M
k,l
pst t
M k,l cnt
k,l
t t=1 k,l cnt avg ← M k,l else k,l k,l avg ← 0, pst k,l avg ← 0, cnt avg ← 0
J.-H. Son et al. / Applied Soft Computing 25 (2014) 322–335
end if end for end for Calculate Sk,l using fuzzy logic-based expertise measurement system. Share knowledge to each other. N k,l Sl ← S k=1 fork ← 1 : N do : L do forl ← 1 N S k,l Ql ← · Q k,l k=1 S l end for end for Algorithm 3. Recognizing the current area and selecting a subgoal for a bio-insect Sub goals for Bio-insect: #2 → #3 → #4 (final goal) Input: Current area and current sub-goal of the bio-insect Output: Sub-goal of the bio-insect if Current area = / Final goal area #4 then Choose a next sub-goal on current area. else Choose a final sub-goal #4. end if References [1] J.-H. Son, H.-S. Ahn, Bio-insect and artificial robot interaction: learning mechanism and experiment, Soft Comput. 18 (6) (2014) 1127–1141. [2] A. Hopgood, Artificial intelligence: hype or reality? IEEE Comput. Mag. 36 (5) (2003) 24–28. [3] K.E. Merrick, A comparative study of value systems for self-motivated exploration and learning by robots, IEEE Trans. Auton. Ment. Dev. 2 (2) (2010) 119–131. [4] J.-H. Son, H.-S. Ahn, Bio-insect and Artificial Robot Interaction Using Cooperative Reinforcement Learning, 2012, pp. 1190–1194. [5] J.R. Krebs, J.T. Erichsen, M.I. Webber, E.L. Charnov, Optimal prey selection in the great tit, Anim. Behav. 25 (1977) 30–38. [6] I. Kitowski, Social learning of hunting skills in juvenile marsh harriers circus aeruginosus, J. Ethol. 27 (3) (2009) 327–332. [7] C.C. Wilmers, E. Post, A. Hastings, The anatomy of predator–prey dynamics in a changing climate, J. Anim. Ecol. 76 (6) (2007) 1037–1044. [8] H. Sand, C. Wikenros, P. Wabakken, O. Liberg, Effects of hunting group size, snow depth and age on the success of wolves hunting moose, Anim. Behav. 72 (4) (2006) 781–789. [9] P. Abbeel, A. Coates, M. Quigley, A.Y. Ng, An application of reinforcement learning to aerobatic helicopter flight, Adv. Neural Inf. Process. Syst. 19 (2007) 1. [10] A. Barto, R. Crites, Improving elevator performance using reinforcement learning, Adv. Neural Inf. Process. Syst. 8 (1996) 1017–1023. [11] J. Peters, S. Vijayakumar, S. Schaal, Reinf. Learn. Hum. Robot. (2003) 1–20.
335
[12] Y. Duan, Q. Liu, X. Xu, Application of reinforcement learning in robot soccer, Eng. Appl. Artif. Intell. 20 (7) (2007) 936–950. [13] B. Abdulhai, R. Pringle, G.J. Karakoulas, Reinforcement learning for true adaptive traffic signal control, J. Transport. Eng. 129 (3) (2003) 278–285. [14] M.A. Walker, An application of reinforcement learning to dialogue strategy selection in a spoken dialogue system, J. Artif. Intell. Res. 12 (2000) 387–416. [15] J.A. Boyan, M.L. Littman, Packet routing in dynamically changing networks: a reinforcement learning approach, Adv. Neural Inf. Process. Syst. (1994), 671–671. [16] Y.-C. Wang, J.M. Usher, Application of reinforcement learning for agentbased production scheduling, Eng. Appl. Artif. Intell. 18 (1) (2005) 73–82. [17] L.M. Gambardella, M. Dorigo, et al., Ant-Q: A Reinforcement Learning Approach to the Traveling Salesman Problem, 1995, pp. 252–260. [18] G. Tesauro, N.K. Jong, R. Das, M.N. Bennani, A Hybrid Reinforcement Learning Approach to Autonomic Resource Allocation, 2006, pp. 65–73. [19] B. Araabi, S. Mastoureshgh, M. Ahmadabadi, A study on expertise of agents and its effects on cooperative Q-learning, IEEE Trans. Syst. Man Cybern. B: Cybern. 37 (2) (2007) 398–409. [20] L. Nunes, E. Oliveira, Advice-exchange amongst heterogeneous learning agents: experiments in the pursuit domain, in: Proceedings of the Second International Joint Conference on Autonomous Agents and Multiagent Systems Autonomous Agents and Multiagent Systems (AAMAS), 2003. [21] M. Ahmadabadi, M. Asadpour, Expertness based cooperative Q-learning, IEEE Trans. Syst. Man Cybern. B: Cybern. 32 (1) (2002) 66–76. [22] M. Ahmadabadi, A. Imanipour, B. Araabi, M. Asadpour, R. Siegwart, Knowledgebased extraction of area of expertise for cooperation in learning, in: IEEE/RSJ International Conference on Intelligent Robots and Systems, 2006, pp. 0–370, 5. [23] P. Ritthipravat, T. Maneewarn, J. Wyatt, D. Laowattana, Comparison and analysis of expertness measure in knowledge sharing among robots, Lecture Notes Comput. Sci. 4031 (2006) 60. [24] L.P. Kaelbling, M.L. Littman, A.W. Moore, Reinforcement Learning: A Survey, J. Artif. Intell. Res. 4 (1996) 237–285. [25] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, MIT Press, 1998. [26] R. Sharma, M. Gopal, Synergizing reinforcement learning and game theory – a new direction for control, Appl. Soft Comput. 10 (3) (2010) 675–688. [27] J.L. Salmeron, Fuzzy cognitive maps for artificial emotions forecasting, Appl. Soft Comput. 12 (12) (2012) 3704–3710. [28] M. Nikravesh, Evolution of fuzzy logic: from intelligent systems and computation to human mind, Soft Comput. 12 (2) (2008) 207–214. [29] M. Tan, Multi-agent reinforcement learning: Independent vs. cooperative agents, in: Proceedings of the Tenth International Conference on Machine Learning, 1993, pp. 330–337. [30] M. Littman, Markov games as a framework for multi-agent reinforcement learning, in: Proceedings of the Eleventh International Conference on Machine Learning, 1994, pp. 157–163. [31] J. Leng, C.P. Lim, Reinforcement learning of competitive and cooperative skills in soccer agents, Appl. Soft Comput. 11 (1) (2011) 1353–1362. [32] P. Tangamchit, J. Dolan, P. Khosla, The necessity of average rewards in cooperative multirobot learning, in: IEEE International Conference on Robotics and Automation, Vol. 2, 2002, pp. 1296–1301. [33] Y. Wang, C. de Silva, Multi-robot box-pushing: single-agent Q-learning vs. team Q-learning, IEEE/RSJ International Conference on Intelligent Robots and Systems (2006) 3694–3699. [34] Y. Wang, C. de Silva, A machine-learning approach to multi-robot coordination, Eng. Appl. Artif. Intell. 21 (3) (2008) 470–484.