Expert Systems with Applications 42 (2015) 6457–6471
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Correcting flawed expert knowledge through reinforcement learning David O. Aihe, Avelino J. Gonzalez ⇑ Intelligent Systems Laboratory, Department of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL 32816-2362, USA
a r t i c l e
i n f o
a b s t r a c t
Article history: Available online 15 April 2015
Subject matter experts can sometimes provide incorrect and/or incomplete knowledge in the process of building intelligent systems. Other times, the expert articulates correct knowledge only to be misinterpreted by the knowledge engineer. In yet other cases, changes in the domain can lead to outdated knowledge in the system. This paper describes a technique that improves a flawed tactical agent by revising its knowledge through practice in a simulated version of its operational environment. This form of theory revision repairs agents originally built through interaction with subject matter experts. It is advantageous because such systems can now cease to be completely dependent on human expertise to provide correct and complete domain knowledge. After an agent has been built in consultation with experts, and before it is allowed to become operational, our method permits its improvement by subjecting it to several practice sessions in a simulation of its mission environment. Our method uses reinforcement learning to correct such errors and fill in gaps in the knowledge of a context-based tactical agent. The method was implemented and evaluated by comparing the performance of an agent improved by our method, to the original hand-built agent whose knowledge was purposely seeded with known errors and/or gaps. The results show that the improved agent did in fact correct the seeded errors and did gain the missing knowledge to permit it to perform better than the original, flawed agent. Ó 2015 Elsevier Ltd. All rights reserved.
Keywords: Knowledge acquisition Context-based reasoning Reinforcement learning Theory revision
1. Introduction and background The development of intelligent systems has traditionally relied heavily on human expertise as a source of its domain knowledge. Thus, we often rely on a subject matter expert’s (SME) interpretation of the domain. The advantage to this is that there is a human ‘‘teacher’’ that can help the knowledge engineer (KE) understand the complexities of the domain. One disadvantage, of course, is that SMEs are indeed human and thus not always correct. That is, they may have an erroneous, incomplete or outdated interpretation of the domain. Thus, errors and/or gaps in knowledge could creep past the validation process and into the intelligent system. SMEs have been traditionally ‘‘interviewed’’, observed and interrogated by knowledge engineers (KE) in order to elicit their knowledge for incorporation into the intelligent system. This formed the basis of the knowledge acquisition process for the expert systems that emerged in the late 1970s. Such acquisition of knowledge from human experts has generally been successful, but also lengthy, difficult and costly, not to mention fraught with potential for errors. Nevertheless, these manual methods are still ⇑ Corresponding author. E-mail addresses: (A.J. Gonzalez).
[email protected]
(D.O.
Aihe),
http://dx.doi.org/10.1016/j.eswa.2015.04.015 0957-4174/Ó 2015 Elsevier Ltd. All rights reserved.
[email protected]
commonly used today. Feigenbaum (1979) foresaw the problem of knowledge acquisition long ago when he coined the term ‘‘knowledge acquisition bottleneck’’, and saw it as a potential obstacle to the widespread use of expert systems. Significant progress has been made over the years in improving the means of eliciting and acquiring this knowledge. Two particular approaches have addressed this problem: (1) automated knowledge acquisition (AKA) systems, and (2) machine learning (ML). AKA has focused on building tools with which an SME can interact without the need for a knowledge engineer. There was significant interest in AKA systems in the 1980s and 1990s. Several systems described in the literature have had some measure of success (Boose, 1984; Delugach & Skipper, 2000; Ford et al., 1991; Gonzalez, Castro, & Gerber, 2006; Kahn, Nolan, & McDermott, 1985; Marcus, McDermott, & Wang, 1985; Parsaye, 1988; Shaw, 1982). AKA tools can (ideally) convert the knowledge elicited directly into the syntax of the intelligent system and avoid the intercession of a knowledge engineer – or at least minimize it. This approach has made the process more efficient, but complete ‘‘turnkey’’ knowledge base creation has remained an elusive goal. With very few notable exceptions, research on AKA systems seems to have practically ended by the early to mid-2000s. See (Gonzalez et al., 2006) for a discussion of AKA systems up to 2006 that specifically addresses tactical agents.
6458
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471
Machine learning approaches, on the other hand, typically use examples to learn how to do, classify, or decide something. There are many supervised machine learning techniques – too many to describe here (see Guerin, 2011). One recent trend involves learning directly from an SME’s description (Pearson & Laird, 2004) as well as learning from observation (LfO) of a human performer in a simulation or in the real world. See (Argall, Chernova, Veloso, & Browning, 2009; Chernova & Veloso, 2007; Fernlund, Gonzalez, Georgiopoulos, & DeMara, 2006; Floyd, Esfandiari, & Lam, 2008; Friedrich, Kaiser, & Dillman, 1996; Gonzalez, Georgiopoulos, DeMara, Henninger, & Gerber, 1998; Isaac & Sammut, 2003; Johnson & Gonzalez, 2014; Konik & Laird, 2006; Moriarty & Gonzalez, 2009; Ontañón, Montaña, & Gonzalez, 2014; Ontañón et al., 2009; Sammut, Hurst, Kedzier, & Michie, 1992; Sidani & Gonzalez, 2000; Stein & Gonzalez, 2011, 2014; Van Lent & Laird, 1998) for a partial list of works in LfO. Observation of an SME in the traditional knowledge acquisition sense involved a knowledge engineer personally watching an expert perform a task, and then converting the observed concepts into code. The more recent machine learning interpretation of learning from observation, on the other hand, involves a computational system that learns through direct observation of the human and directly converts what it observes (a time-stamped trace of data) into usable knowledge in whatever syntax is required by the system. The body of literature on AKA and ML is very extensive, and their thorough discussion here is beyond the scope of this article. Nevertheless, the common thread in AKA and LfO is that they still require the direct involvement of human experts. This is not a bad thing, actually, because SMEs remain the best source of expertise available. Nevertheless, as discussed earlier, flaws could still creep into the system undetected, either because of SME errors, miscommunication between SME and knowledge engineer, or because of subsequent changes in the domain. In this paper, we describe an investigation into a means to improve already competent intelligent agents by subjecting them to experiential learning. That is, they initially learn about the domain from human experts (directly or indirectly), and are subsequently able to correct and complete the knowledge base directly from their own experience in the domain. 1.1. Learning tactical behaviors We are particularly interested in creating agents that can act in a tactically correct manner when executing a mission or task in a possibly hostile environment. Examples of such tasks or missions could be found in military operations as well as in more mundane tasks such as driving an automobile or playing a game. Note that on our case we use a simulation of the environment, rather than the physical environment directly. The Oxford Dictionary (2008) defines ‘‘behavior’’ as ‘‘the actions or reactions of a person or animal in response to external or internal stimuli’’. Henninger (2000) defines behavior as ‘‘any observable action or reaction of a living organism.’’ The behavior of a person can be said to be divided into tactical and strategic behaviors. Schutte (2004) states that ‘‘tactical behaviors are generally considered to be near-term, dynamic activities’’ while strategic behaviors usually involve the decision-making process based on the overall mission of the agent in the longer term. Latorella and Chamberlain (2002) explain that time pressure in any given situation differentiates whether an individual will operate in a tactical mode or in a strategic mode. We address here the improvement in the proficiency of software agents performing a tactical task. These agents can display human-like behavior in computer simulations, and can improve and/or correct their own (and possibly already competent) performance built through other means, which typically include SME involvement. More specifically, we seek to provide the means to
improve competent behavior for agents that find themselves in situations for which they have not been trained or have been trained improperly. This can be a critical issue in military training. We selected automobile driving as the tactical domain of choice. It is sufficiently rich for our purposes and its characteristics are in some ways similar to those of a game where there are objectives and tactical situations as well as rules that define what legal actions can be taken; yet, this domain does not require scarce expertise, as nearly every adult in the US knows how to drive a car. Moreover, our research group has done much work in modeling automobile driving (Brown, 1994; Fernlund et al., 2006; Gonzalez, Grejs, & Gonzalez, 2000; Stein & Gonzalez, 2011), which provides a base for our work. Finally, the specter of fully autonomous driverless cars in the not-too-distant future, such as the well-publicized Google cars (The Economist, 2013), makes this application a compelling one, as society prepares for this eventuality. We next review the relevant literature in more detail to better place our work among that of others.
1.2. Review of the state of the art in building and correcting tactical agents Several research topics have touched upon the subject of correcting knowledge automatically to one degree or another. These include knowledge refinement, theory revision and belief revision. However, while building tactical agents has drawn much recent attention in the literature, an automated means of correcting incomplete or erroneous behavior in tactical agents has not been the subject of extensive research. In this section, we address these issues but focus on the last item – correcting tactical behaviors. Knowledge refinement was mostly associated with AKA systems. A strong model AKA system called MORE (Kahn et al., 1985) refined domain models through a dialog with the SME. Craw’s KRUST system (Craw, 1991) refines and corrects knowledge (although not necessarily tactical knowledge) with the help of an expert. When presented with a case whose solution conflicts with its SME-supplied knowledge, KRUST makes use of other sources to ‘‘... remove unlikely refinements and badly behaved knowledge bases.’’ (Craw, 1991). It then presents the refinements judged best to the SME for his/her selection. Theory revision was a popular topic in the late 1980s and 1990s. Laird and his group studied the ‘‘recovery from incorrect knowledge’’ in the context of the SOAR system (Huffman, Pearson, & Laird, 1993; Laird, 1988; Laird, Hucka, Yager, & Tuck, 1990; Laird, Pearson, & Huffman, 1996; Pearson, 1995; Pearson & Laird, 1995). Their approach was based on a variation of explanationbased learning (Dejong & Mooney, 1986), and in some cases, required the assistance of an instructor. In some instances, they did not correct the knowledge but rather, the decision made by the system. Tecuci (1992) described a system that also corrects SME errors, but likewise uses an SME along with a machine learning system to make the refinements/corrections. Ourston and Mooney (1990) used induction to revise incorrect/ incomplete domain theory that was already a good approximation of the true domain. They claimed it required fewer learning examples. Pazzani and his group investigated theory revision during the same period (Murphy & Pazzani, 1994; Pazzani, 1988, 1991; Wogulis & Pazzani, 1993). They developed a system to correct incomplete and incorrect domain theory as related to classification rules. One of their systems employed similarity-based learning, explanation-based learning and theory-driven learning to discover and implement the corrections. However, the objective was to learn domain theory to improve learning of classification rules, and therefore was not directed at tactical agents.
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471
More recently, Rozich, Ioerger, and Yager (2002) developed the FURL system that is able to implement theory revision in the process of learning fuzzy rules. Cain (2014) presented the DUCTOR system that uses induction to generalize or specialize rules responsible for incorrect classification. Brunk and Pazzani (2014) present the CLARUS system that employs ‘‘... linguistic information to guide the repair process’’. These systems involve errors and gaps in classification rules, however, and not in tactical agents. Belief revision is somewhat related to theory revision in that both seek to change something in an intelligent system in the face of evidence that there are errors. Truth maintenance systems have been used to maintain truth over time in the face of changes in the situation. See Peppas (2008) for a description of this subject. It too has a large body of literature. However, it is largely peripheral to our work because of its focus on logic systems and applications to classification or to databases. Therefore, we will forego a discussion of this topic here. More to the point of our work, Stein and Gonzalez (2011) built a system that uses reinforcement learning to correct and enhance a tactical agent built through a neural network. The same authors built a second system (2015) that uses haptics to make corrections in real time to an agent by an SME/instructor who monitors the agent’s performance in real time and provides corrective counterforce through the haptic device when the agent is performing incorrectly. Their work as it compares to that presented here is discussed in Section 5.1 below, after our results are revealed. Putting in perspective our work compared to the state of the art, the concept of learning from experience through reinforcement learning is not novel. Much research has been done on agents that learn solely through interaction with their environment (see Brooks, 1990 among many others). However, our approach is different in that we support the use of human expertise in building the intelligent system, whether through traditional manual means (interviews, etc.), AKA systems or LfO methods. We also seek to support intelligent tactical agents that are task-specific. This is in contrast to the emerging field of general artificial intelligence, where agents are taught to solve any kind of problem. Our approach serves to enhance human expertise, rather than replace it. Theory revision research has focused on correcting classification rules using inductive learning. The work of Laird’s group is applicable to tactical agents, but they approach it from the standpoint of explanation-based learning and periodic interaction with SNEs, rather than by experiencing the environment itself. The closest work to this is Stein’s work enhancing the knowledge of tactical agents via reinforcement learning when directly experiencing the environment or when guided by an SME through a haptic interface. However, Stein’s use of a neural network architecture (through NeuroEvolution) as the platform for the agents makes the knowledge opaque. We discuss this further in Section 5.1. We based our agents on a specific context-driven representation paradigm called Context-based reasoning. We discuss this next. 1.3. Context-based reasoning There exist many paradigms for representing tactical human behavior. SOAR (Laird, Newell, & Rosenbloom, 1987), ACT-R (Anderson, Matessa, & Lebiere, 1997) and iGEN (Zachary, Ryder, & Hicinbothom, 1998) are three popular ones among many others. However, no universally accepted best modeling paradigm exists for tactical human behavior. Each has its advantages and disadvantages. One alternative approach is modeling tactical behaviors through contextual reasoning. Based on the intuitiveness and flexibility of a context, several researchers have proposed methods that rely on contexts to build agents that exhibit tactical behaviors (Brezillon, 2003; Turner, 2014; Gonzalez & Ahlers, 1998). Context-
6459
based reasoning (CxBR) is one of these. We base our agents on CxBR and briefly describe it next. Context-based reasoning is a modeling paradigm designed to intelligently control an agent’s tactical actions by activating and deactivating predefined contexts when a specific situation calls for them. These contexts contain the knowledge for how the agent should act when it finds itself in that context. CxBR is loosely based on some guiding principles that pertain to the way humans think in their everyday activities (Parker, Gonzalez, & Hollister, 2013). The basic idea behind CxBR is that contexts represent situations, and contain the possibly complex actions normally required to manage these situations successfully. Contexts also define what the agent can expect when in a particular situation. Major Contexts are the main elements of control for an agent under the CxBR representation. Major Contexts contain two main elements: (1) the functionality to allow the agent to act properly when in that situation (action functions and action rules), and (2) the knowledge of when the situation evolves into a different situation and therefore requires a different Major Context to be in control of the agent (transition rules). These transition rules allow control to pass to another Major Context whose actions better suit the newly emerging situation. One Major Context is said to be active when it is controlling the actions of the agent. Rather than being an architecture with an accompanying language, syntax and built-in functionality, CxBR is instead a flexible organization of knowledge that can be implemented in several different ways. See Gonzalez and Ahlers (1994, 1998) and Gonzalez, Stensrud, and Barrett (2008) for a full description of CxBR. Our research objective was to apply reinforcement learning to an already capable CxBR-based tactical agent, to expand its scope of competence beyond that of the SME as well as to correct any errors possibly introduced by the SME. We purposely seeded errors (incorrect and incomplete knowledge) within the agent to determine whether our approach could overcome these flaws. By subjecting a manually built CxBR agent to reinforcement learning in a simulation, our method was able to correct such flawed knowledge. We describe this method next. 2. Conceptual approach Our approach is based on the popular saying ‘‘hindsight is clearer than foresight’’. In general, our process consists of subjecting an agent with an assigned mission to several different realistic scenarios in a simulation, some of which were not specifically considered by the SME. This agent was developed manually a priori with the help of an SME. If the agent successfully completed the mission, its decisions are reinforced. The agent is subsequently subjected to a new scenario that is a modified version of the previous one. If the agent fails to complete the mission, changes are made to the agent and the same scenario is executed again. This continues until the agent successfully accomplished multiple variations of the assigned mission in the environment. We use reinforcement learning (RL) (Barto, 2003) to make the changes to the agent. Our approach is designed to correct three general types of errors: (a) Error in the knowledge provided by the SME (e.g., incorrect speed limit); (b) Missing actions within a context (e.g., not knowing to stop at a stop sign); and (c) Missing complete contexts (e.g., not knowing how to drive on a dirt road). We refer to an agent whose errors and incomplete contexts are repaired (items a and b above) as an improved agent. An agent for which a new context is created (item c above) is called an enhanced agent. We should note here that our approach is intended for off-line use and not for online use during the in-service operation of the agent. Therefore, the computation time for learning is irrelevant in the execution of the agent while in its intended application.
6460
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471
2.1. RL-CxBR integration Creating a model of an automobile driver from scratch through reinforcement learning was not our objective here. This is because it is our opinion that doing so would encourage non-human actions to be generated. While these non-human actions may allow the agent to perform well, the actions will not appear to be made by a human – a key element of many applications of intelligent simulation-based agents. We instead improve an already competent agent built through other means but with an SME involved, either directly or indirectly. Stein and Gonzalez (2011) used a combination of NEAT (Stanley & Miikkulainen, 2002) and PSO (Kennedy & Eberhart, 1995) to build agents through observation and then experience (through a form of RL). While successful, their approach was computationally intense, requiring weeks of computation on a high-performance cluster computer. Our approach applies RL directly to the contexts used in the CxBR agent to improve/correct its performance, making it significantly more computationally efficient (see Section 5.1). Formally, an improved Context C 0i is represented as:
C 0i
RC i
where R is the reinforcement-learning algorithm applied to improve the original context Ci. The design of the reward function is important, as it directs the learning agent on what behavior to reinforce and which to discard. It does not, however, tell the agent what is right or wrong. In our agent improvement process, the reward function directs the agent on what contexts to improve and what are the better values and attributes within these contexts. In a typical mission, the reward function is defined by the achievement of the goal and/or sub-goals of the mission. Rewards are given for the successful completion of the mission. For example, if an agent completes a task successfully and on time, a positive reward is given to the agent. If the agent completes the task successfully but late, no reward is given. If the agent does not complete the task, a negative reward is given. More about this later below in this paper. Formally, the reward function R is a function of the goal (G), the sub-goals (sG) and the constraints of the mission (Co), i.e.:
learning everything from scratch, with the attendant disadvantages mentioned above. While learning from scratch is of course acceptable in the general sense, our preferred solution would be to learn the values of the attributes that if modified, will result in correct human-like behavior as defined by meeting the objectives of the mission or task. One example would be to provide a maximum speed limit as a variable in a database table, and defining actions that will enable the agent to learn the true speed limit of a road segment. 2.3. Formal representation Formalizations are presented in this section for the improvement process as well as for the enhancement process. States are elements of the environment, and an environment consists of many states. A context contains a group of states where related actions are performed. Typically, when an agent performs an action when in a state in the environment, the action leads the agent to a new state. This new state can be in the same context as the previous state or it can be in a different context. This process continues until the mission goal is accomplished. The process enables the selection of the active context (that controls the agent). For any given context Ci, there exists a set of predefined actions for the states in the context. In the formula below, si and sj represent the states while ai, aj and an represent the corresponding actions.
C i ¼ fðsi ; ai Þ; ðsi ; aj Þ; ðsi ; an Þ; ðsj ; ai Þ; ðsj ; aj Þ; ðsj ; an Þ; . . .g There also exists a set of predefined universal actions in the actionbase that are available to all states in all contexts.
a b ¼ fa1 ; a2 ; a3 ; a4 ; . . . ; an g For the given context Ci and state sj, a comparison of Rx values is carried out, where Rx contains the values for each context, state and action triple that were modified through the RL process (called reinforced values).
Rxðcontext; state; action1 Þ ¼ Rewardðcontext; state; action1 Þ þ c:Max ½Rxðsame context; next state; all actionsÞ Rv ðC i Sj a1 Þ ¼ RðC i Sj a1 Þ þ c:Max½Rv ðC i Sjþ1 aÞ
R ¼ ðG; sG; C o Þ
Rv ðC i Sj a2 Þ ¼ RðC i Sj a2 Þ þ c:Max½Rv ðC i Sjþ1 aÞ
Placing a reward on a mission without considering the constraints can be counterproductive, as the rewards received by the agent will not be a true representation of the overall mission. For example, if a constraint exists that an agent driver must arrive at its destination at a given time WITH its own car, but the agent arrives at the appropriate time but in a taxi, the agent would not receive a reward.
Rv ðC i Sj an Þ ¼ RðC i Sj an Þ þ c Max½Rv ðC i Sjþ1 aÞ
2.2. Adjusting a context to enable learning An adjustment to the CxBR modeling paradigm becomes necessary to allow the modification of contexts and creation of new contexts as a result of the agents’ interaction with its environment. This is achieved by replacing hard-coded constants in the contexts with variables. This allows for the seamless modification of the variables during the course of the agents’ interaction with its environment in a simulator. These variables are stored in tables in a database. As the agent undergoes training, these variables are modified until a value equal to or close to the true value in the environment is achieved. We refer to this value as the optimal value. An optimal value is determined when the value of a variable converges to a single value and/or the change in the value of that variable becomes negligibly small after multiple cycles. The question arises as to what part of a context one should replace with variables. Does one replace the action rules, transition rules and/or contextual values (e.g., speed limit) with variables? If these are all replaced with variables, this would amount to the agent
Rv ðC i Sj a3 Þ ¼ RðC i Sj a3 Þ þ c:Max½Rv ðC i Sjþ1 aÞ .. . From the above equations, the context is constant, except in cases where the next state falls within a new context, which we will discuss later as a special case. Taking Ci out of these equations:
Rv ðSj an Þ ¼ RðSj an Þ þ c Max½Rv ðSjþ1 aÞ The appropriate action for each state in the context is thus:
MaxfRx g The formalization of the enhancement process (i.e., create new context) follows the same principle as that of the improvement process (i.e., correct existing knowledge) with only one exception: the only actions attempted are those actions defined in the action-base. 2.4. Improving an active context The real-time improvement of an active context involves making a copy of the active context as originally defined, and placing it in a repository known as the Improved Context Repository. This repository holds copies of contexts that have undergone some form of modification. As soon as a context becomes active, a copy of it is created in this repository. As the agent experiments with different values and settings within a context, these are reflected in the copy
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471
alongside the rewards received for each setting. Formally, for an active Context ACi that exists in the CxBR system context library (i.e., holds all possible contexts relevant for missions/tasks in that domain – formally defined in Section 2.7.3 below), a copy CCi exists in the Improved Context Repository. The copy CCi contains the various values and settings explored and the corresponding rewards. It becomes a working copy of the context in question, while the original one is kept intact in the system context library.
CCi ( fðsi ; ai Þ; ðsi ; aj Þ; . . . ðsn ; an Þjðr i ; rj ; . . . ; rn Þg where si is the state, ai are the actions and ri are the rewards obtained for performing ai in si. The high-level algorithm for the context improvement process is: Initialize the reinforced values (Rx) for each {context, state, action} tuple to zero 10:: Activate the default context Simulate and control agent 1: Initiate random action Agent arrives in new state Calculate value of action in new state in existing context Update Rx values Evaluate transition rules CASE: Evaluated rules transition to existing context GOTO 1: CASE: No context defined for evaluated rules CALL: Context Selector Search all predefined contexts CASE: Match found Activate ‘matched’ context Make copy of previous active context in context repository CALL: Context Modifier Add currently active context to list of ‘compatible context’ in previous context GOTO 1: CASE: No Match Found CALL: Context Creator Create Context from context template Add parameters of current situation to Context from global fact base GOTO 1: CASE: Mission Goal Achieved Mark end of episode.1 Calculate Rx value CASE: Change in Rx / = 0 End GOTO 20:: CASE: Change in Rx > 0 GOTO 10:: CASE: Mission Goal Not Achieved Identify cause of failure Address appropriately by either calling the context modifier or context creator modules GOTO 10:: 20:: Based on the Rx values, choose the action that produces the most reward for each state in a given context by choosing the max Rx value for each contextstate-action combination. Compare the original predefined actions and attributes in a state in a context with the newly learned actions
6461
(actions calculated) and attributes for the same state in the same context CASE: Newly learned action (calculated action) and attributes are different from the original action Create a copy of the context in the context repository CALL: Context modifier Refine/improve the context with the newly learnt action and attributes for that state in the context CASE: original predefined action and attributes are the same as the calculated action do nothing
1 An episode is the beginning to end of an agents’ interaction with its environment. It is essentially a run of the agents’ activities from start to finish in the simulator. Many episodes need to be run in the simulator when training the agent.
The high-level algorithm described above denotes our over-all approach to improve actions and attributes within a context and context transitions. It also serves to create new contexts based on a predefined context template (enhancement). 2.5. The reward function The reward function essentially directs the learning agent on what behavior to reinforce and which to discard. However, it does not tell the agent what is right or wrong. Without a well-designed reward function, the learning process might not be optimal and convergence might not occur. There is no universal definition of a reward function; reward functions are typically designed specifically for each problem being addressed. The reward function is generally defined by the SME, the knowledge engineer or the application system developer. Allowing the SME to set the reward might appear counterintuitive at first because correcting the SMEs knowledge is what our work is all about. However, on a closer look, having the SME define the reward function may serve to highlight problems with the model to him or her. Furthermore, the reward function neither contains the details of the actions nor the transition rules within the model. For example, assuming a SME defines a maximum speed limit of 30 mph in a context and wants an agent to drive to a meeting 50 miles away. If this SME defines a reward function that rewards the agent for arriving at the meeting within an hour, the improvement/enhancement process identifies this as an error because the agent would never achieve its goal. Another attribute of the reward function that is evident in the improvement/enhancement process, is the identification of expert implicit knowledge, also, the lack of SME knowledge in any given mission will be exposed. There are two rules that govern the design of reward functions: (a) The reward function should not contain a reward or punishment for an action. In other words, the system should not reward or punish the agent for performing a particular action or group of actions. This is so because the agent cannot know the best action in any given state. If it did, the problem would be reduced to a supervised learning problem where the agent would learn the right actions, which would have to be known a priori. (b) The reward function should contain only definitions of states, i.e. the agent is rewarded for being in a given state. This state could be the goal state, or else, states leading to the goal state. The actions that lead to these states are unknown to the agent and the agent is expected to learn them. An example of a reward function in the model enhancement prototype is to reward the agent for arriving at a state where the maximum speed for the context being trained is
6462
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471
equal to the maximum speed of the road segment that the context represents in the world. The reward an agent receives is constrained by the mission goal, the immediate sub-goals of the agent as well as the overall constraints placed on the mission. Placing a reward on a mission without taking the constraints of the mission into account will be counterproductive, as the rewards received by the agent will not be a true representation of the overall mission. For example, if a constraint exists that an agent driver must arrive at its destination with its car by a given time, and the agent arrives at the destination at the given time but without its car (e.g., by taxi), the agent should not receive a reward. In our research, the reward function is composed of sub-processes used for enhancing an active context, enhancing processstopping criteria, and generalization of actions amongst others. We provide more details about the specific reward function used later in this article. Please refer to Aihe (2008, pp. 187–189) for detailed explanations of these processes.
would have the agent perform better overall. A criterion for determining the stopping criteria is when the change in reward received, e, is zero or is negligible after a given number of time steps in the simulator. Another stopping criterion is that the total reward received begins to decrease. In some cases, these two criteria can lead to a premature end of the improvement process. This problem is alleviated by introducing a function that compares the current calculated reward R with that of the generalized reward gR: Mathematically, the criteria for stopping the improvement process are: (1) Change in reward
e is zero:
Ri Rj ¼ ei ! ej
ei ej ! 0
For gR < R
(2) Total rewards received RT begin to decrease:
RT ! 0;
for gR < R
2.6. Stopping criteria for the improvement process 2.7. Components of the improvement process The criteria for ending the improvement process involve the agent receiving the maximum reward available and achieving the mission goal. The actual value of the maximum reward is not known to the agent - it has to determine this through its interaction with the environment. The improvement process is two-fold: the improvement of a given active context when that context is isolated, and the improvement of the entire agent. The former can lead to the agent performing better when it is run in the simulator with that specific context being active, whereas the latter
The process for improving the agents is a complex one. This section describes the various components of the process. These are shown in Fig. 1 below. 2.7.1. Action-base (A) As mentioned earlier, the action-base is a component of the environment that contains all available actions that can be taken by the agent in that environment. The actions in the action-base
Fig. 1. Improvement process architecture.
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471
are universal and not restricted to any specific contexts. An agent can explore the environment through these actions in any context. However, a particular action may not necessarily be appropriate for a particular context. Through experience, the agent would learn what actions are appropriate and which to avoid when in certain contexts, thus enabling the improvement process. Therefore, the availability of an action-base relaxes the CxBR principle that only a few things can be done while in any one context (but still only one at a time). This exclusion principle still holds true during normal operation, but not during the improvement phase, as the learning agent has to explore actions not considered by the original agent (nor presumably by the SME) when in a given context. Exploration only occurs when the agent is learning or when it perceives a change in its environment. The action-base receives an input of the state of the environment and outputs a specific action. As the agent learns the best action to take when in a state, this information is sent to the modifier module through path I, and to the global fact-base through path H. See Fig. 1. 2.7.2. Environmental states/contexts (B) The environmental states and contexts are found in area B of Fig. 1. The current state of the environment is processed and analyzed by the inference engine through path F. 2.7.3. Context library (C) The context library is a collection of all predefined contexts for the agent to use during the mission. This collection is also called the constellation of contexts. Only one context in the context library can be active (i.e., in control of the agent) at any given time. The contexts in this library take the states of the environment as inputs, thus the only allowed input to the library is a ‘state’ signal. The outputs of a context are the actions that the agent can execute from a given state, thus an output of the context library is the prescribed action for the agent. Another output of the context library is the active context with all its attributes. An active context is automatically copied to the context repository in preparation for its improvement, as mentioned above. The path L in Fig. 1 shows the active context being copied and sent over to the context repository (D). 2.7.4. Improved Context Repository (D) Refer to area D in Fig. 1. Copies of all modified contexts are stored in this repository. The function of this repository is to provide an efficient backup mechanism for the learning process. The repository enables the addition and retraction of new knowledge by keeping track of all changes made to a context and at what ‘‘state’’ these changes occurred. Fig. 2 specifically depicts the
6463
context repository process where an active context is copied to the repository for possible modification and ultimate improvement. This repository takes contexts as inputs. The outputs are the actions in the improved contexts. A call to the repository module immediately creates a copy of the active context. Control is automatically transferred from the active context to its copy created in the repository. Actions are explored in this context and the values of these actions in the states of the context are stored in the Rx table. The algorithm for the context repository module is as follows. CASE: Activate new context Create copy (CCx) of active context in the context repository Transfer control of agent to the newly created copy of context CCx While in the current state si, Do until context Ci becomes inactive 10::Attempt action ai Store the value of the resulting reward Update the Rx values according to the equation Rxðci ; si ; ai Þ ¼ Rðci ; si ; ai Þ þ c: max½ci sj a Arrive at new state GOTO 10:: (Explore or exploit an action in the new resulting state) Stop
2.7.5. Context selection When called, the context selector module chooses the appropriate predefined context for any given situation. Typically, in a CxBRbased agent, the contexts to which control can be passed (context activated) from any other given context are predefined within the context itself. This list is typically based on expert knowledge about the given situation and the characteristics of the situation that would necessitate a transition from the currently active context. Most times, this list is correct, but it can be occasionally incomplete or incorrect. For example, in the automobile driving domain, an UrbanDriving context used by an agent to drive an automobile in an urban environment would be considered incomplete if its list of compatible contexts was to exclude a Freeway context or Ramp context. Therefore, a CxBR agent would act improperly when not having an appropriate context to which to transition, based on its list of compatible contexts. Our improvement process attempts to eliminate this by first searching through all predefined contexts to see whether the attributes of any context match the
Context Repository Copy of acve context
Context Library Acve context
Fig. 2. Context repository.
6464
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471
current situation. If so, the context is selected as the new active context and the previously active context is sent to the modifier, as represented by path J in Fig. 1. The modifier then modifies and improves the context by adding the newly active context among the list of compatible contexts. Fig. 3 specifically depicts the context selection and modification process. The context selection module takes all predefined as well as improved contexts as input, and the output is one or more contexts that appropriately address the current situation. The way the context selector module works is almost akin to the competing context concept conceived by Gonzalez and Saeki (2001). In this approach, the contexts independently assess the current situation and provide an indication of their own applicability in successfully addressing the current situation. The context selector module makes its choice based on the calculated Rx value. 2.7.6. Context modification The context modifier module improves predefined contexts by modifying the attributes available to the agent (e.g., its action functions) in a given state in a context. After a context or group of contexts have been selected by the context selector module to match the current situation, the contexts are passed through the modifier module, where various actions defined within them are attempted, as are the actions defined in the action-base. After these actions are executed, the action that produces the maximum Rx value for that state is chosen as the most appropriate, and the context is modified to reflect this. In cases where only one context is selected by the context selector to appropriately address the current situation, the previously active context is sent to the modifier and its list of compatible contexts is modified to reflect the newly active context. Modification of the previously active context also occurs in cases where many contexts are chosen by the context selector and an appropriate context is chosen based on the highest Rx value. For a context to be modified, its predefined actions as well as actions from the action-base are performed at random, and the rewards from these actions are tabulated in a local memory base that is available only to the context modifier module. These rewards are back propagated from the goal of the agent. After these actions are executed and their Rx values known, a choice is made on the action and context that best address the current situation relative to the goal of the agent. From Fig. 1, the modifier takes as inputs the contexts through path J and the actions from the action-base through path I. The output is a list of modified contexts with their Rx values, and is depicted by K. Formally, the copy of a selected context going through the modifier is as follows: For a given state in context Ci, different actions are attempted.
CCi ( fðsi ; ai Þ; ðsi ; aj Þ; . . . ðsi ; an Þjðr i ; rj ; . . . ; rn Þg The state/action combination that produces the maximum reward is chosen
ðsi ; ai Þ () maxðri Þ The previously active context is then modified to include the new context as a compatible context. The algorithm for the modification of a context is as follows:
Modifier
Arrows represent contexts Context Selector
Choice
Fig. 3. Context selection process.
CASE: Change in situation Search through list of compatible contexts for a context that best addresses the current situation. CASE: No context addresses the current situation Search through the list of all contexts (predefined, improved and newly created). Identify the context that most nearly matches current situation.** CASE: More than one context address the current situation Rank the contexts by their scores Modify each ranked context by performing the predefined actions and actions defined in the action-base randomly 10::Note the Rx value as calculated from Rxðci ; si ; ai Þ ¼ Rðci ; si ; ai Þ þ c: max½ci sj a The action that produces the highest Rx value for the given state–action combination is chosen. The context is modified to reflect this The previously active context is also modified to reflect the addition of the newly modified context among the list of compatible next contexts Stop CASE: Only one context addresses the current situation GOTO 10:: CASE: No context addresses the current situation CALL context creator. ** Determination of a near-match is done by directly comparing the attributes of the current situation as retrieved from the global fact base to those of all contexts as defined by the action-rules, transition-rules, etc. that are needed for the context to be activated. A score is provided to each context quantifying its relationship to the current situation and the context with the highest score is selected for modification. The score is obtained by calculating the total number of attributes that match the current situation as a function of the total number of attributes of the current situation, (e.g., if 10 attributes are listed by the global fact base for the current situation, a context that satisfies eight of those attributes is said to have a score of 80%).
2.7.7. Context creator The context creator module creates a new context when requested. This is the enhancement process. When the improvement of a context yields suboptimal1 values, the (insufficiently-) improved context is reinstated back to its original form and a new context is created from a predefined context template. This template contains sections for action-functions, transition-rules, compatible contexts, and other elements. As soon as a decision is made to abandon all pending changes to improve an existing context, a call is made to the context creator module. This is based on the Rx values being received for the improved context, or when the system has determined that the number of attributes of existing contexts that match the current situation is low (less than 40%). Upon calling the context creator module, a copy of the context template is made. The attributes of the current situation are taken from the global fact base and these attributes are inserted to the context template copy. From the 1 Suboptimal values of an improved context are Rx values that are too low for the given state-action and the Rx values of previously calculated context-state-actions are reduced. For example, assume a context-state-action that previously produced the maximum Rx value and thus maximum rewards. If after being refined this same context-state-action produces a lower Rx value, the improved context is said to be suboptimal.
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471
attributes, the title (name) of the newly created context is generated. The newly-created context is then sent to the context repository where different actions from the action-base are attempted and their Rx values noted. The actions and transition rules that produce the maximum Rx values for each state in the context are noted. The constraints of the context are part of the attributes. The algorithm for the creation of a new context is as follows: Upon a change in situation Upon searching through all existing contexts and finding no matching context or upon finding a context with a low score1 (<40%) Call the context creator module. While the simulation is paused temporarily Make a copy of the context template and place it in the context repository Get all the attributes of the current situation from the global fact base Filter these attributes according to various parameters, for e.g., location, constraints, type of road, etc. (based on the system being modeled) Continue simulation by making calls to the actions in the action-base As each action is executed, a note is made on the Rx value obtained for that action-state combination. The action that produces the highest Rx value for each stateaction combination is chosen as the appropriate action Update the context with the newly gathered information about its action-rules and transition rules 1 A score is defined as the total number of attributes of a context that match the current situation divided by the total number of attributes of the current situation.
Fig. 4 depicts the complete agent improvement and enhancement flowchart. 3. Prototype implementation of the agent improvement/ enhancement process A prototype of the agent improvement and enhancement processes described above was built to evaluate the effectiveness of these processes. The prototype was developed using Oracles’ PLSQL programming language and an Oracle database. The prototype operates in three phases. Phase I involved manually building a CxBR agent that drives a simulated automobile to perform the tasks intended. This agent was built by a knowledge engineer (the first author) in the traditional manner. Because the knowledge engineer knows how to drive an automobile (as do most adults in the US), the agent is deemed to be derived from human expertise. After an informal and somewhat superficial validation of the agent by the knowledge engineer, it was deemed competent by definition. We refer to this agent as the base agent. Phase II applied our RL-based learning approach to the base agent to improve it and enhance it in the manner discussed above. This phase included seeding the base agent with errors and incomplete knowledge to determine whether our approach would correct the errors/gaps. The expected result is an improved/enhanced agent that overcomes the errors and the gaps. Phase III executes the resulting improved agent and compares its performance to the base agent. The base agent of Phase I was built expressly for this research. The context topology of the base agent is shown in Fig. 5 below. Although it is included in Fig. 5, the DIRT-DRIVING context was omitted from the base agent to determine whether the improvement process can create entirely new contexts after learning from the interactions of the base agent with the environment. Thus, the
6465
DIRT-DRIVING context is labeled in underlined italic font to indicate its initial absence. 3.1. Generic reward function for the prototype The reward presented below is generic and applies to the mission goals of identifying the shortest time. At the end of each simulation, note the total time traveled from the beginning of the simulation to the end of the simulation. If the time is greater than the time for the previous run, and the route for the previous run and the current run are different, punish the agent (a negative reward). If the routes for the previous run and current run are the same, do not punish or reward the agent (a reward of 0). If the time is less than the previous run, and the route between both simulation cycles runs are different, reward the agent. If the routes are the same, give the agent a reward of 0. 3.2. Specific reward function for incorrect speed limit The pseudo code for the reward function used in training the agent to learn the speed limit is as follows: Compare the speed limit of the context being improved with the speed limit defined in the environment for the road segment in question. If the context maximum speed is greater than the speed limit defined for the road segment in the environment, then assign a reward of 20 (punishment). If the speed limit in the context is equal to the speed limit defined for the road segment in the environment, then assign a reward of +50. If the speed limit in the context is less than the speed limit defined for the road segment minus 5, then assign a reward of 10. If the context maximum speed is less than or equal to the previous maximum speed learned for that training context, then assign a reward of 1. If the context maximum speed is greater than the previously learned maximum speed for the context, then assign a reward of +1. Insert what has just been learnt into the reward table. The reward table stores all information about the rewards received by the agent and the context that caused the reward along with the maximum speed of the context and the run time of the simulation cycle. Details about the prototype that would permit a reader to reproduce it are too extensive to be described in this article. Such a full description of the prototype can be found in Aihe (2008). 4. Experiments and results As mentioned earlier ion this article, the kinds of errors an agent can contain in a tactical mission are grouped into three classes: (1) Incorrect values – e.g., an incorrect speed limit for a stretch of road. (2) Incorrect processes or procedures in a tactical situation – e.g., not knowing how to stop at stop signs or red lights. (3) Incomplete procedures necessary to achieve the mission goal – e.g., not knowing how to handle a dirt road, resulting in not knowing what to do upon encountering one. Three experiments were performed, each addressing one of these classes of errors. The goal of the first experiment was to evaluate
6466
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471
Start simulation with default context or continue with existing context No
Current situation matches attributes of listed compatible contexts
Yes
Is this a new situation that calls for new cntxt?
Yes No Activate context selector; search all contexts
Activate new cntxt
Is there a match? Yes
No Calculate % of attributes in all context that match current situation
Activate Context and create a copy in cntxt repository
% > 40%
Activate modifier Yes Create copies of all contexts. Call modifier and attempt all predefined actions in context and action-base for each context
No
Call context creator
Copy attributes of current situation into context template
Calculate and store Rx values for each context – state-action combination
Attempt actions in the action-base
Compare Rx values; choose the action that produces the maximum value
Refine context that produces maximum Rx value and also refine the previously active context
Calculate and store Rx values for each stateaction combination
Compare Rx values, choosing the action that produces the maximum value Add context to context repository and library Fig. 4. Agent improvement process flowchart.
how well the RL improvement method resolves situations when incorrect information exists in the agent’s knowledge. In this experiment, the goal was to arrive at a destination as quickly as possible, in the presence of incorrect speed limit for a road segment, erroneously provided by an SME. The reader might rightfully note that this is a trivial problem easily solved through an optimization search. However, finding the tactically optimal route while working with incomplete or incorrect knowledge – and not knowing it – can lead to mission failure. Our work recognizes when there is incorrect knowledge, corrects it and/or discovers the missing knowledge. The goal of the second experiment was to evaluate how well an improved agent corrects situations when incorrect procedures are provided. The base agent does not know to stop at stop signs or red
traffic lights. Such behavior could result in penalties such as traffic tickets or in accidents. The goal of the third experiment was to evaluate how well an (enhanced) agent can learn otherwise absent procedures that are necessary to achieve a mission goal. In our experiment, the base agent completely lacks knowledge about how to handle a dirt road. Before these experiments were performed, however, the base agent was subjected to the process of improvement through RL. We call this Experiment 0. 4.1. Evaluation infrastructure and criteria Two criteria were used in evaluating the improved agent:
6467
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471
Fig. 5. Context topology for hand-built agent.
Performance of the agent in known and unknown situations. Quality & reliability of the agent’s behavior. These are discussed below. 4.1.1. Performance of the agent in known and unknown situations The experiments measure whether the agent achieves its goal as well as the time it took the agent to achieve it in the environments provided. Performance was measured in elapsed time from start to finish in a mission. The actions taken by the agent determine whether the mission goals were achieved or not. A comparison was made between the base agent’s performances, i.e., elapsed time for the base agent vs. elapsed time for the improved agent. We also recorded whether the mission goal was achieved.
Table 1 Training Route I-A details. Road segment ID Road segment label Road Segment length (miles) ANGLE TRAFFIC INTERSECTION MAXSPEED
1 FREEWAY 1.5
2 CITY 0.6
3 FREEWAY 1.2
4 FREEWAY 1.5
5 CITY 1.0
5 0 0 75
85 0 0 50
45 0 0 75
26 0 0 75
2 0 0 50
4.1.2. Quality & reliability of agents’ behavior We also evaluated the quality and reliability of the agents’ behavior. The quality was measured by noting whether the correct behavior was exhibited at every state of the agents’ environment. For example, did the agent come to a complete stop at a stop sign? Reliability of the agent’s behavior is defined in terms of the change in the agent’s exhibited behavior at a given state during the execution of the agent in the simulation. For example, did the agent change its behavior at an intersection on a different simulation run for the same mission after learning (training) was complete? 4.1.3. Test environments Two test environments were used in our experiments. Environment I was used strictly for improving the base agent and contained three short routes (I-A, I-B and I-C). Each route was composed of several road segments labeled under the ‘‘Road Segment label’’ row heading of Table 1. The reasons for using a different set of routes to improve the agent are twofold: (1) to evaluate whether the improved agent could generalize its actions and behavior in similar but different situations and (2) the shorter routes of Environment I enabled faster training. Training Route I-A is described in Table 1 and displayed pictorially in Fig. 6. A similar description of Training Routes I-B and I-C can be found in Aihe (2008). Each straight segment in Fig. 6 represents one of the five road segments: FREEWAY, CITY, FREEWAY, FREEWAY, and CITY. Routes I-A, I-B and I-C were used to improve the agent to correct incorrect information (max speed). Route I-C was also used to teach the agent missing procedures (traffic light and stop sign) as well as to teach the agent how to build the new context for driving on a dirt road (DIRT). Environment II was used to evaluate the improvement technique by comparing the performance of the improved agent to that of the base agent. It also contains three routes. Route II-1 was used for evaluating how the agent learned correct information; II-2 for
Fig. 6. Pictorial description of training Route I-A.
learned procedures and II-3 for new contexts. In all environments, the same general environmental conditions existed on each of the routes, with differences on the defined maximum speed limits2 on the road segments in the routes and the arrangement of the road segments in each route. Fig. 7 shows the road segments for Environment II. 4.2. Experiment 0 – improving the base agent Route I-A was used in training the agent to learn the appropriate maximum speed attributes for the FREEWAY and CITY driving. Route I-B was used in training the agent to learn the appropriate maximum speed attributes for the PARKING_LOT and RAMP 2 The defined maximum speed limit for a given road type is the same in all environments. The differences in maximum speed limit are between road types.
6468
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471
the environment. The convergence occurs when there is no change in the maximum speed limit between simulation runs. Similar results exist for learning the correct speed limit in the other road segments and learning how to handle stop signs and traffic lights as well as a dirt road. The full set of results can be seen in Aihe (2008). The result of this experiment was an improved and enhanced agent. With an improved/enhanced agent now in hand, we then compare it to the original base agent. 4.3. Experiment #1 – evaluating learning of correct information
Fig. 7. Pictorial Representations of Routes II-1 through II-4.
driving. Route I-C was used to train the agent to learn the appropriate maximum speed attributes for INTERSECTION and TRAFFIC_ LIGHT driving. Route I-C was also used for learning unknown procedures (traffic lights and stop signs) and creating a new appropriate context with appropriate actions and attributes for DIRT driving. Training the agent to learn the maximum speed attribute in the various contexts commences as the simulation begins. The training algorithm described in Section 2 is used. Fig. 8 shows the different maximum speed values used by the agent in the training simulation cycles. The figure shows how the learning of the maximum speed attribute progressed through all these cycles. From Fig. 8, it can be seen that the agent begins the simulation with the maximum speed attribute for the CITY driving context set as 35 mph as (presumably) defined by the SME. After 91 simulation runs (labeled as cycles), the agent learns the appropriate maximum speed attribute for city driving to be 50 mph. This can be seen in Fig. 8. The fluctuation in maximum speed values for different simulation runs is based on the maximum speed value being chosen at random using the learning strategy described in Section 2. The maximum speed attribute eventually converges to the correct maximum speed value set in
The objective of this experiment is to evaluate the ability of the agent to learn the correct value of the speed limit. A comparison of the performance of the improved agent to that of the base agent is made. The performance is compared in terms of the elapsed time to arrive at their destinations while traversing the same routes. The shorter the elapsed time to get to the destination, the better the agent’s performance is. Of course, if too fast, the agent will be stopped by a police agent (part of the environment) and the fine, plus the delay incurred by the traffic stop, will result in a failed mission and the attendant punishment. Each agent traverses the three routes separately five times. The agents encounter different road segments in each route. The time it takes the agent to arrive at the destination for each simulation run before and after the agent’s improvement is presented in Table 2 below. The fluctuations in elapsed time on each route result from the variations in the events in the environment, e.g., traffic light state (red, green, yellow) being different on each run. The results support the hypothesis that the improved agent outperforms the base agent because it found the correct speed limit, which was higher. We show this quantitatively as follows:
d ¼ Elapsed Time for Improv ed Agent ðxÞ Elapsed Time for Original Agent ðyÞ The null hypothesis is the base agent will perform as well as the improved/enhanced agent at the minimum.
H 0 : ld P 0 H a : ld < 0 where ld is the mean of the differences in time to complete the drive between the improved agent and the base agent. H0 is the null hypothesis and Ha is the alternate hypothesis. The t-test (details in Aihe, 2008, pp. 224–225) computes a pvalue of 0.0, thus asserting that there is a 0% probability that the null hypothesis is rejected in error. This indicates that the improved agent was successful in detecting an incorrect speed Table 2 Elapsed Time of Base CxBR agent and the Improved CxBR agent.
Fig. 8. Training process for maximum speed attribute for city driving.
Route segment ID
ELAPSED_TIME – base agent
ELAPSED_TIME – improved agent
II-1 II-1 II-1 II-1 II-1 II-2 II-2 II-2 II-2 II-2 II-3 II-3 II-3 II-3 II-3
1197.4 1197.9 1198.4 1198.1 1197.3 1040.2 1036.3 1033.3 1033.3 1034.4 1270.1 1273.1 1006.9 1011.7 1006.5
940.2 938.8 938.8 939.1 938.3 790.5 788.8 792.2 792.9 788.6 762.6 766.5 767.1 765.4 763.4
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471
6469
limit and learned the correct speed limit to be 50 mph, rather than the original 35 mph. An argument can be made that the results and conclusion of this experiment are valid only because the improved agent learned a maximum speed for each route that was higher than what the SME provided. As such, the improved agent was able to move faster and thus have a shorter elapsed time on each route. The question is what will be the impact in the performance of the agent if the SME had instead provided an erroneous speed limit that was higher, rather than lower. For example, what if the SME had provided 50 mph and the agent learned the actual speed was 35 mph? We counter-argue that the base agent driving at 50 mph would have been stopped by the ‘‘virtual police’’, thus creating delays for the base agent that would have led to a longer elapsed time for mission completion. In real life, the police do not always catch a speedster, but given repeated performance of the speed limit violation, one could conclude that the corrective action would come via a traffic ticket at some point. 4.4. Experiment #2 – evaluating learning of a correct procedure This experiment involves an incorrect process on a given task. In this experiment, the process left out was that of stopping at a red traffic light. Likewise missing was the process of decelerating when the traffic light color is yellow and the agent is approaching the traffic light. The base and improved agents traversed three routes separately five times each as in Experiment #1. The agents encountered different road segments in each route as described earlier. The driving speed of both agents at intersections, red and green traffic lights are presented in Table 3. This table shows a time-based pattern of the improved and base agents’ speed when approaching one specific traffic light. It can be seen that the improved agent starts to reduce its speed when the traffic light color changes from green to yellow and subsequently to red. On the other hand, the base agents’ speed remained constant as it runs the red light; it was utilizing the speed defined for the major context, CITY driving, even though it was being controlled by the TRAFFIC_LIGHT context. Fig. 9 depicts the speeds of both agents at a traffic light when the light turns yellow at t = 2 s. Notice that the base agent still considers the speed limit to be 35 mph, while the improved agent has already learned it to be 50 mph. Table 3 depicts the agents’ behaviors in all traffic lights for each of their five runs. Note that there are no traffic lights on Route II-1 and there are two traffic lights on Route II-2 (II-2A and II-2B). The behavior of the agent at the traffic light for each simulation run can be seen from Table 3 by the speed at which it passes the traffic light based on the traffic light color.
Table 3 Agents’ speed at traffic light. ROUTE_ID
T.L COLOR
SPEED (mph) at T.L. (base agent)
SPEED at T.L. (improved agent)
II-2A II-2A II-2A II-2A II-2A II-2B II-2B II-2B II-2B II-2B II-3 II-3 II-3 II-3 II-3
YELLOW GREEN RED RED RED GREEN GREEN RED RED RED YELLOW RED RED RED GREEN
34.00 34.00 34.00 34.00 34.00 34.00 34.00 34.00 34.00 34.00 34.00 34.00 34.00 34.00 34.00
29.00 49.00 0 0 0 49.00 49.00 0 0 0 4.00 0 0 0 49.00
Fig. 9. Improved CxBR agent vs base CxBR agent speed at a traffic light.
Recalling the hypothesis and steps utilized in analyzing the performance from Experiment #1, it is hypothesized here that the improved agent will behave better than the base agent. The measurement of behavior is restricted to the speed both agents exhibit at traffic lights.
d ¼ Speed for Improv ed Agent ðxÞ speed for Original Agent ðyÞ The null hypothesis is that the base agent will, at worst, behave as well as the improved agent.
H 0 : ld P 0 H a : ld < 0 where ld is the mean of the differences between the improved CxBR agent and the base agent H0 is the null hypothesis and Ha is the alternate hypothesis. The results of the t-test (details in Aihe, 2008, pp. 229–231) return a p-value of 0.0. Thus, there is a 0% probability that the null hypothesis is rejected in error. This indicates that the agent correctly learned to stop at a traffic light. 4.5. Experiment 3 – evaluating the process of learning a new context The objective of this experiment was to test the agent’s performance when an entire context is missing. In this experiment, we assume that the SME omitted describing how to navigate a specific type of road (e.g., dirt road). Thus, the agent encounters an unfamiliar type of road and is forced to figure out how to travel through it. We assess how well the enhanced agent learned this new context. The performance of the enhanced agent is measured in terms of achieving the mission goal and the elapsed time to arrive at its destinations. The base agent and the enhanced agents moved from the starting point of each of three routes to their respective ends. The objective of this experiment is for the agent to merely arrive at the final destination when a road segment in one of the routes is not defined a priori. Table 4 below shows the elapsed time to arrive at the destination when an unknown road segment is introduced. As can be seen from Table 4, the base agent was unsuccessful in its mission because it encountered an unknown and undefined situation for which a context did not exist. Therefore, the mission goal was not accomplished because an exception was raised and the base agent remained in the same position (road segment) indefinitely. On the other hand, the enhanced agent figured it out and the mission goal was accomplished. In the training phase,
6470
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471
Table 4 Elapsed time to destination with introduction of an unknown road segment. ROUTE_ID
ELAPSED TIME – base agent
ELAPSED TIME (s) – improved agent
II-4 II-4 II-4 II-4 II-4
Unmatched Unmatched Unmatched Unmatched Unmatched
644.88 645.46 644.75 644.29 644.77
context context context context context
the agent learned how to handle the dirt road segment and the resulting enhanced agent became successful in its mission.
interpret the actions and behavior of the human actor in a simulation. The results also indicated that our approach expended considerably less computing time than that of Stein and Gonzalez. Unfortunately, we failed to collect exact data on the computing time spent in the agent improvement/enhancement process. Nevertheless, improving and enhancing our agents via the approach presented here took only a few seconds on a standard desktop computer. When that is compared to the computation time of several weeks reported by Stein and Gonzalez (2011) on a high-performance cluster computer, a formal quantitative comparison seems superfluous. However, we should note that this does not include the extensive development time required to manually build our base agent, which was done automatically (from observation) by Stein and Gonzalez.
5. Summary, conclusions and future research Our approach succeeded in improving already competent tactical agents built by hand with SME expertise, to compensate for errors of commission or omission. This was done by allowing the agent to practice missions in a simulated environment and using reinforcement learning to improve its performance. Our work developed a means to break the complete dependence on an SME to correct a tactical knowledge base. In this manner, the significant advantages of human-derived knowledge are maintained while its disadvantages are negated. 5.1. Conclusions We have shown the technical feasibility of improving tactical agents through reinforcement learning. Our research focused on tactical agents, rather than on classification systems as did the early work on theory revision and belief revision. Tactical knowledge can be more complex than classification knowledge, as the agent needs to ‘‘know’’ how to behave properly in several different situations, and generalize that behavior to related but dissimilar situations. Because of this need to be knowledgeable in an arbitrary number of situations, we based our agents on CxBR, a technique specifically designed to represent tactical knowledge. Our results show that the contexts that form the basis of CxBR can be modified to make corrections, and that new contexts can be created automatically when the existing contexts do not apply to a current situation. CxBR was thus found to provide a foundation and structure that facilitated modifications and extensions. Nevertheless, no claims are made here concerning the feasibility of doing the same thing in other modeling paradigms. The impact of our work is in the development of future intelligent systems, especially on tactical agents used in training or analysis simulations or as a controlling element of a real world platform on a mission. This impact is that an automated means to correct, complete and maintain agents can be created and used, on a continual basis if desired. This would ensure that errors that unavoidably creep in during the development phase can be extracted out and incomplete behaviors can be rounded out. Most importantly, an agent’s knowledge base can be kept up to date over time as long as the simulation used remains faithful to the real-world domain it represents, or that the physical world is used as the ‘‘practice field’’. In comparison with Stein and Gonzalez’ (2011) approach using NEAT and PSO, the approach presented here openly and explicitly reflects expert knowledge, compared to the opaque knowledge contained in their populations of neural networks. The agents of interest here were built manually with direct human interaction through the expertise of the knowledge engineer in driving automobiles. Stein and Gonzalez’ work built the original agents through observation. While learning from observation also derives knowledge from human experts, it does so indirectly, as the system must
5.2. Future research As with most research projects, ours is not a complete and exhaustive treatment of the problem introduced. Several areas of future research have come to mind as part of the research conducted. For one, the obvious: while our tests were extensive and rigorous, our existing infrastructure described here should be tested even more extensively than we have done. That is, seed several other types of errors and omit other chunks of knowledge to further test its robustness. In fact, it should be tested to failure to discern its breaking point. This is something we hope to do in the future. Secondly, the applicability of this approach to agents other than tactical agents should be explored. These types of agents (e.g., classification, decision support) may not have the luxury of accurate simulations on which to practice, and may need direct SME involvement to provide a solution that can be the basis for reward or punishment in the reinforcement learning approach. Thirdly, its applicability to agents built with other representational paradigms should be explored. Fourth, the approach should be also implemented in multiagent systems where interaction with heterogeneous agents having different goals may hide errors. Alternatively, affixing the blame for flaws on the appropriate agent may also be a challenge that needs to be addressed. Fifth, and most interestingly, the concept of combining this improvement process with a formal validation process occurred to us during the execution of this research. It seems a very intriguing proposition to be able to repair an agent’s flaws while in the process of formally validating its performance.
References Aihe, D. O. I. (2008). A reinforcement learning technique for enhancing human behavior models in a context-based architecture (Doctoral dissertation). School of Electrical Engineering and Computer Science, University of Central Florida, Orlando, FL, December 2008. Anderson, J. R., Matessa, M., & Lebiere, C. (1997). ACT-R: A theory of higher-level cognition and its relation to visual attention. Human Computer Interaction, 12(4), 439–462. Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robot Autonomous Systems, 57, 469–483. Barto, A. G. (2003). Reinforcement learning. In M. A. Arbib (Ed.), Handbook of brain theory and neural networks (2nd ed., pp. 963–968). Cambridge, MA: MIT Press. Boose, J. H. (1984). Personal construct theory and the transfer of human expertise. In Proceedings of the national conference on artificial intelligence (AAAI-84) (pp. 27–33). Brezillon, P. (2003) Context-based modeling of operators’ practices by contextual graphs. In Proceedings of the 14th mini Euro conference, Luxembourg. Brooks, R. A. (1990). Elephants don’t play chess. Robotics and Autonomous Systems, 6(1), 3–15. Brown, J. B. (1994). Application and evaluation of the context-based reasoning paradigm (Master’s thesis). Dept. of Electrical and Computer Engineering, University of Central Florida, Orlando, FL, July 1994. Brunk, C., & Pazzani, M. (2014). A lexically based semantic bias for theory revision. In Proc. 12th international conference on machine learning (pp. 81–89).
D.O. Aihe, A.J. Gonzalez / Expert Systems with Applications 42 (2015) 6457–6471 Cain, T. (2014). The DUCTOR: A theory revision system for propositional domains. In Proc of the eighth international workshop on machine learning (pp. 485–489). Chernova, S. & Veloso, M. (2007). Confidence-based policy learning from demonstration using Gaussian mixture models. In Proceedings of the sixth international joint conference on autonomous agents and multi-agent systems (AAMAS’07). Craw, S. (1991). Automating the refinement of knowledge based systems (Doctoral dissertation, University of Aberdeen (United Kingdom)). Dejong, G., & Mooney, R. (1986). Explanation-based learning: An alternative view. Machine Learning, 1(2), 145–176. Delugach, H. S. & Skipper, D. J. (2000). Knowledge techniques for advanced conceptual modeling. In Proceedings of the ninth conference on computer generated forces and behavior representation, Orlando, FL. Feigenbaum, E. A. (1979). Themes and case studies of knowledge engineering. In D. Michie (Ed.), Expert systems in the micro-electronic age (pp. 3–25). Edinburg, Scotland: Edinburg University Press. Fernlund, H., Gonzalez, A. J., Georgiopoulos, M., & DeMara, R. F. (2006). Learning tactical human behavior through observation of human performance. IEEE Transactions on Systems, Man and Cybernetics – Part B, 36(1), 128–140. Floyd, M. W., Esfandiari, B. & Lam, K. (2008). A case-based reasoning approach to imitating robocup players. In Proceedings of the 21st international Florida artificial intelligence research society (FLAIRS) (pp. 251–256). Ford, K., Canas, A., Jones, J., Stahl, H., Novak, J., & Adams-Webber, J. (1991). ICONKAT: An integrated constructivist knowledge acquisition tool. Knowledge Acquisition, 3(2), 215–236. Friedrich, H., Kaiser, M., & Dillman, R. (1996). What can robots learn from humans? Annual Reviews in Control, 20, 167–172. Gonzalez, A. J., & Ahlers, R. (1998). Context-based representation of intelligent behaviour in training simulations. Transactions of the Society for Computer Simulation International, 15(4), 153–166. Gonzalez, A. J. & Ahlers, R. (1994). A novel paradigm for representing tactical knowledge in intelligent simulated opponents. In Proceedings of the seventh international conference on industrial and engineering applications of artificial intelligence and expert systems, Austin, TX (pp. 515–523). Gonzalez, A. J. & Saeki, S. (2001). Using context competition to model tactical human behavior in a simulation. In Proceedings of the CONTEXT-2001 conference (pp. 453–456). Gonzalez, A. J., Castro, J., & Gerber, W. E. (2006). Automating the acquisition of tactical knowledge for military missions. Journal of Defense Modeling and Simulation, 3(1), 145–160. Gonzalez, A. J., Georgiopoulos, M., DeMara, R. F., Henninger, A. E., & Gerber, W. (1998). Automating the CGF model development and refinement process by observing expert behavior in a simulation. In Proceedings of the seventh conference on computer generated forces and behavior representation, Orlando, FL. Gonzalez, F.G., Grejs, P. & Gonzalez, A.J. (2000). Autonomous automobile behavior through context-based reasoning. In Proceedings of the 12th international Florida artificial intelligence research society conference, Orlando, FL (pp. 2–6). Gonzalez, A. J., Stensrud, B. S., & Barrett, G. (2008). Formalizing context-based reasoning – a modeling paradigm for representing tactical human behavior. International Journal of Intelligent Systems, 23(7), 822–847. Guerin, F. (2011). Learning like a baby: A survey of artificial intelligence approaches. The Knowledge Engineering Review, 26(2), 209–236. Henninger, A. E. (2000) Neural network based movement models to improve the predictive utility of entity state synchronization methods for distributed simulations (Doctoral dissertation). University of Central Florida, Orlando, FL, 2000. Huffman, S. B., Pearson, D. J., & Laird, J. E. (1993). Correcting imperfect domain theories: A knowledge-level analysis. US: Springer, pp. 209–244. Isaac, A. & Sammut, C. (2003). Goal-directed learning to fly. In Proceedings of the twentieth international conference on machine learning (ICML-2003), Washington DC. Johnson, C. L., & Gonzalez, A. J. (2014). Learning collaborative behavior by observation. Expert Systems with Applications., 41, 2316–2328. Kahn, G., Nolan, S. & McDermott, J. (1985). MORE: An intelligent knowledge acquisition tool. In Proceedings of the 1985 international joint conference on artificial intelligence (IJCAI-85), Los Angeles, CA. Kennedy, J. & Eberhart, R. (1995). Particle swarm optimization. In Proceedings of the IEEE international conference on neural networks (Vol. 4). Konik, T., & Laird, J. E. (2006). Learning goal hierarchies from structured observations and expert annotations. Machine Learning, 64(1–3), 263–287. Laird, J. (1988). Recovery from incorrect knowledge in soar. In Proceedings of the national conference on artificial intelligence (AAAI-88). Laird, J., Hucka, M., Yager, E., & Tuck, C. (1990). Correcting and extending domain knowledge using outside guidance. In Proceedings of the Seventh International Conference on Machine Learning (pp. 235–243). Laird, J. E., Pearson, D. J. & Huffman, S. B. (1996). Knowledge-directed adaptation in multi-level agents. AAAI Technical Report WS-96-04. Laird, J. E., Newell, A., & Rosenbloom, P. S. (1987). Soar: An architecture for general intelligence. Artificial Intelligence, 33(1), 1–64. Latorella, K. & Chamberlain, J. (2002). Tactical vs. strategic behavior: General aviation piloting in convective weather scenarios. In Proceedings of the human factors & ergonomics annual meeting, Baltimore, MD.
6471
Marcus, S., McDermott, J. & Wang, T. (1985). Knowledge acquisition for constructive systems. In Proceedings of the 1985 international joint conference on artificial intelligence (IJCAI-85), Los Angeles, CA. Moriarty, L. & Gonzalez, A. J. (2009). Learning human behavior from observation for gaming applications. In Proceedings of the 2009 FLAIRS conference. Murphy, P. M., & Pazzani, M. J. (1994). Revision of production system rule-bases. Proceedings of the International Conference on Machine Learning, 199–207. Ontañón, S., Bonnette, K., Mahindrakar, P., Gómez-Martín, M., Long, K., Radhakrishnan, J., Shah, R., & Ram, A. (2009). Learning from human demonstrations for real-time case-based planning. In The IJCAI-09 workshop on learning structural knowledge from observations. Ontañón, S., Montaña, J. L., & Gonzalez, A. J. (2014). A dynamic-Bayesian network framework for modeling and evaluating learning from observation. Expert Systems with Applications, 41(11), 5212–5226. Ourston, D., & Mooney, R. J. (1990). Changing the rules: A comprehensive approach to theory refinement. Proceedings of the National Conference on Artificial Intelligence, 815–820. Oxford Dictionary, www.dictionary.com, 2008. Parker, J., Gonzalez, A. J., & Hollister, D. L. (2013). Contextual reasoning in human cognition and the implications for artificial intelligence systems. In CONTEXT 2013 conference, Annecy, France. Parsaye, K. (1988). Acquiring and verifying knowledge automatically. AI Expert, 48–63. Pazzani, M. J. (1988). Integrated learning with incorrect and incomplete theories. Proceedings of the International Machine Learning Conference, 291–297. Pazzani, M. (1991). Learning to predict and explain: An integration of similaritybased, theory driven, and explanation-based learning. Journal of the Learning Sciences, 1(2), 153–199. Pearson, D. J. (1995). Active learning in correcting domain theories: Help or hindrance? Ann Arbor, 1001, 48109. Pearson, D. & Laird, J. E. (2004). Redux: Example-driven diagrammatic tools for rapid knowledge acquisition. In Proceedings of behavior representation in modeling and simulation conference, Washington, DC. Pearson, D. J., & Laird, J. E. (1995). Toward incremental knowledge correction for agents in complex environments. Machine Intelligence, 15, 185–204. Peppas, P. (2008). Belief revision. Foundations of Artificial Intelligence, 3, 317–359. Rozich, R., Ioerger, T., & Yager, R. (2002). FURL-a theory revision approach to learning fuzzy rules. In Proceedings of the 2002 IEEE international conference on fuzzy systems (Vol. 1, pp. 791–796). Sammut, C., Hurst, S., Kedzier, D. & Michie, D. (1992). Learning to fly. In Proceedings of the ninth international machine learning conference (ML’92), Aberdeen, Scotland. Schutte, P. C. (2004). Definitions of tactical and strategic: An informal study. NASA/ TM-2004-213024, November 2004. Shaw, M. L. G. (1982). PLANET: Some experience in creating an integrated system for repertory grid application in a microcomputer. International Journal of Man– Machine Studies, 17, 345–360. Sidani, T. A., & Gonzalez, A. J. (2000). A framework for learning implicit expert knowledge through observation. Transactions of the Society for Computer Simulation, 17(2), 54–72. Stanley, K., & Miikkulainen, R. (2002). Evolving neural networks through augmenting topologies. Evolutionary Computation, 10(2), 99–127. Stein, G., & Gonzalez, A. J. (2015). Building and improving tactical agents in realtime through a haptic-based interface. Journal of Intelligent Systems (on-line version). Stein, G., & Gonzalez, A. J. (2011). Building high-performing human-like tactical agents through observation and experience. IEEE Transactions on Systems, Man and Cybernetics – Part B, 41(3), 792–804. Stein, G., & Gonzalez, A. J. (2014). Learning in context: Enhancing machine learning with context-based reasoning. Applied Intelligence, 41, 709–724. Tecuci, G. D. (1992). Automating knowledge acquisition as extending, updating, and improving a knowledge base. IEEE Transactions on Systems, Man and Cybernetics, 22(6), 1444–1460. The Economist. (2013). Look, no hands – one day every car may come with an invisible chauffeur. Print edition of April 20, 2013. Also available on
. Turner, R. M. (2014). Context-mediated behaviors. In P. Brezillon & A. J. Gonzalez (Eds.), Context in computing: A cross-disciplinary approach for modeling the real world. New York: Springer. Van Lent, M., & Laird, J. (1998). Learning by observation in a tactical air combat domain. In Proceedings of the eighth conference on computer generated forces and behavior representation, Orlando, FL. Wogulis, J. & Pazzani, M. (1993). A methodology for evaluating theory revision systems: Results with AUDREY II. In Proceedings of the 13th international joint conference on artificial intelligence, Chambery, France. Zachary, W., Ryder, J. M., & Hicinbothom, J. H. (1998). Cognitive task analysis and modeling of decision making in complex environments. Making decisions under stress: Implications for individual and team training, 315–344.