Intelligent Gateway Troubleshooter Melisse Leib
Bolt, Beranek and Newman, I0 Mouhon Street, Cambridoe, MA. 02138. USA The Intelligent Gateway Troubleshooter (IGT) is an expert system that has captured some of the network management knowledge of network operators and analysts. Specifically, it involves the detection and diagnosis of problems in the DARPA internet. One important feature of the system is that it works in real time. That is, the detection and diagnosis of relevant events are alerted to a user of the system as they occur. Another feature of the system is that it provides a means of integrating expertise from several domains. This is important because network management expertise is spread amongst several types of human domain experts. The domain of IGT assumes that any anomaly is caused by only one fault. For example, if a gateway does not respond, then the system never attributes this to two different problems. This does not mean, however, that the system does not consider multiple faults. Nonetheless, for the system to consider multiple faults, there must be multiple anomalies. Key Words: network management, expert systems, knowledge-based systems, troubleshooting, artificial intelligence.
1. INTRODUCTION Managing the DARPA internet has become an extremely complex and important task. The DARPA internet is large and growing, consisting of more than 300 networks and growing by 10 networks per month. Many network components and protocols comprise the internet and hosts are located around the world. Hence, if a problem occurs it could spread and affect a good portion of the internet. Furthermore, it is difficult for human operators and analysts to track a problem because of the complexity. Therefore, there is a need for a system that will quickly identify faults and notify the operator so that appropriate action is taken and the internet can return to a normal state. The Intelligent Gateway Troubleshooter {IGT) is a first prototype of such a system. The system is restricted to the analysis of gateway faults. A gateway is a computer that provides an interface between two networks. The restricted domain allows issues in building a deeper and larger system to be addressed. However, later versions will analyze a wider range of internet problems. This paper focuses on issues that arose in the development of the IGT. The major issues are: • • •
The second issue deals with the existence of multiple faults. For example, suppose that the system believes that there are two possible problems in the internet. The system contains a methodology that determines for both to be true. The third issue deals with troubleshooting in real time, that is, detecting problems as the information arrives. For example, the needed information to analyze a problem may not have arrived when the problem is first suspected. A real-time troubleshooter also differs from conventional diagnostic systems in that it must monitor the absence as well as the presence of faults. The IGT provides a mechanism for waiting and requesting for new information and for monitoring the absence of internet faults. This paper presents the framework for the IGT system while addressing the above issues. The framework is independent of the network-management domain, and therefore can be applied to any real-time diagnostic system.
How can multiple domains of expertise with distinct suitable representations be integrated'?. Under what conditions can multiple faults occur? How can the real-time nature of the domain be addressed?
The first issue deals with integrating incompatible forms of knowledge. For example, the analysis of one problem type might be best analyzed by rules. The IGT provides a framework for integrating different analysis methodologies. Paper acceptedApril 1989. Discusstonends April 1990. © 1989ComputationalMechanicsPublications
158
Artificial Intelligence in Engineering, 1989, Vol. 4, No. 4
2. OVERVIEW This section provides an overview of the system and then provides a detailed example of how the system works. The purpose of this section is to present the reader with an understanding of the motivation for the details described in subsequent sections. Figure I represents the relationship between the troubleshooting system and other external entities. The left most module is the Distributed Management Module (DMM). The DMM obtains status and throughput information from gateways. The trouble shooting system requests these information types from the DMM. When the values are received from the DMM, they are stored into a database. The values in the database
Intelligent Gateway Troubleshooter: M. Leib O~4rNIS
Data
Fig. I.
'Data Base"
IGT
System architecture
Top Level
•
Expert
•
/
•
Specific Experts
~ Fig. 2.
congested. The steps that the system takes to determine that the interface A 1 is indeed congested are now traced. The first occurrence is the top level expert receiving a message from the DMM that gateway B reports its neighbor A I as down. This means that the gateway B polled the gateway A through the B-B1-N-A1-A path and did not receive a response. This information indicates that there could be a problem with any device or link along that path. Given the data that B reports A 1 as down, the top level expert asks the low level experts to suggest explanations. The responses returned from some of the low level expert are as follows (other experts also generate hypotheses, but for simplicity we will only consider these):
m i mm
Top and low level experts
can be retrieved by the troubleshooting system, and modules of the troubleshooting system can be activated by changes in incoming information. Figure 2 represents the re'-tionship between components within the gateway troubleshooter. The gateway troubleshooter consists of a top level expert along with several specific low level experts. Each low level expert understands one specific type ofinternet problem and can determine the existence of that specific problem. The top level expert activates and coordinates the activities of the low level experts. The experts that have been implemented analyze problems dealing with device congestion and status. For example, there is one low level expert that can determine if gateway interfaces are congested and another low level expert that can determine if gateways are up or down.
2.1. Example For a general understanding of the system, an example of how the system of experts works on a given problem is provided. Figure 3 gives a sample internet. In this diagram, the rectangle N represents a network, the diamonds (A, B, and C) represent gateways, and the lines (AI, BI, and C1 ) represent gateway interfaces. A gateway is a computer that provides an interface between two networks. A gateway interface is a port of a gateway. Suppose for this sample internet that the interface A 1 is
The interface up/down expert suggests that at least one of the intermediate interfaces (A 1 or B 1) is down. The interface congestion expert suggests that one of those interfaces is congested. The gateway up/down expert hypothesizes that gateway A is down.
When the top level expert has received all generated hypotheses, it then delegates their analysis to the lower level expert. A low level expert receives a hypothesis for analysis if the hypothesized fault is in the expert's domain of expertise. The low level expert then determines if the hypothesis is true or false. A low level expert may also state that there is not enough information to determine if the hypothesis is true or false. For our example, the results of these analyses are summarized: • •
•
The gateway up/down expert rejects the hypothesis that A is down because A responded to the system five minutes ago. The gateway interface up/down expert rejects the hypotheses that either of the interfaces AI or BI is down because AI responded five minutes ago and because another gateway, C reported BI as up three minutes ago. The interface congestion expert does not have enough information to determine if either interface A I or interface B I is congested.
Note that the two up/down experts have sought and found various recent observations in the database that allow them to reject their initial hypotheses. The gateway interface congestion expert, however, still does not have enough information to arrive at a conclusion. A summary
AI(
N
)1
Fig. 3. Sample internet
131
Intelligent Gateway Troubleshooter: M. Leib
Interface Up/down"
of these rejected and outstanding hypotheses is given in Fig. 4. After 15 minutes, the fixed polling rate for data acquisition from the subnet, elapses, a new message arrives at the top level expert. The message states that the interface A1 is dropping 20% of its data packets. Again, the top level expert asks the low level experts to generate hypotheses. Now, the only generated hypothesis by the low level experts is that the interface A I is congested. This hypothesis is now given to the interface congestion expert for analysis. Once again it seeks out observations in the database that allow it to determine if any of the hypotheses are true. The result of this analysis has indicated that the interface A1 is indeed congested. This result is stated based on the following activity: • •
Interface Congestion: (_oA] is Congested ) l==j
~
' "
%d
For the last haifhour it had a traffic flow ofmore than 1000 packets per fifteen minute intervals. For the last half hour it has been dropping more than 20% of its packets.
Based upon this information, the system concludes that A 1 is congested. As a result of this conclusion, the system rejects the competing hypothesis that BI is congested. The final summary of concluded and rejected hypotheses is given in Fig. 5.
__
Up/Down Fig. 5.
•
Concluded and rejected hypotheses i ':J '~-": " " -~" ~' : " ' ~ :.
lj
3. ANM OVERVIEW The system of gateway troubleshooting experts is one portion of a larger system under development called Automated Network Management (ANM). ANM is a system that can assist network operators and network analysts in controlling and understanding complex internets. Ultimately, ANM will provide an integrated set of tools for real-time monitoring, control and analysis of internets. The internets consist of diverse network entities such as gateways, packet switched nodes, packet radios and hosts. ANM will reduce maintenance costs by providing capabilities such as fault isolation and alarm generation. These capabilities will enable the network
-'-~%'J
'
h,
"-'::';"
",: :.V':.'-_"."" "/,{i',:-." :',;; ." ~ ~ " :'~
r~ ent ,
.,
~'
~.?
,'.-? DMlvl ....
DMIM ',, Baet:.b
rJN'[X H©sts
.
CLient il"
.x~<~. ,,~,,
~r..~l . . . .
PSNs
Interface Up/down"
~
b,n,em~
II
/ ,,o
Pea,:et
Glewa~
Entity.s~ci.fic Protocols
Fig. 6.
Interface Congestion: A1 is Congested B1 is Congested
Up/Down Fig. 4.
•
Rejected and outstanding hypotheses
Computer ANM system architecture
operators to efficiently and effectively monitor and control networks. ANM will also provide advanced data gathering, analysis, and presentation tools. This will grant the network analyst a clearer understanding of the behavior of the network so that network performance can be enhanced.
3.1. ANM components The ANM architecture consist of a backbone of Distributed Management Modules (DMMs) and clients that interact with the DMMs to monitor and control a diverse set of network entities as shown in Fig. 6. To facilitate the interactions between the elements of the ANM system, a protocol called the Network
Intelligent Gateway Troubleshooter: M. Leib Management Protocol (NMP) has been developed. The N M P provides a standard by which queries for data and control commands can be transacted in the ANM system.
3.2. The client's role A client's function is to provide the services that are required to manage a network. Clients request and receive raw data collected from network entities by the D M M Backbone. In the future, an additional service of the D M M will be to allow clients to request the execution of specific control actions. A client is connected to a single D M M which may or may not reside in the same processor as the D M M . The D M M backbone is responsible for providing the basic network management capabilities of ANM. The D M M backbone is specifically concerned with gathering raw data from and controlling network entities. The backbone consists of cooperating D M M s that may be geographically distributed. The D M M s receive N M P queries and control commands from clients and return results to the clients. The requests which require interaction with network entities, are translated into entity specific protocols and forwarded. Similarly, the responses from the network entities are translated into N M P responses and returned to the requester.
f'requ~ticy. For the Current version of IGT, the frequency is once every 15 minutes. When the data is received it is placed in a database along with the time that it was received. This database is reserved solely for data returned from the DMM. For each piece of data, the database maintains the four previous values so that trends in the data may be analyzed. The database is located at the client at the same host as the IGT. This allows the IGT to easily retrieve data. IGT can also be activated by relevant updates in the database. IGT requests gateway status and throughput information from the D M M . The types of data returned with respect to devices can be grouped into three categories: gateway data, gateway interface data, and gateway neighbor data. •
•
•
3.3. IGT context The IGT is an example of a client. The context of IGT with the rest of ANM is given in Fig. 1. Initially, IGT sends a query to the D M M . This is the only query that is made to the D M M , so it includes all information that is ever desired. In future versions, however, the D M M will be able to support dynamic queries. In this case, IGT will be able to request any information on demand. After the D M M receives the query, it periodically polls all requested gateways for all requested information. The information is then returned to the client at a fixed
Hypotheses v
There are also two classes of data types: throughput data and status data. •
•
Throughput data describes the amount of information, measured in packets and bytes, that are sent, received or lost between devices or across interfaces. Status data describe all other data that entities report. Examples of the status information include whether
~
Hvoothese~
B reports A1 as down
A1 congested -"I,, ~ B1 congested
A1 congested ~ B1 congested
Gateway data deals with information that a gateway reports about itself. Examples of this data include the vendor of the gateway and the number of seconds the gateway has been up. Gateway interface data deals with information that a gateway reports about each of its interfaces. Examples of this data include the number of bytes sent out by this interface and the number of buffers allocated for the interface. Gateway neighbor data deals with information that a gateway reports about each of its gateway interface neighbors. Examples of this data include the number of packets that have been sent to this neighbor and the positive or negative response from the neighbor.
A down
A down
~ A1 drops 20% of its packets B reports A1 as down
H theses A1 congested Concluded :
Fig. 7.
Example
(B1 congested
drops
20% of its packets B reports A1) : Explained
Intelligent Gateway Troubleshooter: M. Leib or not the gateway is responding and the amount of memory that the gateway is using. 4. IGT FRAMEWORK This section discusses the framework components of the system are discussed the network-management domain. discussion may be generalized to the real-time diagnostic system.
of the IGT. The without regard to Therefore, this components of a
4.1. Hypotheses The IGT experts generate and share data structures called hypotheses. A hypothesis describes some conjectured relevant event in the internet that the system of experts is analyzing. A hypothesis is created when the corresponding event is conjectured to have occurred. They are then passed between modules of the system as well as to the user. Hypotheses are also stored in a global table so that any user or module of the system may retrieve them. A hypothesis is in one of three states: suspected, concluded, or rejected. When a hypothesis is first generated it is said to be suspected. After a hypothesis is generated, a module or user may determine it to be true or false. When a hypothesis is determined to be true it is said to be concluded, and when it is determined to be false it is said to be rejected. A hypothesis that has not been determined to be either true or false is still said to be suspected. IGT as well as a user can change the state of a hypothesis. IGT may determine a hypothesis to be true or false based on its procedures and knowledge about the domain. A human user of the system may also change the state of a hypothesis. This may involve application of additional knowledge and expertise that the t G T system has not obtained. IGT may conclude or reject a hypothesis even though IGT is not completely certain that the hypothesis is true or false. The system concludes and rejects hypotheses when it believes that the corresponding event is true or false. This may occur even if the expert is not absolutely certain that the hypothesis is true or false. This allows the system to make frequent and usually correct observations. A hypothesis object consists of four components: Summary, state, reasoning, and data • The summary component gives a description of the conjectured event that the hypothesis refers to (for instance, that BBN-Gateway is down). • The state component shows that the hypothesis is either suspected, concluded, or rejected. • The reasoning component, consists of two portions, suspect reasoning and final reasoning: • The suspect reasoning states the motivation for the generation of the hypothesis. • The final reasoning states the motivation for concluding or rejecting the hypothesis. • The data component enumerates those pieces of data that are relevant to this event. The data are divided between those that can help to support the conclusion of the hypothesis and those that can help support the rejection of the hypothesis. The use of hypothesis objects enables a user or a module to examine a hypothesis and receive a description of that event's analysis.
4.2. Experts The IGT system consists of a top level expert that coordinates and activates the work of several problemspecific low level experts (Fig. 2). Organizing network analysis knowledge in multiple cooperating experts provides advantages in flexibility and modularity over using a single, large program. It allows the most appropriate knowledge representation to be used in each expert, depending upon the expert's problem domain. One expert may encode its knowledge in a backward-chaining rule-based system, and another may encode its knowledge procedurally. The use of multiple experts provides a framework for adding or changing knowledge in the system. To append expertise about a new type of network problem, a new expert is added to the system. To change the way the system reasons about a certain type of network problem, only a single expert needs to be modified. The disadvantage of this approach is that it constrains the communication between different areas of expertise. If one expert needs to know about the results of another expert, it cannot directly access the internal data and procedures of the other expert. It must communicate with the other expert by another means. Each low level expert communicates its understanding and knowledge of one specific class of problems through hypothesizing and analyzing. In the hypothesizing task, the low level expert is given a message and then is able to generate hypotheses that are relevant to the expert's problem domain. In the analyzing task, the expert is given a hypothesis and then applies its knowledge to determine if the hypothesis is true, false, or if there is not enough information to conclude anything. The top level expert is a manager of the low level experts. It has three roles: • • •
To invoke the low level experts at the appropriate times, To invoke a means for communication between the low level experts, To resolve conflicts between the specific experts.
The role of invoking low level experts at the appropriate time is handled by the top level expert's main loop. This loop first waits for the message about the internet from any source. Once a message is received, the top level expert asks the low level experts to generate hypotheses relevant to that piece of data. Once all experts have hypothesized, the top level expert asks the low level experts to analyze each hypothesis. There are two reasons why this approach is used instead of one where the specific expert is given a message
Problem
4
,~sa/~P~T~e
Fig. 8. Diagnostic system problem cycle
lntellioent Gateway Troubleshooter: M. Leib and then both hypothesizes and analyzes. The first reason is that the approach taken allows for an extension that can assign priorities to each expert. The hypothesizing and analyzing of experts with higher priority would be given preference, while the work for lower priority experts might be postponed until later. The second reason is that the approach taken allows the top level expert to discard some hypotheses that have been generated. There are two reasons why the system would discard hypotheses: the state of the network and human pruning. The system might discard a hypothesis because a previously concluded problem might explain the hypothesis. For example, suppose that gateway A has been concluded to be down. Also suppose that a neighbor of A later reports it as down. The data that the neighbor could not reach A might cause some low level experts to generate hypotheses. However, it is unnecessary for these experts to analyze the hypotheses because A is down explains why the neighbor reported A as down. The system may also discard hypotheses due to human pruning. After hypotheses have been generated, the user can manually reject and/or conclude them. This work can occur before IGT analyzes the hypotheses. A second role of the top level expert is to provide a mechanism for experts to communicate with one another. If one low level expert finds the conclusions or suspicions of another low level expert to be relevant then the top level expert performs the appropriate notifications. For example, the gateway interface flapping expert determines if gateway interfaces are periodically going up and down. Therefore, this expert is interested when an interface is concluded to be up or down. A third role of the top level expert is to determine the conditions for which multiple faults can occur. For example, suppose that a gateway exhibits a high delay. One expert hypothesizes that there is a problem with the gateway's routing table, and another might hypothesize that the gateway is congested. The top level expert determines when both hypotheses may be concluded. The approach that the system uses assumes that any anomaly is caused by only one fault. For example, if a gateway does not respond, then the system never attributes this to two different problems. This does not mean, however, that the system does not consider multiple faults. Nonetheless, for the system to consider multiple faults, there must be multiple anomalies. A lower level expert operationally defines a piece of data to be anomalous if: • •
explained. This approach is used because the presence of anomalous data gives the motivation for analyzing the particular hypothesis. When the anomalous data are removed, then there is typically no motivation for analyzing the problem. To illustrate these concepts, consider the internet of Fig. 3. Suppose that gateway B reports A 1 as down. This generates the following hypotheses: •
AI is congested B I is congested A is down.
These three hypotheses all consider B reports AI as down to be a piece of anomalous data because any of these hypotheses, if concluded, would explain why the datum occurred (Fig. 7a). It is assumed that none of these hypotheses are concluded or rejected. Suppose that the next event that occurs is AI is dropping twenty percent of its packets. This is anomalous data for the hypothesis that A 1 is congested. The resulting hypothesis/anomalous data relationship is summarized in Fig. 7b. To illustrate a hypothesis that is rejected because all of its anomalous data are explained, refer to Fig. 7c. Suppose that BI is congested becomes concluded. As a result, all of
Problem Inconclusive
pr,~blema, s P U ~ T r u
Fig. 9.
Modified problem cycle
ProblemInccncJus,ve
Suspected
'robfem
The expert recognizes the data as problematic. The expert can generate a hypothesis that would explain why the data occurred. The data is said to be anomalous with respect to this hypothesis.
For instance, the system might receive the data that a gateway is not responding. The gateway up/down expert recognizes this datum as problematic. The gateway up/down expert also notes that if the gateway were down, then the system would know why the gateway did not respond. Therefore, if a hypothesis is concluded, then the system knows why all of the data that are anomalous with respect to this hypothesis occurred. When the system knows why a piece of anomalous data occurred, that datum is said to be explained. The method that the IGT system uses is to reject a hypothesis when all of its anomalous data becomes
e
True
False
Concluaea
blern
Dam
Problem P~biem[lIcon,'4usive Fig. I0.
Real time cycle
lnteilioent Gateway Troubleshooter: M. Leib its anomalous data become explained. A is down gets rejected because all of its anomalous data are explained. However, AI is congested is still active because A1 is droppino 20% of its packets is still in need of explanation. The motivation for determining if AI is congested still exists because additional anomalous data were received. However, there is no longer any motivation for analyzing A is down. The multiple-fault strategy also extends to the top level expert determining when to not consider newly generated hypotheses. Suppose that the top level expert receives a datum that a gateway is no longer responding. Suppose further that the hypothesis that the gateway is down has been previously concluded. This concluded hypothesis would explain why this piece of anomalous data is being received, and hence there is no motivation to analyze newly generated new hypotheses that might explain this piece of data.
4.3. Problem cycle To fully understand the motivation for the system's details, the system's context as a real time diagnostic system must be explained. Most diagnostic systems contain rules and procedures for determining the existence of problems in their domain. If a particular problem is suspected, the system is able to determine if it is true or not, as illustrated in Fig. 8. However, IGT runs in real time. Since the system is continuously receiving messages about the internet, it must determine what problems are worth considering and then analyze them. Therefore, it must know when a problem should be suspected. Figure 9 illustrates a modification to the previous model. The system initially assumes that no problem exists in the internet. Once a piece of data is received that allows the system to suspect that the problem exists, the system enters the problem-suspected state. Now, the system applies its rules and procedures to determine if the conjecture is true, false, of if there is not enough information to draw a conclusion. If the conjecture is determined to be false, then the system returns to its initial state where it assumes that there is no problem. If it is determined to be true, then the system assumes that a problem exists, and enters the problem-concluded state of the diagram. If the system cannot conclude if the problem exists or not, then the system remains in the problem-suspected state until it receives additional messages concerning this problem and the cycle is continued. This model is still incomplete. In order to analyze problems in real time, there is the additional constraint of determining that problems have ceased to exist. Figure l0 illustrates the state diagram that the system uses. Determining if problems have ceased to exist is analogous to determining if problems have begun to exist. Once a problem is concluded, the system waits for a message that would make it suspect that the problem has ceased to exist. At this point it enters the no-problem suspected state and again determines if the problem is true, false, or inconclusive. If the problem is still true, the system enters the problem-concluded state; if the system enters the no-problem state; and if the problem is inconclusive, then the system remains in the no-problem suspected state. Note that the arrows in Fig. l0 do not represent received messages. Rather, they are state changes in the
system that can be caused by analyses and human interaction as well as messages. This means that the system could be in the no-problem state and after one message be in the problem-concluded state.
4.3.1. Experts and the problem cycle This section discusses how the role of experts fits in with the problem cycle. A trace of the actions of the experts provides a description of the interactions between experts and problem cycle. The trace begins with all possible problems in the no-problem state. Then an anomalous message is received. Here, the top level expert asks for all lower level experts to generate hypotheses that could explain why the piece of anomalous data occurred. We assume that all hypotheses are newly generated. At this point each of the problems corresponding to the generated hypotheses move into the problem-suspected state. Then the top level expert has all lower level experts analyze their hypotheses. Those hypotheses that are rejected have their corresponding problem move into the no-problem state. Those concluded have their corresponding problem move into the problem-concluded state, and those where no action is taken remain in the problem-suspected state. 4.4. Retractions As previously mentioned, hypothesis objects become generated when IGT suspects that a relevant event has occurred in the internet. The relevant event might be that a particular problem exists, or alternatively, that a problem has ceased to exist. A retraction object is a special case of a hypothesis where the relevant event that is being conjectured is that a concluded problem has ceased. Retraction objects are treated in the same manner as hypothesis objects; they are generated when data suggesting that the problem has ceased is received, and are determined to be true, false, or inconclusive. Referring back to Fig. 10, a retraction object is created when the system enters the no problem-suspected state and then is rejected, concluded or remains suspected. The system usually does not use the hypothesis object ofa retraction's corresponding problem hypothesis when analyzing the retraction. This approach is used because the analysis of the retraction is taking place at the present time, whereas the problem was concluded at a previous time, and there may be a different set of data and reasoning. An expert may obtain the appropriate problem hypothesis if it believes that its information might be helpful. 4.5. Hypothesis lifetime Suspected hypotheses are maintained by the top level expert. This allows the suspected hypothesis to be obtained by any expert, and so that IGT will know the appropriate state of analysis. Problem hypotheses that have been concluded are also maintained by the top level expert. This also allows the system to know the appropriate state of a particular problem. Once the corresponding retraction of a problem hypothesis is concluded, both the problem hypothesis and retraction are removed from the system. Therefore, if the system contains no hypotheses pertaining to a problem, then it assumes that the problem is in the no-problem state. However, there is one exception to the rule of
Intelligent Gateway Troubleshooter: M. Leib removing concluded retractions from the system. It may be the case that an expert or a user wants to save some retracted hypotheses for historical reasons. The user or expert may do so by explicitly asking that the hypothesis remain in the system.
4.6. Messages In prior sections, it was stated that experts receive and respond to messages concerning the internet. This section describes what these messages are and their appropriate timing. There are three types of messages that our system uses: Taps, timers and conclusions. A tap is a database mechanism which notifies modules when the value of a particular object exceeds a particular threshold or becomes updated. IGT implements taps through experts requesting to be notified when particular values in the database are in a particular range. For example, an expert may be requested to be notified whenever the number of packets sent between A and B exceed 1000. An expert may also use the tapping system to be notified when a function of two or more values falls within a specific range. For example, an expert may want to be notified when the sum of the number of packets from A to B and the number of packets from A to C exceeds 2000. This process is implemented by tapping both values and activating the appropriate expert when the combination is in the appropriate range and the time stamps of all values in question are close to each ,other. The desired amount of 'closeness' between time stamps is determined by the module that sets the tap. Another way that an expert might be notified is through the use of timers. For example, an expert might want to discontinue the analysis of a given hypothesis if a certain period of time elapses and no additional supporting data is received. To handle this, an expert sets a timer and once the timer expires, performs the necessary actions. Another way that an expert can be notified is by the conclusion of another hypothesis. For example, the gateway interface flapping expert determines if gateway interfaces are periodically going up and down. Therefore, this expert is interested when an interface is concluded to be up or down by the interface up/down expert. An expert specifies a need for this information by requesting it from the top level expert. When the hypothesis is concluded, either by a lower level expert or by a user of the system, the top level expert notifies the requesting experts. These are the three message mechanisms that are currently being used. This framework is easily extended to include messages that come from other sources. One possible source may be news that comes from a human operator. An operator could be able to report other pieces of data not obtainable from the D M M , and the expert could process it appropriately. Another new source of information could be additional data from the DM M that is not sent automatically. This would allow experts to make dynamic queries to the D M M based on suspected problems, as do human experts. The timing of a given message is very important. For example, consider the gateway up/down expert. Suppose the message is received that gatewayl has responded. This data is irrelevant if the system believes the gateway to be up, but is relevant if the system is uncertain or if the system believes that the gateway is down. It is desirable for an
oxpert tO dynamically request when messages should and should not be sent. When an expert requests that a message be received it is said that the message is enabled. When it is requested that the message not be received, it is said that the message is disabled. Figure 11 summarizes the appropriate messages at given states. When the system is in the no-problem state (at initialization we assume that there is no problem also), the only relevant messages are those that trigger the suspicion of a problem. Then when the problemconcluded state is entered, only these appropriate messages are enabled and others are disabled. When the system is in a suspecting state, any message that is relevant to the given probem is appropriate, and these are enabled in this state. Finally, when the problem is concluded, the only messages that are relevant are those that indicate normal operation. Only these messages are enabled when this state is entered and others are disabled.
State Entered No Problem
EnabledMessages Anomalous
(initialization also)
Problem Suspected/ No Problem Suspected
Problem Concluded Fig. I I.
Any appropriate to the problem's analysis
Normal
Messages at given states
Intelligent Gateway Troubleshooter: M. Leib 5. EXTENSIONS This section describes extensions that if added to IGT, would both improve the system and maintain its framework. There are three suggested extension areas: genetic experts, rule-based systems, and certainty factors.
5.1. Generic experts The first extension is the use of a generic framework for defining new experts. When a new expert is created, a copy of the generic framework is obtained and the specific expertise is inserted into this framework. To understand how the genetic framework works, consider the illustration of the problem cycle in Fig. 12. Suppose that a new expert is explicitly defined by the messages the expert finds relevant for each particular state of this cycle and by a procedure that determines if a given hypothesis is true, false, or inconclusive. A message that is defined would include a description and a way ofenabling and disabling the particular message. This is the only information needed by most experts. The work of procedures that define the state diagram of Fig. 12 could be embedded within the genetic framework. The effect of building this framework is that details that are common to all experts become independent of each expert. This results in experts becoming easier to build because this work does not need to be duplicated. Furthermore, the knowledge that an expert embodies becomes more transparent because it is no longer hidden within the expert's details. Therefore it is easier to understand the function of an expert, as well as make changes to an existing expert. A disadvantage of this approach is that it imposes a structure on newly created experts. With the framework, an expert cannot dynamically enable and disable messages within one particular state; the expert must wait until there is an appropriate state change. 5.2. Rule-based system Another possible extension of the IGT experts is the utilization of a rule-based system. A rule-based system is a program that uses if-then rules for representing knowledge and combines the rules for making inferences. It would not be advantageous to transform the IGT into a rule-based system. Portions oflGT such as the top
level main loop, the message-enabling/disabling paradigm, and the state diagram are better represented by procedures. However, one portion of the system that might be enhanced by rules is the portion of low level experts that determines if hypothesis are true, false, or inconclusive. The following is an example of a rule that IGT might use, It is relevant for determining if an inteface is flapping, that is, periodically going up and down. IF X is a gateway interface AND IF The up/down state of X has changed N times in the last 30 minutes AND IF N > 2 THEN X is flapping Each expert could represent the knowledge for determining when hypotheses are true or false by these rules. The advantage to using this structure is that the knowledge that the rules embody now becomes more explicit. It is now easier to build, modify, and understand the knowledge of a particular expert. The disadvantage to this approach is that it imposes a framework on the knowledge of a given expert. This approach may not be appropriate for a particular domain. For example, if an expert uses an iterative approach to determine if a problem exists, it would be difficult to represent this knowledge with rules. The paradigm of multiple experts solves this problem; one expert might represent its knowledge with rules, while another can represent its knowledge with procedures.
5.3. Certainty factors Another extension that can be added to the IGT system is the use of certainty factors. A certainty factor is a number between - 1 and 1 and assigned to a hypothesis that expresses the amount of evidence that the system has. A certainty factor of - ! means that the system is certain that the hypothesis if false, a certainty factor of 0 means that the system has no evidence for or against the hypothesis, and a certainty factor of I means that the system is certain that a hypothesis is true. Certainty factors may be thought of as the probability of a hypothesis being true or false. However, they cannot be used or called probabilities because they do not obey the axioms of mathematical probabilities. The IGT system can use certainty factors by having each expert assign one to each hypothesis. When new information about the hypothesis is received the expert can update the certainty factor appropriately. Additional negative evidence lowers the certainty factor, while additional positive evidence increases the certainty factor. A hypothesis can be concluded if its certainty factor exceeds a given threshold between 0 and 1. Similarly, a hypothesis can be rejected if its certainty factor falls below another threshold between - 1 and 0. 6. CONCLUSION
Fig. 12. Problem cycle
This document has described the intelligent gateway troubleshooter. The system simulates the expertise of network operators and analysts in detecting and diagnosing problems with the internet. Issues that were relevant in building the system were the integration of multiple expertise as well as the construction of a real-time diagnostic system. The concept of multiple experts allows
Intelligent Gateway Troubleshooter: M. Leib experts from several domains to be integrated as well as providing a means for appending additional expertise to the system. Understanding when and how the experts should be activated through appropriate messages concerning the internet allows us to encode the expertise as a real-time system. Now that a prototype system has been built, the experts need to be used in a real setting. This will verify the expertise of the particular experts as well as verify the framework for which the experts are built. The system must also be expanded in terms of both breadth and depth. Its breadth should be expanded so that it handles a wider range of problems such as, routing inconsistencies
a M understanding additional network entities such as hosts, PSNs, and packet radios. Its depth should also be extended so that the experts are able to fully understand their domain of expertise. ACKNOWLEDGEMENTS This research was supported by the Defense Advanced Research Project Agency under Contract MDA90383-C-0131. The author wishes to acknowledge Jim Ong for his guidance on this project. His ideas and comments have had a trenendous effect on the development of the system.