Computer Networks and ISDN Systems 24 (1992) 33-43 North-Holland
33
Troubleshooting throughput bottlenecks using executable models John A. Zinky BBN Communications, 150 Cambridge Park Drit'e, Cambridge, MA 02140, USA
Joshua Etkin Boston Unicersity, College of Engineering, 44 Cummington Street, Boston, MA 02215, USA
Abstract Zinky, J.A. and J. Etkin, Troubleshooting throughput bottlenecks using executable models, Computer Networks and ISDN Systems 24 (1992) 33-43. Troubleshooting performance problems in computer networks can be automated with the help of a tool called an executable performance model. The goal of an executable model is to give a fast, inexpensive, and correct prediction of the operational network's behavior. An executable model is created for a path across the network and sensitivity analysis is used to determine the cause of a throughput bottleneck. An expert system is used to speed up the search for a simplified yet valid model of the path. A traditional performance model can be extremely complicated and expensive to run so it does not meet the goal of an executable model. A prototype of this troubleshooting technique has been implemented and it demonstrates the feasibility of combining an expert system for speed and a model for accuracy.
Keywords: executable models, computer network management, network's software testing, computer performance models, rule-based systems.
1. Introduction
The traditional role of performance models is to help analyze the design and specification of a new system. After the analysis is done the models are discarded. In earlier work, we have proposed the Executable Model Approach for network development [3]. In this approach, performance models continue to be developed for day-to-day use by network analysts and operators. Integrating performance modeling into the entire network development life cycle will help maintain the correctness and efficiency of computer networks. Here we describe a troubleshooting technique based on the Executable Model Approach that is used to tune the efficiency of computer networks. Many expert systems have been created to help troubleshoot faults in computer networks. Some [8,11] handle mainly physical faults such as noisy lines or failed components. Others [2] man0169-7552/92/$05.00 © 1992
age economic faults such as reducing the cost of lines given the current traffic patterns. Here we will focus on troubleshooting performance faults caused by traffic loads greater than the capacity of the network's resources. To troubleshoot a performance fault, the analyst must collect statistics about network traffic and behavior. These statistics represent thousands of data points which must be reduced to understand the essence of the performance problem. In the end, the analyst must make a mental model of network behavior which explains the problem and its solution. To help the human analyst comprehend what is happening in the network, some work has focused on visualizing network statistics [10]. But the extremely difficult task of calculating how the network is supposed to behave is left to the analyst. Some expert troubleshooting systems in this domain use heuristic reasoning about performance statistics [6], protocol traces [5], or event streams [9] to infer the cause of the problem. A
Elsevier Science Publishers B.V. All rights reserved
34
J.A. Zinky, J. Etkin / Troubleshooting throughput bottlenecks
typical rule for these systems would be: " I f the network delay is high, increase the window size". These heuristics can be wrong and one way to verify their hypotheses is to use a p e r f o r m a n c e model. O t h e r troubleshooting systems [1] offer an integrated environment for comparing network statistics and p e r f o r m a n c e models leaving the reasoning and troubleshooting techniques to a h u m a n analyst. A typical query in this environment would be: " I f the polling rate of controller No. 42 is set to 100 ms, what is the end-to-end delay?" All these troubleshooting tools are effective, but their functionality could be e n h a n c e d by using the Executable Model Approach. The Automatic Network Troubleshooter ( A N T ) prototype [4,12] is based on the Executable Model Approach. The A N T prototype can detect t h r o u g h p u t bottlenecks caused by inappropriate configuration parameters or lack of network resources. A N T uses an executable model of network behavior that is tightly coupled with the statistics collected from the network. The p e r f o r m a n c e model is based on causal mechanics of individual network components. It is customized to a specific situation by specifying different assumptions about traffic patterns or c o m p o n e n t behavior. The troubleshooting process identifies a set of assumptions such that the
modeled behavior is consistent with the observed behavior of the faulty network. A typical assumption would be " L i n e No. 24 is a t h r o u g h p u t bottleneck and Line No. 96 is not a bottleneck". This assumption implies that a p e r f o r m a n c e model of the network should concentrate on modeling Line No. 24 and ignore Line No. 96. A N T uses an expert system to speed up a search t h r o u g h the space of possible assumptions. It uses heuristic reasoning and augments this capability by testing hypotheses against a performance model. T h e output of the troubleshooting process is a set of configuration parameters that can increase the end-to-end t h r o u g h p u t and a simplified p e r f o r m a n c e model to explore the quantitative effect of changing these parameters. This p a p e r focuses on the issues that arise in using an executable model to troubleshoot t h r o u g h p u t bottlenecks. First, we describe how to create an executable model of an end-to-end flow. Special emphasis is given to the flexibility of the model in form and detail. Flexibility allows A N T to create a simplified model that can be tailored to the specific problem instance. Second, we give some b a c k g r o u n d on t h r o u g h p u t bottlenecks and how they occur in operational networks. A sample run of A N T is presented and used t h r o u g h o u t the paper to illustrate the trou-
Joshua Etkin received the B.Sc. E.E. degree from the Technion - Israel Institute of Technology, Haifa, in 1971, the M.Sc.E.E. degree from Tel Aviv University, Tel Aviv, Israel, in 1977, and the Ph.D. degree from Ben-Gurion University, Beer-sheva, Israel, in 1985. He has been a member of the faculty of the College of Engineering at Boston University since 1983, teaching computer communication, local area networks, switching and ISDN, computer architecture, operating systems and software engineering. From 1971 to 1983 he was a system designer and Department Manager in the Research and Development Division of Telrad, Israel. He designed and managed projects in the fields of remote control, SPC exchanges, and integrated voice/data communications. His research interests include distributed computer architecture and operating systems, local area networks for real-time applications, management and testing of computer networks and ISDN.
John A. Zinky received the B.S. degree in electrical engineering from Northwestern University in 1980, the M.S. degree in computer science from University of California at Davis/Livermore in 1983, and the Ph.D. in systems engineering from Boston University in 1989. From 1980 to 1983, he worked at Lawrence Livermore National Laboratory on distributed data acquisition and control systems. Since 1983, he has been working at BBN Communication Corporation in Cambridge Massachusetts. His projects include performance analysis of network devices, stabilizing delay-based SPF routing, and the Automatic Network Troubleshooter project. His research interests include network management, performance analysis, and model-based reasoning.
J.A. Zinky, J. Etkin / Troubleshooting throughput bottlenecks
bleshooting process [12]. Third, we explore the use of network statistics to give hints about the source of the bottleneck. Rules are defined that identify patterns in network statistics and relate them to assumptions about the location of the bottleneck. Next, we show how to use these hints to speed the search for a simplified model of the network. Each assumption generated by the rules is tested. A model is generated based on the rules and its prediction is validated against real network behavior. Last, we show how sensitivity analysis of the model is used to identify the cause of the performance problem. Sensitivity is normalized to help the analyst make tradeoffs when deciding which parameters to change in the operational network.
2. Executable performance model and ANT prototype The goal of an executable model is to give a fast, inexpensive and correct prediction of an operational network's behavior. An executable model is optimized for the turn-around time, from when it is specified to when the results are available. The model's predictive capability is on the critical path of the analysis process, so reducing execution time is essential. Model evaluation must be inexpensive because it is run numerous times each day. Finally, the model must be correct because its predictions are used to make decisions that affect the operational network. These goals differ from the goals of performance models traditionally used to design networks. Design models must be flexible to explore a wide range of hypothetical networks. Execution speed of the model is insignificant compared to the time it takes the analyst to define and debug the different models for each hypothetical network. Executable models are concerned only with the behavior of a specific network and can be customized for the operational network's fixed functionality and specific topology. Special system requirements are necessary to meet the goals of executable models. First, the model should correctly predict the causal behavior of the operational network. The reliability of this prediction should be greater than or equal to the software reliability of the network itself. Second, no human intervention should be needed to
35
specify, execute, or report the results of the model. Third the model's specification should be flexible. Knowledge of how to make the prediction cheaper is held by the executable model. The user of the model specifies what level of complexity is desired. The ANT prototype uses an interchangeable parts paradigm to meet the requirements for its executable model. Each component along the path has several submodels, each valid for a different set of assumptions about the component. An overall model is specified by making assumptions for each component along the path. A submodel is selected for each component based on the assumptions for that component and these submodels are combined into an overall model. Different sets of assumptions generate different models. The more specific the assumptions, the more customized the model is to a given situation. A custom model has a simpler representation than a general model because unnecessary details have been removed. But, a custom model is only valid for its specific network context and a narrow range of assumptions. The ANT prototype builds its model in three phases. First it takes the physical topology of devices along a path and expands it into a functional topology which represents the different protocol components used along the path. Finally, it chooses a sub-model for each component and builds a queueing model to represent the consumption of resources along the path. For example, a host device will be expanded into several functional components, representing the transport and link access protocols. A submodel is chosen for each component which models the consumption of CPU resources needed to process the protocols. Using these multiple representations allows ANT to have greater flexibility for model customization. It also decouples ANT from specific device types which is important for troubleshooting a multi-vendor network. There are four types of components in the functional topology. The protocol processing box contains the state machine that implements the protocol between peers in the same layer. The serl~ice interface translates the data items between layers of protocols. The physical line transmits data between the lowest layers in the protocol. Finally the logical link transmits data between protocol processing boxes. Logical links represent
J.A. Zinky, J. Etkin / Troubleshooting throughput bottlenecks
36
the services offered by the lower layers and have no physical manifestation. The ANT prototype implements two kinds of submodels. The first is a queuing model of the behavior of the component and is valid for all situations within ANT's domain of problems. This submodel's assumption is labeled "bottleneck" and computation is needed to evaluate it. The other submodel is valid only if the component is not part of the bottleneck process. Its model is an infinite capacity queue which is just the connector between queues and can be ignored. This submodel's assumption is labeled "non-bottleneck" and no computation is needed to evaluate it. A submodel is represented by a queueing network with two parts: a model form and a calibration. The model form defines the queueing discipline used by the component, such as a firstcome-first-serve or a delay server. Calibration sets values for service times, thresholds, or populations in terms of measured parameters. Each modeling parameter is calibrated against some set of measured parameters. For example, the service time of a link queue is the average packet length divided by the line speed. The form of a submodel's queueing network is defined before troubleshooting begins. Likewise, the mapping between the parameters and the corresponding service times in the submodel is also predefined.
Assumption
The submodel is calibrated for each troubleshooting run by the specific values of the configuration parameters and measured system parameters. Since we are only interested in modeling throughput bottlenecks, a queueing model is adequate. In the future, we plan to use this technique to specify a modular discrete event simulation [7]. Figure 1 shows an example of how different submodels can be used to represent an ACK aggregation component. In ACK aggregation, transmission of ACKs is delayed in hopes that multiple ACKs can be sent in one packet. The first row represents a model where the returning data carries can back one ACK. Under this assumptions, the ACK waits for the average interarrival time of the return traffic before it gets piggybacked on the returning data. The second row represents the situation when sending and returning traffic is scarce. In this case, there is no opportunity to piggyback and the ACK is sent by itself after a timeout period. The third row represents when the returning ACK traffic is frequent. In this case, K ACKs are aggregated into one return packet. The last row represents when the ACK aggregation delay is trivial relative to other delays in the path. In this case, the component can be ignored by modeling it as an infinite capacity server. Notice that both the form of the queuing model and the calibration formulas change with different assumptions. This shows
Model Form
Service Time Calibration
Cross Traffic is Frequent
ServiceTime = Interarrival time of Cross Traffic
Cross Traffic and ACK Traffic are Scarce
ServiceTime = Ack Aggregation "13rneout
ACK Traffic is Frequent Cross Traffic is Scarce
Forward Acks after K Acks have arrived
Ack Aggregation is Not Bottleneck
ServiceTime
Fig. 1. Multiple sub-models for ACK aggregation.
= 0
J.A. Zinky, J. Etkin / Troubleshooting throughput bottlenecks
the windowsize is too small to compensate for the per-packet roundtrip time, then the throughput will be limited. Frequent window closings is a symptom that a flow is window limited. To fix a window problem either the window has to be increased or the per-packet delay must be decreased. In Section 7, the Multiple Parameters Example illustrates a window blocked problem. (3) Inefficient use of resources. Network algorithms can use resources inefficiently. For example, a retransmission timer can be set to fire before the per-packet roundtrip time. This leads to unnecessary retransmissions and robs the system of transmission bandwidth. To fix inefficiency problems may be as simple as changing a protocol configuration p a r a m e t e r or as difficult as changing the protocols themselves. The A N T prototype can handle window problems and lack of resources. But its techniques could be expanded to handle inefficient uses of resources.
that different assumptions depend on different parameters, hence only the submodels with the right assumptions will give an accurate representation of the protocol component's behavior.
3. B a c k g r o u n d on throughput bottlenecks Throughput bottlenecks are becoming a common complaint as network applications transfer more data. For example, in interactive image transfer applications, the response time is dominated by the network throughput and not by the per-packet delay. Each resource along a path has an inherent throughput limit. When the limit is reached, data cannot flow any faster through the resource. If the offered throughput is less than the limit, then the resource will have no effect on the throughput. To fix a throughput problem, one must first identify the bottleneck resource and then determine how to raise its limit. After raising this limit, some other resource is now the bottleneck. This process continues until the benefit of increasing throughput is less than the cost of increasing resources. There are three major causes of throughput bottlenecks: (1) Lack of resources. Resources can be limited because of physical constraints or because of competition from other users. A long queue of packets waiting to use the resource is a symptom of an over used resource. To fix the problem either more resources should be added or the competition should be removed. The Slow Trunk Example (Section 3.1) has this type of bottleneck. (2) Window too small. Window-based protocols limit the number of outstanding packets. If
3.1. Slow trunk example The following test case is referenced throughout the paper to help illustrate the troubleshooting techniques [12]. This example was chosen because it is easy to explain. In a later section, we will briefly examine a more difficult problem with multiple bottleneck parameters. In this example, the throughput bottleneck is a 9.6 kb trunk line between two Packet Switching Nodes (PSNs). The bottleneck causes a window of messages (7) to queue up behind the trunk resulting in a roundtrip delay of 1.4 s. The topology of the test lab network is shown in Fig. 2. Data messages are sent from the source host to
Source PSN
Source Host
PSN 88 lACK
RR , .
m
~
m
RR
4-
,%
H3 56kb
L.._
Destination Host
Destination PSN
PSN 89
r
M0
M0
I
~13 56kb
9.6kb
----
%.
X.25 data
37
NETDATA
Fig. 2. Physical topology for the slow trunk example.
9--
X.25 data
__1
J.A. Zinky, J. Etkin / Troubleshooting throughput bottlenecks
38
the destination host and acknowledgments are sent back from the destination host to the source host. There is no other traffic competing with the flow under test. Table 1 shows some of the statistics collected. The complete statistics collection represents thousands of data points, but only a few are relevant in our case. Some data points cannot be used to identify throughput bottlenecks while others represent normal behavior. Table 1 includes those data points needed to explain the example and ignores some data points that are used by ANT to troubleshoot the problem.
4. Pattern matching ANT looks for patterns in the network statistics that imply the location of bottlenecks. Each pattern is encoded in one or more declarative rules. A rule has two parts. The Left Hand Side (LHS) matches statistics with different thresholds. The Right Hand Side (RHS) marks the functional topology with hints about components it thinks are " b o t t l e n e c k s " and " n o n bottlenecks". All the rules are tested and each one that fires (matches its LHS) will mark the functional topology with its hint (executes its RHS). For each component, a tally is kept of how many rules marked it as a bottleneck and how many do not. If two rules have different conclusions about the same component, they cancel each other. Source
Destination
Source PSN
Host
Logical Link
~ L3
~L3 .
.
.
......
'
~. L3
"~ L 3
Location of Bottleneck
Bottleneck
,--[.~
"1"1: I, EE
mm
Destination Host
PSN
,-x-a i L3 .
ANT uses rules differently from their traditional use in troubleshooting which is to directly relate symptoms to causes. Here rules are only hints at where to look first. The performance model itself is used to determine the cause. We will now present three patterns in the statistics for the Slow Trunk Example (Table 1). Figure 3 shows in graphical form the results of the rules marking the functional topology. The bold lines represent where the rules indicate possible bottleneck components. The dotted lines represent where the patterns indicate non-bottleneck components. The regular lines represent components for which the rules have no opinion. The three patterns correctly implicate the real bottleneck which is located at the service interface between the Source PNS's Trunk Processing and the Physical Link. The next phase, model simplification, will prove that the service interface is indeed the bottleneck. Note that in this example, if the bottleneck was in a different location these specific rules would not fire, but other rules would detect the location of the bottleneck. The first pattern is concerned with the network statistics at a global level. It looks at the probability that the source host is being blocked because the window has closed. A BBN X.25 network has two windows. The R R window has end-to-end significance and is used by the hosts for flow control. The IACK window is internal to the network and is used to regulate retransmissions. If the IACK window closes, then the bot-
....
,
,
.......
EE ',
Physical Link
Non-Bottleneck Fig. 3. Marked protocollayers.
No Opinion
~L3
J.A. Zinky, J. Etkin / Troubleshooting throughput bottlenecks Table 1 Condensed Statistics for the Slow Trunk Example Device parameter Source Node (PSN 89) Message D a t a Length Messages Per Sec IACK Round Trip Time Trunk B i t Rate PSN Utilization Prob lACK Window Closed Prob RR Window_Closed Destination Node (PSN 88) T r u n k Bit Rate PSN Utilization
Value 2048 4.0 1.4672 9443.3 0.05 {).4 0.4 1421.0 0.05
tleneck is in the X.25 network which is the case in this example. A rule that fires for the first pattern looks at the Prob_IACK_Window Closed statistic for the source host. The rule states: if the probability is above a threshold (0.3), then mark the X.25 network as the bottleneck and the hosts as non-bottleneck. The top two layers of Fig. 3 shows how the components are marked. The X.25 L3 layer represents the end-to-end connection. The components for the source and destination hosts are marked as non-bottleneck and the logical-link between them (i.e. the X.25 network) is marked as bottleneck. The Local L3 layer can be thought of as a detailed view of the upper layer's logical link. On this layer we do not suspect the host access lines or their protocol processing. They are marked as non-bottleneck. The logical link between local L3 processing is the suspected bottleneck and is marked as such. Notice that the rule does not continue to mark the remaining layers of the host access lines. The logical link between the Host and the PSN Local L3 processing boxes implies this marking. The lower layer components inherit the marking from their higher layer logical link so no additional marking is necessary. A rule that does not fire for the first pattern looks for the possibility that the R R window closes, while the IACK window does not. This would imply that the Destination Host is slow in returning the RRs, but the network's internal IACKs are returning unimpeded. The rule would
39
mark the Destination Host and its access line as bottleneck and the Source Host and network as non-bottleneck. Comparing the last two rules, we show that different network statistics imply different bottlenecks. The second pattern is concerned with a specific type of device and tests whether it can be marked as non-bottleneck. It looks at the utilization of the processors in the PSNs. Devices with high utilization tend to be bottlenecks, while devices with low utilizations tend not to be bottlenecks. In this case, the PSNs have low CPU utilization so their protocol processing components should not be the bottleneck. A rule that fires for the second pattern looks at PSN Utilization to determine whether it is below a threshold (0.3). It marks the SF and EE protocol processing boxes as non-bottleneck (Fig. 3). This rule fired twice, once for each PSN. The third pattern is concerned with a specific type of device and tests whether it can be marked as bottleneck. It looks at the resource utilization of trunks. If it has high utilization then it is a likely bottleneck. Link utilization is not measured directly, but must be derived by combining traffic statistics and configuration information. Utilization is the amount of data flowing over the link divided by its bandwidth. Note that if the configuration information is wrong, then the utilization calculations will be wrong and so will the model. Since configuration information is hand-entered into a network management database, there is a chance that it is not consistent with the real network configuration. When ANT is run, a side benefit is an implicit consistency check between the stored configuration and the real configuration. A rule that fires for the third pattern looks at the d~rived value for the T r u n k U t i l i z a t i o n and checks whether it is above a threshold (0.70). Trunk Utilization was calculated by dividing T r u n k Bit Rate by Trunk Bandwidth. The rule marks the service layer between the L2 Trunk processing box and the PSN physical line. This is the location of the queue for packets waiting to be transmitted. The service time of the queue depends on the transfer rate of the physical line. This is the bottleneck component in the Slow Line Example which will be verified in the model simplification phase.
40
J.A. Zinky, J. Etkin / Troubleshooting throughput bottlenecks
Real
Traffic
Measured Traffic
Statistics
I
=t v I
Traffic Model
~
I
Traffic Assumptions A pdofi Knowledge about Traffic
Expected~Traffic Faulty Network
Configuration Parameters Measured System
~i ~ ~1 Modelof v ~ Network
Parameters
Fault Assumptions A ;xiori Knowledge about Network Behavior
Response
Valid Auumptions? Fig. 4. Validating fault assumptions.
5. Model simplification ANT creates a simple model of the bottleneck process and validates it against network statistics. Validating a model is done by comparing its predictions with the measured behavior of the network (Fig. 4). If the assumptions about which components are non-bottleneck are correct, then the model's predictions should match the network statistics. This relation can be used to validate assumptions about which components are non-bottleneck. Using the bottleneck submodel for all components yields the most detailed model for the path, but it is also expensive to run and difficult to understand. The detailed model should correctly predict network behavior for any situation within ANT's domain. If the measured view and predicted view differ, then there is a bug in-either the model, the statistics collection facility, or the network software. During the development of the ANT prototype, this comparison uncovered bugs in all three subsystems. The most detailed model is run once to verify that ANT can predict the network behavior. The goal of model simplication is to generate a simpler model that can correctly predict the behavior. The model simplification process starts with a very detailed model of the network and removes detail until it finds a simple model. The search algorithm is sped up by using the hints generated
by the pattern matcher about the location of the bottlenecks. Also the hierarchy of logical links is used to group the components into trees. The specific algorithm is tailored to create simple models of throughput bottlenecks. The general technique of model simplification is applicable for modeling other types of network behavior, such as per-packet delay. Figure 5 shows the simplified model that ANT generates for the Slow Trunk Example. The model assumes that a large file is being transferred between the source and destination host so there will always be traffic available from the source host. The one component that is modeled is the service layer between the source PSN and the 9.6 kb trunk. A queue is implemented here and stores packets waiting to be sent across the trunk. Throughput = 1 Service Time
- 5 mps
Window Size (7)
Circulating Tokens
Delay = Qlength*Service Time = .2"7 = 1.4 seconds
" ~
Packet Length
2048
Bits per Second
9600
Fig. 5. A simple model of bottleneck process.
J.A. Zinky, J. Etkin / Troubleshooting throughput bottlenecks
The service rate of the queue is determined by the average message size divided by the bandwidth of the line. The model predicts a throughput of roughly 5 messages per second which is close to the 4.0 messages per second measured by the statistics. Additional details can make this model even more accurate, such as adding on the message header to get the true message length. Other evidence that the model is correct is that the delay predicted by the model (1.4 s) is almost the same as the measured delay statistics. Figure 6 shows actual predicted values which used the additional details. We will now briefly explain some of the details of model simplification. Each component is marked with a 3-tuple of indicators. The first slot holds the hint given by the pattern rules. The second slot holds the current assumption as to wether the component is bottleneck or non-bottleneck. The third slot holds whether the assumption has been proved correct or is still hypothetical. The simplification process ends when all components' assumptions are marked correct. A test model is specified by setting the assumption slots to specific values. If the model correctly predicts measured behavior then all the components marked as non-bottleneck are correct. If the model does not predict the measured behavior, then one of the components marked non-bottleneck and hypothetical is really a bottleneck component. Note that, several components can be tested for non-bottleneck in one run. The trick is to test the assumptions in groups. The protocol hierarchy and patterns are used to determine good groupings. The search algorithm uses a breadth-first most-likely scheme which explicitly tests the assumptions for each component. The search starts at the highest level of the functional topology. The components are tested in order according to how they were marked by the patterns: first the non-bottleneck, then the no-opinion, and finally the bottleneck. When a logical link is tested, all its subcomponents are set to non-bottleneck. If the model is correct then all the subcomponents are confirmed as non-bottleneck. If the model is incorrect, then the logical link is confirmed as containing a bottleneck and its subcomponents will be further investigated at lower levels.
41
6. Sensitivity analysis Sensitivity analysis involves perturbing the model's parameters and noting the changes in end-to-end throughput. If the change in a parameter's value results in a change in the predicted throughput, then throughput is sensitive to that parameter. The model is run once for each parameter to measure its sensitivity. The number of parameters that must be checked is greatly reduced by model simplification, because only the parameters in the bottleneck components need to be tested. Also, the simpler model takes less time to run than the detailed model further reducing the cost of sensitivity analysis. When testing sensitivity the value of the parameter is changed in the direction of lowering the component's capacity. For example, a bandwidth parameter is changed to a lower bandwidth value, while a propagation delay is changed to a higher delay value. Lowering the capacity of the b o t t l e n e c k c o m p o n e n t always reduce the throughput. But raising the capacity of a bottleneck component may make it no longer the bottleneck component. A rule of thumb for testing the cause of throughput bottlenecks is: to test if something is a bottleneck, lower its capacity; to test if it is not a bottleneck, raise its capacity. Sensitivity is normalized to the magnitude of the parameter and the throughput. %ChangeInThroughput Sensitivity =
%ChangelnParameter
This formula for sensitivity has some advances and disadvantages. First it is unitless so that all parameters can be compared against each other. Second throughput tends to be multiplicatively related to the parameter values. For our example, doubling the bandwidth of the line will double the throughput. Hence the most sensitive parameters have a sensitivity with a magnitude around one and insensitive parameters are near zero. The disadvantage of this normalization is what to do when the parameter value is zero or when it is not multiplicative related to throughput. For example, per-packet delay is multiplicatively related to throughput when the system is window blocked. But per-packet delay is the sum of all the component delay along the path and some of the corn-
J.A. Zinky, J. Etkin / Troubleshooting throughput bottlenecks
42 Running ANT Collection 4
7. Cost-benefit and multiple tunable parameters
PHASE 1: PATTERN M A T C H I N G
(see Figure 3) PHASE 2: M O D E L SIMPLIFICATION Throughput model 4.33, measured 4.0. Delay model 1.44, measured 1.47 Model Succeeds PHASE 3: SENSITIVITY ANALYSIS bits p e r second line n89/n88 (data), Value = 9600, Sensitivity = 0.99 window size [h3,n89], Value = 7, Sensitivity = 0.0 Fig. 6. Output of ANTprototype.
ponent delays can be zero. The normalization of sensitivity is still an open question. Ultimately, sensitivity should be normalized in terms of costs and benefits which is the subject of the next section. Tunable parameters are the ones that the network operator can change. For example, the service time of the queue in Fig. 5 depends on trunk bandwidth, message size, and header size. But only the line bandwidth can be changed by the network operator. (Note that the message size can be changed by the host administrator and the header size can be changed only by the standards body). Tunable parameters are marked when the model was created and sensitivity analysis is only done on tunable parameters. Figure 6 is the textual output of the ANT prototype. The output of Phase 1 is a picture similar to Fig. 3 with possiable bottleneck components marked with different colors instead of different line types. The Phase 2 results show the throughput and delay predicted by the most detailed performance model. The model accurately predicts both the throughput and delay of the network. The result of Phase 3 is a list of parameters that can be tuned to increase the throughput. The bandwidth of the trunk between node 89 and 88 has a sensitivity of one, so doubling the bandwidth of the line may double the throughput of the network. Note that increasing windowsize does not change the throughput of the system, but it will increase the delay. Hence the windowsize's sensitivity to throughput is zero and its sensivity to delay is one.
The Slow Line example has only one parameter that can be tuned to increase throughput. When more than one parameter can be adjusted, the operator must decide which parameter is the most cost effective. This is a highly constrained and volatile problem. Some of the mappings are quite simple, such as determining the tariff for a leased line. Most of the tradeoffs are political because they involve interactions between human groups. For example, tuning parameters for one user may actually hurt the performance of another user. Also equipment availability may depend on "exchanging of favors" between different support groups. The c o s t / b e n e f i t mapping is too irregular to be automated using a model-based system (deep reasoning) and is more appropriate for a rule-based system (shallow reasoning). The ANT prototype defers the c o s t / b e n e f i t mapping to the operator, but its output could be fed into another system that does the mapping. To illustrate this point we present the results of another test case where the path goes over a satellite line with a windowsize of two. The bottleneck is that the path is window blocked. The solution is to either increase the window or decrease the per-packet delay. Sensitivity analysis (Fig. 7) shows that there are three viable solutions: the windowsize could be increased, the satellite line could be changed to a terrestrial line Running ANT Collection 14 PHASE 1: PATTERN M A T C H I N G PHASE 2: M O D E L SIMPLIFICATION Throughput model 2.7764, measured 2.7 Delay model 0.5744, measured 0.5609 Model succeeds PHASE 3: SENSITIVITY ANALYSIS window_ size [h3,n89], Value = 2, Sensitivity = 0.988 prop delay link n88/n89 (data), Value = 0.242, Sensitivity = - 0.345 prop_ delay link n88/n89 (ack), Value = 0.242, Sensitivity = - 0.345 x25_ 13_ a c k p i g g y b a c k timeout [n88], Value = 0.087, Sensitivity = - 0.126 b i t s per_ second link n88/n89 (data), Value = 51200, Sensitivity = 0.029 bits per second link n88/n89 (ack), Value=51200, Sensitivity = 0.004 Fig. 7. Before changing a turnable parameter.
J.A. Zinky, Z Etkin / Troubleshooting throughput bottlenecks
(lower propagation delay of trunk), or an ACK aggregation timer could be reduced. Raising the bandwidth of the trunk will have almost no effect because transmission time represents only the modest delay of clocking a packet out the trunk. Raising the windowsize or lowering the propagation delay would offer the most relief, but it may not be possible to change them. For example, changing the windowsize involves a modification to the host configuration which is outside the administrative domain of the network operators. It may be physically impossible to get a terrestrial line, such as to users in the Pacific Islands. Changing the ACK aggregation timer may be the most cost effective parameter to change since it is under the administrative control of the network operators and can be changed immediately. The disadvantage is that slightly more ACK packets will be sent. The timer has a sensitivity of - 12% which means that if the timer was set to zero we would expect a 12% increase in throughput. When the aggregation parameter was set to zero, the throughput increased by 18%, a little better than expected. ANT identifies which parameters are causing the problem and the increase in throughput associated with changing each parameter. The ANT performance model can be used by the operator to explore different parameter settings without disrupting the operational network. The operator can then use this knowledge to help make c o s t / benefit tradeoffs and ultimately fix the problem.
8. Conclusion This paper showed a technique for using executable models to troubleshoot throughput bottlenecks in computer networks. The approach takes advantage of the characteristics of throughput bottlenecks to reduce the computational resources needed for automatic troubleshooting. A performance model is tightly coupled with the statistics collected from the network to facilitate easy comparison. The model is optimized for execution time by customizing the model's detail to the network's context. Several techniques were shown for using the model to test assumptions about network behavior and to determine the cause of the bottleneck. The output of the trou-
43
bleshooting process is the analytic tools needed to make tradeoffs to fix the fault. The ANT project is a prototype of this technique and shows the feasibility of automating the troubleshooting of performance faults.
Acknowledgments We would like to thank the Automatic Network Troubleshooter development team: Edward Black, Jean-Louis Flechon, Peg Primak, C.G. Venkatesh, along with technical guidance from Dan Friedman and management guidance from Susan Bernstein and Steve Cohn.
References [1] BGS Systems Inc., Best-SNA Performance Analysis Package, Commercial Product, 1983-present. [2] E. Elsam and J. Mayersohn, Using AI to plan the defense data network, Defense Electron. (June 1985) 175184. [3] J. Etkin and J.A. Zinky, Development life.cycle of computer networks: the executable model approach, IEEE Trans. Software Eng. (September 1989) 1078-1089. [4] J. Fl6chon and J. Zinky, Knowledge-based generation of multi-level models for troubleshooting computer network performance, in: AI and Simulation (MultiConference), San Diego, CA (Society for Computer Simulation, January 1989). [5] B. Hitson, Knowledge-based monitoring and control: an approach to understanding the behavior of T C P / I P network protocols, in: SIGCOMM'88 (ACM, New York, NY. August 1988). [6] M. Leib, Intelligent Gateway Troubleshooter, ANM Technical Note 7, BBN, April 1988. [7] M. Livny, DeNet User's Guide, Computer Science Department, University of Wisconsin-Madison, 1987-present. [8] T.E. Marques, A symptom-driven expert system for isolating and correcting network faults, 1EEE Comm. Mag. (March 1988) 6-13. [9] R. Mathonet, H.V. Cotthem and L. Vanryckeghem, DANTES, in: International Joint Conference on Artificial Intelligence, IJCAI-87 (August 1987) 527-530. [10] B. Roberts, A network analyst's assistant, in: 1CC'88 (IEEE Communication Society, 1988). [11] W. Sayles and J. Thomas, Finding and fixing network faults with an expert system, Data Comm. (June 1988) 149-165. [12] J.A. Zinky, An example of automatically troubleshooting a throughput bottleneck using model based techniques, in: ICC'89, (IEEE Communication Society, 1989) 14481453.