Computer Communications 24 (2001) 19±34
www.elsevier.com/locate/comcom
A selective attenuation feedback mechanism for rate oscillation avoidance N. Li*, S. Park, S. Li Department of Electrical and Computer Engineering, Engineering Science Building 143, University of Texas at Austin, Austin, TX 78712-1084, USA Received 4 September 2000; accepted 4 September 2000
Abstract We have studied the delay-related rate oscillation within Diffserv. The rate oscillation can come from large round trip latency. Enforcing low feedback overhead can also cause the rate oscillation. The Selective Attenuation Feedback via Estimation (SAFE) is proposed to reduce the oscillation while maintaining fast response to network dynamics. SAFE has no per-¯ow accounting. Furthermore, the hashing technique is adopted to keep the operating overhead of SAFE to its minimum. System analysis supports the effectiveness of SAFE. Simulation result also shows that SAFE signi®cantly reduces rate oscillation, therefore achieves high link utilization and small queue size while maintaining very low control overhead. q 2001 Elsevier Science B.V. All rights reserved. Keywords: Rate oscillation; Flow diversity; Feedback overhead; Differentiated service; Fairness
1. Introduction Rate oscillation is a long existing problem for ABR feedback control in ATM network. As different ¯ows have diverse Round Trip Times (RTT), they react to the feedback control at different time instances. The interference between the new control decision and the delayed reaction could cause the long lasting rate oscillation. Several approaches are proposed to solve this problem [1,13,20]. However, they all involve tremendous overhead at interior nodes. It is not reasonable to put so many resources on ABR traf®c since it is intended to be the economic class. With the principle of scalability and low management cost, Differentiated Service (Diffserv) [3,4] has emerged as a promising architecture for deploying quality of service in the Internet. It is based on the notion of the ¯ow aggregation at boundary nodes. Per-¯ow processing complexity is pushed to the edge, so the network maintains a simple core. However, as recent studies have shown, once traf®c is aggregated it is dif®cult to provide individual ¯ows with the appropriate service guarantees and fairness [5,6]. Diffserv suffers from the problems with quantitative service guarantees and per-¯ow fairness. The fundamental reason is the lack of cooperation between boundary nodes and the interior nodes. Only the interior nodes have the knowledge of available resources, and only the boundary nodes are * Corresponding author. Fax: 11-512-471-5532. E-mail addresses:
[email protected] (N. Li),
[email protected] (S. Park),
[email protected] (S. Li).
acting as the network guard by doing admission control and per-¯ow policing. Recently, two fundamentally different schemes have emerged for enabling cooperation between Diffserv boundary and interior nodes to improve performance. One works in the data plane by injecting state information into packet headers. The other one works in the control plane and relies on dynamic feedback. Dynamic Packet States (DPS) [18,19] achieves cooperation based on the ®rst scheme. Each packet carries in its header per-¯ow state, initialized by the boundary nodes. Interior nodes use the header state to process incoming packets. In essence, DPS attempts to approximate a ªstatefulº network, such as Intserv. Although it presents an innovative and ¯exible framework, DPS reintroduces signi®cant complexity to interior nodes. Core±Stateless Fair Queueing (CSFQ) [17] is a queue management mechanism for approximating fair allocation based on the idea of DPS. It periodically estimates link fair share. For every incoming packet, it calculates the drop probability of this packet using per-¯ow information found in the packet header, as well as the link fair share. To store information, CSFQ requires modifying the standard IP header. It further requires signi®cant computational resources since math calculation and speci®c processing are needed for every packet. As a result, this scheme is not scalable, incurs signi®cant overhead in the forwarding path, and differs in spirit to the Diffserv framework. Therefore, we prefer to enhance Diffserv in the control plane and rely on dynamic feedback. The works in Refs.
0140-3664/01/$ - see front matter q 2001 Elsevier Science B.V. All rights reserved. PII: S 0140-366 4(00)00287-5
20
N. Li et al. / Computer Communications 24 (2001) 19±34
boundary node
data path
interior node
control path
1 2
DS domain
receiver sender Fig. 1. The general procedure of a feedback mechanism.
[5,11] demonstrate the signi®cant improvement of the performance by introducing feedback control into the Diffserv network. We argue, along with others [3,5,14], that a dynamic control mechanism is an important and possibly essential requirement for addressing Diffserv performance issues. In order to introduce the feedback control into the Diffserv for potential bene®ts, we have to ®rst solve the rate oscillation problem associated with the feedback control. In order to keep the scalability of Diffserv, the control overhead has to be as low as possible, and per-¯ow processing is prohibited at the interior nodes. As a result, solving the rate oscillation is even more dif®cult within Diffserv than within ATM network. Within Diffserv, the oscillation can be caused by the diversities of both RTT and sample interval (to be de®ned later). Even worse, oscillation can also come from the interaction between TCP congestion control and the rate feedback control. There are three major reasons that can cause rate oscillation. The ®rst one is the diversity of the RTT. The second one is the diverse sample intervals caused by enforcing low control overhead, to be explained later. The third one is the interaction between the rate feedback control and the existing TCP congestion control [21]. READ [11] essentially solves the interaction between the feedback control and TCP. Therefore, we focus on solving the oscillation problem caused by ¯ow diversity related to ¯ow rates and their RTTs. The Selective Attenuation Feedback via Estimation (SAFE) mechanism is proposed to provide appropriate feedback to different ¯ows based on their feedback delay without per-¯ow accounting. Theoretic analysis and simulation results show that SAFE effectively reduces oscillation. Aggregating control packets of ¯ows that share the same
forwarding path can reduce the control overhead of SAFE even further while achieving the same performance. SAFE is a fairly generic solution, and can be applied to the ABR feedback control in an ATM network, where the problem is simpler. SAFE is generic also in the sense that it can work with different fair share estimation algorithms. In the following of this section, we give a brief description on the architecture that SAFE focuses on. To support TCP within Diffserv, an adaptive rate regulating traf®c conditioner (READ) [11] has been proposed to operate at boundary nodes. READ proactively regulates ¯ow throughput to its max±min fair allocation based on the feedback information from the core network. Each interior node periodically calculates its max±min link fair share using control algorithms such as UT and ERICA [8,10]. In this paper, a control algorithm denotes an algorithm that estimates link fair shares given any necessary information. The feedback mechanism sends control packets regularly along the path of each ¯ow to bring the ¯ow fair share from the core network to its traf®c conditioner at the edge. In this way, each TCP ¯ow can achieve its minimum rate requirement and shares residual bandwidth with other ¯ows according to the max± min fairness. The overhead of a feedback mechanism includes the forwarding and processing of control packets at interior nodes. It is summarized by the ratio of the amount of control packets to data packets. To limit such overhead, the control packets of a low rate ¯ow are sent out at lower frequency. We can send control packet on the counter basis or on the time period basis. Either way, low rate ¯ows only sample the network information once over a large time interval. High rate ¯ows, however, still sample the network dynamics
N. Li et al. / Computer Communications 24 (2001) 19±34
src boundary
bottleneck link
dst boundary
a b
c d e f
g h
control packet
data packet
Fig. 2. The representative timeline for the feedback mechanism.
frequently. This diversity of the sample interval can cause permanent rate oscillation in the network. Rate oscillation results in large queueing delay, increased end-to-end jitter, bursty packet loss, and low bandwidth utilization. This paper has three contributions. First, we identify a new problem, namely the rate oscillation caused by enforcing low feedback overhead. Secondly, the SAFE mechanism we propose is the ®rst one that reduces rate oscillation caused by ¯ow diversity with no per-¯ow accounting. Furthermore, the hashing technique is adopted for minimizing the processing overhead. Finally, we derive a system model that explains the characteristics of SAFE. The remainder of the paper is organized as follows: the network model is provided in Section 2; Section 3 analyzes delay-related rate oscillation in a feedback control system; SAFE is proposed in Section 4; the comparison of SAFE with the current feedback mechanism is provided in Section 5; simulation result in Section 6 demonstrates its signi®cant improvements; the related work is given in Section 7; the conclusion of our work follows in Section 8. 2. The network model Packets of a ¯ow enter a Diffserv network through a boundary node, referred as the source boundary node. When a source boundary node receives the ®rst packet of a ¯ow, it initiates an instance of a traf®c conditioner, which installs the service parameters of that ¯ow. The following
21
packets from that ¯ow are classi®ed by this boundary node, and sent to its traf®c conditioner. After smoothing out the traf®c burst, the traf®c conditioner sets a Differentiated Service Code Point (DSCP) in each packet's IP header and injects the packet into the network. Interior nodes determine the service provided to a packet according to its DSCP. Packets of a ¯ow leave the Diffserv network at the ¯ow's destination boundary node. The next hop from the destination boundary node is either the receiver or another network. Besides the standard Diffserv functions, the traf®c conditioner executes a feedback mechanism (Fig. 1). It sends out control packets regularly along the forwarding path (marked as 1) to probe the ¯ow fair share. We denote the fair share of ¯ow i at time n as FSi
n; which is the minimum of the link fair shares along the path. When a control packet reaches its destination boundary node, this control packet is bounced back. The backward path (marked as 2) may be different from the forwarding path. In this paper, the notion of node and link is interchangeable. An interior node k periodically calculates its link fair share FSk
n: When the interior node receives a forward control packet, it updates the fair share ®eld of the packet. There is no special processing for backward control packets. Different information can be fed back to different ¯ows depending on how the interior node processes the forward control packets. The fair share update method differentiates SAFE from the current feedback mechanism [8,12], referred as the Simple Feedback (SF) mechanism in this paper. The interior node in SF compares the fair share in the control packet with its current link fair share, and the minimum value is set to the control packet. The traf®c conditioner of ¯ow i at the source boundary node intercepts the backward control packets. It con®gures its parameters using the ¯ow allocation FAi
n; de®ned as [22] FAi
n mi 1 wi £ FSi
n;
1
where mi is the minimum rate requirement and wi is the weight for ¯ow i. FSi
n is the ¯ow fair share along the forward path. The con®guration detail is provided in Ref. [12]. The network supports both high priority and low priority traf®c. The bandwidth left over by the high priority traf®c is shared dynamically among the low priority traf®c. We only control the low priority traf®c, and treat the high priority as the background traf®c. The control interval and sample interval will be used frequently in this paper. To clarify the meaning, we provide their de®nitions as follows. De®nition 1. The control interval is the time period between two link fair share updates at each interior node. We assume this value is the same for all interior nodes although their updates are not synchronized. De®nition 2.
The sample interval of a ¯ow is the time
22
N. Li et al. / Computer Communications 24 (2001) 19±34
fair share update
time
(a)
control packets of high flow
time
(b)
control packets of low flow
time
(c) Fig. 3. The control and sample intervals of different ¯ows at a bottleneck link.
interval between its consecutive control packets. The sample interval of a control packet is de®ned as its ¯ow's sample interval. During its sample interval, a ¯ow does not react to the link fair share updates.
sample interval is usually larger and even more diverse than RTT is, which is generally less than 300 ms.
Fig. 2 illustrates the representative timeline of a ¯ow. This is the case when its RTT is less than its sample interval. At time a, a control packet of ¯ow i is sent out by its traf®c conditioner. The ¯ow fair share is received at the source boundary at time c. FSi
c is equal to FSk
b in SF given that link k is the bottleneck. At time d, the packet of ¯ow i, conforming to its current ¯ow fair share FSi
c; reaches its bottleneck link k for the ®rst time. The sample interval of the ¯ow i is ue 2 au: As mentioned in Section 1, low rate ¯ows have large sample interval to keep low feedback overhead. We refer ¯ows with sample intervals that are larger than or equal to the control interval as low ¯ows since low rate ¯ows have large sample intervals. Otherwise, we call them high ¯ows. The relation between the control interval and sample intervals is illustrated in Fig. 3. The vertical lines in Fig. 3(a) indicate fair share updates at a bottleneck link, and the time interval is the control interval. The arrows in Fig. 3(b) and (c) represent control packet arrivals of a ¯ow. High ¯ows sample the fair share updates more frequently than low ¯ows do. Suppose every data packet size is 1000 bytes, and a control packet is sent out every 16 data packets. The overhead is 5.88% by assuming a router is bottlenecked by the number of packets not by the bytes processed. For a 1.6 MB/s ¯ow, its sample interval is 80 ms. On the other hand, the sample interval of a 64 Kb/s ¯ow is 2 s. The
In this section, we ®rst analyze the rate oscillation problem caused by large RTTs and large sample intervals. Their differences and common points are clearly stated. Given such ¯ow diversity, we then discuss the dilemma of the control algorithm based on the SF mechanism. It is illustrated by a simple example. Finally, we conclude that an enhanced feedback mechanism is needed to achieve fast response without sacri®cing stability.
3. Analysis of the rate oscillation
3.1. Delay-related rate oscillation Within Diffserv, the link fair share has to be updated at interior nodes without per ¯ow information. The control algorithm at an interior node k can be expressed as FSk
n 1 1 FSk
n 1 uDk
nu £ sgn
Bk
n 2 ARk
n;
2 where Dk
n denotes the step size of fair share update. Bk
n is the available bandwidth to the ¯ows bottlenecked at node k and ARk
n denotes the total arrival rate of these ¯ows. The control interval is one unit of time. The step size Dk
n is algorithm-dependent. When Dk
n is big, we call the algorithm aggressive and otherwise, conservative. When the updated fair share of ¯ow i is fed back to the traf®c conditioner at the source edge, the conditioner adjusts its con®guration and controls the next ¯ow arrival rate to be the value de®ned by Eq. (1). This adjusted arrival rate eventually
N. Li et al. / Computer Communications 24 (2001) 19±34
A1(n) 60 40 20 10
20
30
time
10
20
30
time
A2(n) 60 40 20
23
fair share updates. During their sample interval, they effectively act as the background traf®c to the control algorithm since no update information is received. However, they suddenly react to the control whenever they receive a backward control packet. This switching causes the network con®guration constantly changing and can result in ever lasting rate oscillation. Large RTTs and large sample intervals result in the same problem to the control algorithm in Eq. (2). Both of them cause some ¯ows to use old information and send packets according to old fair shares instead of the current one. Once the arrival rate ARk
n depends on FSk
m; m , n; the feedback control assumption is invalid and the fair share update may be conducted in the wrong direction. 3.2. The dilemma of control algorithms based on the SF mechanism
A1(n)+A2(n) 60 40 20 10
20
30
time
Fig. 4. Constant oscillation for SF.
reaches the interior node and affects the next fair share update. This is a simpli®ed description. In real system, the fair share update calculation is done at every interior node. But only the fair share update at the bottleneck link (which is the minimum among all the updates on the forward path of ¯ow i) is used to adjust fair allocation at the source edge. So we only need to consider the bottleneck link. The implicit assumption of Eq. (2) is that ARk
n is caused by the current fair share FSk
n: If ARk
n , Bk
n; we can infer that FSk
n should be increased to reach the optimal value. Without the above assumption, it is not easy to decide whether FSk
n should be increased or decreased given ARk
n , Bk
n: Note that a big ratio of the RTT to the control interval can affect the update action in Eq. (2) and therefore result in rate oscillation. If all ¯ows' RTTs are smaller than the control interval, the arrival rate ARk
n only depends on FSk
n: Otherwise, ARk
n also re¯ects the old fair share values and may lead the update to the wrong direction. RTT consists of propagation delay and queuing delay, so its minimum is system inherent. Setting the control interval, on the other hand, is an engineering decision. Smaller control interval and/or larger Dk
n have faster system response to network dynamics, but higher risk of rate oscillation. It is a trade-off. In the related literature, it has not been well addressed that the large sample interval can also cause permanent rate oscillation. The large sample intervals and large RTTs affect the system in different ways. Flows with large RTTs sample and react to every fair share update, and their actions are delayed by their RTTs. On the other hand, ¯ows with large sample intervals only sample and react to the partial set of
Recall that an interior node in SF compares the fair share in the control packet with its current link fair share and the minimum value is set to the control packet. The bene®t of such scheme is its low processing overhead, i.e. no per-¯ow accounting and all control packets are treated in the same way. However, such scheme makes it hard to fully utilize available bandwidth while avoiding rate oscillation. We only focus on the oscillation caused by large sample intervals since the problem caused by large RTTs has been studied in many works [1,21]. Consider the case with a mixture of high ¯ows and low ¯ows. If the control algorithm is aggressive, high ¯ows react to network dynamics fast and the total arrival rate converges to the target before slow ¯ows react. The fast response can make full use of available bandwidth if available bandwidth increases (or avoid large queue build-up if available bandwidth decreases). However, when low ¯ows react to the control, the current fair share is too big (or too small) for all active ¯ows, and overshoot (or under utilization) happens. After the large sample intervals, when low ¯ows react again, the fair share is too small (or too big). Even if the available bandwidth no longer changes, the estimated fair share may never converge because low ¯ows keep on stimulating the control system. If the algorithm is conservative, when low ¯ows react to the control, high ¯ows have not converged. Therefore the fair share approaches the steady state in one direction without oscillating around the optimal value. Although being conservative can avoid oscillation, such approach is too slow to react to traf®c dynamics, which can cause long periods of congestion or under utilization. 3.3. A simple example for SF Let us look at a simple example using the SF mechanism. Two ¯ows F1 and F2 are bottlenecked at a link and they share the residual bandwidth equally. For simplicity, the minimum rate of F1 is included into the background traf®c since it never changes, and A1
n only indicates the dynamic
24
N. Li et al. / Computer Communications 24 (2001) 19±34
part of the arrival rate at time n. F1 has high minimum rate requirement, so its sample interval is less than the control interval, i.e. A1
n FS
n is always true. Contrary to F1, the sample interval of F2 is ten times of the control interval, i.e. A2
n FS
10bn=10c: Similar to Eq. (2), the adaptive control algorithm can be expressed as
A1(n) 60 40 20 10
20
30
time
A2(n) 60 40 20 10
20
30
time
10
20
30
time
A1(n)+A2(n) 60 40 20
Fig. 5. Slow response with no oscillation for SF.
A1(n) 60 40 20 10
20
30
time
A2(n) 60
FS
n 1 1 FS
n 1 uD
nu £ sgn
B
n 2 A1
n 2 A2
n:
3 Suppose the initial residual bandwidth is 10 and FS
0 5: At the time n 0; the residual bandwidth of the link increases from 10 to 30. The optimal fair share is therefore changed from 5 to 15. Figs. 4 and 5 illustrate the time evolution of the arrival rates (excluding the background traf®c) after time n 0: The converge path of F1 is algorithm-dependent. The control algorithm updates the link fair share to increase the arrival rate to the new residual bandwidth. If the algorithm is not too conservative, we get FS
9 25 and B
9 A1
9 1 A2
9 holds. However, at the 10th instance, F2 reacts to the control and the total arrival rate is 50, which is too big. After 9 units of time, we get FS
19 5 and B
19 A1
19 1 A2
19: Again, at the 20th moment, the total arrival rate is only 10. Such oscillation lasts forever unless the control algorithm is very conservative. When the algorithm is very conservative, as shown in Fig. 5, oscillation is avoided. However, we have to suffer from long period of under utilization and lose the bene®t of dynamic feedback. If initially, the fair share is 25 instead of 5 and the residual bandwidth decreases from 50 to 30 at the time n 0; the queue would keep on increasing for a long time before the system converges. From the above example, the combination of SF with any control algorithm can not achieve both fast reaction and system stability. Note that the long lasting oscillation comes from the full-scale change of low ¯ows' fair shares. If fractional update is conducted on low ¯ows, intuitively, oscillation will ®nally stop. At the same time, high ¯ows still react to the control as fast as it can. Our solution is based on this idea.
40
4. The selective attenuation feedback via estimation
20 10
20
30
time
A1(n)+A2(n) 60
The general feedback mechanism is already described in the Section 2. In this section, we focus on the fair share update method of SAFE at interior nodes. We ®rst illustrate the key idea of SAFE with the same topology setup as in Section 3. Then we present the detailed description of the mechanism and its implementation.
40
4.1. The key idea illustration of selective attenuation
20
In this illustration, we make the low ¯ow only take a fractional change, and the high ¯ow still take the full step. If the updated fair share of the low ¯ow is slightly larger than the optimal value FS p, the total arrival rate still
10
20
30
time
Fig. 6. Decreased oscillation for selective attenuation.
N. Li et al. / Computer Communications 24 (2001) 19±34
25
the arrival rate when ¯ows with large feedback delay react to the control.
A1(n) 60 40 20 10
20
30
time
10
20
30
time
10
20
30
time
A2(n) 60 40 20
A1(n)+A2(n) 60 40 20
Fig. 7. Fast response with no oscillation for selective attenuation.
oscillates around the target, as shown in Fig. 6. However, the magnitude of the oscillation decreases, as opposite to the constant oscillation in Fig. 4. Finally, the system gets stabilized. If, on the other hand, the updated value of the low ¯ow is slightly less than FS p, then the fair share for the high ¯ow keeps on decreasing, and the one for the low ¯ow keeps on increasing (Fig. 7). When their values are equal, their fair shares converge to the steady state fair share and the total arrival rate is equal to the available bandwidth. There is no oscillation in both Figs. 5 and 7. However, selective attenuation achieves faster response than SF does. This illustration indicates that selective attenuation is an effective way to achieve both fast reaction and system stability. As indicated in Section 5, selective attenuation is also effective to reduce the oscillation caused by large RTTs. In the following, we describe the general solution for delayrelated oscillation regardless it comes from large sample intervals or RTTs. 4.2. The SAFE mechanism SAFE consists of two functions. The ®rst one measures the delay it takes a ¯ow to react to the feedback control. This feedback delay ranges from its RTT to the sum of its RTT and sample interval. The second part decides what fair share value is used to update the control packet based on its feedback delay. The goal of SAFE is to attenuate the change of
4.2.1. The feedback delay estimation The estimation algorithm is based on the key observation that the fair share currently used by a ¯ow can be used to infer its feedback delay. From Fig. 2, we notice that when a control packet of a ¯ow reaches its bottleneck link at time f, the ¯ow is still using the fair share obtained at time b, i.e. FSk
b: Therefore, the interior node k can estimate the time b from the position of FSk
b in its fair share history. For this example, the feedback delay of the ¯ow is estimated as u f 2 bu; which is equal to the sample interval ue 2 au: If a ¯ow's RTT is less than its sample interval, the estimated feedback delay is the sample interval, which falls in the range of the actual feedback delay. If the RTT is larger than the sample interval, the estimated feedback delay is multiple of sample intervals, which is slightly larger than the RTT but less than the RTT plus one sample interval. Still, the estimated value is within the range of the actual feedback delay. The measurement granularity is one sample interval since control packets are sent out every sample interval. For SAFE, a control packet of a ¯ow has two main ®elds: FS.fwd and FS.fc. FS.fwd is updated by each interior node along the ¯ow's forwarding path, and contains the minimum fair share of that path. SF also has this ®eld. The second ®eld FS.fc is set by its traf®c conditioner at the edge, and carries the fair share currently used by that ¯ow. FS.fc is not modi®ed by interior nodes and it is unique for SAFE. Each interior node keeps its fair share history. When it receives a control packet at time f, it looks for the latest sample time b with b max{pu
FSk
p FS:fc >
p # f }:
4
The feedback delay of the control packet is calculated as u f 2 bu: We use the term time-estimation to denote the procedure to ®nd the latest sample time b using the ®eld FS.fc of a control packet. There are three things worth our notice. (1) Since FS.fc of a ¯ow is the value previously set by the bottleneck node k, only the node k can ®nd a match of FS.fc in its fair share history. When an interior node g does not ®nd a match of FS.fc, the feedback delay of the control packet is then assumed to be less than the control interval. If k is still the bottleneck of that ¯ow, the measurement error at the node g does not change the control result. (2) Even if a node g was not the bottleneck node when the FS.fc was set in the control packet, it might has the same value in its history. As a result, inaccurate estimation happens. First, it does not affect the ®nal control result as long as the value set in FS.fwd by the node g is larger than the value set the real bottleneck k. Second, a timestamp can be recorded in the control packet when FS.fc is set. Therefore, such inaccurate estimation would be highly impossible. (3) At the bottleneck node k, the measured feedback delay is not unique
26
N. Li et al. / Computer Communications 24 (2001) 19±34
Hash
Hst
25
1
10
6
25 idx=1
with lpi satisfying ( f bf p li argminl F
l b ± f F
l
15
7
g # F
l # 1 b , l # f ;
5 10 15
20
8
5
5
20
Fig. 8. Store the new fair share into Hst and Hash.
when several link fair shares are the same, and equal to FS.fc. Under this situation, the smallest difference is chosen to be the feedback delay, so this estimation may not be accurate. However, our scheme is tolerant of such inaccuracy. We will justify this point in Section 4.2.2. 4.2.2. Selective attenuation From now on, we only consider the control packet handling at the bottleneck link of a ¯ow. If the estimated feedback delay of a control packet is less than or equal to the control interval, the current link fair share is used to update FS.fwd of the control packet, which is the same as SF does. Otherwise, attenuation is performed. Given the current time f and latest sample time b of ¯ow i, the interior node uses FSk
lpi to update the control packet FS.fc = 5
Hst
Hash 25
1
10
6
25 idx=1
FSk
l 2 FSk
b FSk
f 2 FSk
b
5
where g is the attenuation factor and g [
0; 1: According to Eq. (4), FSk
f ± FSk
b always holds for b ± f : Since lpi is guaranteed to be larger than b, ¯ows with large feedback delay are always updated unless the steady state is reached. FSk
lpi is the interpolation of FSk
b and FSk
f with a constraint that the update value has to be in the fair share history to facilitate next time-estimation. There can be three alternative approaches other than SAFE. The ®rst approach uses the averaged link fair share over the interval u f 2 bu:The second approach takes lpi as the interpolation of b and f, and then uses FSk
lpi to update FS.fwd. There are two problems associated with these alternatives. Measurement error of b affects the control directly, and the attenuation extent can not be controlled tightly unless the link fair share increases or decreases linearly. The third alternative is to use FS:fc 1 g £
FSk
f 2 FS:fc
6
for updating the control packet. This alternative controls attenuation extent precisely without time-estimation. However, it conducts attenuation on all ¯ows since no feedback delay information is available. On the contrary, our approach can automatically distinguish ¯ows by their estimated feedback delay, and selectively attenuate ¯ows with large feedback delay. In our approach, the update value is calculated using the accurate knowledge of FSk
b and FSk
f : And the attenuation extent is as close to the attenuation factor g as possible. The feedback delay only de®nes the feasible set of l, so the inaccuracy of the estimation has limited impact on the control result. 4.3. Implementation details
15
7
20
8
5
5
5 l*=7
10 15 20
Fig. 9. Time-estimation and ®nding the update fair share.
Each interior node has two tables: Hst for holding its fair share history, and Hash for conducting time-estimation using just one table lookup. Hst is of size N and stores the latest N link fair shares. The value of N is determined by the ratio of the maximum sample interval to the control interval. A variable idx keeps the table index of the current link fair share in Hst. Hash contains bindings which map link fair shares to their table indexes in Hst. Hash uses the link fair share to ®nd the right binding. Both Hst and Hash are initiated to be empty. The procedure for storing a new fair share into Hst and Hash is illustrated in Fig. 8. Whenever a new fair share FSk
f is
N. Li et al. / Computer Communications 24 (2001) 19±34 B(n) +
+
α
∆FS(n)
Σ
FS(n)
AR(n) -
+
+
w1
z-τ1
. . .
. . .
wn
z
-τn
Fig. 10. The equivalent model of the feedback system for SF.
generated, it is compared with Hst[idx]. If their values are the same, nothing is done. Consecutive fair shares with the same value are stored in the same entry. Otherwise, idx is incremented by one and wrapped around according to the table size N. Before setting the value of Hst[idx] to be the new fair share, the interior node ®rst checks whether this entry is already occupied. ² If this entry is occupied, it uses the old value of Hst[idx] as the hash key to delete the binding at Hash. Next, it sets the value of Hst[idx] to be FSk
f : At the same time, it inserts the new binding
FSk
f ; idx into Hash. If a binding in Hash has the same fair share value, the new idx overwrites the old one. ² Otherwise, the interior node only needs to store the new fair share in Hst, and inserts corresponding binding into Hash. Upon receiving a control packet, an interior node ®rst performs time-estimation (Fig. 9). It hashes the ®eld FS.fc of the control packet into Hash. If a binding exists, the second element of the binding gives the table index b in Hst, which is the estimation of the wrapped time for the FS.fc. If b idx; the current fair share is used to update the control packet. Otherwise, the interior searches Hst among table entries between b and idx, and selects the fair share de®ned by Eq. (5) to update the control packet. For example, given FS:fc 5; the 12th entry of Hash contains the right binding, which indicates that ®ve is the corresponding table index at the Hst. Suppose that the attenuation factor is 0.5. According to Eq. (5), lpi 7 and the fair share to update the control packet is 15. 5. Comparison of SAFE with SF In this section, we ®rst compare the processing differences between SF and SAFE, then indicate when SAFE
27
and SF are equivalent. Finally, we provide system models for SF and SAFE to understand their characteristics related to RTTs. When an interior node in SF receives a forwarding control packet, it compares its current link fair share FSk
f with FS.fwd of the packet, and the minimum value is set to the control packet. On the other hand, an interior node in SAFE ®rst measures the latest sample time b, and then chooses FSk
lpi using Eq. (5). FSk
lpi is compared with FS.fwd, and the minimum is set to the control packet. With the help of SAFE, ¯ows with large delay update their fair shares gradually and other ¯ows still react to the control fast without attenuation. SAFE is more general, and provides substantial improvement over SF. SF can be considered as a special case of SAFE with the attenuation factor equal to one. While SAFE provides more ¯exibility over SF, it can only reduce the rate oscillation to the granularity that is determined by the aggressiveness of the control algorithm. If the control algorithm is extremely aggressive, the fair share of ¯ows with small delay converges right after b. SAFE turns to be SF since no medial fair shares in the history are available for attenuation. In a homogenous environment with little delay, such control algorithm has the highest converging speed, which is the best we can imagine. However, it is not suitable for controlling oscillation in a heterogeneous situation with ¯ow diversity. Now, we provide the system model of SF and derive its transfer function. We then extend the model to express the transfer function of SAFE. Comparing their transfer functions identi®es the consequence of selective attenuation. We consider a single bottleneck link k here. For the simplicity of the analysis, the control algorithm in Eq. (2) is assumed to be FSk
n 1 1 FSk
n 1 a £
Bk
n 2 ARk
n;
7
where a is the parameter indicating the aggressiveness of the algorithm. The bigger the a is, the more aggressive the control algorithm is and faster reaction is obtained. Since the minimum rate m i in Eq. (1) is a constant and can be simply subtracted from the available bandwidth, we ignore the term of m i in the following. Fig. 10 depicts the system model of SF that includes the effect of the RTTs. The diversity of sample intervals is not included in the model. From this model, the arrival rate ARk
n in SF according to the ¯ow fair allocation (in Eq. (1)) is X wi FSk
n 2 ti ;
8 ARk
n i
where ti is the RTT of ¯ow i normalized by the control interval, then quantized into integer. Based on Eqs. (7) and (8), we ®nd that X wi FSk
n 2 ti :
9 FSk
n 1 1 FSk
n 1 a
Bk
n 2 i
28
N. Li et al. / Computer Communications 24 (2001) 19±34
B(n) +
α
+
∆FS(n)
Σ
Similar to Eq. (8), the arrival rate in SAFE is expressed as X ARk
n wi FSk
lpi 2 ti
FS(n)
AR(n) -
i
γ1 w1
+
z-τn
z211a
1−γn
FSk
z a X Bk
z z 2 1 1 a wi z2ti
10
i
A stable system requires that all the poles of its transfer function are within the unit disc. From Eq. (10), we observe that the parameter a has to be small enough to push all poles within the unit disc. Therefore, it is hard to achieve both fast reaction and stability. With the selective attenuation, FSk
lpi of ¯ow i at time n (in Eq. (5)) is approximately
m0
gi
1 2 gi m FSk
n 2 mti ;
11
Table 1 The Poles of the transfer functions for SF and SAFE with g2 [
0; 1
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
gi wi z2ti 1 a
SF
12
a X i
gi wi
1 X m1
1 2 gi m z2
m11ti
² SAFE is equivalent to SF when ti 0 for all ¯ows. ² SF is a special case of SAFE with gi 1 for all ¯ows. ² SAFE introduces more poles to the system, but relaxes the stability requirement of a imposed by original poles when gi [
0; 1: In our implementation, the delayed ¯ows can catch up the current fair share update once fair shares repeat. Therefore, ¯ows with large feedback delay are not always delayed as Eq. (13) implies. Also, the effect of high order elements is negligible since their magnitudes are small. Therefore, we approximate Eq. (13) to be FSk
z a X X 2 ti Bk
z z 2 1 1 a gi wi z 1 a
1 2 gi wi z22ti i
where gi 1 for ti , 1 and gi g otherwise. Eq. (11) is an approximation because of the constraint that FSk
lpi has to be in the fair share history to facilitate next time-estimation.
g2
gi
1 2 gi m FSk
n 2
m 1 1ti
By comparing Eqs. (10) and (13), we observe that
The transfer function of SF is
FSk
lpi
m0
13 z-τn
Fig. 11. The equivalent model of the feedback system for SAFE.
1 X
X i
+
wn
1 X
FSk
z Bk
z
. . γn
wi
The transfer function of SAFE is found to be
z-τ1
. . .
X i
+ 1−γ1
+
z-τ1
SAFE
p1
p2
p1
p2
p3
1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
0.9832 0.9664 0.9498 0.9342 0.9208 0.9118 0.9113 0.9245 0.955 1
0.9832 0.9664 0.9498 0.9342 0.9208 0.9118 0.9113 0.9245 0.955 1
0.9310 0.8567 0.7760 0.6875 0.5898 0.4811 0.3612 0.2340 0.1096 0
i
14 and the corresponding system model is shown in Fig. 11. A numeric example based on Eq. (14) shows the effectiveness of SAFE. Suppose there are two ¯ows with equal weights of w1 w2 1: Their RTTs are t1 0 and t2 1: Given a 1; SF has two poles and they are all on the unit circle. The system of SF is not stable for the given a . On the other hand, SAFE has three poles and all of them are within the unit disc with g1 1 and g2 [
0; 1: Table 1 shows the magnitudes of poles for SF and SAFE. SAFE can maintain stability with a bigger a than SF can, so achieve both fast reactions and stability. It only adds one extra parameter to tune, but offers more ¯exibility. More numeric study is necessary to understand the effectiveness of SAFE with different topologies. 6. Simulation analysis Within the Diffserv framework, we compare the performance of SAFE with SF under two scenarios. UT [8] is used as the control algorithm in the simulation. Data packets are
N. Li et al. / Computer Communications 24 (2001) 19±34
29
host node
src 1…5
dst 1…5
boundary node
h3
h1
Interior node
(1ms,24Mbps)
A
(1ms,24Mbps)
(1ms,24Mbps)
C
(10ms,10Mbps)
D
(1ms,24Mbps)
B
(1ms,24Mbps)
(1ms,24Mbps)
h4
h2
dst 6…10
src 6…10 Fig. 12. The topology 1: ten ¯ows and on±off background traf®c.
552 bytes and feedback control packets are 40 bytes. Links are full duplex with equal bandwidth in both directions. The buffer size of each interior node is in®nite. We assume that TCP ¯ows have in®nite data to send and adapt their sending 2.5
goodput (mbps)
2
1.5
1
0.5
0 0
50000
100000
time (ms) Fig. 13. Sending rate of the high priority ¯ow.
rate according to TCP Reno [16]. All TCP ¯ows require weighted max±min fair allocation de®ned in Eq. (1) with the unit weight. Thus, they are assigned equal residual bandwidth. The goodput, i.e. good throughput, for each TCP connection is measured at its source on an average interval of 400 ms. The control packets for each ¯ow are sent every 16 data packets. The attenuation factor is 0.67. The ®rst simulation topology is shown in Fig. 12 and the simulation lasts 100 s. The control interval at each interior node is 80 ms. There are two source hosts. Each of them connects to a Diffserv boundary node. Host h1 has ®ve TCP ¯ows and host h2 has four TCP ¯ows and one high priority UDP ¯ow. All TCP ¯ows are randomly started with starting times uniformly distributed within the ®rst 100 ms of the simulation. The UDP ¯ow (10th ¯ow) has high priority and starts at the beginning of the simulation. It carries on±off traf®c. When it is on, its sending rate is always 2 MB/s. When it is off, its sending rate is zero. It switches from one state to the other state every ®ve seconds, as shown in Fig. 13. The ¯ow burst size and minimum sending rate requirements are shown in Table 2. CBS represents the con®rmed burst size, and EBS is the excess burst size [9].
30
N. Li et al. / Computer Communications 24 (2001) 19±34
Table 2 Flow service parameters Flow idx
Src type
min rate (MB/s)
CBS (packet)
EBS (packet)
Src rate (MB/s)
1±5 6±9 10
TCP TCP UDP
0.8 0.08 2
12 6 6
12 6 6
± ± 0/2
The link arrival rate and queue length for SF are shown in Figs. 14 and 15. Figs. 16 and 17 plot the link arrival rate and queue length for SAFE. The statistics are shown in Table 3. The queue size for SF is signi®cantly higher compared with SAFE, and its arrival rate at the bottleneck link is not as predictable as that of SAFE. Given a small buffer size, the rate oscillation for SF could result in large jitter and increased loss probability. For SF, the fair share for the high ¯ow 1±5 is shown in Fig. 18, and the fair share for the low ¯ow 6±9 is shown in Fig. 19. Note that, the fair shares of both high and low ¯ows can not converge in several time intervals, therefore, the queue size at the bottleneck link is very large (Fig. 15). On the other hand, the fair shares for SAFE always converge (Figs. 20 and 21). The high ¯ows react to the change of available bandwidth as fast as possible while low ¯ows update their fair share gradually. The second topology is shown in Fig. 22, and the simulation time is of 8 s. The control interval is 160 ms. The bottleneck link L1 has the bandwidth of 155 MB/s. The propagation delay is 5 ms for each link from L1 to L5. All other links have the propagation delay of 1 ms. Every host carries ten ¯ows, and there are 20 source hosts. The tenth ¯ow at the host 19 is a high priority UDP ¯ow. The traf®c of this ¯ow is the aggregation of 30 MPEG traces with randomly assigned phases. The average rate during the initial 8 s is 11.7 MB/s, the peak rate is 18 MB/ s, and the standard deviation is 1.72 MB/s. All other ¯ows are TCP and their weights are equal. The high ¯ows at the
host 0, host 5, host 10, and host 15 require minimum rate of 0.8 MB/s. Their sample intervals are less than 88 ms. The rest low ¯ows require minimum rate of 40 KB/s, and their sample intervals are less than 1.8 s. We limit the sample interval to be no less than 80 ms to reduce the control overhead further. Note that there is no bene®t to sample the fair share more than twice over a control interval. The plots show the performances of SF and SAFE at the bottleneck link L1. The link arrival rate and queue length for SF are shown in Figs. 23 and 24. Signi®cant oscillations occur and persist in both the arrival rate and queue size. SF is vulnerable to the incurred initial state change and the background traf®c ¯uctuation. Its control overhead is 2.6%, which is lower than theoretic value 6.3%. The reason is that high priority traf®c sends no control packets, and the minimum sample interval is limited to be 80 ms. On the other hand, SAFE stabilizes the system in a short time and is robust to the background ¯uctuation (Figs. 25 and 26). The control overhead for SAFE is also 2.6%. The simulation results show that SAFE is very effective in reducing oscillation caused by large feedback delay. Therefore, SAFE stabilizes the system with high utilization and small queue size. 7. Related work There are several algorithms trying to solve the rate oscillation problem caused by large RTT. The Multiple Time Scale (MTS) protocol [20] operates at interior nodes. It 100000
1.2
80000
queue size (byte)
arrival rates (%)
1 0.8 0.6 0.4
60000
40000
20000
0.2 0
0
0
50000
100000
time (ms) Fig. 14. Arrival rate for SF at the bottleneck link [C±D].
0
50000
100000
time (ms) Fig. 15. Queue length for SF at the bottleneck link [C±D].
N. Li et al. / Computer Communications 24 (2001) 19±34
1.2
31
0.8
fair share (Mbps)
arrival rate (%)
1 0.8 0.6
0.6
0.4
0.4 0.2
0.2 0
0
0
50000
100000
0
time (ms)
50000
100000
time (ms)
Fig. 16. Arrival rate for SAFE at the bottleneck link [C±D].
Fig. 18. Fair share of the ¯ow 1±5 for SF.
100000
0.8
fair share (Mbps)
queue size (byte)
80000
60000
40000
20000
0.6
0.4
0.2
0 0
50000
100000
0
time (ms)
0
Table 3 Performance comparison between SF and SAFE at the bottleneck link [C± D] SF
Mean Std Max
100000
Fig. 19. Fair share of the ¯ow 6±9 for SF.
0.8
0.6
fair share (mbps)
classi®es ¯ows into short latency and long latency. Available bandwidth is also classi®ed into short-life and long-life. Short-life part of available bandwidth is evenly distributed only to short latency ¯ows. Long-life part of available bandwidth is distributed to both short and long latency ¯ows. MTS calculates the fair share for each ¯ow from its ¯ow information. This is different from our context, where no per-¯ow information is available and the control algorithm has to iteratively track the steady state. MTS improves
50000
time (ms)
Fig. 17. Queue length for SAFE at the bottleneck link [C±D].
0.4
0.2
SAFE
Rate (%)
Queue (byte)
Rate (%)
Queue (byte)
0.913 0.063 1.159
3195 9942 83352
0.881 0.052 1.143
799 2628 28704
0 0
50000
100000
time (ms) Fig. 20. Fair share of the ¯ow 1±5 for SAFE.
32
N. Li et al. / Computer Communications 24 (2001) 19±34
fair share (Mbps)
0.8
0.6
0.4
0.2 Fig. 23. Arrival rate for SF at the bottleneck link L1.
0 0
50000
causes of the given control result to determine the next update, not the presumed current price. ERAM still involves signi®cant overhead. Interior nodes need to know per-¯ow information, such as their RTT and their utility functions. The setup of the Enhanced Proportional Rate Control Algorithm (EPRCA) [13] is that each interior node maintains a separate queue for each connection passing through. The queue occupancy at its bottleneck link is fed back to the source. At each source, EPRCA calculates its sending rate, which is proportional to the available buffer space for this connection at its bottleneck link. It takes care of large RTT through the Smith Predictor [15]. EPRCA compensates the delay by including the delayed packets, which is not re¯ected in the queueing feedback, into the calculation. Per-¯ow queueing is a strong requirement at interior nodes. All of these algorithms only address the problem caused by large RTTs and require per-¯ow information or processing. They all integrate the function of handling delay into the control algorithm. On the contrary, SAFE is a feedback
100000
time (ms) Fig. 21. Fair share of the ¯ow 6±9 for SAFE.
network performance in terms of stability, throughput and packet loss. However, MTS requires interior nodes to estimate per-¯ow feedback latency, and predicts the lifetime of the available bandwidth. It imposes tremendous overhead at interior nodes. The Enhanced Random Early Marking (ERAM) [1] is used by each interior node to iteratively estimate its price per unit of rate. This price is fed back to end-users. Each user adjusts its sending rate to maximize the difference between its utility and its cost along the path. ERAM reduces the oscillation by averaging over past prices. The effect of past prices is included into the price update since the delayed arrival rate indeed conforms to the past prices not just the current price. This algorithm uses the actual
host router
src 0…4
dst 0…4
dst 10…14
src 5…9
L1
L2
L3
L4
src 10…14 L5
src 15…19 dst 5…9
dst 15…19
Fig. 22. Simulation topology: a network with two hundred ¯ows.
N. Li et al. / Computer Communications 24 (2001) 19±34
33
mechanism. It addresses a more general problem, does not depend on the control algorithm heavily and requires no per¯ow overhead. 8. Conclusions
Fig. 24. Queue length for SF at the bottleneck link L1.
1.2
arrival rate (%)
1 0.8 0.6 0.4 0.2
Enforcing low control overhead within Diffserv is essential to keep the network scalable. However, this results in the diversity of sample intervals of ¯ows with different rates. Flow diversities related to RTTs and sample intervals can cause permanent oscillation. We argue that it is necessary to distinguish ¯ows according to their feedback delays to keep fast reaction while eliminating rate oscillation. Therefore, selective attenuation is preferred than damping all ¯ows or not attenuating at all. Based on this idea, SAFE is proposed to reduce the rate oscillation with no per-¯ow accounting. To our knowledge, SAFE is the ®rst work that maintains the scalability of Diffserv in solving the delay-related oscillation problem. Furthermore, the hashing technique keeps the overhead of SAFE to its minimum. Theoretic analysis supports the effectiveness of SAFE and simulation result shows that our mechanism signi®cantly reduces rate oscillation. It achieves high link utilization and small queue size while maintaining very low control overhead. The future work may include theoretic model development for the effect of sample intervals, study on the sensitivity of the system, and the guideline for choosing proper attenuation factors. Acknowledgements
0 0
4000
8000
time (ms) Fig. 25. Arrival rate for SAFE at the bottleneck link L1.
We would like to thank Dr G. de Veciana for his valuable feedback and insights at the current approaches. We also appreciate discussions with Marissa Borrego and Shanchieh Yang. Their comments have been crucial in shaping our thinking. This work was supported by NSF under grant ANI-9714586.
300000
queue size (byte)
References 200000
100000
0 0
4000
8000
time (ms) Fig. 26. Queue length for SAFE at the bottleneck link L1.
[1] Sanjeewa Athuraliya, David Lapsley, Steven Low, An Enhanced Random Early Marking Algorithm for Internet Flow Control, Proceedings of INFOCOM2000, Tel Aviv Israel, March 2000. [3] Y. Bernet, et al., A Framework for Differentiated Services, IETF Internet Draft, draft-ietf-diffserv-framework-02.txt, February 1999. [4] S. Blake, et al., An Architecture for Differentiated Services, IETF Network Working Group, RFC 2475, December 1998. [5] H. Chow, A. Leon-Garcia, A Feedback Control Extension to Differentiated Services, IETF Internet Draft draft-chow-diffservfbctrl-00.txt, March 1999. [6] C. Dovrolis, D. Stiliadis, P. Ramanathan, Proportional Differentiated Services: Delay Differentiation and Packet Scheduling, ACM SIGCOMM '99. [8] C. Futon, S. Li, C. Lim, An ABR Feedback Control Scheme with Tracking, INFOCOM97, Kobe, Japan. [9] J. Heinanen, R. Guerin, A Single Rate Three Color Marker, IETF RFC2697.
34
N. Li et al. / Computer Communications 24 (2001) 19±34
[10] S. Kalyanaraman, R. Jain, S. Fahmy, R. Goyal, B. Vandalore, The ERICA Switch Algorithm for ABR Traf®c Management in ATM Networks, ATM Forum/96-1172, August, 1996. [11] N. Li, M. Borrego, S. Li, A rate regulating traf®c conditioner for supporting TCP over Diffserv, Journal of Computer Communications 23 (14±15) (2000) 1349±1362. [12] N. Li, M. Borrego, S. Li, Achieving Per-Flow Fair Rate Allocation within Diffserv, the IEEE Symposium on Computers and Communications '2000, Antibes France. [13] S. Mascolo, D. Cavendish, M. Gerla, ATM Rate Based Congestion Control using a Smith Predictor: An EPRCA Implementation, Proceedings of INFOCOM96, San Francisco, 1996. [14] K.K. Ramakrishnan, G. Hjalmtysson, J.E. Van der Merwe, The role of signaling in quality of service enabled networks, IEEE Communications Magazine 37 (6) (1999) 124±132 June. [15] O.J. Smith, A controllerto overcome dead time, ISA Journal 6 (2) (1959) 28±33. [16] W. Stevens, TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms, IETF RFC2001.
[17] I. Stoica, S. Shenker, H. Zhang, Core-stateless Fair Queueing: Achieving Approximately Fair Bandwidth Allocations in High Speed Networks, ACM SIGCOMM `98. [18] I. Stoica, H. Zhang, et al., Per Hop Behaviors Based on Dynamic Packet States, IETF Internet Draft, draft-stoica-diffserv-dps, February 1999. [19] I. Stoica, H. Zhang, Providing Guaranteed Services Without Per Flow Management, Providing Guaranteed Services Without Per Flow Management, in: SIGCOMM'99, BOSTON, MA. [20] W.K. Tsai, L.C. Hu, Y. Kim, A Temporal-Spatial Flow Control Protocol for ABR in Integrated Networks, IDMS'98, Oslo, Norway. [21] W.K. Tsai, Y. Kim, A Stability and Sensitivity Theory for Rate-Based Max±Min Flow Control for ABR Service, IEEE SICON'1998, Singapore. [22] B. Vandalore, S. Fahmy, R. Jain, R. Goyal, M. Goyal, General Weighted Fairness and its Support in Explicit Rate Switch Algorithms, Computer Communications Journal 23 (2) (2000) 149±161.