Computer Networks 56 (2012) 3076–3086
Contents lists available at SciVerse ScienceDirect
Computer Networks journal homepage: www.elsevier.com/locate/comnet
Periodic early detection for improved TCP performance and energy efficiency q Andrea Francini ⇑ Computing and Software Principles Research Department, Alcatel-Lucent Bell Laboratories, Mooresville, NC, USA
a r t i c l e
i n f o
Article history: Received 11 October 2011 Received in revised form 20 March 2012 Accepted 17 April 2012 Available online 2 May 2012 Keywords: Buffer management TCP Active queue management Random early detection Energy efficiency Simulation
a b s t r a c t Reducing the size of packet buffers in network equipment is a straightforward method for improving the network performance experienced by user applications and also the energy efficiency of system designs. Smaller buffers imply lower queueing delays, with faster delivery of data to receivers and shorter round-trip times for better controlling the size of TCP congestion windows. If small enough, downsized buffers can even fit in the same chips where packets are processed and scheduled, avoiding the energy cost of external memory chips and of the interfaces that drive them. On-chip buffer memories also abate packet access latencies, further contributing to system scalability and bandwidth density. Unfortunately, despite more than two decades of intense research activity on buffer management, current-day system designs still rely on the conventional bandwidth-delay product rule to set the size of their buffers. Instead of decreasing, buffer sizes keep on growing linearly with link capacities. We draw from the limitations of the buffer management schemes that are commonly available in commercial network equipment to define Periodic Early Detection (PED), a new active queue management scheme that achieves important buffer size reductions (more than 95%) while retaining TCP throughput and fairness. We show that PED enables on-chip buffer implementations for link rates up to 100 Gbps while relieving end users from network performance disruptions of common occurrence. Ó 2012 Published by Elsevier B.V.
1. Introduction As link rates keep growing faster than the density and access speed of memory chips, the conventional strategy of sizing packet buffers proportionally to the link capacity is creating critical challenges to the development of future generations of network systems and to the ability of packet networks to sustain the performance requirements of all user applications.
q Part of this paper was previously presented in A. Francini, Beyond RED: Periodic Early Detection for On-Chip Buffer Memories in Network Elements, Proceedings of the 2011 IEEE High-Performance Switching and Routing Conference (HPSR 2011) in Cartagena, Spain, July 4–6, 2011. ⇑ Address: P.O. Box 773, Davidson, NC 28036, USA. Tel./fax: +1 336 697 3405. E-mail address:
[email protected]
1389-1286/$ - see front matter Ó 2012 Published by Elsevier B.V. http://dx.doi.org/10.1016/j.comnet.2012.04.017
At times of traffic congestion, large packet buffers inflate end-to-end packet delays, extending the round-trip times of TCP connections and deferring the completion of file transfers. Large buffers also hamper the scalability of system designs and contribute prominently to energy consumption. Their instantiation typically requires large memories that cannot be integrated in the same chips that process and forward the packets. Off-chip memory components and the connectivity infrastructure needed to reach them consume a substantial portion of real estate on circuit boards, taking it away from packet-handling devices. Off-chip memories also impose limitations on packet forwarding rates. Compared to the ideal case with on-chip memories only, external memories cause a substantial increase in area and power consumed per unit of forwarding capacity.
A. Francini / Computer Networks 56 (2012) 3076–3086
Despite being harmful to application performance and bandwidth density, large packet buffers (hundreds of milliseconds) remain a fixture in system designs because they maximize the aggregate throughput of congested TCP connections using the simplest buffer management policy (Tail-Drop). UDP traffic also needs buffering for the absorption of overload conditions, but small buffers are better than large ones at addressing all UDP overload occurrences [1]. Even better is to combine a small Tail-Drop buffer with the differentiation of packet drop priorities based on UDP compliance with pre-negotiated traffic profiles [2,3]. Unfortunately, small Tail-Drop buffers are not nearly as effective at handling TCP traffic, whose elasticity implies long-lasting congestion incidents and the absence of prenegotiated per-flow contracts. The bandwidth-delay product (BDP) rule [4] commonly drives the sizing of Tail-Drop buffers for TCP traffic: in front of a link of capacity C, the buffer size should be where # is the average round-trip time (RTT) estiB ¼ C #, mated over the set of TCP flows that share the link. The BDP rule generally succeeds in avoiding queue underflow conditions, and therefore reductions of link utilization, when back-to-back packet losses occur at a congested buffer. However, the rule does not guarantee the throughput of individual TCP flows and does not enforce inter-flow fairness. BDP buffers that handle long-range TCP connections are commonly sized for RTT values in the 200 ms 300 ms range. When one such buffer along the network path of a TCP flow is congested, its queueing delay adds a substantial contribution to the RTT experienced by the flow, decreasing its steady-state throughput in proportion to the ratio between the RTT values measured with and without congestion. As the number of congested BDP buffers along the same path increases, the consequences for the throughput of the TCP flow are catastrophic, because the effective RTT that defines the share of the flow at one bottleneck is augmented by the delay added by all the other bottlenecks. While the frequency of occurrence of multi-bottleneck congestion episodes may be low for the typical end user, few such episodes in the course of one day can produce major dissatisfaction, because their perceived effects are heavy and may take minutes to subside. Network operators may not easily find evidence of the performance degradation suffered by TCP flows that encounter large buffers in their multi-bottleneck data paths. The placement of bottlenecks in different administrative domains, or simply the scale of the traffic volume handled by each network node, can make it impractical for monitoring tools to maintain flow-level statistics that reliably reproduce the end-user experience. As a result, the bufferbloat issue keeps lingering [5] while interest in more effective buffer management is fading. Indeed, after recognizing the performance limitations of solutions based on miniature Tail-Drop buffers [6–8] and concluding that active queue management (AQM) techniques like Random Early Detection (RED) [9,10] are promising on paper but inadequate for broad practical applications [11], the networking community has lately been focusing its efforts on re-designing the endpoint behavior so that it better adjusts to the ongoing capacity expansion in host and
3077
access links [12,13] and keeps the TCP performance insensitive to the buffer management flaws exposed by the network [14,15]. Growing energy-efficiency and systemscalability concerns now move our attention back onto buffer management enhancements that can reconcile TCP traffic with small buffers. We find important merits in RED [9], especially in some of the many modifications that have been proposed to amend it [1,16], but also critical flaws. The algorithmic steps that we devise to address those flaws provide the backbone of a new AQM scheme that we call Periodic Early Detection (PED) [17]. Like RED, PED derives its packet drop decisions from the monitoring of the average queue length (AQL). However, PED differs radically from RED in the handling of the packet drop rate. While RED continuously adjusts the drop rate to the evolution of the queue length, PED applies drop rate adjustments only at the end of fixed time intervals, and only if the current rate is clearly offrange. By imposing a fixed packet drop rate, PED aligns with the conclusion presented in [11] that RED’s stability improves as the slope of its packet drop probability function decreases. PED’s distinctive new feature originates from the assumption that the queue is healthily loaded (neither empty nor overflowing) if the set of TCP flows in additive increase mode (slowly expanding their congestion windows) balances the set of TCP flows that are recovering from recent packet losses. Such a balance is the product of an ideal packet drop rate u(t) that evolves over time with the composition of the traffic mix and with the sequence of packet loss assignments to individual TCP flows. PED’s periodic adjustments of the drop rate compensate for the infeasibility of tracking u(t) without error. Our simulations show that PED retains full link utilization with 32 MB of buffer memory in front of a 40 Gbps link, enabling on-chip buffer implementations with embedded DRAM (eDRAM) technology that was available in 2004 [6]. We argue that on-chip implementations should be possible today also for the 80 MB eDRAM that we consider ideal for a 100 Gbps link. Since more than 20% of the power consumed by a typical high-end router line card [18] is due exclusively to the off-chip placement of packet buffers [19], PED’s contribution to energy efficiency can be indeed substantial. Energy gains can be even higher (up to 40%) in the packet processing boards of access nodes [20]. The buffer size reductions enabled by PED bring tremendous benefits to multi-bottleneck TCP flows: we present a simulation experiment derived from a production network, where the throughput of multi-bottleneck flows increases by three orders of magnitude when PED buffers replace conventional Tail-Drop buffers. We remark that the deployment of PED in network buffers does not conflict with the new types of TCP endpoint behavior that have recently gained popularity in the industry [12,14] and actually brings major benefits to them, in the form of faster data deliveries and network performance that is more predictable and stable. We organize the paper as follows. In Section 2 we describe the RED algorithm in its basic formulation and discuss RED enhancements that have been proposed in the literature. We specify the PED algorithm in Section 3
3078
A. Francini / Computer Networks 56 (2012) 3076–3086
and show how it is configured in Section 4. In Section 5 we present simulation results for the appreciation of PED’s performance in single-bottleneck network paths. In Section 6 we show the benefits of PED versus Tail-Drop and RED when the network path of some TCP flows includes multiple bottlenecks. In Section 7 we summarize the contributions of the paper and outline our plans for future work. Fig. 2. Network configuration for Scenarios 1–3.
2. Random early detection In this section we recall the native definition of the RED algorithm and review past proposals for improving it. 2.1. Native RED According to the definition given in [9], upon arrival of based on the the nth packet a RED queue updates the AQL q current value of the instantaneous queue length (IQL) q ½n ¼ w q½n þ ð1 wÞ q ½n 1Þ and marks the packet ðq with probability that depends on the current AQL value ½nÞÞ. The marking of the packet has different ðp½n ¼ pðq consequences depending on the application of the algorithm. For convenience of presentation, in this paper we focus exclusively on the case where marking implies the immediate elimination of the packet. We expect the results of the paper to apply consistently to all applications of packet marking, the most notable being explicit congestion notification (ECN) [21]. Fig. 1 shows a typical profile of the function that maps AQL and packet drop probability (PDP) levels. With the PDP function of Fig. 1, RED drops no packet as long as the AQL remains below the minimum threshold bmin. When the AQL is between bmin and the maximum threshold bmax, an incoming packet is dropped with probability that depends linearly on the current position of the AQL between ¼ bmax Þ. the two thresholds (the probability is pmax when q > bmax and also RED drops every incoming packet when q when the IQL exceeds the threshold Qmax > bmax that defines the total buffer space available. The use of the AQL instead of the IQL isolates the packet drop decision from short-term IQL fluctuations that reflect ordinary TCP dynamics but not the onset of congestion conditions. The operation of RED requires the configuration of the following parameters: the weight w of the exponential weighted moving average (EWMA) that computes the AQL, the buffer thresholds bmin, bmax, and Qmax (we assume that all thresholds and queue lengths are expressed in
Fig. 1. Packet drop probability function of the native RED scheme.
bytes), and the maximum drop probability pmax. Because of the large number of configuration dimensions involved, qualifiers like ‘‘black art’’ [1] and ‘‘inexact science’’ [22] have been attributed to the tuning of RED in its native formulation. This uncertainty alone has proven sufficient to make network operators wary of activating RED in front of their links. As shown in [11], it is actually possible to configure RED so that it keeps the queue in equilibrium, but only when the composition of the traffic mix is known. With a fixed configuration, the performance is critically sensitive to the traffic conditions. We illustrate the issue using results from ns2 [23] simulation experiments run on the dumbbell network topology of Fig. 2. The topology includes a source aggregation node (SAN), a bottleneck node (BNN), and a sink distribution node (SDN). A number N of TCP Reno sources are attached by respective 1 Gbps links to the SAN. The propagation delay of each of these links sets the RTT of the corresponding TCP flow. The propagation delay of all other links is negligible. The TCP sinks have the SACK option enabled. They are attached to the SDN also by 1 Gbps links. All links between network nodes have 100 Gbps capacity, except the bottleneck link from the BNN to the SDN (10 Gbps). We run a sequence of trialand-error experiments to optimize the configuration of the RED parameters for a population of N = 200 long-lived TCP flows with identical RTT (# = 50 ms). With total buffer space Qmax = 8 MB (in line with the technologically feasible goal of embedding a 32 MB DRAM in the traffic management ASIC of a 40 Gbps line card, as set forth in [6]), we set bmin = 0.8 MB, bmax = 7.2 MB, and pmax = 104. The weight of the EWMA is w = 2.4 106. Fig. 3 plots the evolution of the IQL and AQL over a 10 s interval when 200 long-lived flows are present. The RED
Fig. 3. IQL and AQL with RED configuration optimized for 200 flows (200 flows present).
A. Francini / Computer Networks 56 (2012) 3076–3086
Fig. 4. IQL and AQL with RED configuration optimized for 200 flows (40 flows present).
Fig. 5. IQL and AQL with RED configuration optimized for 200 flows (1000 flows present).
queue maximizes the aggregate TCP throughput by staying always busy without overflowing. In Fig. 4 the number of flows drops to 40: the RED queue drops packets too aggressively because pmax is oversized for the traffic mix (pmax = 105 would be adequate to restore full link utilization). In Fig. 5 the queue handles 1000 flows: pmax is now too small to prevent the queue from overflowing and triggering global synchronization conditions (pmax = 103 should be used instead). An analysis of the effects of pmax on the stability of the TCP/RED control loop can be found in [11].
2.2. Preferred RED enhancements Soon after the IETF issued a ‘‘strong recommendation for testing, standardization, and widespread deployment of active queue management in routers,’’ pointing to RED as a preferred candidate [10], the race started to better understand the scheme and eventually propose modifications that could strengthen its performance and simplify its configuration. Out of the vast body of RED enhancement proposals that were spawned by that race, here we focus only on those that we consider instrumental to the specification of our solution. As observed in [1,24], the event-driven computation of the AQL, based exclusively on packet arrivals, can be a source of inaccuracy in the detection of congestion conditions. Better accuracy can be expected from a time-driven
3079
process, with AQL updates triggered by the expiration of a fixed averaging period sq that should be at least as large as the inter-departure time of packets of typical size (e.g., 1500 bytes) when the bottleneck link operates at full capacity (e.g., sq P 0.3 ls with C = 40 Gbps). Since at times of congestion the average packet arrival rate, sampled at the RTT timescale, does not change rapidly, in most cases the event-driven and time-driven computations of the AQL yield very similar results. However, the synchronization of the averaging process still proves highly beneficial because it produces much clearer guidelines for the configuration of the weight parameter of the EWMA. In fact, the fixed spacing of the AQL updates makes it possible to identify the EWMA with a discrete-time low-pass filter with time constant T = sq/w. Since the purpose of the EWMA is the isolation of the AQL from IQL oscillations that occur at the RTT timescale, a time constant that is larger than the expected RTT for a large majority of the TCP flows handled by the queue is sufficient to smooth out undesired AQL fluctuations. With T = 500 ms (RTT values beyond 500 ms are commonly considered unusual in networks that do not include satellite links) and sq = 10 ls, a weight w = 2 105 works well for a 40 Gbps link. In general, once the RTT target for the data path that includes the RED buffer is defined, w scales with the inverse of the expected output rate of the buffer, which typically coincides with the capacity of the associated link. In order to reconcile the accuracy of the AQL computation with the workload of the averaging process, the value of sq chosen for the example (10 ls) is about 30 times larger than the minimum recommended (0.3 ls). As a result, the AQL computation can be run as a background process, completely independent of packet arrivals and departures. Important results have also been obtained in the choice of the maximum drop probability pmax, which defines the slope of the drop probability curve between the thresholds bmin and bmax [16,25]. As shown in the example of Section 2.1, the value of the parameter must be adjusted to the traffic mix to maximize the utilization of the bottleneck link. The Adaptive RED (ARED) algorithm, specified in [16] as a refinement of a similar concept previously presented in [25], subjects pmax to a control h algorithm i that sets its value ðlÞ ðuÞ within a pre-defined range pmax ; pmax . After holding the same pmax value for at least a time T(500 ms is the value of T recommended in [16]), the control algorithm increases pmax as soon as the AQL exceeds a threshold bu, or decreases it as soon as the AQL drops below a threshold bl, with the ultimate goal of settling the AQL around bd = C d, where d is the target average delay (bmin < bl < bd < bu < bmax). The authors of [16] automatically derive all buffer thresholds from the target average delay d, relieving the user from the uncertainty of their configuration: bmin = 0.5bd,bl = 0.9bd, bu = 1.1bd, and bmax = 1.5bd. The range of allowed pmax values is also fixed: pmax 2 [0.01, 0.5]. After following all recommendations for default values, the user is left with the target average delay d or the desired total allocation of buffer space Qmax P btop as the only arbitrary parameter. In a nutshell, ARED combines the native RED with a mechanism for controlling the slope of the linear portion of the packet drop probability curve, driven by the ultimate goal of mapping the AQL of the target average
3080
A. Francini / Computer Networks 56 (2012) 3076–3086
delay onto the ideal packet drop probability p(t) that stabilizes the TCP/RED loop. As noted by the authors, the control algorithm specified in [16] is not optimized for speed of convergence. However, as we showed in [17], in the large-BDP data paths that we are targeting the algorithm also fails to set pmax ðlÞ
anywhere higher than the minimum allowed value pmax . This happens because the algorithm does not include provisions for suspending the reduction of pmax at times when the AQL is below bl for reasons that do not depend on the value of the parameter, especially during the relatively long period that follows a global synchronization event, ðlÞ
however induced. Then, if pmax is small for the current trafh i ðlÞ ðuÞ fic mix (which is the case when the pmax range pmax ; pmax is configured properly), the queue keeps overflowing periodically without ever giving the AQL a chance to grow ðlÞ
above bu and thus raise the value of pmax above pmax . Even assuming that the control loop that modulates pmax can be fixed, we argue that a positive slope is no longer needed in the PDP function. Such a slope provides instant PDP values that are either lower or higher than the value p(t) that is ideal for the traffic mix at time t. The availability of such values makes sense in the native version of RED, where pmax is fixed, but can be avoided in a scheme that periodically adjusts pmax. Consistently with the conclusions presented in [11], in a scheme that controls pmax effectively we can expect a zero-slope PDP function to enforce tighter queue length oscillations around the target queueing delay than a PDP function with positive slope. 3. The periodic early detection algorithm In this section we present the algorithmic core of a new early detection scheme that builds on the idea that a stable instance of RED is one that associates a fixed packet drop probability with an extended range of AQL values, so that marginal AQL oscillations (as the AQL keeps trailing the IQL) do not modify the drop probability and therefore do not induce wider IQL (and AQL) oscillations. To consistently enforce the desired packet drop rate, we replace the notion of packet drop probability, which yields variable inter-drop intervals, with a packet drop period that enforces equally spaced packet drop events. For this reason we refer to our scheme as Periodic Early Detection (PED). PED combines two components that operate at different timescales. At the shorter (packet) timescale, PED drops packets at fixed time intervals when signs of congestion are evident. At the longer (RTT) timescale, a control algorithm adjusts the packet drop period to the evolution of the AQL. PED uses a drop timer with period sD of controllable duration to trigger the sampling of the IQL q and AQL q
like for RED, the definition of a byte version of PED is straightforward.) PED controls the period sD of the drop timer based on the AQL evolution. At time intervals never shorter than a time constant T that is large enough to include the RTT values of most TCP connections (e.g., T = 500 ms), PED com with the minimum PED threshold bPED pares q min and a PED
maximum PED threshold bmax . PED increases sD if PED < bPED q min and decreases it if q > bmax . The size of the period =bPED correction is modulated by a ¼ q min for period increases PED for period decreases. The period of the and by a ¼ bmax =q drop timer remains unchanged if the AQL is found to be in between the two thresholds. The pseudo-code of Fig. 6 summarizes the update of the packet drop period after at least a time T has elapsed since the latest drop period update. Notice that sD can increase or decrease at every upðlÞ
1. suspending the correction of the drop period sD under low-load conditions that are not the consequence of recent packet drop events mandated by PED; 2. halving the drop period after the buffer occupancy grows from empty to full within a time that is comparable with the time constant T (such an event is a sign that the packet drop period is currently too large for dealing properly with the traffic mix); and
PED
and their comparison with respective thresholds bmin and PED PED PED PED > bPED bgate bmin > bgate . If q > bmin AND q gate when the drop timer expires, PED drops the next incoming packet; otherwise it accepts into the queue the next packet and all the packets that follow, up to the next expiration of the drop period. (For simplicity of presentation, we describe here the packet version of the early detection algorithm; just
ðuÞ
date at most by a factor of two. In Fig. 6, sD and sD are the minimum and maximum values allowed for sD. PED uses a synchronous, time-driven background process for updating the AQL. The criteria for setting the averaging period sq are the same that hold for the synchronous version of RED: the period should be larger than the interdeparture time of packets of typical size at the full capacity of the link (e.g., sq P 1500 B/40 Gbps = 0.3 ls), but not larger than a small fraction (e.g., 5%) of the target average delay (e.g., sq 6 0.05 1 ms = 50 ls). As usual, PED computes as an EWMA: q ½n ¼ w q½n þ ð1 wÞ q ½n 1. the AQL q The EWMA weight w is defined by the ratio between the averaging period and the time constant of the TCP data path: w = sq/T. To prevent ordinary TCP dynamics from interfering with the control of the drop period, as we observed for ARED in Section 2.2, PED also includes methods for:
Fig. 6. Pseudo-code for drop period update in PED.
A. Francini / Computer Networks 56 (2012) 3076–3086
3. allowing emergency corrections of the drop period even before expiration of the time constant T as soon as the PED PED AQL exceeds a safety threshold bsafe > bmax .
3081
PED
3. bmax is the maximum PED threshold; it should be twice as large as the minimum PED threshold (e.g., PED
bmax ¼ 12:8 MBÞ; PED
A detailed presentation of the three methods can be found in [26]. PED differs from RED in several ways:
4. bsafe is the safety PED threshold; it should be three times as large as the minimum PED threshold (e.g., PED
bsafe ¼ 19:2 MBÞ; PED
PED keeps the packet drop rate fixed for a minimum time T instead of changing it continuously with the AQL. PED minimizes the variations in the inter-drop times by replacing the packet drop probability with a fixed packet drop rate. PED synchronizes the averaging process (the synchronization of the averaging process is not included in the original versions of RED and ARED [9,16], although it has been proposed later on [1,24]). PED gives the IQL a prominent role in forming the packet drop decision (RED decides exclusively based on the AQL). The choice of the EWMA time constant T as the holding time for the packet drop period of PED results from careful consideration of the effects that data path and filtering delays have on the interaction between the buffer management scheme and the TCP source dynamics. It takes a time comparable with the RTT for a source to recognize a packet loss and for that recognition to produce visible effects on the IQL of the bottleneck queue. Because of the it takes a similar extra EWMA with time constant T P #, time for the AQL to catch up with the IQL variation. The accuracy of the control mechanism that sets the drop period depends tightly on the time distance between the adjustments in the activity of the TCP sources and the corrective actions on the queue length that drive those adjustments. While the delay induced by the RTT cannot be avoided, PED excludes the extra delay contribution of the EWMA by looking at the IQL for the packet drop decision. PED also checks the AQL to confirm that early signs of conPED gestion are present, but the threshold bgate used for this PED purpose is only half the size of the IQL threshold bmin , so that the EWMA delay has practically no impact on the timing of the decision. 4. PED configuration The following is the list of all the configuration parameters that drive the operation of PED, inclusive of setting recommendations that we obtained empirically and then found consistently validated in all of our experiments: 1. Qmax is the total buffer space available; its value is set by hardware design constraints, such as the size of the available buffer memory (e.g., Qmax = 32 MB for the on-chip implementation of buffer memories in front of a 40 Gbps link); PED 2. bmin is the minimum PED threshold; we configure it PED as a fixed fraction (20%) of Qmax (e.g., bmin ¼ 6:4 MB when Qmax = 32 MB);
5. bgate is the gating PED threshold; no packet is dropped by PED as long as the AQL is below this PED threshold; it should be half the size of bmin (e.g., PED bgate ¼ 3:2 MBÞ; 6. sq is the update period for the AQL; it should be large while the enough to avoid multiple updates of q same packet is in transmission out of the queue (e.g., sq = 10 ls); 7. T is the time constant of the control system made of the bottleneck link and of the set of TCP sources whose packets traverse the link; it is also the inverse of the cutoff frequency of the low-pass filter that ; to make sure that implements the computation of q the RTT values of most TCP connections are included, especially when their actual distribution is unknown, the value of T should be set always to 500 ms; lower values are accepted when the RTT distribution is known to be concentrated around a definitely smaller value; 8. w is the weight used in the EWMA computation of the AQL; its value derives directly from the averaging period sq and the time constant T : w = sq/T (e.g., w = 2 105); ðlÞ 9. sD is the minimum PED drop period allowed; it should never be less than the averaging period sq ðlÞ (e.g., sD ¼ 100 lsÞ; and ðuÞ 10. sD is the maximum PED drop period allowed; it but not larger than T (e.g., should be larger than #, sðuÞ ¼ 500 msÞ. D Since we offer fixed recommendations for the values of all parameters, we can claim that the configuration of PED is straightforward and can be fully automated once the link capacity and the amount of available memory are known. 5. PED performance: single-bottleneck flows In this section we present results from the ns2 simulation of PED in traffic scenarios where each TCP flow traverses only one bottleneck link. Scenario 1 is the same that we have used in Section 2.1 to show that the configuration parameters of RED must be adapted to the traffic mix. The results show that PED does not need to adjust its parameters to achieve full link utilization as the number of flows changes drastically. In Scenarios 2 and 3 we set the rate of the bottleneck link at 40 Gbps and challenge PED first with a heavily synchronized traffic mix and then with flows that operate with widely different sizes of their congestion windows. In both cases PED comfortably achieves full link utilization with only 32 MB of buffer memory. We derive Scenario 4 from a production enterprise network: it includes three bottleneck links running at relatively low speeds with no flow traversing more than one
3082
A. Francini / Computer Networks 56 (2012) 3076–3086
Fig. 7. Scenario 1: IQL and AQL with PED, 200 flows present (compare with RED in Fig. 3).
Fig. 8. Scenario 1: IQL and AQL with PED, 40 flows present (compare with RED in Fig. 4).
the queue length evolution when 200, 40, and 1000 flows are present (the flow setups are identical to those that produced the plots of Figs. 3–5). While the width and frequency of the IQL oscillations change with the number of flows, in all cases the AQL stabilizes well below the 2 MB mark (or less than 1.6 ms of queueing delay). Also the IQL oscillations in Fig. 7 are much narrower than those in Fig. 3 (where RED keeps the queue stable, as opposed to Figs. 4 and 5) thanks to the insensitivity of the packet drop period to the AQL in between period updates. In Scenarios 2 and 3 we set the buffer thresholds back to the values of the 40 Gbps example of Section 4: Q max ¼ PED PED PED 32 MB; bgate ¼ 3:2 MB; bmin ¼ 6:4 MB; bmax ¼ 12:8 MB, and PED bsafe ¼ 19:2 MB. In Scenario 2 we configure N = 1000 flows with identical RTT of 250 ms. The scenario tests PED’s ability to achieve 100% link utilization with only 32 MB of buffer memory in a highly-synchronized setup where the Tail-Drop policy needs every byte of the 1.25 GB packet memory mandated by the BDP rule [6]. The steady-state evolution of the IQL and AQL, plotted in Fig. 10 over a 100 s interval, shows uninterrupted utilization of the 40 Gbps link capacity and queueing delays never higher than 3 ms (versus the 250 ms maximum delay of TailDrop). The PED comparison with Tail-Drop shows that energy efficiency can be not just neutral to network performance, but actually enhance it in very substantial ways. Scenario 3 distributes the RTT of N = 100 flows uniformly between 10 ms and 290 ms, increasing drastically the average and variance of the congestion window sizes compared to Scenario 2. We expect wider and faster queue length oscillations, because the amount of packets removed from the data path after a periodic packet loss depends linearly on the size of the congestion window of the affected flow. The IQL and AQL plots of Fig. 11 validate the expectation while showing full utilization of the link capacity. It is important to remark that at steady state the PED drop period settles almost permanently (there are only sporadic, short-lived exceptions) on the maximum allowed value of 500 ms, indicating that the PED control loop would likely push the value higher if a wider range was available. However, the value restriction on the maximum PED drop period does not compromise the link utilization performance, because a wider time separation between subsequent packet drop events is still enforced
Fig. 9. Scenario 1: IQL and AQL with PED, 1000 flows present (compare with RED in Fig. 5).
of them (we will present in Section 6 results obtained with multi-bottleneck flows over the same network topology). In Scenario 1 we configure the time parameters of PED with the same values assigned in the example of Section 4: sq = 10 ls, T = 500 ms (i.e., w ¼ 2 105 Þ; sðlÞ D ¼ 100 ls, ðuÞ and sD ¼ 500 ms. For the buffer thresholds we use instead values that are scaled down by the ratio between the bottleneck rate of the example (40 Gbps) and the rate of PED PED Scenario 1 (10Gbps): Q max ¼ 8 MB; bgate ¼ 0:8 MB; bmin ¼ PED PED 1:6 MB; bmax ¼ 3:2 MB, and bsafe ¼ 4:8 MB. Figs. 7–9 show
Fig. 10. Scenario 2: IQL and AQL with PED and link capacity C = 40 Gbps; 1000 flows with identical RTT (# = 250 ms).
3083
A. Francini / Computer Networks 56 (2012) 3076–3086 Table 1 Link utilization and delay measured in Scenario 4. Buffer and traffic configuration
Link utilization (%)
Average delay (ms)
Maximum delay (ms)
Tail-Drop, L = 200
100, 100, 100 99.25, 99.78, 99.41 99.20, 99.66, 99.31 100, 100, 100 92.79, 95.75, 98.32 99.85, 100, 99.85
44.4, 50.8 9.60, 10.2 9.91, 8.46 38.3, 38.4 8.88, 7.22 9.65, 8.62
56.0, 52.2 14.2, 20.3 10.5, 15.7 56.0, 52.2 13.0,
RED, L = 200 PED, L = 200 Tail-Drop, L = 20 RED, L = 20 Fig. 11. Scenario 3: IQL and AQL with PED and link capacity C = 40 Gbps; 100 flows with RTT uniformly distributed between #min = 10 ms and #max = 290 ms.
by the checks on the IQL and AQL that drive the packet drop decision when the PED drop period expires. In further experiments with Scenarios 2 and 3 we scale the capacity of the bottleneck link from 1 Gbps to 100 Gbps in 1 Gbps increments up to 10 Gbps, and then in 10 Gbps increments up to 100 Gbps. We observe that PED retains full link utilization if we scale the buffer thresholds linearly with the link capacity (from Qmax = 0.8 MB at 1 Gbps to Qmax = 80 MB at 100 Gbps). This linear dependency maintains a buffer-space ratio of less than 5% between PED and the BDP rule and enables on-chip implementations of packet buffers for link capacities up to at least 100 Gbps. We derive the network configuration of Scenario 4 (Fig. 12) from an enterprise network made of four WideArea Network (WAN) nodes and three WAN links between them. The following pairs represent the capacity and propagation delay of the three WAN links: (150 Mbps, 8 ms), (300 Mbps, 4 ms), and (30 Mbps, 4 ms). Each WAN link NLx handles a distinct group of L TCP flows. For each flow group Gx, the data packets flow on the forward path from an ingress node INx to an egress node ENx. The links between WAN nodes and ingress and egress nodes run at 1 Gbps with negligible propagation delay. The links between TCP sources and ingress nodes run at 1 Gbps with
PED, L = 20
38.2, 5.77, 5.68, 35.7, 4.78, 5.71,
52.0, 10.2, 5.90, 52.0, 8.2, 16.5
9.67, 5.72, 9.36
10 ms propagation delay in each direction. The links between TCP sinks and egress nodes also run at 1 Gbps, but the propagation delay is negligible. We run experiments with Tail-Drop, RED (synchronized), and PED deployed in front of the three WAN links. In the Tail-Drop case, we size the three WAN buffers in the forward path to accommodate 48 ms of traffic, more than enough to comply with the BDP rule (the largest propagation-only RTT among the three flow groups is 36 ms). With PED, we set the buffer sizes at 6.4 ms for NL1 and NL2 and at 16 ms for NL3: Qmax,1 = 120 kB, Qmax,2 = 240 kB, and Qmax,3 = 60 kB, all trivial to implement on-chip. At each queue we set the PED buffer thresholds using the same ratios to Qmax as in Scenarios 1, 2, and 3. The lowered capacities of the three WAN links demand reconfiguration of the AQL update perðlÞ iod sq and of the minimum allowed value sD for the PED drop period. Since the transmission of a 1500 B packet takes between 40 ls and 400 ls with the link rates of the ðlÞ topology, we set sq = 500 ls and sD ¼ 1 ms for all three queues. Finally, we configure the RED queues with the same buffer sizes and EWMA parameters as the PED queues, setting the buffer thresholds at the same ratios to the total buffer space as in the experiments of Section 2.1 (10% for bmin and 90% for bmax). We manually optimize the values of pmax (0.04, 0.02, and 0.067) so that they yield
Fig. 12. Network configuration for Scenario 4.
3084
A. Francini / Computer Networks 56 (2012) 3076–3086
when the traffic mix is friendlier to link utilization. Second, we see once again with L = 20 that RED cannot retain full link utilization under varying traffic conditions if it does not adjust its configuration parameters. Third, PED’s performance is practically insensitive to the traffic mix, as opposed to the other two schemes. 6. PED performance: multi-bottleneck flows
Fig. 13. AQL at Tail-Drop buffer of link NL1 with 200 and 20 flows per group in Scenario 4.
near-full link utilization at all WAN links when there are L = 200 flows in each group. In Table 1 we list steady-state link utilization, average delay, and maximum delay for each group of TCP flows, measured over a 100 s interval with L = 200 and L = 20 flows per group (the delays refer exclusively to the WLAN links, combining queueing and propagation components in the forward direction). The results in Table 1 provide several important indications. First, the buffer size dominates the delay performance of the Tail-Drop policy. Not only the maximum delay systematically includes the queueing delay that derives from the saturation of the buffer (48 ms in the example), but also the average delay gets closer to the maximum delay as the traffic mix becomes smoother. The effect on performance is strikingly counter-intuitive: a more stable queue behavior implies larger average delays. We see this happening when the number of flows per group increases (Fig. 13): more flows need smaller congestion windows to fill out the data path and produce narrower queue length oscillations when they lose packets. We can observe the same trend also when we reduce the average RTT of the flows that traverse the bottleneck link. Since the queue length oscillations are always anchored to the maximum queue length, Tail-Drop produces higher average delays
In this section we compare the effects of PED, Tail-Drop, and RED on a group of flows that traverse multiple congested links and compete at each link with single-bottleneck flows. This type of flow arrangement is becoming more and more common as the growing capacities of host and access links keep pushing the placement of traffic bottlenecks from the periphery of the network to its core. For simulation purposes, we define a new Scenario 5 by adding to the topology and flow groups of Scenario 4 a new group G4 of TCP flows that traverse all three WAN links (see the sources and sinks with darker background in Fig. 14). We study this parking-lot topology for two reasons. First, we want to assess the capability of PED to retain full link utilization at one queue when some of the TCP flows that share it may have their packets dropped elsewhere. Second, we look for a straightforward demonstration of the disruptive effects of bufferbloat [5] on network applications, and of the mitigation benefits that PED can offer. The term bufferbloat describes the extensive deployment of large Tail-Drop buffers in network nodes by plain observance of the BDP rule. There is growing consensus [27] that bufferbloat can effectively reverse the intended positive outcome of the increased availability of bandwidth in every segment of the network infrastructure. Since the BDP rule calls for a fixed queueing delay in front of a link, independently of its capacity, at times of congestion more bandwidth does not necessarily translate into smaller delays and faster file deliveries. The net result for end users can be less predictable performance of network applications and longer completion times for file transfers. Table 2 lists the link utilization achieved at the WAN links of Scenario 5 with the three buffer management schemes. The PED numbers show that the scheme
Fig. 14. Network configuration for Scenario 5.
3085
A. Francini / Computer Networks 56 (2012) 3076–3086 Table 2 Link utilization with Tail-Drop, RED, and PED in Scenario 5. Buffer and traffic configuration
Link utilization (%)
Tail-Drop, L = 200 RED, L = 200 PED, L = 200 Tail-Drop, L = 20 RED, L = 20 PED, L = 20
100, 100, 100 99.26, 99.77, 99.30 99.08, 99.82, 99.06 100, 100, 100 92.16, 95.21, 98.16 99.89, 100.00, 99.35
guarantees stability at each queue without suffering from the distribution of packet losses over multiple bottlenecks. With RED we see again a performance drop when the number of flows is different than the one for which pmax is optimized. A bit surprisingly, we see no degradation of link utilization for Tail-Drop, as if bufferbloat had no relevant impact on network performance. To find evidence that bufferbloat does indeed disrupt network applications we must look at the throughput received by the flows of group G4. Table 3 lists the average and the normalized standard deviation for the throughput of flows in the four groups during a 100 s time interval. With L = 200 flows per group, the average throughput of the multi-bottleneck flows is 1000 times larger with PED than with Tail-Drop. The minimal Tail-Drop throughput (42 bps) certainly brings agony to the G4 end users. PED also more than triples the G4 average throughput compared to RED. With 20 flows the ratio between PED and Tail-Drop is smaller, but still larger than 250. This is because the wider queue length oscillations already observed for Tail-Drop in Scenario 4 tend to reduce the queueing delay, and therefore the penalty for multi-bottleneck flows. We deduce that the impact of bufferbloat is heavier when a larger number of single-bottleneck TCP flows compete for bandwidth with multi-bottleneck flows in front of the WAN links. When L = 20 the G4 throughput is slightly higher with RED than with PED, but this is because the instability of the three RED queues and the associated drop in link utilization obviously bring down the average delay, to the benefit of multi-bottleneck flows. The effect of bufferbloat, particularly evident with the Tail-Drop policy, is to massively transfer bandwidth from multi-bottleneck to single-bottleneck flows. The distribution of bandwidth shares among flows at each bottleneck link is no longer driven by the propagation-delay component of the respective round-trip times, but dominated by the queueing delay at other bottlenecks. The unfairness of Tail-Drop in distributing packet losses, by which
low-bandwidth flows lose a larger quota of their packets than high-bandwidth flows [9], further aggravates the condition of multi-bottleneck flows. The transfer of bottleneck bandwidth shares between groups of flows also explains the immaculate link-utilization performance of Tail-Drop in Scenario 5: since only few G4 packets are entering the data path, each bottleneck buffer ends up handling pretty much the same traffic mix as in Scenario 4. This is consistently not the case with RED and PED. Unfortunately, network monitors that only return queue occupancy and link utilization data offer no insight into the traffic conjunctures that produce abysmal network performance for some end users. Not even the recording of per-flow packet losses can help, because with bufferbloat the low throughput of a flow is induced not by losses, but by the queueing delay accumulated at multiple links along the data path, possibly in a number of different administrative domains. We emphasize that PED successfully mitigates the bufferbloat effects by drastically reducing the queueing delay imposed at each bottleneck and by restoring fairness in the distribution of packet losses.
7. Conclusions We have defined a new Periodic Early Detection (PED) scheme, where the frequency of the packet drop decisions adjusts to the queue length not instantly but at the RTT timescale. Our simulation results assert PED’s ability to consistently enforce 100% link utilization with long-lived TCP flows using only a small fraction (less than 5%) of the memory space of current designs. With technology available in 2012, PED can enable the on-chip implementation of packet buffers for link rates up to 100 Gbps, contributing to better system scalability and energy efficiency. The same reduction of buffer sizes also stabilizes the performance of network applications, removing a major cause for user frustration. In practical applications, the traffic mix in a bottleneck link is not only made of long-lived TCP flows, but also includes UDP and short-lived TCP flows, which respond differently to packet losses. Simple experiments with PED reveal that the throughput of TCP flows and the utilization of the bottleneck link degrade heavily if TCP and UDP packets are all stored in one queue. However, this is true of any buffer management scheme and is not specifically a PED issue. Moreover, the enforcement of bandwidth and delay guarantees for applications that rely on UDP transport already mandate the assignment of TCP and UDP packets
Table 3 Flow throughput statistics for the four flow groups of Scenario 5. Buffer and traffic configuration
Tail-Drop, 200 RED, 200 PED, 200 Tail-Drop, 20 RED, 20 PED, 20
Group G1
Group G2
Group G3
Group G4
Avg (kbps)
St.Dev. (%)
Avg. (kbps)
St.Dev. (%)
Avg. (kbps)
St.Dev. (%)
Avg. (kbps)
St.Dev. (%)
750 730 698 7498 6437 7033
2.37 52.5 66.9 3.1 6.9 5.1
1500 1482 1452 14998 13807 14540
0.77 47.6 49.2 2.7 6.6 5.6
150 135 105 1498 1005 1038
46.0 131.0 162.5 2.3 40.5 37.1
0.04 14 44 1.7 467 452
204.5 170.8 130.4 80.0 48.2 43.2
3086
A. Francini / Computer Networks 56 (2012) 3076–3086
to different queues. Differentiation in queue treatment between short-lived and long-lived TCP flows has been advocated in other contexts [28,29] and may also prove beneficial in PED queues. We have executed all simulation experiments using TCP Reno sources. We plan to extend our study to TCP sources of different types, especially those that rely on queueing delays rather than packet losses as congestion indicators [14,30]. We expect to observe important differences in the results, but confidently none that would fail to confirm the benefits of the RTT reduction and stabilization enabled by PED. We are also developing an analytical framework for the formal characterization of PED’s properties. Acknowledgement This work was completed with the support of the US Department of Energy (DOE), award No. DE-EE0002887. However, any opinions, findings, conclusions and recommendations expressed herein are those of the author and do not necessarily reflect the views of the DOE. References [1] V. Jacobson, K. Nichols, K. Poduri, RED in a Different Light.
, September 1999 (accessed March 2012). [2] J. Heinanen, R. Guerin, A Single Rate Three Color Marker, IETF RFC 2697, September 1999. [3] J. Heinanen, R. Guerin, A Two Rate Three Color Marker, IETF RFC 2698, September 1999. [4] C. Villamizar, C. Song, High-performance TCP in ANSNET, ACM SIGCOMM Computer Communications Review 24 (5) (1994) 45–60. [5] J. Gettys, Bufferbloat: dark buffers in the Internet, IEEE Internet Computing 15 (3) (2011) 95–96. [6] G. Appenzeller, I. Keslassy, N. McKeown, Sizing router buffers, in: Proceedings of ACM SIGCOMM 2004, Portland, OR, August 2004. [7] Y. Ganjali, N. McKeown, Update on buffer sizing in Internet routers, ACM SIGCOMM Computer Communication Review 36 (5) (2006) 67– 70. [8] A. Vishwanath, V. Sivaraman, M. Thottan, Perspectives on router buffer sizing: recent results and open problems, ACM SIGCOMM Computer Communication Review 39 (2) (2009) 34–39. [9] S. Floyd, V. Jacobson, Random early detection gateways for congestion avoidance, IEEE/ACM Transactions on Networking 1 (4) (1993) 397–413. [10] R. Braden, et al., Recommendations on Queue Management and Congestion Avoidance in the Internet, IETF RFC 2309, April 1998. [11] S.H. Low, F. Paganini, J. Wang, S. Adlakha, J.C. Doyle, Dynamics of TCP/RED and a scalable control, in: Proceedings of IEEE INFOCOM 2002, New York City, NY, June 2002. [12] S. Ha, I. Rhee, L. Xu, CUBIC: a new TCP-friendly high-speed TCP variant, ACM SIGOPS Operating Systems Review 42 (5) (2008) 64– 74. [13] N. Dukkipati et al., An argument for increasing TCP’s initial congestion window, ACM SIGCOMM Computer Communication Review 40 (3) (2010) 27–33. [14] C. Jin et al., FAST TCP: from theory to experiments, IEEE Network 19 (1) (2005) 4–11.
[15] FastSoft. , 2012 (accessed March 2012). [16] S. Floyd, R. Gummadi, S. Shenker, Adaptive RED: An Algorithm for Increasing the Robustness of RED’s Active Queue Management. , August 2001 (accessed March 2012). [17] A. Francini, Beyond RED: periodic early detection for on-chip buffer memories in network elements, in: Proceedings of IEEE HPSR 2011, Cartagena, Spain, July 2011. [18] Cisco CRS-1 multi-shelf system. , 2012 (accessed March 2012). [19] M.K. Thottan et al., Adapting Router Buffers for Energy Efficiency, Bell Labs Technical Document ITD-11-51109D, February 2011. [20] K. Hooghe, M. Guenach, Towards energy-efficient packet processing in access nodes, in: Proceedings of IEEE GLOBECOM 2011, Houston, TX, December 2011. [21] K. Ramakrishnan, S. Floyd, D. Black, The Addition of Explicit Congestion Notification (ECN) to IP, IETF RFC 3168, September 2001. [22] M. May, J. Bolot, C. Diot, B. Lyles, Reasons not to deploy RED, in: Proceedings of IEEE/IFIP IWQoS 1999, London, UK, June 1999. [23] ns2. , 2012 (accessed March 2012). [24] V. Misra, W.B. Gong, D. Towsley, Fluid-based analysis of a network of AQM routers supporting TCP flows with an application to RED, in: Proceedings of ACM SIGCOMM 2000, Stockholm, Sweden, August 2000. [25] W. Feng, D. Kandlur, D. Saha, K. Shin, A self-configuring RED gateway, Proceedings of IEEE INFOCOM, New York City, NY, March 1999. [26] A. Francini, Active queue management with variable bottleneck rate, Proceedings of 35th IEEE Sarnoff Symposium, Newark, NJ, May 2012 (to appear). [27] Bufferbloat. , 2012 (accessed March 2012). [28] R. Pan, B. Prabhakar, K. Psounis, CHOKe – a stateless active queue management scheme for approximating fair bandwidth allocation, in: Proceedings of IEEE INFOCOM 2000, Tel Aviv, Israel, March 2000. [29] D.M. Divakaran, G. Carofiglio, E. Altman, P. Vicat-Blanc Primet, A flow scheduler architecture, Lecture Notes in Computer Science, Networking 2010, vol. 6091, SpringerLink, 2010, pp. 122–134. [30] L. Brakmo, L. Peterson, TCP vegas: end-to-end congestion avoidance on a global Internet, IEEE Journal on Selected Areas in Communication 13 (8) (1995) 1465–1480.
Andrea Francini is a member of technical staff in the Computing and Software Principles Research Department at Alcatel-Lucent Bell Laboratories in North Carolina. He holds an Italian Laurea degree (summa cum laude) in Electrical Engineering and a Ph.D. in Electrical Engineering and Communications, both from the Politecnico di Torino, Italy. With Bell Labs since 1996, he has designed Quality-ofService architectures for packet switching chipsets, multi-service systems, and broadband access platforms, and has been a contributing member of the WiMAX Forum and of the IEEE 802.21 working group. His current research focus is on algorithms and architectures for minimizing energy consumption in packet networks. Dr. Francini received the Bell Labs President’s Gold Award in 1999 and the IEEE Globecom, High-Speed Networks Symposium Best Paper Award in 2002. He was a member of the Alcatel-Lucent Technical Academy for 2009–2011, holds 12 USPTO patent with 8 more applications pending, and has published more than 30 papers on international journals and conferences.