Enhanced weighted round robin schedulers for accurate bandwidth distribution in packet networks

Enhanced weighted round robin schedulers for accurate bandwidth distribution in packet networks

Computer Networks 37 (2001) 561±578 www.elsevier.com/locate/comnet Enhanced weighted round robin schedulers for accurate bandwidth distribution in p...

519KB Sizes 0 Downloads 38 Views

Computer Networks 37 (2001) 561±578

www.elsevier.com/locate/comnet

Enhanced weighted round robin schedulers for accurate bandwidth distribution in packet networks q A. Francini *, F.M. Chiussi, R.T. Clancy 1, K.D. Drucker 2, N.E. Idirene 3 Bell Laboratories, Lucent Technologies, Holmdel, NJ 07733, USA

Abstract Weighted round robin (WRR) schedulers constitute a popular solution for di€erentiating the bandwidth guarantees of heterogeneous IP ¯ows, mostly because of their minimal implementation cost. However, the existing WRR schedulers are not sucient to satisfy all the requirements of emerging quality-of-service frameworks. Flexible bandwidth management at the network nodes requires the deployment of hierarchical scheduling structures, where bandwidth can be allocated not only to individual ¯ows, but also to aggregations of those ¯ows. With currently available WRR schedulers, the superimposition of a hierarchical structure compromises the simplicity of the basic scheduler. WRR schedulers are also known for their burstiness in distributing service, which exposes the scheduled ¯ows to higher packet-loss probability at downstream nodes. By construction, WRR schedulers distribute bandwidth proportionally to the service shares allocated to the individual ¯ows. For best-e€ort (BE) ¯ows, having no speci®ed bandwidth requirements, existing WRR schedulers typically allocate arbitrary service shares. This approach con¯icts with the intrinsic nature of BE ¯ows and reduces the availability of bandwidth for the allocation of guaranteed-bandwidth (GB) ¯ows. We present three enhancements for WRR schedulers that solve these problems. In the ®rst enhancement, we superimpose a ``soft'' scheduling layer on the basic WRR scheduler by simply rede®ning the computation of the ¯ow timestamps. The second enhancement substantially reduces the service burstiness of the WRR scheduler with only marginal impact on its implementation cost. Finally, the third enhancement allows the smooth integration of GB and BE ¯ows, with ecient management of the available bandwidth and total compliance with the nature of BE ¯ows. Ó 2001 Elsevier Science B.V. All rights reserved. Keywords: Packet scheduling; Packet switching; Quality of service; Di€erentiated services; Multi-protocol label switching; Best e€ort

q

A preliminary version of this paper was published in Ref. [29]. * Corresponding author. Address: Room 4G-532, 101 Crawfords Corner Road, Holmdel, NJ 07733, USA. E-mail address: [email protected] (A. Francini). 1 This author was with Bell Laboratories, Lucent Technologies when this work was performed. He is now with Sycamore Networks, Wallingford, CT 06942, USA. 2 This author was with Bell Laboratories, Lucent Technologies when this work was performed. He is now with Agere Systems, Murray Hill, NJ 07974, USA. 3 This author was with Bell Laboratories, Lucent Technologies when this work was performed. He is now with Xebeo Communications, South Plain®eld, NJ 07080, USA.

1. Introduction The increasing popularity of elaborate qualityof-service (QoS) frameworks, such as di€erentiated services [18] and other QoS proposals related to multi-protocol label switching (MPLS) [2,17], puts emphasis on packet schedulers that allow ¯exible bandwidth management. Several existing packet schedulers o€er excellent worst-case delay performance in addition to providing accurate bandwidth guarantees [3,22,24±27]. However,

1389-1286/01/$ - see front matter Ó 2001 Elsevier Science B.V. All rights reserved. PII: S 1 3 8 9 - 1 2 8 6 ( 0 1 ) 0 0 2 2 9 - 8

562

A. Francini et al. / Computer Networks 37 (2001) 561±578

in networks where packets have variable size, as is the case in IP, the implementation cost of schedulers with good worst-case delay properties is substantial [22,25,27]. (On the contrary, in networks with ®xed-size packets, as is the case in ATM, low-cost scheduling techniques that provide both accurate bandwidth guarantees and tight delay bounds are well known [5,8].) The heavy implementation cost of packet schedulers that feature optimal delay performance and the common perception that worst-case delay performance is rather secondary to robust bandwidth performance in IP networks have recently increased the interest in schedulers that have very low complexity and can provide accurate bandwidth guarantees, but do not necessarily achieve tight delay bounds. A typical example is the family of weighted round robin (WRR) schedulers [1,16, 21]. Di€erent versions of these scheduling algorithms have appeared in literature; well-known examples are the de®cit round robin (DRR) [21] and surplus round robin (SRR) [1] algorithms. Their delay properties are far from optimal, but they enforce strict bandwidth guarantees with minimal implementation cost. Existing WRR schedulers su€er from three main drawbacks. First, they are essentially ``single-layer'' schedulers, which implies that they can control the distribution of bandwidth only to individual ¯ows. Superimposing multiple scheduling layers, and thus implementing a hierarchical and ¯exible structure that can not only allocate bandwidth to individual ¯ows, but also create aggregations of ¯ows and segregate bandwidth accordingly, compromises the simplicity of these schedulers. Second, WRR schedulers distribute service to the backlogged ¯ows in a highly bursty manner, since they continue servicing a ¯ow until its allocated share within a service frame is exhausted. Reducing such service burstiness is highly desirable, since it leads to lower packet losses, allows for more e€ective bu€er management, and improves the delay distribution. Third, WRR schedulers distribute bandwidth to the backlogged ¯ows proportionally to their allocated service shares. For ¯ows with speci®ed bandwidth requirements (or guaranteed-bandwidth

(GB) ¯ows), the allocated service shares re¯ect the respective minimum-bandwidth guarantees. For best-e€ort (BE) ¯ows, which have no speci®ed bandwidth guarantees and only require fairness in their relative treatment, the allocation of service shares in a WRR scheduler that also serves GB ¯ows is always problematic. Ideally, BE ¯ows should access the server only after all backlogged GB ¯ows have obtained their guaranteed bandwidth. The amount of bandwidth globally available to BE ¯ows should dynamically adapt to the backlog state of the GB ¯ows, increasing when their activity is low, and decreasing when it intensi®es. The bandwidth that is not used by GB ¯ows should then be evenly distributed to all backlogged BE ¯ows. In existing WRR schedulers, a BE ¯ow can be serviced only if it is allocated a service share. The selection of the service share is always arbitrary, and intrinsically compromises the ¯exibility that should instead characterize the distribution of bandwidth to BE ¯ows. Furthermore, the service shares allocated to the BE ¯ows are subtracted from a bandwidth pool that is shared with the GB ¯ows, which reduces the amount of GB ¯ows that the scheduler can support. In this paper, we propose three enhanced WRR schedulers to solve these problems. The three enhancements are independent, and can be combined together with no limitations. They all use the SRR algorithm, which proves more suitable than DRR to segregate bandwidth with minimal overhead. In our ®rst enhancement, we create a WRR scheduler with a hierarchical structure for bandwidth segregation. Our objective is to provide bandwidth guarantees to aggregations of ¯ows (which we call bundles) as well as to individual ¯ows in a completely transparent manner, i.e., without using any additional scheduling structure. We achieve our objective by ®rst de®ning a timestamp-based version of the SRR algorithm, and then simply enhancing the way the timestamps are manipulated. The resulting scheduling hierarchy is ``soft'', and therefore has negligible complexity; yet, it is e€ective, as we demonstrate with both theoretical arguments and experimental results. Our hierarchical structure provides bandwidth

A. Francini et al. / Computer Networks 37 (2001) 561±578

segregation to the bundles as well as robust guarantees to the individual ¯ows. The second enhancement consists of an implementation of the SRR scheduler that reduces the service burstiness. The implementation is based on a modi®ed calendar queue, which includes a very small number of sorting bins. The resulting scheduler achieves a substantial improvement in burstiness with only marginal increase of its overall complexity. Our third enhancement allows the ¯exible integration of GB and BE ¯ows in a single scheduling engine. We divide the WRR service frame in two sub-frames, devoting the ®rst sub-frame to satisfying the bandwidth guarantees of the GB ¯ows that are currently backlogged, and the second one to servicing BE ¯ows. For each frame, the duration of the two sub-frames depends on the amount of bandwidth allocated to the GB ¯ows and on the number of GB ¯ows that are backlogged at the beginning of the frame. The service shares of the BE ¯ows are no longer drawn from the same ®nite pool (constrained by the capacity of the outgoing link) that also sustains the bandwidth guarantees of the GB ¯ows, but from a distinct, unconstrained pool that dynamically adapts to the portion of link capacity left unused by the GB ¯ows. We handle the aggregate of BE ¯ows as a single entity in the WRR scheduler, with a service share that adapts to the backlog dynamics of the GB ¯ows. We enforce fairness in the treatment of backlogged BE ¯ows by sorting them in a replica of the basic WRR queueing structure. The integration of GB and BE ¯ows works in the case where no hierarchical bandwidth segregation is superimposed, as well as in the presence of bundles. We present simulation results that show the e€ectiveness of our technique in tailoring the service shares of the BE ¯ows to the bandwidth left unused by inactive GB bundles. The rest of this paper is organized as follows. In Section 2, we brie¯y recall some general scheduling concepts, review the calendar queue, which we use as the starting point for our implementation of the WRR with reduced burstiness, and describe the DRR and SRR algorithms. In Section 3, we present our enhanced WRR scheduler to provide soft bandwidth segregation, and show simulation re-

563

sults that con®rm the operation of the scheduler. In Section 4, we present an implementation of our enhanced WRR scheduler to reduce the service burstiness, together with supporting experimental results. In Section 5, we present our novel technique for the integration of GB and BE ¯ows, and provide simulation results that validate its performance. Finally, in Section 6, we o€er some concluding remarks. 2. Background 2.1. Packet schedulers with allocation of service shares The system that we address in this paper is a packet multiplexer where multiple trac ¯ows contend for the bandwidth that is available on the outgoing link. The multiplexer maintains a distinct queue of packets for each ¯ow (per-¯ow queueing), and handles the ¯ows as individual entities in the scheduler that distributes access to the link (per¯ow scheduling). We focus on the scheduling part of the system, and restrict our attention to schedulers that associate a service share qi to each con®gured ¯ow i and always distribute service to the ¯ows in proportion to their allocated shares (popular sharebased service disciplines include GPS-related [11,19,24,28] and WRR [16,21] schedulers). The latency hi experienced by ¯ow i captures the worstcase delay properties of a share-based scheduler, provided that the sum of the allocated service shares does not exceed the capacity of the server. (The latency of ¯ow i is de®ned as the maximum interval of time that the ¯ow must wait at the beginning of a busy period before its service rate becomes permanently not lower than the allocated service share qi [23].) Packet-by-packet rate-proportional servers (PRPS) [24] feature minimum latency among GPSrelated schedulers. GPS-related schedulers that are not P-RPS, such as self-clocked fair queueing (SCFQ) [11] and start-time fair queueing (SFQ) [12], as well as WRR schedulers [16,21], have much looser latency guarantees than any P-RPS [23]. However, in the case of variable-sized packets,

564

A. Francini et al. / Computer Networks 37 (2001) 561±578

their implementation complexity is much lower, and their bandwidth guarantees are still robust, which makes them extremely appealing whenever the ¯ows to be scheduled have no stringent delay requirements.

timestamp values must always be a subset of the total range of values that the calendar queue can cover.

2.2. The calendar queue

The DRR algorithm [21] is one of the most popular examples of a WRR scheduler for variable-sized packets, due to its minimal implementation complexity and its eciency in servicing the ¯ows in proportion to their allocated shares. WRR schedulers associate a service share qi with each con®gured ¯ow i. The service shares translate into minimum guaranteed service rates when their sum over all con®gured ¯ows does not exceed the capacity r of the server:

Most schedulers with allocation of service shares assign timestamps to the backlogged ¯ows to determine the order of transmission of the respective packets. A timestamp generally expresses the service deadline for the associated ¯ow, in a virtual-time domain that is speci®c of each algorithm. The implementation cost of a timestamp-based scheduler is commonly dominated by the complexity of sorting the timestamps. The calendar queue [7,20] reduces the complexity of the sorting task by providing an ordered structure of bins, each bin being associated with a certain range of timestamp values. The representative timestamp of a bin is given by the lower end of the timestamp range associated with the bin. The bins are ordered in memory by increasing value of their representative timestamps. Flows (or packets) are stored in the bins based on their current timestamps. The bins are then visited in their order in memory. By construction, when the mechanism visits a bin that is non-empty, the representative timestamp of the bin is the minimum representative timestamp in the calendar queue. The whole idea is to provide a structure where each position in memory has a direct relation with the value of the timestamps, and simplify the sorting task by exploiting the spatial separation that is achieved by associating the timestamps with the correct bins. In practice, each bin contains a list of ¯ows, which is commonly served in ®rst-in-®rst-out (FIFO) order. The calendar queue is feasible only if the underlying scheduler guarantees the valid range of timestamp values to be ®nite at any time. If the ®nite-range condition holds, then the calendar queue can become a circular queue, where the bins are reused as time progresses. In order to prevent timestamps belonging to disjoint intervals from overlapping in the same bin, the range of valid

2.3. De®cit round robin

V X iˆ1

qi 6 r:

…1†

The bound of Eq. (1), where V is the total number of con®gured ¯ows, guarantees that ¯ow i receives service at a long-term rate that is not lower than qi . In addition to this minimum-bandwidth guarantee, the bound always provides a latency guarantee to the ¯ow (the provision of the latency guarantee is instead not possible if the total share allocation exceeds the capacity of the server). Conforming to the WRR concept, the DRR algorithm divides the activity of the server into service frames. Throughout this paper, we consider a formulation of the algorithm that uses a reference timestamp increment TQ to express the frame duration in the virtual-time domain. (This formulation is not the one used in the de®nition of DRR originally presented in Ref. [21], but it is functionally equivalent, and better suited to the description of the WRR enhancements that we present in the following sections.) Within a frame, each con®gured ¯ow i is entitled to the transmission of a service quantum Qi of information units such that Qi ˆ qi TQ :

…2†

The scheduler visits the backlogged ¯ows only once per frame, and therefore ful®lls in a single shot their service expectations for the frame. Each

A. Francini et al. / Computer Networks 37 (2001) 561±578

¯ow i maintains a timestamp Fi , which is updated every time a new packet pik of length lki reaches the head of the ¯ow queue: Fik

ˆ

Fik 1

lk ‡ i: qi

…3†

The scheduler keeps servicing the ¯ow as long as its timestamp remains smaller than TQ . When the timestamp exceeds the reference timestamp increment, the scheduler declares the visit to ¯ow i over: it subtracts TQ from the timestamp and looks for another backlogged ¯ow to serve. The timestamps carry over the service credits of the backlogged ¯ows to the following frames, allowing the scheduler to distribute service proportionally to the allocated service shares in the long term (i.e., over multiple frames). When a ¯ow i becomes idle, the scheduler immediately moves to another ¯ow. If ¯ow i becomes backlogged again in a short time, it must wait for the next frame to start in order to receive a new visit from the server. When the ¯ow becomes idle, its timestamp is reset to zero to avoid any loss of service when the ¯ow becomes backlogged again in a future frame. By construction, the timestamp of an idling ¯ow is always smaller than TQ , so that the timestamp reset never generates extra credits that would otherwise penalize other ¯ows. Generally, the value of timestamp Fi at the beginning of the frame for ¯ow i ranges between 0 and Li =qi , where Li is the maximum size of a packet of ¯ow i. This ¯uctuation of the initial value of the timestamp induces the ¯uctuation of the amount of information units that ¯ow i transmits in a frame, which ranges within the interval (Qi Li , Qi ‡ Li ). Accordingly, the total amount of information units that the server transmits in a frame is not ®xed, even when all con®gured ¯ows are permanently backlogged. The DRR scheduler was implemented in Ref. [21] with a single linked list of backlogged ¯ows, visited in FIFO order. The arrangement of the backlogged ¯ows in a single FIFO queue leads to O(1) implementation complexity, provided that the reference timestamp increment TQ is not smaller than the timestamp increment determined by the maximum-sized packet for the ¯ow with minimum service share:

TQ P

Lmax : qmin

565

…4†

If the condition of Eq. (4) is not satis®ed, the algorithmic complexity of the scheduler explodes with the worst-case number of elementary operations to be executed between consecutive packet transmissions, since the scheduler may have to process a large number of ¯ows before ®nding one that it can actually serve (elementary operations include: ¯ow extraction and insertion in the linked list; timestamp update; comparison of the timestamp with the reference timestamp increment). In particular, the scheduler may have to deny service to the same ¯ow for several consecutive frames, until the repeated subtraction of the reference timestamp increment makes the ¯ow timestamp fall within the ‰0; TQ † interval. The pseudo-code of Fig. 1 speci®es the rules for handling ¯ow i and updating its timestamp in DRR. The price to pay for the minimal complexity of the single-FIFO-queue implementation of DRR comes in terms of service burstiness and latency. The following bound on the latency of DRR, optimized under the assumption that TQ ˆ Lmax =qmin , was provided in Ref. [23]:   3…r qi † qi Lmax DRR : …5† hi 6 ‡ qmin qmin r The latency of DRR is about three times as large as the latency of SCFQ [11], which in turn has by far the worst delay performance among GPSrelated schedulers [23]. The reason for the poor

Fig. 1. Pseudo-code for ¯ow and timestamp handling in DRR upon assignment of a new timestamp to ¯ow i.

566

A. Francini et al. / Computer Networks 37 (2001) 561±578

delay performance of DRR is in the combination of the single-FIFO-queue implementation of the sorting structure with the credit-accumulation mechanism that is instantiated in the timestamps. In addition to the heavy degradation in latency, a second negative implication of queueing all backlogged ¯ows in a single linked list and exhausting their frame shares in a single visit is the sizable service burstiness that the ¯ows can experience. Typically, a ¯ow i waits for TQ Qi =r time units in between consecutive visits of the server, and then obtains complete control of the server for Qi =r consecutive time units. 2.4. Surplus round robin A description of SRR was provided in Ref. [1]. The algorithm features the same parameters and variables as DRR, but a di€erent event triggers the update of the timestamp: a ¯ow i receives a new timestamp Fik when the transmission of packet pik gets completed, independently of the resulting backlog state of the ¯ow. The end of the frame is always detected after the transmission of a packet, and not before: the timestamp carries over to the next frame the debit accumulated by the ¯ow during the current frame, instead of the credit that is typical of DRR. An advantage of SRR over DRR is that it does not require to know in advance the length of the head-of-the-queue packet to determine the end of the frame for a ¯ow. On the other hand, in order to prevent malicious ¯ows from stealing bandwidth from their competitors, the algorithm cannot reset the timestamp of a ¯ow that becomes idle. The non-null timestamp of an idle ¯ow is eventually obsoleted by the end of the next frame. Ideally, the timestamp should be reset as soon as it becomes obsolete. However, in a scheduler that handles hundreds of thousands or even millions of ¯ows, a prompt reset of all timestamps that can simultaneously become obsolete is practically impossible. We therefore focus throughout the paper on implementations of the SRR algorithm which do not perform any check for obsolescence on the timestamps of the idle ¯ows, and where a newly backlogged ¯ow always resumes its activity with the latest value of the timestamp, however old that

Fig. 2. Pseudo-code for ¯ow and timestamp handling in SRR upon assignment of a new timestamp to ¯ow i.

value can be. The e€ect of this assumption is that a newly backlogged ¯ow may have to give up part of its due service the ®rst time it is visited by the server, as a consequence of the debit accumulated long time before. The pseudo-code of Fig. 2 speci®es the rules for handling ¯ow i and updating its timestamp in SRR, whereas Figs. 3 and 4 illustrate the di€erence between DRR and SRR in determining the duration of a frame. For simplicity of presentation, in the rest of the paper we use the WRR name when we allude to DRR or SRR generically, with no explicit reference to their distinguishing features.

3. Soft bandwidth segregation In this section, we present our novel scheme for enforcing bandwidth segregation in a WRR scheduler without modifying its basic structure. We show in Fig. 5 the model for bandwidth segregation that we assume as reference. The set of allocated ¯ows is partitioned into K subsets that we call bundles. Each bundle I aggregates VI ¯ows and has an allocated service rate RI . The logical organization of the scheduler re¯ects a two-layered hierarchy: it ®rst distributes bandwidth to the bundles, according to their aggregate allocations, and then serves the ¯ows based on their share allocations within the bundles. The scheduler treats each bundle independently of the backlog state of the corresponding ¯ows, as long as at least one of them is backlogged.

A. Francini et al. / Computer Networks 37 (2001) 561±578

Fig. 3. Frame de®nition in DRR. Assumption: at the beginning of the ®rst frame, all ¯ows have timestamp equal to zero.

Fig. 4. Frame de®nition in SRR. Assumption: at the beginning of the ®rst frame, all ¯ows have timestamp equal to zero.

567

568

A. Francini et al. / Computer Networks 37 (2001) 561±578

Fig. 5. Reference model for bandwidth segregation.

We aim at the enforcement of strict bandwidth guarantees for both the ¯ow aggregates and the individual ¯ows within the aggregates (in this context, the ¯ow shares express actual service rates), without trying to support delay guarantees of any sort (frameworks for the provision of stringent delay guarantees in a scheduling hierarchy are already available [4,27], but they all resort to sophisticated algorithms that considerably increase the complexity of the scheduler). Consistently with the condition of Eq. (1) for the provision of per-¯ow bandwidth guarantees in a WRR scheduler with no bandwidth segregation, the following condition must always hold on the rate allocations of the bundles in the system that we are designing: K X

RI 6 r:

…6†

Iˆ1

Similarly, the following bound must be satis®ed within each bundle I in order to meet the bandwidth requirements of the associated ¯ows: X qi 6 RI : …7† i2I

A scheduling solution inspired by the frameworks presented in Refs. [4,27] would introduce a full¯edged (and expensive) scheduling layer to handle the bundles in between the ¯ows and the link server. Generally, the implementation cost of a full¯edged hierarchical scheduler grows linearly with

the number of bundles, because each bundle requires a separate replica of the basic per-¯ow scheduler. In our scheme, on the contrary, the layer that enforces the bundle requirements in the scheduling hierarchy is purely virtual, and is superimposed on a single instance of the basic scheduler. The cost of the structure that handles the individual ¯ows is therefore independent of the number of con®gured bundles, which leads to substantial savings in the implementation of the scheduling hierarchy. The addition to the system of a virtual scheduling layer that supports strict bandwidth guarantees for pre-con®gured ¯ow aggregates relies on a simple modi®cation of the mechanism that maintains the timestamps of the ¯ows in the timestamp-based version of the WRR algorithm (at this point of the discussion, we make no distinction between DRR and SRR). For each con®gured bundle I, the scheduler maintains, together with the guaranteed aggregate service rate RI , the sum UI of the service shares of the ¯ows that are currently backlogged in the bundle (we refer to UI as the cumulative share of bundle I): X UI ˆ qi : …8† i2BI

In Eq. (8), BI is the set of ¯ows of bundle I that are currently backlogged. The idea is to use the ratio between the allocated share of ¯ow i and the cumulative share of bundle I to modulate the guar-

A. Francini et al. / Computer Networks 37 (2001) 561±578

anteed rate of the bundle in the computation of the timestamp increment associated with packet pik : Fik ˆ Fik

1

‡

lki UI : RI q i

…9†

In order to verify that the timestamp assignment rule of Eq. (9) actually enforces the bandwidth guarantees of the bundles, we compute the amount of service that bundle I may expect to receive during a frame. The computation is based on the following two assumptions: (i) the cumulative share of the bundle remains unchanged during the whole frame, independently of the backlog dynamics of the corresponding ¯ows; and (ii) the set of ¯ows that can access the server during the frame includes only the ¯ows that are backlogged at the beginning of the frame (if some ¯ows in the bundle become backlogged after the frame has started, they must wait until the beginning of a new frame before they can access the server). If we apply the rule of Eq. (9) over all the services received by ¯ow i of bundle I during the frame, the reference per-frame timestamp increment that we obtain for the ¯ow is: TQ ˆ

Qi UI : RI qi

…10†

Then, by aggregating the service quanta of all the ¯ows in bundle I, we obtain the service quantum QI of the bundle: P X qi QI ˆ Qi ˆ i2BI RI TQ ˆ RI TQ : …11† UI i2BI The expression of QI in Eq. (11) is identical to the expression of the ¯ow quantum Qi in Eq. (2), and therefore proves that the timestamp-updating rule of Eq. (9) preserves the bandwidth guarantees of bundle I, independently of the composition of the set of ¯ows that are backlogged in the bundle at the beginning of the frame. Holding on the assumption that the cumulative share of bundle I does not change during the frame, we can also show that the timestampupdating rule of Eq. (9) preserves the service proportions for any two ¯ows i, j of bundle I that never become idle during the frame:

R Qi qi UII TQ qi ˆ RI ˆ : Qj qj UI TQ qj

569

…12†

In order to specify the details of the WRR algorithm with bandwidth segregation, we must ®rst discuss the assumptions that we have made to obtain the results of Eqs. (11) and (12), and then evaluate their algorithmic implications. The use of a constant value of cumulative share UI in all the timestamp increments that the scheduler computes during a frame provides a common reference for consistently distributing service to the ¯ows of bundle I. Identical purpose has the exclusion from the frame of the ¯ows that become backlogged only after the frame has started. The timestamp increment is the charge that the system imposes on a ¯ow for the transmission of the related packet. The cost of the transmission depends on the bandwidth that is available within the bundle at the time it is executed. In order to make the timestamp increment consistent with the cost of the bandwidth resource within the bundle, it must be computed when the resource is used, i.e., upon the transmission of the corresponding packet. If the scheduler computes the increment in advance, the state of the bundle (and therefore the actual cost of the bandwidth resource) can undergo radical changes before the transmission of the packet occurs, thus making the charging mechanism inconsistent with the distribution of bandwidth. Within the pair of WRR algorithms that we are considering, SRR is the one that best ®ts the requirement for consistency between transmissions and timestamp increments, because it uses the length of the just transmitted packet to update the timestamp and determine the in-frame status of the corresponding ¯ow. In DRR, on the contrary, the scheduler performs the timestamp update and the in-frame status check using the length of the new head-of-the-queue packet, possibly long before it is actually transmitted. When the DRR server ®nally delivers the packet, the cumulative share of the bundle, and therefore the cost of bandwidth within the bundle, may have changed considerably since the latest timestamp update.

570

A. Francini et al. / Computer Networks 37 (2001) 561±578

Following the classi®cation of fair queueing algorithms introduced in Ref. [1], the adoption of SRR over DRR for the implementation of our soft scheduling hierarchy can be generalized to the classes of causal and non-causal schedulers. In a causal fair queueing algorithm, the selection of the next packet to transmit depends exclusively on the packets that the server has already transmitted, as is the case for SRR and SFQ. In a non-causal algorithm, on the contrary, the scheduling decision is in¯uenced by packets that still have to be transmitted. This is the case for DRR and most GPS-related schedulers, including all P-RPS, where the timestamps associated with the backlogged ¯ows depend on the size of the respective headof-the-queue packets. Causal algorithms have poorer delay properties than non-causal algorithms, but lend themselves to the implementation of the soft scheduling hierarchy much better than non-causal algorithms. Introducing the mechanism for bandwidth segregation in SRR is straightforward. In addition to the minimum-bandwidth guarantee RI and the cumulative share UI , each bundle I maintains a running share /I and a start ¯ag rI . The running share keeps track of the sum of the service shares of the backlogged ¯ows in the bundle: X /I …t† ˆ qi 8t: …13†

bundle parameters. When the ®rst ¯ow of a bundle becomes backlogged, the start ¯ag is set equal to FRMCNT (the bundle will receive its ®rst service only during the next frame, after FRMCNT has toggled its current value). In order to identify the end of a frame, each ¯ow i maintains a frame ¯ag FFi . The frame ¯ag of ¯ow i is set to the complement of FRMCNT whenever the ¯ow is queued to the tail of the list of backlogged ¯ows. When the scheduler ®nds a frame ¯ag that does not match the frame counter, it declares the start of a new frame and toggles the frame counter. The sequence of operations to be executed after completing the transmission of a packet and processing the corresponding ¯ow is summarized in the pseudo-code of Fig. 6. Compared to the implementation of a basic WRR scheduler, the only additional cost of the soft scheduling hierarchy is in the memory space needed to store the state information for the bundles and in the few elementary operations that maintain that information. We illustrate in a simple simulation scenario the behavior of the SRR scheduler with bandwidth

i2BI …t†

The running share is updated every time a ¯ow of the bundle changes its backlog state. In general, the updates of the running share /I do not translate into immediate updates of the cumulative share UI . In fact, the scheduler updates the cumulative share of the bundle only upon detection of mismatching values in the start ¯ag rI and in a global single-bit frame counter FRMCNT that the scheduler toggles at every frame boundary (the scheduler compares rI and FRMCNT every time it serves a ¯ow of bundle I). A di€erence in the two bits triggers the update of the cumulative share to be used in the future timestamp computations (UI /I ) and toggles the start ¯ag of the bundle (rI FRMCNT ). If, instead, the two bits are already equal, the service just completed is certainly not the ®rst one that the bundle receives during the current frame, and no action must be taken on the

Fig. 6. Pseudo-code of SRR with bandwidth segregation.

A. Francini et al. / Computer Networks 37 (2001) 561±578

571

Table 1 Bandwidth segregation in SRR (throughput is expressed in percentages of the server capacity) Flow

Throughput Nominal allocation

i1 ‡    ‡ i50 i51 i52 j1 k1

10.00 20.00 30.00 20.00 20.00

Without bundles

With bundles

Expected

Observed

Expected

Observed

10.00 30.00 0.00 30.00 30.00

9.98 30.00 0.00 30.01 30.01

10.00 50.00 0.00 20.00 20.00

9.98 49.89 0.00 20.06 20.07

segregation. We consider a packet multiplexer with capacity r and three con®gured bundles (I, J, and K). Bundle I is allocated 60% of the server capacity, and contains 52 ¯ows: the ®rst 50 ¯ows (i1 ; . . . ; i50 ) are each allocated 0:2% of the capacity of the server; the service share of ¯ow i51 is qi51 ˆ 0:2r; ®nally, the service share of ¯ow i52 is qi52 ˆ 0:3r. Flows i1 ; . . . ; i50 are strictly regulated, i.e., their packet arrival rate at the multiplexer never exceeds their guaranteed service share. These ¯ows typically have negligible backlog whenever the server has bandwidth available in excess of their guaranteed shares. As a consequence, we expect each of them to switch backlog state quite often. Flow i51 is unregulated: the packet arrival rate of the ¯ow is far above its allocated service share. This behavior induces a permanent backlog in the packet queue of ¯ow i51 . Finally, ¯ow i52 always remains inactive, with no packets reaching the multiplexer. Bundle J is allocated 20% of the capacity of the server, and contains a single, unregulated ¯ow j1 (similarly to i51 , the backlog in the queue of ¯ow j1 is permanent, independently of the service sequence generated by the scheduler). Bundle K is con®gured exactly the same way as bundle J: it is allocated 20% of the capacity of the server, and contains a single, unregulated ¯ow k1 . The inactivity of ¯ow i52 makes 30% of the capacity of the server available to the ¯ows that are active. The objective of the simulation setup is showing that the association of the active ¯ows with three di€erent bundles has relevant impact on the distribution of bandwidth that is in excess of the nominal guarantees. We execute two simulation runs under the same pattern of packet arrivals. In the ®rst run, we re-

move the bundles from the system, and schedule the 53 active ¯ows based on their allocated shares qi1 ; . . . ; qi51 , qj1 , and qk1 . We restore the bundles in the second run. We measure in both cases the bandwidth received individually by ¯ows i51 , j1 , and k1 , and cumulatively by the aggregate of ¯ows i1 ; . . . ; i50 , and report the results in Table 1. When no bundles are superimposed, we expect the excess bandwidth made available by the inactivity of ¯ow i52 to be evenly shared by ¯ows i51 , j1 and k1 (¯ows i1 ; . . . ; i50 have no access to the excess bandwidth because they have no extra supply of packets above their nominal bandwidth allocations). When the bundles are overlaid, on the contrary, i51 is the only ¯ow that is entitled to receive extra services, as long as it remains backlogged. The experimental results substantially match the ideal shares that the scheduler should distribute, giving evidence to the e€ectiveness of the soft hierarchy in segregating bandwidth. 4. Reducing the service burstiness In Section 3, we have identi®ed SRR as the algorithm that successfully enforces bandwidth segregation with no substantial impact on the implementation complexity. However, the singleFIFO-queue implementation of the scheduler still induces poor performance in terms of latency and service burstiness. The scheduler visits each ¯ow only once per frame: in order to maintain the bandwidth proportions de®ned by the allocated service shares, the scheduler must ful®ll the entire per-frame service expectation of the ¯ow in a single shot. If a ¯ow has a large number of in-frame

572

A. Francini et al. / Computer Networks 37 (2001) 561±578

packets waiting for service, the server transmits them back-to-back until the ¯ow timestamp exceeds the reference timestamp increment TQ . It is natural to expect a reduction in latency and burstiness if the server interleaves the transmission of packets of di€erent ¯ows within the same frame. This interleaving of packet transmissions is possible only if ¯ows with more than one packet to transmit can reach the head of the list of backlogged ¯ows multiple times during the frame. A single FIFO queue of ¯ows obviously collides with this requirement, because a ¯ow that is extracted from the head of the list is automatically forced to enter the next frame. Thus, the only way to distribute smoother services is increasing the number of queues of backlogged ¯ows that the scheduler maintains. The calendar queue that we have reviewed in Section 2.2 is an array of linked lists, where ¯ows get queued based on the values of the associated timestamps. We resort to this technique, which relates the service pattern generated by the scheduler to the progress in the amount of information that the ¯ows have already transmitted during the frame, to reduce the service burstiness of our scheduler. The calendar queue that we use to implement SRR, shown in Fig. 7, is divided in two logical segments of equal size. The in-frame segment

contains the backlogged ¯ows that can still be serviced during the current frame, while the out-offrame segment contains the ¯ows with exhausted service share and the newly backlogged ¯ows of the current frame. At every end of a frame, the inframe and out-of-frame segments swap their position in the array of bins. In a calendar queue with Nb bins in each segment, the bin associated with the timestamp Fik of ¯ow i has the following o€set bi within its segment:   Fik bi ˆ Nb : …14† TQ When a ¯ow becomes backlogged, it is queued to the tail of the bin corresponding to its latest timestamp in the out-of-frame segment of the calendar queue. When a ¯ow is serviced, it is ®rst extracted from the head of its current bin. Then, the scheduler updates its timestamp and checks whether the new value exceeds TQ . If this is the case, the scheduler subtracts TQ from the timestamp and, if the ¯ow is still backlogged, appends it to the proper bin in the out-of-frame segment of the calendar queue. If, instead, the new timestamp does not exceed TQ , the scheduler appends the ¯ow to a bin of the in-frame segment. Every time the transmission of a packet is completed, the scheduler searches for the next ¯ow to serve, starting from the bin of the just serviced

Fig. 7. Calendar queue for the implementation of SRR.

A. Francini et al. / Computer Networks 37 (2001) 561±578

one. If the ®rst non-empty bin is found in the outof-frame segment, the scheduler toggles the frame counter FRMCNT, which declares the beginning of a new frame. There is no longer need to maintain per-¯ow frame ¯ags, as in the single-FIFO-queue implementation of SRR, because the scheduler detects the start of a new frame directly from the position of the ®rst non-empty bin, without having to retrieve the information from the ¯ow selected for service. Qualitatively, the fairness and delay properties of the scheduler improve as the number of bins increases. However, it is important to note that the worst-case indices that typically characterize a packet scheduler (latency, worst-case fairness index [4], and service fairness index [11]) barely obtain any bene®t from the calendar-queue implementation of SRR. The reason for this counter-intuitive behavior is that all those indices are dominated by the worst possible state that a ¯ow can ®nd in the system when it is newly backlogged. Since the worst-case state of the system is practically independent of the adopted implementation, the indices look very similar with the single FIFO queue and the calendar queue. We prefer not to proceed with a detailed derivation of the indices (which is straightforward with known techniques [23], yet quite tedious), and only point out that they are all dominated by the term 2Lmax =qmin , which is the time that the server takes to transmit a two-frames worth of backlog for all the ¯ows con®gured in the system, assuming that each of them is allocated the minimum share qmin . The considerable bene®ts of the calendar-queue implementation of SRR can be appreciated as soon as some of the con®gured ¯ows have service share greater than qmin , a situation that none of the common worst-case indices can capture. We show with a simple simulation experiment the performance of the calendar-queue im-

573

plementation of SRR under less extreme trac conditions. We con®gure 53 unregulated ¯ows, divided in three distinct classes. All con®gured ¯ows belong to the same bundle, so that the higher level of the scheduling hierarchy remains transparent. The ®rst class includes 50 ¯ows (i1 ; . . . ; i50 ), and each ¯ow in the class is allocated 0:4% of the capacity of the server. The second class consists of two ¯ows (i51 and i52 ), with a bandwidth allocation of 0:2r for each of them. Flow i53 is the only member of the third class, and has a bandwidth allocation of 0:4r. For simplicity, all packets reaching the system have the same size. We set the minimum service share qmin that the scheduler can support to three di€erent values in three distinct …1† …2† simulation runs: qmin ˆ qi1 , qmin ˆ 0:1qi1 , and …3† qmin ˆ 0:01qi1 . The reference timestamp increment TQ , which determines the duration of a frame, increases as the ratio between the service shares of the con®gured ¯ows and qmin increases. We implement the SRR scheduler with the single-FIFOqueue technique (SQ), and with two calendar queues (CQ1 and CQ32), having respectively 1 and 32 bins per segment (for a total of 2 and 64 bins per calendar queue). In a server that guarantees a ®xed number of packet transmissions to a ¯ow that remains continuously backlogged, the maximum time elapsing between the transmissions of two consecutive packets of a ¯ow is a clear indicator of the service burstiness of the scheduler. In Table 2, we report the maximum interdeparture times observed for ¯ows i1 (qi1 ˆ 0:004r) and i53 (qi53 ˆ 0:4r) in a sequence of delivered packets that is long enough to cover the duration …1† of a frame. In the ®rst scenario, with qmin ˆ qi1 , there is no substantial di€erence between the single-FIFO-queue implementation and the two calendar queues, because ¯ows i1 ; . . . ; i50 are entitled to the transmission of only one packet per frame. As the number of packets transmitted in a frame

Table 2 Maximum inter-departure times for ¯ows i1 (qi1 ˆ 0:004r) and i53 (qi53 ˆ 0:4r) for di€erent values of minimum con®gurable service …1† …2† …3† share (qmin ˆ qi1 , qmin ˆ 0:1qi1 , and qmin ˆ 0:01qi1 ) (the inter-departure times are normalized to the transmission time of a packet) Flow i1 i53

…1†

…2†

qmin

…3†

qmin

qmin

SQ

CQ1

CQ32

SQ

CQ1

CQ32

SQ

CQ1

CQ32

250 153

250 53

250 53

2491 1501

2023 53

1023 53

2491 1501

2023 53

299 53

574

A. Francini et al. / Computer Networks 37 (2001) 561±578

Fig. 8. Inter-departure times observed for ¯ow i1 (qi1 ˆ 0:004r), normalized to the transmission time of a packet.

by the ¯ows with lowest share grows, the bene®t of the calendar-queue implementation becomes more and more evident. At ®rst glance, the performance of the two calendar-queue implementations is very similar, especially for the ¯ow with highest service share (the detected maximum inter-departure times reported in Table 2 for ¯ow i53 are always the same with CQ1 and CQ32). In order to better appreciate the e€ect of increasing the number of bins in the calendar queue, it is necessary to look with more detail at the sequence of inter-departure times produced by the scheduler. In Fig. 8, we trace the inter-departure times that the SQ, CQ1, and CQ32 implementations of SRR produce for ¯ow i1 in the transmission of 100 consecutive packets, in the …2† case where qmin ˆ qmin ˆ 0:1qi1 . Similarly, the plot of Fig. 9 traces the inter-departure times observed in the transmission of 1000 consecutive packets of ¯ow i53 . We immediately observe that a higher number of bins determines much shorter sequences of consecutive packets that are spaced by the maximum inter-departure time, which testi®es lower burstiness.

5. Integration of best-e€ort ¯ows The existing QoS frameworks for the support of service guarantees network-wide [6,14,15,18] must cope with the coexistence of guaranteed and best-

Fig. 9. Inter-departure times observed for ¯ow i53 (qi53 ˆ 0:4r), normalized to the transmission time of a packet. The plot reports the inter-departure times in logarithmic scale, in order to distinguish the behavior of the three implementations at the lower end of the covered range.

e€ort trac in the same IP network. Additional trac-management issues need to be solved at the network nodes to eciently integrate di€erent types of trac. BE ¯ows have no speci®ed QoS requirements; accordingly, no bandwidth resources should be reserved for these ¯ows in the scheduler that regulates access to an outgoing link. However, fairness in the relative treatment of distinct BE ¯ows insisting on the same link is highly desirable. When considering a set of BE ¯ows in isolation, a WRR scheduler with identical service-share allocation for all ¯ows is the simplest scheme that can be conceived to meet the fairness objective. However, if GB ¯ows also contend for the same outgoing link, a single WRR scheduler is no longer adequate, and ultimately contradicts the nature of BE ¯ows. In fact, the shares allocated to BE ¯ows subtract bandwidth from the pool that can be allocated to GB ¯ows, and the ratio between the shares allocated to GB and BE ¯ows enforces ®xed proportions in the distribution of bandwidth between the two types of trac. We illustrate the problem with an example. A single WRR scheduler handles both GB and BE ¯ows. We decide to allocate 1% of the server capacity r to each con®gured BE ¯ow (of course, the choice of 1% is totally arbitrary, as is the case for any other value). We ®rst con®gure 20 BE ¯ows,

A. Francini et al. / Computer Networks 37 (2001) 561±578

so that each of them initially obtains 5% of the server capacity. Then, we add to the system two GB ¯ows, each asking for 0:4r. At this point, the capacity of the server is totally allocated, and no additional ¯ows of either type can be set up. A negative consequence of allocating an explicit share to BE ¯ows is that the presence of such ¯ows reduces the amount of nominal bandwidth that the server can reserve to GB ¯ows. Moreover, the availability of nominal bandwidth constrains the number of con®gurable BE ¯ows. Ideally, the whole capacity of the server should be accessible to GB ¯ows and the availability of nominal bandwidth should not a€ect the con®guration of BE ¯ows, simply because these ¯ows have no explicit bandwidth requirements. Whenever one of the two con®gured GB ¯ows becomes idle, the single WRR scheduler grants 0:66r to the GB ¯ow that remains backlogged, while each BE ¯ow gets 1:66% of the capacity of the server (the scheduler keeps servicing the backlogged ¯ows in ®xed proportions, according to their explicit shares). Ideally, the backlogged GB ¯ow should instead keep receiving no more than 40% of the capacity of the server, while each BE ¯ow should be serviced at 0:03r. A fair airport scheduler [13] where BE ¯ows have no reserved bandwidth in the guaranteed service queue (GSQ) and higher priority than GB ¯ows in the auxiliary service queue (ASQ) would provide an elegant solution to all the functional issues involved in the integration of GB and BE ¯ows, but would also increase the implementation cost of the scheduler beyond what is of interest in current IP networks. A much cheaper solution can be found in Refs. [9,10]: the server handles GB and BE ¯ows in two distinct WRR schedulers, and serves the BE aggregate only after having granted to the GB aggregate the sum of the guaranteed service shares of the allocated GB ¯ows. Unfortunately, this approach lacks ¯exibility in passing to BE ¯ows the bandwidth that is not used by idle GB ¯ows, because no bandwidth is transferred from the GB to the BE aggregate as long as at least one GB ¯ow remains backlogged. The enhancement that we present in this section achieves the ®nest granularity in transferring unused bandwidth from GB to BE ¯ows, at the only

575

cost of replicating the queueing structure of the basic WRR scheduler and maintaining some state information for the BE aggregate. We ®rst describe the application of the technique to a ¯at WRR scheduler (i.e., a WRR scheduler with no superimposed hierarchy for the distribution of bandwidth to GB ¯ows), and then discuss its extension to the case where the set of GB ¯ows is partitioned into multiple bundles. We divide the service frame of the WRR scheduler in two parts called sub-frames. The ®rst sub-frame is devoted to satisfying the bandwidth requirements of the GB ¯ows that are backlogged at the beginning of the frame; in the second subframe, the WRR scheduler serves the BE aggregate until the expected frame duration is reached (the expected frame duration is the duration of the WRR frame when the whole capacity of the link is allocated to GB ¯ows and all allocated GB ¯ows are backlogged). The duration of the GB and BE sub-frames is subject to complementary ¯uctuations that are triggered by changes in the backlog state of the GB ¯ows, whereas the overall frame duration remains constant as long as backlogged BE ¯ows are available. In order to determine the amount of services to be granted to the BE aggregate within a frame, the scheduler maintains a BE running share /BE that tracks the di€erence between the link capacity r and the sum of the service shares of the backlogged GB ¯ows: X /BE …t† ˆ r qi 8t …15† i2BGB …t†

where BGB (t) is the set of GB ¯ows that are backlogged at time t. (The de®nition of /BE in Eq. (15) obviously assumes that the sum of the service shares allocated to the GB ¯ows does not exceed the capacity of the server.) The scheduler samples the BE running share at the end of each GB sub-frame (which is detected when there are no more backlogged GB ¯ows in the current frame), and uses its value to set the BE cumulative share UBE for the incoming BE sub-frame. The scheduler maintains a BE timestamp FBE to regulate the duration of the BE sub-frame. According to the SRR algorithm, at the end of the transmission of a BE packet of size

576

A. Francini et al. / Computer Networks 37 (2001) 561±578

lkBE the scheduler updates the BE timestamp as follows: k k 1 FBE ˆ FBE ‡

lkBE : UBE

…16†

The distribution of service to the BE aggregate continues as long as there are backlogged BE ¯ows and the BE timestamp does not exceed the reference timestamp increment TQ . The negation of any of the two conditions triggers the end of both the BE sub-frame and the whole frame, and resumes the distribution of service to the GB ¯ows in a new frame. During the BE sub-frame, the scheduler must still determine which individual BE ¯ows to serve. The fairness criterion that requires equal amount of service for BE ¯ows that are simultaneously backlogged leads to the adoption of a separate instance of the basic WRR scheduler as the mechanism for handling BE ¯ows. In the WRR replica, all BE ¯ows are assigned the same service share, as shown in the ¯at scheduler of Fig. 10. Since the service shares of the BE ¯ows do not count against the capacity of the server, there is no limit on the number of BE ¯ows that can be allocated in the system. The frame dynamics of the BE scheduler are completely independent of their counterparts in the main WRR scheduler: multiple BE sub-frames may be needed to complete a frame in the BE scheduler or, conversely, a single BE

sub-frame in the main scheduler may be sucient to complete several frames in the BE scheduler. In a ¯at WRR scheduler, the BE running share is incremented every time a GB ¯ow becomes idle, and decremented every time a GB ¯ow becomes backlogged. In a scheduler with soft hierarchy for bandwidth segregation, the same updates are triggered by changes in the backlog state of the bundles (a bundle is backlogged if at least one of its ¯ows is backlogged). We validate the behavior of the integrated scheduler for GB and BE ¯ows with a simulation experiment where the set of GB ¯ows is partitioned into three bundles (I, J, and K). Both the main WRR scheduler and the BE replica are implemented using SRR and one bin per segment in the calendar queue. Bundle I contains 51 ¯ows: ¯ows i1 ; . . . ; i50 are strictly regulated and are each allocated 0:2% of the link capacity r. Flow i51 is unregulated, and its service share is q51 ˆ 0:2r. The total share allocation for bundle I is therefore equal to 30% of the link capacity. Bundle J is allocated 20% of the link capacity and its only ¯ow j1 is unregulated. Bundle K is allocated 30% of the link capacity and its only ¯ow k1 is permanently idle. The server also handles four unregulated BE ¯ows (h1 ; . . . ; h4 ). The total share allocation for GB ¯ows is equal to 0:8r, which leaves a base of 0:2r available to BE ¯ows; including the bandwidth that is not used by the inactive bundle K, the

Fig. 10. Integration of GB and BE ¯ows in a ¯at WRR scheduler.

A. Francini et al. / Computer Networks 37 (2001) 561±578 Table 3 Integration of GB and BE ¯ows in SRR with soft hierarchy (throughput is expressed in percentages of the server capacity) Flow

Type

Throughput Nominal allocation

i1 ‡    ‡ i50 i51 j1 k1 h1 h2 h3 h4

GB GB GB GB BE BE BE BE

10.00 20.00 20.00 30.00 0.00 0.00 0.00 0.00

With bundles Expected

Observed

10.00 20.00 20.00 0.00 12.50 12.50 12.50 12.50

9.89 20.12 20.01 0.00 12.50 12.49 12.49 12.50

BE aggregate should have access to 50% of the link capacity, equally distributed over the four BE ¯ows in shares of 0:125r each (in the same scenario, the approach proposed in Refs. [9,10] would provide to the BE aggregate only 20% of the link capacity). Table 3 collects the throughput results for the con®gured ¯ows, clearly showing that the scheduler behaves as expected in meeting the bandwidth requirements of the GB ¯ows and in distributing the excess bandwidth to the BE ¯ows in a fair manner. 6. Concluding remarks We have presented three enhancements of WRR schedulers for providing bandwidth guarantees in IP networks. In our ®rst enhancement, we superimpose a ``soft'' hierarchical structure to a WRR scheduler, which allows segregating bandwidth among bundles of ¯ows, in addition to providing bandwidth guarantees to the individual ¯ows. The mechanism has minimal complexity, since it is entirely based on a simple rede®nition of the way the timestamps are computed. In the second enhancement, we achieve a considerable reduction in service burstiness by implementing the WRR scheduler with a modi®ed calendar queue. The improvement is substantial even with few sorting bins in the calendar queue. Our third enhancement allows the integration of GB and BE ¯ows in a single WRR engine, at only the cost of

577

replicating the sorting structure for the ¯ow timestamps and maintaining some state information for the best-e€ort trac aggregate. The three enhancements are useful mechanisms to meet the increasingly sophisticated scheduling demands in networks supporting QoS, while keeping the complexity of the schedulers to a minimum.

References [1] H. Adiseshu, G. Parulkar, G. Varghese, A reliable and scalable striping protocol, in: Proceedings of ACM SIGCOMM'96, August 1996. [2] D. Awduche, J. Malcolm, J. Agogbua, M. O'Dell, J. McManus, Requirements for Trac Engineering over MPLS, Request for Comments (RFC) 2702, IETF, September 1999. [3] J.C.R. Bennett, H. Zhang, WF2 Q: worst-case-fair weighted fair queueing, in: Proceedings of IEEE INFOCOM'96, March 1996, pp. 120±128. [4] J.C.R. Bennett, H. Zhang, Hierarchical packet fair queueing algorithms, in: Proceedings of ACM SIGCOMM'96, August 1996, pp. 143±156. [5] J.C.R. Bennett, D.C. Stephens, H. Zhang, High speed, scalable, and accurate implementation of fair queueing algorithms in ATM networks, in: Proceedings of IEEE ICNP'97, October 1997, pp. 7±14. [6] R. Braden, D. Clark, S. Shenker, Integrated Services in the Internet Architecture: an Overview, Request for Comments (RFC) 1633, IETF, June 1994. [7] F.M. Chiussi, A. Francini, J.G. Kneuer, Implementing fair queueing in ATM switches ± Part 2: The logarithmic calendar queue, in: Proceedings of IEEE GLOBECOM'97, November 1997, pp. 519±525. [8] F.M. Chiussi, A. Francini, Advances in implementing fair queueing schedulers in broadband networks, in: Proceedings of IEEE ICC'99, June 1999 (Invited paper). [9] F.M. Chiussi, A. Francini, Providing QoS guarantees in packet switches, in: Proceedings of IEEE GLOBECOM'99, High-Speed Networks Symposium, Rio de Janeiro, Brazil, December 1999. [10] F.M. Chiussi, A. Francini, A distributed scheduling architecture for scalable packet switches, IEEE Journal on Selected Areas in Communications 18 (12) (2000) 2665± 2683. [11] S.J. Golestani, A self-clocked fair queueing scheme for broadband applications, in: Proceedings of IEEE INFOCOM'94, April 1994, pp. 636±646. [12] P. Goyal, H.M. Vin, H. Chen, Start-time fair queueing: a scheduling algorithm for integrated services, in: Proceedings of ACM SIGCOMM'96, August 1996, pp. 157±168. [13] P. Goyal, H.M. Vin, Fair airport scheduling algorithms, in: Proceedings of NOSSDAV'97, May 1997, pp. 273±282.

578

A. Francini et al. / Computer Networks 37 (2001) 561±578

[14] J. Heinanen, F. Baker, W. Weiss, J. Wroclawski, Assured Forwarding PHB Group, Request for Comments (RFC) 2597, IETF, June 1999. [15] V. Jacobson, K. Nichols, K. Poduri, An Expedited Forwarding PHB, Request for Comments (RFC) 2598, IETF, June 1999. [16] M. Katevenis, S. Sidiropoulos, C. Courcoubetis, Weighted round robin cell multiplexing in a general-purpose ATM switch, IEEE Journal on Selected Areas in Communications 9 (1991) 1265±1279. [17] K. Muthukrishnan, A. Malis, A Core MPLS IP VPN Architecture, Request for Comments (RFC) 2917, IETF, September 2000. [18] K. Nichols, V. Jacobson, L. Zhang, A Two-bit Di€erentiated Services Architecture for the Internet, Request for Comments (RFC) 2638, IETF, July 1999. [19] A.K. Parekh, R.G. Gallager, A generalized processor sharing approach to ¯ow control in integrated services networks: the single-node case, IEEE/ACM Transactions on Networking (June 1993) 344±357. [20] J.L. Rexford, A.G. Greenberg, F.G. Bonomi, Hardwareecient fair queueing architectures for high-speed networks, in: Proceedings of IEEE INFOCOM'96, March 1996, pp. 638±646. [21] M. Shreedhar, G. Varghese, Ecient fair queueing using de®cit round robin, IEEE/ACM Transactions on Networking 4 (3) (1996) 375±385. [22] D.C. Stephens, J.C.R. Bennett, H. Zhang, Implementing scheduling algorithms in high-speed networks, IEEE Journal on Selected Areas in Communications 17 (6) (1999) 1145±1158. [23] D. Stiliadis, A. Varma, Latency-rate servers: a general model for analysis of trac scheduling algorithms, in: Proceedings of IEEE INFOCOM'96, March 1996, pp. 111±119. [24] D. Stiliadis, A. Varma, Design and analysis of frame-based fair queueing: a new trac scheduling algorithm for packet-switched networks, in: Proceedings of ACM SIGMETRICS'96, May 1996, pp. 104±115. [25] D. Stiliadis, A. Varma, A general methodology for designing ecient trac scheduling and shaping algorithms, in: Proceedings of IEEE INFOCOM'97, Kobe, Japan, April 1997. [26] D. Stiliadis, A. Varma, Ecient fair queueing algorithms for packet switched networks, IEEE/ACM Transactions on Networking 6 (2) (1998) 175±185. [27] I. Stoica, H. Zhang, T.S.E. Ng, A hierarchical fair service curve algorithm for link-sharing, real-time, and priority services, in: Proceedings of ACM SIGCOMM'97, September 1997. [28] L. Zhang, Virtual clock: a new trac control algorithm for packet switching, ACM Transactions on Computing Systems (May 1991) 101±124.

[29] M. Ajmone Marsan, A. Bianco (Eds.), Proceedings of the International Workshop on Quality of Service in Multiservice IP Networks (QoS-IP 2001), Lecture Notes in Computer Science, vol. 1989, January 2001, pp. 205±221.

Andrea Francini received the Dr. Eng. degree in Electrical Engineering (summa cum laude) in 1993, and the Ph.D. in Electrical Engineering and Communications in 1998, both from Politecnico of Turin, Italy. Since 1996, he has been with Bell Laboratories, Lucent Technologies, in the Data Networking Systems Research Department. He has contributed to the architectural design of three generations of the Lucent ATLANTA chipset, leading the e€ort for the deployment of sophisticated QoS support in the modules of the chipset. Dr. Francini is currently working on the architectural speci®cation of a switching system for IP-based wireless networks. His research interests include trac management, scheduling, scalable packet-switching architectures, and QoS frameworks for next-generation wireless networks. Fabio M. Chiussi received the Ph.D. in Electrical Engineering from Stanford University in 1993. Since 1993, he has been with Bell Laboratories, Lucent Technologies, where he is currently Director, Data Networking Systems Research. He has led the architectural design of three generations of the Lucent ATLANTA chipset, an industryleading silicon solution for ATM and IP switching and port processing; within the ATLANTA project, he has also held various development responsibilities, including leading the development of the switch fabric devices for the latest generation of the chipset. He is currently leading the architectural speci®cation and development of a switching system for the wireless infrastructure, supporting MPLS technology and advanced services. Dr. Chiussi has been conducting fundamental research in the area of scalable switch architectures, trac management and scheduling, congestion control, and VLSI design. He has written more than 70 technical papers and holds 9 patents, with 20 more pending. Dr. Chiussi was named the 1997 Eta Kappa Nu Outstanding Young Electrical Engineer. He is a Bell Labs Fellow. Robert T. Clancy received his Bachelor of Electrical Engineering from Manhattan College in 1989, and his Master of Electrical Engineering from the Stevens Institute of Technology in 1993. He was with Bell Laboratories, Lucent Technologies, until the beginning of 2000. He currently works for Sycamore Networks, in the Optical Edge Business Unit. Kevin D. Drucker received his Bachelor of Computer Engineering from Lehigh University in 1993 and is currently pursuing his Master of Electrical Engineering at Johns Hopkins University. He currently works for Agere Systems, after having been with Bell Labs until February 2001.