Weighted fair bandwidth sharing using SCALE technique

Computer Communications 24 (2001) 51±63 www.elsevier.com/locate/comcom Weighted fair bandwidth sharing using SCALE technique H. Zhu a,*, A. Sang b, ...

Download PDF

556KB Sizes 0 Downloads 58 Views

Report

PDF Reader
Full Text

Computer Communications 24 (2001) 51±63

www.elsevier.com/locate/comcom

Weighted fair bandwidth sharing using SCALE technique H. Zhu a,*, A. Sang b, S.-Q. Li b a

Electrical and Computer Engineering Department, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA b Department of Electrical and Computer Engineering, The University of Texas Ð Austin, Austin, TX 78712, USA Received 4 September 2000; accepted 4 September 2000

Abstract Currently there is not enough work on the Internet weighted fair bandwidth sharing without per-¯ow management, especially when both UDP and TCP ¯ows of different RTTs and different bandwidth targets coexist. This paper contains two contributions: 1. A mechanism called SCALE±WFS, Scalable Core with Aggregation Level labEling Ð Weighted Fair bandwidth-Sharing, is presented to achieve near-optimal Max±Min weighted fairness without per-¯ow management at core routers, in the context of differentiated service (Diffserv) networks. Through extensive simulation and simple analysis, we show the performance problems with current solutions and propose SCALE±WFS to solve these problems. This scheme works effectively for different ¯ow weights, RTTs, protocols (TCP and UDP), under the scenarios of different bandwidth provisioning and multiple congested gateways, all of which we consider the basic requirements for a practical solution. 2. Apart from ªper-¯owº and completely ªcore-statelessº, SCALE represents a third management level Ð the label-based ¯ow ªaggregationº level. It reduces the information redundancy in the ªper-¯owº and surpasses the performance of the ªcore-statelessº. This is important to Diffserv networks where we try to avoid the non-scalability of per-¯ow management while keeping its good Quality of Services. In this paper, we show that SCALE±WFS is effective and scalable, robust and extensible. q 2001 Elsevier Science B.V. All rights reserved. Keywords: Differentiated service; Max±min fairness; Pro®le rate; Aggregation; Packet dropping probability

1. Introduction 1.1. Intserv and Diffserv, expedited and assured forwarding Due to the increasing usage of Internet, a wide range of Quality-of-Service (QoS) support is needed to maximize application utilities. Two major service models were proposed, the Integrated Service (Intserv) [19] and the Differentiated Service (Diffserv) [24]. Intserv provides end-to-end QoS support on a per ¯ow basis. This usually requires per-¯ow management such as packet classi®cation, ¯ow table search and state maintenance at all routers. It also often includes a resource reservation signaling protocol such as RSVP [22,23]. Thus, its scalability is still highly questionable. Examples of Intserv models include guaranteed [20] and controlled load [21] services. Diffserv provides service differentiation among traf®c aggregates and usually adopts relatively lightweight * Corresponding author. E-mail addresses: [email protected] (H. Zhu), [email protected] (A. Sang), [email protected] (S.-Q. Li).

mechanisms to simplify the network design. It discriminates routers at the edge and in the core within a domain. It is considered feasible that the edge routers support per-¯ow management while the core routers may not. Typical Diffserv models include expedited [25], assured [26] and best efforts forwarding. Expedited forwarding (EF) provides virtual lease line service through resource reservation and strict traf®c conditioning. Assured forwarding (AF) provides probabilistic packet forwarding for the aggregated ¯ows, called class. Currently four classes are de®ned. 1.2. Weighted fairness and pro®le rates A ¯ow can be assigned with a weight [1]. Weights to the same ¯ow can differ from router to router. This paper studies the scenario that the weight is ®xed for a ¯ow at all routers it traverses in the same DS domain (e.g. an ISP network). Within each class, Diffserv AF service provides high probability forwarding for IN packets [12], the packets arriving within the aggregate's subscribed information rate (i.e. the pro®le rate), and low probability forwarding to

0140-3664/01/$ - see front matter q 2001 Elsevier Science B.V. All rights reserved. PII: S 0140-366 4(00)00289-9

52

H. Zhu et al. / Computer Communications 24 (2001) 51±63

Fig. 1. SCALE±WFS architecture.

OUT packets [12], the aggregate part exceeding the pro®le rate. Since application utilities are more closely related to individual ¯ow rather than the whole aggregate, a designated pro®le rate for each individual ¯ow is expected to represent its targeted end-to-end average throughput. Since AF service is only provided in terms of probability and subject to congestion loss, the level of forwarding assurance is not guaranteed and depends on the resource provisioning and traf®c loading situations. It is desirable that when excessive bandwidth is available, all ¯ows share it in a weighted fair way (i.e. proportional to their pro®le rate), while when congestion happens, all ¯ow's throughput degrades in the same manner. This provides weighted max±min fairness to share the total bandwidth among ¯ows according to their pro®le rates, which we call the weighted fairness. 1.3. Related existing work Traditionally per-¯ow weighted fairness was achieved by per-¯ow queueing mechanisms such as Weighted Fair Queueing [1] or per-¯ow accounting [2,5,7]. These mechanisms require costly packet classi®cation, and complicated ¯ow table maintenance at routers to track the states of each ¯ow. Due to these complexities and the huge ¯ow numbers, their scalability is highly questionable. A number of efforts are being taken recently to apply different dropping precedence combination to gain some fairness. Typically, there are schemes of two dropping precedence, such as RIO [12], and three dropping precedence, such as Three Color Marker (TCM) [13,14] used together with RED-3. The latter denotes a modi®ed RIO working with three dropping precedence. By the report of Ref. [8], it is highly unlikely that per-¯ow fairness can be

achieved through different combinations of three dropping precedence only. Experiments in Ref. [8] and this paper also show their strong biases against TCP ¯ows with hightargeted throughput especially when non-adaptive (e.g. UDP-CBR) and adaptive (e.g. TCP) ¯ows co-exist. Thus, these approaches are far from being fair. Some approaches are being taken using feedback control methods [10,11]. These approaches require signi®cant changes to current Internet service model. In terms of a huge system like Internet, the control stability is still not clear in practice. Fair bandwidth sharing without per-¯ow information was ®rst addressed in UT [3] in ATM literature and then in CSFQ [4], Rainbow Fair Queueing [27] and Choke [28] in the Internet literature. However, Choke only addresses average fair share of bandwidth rather than weighted share, and it is also not straightforward how it can be extended to support weighted fair sharing. Rainbow Fair Queueing discusses average fair share in most parts of its paper and only shows the performance of its weighted version with an all-UDP case. Due to its heuristic method of adjusting color threshold based on queue behavior, it is not sure how its weighted version performs when both TCPs and UDPs of different weights and RTTs coexist. In CSFQ paper, only simulations about average fair share were shown. In this paper, we show that the Weighted CSFQ cannot provide weighted fairness when both UDP and TCP of different weights traverse together through one or multiple congested links. It also fails to converge in a reasonably short period when sudden traf®c changes. 1.4. Our solution Our approach, SCALE±WFS, provides weighted fairness

H. Zhu et al. / Computer Communications 24 (2001) 51±63

to each individual ¯ow without managing per-¯ow state at core routers. We found that per-¯ow state actually provides more than enough information for our control target, which is the weighted fairness here. In fact, ¯ows that will be served in the same way by Max±Min fairness criterion can be merged into one aggregate at core routers, and dropped with the same probability. The merge is done according to certain label added to the packet header. By checking the label and operating on a per-aggregate level, core routers can effectively provide weighted fairness to each individual ¯ow. At the same time, they only need to maintain a much smaller aggregation table than the per-¯ow one. We call this technique ªScalable Core with Aggregation Level labElingº (SCALE). Applying SCALE to weighted fair bandwidth sharing is named as SCALE±WFS. SCALE±WFS does not require core routers to keep per¯ow state, thus it is scalable. The aggregation table is small with limited operations de®ned, thus it is simple. It requires neither protocol change nor additional feedback control mechanism thus is easy to deploy. It can work together with other orthogonal mechanisms for ®ner control, thus it is extensible. These mechanisms may include re-negotiable traf®c source behavior, traf®c conditioners, and ®ner grained services for different traf®c classes. It also converges quickly after abrupt traf®c change, thus is robust. Simulations show that it provides good services to heterogeneous sources (TCP/UDP) of different RTTs, with different pro®le rates, under different bandwidth provisioning on one or multiple congested links. The rest of the paper is organized as follows. Section 2 introduces SCALE±WFS architecture and algorithms. Section 3 provides simulation results. Section 4 makes technical discussion. Finally, Section 5 concludes the paper and describes the future work. 2. SCALE±WFS architecture and algorithm SCALE±WFS's architecture is based on a common model of Diffserv as shown in Fig. 1 where components include Marker, Ingress Edge Router, Core Router, and Egress Edge Router. First, before entering the DS domain, each individual ¯ow is labeled by the marker with ¯ow pro®le rate (ri), speci®ed by negotiations between users and ISP, or two ISPs as the average end-to-end throughput expectation. The labeling can be supported by popular tagging mechanisms such as DPS [15] or MPLS. Then, the ingress edge router where per-¯ow management is used will probabilistically forward the ¯ow's packets. For each packet to be forwarded, it is re-labeled with a ¯ow fairness index ( fi), which will be de®ned in Section 2.1. The only purpose of the pro®le rate is to facilitate the computation of the ¯ow fairness index at the edge. Therefore, if the marker is located at the ingress edge router, the pro®le rate labeling is not necessary.

53

Each output link at a core or egress edge router maintains a FIFO queue. Incoming packets are either en-queued or dropped. They are dropped with the same probability if their labels fs ¼fi ¼ft are approximately the same value, here represented by an aggregate level fairness index (Fj), regardless of their individual ¯ow ID. In other words, ¯ows can be classi®ed into m 1 1 different groups or aggregates (AGs: AG0 ; ¼AGj ¼; AGm solely based on their labels, and indexed into an aggregate table (AG Table) maintained by the router. Later Section 2.4 shows that this table is very small. The computation of the dropping probability Pj for aggregate AGj will be based on Fj, the measured aggregate arrival rate Aj, and the link fairness index (L) de®ned in Section 2.1. Because of the small AG table to which direct access (instead of searching) can be applied, scalable WFS can be obtained among different AGs that share the same queue. Both edge routers and core routers re-label the ¯ow fairness index for every en-queued packet, based on its old fairness index labeled and dropping probability. A ¯ow can pass one or multiple DS domains until it ®nally reaches its destination host. 2.1. Fairness index We de®ne three fairness indices here, one for a ¯ow ( fi), one for an AG (Fj), and the rest for a router output link (L). Assume ¯ow i has a speci®ed pro®le rate that is denoted by ri, and the router it currently is passing through has the output capacity of C on the output link where ¯ow i heads to. Flow i`s arrival rate is ai. We de®ne the ¯ow fairness index for this ¯ow as below: fi

ai ri

1

At edge routers, ai can be estimated at per-¯ow level with commonly used measurement schemes such as the time sliding window. This index represents how much current ¯ow rate is away from its throughput expectation. For an output link, we de®ne its link fairness index L as the solution to the following weighted max±min fairness equations: n X i1

ri minL; fi C; L max fi ; i1;¼;n

if

n X i1

if

n X i1

ai . C 2 ai # C

where ¯ow 1 to n are the ¯ows that go through this link. Note that for simplicity, we assume there is only one output buffer for a link. Eq. (2) is a continuous, non-decreasing, concave and piecewise-linear function of L. L can be computed periodically by solving Eq. (2), given per-¯ow information (n, ri, fi) which is known at the ingress edge router. The equation solution algorithm is listed in Appendix A with time complexity O(n 2). L re¯ects the workload of

54

H. Zhu et al. / Computer Communications 24 (2001) 51±63

can be derived from:

! L ; F 0j Fj 1 2 Pj ; Pj max 0; 1 2 Fj

;j [ 0; m: 5

Each packet belongs to AGj will be re-labeled with F 0j before de-queued from current core/egress router. 2.2. Markers Fig. 2. Dropping probabilities computing algorithm.

the current link and provides a criterion for fairness comparisons among ¯ows sharing the link. With L, the ¯uid dropping probability pi for ¯ow i, and the new ¯ow fairness index f 0i that will re-label ¯ow i`s en-queued packets can be obtained: L ;i [ 1; n; ; f 0i fi 1 2 pi ; pi max 0; 1 2 fi 3 where f 0i is actually min fi ; L that represents the de-queued rate over the pro®le rate of ¯ow i. According to the standard de®nition, as is also used in Refs. [3,4], weighted max±min fairness is achieved at this link when Eq. (3) is ful®lled. In an output link of a core/egress router, assume all incoming ¯ows are grouped into m aggregates AG0 ±AGm in which ¯ows with the same quantitized ¯ow fairness index fi are grouped into the same aggregate. Assume an aggregate AGj consists of all ¯ows in s; t; where ;i [ s; t; fi ù Fj : We call Fj the jth aggregate ¯ow fairness index for the aggregate (ie. macro-¯ow) AGj. Then we can get L by solving the following equations: m X j0

Rj minL; Fj C; if L max Fj ; j0;¼;m

if

m X j1 m X j1

Aj . C 4 Aj # C

P P where Rj tis ri and Aj tis ai represent the pro®le rate and the measured arrival rate, respectively, for the jth AGj {fs ¼fi ¼ft }: Note that the core routers can infer Rj from Aj over Fj. L can be solved using the iterative method in Appendix A. The only difference is to replace each ¯ow with each aggregate. However, the time complexity for solving L becomes O(m 2) now, in contrast to the O(n 2) in the case of per-¯ow. Usually the tunable aggregate level number m is much smaller than and independent of the huge ¯ow number n, as we will discuss later. Thus the accurate solution of L at the core would be scalable if performed at the introduced aggregate level. Based on L, the dropping probability and new label for aggregate AGj

The markers may reside at the ¯ow source sides, or the ingress edge router of the current DS domain, or the egress edge router of the upstream DS domain, depending on the system design and service level agreements. They must work ªhonestlyº, i.e. security issue must be considered to prevent forged labeling. That is out of the scope of this paper. Although not required, SCALE±WFS is able to support multiple dropping precedence (i.e. multiple colors) within each ¯ow. In this case, the marker has one more task, inserting a color label into the packet headers. There are two typical mechanisms for color marking. One is to mark the packets according to whether the packets arriving within or beyond the aggregate's pro®le rate, such as RIO and TCM. The other typical mechanism is in multimedia area, where applications may want to use layered coding [17] to mark important packets as IN whenever possible. For example, MPEG I-frames may be marked as IN, while B-frames and P-frames as OUT. SCALE±WFS can support both color schemes. It does not need color re-marking to achieve fairness such as in Ref. [6], therefore it does not stick to a particular dropping precedence scheme. Section 2.1 only considered single dropping precedence case. For two dropping precedence case, packets within each ¯ow are color-labeled as either IN or OUT [12]. Both ¯ow and aggregate fairness indices become a two element tuple: kfi_IN ; fi;OUT l and kFj_IN ; Fj;OUT l where fi_IN ai_IN =ri ; fi_OUT ai_OUT =ri ; Fj_IN Aj_IN =Rj and Fj_OUT Aj_OUT =Rj : Without going to details, we note that ¯ows with approximately the same tuple values can be mapped into one aggregate. Unless otherwise speci®ed, we assume two dropping precedence in this paper. Similar way can be used to support TCM, or more colors. 2.3. Edge router Let us look at the n incoming ¯ows to the ingress edge router in Fig. 1. The link capacity is C. For this link, the router maintains a ¯ow table, performs per-¯ow packet classi®cation and estimates the ai_IN and ai_OUT, namely the arrival rate of ¯ow i`s IN and OUT packets. In our experiments, we simply use time sliding window for rate estimation. At the beginning of the Nth sliding step (i.e. computation interval, or measurement interval), the L in Eq. (2) and pi in Eq. (3) are computed based on the arrival rates ai_IN(N 2 1) and ai_OUT(N 2 1) averaged over a

H. Zhu et al. / Computer Communications 24 (2001) 51±63

55

Fig. 3. SCALE±WFS management of one output queue at a core/egress router.

measurement window lasting from the (N±W)th sliding step to the (N 2 1)th sliding step, where W is the measurement window size. Note that using a low-pass ®lter for the measurement is also a common practice. More accurate estimation algorithms should generally allow better control. For ¯ow i, its total dropping probability (pi), dropping probability of IN packets (pi_IN) and OUT packets (pi_OUT) for the Nth sliding step period can be computed as Fig. 2 at the very beginning of the period. It ensures dropping OUT packets ®rst, and let as many IN packets through as possible. Note that all of the parameters in Fig. 2 are related to ¯uid ¯ow model. In practice, we use the method as in RED [16] to convert the ¯uid dropping probability to packet dropping probability by multiplying the former with the current packet size divided by the mean packet size. In addition, we have two queue threshold parameters for congestion control, minth and maxth. The average queue size meanQue is updated upon each packet arrival. It is done through an exponential low pass ®lter as in Ref. [16]. When meanQue is between minth and maxth, arrival packet is dropped according to the pi_IN or pi_OUT. Otherwise, the dropping probability is either 0 (when meanQue , minth or 1 (when meanQue . maxth: Thus we rely very little on queue behavior and can tolerate more bursty traf®c. The parameters setting will be discussed in Section 4. 2.4. Core/egress router: quantization of service granularity using ¯ow aggregation Fig. 3 shows the SCALE±WFS management of one FIFO output queue at core/egress edge routers. Arriving ¯ows to this queue are mapped into the AG table by label check. The mapping is done by quantizing the ¯ow fairness index fi_IN and fi_OUT to discrete-valued aggregate fairness index Fj_IN and Fj_OUT. Let Fj Fj_IN 1 Fj_OUT ; Rj N Aj_IN N 1 Aj_OUT N=Fj : The (IN or OUT) ¯ow packets will be dropped with probability Pj_IN (N) or Pj_OUT (N), as obtained from Eq. (5) and Fig. 2 (when meanQue is in [minth, maxth]). For an en-queued packet, its old labels are to be

replaced by the new ones, namely Fj_IN(N) 1 and Fj_OUT(N) 1. The new labels are computed as Fj_IN N £ 1 2 Pj_IN N or Fj_OUT N £ 1 2 Pj_OUT N; respectively. In this table, Aj_IN (N) or Aj_OUT (N) will be updated upon each packet arrival. Other table items are computed at the end of the Nth interval, based on the measured arrival rates. Storing the right two columns in the table is not required but improves the speed of the dropping decisions. As we can see from Fig. 3, the incoming packets of the same aggregate will be mapped into the same row of the AG table, regardless of their detailed ¯ow ID. Thus, ¯ows having approximately the same ¯ow fairness indices and the same next hop can be treated the same in terms of dropping probability. Such dropping already implies fairness among ¯ows within that aggregate. Meanwhile, the L can be obtained by solving Eq. (4) accurately, and then the accurate Pj can be obtained. Therefore, weighted fairness among different AGs, and further among all ¯ows, can be achieved. This scheme allows us to remove the redundant per-¯ow information while still obtaining near-optimal weighted fairness guarantee. It establishes the basic concept of SCALE±WFS to achieve per-¯ow fairness without per¯ow management at core routers. In contrast to SCALE± WFS, UT and CSFQ have to adopt a rough estimation of the L (and thus pi`s), which results in unpredictable performance as we will show in the simulation part. That is due to the missing of the necessary ¯ow information in their scheme when trying to achieve fairness in an almost completely stateless core. Most importantly, it is obvious that the AG table size is much smaller than the explosive ¯ow numbers n. This is because, in practice, it is highly unlikely for the discrete level number of aggregate fairness index to exceed 100, which means ¯ow's actual arrival rate exceeds its speci®ed pro®le rate in 100 times. Such ¯ows can be cut to a lower fairness index at the ®rst ingress edge router they pass. Therefore, the AG table entries can be selected among a set of discrete constants, equally spaced or not, over a reasonable value region. In our experiment, we use [0.0,

56

H. Zhu et al. / Computer Communications 24 (2001) 51±63

Fig. 4. Single edge single core case.

8.0] with an equally spaced interval of 0.1. In this case, in Fig. 1, m 80: More importantly, this size does not grow with the number of ¯ows n. In Rainbow Fair Queueing, each color represents an absolute rate value. Considering the large arrival rate ranges (several hundred bps to several Mbps, say) for different ¯ows, it is dif®cult to use a small amount of common colors to represent the varying range of ¯ow rates. On the other hand, a small number of common colors will result in a large rate interval between color layers and lead to poor control precision. Hence, its scalability is questionable. Up till now, our control is mainly rate-based. Queue behavior must be taken care of when bursty traf®c is presented, for queue over¯ow can defeat pure rate-based controls. In SCALE±WFS, we use a simple heuristic method similar to but simpler than Rainbow Fair Queueing to prevent queue over¯ow. Whenever meanQue exceeds a certain threshold queTh (80% of maximum queue size in our experiments), upon each packet arrival, L is automatically decreased by 1% when all of the following conditions suf®ce: (1) en-queued bits exceed 0.02 of maximum queue size since the last L update; (2) current meanQue is larger than the meanQue at the last L update time. This scheme differs from the one in Rainbow Fair Queueing and CSFQ in that we do not bound the number of consecutive 1% decreases. It is because our L is computed through max± min equations rather than heuristic methods, thus is already near optimal. The ª1% Decreaseº is only used as a minor adjustment in case of very bursty traf®c, and is expected to happen infrequently.

In SCALE±WFS, the sliding step (i.e. the computation interval in our algorithm) is set as dmax, where dmax C/ bufferSize The sliding window size for measuring mean rates is set to 200dmax in order to re¯ect low frequency domain traf®c rate. This setting allows a more stable and robust control behavior. The minth, maxth and queTh are set as 2.5, 90 and 80% of the buffer size, respectively, to leave enough space for the rate-based dropping probability to take effect. Our experience shows that minor changes to them would not affect the performance, as shown in Section 4.4. In the core, the AG table indices range from 0.0 to 8.0 with 0.1 intervals. The weighted CSFQ parameters K, Kc and Ka are set according to the guidelines given by CSFQ paper [4], as two times of dmax. Its ns-2 code is provided by CSFQs authors [29]. Here we use single rate TCM with RED3. For TCM token bucket sizes, we set the Committed Burst Size (CBS) for each ¯ow to be ri p max_RTT; i.e. the bandwidth delay product. To obtain max_RTT, we sum up the propagation delay and the maximum queueing delay by of¯ine observation. This is merely for optimizing TCMs performance when TCP is presented, and it brings no gain to SCALE±WFS and CSFQ. In practice, it can be other settings. The TCM Excessive Burst Size (EBS) is set as two times of CBS. This setting avoids preventing TCP sources from full speed transmission in case of TCM 1 RED3. The RED3 queue thresholds are set as RIO conventions: (1) for green packets, minth and maxth are set to be 60 and 90% of the buffer size, and the maximum dropping probability is 1/50; (2) for yellow packets, they are set to be 35%, 60% and 1/10, respectively; (3) for red packets, they are 5%, 35% and 1/5.

3. Simulations In this section, we evaluate SCALE±WFS with simulations, in comparison with CSFQ and TCM 1 RED3. Unless otherwise speci®ed, buffer sizes are set as 100 ms*C at the edge router and 50 ms*C at each core router, where C is the output link capacity. To obtain a steady result, the ®rst 10 s of simulation are skipped. All experiments were done with ns-2 [18] and all TCP sources are TCP Reno.

3.1. Provisioning scenarios and fairness performance metrics Unless otherwise speci®ed, we test the above three schemes under the following three typical bandwidth provisioning situations: (1) over-provisioning Ð Output link capacity (C) is 150% of the total pro®le rates of all incoming ¯ows; (2) exact-provisioning Ð C equals to the

H. Zhu et al. / Computer Communications 24 (2001) 51±63

Fig. 5. Exp. 1-1: under provisioning.

total pro®le rates; (3) under-provisioning Ð C is 50% of the total pro®le rates. To evaluate the ªfairnessº effect, we de®ne the Fairness Performance Metric (FPM) for ¯ow i as: FPMi

THi BWi

where THi is the measured average throughput of ¯ow i, and BWi is the ideal bandwidth that ¯ow i should get according to the weighted Max±Min fairness Eq. (2). If weighted fairness is perfectly achieved, FPMi should always be equal to 1. 3.2. The case of single edge plus single core: with both TCP and UDP ¯ows The topology of Experiment 1 is shown in Fig. 4, which has one edge router and one core router with both TCP and UDP ¯ows of different pro®le rates and different RTTs. Totally eight ¯ow groups are sending data to the sink, with four ¯ows in each group. Within one group, the pro®le rates are the same for all ¯ows. Group 1, 3, 5 and 7 are UDP ¯ows with speci®ed pro®le rate at 0.5, 1.0, 1.5 and 2.0 Mbps, respectively. However, each UDP ¯ow actually sends CBR traf®c in 3 Mbps. Group 2, 4, 6 and 8 are TCP ¯ows carrying bulk FTP. Their pro®le rates are 0.5, 1.0, 1.5

57

and 2.0 Mbps, respectively. Therefore, total pro®le rates of all sources are 40 Mbps. The access link capacity between each source and the edge router is 10 Mbps to ensure no congestions. In each group, the source-to-edge propagation delay is 10 ms for the ®rst two ¯ows and 30 ms for the other two ¯ows. For example, ¯ow 1, 2, 3 and 4 are UDP ¯ows (i.e. Group 1). The link propagation delay for ¯ow 1 and 2 is 10 and 30 ms for ¯ow 3 and 4. For over-provisioning case, the edge-to-core and core-to-sink link capacity are all 60 Mbps; for exact-provisioning, both of them are 40 Mbps; for under-provisioning, they are, respectively, 60 and 20 Mbps. The result is shown in Figs. 5±7. We can see that SCALE±WFS is close to the ideal FPM 1 under all situations. Both CSFQ and TCM 1 RED3 are seriously biased against TCP ¯ows: in particular, UDP ¯ows can grab up to ten times more bandwidth than the TCP ¯ows of the same weights. Therefore, CSFQ and TCM 1 RED3 can hardly protect TCP from UDP, and thus failed to provide weighted fairness. In the experiments, CSFQ performs similarly as TCM 1 RED3 when bandwidth is over-provisioned, better than it is when under-provisioned, but worse than it when exactly provisioned. This may be due to two reasons: CSFQs inaccurate estimation of L (called a in CSFQ) and the small measurement windows given by the guidelines in Ref. [4]. An overestimated L allows UDP to ¯ood into queue and leads to queue-over¯ow which forces TCP to be dropped even when it should not. On the other hand, an underestimated L results in low throughput. In the CSFQ paper, the guideline of using 2dmax as the measurement window actually measures high frequency domain traf®c caused by TCP transient burstiness. This results in unstable control and bias against TCP. We also tested CSFQ under the topology in Fig. 4 when sources are pure TCP or pure UDP ¯ows. We found that CSFQ can successfully provide near-optimal weighted fairness in the all-UDP case. In the all-TCP case, it also provides a certain degree of weighted fairness except when under-provisioned. At that time, TCP ¯ows of shorter RTTs can grab twice as much bandwidth as TCPs of longer RTTs, even in the same group (i.e. same pro®le rates). When combining TCP and UDP, CSFQ cannot protect TCP from UDP as shown above, although we tried to tune CSFQ parameters to its optimal. In the all-TCP case, we found, instead of following CSFQs 2dmax`s guideline, 200dmax provides better fairness among ¯ows. However, with this tuning, it cannot use the capacity fully. For instance, when the bandwidth is exact-provisioned (i.e. 40 Mbps core link capacity), the link occupancy is only 33 Mbps; when under-provisioned, it is 19.9 Mpbs out of 20 Mbps; and only 49 Mbps out of 60 Mbps when over-provisioned. 3.3. The case of single edge plus single core: all TCP ¯ows of different RTTs

Fig. 6. Exp. 1-2: exact provisioning.

Experiment 2's topology is similar to Fig. 4. The major

58

H. Zhu et al. / Computer Communications 24 (2001) 51±63

Fig. 7. Exp. 1-3: over provisioning.

difference is that now all sources are TCP ¯ows with 1 Mbps pro®le rate each, and the source-to-edge propagation delay is i*3 ms (source ID i [ 1; 32: For over-provisioning case, both capacity of edge-to-core and core-to-sink are set to 64 Mbps; for exact-provisioning, both of them are set to 32 Mbps; for under-provisioning, edge±core link is set as 64 Mbps and core±sink link is set as 16 Mbps. Therefore, the target bandwidth of each ¯ow is 2, 1 and 0.5 Mbps, respectively, under the three scenarios. Figs. 8±10 show the simulation result. When underprovisioned, three schemes have similar good performance. When exact- or over-provisioned, SCALE±WFS performs the best, and TCM 1 RED3 is better than CSFQ. In particular, CSFQ favors short-RTT TCPs in the over-provisioning case as before. Note that TCM 1 RED3 performs reasonably well when only TCP ¯ows are present. This may be due to the adaptive source behavior of TCPs when congestion occurs. In general, it takes longer time for TCP to reach higher throughput. When under-provisioned, all ¯ows' target bandwidth are smaller compared to other situations, and thus the target is relatively easier to reach. 3.4. The case of single edge plus single core: convergence experiment

Fig. 9. Exp. 2-2: exact provisioning.

under the topology similar to Fig. 4. The major differences are: (1) both edge±core and core±sink capacity are 28 Mbps, which equals to the total pro®le rates of traf®c group 1, 2 and 4; (2) the total simulation time is 120 s. Each group contains both TCP and UDP with different RTTs. Group 1 and 4 are ON all the time, Group 2 is turned on at the 0th second but turned off at the 40th second; Group 3 is turned on at the 80th second, and turned off with all others at the 120th second. The traf®c ON and OFF is shown in Fig. 11. During simulation time 0th±40th second, total traf®c pro®le rates are 28 Mbps. Therefore, the network is in the state of exact-provisioning. During the 40th±80th second, total pro®le rate is 20 Mbps. The network is bandwidth over-provisioning. During the 80th±120th second, total pro®le rate is 32 Mbps. The network is bandwidth under-provisioning. To see how the three schemes respond to the abrupt traf®c changes, we randomly select a TCP and a UDP from one group (i.e. of the same pro®le rate) to see how their transient throughput dynamics. In Figs. 12 and 13, the smoothed throughputs are shown to see if they converge in a reasonably long time. Shown by those two ®gures, SCALE±WFS performs better than the other two schemes. As before, CSFQ allows higher UDP throughputs and is biased against TCP. Considering the

In this experiment, we tested each scheme's convergence,

Fig. 8. Exp. 2-1: under provisioning.

Fig. 10. Exp. 2-3: over provisioning.

H. Zhu et al. / Computer Communications 24 (2001) 51±63

59

Fig. 14. Flow 1±4 and ¯ow 5±8 are UDP and TCP ¯ows with pro®le rate of 1 Mbps, while ¯ow 9±12 and ¯ow 13± 16 are of 2 Mbps. It can be seen that SCALE±WFS obtains near-optimal weighted fairness. CSFQ and TCM 1 RED3 failed to protect TCP from UDP again. Particularly, CFSQs UDP sources get a little more than their targeted bandwidth, while its TCPs only reach about half of the bandwidth expectation. With TCM 1 RED3, the TCP throughputs almost dropped to 0 while UDP throughputs only reach about 70% of the target bandwidth. All these tests proved the excellent performance of SCALE±WFS over CSFQ and TCM 1 RED3.

Fig. 11. Traf®c ON and OFF.

performance of CSFQ and TCM 1 RED3 under Experiment 1 (Section 3.2) where traf®c sources are on all the time, it is natural that they cannot converge to their target bandwidth when traf®c are changing. 3.5. Multiple congested links case Experiment 4 is designed to test how all three schemes work under multiple congested links, as shown in Fig. 15, a typical ªparking lotº structure. Two groups of traf®c, in which each ¯ow's pro®le rate is 1 or 2 Mbps for group 1 and group 2, respectively, enter the network from the left edge router. They traverse all the core routers to reach the RightSink. The output link of the left edge router is 32 Mbps to allow the traf®c to ¯ood into the network. Apart from the traf®c coming from the left sources, each core router accepts three disturbing UDP traf®c coming from the bottom of the ®gure. These UDP ¯ows heads towards the UDPSink. It can be seen that at all core routers' output links, traf®c from the leftmost sources and the disturbing UDP traf®c merge and split before they reach their sinks. All UDP ¯ows are CBRs at 3 Mbps. Each core router's output link is 10 Mbps, and thus is very congested. By observing how much bandwidth the sources from the leftmost side can get after they reach the RightSink, we can test the weighted fairness guarantee situation. Experimental result on four congested links is shown in

Fig. 12. Exp. 3-1: convergence of UDP.

4. Technical discussions 4.1. Low cost and simplicity As discussed in Section 2.4, core routers only need to maintain a much smaller aggregate table (for example m 1 1 items in Fig. 1) in order to implement SCALE±WFS. The table size is very ¯exible and can be independent from router to router. It is up to the ISP to tradeoff between the control complexity and the service granularity. More important, this table size and the ®xed table indices (i.e. the AG fairness indices) are not correlated with the explosive arriving ¯ow number n, the continuously valued ¯ow fairness index, and volatile ¯ow arrival rates. Thus, the memory requirement is much smaller than per ¯ow management. Items in the AG table can be directly accessed, instead of using longest pre®x match; therefore, the searching complexity is O(1). In addition, it does not need to identify the start and the end of ¯ows in the core. Another advantage of SCALE±WFS over per-¯ow is its low complexity O(m 2) to compute the link fairness index L, as discussed before. Moreover, SCALE±WFS does not require any protocol change, nor does it need additional signaling protocols or feedback control mechanism. All it needs is a labeling mechanism and a Service Level Agreement for the pro®le rate speci®cation. In terms of implementation, its complexity is much closer to ªcore-statelessº than to ªper¯owº but achieving a much better weighted fairness than the former by our analysis and simulations. 4.2. Robustness In simulation, we tested SCALE±WFS under the scenarios of bandwidth over-provisioning, under-provisioning and exact-provisioning with the existence of both TCP and UDP. We also considered different ¯ow RTTs and heterogeneous ¯ow pro®le rates (i.e. ¯ow weights) through multiple congested links, in addition to the convergence test after sudden traf®c changes. Previous related works [4,27,28] did not test their schemes in such a dynamic environment. By comparison, we found SCALE±WFS is more robust than them. Furthermore, SCALE±WFS does not rely on TCP throughput models as a function of RTT

60

H. Zhu et al. / Computer Communications 24 (2001) 51±63

Fig. 14. Multiple congested links.

Fig. 13. Exp. 3-2: TCP convergence.

and loss rate, thus can effectively protect normal TCPs against ªbad-behavedº TCPs and UDPs. 4.3. Extensibility and forward compatibility One advantage of SCALE±WFS is to replace the rough estimation of the link fairness index L by making the max± min weighted fairness equation explicitly solvable at core routers. Its performance or control precision can be adjusted by the quantization levels of the AG table, and can progress with growing CPU or memory resources. An extreme case is to enlarge the AG table to take each ¯ow as one aggregate if the next-generation router allows. Then it becomes per-¯ow management. If one wants to provide different service priority to support multiple traf®c classes, the AG tables should accommodate more table items, or adopt other approach to solve the correspondingly changed max±min fairness equation. Surely, the SCALE technique can still be used when these situations happen. Our scheme does not require changing source behavior,

nor does it make resource reservations. Therefore, we cannot expect it to improve global network ef®ciency except the local max±min fairness on each router. However, it does not reject any such mechanisms as rate re-negotiation and feedback control (to notify the source about its bottleneck link situation) to attain global network resource optimization. As we can see, it is also orthogonal to the existing traf®c conditioning and routing schemes in DiffServ networks. 4.4. Parameter settings During the simulations, we found when the queue minimum threshold, minth, is too small, the dropping tends to be slightly aggressive. In addition, the output capacity of the router may not be fully utilized even when the total incoming rate is larger than the output capacity. If minth is too big, fairness decays slightly. However, the performance is not sensitive to minth as shown by Table 1, where we can

Fig. 15. Multiple congested links case.

H. Zhu et al. / Computer Communications 24 (2001) 51±63

61

Table 1 Minth vs. throughput and fairness Unit:Mbps

UDP1-2 UDP3-4 TCP1-2 TCP3-4 UDP5-6 UDP7-8 TCP5-6 TCP7-8 UDP9-10 UDP11-12 TCP9-10 TCP11-12 UDP13-14 UDP15-16 TCP13-14 TCP15-16 Total: Capacity:

Under-provisioning

Unit:Mbps

TargetBW

2.50%

5%

10%

0.25 0.25 0.25 0.25 0.5 0.5 0.5 0.5 0.75 0.75 0.75 0.75 1 1 1 1

0.253 0.255 0.267 0.259 0.518 0.517 0.523 0.479 0.780 0.788 0.734 0.661 1.063 1.076 0.946 0.874 19.999 20

0.255 0.255 0.265 0.249 0.513 0.519 0.526 0.487 0.785 0.789 0.718 0.668 1.065 1.078 0.934 0.887 19.999 20

0.257 0.260 0.279 0.250 0.514 0.517 0.505 0.483 0.787 0.787 0.739 0.662 1.068 1.080 0.909 0.897 19.999 20

see the Experiment 1's results for setting minth at 2.5, 5 and 10% of maximum queue size, respectively. The queue maximum threshold maxth is related with the burstiness of the incoming traf®c. The bigger maxth is, the more tolerant the router is towards bursty traf®c. Noted that our experiment only used leaky bucket color marker and time sliding window measurement for all the traf®c. It is reported in Ref. [9] that these mechanisms mark TCP's IN packet rate less than the pro®le rate and overestimate TCP average arrival rates. Hence, we may expect better performance from our model given an improved traf®c conditioner.

5. Conclusions and future work Through analysis and simulation, we show that previous approaches cannot guarantee Internet weighted fair bandwidth sharing without per-¯ow management, especially when both TCP and UDP ¯ows of different RTTs and different weights coexist. Hence this paper makes a timely contribution. We proposed a rate-based active queue management mechanism, called SCALE±WFS (Scalable Core with Aggregations Level labEling±Weighted Fair bandwidthSharing), to achieve the near-optimal weighted fair bandwidth sharing with a scalable core, under the context of DiffServ (AF service) network. It works effectively for heterogeneous ¯ow weights (i.e. the targeted bandwidth shares), different ¯ow RTTs, different protocols (TCP and UDP). It also works under the scenarios of different bandwidth provisioning or multiple congested gateways, all of which we consider the basic requirements for a practical solution. Apart from ªper-¯owº level and completely ªcore-state-

UDP1-2 UDP3-4 TCP1-2 TCP3-4 UDP5-6 UDP7-8 TCP5-61.5 TCP7-8 UDP9-10 UDP11-12 TCP9-10 TCP11-12 UDP13-14 UDP15-16 TCP13-14 TCP15-16 Total: Capacity:

Over-provisioning TargetBW

2.50%

5%

0.75 0.75 0.75 0.75 1.5 1.5 1.458 1.5 2.25 2.25 2.25 2.25 3 3 3 3

1.018 1.019 0.754 0.740 1.692 1.695 1.472 1.369 2.364 2.361 2.154 2.009 2.991 2.990 2.799 2.564 59.970 60

0.997 0.998 0.761 0.735 1.717 1.718 1.522 1.462 2.427 2.429 2.118 2.017 2.994 2.993 2.705 2.447 59.998 60

10% 1.020 1.020 0.776 0.740 1.747 1.744 1.453 2.463 2.457 2.115 1.912 2.995 2.994 2.723 2.311 59.999 60

lessº, we present a third level, ªaggregationº level management. SCALE, as a control mechanism that reduces information redundancy with ªper-¯owº, also avoids the low performance of completely ªcore-statelessº. We show that ¯ows of same fairness index can be merged at the core and served without difference. At this ¯ow aggregate level, local max±min fairness can be guaranteed in a rather precise and scalable way. This is very important to the current Internet architecture, especially because our scheme has low overhead but is effective and robust. Our future work aims to enhance the SCALE±WFS to guarantee multiple traf®c classes and improve global network ef®ciency with the help of certain feedback rate control or pro®le re-negotiation. Another interesting point is to further integrate the MPLS labeling ideas into the DiffServ. Moreover, we will also try to use the SCALE scheme to investigate other network control problems. The basic idea is to compress per-¯ow information by ¯ow aggregation. This idea may be suitable for the services that were initially developed for IntServ, but are now being replaced by Diffserv. Acknowledgements We would like to thank sincerely the valuable discussion with Professor Simon Lam, especially on the control convergence part, the email discussions with Ion Stoica and Rong Zheng, and valuable advice from Dr Jeffery Hansen. Appendix A Below we give out the iterative algorithm for solving

62

H. Zhu et al. / Computer Communications 24 (2001) 51±63

weighted Max±Min fairness Eqs. (2) and (4) at the beginning of the Nth computation interval. Assumed given value: C; ai N 2 1; S {1; 2; ¼i; ¼; n} (¯ow set), and fi ;i [ S; Initialization: 2 1=fi ; ;i [ S; C_residue C; ri ai N P r_residue i[S ri ; L C_residue=r_residue; K 0:9; Algorithm: While (!Null(S) AND SsmallerThanL(S,L)) Do{ for ;i [ S{ if fi # L { pi N 0; f 0i N fi ; Remove i from S; C_residue C_residue 2 ai N 2 1; r_residue r_residue 2 ri ; } } if r_residue ! 0 L C_residue=r_residue; } if (Null(S)) L max fi ; ;i [ 1; ¼; n; else L L p K; for ;i [ S {pi N 1 2 L=fi ; f 0i N 1 2 pi N p fi ; } Subroutine: SsmallerThanL(S,L) { if 'i [ S; fi , L return TRUE; else return FALSE; } Comment: We assume fi ± 0 ;i [ S in the above. Otherwise if fi 0; we just directly set f 0i N pi N 0; remove i from S, and let C_residue C 2 ai N 2 1 in the initialization. The rest would be the same. Comment: The K in the algorithm is a heuristic constant, whose value should be between 0 and 1. In practice, we simply set it as 0.9. The reason is that our computation of L is based on ¯ow rate estimation that re¯ects the history rather than the future. Therefore, L can be slightly larger than needed when TCP ¯ows are in their growing stages (such as slow start), thus bringing slight bias towards UDP. Shrinking L a little bit will help TCPs to reach their target bandwidth when UDPs are present. References

[5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26]

[1] A. Demers, S. Keshav, S. Shenker, Analysis and simulation of a fair queueing algorithm, Internetworking: Research and Experience 1 (1990) 3±26. [2] D. Lin, R. Morris, Dynamics of Random Early Detection, ACM SigComm'97, September 1997, France. [3] C. Fulton, S.Q. Li, C.S. Lim, UT: ABR Feedback Control with Tracking, IEEE Infocom'97, April 1997. [4] I. Stoica, S. Shenker, H. Zhang, Core-Stateless Fair Queueing:

[27] [28] [29]

Achieving Approximately Fair Bandwidth Allocation in High Speed Networks, ACM SigComm'98, August 1998, Canada. O. Bonaventure, S. De Cnodder, A Rate Adaptive Shaper for Differentiaed Services, Internet Draft, http://www.info.fundp.ac.be/~obo/ doc/draft-bonaventure-diffserv-rashaper-00.txt. I. Yeom, A.L. Narasimha Reddy, Impact of marking strategy on aggregated ¯ows in a differentiated services network, Proceedings of IEEE/IFIP IWQoS, May 1999. F.M. Anjum, L. Tassiulas, Fair Bandwidth Sharing among Adaptive and Non-Adaptive Flows in the Internet, IEEE InfoCom'99, March 1999, USA. N. Seddigh, B. Nandy, P. Pieda, Study of TCP and UDP Interaction for the AF PHB, Internet Draft, draft-nsbnpp-diffserv-tcpudpaf-01.pdf. W. Lin, R. Zheng, J.C. Hou, How to Make Assured Service More Assured, ICNP 99. H. Chow, A. Leon-arcia, A Feedback Control Extension to Differentiated Services, Internet Draft, draft-chow-diffserv-fbctrl-00.pdf. R. Satyavolu, K. Duvedi, S. Kalyanaraman, P. Bagal, Techniques for Explicit Feedback Control of TCP, IEEE/IFIP IWQoS 99, June 1999, UK. D.D. Clark, W. Fang, Explicit Allocation of Best-Effort Packet Delivery Service, IEEE/ACM Transaction on Networking 6 (4) (1998). J. Heinanen, R. Guerin, A Single Rate Three Color Marker, Internet RFC 2697, September 1999. J. Heinanen, R. Guerin, A Two Rate Three Color Marker, Internet RFC 2698, September 1999. Dynamic Packet State: http://www.cs.cmu.edu/~istoica/DPS/. S. Floyd, V. Jacobson, Random Early Detection for Congestion Avoidance, IEEE/ACM Transaction on Networking 1 (4) (1993) 397±413. T. Turletti, S.F. Parisis, J. Bolot, Experiments with a Layered Transmission Scheme over the Internet, INRIA Research Report no 3296, November 1997. UCB/LBNL/VINT Network Simulator-ns (version 2), http://wwwmash.cs.berkeley.edu/ns/. R. Braden, D. Clark, S. Shenker, Integrated Services in the Internet Architecture: an Overview, Internet RFC 1633, June 1994. S. Shenker, C. Partridge, R. Guerin, Speci®cation of Guaranteed Quality of Service, Internet RFC 2212, September 1997. J. Wroclawski, Speci®cation of the Controlled-Load Network Element Service, Internet RFC 2211, September 1997. R. Braden, Ed., L. Zhang, S. Berson, S. Herzog, S. Jamin, Resource ReSerVation Protocol (RSVP) Ð Version 1 Functional Speci®cation, Internet RFC 2205, September 1997. J. Wroclawski, The Use of RSVP with IETF Integrated Services, Internet RFC 2210, September 1997. S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, W. Weiss, An Architecture for Differentiated Services, Internet RFC 2475, December 1998. V. Jacobson, K. Nichols, K. Poduri, An Expedited Forwarding PHB, Internet RFC 2598, June 1999. J. Heinanen, F. Baker, W. Weiss, J. Wroclawski, Assured Forwarding PHB Group, Internet RFC 2597, June 1999. Z. Cao, Z. Wang, E. Zegura, Rainbow Fair Queueing: Fair Bandwidth Sharing Without Per-Flow State, IEEE INFOCOM 2000, March 2000, Tel-Aviv, Isreal. R. Pan, B. Prabhakar, K. Psounis, CHOKe, A statless active queue management scheme for approximating fair bandwidth allocation, IEEE INFOCOM 2000, March 2000, Tel-Aviv, Isreal. CSFQ Homepage: http://www.cs.cmu.edu/~istoica/csfq/.

H. Zhu et al. / Computer Communications 24 (2001) 51±63 Haifeng Zhu, is a doctoral student in Computer Engineering in Carnegie Mellon University. He received his Bachelor and Master degrees in Computer Science and Technology from Tsinghua University, Beijing, China, in 1993 and 1995 respectively. In 1994±1996 he chaired a working group in APNG (Asia±Paci®c Networking Group) and published the ®rst RFC from China as a primary author. He served as an expert member in China National Information Technology Standardization Technical Committee in 1995±1997. Prior to joining CMU, he was a graduate student in the University of Texas at Austin. His research interest includes Internet Quality of Service (QoS) and its optimization, congestion control for unicast and multicast, and differentiated services.

Aimin Sang, received the B.S. degree from the University of Science and Technology of China, the M.S. degree from the Graduate School and the Institute of Automation, Chinese Academy of Sciences. He is currently a Ph.D. candidate of Electrical Engineering Department, The University of Texas at Austin. His current research interests include multi-service network design and performance analysis.

63

San-qi Li, received his B.S. degree from Beijing University of Posts and Telecommunications, Beijing, China, in 1976, and the M.A.Sc. and Ph.D. degrees from the University of Waterloo, Waterloo, ON, Canada, in 1982 and 1985, respectively, all in electrical engineering. From 1985 to 1989, he was Associate Research Scientist and Principal Investigator at Center for Telecommunications Research at Columbia University, New York. In September 1989, he joined the faculty of the Department of Electrical and Computer Engineering, University of Texas at Austin, where he is Temple Foundation Endowed Professor. He is also an Honorable Professor at Beijing University of Posts and Telecommunication, China. He has published more than 150 papers in international archival journals and referred international conference proceedings. The main focus of his research has been to develop new analytical methodologies and carry out performance analysis of multimedia service networks, then understand system fundamentals and explore new design concepts. Dr. Li served as a Member of the Technical Program Committee for IEEE Infocom Conference since 1988 to 1998. From 1995 to 1997, he served as an Editor for IEEE/ACM TRANSACTIONS ON NETWORKING.

Weighted fair bandwidth sharing using SCALE technique

Weighted fair bandwidth sharing using SCALE technique

Recommend Documents