Revisiting design choices in queue disciplines: The PIE case

Revisiting design choices in queue disciplines: The PIE case

Revisiting Design Choices in Queue Disciplines: the PIE case Journal Pre-proof Revisiting Design Choices in Queue Disciplines: the PIE case Pasquale...

1MB Sizes 0 Downloads 46 Views

Revisiting Design Choices in Queue Disciplines: the PIE case

Journal Pre-proof

Revisiting Design Choices in Queue Disciplines: the PIE case Pasquale Imputato, Stefano Avallone, Mohit P. Tahiliani, Gautam Ramakrishnan PII: DOI: Reference:

S1389-1286(19)31377-5 https://doi.org/10.1016/j.comnet.2020.107136 COMPNW 107136

To appear in:

Computer Networks

Received date: Revised date: Accepted date:

15 October 2019 21 December 2019 31 January 2020

Please cite this article as: Pasquale Imputato, Stefano Avallone, Mohit P. Tahiliani, Gautam Ramakrishnan, Revisiting Design Choices in Queue Disciplines: the PIE case, Computer Networks (2020), doi: https://doi.org/10.1016/j.comnet.2020.107136

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier B.V.

Revisiting Design Choices in Queue Disciplines: the PIE case Pasquale Imputatoa,∗, Stefano Avallonea , Mohit P. Tahilianib , Gautam Ramakrishnanb a Universit` a degli Studi di Napoli Federico II Dipartimento di Ingegneria Elettrica e delle Tecnologie dell’Informazione Via Claudio, 21, 80125, Napoli Italy b Wireless Information Networking Group (WiNG) Department of Computer Science and Engineering NITK Surathkal, Mangalore, Karnataka, 575025, India

Abstract Bloated buffers in the Internet add significant queuing delays and have a direct impact on the user perceived latency. There has been an active interest in addressing the problem of rising queue delays by designing easy-to-deploy and efficient Active Queue Management (AQM) algorithms for bottleneck devices. The real deployment of AQM algorithms is a complex task because the efficiency of every algorithm depends on appropriate setting of its parameters. Hence, the design of AQM algorithms is usually entrusted on simulation environments where it is relatively straightforward to evaluate the algorithms with different parameter configurations. Unfortunately, several factors that affect the efficiency of AQM algorithms in real deployment do not manifest during simulations, and therefore, lead to inefficient design of the AQM algorithm. In this paper, we revisit the design considerations of Proportional Integral controller Enhanced (PIE), an algorithm widely considered for network deployment, and extensively evaluate its performance using a Linux based testbed. Our experimental study reveals some performance anomalies in certain circumstances and we prove that they can be attributed to a specific design choice of PIE, namely the use of the estimated departure rate to compute the expected queuing delay. Therefore, we designed an alternative approach based on packet timestamps, implemented it in the Linux kernel and proved its effectiveness through an experimental campaign. Keywords: Network Testing, Experimental Evaluation, Linux, Traffic Control, Queue Discipline, Active Queue Management, PIE 1. Introduction Rapid increase in the usage of time sensitive Internet applications has encouraged the research community to identify and eliminate the potential latency hotspots in the network. One of the most recently and widely studied problems is Bufferbloat, wherein the excessive buffering leads to significant rise in the queuing delay [1]. An effective approach to resolve this problem is to enable Active Queue Management (AQM) in the network devices. Some of the well known AQM algorithms are: Random Early Detection (RED) [2], BLUE [3], Proportional Integral (PI) controller [4], Controlled Delay (CoDel) [5], PI controller Enhanced (PIE) [6] and PI with a Square (PI2 ) [7]. The benefits of using AQM depend on the accuracy of congestion estimation techniques adopted by these algorithms. Depending on the approach adopted to estimate network congestion, AQM algorithms can be classified into different categories: queue length based algorithms (e.g., RED, PI), rate based algorithms (e.g., BLUE), queue delay based algorithms (CoDel, PIE, PI2 ) and others. The algorithms that belong to the ∗ Corresponding

author Email addresses: [email protected] (Pasquale Imputato), [email protected] (Stefano Avallone), [email protected] (Mohit P. Tahiliani), [email protected] (Gautam Ramakrishnan) Preprint submitted to Elsevier

same category further differ in the way the congestion is predicted: for instance, RED predicts congestion by tracking the average queue length [2], whereas PI predicts congestion by tracking the instantaneous queue length [4]. Similarly, CoDel tags every incoming packet with a timestamp during enqueue and uses this tag during dequeue to measure the queue delay [5]. PIE and PI2 , instead, use Little’s Law to estimate queue delay [6, 7]. Although AQM algorithms offer significant benefits, the real deployment is challenging due to the fact that their performance is highly sensitive to the parameter settings. Therefore, the core design and evaluation of these algorithms is usually performed by resorting to simulation environments and subsequently, they are considered for the real deployment. In this paper, we show that a few important aspects of AQM design do not manifest in the simulation environment, but might arise during the real deployment of the algorithm and affect the overall performance. We emphasize that the AQM algorithms, besides being thoroughly evaluated in simulated environments, must be extensively evaluated in a real testbed environment to discover the potential design flaws before they are considered for deployment in the real systems. In this paper, we take PIE as a prominent example to prove our claims. PIE has been proposed by Cisco Inc., implemented in the Linux kernel and standardized by IETF in RFC 8033. Moreover, a variant of PIE called DOCSIS PIE (RFC 8034) [8] January 31, 2020

is deployed by CableLabs in DOCSIS 3.1 and higher versions of their cable modems. In the original paper presenting PIE [6], the performance of the proposed AQM algorithm is mainly evaluated by means of ns-2 simulations. The experimental study we conducted instead shows that PIE may lose effectiveness in some circumstances. We thoroughly investigated this issue and determined that it is caused by an integral component of PIE, called departure rate estimation. This component is expected to estimate the actual transmission rate of packets over the outgoing link, but ends up estimating the rate at which packets are stored in the device driver’s transmission ring. In Linux based network systems, packets from PIE scheduler are queued waiting for transmission in the device ring. We stress that the introduction of such component was a precise design choice to eliminate the need of timestamping packets, which was deemed to be a time consuming operation1 . Our findings equally apply to other AQM algorithms that rely on departure rate estimation, such as PI2 . To overcome the issues arising when packets are buffered again after leaving the queue managed by the PIE AQM algorithm such as in Linux based systems, we recommend to modify the design of PIE in such a way to remove the departure rate estimation component. In this paper, we present a revisited design of PIE, whereby the queue delay is estimated by using a packet timestamp based approach instead of the departure rate estimation component. We implemented the revisited PIE algorithm in the Linux kernel and carried out an experimental campaign to prove that performance is improved in all the scenarios under test. This paper is organized as follows: related work is discussed in Section 2, while Section 3 provides a background of the Linux Traffic Control infrastructure and briefly illustrates the operations of the PIE AQM algorithm. Section 4 describes the methodology we adopted to carry out experiments and Section 5 presents the experimental results obtained with the original version of PIE. Section 6 presents the proposed revisited design of PIE and and Section 7 shows the results of the experiments conducted to prove its effectiveness. Finally, Section 8 concludes the work.

In [9, 10] authors conduct a simulation study based on ns-3 to evaluate the effectiveness of AQM algorithms such as RED, CoDel and PIE. The study aims to evaluate their performance in terms of network delay. However, they do not take any measure to minimize the queuing time in the network device buffer, which can become the most prominent contribution to the latency experienced by packets [11]. Also, a number of factors affecting the effectiveness of AQM algorithms are not taken into account in simulations. For instance, device specific optimizations in the flow control implemented by the device driver are not reproducible by means of simulations. A few papers present the results of experiments conducted to test AQM algorithms. In [12, 13] authors setup a testbed to evaluate the performance of AQM algorithms such as PIE and CoDel mainly in terms of network delay. However, they neglect the additional delay caused by the queuing time in the network device buffer and do not evaluate its impact on the overall performance. Also, experiments are performed with a single network device, thus results do not capture the different behavior that distinct devices have due to different implementations of the drivers. The impact of network device buffers on the performance of AQM algorithms has been studied in [11] by using a real testbed. However, only algorithms, such as CoDel, that do not use the departure rate estimation approach have been tested. In [14] authors focus on the impact of device buffers in LTE networks. They propose the use of PIE to minimize the queuing time in the RLC (Radio Link Control) queues. However, a discussion about why PIE is the most suited AQM algorithm for their scenario and an analysis of how PIE (and its components) interplay with the underlying layers are missing. The initial concern raised in [6] about the excessive computational overhead due to the timestamping approach has been disproved by several evaluations of actual CoDel implementations. Experiments in [11] show that the timestamp based approach used by CoDel enables to achieve good performance in terms of throughput and latency over wired (Ethernet) networks. CoDel has been also ported to the 802.11 MAC (Medium Access Control) layer of the Linux kernel to perform software timestamping to control the queuing time inside Wi-Fi device drivers [15]. Furthermore, the work in [16] presents an implementation of CoDel in P4, thus showing that the timestamp based approach can also be used to reduce the latency in programmable network equipments. In [17] authors carry out a detailed analysis of PIE aiming to evaluate its performance while tweaking a number of parameters. Authors argue that it is difficult to identify a single value for each parameter that ensures that PIE remains effective and reactive in all circumstances. In [18] authors present a version of PIE without the departure rate component. The considerations that emerged from the various studies and analyses about the performance of PIE flowed into the IETF RFC 8033 [19]. This RFC recommends that the queue threshold parameter be set in the range from 16 to 64 KB and the default value be 16 KB. However, the standard does not provide any recommendation about how to determine a proper value for this parameter. Also, RFC 8033 defines the departure rate method as the primary method to estimate the queuing delay. However, it also

2. Related work In this section, we describe the work related to the design and evaluation of the most widely adopted AQM algorithms. The goal is to provide an overview of the context in which this paper is placed and highlight the contributions of our work. Authors of AQM algorithms have often resorted to conducting network simulations to evaluate their proposals. This is because simulation allows to quickly test the performance of AQM algorithms in a variety of conditions in a controlled manner. For example, RED and CoDel algorithms have been evaluated using the ns-2 network simulator [2, 5]. PIE and a number of its variants have also been evaluated mainly through simulations [6]. 1 As discussed hereinafter, RFC 8033 opens up to an alternative method to measure queue delay in PIE by using timestamps, like CoDel.

2

Algorithm 2 Pseudocode of Linux generic qdisc main functions.

opens to alternative methods to estimate the queuing delay that exploit a timestamp based approach, thus leaving out the initial concern about the computational complexity. Therefore, to the best of our knowledge, no previous work has analyzed in depth the performance of queue delay based AQM algorithms based on real experiments and disclosed performance anomalies that can be attributed to some design choices of the algorithms themselves.

RUN () 1 2 3 4

while (RESTART ()) do quota− = 1 if (quota <= 0) then break

RESTART ()

3. Background

1 2

3.1. Linux Traffic Control infrastructure This section provides some background information about the journey of packets within the Linux kernel, from the time the outgoing network interface is determined to the time they are actually handed to the Network Interface Card (NIC). Every packet to be transmitted through a NIC is passed to the Traffic Control (TC) infrastructure, which enqueues the packet in the queuing discipline (qdisc henceforth) associated with the outgoing network interface. The pseudocode of the Linux send function is reported in Algorithm 1. A qdisc implements various traffic management functions, including scheduling, dropping and marking of the queued packets. Like many other AQM algorithms, PIE is implemented as a qdisc. Immediately after enqueuing a packet in a qdisc, TC requests such qdisc to dequeue at most a configurable number (indicated by the quota variable, which is set to 64 by default) of packets by calling the qdisc run function (Algorithm 2). The restart function (Algorithm 2) dequeues a packet from the qdisc and passes it to the network device driver by calling the transmit function. The network device driver stores received packets in a circular queue, named transmission ring. The ring is made of a (configurable) number of slots (usually referred to as descriptors), each of which stores information about a packet, such as its length and the address of the physical memory where it is stored. Then, network device drivers asynchronously transfer packets to the NIC by employing DMA (Direct Memory Access).

packet ← DEQUEUE() return T RANSMIT (packet)

TRANSMIT(packet) 1 2 3 4 5 6

determine the device transmission ring for the packet if device transmission ring is stopped then requeue the packet in the qdisc return FALSE else send packet to device return TRUE

of a DMA transfer makes enough room in the transmission ring, the network device driver restarts the transmission ring and wakes the associated qdisc, i.e., it requests the qdisc to resume dequeuing packets, until either a configurable number of packets are dequeued or the transmission ring is stopped again. We note that network device drivers implement various policies for what concerns the minimum amount of room in the transmission ring required to wake up the qdisc and the handling of the DMA transfers of packets to the NIC. During heavy traffic workload, the flow control technique described above leads to alternating periods in which either a burst of packets are dequeued from the qdisc and passed to the network device driver or no packet is dequeued from the qdisc (because the transmission ring is stopped). Such an observation is key to explain the performance anomalies of AQM algorithms like PIE that we disclose in this paper. Finally, we mention that the Linux kernel includes an algorithm named BQL (Byte Queue Limits) to adaptively determine the minimum amount of bytes to store in the transmission ring to avoid starvation. The goal is clearly to minimize the additional latency due to packets being queued in the transmission ring, while avoiding to have an impact on throughput. In order to support BQL, network device drivers have to send proper notifications when packets are enqueued into and dequeued from the transmission ring. Most Ethernet device drivers in the Linux kernel support BQL and most Linux distributions enable BQL by default. However, no Wi-Fi device driver supports BQL.

Algorithm 1 Pseudocode of Linux TC send function. SEND(packet) 1 determine the qdisc associated to the network device 2 qdisc. ENQUEUE(packet) 3 qdisc.RUN()

The transmission ring is a finite buffer and may become full if the rate at which packets are dequeued from the qdisc is consistently higher than the rate at which packets are actually transmitted over the network. In order to avoid packet drops due to buffer overrun, the network device driver stops the transmission ring as soon as there is no room for a further packet and the transmit function of the TC infrastructure (Algorithm 2) prevents a qdisc from sending packets to the network device driver if the transmission ring is stopped. Whenever the completion

3.2. PIE PIE is an enhancement of the well known Proportional Integral (PI) controller with an aim to control the queuing delay around a desired reference queue delay. It differs from the original PI controller in two ways: (i) the goal of PI is to predict congestion by tracking the queue length, whereas the goal of PIE 3

is to minimize the impact of Bufferbloat by tracking queue delay (ii) the control parameters are not auto-tuned in PI, whereas they are auto-tuned in PIE [4, 6]. The pseudocode of the PIE algorithm is reported in Algorithm 3. When PIE is requested to enqueue a packet , PIE randomly drops the received packet with a pre-calculated probability p. Dropping packets is the main instrument used by AQM algorithms to notify TCP senders of the need to decrease the sending rate. When PIE is requested to dequeue a packet, PIE computes the average dequeue rate. Such value should estimate the available bandwidth and track its fluctuations [6]. It is worth to note that in Linux based systems, the TC infrastructure will trigger a number of consecutive dequeue operations according to the flow control mechanism described earlier.

Figure 1: Network topology defined in the RFC 7928 to evaluate AQM performance.

Figure 2: Testbed used for experiments.

Algorithm 3 Pseudocode of the PIE algorithm with the departure rate method.

idea is to avoid starting a new measurement cycle when the queue backlog is low. A counter (dq count) keeps track of the amount of bytes dequeued since the beginning of the current measurement cycle. When such counter exceeds the queue threshold, the measurement cycle ends. The dequeue rate in the measurement cycle is then computed as the ratio of the amount of bytes dequeued (provided by the counter) to the duration of the cycle (dtime). The average dequeue rate (avgDqRate) is then derived as a weighted average of the last dequeue rate sample and the previous average dequeue rate value. The average dequeue rate is then used to periodically determine the queuing delay. Every time interval of length Tupdate (whose value recommended by RFC 8033 is 15 ms) the queuing time (cur del) is updated by using the estimated average dequeue rate and the probability p is updated based on the estimated congestion. In particular, the current queue delay is estimated (by using the Little’s Law) as the ratio of the current queue length to the estimated average dequeue rate. The drop probability p is calculated as a linear combination of the previous drop probability value, the difference between the current queue delay and the reference queue delay ref del (according to a control parameter α) and the difference between the current queue delay and the previous queue delay old del value (according to a control parameter β ). PIE reaches a steady state when the current queue delay does not change and equals the reference queue delay.

ENQUEUE (packet) 1 randomly drop packet with probability p 2 if (packet has not been dropped) 3 then enqueue packet ENQUEUE (packet) 1 2 3 4 5

rnd = random number() if (rnd < p) then drop packet else enqueue packet

DEQUEUE () 1 dequeue packet 2 if (( NOT in measurement) AND (qlen > dq threshold)) 3 then in measurement ← true 4 dq count ← dq count + dq packet size 5 start ← now 6 if dq count > dq threshold 7 then dtime ← now − start count 8 dq rate ← dqdtime 9 avgDqRate ← (1 − ε) · avgDqRate + ε · dq rate 10 in measurement ← f alse 11 dq count ← 0 12 return packet Every Tupdate milliseconds qlen 1 cur del ← avgDqRate 2 p ← p + α · (cur del − re f del) + β · (cur del − old del) 3 old del ← cur del 4 adapt α and β values based on the value of p

4. Methodology We conducted a thorough experimental campaign to analyze the performance of PIE and explain the obtained results. We followed the characterization guidelines for AQM performance evaluation defined in the RFC 7928 [20]. The RFC suggests to use a topology (see Figure 1) composed of one (or more) Sender node connected to a Router L (Left) through a high bandwidth link, a Router R (Right) connected to Router L by means of a bottleneck link and one (or more) Receiver node connected to Router R by means of a high bandwidth link. The RFC suggests to generate a unidirectional (bidirectional) traffic flow between (every pair of) sender and receiver, and a sym-

The dequeue rate estimation method employed by PIE is based on measurement cycles. After dequeuing a packet, if no measurement cycle is running and the current queue backlog (qlen) exceeds a given threshold (dq threshold, which is set to 16 kB by default), a new measurement cycle is started. Given that a meaningful estimate of the dequeue rate can only be obtained starting from a certain number of packets dequeued, the 4

5. Experimental results with the dequeue rate estimation method

metric (asymmetric) bottleneck link to evaluate the AQM configured on NIC3 (and NIC4). Since we aim to evaluate how PIE interplays with the network device driver, it is sufficient to consider a single unidirectional traffic flow over a symmetric bottleneck link and have PIE installed on NIC3 only. Consequently, the link between Router R and the Receiver node becomes superfluous. Therefore, we used a testbed (shown in Figure 2) consisting of three PCs connected back to back through Ethernet cross cables and the AQM configured on NIC3. The rate of the link between the sender and the router is 1 Gbps, while the rate of the link between the router and the receiver is 100 Mbps. We use iPerf3 [21] to generate TCP traffic from the sender to the receiver. In this way, the link between the router and the receiver becomes the bottleneck and some backlog accumulates in the qdisc associated with NIC3. Therefore, we use PIE as the qdisc associated with NIC3, while the qdiscs associated with the other NICs are simple FIFO queues. ICMP Echo Request/Reply messages are exchanged every 5 milliseconds between the sender and the receiver for the purpose of collecting samples of the round-trip time experienced over the path. Given that the return path is lightly loaded (it is only traversed by TCP acknowledgments and ICMP Echo Reply messages), the network stack associated with NIC3 is the one that contributes the most to the resulting latency and packet drops. All of the three PCs are equipped with an Intel i7-6700 CPU and a 16GB RAM, and run the Linux kernel 5.0. In order to test PIE with different network device drivers, we performed two sets of experiments differing for the NIC3 model. For the first set, NIC3 is an Intel I217-LM Gigabit Ethernet controller using the e1000e driver; for the second set, NIC3 is a Broadcom BCM57785 Gigabit Ethernet controller using the tg3 driver. Such adapters adopt different flow control mechanisms due to different interrupt mitigation techniques, i.e., how often the hardware will notify the driver about the completion of packets transmission. The Intel adapter relies on dynamic interrupt mitigation thresholds (adapted based on the current network and system load) while the Broadcom adapter relies on statically assigned thresholds. In order to also test PIE under different backlog conditions, we considered the case where BQL is enabled and the case where BQL is disabled. In both cases, the transmission ring of the NIC3 driver is fixed to 256 descriptors (the default for the considered drivers). For each experiment, we performed 5 consecutive tests, each of which lasting 30 seconds. The transient state during which queues build up lasts for about a couple of seconds, thus 30 seconds are enough to capture the steady-state evolution of the quantities we measure. Also, results are rather similar from one test to another, thus repeating an experiment 5 times is enough to minimize the impact of undesired factors. In the following, we first show the results obtained in terms of round-trip time, packet loss, PIE backlog and bytes inflight in the device transmission ring. Afterwards, we specifically analyze the behavior of PIE to explain the results shown and substantiate our claim that estimating the dequeue rate is not a correct approach when packets are subsequently stored in another buffer.

5.1. Network delay and queue specific parameters We performed the experiments as described in the previous section and show in Figure 3 the results in terms of network round-trip time (3a), backlog of the PIE qdisc associated with NIC3 (3b), packets dropped by the PIE qdisc installed on NIC3 (3c) and bytes inflight in the transmission ring of NIC3 (3d). The empirical distribution of the set of samples collected for each quantity in each experiment is summarized by means of box plots. On each box, the central mark indicates the median, and the bottom and top edges of the box indicate the 25-th and 75-th percentiles, respectively. For each experiment, we also collected the average throughput over time intervals of 100 ms, as measured by the iPerf3 server. The throughput (not shown) is very stable across the whole duration of every experiment and is approximately equal to 94 Mbps for all scenarios under test. We recall that, by default, the target PIE delay is set to 15 ms and PIE attempts to keep the queue size equal to the amount of bytes transmitted in the target delay at bottleneck rate (which is about 180 KB if it correctly estimates the bottleneck rate of 100 Mbps). With the Intel adapter using the e1000e driver, PIE exhibits the expected behavior only when BQL is enabled: the round-trip time is close to 15 ms, the backlog is rather limited (around 200 kB) and there are some packet drops, which are necessary to signal the TCP sender the need to slow down. When BQL is disabled, the performance achieved by PIE is rather poor: the round-trip time is very high (about 100 ms), the backlog is quite large (over 800 kB) and there are no packet drops, as if no congestion were detected. With the Broadcom adapter using the tg3 driver, results are even worse, as the performance achieved by PIE is rather poor even when BQL is enabled: the RTT is around 40 ms and the backlog quite large (around 400 kB) and some packets drops occurs. When BQL is disabled, the performance is worse: the RTT is around 90 ms and the backlog quite large (over 600 kB) and some packets drops occurs. We note that it is reasonable to expect an improved performance of AQM algorithms when BQL is enabled, because BQL reduces the amount of bytes stored in the transmission ring according to the results in Figure 3d (and hence the additional contribution to the queuing delay) which are out of the control of AQM algorithms. In fact, for both adapters, the bytes inflight in the device transmission ring are limited to a few kilobytes when BQL is enabled, while are about 350kB when BQL is disabled. However, even with BQL enabled, the experienced network delay exceeds the target delay when the Broadcom adapter is used, which is quite unexpected. When BQL is disabled, instead, the network delay is very high. Additionally, PIE drops no packet when the Intel adapter is used (only a few when the Broadcom adapter is used), which means that PIE completely (or almost completely) fails to detect congestion. 5.2. Traffic control performance parameters We then conducted a further investigation to understand the reasons of the poor performance exhibited by PIE. In particular, 5

ping RTT (ms)

80 60 40 20 0 256

BQL

pkts dequeued (num)

100

40

pkts dequeued (num)

e1000e tg3

10 8 6 4 2 0

256

30 20 10 0

0

20

40

0

20

40

600 400 200 0 BQL tx ring size

dropped (number of packets)

60

80

100

120

80

100

120

10 8 6 4 2 0

80

100

120

256

30 20 10 0

0

20

40

60

time (ms) BQL

0

20

40

60

time (ms)

Figure 4: Histograms of number of packets dequeued by the PIE qdisc in time intervals of 120 µs each. PIE dequeues packets at a regular rate only with e1000e driver and with BQL enabled. In the other cases, the dequeue rate is rather irregular with bursts ranging from 10 to 30 packets.

e1000e tg3

15

we looked at the times at which packets were dequeued by PIE. For this purpose, we modified the PIE dequeue function so that every time a packet is dequeued, the current time is printed in the kernel logs. Then, we derived histograms in Figure 4 showing the number of dequeue events occurring in time intervals of 120 µs each, which is the time required to transmit an MTUsized packet (1500 B) at the outgoing link rate (100 Mbps). Figure 4a shows the case of the e1000e driver (Intel adapter) while Figure 4b shows the case of the tg3 driver (Broadcom adapter). With the e1000e driver, when BQL is enabled (which is the case in which PIE exhibits the expected behavior), the number of packets dequeued in a time interval of 120 µs normally ranges from 0 to 3, with the average being close to 1. This means that packets are dequeued by the PIE qdisc at a regular rate which basically corresponds to the rate at which packets are transmitted over the outgoing link (one packet every 120 µs). When BQL is disabled, instead, we observe that packets are dequeued in bursts of about 30 packets per single interval and then no other packet is dequeued in a number of subsequent intervals. With the tg3 driver, regardless of whether BQL is used or not, packets are dequeued in bursts of about 10 packets per single interval. How to interpret such result and what impact does it have on the performance of PIE? We recall that a qdisc does not autonomously decide when a packet has to be dequeued, but it is the TC infrastructure that requests a qdisc to dequeue a number

10 5 0 256

BQL tx ring size

(c) Packets dropped by the PIE qdisc associated with NIC3

400

inflight in tx ring (kB)

120

(b) tg3 driver

(b) Backlog of the PIE qdisc associated with NIC3

20

pkts dequeued (num)

qdisc backlog (KB)

800

40

pkts dequeued (num)

e1000e tg3

25

100

(a) e1000e driver

(a) Round-trip time of Echo Request/Reply messages

256

80

time (ms)

tx ring size

1000

60

time (ms) BQL

e1000e tg3

300

200

100

0 256

BQL tx ring size

(d) Bytes inflight in NIC3 tx ring Figure 3: Results of the experiments with the dequeue rate estimation method. PIE exhibits the expected behavior (it drops packets and keeps the queue disc backlog close to 180 KB, i.e., the amount of bytes transmitted in 15 ms (target delay) at 100 Mbps) only with the e1000e driver and with BQL enabled. In the other cases, the performance achieved by PIE is rather poor. Also the impact of BQL on the performance is different from one adapter to another.

6

1 0.8 0.6 0.4 0.2 0 10 3

256 60 50 10 4

10 5

10 6

10 7

10 8

dtime (ns) BQL e1000e tg3

F(x) F(x)

30

10 10 4

10 5

10 6

10 7

0

10 8

256

Figure 7: Queue delay estimated by PIE with the dequeue rate method. The values are close to the target, i.e., 15 ms, only with the e1000e driver and with BQL enabled, while in the other cases the values are 0 or highly variable.

5.3. PIE specific performance parameters To further substantiate our insight, we also traced the values of two internal variables of PIE: the duration of a measurement cycle (dtime) and the average dequeue rate (avgDqRate), which is a weighted average of the last estimated dequeue rate sample and the previous average dequeue rate value. Figures 5 and 6 show the empirical CDF of the collected samples of dtime and avgDqRate, respectively for both the adapters. When the Intel adapter is used and BQL is enabled, we already observed (Figure 4) that the PIE qdisc is requested to dequeue packets in intervals of 120µs. This is due to the combination of the threshold used by the e1000e driver to determine when the transmission ring has to be restarted after being stopped and the limit dynamically computed by BQL to determine whether or not the transmission ring has to be stopped. Given that the default queue threshold used by PIE is 16 kB and the size of each packet is 1500B, we expect that a measurement cycle spans over a few time intervals of 120µs. Figure 5 indeed shows that all the measurement cycles last at least 1ms (and no more than 1.5ms). Figure 6 shows that all the average dequeue rate samples are lower than about 13Mbps (about 104Mbps). When BQL is disabled, Figure 5 shows that about 80% of the samples of the measurement cycle duration are below 5 · 103 ns, i.e., 5µs. In fact, Figure 6 shows that about 80% of the samples of the average dequeue rate are higher than about 3GB per second (about 24Gbps). Given that the link capacity is just 100Mbps, it is clear that PIE believes to be able to dequeue packets at a much higher rate than the real one, hence it does not drop packets and the backlog accumulates in its queue. When the Broadcom adapter is used and BQL is enabled, we observe that about 80% of the samples of the measurement cycle duration are below about 1ms. In fact, Figure 6 shows that about 80% of the samples are higher than about 40MB per second (about 320Mbps) When BQL is disabled, the performance is very similar to the case when BQL is enabled. We already observed (Figure 4) that, when using the Broadcom adapter, the dequeue rate is not highly affected by the usage of BQL. Given that the link capacity is just 100Mbps, PIE believes to be able to dequeue packets at a rate higher than the real one, hence it drops fewer packets than necessary. With the Broadcom adapter, regardless of whether BQL is used or not, PIE behaves similarly

256 e1000e tg3

10 8

10 9

10 10

10 11

avgDqRate (Bps) BQL e1000e tg3

10 8

10 9

BQL tx ring size

Figure 5: ECDFs of PIE dtime. The duration of a measurement cycle is close to the one expected, i.e., about 1.3 ms, which is the time to transmit 16 KB (queue threshold) at 100 Mbps, only with the e1000e driver and when BQL is enabled. In the other cases, the values are lower than expected.

1 0.8 0.6 0.4 0.2 0 10 7

40

20

dtime (ns)

1 0.8 0.6 0.4 0.2 0 10 7

e1000e tg3

70

e1000e tg3

delay (ms)

F(x) F(x)

1 0.8 0.6 0.4 0.2 0 10 3

10 10

10 11

avgDqRate (Bps)

Figure 6: ECDFs of PIE avgDqRate. The average dequeue rate is close to the one expected, i.e., 100 Mbps, which is the bottleneck data rate, only with the e1000e driver and when BQL is enabled. In the other cases, the values are higher than expected.

of packets consecutively, in agreement with the flow control mechanism we illustrated earlier. Given that the transmission ring acts as a buffer for the packets that need to be transmitted over the network, a burst of packets can be transferred at a high rate from the qdisc to the transmission ring, until the transmission ring is stopped by either the driver or BQL. The transmission ring stays stopped until enough room (the exact amount of which being driver-dependent) is made by removing packets that have been transferred to the NIC. During that time, no packet is dequeued from the qdisc. Such a mechanism explains why there are time intervals with a burst of packets dequeued and time intervals with no packet dequeued. The implication of this mechanism is that the rate at which packets are dequeued by the qdisc, which is what PIE attempts to estimate, can be much higher than the rate at which packets are transmitted over the network. In particular, if a measurement cycle begins and ends with two packets dequeued in the same burst, the estimated rate will be extremely high. As a consequence, PIE will believe that large backlogs can be dequeued in a small amount of time and hence it will not drop any packet. The result is the poor performance we have observed. 7

to the Intel adapter with BQL disabled. To conclude our insights, we derived the queue delay values estimated by PIE based on the average dequeue rate values. The queue delay (ratio of the queue length to the average dequeue rate) directly impacts the ability of PIE to detect a congestion (and hence to adjust its dropping probability accordingly). Results are reported in Figure 7. Basically, when the Intel adapter is used and BQL is enabled, the estimated delay is close to 15ms and the algorithm correctly adapts the dropping probability to keep the RTT close to 15ms. When BQL is disabled, the estimated delay is close to 0ms and, since it is below the target of 15ms, no packet is dropped. However, in this case the RTT is well above the target and close to 100ms (Figure 3a). When the Broadcom adapter is used, the median distribution values are about 20ms in both cases (with a slight lower value when BQL is enabled). However, the deviation of the queue delay samples is quite high (the 25-th percentile is about 5ms and the 75-th percentile is about 40ms). When the estimated queue delay is high, packets are dropped; when it is low, the backlog grows. As a result, the RTT is not properly kept below the target delay (Figure 3a). We have shown in Figure 4 that the rate at which packets are dequeued by a qdisc may be rather irregular over time. As explained, the reason is that the qdisc is requested to dequeue packets in the interval between the time the transmission ring is restarted and the time the transmission ring is stopped, after that the qdisc is prevented from dequeuing packets until the transmission ring is restarted again. Thus, the rate at which packets are dequeued by a qdisc depends on how many packets are needed for the driver or BQL to stop a transmission ring that has just been restarted. This, in turn, depends on a number of factors, including the threshold used by the device driver to determine whether the transmission ring can be restarted after being stopped, whether BQL is enabled or disabled, the way DMA transfers are managed by the device driver, etc. None of such factors can be controlled by a qdisc, unfortunately. Is there anything a qdisc can do to get an estimate of the dequeue rate that is consistently close to the link capacity, independently of such factors? An intuitive answer may be to try to increase the duration of the measurement cycles in PIE, in the attempt to have them start and end with two packets belonging to different bursts. In this way, the dequeue rate is averaged over multiple bursts, likely resulting in a value close to the actual link rate. However, a measurement cycle is terminated when the amount of bytes dequeued since the beginning of the cycle exceeds a configurable queue threshold. Hence, the latter needs to be tuned in order to control the duration of a measurement cycle. However, it is difficult to find a value exceeding the burst size in bytes. When BQL is disabled, a burst is made of a number of packets equal to the difference between the transmission ring size and the threshold to restart the ring; given that packets may have variable sizes, the burst size in bytes is variable as well. When BQL is enabled, even though the limit computed by BQL is measured in bytes, the limit itself is variable and dynamically computed by BQL. On top of this, we need to consider that selecting too large values for the queue threshold has the disadvantage that the estimation technique becomes less

responsive to link rate variation. Algorithm 4 Pseudocode of the PIE algorithm with the timestamp method. ENQUEUE (packet) 1 2 3 4

randomly drop packet with probability p if (packet has not been dropped) then attach enqueue time to packet enqueue packet

ENQUEUE (packet) 1 2 3 4 5

rnd = random number() if (rnd < p) then drop packet else enqueue packet

DEQUEUE () 1 dequeue packet 2 read enqueue time from packet 3 cur del ← now − enqueue time 4 return packet Every Tupdate milliseconds 1 2 3

p ← p + α · (cur del − re f del) + β · (cur del − old del) old del ← cur del adapt α and β values based on the value of p

6. Revisiting the design of PIE to address performance anomalies: a timestamp based version In this section, we revisit the design of PIE and propose a version of PIE adopting a timestamp based approach to estimate the queuing delay. The proposed version of PIE is in line with the provisions in RFC 8033 relating to the possibility of exploring alternative methods to estimate the queuing delay. The motivations behind the proposed version of PIE have been presented throughout this paper. We illustrated the design choices for PIE in Section 3 and provided the necessary background to understand how PIE operates in Linux based systems by describing the Linux Traffic Control infrastructure. We highlighted possible performance anomalies due to the additional queuing in the device buffer and, following the methodology described in Section 4, experimentally proved our claims (Section 5). PIE ends up estimating the rate at which packets are passed to the device buffer, rather than the rate at which they are transmitted by the device. The rate at which packets are passed to the device buffer is determined by the flow control, which in turn depends on the techniques and optimizations adopted by the network device driver in transferring packets to the device. Hence, as shown by our experimental results, the estimated dequeue rate varies from device to device. Since we proved that the reported performance anomalies can be ascribed to the incorrect estimation of the queue delay 8

caused by the mismatch between the estimated dequeue rate and the actual transmission rate, we decided to replace the departure rate component of the original design of PIE with a new component that directly measures the queuing delay. We followed an approach based on timestamps to estimate the packet sojourn time, i.e., the time spent by the packet in the queue of PIE. The pseudocode of the proposed PIE version is reported in Algorithm 4. When PIE is requested to enqueue a packet and such packet is not dropped, PIE attaches the current time (enqueue time) to the packet before enqueuing the packet. When PIE is requested to dequeue a packet, PIE updates the current estimate of the queuing delay (cur del) with the sojourn time of the dequeued packet. Every time interval of length Tupdate , the probability p and the other parameters are updated based on the current queuing delay, as determined by the most recent dequeue operation. By directly measuring the queuing delay (as opposed to indirectly estimating it through the queue size and the dequeue rate), we expect that the proposed version of PIE will perform approximately the same with different network devices. The queuing delay is no longer dependent on the dequeue rate, which is heavily affected by the behavior of the network device driver related to how packets are transferred to the device and how the start/stop status of the transmission ring is handled. We modified the Linux implementation of PIE to support the timestamp based approach described in this Section. Next Section presents the results of the experiments conducted to evaluate the performance of the proposed version of PIE.

60

e1000e tg3

ping RTT (ms)

50 40 30 20 10 256

BQL tx ring size

(a) Round-trip time of Echo Request/Reply messages

e1000e tg3

qdisc backlog (KB)

500 400 300 200 100 0 256

BQL tx ring size

(b) Backlog of the PIE qdisc associated with NIC3

e1000e tg3

dropped (number of packets)

20

7. Experimental results with the timestamp method We repeated the experimental campaign described in Section 4 to evaluate the Linux implementation of the timestamp based version of PIE. Figure 8 shows the results in terms of network round-trip time (8a), backlog of the PIE qdisc associated with NIC3 (8b), packets dropped by the PIE qdisc associated with NIC3 (8c) and bytes inflight in the device transmission ring of NIC3 (8d). The throughput reported by the iPerf3 server is very stable across the whole duration of every experiment and the stable value is about 94 Mbps. We recall that, due to the removal of the departure rate component, we expect that PIE performs similarly on both network adapters. Also, we expect that using BQL improves the performance of the timestamp based version of PIE with respect to the case in which BQL is not enabled. The queuing time in the transmission ring is still out of the control of the PIE qdisc and it adds up to the queuing time in the PIE queue. BQL allows to decrease the amount of bytes stored in the transmission ring and hence the additional queuing time therein. Therefore, we expect that the RTT is around the target delay for both adapters when BQL is used. Also, we expect that, even if BQL is not used, PIE is able to detect the congestion and drops some packets. With BQL enabled, PIE exhibits a very similar behavior on both adapters: the round-trip time is close to 15ms, the backlog

15 10 5 0 256

BQL tx ring size

(c) Packets dropped by the PIE qdisc associated with NIC3

inflight in tx ring (kB)

400

e1000e tg3

300

200

100

0 256

BQL tx ring size

(d) Inflight in NIC3 tx ring Figure 8: Results of the experiments with the timestamp method. PIE exhibits the expected behavior in all cases and it has a similar behavior on both network adapters with BQL disabled or enabled.

9

40

delay (ms)

The very high dequeue rate leads the AQM algorithm to underestimate the queuing delay and ultimately not to detect the occurrence of a congestion, thus performing poorly. Moreover, we show that queue delay based AQM algorithms exhibit device driver dependent behavior. Additionally, on some adapters, such algorithms may strongly require the use of BQL. Then, we revisited the design of PIE, a queue delay based AQM algorithm widely considered for network deployments, by introducing the direct measure of the queuing time through the use of timestamps. Following the revisited design, we modified the Linux implementation of PIE and assessed the improved performance of the timestamp based version of PIE through experiments. Results also showed that the behavior of PIE is not affected by how network device drivers handle operations like the transfer of packets to the device or the start/stop of the transmission ring. To the best of our knowledge, this paper is the first one that disclosed, thoroughly analyzed and addressed the issues arising from using the dequeue rate to drive the calculation of the dropping rate done by AQM algorithms. We believe that our study will bring more awareness in the research community about the potential issues that may arise when deploying AQM algorithms.

e1000e tg3

30 20 10 0 256

BQL tx ring size

Figure 9: Queue delay estimated by PIE with the timestamp method. The values are close to the target, i.e., 15 ms, with BQL disabled or enabled and on both the adapters.

is kept limited by PIE (about 200kB) and there are some packets dropped by PIE. Likewise, with BQL disabled, PIE exhibits a very similar behaviour on both adapters. As expected, the round-trip time (45ms) is higher than the target delay, because the queuing time in the transmission ring adds up to the queuing time in the PIE queue. When BQL is disabled, the amount of bytes inflight in the device transmission ring are about 350kB. We observe that the PIE backlog is quite similar to the case with BQL enabled (and it is of about 200kB), which shows that the PIE algorithm correctly keeps a queue size that guarantees to meet the target delay (it takes 16ms to transmit 200kB at a rate of 100Mbps). This is further proved by the fact that some packets are dropped. The fact that the number of packet dropped when BQL is disabled is lower than when BQL is enabled can be likely explained by considering that the higher RTT due to the additional queuing time in the transmission ring leads the TCP congestion control to send packet at a slightly slower rate, which results in a less frequent need to drop packet to keep the desired queue size. In order to complete our evaluation of the timestamp based version of PIE, we show the queue delay as calculated by PIE in Figure 9. Regardless of whether BQL is used or not, PIE calculates a queue delay of 15ms on both network adapters and therefore adjusts the dropping probability based on the real queuing time experienced by packets.

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References

8. Conclusions In this work, we highlighted a potential design flaw in AQM algorithms that simulations are not typically able to disclose. Namely, we showed that AQM algorithms managing their queues based on an estimate of the dequeue rate may exhibit poor performance if dequeued packets are subsequently stored in another buffer, which is the case of real operating systems such as Linux. In short, the reason is that they estimate the rate at which packets are dequeued from the queue, rather than the rate at which packets are actually transmitted over the network. The difference between the two rates may be considerable as packets can be dequeued from the qdisc at a very high rate because they are subsequently buffered by the network device driver. 10

[1] J. Gettys and K. Nichols, “Bufferbloat: Dark buffers in the Internet,” Communications of the ACM, vol. 55, no. 1, pp. 57–65, 2012. [2] S. Floyd and V. Jacobson, “Random early detection gateways for congestion avoidance,” IEEE/ACM Transactions on Networking, vol. 1, no. 4, pp. 397–413, 1993. [3] W. Feng, K. G. Shin, D. D. Kandlur, and D. Saha, “The BLUE Active Queue Management Algorithms,” IEEE/ACM Transactions on Networking (ToN), vol. 10, no. 4, pp. 513–528, 2002. [4] C. V. Hollot, V. Misra, D. Towsley, and W. Gong, “On designing improved controllers for AQM routers supporting TCP flows,” in INFOCOM, Twentieth Annual Joint Conference of the IEEE Computer and Communications Societies Proceedings, vol. 3. IEEE, 2001, pp. 1726–1734. [5] K. Nichols and V. Jacobson, “Controlling Queue Delay,” Communications of the ACM, vol. 55, no. 7, pp. 42–50, 2012. [6] R. Pan, P. Natarajan, C. Piglione, M. S. Prabhu, V. Subramanian, F. Baker, and B. VerSteeg, “PIE: A lightweight control scheme to address the bufferbloat problem,” in IEEE 14th International Conference on High Performance Switching and Routing (HPSR). IEEE, 2013, pp. 148–155. [7] K. De Schepper, O. Bondarenko, I. Tsang, and B. Briscoe, “PI 2: A Linearized AQM for both Classic and Scalable TCP,” in Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies. ACM, 2016, pp. 105–119. [8] G. White and R. Pan, “Active Queue Management (AQM) Based on Proportional Integral Controller Enhanced PIE) for Data-Over-Cable Service Interface Specifications (DOCSIS) Cable Modems,” RFC 8034, 2017. [9] D. A. Alwahab and S. Laki, “A simulation-based survey of active queue management algorithms,” in Proceedings of the 6th International Conference on Communications and Broadband Networking, ser. ICCBN 2018. New York, NY, USA: ACM, 2018, pp. 71–77. [Online]. Available: http://doi.acm.org/10.1145/3193092.3193106

[10] A. Deepak, K. S. Shravya, and M. P. Tahiliani, “Design and implementation of aqm evaluation suite for ns-3,” in Proceedings of the Workshop on Ns-3, ser. WNS3 ’17. New York, NY, USA: ACM, 2017, pp. 87–94. [Online]. Available: http://doi.acm.org/10.1145/3067665. 3067674 [11] P. Imputato and S. Avallone, “An analysis of the impact of network device buffers on packet schedulers through experiments and simulations,” Simulation Modelling Practice and Theory, vol. 80, no. Supplement C, pp. 1–18, 2018. [12] T. Høiland-Jørgensen, P. Hurtig, and A. Brunstrom, “The Good, the Bad and the WiFi: Modern AQMs in a residential setting,” Computer Networks, vol. 89, pp. 90–106, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1389128615002479 [13] N. Khademi, D. Ros, and M. Welzl, “the new aqm kids on the block: An experimental evaluation of codel and pie,” in 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). [14] Y. Guo, F. Qian, Q. A. Chen, Z. M. Mao, and S. Sen, “Understanding on-device bufferbloat for cellular upload,” in Proceedings of the 2016 Internet Measurement Conference, ser. IMC ’16. New York, NY, USA: ACM, 2016, pp. 303–317. [Online]. Available: http: //doi.acm.org/10.1145/2987443.2987490 [15] T. Høiland-Jørgensen, M. Kazior, D. T¨aht, P. Hurtig, and A. Brunstrom, “Ending the anomaly: Achieving low latency and airtime fairness in wifi,” in Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference, ser. USENIX ATC ’17. Berkeley, CA, USA: USENIX Association, 2017, pp. 139–151. [Online]. Available: http://dl.acm.org/citation.cfm?id=3154690.3154704 [16] R. Kundel, J. Blendin, T. Viernickel, B. Koldehofe, and R. Steinmetz, “P4-codel: Active queue management in programmable data planes,” in 2018 IEEE Conference on Network Function Virtualization and Software Defined Networks (NFV-SDN), Nov 2018, pp. 1–4. [17] B. Briscoe, “Review: Proportional Integral controller Enhanced (PIE) Active Queue Management (AQM) [PNB+15],” Report, 2015. [18] G. White, “Active queue management in docsis 3.1 networks,” IEEE Communications Magazine, vol. 53, no. 3, pp. 126–132, March 2015. [19] R. Pan, P. Natarajan, F. Baker, and G. White, “Proportional Integral Controller Enhanced (PIE): A Lightweight Control Scheme to Address the Bufferbloat Problem,” IETF, RFC 8033, February 2017. [20] N. Kuhn, P. Natarajan, N. Khademi, and D. Ros, “Characterization guidelines for active queue management (aqm),” Internet Requests for Comments, RFC Editor, RFC 7928, July 2016. [21] A. Tirumala, F. Qin, J. Dugan, J. Ferguson, and K. Gibbs. (2006) Iperf, (2006),. [Online]. Available: https://iperf.fr/

11

Biographies Pasquale Imputato received the M.Sc. and Ph.D. degrees from the University of Napoli Federico II in 2015 and 2019, respectively. He is currently a research fellow at Department of Computer Engineering at the University of Napoli. He was a visiting researcher at the Centre Tecnol`ogic de Telecomunicacions de Catalunya (2017-2018). His research interests include wireless networks and the bufferbloat problem. Stefano Avallone received the M.Sc. and Ph.D. degrees from the University of Napoli Federico II in 2001 and 2005, respectively. He is currently an Associate Professor with the Department of Computer Engineering at the University of Napoli. He was a visiting researcher at the Delft University of Technology (2003-04) and at the Georgia Institute of Technology (2005). He is on the editorial board of Elsevier Ad Hoc Networks and the technical committee of Elsevier Computer Communications. His research interests include wireless mesh networks, 4G/5G networks and the bufferbloat problem. Mohit P. Tahiliani received the Ph.D. degree from the National Institute of Technology Karnataka (NITK), Surathkal, India. He is currently an Assistant Professor with the Department of Computer Science and Engineering at NITK Surathkal. He is on the advisory board of NS-3 Consortium and the technical program committee of Workshop on ns-3 (WNS3). He served as a General Chair for the 10th edition of WNS3 held at NITK Surathkal in 2018. His research interest include congestion control algorithms, active queue management, traffic engineering, network protocol analysis, network experimentation, fast packet processing techniques and efficient NFV deployments. Gautam Ramakrishnan is a senior year undergraduate student at the National Institute of Technology Karnataka (NITK), Surathkal, India. He is pursuing a degree in Computer Science and Engineering, his research interests include Computer Networks, Computer Architecture and Operating Systems. He currently focuses on Active Queue Management mechanisms.

Pasquale Imputato

Stefano Avallone

Mohit P. Tahiliani

Gautam Ramakrishnan