Deadline and Incast Aware TCP for cloud data center networks

Deadline and Incast Aware TCP for cloud data center networks

Computer Networks xxx (2014) xxx–xxx Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet D...

2MB Sizes 35 Downloads 122 Views

Computer Networks xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Computer Networks journal homepage: www.elsevier.com/locate/comnet

Deadline and Incast Aware TCP for cloud data center networks Jaehyun Hwang a, Joon Yoo b,⇑, Nakjung Choi a a b

Bell Labs, Alcatel-Lucent, 7fl. DMC R&D Center, 1649 Sangam-dong, Mapo-gu, Seoul 121-904, Republic of Korea Department of Software Design & Management, Gachon University, Bokjeong-dong, Sujeong-gu, Seongnam-si, Gyeonggi-do 461-701, Republic of Korea

a r t i c l e

i n f o

Article history: Received 15 May 2013 Received in revised form 1 November 2013 Accepted 4 December 2013 Available online xxxx Keywords: Cloud data center networks Partition/Aggregate pattern Deadline awareness Incast congestion

a b s t r a c t Nowadays, cloud data centers have become a key resource to provide a plethora of rich online services such as social networking and cloud computing. The cloud data center applications typically follow the Partition/Aggregate traffic pattern based on a tree-like logical topology, where the aggregator node may gather response data from thousands of worker nodes. One of the key challenges for such applications, however, is to meet the soft real-time constraints. In this paper, we introduce the design and implementation of DIATCP, a new transport protocol that is both deadline-aware and incast-avoidable for cloud data center applications. Prior work achieve deadline awareness by host-based or network-based approaches, but they are either imperfect in meeting their deadlines or have weakness in practical deployment. In contrast, DIATCP is deployed only at the aggregator, which directly controls the peers’ sending rate to avoid incast congestion and, more importantly, to meet the application deadline. This is under the key observation that the aggregator knows the bottleneck link status as well as its workers’ information under the Partition/Aggregate traffic pattern. Through detailed ns-3 simulations and real testbed experiments, we show that DIATCP significantly outperforms the previous protocols in the cloud data center environment. Ó 2014 Elsevier B.V. All rights reserved.

1. Introduction Modern-day cloud data centers are steadily progressing as a key resource for providing many online services that include web search, social networking, cloud computing, financial services, and recommendation systems. These services generally need to deliver the results to the endusers in a timely manner, so they often have soft real-time constraints that originate from the service level agreements (SLAs), which greatly affect user experience and service provider revenue [1–3]. One of the major portions of the delay is caused by the intra-data center communication, which must be kept efficient, as most of

⇑ Corresponding author. Tel.: +82 31 750 5832. E-mail addresses: [email protected] (J. Hwang), joon. [email protected] (J. Yoo), [email protected] (N. Choi).

the computing and storage resources are located inside the local data centers. Unfortunately, it has been reported that the conventional transport protocol, TCP, does not work well; it forms a communication bottleneck in the commercial data centers, as a result, causing serious performance degradation [4,5,1]. There are several reasons for this performance degradation. First, the cloud data center services often follow the Partition/Aggregate traffic pattern based on a tree-like logical topology, where typically thousands of servers participate to achieve high performance. Here, the numerous servers, called workers, send their response data to a single point called the aggregator. This causes a burst of traffic at the aggregator. Second, the Top-of-the-Rack (ToR) switches, where the aggregator is connected, are shallow-buffered, normally having only a 3–4 MB sharedpacket buffer memory. Sometimes this shallow buffer size

http://dx.doi.org/10.1016/j.comnet.2013.12.002 1389-1286/Ó 2014 Elsevier B.V. All rights reserved.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

2

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

is not enough to handle such a burst of traffic, resulting in buffer overflow, which is called incast congestion. Third, the retransmission timeout to detect the incast congestion (i.e., packet losses) is too long as TCP is fundamentally designed for wide area networks. For example, RTOmin is generally set to 200–300 ms, but the actual round-trip times (RTTs) are only hundreds of ls in data center networks. In summary, legacy TCP suffers from the incast congestion, showing low aggregate goodput and long query completion time. Most of the prior work [6,1,7–10] have mainly focused on reducing the flow completion time of the tail.1 In order for the application to meet its deadline defined by the SLA, however, all the flows between the aggregator and its workers should be completed within the application deadline. The existing fair-share based solutions like DCTCP [1] are oblivious to the application deadline, so they may not meet the application deadlines even though they still reduce the overall flow completion time. Some recent work such as D3 [11] and D2TCP [12] introduce deadline-awareness in designing the data center protocols. They make an effort to allocate differentiated bandwidth based on the flow size and the deadline. However, they are either impractical in their deployment (D3) or imperfect in meeting the application deadline (D2TCP). In this paper, we propose a new deadline and Incast Aware TCP, called DIATCP, which efficiently meet the deadline requirements of the online data center applications. Our main contribution is that we take a new design approach, which not only overcomes, but also leverages the Partition/Aggregate traffic pattern of the data center applications. While recent solutions on data center networks take the form of either host-based or network-based approaches, DIATCP, however, operates at the aggregator, so called an aggregator-based approach, which performs admission and flow control to achieve our goal. The key observation is that the aggregator can effectively obtain rich information such as the bottleneck link bandwidth, and the workers’ flow information such as data sizes and deadlines. Therefore, DIATCP does not require any support from the network switches – implementation and deployment are easy. Further, the incoming network traffic to the aggregator can be managed very efficiently – traffic control is centralized. We conduct detailed evaluations through ns-3 simulations to compare DIATCP with the previous solutions such as DCTCP [1] and D2TCP [12]. We have also built the prototype of the DIATCP algorithm on the Linux kernel 2.6.38 and evaluated it on our real data center testbed. The data center testbed consists of 46 servers, 3 ToR switches, and one Aggregation switch. The evaluation results of DIATCP are:  There are no missed deadlines under the real tracebased simulations and very few missed deadlines (under 1%) for the testbed experiments.  There are no TCP timeouts for both simulations and testbed experiments.

 The aggregate goodput even for non-deadline flows that do not follow the Partition/Aggregate pattern is comparable.  For the prototype implementation, only a small modification (about 265 lines) at the aggregator is required. On the other hand, the network-based approach requires high-cost custom hardware chips (D3 [11], DeTail [9], and PDQ [13]).  Flow quenching is supported (unlike D2TCP [12]).  ECN-support is unnecessary (unlike DCTCP [1] and D2TCP [12]).  No priority inversion is incurred (unlike D3 [11]). The rest of the paper is organized as follows. In Section 2, we briefly review today’s data center communications and describe the previous deadline-aware approaches, followed by our motivation. Section 3 explains our DIATCP algorithm in detail. In Section 4, we present our experimental results performed with the ns-3 simulator and our data center testbed. We discuss some design issues in Sections 5, and 6 presents related work. Finally, we conclude our paper in Section 7. 2. Overview We briefly review today’s data center communications and the major problems that occur at the transport layer. To understand our motivation, we explain some of the existing approaches and their limitations. Finally, we describe the aggregator-based approach that is the key idea of our new proposal. 2.1. Data center applications and communications Partition/Aggregate traffic pattern: It is known that cloud data center applications these days often follow the Partition/Aggregate traffic pattern, which is shown in Fig. 1. Such applications include web search [1], MapReduce [14], Dryad [15], social networking [16], and recommendation systems [17]. Under this traffic pattern, a user request is sent to the top-level aggregator first, and then partitioned into several pieces to be distributed them to the lower-level aggregators. The request finally arrives at the worker-nodes via the last-level aggregator in the data center network. If there is an SLA between the service provider and users, the final results should be delivered to the end-users within the SLA, typically 200–300 ms within a (deadline = 300ms)

Root aggregator

(deadline = 100ms) Aggregator

Aggregator

...

Aggregator

(deadline = 40ms) Worker

Worker

...

Worker

...

Worker

...

Worker

1

In general, all response data from the workers need to be aggregated to make a meaningful result, so the query completion time is determined by the most congested connection, i.e., tail.

Fig. 1. Example of the Partition/Aggregate design pattern with different response deadlines at each layer.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

data center [17,1]. So the aggregator at each level needs to aggregate the results from its child-nodes with an intermediate deadline, which should be earlier than the final deadline (i.e., SLA). Note that the response data from the childnodes that miss the deadline would be discarded, and this significantly affects the quality of the results, hence the operator revenue. For example, the effect of 100 ms latency at Amazon.com is approximated to result in 1% drop in sales [3]. Similarly, an extra latency of 500 ms in Google’s search drops traffic by 20% [3]. There is 5–9% drop in traffic of Yahoo! with 400 ms latency [2]. Therefore, it is very important to meet the deadlines in today’s data center communications. In this paper, we will deal with 20– 50 ms deadlines as we only focus on the last-hop communication, between the last-level aggregator and the workers. Data center traffic: According to [1], the data center workload can be categorized into three types: bursty query traffic, short message flows, and large flows. The query traffic consists of latency-critical flows as it follows the Partition/Aggregate pattern, and the size is only a few KBs (e.g., 2 KB). The short message flows are normally used to update control state on the workers and are also timesensitive. The message sizes are between 50 KB and 1 MB. Lastly, there are a few large background flows (the median number of concurrent large flows is 1), whose sizes are 1–50 MB. The main role of these long flows is to copy new/updated data to the workers, so they are typically throughput-driven without any deadlines. 2.2. Previous deadline-aware approaches There are several congestion control protocols that aim to meet the real-time application constraints in data centers. The major design philosophy behind these approaches has been inherited from the following two methodologies: (i) the network-based approaches which are generally conducted at the switches or (ii) the host-based approaches which are performed at the end-host side. Our main insight, however, is that we can conceptually take a hybrid approach by leveraging the communication pattern of the data center applications. In detail, the Partition/Aggregation traffic pattern enables the aggregator to access both the bottleneck link information and also the peer information. We compare the two existing approaches first, and then further discuss our new approach in the next subsection. Network-based approach: In the network-based approach, the network side elements like switches and routers are regarded as centralized components since they are capable of directly monitoring the network conditions. Furthermore, they can collect the flow information and their real-time requirements (e.g., deadlines), so it is possible to allocate guaranteed bandwidth for each flow. The wellknown examples that behave in such a way are XCP [18] and D3 [11]. In addition, let us assume that the congestion at the bottleneck link is such that its capacity cannot support finishing all flows before the deadline. It is preferable to quench some of the flows so that the remaining flows can meet the deadline [11]. The network switch can also support this flow quenching by adequately conducting the total bandwidth allocation. In summary, the network-based

3

approach generally performs well based on the accurate network information, but its weak point is always in the practical deployment. It needs to upgrade all the infrastructure such as routers with high cost. For instance, D3 needs to change almost every network elements including switches, end-hosts, and applications [11]. Host-based approach: Generally, the main advantage of the host-based approach is that it is relatively easy and simple to implement and deploy. It can be implemented by modifying each of the end-hosts via mere simple software upgrades. However, the host-based approach still needs to get the feedback information from the network, since the end-hosts determine their actions based on the network congestion level. One key feedback is the Round-Trip Delay (RTT), as the queuing delay included in the RTT directly reflects the network congestion. Unfortunately, it has been reported that the RTT measurement in data center networks is not that reliable [1,7]. The RTT is normally hundreds of ls in data center networks, so it can severely fluctuate even with a small amount of computing/scheduling delay at the server machines. Thus the host-based approaches, such as DCTCP, actively utilize Explicit Congestion Notification (ECN) [19] in designing their congestion control algorithms. ECN is employed so that the congestion information is fed back to the end host. Further, flow quenching cannot be supported in host-based approaches such as D2TCP since the end-hosts cannot determine when they should be paused or stopped without the help from the centralized controller. Most importantly, the host-based approach provides some level of deadlineawareness, but does not guarantee that the flows meet their deadlines. This is the major drawback compared to the network-based approach. 2.3. Aggregator-based approach As reported in previous studies [6,1,20,11], one of the main bottleneck points in data center communications is at the ToR switch, specifically, the ToR switch interface to the aggregator. This occurs because the numerous workers try to send their response data to the same aggregator through the shallow-buffered ToR switch at the same time. At this point, we assume that the link capacity between the aggregator and the ToR switch is given in advance as its typical value is 1 Gbps or 10 Gbps in today’s data center environments. We also assume that the aggregator keeps track of the traffic information from all its peers. This is a realistic assumption because in the Partition/Aggregate traffic pattern, a request is sent to the workers through the aggregator. So, the aggregator is capable of managing the peer traffic information, hence the deadline information is also available. In other words, even though the aggregator may be an end-host, it still can acquire this rich information, which is traditionally viewed as given only to the network nodes such as switches or routers. Therefore, we can take the advantages from the aforementioned two approaches: (i) it does not require any support from switches – implementation and deployment are easy, and (ii) the incoming network traffic to the aggregator can be managed effectively with the given bottleneck link capacity – traffic control is centralized.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

4

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

2.4. Effect of incast congestion Existing deadline-aware protocols like D3 and D2TCP only aim to meet the flow deadlines. Our insight is that the incast congestion can severely damage the performance of online applications in cloud data center networks; it normally increases the network delay as well as the query completion time, so the deadline-aware algorithm may not work properly in heavily congested situations. For instance, D2TCP introduces a deadline factor to the DCTCP’s congestion control algorithm, but DCTCP is not a perfect solution for avoiding the incast congestion.2 This implies that D2TCP is not a robust scheme for the congested network situations.3 Moreover, if RTOmin is a large value, the application deadline could be missed even with only one TCP timeout. Furthermore, we found that some loose application deadlines can be met easily even without the deadlineawareness features if the incast congestion is avoided. DeTail [9] takes a similar approach; it reduces the flow completion time tail by performing traffic engineering, even though it does not support deadline awareness. Therefore, it is essential to take incast congestion as well as deadline awareness into account in order to achieve our ultimate goal. 3. Deadline and Incast Aware TCP (DIATCP) DIATCP consists of two functions – incast awareness and deadline awareness. We first give the design overview of DIATCP, and then explain the detailed algorithms. Lastly, we discuss some design issues – how to choose some important parameters like Gwnd in practice. 3.1. Design overview We categorize the flows in data center networks into non-deadline flows that have no specific deadline in flow completion time and deadline flows that are supposed to be completed within a specific deadline. Such deadlines are generally defined by applications at the user space and delivered to the TCP layer via existing socket APIs such as setsockopt (). We therefore assume that the application deadlines and response data sizes are given to the TCP layer before the data transfer begins. Note again that the key idea of DIATCP is that the aggregator node may monitor all the incoming traffic to itself. Therefore, the DIATCP algorithm works only for the incoming traffic from peers to the aggregator. The outgoing traffic from the aggregator can be managed by the destination nodes of the traffic, so we do not consider the outgoing traffic in this paper. Thus the response data size means the amount of data that the peer transmits to the aggregator. Deadline and Incast Aware algorithm: Our algorithm is motivated by IA-TCP [8], which is designed to operate at the aggregator side by means of TCP acknowledgement (ACK) regulation, to control the TCP data sending rate of 2 DCTCP suffers from the incast congestion as the number of workers is more than 35 in their experimental environments [1]. 3 D2TCP still incurs missed deadlines under heavily congested conditions [12].

the workers. DIATCP also adopts the ACK regulation scheme for similar purposes. We utilize the advertisement window field4 in the TCP ACK header to allocate a specific window size to each peer. By doing this, we control the total amount of traffic in order not to overflow the bottleneck link – details are explained in Section 3.2. Algorithm 1 presents the overall DIATCP algorithm employed at the aggregator. First, we represent each application connection by an abstract node, which includes the information such as the application deadline, data size, and allocated DIATCP window size (Fig. 2). A node is inserted to the Connection list whenever a new connection is created. When a connection is closed then the node is deleted from the list (lines 2 and 6). Thereafter, we update the allocated window size whenever there is a change in the Connection list (lines 3 and 7). For this, we develop a new Global window allocation scheme that allocates the Global window based on the deadline and data size – details are explained in Section 3.3. Lastly, by accessing each node’s information, the advertisement window in the ACK header is set to the allocated window size (line 10). Algorithm 1. DIATCP algorithm at the aggregator 1: On creating a new connection: 2: Insert a node to the Connection list 3: Call Global_window_allocation() 4: 5: On closing a connection: 6: Delete a node from the Connection list 7: Call Global_window_allocation() 8: 9: Sending an ACK: 10: Set Advertisement window to allocated_window

The priority order of the Connection list may be sorted depending on some real-time scheduling priority, such as EDF (Earliest Deadline First) or SJF (Shortest Job First). In this paper, we employ the EDF policy as it is known to minimize the number of late tasks, i.e., minimize the number of missed deadline flows. To employ EDF, the Connection list is ordered in an ascending order of deadlines where the tie-breaking is done according to flow arrival time. When there is a new connection, it is immediately inserted into the appropriate slot in the list. In result, the existing low-priority flows are preempted by the new flow. For example, if a very urgent flow is inserted into the Connection list, it will be allotted in front of the existing nodes that are less urgent via the Global window allocation. 3.2. Incast awareness The data center applications generally induce a large number of concurrent TCP connections that share a common bottleneck link, resulting in buffer overflow at the ToR

4 The TCP receiver sets the advertisement window based on its receiving capacity to conduct flow control. The TCP sender uses this information to set the sending window size.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

switch. To avoid such incast congestion, the aggregate sending rates of all flows, including the deadline and non-deadline flows, must not exceed the (bottleneck) link capacity: n m X X Ratedi þ Ratend j 6 Link capacity i¼1

ð1Þ

j¼1

where Ratedi is the sending rate of the ith deadline flow and Ratend j is that for the jth non-deadline flow. n and m are the numbers of deadline and non-deadline flows respectively. Such rate control can be implemented in many ways [21,22,8], but we employ this by newly defining the Global window size (Gwnd) and DiaDelay. Gwnd is the sum of the sending window sizes of all the peer connections to the aggregator. DiaDelay is the common artificial RTT that all the connections should maintain. We express the aggregate sending rate, i.e., the left side in (1), for all the peers transmitting data to the aggregator. The aggregate sending rate should match the link capacity from the ToR switch to the aggregator (i) to avoid incast congestion and (ii) maintain goodput. So, we have

Gwnd  MSS ¼ Link capacity DiaDelay

ð2Þ

where MSS denotes the Maximum Segment Size. Here, DiaDelay can be regarded as a time slot like a global RTT. So if the number of outstanding packets (whose size is MSS) is ‘Gwnd’ during ‘DiaDelay’ and ‘Gwnd/ DiaDelay’ is the link capacity, then the aggregate sending rate should not cause buffer overflow at the ToR switch. Fig. 3 shows an example of the incast awareness operation based on ACK regulation. First, the aggregator adds an ACK delay to each connection as much as DiaDelay  measured RTT i , where measured RTT i is the RTT of the ith flow measured at the aggregator side, so that all peers have the same RTT, hence DiaDelay. Second, it allocates a proper advertisement window to each peer so P that the sum ni¼1 wi ¼ Gwnd, where n is the total number of peers and wi is the window size of the ith peer. The aggregator controls Gwnd so that (2) is met, so that incast congestion is avoided. 3.3. Deadline awareness The deadline awareness in DIATCP is employed by the Global window allocation algorithm, which we explain here in detail. Our basic strategy is to give higher priority to the deadline flows first, and then allocate the remaining bandwidth to the other flows in a fair-share manner. The total window size, i.e., Gwnd, is determined by (2). If a flow wants to complete before its deadline, then it should follow:

Window requirement ¼

s  DiaDelay d

ð3Þ

where s is the remaining transmit data size and d is the remaining time until the deadline. Algorithm 2 presents the Global window allocation algorithm. Lines 5–9 implement the initial allocation which corresponds to (3); the window size is allocated so that it meets the deadline of each flow. We allocate

5

one window to non-deadline flows to cover as many non-deadline flows as possible. Note again that we employ EDF; the Connection list is ordered by giving priority to earliest deadline flows. So the flows with earlier deadlines are allocated first. If there are remaining windows after the initial allocation, reallocation will be performed later in a fair-share manner (lines 31–33). Assuming that the flow that misses its deadline is meaningless, we drop the flow if the deadline is almost missed or the window requirement is larger than Gwnd (lines 12–13). Next, if the initial window requirement is acceptable (line 15), we allocate the number to the node, and send an ACK if the previous allocated window size was zero, in order to advertise the non-zero window (i.e., resume the paused flow) (lines 15–21). If there is no available window for the node, zero window is allocated, and the flow will be eventually paused (lines 22–25). Algorithm 2. Global window allocation node.deadline: remaining time until deadline node.size: remaining data size node.win: allocated window 1: total alloc ¼ 0 2: node the first node in the Connection list 3: 4: while node exists do 5: if node.deadline > 0 then 6: alloc ¼ node:size=node:deadline  DiaDelay 7: else 8: / node.deadline = 0 for non-deadline flows  / 9: alloc ¼ 1 10: end if 11: 12: if node:deadline expires jj alloc > Gwnd then 13: Drop the flow that corresponds to the node 14: else 15: if total alloc þ alloc 6 Gwnd then 16: total alloc ¼ total alloc þ alloc 17: node:win ¼ alloc 18: if previous node.win was zero then 19: / non-zero window advertisement  / 20: Send an ACK 21: end if 22: else 23: / zero window advertisement / 24: node:win ¼ 0 25: end if 26: end if 27: 28: node the next node in the Connection list 29: end while 30: 31: if total alloc < Gwnd then 32: allocate the remaining window to flows that have non-zero window in a fair-share manner 33: end if We note that there can be several ways in implementing the flow dropping (line 13). One easiest way is to simply

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

6

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

20ms, 10KB, Win: 4

35ms, 20KB, Win: 3

40ms, 30KB, Win: 2

null, 10MB, Win: 1

null, 20MB, Win: 0

...

If win 0 becomes non-zero, send an ACK to advertise non-zero window

Connection List

Get the window size for each ACK

Connection Information

tcp_sock

tcp_sock

Fig. 2. Example of the Connection list: Each node represents the connection information, e.g., deadline, data size, and window size. Here, the window size is computed by the Global window allocation and the list is ordered by the EDF policy.

Fig. 3. DIATCP operation: Global window allocation and ACK pacing.

send a FIN packet like [11], but in our implementation, we give a lowest priority (i.e., zero deadline) to the flow rather than closing the connection, so that it is promptly resumed after all other deadline flows are completed.

3.4. Design issues in choosing parameters The aggregator node keeps track of all the incoming flows, thus it can fully monitor its bottleneck link at the ToR switch. Since this link capacity is a given value, we use pre-set values for Gwnd and DiaDelay, e.g., 30 and 360 ls assuming the link capacity is 1 Gbps and MSS is 1.5 KB. These values are acceptable if most of the measured RTTs between the peer servers are within 360 ls. In this case, we can add the ACK delay so that the final RTTs become DiaDelay as explained in Section 3.2. The practical problem is that the RTT generally oscillates in data center environments as the computational/process scheduling times can greatly affect the transmission delay. So we need to measure a live RTT to calculate an accurate ACK delay. Note that it is not meaningful to use the average or minimum RTT for the measured RTT because the flow sizes are generally too small for us to obtain a stable value. Furthermore, a few spikes in the measurements can significantly affect the overall performance as there are tens of ls of gaps between the average and minimum RTTs as shown in Fig. 4. In addition, measuring live RTTs at the

aggregator side is difficult in practice unless there is data traffic on both directions. Lastly, the artificial ACK delay may cumulate on the following ACKs, leading to a big ACK. We may send an ACK every two data packets to avoid this as the existing delayed ACK algorithm does, but it results in inaccurate ACK delay measurements. For these reasons, we take a practical approach; we elect to use a predefined RTT value extracted from prior measurements and computation, rather than using live measurements on a microsecond granularity. The DiaDelay is set according to this predefined RTT. For this purpose, we measured the RTTs between the aggregator node and each of the 45 servers 1000 times in our data center testbed as shown in Fig. 4. Here, we found that naively taking the average RTT is not reasonable because the RTT measure-

Fig. 4. Measurement of RTTs between the aggregator and 45 servers.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

7

Table 1 Pre-set values of Gwnd and DiaDelay (ls) for 1 Gbps of link capacity. The MSS is 1.5 KB. Gwnd

14

15

16

17

18

19

20

DiaDelay

168

180

192

204

216

228

240

ments do not follow the normal distribution according to the Shapiro–Wilk’s normality test [23]. In addition, the average RTT histogram for the servers includes negative skewness, which implies the average RTT may not be reliable. Hence, we need to estimate its probability density via a nonparametric method (e.g., kernel density estimation method [24]) since we do not have prior knowledge about the true RTT distribution. Considering RTT to be the independent identically distributed (i.i.d.) random variable for each server, we use the kernel density estimation method, which is known to converge quickly to the true density, to obtain the smooth density estimator5. In result, the RTT with the highest probability in our density estimator is 200.11 ls, so the DiaDelay is chosen based on this computation. Table 1 presents some candidates for the pre-set values of Gwnd and DiaDelay. To achieve full link utilization, 17 or a larger value seems reasonable for Gwnd. We measure the aggregate goodput by transmitting 10 MB from each server to the aggregator while varying Gwnd from 13 to 20, as shown in Fig. 5. Finally, we use 18 for Gwnd throughout this paper, assuming that the DiaDelay is 216 ls based on Figs. 4 and 5, and Table 1. Note that the DiaDelay can be set depending on the particular data center environments. We later validate that this practical approach works well in our real data center testbed. 4. Evaluation In this Section, we evaluate DIATCP first through largerscale ns-3 simulations [26] and then by a small-scale 46-server data center testbed based on Linux implementations. In the simulations, we compare DIATCP with the legacy DCTCP and D2TCP.6 We also evaluate a fair-share version of DIATCP, denoted by DIATCP-FS, to investigate how our congestion avoidance algorithm improves the performance of deadline flows in terms of the number of missed deadlines even without the deadline-awareness. DIATCP-FS also performs the Global window allocation, but it regards all flows as non-deadline flows. 4.1. Simulations We have implemented DIATCP and DIATCP-FS on the ns-3 simulator (ver. 3.14) [26]. For comparison, DCTCP and D2TCP with ECN functionality7 are also implemented, 5 We use the method of Sheather and Jones [25] to select the kernel bandwidth (2.335 ls in this paper). 6 The ECN is an essential functionality to avoid incast congestion in DCTCP and D2TCP. Unfortunately, the current operating system version of the switch in our testbed does not support ECN marking. Therefore, we only use ns-3 simulations to compare DIATCP with DCTCP and D2TCP. 7 The current version of ns-3 does not support ECN functionality yet at both RED (Random Early Detection) [27] and TCP.

Fig. 5. Aggregate goodput (Mbps) with varying Gwnd.

Fig. 6. Simulation topology.

based on the algorithms described in [1,12] and the DCTCP implementation available in [28]. Fig. 6 depicts the simulation topology. It consists of 5 racks; each rack has 45 servers and a ToR switch. All servers are connected to their ToR switch with 1Gbps links, and 5 ToR switches are connected to an Aggregation switch with 10 Gbps links. Besides, one server dedicated for the aggregator (i.e., aggregator node) is added to the first ToR switch with a 1 Gbps link. The link delay is set to 25 ls so that the average RTT is about 200 ls, a typical value of today’s data center networks. The packet buffer size per port is 128 KB assuming shallow-buffered ToR switches. The Aggregation switch has a deep packet buffer. For the key parameters of DCTCP and D2TCP, we set g, the weighted averaging factor, to 1/16 and K, the buffer occupancy threshold for marking CE-bits, to 20 for 1 Gbps links and 65 for 10 Gbps according to [1]. For D2TCP, we set d, the deadline imminence factor, to be between 0.5 and 2.0 according to [12]. The RTOmin for all protocols is 20 ms as a production TCP like Google’s uses 20 ms within data centers [29]. The main purpose of the simulations is to compare our algorithm with the previous studies, so we use a similar scenario used in [12], where five Partition/Aggregate trees are running on the network; each tree consists of one aggregator and n workers and has different deadline and response data size. The aggregators are located in the aggregator node and the workers are distributed evenly over all servers so that each server has only one worker. Then the aggregators send a request to their workers, and the workers immediately respond with a specific size of data. We increase n, the number of workers per tree, from 20 to 45, to measure the performance with various congestion levels. Lastly, we set the five trees’ (base

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

8

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

deadline, response data size) to (tight, 2 KB), (tight, 6 KB), (moderate, 10 KB), (moderate, 14 KB), and (lax, 18 KB), respectively. Note that the values for the data sizes (i.e., a few KB of flow sizes) are based on the characteristics of workloads in production clusters [1], but since the real distributions of flow deadlines are not available, we use 20 ms (tight), 30 ms (moderate) and 40 ms (lax) assuming that the actual deadlines are exponentially distributed around the base deadlines as the previous studies did [11,12]. In our simulations, we use a one-sided exponential distribution for flow deadlines, where its mean is the base deadline, but caps the maximum value to 150% of the base deadline. All simulation results are repeated 100 times and averaged. 4.1.1. Deadlines awareness Fig. 7 shows the fraction of flows that miss the deadlines (i.e., Y axis) with increasing congestion levels. When the number of workers per tree is small (e.g., 20 or less), all variants meet the deadlines well, but the missed deadlines of NewReno and DCTCP increase rapidly as the number of workers increases. Even though DCTCP normally achieves much lower average flow completion times than NewReno, its ignorance to the deadlines results in a large fraction of missed deadlines, showing similar performance with NewReno. D2TCP performs much better than NewReno and DCTCP as it gives more bandwidth to near-deadline flows, but still misses about 27% of the deadlines when the number of workers is large. On the other hand, DIATCP does not miss any deadlines even in highly congested situations. We note that the fair-share version of DIATCP, DIATCP-FS, also shows similar results; it misses only 3 deadlines when the number of workers is 45. This implies that most of deadlines can be met just by avoiding the incast congestion effectively.

the design goals of DIATCP is to avoid incast congestion, DIATCP-FS and DIATCP controls the total sending window size to the extent of the bottleneck link capacity and in result, does not suffer any timeouts. 4.1.3. Normalized latency Fig. 9 shows the flow completion times that are normalized to the deadline; a flow whose normalized completion time is less than 100% meets the deadline. We draw the three points for each variant, which indicate the 50th, 90th, and 99th percentiles of the flow completion times, from bottom to top. In the graph, D2TCP shows much lower flow completion times than DCTCP and the 90th percentiles are under the deadlines (100%) until the number of workers is 35, but become more than the deadlines when the numbers of workers are 40 and 45. We note that in the cases of DIATCP-FS and DIATCP, all the 99th percentiles of the flow completion times are under the deadlines. Moreover, DIATCP shows lower completion times than DIA-TCP-FS in all cases, and the 99th percentile of DIATCP flows is lower than 50% even when the number of workers is 45. These low completion times naturally result in good performance in terms of the missed deadlines as shown in Fig. 7.

4.1.2. Incast avoidance To show how the incast congestion affects the performance, we measure the fraction of flows that suffer at least one timeout as shown in Fig. 8. It is observed that more than 50% of flows that employ NewReno or DCTCP experience the incast congestion in all cases. D2TCP shows better performance with regard to congestion avoidance, but the fraction of timeout flows increases up to around 60% as the number of workers increases. By comparing this result with Fig. 7, we can see that the incast congestion directly affects the missed deadlines as flow deadlines are ranged from 20 ms to 60 ms while RTOmin is 20 ms. Since one of

4.1.4. Deadline awareness with background traffic We measure the missed deadlines when the deadline and non-deadline flows coexist. We add one background flow that transmits 10 MB of data to the aggregator node in the previous scenario, as it is reported that the median number of concurrent large flows is 1 in data center networks [1]. The background flow obviously has no deadline and fully utilizes the bottleneck link before the Partition/ Aggregate trees begin. Fig. 10 shows the missed deadlines for the deadline flows, and it is observed that the overall missed deadlines of NewReno, DCTCP, and D2TCP increase as much as 10–20% compared to Fig. 7 while DIATCP still does not miss any deadlines. DIATCP-FS shows 0.36% missed deadlines when the number of workers is 45. The results here imply that the background flow aggravates the incast congestion, so it misses a larger fraction of the deadlines. Nevertheless, DIATCP avoids the congestion by controlling the total amount of incoming traffic and supports the deadline flows over the background flow. As a result, we confirm that DIATCP significantly outperforms the existing schemes. We also measure the average goodput of the background flow to see the link utilization of non-deadline

Fig. 7. Fraction of flows that miss the deadlines.

Fig. 8. Fraction of flows that suffer at least one timeout.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

Fig. 9. Normalized flow completion times at the 50th, 90th, and 99th percentiles.

flows, as shown in Fig. 11. The results clearly tend to decrease as the number of workers increases, but the goodputs of DIATCP-FS and DIATCP are consistently higher than that of NewReno and DCTCP, and comparable with D2TCP. This shows that even the non-deadline flow suffers from the incast congestion in case of NewReno and DCTCP. DIATCP guarantees that the deadline-flows meet their requirements, by taking a negligible portion of the link bandwidth from the background traffic and allocating it to the deadline flows. D2TCP shows similar goodput performance, but fails to meet the deadline as shown in Fig. 7. 4.1.5. Deadline awareness with tight deadline requirements To investigate how our deadline-aware algorithm works effectively in extreme scenarios, we measure the missed deadlines when one of the trees has an unacceptably tight deadline requirement as shown in Fig. 12. In this scenario, we change the first tree’s deadline from 20 ms to 5 ms (fixed value). In the graph, it is observed that the missed deadlines of DCTCP and D2TCP slightly increase as most of the tight deadline flows miss the deadline. We also observe that DIATCP-FS misses the tight deadlines as well, and the fraction of missed deadline flows increases up to 13%. This is mainly because DIATCP-FS just allocates Gwnd to all flows in a fair-share manner. However, when the tight deadline flows start to transmit, DIATCP preempts the tight flows with higher priority than the far-deadline flows, and allocates a proper window size to the tight flows so that they finish the transfer in a timely manner. As a result, DIATCP misses no deadline in all cases.

9

Fig. 11. Average goodput of the background flow.

though we allocate all available bandwidth to the flows, so it would be better to abandon the flows rather than trying to complete all flows. To test this scenario, we set the deadline and data size to 20 ms and 1 KB for four Partition/Aggregate trees, and give an extremely small deadline and a large data size to one tree so that its bandwidth requirement is larger than the link capacity. We measure the fraction of flows that meet the deadlines as shown in Fig. 13. Note again that the flows in the tree with extreme requirements can never meet the deadline. We note that if there is no extreme flow, all other flows easily meet the deadlines in all cases, as the data size (1 KB) is so small. When the extreme flows coexist, D2TCP misses the deadlines by up to 60% as the number of extreme flows increases. DIATCP, however, effectively drops the flow as explained in Section 3.3 when the window requirement is larger than Gwnd, i.e., the bandwidth requirement exceeds the link capacity. As a result, it does not miss any deadlines of the normal flows. We also confirm that DIATCP performs better than D2TCP in terms of the completion times of the extreme flows, but skip the results in the paper. 4.2. Testbed experiments

4.1.6. Flow quenching As described in [11], in some extreme cases, there may be some flows that require larger bandwidth than the link capacity. Those flows cannot meet the deadlines even

We evaluate our algorithm using a Linux implementation with a 46-server testbed, to confirm that DIATCP works well in a real data center environment. We implement DIATCP into the Linux kernel 2.6.38 by adding about 265 lines of code; 219 lines for the main operations, 25 lines for the header files, and 21 lines for the proc related code used in the Linux system. The testbed topology is similar to the simulations, but only 3 racks are used (Fig. 14); each rack has 15 servers and a ToR switch, Summit X460 with 48 1 Gbps ports, 4 10 Gbps ports and a 3 MB shared packet buffer memory [30]. The three ToR switches are connected to an Aggrega-

Fig. 10. Missed deadlines with background traffic (10 MB).

Fig. 12. Missed deadlines with tight deadline requirements (5 ms).

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

10

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

tion switch, Summit X670 with 48 10 Gbps ports and a 9 MB shared packet buffer memory [31]. Lastly, one aggregator node is added to the first rack. All switches are Extreme Networks products, and all servers including the aggregator node are Dell OptiPlex 790DT with a quad-core Intel i5 3.3 GHz processor, 4 GB RAM and 1 Gbps Ethernet interfaces, running the Linux kernel 2.6.38. In our experiments, we set the packet buffer size of the bottleneck port in the first ToR switch to 178 KB, which is the minimum value that our switch supports. Since the switch does not support ECN as mentioned before, we compare our algorithm to Reno and 1win-Reno whose sending window is fixed to one. The 1win-Reno emulates the most conservative window-based congestion control algorithm, as the maximum window size can only grow to one. In other words, the one window is the most incast-avoidable window size under highly congested situations. Existing solutions like DCTCP adopt the windowbased algorithms. The RTOmin for all protocols is set to 20 ms. We set up five Partition/Aggregate trees on the testbed as we did in the simulations and vary the number of workers per tree from 20 to 45. Since we have only 45 physical servers, multiple workers are launched on each server. The trees’ base deadlines are tight (20 ms) for two trees, moderate (30 ms) for other two trees, and lax (40 ms) for the last tree, and a 50% uniform-random variance is added to the base deadlines. The response data sizes are uniformly distributed across [2 KB, 10 KB] for the tight deadline flows, [10 KB, 20 KB] for the moderate deadlines, and [20 KB, 35 KB] for the lax deadlines. We also add one background (non-deadline) flow whose size is 100 MB, and start the trees after the long flow fully utilizes the bottleneck link. All experimental results are repeated 100 times and averaged. 4.2.1. Deadline awareness with background traffic We measure the missed deadlines for the deadline flows, by increasing the number of workers per tree. In Fig. 15, we observe that the missed deadlines of Reno and 1win-Reno increase as the number of workers increases. The 1win-Reno performs slightly better until the number of workers is 35, but becomes comparable with Reno under congested network conditions. This implies that even the most conservative strategy (1win-Reno) that the window-based congestion control can take is not enough to avoid incast congestion. Meanwhile, with DIATCP-FS and DIATCP, no timeout is observed, resulting in just a few missed deadlines (under

Fig. 13. Fraction of flows that meet the deadlines when there is a tree with extreme flows that require larger bandwidth than the link capacity.

1% in all cases). We note that DIATCP also misses 0.1– 0.9% of the deadlines, as a scheduling delay exists before the aggregator sends a request to the workers at the application layer. This delay may increase up to more than 15 ms depending on the number of connections, in turn affects the performance of short flows. Despite such scheduling overheads, DIATCP still rarely misses the deadlines in most cases. In addition, we observe that both variants of DIATCP outperform both Reno and 1win-Reno in terms of average goodput of the background flow, as shown in Fig. 16, by effectively avoiding the incast congestion. In general, we found that our experimental results have a similar trend to the simulations, shown in Figs. 10 and 11. In the experimental results, the missed deadlines of Reno and 1win-Reno are around 20% when the number of workers is 20, and increase up to around 60% as shown in Fig. 15, and this almost corresponds to the simulation results of NewReno and DCTCP in Fig. 10. The missed deadlines of DIATCP-FS and DIATCP are almost zero in both simulations and real experiments. It is also shown that the results for the average goodput of the background flow are similar even though the absolute numbers are slightly different as the environmental settings are different. Therefore, we believe that our simulation results correctly capture the behaviors of the real world. 4.2.2. Deadline awareness with tight deadline requirements In the previous scenario, there are no evident differences between DIATCP-FS and DIATCP in terms of the missed deadlines. Here, we build a more stressful scenario with tighter deadlines. Further, we alleviate the effect of the scheduling overhead to concentrate only on the network performance, by increasing the deadline and response data size; each flow’s data size and deadline are uniformly distributed across [100 KB, 150 KB] and [50 ms, 250 ms], respectively. No background flow is added. Under this scenario, we measure the missed deadlines for the two variants as shown in Fig. 17. In the graph, it is observed that the missed deadlines of DIATCP-FS linearly increase as the number of workers increases. DIATCP-FS often misses the deadlines that are relatively small, as it allocates the same amount of windows to the flows in firstcome, first-served manner. On the other hand, DIATCP still shows almost zero missed deadline until the number of workers exceeds 40, by allocating more windows to the near-deadline flows. When the number of workers is 45, it is impossible to meet the deadlines for some flows even for DIATCP, resulting in about 10% of the missed deadlines. 4.2.3. Incast avoidance We now evaluate how DIATCP copes with incast congestion. Unlike the previous scenarios, the flows do not have any deadlines, and the response data size of each worker is 5 MB/n, where n is the number of workers. We measure the query completion time as n increases by up to 220, as shown in Fig. 18. During the experiments, the minimum completion time is about 45 ms, and DIATCP shows generally low completion times ranged between 45 ms and 47 ms in all cases. However, the query completion time of Reno increases by up to 88 ms because of the severe incast congestion. It is worth noting that 1win-Reno

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

11

Aggregation switch (48 10Gbps ports)

ToR switches (48 1Gbps ports + 4 10Gbps ports)

Fig. 14. The real data-center testbed.

Fig. 15. Missed deadlines with one background flow (100 MB).

Fig. 17. Missed deadlines with tight deadline requirements.

ers imply that 180 KB of data packets (1.5 KB  120) are transmitted at the same time through the bottleneck buffer. So the query completion times slightly increase after this point. Finally, we observe that DIATCP suffers no timeout in all cases, directly resulting in the low completion times.

Fig. 16. Average goodput of the background flow.

shows low completion times in this scenario, since the workers’ small sending window sizes lead to less congested situations. But, it performs poorly with a small number of workers (e.g., less than 10). For example, the query completion time of 1win-Reno is more than 900 ms with one worker and 63 ms with five workers as shown in Fig. 18. This is clear since 1win-Reno is the most conservative window-based congestion control strategy. We also measure the timeout ratio, the fraction of queries that suffer at least one timeout as shown in Fig. 19. As expected, Reno suffers at least one timeout among the workers in most experimental rounds when the number of workers is more than 80, and it directly results in high completion times as the query completion time is determined by the most congested flow. 1win-Reno starts to experience timeouts when the number of workers is 120 because the packet buffer size is 178 KB and the 120 work-

4.2.4. Multi-bottleneck environments One of our main assumptions in the design phase is that the bottleneck link is the last hop between the aggregator and the shallow-buffered ToR switch. This is normally true in many cases, but there can be more than one bottleneck point between the aggregator and workers. To see whether DIATCP copes with such multiple bottleneck environments, we set up a simple multi-bottleneck topology on our testbed as shown in Fig. 20. In this topology, there are two bottleneck points: BP-1, the port of the left X460

Fig. 18. Query completion time vs. the number of workers.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

12

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx BG-R

BG-S BP-2

BP-1

X460

X670

X460 2 updates

Aggregator 60 workers (80KB each)

Fig. 19. Timeout ratio vs. the number of workers.

switch connected to the aggregator, and BP-2, the port of the right X460 switch connected to the X670 switch. When these artificial bottleneck points are enabled, the buffer size of both BP-1 and BP-2 is set to 178 KB (which is the minimum configurable size), and the egress rate of BP-2 is set to 1 Gbps. Next, we deploy one aggregator node, 60 workers8 whose response data size is 80 KB, and two update nodes that transmit a large size of update data to the aggregator. These flows traverse through both bottleneck points. There is one additional long-term background traffic from BG-S node to BG-R node. Table 2 presents the average query completion time for the workers with different settings for the bottleneck points. When there is no artificial bottleneck point on the path, the completion time of Reno is relatively low, but increases by 83.4 ms, 70.9 ms, and 84.5 ms as BP-1, BP-2, and both are enabled, respectively. When there are multiple bottleneck points, the total amount of traffic on the endto-end path is converged to the main bottleneck bandwidth (BP-1 in our case). For this reason, it is shown that the result of the two-bottleneck case is similar to that of the BP-1 case. DIATCP, however, efficiently controls the total amount of traffic and avoids the incast congestion on both bottlenecks, showing low completion times in all cases.

5. Discussion Other experimental results: In our testbed, all RTTs between the aggregator node and the other servers are varied around 200ls in average. However, the average RTTs can be different among workers as the number of hops is varied in the data center. Such environments that have heterogeneous RTTs may incur penalty for short RTT workers in terms of performance as DiaDelay increases due to long RTT workers. In our design, this will be compensated by receiving more window allocation. Furthermore, the bottleneck link is always fully utilized since the aggregate sending rate (i.e., Gwnd=DiaDelay) is the bottleneck link capacity. To show DIATCP works well in the heterogeneous environments, we conduct a set of simulation experiments with the network topology in Fig. 6, where one worker is deployed in each rack (i.e., totally 5 workers) and the round-trip propagation delays are given as (i) 200 ls for

8 These workers are running on 12 physical servers (i.e., 5 workers per server).

...

Fig. 20. The multi-bottleneck topology.

all flows, (ii) 300 ls for all flows, and (iii) finally diverse flows by giving different round-trip propagation delays between racks, namely 100, 150, 200, 250, and 300 ls. The DiaDelay is set to 300 ls, and each worker transmits 2 MB to the aggregator. In Table 3, we measure the aggregate goodput to see the bottleneck link utilization. There is only 0.5% drop in the average goodput even when the workers have different RTTs. We do not perform any experiments related to the priority inversion problem, raised by [12]. As described in Section 3.3, the flows in the Connection list are basically sorted in an ascending order of deadlines. So clearly, the flows with earlier deadlines are allocated first via our algorithm with higher priority, and unlike [11], no priority inversion is observed in any of the experiments. Retransmission timeout parameter: In our simulations and experiments, we set the minimum retransmission timeout (RTOmin ) to 20 ms. One may argue that the RTOmin of 20 ms seems high since it is significantly longer than typical RTT values in data center networks. Indeed, prior work have proposed to use RTOmin values of 1 ms or lower [32]. However, other studies [9] have shown that retransmission timeouts less than 10 ms cause spurious retransmission. In other words, the small RTOmin value incurs false alarms so that unnecessary timeouts occur. Therefore, we elect to use RTOmin value 20 ms, as in the practical Google data center [29] or previous work [12]. Congestion at the core/aggregation switches: DIATCP mainly focuses on the congestion at the ToR switch, whereas other network-based schemes consider the congestion at the aggregation or core switches [11,13,9]. There are several reasons why we view the ToR congestion as the most critical problem. First, it has been reported that the majority (80%) of traffic originated by servers in cloud data centers are usually destined to the machines within the same rack [20]. The reason is that the cloud services are commonly placed in adequate locations so that the amount of inter-rack traffic is minimized. Second, it is also noted that packet losses do not correlate with high average link utilization, but occur under low average utilization since the primary cause of the losses is the bursty query traffic, generated by the Partition/Aggregate applications [20]. Therefore, the Partition/Aggregation traffic pattern aggravates the bursty packet losses at the ToR switch, while aggregation and core switches only undergo higher utilization. Finally and most importantly, since DIATCP is implemented only at the aggregator, it can be used in parallel with the network-based schemes that address the aggregation and core switch congestion.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

13

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

Non-Partition/Aggregate traffic: DIATCP leverages the Partition/Aggregate traffic pattern by enabling the aggregator to conduct admission and flow control. For Non-Partition/Aggregate traffic patterns, DIATCP still manages to achieve good performance, since it conducts flow control to avoid congestions. This is evident since DIATCP provides better average goodput even for background flows, which represent the Non-Partition/Aggregate flows, as shown in Figs. 11 and 16. Practical design issues: Now we discuss the two design issues of DIATCP that may arise in practical data center environments. The first issue is how to deploy DIATCP in the virtualized environments. We implicitly assume that each TCP stack has full use of the link capacity of the Network Interface Card (NIC) in this paper. However, there can be multiple virtual machines (VMs) on the same physical node in commercial cloud data centers, hence multiple TCP stacks may compete for the same physical NIC capacity. To deal with this situation, the VM monitor (i.e., hypervisor) needs to control the link capacity of the virtual NICs (VNICs), so that the total capacity of VNICs do not exceed the physical link capacity. Our solution is that the hypervisor periodically sets the link capacity of each VNIC in proportion to the amount of incoming traffic. This function can be easily implemented in the hypervisor since the link capacity is a controllable parameter in each VM via the proc interface and the statistics of the incoming traffic is usually available at the privileged domain (e.g., dom0 in the case of Xen [33]). We also consider user requirements to be another option for setting the VNIC capacity if they are available. The second design issue is properly allocating the Gwnd to long-lived and quiescent connections like heartbeat flows. The type of flow does not tear down the connection even after all the data is transmitted, so that future transmissions can avoid the cost of 3-way handshake. The Global window allocation allocates some windows to such flows as well, which may waste bandwidth, although they are allocated after the deadline flows are served first. Our solution for avoiding this problem is to keep another global list for those long-lived flows. So if a node in the Connection list is not accessed (i.e., idle) for a specific duration, we move the node to the idle list until a new data packet is received from the peer. When the flow is re-activated, the corresponding node is re-inserted to the Connection list, followed by the Global window allocation. 6. Related work There is a plethora of work on the design of today’s data center protocols, including advanced transport protocols for wide area networks, Active Queue Management (AQM) schemes, and real-time scheduling policies. Among Table 2 Avg. query completion time (ms) on the multi-bottleneck topology. TCP

Reno DIATCP

Bottleneck None

BP-1

BP-2

Both

63.4 49.5

83.4 49.5

70.9 49.4

84.5 49.5

Table 3 Bottleneck link utilization. RTT setting

200 ls

300 ls

Heterogeneous

Goodput (Mbps)

967.98

966.24

962.49

these, we focus on two threads of the recent data center transport protocols: incast congestion avoidance and deadline awareness. In [4], the authors conduct an in-depth analysis of the incast congestion problem with their cluster-based storage systems. Specifically, the tradeoffs between (i) the buffer size of the switches and the number of packet losses, and (ii) the data block size and the link idle time are explored. To mitigate TCP throughput collapse, they tune some TCP layer parameters such as duplicate ACK threshold, or perform Ethernet Flow Control. Unfortunately, they do not come out with a satisfactory solution that fully addresses the problem. To effectively avoid the incast congestion, new congestion control schemes like Data Center TCP (DCTCP) [1], Incast Congestion Control for TCP (ICTCP) [7], and IncastAvoidance TCP (IA-TCP) [8] are proposed. DCTCP provides a fine-grained congestion control that adapts the window size in proportion to the extent of congestion by counting the number of ECN marks. ICTCP measures the bandwidth of the total incoming traffic to obtain the available bandwidth, and then controls the receive window of each connection based on this information. Similarly, IA-TCP controls the workers’ sending rate not to exceed the bandwidth-delay product. However, IA-TCP does not limit the total number of outstanding packets as the number of concurrent connections increases while DIATCP limits the total number by Gwnd. This makes IA-TCP less scalable since the incast congestion is practically inevitable with a large number of outstanding packets even though it artificially increases the RTT. While the previous work focus on the fair-share based congestion control, D3, a Deadline-Driven Delivery control protocol [11], introduces deadline-awareness into the design space of data center protocols. D3 exploits deadline information and performs explicit rate control in a centralized manner at the switches, to allocate bandwidth based on each flow’s deadline and size. Deadline-Aware Datacenter TCP (D2TCP) [12] is another deadline-aware protocol, which provides a fully distributed algorithm. D2TCP introduces the deadline factor into the DCTCP’s congestion control scheme; it adapts the congestion window size using a gamma-correction function so that the near-deadline flows aggressively take more bandwidth than far-deadline flows. Preemptive Distributed Quick (PDQ) [13] is a flow scheduling algorithm designed to complete flows quickly and meet flow deadlines. PDQ emulates a Shortest Job First (SJF) algorithm to give a higher priority to the short flows. PDQ provides a distributed algorithm by allowing each switch to propagate flow information to others via explicit feedback in packet headers. DeTail [9] is an in-network multipath-aware congestion control mechanism and takes a traffic engineering based approach to reduce the flow completion time tail. DeTail

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

14

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

employs (i) link-layer-flow-control (LLFC) to react to congestion more quickly, (ii) per-packet adaptive load balancing (ALB) to spread traffic across available paths (i.e., support multipath data transfer), and (iii) priority mechanisms for traffic differentiation. We note that compared with DIATCP, the existing approaches have some weaknesses as follows:  Network-based solutions like D3, PDQ, and DeTail require high-cost and/or customized hardware chips in the network part, which is definitely a hurdle for deployment.  Host-based solutions like DCTCP and D2TCP cannot support flow quenching.  DCTCP and D2TCP require the ECN functionality, still not supported in some ToR switches [34].  D3 has the priority inversion problem as described in [12]. 7. Concluding remarks In this paper, we propose DIATCP, a new Deadline and Incast Aware TCP, designed for cloud data center networks. Unlike the existing approaches that are either host-based approaches or network-based, we develop and design the aggregator-based solution. Our insight is that under the Partition/Aggregate traffic pattern, the main bottleneck point is normally the last hop between the aggregator and the ToR switch, so the aggregator is aware of the bottleneck link capacity as well as the traffic on the link. Therefore, DIATCP controls the peers’ sending rate directly to avoid the incast congestion and to meet the cloud application deadline. We implement the prototype of the DIATCP algorithm on the Linux kernel. With extensive experiments with our data center testbed and simulations, we confirm that DIATCP outperforms the existing solutions in all experiments in terms of both deadline-awareness and incastavoidance. We believe that our aggregator-based approach is the right direction in designing data center protocols, and DIATCP is one example that is optimized particularly for the data center application’s requirements. As future work, we plan to design an optimized tuning algorithm for Gwnd based on mathematical analysis. Acknowledgements This work was supported by the Seoul R&BD Program (WR080951) funded by the Seoul Metropolitan Government. This work was also supported by the National Research Foundation of Korea Grant funded by the Korean Government (NRF-2013R1A1A1006823). The authors thank Dr. Jeongran Lee for the valuable comments and discussions. References [1] M. Alizadeh, A. Greenberg, D.A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, M. Sridharan, Data Center TCP (DCTCP), in: Proc. of ACM SIGCOMM, 2010. [2] S. Stefanov, YSlow 2.0, December 2008. .

[3] T. Hoff, Latency Is Everywhere And It Costs You Sales – How To Crush It, July 2009. . [4] A. Phanishayee, E. Krevat, V. Vasudevan, D.G. Anderson, G.R. Ganger, G.A. Gibson, S. Seshan, Measurement and analysis of TCP throughput collapse in cluster-based storage systems, in: Proc. of USENIX FAST, 2008. [5] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, R. Chaiken, The nature of datacenter traffic: measurements & analysis, in: Proc. of ACM IMC, 2009. [6] Y. Chen, R. Griffith, J. Liu, R.H. Katz, A.D. Joseph, Understanding TCP incast throughput collapse in datacenter networks, in: Proc. of ACM WREN, 2009. [7] H. Wu, Z. Feng, C. Guo, Y. Zhang, ICTCP: incast congestion control for TCP in data center networks, in: Proc. of ACM CoNEXT, 2010. [8] J. Hwang, J. Yoo, N. Choi, IA-TCP: a rate based incast-avoidance algorithm for TCP in data center networks, in: Proc. of IEEE ICC, 2012. [9] D. Zats, T. Das, P. Mohan, R. Katz, DeTail: reducing the flow completion time tail in datacenter networks, in: Proc. of ACM SIGCOMM, 2012. [10] M. Alizadeh, A. Kabbani, T. Edsall, B. Prabhakar, A. Vahdat, M. Yasuda, Less is more: trading a little bandwidth for ultra-low latency in the data center, in: Proc. of USENIX NSDI, 2012. [11] C. Wilson, H. Ballani, T. Karagiannis, A. Rowstron, Better never than late: meeting deadlines in datacenter networks, in: Proc. of ACM SIGCOMM, 2011. [12] B. Vamanan, J. Hasan, T.N. Vijaykumar, Deadline-aware datacenter TCP (D2TCP), in: Proc. of ACM SIGCOMM, 2012. [13] C.-Y. Hong, M. Caesar, P.B. Godfrey, Finishing flows quickly with preemptive scheduling, in: Proc. of ACM SIGCOMM, 2012. [14] J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in: Proc. of USENIX OSDI, 2004. [15] M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, Dryad: distributed data-parallel programs from sequential building blocks, in: Proc. of EuroSys, 2007. [16] D. Beaver, S. Kumar, H.C. Li, J. Sobel, P. Vajgel, Finding a needle in haystack: facebook’s photo storage, in: Proc. of USENIX OSDI, 2010. [17] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, W. Vogels, Dynamo: Amazon’s highly available key-value store, in: Proc. of ACM SOSP, 2007. [18] D. Katabi, M. Handley, C. Rohrs, Congestion control for high bandwidth-delay product networks, in: Proc. of ACM SIGCOMM, 2002. [19] K. Ramakrishnan, S. Floyd, D. Black, The Addition of Explicit Congestion Notification (ECN) to IP, RFC 3168, IETF, September 2001. [20] T. Benson, A. Akella, D.A. Maltz, Network traffic characteristics of data centers in the wild, in: Proc. of ACM IMC, 2010. [21] S. Floyd, M. Handley, J. Padhye, J. Widmer, TCP Friendly Rate Control (TFRC): Protocol Specification, RFC 5348, IETF, September 2008. [22] L.S. Brakmo, L. Peterson, TCP vegas: end to end congestion avoidance on a global internet, IEEE J. Sel. Area. Commun. 13 (8) (1995) 1465– 1480. [23] J.P. Royston, An extension of shapiro and Wilk’s W test for normality to large samples, Appl. Stat. 31 (2) (1982) 115–124. [24] B.W. Silverman, Density Estimation, Chapman and Hall, London, 1986. [25] S.J. Sheather, M.C. Jones, A reliable data-based bandwidth selection method for kernel density estimation, J. Roy. Stat. Soc.: Ser. B 53 (3) (1991) 683–690. [26] The ns-3 Discrete-Event Network Simulator. . [27] S. Floyd, V. Jacobson, Random early detection gateways for congestion avoidance, IEEE/ACM Trans. Network. 1 (4) (1993) 397– 413. [28] Data Center TCP. . [29] The D2TCP slide at ACM SIGCOMM 2012. . [30] Summit X460 Switches. . [31] Summit X670 Switches. . [32] V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D.G. Anderson, G.R. Ganger, G.A. Gibson, B. Mueller, Safe and effective fine-grained TCP retransmissions for datacenter communication, in: Proc. of ACM SIGCOMM, 2009.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx [33] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, A. Warfield, Xen and the art of virtualization, in: Proc. of ACM SOSP, 2003. [34] Randall R. Stewart, Michael Tüxen, George V. Neville-Neil, An investigation into data center congestion with ECN, in: Proc. of BSDCan, 2011. Jaehyun Hwang received the B.S. degree in computer science from Catholic University of Korea, Seoul, Korea in 2003, and the M.S. and Ph.D. in computer science from Korea University, Seoul, Korea in 2005 and 2010, respectively. His research backgrounds are mainly in TCP, focusing on a flexible TCP structure, advanced TCP flavors and their performance. Since September 2010, he has been with the networking research domain at Bell Labs, Alcatel-Lucent as a member of technical staff. His current research interests include data center networks, software-defined networking, multipath TCP, and HTTP adaptive streaming.

15

Nakjung Choi received the B.S. and Ph.D. degrees in computer science and engineering from Seoul National University (SNU), Seoul, Korea, in 2002 and 2009, respectively. From September 2009 to April 2010, he was a postdoctoral research fellow in the Multimedia and Mobile Communications Laboratory, SNU. Since April 2010, he is a member of technical staff at Alcatel-Lucent, Bell Labs Seoul. His research interests are Future Internet such as content centric networking and green networking, and also mobile/wireless networks such as wireless LANs and wireless mesh networks.

Joon Yoo received his B.S. in Mechanical Engineering from Korea Advanced Institute of Science and Technology (KAIST), and Ph.D. in Electrical Engineering and Computer Science from Seoul National University in 1997 and 2009, respectively. He worked as a postdoctoral researcher at the University of California, Los Angeles in 2009 and then worked at Bell Labs, Alcatel-Lucent as a Member of Technical Staff from 2010 to 2012. Since 2012, he has been with the department of Software Design and Management at Gachon University as an assistant professor. His research interests include vehicular networks, cloud data center networks, and IEEE 802.11.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002