Deadline and Incast Aware TCP for cloud data center networks

Computer Networks xxx (2014) xxx–xxx Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet D...

Download PDF

2MB Sizes 35 Downloads 122 Views

Report

PDF Reader
Full Text

Computer Networks xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Computer Networks journal homepage: www.elsevier.com/locate/comnet

Deadline and Incast Aware TCP for cloud data center networks Jaehyun Hwang a, Joon Yoo b,⇑, Nakjung Choi a a b

Bell Labs, Alcatel-Lucent, 7ﬂ. DMC R&D Center, 1649 Sangam-dong, Mapo-gu, Seoul 121-904, Republic of Korea Department of Software Design & Management, Gachon University, Bokjeong-dong, Sujeong-gu, Seongnam-si, Gyeonggi-do 461-701, Republic of Korea

a r t i c l e

i n f o

Article history: Received 15 May 2013 Received in revised form 1 November 2013 Accepted 4 December 2013 Available online xxxx Keywords: Cloud data center networks Partition/Aggregate pattern Deadline awareness Incast congestion

a b s t r a c t Nowadays, cloud data centers have become a key resource to provide a plethora of rich online services such as social networking and cloud computing. The cloud data center applications typically follow the Partition/Aggregate trafﬁc pattern based on a tree-like logical topology, where the aggregator node may gather response data from thousands of worker nodes. One of the key challenges for such applications, however, is to meet the soft real-time constraints. In this paper, we introduce the design and implementation of DIATCP, a new transport protocol that is both deadline-aware and incast-avoidable for cloud data center applications. Prior work achieve deadline awareness by host-based or network-based approaches, but they are either imperfect in meeting their deadlines or have weakness in practical deployment. In contrast, DIATCP is deployed only at the aggregator, which directly controls the peers’ sending rate to avoid incast congestion and, more importantly, to meet the application deadline. This is under the key observation that the aggregator knows the bottleneck link status as well as its workers’ information under the Partition/Aggregate trafﬁc pattern. Through detailed ns-3 simulations and real testbed experiments, we show that DIATCP signiﬁcantly outperforms the previous protocols in the cloud data center environment. Ó 2014 Elsevier B.V. All rights reserved.

1. Introduction Modern-day cloud data centers are steadily progressing as a key resource for providing many online services that include web search, social networking, cloud computing, ﬁnancial services, and recommendation systems. These services generally need to deliver the results to the endusers in a timely manner, so they often have soft real-time constraints that originate from the service level agreements (SLAs), which greatly affect user experience and service provider revenue [1–3]. One of the major portions of the delay is caused by the intra-data center communication, which must be kept efﬁcient, as most of

⇑ Corresponding author. Tel.: +82 31 750 5832. E-mail addresses: [email protected] (J. Hwang), joon. [email protected] (J. Yoo), [email protected] (N. Choi).

the computing and storage resources are located inside the local data centers. Unfortunately, it has been reported that the conventional transport protocol, TCP, does not work well; it forms a communication bottleneck in the commercial data centers, as a result, causing serious performance degradation [4,5,1]. There are several reasons for this performance degradation. First, the cloud data center services often follow the Partition/Aggregate trafﬁc pattern based on a tree-like logical topology, where typically thousands of servers participate to achieve high performance. Here, the numerous servers, called workers, send their response data to a single point called the aggregator. This causes a burst of trafﬁc at the aggregator. Second, the Top-of-the-Rack (ToR) switches, where the aggregator is connected, are shallow-buffered, normally having only a 3–4 MB sharedpacket buffer memory. Sometimes this shallow buffer size

http://dx.doi.org/10.1016/j.comnet.2013.12.002 1389-1286/Ó 2014 Elsevier B.V. All rights reserved.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

2

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

is not enough to handle such a burst of trafﬁc, resulting in buffer overﬂow, which is called incast congestion. Third, the retransmission timeout to detect the incast congestion (i.e., packet losses) is too long as TCP is fundamentally designed for wide area networks. For example, RTOmin is generally set to 200–300 ms, but the actual round-trip times (RTTs) are only hundreds of ls in data center networks. In summary, legacy TCP suffers from the incast congestion, showing low aggregate goodput and long query completion time. Most of the prior work [6,1,7–10] have mainly focused on reducing the ﬂow completion time of the tail.1 In order for the application to meet its deadline deﬁned by the SLA, however, all the ﬂows between the aggregator and its workers should be completed within the application deadline. The existing fair-share based solutions like DCTCP [1] are oblivious to the application deadline, so they may not meet the application deadlines even though they still reduce the overall ﬂow completion time. Some recent work such as D3 [11] and D2TCP [12] introduce deadline-awareness in designing the data center protocols. They make an effort to allocate differentiated bandwidth based on the ﬂow size and the deadline. However, they are either impractical in their deployment (D3) or imperfect in meeting the application deadline (D2TCP). In this paper, we propose a new deadline and Incast Aware TCP, called DIATCP, which efﬁciently meet the deadline requirements of the online data center applications. Our main contribution is that we take a new design approach, which not only overcomes, but also leverages the Partition/Aggregate trafﬁc pattern of the data center applications. While recent solutions on data center networks take the form of either host-based or network-based approaches, DIATCP, however, operates at the aggregator, so called an aggregator-based approach, which performs admission and ﬂow control to achieve our goal. The key observation is that the aggregator can effectively obtain rich information such as the bottleneck link bandwidth, and the workers’ ﬂow information such as data sizes and deadlines. Therefore, DIATCP does not require any support from the network switches – implementation and deployment are easy. Further, the incoming network trafﬁc to the aggregator can be managed very efﬁciently – trafﬁc control is centralized. We conduct detailed evaluations through ns-3 simulations to compare DIATCP with the previous solutions such as DCTCP [1] and D2TCP [12]. We have also built the prototype of the DIATCP algorithm on the Linux kernel 2.6.38 and evaluated it on our real data center testbed. The data center testbed consists of 46 servers, 3 ToR switches, and one Aggregation switch. The evaluation results of DIATCP are: There are no missed deadlines under the real tracebased simulations and very few missed deadlines (under 1%) for the testbed experiments. There are no TCP timeouts for both simulations and testbed experiments.

The aggregate goodput even for non-deadline ﬂows that do not follow the Partition/Aggregate pattern is comparable. For the prototype implementation, only a small modiﬁcation (about 265 lines) at the aggregator is required. On the other hand, the network-based approach requires high-cost custom hardware chips (D3 [11], DeTail [9], and PDQ [13]). Flow quenching is supported (unlike D2TCP [12]). ECN-support is unnecessary (unlike DCTCP [1] and D2TCP [12]). No priority inversion is incurred (unlike D3 [11]). The rest of the paper is organized as follows. In Section 2, we brieﬂy review today’s data center communications and describe the previous deadline-aware approaches, followed by our motivation. Section 3 explains our DIATCP algorithm in detail. In Section 4, we present our experimental results performed with the ns-3 simulator and our data center testbed. We discuss some design issues in Sections 5, and 6 presents related work. Finally, we conclude our paper in Section 7. 2. Overview We brieﬂy review today’s data center communications and the major problems that occur at the transport layer. To understand our motivation, we explain some of the existing approaches and their limitations. Finally, we describe the aggregator-based approach that is the key idea of our new proposal. 2.1. Data center applications and communications Partition/Aggregate trafﬁc pattern: It is known that cloud data center applications these days often follow the Partition/Aggregate trafﬁc pattern, which is shown in Fig. 1. Such applications include web search [1], MapReduce [14], Dryad [15], social networking [16], and recommendation systems [17]. Under this trafﬁc pattern, a user request is sent to the top-level aggregator ﬁrst, and then partitioned into several pieces to be distributed them to the lower-level aggregators. The request ﬁnally arrives at the worker-nodes via the last-level aggregator in the data center network. If there is an SLA between the service provider and users, the ﬁnal results should be delivered to the end-users within the SLA, typically 200–300 ms within a (deadline = 300ms)

Root aggregator

(deadline = 100ms) Aggregator

Aggregator

...

Aggregator

(deadline = 40ms) Worker

Worker

...

Worker

...

Worker

...

Worker

1

In general, all response data from the workers need to be aggregated to make a meaningful result, so the query completion time is determined by the most congested connection, i.e., tail.

Fig. 1. Example of the Partition/Aggregate design pattern with different response deadlines at each layer.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

data center [17,1]. So the aggregator at each level needs to aggregate the results from its child-nodes with an intermediate deadline, which should be earlier than the ﬁnal deadline (i.e., SLA). Note that the response data from the childnodes that miss the deadline would be discarded, and this signiﬁcantly affects the quality of the results, hence the operator revenue. For example, the effect of 100 ms latency at Amazon.com is approximated to result in 1% drop in sales [3]. Similarly, an extra latency of 500 ms in Google’s search drops trafﬁc by 20% [3]. There is 5–9% drop in trafﬁc of Yahoo! with 400 ms latency [2]. Therefore, it is very important to meet the deadlines in today’s data center communications. In this paper, we will deal with 20– 50 ms deadlines as we only focus on the last-hop communication, between the last-level aggregator and the workers. Data center trafﬁc: According to [1], the data center workload can be categorized into three types: bursty query trafﬁc, short message ﬂows, and large ﬂows. The query trafﬁc consists of latency-critical ﬂows as it follows the Partition/Aggregate pattern, and the size is only a few KBs (e.g., 2 KB). The short message ﬂows are normally used to update control state on the workers and are also timesensitive. The message sizes are between 50 KB and 1 MB. Lastly, there are a few large background ﬂows (the median number of concurrent large ﬂows is 1), whose sizes are 1–50 MB. The main role of these long ﬂows is to copy new/updated data to the workers, so they are typically throughput-driven without any deadlines. 2.2. Previous deadline-aware approaches There are several congestion control protocols that aim to meet the real-time application constraints in data centers. The major design philosophy behind these approaches has been inherited from the following two methodologies: (i) the network-based approaches which are generally conducted at the switches or (ii) the host-based approaches which are performed at the end-host side. Our main insight, however, is that we can conceptually take a hybrid approach by leveraging the communication pattern of the data center applications. In detail, the Partition/Aggregation trafﬁc pattern enables the aggregator to access both the bottleneck link information and also the peer information. We compare the two existing approaches ﬁrst, and then further discuss our new approach in the next subsection. Network-based approach: In the network-based approach, the network side elements like switches and routers are regarded as centralized components since they are capable of directly monitoring the network conditions. Furthermore, they can collect the ﬂow information and their real-time requirements (e.g., deadlines), so it is possible to allocate guaranteed bandwidth for each ﬂow. The wellknown examples that behave in such a way are XCP [18] and D3 [11]. In addition, let us assume that the congestion at the bottleneck link is such that its capacity cannot support ﬁnishing all ﬂows before the deadline. It is preferable to quench some of the ﬂows so that the remaining ﬂows can meet the deadline [11]. The network switch can also support this ﬂow quenching by adequately conducting the total bandwidth allocation. In summary, the network-based

3

approach generally performs well based on the accurate network information, but its weak point is always in the practical deployment. It needs to upgrade all the infrastructure such as routers with high cost. For instance, D3 needs to change almost every network elements including switches, end-hosts, and applications [11]. Host-based approach: Generally, the main advantage of the host-based approach is that it is relatively easy and simple to implement and deploy. It can be implemented by modifying each of the end-hosts via mere simple software upgrades. However, the host-based approach still needs to get the feedback information from the network, since the end-hosts determine their actions based on the network congestion level. One key feedback is the Round-Trip Delay (RTT), as the queuing delay included in the RTT directly reﬂects the network congestion. Unfortunately, it has been reported that the RTT measurement in data center networks is not that reliable [1,7]. The RTT is normally hundreds of ls in data center networks, so it can severely ﬂuctuate even with a small amount of computing/scheduling delay at the server machines. Thus the host-based approaches, such as DCTCP, actively utilize Explicit Congestion Notiﬁcation (ECN) [19] in designing their congestion control algorithms. ECN is employed so that the congestion information is fed back to the end host. Further, ﬂow quenching cannot be supported in host-based approaches such as D2TCP since the end-hosts cannot determine when they should be paused or stopped without the help from the centralized controller. Most importantly, the host-based approach provides some level of deadlineawareness, but does not guarantee that the ﬂows meet their deadlines. This is the major drawback compared to the network-based approach. 2.3. Aggregator-based approach As reported in previous studies [6,1,20,11], one of the main bottleneck points in data center communications is at the ToR switch, speciﬁcally, the ToR switch interface to the aggregator. This occurs because the numerous workers try to send their response data to the same aggregator through the shallow-buffered ToR switch at the same time. At this point, we assume that the link capacity between the aggregator and the ToR switch is given in advance as its typical value is 1 Gbps or 10 Gbps in today’s data center environments. We also assume that the aggregator keeps track of the trafﬁc information from all its peers. This is a realistic assumption because in the Partition/Aggregate trafﬁc pattern, a request is sent to the workers through the aggregator. So, the aggregator is capable of managing the peer trafﬁc information, hence the deadline information is also available. In other words, even though the aggregator may be an end-host, it still can acquire this rich information, which is traditionally viewed as given only to the network nodes such as switches or routers. Therefore, we can take the advantages from the aforementioned two approaches: (i) it does not require any support from switches – implementation and deployment are easy, and (ii) the incoming network trafﬁc to the aggregator can be managed effectively with the given bottleneck link capacity – trafﬁc control is centralized.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

4

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

2.4. Effect of incast congestion Existing deadline-aware protocols like D3 and D2TCP only aim to meet the ﬂow deadlines. Our insight is that the incast congestion can severely damage the performance of online applications in cloud data center networks; it normally increases the network delay as well as the query completion time, so the deadline-aware algorithm may not work properly in heavily congested situations. For instance, D2TCP introduces a deadline factor to the DCTCP’s congestion control algorithm, but DCTCP is not a perfect solution for avoiding the incast congestion.2 This implies that D2TCP is not a robust scheme for the congested network situations.3 Moreover, if RTOmin is a large value, the application deadline could be missed even with only one TCP timeout. Furthermore, we found that some loose application deadlines can be met easily even without the deadlineawareness features if the incast congestion is avoided. DeTail [9] takes a similar approach; it reduces the ﬂow completion time tail by performing trafﬁc engineering, even though it does not support deadline awareness. Therefore, it is essential to take incast congestion as well as deadline awareness into account in order to achieve our ultimate goal. 3. Deadline and Incast Aware TCP (DIATCP) DIATCP consists of two functions – incast awareness and deadline awareness. We ﬁrst give the design overview of DIATCP, and then explain the detailed algorithms. Lastly, we discuss some design issues – how to choose some important parameters like Gwnd in practice. 3.1. Design overview We categorize the ﬂows in data center networks into non-deadline ﬂows that have no speciﬁc deadline in ﬂow completion time and deadline ﬂows that are supposed to be completed within a speciﬁc deadline. Such deadlines are generally deﬁned by applications at the user space and delivered to the TCP layer via existing socket APIs such as setsockopt (). We therefore assume that the application deadlines and response data sizes are given to the TCP layer before the data transfer begins. Note again that the key idea of DIATCP is that the aggregator node may monitor all the incoming trafﬁc to itself. Therefore, the DIATCP algorithm works only for the incoming trafﬁc from peers to the aggregator. The outgoing trafﬁc from the aggregator can be managed by the destination nodes of the trafﬁc, so we do not consider the outgoing trafﬁc in this paper. Thus the response data size means the amount of data that the peer transmits to the aggregator. Deadline and Incast Aware algorithm: Our algorithm is motivated by IA-TCP [8], which is designed to operate at the aggregator side by means of TCP acknowledgement (ACK) regulation, to control the TCP data sending rate of 2 DCTCP suffers from the incast congestion as the number of workers is more than 35 in their experimental environments [1]. 3 D2TCP still incurs missed deadlines under heavily congested conditions [12].

the workers. DIATCP also adopts the ACK regulation scheme for similar purposes. We utilize the advertisement window ﬁeld4 in the TCP ACK header to allocate a speciﬁc window size to each peer. By doing this, we control the total amount of trafﬁc in order not to overﬂow the bottleneck link – details are explained in Section 3.2. Algorithm 1 presents the overall DIATCP algorithm employed at the aggregator. First, we represent each application connection by an abstract node, which includes the information such as the application deadline, data size, and allocated DIATCP window size (Fig. 2). A node is inserted to the Connection list whenever a new connection is created. When a connection is closed then the node is deleted from the list (lines 2 and 6). Thereafter, we update the allocated window size whenever there is a change in the Connection list (lines 3 and 7). For this, we develop a new Global window allocation scheme that allocates the Global window based on the deadline and data size – details are explained in Section 3.3. Lastly, by accessing each node’s information, the advertisement window in the ACK header is set to the allocated window size (line 10). Algorithm 1. DIATCP algorithm at the aggregator 1: On creating a new connection: 2: Insert a node to the Connection list 3: Call Global_window_allocation() 4: 5: On closing a connection: 6: Delete a node from the Connection list 7: Call Global_window_allocation() 8: 9: Sending an ACK: 10: Set Advertisement window to allocated_window

The priority order of the Connection list may be sorted depending on some real-time scheduling priority, such as EDF (Earliest Deadline First) or SJF (Shortest Job First). In this paper, we employ the EDF policy as it is known to minimize the number of late tasks, i.e., minimize the number of missed deadline ﬂows. To employ EDF, the Connection list is ordered in an ascending order of deadlines where the tie-breaking is done according to ﬂow arrival time. When there is a new connection, it is immediately inserted into the appropriate slot in the list. In result, the existing low-priority ﬂows are preempted by the new ﬂow. For example, if a very urgent ﬂow is inserted into the Connection list, it will be allotted in front of the existing nodes that are less urgent via the Global window allocation. 3.2. Incast awareness The data center applications generally induce a large number of concurrent TCP connections that share a common bottleneck link, resulting in buffer overﬂow at the ToR

4 The TCP receiver sets the advertisement window based on its receiving capacity to conduct ﬂow control. The TCP sender uses this information to set the sending window size.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

switch. To avoid such incast congestion, the aggregate sending rates of all ﬂows, including the deadline and non-deadline ﬂows, must not exceed the (bottleneck) link capacity: n m X X Ratedi þ Ratend j 6 Link capacity i¼1

ð1Þ

j¼1

where Ratedi is the sending rate of the ith deadline ﬂow and Ratend j is that for the jth non-deadline ﬂow. n and m are the numbers of deadline and non-deadline ﬂows respectively. Such rate control can be implemented in many ways [21,22,8], but we employ this by newly deﬁning the Global window size (Gwnd) and DiaDelay. Gwnd is the sum of the sending window sizes of all the peer connections to the aggregator. DiaDelay is the common artiﬁcial RTT that all the connections should maintain. We express the aggregate sending rate, i.e., the left side in (1), for all the peers transmitting data to the aggregator. The aggregate sending rate should match the link capacity from the ToR switch to the aggregator (i) to avoid incast congestion and (ii) maintain goodput. So, we have

Gwnd MSS ¼ Link capacity DiaDelay

ð2Þ

where MSS denotes the Maximum Segment Size. Here, DiaDelay can be regarded as a time slot like a global RTT. So if the number of outstanding packets (whose size is MSS) is ‘Gwnd’ during ‘DiaDelay’ and ‘Gwnd/ DiaDelay’ is the link capacity, then the aggregate sending rate should not cause buffer overﬂow at the ToR switch. Fig. 3 shows an example of the incast awareness operation based on ACK regulation. First, the aggregator adds an ACK delay to each connection as much as DiaDelay measured RTT i , where measured RTT i is the RTT of the ith ﬂow measured at the aggregator side, so that all peers have the same RTT, hence DiaDelay. Second, it allocates a proper advertisement window to each peer so P that the sum ni¼1 wi ¼ Gwnd, where n is the total number of peers and wi is the window size of the ith peer. The aggregator controls Gwnd so that (2) is met, so that incast congestion is avoided. 3.3. Deadline awareness The deadline awareness in DIATCP is employed by the Global window allocation algorithm, which we explain here in detail. Our basic strategy is to give higher priority to the deadline ﬂows ﬁrst, and then allocate the remaining bandwidth to the other ﬂows in a fair-share manner. The total window size, i.e., Gwnd, is determined by (2). If a ﬂow wants to complete before its deadline, then it should follow:

Window requirement ¼

s DiaDelay d

ð3Þ

where s is the remaining transmit data size and d is the remaining time until the deadline. Algorithm 2 presents the Global window allocation algorithm. Lines 5–9 implement the initial allocation which corresponds to (3); the window size is allocated so that it meets the deadline of each ﬂow. We allocate

5

one window to non-deadline ﬂows to cover as many non-deadline ﬂows as possible. Note again that we employ EDF; the Connection list is ordered by giving priority to earliest deadline ﬂows. So the ﬂows with earlier deadlines are allocated ﬁrst. If there are remaining windows after the initial allocation, reallocation will be performed later in a fair-share manner (lines 31–33). Assuming that the ﬂow that misses its deadline is meaningless, we drop the ﬂow if the deadline is almost missed or the window requirement is larger than Gwnd (lines 12–13). Next, if the initial window requirement is acceptable (line 15), we allocate the number to the node, and send an ACK if the previous allocated window size was zero, in order to advertise the non-zero window (i.e., resume the paused ﬂow) (lines 15–21). If there is no available window for the node, zero window is allocated, and the ﬂow will be eventually paused (lines 22–25). Algorithm 2. Global window allocation node.deadline: remaining time until deadline node.size: remaining data size node.win: allocated window 1: total alloc ¼ 0 2: node the ﬁrst node in the Connection list 3: 4: while node exists do 5: if node.deadline > 0 then 6: alloc ¼ node:size=node:deadline DiaDelay 7: else 8: / node.deadline = 0 for non-deadline ﬂows / 9: alloc ¼ 1 10: end if 11: 12: if node:deadline expires jj alloc > Gwnd then 13: Drop the ﬂow that corresponds to the node 14: else 15: if total alloc þ alloc 6 Gwnd then 16: total alloc ¼ total alloc þ alloc 17: node:win ¼ alloc 18: if previous node.win was zero then 19: / non-zero window advertisement / 20: Send an ACK 21: end if 22: else 23: / zero window advertisement / 24: node:win ¼ 0 25: end if 26: end if 27: 28: node the next node in the Connection list 29: end while 30: 31: if total alloc < Gwnd then 32: allocate the remaining window to ﬂows that have non-zero window in a fair-share manner 33: end if We note that there can be several ways in implementing the ﬂow dropping (line 13). One easiest way is to simply

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

6

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

20ms, 10KB, Win: 4

35ms, 20KB, Win: 3

40ms, 30KB, Win: 2

null, 10MB, Win: 1

null, 20MB, Win: 0

...

If win 0 becomes non-zero, send an ACK to advertise non-zero window

Connection List

Get the window size for each ACK

Connection Information

tcp_sock

tcp_sock

Fig. 2. Example of the Connection list: Each node represents the connection information, e.g., deadline, data size, and window size. Here, the window size is computed by the Global window allocation and the list is ordered by the EDF policy.

Fig. 3. DIATCP operation: Global window allocation and ACK pacing.

send a FIN packet like [11], but in our implementation, we give a lowest priority (i.e., zero deadline) to the ﬂow rather than closing the connection, so that it is promptly resumed after all other deadline ﬂows are completed.

3.4. Design issues in choosing parameters The aggregator node keeps track of all the incoming ﬂows, thus it can fully monitor its bottleneck link at the ToR switch. Since this link capacity is a given value, we use pre-set values for Gwnd and DiaDelay, e.g., 30 and 360 ls assuming the link capacity is 1 Gbps and MSS is 1.5 KB. These values are acceptable if most of the measured RTTs between the peer servers are within 360 ls. In this case, we can add the ACK delay so that the ﬁnal RTTs become DiaDelay as explained in Section 3.2. The practical problem is that the RTT generally oscillates in data center environments as the computational/process scheduling times can greatly affect the transmission delay. So we need to measure a live RTT to calculate an accurate ACK delay. Note that it is not meaningful to use the average or minimum RTT for the measured RTT because the ﬂow sizes are generally too small for us to obtain a stable value. Furthermore, a few spikes in the measurements can signiﬁcantly affect the overall performance as there are tens of ls of gaps between the average and minimum RTTs as shown in Fig. 4. In addition, measuring live RTTs at the

aggregator side is difﬁcult in practice unless there is data trafﬁc on both directions. Lastly, the artiﬁcial ACK delay may cumulate on the following ACKs, leading to a big ACK. We may send an ACK every two data packets to avoid this as the existing delayed ACK algorithm does, but it results in inaccurate ACK delay measurements. For these reasons, we take a practical approach; we elect to use a predeﬁned RTT value extracted from prior measurements and computation, rather than using live measurements on a microsecond granularity. The DiaDelay is set according to this predeﬁned RTT. For this purpose, we measured the RTTs between the aggregator node and each of the 45 servers 1000 times in our data center testbed as shown in Fig. 4. Here, we found that naively taking the average RTT is not reasonable because the RTT measure-

Fig. 4. Measurement of RTTs between the aggregator and 45 servers.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

7

Table 1 Pre-set values of Gwnd and DiaDelay (ls) for 1 Gbps of link capacity. The MSS is 1.5 KB. Gwnd

14

15

16

17

18

19

20

DiaDelay

168

180

192

204

216

228

240

ments do not follow the normal distribution according to the Shapiro–Wilk’s normality test [23]. In addition, the average RTT histogram for the servers includes negative skewness, which implies the average RTT may not be reliable. Hence, we need to estimate its probability density via a nonparametric method (e.g., kernel density estimation method [24]) since we do not have prior knowledge about the true RTT distribution. Considering RTT to be the independent identically distributed (i.i.d.) random variable for each server, we use the kernel density estimation method, which is known to converge quickly to the true density, to obtain the smooth density estimator5. In result, the RTT with the highest probability in our density estimator is 200.11 ls, so the DiaDelay is chosen based on this computation. Table 1 presents some candidates for the pre-set values of Gwnd and DiaDelay. To achieve full link utilization, 17 or a larger value seems reasonable for Gwnd. We measure the aggregate goodput by transmitting 10 MB from each server to the aggregator while varying Gwnd from 13 to 20, as shown in Fig. 5. Finally, we use 18 for Gwnd throughout this paper, assuming that the DiaDelay is 216 ls based on Figs. 4 and 5, and Table 1. Note that the DiaDelay can be set depending on the particular data center environments. We later validate that this practical approach works well in our real data center testbed. 4. Evaluation In this Section, we evaluate DIATCP ﬁrst through largerscale ns-3 simulations [26] and then by a small-scale 46-server data center testbed based on Linux implementations. In the simulations, we compare DIATCP with the legacy DCTCP and D2TCP.6 We also evaluate a fair-share version of DIATCP, denoted by DIATCP-FS, to investigate how our congestion avoidance algorithm improves the performance of deadline ﬂows in terms of the number of missed deadlines even without the deadline-awareness. DIATCP-FS also performs the Global window allocation, but it regards all ﬂows as non-deadline ﬂows. 4.1. Simulations We have implemented DIATCP and DIATCP-FS on the ns-3 simulator (ver. 3.14) [26]. For comparison, DCTCP and D2TCP with ECN functionality7 are also implemented, 5 We use the method of Sheather and Jones [25] to select the kernel bandwidth (2.335 ls in this paper). 6 The ECN is an essential functionality to avoid incast congestion in DCTCP and D2TCP. Unfortunately, the current operating system version of the switch in our testbed does not support ECN marking. Therefore, we only use ns-3 simulations to compare DIATCP with DCTCP and D2TCP. 7 The current version of ns-3 does not support ECN functionality yet at both RED (Random Early Detection) [27] and TCP.

Fig. 5. Aggregate goodput (Mbps) with varying Gwnd.

Fig. 6. Simulation topology.

based on the algorithms described in [1,12] and the DCTCP implementation available in [28]. Fig. 6 depicts the simulation topology. It consists of 5 racks; each rack has 45 servers and a ToR switch. All servers are connected to their ToR switch with 1Gbps links, and 5 ToR switches are connected to an Aggregation switch with 10 Gbps links. Besides, one server dedicated for the aggregator (i.e., aggregator node) is added to the ﬁrst ToR switch with a 1 Gbps link. The link delay is set to 25 ls so that the average RTT is about 200 ls, a typical value of today’s data center networks. The packet buffer size per port is 128 KB assuming shallow-buffered ToR switches. The Aggregation switch has a deep packet buffer. For the key parameters of DCTCP and D2TCP, we set g, the weighted averaging factor, to 1/16 and K, the buffer occupancy threshold for marking CE-bits, to 20 for 1 Gbps links and 65 for 10 Gbps according to [1]. For D2TCP, we set d, the deadline imminence factor, to be between 0.5 and 2.0 according to [12]. The RTOmin for all protocols is 20 ms as a production TCP like Google’s uses 20 ms within data centers [29]. The main purpose of the simulations is to compare our algorithm with the previous studies, so we use a similar scenario used in [12], where ﬁve Partition/Aggregate trees are running on the network; each tree consists of one aggregator and n workers and has different deadline and response data size. The aggregators are located in the aggregator node and the workers are distributed evenly over all servers so that each server has only one worker. Then the aggregators send a request to their workers, and the workers immediately respond with a speciﬁc size of data. We increase n, the number of workers per tree, from 20 to 45, to measure the performance with various congestion levels. Lastly, we set the ﬁve trees’ (base

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

8

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

deadline, response data size) to (tight, 2 KB), (tight, 6 KB), (moderate, 10 KB), (moderate, 14 KB), and (lax, 18 KB), respectively. Note that the values for the data sizes (i.e., a few KB of ﬂow sizes) are based on the characteristics of workloads in production clusters [1], but since the real distributions of ﬂow deadlines are not available, we use 20 ms (tight), 30 ms (moderate) and 40 ms (lax) assuming that the actual deadlines are exponentially distributed around the base deadlines as the previous studies did [11,12]. In our simulations, we use a one-sided exponential distribution for ﬂow deadlines, where its mean is the base deadline, but caps the maximum value to 150% of the base deadline. All simulation results are repeated 100 times and averaged. 4.1.1. Deadlines awareness Fig. 7 shows the fraction of ﬂows that miss the deadlines (i.e., Y axis) with increasing congestion levels. When the number of workers per tree is small (e.g., 20 or less), all variants meet the deadlines well, but the missed deadlines of NewReno and DCTCP increase rapidly as the number of workers increases. Even though DCTCP normally achieves much lower average ﬂow completion times than NewReno, its ignorance to the deadlines results in a large fraction of missed deadlines, showing similar performance with NewReno. D2TCP performs much better than NewReno and DCTCP as it gives more bandwidth to near-deadline ﬂows, but still misses about 27% of the deadlines when the number of workers is large. On the other hand, DIATCP does not miss any deadlines even in highly congested situations. We note that the fair-share version of DIATCP, DIATCP-FS, also shows similar results; it misses only 3 deadlines when the number of workers is 45. This implies that most of deadlines can be met just by avoiding the incast congestion effectively.

the design goals of DIATCP is to avoid incast congestion, DIATCP-FS and DIATCP controls the total sending window size to the extent of the bottleneck link capacity and in result, does not suffer any timeouts. 4.1.3. Normalized latency Fig. 9 shows the ﬂow completion times that are normalized to the deadline; a ﬂow whose normalized completion time is less than 100% meets the deadline. We draw the three points for each variant, which indicate the 50th, 90th, and 99th percentiles of the ﬂow completion times, from bottom to top. In the graph, D2TCP shows much lower ﬂow completion times than DCTCP and the 90th percentiles are under the deadlines (100%) until the number of workers is 35, but become more than the deadlines when the numbers of workers are 40 and 45. We note that in the cases of DIATCP-FS and DIATCP, all the 99th percentiles of the ﬂow completion times are under the deadlines. Moreover, DIATCP shows lower completion times than DIA-TCP-FS in all cases, and the 99th percentile of DIATCP ﬂows is lower than 50% even when the number of workers is 45. These low completion times naturally result in good performance in terms of the missed deadlines as shown in Fig. 7.

4.1.2. Incast avoidance To show how the incast congestion affects the performance, we measure the fraction of ﬂows that suffer at least one timeout as shown in Fig. 8. It is observed that more than 50% of ﬂows that employ NewReno or DCTCP experience the incast congestion in all cases. D2TCP shows better performance with regard to congestion avoidance, but the fraction of timeout ﬂows increases up to around 60% as the number of workers increases. By comparing this result with Fig. 7, we can see that the incast congestion directly affects the missed deadlines as ﬂow deadlines are ranged from 20 ms to 60 ms while RTOmin is 20 ms. Since one of

4.1.4. Deadline awareness with background trafﬁc We measure the missed deadlines when the deadline and non-deadline ﬂows coexist. We add one background ﬂow that transmits 10 MB of data to the aggregator node in the previous scenario, as it is reported that the median number of concurrent large ﬂows is 1 in data center networks [1]. The background ﬂow obviously has no deadline and fully utilizes the bottleneck link before the Partition/ Aggregate trees begin. Fig. 10 shows the missed deadlines for the deadline ﬂows, and it is observed that the overall missed deadlines of NewReno, DCTCP, and D2TCP increase as much as 10–20% compared to Fig. 7 while DIATCP still does not miss any deadlines. DIATCP-FS shows 0.36% missed deadlines when the number of workers is 45. The results here imply that the background ﬂow aggravates the incast congestion, so it misses a larger fraction of the deadlines. Nevertheless, DIATCP avoids the congestion by controlling the total amount of incoming trafﬁc and supports the deadline ﬂows over the background ﬂow. As a result, we conﬁrm that DIATCP signiﬁcantly outperforms the existing schemes. We also measure the average goodput of the background ﬂow to see the link utilization of non-deadline

Fig. 7. Fraction of ﬂows that miss the deadlines.

Fig. 8. Fraction of ﬂows that suffer at least one timeout.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

Fig. 9. Normalized ﬂow completion times at the 50th, 90th, and 99th percentiles.

ﬂows, as shown in Fig. 11. The results clearly tend to decrease as the number of workers increases, but the goodputs of DIATCP-FS and DIATCP are consistently higher than that of NewReno and DCTCP, and comparable with D2TCP. This shows that even the non-deadline ﬂow suffers from the incast congestion in case of NewReno and DCTCP. DIATCP guarantees that the deadline-ﬂows meet their requirements, by taking a negligible portion of the link bandwidth from the background trafﬁc and allocating it to the deadline ﬂows. D2TCP shows similar goodput performance, but fails to meet the deadline as shown in Fig. 7. 4.1.5. Deadline awareness with tight deadline requirements To investigate how our deadline-aware algorithm works effectively in extreme scenarios, we measure the missed deadlines when one of the trees has an unacceptably tight deadline requirement as shown in Fig. 12. In this scenario, we change the ﬁrst tree’s deadline from 20 ms to 5 ms (ﬁxed value). In the graph, it is observed that the missed deadlines of DCTCP and D2TCP slightly increase as most of the tight deadline ﬂows miss the deadline. We also observe that DIATCP-FS misses the tight deadlines as well, and the fraction of missed deadline ﬂows increases up to 13%. This is mainly because DIATCP-FS just allocates Gwnd to all ﬂows in a fair-share manner. However, when the tight deadline ﬂows start to transmit, DIATCP preempts the tight ﬂows with higher priority than the far-deadline ﬂows, and allocates a proper window size to the tight ﬂows so that they ﬁnish the transfer in a timely manner. As a result, DIATCP misses no deadline in all cases.

9

Fig. 11. Average goodput of the background ﬂow.

though we allocate all available bandwidth to the ﬂows, so it would be better to abandon the ﬂows rather than trying to complete all ﬂows. To test this scenario, we set the deadline and data size to 20 ms and 1 KB for four Partition/Aggregate trees, and give an extremely small deadline and a large data size to one tree so that its bandwidth requirement is larger than the link capacity. We measure the fraction of ﬂows that meet the deadlines as shown in Fig. 13. Note again that the ﬂows in the tree with extreme requirements can never meet the deadline. We note that if there is no extreme ﬂow, all other ﬂows easily meet the deadlines in all cases, as the data size (1 KB) is so small. When the extreme ﬂows coexist, D2TCP misses the deadlines by up to 60% as the number of extreme ﬂows increases. DIATCP, however, effectively drops the ﬂow as explained in Section 3.3 when the window requirement is larger than Gwnd, i.e., the bandwidth requirement exceeds the link capacity. As a result, it does not miss any deadlines of the normal ﬂows. We also conﬁrm that DIATCP performs better than D2TCP in terms of the completion times of the extreme ﬂows, but skip the results in the paper. 4.2. Testbed experiments

4.1.6. Flow quenching As described in [11], in some extreme cases, there may be some ﬂows that require larger bandwidth than the link capacity. Those ﬂows cannot meet the deadlines even

We evaluate our algorithm using a Linux implementation with a 46-server testbed, to conﬁrm that DIATCP works well in a real data center environment. We implement DIATCP into the Linux kernel 2.6.38 by adding about 265 lines of code; 219 lines for the main operations, 25 lines for the header ﬁles, and 21 lines for the proc related code used in the Linux system. The testbed topology is similar to the simulations, but only 3 racks are used (Fig. 14); each rack has 15 servers and a ToR switch, Summit X460 with 48 1 Gbps ports, 4 10 Gbps ports and a 3 MB shared packet buffer memory [30]. The three ToR switches are connected to an Aggrega-

Fig. 10. Missed deadlines with background trafﬁc (10 MB).

Fig. 12. Missed deadlines with tight deadline requirements (5 ms).

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

10

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

tion switch, Summit X670 with 48 10 Gbps ports and a 9 MB shared packet buffer memory [31]. Lastly, one aggregator node is added to the ﬁrst rack. All switches are Extreme Networks products, and all servers including the aggregator node are Dell OptiPlex 790DT with a quad-core Intel i5 3.3 GHz processor, 4 GB RAM and 1 Gbps Ethernet interfaces, running the Linux kernel 2.6.38. In our experiments, we set the packet buffer size of the bottleneck port in the ﬁrst ToR switch to 178 KB, which is the minimum value that our switch supports. Since the switch does not support ECN as mentioned before, we compare our algorithm to Reno and 1win-Reno whose sending window is ﬁxed to one. The 1win-Reno emulates the most conservative window-based congestion control algorithm, as the maximum window size can only grow to one. In other words, the one window is the most incast-avoidable window size under highly congested situations. Existing solutions like DCTCP adopt the windowbased algorithms. The RTOmin for all protocols is set to 20 ms. We set up ﬁve Partition/Aggregate trees on the testbed as we did in the simulations and vary the number of workers per tree from 20 to 45. Since we have only 45 physical servers, multiple workers are launched on each server. The trees’ base deadlines are tight (20 ms) for two trees, moderate (30 ms) for other two trees, and lax (40 ms) for the last tree, and a 50% uniform-random variance is added to the base deadlines. The response data sizes are uniformly distributed across [2 KB, 10 KB] for the tight deadline ﬂows, [10 KB, 20 KB] for the moderate deadlines, and [20 KB, 35 KB] for the lax deadlines. We also add one background (non-deadline) ﬂow whose size is 100 MB, and start the trees after the long ﬂow fully utilizes the bottleneck link. All experimental results are repeated 100 times and averaged. 4.2.1. Deadline awareness with background trafﬁc We measure the missed deadlines for the deadline ﬂows, by increasing the number of workers per tree. In Fig. 15, we observe that the missed deadlines of Reno and 1win-Reno increase as the number of workers increases. The 1win-Reno performs slightly better until the number of workers is 35, but becomes comparable with Reno under congested network conditions. This implies that even the most conservative strategy (1win-Reno) that the window-based congestion control can take is not enough to avoid incast congestion. Meanwhile, with DIATCP-FS and DIATCP, no timeout is observed, resulting in just a few missed deadlines (under

Fig. 13. Fraction of ﬂows that meet the deadlines when there is a tree with extreme ﬂows that require larger bandwidth than the link capacity.

1% in all cases). We note that DIATCP also misses 0.1– 0.9% of the deadlines, as a scheduling delay exists before the aggregator sends a request to the workers at the application layer. This delay may increase up to more than 15 ms depending on the number of connections, in turn affects the performance of short ﬂows. Despite such scheduling overheads, DIATCP still rarely misses the deadlines in most cases. In addition, we observe that both variants of DIATCP outperform both Reno and 1win-Reno in terms of average goodput of the background ﬂow, as shown in Fig. 16, by effectively avoiding the incast congestion. In general, we found that our experimental results have a similar trend to the simulations, shown in Figs. 10 and 11. In the experimental results, the missed deadlines of Reno and 1win-Reno are around 20% when the number of workers is 20, and increase up to around 60% as shown in Fig. 15, and this almost corresponds to the simulation results of NewReno and DCTCP in Fig. 10. The missed deadlines of DIATCP-FS and DIATCP are almost zero in both simulations and real experiments. It is also shown that the results for the average goodput of the background ﬂow are similar even though the absolute numbers are slightly different as the environmental settings are different. Therefore, we believe that our simulation results correctly capture the behaviors of the real world. 4.2.2. Deadline awareness with tight deadline requirements In the previous scenario, there are no evident differences between DIATCP-FS and DIATCP in terms of the missed deadlines. Here, we build a more stressful scenario with tighter deadlines. Further, we alleviate the effect of the scheduling overhead to concentrate only on the network performance, by increasing the deadline and response data size; each ﬂow’s data size and deadline are uniformly distributed across [100 KB, 150 KB] and [50 ms, 250 ms], respectively. No background ﬂow is added. Under this scenario, we measure the missed deadlines for the two variants as shown in Fig. 17. In the graph, it is observed that the missed deadlines of DIATCP-FS linearly increase as the number of workers increases. DIATCP-FS often misses the deadlines that are relatively small, as it allocates the same amount of windows to the ﬂows in ﬁrstcome, ﬁrst-served manner. On the other hand, DIATCP still shows almost zero missed deadline until the number of workers exceeds 40, by allocating more windows to the near-deadline ﬂows. When the number of workers is 45, it is impossible to meet the deadlines for some ﬂows even for DIATCP, resulting in about 10% of the missed deadlines. 4.2.3. Incast avoidance We now evaluate how DIATCP copes with incast congestion. Unlike the previous scenarios, the ﬂows do not have any deadlines, and the response data size of each worker is 5 MB/n, where n is the number of workers. We measure the query completion time as n increases by up to 220, as shown in Fig. 18. During the experiments, the minimum completion time is about 45 ms, and DIATCP shows generally low completion times ranged between 45 ms and 47 ms in all cases. However, the query completion time of Reno increases by up to 88 ms because of the severe incast congestion. It is worth noting that 1win-Reno

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

11

Aggregation switch (48 10Gbps ports)

ToR switches (48 1Gbps ports + 4 10Gbps ports)

Fig. 14. The real data-center testbed.

Fig. 15. Missed deadlines with one background ﬂow (100 MB).

Fig. 17. Missed deadlines with tight deadline requirements.

ers imply that 180 KB of data packets (1.5 KB 120) are transmitted at the same time through the bottleneck buffer. So the query completion times slightly increase after this point. Finally, we observe that DIATCP suffers no timeout in all cases, directly resulting in the low completion times.

Fig. 16. Average goodput of the background ﬂow.

shows low completion times in this scenario, since the workers’ small sending window sizes lead to less congested situations. But, it performs poorly with a small number of workers (e.g., less than 10). For example, the query completion time of 1win-Reno is more than 900 ms with one worker and 63 ms with ﬁve workers as shown in Fig. 18. This is clear since 1win-Reno is the most conservative window-based congestion control strategy. We also measure the timeout ratio, the fraction of queries that suffer at least one timeout as shown in Fig. 19. As expected, Reno suffers at least one timeout among the workers in most experimental rounds when the number of workers is more than 80, and it directly results in high completion times as the query completion time is determined by the most congested ﬂow. 1win-Reno starts to experience timeouts when the number of workers is 120 because the packet buffer size is 178 KB and the 120 work-

4.2.4. Multi-bottleneck environments One of our main assumptions in the design phase is that the bottleneck link is the last hop between the aggregator and the shallow-buffered ToR switch. This is normally true in many cases, but there can be more than one bottleneck point between the aggregator and workers. To see whether DIATCP copes with such multiple bottleneck environments, we set up a simple multi-bottleneck topology on our testbed as shown in Fig. 20. In this topology, there are two bottleneck points: BP-1, the port of the left X460

Fig. 18. Query completion time vs. the number of workers.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

12

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx BG-R

BG-S BP-2

BP-1

X460

X670

X460 2 updates

Aggregator 60 workers (80KB each)

Fig. 19. Timeout ratio vs. the number of workers.

switch connected to the aggregator, and BP-2, the port of the right X460 switch connected to the X670 switch. When these artiﬁcial bottleneck points are enabled, the buffer size of both BP-1 and BP-2 is set to 178 KB (which is the minimum conﬁgurable size), and the egress rate of BP-2 is set to 1 Gbps. Next, we deploy one aggregator node, 60 workers8 whose response data size is 80 KB, and two update nodes that transmit a large size of update data to the aggregator. These ﬂows traverse through both bottleneck points. There is one additional long-term background trafﬁc from BG-S node to BG-R node. Table 2 presents the average query completion time for the workers with different settings for the bottleneck points. When there is no artiﬁcial bottleneck point on the path, the completion time of Reno is relatively low, but increases by 83.4 ms, 70.9 ms, and 84.5 ms as BP-1, BP-2, and both are enabled, respectively. When there are multiple bottleneck points, the total amount of trafﬁc on the endto-end path is converged to the main bottleneck bandwidth (BP-1 in our case). For this reason, it is shown that the result of the two-bottleneck case is similar to that of the BP-1 case. DIATCP, however, efﬁciently controls the total amount of trafﬁc and avoids the incast congestion on both bottlenecks, showing low completion times in all cases.

5. Discussion Other experimental results: In our testbed, all RTTs between the aggregator node and the other servers are varied around 200ls in average. However, the average RTTs can be different among workers as the number of hops is varied in the data center. Such environments that have heterogeneous RTTs may incur penalty for short RTT workers in terms of performance as DiaDelay increases due to long RTT workers. In our design, this will be compensated by receiving more window allocation. Furthermore, the bottleneck link is always fully utilized since the aggregate sending rate (i.e., Gwnd=DiaDelay) is the bottleneck link capacity. To show DIATCP works well in the heterogeneous environments, we conduct a set of simulation experiments with the network topology in Fig. 6, where one worker is deployed in each rack (i.e., totally 5 workers) and the round-trip propagation delays are given as (i) 200 ls for

8 These workers are running on 12 physical servers (i.e., 5 workers per server).

...

Fig. 20. The multi-bottleneck topology.

all ﬂows, (ii) 300 ls for all ﬂows, and (iii) ﬁnally diverse ﬂows by giving different round-trip propagation delays between racks, namely 100, 150, 200, 250, and 300 ls. The DiaDelay is set to 300 ls, and each worker transmits 2 MB to the aggregator. In Table 3, we measure the aggregate goodput to see the bottleneck link utilization. There is only 0.5% drop in the average goodput even when the workers have different RTTs. We do not perform any experiments related to the priority inversion problem, raised by [12]. As described in Section 3.3, the ﬂows in the Connection list are basically sorted in an ascending order of deadlines. So clearly, the ﬂows with earlier deadlines are allocated ﬁrst via our algorithm with higher priority, and unlike [11], no priority inversion is observed in any of the experiments. Retransmission timeout parameter: In our simulations and experiments, we set the minimum retransmission timeout (RTOmin ) to 20 ms. One may argue that the RTOmin of 20 ms seems high since it is signiﬁcantly longer than typical RTT values in data center networks. Indeed, prior work have proposed to use RTOmin values of 1 ms or lower [32]. However, other studies [9] have shown that retransmission timeouts less than 10 ms cause spurious retransmission. In other words, the small RTOmin value incurs false alarms so that unnecessary timeouts occur. Therefore, we elect to use RTOmin value 20 ms, as in the practical Google data center [29] or previous work [12]. Congestion at the core/aggregation switches: DIATCP mainly focuses on the congestion at the ToR switch, whereas other network-based schemes consider the congestion at the aggregation or core switches [11,13,9]. There are several reasons why we view the ToR congestion as the most critical problem. First, it has been reported that the majority (80%) of trafﬁc originated by servers in cloud data centers are usually destined to the machines within the same rack [20]. The reason is that the cloud services are commonly placed in adequate locations so that the amount of inter-rack trafﬁc is minimized. Second, it is also noted that packet losses do not correlate with high average link utilization, but occur under low average utilization since the primary cause of the losses is the bursty query trafﬁc, generated by the Partition/Aggregate applications [20]. Therefore, the Partition/Aggregation trafﬁc pattern aggravates the bursty packet losses at the ToR switch, while aggregation and core switches only undergo higher utilization. Finally and most importantly, since DIATCP is implemented only at the aggregator, it can be used in parallel with the network-based schemes that address the aggregation and core switch congestion.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

13

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

Non-Partition/Aggregate trafﬁc: DIATCP leverages the Partition/Aggregate trafﬁc pattern by enabling the aggregator to conduct admission and ﬂow control. For Non-Partition/Aggregate trafﬁc patterns, DIATCP still manages to achieve good performance, since it conducts ﬂow control to avoid congestions. This is evident since DIATCP provides better average goodput even for background ﬂows, which represent the Non-Partition/Aggregate ﬂows, as shown in Figs. 11 and 16. Practical design issues: Now we discuss the two design issues of DIATCP that may arise in practical data center environments. The ﬁrst issue is how to deploy DIATCP in the virtualized environments. We implicitly assume that each TCP stack has full use of the link capacity of the Network Interface Card (NIC) in this paper. However, there can be multiple virtual machines (VMs) on the same physical node in commercial cloud data centers, hence multiple TCP stacks may compete for the same physical NIC capacity. To deal with this situation, the VM monitor (i.e., hypervisor) needs to control the link capacity of the virtual NICs (VNICs), so that the total capacity of VNICs do not exceed the physical link capacity. Our solution is that the hypervisor periodically sets the link capacity of each VNIC in proportion to the amount of incoming trafﬁc. This function can be easily implemented in the hypervisor since the link capacity is a controllable parameter in each VM via the proc interface and the statistics of the incoming trafﬁc is usually available at the privileged domain (e.g., dom0 in the case of Xen [33]). We also consider user requirements to be another option for setting the VNIC capacity if they are available. The second design issue is properly allocating the Gwnd to long-lived and quiescent connections like heartbeat ﬂows. The type of ﬂow does not tear down the connection even after all the data is transmitted, so that future transmissions can avoid the cost of 3-way handshake. The Global window allocation allocates some windows to such ﬂows as well, which may waste bandwidth, although they are allocated after the deadline ﬂows are served ﬁrst. Our solution for avoiding this problem is to keep another global list for those long-lived ﬂows. So if a node in the Connection list is not accessed (i.e., idle) for a speciﬁc duration, we move the node to the idle list until a new data packet is received from the peer. When the ﬂow is re-activated, the corresponding node is re-inserted to the Connection list, followed by the Global window allocation. 6. Related work There is a plethora of work on the design of today’s data center protocols, including advanced transport protocols for wide area networks, Active Queue Management (AQM) schemes, and real-time scheduling policies. Among Table 2 Avg. query completion time (ms) on the multi-bottleneck topology. TCP

Reno DIATCP

Bottleneck None

BP-1

BP-2

Both

63.4 49.5

83.4 49.5

70.9 49.4

84.5 49.5

Table 3 Bottleneck link utilization. RTT setting

200 ls

300 ls

Heterogeneous

Goodput (Mbps)

967.98

966.24

962.49

these, we focus on two threads of the recent data center transport protocols: incast congestion avoidance and deadline awareness. In [4], the authors conduct an in-depth analysis of the incast congestion problem with their cluster-based storage systems. Speciﬁcally, the tradeoffs between (i) the buffer size of the switches and the number of packet losses, and (ii) the data block size and the link idle time are explored. To mitigate TCP throughput collapse, they tune some TCP layer parameters such as duplicate ACK threshold, or perform Ethernet Flow Control. Unfortunately, they do not come out with a satisfactory solution that fully addresses the problem. To effectively avoid the incast congestion, new congestion control schemes like Data Center TCP (DCTCP) [1], Incast Congestion Control for TCP (ICTCP) [7], and IncastAvoidance TCP (IA-TCP) [8] are proposed. DCTCP provides a ﬁne-grained congestion control that adapts the window size in proportion to the extent of congestion by counting the number of ECN marks. ICTCP measures the bandwidth of the total incoming trafﬁc to obtain the available bandwidth, and then controls the receive window of each connection based on this information. Similarly, IA-TCP controls the workers’ sending rate not to exceed the bandwidth-delay product. However, IA-TCP does not limit the total number of outstanding packets as the number of concurrent connections increases while DIATCP limits the total number by Gwnd. This makes IA-TCP less scalable since the incast congestion is practically inevitable with a large number of outstanding packets even though it artiﬁcially increases the RTT. While the previous work focus on the fair-share based congestion control, D3, a Deadline-Driven Delivery control protocol [11], introduces deadline-awareness into the design space of data center protocols. D3 exploits deadline information and performs explicit rate control in a centralized manner at the switches, to allocate bandwidth based on each ﬂow’s deadline and size. Deadline-Aware Datacenter TCP (D2TCP) [12] is another deadline-aware protocol, which provides a fully distributed algorithm. D2TCP introduces the deadline factor into the DCTCP’s congestion control scheme; it adapts the congestion window size using a gamma-correction function so that the near-deadline ﬂows aggressively take more bandwidth than far-deadline ﬂows. Preemptive Distributed Quick (PDQ) [13] is a ﬂow scheduling algorithm designed to complete ﬂows quickly and meet ﬂow deadlines. PDQ emulates a Shortest Job First (SJF) algorithm to give a higher priority to the short ﬂows. PDQ provides a distributed algorithm by allowing each switch to propagate ﬂow information to others via explicit feedback in packet headers. DeTail [9] is an in-network multipath-aware congestion control mechanism and takes a trafﬁc engineering based approach to reduce the ﬂow completion time tail. DeTail

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

14

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx

employs (i) link-layer-ﬂow-control (LLFC) to react to congestion more quickly, (ii) per-packet adaptive load balancing (ALB) to spread trafﬁc across available paths (i.e., support multipath data transfer), and (iii) priority mechanisms for trafﬁc differentiation. We note that compared with DIATCP, the existing approaches have some weaknesses as follows: Network-based solutions like D3, PDQ, and DeTail require high-cost and/or customized hardware chips in the network part, which is deﬁnitely a hurdle for deployment. Host-based solutions like DCTCP and D2TCP cannot support ﬂow quenching. DCTCP and D2TCP require the ECN functionality, still not supported in some ToR switches [34]. D3 has the priority inversion problem as described in [12]. 7. Concluding remarks In this paper, we propose DIATCP, a new Deadline and Incast Aware TCP, designed for cloud data center networks. Unlike the existing approaches that are either host-based approaches or network-based, we develop and design the aggregator-based solution. Our insight is that under the Partition/Aggregate trafﬁc pattern, the main bottleneck point is normally the last hop between the aggregator and the ToR switch, so the aggregator is aware of the bottleneck link capacity as well as the trafﬁc on the link. Therefore, DIATCP controls the peers’ sending rate directly to avoid the incast congestion and to meet the cloud application deadline. We implement the prototype of the DIATCP algorithm on the Linux kernel. With extensive experiments with our data center testbed and simulations, we conﬁrm that DIATCP outperforms the existing solutions in all experiments in terms of both deadline-awareness and incastavoidance. We believe that our aggregator-based approach is the right direction in designing data center protocols, and DIATCP is one example that is optimized particularly for the data center application’s requirements. As future work, we plan to design an optimized tuning algorithm for Gwnd based on mathematical analysis. Acknowledgements This work was supported by the Seoul R&BD Program (WR080951) funded by the Seoul Metropolitan Government. This work was also supported by the National Research Foundation of Korea Grant funded by the Korean Government (NRF-2013R1A1A1006823). The authors thank Dr. Jeongran Lee for the valuable comments and discussions. References [1] M. Alizadeh, A. Greenberg, D.A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, M. Sridharan, Data Center TCP (DCTCP), in: Proc. of ACM SIGCOMM, 2010. [2] S. Stefanov, YSlow 2.0, December 2008. .

[3] T. Hoff, Latency Is Everywhere And It Costs You Sales – How To Crush It, July 2009. . [4] A. Phanishayee, E. Krevat, V. Vasudevan, D.G. Anderson, G.R. Ganger, G.A. Gibson, S. Seshan, Measurement and analysis of TCP throughput collapse in cluster-based storage systems, in: Proc. of USENIX FAST, 2008. [5] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, R. Chaiken, The nature of datacenter trafﬁc: measurements & analysis, in: Proc. of ACM IMC, 2009. [6] Y. Chen, R. Grifﬁth, J. Liu, R.H. Katz, A.D. Joseph, Understanding TCP incast throughput collapse in datacenter networks, in: Proc. of ACM WREN, 2009. [7] H. Wu, Z. Feng, C. Guo, Y. Zhang, ICTCP: incast congestion control for TCP in data center networks, in: Proc. of ACM CoNEXT, 2010. [8] J. Hwang, J. Yoo, N. Choi, IA-TCP: a rate based incast-avoidance algorithm for TCP in data center networks, in: Proc. of IEEE ICC, 2012. [9] D. Zats, T. Das, P. Mohan, R. Katz, DeTail: reducing the ﬂow completion time tail in datacenter networks, in: Proc. of ACM SIGCOMM, 2012. [10] M. Alizadeh, A. Kabbani, T. Edsall, B. Prabhakar, A. Vahdat, M. Yasuda, Less is more: trading a little bandwidth for ultra-low latency in the data center, in: Proc. of USENIX NSDI, 2012. [11] C. Wilson, H. Ballani, T. Karagiannis, A. Rowstron, Better never than late: meeting deadlines in datacenter networks, in: Proc. of ACM SIGCOMM, 2011. [12] B. Vamanan, J. Hasan, T.N. Vijaykumar, Deadline-aware datacenter TCP (D2TCP), in: Proc. of ACM SIGCOMM, 2012. [13] C.-Y. Hong, M. Caesar, P.B. Godfrey, Finishing ﬂows quickly with preemptive scheduling, in: Proc. of ACM SIGCOMM, 2012. [14] J. Dean, S. Ghemawat, MapReduce: simpliﬁed data processing on large clusters, in: Proc. of USENIX OSDI, 2004. [15] M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, Dryad: distributed data-parallel programs from sequential building blocks, in: Proc. of EuroSys, 2007. [16] D. Beaver, S. Kumar, H.C. Li, J. Sobel, P. Vajgel, Finding a needle in haystack: facebook’s photo storage, in: Proc. of USENIX OSDI, 2010. [17] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, W. Vogels, Dynamo: Amazon’s highly available key-value store, in: Proc. of ACM SOSP, 2007. [18] D. Katabi, M. Handley, C. Rohrs, Congestion control for high bandwidth-delay product networks, in: Proc. of ACM SIGCOMM, 2002. [19] K. Ramakrishnan, S. Floyd, D. Black, The Addition of Explicit Congestion Notiﬁcation (ECN) to IP, RFC 3168, IETF, September 2001. [20] T. Benson, A. Akella, D.A. Maltz, Network trafﬁc characteristics of data centers in the wild, in: Proc. of ACM IMC, 2010. [21] S. Floyd, M. Handley, J. Padhye, J. Widmer, TCP Friendly Rate Control (TFRC): Protocol Speciﬁcation, RFC 5348, IETF, September 2008. [22] L.S. Brakmo, L. Peterson, TCP vegas: end to end congestion avoidance on a global internet, IEEE J. Sel. Area. Commun. 13 (8) (1995) 1465– 1480. [23] J.P. Royston, An extension of shapiro and Wilk’s W test for normality to large samples, Appl. Stat. 31 (2) (1982) 115–124. [24] B.W. Silverman, Density Estimation, Chapman and Hall, London, 1986. [25] S.J. Sheather, M.C. Jones, A reliable data-based bandwidth selection method for kernel density estimation, J. Roy. Stat. Soc.: Ser. B 53 (3) (1991) 683–690. [26] The ns-3 Discrete-Event Network Simulator. . [27] S. Floyd, V. Jacobson, Random early detection gateways for congestion avoidance, IEEE/ACM Trans. Network. 1 (4) (1993) 397– 413. [28] Data Center TCP. . [29] The D2TCP slide at ACM SIGCOMM 2012. . [30] Summit X460 Switches. . [31] Summit X670 Switches. . [32] V. Vasudevan, A. Phanishayee, H. Shah, E. Krevat, D.G. Anderson, G.R. Ganger, G.A. Gibson, B. Mueller, Safe and effective ﬁne-grained TCP retransmissions for datacenter communication, in: Proc. of ACM SIGCOMM, 2009.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

J. Hwang et al. / Computer Networks xxx (2014) xxx–xxx [33] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, A. Warﬁeld, Xen and the art of virtualization, in: Proc. of ACM SOSP, 2003. [34] Randall R. Stewart, Michael Tüxen, George V. Neville-Neil, An investigation into data center congestion with ECN, in: Proc. of BSDCan, 2011. Jaehyun Hwang received the B.S. degree in computer science from Catholic University of Korea, Seoul, Korea in 2003, and the M.S. and Ph.D. in computer science from Korea University, Seoul, Korea in 2005 and 2010, respectively. His research backgrounds are mainly in TCP, focusing on a ﬂexible TCP structure, advanced TCP ﬂavors and their performance. Since September 2010, he has been with the networking research domain at Bell Labs, Alcatel-Lucent as a member of technical staff. His current research interests include data center networks, software-deﬁned networking, multipath TCP, and HTTP adaptive streaming.

15

Nakjung Choi received the B.S. and Ph.D. degrees in computer science and engineering from Seoul National University (SNU), Seoul, Korea, in 2002 and 2009, respectively. From September 2009 to April 2010, he was a postdoctoral research fellow in the Multimedia and Mobile Communications Laboratory, SNU. Since April 2010, he is a member of technical staff at Alcatel-Lucent, Bell Labs Seoul. His research interests are Future Internet such as content centric networking and green networking, and also mobile/wireless networks such as wireless LANs and wireless mesh networks.

Joon Yoo received his B.S. in Mechanical Engineering from Korea Advanced Institute of Science and Technology (KAIST), and Ph.D. in Electrical Engineering and Computer Science from Seoul National University in 1997 and 2009, respectively. He worked as a postdoctoral researcher at the University of California, Los Angeles in 2009 and then worked at Bell Labs, Alcatel-Lucent as a Member of Technical Staff from 2010 to 2012. Since 2012, he has been with the department of Software Design and Management at Gachon University as an assistant professor. His research interests include vehicular networks, cloud data center networks, and IEEE 802.11.

Please cite this article in press as: J. Hwang et al., Deadline and Incast Aware TCP for cloud data center networks, Comput. Netw. (2014), http://dx.doi.org/10.1016/j.comnet.2013.12.002

Deadline and Incast Aware TCP for cloud data center networks

Deadline and Incast Aware TCP for cloud data center networks

Recommend Documents