Efficient in-network aggregation mechanism for data block repairing in data centers

Efficient in-network aggregation mechanism for data block repairing in data centers

Future Generation Computer Systems 105 (2020) 33–43 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage: w...

1MB Sizes 0 Downloads 12 Views

Future Generation Computer Systems 105 (2020) 33–43

Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Efficient in-network aggregation mechanism for data block repairing in data centers ∗

Junxu Xia, Deke Guo , Junjie Xie Science and Technology on Information Systems Engineering Laboratory, National University of Defense Technology, Changsha Hunan 410073, China

article

info

Article history: Received 16 April 2019 Received in revised form 3 October 2019 Accepted 27 October 2019 Available online 30 October 2019 Keywords: In-network aggregation Distributed storage Erasure code Data centers

a b s t r a c t Many distributed storage systems in data centers, e.g., Google Colossus SF, Facebook HDFS, and Microsoft Azure, adopt erasure codes to improve storage reliability and reduce the storage space overhead. Accordingly, when a block of data fails, multiple involved blocks are sent to a new node to regenerate the failed block. Furthermore, repairing those failed data blocks brings a large amount of bandwidth consumption. Meanwhile, prior repairing methods only focus on how to aggregate data blocks on involved storage nodes, which fails to efficiently save the network bandwidth. By contrast, our observations show that the aggregation is not limited to those storage nodes. Specifically, servers and switches can all be utilized to aggregate those data blocks during the transmission phase. Therefore, in this paper, we propose an efficient in-network aggregation mechanism, called AggreTree, which can effectively leverage the intermediate nodes along the transmission paths to aggregate data blocks, thus repairing failed blocks. Compared with existing methods, the results of experiments show that AggreTree can efficiently improve the repairing speed of failed blocks by up to 4.06 and 2.86 times, and further reduce the transmission cost by 80.86% and 81.68% in Fat-tree and BCube data centers, respectively. © 2019 Published by Elsevier B.V.

1. Introduction Nowadays, cloud computing has given rise to a higher demand for big data storage. Many cloud applications typically generate large amounts of data, which is stored in data centers. In addition, the emergence of cloud storage also puts forward higher requirements for storage systems. In data centers, the distributed storage system realizes the large-scale storage of data by deploying a large number of storage nodes. In order to improve the quality of storage service, many distributed storage systems, such as Google ColossusFS [1], Facebook HDFS [2], and Microsoft Azure Storage [3], adopt erasure codes [4] to improve the storage reliability and reduce the storage space overhead. Those storage systems distribute large amounts of data across multiple storage nodes in data centers. Each storage node holds only a portion of the encoded data, avoiding the overall file loss due to machine failures. Even though many efforts have been done, failures are very common in modern data centers [5–7]. Furthermore, most failures in data centers are single node failures, which accounts for more than 90% in practical [8]. When a data block fails due to the failure of the storage node, multiple involved data blocks from ∗ Corresponding author. E-mail address: [email protected] (D. Guo). https://doi.org/10.1016/j.future.2019.10.033 0167-739X/© 2019 Published by Elsevier B.V.

other storage servers, called providers, will transfer the corresponding data blocks to a new node, referring to as the newcomer, to repair the failed data block. The repairing process usually consumes a large amount of bandwidth and puts considerable strain on the data center network. To reduce the bandwidth consumption and increase the repairing speed, many efforts [9–12] have been done via aggregating the data blocks during the repairing process. They mainly focus on the scheduling between the providers and the newcomer. Those providers and the newcomer partially aggregate the data blocks according to certain schemes and finally complete the repairing of the failed block. Although prior methods obtain certain improvement gains, the scheduling between involved servers still brings a large amount of network overhead. As we all know that servers are frequently used as the switch nodes in server-centric data centers [5,13,14]. It is no doubt that those servers can also be utilized to aggregate data blocks. Furthermore, as shown in Literatures [15–17], switches have also been deployed to quickly process and cache some intermediate data. Therefore, different from the previous work, we will leverage the in-network aggregation to significantly reduce the transmission volume in the process of transmission. When data blocks transferred from providers to the newcomer, they pass through many intermediate nodes in the network. Those intermediate nodes, including some storage servers, or servers that may not hold data blocks, or even some switches, will perform

34

J. Xia, D. Guo and J. Xie / Future Generation Computer Systems 105 (2020) 33–43

a simple XOR operation to aggregate small-size blocks from their child nodes, assisting in the repair of failed blocks. Thus, the failed block repairing can be finished naturally along the transmission path to further reduce the transmission cost. In particular, it is worth mentioning that in Literature [17], Sapio et al. have achieved satisfactory in-network aggregation results with programmable switches and their high-speed SRAM memory. Simple logic operations are millisecond sized on switches and therefore, the processing delay on intermediate nodes could be negligible compared to the transmission delay. To efficiently repair failed blocks, two major challenges need to be further addressed. The first one is the heterogeneity of available bandwidth. It is difficult to achieve efficient block repairing by relying on the participation of intermediate nodes alone. The failed block repairing is a cooperative process among multiple related nodes and links. The congestion of one link will lead to the slow speed of the whole repairing process. The second is how to efficiently design the transfer paths. It is mainly because that different links will incur different transfer costs. Simply looking for paths with high available bandwidth often carries considerable transmission cost, because the block transmission may take up more links when we search for maximum bandwidth paths. Therefore, an effective scheme is further needed to control the transmission cost during failed block repairing. To tackle the above challenges, in this paper, we propose an efficient in-network aggregation mechanism, called AggreTree, to repair failed blocks in the erasure coding-based storage systems. AggreTree leverages a two-stage decision, which first ensures the fastest transmission speed and then minimizes the transmission cost under the premise of guaranteed speed. Our AggreTree is efficient due to not only the fast repairing speed of failed blocks but also the low bandwidth consumption. It is worth noting that the core of AggreTree is to achieve the in-network aggregation, which can efficiently repair failed blocks while achieving lower bandwidth consumption than existing methods. We have conducted massive experiments to evaluate the performance of AggreTree under two representative data center topologies, fattree [18] and BCube [13]. The results of the experiments show the efficiency and effectiveness of AggreTree. In summary, we make the following major contributions in this paper.

• We first put forward and formalize the in-network aggregation in the erasure coding-based storage systems. Different from the previous methods, the servers without data blocks and switches along the transmission paths also participate in the repairing of failed blocks, further improving the repairing efficiency. • We design an efficient in-network aggregation mechanism, called AggreTree to repair failed blocks. The AggreTree is efficient in both the repairing speed and the bandwidth saving. • Massive experiments are conducted under two representative data center topologies to evaluate the performance of AggreTree. Compared with existing methods, AggreTree can obtain up to 4.06× and 2.86× repairing speed, and reduce 80.86% and 81.68% transmission cost in Fat-tree and BCube data centers, respectively. The rest of this paper is organized as follows. We introduce the background and related work in Section 2. The design overview of AggreTree is presented in Section 3. Then, we formulate the repairing problem of failed blocks in Section 4, and present the construction process of the AggreTree in Section 5. In Section 6, we evaluate the performance of the AggreTree by comparing it with existing methods. We draw the conclusion of this paper in Section 7.

2. Background and related work In this section, we first present background on the erasure coding-based storage systems, then we introduce the related works of failed block repairing. 2.1. Background Cloud computing often needs to store and process massive data. The fault tolerance of data is of great significance to cloud computing. In distributed systems, data is often stored on different storage nodes with redundancy. Those nodes consist of a large number of commercial storage devices distributed in data centers. However, storage nodes are sometimes unavailable due to disk failures, downtime, etc., causing the loss or failure of data. To this end, distributed systems adopt redundancy to achieve the reliability of data storage. Replication is a classic solution, which is traditionally adopted in production systems [19,20]. The origin data is copied into multiple replicas and stored on different nodes to maintain data availability. Replication is simple and practical, however, this approach consumes considerable storage space, which is not friendly to Big Data storage. For example, many replication-based storage systems [19,20] generally deploy 3 replicas for each file, which can tolerant up to 2 node failures. In such systems, storing a single file carries three times the storage space overhead. To this end, erasure codes are applied to distributed systems to reduce the consumption of storage space. Different from replication-based storage systems, files in erasure coding-based storage systems are encoded rather than simply copying. Hence, the system only generates a small amount of redundancy. For example, storing a file with the (6, 3) erasure code [1] only requires 6+3 = 1.5 times storage space overhead, while the system could 6 tolerant up to 3 node failures. Those files in the erasure codingbased storage systems are stored as fixed-size blocks (the size ranging from 64MB to 256MB), which form the basic read/write units. The set of blocks that are encoded together is referred to as a stripe. Generally, files are stored with multiple stripes, each of which is encoded independently. Reed–Solomon (RS) family of codes [4] are one representative example of erasure codes, which have been intensively applied to production systems for fault tolerance, such as three popular storage systems: Google Colossus FS [1], Facebook HDFS [2], and Microsoft Azure Storage [3]. To be specific, with the (k, m) RS code, the origin data is first divided into k data blocks, and then m redundant blocks are generated. Those k + m blocks form a stripe together. Arbitrary k blocks of those k + m blocks can regenerate any failure of the left blocks. Those data blocks and redundant blocks from the same stripe are stored on k + m different storage nodes to tolerant any m node failures. Each storage node could hold multiple blocks from different stripes but one block from the same stripe. Thus, when a storage node fails, the failed block can be regenerated by extracting the corresponding blocks of the same stripe from other storage nodes. In this way, the reliability of data storage is guaranteed. Single node failures account for more than 90% of failures in data centers in practical [8]. To repair a failed block, k providers, each of which stores an involved block, are required to transfer their blocks to a new storage node to repair the failed block P ∗ . This new storage node, called the newcomer, receives those blocks and regenerates the failed block to maintain the redundancy. The repair process is a linear combination of those k blocks, as shown in Eq. (1),

⎛ ⎞ P1

k ⎜ P2 ⎟ ∑ ⎟= βi Pi P ∗ = (β1 , β2 , . . . , βk ) × ⎜ . ⎝ .. ⎠

Pk

i=1

(1)

J. Xia, D. Guo and J. Xie / Future Generation Computer Systems 105 (2020) 33–43

35

Fig. 1. Two different methods. Fig. 2. The topology consists of 6 servers and 2 switches.

where P1 , P2 , . . . , Pk are k blocks received from k providers. The corresponding decoding coefficient matrix is recorded as β1 , β2 , . . . , βk , βi ∈ Fq , where Fq represents Galois Field on q. Additions and multiplications are based on Galois Field arithmetic over w -bit units called the word, which is the basic encoding unit. For RS codes, they require k + m ≤ 2w + 1 [21]. The size of blocks after aggregation remains unchanged during the whole repair process. 2.2. Related work The inefficient repairing of erasure codes has become a bottleneck restricting its wide application [22]. To this insight, it is urgent to make efforts to increase the efficiency of failed block repairing. The conventional repair method when k = 4 is shown in Fig. 1 (a). The newcomer v0 receives all the involved blocks, P1 , P2 , P3 , and P4 , then aggregates them locally to generate the failed block P ∗ . However, such a repair process often leads to the congestion on the downlink of the newcomer, referred to as the incast problem [23], decreasing the repair efficiency. As pointed in Literatures [24–26], the network transmission time accounts for up to 94% of the repairing time. Therefore, optimizations of failed block repairing mainly focus on network transfers. To reduce the transmission time, the tree-structured method [9] transforms the practical topology to a specific weighted undirected complete graph and calculates the maximum spanning tree. Blocks are transmitted and calculated at the layer of smallsize units, each of which consists of one or more words. We call such a unit a slice, which is referred to as the basic transmission or aggregation unit. The key idea of the tree-structured scheme is to find a maximum spanning tree based on the available bandwidth between providers and the newcomer, as shown in Fig. 1 (b). Different from the conventional repair, this tree-structured scheme requires storage nodes, e.g., v1 , v2 , v3 , and v4 , to transfer the processed slices of blocks, βi Pi , instead of raw blocks Pi . The intermediate storage node, e.g., v4 , in the spanning tree receives the slices from their child storage nodes and then immediately aggregates them with local slices to generate the new aggregated slice. Since the size of the slices is small enough, we could ignore the aggregation time. The total repair time is mainly affected by the transmission time, which is determined by the minimum bandwidth of the tree. The tree-structured method can avoid congestion and improve repair efficiency theoretically. However, the gap between the transformed topology and the real topology makes it difficult to achieve the desired result. Fig. 2 shows an example of the failed block repairing, where the server s2 is the newcomer, while servers s3 , s4 and s6 are three providers. In this topology, the server also performs as a switch, which is quite common in server-centric topologies like BCube [13], VLCcube [14], etc. Edge weights represent the available bandwidth (Mbps) between nodes measured in a time slot. With the tree-structured method, an optional scheme is presented in Fig. 3 (a). It can be seen

Fig. 3. An example of the tree-structured scheme.

that the minimum bandwidth of the spanning tree is 43 Mbps. By contrast, the actual bandwidth in the practical topology may not fit this theoretical value. Fig. 3(b) shows the actual transmission. Two blocks would be transmitted on links {s1 , s2 }, {s1 , s4 }, and {s4 , s5 } instead of one block in theory. The actual minimum bandwidth is min{85/2, 73/2, 63/2} = 31.5 Mbps for the transmission of each block, instead of 43 Mbps. Moreover, the differences between transformed and real topologies often result in uncertain transmission costs. If the size of each block is D MB, the bandwidth overhead will be 9 × D MB in the above example, rather than 3 × D MB in theory. A feasible improvement is that we can aggregate blocks using the intermediate server s5 , which can ensure only one block transmitted on subsequent links, thus achieving 43 Mbps repair speed and reducing bandwidth overhead to a relatively low value, i.e., 6 × D MB. A similar idea is also expressed in Literature [10], called Aggrecode. However, the main purpose of Aggrecode is to minimize the transmission cost, ignoring the bandwidth heterogeneity. Hence, it fails to obtain an elegant repair speed. The repair pipelining in literature [11] can obtain the desired repair speed, but the transmission cost seems unsatisfactory. As shown in Fig. 4(a), the repair pipelining leverages links with high bandwidth to serialize all the providers. The latter constantly receives each slice from the former, aggregates it with its own slice, and then transmits the aggregated slice to the next storage node. In this way, the speed of the repair pipelining is closed to the theoretical value in the practical topology. However, the serialization of all involved storage nodes will inevitably bring considerable transmission costs. In addition, Shen et al. [12] consider that the intra-rack bandwidth is bigger than the inter-rack bandwidth in clustered file systems. They propose to aggregate all available blocks within the rack and send the aggregated block out, thus improving the repair speed. However, this approach only works well in cluster file systems. Moreover, blocks are randomly distributed across different storage nodes. When the scare of the data center grows up, the probability that a rack holds multiple blocks would decrease, which makes it difficult to achieve the desired effect.

36

J. Xia, D. Guo and J. Xie / Future Generation Computer Systems 105 (2020) 33–43

4. Problem formulation For the (k, m) erasure coding-based storage system, the original data is divided into k blocks, then m redundant blocks are generated. Note that all blocks have the same size. Any a failed block from k + m blocks can be regenerated by any other k blocks of the same stripe. The involved k storage nodes, i.e., the providers, transfers their blocks to a new node, called the newcomer, to regenerate the failed block when a block fails due to the disk damage, downtime, and so on.

Fig. 4. The repair pipelining scheme and the AggreTree scheme presented in this paper.

In summary, it is difficult for existing methods to achieve satisfactory repair speed with low transmission cost in practical data centers. The trade-off between speed and cost is difficult to realize with the single optimization objective. Furthermore, it is worth considering how to model the real topology so that the scheme can achieve the theoretical effect. Therefore, it is a challenging work to achieve a high repair speed in the practical topology, while the cost can be effectively controlled. This is what our AggreTree concerns in this paper.

3. Design overview Although many aforementioned efforts have made significant contributions, they struggle to achieve a satisfactory trade-off between the repair speed and cost in practical topologies. In particular, it is worth mentioning that it would not add significant latency by introducing switches or servers on the transmission path as intermediate nodes to aggregate data. For example, Sapio et al. [17] have achieved satisfactory in-network aggregation results for machine learning via deploying programmable switches for processing intermediate data. The high-speed SRAM of the switch allows the delay of simple logic operations to be controlled in milliseconds. Therefore, the latency of block aggregation through the XOR operation on the switch could be negligible compared to the transmission latency. In addition, the block aggregation on intermediate servers has been achieved in Literature [11] with significant improvement. Therefore, aggregation of intermediate data could be deployed on both the server and the switch. The delay caused by intermediate node processing would not have a significant impact on the total repair time. In this paper, we present our AggreTree to demonstrate how to use both the server and the switch to improve the efficiency of failed block repairing. The purpose of our AggreTree is to achieve a satisfactory trade-off between repair speed and cost. To be specific, the failed block should be repaired quickly, while the cost during the repair process can be minimized. The key idea of AggreTree is to build a tree connecting all the providers and the newcomer, while the intermediate nodes in this tree are responsible for aggregating or forwarding intermediate data. For the aforementioned example, we present our AggreTree in Fig. 4(b) to illustrate a more efficient repair solution, where the switch w2 is used for aggregating intermediate data. Our AggreTree can also obtain 43 Mbps repair bandwidth, while the bandwidth consumption is reduced to 5 × D MB in this example. Next, we elaborate on how our AggreTree achieves this sight through the two-stage decision.

Definition 1. Failed block repairing refers to build a tree connecting all providers and the newcomer. In this tree, each provider (leaf nodes or intermediate nodes) transfers the block βi Pi in small-size units to the newcomer (root node). The intermediate nodes (providers or other nodes in the topology) in the tree aggregate the units of blocks from their child nodes and send the aggregated units of blocks to their parent nodes. Then, the newcomer receives all units of ∑ blocks, aggregates them, and k generates the failed block P ∗ = i=1 βi Pi , where there are k providers. The small-size units of blocks are transmitted continuously and aggregated immediately until the entire block is completely sent out. The aggregation time on intermediate nodes could be negligible. Therefore, the time of failed block repairing is mainly affected by the transfer time across the network. In addition, the routing path we build is based on the bandwidth of links measured or predicted in a time slot. For the next time slot, we can quickly regenerate the new routing path by updating the bandwidth of each link until the repairing process is finished. To formalize the repairing problem of a failed block, as shown in Definition 1, we first model a data center as a graph G = (V , E , W ) with the node set V , the edge set E and the edge weight set W . The set of providers and the newcomer is denoted as M = {v0 , v1 , v2 , . . . , vk }, M ⊆ V , where v0 denotes the newcomer and others v1 , v2 , . . . , vk refer to as k providers. Not that, a node vi (v i ∈ V , v i ∈ / M) in the graph G = (V , E , W ) can be either a server or a switch, since both the server and switch could participate in the repair process for aggregating intermediate data. An edge eij ∈ E denotes a link connecting nodes vi and vj (vi , vj ∈ V ). An edge weight w (eij ) ∈ W denotes the available bandwidth of the edge eij measured in a time slot, which can be obtained by the control plane of the network [27–29]. To efficiently repair the failed block, the core of our aggregation mechanism is to find such a tree, AggreTree TM = (V ′ , E ′ ), in the graph G to connect those nodes in M, where M ⊆ V ′ ⊆ V , E ′ ⊆ E. In particular, AggreTree could include some switches, or servers that may not hold involved blocks, i.e., the node vj , vj ∈ V ′ but vj ∈ / M, which is quite different from prior methods. Those switches and servers are used for data transmission or aggregation in the AggreTree. Definition 2. The bottleneck bandwidth of a tree refers to the minimum available bandwidth among all links in this tree. Each link in data centers carries data from different services. Hence, even for links of the same capacity, their available bandwidth is often different. As defined in Definition 2, the bottleneck bandwidth of a tree is the minimum available bandwidth among all links in this tree. Besides, the bottleneck bandwidth determines the speed of failed block repairing. Therefore, the first goal of our aggregation mechanism, AggreTree, is to obtain the maximum bottleneck bandwidth, which could determine the completion time of failed block repairing. We call the tree with the maximum bottleneck bandwidth as the maximum bottleneck tree in this paper. The first stage of our AggregTree is to find

J. Xia, D. Guo and J. Xie / Future Generation Computer Systems 105 (2020) 33–43

such a maximum bottleneck tree to achieve the highest repairing speed. Such an objective is formalized in Eq. (2), where w (e′ij ) denotes the available bandwidth of the edge e′ij ∈ E ′ , and |V ′ | denotes the number of nodes in V ′ . The first constraint in Eq. (2) ensures that the edge e′ij ∈ E ′ is selected from the edge set E. The second constraint is to guarantee that the AggreTree TM = (V ′ , E ′ ) can connect all nodes in the set M, which consists of all providers and the newcomer. We call the result B as the bottleneck bandwidth of the maximum bottleneck tree. B = max(min w (e′ij ))

⎧ ′ ′ ⎨ eij ∈ E ⊆ E M ⊆ V′ ⊆ V ⎩ 1 ≤ i ≤ |V ′ |, 1 ≤ j ≤ |V ′ |

e′ij

⎧ ′ ′ ⎪ ⎨ eij ∈ E ⊆ E w (e′ij ) ≥ B ⎪ ⎩ M ⊆ V′ ⊆ V

5. Construction of AggreTree Based on the above analysis, the construction of the AggreTree is a two-stage decision process. To build an efficient AggreTree, we first find a connecting tree with the maximum bottleneck bandwidth and then build the AggreTree with the minimum transmission cost while keeping the bottleneck bandwidth unchanged. 5.1. Identifying the maximum bottleneck bandwidth

In erasure coding-based storage systems, the original data is encoded into k + m blocks. Those blocks are transferred to each involved link. With our AggreTree, each link transmits only one block because intermediate nodes aggregate blocks from multiple links. Hence, in the case of repairing a failed block, the bandwidth consumed is proportional to the number of links consumed. As defined in Definition 3, the transmission cost refers to the number of links. The more links a transmission occupies, the more bandwidth it consumes. Therefore, the second goal of our aggregation mechanism is to build the tree with the lowest transmission cost while ensuring the highest repairing speed. The second optimization objective of AggreTree is shown in Eq. (3). The objective in Eq. (3) ensures to incur the lowest transmission cost, while the second constraint guarantee that the available bandwidth of the employed links is higher than or equal to the bottleneck bandwidth B.



reduced. The scheme in Fig. 4(b) has the same repairing bandwidth of 43 Mbps, while the transmission cost is lower than that in Fig. 3(b). Thus, the transmission cost can be further reduced while keeping the bottleneck bandwidth of the tree unchanged.

(2)

Definition 3. The transmission cost of AggreTree refers to the number of links in the tree.

min

37

(3)

Our aggregation mechanism achieves the above two objectives as shown in Eqs. (2) and (3). The core of our aggregation mechanism is to find a tree, AggreTree TM = (V ′ , E ′ ) with the maximum bottleneck bandwidth, while the transmission cost is as little as possible. It can be seen as a two-stage decision, i.e, speed first and cost second. The pursuit of bandwidth maximization alone often carries some unnecessary additional transmission costs. Considering the example in Fig. 2, if we only consider Equation (2), it is easy to find each maximum bottleneck path from the node v0 to each storage node by modifying the classical Prim [30] or Kruskal [31] algorithm. Then, the employed repairing tree that consists of those paths could be the solution in Fig. 3(b) or Fig. 4(a). However, it is the bottleneck bandwidth of the tree that determines the transmission speed. There is no need to search for links with the highest bandwidth. In other words, the transmission speed will not be affected as long as the bandwidth of links in the tree is greater than or equal to the bottleneck bandwidth of the tree. For the aforementioned example in Fig. 3(b), the bottleneck bandwidth is 43 Mbps. Hence, links with the bandwidth greater than 43 Mbps can be used to transmit blocks without reducing the repair speed. If we choose the links (s3 , w2 ) and (s6 , w2 ) and use the switch w2 to aggregate blocks, rather than guide the blocks to s5 for aggregation, the transmission cost can be further

To efficiently repair the failed block, the first stage of AggreTree is to build a maximum bottleneck tree and find its bottleneck bandwidth B, as shown in Eq. (2). The bottleneck bandwidth B determines the speed of failed block repairing. Specifically, this maximum bottleneck tree should include all involved providers and the newcomer. The providers in this tree perform as leaf nodes or intermediate nodes, while the newcomer works as the root node. Other nodes, i.e., switches or other storage servers in data centers, may participate in the process of failed block repairing. They are responsible for forwarding blocks or aggregating some intermediate data. It is different from prior tree-structured schemes and the repair pipelining, where only providers and the newcomer can aggregate the intermediate data. Definition 4. The maximum bottleneck path from vi to vj refers to the path whose bottleneck bandwidth is maximum in a weighted undirected graph. To find a maximum bottleneck bandwidth B, a maximum bottleneck tree is first required. In the weighted undirected graph G = (V , E , W ), building such a tree requires that each provider and the newcomer are connected with a maximum bottleneck path. Hence, we should first build the maximum bottleneck paths between each provider and the newcomer separately. Thus, all the maximum bottleneck paths would form a maximum bottleneck tree Ts = (Vs , Es , Ws ). This tree Ts contains all the nodes in M and some other nodes in V . All those nodes make up Vs together, i.e., M ⊆ Vs ⊆ V . Each path from the newcomer to a provider consists of multiple links eij , and all links in those paths form the edge set Es . Their weights form the weight set Ws . As defined in Definition 2, the bottleneck bandwidth of a tree refers to the minimum available bandwidth among all links in this tree. To obtain the bottleneck bandwidth of the maximum bottleneck tree Ts = (Vs , Es , Ws ), we search for its minimum weight, and the bottleneck bandwidth B = max(min w (eij )), eij ∈ Es . Theorem 1. The maximum bottleneck tree Ts = (Vs , Es , Ws ) has the higher or the same bottleneck bandwidth than any other trees that contains M in G = (V , E , W ). Proof. Assuming that there is a tree Ts′ , its bottleneck bandwidth is higher than that of Ts . According to Definition 2, the bottleneck bandwidth of the tree is the minimum bandwidth of all the paths from each provider to the newcomer in this tree. That is, there is a path from the provider vi to the newcomer in the tree Ts′ , whose bottleneck bandwidth is higher than that of the path connecting the provider vi and the newcomer in the tree Ts . This is paradoxical because each path in Ts is the maximum bottleneck path from the provider to the newcomer.

38

J. Xia, D. Guo and J. Xie / Future Generation Computer Systems 105 (2020) 33–43

Thus, the smallest weight of the edge in the edge set OrderLink is directly the maximum bottleneck bandwidth, without ambiguous edge selections.

Algorithm 1 Construction of AggreTree. Input: G=(V , E , W ), M ={v0 , v1 , v2 , ..., vk } Output: TM =(V ′ , E ′ ) 1: Define B = null; 2: B=MaxBottleneck(M , G); 3: Gc =UpdateTopology(B, G); ′ ′ 4: Build the minimum steiner tree TM =(V , E ) in Gc ′ (Vc , Ec , Wc ), subject to M ∈ V ; 5: return TM .

5.2. Reducing the transmission cost

=

6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:

function MaxBottleneck(M, G) Define OrderLink = ø; Order all edges eij ∈ E by their weights w (eij ) ∈ W in a descending sequence, the ith element is OrderLink[i]; Define ConnectTree=ø; Define B=null; for l=1 → size(V ) do if ∃vi ∈ M is disconnected by ConnectTree then ConnectTree ← OrderLink[l]; else break up; B=w (OrderLink[l]); return B.

18: 19: 20: 21: 22: 23: 24: 25: 26:

function UpdateTopology(B, G) Define graph Gc =(Vc , Ec , Wc ); Gc =G; for Each eij ∈ Ec do if w (eij ) < B then Delete the edge eij Gc =(Vc , Ec , Wc ); Set all elements in Wc as 1; return Gc .

from

Ec

and

update

According to Theorem 1, we can generate a maximum bottleneck tree Ts = (Vs , Es , Ws ) via building the maximum bottleneck path from the newcomer to each provider, respectively. Then, the minimum bandwidth among all edges eij ∈ Vs is the maximum bottleneck bandwidth B. The function MaxBottleneck(M , G) in Algorithm 1 is designed to calculate the maximum bottleneck bandwidth B. In a weighted undirected graph G = (V , E , W ), each edge eij , eij ∈ E is ordered according to its weight w (eij ), w (eij ) ∈ W in a descending sequence, while the result is recorded as the set OrderLink (line 9). The set of providers and the newcomer is denoted as M, M ⊆ V . In particular, we only care about the maximum bottleneck bandwidth B of the maximum bottleneck tree. Hence, it is unnecessary to calculate every maximum bottleneck path exactly. To this insight, we can adopt a fast and effective method to directly generate the maximum bottleneck bandwidth B. More specifically, the edge set ConnectTree is initialized as an empty set (line 10), and an edge from the edge set OrderLink is selected each time to join ConnectTree (line 12–15). If multiple edges have the same weight, these edges should be added to the set ConnectTree simultaneously, without affecting the value of the maximum bottleneck bandwidth B. When edges in ConnectTree connect all nodes in the set M, the algorithm would stop. Thereby, the weight of the last added edge is the maximum bottleneck bandwidth B (line 16). Our function MaxBottleneck(M , G) can always find a maximum bottleneck value. The reason is that the links (edges) are sorted according to their weights in a descending order firstly, and then they are added step by step into the edge set OrderLink until edges in OrderLink connects to all the providers and the newcomer.

The second stage of AggreTree is to reduce the transmission cost while guaranteeing the maximum bottleneck bandwidth. As mentioned above, the transmission speed is influenced by the bottleneck bandwidth of the tree. Finding the maximum bottleneck tree can obtain the maximum transmission speed, while its transmission cost may be relatively high. Therefore, blindly looking for the link with the largest available bandwidth often leads to additional bandwidth consumption, resulting in too long transmission paths and taking up too many links. Actually, we only need to look for links with the higher available bandwidth than the bottleneck bandwidth B of the tree, which would not affect the transmission speed. When we construct the maximum bottleneck tree Ts = (Vs , Es , Ws ), we do not consider the transmission cost. The aggregation tree Ts may not meet the goal of minimizing the transmission cost in Eq. (3). Therefore, we need to further reduce its transmission cost without affecting the repairing speed. According to Definition 3, the number of links in the tree affects the transmission cost. It is necessary to find the maximum bottleneck tree with the least number of links. To achieve this goal, we first design the function UpdateTopology(B, G) to customize the topology based on the maximum bottleneck bandwidth B, as shown in Algorithm 1. If the weight of a link in the weighted undirected graph G = (V , E , W ) is below the threshold B, it is removed from the graph G. In this way, the weights of the remaining links are higher than or equal to the bottleneck bandwidth B. We denote the customized graph as Gc = (Vc , Ec , Wc ). It is obvious that min w (eij ) ≥ B, where eij ∈ Ec ⊆ E. Note that, the threshold B is hard to set artificially in advance. If the preset threshold is too large, it is likely that the customized graph will fail to connect all providers and the newcomer. By contrast, if the preset threshold is too small, the transmission speed will be limited. Theorem 2. The customized graph Gc = (Vc , Ec , Wc ) can guarantee the connectivity between all the providers and the newcomer. Proof. As aforementioned, the spanning tree Ts = (Vs , Es ) is a maximum bottleneck tree in the graph G = (V , E , W ), which can connect all providers and the newcomer. Meanwhile, the weights of edges in Es are all no less than B. That is, all edges in Ts can be saved in Gc , Es ⊆ Ec . Therefore, at least there is a tree Ts , which connects all the providers and the newcomer in the customized graph Gc = (Vc , Ec , Wc ). Theorem 2 shows that in the graph Gc = (Vc , Ec , Wc ), we can definitely find such a tree, whose transmission cost is less than or equal to that of the maximum bottleneck tree Ts = (Vs , Es ), connecting all the providers and the newcomer. In order to reduce the transmission cost, it is necessary to connect the providers and the newcomer with as few links as possible. Next, we present two ways to reduce the transmission cost. The former is a fast way with low complexity, while the latter is a more precise way with slight the addition of computation time. 5.2.1. The subtract method based on finding the shortest path An easy method to reduce the transmission cost is to find the shortest path between each provider and the newcomer in the graph Gc = (Vc , Ec , Wc ). Specifically, we define a weight set C , where c(eij ) = 1, c(eij ) ∈ C . Thus, for each edge eij ∈ Ec in Gc = (Vc , Ec , Wc ), the edge weight w (eij ) is replaced by c(eij ) ∈ C ,

J. Xia, D. Guo and J. Xie / Future Generation Computer Systems 105 (2020) 33–43

Fig. 5. Two different subtract methods.

obtaining a new graph G′c = (Vc , Ec , C ). That is, the weight of all the edges in the new graph G′c = (Vc , Ec , C ) is 1. We can reduce the transmission cost by looking for each shortest path from each provider to the newcomer in the graph G′c = (Vc , Ec , C ). In this way, those shortest paths form a new tree TM = (V ′ , E ′ ), where M ⊆ V ′ ⊆ Vc , E ′ ⊆ Ec . The nodes where the paths intersect aggregate the blocks from multiple links. Obviously, the bottleneck bandwidth of TM is equal to the maximum bottleneck bandwidth B, while the transmission cost is much lower than the aforementioned maximum bottleneck tree Ts . 5.2.2. The subtract method based on finding the minimum Steiner tree Although the transmission cost can be reduced via the subtract method in Section 5.2.1, the result is not optimal. Furthermore, we prove that minimizing the transmission cost is NP-hard.

39

of ≈ 1.28 within time O(mn2 ), where m and n are the numbers of terminals and non-terminals in the graph, respectively [34]. In summary, the construction process of AggreTree can be expressed by Algorithm 1. The inputs are the data center topology G = (V , E , W ) and the set M = {v0 , v1 , v2 , . . . , vk }, which consists of providers and the newcomer. The output is the AggreTree TM = (V ′ , E ′ ), which has the maximum bottleneck bandwidth with the lowest transmission cost. The function MaxBottleneck (M , G) aims to find the maximum bottleneck bandwidth B of the maximum bottleneck tree, while the function UpdateTopology (B, G) removes edges with weights less than the bottleneck bandwidth B from the graph G = (V , E , W ). The algorithm first computes the maximum bottleneck bandwidth B of a maximum bottleneck tree in the graph G = (V , E , W ) (line 2). The maximum bottleneck tree connects all the nodes in the node set M. Next, the original topology G = (V , E , W ) is pruned according to the bottleneck bandwidth B, and links with bandwidth lower than this bandwidth B are deleted (line 3). Using the remaining topology, we look for a minimum Steiner tree that connects all providers and the newcomer (line 4). This Steiner tree TM = (V ′ , E ′ ) is the maximum bottleneck tree with the lowest transmission cost. The time complexity of building the AggreTree is the same as that of building the minimum Steiner tree, i.e., O(mn2 ) as aforementioned. The time utilization of function MaxBottleneck(M , G) comes mainly from sorting and testing node connectivity, which is lightweight. The function UpdateTopology(B, G) is mainly used to update the topology, so its complexity is much lower than that of the function MaxBottleneck(M , G). For large-scale topologies, Algorithm 1 is fast enough. Moreover, if we need to further reduce the computation time, we can also use the reduction method presented in Section 5.2.1 with slightly precision descending, instead of calculating the minimum Steiner tree. In this way, the time complexity can be reduced to O(n2 ). 6. Performance evaluation

Theorem 3. It is NP-hard to find the maximum bottleneck tree with the lowest transmission cost. Proof. The maximum bottleneck tree with the lowest transmission cost connects all the providers and the newcomer, consuming the least number of links in terms of ensuring the bottleneck bandwidth unchanged. In the graph G′c = (Vc , Ec , C ), finding such a tree allows for the addition of other intermediate nodes vi ∈ Vc , vi ∈ / M to join the tree TM to reduce the transmission cost. This is the minimum Steiner tree problem, which is NP-hard [32]. For the example in Fig. 5, the nodes v1 and v2 are two providers and v0 is the newcomer. Using the subtract method based on finding the shortest path, we find the shortest paths between v1 and v0 , v2 and v0 respectively. The result is shown in Fig. 5(a). It can be seen that the transmission cost is 3 + 4 = 7. However, if the blocks are aggregated on the node v3 , the number of links can be further reduced, which is 6 as Fig. 5(b) shows. To this insight, we should consider all the transmission paths when building the tree to minimize the total transmission cost. Therefore, minimizing the transmission cost should be the process of finding the minimum steiner tree in the graph G′c = (Vc , Ec , C ). This steiner tree is denoted as TM = (V ′ , E ′ ), where M ⊆ V ′ ⊆ Vc , E ′ ⊆ Ec . The tree TM = (V ′ , E ′ ) is the maximum bottleneck tree with the lowest transmission cost, satisfying the requirements of Eq. (2) and Eq. (3). At current, many approximation algorithms [32–34] have been designed to solve the minimum steiner tree problem, which can be utilized to find the AggreTree TM . The heuristic achieves an approximation ratio

In this section, we empirically evaluate the performance of our AggreTree methods. We first introduce the simulation setup in Section 6.1, including the parameter settings, comparison methods, and performance metrics. Then, we conduct performance emulations in two typical topologies, fat-tree in Section 6.2 and BCube in Section 6.3. 6.1. Simulation setup 6.1.1. Parameter settings We evaluate the performance of our AggreTree in two typical data center topologies, Fat-tree [18] and BCube [13]. Fat-tree is the most-popular switch-centric data center topology, where the data forwarding only depends on the switch. The topology scale is determined by the parameter α . The α -ary fat-tree topology consists of α 3 /4 servers and 5 ×α 2 /4 switches. In contrast, BCube is a representative server-centric topology, which is adopted in many modular data centers, e.g., HP’s POD [35], IBM’s Modular Data Center [36]. Servers in BCube can also forward data like switches. The topology scale is determined by two parameters: n and b. The BCube (n, b) consists of nb+1 servers and (b + 1) × nb switches. In the simulation, we set α = 8, 16, and 24 in the fattree topologies, where there are 128, 1024, and 3456 servers, respectively. For BCube, we evaluate our AggreTree in BCube (4, 3) with 256 servers and 256 switches. The available bandwidth w (eij ) in each topology follows uniform distribution, where w(eij ) ∼ U(1, 100) Mbps. The servers in the topologies can store

40

J. Xia, D. Guo and J. Xie / Future Generation Computer Systems 105 (2020) 33–43

or aggregate data blocks while the switches are responsible for aggregating or forwarding intermediate data. Providers and the newcomer are selected randomly from the set of servers. We select multiple servers and specify one server as the newcomer and others as providers. The providers transfer their blocks in small-size units (slices) to the newcomer via the network. 6.1.2. Comparison methods We compare our AggreTree with the following three methods to evaluate the performance. The first one is a reference to the previous tree-structured repair works [9], while the latter two methods are designed to further evaluate the transmission cost and speed of AggreTree. In particular, both the MinCostOnly method and the MaxBandwidthOnly method use intermediate nodes to aggregate data, but they aim to obtain the unilateral optimization, i.e., the minimum transmission cost or the maximum repair speed. Tree-structured Method: As aforementioned, this method builds the maximum spanning tree connecting providers and the newcomer. The topology of the data center is ignored and aggregation of blocks is only performed on those providers or the newcomer. MinCostOnly Method: The purpose of this method is to build a tree with the least transmission cost, ignoring the available bandwidth of the links. This approach is more efficient than AggreCode [10] because intermediate nodes including switches can also aggregate data blocks to reduce the consumption of duplicate links. MaxBandwidthOnly Method: This method aims to build a tree connecting the providers and the newcomer with the maximum transmission bandwidth. Meanwhile, it allows intermediate nodes to aggregate blocks rather than just on the providers or the newcomer. 6.1.3. Performance metrics We mainly evaluate the bottleneck bandwidth and the transmission cost of different methods in the fat-tree and BCube networks. Bottleneck bandwidth is directly related to the transmission speed of failed block repairing. If all the links are not reused, the bottleneck bandwidth refers to as the bottleneck bandwidth of this tree, which depends on the minimum value of all links’ available bandwidth. Otherwise, the available bandwidth of the repeatedly occupied links will be reduced to a lower value, affecting the transmission speed. Transmission cost is the bandwidth consumed. For the MinCostOnly method, the MaxBandwidthOnly method and AggreTree, the transmission cost is the number of links consumed. For the tree-structured method, if there is a certain link reused, the bandwidth consumed on this link increases accordingly. In this case, the transmission cost is bigger than the number of links in the tree. We define one unit of transmission cost as the bandwidth consumed of transferring a block on a link. 6.2. Performance in fat-tree-like data centers We compare the performance of bottleneck bandwidth and the transmission cost in fat-tree data centers, with respect to the bandwidth heterogeneity and the number of used providers. 6.2.1. Impact of the bandwidth heterogeneity In fat-tree data centers, the switches are divided into the edge switches, the aggregation switches, and the core switches. Each server is only connected with an edge switch via a single link. Therefore, the congestion on this single link often results in a

sharp decrease in available bandwidth, leading to the unsatisfactory connectivity between servers. To this end, only the nodes above the edge layer are considered in our simulations. That is, the links between the edge switches and servers are not the transmission bottleneck by default. Fig. 6 shows the comparison of the bottleneck bandwidth and the transmission cost with 12 providers. Exactly, Fig. 6(a) shows the bottleneck bandwidth in the fat-tree topology where α = 16. AggreTree obtains the highest bottleneck bandwidth with 69.64 Mbps on average. In particular, compared with the tree-structured method, our AggreTree improves the bottleneck bandwidth by 4.06 times. The reason is that the tree-structured method can hardly reach a high bottleneck bandwidth value due to the presence of link reuse. The bottleneck bandwidth of the MinCostOnly method (4.28 Mbps on average) is even much lower than that of the tree-structured method (13.76 Mbps on average). The root cause is that the MinCostOnly method ignores the available bandwidth of links when building the connecting tree. In contrast, the MaxBandwidthOnly method can gain the same bottleneck bandwidth with AggreTree, because its building way is similar to the first stage of AggreTree. Hence, the result is not shown in the figure. Fig. 6(b) shows the performance of the comparison methods in terms of transmission cost. The tree-structured method brings the highest transmission cost due to the repeated use of links, at an average of 292.3 units bandwidth overhead. In contrast, the MaxBandwidthOnly method brings a little lower transmission cost, with 223.95 units on average. The reason is that it makes use of intermediate nodes to aggregate blocks, which avoids the possibility of repeated link occupation and thus, reducing the transmission cost. However, the reduction of transmission cost is quite limited because the MaxBandwidthOnly method always chooses the links with high available bandwidth, thus introducing some unnecessary transmission costs. The MinCostOnly method aims to build the connecting tree with the minimum transmission cost, hence, its transmission cost is the lowest of those three methods. AggreTree generates a slightly higher transmission cost than the MinCostOnly method because the transmission cost is effectively reduced in the second stage of building the tree. Particularly, AgreeTree reduces the transmission cost by an average of 80.86% compared with the tree-structured method. Fig. 6(c) further evaluates the bottleneck bandwidth with respect to the topology size. With the increase of the topology scale, the AggreTree could obtain higher bottleneck bandwidth. The reason is that there are more high-availability bandwidth links available in the data center as the size of the topology grows. Therefore, the connecting tree with higher bottleneck value can be constructed with more candidate links. 6.2.2. Impact of changing the number of providers Fig. 7 shows the evaluation performance, with respect to the number of providers in a given topology. The results are based on an average of 100 repeated experiments. In Fig. 7(a), we compare the transmission cost of AggreTree and the MaxBandwidthOnly method. Note that, the core of the MaxBandwidthOnly method is building a maximum bottleneck tree to obtain the maximum repair speed, like the first-stage decision of AggreTree. However, the MaxBandwidthOnly method incurs much more transmission cost than AggreTree in different fat-tree topologies, although it can obtain the same bottleneck bandwidth as AggreTree. AggreTree reduces the transmission cost by up to 87.1% in the α = 24 fat-tree topology. The root cause is that our AggreTree adopts effective methods to reduce link occupancy and thus reduces the transmission cost. Fig. 7(b) evaluates the transmission cost, with respect to the number of providers. In different fat-tree topologies, the bandwidth overhead all goes from about 16 units to about 40 units

J. Xia, D. Guo and J. Xie / Future Generation Computer Systems 105 (2020) 33–43

41

Fig. 6. Results of 12 providers and 100 times experiments.

Fig. 7. Evaluations in the fat-tree topologies.

Fig. 8. Evaluations in the BCube (4, 3) topology.

with the increasing of providers. The transmission cost of AggreTree increases slightly because the fat-tree topology is constructed by three layers’ switches. Although its size is determined by the parameter α , the change of α does not influence the distance between nodes in the fat-tree topology and therefore, there is not a significant impact on the transmission cost. The transmission cost of different methods is shown in Fig. 7(c). Under a different number of providers, the transmission cost of the tree-structured method and the MaxBandwidthOnly method is much higher than that of AggreTree and the MinCostOnly method, e.g., the tree-structured method causes 6.46 times cost than that of AggreTree with 12 providers. By contrast, the MinCostOnly method could keep the minimum bandwidth overhead. However, the repair speed of the MinCostOnly method is quite slow because of congestion. To this point, our AggreTree consumes slightly higher transmission costs than the MinCostOnly method, realizing the fast repair of failed blocks. 6.3. Performance in BCube-like data centers Different from the fat-tree topology, BCube is a server-centric topology, where servers can also perform as switches to forward

data. We mainly conduct the experiment in BCube (4, 3) to evaluate the performance. We compare AggreTree with the tree-structured method to evaluate cost savings and the repair speed increased in Fig. 8(a). As the number of providers increases, the percentage of repair speed increasing becomes higher. The reason is that the link collision would appear with a higher possibility in the treestructured method when the number of providers grows, thus causing much more bandwidth overhead. Meanwhile, the rising tendency of repair speed is more obvious as the number of providers increases. This is because the increased link collisions dramatically reduce the bottleneck transmission bandwidth. In particular, AggreTree can increase the repair speed by 2.86 times and reduce the transmission cost by 81.68% when there are 12 providers. We further conduct the experiments to evaluate the bottleneck bandwidth when the number of providers is 12, as Fig. 8(b) shows. Similar to the result in the fat-tree topology, AggreTree obtains the highest transmission bandwidth, at an average of 42.01 Mbps, while the tree-structured method and the MinCostOnly method average at 9.674 Mbps and 1.98 Mbps, respectively. Fig. 8(c) shows the transmission cost of the four methods in the

42

J. Xia, D. Guo and J. Xie / Future Generation Computer Systems 105 (2020) 33–43

BCube topology. It can be seen that AggreTree consumes slightly higher transmission costs than the MinCostOnly method, but a lower transmission cost than the MaxBandwidthOnly method and the tree-structured method. Different from the result in the fattree topology, AggreTree causes a slightly higher transmission cost than the MinCostOnly method in the BCube topology. This is caused by the structural characteristics of BCube, i.e., changing the transmission path in BCube will often bring additional path consumption. Nonetheless, the transmission cost of AggreTree is significantly lower than the MaxBandwidthOnly method and the tree-structured method. 7. Conclusion The inefficient repairing of erasure codes has become the application bottleneck. In this paper, we propose the AggreTree to improve the efficiency of failed block repairing for erasure coding-based storage systems. Our AggreTree leverages servers and switches along the transmission paths to aggregate intermediate data, thus reducing bandwidth consumption while avoiding congestion. We design a two-stage decision to achieve the best trade-off between repair speed and bandwidth consumption, that is, prioritizing the fastest repair speed while minimizing bandwidth consumption as possible. Compared with prior efforts, we build the repair scheme based on the level of practical topologies, synthetically considering the effect of both repair speed and bandwidth overhead. Finally, we conduct experiments in two representative data center topologies to evaluate the performance of AggreTree. The results show that AggreTree can efficiently improve the repairing speed of failed blocks by up to 4.06 and 2.86 times, and further reduce the transmission cost by 80.86% and 81.68% in Fat-tree and BCube data centers, respectively. Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgment This work is partially supported by National Natural Science Foundation of China under Grant No. 61772544, the Hunan Provincial Natural Science Fund for Distinguished Young Scholars under Grant No. 2016JJ1002, and the Guangxi Cooperative Innovation Center of cloud computing and Big Data under Grant Nos. YD16507 and YD17X11. References [1] Colossus, Successor to google file system, http://static. googleusercontent.com/media/research.google.com/en/us/university/ relations/facultysummit2010/storage_architecture_and_challenges.pdf. [2] Facebook’s erasure coded hadoop distributed file system (hdfs-raid), https: //github.com/facebook/hadoop-20. [3] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, S. Yekhanin, Erasure coding in windows azure storage, in: 2012 USENIX Annual Technical Conference, 2012, pp. 15–26. [4] I.S. Reed, G. Solomon, Polynomial codes over certain finite fields, J. Soc. Ind. Appl. Math. 8 (2) (1960) 300–304. [5] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, S. Lu, Dcell: A scalable and faulttolerant network structure for data centers, in: Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication, ser. SIGCOMM ’08, 2008, pp. 75–86. [6] K.V. Rashmi, N.B. Shah, D. Gu, H. Kuang, D. Borthakur, K. Ramchandran, A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the facebook warehouse cluster, in: HotStorage’13, 2013.

[7] J. Xie, D. Guo, X. Zhu, B. Ren, H. Chen, Minimal fault-tolerant coverage of controllers in iaas datacenters, IEEE Trans. Serv. Comput. (2017). [8] D. Ford, F. Labelle, F.I. Popovici, M. Stokely, V. Truong, L. Barroso, C. Grimes, S. Quinlan, Availability in globally distributed storage systems, in: 9th USENIX Symposium on Operating Systems Design and Implementation, OSDI, 2010, pp. 61–74. [9] J. Li, S. Yang, X. Wang, X. Xue, B. Li, Tree-structured data regeneration with network coding in distributed storage systems, in: 17th International Workshop on Quality of Service, IWQoS, 2009, pp. 1–9. [10] J. Zhang, X. Liao, S. Li, Y. Hua, X. Liu, B. Lin, Aggrecode: Constructing route intersection for data reconstruction in erasure coded storage, in: IEEE Conference on Computer Communications, INFOCOM, 2014, pp. 2139–2147. [11] R. Li, X. Li, P.P.C. Lee, Q. Huang, Repair pipelining for erasure-coded storage, in: 2017 USENIX Annual Technical Conference, USENIX ATC, 2017, pp. 567–579. [12] Z. Shen, J. Shu, P.P.C. Lee, Reconsidering single failure recovery in clustered file systems, in: 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN, 2016, pp. 323–334. [13] C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, S. Lu, Bcube: a high performance, server-centric network architecture for modular data centers, in: Proceedings of the ACM SIGCOMM 2009 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, 2009, pp. 63–74. [14] L. Luo, D. Guo, J. Wu, T. Qu, T. Chen, X. Luo, Vlccube: A VLC enabled hybrid network structure for data centers, IEEE Trans. Parallel Distrib. Syst. 28 (7) (2017) 2088–2102. [15] X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, I. Stoica, Netcache: Balancing key–value stores with fast in-network caching, in: Proceedings of the 26th Symposium on Operating Systems Principles, 2017, pp. 121–136. [16] X. Jin, X. Li, H. Zhang, N. Foster, J. Lee, R. Soulé, C. Kim, I. Stoica, Netchain: Scale-free sub-rtt coordination, in: 15th USENIX Symposium on Networked Systems Design and Implementation, NSDI, 2018, pp. 35–49. [17] A. Sapio, M. Canini, C. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishnamurthy, M. Moshref, D.R.K. Ports, P. Richtárik, Scaling distributed machine learning with in-network aggregation, CoRR abs/1903.06701 (2019). [18] M. Al-Fares, A. Loukissas, A. Vahdat, A scalable, commodity data center network architecture, in: Proceedings of the ACM SIGCOMM 2008 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, 2008, pp. 63–74. [19] S. Ghemawat, H. Gobioff, S. Leung, The google file system, in: Proceedings of the 19th ACM Symposium on Operating Systems Principles, SOSP, 2003, pp. 29–43. [20] B. Calder, J. Wang, A. O., et al., Windows azure storage: a highly available cloud storage service with strong consistency, in: Proceedings of the 23rd ACM Symposium on Operating Systems Principles, SOSP, 2011, pp. 143–157. [21] J.S. Plank, J. Luo, C.D. Schuman, L. Xu, Z. Wilcox-O’Hearn, A performance evaluation and examination of open-source erasure coding libraries for storage, in: 7th USENIX Conference on File and Storage Technologies, 2009, pp. 253–265. [22] M. Sathiamoorthy, M. Asteris, D.S. Papailiopoulos, A.G. Dimakis, R. Vadali, S. Chen, D. Borthakur, Xoring elephants: Novel erasure codes for big data, PVLDB 6 (5) (2013) 325–336. [23] D. Guo, Aggregating uncertain incast transfers in bcube-like data centers, IEEE Trans. Parallel Distrib. Syst. 28 (4) (2017) 934–946. [24] M. Silberstein, L. Ganesh, Y. Wang, L. Alvisi, M. Dahlin, Lazy means smart: Reducing repair bandwidth costs in erasure-coded distributed storage, in: International Conference on Systems and Storage, SYSTOR, 2014, pp. 15:1–15:7. [25] K.V. Rashmi, P. Nakkiran, J. Wang, N.B. Shah, K. Ramchandran, Having your cake and eating it too: Jointly optimal erasure codes for i/o, storage, and network-bandwidth, in: Proceedings of the 13th USENIX Conference on File and Storage Technologies, FAST, 2015, pp. 81–94. [26] J. Li, B. Li, Beehive: Erasure codes for fixing multiple failures in distributed storage systems, IEEE Trans. Parallel Distrib. Syst. 28 (5) (2017) 1257–1270. [27] P. Qin, B. Dai, B. Huang, G. Xu, Bandwidth-aware scheduling with sdn in hadoop: A new trend for big data, IEEE Syst. J. 11 (4) (2017) 2337–2344. [28] J. Xie, D. Guo, Z. Hu, T. Qu, P. Lv, Control plane of software defined networks: A survey, Comput. Commun. 67 (2015) 1–10. [29] J. Xie, D. Guo, C. Qian, L. Li, B. Ren, H. Chen, Validation of distributed SDN control plane under uncertain failures, IEEE/ACM Trans. Netw. 27 (3) (2019) 1234–1247. [30] C. Martel, The expected complexity of prim’s minimum spanning tree algorithm, Inform. Process. Lett. 81 (4) (2002) 197–201. [31] U.B. Sayata, N.P. Desai, An algorithm for hierarchical chinese postman problem using minimum spanning tree approach based on kruskal’s algorithm, in: Advance Computing Conference, 2015.

J. Xia, D. Guo and J. Xie / Future Generation Computer Systems 105 (2020) 33–43 [32] L. Kou, G. Markowsky, L. Berman, A fast algorithm for steiner trees, Acta Inform. 15 (2) (1981) 141–145. [33] The steiner problem with edge lengths 1 and 2, Inform. Process. Lett. 32 (4) (1989) 171–176. [34] G. Robins, A. Zelikovsky, Tighter bounds for graph steiner tree approximation, SIAM J. Discret. Math. 19 (1) (2005) 122–134. [35] Hp performance optimized datacenter (pod), http://h18000.www1.hp.com/ products/servers/solutions/datacentersolutions/pod/. [36] Ibm portable modular data center, http://www-935.ibm.com/services/us/ en/it-services/data-center/modular-data-center/index.html.

Junxu Xia: received the B.S. degree in management science and engineering from National University of Defense Technology, Changsha, China, in 2018. He is currently working towards the M.S. degree at the same department. His main research interests include data centers, cloud computing and distributed system.

43

Deke Guo: received the B.S. degree in industry engineering from the Beijing University of Aeronautics and Astronautics, Beijing, China, in 2001, and the Ph.D. degree in management science and engineering from the National University of Defense Technology, Changsha, China, in 2008. He is currently a Professor with the College of Systems Engineering, National University of Defense Technology. His research interests include distributed systems, software-defined networking, data center networking, wireless and mobile systems, and interconnection networks. He is a member of the ACM. Junjie Xie: received the B.S. degree in computer science and technology from the Beijing Institute of Technology, Beijing, China, in 2013. He received the M.S. degree in management science and engineering from the National University of Defense Technology (NUDT), Changsha, China, in 2015. He is currently a Ph.D. student in NUDT, from 2016. He is also a joint Ph.D. student in the University of California, Santa Cruz (UCSC), USA, from October 2017. His study in the UCSC is supported by the China Scholarship Council (CSC). His research interests include distributed systems, software-defined networking and edge computing.