MPLS-based reduction of flow table entries in SDN switches supporting multipath transmission

MPLS-based reduction of flow table entries in SDN switches supporting multipath transmission

Journal Pre-proof MPLS-based reduction of flow table entries in SDN switches supporting multipath transmission Zbigniew Duli´nski, Grzegorz Rzym, Piot...

3MB Sizes 0 Downloads 48 Views

Journal Pre-proof MPLS-based reduction of flow table entries in SDN switches supporting multipath transmission Zbigniew Duli´nski, Grzegorz Rzym, Piotr Chołda

PII: DOI: Reference:

S0140-3664(18)30686-8 https://doi.org/10.1016/j.comcom.2019.12.052 COMCOM 6109

To appear in:

Computer Communications

Received date : 6 August 2018 Revised date : 23 September 2019 Accepted date : 27 December 2019 Please cite this article as: Z. Duli´nski, G. Rzym and P. Chołda, MPLS-based reduction of flow table entries in SDN switches supporting multipath transmission, Computer Communications (2020), doi: https://doi.org/10.1016/j.comcom.2019.12.052. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Journal Pre-proof

of

MPLS-based reduction of ow table entries in SDN switches supporting multipath transmission 1∗ 2 2 Zbigniew Duli«ski , Grzegorz Rzym , Piotr Choªda 1

Jagiellonian University, Faculty of Physics, Astronomy, and Applied Computer

pro

Science, ul. Šojasiewicza 11, 30-348 Kraków, Poland

2

AGH University of Science and Technology, Department of Telecommunications,

Abstract

re-

Al. Mickiewicza 30, 30-059 Kraków, Poland

In this paper, the problem of resource utilisation improvement in softwaredened networking (SDN) is addressed. The need for resource optimization is understood here be twofold.

First, bandwidth in links should be saved

when congestion appears. Second, the internal resources represented by ta-

urn al P

ble entries of SDN switches should be minimised to ensure fast processing and exibility. Here, both types of resources are optimised with a new mechanism for ow aggregation. The mechanism is accompanied with a multipath transmission supporting reaction when network conditions change. The proposed mechanism uses classical MPLS labelling, which enables ow aggregation together with multipath transmission; therefore, neither involves any denition of new protocols nor requires the application of legacy signalling protocols.

Only simple yet powerful modications of the exist-

ing solutions assured by exibility of the OpenFlow protocol are necessary. Furthermore, the proposed solution can be incrementally deployed in legacy networks.

The aggregation results in a low number of ow entries in core switches in comparison to legacy OpenFlow operation.

The simulations show that

the number of ow entries in core switches can be reduced by as much as 93%, while the overall network trac is increased by around 171%.

This

Jo

type of scalability improvement of ow processing is obtained as a result of

Corresponding author, e-mail: [email protected], phone: +48 126644871, address: ul. Šojasiewicza 11, 30-348 Kraków, Poland. ∗

Journal Pre-proof

the introduction of a centrally managed MPLS label distribution performed by an SDN controller. Moreover, the proposed method of multipath trans-

of

mission improves network resource utilisation. Additionally, independently of the trac pattern, the proposed approach signicantly reduces the communication overhead between the controller and the switches.

Keywords:

Flow aggregation; Multipath transmission; Multi-protocol label

pro

switching (MPLS); Software-dened networking (SDN)

1. Introduction

In legacy IP networks, packets traverse a single path between a pair of source and destination nodes. The path is established by a routing protocol

re-

as best on the basis of the link metrics (weights).

However, when conges-

tion appears in some links on this path, a new path omitting over-utilised links should be found.

The easiest way for nding a new path consists in

increase of congested links weights. The path is then recalculated [1]. However, even a single modication of a metric can be disruptive to a whole

urn al P

network due to the following scalability issues: (a) the update of routing tables takes a considerable amount of time; (b) it is likely to cause reordering or packet dropping, thus decreasing the performance of TCP. Obviously, the more changes that are introduced, the larger the chaos that is observed. The well known fact is that in almost any network there is at least one concurrent path that is an alternative to the one used [2, 3, 4]. This fact enables the network control system to counteract the abovementioned congestion problem with the so-called multipath transmission. Multipath transmission can be introduced in dierent network layers, for example, in the physical layer (WDM, SONET/SDH), in the link layer (TRILL, MPLS, SPB), in the network layer (ECMP, EIGRP), in the transport layer (MPTCP, CMT), or in the application layer (MPRTP, MRTP) [2, 3, 4].

Apart from enabling the

use of additional paths, this type of transmission assumes that the routing is semi-independent of current link weights. Nowadays, the most popular solution for establishing such paths is bases upon multi-protocol label switching

Jo

(MPLS). Flexible trac engineering [5] is then enabled.

However, MPLS

paths are established for a long-time scale (highly static) and with the purpose of serving very large amounts of data. Therefore, despite the fact that these paths can be periodically re-optimised, such a process again results in the disruption of existing trac and typically does not take into account the

Journal Pre-proof

current utilisation of links. Fortunately, the introduction of the ow-based forwarding in Software-Dened Networking (SDN) [6, 7] provides the possi-

of

bility for a disruption-free transmission of packets using paths that can be changed with the ne granularity of time or data volumes. Unfortunately, the application of ow-based switches supporting ne-grained ow-level control results in scalability problems due to an unmanageable increase in the

pro

sizes of ow tables. Some techniques, such as multi-path TCP (MPTCP), enable better resource utilisation, but simultaneously generate more ows in the network [2]. This fact hinders ow-based forwarding due to storage limitations and lookup delays. Such a problem has already been observed with the introduction of the OpenFlow protocol [8, 9, 10, 11, 12, 13]. The issue has been addressed and, notably, ternary content addressable memory (TCAM) is used for storing ow entries [14, 15, 16].

Moreover, a centralised man-

the

reactive

re-

agement approach can create signicant signalling overhead, especially when approach for ow installation is used. Extensive communication

between an SDN controller and switches is then required [6, 9, 10].

Early

benchmarks have shown that controllers are unable to handle a huge number of requests [17, 18]. Recent research [11, 12] shows that this issue still bothers

urn al P

the scientic community. The problem is burdensome in data centre (DC) environments, where enormous numbers of ows are present [19, 20]. Another option, the

proactive

way of ow installation, can be advantageous, but such

a solution is a trade o with precision of trac management. Another scalability problem is related to ow installation time in hardware switches  according to [21], a single ow installation time can reach up to 100 ms for a TCAM of 4,000 entries.

In this paper, we apply MPLS labelling to ow aggregation in OpenFlow managed networks. In this way, we show that it is possible to improve the network behaviour under congestion (due to the application of multipath approach) while simultaneously reducing the size of ow tables in the core of an SDN network and minimising signalling requirements. Accordingly, each source node may concurrently transmit data via multiple paths and new paths are added on demand to avoid congested links on already used paths. These paths are not following the idea of equal cost multipath (ECMP)

Jo

routing. In fact, the mechanism provides for new ows new paths that are not congested (if possible), while the existing ows use previously established paths. In this way, the trac is not disrupted; therefore, we aim to improve resource utilisation. The proposed mechanism is based on tagging ows with MPLS labels. Therein, the forwarding of packets is performed on the basis

Journal Pre-proof

of labels. However, contrary to the classical MPLS mode, according to our proposal the distribution of labels is not supported by signalling protocols,

of

such as label distribution protocol (LDP) or resource reservation protocol (RSVP). Instead, we use OpenFlow only. Therefore, we neither replace nor improve well known BGP/VPN [22] or similar MPLS-based solutions. Thus, we can summarise the contribution as follows:

pro

• Algorithm for switch self ow installation enabling reduction of the signalling overhead related to Packet_IN processing. Due to this property, it is possible to completely eliminate Packet_IN massages while keeping all benets related to the reactive ow treatment.

• Multipath transmission

enabling a better network resource utilisation Thus, we

re-

based on the fast reaction to network condition changes.

ensure that congestion appearing on a link when other links are underutilised is solved by on-demand and automated path recalculation. Due to this property, we are able to increase the overall throughput of the network.

enabling a decrease of ow tables in

urn al P

• MPLS-based ow aggregation

switches centrally managed by an SDN controller. The forwarding decision on trac ows destined to a selected node is based on a single label in the whole network. Due to this property, the number of served ow entries is drastically diminished.

• No requirement for the involvement of new protocols

since the proposed

mechanism explores only the existing o-the-shelf solutions, specically, MPLS, OpenFlow, basic routing with OSPF/IS-IS, and link discovery with LLDP. Due to this property, the implementation of the proposal is very easy.

Concerning the technological readiness of a network, the new mechanism can be deployed with the coexistence of legacy MPLS switches and OpenFlow nodes. OpenFlow switches are required at the edge of a network, while legacy MPLS or OpenFlow switches can be used as core nodes. MPLS labels are

Jo

centrally distributed by an SDN controller using NetConf for legacy MPLS switches and OpenFlow for the others. In fact, we do not have to base our mechanism on MPLS, because it only requires some form of tagging to simply provide the unique marking of destination nodes in a network.

Instead of

Journal Pre-proof

tagging with MPLS labels, one can also use methods characteristic of virtual local area networks (VLANs). Nevertheless, in the case of VLAN tagging,

of

the scalability is lower than in the case of MPLS [23]. If one expects that a standard VLAN space is not sucient, then one can use the Q-in-Q or even the PBB approach.

In the case of MPLS, a label stacking can be used to

increase the label space.

pro

The paper is organised as follows: Section 2 presents the justication for the proposed approach; Section 3 introduces the mechanism for a centralised path set-up optimisation supporting IP ows in SDN networks, along with the related architecture; Section 4 describes the evaluation details, including the tools used and the obtained of performance results  the mechanism scalability is also discussed; Section 5 presents a review of the related work  we use this background to emphasise a comparison of our approach with clusions.

re-

others presented before; Section 6 summarises the paper with concise con-

2. Problem statement and motivation behind the proposed mech-

urn al P

anism

In this section, we briey describe drivers for the proposed mechanism. The rationale relates to the three important problems which appear in networks operating with the ow-forwarding scheme, namely: (a) scalability of ow tables at switches in the core of the network, (b) link congestion, and (c) ow installation overhead. In our mechanism, these problems are solved by the application of ow aggregation, multipath transmission, and improved ow installation procedure, respectively.

2.1. Flow aggregation

Flow-based forwarding supports eective trac distinction and management.

However, this approach suers from the necessity to serve an enor-

mous number of ow entries that need to be maintained by each of the ow-forwarding nodes. It is a well-known fact that TCAM is the most suitable memory technology for ow storing and forwarding [6, 12].

However,

Jo

it is very expensive, consumes a lot of energy, and can store only a limited number of entries [14, 15, 16, 24, 25].

This last drawback is the most im-

portant from the viewpoint of TCAM applications. The number of entries which has to be served by a switch strongly depends on the level of network aggregation hierarchy.

Journal Pre-proof

As an example of the possible gain from usage of our aggregation method, we present a simple network topology in Fig. 1. The whole discussion related

of

to benets of our mechanism in context of this particular topology is valid for any topology in which trac aggregation is enforced by consecutive nodes. In our experiments presented in Section 4, we use a few dierent topologies in order to conrm the generality of this approach. Note that we distinguish

pro

between the two types of switches performing ow forwarding. To be compatible with traditional MPLS terminology, we divide these network nodes into: (a)

provider edge

(PE) nodes; and (b)

provider

(P) core nodes. Now,

let us suppose that the whole trac from ingress (domain entrance) nodes, i.e., PEN 1 to PEN J and PEM 1 to PEM K , is directed to egress (domain exit) nodes PE-D1 and PE-D2. The ingress nodes represent an access layer. A

re-

number of active ow entries in each PE is depicted in red, for example D1 + N1D2 ow entries, where indices D1 and D2 represent PEN 1 stores N1 which egress node the ows are directed to. At the rst core layer, we ob-

urn al P

serve a signicant increase in the number of ows coming from the access PJ D1 D2 layer. For instance, node P1 maintains as many as j=1 (Nj + Nj ) ows. In the second core layer, many more ow entries has to be served, this is P PJ D1 D2 D1 + NjD2 ) + K k=1 (Mk + Mk ). j=1 (Nj One of the main aims of the proposed mechanism is to reduce a number

of ow entries in core switches (P nodes).

Let us consider only the ows

directed to networks accessible via PE-D1. Since all ows from the access layer are directed to the single egress node (PE-D1), they can be represented by a single global label. If we consider two destinations, namely PE-D1 and PE-D2, we have to use two global labels.

In our mechanism, ingress PE

nodes are responsible for tagging each ow directed to a given destination egress PE node with a global unique label representing a particular egress node.

In such a case, the number of ow entries is limited to exactly two

for every P node in each core layer (depicted in blue). The number of ow entries in the access layer remains unchanged.

2.2. Multipath transmission

Since network trac volume is continuously growing, one can expect that

Jo

at some point, any network may experience congestion. To avoid this problem, one can use alternative paths leading to the same destination in the network.

It is typical for mesh networks for them to contain more than

one path between each source and destination. Such concurrent paths can be either completely or partially disjointed. The use of concurrent paths can

Journal Pre-proof

Core Layer1

Core Layer2

of

Access Layer

PEN1

PE-D1

P1

2 2

PENJ

PEM1

.. .

P2

PE-D2

urn al P

PEMK

2

Aa A A A

re-

P3

pro

.. .

Figure 1: The assumed network structure and the related numbers of ows mitigate the problem of congested links. However, most of the legacy routing protocols do not implement multipath transmission. If they support multipath transmission, it is only based on equal cost multipath (ECMP) [26]. When the MPLS is used, one can consider ECMP or load balancing over an unequal cost path. From the point of view of single application layer transmission, it is important to use the same path for a particular ow. There exists recommendation on how to avoid the potential mis-ordering of packets [27]. The notable exception amongst routing protocols is related to Cisco's EIGRP which can concurrently use paths of dierent costs [28]. Our mechanism aims to use multipath transmission together with ow

Jo

aggregation in order to avoid congestion in ow-based forwarding networks. Our proposal exploits many paths between same source-destination pair, but these paths do not need to be equal cost. Moreover, new paths are activated on demand only, this occurs, when congestion appears and they do not tear

Journal Pre-proof

down existing ows. Our mechanism searches the whole network to nd the best new path avoiding congested links. The proposed solution extensively

of

uses an aggregation procedure based on tagging with labels. The concept of our multipath-based approach is described here with a simple exemplary network presented in Fig. 2. Sources of trac are connected to the ingress node PES , while destination networks are attached to the egress

pro

node PED . One can observe that a few paths exist connecting PES and PED nodes. Let us suppose that a routing procedure has chosen the path going through core node P11 . In line with our aggregation procedure, label L1 has been distributed enabling transfer along the links marked with this label. All ows going from PES to PED traverse this path and they are stamped with MPLS label L1. All switches on the path forward packets according to this In some moment congestion appears on the link marked with a red

cross.

Our mechanism nds an alternative path (here: PES -P21 -P22 -PED )

re-

label.

and distributes a new label L2 which will be used for packet forwarding. When all switches on this new path get this new label, ingress node PES starts to mark new ows with label L2. The existing ows are still marked with label L1 and they continue to traverse the old path, namely PES -P11 The number of these ows is denoted by N1 and does not increase.

urn al P

PED .

Label L2 is used by the number of ows denoted as N2 , and this number may increase since new ows arrive to the network.

When the next congestion

events appear in some links (in our example, the next congestion event is indicated with a blue cross, and the subsequent congestion event is indicated with a purple cross), new paths are found and new labels are distributed (L3 and L4, respectively). A similar scenario takes place after all congestion events occur: existing ows use labels L1 and L2, while new ows use L3, and then the label L4 (after the third congestion event).

The numbers of

ows: N1 , N2 , N3 , related to the existing ows (using L1, L2 and L3 labels, respectively) tend to decrease. The number N4 of new ows may vary, but is likely to increase.

Thanks to the use of the ow-forwarding paradigm, it is possible to distinguish existing ows from new ones.

Only a single active label is used

for tagging newly arriving ows directed to the same destination. The old

Jo

labels are used for forwarding all ows existing before dierent congestion events appear.

The tagging mechanism enables keeping existing ows on

previously selected paths when the routing process chooses new paths between the given source-destination nodes. This mechanism also prevents the increase of a number of ows from a particular source going via a congested

Journal Pre-proof

P11

of

L1

L1 L2, L3

PES

L2

L2

PED

P22

P21

pro

N1 N2 N3 N4

L3

L3, L4

L4

P31

L4

P32

L4

P33

link.

urn al P

2.3. Flow installation

re-

Figure 2: The example supporting explanation of the concept of multipath transmission adopted in this paper

In SDN networks based on the OpenFlow protocol, two methods of ow installation are available:

reactive

and

proactive.

In the former, each new

ow reaching a switch generates signalling between the controller and an ingress switch. Such a ow of signalling messages results in on-demand rule installation. In the case of the proactive mode, forwarding rules are installed before ows arrive at the switch.

A combination of these two approaches

is also possible. The reactive ow insertion allows exible and ne-grained network management. However, such an approach has a few serious limitations. First of all, every packet that does not have a match in a ow table of a switch has to be forwarded to an SDN controller. The controller then has to dene actions for this packet and install a new rule for the next packets belonging to the particular ow in the switch. This situation may lead to the overloading of a controller with signalling messages

Packet_IN, especially in

networks where a huge number of ows and switches is present, for example, Furthermore, other limitations have to be taken into ac-

Jo

in DCs [19, 20].

count; therefore, a limited number of requests per time unit can be handled by a single controller [17, 18, 11, 12], thus decreasing network scalability. The reactive approach introduces an additional delay for the rst packet of every new ow when it is forwarded to the controller. Moreover, it is likely

Journal Pre-proof

that packets belonging to a single ow can arrive with such a high frequency that the installation of a forwarding rule in a switch takes place after many packets from the same ow arrive. This results in triggering many unneces-

Packet_IN messages, causing further overloading of the controller.

of

sary

Such

behaviour can be exploited to attack an SDN controller.

Despite the above, proactive ow insertion can easily mitigate these prob-

pro

lems. It requires advance knowledge about all trac matches that could arrive into a switch. However, exibility and precision of trac control is lost in this case. Proactive rules are usually more general than those dened in a reactive way.

This results from a lack of knowledge regarding all trac

matches.

Existing ow-based switches suer from the delays related to ow entry insertion into TCAM. This problem is mainly related to a weak management

re-

CPU and a slow communication channel present between the management of the CPU and a switching chipset [29, 30, 31]. These delays are especially cumbersome when networks operate in a reactive way [18]. In the proposed mechanism, we limit the signalling overhead, yet we still assume the application of ne-grained ow forwarding. To install a new ow, exclude

urn al P

we do not need to communicate with the SDN controller. In this way, we

Packet_IN

messages. A controller proactively installs only rules for

ow aggregates in a dedicated ow table in PE. Based on these patterns, the switch itself installs ne-grained ows without the necessity to communicate with the controller. The SDN controller performs maintenance and modication of aggregation rules only. The abovementioned modications do not often take place, they occur only when congestion appears and new paths are required. The introduction of these rules is feasible due to the denition of a dedicated ow table. The detailed explanation of aggregation rules and switch behaviour is given in Section 3.

3. Detailed description of the proposed mechanism Concerning the previously given classication of the switches, our mechanism assumes that:

Jo

• provider edge • provider

(PE) nodes map ows to MPLS labels,

(P) core nodes only forward packets according to MPLS labels.

Journal Pre-proof

We dene a

source-client network

(SCN) and a

destination-client network

(DCN) as networks where sources and destinations of trac are located,

of

respectively. SCNs and DCNs are accessible only via PE nodes. To eectively map ows to the labels, the SDN controller builds and maintains a map of the physical topology and stores it in the form of a link state database (LSDB). The LSDB is modied when congestion starts to

pro

appear, i.e., when link metrics are changed. The SDN controller calculates the best path only for pairs of PE nodes.

The reverse Dijkstra algorithm

(described in Section 3.4) is used to perform this task.

For each PE, the

controller allocates a global MPLS label representing a particular PE node. The labels accompanied with information about proper output interfaces (obtained due to executing the shortest-path algorithm) are then populated to each node.

When a packet belonging to a particular ow reaches an

re-

ingress PE node, it is tagged with a proper MPLS label and is subsequently forwarded to a pre-selected interface. This label indicates the egress PE node via which a particular DCN is reachable. Therefore, each node on the path will use a given label to reach the particular related PE node.

Moreover,

the same label will be used by any node in the whole network to reach the

urn al P

specied egress PE node. Such an approach results in ow aggregation and a signicant reduction of ow table entries in P nodes. The proposed mechanism supports a fast reaction to changes in trac conditions. The SDN controller periodically collects information related to the utilisation of links.

OpenFlow port statistics requests are proposed to

be used. However, other protocols for retrieving counter data are discussed in Section 3.6. There are two predened thresholds dened by the administrator as congestion triggers. If throughput of any link exceeds one these thresholds, this indicates that congestion on this particular link appeared. Additional details on these thresholds are discussed in Section 3.2.

When

any congestion in the network is recognised, the controller increases metrics of the over-utilised links. The reverse Dijkstra algorithm is then recalculated using modied metrics and a new label for each PE is allocated. The controller populates all nodes with new labels and the related output interfaces. Therefore, only new ows use the new labels (i.e., the new paths). All the

Jo

existing (old) ows are forwarded using the previously allocated labels (i.e., previously calculated paths). Such an approach stabilises the ow forwarding process and introduces the multipath transmission. The proposed management system running on the SDN controller (see Fig. 3)

is logically divided into the two components that are responsible for dening

Journal Pre-proof

SDN CONTROLLER MEASUREMENT COMPONENT

periodically repeated

No

Tput Warn Th

Yes

No

Metric = NORM

Yes Metric = WARN

Metric = CONG

pro

Tput Cong Th

of

Read Counters

Calculate Tput

LABEL ALLOCATOR COMPONENT Collect from all links

Any metric updated?

Yes

Calculate Dijkstra

Allocate Labels

re-

No

PE Node

P Node

Update Coarse Flow Table

Update Flow Table

urn al P

Figure 3: Flowchart of the proposed mechanism how the nodes process the data:

• measurement component

responsible for gathering link utilisation and

• label allocator component

calculating paths and distributing MPLS la-

modication of metrics,

bels.

Below, we rst describe the way in which the packets and ows are processed in various types of network nodes, and then describe the operation of both of the components introduced above.

3.1. Flow processing in PE and P nodes

Each PE node implements ow-based forwarding. The way the ows are

Jo

dened is neutral from the viewpoint of our mechanism (in the context of the ow aggregation procedure). For instance, a traditional 5-tuple (source and destination addresses/ports with the L4 protocol) can be used. In accordance with the OpenFlow specication, we propose to use two

ow tables in each PE node. The

detailed ow table

(DFT) stores detailed

Journal Pre-proof

information on active ows, i.e., 5-tuples. The

coarse ow table

tains the mapping between DCNs and pairs (output

MPLS label, output

The match elds for rules in the DFT are dierent from matches

of

interface).

(CFT) con-

in the CFT. One can see this dierence in Fig. 4, where the term `Flow' is used in the DFT and the term `Net' is used in the CFT. As explained later, the entries in the CFT exist permanently, i.e., there is always a match for a

pro

particular network, while an action list depends on the currently used path and can be modied. The existence of a particular ow in the DFT depends on its lifetime and a ow idle timeout. Thus, when a packet reaches a PE node, it is processed following the pipeline shown in Fig. 4. We consider the following two cases.

1. If a packet (P1 in the gure) matches an existing ow in the DFT, then

re-

this packet is processed according to the list of actions present for such an entry. This means that, the packet is tagged with a selected MPLS label and forwarded to the indicated output interface. 2. If a packet (P2 in the gure) does not match DFT, then it is redirected to the CFT. It contains entries composed of a DCN and the list of the

urn al P

following actions: push a pre-dened MPLS label and direct the packet to a pre-selected output interface.

Thus, when the match is found,

the specied actions on the packet are performed and the detailed ow entry is created in the DFT. The entry is based on information gathered from the packet's header elds. Therefore, for the new ow dened on the basis of this header, the entry action list is copied from the CFT. The idle timeout of this entry is set to a nite pre-dened value. The issue of timeout setting and usage is explained in detail in Section 3.5. If TCAM is used, the ow table lookup is performed in a single clock cycle. By contrast, while DFT and CFT are searched, two clock cycles are needed. Of course, if the match is found in the DFT, then only one lookup is needed. In the legacy OpenFlow protocol, only an SDN controller may insert ow entries into ow tables. lead to a storm of

As previously mentioned, such a procedure may

Packet_IN

messages received by a controller. When we

Jo

consider huge networks carrying millions of ows, the reactive ow insertion may lead to overload of the controller.

Therefore, in our mechanism we

improve the standard operation of the OpenFlow protocol and, as a result, we reduce the number of messages exchanged between the controller and the switches.

The CFT contains general rules indicating how ows should be

Journal Pre-proof

Detailed Flow Table Flow1

Label1 Out1

Flow2

Label2 Out1

... FlowX

LabelY OutZ

TABLE MISS

of

P1

P1 Label2

Coarse Flow Table Net1

Label1 Out1

Net2

Label2 Out1

Net3

Label3 Out2

pro

P2

P2 Label3

...

NetX

LabelY OutZ

re-

Insert Flow

Figure 4: Packet processing pipeline in an edge (PE) node processed.

Such the approach results in the proactive installation of very

general rules. These rules are in some way persistent and are updated by the

urn al P

controller, only when congestion appears. It does not happen as frequently as a detailed ow installation. The number of CFT entries is related to the number of DCNs.

Since the detailed ow installation happens very often,

insertion of a particular granular ow entry into DFT is made by a PE switch on its own (as presented in Fig. 4).

We should note that by itself,

Open vSwitch (OVS) inserts ows into Exact Match Cache and Tuple Space Search cache [32]. Such an OVS procedure improves performance of packet processing, this is because packets are processed in caches (in a kernel or DPDK module [33]), avoiding slow user space processing where OpenFlow tables are maintained. Although the OVS approach increases the number of entries, it speeds up switch operation. A similar procedure can be applied in order to implement DFT self-insertion. Works on DFT implementation in OVS are now in progress. Our preliminary OVS modication needs only six additional lines of code. Additionally, the authors of DevoFlow proposed the mechanism enabling switch self ow insertion [14].

Furthermore, they

Jo

also show by simulations that such an approach is a valuable concept. Our proposed improvement removes the need to use

Packet_IN

messages, but

still keeps all benets related to the reactive treatment of ows. For P nodes, only a single ow table is required. When a packet reaches

such a node, this packet is matched on the basis of a label only. The packet

Journal Pre-proof

is then sent out to a proper output interface with exactly the same label. If a legacy MPLS router is used as a P node, such a node only performs the

of

ordinary label swapping operation. Legacy MPLS routers allow static label switched path conguration (similar to static routing). This means that due to our mechanism, the SDN controller is able to add (or remove) static MPLS 1

entries in a router conguration via SNMP or NetConf .

pro

The whole knowledge about all the MPLS labels used in the network is possessed by an SDN controller. It knows exactly which label should be installed in a particular P (core) node to ensure that a particular egress PE node can be reached. According to our approach, only P nodes can be served by legacy MPLS routers (with static MPLS entries). With the help of NetConf, the controller can send a conguration to the router in which a static LSP entry will be added. Such the entry denes which input label should be

re-

swapped to another output label and indicates an output interface. To apply our proposal using legacy routers functioning as P nodes, the input label has to be swapped to the same output label.

The label and the output inter-

face are assigned by the controller. Such the conguration can be enforced remotely using NetConf. Application of a static LSP conguration is a stan-

urn al P

dard procedure for manual distribution of labels (that is, without use of LDP or RSVP). This procedure is not disruptive for the packet forwarding because reconguration is done only when congestion appears. When it happens, the controller congures a new path.

During this procedure, packets are still

forwarded via existing paths. When all P nodes, except for an ingress PE, acknowledge the conguration change, a controller installs a proper entry in the ingress PE. The decision on how frequently a new conguration can be applied depends on a number of requested changes in the conguration. In fact, the exact value of this frequency depends on the vendor and a specic device model.

3.2. Measurement and label allocator components The measurement component (MC) periodically retrieves data from link counters. The collected information is used to calculate the bandwidth util-

Jo

isation at links. The controller requests interface counter reports from each

An exemplary conguration of static label switched paths for Juniper routers can be found at: https://www.juniper.net/documentation/en_US/junos/topics/task/ configuration/mpls-qfx-series-static-lsp-cli.html. These CLI commands can be congured via SNMP/NetConf. 1

Journal Pre-proof

node. Each node replies with a single statistic response containing information about all its counters. In the basic version of the proposed mechanism, there are

2N

messages in each polling interval.

N

nodes in the network,

of

we apply OpenFlow statistics messages. If we have

Taking into account the

fact that each measurement-driven mechanism requires statistics collection, such an approach does not involve a huge signalling overhead.

MultipartReq

The single

(statistics request) message for port statistics

MultipartRes (statistics 86 + 112n bytes

pro

OpenFlow 1.3

generates a total of 94 bytes of network trac. The response) message containing counters for

n

ports involves

of signalling. In our simulations, we are able to retrieve the data from nodes periodically with the frequency of one second, but this information can be requested rarely. Our mechanism does not require and

FlowMod

Packet_IN, Packet_Out

triplet maintenance, which form the largest contributor of sig-

re-

nalling overhead in OpenFlow networks [9].

Another method for obtaining port counter values may be based on using SNMP. However, the leading vendors recommend setting the pooling interval 2

in the range of 3060 seconds . However, if a fast reaction to the change of the trac is needed, 30 seconds can be unacceptably long interval. Other o-

urn al P

the-shelf protocols enabling statistics readouts are discussed in Section 3.6. There are two utilisation thresholds congured: warning (WarnTh) and

congestion (CongTh). The latter is determined with a value larger than the

former. Each time the amount of the trac throughput (Tput) exceeds one of the dened thresholds, the MC changes a link metric in the LSDB stored by the label allocator component (LAC). We propose the following three values of the congurable link metrics related to the threshold values: (1)

NORM: nor-

mal metric (a default IGP metric) for a link utilisation of a value not greater than the

WarnTh threshold; (2) WARN: warning metric for the link utilisation of WarnTh and CongTh threshold; (3) CONG: congestion metlink utilisation exceeding the CongTh threshold. In order to prevent

a value between the ric for a

oscillation of the link utilisation around thresholds, we applied hysteresis. Generally, for each link, an operator denes three link weights (metrics) in

NORM, WARN, and CONG. If a particular link is not congested a lower value (NORM) is used as a link weight for Disjkstra algorithm. When trac increases and tends toward congestion exceeding the WarnTh threshold,

Jo

increasing order:

https://www.juniper.net/documentation/en_US/junos/topics/task/conguration/snmpbest-practices-device-end-optimizing.html 2

Journal Pre-proof

the second value of weights (WARN) is assigned to this link and Disjkstra is

recalculated. The third value of link weights (CONG) is used when the

CongTh

of

threshold is exceeded and again Dijksta is recalculated. LAC builds and maintains LSDB. Each time the MC changes any link metric, the recalculation of the shortest paths is triggered for each of the PE nodes treated as a root. By the

shortest path

we are referring to the path

pro

obtained due to the reverse Dijkstra algorithm run for the link weights set after the change. The number of such recalculations can be limited to some PE nodes only. We use the so-called reverse Dijkstra algorithm based on the Dijkstra algorithm. Below, in Section 3.4, we describe this process in detail. After recalculation, a new label set is allocated. The maximum size of this set equals the number of PE nodes. After consecutive recalculations of the reverse Dijkstra algorithm, only one unique label represents a destination PE

re-

for new ows. In this way, between a selected source-destination PE pair, the newly recognised ows are redirected to the new paths, while the existing ows still traverse the old paths  they are forwarded with the previously allocated labels.

As presented in Section 3.1, for PE nodes the CFT table is updated and

urn al P

lled in by the SDN controller. The CFT contains entries composed of an address of a DCN and a list of the following actions: push MPLS label and send it to an output interface.Each entry in the CFT has innite timeout. Each time recalculation of the reverse Dijkstra algorithm is triggered  this results in the CFT update. The old list of actions for each DCN is replaced with a new label and a new output interface (based on the structure of the new shortest-path tree).

After the recalculation, new label entries in the ow tables are also proactively installed in the P nodes. A single entry of this kind contains a match based only on a new input MPLS label and the forwarding action: the output interface is based on the currently calculated shortest-path tree.

The

idle timeout is set to innity.

3.3. Possible extensions of the proposed mechanism The proposed mechanism does not limit functionalities which are present

Jo

in OpenFlow. All OpenFlow actions can still be performed. The only aspect which dierentiates our solution from the standard OpenFlow behaviour is the addition of the

Insert Flow action.

This action is taken by a switch itself

and results in a ow insertion into DFT on the basis of an entry transferred from CFT.

Journal Pre-proof

The match rule present in CFT does not need to be based only on a destination network. It can be composed of any combination of elds and

of

wildcards supported by OpenFlow. In Fig. 5, we depicted a few exemplary match rules. Let us consider three packets arriving at the switch, i.e., P1, P2, P3. None of them match any entry in the DFT, they are thus redirected to CFT. P1 and P2 are destined to the same network, but P1 also matches an extended rule with a destination Layer 4 port; therefore, it is send to a P2 (Label2,

Out2).

Out1) than in the case of

pro

dierent output port with a dierent label (Label1,

Such an approach allows serving distinct applications in

a specic way. Another possibility of packet serving is the use of the DSCP eld to fulll QoS requirements. Our proposal is to consider separate labels for dierent application ports that are reachable via a particular egress PE node. For example, let us consider two applications (e.g. HTTP server and

re-

some VoIP server) which are accessible via the same egress PE node. These two applications may have dierent QoS requirements.

Thus, a modied

version of our mechanism may take into account dierent QoS requirements during path calculation. As a consequence, the same egress PE node may be reachable from all ingress PE nodes via dierent paths at the same time. In

urn al P

this way, some MPLS labels may be allocated for applications with higher QoS requirements, and some for less demanding trac. It is also possible to control trac directly by an SDN controller. If there is a particular type of trac that is expected to be managed by an SDN

Table Miss entry. This entry allows Packet_IN message. After packet analysis, the controller

controller itself, CFT should posses a the generation of a

installs an appropriate entry in DFT.

3.4. Reverse Dijkstra algorithm: the recalculation algorithm Whenever congestion appears, the calculation of new paths avoiding overloaded links is needed. For nding these new paths, a form of the Dijkstra algorithm is used. In the proposed mechanism, we do not need to perform path recalculation for all network nodes. To decide for which PE nodes require recalculation, the mechanism starts the investigation of labels used by the packets transferred via the link at which a new overload has just been Each of these labels indicates a specic destination PE node,

Jo

recognised.

thereupon called thereupon an `aected PE' node. To avoid the negative impact of congestion, new paths directed to the aected PEs should be found. Consequently, new labels have to be allocated.

For the non-aected PE

P1

CFT

DFT

...

TABLE MISS

of

P2

match

priority

Net1, dstPort1

1

Label1 Out1

Net1

100

Label2 Out2

Net2, DSCP1

1

Label3 Out1

100

Label4 Out3

...

...

100

LabelY OutZ

Net2 ...

NetX

actions

P1 Label1

P2 Label2

pro

P3

Packet_IN

Journal Pre-proof

TABLE MISS

re-

Insert Flow Insert Flow

Figure 5: The extension of a coarse ow table

urn al P

nodes, path recalculation and label reallocation are not required. If the regular Dijkstra algorithm is used, every PE node (not only aected PE nodes) will have to recalculate paths to each aected PE node. us consider a network with 100 PE nodes.

For example, let

If in some moment congestion

appears at a single link where only one label is related to one PE node, the regular Dijkstra will have to be performed by 99 ingress PEs (excluding affected PE). By contrast, the reverse Dijkstra requires only one calculation made from the perspective of the aected PE. In the presented mechanism, we calculate paths from the perspective of each aected PE node treated as a root. However, the weights used in the shortest-path algorithm are related to the links directed in the opposite way (i.e., towards the root). Therefore, we call this procedure `reverse Dijkstra'. In case of regular Dijkstra, we answer the question of how to reach each all other nodes (leafs) from a root.

Reverse Dijkstra answers the question of

how to reach a root node from all other nodes. For a better explanation of

Jo

how this procedure works, in Fig. 6 we present an example network topology together with the obtained reverse Dijkstra tree. The destination PE (a root for the reverse Dijkstra calculation) is coloured orange (node 1). A metric of each link for each direction is depicted in Fig. 6a. For example, if we consider connection between nodes 1 and 2, metric 1 is used for trac from node 1 to

Journal Pre-proof

node 2, while metric 7 is used for trac in the opposite direction. When we consider node 1 as a root, the regular Dijkstra algorithm uses metric 1, while

of

the reverse Dijkstra will use metric 7. In Fig. 6b, the outcome of the whole reverse Dijkstra procedure is presented in the form of the tree with the used metrics. With blue arrows, we indicate the trac direction. The MPLS label directed to the destination node 1 is distributed down the reverse Dijkstra In the case of a regular Dijkstra algorithm, one has to perform six

pro

tree.

Dijkstra calculations using nodes 2-7 as roots.

1

1

1

1

2

9 2

7

3

1

2

7

3

7

4 4

3

2

5

5

3

6

urn al P 1

2

re-

7

9

3

3

4

4

5

5

6

4

6

3

7

8

3

7

(a) An example network topology (b) The reverse Dijkstra tree built in with the related link weights the example network Figure 6: An illustration of the proposed reverse Dijkstra algorithm

3.5. Flow garbage procedure

Flow tables require maintenance to remove unused entries. We propose

Jo

the application of a standard OpenFlow procedure for ow entry removal from the DFTs in PE nodes  an idle timeout counter is used for this purpose. Some nite value is assigned for each ow present in the DFT, while rules placed in the CFT always have an innite timeout. The CFT innite timeout is designed deliberately because rules in the DFT are installed by

Journal Pre-proof

the switch itself (on the basis of rules in the CFT). If a timeout value less than innity is used, the entry for this destination network will be removed.

of

If it happens that the trac destined to the mentioned network appears, and there is no DFT entry for this ow, such the trac will be dropped because of the lack of an appropriate CFT entry. On the other hand, if congestion appears, only a modication of the related CFT rule is applied, i.e., labels

pro

and output interfaces are updated. We want to notice that the CFT is used for aggregation while the DFT is used for serving particular ows. A proper setting of idle timeout strongly depends on the network trac pattern. The are ows characterised with either short inter-packet intervals or ows with long packet intervals, and bursty ows with some level of periodicity. The authors of [34] show that dierent ows should be assigned with dierent values of suitable timeouts. Their study shows that 76% of ows

re-

have packet inter-arrival times lower than 1 second. In our simulations, we use 3 seconds idle timeouts. This refers to 80% of ows which have packet inter-arrival times less than 3 seconds [34].

We also checked the impact of

lower values of idle timeout on the ow table occupancy.

Authors of [35]

suggest using low values of the idle timeout, even lower than 1 second. A low

urn al P

value of idle timeout decreases the number of ows in DFTs of PE nodes, but it may cause some ows to be removed from DFTs despite the fact they are active.

Such a situation results in unnecessary CFT lookup and ow

reinstallation into DFTs. However, it is not very costly to our mechanism because it does not involve the

Packet_IN procedure.

A switch modied ac-

cording to our proposal reinstalls ows on its own, without the necessity to communicate with the SDN controller. When the ow is present in a DFT, only a single lookup consuming one clock cycle cycle for TCAM is needed. Additionally, after the ow removal, a lookup is performed in two clock cycles.

An excessive value of the idle timeout will not often trigger the ow

reinstallation procedure, but it will increase the number of rules in the DFT. For the P nodes we propose a procedure aligned with the current functionality of OpenFlow. This procedure states that when an SDN controller calculates new paths and allocates a new label to a particular egress PE, the related entries are proactively installed with an innite idle timeout into for-

Jo

warding tables of P switches belonging to these paths. Simultaneously, in P switches, the controller modies previous rules destined to this PE (identied by previously allocated labels); in other words, the controller changes only timeout timers from innity to a nite value.

Thus, all existing ows are

forwarded without changes. When old ows end, their idle timeout counters

Journal Pre-proof

are exceeded and the removal of such ows from the ow tables takes place. When all the ows related to a particular label expire in all P nodes, this

of

label returns to the pool of available labels used by the SDN controller. The innite value set to the idle timeout of ow entries in P switches is needed to sustain readiness to handle ows, even after a long absence of any related trac.

pro

3.6. Integration with existing o-the-shelf technologies

A full upgrade of network devices may require huge capital expenditure for an operator. An incremental approach can span this task in time. In this subsection, we summarise the ideas of how to integrate our mechanism with existing o-the-shelf technologies.

In the proposed system, we can distinguish between two types of network

re-

nodes: PE and P. The former has to be upgraded, while the latter may be a legacy MPLS router. The PE nodes have to be modied OpenFlow switches working as proposed in Section 3.1.

Since P nodes forward trac only according to MPLS labels in our mechanism, a network operator does not need to replace legacy routers if they

urn al P

support MPLS. We stress again that labels used in our mechanism have a global meaning. The application of the centralised management oered by SDN controllers enables synchronisation of label distribution to all the P and PE nodes. An SDN controller distributes a unique global label related to a particular egress PE node. The P node performs only label swapping. The input label has to be swapped with the same output label. All legacy MPLS routers known to us allow conguration of static label switched paths (LSPs).

In this case, an administrator is obligated to allocate input and

output labels manually. This can be achieved remotely using, for instance, NetConf, SNMP or XMPP. Therefore, an SDN controller may use one of the previously mentioned protocols for conguration of static LSP on legacy P nodes. In this way, a single unique MPLS label can be allocated on each of the PE and P nodes on the path.

Each time a controller recalculates paths and allocates new labels, it recongures static LSP entries on a P node and updates the CFT on a PE Standard signalling mechanisms, such as LDP or RSVP, cannot be

Jo

node.

applied because MPLS labels distributed by them are assigned by each node independently of others, and consequently the labels have a local meaning only.

Journal Pre-proof

Table 1: Signalling protocols between an SDN controller and network nodes Counter readouts

Topology

Pull

management

discovery

NetFlow

OpenFlow

OpenFlow

IPFIX

SNMP

SNMP

of

Flow

Push

LLDP OSPF

NetConf

jFlow

XMPP

IS-IS

pro

sFlow

Due to the fact that our solution represents a measurement-driven mechanism, it needs to collect some link statistics.

These can be gathered and

communicated with use of various protocols depending on functionality supported by switches and routers, as well as the assumed method of obtaining

re-

counter readouts (push or pull). For the push method, protocols such as NetFlow, IPFIX, sFlow, and jFlow may be used. These protocols are designed to periodically report trac statistics. However, they generate a lot of information which is useless from the standpoint of our mechanism. These protocols are able to deliver detailed statistics about each ow, yet our mechanism

urn al P

only needs general interface counter readouts. For the pull method, one can apply OpenFlow or SNMP. These protocols oer on-demand acquisition of general interface statistics. The only drawback is that the pull method requires requestreply communication, so some overhead relating to requests is generated. Contrary to the push method, the pull method limits overall trac exchanged between the controller and P/PE nodes. As controllers have to maintain LSDBs, they need to discover the network topology. Information collected from the well-known protocols such as LLDP, OSPF, and IS-IS can be used to build LSDB in an SDN controller. The use of OSPF/IS-IS to discover the topology originates from the idea of incremental implementation of our mechanism in any network running legacy MPLS routers. As we described in Section 3.1 the core of the network may stay without replacement if it supports MPLS static label switching. Only the new type of edge nodes (PE nodes) have to be deployed in the network. If OSPF is used in the network, we can obtain the advantage of a link-state

Jo

advertisement (LSA) database for discovering network topology. We suppose that an SDN controller only listens to LSAs.

In the case of IS-IS, a piece

of topology information is encoded in link-state protocol data units (LSPs). The controller can then reconstruct the network topology directly from the

Journal Pre-proof

LSA or LSP database.

In Table 1, we summarise some market-available

of

protocols that can be applied with our mechanism.

4. Evaluation

This section presents simulation setups and results for performance evalIt reports test scenarios, assessment

pro

uation of the proposed mechanism.

methodology and metrics used during mechanism evaluation.

4.1. Simulation environment

All the tests were run on the ns-3 simulator [36]. To conduct the evaluation, we implemented the components described in Section 3. A new MPLS module oering concurrent processing of IP packets and MPLS frames was

re-

also implemented. Moreover, we have added features enabling: SDN-based central management of a network, LSDB maintenance, the reverse Dijkstra algorithm calculation, the new MPLS label distribution procedure, and functionalities related to the Measurement Component.

We used four topologies for the experiments: US backbone, Nobel-EU,

urn al P

and Cost266 topologies are from [37], and the three-level Clos topology adequate for an internal data center network [38]. The US backbone network contains 39 nodes and 61 bidirectional links. The Nobel-EU is of 28 nodes and 41 bidirectional links.

In the case of the Cost266 network, 37 nodes

and 57 bidirectional links are used. The Clos topology consists of 9 access switches (each with 3 uplinks), 9 aggregation switches, and 3 core switches. Each aggregation switch is connected to all the core switches.

For all the

networks, some selected nodes (PEs) serve as attachment points for trac sources and destinations playing both the SCN and DCN roles simultaneously (as depicted in Fig. 7). This means that such nodes randomly (uniformly) generate trac to all the nodes of this type.

All the other nodes transit

trac only. For all topologies, all links connecting network nodes are set to 100 Mbps with a 1 ms propagation delay. Links interconnecting SCNs/DCNs with PE are set to 1 Gbps with a 1 ms propagation delay. Such a conguration allows the avoidance of bottlenecks in the access part of the network.

Jo

We study transmission of the TCP trac only. Network trac is injected

with the use of the ns-3 internal random number generators. Flow sizes are generated on the basis of a Pareto distribution with a shape parameter equal to 1.5 and a mean value set to 500 kB. Flow inter-arrival times were selected in accordance with exponential distribution (the mean value equals 3 ms).

Journal Pre-proof

34

19

33

1

23

35 12

20

10 15 28 13

3

12

9

4 2

32 38

39

17

8

5

37 20

7

30

17

34

24

9

36

23

4

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

urn al P 21

29

3

2

35 28

22

18

re-

15 12 33

4

22

1

10

27

8

(b) Nobel-EU

16

13

25

27

17

3

32

1

18

16

25

8

15

6

(a) US backbone

19

24

26

21

28

25

6

11

20

31

11

7

21

22

14

5

1

14

19

24

7

6

23

18

14 30 29 27 16 26

11

37

13

10

pro

5

9

of

36

31

26

2

(c) Cost266

(d) Clos

Figure 7: Topologies used in the numerical evaluation (the green vertices represent SCNs/DCNs) Moreover, we dedicated a separate section (4.5) to present the inuence of dierent trac patterns on the performance of the compared mechanisms. The simulation time was set to 100 seconds. Data collection was started after elapsing of the rst 10 seconds of the simulation warm-up time had elapsed. Simulations of the proposed mechanism were conducted for a few combinations of the (CongTh,

(0.9, 0.4 − 0.7),

WarnTh)

pairs:

(0.8, 0.4 − 0.7), (0.85, 0.4 − 0.7), 0.1 step.

where the warning threshold was increased with

Jo

Each simulation was repeated 20 times. The 95% condence intervals were then computed.

For each pair of thresholds, we use the same set of seed

values to achieve repeatable trac conditions.

Additionally, such a proce-

dure enables us to carry a fair comparison among dierent setups. For all

Journal Pre-proof

the simulation setups, we xed the following values of link metrics placed in LSDB:

NORM

= 1,

WARN

= 1000,

CONG

= 65535.

of

4.2. Comparison to the selected mechanisms

To observe the gain of our mechanism, we have proposed a few scenarios in which simulations of other mechanisms were performed. In order to fairly

pro

compare all scenarios with our mechanism, we performed 20 simulations for each scenario with the same set of seeds, the same value of ow idle timeout was used, i.e., 3 seconds. We compare our mechanism to:

a) centrally calculated Dijkstra algorithm with reactive ow installation (`single path');

re-

b) classical Equal Cost Multipath based on the OSPF (`ECMP'); c) FAMTAR  distributed Dijkstra algorithm with multipath routing (`FAMTAR'), more details about FAMTAR operations are provided in Section 5; d) DevoFlow  a modication of the OpenFlow protocol [14], where mod-

urn al P

ied switch forwards packets according to ECMP pahts. Switch installs detailed ow entries by itself using so called

rule cloning procedure.

When

an elephant ow is detected (elephant ow detection is based on some threshold), switch informs a controller. The controller installs then appropriate detailed ow entry related to this elephant ow on the least congested path;

e) Expedited Eviction  an approach to minimize ow table occupancy based on forecasting TCP ow termination [35]. The rst scenario a) refers to the operation of standard OpenFlow switches, all working in the reactive way. In this scenario only the rst switch on the path sends the

Packet_IN

message to the controller. The controller installs

respective ows in all the switches on the path. This way, we limit a number of the

Packet_IN

messages.

The path computation is performed centrally

by an SDN controller using the Dijkstra algorithm. The weights of all links

Jo

are the same and set to 1.

The second scenario b) uses standard ECMP based on the Dijkstra algo-

rithm implemented in OSPF.

Journal Pre-proof

The third scenario c) used for comparison with our mechanism is FAMTAR [39], where path computation is achieved in a distributed way. FAM-

of

TAR uses not equal cost multipath transmission. In case of the fourth scenario d), we implemented in ns-3 the DevoFlow mechanism.

It applies rule cloning, multipath and threshold-based (i.e.,

not sampling-based) elephant ow detection.

The multipath transmission

pro

is based on ECMP, while the routes for elephant ows are chosen using the decreasing best-t heuristics to solve the bin-packing problem. The authors of [14] propose to detect an elephant ow as a ow that transfers at least a threshold number of bytes in range of 110 MB. We decide to choose the threshold equal to 1 MB because it results in increase of the elephant ows; thus, giving us more opportunity to show the exibly in trac control. The last scenario e) is focused on minimisation of ow table occupancy.

re-

Since we simulated TCP trac only, we implemented only the related part of the mechanism proposed in [35].

This solution expedites rule evictions

recognising TCP ow termination via FIN/RST ags. However, the authors of [35] do not consider any form of multipath trasmission; therefore, we use only a single path transmission in simulations to provide a fair comparison

urn al P

with this mechanism.

4.3. Performance metrics

In this section, we dene performance metrics used for the evaluation of our mechanism and for comparison with others.

For all the scenarios,

we collected data from all nodes. On this basis, we were able to calculate: (a) the total number of the transmitted (`Tx') and received (`Rx') bytes; (b) the percentage of dropped packets (`Drop Pkts'); (c) the mean achieved network throughput (`Avg Tput'). Moreover, we propose a metric expressing the received data gain (`Rx Gain ') dened by the following equation:

Rx Gain =

where

Rx

Rx − Rx compared × 100% Rx compared

(1)

is the total received data during simulation when our mechanism

is used, and

Rx compared

expresses the total received data when one of the

Jo

compared scenarios is applied. This metric is obtained form a comparison of our mechanism with a particular other mechanism specied in a related scenario (Section 4.2). To estimate the scalability of our mechanism, we gather the total number

of ow entries in all the access nodes (PE nodes) per second (`Sum of DFT

Journal Pre-proof

entries (PE)'), and the mean number of ow entries present in a single core node (P node) per second (`Avg label entries (P)'). The number of ow entries

of

in a single OpenFlow P node is equal to the number of labels used by this node. In the case of a legacy MPLS node this number is equal to the number of MPLS labels present in the label forwarding information base (LFIB). To show eciency of the ow processing supported by our tagging approach,

pro

during the whole simulation we observe the number of label entries present on all the P switches and we store the maximum values. The mean of maximum values over all simulations is provided (`Max label entries (P)'). When we simulated other mechanisms, we also collected the number of ow entries in all nodes. This enables a fair comparison of the ow reduction in the core of the network.

Furthermore, for the evaluation of ow processing scalability of the pro-

re-

posed mechanism, we dene the following indicator: the maximum ow reduction indicator (`maxFRI '). Its calculation is performed in the following manner. Firstly, we distinguish ow tuples and labels. The ow tuples are used by PE switches, and labels represent aggregated ows. The latter are used by the P switches for forwarding. Supposing that all the PE nodes serve

urn al P

as trac sources, in each step of the simulation, we observe the total number of ows in the network. Secondly, we verify how many labels on a single P node are active due to the presence of the abovementioned ows. Thirdly, we calculate the average number of active labels per single P node.

This

value shows the mean number of labels used for trac processing in the core of a network. Finally, to present a single indicator for the whole simulation time (simT ime), we use the maximum value to calculate below:



maxFRI



avg(#labels P )  × 100% maxFRI = max 1 − P simTime #flows PE

as shown

(2)

PE

This value exemplies the maximum percentage reduction of the number of entries used by switches in ow tables in comparison to legacy ow

Jo

switching (without any aggregation procedure). This number expresses the decrease rate of ow table entries when our mechanism is used. to stress that the

maxFRI

We want

represents the situation when all ows from all

the PE nodes are present on all core P switches. This is a key performance indicator (KPI) which enables us to quantify the eciency of ow reduction

Journal Pre-proof

during system operation.

This KPI is based on measurements performed

during the system run. No other mechanisms are compared with the use of

of

this indicator. We also dene another indicator, this is the comparative ow reduction indicator (`CFRI '), which measures eciency of ow reduction for core nodes when simulated scenarios are compared. Contrary to

maxFRI , CFRI

uses

pro

simulation data gathered in both the compared mechanisms. The indicator is dened as follows:

# avg(#labels P ) × 100% CFRI = 1 − avg(#flows comparedP ) "

where

avg(#labels P )

(3)

is the average number of ow entries (per second) in

re-

a core P node when our mechanism is applied, and

avg(#flows comparedP )

refers to the average number of ow entries (per second) in a core node when another considered mechanism is used.

4.4. Results

urn al P

This section presents the results achieved during evaluation of our mechanism and comparison of our mechanism with others.

All the simulations

presented and analysed in this section were performed under the following assumptions: the mean ow inter-arrival time equals 3 ms and the mean ow size equals 500 kB.

Table 2 presents a comparison of trac statistics for the considered topologies. One can notice that use of multipath transmission results in an increase of trac transfers. If one compares a single path transmission (based on the centrally calculated Dijkstra algorithm with the reactive ow installation) or Expedited Eviction with ECMP, it can be observed that ECMP enables a better resource utilisation for Nobel-EU and Cost266 topologies. For the Clos topology, ECMP is almost two times more ecient than the single path option. This stems from the fact that the Clos network oers many concurrent equal cost paths. For US backbone, one can notice that use of a single path transmission gives slightly better results than use of ECMP. This stems

Jo

from the fact that for this simulations ns-3 performed ECMP per packet, not per ow. For instance, if two equal cost paths are available for the same destination and one of them is congested, some packets belonging to the same ow reach this destination unordered. This will cause packet drops. Since we operate with the TCP trac, each time the retransmissions occur, TCP

Journal Pre-proof

sources slow down. In fact, some topologies may have a very limited number of equal cost paths; moreover, it can happen that there are no such paths. For

of

considered topologies and trac pattern DevoFlow achieved slightly better

pro

throughput than ECMP, what is inline with results from [14].

If one compares ECMP or DevoFlow with our mechanism and FAMTAR, a better performance is observed for all the inspected topologies with our proposal.

Our mechanism and FAMTAR simultaneously use any available

paths that can have dierent costs. ECMP requires the same cost for concurrent paths. It can happen that there are no concurrent equal cost paths in a particular network topology. On the other hand, DevoFlow outperform

re-

ECMP only in the case when some number of elephant ows is present. The advantage of our mechanism in comparison to FAMTAR lies in the fact that our mechanism possesses better aggregation eciency (as discussed later). Moreover, due to our mechanism, only edge nodes (PE nodes) require modication; however, the core nodes can be o-the-shelf MPLS equipment. For

urn al P

FAMTAR, all the nodes have to be replaced at once. A simple analysis of the results summarised in Table 3 shows that in the case of all congestion and warning threshold pairs, a notable increase of the total received data (Rx Gain ) is observed when our mechanism is used in comparison to a single path, ECMP and DevoFlow. Since Expedited Eviction uses single paths for transmission, the achieved (Rx Gain ) is at the same level as in the single path scenario. The best

Rx Gain of 170.9% is achieved when we

compare our mechanism with a single path transmission for the Clos topology.

Rx Gain is topology-dependent. One can notice that low Rx Gain-c (our mechanism compared to FAMTAR) stem from the fact

However, the value of values of

that both the mechanisms enable a similar link utilisation. Simply stated, they are based on an unequal cost multipath transmission. value of

Rx Gain-c

The negative

indicates a better performance of FAMTAR, but this value

never exceeds 0.8%.

Moreover, the negative values appear less frequently

then positive values; however, this dierence is not strong enough to conrm

Jo

that mechanism performs better: our mechanism and FAMTAR behave very similarly with regard to transmission eciency.

14.5±0.12 5.8±0.05

14.5±0.11 5.8±0.06

14.6±0.12 5.4±0.07

14.6±0.15 5.6±0.07

14.6±0.12 5.6±0.05

14.5±0.13 5.8±0.06

14.7±0.12 5.3±0.09 1338.6±11.0

14.6±0.12 5.5±0.05

14.6±0.14 5.6±0.07

14.6±0.12 5.7±0.06

11.0±0.11 6.9±0.07

10.6±0.14 5.9±0.09

14.6±0.10 5.3±0.08

11.9±0.25 5.4±0.19

12.1±0.11 6.6±0.10

15.4±0.12

15.4±0.11

15.5±0.12

15.5±0.15

15.5±0.13

15.4±0.13

15.6±0.13

15.5±0.12

15.5±0.14

15.5±0.12

11.9±0.11

11.3±0.14

15.4±0.10

12.7±0.24

13.1±0.12

0.6

0.7

0.4

0.5

0.6

0.7

0.4

0.5

0.6

0.7

single path

ECMP

FAMTAR

DevoFlow

Exp. evic.

0.9

0.85

997.5±9.0

981.6±20.9

1325.7±8.8

961.0±12.4

997.7±9.7

1328.3±10.9

1329.2±12.8

1331.6±11.1

1322.3±11.5

1327.9±11.3

1331.0±13.5

1332.1±11.0

1315.5±10.0

1318.4±10.7

1323.1±11.7

14.5±0.13 5.7±0.07

15.4±0.13

0.5

0.8

1326.7±12.8

[Mbps]

14.6±0.14 5.5±0.09

Pkts [%]

Avg Tput

15.4±0.14

[GB]

[GB]

Drop

0.4

CongTh WarnTh

Rx

9.4±0.06

11.5±0.07

14.4±0.08

10.4±0.02

8.6±0.04

14.6±0.10

14.6±0.10

14.6±0.10

14.6±0.10

14.6±0.12

14.5±0.10

14.5±0.11

14.5±0.12

14.5±0.09

14.5±0.11

14.4±0.12

14.5±0.11

[GB]

Tx

11.0±0.04

13.4±0.04

7.4±0.07

7.4±0.06

7.4±0.06

7.4±0.06

7.3±0.05

7.3±0.05

7.3±0.05

7.3±0.04

7.4±0.05

7.3±0.03

7.4±0.06

7.4±0.06

828.8±1.7

665.2±3.4

1222.2±8.8

1220.8±8.9

8.0±0.06

12.8±0.09

10.2±0.06 10.1±0.04

665.3±3.8

841.4±5.1

10.7±0.06

12.7±0.05

16.5±0.08

11.3±0.03

9.8±0.03

16.7±0.07

16.7±0.06

16.7±0.07

16.7±0.08

16.6±0.10

16.7±0.07

16.7±0.07

16.6±0.08

16.5±0.06

16.5±0.08

16.5±0.08

16.5±0.08

[GB]

Tx

5.8±0.05

11.1±0.04

9.0±0.08

12.3±0.20

11.1±0.04 10.3±0.08

15.5±0.07

9.7±0.02

12.9±0.05

6.0±0.06

5.9±0.05

5.9±0.06

5.9±0.06

5.9±0.07

5.9±0.06

5.9±0.06

5.9±0.05

5.9±0.06

6.0±0.08

5.9±0.09

6.0±0.07

741.5±2.8

916.8±3.5

1409.2±6.3

880.2±1.7

741.1±2.6

1425.0±6.2

1427.2±4.9

1425.5±6.0

1422.4±6.4

1420.1±8.4

1423.2±5.4

1420.6±5.6

1418.4±6.6

1406.6±4.3

1407.9±6.8

1407.8±6.7

1405.8±7.1

[Mbps]

Avg Tput

Pkts [%]

Drop

Clos

[Mbps]

Avg Tput

9.9±0.07

682.4±2.5

1847.9±27.7

1843.5±20.7

1844.2±24.2

1846.0±28.1

1845.7±26.3

1839.7±26.5

1842.0±29.4

1838.4±61.9

1843.9±20.9

1820.3±23.6

1840.5±25.1

8.3±0.03

9.9±0.05

682.4±2.1

15.3±0.17 6.1±0.10 1263.1±14.0

21.9±0.30 3.0±0.04 1808.5±24.5

15.0±0.15 6.2±0.06 1237.3±12.7

8.2±0.03

22.3±0.34 2.4±0.02

22.3±0.25 2.4±0.03

22.3±0.29 2.4±0.04

22.3±0.34 2.4±0.05

22.3±0.32 2.4±0.03

22.2±0.32 2.4±0.02

22.3±0.36 2.4±0.04

22.2±0.75 2.4±0.03

22.3±0.25 2.4±0.02

22.0±0.29 2.5±0.09

22.2±0.30 2.4±0.03

22.4±0.46 2.4±0.02 1850.0±37.9

[GB]

Rx

of 9.3±0.03

16.3±0.17

22.6±0.30

16.0±0.15

9.3±0.03

22.9±0.34

22.9±0.26

22.9±0.31

22.9±0.34

22.9±0.33

22.8±0.33

22.8±0.37

22.8±0.77

22.9±0.26

22.6±0.30

22.8±0.32

22.9±0.47

[GB]

Tx

pro 8.1±0.03

15.7±0.07

15.7±0.05

15.7±0.07

15.6±0.07

15.6±0.09

15.6±0.06

15.6±0.06

15.6±0.07

15.5±0.05

15.5±0.08

15.5±0.07

15.4±0.08

Pkts [%]

Drop

Cost266

[GB]

Rx

re-

1220.8±9.2

1219.3±9.1

1220.7±9.7

1217.5±8.6

1218.6±8.7

1215.2±9.8

1209.4±7.3

1209.9±9.3

1209.9±9.9

1210.7±9.3

[Mbps]

Avg Tput

13.3±0.07 7.24±0.06 1209.09±7.5

9.1±0.02

7.3±0.04

13.4±0.10

13.4±0.10

13.4±0.10

13.4±0.10

13.4±0.11

13.4±0.09

13.4±0.10

13.4±0.11

13.3±0.08

13.3±0.10

13.3±0.11

13.3±0.10

Pkts [%]

Drop

Nobel-EU

[GB]

Rx

Table 2: Trac statistics for the considered topologies

urn al P US backbone

Tx

Jo

Journal Pre-proof

urn al P

[%]

38.1±0.9

37.7±1.2

37.2±1.3

36.9±1.1

38.6±1.2

38.5±1.2

38.2±1.1

37.6±1.0

39.3±1.1

38.6±1.1

38.3±1.0

38.2±0.9

[%]

33.0±1.0

32.6±1.2

32.2±1.2

31.9±1.0

33.5±1.0

33.4±1.1

33.1±1.1

32.5±0.9

34.2±0.9

33.5±1.2

33.2±1.1

33.1±1.0

0.4

0.9

0.85

0.8

0.5

0.6

0.7

0.4

0.5

0.6

0.7

0.4

0.5

0.6

0.7

CongTh WarnTh

Rx Gain-b

Rx Gain-a

0.2±0.4

0.3±0.4

0.4±0.5

1.0±0.3

-0.3±0.4

0.2±0.4

0.4±0.5

0.5±0.4

-0.8±0.2

-0.5±0.3

-0.2±0.4

0.1±0.4

[%]

Rx Gain-c

US backbone

22.8±2.5

23.1±2.8

22.9±2.4

23.7±2.9

22.1±2.2

22.7±2.7

22.8±2.8

23.0±2.9

21.6±2.9

22.2±3.1

22.1±2.9

22.8±2.9

[%]

Rx Gain-d

83.7±1.5

83.5±1.5

83.5±1.6

83.3±1.6

83.5±1.6

83.0±1.7

83.2±1.6

82.7±1.8

81.8±1.5

81.9±1.6

81.9±1.6

82.0±1.6

[%]

Rx Gain-a

47.5±1.0

47.3±1.0

47.3±1.1

47.1±1.1

47.3±1.2

46.9±1.1

47.0±1.1

46.6±1.2

45.9±0.9

46.0±1.1

46.0±1.1

46.1±1.1

[%]

Rx Gain-b

1.1±0.7

1.0±0.7

1.0±0.8

0.8±0.8

1.0±0.8

0.7±0.7

0.8±0.7

0.5±0.8

0.0±0.6

0.1±0.8

0.1±0.8

0.1±0.8

[%]

Rx Gain-c

Nobel-EU

32.5±1.6

32.7±1.9

32.6±1.6

32.3±1.9

32.4±1.5

32.2±1.6

92.3±0.7

92.6±0.6

92.4±0.9

91.9±0.7

91.6±1.2

92.1±0.9

91.7±0.8

91.4±0.8

89.8±0.7

90.0±1.0

90.0±0.8

89.7±1.0

[%]

Rx Gain-a

61.9±0.5

62.1±0.5

1.1±0.5

1.3±0.4

1.2±0.5

0.9±0.5

0.8±0.4

1.0±0.3

0.8±0.3

0.7±0.4

-0.2±0.3

-0.1±0.4

-0.1±0.3

-0.2±0.5

[%]

41.0±1.0

41.5±0.8

41.5±1.2

40.9±1.1

40.8±1.3

41.2±0.8

40.9±0.5

40.7±1.3

39.4±0.6

39.7±1.2

39.2±1.0

39.4±1.2

[%]

Rx Gain-d

Clos

49.4±1.4

49.0±1.1

49.1±1.3

49.2±1.6

49.2±1.5

48.7±1.3

48.9±1.6

49.4±3.4

49.0±1.2

47.4±1.8

48.8±1.2

49.8±1.4

[%]

Rx Gain-b

2.2±0.4

1.9±0.4

2.0±0.5

2.1±0.6

2.1±0.6

1.7±0.6

1.9±0.7

2.2±0.7

2.0±0.4

0.9±0.7

1.8±0.3

2.4±0.2

[%]

Rx Gain-c

of 170.8±4.4

170.2±3.4

170.3±3.8

170.5±4.4

170.5±4.0

169.6±4.2

169.9±4.7

168.9±8.2

170.2±3.3

166.6±3.4

169.7±4.0

170.9±6.0

[%]

Rx Gain-a

pro 61.9±0.7

61.6±0.6

61.3±0.9

61.7±0.6

61.4±0.5

61.1±0.6

59.8±0.4

59.9±0.7

59.9±0.7

59.7±0.7

[%]

Rx Gain-c

Cost266

Rx Gain-b

re-

32.1±1.7

32.2±1.4

31.1±1.3

31.2±1.7

31.4±1.9

31.2±2.3

[%]

Rx Gain-d

46.3±1.5

45.9±1.2

46.0±1.5

46.1±1.7

46.1±1.7

45.6±1.6

45.8±1.7

46.4±3.6

46.0±1.3

44.4±2.0

45.7±1.4

46.8±1.6

[%]

Rx Gain-d

Table 3: Trac gains of the proposed mechanism in comparison to other mechanisms (Rx Gain-a : centrally calculated Dijkstra with reactive ow installation, Rx Gain-b : ECMP, Rx Gain-c : FAMTAR, Rx Gain-d : DevoFlow)

Jo

Journal Pre-proof

Journal Pre-proof

The main purpose of introducing our mechanism relates to the need for reducing the number of ow entries in the core switches (P nodes).

We

of

can observe that a signicant reduction has been obtained due to the ow aggregation procedure based on the introduction of centrally managed MPLS label distribution performed by the SDN controller. Since all ows destined to DCNs attached to the particular PE node are represented by a single

pro

label, a large number of ingress ow entries from the edge of the network can be served by the same single label in the core.

Thus, the number of

labels utilised by a single P node depends on the number of PE switches and the number of used paths.

The number of labels is sensitive to the

statistics of ow life-times and the idle timeout value used by the garbage collector (in our simulations, the latter is set to 3 seconds). Since the network core forwards trac from all the PE switches, it is useful to compare the

re-

summarised number of ow entries in the network to the number of labels present in a single P switch. To see the impact of our mechanism, observe columns `Sum of DFT entries (PE)' and `Avg label entries (P)' in Table 4, where the dierence of at least two orders of magnitude can be noticed for all the inspected topologies.

This result is well seen in Fig. 8, where the

urn al P

time changes are shown (for US backbone topology). Despite the fact that the number of ows arriving to PE nodes (in blue) increases, the number of labels used by P nodes (in red) tends to stabilise. This observation conrms the high scalability achieved by the proposed mechanism.

Moreover, the indicator dened in Eq. (2) proves potential and considerable scalability of our mechanism. Namely, the

maxFRI

illustrates the best

achieved result for the considered network conguration and trac conditions.

In Table 4, the best achieved values of

maxFRI

for all considered

topologies are marked in bold. This parameter takes value more than 99.2% proving that our mechanism behaves steadily in various topologies. We have also analysed inuence of the ow timeout on the ow table occupancy for both the PE and P nodes for our mechanism. The results were

Jo

obtain for US backbone network only. The three values of idle timeout were simulated: 1 second, 2 seconds and 3 seconds. All the results are presented in Fig. 9. The orange line indicates the median, while the box extends from the lower to upper quartile values of the data. the box to show the range of the data.

The whiskers extend from

The marked ier points represent

Avg label

entries (P)

174.8±1.9

182.8±1.9

188.5±1.7

183.9±2.2

172.6±2.7

181.6±2.6

186.2±1.9

186.0±2.2

170.7±2.6

179.6±1.4

186.3±2.1

186.8±2.9

entries (PE)

17385.5±189.7

17610.6±118.9

17977.6±188.9

17914.8±135.0

16928.6±186.4

17288.2±234.8

17488.8±96.6

17631.7±172.5

16628.3±198.0

17047.0±199.8

17335.1±181.9

17427.3±158.5

0.4

0.9

0.85

0.8

0.5

0.6

0.7

0.4

0.5

0.6

0.7

0.4

0.5

0.6

0.7

CongTh WarnTh

Sum of DFT

465.2±10.3

473.2±16.2

458.9±10.6

459.6±13.8

473.4±12.6

474.5±15.3

474.0±14.9

470.0±8.3

481.5±13.3

484.5±10.8

486.4±13.3

489.9±11.0

entries (P)

Max label

US backbone

99.27±0.02

99.27±0.01

99.26±0.02

99.27±0.02

99.27±0.01

99.26±0.02

99.26±0.01

99.27±0.02

99.26±0.01

99.26±0.02

99.26±0.01

99.28±0.02

[%]

maxFRI

20944.1±194.5

20891.0±125.5

20912.6±186.0

20907.0±149.5

21171.1±165.4

21217.5±131.7

21141.7±213.1

21108.0±145.1

21380.3±136.4

21325.5±176.9

21329.1±207.9

21265.1±223.7

entries (PE)

Sum of DFT

132.9±1.2

131.5±1.1

130.0±1.0

129.1±1.1

121.7±0.9

120.6±0.5

118.7±1.1

119.0±1.2

120.8±0.8

119.1±1.2

119.1±1.3

118.9±0.8

entries (P)

Avg label

448.4±7.9

450.0±7.4

440.6±7.3

434.1±6.7

382.9±5.6

378.0±5.1

378.3±5.8

379.0±7.5

385.7±7.4

374.8±8.7

378.6±7.3

381.2±4.7

entries (P)

Max label

Nobel-EU

99.60±0.01

99.61±0.01

99.62±0.01

99.62±0.01

99.59±0.01

16163.9±263.0

15960.2±253.3

15952.5±265.4

15925.0±218.5

16307.7±243.0

16148.9±232.9

16200.3±245.9

16224.8±175.4

16755.9±299.7

16607.9±359.2

16482.3±326.6

16684.5±270.7

entries (PE)

Sum of DFT

346.7±7.5

342.8±5.9

348.8±5.1

354.1±4.8

349.9±9.0

344.6±8.4

99.65±0.01

99.66±0.01

99.63±0.01

99.65±0.01

99.65±0.01

99.66±0.01

[%]

maxFRI

6059.8±85.6

5990.2±99.0

5939.0±100.7

6241.3±205.1

6133.4±79.5

5958.8±23.9

entries (PE)

Sum of DFT

98.8±1.0

97.0±1.1

96.5±0.9

96.4±0.8

97.6±1.3

96.3±1.4

342.7±5.4

337.8±9.3

341.3±7.5

344.2±5.9

342.9±8.4

346.7±7.0

99.64±0.01

99.64±0.01

99.65±0.01

99.66±0.01

99.64±0.01

99.64±0.01

46.7±1.5

47.4±2.7

45.2±2.0

43.4±1.3

47.9±2.4

50.5±2.0

47.1±1.8

43.0±2.4

49.2±2.8

54.6±5.3

50.3±1.7

44.6±2.8

101.0±6.6

105.2±8.1

100.2±5.4

94.0±3.2

101.4±9.6

114.4±7.6

104.0±4.8

97.7±11.2

107.6±8.7

124.5±18.3

110.0±5.5

97.5±12.8

entries (P)

Max label

Clos Avg label entries (P)

of

5874.1±51.6

5947.3±96.7

5998.0±105.0

6071.5±52.0

5906.8±79.7

6067.1±73.9

pro

96.7±1.3

95.6±1.1

95.1±1.3

94.6±1.5

94.3±1.6

94.3±1.1

entries (P)

Max label

Cost266 Avg label entries (P)

re-

99.60±0.01

99.60±0.01

99.60±0.01

99.60±0.01

99.60±0.01

99.60±0.01

99.60±0.01

[%]

maxFRI

Table 4: Aggregation eciency for the considered topologies

urn al P

Jo 99.38±0.04

99.36±0.02

99.40±0.04

99.45±0.03

99.38±0.04

99.36±0.02

99.38±0.04

99.44±0.05

99.38±0.04

99.35±0.07

99.38±0.05

99.44±0.07

[%]

maxFRI

Journal Pre-proof

Journal Pre-proof

10000

of

Sum of DFT entries (PE node)

15000

5000

0 20

40

60

Time [s]

Avg number of label entries (P node)

100 80 60 40 20 0 0

20

40

60

100

120

80

100

120

re-

Time [s]

80

pro

0

Figure 8: Average numbers of network ows and the used labels outlier values. As we have already mentioned in Section 3.5, the number of

urn al P

ow rules in ow tables depends on trac characteristics and the value of idle timeout. We can see that it is necessary to dene dierent values of idle timeout for short ows and for long-lasting ows. The authors of [35] suggest using low values of idle timeout, even lower than 1 second. In the case of our mechanism no communication with the controller for a new ow is necessary (it installs them on its own), a low value is desirable. As one can see in Fig. 9, 1 second idle timeout reduces the DFT occupancy by 30% in comparison to the situation when 3 seconds timeout (for US backbone topology). The former case decreases the number of used labels almost twice. Now, we compare our mechanism with FAMTAR from the viewpoint of aggregation eciency. The number of FAMTAR's ow forwarding table (FFT) entries and the number of DFT entries (in our mechanism) for any considered scenario are at the same level. The mechanisms signicantly dier in the number of entries stored at the core nodes. Let us consider a network containing 100 edge nodes, all being entrances and exits of a domain (sources

Jo

and destinations of trac). Suppose that there are no congested links in the network. For the FAMTAR solution, each connection between edge nodes is marked with a dierent tag. This gives a total number of

9900 = 99×100 (99

destinations from each 100 sources) tags in the whole network. This number

14000 1 sec.

2 sec.

3 sec.

1 sec.

maxFRI [%]

400

200 1 sec.

2 sec.

3 sec.

2 sec.

3 sec.

(b) Avg number of label entries (P)

500

300

100

99.5 99.4

re-

Max label entries (P)

(a) Sum of DFT entries (PE)

150

of

16000

pro

18000

Avg label entries (P)

Sum of DFT entries (PE)

Journal Pre-proof

99.3 99.2

1 sec.

(c) Max number of label entries (P)

2 sec.

3 sec.

(d) maxF RI

urn al P

Figure 9: Comparison showing inuence of ow idle timeouts for US backbone topology represents the number of ow entries a single core node has to handle in the worst case. This core node has to process communication with all edge nodes. In our case, each connection from any edge node to a particular exit node is tagged with the same single global MPLS label.

This results in a total

number of 100 tags across the whole network. This number is also obtained in the worst case.

When multipath transmission is considered, both mechanisms recalculate routing in the network. Supposing the worst case scenario, where all the existing ows are still present on the previous paths (before the path recalculation takes place), all the old tags have to be maintained. For the new ows, new tags have to be allocated. Therefore, a single change of routing induces an increase in the number of tags used to

19880 = 2 × 9900 for FAM-

Jo

TAR. When we consider our mechanism, the core nodes have to maintain only

200 = 2 × 100

tags (labels). In this way, our mechanism is much more

scalable. Moreover, FAMTAR uses a DSCP eld of IPv4 header for aggregation. This poses a scalability issue  only 256 simultaneous aggregated ows can be processed in the network, so even the topology discussed in the

Journal Pre-proof

above example cannot be served. Only very limited aggregates are available. In Fig. 10, we separately present CDFs of the ow table occupancy per

of

second for access nodes and core nodes. The analysis was done for US backbone topology. In case of our mechanism, access nodes store simultaneously less ow entries than the other compared mechanisms (Fig. 10a). This observation follows from the fact that when a network is heavily utilised a single

pro

ow is transmitted with smaller throughput values. In all the scenarios, ow inter-arrival time is the same. We use TCP ows that slow down if necessary; thus, they are present for much longer periods in a network. This means that each single node has to maintain ows for a longer time. This results in increase of the number of ow entries. In the case of our mechanism, ows can achieve higher throughputs than in other cases. Consequently, the process of transmission is nished faster and ow entries in ow tables are maintained

re-

for shorter times. We can observe in Fig. 10b that our mechanism requires up to two orders of magnitude less ow entries than the other mechanisms. For example, when one considers the average number of ow entries in core

81.0 ± 11.9 ow entries, while the single 2070.7 ± 662.1 of them. A non-negligible

nodes, our mechanism requires only path approach needs as many as

urn al P

value of the condence intervals for the single path follows from the fact that network resources are unevenly used; in other words, some core nodes are heavily utilised while others are used occasionally only. Such a situation does not appear when our mechanism is applied, and then the network nodes are used in the balanced way.

Our mechanism signicantly limits the communication between switches and an SDN controller. The controller only retrieves link statistics and updates aggregation entries (CFT in PE nodes, and label entries in P nodes). If we suppose that the controller collects link statistics and performs the reverse Dijkstra calculation every second (the worst case), the total number of exchanged messages between the controller and the nodes per second is equal to

2N + N

(the request and response for statistics from

N

nodes plus

N

mes-

sages distributing new labels). For example, for the US backbone network it equals

117 = 2 × 39 + 39.

DevoFlow has to collect statistics in the same

manner (in case of the threshold-based elephant ow detection). Moreover,

Jo

each elephant ow has to be served by the controller; therefore, it gener-

Packet_IN and ow installation messages (Packet_OUT together with Flow_MOD messages). In Fig. 11 we present CDFs of a number of OpenFlow ates

signalling messages per second, used by the single path case, DevoFlow and Expedited Forwarding for US backbone network. Other considered mecha-

Journal Pre-proof



&')

     

 'HYR)ORZ

















pro



([SHGLWHG(YLFWLRQ

of



&')

&')





  6LQJOHSDWK 1XPEHURIIO )$07$5 RZHQWULHV

2XU0HFKDQLVP







 

&')



   1XPEHURIIORZHQWULHV 







  1XPEHURIIORZHQWULHV

(a) Access nodes





(b) Core nodes

6LQJOHSDWK





  







([SHGLWHG(YLFWLRQ



&')

&')





  'HYR)ORZ

urn al P



re-

 Figure 10: CDF of number of ow entries (i.e., a ow table occupancy)

















(a) Number of Packet_IN messages

   

















(b) Number of ow installation messages (Packet_OUT together with Flow_MOD messages)

Figure 11: CDF of number of OpenFlow signalling messages per second nisms do not use

Packet_IN/Packet_OUT signalling.

As expected, DevoFlow

Jo

limits communication with the controller. When an elephant ow is detected by an access switch, it sends

Packet_IN to the controller.

Then, it establishes

the least congested path for this ow and sends installation messages to all switches on this path. All other (non-elephant) ows are routed via ECMP paths without communicating with the controller. For the single path sce-

Journal Pre-proof

nario,

Packet_IN

messages are generated only by access nodes. The related

ow installation messages are forwarded to all nodes on the path. Expedited

Packet_IN

messages because ow installation procedure

is done hop-by-hop. Additionally,

Packet_IN messages are generated even by

of

Eviction uses more

evicted ows since some packets after RST/FIN ags may occur. However, the controller will ignore these messages as described in [35]. A slight dier-

pro

ence in a number of ow installation messages for the single path case and Expedited Eviction results from the behaviour of the compared mechanisms. For the single path case, a single

Packet_IN

is responsible for generation

of a bunch of installation messages, while in the case of Expedited Eviction each node generates

Packet_IN

and processes the responding

messages.

Packet_OUT

Our mechanism Single path DevoFlow

117

Nobel-EU



Cost266



84

Clos



111

63

17569.5±223.38

17397.7±146.52

15286.8±122.10

19601.1±189.41

35080.7±375.89

30154.7±122.08

26739.3±80.27

31751.4±258.08

117.6±13.49

76.0±7.61

urn al P

Expedited Eviction



re-

Table 5: No. of OpenFlow messages US Backbone

93.7±6.07

88.1±19.69

In Table 5, we present the total number of OpenFlow messages served per second by the SDN controller. In all the simulations performed for the single path case, DevoFlow and Expedited Eviction we do not implement the delay related to ow installation (including

Packet_IN messages generation,

processing and insertion of the resulting ow in the switches). We can state that this delay can signicantly inuence the achieved throughput. Moreover, it is likely that packets belonging to a single ow can arrive with such a high frequency that the installation of a forwarding rule in the switch takes place after many packets from the same ow arrive. many unnecessary controller.

Packet_IN

This results in triggering

messages, causing further overloading of the

Additionally, it is worth noting that in our simulations of the

single path case, we consider the situation that only a single

Packet_IN

message is generated for each separate ow. Therefore, in real networks the throughput for all mechanisms using

Packet_IN

communication is expected

Jo

to be even lower than that presented in Table 2. In our simulations, the total number of OpenFlow messages for the single

path scenario is a sum of

Packet_IN messages and ow installation messages Flow_MOD messages). In case of DevoFlow, the

(Packet_OUT together with

Journal Pre-proof

Table 6: Comparative ow reduction indicator (CFRI [%]) Nobel-EU

Cost266

Clos

95.00±0.13

97.83±0.02

98.31±0.01

99.76±0.00

94.29±0.14

97.48±0.02

98.12±0.01

99.73±0.01

92.71±0.18

97.20±0.02

97.82±0.01

98.87±0.02

Packet_IN messages, ow in-

pro

DevoFlow Expedited Eviction

of

US Backbone Single path

total number of signalling messages is a sum of

stallation messages and link statistic messages. We assume that link statistic messages are exchanged every second, as in the case of our mechanism (the request and response for statistics from

N

nodes). For Expedited Eviction

the total number of OpenFlow messages is a sum of

Packet_IN

messages,

ow installation messages, and ow removal messages. We observe a reduc-

re-

tion of up to 99% in the number of OpenFlow messages exchanged between the controller and switches when our mechanism is compared to Expedited Eviction, and on the similar level when compared to DevoFlow.

One can

see that DevoFlow is better than our mechanism for some topologies. It is important to stress that for this calculation, we used the most pessimistic

urn al P

case for our mechanism.

The achieved reductions of ow entries in the core nodes (dened by Eq. (3)) for all considered topologies are summarised in the `CFRI ' column of Table 6. Independently of the simulated topologies and mechanisms, the

CFRI

is greater than 92%. The comparison of the ow reduction eciency

for our mechanism and FAMTAR were discussed with the help of analytical evaluations before. This comparison also shows that our mechanism performs better than FAMTAR.

4.5. Comparison of dierent trac patterns Here, we present the evaluation of the inuence of dierent trac patterns on performance of the compared mechanism run on all the topologies used. To show how the presence of mice/elephant ows aects the considered mechanisms, we use dierent ow size distributions. We do that by generating ow sizes on the basis of a Pareto distribution with various mean values.

Jo

We consider the three scenarios of simulations: (1) with mice ows only, (2) elephant ows only, and (3) the mixture of them. According to [40, 41], we assume the mean mice ow size is equal to 256 kB. However, in the literature many dierent denitions of an elephant ow (the so-called `heavy hitter') can be found. Typically, those denitions are based on a ow size [42, 43] or

Journal Pre-proof

MaxTput 3400 Mbps

Tput [Mbps]

Tput [Mbps]

2000 1500 1000 500

256kB

5MB Mean Flow Size

2500

MaxTput 2100 Mbps

2000

Tput [Mbps]

1750 1500 1250 1000 750 500 250 0

256kB

5MB Mean Flow Size

Mix

(b) Nobel-EU

MaxTput 2700 Mbps

1500

urn al P

Tput [Mbps]

(a) US backbone

Mix

MaxTput 1800 Mbps

re-

0

1400 1200 1000 800 600 400 200 0

pro

2500

of

Tput [Mbps]

1750 1500 1250 its rate [44, 45]. We take 5 MB as a mean elephant ow size. This denition 1000 is in line with [40, 41]. 750 Moreover, to assess the compared mechanisms in various conditions, we 500generate also a mixture of mice and elephant ows with the ratio 9 : 1 in 250 accordance with [41]. 0 256kB 5MB Mix Size Our Mechanism FAMTAR ECMPMean FlowDevoFlow Expedited Eviction Single path

1000 500

256kB

5MB Mean Flow Size

(c) Cost266

Mix

0

256kB

5MB Mean Flow Size

Mix

(d) Clos

Figure 12: Achieved network throughputs for the simulated topologies (the mean ow inter-arrival time is equal to 3 ms). The values in boxes show the maxima obtained from the `multi-commodity max-ow' optimization problem In Fig. 12, we compare the achieved overall network throughputs for all the considered mechanisms in dierent topologies for the three dened trac patterns. Our mechanism and FAMTAR demonstrate similar eciency, while the other mechanisms achieve worse throughput. We see some dependency related to a topology and trac pattern. It is especially visible for the Clos

Jo

topology, where DevoFlow perform better than ECMP. Obtained results for DevoFlow are in line with [14].

Since we use a smaller Clos topology and

dierent trac patterns, our simulations present a lower level of gain for DevoFlow than the results presented by authors of it.

Let us notice that

our simulator is packet-based, while the simulator used in [14] is ow-based.

Journal Pre-proof

This can be one of the reasons behind the observed dierences. We also implement an optimization model based on linear programming.

of

It is a `multi-commodity max-ow problem' [46] that maximises network throughput distributing ows optimally. However, this is a static approach, assuming the full knowledge on non-chainging ows.

Thus, this approach

provides better throughput values than all the mechanisms dealt with in this

pro

paper. On the other hand, the optimization-based approach is too rigid and

Number of flow entries

not useful when we have the dynamicity in ows, what in fact practically is the case. The optimal value for each 103 topology is shown as a number in boxes in Fig. 12. The gap seen between the optimal values and the ones achieved by a particular mechanism show how much trac is transferred unoptimally

102

due to the dynamic behaviour of the ows.

FAMTAR

Number of flow entries

104

Expedited Eviction

103

256kB

5MB Mean Flow Size

102

Mix

256kB

102

Number of flow entries

Number of flow entries

(a) US backbone

103

Single path

103

urn al P

Number of flow entries

105

102

256kB5MBMix Mean Flow Size DevoFlow

re-

Our Mechanism

256kB

5MB Mean Flow Size

(c) Cost266

Mix

5MB Mean Flow Size

Mix

(b) Nobel-EU

104 103 102 256kB

5MB Mean Flow Size

Mix

(d) Clos

Jo

Figure 13: Average ow table occupancy in core nodes (the mean ow inter-arrival time is equal to 3 ms) Average ow table occupancies, expressed per second, for core and access

nodes are presented in Figures 13 and 14, respectively. One can seen that

256kB5MBMix Mean Flow Size DevoFlow

104

103 256kB

5MB Mean Flow Size

256kB

Mix

Mix

103

Mix

256kB

urn al P

(c) Cost266

104

re-

Number of flow entries

Number of flow entries

5MB Mean Flow Size

5MB Mean Flow Size

(b) Nobel-EU

104

256kB

Single path

104

(a) US backbone

103

Expedited Eviction

of

FAMTAR

102

Number of flow entries

Number of flow entries

Our Mechanism

103

pro

Number of flow entries

Journal Pre-proof

5MB Mean Flow Size

Mix

(d) Clos

Figure 14: Average ow table occupancy in access nodes (the mean ow inter-arrival time is equal to 3 ms) the use of our ow aggregation procedure signicantly limits the number of entries in core nodes. In comparison to other solutions, our mechanism requires at least two orders of magnitude less entries. While comparing our mechanism to FAMTAR (without any aggregation procedure), we observe the dierence of even three orders of magnitude. FAMTAR with aggregation was discussed analytically in Section 4.4. We want to note that FAMTAR and our mechanism transfers a huge amount of trac in comparison to other mechanisms.

Since FAMTAR does not aggregate ows, it requires much

more ow entries.

On the other hand, DevoFlow and Expedited Eviction

do not transfer as much trac as our mechanisms and FAMTAR, but they

Jo

experience a relatively huge number of entries in ow tables. In the case of access nodes (Fig. 14), our mechanism also shows the best behaviour. We also consider OpenFlow signaling overhead for dierent trac pat-

terns (Fig. 15). Since DevoFlow uses

Packet_IN/Packet_OUT procedure only

for elephant ow installation, it requires much less signalling messages than

104 103

256kB 5MB Mix Mean Flow Size DevoFlow Expedited Eviction

PacketIn/PacketOut number

PacketIn/PacketOut number

Our mechanism

104

103

102

256kB

5MB Mean Flow Size

104 103 102

256kB

Mix

256kB

5MB Mean Flow Size

Mix

103 102

Mix

urn al P

(c) Cost266

104

re-

104

102

5MB Mean Flow Size

(b) Nobel-EU

PacketIn/PacketOut number

PacketIn/PacketOut number

(a) US backbone

103

Single path

of

102

pro

PacketIn/PacketOut number

Journal Pre-proof

256kB

5MB Mean Flow Size

Mix

(d) Clos

Figure 15: No. of OpenFlow signaling messages (the mean ow inter-arrival time is equal to 3 ms) the other mechanisms installing ows reactively. The dierence between Expedited Eviction and the single path case results from the fact that Expedited

Packet_IN for evicted ows. Our mechanism does not use the Packet_IN/Packet_OUT procedure at all, its signaling overhead is related

Eviction generates

only to statistics gathering and coarse ow installation when needed. In Fig. 16, we present the results achieved for a less utilised network. We select a trac pattern with relatively high number of elephant ows (the mean ow size equals to 5 MB) and not so intensive ow inter-arrival time (30 ms). The results show that in case of the Clos topology DevoFlow achieves a notably higher throughput than ECMP (the same as in the case

Jo

of FAMTAR), but our mechanism outperforms others independently of the topology and trac pattern. In general, our mechanism is less sensitive to a topology structure and trac.

Journal Pre-proof

1000 500

1750 1500 1250 1000 750 500 250 0

USA Nobel Our Mechanism FAMTAR

ECMP DevoFlow

Cost

Clos Expedited Eviction Single path

of

Tput [Mbps]

0

USA

Nobel

pro

Tput [Mbps]

1500

Cost

Clos

5. Related work

re-

Figure 16: Achieved network throughputs for the simulated topologies (the mean ow inter-arrival time is equal to 30 ms, the mean ow size is equal to 5 MB)

Some proposals for ow table reduction have appeared in literature. The

urn al P

authors of FAMTAR [39] (decentralised mechanism, not SDN-based) introduced an aggregation procedure limiting the number of ows in core routers [47]. Separate ows following the same routing path are aggregated by tagging. The tag represents a sequence of nodes on an OSPF path connecting entry and exit nodes of a domain using FAMTAR. In a similar manner to the majority of aggregation mechanisms, nodes are also separated into two classes: edge and core. Edge nodes use three tables, namely standard routing table (RT), ow forwarding table (FFT) and aggregation table (AT). Core routers possess RT and AT. When a new packet arrives to an edge of the domain, the edge node does not have an entry in FFT. It consults the RT and AT to build a new FFT entry containing a ow identity and a tag (from AT), as well as an output interface (from RT). The packets belonging to existing ows are then processed in accordance with FFT. When a packet reaches a core node, it looks up its AT. If a match of the tag is not found, the router looks up its RT for the destination. Next, it creates an AT entry

Jo

containing the received tag identity and the output interface (retrieved from RT).

The authors of [48] introduce partitioning of a network into regions and

assume the use of two MPLS tags to reduce the number of ow table entries. To compare this mechanism to FAMTAR and our mechanism, let us consider

Journal Pre-proof

the previously used scenario again. To be consistent with our terminology, we use the `edge' notion for the nodes where host networks are connected. When

of

a network possessing 100 edge nodes is divided into 10 regions, each with 10 edge nodes, the maximum number of ow entries in a core node is 20 for the mechanism proposed in [48]. Such a number is valid when only one node is a communication point for other regions. This mechanism oers a better ow

pro

scalability than our proposition and FAMTAR. In [48], the authors use an optimisation task for the selection of a number of regions. Ten is the optimal value for regions, resulting in a lower number of ow table entries for the proposed scenario. When one decides to change the number of regions, the average number of ows increases according to the following rule: where

R is the number of regions and E

R + E/R,

is the number of edge routers. How-

ever, the authors do not oer any congestion avoidance procedure. In our

re-

opinion, use of a single node as an entrance to each region increases the probability of congestion. We cannot qualitatively compare this mechanism with our solution because paper [48] does not introduce any multipath transmission method.

Contrary to [48], the next advantage of our mechanism is a

lack of communication between the switches and the controller for every new

urn al P

ow installation (excluding the extension presented is Subsection 3.3). The compared mechanism [48] requires a reactive installation of each new ow in edge nodes.

In [49], the authors propose a dynamic ow aggregation based on VLAN tagging.

The basis for the aggregation procedure is a common path that

many ows traverse at the same time. Since the authors of [49] give a performance evaluation based on a ring, a tree and a ring of tree topologies, it is hard to compare their solution with our universal mechanism.

They

obtain a 48% reduction of ow entries in the core of the network, compared to 96% in our case.

In this solution, a controller has to maintain a huge

database storing information about all the running ows. We perceive this as a weak point. When a packet belonging to a new ow arrives to a switch, a

Packet_IN message is always generated independently whether an aggregate exists or not. Contrary to our approach, this solution creates an enormous communication burden between network nodes and the controller.

Jo

In [14], Curtis

et al.

introduce a modication of the OpenFlow protocol,

known as DevoFlow. In this solution, similar to our proposal, a switch is able to insert some ow rules into a ow table on its own. The authors distinguish two classes of ows based on measurements: mice and elephants. The former can be installed by a switch itself using proper wildcards, while the latter

Journal Pre-proof

requires communication with a controller. When a given ow is recognised as an elephant ow, a switch consults the controller. It nds the least occu-

of

pied path to the destination and re-routes this ow by installing proper ow rules in switches on the path found. Communication between network nodes and the controller is minimised, but may still be signicant (depending on the distribution of elephant ows).

Curtis

et al.

also explore a multipath

DevoFlow re-routes elephant

pro

transmission enabling congestion avoidance.

ows what can cause packet reordering during transient state. However, the number of reordered packets is insignicant in relation to the overall number of packets in an elephant ow.

The authors of [50] propose the weighted cost multipath (WCMP) mechanism.

According to this mechanism, nding multiple paths connecting a

particular source and destination is based on ECMP. Each path then ac-

re-

quires a weight using max-ow min-cut algorithms. The weights determine how ows are distributed amongst paths determined by ECMP. In contrast with our mechanism, the proposal presented in [50] does not implement trafc monitoring and does not provide a reactive mechanism for congestion avoidance. An extension is proposed in Niagara [51], in which a controller

urn al P

performs an ecient approximation of weights for each service. It also optimises the division of the rule table space.

These two solutions generally

focus on datacentre networks.

The best industry TE solution for inter-DC WAN communication based on SDN is well known as Google B4 [52].

According to the application

demands, it allocates bandwidth and distributes trac evenly among tunnels connecting communicating sites. In comparison with our solution, the Google mechanism strongly relies on real-time trac measurements. Both solutions require customised switches.

By contrast, our approach is dedi-

cated for intra-domain communication. However, when one considers more advanced solutions for MPLS, such as a stateful path computation element (PCE) [53, 54] and BGP-LS [55], we note that our mechanism can be extended to inter-domain scenarios.

The Mahout operation is based on the identication of mice and elephant ows [56]. It is dedicated for DC networks and it engages end hosts for ele-

Jo

phant ow detection. The ows are routed on the basis of ECMP. Once an elephant ow is detected, it is marked using DSCP. When such a marked ow reaches the top of a rack switch, it is redirected to an SDN controller and rerouted using the least congested paths of all the ECMP paths. The relocation of the elephant ow detection from the switches to the end hosts reduces

Journal Pre-proof

high monitoring overheads. However, this approach requires dedicated software deployment in an end host, which is likely to be achieved only in DCs.

of

Before classifying a particular ow as an elephant ow, it is treated as a mice ow. After detecting, it is highly probable that this ow will be rerouted. This degrades the TCP performance, and may cause packet reordering and drops. Moreover, each elephant ow must be reactively installed. This re-

Packet_IN

messages. In a similar manner to our

pro

sults in a high number of

approach, the Mahout monitors link utilisation. There is no ow aggregation; therefore, especially for DC networks. aggregation nodes can experience ow table overload. The other solution which makes distinction between mice and elephant ows is known as Hedera [57]. The general dierence between Mahout and Hedera is that Hedera periodically pulls switches to detect elephant ows.

re-

The problem of the reduction of ow tables occupancy is also analysed in [35].

The authors propose the mechanism based on ow timeout opti-

misation.

The approach is aimed at rapid removal of nished ows.

work suggests to serve TCP and UDP ows in dierent ways. they assume that the

FIN/RST

This

For TCP,

ags indicate the end of a ow. As soon as

urn al P

these ags are observed, the ow is evicted. Contrarily, for UDP ows, the ow installation is postponed since many of these ows consist of one or two packets only. The authors show that the procedure proposed enables reduction of ow tables occupancy up to 62%. However, they do not discuss the signalling overhead related to communication between the controller and the switches; therefore, their proposal is purely reactive. the approach can trigger a very huge number of

In our opinion, such

Packet_IN messages and, in

result, it will be likely to degrade the performance of the controller and the network. The delayed UDP ow installation creates even a bigger number of

Packet_IN Wang

messages.

et al.

[58] focus on minimisation of the delay between pairs of

nodes using precomputed multiple paths. Nevertheless, the decrease of latency is obtained at a cost of increase in the number of ow entries at the switches. Our approach avoids this trade-o. The solution presented in [59] concentrates on the minimisation of communication between the controller

Jo

and switches. A separate pool of tags for marking dierent paths (one per source-destination pair) is used. In contrast with our mechanism, this proposal does not implement dynamic reaction to congestion and each new ow must be processed by the controller, while our solution oers a better ow aggregation in the core network and signicantly reduces the number of mes-

Journal Pre-proof

sages exchanged between the SDN controller and switches.

of

6. Conclusions To improve the network utilisation and eciency of SDN-based forwarding, we propose the mechanism for ow aggregation accompanied by multi-

pro

path transmission. The performed experiments show that the application of the mechanism results in a 93% reduction of ow entries in core nodes and a 99% reduction of OpenFlow messages.

We also observe that the overall

network trac increases by around 171%. The performed evaluations show that for all simulated network topologies and trac patterns our mechanism behaves more eciently than a large group of other compared solutions. A signicant reduction in the number of ow entries in the core of the net-

re-

work is obtained due to the ow aggregation procedure. The procedure is enabled by the introduction of a centrally managed MPLS label distribution. The distribution is performed by an SDN controller without the application of legacy signalling protocols. The increase of the network trac has been obtained due to the application of multipath transmission. When a poten-

urn al P

tial link overload is detected, the mechanism dynamically nds new paths for new ows. Beside the above mentioned advantages, our mechanism does not involve new protocols but uses only simple modications of the existing solutions. These include a modied switch which is able to install new ow rules on its own. As a result of this, the communication between the SDN controller and network nodes is minimised. Moreover, the mechanism may be deployed incrementally using legacy MPLS nodes in the core of the network. As the future work extending the results given in this paper, we consider congestion reduction methods based on redirection of existing ows to less utilised links. However, removing some ows from congested paths and redirecting them to other existing, yet not congested paths, may result in creating new congestion in the network (and underutilisation of the previously congested paths). This approach requires the research on new trac prediction methods.

Another interesting topic we plan to study is related

to a specialised treatment of selected applications using SDNs. Such an ap-

Jo

proach will be focused on an expected limited usage of additional signalling related to reactive ow installation.

Journal Pre-proof

Acknowledgment This work was performed under Contract No 15.11.230.387 Dynamic

of

trac management in Software-Dened Networks. This research was carried out with the supercomputer `Deszno' purchased thanks to the nancial support of the European Regional Development Fund in the framework of the Polish Innovation Economy Operational Program (contract no. POIG.

pro

02.01.00-12-023/08). This research was also supported in part by PL-Grid Infrastructure.

References

[1] B. Fortz, M. Thorup, Optimizing OSPF/IS-IS weights in a changing 756767.

re-

world, IEEE Journal on Selected Areas in Communications 20 (4) (2002)

[2] J. Dom»aª, Z. Duli«ski, M. Kantor, J. Rz¡sa, R. Stankiewicz, K. Wajda, R. Wójcik, A survey on methods to provide multipath transmission in

urn al P

wired packet networks, Computer Networks 77 (2015) 1841. [3] R. Wójcik, J. Dom»aª, Z. Duli«ski, G. Rzym, A. Kamisi«ski, P. Gawªowicz, P. Jurkiewicz, J. Rz¡sa, R. Stankiewicz, K. Wajda, A survey on methods to provide interdomain multipath transmissions, Computer Networks 108 (C) (2016) 233259.

[4] S. K. Singh, T. Das, A. Jukan, A survey on internet multipath routing and provisioning, IEEE Communications Surveys Tutorials 17 (4) (2015) 21572175.

[5] E. Rosen, A. Viswanathan, R. Callon, Multiprotocol label switching architecture, IETF RFC 3031 (2001). [6] W. Xia, Y. Wen, C. H. Foh, D. Niyato, H. Xie, A survey on softwaredened networking, IEEE Communications Surveys and Tutorials 17 (1) (2015) 2751.

Jo

[7] Software-dened networking:

the new norm for networks, Open Net-

working Foundation whitepaper (2012).

[8] OpenFlow switch specication v1.5.1, Open Networking Foundation specication (2015).

Journal Pre-proof

[9] D. Kreutz,

F. M. V. Ramos,

P. E. Verissimo,

C. E. Rothenberg,

S. Azodolmolky, S. Uhlig, Software-dened networking: a comprehen-

of

sive survey, Proc. the IEEE 103 (1) (2015) 1476. [10] R. Alvizu, G. Maier, N. Kukreja, A. Pattavina, R. Morro, A. Capello, C. Cavazzoni, Comprehensive survey on T-SDN: software-dened networking for transport networks, IEEE Communications Surveys Tutori-

pro

als 19 (4) (2017) 22322283.

[11] M. Karakus, A. Durresi, A survey: control plane scalability issues and approaches in software-dened networking (SDN), Computer Networks 112 (2017) 279293.

[12] S. Li, K. Han, N. Ansari, Q. Bao, D. Hu, J. Liu, S. Yu, Z. Zhu, Improving

re-

SDN scalability with protocol-oblivious source routing: a system-level study, IEEE Transactions on Network and Service Management 15 (1) (2018) 275288.

[13] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson,

urn al P

J. Rexford, S. Shenker, J. Turner, OpenFlow: enabling innovation in campus networks, SIGCOMM Comput. Commun. Rev. 38 (2) (2008) 6974.

[14] A. R. Curtis, J. C. Mogul, J. Tourrilhes, P. Yalagandula, P. Sharma, S. Banerjee, DevoFlow: scaling ow management for high-performance networks, SIGCOMM Comput. Commun. Rev. 41 (4) (2011) 254265. [15] R. Khalili, W. Y. Poe, Z. Despotovic, A. Hecker, Reducing state of OpenFlow switches in mobile core networks by ow rule aggregation, in: Proc. 2016 25

th

International Conference on Computer Communication

and Networks (ICCCN), 2016, pp. 19. [16] C. C. Chuang, Y. J. Yu, A. C. Pang, G. Y. Chen, Minimization of TCAM usage for SDN scalability in wireless data centers, in: Proc. 2016 IEEE

Jo

Global Communications Conference GLOBECOM, 2016, pp. 17. [17] A. Tavakoli, M. Casado, T. Koponen, S. Shenker, Applying NOX to the datacenter, in: Proc. Workshop on Hot Topics in Networks HotNetsVIII, 2009.

Journal Pre-proof

[18] A. Tootoonchian, S. Gorbunov, Y. Ganjali, M. Casado, R. Sherwood, On controller performance in software-dened networks, in: Proc. 2

nd

of

USENIX Conference on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services Hot-ICE'12, 2012.

[19] T. Benson, A. Akella, D. A. Maltz, Network trac characteristics of data centers in the wild, in: Proc. 10

th

ACM SIGCOMM Conference on

267280.

pro

Internet Measurement IMC'10, ACM, New York, NY, USA, 2010, pp.

[20] S. Kandula, S. Sengupta, A. Greenberg, P. Patel, R. Chaiken, The nath

ture of data center trac: measurements & analysis, in: Proc. 9

ACM

SIGCOMM Conference on Internet Measurement, IMC '09, ACM, New

re-

York, NY, USA, 2009, pp. 202208.

[21] X. Wen, B. Yang, Y. Chen, L. E. Li, K. Bu, P. Zheng, Y. Yang, C. Hu, RuleTris: minimizing rule update latency for TCAM-based SDN switches, in:

Proc. 2016 IEEE 36

th

International Conference on Dis-

urn al P

tributed Computing Systems (ICDCS), 2016, pp. 179188. [22] Y. R. E. Rosen, BGP/MPLS IP virtual private networks (VPNs), IETF RFC 4364 (2006). [23] IEEE,

http://www.ieee802.org/1/pages/802.1Q.html,

IEEE 802.1q:

VLAN (2005).

[24] A. X. Liu, C. R. Meiners, E. Torng, TCAM razor:

a systematic ap-

proach towards minimizing packet classiers in TCAMs, IEEE/ACM Transactions on Networking 18 (2) (2010) 490500. [25] N. Katta, O. Alipourfard, J. Rexford, D. Walker, Innite CacheFlow in software-dened networks, in: Proc. Third Workshop on Hot Topics in Software Dened Networking HotSDN '14, ACM, New York, NY, USA, 2014, pp. 175180.

Jo

[26] C. Hopps, Analysis of an equal-cost multi-path algorithm, IETF RFC 2992 (2000).

[27] S. Swallow, S. Bryant, L. Andersson, Avoiding equal cost multipath treatment in MPLS networks, IETF RFC 4928 (2007).

Journal Pre-proof

[28] D. Savage, J. Ng, S. Moore, D. Slice, P. Paluch, R. White, Cisco's enhanced interior gateway routing protocol (EIGRP), IETF RFC 7868

of

(2016). [29] S. H. Yeganeh, A. Tootoonchian, Y. Ganjali, On scalability of softwaredened networking, IEEE Communications Magazine 51 (2) (2013) 136

pro

141.

[30] K. He, J. Khalid, S. Das, A. Gember-Jacobson, C. Prakash, A. Akella, L. E. Li, M. Thottan, Latency in software dened networks: measurements and mitigation techniques, SIGMETRICS Perform. Eval. Rev. 43 (1) (2015) 435436.

[31] K. He, J. Khalid, A. Gember-Jacobson, S. Das, C. Prakash, A. Akella, switches, in: Proc. 1

st

re-

L. E. Li, M. Thottan, Measuring control plane latency in SDN-enabled ACM SIGCOMM Symposium on Software De-

ned Networking Research SOSR'15, 2015.

[32] B. Pfa, J. Pettit, T. Koponen, E. Jackson, A. Zhou, J. Rajahalme,

urn al P

J. Gross, A. Wang, J. Stringer, P. Shelar, K. Amidon, M. Casado, The design and implementation of Open vSwitch, in: Proc. 12

th

USENIX

Symposium on Networked Systems Design and Implementation (NSDI 15), USENIX Association, Oakland, CA, 2015, pp. 117130. [33] OVS-DPDK datapath classier (2018).

https://software.intel.com/en-us/articles/ovs-dpdkdatapath-classifier

URL

[34] H. Zhu, H. Fan, X. Luo, Y. Jin, Intelligent timeout master: dynamic timeout for SDN-based data centers, in: 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM), 2015, pp. 734 737.

[35] S. Shirali-Shahreza, Y. Ganjali, Delayed installation and expedited eviction:

An alternative approach to reduce ow table occupancy in sdn

Jo

switches, IEEE/ACM Transactions on Networking 26 (4) (2018) 1547 1561.

[36] The ns-3 discrete-event network simulator (2018). URL

https://www.nsnam.org

Journal Pre-proof

[37] S. Orlowski, R. Wessäly, M. Pióro, A. Tomaszewski, SNDlib 1.0  survivable network design library, Networks 55 (3) (2010) 276286.

of

[38] C. Clos, A study of non-blocking switching networks, Bell System Technical Journal 32 (5) (1953) 406424.

[39] R. Wójcik, J. Dom»aª, Z. Duli«ski, Flow-aware multi-topology adaptive

pro

routing, IEEE Communications Letters 18 (9) (2014) 15391542. [40] C. Lee, Y. Nakagawa, K. Hyoudou, S. Kobayashi, O. Shiraki, T. Shimizu, Flow-aware congestion control to improve throughput under TCP incast in datacenter networks, in: 2015 IEEE 39th Annual Computer Software and Applications Conference, Vol. 3, 2015, pp. 155162.

re-

[41] T.-Y. Mu, A. Al-Fuqaha, K. Shuaib, F. M. Sallabi, J. Qadir, SDN ow entry management using reinforcement learning, ACM Trans. Auton. Adapt. Syst. 13 (2) (2018) 11:111:23.

[42] H. Luo, Y. Xu, W. Xie, Z. Chen, J. Li, H. Zhang, H. Chao, A framework for integrating content characteristics into the future Internet architec-

urn al P

ture, IEEE Network 31 (3) (2017) 2228.

[43] S. Dashti, M. Berenjkoub, A. Tahmasbi, An ecient sketch-based framework to identify multiple heavy-hitters and its application in dos detection, in:

2014 22nd Iranian Conference on Electrical Engineering

(ICEE), 2014, pp. 11131118.

[44] E. T. B. Hong, C. Y. Wey, An optimized ow management mechanism in openow network, in: 2017 International Conference on Information Networking (ICOIN), 2017, pp. 143147. [45] B. Lee, R. Kanagavelu, K. M. M. Aung, An ecient ow cache algorithm with improved fairness in software-dened data center networks, in: 2013 IEEE 2nd International Conference on Cloud Networking (CloudNet), 2013, pp. 1824.

Jo

[46] M. Pióro, D. Medhi, Routing, Flow, and Capacity Design in Communication and Computer Networks, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2004.

Journal Pre-proof

[47] J. Dom»aª, P. Jurkiewicz, P. Gawªowicz, R. Wójcik, Flow aggregation mechanism for ow-aware multi-topology adaptive routing, IEEE Com-

of

munications Letters 21 (12) (2017) 25822585. [48] N. Kitsuwan, S. Ba, E. Oki, T. Kurimoto, S. Urushidani, Flows reduction scheme using two MPLS tags in software-dened network, IEEE

pro

Access 5 (2017) 1462614637.

[49] A. Mimidis, C. Caba, J. Soler, Dynamic aggregation of trac ows in SDN: applied to backhaul networks, in:

Proc. 2016 IEEE NetSoft

Conference and Workshops (NetSoft), 2016, pp. 136140. [50] J. Zhou, M. Tewari, M. Zhu, A. Kabbani, L. Poutievski, A. Singh, A. Vahdat, WCMP: Weighted cost multipathing for improved fairness Proc. Ninth European Conference on Computer

re-

in data centers, in:

Systems, EuroSys'14, ACM, New York, NY, USA, 2014, pp. 5:15:14. [51] N. Kang, M. Ghobadi, J. Reumann, A. Shraer, J. Rexford, Ecient trac splitting on commodity switches, in: Proc. 11

th

ACM Conference

urn al P

on Emerging Networking Experiments and Technologies, CoNEXT '15, ACM, New York, NY, USA, 2015, pp. 6:16:13. [52] S. Jain,

A. Kumar,

S. Mandal,

J. Ong,

L. Poutievski,

A. Singh,

S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zolla, U. Hölzle, S. Stuart, A. Vahdat, B4: experience with a globally-deployed software dened WAN, SIGCOMM Comput. Commun. Rev. 43 (4) (2013) 314. [53] F. Paolucci, F. Cugini, A. Giorgetti, N. Sambo, P. Castoldi, A survey on the path computation element (PCE) architecture, IEEE Communications Surveys Tutorials 15 (4) (2013) 18191841. [54] E. Crabbe, I. Minei, J. Medved, R. Varga, Path computation element communication protocol (PCEP) extensions for stateful PCE, IETF RFC 8231 (September 2017).

Jo

[55] H. Gredler, J. Medved, S. Previdi, A. Farrel, S. Ray, North-bound distribution of link-state and trac engineering (TE) information using BGP, IETF RFC 7752 (March 2016).

Journal Pre-proof

[56] A. R. Curtis, W. Kim, P. Yalagandula, Mahout: Low-overhead datacenter trac management using end-host-based elephant detection, in:

of

Proc. 2011 IEEE INFOCOM, 2011, pp. 16291637. [57] M. Al-Fares, S. Radhakrishnan, B. Raghavan, N. Huang, A. Vahdat, Hedera: dynamic ow scheduling for data center networks, in: Proc. 7

th

USENIX Conference on Networked Systems Design and Implemen-

pro

tation, NSDI'10, USENIX Association, Berkeley, CA, USA, 2010. [58] Y.-C. Wang, Y.-D. Lin, G.-Y. Chang, SDN-based dynamic multipath forwarding for inter-data center networking, in: Proc. 2017 IEEE International Symposium on Local and Metropolitan Area Networks LANMAN, 2017, pp. 13.

re-

[59] W. Lin, Y. Niu, X. Zhang, L. Wei, C. Zhang, Using path label routing in wide area software-dened networks with OpenFlow, in: Proc. 2016 International Conference on Networking and Network Applications NaNA,

Jo

urn al P

2016, pp. 149154.

Journal Pre-proof Conflict of Interest

Jo

urn al P

re-

pro

of

There is no conflict of interest in publishing this paper. Any people or organisations can have any claims or influence on authors work.