Journal Pre-proof MPLS-based reduction of flow table entries in SDN switches supporting multipath transmission Zbigniew Duli´nski, Grzegorz Rzym, Piotr Chołda
PII: DOI: Reference:
S0140-3664(18)30686-8 https://doi.org/10.1016/j.comcom.2019.12.052 COMCOM 6109
To appear in:
Computer Communications
Received date : 6 August 2018 Revised date : 23 September 2019 Accepted date : 27 December 2019 Please cite this article as: Z. Duli´nski, G. Rzym and P. Chołda, MPLS-based reduction of flow table entries in SDN switches supporting multipath transmission, Computer Communications (2020), doi: https://doi.org/10.1016/j.comcom.2019.12.052. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.
Journal Pre-proof
of
MPLS-based reduction of ow table entries in SDN switches supporting multipath transmission 1∗ 2 2 Zbigniew Duli«ski , Grzegorz Rzym , Piotr Choªda 1
Jagiellonian University, Faculty of Physics, Astronomy, and Applied Computer
pro
Science, ul. ojasiewicza 11, 30-348 Kraków, Poland
2
AGH University of Science and Technology, Department of Telecommunications,
Abstract
re-
Al. Mickiewicza 30, 30-059 Kraków, Poland
In this paper, the problem of resource utilisation improvement in softwaredened networking (SDN) is addressed. The need for resource optimization is understood here be twofold.
First, bandwidth in links should be saved
when congestion appears. Second, the internal resources represented by ta-
urn al P
ble entries of SDN switches should be minimised to ensure fast processing and exibility. Here, both types of resources are optimised with a new mechanism for ow aggregation. The mechanism is accompanied with a multipath transmission supporting reaction when network conditions change. The proposed mechanism uses classical MPLS labelling, which enables ow aggregation together with multipath transmission; therefore, neither involves any denition of new protocols nor requires the application of legacy signalling protocols.
Only simple yet powerful modications of the exist-
ing solutions assured by exibility of the OpenFlow protocol are necessary. Furthermore, the proposed solution can be incrementally deployed in legacy networks.
The aggregation results in a low number of ow entries in core switches in comparison to legacy OpenFlow operation.
The simulations show that
the number of ow entries in core switches can be reduced by as much as 93%, while the overall network trac is increased by around 171%.
This
Jo
type of scalability improvement of ow processing is obtained as a result of
Corresponding author, e-mail:
[email protected], phone: +48 126644871, address: ul. ojasiewicza 11, 30-348 Kraków, Poland. ∗
Journal Pre-proof
the introduction of a centrally managed MPLS label distribution performed by an SDN controller. Moreover, the proposed method of multipath trans-
of
mission improves network resource utilisation. Additionally, independently of the trac pattern, the proposed approach signicantly reduces the communication overhead between the controller and the switches.
Keywords:
Flow aggregation; Multipath transmission; Multi-protocol label
pro
switching (MPLS); Software-dened networking (SDN)
1. Introduction
In legacy IP networks, packets traverse a single path between a pair of source and destination nodes. The path is established by a routing protocol
re-
as best on the basis of the link metrics (weights).
However, when conges-
tion appears in some links on this path, a new path omitting over-utilised links should be found.
The easiest way for nding a new path consists in
increase of congested links weights. The path is then recalculated [1]. However, even a single modication of a metric can be disruptive to a whole
urn al P
network due to the following scalability issues: (a) the update of routing tables takes a considerable amount of time; (b) it is likely to cause reordering or packet dropping, thus decreasing the performance of TCP. Obviously, the more changes that are introduced, the larger the chaos that is observed. The well known fact is that in almost any network there is at least one concurrent path that is an alternative to the one used [2, 3, 4]. This fact enables the network control system to counteract the abovementioned congestion problem with the so-called multipath transmission. Multipath transmission can be introduced in dierent network layers, for example, in the physical layer (WDM, SONET/SDH), in the link layer (TRILL, MPLS, SPB), in the network layer (ECMP, EIGRP), in the transport layer (MPTCP, CMT), or in the application layer (MPRTP, MRTP) [2, 3, 4].
Apart from enabling the
use of additional paths, this type of transmission assumes that the routing is semi-independent of current link weights. Nowadays, the most popular solution for establishing such paths is bases upon multi-protocol label switching
Jo
(MPLS). Flexible trac engineering [5] is then enabled.
However, MPLS
paths are established for a long-time scale (highly static) and with the purpose of serving very large amounts of data. Therefore, despite the fact that these paths can be periodically re-optimised, such a process again results in the disruption of existing trac and typically does not take into account the
Journal Pre-proof
current utilisation of links. Fortunately, the introduction of the ow-based forwarding in Software-Dened Networking (SDN) [6, 7] provides the possi-
of
bility for a disruption-free transmission of packets using paths that can be changed with the ne granularity of time or data volumes. Unfortunately, the application of ow-based switches supporting ne-grained ow-level control results in scalability problems due to an unmanageable increase in the
pro
sizes of ow tables. Some techniques, such as multi-path TCP (MPTCP), enable better resource utilisation, but simultaneously generate more ows in the network [2]. This fact hinders ow-based forwarding due to storage limitations and lookup delays. Such a problem has already been observed with the introduction of the OpenFlow protocol [8, 9, 10, 11, 12, 13]. The issue has been addressed and, notably, ternary content addressable memory (TCAM) is used for storing ow entries [14, 15, 16].
Moreover, a centralised man-
the
reactive
re-
agement approach can create signicant signalling overhead, especially when approach for ow installation is used. Extensive communication
between an SDN controller and switches is then required [6, 9, 10].
Early
benchmarks have shown that controllers are unable to handle a huge number of requests [17, 18]. Recent research [11, 12] shows that this issue still bothers
urn al P
the scientic community. The problem is burdensome in data centre (DC) environments, where enormous numbers of ows are present [19, 20]. Another option, the
proactive
way of ow installation, can be advantageous, but such
a solution is a trade o with precision of trac management. Another scalability problem is related to ow installation time in hardware switches according to [21], a single ow installation time can reach up to 100 ms for a TCAM of 4,000 entries.
In this paper, we apply MPLS labelling to ow aggregation in OpenFlow managed networks. In this way, we show that it is possible to improve the network behaviour under congestion (due to the application of multipath approach) while simultaneously reducing the size of ow tables in the core of an SDN network and minimising signalling requirements. Accordingly, each source node may concurrently transmit data via multiple paths and new paths are added on demand to avoid congested links on already used paths. These paths are not following the idea of equal cost multipath (ECMP)
Jo
routing. In fact, the mechanism provides for new ows new paths that are not congested (if possible), while the existing ows use previously established paths. In this way, the trac is not disrupted; therefore, we aim to improve resource utilisation. The proposed mechanism is based on tagging ows with MPLS labels. Therein, the forwarding of packets is performed on the basis
Journal Pre-proof
of labels. However, contrary to the classical MPLS mode, according to our proposal the distribution of labels is not supported by signalling protocols,
of
such as label distribution protocol (LDP) or resource reservation protocol (RSVP). Instead, we use OpenFlow only. Therefore, we neither replace nor improve well known BGP/VPN [22] or similar MPLS-based solutions. Thus, we can summarise the contribution as follows:
pro
• Algorithm for switch self ow installation enabling reduction of the signalling overhead related to Packet_IN processing. Due to this property, it is possible to completely eliminate Packet_IN massages while keeping all benets related to the reactive ow treatment.
• Multipath transmission
enabling a better network resource utilisation Thus, we
re-
based on the fast reaction to network condition changes.
ensure that congestion appearing on a link when other links are underutilised is solved by on-demand and automated path recalculation. Due to this property, we are able to increase the overall throughput of the network.
enabling a decrease of ow tables in
urn al P
• MPLS-based ow aggregation
switches centrally managed by an SDN controller. The forwarding decision on trac ows destined to a selected node is based on a single label in the whole network. Due to this property, the number of served ow entries is drastically diminished.
• No requirement for the involvement of new protocols
since the proposed
mechanism explores only the existing o-the-shelf solutions, specically, MPLS, OpenFlow, basic routing with OSPF/IS-IS, and link discovery with LLDP. Due to this property, the implementation of the proposal is very easy.
Concerning the technological readiness of a network, the new mechanism can be deployed with the coexistence of legacy MPLS switches and OpenFlow nodes. OpenFlow switches are required at the edge of a network, while legacy MPLS or OpenFlow switches can be used as core nodes. MPLS labels are
Jo
centrally distributed by an SDN controller using NetConf for legacy MPLS switches and OpenFlow for the others. In fact, we do not have to base our mechanism on MPLS, because it only requires some form of tagging to simply provide the unique marking of destination nodes in a network.
Instead of
Journal Pre-proof
tagging with MPLS labels, one can also use methods characteristic of virtual local area networks (VLANs). Nevertheless, in the case of VLAN tagging,
of
the scalability is lower than in the case of MPLS [23]. If one expects that a standard VLAN space is not sucient, then one can use the Q-in-Q or even the PBB approach.
In the case of MPLS, a label stacking can be used to
increase the label space.
pro
The paper is organised as follows: Section 2 presents the justication for the proposed approach; Section 3 introduces the mechanism for a centralised path set-up optimisation supporting IP ows in SDN networks, along with the related architecture; Section 4 describes the evaluation details, including the tools used and the obtained of performance results the mechanism scalability is also discussed; Section 5 presents a review of the related work we use this background to emphasise a comparison of our approach with clusions.
re-
others presented before; Section 6 summarises the paper with concise con-
2. Problem statement and motivation behind the proposed mech-
urn al P
anism
In this section, we briey describe drivers for the proposed mechanism. The rationale relates to the three important problems which appear in networks operating with the ow-forwarding scheme, namely: (a) scalability of ow tables at switches in the core of the network, (b) link congestion, and (c) ow installation overhead. In our mechanism, these problems are solved by the application of ow aggregation, multipath transmission, and improved ow installation procedure, respectively.
2.1. Flow aggregation
Flow-based forwarding supports eective trac distinction and management.
However, this approach suers from the necessity to serve an enor-
mous number of ow entries that need to be maintained by each of the ow-forwarding nodes. It is a well-known fact that TCAM is the most suitable memory technology for ow storing and forwarding [6, 12].
However,
Jo
it is very expensive, consumes a lot of energy, and can store only a limited number of entries [14, 15, 16, 24, 25].
This last drawback is the most im-
portant from the viewpoint of TCAM applications. The number of entries which has to be served by a switch strongly depends on the level of network aggregation hierarchy.
Journal Pre-proof
As an example of the possible gain from usage of our aggregation method, we present a simple network topology in Fig. 1. The whole discussion related
of
to benets of our mechanism in context of this particular topology is valid for any topology in which trac aggregation is enforced by consecutive nodes. In our experiments presented in Section 4, we use a few dierent topologies in order to conrm the generality of this approach. Note that we distinguish
pro
between the two types of switches performing ow forwarding. To be compatible with traditional MPLS terminology, we divide these network nodes into: (a)
provider edge
(PE) nodes; and (b)
provider
(P) core nodes. Now,
let us suppose that the whole trac from ingress (domain entrance) nodes, i.e., PEN 1 to PEN J and PEM 1 to PEM K , is directed to egress (domain exit) nodes PE-D1 and PE-D2. The ingress nodes represent an access layer. A
re-
number of active ow entries in each PE is depicted in red, for example D1 + N1D2 ow entries, where indices D1 and D2 represent PEN 1 stores N1 which egress node the ows are directed to. At the rst core layer, we ob-
urn al P
serve a signicant increase in the number of ows coming from the access PJ D1 D2 layer. For instance, node P1 maintains as many as j=1 (Nj + Nj ) ows. In the second core layer, many more ow entries has to be served, this is P PJ D1 D2 D1 + NjD2 ) + K k=1 (Mk + Mk ). j=1 (Nj One of the main aims of the proposed mechanism is to reduce a number
of ow entries in core switches (P nodes).
Let us consider only the ows
directed to networks accessible via PE-D1. Since all ows from the access layer are directed to the single egress node (PE-D1), they can be represented by a single global label. If we consider two destinations, namely PE-D1 and PE-D2, we have to use two global labels.
In our mechanism, ingress PE
nodes are responsible for tagging each ow directed to a given destination egress PE node with a global unique label representing a particular egress node.
In such a case, the number of ow entries is limited to exactly two
for every P node in each core layer (depicted in blue). The number of ow entries in the access layer remains unchanged.
2.2. Multipath transmission
Since network trac volume is continuously growing, one can expect that
Jo
at some point, any network may experience congestion. To avoid this problem, one can use alternative paths leading to the same destination in the network.
It is typical for mesh networks for them to contain more than
one path between each source and destination. Such concurrent paths can be either completely or partially disjointed. The use of concurrent paths can
Journal Pre-proof
Core Layer1
Core Layer2
of
Access Layer
PEN1
PE-D1
P1
2 2
PENJ
PEM1
.. .
P2
PE-D2
urn al P
PEMK
2
Aa A A A
re-
P3
pro
.. .
Figure 1: The assumed network structure and the related numbers of ows mitigate the problem of congested links. However, most of the legacy routing protocols do not implement multipath transmission. If they support multipath transmission, it is only based on equal cost multipath (ECMP) [26]. When the MPLS is used, one can consider ECMP or load balancing over an unequal cost path. From the point of view of single application layer transmission, it is important to use the same path for a particular ow. There exists recommendation on how to avoid the potential mis-ordering of packets [27]. The notable exception amongst routing protocols is related to Cisco's EIGRP which can concurrently use paths of dierent costs [28]. Our mechanism aims to use multipath transmission together with ow
Jo
aggregation in order to avoid congestion in ow-based forwarding networks. Our proposal exploits many paths between same source-destination pair, but these paths do not need to be equal cost. Moreover, new paths are activated on demand only, this occurs, when congestion appears and they do not tear
Journal Pre-proof
down existing ows. Our mechanism searches the whole network to nd the best new path avoiding congested links. The proposed solution extensively
of
uses an aggregation procedure based on tagging with labels. The concept of our multipath-based approach is described here with a simple exemplary network presented in Fig. 2. Sources of trac are connected to the ingress node PES , while destination networks are attached to the egress
pro
node PED . One can observe that a few paths exist connecting PES and PED nodes. Let us suppose that a routing procedure has chosen the path going through core node P11 . In line with our aggregation procedure, label L1 has been distributed enabling transfer along the links marked with this label. All ows going from PES to PED traverse this path and they are stamped with MPLS label L1. All switches on the path forward packets according to this In some moment congestion appears on the link marked with a red
cross.
Our mechanism nds an alternative path (here: PES -P21 -P22 -PED )
re-
label.
and distributes a new label L2 which will be used for packet forwarding. When all switches on this new path get this new label, ingress node PES starts to mark new ows with label L2. The existing ows are still marked with label L1 and they continue to traverse the old path, namely PES -P11 The number of these ows is denoted by N1 and does not increase.
urn al P
PED .
Label L2 is used by the number of ows denoted as N2 , and this number may increase since new ows arrive to the network.
When the next congestion
events appear in some links (in our example, the next congestion event is indicated with a blue cross, and the subsequent congestion event is indicated with a purple cross), new paths are found and new labels are distributed (L3 and L4, respectively). A similar scenario takes place after all congestion events occur: existing ows use labels L1 and L2, while new ows use L3, and then the label L4 (after the third congestion event).
The numbers of
ows: N1 , N2 , N3 , related to the existing ows (using L1, L2 and L3 labels, respectively) tend to decrease. The number N4 of new ows may vary, but is likely to increase.
Thanks to the use of the ow-forwarding paradigm, it is possible to distinguish existing ows from new ones.
Only a single active label is used
for tagging newly arriving ows directed to the same destination. The old
Jo
labels are used for forwarding all ows existing before dierent congestion events appear.
The tagging mechanism enables keeping existing ows on
previously selected paths when the routing process chooses new paths between the given source-destination nodes. This mechanism also prevents the increase of a number of ows from a particular source going via a congested
Journal Pre-proof
P11
of
L1
L1 L2, L3
PES
L2
L2
PED
P22
P21
pro
N1 N2 N3 N4
L3
L3, L4
L4
P31
L4
P32
L4
P33
link.
urn al P
2.3. Flow installation
re-
Figure 2: The example supporting explanation of the concept of multipath transmission adopted in this paper
In SDN networks based on the OpenFlow protocol, two methods of ow installation are available:
reactive
and
proactive.
In the former, each new
ow reaching a switch generates signalling between the controller and an ingress switch. Such a ow of signalling messages results in on-demand rule installation. In the case of the proactive mode, forwarding rules are installed before ows arrive at the switch.
A combination of these two approaches
is also possible. The reactive ow insertion allows exible and ne-grained network management. However, such an approach has a few serious limitations. First of all, every packet that does not have a match in a ow table of a switch has to be forwarded to an SDN controller. The controller then has to dene actions for this packet and install a new rule for the next packets belonging to the particular ow in the switch. This situation may lead to the overloading of a controller with signalling messages
Packet_IN, especially in
networks where a huge number of ows and switches is present, for example, Furthermore, other limitations have to be taken into ac-
Jo
in DCs [19, 20].
count; therefore, a limited number of requests per time unit can be handled by a single controller [17, 18, 11, 12], thus decreasing network scalability. The reactive approach introduces an additional delay for the rst packet of every new ow when it is forwarded to the controller. Moreover, it is likely
Journal Pre-proof
that packets belonging to a single ow can arrive with such a high frequency that the installation of a forwarding rule in a switch takes place after many packets from the same ow arrive. This results in triggering many unneces-
Packet_IN messages, causing further overloading of the controller.
of
sary
Such
behaviour can be exploited to attack an SDN controller.
Despite the above, proactive ow insertion can easily mitigate these prob-
pro
lems. It requires advance knowledge about all trac matches that could arrive into a switch. However, exibility and precision of trac control is lost in this case. Proactive rules are usually more general than those dened in a reactive way.
This results from a lack of knowledge regarding all trac
matches.
Existing ow-based switches suer from the delays related to ow entry insertion into TCAM. This problem is mainly related to a weak management
re-
CPU and a slow communication channel present between the management of the CPU and a switching chipset [29, 30, 31]. These delays are especially cumbersome when networks operate in a reactive way [18]. In the proposed mechanism, we limit the signalling overhead, yet we still assume the application of ne-grained ow forwarding. To install a new ow, exclude
urn al P
we do not need to communicate with the SDN controller. In this way, we
Packet_IN
messages. A controller proactively installs only rules for
ow aggregates in a dedicated ow table in PE. Based on these patterns, the switch itself installs ne-grained ows without the necessity to communicate with the controller. The SDN controller performs maintenance and modication of aggregation rules only. The abovementioned modications do not often take place, they occur only when congestion appears and new paths are required. The introduction of these rules is feasible due to the denition of a dedicated ow table. The detailed explanation of aggregation rules and switch behaviour is given in Section 3.
3. Detailed description of the proposed mechanism Concerning the previously given classication of the switches, our mechanism assumes that:
Jo
• provider edge • provider
(PE) nodes map ows to MPLS labels,
(P) core nodes only forward packets according to MPLS labels.
Journal Pre-proof
We dene a
source-client network
(SCN) and a
destination-client network
(DCN) as networks where sources and destinations of trac are located,
of
respectively. SCNs and DCNs are accessible only via PE nodes. To eectively map ows to the labels, the SDN controller builds and maintains a map of the physical topology and stores it in the form of a link state database (LSDB). The LSDB is modied when congestion starts to
pro
appear, i.e., when link metrics are changed. The SDN controller calculates the best path only for pairs of PE nodes.
The reverse Dijkstra algorithm
(described in Section 3.4) is used to perform this task.
For each PE, the
controller allocates a global MPLS label representing a particular PE node. The labels accompanied with information about proper output interfaces (obtained due to executing the shortest-path algorithm) are then populated to each node.
When a packet belonging to a particular ow reaches an
re-
ingress PE node, it is tagged with a proper MPLS label and is subsequently forwarded to a pre-selected interface. This label indicates the egress PE node via which a particular DCN is reachable. Therefore, each node on the path will use a given label to reach the particular related PE node.
Moreover,
the same label will be used by any node in the whole network to reach the
urn al P
specied egress PE node. Such an approach results in ow aggregation and a signicant reduction of ow table entries in P nodes. The proposed mechanism supports a fast reaction to changes in trac conditions. The SDN controller periodically collects information related to the utilisation of links.
OpenFlow port statistics requests are proposed to
be used. However, other protocols for retrieving counter data are discussed in Section 3.6. There are two predened thresholds dened by the administrator as congestion triggers. If throughput of any link exceeds one these thresholds, this indicates that congestion on this particular link appeared. Additional details on these thresholds are discussed in Section 3.2.
When
any congestion in the network is recognised, the controller increases metrics of the over-utilised links. The reverse Dijkstra algorithm is then recalculated using modied metrics and a new label for each PE is allocated. The controller populates all nodes with new labels and the related output interfaces. Therefore, only new ows use the new labels (i.e., the new paths). All the
Jo
existing (old) ows are forwarded using the previously allocated labels (i.e., previously calculated paths). Such an approach stabilises the ow forwarding process and introduces the multipath transmission. The proposed management system running on the SDN controller (see Fig. 3)
is logically divided into the two components that are responsible for dening
Journal Pre-proof
SDN CONTROLLER MEASUREMENT COMPONENT
periodically repeated
No
Tput Warn Th
Yes
No
Metric = NORM
Yes Metric = WARN
Metric = CONG
pro
Tput Cong Th
of
Read Counters
Calculate Tput
LABEL ALLOCATOR COMPONENT Collect from all links
Any metric updated?
Yes
Calculate Dijkstra
Allocate Labels
re-
No
PE Node
P Node
Update Coarse Flow Table
Update Flow Table
urn al P
Figure 3: Flowchart of the proposed mechanism how the nodes process the data:
• measurement component
responsible for gathering link utilisation and
• label allocator component
calculating paths and distributing MPLS la-
modication of metrics,
bels.
Below, we rst describe the way in which the packets and ows are processed in various types of network nodes, and then describe the operation of both of the components introduced above.
3.1. Flow processing in PE and P nodes
Each PE node implements ow-based forwarding. The way the ows are
Jo
dened is neutral from the viewpoint of our mechanism (in the context of the ow aggregation procedure). For instance, a traditional 5-tuple (source and destination addresses/ports with the L4 protocol) can be used. In accordance with the OpenFlow specication, we propose to use two
ow tables in each PE node. The
detailed ow table
(DFT) stores detailed
Journal Pre-proof
information on active ows, i.e., 5-tuples. The
coarse ow table
tains the mapping between DCNs and pairs (output
MPLS label, output
The match elds for rules in the DFT are dierent from matches
of
interface).
(CFT) con-
in the CFT. One can see this dierence in Fig. 4, where the term `Flow' is used in the DFT and the term `Net' is used in the CFT. As explained later, the entries in the CFT exist permanently, i.e., there is always a match for a
pro
particular network, while an action list depends on the currently used path and can be modied. The existence of a particular ow in the DFT depends on its lifetime and a ow idle timeout. Thus, when a packet reaches a PE node, it is processed following the pipeline shown in Fig. 4. We consider the following two cases.
1. If a packet (P1 in the gure) matches an existing ow in the DFT, then
re-
this packet is processed according to the list of actions present for such an entry. This means that, the packet is tagged with a selected MPLS label and forwarded to the indicated output interface. 2. If a packet (P2 in the gure) does not match DFT, then it is redirected to the CFT. It contains entries composed of a DCN and the list of the
urn al P
following actions: push a pre-dened MPLS label and direct the packet to a pre-selected output interface.
Thus, when the match is found,
the specied actions on the packet are performed and the detailed ow entry is created in the DFT. The entry is based on information gathered from the packet's header elds. Therefore, for the new ow dened on the basis of this header, the entry action list is copied from the CFT. The idle timeout of this entry is set to a nite pre-dened value. The issue of timeout setting and usage is explained in detail in Section 3.5. If TCAM is used, the ow table lookup is performed in a single clock cycle. By contrast, while DFT and CFT are searched, two clock cycles are needed. Of course, if the match is found in the DFT, then only one lookup is needed. In the legacy OpenFlow protocol, only an SDN controller may insert ow entries into ow tables. lead to a storm of
As previously mentioned, such a procedure may
Packet_IN
messages received by a controller. When we
Jo
consider huge networks carrying millions of ows, the reactive ow insertion may lead to overload of the controller.
Therefore, in our mechanism we
improve the standard operation of the OpenFlow protocol and, as a result, we reduce the number of messages exchanged between the controller and the switches.
The CFT contains general rules indicating how ows should be
Journal Pre-proof
Detailed Flow Table Flow1
Label1 Out1
Flow2
Label2 Out1
... FlowX
LabelY OutZ
TABLE MISS
of
P1
P1 Label2
Coarse Flow Table Net1
Label1 Out1
Net2
Label2 Out1
Net3
Label3 Out2
pro
P2
P2 Label3
...
NetX
LabelY OutZ
re-
Insert Flow
Figure 4: Packet processing pipeline in an edge (PE) node processed.
Such the approach results in the proactive installation of very
general rules. These rules are in some way persistent and are updated by the
urn al P
controller, only when congestion appears. It does not happen as frequently as a detailed ow installation. The number of CFT entries is related to the number of DCNs.
Since the detailed ow installation happens very often,
insertion of a particular granular ow entry into DFT is made by a PE switch on its own (as presented in Fig. 4).
We should note that by itself,
Open vSwitch (OVS) inserts ows into Exact Match Cache and Tuple Space Search cache [32]. Such an OVS procedure improves performance of packet processing, this is because packets are processed in caches (in a kernel or DPDK module [33]), avoiding slow user space processing where OpenFlow tables are maintained. Although the OVS approach increases the number of entries, it speeds up switch operation. A similar procedure can be applied in order to implement DFT self-insertion. Works on DFT implementation in OVS are now in progress. Our preliminary OVS modication needs only six additional lines of code. Additionally, the authors of DevoFlow proposed the mechanism enabling switch self ow insertion [14].
Furthermore, they
Jo
also show by simulations that such an approach is a valuable concept. Our proposed improvement removes the need to use
Packet_IN
messages, but
still keeps all benets related to the reactive treatment of ows. For P nodes, only a single ow table is required. When a packet reaches
such a node, this packet is matched on the basis of a label only. The packet
Journal Pre-proof
is then sent out to a proper output interface with exactly the same label. If a legacy MPLS router is used as a P node, such a node only performs the
of
ordinary label swapping operation. Legacy MPLS routers allow static label switched path conguration (similar to static routing). This means that due to our mechanism, the SDN controller is able to add (or remove) static MPLS 1
entries in a router conguration via SNMP or NetConf .
pro
The whole knowledge about all the MPLS labels used in the network is possessed by an SDN controller. It knows exactly which label should be installed in a particular P (core) node to ensure that a particular egress PE node can be reached. According to our approach, only P nodes can be served by legacy MPLS routers (with static MPLS entries). With the help of NetConf, the controller can send a conguration to the router in which a static LSP entry will be added. Such the entry denes which input label should be
re-
swapped to another output label and indicates an output interface. To apply our proposal using legacy routers functioning as P nodes, the input label has to be swapped to the same output label.
The label and the output inter-
face are assigned by the controller. Such the conguration can be enforced remotely using NetConf. Application of a static LSP conguration is a stan-
urn al P
dard procedure for manual distribution of labels (that is, without use of LDP or RSVP). This procedure is not disruptive for the packet forwarding because reconguration is done only when congestion appears. When it happens, the controller congures a new path.
During this procedure, packets are still
forwarded via existing paths. When all P nodes, except for an ingress PE, acknowledge the conguration change, a controller installs a proper entry in the ingress PE. The decision on how frequently a new conguration can be applied depends on a number of requested changes in the conguration. In fact, the exact value of this frequency depends on the vendor and a specic device model.
3.2. Measurement and label allocator components The measurement component (MC) periodically retrieves data from link counters. The collected information is used to calculate the bandwidth util-
Jo
isation at links. The controller requests interface counter reports from each
An exemplary conguration of static label switched paths for Juniper routers can be found at: https://www.juniper.net/documentation/en_US/junos/topics/task/ configuration/mpls-qfx-series-static-lsp-cli.html. These CLI commands can be congured via SNMP/NetConf. 1
Journal Pre-proof
node. Each node replies with a single statistic response containing information about all its counters. In the basic version of the proposed mechanism, there are
2N
messages in each polling interval.
N
nodes in the network,
of
we apply OpenFlow statistics messages. If we have
Taking into account the
fact that each measurement-driven mechanism requires statistics collection, such an approach does not involve a huge signalling overhead.
MultipartReq
The single
(statistics request) message for port statistics
MultipartRes (statistics 86 + 112n bytes
pro
OpenFlow 1.3
generates a total of 94 bytes of network trac. The response) message containing counters for
n
ports involves
of signalling. In our simulations, we are able to retrieve the data from nodes periodically with the frequency of one second, but this information can be requested rarely. Our mechanism does not require and
FlowMod
Packet_IN, Packet_Out
triplet maintenance, which form the largest contributor of sig-
re-
nalling overhead in OpenFlow networks [9].
Another method for obtaining port counter values may be based on using SNMP. However, the leading vendors recommend setting the pooling interval 2
in the range of 3060 seconds . However, if a fast reaction to the change of the trac is needed, 30 seconds can be unacceptably long interval. Other o-
urn al P
the-shelf protocols enabling statistics readouts are discussed in Section 3.6. There are two utilisation thresholds congured: warning (WarnTh) and
congestion (CongTh). The latter is determined with a value larger than the
former. Each time the amount of the trac throughput (Tput) exceeds one of the dened thresholds, the MC changes a link metric in the LSDB stored by the label allocator component (LAC). We propose the following three values of the congurable link metrics related to the threshold values: (1)
NORM: nor-
mal metric (a default IGP metric) for a link utilisation of a value not greater than the
WarnTh threshold; (2) WARN: warning metric for the link utilisation of WarnTh and CongTh threshold; (3) CONG: congestion metlink utilisation exceeding the CongTh threshold. In order to prevent
a value between the ric for a
oscillation of the link utilisation around thresholds, we applied hysteresis. Generally, for each link, an operator denes three link weights (metrics) in
NORM, WARN, and CONG. If a particular link is not congested a lower value (NORM) is used as a link weight for Disjkstra algorithm. When trac increases and tends toward congestion exceeding the WarnTh threshold,
Jo
increasing order:
https://www.juniper.net/documentation/en_US/junos/topics/task/conguration/snmpbest-practices-device-end-optimizing.html 2
Journal Pre-proof
the second value of weights (WARN) is assigned to this link and Disjkstra is
recalculated. The third value of link weights (CONG) is used when the
CongTh
of
threshold is exceeded and again Dijksta is recalculated. LAC builds and maintains LSDB. Each time the MC changes any link metric, the recalculation of the shortest paths is triggered for each of the PE nodes treated as a root. By the
shortest path
we are referring to the path
pro
obtained due to the reverse Dijkstra algorithm run for the link weights set after the change. The number of such recalculations can be limited to some PE nodes only. We use the so-called reverse Dijkstra algorithm based on the Dijkstra algorithm. Below, in Section 3.4, we describe this process in detail. After recalculation, a new label set is allocated. The maximum size of this set equals the number of PE nodes. After consecutive recalculations of the reverse Dijkstra algorithm, only one unique label represents a destination PE
re-
for new ows. In this way, between a selected source-destination PE pair, the newly recognised ows are redirected to the new paths, while the existing ows still traverse the old paths they are forwarded with the previously allocated labels.
As presented in Section 3.1, for PE nodes the CFT table is updated and
urn al P
lled in by the SDN controller. The CFT contains entries composed of an address of a DCN and a list of the following actions: push MPLS label and send it to an output interface.Each entry in the CFT has innite timeout. Each time recalculation of the reverse Dijkstra algorithm is triggered this results in the CFT update. The old list of actions for each DCN is replaced with a new label and a new output interface (based on the structure of the new shortest-path tree).
After the recalculation, new label entries in the ow tables are also proactively installed in the P nodes. A single entry of this kind contains a match based only on a new input MPLS label and the forwarding action: the output interface is based on the currently calculated shortest-path tree.
The
idle timeout is set to innity.
3.3. Possible extensions of the proposed mechanism The proposed mechanism does not limit functionalities which are present
Jo
in OpenFlow. All OpenFlow actions can still be performed. The only aspect which dierentiates our solution from the standard OpenFlow behaviour is the addition of the
Insert Flow action.
This action is taken by a switch itself
and results in a ow insertion into DFT on the basis of an entry transferred from CFT.
Journal Pre-proof
The match rule present in CFT does not need to be based only on a destination network. It can be composed of any combination of elds and
of
wildcards supported by OpenFlow. In Fig. 5, we depicted a few exemplary match rules. Let us consider three packets arriving at the switch, i.e., P1, P2, P3. None of them match any entry in the DFT, they are thus redirected to CFT. P1 and P2 are destined to the same network, but P1 also matches an extended rule with a destination Layer 4 port; therefore, it is send to a P2 (Label2,
Out2).
Out1) than in the case of
pro
dierent output port with a dierent label (Label1,
Such an approach allows serving distinct applications in
a specic way. Another possibility of packet serving is the use of the DSCP eld to fulll QoS requirements. Our proposal is to consider separate labels for dierent application ports that are reachable via a particular egress PE node. For example, let us consider two applications (e.g. HTTP server and
re-
some VoIP server) which are accessible via the same egress PE node. These two applications may have dierent QoS requirements.
Thus, a modied
version of our mechanism may take into account dierent QoS requirements during path calculation. As a consequence, the same egress PE node may be reachable from all ingress PE nodes via dierent paths at the same time. In
urn al P
this way, some MPLS labels may be allocated for applications with higher QoS requirements, and some for less demanding trac. It is also possible to control trac directly by an SDN controller. If there is a particular type of trac that is expected to be managed by an SDN
Table Miss entry. This entry allows Packet_IN message. After packet analysis, the controller
controller itself, CFT should posses a the generation of a
installs an appropriate entry in DFT.
3.4. Reverse Dijkstra algorithm: the recalculation algorithm Whenever congestion appears, the calculation of new paths avoiding overloaded links is needed. For nding these new paths, a form of the Dijkstra algorithm is used. In the proposed mechanism, we do not need to perform path recalculation for all network nodes. To decide for which PE nodes require recalculation, the mechanism starts the investigation of labels used by the packets transferred via the link at which a new overload has just been Each of these labels indicates a specic destination PE node,
Jo
recognised.
thereupon called thereupon an `aected PE' node. To avoid the negative impact of congestion, new paths directed to the aected PEs should be found. Consequently, new labels have to be allocated.
For the non-aected PE
P1
CFT
DFT
...
TABLE MISS
of
P2
match
priority
Net1, dstPort1
1
Label1 Out1
Net1
100
Label2 Out2
Net2, DSCP1
1
Label3 Out1
100
Label4 Out3
...
...
100
LabelY OutZ
Net2 ...
NetX
actions
P1 Label1
P2 Label2
pro
P3
Packet_IN
Journal Pre-proof
TABLE MISS
re-
Insert Flow Insert Flow
Figure 5: The extension of a coarse ow table
urn al P
nodes, path recalculation and label reallocation are not required. If the regular Dijkstra algorithm is used, every PE node (not only aected PE nodes) will have to recalculate paths to each aected PE node. us consider a network with 100 PE nodes.
For example, let
If in some moment congestion
appears at a single link where only one label is related to one PE node, the regular Dijkstra will have to be performed by 99 ingress PEs (excluding affected PE). By contrast, the reverse Dijkstra requires only one calculation made from the perspective of the aected PE. In the presented mechanism, we calculate paths from the perspective of each aected PE node treated as a root. However, the weights used in the shortest-path algorithm are related to the links directed in the opposite way (i.e., towards the root). Therefore, we call this procedure `reverse Dijkstra'. In case of regular Dijkstra, we answer the question of how to reach each all other nodes (leafs) from a root.
Reverse Dijkstra answers the question of
how to reach a root node from all other nodes. For a better explanation of
Jo
how this procedure works, in Fig. 6 we present an example network topology together with the obtained reverse Dijkstra tree. The destination PE (a root for the reverse Dijkstra calculation) is coloured orange (node 1). A metric of each link for each direction is depicted in Fig. 6a. For example, if we consider connection between nodes 1 and 2, metric 1 is used for trac from node 1 to
Journal Pre-proof
node 2, while metric 7 is used for trac in the opposite direction. When we consider node 1 as a root, the regular Dijkstra algorithm uses metric 1, while
of
the reverse Dijkstra will use metric 7. In Fig. 6b, the outcome of the whole reverse Dijkstra procedure is presented in the form of the tree with the used metrics. With blue arrows, we indicate the trac direction. The MPLS label directed to the destination node 1 is distributed down the reverse Dijkstra In the case of a regular Dijkstra algorithm, one has to perform six
pro
tree.
Dijkstra calculations using nodes 2-7 as roots.
1
1
1
1
2
9 2
7
3
1
2
7
3
7
4 4
3
2
5
5
3
6
urn al P 1
2
re-
7
9
3
3
4
4
5
5
6
4
6
3
7
8
3
7
(a) An example network topology (b) The reverse Dijkstra tree built in with the related link weights the example network Figure 6: An illustration of the proposed reverse Dijkstra algorithm
3.5. Flow garbage procedure
Flow tables require maintenance to remove unused entries. We propose
Jo
the application of a standard OpenFlow procedure for ow entry removal from the DFTs in PE nodes an idle timeout counter is used for this purpose. Some nite value is assigned for each ow present in the DFT, while rules placed in the CFT always have an innite timeout. The CFT innite timeout is designed deliberately because rules in the DFT are installed by
Journal Pre-proof
the switch itself (on the basis of rules in the CFT). If a timeout value less than innity is used, the entry for this destination network will be removed.
of
If it happens that the trac destined to the mentioned network appears, and there is no DFT entry for this ow, such the trac will be dropped because of the lack of an appropriate CFT entry. On the other hand, if congestion appears, only a modication of the related CFT rule is applied, i.e., labels
pro
and output interfaces are updated. We want to notice that the CFT is used for aggregation while the DFT is used for serving particular ows. A proper setting of idle timeout strongly depends on the network trac pattern. The are ows characterised with either short inter-packet intervals or ows with long packet intervals, and bursty ows with some level of periodicity. The authors of [34] show that dierent ows should be assigned with dierent values of suitable timeouts. Their study shows that 76% of ows
re-
have packet inter-arrival times lower than 1 second. In our simulations, we use 3 seconds idle timeouts. This refers to 80% of ows which have packet inter-arrival times less than 3 seconds [34].
We also checked the impact of
lower values of idle timeout on the ow table occupancy.
Authors of [35]
suggest using low values of the idle timeout, even lower than 1 second. A low
urn al P
value of idle timeout decreases the number of ows in DFTs of PE nodes, but it may cause some ows to be removed from DFTs despite the fact they are active.
Such a situation results in unnecessary CFT lookup and ow
reinstallation into DFTs. However, it is not very costly to our mechanism because it does not involve the
Packet_IN procedure.
A switch modied ac-
cording to our proposal reinstalls ows on its own, without the necessity to communicate with the SDN controller. When the ow is present in a DFT, only a single lookup consuming one clock cycle cycle for TCAM is needed. Additionally, after the ow removal, a lookup is performed in two clock cycles.
An excessive value of the idle timeout will not often trigger the ow
reinstallation procedure, but it will increase the number of rules in the DFT. For the P nodes we propose a procedure aligned with the current functionality of OpenFlow. This procedure states that when an SDN controller calculates new paths and allocates a new label to a particular egress PE, the related entries are proactively installed with an innite idle timeout into for-
Jo
warding tables of P switches belonging to these paths. Simultaneously, in P switches, the controller modies previous rules destined to this PE (identied by previously allocated labels); in other words, the controller changes only timeout timers from innity to a nite value.
Thus, all existing ows are
forwarded without changes. When old ows end, their idle timeout counters
Journal Pre-proof
are exceeded and the removal of such ows from the ow tables takes place. When all the ows related to a particular label expire in all P nodes, this
of
label returns to the pool of available labels used by the SDN controller. The innite value set to the idle timeout of ow entries in P switches is needed to sustain readiness to handle ows, even after a long absence of any related trac.
pro
3.6. Integration with existing o-the-shelf technologies
A full upgrade of network devices may require huge capital expenditure for an operator. An incremental approach can span this task in time. In this subsection, we summarise the ideas of how to integrate our mechanism with existing o-the-shelf technologies.
In the proposed system, we can distinguish between two types of network
re-
nodes: PE and P. The former has to be upgraded, while the latter may be a legacy MPLS router. The PE nodes have to be modied OpenFlow switches working as proposed in Section 3.1.
Since P nodes forward trac only according to MPLS labels in our mechanism, a network operator does not need to replace legacy routers if they
urn al P
support MPLS. We stress again that labels used in our mechanism have a global meaning. The application of the centralised management oered by SDN controllers enables synchronisation of label distribution to all the P and PE nodes. An SDN controller distributes a unique global label related to a particular egress PE node. The P node performs only label swapping. The input label has to be swapped with the same output label. All legacy MPLS routers known to us allow conguration of static label switched paths (LSPs).
In this case, an administrator is obligated to allocate input and
output labels manually. This can be achieved remotely using, for instance, NetConf, SNMP or XMPP. Therefore, an SDN controller may use one of the previously mentioned protocols for conguration of static LSP on legacy P nodes. In this way, a single unique MPLS label can be allocated on each of the PE and P nodes on the path.
Each time a controller recalculates paths and allocates new labels, it recongures static LSP entries on a P node and updates the CFT on a PE Standard signalling mechanisms, such as LDP or RSVP, cannot be
Jo
node.
applied because MPLS labels distributed by them are assigned by each node independently of others, and consequently the labels have a local meaning only.
Journal Pre-proof
Table 1: Signalling protocols between an SDN controller and network nodes Counter readouts
Topology
Pull
management
discovery
NetFlow
OpenFlow
OpenFlow
IPFIX
SNMP
SNMP
of
Flow
Push
LLDP OSPF
NetConf
jFlow
XMPP
IS-IS
pro
sFlow
Due to the fact that our solution represents a measurement-driven mechanism, it needs to collect some link statistics.
These can be gathered and
communicated with use of various protocols depending on functionality supported by switches and routers, as well as the assumed method of obtaining
re-
counter readouts (push or pull). For the push method, protocols such as NetFlow, IPFIX, sFlow, and jFlow may be used. These protocols are designed to periodically report trac statistics. However, they generate a lot of information which is useless from the standpoint of our mechanism. These protocols are able to deliver detailed statistics about each ow, yet our mechanism
urn al P
only needs general interface counter readouts. For the pull method, one can apply OpenFlow or SNMP. These protocols oer on-demand acquisition of general interface statistics. The only drawback is that the pull method requires requestreply communication, so some overhead relating to requests is generated. Contrary to the push method, the pull method limits overall trac exchanged between the controller and P/PE nodes. As controllers have to maintain LSDBs, they need to discover the network topology. Information collected from the well-known protocols such as LLDP, OSPF, and IS-IS can be used to build LSDB in an SDN controller. The use of OSPF/IS-IS to discover the topology originates from the idea of incremental implementation of our mechanism in any network running legacy MPLS routers. As we described in Section 3.1 the core of the network may stay without replacement if it supports MPLS static label switching. Only the new type of edge nodes (PE nodes) have to be deployed in the network. If OSPF is used in the network, we can obtain the advantage of a link-state
Jo
advertisement (LSA) database for discovering network topology. We suppose that an SDN controller only listens to LSAs.
In the case of IS-IS, a piece
of topology information is encoded in link-state protocol data units (LSPs). The controller can then reconstruct the network topology directly from the
Journal Pre-proof
LSA or LSP database.
In Table 1, we summarise some market-available
of
protocols that can be applied with our mechanism.
4. Evaluation
This section presents simulation setups and results for performance evalIt reports test scenarios, assessment
pro
uation of the proposed mechanism.
methodology and metrics used during mechanism evaluation.
4.1. Simulation environment
All the tests were run on the ns-3 simulator [36]. To conduct the evaluation, we implemented the components described in Section 3. A new MPLS module oering concurrent processing of IP packets and MPLS frames was
re-
also implemented. Moreover, we have added features enabling: SDN-based central management of a network, LSDB maintenance, the reverse Dijkstra algorithm calculation, the new MPLS label distribution procedure, and functionalities related to the Measurement Component.
We used four topologies for the experiments: US backbone, Nobel-EU,
urn al P
and Cost266 topologies are from [37], and the three-level Clos topology adequate for an internal data center network [38]. The US backbone network contains 39 nodes and 61 bidirectional links. The Nobel-EU is of 28 nodes and 41 bidirectional links.
In the case of the Cost266 network, 37 nodes
and 57 bidirectional links are used. The Clos topology consists of 9 access switches (each with 3 uplinks), 9 aggregation switches, and 3 core switches. Each aggregation switch is connected to all the core switches.
For all the
networks, some selected nodes (PEs) serve as attachment points for trac sources and destinations playing both the SCN and DCN roles simultaneously (as depicted in Fig. 7). This means that such nodes randomly (uniformly) generate trac to all the nodes of this type.
All the other nodes transit
trac only. For all topologies, all links connecting network nodes are set to 100 Mbps with a 1 ms propagation delay. Links interconnecting SCNs/DCNs with PE are set to 1 Gbps with a 1 ms propagation delay. Such a conguration allows the avoidance of bottlenecks in the access part of the network.
Jo
We study transmission of the TCP trac only. Network trac is injected
with the use of the ns-3 internal random number generators. Flow sizes are generated on the basis of a Pareto distribution with a shape parameter equal to 1.5 and a mean value set to 500 kB. Flow inter-arrival times were selected in accordance with exponential distribution (the mean value equals 3 ms).
Journal Pre-proof
34
19
33
1
23
35 12
20
10 15 28 13
3
12
9
4 2
32 38
39
17
8
5
37 20
7
30
17
34
24
9
36
23
4
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
urn al P 21
29
3
2
35 28
22
18
re-
15 12 33
4
22
1
10
27
8
(b) Nobel-EU
16
13
25
27
17
3
32
1
18
16
25
8
15
6
(a) US backbone
19
24
26
21
28
25
6
11
20
31
11
7
21
22
14
5
1
14
19
24
7
6
23
18
14 30 29 27 16 26
11
37
13
10
pro
5
9
of
36
31
26
2
(c) Cost266
(d) Clos
Figure 7: Topologies used in the numerical evaluation (the green vertices represent SCNs/DCNs) Moreover, we dedicated a separate section (4.5) to present the inuence of dierent trac patterns on the performance of the compared mechanisms. The simulation time was set to 100 seconds. Data collection was started after elapsing of the rst 10 seconds of the simulation warm-up time had elapsed. Simulations of the proposed mechanism were conducted for a few combinations of the (CongTh,
(0.9, 0.4 − 0.7),
WarnTh)
pairs:
(0.8, 0.4 − 0.7), (0.85, 0.4 − 0.7), 0.1 step.
where the warning threshold was increased with
Jo
Each simulation was repeated 20 times. The 95% condence intervals were then computed.
For each pair of thresholds, we use the same set of seed
values to achieve repeatable trac conditions.
Additionally, such a proce-
dure enables us to carry a fair comparison among dierent setups. For all
Journal Pre-proof
the simulation setups, we xed the following values of link metrics placed in LSDB:
NORM
= 1,
WARN
= 1000,
CONG
= 65535.
of
4.2. Comparison to the selected mechanisms
To observe the gain of our mechanism, we have proposed a few scenarios in which simulations of other mechanisms were performed. In order to fairly
pro
compare all scenarios with our mechanism, we performed 20 simulations for each scenario with the same set of seeds, the same value of ow idle timeout was used, i.e., 3 seconds. We compare our mechanism to:
a) centrally calculated Dijkstra algorithm with reactive ow installation (`single path');
re-
b) classical Equal Cost Multipath based on the OSPF (`ECMP'); c) FAMTAR distributed Dijkstra algorithm with multipath routing (`FAMTAR'), more details about FAMTAR operations are provided in Section 5; d) DevoFlow a modication of the OpenFlow protocol [14], where mod-
urn al P
ied switch forwards packets according to ECMP pahts. Switch installs detailed ow entries by itself using so called
rule cloning procedure.
When
an elephant ow is detected (elephant ow detection is based on some threshold), switch informs a controller. The controller installs then appropriate detailed ow entry related to this elephant ow on the least congested path;
e) Expedited Eviction an approach to minimize ow table occupancy based on forecasting TCP ow termination [35]. The rst scenario a) refers to the operation of standard OpenFlow switches, all working in the reactive way. In this scenario only the rst switch on the path sends the
Packet_IN
message to the controller. The controller installs
respective ows in all the switches on the path. This way, we limit a number of the
Packet_IN
messages.
The path computation is performed centrally
by an SDN controller using the Dijkstra algorithm. The weights of all links
Jo
are the same and set to 1.
The second scenario b) uses standard ECMP based on the Dijkstra algo-
rithm implemented in OSPF.
Journal Pre-proof
The third scenario c) used for comparison with our mechanism is FAMTAR [39], where path computation is achieved in a distributed way. FAM-
of
TAR uses not equal cost multipath transmission. In case of the fourth scenario d), we implemented in ns-3 the DevoFlow mechanism.
It applies rule cloning, multipath and threshold-based (i.e.,
not sampling-based) elephant ow detection.
The multipath transmission
pro
is based on ECMP, while the routes for elephant ows are chosen using the decreasing best-t heuristics to solve the bin-packing problem. The authors of [14] propose to detect an elephant ow as a ow that transfers at least a threshold number of bytes in range of 110 MB. We decide to choose the threshold equal to 1 MB because it results in increase of the elephant ows; thus, giving us more opportunity to show the exibly in trac control. The last scenario e) is focused on minimisation of ow table occupancy.
re-
Since we simulated TCP trac only, we implemented only the related part of the mechanism proposed in [35].
This solution expedites rule evictions
recognising TCP ow termination via FIN/RST ags. However, the authors of [35] do not consider any form of multipath trasmission; therefore, we use only a single path transmission in simulations to provide a fair comparison
urn al P
with this mechanism.
4.3. Performance metrics
In this section, we dene performance metrics used for the evaluation of our mechanism and for comparison with others.
For all the scenarios,
we collected data from all nodes. On this basis, we were able to calculate: (a) the total number of the transmitted (`Tx') and received (`Rx') bytes; (b) the percentage of dropped packets (`Drop Pkts'); (c) the mean achieved network throughput (`Avg Tput'). Moreover, we propose a metric expressing the received data gain (`Rx Gain ') dened by the following equation:
Rx Gain =
where
Rx
Rx − Rx compared × 100% Rx compared
(1)
is the total received data during simulation when our mechanism
is used, and
Rx compared
expresses the total received data when one of the
Jo
compared scenarios is applied. This metric is obtained form a comparison of our mechanism with a particular other mechanism specied in a related scenario (Section 4.2). To estimate the scalability of our mechanism, we gather the total number
of ow entries in all the access nodes (PE nodes) per second (`Sum of DFT
Journal Pre-proof
entries (PE)'), and the mean number of ow entries present in a single core node (P node) per second (`Avg label entries (P)'). The number of ow entries
of
in a single OpenFlow P node is equal to the number of labels used by this node. In the case of a legacy MPLS node this number is equal to the number of MPLS labels present in the label forwarding information base (LFIB). To show eciency of the ow processing supported by our tagging approach,
pro
during the whole simulation we observe the number of label entries present on all the P switches and we store the maximum values. The mean of maximum values over all simulations is provided (`Max label entries (P)'). When we simulated other mechanisms, we also collected the number of ow entries in all nodes. This enables a fair comparison of the ow reduction in the core of the network.
Furthermore, for the evaluation of ow processing scalability of the pro-
re-
posed mechanism, we dene the following indicator: the maximum ow reduction indicator (`maxFRI '). Its calculation is performed in the following manner. Firstly, we distinguish ow tuples and labels. The ow tuples are used by PE switches, and labels represent aggregated ows. The latter are used by the P switches for forwarding. Supposing that all the PE nodes serve
urn al P
as trac sources, in each step of the simulation, we observe the total number of ows in the network. Secondly, we verify how many labels on a single P node are active due to the presence of the abovementioned ows. Thirdly, we calculate the average number of active labels per single P node.
This
value shows the mean number of labels used for trac processing in the core of a network. Finally, to present a single indicator for the whole simulation time (simT ime), we use the maximum value to calculate below:
maxFRI
avg(#labels P ) × 100% maxFRI = max 1 − P simTime #flows PE
as shown
(2)
PE
This value exemplies the maximum percentage reduction of the number of entries used by switches in ow tables in comparison to legacy ow
Jo
switching (without any aggregation procedure). This number expresses the decrease rate of ow table entries when our mechanism is used. to stress that the
maxFRI
We want
represents the situation when all ows from all
the PE nodes are present on all core P switches. This is a key performance indicator (KPI) which enables us to quantify the eciency of ow reduction
Journal Pre-proof
during system operation.
This KPI is based on measurements performed
during the system run. No other mechanisms are compared with the use of
of
this indicator. We also dene another indicator, this is the comparative ow reduction indicator (`CFRI '), which measures eciency of ow reduction for core nodes when simulated scenarios are compared. Contrary to
maxFRI , CFRI
uses
pro
simulation data gathered in both the compared mechanisms. The indicator is dened as follows:
# avg(#labels P ) × 100% CFRI = 1 − avg(#flows comparedP ) "
where
avg(#labels P )
(3)
is the average number of ow entries (per second) in
re-
a core P node when our mechanism is applied, and
avg(#flows comparedP )
refers to the average number of ow entries (per second) in a core node when another considered mechanism is used.
4.4. Results
urn al P
This section presents the results achieved during evaluation of our mechanism and comparison of our mechanism with others.
All the simulations
presented and analysed in this section were performed under the following assumptions: the mean ow inter-arrival time equals 3 ms and the mean ow size equals 500 kB.
Table 2 presents a comparison of trac statistics for the considered topologies. One can notice that use of multipath transmission results in an increase of trac transfers. If one compares a single path transmission (based on the centrally calculated Dijkstra algorithm with the reactive ow installation) or Expedited Eviction with ECMP, it can be observed that ECMP enables a better resource utilisation for Nobel-EU and Cost266 topologies. For the Clos topology, ECMP is almost two times more ecient than the single path option. This stems from the fact that the Clos network oers many concurrent equal cost paths. For US backbone, one can notice that use of a single path transmission gives slightly better results than use of ECMP. This stems
Jo
from the fact that for this simulations ns-3 performed ECMP per packet, not per ow. For instance, if two equal cost paths are available for the same destination and one of them is congested, some packets belonging to the same ow reach this destination unordered. This will cause packet drops. Since we operate with the TCP trac, each time the retransmissions occur, TCP
Journal Pre-proof
sources slow down. In fact, some topologies may have a very limited number of equal cost paths; moreover, it can happen that there are no such paths. For
of
considered topologies and trac pattern DevoFlow achieved slightly better
pro
throughput than ECMP, what is inline with results from [14].
If one compares ECMP or DevoFlow with our mechanism and FAMTAR, a better performance is observed for all the inspected topologies with our proposal.
Our mechanism and FAMTAR simultaneously use any available
paths that can have dierent costs. ECMP requires the same cost for concurrent paths. It can happen that there are no concurrent equal cost paths in a particular network topology. On the other hand, DevoFlow outperform
re-
ECMP only in the case when some number of elephant ows is present. The advantage of our mechanism in comparison to FAMTAR lies in the fact that our mechanism possesses better aggregation eciency (as discussed later). Moreover, due to our mechanism, only edge nodes (PE nodes) require modication; however, the core nodes can be o-the-shelf MPLS equipment. For
urn al P
FAMTAR, all the nodes have to be replaced at once. A simple analysis of the results summarised in Table 3 shows that in the case of all congestion and warning threshold pairs, a notable increase of the total received data (Rx Gain ) is observed when our mechanism is used in comparison to a single path, ECMP and DevoFlow. Since Expedited Eviction uses single paths for transmission, the achieved (Rx Gain ) is at the same level as in the single path scenario. The best
Rx Gain of 170.9% is achieved when we
compare our mechanism with a single path transmission for the Clos topology.
Rx Gain is topology-dependent. One can notice that low Rx Gain-c (our mechanism compared to FAMTAR) stem from the fact
However, the value of values of
that both the mechanisms enable a similar link utilisation. Simply stated, they are based on an unequal cost multipath transmission. value of
Rx Gain-c
The negative
indicates a better performance of FAMTAR, but this value
never exceeds 0.8%.
Moreover, the negative values appear less frequently
then positive values; however, this dierence is not strong enough to conrm
Jo
that mechanism performs better: our mechanism and FAMTAR behave very similarly with regard to transmission eciency.
14.5±0.12 5.8±0.05
14.5±0.11 5.8±0.06
14.6±0.12 5.4±0.07
14.6±0.15 5.6±0.07
14.6±0.12 5.6±0.05
14.5±0.13 5.8±0.06
14.7±0.12 5.3±0.09 1338.6±11.0
14.6±0.12 5.5±0.05
14.6±0.14 5.6±0.07
14.6±0.12 5.7±0.06
11.0±0.11 6.9±0.07
10.6±0.14 5.9±0.09
14.6±0.10 5.3±0.08
11.9±0.25 5.4±0.19
12.1±0.11 6.6±0.10
15.4±0.12
15.4±0.11
15.5±0.12
15.5±0.15
15.5±0.13
15.4±0.13
15.6±0.13
15.5±0.12
15.5±0.14
15.5±0.12
11.9±0.11
11.3±0.14
15.4±0.10
12.7±0.24
13.1±0.12
0.6
0.7
0.4
0.5
0.6
0.7
0.4
0.5
0.6
0.7
single path
ECMP
FAMTAR
DevoFlow
Exp. evic.
0.9
0.85
997.5±9.0
981.6±20.9
1325.7±8.8
961.0±12.4
997.7±9.7
1328.3±10.9
1329.2±12.8
1331.6±11.1
1322.3±11.5
1327.9±11.3
1331.0±13.5
1332.1±11.0
1315.5±10.0
1318.4±10.7
1323.1±11.7
14.5±0.13 5.7±0.07
15.4±0.13
0.5
0.8
1326.7±12.8
[Mbps]
14.6±0.14 5.5±0.09
Pkts [%]
Avg Tput
15.4±0.14
[GB]
[GB]
Drop
0.4
CongTh WarnTh
Rx
9.4±0.06
11.5±0.07
14.4±0.08
10.4±0.02
8.6±0.04
14.6±0.10
14.6±0.10
14.6±0.10
14.6±0.10
14.6±0.12
14.5±0.10
14.5±0.11
14.5±0.12
14.5±0.09
14.5±0.11
14.4±0.12
14.5±0.11
[GB]
Tx
11.0±0.04
13.4±0.04
7.4±0.07
7.4±0.06
7.4±0.06
7.4±0.06
7.3±0.05
7.3±0.05
7.3±0.05
7.3±0.04
7.4±0.05
7.3±0.03
7.4±0.06
7.4±0.06
828.8±1.7
665.2±3.4
1222.2±8.8
1220.8±8.9
8.0±0.06
12.8±0.09
10.2±0.06 10.1±0.04
665.3±3.8
841.4±5.1
10.7±0.06
12.7±0.05
16.5±0.08
11.3±0.03
9.8±0.03
16.7±0.07
16.7±0.06
16.7±0.07
16.7±0.08
16.6±0.10
16.7±0.07
16.7±0.07
16.6±0.08
16.5±0.06
16.5±0.08
16.5±0.08
16.5±0.08
[GB]
Tx
5.8±0.05
11.1±0.04
9.0±0.08
12.3±0.20
11.1±0.04 10.3±0.08
15.5±0.07
9.7±0.02
12.9±0.05
6.0±0.06
5.9±0.05
5.9±0.06
5.9±0.06
5.9±0.07
5.9±0.06
5.9±0.06
5.9±0.05
5.9±0.06
6.0±0.08
5.9±0.09
6.0±0.07
741.5±2.8
916.8±3.5
1409.2±6.3
880.2±1.7
741.1±2.6
1425.0±6.2
1427.2±4.9
1425.5±6.0
1422.4±6.4
1420.1±8.4
1423.2±5.4
1420.6±5.6
1418.4±6.6
1406.6±4.3
1407.9±6.8
1407.8±6.7
1405.8±7.1
[Mbps]
Avg Tput
Pkts [%]
Drop
Clos
[Mbps]
Avg Tput
9.9±0.07
682.4±2.5
1847.9±27.7
1843.5±20.7
1844.2±24.2
1846.0±28.1
1845.7±26.3
1839.7±26.5
1842.0±29.4
1838.4±61.9
1843.9±20.9
1820.3±23.6
1840.5±25.1
8.3±0.03
9.9±0.05
682.4±2.1
15.3±0.17 6.1±0.10 1263.1±14.0
21.9±0.30 3.0±0.04 1808.5±24.5
15.0±0.15 6.2±0.06 1237.3±12.7
8.2±0.03
22.3±0.34 2.4±0.02
22.3±0.25 2.4±0.03
22.3±0.29 2.4±0.04
22.3±0.34 2.4±0.05
22.3±0.32 2.4±0.03
22.2±0.32 2.4±0.02
22.3±0.36 2.4±0.04
22.2±0.75 2.4±0.03
22.3±0.25 2.4±0.02
22.0±0.29 2.5±0.09
22.2±0.30 2.4±0.03
22.4±0.46 2.4±0.02 1850.0±37.9
[GB]
Rx
of 9.3±0.03
16.3±0.17
22.6±0.30
16.0±0.15
9.3±0.03
22.9±0.34
22.9±0.26
22.9±0.31
22.9±0.34
22.9±0.33
22.8±0.33
22.8±0.37
22.8±0.77
22.9±0.26
22.6±0.30
22.8±0.32
22.9±0.47
[GB]
Tx
pro 8.1±0.03
15.7±0.07
15.7±0.05
15.7±0.07
15.6±0.07
15.6±0.09
15.6±0.06
15.6±0.06
15.6±0.07
15.5±0.05
15.5±0.08
15.5±0.07
15.4±0.08
Pkts [%]
Drop
Cost266
[GB]
Rx
re-
1220.8±9.2
1219.3±9.1
1220.7±9.7
1217.5±8.6
1218.6±8.7
1215.2±9.8
1209.4±7.3
1209.9±9.3
1209.9±9.9
1210.7±9.3
[Mbps]
Avg Tput
13.3±0.07 7.24±0.06 1209.09±7.5
9.1±0.02
7.3±0.04
13.4±0.10
13.4±0.10
13.4±0.10
13.4±0.10
13.4±0.11
13.4±0.09
13.4±0.10
13.4±0.11
13.3±0.08
13.3±0.10
13.3±0.11
13.3±0.10
Pkts [%]
Drop
Nobel-EU
[GB]
Rx
Table 2: Trac statistics for the considered topologies
urn al P US backbone
Tx
Jo
Journal Pre-proof
urn al P
[%]
38.1±0.9
37.7±1.2
37.2±1.3
36.9±1.1
38.6±1.2
38.5±1.2
38.2±1.1
37.6±1.0
39.3±1.1
38.6±1.1
38.3±1.0
38.2±0.9
[%]
33.0±1.0
32.6±1.2
32.2±1.2
31.9±1.0
33.5±1.0
33.4±1.1
33.1±1.1
32.5±0.9
34.2±0.9
33.5±1.2
33.2±1.1
33.1±1.0
0.4
0.9
0.85
0.8
0.5
0.6
0.7
0.4
0.5
0.6
0.7
0.4
0.5
0.6
0.7
CongTh WarnTh
Rx Gain-b
Rx Gain-a
0.2±0.4
0.3±0.4
0.4±0.5
1.0±0.3
-0.3±0.4
0.2±0.4
0.4±0.5
0.5±0.4
-0.8±0.2
-0.5±0.3
-0.2±0.4
0.1±0.4
[%]
Rx Gain-c
US backbone
22.8±2.5
23.1±2.8
22.9±2.4
23.7±2.9
22.1±2.2
22.7±2.7
22.8±2.8
23.0±2.9
21.6±2.9
22.2±3.1
22.1±2.9
22.8±2.9
[%]
Rx Gain-d
83.7±1.5
83.5±1.5
83.5±1.6
83.3±1.6
83.5±1.6
83.0±1.7
83.2±1.6
82.7±1.8
81.8±1.5
81.9±1.6
81.9±1.6
82.0±1.6
[%]
Rx Gain-a
47.5±1.0
47.3±1.0
47.3±1.1
47.1±1.1
47.3±1.2
46.9±1.1
47.0±1.1
46.6±1.2
45.9±0.9
46.0±1.1
46.0±1.1
46.1±1.1
[%]
Rx Gain-b
1.1±0.7
1.0±0.7
1.0±0.8
0.8±0.8
1.0±0.8
0.7±0.7
0.8±0.7
0.5±0.8
0.0±0.6
0.1±0.8
0.1±0.8
0.1±0.8
[%]
Rx Gain-c
Nobel-EU
32.5±1.6
32.7±1.9
32.6±1.6
32.3±1.9
32.4±1.5
32.2±1.6
92.3±0.7
92.6±0.6
92.4±0.9
91.9±0.7
91.6±1.2
92.1±0.9
91.7±0.8
91.4±0.8
89.8±0.7
90.0±1.0
90.0±0.8
89.7±1.0
[%]
Rx Gain-a
61.9±0.5
62.1±0.5
1.1±0.5
1.3±0.4
1.2±0.5
0.9±0.5
0.8±0.4
1.0±0.3
0.8±0.3
0.7±0.4
-0.2±0.3
-0.1±0.4
-0.1±0.3
-0.2±0.5
[%]
41.0±1.0
41.5±0.8
41.5±1.2
40.9±1.1
40.8±1.3
41.2±0.8
40.9±0.5
40.7±1.3
39.4±0.6
39.7±1.2
39.2±1.0
39.4±1.2
[%]
Rx Gain-d
Clos
49.4±1.4
49.0±1.1
49.1±1.3
49.2±1.6
49.2±1.5
48.7±1.3
48.9±1.6
49.4±3.4
49.0±1.2
47.4±1.8
48.8±1.2
49.8±1.4
[%]
Rx Gain-b
2.2±0.4
1.9±0.4
2.0±0.5
2.1±0.6
2.1±0.6
1.7±0.6
1.9±0.7
2.2±0.7
2.0±0.4
0.9±0.7
1.8±0.3
2.4±0.2
[%]
Rx Gain-c
of 170.8±4.4
170.2±3.4
170.3±3.8
170.5±4.4
170.5±4.0
169.6±4.2
169.9±4.7
168.9±8.2
170.2±3.3
166.6±3.4
169.7±4.0
170.9±6.0
[%]
Rx Gain-a
pro 61.9±0.7
61.6±0.6
61.3±0.9
61.7±0.6
61.4±0.5
61.1±0.6
59.8±0.4
59.9±0.7
59.9±0.7
59.7±0.7
[%]
Rx Gain-c
Cost266
Rx Gain-b
re-
32.1±1.7
32.2±1.4
31.1±1.3
31.2±1.7
31.4±1.9
31.2±2.3
[%]
Rx Gain-d
46.3±1.5
45.9±1.2
46.0±1.5
46.1±1.7
46.1±1.7
45.6±1.6
45.8±1.7
46.4±3.6
46.0±1.3
44.4±2.0
45.7±1.4
46.8±1.6
[%]
Rx Gain-d
Table 3: Trac gains of the proposed mechanism in comparison to other mechanisms (Rx Gain-a : centrally calculated Dijkstra with reactive ow installation, Rx Gain-b : ECMP, Rx Gain-c : FAMTAR, Rx Gain-d : DevoFlow)
Jo
Journal Pre-proof
Journal Pre-proof
The main purpose of introducing our mechanism relates to the need for reducing the number of ow entries in the core switches (P nodes).
We
of
can observe that a signicant reduction has been obtained due to the ow aggregation procedure based on the introduction of centrally managed MPLS label distribution performed by the SDN controller. Since all ows destined to DCNs attached to the particular PE node are represented by a single
pro
label, a large number of ingress ow entries from the edge of the network can be served by the same single label in the core.
Thus, the number of
labels utilised by a single P node depends on the number of PE switches and the number of used paths.
The number of labels is sensitive to the
statistics of ow life-times and the idle timeout value used by the garbage collector (in our simulations, the latter is set to 3 seconds). Since the network core forwards trac from all the PE switches, it is useful to compare the
re-
summarised number of ow entries in the network to the number of labels present in a single P switch. To see the impact of our mechanism, observe columns `Sum of DFT entries (PE)' and `Avg label entries (P)' in Table 4, where the dierence of at least two orders of magnitude can be noticed for all the inspected topologies.
This result is well seen in Fig. 8, where the
urn al P
time changes are shown (for US backbone topology). Despite the fact that the number of ows arriving to PE nodes (in blue) increases, the number of labels used by P nodes (in red) tends to stabilise. This observation conrms the high scalability achieved by the proposed mechanism.
Moreover, the indicator dened in Eq. (2) proves potential and considerable scalability of our mechanism. Namely, the
maxFRI
illustrates the best
achieved result for the considered network conguration and trac conditions.
In Table 4, the best achieved values of
maxFRI
for all considered
topologies are marked in bold. This parameter takes value more than 99.2% proving that our mechanism behaves steadily in various topologies. We have also analysed inuence of the ow timeout on the ow table occupancy for both the PE and P nodes for our mechanism. The results were
Jo
obtain for US backbone network only. The three values of idle timeout were simulated: 1 second, 2 seconds and 3 seconds. All the results are presented in Fig. 9. The orange line indicates the median, while the box extends from the lower to upper quartile values of the data. the box to show the range of the data.
The whiskers extend from
The marked ier points represent
Avg label
entries (P)
174.8±1.9
182.8±1.9
188.5±1.7
183.9±2.2
172.6±2.7
181.6±2.6
186.2±1.9
186.0±2.2
170.7±2.6
179.6±1.4
186.3±2.1
186.8±2.9
entries (PE)
17385.5±189.7
17610.6±118.9
17977.6±188.9
17914.8±135.0
16928.6±186.4
17288.2±234.8
17488.8±96.6
17631.7±172.5
16628.3±198.0
17047.0±199.8
17335.1±181.9
17427.3±158.5
0.4
0.9
0.85
0.8
0.5
0.6
0.7
0.4
0.5
0.6
0.7
0.4
0.5
0.6
0.7
CongTh WarnTh
Sum of DFT
465.2±10.3
473.2±16.2
458.9±10.6
459.6±13.8
473.4±12.6
474.5±15.3
474.0±14.9
470.0±8.3
481.5±13.3
484.5±10.8
486.4±13.3
489.9±11.0
entries (P)
Max label
US backbone
99.27±0.02
99.27±0.01
99.26±0.02
99.27±0.02
99.27±0.01
99.26±0.02
99.26±0.01
99.27±0.02
99.26±0.01
99.26±0.02
99.26±0.01
99.28±0.02
[%]
maxFRI
20944.1±194.5
20891.0±125.5
20912.6±186.0
20907.0±149.5
21171.1±165.4
21217.5±131.7
21141.7±213.1
21108.0±145.1
21380.3±136.4
21325.5±176.9
21329.1±207.9
21265.1±223.7
entries (PE)
Sum of DFT
132.9±1.2
131.5±1.1
130.0±1.0
129.1±1.1
121.7±0.9
120.6±0.5
118.7±1.1
119.0±1.2
120.8±0.8
119.1±1.2
119.1±1.3
118.9±0.8
entries (P)
Avg label
448.4±7.9
450.0±7.4
440.6±7.3
434.1±6.7
382.9±5.6
378.0±5.1
378.3±5.8
379.0±7.5
385.7±7.4
374.8±8.7
378.6±7.3
381.2±4.7
entries (P)
Max label
Nobel-EU
99.60±0.01
99.61±0.01
99.62±0.01
99.62±0.01
99.59±0.01
16163.9±263.0
15960.2±253.3
15952.5±265.4
15925.0±218.5
16307.7±243.0
16148.9±232.9
16200.3±245.9
16224.8±175.4
16755.9±299.7
16607.9±359.2
16482.3±326.6
16684.5±270.7
entries (PE)
Sum of DFT
346.7±7.5
342.8±5.9
348.8±5.1
354.1±4.8
349.9±9.0
344.6±8.4
99.65±0.01
99.66±0.01
99.63±0.01
99.65±0.01
99.65±0.01
99.66±0.01
[%]
maxFRI
6059.8±85.6
5990.2±99.0
5939.0±100.7
6241.3±205.1
6133.4±79.5
5958.8±23.9
entries (PE)
Sum of DFT
98.8±1.0
97.0±1.1
96.5±0.9
96.4±0.8
97.6±1.3
96.3±1.4
342.7±5.4
337.8±9.3
341.3±7.5
344.2±5.9
342.9±8.4
346.7±7.0
99.64±0.01
99.64±0.01
99.65±0.01
99.66±0.01
99.64±0.01
99.64±0.01
46.7±1.5
47.4±2.7
45.2±2.0
43.4±1.3
47.9±2.4
50.5±2.0
47.1±1.8
43.0±2.4
49.2±2.8
54.6±5.3
50.3±1.7
44.6±2.8
101.0±6.6
105.2±8.1
100.2±5.4
94.0±3.2
101.4±9.6
114.4±7.6
104.0±4.8
97.7±11.2
107.6±8.7
124.5±18.3
110.0±5.5
97.5±12.8
entries (P)
Max label
Clos Avg label entries (P)
of
5874.1±51.6
5947.3±96.7
5998.0±105.0
6071.5±52.0
5906.8±79.7
6067.1±73.9
pro
96.7±1.3
95.6±1.1
95.1±1.3
94.6±1.5
94.3±1.6
94.3±1.1
entries (P)
Max label
Cost266 Avg label entries (P)
re-
99.60±0.01
99.60±0.01
99.60±0.01
99.60±0.01
99.60±0.01
99.60±0.01
99.60±0.01
[%]
maxFRI
Table 4: Aggregation eciency for the considered topologies
urn al P
Jo 99.38±0.04
99.36±0.02
99.40±0.04
99.45±0.03
99.38±0.04
99.36±0.02
99.38±0.04
99.44±0.05
99.38±0.04
99.35±0.07
99.38±0.05
99.44±0.07
[%]
maxFRI
Journal Pre-proof
Journal Pre-proof
10000
of
Sum of DFT entries (PE node)
15000
5000
0 20
40
60
Time [s]
Avg number of label entries (P node)
100 80 60 40 20 0 0
20
40
60
100
120
80
100
120
re-
Time [s]
80
pro
0
Figure 8: Average numbers of network ows and the used labels outlier values. As we have already mentioned in Section 3.5, the number of
urn al P
ow rules in ow tables depends on trac characteristics and the value of idle timeout. We can see that it is necessary to dene dierent values of idle timeout for short ows and for long-lasting ows. The authors of [35] suggest using low values of idle timeout, even lower than 1 second. In the case of our mechanism no communication with the controller for a new ow is necessary (it installs them on its own), a low value is desirable. As one can see in Fig. 9, 1 second idle timeout reduces the DFT occupancy by 30% in comparison to the situation when 3 seconds timeout (for US backbone topology). The former case decreases the number of used labels almost twice. Now, we compare our mechanism with FAMTAR from the viewpoint of aggregation eciency. The number of FAMTAR's ow forwarding table (FFT) entries and the number of DFT entries (in our mechanism) for any considered scenario are at the same level. The mechanisms signicantly dier in the number of entries stored at the core nodes. Let us consider a network containing 100 edge nodes, all being entrances and exits of a domain (sources
Jo
and destinations of trac). Suppose that there are no congested links in the network. For the FAMTAR solution, each connection between edge nodes is marked with a dierent tag. This gives a total number of
9900 = 99×100 (99
destinations from each 100 sources) tags in the whole network. This number
14000 1 sec.
2 sec.
3 sec.
1 sec.
maxFRI [%]
400
200 1 sec.
2 sec.
3 sec.
2 sec.
3 sec.
(b) Avg number of label entries (P)
500
300
100
99.5 99.4
re-
Max label entries (P)
(a) Sum of DFT entries (PE)
150
of
16000
pro
18000
Avg label entries (P)
Sum of DFT entries (PE)
Journal Pre-proof
99.3 99.2
1 sec.
(c) Max number of label entries (P)
2 sec.
3 sec.
(d) maxF RI
urn al P
Figure 9: Comparison showing inuence of ow idle timeouts for US backbone topology represents the number of ow entries a single core node has to handle in the worst case. This core node has to process communication with all edge nodes. In our case, each connection from any edge node to a particular exit node is tagged with the same single global MPLS label.
This results in a total
number of 100 tags across the whole network. This number is also obtained in the worst case.
When multipath transmission is considered, both mechanisms recalculate routing in the network. Supposing the worst case scenario, where all the existing ows are still present on the previous paths (before the path recalculation takes place), all the old tags have to be maintained. For the new ows, new tags have to be allocated. Therefore, a single change of routing induces an increase in the number of tags used to
19880 = 2 × 9900 for FAM-
Jo
TAR. When we consider our mechanism, the core nodes have to maintain only
200 = 2 × 100
tags (labels). In this way, our mechanism is much more
scalable. Moreover, FAMTAR uses a DSCP eld of IPv4 header for aggregation. This poses a scalability issue only 256 simultaneous aggregated ows can be processed in the network, so even the topology discussed in the
Journal Pre-proof
above example cannot be served. Only very limited aggregates are available. In Fig. 10, we separately present CDFs of the ow table occupancy per
of
second for access nodes and core nodes. The analysis was done for US backbone topology. In case of our mechanism, access nodes store simultaneously less ow entries than the other compared mechanisms (Fig. 10a). This observation follows from the fact that when a network is heavily utilised a single
pro
ow is transmitted with smaller throughput values. In all the scenarios, ow inter-arrival time is the same. We use TCP ows that slow down if necessary; thus, they are present for much longer periods in a network. This means that each single node has to maintain ows for a longer time. This results in increase of the number of ow entries. In the case of our mechanism, ows can achieve higher throughputs than in other cases. Consequently, the process of transmission is nished faster and ow entries in ow tables are maintained
re-
for shorter times. We can observe in Fig. 10b that our mechanism requires up to two orders of magnitude less ow entries than the other mechanisms. For example, when one considers the average number of ow entries in core
81.0 ± 11.9 ow entries, while the single 2070.7 ± 662.1 of them. A non-negligible
nodes, our mechanism requires only path approach needs as many as
urn al P
value of the condence intervals for the single path follows from the fact that network resources are unevenly used; in other words, some core nodes are heavily utilised while others are used occasionally only. Such a situation does not appear when our mechanism is applied, and then the network nodes are used in the balanced way.
Our mechanism signicantly limits the communication between switches and an SDN controller. The controller only retrieves link statistics and updates aggregation entries (CFT in PE nodes, and label entries in P nodes). If we suppose that the controller collects link statistics and performs the reverse Dijkstra calculation every second (the worst case), the total number of exchanged messages between the controller and the nodes per second is equal to
2N + N
(the request and response for statistics from
N
nodes plus
N
mes-
sages distributing new labels). For example, for the US backbone network it equals
117 = 2 × 39 + 39.
DevoFlow has to collect statistics in the same
manner (in case of the threshold-based elephant ow detection). Moreover,
Jo
each elephant ow has to be served by the controller; therefore, it gener-
Packet_IN and ow installation messages (Packet_OUT together with Flow_MOD messages). In Fig. 11 we present CDFs of a number of OpenFlow ates
signalling messages per second, used by the single path case, DevoFlow and Expedited Forwarding for US backbone network. Other considered mecha-
Journal Pre-proof
&