Simulation Modelling Practice and Theory 31 (2013) 169–185
Contents lists available at SciVerse ScienceDirect
Simulation Modelling Practice and Theory journal homepage: www.elsevier.com/locate/simpat
On the effectiveness of Linux containers for network virtualization G. Calarco ⇑, M. Casoni Department of Engineering ‘‘Enzo Ferrari’’, University of Modena and Reggio Emilia, Via Vignolese 905, 41125 Modena, Italy
a r t i c l e
i n f o
Article history: Received 25 July 2012 Received in revised form 21 November 2012 Accepted 22 November 2012 Available online 23 December 2012 Keywords: Network emulation Virtual network Software router Linux container
a b s t r a c t This paper presents a novel approach to the study of multi-technological networks based on Linux containers and software emulators. We illustrate the architecture and implementation issues of a modular and flexible testbed (NetBoxIT) that supports the virtualization and the concurrent, real-time execution of several independent emulators on a single, multi-core hardware platform. Distinct virtual networks can be instantiated, and connected to synthesize heterogeneous networks configurations. NetBoxIT is also an open platform, which can be interfaced with external networks and nodes, enabling the evaluation of true users’ applications and protocols. We examine its performance under different viewpoints (scalability, computational load, timing overheads, and realism) and we show how the proposed testbed architecture leads to a general-purpose, reliable, and economical tool for assessing multipart networks with respect to real-world applications. Moreover, we discuss which are the current and future technologies that can be introduced to reduce the testbed timing overheads and to further improve performance. Ó 2012 Elsevier B.V. All rights reserved.
1. Introduction Heterogeneous networks design is often a challenging task. In particular, when different technologies are required to inter-operate, it is hard to predict if a planned infrastructure can accomplish the expected operational performance. Things can become even trickier when the designer aims at foreseeing how the compound system could possibly be tuned to meet the peculiar requirements of a specific user application (e.g., a certain quality of service constraint in multimedia communications). In this sense, the a priori availability of a smart design tool would offer a strategic advantage, as it could help to converge toward an optimal network architecture from the very beginning. In particular, it would be attractive to have an assessment tool that enables the rapid comparison of different and alternative architectures. Such a tool should offer at least some fundamental characteristics: realism and repeatability, for a consistent comparison of distinct design choices; flexibility and scalability, to manage more and more intricate scenarios; and, possibly, be low-cost. In this paper, we illustrate the software architecture and the implementation issues of a modular and scalable software testbed (NetBoxIT) which can be fruitfully utilized to simulate complex networks, possibly organized with different topologies and technologies. NetBoxIT is conceived to support the creation, and interconnection of several, coexisting virtual networks (‘‘netboxes’’) on a single, multi-core hardware platform. In summary, using container-based virtualization techniques and a network simulation tool, a number of distinct netboxes (i.e., using different configurations each) can be run concurrently, to mimic the separate portions of a heterogeneous network. By means of internal bridges or external true-world equipment, these virtual networks can be interconnected to assemble the model of the overall network infrastructure under investigation. Fig. 1 depicts the conceptual scheme of a netbox. ⇑ Corresponding author. Tel.: +39 3385657879. E-mail address:
[email protected] (G. Calarco). 1569-190X/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.simpat.2012.11.007
170
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
Fig. 1. The abstract scheme of a netbox: a whole network is emulated within an insulation container and interfaced to external entities or other modules by physical or software interfaces.
The proposed testbed is designed with a novel architecture and it aims at offering a number of advantages. First, it is modular and extensible, to let the designer arrange complex network scenarios by the combination of simple building blocks. Virtualization helps creating a first level of insulation among netboxes: the overall computing resources (i.e. mainly, the group of available CPU-cores and the memory address space) can be split and decoupled among them, so that their processing co-interference is limited. To reinforce encapsulation further, the network emulation logic (i.e., the operational parameters of any simulated network) are entirely enclosed inside each corresponding netbox (i.e., no network emulation processing is executed out of their boundaries). In this manner, netboxes are both computationally and logically self-contained, and can be instantiated and employed as if they were hardware emulation devices. Moreover, thanks to the chosen emulator, NetBoxIT can offer a fairly realistic modeling of several network standards, and supports real-time data handling, with negligible timing overheads against the represented true network (i.e., data traverse the virtual networks with the same latencies that would occur in reality). Also, NetBoxIT aims at being an open platform, which can be transparently interfaced with external equipment and nodes, to further increase the realism of simulations and verify networks behavior against true-world applications. Finally, it is low-cost, being based on PC-class hardware and open-source, ‘‘off-the-shelf’’ software only. The paper is organized as follows: in Section 2, we give an overview of the pursued design guidelines in comparison with existing tools used in networks design nowadays. Section 3 briefly sketches the existing virtualization technologies and how these can be employed to assemble virtual networks. In Section 4 we bring to light the hardware and software NetBoxIT components, the motivations that led to their choice and the key issues we have taken into account during their deployment. In Section 5 we focus on experimental trials, with the aim at proving that the testbed is feasible for heterogeneous networks emulation. Finally, we draw some conclusions and report our current and future work.
2. Networks design: methodologies and trends A wide variety of techniques can be employed for networks design, ranging from pure mathematical models to real-world testbeds. Mathematical queuing modeling is the most abstract methodology, but is usually proficient only for qualitative evaluations and can lead to unmanageable levels of complexity when applied to specific situations (e.g., wireless mesh networks). Physical testbeds, that use true devices, are unquestionably realistic, but frequently impracticable for multi-technological networks study, in particular for wide-ranging topologies, due to the time and costs for their implementation. Hardware emulators are based on special-purpose equipment: they are fairly reliable to reproduce the behavior of a certain network link and can be cascaded to reproduce a certain heterogeneous network. These commercial devices, however, are typically close to modifications by the researcher and often expensive. Simulation tools are indeed a more popular choice: they are adaptable, quick and cheap to deploy, and repeatable, as well as they frequently offer reliable and faithful networks modeling. Nevertheless, the simulation of complex networks usually requires a vast amount of computing resources; moreover, pure simulations cannot be validated against true-world applications, due to their lack of interoperability with existing applications and systems. For these reasons, in the recent past software network emulation has gained a lot of interest in the networking research community. The key idea behind it is to follow a hybrid approach, where a software simulator (executed using general-purpose hardware) can be mixed (using true network interfaces) with real network components and applications, with the aim of increasing the fidelity of results and to perform the validation of a planned infrastructure against real traffic. Similarly to pure simulators, software emulators are fast to configure and low-cost, and their realism depends on the network modeling accuracy. However, a fundamental distinction is that an emulator cannot run in a virtual simulated time, to coexist with real network entities. Therefore, it should have enough computing resources to respect real-time constraints when processing incoming data: if true applications exchange traffic through an emulated topology, it is necessary to guarantee that the emulator does not introduce any extrinsic delay (or, at least, that it is limited and measurable). There are therefore some critical aspects, mainly related to the choice of the underlying hardware and software platform.
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
171
Ideally, it would be amazing to assemble an evaluation tool that captures the best of all the described techniques: the flexibility and efficiency of software simulators; the modularity and scalability of hardware emulators; and, the realism of physical testbeds. Moreover, it should be configurable and manageable through open programming interfaces. Finally, it should be technologically close to the final system, so that little investment should be done for a working prototype. We have tried to conceive NetBoxIT as a sweet spot among all these targets. Therefore, we aimed at creating a framework that is: A flexible, general-purpose tool, where heterogeneous multi-technological networks can be studied. Modular: the ‘‘netbox’’ in Fig. 1 should deploy a whole virtual network environment (a ‘‘network in a box’’, as in hardware network emulators) and it should be completely insulated, with private computing resources, so that several netboxes can run concurrently and autonomously. With a high-quality level of realism, validated against real-world systems, with accurate modeling of the PHY/MAC layers. Real-time, with no introduction of temporal biases, and avoiding the usage of ‘‘time warping’’ techniques [1], that can scale up the emulated networks dimension limitlessly, but are unsuited to interface the testbed with external true-world networks and applications. Based on cost-effective, all-in-one, commodity, and general-purpose hardware, with no dedicated hardware, FPGA/DSP devices, or large-scale distributed platforms. Open, possibly interfaced with external devices, networks, nodes and applications by standard physical interfaces. Based on ‘‘off-the-shelf’’ and reusable software products only (and avoiding complex software customizations). Efficient, using the smallest amount of computing resources, to improve scalability. To practice network emulation, several tools have been developed in the past. A very common approach (e.g. Dummynet [2], NetPath [3]) is to exploit ‘‘virtual links’’ for the modeling of a network and as a means of interconnection among physical or virtual nodes. This approach has been pursued both in a single-machine, as well as in distributed testbeds. However, special-purpose software entities (usually, shapers, customized bridges or tunnels) must be specifically designed to reproduce the characteristics of a certain kind of channel and to model the (real) network behavior by some properties variation (propagation delay, bandwidth, packet latency, packet loss probability, bit error rate, etc.). For instance, ModelNet [4] is a largescale network emulator based on a modified version of Dummynet, and probably the earliest attempt to model entire networks on a single host. A set of nodes running user applications are configured to exchange information through a set of core nodes that are employed to adapt the traffic to the bandwidth, latency, and loss profile of a target network topology. Network emulation is synthesized within the core nodes by the means of the ipfw IP firewall and the ModelNet kernel module: the first intercepts and selects the incoming packets, whilst the second injects the datagrams into a set of ‘‘pipes’’ that represent the emulated topology (somehow, this approach can be seen as a field of application of the mathematical queuing theory). The ModelNet scheduler runs at the kernel’s priority level and uses a 10 KHz clock tick (i.e., with 100 ls granularity) to wake up and move packets from pipe to pipe or to the final destination. As observed in [2], this kind of approach usually lacks for a precise representation of the MAC layer nuances, like framing (e.g., preambles and check-sums insertion), channel access scheduling or frames retransmissions. A possible strategy to overcome this limitation can rely on the approximation of the MAC layer by employing a probabilistic model (which however depends on the type of MAC protocol) to adjust the frames delaying. Netkit [5] is a lightweight, flexible emulator based on User-Mode Linux [6] and provides a set of tools for the setup of complex network scenarios. Virtual nodes are created using a modified Linux kernel (that runs as a userspace process) and are networked by means of virtual hubs (called ‘‘uml_switch’’), which simulate the behavior of a network switch or hub. Furthermore, virtual nodes are able to exchange traffic with external networks, since they can be equipped with TAP devices and employ the real network cards. However, the uml_switch is represented by a user-space daemon process (which receives packets from the virtual nodes via UNIX domain sockets), a choice that does not seem the most suitable for real-time network emulation. As a consequence, the most natural context of application for Netkit appears didactics.
3. Virtualization techniques: opportunities, issues and limitations Hardware virtualization is basically related to the abstraction of functionalities from physical components. The value of virtualization has come to light on modern, multi-core, commodity hardware, where the continuous increase of computational power allows multiple virtual guest operating systems (Virtual Machines or GuestOSs) to share the physical machine resources. In a nonvirtualized system a single operating system multiplexes all hardware resources among applications; instead, a virtualized system employs a dedicated software layer (the Virtual Machine Monitor or VMM, frequently called the hypervisor) to schedule and multiplex the low-level shared resources among high-level applications and operating systems. In brief, virtualization reproduces in software an entire hardware platform, giving the illusion of a real machine to all the software entities executed above it. From the opposite perspective, we can say that virtual machines do not access the physical resources directly, but only through the VMM (even if some distinctions are possible). Typically, virtualization offers several benefits, such as increased security, centralized management, GuestOSs live backup and migration, low maintenance costs, energy savings, hardware utilization efficiency, just to name a few. However, these benefits do not come for free: the performance of a virtual guest can be affected by the interference with other virtual guests, and VMM overheads can influence CPU, memory, hard disk, and networking usage. Mainly, two similar approaches have been followed in the past:
172
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
the VMM can run on bare hardware (native VMM); or on top of an underlying operating system (hosted VMM). Native and hosted VMMs can be used to offer the so-called ‘‘full virtualization’’, when a guest OS runs unaltered, and hardware is a completely virtualized resource. In this scenario, the VMM has the task to emulate any underlying physical device. This approach is not the most efficient since the presence of the VMM decreases the performance of the guest system. To overcome this problem, the GuestOS can be modified to exploit special drivers which provide direct access to hardware (with no VMM intervention): this technique is called para-virtualization and has the advantage that the virtualization overhead becomes much lower. Examples of full virtualization are represented by VirtualBox, VmWare, KVM, whilst XEN is a well-know model of para-virtualization. A more recent technique, called OS-level or container-based virtualization, employs the guests running in a container of the operating system. This virtualization technique was originally developed as an enhancement of the UNIX ‘‘chroot’’ mechanism, but it offers much more isolation and full resource control and management, both for a standalone application, and for a whole GuestOS system (see Fig. 2). In practice, the kernel of the hosting operating system is modified to allow for multiple isolated user-space instances (instead of just one); resource management features are also available to limit the impact of one container’s activities on the other ones and to split the available resources among them. This form of virtualization usually brings little overhead, since programs within containers use the host system API interface directly and do not need to rely on hardware emulation or VMMs. OpenVZ, Linux-VServer, Virtuozzo, FreeBSD Jails and LXC [7] are the most popular container-based tools. In recent years, researchers have started to employ software emulation and virtualization techniques jointly for virtual networks deployment. A well-established practice is the use of a number of Virtual Machines (VMs) within a single or a small set of physical nodes, and the insertion of tailored virtual links (e.g., shapers) to mimic a network among them. A number of experimental testbeds have been proposed starting from this scheme, typically employing full- or para-virtualized virtual machines. However, the scalability of this approach is seriously limited by the computing resources consumed by the execution of an entire OS. In some cases, to overcome this problem, time-warping techniques have been introduced within the virtualization platform [1]: however, this choice makes impossible the direct interfacing of the virtual machines with trueworld external entities. More recently, container-based virtualization has started to encounter some interest since, differently from traditional virtualization, its lightweight execution environment can be used to host even a single program alone, and does not necessarily require the installation of a whole operating system. From a technological point of view, the Trellis testbed [8] is somewhat similar to ours, since containers are employed to host a set of distinct virtual nodes. However, there are some clear distinctions: first, containers are used to host virtual routers only, and not an entire emulated network, as we do; moreover, the authors connect these nodes using programmable virtual links (EGRE tunnels) with plain traffic modulation capabilities (in essence, shaping), which are implemented within the kernel space of the host OS. Another testbed architecture comparable to ours is [9] (an implementation guide is available on the NS-3 project Wiki). This is a widely adopted scheme (see Fig. 3), where containers are used to host a number of user applications, and a network emulator is employed to interconnect them. However, it should be noticed again that the (single) emulator is not virtualized, but executed within the raw Linux context. Besides, the authors exploit Tap interfaces to connect the VMs to the network emulator. There are some further examples of container-based testbeds. Mininet [10] is a system for the rapid prototyping of large virtual networks on a single laptop: the virtual nodes are represented by different namespaces, whilst the network links are based on Virtual Ethernet peers [11]. To apply bandwidth limitations and QoS policies to a link, the in-kernel Linux Traffic Control can be employed; nevertheless, only wired links can be emulated within this testbed. CORE (Common Open Research Emulator, [12]) is a multi-platform, user-friendly framework that is able to use Linux or FreeBSD container-based virtualization to build both wired and wireless virtual networks. True applications can be run unmodified, even connected to real networks and systems to extend the testbed with physical external equipment. CORE strictly deals with the emulation of the network (and higher) layers, and uses a simplified simulation engine for the MAC and physical layers. Virtual networks are created using the Linux bridging tools, which allow the modulation of bandwidth, propagation delay, and packet errors probability, as well as on/off connectivity can be introduced to reproduce the behavior of wireless networks. To overcome this limitation, CORE can be combined with other (underlying) emulation tools, for instance EMANE [13] and NS-3 [14], following the same architectural scheme drawn in Fig. 3. Other similar integrations of containers and NS-3 are presented in [15,16], where LXC nodes communicate through a NS-3 emulated network.
Fig. 2. Containers vs. traditional VMs architectures.
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
173
Fig. 3. Virtual networks built with a conventional employment of Linux containers.
In brief, NetBoxIT proposes two architectural novelties. First, we reverse the previous paradigm and we use containers to virtualize the network emulators, and run several of them concurrently (not just a single one) in the user land context. This scheme brings to a modular framework, more suited to modern multi-core PCs, where the shared CPU resources are split to get assigned to each emulator container (i.e., virtual network). Second, we exclusively employ the plain Ethernet protocol for the interconnections among (or with) virtual networks. This choice is not only far more interoperable, but also helps enhancing the testbed realism: first, NetBoxIT is ready to be joined with a broad variety of real equipment (routers, switches, wireless portals, etc.); moreover, it makes easy to attach true-world applications running on external PCs to the emulation platform (besides, this also helps to preserve the limited testbed processing power for the network emulations tasks only). 4. Testbed hardware and software architecture A flexible emulation tool can be beneficial to assist the designer’s early choices, even more if it is open and can be interfaced with true-world equipment, so that the realism of evaluations can be improved and real-world applications can be tested. In view of that, our plan was to create a framework that is based on general-purpose hardware, and open-source, ‘‘off-the-shelf’’ software products only; modular, where autonomous emulation blocks (the ‘‘netboxes’’) can be used similarly to hardware emulators and are provided with private computing resources, so that several netboxes can run in parallel within a single PC-based platform; realistic, with a high-quality level of trustworthiness, validated against real-world systems (for instance, by exploiting existing network simulators that are already demonstrated to meet this target); real-time, that is, supporting real-time emulation; efficient, i.e. consuming the smallest amount of computing resources, to improve the testbed scalability; and, open and interoperable, in the sense that we want to support data exchange with external devices, networks, and applications by standard physical interfaces. Moreover, to preserve modularity (and improve reusability), we avoid the practice of using programmable virtual links among virtual machines to introduce delays, packet losses, etc. to model the (real) network behavior. In fact, we mean to emulate the characteristics of each network exclusively inside each corresponding netbox and to use plain and standard connections among them. In the following, we illustrate the hardware and software components that have been selected to accomplish the design targets just considered. 4.1. The hardware platform NetBoxIT is supported by a server-class PC, based on a Dual Xeon 2.4 GHz E5530 chipset, with a total amount of eight physical CPU-cores (the two physical CPUs with four cores each). This emulation server comprises four single-port Gigabit Ethernet cards, all plugged onto the PCI-Express I/O bus. Besides, NetBoxIT is connected (by wired Ethernet links) to an external RFC1812 software router (based on a Dual-core E2200 chipset), which can be used to dispatch the traffic among the different netboxes. A Linux 2.6 kernel is installed on both these two PCs. Finally, two ancillary desktop PCs are employed to host true-world user applications or synthetic traffic generators to produce the traffic flows traversing the virtual networks. 4.2. Netboxes internals Fig. 4 summarizes a netbox software architecture and its I/O endpoints. A netbox is a programmable module, designed to act like a hardware emulator. Its nucleus is a simulation engine able to perform as an emulator, i.e. where some nodes of the simulated network are equipped with Ethernet-like interfaces. These are bound to the virtual interfaces (veth0, veth1) of the VM wherein the emulator is insulated. Hence, the assembly of an emulator inside a VM implements the abstract scheme
174
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
Fig. 4. The architectural scheme of a netbox: a whole network is emulated and encapsulated within an insulating virtual machine. A netbox can be interfaced to external entities by physical cards or directly to other netboxes by Ethernet software bridges.
previously depicted in Fig. 1. The netbox interfaces can be either connected to the physical Ethernet cards (to exchange frames with external entities) or linked to other netboxes using plain software bridges (instantiated within the raw Linux). By the cascade of many netboxes (and merely changing the internal emulator configuration), one can synthesize the different sections of a heterogeneous network. 4.3. Virtual machines and LXC containers Traditional full- or para-virtualization tools are useful to run a number of operating systems (GuestOSs) concurrently. In our case, one could eventually employ a GuestOS to assemble a netbox and run the emulation environment. This approach seems however not so efficient, since the virtualization of a whole operating system for running just a single emulator would waste a lot of computing resources. This assumption has been preliminarily confirmed by our past investigations [17], where we compared different virtualization techniques (VMware, VirtualBox, and LXC Linux Containers) to run a simple UDP traffic generator. We measured that LXC achieves the best loss-free throughput and, more important, we observed that containers introduced an unperceivable computing overhead, since performance did not differ from the execution of the generator onto the raw Linux OS. For this reason, LXC containers have been chosen as the key virtualization technology for our platform. A step forward will then be to examine if containers are also suitable to properly insulate network emulators, which are more demanding applications, since we point towards a scalable testbed where an increasing number of concurrent netboxes can be run with no mutual computing interference. Trials were conducted using LXC rel. 0.7.4. 4.4. The NS-3 simulator Several simulators exist for networks design (NS-2, NS-3, OMNeT++, QualNet, etc.), but just a few are able to perform emulation tasks (i.e., exchange true packets with a true network). Among these, NS-3 [14] appears the most appropriate, since it offers some very appealing features. First, several, fairly realistic network models are available, in particular the Wi-Fi and WiMAX models offer a high level of realism, validated against true-world systems, with the simulated PHY/ MAC layers that resemble the behavior of real equipment very strictly. Further, it natively supports different emulation interfaces, Tap and EmuNet devices. The latter is suitable for our purposes, since it allows a netbox to exploit the testbed Ethernet physical interfaces. Moreover, NS-3 supports real-time simulations, which can be run with a restricted amount of temporal skew, so that ingress data are treated by the emulator with the same timing they would have when traversing a true network. Of course, the choice of NS-3 is not compulsory: other equivalent emulators could be hosted in future, as well; however, NS-3 is open, free, and seems nowadays the most efficient [18]. Experiments have been carried on by using NS-3.9. 4.5. The Click Modular Router Extensions of commercial routers are usually not possible and an open approach seems much more attractive, in particular at the prototyping stage. As an example, the network designer could attain many benefits if it was possible to employ an open router and optimize its behavior with respect to a particular pattern of traffic (e.g., multimedia communications). The Click Modular Router [19] permits the design of high-speed, PC-based routers, in which customized services can be arranged
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
175
using the available libraries or creating new ones. For instance, Click has already been employed to investigate QoS-aware dynamic bandwidth management and admission control schemes in the past [20]. In summary, we tried to meet our testbed design targets by a proper selection of general-purpose hardware and opensource, off-the-shelf software components. The hardware platform is a server-class PC, based on multi-core CPUs (so that many execution contexts can be run disjointedly), and a broadband PCI-Express internal bus (where I/O transfers overhead is negligible). LXC containers allow splitting and encapsulating the available CPU-cores and NICs among a number of distinct emulation tasks in a modular fashion. NS-3 is fairly realistic, since many of its simulation models are already validated against true networks: nevertheless, we aim at verifying if it still offers truthful outcomes when employed as a virtualized emulator. NS-3 also claims to support real-time packet processing, so that data can be handled with the same timings they would encounter in reality. Moreover, both LXC and the NS-3 simulator seem very efficient: the former introduces a negligible virtualization overhead [17], while the latter offers high simulation performance [18]. Finally, NetBoxIT aims at being highly interoperable, since it exploits Ethernet as a means to support information exchange among netboxes and external entities: this is particular helpful to interconnect the testbed with real-world devices and networks, to increase the realism of investigations and to allow the evaluation of true applications.
5. Virtual networks experiments and NetBoxIT performance 5.1. A case study: heterogeneous wireless emergency networks Fig. 5 sketches a portion of a network infrastructure that we are investigating within the EU 7th FP Large Scale Integrated Project ‘‘A holistic approach towards the development of the first responder of the future’’. Nowadays, civil protection agencies exhibit a rising interest for exchanging enhanced information (audio, video, maps, etc.), compared to the existing, lowrate Personal Mobile Radio services (e.g. TETRA or P-25). In [21] we illustrated how this need could be satisfied using broadband, multipart, and infrastructure-less wireless networks, where different technologies are used at the Incident, Jurisdictional and Extended Area Networks (IAN, JAN, EAN) to support communications among first responders (FRs), mobile Emergency Operation Centers (MEOCs) and the main Emergency Operation Center (EOC). On the operation field, an IEEE 802.11 network is employed among first responders and their commander, and an IEEE 802.16 connection supports the long-range link between the commander (and his/her team) and the closest MEOC. A geostationary satellite offers the coverage among the MEOCs and the remote EOC. Further, we envision the satellite system to support the DVB-RCS Next-Generation services [22], so that a direct MEOC-SAT-EOC (or even a MEOC-SAT-MEOC) path can be created, with no involvement of ground station hubs, bringing to a significant reduction of propagation delays. Finally, an IP router is located by any of the MEOCs, to forward information among the different networks. We adopted the project heterogeneous network as a significant example for NetBoxIT.
Fig. 5. The Emergency Network under investigation, based on Wi-Fi, WiMAX and satellite networks (dashed-red, blue and dashed-purple lines, respectively). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
176
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
5.2. Experimental results in wireless networks emulation At the best of our knowledge, container-based virtualization has never been employed to assemble virtual networks before and our testbed architecture appears entirely novel: therefore, it seems important to clarify here that our very first target is to examine the testbed fidelity in networks assessments. Some preliminary experiments have been reported in [23], mainly related to the emulation of a Wi-Fi point-to-point link, but now we want to refine our evaluations and carry out a cycle of plain trials, based on very simple heterogeneous scenarios that can be even understood and corroborated by intuition, before moving on towards more refined evaluations. Hence, we focus here on some testbed qualities that seem important to confirm our architectural choices: scalability, computational load, realism, and timing overheads. Moreover, we experimented some available techniques (such as, direct LXC-to-NIC interconnections and Virtual Ethernet peers) to introduce some further performance improvements. 5.2.1. Scalability Scalability is the testbed ability to sustain, within the limits of the existing hardware resources, an increasing number of concurrent netboxes. However, these should not produce or experience any mutual interference, so that each can perform its tasks unperturbed. With this aim in mind, we performed a sequence of trials where a pool of netboxes is employed to encapsulate an NS-3 UDP generator each. These produce packets at the maximum achievable rate, so that their pre-assigned CPU resources are fully consumed (in particular, a single CPU-core is reserved for each container). Fig. 6 reports the aggregated data rate, measured at the output NICs of the testbed, when the number of running generators is increased and the available CPUs are progressively exhausted. It is straightforward to observe that, independently from the packets size, the global throughput grows up quite linearly. This reveals a minimal co-interference among netboxes and seems to confirm that NetBoxIT is fairly scalable in using the available resources, even in different conditions. 5.2.2. Computational load Computational load is the amount of processing power required by a netbox to run an emulation task: this quantity must be monitored, since if a netbox gets overloaded, it slows down, producing misleading results. As NS-3 is an event-driven simulator, it consumes the available resources accordingly to the occurrence of ongoing events. Accordingly, a possible manner to evaluate a netbox CPU load is by injecting a packet stream at an increasing rate, so that its virtual CPU is progressively pushed toward saturation. To obtain a preliminary estimate, we employed two distinct netboxes (with a single CPU-core assigned), that emulate a Wi-Fi and a WiMAX link respectively. A 64-byte UDP flow is injected in each, and the packet rate is gradually raised up until their wireless capacity is entirely saturated. In these conditions, the consumed CPU kept below 55% (Wi-Fi task), and 65% (WiMAX task), showing that both remain well far from overloading, and that there is even room available for more demanding simulations. More interesting, the CPU consumption grows up quite linearly, and a 10 increase of the packet rate produces just a 5 increase of the CPU utilization. Albeit, if a single CPU-core became inadequate, because of severely demanding tasks, one can even consider splitting a too complex simulation among several, simpler netboxes; as well as it is also possible to allot several cores to a certain netbox (NS-3 supports multi-threaded simulations under many circumstances). Finally, exploiting Ethernet as a means of communication, it is easy to extend the emulation in more than a single PC and move toward a distributed testbed. 5.2.3. Realism As regards the testbed realism, a first round assessment is carried on, to evaluate if, despite being virtualized, emulators can provide throughput estimates that are consistent with the corresponding real networks. Three distinct netboxes are
Fig. 6. The aggregated throughput obtained by the contemporaneous execution of an increasing number of netboxes, each running ad UDP generator at the Maximum Loss-Free packet rate.
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
177
instantiated, deploying a Wi-Fi, a WiMAX and a satellite point-to-point link, respectively. A single CBR UDP flow (using a 500-byte payload) is injected in each virtual network separately and the rate is progressively raised up until the radio link is wholly saturated. The measured values of the maximum IP throughput are reported in Table 1, together with the operational parameters chosen for the three emulators. The obtained results are in agreement with experimental trials reported in literature under similar conditions [24–27], confirming that the network modeling supplied by the NS-3 simulator is fairly realistic and, more important, that the obtained performance is not influenced by the LXC virtualization even when the ingress packet rate reaches its maximum. Despite existing validations, the low WiMAX throughput (less than 2 Mb/s) could seem surprising anyway. However, the careful reading of the IEEE 802.16 standard (TDD profile) helped clarifying that the obtained value is truthful. In fact, the 18 Mb/s theoretical capacity represents the gross value at the physical layer, but several diminishing factors concur to achieve a much lower IP goodput in practice. For instance, not all the OFDM subcarriers are devoted to carry information, as well as, at the distance of 2.5 km, a robust modulation scheme is required (QPSK 34 instead of the most efficient 64QAM), lowering the spectral efficiency. Moreover, guard and trailing symbols inside the physical frames, the FEC code, the MAC layer overhead and the scheduling and the control information continuously exchanged between the Subscriber Station (SS) and the Base Station (BS) are other important factors that decrease the IP goodput. Then it is important to establish if NS-3 offers realistic outcomes as regards the latency even when it is employed to emulate the whole heterogeneous network, i.e., in our particular setting, when a chain of virtualized, real-time network emulators is created. As a reference scenario, we suppose to have a first responder, located near the disaster site, and with the need to communicate with the remote EOC personnel. Consequently, a voice communication is opened, and a VoIP stream crosses the Wi-Fi network towards the team commander, the WiMAX link between the commander and the closer MEOC, the MEOC router and finally the satellite system. In Fig. 7 we sketch the corresponding testbed configuration which has been employed: three distinct netboxes (using just a single CPU-core each) are instantiated and interconnected (using Linux bridges or the external Click Modular Router) to translate our reference scenario in terms of virtual networks. The three wireless links are configured by using the same operational parameters previously reported in Table 1. As we stated above, one of the key advantages of NetBoxIT is represented by the possible connection with real network nodes. Using the available Ethernet interfaces, a couple of supplementary PCs are attached to the testbed to introduce a pair of true-world LinPhone [28] VoIP agents as the source and sink of traffic (Table 2 resumes their configuration parameters). Moreover, it should be mentioned that, similarly to what could be done in a real deployment, the WiMAX simulator is configured to schedule the uplink (i.e., from the FR to the EOC) VoIP traffic using the rtPS (real-time Polling Service) QoS policy, which is the best suited to support voice communications in terms of efficiency and timing constraints. A SIP-based, bidirectional voice phone call can now be opened between the two end-points. The well-known Wireshark sniffer [29] is run within the raw Linux and utilized to track the frames timing. In particular, it is able to record all the frames at any of the testbed interfaces: not only the physical NICs can be kept monitored, but also any virtual interface that acts as an Ethernet device (i.e., bridges and netboxes virtual ports). This is an important advantage: in one shot, one can obtain an overall outlook of packets timing during their route through the testbed. Moreover, this can be done with an absolute timing reference and no synchronization mechanisms, since both Wireshark and the netboxes are all hosted within the same platform, and their clock is shared. A sample of 40,000 frames (about 13 min of true conversation) was extracted for further offline analysis. Table 3, Figs. 8, and 9 illustrate the statistics related to the end-to-end latency and jitter that affect the VoIP communication along the Emergency Network return (FR-to-EOC) and forward (EOC-to-FR) channels. In particular, these values are measured at the physical NICs of the testbed where the two VoIP agents are connected (thus, we are not measuring the audio encoding/decoding processing delays that occur inside the two endpoint PCs, since these are not strictly related to the network itself but dependent on the particular audio codec adopted). It is worth noting that the delay and jitter distributions for the return channel are considerably narrower than the ones for the forward channel, exposing the benefit of the rtPS WiMAX scheduler (and the realistic functioning of the netbox as well). The uplink mean latency results (269 ms) are in good agreement with our theoretical estimate. In fact, in WiMAX the SS-BS bandwidth request–grant procedure requires a minimum interval equal to the frame duration (10 ms); since the audio source and the WiMAX TDMA scheduler are (unavoidably) desynchronized (and there is a continuous time shift between the two), the delay experienced by packets offered to the WiMAX network should amount to 15 ms in the average (which is to be added to the 254 ms delay introduced by the satellite link). These plain considerations seem to further confirm the realism of the testbed.
Table 1 Networks operational parameters and measured IP throughput. Technology
Distance
TX Power (dBm)
Channel propagation model
802.11a 802.16TDD 10 MHz@5 GHz DVB-RCS NG
80 m 2.5 km 76,000 km
16 30 –
FRIIS, LOS FRIIS, LOS Delay = 254 ms, PER = 10
6
Theoretical gross capacity
Maximum IP throughput
54 Mb/s Return/forward channel: 18/18 Mb/s Return/forward channel: 2/10 Mb/s
7 Mb/s 1.9/1.9 Mb/s 2/10 Mb/s
178
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
Fig. 7. The trials configuration, exposing the NS-3 emulators (the blue clouds) used to emulate the Wi-Fi, WiMAX and satellite networks, the Click Router and the traffic source and sink. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Table 2 VoIP agents configuration. Agent
Audio codec
Sampling rate (kHz)
Frame (ms)
Codec delay (ms)
Coding rate (Kb/s)
Ethernet throughput (Kb/s)
VAD
LinPhone rel. 3.4
Speex WB
16
20
50
28 (CBR)
49.6
NO
Table 3 End-to-end latency and jitter experienced by a VoIP flow traversing the Emergency Network through the Return and forward channels. Latency (ms)
Return channel Forward channel
Jitter (ms)
Mean
Max
Min
Mean
Max
Min
269.2 303.0
279.5 350.0
255.8 256.4
7.9 42.9
22.1 55.6
1.3 25.5
Fig. 8. End-to-end latency and jitter (probability frequencies and cumulative function plots) experienced by a VoIP flow traversing the Emergency Network return channel.
5.2.4. Timing overheads Timing overheads are the temporal extra-biases introduced by the software elements instantiated inside the testbed platform and required for its functioning. In theory, we wish that information could traverse the virtual networks exactly with the same latencies that would occur in reality. But, in a software-based platform, this target is clearly unachievable in absolute, and some computing latencies are unavoidable. This is of course just a first order approximation, since the hardware architecture can also be a source of extra skews (for instance, the I/O internal bus cannot enforce an absolute timing guarantee of data transfers from/to memory to/from NICs; or, the NICs can introduce extra-delays due to frames queuing inside
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
179
Fig. 9. End-to-end latency and jitter (probability frequencies and cumulative function plots) experienced by a VoIP flow traversing the Emergency Network forward channel.
their internal FIFO buffers). However, the PCs internal data transfers are nowadays so fast that the introduced lags can be practically neglected in most of the cases. For our purposes, it is first and foremost important to obtain a fair real-time data handling at least inside the netboxes (i.e., within the emulators). Happily, this is easy to obtain, since NS-3 allows to enforce a hard limit for the maximum computing jitter (i.e., the skew between the simulation clock and the hardware clock controlled by the NS-3 real-time scheduler) of a simulation: if this happens not to be respected, the NS-3 engine simply aborts and reports the trouble. Therefore, we focused on quantifying the skews occurring outside the netboxes, and analyzed if they can have some practical impact on measurements, or not. More in detail, the containers’ virtual NICs need some time to receive or transmit frames: the interconnecting bridges delay the frames during their delivery, since expensive lookup and copying operations must be performed. For the evaluation of these timing overheads, we performed a set of measurements where a unidirectional UDP flow traverses the same testbed configuration sketched in Fig. 7. Wireshark is again utilized to record the frames timings. The bar graph on the left of Fig. 10 reports the delays accumulated by frames as they traverse each of the interconnecting entities (i.e., where no emulation processing occurs). For instance, from the time when Wireshark records a frame at the first Ethernet NIC (i.e., the frame has been entirely reassembled by the transceiver and transferred to receiving ring buffer in kernel space memory), about 17 ls are required before it is recorded again at the netbox virtual NIC; the same interval is necessary to move a frame from a netbox toward the transmission buffer of a physical NIC; instead, crossing a Linux bridge costs approximately 12 ls. The global timing overhead keeps below 90 ls, far from being relevant in many common situations. The graph on the right of Fig. 10 illustrates the latencies measured between the I/O interfaces of the Wi-Fi, WiMAX and SAT netboxes, and the NICs attached to the Click Modular Router, so that the single contribution of each emerges. It is worth noting how the mean latency of the WiMAX netbox is about 14.3 ms, very close to the theoretical 15 ms value discussed in the previous paragraph. 5.2.5. Virtual networks interconnections The reported outcomes suggest that a too high number of interconnecting bridges could become the major limitation to the realism and performance of the testbed. We should bear in mind that each bridge is implemented by a kernel object that stores, copies, and delivers each datagram (more exactly, a kernel socket buffer) generated by the user-space application (the NS-3 emulator in our case) or coming from the NIC. However, Linux kernels after 2.6.35 introduce the support for the direct
Fig. 10. Overheads and emulation timings of the testbed.
180
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
binding of LXC containers to physical NICs: this feature allows the removal of most of the software bridges depicted in Fig. 7 and the straight interconnection of netboxes with the external equipment. This choice has a beneficial effect, since the (expensive) in-kernel processing is skipped and the related timing overheads disappear. Thanks to the particular topology of our case study, even the last remaining bridge (the direct link between the Wi-Fi and WiMAX emulators) can be replaced with a more efficient forwarding scheme. Virtual Ethernet peers (VETH, [11]) are a (not so widely known) kernel-based socket buffers forwarding mechanism, which allows for a faster delivery between two endpoints applications, in comparison with kernel bridges. The Layer-2 frames generated by a netbox are copied to kernel land and looped back to user land to get received by the peer netbox. In particular, an AF_NETLINK socket is employed to touch the destination (dst_entry) of socket buffers (sk_buff), so that these are not moved to a physical NIC, as it would happen for the outgoing frames, but enqueued for the receiving application. Fig. 11 (left) reports a comparison of the maximum forwarding rate that can be sustained by a standard Linux bridge in comparison with a Virtual Ethernet peer, showing the substantial improvement offered in the latter case. Finally, Fig. 11 (right) also gives evidence of the overall benefit (in terms of timing overheads), obtained thanks to the bridges removal. 5.3. NetBoxIT capacity, performance analysis, and related work The evaluation of the NetBoxIT capacity (i.e., the ability to manage the ingress packets within real time constraints) is necessary to understand which are the performance limits of our architecture, at what point the system breaks down, and how internal system delays behave when the system is overloaded or the number of employed containers is scaled up. In Fig. 12 we resume the key results extracted from a set of experiments conducted with these targets in mind. Our first evaluation is focused on quantifying the capacity of NetBoxIT as a function of the offered load, measured in packets per second. Similarly to software routers, NetBoxIT ‘‘forwarding’’ performance is widely independent from packets size, since both the hardware overheads (e.g., the time necessary to receive or transmit frames by our Gigabit Ethernet NICs or move packets through the PCI-Express bus), and the software overheads (due to memcpy operations of the frames) are negligible: for this reason, we report here only the tests conducted with short 64-byte packets. We vary from 1 to 8 the number of netboxes to simulate a chain topology of 2–16 IP routers communicating through 1 Gb/s Ethernet links. It is worth noting that each netbox employs a single CPU core, therefore we progressively consume all the available computing resources as the chain becomes longer. It should be remembered that our hardware platform is based on a Dual Quad-core Xeon chipset: hence, we have two physical CPUs (socket #0 and socket #1) which encase 4 cores each. For clarity, we will use the following definitions: the CPU cores that belong to socket #0 will be enumerated 0, 2, 4, 6, whilst the cores belonging to socket #1 will be named 1, 3, 5, 7. We set up five different testbed configurations, where the number of hops is progressively extended, with some variations in the choice of the CPU cores: setup (A) Two IP routers are emulated within a single netbox. Thus, we create a single LXC container and run an NS-3 simulation inside of it; the NS-3 configuration simulates a 1 Gb/s Ethernet link with two virtual routers attached; then, each virtual router is also equipped by a (Gigabit) EmuNet device and straightly bound to the physical (Gigabit) NIC (software bridges are not employed anymore in these tests). Therefore, frames coming from an external source are received from the first router through the physical NIC, and then forwarded to the second router through the simulated Ethernet link. The second router then forwards and transmits the frames through the physical NIC towards an external sink. This lone netbox is executed on core #2 (the second core of socket #0). setup (B) Four IP routers are split in pairs and emulated within two netboxes, each constructed like in setup (A). In this configuration, one Virtual Ethernet peer is employed to interconnect the two netboxes. During this test, core #2 is assigned to the first netbox, and core #4 is employed for the second netbox.
Fig. 11. Virtual Ethernet peers vs. Bridge forwarding rate (left); timing overheads of the testbed (right) after the bridges removal.
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
181
Fig. 12. Cumulative (filled bars) and per netbox (dotted bars) capacity, and average CPU-core load (left); timings and interconnections overheads in NetBoxIT, during the emulation of a chain of Gigabit Ethernet routers (right).
setup (C) Eight IP routers are split in pairs and emulated within four netboxes, each constructed like in setup (A). In this configuration, three Virtual Ethernet peers are employed to interconnect the four netboxes. During this test cores #1, 3, 5, 7 are used (the entire socket #1). setup (D) Similarly to setup (C), eight IP routers are split in pairs and emulated within four netboxes, connected by three Virtual Ethernet peers. However, in this test we use cores #0, 2, 4, 6 (the whole socket #0). setup (E) Sixteen IP routers are split in pairs and emulated within eight netboxes, and seven Virtual Ethernet peers are employed to interconnect them. During this test we employ all the platform CPUs cores. In summary, these topologies directly connect a transmitter with a sink over a variable number of hops. By changing the number of routers in the path, we modify the total amount of work that NetBoxIT performs to emulate the end-to-end route. In each experiment, an external PC is used to inject a 64-byte UDP flow, to receive the outgoing frames and to measure the emulator capacity. The left graph in Fig. 12 plots the maximum forwarding capacity, measured in packets per second, for each corresponding testbed configuration. In particular, the dotted graph bar points out the maximum forwarding rate that is obtainable by each single netbox forming the chain topology; whilst the filled bars account for the overall amount of frames that are instantaneously managed by the whole platform. Within the left graph, we also report the average load of the CPU cores during the same experiments. It should be noticed that in a chain of netboxes, where several cores are used, the load of all the cores usually keeps aligned, after a short transient. In fact, since the frames leaving a netbox become the input for its followers, the first netbox of the chain inevitably leads the performance of all the others. An exception to this is when we run both a netbox and the kernel/NIC driver onto the same core #0, like in setups D and E. In such a situation, that specific core is subjected to a greater load than the followers. At saturation for the baseline case of two hops, the measured NetBoxIT capacity is approximately 44,000 packet/s. It is worth clarifying that we strictly refer here to the capacity of the emulator meant as a ‘‘forwarding black box’’, i.e., to the forwarding capability of the software and hardware ensemble that receives and transmits frames from/to the physical NICs, and not to the capacity of the virtual Gigabit Ethernet segment interconnecting the two routers within the NS-3 simulation (which nominal capacity of 1 Gb/s was chosen to be large enough not to behave as a bottleneck). At this point, the CPU is 95.6% utilized and clearly represents the bottleneck for the emulator: the platform enters a state in which the available computing resources are employed to pull 44,000 packet/s from the input network card, to emulate the two-hop virtual link (e.g., routers IP stacks, and their table lookup algorithms, frames propagation, transmission and reception activities, etc.), and to push the packets to the output NIC. Once we attempt to emulate four hops or more, the forwarding rate per single netbox decreases progressively. For 16 hops, each netbox accurately forwards about 26,000 packet/s, whilst the platform as a whole manages 211,000 packet/s. In this worst case, however, a netbox is still able to reliably emulate a 10 Mb/s Ethernet link with 64-byte packets and a 100 Mb/s link if frames larger than 500 byte are employed. Instead, a 1 Gb/s connection can never be attained, even with 1500-byte frames. For the sake of a comparison, we can observe that the forwarding capacity of a single netbox is certainly lower than in ModelNet [4], where an emulation node using a 1.4 GHz CPU can forward about 120,000 packet/s. However, the two platforms are founded on quite different kinds of modeling and architectures. ModelNet ‘‘pipes’’ are certainly lighter and faster than NS-3 C++ objects, but they are also simpler. For instance, the NS-3 virtual NICs (the CsmaNetDevice or the EmuNetDevice we used in these tests) offer a set of possible configuration parameters (e.g., MAC address, MTU, DIX/LLC EncapsulationMode, queues lengths, CRC checksums, TX/RX statistical counters, error modeling, etc.) that make their functioning extremely close to that of a real network adapter. As one can guess, a trade-off between performance and realism is certainly unavoidable and the NS-3 modeling accuracy comes with a price to pay. Moreover, we can speculate that ModelNet would probably forward about 200–300,000 packet/s on our 2.4 GHz CPU. However, it is important to notice that the ModelNet architecture is based on the assumption that the network emulation is entirely performed within the kernel of the host (where ‘‘pipes’’ are instantiated), at the highest priority, whilst our netboxes are executed in a user land context. Thus, even if the performance of a single netbox is certainly
182
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
inferior, we have the additional flexibility that we can more easily run different simulations in parallel onto different processors. Therefore, the overall amount of packets (the cumulative capacity in Fig. 12) that can be managed thanks to parallelization makes NetBoxIT performance comparable to ModelNet when the number of virtual networks grows, despite of the lower priority of the user space processing. It seems now relevant to ask what are the bottlenecks of the system, to assess whether the performance can be improved. In Fig. 12 we also report the trend of the CPU load, which smoothly declines from 96% to 65%: clearly, as we increase the number of busy cores, the full exploitation of the available computing resources becomes more and more difficult. It is evident how the forwarding performance drop cannot be directly related to a lack of computing resources, since our CPU cores remain partially idle. However, this behavior is not entirely surprising: a lot of recent research carried out for software routers shows that it is not sufficient to increase the number of physical threads to improve the overall performance of a system, even when the software threads are carefully split among the physical threads. Some recent studies have also pointed out which are the issues to be specifically addressed. In [30] an extensive and detailed analysis of Linux scalability over multicore platforms highlights different kinds of problems that could also apply to our situation. In very few words, we can say that the consistency of shared data structures is probably the main (but not the only) source of many scalability limits in modern multi-threaded platforms. When several concurrent tasks use the same data and hence need to read/write from/ to a shared portion of memory, race conditions must be avoided, to preserve data coherency. The first effect of this constraint is that a task may (or must) lock a shared data structure when using it: therefore, a platform where an increasing number of threads are cooperatively running and processing the same data structures is also a platform where the threads waiting times tend to become progressively larger. A second effect, which is usually (but not necessarily) related to the first one, is that when a task writes to a shared memory segment, the cache coherency protocol needs to invalidate any related cache copy: therefore, the CPU cores running some tasks that are using the same memory portion must re-fetch data directly from the main memory, since their cache is no longer synchronized with the actual data values. This problem is frequently related to locking (since locking is usually just a way to perform safe data touches), but can even occur with lock-free shared data structures. We can start from speculating to encounter all these possible performance limitations in our architecture. In fact, shared data and locks are both certainly employed. For instance, the Linux kernel moves packets from a network card towards the main memory (via DMA controller) using a pool of buffers, which are referenced from a single, circular list of pointers, the ring buffer. Since this list is a shared structure (to allow many applications to send/receive data to/from the same NIC), it needs some form of protection. The default policy is to use a locking mechanism to protect the list while it is updated (i.e., when a packet buffer needs to be retrieved or is freed). This causes a contention for the lock and could possibly create a bottleneck within the kernel networking subsystem. In our architecture, the Linux networking layer is certainly involved in copying data to/from our NS-3 virtual networks, which are just a special case of receiving/transmitting user land applications. However, we did not find any strong evidence to relate the NetBoxIT performance limits to the ring buffer locking mechanism. First, our previous measurements in Fig. 11 show the forwarding capacity of a Virtual Ethernet peer (which is instantiated just within the kernel networking layer) to be at least 500,000 packet/s. However, even when we employ eight netboxes in setup (E), and a 26,000 packet/s flow crosses the kernel 16 times (moving from kernel to user space, and vice versa), the total amount of packets to be managed by the kernel (about 420,000 packet/s) is still below this limit. A more meaningful motivation comes from the knowledge of the effects of a highly locked ring buffer: the DMA controller cannot acquire free memory addresses fast enough, and frames transfers between the card and the memory bank are progressively stuck. Therefore, the FIFO buffer of the NIC quickly gets filled, and the ingress frames start being dropped. Moreover, frames that gained to enter must wait onto the NIC for a much longer average time (typically, there is a a sharp increase in their delay, from microseconds to milliseconds; for instance, see in [31]). However, we never encountered these conditions during our tests: we did not record any frame drop at the level of the network cards, as well as the Virtual Ethernet peers delays remained to a minimum (3–4 ls). Besides, it is worth noting, by a comparison between configurations (C) and (D), that it is a best practice not to employ the core #0 for running a netbox, whenever it is possible, since this brings to lower performance. In fact, this core is preferably selected by Linux to execute the NIC driver, since the APIC (Advanced Programmable Interrupt Controller) delivers all the interrupts generated by the card to core #0 by default. Therefore, an emulation task running on this core is frequently interrupted and expensive context switches with the driver must be performed. At a later time, we also wondered what the real impact of cache misses could be on system performance. This effect is certainly existing: frames headers are updated at the output of each hop, since their MAC destination address must change. For instance, in setup (B), each frame (i.e., its memory data structure) is touched by the first netbox (running on core #2) before being moved towards the second netbox (running on core #4): clearly, we have a circumstance in which a cache miss occurs and core #4 is compelled to retrieve data from the main memory bank. However, this problem alone does not justify the baseline performance that we obtained, since the bus interconnecting the CPU and the DDR3 memory bank offers a 68 Gb/s capacity in our platform and re-fetching an entire 64-byte frame requires a few nanoseconds, which can be just a minor overhead. Hence, it seems that locking mechanisms and cache invalidations alone are not sufficient to explain the trend that we have registered in the decrease in performance. However, there is another aspect that we have not considered yet. Despite virtualization, a netbox is a process like any other and it is executed within the user context of Linux: hence, it is compelled to transmit or receive packets through the kernel. One of the milestones within the Linux API is certainly represented by a pair of system calls, ‘‘copy_to_user’’ and ‘‘copy_from_user’’, that are employed whenever it is necessary to move data from kernel space to user space, or vice versa.
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
183
In fact, for security reasons, user and kernel space operate in completely different address spaces, and these functions are in charge to do all the necessary checks and preparations, as well as the actual copy of the data between the virtual and physical address memory segments. Of course, this is not the only processing that affects a packet during its path from the NIC driver towards the user socket: after a frame is stored within a sk_buff, a long sequence of operations are executed by the layered TCP/IP Linux stack for the management of data (and meta-data), making the delivery of the packet payload a quite expensive task. For this reason, many fast packet processors are implemented within the kernel itself, or sometimes as a special in-kernel module, like the Click Modular Router [19]. In order to acquire a baseline reference and bring to light the price to pay when packets routing is executed in user land, we tested the user-space version of a Click IP router on our platform. During these experiments we measured a maximum loss-free forwarding rate of 120,000 packet/s: if compared to the maximum capacity of the in-kernel Click onto the same hardware (about 750,000 packet/s), there is a diminishing of a factor 6, that can only be ascribed to the journey of packets through the kernel layers, since at the same time it barely uses just a 15% of the CPU. Therefore, the capacity that we recorded in the case of setup (A) makes sense, if one considers that NS-3 is not particularly optimized to perform routing activities and that the routers within the simulation are even two. Very recent research is fully confirming this assumption and shows that cutting out the kernel from packet processing can bring to huge gains. In [32], the Netmap framework is employed to move data from/to the NIC with a reduced kernel involvement: multiple packets can be transferred by a single system call, dynamic packet buffers are substituted by static buffers, and ring buffers are directly shared between the kernel and user space, so that memory copies are reduced. The improvement is impressive: a Netmap-aware, user space, Click-based bridge is 10 times faster than the in-kernel version; or, pktgen, a well known and highly specialized packet generator, can easily saturate a 10 Gb/s interface in user space (with 64-byte packets), whilst the traditional in-kernel version can barely reach 4 Mpacket/s. Finally, in Fig. 12 (right) we report the impact of the cumulative overheads introduced by Virtual Ethernet peers during the end-to-end frames journey, against the intrinsic (cumulative) timings required by all the netboxes of the chain (singularly, each netbox processing time varies between 23 ls in setup A, and 38 ls in setup E, depending on the corresponding packet rate). As we stated above, these overheads keep to a minimum of 3–4 ls, and do not increase as we extend the topology. For a rapid comparison, the ModelNet granularity (100 ls) can lead to a cumulative overhead of 1 ms in a 10-hop configuration. It is worth noting that, when we were experimenting the emulation of the Emergency Network, the delays of the various wireless networks were much greater than the peers overheads: therefore, these could be considered negligible. But now they become much more comparable. The main question is: how important is this overhead in respect of the kind of network that we emulate? Unfortunately, the answer cannot be unique, and must be searched case-by-case. For instance, an overhead of 3 ls can be acceptable, when we compare it against the cumulative transmitting delay of a chain of two 10–100 Mb/s Ethernet segments transmitting a 500-byte packet (800–80 ls), but it is not so if we wish to emulate a pair of Gigabit links and use 64-byte packets, since the overheads can be even greater than the network transmission delays. However, further improvements are possible, and in Section 5.4 we will speculate about the possible techniques that could be introduced in the near future to reduce these overheads and overcome these limitations. The last question that requires a response is how far we can go with the netboxes, and which types of experiments we can conduct reliably. It is firstly worth noting that we necessarily express the capacity of the emulated networks (within NS-3) and the performance of a netbox (intended as an emulation machine) by using two different metrics: the former in bits per second, whilst the latter in packets per second. And, when we assign a certain capacity to a link within the NS-3 simulation, its nominal throughput represents a key constraint: therefore, the transmitting time of a packet changes accordingly with the size of the packet itself (for instance, if we fix a 100 Mb/s link, then 40 ls are required to transmit a 500-byte frame). Instead, a netbox has a capacity constraint that is related to its computing efficiency, and measured in terms of packet rate, almost independently from the size of packets. To determine if a netbox is efficient enough to meet the timing of the virtual network that is emulating, one should always keep in mind that the capacity of the netbox rigorously prevails on the capacity of the emulated network. Let us explain this with a simple example. In setup (A), the maximum rate we registered was 44,000 packet/s. Since the packets processing is independent from packets size, then our netbox can reliably support the emulation of an Ethernet link which capacity can range from 22.5 Mb/s to 528 Mb/s, depending on the ingress packets size (from 64 to 1500-byte packets). Or, equivalently, since the netbox requires 23 ls to process each packet, we cannot instantiate within the NS-3 simulation a network which is faster than that. In other terms, we cannot expect the simulated network to be faster than the emulator itself. This statement could seems obvious, but it should be remembered very clearly when mixed traffic is to be offered to a netbox, since the strictest requirement to be respected to guarantee the reliability of the emulation is fixed by the size of the shortest packets (i.e., referring to setup (A) again, if we inject 64 and 500-byte flows simultaneously, the emulator can offer a capacity of a maximum of 22.5 Mb/s and it is a nonsense to instantiate an NS-3 link with a higher capacity). 5.4. Future opportunities As we stated above, the Netmap framework has been demonstrated as a tool for high-speed packets exchange between a network card and a user application. In principle, it could be exploited for the fast interconnection of a netbox with the physical NIC. Unfortunately, this is not so straightforward, since the EmuNet device in NS-3 is not Netmap-aware, and it relies on the ‘‘net_device’’ and ‘‘sk_buff’’ traditional kernel structures to communicate with the network card. Differently, Netmap shares the NIC buffers directly into user space. Therefore, whenever an application needs to exchange data with a device,
184
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
it must retrieve a file descriptor related to the shared buffers, before starting to read from or write into them. Thus, a completely different approach that requires the modification of the application source code. Furthermore, an even more interesting scenario could be unfolded by the utilization of user-space shared buffers to provide high-speed inter-process communication between different NS-3 emulators directly, with no kernel involvement. At the time of this writing, we found that VALE [33] might be a tool somewhat similar to what we were aiming at. Preliminary results currently being published are reporting that it is possible to exchange 60-byte frames between user processes at 17.6 Mpacket/s, just 50–60 ns per packet, a delay about two orders of magnitude smaller than with VETH peers. Such an improvement could support 10 Gb/s interconnections among netboxes even with short packets, and allow NetBoxIT to be used in a much wider spectrum of experiments. 6. Conclusions In this paper we presented NetBoxIT, a modular, flexible, and scalable platform that exploits LXC containers and the NS-3 simulator for the assessment of heterogeneous networks. We described its implementation issues, and used it for the evaluation of the end-to-end performance of a multipart Emergency Network. Our investigation was mostly focused on examining, by means of plain and intuitive trials, if the proposed testbed architecture leads to realistic results. Moreover, NetBoxIT is conceived to be an open platform, where virtual networks can inter-operate with true devices and applications, using the testbed Ethernet ports. As a result, we introduced a true VoIP application and a real IP router to conduct some of our trials, with the aim at increasing the realism of the emulations. In the near future, we mean to further exploit the testbed interoperability for a subjective estimate of the quality of experience perceived by the end user and refine the network functioning parameters consequently. Finally, we investigated the performance bottlenecks of our architecture, to show which are its current limitations and to identify how these could be overcome in the future. Acknowledgment The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007–2013) under grant agreement no [242411]. References [1] D. Gupta, K. Yocum, M. Mcnett, A.C. Snoeren, A. Vahdat, G.M. Voelker, To infinity and beyond: time warped network emulation, in: ACM Symposium on Operating Systems Principles, 2005. [2] M. Carbone, L. Rizzo, Dummynet revisited, SIGCOMM Comput. Commun. Rev. 40 (2) (2010) 12–20, http://dx.doi.org/10.1145/1764873.1764876.
. [3] S. Agarwal, J. Sommers, P. Barford, Scalable network path emulation, in: Proceedings of the 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS ’05, IEEE Computer Society, Washington, DC, USA, 2005, pp. 219–228. doi: 10.1109/MASCOT.2005.61 . [4] A. Vahdat, K. Yocum, K. Walsh, P. Mahadevan, D. Kostic´, J. Chase, D. Becker, Scalability and accuracy in a large-scale network emulator, SIGOPS Oper. Syst. Rev. 36 (SI) (2002) 271–284, http://dx.doi.org/10.1145/844128.844154. . [5] M. Pizzonia, M. Rimondini, Netkit: easy emulation of complex networks on inexpensive hardware, in: Proceedings of the 4th International Conference on Testbeds and Research Infrastructures for The Development of Networks & Communities, TridentCom ’08, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), ICST, Brussels, Belgium, Belgium, 2008, pp. 7:1–7:10. . [6] User-mode Linux Kernel. . [7] The LXC Linux Containers. . [8] S. Bhatia, M. Motiwala, W. Muhlbauer, Y. Mundada, V. Valancius, A. Bavier, N. Feamster, L. Peterson, J. Rexford, Trellis: a platform for building flexible, fast virtual networks on commodity hardware, in: Proceedings of the 2008 ACM CoNEXT Conference, CoNEXT ’08, ACM, New York, NY, USA, 2008, pp. 72:1–72:6. doi: 10.1145/1544012.1544084. . [9] A. Alvarez, R. Orea, S. Cabrero, X.G. Pañeda, R. Garcı´a, D. Melendi, Limitations of network emulation with single-machine and distributed ns-3, in: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, SIMUTools ’10, ICST (Institute for Computer Sciences, SocialInformatics and Telecommunications Engineering), ICST, Brussels, Belgium, Belgium, 2010, pp. 67:1–67:9. doi: 10.4108/ICST.SIMUTOOLS2010.8630. . [10] B. Lantz, B. Heller, N. McKeown, A network in a laptop: rapid prototyping for software-defined networks, in: Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks, Hotnets-IX, ACM, New York, NY, USA, 2010, pp. 19:1–19:6. doi: 10.1145/1868447.1868466. . [11] VETH. URL . [12] J. Ahrenholz, Comparison of core network emulation platforms, in: Military Communications Conference, 2010 - MILCOM, 2010, pp. 166 –171. doi: 10.1109/MILCOM.2010.5680218. [13] J. Ahrenholz, T. Goff, B. Adamson, Integration of the core and emane network emulators, in: Military Communications Conference, 2011 - MILCOM, 2011, pp. 1870 –1875. doi: 10.1109/MILCOM.2011.6127585. [14] NS-3 Project Homepage. . [15] J. Zhang, Z. Qin, Taprouter: an emulating framework to run real applications on simulated mobile ad hoc network, in: Proceedings of the 44th Annual Simulation Symposium, ANSS ’11, Society for Computer Simulation International, San Diego, CA, USA, 2011, pp. 39–46. . [16] M. Skjegstad, F. Johnsen, J. Nordmoen, An emulated test framework for service discovery and manet research based on ns-3, in: 5th International Conference on New Technologies, Mobility and Security (NTMS), 2012, pp. 1 –5. doi: 10.1109/NTMS.2012.6208683. [17] G. Calarco, M. Casoni, Virtual networks and software router approach for wireless emergency networks design, in: VTC Spring, IEEE, 2011, pp. 1–5. .
G. Calarco, M. Casoni / Simulation Modelling Practice and Theory 31 (2013) 169–185
185
[18] E. Weingärtner, H. Vom Lehn, K. Wehrle, A performance comparison of recent network simulators, in: Proceedings of the 2009 IEEE International Conference on Communications, ICC’09, IEEE Press, Piscataway, NJ, USA, 2009, pp. 1287–1291. . [19] E. Kohler, R. Morris, B. Chen, J. Jannotti, M.F. Kaashoek, The click modular router, ACM Trans. Comput. Syst. 18 (3) (2000) 263–297, http://dx.doi.org/ 10.1145/354871.354874. . [20] G. Calarco, C. Raffaelli, Implementation of implicit qos control in a modular software router context, in: Proceedings of The Third International Conference on Quality of Service in Multiservice IP Networks, QoS-IP’05, Springer-Verlag, Berlin, Heidelberg, 2005, pp. 390–399. doi: 10.1007/978-3540-30573-6_30. . [21] G. Calarco, M. Casoni, A. Paganelli, D. Vassiliadis, M. Wódczak, A satellite based system for managing crisis scenarios: the e-sponder perspective, in: Advanced Satellite Multimedia Systems Conference (asma) and The 11th Signal Processing for Space Communications Workshop (spsc), 2010 5th, 2010, pp. 278 –285. doi: 10.1109/ASMS-SPSC.2010.5586905. [22] H. Skinnemoen, Creating the next generation dvb-rcs satellite communication amp; applications: the largest standards initiative for satellite communication inspires new oppurtunities, in: Advanced Satellite Multimedia Systems Conference (asma) and the 11th Signal Processing for Space Communications Workshop (spsc), 2010 5th, 2010, pp. 147–154. doi: 10.1109/ASMS-SPSC.2010.5586868. [23] G. Calarco, M. Casoni, NetBoxIT: virtual emulation integrated testbed for the heterogeneous networks design, in: 18th IEEE Workshop on Local Metropolitan Area Networks (LANMAN), 2011, pp. 1–2. doi: 10.1109/LANMAN.2011.6076929. [24] O. Grondalen, P. Gronsund, T. Breivik, P. Engelstad, Fixed wimax field trial measurements and analyses, in: Mobile and Wireless Communications Summit, 16th IST, 2007, pp. 1–5. doi: 10.1109/ISTMWC.2007.4299213. [25] J. Martin, B. Li, W. Pressly, J. Westall, Wimax performance at 4.9 GHz, in: Aerospace Conference, 2010 IEEE, 2010, pp. 1–8. doi: 10.1109/ AERO.2010.5446943. [26] N. Baldo, M. Requena-Esteso, J. Núñez Martı´nez, M. Portolès-Comeras, J. Nin-Guerrero, P. Dini, J. Mangues-Bafalluy, Validation of the IEEE 802.11 mac model in the ns3 simulator using the extreme testbed, in: Proceedings of the 3rd International ICST Conference on Simulation Tools and Techniques, SIMUTools ’10, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), ICST, Brussels, Belgium, Belgium, 2010, pp. 64:1–64:9. doi: 10.4108/ICST.SIMUTOOLS2010.8705. . [27] J.A.R.P. de Carvalho, H. Veiga, P.A.J. Gomes, C.F.F.P.R. Pacheco, N. Marques, A.D. Reis, Laboratory performance of Wi-Fi point-to-point links: a case study, in: Proceedings of the 2009 Conference on Wireless Telecommunications Symposium, WTS’09, IEEE Press, Piscataway, NJ, USA, 2009, pp. 143–147. . [28] LinPhone, free SIP VoIP Client. . [29] Wireshark. . [30] S. Boyd-Wickizer, A.T. Clements, Y. Mao, A. Pesterev, M.F. Kaashoek, R. Morris, N. Zeldovich, An analysis of linux scalability to many cores, in: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI’10, USENIX Association, Berkeley, CA, USA, 2010, pp. 1–8. . [31] R. Bolla, R. Bruschi, The ip lookup mechanism in a linux software router: performance evaluation and optimizations, in: Workshop on High Performance Switching and Routing, HPSR ’07, 2007, pp. 1 –6. doi: 10.1109/HPSR.2007.4281242. [32] L. Rizzo, Netmap: a novel framework for fast packet i/o, in: Proceedings of the 2012 USENIX Conference on Annual Technical Conference, USENIX ATC’12, USENIX Association, Berkeley, CA, USA, 2012, pp. 9–9. . [33] VALE, a switched Ethernet for virtual machines. .