Simulation of nonlinear signal propagation in multimode fibers on multi-GPU systems

Simulation of nonlinear signal propagation in multimode fibers on multi-GPU systems

Simulation of Nonlinear Signal Propagation in Multimode Fibers on Multi-GPU Systems Journal Pre-proof Simulation of Nonlinear Signal Propagation in ...

895KB Sizes 0 Downloads 35 Views

Simulation of Nonlinear Signal Propagation in Multimode Fibers on Multi-GPU Systems

Journal Pre-proof

Simulation of Nonlinear Signal Propagation in Multimode Fibers on Multi-GPU Systems Marius Brehler, Malte Schirwon, Peter M. Krummrich, ¨ Dominik Goddeke PII: DOI: Reference:

S1007-5704(19)30469-1 https://doi.org/10.1016/j.cnsns.2019.105150 CNSNS 105150

To appear in:

Communications in Nonlinear Science and Numerical Simulation

Received date: Revised date: Accepted date:

4 June 2019 28 October 2019 7 December 2019

¨ Please cite this article as: Marius Brehler, Malte Schirwon, Peter M. Krummrich, Dominik Goddeke, Simulation of Nonlinear Signal Propagation in Multimode Fibers on Multi-GPU Systems, Communications in Nonlinear Science and Numerical Simulation (2019), doi: https://doi.org/10.1016/j.cnsns.2019.105150

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Highlights • Multi-GPU implementation for the nonlinear signal propagation in multimode fibers • Different approaches to realize the GPU-GPU communication are explored and compared • A mode-division multiplexed data transmission using 120 modes is evaluated

1

Simulation of Nonlinear Signal Propagation in Multimode Fibers on Multi-GPU Systems Marius Brehlera,∗, Malte Schirwonb , Peter M. Krummricha , Dominik G¨oddekeb a TU

b University

Dortmund, Chair for High Frequency Technology, Friedrich-W¨ohler-Weg 4, 44227 Dortmund, Germany of Stuttgart, Institute for Applied Analysis and Numerical Simulation, Allmandring 5b, 70569 Stuttgart, Germany

Abstract Mode-division multiplexing (MDM) is seen as a possible solution to satisfy the rising capacity demands of optical communication networks. To make MDM a success, fibers supporting the propagation of a huge number of modes, i.e. several tens, are of interest. Many of the system aspects occurring during the propagation can be evaluated by using appropriate models. However, fibers are a nonlinear medium and, therefore, numerical simulations are required. For a large number of modes, the simulation of the nonlinear signal propagation in particular for telecommunication leads to new challenges, for example regarding the required memory, which we address with an implementation incorporating multiple GPU-accelerators. In this paper, we evaluate two different approaches to realize the communication between the GPUs and analyze the performance for simulations involving up to 8 Tesla GPUs. To the best of our knowledge, we are the first who explore a multi-GPU approach to simulate the nonlinear signal propagation in multimode fibers. This allows us to show results for an MDM transmission system utilizing the extremely large number of 120 spatial modes in a fiber with a core diameter of 62.5 µm as an application example and to analyze the impact of the nonlinear effects on the transmitted signals. Keywords: Multi-GPU, Parallel computing, Fiber optics, Mode-division multiplexing, Nonlinear signal propagation

1. Introduction One of the main challenges in the design of future optical networks is to satisfy the growing capacity demand. A promising approach to solve this challenge is to use the yet mostly untapped spatial dimension. Space-division multiplexing (SDM) has attracted a lot of attention in the last years, both in industry and academic research. One option to realize an SDM system is the use of multimode fibers (MMF), where each mode capable of propagation is used as a channel for individual signals, referred to as mode-division multiplexing (MDM). [1] Recently, the utilization of 45 spatial modes in a multimode fiber as individual transmission channels was demonstrated experimentally for the first time [2]. With the availability of mode multiplexers for 45 Hermite-Gaussian modes [3, 4], and the potential to excite even more modes [5], the investigation of MDM systems supporting a large mode count is getting more and more relevant. During the design process of fiber optic transmission systems, numerical simulations are the common choice to study different system aspects. However, especially for a large mode count, new challenges arise within the simulation. As fused silica is a nonlinear medium [6], the simulation of the propagation of optical radiation in the fiber is quite challenging. The nonlinear signal propagation can be described by coupled partial differential equations for which a closed-form solution only exists in very few special cases. Therefore, numerical methods are required to approximate solutions. Exploring the impact of nonlinear effects in the case of data transmission, is already challenging for only a single propagating mode, since long signal sequences need to be simulated. Therefore, GPU-accelerators can be used to speed up simulations [7, 8, 9]. The numerical effort rises sharply when optical fibers which enable the propagation ∗ Corresponding

author. Email address: [email protected] (Marius Brehler)

Preprint submitted to Communications in Nonlinear Science and Numerical Simulation

December 13, 2019

of multiple modes, especially fibers with a core diameter ≥ 50 µm, are the target of interest. In those fibers, several tens/dozens or even more than 100 spatial modes can be used as spatial channels. Moreover, the restricted amount of GPU-memory limits the approach to accelerate the simulation of the nonlinear signal propagation [10]. As a result, publications considering the nonlinear signal propagation in MMFs numerically are mostly limited to only a few modes if only a single GPU is used, e.g. 15 spatial modes in [11]. In this paper, we explore the possibility to distribute the simulation of a transmission scenario in a single fiber to multiple GPU-accelerators. Here, we realize the communication between the GPUs with the Message Passing Interface (MPI) or the NVIDIA Collective Communications Library (NCCL). Only with multi-GPU implementations simulations with up to 36 spatial modes and 60 wavelength channels per mode [12] are possible in acceptable time, for which a preliminary version of our MPI-implementation was used. To the best of our knowledge, the approach to use multiple GPU-accelerators to simulate the nonlinear signal propagation in multimode fibers has not been reported so far. The paper is organized as follows: We first briefly present the mathematical description of the nonlinear signal propagation in multimode fibers. Next, we review the numerical methods, followed by our MPI and NCCL implementations, as well as GPU-specific modifications to the code required for the simulation of many modes. The description of the implementation is followed by benchmarking the implementation incorporating up to 8 GPUs. Finally, we use the application to demonstrate the simulation of an MDM transmission system in which a fiber with 62.5 µm core diameter provides 120 spatial modes as MDM channels. Such a simulation would not have been possible without the described multi-GPU implementation due to an otherwise not acceptable processing duration. 2. Modeling of the Nonlinear Signal Propagation in Multimode Fibers All established models capturing the nonlinear signal propagation are based on the nonlinear Schr¨odinger equation or the Manakov equation. To consider the nonlinear signal propagation in a multimode fiber, the nonlinear Schr¨odinger [13] or the Manakov equation [14, 15] can be given as   ! X in X   α ∂n ∂Aa 2 2  = − Aa + i βn,a n Aa + iγ κaa |Aa | + κab |Ab |  Aa , ∂z 2 n! ∂t b,a n=0 | {z } | {z } Lˆ

(1)



where A represents the slowly varying envelopes of the spatial modes a and b. The linear part Lˆ covers the attenuation and the dispersion occurring in an optical fiber. The attenuation is given α and the dispersion is taken into account by βn , which are the coefficients of the Taylor series expansion of the propagation constants. Within the nonlinear ˆ the parameter γ is defined as γ = (n2 ω0 )/(c0 Aeff,LP0,1 ). Here, n2 is the nonlinear refractive index which is part N, related to the Kerr-effect, ω0 is the center frequency of the optical signal, and c0 is the speed of light in vacuum. Furthermore, the effective area of the fundamental mode Aeff,LP0,1 is included in the definition of γ. The intramodal nonlinear coupling coefficient is specified as κaa , whereas the intermodal interaction is considered by κab . Accordingly, these coefficients cover the information on the overlap of the spatial mode profiles. The fiber nonlinearity leads to nonlinear phase rotation, i.e. a phase modulation depending on the power of the optical signals occurs. In conjunction with the dispersion, the phase modulation is converted to an amplitude modulation. As a result the transmitted optical signals are distorted, meaning the Kerr-effect leads to nonlinear impairments. While only intramodal nonlinear effects occur during the signal propagation in a single-mode fiber, this is not the case for multimode fibers. In multimode fibers, the intermodal effects can potentially lead to significant increase of the nonlinear impairments as multiple spatial modes are involved in the nonlinear interaction. Therefore, it is essential to consider the intermodal effects in the modeling of the signal propagation in multimode fibers. Thus the numerical complexity increases significantly, due to considering the intermodal nonlinear effects by incorporating the weighted squared absolute values Ab , which we will discuss later. An elaborated description of the nonlinear effects in fiber optics is given inter alia in [6]. In addition, see [16] for a detailed discussion on modeling the nonlinear propagation in multimode fibers.

3

3. Numerical Approximation Since analytical solutions can only be calculated for a few special cases, numerical methods are required for the evaluation of Eq. (1). To approximate the solution of Eq. (1), pseudo-spectral methods like the split-step Fourier method (SSFM) [6] or the fourth-order Runge-Kutta in the Interaction Picture (RK4IP) method [17] can be used. Prior to the discussion of the GPU-accelerated implementation, we briefly review these methods. 3.1. The Symmetric Split-Step Fourier Method The formal solution to Eq. (1) is given by h  i A(z + h, T ) = exp h Lˆ + Nˆ A(z, T ) .

(2)

With the Baker-Hausdorff formula [18], this can be approximated as

    A(z + h, T ) ≈ exp hLˆ exp hNˆ A(z, T ) ,

(3)

allowing to solve the linear part Lˆ and the nonlinear part Nˆ independently of each other. The linear part Lˆ is solved in the frequency domain and the nonlinear part Nˆ is solved in the time domain. The resulting splitting error can be further reduced by applying a symmetric split-step approach:

A(z + h, T ) ≈ exp

! ! ! Z z+h hˆ ˆ 0 )dz0 exp h Lˆ A(z, T ) L exp N(z 2 2 z

(4)

The nonlinear part can be either solved by an iterative approach as described in [6] or with explicit schemes like the Runge-Kutta method. This results in a third order accuracy to the step size h and a global error of O(h2 ). The two variants are denoted here as SSFM-Agrawal and SSFM-RK4, the latter using a fourth-order Runge-Kutta method to solve the nonlinear step. 3.2. The Fourth-Order Runge-Kutta in the Interaction Picture Method To avoid the splitting error, Eq. (1) can be transformed with   AI = exp −(z − z0 )Lˆ A

(5)

into the ‘Interaction Picture’, where z0 is the separation distance. This allows to use explicit schemes like the fourthorder Runge-Kutta method, to solve the differentiated form of Eq. (5). In contrast to the SSFM, no splitting is required, and the numerical accuracy is primarily limited by the applied explicit scheme. With the separation distance z0 defined as z + h/2, the algorithm to advance A(z, T ) to A(z + h, T ) is ! ! h h AI =A z + , T = exp Lˆ · A (z, T ) (6a) 2 2 ! i! h h (6b) k1 = exp Lˆ hNˆ (A(z, T )) A (z, T ) 2 ! " # k1 k1 k2 =hNˆ AI + · AI + (6c) 2 2 ! " # k2 k2 k3 =hNˆ AI + · AI + (6d) 2 2 " ! # ! h h k4 =hNˆ exp Lˆ · [AI + k3 ] · exp Lˆ · [AI + k3 ] (6e) 2 2 ! " # h k1 k2 k3 k4 A (z + h, T ) = exp Lˆ · AI + + + + (6f) 2 6 3 3 6 4

as given in [17]. This method exhibits a local error of fifth-order and is globally fourth-order accurate. A detailed analysis of the RK4IP method as well as a theoretical and numerical comparison between the RK4IP method and the SSFM-RK4 are given in [19]. In addition, [20] provides a comparison between the SSFM and the RK4IP method, focusing on the nonlinear signal propagation in multimode fibers. Regarding multi-GPU implementations of the reviewed methods, the implementations primarily differ in the amount of required communication, as discussed in the next section. Anyway, the implementations can be kept quite similar to each other. 4. Implementation of the Numerical Methods The signals can be represented by a matrix of sampled data with the dimension 2M × N. Here, M is the number of spatial modes, and the factor 2 results from taking both orthogonal polarization planes into account. The number of discrete time samples is given by N. Thus, each row represents a spatial or polarization mode. To investigate Kerrbased nonlinear effects, namely self-phase modulation (SPM) and especially cross-phase modulation (XPM), long symbol sequences need to be considered in particular in optical communication systems. Furthermore, if wavelengthdivision multiplexing (WDM) is of interest, and thus the impact of four-wave mixing (FWM) should be evaluated, each symbol needs to be represented by an appropriate number of samples to simulate a sufficiently large frequency spectrum. E.g. 256 samples per symbol, denoted as N sps , were used in [11] to simulate a spectral range of 8.192 THz. Here, it is important to note that the mode profiles can be safely assumed to be constant within the signal bandwidth in optical communications [16]. For an even larger frequency range, this might not be the valid anymore. In this case, also the coefficients κ cannot be considered constant. Taking the frequency dependence of mode profiles into account, the nonlinear propagation can be modeled based on the approach described in [21] and [22]. This approach is reported to be superior to the more conventional approach to express the nonlinear polarization by a sum of mode overlap integrals, used here. However, this method does not relax the challenge on the memory demand in the simulation of long sampled signals, combined with a sufficient large frequency range. In the simulation, reported in [11], M = 15 spatial modes and Ns = 214 symbols per spatial and per polarization mode were considered. With N = N s · Nsps , this results in a complex valued dense matrix of size 30 × 222 , which requires 1920 MiB of storage. When further increasing the number of spatial modes M, the matrix containing the sampled signal might still fit into the GPU-memory, but not all intermediate results do any longer. Indeed, also the documentation of the freely available solver released along with [23] states, that the GPU-memory determines the number of modes. According to the documentation the nonlinear propagation of up to 125 modes on a Titan X GPU is possible, where each mode is represented by N = 1013 samples and the parallel extent for the Massively Parallel Algorithm (MPA) is set to 10. Hence, for larger N, as here of interest, either the the number of modes has to be reduced or an alternative solution becomes necessary. We therefore propose to split the 2M polarization and spatial modes to K processes. Since N  2M, this approach has several advantages over splitting N contiguous samples of a unique spatial or polarization mode to different processes, as discussed below. Finally, we point out that several implementations to simulate the nonlinear signal propagation in optical fibers are available to the public [23, 24, 25], either already capable to simulate the propagation of multiple modes or representing a good starting point for implementational extensions. Furthermore, the nonlinear signal propagation in multimode fibers can also be simulated using readily available ordinary differential equations (ODE) solvers in MATLAB [26]. Consequently, in this paper we focus on the challenge arising with the required memory and how an implementation incorporating multiple GPUs can be realized in a way that the additional data transfer effort does not overly increase the simulation time. 4.1. Splitting the Numerical Problem In [27], [28] the split-step Fourier method is parallelized by using distributed fast Fourier transform implementations. However, this requires a lot of communication between the involved compute nodes. Instead of letting multiple processes take part in the calculation of a spatial or polarization mode, only entire modes are distributed to the different processes. Here, each process is associated with one GPU, but the process itself can still involve multiple threads. Thus, the N samples of a single signal are only required and processed by one unique process. As proposed, the channels are equally distributed to K processes. With this, each process computes 2M/K channels as illustrated in Fig. 1 5

(2M)/K

K× (2M)/K

Figure 1: Signal matrix split to multiple processes.

The matrix representing the sampled signal is stored row-major and the rows are aligned linear in the memory. Therefore, the memory alignment is optimized for the fast Fourier transforms (FFT), as discussed in [20]. The computation of the linear step Lˆ can be executed fully parallel by each process independently. Only for the calculation of the nonlinear step Nˆ information from the other processes is required, namely the squared absolute values of the envelopes A of all modes not locally available. The SSFM-Agrawal requires the computation of |A|2 once at the position z for the first iteration and at the position z + h for every following iteration. Using the SSFM-RK4, the values |A|2 are required to calculate k1 , k2 , k3 , and k4 , which is the same for the RK4IP method. The squared absolute values |A|2 can be stored real-valued. Therefore, in every iteration or rather the calculation of kn , (2M − 2M/K) · N real valued numbers have to be provided by the other processes and each process has to share its (2M/K) · N computed values. The squared absolute values |A|2 are exchanged via MPI or NCCL. Due to the large signal matrices, one has to expect quite large messages even if communication is kept minimal with our splitting approach. For the previous example with matrix size 30 × 222 , sharing all squared absolute values would result in a message size of 960 MiB. In the following, we apply our modifications to the RK4IP method. The RK4IP allows more than doubled stepsizes h in the simulation of MDM transmission systems, as shown in [20]. Hence, less data exchange is required for the RK4IP method. Nevertheless, the presented approach can be applied in an identical fashion to the SSFM-Agrawal and the SSFM-RK4 as previously mentioned. 4.2. MPI-Implementation One option to realize the communication between the involved GPUs is to use the the Message Passing Interface [29]. Using MPI has the advantage, that the GPUs do not necessarily have to be placed in the same compute node. Here, one MPI process per GPU is used. With the availability of CUDA-aware MPI [30] implementations, the programmer does not have to stage the data in the host memory, as the GPU buffers can be directly passed to MPI. A naive approach to realize the communication via MPI is the use of collective operations like MPI Bcast or MPI Allgather. However, these rely on blocking communication and CUDA-aware implementations that supporting non-blocking collectives are still under development. Using non-blocking communication instead has the advantage to overlap communication and processing of the data. Overlapping communication and computations is essential to hide communication costs and to obtain good scalability. We therefore decided to explicitly exchange data via asynchronous, and therefore non-blocking send and receive operations, namely MPI Isend and MPI Irecv. The program sequence is described in Listing 1. Listing 1: Basic programm flow to compute Nˆ incorporating MPI. void s e n d s q r a b s ( c o n s t i n t rank , c o n s t i n t s i z e , c o n s t REAL ∗ s q r a b s , . . , c o n s t i n t num elem , M P I R e q u e s t ∗ s e n d r e q ) { f o r ( i n t r k =0; rk < r a n k ; r k ++) M P I I s e n d (& s q r a b s [ . . ] , num elem , MPI DOUBLE , . . , MPI COMM WORLD,& s e n d r e q [ r k ] ) ;

}

f o r ( i n t r k = r a n k +1; rk < s i z e ; r k ++) M P I I s e n d (& s q r a b s [ . . ] , num elem , MPI DOUBLE , . . , MPI COMM WORLD,& s e n d r e q [ rk − 1 ] ) ;

6

void r e c v s q r a b s ( c o n s t i n t rank , c o n s t i n t s i z e , REAL ∗ s q r a b s , . . , c o n s t i n t num elem , M P I R e q u e s t ∗ r e c v r e q ) { f o r ( i n t r k =0; rk < r a n k ; r k ++) M P I I r e c v (& s q r a b s [ . . ] , num elem , MPI DOUBLE , . . , MPI COMM WORLD,& r e c v r e q [ r k ] ) ;

}

f o r ( i n t r k = r a n k +1; rk < s i z e ; r k ++) M P I I r e c v (& s q r a b s [ . . ] , num elem , MPI DOUBLE , . . , MPI COMM WORLD,& r e c v r e q [ rk − 1 ] ) ;

calc recv send calc

squareabs ( . . ) ; sqrabs ( . . ) ; sqrabs ( . . ) ; nonlinear own ( . . ) ;

all done = 0; w h i l e ( a l l d o n e < s i z e −1) { MPI Waitany ( s i z e −1 , r e c v r e q , &r k i d x , . . ) ; c a l c n o n l i n e a r o t h e r s ( . . , rk idx , . . ) ; a l l d o n e ++; } apply nonlinear all ( . . ) ;

After the computation of |A|2 for the (2M)/K modes persisting on the GPU, we initialize the data exchange operations. The values are send via MPI Isend in the send sqrabs() function, and matching receive MPI Irecv commands are posted in the recv sqrabs() function. As mentioned before, these operations are non-blocking and therefore both commands return immediately, even if the transfers are not finished. Next, the CUDA kernel is launched to calculate the contribution to the nonlinear phase rotation of the modes that are persisting on the GPU. This is non-blocking again. Afterwards, a blocking operation MPI Waitany is called, to wait until any of the MPI Irecv commands has finished and if, the contribution of the received |A|2 values to the nonlinear phase rotation is calculated. If all |A|2 values of the K − 1 other processes are received, and all contributions are taken into account, the nonlinear phase rotation is finally applied to the modes persisting on the GPU. Further details on the implementation are given in Appendix A and Appendix B. This approach scales perfectly if the time needed to receive the next data is shorter than the time for the simultaneously performed computations. In this case, the GPU does not have to wait for the next data, since these are received while the GPU is performing computations. The first work package is always available on the GPU, since this is the calculation of calc nonlinear own() for which no data needs to be received. However, the execution of calc nonlinear others() relies on data sent from the other processes. In practice, the possible overlap strongly depends on the simulation set-up, i.e. the number of spatial modes M and samples N, and is limited by the number of involved GPUs K as well as the interconnects between the GPUs. 4.3. NCCL-Implementation A higher-level approach is to exchange data via the NVIDIA Collective Communications Library (NCCL) [31]. NCCL supports multiple GPUs installed in a single node or across multiple nodes. The library provides topologyaware collective communication primitives and features multiple ring formations for high bus utilization. Within NCCL, the collectives are implemented in a single kernel and are therefore associated to a so-called CUDA stream [32]. The NCCL calls return when the operation is enqueued to the specified stream and the collective operation is executed asynchronously. In our implementation, ncclAllGather is used to aggregate the data. As depicted in Listing 2, Listing 2: Basic programm flow to compute Nˆ incorporating NCCL. calc squareabs ( . . ) ; n c c l A l l G a t h e r ( ( c o n s t v o i d ∗)& s q r a b s [ . . ] , ( v o i d ∗ ) s q r a b s , num elem , n c c l D o u b l e , comm , s t r e a m a ) ; calc nonlinear own ( . . , stream b ) ; cudaStreamSynchronize ( stream b ) ;

7

f o r ( i n t r k i d x =0; r k i d x < s i z e −1; r k i d x ++) c a l c n o n l i n e a r o t h e r s ( . . , rk idx , . . , stream a ) ; apply nonlinear all ( . . , stream a ) ;

we use different streams for the kernel launch within calc nonlinear own() and the remaining kernel calls to enable concurrent execution. To enable the implementation to utilize multiple nodes, we use NCCL together with MPI. Hence, each GPU is associated with an MPI process as before. A common NCCL communicator spanning all processes, is initialized as described in Appendix C. 4.4. GPU-Acceleration The basic GPU-accelerated implementation of the RK4IP method using a single GPU-accelerator is described in [20] and further optimization is presented in [33]. However, in addition to the previously described adaptations, further modifications to the GPU code are required to make it capable to run on multiple GPUs. In the preceding single-node implementation, only a single CUDA kernel capturing the nonlinear effects was launched. As shown before, this is now split up into an own kernel, responsible for calculating the nonlinear phase rotation of the locally stored modes, and an others kernel, responsible for the calculation for the nonlinear phase rotation induced by the modes not locally available. Thus, K − 1 instances of the latter kernel have to be launched. The overall nonlinear phase rotation is stored in an additional array of size (2M/K) · N. Both kernels incorporate so-called shared memory, to alleviate the penalty occurring due to column access to the memory [20, 33]. In contrast to the single-node implementation, applying the nonlinear phase rotation to the locally available modes now only requires row access instead of column access to the memory. Applying the nonlinear phase rotation is performed by an additional kernel, as already indicated in Listings 1 and 2. In addition, the interaction matrix no longer fits into the constant memory for a large number of modes without using a splitting approach. Storing all κ values, requires a matrix of 2M × 2M elements. Assuming a symmetric matrix, which is the case for linearly polarized (LP) modes [13], it is sufficient to only store the upper triangular matrix, reducing the number of elements to (2M · 2M)/2 + M. This is exemplified for the case of M = 2 and K = 2 in Fig. 2. For a huge number of modes, e.g. M = 120, still 28920 double precision values of 8 B would need to

11

12

13

14

21

22

23

24

31

32

33

34

41

42

43

44

Figure 2: Exemplary interaction matrix for M = 2 and K = 2. The upper triangular matrix is highlighted in blue; All elements required for the calculations on GPU 2 are shaded red and green. By considering the symmetry, the light and dark shaded green elements are identical and only one of the sub-matrices needs to be stored. Furthermore, the light red shaded element can be neglected.

be stored in the constant memory, of which only 64 KiB are available. Therefore, this approach does not lead to a sufficient saving. However, only 2M/K · 2M · 2 − (2M/K)2 need to be accessed for the calculations. For GPU 2, these are the red and green shaded elements in Fig. 2. The other elements of the matrix are only required on the other involved GPU. Taking the symmetry into account again, it is sufficient to save only rows or columns which apply to the modes considered on the certain GPU. With this in mind, the number of elements can be reduced to 2M/K · 2M. Furthermore, for the κ coefficients describing the nonlinear coupling for the modes persisting in the GPU, it would be sufficient again to only store the upper triangular matrix, as visualized in Fig. 2. However, distributing the matrix via MPI and the necessary index arithmetic is more complicated for this case, and only (2M/K · 2M/K)/2 − M/K additional elements can be saved. 8

6

5. Benchmark To achieve the maximum performance, peer-to-peer access between the GPUs is essential. The benchmark is therefore performed on an AWS EC2-instance of type p2.8xlarge. This instance incorporates 4 Tesla K80 accelerators. Each K80 provides a pair of GK210 GPUs, resulting in 8 available GPUs. On this instance type the GPUs are connected via a common PCIe fabric. The configuration used for the benchmark is given in Table 1. Considering a sequence with a symbol rate

5

Table 1: Configuration used for the benchmark. M 15 30 60 90 120

4

Ns

214

N sps

K

M/K

128

1 2 4 6 8

15

of 32 GBaud, a spectral range of 4.096 THz is simulated. Incorporating 8 GPUs allows to evaluate the nonlinear interaction between 120 spatial modes. This is of interest as it is the number of the potentially usable spatial modes in a fiber with 62.5 µm core diameter [34]. The number of involved processes, or rather GPUs, is scaled up from 1 to 8, to investigate the scaling of the proposed implementations. The number of spatial modes M persisting per GPU is kept constant. In consequence, the ˆ each total signal matrix occupies up to 7680 MiB, of which 960 MiB are stored per GPU. For every calculation of N, GPU needs to share 480 MiB. For the benchmark, 150 steps have been simulated and all calculations are executed with double precision. Recall, that Nˆ is calculated 4 times per step. This is the same for an SSFM-RK4 implementation, whereas the number to calculate Nˆ depends on the number of iterations in an SSFM-Agrawal implementation. The initial distribution and the final collection of the sampled signal matrix, as well as the transfer of further necessary parameters and data, is excluded from the benchmark. Results are shown in Fig. 3. Here, the execution times T K are

3 2 1

8 MPI

MPI - only comm.

NCCL

NCCL - only comm.

7

1

T K /T K=1,Single GPU Impl.

T K /T K=1,Single GPU Impl.

7

6 5 4

2

3 2 1

1

2

4 4

6

Number of GPUs K

8

Number of G

Figure 3: Scaling of the implementation.

normalized to the execution time of our previous single-node, single-GPU implementation [20, 33]. With only a single GPU used, K = 1, the relative runtime is > 1. Due to splitting the calculation of Nˆ into several kernels, the runtime increases by approximately 8.5 %. For K ≤ 4, the MPI- and NCCL-implementation scale nearly equal. With even more GPUs involved, the execution time of the MPI-implementation rises sharply. Incorporating all 8 GPUs, the MPI-implementation requires 6.76 times the execution time of the single-GPU implementation, whereas the NCCL-implementation scales with a factor of 4.26. To evaluate the reason, the benchmarks are rerun without the additional calculations performed in the calc nonlinear others(). Therefore, only the amount of 9

communication grows with an increasing K. Here, the MPI-implementation shows nearly identical results, whereas the relative runtime of NCCL-implementation drops. For the MPI-implementation, this clarifies that the increase of execution time is caused by communication, and not by the additional calculations. In conclusion, communication and calculations can be perfectly overlapped using MPI, additionally confirmed by profiling the application. However, the implementation shows an improvable communication pattern for K > 4. With the NCCL-implementation on the contrary, communication and calculations are not fully overlapping. Anyway, the topology-aware communication patterns show clear benefits for the simulation with more than 4 GPUs involved. For K = 2, highlighted in Fig. 3, the MPI-implementation is slightly outperforming the NCCL-implementation (factor 1.38 vs. 1.51). With only two GPUs taking part in the simulation, the NCCL’s topology-awareness cannot improve communication. In this case overlapping of communication and calculations is much more important. From a view point of weak scaling, an improved performance is desirable, especially for a large number of involved GPUs. However, regarding the required all-to-all communication the performance metrics are not surprising. Nevertheless, the application enables the simulation of the nonlinear signal propagating of a huge number of spatial modes and a large frequency range, which was not possible so far. Improved performance of the MPI-implementation can be expected when decoupling the CPU-GPU control flow. With the future availability of MPI-GDS [35], the asynchronous send operations can be triggered directly after the squared absolute values are computed, leading to better hiding of the communication. In addition, also the optimization of collective operations is under investigation [36, 37]. Therefore, future library implementations offer the potential to further improve the performance of the proposed implementation. In the next section we show, what is already possible with the implementation based on the current available libraries. 6. Simulation of a 62.5 µm Fiber Within the application example we demonstrate the feasibility of an MDM transmission over a multimode fiber with graded-index profile featuring a core diameter of 62.5 µm and a numerical aperture of 0.275. The highest order modes feature small effective refractive indices and are therefore affected most by the cladding. In such a fiber, 120 of the spatial modes capable of propagation can be used for a mode-multiplexed transmission [34]. We assume a profile exponent of 1.94 and a trench, a section with a reduced refractive index within the cladding of the fiber, in a distance of 1.25 µm to the core and with a width of 3.5 µm. The refractive index difference between the cladding and the trench is 6.5 · 10−3 . Further, an attenuation coefficient of 0.23 dB is assumed for all modes. The modes form strongly coupled groups of modes, meaning the linear coupling between the modes belonging to the same mode group is strong, whereas the linear coupling between modes of different mode groups is weak. This is taken into account by choosing the appropriate nonlinear coupling coefficients κ for the Manakov equation [15]. The mode groups (MG) are considered as given in Table 2, which further provides the effective areas Aeff of the spatial modes. The mode profiles and the later used propagation constants are calculated numerically with

MG 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Table 2: Mode groups (MG), Aeff in µm2 , and number of spatial modes.

Modes in Group LP0,1 LP1,1 LP0,2 LP1,2 LP0,3 LP1,3 LP0,4 LP1,4 LP0,5 LP1,5 LP0,6 LP1,6 LP0,7 LP1,7 LP0,8

LP2,1 LP3,1 LP2,2 LP3,2 LP2,3 LP3,3 LP2,4 LP3,4 LP2,5 LP3,5 LP2,6 LP3,6 LP2,7

LP4,1 LP5,1 LP4,2 LP5,2 LP4,3 LP5,3 LP4,4 LP5,4 LP4,5 LP5,5 LP4,6

LP6,1 LP7,1 LP6,2 LP7,2 LP6,3 LP7,3 LP6,4 LP7,4 LP6,5

LP8,1 LP9,1 LP8,2 LP9,2 LP8,3 LP9,3 LP8,4

LP10,1 LP11,1 LP10,2 LP11,2 LP10,3

LP12,1 LP13,1 LP12,2

LP14,1

172 231 347 373 504 499 653 618 795 731 932 841 1066 946 1195

Total Num. of Spatial Modes up to this MG

Aeff 311 372 469 545 605 690 732 824 853 951 969 1072 1081

428 475 615 674 768 835 907 979 1038 1124 1162

521 561 733 783 900 957 1050 1112 1188

601 635 834 879 1015 1066 1174

671 702 926 966 1119

735 764 1009

795

1 3 6 10 15 21 28 36 45 55 66 78 91 105 120

JCMsuite [38]. With a nonlinear refractive index n2 = 2.6 · 10−20 m2 W−1 , the fiber features a nonlinearity coefficient γ = 0.61 W−1 km−1 . In the definition of γ, the center frequency of the optical signal is given by ω0 and c0 is the 10

speed of light in vacuum. The differential mode group delays ∆β1 =β1,a −β1,LP0,1 are calculated with Eq. (29) from [39]. With ∆β1 fulfilling the so-called phase-matching condition, the cross-phase modulation between the strongly coupled groups is maximized. This is potentially the worst case for the nonlinear effects. The phase matching condition for multimode fibers incorporates the group-velocity dispersion parameter β2 , the second derivatives of the propagation constant β, which are calculated numerically. The values assumed for the simulation for ∆β1 and β2 are given in Table 3. Table 3: DMGDs ∆β1 in ps/km and group-velocity dispersion parameters β2 in ps2 /km for the different mode groups (MG).

MG ∆β1 avg. β2

1 0.0 -23.1

2 -7.32 -23.3

3 -7.38 -23.5

4 -7.48 -23.8

5 -7.54 -24.0

6 -7.60 -24.2

7 -7.70 -24.5

8 -7.79 -24.8

9 -7.85 -25.0

10 -7.95 -25.3

11 -8.04 -25.6

12 -8.18 -25.9

13 -8.23 -26.2

14 -8.39 -26.7

15 -8.61 -27.4

In the simulation, each spatial mode carries 60 WDM channels within a 50 GHz grid. The center frequency of the wavelength channel with the lowest carrier frequency is placed at 191.95 THz, the one with the highest carrier frequency is placed at 194.9 THz. Thus, a spectral bandwidth of 3 THz is used for transmission. Within the simulation, each WDM channel carries a dual polarization (DP) Quadrature Phase-Shift Keying (QPSK) modulated signal, with a symbol rate of 32 GBaud. The average launch power per DP-QPSK signal is set to −1 dBm. Here, we simulate the transmission over two 80 km spans, resulting in a transmission distance of 160 km. After each span, the fiber losses are compensated by a noiseless amplifier with flat-gain profile. White noise is added to the signals before the receiver, setting the optical signal-to-noise ratio (OSNR) to 20 dB. In this regime, the nonlinear effects are the dominating source of the signal degradation. Within the digital signal processing stage, the dispersion is perfectly compensated. Finally, a clock recovery [40] as well as a phase recovery are applied [41]. To quantify the nonlinear impairments, the squared Q-factors are estimated for each mode and each WDM channel. The minimal, mean, and maximal Q2 -factors for each mode after the transmission over 160 km are depicted in Fig. 4. Since the fundamental mode features the smallest effective areas Aeff and the highest coupling coefficients, 16.00

Q2 in dB

15.95

15.90

max mean min

15.85

13 6 10

21 28

36

45

55

66

78

91

105

120

Spatial Mode Number Figure 4: Squared Q-factors for each spatial mode after transmission over 160 km, evaluated for an OSNR of 20 dB.

it suffers most from the nonlinear impairments and features the smallest Q2 -factor. For the higher order modes, the mean Q2 -factors improve. To assess the nonlinear signal distribution, we evaluate the mean Q2 -factors relative to Q-factors obtained for a back-to-back (b2b) transmission, shown in Fig. 5. With about −0.06 dB after the first and −0.13 dB after the second 80 km span, the fundamental mode again shows the highest signal degradation. Independent of the transmission distance, the higher order modes are less affected by the nonlinear effects. However, one can clearly identify the lower order mode groups, especially based on the results for a transmission over 160 km. Also the induced nonlinear penalty increases with increasing transmission distance, the overall penalty is rather small. For lower OSNRs, an even smaller penalty can be expected [12]. Hence, the utilization of 120 spatial modes in an MDM transmission system seems possible, and the Kerr-based nonlinear impairments do not prohibit the use of such a fiber. 11

Q2 −Q2b2b in dB

−0.02 −0.04 −0.06 −0.08 −0.10 80 km 160 km

−0.12 13 6 10

21 28

36

45

55

66

78

91

105

120

Spatial Mode Number Figure 5: Mean squared Q-factors for each spatial mode after transmission over 80 and 160 km, relative to the squared Q-factors obtained for a back-to-back (b2b) transmission.

7. Conclusion In this paper, we presented a multi-GPU implementation to simulate the nonlinear signal propagation in multimode fibers. Our approach allows the simulation of over 100 spatial modes while considering a large spectral bandwidth at the same time. With this approach, it was possible for the first time to simulate a mode-division multiplexing system utilizing 120 spatial modes in a 62.5 µm fiber along with 60 wavelength channels per spatial mode by using 8 GPUs. In the application example, we have evaluated the nonlinear impairments for each spatial mode and each wavelength channel. The results allow to conclude that the nonlinear impairments do not prohibit the usage of such a large number of spatial modes in a mode-division multiplexing system. It can be expected that one can scale up mode-division multiplexing systems far beyond the most recent transmission experiment in which 45 spatial modes were transmitted over 26.5 km [2]. The implementation presented here allows to study those future systems, thereby delivering an important tool for the advancement of research on telecommunication applications. We revealed necessary modifications to a GPU-accelerated implementation like [12], which enables us to simulate many spatial modes, and discussed various approaches how to realize the communication between the GPUs in a way that the additional data transfer effort does not overly increase the simulation time. The performance of the implementation was analyzed, whereas the communication between the GPUs was realized with either MPI or NCCL. While MPI shows performance benefits for a few used GPUs, the implementation clearly profits from NCCL’s topologyawareness if more than 4 GPUs are involved in the simulation. Hence, taking the topology into account can be more important than concurrent communication and computations. Especially in the development of memory bound applications, like the one presented here, this possibility should be considered. Acknowledgments The authors are grateful for the donation of a Tesla K40c by NVIDIA through the GPU Grant program. The work of M. Schirwon and D. G¨oddeke was supported by the German Excellence Initiative through EXC 2075/1 – 390740016 (SimTech). Appendix A. Common Code If multiple processes and GPUs are involved in the simulation, each process needs to be associated with a specific GPU. In a shared memory system or rather when all processes run on the same machine, this can be realized by choosing the device with the cudaSetDevice function. However, this is not possible if multiple machines are used, as the global MPI rank does not match the GPU identifier. A more flexible approach, usable on both system types, is 12

to create a local communicator with MPI, as presented in [42]. The code is given in Listing 3 and is relevant for the MPI- as well as for the NCCL-implementation. Listing 3: Local communicatior i n t numDevices = −1; c u d a s t a t u s = c u d a G e t D e v i c e C o u n t (& numDevices ) ; MPI Comm l o c a l c o m m ; MPI Info i n f o ; M P I I n f o c r e a t e (& i n f o ) ; M P I C o m m s p l i t t y p e (MPI COMM WORLD, MPI COMM TYPE SHARED , r a n k , i n f o ,& l o c a l c o m m ) ; i n t l o c a l r a n k = −1; MPI Comm rank ( loc a l c om m ,& l o c a l r a n k ) ; MPI Comm free (& l o c a l c o m m ) ; M P I I n f o f r e e (& i n f o ) ; i n t c u d a D e v i c e = l o c a l r a n k%numDevices ; c u d a s t a t u s = cudaSetDevice ( cudaDevice ) ;

First an MPI Info object is created, with is used to split the global communicator into a new local communicator. One communicator per shared memory region is created, specified by the split type MPI COMM TYPE SHARED. The local communicator is used to determine the local rank of each process on each machine. Since the enumeration of the local ranks starts with zero on each machine, the local rank matches GPU identifiers. After the local rank is determined, the local communicator can be freed, as this new communicator is only used to calculate a local rank. This is followed by the calculation of which GPU to choose and choosing the appropriate GPU. Both, the MPI- and the NCCL-implementation further have in common that the sampled data representing the signal needs to be equally shared between the involved processes. To distribute the data to each MPI process one can use MPI Scatter, to collect and merge the data one can use the inverse operation MPI Gather. The distribution as well as the collection of the data only needs to be performed once, at the start and the end of the simulation, respectively. The use of both operations is documented in the standard [29], where one can further find information on the initialization and termination of the global MPI communicator. Appendix B. MPI Specific Code The presented multi-GPU implementation pursues the strategy to hide the time required for communication behind computation time. To enable overlapping communication and computations, non-blocking MPI is used, which also gives MPI more opportunities to build efficient pipelines [43]. As mentioned in section 4.2, our implementation requires CUDA-aware MPI. Otherwise additional host-to-device and device-to-host cudaMemcpy calls would be necessary, see e.g. [43]. In preparation to the code presented in Listing 1, the following Listing 4 provides further implementation details, necessary to use non-blocking MPI. Listing 4: Generation of request handels MPI Request ∗ s e n d r e q ; s e n d r e q = ( M P I R e q u e s t ∗ ) m a l l o c ( ( s i z e −1)∗ s i z e o f ( M P I R e q u e s t ) ) ; MPI Request ∗ r e c v r e q ; r e c v r e q = ( M P I R e q u e s t ∗ ) m a l l o c ( ( s i z e −1)∗ s i z e o f ( M P I R e q u e s t ) ) ; int ∗ rk others ; r k o t h e r s = ( i n t ∗ ) m a l l o c ( ( s i z e −1)∗ s i z e o f ( i n t ) ) ; f o r ( i n t r k =0; rk < r a n k ; r k ++) r k o t h e r s [ r k ]= r k ; f o r ( i n t r k = r a n k +1; rk < s i z e ; r k ++) r k o t h e r s [ rk −1] = r k ;

13

To track the status of asynchronous, non-blocking MPI operations, so called request objects are used. Therefore, memory for the requests of the send and the receive operations is allocated. Here, the total number of involved MPI processes is stored in the variable size. In addition, we create an array, which holds the ranks of all the other MPI processes involved in the communication, excluding the rank of the current process itself, which is stored in rank. Together with the MPI Waitany call, shown in Listing 1, this is used to determine the rank of the process that sent the data by evaluating rk others[rk idx]. Combined with information on the number of elements per process, the memory address, or rather the offset, to access the array on the GPU containing the squared absolute values can be calculated straight forward. Finally, as the nonlinear part Nˆ is calculated four times in Eq. (6), it must be guaranteed that all values are sent, prior to calling calc squareabs() once again. For this purpose, one can use MPI Waitall, passing the array holding the send requests as an argument. Appendix C. NCCL Specific Code Instead of using MPI for the GPU-GPU communication, the data exchange can be realized by using NCCL, which provides multi-GPU and multi-node collective communication primitives. As the use of NCCL is not as common as MPI, this section provides details on creating, sharing, and initializing the NCCL communicator. Essentially, the code below reproduces the example given in [44, Example 2: One Device per Process or Thread]. Listing 5: NCCL communicator ncclUniqueId id ; ncclComm t comm ; i f ( r a n k == 0 ) n c c l G e t U n i q u e I d (& i d ) ; M P I B c a s t ( ( v o i d ∗)& i d , s i z e o f ( i d ) , MPI BYTE , 0 , MPI COMM WORLD ) ; n c c l Co m mI n it R an k (&comm , s i z e , i d , r a n k ) ;

After an identifier and communicator object are created, only the MPI process with rank 0 creates a unique identifier, stored in the identifier object. This is shared between all MPI processes with the broadcast operation MPI Bcast. The identifier object is used afterwards to initialize the communicator. As mentioned in section 4.3, only GPU-GPU communication is realized with NCCL, whereas all other communication is MPI-based. Therefore, the information stored in the variables size and rank is already generated within the MPI initialization. This is now reused to initialize the NCCL communicator. References [1] D. J. Richardson, J. M. Fini, L. E. Nelson, Space-division multiplexing in optical fibres, Nat. Photon. 7 (5) (2013) 354–362. doi:10.1038/nphoton.2013.94. [2] R. Ryf, et al., High-Spectral-Efficiency Mode-Multiplexed Transmission over Graded-Index Multimode Fiber, in: 44th European Conference and Exhibition on Optical Communication (ECOC), Rome, Italy, 2018, paper Th3B.1. doi:10.1109/ECOC.2018.8535536. [3] S. Bade, B. Denolle, G. Trunet, N. Riguet, P. Jian, O. Pinel, G. Labroille, Fabrication and Characterization of a Mode-selective 45-Mode Spatial Multiplexer based on Multi-Plane Light Conversion, in: Optical Fiber Communication Conference (OFC) Postdeadline Papers, San Diego, CA, USA, 2018, paper Th4B.3. doi:10.1364/OFC.2018.Th4B.3. [4] N. K. Fontaine, R. Ryf, H. Chen, S. Wittek, J. Li, J. C. Alvarado, J. E. A. Lopez, Packaged 45-Mode Multiplexers for a 50-µm Graded Index Fiber, in: 44th European Conference and Exhibition on Optical Communication (ECOC), Rome, Italy, 2018, paper Mo4E.1. doi:10.1109/ECOC.2018.8535302. [5] N. K. Fontaine, R. Ryf, H. Chen, D. T. Neilson, K. Kim, J. A. Carpenter, Scalable mode sorter supporting 210 Hermite-Gaussian modes, in: Optical Fiber Communication Conference (OFC) Postdeadline Papers, San Diego, CA, USA, 2018, paper Th4B.4. doi:10.1364/OFC.2018.Th4B.4. [6] G. P. Agrawal, Nonlinear Fiber Optics, 5th Edition, Academic Press, 2012. [7] S. Hellerbrand, N. Hanik, Fast Implementation of the Split-Step Fourier Method Using a Graphics Processing Unit, in: Optical Fiber Communication Conference (OFC), San Diego, CA, USA, 2010, paper OTuD7. doi:10.1364/OFC.2010.OTuD7. [8] S. Pachnicke, A. Chachaj, M. Helf, P. M. Krummrich, Fast parallel simulation of fiber optical communication systems accelerated by a graphics processing unit, in: International Conference on Transparent Optical Networks (ICTON), Munich, Germany, 2010, paper Th.B1.5. doi:10.1109/ICTON.2010.5549002.

14

[9] J. Alcaraz-Pelegrina, P. Rodr´ıguez-Garc´ıa, Simulations of pulse propagation in optical fibers using graphics processor units, Comp. Phys. Commun. 182 (7) (2011) 1414–1420. doi:10.1016/j.cpc.2011.03.007. [10] A. Uvarov, N. Karelin, I. Koltchanov, A. Richter, H. Louchet, G. Shkred, GPU-assisted simulations of SDM systems, in: International Conference on Transparent Optical Networks (ICTON), Girona, Spain, 2017, paper We.D1.6. doi:10.1109/ICTON.2017.8025061. [11] M. Brehler, P. M. Krummrich, Impact of WDM Channel Count on Nonlinear Effects in MDM Transmission Systems, in: Optical Fiber Communication Conference (OFC), Los Angeles, CA, USA, 2017, paper Th2A.63. doi:10.1364/OFC.2017.Th2A.63. [12] M. Brehler, P. M. Krummrich, Nonlinear impairment scaling with the number of mode groups in mode-multiplexed transmission over a 50 µm multimode fiber, Opt. Express 26 (13) (2018) 16393–16401. doi:10.1364/OE.26.016393. [13] F. Poletti, P. Horak, Description of ultrashort pulse propagation in multimode optical fibers, J. Opt. Soc. Am. B 25 (10) (2008) 1645–1654. doi:10.1364/JOSAB.25.001645. [14] S. Mumtaz, R.-J. Essiambre, G. P. Agrawal, Nonlinear Propagation in Multimode and Multicore Fibers: Generalization of the Manakov Equations, J. Lightw. Technol. 31 (3) (2013) 398–406. doi:10.1109/JLT.2012.2231401. [15] A. Mecozzi, C. Antonelli, M. Shtaif, Coupled Manakov equations in multimode fibers with strongly coupled groups of modes, Opt. Express 20 (21) (2012) 23436–23441. doi:10.1364/OE.20.023436. [16] C. Antonelli, M. Shtaif, A. Mecozzi, Modeling of Nonlinear Propagation in Space-Division Multiplexed Fiber-Optic Transmission, J. Lightw. Technol. 34 (1) (2016) 36–54. doi:10.1109/JLT.2015.2510511. [17] J. Hult, A Fourth-Order Runge-Kutta in the Interaction Picture Method for Simulating Supercontinuum Generation in Optical Fibers, J. Lightw. Technol. 25 (12) (2007) 3770–3775. doi:10.1109/JLT.2007.909373. [18] G. H. Weiss, A. A. Maradudin, The Baker-Hausdorff Formula and a Problem in Crystal Physics, J. Math. Phys. 3 (4) (1962) 771–777. doi:10.1063/1.1724280. [19] S. Balac, A. Fernandez, F. Mah´e, F. M´ehats, R. Texier-Picard, The Interaction Picture method for solving the generalized nonlinear Schr¨odinger equation in optics, ESAIM: Math. Model. Numer. Anal., 50 (4) (2016) 945964. doi:10.1051/m2an/2015060. [20] M. Brehler, M. Schirwon, D. G¨oddeke, P. M. Krummrich, A GPU-Accelerated Fourth-Order Runge-Kutta in the Interaction Picture Method for the Simulation of Nonlinear Signal Propagation in Multimode Fibers, J. Lightw. Technol. 35 (17) (2017) 3622–3628. doi:10.1109/JLT.2017.2715358. [21] J. Lgsgaard, Efficient simulation of multimodal nonlinear propagation in step-index fibers, J. Opt. Soc. Am. B, 34(10) (2017), 2266–2273. doi:10.1364/josab.34.002266. [22] J. Lgsgaard, Full-vectorial multimode nonlinear simulations on a real-space FourierGauss grid, J. Opt. Soc. Am. B, 36(8) (2019), 2235–2243. doi:10.1364/josab.36.002235. [23] L. G. Wright, Z. M. Ziegler, P. M. Lushnikov, Z. Zhu, M. A. Eftekhar, D. N. Christodoulides, F. W. Wise, Multimode nonlinear fiber optics: Massively parallel numerical solver, tutorial, and outlook, IEEE J. Sel. Top. Quantum Electron., 24(3) (2018) 1–16. doi:10.1109/JSTQE.2017.2779749. [24] SSPROP Homepage, online, accessed: 01-Oct-2019 (2019). URL https://www.photonics.umd.edu/software/ssprop/ [25] S. Balac, A. Fernandez, SPIP: A computer program implementing the Interaction Picture method for simulation of light-wave propagation in optical fibre, Comput. Phys. Commun., 199, 139–152. doi:10.1016/j.cpc.2015.10.012. [26] R. Khakimov, I. Shavrin, S. Novotny, M. Kaivola, H. Ludvigsen, Numerical solver for supercontinuum generation in multimode optical fibers, Opt. Express, 21(12) (2013), 14388–14398. doi:10.1364/OE.21.014388. [27] S. M. Zoldi, V. Ruban, A. Zenchuk, S. Burtsev, Parallel Implementation of the Split-step Fourier Method For Solving Nonlinear Schr¨odinger Systems, SIAM News 32 (1) (1999) 1–5. [28] T. R. Taha, X. Xu, Parallel Split-Step Fourier Methods for the Coupled Nonlinear Schr¨odinger Type Equations, J. Supercomput. 32 (1) (2005) 5–23. doi:10.1007/s11227-005-0183-5. [29] Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, Version 3.1, High Performance Computing Center Stuttgart (HLRS), 2015. [30] J. Kraus, An Introduction to CUDA-Aware MPI, online, accessed: 01-Nov-2018 (Mar. 2013). URL https://devblogs.nvidia.com/parallelforall/introduction-cuda-aware-mpi/ [31] NCCL Homepage, online, accessed: 01-Oct-2019 (2019). URL https://developer.nvidia.com/nccl [32] N. Wilt, The CUDA Handbook: A Comprehensive Guide to GPU Programming, 1st Edition, Addison-Wesley Professional, 2013. [33] M. Brehler, M. Schirwon, D. Gddeke, P. M. Krummrich, Modeling the Kerr-Nonlinearity in Mode-Division Multiplexing Fiber Transmission Systems on GPUs, in: Advanced Photonics Congress, Nonlinear Photonics (NP), Zurich, Switzerland, 2018, paper JTu5A.27. doi:10.1364/BGPPM.2018.JTu5A.27. [34] P. Sillard, Few-Mode Fibers for Space Division Multiplexing, in: Optical Fiber Communication Conference (OFC), Anaheim, CA, USA, 2016, paper Th1J.1. doi:10.1364/OFC.2016.Th1J.1. [35] A. Venkatesh, K. Hamidouche, S. Potluri, D. Rosetti, C.-H. Chu, D. K. Panda, MPI-GDS: High Performance MPI Designs with GPUDirectaSync for CPU-GPU Control Flow Decoupling, in: 2017 46th International Conference on Parallel Processing (ICPP), Bristol, UK, 2017, pp. 151–160. doi:10.1109/ICPP.2017.24. [36] A. A. Awan, K. Hamidouche, A. Venkatesh, D. K. Panda, Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning, in: Proceedings of the 23rd European MPI Users’ Group Meeting on - EuroMPI 2016, Edinburgh, Scotland, 2016, pp. 15–22. doi:10.1145/2966884.2966912. [37] A. A. Awan, C.-H. Chu, H. Subramoni, D. K. Panda, Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters, in: Proceedings of the 25th European MPI Users’ Group Meeting on - EuroMPI’18, Barcelona, Spain, 2018, pp. 1–9. doi:10.1145/3236367.3236381. [38] J. Pomplun, S. Burger, L. Zschiedrich, F. Schmidt, Adaptive finite element method for simulation of optical nano structures, phys. stat. sol. (b) 244 (10) (2007) 3419–3434. doi:10.1002/pssb.200743192.

15

[39] C. Antonelli, O. Golani, M. Shtaif, A. Mecozzi, Nonlinear interference noise in space-division multiplexed transmission through optical fibers, Opt. Express 25 (12) (2017) 13055–13078. doi:10.1364/OE.25.013055. [40] R. Schmogrow, M. Winter, M. Meyer, D. Hillerkuss, S. Wolf, B. Baeuerle, A. Ludwig, B. Nebendahl, S. Ben-Ezra, J. Meyer, M. Dreschmann, M. Huebner, J. Becker, C. Koos, W. Freude, J. Leuthold, Real-time Nyquist pulse generation beyond 100 Gbit/s and its relation to OFDM, Opt. Express 20 (1) (2012) 317–337. doi:10.1364/OE.20.000317. [41] S. Tsukamoto, K. Katoh, K. Kikuchi, Coherent Demodulation of Optical Multilevel Phase-Shift-Keying Signals Using Homodyne Detection and Digital Signal Processing, IEEE Photon. Technol. Lett. 18 (10) (2006) 1131–1133. doi:10.1109/LPT.2006.873921. [42] J. Kraus, J. Stephan, Multi-GPU Programming Models, GPU Technology Conference Europe (GTC Europe), Munich, Germany, 2017, Session ID: 23031. URL http://on-demand.gputechconf.com/gtc-eu/2017/presentation/23031-jiri-kraus-multi-gpu-programming-models. pdf [43] J. Kraus, Multi-GPU Programming with MPI, GPU Technology Conference Silicon Valley (GTC), San Jose, CA, USA, 2017, Session ID: S7133. URL http://on-demand.gputechconf.com/gtc/2017/presentation/S7133-jiri-kraus-multi-gpu-programming.PDF [44] NVIDIA Collective Communication Library (NCCL) Documentation, online, accessed: 01-Nov-2018 (2018). URL https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/index.html

16

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

17