Efficiency optimization of a fast Poisson solver in beam dynamics simulation

Efficiency optimization of a fast Poisson solver in beam dynamics simulation

Computer Physics Communications 198 (2016) 82–96 Contents lists available at ScienceDirect Computer Physics Communications journal homepage: www.els...

1MB Sizes 1 Downloads 26 Views

Computer Physics Communications 198 (2016) 82–96

Contents lists available at ScienceDirect

Computer Physics Communications journal homepage: www.elsevier.com/locate/cpc

Efficiency optimization of a fast Poisson solver in beam dynamics simulation Dawei Zheng ∗ , Gisela Pöplau 1 , Ursula van Rienen Faculty of Computer Science and Electrical Engineering, University of Rostock, Albert-Einstein-Straße 2, Rostock 18059, Germany

article

info

Article history: Received 3 January 2015 Received in revised form 20 June 2015 Accepted 4 September 2015 Available online 15 September 2015 Keywords: Optimization Poisson solver Green’s function Space charge Beam dynamics

abstract Calculating the solution of Poisson’s equation relating to space charge force is still the major time consumption in beam dynamics simulations and calls for further improvement. In this paper, we summarize a classical fast Poisson solver in beam dynamics simulations: the integrated Green’s function method. We introduce three optimization steps of the classical Poisson solver routine: using the reduced integrated Green’s function instead of the integrated Green’s function; using the discrete cosine transform instead of discrete Fourier transform for the Green’s function; using a novel fast convolution routine instead of an explicitly zero-padded convolution. The new Poisson solver routine preserves the advantages of fast computation and high accuracy. This provides a fast routine for high performance calculation of the space charge effect in accelerators. © 2015 Elsevier B.V. All rights reserved.

1. Introduction Future particle accelerator facilities such as next generation light sources [1,2] and linear colliders [3,4] are worldwide under active development and construction. Aiming for beams of very high quality with properties such as high brightness and low emittance these machines require the development of new technologies. This process always goes along with new challenges for the numerical simulation of the related beam dynamics. Hereby the detailed study of space charge driven effects which can lead to severe instabilities is essential in order to ensure the designed performance of the accelerator. State-of-the-art simulations in this field require a higher and higher resolution which includes a large number of particles, tracking of long distances (start-to-end-simulations), and a high resolution in space and time. Typical examples for such demanding simulations are the investigation of micro bunch instabilities for the European X-ray Free Electron Laser (XFEL) [5] or the numerical simulation of the effect of ionized residual gas on an electron bunch in an Energy Recovery Linac (ERL) [6]. Since space charge forces can produce emittance growth and beam loss in accelerators, precise simulation results are essentially important to control and optimize beams. Several beam dynamics packages are available such as IMPACT [7,8], ASTRA [9], GPT [10], MOEVE PIC Tracking [11], etc. In all these packages one central issue is how to compute the space charge force. Generally, there are two models: the particle–particle (PP) model [12], and the particle-in-cell (PIC) model [12–15] which is considered in this paper. For the PP model, the complexity is Np2 , where Np denotes the number of particles. This tremendously increases the CPU time consumption compared to the PIC model. In the PIC model, we generate meshes in space and arrange the charge of the particles on this grid as charge density. Afterwards we compute the potential and the electric field on the mesh rather than on the particle position. Finally, we compute the space charge forces for each particle from the grid by interpolation. The space charge forces are directly connected to Poisson’s equation, which solves for the potential. With the potential, we can easily compute the electric fields and the space charge forces of particles. There are several 3D computational methods to solve Poisson’s equation which are used in beam dynamics simulation: the adaptive multigrid method [16], FFT-based methods [13], the Green’s function method [12], etc.



Corresponding author. E-mail address: [email protected] (D. Zheng).

1 G. Pöplau was with the Faculty of Computer Science and Electrical Engineering, University of Rostock, Germany. She is now with compaec e.G., Rostock, 18059, Germany. http://dx.doi.org/10.1016/j.cpc.2015.09.005 0010-4655/© 2015 Elsevier B.V. All rights reserved.

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96

83

However, these 3D Poisson solvers may become inaccurate and ineffective in the simulation of next-generation accelerator facilities [17], which have very long bunches (i.e. a large longitudinal-to-transverse aspect ratio in bunch size). The ratio is usually several millimeters in longitudinal size and ten to hundred micron in transverse size. Therefore, some numerical methods cut the long bunches into thin slices along the longitudinal direction, solve the 2D Poisson’s equation in transverse direction for each slice and map back the potential with the corresponding slice length along the longitudinal direction to resolve the 3D solution [18,19]. Another model from Qiang et al. [20,21] deals accurately with such bunches in a 3D, rather than a 2D computational domain. However, the computations involved are very time consuming so that parallel computers are needed for an efficient performance. There are two main time consuming parts inside the model routine. One part comes from the computation of the integrated Green’s function and the related Fourier transform, the other comes from the computation of the convolution with pruned Fourier transform. This paper is organized as follows: We study and compare the standard Green’s function (GF) and the integrated Green’s function (IGF) Poisson solvers applied in the model by Qiang et al. [20]. The major improvement of the IGF approach is on the numerical integration, which splits the integral domain into grid cells. Considering grid cells with their center in the discretization grid points and side lengths of one step size in each axis direction, the IGF can achieve more accurate integrals compared to the GF as shown in Section 2. Furthermore, only a few parts of these grid cell integrals considerably disagree, while the remaining parts of these grid cell integrals are rather close. Based on this observation, we have presented a reduced integrated Green’s function (RIGF) method in our former work [22]: only the parts near the origin of the IGF integrals are computed to replace the GF integrals, the remaining parts are still the GF integrals as described in the first subsection of Section 3. Furthermore, we show in the second subsection of Section 3 that the discrete 3D Fourier transform (DFT) of the extended GF can be expressed as a discrete cosine transform (DCT) of the GF according to Theorem 1. A classical routine based on the convolution theory in Fourier analysis is widely used as Poisson solver: by extensions and the application of the DFT, the charge density and the Green’s function are transferred into Fourier space. The potential is easily achieved by multiplying the two and by the application of an inverse DFT. For higher efficiency in operations and storage, Hockney and Eastwood [12] processed a pruned FFT for the charge density rather than the aforementioned routine. The FFT performs only on the non-zero vector in each direction rather than on the whole values of the extended charge density. This reorganization saves 7/12 of the computer operations in 3D. Recently, Bowman and Roberts [23] provided an efficient dealiased convolution without zero-padding. They deal with a convolution of sequences padded with zeros for both, the object and kernel (in our context space charge density and GF). However, since the GFs are not padded with zeros as Bowman and Roberts have assumed them, we cannot use their routine directly. Fortunately, we can achieve and reconstruct the GFs’ DCT values by the even and odd indices without padding, too. In Section 4, a related novel fast convolution method is introduced. Furthermore, we provide an error analysis of the RIGF method in comparison with the IGF routine in Section 5 and show the reliability of the algorithm. A threshold parameter Rz , which determines where to switch from the IGF integrals to the GF integrals is also given. The Rz is chosen automatically by a procedure based on an adapted accuracy control. The numerical results and examples in Section 5 show the efficiency improvement and the applicability of the RIGF method to practiceoriented simulations. The efficiency gain (the elapsed time ratio of the previous routine to the new routine) from the optimization of the IGF varies from two to seven, from the substitution of the 3D FFT computation for GFs it is higher than linear, and from the optimization of the convolution it is more than 1.3. In total, the whole speed-up of the Poisson solver is larger than three to four. Three different examples are provided to verify that the RIGF method is as accurate as the IGF method in practice. 2. Green’s function and integrated Green’s function methods The simulation object is a bunch of charged macro particles, which usually contains a large number of particles (for instance 106 –1013 ) concentrated around one synchronous particle, which is tracked along the accelerator sections. Since the velocity of the particle (e.g. an electron) is nearly relativistic, we choose the laboratory frame as the working frame, while the electric field is calculated in each time step in the rest frame, as shown in Fig. 1. Firstly, the routine loads bunch information and transforms the bunch position from the laboratory frame to the rest frame by a Lorentz transform: x = xˆ , y = yˆ ,

(1)

z = γ zˆ . Here, the z direction is the bunch’s travel direction and the coordinates (ˆx, yˆ , zˆ ) have been transformed to (x, y, z ) with the Lorentz factor γ = 1/ 1 − β 2 , where β is defined by β = vˆ zˆ /c with the velocity in z direction vˆ zˆ and the speed of light c. Secondly, we solve for the electromagnetic field in the rest frame. Apart from the driven electric field E driven provided by the radio frequency (RF) cavities, the particles inside the first part of the accelerator affect each other due to space charge forces generated by the particles themselves. As introduced before, we use the PIC 3D model for simulation. Therefore, the electromagnetic field is computed on the grid rather than on the involved particles. It is well-known, that the computation of the electric space charge field directly connects to the solution of Poisson’s equation:

−∆ϕ(x, y, z ) =

ρ(x, y, z ) , ε0

in Ω ⊂ R3 ,

(2)

with the Laplace operator ∆, the electric potential ϕ , the charge density ρ , the permittivity in vacuum ε0 and the considered domain Ω . In our consideration, Ω is a box shaped domain over [0, Lx ] × [0, Ly ] × [0, Lz ], where Lx , Ly , Lz are the lengths in each direction. In the PIC model, the above Lx × Ly × Lz domain is then discretized with Nx × Ny × Nz mesh points in each direction. By equidistant discretization, the

84

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96

Fig. 1. A schematic model of bunch motion with transformation between laboratory frame Kˆ (left) and rest frame K (right).

step sizes are directly formed as: hx =

Lx , N x −1

hy =

Ly , Ny −1

hz =

Lz , Nz −1

and the corresponding coordinates of the mesh points are: xi = ihx ,

yj = jhy , zk = khz with i = 0, 1, 2, . . . , Nx − 1, j = 0, 1, 2, . . . , Ny − 1, k = 0, 1, 2, . . . , Nz − 1. For each particle, the charge should be assigned to the neighboring grid points (Fig. 2) with some special assignment functions for different schemes, as e.g. the Nearest Grid Point (NGP) or Cloud In Cell (CIC) schemes, which are beyond the scope of our topic in this paper. Introductions to those methods can be found in [12] or [13]. Thirdly, the corresponding electric and magnetic fields have to be transformed back from the rest frame to the laboratory frame: Eˆ ⊥ = γ E⊥ ,

(3)

Eˆ ∥ = E∥ ,

(4)

ˆ= B

γ c2

v × E,

(5)

where E is the electric field strength, B the magnetic field strength, ⊥ stands for the transverse directions and ∥ for the longitudinal ˆ l. direction. Then, the fields are interpolated from the grid to each particle l as Eˆ l and B Fourthly, for particle l, the relativistic equations of motion in the laboratory frame are given as:

ˆl dp dtˆ drˆ l dtˆ

= ql (Eˆ l + vˆ l × Bˆ l ), =

ˆl p ˆ l γl m

(6)

,

(7)

ˆ l the rest mass. Furthermore, rˆl refers to the position and vˆ l to the velocity with βl = vˆ zˆ ,l /c , γl = where p  ˆ l denotes the momentum, m

1/ 1 − βl2 for particle l. For the time integration, we can use either the Runge–Kutta or the Leap-Frog scheme, e.g.

The above four procedural steps are circulated by time step 1t until a certain time limit or distance limit is reached. However, the focus of the discussion in this paper is on solving Poisson’s equation. The methods of GF and IGF solve Poisson’s equation in free space with the Green’s function G(x, x′ , y, y′ , z , z ′ ) = 

1

(x −

x′ )2

+ (y − y′ )2 + (z − z ′ )2

,

(8)

which is the fundamental solution from which the potential distribution of a point charge can be determined. The Green’s function is also expressed by three terms as G(x, y, z ), which means G(x, 0, y, 0, z , 0). The solution of Poisson’s equation using the Green’s function reads as [12,20]:

ϕ(x, y, z ) =

1 4π ε0

   ·

ρ(x′ , y′ , z ′ )G(x, x′ , y, y′ , z , z ′ )dx′ dy′ dz ′ .

(9)

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96

85

Fig. 2. A schematic plot of particle assignment to the neighboring grid points. A fraction of its charge is assigned to each of the eight neighboring grid points of a particle at some position inside the grid cell.

GF integral With the well-known midpoint rule for the numerical integral in (9), the discrete GF formula is given by

ϕ(xi , yj , zk ) ≈

1 4π ε0

N x −1 N y −1 N z −1

·



ρ(xi′ , yj′ , zk′ )G˜ GF (xi , xi′ , yj , yj′ , zk , zk′ ),

(10)

i′ =0 j′ =0 k′ =0

with

˜ GF (xi , xi′ , yj , yj′ , zk , zk′ ) = hx hy hz G(xi , x′i , yj , y′j , zk , zk′ ). G

(11)

IGF integral With the summation of integrals for the Green’s function over each grid cell [i − shown in Fig. 3, the discrete IGF formula is given by

ϕ(xi , yj , zk ) ≈

1 4π ε0

hx 2

,i +

hx 2

] × [j −

hy 2

,j +

hy 2

] × [k −

hz 2

,k +

hz 2

] as

Nx −1 Ny −1 Nz −1



ρ(xi′ , yj′ , zk′ )G˜ IGF (xi , xi′ , yj , yj′ , zk , zk′ ),

(12)

i′ =0 j′ =0 k′ =0

˜ IGF (xi , xi′ , yj , yj′ , zk , zk′ ) can be calculated with the Green’s function (8) as follows: where G ˜ IGF (xi , xi′ , yj , yj′ , zk , zk′ ) = G

xi′ +hx /2



yj′ +hy /2



xi′ −hx /2

zk′ +hz /2



yj′ −hy /2

zk′ −hz /2

G(xi , x′ , yj , y′ , zk , z ′ )dx′ dy′ dz ′ .

(13)

This integral can be calculated from the primitive function (antiderivative) of (8), which can be finally simplified to

  

1



x2 + y2 + z 2

z2 . dxdydz = − arctan



2



x2 2

z





+ xz ln y +



x2 + y2 + z 2

 arctan



xy



yz



x x2 + y 2 + z 2





y2 2

 arctan



xz



y x2 + y2 + z 2

   + yz ln x + x2 + y2 + z 2 

x2 + y2 + z 2 + xy ln z +





x2 + y2 + z 2 ,

(14)

we repeat the simple form from [21] in (14). Calculating the sums of (10) and (12) leads to a very high numerical complexity. Therefore, the convolution theory based upon discrete Fourier transforms is naturally applied to replace the direct summation, so-called cyclic convolution. Inside the cyclic convolution, we ˜ ex and ρex rather than G and ρ . Simply speaking, the charge density ρex is padded with zeros in all coordinate directions, the tilde consider G ˜ is expanded to be symmetric. The details of the expansions are introduced in [20,12]. In fact, the summations for both Green’s function G the original and the expanded ways are identical. Using the 3D discrete Fourier transformation F and convolution theory, the expanded potential expression is:

ϕex =

1 4π ε0





˜ ex · Fρex , F−1 FG

(15)

˜ ex with Fρex . The size of ϕex is 2Nx × 2Ny × 2Nz . In contrast, if we whereby the operation · means the elementwise multiplication of FG directly perform the cyclic convolution on G and ρ of the original Nx × Ny × Nz domain, the charge density and the Green’s function G will be treated as periodic data. In this case, wrong values for G and ϱ will automatically be used (defined) in the negative direction. This leads to a meaningless solution. Finally, the potential at each grid point equals the first Nx , Ny , Nz values of the expanded potential expression on each axis.

86

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96

Fig. 3. A schematic plot of the grid arrangement used for the integrals. The red colored points are the discretized grid points. The grid cell, edged by the blue colored lines around each red central point, is the cube which is considered for the numerical evaluation of both GF and IGF integrals. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Yet, Hockney and Eastwood’s efficient routine implements differently as the following two parts: Green’s function calculation and Fourier transform part 1. Compute each tilde Green’s function value:

˜ IGF (xi , xi′ , yj , yj′ , zk , zk′ ) = G



xi′ +hx /2 xi′ −hx /2



yj′ +hy /2



yj′ −hy /2

zk′ +hz /2 zk′ −hz /2

G(xi , x′ , yj , y′ , zk , z ′ )dx′ dy′ dz ′

by means of formula (14). 2. Expand the tilde Green’s function values as [20]:

G˜ (x , y , z ), i j k    ˜  , G ( x  2Nx −i yj , zk ),   ˜  G(xi , y2Ny −j , zk ),     G˜ (x , y , z i j 2Nz −k ), ˜ ex (xi , yj , zk ) = G ˜  G(x2Nx −i , y2Ny −j , zk ),      ˜ (x2Nx −i , yj , z2Nz −k ), G    ˜   G(x , y ,z ),   i 2Ny −j 2Nz −k ˜ (x2Nx −i , y2Ny −j , z2Nz −k ), G

0 ≤ i ≤ Nx ; 0 ≤ j ≤ Ny ; 0 ≤ k ≤ Nz , Nx < i ≤ 2Nx − 1; 0 ≤ j ≤ Ny ; 0 ≤ k ≤ Nz , 0 ≤ i ≤ Nx ; Ny < j ≤ 2Ny − 1; 0 ≤ k ≤ Nz , 0 ≤ i ≤ Nx ; 0 ≤ j ≤ Ny ; Nz < k ≤ 2Nz − 1, Nx < i ≤ 2Nx − 1; Ny < j ≤ 2Ny − 1; 0 ≤ k ≤ Nz , Nx < i ≤ 2Nx − 1; 0 ≤ j ≤ Ny ; Nz < k ≤ 2Nz − 1, 0 ≤ i ≤ Nx ; Ny < j ≤ 2Ny − 1; Nz < k ≤ 2Nz − 1, Nx < i ≤ 2Nx − 1; Ny < j ≤ 2Ny − 1; Nz < k ≤ 2Nz − 1.

˜ ex (xi , yj , zk ), get FG˜ ex (xi , yj , zk ). The storage of Green’s function values can be reduced due to the symmetric property. 3. Take the 3D FFT of G Fast convolution with pruned Fourier transform part The well-known fast convolution uses a pruned FFT with zero-padding, because 7/8 of the extended domain is equal to zero. It has been introduced by Hockney and Eastwood [12]. The storage requirement is of 2Nx × Ny × Nz elements, plus a temporary 2D array of 2Ny × Nz elements and a temporary 1D vector of 2Nz elements. Yet the temporary storage is not a significant issue with the modern computer technology. We observed that the implementation with a 2D temporary array plus a 1D temporary vector shows a worse performance compared to an implementation with a 2D 2Ny × Nz temporary array plus a 2D 2Ny × 2Nz temporary array as it is used in Algorithm 1. The basic purpose is the same: to avoid the Fourier transform of part padded with zeros and to save storage. In the implementation, the difference is to treat the FFT in batches to speed up rather than involving a separate 1D convolution. In practice, the FFTs in the loops are performed with the FFTW package [24] with the advanced interface, which transforms the multiple arrays simultaneously. 3. Optimization of Green’s function calculation and related Fourier transform 3.1. Reduced integrated Green’s function method In principle, the IGF method is much more accurate than the GF method. Yet the corresponding computations are more time-consuming, as we can see in Table 1. The elapsed time of the IGF integrals is more than dozen times higher than that of the standard GF integrals. This is due to the fact that the IGF integral leads to high time-consuming terms, while it is only a simple term to calculate for the GF integral. In fact, the standard GF integrals work sufficiently well in simulations, where the bunch’s transverse size does not differ much from the longitudinal size. Nowadays, next-generation accelerators usually aim for the generation of very short bunches. However, computing the

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96

87

Table 1

˜ The computation Comparison of elapsed time of the two Green’s function integrals G. program is implemented in C under Intel Core 2.60 GHz CPU (N = Nx = Ny = Nz ). N +1

GF integral elapsed time

IGF integral elapsed time

65 129 257

0.004 s 0.029 s 0.217 s

0.073 s 0.527 s 4.030 s

Algorithm 1 Fast Convolution 3D

˜ ex (0 : 2Nx − 1, 0 : 2Ny − 1, 0 : 2Nz − 1) Input: f (0 : Nx − 1, 0 : Ny − 1, 0 : Nz − 1), FG Output: u(0 : Nx − 1, 0 : Ny − 1, 0 : Nz − 1) 1: function Convolution3D 2: fex (0 : 2Nx − 1, 0 : Ny − 1, 0 : Nz − 1) ← PaddingZeroInX[f (0 : Nx − 1, 0 : Ny − 1, 0 : Nz − 1)] 3: for j = 0 → Ny − 1 do 4: for k = 0 → Nz − 1 do 5: fex (:, j, k) ← FFT[fex (:, j, k)] 6: end for 7: end for 8: for i = 0 → 2Nx − 1 do 9: Temp2D(0 : 2Ny − 1, 0 : Nz − 1) ← PaddingZeroInY[fex (i, 0 : Ny − 1, 0 : Nz − 1)] 10: for k = 0 → Nz − 1 do 11: Temp2D(:, k) ← FFT[Temp2D(:, k)] 12: end for 13: for j = 0 → 2Ny − 1 do 14: Temp2Db (j, 0 : 2Nz − 1) ←PaddingZeroInZ[Temp2D(j, 0 : Nz − 1)] 15: end for 16: for j = 0 → 2Ny − 1 do 17: Temp2Db (j, :) ←FFT[Temp2Db (j, :)] 18: end for 19: for j = 0 → 2Ny − 1 do 20: for k = 0 → 2Nz − 1 do ˜ ex (i, j, k) 21: Temp2Db (j, k)=Temp2Db (j, k) ∗ FG 22: end for 23: end for 24: for j = 0 → 2Ny − 1 do 25: Temp2Db (j, :) ← IFFT[Temp2Db (j, :)] 26: end for 27: for j = 0 → Ny − 1 do 28: Temp2D(j, :) ←Temp2Db (j, 0 : Nz − 1) 29: end for 30: for k = 0 → Nz − 1 do 31: Temp2D(:, k) ← IFFT[Temp2D(:, k)] 32: end for 33: for k = 0 → Nz − 1 do 34: fex (i, :, k) ←Temp2D(0 : Ny, k) 35: end for 36: end for 37: for j = 0 → Ny − 1 do 38: for k = 0 → Nz − 1 do 39: fex (:, j, k) ← IFFT[fex (:, j, k)] 40: end for 41: end for 42: u(0 : Nx − 1, 0 : Ny − 1, 0 : Nz − 1) ← fex (0 : Nx − 1, :, :) 43: end function space charge forces in the rest frame, even these bunches appear as bunches with large longitudinal-to-transverse aspect ratio due to the Lorentz transformation. This induces large numerical errors for the standard integrals of the Green’s function. For further considerations we define the Green’s function integral error as:

δ G˜ GF (xi , yj , zk ) = G˜ GF (xi , yj , zk ) − G˜ IGF (xi , yj , zk ). ˜ GF (xi , yj , zk ) from the common error analysis for the midpoint rule in 1D, as: We can derive the bound of δ G   h3 h3 h3  ˜  x y z Kx Ky Kz , δ GGF (xi , yj , zk ) ≤ 3 24

(16)

88

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96

where Kx is the largest value of ∂ 2 G(x, y, z )/∂ x2 with x ∈ [xi −

]. The numbers Ky , Kz are defined in the same way. Therefore, δ G˜ GF (xi , yj , zk ) is influenced by the step sizes hx , hy , hz and by the second derivatives of G(x, y, z ) in all three coordinate directions. hx 2

, xi +

hx 2

We comparatively define the local relative error of the Green’s function integral as:

   δ G˜ (x , y , z )   GF i j k  ηG (xi , yj , zk ) =  .  G˜ IGF (xi , yj , zk )  For very long bunches, it is neither realistic nor efficient to discretize with the same stepsize hx = hy = hz in all three directions aiming for an accurate ηG (xi , yj , zk ) since the calculation time increases proportionally with the aspect ratio, which could be several hundred times. The FFT computation time increases as well. If we would want to keep high efficiency, i.e. N = Nx = Ny = Nz , then ηG (xi , yj , zk ) increases significantly with the aspect ratio. The larger the aspect ratio is, the worse the integral errors are (see Fig. 4(a)), which can also be seen from (16). Therefore, the IGF integral is the better choice in this case. It calculates accurately and, depending on the aspect ratio, usually costs less time than the GF integral discretized on a fine equidistant grid. In order to derive an accurate and efficient RIGF method, we should carefully study ηG (xi , yj , zk ) in the case of Nx = Ny = Nz . Where is the largest value of ηG (xi , yj , zk ) located? How does ηG (xi , yj , zk ) vary? What can be changed? To figure out the facts visually and easily, we chose a computational domain with a large aspect ratio: Lx :Ly :Lz = 1:1:30. Firstly (see Fig. 4(b)), it can be observed that ηG (xi , yj , zk ) is large when z is small. It drops down exponentially with increasing values of z. Next, according to (11) and (13) the results for the two kinds of integrals over the grid cells are very different near the origin of the z-axis, but with increasing values of z, they get closer. All the facts above tell us that the midpoint rule for the integrals over the grid cells given in (11) is not accurate at the front parts of the large aspect ratio direction. When we increase the distance from the origin, the two kinds of integrals reveal less different and show acceptable local relative errors, which give us some insight into the reduced integrated method. ˜ (xi , yj , zk ), we first determine the exact integrals over grid cells where The idea of the RIGF method is simple: when we compute G necessary. This means that the grid cells near the origin in the large aspect ratio direction are treated with the IGF model. An integer parameter Rz determines where to switch from the IGF model to the GF model. Then, the remaining parts of the integrals are calculated by the simpler standard GF model. In the following we suppose that the large aspect ratio direction is the z-axis. Then the new algorithm reads as follows: Algorithm: (RIGF) Calculate each tilde Green’s function value: If k ≤ Rz :

˜ (xi , xi′ , yj , yj′ , zk , zk′ ) = G



xi′ +hx /2 xi′ −hx /2



yj′ +hy /2

zk′ +hz /2



yj′ −hy /2

zk′ −hz /2

G(xi , x′ , yj , y′ , zk , z ′ )dx′ dy′ dz ′ ,

use the formula from (14) to calculate the integrals. Otherwise:

˜ (xi , xi′ , yj , yj′ , zk , zk′ ) = hx hy hz G(xi , x′i , yj , y′j , zk , zk′ ). G 3.2. Substitution of Fourier transform for Green’s function Normally, after the tilde Green’s function calculation, we used to extend the GF values and take the 3D Fourier transform of it. However, now we provide a discrete cosine transform-I (DCT-I) of the tilde GF to replace the extended 3D Fourier transform. ˜ ex , a 3D real even symmetric vector is defined by van Loan in [25]: suppose g = {g0 , g1 , . . . , gn } is real, the The extension of GF values G extension defined as g˜ = {g0 , g1 , . . . , gn , gn−1 , gn−2 , . . . , g1 }2n is the real even symmetric extension of g. The Fourier transform of a real even symmetric vector has the same structure as itself—real even. ˜ ex can be substituted by the DCT-I of G, ˜ which is based on the following theorem: Theoretically, the Fourier transform of G Theorem 1. Suppose g = {g0 , g1 , . . . , gn }, the real even symmetric extension of g is g˜ . If y˜ = 21 F2n g˜ and y = DCTn (g ), where DCT is the first kind of discrete cosine transform (DCT-I). Then y˜ is the real even symmetric extension of y, i.e. y˜ = {y0 , y1 , . . . , yn , yn−1 , yn−2 , . . . , y1 }2n . The detailed proof can be found in [25]. For multi-dimensions, the conclusion is also valid. From Theorem 1, the improvement of the GF Fourier transform is significant for both storage and efficiency. The extension of tilde GF values is avoided. The storage remains the same size rather than increasing by factor eight. Secondly, the 3D 2Nx × 2Ny × 2Nz FFT is replaced by a 3D (Nx + 1) × (Ny + 1) × (Nz + 1) DCT-I. The complexity and efficiency improvement is apparent. For the further usage of

˜ ex , the extension can be expressed implicitly from the DCT(G˜ ) with real even symmetry rather than with the extension in storage. The FG corresponding indices are as shown: ˜ ex (i, j, k) = 8 · DCT(G˜ )(I , J , K ), FG  J =

j, 2Ny − j

 where I =

j ∈ [0, Ny ], j ∈ [Ny + 1, 2Ny − 1],

i, 2Nx − i

 K =

k, 2Nz − k

i ∈ [0, Nx ], i ∈ [Nx + 1, 2Nx − 1], k ∈ [0, Nz ], k ∈ [Nz + 1, 2Nz − 1].

(17)

A closely related statement is also very important and useful: the inverse transform of DCT-I is the same as DCT-I itself, apart from a constant factor. The same conclusion applies for the IFFT and FFT of real-even symmetric vectors.

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96

89

4. Optimization of the fast convolution with pruned Fourier transform: a novel fast convolution routine The classical routine of Hockney and Eastwood [12] as shown in Section 2 has been widely used for several decades. Recently, Bowman and Roberts have proposed an efficient dealiased convolution algorithm without the expense of conventional zero-padding [23]. They use a routine to calculate an implicitly rather than the explicitly padded convolution. The efficient method avoids explicit zero-padding in convolutions and saves both, memory and computation time especially in a multidimensional convolution. In general, the speed-up of the implicitly zero-padding routine has a factor of 2. The precision of the implicitly padded convolution is the same as that of the explicitly zero-padded convolution. They use this kind of convolution in the study of turbulence. However, the input data’s extension and convolution in Bowman and Roberts’ study is different from the convolution for the Poisson solver we introduced in Section 2. This is ˜ ex . The goal of this section is to construct a convolution without because we apply a real even symmetric extension, not a zero-padded G ˜ The efficiency improvement needs to be tested for our explicit zero-padding and without explicit real even symmetric expansion of G. new routine and the novel fast convolution routine will be compared with the classical routine of Hockney and Eastwood. 4.1. FFT without a bit reversal stage and implicitly padded convolution in 1D Suppose the input data vector fˆ of length n, is padded with zeros to the length of m = 2n. By the definition of the discrete Fourier transform (DFT), the backward DFT of fˆ reads as follows: m−1

fk =



ei

2π m kl

m−1

fˆl =



l =0

kl ˆ ωm fl ,

k = 0, 1, 2 . . . , m − 1,

l =0 2π

where ωm denotes the mth root of unit defined by ωm = ei m , and fˆl is a vector padded with zeros, i. e. fˆl = 0 with l = n, . . . , m − 1. On one hand, as any number multiplied by zero equals zero, the multiplication with zeros can be avoided by truncating the summation. On the other hand, the separation between even and odd indices expresses the summations in terms of subtransforms efficiently: f2k =

n−1 

2kl ˆ ω2n fl =

l=0

n −1 

ωnk fˆl ,

f2k+1 =

n−1 

l =0

(2k+1)l ˆ fl = ω2n

n−1 

l ˆ ωnk ω2n fl ,

(18)

l=0

l=0

2k m due to the properties of roots of units: ω2n = ωnk and ωm = 1. For the scaled forward DFT we can proceed inversely: firstly, the original forward DFT is given by

fˆl =

m−1 1  −lk ωm fk , 2n k=0

l = 0, 1, 2 . . . , m − 1.

Then, the sum is split into even and odd index divisions: fˆl =

1 2n



n −1 

ωn f2k + ω2n −lk

k=0

−l

n−1 

 ωn f2k+1 , −lk

l = 0, 1, 2 . . . , m − 1.

(19)

k=0

From Eqs. (18) and (19), we achieve a FFT without a bit reversal stage for the 1D implicitly padded convolution, they are named fftpadBackward and fftpadForward in [23]. Algorithm 2 fftpadBackward Input: vector f (0 : n − 1) Output: vector f (0 : n − 1), vector u(0 : n − 1) 1: function fftpadBackward 2: for k = 0 → n − 1 do k 3: u(k) ← ω2n f (k); 4: end for 5: f (:) ← FFT−1 [f (:)]; 6: u(:) ← FFT−1 [u(:)]; 7: end function

Algorithm 3 fftpadForward Input: vector f (0 : n − 1), vector u(0 : n − 1) Output: vector f (0 : n − 1) 1: function fftpadForward 2: f (:) ← FFT[f (:)]; 3: u(:) ← FFT[u(:)]; 4: for k = 0 → n − 1 do −k u(k); 5: f (k) ← f (k) + ω2n 6: end for 7: end function

90

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96

The ω2n used in Algorithms 2 and 3 is processed by Buneman’s recursion [23]. In the 1D convolution shown in Algorithm 4, we have two input vectors: f (charge density) and g (GF after DCT). The size of f is n, while g has one element more than f . Since, the GF vector is extended in the real even symmetric way, g will be operated differently from u and f in Bowman and Roberts’ convolution routine after the fftpadBackward. We reorganize the g indexed with even and odd orders by multiplying it with f and u, respectively. The extension of g as real even vector implicitly takes place by reusing indices twice as explained in the function Multiply1D: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

function Multiply1D[f (:), u(:), g (:)] for i = 0 → n − 1 do if i < n/2 then Ie = 2i, Io = 2i + 1 ; else Ie = 2(n − i), Io = 2(n − i) − 1; end if f (i) = f (i) ∗ g (Ie ), u(i) = u(i) ∗ g (Io ); end for end function

Next, f is multiplied with even indices of g and u with odd indices of g. At last, we perform fftpadForward with f and u to achieve the result of the 1D convolution. Algorithm 4 Implicitly padded convolution in 1D Input: vector f (0 : n − 1), vector g (0 : n) Output: vector f (0 : n − 1) 1: function convolution1D 2: [u, f ] ← fftpadBackward[f ]; 3: [f (:), u(:)] ← Multiply1D[f (:), u(:), g (:)]; 4: [f ] ← fftpadForward[f , u]; 5: end function

4.2. Implicitly padded convolution in 2D For the 2D implicitly padded convolution shown in Algorithm 5, the basic idea is the same as for the 1D convolution. There are two inputs: f sized by Nj × Nk and g sized by Nj + 1 × Nk + 1. Firstly, we apply the routine fftpadBackward along the j direction for all k indices for f . The resulting vectors are f and foe with size Nj × Nk . Secondly, we apply fftpadBackwards along the k direction for all j indices for both f and foe . The four resulting vectors are f , feo for f and foe , foo for foe , respectively. The fftpadBackwards along both directions can be understood as a 2D fftpadBackward. After the 2D fftpadBackward, we multiply the achieved vectors with g obtained by the 2D DCT of the GF values in 2D. The details are demonstrated in the function Mulitply2D: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:

function Multiply2D[f (:, :), feo (:, :), foe (:, :), foo (:, :), g (:, :)] for k = 0 → Nk − 1 do for j = 0 → Nj − 1 do if k < Nk /2 then Ke = 2k, Ko = 2k + 1 ; else Ke = 2(Nk − k), Ko = 2(Nk − k) − 1; end if if j < Nj /2 then Je = 2j, Jo = 2j + 1 ; else Je = 2(Nj − j), Jo = 2(Nj − j) − 1; end if f (j, k) = f (j, k) ∗ g (Je , Ke ), feo (j, k) = feo (j, k) ∗ g (Je , Ko ), foe (j, k) = foe (j, k) ∗ g (Jo , Ke ), foo (j, k) = foo (j, k) ∗ g (Jo , Ko ); end for end for end function

Since g should be implicitly multiplied with all of the vectors generated by the 2D fftpadBackward (see Algorithm 5), we have to reexpress the indices of g for each of the four vectors by real even symmetry and order them by even and odd indices separately. Then, we proceed the inverse procedure of 2D fftpadBackward to achieve the solution: the fftpadForward along k for f and feo to obtain f , meanwhile, for foe and foo to obtain foe . Finally, the fftpadForward along j for f and foe to achieve f as the final 2D convolution result.

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96

91

0.8

8x10-2

0.6

6x10-2

0.4

4x10-2

0.2

2x10-2

0

10

20

30

40

50

60

0

10

20

30

40

50

60

Fig. 4. (a) The local relative error of standard GF integrals, ηG (ihx , ihy , 0), with different aspect ratios. The computational domain’s ratio is Lx :Ly :Lz = 1:1:30. The number of mesh points is 65 × 65 × 65 = 274,625. When we increase the aspect ratio, the error ηG (ihx , ihy , 0) increases, especially for the grid cells near (0, 0, 0). This directly reflects the reason why the standard GF method solves Poisson’s equation less accurately in the case of large aspect ratios. (b) The local relative error of the GF integrals. The number of mesh points is 65 × 65 × 65 = 274,625. We can see that the integrals significantly disagree from each other at the boundaries of the z axis (in the plot (ihx , ihy , 0)). The different integral results drop fast with growing values of z (for all curves). A clear difference can be seen between the axis linked to the large longitudinal-to-transverse aspect-ratio and the other directions: compare the plot (black solid line) of (ihx , ihy , 0) to the plot (red dot line) of (0, ihy , ihz ). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Algorithm 5 Implicitly padded convolution in 2D Input: f (0 : Nj − 1, 0 : Nk − 1), g (0 : Nj , 0 : Nk ) Output: f (0 : Nj − 1, 0 : Nk − 1) 1: function Convolution2D 2: for k = 0 → Nk − 1 do 3: [f (:, k), foe (:, k)] ← fftpadBackward[f (:, k)]; 4: end for 5: for j = 0 → Nj − 1 do 6: [f (j, :), feo (j, :)] ← fftpadBackward[f (j, :)]; 7: [foe (j, :), foo (j, :)] ← fftpadBackward[foe (j, :)]; 8: end for 9: [f (:, :), feo (:, :), foe (:, :), foo (:, :)] ← Multiply2D[f (:, :), feo (:, :), foe (:, :), foo (:, :), g (:, :)]; 10: for j = 0 → Nj − 1 do 11: [f (j, :)] ← fftpadForward[f (j, :), feo (j, :)]; 12: [foe (j, :)] ← fftpadForward[foe (j, :), foo (j, :)] ; 13: end for 14: for k = 0 → Nk − 1 do 15: [f (:, k)] ← fftpadForward[f (:, k), foe (:, k)]; 16: end for 17: end function The 2D convolution routine is presented differently as the commonly used routine processing the recurrence of a 1D convolution as introduced in [12] and [23]. Reusing the decoupled work memory and memory savings as mentioned in the references is a good efficient optimization way. In practice, we use another efficient implementation by FFTW package. In our implementation, we transform the multiple arrays simultaneously by FFTW advance interface instead of for-loops, which can accelerate the calculation [24]. In this case, we use the advanced complex DFTs routine fftw_plan_many_dft in our code. This allows us to implement the 1D FFTs of an array simultaneously for both directions without further manually operation. The difference between the two implementations in programming occurs in the 2D convolution routine. Additionally, the FFTW package may provide some professional speedup technology in their advanced simultaneously FFT routine, which may further make the difference obvious. The 2D convolution algorithm is used as recurrence in the following 3D convolution algorithm. 4.3. Implicitly padded convolution in 3D For the 3D implicitly padded convolution given in Algorithm 6, we have the two inputs f and g of dimension Ni × Nj × Nk and Ni + 1 × Nj + 1 × Nk + 1, respectively. Firstly, the routine fftpadBackwards transfers f along the i direction, for all j from 0 to Nj − 1 and all k from 0 to Nk − 1 to achieve f and fo . Secondly, for each index i from 0 to Ni − 1, we set the corresponding 2D slices of the 3D FFT of GF values as ge and go for f and fo separately by the function Set2Dg: The 2D vectors are grouped by even and odd indices for i. We simply use the former 2D implicitly convolution routine for both f and fo with the corresponding ge and go for i. After the loop of i, we transfer f and fo by fftpadForwards along i for all j and k indices as the first step. Finally, we obtain the 3D convolution result of f and g. 5. Error classification of Green’s function methods Simulation results should verify (even predict) physical theory and experiment. Unfortunately, simulation results are never one hundred percent accurate. Sometimes they are even far from reality. Therefore, it is necessary to carry out an error analysis of the methods.

92 1: 2: 3: 4: 5: 6: 7: 8:

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96

function Set2Dg[g (:, :, :), i] if i < Ni /2 then Ie = 2i, Io = 2i + 1 ; else Ie = 2(Ni − i), Io = 2(Ni − i) − 1; end if ge (0 : Nj − 1, 0 : Nk − 1) = g (Ie , :, :), go (0 : Nj − 1, 0 : Nk − 1) = g (Io , :, :); end function

Algorithm 6 Implicitly padded convolution in 3D Input: f (0 : Ni − 1, 0 : Nj − 1, 0 : Nk − 1), g (0 : Ni , 0 : Nj , 0 : Nk ) Output: f (0 : Ni − 1, 0 : Nj − 1, 0 : Nk − 1) 1: function Convolution3D 2: for j = 0 → Nj − 1 do 3: for k = 0 → Nk − 1 do 4: [f (:, j, k), fo (:, j, k)] ← fftpadBackward[f (:, j, k)]; 5: end for 6: end for 7: for i = 0 → Ni − 1 do 8: [ge (0 : Nj − 1, 0 : Nk − 1), go (0 : Nj − 1, 0 : Nk − 1)] = Set2Dg [g (:, :, :), i]; 9: f (i, :, :) = CONVOLUTION2D(f (i, :, :), ge (:, :)); 10: fo (i, :, :) = CONVOLUTION2D(fo (i, :, :), go (:, :)); 11: end for 12: for j = 0 → Nj − 1 do 13: for k = 0 → Nk − 1 do 14: f (:, j, k) ← fftpadForward[f (:, j, k), fo (:, j, k)]; 15: end for 16: end for 17: end function Optimally, this should be done before implementing them. There are various error sources for the Green’s function methods, we focus on some sources regarding the numerical method itself. Model error: In the PIC model, the error sources are classified into two main sources. Firstly, the fluctuation of feature density, which is connected to a small number of macro particles in the simulation [26]. We can increase the number of grid points, but in this case the computing time increases dramatically as well. The second source is the coupling between the particles and the grid, which comprises the usage of interpolation and deposition algorithms for the particles on the grid. One approach showed that the error (or ‘‘noise’’ in their words) in PIC simulations is Poisson-distributed [27]. In fact, error studies are usually achieved by estimating the total error according to test calculation results. The model error reflects the discretization of ρ(x, y, z ) in (9) as ρ(xi , yj , zk ) in (10) and (12). Numerical integration error: We are more interested in the numerical integration error, which is the main error source besides the model error. As we can see from (9) and (12), there is no more discretization error for the IGF method, since the integral is calculated analytically. However, there is such an error for both, the GF method and the RIGF method, that we should consider. Comparing the RIGF method to the IGF method, how does the parameter Rz affect the relative numerical difference of the computed potential? We define:

δϕex (xi , yj , zk ) = ϕexRIGF (xi , yj , zk ) − ϕexIGF (xi , yj , zk ).

(20)

˜ ex = G˜ exRIGF − G˜ exIGF and 0 < |δ G˜ ex | ≤ ηRz , where ηRz is the upper bound of |δ G˜ ex | and we have: Suppose δ G δϕex (xi , yj , zk ) =

2Nx −1 2Ny −1 2Nz −1

1

  

4π ε0

i ′ =0

j ′ =0

ρex (xi′ , yj′ , zk′ )δ G˜ ex (xi , xi′ , yj , yj′ , zk , zk′ ).

(21)

k′ =0

˜ ex (xi , 0, yj , 0, zk , 0) = 0 for the RIGF method. Hence, if |k − k′ | ≤ Rz , then all error terms are zero for the potential at If k ≤ Rz , δ G (xi , yj , zk ). In other words, we calculate the accurate potentials around (xi , yj , zk ) by means of the IGF method within a distance determined by Rz . Outside of this ‘‘Rz distance’’ we calculate the less accurate potentials using the GF method. The ‘‘ 1r ’’ potential rule shows us: IGF and GF are becoming closer as the ‘‘Rz distance’’ increases. Let Az = N[0, k − Rz − 2] ∪ N[k − Rz − 1, 2Nz − 1], where N[a, b] = {n ∈ N, a ≤ n ≤ b}. Applying the Cauchy–Schwarz inequality we obtain:

|δϕex (xi , yj , zk )| ≤

1 4π ε0

  2Nx −1 2Ny −1  x −1 2Ny −1 2Nz −1  2       2 2N   ρex (xi′ , yj′ , zk′ )  δ G˜ ex (xi , xi′ , yj , yj′ , zk , zk′ ) . i′ =0

j′ =0 k′ ∈Az

i ′ =0

(22)

j′ =0 k′ =Rz +1

From Eq. (22), we see that the local error of the potential is limited by both the charge density outside the ‘‘Rz distance’’ and the error of GF values, when the index is larger than Rz . Since both are controlled and limited by parameter Rz , the error converges to zero when we increase Rz .

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96

93

0.6

for RIGF method

3 0.4

0.2

0

2

1

0 0

2

4

6

8

10

12

14

0

5

10

15

˜ Rz +1 | (left) and the error ηˆ ϕ (defined in Example 1) of the RIGF integrals for various Rz (right). Fig. 5. The error |δ G Table 2 Comparison of different Green’s function integrals’ elapsed time with increasing grid resolution. N +1

GF

IGF

RIGF(s = 1)

RIGF(s = 2)

65 129 257

0.004 s 0.029 s 0.217 s

0.062 s 0.448 s 4.030 s

0.019 s 0.103 s 0.595 s

0.034 s 0.164 s 0.871 s

Table 3 Elapsed time comparison of 3D DCT and FFT transforms for GFs with increasing grid resolution. N +1

3D DCT

2N

3D FFT

33 65 129 257

0.003 s 0.011 s 0.063 s 0.558 s

64 128 256 512

0.022 s 0.194 s 2.053 s 22.32 s

One interesting problem is how to decide on the parameter Rz , 0 ≤ Rz ≤ Nz . With Rz = 0, RIGF is the same as GF. For Rz = Nz , RIGF is the same as IGF. In all other cases, the parameter is some threshold value of difference of the IGF over all grid cells that we could tolerate. Also it is the threshold value of the acceptable distance where we calculate the potential accurately, as it is described above. Of course, the smaller the value is, the less accurate the solution is. On the other hand, the smaller the value is, the more computing time we save. This is due to the fact that all the integrals with a relative error smaller than the tolerance error are calculated by simple, fast standard GF integrals. Adapted accuracy control: We calculate the tilde IGF values with increasing coordinate k. Additionally, we calculate tilde GF at (0, 0, 0, 0, zk , 0) and record the δ G˜ k = δ G˜ (0, 0, 0, 0, zk , 0). ˜ IGFk−1 − G˜ IGFk )/G˜ IGFk−1 drops down to be less than 1/ log2 Nz . 1. Locate the stable area of GF: Find the first k for which the magnitude of (G Record k as kstable . ˜ k /δ G˜ kstable drops down to 10−s , s ≥ 0. Here, 2. Locate the accuracy tolerance: The proper Rz is as the first k, where the magnitude of δ G s = 1 or 2 is the accuracy control integer for the RIGF method. When higher accuracy is preferred, we could choose the next or even a higher accuracy number for switching. Comparing the plots in Fig. 5, we can conclude that kstable = 7. For higher accuracy, we could choose s = 1, 2 or even larger integers for switching. In fact, the solution is not very sensitive to Rz . It does not need to be a large value in practice, because the grid cell integrals drop very fast as we see from Fig. 4. Again, after some steps, the model error may play the major error role and the integral error is weaker and has less influence on the solution. Hence, ηˆ ϕ is less dependent on Rz than δ˜G as Fig. 5 shows. 6. Results and examples 6.1. Optimization results: speed-up of the elapsed time The efficiency improvements are shown by three classified optimizations: the optimization of the IGF calculation, the optimization of the 3D FFT calculation for the GFs and the optimization of the convolution calculation. The comparison results are listed in Tables 2–4 with mesh number N = Nx = Ny = Nz . The elapsed time is measured with increasing number of grid points by the CPU time routine implemented in C on a (Debian) linux system with an Intel 2.6 GHz processor. Optimization of IGF calculation: We use the RIGF with the adapted accuracy control to choose the parameter Rz for both s = 1 and s = 2, then the Rz is chosen automatically in the code routine. The optimization effect is obvious, the elapsed time of the IGF calculation is 2–7 times higher than the RIGF calculation. Optimization of 3D FFT calculation for GFs: For the 1D problem, the complexity of DCT is 2.5n log n while the complexity of FFT is 5n log 2n. For the 3D problem, the complexity and the efficiency improvement of DCT versus FFT is higher than linear, the used storage is only 1/8 of the storage of the extended tilde GFs explained in Section 2.

94

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96 Table 4 Comparison of different convolutions’ elapsed time with increasing grid resolution. N +1

H&E’s routine

updated H&E’s routine

Implicitly padded routine

33 65 129 257

0.068 s 0.621 s 2.124 s 164.5 s

0.015 s 0.201 s 1.867 s 65.5 s

0.012 s 0.089 s 1.295 s 11.79 s

Table 5 Comparison of the three different Green’s function Poisson solvers’ relative errors and elapsed time in Example 1. N = Nx = Ny = Nz . GF + updated H&E method

RIGF(s = 1) + implicitly padded method

IGF + updated H&E method

N

ηˆ ϕ

ηˆ E

Elapsed time

N

ηˆ ϕ

ηˆ E

Elapsed time

N

ηˆ ϕ

ηˆ E

Elapsed time

32 64 128

0.3741 0.1239 0.03183

2.008 1.0590 0.5201

0.041 s 0.411 s 4.006 s

32 64 128

0.0239 0.00317 0.00125

0.0500 0.0212 0.0140

0.060 s 0.464 s 4.455 s

32 64 128

0.0239 0.00316 0.00127

0.0501 0.0212 0.0140

0.019 s 0.121 s 1.482 s

Table 6 Parameters of the bunch in Example 3. Parameters for Bunch Number of macro particles Beam energy Beam charge Normalized emittance Bunch length rms bunch radius

20,000 0.800 MeV −1.000 nC 1.000 π mrad mm 22.5 mm 0.75 mm

Optimization of convolution calculation: We present three different convolution routines: Hockney and Eastwood’s (H&E’s) routine, updated H&E’s routine (Algorithm 1) and our aforementioned implicitly padded convolution routine. The speed-up varies for different grid numbers. 6.2. Test examples: confirm the accuracy of the RIGF with the implicitly padded convolution method The RIGF with the implicitly padded convolution method is tested by three different examples for the agreement with the IGF method. These are: simulating an ideal uniform electron bunch, a very long Gaussian bunch representing the ILC bunch and the tracking of a Gaussian bunch along a distance about 0.3 m without any external electric field. Example 1. We use an ideal uniform ellipsoidal bunch with an aspect ratio of 30 as an example. The relative errors which we used are defined as follows:

ηϕ (i, j, k) :=

|ϕi,j,k − ϕtruei,j,k | max |ϕtruei,j,k |

,

and

i ,j ,k

ηE (i, j, k) :=

∥Ei,j,k − Etruei,j,k ∥2 max ∥Etruei,j,k ∥2 i,j,k

,

ηˆ ϕ := max(ηϕ (i, j, k)), i,j,k

and ηˆ E := max(ηE (i, j, k)). i,j,k

Here, ϕi,j,k = ϕxi ,yj ,zk denotes the computed potential and ϕtruei,j,k = ϕtruexi ,yj ,zk the analytical potential at the grid point (xi , yj , zk ), respectively. Similarly, the values of the electric field Ei,j,k and Etruei,j,k are defined. From Table 5, we can see a very good result: the RIGF method combines the advantages of the standard GF method and of the IGF method. Nearly the same accuracy as with the IGF approach is achieved at even less computational time than required by the GF method. The results for Nx = Ny = Nz = 128 i.e. 2,097,152 mesh points, are presented in Fig. 6. We can observe that the relative electric field error of the RIGF method is nearly the same as that of the IGF method in the case of a uniform bunch with large aspect ratio 30. Example 2. This example represents the ILC bunch with an aspect ratio of 17:1:708 in x, y and z direction, respectively. It is generated with 20,000 macro particles with a Gaussian distribution by the tool generator of the tracking code ASTRA [9]. Fig. 7 represents the zcomponent of the electric field (in the rest frame) along the z-axis using three different Green’s function methods. It turns out that the difference between the standard GF method and the IGF method is significant. However, the RIGF method still strictly agrees with the IGF method despite the much lower computational effort. Example 3. In the third example we tracked a Gaussian bunch for 1000 time steps with 1t = 1 ps in the beam line without any external electric field. The bunch parameters are shown in Table 6. We compare the emittance and the bunch size of the tracked bunch in Fig. 8. The RIGF agrees well with the IGF method. It is not surprising that the RIGF method calculates the electric field to the ‘‘same’’ accuracy as the IGF method does. On one hand, the grid cell integrals (13) lead to smaller errors as explained before. In contrast, we can see from the frequency domain: the value at (1, 1, 1) ˜ exRIGF − FG˜ exIGF is equal to the sum of all δ G˜ ex . This Fδ G˜ ex (1, 1, 1) does not affect the electric field in the calculation, since Fδ G˜ ex (1, 1, 1) of FG

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96

(a) GF at (:,

Ny 2

(d) GF at (:, :,

, :).

Nz 2

).

(b) IGF at (:,

Ny 2

(e) IGF at (:, :,

, :).

Nz 2

).

(c) RIGF at (:,

95

Ny 2

(f) RIGF at (:, :,

, :).

Nz 2

).

Fig. 6. Comparison of ηE (i, j, k) for the three Green’s function methods in Example 1. GF (a) (d), IGF (b) (e), RIGF (c) (f).

Fig. 7. Electric field component Ez as function of z of a Gaussian bunch in Example 2 from the solutions of IGF, RIGF and standard GF method, respectively.

describes the main potential shift of the RIGF method to the IGF method, and the electric field (gradient of the potential) is not influenced ˜ ex (1, 1, 1) shift does not perturb by the potential shift. The perturbations for ‘‘other frequencies’’ are much smaller. Summing up, the Fδ G the electric field and all other perturbations are very small in the frequency domain, benefitting from the fact that the RIGF method solves accurately. 7. Conclusions In this paper, we introduced an optional 3D Poisson solver routine with the optimizations of the IGF and the relative double sized discrete Fourier transform by a reduced integrated Green’s function (RIGF) and the relative discrete cosine transform, the optimization of fast convolution by a novel implicitly padded convolution. We tested the new routine with three model problems. On the practical side, the new routine with RIGF is not only less time consuming, but also achieves almost the same accuracy for the electric field as the integrated Green’s function method (IGF). Thus, we can conclude that we successfully aimed to find a faster method as accurate as the classical and commonly used convolution routine by Hockney and Eastwood. The whole speed-up of the Poisson solver can reach as large as 3 to 4. So we propose the use of all parts of our new routine rather than the aforementioned classical Poisson solver routine in order to speed up calculations.

96

D. Zheng et al. / Computer Physics Communications 198 (2016) 82–96

(a) GF.

(d) GF.

(b) IGF.

(c) RIGF.

(e) IGF.

(f) RIGF.

Fig. 8. Comparison of emittance growth and bunch size increase for the three Green’s function methods in Example 3. GF (a) (d), IGF (b) (e), RIGF (c) (f).

Acknowledgments The authors would like to thank T. Flisgen for providing the graphics for Fig. 3, A. Markovik´ for providing the bunch data of Example 2 and J. Qiang for his suggestions and valuable discussions. The work of the first author was supported by a grant (grant number: 2011618041) of the CSC (China Scholarship Council). References [1] G.H. Hoffstaetter, and S.M. Tigner (Eds.), Cornell Energy Recovery Linac: Science Case and Project Definition Design Report, Tech. Report, 2013. URL https://www.classe.cornell.edu/Research/ERL/PDDR.html. [2] Massimo Altarelli, et al. (Eds.) The European X-ray Free-Electron Laser Technical Design Report; DESY 2006-097, Tech. Report, 2007. URL http://www.xfel.eu/sites/site_ xfel-gmbh/content/e63617/e79991/e74695/european-xfel-tdr_eng.pdf. [3] Ties Behnke, et al. (Eds.) The International Linear Collider Technical Design Report Volume 1: Executive Summary, Tech. Report, 2013. URL http://www.linearcollider. org/ILC/Publications/Technical-Design-Report. [4] M. Aicheler, et al. (Eds.) A Multi-TeV linear collider based on CLIC technology: CLIC Conceptual Design Report, Tech. Report, Geneva, 2012. http://dx.doi.org/10.5170/ CERN-2012-007. [5] M. Clemens, M. Dohlus, S. Lange, G. Pöplau, T. Limberg, U. van Rienen, Microbunch Amplification in the European XFEL, TESLA-FEL-2009-02, DESY, Hamburg, 2009. URL http://flash.desy.de/reports_publications/tesla_fel_reports/tesla_fel_2009/. [6] G. Pöplau, U. van Rienen, A. Meseck, Numerical studies of the behavior of ionized residual gas in an energy recovering linac, Phys. Rev. ST Accel. Beams 18 (2015) 044401. http://dx.doi.org/10.1103/PhysRevSTAB.18.044401. URL http://link.aps.org/doi/10.1103/PhysRevSTAB.18.044401. [7] J. Qiang, R.D. Ryne, S. Habib, V. Decyk, An object-oriented parallel particle-in-cell code for beam dynamics simulation in linear accelerators, J. Comput. Phys. 163 (2) (2000) 434–451. http://dx.doi.org/10.1006/jcph.2000.6570. URL http://linkinghub.elsevier.com/retrieve/pii/S0021999100965707. [8] J. Qiang, P. Colestock, D. Gilpatrick, H. Smith, T. Wangler, M. Schulze, Macroparticle simulation studies of a proton beam halo experiment, Phys. Rev. ST Accel. Beams 5 (12) (2002) 124201. http://dx.doi.org/10.1103/PhysRevSTAB.5.124201. URL http://link.aps.org/doi/10.1103/PhysRevSTAB.5.124201. [9] K. Flöttmann, ASTRA, DESY, Hamburg, 2000. URL http://www.desy.de/~mpyflo. [10] Pulsar Physics, General Particle Tracer, GPT, 2004. URL http://www.pulsar.nl/gpt. ´ Simulation of the interaction of positively charged beams and electron clouds (Ph.D. thesis), University of Rostock, Rostock, Germany, 2013. [11] A. Markovik, [12] R.W. Hockney, J.W. Eastwood, Computer Simulation Using Particles, CRC Press, 2010. [13] C. Birdsall, A. Langdon, Plasma Physics via Computer Simulation, Taylor and Francis Group, 1985, URL http://www.osti.gov/scitech/servlets/purl/6702524. [14] J.M. Dawson, Particle simulation of plasmas, Rev. Modern Phys. 55 (2) (1983) 403. URL http://rmp.aps.org/abstract/RMP/v55/i2/p403_1. [15] D. Tskhakaya, K. Matyash, R. Schneider, F. Taccogna, The particle-in-cell method, Contributions to Plasma Physics 47 (8–9) (2007) 563–594. http://dx.doi.org/10.1002/ ctpp.200710072. URL http://onlinelibrary.wiley.com/doi/10.1002/ctpp.200710072/abstract. [16] G. Pöplau, U. van Rienen, B. van der Geer, M. de Loos, Multigrid algorithms for the fast calculation of space-charge effects in accelerator design, IEEE Trans. Magn. 40 (2) (2004) 714–717. http://dx.doi.org/10.1109/TMAG.2004.825415. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1284514. [17] C. Huang, et al., Recent results and future challenges for large scale particle-in-cell simulations of plasma-based accelerator concepts, J. Phys. Conf. Ser. 180 (2009) 012005. http://dx.doi.org/10.1088/1742-6596/180/1/012005. URL http://stacks.iop.org/1742-6596/180/i=1/a=012005?key=crossref.5deb0f41166e805afbc3ae6dcced074e. [18] A. Luccio, N. D’Imperio, J. Beebe-Wang, Space Charge Dynamics Simulated in 3-D in the Code ORBIT, in: Proceedings of EPAC 2002, Paris. URL https://accelconf.web.cern.ch/accelconf/e02/PAPERS/WEPLE058.pdf. [19] J. Qiang, M.A. Furman, R.D. Ryne, A parallel particle-in-cell model for beam-beam interaction in high energy ring colliders, J. Comput. Phys. 198 (1) (2004) 278–294. http://dx.doi.org/10.1016/j.jcp.2004.01.008. URL http://linkinghub.elsevier.com/retrieve/pii/S0021999104000282. [20] J. Qiang, S. Lidia, R. Ryne, C. Limborg-Deprey, Three-dimensional quasistatic model for high brightness beam dynamics simulation, Phys. Rev. ST Accel. Beams 9 (4) (2006) 044204. http://dx.doi.org/10.1103/PhysRevSTAB.9.044204. URL http://link.aps.org/doi/10.1103/PhysRevSTAB.9.044204. [21] J. Qiang, S. Lidia, R. Ryne, C. Limborg-Deprey, Erratum: Three-dimensional quasistatic model for high brightness beam dynamics simulation [phys. rev. ST accel. beams 9, 044204 (2006)], Phys. Rev. ST Accel. Beams 10 (12) (2007) 129901. http://dx.doi.org/10.1103/PhysRevSTAB.10.129901. URL http://link.aps.org/doi/10.1103/ PhysRevSTAB.10.129901. [22] D. Zheng, G. Pöplau, U. van Rienen, On several Green’s function methods for fast Poisson solver in free space, in: Proceedings of SCEE, Scientific Computing in Electrical Engineering, 2014, Wuppertal. [23] J.C. Bowman, M. Roberts, Efficient dealiased convolutions without padding, SIAM J. Sci. Comput. 33 (1) (2011) 386–406. http://dx.doi.org/10.1137/100787933. URL http://epubs.siam.org/doi/abs/10.1137/100787933. [24] M. Frigo, S.G. Johnson, The design and implementation of FFTW3, Proc. IEEE 93 (2) (2005) 216–231. [25] C. van Loan, Computational frameworks for the fast Fourier transform, in: Frontiers in Applied Mathematics, vol. 10, SIAM, 1992. [26] Y.N. Grigoryev, et al., Numerical Particle-In-Cell Methods: Theory and Applications, DE GRUYTER, Utrecht; Boston, 2002. [27] B. Terzić, I.V. Pogorelov, C.L. Bohn, Particle-in-cell beam dynamics simulations with a wavelet-based Poisson solver, Phys. Rev. ST Accel. Beams 10 (2007) 034201. http://dx.doi.org/10.1103/PhysRevSTAB.10.034201. URL http://link.aps.org/doi/10.1103/PhysRevSTAB.10.034201.