2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
FP-AMR: A Reconfigurable Fabric Framework for Adaptive Mesh Refinement Applications ∗ Department
† Department
Tianqi Wang∗ Tong Geng† Xi Jin∗ Martin Herbordt†
of Physics; University of Science and Technology of China, Hefei, China of Electrical and Computer Engineering; Boston University, Boston, MA, USA
identifies regions that need more resolution and superimposes finer sub-grids only on those regions. With the emergence of FPGAs in large-scale clouds and clusters, it makes sense to investigate whether they are an appropriate tool for AMR acceleration; we are not aware of any previous study. The challenge is that AMR generally uses dynamic and pointer-based data structures, characteristics that make any acceleration difficult [9] [10] (where random off-chip DRAM access is expensive and inefficient). AMR frameworks that support GPU acceleration [8] generally use the GPU only to execute the PDE solver and use a host CPU to handle operations on the dynamic data structures [5]. FPGAs can be designed with a cache to support traversal of pointerbased data structures [11], but this approach does not support their modification. Our approach provides a carefully designed data structure and memory subsystem to support all relevant operations. In this work, we propose FP-AMR, a reconfigurable fabric framework for block-structured adaptive mesh refinement applications [2]. FP-AMR helps programmers develop AMR applications for both FPGA-centric clusters and traditional HPC clusters with FPGAs as accelerators. The novel contributions are as follows:
Abstract—Adaptive mesh refinement (AMR) is one of the most widely used methods in High Performance Computing accounting a large fraction of all supercomputing cycles. AMR operates by dynamically and adaptively applying computational resources non-uniformly to emphasize regions of the model as a function of their complexity. Because AMR generally uses dynamic and pointer-based data structures, acceleration is challenging, especially in hardware. As far as we are aware there has been no previous work published on accelerating AMR with FPGAs. In this paper, we introduce a reconfigurable fabric framework called FP-AMR. The work is in two parts. In the first FPAMR offloads the bulk per-timestep computations to the FPGA; analogous systems have previously done this with GPUs. In the second part we show that the rest of the CPU-based tasks–including particle mesh mapping, mesh refinement, and coarsening–can also be mapped efficiently to the FPGA. We have evaluated FP-AMR using the widely used program AMReX and found that a single FPGA outperforms a Xeon E5-2660 CPU server (8 cores) by from 21×-23× depending on problem size and data distribution. Index Terms—Reconfigurable Computing, Adaptive Mesh Refinement, Scientific Computing, High Performance Computing
I. I NTRODUCTION The increase in supercomputer performance over the last several decades has, by itself, been insufficient to solve many challenging modeling and simulation problems. For example, the complexity of solving evolutionary partial differential equations (PDEs) scales as Ω(n4 ), where n is the number of mesh points per dimension. Thus, the three-order-ofmagnitude improvement in performance over the past 25 years has meant just a 5.6× gain in spatio-temporal resolution [1]. To address this problem, many simulation packages [2]–[6] use Adaptive Mesh Refinement (AMR)–which selectively applies computing to regions of most interest–to increase resolution. Its utility and versatility have made AMR one of the most widely used frameworks in HPC [7]. The problem with traditional uniform meshes is that difficult regions (discontinuities, steep gradients, shocks) require high resolution; but since that high resolution is applied everywhere equally, most of the computation is wasted. For example, the physical universe is populated by structures at very different scales, from superclusters down to galaxies. These strongly correlated structures need fine resolution in both space and time, while regions of space that are mostly void could make due with coarse resolution (e.g., [6], [8]). When using AMR, simulations start from a coarse grid. The program then 2576-2621/19/$31.00 ©2019 IEEE DOI 10.1109/FCCM.2019.00040
•
•
•
•
245
FP-AMR can offload all CPU-based tasks to FPGAs, including particle mesh mapping, mesh refinement, and coarsening. In all GPU AMR versions of which we are aware, these functions must be executed on a CPU. The advantage of complete offload is to drastically reduce communication overhead. FP-AMR uses a Z-order space-filling curve to map the simulation space and the adaptive mesh to a customized data structure. This design uses spatial locality to reduce data access conflicts and improve memory access efficiency. FP-AMR supports direct FPGA-to-FPGA communication. This means that inter-accelerator communication, which is necessary to periodically update ghost-zones, avoids transit through the CPUs. Experiments show that, with FP-AMR, a single FPGA outperforms a Xeon E5-2660 CPU by 21× to 23× depending on cluster scale and simulation initial conditions. Also that, using the secondary network, performance scales perfectly to eight FPGAs and is likely to scale similarly to much larger FPGA clusters.
II. BACKGROUND
t/4. Step 8: Integrate Level 2 over t/4. Step 9: Synchronize Levels 1, 2, and 3.
A. AMR Cosmological simulation is used as a motivational example of AMR (Algorithm 1). The functions Calc_Potential, Calc_Accel and Motion_Update, which are listed in Lines 2-4, can be replaced by any application-specific customized functions. The key feature is the function Particle_Map listed in Line 1.
Algorithm 2 Algorithm for Particle Map Input: Pi Particle information of time Ti Output: M ass_M eshi Adaptive mesh of time Ti 1: Initialize(M esh_Conf, P _Sorti ) 2: for each block ∈ Space do 3: count = 0 4: par_list = ∅ 5: for each par ∈ Pi do 6: if (par_loc ∈ block) then 7: count+ = 1 8: par_list ← par 9: end if 10: end for 11: if (count > threshold1 ) then 12: M esh_Conf ←Refine(block) 13: else if ( adj count < threshold2 ) then 14: M esh_Conf ←Coarsen(block) 15: else 16: M esh_Conf ←Keep(block) 17: end if 18: P _Sorti ← par_list 19: end for 20: for each par ∈ P _Sorti do 21: M ass_M esh ←Interpolation(par, M esh_Conf ) 22: end for 23: return M ass_M esh
Algorithm 1 Algorithm for Time forward Input: Pi Particle information of time Ti Output: Pi+1 Particle information of time Ti+1 1: M ass_M eshi = Particle_Map(Pi ) # Map Ti particles to adaptive mesh 2: P otentiali = Calc_Potential(M ass_M eshi ) # Calculate Ti potential distribution 3: Acceli = Calc_Accel(P otentiali ) # Calculate Ti all particles’ accelarte 4: Pi+1 = Motion_Update(Acceli , Pi ) # Motion update to generate Pi+1 5: return Pi+1 'ƌŝĚ>ĞǀĞůϬ
'ƌŝĚ>ĞǀĞůϭ
Details of Particle_Map are shown in Algorithm 2. In Lines 2-19, the first nested loop traverses the whole simulation space and counts each mesh block’s particle number. According to the particle density of each block and its adjacent blocks, the block is refined, coarsened, or kept. Based on M esh_Conf , generated by Lines 2-19, Lines 20-22 traverse all particles and interpolate the particles’ masses to the adaptive mesh. At the same time, all particles are sorted based on spatial locality for the next step’s particle interpolation.
'ƌŝĚ>ĞǀĞůϮ
;Ϳ^ƉĂĐĞŝŵĞŶƐŝŽŶZĞĨŝŶĞŵĞŶƚ
ϭ 'ƌŝĚ>ĞǀĞůϬ ƚ
Ϯ
ϲ ƚͬϮ
ϯ ƚͬϰ
'ƌŝĚ>ĞǀĞůϭ
ƚͬϮ
ϳ
ϰ ƚͬϰ
;ͿdŝŵĞŝŵĞŶƐŝŽŶZĞĨŝŶĞŵĞŶƚ ϱ
ϴ ƚͬϰ
'ƌŝĚ>ĞǀĞůϮ ƚͬϰ
ϵ m1
dy S1
Figure 1 shows details of Particle_Map: how a cosmological simulation uses AMR to implement various resolutions in the spatio-temporal regions of high mass density. As shown in Figure 1(A), AMR starts from the coarsest grid (Level 0), then identifies blocks as high mass density (red), applies finer resolution to these blocks, and then generates a finer grid (Level 1) for just those blocks. Continuing, the next grid (Level 2) is generated for high-density blocks in Level 1 (blue). Figure 1(B) shows the subcycling-in-time approach used by AMR. Step 1: Integrate Level 0 over t. Step 2: Integrate Level 1 over t/2. Step 3: Integrate Level 2 over t/4. Step 4: Integrate Level 2 over t/4. Step 5: Synchronize Levels 1 and 2. Step 6: Integrate Level 1 over t/2. Step 7: Integrate Level 2 over
m4
dx
Fig. 1. Spatio-temporal for AMR algorithm
S4
dxdy
(1 dx)dy
M
S2
m2
dx(1 dy )
S3
m1
M S1
m2
M S2
M dx dy M dx (1 dy )
m3
M S3
M (1 dx) (1 dy )
m4
M S4
M (1 dx) dy
(1 dx)(1 dy )
m3
Fig. 2. Bi-linear interpolation
In summary, the key features of Particle_Map used to build and adapt the adaptive mesh are: 1) Refine mesh and coarsen mesh: According to the mess
246
density, AMR method needs to decide each block should be refined or coarsened. 2) Interpolate: As shown in Fig 2, bi-linear or tri-linear methods can be used for the 2D/3D dimension cases. Based on area/volume, mass is partitioned and mapped to the mesh.
ZĞĐŽŶĨ ĞǀŝĐĞ
W/Ğ
>ŽĐĂƚŝŽŶ
/K
,ŽƐƚ Wh
ZD
W/Ğ EŽĚĞϮ
EŽĚĞϭ
<ĞƌŶĞůϭ <ĞƌŶĞůϮ
Ratioamr = ZD
Z ƚƌů
&WͲDZ &WͲDZ ŽǁŶ dƌĂŶƐ
hƉ dƌĂŶƐ
tĞƐƚ dƌĂŶƐ
^ŽƵƚŚ dƌĂŶƐ
ĂƐƚ dƌĂŶƐ
EŽƌƚŚ dƌĂŶƐ
;Ϳ&W'ĞƐŝŐŶ
WŚĂƐĞ//͗WĂƌƚŝĐůĞDĂƉƉŝŶŐ
Wͺ^Žƌƚ
dŚ
dĐ
dĚ
dĐ
dŚ
dŝŵĞ^ƚĞƉͺϮ
Th + 2Tc Th + Td + 2Tc
(1)
where Th is CPU time, Td is accelerator time, and Tc is communication latency. A better accelerator can reduce Td , but the time required to exchange hundreds of MB of data between host and device (e.g., through PCIe) is substantial. Moreover, Tc cannot be overlapped because the timestep 2 cannot start before timestep 1 finishes. Offload of Particle_Map to the accelerator is therefore essential to improve performance further.
ZŽƵƚĞƌ
WŚĂƐĞ/͗^ŽƌƚŝŶŐ
dĐ
Figure 4 shows how traditional clusters execute AMR through phases and communication between host and device. The fraction of time for Particle_Map can be expressed as
W/Ğƚƌů
<ĞƌŶĞůŶͲϭ <ĞƌŶĞůŶ
&W'
dĚ
Fig. 4. Communication and computation phases
;Ϳ^LJƐƚĞŵKǀĞƌǀŝĞǁ
hÜØ=È©¯
dĐ
dŝŵĞ^ƚĞƉͺϭ
ĞǀŝĐĞŝƌĞĐƚůLJŽŵŵƵŶŝĐĂƚŝŽŶEĞƚǁŽƌŬ
&W'
dŝŵĞ dŚ
EŽĚĞŶ
ZD
ZĞĐŽŶĨ ĞǀŝĐĞ
W/Ğ
ZD
,ŽƐƚ Wh
/K
ZD
ZD
ZĞĐŽŶĨ ĞǀŝĐĞ
/K
ZD
,ŽƐƚ Wh
,ŽƐƚ
,ŽƐƚŽŵŵƵŶŝĐĂƚŝŽŶEĞƚǁŽƌŬ
Žŵŵ ĞǀŝĐĞ
number of random accesses makes it difficult to accelerate, programmers generally use host CPUs.
WͺDĂƉ
III. FP-AMR F RAMEWORK ZD
WĂƌƚŝĐůĞ /ŶĨŽƌŵĂƚŝŽŶ
^ŽƌƚĞĚWĂƌƚŝĐůĞ /ŶĨŽƌŵĂƚŝŽŶ
DZ'ƌŝĚ ^ƚĂƚƵƐ
A. Overview
DĂƉƉĞĚDĞƐŚ
Figure 3 shows FP-AMR in an FPGA-centric cluster. Figure 3(A) shows the FPGA; Figure 3(B) gives details of the FPGA design. A router handles FPGA-to-FPGA communication through the transceivers. Functions 2-4 in Algorithm 1 are instantiated as user-specific kernels 1-n. FP-AMR is in the red block and wraps the user-specific kernels. According to Algorithm 2, two nested loops are executed sequentially. As shown in Figure 3(C), the FP-AMR module consists of two parts, PE-Sort and PE-Map, corresponding the two loops. PE-sort reads the particle information list and then generates the sorted particle information list and AMR grid status. According to the values in these structures, PE-map generates the final mass mapping mesh. For most simulation applications, the adaptive mesh data structures are too large (usually several GB) to be stored on-chip.
;Ϳ&WͲDZDŽĚƵůĞ
Fig. 3. Overview of FP-AMR
B. FPGA-centric clusters In traditional HPC clusters, accelerators communicate through their hosts, which results in substantial latency overhead. In FPGA-centric clusters, such as Catapult I [12] and many other recent systems [13], [14], FPGAs are interconnected directly through their transceivers [15]–[17]. As shown in Figure 3, the yellow marked reconfigurable devices can exchange data without detouring host CPU. While there is no technological reason why GPUs (and other accelerators) cannot be built with a similar capability, to date NVLink supports only a small number of GPUs and, unlike with FPGAs, adds substantially to the cost of the cluster.
B. Space Filling Curve According to Algorithm 2 line 2-19 and line 20-22, AMR must traverse the whole simulation space and the particle information list. Since all necessary data structures (Figure 3(C)) are stored off-chip, FP-AMR needs to access off-chip memory frequently. To make this more efficient, FP-AMR uses on-chip cache to take advantage the spatial locality. To map the 2D/3D simulation space to the 1D memory address space, a space-filling curve (SFC) is advantageous. SFCs have ranges
C. Challenge and Motivation For AMR-based applications, accelerators use optimized kernels to handle the functions in Algorithm 1 Line 2-4. Particle_Map uses tree-based dynamic data structures. For scientific simulations, the size of these data structures is usually hundreds of MB. These pointer-based dynamic data structures must, therefore, be stored off-chip. Since the large
247
zͺ'ƌŝĚͺ/ zͺůŽĐŬͺ/
BÜ*د zͺůŽĐŬͺ/ VȯÂãÜ
WŽŝŶƚ/
ϭϬϬ ůŽĐŬ/
Ϭϭϭ
zͺ'ƌŝĚͺ/
Ϭϭϭ
Ϭϱ
Ϭϳ
ϭϯ
ϭϱ
ϬϭϬ
Ϭϰ
Ϭϲ
ϭϮ
ϭϰ
ϭϲ
ϭϳ
ϭϴ
ϭϵ
Ϯϰ
Ϭϱ
Ϭϳ
ϭϯ
ϭϱ
Ϯϯ
Ϭϰ
Ϭϲ
ϭϮ
ϭϰ
ϮϮ
Ϭϭ
Ϭϯ
Ϭϵ
ϭϭ
Ϯϭ
ϬϬ
ϬϮ
Ϭϴ
ϭϬ
ϮϬ
ϭϬϬ Ϭϭϭ
Ϭϭϭ ¼È¹
ϬϭϬ ϬϭϬ
ϬϭϬ ϬϬϭ
ϬϬϭ
Ϭϭ
Ϭϯ
Ϭϵ
ϭϭ
ϬϬϭ
ϬϬϭ ϬϬϬ
ϬϬϬ
ϬϬϬ ϬϬϬ
ϬϬϭ
ϬϭϬ
Ϭϭϭ
ϬϮ
ϬϬ
Ϭϴ
yͺůŽĐŬͺ/ ϬϬϬ
ϬϬϬ ϬϬϭ ϬϭϬ Ϭϭϭ ϭϬϬ yͺ'ƌŝĚͺ/ ;Ϳ^ŝŵƵůĂƚŝŽŶƐƉĂĐĞĐŽŶƚĂŝŶďůŽĐŬƐĂŶĚŐƌŝĚƉŽŝŶƚƐ
ϬϬϭ
ϬϭϬ
;ͿdžƚƌĂĐƚďůŽĐŬƐĂŶĚŝŶĚĞdž
ϭϬ ϬϬϬ Ϭϭϭ yͺůŽĐŬͺ/ ϬϬϬ
ϬϬϭ
ϬϭϬ
Ϭϭϭ
ϭϬϬ
yͺ'ƌŝĚͺ/
;ͿdžƚƌĂĐƚŐƌŝĚƉŽŝŶƚƐĂŶĚŝŶĚĞdž
Fig. 5. Use space filling curve to index the whole simulation space and adaptive mesh grid
z
ϬϬ
Ϭϭ
ϭϰ
ϭϱ
ϭϭ
z
Ϭϱ
Ϭϳ
ϭϯ
ϭϱ
ϭϭ
ϬϯͺϬϭͺϬϭ ϬϯͺϬϭͺϬϯ
ϬϭͺϬϭ
Ϭϯ
ϬϮ
ϭϯ
ϭϮ
Ϭϰ
Ϭϳ
Ϭϴ
ϭϭ
Ϭϱ
Ϭϲ
Ϭϵ
ϭϬ
ϭϬ
Ϭϰ
Ϭϲ
ϭϮ
ϭϰ
Ϭϭ
Ϭϯ
Ϭϵ
ϭϭ
ϬϬ
ϬϮ
Ϭϴ
ϭϬ
ϬϯͺϬϯ
ϬϭͺϬϯ ϬϯͺϬϭͺϬϬ ϬϯͺϬϭͺϬϮ
ϭϬ ϬϭͺϬϬ
Ϭϭ
ϬϭͺϬϮ
ϬϯͺϬϮ
ϬϯͺϬϬ
Ϭϭ
ϬϬ
ϬϬ ϬϬ
Ϭϭ ϭϬ ϭϭ ;ͿͲŽƌĚĞƌŝŶŐƵƌǀĞ
y
ϬϬ ϬϬ Ϭϭ ϭϬ ϭϭ ;ͿWĞĂŶŽͲ,ŝůďĞƌƚƵƌǀĞ
ϬϮ
y ;ͿĚĂƉƚŝǀĞDĞƐŚ
Fig. 6. Different Space Filling Curves
>ϯůŽĐŬͲ/ŶĚĞdž >ϮůŽĐŬͲ/ŶĚĞdž
that contain the entire 2D/3D dimensional unit square; Z-order and Peano-Hilbert are two widely used SFCs. As shown in Figure 5(A), the whole simulation space contains two basic elements: blocks and mesh grids. Blocks are the orange space in Figure 5(A) and are used to count the particles’ spatial distribution. The mesh grids are the black points in Figure 5(A)(C) and are used to store the mapped particles’ mass. To take advantage the spatial locality, the blocks and mesh grids both need to be indexed with the SFC. We use the Z-order curve as an example. In Figure 5(B)(C), the blocks and mesh grid points are extracted separately and indexed. Figure 6 shows the indexing methods for the Z-order and PeanoHilbert curves. Note that the Z-order curve only needs the bit-shuffle operation, as shown in Figure 6(A). In contrast, the Peano-Hilbert curve generator is a recursive algorithm, which is extremely expensive for hardware implementation. Clearly, the Z-order curve is a more reasonable choice. For the adaptive mesh, the blocks of the whole simulation space are shown in Figure 7(A). From the adaptive mesh, we can extract different level mesh grids (Level 1 to Level 3), which are shown in Figure 7(B). AMR uses a finer grid to index critical areas. For a particle located in the block marked with the star (in Figure 7(A)), we generate its indices in all three grid levels (see Figure 7(B)), and use the spliced three indices to reference the particle (in Figure 7(B)).
;ͿĚĂƉƚŝǀĞDĞƐŚ/ŶĚĞdžDĞƚŚŽĚ >ϭůŽĐŬͲ/ŶĚĞdž
/ŶĚĞdžсϬϯͺϬϭͺϬϮ
Fig. 7. Adaptive mesh indexing
signed carefully. Figure 8 shows all related data structures in Algorithm 1 and their relationships. For time step i, (1) Sort: the particle information list of T imei (Pi : from particle 0 to particle n) is sorted to generate P _Sorti (from particle 0 to particle n ) based on spatial locality. (2) Interpolation: Generate M ass_M eshi of T imei . (3) Calc_potential/Calc_accel: User specific kernels calculate the discrete potential and each particle’s acceleration of T imei based on M ass_M eshi . (4) Motion update: Based on T imei ’s acceleration update and the (locality, velocity) field of P _Sorti , the result is the next timestep’s particle information list Pi+1 There are two kinds of data structures: the particle information lists (Pi , P _Sorti ) and the adaptive mass mesh point list (M ass_M eshi ). Both have similar memory layout. For the particle information list (Figure 9(A)), the list is divided into several sections, where each section contains all particles that share the same block-index. The sections are stored contiguously and the sections with larger block-indices are stored in higher address space. As shown in Figure 8, from time step i to time step i + 1, the location field of the particle information list is updated. In the next time step i + 2, even if the particle spatial distribution is relatively stable, there will be some particles flowing in and out of each block. For each section, there is an extra back-up space at the end of the section, which is just in case more particles need to be located
C. Data Structures In FP-AMR, particle information and adaptive mesh are stored off-chip DRAM so their data structures must be de-
248
in the block. Besides the memory to store particle information, FP-AMR needs an extra base-address array and counter array. The baseaddress array keeps each section’s base address and counter array records for how many particles exist in each block. Figure 9 shows how the coarsen and refine operations affect the particle information list. First, the section or adjacent sections (include the backup blank) is fragmented or merged. Then all related particles are re-sorted. The base-address and counter arrays are updated.
>ŽĐ
DĂƐƐ
sĞĐ
KƚŚĞƌ
DĞŵŽƌLJĚĚƌĞƐƐ/ŶĐƌĞĂƐĞ
ůŽĐŬͲŝŶĚĞdžсϬϮ
WĂƌƚŝĐůĞϬ
ůŽĐŬͲŝŶĚĞdžсϬϯͺϬϬ
WĂƌƚŝĐůĞϭ WĂƌƚŝĐůĞϮ
ůŽĐŬͲŝŶĚĞdžсϬϯͺϬϭͺϬϬ
ĂƐĞŬ
ŽƵŶƚĞƌŬ
ůŽĐŬͲŝŶĚĞdžсϬϯͺϬϭͺϬϭ WĂƌƚŝĐůĞŶ
ůŽĐŬͲŝŶĚĞdžсϬϯͺϬϭͺϬϮ ĂĐŬƵƉďůĂŶŬ
ůŽĐŬͲŝŶĚĞdžсϬϯͺϬϭͺϬϯ
;ͿĂƚĂ^ƚƌƵĐƚƵƌĞŽĨWĂƌƚŝĐůĞ/ŶĨŽƌŵĂƚŝŽŶ>ŝƐƚ DĂƐƐ >ŽĐ sĞĐ KƚŚĞƌ
ĂƐĞϬϮ
WĂƌƚŝĐůĞϬ
DĂƐƐ >ŽĐ sĞĐ KƚŚĞƌ
WĂƌƚŝĐůĞϬ
ŽƵŶƚĞƌŵϭ
ŽƵŶƚĞƌŶϭ
WĂƌƚŝĐůĞŵϭ
WĂƌƚŝĐůĞŶϭ DĂƐƐ ф>ŽĐ͕sĞĐх KƚŚĞƌ
WĂƌƚŝĐůĞϬ
ŽĂƌƐĞ
Wͺŝ
WĂƌƚŝĐůĞŶ
WĂƌƚŝĐůĞϬ
ĂƐĞϬϮͺϬϭ
ŽƵŶƚĞƌŵϮ
WĂƌƚŝĐůĞϭ WĂƌƚŝĐůĞϮ
ĂĐŬƵƉďůĂŶŬÿ
ĂĐŬƵƉďůĂŶŬ
&ƌŽŵ͗EĞdžƚdŝŵĞƐƚĞƉ dŝŵĞƐƚĞƉŝнϭ
ĂƐĞϬϮͺϬϬ
WĂƌƚŝĐůĞŵϮ
hƉĚĂƚĞĂƐĞĚĚƌ ƌƌĂLJ;ŽĂƌƐĞͿ
;ϭͿ^Žƌƚ
ĂĐŬƵƉďůĂŶŬÿ
hƉĚĂƚĞŽƵŶƚĞƌ ƌƌĂLJ;ŽĂƌƐĞͿ
ĂƐĞĚĚƌƌƌĂLJ
ŽƵŶƚĞƌƌƌĂLJ
ĂƐĞŬ ĂƐĞŬнϭ ĂƐĞŬнϮ
ŽƵŶƚĞƌŬ ŽƵŶƚĞƌŬнϭ ŽƵŶƚĞƌŬнϮ
WĂƌƚŝĐůĞϬ
ĂƐĞϬϮͺϬϮ
ŽƵŶƚĞƌŵϯ
WĂƌƚŝĐůĞŵϯ DĂƐƐ ф>ŽĐ͕sĞĐх KƚŚĞƌ
WĂƌƚŝĐůĞϬΖ
Wͺ^Žƌƚͺŝ
WĂƌƚŝĐůĞϭΖ
;ϮͿ/ŶƚĞƌƉŽůĂƚŝŽŶ hƉĚĂƚĞĂƐĞĚĚƌ ƌƌĂLJ;ZĞĨŝŶĞͿ DĂƐƐ
'ƌŝĚͺWŽŝŶƚϬ
'ƌŝĚͺWŽŝŶƚϮ
ĂĐŬƵƉďůĂŶŬÿ
hƉĚĂƚĞŽƵŶƚĞƌ ƌƌĂLJ;ZĞĨŝŶĞͿ
;ͿŽĂƌƐĞĂŶĚZĞĨŝŶĞKƉĞƌĂƚŝŽŶ
ZĞĨŝŶĞ
DĂƐƐͺDĞƐŚͺŝ
Fig. 9. Data structure of particle information list
'ƌŝĚͺWŽŝŶƚϭ
;ϯͿĂůĐͺƉŽƚĞŶƚŝĂůͬ ĂůĐͺĂĐĐĞůͬ DŽƚŝŽŶͺƵƉĚĂƚĞ
'ƌŝĚͺWŽŝŶƚŵ
[
DĂƐƐ ф>ŽĐ͕͛sĞĐ͛х KƚŚĞƌ
\
[
Wͺ^Žƌƚ
WĂƌƚŝĐůĞϬΖ
WĂƌƚŝĐůĞŶ͛
ĂƐĞϬϮͺϬϯ
WĂƌƚŝĐůĞŵϰ
WĂƌƚŝĐůĞŶ͛
WĂƌƚŝĐůĞϮΖ
WĂƌƚŝĐůĞϬ
ŽƵŶƚĞƌŵϰ
WĂƌƚŝĐůĞϮΖ
WĂƌƚŝĐůĞϭΖ
ĂĐŬƵƉďůĂŶŬÿ
\
]
]
ĚLJ
Ědž
// P P
Ěnj
ŽŶƐƚϭ
Wͺŝнϭ dŽ͗EĞdžƚdŝŵĞƐƚĞƉ dŝŵĞƐƚĞƉŝнϭ
Ͳ Ͳ Ͳ ϭͲĚdž
ϭͲĚLJ
ϭͲĚnj
ͬ
ͬ
ͬ
&WϮ&/y
&WϮ&/y
&WϮ&/y
Fig. 8. Data structures’ relationship
IV. A RCHITECTURE In this section, the hardware architecture of the two phases of FP-AMR, PE_Sort, and PE_Map (as shown in Figure 3(C)), is introduced. The PE_Sort module sorts all particle information based on spatial locality and decides which blocks need to be refined or coarsened. PE_Map maps the sorted particles to the adaptive mesh with bi-/tri-linear interpolation.
ͲŽƌĚĞƌŝŶŐ ;ŝƚ^ŚƵĨĨůĞͿ
ĂƐĞͺĚĚƌ ƌƌĂLJ
&/yϮ&W ĚLJ
&/yϮ&W Ědž
ͲŽƌĚĞƌ
ZĞĨŝŶĞ͍ ŽŶƐƚϭ
н
ĂĐŬǁĂƌĚƚŽ ZĞĨŝŶĞŵĞŶƚ
н &ŽƌǁĂƌĚƚŽƵĨĨĞƌ
ƵĨĨĞƌĚĚƌ
The data path of module PE_Sort is shown in Figure 10. The inputs of PE_Sort are: coordinates of this level’s origin (x0 , y0 ); this level’s length L; the particle coordinates vector (x, y); and the particles’ mass m. First, the pipeline calculates (idx , idy ) and (dx, dy): x − x0 y − y 0 , (2) (idx , idy ) = L L y − y0 x − x0 − idx , − idy ) (3) (dx, dy) = ( L L (idx , idy ) shows which block the particle is located in and (dx, dy) is the offset within the block. Using (idx , idy ), the Z-order module generates the block index, which is used
ŵ
ŽƵŶƚĞƌͲ ƌƌĂLJ
ĂƐĞͺĚĚƌ ŽƵŶƚĞƌͺ
A. PE_Sort
&/yϮ&W Ěnj
ĂĐŚĞ ZD
Fig. 10. Data path of PE_Sort module
as the key to search the base-address array and increment the corresponding counter in the counter array. The search results of the base-address and counter arrays are added to form the address of the particle’s information. Then all related particle information (such as (dx, dy, m)) are written to onchip cache. Because of the spatial locality of the particles,
249
the cache has a low miss rate. Based on the counter array search result, PE_Sort decides whether the block needs to be refined, coarsened, or kept as is. If refine or coarsen operations is necessary, then the data is streamed backward as shown by the blue arrow. ŽŶƐƚϭ
ͲŽƌĚĞ;ŝ͕ũͿ
Ědž
ŵ
ĚLJ
ͲŽƌĚĞƌŝŶŐĚũĂĐĞŶƚ
ͲŽƌĚĞ ͲŽƌĚĞ ͲŽƌĚĞ ͲŽƌĚĞ ;ŝ͕ũͿ ;ŝнϭ͕ũͿ ;ŝ͕ũнϭͿ ;ŝнϭ͕ũнϭͿ
ZDͺ^ĞůĞĐƚ͗ůŽǁĞƐƚϮďŝƚƐ
Ͳ Ͳ ϭͲĚdž
Ědž ϭͲĚLJ
ĚLJ
ϭͲĚLJ ĚLJ ϭͲĚdž Ědž
dž
dž dž
ĚĚƌͺϬ
dž
dž dž
ŵϭ
ŵϮ
ŵϯ
н
н
н
н
ĚĚƌͺϮ
ZDϭ
Ϭϱ
dž
ŵϬ
ĚĚƌͺϭ ZDϬ
dž
Fig. 12. Memory partition method
Ϭϳ
ϭϱ
ϭϯ
ůŽĐŬͺϰ
ůŽĐŬͺϱ
ůŽĐŬͺϴ
ůŽĐŬͺϬ͗'ƌŝĚWŽŝŶƚƐ;Ϭ͕ϭ͕Ϯ͕ϯͿ ZDϬ ϬϬ Ϭϰ Ϭϴ ϭϮ
ZDϭ Ϭϭ Ϭϱ Ϭϵ ϭϯ
ZDϮ ϬϮ Ϭϲ ϭϬ ϭϰ
ZDϯ Ϭϯ Ϭϳ ϭϭ ϭϱ
ĚĚƌͺϯ
ZDϮ
Ϭϰ
ZDϯ
Ϭϲ
WͺDĂƉ
ϭϰ
ϭϮ
ůŽĐŬͺϭ
ůŽĐŬͺϯ
ůŽĐŬͺϳ
Fig. 11. Datapath of the PE_Map module Ϭϭ
Ϭϯ
ϭϭ
Ϭϵ
ůŽĐŬͺϱ͗'ƌŝĚWŽŝŶƚƐ;ϲ͕ϳ͕ϭϮ͕ϭϯͿ ZDϬ ϬϬ Ϭϰ Ϭϴ ϭϮ
ZDϭ Ϭϭ Ϭϱ Ϭϵ ϭϯ
ZDϮ ϬϮ Ϭϲ ϭϬ ϭϰ
ZDϯ Ϭϯ Ϭϳ ϭϭ ϭϱ
ůŽĐŬͺϯ͗'ƌŝĚWŽŝŶƚƐ;ϯ͕ϲ͕ϵ͕ϭϮͿ
B. PE_Map The datapath of the PE_Map module is shown in Figure 11. The bi-linear interpolation result of the particle mass m is (m0 , m1 , m2 , m3 ). These results are the added to the related grid points. m0 = m · dx · dy (4) (5) m1 = m · (1 − dx) · dy (6) m2 = m · dx · (1 − dy) (7) m3 = m · (1 − dx) · (1 − dy)
ůŽĐŬͺϬ
ϬϬ
ůŽĐŬͺϮ
ϬϮ
ůŽĐŬͺϲ
ϭϬ
Ϭϴ
ZDϬ ϬϬ Ϭϰ Ϭϴ ϭϮ
ZDϭ Ϭϭ Ϭϱ Ϭϵ ϭϯ
ZDϮ ϬϮ Ϭϲ ϭϬ ϭϰ
ZDϯ Ϭϯ Ϭϳ ϭϭ ϭϱ
Fig. 13. Conflict free parallel memory access
/2$'
$''(5
6725(
>Ϭ
%XEEOH %XEEOH
>Ϭ
6725( 3DUWLFOH
>ϭ
Ϭ͕DϬ
>Ϯ
ϭ͕Dϭ Ϯ͕DϮ
&WĞůĂLJ
%XEEOH
&WĞůĂLJ
^dϬнDϬ
%XEEOH
^dϭнDϬ
ůŽĐŬͲŝĚсŝ
^dϮнDϭ
&WĞůĂLJ
^dϬнDϭ
>Ϭ
3DUWLFOH 3DUWLFOH 3DUWLFOH
^dϬнDϬ
Ϭ͕Dϭ
%XEEOH
$''(5
>Ϭ
Ϭ͕DϬ
At this point, we examine the design with respect to FPGA resource constraints. We find that the number of computational and on-chip memory units is sufficient. However, the latter is only the case if data are mapped to remove bank conflicts. In PE_Map, each particle map operation needs to add mass interpolation results to four (2D), or eight (3D), adjacent grid points. If these adjacent grid points are not located in different memory banks, then they cannot be accessed them in a single cycle. As waiting for data causes idle stages in the pipeline a suitable memory partition is necessary. 1) Memory Partition: The Z-order indexing is helpful in creating a collisionless data layout. As shown in Figure 12, grid points must be mapped to four different RAM banks. The grid-point index is z − order. The remainder of z−order 4 is used for RAM bank selection while the quotient of z−order 4 is used as the address offset. For example, grid point 14 is mapped to RAM 2’s offset 3. Figure 13 shows simultaneous memory accesses. For particles located in green Block_0, the memory access request for (00, 01, 02, 03) is collisionless. For particles locate in orange Block_3, the memory access request for (03, 06, 09, 12) is collisionless. For particles locate in ivory Block_5, the memory access request for (12, 13, 06, 07) is also collisionless.
/2$'
%XEEOH
Ϭ͕DϮ
&WĞůĂLJ
>Ϭ Ϭ͕DϬ ^dϬнDϬ >Ϭ Ϭ͕Dϭ ^dϬнDϭ >Ϭ Ϭ͕DϮ ^dϬнDϮ
>Ϭ Ϭ͕DϬ ^dϬнDϬ >ϭ ϭ͕Dϭ ^dϭнDϭ >Ϯ Ϯ͕DϮ ^dϮнDϮ
3DUWLFOH
ͲŽƌĚĞƌŝŶŐ
ZDͺKĨĨƐĞƚ͗,ŝŐŚĞƌďŝƚƐ
ůŽĐŬͲŝĚсŝнϰ
3DUWLFOH
ůŽĐŬͲŝĚсŝнϴ ^dϬнDϮ
Fig. 14. Read-after-write data hazard
2) RAW hazard: In Figure 11, PE_Map reads the temporary mass from RAM, adds the interpolation result, and stores the sum back to RAM. However, the floating point adders have long latency (more than ten cycles). As shown in Figure 14, if we handle particles (marked blue) sequentially, P article0 − P article2 , then PE_Map needs to repeatedly read and write the same position in RAM causing a RAW hazard. Therefore, FP-AMR handles the particles (marked brown) sequentially, P article0 − P article2 . Because these particles have different block id and different grid points, the hazard is avoided.
250
V. E XPERIMENTAL E VALUATION
the accelerators do very well compared with the CPU-only system: CPU+GPU is 35×-37× better, CPU+FPGA 17×-19×, and FP-AMR 21×-23×.
!" "#
We use an astrophysics AMR-based N-body simulation, which solves the Poisson Equation, to evaluate the performance of FP-AMR in an FPGA-centric cluster. We implement three different systems as the control group: 1) CPU-only cluster, 2) CPU-GPU cluster (GPUs as co-processors), and 3) CPU-FPGA cluster (FPGAs as co-processors). These three versions are based on AMReX, which is a publicly available software framework designed for building massively parallel block-structured AMR applications. A. Experimental Setup
Design CPU-only CPU-GPU CPU-FPGA FP-AMR
We use NVIDIA Tesla P100 GPUs and Xilinx VCU118 Boards as platforms. These devices all use the 16nm process. The host CPU is one socket Intel Xeon E5-2660 with 8 cores running a multithreaded version. Table I shows details of the FP-AMR and the three control clusters. We currently have resources to test all four configurations up to eight nodes with one accelerator per node. The control clusters all use CPUs to handle AMReX-based AMR and use CPU-side Ethernet for communication between nodes. For the Poisson solver, the CPU-only version uses the Intel AVX-enhanced FFTW library; the CPU-GPU version uses cuFFT; and the two FPGA versions both use Xilinx FFT IP. TABLE I E XPERIMENT S ETUP CPU Side Poisson Solver AMReX FFTW (AVX) AMRex cuFFT AMReX Xilinx FFT IP Initial Xilinx FFT IP
$%& '
$%()%
$%()*
&*+,
Fig. 15. Different initial condition
Compared with the CPU+FPGA system, FP-AMR’s performance is 19%-20% better. The improvement of FP-AMR over originates from two effects. First, at the end of each time step, nodes transfer particles as they move through boundaries. FPAMR benefits from the direct FPGA-to-FPGA connections. Second, as mentioned in Section II, FP-AMR offloads several tasks to the FPGAs. If the initial condition has higher kinetic energy, it will cause more particle information exchange and influence the particles’ spatial locality, which has a negative effect on particle map. As shown in Figure 15, for strong interaction the initial the time consumption of each timestep is 22.8% larger.
Interconnection Eth Eth Eth Eth + Trans
In the FP-AMR version, the CPU only handles program initialization and workload scheduling. All AMR and Poisson solver tasks are completed on the FPGA side. All other communication is direct FPGA-to-FPGA through FPGA transceivers. The design is coded in Verilog. For the Poisson solver, the frequency is 250MHz. For the utility (DMA, bus, etc) and FP-AMR parts, the frequency is 200MHz. The performance depends on the initial conditions of the simulation. For example, the particles’ kinetic energy can influence the number of refine/coarsen operations and the frequency of particle information exchanges. We use NGenIC, a widely used initial conditions generator for cosmological simulations. We generate two different initial conditions: strong interaction and weak interaction. These have the same number of particles distribution, but the kinetic energy of the first is twice that of the second. In both cases results between 32- and 64-bit FP are nearly indistinguishable and so the CPU/GPU/FPGA implementations all use FP32 format.
Performance Cache Miss Ratio Energy Consumption
TABLE II P ERFORMANCE OVERVIEW CPUCPU+GPU CPU+FPGA only 1x 34.6x17.3x36.6x 19.5x 17.1% 17.1% 17.1% 1x
0.075x
0.015x
FP-AMR 20.7x23.2x 18.7% 0.012x
C. Memory subsystem In our experiments, we find a diminishing return beyond a 4KB 4-way set-associate cache. The replacement policy is LRU. To evaluate FP-AMR’s memory subsystem, we measure the cache miss rate of the particle map in FP-AMR and the CPU-only system. For FP-AMR, we instrument the FPGA cache with counters and find a cache miss rate of 18.7%. For the CPU-only system, we use Linux-perf to profile the CPU’s last level cache miss ratio, which is 17.1%. Compared with the CPU’s highly optimized and sophisticated cache design, FPAMR’s memory subsystem only causes a 1.6% higher cache miss ratio.
B. Performance Summary To measure performance we run the system for 100 timesteps after a 10 timestep warm-up. Figure 15 shows average node performance for different initial conditions and node count. Table II shows the systems’ performance. All of
D. Performance and Resource Utilization Break-Down Figure 16 gives a break-down of time consumption for 8 node versions of CPU-GPU, CPU-FPGA and, FP-AMR. For
251
^ƚƌŽŶŐ/ŶƚĞƌĂĐƚŝŽŶ
tĞĂŬ/ŶƚĞƌĂĐƚŝŽŶ
Fig. 16. CPU-FPGA and FP-AMR system time consumption break down and utilization break down Fig. 17. Scalability
the Poisson solver, for each initial condition, the two FPGA clusters have similar times (as expected), while the GPU cluster is more than three times faster. For intra node particle map, both CPU-FPGA and CPU-GPU clusters execute this operation on the CPU and so, as expected, their performance is similar. FP-AMR executes this on the FPGAs and is twice as fast. For inter node particle exchange, again the CPU-FPGA and CPU-GPU clusters have similar performance, while FPAMR, which uses the direct FPGA-FPGA connections, is so fast it is barely visible. Figure 16 also shows the FPGA resource utilization. The FP-AMR design has three parts: basic utility (DRAM controller, Transceivers controller, etc), FP-AMR module, and Poisson solver. For FP-AMR, the FPGA resource bottleneck is DSP slices. The FP-AMR framework uses less than 5% of DSP slices. This also explains the disparity in performance between the GPU and FPGA in the Poisson solver.
execute the balance of the CPU-based tasks–including particle mesh mapping, mesh refinement and coarsening–requires using customized data structures and memory subsystem. Another advantage of FP-AMR is that inter-iteration particle exchanges are executed using direct FPGA-FPGA connections. The FP-AMR enhancements lead to a 20% performance improvement over the CPU+FPGA version. Overall we find that both FPGA-based versions of AMR substantially improve performance over CPU-based versions, with factors of 17×-19× and 21×-23×, respectively. Energy consumption is improved by a much greater factor. The limiting factor on FPGA performance is number of DSP units. When compared with a high-end GPU of similar process generation, we find that the performance of the GPU is 1.46×1.67× better than FP-AMR. This appears to be entirely due to floating point resources available on the GPUs for the execution of the intra-iteration payload computations. Clearly having more DSPs would allow the FP-AMR to have equal or better performance. The energy consumption of the FPGAbased systems is about 6× less than that of the GPU-based systems. We tested scalability of all of the systems and found all of the systems scaled well within the size of the clusters to which we had access. Testing on larger systems is required to demonstrate the expected benefit of the direct FPGA-FPGA connections on scalability. It is likely that we can improve the performance of the payload operations on the FPGA (Poisson solver). This is something that we did not concentrate on so far in this study, but is the primary hindrance to matching GPU performance. Exchanging the IP-based FFTs for custom FFTs could bridge the gap, even given the disparity in floating point resources.
E. System Scalability Figure 17(A-B) shows the performance scalability of the four systems with two different initial conditions. Clearly, the AMReX framework makes sure the three control groups’ performance has good scalability. Based on this limited-scale experiment, FP-AMR’s performance has comparable linearity. VI. D ISCUSSION AND F UTURE W ORK In this work, we study the mapping of AMR onto FPGAs and FPGA clusters. We believe that this is the first such study. We create two versions, one where the FPGAs execute only the computations that are traditionally offloaded to the accelerator, while the CPUs retain control and data structures; and a second, FP-AMR, where the entire computation (after initialization) is executed on the FPGAs. For the FPGAs to
252
R EFERENCES [1] C. Burstedde, O. Ghattas, G. Stadler, T. Tu, and L. C. Wilcox, “Towards adaptive mesh PDE simulations on petascale computers,” Proceedings of Teragrid, vol. 8, 2008. [2] A. Dubey, A. Almgren, J. Bell, M. Berzins, S. Brandt, G. Bryan, P. Colella, D. Graves, M. Lijewski, F. Löffler et al., “A survey of high level frameworks in block-structured adaptive mesh refinement packages,” Journal of Parallel and Distributed Computing, vol. 74, no. 12, pp. 3217–3227, 2014. [3] D. Calhoun, “Adaptive Mesh Refinement Resources,” 2017. [Online]. Available: math.boisestate.edu/∼calhoun/www_personal/research/ amr_software/ [4] W. Zhang, A. Almgren, M. Day, T. Nguyen, J. Shalf, and D. Unat, “BoxLib with tiling: an AMR software framework,” arXiv preprint arXiv:1604.03570, 2016. [5] H.-Y. Schive, J. A. ZuHone, N. J. Goldbaum, M. J. Turk, M. Gaspari, and C.-Y. Cheng, “GAMER-2: a GPU-accelerated adaptive mesh refinement code–accuracy, performance, and scalability,” Monthly Notices of the Royal Astronomical Society, vol. 481, no. 4, pp. 4815–4840, 2018. [6] M. N. Farooqi, T. Nguyen, W. Zhang, A. S. Almgren, J. Shalf, and D. Unat, “Phase asynchronous AMR execution for productive and performant astrophysical flows,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, 2018, p. 70. [7] P. Davis, “Adaptive Mesh Refinement: An Essential Ingredient in Computational Science,” SIAM News, May 1, 2017. [8] A. Almgren, V. Beckner, J. Bell, M. Day, L. Howell, C. Joggerst, M. Lijewski, A. Nonaka, M. Singer, and M. Zingale, “CASTRO: A new compressible astrophysical solver. I. Hydrodynamics and self-gravity,” The Astrophysical Journal, vol. 715, no. 2, p. 1221, 2010. [9] B. Peng, T. Wang, X. Jin, and C. Wang, “An Accelerating Solution for N -body MOND simulation with FPGA-SOC,” International Journal of Reconfigurable Computing, vol. 2016, 2016. [10] T. Wang, L. Zheng, X. Jin, B. Peng, and C. Wang, “FPGA acceleration of TreePM N-body simulations for Modified Newton Dynamics,” in International Conference on Field-Programmable Technology, 2016, pp. 201–204. [11] J. Coole, J. Wernsing, and G. Stitt, “A traversal cache framework for FPGA acceleration of pointer data structures: A case study on BarnesHut N -body simulation,” in International Conference on Reconfigurable Computing and FPGAs, 2009, pp. 143–148. [12] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray et al., “A reconfigurable fabric for accelerating large-scale datacenter services,” ACM SIGARCH Computer Architecture News, vol. 42, no. 3, pp. 13– 24, 2014. [13] A. D. George, M. C. Herbordt, H. Lam, A. G. Lawande, J. Sheng, and C. Yang, “Novo-G#: Large-scale reconfigurable computing with direct and programmable interconnects,” in IEEE High Performance Extreme Computing Conference, 2016, pp. 1–7. [14] T. Geng, T. Wang, A. Sanaullah, C. Yang, R. Patel, and M. Herbordt, “A framework for acceleration of CNN training on deeply-pipelined FPGA clusters with work and weight load balancing,” in International Conference on Field Programmable Logic and Applications, 2018, pp. 394–398. [15] J. Sheng, B. Humphries, H. Zhang, and M. Herbordt, “Design of 3D FFTs with FPGA Clusters,” in IEEE High Perf. Extreme Computing Conf., 2014. [16] J. Sheng, C. Yang, A. Caulfield, M. Papamichael, and M. Herbordt, “HPC on FPGA Clouds: 3D FFTs and Implications for Molecular Dynamics,” in Proc. IEEE Conf. on Field Programmable Logic and Applications, 2017. [17] J. Sheng, C. Yang, and M. Herbordt, “High Performance Dynamic Communication on Reconfigurable Clusters,” in Proc. IEEE Conf. on Field Programmable Logic and Applications, 2018.
253