Performance and precision of histogram calculation on GPUs: Cosmological analysis as a case study

Computer Physics Communications 185 (2014) 2558–2565 Contents lists available at ScienceDirect Computer Physics Communications journal homepage: www...

Download PDF

760KB Sizes 0 Downloads 37 Views

Report

PDF Reader
Full Text

Computer Physics Communications 185 (2014) 2558–2565

Contents lists available at ScienceDirect

Computer Physics Communications journal homepage: www.elsevier.com/locate/cpc

Performance and precision of histogram calculation on GPUs: Cosmological analysis as a case study Miguel Cárdenas-Montes a,∗ , Juan José Rodríguez-Vázquez a , Miguel A. Vega-Rodríguez b , Ignacio Sevilla-Noarbe a , Eusebio Sánchez Alvaro a a

CIEMAT, Department of Fundamental Research, Avda. Complutense 40, 28040, Madrid, Spain

b

University of Extremadura, ARCO Research Group, Department Technologies of Computers and Communications, Escuela Politécnica, Campus Universitario s/n, 10003, Cáceres, Spain

article

info

Article history: Received 20 January 2014 Received in revised form 30 May 2014 Accepted 4 June 2014 Available online 11 June 2014 Keywords: GPU computing Histogram calculation Cosmology Correlation function

abstract Histogram calculation is an essential part of many scientific analyses. In Cosmology, histograms are employed intensively in the computation of correlation functions of galaxies, as part of Large Scale Structure studies. Among the most commonly used ones are the two-point, three-point and the shear–shear correlation functions. In these computations, the precision of the calculation of the counts in each bin is a key element for achieving the highest accuracy. In order to accelerate the analysis of increasingly larger datasets, GPU computing is becoming widely employed in this field. However, the recommended histogram calculation procedure becomes less precise when bins become highly populated in these sort of algorithms. In this work, an alternative implementation to correct this problem is proposed and tested. This approach is based on distributing the creation of histograms between the CPU and GPU. The implementation is tested using three cosmological analyses with observational data. The results show an increased performance in terms of accuracy while keeping the same execution time. © 2014 Elsevier B.V. All rights reserved.

1. Introduction The calculation of correlation functions is a computing-intensive activity which uses histograms as an intrinsic part of the algorithm in many implementations. They are frequently employed as part of Large Scale Structure studies in cosmological analyses. Due to the increment in the volume of scientific data available, cosmologists have researched diverse alternatives to accelerate these computations. Currently, GPU computing has the capability to process large datasets within a reasonable execution time and many publications have tackled the analysis of experimental data employing them. However, for large datasets, histogram construction presents some precision weaknesses associated with number representation. In this work, these weaknesses are shown in relation to the study of correlation functions in Cosmology: the two-point angular correlation function (2PACF), the three-point angular correlation function (3PACF) and the shear–shear correlation function. However, it is expected that these imprecisions are independent of the

∗

Corresponding author. Tel.: +34 91 346 6281; fax: +34 91 346 6068. E-mail address: [email protected] (M. Cárdenas-Montes).

http://dx.doi.org/10.1016/j.cpc.2014.06.002 0010-4655/© 2014 Elsevier B.V. All rights reserved.

problem, and only related with factors such as the dataset size, the histogram parameters and the number representation. In order to overcome them, an alternative implementation for histogram construction is presented and evaluated using observational data. This approach divides the histogram construction task in two phases. Firstly, the counts are accumulated in subhistograms which reside on the shared memory of GPU and are collated in global memory. Later, these sub-histograms are added up on the host (CPU) memory (hereafter termed host memory), to profit from the double precision representation. The proposed implementation is as fast and is more accurate than with previous schemes. The tests performed and reported in the current work demonstrate that this new approach is able to deal with up to 1012 counts per bin. To the authors’ knowledge, no modifications in this line have been proposed to overcome the weaknesses of the number representation in histogram construction. This paper is organized as follows: Section 2 summarizes related work and previous efforts done in this area. In Section 3.1 a brief description of the weaknesses of number-representation is presented. The relevant algorithms for cosmology analysis used in this work are briefly described in Section 3.2. The CFHTLenS dataset, which is used later, is presented in Section 3.3. The comparison of the new implementation presented here versus the

M. Cárdenas-Montes et al. / Computer Physics Communications 185 (2014) 2558–2565

classic approach is presented and analysed with both artificial (Section 4) and real data (Section 5). Finally, Section 6 contains the conclusions of this work. 2. Related work A standard and essential reference for histogram construction on GPUs is the white paper from NVIDIA [1]. Although originally devoted to image processing and data mining, it has served as a guide for many other scientific areas. In this paper, 64-bin and 256-bin histogram implementations are proposed and evaluated. Nowadays, the hardware has evolved enough to allow for larger histogram sizes. The proposed implementation of [1] is the ‘classic’ one using per-block1 sub-histograms in shared memory. In this approach, they are later gathered on global memory to form the final histogram. Two more variants are proposed: per-thread and per-warp2 sub-histograms. Depending on the architecture of the GPU, the histogram size and the data size, these strategies might be a limiting factor in the performance and in the capability to contain the counts in the bins (bin-count accuracy).3 In [2], the authors propose two new efficient methods for histogram calculation on GPUs. This article uses the NVIDIA white paper as a starting point to propose their improvements. However, from the two implementations for histogram construction presented in [1], the authors only cite the smallest one in capability. In any case, this paper was written when compute capability was 1.0,4 and no atomic operations on shared memory were available. Therefore, it can be considered as an obsolete strategy. In [3], another efficient implementation to compute image histograms on GPUs is proposed. Among other considerations, the authors make considerations on the precision and the bin-count accuracy. In this paper, an implementation which is unable to accumulate more than 256 counts per bin is compared with a new proposal. In order to mitigate the lack of bin-count accuracy, local histograms are created. However, this new implementation introduces errors when accumulating more than 2048 counts per bin. This is clearly insufficient for cosmological analysis such as: 2PACF, 3PACF and shear–shear correlation, as it will be shown later. Neither of these implementations showcase the use of atomic functions or shared memory to enhance the performance. In the book [4] a very efficient implementation of per-block sub-histograms on shared memory is described. This implementation presents a few advantages: it is easy to implement, it is widely applicable to many different disciplines and it has a great performance and bin-count accuracy. This implementation seems inspired by the per-block implementation of [1], but incorporates atomic functions and uses shared memory for storing the intermediate sub-histograms therefore improving the performance. This implementation has been used by the authors in previous works, such as the analysis of the 2PACF [5,6] and the shear–shear analysis [7]. At the same time, it is employed for comparison purposes in the current work. In [8] a per-block sub-histogram implementation is proposed. The novelty of this approach relies on the multiple sub-histograms which are embodied in a single thread block. This approach is an extension of the one sub-histogram per thread block approach [9]

1 A block of threads, thread block or simply block is a logical group of threads which are executed on a streaming multiprocessor. 2 A warp is a group of 32 threads. The instructions are issued per warp. This is the minimum unit of scheduling in the streaming multiprocessor. 3 The ability to correctly gather the number of counts assigned to a bin is termed bin-count accuracy. 4 The compute capability categorizes the features of the compute device. A higher number indicates more advanced features and a newer generation of the device.

2559

which includes multiple sub-histograms per thread block. As a final result, the implementation is a hybrid between per-warp and per-block sub-histogram implementations. Given that this implementation gathers the sub-histograms in the final histogram on device memory, it will face difficulties with the largest number attainable for some number-representations. On the other hand, it will show some imprecision when adding small and large numbers in float-representation. The comparison between the methods proposed by Shams et al. [2] and by Nugteren et al. [9] shows that the different application areas (data mining and image processing) establish different requirements about the histogram size, larger for data mining than for image processing. The input sizes are also different: 8-bits are enough for image processing, whereas, 32-bits are typical in data mining. This restricts the performance study towards different objectives than in Cosmology, and therefore, it forces to implement modifications. In [9] two histogram implementations are proposed: a perwarp sub-histogram and a per-thread sub-histogram. When processing images, a successful and simple proposition to improve the performance is to shuffle the input data. This is useful for images, however, it has no impact on the analysis of galaxy catalogues used in cosmology. For images, it is probable that close pixels will feed the same bin therefore generating collisions (sequential updates of the number of counts in the same bin) which will degrade the performance. For cosmological inputs, this a priori knowledge of the data structure does not exist, except for the case that the data has been previously ordered. Neither of the mentioned approaches proposes simultaneously modifying both: the kernel and the CPU-part of the code, nor profit from double-representation accessible only on the CPU-part. Furthermore, in all of the previous implementations, after constructing the sub-histograms they are accumulated into the final histogram on the GPU (device memory). In the implementation proposed here, the final gathering is performed on host memory (RAM) which benefits from the double-representation to improve the bin-count accuracy. In addition, no considerations about the input size and how it impacts the precision of the final result are presented in the works mentioned in this section. In contrast to image processing, cosmological analysis requires trigonometric calculations before feeding the appropriate bin in the histogram. For this reason, performance comparison between the mentioned works and the current work for cosmological problems is not feasible. Regarding the precision of floating point representation on GPU, a review of this issue is presented at [10]. In this work the inexactnesses associated with the use of float representation (operation accuracy and rounding), and how the programming affects the final result is underlined. Furthermore, in [11] a complete description of the floating point format and its weaknesses are fully described. In [12] the latest version of the floating-point standard can be found. Concerning other cosmological studies, in [13] the authors present an alternative approach to calculate the 2PACF. In the description of the implementation, the use of integer-representation for the histogram (256 bins and logarithmic binning), as well as the use of atomicAdd() function and shared memory for supporting the sub-histograms are highlighted. Although they express that the final aim is to be able to process up to billions of galaxies with this code, the largest dataset processed is composed of one million galaxies. From the description of the implementation, similar difficulties for the bin-count accuracy as the float-based implementation will arise when increasing the size of the dataset. The lack of precision in this implementation is mitigated because the data are processed in bunches, and then accumulated. Another implementation of the 2PACF is presented in [14]. This approach uses an array in double-representation to hold

2560

M. Cárdenas-Montes et al. / Computer Physics Communications 185 (2014) 2558–2565

the histograms and thus avoids using atomicAdd() functions. The selection of the bin to be filled is through a water-fall-if–elseif structure. This strategy saves the problem of the lack of precision of number-representation, but penalizes the performance of the application. Finally, other works from the authors where histograms are an essential part of the cosmological analysis include [5–7,15]. In all of these cases, the classic, device memory gathering implementation described in [4] is employed. 3. Methods and materials 3.1. Weaknesses of number representation Floating point encoding allows representing numbers in the range from 1.18 × 10−38 to 3.40 × 1038 . However, this representation is done with a certain approximation. Some numbers can be exactly represented, for example 1; whereas others are represented approximately, for example 0.1 or the fraction 23 . This incapability to represent exactly some numbers can translate into inaccuracies in the analyses. In [10], a selection of the peculiarities related with the precision of the float representation and their operations on GPU is presented. Another aspect of floating point representation related with the current work is the one termed as epsilon comparison, boolIsEqual = fabs(f 1 − f 2) ≤ epsilon, or in other words, when adding a unit to a value in float-representation the result remains the same. In float-representation this happens in the range from 16 777 216 to 3.40 × 1038 . When analysing galaxy catalogues with correlation functions, the procedure is roughly as follows. After calculating the separation (linear or angular) between a specific pair of galaxies, if the distance or the angle is in the range of the target histogram to be analysed, a value is added to the corresponding bin. For 2PACF and 3PACF, the value is a single unit, whereas for the shear–shear correlation this value depends on the shapes and the orientation of the galaxies, and it is a non-integer value in general. Therefore, as a side effect of the lack of accuracy for representing some integers larger than 16 777 216, if a bin reaches this number of counts, then the additional coincidences of two galaxies for the angle or the distance corresponding to this bin will be lost. Thus, for the case of adding counts unit by unit in the bins, the number of counts will never grow above this value and, consequently, the accuracy of the cosmological calculation will be affected. The use of number-representation based on integers would theoretically solve the lack of accuracy of float-representation. For example, integer representation is able to reproduce numbers up to 2.1 · 109 and unsigned-integer-representation up to 4.3 · 109 . However, as will be shown in Section 4 even these limits are inadequate for cosmological studies when using large datasets.5 On the other hand, it will be also demonstrated that unsigned-long–longinteger-representation fulfils the requirements of accuracy, but severely penalizes the execution time performance. The approach presented in this work uses double precision for the histogram by making use of the host memory for histogram creation, as described in Section 4. 3.2. Estimators of large scale structure of the Universe The statistical functions described below, applied to a catalogue of galaxies at different cosmological distances, encode information

about the Large Scale Structure of the Universe, which in turn can provide constraints to the average matter density and the presence of dark energy. How these functions relate to cosmological models is beyond the scope of this work. The reader is referred to [16]. They will be used as test-bench to verify the proposed implementation. 3.2.1. The two-point angular correlation function The two-point angular correlation function (2PACF), ω(θ ), is a computationally intensive function which measures the excess or lack of probability of finding a pair of galaxies separated by a certain angle θ with respect to a random distribution. The 2PACF has a computational complexity of O(N 2 ), where N is the number of galaxies. Although diverse estimators exist for the 2PACF, the estimator proposed by [17], (Eq. (1)), is the most widely used by cosmologists by virtue of its minimum variance.

ω(θ ) = 1 +



Nrandom

DD(θ )

2 ·

Nreal

RR(θ )

−2·

Nrandom Nreal

·

DR(θ ) RR(θ )

where • DD(θ ) is the number of pairs of galaxies for a given angle θ chosen from the observational data catalogue, D, with Nreal galaxies. • RR(θ ) is the number of pairs of galaxies for a given angle θ chosen from the random catalogue, R, with Nrandom galaxies. • DR(θ ) is the number of pairs of galaxies for a given angle θ taking one galaxy from the observational data catalogue D and another from the random catalogue R. 3.2.2. The three-point angular correlation function The three-point angular correlation function (3PACF), ζ (θ ), is similar to the 2PACF but involving three galaxies instead of two. This modification increases the computational complexity to O(N 3 ). In this case, one of the most used estimators is the one proposed by [18] (Eq. (2)).

ζ (θ ) =



Nrandom

3

Nreal

+3 ·

·

Nrandom Nreal

DDD RRR

·

 −3·

DRR RRR

Nrandom Nreal

2 ·

DDR RRR

−1

(2)

where • DDD(θ1 θ2 θ3 ) denotes the number of triplets of galaxies for a given set of angles θ1 θ2 θ3 , where the three galaxies are selected from the observational data catalogue, D. • RRR(θ1 θ2 θ3 ) denotes the number of triplets of galaxies for a given set of angles θ1 θ2 θ3 , where the three galaxies are selected from the random data catalogue, R. • DDR(θ1 θ2 θ3 ) and DRR(θ1 θ2 θ3 ) are similar to the previous ones taking two galaxies from one catalogue and the third one from the other catalogue. 3.2.3. The shear–shear correlation function The shear–shear correlation function involves the correlation between the shapes of pairs of galaxies as a function of their relative distance (the term shear is derived from the distortion of the shape of far away galaxies by the integrated effect of the mass in the line of sight). See [19] for a review. In this case, the cosmological information is encoded in two functions of the distance angle θ , defined as Eq. (3).

 ξ+ (θ ) =

wi wj (γt (θi ) · γt (θj ) + γ× (θi ) · γ× (θj ))

ij



wi wj

ij

5 When analysing 1 million galaxies with the 2PACF or the shear–shear correlation codes, then the most populated bins reach the order of 1010 counts. For the 3PACF with a input of 104 galaxies, the most populated bins have a similar number of counts.

(1)

 ξ− (θ ) =

wi wj (γt (θi ) · γt (θj ) − γ× (θi ) · γ× (θj ))

ij

 ij

wi wj

(3)

M. Cárdenas-Montes et al. / Computer Physics Communications 185 (2014) 2558–2565

where

• wi,j are the weights of the measured objects (which depend on the accuracy of the shape measurement).

• γt ,× (θ ) are the shear components of the object, which are obtained through the measurements of orientation of the shapes with respect to a local reference frame. For more details on how these quantities are obtained and constructed, the reader is referred to [7]. All three functions involve the computation of histograms and the use of very large number of entries per bin, whether it be by counting pairs at a certain angular distance and adding one to the corresponding bin (2PACF and 3PACF) or by computing a more complicated function of the said angle and adding a non-integer value (shear–shear correlation). Therefore, the questions related with the precision in the calculation of the counts per bin are very pertinent for the precision of the analysis and the reliability of the results. 3.3. The CFHTLenS survey Data from the Canada–France–Hawaii Lensing Survey [20] is used for testing in Section 5, hereafter referred to as CFHTLenS. The CFHTLenS survey analysis combined weak lensing data processing with THELI [21], shear measurement with lensfit [22] and photometric redshift measurement with PSF-matched photometry [23]. A full systematic error analysis of the shear measurements in combination with the photometric redshifts is presented in [20], with additional error analyses of the photometric redshift measurements presented in [24]. A query was done on the CFHTLenS catalogue query page6 for right ascension, declination, the ellipticities (as proxies of the shear) and the weight without any selection cuts, except the requirement that the measured ellipticities are non-zero. The purpose here is to run the code on a catalogue with ellipticities even if they are not accurate or contains contamination from nongalaxy components. An area was selected from one of the four fields to hold exactly one million galaxies (randomly selected). 4. Histogram construction: device memory versus host memory gathering In this section several tests are performed on building histograms with variations in the number-representation and the choice of memory in which to build the final histogram. The figure of merit we will use is the kernel execution time7 and the maximum amount of objects allowed in the test input file. We will concentrate on the 3PACF case on this section as this problem is the most demanding in these terms. Several artificial files containing an increasing amount of objects at the same coordinate position are created for these tests, just to check these figures of merit. During the execution, the threads will go through all the triplets of galaxies (Eq. (2)) and for each triplet, the bin where a count has to be added will be calculated (in this case, always the same bin). In general, if the bin is within the range of the histogram, the addition is performed to the sub-histograms in shared memory. Since the kernel is multithreaded, simultaneous updates of the same bin might occur, so the use of atomic functions (atomicAdd()) is mandatory. Finally, all the sub-histograms are accumulated on

6 http://www.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/community/CFHTLens/query. html. 7 To measure the kernel execution time, the CUDA API event has been used. This mechanism is more precise than CPU or operating system timers because it eliminates the latency associated to the operating system.

2561

the final histogram using either the GPU device memory (classic approach) or host memory (new implementation). Results are reported in Table 1 and detailed below. The variations explored are the following:

• The classic float-based implementation following the guide-

• • • •

lines of the NVIDIA white paper [1] and the procedure recommended at the book [4]. In this case, all the sub-histograms are gathered into the final histogram in device memory (GPU global memory) using float precision by using again the atomicAdd() function. This approach will be sometimes termed as device memory gathering. Final histogram is integer based, but otherwise same as above. Final histogram is unsigned integer-based, same as above. Final histogram is unsigned long–long integer-based, same as above. A float-based implementation, modifying the gathering of subhistograms into the final one by using host memory, instead of using device memory. This modification allows implementing the final histogram in double precision. This approach proposed here will be sometimes termed as host memory gathering (e.g. RAM).

All the numerical experiments have been executed in a machine with two Intel Xeon X5570 processors at 2.93 GHz and 8 GB of RAM, and a C2075 NVIDIA GPU card. This is composed of 14 streaming multiprocessors each one including 32 CUDA cores. This makes a total of 448 cores. CUDA release 5.0 and compute capability 2.0 have been used. 4.1. Results for implementations with histogram gathering on device memory 4.1.1. Float-based implementation The numerical results for this test show that float-based implementation on device memory handles correctly up to 8 · 109 counts per bin, which corresponds to an input size of 2 · 103 (2k) galaxies. When the input size increases to 4k, the final result shows a deficit of counts. This indicates the incapability of this implementation to contain 6.4 · 1010 counts per bin. In the 3k case, on the other hand, no deficit of counts appears. On the contrary, it is overestimated. This arises from the limited precision available to represent some numbers when using float precision. When adding numbers iteratively, rounding-up takes place and the final result is slightly higher than expected.8 In order to demonstrate this effect, a new execution is performed with 3k galaxies but spanning more than a single bin in the histogram. Instead of having 3k identical single positions, 2k galaxies keep the same coordinates whereas the remaining 1k have different ones at one degree separation on the sky, but identical among themselves. When forming triplets between points selected from both subsets, the bins corresponding to a 1-degree angle of separation are filled. Otherwise, when the points are selected inside each subset, only the zero degree bin is incremented, as before. This way, the overall 2.7 · 1010 counts are distributed among two bins. When using as input data this 2k+1k input file, the overall number of counts is correctly accumulated. Therefore, it can be concluded that the use of float as base of the histogram construction can produce either a deficit or an overestimation of the number of counts per bin. 4.1.2. Integer-based implementations In this case, float precision is replaced by integer precision, which worsens the final result. Only the 1k input file is correctly processed. Larger number of counts lead to incorrect results due

8 The typical academic example of this underlying imprecision is the calculation of the number one by iteratively adding ten times the value 0.1 in float precision. The final result obtained is 1.0000001.

2562

M. Cárdenas-Montes et al. / Computer Physics Communications 185 (2014) 2558–2565

Table 1 Comparison of execution time and precision for artificial test files in the 3PACF case for different number representations and memory gathering approaches. The approach proposed in this work corresponds to float precision with host memory gathering. Number representation

Memory gathering

Maximum input size

Execution time for 1k (ms)

Execution time for maximum size (ms)

Float Integer Unsigned integer Unsigned long–long integer Float

Device Device Device Device Host

2 · 103 1 · 103 1 · 103 > 10 · 103 >10 · 103

2861 3319 3318 9778 2873

22 803 3 319 3 318 9 762 523 2 819 102

Table 2 Profile of the implementations. Implementation

Static shared memory

Registers

Occupancy

Float Integer Unsigned integer Unsigned long–long integer

2.05 KB 2.05 KB 2.05 KB 4.10 KB

25 28 28 28

67% 67% 67% 67%

to the limit of the largest value that can be represented in integer precision. By replacing integer by unsigned integer precision, the problem is not solved and the largest input file is still 1k. This new implementation does not handle correctly cases where 8 · 109 counts per bin ought to be accumulated. In real cases, up to 1010 counts are accumulated in the most populated bins when analysing 1 million galaxies. On the other hand, changing to unsigned long–long integer it is demonstrated that it calculates correctly up to 1012 counts per bin, which corresponds to an input file size of 10k objects which largely fulfils the precision requirements for current cosmological problems. Unfortunately, the execution time is much longer than for the previous implementations.9 For example, to process 1k input files (109 counts per bin), the unsigned long–long integer implementation takes more than three times (3.4) than the floatbased implementation. In-depth analysis of the kernel (Table 2) reveals that integerbased implementations consumes more registers and more static shared memory than float-based implementation, although this does not reduce the occupancy.10 Since a trade-off between the precision and the performance of the implementation is the final goal, this implementation will be at an important disadvantage when considering the analysis of large input files due to its long execution time. 4.2. Float-based implementation with histogram gathering on host memory In order to overcome the difficulties associated with the lack of precision and the long execution time, the proposed implementation in this work creates the sub-histograms in shared memory with float precision, but instead does not accumulate them in device memory but rather uses the host memory. The key point of this modification is to allow the final histogram to be in double precision. Until now, the implementation was limited to single precision due to limitations of the atomicAdd() function. In this new approach, the histogram on device memory (global memory) does not have the size of the sub-histograms, but instead the equivalent size of all sub-histograms end-to-end. The subhistograms are copied to this memory in such way that they are sequentially arranged. Then, all the sub-histograms grouped in this

manner are sent to the host memory where they are gathered in a final histogram where we can take advantage of double precision. This alternative has certain drawbacks. The new array holding all the sub-histograms increases the global memory consumption by a factor of the number of the thread blocks used during the kernel invocation. Moreover, this larger array makes the copies to and from device memory to take longer. Lastly, the creation of the final histogram in this new implementation is performed in a sequential manner on CPU, instead of using the inherent parallelization of the GPU. In Table 1, the reported execution time (bottom row) includes the gathering of the sub-histograms into the final histogram on the CPU. This table shows that the proposed, host memory gathering algorithm is able to process correctly up to 1012 counts per bin, just like the unsigned long–long integer-based implementation with device memory gathering, but with a reduction of the execution time by a factor of 3.4. A similar execution time is obtained when comparing with the float-based, device memory gathering implementation. Therefore, this host memory gathering approach has the capability of dealing with large input files while keeping a fast execution time. It should be underlined that the precision of the host memory gathering implementation depends on the parameters used during the kernel invocation. In all cases, the number of threads per block has been considered only dependent on the histogram structure and the current limitation of the hardware. The limitation of the number of threads per block on Fermi architecture (1024) forces to establish for the 3PACF a thread block structure of 8 × 8 × 8 threads, 8 bins per each one of the three angles subtended by each triplet of points. This way, the number of thread blocks becomes relevant for the precision of the implementation.11 More thread blocks mean that the counts will be distributed among a larger number of subhistograms on shared memory, and therefore, less likely to reach the value 16 777 216 in any bin. If the number of thread blocks is reduced, then the problem associated with the lack of precision in float representation emerges. On the other hand, if the kernel is invoked with a larger number of blocks, then this problem is mitigated. As an example, in Table 3, the number of thread blocks12 required for correctly processing very large input files (5k and 10k galaxies) is presented. The profiling of the kernel which implements the new histogram calculation does not differ from the values presented in Table 2. The register consumption is the same as with device memory gathering: 25, as well as the static shared memory consumption: 2.05KB. The most significant difference emerges in the time it takes for the new histogram-array to be copied to and from device memory. In previous implementations, it took less than 2 µs, whereas in the new approach (host memory gathering) it takes much more. This transfer time depends on the number of thread blocks with

9 In order to compare fairly, the execution times correspond to the kernel execution, not including factors such as data copy between the CPU and GPU or the conversion of data from right ascension and declination to Euclidean coordinates. 10 The occupancy is the ratio between the active warps and the maximum active

11 Further details of the Fermi architecture and its relation with the histogram construction can be found in [7] which focuses on the shear–shear algorithm but many considerations can be extrapolated to other correlation functions. 12 For performance reasons, the number of thread blocks has been chosen as a

warps that can execute on a streaming multiprocessor.

factor of the number of streaming multiprocessors (14) on Fermi architecture.

M. Cárdenas-Montes et al. / Computer Physics Communications 185 (2014) 2558–2565

2563

Table 3 Precision of the results for diverse number of thread blocks and two large input sizes: 5k and 10k galaxies. Thread blocks

Counts

Correct result?

Execution time

115.956 · 109 125 · 109

No Yes

n/a 353 717 ms

0.74 · 1012 1012

No Yes

n/a 2819 102 ms

5k galaxies 84 98 10k galaxies 210 280

which the kernel is invoked. For 280 thread blocks, the most unfavourable case, the copy to or from the device memory takes 47 ms. In spite of this penalty, the time for the two copies is tiny compared with the kernel execution time, 2819 102 ms, having a very limited impact on the overall performance. 5. Application to real data Up to now, the proposed host memory gathering implementation has been tested using artificial input files to fully control the tests (using the 3PACF algorithm). In order to complete the analysis, a study is performed with observational data coming from the CFHTLenS survey (see Section 3.3). For these tests, a set of 106 galaxies has been extracted from the survey. The differences with the device memory gathering implementation will be showcased for the three algorithms described in Section 3.2. It is expected that the new dataset will distribute the counts along the histogram and therefore the deficit of counts observed in the previous studies will be partially mitigated. 5.1. Two-point angular correlation function Considering that the proposed implementation (host memory gathering) has demonstrated to correctly reproduce the number of counts per bin up to very large figures; then any deviation from this value, when using device memory gathering, will be attributed to an imprecision of the number representation. In order to underline this deviation, the relative error, 100 · DDdev ice −DDhost , for the 2PACF is shown in Fig. 1. In this figure, some DDhost differences appear between the number of counts calculated for both implementations.13 In spite of the differences, they are far from being significant for cosmological studies and the relative error is below the threshold for instrumental errors. The relative error is low enough to maintain the validity of previous works performed with the device memory gathering implementation [5, 6]. However, since the proposed implementation does not increase the execution time, while improving the bin-count accuracy, it is recommended for the analysis of the 2PACF. A side effect of the device memory gathering implementation is that the repetition of the executions produces some variations of one unit in the last significant digit of the counts accumulated in the bins. Again the origin of this effect is the float-representation. When gathering the counts accumulated in the sub-histograms, the order of the addition becomes relevant.14 Depending on the order in which the atomicAdd() operations are executed, a difference in the last significant digit might occur. If the bins with large number of counts are added first, then adding the bins

13 In this case, the histogram covers up to 16 degrees with 16 bins per degree. Due to the reduced area covered by the survey CFHTLenS, counts above the 160th bin do not exist. 14 In general, when using float-representation is advised to sum the numbers in order, being the smaller ones the first.

DD

−DD

dev ice host Fig. 1. Relative error, 100 · for the 2PACF between the device DDhost memory gathering implementation and the proposed host memory gathering implementation for the CFHTLenS input file (106 galaxies). Histogram is composed of 256 bins in this case: 16 degrees with 16 bins per degree.

with small number of counts will not contribute to the sum (this inherent to the lack of accuracy of the float-representation). In this case the bins with the small number of counts have to be added together first. This effect is completely corrected in the host memory gathering implementation. In Fig. 2, the standard deviation for each bin after 10 executions is shown for both implementations. It can be seen that, when using the device memory gathering implementation some dispersion of the counts accumulated per bin emerges (Fig. 2(a)). On the contrary, no dispersion is drawn for the proposed implementation (Fig. 2(b)). 5.2. Three-point angular correlation function Similar tests to those performed in the previous section are executed, but instead of the 2PACF, the 3PACF is used. This function emphasizes the weaknesses of the device memory gathering implementation, since more counts are accumulated per galaxy in the input file.15 Due to the large execution time of the 3PACF in relation to the 2PACF, for this study a randomly-selected sample of 104 galaxies has been extracted from the CFHTLenS file. In the case of the 3PACF (Fig. 3), the relative error for CFHTLenS dataset is larger, one order of magnitude, than for the 2PACF, even if a smaller input file is employed. This increment comes from the higher computational intensity of the 3PACF with respect to the 2PACF. Similarly to what happened with the 2PACF, for the 3PACF the dispersion of the counts accumulated per bin disappears when the host memory gathering implementation is used (Fig. 4). When using the device memory gathering implementation (Fig. 4(a)), the counts per bin exhibit some dispersion which is not present in the proposed one (Fig. 4(b)). 5.3. Shear–shear correlation function Three main correlations and corresponding histograms are calculated during the shear–shear analysis: the number of pairs, ξ+ , and ξ− . In these, small, non-integer values are added to the bins. As seen in Fig. 5, the device memory gathering implementation produces different results from the host memory gathering implementation proposed in this work, again due to the lack of accuracy of the float representation. Nonetheless, the relative error is low enough to still validate the cosmological analysis performed [7]. However, the host memory gathering alternative is able to deal with larger datasets with higher precision by

15 If N identical points are processed with the 2PACF, then N 2 counts are expected in the bin zero. On the other hand, if this sample is processed with the 3PACF, then N 3 counts are expected.

2564

M. Cárdenas-Montes et al. / Computer Physics Communications 185 (2014) 2558–2565

(a) Device memory gathering.

(b) Host memory gathering.

Fig. 2. Standard deviation for the 2PACF using CFHTLenS dataset for 10 runs using the device memory gathering implementation (a) and host memory gathering implementation (b). The latter always reproduces an identical result, therefore its standard deviation is null; whereas using device memory gathering the last significant digit varies among the executions.

DD

−DD

ice host Fig. 3. Relative error, 100 · devDD for the 3PACF between the device memory host gathering implementation and the host memory gathering implementation for a 4 sub-set of the CFHTLenS input file (10 galaxies in this test). An 8 × 8 × 8 binning has been used.

employing similar execution time, and for this reason, it is recommended for future analysis. The study of ξ+ shows a similar order of magnitude in the relative error as the 2PACF. However, the relative error for ξ− is much higher. The origin of this higher relative error is that ξ− is very close to zero, so that the division of the relative error calculation magnifies its value. Since in previous studies the absence of dispersion in the final results has been demonstrated when implementing the host memory gathering, this study has not been repeated for the shear–shear correlation function. 6. Conclusions In this paper, a new approach for calculating histograms on GPUs has been presented, making use of host memory. It takes the

(a) Device memory gathering.

case of correlation function computation to show how gathering the histograms in host memory overcomes the precision problems of histogram construction on GPU when operating in device memory and dealing with bins with very high number of counts. We have first compared this host memory gathering approach side by side against different implementations using device memory gathering, for varying number representations, finding that we achieve a relatively high performance in precision and execution time with the former. This new implementation avoids the loss of counts when bins are close to the top limit of float-precision, and therefore, achieves a higher precision which translates into higher accuracy in cosmological studies. This new implementation has been successfully tested with observational data by using three real cosmological applications (2PACF, 3PACF, and shear–shear correlation function), where it has demonstrated its capability to consistently process up to 1012 counts per bin without introducing penalty in the execution time. Beyond the application to cosmological studies, the improvements presented in this work are applicable to other areas where histogram construction for large datasets is an essential part of the analysis. More comparative studies of the performance and the precision of other implementations when applying to cosmological analysis are considered as future work. Acknowledgements The authors would like to thank the Spanish Ministry of Economy and Competitiveness (MINECO) for funding support through grant FPA2012-39684-C03-03 and through the Consolider Ingenio2010 program, under project CSD2007-00060.

(b) Host memory gathering.

Fig. 4. Standard deviation for the 3PACF and a sub-set of 104 galaxies of the CFHTLenS dataset for 10 runs using the device memory gathering implementation (a) and host memory gathering implementation (b). The latter always reproduces an identical result, therefore its standard deviation is null; whereas in the device memory gathering implementation the last significant digit varies among the executions.

M. Cárdenas-Montes et al. / Computer Physics Communications 185 (2014) 2558–2565

(a) ξ+ . Fig. 5. Relative error, 100 ·

2565

(b) ξ− . ξdev ice −ξhost ξhost

for the parameters involved in the shear–shear correlation function (ξ+ , ξ− ) between the device memory gathering implementation

and the host memory gathering implementation for the CFHTLenS input file (106 galaxies), for: (a) ξ+ , and (b) ξ− .

The CFHTLenS data is based on observations obtained with MegaPrime/MegaCam, a joint project of CFHT and CEA/DAPNIA, at the Canada–France–Hawaii Telescope (CFHT) which is operated by the National Research Council (NRC) of Canada, the Institut National des Sciences de l’Univers of the Centre National de la Recherche Scientifique (CNRS) of France, and the University of Hawaii. This research used the facilities of the Canadian Astronomy Data Centre operated by the National Research Council of Canada with the support of the Canadian Space Agency. CFHTLenS data processing was made possible thanks to significant computing support from the NSERC Research Tools and Instruments grant program. The improvements extracted from this work are implemented in the codes: 2PACF and shear–shear correlation function made available at (http://wwwae.ciemat.es/cosmo/gp2pcf/). References [1] V. Podlozhnyuk, Histogram calculation in CUDA. NVIDIA White Paper, 2007, (Nov. 2007). [2] R. Shams, R.A. Kennedy, Efficient histogram algorithms for NVIDIA CUDA compatible devices, in: Proc. Int. Conf. on Signal Processing and Communications Systems (ICSPCS), December 2007, pp. 418–422. [3] Thorsten Scheuermann, Justin Hensley, Efficient histogram generation using scattering on GPUs, in: Proceedings of the 2007 Symposium on Interactive 3D Graphics and Games, ACM, 2007, pp. 33–37. [4] Jason Sanders, Edward Kandrot, CUDA by Example: An Introduction to General-Purpose GPU Programming, Addison-Wesley Professional, 2010. [5] Miguel Cárdenas-Montes, Juan José Rodríguez-Vázquez, Rafael Ponce, Ignacio Sevilla, Eusebio Sánchez, Nicanor Colino, Miguel A. Vega-Rodríguez, New computational developments in cosmology, in: Ibergrid Conference, Laboratório de Instrumentaçao e Física Experimental de Partículas, 2012, pp. 101–112. [6] Miguel Cárdenas-Montes, Miguel A. Vega-Rodríguez, Jan Westerholm, Rafael Ponce, Evren Yurtesen, Ignacio Sevilla, Eusebio Sánchez Alvaro, Juan José Rodríguez-Vázquez, Mats Aspnas, Ville Timonen, Nicanor Colino, Antonio Gómez-Iglesias, Calculation of two-point angular correlation function: implementations on many-core and multicore processors, in: Ibergrid Conference, Editorial Universitat Politècnica de València, 2013, pp. 203–214. [7] Miguel Cárdenas-Montes, Miguel A. Vega-Rodríguez, Christopher Bonnett, Ignacio Sevilla-Noarbe, Rafael Ponce, Eusebio Sánchez Alvaro, Juan José Rodríguez-Vázquez, Comput. Phys. Commun. (ISSN: 0010-4655) 185 (1) (2014) 11–18. http://dx.doi.org/10.1016/j.cpc.2013.08.005. [8] Juan Gómez-Luna, José María González-Linares, José Ignacio Benavides, Nicolas Guil, Mach. Vis. Appl. 24 (5) (2013) 899–908. [9] C. Nugteren, G. van den Braak, H. Corporaal, B. Mesman, High performance predictable histogramming on GPUs: exploring and evaluating algorithm trade-offs, in: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ACM, 2011, p. 1. [10] Nathan Whitehead, Alex Fit-Florea, Precision and Performance: Floating Point and IEEE 754 Compliance of NVIDIA GPUs, NVIDIA, 2011.

[11] David Goldberg, ACM Comput. Surv. 23 (1991) 5–48. [12] IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2008, Aug. 29, 2008, pp. 1–70, http://dx.doi.org/10.1109/IEEESTD.2008.4610935. [13] D. Bard, M. Bellis, M.T. Allen, H. Yepremyan, J.M. Kratochvil, Astron. Comput. (ISSN: 2213-1337) 1 (2013) 17–22. http://dx.doi.org/10.1016/j.ascom.2012. 11.001. [14] Dylan W. Roeh, Volodymyr V. Kindratenko, Robert J. Brunner, Accelerating cosmological data analysis with graphics processors, in: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, ACM, 2009, pp. 1–8. [15] R. Ponce, M. Cárdenas-Montes, J.J. Rodríguez-Vázquez, E. Sánchez, I. Sevilla, in: P. Ballester, D. Egret, N.P.F. Lorente (Eds.), Application of GPUs for the Calculation of Two Point Correlation Functions in Cosmology, Astronomical Data Analysis Software and Systems XXI, in: ASP Conference Series, vol. 461, Astronomical Society of the Pacific, 2012, p. 73. [16] L. Fu, E. Semboloni, H. Hoekstra, M. Kilbinger, L. van Waerbeke, I. Tereno, Y. Mellier, C. Heymans, J. Coupon, K. Benabed, J. Benjamin, E. Bertin, O. Dore, M.J. Hudson, O. Ilbert, R. Maoli, C. Marmo, H.J. McCracken, B. Menard, Astron. Astrophys. 479 (1) (2008) 9–25. [17] S.D. Landy, A.S. Szalay, Amer. J. Phys. 412 (1993) 64–71. [18] István Szapudi, Alexander S. Szalay, Astrophys. J. Lett. 494 (1998) L41–L44. [19] Martin Kilbinger, Liping Fu, Catherine Heymans, Fergus Simpson, Jonathan Benjamin, Thomas Erben, Joachim Harnois-Dïraps, Henk Hoekstra, Hendrik Hildebrandt, Thomas D. Kitching, Yannick Mellier, Lance Miller, Ludovic Van Waerbeke, Karim Benabed, Christopher Bonnett, Jean Coupon, Michael J. Hudson, Konrad Kuijken, Barnaby Rowe, Tim Schrabback, Elisabetta Semboloni, Sanaz Vafaei, Malin Velander, Mon. Not. R. Astron. Soc. 430 (3) (2013) 2200–2220. [20] C. Heymans, L. Van Waerbeke, L. Miller, T. Erben, H. Hildebrt, H. Hoekstra, T.D. Kitching, Y. Mellier, P. Simon, C. Bonnett, J. Coupon, L. Fu, J. Harnois Déraps, M. J. Hudson, M. Kilbinger, K. Kuijken, B. Rowe, T. Schrabback, E. Semboloni, E. van Uitert, S. Vafaei, M. Veler, Mon. Not. R. Astron. Soc. 427 (1) (2012) 146–166. [21] T. Erben, H. Hildebrandt, L. Miller, L. van Waerbeke, C. Heymans, H. Hoekstra, T.D. Kitching, Y. Mellier, J. Benjamin, C. Blake, C. Bonnett, O. Cordes, J. Coupon, L. Fu, R. Gavazzi, B. Gillis, E. Grocutt, S.D.J. Gwyn, K. Holhjem, M.J. Hudson, M. Kilbinger, K. Kuijken, M. Milkeraitis, B.T.P. Rowe, T. Schrabback, E. Semboloni, P. Simon, M. Smit, O. Toader, S. Vafaei, E. van Uitert, M. Velander, CFHTLenS: The Canada-France-Hawaii Telescope Lensing Survey - Imaging Data and Catalogue Products, 2012. arXiv:astro-ph/1210.8156. [22] L. Miller, C. Heymans, T.D. Kitching, L. van Waerbeke, T. Erben, H. Hildebrandt, H. Hoekstra, Y. Mellier, B.T.P. Rowe, J. Coupon, J.P. Dietrich, L. Fu, J. HarnoisDéraps, M.J. Hudson, M. Kilbinger, K. Kuijken, T. Schrabback, E. Semboloni, S. Vafaei, M. Velander, Mon. Not. R. Astron. Soc. 429 (4) (2013) 2858–2880. [23] H. Hildebrandt, T. Erben, K. Kuijken, L. van Waerbeke, C. Heymans, J. Coupon, J. Benjamin, C. Bonnett, L. Fu, H. Hoekstra, T.D. Kitching, Y. Mellier, L. Miller, M. Velander, M.J. Hudson, B.T.P. Rowe, T. Schrabback, E. Semboloni, N. Benítez, Mon. Not. R. Astron. Soc. 421 (3) (2012) 2355–2367. [24] Jonathan Benjamin, Ludovic Van Waerbeke, Catherine Heymans, Martin Kilbinger, Thomas Erben, Hendrik Hildebrandt, Henk Hoekstra, Thomas D. Kitching, Yannick Mellier, Lance Miller, Barnaby Rowe, Tim Schrabback, Fergus Simpson, Jean Coupon, Liping Fu, Joachim Harnois-Déraps, Michael J. Hudson, Konrad Kuijken, Elisabetta Semboloni, Sanaz Vafaei, Malin Velander, Mon. Not. R. Astron. Soc. 431 (2013) 1547–1564.

Performance and precision of histogram calculation on GPUs: Cosmological analysis as a case study

Performance and precision of histogram calculation on GPUs: Cosmological analysis as a case study

Recommend Documents