Available online at www.sciencedirect.com
ScienceDirect This space is Computer reservedScience for the Procedia header, do not use it Procedia 108C (2017) 877–886 This space is reserved for the Procedia header, do not use it This space is reserved for the Procedia header, do not use it
International Conference on Computational Science, ICCS 2017, 12-14 June 2017, Zurich, Switzerland
Toward hybrid platform for evolutionary computations Toward hybrid ofplatform for evolutionary hard discrete problems computations Toward hybrid platform for evolutionary computations of hard discrete problems 1 1 Dominik Żurek1 , KamilofPiętak , Marcin Pietroń , and Marek Kisiel-Dorohinicki1 hard discrete problems Dominik Żurek1 , Kamil Piętak1 , Marcin Pietroń1 , and30, Marek Kisiel-Dorohinicki1 AGH University of Science and 1Technology, Al. Mickiewicza 30-059 Krakow, Poland 1 1 Dominik Żurek , Kamil Piętak , Marcin Pietroń , and Marek Kisiel-Dorohinicki1 {dzurek,kpietak,pietron,doroh}@agh.edu.pl AGH University of Science and Technology, Al. Mickiewicza 30, 30-059 Krakow, Poland {dzurek,kpietak,pietron,doroh}@agh.edu.pl AGH University of Science and Technology, Al. Mickiewicza 30, 30-059 Krakow, Poland {dzurek,kpietak,pietron,doroh}@agh.edu.pl
Abstract Memetic Abstractagent-based paradigm, which combines evolutionary computation and local search techniques in one of promising solving large and hard discrete Abstract Memetic agent-based paradigm, meta-heuristics which combinesfor evolutionary computation and localproblem search such as Low Autocorrellation Binary Sequence (LABS) or optimal Golomb-ruler (OGR). In the Memetic paradigm, meta-heuristics which combinesfor evolutionary computation and localproblem search techniquesagent-based in one of promising solving large and hard discrete paper as a follow-up of the previous research, a short concept of hybrid agent-based evolutionary techniques one of promising meta-heuristics for solving large Golomb-ruler and hard discrete problem such as LowinAutocorrellation Binary Sequence (LABS) or optimal (OGR). In the systems platform, which computations among CPU GPU, is shortly introduced. such Autocorrellation Binary Sequence (LABS) or optimal Golomb-ruler (OGR). In the paperasasLow a follow-up of thespreads previous research, a short concept ofand hybrid agent-based evolutionary The main part of the presents an efficient parallel GPU implementation ofevolutionary LABS local paper as platform, a follow-up of paper thespreads previous research, a short concept ofand hybrid agent-based systems which computations among CPU GPU, is shortly introduced. optimization strategy. As a means for comparison, speed-up between GPU implementation systems which spreads computations CPU and GPU, is shortly introduced. The mainplatform, part of the paper presents an efficientamong parallel GPU implementation of LABS local and CPU parallel versions are shown. This constitutes a promising toward The main sequential part of theand paper an comparison, efficient parallel GPU implementation of step LABS local optimization strategy. As a presents means for speed-up between GPU implementation building platform that combines with highly efficient local optimization strategy. a means for evolutionary comparison, speed-up between GPU implementation and CPUhybrid sequential andAs parallel versions are shown. meta-heuristics This constitutes a promising step toward optimization of chosen discrete problems. and CPUhybrid sequential and parallel versionsevolutionary are shown. meta-heuristics This constituteswith a promising step toward building platform that combines highly efficient local
building hybrid platform that combines evolutionary meta-heuristics optimization of chosen discrete problems. Keywords: evolutionary computing, GPU memetic computing, with LABShighly efficient local © 2017 The Authors. Published by Elsevier B.V.computing, optimization ofresponsibility chosen discrete problems. Peer-review under of the scientific committee of the International Conference on Computational Science Keywords: evolutionary computing, GPU computing, memetic computing, LABS Keywords: evolutionary computing, GPU computing, memetic computing, LABS
1 Introduction 1 Introduction In the paper, as a follow-up of the research presented in [13], the next step in solving Low Auto1 Introduction corellation Binary Sequence with efficient techniques is shown. LABS, one of the hard discrete
In the paper, as a follow-up of the research presented in [13], the next step in solving Low Autoproblem despite research, is still an open optimization problem long sequences. The paIn the paper, as awide follow-up ofwith the research presented in theLABS, nextforstep solving Low Autocorellation Binary Sequence efficient techniques is [13], shown. oneinof the hard discrete per introduces a very efficient, parallel realization of local optimization for LABS implemented corellation Binary Sequence with efficient techniques is shown. LABS, one of the hard discrete problem despite wide research, is still an open optimization problem for long sequences. The paat The implementation isstill thought as optimization a part of a optimization hybrid computational environment problem despite anrealization open problem for long sequences. The paperGPU. introduces a wide very research, efficient, is parallel of local for LABS implemented that utilize agent-based evolutionary meta-heuristics. The integration of the environment and per introduces a very efficient, parallel realization of local for LABS implemented at GPU. The implementation is thought as a part of a optimization hybrid computational environment proposed components will be a topic for further research. at The implementation is thought as a part ofThe a hybrid computational environment thatGPU. utilize agent-based evolutionary meta-heuristics. integration of the environment and There isagent-based a lot of various that triesresearch. to solve problem. simplest and one that utilize evolutionary meta-heuristics. The LABS integration of theThe environment proposed components will be amethods topic for further is exhaustive enumeration (ie. brute-force method) that provides the best results, but can be proposed will be amethods topic forthat further There components is a lot of various triesresearch. to solve LABS problem. The simplest one applied only to small values of L. Some researchers use a partial enumeration, choosing so There is aenumeration lot of various thatmethod) tries tothat solveprovides LABS problem. The simplest is exhaustive (ie.methods brute-force the best results, but canone be called skew symmetric sequences[14] that are the most likely solutions for many lengths (eg. is exhaustive enumeration (ie.ofbrute-force that theenumeration, best results, choosing but can be applied only to small values L. Some method) researchers useprovides a partial so for L ∈skew [31, bestvalues sequences skeware symmetric). applied only65], to 21 small of L.areSome researchers a partial enumeration, choosing(eg. so called symmetric sequences[14] that the mostuse likely solutions for many lengths enumerative algorithms (complete or most partial) are solutions limited tofor small values of L(eg. by called symmetric the likely many lengths for However, L ∈skew [31, 65], 21 best sequences[14] sequences are that skeware symmetric). the However, exponential size of the search space. Heuristic use some plausible rules tooflocate for L ∈ [31, 65], 21 best sequences are (complete skew symmetric). enumerative algorithms oralgorithms partial) are limited to small values L by enumerative (complete oralgorithms partial) are limited to small values L by the However, exponential size of the algorithms search space. Heuristic use some plausible rules tooflocate the exponential size of the search space. Heuristic algorithms use some plausible rules to locate1 1877-0509 © 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the scientific committee of the International Conference on Computational Science 10.1016/j.procs.2017.05.201
1 1
878
Toward hybrid platform for evolutionary computations of hard discrete problems Dominik Żurek et al. / Procedia Computer Science 108C (2017) 877–886
D. Żurek et al.
good sequences more quickly. Examples include simulated annealing, evolutionary algorithms, local optimization techniques – the list of heuristic algorithms can be found in [7]. A wellknown method of local optimization for LABS is Tabu Search [9, 8] – the best results have been obtained using some variations of this technique [11]. All these methods give relatively good solutions for L < 200, but fail for larger lengths. As a result of this fact LABS problem found its place in CSPLIB list, a library of test problems for constraint solvers [18]. The best known results (BKVs) for LABS can be found in [3]. The paper concentrates on solving LABS problem using efficient parallel computations on GPU which will be further combined with evolutionary meta-heuristics. The authors introduce the basic concept of hybrid computational environment that combines CPU and GPU computations and widely presents the way of solving LABS problem on GPU using one of local search strategies called Steepest Descend Local Search (SDLS) [2]. The paper is organized as follows: introduction to LABS problem together with optimized algorithm for energy computation are presented to show a background for GPU algorithm. Then short introduction to the concept of evolutionary multi-agent systems (EMAS) with local optimization techniques has been provided, followed by an overview of a platform allowing to run such systems in a hybrid environment comprised of CPU and GPU. The next section presents details about LABS evaluation and local optimization implemented on GPU, which constitutes the central point of the paper. Than, the sample results are presented to compare efficiency between pure CPU and GPU computations and to compute speed-up we gain using GPU. At last, after the conclusions, the future work such as integration of CPU and GPU in the shape of hybrid computational platform is presented.
2
LABS problem
Low Autocorrelation Binary Sequence (LABS) is an NP-hard combinatorial problem with simple formulation. It has been under intensive study since 1960s by physics and artificial intelligence communities. It consists in finding a binary sequence S = {s0 , s1 , ..., sL−1 } with length L, where si ∈ {−1, 1} which minimizes energy function E(S):
Ck (S) =
L−k−1
si si+k
i=0
E(S) =
L−1
(1)
Ck2 (S)
k=1
M.J Golay defined also a so called merit factor [10], which binds LABS energy level to the length of a given sequence: F (S) =
L2 2E(S)
(2)
The search space for the problem with length L has size 2L and energy of sequence can be computed in time O(L2 ). LABS problem has no constraints, so S can be represented naturally as an array of binary values. One of the reason of high complexity of the problem is, that in LABS all sequence elements are correlated. One change that improves some Ci (S), has also an impact on many other Cj (S) and can lead to big changes of solution’s energy. 2
Toward hybrid platform for evolutionary computations of hard discrete problems Dominik Żurek et al. / Procedia Computer Science 108C (2017) 877–886 D. Żurek et al.
s1 s2 s1 s3 s1 s4 s1 s5
s2 s3 s2 s4 s2 s5
s3 s4 s3 s5
s4 s5
T (S)
s1 s2 + s2 s3 + s3 s4 + s4 s5 s1 s3 + s2 s4 + s3 s5 s1 s4 + s2 s5 s1 s5 C(S)
Figure 1: Data structures introduced to efficiently compute fitness in the neighborhood.
Computing the energy can be speed up with technique proposed by Cotta et al.[9], who introduced the notion of neighborhood of a solution S with length L obtained by flipping exactly one symbol in the sequence: N (S) = {f lip(S, i)|i ∈ {1, .., L}}
(3)
where f lip(s1 ...si ...sL , i) = s1.... − si .....sL [9].
Than, all computed products can be stored in a (L − 1) × (L − 1) table T (S), such that T (S)ij = sj si+j for j ≤ L − i, and saving the values of the different correlations in a L − 1 dimensional vector C(S), defined as C(S)k = Ck (S) for 1 ≤ k ≤ L − 1. Figure 1 shows these data structures for a L = 5 instance. Cotta observed that flipping a single symbol si multiples by −1 the value of cells in T (S) where si is involved, the fitness of sequence f lip(S, i) can be efficiently recomputed in time O(L) using the algorithm 1. Algorithm 1 Computing LABS energy using T (S) and C(S) structures 1: function ValueFlip(S, i, T, C) 2: f := 0 3: for p := 0 to L − 1 do 4: v := Cp 5: if p ≤ L − i then v := 2Tpi end if 6: if p < i then v := 2Tp(i−p) end if f := f + v 2 end for 8: return f end function 7:
The flipping method can be successfully used in various local optimization techniques where usually one interactions modifies only one sequence element. Such technique can be used for example in Steepest Descent Local Search method, which modifies one form the sequence elements and than checks if a new sequence has better energy. If so, the next random element is modified. The process is continued until the no new better sequence is created. The number of iterations can be also set to chosen constant – this technique can be very useful when one optimizes parallely many sequences, because the time of optimization for each solution is more or less the same, ie. there is almost no time loss dedicated for synchronization. The SDLS algorithm that uses flip operations to compute LABS energy has been proposed also in [9]. 3
879
880
Toward hybrid platform for evolutionary computations of hard discrete problems Dominik Żurek et al. / Procedia Computer Science 108C (2017) 877–886
3
D. Żurek et al.
Evolutionary meta-heuristics with local optimization for LABS
One of the promising metaheuristic in solving hard problems is evolutionary multi-agent system proposed by K. Cetnarowicz [6] and developed for many years of research by IISG group at AGH-UST in Krakow. The EMAS extended with local optimization techniques, so called memetic EMAS, has been already applied to hard discrete problems such as LABS [13, 12].
3.1
The concept of the memetic EMAS
In a basic EMAS, the system is comprised of individual agents living in a population1 . The main concepts of evolution are realized in the shape of reproduction and death operations. In optimization problems the individuals contain a solution decoded in the shape of genotype. The genotype is inherited from parents using variation operators (ie. mutation and recombination). As there is no global knowledge, in contrast to classic evolutionary algorithm, the selection is based on energy resource belonging to each individual. The level of energy is related to its quality and causes various behavior of an individual: high level of energy allows it to reproduce, average level leads to energy transfer between individuals (“fight” operation) and low level causes agent’s death. Each new born individual has to be evaluated by a fitness function which rates it’s genotype in context of a given problem. In memetic variant of EMAS, the evaluation can be further improved utilizing various local optimization techniques such as Tabu Search, Steepest Descent Local Search or Random Mutation Hill Climbing. Algorithm 2 A simplified algorithm of population processing in memetic EMAS 1: function processPopulation(population) 2: pairs := selectP airs(population) 3: newBorn := Array[] 4: for pair in pairs do 5: if canReproduce(pair) then newBorn+ = reproduce(pair) 6: elsef ight(pair) end if end for 7: newBorn := evaluate(newBorn) 8: newBorn := improve(newBorn) 9: population+ = newBorn 10: for ind in population do 11: if shouldDie(ind) then remove(population, ind) end if end for 12: return population end function The concept of memetic EMAS can be also illustrated using a simplified, sequential algorithm shown in 2. The processPopultation function contains a single “step” that is called sequentially until a stop condition is reached. Within each step, pairs for reproduction or fight are selected from the current population. If a selected pair is “good enough”, then a new individual is “born” from the selected parents. Otherwise, individuals from the pair fight with each other, and in consequence a portion of life energy is transferred between them. Next, all new 1 There
4
is also multi-populations variant of EMAS, but it is out of the scope of this paper.
Toward hybrid platform for evolutionary computations of hard discrete problems Dominik Żurek et al. / Procedia Computer Science 108C (2017) 877–886 D. Żurek et al.
born individuals are then evaluated and improved using local optimization techniques. At last, all week individuals are “killed”, ie. removed from the population. In [13, 12] we shown results coming from many experiments performed using pure CPU architecture with one or many threads. The results show that memetic EMAS is a promising technique for solving not only relatively small LABS problems (L ∼ 50), but also sizes that still are a challenge (eg. L ∼ 200).
3.2
The concept of a hybrid platform
Evolutionary multi-agent systems, as well as classic evolutionary algorithms, can be easily parallelised using various techniques [5, 1]. Two different research directions in this field can be observed: • decomposition models that assume geographic division of population and constraints for interaction between particular individuals; the whole process of evolution is then parallelised in particular sub-populations; the most popular decomposition concepts are migration and cell models; • global parallelisation in which particular steps of evolution performed on different individuals are implemented on many computational nodes. When computations with GPUs are considered, the global parallelisation model can be applied by using master-slave architecture. The architecture assumes that central unit performs most of the evolutionary process (ie. reproduction, energy transfer or death operations) and the unit delegates some of most expensive computations to the GPU. In case of memetic EMAS (see algorithm 2) the operations of evaluation and improvement are the good choice to delegate to slaves. These two operations are combined and extended with required conversions between CPU and GPU representations of a given problem. To minimize the communication overhead and use the parallel nature of GPU, the evaluation and improvement operation is done for whole population at once. The GPU interface gets the variable number of solutions that belong to newly created individuals and than performs evaluation and improvement. In effect, the GPU returns a vector of evaluations together with improved solutions to update adequate individuals. The described concept of the hybrid computational environment is partially implemented on AgE platform2 – a Java-based solution developed as an open-source project by the Intelligent Information Systems Group of AGH-UST. AgE provides a platform for the development and execution of distributed agent-based applications in mainly simulation and computational tasks [4, 15]. The modular architecture of AgE allows to use components to assembly and run agentbased computations for various problems such as black-box hard discrete problems (eg. LABS, OGR, Job-shop)[12, 13] or continues optimization problems. The platform is prepared to be integrated with external solvers such as presented LABS optimization on GPU.
4
Parallel SDLS algorithm for LABS local optimization
To better understand details about evaluation and improvement of LABS solutions, Nlet us introduce a basis of GPU processing. The GPU processors are constructed as N multiprocessor structure with M cores each. The cores share an Instruction Unit with other cores in a multiprocessor. The multiprocessors contain dedicated memories which are shared among all cores and are much faster than a global memory. There are also efficient read-only constant 2 https://gitlab.com/age-agh/age3
5
881
882
Toward hybrid platform for evolutionary computations of hard discrete problems Dominik Żurek et al. / Procedia Computer Science 108C (2017) 877–886
D. Żurek et al.
and texture memories with global access. Constant memory is efficient for reading constants and texture memory for random read operations. In contrast to CPU, GPGPU has less logic to control and branch prediction. GPGPU enables to run thousands of parallel threads, which are grouped in blocks with shared memory. The main aspects in development of highly efficient GPU programs are: the usage of memory, an efficient dividing code to parallel threads and communication between the threads. Additionally synchronization and communication between the threads should be considered as important issue. In this paper the authors used graphics cards from Tesla family which are produced by NVIDIA3 . Those kind of GPGPU can be programmed by high-level programming framework CUDA (Compute Unified Device Architecture). The CUDA is a set of software layers to communicate with the GPU based on standard C/C++. The platform gives the direct access to the GPU’s virtual instruction set and parallel computational elements (C or C++ code can be directly send to the GPU without using assembly code). Creating the optimized code requires knowledge of: how threads are mapped to blocks and grids, how to synchronize them and distinguish access to various types of memory (registers, shared, global and read only memories). CUDA provides two kinds of APIs: The runtime API and the Driver API. In the first one, kernels are defined and compiled in common with C/C++ files. The source code is compiling using the NVidia Cuda Compiler (NVCC). As a result of compilation there is returned an executable file containing the entire program. Using the NVCC is not possible to compile a source code from another language like Java or Python. The Driver API is used when one needs to invoke CUDA kernels from code of high-level programming language. As an example, JCuda (Java for CUDA)4 can be pointed.
4.1
Parallel SDLS realization for LABS on GPU
Based on analysis of efficient heuristics authors concluded that SDLS algorithm for LABS problem is suitable to implement in GPU cards, because this approach can be fully parallelized. Module responsible for optimizing LABS problem using SDLS method, receives binary sequences of length L represented as binary strings in {−1, 1}L , which were randomly generated for example by a curand library5 Provided vectors are spread between blocks, so each one of them contains their own sequence. In the first step of the proposed algorithm, each block evaluates a given sequence SI using two data structures: T (S) and C(S) (see figure 1), which are stored in a shared memory. In order to create these structures, L−1 iterations have to be performed. During the first iteration, a distance (si si+distance , where distance < L) has value 1 and there are used L − 1 threads. In the next iteration the distance is incremented and the number of threads, which are used to compute next values of structures (next row in T (S)), is decremented. To create an element of C(S), after each iteration sum of each Titeration (S) results computed using reduction operation. To calculate first energy value, which will be used as a reference energy Er , result of reduction is raised to the power of 2 and after each iteration this value is added to value from the previous iteration. The next step of the algorithm is a local optimization that utilize SDLS approach with given number of iterations. The method explores a sequence neighborhood and therefor allows 3 http://www.nvidia.com/object/cuda_home_new.html 4 Project
homepage: https://jcuda.org curand library provides facilities that focus on the simple and efficient generation of high-quality pseudo-random and quasi-random numbers. 5 The
6
Toward hybrid platform for evolutionary computations of hard discrete problems Dominik Żurek et al. / Procedia Computer Science 108C (2017) 877–886 D. Żurek et al.
to use fast flip strategy in sequence energy computations (see section 2). The concept is similar to sequential version of SDLS [9], but each thread parallely mutates different bit of the input sequence. In this process each block uses L threads, all of them: 1. change value of sequence element sindexOfThread to opposite one (−1 ∗ sindexOfThread ),
2. compute new energy value according to SDLS algorithm with value flip evaluation [9], 3. store new energy in shared memory and temporary C’(S) in thread’s cache memory (in order to not calculate common C(S) again after updating it by thread which calculated best energy)
According to V alueF lip function (see algorithm 1), the final energy f is summed up from v values that represent cells of new C(S) structure. Next, from the L energy values the best one is chosen using reduction operation on a vector stored in shared memory. Additionally in the same shared memory ids of corresponding threads are stored. In order to find best energy and appropriate number of thread two parallel reduction process are run (figure 2). First, to find minimum energy, the second to find index of thread, which computed this energy. In the proposed algorithm, all threads create their own C (S) structure in their cache memory based on v values. Only one thread, which computed the best energy in current iteration doing update of the common C(S) structure (C(S) is overwritten by C (S) of a winner thread).
L
L
.......
....... min
min
min
log2L
min
.......... min . . .
min
min
id
min
id
id
log2L
i id
id
.......... id . . .
id
id
Figure 2: Reduction to find best energy and corresponding thread in a single block Then minimum energy Emin is compared with reference energy Er which was computed from the input sequence SI . If Emin < Er then Emin value becomes a new reference energy and sequence for this energy is used as a new input sequence SI . The thread who computed Emin energy copies C(S) table from his cache to shared memory. Additionally, the same thread updates T (S) structure, changing cells, which contain elements with index equals to thread id. 7
883
884
Toward hybrid platform for evolutionary computations of hard discrete problems Dominik Żurek et al. / Procedia Computer Science 108C (2017) 877–886 D. Żurek et al.
Cotta proposed to run SDLS optimization until no further improvement is achieved [9], however this concept in parallel version could cause synchronization overhead (when one block waits for other that still seeks better sequence performing more iterations). This is why, the constant value of iterations has been chosen to make all block execute in the same (or very similar) time and then minimize waiting time before synchronization. After N iterations, the computed achieved results (energies and sequences) are copied to global memory and then to the CPU where they can be further processed by evolutionary meta-heuristics. During this processes memory management is crucial. For the sequence of length L, the following structures have to be allocated in the shared memory by each block: • L ∗ sizeof (short) – sequence representation, •
(L−1)∗L 2
∗ sizeof (short) – T (S) table
• (L − 1) ∗ sizeof (int) – C(S) table, • 2 ∗ L ∗ sizeof (int) – energies after flipping with ids of corresponding threads.
5
Experimental Results
To show the effectiveness of the presented SDLS algorithm for LABS realized on GPU, both GPU and CPU implementation have been considered. On CPU, sequential and parallel version of algorithm (for 2 & 4 cores) have been run. The experiments for GPU were performed on NVIDIA Tesla m2090 and for CPU on Intel(R) Xeon(R) CPU E5-2630 v3(2.4GHz). The reference version of SDLS for LABS for CPU has been implemented using OpenMP6 library. It is a concurrency platform for multi-threaded, shared-memory and parallel processing for C, C++ and Fortran languages. It’s a natural choice to build such algorithms, because OpenMP allows to easily transform sequential code to a parallel one using appropriate compiler directives to generate threads for the parallel processor platform (there is no need to create threads nor assign tasks to each of them manually). In the experiments a special compiler directive -O3 was used to optimize multi-core CPU implementation. Both versions have been run for LABS sequences of lengths L = 48 and L = 201, comparing times of execution for different number of solutions (1000, 4000, 8000, 16000, 32000, 64000). Number of SDLS iterations has been set to 128. The results were gathered from 10 executions of each configuration – the average values has been presented. The observed standard deviation of the average results was negligible. No. of solutions 1000 4000 8000 16000 32000 64000
GPGPU 22.61 73.93 144.39 285.74 568.99 1129.87
CPU (1 core) 698.8 2663.8 5341.6 10683.6 21380.4 42803.3
Speed-up 30.90 36.03 36.99 37.39 37.58 37.88
CPU (2 cores) 348.2 1398.5 2791.8 5600.5 11795.9 23574.8
Speed-up 15.40 18.92 19.33 19.60 20.73 20.87
CPU (4 cores) 239.2 1120.9 2281 5030.9 6605.2 13077.1
Speed-up 10.58 15.16 15.80 17.61 11.61 11.57
Table 1: Execution times in milliseconds for SDLS with 128 iterations for LABS L = 48 6 http://www.openmp.org/
8
Toward hybrid platform for evolutionary computations of hard discrete problems Dominik Żurek et al. / Procedia Computer Science 108C (2017) 877–886 D. Żurek et al.
No. of solutions 1000 4000 8000 16000 32000 64000
GPGPU 916.35 3314.44 6869.51 12583.08 26573.16 53760.38
CPU (1 core) 11645.4 46524.3 93227.1 185832.4 376872.9 742100.9
Speed-up 12.71 14.04 13.57 14.77 14.18 13.80
CPU (2 cores) 5756.9 23114.4 46574.2 93483.8 197617.3 396481.4
Speed-up 6.28 6.97 6.78 7.43 7.44 7.37
CPU (4 cores) 3016.8 12059.2 24575.4 50674.3 112675.5 230078.9
Speed-up 3.29 3.64 3.58 4.03 4.24 4.28
Table 2: Execution times in milliseconds for SDLS with 128 iterations for LABS L = 201 Results shown in the tables 1 and 2 present average times in milliseconds for particular configurations together with speed-up value, computed as relation between time on CPU and U ). For L = 48 sequences (table 1), GPU algorithm is significantly GPU version (Sp = TTCP GP U faster than sequential version on CPU (∼ 30 − 37) and still gives very good speed-up even for 4 cores on CPU (∼ 10 − 17) regardless of number of solutions. For L = 201 sequences (table 2), GPU executes around ∼ 13 faster than sequential version and around ∼ 3.2 − 4.2 faster than 4 cores CPU version. Some decrease of speed-up is caused by a fact, that the proposed algorithm is optimized for L ∼ 200 sequences and than it utilizes most of GPU logical resources. In such cases, the physical GPU resources have to be shared between particular blocks and threads, and so the efficiency is lower. In all cases parallel version of SDLS on GPU gives significant efficiency improvement and shows that presented implementation can be used in fast finding better LABS sequences and narrows domain for further process of LABS search using other techniques such as evolutionary algorithms.
6
Conclusions and further work
This paper is one of the first steps toward efficient hybrid platform dedicated for solving difficult discrete problems such as LABS or Golomb-ruler optimization, that utilize both evolutionary multi-agent systems as well as local optimization techniques implemented on GPU. The presented implementation of LABS optimization shows that a significant speed-up can be achieved using parallel GPU computations. The implementation can be further integrated with metaheuristics such as evolutionary algorithms, which constitute a basis for the concept of hybrid computational environment in master-slave model that seems to be very promising. In the near future the authors plan to fully integrate AgE platform with local optimization algorithm implemented on GPU. Important direction of the research is to adjust both algorithms to lower communication overhead between CPU and GPU which includes required data conversions. Also, the plan assumes the implementation of the hybrid concept for other difficult problems that can be solved using algorithms with parallel structure – some research for optimal Golomb ruler search has been already published in [17, 16].
6.1
Acknowledgments
The research presented in the paper received support from AGH University of Science and Technology statutory project and by the Faculty of Computer Science, Electronics and Telecommunications Dean’s Grant for Ph.D. Students and Young Researchers. 9
885
886
Dominik Żurek et al. / Procedia Computer Science 108C (2017) 877–886 D. Żurek et al. Toward hybrid platform for evolutionary computations of hard discrete problems
References [1] E. Alba and M. Tomassini. Parallelism and evolutionary algorithms. IEEE Transactions on Evolutionary Computation, 6(5):443–462, Oct 2002. [2] Michael Bartholomew-Biggs. The Steepest Descent Method, pages 1–8. Springer US, Boston, MA, 2008. [3] B. Bošković, F. Brglez, and J. Brest. A GitHub Archive for Solvers and Solutions of the labs problem. For updates, see https://github.com/borkob/git_labs. , January 2016. [4] Aleksander Byrski and Marek Kisiel-Dorohinicki. Agent-based model and computing environment facilitating the development of distributed computational intelligence systems. In Proceedings of the 9th International Conference on Computational Science, ICCS 2009, pages 865–874, Berlin, Heidelberg, 2009. Springer-Verlag. [5] Erick Cantu-Paz. Efficient and Accurate Parallel Genetic Algorithms. Kluwer Academic Publishers, Norwell, MA, USA, 2000. [6] K. Cetnarowicz, M. Kisiel-Dorohinicki, and E. Nawarecki. The application of evolution process in multi-agent world (MAW) to the prediction system. In Proc. of 2nd Int. Conf. on Multi-Agent Systems (ICMAS’96). AAAI Press, 1996. [7] C. deGroot, K.H. Hoffmann, and D. Würtz. Low Autocorrelation Binary Sequences: Exact Enumeration and Optimization by Evolution Strategies Claas DeGroot; Diethelm Würtz; Karl Heinz Hoffmann. IPS research report. Eidgenössische Technische Hochschule Zürich, Interdisziplinäres Projektzentrum für Supercomputing, 1989. [8] Iván Dotú and Pascal Van Hentenryck. A Note on Low Autocorrelation Binary Sequences, pages 685–689. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006. [9] José E. Gallardo, Carlos Cotta, and Antonio J. Fernández. Finding low autocorrelation binary sequences with memetic algorithms. Appl. Soft Comput., 9(4):1252–1262, September 2009. [10] M. Golay. The merit factor of long low autocorrelation binary sequences (corresp.). IEEE Transactions on Information Theory, 28(3):543–549, May 1982. [11] Steven Halim, Roland H. C. Yap, and Felix Halim. Engineering Stochastic Local Search for the Low Autocorrelation Binary Sequence Problem, pages 640–645. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. [12] Magdalena Kolybacz, Michal Kowol, Lukasz Lesniak, Aleksander Byrski, and Marek KisielDorohinicki. Efficiency of memetic and evolutionary computing in combinatorial optimisation. In Webjørn Rekdalsbakken, Robin T. Bye, and Houxiang Zhang, editors, ECMS, pages 525–531. European Council for Modeling and Simulation, 2013. [13] Michał Kowol, Aleksander Byrski, and Marek Kisiel-Dorohinicki. Agent-based evolutionary computing for difficult discrete problems. Procedia Computer Science, 29:1039 – 1047, 2014. [14] Tom Packebusch and Stephan Mertens. Low autocorrelation binary sequences. Journal of Physics A: Mathematical and Theoretical, 49(16):165001, 2016. [15] Kamil Piętak, Adam Woś, Aleksander Byrski, and Marek Kisiel-Dorohinicki. Functional Integrity of Multi-agent Computational System Supported by Component-Based Implementation, pages 82– 91. Springer Berlin Heidelberg, Berlin, Heidelberg, 2009. [16] M. Pietron, A. Byrski, and M. Kisiel-Dorohinicki. Leveraging heterogeneous parallel platform in solving hard discrete optimization problems with metaheuristics. Journal of Computational Science, 18:59 – 68, 2017. [17] Marcin Pietron, Aleksander Byrski, and Marek Kisiel-Dorohinicki. Gpgpu for difficult black-box problems. Procedia Computer Science, 51:1023 – 1032, 2015. [18] Toby Walsh. CSPLib problem 005: Low autocorrelation binary sequences. http://www.csplib. org/Problems/prob005. Accessed: 2017-01-31.
10