Multi-objective optimisation of software application mappings on heterogeneous MPSoCs: TONPET versus R2-EMOA

Integration, the VLSI Journal 69 (2019) 50–61 Contents lists available at ScienceDirect Integration, the VLSI Journal journal homepage: www.elsevier...

Download PDF

2MB Sizes 0 Downloads 10 Views

Report

PDF Reader
Full Text

Integration, the VLSI Journal 69 (2019) 50–61

Contents lists available at ScienceDirect

Integration, the VLSI Journal journal homepage: www.elsevier.com/locate/vlsi

Multi-objective optimisation of software application mappings on heterogeneous MPSoCs: TONPET versus R2-EMOA Gereon Führ a, ∗ , Ahmed Hallawa a, ∗∗ , Rainer Leupers a , Gerd Ascheid a , Juan Fernando Eusse b a b

RWTH Aachen University, Germany Silexica GmbH, Germany

A R T I C L E

I N F O

Keywords: Power-performance trade-oﬀ Mapping Heterogeneous MPSoCs Virtual platform Multi-objective optimisation Pareto

A B S T R A C T

For heterogeneous multi-core architectures, eﬃcient development of parallel software is paramount. Fast and accurate compiler technology is required in order to exploit their advantages and to optimise for multiple objectives, such as performance and power. The work at hand presents a heuristic and state-of-the-art Evolutionary Multi Objective Algorithm (EMOA) approach to tackle this problem. The performance and consistency of the population based heuristic TONPET and the indicator based EMOA are compared and thoroughly analysed. For the evaluation, both are integrated into the SLX tool suite. Representative benchmarks and three diﬀerent MPSoC platforms are chosen for an indepth realistic analysis. For smaller and medium sized solution spaces, TONPET outperforms the EMOA with 4.7% better Pareto fronts on average, while being 18 × faster in the worst case. In vast solution spaces, the EMOA consistently produces 3% better Pareto fronts on average but TONPET runs 88 × faster in the worst case. Furthermore, for comparison purposes, a full performance consistency analysis on EMOA conducted.

1. Introduction Advancements in the ﬁeld of Heterogeneous Multi- and ManyProcessor Systems-on-Chip (MPSoCs) oﬀer a solution to the increasing demand for computational workload associated with tight power budgets. However, developing software applications for MPSoCs presents many challenges. Firstly, parallel programming is more complicated compared to writing sequential code. Secondly, developers of MPSoCs must take into consideration the added hardware intricacies, such as Processing Element (PE) communication, bus contention, in addition to frequency and voltage settings. Finally, the software application mapping process is challenging, as it is guided by multiple objectives, such as reducing power consumption, meeting deadlines, or optimising memory throughput. Generally, in order to facilitate modelling of parallel behaviour, Kahn Process Network (KPN) [22] is a programming model that is widely used for MPSoC software application development. KPN relies on First In First Out (FIFO) channels for the communications between deterministic and sequential processes. Consequently, the

eﬃciency of the design depends on the mapping procedure, which determines scheduling and distribution of the processes on the MPSoC. In fact, there is currently a wide range of tools. For example, the commercially available SLX tool suite [2] provides a language capable of representing the parallel nature of MPSoCs, thus, suited for software mapping of diﬀerent tasks and generates code on multiple targets. MPSoC mapping problems can be modelled as a Multi-Objective Optimisation Problem (MOOP). In contrast to a single-objective optimisation, the solution of MOOPs consist of a set of non-dominated optimal solutions rather than a single solution. This set is dubbed as the Pareto Optimal (PO) set. Due to the presence of multiple objectives, one solution can dominate another w.r.t. one objective, but fails to achieve such dominance in another. In the context of MPSoCs, the objectives are usually associated with meeting pre-set deadlines, reducing power consumption, or memory throughput optimisation. In Ref. [29], it is shown that this is an NPhard problem, already for single objective optimisation. Consequently, mapping heuristics are widely adopted as highlighted in Ref. [31]. This

∗ Corresponding author. ∗∗ Corresponding author. E-mail addresses: [email protected] (G. Führ), [email protected] (A. Hallawa). https://doi.org/10.1016/j.vlsi.2019.09.005 Received 17 June 2019; Received in revised form 28 August 2019; Accepted 7 September 2019 Available online 13 September 2019 0167-9260/© 2019 Elsevier B.V. All rights reserved.

G. Führ et al.

Integration, the VLSI Journal 69 (2019) 50–61

is one reason why we selected a heuristic for ﬁnding acceptable solutions of MOOPs in reasonable time. There are many multi-objective optimisation algorithms in literature such as particle swarm optimisation [36,37]. However, one widely adopted family of algorithms suited for MOOPs are Evolutionary MultiObjective Algorithms (EMOAs). These metaheuristic global search algorithms ﬁnd close-to-optimal Pareto fronts in multi-dimensional solution spaces. Inspired by biological evolution, EMOAs are suited for ﬁnding a Pareto front in MOOP as they are population based. Thus, they start with a population of solutions rather than a single solution. This suits well in ﬁnding a PO set as the ending goal is to ﬁnd multiple solutions. Furthermore, EMOAs are global search algorithms, which give them a performance edge in vast multimodal solution spaces. However, EMOAs suﬀer from performance sensitivity to hyperparamter tuning, thus requiring rigorous training. Further, they are stochastic, which may aﬀect the consistency of their performance and they require relatively high computational resources. Taking a look into related work (Section 2), EMOAs are commonly used to solve MOOPs in the context of MPSoC optimisation. This is one reason why we selected a state-ofthe-art EMOA as comparison candidate in this work. In an attempt to avoid these disadvantages while exploiting some of EMOAs’ advantages, the authors in Ref. [27] presented a populationbased heuristic dubbed as TONPET. It is designed to produce Pareto fronts for MPSoC software application mappings faster than state-ofthe-art EMOAs. In their comparisons, they picked an indicator based EMOA dubbed as R2-indicator EMOA (R2-EMOA). This choice was driven by the fact that unlike the popular Hypervolume indicator EMOA, R2-EMOA requires less computational resources. In this work, we provide a thorough and elaborated analysis on the diﬀerence in performance between TONPET and R2-EMOA. As it is one of the main key diﬀerences, the focus is on the consistency. Firstly, we present the heuristic TONPET [27] in detail. Afterwards, we provide an explanation of the inner working of the R2-EMOA algorithm. Finally, we provide an in-depth analysis of the diﬀerence in performance between TONPET and R2-EMOA on three diﬀerent platforms: the ODROID-XU3 [1], the Keystone II 66AK2H [4] and a heterogeneous Many-core Virtual Platform (McVP) developed in-house. This analysis includes studying the statistical behaviour of R2-EMOA on each platform and its eﬀect on the consistency of its performance. Consequently, this paper presents the following main contributions:

In this context, taking multiple objectives into account becomes a paramount research direction. Of note, the authors of [23] propose a decomposition approach instead of dealing with the entire problem. Consequently, the large scale problem is divided into sub-problems, following the divide-and-conquer concept. For example, in this work, balancing the workload for each processor, cluster and network, the Pareto front of this mapping optimisation problem is calculated in less than one hour. In Ref. [21], the objectives performance and energy are optimised, taking memory usage and energy into account. Conﬁgurable Network-on-Chip (NoC) based MPSoCs are addressed in Ref. [8]. The performance and cost requirements for an NoC conﬁguration is concurrently optimised. These works have in common that the approaches are based on evolutionary algorithms (EAs). To cope with rather long computation times, we propose a fast executing heuristic to enable multiobjective optimisation and Pareto front calculation. Another popular approach for MOOPs are machine learning based models to get the best power and performance trade-oﬀ. The authors of [5] propose to use multivariate linear regression to perform runtime optimisation of the task mapping and best frequency and voltage settings. A library entry is stored with these values for each application. Based on abstract power and performance models, the selection of an entry is done. In Ref. [19], multinomial logistic regression classiﬁcation is deployed to map oﬄine a set of classiﬁers to Pareto optimal platform conﬁgurations. These classiﬁers are consulted during run time to select the appropriate conﬁguration depending on the current workload. A recent approach uses deep Q-learning for dynamically controlling the PE type, the number ob PEs and their frequency [18]. As for the other approaches, a design-time training is necessary to correlate the Q-values with actions depending on the workload of known applications. At runtime, the network acts based on the current workload. In contrast, this work proposes a design-time method, which allows to avoid the potential run-time overhead. Further, an instrumentation of the applications is not necessary to get workload statistics at run-time and drive the optimisation approach. Combining mapping optimisation into an entire framework to support diverse application scenarios is done in Refs. [33,34]. They target architecture synthesis in FPGAs and SystemC. Parallelism within applications is speciﬁed with a language based on Kahn Process Networks (KPNs). Performance optimisation is computed with the help of EAs. Focusing software synthesis for MPSoCs, the commercial available SLX tools suite [2] oﬀers performance optimised mappings in the context of streaming applications with the help of a KPN based language. We take the facilities of the SLX tool suite as starting point to incorporate TONPET. Regarding EMOA, literature is rich with wide range of algorithms. One categorisation can be done by identifying two main classes of EMOA: dominance based and indicator based. In dominance based EMOA, solutions are seen more ﬁt than others based on dominance properties, such as dominance rank, dominance count, and dominance depth. In dominance ranking, the ﬁtness is decided based on the number of solutions which are dominating the solution under investigation. One example for this is MOGA [26]. However, in dominance count, the ﬁtness is driven by counting all solutions that the solution under investigation is dominating. This is problematic because having a better count in dominance without taking into consideration the rank of the solution, i.e. which Pareto front is present on, can lead to a miss representation of the quality of the solution. One example adopting this method is Strength Pareto Evolutionary Algorithm (SPEA) [24,40]. On the other hand, in dominance depth, the ﬁtness is decided based on the position of the solution from the Pareto front view. This means, all solutions are ﬁrst grouped in Pareto fronts. Then, the solution that belongs to the more dominating front (higher rank) is set to be more ﬁt. An example is the Non-dominated Sorting Genetic Algorithm 2 (NSGA-2) [12]. Literature suggests that dominance depth is consistently outperforming other methods. However, one big problem is that algorithms such as NSGA-2

∙ Implementing an eﬃcient heuristic to conduct MOOP for SW mappings on MPSoCs.

∙ Designing a novel R2-EMOA algorithm. ∙ Integrating both, the heuristic and the R2-EMOA into the SLX tools suite.

∙ Evaluating the heuristic performance relative to R2-EMOA with representative benchmarks on three diﬀerent platforms.

∙ Analysing the consistency and statistical behaviour of the R2-EMOA. 2. Related work Single objective mapping approaches especially power-aware mapping strategies have been explored in recent years. The authors of [29] discuss a detailed mathematical representation of the multidimensional problem of power driven application mapping onto heterogeneous MPSoCs. Heuristics dealing with this NP-hard optimisation problem use statistical execution proﬁles of the software as input. This mathematical problem deﬁnition is extended to incorporate voltage and frequency scaling [28]. In this work, a heuristic integrated into a software mapping framework is proposed with the goal to minimise the average power consumption, while obeying deadlines. Both report acceptable performance of their heuristics. In Ref. [25], the authors present an algorithm which ﬁrstly maximises the throughput, afterwards, processes are moved to PEs with less power consumption to fulﬁl power constraints. 51

G. Führ et al.

Integration, the VLSI Journal 69 (2019) 50–61

level instrumentation and cost calculation is applied on the granularity of Basic Blocks (BB).1 The processor instruction set cost is the input for clock cycle calculation of each BB. Performing lowering, resource constrained list scheduling, and code placement, the cost for a BB is generated. Timing information is available to annotate the traces after summing up all BBs that are executed within two channel accesses, called a segment. Using the abstract power model introduced in Section 3.2, a trace is computed that contains the required power information on the granularity of a segment. As within a segment no channel accesses occur, the PE is always in the same power state. 3.2. MPSoC model

cannot handle more than two objectives without suﬀering from deterioration in performance. As a result, EMOA literature currently leans towards the use of indicators rather than dominance based methodology. In indicator based EMOA, the multi-objective problem is transformed to a single objective problem, where the pre-chosen indicator is used as a ﬁtness function. In this regard, there are two widely adopted indicators: Hypervolume Indicator (HI) and R2 indicators. This is covered in details in Section 4.4. It is worth to mention that there are other indicator based algorithms, such as IBEA [39]. However, they are commonly outperformed by HI and R2.

The architectural model of the target platform is speciﬁed in the following way to describe the relation of, e.g., memory and communication architecture, and type and number of PEs. An MPSoC platform L is modelled as a directed graph: L = {R, E}, where R is a set of hardware resources present in the platform and E represents a set of connections. R is a set of all PEs  and all memories , with R =  ∪  and  ∩  = ∅. The set Dq = {d1, d2, …} denotes all memories reachable by q ∈ . PE power modelling is done as follows. The underlying model consists of the basic CMOS power consumption parts, i.e. leakage power Ps = I · V and dynamic power Pd = u · C · f · V2 , where I denotes the leakage current and V the present voltage. C is the switching capacitance and f the operating frequency. The utilisation value u is the fraction of the entire execution time the PE has been active and thus consumed the dynamic power. PEs connected to the same power supply are part of a common voltage domain. Similarly, all PEs connected to the same clock are part of a the same frequency domain. In Ref. [28], it is shown that this model achieves estimation errors of about 9% on average for PEs including L1 caches.

3. System model

4. Multi-objective optimisation

We base this work on the SLX tool suite. Fig. 1 illustrates the mapping toolﬂow for already parallelised applications. The part MOOP analysis is the new piece of technology that we propose and discussed in detail in Section 4. The already parallelised application (Section 3.1) and the MPSoC platform model (Section 3.2) are required as input of the tool ﬂow. Optionally, constraints can be deﬁned, which are, e.g., real time and performance demands, or already known process-to-PE assignments. The subsequent steps Mapper and Constraints check oﬀers heuristics to optimise for performance, PE power consumption, or both. The target speciﬁc code generator translates the parallel code into plain C code, incorporating the ﬁndings of the Mapper. The plain C code can be understood by the MPSoC compiler.

A general MOOP requires the simultaneous minimisation of n objectives and is deﬁned as

Fig. 1. Toolﬂow overview for parallel applications.

minimise f(x) = (f1 (x), f2 (x), … , fn (x)) subjectto x ∈ Xf A solution is represented by x = (x1 , … , xk ) and the set of feasible solutions is given by Xf ⊆ X, which is called the decision space with k decision variables. f(x) represents the objective functions that map the solution x to the respective objective values o = (o1 , … , on ) of the objective space O. There is typically not a single solution that optimises all objectives. Rather, a set of solutions exists that presents a trade-oﬀ between the diﬀerent objectives. The objectives cannot be further improved without degrading at least one of the other objectives. These solutions dominate the other feasible ones and are called non-dominated solutions or Pareto points. More formal, let two solutions xa and xb exhibit the objective values oa and ob , respectively. The solution xa is said to dominate xb :

3.1. Application model Applications are described as directed graphs in the KPN model of computation [22], i.e. A = {,  }, where  is the set of the application processes and  is the set of directed FIFO channels. A process zi ∈  communicates with process zj ∈  via an un-bounded point-to-point FIFO channel cij ∈ . Relying on a small set of keyword, the KPN applications are implemented in ANSI-C. With C for Process Networks (CPN), processes and channels, and the required operations to access them are described [30]. For the realisation of un-bounded FIFO channels, the minimum deadlock free size is chosen. An event driven approach is used to simulate a CPN application. The Trace Replay Module (TRM) computes all required timing and power information, such as process and total execution time, individual and collective PE power values. TRM is based on dynamic and static proﬁling. The former collects traces that contain the dynamic behaviour of the application. As KPNs are untimed, timing information is added to those traces based on the approach described in Ref. [14]. Source

xa ≺ xb iﬀ oai ≤ obi , ∀ i = {1, … , n} and oaj < obj , ∃ j ∈ {1, … , n}

(1)

This means for a solution xa to be Pareto dominating solution xb , solution xa must be better or equal for all objective(s), while being better on at least one objective. This implies that xa would be superior to xb , if oa dominates ob . All non-dominated solutions are considered optimal solutions of an MOOP. They are also known as Pareto front. 1 A basic block is a code sequence with only one entry and one exit. There are no branches in and no branches out, except to the entry and at the exit, respectively.

52

G. Führ et al.

Integration, the VLSI Journal 69 (2019) 50–61

the Pareto front stored in Pexec and texec , and the corresponding platform power conﬁguration C as well as the process mapping Mz,q . The following comprises the individual steps of TONPET.

4.1. Problem deﬁnition For the MOOP in this work, the decision space consists of k = 2 parameters, namely the task mapping and the platform power conﬁguration. The former is based on a static mapping, i.e., no task migration during run time. The objective functions are equivalent to simulating a task mapping and platform power conﬁguration, i.e. executing the TRM. The objective space has n = 2 values: the execution time texec of the application and the average power, peak power or energy consumption. In the rest of this paper, only average power Pexec is discussed, but can be replaced by the others with little eﬀort. Necessary inputs are the processes  and available PEs  given by the application and platform model, respectively. The latter deﬁnes the platform power conﬁgurations , which basically are the set of possible frequencies  (q), the permitted minimum voltage Vf ,q for a selected frequency f, the switching capacitance Cq and the leakage current Iq of q ∈ . The utilisation value uq ∈ [0, 1] is the fraction of the entire execution time q has been active and thus consumed the dynamic power. The TRM computes Pexec according to Eq. (2c), with Psf ,q and Pdf ,q denoting the leakage and dynamic power, respectively. Psf ,q = Iq · Vf ,q

(2a)

Pdf ,q = Cq · fq · Vf2,q · uq

(2b)

∑

Pexec =

q∈

Pdf ,q + Psf ,q

4.2.1. Pre-pruning The pre-pruning of the mapping search space is done in case the number of platform power conﬁgurations  is larger than a user deﬁned number N. As shown in line 2–4 in Algorithm 1, a uniform distributed random process selects N platform conﬁgurations. This reduces the run time of the classiﬁcation and pruning to an acceptable length. The uniform distribution ensures that a representative set of all possible conﬁgurations is selected. As further discussed in Section 6, this pre-pruning step is intended for next generation MPSoCs, such as the McVP. Algortihm 1 Heuristic TONPET.

(2c)

The execution time texec is calculated as shown in Eq. (3), where tcyclez,q denotes the number of cycles used by z scheduled on q. tschedulez,q takes inter-process dependencies and concurrencies into account. The TRM, which computes texec , is aware of the latency incurred by context switches and FIFO data communication. This includes memory and interconnect delays as well as congestion. Mz,q = 1 indicates that process z is mapped to q. ∑ ∑ tcyclez,q + tschedulez,q

texec =

q∈ z∈

fq

Mz,q

Algortihm 2 Classify and prune conﬁgurations.

(3)

The resulting minimisation problem is given in Eq. (4), where (4b) and (4c) deﬁne that each process is mapped on exactly one PE. This is required, as we solve to ﬁnd a static mapping, with no task migration during run time. min f = (texec , Pexec ) s.t.

∑ q∈

Mz,q = 1, ∀z ∈ 

Mz,q ∈ {0, 1}, ∀z ∈ , ∀q ∈ 

(4a) (4b)

4.2.2. Classiﬁcation and pruning The classiﬁcation and pruning of platform power conﬁgurations are done within ClassifyPruneConﬁgs(), presented in Algorithm 2. This phase characterises fast and mapping independent each frequency setting of the platform and application. To deploy this key idea, the characterisation is based on qualiﬁers, namely the Total Nominal Power (TNP) and the Execution Time Indicator (ETI). The TNP value is computed according to Eq. (2c), with uq = 1, ∀ q ∈ . The required fq for q ∈  is stored in c ∈ . The ETI is calculated without considering inter-process dependencies and concurrencies, i.e., it is only based on tcyclez,q . This assumption means that a process has all input tokens and is ready to run. The ETI ∑ ∑ is determined by the following equation: ETI = q∈type z∈ tcyclez,q ∕fq , where type ⊆  denotes the set of PE types. Algorithm 2 shows the computation of ETI and TNP qualiﬁers for every c ∈  in lines 2–4). The pruning procedure is started in line 5). A conﬁguration is considered sub-optimal when other conﬁgurations with better TNP and ETI are available.

(4c)

4.2. Heuristic: TONPET For a fast and eﬃcient examination, TONPET ﬁnds a Pareto front approximation for the optimisation objectives execution time and average PE power consumption. The latter can be exchanged with PE peak power and PE energy consumption with little eﬀort. Classiﬁcation qualiﬁers are utilised to prune the search space independent of the application, due to the vast search space. Further, invoking TRM is timeconsuming and application dependent, thus forms the major bottleneck. With a reasonable minimal exploration space, the ﬁnal Pareto front is generated. The design goal of the heuristic is to have a minimum number of TRM calls. Algorithm 1 shows the pseudo code of the top level function. Inputs are the PEs , the processes , and all platform power conﬁgurations  per PE. The latter is sorted in ascending order by the frequency. Optional ﬁxed mapping and latency constraints are deﬁned by M̂z,̂q and 𝜏 , respectively. Outputs are all objective values of 53

G. Führ et al.

Integration, the VLSI Journal 69 (2019) 50–61

Algortihm 3 Calculation of Pareto front.

Fig. 2. ETI and TNP classiﬁer values of platform power conﬁgurations after pruning for JPEG and ODROID-XU3.

Fig. 3. Intermediate steps of the Pareto front calculation with TONPET for JPEG and ODROID-XU3.

Both values give a notion whether c ∈  is a potential candidate for the ﬁnal Pareto front, treating them as the preliminary objective space. This can be seen by comparing ETI and TNP of non-dominated platform conﬁgurations shown in Fig. 2 with the paretoClassConﬁgs in Fig. 3. The latter forms already a good approximation. To further save run time of TONPET, not every entry of paretoClassConﬁgs is analysed during the subsequent steps. Sorted in ascending order by ETI, an equal distributed selection is chosen. On the one hand, a ﬁxed number would not be suﬃcient, due to |paretoClassConﬁgs| ≠ const. On the other hand, selecting a linear dependent fraction has little eﬀect on the subsequent runtime. Consequently, just every

The entries of selectedConﬁgs which yield the best Pareto front approximation are determined by storing just the non-dominated in the set paretoSelectedConﬁgs (line 6). Fig. 3 exempliﬁes this. It is worth to mention that the objective values of paretoClassConﬁgs are never computed. They are used for demonstration purposes in this ﬁgure. Each entry of paretoSelectedConﬁgs is evaluated by taking the frequency domain structure into account. The previous step of invoking loadBalancer in line 3 maps the processes to all PEs of the platform. However, another power saving eﬀect has to be explored. Mapping to a few PEs potentially increases execution time, but saves power, as idling PEs can be put to a low power mode. If all PEs within one voltage or frequency domain are idle, an even lower power mode is available. On the other hand, distributing the processes among many processors oﬀers better performance but higher power consumption. The smallest granularity exploiting this is the frequency domain. loadBalancer has to ﬁnd mappings multiple times starting with just elaborating a mapping for one frequency domain with lowest TNP, then two domains, until all frequency domains containing all PEs are considered. While state-of-the-art MPSoCs are explored quite fast, future platforms oﬀer further potential of saving run time (lines 8–13). With increasing amount of frequency domains, adding just a few PEs has rarely an impact on Pexec and texec above a certain number of PEs. This behaviour is covered by Amdahl’s law, describing the theoretical speed-up of parallel applications while increasing the number of PEs. Approximating this requires the most samples in the beginning, where the steepest increase occurs, and less as soon as the speed-up is levelling out. At ﬁrst glance, a log2 based step size would be the ideal candidate. However, the number of frequency domains is increased within TONPET, hence, several PEs are added per step. To avoid an under sampling, the Fibonacci series has been chosen, as the growth rate is less than for log2 but signiﬁcant enough to approximate the aforementioned eﬀects. The remaining lines 14–17 set the platform power conﬁguration under evaluation, and invoke loadBalancer and TRM to receive Pexec and texec . In the end, the set paretoEvalConﬁgs stores just the non-dominated results, i.e. the ﬁnal Pareto front. Fig. 4 exempliﬁes this by showing all results that have been evaluated and the points that form the Pareto front approximation.

step = ⌊log2 (|paretoClassConﬁgs|)⌋ entry of paretoClassConﬁgs is analysed afterwards (lines 6–8). This results in the set selectedConﬁgs with the size ⌈ ⌉ |paretoClassConﬁgs| |selectedConﬁgs| = ⌊log2 (|paretoClassConﬁgs|)⌋ Choosing a log2 based step size causes an eﬃcient reduction of the solution space, while having a trade-oﬀ between a ﬁxed number of selected Pareto classiﬁed conﬁgurations and a constant step size. 4.2.3. Pareto front The calculation of the ﬁnal Pareto front starts with the analysis of the remaining selectedConﬁgs, shown in Algorithm 3. First, a process mapping is generated using loadBalancer. Second, the objectives Pexec and texec are computed (lines 2–5). The load balancing heuristic loadBalancer is designed to achieve an equal workload for each PE, as described in Ref. [11]. In the presence of communication dependencies, balancing could introduce a high communication overhead. However according to the thorough evaluation presented in Ref. [15], loadBalancer provides a reasonable trade-oﬀ between optimal texec and algorithm run time. Further, a balanced load gives a fair trade-oﬀ between Pexec and texec . This is in line with the design goal of TONPET, which is to ﬁnd fast and eﬃciently a Pareto front approximation.

54

G. Führ et al.

Integration, the VLSI Journal 69 (2019) 50–61

stant as proposed in Ref. [10]: R2(S, U ) =

−1 ∑ max {u(s)} |U| u∈U s∈S

(7)

In our implementation, we chose the Chebyshev weighted function as a utility function. However, we adopt the normalised form in order to handle objective functions that are measured using diﬀerent units. This is expected in our mapping optimisation problem. Thus, our utility function can be deﬁned as follows: Fig. 4. Final steps of the Pareto front calculation with TONPET for JPEG and ODROID-XU3.

∗

Uw (s) = − max wi | i∈{1…m}

(8)

Where w is Chebyshev weighted vector, znad ∈ z∗ which represent i the lower bound vector for all objectives (ideal vector) and z∗i ∈ znad , which represents the upper bound of each objective in the entire Pareto set (nadir vector) and m is the number of objectives. Furthermore, we follow the proposal of [17] of using a rank based system over the output of the utility function. In other words, we group the solutions based on their performance on the utility function, and rank them based on their presence on which Pareto set they are located. Regarding population size and number of iterations, we deﬁned two computation budgets for R2-EMOA. The ﬁrst one dubbed as constrained. This version of the R2-EMOA is intended to be a compromise between computational power, i.e. algorithm run time and solution quality. The idea is to get closer to the run time similar of a heuristic, but still utilising the R2-EMOA. The computation budget is set by statistic values of TONPET (for comparison purposes) as follows: The number of platform power conﬁgurations || multiplied with the number of frequency domains in addition to the number of TRM calls form the number of evaluations, i.e., || × |frequencyDomains| + |TRMcalls|. Consequently, the population size adopted in R2-EMOA is set to be equal to the number of Pareto points that TONPET determined as PO set approximation. These decisions can be justiﬁed by the fact that we wanted to end up with the same number of solutions in the ﬁnal stage of the algorithm while executing close number of evaluations. The second one is dubbed as unconstrained, where we gave R2EMOA a relatively high budget. Here, we chose a reasonable number of evaluations of 6000 with a population size of 50. This facilitated studying the performance changes in R2-EMOA when given a way much bigger computational budget compared to the constrained one.

Fig. 5. Hypervolume Indicator examples.

4.3. Hypervolume indicator Literature established the HI to capture diversity and dominance in one single value of a Pareto front [38]. A non-dominated front is better, if the solutions are well distributed over the objective space, and covering a wider range for each objective function value. Deﬁned for n-dimension, HI is an area coverage indicator if used with two objective functions. An optimal selected reference point and resulting reference plane is visualised by the light coloured area in Fig. 5. The actual coverage of the set is the area towards the reference point, calculated as illustrated by the dark coloured area. The HI is the division of the covered area and the reference area. The left plot can be understood as a most diverse and non-dominated set with an HI of 51.8% (a). Without the middle (b) or the left most value (c), the diversity and dominance are reduced, respectively. This is covered in both cases in the HI with 46.4% and 50.9%. One interesting property in HI is that it is strictly monotonic. This means that given two sets of solutions a and b, where a is strictly dominating b, then the HI of a, IHI (a), must be always be greater than IHV (b) [7,13,38]. However, there are two main drawbacks in using the HI. Firstly, it is computationally demanding particularly if the number of objectives is increased. This becomes an issue when the HI is used as a ﬁtness function. Secondly, there is a bias in its output w.r.t. regions that have a knee-shape form compared to other regions. Further details can be found in Ref. [9].

5. Experimental results Three case studies have been carried out assessing the quality and the impact of the classiﬁcation and pruning phase of TONPET. The speed-up of TONPET is contrasted with the HI, which compresses diversity and dominance of the produced Pareto front into one value. Moreover, TONPET performance is compared to the R2-EMOA. To round of the evaluation, the R2-EMOA performance analysis is given. A set of representative parallel applications have been chosen from diﬀerent domains [6]. The number of processes is given in brackets: audio ﬁlter (11), JPEG encoder (24), multiple input multiple output orthogonal frequency division multiplexing MIMO OFDM transceiver (36), spacetime adaptive processing STAP (16), and sobel ﬁlter (5). In-house implementations of an LTE uplink receiver physical layer benchmark LTE (19) [32], and a Mandelbrot set computation with 16 Man16 and 150 Man150 worker processes round oﬀ the benchmark set.

4.4. R2-EMOA R2 indicator is inspired from the set of indicators, dubbed as R, shown in Ref. [20]. In R2 indicator, a utility function, which is faster to execute and does not suﬀer from bias is adopted. There are many R2 indicator based algorithms such as in Refs. [9,39]. The general form of R2 can be formalized as follows [9]: max {u(k)}p(u)du − max {u(s)}p(u)du R2(K , S, U , p) = ∫u∈U k∈K ∫u∈U s∈S

si − zi | znad − z∗i i

(5)

Where K is a reference set (reference Pareto front), S is solution set, U is a set of utility functions and p is the probability distribution over the utility functions. In our implementation, the probability distribution is uniform, thus Eq. (5) can be simpliﬁed as follows: 1 ∑ R2(K , S, U ) = max {u(k)} − max {u(s)} (6) s∈S |U| u∈U k∈K

5.1. Evaluation platforms The ODROID-XU3 [1] serves as evaluation platform for the ﬁrst case study, deploying the Samsung Exynos-5422 processor with the big.LITTLE architecture of ARM. The frequency ranges from 200 to 1400 MHz (little – with four ARM Cortex-A7 PEs) and 2000 MHz (big – with four ARM Cortex-A15 PEs) in steps of 100 MHz per cluster. This

This can be simpliﬁed further if the reference set is deﬁned to be con55

G. Führ et al.

Integration, the VLSI Journal 69 (2019) 50–61

of the R2-EMOA; and (iii) identify the actual power-performance tradeoﬀ contemplating the Pareto fronts. In order to conduct (i), the output of TONPET is contrasted with the Pareto front computed with the R2-EMOA. The HI of both fronts is calculated using the same reference point. It is worth to mention that the reference point is computed per benchmark and platform. The maximum values of each dimension, which result from both TONPET and R2-EMOA, are used. The diﬀerence of these HI values gives a fair result about the accuracy of the Pareto approximate. Speed-up values are obtained by comparing both run times.

Table 1 Constrained R2-EMOA conﬁguration for ODROID-XU3.

audio ﬁlter JPEG LTE Man150 Man16 MIMO OFDM sobel ﬁlter STAP

Population

Eval.

TRM (TONPET)

10 8 10 5 5 7 9 9

510 508 509 505 505 510 509 509

16 14 15 11 11 16 15 15

5.2.1. Case study: ODROID-XU3 In this case study, the number of platform power conﬁgurations is || = 247, N is considered larger than ||. Consequently, the prepruning phase is not required. At the example of JPEG, the classiﬁcation and pruning phase determines that |paretoClassConﬁgs| = 31. The step size of ⌊log2 (|paretoClassConﬁgs|)⌋ = ⌊log2 (31)⌋ = 4 causes |selectedConﬁgs| = 8 conﬁgurations to be selected for further analysis. Up to this point, TONPET ran 518 ms. After invoking loadBalancer, 6 are classiﬁed as non-dominated and evaluated allowing one and two frequency domains. The Pareto front approximation contains |paretoEvalConﬁg| = 8 values. The remaining run time of the heuristic for calculating the Pareto front approximation is 1.32min, while the TRM is executed 14 times. However, exploring all platform power conﬁgurations  and frequency domain combinations results in calling TRM 494 times with a total run time of 34.4min. Hence, the speedup of TONPET resulting from the classiﬁcation and pruning phase is 26 × . Compared to the budget constrained R2-EMOA run time, the heuristic is 150 × faster. As stated in Section 4.4, the parameters of the budget constrained R2-EMOA are based on the resource demand of TONPET and the underlying platform. With || = 247 and two frequency domains for the ODROID-XU3, and 14 TRM calls, the resulting number of evaluations is set to 508. The population size equals 8, which is the same as the resulting number of Pareto points computed by TONPET. All of those run time numbers are shown in Fig. 7. The conﬁguration of the constrained R2-EMOA is given in Table 1. For a visual comparison of the computed Pareto fronts, Fig. 6 shows the PO sets of TONPET, constrained and unconstrained R2-EMOA. Due to the stochastic nature of the R2-EMOA, one result is selected randomly out of ten performed

adds up to || = 247 diﬀerent frequency combinations. Unused PEs are put into the idle power state. The operating system also sets the lowest voltage possible. As second platform, a model of the Keystone II 66AK2H [4] is utilised. All PEs operate at a ﬁxed voltage of 1 V. The C66x DSP cluster allows two diﬀerent frequencies, 800 MHz and 1200 MHz. The ARM Cortex-A15 cluster can be set from 200 MHz to 1400 MHz with a step size of 100 MHz. As a result, this form the smallest search space due to || = 26. Targetting the largest search space, the third case study is realised with an in-hosed SystemC [3] heterogeneous McVP. Deploying a hierarchical structure, subsystems consist of a local bus, memory and either an ADSP Blackﬁn 609 DSP or an ARM Cortex-A9. Combining four ARM and four Blackﬁn subsystems, a cluster includes additionally a bus and a memory. Four clusters and a shared memory are connected via a global bus. In total, the McVP includes 32 PEs. Two subsystems of same PE type are grouped into a frequency domain, while four share the same voltage domain. The frequency ranges from 200 MHz to 1200 MHz for the ARMs, and from 100 MHz to 500 MHz for the Blackﬁns. For both, the step size is 100 MHz. The lowest applicable voltage per voltage domain is automatically set by the bare metal runtime environment. All in all, McVP has || = 3.5 · 109 diﬀerent frequency combinations. 5.2. Case studies The case studies are intended to (i) assess the quality and performance of the heuristic; (ii) determine the consistency and performance

Fig. 6. Pareto front comparison for ODROID-XU3. 56

G. Führ et al.

Integration, the VLSI Journal 69 (2019) 50–61

Table 3 Constrained R2-EMOA conﬁguration for Keystone II.

audio ﬁlter JPEG LTE Man150 Man16 MIMO OFDM sobel ﬁlter STAP

Fig. 7. Run times of constrained R2-EMOA, Pareto front calculation without pruning, ClassifyPruneConﬁgs and CalcParetoFront for ODROID-XU3.

JPEG LTE Man150 Man16 MIMO OFDM sobel ﬁlter STAP

ODROID-XU3

McVP

Keystone II

−0.4% (0.014) + (0.160) −0.3% (0.084) ++ (0.106) + (0.002) + (0.131) −1% (0.014) −0.9% (0.037)

−2.0% (0.002) −1.7% (0.002) −6.4% (0.002) + (0.049) −1.9% (0.002) −1.5% (0.002) −13.4% (0.002) −0.7% (0.002)

+

Eval.

TRM (TONPET)

61 61 62 62 62 62 62 62

9 9 10 10 10 10 10 10

to 80 ms. CalcParetoFront has its minimum with 373 ms analysing the audio ﬁlter and the maximum with 37982 ms evaluating the JPEG benchmark. This shows the design goal of a benchmark independent classiﬁcation and pruning phase. The conﬁguration of the constrained R2-EMOA is shown in Table 3. The resulting Pareto fronts are visualised in Fig. 8. Table 2 reveals that the heuristic computes Pareto fronts with an HI 4.6% and 6.3% worse than the constrained R2-EMOA for just two benchmarks. TONPET is faster 18 × in the worst case and 30 × on average. Considering the p-value presented in Table 2 reveals that HI performance of TONPET relative to the R2-EMOA is statistically signiﬁcant throughout the experiments.

Table 2 TONPET HI performance relative to R2-EMOA.

audio ﬁlter

Population 8 6 9 9 9 10 9 9

(0.065)

+ (0.106) −4.6% (0.002)

++ (0.193) ++ (0.193) + (0.106) + (0.084) −6.3% (0.037)

5.2.3. Case study: McVP The aforementioned case studies show a linear dependency of || on the ClassifyPruneConﬁgs run time. With || = 3.5 · 109 , this would result in approximately 12 days for the McVP cases study. To avoid such unacceptable computation times, N is set to 105 to enable prepruning. The large number of frequency domains can gain from the speed-up capabilities of just allowing i frequency domains within the inner loop of CalcParetoFront, where i is a number of the Fibonacci series. The constrained R2-EMOA, ClassifyPruneConﬁgs and CalcParetoFront run times are shown in Fig. 10. The former dominates the entire run time of the heuristic, while CalcParetoFront takes at maximum half of this time. Future work could determine a more eﬃcient way for this pre-pruning step. However, TONPET is faster with 88 × in the worst case and 750 × on average than the constrained R2-EMOA. The conﬁguration of the constrained R2-EMOA is shown in Table 4. It is worth to mention that due to the high number of conﬁgurations, we limited the number of evaluations to 6000, but allowed a population size of 50. The resulting Pareto fronts are visualised in Fig. 11. Table 2 reveals that the heuristic computes Pareto fronts with an HI 3% worse than the constrained R2-EMOA on average. For Man150, TONPET produces a better Pareto front. Further, the presented p-value strengthens all HI performance results.

+: Better than budget constrained EMOA. ++: Better than budget unconstrained EMOA. (): p-value according to Wilcoxon signed-rank test [35].

repetitions. It is worth to mention that this comes closest to a productive and realistic usage. Looking at ClassifyPruneConﬁgs and CalcParetoFront run time, it becomes clear that the former is benchmark independent. The run time ranges between 300 ms and 600 ms. On the other hand, the Pareto front calculation run time exceeds this by two orders of magnitude. This strengthens the design goal of the heuristic to reduce the TRM calls to a minimum. In Table 2, an overview of the HI mean performance relative to constrained and unconstrained R2-EMOA is given. TONPET calculates Pareto fronts that are less than 1% worse compared to the constrained R2-EMOA for half of the benchmarks. In case of Man150, the heuristic outperforms even the unconstrained meta heuristic. In terms of speedup, TONPET is faster than the constraint R2-EMOA: 80 × in the worst case and 120 × on average. Moreover, Table 2 presents the p-value according to the Wilcoxon signed-rank test [35]. This test reveals the statistical signiﬁcance of a non-uniform distributed set and a single population. This means that the resulting HI performance of TONPET relative to the R2-EMOA is statistically signiﬁcant. The lower the pvalue, the more probable another execution of the stochastic R2EMOA outputs a similar HI performance diﬀerence as given in Table 2.

5.3. R2-EMOA performance analysis Since one of the key diﬀerences between TONPET and R2-EMOA is the stochastic nature in the performance of the latter, we prepared a set of experiments dedicated solely for capturing that. Firstly, each optimisation problem, i.e., each benchmark on all three platforms, is repeated ten times and the HI percentage is calculated and used as performance metric. Furthermore, we have done this process for both constrained and unconstrained R2-EMOA to observe the gain from giving more computational resources to the R2-EMOA. Finally, the upper and lower quartiles are highlighted in Figs. 12–14, indicating the performance consistency. In order to have a full statistical proﬁle, Tables 5 and 6 present the variance of HI for the constrained and unconstrained conﬁguration, respectively. As stated in Section 5.2, the reference point

5.2.2. Case study: Keystone II N is considered larger than || = 26. The ClassifyPruneConﬁgs, CalcParetoFront, Pareto front calculation without pruning and constrained R2-EMOA run times are shown in Fig. 9. The former ranges from 10 ms 57

G. Führ et al.

Integration, the VLSI Journal 69 (2019) 50–61

Fig. 8. Pareto front comparison for Keystone II.

Table 4 Constrained R2-EMOA conﬁguration for McVP.

audio ﬁlter JPEG LTE Man150 Man16 MIMO OFDM sobel ﬁlter STAP

Population

Eval.

TRM (TONPET)

50 50 50 50 50 50 50 50

6000 6000 6000 6000 6000 6000 6000 6000

50 39 50 40 56 58 31 10

Fig. 10. Run times of constrained R2-EMOA, ClassifyPruneConﬁgs and CalcParetoFront for the Virtual Platform.

be addressed by the heuristic. This is enabled by the pre-pruning step discussed in Section 4.2.1. Without this, the large number of conﬁgurations || would result in extreme run times of the heuristic. The random selection of N platform conﬁgurations allows for an acceptable termination time. Investigating the run time spent in the calculation of ClassifyPruneConﬁgs and CalcParetoFront reveals further potential for improvements. Fig. 10 shows that the former takes halve of the entire run time in the best case. A possible solution is the work proposed in Ref. [16]. Applying mathematical theory of groups, the authors achieve speed-ups of up to 10 × on current platforms, such as the Keystone II, exploiting the architectural symmetry. Many of those platform conﬁgurations || have similar eﬀect due to the hierarchical and symmetric nature available in the McVP. It is worth to mention that this inherent symmetry of the architecture can be found in almost all MPSoCs. Adopting the mathematical theory of groups approach of [16] for the pre-pruning step could result in event faster run times, while the ﬁnal PO set quality would be improved. Looking to HI performance in Figs. 12 and 13, one general observation is that the unconstrained R2-EMOA outperforms the constrained R2-EMOA, and is more consistent. This is an expected behaviour given that the unconstrained R2-EMOA uses more computational power. However, the gain in performance is not always signiﬁcant and depends on both the platform and the benchmark. For example, for Man150 executed on the ODROID-XU3, the performance diﬀerence between constrained and unconstrained R2-EMOA is 17% (diﬀerence of the

Fig. 9. Run times of constrained R2-EMOA, Pareto front calculation without pruning, ClassifyPruneConﬁgs and CalcParetoFront for Keystone II.

used in calculating the hypervolume space for each benchmark on each platform is chosen diﬀerently. This is because the solution space size varies. For example, if the reference point is set to be far, the relative gain of hypervolume space of a more dominating Pareto front is not observed proportionally. Thus, changing the reference point to ﬁt the problem enables the adequate capture the performance diﬀerence. The downside from having a diﬀerent reference point for each benchmark is that the HI performance of diﬀerent problems are not comparable. 6. Discussion The experimental results given for TONPET reveal a convincing applicability. Even next generation platforms, such as the McVP, can 58

G. Führ et al.

Integration, the VLSI Journal 69 (2019) 50–61

Fig. 11. Pareto front comparison for the Virtual Platform.

Table 5 Constrained R2-EMOA HI variance.

audio ﬁlter JPEG LTE Man150 Man16 MIMO OFDM sobel ﬁlter STAP

ODROID-XU3

McVP

Keystone II

0.30 14.41 8.65 49.17 3.76 21.14 4.23 2.43

0.017 0.210 0.002 0.877 0.029 0.018 0.033 0.014

4.66 6.04 3.96 17.86 9.65 3.00 5.64 5.05

ODROID-XU3

McVP

Keystone II

0.114 0.136 0.111 1.100 0.200 0.790 0.001 0.075

– – – – – – – –

0.175 0.348 2.021 7.775 2.827 2.748 0.307 0.134

Table 6 Unconstrained R2-EMOA HI variance.

audio ﬁlter JPEG LTE Man150 Man16 MIMO OFDM sobel ﬁlter STAP

Fig. 12. HI comparison of constrained and unconstrained R2-EMOA for ODROID-XU3.

median). For the same benchmark tested on the Keystone II, a lower diﬀerence of 3.4% is shown. Moreover, the performance consistency of constrained EMOA (diﬀerence between upper and lower quartiles) showed a value of 21.4% in the worst case and 0.075% in the best case. Taking all these ﬁndings into consideration, one conclusion is that there is no global solution which ﬁts all benchmarks on different platforms. TONPET oﬀers a very close performance to R2EMOA in problems with smaller solution space with less runtime but fails to deliver that for benchmarks with vast solution space. One the other hand, R2-EMOA can suﬀer from inconsistencies in its performance. Furthermore, giving more computational power to the R2-EMOA would not always manifest in a proportional performance gain.

Fig. 13. HI comparison of constrained and unconstrained R2-EMOA for Keystone II. 59

G. Führ et al.

Integration, the VLSI Journal 69 (2019) 50–61 [3] SystemC. [Online] http://www.accellera.org/downloads/standards/systemc (accessed 06/2019). [4] Texas Instruments Literature: SPRS866: 66AK2H12/06 Multicore DSPARM KeyStone II system-on-Chip (SoC). [Online] http://www.ti.com/product/ 66AK2H12 (accessed 02/2018). [5] A. Aalsaud, R. Shaﬁk, A. Raﬁev, F. Xia, S. Yang, A. Yakovlev, Poweraware performance adaptation of concurrent applications in heterogeneous many-core systems, in: Proceedings of the 2016 International Symposium on Low Power Electronics and Design, ISLPED 16, ACM, New York, NY, USA, 2016, pp. 368–373. [6] M. Aguilar, R. Jimenez, R. Leupers, G. Ascheid, Improving performance and productivity for software development on TI Multicore DSP platforms, in: EDERC, Sept 2014. [7] N. Beume, C.M. Fonseca, M. Lpez-Ibez, L. Paquete, J. Vahrenhold, On the complexity of computing the hypervolume indicator, IEEETrans. Evol. Comput. 13 (5) (2009). [8] S.L. Beux, G. Nicolescu, G. Bois, Y. Bouchebaba, M. Langevin, P. Paulin, Optimizing conﬁguration and application mapping for MPSoC architectures, in: NASA/ESA Conference on Adaptive Hardware and Systems, July 2009. [9] D. Brockhoﬀ, T. Wagner, H. Trautmann, On the properties of the R2 indicator, in: Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation, ACM, 2012. [10] D. Brockhoﬀ, T. Wagner, H. Trautmann, 2 indicator-based multiobjective search, Evol. Comput. 23 (3) (2015) 369–395. [11] J. Castrillon, R. Leupers, G. Ascheid, MAPS: mapping concurrent dataﬂow applications to heterogeneous MPSoCs, in: IEEE Transactions on Industrial Informatics, Feb 2013. [12] K. Deb, A. Pratap, S. Agarwal, T. Meyarivan, A fast and elitist multiobjective genetic algorithm: Nsga-ii, IEEE Trans. Evol. Comput. 6 (2) (2002) 182–197. [13] M. Emmerich, N. Beume, B. Naujoks, An emo algorithm using the hypervolume measure as selection criterion, in: International Conference on Evolutionary Multi-Criterion Optimization, 2005. [14] J.F. Eusse, C. Williams, L.G. Murillo, R. Leupers, G. Ascheid, Pre-architectural performance estimation for ASIP design based on abstract processor models, in: SAMOS, 2014. [15] A. Goens, R. Khasanov, J. Castrillon, S. Polstra, A. Pimentel, Why comparing system-level MPSoC mapping approaches is diﬃcult: a case study, in: Proceedings of MCSoC-16, Sept. 2016. [16] A. Goens, S. Siccha, J. Castrillon, Symmetry in Software Synthesis. ACM Transactions on Architecture and Code Optimization (TACO), July 2017. [17] R.H. Gmez, C.A.C. Coello, MOMBI: a new metaheuristic for many-objective optimization based on the R2 indicator, in: 2013 IEEE Congress on Evolutionary Computation, 2013. [18] U. Gupta, S.K. Mandal, M. Mao, C. Chakrabarti, U.Y. Ogras, A deep q-learning approach for dynamic management of heterogeneous processors, IEEE Comput. Archit. Lett. 18 (1) (Jan 2019) 14–17. [19] U. Gupta, C.A. Patil, G. Bhat, P. Mishra, U.Y. Ogras. Dypo, Dynamic pareto-optimal conﬁguration selection for heterogeneous mpsocs, ACM Trans. Embed. Comput. Syst. 16 (5s) (Sept. 2017) 123:1–123:20. [20] M.P. Hansen, A. Jaszkiewicz, Evaluating the Quality of Approximations to the Non-dominated Set, IMM, Department of Mathematical Modelling, Technical University of Denmark, 1994. [21] O. Holzkamp, Memory-aware Mapping Strategies for Heterogeneous MPSoC Systems, PhD thesis, Technical University of Dortmund, Germany, 2017. [22] G. Kahn, The semantics of a simple language for parallel programming, in: Proceedings of Information Processing, Stockholm, Sweden, Aug 1974. [23] S.H. Kang, H. Yang, L. Schor, I. Bacivarov, S. Ha, L. Thiele, Multi-objective mapping optimization via problem decomposition for many-core systems, in: IEEE 10th Symposium on Embedded Systems for Real-Time Multimedia, Oct 2012. [24] M. Kim, T. Hiroyasu, M. Miki, S. Watanabe, Spea2: improving the performance of the strength pareto evolutionary algorithm 2, in: International Conference on Parallel Problem Solving from Nature, Springer, 2004, pp. 742–751. [25] G. Liu, J. Park, D. Marculescu, Procrustes1: power constrained performance improvement using extended maximize-then-swap algorithm, IEEE Trans. Comput. Aided Des. Integr Circuits Syst. 34 (10) (Oct 2015) 1664–1676. [26] T. Murata, H. Ishibuchi, Moga: multi-objective genetic algorithms, in: Evolutionary Computation, 1995., IEEE International Conference on, vol. 1, IEEE, 1995, p. 289. [27] G. Onnebrink, A. Hallawa, R. Leupers, G. Ascheid, A.-U.-D. Shaheen, A heuristic for multi objective software application mappings on heterogeneous MPSoCs, in: Proceedings of ASPDAC, 2019. [28] G. Onnebrink, F. Walbroel, J. Klimt, R. Leupers, G. Ascheid, L.G. Murillo, S. Schrmans, X. Chen, Y. Harn, DVFS-enabled power-performance trade-oﬀ in MPSoC SW application mapping, in: SAMOS, July 2017. [29] A. Schranzhofer, J.J. Chen, L. Thiele, Dynamic power-aware mapping of applications onto heterogeneous MPSoC platforms, IEEE Trans. Ind. Inf. 6 (4) (Nov 2010) 692–707, https://doi.org/10.1109/TII.2010.2062192. [30] W. Sheng, S. Schrmans, M. Odendahl, R. Leupers, G. Ascheid, Automatic Calibration of Streaming Applications for Software Mapping Exploration, IEEE Design Test, 2013. [31] A.K. Singh, M. Shaﬁque, A. Kumar, J. Henkel, Mapping on multi/many-core systems: survey of current and emerging trends, in: Proceedings of Design Automation Conference (DAC), 2013. [32] M. Sjlander, S. McKee, P. Brauer, D. Engdal, A. Vajda, An LTE uplink receiver PHY benchmark and subframe-based power management, in: Performance Analysis of Systems and Software (ISPASS), 2012.

Fig. 14. HI overview of constrained R2-EMOA for the Virtual Platform.

7. Conclusion and future work In this work, we proposed the population-based heuristic TONPET and the state-of-the-art R2-EMOA that optimises the objectives power and performance and calculates the PO set. The quality, performance and consistency of both approaches were evaluated and compared, together with an applicability analysis. The experimental environment consisted of three diﬀerent case studies, including a selection of representative benchmarks. TONPET and R2-EMOA were integrated into the SLX tool suite for a thorough test setup. TONPET performed 80 × and 18 × faster in the worst case on the Keystone II and on the ODROID-XU3, respectively. Further, the average HI is 4.7% better compared to the constraint R2-EMOA. Moreover, TONPET is applicable in highly complex search spaces, according to the McVP case study with over 16 diﬀerent frequency domains, 32 PEs and 3.5 billion conﬁgurations. The worst case speed-up was 88 × , while in HI only 3% less was attained on average. R2-EMOA consistency of performance has been studied thoroughly by repeating each experiment ten times. Generally, the unconstrained R2-EMOA showed a higher consistency with a variance of 7.8% in the worst case. This is acceptable, especially when considering the gain in the performance diﬀerence (median HI diﬀerence) in benchmarks with a vast solution space. One the other hand, the constrained R2-EMOA presented a worst case consistency of 49.2% for the ODROID-XU3, 0.8% for the McVP and 17.9% for the Keystone II. For future work, we recommend the development of a hybrid approach where both TONPET and R2-EMOA are available and can be adopted based on the problem solution space characteristics. Consequently, this may require adding a further classiﬁcation process, before assigning the problem to either of these algorithms. This can be done by adding an exploration step capable of capturing the solution space properties, then classifying it and consequently adopt the adequate algorithm. Furthermore, this classiﬁcation step can be linked to the initialisation phase of the chosen algorithm. With this, the overall runtime is not aﬀect signiﬁcantly. Acknowledgement This work is funded as part of the CONFIRM project (16ES0570) within the research program ICT 2020 by the German Federal Ministry of Education and Research (BMBF) and supported by Inﬁneon Technologies AG, Robert Bosch GmbH, Intel Deutschland GmbH, and Mentor Graphics GmbH. References [1] ODROID-XU3. [Online] http://odroid.com/dokuwiki/doku.php?iden:odroid-xu3 (accessed 06/2019). [2] Silexica GmbH. [Online] http://silexica.com (accessed 06/2019).

60

G. Führ et al.

Integration, the VLSI Journal 69 (2019) 50–61

[33] T. Stefanov, C. Zissulescu, A. Turjan, B. Kienhuis, E. Deprettere, System design using kahn process networks: the Compaan/Laura approach, in: Proceedings of the Conference on Design, Automation and Test in Europe (DATE), 2004. [34] L. Thiele, I. Bacivarov, W. Haid, K. Huang, Mapping applications to tiled multiprocessor embedded systems, in: Application of Concurrency to System Design, 2007. [35] F. Wilcoxon, Individual comparisons by ranking methods, Biom. Bull. 1 (6) (1945) 80–83. [36] Y. Zhang, D.-w. Gong, J. Cheng, Multi-objective particle swarm optimization approach for cost-based feature selection in classiﬁcation, IEEE ACM Trans. Comput. Biol. Bioinform 14 (1) (2017) 64–75.

[37] Y. Zhang, D.-W. Gong, Z. Ding, A bare-bones multi-objective particle swarm optimization algorithm for environmental/economic dispatch, Inf. Sci. 192 (2012) 213–227. [38] E. Zitzler, D. Brockhoﬀ, L. Thiele, The hypervolume indicator revisited: on the design of pareto-compliant indicators via weighted integration, in: Evolutionary Multi-Criterion Optimization, Springer, 2007, pp. 862–876. [39] E. Zitzler, S. Knzli, Indicator-based selection in multiobjective search, in: International Conference on Parallel Problem Solving from Nature, Springer, 2004, pp. 832–842. [40] E. Zitzler, M. Laumanns, L. Thiele, SPEA2: Improving the Strength Pareto Evolutionary Algorithm, 2001.

61

Multi-objective optimisation of software application mappings on heterogeneous MPSoCs: TONPET versus R2-EMOA

Multi-objective optimisation of software application mappings on heterogeneous MPSoCs: TONPET versus R2-EMOA

Recommend Documents