Hardware implementation of a novel genetic algorithm

Hardware implementation of a novel genetic algorithm

ARTICLE IN PRESS Neurocomputing 71 (2007) 95–106 www.elsevier.com/locate/neucom Hardware implementation of a novel genetic algorithm Z. Zhu, D.J. Mu...

359KB Sizes 2 Downloads 131 Views

ARTICLE IN PRESS

Neurocomputing 71 (2007) 95–106 www.elsevier.com/locate/neucom

Hardware implementation of a novel genetic algorithm Z. Zhu, D.J. Mulvaney, V.A. Chouliaras Department of Electronic and Electrical Engineering, Loughborough University, LE11 3TU, UK Available online 3 August 2007

Abstract This paper introduces a novel genetic algorithm whose features have been purposely designed to be suited to hardware implementation. This is distinct from previous hardware designs that have been realized directly from conventional genetic algorithm approaches. To be suitable for hardware implementation, we propose that a genetic algorithm should attempt to both minimize final layout dimensions and reduce execution time while remaining a valid implementation. Consequently, the new genetic algorithm specifically aims to keep the requisite silicon area to a minimum by incorporating a monogenetic strategy that retains only the optimal individual, resulting in a dramatic reduction in the memory requirement and obviating the need for crossover circuitry. The results given in this paper demonstrate that new approach improves on a number of existing hardware genetic algorithm implementations in terms of the quality of the solution produced, the calculation time and the hardware component requirements. r 2007 Elsevier B.V. All rights reserved. Keywords: Genetic algorithms; Genetic hardware; Machine learning

1. Introduction Hardware implementations of a range of genetic algorithms (GAs) have been described by a number of authors with the principal purpose of reducing execution time [10,20]. Compared with their software counterparts, hardware solutions typically reduce calculation times by a factor of around 50 [2]. A further benefit of such solutions is the possibility of replicating the GA, thus allowing parallel exploration of the search space. Work in the literature includes investigations of both coarse-grained (distributed) hardware solutions [11,15], where the individual processing elements (PEs) are loosely interconnected and operate on distinct elements of the population, and fine-grained (cellular) implementations [9,13], where the PEs work collectively on the population. The latter is the main target for the current work and we specifically aim to apply modern electronic design automation (EDA) tools to generate hardware solutions that provide extremely fine division of the population between PEs. Although the architectural nature of GAs appears inherently parallel, for example, fitness calculations can be applied independently Corresponding author. Tel.: +44 1509 227042; fax: +44 1509 227014.

E-mail address: [email protected] (D.J. Mulvaney). 0925-2312/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2006.11.031

to the individuals in separate threads of computation, the application of the genetic operators frequently involves combinations of chromosomes. Consequently, a close mapping to fine-grained parallel solutions can be difficult to realize in practice. Through an investigation of novel GA architectures, the work presented in this paper is able to deliver an innovative approach to maximizing the independence between individuals; namely by completely removing the need to maintain a population. Where the GA implementation includes a population, then, to achieve the most significant reduction in execution time, the population is best stored in on-chip memory. In such a case it is normally feasible to access the individual chromosomes at the full clock speed, but such data occupy significant physical area that could otherwise have been used for other GAs, processing elements, control devices or peripherals on the hardware system. With populations of over 100 individuals in realistic applications [7], storage of around 10–100 kB would be required. Modern fieldprogrammable gate arrays or application-specific integrated circuits would have no practical difficulty in assimilating such memory requirements, but, as many modern implementations are system-on-chip (SoC) solutions that incorporate multi-functionality and many are deployed in portable embedded applications, it is

ARTICLE IN PRESS 96

Z. Zhu et al. / Neurocomputing 71 (2007) 95–106

important to avoid unnecessary use of silicon area that will not only consume power, but also could have been used by other functional components. Should such on-chip memory dominate the final layout, an alternative is to provide off-chip memory. In such cases, not only may cost considerations dictate the use of slower memory requiring a number of clock cycles to access, but also, if a number of GAs are combined in a single device, it is unlikely that the data bandwidth will be sufficient to allow all GAs simultaneous access to their respective populations. To address the memory usage issue, the compact GA [1] represented the population as a probability distribution over the set of solutions rather than requiring the storage of the entire population. The elements of a probability vector, equal in length to that of the individuals in the population, indicate the probability that the corresponding bit of an individual is unity. All vector elements are initially set to a value of 0.5. Each generation involves producing new pairs of individuals whose bit pattern is determined according to the vector probabilities. The probabilities in the vector are then modified in favor of the bit values stored in the individual of better fitness. The vector itself holds the final solution. Ramamurthy and Vasanth [10] implemented a conventional roulette-based GA in hardware. The roulette wheel was used to select pairs of individuals to be operated upon by single-point crossover, where the number of slots allotted to an individual depends on rank determined according to its fitness. Consequently, the implementation of the algorithm requires that the population is sorted in order of fitness before selection for crossover. Mutation is implemented by inverting randomly selected bits from an individual. In the implementation of Hereboy, Levi [12] combined features of simulated annealing (in that only one individual is required) and GAs (to mutate that individual). Combined with a novel method for the adaptation of the mutation rate, the approach was found to be particularly suitable for the solution of problems requiring representation by long chromosomes. As Levi considered the method particularly suited to serial rather than the parallel hardware implementations, we have chosen not to consider Hereboy as one of our alternative GAs in this paper; yet it is conceivable that minor modifications to the algorithm would allow it to become a suitable candidate solution for future investigation. While addressing the performance and memory usage issues, it is important to ensure that any hardware implementation does not sacrifice the quality of the resulting solution. Indeed, the mere process of designing a hardware solution can almost inadvertently facilitate the development of new methods that either would not have come to light or would have been unrealistic in purely software approaches. As an example, Sharawi et al. [21] developed a crossover mechanism based on a ‘half-siblingsand-a-clone’ approach that was able to shorten significantly the GA convergence time. In the approach, chromosomes that best meet the fitness criteria are kept in a subsequent generation, while others are replaced by

individuals that surpass a threshold generated following crossover. The crossover rate is lowered as more individuals satisfy the threshold, whose value is effectively the mutation rate. Software implementations generally allow significant flexibility, not only in the modification of fitness calculations to meet the application requirement, but also in terms of the ease with which parameters, such as the population size, the lengths of individuals and the rate of application of operators, can be varied. These often need to be set once initially or varied during the search, for example, according to the perceived or estimated size of the solution space, the current progress of the evolution or the required diversity in the population. To permit their use in a wide range of applications, hardware implementations of GAs also need to be flexible in their structure to allow for such parameter variations [24]. Ideally, a facility to allow parameters to be set should be available prior to each application of the GA, but in practice, as a minimum requirement, it should be possible to specify such parameters at the design stage before it is progressed through the EDA flow. The more major the effects of these parameter changes, the longer the time it will generally take to generate a new hardware solution. As memory blocks typically occupy substantial silicon area, their reconfiguration often involves substantial redesign effort. Consequently, most hardware solutions do not tailor the memory requirement to the application, but assume a worst-case usage, thereby wasting significant silicon area and consuming additional power in many applications. This paper introduces the optimal individual monogenetic algorithm (OIMGA). It is specifically designed to address the issues discussed above and achieves the following:

   

Compared with conventional GAs, the memory requirement is substantially reduced since only two individuals need to be kept in on-chip registers. The memory requirement of OIMGA is largely independent of the application. The solution has the potential for dynamic reconfiguration according to the problem at hand. In comparison with a range of existing GA hardware methods, its performance on benchmark problems is shown to exhibit an improvement; in some cases a significant one.

The paper is organized as follows. The OIMGA algorithm is introduced in Section 2 and its hardware implementation is described in Section 3. Section 4 presents results that compare, for four hardware GA implementations, the qualities of the solution produced, the calculation times and the hardware component requirements. Conclusions that discuss the benefits of OIMGA and the planned future developments can be found in Section 5.

ARTICLE IN PRESS Z. Zhu et al. / Neurocomputing 71 (2007) 95–106

2. OIMGA algorithm OIMGA includes two searches that interact in a hierarchical manner, namely a global search and a local search. The global search selects a series of (possibly overlapping) regions from the entire search space for detailed exploration by the local search. The global search maintains a single individual (termed topChrom) that is the best (according to the fitness criterion) obtained from all the local searches performed to date. The local search investigates the regions selected by the global search in order to determine the local optimal individual (LOI). This is achieved by generating the local population in a narrow range using micro mutation. We use the term micro mutation as the operator is permitted to alter only

Table 1 OIMGA parameters available to the designer Parameter Meaning l n m t_gens

Individual length Population size The size of the miniature space around the LOI Maximum number of consecutive global generations without improvement k_gens Maximum number of consecutive local generations without improvement d_adjustor Range of mutation of an individual pm Probability of mutation

97

the least-significant bits of an individual (rather than the individual as a whole), where the number of bits that can be affected is reduced if previous generations have not improved the fitness of the LOI. If the micro mutation results in a better individual, this becomes the new LOI. The process is repeated until specified number of generations has not resulted in an improvement in LOI fitness. As OIMGA maintains only the single LOI, traditional crossover operators cannot be applied directly. However, it is apparent that the micro mutation, by leaving unaltered the most-significant bits of the LOI, could be considered as also performing a type of crossover at the selected point in the chromosome. Although the least-significant bits to which the selected upper bits become joined are selected in a random manner, in most cases substantial genetic material is carried forward to later generations. Radolph [18] proved that GAs that maintain the best solution in the population will always converge to the global optimum. As OIMGA always retains the best individual, it being lost neither by mutation nor crossover, the algorithm is convergent. As the algorithm repeatedly initializes the population space following a global search, OIMGA is likely to be very effective in maintaining diversity and preventing premature convergence. Compared with the existing methods described above, the convergence time of OIMGA is likely to be shorter due to reductions both in the total search space explored and in

Main Procedure OIGMA genetic algorithm Randomly generate a new individual to be the best global individual (topChrom) Repeat Call Procedure LOI Generation Call Procedure Local Search If the local optimal individual (bestChrom) is fitter than the best global individual then the local optimal individual becomes the best global individual Until the best global individual has not changed in t_gen attempts The best global individual is the output Procedure LOI Generation Randomly generate a new individual to be the local optimal individual Repeat Randomly generate a new individual If the new individual is fitter than the local optimal individual then the new individual becomes the local optimal individual Until n new individuals have been generated Procedure Local Search Repeat Repeat Use micro mutation to generate a new individual from the local optimal individual If the new individual is fitter than the local optimal individual then the new individual becomes the local optimal individual Until m new individuals have been generated If the local optimal individual was not changed in the above loop then decrease the range of micro mutation Until the local optimal individual has not changed in k_gen attempts Fig. 1. The pseudocode for the OIMGA algorithm.

ARTICLE IN PRESS 98

Z. Zhu et al. / Neurocomputing 71 (2007) 95–106

the population size [14]. A further execution speed enhancement in the hardware implementation is also easily identifiable since the executions of the global and the local searches are prime candidates for hardware pipelining. Table 1 shows the parameters available to a designer using the OIMGA algorithm, while Fig. 1 shows the pseudocode of OIMGA itself. In a purely software or in an embedded processor implementation, the fitness unit will be executed as part of the normal instruction sequence, but in a hardware solution there are two principal methods that could be considered for the realization of the fitness unit. Firstly, the fitness unit could be implemented in hardware and, although this is likely to result in a short execution time, many fitness functions are complex to compute in hardware as they often require not only multiplication and division operations, but often incorporate trigonometric and logarithmic expressions also. A further drawback of this approach is inflexibility, resulting from the need to redesign the entire fitness unit should the requirement arise that the GA be applied to a different problem. Secondly, an SoC solution could be adopted in which the functionality of the GA remains in hardware, but the fitness unit is implemented in software on an embedded processor. While this has the advantage of flexibility, the software implementation is very likely to have a longer execution time compared with a hardware solution. In this paper, hardware fitness units have been used to generate the results. 3. OIMGA hardware design Fig. 2 shows the main structure of the hardware implementation of OIMGA. The loi generator initiates the local process by randomly producing a population that includes n individuals, from which the LOI is determined.

Fig. 2. The main structure of OIMGA showing the interaction of the global and local evaluation units.

In the micro mutation unit, the individuals are allowed to evolve in value only within the range indicated by d_adjustor and any alteration of range is controlled by the d_controler. The fitness value of the generated individual is calculated by the fitness unit and the local evaluator compares the fitness of the current LOI with that of the previous one and replaces it if its fitness is better. The search in the local space is repeated m times. If, during these searches, a new LOI is not found then the range that d_adjustor indicates is decreased. When the fitness of the LOI does not improve over k_gens cycles, then the LOI is sent to the global evaluator. The global evaluator implements the global process and retains the globally best individual and its fitness value found from all the local searches. The global process terminates when the fitness has not improved over t_gens operations of the local process. 3.1. LOI generation The LOI generator shown in Fig. 3 includes a random number generator RNG that produces an l-bit random individual whose fitness value is calculated by the fitness unit and stored in the register loiFit. The unit cmp1 is used to compare the fitness of loiFit with that of the best fitness value held in bestFit and, if it is better, bestFit is replaced by loiFit and the new individual (of length l) replaces that held in the register bestChrom. The n bit counter ensures that the entire process is carried out n times, where n is the population size. Note that in order to modify the size of the population, it is only necessary to change the length of the counter. 3.2. Micro mutation unit The micro mutation unit is shown in Fig. 4. If the probability of mutation pm is greater than RNGi and d_MRSRi is set, the ith bit of bestChrom is mutated. The register tempChrom holds the value of the chromosome following mutation and is evaluated in the fitness unit. If its fitness is better than that in bestFit (as determined by the comparator in the local evaluator shown in Fig. 5), the

Fig. 3. The LOI generator produces n individuals and the fittest (bestChrom) effectively defines a search region for later detailed investigation.

ARTICLE IN PRESS Z. Zhu et al. / Neurocomputing 71 (2007) 95–106

signal cmp2 operates the tri-state gate to replace bestChrom by the value in tempChrom. To modify the length of the individual, a corresponding change can be made to the number of bits in the micro mutation unit.

3.3. Local evaluator The local evaluator shown in Fig. 5 uses the fitness values to select the better individual from tempFit and bestFit, and keeps this elite individual and its fitness value during local evolution.

99

3.4. Adjusting the range of mutation During an evolution process, the times between modifications to the fitness values generally increase, indicating that the evolution is converging to a final value. To facilitate rapid convergence, it is normally appropriate to reduce the bounds of the allowed change of mutation values in order to investigate the space in the more immediate vicinity of the current best individual. Initially, the bits in the mask right shift register shown in Fig. 6 are all set, MRSRi ¼ 1, iA[1,l]. The initial value of the range held in d_initial is set to a predefined value, signifying that all but this number of bits in the individuals should be mutated. This value is copied into d_counter. The exchange register (exchange) is initialized to 0 and is incremented whenever the local evaluator replaces the current best individual. The value in d_counter defines the number of shifts that are performed by the MRSR (with the left-most bit zero filled); at each shift d_counter decrements by 1. To understand the operation, consider the case where the initial value held in d_counter is 3. In this case, following the shift operations, the state of MRSR is shown in Eq. (1). ( MRSRi ¼

0

0pip3;

1

4pipl:

(1)

Fig. 4. Micro mutation involves the mutation of the least-significant bits of bestChrom using a defined probability stored in a register.

Fig. 5. If the fitness of a new individual is better than the current local best, the local evaluator replaces the latter with the former.

Fig. 7. If the fitness of the chromosome produced by the current local search is the best seen so far, it is used to replace the currently stored global best individual (topChrom).

Fig. 6. The range of mutation of bestChrom is restricted by only allowing the least-significant bits of the individual to be changed. When no improvement in fitness has occurred during the local generation of m individuals, the fewer will be the number of least-significant bits that can be affected.

ARTICLE IN PRESS Z. Zhu et al. / Neurocomputing 71 (2007) 95–106

100

functions shown in Eqs. (2)–(5) below [4,25,26]:

These values indicate that the range of mutation is in [4,l]. During local evolution based on LOI, exchange will be incremented if bestChrom and bestFit are replaced. After each generation of local evolution, d_initial will increase by 1 if exchange is still 0, thereby reducing the number of bits that are mutated in the micro mutation unit.

f 1 ðxÞ ¼ ð1  2 sin20 ð3pxÞ þ sin20 ð20pxÞÞ20 2 5:1 2 5 f 2 ð~ xÞ ¼ x2  2 x1 þ x1  6 4p p   1 þ 10 1  cos x1 þ 10 8p

The principle of operation of the global evaluator, shown in Fig. 7, is very similar to that of the local evaluator. The global evaluator selects the better individual from bestFit and topFit, and keeps the elite individual from all generations and its corresponding fitness value in topChrom and topFit, respectively.

f 3 ð~ xÞ ¼ 30 þ

3 X

½5; 10  ½0; 15, ð3Þ

ðx2i  10 cosð2pxi ÞÞ

½5:12; 5:123 ,

(4)

i¼1

f 4 ð~ xÞ ¼

4 X i¼1

  2 20 ixi sinðxi Þ sin p

½0; p4 .

(5)

Benchmark functions are deliberately designed to exhibit properties similar to those found in real-world search problems and, in particular, it is difficult to determine analytically the maximum or minimum values of the above functions by methods other than using some form of search. The above set of functions was purposely chosen to provide a progressively increasing problem complexity to be addressed by the GAs, as the dimensionality of the tasks

4. Experimental results This section investigates the efficiency, in terms of the quality of the solution produced, the calculation time and the hardware component requirements, of four hardware GA implementations, namely compact GA [1], roulette [19], half-siblings-and-a-clone [21], and OIMGA. The fitness calculations were provided by the benchmark

1.200

2.500

1.100 function minimum

function maximum (x106)

(2)



3.5. Global evaluator

1.000 0.900 OIMGA Clone Roulette Compact

0.800 0.700

2.000 OIMGA Clone Roulette Compact

1.500 1.000 0.500

0.600

0.000 16

32

64 population size, n

128

256

16

32

64 population size, n

128

3.000 function minimum

0.200 function minimum

½0; 1

0.150 0.100 OIMGA Clone Roulette

0.050

256

OIMGA Clone Roulette Compact

2.500 2.000 1.500 1.000 0.500 0.000

0.000 16

32

64 population size, n

128

256

16

32

64 population size, n

128

256

Fig. 8. Maxima of the benchmark functions found by the GAs for a range of population sizes and at a fixed individual length (l) of 32. Each data point shown is the mean value calculated from results obtained from 200 tests, except for the compact GA where, due to its long calculation time, only 20 tests were carried out. Note that the compact GA is also omitted from figure (c) as its calculation time curve contains values significantly (around 10 times) larger than those obtained from the other GAs. (a) Estimated f1(x) maxima (actual value 1.048  106), (b) estimated f2(x) minima (actual value 0.398), (c) estimated f3(x) minima (actual value 0), and (d) estimated f4(x) minima (actual value 0).

ARTICLE IN PRESS Z. Zhu et al. / Neurocomputing 71 (2007) 95–106

presented by the functions rises from 1 to 4 in the set shown. The GA implementations (other than OIMGA) were carried out according to the descriptions given by the respective authors. The simulations were initially developed and run in Matlab [16] and experimental results for all the GAs were generated on the same host computer system. The rapid development facilities of Matlab are well suited to the generation of results for assessing the relative qualities of the estimates produced by the GAs, but the tool is not designed to fully reproduce the cycle-accurate timings of a hardware implementation. The execution timings of practical implementations were assessed in two alternative systems, firstly on an embedded processorbased system targeted at an existing platform [3] and secondly on a purely hardware solution implemented in a high-level modeling language (here SystemC [23]). A third alternative would be to adopt an SoC solution in which perhaps the majority of the GA is implemented in hardware, but software (for example, the fitness unit and various GA parameters) is designed for execution on one or more embedded processors that reside within the final hardware.

For the embedded processor solution, the algorithms were re-written in the C programming language and the code executed using the commercial Armulator instruction set simulator on an ARM9 processor running at 100 MHz. The SystemC implementation was developed and simulated using the Synposys co-centric tools [22].

4.1. Determination of function extrema In the first set of experiments, the performance of the GAs in determining the maximum value of the function f1(x) and the minimum values of the functions f2(x), f3(x) and f4(x) were investigated for various values of population size and individual lengths. Fig. 8 shows that for a fixed individual length, OIMGA outperformed the other GA implementations, particularly for small populations. The performance of the compact GA was noticeably inferior to the other implementations. The poor performance of the compact GA was also apparent when the population size was fixed and the maximum function values determined for a range of lengths of the individuals, Fig. 9. The remaining three GAs all performed similarly under this test.

2.500

1.100 function minimum

function maximum (x106)

1.200

1.000 0.900 OIMGA Clone Roulette Compact

0.800 0.700

2.000 OIMGA Clone Roulette Compact

1.500 1.000 0.500 0.000

0.600 16

32

64 128 length of each individual, l

256

16

0.100

3.000

0.080

2.500

0.060

function minimum

function minimum

101

OIMGA Clone Roulette

0.040 0.020

32 64 length ofeach individual, l

256

128

OIMGA Clone Roulette Compact

2.000 1.500 1.000 0.500

0.000

0.000 16

32 64 128 length of each individual, l

256

16

32

64 128 length of each individual, l

256

Fig. 9. Maxima of the benchmark functions found by the GAs for a range of individual lengths and at a fixed population size (n) of 128. Each data point shown is the mean value calculated from results obtained from 200 tests, except for the compact GA where only 20 tests were carried out. The curve for the compact GA is omitted from figure (c) for the same reason as that given in the caption of Fig. 8. (a) Estimated f1(x) maxima (actual value 1.048  106), (b) estimated f2(x) minima (actual value 0.398), (c) estimated f3(x) minima (actual value 0), and (d) estimated f4(x) minima (actual value 0).

ARTICLE IN PRESS Z. Zhu et al. / Neurocomputing 71 (2007) 95–106

102

6000 OIMGA Clone Roulette

8000

calculation time (µs)

calculation time (µs)

10000

6000 4000 2000

4000 3000 2000 1000 0

0 16

32

64 population size, n

128

256

16

5000

32

64 population size, n

128

256

10000 OIMGA Clone Roulette

4000

calculation time (µs)

calculation time (µs)

OIMGA Clone Roulette

5000

3000 2000 1000 0

OIMGA Clone Roulette

8000 6000 4000 2000 0

16

32

64 population size, n

128

256

16

32

64 population size, n

128

256

Fig. 10. Calculation times of the benchmark functions implemented on an embedded processor for a range of population sizes and at a fixed individual length (l) of 32. Each data point shown is the mean value calculated from results obtained from 200 tests. (a) Estimated f1(x) calculation times, (b) estimated f2(x) calculation times, (c) estimated f3(x) calculation times, and (d) estimated f4(x) calculation times.

4.2. Calculation times for the embedded software implementations The second set of experiments used the embedded ARM processor to determine the calculation times to reach convergence when determining the maximum values of each of the functions for a range of population sizes and individual lengths. The compact GA performed poorly compared with the other GAs, with its calculation times being greater by at least an order of magnitude across all the population sizes and individual lengths investigated. In order to allow the reader to be able to compare the results from the remaining GAs, the calculation times for the compact GA are omitted from the results below. In Fig. 10, it can be seen that as the population size is increased, the calculation times of OIMGA increase less steeply than those of the other GAs and more detailed investigations revealed that, with the doubling of the population size, the calculation times for OIMGA increased at only half the rate of the half-siblings-and-a-clone and the roulette GAs. In Fig. 11, it is apparent that, for the individual lengths being considered, OIMGA was able to demonstrate a reduced calculation time compared with the other GA

approaches. It is clear from these results that OIMGA performed particularly well in the more demanding cases where the individuals were of greater length and the population larger. 4.3. Calculation times for the hardware implementations In the hardware-based implementation using SystemC, the compact GA again performed poorly in comparison with the other approaches, and, for purposes of clarity, the calculation times are again omitted from the results. In Figs. 12 and 13, it can be seen that, compared with the OIMGA and clone GA implementations, the roulette GA performed somewhat worse. On further inspection of the timings, it is apparent that these increased calculation times for the roulette method are largely due to the complexity of the logic required to implement the roulette selection. Note that in all cases, the fitness unit has been designed and implemented as part of the hardware solution. In order to allow each approach to function at its best in generating results, the experimental procedure followed involved adjustment of the respective parameters of each of the GAs (other than for l and m whose values were

ARTICLE IN PRESS Z. Zhu et al. / Neurocomputing 71 (2007) 95–106

15000 OIMGA Clone Roulette

8000

calculation time (µs)

calculation time (µs)

10000

6000 4000 2000

OIMGA Clone Roulette

12500 10000 7500 5000 2500

0

0 16

32

64 128 length of each individual, l

256

16

32

64 128 length of each individual, l

256

25000

15000 OIMGA Clone Roulette

12500

calculation time (µs)

calculation time (µs)

103

10000 7500 5000 2500 0

OIMGA Clone Roulette

20000 15000 10000 5000 0

16

32

64 128 length of each individual, l

256

16

32

64 128 length of eachindividual, l

256

Fig. 11. Calculation times of the benchmark functions implemented on an embedded processor for a range of individual lengths and at a fixed population size (n) of 128. Each data point shown is the mean value calculated from results obtained from 200 tests. (a) Estimated f1(x) calculation times, (b) estimated f2(x) calculation times, (c) estimated f3(x) calculation times, and (d) estimated f4(x) calculation times.

purposely varied to obtain the results). Typical parameters used for OIMGA for the estimation of the maxima and minima of the benchmark functions are given in Table 2. For the compact GA no parameters need to be set, for the roulette GA the mutation rate was 0.005 and the crossover rate 0.004, while for half-siblings-and-a-clone the mutation rate was 0.003. Note that altering the width of the fitness value used in internal calculations affects not only the system performance, but also influences other hardware requirements, such as the number of comparators required. Hardware implementations of GAs mainly consist of random number generators, comparators, registers and memory. A somewhat simplified but useful approximation of the implementation requirement of each component in the circuitry is its total bit number (TBN). For example, if there are 10 of 8-bit registers in a circuit, their TBN is 80 bits. To illustrate the relative complexities of the GAs investigated in the current work, the values in Table 3 were obtained from algorithmic estimates of the hardware requirements of four different classes of component. It can be seen that the TBN for the compact GA and OIMGA solutions are an order of magnitude less than those for the other GA methods. However, in contrast with OIMGA, the modest hardware requirement of the

compact GA has clearly been achieved at the expense of performance.

4.4. Hardware realization of OIMGA Our current work is now concentrating on the hardware realization of OIMGA. The approach of particular interest is to combine the GA implementation with one or more processors that provide the flexibility to adapt the solution to a wide range of application-specific fitness calculations. Our first hardware implementation incorporates a combined Leon3-MP [8] multi-processor platform (with one processor in this instance) and the GA module, instantiated as an advanced high-speed bus (AHB) slave [3], was implemented in a high performance, 1-poly/8-metal 0.13 mm CMOS process. The flow included validation on Modelsim [17], front-end synthesis on Synopsys Design Compiler [22], followed by power, clock and detailed routing on Cadence SoC Encounter [5]. Although the flow did not stress the back-end tools to their full, a post-route clock frequency of 300 MHz was achieved. The total area of the VLSI macro was 4.75 mm2 and the design included 34 368 standard cells and 26 RAM macros for the

ARTICLE IN PRESS Z. Zhu et al. / Neurocomputing 71 (2007) 95–106

104

300

OIMGA Clone Roulette

300

calculation time (µs)

calculation time (µs)

400

200

100

0

200

100

0 16

32

64 population size, n

128

16

256

OIMGA Clone Roulette

500 calculation time (µs)

400 calculation time (µs)

OIMGA Clone Roulette

300 200 100 0

32

64 population size, n

128

256

OIMGA Clone Roulette

400 300 200 100 0

16

32

64 population size, n

128

256

16

32

64 population size, n

128

256

Fig. 12. Calculation times of the benchmark functions implemented in hardware for a range of population sizes and at a fixed individual length (l) of 32. Each data point shown is the mean value calculated from results obtained from 200 tests. (a) Estimated f1(x) calculation times, (b) estimated f2(x) calculation times, (c) estimated f3(x) calculation times, and (d) estimated f4(x) calculation times.

instruction cache, data cache, register file and debug unit RAM. 5. Conclusions The paper has introduced a new GA algorithm that is particularly suited to hardware implementation. When run on benchmark problems, the new algorithm compared favorably with a number of hardware solutions found in the literature, both in terms of its execution time and in its performance on benchmark problems. Local convergence is achieved in OIMGA by retaining elite individuals, while population diversity is ensured by continually searching for the best individuals in fresh regions of the search space. As OIMGA only requires two individuals to be stored, its memory requirement is substantially reduced compared with that of the other high-performance candidate hardware GA implementations investigated. As the size of the population and the length of the individuals can be altered simply by replicating existing logic units, additional memory is not required when it is applied to a new problem. Moreover, by setting maximum values for the length of the individuals, it is possible to use the hardware GA across a wide range of applications without the need to

modify the hardware design. It is a relatively simple addition to the hardware design to allow the acquisition of values for these parameters, and using them to dynamically configure hardware registers prior to evolution. While this paper has demonstrated the feasibility of the new method, further investigations of practical hardware implementation of OIMGA are underway. An advantage of OIMGA is the small area it occupies in hardware, making it particularly suitable for fine-grained parallel SoC configurations. Also planned are detailed investigations of OIMGA using alternative architectural approaches and its application to more extensive practical problems. A further issue that needs to be addressed is that of the fitness calculations. Not only are these calculations often the most intensive of all the GA computations, ideally they also need to be dynamically modifiable. One architectural approach found in the literature is the master–slave GA that allows each PE in a fine-grained system to determine the fitness values of its own population, while selection, application of operators and mutation are carried out serially by a single PE [6]. To allow dynamic reconfiguration of the fitness calculation, we are now investigating the inclusion of soft microprocessor cores into the design to enable the fitness function to be described in software prior to GA execution.

ARTICLE IN PRESS Z. Zhu et al. / Neurocomputing 71 (2007) 95–106

400 calculation time (µs)

300 calculation time (µs)

105

200 OIMGA Clone Roulette

100

300 OIMGA Clone Roulette

200

100

0

0 16

32

64 128 length of each individual, l

16

256

32

64 128 length of each individual, l

400

OIMGA Clone Roulette

calculation time (µs)

calculation time (µs)

500 400 300 OIMGA Clone Roulette

200 100

256

300

200

100

0

0 16

32

64 128 length of each individual, l

256

16

32

64 128 length of each individual, l

256

Fig. 13. Calculation times of the benchmark functions implemented in hardware for a range of individual lengths and at a fixed population size (n) of 128. Each data point shown is the mean value calculated from results obtained from 200 tests. (a) Estimated f1(x) calculation times, (b) estimated f2(x) calculation times, (c) estimated f3(x) calculation times, and (d) estimated f4(x) calculation times.

Table 2 OIMGA parameter values m

t_gens

k_gens

d_adjuster

pm

Width of fitness value

10

4

5

3

0.382

32

Table 3 Hardware requirements of the GA implementations GA

Random number generators Comparators Registers Memory

OIMGA 160 Clone 32 Roulette 59 Compact 256

224 96 64 262

296 256 478 352

0 4096 8192 0

References [1] C. Aporntewan, P. Chongstitvatana, A hardware implementation of the compact genetic algorithm, in: Proceedings of IEEE Congress on Evolutionary Computation, Seoul, Korea, May 2001, pp. 27–30. [2] S. Areibi, M. Moussa, G. Koonar, A genetic algorithm hardware accelerator for VLSI circuit partitioning, ISCA Int. J. Comput. Appl. 12 (3) (2005) 163–180. [3] ARM /http://www.arm.com/S.

[4] B. Bilgin, Common benchmark functions used in evolutionary algorithms /http://cse.yeditepe.edu.tr/bbilgin/resources/ cse416/benchmarking.pdfS. [5] Cadence /http://www.cadence.com/S. [6] E. Cantu-Paz, D.E. Goldberg, Efficient parallel genetic algorithms: theory and practice, Comput. Methods Appl. Mech. Eng. 186 (2000) 221–238. [7] K.A. De Jong, W.M. Spears, Using genetic algorithms to solve NPcomplete problems, in: Proceedings of 3rd International Conference on Genetic Algorithms, San Mateo, CA, 1989, pp. 124–132. [8] Gaisler Research /http://www.gaisler.com/S. [9] M. Giacobini, E. Alba, A. Tettamanzi, M. Tomassini, Modeling selection intensity for toroidal cellular evolutionary algorithms, in: Proceedings of Genetic and Evolutionary Computation Conference, Seattle, Washington, Part I, June 2004, pp. 1138–1149. [10] J.W. Hauser, C.N. Purdy, Sensor data processing using genetic algorithms, in: Proceedings of IEEE Midwest Symposium on Circuits and Systems, Lansing, MI, August 2000. [11] E.Y. Kim, S.H. Park, S.W. Hwang, H.J. Kim, Video sequence segmentation using genetic algorithms, Pattern Recognition Lett. 23 (7) (2002) 843–863. [12] D. Levi, HereBoy: a fast evolutionary algorithm, in: Proceedings of the Second NASA/DoD Workshop on Evolvable Hardware, Palo Alto, CA, July 2000, pp. 17–24. [13] X. Li, S. Sutherland, A cellular genetic algorithm simulating predator–prey interactions, in: Proceedings of 4th Asia-Pacific Conference on Simulated Evolution And Learning, Singapore, 2002, pp. 76–80. [14] J. Li, S. Wang, Optimum family genetic algorithm, J. Xi’an Jiao Tong Univ. 38 (2004).

ARTICLE IN PRESS 106

Z. Zhu et al. / Neurocomputing 71 (2007) 95–106

[15] C. Maple, L. Guo, J. Zhang, Parallel genetic algorithms for third generation mobile network planning, in: Proceedings of Parallel Computing in Electrical Engineering, Dresden, Germany, September 2004, pp. 229–236. [16] Matlab /http://www.mathworks.com/S. [17] Mentor Graphics /http://www.mentor.com/S. [18] G. Radolph, Convergence analysis of canonical genetic algorithms, IEEE Trans. Neural Networks 5 (1) (1994) 96–101. [19] P. Ramamurthy, J. Vasanth, VLSI implementation of genetic algorithms, unpublished. [20] S.D. Scott, A. Samal, S. Seth, HGA: a hardware-based genetic algorithm, in: Proceedings of 3rd ACM International Symposium on FPGAs, Monterey, California, February 1995, pp. 53–59. [21] M.S. Sharawi, J. Quinlan, H.S. Abdel-Aty-Zohdy, A hardware implementation of genetic algorithms for measurement characterization, in: Proceedings of IEEE International Conference on Electronics, Circuits, and Systems, Dubrovnik, Croatia, September 2002, pp. 1267–1270. [22] Synopsys /http://www.synopysys.com/S. [23] SystemC /http://www.systemc.org/S. [24] S. Wakabayashi, T. Koide, N. Toshine, M. Yamane, H. Ueno, Genetic algorithm accelerator GAA-II, in: Proceedings of Asia and South Pacific Design Automation Conference, Yokohama, Japan, January 2000, pp. 9–10. [25] X. Yao, Y. Liu, G. Lin, Evolutionary programming made faster, IEEE Trans. Evol. Comput. 3 (2) (1999) 82–102. [26] L. Zhang, B. Zhang, Research on the mechanism of genetic algorithms, J. Software 11 (7) (2000). Zhenhuan Zhu received a B.Sc. in Computer Applications from Hefei University of Technology, China in 1982. He worked as a Lecturer for Jiujiang Polytechnic College and as an Associate Professor for Anhui University of Technology and Sciencem, both in China. He was an academic visitor in the Department of Electronic and Electrical Engineering at the University of Loughborough, UK, where he researched hardware implementations of machine learning algo-

rithms. Currently, he is working on a secure ad hoc fire and emergency safety network in the Department of Computer Science at the University of Loughborough. His research interests include embedded systems, AI, system-on-chip, industrial process control and network communications. David Mulvaney has been a Senior Lecturer in the Department of Electronic and Electrical Engineering at Loughborough University since June 2001. His main research interests include novel real-time embedded machine-learning techniques and electronic hardware solutions for real-time applications. Before joining Loughborough, Dr. Mulvaney was employed by a UK engineering consultancy where he developed a number of real-time AI systems for military applications, as well as working on knowledge-based systems to aid the assembly of surface mount components and on the causal modeling of diesel engines for use in predictive maintenance. Dr. Mulvaney has carried out consultancy work for BP, Otis, Cadbury-Schweppes and GE Lighting, gives commercial training courses in real-time embedded C++ and has over 70 publications in professional journals and at international conferences. Vassilios A. Chouliaras was born in Athens, Greece in 1969. He received a B.Sc. in Physics and Laser Science from Heriot-Watt University, Edinburgh in 1993 and an M.Sc. in VLSI Systems Engineering from UMIST in 1995. He worked as an ASIC design engineer for INTRACOM SA and as a senior R&D engineer/processor architect for ARC International. Currently, he is a Senior Lecturer in the Department of Electronic and Electrical Engineering at Loughborough University, UK where he is leading the research in CPU architecture and microarchitecture, SoC modeling and software parallelization. His research interests include superscalar and vector CPU microarchitecture, high-performance embedded CPU implementations, performance modeling, custom instruction set design and self-timed design.