Expert Systems With Applications 78 (2017) 32–50
Contents lists available at ScienceDirect
Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa
Optimization based on phylogram analysis Antonio Soares a,∗, Ricardo Râbelo a, Alexandre Delbem b a b
Federal University of Piauí, Piauí, Brazil University of São Paulo, Piauí, Brazil
a r t i c l e
i n f o
Article history: Received 2 September 2016 Revised 3 February 2017 Accepted 4 February 2017 Available online 5 February 2017 Keywords: Estimation of distribution algorithms Phylogram analysis Hierarchical clustering Combinatorial optimization
a b s t r a c t Estimation of Distribution Algorithms (EDAs) is evolutionary algorithms with relevant performance in handling complex problems. Nevertheless, their efficiency and effectiveness directly depends on how accurate the deployed probabilistic models are, which in turn depend on methods of model building. Although the best models found in the literature are often built by computationally complex methods, whose corresponding EDAs require high running time, these methods may evaluate a lesser number of points in the search space. In order to find a better trade-off between running time (efficiency) and the number of evaluated points (effectiveness), this work uses probabilistic models built by algorithms of phylogenetic reconstruction, since some of them are able to efficiently produce accurate models. Then, an EDA, namely, Optimization based on Phylogram Analysis, and a new search technique, namely, Composed Exhaustive Search, are developed and proposed to find solutions for combinatorial optimization problems with different levels of difficulty. Experimental results show that the proposed new EDA features an interesting trade-off between running time and number of evaluated points, attaining solutions near to the best results found in the literature for each one of such performance measures. © 2017 Elsevier Ltd. All rights reserved.
1. Introduction An important field of research in Evolutionary Algorithms (EAs) consists in developing EDAs (Larrañaga & Lozano, 2001; Pelikan, Goldberg, & Lobo, 2002; Pelikan, Sastry, & Cantú-Paz, 2006) to build probabilistic models of variable values by means of a set of samples selected in promising regions of the search space. From these models, an EDA generates new solutions with values for each variable that are consistent with such regions. In general, a probabilistic model is built at each generation. The exactitude of a probabilistic model directly affects the EDA’s convergence rate (Goldberg, 2002) and its ability to attain the optimal solution of the problem. In other words, the sole use of a probabilistic model does not guarantee neither efficiency nor effectiveness of an EDA; under certain conditions, it may be similar to ordinary EAs (Harik, Lobo, & Goldberg, 1999; Mühlenbein, 1997). In this context, a typical example is the relationship between Genetic Algorithm (GA) and compact GA (Harik et al., 1999); the latter uses a univariate probabilistic model and has similar performance to GA (Harik et al., 1999) for various problems tested in the literature. One advantage of a univariate probabilistic model is
∗
Corresponding author. E-mail addresses:
[email protected] (A. Soares),
[email protected] (R. Râbelo),
[email protected] (A. Delbem). http://dx.doi.org/10.1016/j.eswa.2017.02.012 0957-4174/© 2017 Elsevier Ltd. All rights reserved.
that the construction of such a model requires relatively low running time. On the other hand, the Bayesian Optimization Algorithm (BOA) (Pelikan, Goldberg, & Cantú-Paz, 1999) requires significantly more running time to build multivariate and multimodal probabilistic models, namely, Bayesian Networks (Pelikan et al., 1999). This EDA is able to solve certain problems that usually have poor performance for many EAs. Thus, an important aspect in evaluating the performance of EDAs is the trade-off between the efficiency of the model building process and the exactitude (effectiveness) of the built probabilistic model (Pelikan et al., 2002). The strategy deployed in this work to reach such a trade-off is based on (exact or approximately) decomposing relatively complex problems into simpler sub-problems. Then, from the obtained models for these sub-problems, probabilistic models for the original problem can be efficiently obtained while maintaining their capability of representing promising regions. The decomposition of the original problem into sub-problems may be achieved by identifying groups of highly correlated variables, which, in some cases, are called Building Blocks BBs. The theory presented in Goldberg (2002) shows that more efficient and effective EDAs, in general, build probabilistic models capable of correctly mapping more BBs by sampling several optimal instances of them. They also present a better use of information from BB to generate new solutions. Moreover, such EDAs must ensure that the optimal instances of BB prevail in relation to sub-optimal
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
33
Fig. 1. Phylogeny of a group of plants obtained from test set sequences of nucleotides rbcL_55 (Cancino & Delbem, 2007) and Neighbor Joining algorithm (Saitou & Nei, 1987; Studier & Keppler, 1988).
ones during the evolutionary process and that convergence drift is avoided (Goldberg, 2002). This work investigates aspects of computational biology methods used to exact or approximately decompose a problem into subproblems. These methods have been successfully used for phylogenetic reconstruction, which provides evolutionary relationships between species and other biological objects (Cancino & Delbem, 2007; Felsenstein, 2003). From such an investigation on phylogenetic methods, we propose a new EDA, namely, Optimization based on Phylogram Analysis (OPA), to attain solutions for combinatorial optimization problems with different levels of difficulty with five main features: (i) the use of a set of trees, instead of only one, to model the correlation between variables; (ii) a summary of relevant information, from the set of trees, to a network structure; (iii) the use of a community detection algorithm (Newman & Girvan, 2004) to identify possible BBs of the problem from one or more network structures; (iv) resampling in several steps of the algorithm to make network structures more representative of correlations between problem variables; and (v) a new search technique, namely, Composed Exhaustive Search (CES). Features (i) to (iv) are organized into a new method of clustering, namely, Clustering based on Phylogram Analysis (CPA); OPA consists in the association of CPA and CES. Thus, CPA, CES, and OPA, consequently, show how such features can deal with aspects of combinatorial optimization problems that make them complex. The paper presents a theoretical analysis that enables predictions on the performance of OPA; moreover, it shows experimental results that corroborate the theoretical estimates. The remaining of the paper is organized as follows. Section 2 introduces the concept of phylogeny as a probabilistic model for EDAs. Section 3 describes all steps of CPA. Section 4 shows how possible BBs can be identified for various problems by using CPA. Section 5 describes some EDAs that were also developed to solve combinatorial problems by means of variable partitioning. Section 6 presents CES, which efficiently solves combinatorial optimization problems by means of partitions obtained by CPA. Section 7 shows the results by combining the CPA and CES techniques to solve deceptive trap problems. Section 8 shows the results for OneMax and BinInt problems and with overlapping BBs. Finally, Section 10 summarizes the main contributions of this work.
Fig. 2. Possible clades (dashed lines) obtained from the phylogeny in Fig. 1.
Usually, phylogenetic trees (a connected and acyclic graph) are binary trees where their leaves represent species; thus, leaves are labeled with the name of a corresponding species. Fig. 1 shows a phylogeny that highlights evolutionary relationships for a group of plants. The leaves represent existing species, while the internal nodes indicate hypothetical ancestrals or extinct species. Clades are groups that involve an ancestor and all descendants starting from the lowest (leaf) to a higher (ancestor) internal node in the tree structure. They are, in general, the most useful information in a phylogeny, since relationships within a clade are strong. Fig. 2 illustrates possible clades containing the Nephrolepis and Xiphopteris species, whose DNA sequences were obtained from test set rbcL_55 (Cancino & Delbem, 2007). In a computational biology method to solve optimization problems, a clade involves no biological data (except for some bioinformatic problems) and corresponds to a subset of correlated variables, which can be viewed as BBs. A variable can be present in more than one BBs, and BBs can overlap if they have variables in common. In Biology, DNA sequences overlap if they have genetic material in common, as it usually occurs in a clade. A resampling process (Felsenstein, 2003) can be used to generate various phylograms (phylogenies when DNA sequences are used) from the same data set in order to provide statistic significance to the found relationships. By means of phylograms, CPA can find a consensus of those relationships and compose a network based on them, where highly correlated variables (indicating a BBs) are modeled by a graph clique. Afterwards, the CPA technique uses a community detection algorithm from Complex Networks (Donetti & Muñoz, 2004; Duch & Arenas, 2005) to automatically detect communities (BBs from the optimization point of view) from the network.
2. Phylogram-based models 3. Description of CPA A phylogeny is tree representation of species relationships with common origin. The term phylogenetic tree is used for phylogenies obtained from both morphological data and genetic sequences. In this work, phylogenies from various types of data are built in order to determine groups of objects (phyla), correlating such data.
CPA can construct models of relationships among variables considering problems with different levels of difficulty, i.e., variables can be partitioned into: (i) BBs with no overlap, (ii) BBs with some overlaps, and (iii) BBs with overlap. CPA has eight main steps that
34
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
Fig. 3. Examples of an initial population (n = 10, = 9) for a problem composed of 3 ftrap3 functions (k = 3) and of a selected population obtained by tournament size 2, using both = 0.50 and = 1.00.
Fig. 4. Distance matrix for variables x0 to x8 calculated using MI for a problem with = 9. Each column i (indexed from 0 to 8) of the left table, which represents the selected population with fitnesses of its individuals, corresponds to a vector vi whose elements are values of varibale xi .
run in a pipeline way. A description of each step is detailed next and their operation is depicted in Figs. 3–9. Fig. 10 summarizes CPA as a whole. Step 1: Selected populations by resampling First, an initial random population of size n is randomly generated and its individuals are evaluated by a fitness function. Then, Step 1 makes a tournament by resampling from the initial population, selecting ns individuals. The parameter ∈ R+ multiplies n, defining the size of the selected population ns , so that ns = n. Fig. 3 summarizes Step 1, where the fitness function correspond to 3 ftrap3 functions (Harik, Lobo, & Sastry, 2006; Pelikan et al., 1999), with non overlapping BBs of size k = 3 (k represents the BBs size) and = 9 variables ( is the number of biological objects in a data set, or the number of variables of a problem). It is important to highlight that the resampling processes can be repeated several times in order to obtain several selected populations, improving the estimates produced by CPA. Step 2: Distance matrix For each selected population sampled in Step 1, a distance matrix is calculated using a metric applied to each pair of variables of a problem. All the values assigned to a variable xi in the selected population composes a vector vi ; and the distance between variables xi and xj is calculated by using their corresponding vectors vi and vj .
CPA uses two metrics to construct the distance matrix: Mutual Information (MI) (Kraskov, Stogbauer, Andrzejak, & Grassberger, 2003) for problems with binary variables, and Normalized Compression Distance (NCD) (Cilibrasi & Vitányi, 2005) for real or mixed variables. Other metrics can be applied according to a priori knowledge on the problem’s domain. For example, for DNA sequences, metrics based on evolutionary models (Felsenstein, 2003) can be used. Fig. 4 depicts Step 2. Step 3: Resampling of phylograms Step 3 generates a set of possible phylograms from the distance matrix determined in Step 2. Neighbor Joining (NJ) algorithm (Saitou & Nei, 1987; Studier & Keppler, 1988) is used to construct each phylogram (that can represent hierarchical relationships between objects, variables in CPA), since it may be considered an efficient algorithm (O(3 )) when compared to other methods used to construct phylogenetic trees. However, NJ cannot guarantee an optimal phylogram since it is a greedy algorithm and it generates only one possible phylogram. Such a hurdle can be overcome by permuting columns and rows of the distance matrix. When columns and rows of the distance matrix are permuted, NJ usually builds different sub-optimal phylograms for each permutation. Thus, Step 3 constructs a set of phylograms by applying NJ to a set of distance matrices obtained by column and row permutations. Those phylograms differ from each other, but they, in general, preserve some clades (possible BBs), since a subtree associated to
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
35
Fig. 5. Two phylograms generated by NJ applied to two distance matrices determined by means of random permutations of columns and rows from the same distance matrix. Equal ribbons highlight three partitions of objects (clustered according to certain subtrees) that are common in both phylograms.
a clade is calculated by NJ using less indirections (relative distances involving internal nodes) for objects. Fig. 5 shows two different phylograms originated from the same distance matrix determined by means of permutations of columns and rows, highlighting three partitions preserved in both phylograms. Those common relationships can be determined by calculating the consensus from several trees (as it will be seen in Step 5), enabling to find the correct BBs in a computationally efficient way. Step 4: Network format of each phylogram Step 4 converts the output of NJ from Newick format (Felsenstein, 2003) (usually applied to bioinformatics tools like PHYLIP (Felsenstein, 20 0 0)) to an adjacency matrix, namely, network format. The use of a network format makes the application of several community detection algorithms in the field of Complex Networks possible (Donetti & Muñoz, 2004; Duch & Arenas, 2005) (as it will be seen in Step 5). Fig. 6 depicts this conversion. Step 5: Partitionings Based on the network obtained in Step 4, Step 5 determines a set of possible variable partitionings (a set of clades, as seen in
Step 3) of the problem. It applies the Fast Newman (FN) community detection algorithm, which has an adequate trade-off between the quality of the obtained communities and the required running time (Crocomo, Martins, & Delbem, 2013) (O((a + b)b), where a is the number of edges in the network, and b is its number of nodes (Newman & Girvan, 2004)). As NJ, FN is also a greedy algorithm, for which random permutations of rows and columns of the adjacency matrix usually produce different variable partitionings. Step 5 applies FN to various permutations providing a set of partitionings. Then, the less frequent partitionings are eliminated.
Step 6: Networks of leaves Step 6 simply converts each partitioning of leaf nodes (obtained in Step 5) into adjacency matrix format representing a graph, namely, network of leaves. It is assumed that all nodes in a same partition (a possible clade or BB) of a partitioning forms a graph clique (Diestel, 2006). This adjacency matrix format is required in Step 7, which also represents the indices of rows and columns in a standardized order (the ascending order of indexes is considered). Fig. 7 depicts the conversion process performed by Step 6.
36
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
Fig. 6. Conversion of a phylogram from Newick format into network format. Nodes with indices greater than 8 are internal nodes of the phylogram.
Step 7: Equivalent network related to a selected population All adjacency matrices representing partitionings found from each network of leaves obtained from the same selected population (same resampling) are summed (using logical sum or the Boolean operator OR (Lipschutz & Lipson, 2004)). The result is an equivalent adjacency matrix or equivalent network. Fig. 8depicts Step 7, showing two equivalent networks associated with two selected populations. Each equivalent network was obtained from two networks of leaves generated by means of resamplings from the same distance matrix, associated with a selected population. The two selected populations were obtained from the same random initial population.
relatively low quality network (for example, a network with erroneous connection among partitions) are expected to not interfere the final result (obtained communities or BBs) of the final network. This is due to the fact that the importance of a spurious connection between nodes is relatively small compared to the importance of properly included edges, which are more frequent. Moreover, the final network obtained from the summation of those matrices show that a small partition in an adjacency matrix can become larger (forming a clade) when checking consensus among them. In addition, a partition can become more representative or reliable by the inclusion of edges between its nodes. Fig. 10 summarizes all steps of CPA in a diagram. 4. Identification of partitionings and overlaps
Step 8: Final partitioning from several selected populations FN is applied, this time, over each equivalent adjacency matrix (equivalent network) providing a partitioning from each selected population. Next, all equivalent adjacency matrices are summed (using logical sum or the Boolean operator OR (Lipschutz & Lipson, 2004)) by the same procedure described in Step 7, producing a single adjacency matrix, namely, final network. At last, FN is applied to the final network, obtaining the final partitioning. Fig. 9 depicts Step 8. At this step, CPA reveals how it can extract quality (relevant relationships) from quantity. Notice that relationships present in a
Additively Separable Deceptive Problems (ASDP) (Goldberg, 2002) have clear BBs. Nevertheless, they must be identified from sampled data, a task that is usually not trivial. Problems with overlapping BBs have more than one way of interaction between two variables; in some cases, there are several levels of relationships among variables, with different intensities. One way of representing relationships between BBs is by a phylogram. CPA can construct phylogram-based models for complex combinatorial optimization problems, facilitating their resolution. Overlaps, represented hierarchically, have direct relationship with trees, where taxa with common ancestor have high homology
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
37
Differential Evolution (DE) (Qu & Suganthan, 2010), which converts the bit string of each BB into the corresponding integer, generating integer meta-variables. Strategy (i) was proposed in Crocomo et al. (2013), and it executes an ES on BBs obtained in the first generation of the BOA (Pelikan et al., 1999)). Such a method, namely, StrOp, is able to find the solution of ASDPs with significant reduction in Number of Evaluations (NE); it has reached up to 78% less NE when compared to BOA. Furthermore, it showed an efficacy (percentage of test repetitions with different seeds for the random number generator to find optimum) equal to or above 95%, similar to the efficacy found by BOA. In strategy (ii), a DE explores the search space of metavariables (Melo, Vargas, Crocomo, & Delbem, 2011). The integer meta-variables (obtained based on a partitioning procedure that also uses a phylogenetic tree) enables DE to find better solutions for the tested combinatorial problems. DE demonstrated better efficiency, requiring only about 50% NE of Extended Compact Genetic Algorithm (ECGA) (Harik et al., 2006) for solving ASDP with BB of size k = 8. These results motivated the development of an EDA, which uses a new search strategy for determining a more adequate partitioning. In order to achieve this goal, a new search strategy is proposed, namely, CES, which is described next. 6. Composed exhaustive search Fig. 7. Example of conversion from a partitioning (a vector of labels) to an adjacency matrix, and the corresponding network of leaves.
(similarity with a common origin and overlapping genetic material). Relevant relationships in a phylogram are usually in the same sub-tree, namely, clade. In the context of variable partitioning, clades correspond to subsets of overlapping BBs. In this paper, the term partition directly corresponds to a BB or a set of overlapping BB, forming a clade. On the other hand, taxa with larger distance relationships (internal nodes on top of a phylogram) have such a low correlation that, in general, can be ignored for practical purposes. Since there is little genetic material in common, there is no guarantee that this material has the same origin. For example, the material in common may be transposable elements (Philippsen, 2014). It is noteworthy to mention that the composition of equivalent networks by logical sum (as seen in Step 7) usually groups variables of a BB, making them compose a unique partition, independently of the number of connections that a variable has between two or more BBs. To assess this, CPA deploys variable partitioning of a selected population and analyzes the final partitioning (a consensus) in order to evaluate its representativeness, i.e., the reliability of clades obtained by this process compose a BB. In this paper, it is considered that a BB in the final partitioning is not relevant if it is not present in half of partitionings of a selected population. Fig. 11 depicts how CPA identifies BBs from several samples of partitionings for three types of problems: (i) ASDP, in which none of the variables of BBs overlap (Fig. 11(a)), (ii) problems in which one variable of some BBs overlaps (Fig. 11(b)), and (iii) problems in which at least one variable of each BBs overlaps (Fig. 11(c)). 5. Estimation of distribution algorithms The partitioning generated by CPA can be used as a step in an EDA for solving optimization problems. For example, a method can be used to explore the search space based on each BB found by CPA (as seen in Step 8). In this sense, we highlight two strategies: (i) straight optimization, which basically corresponds to make an Exhaustive Search (ES) on each identified BB, and (ii) application of
CES is proposed to find the optimal solution for combinatorial optimization problems (Goldbarg & Luna, 20 0 0) by means of partitions found by CPA. It performs an ES at each partition (possible BB) from the final partitioning found by CPA and analyzes the results of such ESs to determine the problem’s global optimum. CES executes the four steps described next, which are summarized in Fig. 13. Step 1: Step 1 searches for one solution around the global optimum and another in the neighborhood of the best local optimum from the selected population in the CPA. These solutions are used to evaluate new solutions generated by ES applied to a partition. In each evaluation of a solution, the values of the bits outside the BB where ES is operating are kept constant; these values are obtained from the solution around the global or local optima. It is noteworthy to mention that the modeling of the competition between global and local optimum instances of a BB, as proposed by Goldberg (2002), resulted in a theoretical model that corroborates the fact that solutions with various instances of BBs of the global and best local optima are highlighted in the selected population, as long as the initial population is randomly sampled and n is greater than or equal to the value obtained by means of n = χ k (k ln χ + ln m ) (Goldberg, 2002), where χ is the arity of discrete variables (in this paper, χ = 2 for all problems) and m is the problem’s estimated number of BBs. Based on a selected population with such characteristics, a clustering algorithm as k-Nearest Neighbors (kNN) may be used to divide this population into two groups, one for each of the competing instances in each neighborhood. Fig. 12 illustrates the bimodal aspect of the distribution of the selected population. Such an aspect can be highlighted by a resampling using a tournament with = 2, 00. Then, the most different solutions between these two groups (by maximizing the significance of their difference) are determined in order to obtain a solution near the global optimum and another near the best local optimum (even though, at first, it is not known which solution belongs to which neighborhood). For combinatorial problems, the clustering algorithm may be avoided by calculating the Hamming
38
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
Fig. 8. Equivalent networks obtained from two networks of leaves associated with two different selected populations.
distance (Hamming, 1950) among solutions of the selected population and, then, choosing the pair of solutions (one from each cluster) that corresponds to the largest distance.
Step 2: Step 2 determines the best size for partitions (k∗ ) of a partitioning, which may contain overlap (clades). In these cases, it is possible to choose the most convenient size for BBs involved in a clade (notice that the result of ES in a group of bits smaller than the size of a BB, and the complement of this result, produces an estimate of a neighborhood of the optimal, or local optimal, BB instance). With this choice, the NE and Running Time (RT) of the algorithm may be controlled, avoiding relatively large clades, in which an ordinary ES is prohibitive. Table 1 shows each step of how to determine k∗ for a problem with = 9. The NE on CES is determined by 1. Table 2
Table 1 Calculation of k∗ for = 9. k
m = /k
m
− mk
m
NE
1 2 k∗ = 3 4 5 6 7 8 9
9 4.5 3 2.25 1.8 1.5 1.2857 1.125 1
9 4 3 2 1 1 1 1 1
0 1 0 1 4 3 2 1 0
9 5 3 3 2 2 2 2 1
1060 100 64 84 104 152 272 524 1028
lists several values for k∗ as a function of .
NE = 2(m2k + (m − m )2−mk + 2m ).
(1)
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
39
Fig. 9. Identification of BBs (communities) of the problem by means of equivalent networks and partitioning of a selected population.
Table 2 Relationship of and k∗ , for ranging from 5 to 180.
Step 3:
5
10
20
40
50
60
70
80
85
k∗ k∗
3 90 9
3 100 9
4 120 10
5 140 10
6 150 11
7 180 12
7 200 13
8 250 14
8 300 15
The complexity of CES (CCES ) is O
2k . Based on Table 2, it
k can be seen that k increases according to in a rate that can be approximated by 2log . Assuming that this behavior is3 kept for larger than the ones evaluated in Table 2, CCES = O . log
With k∗ , an ES finds the optimal BB instance. However, not only the optimal instance is stored, but also its complement. Notice that the BB instance found directly by ES may belong to a trap (a BB instance present in the solution corresponding to the best local optimum). The EDA theory proposed in Goldberg (2002) shows that both instances predominate in the selected population. The trap (if any) is not in the vicinity of the optimal solution; otherwise, an ES in this neighborhood would converge in relatively quick way, avoiding the trap (then, it would not be a trap). Step 4: The result of Step 3 (one ES for each partition of size k) is the global and local optima and their complements for each BB of a
40
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
Fig. 10. Summary of CPA.
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
41
Fig. 11. (a) ASDP with 3 ftrap3 functions (3 separated BBs), (b) problem with 1 ftrap3 function (without overlap) and 3 ftrap3 functions in which two pairs of them have one bit in common overlapping each pair, and (c) problem with 4 ftrap3 funtions where tree pairs of them have one bit in common overlapping each pair.
42
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
This section is divided into seven experiments to evaluate the performance of the OPA. Experiment 1 uses trap problems with k ranging from 3 to 6 (first experiments using = 1.00). Experiment 2 evaluates OPA for problems based on more complex functions regarding the size of BBs and the number of local and global optima (greater multimodality). Experiment 3 presents simple function tests for evaluating OPA both in terms of NE and RT. Experiment 4 evaluates the influence of the amount of resampling for NE and RT values. Experiment 5 shows the effect of for NE and RT values. Finally, the Experiment 6 shows that there may be which results in OPA with a suitable compromise between NE and RT. Experiment 1: Problems composed of 5 ftrap functions with k ranging from 3 to 6:
Fig. 12. Fitness distribution of an initial population (n = 10 0 0, = 30) and two selected populations (truncation (10% best) and tournament size 8) for a problem composed of 6 ftrap5s (k = 5), wherein the two corresponding modes of distribution are highlighted using = 2, 00.
partitioning. From these results, CES combines all instances of the obtained values. The combination with the best evaluation is, then, the solution of the problem. CES is activated only if the partition found by CPA is larger than k∗max bits, which may be informed as an input parameter of OPA. In this paper, k∗max = 10 is used for all tests. When a BB is large, or, for example, it involves the entire chromosome, it can be inferred that: (i) CPA cannot find correlations among problem variables (as, for example, due to a poor resampling); (ii) variables are uncorrelated; (iii) the variables of the whole problem compose a single BB (such as a single trap with several bits). Fig. 13 summarizes the four CES steps in one example, where C1 and C2 are individuals of selected population predicted around the global and best local optima (Step 1).
To carry out this initial experiment, tests were conducted for ftrap3 (k = 3, = 15), ftrap4 (k = 4, = 20), ftrap5 (k = 5, = 25) and ftrap6 (k = 6, = 30) functions, so that m = 5 for all these sub-problems. The results in Fig. 14 show OPA values compared with the values obtained by BOA, CD-BOA, StrOp and StrOp+REDA algorithms. Both OPA and StrOp+REDA use 10 samples for resampling the selected population. In Fig. 14, as in other figures in this section, each point on the graph is composed of average, maximum and minimum values obtained for thirty runs of the algorithm, just replacing the seed random number in each run. These results with relatively small problems show some equivalence between OPA and StrOp+REDA and their better performance in terms of NE when compared to BOA, CD-BOA and StrOp. Fig. 14 also shows that the BOA, CD-BOA and StrOp present a bigger NE for k > 5. This is probably due to the fact that the construction of the Bayesian network requires an estimate of v = k − 1. So, for k > 5, v is overestimated, generating more computationally complex Bayesian networks to compute and, thereby, reducing the representativeness of the resulting Bayesian networks. On the other hand, OPA does not require this parameter for problems with BBs with variable size. StrOp+REDA probably can, by the resampling process, dispose BBs mistakenly great, maintaining a good performance for k > 5. Experiment 2: Problems composed of different types of deceptive functions:
7. Evaluation of OPA with deceptive problems The experiments carried out in this section use deceptive trap problems. The performance of OPA is compared to BOA and three other algorithms developed in Crocomo et al. (2013): CD-BOA, StrOp and StrOp+REDA. The BOA has been one of the most widely used algorithms in the literature for EDA performance comparisons. Furthermore, the CD-BOA, StrOp and StrOp+REDA algorithms are derived from BOA, showing that it is possible to significantly increase the original performance EDAs based on Bayesian networks for the benchmark problems considered in this paper. On the other hand, BOA uses the input parameter v, fixing the maximum number of parent vertices by variable in the Bayesian network. The ideal value of v is (k − 1 ); the BOA’s performance is reduced if v is different than the ideal value. Notice that the optimal value can be previously obtained for ASDP, as in StrOp. The algorithms developed in Crocomo et al. (2013) also require some information on v, since they use as input the interval [vl , vu ], which must contain all sizes of BBs of the problem. On the other hand, OPA does not need any parameter related to the size of a BB. A direct advantage of this feature is OPAs effectively deals with problems in which BBs have significantly different sizes.
The performance of OPA is compared to BOA, CD-BOA, StrOp and StrOp+REDA using ideal parameters to optimize four ASDPs, each composed of BBs involving one of the following functions: f3deceptive ( = 90, m = 30), ftrap5 ( = 90, m = 18), f6bipolar ( = 90, m = 15), and 15 f3deceptive functions with 9 ftrap5 functions ( = 90, m = 24). Fig. 15 summarizes such a comparison among algorithms. As in Experiment 1, these results show that OPA has a similar performance, in terms of NE, to StrOp+REDA considering 10 samples for resampling the selected population. Experiment 3: NE vs. RT for a problem composed of ftrap5: In this experiment, one ASDP composed of ftrap5 (k = 5) and m varying with (m = /k) is used to analyze the performance of OPA in terms of NE and RT as increases. Tests are carried out for values ranging from 30 to 180 by steps of 30. Figs. 16 and 17 compare the performance of OPA regarding NE and RT to the performances of BOA, CD-BOA, StrOp and StrOp+REDA algorithms. This experiment shows, again, the alignment of OPA with StrOp+REDA, with
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
43
Fig. 13. The four CES steps based on C1 and C2 predict solutions around the global and best local optima.
certain advantage for OPA in terms of NE. OPA excels in RT BOA, CD-BOA and StrOp+REDA as increases. This can be explained by the greater complexity of building a probabilistic model based on Bayesian network in every generation in StrOp+REDA. On the other hand, StrOp constructs this probabilistic model only once, presenting a RT below 5 seconds for all tests. Notice that in terms of NE, StrOp requires less evaluations than BOA and CD-BOA and more evaluations than StrOp+REDA and OPA.
Experiment 4: Effect of resampling the selected population in terms of NE and RT: To verify the influence on the number of resampling, tests were performed to a problem composed of ftrap5 with = 90, with the number of resampling varying from 1 to 10. OPA is only compared to StrOp+REDA (vl = 3 and vu = 7) because BOA, CD-BOA and StrOp do not use resampling in their algorithms.
Figs. 18 and 19 respectively show the results with respect to NE and RT as the number of resampling increases. The results of this experiment show a major difference between the two algorithms. OPA has satisfactory results even using small number of resampling (four or less). This is due to the fact that OPA generates more models (equivalent networks) from the same resampling of the selected population and that its average communication time is limited by a constant. NE values for StrOp+REDA with two resamplings correspond to the maximum NE tested by the bisection method (Sastry, 2001), i.e., StrOp+REDA does not solve the problem with the maximum NE tested and RT is relatively high in such cases. OPA presents a lower RT due to a less complex method to build the probabilistic model and because the processing can be distributed among a number of processors equal to the number of resampling, as shown in Fig. 20. It is important to notice that the time for communication between machines has been neglected to compute RT, since the parallelizing scheme used (Fig. 20) requires 2ABC communication between each processing elements, resulting in a total communica-
44
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
Fig. 14. Comparison of NE among BOA (v = 4), CD-BOA, StrOp and StrOp+REDA (with vl = 3 and vu = 7 for both), and OPA for problems composed of 5 ftrap functions with k ranging from 3 to 6.
Fig. 15. Comparison of NE among BOA (v = 4), CD-BOA, StrOp and StrOp+REDA (with vl = 3 and vu = 7 for both), and OPA for different types of deceptive functions with = 90.
tion time of tt ≤ ct 2ABC, where ct is a positive constant indicating the maximum communication time between 2 elements. This time can be amortized by ABC processes in parallel, resulting in amortized time O(1). To further verify this aspect, notice that the parallelization scheme forms a tree structure wherein the edges are associated with communications between two machines (one for sending data and the other for receiving). Thus, the total number of nodes is ABC + 1, and of edges is ABC. As shown in Fig. 20, tt = tA A + tB B + tC C, wherein tA , tB e tC are the highest communication time associated with levels A, B and C of the tree. Be tm = max {tA , tB , tC }, then tt ≤ ABCtm . This time is amortized by ABC processing of OPA in parallel occurring at level C to generate each of the network of leaves, resulting in average time O(1).
Fig. 16. Comparison of NE among OPA, BOA (v = 4), CD-BOA, StrOp and StrOp+REDA (both vl = 3 e vu = 7) for problems composed of ftrap5.
Fig. 17. Comparison of RT among OPA, BOA (v = 4), CD-BOA, StrOp and StrOp+REDA (both vl = 3 e vu = 7) for problems composed of ftrap5.
Notice that this parallelization assumes ABC processing elements; the tests performed in this experiment requested up to A = 16, B = 10. C = 10. Therefore, the achievement of all tests with OPA in amortized time O(1) would use the most of a cluster with 1.600 processing elements. Experiment 5: Influence of in NE and RT In Experiment 5, the analysis in Experiment 4 is again realized, now with varying values of in order to show the influence of the size of the resampled population (ns ) on the performance of OPA. Figs. 21 and 22 indicate that OPA has certain robustness over the
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
45
Fig. 18. NE of StrOp+REDA and OPA for a number of resampling ranging from 1 to 10 for a problem composed of ftrap5 with = 90.
Fig. 19. RT of StrOp+REDA and OPA for a number of resampling ranging from 1 to 10 for a problem composed of ftrap5 with = 90.
variation of parameter , since for ranging from 0.25 to 16.00, NE is lower. Equivalent results occur for RT, wherein ranging from 0.07 to 1.00 correspond to RT significantly lower. The intersection of both ranges ( [0.25; 16.00] and [0.07; 1.00]) is [0.25; 1.00], and it indicates values that enable a better performance of OPA in terms of both NE and RT. Experiment 6:
significantly close to StrOp, but = 0.25 results in a curve clearly separated from the group formed by StrOp, OPA0.07 and OPA0.13 for higher values of . Therefore, OPA0.13 is close to the rival algorithm with the best performance in NE (StrOp+REDA) and the rival algorithm with the best performance in RT (StrOp). This indicates that the OPA0.13 presents a compromise between NE and RT, which can be important in applications in which it is not known beforehand if the priority is a lesser NE or RT.
RT in interval 0.07 ≤ ≤ 1.00
Experiment 7:
In Experiment 5, it was found that in the interval 0.07 ≤ ≤ 1.00, OPA presents a relatively low RT (Fig. 22). In Experiment 6, the analysis in Experiment 3 is carried out again, now using values of in interval associated with lower RT found in Experiment 5. Fig. 23 shows that OPA0.13 has performance close to OPA1.00 (best algorithm in terms of NE) whereas the curves of the OPAs with 0.13 ≤ ≤ 1.00 and StrOp+REDA form a grouping. In terms of RT, Fig. 24 shows that = 0.07 and = 0.13 maintains OPA
vs. number of resamplings: The Experiment 7 puts together the analyses presented in Experiment 3 and Experiment 5. Figs. 25–30, summarize the results for this experiment. These results indicate that, for a number of resampling equal to or greater than 4 and ≥ 0.50, NE is relatively stable and close to the lowest level found. On the other hand, ≥ 0.50 may not
Fig. 20. Configuration of communication time between the processing elements.
46
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
Fig. 21. NE of Experiment 4 (ftrap5 with = 90) for ranging from 0.07 to 16.00.
Fig. 22. RT of Experiment 4 (ftrap5 with = 90) for ranging from 0.07 to 16.00.
Fig. 23. Comparison of NE among BOA (v = 4), StrOp and StrOp+REDA (with vl = 3 and vu = 7 for both), and OPA for problems composed of ftrap5, with ranging from 0.07 to 1.00.
Fig. 24. Comparison of RT among BOA (v = 4), StrOp and StrOp+REDA (with vl = 3 and vu = 7 for both), and OPA for problems composed of ftrap5, with ranging from 0.07 to 1.00..
8. Evaluation of OPA with OneMax and BinInt problems and overlapping BBs be the most appropriate in terms of RT (see Experiment 6), but it also presents an interesting compromise, since it possesses intersections with the interval 0.07 ≤ ≤ 1.00 obtained in Experiment 5. In addition, OPA0.13 , which presented the best compromise in terms of NE and RT (Experiment 6), attain significantly better results in terms of NE for number of resampling equal to 10. This is an interesting result since most resampling could be performed with parallelization, involving a constant number of processing elements in relation to the problem size and limited time of communication between these elements (Experiment 6).
The following experiments are carried out to evaluate the performance of OPA for (i) problems with overlapping trap functions, and (ii) not deceptive problems BinInt and OneMax. Experiment 1 evaluates the algorithm developed for problems involving ftrap5 functions with overlapping BBs in two bits of each BB. These tests verify the performance of the algorithms on important aspects of hierarchical problems. Experiments 2 and 3 test the proposed algorithm with two non deceptive problems (OneMax and BinInt), modeling important as-
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
Fig. 25. Trap problem composed of ftrap5 with = 30.
47
Fig. 27. Trap problem composed of ftrap5 with = 90.
Fig. 26. Trap problem composed of ftrap5 with = 60.
pects of real-world problems. BinInt is especially a difficult problem for typical EAs and EDAs. OPA1.00 is compared to BOA, CD-BOA and StrOp+REDA. The value of = 1.00 in OPA was used in the experiments of this section so that a comparison with StrOp+REDA were more appropriate, since StrOp+REDA generates a selected population with the same size of the initial population. Experiment 1: Problems composed by overlapping ftrap5 functions: This experiment is composed by deceptive functions with BBs of size 5 overlapping in two bits of each BB and ftrap5. This type of problem is used to represent hierarchical problems (Pelikan & Goldberg, 2003). From such a problem, certain limitations of the first multivariate EDAs (Harik et al., 2006) may be observed, since
Fig. 28. Trap problem composed of ftrap5 with = 120.
these algorithms ensure a near optimal solution (at leaf one BB with a wrong instance) only for ASDP. In the tests, the following values for we considered: 20, 38, 56, 74, 92 and 110. Fig. 31 shows a significant difference among the results obtained by OPA1.00 and the other algorithms. This is due to the use of CES, since for problems with, OPA1.00 has as the final partitioning only one partition involving all variables. This indicates that correlations between smaller blocks are not as strong as the correlation between all variables. Notice that in this context, OPA1.00 uses the optimal partition k∗ aiming at reducing NE.
48
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
Fig. 29. Trap problem composed of ftrap5 with = 150.
Fig. 30. Trap problem composed of ftrap5 with = 180.
Fig. 31. NE of OPA1.00 compared to BOA (v = 4) and StrOp+REDA (vl = 3 e vu = 7) for a problem composed of ftrap5 functions with two overlaps.
Fig. 32. NE of OPA1.00 compared to CD-BOA and StrOp+REDA (both vl = 1 and vu = 1) for OneMax problem.
Experiment 2: Experiment 3: OneMax problems: BinInt problems: The OneMax problem was tested for the following values of : 5, 10, 15, 20 and 25. As in Experiment 1, each resampling results in a different partitioning. The consensus of the obtained partitions of equivalent networks is a final partitioning and it does not represent the sampling partitioning, showing that is not possible to find (or that it does not exist) correlation between the problem variables (all relatively weak BBs are united in a single clade). Again, the CES of OPA results in relatively low NE, as shown in Fig. 32, indicating a certain similarity between OPA1.00 and StrOp+REDA, with the advantage of OPA1.00 presenting NE virtually constant.
This experiment is similar to Experiment 2 above, but at this time it is considered the BinInt problem. Despite the similarity with OneMax in terms of BB size (k = 1), such a problem stresses the possibility of probabilistic algorithms converge for drift (Goldberg, 2002), i.e., the most significant bits for calculating the fitness direct the evolution initially, allowing the least significant bits to easily converge at random. The results are shown in Fig. 33. As in Experiment 2, OPA1.00 and StrOp+REDA show equivalent performance.
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
49
9. Discussions
Fig. 33. NE of OPA1.00 compared to CD-BOA and StrOp+REDA (both vl = 1 and vu = 1) for BinInt problem.
Experiment 1 in Section 8, OPA1, 00 attained better results than the best algorithm tested for problems with overlapping variables, CD-BOA. In Experiments 2 and 3, OPA1, 00 achieved comparable or better performance than the StrOp+REDA, the best rival tested for problems with not correlated variables (BBs of size 1): OneMax (Experiment 2) and BinInt (Experiment 3). Experiment 4 shows that OPA can also solve problems that combine aspects considered challenging and benchmark problems of EDAs, in which all BBs have some overlap. Thus, OPA showed that it is possible to build an algorithm whose performance is similar to the best performances found in the literature for problems with significantly different difficulties. This fact is due to OPA’s ability of identifying types of correlation between variables of a problem and suitably exploiting this information through the CES. EDAs use a probabilistic model of the values of the variables of a selected group of individuals to generate new individuals. The quality of the model depends on how well these individuals sample promising regions in the search space. In other words, NE generally increases with multimodality and the dimension of the problem. The building of more representative models may require methods with relatively high computational cost, as the K2 algorithm used to build Bayesian networks in BOA. This results in an increased RT of the EDA. In this paper, we proposed the construction of models that compromise NE and RT. The CPA allows to obtain models that adequately represent the BBs of a problem with a computational cost that can be controlled by the parameter and number of resamplings from the initial population. In this paper, the CPA has been extended to be applied to the resolution of combinatorial EDAs benchmark problems with different levels of difficulty, generating OPA. Thus, this paper aims at assessing the main features of OPA, making use of comparisons of the performances of OPA and other EDAs. 10. Conclusion Each EDA has a better performance for specific problems, such as BOA is suited for hierarchical problems, CD-BOA is suited for problems with overlap, and StrOp is suited for ASDP. OPA was able to solve a larger set of problems (ASDPs, non deceptive and overlapping) with similar results to the best ones found in the correlate literature of each problem. The main contributions of the OPA are listed below:
Fig. 34. NE of OPA1.00 for a problem in which one part of the chromosome is composed of ftrap5 functions with two overlapping bits and the other part is composed of ftrap5 functions.
Experiment 4: Problem composed of separable and overlapping BBs: This experiment presents a new problem in that one part of the chromosome is composed of ftrap5 functions with two bits overlapped in the same way as Experiment 1, and the other part is composed of ftrap5 functions without overlap as in Experiment 3 of Section 7. This experiment shows that the OPA1.00 is able to solve problems with distinct difficulties with NE growing in an apparent linear way in relation to the size of the problem (). Fig. 34 shows NE values obtained for Experiment 4, with a number of resampling equals 10.
1. The resampling of selected population used by OPA results in greater efficiency in identifying the problem of partitioning; 2. The generation of various partitionings by means of the permutation of rows and columns of the distance matrix and adjacency matrix (with all nodes of the phylogram) shows the importance of resampling procedures for EDAs; 3. With parameter , the size of selected population can be reduced without reducing the algorithm’s efficiency; 4. Such a parameter can be adjusted so that OPA performs with relatively high efficiency in terms of NE as RT. OPA0.13 , for example, presented a performance similar to the one of the algorithm that had the best RT and also to the one of the algorithm that achieved the best NE in the experiments; 5. The parallelization structure of resampling involved in the CPA proves to be scalable, allowing the resolution of complex combinatorial problems with certain compromise in terms of computational cost; 6. The fact that the OPA is able to solve ASDP, not deceptive problems and problems with overlapping BBs with computational efficiency comparable to the best results obtained by different
50
A. Soares et al. / Expert Systems With Applications 78 (2017) 32–50
algorithms shows that the same EDA is able to efficiently solve a greater quantity of types of problems, since: (i) best models may be obtained with a compromise in terms of computational cost, and (ii) the obtained partitionings are better exploited according to their global and local optimal instances. Acknowledgment The authors would like to thank the CAPES and FAPESP for financial support. References Cancino, W., & Delbem, A. (2007). Inferring phylogenies by multi-objective evolutionary algorithms. International Journal of Information Technology and Intelligent Computing, 2, 1–26. Cilibrasi, R., & Vitányi, P. M. B. (2005). Clustering by compression. IEEE Transactions on Information Theory, 51, 1523–1545. Crocomo, M. K., Martins, J. P., & Delbem, A. C. B. (2013). Decomposition of black-box optimization problems by community detection in Bayesian networks. International Journal of Nature Computing Research (IJNCR), 3(4), 1–19. doi:10.4018/jncr. 2012100101. Diestel, R. (2006). Graph theory. Electronic library of mathematics. Springer. URL: http://books.google.com.br/books?id=aR2TMYQr2CMC. Donetti, L., & Muñoz, M. A. (2004). Detecting network communities: A new systematic and efficient algorithm. Journal of Statistical Mechanics: Theory and Experiment, 2004(10), P10012. Duch, J., & Arenas, A. (2005). Community detection in complex networks using extremal optimization. Physical Review E, 72(2), 027104+. doi:10.1103/PhysRevE.72. 027104. Felsenstein, J. (20 0 0). Phylip (phylogeny inference package). http://evolution. genetics.washington.edu/phylip.html. Felsenstein, J. (2003). Inferring phylogenies (2nd). Sinauer Associates. Goldbarg, M., & Luna, H. (20 0 0). Otimização combinatória e programação linear: modelos e algoritmos. Campus. URL: http://books.google.com.br/books?id= kctuAQAACAAJ. Goldberg, D. E. (2002). The design of innovation: Lessons from and for competent genetic algorithms. Norwell, MA, USA: Kluwer Academic Publishers. Hamming, R. W. (1950). Error detecting and error correcting codes. Bell System Technical Journal, 29(2), 147–160. doi:10.1002/j.1538-7305.1950.tb00463.x. Harik, G., Lobo, F., & Sastry, K. (2006). Linkage learning via probabilistic modeling in the extended compact genetic algorithm (ECGA). In Scalable Optimization via Probabilistic Modeling (pp. 39–61).
Harik, G. R., Lobo, F. G., & Goldberg, D. E. (1999). The compact genetic algorithm. IEEE Transactions on Evolutionary Computation, 3(4), 287–297. Kraskov, A., Stogbauer, H., Andrzejak, R., & Grassberger, P. (2003). Hierarchical clustering based on mutual information. Arxiv preprint q-bio/0311039,. Larrañaga, P., & Lozano, J. A. (2001). Estimation of distribution algorithms: A new tool for evolutionary computation (genetic algorithms and evolutionary computation). Springer. Lipschutz, S., & Lipson, M. (2004). Matemática Discreta: Coleção Schaum. BOOKMAN COMPANHIA ED. URL: http://books.google.com.br/books?id=2S9bwDmD1P0C. Melo, V. V., Vargas, D. V., Crocomo, M. K., & Delbem, A. C. B. (2011). Phylogenetic differential evolution.. International Journal of Natural Computing Research, 2(1), 21–38. Mühlenbein, H. (1997). The equation for response to selection and its use for prediction. Evolutionary Computation, 5(3), 303–346. Newman, M. E. J., & Girvan, M. (2004). Finding and evaluating community structure in networks. Physical Review E, 69(2), 026113. PT: J; PN: Part 2; PG: 15. Pelikan, M., & Goldberg, D. E. (2003). Hierarchical BOA solves ising spin glasses and maxsat. In Proceedings of the 2003 international conference on genetic and evolutionary computation: Part II (pp. 1271–1282). Berlin: Springer-Verlag. URL: http://portal.acm.org/citation.cfm?id=1756582.1756586. Pelikan, M., Goldberg, D. E., & Cantú-Paz, E. (1999). Boa: The Bayesian optimization algorithm. In W. Banzhaf, J. Daida, A. E. Eiben, M. H. Garzon, V. Honavar, M. Jakiela, & R. E. Smith (Eds.), Proceedings of the genetic and evolutionary computation conference GECCO-99: I (pp. 525–532). Orlando, FL, San Fransisco, CA: Morgan Kaufmann Publishers. Pelikan, M., Goldberg, D. E., & Lobo, F. G. (2002). A survey of optimization by building and using probabilistic models. Computational Optimization and Applications, 21(1), 5–20. http://dx.doi.org/10.1023/A:1013500812258. Pelikan, M., Sastry, K., & Cantú-Paz, E. (2006). Scalable optimization via probabilistic modeling: From algorithms to applications (studies in computational intelligence). Secaucus, NJ, USA: Springer-Verlag New York, Inc. Philippsen, G. S. (2014). Estudo da influência de elementos transponíveis nos genomas das algas C. reinhardtii e V. carteri. Física - IFSC - USP (Ph.D. thesis). Qu, B.-Y., & Suganthan, P. N. (2010). Novel multimodal problems and differential evolution with ensemble of restricted tournament selection. In IEEE congress on evolutionary computation (pp. 1–7). Los Alamitos: IEEE. Saitou, N., & Nei, M. (1987). The neighbor-joining method: A new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4), 406–425. URL: http://mbe.oxfordjournals.org/content/4/4/406.abstract. Sastry, K. (2001). Evaluation-relaxation schemes for genetic and evolutionary algorithms. Urbana, IL: University of Illinois at Urbana-Champaign, Department of General Engineering (Master’s thesis). Studier, J., & Keppler, K. (1988). A note on the neighbor-joining algorithm of Saitou and Nei. Molecular Biology and Evolution, 5(6), 729.