Identifying practical significance through statistical comparison of meta-heuristic stochastic optimization algorithms

Identifying practical significance through statistical comparison of meta-heuristic stochastic optimization algorithms

Applied Soft Computing Journal 85 (2019) 105862 Contents lists available at ScienceDirect Applied Soft Computing Journal journal homepage: www.elsev...

975KB Sizes 0 Downloads 10 Views

Applied Soft Computing Journal 85 (2019) 105862

Contents lists available at ScienceDirect

Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc

Identifying practical significance through statistical comparison of meta-heuristic stochastic optimization algorithms ∗

Tome Eftimov a,b , , Peter Korošec a a b

Computer Systems Department, Jožef Stefan Institute, 1000 Ljubljana, Slovenia Stanford University, Palo Alto, 94305 California, USA

article

info

Article history: Received 5 December 2018 Received in revised form 10 October 2019 Accepted 14 October 2019 Available online 21 October 2019 Keywords: Benchmarking Practical significance Ranking scheme Single objective problem Statistical comparison Stochastic optimization algorithms

a b s t r a c t In this paper, we propose an extension of a recently proposed Deep Statistical Comparison (DSC ) approach, called practical Deep Statistical Comparison (pDSC ), which takes into account practical significance when making a statistical comparison of meta-heuristic stochastic optimization algorithms for single-objective optimization. For achieving practical significance, two variants of the standard DSC ranking scheme are proposed. The first is called sequential pDSC, and takes into account practical significance by preprocessing of the independent optimization runs in a sequential order. The second is called Monte Carlo pDSC, and avoids any dependency of practical significance with regard to the ordering of optimization runs. The analysis of identifying practical significance on benchmark tests for single-objective problems, shows that for some cases, both variants of pDSC compared to the Chess Rating System for Evolutionary Algorithms (CRS4EAs) approach give different conclusions. Preprocessing for practical significance is carried out in a similar way, but there are cases when the conclusion for practical significance differ, which comes from the different statistical concepts used to identify practical significance. © 2019 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction Over recent years, many meta-heuristic stochastic optimization algorithms have been developed, which makes analyzing the performance of a new algorithm compared with the state-ofthe-art algorithms a crucial task [1]. There exist many studies in which theoretical evaluation of optimization algorithms are presented [2,3]. On the other side, in empirical evaluation of optimization algorithms, one of the most common ways to compare the performance of the algorithms is to use statistical tests based on hypothesis testing [4]. This requires sufficient knowledge from the user, which includes knowing which conditions must be fulfilled, so that the relevant and proper statistical test (e.g., parametric or nonparametric) can be applied [5,6]. The aim of making such a comparison is to find the strengths and weaknesses of a newly introduced algorithm. Existing approaches for assessing the performance of two or more algorithms for the same problem consider only statistical significance, and to the best of our knowledge there is only one approach that considers practical significance [7]. Even if it is crucial in research that state-of-the-art methods for assessing performance are related to ∗ Corresponding author at: Computer Systems Department, Jožef Stefan Institute, 1000 Ljubljana, Slovenia. E-mail addresses: [email protected] (T. Eftimov), [email protected] (P. Korošec).

statistical significance, there is still a large gap between theory and real-world scenarios. This is because sometimes the statistical significance that exists is not significant in a practical sense [8]. Practical significance is defined as the relationships among the quality of solutions of real-world applications [8] and is concerned with the usefulness of the obtained solutions to the problem, defined by its quality metrics (results) in the real world [9,10]. Practical significance compared beyond statistical significance tries to answer larger question about differences: ‘‘Are the differences among samples big enough to have real meaning?’’ Let us assume that for some ϵ we do not consider the difference between two values to be of real-world relevance. We can then define fϵ (x1 , x2 ) as fϵ (x1 , x2 ) =

{

equal,

if x1 < x2 + ϵ and x1 > x2 − ϵ

different ,

other w ise,

(1)

which identifies if there is practical difference between values x1 and x2 (in our case estimations of quality of solutions), based on which we can identify practical significance. So algorithms that check for practical significance do not compare two values based on strict difference, but using the above equation. In real-world we could have defined ϵ for each value pair separately, but for simplicity we are using the same ϵ for all comparisons.

https://doi.org/10.1016/j.asoc.2019.105862 1568-4946/© 2019 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/bync-nd/4.0/).

2

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

comparisons are made with the results from literature or various competitions (i.e. BBOB [12]). The end results of such comparisons can be influenced by the way test functions are calculated (influence of computer accuracy i.e., values closer to zero can be more accurately presented, where values further away are less accurately rounded, due to the IEEE 754 standard, which defines floating point numbers with mantissa and biased exponent), what type of variables are used (e.g., 4-byte float, 8-byte float, 10-byte float), or as in the case of competitions where an error threshold is defined, when algorithms can be stopped. All these cases can result in different final solutions, which are not representative of actual performance if only statistical significance is identified. To include practicality in algorithms performance comparison, the obtained results must first be preprocessed at the practical level ϵ in which we are interested. The preprocessed data can then be evaluated using a statistical test in order to find if there is a practical significance between the performances of the compared algorithms with regard to obtained results. The main contributions of the paper are: Fig. 1. Parametrization of rotor and stator lamination for electric motor design. Source: Taken from [11].

Let us look at some examples where practicality is relevant. In real-world it is often important to consider practical significance. Influence of practical significance can be observed in different industrial tasks. For example, in production scheduling, where simulations are based on predefined production norms, which are expected performance measures and simulation result is only approximation of the actual performance. In production, various products are often created with some tolerances that minimally influence their performance, which consequently means that products within these tolerances are deemed equal. A more illustrative example is presented in Fig. 1, which shows the parametrization of an electric motor design, its rotor and stator. Without going into details (see [11] for details), there are 11 independent parameters, which define several important geometric characteristics. For this purpose a simulation tool is required to estimate the performance of the proposed solution. Simulation tools also have their own accuracy, meaning the correlation between simulation based estimated performance and real-world performance. In most cases there is always some discrepancy. This discrepancy determines another aspect of practicality, meaning differences found between two simulations on the level of some decimal might not be reflected in the end product. Finally, some minute improvements in the final result is not relevant in real-world, e.g., an electrical energy consumption of 134.00 W and 134.01 W is practically the same for the electric motor designer or user. Taking into account all these aspects, one can easily see why looking at practicality is important in industry. From the perspective of algorithms, if one would need to choose between fast and slow converging algorithm, where the latter returns better results, but is practically insignificant, one would always select the first algorithm. It is therefore important to understand the difference between practical significance and statistical significance. For instance, if we compare two algorithms designed to find the optimal solution to a given problem. The first algorithm solves the problem with an approximation error of 10−10 , while the second one with an error of 10−16 . Although a statistical significance can be found when comparing outcomes between these two algorithms, this difference can be insignificant in a practical sense with respect to the application of the problem. Besides defining practical significance in terms of industry needs, a similar definition also applies to benchmarking where

• A methodology for identifying practical significance between meta-heuristic stochastic optimization algorithms through statistical comparison for one-dimensional data; • The proposal of two ranking schemes working/dealing with practical significance; • Experimental results show that the proposed methodology provides promising results and can be used for identifying practical significance. The paper is organized as follows: Section 2 gives an overview of the related work, while Section 3 introduces the two variants of the practical Deep Statistical Comparison ranking scheme. Section 4 presents a benchmarking analysis of the stochastic optimization algorithms on multiple problems using the proposed ranking schemes, followed by a discussion of the results. Section 5 gives a discussion of the proposed approach and finally the conclusions of the paper are presented in Section 6. 2. Related work To determine the strengths and weaknesses of a stochastic optimization algorithm, its performance must be compared with that of the state-of-the-art algorithms. Several competitions for comparing optimization algorithms at evolutionary computation conferences (e.g., the Genetic and Evolutionary Computation Conference (GECCO) [13] and IEEE Congress on Evolutionary Computation (CEC)) [14] have now been organized, in which different optimization algorithms are compared using a set of benchmark functions. Consequently, many papers in the field of evolutionary algorithms follow the guidelines from such competitions, since they provide benchmark functions and experiments for comparison. The results depend on the applied performance metrics and statistical techniques. The statistical analyses must be chosen with care because they provide the information from which the conclusions are drawn. In this paper, we concentrate on how to evaluate performance [5,15–18] using statistical approaches for evaluation purposes, when problem instances are chosen and the experimental environment established. Demšar [19] presented suitable statistical tests that can be used to compare machine learning algorithms. Garcia et al. [5] discussed two scenarios that use nonparametric tests for analyzing the behavior of evolutionary algorithms for optimization problems: single-problem analysis and multiple-problem analysis. In single-problem analysis, the algorithms are compared using the results from the multiple runs obtained on a single benchmark problem, while in multiple-problem analysis they are compared on a set of benchmark problems. In the case of

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

multiple-problem analysis, the authors use the average of the results for each problem for each algorithm to be involved in the comparison. This approach is known as the ‘‘common approach’’ because it is the most used for comparing meta-heuristic stochastic optimization algorithms. We need to be aware that averages are known to be sensitive to outliers, so instead of averages the medians can also be used in the common approach because they are less sensitive to outliers. The differences between averages or medians can be in some ϵ neighborhood (e.g., 10−9 , 10−10 , etc.), which further affect the ranking results [18]. The decision of using average or median, can have a great influence on the final result of the statistical test. In cases when averages or medians are in some ϵ -neighborhood the common approach identifies this as difference in performance of algorithms, even though the distributions of multiple runs would be the same [20]. Also, the opposite scenario can happen with averages or medians being the same, but the distributions of multiple runs would not be the same, suggesting a difference between the performance of the algorithms. For these reasons, we recently proposed Deep Statistical Comparison (DSC ) for comparing meta-heuristic stochastic optimization algorithms over multiple single-objective problems [18]. The DSC approach allows users to acquire more robust statistic results, which reduces cases when wrong conclusions are made coming from the approach used due to the presence of outliers or due to the ranking scheme of some standard statistical tests [18,20,21] (e.g., Wilcoxon signerank test, Friedman test, Iman–Davenport test). Additionally, we must be aware that sequential order of algorithm results for the comparison can impact the outcome of the statistical analysis. Veček et al. [7,22] presented another approach known as a Chess Rating System for Evolutionary Algorithms (CRS4EAs), which is an approach especially designed for empirically dealing with practical significance. In this case, the practical significance is calculated using confidence intervals. All previously described approaches, except the CRS4EAs, deal only with statistical significance, neglecting practical significance, which is relevant for obtaining a better understanding of the compared algorithms. For this reason, we describe two variants of the DSC approach and compare them with the CRS4EAs approach. 2.1. Chess rating system for evolutionary algorithms The CRS4EAs is an empirical algorithm for comparing and ranking evolutionary algorithms [7]. The algorithm is like a chess tournament where the optimization algorithms are considered as chess players and a comparison between the solutions of two optimization algorithms as the outcome of a single game. The algorithm is a tournament in which m algorithms (chess players) are solving k optimization problems over n independent runs. One comparison is considered as one game, so all m ∗ k ∗ n solutions are compared pairwise and the results are recorded as wins, losses, or draws. The winner is the algorithm that calculates the best solution for the considered problem. If the difference between the solutions of two algorithms is smaller than some predefined ϵd (draw limit or practical significance of position evaluation function values), the result of the game is considered a draw. The performance of a player is generated after all k ∗ n ∗ m ∗ (m − 1)/2 comparisons have been performed. Each player is then described using a rating R, rating deviation RD, and a rating (confidence) interval RI calculated according to the Glicko-2 rating system [23]. R is the absolute power of a player and the RD is an indicator of how reliable is that rating evaluation. The choice of the interval RI is similar to selecting the significance level α when performing a statistical test, while the choice of the RD is similar to selecting the significance level.

3

If the RIs of two algorithms do not overlap, then the performances of the two algorithms differ significantly. For the experimental setup, the authors proposed that RD = 50 is an appropriate value. The tournament parameters, number of independent runs n and draw limit ϵd , have to be determined at the start of the tournament, along with selecting the algorithms and the benchmark test suite. However for CRS4EAs, when assessing the statistical significance by determining the overlap of the confidence intervals, is statistically inconsistent [24]. This is one of the most common statistical errors when statistical practitioners are asked to compare the results of confidence intervals and determine whether such intervals overlap. When 95% confidence intervals of two independent populations do not overlap, there will indeed be a statistical significance between them (at the 0.05 level of significance). However, the opposite is not necessarily true i.e., confidence intervals may overlap, but there can still be a statistical significance between them. Because the CRS4EAs detects practical significance, if the confidence intervals do not overlap this might lead to wrong conclusions, i.e., in case the confidence intervals overlap, the CRS4EAs determines there is no practical significant difference, but actually there may be a practical significance. According to the authors, the rank update computations in CRS4EAs are less complicated and sensitive than those in statistical significance tests, the method is less sensitive to outliers, reliable rankings can be obtained over a small number of runs, and the conservativeness/liberality is easier to control [23]. 2.2. Deep statistical comparison The main feature of DSC [18] is the ranking scheme, which is based on the whole distribution, instead of using only one statistic to describe the distribution, such as average or median. DSC removes the sensitivity of the simple statistics to the data found in common approaches and allows more robust statistics to be calculated without taking extra measures to prevent the influence of outliers or some error inside the ϵ -neighborhood. It consists of two steps; (1) the use of a ranking scheme to obtain data for statistical comparison, and (2) a standard omnibus statistical test, which uses the data obtained in the step (1). The benefits of using this approach over the common approach can be found in [18]. 2.2.1. DSC ranking scheme By using a statistical test for comparing distributions, pairwise comparisons between the algorithms must be made and the obtained p-values are organized in a matrix. Because this matrix has a fundamental role in the DSC ranking scheme, its definition is reintroduced. Let m and k be the number of algorithms and the number of problems, respectively, and n the number of runs performed by each algorithm on the same problem. Let Xi be a n × m matrix, where i = 1, . . . , k. The rows of this matrix correspond to the results obtained from multiple runs on the ith problem, and the columns correspond to the different algorithms. The matrix element Xi [j, l], where j = 1, . . . , n, and l = 1, . . . , m, correspond to the result obtained using the jth run of the lth algorithm. Let αT be the significance level used by the statistical test for comparing distributions. By using a statistical test for comparing these distributions, m · (m − 1)/2 pairwise comparisons between the algorithms are performed, and the results are organized in a m × m matrix, Mi , as follows: Mi [a, b] =

{

pvalue , a ̸ = b 1,

a = b,

where a and b are algorithms and a, b = 1, . . . , m.

(2)

4

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

Because multiple pairwise comparisons are made, this can lead to a family-wise error rate (FWER) [25]. To reduce the FWER the Bonferroni correction [26] is used to correct the obtained p-values. The matrix Mi is reflexive and symmetric, but the key point for the ranking scheme is to check transitivity, since it is used to determine the ranking. For this purpose, the matrix Mi′ is defined using the following equation: Mi′ [a, b] =

{

1, 0,

2 Mi [a, b] ≥ αT /Cm

2 Mi [a, b] < αT /Cm .

(3)

The elements of the matrix, Mi′ , are defined according to the pvalues obtained by the statistical test used for the comparison of distributions corrected by the Bonferroni correction. However other corrections for all-vs-all pairwise comparisons, such as Shaffer’s correction, can be used [5]. For example if the element Mi′ [a, b] is 1, this means that the null hypothesis used in the statistical test for comparing distributions, which is the hypothesis that the two data samples obtained by the ath and the bth algorithm come from the same distribution, is not rejected. If the element Mi′ [a, b] is 0, this means that the null hypothesis is rejected, so the two data samples come from different distributions. ′ Before any ranking is performed, the matrix Mi 2 is calculated to ′ check transitivity. If the Mi has a 1 in each position for which ′ Mi 2 has a non-zero element, transitivity is satisfied, otherwise it is not. At the end, each algorithm is assigned a ranking according to its transitivity. If the distributions are the same, the algorithms are ranked the same. If the distributions are different, the algorithms are ranked according to some performance metric specified by the experimental design. According to features that are looked for in the behavior of an algorithm, the performance metric can be average, median, a combination of average with a standard deviation, etc. In DSC, the average value is used as a metric because the average is an unbiased estimator. More details about the DSC ranking scheme or how the algorithms are given their rankings can be found in [18].

3.1. Sequential pDSC ranking scheme In the sequential variant of the pDSC ranking scheme, the search for practical significance is made using a preprocessing step. First, the level of practical significance, ϵp , is set by the user. Before comparing the distributions of multiple runs between each pair of algorithms, (a, b), a, b = 1, . . . , m, the data used by each pairwise comparison must be preprocessed in order to consider the practical significance. For each algorithm applied to each problem, data is available from n independent runs. If the ath and bth algorithms are involved in a pairwise comparison their data is preprocessed as follows

{

ag = b g =

ag +bg 2

, |ag − bg | ≤ ϵp |ag − bg | > ϵp ,

ag = ag , b g = b g

(4)

where g = 1, . . . , n, represents one independent run. The preprocessed data for each pairwise comparison is further used in Eq. (2) of the standard DSC ranking scheme in order to define the p-value for each pairwise comparison. The following steps remain the same as in the standard DSC ranking scheme. The obtained p-values are corrected to control the FWER. The transitivity of the matrix Mi′ is checked and the rankings of the algorithms are assigned accordingly. The preprocessing step is only a pre-step that must be performed to handle the practical significance level. The flowchart that represents the relevant steps for the pDSC is presented in Fig. 2. Here it can be seen how data is being preprocessed when their difference of fitness evaluation is smaller than some predefined ϵ (see shaded squares with affected algorithm runs marked with ∗ ). This kind of preprocessing requires that the algorithms are run the same number of times. However, some algorithms can be run less times than the others. In order to make this preprocessing possible, some sampling technique (e.g., bootstrapping) is needed, to sample data for the algorithm that was run the fewest number of times, so that both algorithms have the same amount of data.

3. Practical deep statistical comparison

3.2. Monte Carlo pDSC ranking scheme

The search for practical significance can be made as a preprocessing step. For benchmarking purposes, two variants of the practical Deep Statistical Comparison (pDSC ) ranking scheme are introduced. Both variants are extensions of the DSC ranking scheme in order to address the question of practical significance. In the first variant, called sequential pDSC, the practical significance is similar to the draw limit in CRS4EAs. This means that the preprocessing is done in sequential order, in which the gth run from one algorithm is compared with the gth run of the other algorithm, g = 1, . . . , n. In this variant, the order of the independent runs can affect practical significance because these algorithms are stochastic in nature and there is no guarantee that the same order will be produced if the algorithms are run again. To avoid this, a second scheme known as Monte Carlo pDSC is proposed, where a Monte Carlo simulation for each pairwise comparison is made using permutations of the independent runs of both algorithms. This simulates N runs of the algorithms where final solutions are obtained in a different order. Both variants require a threshold parameter ϵp that should be given by the user. When ϵp is set to 0, both ranking schemes transform to a standard DSC ranking scheme. Although we could show only the results for the second variant, results for both variants are presented, to make a fair comparison to CRS4EAs, which takes algorithm results in sequential order, and show the advantages of the pDSC approach. At the same time it is possible to show influences/differences of ordering, which can be observed when comparing sequential vs Monte Carlo approach.

In the sequential variant of the pDSC ranking scheme, practical significance is considered in the same way as the draw limit in CRS4EAs. In the preprocessing step, the solutions of the two algorithms, a and b, for the same problem over gth run are compared. So the first run from the ath algorithm is compared with the first run of the bth algorithm, the second run from the ath algorithm with the second run from the bth algorithm, and so on. However, if new n independent runs for the algorithms are obtained, the practical significance may be different. To avoid the dependence of the practical significance on the order of the independent runs, for a more robust comparison, a Monte Carlo [27] variant of the pDSC ranking scheme is proposed, which uses permutation testing. In this case, for each n independent runs of each algorithm, a different order of the data is defined by generating its permutations. The number of such permutations is n!. Because multiple pairwise comparisons are used, for each pairwise comparison, (a, b), (n!)2 different combinations exist, which can be used to check for practical significance. Each combination has one permutation of the set of multiple runs from the ath algorithm and one permutation from the set of multiple runs from the b-algorithm. Eq. (2) requires one p-value for each pairwise comparison that is further corrected to control the FWER. The ranking scheme randomly selects N different combinations for each pairwise comparison and a statistical test is used to compare distributions for each combination. For each combination, a search for practical significance is made by preprocessing the data using Eq. (4). By

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

5

Fig. 2. The flowchart for sequential pDSC.

selecting different combinations for the same pairwise comparison, N p-values are obtained. Therefore, we need to find a way of selecting the appropriate p-value from a set of N p-values. One way is by permutation testing which uses the average p-value as an estimator, but averaging can be affected by outliers. To obtain a more robust estimation of the p-value, we can use the distribution of p-values and its mode. The mode of the continuous probability distributions is the value at which the probability density reaches a maximum value. In this case it can happen that the distributions of the obtained p-values is multimodal, so the question that arises is which one to use. For this reason, a variable V is defined as the number of combinations where the null hypothesis is rejected. To

estimate if the compared distributions are the same or not, a prior level of significance, αp , needs to be set, which is the probability threshold that gives us the information when the distributions are considered as different.

{

P(V ) < αp ,

a and bhave the same distribution

P(V ) ≥ αp ,

a and bhave different distributions.

(5)

If the distributions of the algorithms involved in the pairwise comparison are the same, then the p-value for this pairwise comparison can be randomly selected from a subset of N p-values which are greater than α . If the distributions are different, a kernel density estimation [28] is used to estimate the probability

6

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

density function of a subset of N p-values that are lower than α . The mode of the probability density function is used as an appropriate p-value, which will be used in the M matrix. In this case, the kernel density estimation and the mode are used, because if p-value is chosen at random, it can be further affected when it needs to be corrected to control the FWER. This is the probability of making one or more false discoveries, or type I errors, among all hypotheses when performing multiple hypotheses testing. The relevant steps of the Monte Carlo pDSC are given as a flowchart in Fig. 3. 3.3. Selection of an appropriate omnibus statistical test After ranking the algorithms, the next step is to choose an appropriate statistical test. Guidelines on which test to choose (parametric or nonparametric) are given in [5]. The result gives information about practical significance, because the data that is preprocessed takes into account the practical threshold. 4. Evaluation To evaluate the proposed approaches we followed an already established methodology [7] and show how different practical levels can drastically change the results of a comparison. For this reason we chose to use benchmarking test functions [29], which can be used to simulate real-world scenarios with respect to influence of practicality on the results of comparisons. However, since the performance of metaheuristics stochastic optimization algorithms could drastically change in the case of real-world scenarios, we also evaluated the proposed approach in case of the Parameter Estimation (PE) of biochemical systems, where the comparison between algorithms is done using a set of synthetic biochemical models characterized by an increasing number of dimensions in search space [9,30]. 4.1. Black-box benchmarking 2015 test problems To evaluate how both variants of the pDSC approaches work for practical significance and to compare them with the CRS4EAs, we used the results from a Black-Box Benchmarking 2015 (BBOB 2015) competition [29], which consists of a well-defined experimental protocol and well documented open source ecosystem with abundant information on benchmarking problems and transformations. BBOB 2015 is a competition that provides singleobjective problems for benchmarking, which are then optimized by the competing algorithms. The competition organizer prepared different so-called instances of a problem to prevent authors tailoring algorithms to use information from previous runs. This means that the shape of the problem search space remains static, but the optimal value is at a different location in each run. Since we are comparing algorithms with regard to the offset to the optimum, we can still treat all runs as independent. We used BBOB 2015 results because good benchmarking is surprisingly difficult to select, and as such one should avoid biased experiments and evaluations. From the competition, 15 out of 18 algorithms were used for evaluation because 3 of the algorithms did not provide data organized in the template provided. This fact does not influence the experiment, since the goal is to show how the approaches work when practical significance is considered and not to provide information which algorithm is the best. The algorithms used were: BSif [31], BSifeg [31], BSqi [31], BSrr [31], CMA-CSA [32], CMA-MSR [32], CMA-TPA [32], GP1-CMAES [33], GP5-CMAES [33], RAND-2xDefault [34], RF1-CMAES [33], RF5-CMAES [33], Sif [31], Sifeg [31], and Srr [31].

For each one, the results for 22 different noiseless test problems in 5 dimensionality (2, 3, 5, 10, and 20) are available. More details about test problems can be found in [35]. For the experiments, a statistical comparison was performed by comparing the performance of algorithms for solving 22 different noiseless problems with a dimension fixed at 10. For each algorithm, data for 15 runs for each problem are provided and used in our experiments. The 15 runs are taken because it is the number of runs provided by the GECCO 2015 competition. The power analysis of the DSC approach and the common approach using 15 runs has already been presented in [18]. Fifteen runs are sufficient to see how the proposed approach works, but if we have more runs the approach can only benefit from them, which is the case for each statistical analysis. 4.2. Biochemical models for parameter estimation of biochemical systems To perform evaluation using real-world optimization problems, we used the results obtained for parameter estimation of biochemical systems presented in [9,30]. Biochemical systems are mechanistic and fully parameterized reaction-based models (RBMs) [36]. A RBM is defined by specifying the set of molecular species, the set of biochemical reactions that describe the interactions among the species, the set of kinetic constants associated with the reactions, and the initial concentration of each species. Eight algorithms were used for the PE problem: ABC [37], Covariance Matrix Adaptation ES (CMA-ES) [38], DE with rand/1/bin strategy [39], Estimation of Distribution Algorithm (EDA) [40], GA [41], PSO [42], and its fuzzy-based settings-free variant FSTPSO [43]. More details about the algorithms and their implementation can be found in [30]. To evaluate the algorithms, 12 different randomly generated RBMs of increasing size were used. Six RBMs are characterized by 25 reactions and species, while the other six RBMs are characterized by 50 reactions and species. For each algorithm, 15 repetitions were performed on each model. 4.3. Experiments Five experiments are presented to compare the results achieved by both versions of pDSC and CRS4EAs:

• In the first experiment, the practical threshold is set to 0 (i.e. ϵp = 0 and ϵd = 0). Both variants of the proposed









pDSC ranking scheme act the same as the standard DSC ranking scheme. We did this in order to see the statistical significance that can be detected by the pDSC approach and the CRS4EAs; In the second experiment, the focus is on practical significance, in which the results obtained from the sequential variant of the pDSC ranking scheme are compared with those from CRS4EAs; In the third experiment, the results from the Monte Carlo variant of the pDSC ranking scheme are compared with those from CRS4EAs; The fourth experiment represents the most common scenario in benchmarking for comparing a newly proposed algorithm with state-of-the-art algorithms, which involves multiple comparisons with a control algorithm (i.e. one versus all comparison). The results from the sequential variant of pDSC and CRS4EAs are presented. Finally, our last experiment show the results of the pDSC when it is used for comparing algorithms on real-world optimization problems, which in our case is the parameters estimation of biochemical systems.

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

7

Fig. 3. The flowchart for Monte Carlo pDSC.

DSC approaches are based on the idea of null-hypothesis testing, where the null and the alternative hypothesis are:

• H0 : there is no practical significance between the performance of the compared algorithms using a set of benchmark problems; • HA : there is a practical significance between the performance of the compared algorithms using a set of benchmark problems. The same problem is addressed also by the CRS4EAs, with the difference that it follows the idea of confidence intervals for identifying differences between performances of algorithms. The common thing for both approaches is that they use a practical threshold to preprocess the data before using it with an approach that tries to identify if there is a practical significance. The pDSC approach follows the null-hypothesis testing, while the CRS4EAs use confidence interval to provide the same result. In the

case when the practical threshold for both of them is set to 0, they are reporting the result only for statistical significance. We should mention that comparing these approaches is not an easy task, since they use different concepts, however we compared them on a level of final results (i.e. when statistical significance can be detected or not). For the pDSC approaches, a two-sample Anderson–Darling (AD) test with a significance level of αT = 0.05 is used as the statistical criteria for comparing distributions in the pDSC ranking scheme. The benefit of using the two-sample AD test instead of the two-sample Kolmogorov–Smirnov test with the DSC approach is presented in [44]. The transformed data is further used as input data for some appropriate omnibus statistical test, which in our case is the Friedman test with significance level, α , set at 0.05. For the CRS4EAs, a recommended RD = 50 is used and the results for the 95% RI are reported. In both cases, the end result determines if there is a statistical significance between the performance of

8

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

Table 1 Ratings for the algorithms obtained by the CRS4EAs. Algorithms

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

RF1-CMAES, Sifeg, BSif BSifeg, Sif, BSif Sifeg, GP5-CMAES, BSif BSif, RF1-CMAES, Sif BSrr, Sif, Srr RF1-CMAES, Sifeg, BSifeg BSif, BSqi, BSifeg BSifeg, RF1-CMAES, Srr GP5-CMAES, BSif, Srr BSifeg, BSrr, Srr RF1-CMAES, BSifeg, Sif Sifeg, GP1-CMAES, GP5-CMAES Srr, BSif, BSifeg Sifeg, GP5-CMAES, RF1-CMAES BSrr, Sif, Sifeg BSifeg, GP1-CMAES, BSqi GP1-CMAES, BSifeg, Sif BSqi, Srr, GP1-CMAES BSqi, Sifeg, Sif RF1-CMAES, BSifeg, Sif RF1-CMAES, GP1-CMAES, Srr BSif, Srr, CMA-CSA BSrr, BSif, CMA-TPA Sifeg, CMA-TPA, BSqi BSif, BSqi, CMA-MSR

Table 2 Statistical comparisons of 3 algorithms.

Ratings R1

R2

R3

1683.497 1581.554 1659.972 1614.490 1554.892 1650.562 1540.777 1647.425 1627.037 1573.713 1578.418 1622.332 1674.087 1661.540 1570.576 1547.051 1572.144 1537.640 1617.627 1578.418 1650.562 1805.829 1743.095 1794.851 1758.779

1410.604 1573.713 1430.992 1479.611 1482.740 1476.475 1481.180 1504.705 1437.266 1463.928 1536.072 1547.051 1430.992 1474.906 1526.662 1509.410 1473.338 1509.410 1504.705 1536.072 1583.123 1401.194 1471.770 1401.194 1379.237

1405.899 1344.733 1409.035 1405.899 1462.359 1372.953 1478.043 1347.869 1435.697 1462.359 1385.510 1330.618 1394.920 1363.553 1402.762 1443.354 1454.518 1452.949 1377.668 1385.510 1266.315 1292.977 1285.135 1303.956 1361.985

the compared algorithms, using the same significance level. The statistical comparisons for the pDSC approach are performed in the R programming language, while the results for the CRS4EAs are obtained using Java. CRS4EAs gives results for pairwise comparisons (pairs of algorithms) i.e. similar to post-hoc statistical tests, while pDSC gives the result of an omnibus statistical test and if there is a difference between the performance of the algorithms a post-hoc test is used to find the difference between the pairs of algorithms. In order to properly compare both approaches, the results are compared on the level provided by the pDSC (i.e. on the level of the omnibus statistical test). This means, if there is at least one pair of algorithms for which the CRS4EAs finds that there is a statistical significance, then we can assume that there is a statistical significance between the performance of all algorithms. 4.3.1. A comparison between pDSC and CRS4EAs considering only statistical significance In this experiment, the focus is on statistical significance, so the practical significance level is set to 0, ϵp = 0 and ϵd = 0. To compare pDSC and CRS4EAs, 25 different combinations that involves a comparison of three algorithms, are selected. To make a comparison with our previous work easier, the same combinations are used as in [18], where the standard DSC approach is compared with the common approach. The ratings, Rl , l = 1, 2, 3, obtained by the CRS4EAs for each algorithm in each combination are presented in Table 1. The rating Rl corresponds to the rating of the lth indexed algorithm in the combination. These ratings are then used to calculate the 95% RI in order to find if there is a difference between the performance of the compared algorithms using the CRS4EAs. Table 2 shows the results from comparisons when the pDSC approach and the CRS4EAs are used for finding a statistical significance, neglecting practical significance. There are 9 combinations where results differ; i.e. 1st, 2nd, 3rd, 4th, 6th, 8th, 12th, 13th, and 14th combination. Using the pDSC, the result is that there is no statistical significance between the performance of the algorithms, while using the CRS4EAs, the result is that there is

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Algorithms

pDSC approach

CRS4EAs

RF1-CMAES, Sifeg, BSif BSifeg, Sif, BSif Sifeg, GP5-CMAES, BSif BSif, RF1-CMAES, Sif BSrr, Sif, Srr RF1-CMAES, Sifeg, BSifeg BSif, BSqi, BSifeg BSifeg, RF1-CMAES, Srr GP5-CMAES, BSif, Srr BSifeg, BSrr, Srr RF1-CMAES, BSifeg, Sif Sifeg, GP1-CMAES, GP5-CMAES Srr, BSif, BSifeg Sifeg, GP5-CMAES, RF1-CMAES BSrr, Sif, Sifeg BSifeg, GP1-CMAES, BSqi GP1-CMAES, BSifeg, Sif BSqi, Srr, GP1-CMAES BSqi, Sifeg, Sif RF1-CMAES, BSifeg, Sif RF1-CMAES, GP1-CMAES, Srr BSif, Srr, CMA-CSA BSrr, BSif, CMA-TPA Sifeg, CMA-TPA, BSqi BSif, BSqi, CMA-MSR

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

0 0 0 0 1 0 1 0 1 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 0

0 - means that there is a statistical significance between the performance of the algorithms. 1 - means that there is no statistical significance between the performance of the algorithms.

a statistical significance between the performance of the algorithms. For these combinations, the results from the CRS4EAs are the same as the results obtained using the common approach presented in [18]. In [18], it was shown that the DSC approach gives satisfactory results for the same combinations compared to the common approach, since it is less affected by averaging or insignificant statistical differences that exist between data values. The power analysis of the DSC approach is presented in [18], while robustness of the results using different statistics in [21]. For the other six combinations out of the first 15, both approaches give the same result, that there is no statistical significance between the performance of the algorithms, but they differ from the results of the common approach [18]. Since the same combinations are used in this paper, we can empirically investigate the behavior of the CRS4EAs compared to the common approach and the DSC. Using the same 15 combinations as in our previous work [18], where we showed that the common approach is affected by outliers in these 15 combinations, the CRS4EAs could catch outliers only for six combinations out of 15. This indicates that CRS4EAs is indeed less sensitive to outliers compared to the common approach, but not to the same degree as the DSC. For the 19th combination, the pDSC and the common approach provide the same result, that there is no statistical significance between the performance of the algorithms, which differs from the results obtained using the CRS4EAs. Using these results, it follows that the pDSC gives more robust results than the CRS4EAs, when the focus is on statistical significance. 4.3.2. A comparison between the sequential pDSC and CRS4EAs considering practical significance In the second experiment, the focus is on practical significance with a sequential variant of the pDSC ranking scheme. Preprocessing is performed similarly to CRS4EAs. In both cases, the practical significance level, ϵp and ϵd , is set by the user. Table 3, shows the results obtained by the sequential variant of the pDSC, while Table 4 shows results for CRS4EAs. In both tables the results are presented for different practical significance

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

9

Fig. 4. Empirical cumulative distributions for f7 at practical significance level of 10−1 : (a) (RF1-CMAES, Sifeg), (b) (RF1-CMAES, BSif), (c) (Sifeg, BSif).

levels. Comparing both sets of results, we find that the compared approaches behave differently. They give the same results for seven combinations; i.e. 7th, 15th, 16th, 17th, 18th, 19th, and 24th combination, while the remaining results differ. This happens because the CRS4EAs is more sensitive to outliers than the DSC, which is also reflected in the sequential variant of the pDSC, since the preprocessed data can be considered as any data that is input into the standard DSC ranking scheme. Because the results reported in Tables 3 and 4 let us know if there is a practical significance between the performances of the algorithms with regard to all 22 benchmark problems, and since the outcome depends on the rankings that are obtained on individual problems, we decided to further analyze the pDSC and CRS4EAs using problem f7 . The CRS4EAs provides the following ratings: RF1CMAES - 1580.6356, Sifeg - 1661.2711, and BSif - 1258.0933, with RD = 91.5124, which is the RD used for a single problem analysis. The 95% RIs are: [1397.611, 1763.66], [1478.246, 1844.296], and [1075.069, 1441.118], respectively. From them, it follows that there is a practical significance between all algorithms on this benchmark problem due to the differences observed between the algorithms (Sifeg, BSif). However, the pDSC approach gives the following p-values: 0.833 for (RF1-CMAES, Sifeg), 0.174 for (RF1-CMAES, BSif), and 0.026 for (Sifeg, BSif). Each hypothesis is

tested at a statistical significance of 0.016, taking into account the Bonferroni correction. In this example, the transitivity is satisfied and the algorithms obtain the same ranking, meaning there is no statistical significance between them. To show this, in Fig. 4, the empirical cumulative distributions for pairwise comparisons between the three algorithms RF1-CMAES, Sifeg, and BSif, are given. The x-axis, ‘‘Value’’, represents the achieved error of the optimization runs. Using this figure, it can be seen that there is no practical significant difference between them. From Tables 3 and 4, we can see that both approaches return ‘‘logical’’ results since after a certain ϵ threshold results change from ‘‘different’’ (denoted by 0) to ‘‘equal’’ (denoted by 1). ‘‘Different’’ means that there is a practical significance between the performance of the compared algorithms using a set of 22 benchmark problems, while ‘‘equal’’ means that here is no practical significance between the performance of the compared algorithms using a set of 22 benchmark problems. As expected the practical significance of the compared algorithms differ for different practical significance levels. To see what happens in the case of a single problem analysis, let us look at the 21st combination in Table 3, and practical significance levels, ϵp ∈ {10−1 , 100 , 101 }. This combination is shown in detail in Table 5, where rankings from the sequential variant of the pDSC ranking

10

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862 Table 3 Statistical comparison of three algorithms using the sequential variant of the pDSC .

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Algorithms

ϵp 10−9

10−6

10−3

10−2

10−1

100

101

102

RF1-CMAES, Sifeg, BSif BSifeg, Sif, BSif Sifeg, GP5-CMAES, BSif BSif, RF1-CMAES, Sif BSrr, Sif, Srr RF1-CMAES, Sifeg, BSifeg BSif, BSqi, BSifeg BSifeg, RF1-CMAES, Srr GP5-CMAES, BSif, Srr BSifeg, BSrr, Srr RF1-CMAES, BSifeg, Sif Sifeg, GP1-CMAES, GP5-CMAES Srr, BSif, BSifeg Sifeg, GP5-CMAES, RF1-CMAES BSrr, Sif, Sifeg BSifeg, GP1-CMAES, BSqi GP1-CMAES, BSifeg, Sif BSqi, Srr, GP1-CMAES BSqi, Sifeg, Sif RF1-CMAES, BSifeg, Sif RF1-CMAES, GP1-CMAES, Srr BSif, Srr, CMA-CSA BSrr, BSif, CMA-TPA Sifeg, CMA-TPA, BSqi BSif, BSqi, CMA-MSR

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 - indicates that the null hypothesis is rejected, pv alue < 0.05. 1 - indicates that the null hypothesis fails to reject, pv alue ≥ 0.05. pvalue corresponds to the p-value obtained by the Friedman test. Table 4 Statistical comparison of three algorithms using CRS4EAs. Algorithms

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

RF1-CMAES, Sifeg, BSif BSifeg, Sif, BSif Sifeg, GP5-CMAES, BSif BSif, RF1-CMAES, Sif BSrr, Sif, Srr RF1-CMAES, Sifeg, BSifeg BSif, BSqi, BSifeg BSifeg, RF1-CMAES, Srr GP5-CMAES, BSif, Srr BSifeg, BSrr, Srr RF1-CMAES, BSifeg, Sif Sifeg, GP1-CMAES, GP5-CMAES Srr, BSif, BSifeg Sifeg, GP5-CMAES, RF1-CMAES BSrr, Sif, Sifeg BSifeg, GP1-CMAES, BSqi GP1-CMAES, BSifeg, Sif BSqi, Srr, GP1-CMAES BSqi, Sifeg, Sif RF1-CMAES, BSifeg, Sif RF1-CMAES, GP1-CMAES, Srr BSif, Srr, CMA-CSA BSrr, BSif, CMA-TPA Sifeg, CMA-TPA, BSqi BSif, BSqi, CMA-MSR

ϵd 10−9

10−6

10−3

10−2

10−1

100

101

102

0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0

0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 0 0 0 0 0 0

0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 - indicates that there is a statistical significance between the performance of the algorithms using the 95% RI. 1 - indicates that there is no statistical significance between the performance of the algorithms using the 95% RI.

scheme are given for the selected values of the practical significance level. From this table, we see that the rankings for different practical significance levels differ. The rankings obtained for ϵp = 10−1 and ϵp = 100 differ only in one problem, f18 . When the practical significance level increases to ϵp = 101 , the rankings differ when ϵp = 10−1 and ϵp = 100 in eight problems. To see what happens at the single problem level, we focus on problem f18 . Here, the rankings differ for different values of practical significance. This is because comparisons of distributions

are made to obtain the rankings and as a result of preprocessing the distributions of the data change. To explain this, the results for the problem f18 are presented in detail. In Figs. 5, 6, and 7, the empirical cumulative distributions for pairwise comparisons between the three algorithms RF1-CMAES, GP1-CMAES, and Srr, are given for different practical significance levels. The x-axis, ‘‘Value’’, represents the achieved error of the optimization runs. When the practical significance level is ϵp = 10−1 (Table 5), the rankings are 2.00, 1.00, and 3.00, respectively. The p-values

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

11

Table 5 Rankings for the algorithms RF1-CMAES, GP1-CMAES, and Srr, using the sequential variant of the pDSC ranking scheme. (a) ϵp = 10−1

(b) ϵp = 100

(b) ϵp = 101

F

RF1-CMAES

GP1-CMAES

Srr

F

RF1-CMAES

GP1-CMAES

Srr

F

RF1-CMAES

GP1-CMAES

Srr

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16 f17 f18 f19 f20 f21 f22

2.00 3.00 2.50 3.00 2.00 3.00 2.50 3.00 3.00 2.50 2.50 2.00 3.00 2.00 1.50 3.00 2.00 2.00 2.00 2.50 2.00 2.00

2.00 2.00 2.50 2.00 2.00 1.50 1.00 2.00 1.50 1.00 1.00 2.00 1.50 2.00 1.50 2.00 1.00 1.00 3.00 2.50 2.00 2.00

2.00 1.00 1.00 1.00 2.00 1.50 2.50 1.00 1.50 2.50 2.50 2.00 1.50 2.00 3.00 1.00 3.00 3.00 1.00 1.00 2.00 2.00

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16 f17 f18 f19 f20 f21 f22

2.00 3.00 2.50 3.00 2.00 3.00 2.50 3.00 3.00 2.50 2.50 2.00 3.00 2.00 1.50 3.00 2.00 1.50 2.00 2.50 2.00 2.00

2.00 2.00 2.50 2.00 2.00 1.50 1.00 2.00 1.50 1.00 1.00 2.00 1.50 2.00 1.50 2.00 1.00 1.50 3.00 2.50 2.00 2.00

2.00 1.00 1.00 1.00 2.00 1.50 2.50 1.00 1.50 2.50 2.50 2.00 1.50 2.00 3.00 1.00 3.00 3.00 1.00 1.00 2.00 2.00

f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 f12 f13 f14 f15 f16 f17 f18 f19 f20 f21 f22

2.00 3.00 2.50 3.00 2.00 3.00 2.00 2.00 3.00 2.50 2.50 2.00 3.00 2.00 1.50 2.00 2.00 2.00 2.00 2.00 2.00 2.00

2.00 2.00 2.50 2.00 2.00 1.50 2.00 2.00 1.50 1.00 1.00 2.00 2.00 2.00 1.50 2.00 2.00 2.00 2.00 2.00 2.00 2.00

2.00 1.00 1.00 1.00 2.00 1.50 2.00 2.00 1.50 2.50 2.50 2.00 1.00 2.00 3.00 2.00 2.00 2.00 2.00 2.00 2.00 2.00

from the two-sample AD test are 0.013 for (RF1-CMAES, GP1CMAES), 0.000 for (RF1-CMAES, Srr), and 0.000 for (GP1-CMAES, Srr). Each hypothesis is tested at a significance level of 0.016, taking into account the Bonferroni correction. In this example, all distributions differ (Fig. 5), the transitivity is not satisfied and the algorithms obtain their rankings according to their average values. When the practical significance level increases to ϵp = 100 (Table 5), the rankings are 1.50, 1.50, 3.00, respectively. In this case, the p-values for the pairwise comparisons are 0.021 for (RF1-CMAES, GP1-CMAES), 0.000 for (RF1-CMAES, Srr), and 0.000 for (GP1-CMAES, Srr), which are further tested at a significance level of 0.016, taking into account the Bonferroni correction. From the p-values, the distributions between the algorithms RF1CMAES and GP1-CMAES are the same and the transitivity is satisfied, so the algorithms are split into two sets {RF1-CMAES, GP1-CMAES} and {Srr}, and the algorithms obtain a ranking from the set to which they belong. When the practical significance level increases to ϵp = 101 (Table 5), the rankings are 2.00, 2.00, and 2.00, respectively. The p-values are 1.000 for (RF1-CMAES, GP1-CMAES), 0.08 for (RF1CMAES, Srr), and 0.06 for (GP1-CMAES, Srr). From the obtained p-values and from Fig. 7, we see that the distributions are the same and all the algorithms received the same ranking. To see what happens when the order of the independent runs changes, a different permutation for problem f18 is used for each algorithm at practical significance level ϵp = 100 (Table 6). Using this order of independent runs for the problem f18 , the rankings obtained using the sequential variant of the pDSC ranking scheme are 2.00, 1.00, and 3.00, respectively. Previous, they were 1.50, 1.50, and 3.00. We can conclude therefore that the order of independent runs influences preprocessing, which affects practical significance. It also affects the CRS4EAs because different orders result in a different number of wins, loses, and draws, which are used to calculate the rankings that are then used to calculate the RI [7]. 4.3.3. A comparison between Monte-Carlo pDSC and CRS4EAs considering practical significance In this experiment, we use the Monte Carlo variant of the pDSC ranking scheme. For each algorithm, the number of independent runs provided by the BBOB 2015 for each problem, n = 15. For

Table 6 Permutations of independent runs obtained for problem f18 by the algorithms RF1-CMAES, GP1-CMAES, and Srr. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

RF1-CMAES

GP1-CMAES

Srr

9.864 2.509 6.259 1.579 6.208 1.510 0.250 6.308 2.381 5.576 0.957 2.244 2.490 1.584 2.497

2.439 0.971 0.387 3.848 0.396 0.579 0.397 0.966 0.056 1.577 0.604 0.391 6.271 0.244 2.483

9.862 9.693 6.309 3.970 9.908 6.307 24.844 15.842 6.305 3.882 23.460 6.121 14.341 3.930 9.937

each algorithm, 15! permutations can be generated on a single problem. To avoid the influence of the order of independent runs for each pairwise comparison, a Monte-Carlo simulation is performed in which N = 1000 different combinations out of (15!)2 are randomly selected. For each combination, preprocessing and a two-sample AD test (αT = 0.05) are used to obtain a p-value. Because 1000 p-values are obtained for each pairwise comparison, but the Monte Carlo variant of the pDSC ranking scheme requires only one, Eq. (5) is used to select a p-value with a prior level of significance, αp = 0.05. The results of the Friedman test with statistical significance, α = 0.05 are presented in Table 7. These results for different practical significance levels are the same as those obtained using sequential variant. The order of the data provided by the BBOB 2015 in these combinations does not affect the results. However, the order of the independent runs can affect the results, which is seen in the previous example, when different order of the data for one problem changed the result. To see what happens at the single problem level when the Monte-Carlo variant is used, we focus on 21st combination from Table 7, problem f18 , where ϵp = 100 . The three algorithms involved in this comparison are RF1-CMAES, GP1-CMAES, and Srr. For each pairwise comparison, 1000 permutations of the 15 independent runs for each algorithm for problem f18 were

12

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862 Table 7 Statistical comparison of three algorithms using the Monte-Carlo variant of the pDSC .

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Algorithms

ϵp 10−9

10−6

10−3

10−2

10−1

100

101

102

RF1-CMAES, Sifeg, BSif BSifeg, Sif, BSif Sifeg, GP5-CMAES, BSif BSif, RF1-CMAES, Sif BSrr, Sif, Srr RF1-CMAES, Sifeg, BSifeg BSif, BSqi, BSifeg BSifeg, RF1-CMAES, Srr GP5-CMAES, BSif, Srr BSifeg, BSrr, Srr RF1-CMAES, BSifeg, Sif Sifeg, GP1-CMAES, GP5-CMAES Srr, BSif, BSifeg Sifeg, GP5-CMAES, RF1-CMAES BSrr, Sif, Sifeg BSifeg, GP1-CMAES, BSqi GP1-CMAES, BSifeg, Sif BSqi, Srr, GP1-CMAES BSqi, Sifeg, Sif RF1-CMAES, BSifeg, Sif RF1-CMAES, GP1-CMAES, Srr BSif, Srr, CMA-CSA BSrr, BSif, CMA-TPA Sifeg, CMA-TPA, BSqi BSif, BSqi, CMA-MSR

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

0 - indicates that the null hypothesis is rejected, pv alue < 0.05. 1 - indicates that the null hypothesis fails to reject, pv alue ≥ 0.05. pvalue corresponds to the p-value obtained by the Friedman test. Table 8 Multiple comparisons with a control algorithm (CMA-CSA) by using multiple Wilcoxon tests with the sequential variant of the pDSC ranking scheme. j

CMA-CSA vs.

pv alue

1 2 3 4 5 6 7 8 9

CMA-TPA RAND-2xDefault BSif GP1-CMAES RF5-CMAES Sifeg BSrr BSqi Srr

1.7208e−01 4.3968e−10 1.1054e−05 2.4090e−08 4.3968e−10 2.8937e−05 2.8937e−05 2.8937e−05 2.8937e−05

selected. For each combination that contains one permutation from the first algorithm and one permutation from the second algorithm, the data is preprocessed using Eq. (4). This data is then used by the two-sample AD test. After processing 1000 different combinations, the ranking scheme uses Eq. (5) to select one p-value according to a prior level of significance, αp . Fig. 8, shows the kernel density estimation for the p-values of each pairwise comparison. The blue vertical line corresponds to the mode of the probability density function that is used from Eq. (5), which for (RF1-CMAES, GP1-CMAES) is 0.008, for (RF1-CMAES, Srr) is 0.000, and for (GP1-CMAES, Srr) is 0.000. The Bonferroni correction (significance level 0.016) is then used to test the p-values, the transitivity is not satisfied and the obtained rankings are 2.00, 1.00, and 3.00, respectively. This result is different from the one obtained by the sequential variant of the pDSC. 4.3.4. Multiple comparisons with a control algorithm using the sequential variant of the pDSC In the previous three experiments, we compared three algorithms. However, in a typical comparison of meta-heuristic algorithms often more than three algorithms are compared to the proposed one, as multiple comparisons with a control algorithm. In this case, the data for statistical analysis is obtained by using

the sequential variant of the pDSC ranking scheme and then used in appropriate omnibus statistical test. If the p-value is smaller than the predefined statistical significance, the null hypothesis is rejected and a post-hoc test is needed, for which we require different correction procedures such as the Bonferroni–Dunn procedure [45], Holm procedure [46], and Hochberg procedure [47]. However, when the number of algorithms increases, the correction of the p-values can influence the results. To avoid this, a way of comparing multiple comparisons with a control algorithm using the sequential variant pDSC ranking scheme without the influence of the correction procedure is to perform multiple Wilcoxon tests, one for each pairwise comparison. If we try to draw a conclusion by using a larger number of pairwise comparisons, we accumulate error from combining pairwise comparisons, and consequently we lose control of the FWER. Because of this, the true statistical significance for combining pairwise comparisons is given by pv alue = 1 −

k−1 ∏

[1 − pvalue Hi ].

(6)

i=1

Here, the sequential variant of the pDSC approach is used, because the results obtained from it are the same as those obtained form the Monte-Carlo variant of the pDSC. For this experiment, the practical significance level is ϵp = 10−1 for the sequential variant of the pDSC ranking scheme and ϵd = 10−1 for the CRS4EAs. The algorithms are: CMA-TPA, RAND-2xDefault, BSif, GP1-CMAES, RF5-CMAES, Sifeg, BSrr, BSqi, CMA-CSA, and Srr. A comparison is made between 9 algorithms and a control one. The control is the CMA-CSA algorithm, which was randomly selected from these 10 algorithms. The Wilcoxon test performs a comparison between two algorithms. The p-values from pairwise comparisons are independent from one another (i.e. considering one versus all scenario). To control the FWER, the true statistical significance [5] is calculated by using Eq. (6). From Table 8, before the FWER is controlled, the conclusion that the CMA-CSA algorithm has practical significance in the performance compared to the algorithms RAND-2xDefault, BSif,

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

13

Fig. 5. Empirical cumulative distributions for f18 at practical significance level of 10−1 : (a) (RF1-CMAES,GP1-CMAES), (b) (RF1-CMAES,Srr), (c) (GP1-CMAES,Srr).

GP1-CMAES, RF5-CMAES, Sifeg, BSrr, BSqi, and Srr, with a significance level of α = 0.05 is incorrect. There exists practical significance between the performances of the CMA-CSA algorithm and each of the eight algorithms considering independent pairwise comparisons since the p-values are smaller than α = 0.05. But the true practical significance for combining pairwise comparisons for these eight hypotheses can be obtained only by using Eq. (6), which in our case is 0.000 and is smaller than a significance level of α = 0.05. From this we can conclude that there exists a practical significance between the performance of the CMA-CSA and those eight algorithms. The same 10 algorithms were further tested using the CRS4EAs and the obtained ratings are presented in Table 9. The 95% RI are used with a RD = 50 to report the player RI. If the RIs of two algorithms do not overlap, the performances of these two algorithms are statistically different. In this example, the RI of the CMA-CSA algorithm is [1713.2604, 1913.2604] and does not overlap with the RIs of Srr, Sifeg, GP1-CMAES, BSqi, BSrr, BSif, RAND2x-Default, and RF5-CMAES, so there is a practical difference between the CMA-CSA algorithm and each of the other 8 algorithms. In this instance, the result is the same as that obtained with the sequential variant of the pDSC, but in general, the obtained results from both approaches could differ.

Table 9 Ratings obtained by the CRS4EAs for 10 algorithms. Algorithm

CMA-TPA

BSif

1795.6143

RAND2xDefault 1201.9396

Ranking Algorithm Ranking

1408.9745

GP1CMAES 1527.6046

RF5CMAES 1157.2131

Sifeg 1557.4805

BSrr 1476.0643

BSqi 1488.4689

CMA-CSA 1813.2604

Srr 1573.3794

4.3.5. A comparison done using real-world optimization problems To evaluate and show the usage of the pDSC on real-world optimization problems, the Monte-Carlo variant was used to compare the data obtained for the eight algorithms on 12 different biochemical models presented in [30]. The 12 different models can be assumed as independent real-world problems. For this reason, we randomly selected three out of eight algorithms to present the results on a single-problem level for different practical thresholds. Additionally, we compared all eight algorithms assuming all real-world problems (i.e. 12 biochemical models) by performing multiple comparison with a control algorithm, which is a scenario that has already been explained in Section 4.3.4. Let us assume that three algorithms are available for parameter optimization of biochemical systems and we are interested

14

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

Fig. 6. Empirical cumulative distributions for f18 at practical significance level of 100 : (a) (RF1-CMAES,GP1-CMAES), (b) (RF1-CMAES,Srr), (c) (GP1-CMAES,Srr). Table 10 Statistical comparison of three algorithms using the Monte-Carlo variant of the pDSC on real-world optimization problems.

ϵp = 10−1

Model

Model1 Model2 Model3 Model4 Model5 Model6 Model1 Model2 Model3 Model4 Model5 Model6 p-value

(25 × 25) (25 × 25) (25 × 25) (25 × 25) (25 × 25) (25 × 25) (50 × 50) (50 × 50) (50 × 50) (50 × 50) (50 × 50) (50 × 50)

ϵp = 0

ϵp = 101

ϵp = 2 ∗ 101

ϵp = 3 ∗ 101

CMA-ES

DE

PSO

CMA-ES

DE

PSO

CMA-ES

DE

PSO

CMA-ES

DE

PSO

CMA-ES

DE

PSO

1.50 1.00 3.00 2.00 2.50 3.00 3.00 3.00 1.00 3.00 3.00 3.00

3.00 3.00 1.50 2.00 2.50 2.00 1.50 2.00 3.00 2.00 1.50 2.00

1.50 2.00 1.50 2.00 1.00 1.00 1.50 1.00 2.00 1.00 1.50 1.00

1.50 1.00 3.00 2.00 2.50 3.00 3.00 3.00 1.00 3.00 3.00 3.00

3.00 3.00 1.50 2.00 2.50 2.00 1.50 2.00 3.00 2.00 1.50 2.00

1.50 2.00 1.50 2.00 1.00 1.00 1.50 1.00 2.00 1.00 1.50 1.00

1.50 1.50 3.00 2.00 2.50 3.00 3.00 3.00 1.00 3.00 3.00 3.00

3.00 3.00 1.50 2.00 2.50 2.00 1.50 2.00 3.00 2.00 1.50 2.00

1.50 1.50 1.50 2.00 1.00 1.00 1.50 1.00 2.00 1.00 1.50 1.00

2.00 1.50 3.00 2.00 2.50 3.00 3.00 3.00 1.00 3.00 3.00 3.00

2.00 3.00 1.50 2.00 2.50 1.50 1.50 1.50 2.50 2.00 2.00 2.00

2.00 1.50 1.50 2.00 1.00 1.50 1.50 1.50 2.50 1.00 1.00 1.00

2.00 1.50 3.00 2.00 2.50 3.00 3.00 3.00 1.00 3.00 2.00 3.00

2.00 3.00 1.50 2.00 2.50 1.50 1.50 1.50 2.50 1.50 2.00 2.00

2.00 1.50 1.50 2.00 1.00 1.50 1.50 1.50 2.50 1.50 2.00 1.00

0.03

0.03

which algorithm provides the best results. The three randomly selected algorithms are: CMA-ES, DE, and PSO. The algorithms rankings obtained for each model and different practical thresholds (10−1 , 0, 101 , 2 ∗ 101 , and 3 ∗ 101 ) are presented in Table 10. In Table 10, we can see that the rankings of the algorithms can differ for the same model when the practical threshold increases.

0.02

0.04

0.15

For example, let us focus on the second model (i.e. Model2 25 × 25). If the experts that are working on PE for biochemical systems assume that each difference between the fitness values are crucial, then they are interested in the scenario when the practical threshold is set to 0, ϵp = 0. In this case, we can conclude that the CMA-ES provides the best results that are

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

15

Fig. 7. Empirical cumulative distributions for f18 at practical significance level of 101 : (a) (RF1-CMAES,GP1-CMAES), (b) (RF1-CMAES,Srr), (c) (GP1-CMAES,Srr).

statistically significant than the results provided by the DE and PSO, which are ranked third and second, respectively. However, if the experts assume that the difference between the fitness values that are smaller or equal than 101 are not significant from practical point for the real-world scenario, they can set the practical threshold to 101 . In this case, the CMA-ES and PSO are not statistically significant, but are better than the DE. It follows that both can provide good practical results when they are used on Model2 25 × 25. The last row in Table 10 provides the p-value obtained by the Friedman test, when the three algorithms are compared using the set of 12 models, for each practical threshold, separately. P-value smaller than 0.05 shows that there is a statistical significance between the performance of the algorithms. We also should point here that the practical threshold should be assigned by the domain experts. Next, we compared all eight algorithms by performing multiple comparison with a control algorithm. In our case, the CMAES algorithm was selected as a control algorithm. The results from the multiple pairwise comparisons for different practical thresholds are presented in Table 11. Looking at Table 11, there are no big differences assuming the multiple-problem analysis scenario, in which all pairwise comparisons between the CMA-ES and other algorithms are performed, for two different practical thresholds, 0 and 101 . The

Table 11 Multiple comparisons with a control algorithm (CMA-ES) by using multiple Wilcoxon tests with the Monte Carlo variant of the pDSC ranking scheme. CMA-ES vs.

ϵp = 0

ϵp = 101

ABC DE EDA FST-POS FST-PSO2 GA PSO

0.01 0.15 0.57 0.04 0.01 0.01 0.07

0.04 0.15 0.59 0.09 0.00 0.04 0.07

difference is only seen in the case when the CMA-ES is compared to the FST-PSO. In case when any difference between the fitness values obtained by the algorithms is important, ϵp = 0, a statistical significance is observed between the performance of the algorithms. However, when the practical threshold is changed to ϵp = 101 , there is no practical significance between the performance of the algorithms considering the set of all 12 models. We should mention here that the presented results are from independent pairwise comparisons, so if we would like to make a general conclusion, we should continue with the procedure described in Section 4.3.4.

16

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

Fig. 8. Kernel density estimation for the obtained p-values for each pairwise comparison: (a) (RF1-CMAES, GP1-CMAES), (b) (RF1-CMAES, Srr), (c) (GP1-CMAES, Srr).

Finally, we would like to point that different models can have different practical thresholds that need to be assigned as a priori information by the domain experts before the analysis. 5. Discussion Benchmarking is necessary to evaluate the performance of a newly introduced stochastic optimization algorithm against the performance of state-of-the-art algorithms [48]. It is related to three main questions:

• What problems should be chosen for comparison? • How should comparative experiments be designed? • How should the performance of approaches be measured? In this paper, we assumed that the first two questions are already solved (i.e. benchmark problems are selected and the experiments are set), and our focus was to identify practical significance between the performances of the compared algorithms. CRS4EAs is one approach that can identify practical significance, where the difference between two algorithms is reported if the confidence intervals overlap. However, using the CRS4EAs

some information can be misleading since although the confidence intervals may overlap, practical significance between them can still exist. For this reason, we propose an extension of Deep Statistical Comparison, called practical Deep Statistical Comparison. The difference from the standard DSC ranking scheme, which deals with only statistical significance, is that practical significance is handled with a preprocessing step before it is used by the standard DSC ranking scheme. The users sets the practical significance level for preprocessing. If it is set to 0, then the pDSC acts like a standard DSC ranking scheme, which is used to identify statistical significance. Two ranking schemes are proposed, the sequential variant, where the independent runs are preprocessed in sequential order, and the Monte-Carlo variant that avoids the dependence of the practical significance on the order of the independent runs. Although the Monte-Carlo variant requires many more comparisons to be made, it does not reduce its relevance since we are not running it in some real-time system. Experimental results show that pDSC gives more robust results for practical significance than the CRS4EAs and by doing so it improves the state-of-the-art for identifying practical significance. First, we compared DSC and CRS4EAs from the perspective of statistical significance to see if CRS4EAs can identify statistical significance. Based on the test data sets, CRS4EAs performed better

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

than the common approach by giving more robust results, since it provides relevant statistics in more cases than common approach. However, in terms of detecting outliers, its performance still lags behind DSC. Second, we compared the sequential variant of the pDSC and CRS4EAs since both handle the practical significance in sequential order. Predictably, pDSC was more consistent, so once the practical significance was detected, increasing the practical level did not alter the outcome. Tables 2 and 4 show how CRS4EAs changes from 1 (no significance) to 0 (a significance) when the practical level increases, which indicates that CRS4EAs does not perform consistently. Both algorithms yielded the same results in 7 out of 25 cases, while in the other 18 there was a difference in at least one practical level. Third, we used the Monte Carlo variant of the pDSC to negate the influence of preprocessing of the independent runs in sequential order since there is no guarantee that the same order will be obtained, if algorithms are run again. The results reveal only one difference at test case 21 compared to the sequential variant, where for practical levels 10−2 , 10−1 , and 100 the outcome of evaluation changes from 1 to 0. This confirms that using sequential order of algorithm results for the comparison impacts the outcome of the statistical analysis. However, in our test instances, this happens only for one out of 25 test instances. Since the statistical analysis is done only once on the acquired data, we suggest to always apply the Monte Carlo variant of pDSC although it is more time consuming. We compared the sequential variant of the pDSC and CRS4EAs by performing an experiment based on well-known multiple comparisons with a control algorithm. In this case, both approaches yielded the same results. However, different results can also be obtained but this depends on the algorithms involved and on the selected practical level. 6. Conclusions In this paper, two variants of the DSC ranking scheme that take into account practical significance are proposed. The first one is called sequential pDSC, in which the practical significance is considered by introducing preprocessing of the data. Here, preprocessing is performed sequentially in the order of the independent runs. The second variant is called Monte Carlo pDSC, where practical significance is not dependent on the order of the independent runs. An evaluation of both pDSC approaches was made using the results from BBOB 2015 competition, which uses single-objective problems for benchmarking, and comparing them to the Chess Rating System for Evolutionary Algorithms. The results revealed that CRS4EAs and the pDSC give different results. Preprocessing for practical significance is carried out in a similar way, but there are cases when the results for practical significance differ. This happens because pDSC is more robust against outliers. The compared approaches are based on inferential statistics. However, for our future work, we plan to extend pDSC using Bayesian statistics. The difference is that inferential statistics compares p-value and significance level. In Bayesian statistic, there is no need to specify the significance level for rejection of the null hypothesis, instead there is a distribution of the parameter from which we can obtain the probability of the null hypothesis being true. In Bayesian statistics a prior probability model (based on the known information or a non-informative prior distribution) should be set, the posterior distribution based on the data should be calculated, and the model should be evaluated. We aim to change the way statistical comparisons are made, so instead of using statistical tests based on inferential statistics, we will apply statistical tests based on Bayesian statistics.

17

Declaration of competing interest No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.asoc.2019.105862. Acknowledgment The authors acknowledge the financial support from the Slovenian Research Agency (research core funding No. P2-0098, and project Z2-1867) and from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 692286. We would like to thank Miha Ravber and Matej Črepinšek from University of Maribor, Slovenia, for providing the code for the CRS4EAs algorithm. We would also like to thank Marco Nobile from University of Milano-Bicocca, Italy, for providing the results of study on biochemical parameter estimation. References [1] D. Molina, A. LaTorre, F. Herrera, An insight into bio-inspired and evolutionary algorithms for global optimization: review, analysis, and lessons learnt over a decade of competitions, Cogn. Comput. 10 (2018) 517–544. [2] V. Karavaev, D. Antipov, B. Doerr, Theoretical and empirical study of the (1+(λ, λ)) ea on the leadingones problem, in: Proceedings of the Genetic and Evolutionary Computation Conference Companion, ACM, 2019, pp. 2036–2039. [3] D. Vinokurov, M. Buzdalov, A. Buzdalova, B. Doerr, C. Doerr, Fixed-target runtime analysis of the (1+ 1) ea with resampling, in: Proceedings of the Genetic and Evolutionary Computation Conference Companion, ACM, 2019, pp. 2068–2071. [4] E.L. Lehmann, J.P. Romano, G. Casella, Testing Statistical Hypotheses, Vol. 150, Wiley New York et al, 1986. [5] S. García, D. Molina, M. Lozano, F. Herrera, A study on the use of nonparametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the cec’2005 special session on real parameter optimization, J. Heuristics 15 (6) (2009) 617–644. [6] J. Derrac, S. García, D. Molina, F. Herrera, A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms, Swarm Evol. Comput. 1 (1) (2011) 3–18. [7] N. Veček, M. Mernik, M. Črepinšek, A chess rating system for evolutionary algorithms: a new method for the comparison and ranking of evolutionary algorithms, Inform. Sci. 277 (2014) 656–679. [8] R.E. Kirk, Practical significance: A concept whose time has come, Educ. Psychol. Meas. 56 (5) (1996) 746–759, http://dx.doi.org/10.1177/ 0013164496056005002. [9] M.S. Nobile, A. Tangherloni, L. Rundo, S. Spolaor, D. Besozzi, G. Mauri, P. Cazzaniga, Computational intelligence for parameter estimation of biochemical systems, in: 2018 IEEE Congress on Evolutionary Computation (CEC), IEEE, 2018, pp. 1–8. [10] J.H. Kämpf, M. Wetter, D. Robinson, A comparison of global optimization algorithms with standard benchmark functions and real-world applications using energyplus, J. Build. Perform. Simul. 3 (2) (2010) 103–120. [11] P. Korošec, Stigmergy as an Approach to Metaheuristic Optimization (Ph.D. thesis), Jožef Stefan International Postgraduate School, Ljubljana, Slovenia, Ljubljana, Slovenia, 2006, Slovenian title: Stigmergija kot pristop k metahevristični optimizaciji. [12] N. Hansen, A. Auger, O. Mersmann, T. Tusar, D. Brockhoff, Coco: A platform for comparing continuous optimizers in a black-box setting, 2016, arXiv preprint arXiv:1603.08785. [13] Genetic and evolutionary computation conference (gecco), gecco workshop on real-parameter black-box optimization benchmarking (bbob), 2018, http://numbbo.github.io/workshops/BBOB-2018/, Accessed: 15-06-19. [14] Congress on evolutionary computation (cec), special session & competitions on real-parameter single objective optimization, 2017, http: //www.ntu.edu.sg/home/EPNSugan/index_files/CEC2017/CEC2017.htm, Accessed: 15-06-19. [15] D. Shilane, J. Martikainen, S. Dudoit, S.J. Ovaska, A general framework for statistical performance comparison of evolutionary computation algorithms, Inform. Sci. 178 (14) (2008) 2870–2879. [16] J. Derrac, S. García, S. Hui, P.N. Suganthan, F. Herrera, Analyzing convergence performance of evolutionary algorithms: a statistical approach, Inform. Sci. 289 (2014) 41–58.

18

T. Eftimov and P. Korošec / Applied Soft Computing Journal 85 (2019) 105862

[17] E.G. Carrano, E.F. Wanner, R.H. Takahashi, A multicriteria statistical based comparison methodology for evaluating evolutionary algorithms, IEEE Trans. Evol. Comput. 15 (6) (2011) 848–870. [18] T. Eftimov, P. Korošec, B.K. Seljak, A novel approach to statistical comparison of meta-heuristic stochastic optimization algorithms using deep statistics, Inform. Sci. 417 (2017) 186–215. [19] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [20] T. Eftimov, P. Korošec, B.K. Seljak, Disadvantages of statistical comparison of stochastic optimization algorithms, in: Proceedings of the Bioinspired Optimizaiton Methods and their Applications, BIOMA, 2016, pp. 105–118. [21] T. Eftimov, P. Korošec, The impact of statistics for benchmarking in evolutionary computation research, in: Proceedings of the Genetic and Evolutionary Computation Conference Companion, ACM, 2018, pp. 1329–1336. [22] N. Veček, M. Črepinšek, M. Mernik, On the influence of the number of algorithms, problems, and independent runs in the comparison of evolutionary algorithms, Appl. Soft Comput. 54 (2017) 23–45. [23] N. Veček, M. Črepinšek, M. Mernik, D. Hrnčič, A comparison between different chess rating systems for ranking evolutionary algorithms, in: Computer Science and Information Systems (FedCSIS), 2014 Federated Conference on, IEEE, 2014, pp. 511–518. [24] R.V. Hogg, J. McKean, A.T. Craig, Introduction to Mathematical Statistics, Pearson Education, 2005. [25] M.J. Laan van der, S. Dudoit, K.S. Pollard, Multiple testing. part ii. stepdown procedures for control of the family-wise error rate, Statist. Appl. Genet. Mol. Biol. 3 (1) (2004) 1–33. [26] S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power, Inform. Sci. 180 (10) (2010) 2044–2064. [27] C.Z. Mooney, Monte Carlo Simulation, Vol. 116, Sage Publications, 1997. [28] B.W. Silverman, Density Estimation for Statistics and Data Analysis, Routledge, 2018. [29] B. black box optimization competition, black-box benchmarking 2015, 2015, http://coco.gforge.inria.fr/doku.php?id=bbob-2015, Accessed: 01-0216. [30] A. Tangherloni, S. Spolaor, P. Cazzaniga, D. Besozzi, L. Rundo, G. Mauri, M.S. Nobile, Biochemical parameter estimation vs. benchmark functions: A comparative study of optimization performance and representation design, Appl. Soft Comput. 81 (2019) 105494. [31] P. Pošík, P. Baudiš, Dimension selection in axis-parallel brent-step method for black-box optimization of separable continuous functions, in: Proceedings of the Companion Publication of the 2015 on Genetic and Evolutionary Computation Conference, ACM, 2015, pp. 1151–1158. [32] A. Atamna, Benchmarking ipop-cma-es-tpa and ipop-cma-es-msr on the bbob noiseless testbed, in: Proceedings of the Companion Publication of the 2015 on Genetic and Evolutionary Computation Conference, ACM, 2015, pp. 1135–1142.

[33] L. Bajer, Z. Pitra, M. Holeňa, Benchmarking gaussian processes and random forests surrogate models on the bbob noiseless testbed, in: Proceedings of the Companion Publication of the 2015 on Genetic and Evolutionary Computation Conference, ACM, 2015, pp. 1143–1150. [34] D. Brockhoff, B. Bischl, T. Wagner, The impact of initial designs on the performance of matsumoto on the noiseless bbob-2015 testbed: A preliminary study, in: Proceedings of the Companion Publication of the 2015 on Genetic and Evolutionary Computation Conference, ACM, 2015, pp. 1159–1166. [35] N. Hansen, A. Auger, S. Finck, R. Ros, Real-parameter black-box optimization benchmarking 2010: Experimental setup, 2010. [36] D. Besozzi, Reaction-based models of biochemical networks, in: Conference on Computability in Europe, Springer, 2016, pp. 24–34. [37] D. Karaboga, B. Basturk, A powerful and efficient algorithm for numerical function optimization: artificial bee colony (abc) algorithm, J. Glob. Optim. 39 (3) (2007) 459–471. [38] N. Hansen, A. Ostermeier, Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation, in: Proceedings of IEEE International Conference on Evolutionary Computation, IEEE, 1996, pp. 312–317. [39] S. Das, P.N. Suganthan, Differential evolution: A survey of the state-of-the-art, IEEE Trans. Evol. Comput. 15 (1) (2010) 4–31. [40] P. Larrañaga, J.A. Lozano, Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation, Vol. 2, Springer Science & Business Media, 2001. [41] J.H. Holland, et al., Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence, MIT press, 1992. [42] J. Kennedy, Particle swarm optimization, Encyclopedia Mach. Learn. (2010) 760–766. [43] M.S. Nobile, P. Cazzaniga, D. Besozzi, R. Colombo, G. Mauri, G. Pasi, Fuzzy self-tuning pso: A settings-free algorithm for global optimization, Swarm Evol. Comput. 39 (2018) 70–85. [44] T. Eftimov, P. Korošec, B.K. Seljak, The behavior of deep statistical comparison approach for different criteria of comparing distributions, in: Proceedings of the 9th International Joint Conference on Computational Intelligence - Volume 1: IJCCI, INSTICC, SciTePress, 2017, pp. 73–82, http: //dx.doi.org/10.5220/0006499900730082. [45] O.J. Dunn, Multiple comparisons among means, J. Amer. Statist. Assoc. 56 (293) (1961) 52–64. [46] S. Holm, A simple sequentially rejective multiple test procedure, Scand. J. Statist. (1979) 65–70. [47] Y. Hochberg, A sharper bonferroni procedure for multiple tests of significance, Biometrika 75 (4) (1988) 800–802. [48] A. Rolstadås, Benchmarking—Theory and Practice, Springer, 2013.