Evolutionary Bayesian Network design for high dimensional experiments

Evolutionary Bayesian Network design for high dimensional experiments

Chemometrics and Intelligent Laboratory Systems 135 (2014) 172–182 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory...

881KB Sizes 0 Downloads 37 Views

Chemometrics and Intelligent Laboratory Systems 135 (2014) 172–182

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab

Evolutionary Bayesian Network design for high dimensional experiments Debora Slanzi ⁎, Irene Poli Department of Environmental Science, Informatics and Statistics, University Ca' Foscari, Cannaregio 873, 30121 Venice, Italy European Centre for Living Technology, Ca' Minich, S. Marco 2940, 30124 Venice, Italy

a r t i c l e

i n f o

Article history: Received 19 November 2013 Received in revised form 14 April 2014 Accepted 20 April 2014 Available online 30 April 2014 Keywords: Design of experiments Evolutionary Bayesian networks High dimensional systems Optimisation Vesicles self-organisation process

a b s t r a c t Laboratory experimentation is increasingly concerned with systems whose dynamical behaviour can be affected by a very large number of variables. Objectives of experimentation on such systems are generally both the optimisation of some experimental responses and efficiency of experimentation in terms of low investment of resources and low impact on the environment. Design and modelling for high dimensional systems with these objectives present hard and challenging problems, to which much current research is devoted. In this paper, we introduce a novel approach based on the evolutionary principle and Bayesian network models. This approach can discover optimum values while testing just a very limited number of experimental points. The very good performance of the approach is shown both in a simulation analysis and biochemical study concerning the emergence of new functional bio-entities. © 2014 Elsevier B.V. All rights reserved.

1. Introduction Designing experiments and modelling data in current scientific research increasingly contend with the problem of high-dimensionality. Natural systems in particular are described by a very large number of variables whose dynamical interaction leads to the emergence of a particular behaviour of the system. Understanding the organisation of such systems and identifying the relationships among variables generally involve complex laboratory experimentation. Strategies to plan experiments have to address the difficulty of deriving efficient designs and optimisation procedures testing a small set of experimental points, as each of which may involve great investment of resources and sometimes negative environmental effects. Modelling high-dimensional systems confronts also the difficulties of both building nonlinear dependence relations and estimating a number of parameters that increases rapidly with the dimension of the system. These difficulties suggest the importance of alternatives to standard experimental designs and linear regression models. As a result of these considerations, many authors have developed strategies to achieve efficient designs by reducing the effective number of parameters through the assumption of sparsity (only a few variables are relevant in determining the dynamics of the system). Georgiou et al. [1,2], Marley et al. [3], and Sun et al. [4], among others, contributed to the construction and evaluation of the large class of supersaturated designs (SSDs), which have been derived as a class of fractional factorial designs for systems with a large number of variables and few experimental points. Building on the pioneering work of Booth and Cox [5], ⁎ Corresponding author. Tel.: +39 3469707575. E-mail addresses: [email protected] (D. Slanzi), [email protected] (I. Poli).

http://dx.doi.org/10.1016/j.chemolab.2014.04.013 0169-7439/© 2014 Elsevier B.V. All rights reserved.

there have been significant developments initially for problems with two-level variables and later for multi-level and mixed-level variables [2]. Based on the classical assumption of the linear main effects model with Gaussian experimental errors, the construction of SSDs is realised through different algorithms and computer search strategies to find optimal designs with respect to a measure built on the information matrix of the model [3]. Another class of designs addressing high dimensionality is Exchange Algorithms and their developments [6–9]. These designs are based on the exchange between the selected design points and points in a candidate list. The exchange is performed with the aim of identifying the design that satisfies a measure of optimality; in these approaches the exchange is realised at any iteration of the algorithm generally for increasing the determinant of the information matrix. Optimal design procedures based on the information matrix, as SSDs and Exchange Algorithms, are adequate for several problems, but they require a prior knowledge of the form of the response function (linear model) which frequently is not available. Model-robust experimental designs based on Exchange Algorithms have been recently proposed [9] to respond to this issue. These designs do not assume a single model form but allow for a set of user-specified models, and the design is derived by maximising the product of the determinant of the information matrices associated with each of the suggested models. A different way to address the problem of designing experimentation for high dimensional systems is to use computer experiments [10, 11]. In computer experiments, rather than conducting laboratory experimentation or making field observations, simulators based on mathematical models are constructed to study how the model behaves under relevant variables and conditions. Physical processes can be simulated and the simulation code can serve as an efficient mode for

D. Slanzi, I. Poli / Chemometrics and Intelligent Laboratory Systems 135 (2014) 172–182

exploring the properties of the process [12–15]. This is becoming a popular approach and it is applied in different research fields; however high dimensionality makes this approach hard to use, since it requires complex models that may be prohibitively expensive to simulate. In designing computer simulations, Latin hypercube sampling has been proposed [16,17]. These designs generally provide uniform samples for the marginal distribution of the variables and have achieved successful results in a set of research studies. Recent contributions, based on these developments, have proposed to design high dimensional experiments by adopting search procedures built on evolutionary approaches, such as genetic algorithms and particle swarm optimisation techniques [18–22]. In this paper, we propose a design strategy that is developed according to the evolutionary approach. Our strategy does not involve an a priori choice of the experimental design but it evolves the design through a number of experimental generations, moving in different areas of the search space. Each generation of planned experiments is achieved by combining evolutionary principles and inference from statistical models. In this procedure we combine the ability of the evolutionary approach to intelligently navigate the search space with the capacity of statistical models to uncover hidden information. The common practice of a priori choice of the experimental points for high dimensional problems may in fact be inappropriate for a premature and possible misleading selection of the experimental points. More specifically, our approach is based on the evolution of probabilistic graphical models, PGMs [23,24]. We focus on a particular class of PGMs, that is the class of Bayesian network models, where nodes in the graph correspond to random variables and arcs between nodes describe the dependence structure that may characterise the set of variables on which we then develop statistical inference. We study this Evolutionary Bayesian Network design (EBN-design) both in a simulation analysis and laboratory biochemical experimentation concerning the self-organisation of amphiphilic molecules. In both our studies, EBN-design exhibits excellent performance in optimising the response of the systems, in comparison to other common experimental approaches. The paper is organised as follows. Section 2 introduces the design for optimisation and presents the Evolutionary Bayesian Network approach to design high dimensional experiments. Section 3 presents a simulation study to test the efficiency of EBNdesign compared with common alternative design of experiment approaches. Section 4 describes a biochemical study concerning the selforganising process of amphiphilic molecules, and Section 5 presents some concluding remarks. 2. Design for optimisation An optimisation problem can be described in its general structure as follows. Let S be a subset of the Euclidean space Rd and f be a real-valued function on S. Let x = (x1,…,xd) be the set of variables defined in S, and y be the response variable, y = f(x1,…,xd). The optimisation problem consists in finding x∗ in S such that f(x∗) ≥ f(x) for all x ∈ S (in maximisation problems). The inferential problem concerns the form of f. Experimentation provides the data to construct a function ^f ðx1 ; …; xd Þ that can serve as an approximation to f(x1,…,xd) over the domain S of interest. Data from experimentation are collected according to a chosen design ξn, consisting of a set of experimental points, ξn ¼ ðx1 x2 ⋯xn Þ where each experimental point xk, k = 1,…n, is a d-dimensional vector, whose values are the levels of the variables. The set of data from experimentation, namely (X, y) with X an (n × d)-matrix and y an n-vector, represents then the evidence for inferring the dependence relations among variables, i.e. the form of f, and for identifying the design points that give the optimum value of the response variable.

173

In classical design theory, a linear regression model for f is selected to describe the dependence relation of y on x1,…,xd, that in matrix notation can be represented as y ¼ Xβ þ e where β are the unknown parameters of the model and e is an additive stochastic component with E(e) = 0 and covariance matrix Var(e) = σ2In. The estimates of the parameters, under the Least Squares ap      ^ ¼ X0 X −1 X0 y with Cov β ^ ¼ σ 2 X0 X −1 , and proach, take the form β the information matrix associated to the particular n-points design ξn h  i−1   ^ is defined as Iðξn Þ ¼ X0 X =σ 2 ¼ Cov β . To select an efficient design ξn, the theory of optimal design [25–27] prescribes the choice of n experimental points xk, k = 1,…,n, such that some criterion ϕ(I(ξn)), based on the information matrix, is optimised. The most common criterion is D-optimality, for which ϕ(I(ξn)) = | I(ξn)|. This approach requires the knowledge of the precise form of f, assuming a linear model. However, the assumption of linearity is difficult to justify when a large number of variables characterise the system and the form f is unknown, or nonlinear, as in complex high dimensional systems. Our contribution to address this problem is to adopt evolution as a paradigm to build a design strategy and drive the evolution with Bayesian network models. 2.1. Evolving the design with models The key idea of evolutionary design is to derive a small set of informative experimental points adopting the principle of evolution: the design is evolved through generations of selected experimental points according to a particular function measuring the goodness of the design in reaching the objective of optimisation. The design is then evolved in a sequence of small sub-designs achieved with some probabilistic rules that emulate the laws of genetic evolution [18,28–30,20]. In building the evolutionary design, we create first a small lowdimensional set of experimental points, ξn1 ¼ X1 (the design D1 at time 1), selecting variables and variable levels at random. The random initial design, as a random sample from the population of all possible experimental points, is an intriguing selection criterion since it allows the exploration of the search space in areas not anticipated by prior knowledge but where relevant information may reside. The first population of experimental points is then tested in laboratory (or evaluated in simulation) and generates the first set of data consisting of different combinations of factor levels and their corresponding experimental response (X1, y1). This initial dataset is small and each experimental point is low dimensional, but the random selection of the choice allows the system to explore through the many dimensions of the whole design space. In a simple evolutionary design, as in Genetic Algorithm design [31,18, 28,20,32], a set of probabilistic transition rules (based on genetic operators) is defined and applied to guide the evolutionary process towards the next generation of candidate experimental points. Probabilistic selection is the most important element in the evolving design. Adopting proportional, truncated or tournament selection approaches, it leads to the construction of a new small population of experimental points selected according to their expected capacity to provide high quality responses. Other operators, such as recombination and mutation, contribute to the evolutionary design, exchanging and changing elements of the experimental points to improve the quality of the compositions. These operators lead the evolution from the initial design to a sequence of successive designs with higher quality and more precise responses. The class of GAdesigns has been very successful in addressing search problems in high dimensional spaces and in a variety of research fields. Another class of evolutionary procedures is the Particle Swarm Optimisation (PSO) [33,22] where each experimental point is regarded as a particle of a swarm. In this procedure particles move in the search space

174

D. Slanzi, I. Poli / Chemometrics and Intelligent Laboratory Systems 135 (2014) 172–182

with an adaptable velocity recording the best position they visited and communicate with their neighbourhood to improve the search in the design space. However these designs are frequently considered as a blind search tool, since the evolutionary process is built with probabilistic transition rules that operate in a mechanistic way, without the knowledge of the underlying system. To overcome this weak aspect of the procedure and make more powerful the search for optimisation, we consider an evolutionary design where the information on the system is collected at any generation of data by a class of statistical models. At any generation q, q = 1,…,Q, a set of data (Xq, yq) is in fact available to be modelled and to provide insights on the dependence relation between the response and the selected set of variable levels and their interactions. This design, named DEEMs (Design of Evolutionary Experiments based on Models, as in Refs. [20,34,35]), is built by integrating the information that can be inferred by statistical models with the information achieved by the evolutionary adaptive procedure. DEEM approach is very flexible since it can be developed using different classes of models and with different evolutionary procedures, depending on the problem under study. When the dependence and independence relations in the data are a crucial element for the optimisation problem, then the Bayesian network models can be the most appropriate for DEEM approach. Following this idea, we derive an evolutionary design based on information provided by a Bayesian network model.

Therefore, the BN encodes the basic Markov property for which each node xi is probabilistically independent of all nondescendants, given its parents. In Fig. 1, x2 takes the role of parent of x1 and x3, as well x5 is parent of x7. For a particular set of data we can build and represent a BN by inferring from data both the structure of the network and conditional probabilities. In this paper we will consider inference procedures based on the maximisation of a scoring function which measures the ability of the network to represent the observed data (Search and Score methods) [38,37]. This approach is known to provide more accurate and robust results in comparison with other approaches [39,40]. More specifically, the approach is based on defining (1) a scoring function for evaluating the quality of a given structure, and (2) a search strategy for investigating the space of candidate models. The scoring function can be defined on the basis of different principles such as entropy and information [41,42], minimum description length [43,44] and Bayesian approaches [45,46]. In this research, we will consider Bayesian approaches which involve the assessment of prior probability distributions on the possible networks, and the computation of posterior probability distributions conditioned to the data. The best network is then the one that maximises the posterior distribution. As search strategy, we will consider the hill-climbing search procedure, a local heuristic search method commonly adopted in this context [46]. 2.3. Evolutionary Bayesian Network design

2.2. The Bayesian network models A Bayesian network (BN) is a multivariate statistical model that can be described as a directed acyclic graph (DAG) where nodes represent variables and arcs represent direct probabilistic dependence relations among them [36–38]. As illustrated in a large variety of applications, BNs are a powerful tool to extract information from data and infer dependence relations among variables. A BN can be represented as in Fig. 1, where a set of variables xi, i = 1, …,7, are described as the nodes of the graph and the presence of arcs indicates a dependence relation between variables. In this network, x1 and x3 are directly dependent on x2 as well x6 is dependent on x3 and x4. The absence of the arc between x2 and x5 implies that x5 is dependent on x2 through x1 and x3. The dependence relations between variables are quantified by conditional probabilities, as P(x1|x2,x6). In a BN, the conditional probabilities P(xi|xj) represent the set of the parameters of the network and allow the derivation of a joint probability distribution over the set of variables. This joint probability distribution can be written as follows: d

P ðxÞ ¼ ∏ P ðxi jPaðxi ÞÞ

ð1Þ

i¼1

where Pα(xi), a subset of {x1,…,xd}\xi, is regarded as a parent set of xi. In the BN, variables take roles of “parent” and “child” according to the dependence relation that links the variables: xi is child of xj if xi directly depends on xj, i ≠ j = 1,…,d.

Fig. 1. An example of a Bayesian network.

The Evolutionary Bayesian Network design (EBN-design) that we propose is presented in Fig. 2. In order to optimise the response of the system, this design begins by randomly generating a first population of experimental points X1 ¼   x1;1 x1;2 ⋯x1;n1 ; the random selection concerns both the variables and levels of each variable (line 2 in Algorithm 1). This initial design D1, representing the first population of n1 experimental points, is then tested in laboratory (line 3 in Algorithm 1) and provides the first set of responses, namely the vector y1 (line 4 in Algorithm 1). This set of data (D1, y1) is then ordered according to the response values (line 6 in Algorithm 1) and the subset of experimental points with response values greater than a fixed threshold is selected for the next generation (line 7 in Algorithm 1). The threshold is chosen as a particular quantile of the observed response distribution: high thresholds (i.e. over the third quartile of the observed responses) reduce the number of selected experimental points for the next generation inducing the procedure to exploit the optimality region, whereas low thresholds (i.e. equal or less than the median of the observed responses) make the procedure to explore larger areas of the search space. In order to optimise the response of the system, we choose to use the third quartile as threshold. On this first subset of data, namely D⋆1, representing the experimental points with responses higher than the threshold, we build a Bayesian network model to infer the dependence relation among variables. Considering the class of discrete Bayesian networks, we discretise the response variable based on the quantiles of its observed distribution, achieving the new vector of responses y⋆1 (line 8 in Algorithm 1). The network is therefore constructed considering d + 1 factor variables, namely the d variables in D⋆1 and the corresponding response variable y⋆1 (line 9 in Algorithm 1). To infer the BN, we adopt the Search and Score learning approach, using a hill-climbing search procedure with a specific number of random restarts, and the Bayesian Dirichlet with score equivalent and uniform prior (BDeu) [46,47]. Among the possible structure inferential procedure, we chose the Search & Score approach since it can produce robust networks when the structure is sparse and can be based on prior information about possible relations among the variables of the system [48]. The particular choice of the inferential procedure depends on the number of the experimental points. When testing a very small set of points with respect to the whole search space, the use of greedy hill-climbing procedures and equivalent score metrics is frequently suggested [40,48].

D. Slanzi, I. Poli / Chemometrics and Intelligent Laboratory Systems 135 (2014) 172–182

175

Fig. 2. A representation of EBN-design.

From the joint probability distribution of the BN, we then estimate a ^ 2, and derive the corresponding estimatset of new experimental points, D ^ 2 , using Probabilistic Logic Sampling (PLS) [49] (line 10 in ed responses, y Algorithm 1). This sampling procedure consists in (1) deriving an ordered structure of the nodes (where each node is preceded by its parents) and (2) identifying the values of each variable according to the ordered structure and the conditional probabilities of the network. EBN-design then ranks these new experimental points according to their estimated re^ 2 and selects only the subset of them, D⋆2, with estimated responses y sponse values greater than the established threshold (line 11 in Algorithm 1). This new set of experimental points becomes the second population of the evolutionary procedure and is evaluated in laboratory (line 12 of Algorithm 1) producing the new set of data (line 13 of the Algorithm). The process is iteratively repeated, generation after generation, maintaining the same size in each population of experimental points and ends when the global optimum is achieved, or the maximum total number of experimental points is reached. The number of experimental points (population size) depends on the technology (microwell plate) and the number of generations depends on the costs and time of experimentation. EBN-design is a flexible procedure and it adjusts to different problem constraints. From the evolutionary aspect of this sequential design, we can learn how each variable affects the response of the experimentation, the dependence relations among these variables and their effect on the response, and the region of variables and interactions which give the best response, in a way that only the best experimental points producing relevant information will survive and will lead the algorithm to the optimality region of the search space. 3. Simulation study We evaluate the performance of EBN-design by developing a simulation study based on the Rosenbrock function [50], frequently used to compare optimisation procedures [51,52]. This function has the following form: f ðxÞ ¼

 d−1   2 X 2 2 100 xi −xiþ1 þ ðxi −1Þ : i¼1

Algorithm 1. Evolutionary Bayesian Network design Require: Observational data set Ensure: The optimal experimental points via EBN 1: 2: 3: 4:

initialise nq, Ngen, threshold D1 ← First random population of n1 experimental points evaluate y1 ∀ D1 EBN-design ← {[D1, y1]}

5: for q in 2: Ngen do 6: rank the experimental points based on decreasing values of corresponding responses 7: select the subset of experimental points D⋆q − 1 with yq − 1 N threshold 8: yq⋆ − 1 ← discretise yq − 1 based on the quantile of its observed distribution ⋆ 9: learn the BN Netq − 1 using (D⋆q − 1,y q − 1) as input ^ q; y ^ q Þ from the JPD coded by 10: estimate a new set of observations D Netq − 1 ^ q with the 11: define the nq new experimental points Dq ¼ D⋆q ⊂D ^q highest estimated values y 12: evaluate yq ∀ Dq 13: EBN-design ← concatenate {EBN-design, [Dq, yq]} 14: end for The Rosenbrock function f represents a system in which the response function takes the maximum values at extreme points of its domain. For this simulation study, we assume a set of discrete variables x1, i = 1,…,6, with possible values xij, such that xij = xij = {− 5, − 4, − 3, − 2, − 1, 0, 1, 2, 3, 4, 5}. Under this setting, the whole combinatorial space consists of 1,771,561 different element combinations (experimental points), that, in case of real experimentation, should be tested in laboratory (as suggested by classical factorial design). Under these assumptions, the global optimum is equal to 450,180, and we aim to find this optimal value, or a region of optimality, by evaluating just a very small set of element combinations. We limit this set to 150, representing the 0.01% of the candidate tests to be evaluated. We study the performance of EBN-design in comparison with other designs proposed in the literature for high dimensional systems, namely: • • • •

MC-design: Fedorov's exchange algorithm design [6,7,53]; LHS-design: Latin Hypercube Sampling design [54,55]; GA-design: Genetic Algorithm search procedure [31,32,56]; PSO-design: Particle Swarm Optimization [33,57,58].

MC and LHS designs select the 150 experimental points just in one step, whereas GA, PSO and EBN designs, as evolutionary approaches, derive smaller populations of candidate experimental points through a finite number of generations. We optimise GA-design for different forms of crossover operator (Uniform, Arithmetic and Blend), different forms of selection operator (Rank, Proportional, Tournament) and different probabilities of mutation and crossover. The best achieved GA-design (with Rank selection operator, Uniform crossover operator, and pm = 0.1, pc = 0.9) is here considered for comparison. In the simulation for GA-design and PSO-design, each population is composed of 30 experimental points, so the algorithms evolve the experimental points through 5 generations. In the simulation for EBN-

D. Slanzi, I. Poli / Chemometrics and Intelligent Laboratory Systems 135 (2014) 172–182

20000

30000

40000

Fig. 4. The response value density function of the whole experimental space and the experimental points of each design for the Rosenbrock function.

To evaluate the performance of the strategy in terms of regions of optimality, we count the number of experimental points with response values greater than particularly high p-th percentiles of the response distribution, with p = {90,95,99}. We report in Table 1 the average number of experimental points obtained from the 10 data sets and the corresponding standard deviation. From this table we can observe the better performance of EBNdesign in uncovering the optimality region with respect to the other designs here selected: testing the small set of 150 experimental points, EBN-design finds an average of 39.5 experimental points with response values in the 90-th percentile of the distribution, i.e. almost 40 experimental points out of 150 chosen by EBN-design give a response in the domain of the 10% best responses of the experiment. This average takes the value of 10.2 for the domain of the 99-th percentile of the distribution. Comparing EBN-design with the other designs here considered, we can confirm the very good performance of the approach with a great difference in results: the number of experimental points in the domain of p = 90 ranges from 39.5 of EBN to 1.0 of LHS-design, and this difference even increases from p = 95 to p = 99. Very interesting also is the study of the evolution through generations of the best response (highest value) and the average response of

3e+05

3e+05 0e+00

1e+05

1e+05

2e+05

2e+05

3e+05 0e+00

GA

Global optimum 4e+05

Global optimum 4e+05

Global optimum

1e+05

2e+05 1e+05 0e+00

LHS

10000

Response values

2e+05

3e+05

3e+05 2e+05 1e+05 0e+00

MC

0

4e+05

Global optimum 4e+05

4e+05

Global optimum

MC LHS GA PSO EBN

0e+00

design, we assume an initial population of 60 experimental points and we let the algorithm to evolve through 3 generations of 30 candidate points each so, like the other procedures, a total of 150 design points have been sampled. We evaluate the performance of EBN-design with respect to the other designs here considered by generating Monte Carlo simulations, i.e. 10 simulated datasets from the domain of the Rosenbrock function. We present here the behaviour of the best Monte Carlo simulation, i.e. the simulation in which the highest simulated response values are achieved by each procedure. The distribution of the response values achieved by the four design approaches, as presented in the box-plot representations of Fig. 3, reveals that MC, LHS, GA and PSO designs present similar distributions, while EBN-design shows a skew response distribution for the presence of very high values of the response. We can notice in fact that the top whisker for EBN-design is much higher than for other designs, and also we can notice the presence of extreme values over the top whisker in the plot, which also include the global optimum (these values are represented as dots in the box-plot). This first result shows a great capacity of EBN-design to uncover the optimality region (high values) of the response function testing a very small set of experimental points. Working in simulation, we can know the whole experimental space that we explore, namely each experimental point and the corresponding response. So it is particularly informative for the comparison of these design approaches to evaluate the position of the sampled values with respect to the population they are supposed to represent (and wherefrom they derive). This comparison is presented in Fig. 4. From the response values of each design (150 values for each design) we can see a concentration of these values in the central area of the response density function of the whole experimental space (as expected), and very high values of EBN-design in the right tail of the function. Being able to find high values in response domain is quite relevant for the optimisation target. Ordering the experimental points according to the values of their responses and comparing EBN-design with each of the other design strategies, we notice in Fig. 5 that the EBN experimental points, represented in red, exhibit a strong concentration towards the right region of the domain, i.e. the region with higher response values.

Density function

176

PSO

EBN

Fig. 3. Box-plots of the simulated response value distributions achieved in optimising the Rosenbrock function with MC, LHS, GA, PSO and EBN-design procedures.

D. Slanzi, I. Poli / Chemometrics and Intelligent Laboratory Systems 135 (2014) 172–182

177

EBN MC

EBN LHS

EBN GA

EBN PSO

0

50 100 150 200 250 300

0

50 100 150 200 250 300

Exps

Exps

Response Values

Global optimum

Response Values

Global optimum

Response Values

Global optimum

Response Values

Global optimum

0

50 100 150 200 250 300

0

Exps

50 100 150 200 250 300

Exps

Fig. 5. Pairwise comparison of the ordered response values from each evaluated design for the Rosenbrock function.

the 10 data sets. We describe this evolution in Fig. 6 comparing EBNdesign with GA-design and PSO-design, which are evolutionary, and introducing also the results achieved by MC-design and LHS-design in one step procedure. The evolutionary process guided by the Bayesian network provides higher values at each generation and an increasing diverging pattern between EBN-design and the other designs here considered. As a measure of the design goodness, we introduce the following criterion. Let ^f max be the maximum estimated value found by the design strategy, and fmax and fmin be the known maximum and minimum value of f on the whole search space respectively. The design goodness criterion for optimisation can be computed as follows:

 10 replications (^f max). In Table 2 we report the values of the DG criterion for the four designs here considered and for a fixed value of the number of design points n = 150. From these results we notice that EBN-design is able to reach the global optimum (which is equal to 450,180), whereas the maximum value achieved by the other designs is 276,508, 216,401, 248,555 and 351,050 respectively for MC, LHS, GA and PSO-design. Moreover, for EBN-design, all the three versions of the DG criterion take the lowest value in comparison with the values achieved by other designs, showing that EBN-design is able to find the best solution (global optimum).

4. Emergence of vesicles by EBN-design

DG ¼ j^f max −f max j=jf min −f max j:

ð2Þ

Low values of DG indicate a small distance between ^f max and fmax. In  the best case where ^f max ¼ f max ¼ f , where f∗ represents the optimum value of the system, DG equals 0; if rather hatfmax = fmin, then DG becomes 1. In our simulation study, with 10 replications of the process, we derive the GD criterion by considering the following cases: the average of the estimated optimal function values over the 10 replications (M ^f max ); the median value over 10 replications (Me ^f max ); and the best value in the

Self-organising processes of amphiphilic molecules under particular conditions may lead to the emergence of new biological entities with specific functionalities that can be extremely relevant for medical and pharmaceutical research [59–61]. Experimentation in this field generally involves a huge number of parameters, generating an extremely large search space where the target optimum value is then very hard to find. Moreover this experimentation is generally expensive, time consuming and frequently with some negative environmental effects. To design the experiments for this high dimensional system and model the resulting

Table 1 Comparison of the average number of experimental points with response's value higher than the p-th percentile of the response distribution (in parenthesis the standard deviation) for the Rosenbrock function. Average number of exp. points with the response value greater than the p-th percentile of the distribution Design

p = 90

p = 95

p = 99

MC LHS GA PSO EBN

10.6 (3.2) 1.0 (1.3) 13.7 (8.9) 13.2 (2.3) 39.5 (8.9)

4.8 (2.2) 0.3 (0.5) 4.1 (3.2) 7.5 (3.1) 26.5 (7.1)

0.8 (0.6) 0.1 (1.0) 0.3 (0.6) 2.3 (1.8) 10.2 (3.9)

1

2

3

4

2e+05

3e+05

MC LHS GA PSO EBN

1e+05

Mean Response (average in 10 runs)

4e+05 3e+05 2e+05

MC LHS GA PSO EBN

1e+05

Best Response (average in 10 runs)

b

0e+00

a

4e+05

D. Slanzi, I. Poli / Chemometrics and Intelligent Laboratory Systems 135 (2014) 172–182

0e+00

178

5

1

2

3

Generation

4

5

Generation

Fig. 6. Best response values (a) and Mean response values (b) through generations for the Rosenbrock function.

Table 2 Design goodness (DG) values of the optimisation strategies for the Rosenbrock function. 

MC LHS GA PSO EBN

DG(^f max )

DG(M^f max )

DG(Me^f max )

0.3858 0.5193 0.4067 0.2202 0

0.4480 0.6056 0.4855 0.3653 0.1785

0.4324 0.6135 0.4896 0.3759 0.2242

data we adopt EBN-design approach. Our specific interest is in the formation of biological compartments, namely vesicles, that can be used for drug delivery [28,62–64]. Experimentation for this research consists in evaluating mixtures of molecules, taken at different levels of concentration, and for different laboratory protocols. The purpose of this experimentation is to discover the particular molecular mixture that can optimise a measure of the emergence of vesicles. The set of candidate molecules is very large and an a priori selection based on expert knowledge and explorative statistical analyses reduced this set to 16 molecules. The amphiphilic molecules selected for this study are described in Appendix A. Each molecule has been considered at 6 possible levels of concentration. Formally, the variables of this experiment can be

represented by x = (x1,…,x16) and each k-th experimental point is the mixture xk = (xk,1,…,xk,16), where xk,i, with 0 ≤ xk,i ≤ 1, represents the concentration level (the number of volume units) of the xi molecule. This mixture composition has to satisfy the condition: ∑16 i = 1xk,i = 1. The response of each experimental point is a measure of the emergence of vesicles from the experimentation, and it is derived from a turbidity indicator provided by a fluorescent microscope. For the details of the experimentation we refer to Refs. [65,28]. According to the classical factorial design we should conduct 15,504 tests to evaluate all mixtures of the experimental space. In order to reduce this space, we develop EBN-design by selecting in a random way a first set of 60 candidate mixtures (experimental points), observing their experimental response (repeated three times to provide information on the experimental error), and then modelling the achieved data with the Bayesian network to infer the dependence relation among the variables. The process is then repeated for 3 generations, estimating 30 candidate mixtures in each generation of the algorithm, following the process described in Section 2.3. Therefore the experimentation involves 150 tests, that is less than 1% of the whole experimental space (0.97%). In Fig. 7 a population of vesicles achieved in a single experimental point is represented, showing the diversity of obtained structures.

Table 3 Number of experimental points achieved by the three design strategies with response value higher than the p-th percentile of the response distribution. Average number of exp. points with the response value greater than the p-th percentile of the distribution

Fig. 7. Fluorescent micrograph of a vesicle population showing the diversity of vesicle structures (Martin Hanczyc, Protolife).

Design

p = 90

p = 95

p = 99

MC GA EBN

26 24 49

24 14 30

10 4 11

179

1.0

1.0

D. Slanzi, I. Poli / Chemometrics and Intelligent Laboratory Systems 135 (2014) 172–182

EBN GA

EBN

0.8 0.6 0.4

Response Values

0.6 0.4

0.2

0.2

Response Values

0.8

MC

0

50

100

150

200

250

300

0

50

100

Exps

150

200

250

300

Exps

0.9 0.8 0.7 0.6 0.4

MC GA EBN

MC GA EBN

0.5

0.7

0.8

0.9

Mean Response (average in 10 runs)

1.0

b

0.6

Best Response (average in 10 runs)

a

1.0

Fig. 8. Pairwise comparison of the ordered experimental response values from amphiphilic mixtures.

1

2

3

4

5

Generation

1

2

3

4

5

Generation

Fig. 9. Best response values (a) and Mean response values (b) of the amphiphilic mixtures at each generation.

In evaluating the performance of EBN-design with respect to GA and MC designs,1 we count the number of experimental points with response values in the right tail of the response distribution. From Table 3 we notice that EBN-design is able to discover a very high number of mixtures in this area of the distribution, and specifically for p = 90 and p = 95, EBN-design finds almost double number of experimental points with respect to the other designs here considered. We notice in fact that 49 experimental points out of 150 chosen by EBN-design give a response in the domain of the 10% best mixtures of the experiment, whereas this number is equal to 26 for MC-design and 24 for GAdesign. This proportion remains invariant for p = 95, while for p = 99 MC-design has a higher number of very good mixtures but less than EBN-design. This comparison of design's strategies exhibits the very good performance of EBN-design for optimisation.

1 To our knowledge, LHS-design has not been extended to highly constrained irregularly shaped experimental regions involving mixture components.

The pairwise design comparisons involving all the tested mixtures are presented in Fig. 8 according to their response's rank. In this figure, EBN-design, described by red bars, shows a majority of experimental points in the region of high responses. The design comparison can be realised also with respect to the evolution across generations. In Fig. 9 we describe the evolution of the best response and the mean response achieved in each generation of experimental points. We can notice that the best response achieved with EBN-design (red dot line) is always better than the best response achieved with GA-design (black solid line), and it maintains this order in all the generations. The same behaviour is exhibited also by the mean value of the responses with a difference which seems to increase generation by generation, indicating the very good performance of EBN-design for optimisation. In order to know how the composition of the mixtures changes to achieve the particular molecule interaction that gives rise to high values of Turbidity response, we now present the estimated network's structure and their evolution across generation.

180

D. Slanzi, I. Poli / Chemometrics and Intelligent Laboratory Systems 135 (2014) 172–182

Generation 1

Generation 2

TRIO

TRIO PA

YPHY

LPC

PA

YPHY

DPPG

LPC

DPPG

DPPC

DPPC

PPRO

PPRO CHOL

CHOL MEPC

DOPC

MEPC DIPC

TURB IDITY

DIPC

TURB IDITY

DOPC

PC

PC

VITE

VITE CARD

CARD

POPG SPM

POPG

SPM

Generation 3

Generation 4

TRIO

TRIO

YPHY

PA

YPHY

LPC

PA

LPC

DPPG

DPPG DPPC

DPPC PPRO

PPRO

CHOL

CHOL

MEPC

MEPC

DOPC

DIPC

TURB IDITY

DIPC

DOPC

TURBI DITY

PC

PC VITE

VITE

CARD

CARD SPM

POPG

SPM

POPG

Fig. 10. EBN-design for amphiphilic molecules self-organising process in experimentation ( indicates an arc's strength in the interval (0.5, 0.7]; → indicates an arc's strength in the interval (0.7, 0.9]; ⇒ indicates an arc's strength in the interval (0.9, 1]).

We infer the structure of the dependence relations among variables, and we also observe how this inferred structure evolves from one generation to the next of the algorithm, approaching the area of optimality in the experimental space. The evolution of the structure of the network is represented in Fig. 10. At the beginning of the algorithm (generation 1), a first random population of experimental points is selected and tested. The estimated Bayesian network is able to give a first hint of the dependence relations among variables: a sub-network is identified in which the DPPG molecule seems to be the most informative. From the Bayesian network estimated at generation 2, we can identify two sub-networks of dependence relations. The first subnetwork involves seven variables and highlights the key role of DPPG molecule in interacting with the other variables and in affecting Turbidity. The second sub-network involves only DPPC and LPC molecules, and shows that DPPC has a direct dependency with Turbidity. At generation 3 we can identify three sub-networks of dependence relations. We notice in particular, three variables directly affecting the response of the system: DPPC, SPM and CARD molecules.

At generation 4 the dependence relations among the most informative variables of the system and their effect on the response are outlined. We can clearly identified three sub-networks and four informative variables: DPPC, DPPG, CARD and SPM. The interaction of these variables plays a key role in providing very high Turbidity values. From the estimated network, we can also evaluate the strength of the arcs in the network, informing on the accuracy and intensity of the dependence relations. This arc's strength is calculated as the bootstrap probability of the true-positive presence of the arc in the network [66, 67]. In Fig. 10 we indicate by a double-line arc the dependence relations with arc's strength greater than 0.9. The dependence relation of Turbidity on the mixture variables can also be confirmed by observing the changes in the mixture compositions across generations. In Table 4 we illustrate the detailed structure of four mixtures achieved at generation 1 and at generation 4 of the experimentation. The value of Turbidity increases from 0.120 of the first generation to 0.988 of the fourth generation. The composition of the best mixtures reveals the key role of DPPC, DPPG, CARD and SPM molecules in achieving the highest Turbidity value as inferred by the Bayesian network at generation 4 in Fig. 10.

D. Slanzi, I. Poli / Chemometrics and Intelligent Laboratory Systems 135 (2014) 172–182

181

Table 4 Experimental mixtures and the resulting response T. Molecule short name

Variable

Mixture composition at generation 1

CHOL DOPC DPPC DPPG PC DIPC YPHY PA PPRO LPC MEPC CARD TRIO POPG VITE SPM Turbidity

X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 T

0 0 0 0 0 0 0.4 0 0 0.6 0 0 0 0 0 0 0.120

0 0 0 0 0.4 0.2 0 0 0 0.2 0.2 0 0 0 0 0 0.136

Mixture composition at generation 4

0 0 0 0 0.8 0 0 0 0 0 0 0 0 0.2 0 0 0.180

0.4 0 0 0 0 0 0 0.2 0.2 0.2 0 0 0 0 0 0 0.204

0 0 0 0.2 0 0 0 0 0 0 0 0.6 0.2 0 0 0 0.988

0 0 0 0.2 0 0 0 0 0 0 0 0.8 0 0 0 0 0.860

0 0 0.4 0.2 0 0 0 0 0 0 0 0 0 0 0 0.4 0.852

0 0.2 0.2 0.2 0 0 0 0 0 0 0 0.2 0 0 0 0.2 0.776

The bold values identify the variables with a direct dependence relation with the response variable T.

5. Concluding remarks In this research we addressed the problem of designing experiments and modelling data from high dimensional systems with the objective of maximising a particular emergent property. Towards this aim, we introduced an evolutionary approach to design the experimental space where the evolution is guided by a Bayesian network model. The model was able to identify the most informative dependence relations among the system variables and their interactions, and achieve the target optimality region. Combining the evolutionary strategy to explore the search space with the information from the Bayesian network model, EBNdesign has been more successful than commonly used alternatives, both in simulation and in a real data biochemical problem. The approach achieved the optimum value in at least one simulation and the optimality region in the real data problem testing a very small set of the whole experimental space, namely less than 1% of the candidate experimental points. The comparison with other approaches, introduced for designing experiments on high dimensional systems, has always favoured EBNdesign. Conflict of interest There is no conflict of interest. Acknowledgements This work was supported by the EU integrated project Programmable Artificial Cell Evolution (PACE) in FP6-IST-FET-002035, Complex Systems Initiative and by the Fondazione di Venezia — DICE project. The authors would like to acknowledge the European Centre for Living Technology (www.ecltech.org) for providing opportunities of presentation and fruitful discussion of the research. Thanks to the colleagues of the University of Venice and of ProtoLife which gave valuable suggestions to this work. Thanks also to the referees for their very helpful suggestions.

using functions of the package AlgDesign, version 1.1-7 and LHS-design was built by using functions of the package lhs, version 0.10. GA-design was built by using functions of the packages genalg, version 0.1.1, and ga, version 1.1; we fix the probability of crossover equal to 0.9 and the probability of mutation equal to 0.1. No elitism is admitted. PSOdesign was built by using functions of the package hydroPSO, version 0.3-3, with c1 = c2 = 0.5 + log(2) and a star topology. List of amphiphilic molecules Below is the set of variables representing the amphiphilic molecules in the self-organising process described in Section 4: • • • • • • • • • • • • • • • •

X1: Cholesterol, CHOL; X2: 1,2-Dioleoyl-sn-Glycero-3-Phosphocholine, DOPC; X3: 1,2-Dipalmitoyl-sn-Glycero-3-Phosphocholine, DPPC; X4: 16:0 PG or 1,2-Dipalmitoyl-sn-Glycero-3-[Phospho-rac-(1-glycerol)] (Sodium Salt),DPPG; X5: L-a-Phosphatidylcholine, Soy, PC; X6: 1,2-Di-O-Octadecenyl-sn-Glycero-3-Phosphocholine, DIPC; X7: Phytosphingosine (Saccharomyces cerevisiae) 4-Hydroxysphinganine, YPHY; X8: 1,2-Dioleoyl-sn-Glycero-3-Phosphate (Monosodium Salt), PA; X9: 1,2-Dioleoyl-sn-Glycero-3-Phosphopropanol (Sodium Salt), PPRO; X10: 1-Oleoyl-2-Hydroxy-sn-Glycero-3-Phosphocholine, LPC; X11: 1,2-Diphytanoyl-sn-Glycero-3-Phosphocholine, MEPC; X12: 1,1,2,2-Tetramyristoyl Cardiolipin (Sodium Salt), CARD; X13: 1,2,3-Trioleoylglycerol (Triolein), TRIO; X14: 1-Palmitoyl-2-Oleoyl-sn-Glycero-3-[Phospho-rac-(1-glycerol)] (Sodium Salt), POPG; X15: L-all-rac-α-Tocopherol (97%), VITE; X16: (2S,3R,4E)-2-acylaminooctadec-4-ene-3-hydroxy-1-Phosphocholine (Egg, Chicken), SPM.

Appendix A

Laboratory protocols, which include different reaction times, heating temperatures and few technology devices, have been fixed by the experimenters on the basis of previous experience and a priori knowledge.

Computational tools

References

The statistical analyses were performed using the R-project free software environment for statistical computing and graphics (www.r-project.org), version R 2.15.2 GUI 1.53 Leopard build 32-bit on an OS X 10.9, 2.53 Ghz Intel Core 2 Duo. EBN-design was built in R environment by combining functions of the packages bnlearn, version 3.3, gRbase, version 1.6-8, and gRain, version 1.1-0. MC-design was built by

[1] S. Georgiou, C. Koukouvinos, P. Mantas, On multi-level supersaturated designs, J. Stat. Plan. Infer. 136 (8) (2006) 2805–2819. [2] S.D. Georgiou, Supersaturated designs: a review of their construction and analysis, J. Stat. Plan. Infer. (0) (2012), http://dx.doi.org/10.1016/j.jspi.2012.09.014. [3] C.J. Marley, D.C. Woods, A comparison of design and model selection methods for supersaturated experiments, Comput. Stat. Data Anal. 54 (12) (2010) 3158–3167. [4] F. Sun, D.K.J. Lin, M.Q. Liu, On construction of optimal mixed-level supersaturated designs, Ann. Stat. 39 (2) (2011) 1310–1333.

182

D. Slanzi, I. Poli / Chemometrics and Intelligent Laboratory Systems 135 (2014) 172–182

[5] K.H.V. Booth, D.R. Cox, Some systematic supersaturated designs, Technometrics 4 (1962) 489–495. [6] V.V. Fedorov, Theory of Optimal Experiments, Academic Press, New York, 1972. [7] N.K. Nguyen, A.J. Miller, A review of some exchange algorithms for constructing discrete D-optimal designs, Comput. Stat. Data Anal. 14 (4) (1992) 489–498. [8] K. Ogungbenro, G. Graham, I. Gueorguieva, L. Aarons, The use of a modified fedorov exchange algorithm to optimise sampling times for population pharmacokinetic experiments, Comput. Methods Prog. Biomed. 80 (2) (2005) 115–125. [9] B. Smucker, E.D. Castillo, J.L. Rosenberger, Exchange algorithms for constructing model-robust experimental designs, J. Qual. Technol. 43 (2011) 29–44. [10] S. Levy, D. Steinberg, Computer experiments: a review, Adv. Stat. Anal. 94 (4) (2010) 311–324. [11] S. Kuhnt, D. Steinberg, Design and analysis of computer experiments, Adv. Stat. Anal. 94 (4) (2010) 307–309, http://dx.doi.org/10.1007/s10182-010-0143-0. [12] B. Auder, A.D. Crecy, B. Iooss, M. Marques, Screening and metamodeling of computer experiments with functional outputs. application to thermalhydraulic computations, Reliab. Eng. Syst. Saf. 107 (2012) 122–131. [13] J. Crapps, S. Daniewicz, A study of combined roughness and plasticity induced fatigue crack closure for long cracks using a modified strip-yield model, Int. J. Fatigue 43 (2012) 98–104. [14] V. Zakharov, R. Shamin, Statistics of rogue waves in computer experiments, JETP Lett. 96 (1) (2012) 66–69. [15] T. Mhlenstdt, M. Gsling, S. Kuhnt, How to choose the simulation model for computer experiments: a local approach, Appl. Stoch. Model. Bus. Ind. 28 (4) (2012) 354–361. [16] F. Sun, M.Q. Liu, D.K.J. Lin, Construction of orthogonal Latin hypercube designs, Biometrika 96 (4) (2009) 971–974. [17] L. Gu, J.F. Yang, Construction of nearly orthogonal Latin hypercube designs, Metrika 76 (6) (2013) 819–830. [18] J. Cawse, Experimental Design for Combinatorial and High Throughput Material Developments, Springer-Verlag, New York, 2003. [19] M. Pelikan, D. Goldberg, F. Lobo, A survey of optimization by building and using probabilistic models, Comput. Optim. Appl. 21 (1) (2002) 5–20. [20] R. Baragona, F. Battaglia, I. Poli, Evolutionary Statistical Procedures, Springer-Verlag, Berlin, 2011. [21] R. Poli, J. Kennedy, T. Blackwell, Particle swarm optimization, Swarm Intell. 1 (1) (2007) 33–57. [22] M. Forlin, D. Slanzi, I. Poli, Combining probabilistic dependency models and particle swarm optimization for parameter inference in stochastic biological systems, in: F.L. Gaol, Q.V. Nguyen (Eds.), Proceedings of the 2011 2nd International Congress on Computer Applications and Computational Science, vol. 145 of Advances in Intelligent and Soft Computing, Springer, Berlin Heidelberg, 2012, pp. 437–443. [23] R. Cowell, A. Dawid, S. Lauritzen, D. Spiegelhalter, Probabilistic Networks and Expert Systems, Springer Verlag, New York, 1999. [24] C. Borgelt, R. Kruse, Graphical Models: Methods for Data Analysis and Mining, Wiley & Son, New York, 2002. [25] A. Atkinson, A. Donev, R. Tobias, Optimum Experimental Designs, with SAS, Oxford University Press, Oxford, UK, 2007. [26] H. Maruri-Aguilar, E. Saenz-de Cabezon, H. Wynn, Betti numbers of polynomial hierarchical models for experimental designs, Ann. Math. Artif. Intell. 64 (4) (2012) 411–426. [27] H. Dette, V.B. Melas, P. Shpilev, Robust t-optimal discriminating designs, Ann. Stat. 41 (4) (2013) 1693–1715. [28] M. Forlin, I. Poli, D. De March, N. Packard, G. Gazzola, R. Serra, Evolutionary experiments for self-assembling amphiphilic systems, Chemom. Intell. Lab. Syst. 90 (2) (2008) 153–160. [29] F. Caschera, G. Gazzola, M.A. Bedau, C. Bosch Moreno, A. Buchanan, J. Cawse, N. Packard, M.M. Hanczyc, Automated discovery of novel drug formulations using predictive iterated high throughput experimentation, PLoS ONE 5 (1) (2010). [30] J.N. Cawse, G. Gazzola, N. Packard, Efficient discovery and optimization of complex high-throughput experiments, Catal. Today 159 (1) (2011) 55–63. [31] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1989. [32] W. Limmun, J.J. Borkowski, B. Chomtee, Using a genetic algorithm to generate d-optimal designs for mixture experiments, Qual. Reliab. Eng. Int. 29 (7) (2013) 1055–1068. [33] J. Kennedy, R. Eberhart, Particle swarm optimization, Proceedings of the IEEE International Conference on Neural Networks, IEEE, vol.4, 1995, pp. 1942–1948. [34] G. Zemella, D.D. March, M. Borrotti, I. Poli, Optimised design of energy efficient building facades via evolutionary neural networks, Energy Build. 43 (12) (2011) 3297–3302. [35] M. Borrotti, I. Poli, Nave bayes ant colony optimization for experimental design, in: R. Kruse, M.R. Berthold, C. Moewes, M.n. Gil, P. Grzegorzewski, O. Hryniewicz (Eds.), Synergies of Soft Computing and Statistics for Intelligent Data Analysis, Vol. 190 of Advances in Intelligent Systems and Computing, Springer, Berlin Heidelberg, 2013, pp. 489–497.

[36] F. Jensen, Bayesian Networks and Decision Graphs, Springer-Verlag, New York, 2001. [37] A. Darwiche, Modeling and Reasoning with Bayesian Networks, Cambridge University Press, Cambridge, UK, 2009. [38] R. Neapolitan, Learning Bayesian Networks, Prentice Hall, 2003. [39] S. Yang, K. Chang, Comparison of score metrics for Bayesian network learning, IEEE Trans. Syst. Man Cybern. Syst. Hum. 32 (3) (2002) 419–428. [40] Z. Liu, B. Malone, C. Yuan, Empirical evaluation of scoring functions for Bayesian network model selection, BMC Bioinforma. 13 (15) (2012) 1–16. [41] C. Chow, C. Liu, Approximating discrete probability distributions with dependence trees, IEEE Trans. Inf. Theory 14 (3) (1968) 462–467. [42] E. Herskovits, G.F. Cooper, Kutato: an entropy-driven system for construction of probabilistic expert systems from databases, Proceedings of the Sixth conference on Uncertainty in Artificial Intelligence, UAI'90, AUAI Press, Cambridge, MA, USA, 1990, pp. 54–62. [43] W. Lam, F. Bacchus, Learning Bayesian belief networks: An approach based on the MDL principle, Comput. Intell. 10 (1994) 269–293. [44] N. Friedman, M. Goldszmidt, Learning Bayesian networks with local structure, Proceedings of the Twelfth Annual Conference on Uncertainty in Artificial Intelligence, UAI'96, Morgan Kaufmann, Reed College, Portland, Oregon, USA, 1996, pp. 252–262. [45] G. Cooper, E. Herskovits, A Bayesian method for the induction of probabilistic networks from data, Mach. Learn. 9 (1992) 309–347. [46] D. Heckerman, D. Geiger, D. Chickering, Learning Bayesian networks: the combinations of knowledge and statistical data, Mach. Learn. 20 (1995) 197–243. [47] R. Daly, Q. Shen, S. Aitken, Learning Bayesian networks: approaches and issues, Knowl. Eng. Rev. 26 (2011) 99–157. [48] R. Nagarajan, M. Scutari, S. Lbre, Bayesian Networks in R: with Applications in Systems Biology, Springer Publishing Company, Incorporated, 2013. [49] M. Henrion, Propagating uncertainty in Bayesian networks by probabilistic logic sampling, in: J.F. Lemmer, L.N. Kanal (Eds.), Uncertainty in Artificial Intelligence, vol. 2, Elsevier Science Publishing Company, Inc., New York, USA, 1988, pp. 149–163. [50] H.H. Rosenbrock, An automatic method for finding the greatest or least value of a function, Comput. J. 3 (3) (1960) 175–184. [51] Y. Yongjian, L. Yumei, A new discrete filled function algorithm for discrete global optimization, J. Comput. Appl. Math. 202 (2) (2007) 280–291. [52] S.F. Woon, V. Rehbock, A critical review of discrete filled function methods in solving nonlinear discrete optimization problems, Appl. Math. Comput. 217 (1) (2010) 25–41. [53] B. Wheeler, optMonteCarlo. AlgDesign. The R project for statistical computing, R package version 1.1-7, 2011. [54] R. Stocki, A method to improve design reliability using optimal Latin hypercube sampling, Comput. Assist. Mech. Eng. Sci. 12 (2005) 87–105. [55] E.R. van Dam, G. Rennen, B. Husslage, Bounds for maximin Latin hypercube designs, Oper. Res. 57 (3) (2009) 595–608. [56] L. Scrucca, GA: a package for genetic algorithms in R, J. Stat. Softw. 53 (4) (2013) 1–37. [57] M. Zambrano-Bigiarini, R. Rojas, A model-independent particle swarm optimisation software for model calibration, Environ. Model Softw. 43 (2013) 5–25. [58] M. Zambrano-Bigiarini, R. Rojas, hydroPSO: Particle Swarm Optimisation, with focus on Environmental Models, R Package Version 0.3-3, 2013. [59] P. Stano, P. Carrara, Y. Kuruma, T. Pereira de Souza, P.L. Luisi, Compartmentalized reactions as a case of soft-matter biotechnology: synthesis of proteins and nucleic acids inside lipid vesicles, J. Mater. Chem. 21 (2011) 18887–18902. [60] T. Pereira de Souza, F. Steiniger, P. Stano, A. Fahr, P.L. Luisi, Spontaneous crowding of ribosomes and proteins inside vesicles: a possible mechanism for the origin of cell metabolism, ChemBioChem 12 (15) (2011) 2325–2330. [61] S.D. Pickett, D.V.S. Green, D.L. Hunt, D.A. Pardoe, I. Hughes, Automated lead optimization of MMP-12 inhibitors using a genetic algorithm, ACS Med. Chem. Lett. 2 (1) (2011) 28–33. [62] F. Caschera, M. Hanczyc, S. Rasmussen, Machine learning for drug design, molecular machines and evolvable artificial cells, Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO'11, ACM, New York, NY, USA, 2011, pp. 831–832. [63] F. Caschera, S. Rasmussen, M. Hanczyc, Machine learning optimization of evolvable artificial cells, in: E. Giacobino, R. Pfeifer (Eds.), Proceedings of the 2nd European Future Technologies Conference and Exhibition 2011, Vol. 7 of Procedia Computer Science, Elsevier Science, 2011, pp. 187–189. [64] M.A. Bedau, A. Buchanan, G. Gazzola, M. Hanczyc, T. Maeke, J. Mccaskill, I. Poli, N.H. Packard, Evolutionary design of a DDPD model of ligation, Lecture Notes in Computer Science 3871, Springer, 2005, pp. 201–212. [65] M. Theis, G. Gazzola, M. Forlin, I. Poli, M. Hanczyc, M. Bedau, Optimal formulation of complex chemical systems with a genetic algorithm, ECCS06 online Proceedings, Oxford, 2006, pp. 193–197. [66] N. Friedman, M. Goldszmidt, A. Wyner, Data analysis with Bayesian networks: A bootstrap Approach, 1999. [67] M. Scutari, R. Nagarajan, Identifying significant edges in graphical models of molecular networks, Artif. Intell. Med. 57 (3) (2013) 207–217.