Improving the efficiency of evolutionary de novo peptide design: Strategies for probing configuration and parameter settings

Improving the efficiency of evolutionary de novo peptide design: Strategies for probing configuration and parameter settings

BioSystems 88 (2007) 35–55 Improving the efficiency of evolutionary de novo peptide design: Strategies for probing configuration and parameter settin...

2MB Sizes 0 Downloads 19 Views

BioSystems 88 (2007) 35–55

Improving the efficiency of evolutionary de novo peptide design: Strategies for probing configuration and parameter settings Wuming Zhang a , Kazuyoshi Yano b , Isao Karube b,∗ a

Research Center for Advanced Science and Technology, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8904, Japan b School of Bionics, Tokyo University of Technology, 1401-1 Katakura, Hachioji, Tokyo 192-0982, Japan Received 28 September 2004; received in revised form 30 November 2005; accepted 11 April 2006

Abstract Evolutionary molecular design based on genetic algorithms (GAs) has been demonstrated to be a flexible and efficient optimization approach with potential for locating global optima. Its efficacy and efficiency are largely dependent on the operations and control parameters of the GAs. Accordingly, we have explored new operations and probed good parameter setting through simulations. The findings have been evaluated in a helical peptide design according to “Parameter setting by analogy” strategy; highly helical peptides have been successfully obtained with a population of only 16 peptides and 5 iterative cycles. The results indicate that new operations such as multi-step crossover–mutation are able to improve the explorative efficiency and to reduce the sensitivity to crossover and mutation rates (CR–MR). The efficiency of the peptide design has been furthermore improved by setting the GAs at the good CR–MR setting determined through simulation. These results suggest that probing the operations and parameter settings through simulation in combination with “Parameter setting by analogy” strategy provides an effective framework for improving the efficiency of the approach. Consequently, we conclude that this framework will be useful for contributing to practical peptide design, and gaining a better understanding of evolutionary molecular design. © 2006 Elsevier Ireland Ltd. All rights reserved. Keywords: Genetic algorithm; De novo peptide design; Helical peptide; Evolutionary design; Sequence-comparison

1. Introduction Although both are usually synthesized from the same set of 20 natural amino acids, peptides are readily distinguished from proteins by their easy production and lack of higher-order structures. These facilities of studying peptide enable us to develop our understanding of the relationship between the sequences and their structures of natural proteins and, furthermore, to expedite our discovery of particular proteins for various kinds

∗ Corresponding author. Tel.: +81 426 37 2111; fax: +81 426 37 3134. E-mail address: [email protected] (I. Karube).

of applications (Baldwin and Rose, 1999; Kohn and Hodges, 1998). Although exploring particular peptides for these purposes is a problem difficult to be formalized and solved mathematically, this task could be successfully accomplished by using the de novo design, an irrational design strategy whereby a combinatorial sequence space screening allows discovery of specific peptides. However, the number of potential peptides in the space is so large that it is infeasible to exhaustively synthesize and screen all the sequences to obtain the specific peptides, notwithstanding that combinatorial chemistry provides a powerful means to synthesize and screen far more peptides. The problem remains as to how to obtain particular peptides efficiently. One approach to address this difficulty is de novo peptide design based

0303-2647/$ – see front matter © 2006 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.biosystems.2006.04.007

36

W. Zhang et al. / BioSystems 88 (2007) 35–55

on genetic algorithms (GAs), one of the evolutionary algorithms, which can be used for exploring vast combinatorial sequence spaces. This GA-based approach is particularly appropriate to peptide design owing to the natural extension of biological evolutionary operators to amino acid representations of peptides in computers (Jones, 1994). Moreover, effective peptides can be generated in a heuristic way, much like natural evolutionary processes, thereby enhancing the efficiency of design and increasing the possibility of success. This approach will become a more feasible peptide design framework if the GAs are tailored to meet the demands of the de novo peptide design. In a typical GA-based framework, the GAs generally work with a population of sequences (peptide library); each sequence is a possible solution for the design problem in hand; its “goodness” is assessed by a fitness function (objective function) reflecting the satisfaction of the design specifications. The new candidates are then generated by applying the selection, crossover, mutation, and other genetic operators to the evaluated population. This process is generally iterated in such a way that the more “fit” a member of the population is, the more likely its reproduction is performed, thus mimicking the principles of Darwinian evolution (Goldberg, 1989). Overall, this algorithm is not only an evolutionary model applicable to organisms, but also for solving combinatorial problems like de novo peptide design (Forrest, 1993; Fogel, 1994). In a GA-based molecular design, the molecular design cycle is framed by combining a stochastic variation process whereby new populations are generated by genetic operations according to the properties of the molecules, with an evaluation process whereby the molecules are assessed. For example, the generation of molecular structures by genetic operations in combination with the computational prediction of their properties has enabled the concept of evolutionary de novo molecule design, which has been applied with success in the design and synthesis of small molecules (Sheridan and Kearsley, 1995; Brown and Martin, 1997; Douguet et al., 2000; Pegg et al., 2001). In these applications, the numerical fitness scores have been derived from theoretical models or empirical evaluations. Since our present knowledge concerning the sequence–property relationships does not allow the accurately computational prediction of their properties, the experimental results of peptides have to be used as a component in the estimation of fitness scores. Therefore, the generation of peptide sequences, in combination with the synthesis, purification, and analysis of real peptides, constructs a new framework for evolutionary de novo

molecular design. Due to its conceptual simplicity, flexibility, and lack of reliance on a priori information, this optimization approach has successfully been introduced in generating and synthesizing peptides with specific properties (Yokobayashi et al., 1996; Kamphausen et al., 2002) and peptides with specific structure (Zhang et al., 2003). However, an excessively large population size and, more importantly, a quite large number of generations are often required to ensure satisfactory results such as the optimal fitness. These disadvantages may be overcome, thereby minimizing the time and expense consumed in peptide synthesis and screening experiments, if the GAs could be customized for a specific de novo peptide design. In the GA-based approach, the configuration of genetic operations and the setting of control parameters of GAs generally manage the way they operate and largely determine the success and efficiency this approach solves problems (Eiben et al., 1999). Customization of GAs can thus be regarded as probing the satisfactory configurations and parameter settings. For more than two decades, numerous theoretical and experimental analyses of the configurations and parameter settings of GAs have been performed to solve this problem (De Jong, 1975; Grefenstette, 1986; B¨ack et al., 1997; Sundaram and Venkatasubramanian, 1998; Illgen et al., 2000; Belmont-Moreno, 2001; Rojas et al., 2002). Unfortunately, in spite of these efforts, there has never been a general theoretical framework allowing us to recognize the optimal status of the algorithms. Despite the lack of theoretical frameworks, there are two empirical methodologies of handling this problem. In some complex situations, the framework is managed by adaptive parameters; this framework enables GAs to optimize their performance by modifying their parameters dynamically during the evolutionary processes (Hinterding et al., 1997; Eiben et al., 1999). The optimal parameter setting could thus be achieved through an adaptive process. In contrast, in most practical situations, GAs are set to be controlled by static parameters, which remain invariable during the evolutionary process. Whereas the adaptive parameters may rely solely on the framework of the design system, the static parameters have to be empirically chosen in term of the literature recommendations or practitioner’s experience, or by means of trial-and-error experimentation. In trial-and-error method, the optimal parameter setting could be determined by exhaustive comparison of the experimental outcomes at all possible parameter settings. Unfortunately, the number of the experiments required might be far more than that in an evolutionary de novo peptide design itself, let alone this exhaustive

W. Zhang et al. / BioSystems 88 (2007) 35–55

comparison. “Parameter setting by analogy” is an alternative method for determining good parameter settings (Eiben et al., 1999; Harik et al., 1999; Rojas et al., 2002). In this method, it is assumed that for a given GA, good parameter settings for one problem would be similar to those for a closely related problem. In other words, good parameter settings that have been successful in solving one problem could be used for its “similar” problems. Herein, we set up a sequence-searching problem relevant to de novo peptide design to provide a benchmark for probing the good parameter settings. Clearly, the shared problem is how to efficiently explore a combinatorial sequence space (peptide library) for specific peptides; this problem could be efficiently solved by using a customized framework of the evolutionary de novo peptide design approach. Consequently, the findings from the simulation of a analogous problem can be utilized as strategies for setting the practical peptide design according to the principle of “Parameter setting by analogy”. In other words, the evolutionary de novo design framework will become a more efficient design system, if it is set at the optimal (or good) configuration and parameter setting identified by analyzing the dynamic behavior of the GAs in the simulation. This method has been evaluated in a real peptide design in this study. In this article, we attempt to place de novo peptide design in a flexible optimization framework and, furthermore, to investigate the suitable configurations and parameter settings of the GAs for locating the best peptide using small populations and the minimum number of design cycles. Our aims are therefore two-fold: to gain a better insight into the most relevant aspects of the GAs and to achieve a better implementation of the elemental operations in a real peptide design. 2. Principles and methods

37

some). These sequences are then assessed by a fitness score reflecting their properties (evaluation). This quantitative representation can be attained using a theoretical or empirical computation (prediction), an experimental analysis (chemical synthesis and measurement), or a combination of them. The program sequentially selects promising individuals from the current library according to their fitness scores (selection) and generates new candidates for next generation by performing crossover, mutation, and other genetic operations on these selected sequences (stochastic variation or generation). These operations are managed by the corresponding parameters such as the population size (PS), crossover rate (CR), and mutation rate (MR). In this way, the evolutionary algorithm may be implemented through the iteration of the stochastic generation of a population of new sequences, the evaluation of these generated sequences, and the selection of the promising sequences from these evaluated sequences. The iteration can be set to terminate upon achieving optimal fitness, reaching a preset number of iterations, or satisfying other applicationdependent criteria. As shown in Fig. 1(b), the generated sequences may be regarded as those sampled from the combinatorial sequence space, whereas their fitness scores could be considered the values sampled at the corresponding sequences on fitness landscapes. For example, in the current study of helical peptide design, the sequences of potentially coiled-coil helical peptides are those sampled from a 31-residue sequence space, whereas their helicities or ellipticities at 222 nm are the values at the sites of these sequences on the helicity landscape of the folded peptides. With the process proceeding, the selected sequences in the current library are getting increasingly advantageous and are approaching the optimal sequence. Similarly, the helicities or ellipticities at 222 nm of the advantageous sequences in the library will be getting increasingly greater with the generation proceeding, and eventually the peptide with the highest helicity could be obtained after a process of a number of generations. In this way, through the iteration of variation of sequences and screening of peptides in the working peptide library, this algorithm fulfills the exploration of the most advantageous peptide in the combinatorial sequence space.

2.1. Overview of evolutionary de novo peptide design Peptides are oligomers usually comprising the 20 natural amino acids; hence, each N-residue peptide could be regarded as 1 of the 20N possible peptides in the combinatorial sequence space. While the peptides in the space are too many to be synthesized, useful peptides within these peptides tend to be very few. The evolutionary de novo peptide design is an effective way to the screening because the GAs used in this approach are optimization algorithms appropriate to the combinatorial problem. The evolutionary de novo design framework is typically set to operate a working peptide library in such a way as to mimic the natural evolution. As can be seen in Fig. 1(a), each peptide in the library is generally encoded in a string of binary, real, or integer variables for genetic computation (virtual chromo-

2.2. Description of simulated evolutionary de novo peptide design The volume, polarity, charge, hydrophobicity, and other properties of amino acids are generally responsible for the tertiary structures and physiochemical properties of the peptides. With a combination of these distinct properties of the amino acids in sequences, natural proteins fulfill numerous design specifications such as bio-catalytic activity, specific structure, and structural stability through evolution. However, homogeneous proteins are still far beyond de novo design due to lack of information regarding which properties of amino acids are the most important in various parts of the proteins, and what criteria constrain the choice of amino acids at different locations in sequences. These difficulties could be circumvented

38

W. Zhang et al. / BioSystems 88 (2007) 35–55

Fig. 1. Schematic representation of the evolutionary de novo peptide design approach. The cycle is started from a population of random sequences, and terminated after a few design cycles. Schematic representation of the: (a) steps involved in evolutionary de novo peptide design and (b) principle of evolutionary de novo peptide design.

W. Zhang et al. / BioSystems 88 (2007) 35–55

if particular proteins were created in such a way mimicking natural proteins. That is, proteins can be explored through the iteration of generating new sequences and characterizing these sequences. 2.2.1. Protein characterization by sequence similarity Another problem in protein design is that even characterizing a single protein by NMR or X-ray crystallographic methods and biochemical experiments is very expensive and time-consuming, let alone solving the properties of many proteins generated in an evolutionary design process. The tertiary structures and physicochemical properties of proteins might be inferred from the information available concerning resolved proteins. The basic assumption underlying this application is the “similar property principle”, whereby similar chemical structures (or amino acid sequences) should lead to similar tertiary structures, physicochemical properties, and biological activities (Rouvray, 1990). This hypothesis is strongly supported by the analyses of natural protein sequences. Since protein structure is conserved in divergent evolution, proteins with different but related functions in present-day organisms often have similar amino acid sequences. These homologous proteins generally have a common ancestor and share a similar tertiary structure. Thus, identifying homologous proteins of known structure can provide detailed tertiary structures and physicochemical properties of sequences. At present, the comparison of a new protein sequence against a database of resolved proteins is perhaps the most important means of understanding the function and origin of a protein. We argue that this sequencecomparison method would likewise become a potent technique to characterize de novo sequences. 2.2.2. Sequence similarity based on amino acid similarity matrices The Smith–Waterman algorithm is an effectual sequencecomparison method for discovering even weak similarities between sequences (Smith and Waterman, 1981). The frequently used database searching programs, such as FASTA and BLAST, are generally based on heuristic algorithms derived from the Smith–Waterman algorithm (Altschul et al., 1990; Pearson, 1990). These algorithms could be expressed as follows. Let R = R1 R2 ···Rl ···RL and T = T1 T2 ···Tm ···TM be two sequences containing 20 amino acids. A local sequence alignment between two such sequences is performed along an alignment path p represented as a set of ordered index pairs pk = (lk , mk ) of matched residues in the sequences R and T, where k = 1, 2, . . ., K; lk ≤ lk+1 , mk ≤ mk+1 ; lK ≤ L, mK ≤ M. The alignment scoring function Sp (R, T) at an alignment path p comprises a sequence similarity scoring function and a gap scoring function. While the sequence similarity scoring function consists of a substitution matrix describing all possible identities and substitutions between matched residues, the gap scoring function comprises penalty scores for various insertions and deletions occurred in aligned sequences.

39

The Smith–Waterman algorithm optimizes the alignment scoring function over all possible paths between two sequences to obtain the optimal local alignment score given by Eq. (1), that is, the local alignment score along the optimal path poptimal . S(R, T ) = Max{Sp (R, T )}

(1)

p

The optimal alignment scores computed by Smith–Waterman algorithm have traditionally been interpreted as measures of similarity (maximum) or distance (minimum) between the sequences and generally been viewed as criteria that determine whether a mutated sequence is likely to conserve the structural and physicochemical properties necessary to maintain the structure and function of the protein (Smith et al., 1981). For two gapless sequences R and T (L = M), the ordered index pairs (1, 1), (2, 2), . . ., (L, L) of matched residues in the sequences are the optimal alignment path of local alignment. The local alignment score, or sequence similarity score S(R, T) given by Eq. (2) could thus be obtained by summing the amino acid similarity s(Rl , Tl ) between amino acid pairs occupying the same position in sequences. S(R, T ) =

L 

s(Rl , Tl )

(2)

l=1

The amino acid similarity matrices describing all possible pairs of amino acids are generally determined through comparative analyses of natural sequences, or in terms of structural and physicochemical properties of amino acids. The former are statistical scores for explicit and implicit evolutionary models; Point Accepted Mutation (PAM) series and Block Substitution Matrix (BLOSUM) series are the examples. These matrices may tabulate numerical values for the frequency of various substitutions occurring in selected protein sequences. On the other hand, the latter are systematic scores computed according to the selected properties of amino acids such as polarity, volume, atomic composition, and ␣-helix preferences. These scores might be given by the distances between the selected properties of amino acid pairs (Grantham, 1974; Miyata et al., 1979; Mohana Rao, 1987; George et al., 1990). 2.2.3. New sequence similarity based on physicochemical distance In peptide design, let sequences R and T be gapless sequences in a combinatorial sequence space SS defined by a given sequence length N and a predefined amino acid composition. The peptide T was set as the target peptide possessing the desired physicochemical properties (design specifications) while the peptide R was some peptide in the same peptide library. As was stated above, the sequence similarity between the gapless sequences T and R could be obtained with Eq. (2), and therefore the sequence R could be characterized in terms of the sequence similarity to the sequence T. While the statistical amino acid similarity matrices are more adequate to compute the similarity between natural sequences, especially the conserved regions in natural sequences, the sys-

40

W. Zhang et al. / BioSystems 88 (2007) 35–55

tematic score matrices are more suitable for diverse sequences. Accordingly, a new sequence similarity based on systematic scores was required to evaluate de novo designed sequences with sequence-comparison. Its level should be only related to the properties that are important for the structural and physiochemical properties of peptides during the evolutionary design processes. As given by Eq. (3), the new sequence similarity was defined as a Euclidean distance between the structural and physicochemical property arrays of the amino acids in sequences and normalized by peptide length.

 1 F (R, T ) = · N

N i=1

2

(PiR − PiT ) + σP2

N i=1

2

(ViR − ViT ) σV2

2.3. Description of the new GA A new GA was established to enhance the explorative ability to the sequence space. The basic steps in the new GA are shown in the flowchart in Fig. 2. An initial population was randomly generated and the numerical fitness of each sequence in the population was determined using Eq. (3). The selection–reproduction scheme was then performed according to the rank of their “goodness” in the population. In this stage, the promising sequences in the current population were selected as members of a parent population. Of them, several very promising sequences were duplicated to an elitism library, the SeedLibrary, and several best indi-

(3) where PiR , ViR and PiT , ViT are the numerical values of polarity P and volume V of the amino acid residues at position i in the sequences R and T; these quantitative representations are same as those in Grantham’s article (Grantham, 1974; Willett et al., 1998). The symbols σ P and σ V are the standard deviations of the polarities and volumes of the amino acids used in sequence space SS. This normalized physiochemical distance is signified as the similarity index between the sequences R and T. It varies between zero, indicating that both sequences have identical properties, and positive values, indicating lesser degrees of similarity between them. In this way, each sequence in the sequence space SS could easily be assigned a numerical value for its normalized physicochemical distance from the target sequence, that is, “similarity index”, as its fitness score. 2.2.4. Simulated evolutionary de novo peptide design A sequence satisfying the design specifications was first set as the optimal sequence or the target peptide in combinatorial sequence space. The algorithm was then given the task of identifying the sequences possessing the structural and physiochemical “properties” as similar as possible to those of this target peptide. A simulated evolutionary de novo design strategy based on the sequence-comparison was performed to solve this sequence-searching problem. Clearly, the optimal sequence should be found by comparing the fitness scores (similarity indices) of the generated sequences, rather than the amino acid sequences or the physiochemical property arrays of the amino acids in sequences. In the design process, candidates were easily generated by randomly mutating and shuffling a group of sequences selected from the proceeding population. The similarity index between a generated sequence and the optimal sequence was assigned as the fitness score of the generated sequence, convincing the degree of satisfying the design specifications. In principle, the optimal sequence would have a maximum similarity (or minimum similarity index). There must be only one optimal sequence, the target peptide, because the array of physicochemical properties of the target peptide is unique in the fitness landscapes. Accordingly, we were able to stipulate the termination criteria of the GAs and, furthermore, to analyze the performance of the evolutionary de novo design approach.

Fig. 2. Flowchart for the new genetic algorithm (GA). LoopMax is the preset number of crossover–mutation steps.

W. Zhang et al. / BioSystems 88 (2007) 35–55

viduals in the SeedLibrary were reproduced to the parent population. This parent population was then subjected to a multi-step crossover–mutation operation, comprising threepoint crossover and mutation procedures, and an offspring population was formed. In this algorithm, the three-point crossover, and mutation procedures were repeated several times in order to increase the stochastic variation, thereby inducing a more widely varied offspring population. The individuals in this offspring population were finally evaluated according to Eq. (3) in a new generation. In this way, the evaluation, selection, and multi-step crossover–mutation procedures were iterated until a similarity index less than preset threshold, a generation larger than preset number of iterations, or other quantitative feature satisfying convergence criterion was reached. 2.3.1. Reference GA Traditionally, a standard GA is based on bit representation, one-point crossover, bit-flip mutation, and roulette wheel selection (with or without elitism). The algorithm design is thus limited in its setting of parameters (PS, MR, and CR). To evaluate the new algorithm, we constructed a standard GA as a reference. This GA had the same starting population, as did the new GA in each run. However, a one-step crossover–mutation procedure containing a one-point crossover operation, rather than the eight-step crossover–mutation procedure containing a three-point crossover operation, was used. 2.3.2. Population size The population must contain a sufficient number of sequences to maintain the diversity for exploration. There is a high risk of premature convergence to a local sub-optimum if the population size is too small. In contrast, although a large population discourages premature convergence to sub-optimal solutions, this action requires more evaluations per generation, possibly resulting in unacceptably high expense and lengthy delays prior to convergence. Hence, limiting the population to an appropriately small size is required to improve the efficiency of peptide design. We wished to determine the optimal crossover rate and mutation rate setting (CR–MR setting) and to investigate whether this setting would change upon variation of the population sizes.

41

2.3.3. Selection and reproduction procedures In the new GA, a ranking strategy was used for the selection of sequences whereby the reproduction of a sequence depended on its rank order in population rather than its absolute fitness score. The selection procedure was implemented by choosing the selection cutoffs of the sequences in the evaluated population. This strategy avoids giving the largest share of offspring to a small group of highly fitted sequences, thereby reducing the selection pressure when the fitness variance is high, especially at the situations of small population and early stages of the process. The cutoffs were experimentally determined by considering the influence of small population sizes and a limited number of design cycles on the diversity of sequences in population. The parent population was therefore formed by copying the top about 24% twice (reproduction), the middle approximately 40% once (survival), and the last 30% nothing (death) of the evaluated population. The remaining circa 6% came from the SeedLibrary. 2.3.4. Crossover and mutation operations The crossover operations are generally performed on all sequences. If a one-point crossover is performed, pairs of sequences are randomly selected from the parent population. Then, the paired sequences will be cut at a same random site or not according to the preset CR. If they are cut, the first half of one sequence is subsequently combined with the second part of the other sequence, resulting in the formation of two new sequences. Alternatively, the paired sequences simply become two new individuals. The two-point and three-point crossovers can be performed similarly, as shown in Fig. 3. In our experiments, a one-point crossover operation was used in the standard GA, whereas a three-point crossover operation was utilized in the new GA. In binary implementations of GAs, mutations are generally performed by changing a bit in the chromosome from 1 to 0, or vice versa (bit flipping). In our integer-coded GA, this was implemented by substituting one amino acid residue in sequence with another randomly selected from a set of amino acids. Supposing an MR of MR0 is used in a GA with S steps of crossover–mutation, the equivalent MR of MRS is then given

Fig. 3. Schematic representations of crossovers. The parent sequences Ac-NAKVQNSRMK-Amide and Ac-KQHGLAKEDQ-Amide are two illustrative examples of decapeptides. (a) One-point crossover; (b) two-point crossover; (c) three-point crossover.

42

W. Zhang et al. / BioSystems 88 (2007) 35–55

by Eq. (4). MRS = 1 − (1 − MR0 )S

(4)

Similarly, the equivalent CR can be determined. 2.4. Experimental design for simulated evolutionary de novo peptide design In order to exhaustively search the optimal parameter setting, we set the algorithm to identify the optimal sequence from the defined sequence space and compared the experimental outcomes at all possible parameter settings. The experiments were designed to identify the suitable parameter setting comprising four factors: the PS, CR, MR, and step number of crossover–mutations. The number of possible parameter settings is very large even if we use only a few values for each parameter. The experiments were conducted using PSs of 8, 16, 24, 32, and 48. For each population size, the experiments allowed 21 MR values, increasing linearly from 0 to 0.20; 11 CR values, varying from 0 to 1.00; and 4 types of multi-step crossover–mutations, containing 1-, 4-, 8-, and 16-step. A total of 4620 combinations were tested. For each given parameter setting, the outcome was determined by averaging the results from five runs of each experiment to account for the stochastic characteristics of the evolutionary processes. In addition, two sequences, a naturally occurring sequence and a randomly generated sequence, were set as the optimal sequences for the combinatorial sequence spaces SS1 and SS2 derived from the 20 natural amino acids and the 8 hydrophilic amino acids selected from the 20 natural amino acids, respectively. The eight hydrophilic amino acids are serine (S), threonine (T), glutamine (Q), glutamic acid (E), asparagine (N), aspartic acid (D), lysine (K), and arginine (R). The evolutionary process in this study was started from a random population and terminated after 30 generations. 2.5. Experimental design for helical peptide design using GAs The ellipticity, especially at 222 nm, is often used as a helicity index of proteins and peptides. A typical ␣-helical peptide has a large negative response over the wavelength range from 190 to 260 nm, with a peak at 222 nm, in the circular dichroism (CD) spectrum. The ellipticity at 222 nm allows the accurate determination of the helicity of a peptide (Chen et al., 1974; Chang et al., 1978). However, the ellipticities at 222 nm of synthetic peptides have to be measured because they are unable to be predicted theoretically. The measured ellipticities were consequently used as the fitness scores in helical peptide design. In each design cycle, a group of sequences was generated by the new GA. After being synthesized and purified, the peptides with these sequences were evaluated using circular dichroism spectrometry. The new sequences for next generation were subsequently generated by the GA programs according to the CD results.

2.5.1. Coiled-coil helical peptide design using the new GA The findings in the simulated evolutionary de novo peptide design were used in the further experiment of screening coiledcoil helical peptides in a 31-residue peptide library SS3 derived from eight hydrophilic amino acids. This peptide library was based on a coiled-coil helical peptide model (target peptide), i.e., the leucine zipper domain in the protein GCN4. Thus, the designed peptides possess potential leucine zipper conformation. These peptides were denoted LZP peptides. The sequence space SS3 consists of the LZP sequences given by formula (5). Ac-YG-(Xg Va Xb Xc Ld Xe Xf )4 -G-Am

(5)

where V and L stand for the hydrophobic amino acids valine and leucine, and the residue X is one of the eight hydrophilic amino acids. The N-terminal tyrosine (Y) was used as a UV absorbent residue for the determination of peptide concentrations. The N-terminal glycine (G) was used as a helix destroyer to isolate the tyrosine residue from the helical fragments in sequences. Another glycine residue at the C-terminus had the function of simplifying the synthetic protocols. In addition, an internal asparagine in position a inhibits nonspecific hydrophobic interactions between the helices. This effect prevents the peptides from over-oligomerization and consequent precipitation during experiments, thus, resulting in the formation of stable coiled-coil dimers or trimers of the peptides. In order to produce soluble peptides, the residues on the outer surface (positions b, c, e, f, and g) of coiled-coil helices, as shown in Fig. 4(a), were selected from the eight hydrophilic amino acids. Thus, there are 85 possible arrangements of hydrophilic residues in the library SS3. Each arrangement of the eight hydrophilic residues at positions b, c, e, f, and g could produce a peptide possessing a unique ellipticity. The residues in the hydrophobic core (positions a and d) between the helices, on the other hand, were hydrophobic amino acids valine (V) and leucine (L), the same as those in the leucine zipper domain of GCN4. Furthermore, four repeated heptads of residues (gabcdef)4 make the suitable peptides fold into stable conformation at room temperature. Thus, the wellfolded coiled-coil helices could be represented as the helical wheel model shown in Fig. 4(b). Moreover, highly helical peptides could be obtained by identifying the best arrangement of hydrophilic amino acids from the 85 possibilities. This task was accomplished using the new GA set at good CR–MR setting of (0.90, 0.05) and the minimum acceptable population size of 16. 2.5.2. Monomeric helical peptide design using a standard GA Another example of evolutionary de novo peptide design is the monomeric helical peptide design. The sequence space comprises the hexadecapeptides given by formula (6). Ac-X01 X02 X03 X04 X05 X06 X07 X08 X09 X10 X11 X12 X13 X14 X15 X16 -Am

(6)

W. Zhang et al. / BioSystems 88 (2007) 35–55

43

Fig. 4. Sequence and structure of coiled-coil helical peptides. Whereas the Y, V, L, N, and G stand for the amino acids tyrosine, valine, leucine, asparagine, and glycine, respectively, the symbol X is one of the eight hydrophilic amino acids. The Ac- and -Am are the N-terminal acetyl-group and C-terminal amide, respectively. (a) Designed 31-residue sequence of LZP peptides in the peptide library SS3 derived from the eight hydrophilic amino acids. (b) Helical wheel model of the coiled-coil helical structure. N-terminal view of the two identical helical peptides in dimeric conformation is illustrated.

where the X stands for one of the three amino acid residues: glutamic acid (E), alanine (A), and lysine (K). These hexadecapeptides were noted as EAK peptides. Therefore, there are 316 EAK peptides in the sequence space. A standard GA set at a bit MR of 0.09 (MR of 0.172), a CR of 1.0, and a PS of 32 was used for generating highly helical peptides from the sequence space (Zhang et al., 2003). 2.5.3. Evaluation of the designed peptides The EAK and LZP sequences generated by the GAs were synthesized by parallel solid-phase synthesis and purified according to an experimental protocol described previously (Zhang et al., 2003). The EAK and LZP samples were prepared in 15 mM phosphate buffer, pH 7.0, and 20 mM phosphate buffer, pH 7.4, containing 150 mM sodium chloride, respectively. The EAK (at 20 ␮M, 25 ◦ C) and LZP (at 25 ␮M, 20 ◦ C) samples were then scanned using a circular dichroism spectrometer Model JASCO TS-60 equipped with a Peltier temperature control accessory. The oligomerization of the LZP peptides was analyzed using Size-Exclusion Chromatography (SEC) with AKTA explorer XT-10 system (Amersham Pharmacia Biotech, Tokyo, Japan). Analyses were carried out at room temperature using a Superdex 75 HR10/30 column (Amersham Pharmacia Biotech) eluted at 0.5 ml/min with 20 mM phosphate buffer, pH 7.4, containing 150 mM sodium chloride. The UV absorbance of the elution was recorded at 280 nm.

3. Results Generally, the choice of genetic operations and control parameters of GAs is regarded as a trade-off between exploiting the beneficial properties of promising sequences (by crossover) and exploring new promising sequences from sequence space (by mutation). The main problem is that the effect of even one individual factor is often unpredictable, let alone the combined influence

of all factors. In the current experiments, the most relevant variables governing the dynamics of the GAs were analyzed from two viewpoints. The first, the so-called “offline performance”, is the best fitness score (minimum similarity index) obtained in each generation. This definition demonstrates the ability of the GAs to identify the optimal sequence in sequence space, and reflects the exploration of the sequence space. The second, the so-called “online performance”, is the average fitness score (average similarity index) in each generation. This designation can be regarded as the ability to exploit the beneficial properties of the existing sequences, and indicates the convergence of the evolutionary process. What really matters is to identify the best sequence and, furthermore, to converge on it as rapidly as possible. In this study, we were particularly interested in these two scores within the population of the final generation. On the one hand, regarding the offline performance, it is convenient to select parameter setting for peptide design, allowing a thorough exploration of sequence space. On the other, with respect to the online performance, a high level of exploitation of the selected sequences could be taken into consideration for GA setting. As shown in Figs. 5–7, four variables analyzed in the simulation process had a significant influence on the evolution. Hence, the good parameter settings both for the peptide design and for the system design could be ascertained by analyzing the extent of their influence. 3.1. Evolutionary processes at different parameter settings The convergence trajectories of the sequencesearching problem, i.e., the minimum and average similarity indices of several runs of the new GA at different

44

W. Zhang et al. / BioSystems 88 (2007) 35–55

Fig. 5. Variations in similarity indices during the simulated evolutionary de novo peptide design. All experiments in graphs (a)–(f) were performed with a population size of 16. The curves in graphs (a)–(d) illustrate the results for the decapeptide library SS1 derived from the 20 natural amino acids, whereas those in graphs (e) and (f) are the results for the decapeptide library SS2 derived from the eight hydrophilic amino acids. (a) Minimum distance (SS1), at CR = 0.5, MR = 0.19; (b) average distance (SS1), at CR = 0.5, MR = 0.19; (c) minimum distance (SS1), at CR = 0.9, MR = 0.05; (d) average distance (SS1), at CR = 0.9, MR = 0.05; (e) minimum distance (SS2), at CR = 0.9, MR = 0.05; (f) average distance (SS2), at CR = 0.9, MR = 0.05. Graphs (g) and (h) illustrate the results at various population sizes. The GAs were set at the CR–MR setting of (0.90, 0.05) and operated on the decapeptide library SS1 derived from the 20 natural amino acids. (g) minimum distance (SS1) at variable population sizes; (h) average distance (SS1) at variable population sizes.

W. Zhang et al. / BioSystems 88 (2007) 35–55

45

Fig. 6. Effects of multi-step crossover–mutation. Experiments were performed by using a decapeptide library SS1 derived from the 20 natural amino acids with a population size of 16 peptides. The value at each CR–MR setting is the average of the similarity indices (physicochemical distances) obtained from five runs. (a) Minimum distance, GA with 16 steps; (b) average distance, GA with 16 steps; (c) minimum distance, GA with 8 steps; (d) average distance, GA with 8 steps; (e) minimum distance, GA with 4 steps; (f) average distance, GA with 4 steps; (g) minimum distance, reference GA with 1 step; (h) average distance, reference GA with 1 step.

46

W. Zhang et al. / BioSystems 88 (2007) 35–55

Fig. 7. Effects of population size. The simulations were performed by using a decapeptide library SS1 derived from the 20 natural amino acids. These graphs, in combination with Fig. 6(c and d), show the results for population sizes of 8, 24, 32, 48, and 16, respectively. The value at each CR–MR setting is the average of the similarity indices (physicochemical distances) obtained from five runs. (a) Minimum distance, population size 8; (b) average distance, population size 8; (c) minimum distance, population size 24; (d) average distance, population size 24; (e) minimum distance, population size 32; (f) average distance, population size 32; (g) minimum distance, population size 48; (h) average distance, population size 48.

W. Zhang et al. / BioSystems 88 (2007) 35–55

parameter settings, are presented in Fig. 5. These curves illustrate not only the stochastic properties of the evolutionary processes, but also the influence of the CR–MR settings, the population sizes, and the sequence spaces. In these trials, sub-optimal sequences were found in most situations; furthermore, the optimal sequence (target peptide) was obtained after 13 generations in one trial (Fig. 5(c)), and after 10 generations in three trials (Fig. 5(e)). Most of the sub-optimal sequences were one-residue mutations of the optimal sequence (data not shown). As those occurred in natural proteins, these mutations often occurred between pairs of amino acids possessing similar physicochemical properties such as leucine and isoleucine. Compared with those at other CR–MR settings such as (0.5, 0.19) (Fig. 5(a and b)), the new GA at a CR–MR setting of (0.90, 0.05) had an obvious tendency to give both the lower minimum similarity indices (Fig. 5(c)) and lower average similarity indices (Fig. 5(d)). These features indicated that GAs with different parameter settings had apparently differing capabilities of locating the optimal sequence, thus evolving at different speeds. Consequently, it is possible to find a parameter setting (or a range of parameter settings) for evolution that is more efficient. In addition, the new GA at the CR–MR setting of (0.90, 0.05) attained the lower minimum and average similarity indices in the decapeptide library SS2 (Fig. 5(e and f)) than those in the decapeptide library SS1 (Fig. 5(c and d)). Therefore, the new GA at any given CR–MR setting close to (0.9, 0.05) would be far more powerful in dealing with a small peptide library than in dealing with a large library. 3.2. Effects of MR-CR settings The effects of the CR–MR settings on the performance of the new GA were evaluated for five population sizes: 8, 16, 24, 32, and 48. In view of the particular effects of CR–MR setting, the characteristics of the best and average fitness scores for these population sizes were observed in Figs. 6(c and d) and 7.

47

As shown in Fig. 6(c and d) and Table 1, the best and average fitness scores of the new GA with a population of 16 sequences demonstrated effects typical of CR–MR settings. The features of such two scores indicated that both the explorative ability and the convergence were very sensitive to the MRs, although both were also dependent upon the CRs. While a MR range of 0.05–0.08 in the graph of best fitness scores facilitated an efficient search for the optimal sequences (minimum similarity index less than 0.15), a MR range of 0.02–0.05 in the graph of average fitness scores enabled the new GA to achieve good convergence (average similarity index less than 0.3). Moreover, as illustrated by the curves of MR = 0 in Fig. 6(c and d), the best and average fitness scores were greatly enhanced by crossovers at MRs less than 0.05. However, as shown by the curves of MR = 0.2, this enhancement was nearly negligible at MRs near 0.2. Clearly, the good CR–MR settings exhibiting extensive exploration (minimized best fitness scores) and high convergence (minimized average fitness scores) should be within both the CR–MR range (0.70–1.00, 0.05–0.08) in the graph of the best fitness scores in Fig. 6(c) and the range (0.70–1.00, 0.02–0.05) in the graph of the average fitness scores in Fig. 6(d). As shown in Figs. 6(c and d) and 7, the best and average fitness scores at population sizes of 8, 16, 24, 32, and 48 displayed both characteristics in common and distinct characteristics. While the range of acceptable CR–MR settings in the best fitness scores obviously expanded with an increase in the population sizes, this expansion in the average fitness scores was insensitive to the increase of the population sizes. Although expansion was slow with the increase of population sizes, the range in the average fitness scores eventually extended to the MRs near 0 at the population size of 48. In summary, the characteristics of the best and average fitness scores indicated that the algorithm at the appropriate range of CR–MR settings is able not only to efficiently identify the best sequence, but also to achieve good convergence. Consequently, we conclude

Table 1 Dependence of the range of the acceptable CR–MR settings upon the step numbers of crossover–mutation Step number

1 4 8 16

Acceptable CR–MR settings

Referent CR–MR settings

Minimum distance index ≤0.15

Average distance index ≤0.3

Minimum distance index ≤0.2

Average distance index ≤0.4

(0.9–0.9, 0.05–0.05) (0.9–1.0, 0.04–0.07) (0.7–1.0, 0.05–0.08) (0.9–1.0, 0.04–0.06)

(0.9–0.9, 0.05–0.05) (0.9–1.0, 0.04–0.06) (0.7–1.0, 0.02–0.05) (0.8–0.9, 0.05–0.05)

(0.8–1.0, 0.05–0.08) (0.7–1.0, 0.04–0.12) (0.6–1.0, 0.03–0.11) (0.6–1.0, 0.03–0.12)

(0.7–1.0, 0.04–0.05) (0.5–1.0, 0.03–0.06) (0.6–1.0, 0.02–0.07) (0.7–1.0, 0.01–0.07)

48

W. Zhang et al. / BioSystems 88 (2007) 35–55

that the optimal CR–MR setting probably occurs within the range of acceptable CR–MR settings near (0.9, 0.05); and the CR–MR setting of (0.90, 0.05) was thus denoted as representative of the good CR–MR settings for the new GA. 3.3. Effects of multi-step crossover–mutation The performance of the GA was examined in four different simulations: the first with 16 crossover–mutation steps; the second with 8 steps (new GA); the third with 4 steps; and the fourth with only 1 step (standard GA). The best and average fitness scores at a population size of 16 are shown in Fig. 6. As can be seen in Table 1, the effects of CR–MR setting were relevant to the step numbers. The range of acceptable CR–MR settings, especially in the best fitness scores, improved with an increase in the step number from 1 to 8. However, the performance of the 16-step crossover–mutation was not obviously better than that of the 8-step operation. The eight-step crossover–mutation was consequently deemed suitable for a population size of 16. In addition, the range of acceptable CR–MR settings was enlarged from (0.90–0.90, 0.05–0.05) for the standard GA to (0.70–1.00, 0.05–0.08) for the new GA in the best fitness scores, and from (0.90–0.90, 0.05–0.05) to (0.70–1.00, 0.02–0.05) in the average fitness scores. 3.4. Effects of population sizes Figs. 5(g and h), 6(c and d), and 7 provide details of the new GA at the population sizes of 8, 16, 24, 32, and 48. As can be seen, both the best and average fitness scores showed that the new GA was able to evolve more efficiently with larger population sizes, i.e., displaying better exploration and greater convergence. However, the performance improvement gained by increasing population size decreased with increasing size, especially according to the best fitness scores. This behavior may be particularly observed in the curves shown in Fig. 5(g and h), which are the results of the new GA preset at the good CR–MR setting of (0.90, 0.05) and for the decapeptide library SS1. The results suggested that population size at least 24 was necessary for the decapeptide library SS1 because little improvement was achieved by increasing the population beyond this. The population size of 24 was consequently recognized as the minimum acceptable population size for the decapeptide library SS1. Similarly, a population size of 16 peptides was acceptable for the decapeptide library SS2. As shown in Figs. 6(c and d) and 7, the results at the population sizes of 8, 16, 24, 32, and 48 indicated

that the range of acceptable CR–MR settings expanded with increases in the population sizes, which is more evident in the best fitness scores than in the average fitness scores. Although the range of the acceptable CR–MR settings did not obviously expand with the increase of the population sizes so much in the average fitness scores as in the best fitness scores, the average fitness scores were more dependent upon the CR–MR settings than the best fitness scores. The performance of the new GA might be more closely related to the population sizes if the chosen CR–MR setting was malapropos. Using a large population size, the new GA tended to overcome the disadvantages conferred by poor parameter settings, including those CR–MR settings far from the good CR–MR setting. 3.5. Experimental evaluation of the new GA in coiled-coil helical peptide design Since the GA’s performance depends on its parameter settings and the fitness function, we chose the helical peptide design as a test case to verify the efficacy of the proposed operations and the identified parameter settings of the new GAs in improving the efficiency of peptide design. 3.5.1. Helical peptide design using the new GA As a test case for evolutionary optimization, the overall de novo design process comprised five cycles of three steps: stochastic variation of the selected sequences; synthesis of the generated sequences; and evaluation of the synthesized peptides. The new GA preset at CR–MR setting of (0.90, 0.05) and PS of 16 was used for helical peptide design. The mean residue molar ellipticities at 222 nm of the LZP peptides were then measured and used as the fitness scores of sequences. The design process was started from a population of 16 random sequences and terminated after five design cycles. As shown in Fig. 8(a), a group of highly helical peptides was result from synthesizing and characterizing 80 LZP peptides. The ␣-helix percentage of each LZP peptide was calculated based on the mean residue molar ellipticity of −35,874.46 and −36,225.23 deg cm2 dmol−1 for 100% helical 28- and 31-residue peptides, respectively. Both ellipticities (deg cm2 dmol−1 ) were derived from Eq. (7) below for a helix of infinite length:   k [θ] = −36, 500 · 1 − (7) n where n is the number of residues per helix and k is a wavelength-dependent constant equal to 2.57

W. Zhang et al. / BioSystems 88 (2007) 35–55

49

Fig. 8. Ellipticity values of the helical peptides designed using the evolutionary approach. Diamonds denote the mean residue molar ellipticities (at 222 nm) of the designed peptides. The best and worst individuals in each generation are linked by dotted lines. Solid circles linked by solid lines denote the average values in each generation. (a) LZP peptides (coiled-coil helical peptides) designed using the new GA. (b) EAK peptides (monomeric helical peptides) designed using a standard GA.

at 222 nm (Chen et al., 1974; Chang et al., 1978). The best coiled-coil helical peptide found in current experiments had a mean residue molar ellipticity of −31,717.93 deg cm2 dmol−1 (87.56% helicity) for whole sequence including a tyrosine label and two glycine spacers, or approximately −35,116.28 deg cm2 dmol−1 (97.89% helicity) for the 28-residue helical fragment. Moreover, the results of SEC analyses showed that this peptide was a dimer in solution. These results suggested that this peptide possesses a dimeric coiled-coil conformation as shown in Fig. 4(b). Besides the best peptide, several others had mean residue molar ellipticities far greater than that of a referent sequence, the natural model peptide from GCN4. This model peptide had a mean residue molar ellipticity of −18,078.84 deg cm2 dmol−1 (49.91% helicity) for the whole sequence, or approximately −20,015.86 deg cm2 dmol−1 (55.79% helicity) for the 28-residue helical fragment. 3.5.2. Helical peptide design using a standard GA As the new GA was used for coiled-coil helical peptide design, a standard GA was used to design monomeric helical peptides. The GA was binary-coded, with a bit MR of 0.09, a CR of 1.0, and a PS of 32 sequences. After four design cycles, a total of 128 EAK peptides were synthesized and evaluated. Fig. 8(b) shows the mean residue molar ellipticities at 222 nm of EAK peptides in four design cycles. Theoretically, a 100% 16residue helical peptide has mean residue molar ellipticities at 222 nm of −33,155.31 deg cm2 dmol−1 , according to Eq. (7). The best monomeric helical peptide obtained had a mean residue molar ellipticity of −24,600 deg cm2 dmol−1 at 222 nm, that is, approximately 74.20% helicity.

4. Discussion In this study, a new sequence similarity has been established to evaluate de novo sequences; each generated sequence could be assigned a fitness score based on this sequence similarity. Thus, the effects of genetic operations and parameter settings in GAs on the exploration of the optimal sequence have been successively evaluated by observing the variations of the best fitness scores (minimum similarity indices) and average fitness scores (average similarity indices) in population. 4.1. New sequence similarity for characterizing de novo designed sequences Traditionally, sequence similarity is associated with individual amino acid similarities. There are two types of quantitative representations of amino acid similarity: the amino acid substitution matrices statistically estimated from natural sequences and the physiochemical similarity matrices systematically computed from properties of amino acids. PAM matrix series and BLOSUM matrix series are the most widely used substitution matrices. The PAM matrices for explicit evolutionary model are determined from the substitutions observed in conserved regions and used throughout a global alignment including conserved and highly mutable regions. The BLOSUM matrices for implicit evolutionary model, on the other hand, are based only on conserved regions of the proteins, regardless of evolutionary distance. Due to this constraint, the BLOSUM matrices have therefore been shown to be very effective in identifying similar sequences, even those with distant relationships (Dayhoff et al., 1978; Henikoff and Henikoff, 1992).

50

W. Zhang et al. / BioSystems 88 (2007) 35–55

The systematic matrices based on the properties of amino acids have been widely evaluated by comparison with the substitution patterns observed during protein evolution (Zuckerkandl and Pauling, 1965; Sneath, 1966; Epstein, 1967; Clarke, 1970; Grantham, 1974; Miyata et al., 1979; Yang et al., 1998). Although amino acids were represented by as many as 134 physicochemical properties in one study, the measures based on these parameters did not appear useful in measuring amino acid similarity because it was difficult to determine the importance of each property (Sneath, 1966). In other studies, only three attributes: polarity, volume, and composition, were considered (Grantham, 1974). In further studies, only polarity and volume were used in distance measures between amino acids (Miyata et al., 1979). These measures of amino acid similarity were analyzed in a large data set containing all proteins in the mitochondrial genome from 20 species of mammals (Yang et al., 1998). The results showed that the polarity and volume of amino acids have the greatest influence on amino acid substitutions in mitochondrial proteins. Moreover, the “physicochemical distance” based on the Euclidean distance between the amino acid pair yields the best performance (Miyata et al., 1979). In this article, the sequence similarity for characterizing de novo designed peptides is given by Eq. (3). The results of simulation and experiments suggest that this definition is successfully used to establish a fitness landscape analogue to the helicity fitness landscape for evolutionary peptide design and to evaluate the new GAs. On the one hand, the structural and physicochemical properties of peptides are generally associated more with small sequence fragments than with individual amino acids. The definition appears to reflect this relationship. On the other, whereas the statistical scores are directly relevant to the natural sequences, the systematic matrices are generally dependent upon the amino acids. This feature indicates the latter seem to be more adequate for diverse sequences including de novo sequences and natural sequences. Furthermore, in spite of suffering from problems with balancing the contributions of the different properties to the selection of encouraging mutations, as was stated above, the “physiochemical distance” appeared to be more effective in identifying similar sequences from the mitochondrial proteins from 20 species mammals. This means that even with volume and polarity two properties, the systematic matrices have effectively established the criteria that constrained the natural evolution. In consequence, the “physiochemical distance” could be effectual to impel the generation of de novo sequences likewise. Intuitively, this measure is more reasonable since it has been argued that the polarity

and volume are the dominant factors underlying substitutions in evolution and have a great impact on protein folding (Grantham, 1974; Tomii and Kanehisa, 1996). 4.2. General observation of evolutionary de novo peptide design processes As can be seen in Fig. 5, the average fitness scores tended to be more sensitive to CR–MR settings than the best fitness scores. The probable reasons include the diversity of the sequences in population, which is closely dependent upon the MRs. In fact, higher MRs generally induce a greater diversity of sequences, which lead the average fitness scores (average similarity indices) to remain high as the evolution progressed. The best fitness scores, on the other hand, appeared to progress faster than the average fitness scores at a given CR–MR setting. Moreover, the best fitness scores at good CR–MR setting tended to develop more efficiently than at other CR–MR settings, thus plateauing first after a small number of iterations from start; the plateaued best fitness scores ranged over CR–MR settings near the good CR–MR setting a little later, as shown in Figs. 6 and 7. After more iteration, the average fitness scores at good CR–MR setting plateaued, and as above, the plateaued average fitness scores ranged CR–MR settings near the good CR–MR setting progressively later. In this way, the best and average fitness scores are getting increasingly lower; moreover, the range of acceptable CR–MR settings near the good CR–MR setting is getting increasingly larger with the increase of the iteration number. Fig. 5 also indicates that in addition to the CR–MR settings and generations, the size of population and the size of sequence space greatly influence on the range of acceptable CR–MR settings for the best and average fitness scores. Figs. 6 and 7 furthermore provide demographic information for the plateaus of the best and average fitness scores. These results suggest that GAs become less sensitive to CR–MR settings when a large number of generations, a large population size, or multi-step crossover–mutation is used. In contrast, in a practical peptide design, a minimized size of population and minimized number of design cycles are often required to design peptide efficiently. These restrictions upon system design will decrease the range of acceptable CR–MR settings, thus increasing the sensitive to the CR–MR settings. Consequently, the determination of the good and acceptable CR–MR settings is probably more significant. However, the determination of the optimal parameter setting is seriously limited by number of preset parameter settings and is strongly influenced by stochastic prop-

W. Zhang et al. / BioSystems 88 (2007) 35–55

erties of the evolutionary processes. First, as with any trial-and-error strategy that exhaustively compares the experimental outcomes at different parameter settings, the optimal parameter setting has to be chosen from all possible combinations of the control parameters varying among large number of values. In this study, the good parameter setting was chosen from a limited number of parameter settings; for example, the good CR–MR setting was obtained from the preset CR–MR settings comprising 11 CR values and 21 MR values. Thus, the difference between the optimal parameter setting and the good parameter setting is largely attributed to this limitation number of the parameter settings. Second, as with any sample strategy that generates sequences in terms of genetic operations, the best and average fitness scores in population are stochastic at each given parameter setting, although they are definite in each run of experiment. Due to their stochastic properties, the best and average fitness scores at each given parameter setting should be estimated from the results of a large number of repeats of experiments. Through the repetitive experiments, we can decrease the random errors of the best and average fitness scores, thus improving the statistical significance of the good CR–MR setting and acceptable CR–MR settings. In this study, only five repeats of experiments were performed at each given parameter setting; thus, the random errors of the obtained results, especially those at small population size of 8, can be attributed to this small sample size. However, these errors at each given parameter setting would be getting increasingly lower as the evolutionary process progresses owing to the evolution and convergence of the best and average fitness scores. Third, as with any sampling strategy that does not evaluate all sequences in a combinatorial sequence space, there is certainly no guarantee of finding the optimal sequence within a given number of generations. In general, a large population size and a large number of generations are often needed to ensure finding the optimal sequence. However, simulation results within 31 generations show that sub-optimal sequences in most situations, and furthermore, the optimal sequence in some cases were obtained. The plateaus of the best and average fitness scores indicate that further improvement is quite arduous. Clearly, these sub-optimal sequences have similar sequences and fitness scores comparable to the optimal sequence. As the homologous proteins do in natural evolution, these sequences probably conserve the structures and possess the properties similar to those of the optimal sequence. Consequently, the sequences satisfying the design specifications can be obtained within a limited number of generations.

51

In summary, although more repeats of experiments and larger number of parameter settings would benefit for better results, a combination of the preset number of parameter settings with a 31-generation process is sufficient to observe the effects of CR–MR settings and population sizes and to analyze both the explorative ability and the convergence of the new GA in solving decapeptide design problem. 4.3. Optimization of CR–MR setting The results of the simulations indicate that the fundamental behavior of the new GA, especially that shown by the average fitness scores, is highly dependent upon the MRs. As shown in Fig. 6(c and d), the GA at an MR of 0.2 yielded a great diversity of sequences but a low efficiency in generating the optimal sequence, whereas the algorithm at an MR of approximately 0.05 showed an efficient evolution in spite of lower diversity of sequences. The choice of MRs could therefore be regarded as a trade-off, which was investigated by observing the curves of CR = 0 (mutation only) in Figs. 6 and 7. The average fitness scores at MRs near 0.2 show that new sequences were generated over a broader area in sequence space because more mutations were generated in the current sequences. This action led to a greater ability to explore new areas in the sequence space, but increased the possibility of destructing the promising sequences in the current population, thus producing a higher average fitness score. In contrast, a value less than optimal MR tended to decrease the possibility of the mutations occurred in the current sequences; the new sequences were generated in a narrower area in sequence space. As a result, while this action generally decreases the explorative ability of the algorithm, it induces a greater ability to exploit promising sequences. In consequence, MRs far from the optimal MR will upset the balance between the exploration and exploitation at optimal MR, and produce a higher average fitness scores. On the other hand, although the influence of the CRs on the performance was less obvious than that of MRs, the CRs, especially the high values, enabled the new GA to progress more efficiently. This improvement could be observed in the curves of MR = 0 (crossover only) in Figs. 6 and 7. The results also reveal that the new GAs could accomplish their optimal performance at CR–MR setting of (0.90, 0.05) by simultaneous optimization of CR and MR; furthermore, this good performance widely ranged over CR–MR settings near the good CR–MR setting after a process of 31 generations. Consequently, the performance improvement of the new GA could be

52

W. Zhang et al. / BioSystems 88 (2007) 35–55

attributed mainly to the effective combination of the crossover and mutation operations. 4.4. Effects of multi-step crossover–mutation operation As shown in Fig. 6(b, d, f, and h), the average fitness scores in the curves of CR = 0 (mutation only) remained nearly unchanged as the number of steps increasing. This result indicates that the effects of the mutation in the multi-step crossover–mutation operation were similar to those in the one-step crossover–mutation operation of the standard GA. Consequently, the effects of the mutation operation are independent upon its procedure. This is probably due to the mutual independence of point mutations. In contrast to the mutation operation, the effects of crossover operation are very closely dependent upon the crossover procedure. The influence of the crossover in the multi-step crossover–mutation operation on the performance of the GAs was observed in the curves of MR = 0 (crossover only), as shown in Fig. 6. First, in the CR range of 0.7–1.0, the best and average fitness scores were much less than those of the standard GA. Second, with an increase of CRs in this range, both scores decreased more obviously than those of the standard GA. Furthermore, the results of four GA simulations, as shown in Fig. 6 and Table 1, revealed that the range of the acceptable CR–MR settings near the good CR–MR setting varied considerably with an increase in the number of crossover–mutation steps from 1 to 16, whereas the good CR–MR setting remained almost unchanged according to the best fitness scores. The GA with an eight-step crossover–mutation appeared to have the largest area of acceptable CR–MR settings at population size of 16. We conclude that an appropriate step number of repetitive crossover–mutations made the GAs more efficient in the exploration of the optimal sequence and less sensitive to the CR–MR settings. Consequently, the performance improvement of the new GA could be mainly attributed to the eight iterations of three-point crossover operation, rather than the repeated mutation operation. Due to the crossover effects, the multi-step crossovermutation could diminish the problems caused by the decrease of population sizes. Generally, a small population size tends to converge rapidly to high levels of homogeneity, which reduces the explorative ability of the crossover operations. The multi-step crossover–mutation operation exerts the multi-point and multi-parent crossovers of sequences to increase the diversity without raising MRs. There have been situ-

ations where a larger number of crossover points are beneficial. One example is the uniform crossover, which performs crossover at a large number of points. Experiments also show that uniform crossover can give greater recombination behavior and improve empirical performance of GAs (Syswerda, 1989). Furthermore, the performance of GAs can also be enhanced significantly by using more parent sequences in crossover (Eiben et al., 1995). Clearly, the multi-parent crossovers of the multi-step crossover–mutation operation in the new GA can be considered more explorative than the twoparent situation in standard GAs because the resemblance between parents and their offspring generated using multi-parent crossover is, on average, smaller. The multi-point and multi-parent crossover effects of the multi-step crossover–mutation therefore help the new GA overcome both the limited information capacity of smaller populations and the resulting tendency for more homogeneity, thus yielding greater exploration while maintaining their exploitation. Generally, the crossover in standard GAs is an exploitative operation providing much less explorative effects than the mutation operation. However, the crossover effects of the multi-step crossover–mutation operation in the new GA have greatly enhanced the exploration of the sequence space (Figs. 6 and 7). Because of this operation, fewer mutations (smaller MRs) than in a standard GA were required to achieve an equivalent exploration. The MR at good CR–MR setting was less than previous settings of Zhang et al. 0.172 (bit MR of 0.09) (Zhang et al., 2003); Grefenstette 0.166 (bit MR of 0.01) (Grefenstette, 1986); and Franis–Lavergne 0.2–0.3 (Franis and Lavergne, 2001). In consequence, through this multi-step crossover operation, while retaining its exploitative ability, the new GA improves its explorative ability, especially in the situations of small populations and early stages of the process. In other words, the good CR–MR setting can be considered as a fine balance between the exploration of sequence space and exploitation of the desired properties of promising sequences: not only a trade-off of MRs but also a fair compromise between the mutation effects and the recombination effects. 4.5. Effects of population size Generally, an appropriate population must contain sufficient diversity for effective recombination without carrying an excessive number of sequences. In the current simulation study, the results show that small populations of 16 and 24 sequences were acceptable for the new GA setting at the good CR–MR setting of (0.9, 0.05).

W. Zhang et al. / BioSystems 88 (2007) 35–55

Furthermore, the population size had a strong influence on the behavior of the new GA in several aspects. First, the range of acceptable CR–MR settings for the best fitness scores considerably expanded as the population size increased from 8 to 48 (Figs. 6(c and d) and 7), whereas the good CR–MR setting within this CR–MR range remained almost unchanged. This means that the new GA became more explorative and, furthermore, less sensitive to CR–MR settings when a larger population was used. To a certain extent, this suggests that if the GA set at a problem-non-specific CR–MR setting, a very large population size would be needed to ensure thorough exploration of the sequence space. Second, although the expansion of the range of acceptable MRs for the average fitness scores was much slower than that for the best fitness scores, this range gradually broadened with the increase of population from 8 to 48, and eventually contained the MRs near 0 at the population size of 48 (Fig. 7). In other words, the new GA is able to fulfill the explorative task even without the help of the main explorative operation-mutation if the population size is large enough. This implied that our multi-step crossover–mutation operation is probably the principal reason why a smaller MR was required in the new GA. Finally, of the factors analyzed, it is noteworthy that the population size is less influential on the average fitness scores than on the best fitness scores. In principle, an average fitness score is derived from the total number of individuals in the population and, statistically, this average is not very sensitive to the population size. 4.6. Efficacy of the new GA in helical peptide design As in any heuristic search algorithm, the major aspects of applying GAs to the peptide design problem are the representation (encoding) of the peptides, the generation of new sequence by genetic operations, and the evaluation of the resulting peptides (fitness function). In this work, a framework comprising the integercoded representation of peptides, the generation of new sequences by the good genetic operations in the new GA, and the experimental measurement of the ellipticities of the peptides was used in helical peptide design. Furthermore, the new GA used in the real peptide design was preset at the good parameter setting determined through the simulated peptide design. Consequently, the success in the synthesis of the highly helical peptides may be attributed to the efficacy of the new GA as well as a considerable similarity between the coiled-coil helical peptide design and the sequence-searching problem. Compared with the standard GA-based monomeric helical peptide design, the new GA-based coiled-coil

53

helical peptide design demonstrates a much more efficient identification of the highly helical peptides despite using only 16 peptides in the population. The variation of the mean residue molar ellipticities at 222 nm exhibits the signs of improvement of this evolutionary design approach. We have found that in addition to almost all the peptides in the starting populations, a small part of peptides in subsequent generations are the lowly helical peptides. Thus, the highly helical peptides in both projects could be considered as the outcome of the evolution starting with pools of lowly helical peptides. Furthermore, the best and average ellipticities in the populations improved more significantly in the coiled coil helical peptide design process than those in the monomeric helical peptide design process, although the population in the former was half the size of the latter. We conclude that this improvement is the result of a fine balance between the exploration and exploitation. Using the suitable operations and the pre-identified CR–MR setting, the new GA achieves this balance and enables the design framework to generate highly helical peptides more rapidly, even with a small population size. However, as opposed to the situation in the simulated de novo peptide design, whereby the optimal sequence is preset, there is certainly no guarantee of finding the absolute optimum in practical de novo peptide designs. In general, a large population size and a large number of generations are often needed to ensure finding the optimal sequence. In this study, the simulation results suggest that for the new GA preset at the CR–MR setting of (0.90, 0.05) and population size of 16 peptides, at least 10 design cycles were needed to obtain the optimal sequences from a decapeptide library SS2. For the peptide library SS3 used for coiled-coil helical peptide design, the design cycles could be much reduced. In the synthetic experiments, highly helical peptides were synthesized de novo within five design cycles. The ellipticities of these LZP peptides are far greater than that of the model peptide from the natural protein GCN4. Of them, the best peptide possesses an ellipticity very close to the value of a 100% helical peptide. Despite the fact that better peptides could be obtained more efficiently by the new GA, finding the absolute optimal peptide cannot be assured, as is the case for the standard GA. 5. Conclusions Evolutionary de novo peptide design based on GAs is a heuristic strategy, which searches for specific peptides efficiently within a combinatorial sequence space. Nevertheless, the efficiency and convergence are unpredictable and strongly dependent upon the underlying

54

W. Zhang et al. / BioSystems 88 (2007) 35–55

parameter setting and the fitness landscapes of peptides. Consequently, we have investigated the significance and effects of the variations and combinations of the genetic operations in GAs by means of the simulation of sequence-searching problem, a benchmark of evolutionary de novo peptide design. Furthermore, the good combination of genetic operations and good parameter setting identified through this simulation has been utilized as a good setting of the GAs for the coiledcoil helical peptide design according to the principle of “Parameter setting by analogy”. This strategy is successful in improving the efficiency of peptide design; the highly helical peptides have been generated even with a small population and few design cycles. This success can be contributed to the analogy between the helical peptide design and the sequence-searching problem. Due to this analogy, the good configuration and the good parameter setting identified through the simulation has become a promising setting of the evolutionary design approach for helical peptide design. In this study, the influences of four parameters: PS, CR, MR, and step number of repeated crossover–mutation operation, on the performance of the GA-based peptide design have been evaluated. The results indicate that the multi-step crossover–mutation operation greatly enhanced the stochastic variation of the sequences, and provided the new GA with a more extensive exploration capability, thus inducing a wider range of acceptable CR–MR settings than those in a standard GA. As a result, like a long evolutionary process and a large population size, an appropriate number of crossover–mutation steps could enhance the efficiency of discovering optimal sequence and reduce the negative influence of problem-non-specific CR–MR settings. The results further show that CR–MR setting of (0.9, 0.05) is a good CR–MR setting of the new GA comprising eight-step crossover–mutation. When the new GA is set at this CR–MR setting, a minimum acceptable population size has been identified: a population size of 24 for a decapeptide library SS1, and size of 16 for a decapeptide library SS2. Consequently, this setting could be considered as a good setting of the evolutionary de novo peptide design framework. Clearly, compared to merely choosing a large number of generations, or a large population size, carefully determining the CR–MR setting relevant to the minimized population size and minimized number of design cycles considerably enhances the success of design and reduces directly the efforts required in chemical synthesis. The experimental results of the design and synthesis of helical peptide show that this setting has been successfully used in practical de novo peptide design.

In conclusion, by determining a good setting through the simulation of sequence-searching problem, this evolutionary algorithm could be implemented much more efficiently in peptide design. Further work is concerned with the analyses of the amino acid substitutions occurred in designed helical peptides and, furthermore, the comparison to those in natural sequences. Of course, cautions should be used in applying the configuration and parameter settings obtained in this study to other evolutionary design problems. Acknowledgements Thanks are due to Dr. S. McNiven for helpful comments and assistance with the manuscript and everyone else who helped. References Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J., 1990. Basic local alignment search tool. J. Mol. Biol. 215, 403–410. B¨ack, T., Hammel, U., Schwefel, H.P., 1997. Evolutionary computation: comments on the history and current state. IEEE Trans. Evol. Comput. 1, 3–17. Baldwin, R.L., Rose, G.D., 1999. Is protein folding hierarchic? I. Local structure and peptide folding. Trends Biochem. Sci. 24, 22–33. Belmont-Moreno, E., 2001. The role of mutation and population size in genetic algorithms applied to physics problems. Int. J. Mod. Phys. C 12, 1345–1355. Brown, R.D., Martin, Y.C., 1997. Designing combinatorial library mixtures using a genetic algorithm. J. Med. Chem. 40, 2304–2313. Chang, C.T., Wu, C.-S.C., Yang, J.T., 1978. Circular dichroic analysis of protein conformation: inclusion of the ␤-turns. Anal. Biochem. 91, 13–31. Chen, Y.H., Yang, J.T., Chau, K.H., 1974. Determination of the helix and beta-form of proteins in aqueous solution by circular dichroism. Biochemistry 13, 3350–3359. Clarke, B., 1970. Selective constraints on amino-acid substitutions during the evolution of proteins. Nature 228, 159–160. De Jong, K.A., 1975. An analysis of the behaviour of a class of genetic adaptive systems. Ph.D. Thesis. Department of Computer and Communication Sciences, University of Michigan, Ann Arbor, MI. Dayhoff, M., Schwartz, R.M., Orcutt, B.C., 1978. A model of evolutionary change in proteins. In: Dayhoff, M. (Ed.), Atlas of Protein Sequence and Structure, vol. 5, Suppl. 3. National Biomedical Research Foundation, Silver Spring, MD, pp. 345–352. Douguet, D., Thoreau, E., Grassy, G., 2000. A genetic algorithm for the automated generation of small organic molecules: drug design using an evolutionary algorithm. J. Comput. Aided Mol. Des. 14, 449–466. Eiben, A.E., Venkemenade, C.H.M., Kok, J.N., 1995. Orgy in the computer: multi-parent reproduction in genetic algorithms. Adv. Artif. Life Lecture Notes Artif. Intell. 929, 934–945. Eiben, A.E., Hinterding, R., Michalewicz, Z., 1999. Parameter control in evolutionary algorithms. IEEE Trans. Evol. Comput. 3, 124–141. Epstein, C.J., 1967. Non-randomness of amino-acid changes in the evolution of homologous proteins. Nature 215, 355–359.

W. Zhang et al. / BioSystems 88 (2007) 35–55 Fogel, D., 1994. An introduction to simulated evolutionary optimisation. IEEE Trans. Neural Networks 5, 3–14. Forrest, S., 1993. Genetic algorithms—principles of natural-selection applied to computation. Science 261, 872–878. Franis, O., Lavergne, C., 2001. Design of evolutionary algorithms—statistical perspective. IEEE Trans. Evol. Comput. 5, 129–147. George, D.G., Barker, W.C., Hunt, L.T., 1990. Mutation data matrix and its uses. Methods Enzymol. 183, 333–351. Goldberg, D.E., 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Publishing Company, Reading, MA. Grantham, R., 1974. Amino acid difference formula to help explain protein evolution. Science 185, 862–864. Grefenstette, J.J., 1986. Optimization of control parameters for genetic algorithms. IEEE Trans. Syst., Man, Cybern. C 16, 122–128. Harik, G., Cant´u-Paz, E., Goldberg, D.E., Miller, B.L., 1999. The gambler’s ruin problem, genetic algorithms, and the sizing of populations. Evol. Comput. 7, 231–253. Henikoff, S., Henikoff, J.G., 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 89, 10915–10919. Hinterding, R., Michalewicz, Z., Eiben, A.E., 1997. Adaptation in evolutionary computation: a survey. In: Proceedings of 4th IEEE International Conference on Evolutionary Computation, Indianapolis, IN, USA, pp. 65–69. Illgen, K., Enderle, T., Broger, C., Weber, L., 2000. Simulated molecular evolution in a full combinatorial library. Chem. Biol. 7, 433–441. Jones, D.T., 1994. De novo protein design using pairwise potentials and a genetic algorithm. Protein Sci. 3, 567–574. Kamphausen, S., H¨oltge, N., Wirsching, F., Morys-Wortmann, C., Riester, D., Goetz, R., Th¨urk, M., Schwienhorst, A., 2002. Genetic algorithm for the design of molecules with desired properties. J. Comput. Aided Mol. Des. 16, 551–567. Kohn, W.D., Hodges, R.S., 1998. De novo design of ␣-helical coiled coils and bundles: models for the development of protein-design principles. Trends Biotechnol. 16, 379–389. Miyata, T., Miyazawa, S., Yasunaga, T., 1979. Two types of amino acid substitutions in protein evolution. J. Mol. Evol. 12, 219–236. Mohana Rao, J.K., 1987. New scoring matrix for amino acid residue exchanges based on residue characteristic physical parameters. Int. J. Pept. Protein Res. 29, 276–281. Pegg, S.C.H., Haresco, J.J., Kuntz, I.D., 2001. A genetic algorithm for structure-based de novo design. J. Comput. Aided Mol. Des. 15, 911–933.

55

Pearson, W.R., 1990. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183, 63–98. Rojas, I., Gonz´alez, J., Pomares, H., Merelo, J.J., Castillo, P.A., Romero, G., 2002. Statistical analysis of the main parameters involved in the design of a genetic algorithm. IEEE Trans. Syst., Man, Cybern. C 32, 31–37. Rouvray, D., 1990. The evolution of the concepts of molecular similarity. In: Johnson, M.A., Maggiora, G.M. (Eds.), Concepts and Applications of Molecular Similarity. Wiley, New York, pp. 15–42. Sheridan, R.P., Kearsley, S.K., 1995. Using a genetic algorithm to suggest combinatorial libraries. J. Chem. Inf. Comput. Sci. 35, 310–320. Smith, T.F., Waterman, M.S., 1981. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197. Smith, T.F., Waterman, M.S., Fitch, W.M., 1981. Comparative biosequence metrics. J. Mol. Evol. 18, 38–46. Sneath, P.H.A., 1966. Relations between chemical structure & biological activity in peptides. J. Theor. Biol. 12, 157–195. Sundaram, A., Venkatasubramanian, V., 1998. Parametric sensitivity and search-space characterization studies of genetic algorithms for computer-aided polymer design. J. Chem. Inf. Comput. Sci. 38, 1177–1191. Syswerda, G., 1989. Uniform crossover in genetic algorithms. In: Schaffer, J.D. (Ed.), Proceedings of 3rd International Conference on Genetic Algorithms. George Mason University. Morgan Kaufmann Publishers, San Mateo, CA. USA, pp. 2–8. Tomii, K., Kanehisa, M., 1996. Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng. 9, 27–36. Willett, P., Barnard, J.M., Downs, G.M., 1998. Chemical similarity searching. J. Chem. Inf. Comput. Sci. 38, 983–996. Yang, Z.H., Nielsen, R., Hasegawa, M., 1998. Models of amino acid substitution and applications to mitochondrial protein evolution. Mol. Biol. Evol. 15, 1600–1611. Yokobayashi, Y., Ikebukuro, K., McNiven, S., Karube, I., 1996. Directed evolution of trypsin inhibiting peptides using a genetic algorithm. J. Chem. Soc. Perkin Trans. 1 20, 2435–2437. Zhang, W., Loughran, M.G., Kanna, S., Yano, K., Ikebukuro, K., Yokobayashi, Y., Kuroda, R., Karube, I., 2003. Exploration of structural features of monomeric helical peptides with a genetic algorithm. Proteins 7, 503–555. Zuckerkandl, E., Pauling, L., 1965. Evolutionary divergence and convergence in proteins. In: Bryson, V., Vogel, H.J. (Eds.), Evolving Genes and Proteins. Academic Press, New York, pp. 97–166.