A practical algorithm based on particle swarm optimization for haplotype reconstruction

A practical algorithm based on particle swarm optimization for haplotype reconstruction

Applied Mathematics and Computation 208 (2009) 363–372 Contents lists available at ScienceDirect Applied Mathematics and Computation journal homepag...

822KB Sizes 2 Downloads 5 Views

Applied Mathematics and Computation 208 (2009) 363–372

Contents lists available at ScienceDirect

Applied Mathematics and Computation journal homepage: www.elsevier.com/locate/amc

A practical algorithm based on particle swarm optimization for haplotype reconstruction Jingli Wu a,b, Jianxin Wang a,*, Jian’er Chen a,c a b c

School of Information Science and Engineering, Central South University, Changsha 410083, China Department of Computer Science, Guangxi Normal University, Guilin 541004, China Department of Computer Science, Texas A&M University, College Station, TX 77843, USA

a r t i c l e

i n f o

Keywords: Algorithm Single nucleotide polymorphisms Haplotype The minimum error correction Particle swarm optimization

a b s t r a c t The minimum error correction (MEC) model is an important computational model for haplotype reconstruction problem. Owing to the NP-hardness of the MEC model, Qian et al. presented a particle swarm optimization (PSO) algorithm to solve the model, and the length of a particle code is equal to the number of SNP fragments. However, there are hundreds and thousands of SNP fragments in practical applications, the PSO algorithm based on this kind of long particle code cannot obtain high reconstruction rate efficiently. In this paper, a kind of short particle code is proposed by taking advantage of the low heterozygous frequencies of single nucleotide polymorphisms (SNPs), and a practical particle swarm optimization algorithm P-MEC based on this kind of particle code is presented. Experimental results indicate that P-MEC algorithm can get higher reconstruction rate than the particle swarm optimization algorithm designed by Qian et al. to the MEC model. Moreover, this kind of short particle code makes P-MEC algorithm efficient even for reconstructing long haplotypes. Ó 2008 Elsevier Inc. All rights reserved.

1. Introduction Single nucleotide polymorphisms (SNPs) are believed to be the most predominant form to address genetic differences, and are important in medical, drug-design, diagnostic and forensic applications [1]. In diploid organisms, genomes are organized into pairs of chromosomes. A set of related SNPs on each copy of a pair of chromosomes define a haplotype. Recent works indicate that haplotypes generally have more information content than individual SNP in disease association studies [2]. However, it is substantially difficult to determine haplotypes through experiments directly. Computational methods have been studied and used in the construction of haplotypes. The corresponding computational problem is called haplotyping that has become one of the hottest areas in Computational Biology. There are generally two classes of combinatorial versions of the haplotyping problem. One class is concerned with haplotyping a population [3,4]; and the other class is concerned with haplotyping a single individual based on the data and methodology of shotgun sequence assembly [5]. The latter category is studied in this paper. The problem of haplotyping a single individual was first formulated as an optimization problem by Lancia et al. [5] and can be described as follows. Suppose there are a set of short genome fragments with SNPs of a pair of haplotypes, coming from DNA shotgun sequencing or generated by a resequencing effort for the purpose of large-scale haplotyping. When SNP positions are focused, these short genome fragments are actually SNP fragments. The problem of haplotyping a single individual is concerned with the partition of the SNP fragments into two sets, with each set determining a haplotype. The * Corresponding author. E-mail address: [email protected] (J. Wang). 0096-3003/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.amc.2008.12.040

364

J. Wu et al. / Applied Mathematics and Computation 208 (2009) 363–372

problem becomes difficult and challenging due to DNA sequencing errors and the diploidy of human genome. While emphasizing on different types of errors, the minimum fragment removal (MFR) and the minimum SNP removal (MSR) models have been proposed [5]. Rizzi et al. presented algorithms for the two models by using dynamic programming [6]. Another important computational model for the problem, the minimum error correction (MEC) model, was proposed by Lippert et al. [7]. An exact algorithm based on branch-and-bound method and two algorithms based on dynamic clustering were presented for this model [8]. A heuristic method based on genetic algorithm was proposed by Wang et al. for the MEC model [9]. Qian et al. presented a new heuristic algorithm based on particle swarm optimization (PSO) (it is named as PSO-Q in this paper) and got higher reconstruction rate than the genetic algorithm designed by Wang et al. to this model [10]. The particle code length of PSO-Q algorithm is equal to the number of SNP fragments. However, there are hundreds and thousands of SNP fragments in practical applications, PSO-Q cannot obtain high reconstruction rate efficiently due to its long particle code. In this paper, a kind of short particle code is proposed by taking advantage of the low heterozygous frequencies of SNPs, and a practical particle swarm optimization algorithm P-MEC based on this kind of particle code is presented. Experimental results show that P-MEC algorithm can find good solutions and achieve higher reconstruction rate than PSO-Q algorithm. Moreover, it is still efficient even for reconstructing long haplotypes. The rest of this paper is organized as follows. In Section 2, preliminary definitions and notations are given. In Section 3, P-MEC algorithm is described. In Section 4, the comparisons between P-MEC and PSO-Q algorithms are given through experimental results. Finally, some conclusions are drawn in Section 5. 2. Definitions and notations Since DNA of diploid organisms is organized in pairs of chromosomes, each SNP site can be either a homozygous site if it has the same allele on both chromosomes, or a heterozygous site if it has different alleles on both chromosomes. In general, each SNP shows a variability of only two possible alleles, so a haplotype can be represented as a string over a 2-letter alphabet {0, 1}. Suppose that there are m SNP fragments from a pair of chromosomes and the length of the corresponding haplotypse is n. Define an m  n SNP matrix M, in which each entry M½i; j has value 0, 1 or – (null, lack of information). Each row corresponds to a SNP fragment and each column corresponds to a SNP site. Define n0 ðjÞ (resp. n1 ðjÞ) as the number of entries of column j that are equal to 0 (resp. 1). Let f0 ðjÞ (resp. f1 ðjÞ) be the proportion of 0 (resp.1) entries in the not null entries of column j, i.e. f0 ðjÞ ¼ n0 ðjÞ=ðn0 ðjÞ þ n1 ðjÞÞ, f1 ðjÞ ¼ n1 ðjÞ=ðn0 ðjÞ þ n1 ðjÞÞ. Given x; y 2 f0; 1; g and let

sðx; yÞ ¼ dðx; yÞ ¼



1 if x – ; y – ;

0 otherwise  1 if x – ; y – ; 0

and x ¼ y and x–y

otherwise

ð1Þ ð2Þ

For two strings U ¼ u1 ; . . . ; un and V ¼ v 1 ; . . . ; v n over the alphabet {0, 1,  }, let

X

SðU; VÞ ¼

sðui ; v i Þ

ð3Þ

dðui ; v i Þ

ð4Þ

16i6n

DðU; VÞ ¼

X

16i6n

When the strings U and V are regarded as two SNP fragments, if DðU; VÞ > 0, fragments U and V are said to conflict, otherwise they are said to agree. The conflict between two SNP fragments means that they come from different chromosome copies or there are errors in the data. If there are no errors in the data, the rows of M can be partitioned into two classes of non-conflicting fragments. A pair of haplotypes can be reconstructed by assembling the fragments in the two classes. In this case, the SNP matrix M is called error-free. An important and widely accepted optimization model was proposed by Lippert et al. [7] based on the above concepts for the SNP haplotyping problem, as follows: Minimum error correction (MEC): Given a SNP matrix M, correct the minimum number of entries (0 into 1 and vice versa) so that the resulting matrix is error-free. Let M be a SNP matrix, M½i;  denotes the ith row of the matrix M. Suppose that the rows in the matrix M are partitioned into two disjoint classes of fragments. Let G ¼ fM½i1 ; ; . . . ; M½ik ; g be any one of the two classes, i.e. G½j;  ¼ M½ij ; , Then the consensus haplotype of G, denoted as hðGÞ, is the string hðGÞ ¼ h1 ðGÞ; . . . ; hn ðGÞ, where

hc ðGÞ ¼



0

if N 0 ðG; cÞ > N1 ðG; cÞ; c ¼ 1; 2; . . . ; n;

1 otherwise;

ð5Þ

here N 0 ðG; cÞ and N 1 ðG; cÞ denote the number of 0 and 1 in the set {G½1; c; . . . ; G½k; c} respectively. In this paper, the reconstruction rate (RR) [8,9] is used to measure the accuracy of the pair of reconstructed haplotypes. ^ ¼ ðh ^1 ; h ^2 Þ is the pair of reconstructed haplotypes. The proAssuming that h ¼ ðh1 ; h2 Þ is the pair of original haplotypes, and h portion RR of nucleotides that are reconstructed correctly is defined as follows:

J. Wu et al. / Applied Mathematics and Computation 208 (2009) 363–372

^ ¼1 RRðh; hÞ

minfr 11 þ r 22 ; r12 þ r 21 g ; 2n

365

ð6Þ

^ Þ; i; j ¼ 1; 2. where rij ¼ Dðhi ; h j 3. Algorithm 3.1. PSO-Q algorithm PSO-Q algorithm uses a binary string of {0,1} as the particle presentation, which expresses a classification of the SNP fragments. The length of a particle code is equal to the number of the SNP fragments and the value 0 or 1 on ith site denotes ith fragment’s class-membership. Then the size of solution space is 2m . The procedure of PSO-Q is described in Fig. 1. DNA sequencing usually produces hundreds and thousands of SNP fragments, i.e. m is very large in practical applications. Therefore, PSO-Q has a very large solution space, which makes PSO-Q difficult to find a good solution efficiently. It is well known that a pair of haplotypes has the same allele on a homozygous site, i.e. it has different alleles only at a heterozygous site. The proportion of heterozygous sites of a pair of haplotypes is low, e.g. it is about 0.2 in realistic situation [11]. If we encode only the heterozygous sites of a pair of haplotypes, we can get a smaller solution space than PSO-Q, for this kind of code is much shorter than the particle code of PSO-Q. Base on this idea, we introduce the following P-MEC algorithm. 3.2. P-MEC algorithm In this section, P-MEC algorithm based on a novel particle representation is presented. The input is an m  n SNP matrix M and a parameter t. The output is a pair of reconstructed haplotypes h of length n. Firstly, the matrix M is preprocessed by deleting some redundant information. After the preprocessing, the fragments in the new matrix only retain heterozygous 0 SNPs. Secondly, our designed PSO algorithm is implemented to generate h , which denotes a pair of haplotypes with only 0 heterozygous SNP sites, for the fragments in the new matrix have only heterozygous SNPs. Finally, h is augmented and the final pair of haplotypes h is got. Some important steps of P-MEC algorithm will be introduced in detail as follows. 3.2.1. Preprocessing In order to solve the problem effectively, at first the input matrix M is preprocessed. For each column j, if f0 ðjÞ 6 t (resp. f1 ðjÞ 6 t), it is deleted from M and is called as 1-field (resp. 0-field), here t is set to 0.2 [11]. The remained SNPs are all heterozygous sites. After dropping columns, if all letters in one fragment become , the fragment is also discarded because it is useless in reconstructing haplotypes. For convenience of description, the new SNP matrix is still denoted by Mmn .

Fig. 1. PSO-Q Algorithm.

366

J. Wu et al. / Applied Mathematics and Computation 208 (2009) 363–372

3.2.2. Particle swarm optimization (PSO) PSO simulates bird flocking behavior to achieve a self-evolution system [12]. A swarm of particles are initialized randomly, and the optimal solution is found after a number of generations. At every generation, each particle moves along each dimension of the problem space. Its velocity and direction will be altered and influenced by its personal best experience and the global best experience of the swarm, which are defined as formulas (9) and (10) [13].

V i ðT þ 1Þ ¼ w  V i ðTÞ þ C 1  rand1  ðPi  X i ðTÞÞ þ C 2  rand2  ðPg  X i ðTÞÞ;

ð8Þ

X i ðT þ 1Þ ¼ X i ðTÞ þ V i ðT þ 1Þ;

ð9Þ

where X i and V i denote the position and velocity of the ith particle respectively, T is the iterative times, w denotes the inertia weight factor, Pi is the personal best position of a given particle i so far, P g is the position of the best particle of the entire swarm so far, C 1 and C 2 are constants and are known as acceleration coefficients. The random values rand1 and rand2 are in the range of (0, 1). Our designed PSO algorithm for haplotype reconstruction is described as follows. (1) Particle representation Because only the heterozygous site has different alleles on both copies of a pair of haplotypes, the encoding of only heterozygous sites does not affect the reconstruction rate. A binary string Xðx1 ; x2 ; . . . ; xn Þ ðxi 2 f0; 1gÞ (resp. Vðv 1 ; v 2 ; . . . ; v n Þðv i 2 f0; 1gÞ) is used to represent a particle position (resp. velocity). Then the size of solution space is 2n . A particle position denotes a haplotype having only heterozygous SNPs. Because a pair of haplotypes has different alleles on heterozygous SNPs, a pair of haplotypes that has only heterozygous SNPs can be inferred from a particle position. A particle velocity measures the difference between two particle positions. (2) Generation of the initial swarm A swarm is made of N particles, i.e. the swarm size is N. For ith particle in a swarm, let X i ðTÞ (resp. V i ðTÞ) denote the particle position (resp. velocity) in generation T. An initial swarm is generated as follows. An initial velocity is a string of n ‘‘0”. An initial position is generated by the following method: partition the fragments of matrix M into two groups randomly, and generate a pair of haplotypes of length n according to formula (5). Then choose one copy of the pair of haploypes randomly to be a particle position. (3) Operations Because particle positions and velocities are all binary discrete variables, several operators are defined for the operations in formulas (9) and (10), and they can deal with binary discrete variables directly. (1) SubðX 1 ; X 2 Þ: It is used between two positions X 1 ðx11 ; x12 ; . . . ; x1n Þ and X 2 ðx21 ; x22 ; . . . ; x2n Þ, and the result denotes velocity V:

V ¼ SubðX 1 ; X 2 Þ ¼ ðv 1 ; v 2 ; . . . ; v n Þ  0 x1i ¼ x2i vi ¼ 1 x1i – x2i i ¼ 1; 2; . . . ; n

ð10Þ ð11Þ

(2) AddðV 1 ; V 2 Þ: It is defined as the logical OR of every bit of velocity V 1 ðv 11 ; v 12 ; . . . ; v 1n Þ and V 2 ðv 21 ; v 22 ; . . . ; v 2n Þ, and the result is velocity V:

V ¼ AddðV 1 ; V 2 Þ ¼ ðv 1 ; v 2 ; . . . ; v n Þ;

v i ¼ v 1i ORv 2i ;

v i 2 f0; 1g;

ð12Þ

i ¼ 1; 2; . . . ; n

(3) MulðC; V 1 Þ: It is the multiplication between probability Cð0 6 C 6 1Þ and velocity V 1 ðv 11 ; v 12 ; . . . ; v 1n Þ, and the result is velocity V:

V ¼ MulðC; V 1 Þ ¼ ðv 1 ; v 2 ; . . . ; v n Þ  v if 0:5 6 C 6 1 v i ¼ 1i 1  v 1i otherwise i ¼ 1; 2; . . . ; n;

ð13Þ ð14Þ

(4) XorðX 1 ; VÞ: It is defined as the logical XOR of every bit of position X 1 ðx11 ; x12 ; . . . ; x1n Þ and velocity Vðv 1 ; v 2 ; . . . ; v n Þ, and the result is denoted as position X:

X ¼ XorðX 1 ; VÞ ¼ ðx1 ; x2 ; . . . ; xn Þ; xi ¼ x1i XORv i ;

xi 2 f0; 1g;

ð15Þ

i ¼ 1; 2; . . . ; n

According to the above defined operators, formulas (9) and (10) can be translated into formulas (17) and (18) respectively:

V i ðT þ 1Þ ¼ AddðAddðMulðw; V i ðTÞÞ;

MulððC 1  rand1 Þ;

ðSubðPi ; X i ðTÞÞÞÞÞ;

MulððC 2  rand2 Þ;

ðSubðP g ; X i ðTÞÞÞÞÞ ð16Þ

X i ðT þ 1Þ ¼ XorðX i ðTÞ;

V i ðT þ 1ÞÞ

ð17Þ

J. Wu et al. / Applied Mathematics and Computation 208 (2009) 363–372

367

Fig. 2. PSO Algorithm.

Fig. 3. P-MEC Algorithm.

(4) Evaluation of particle fitnessFitness is used to evaluate the searching performance of particles and guide the search process of a particle swarm. Given a particle position X and all fragments fi ði ¼ 1; 2; . . . ; mÞ of M, the particle fitness FitnessðXÞ is defined as follows:

EðXÞ FitnessðXÞ ¼ 1  mn X EðXÞ ¼ minðSðfi ; XÞ; Dðfi ; XÞÞ

ð18Þ ð19Þ

16i6m 0

0

As mentioned above, position X denotes one copy of a pair of haplotypes ðh1 ; h2 Þ which has only heterozygous SNPs, e.g. 0 0 X ¼ h1 . Sðfi ; XÞ counts the number of the sites that have the same alleles on fi and h1 , i.e. the number of the sites that have 0 0 different alleles on fi and h2 . Dðfi ; XÞ counts the number of the sites that have different alleles on fi and h1 . EðXÞ is the min0 0 imum error correction number corresponding to the pair of haplotypes ðh1 ; h2 Þ. Therefore, the smaller EðXÞ is, the stronger the fitness of particle X is. The PSO algorithm for haplotype reconstruction is summarized in Fig. 2. 3.2.3. Augmenting 0 0 0 0 0 The homozygous SNPs deleted in preprocessing must be inserted again. For h ¼ ðh1 ; h2 Þ, augment h1 and h2 by the bits of the columns discarded in preprocessing to generate h1 and h2 . If a deleted column j is 0-field (resp. 1-field) then position j of 0 both haplotypes will also be 0 (resp. 1). Then we get the final pair of haplotypes h ¼ ðh1 ; h2 Þ. After augmentation, h becomes the final result h. Based on the above mentioned steps, P-MEC algorithm for haplotype reconstruction is illustrated in Fig. 3. 4. Experimental results In this section, real haplotype data was used to compare P-MEC with PSO-Q designed by Qian et al. [10]. The experiments were implemented on an Intel Pentium IV 2.0 GHz with a 512 MB of RAM. The operating system was Windows XP Professional and the compiler was Microsoft Visual C++ 6.0.

368

J. Wu et al. / Applied Mathematics and Computation 208 (2009) 363–372

Fig. 4. Reconstruction rates of different swarm sizes and iteration times with c ¼ 10; n ¼ 1000; Ps ¼ 0:05; f min ¼ 2; f max ¼ 4.

4.1. Experimental instances The experiments were performed using the data of genotypes_chr1_CEU_r22_nr.b36_fwd.phase.gz file,1 which was released by the International HapMap Project [14] on 20 December, 2007. The file records the haplotypes on chromosome1 of 60 individuals in CEPH samples (Utah residents with ancestry from northern and western Europe). The SNP fragments were generated as follows. Firstly, we select a individual at random from the 60 individuals. Then beginning with a random SNP site, a pair of haplotypes of length n can be obtained from the haplotypes of the selected individual. Secondly, in order to make the generated SNP fragments have the same statistical features as the real data, a widely used shotgun assembly simulator CELSIM [15] is invoked to generate m1 single SNP fragments (not mate-pair) and m2 matepair SNP fragments so that every mate-pair is made up of two single fragments from the same haplotype. The length of a single fragment was between f min and f max, and the length of a mate-pair was set to n=10 in our experiments. The values of f min and f max are 3 and 7 respectively [11] by using traditional Sanger sequencing technology. With the development of new sequencing technologies, e.g. 454 Sequencing [16], the length of a single fragment is shortened. Hence we also test the algorithms with shorter single SNP fragment length. The coverage was c=2 for both single and mate-pair SNP fragments, and the total coverage was c. Finally, the output SNP fragments are processed to plant reading errors with probability P s . In practical applications, c is between 5 and 10, and Ps is between 2% and 5% [11]. Interested readers refer to [15] for details about how to generate simulated data. 4.2. Performance evaluation In this section, the evaluation of P-MEC and PSO-Q algorithms is described. 100 data sets were generated for each parameter setting. The average over 100 runs at each parameter setting was calculated and presented. The parameters of PSO-Q algorithm were set according to Ref. [10]. Fig. 4 shows the reconstruction rates got by P-MEC algorithm with different swarm sizes and iteration times, where c ¼ 10; n ¼ 1000; Ps ¼ 0:05; f min ¼ 2 and f max ¼ 4. In this figure, RR varies between a very narrow range, which is from 0.9699 to 0.9725. Therefore, N was set to 20 and iteration times was set to 100 for the trade-off between reconstruction rate and running time. The rest parameters of P-MEC algorithm were set as follows: w ¼ 0:8; C 1 ¼ C 2 ¼ 0:7. In Table 1, three sets of parameters were set in dealing with single fragment length range ½f min; f max, and Ps ¼ 0:05; c ¼ 10; n ¼ 100. From this table we can see that the reconstruction rate of P-MEC algorithm is not affected by the change of single fragment length, while the reconstruction rate of PSO-Q algorithm decreases with the decreasing of single fragment length. When single fragment length decreases, the fragments number increases, so does the particle code length of PSO-Q. Hence the solution space of PSO-Q algorithm is enlarged, and PSO-Q algorithm cannot keep good performance with fixed swarm size and iterative times. The running time of the two algorithms increases with the decreasing of single fragment length. In Fig. 5, six sets of parameters were set in dealing with error rate Ps , and c ¼ 10; n ¼ 100; f min ¼ 2; f max ¼ 4. Fig. 5 shows that the reconstruction rates of P-MEC is higher than that of PSO-Q in every Ps setting. The RR of P-MEC changes from 0.9972 to 0.9862, while the RR of PSO-Q changes from 0.8748 to 0.8693. Fig. 6 shows the reconstruction rates of the two algorithms in dealing with coverage c, where P s ¼ 0:05; n ¼ 100; f min ¼ 2; f max ¼ 4. Form this figure we can see that the reconstruction rates of the two algorithms are all increased with

1

download from http://www.hapmap.org/downloads/phasing/2007-08_rel22/phased/

J. Wu et al. / Applied Mathematics and Computation 208 (2009) 363–372

369

Table 1 Reconstruction rate with Ps ¼ 0:05; c ¼ 10; n ¼ 100. ½f min; f max

[3,7] [2,4] [1,2]

RR

Running time (s)

P-MEC

PSO-Q

P-MEC

PSO-Q

0.9835 0.9862 0.9851

0.8926 0.8693 0.8612

0.2500 0.3130 0.8280

12.3440 21.0470 55.2577

Fig. 5. Reconstruction rates of different P s with c ¼ 10; n ¼ 100; f min ¼ 2; f max ¼ 4.

Fig. 6. Reconstruction rates of different c with P s ¼ 0:05; n ¼ 100; f min ¼ 2; f max ¼ 4.

the augment of c, and P-MEC algorithm has higher reconstruction rate than PSO-Q algorithm. The RR of P-MEC changes from 0.8512 to 0.9862, while the RR of PSO-Q changes from 0.7837 to 0.8693. The increase of fragments number caused by the augment of c increases the particle code length of PSO-Q algorithm, then enlarges its solution space. Therefore, fixed swarm size and iterative times hider the improvement of reconstruction rate for PSO-Q with the increase of c. In Fig. 7, the two algorithms are compared with different haplotype length n, where Ps ¼ 0:05; c ¼ 10; f min ¼ 2; f max ¼ 4. As shown in Fig. 7, the reconstruction rates of P-MEC are higher than that of PSO-Q when n varies from 100 to 1000. Generally, the reconstruction rate will decrease while n increases. But there is a strange case in Fig. 7. The reconstruction rate of PSO-Q algorithm with n ¼ 100 is smaller than that with n ¼ 200 and n ¼ 300, which could not be understood and explained now. We have done a lot of experiments with different settings, but we obtained the same results. The comparison results of reconstruction rate show that the pair of haplotypes generated by P-MEC algorithm has higher reconstruction rate than that obtained from PSO-Q algorithm. PSO is a kind of evolutionary computation or swarm intelligence optimization method. As mentioned above, it simulates bird flocking behavior to search the optimal result in

370

J. Wu et al. / Applied Mathematics and Computation 208 (2009) 363–372

Fig. 7. Reconstruction rates of different n with P s ¼ 0:05; c ¼ 10; f min ¼ 2; f max ¼ 4.

the solution space after a number of generations. The performance of PSO is closely related to the size of solution space, which is decided by the length of the particle code. It is apparent that the length of our designed particle code, which is equal to the number of heterozygous SNPs of a haplotype, is about nd, where d denotes the proportion of heterozygous sites of a pair of haplotypes. Due to the low heterozygous frequencies of SNPs, e.g. d is about 0.2 in realistic situation [11], our designed particle code is much shorter than the particle code of PSO-Q algorithm, which is equal to the number of fragments, i.e. about 2nc=ðf min þ f maxÞ þ 10c. For example, when n ¼ 100, c ¼ 10, f min ¼ 3 and f max ¼ 7, the particle code length of P-MEC algorithm is about 20, while the particle code length of PSO-Q algorithm is about 300. Therefore, the solution space corresponding to P-MEC algorithm (220) is much smaller than that of PSO-Q algorithm (2300). It is easier for P-MEC algorithm to get a good solution. In Fig. 8, five sets of parameters were generated in dealing with coverage c, which was varied from 2 to 10, where Ps ¼ 0:05; n ¼ 100; f min ¼ 2; f max ¼ 4. Fig. 8 shows that the running speed of P-MEC algorithm is 40 to 70 times faster than PSO-Q algorithm. In Fig. 9, six sets of parameters were generated in dealing with haplotype length n, where Ps ¼ 0:05; c ¼ 10; f min ¼ 2; f max ¼ 4. The running time of P-MEC is much shorter than that of PSO-Q. When n is 1000, the running time of PSO-Q is 856.7564 s, while P-MEC only uses 9.1061 s. Hence P-MEC is still efficient for reconstructing long haplotypes. From the comparison results of running time, it can be seen that P-MEC algorithm runs faster than PSO-Q algorithm, and the running time of P-MEC algorithm increases more slowly than that of PSO-Q algorithm with the increase of c and n or the decrease of single fragment length. The two algorithms spend most of their running time on computing the particle position and velocity, hence their running time mainly depends on the following three parameters: the particle code length, swarm size and iteration times. The larger the three parameters are, the longer the running time is. Because the two algorithms use the same iteration times in our experiments, the difference of their running time is closely related to particle code length and swarm size. It is well known that the swarm size is dependent upon the particle code length, small size warm can be used

Fig. 8. Running time of different c with P s ¼ 0:05; n ¼ 100; f min ¼ 2; f max ¼ 4.

J. Wu et al. / Applied Mathematics and Computation 208 (2009) 363–372

371

Fig. 9. Running time of different n with P s ¼ 0:05; c ¼ 10; f min ¼ 2; f max ¼ 4.

when the particle code length is short. As mentioned above, P-MEC algorithm has much shorter particle code than PSO-Q algorithm. Therefore, P-MEC algorithm can use much smaller size swarm than PSO-Q algorithm, and the running time of P-MEC algorithm is very short even for reconstructing long haplotypes. 5. Conclusions Haplotyping a single individual problem is one of the hottest areas in Computational Biology today. In this paper, a practical algorithm P-MEC based on particle swarm optimization is presented. A kind of short particle code is designed for P-MEC by taking advantage of the low heterozygous frequencies of single nucleotide polymorphisms. This kind of short code can also be adopted by other meta-algorithms for solving this problem, which will be studied in the future. Several operators are defined for computing particle positions and velocities. These operators can deal with binary discrete variables directly, and they also can be used to solve other combinatorial optimization problems. Experimental results indicate that P-MEC algorithm can get higher reconstruction rate than PSO-Q algorithm in different parameter settings. Because P-MEC algorithm uses short particle code, it can generate a small size solution space. It is easier for P-MEC algorithm to get a good solution. The running time of P-MEC algorithm is much shorter than that of PSO-Q algorithm. Furthermore, P-MEC algorithm is intended to reconstructing long haplotypes and has highly efficiency. P-MEC algorithm can still get high reconstruction rate when single fragment length decreases, hence P-MEC may be also adaptive to the new sequencing technology, which produces shorter DNA sequences than traditional one. In conclusion, P-MEC algorithm is a practical solution for haplotyping a single individual. Acknowledgements The authors are grateful to anonymous referees for their helpful comments and to Professor Gene Myers for his kindly providing the source codes of CELSIM. This work is in part supported by the National Natural Science Foundation of China under Grant No. 60433020 and No. 60773111, Provincial Natural Science Foundation of Hunan (06JJ10009), the Program for New Century Excellent Talents in University No. NCET-05-0683 and the Program for Changjiang Scholars and Innovative Research Team in University No. IRT0661. References [1] V. Bafna, S. Istrail, G. Lancia, R. Rizzi, Polynomial and APX-hard cases of the individual haplotyping problem, Theoretical Computer Science 335 (2005) 109–125. [2] J.C. Stephens et al, Haplotype variation and linkage disequilibrium in 313 human genes, Science 293 (2001) 489–493. [3] A.G. Clark, Inference of haplotypes from PCR-amplified samples of diploid populations, Molecular Biology and Evolution 7 (1990) 111–122. [4] D. Gusfield, Inference of haplotypes from samples of diploid populations: complexity and algorithms, Journal of Computational Biology 8 (2001) 305– 323. [5] G. Lancia, V. Bafna, S. Istrail, R. Lippert, R. Schwartz, SNPs problems, complexity and algorithms, in: Proceedings of the Ninth Annual European Symposium on Algorithms, 2001, pp. 182–193. [6] R. Rizzi, V. Bafna, S. Istrail, G. Lancia, Practical algorithms and fixed-parameter tractability for the single individual SNP haplotyping problem, in: Proceedings of Second International Workshop on Algorithms in Bioinformatics, 2002, pp. 29–43. [7] R. Lippert, R. Schwartza, G. Lancia, S. Istrail, Algorithmic strategies for the SNPs haplotype assembly problem, Briefings in Bioinformatics 3 (2002) 23– 31. [8] R.S. Wang, L.Y. Wu, J.H. Zhang, X.S. Zhang, Algorithms for SNP haplotype assembly problem, Applied Mathematics-A. Journal of Chinese Universities 19 (2004) 515–528.

372

J. Wu et al. / Applied Mathematics and Computation 208 (2009) 363–372

[9] R.S. Wang, L.Y. Wu, Z.P. Li, X.S. Zhang, Haplotype reconstruction from SNP fragments by minimum error correction, Bioinformatics 21 (2005) 2456– 2462. [10] W.Y. Qian, Y.J. Yang, N.N. Yang, L. Chun, Particle swarm optimization for snp haplotype reconstruction problem, Applied Mathematics and Computation 196 (2008) 266–272. [11] A. Panconesi, M. Sozio, Fast hare: a fast heuristic for single individual SNP haplotype reconstruction, in: Proceedings of Fourth Workshop on Algorithms in Bioinformatics, 2004, pp. 266–277. [12] J. Kennedy, R.C. Eberhart, Particle swarm optimization, in: Proceedings of the IEEE International Conference on Neural Networks, vol. 4, 1995, pp. 1942–1948. [13] J. Kennedy, R.C. Eberhart, Y. Shi, Swarm Intelligence, Morgan Kaufman Publishers, San Francisco, 2001. [14] T.I.H. Consortium, The international HapMap project, Nature 426 (6968) (2003) 789–796. [15] G. Myers, A dataset generator for whole genome shotgun sequencing, in: Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology, 1999, pp. 202–210. [16] M. Margulies, Genome sequencing in microfabricated high-density picolitre reactors, Nature 437 (7057) (2005) 376–380.