Biochemical and Biophysical Research Communications 404 (2011) 593–598
Contents lists available at ScienceDirect
Biochemical and Biophysical Research Communications journal homepage: www.elsevier.com/locate/ybbrc
Effective haplotype assembly via maximum Boolean satisfiability Sayyed R. Mousavi a,⇑, Maryam Mirabolghasemi a, Nadia Bargesteh a, Majid Talebi b a b
Department of Computer Engineering and Information Technology, Isfahan University of Technology, Isfahan 84156-83111, Iran Department of Agricultural Biotechnology, Isfahan University of Technology, Isfahan 84156-83111, Iran
a r t i c l e
i n f o
Article history: Received 16 November 2010 Available online 7 December 2010 Keywords: Haplotype assembly Heuristic algorithms Max-SAT Single Individual Haplotyping SNP
a b s t r a c t The haplotype assembly problem seeks the haplotypes of an individual from which a set of aligned SNP fragments are available. The problem is important as the haplotypes contain all the SNP information, which is essential to such studies as the analysis of the association between specific diseases and their potential genetic causes. Using Minimum Error Correction as the objective function, the problem is NP-hard, which raises the demand for effective yet affordable solutions. In this paper, we propose a new method to solve the problem by providing a novel Max-2-SAT formulation for the problem. The proposed method is compared with several well-known algorithms proposed for the problem in the literature on a recent extensive benchmark, outperforming them all by achieving solutions of higher average quality. Ó 2010 Elsevier Inc. All rights reserved.
1. Introduction The recognition of genetic variations among individuals is essential to such research areas as investigating the potential association between certain genes and human diseases and their treatment. Consequently, the study of DNA variation, especially in human being, has been an active research area in the recent years. The demand for such research areas has been further raised by the recent advances in DNA sequencing technology and the announcement of a complete individual human genome sequencing [1]. One of the most common forms of genetic variation is Single Nucleotide Polymorphism, SNP, which is the alteration of a single base in the genome. The sequence of SNPs in a certain chromosome is called a haplotype (for ‘‘haploid genotype’’). Haplotypes are a combination of alleles at different markers along the same chromosome that are inherited as a unit. Humans are diploid organisms; that is, except for the sexual chromosomes of males, the chromosomes come in two copies: one inherited from the mother and the other from the father. Therefore, there are two copies of the haplotypes for each of the SNP sequences. Although current sequencing techniques can detect the presence of SNP sites, they cannot specify to which copy of the haplotypes the alleles belong. Since haplotypes contain all the information about DNA variations, they play a crucial role in many studies about variations in gene expression and prediction of diseases. As a consequence, The International HapMap Project is intended to develop a detailed haplotype map of human genome [2]. ⇑ Corresponding author. Fax: +98 311 391 2451. E-mail address:
[email protected] (S.R. Mousavi). 0006-291X/$ - see front matter Ó 2010 Elsevier Inc. All rights reserved. doi:10.1016/j.bbrc.2010.12.001
The problem of Haplotype Assembly, also known as Single Individual Haplotyping (SIH), is to obtain the pair of haplotypes from which a number of aligned fragments are available. The main difficulty is the possible inconsistencies due to imperfect reading instruments. It is important to note that no algorithm can claim the ability to always obtain the true haplotypes. The way such algorithms work is to try to minimize some error measure, also called an objective function, which is expected to be low when the right haplotypes are recognized. Among well-known objective function for this purpose is Minimum Error Correction (MEC) [3]. Further objective functions may be found in [4]. The proposed method is this paper is based on minimizing MEC. In the rest of the paper, we will use SIH to mean the haplotype assembly problem based on using the MEC objective function. The SIH problem has already been proved NP-hard [5], which implies that all existing exact algorithms proposed for the problem are of exponential time complexity. Two such algorithms are the dynamic programming algorithms in [6] and [3], which are both exponential in the maximum length k of the input fragments, hence not practical for large values of k. Therefore, in [3], a third exact algorithm based on a Partial Max-SAT formulation was proposed for large values of k (more specifically for k > 15). However, the proposed algorithm was also rather time-consuming, hence run on a cluster (with 480 nodes), while achieving slightly improved solutions compared to those of a fast heuristic algorithm HapCUT [7], as reported in [3]. Several other heuristic algorithms have already been proposed for the problem [1,7,8]. We avoid explaining such algorithms here, but a brief survey, including comparisons, of them is available in [4]. It includes those which are most practical and excludes, for example, those requiring extensive run-time and those not specified precisely enough to reproduce. In
594
S.R. Mousavi et al. / Biochemical and Biophysical Research Communications 404 (2011) 593–598
this paper, we propose a novel method HapSAT, based on a Max-2SAT formulation for the problem. We compare the proposed method with the seven selected algorithms in [4] but not with those which were excluded from [4] because of their run-time inefficiency on incomplete specifications. 2. Materials and methods 2.1. Basic notations and problem formulation P P Let = {A, T, C, G, –} and s be a string over . We use |s| to denote the length of s and s[k], k = 1, . . ., |s|, to denote the character at its kth position. We call the alphabet character 0 –0 a gap and any P other alphabet character in a base. By a read, or a fragment, we P mean a string over . We assume that all reads are of the same length n (since they have already been aligned). Such reads are usually presented as an m n matrix R, called the read matrix, where m is the number of reads each comprising a row of the matrix such that the element Rji will be the ith character of the jth read. P A haplotype is also a string of length n over . Let r be a read and h a haplotype. We define Minimum Error Correction (MEC) P of r with respect to h as MEC(r, h) = ni¼1 dðr½i; h½iÞ, where:
dðr½i; h½iÞ ¼
0 if r½i ¼ h½i or r½i ¼ 0 —0 1
otherwise:
Then, the SIH problem is formally defined as follows. Problem: SIH Input instance: a read matrix Rm n Outputs: (1) a pair of haplotypes H = (h0, h1), and (2) a covering function f(.) from {1, . . ., m} to {0, 1}. P Minimize: m j¼1 MECðRj ; hf ðjÞ Þ, where Rj denotes the jth read, i.e. the read in the jth row of R. Note that the function f(.) assigns each read to one of the haplotypes. Finally, we use pol(i), i = 1, . . ., n, to denote the number of base characters in the ith column of the input read matrix R. Without loss of generality, we assume that pol(i) > 0, i = 1, . . ., n (otherwise, the values h0[i] and h1[i] could arbitrarily be chosen among P those in without affecting the MEC value). 2.2. The Max-SAT problem Let u be a Boolean formula over binary variables x1, x2, . . ., xn. The Boolean satisfiability problem, SAT, is to decide whether there exists an assignment to the variables x1, x2, . . ., xn such that u evaluates to true [9]. If such a variable assignment, which satisfies the formula F, exists, then the formula is called satisfiable; it is called unsatisfiable otherwise. SAT was the first problem proved NP-complete [9], following which many other problems were proved so by reductions from SAT or other NP-complete problems [10]. Among special cases of SAT is CNF-SAT, where the formula F is in the conjunctive normal form (CNF). That is, the formula is represented as the conjunction of a number of clauses, each a disjunction of a number of literals, where a literal is a binary variable or its negation. A variant of the CNF-SAT problem is the maximum Boolean satisfiability problem, Max-SAT for short, which is to obtain a variable assignment that satisfies as many clauses as possible [11]. In this paper, we represent an instance of Max-SAT by a set C of clauses; the variables are those whose literals are used in some clauses in C. Finally, a special case of Max-SAT is Max-k-SAT, k = 2, 3, . . ., where each clause contains exactly k literals. Example. Consider the following simple instance of Max-SAT: C = {:a _ :b; :a _ b; a; a _ b}. In this instance, there are four clauses
and two variables a and b. The assignment (a = False, b = True), for example, satisfies three out of four clauses; no assignment can satisfy all of them. Now consider the Max-2-SAT instance C = {:a _ :b; :a _ b; a _ :b}. The optimal variable assignment is (a = False, b = false) which satisfies all the three clauses. Despite the NP-hardness of SAT and Max-SAT, last decade has witnessed remarkable improvements in SAT and Max-SAT solvers [12–14]. Such solvers can be used to solve a problem if it can be formulated as a SAT or a Max-SAT problem [15]. 2.3. The proposed HapSAT method In this section, we propose a new method, HapSAT, to solve the SIH problem by formulating it as a Max-2-SAT problem. That is, given an instance ISIH of SIH, we create an instance IM2S of Max-2-SAT such that a solution to ISIH is obtained from a solution to IM2S. A similar work has recently been accomplished in [3], where a Partial Max-SAT formulation was provided for the SIH problem. However, the proposed Max-2-SAT formulation in this paper has several advantages over the Partial Max-SAT formulation of [3]. First, there are in general more Max-SAT solvers available than Partial MaxSAT solvers. For example, most of the (heuristic) solvers in UBCSAT may readily be used as (heuristic) Max-SAT solvers [16]. In addition, any Partial Max-SAT solver can also be used to solve Max-SAT problems (by defining all the clauses as to be soft), whereas the converse does not hold. Second, our Max-SAT formulation results in instances with fewer variables and clauses than those of the Partial Max-SAT formulation. To be precise, the number of variables and clauses in the Partial Max-SAT formulation of [3] are n + m + q and 5 q (4 q ternary hard and q unary soft clauses), respectively, where q is the total number of base characters in columns i, i = 1, . . ., n, such that pol(i) > 1. However, the number of variables and clauses in our proposed Max-2-SAT formulation are only 2 n + m and 2 q, respectively. Third, our formulation generalizes the formulation of [3]. More specifically, the Partial Max-SAT formulation of [3] presumes that h0[i] – h1[i] at all sites i where pol(i) > 1, i = 1, . . ., n, because it considers such a site as to be heterozygote. However, even for a homozygote site i, i = 1, . . ., n, we may well have pol(i) > 1 simply due to a reading error. Our formulation is more general in the sense that it allows for the case h0[i] = h1[i] even at sites i, i = 1, . . ., n, where pol(i) > 1. In fact, the solution space in our formulation is a superset of the solutions space in the Partial Max-SAT formulation. Consequently, given an instance ISIH of the haplotype assembly problem, any optimal solution obtained via our proposed formulation will yield an MEC error value, which is less than, or equal to, the MEC value of the Partial Max-SAT formulation for the same instance ISIH. Fig. 1 shows a high-level pseudo-code for HapSAT, which consists of four steps. In step 1, at first, a two-dimensional array Most[ ][ ] is populated such that Most[i][1], i = 1, . . ., n, pol(i) > 0, contains the most frequent base character in the ith column of R and that Most[i][0] , i = 1, . . ., n, pol(i) > 1, contains the second most frequent base character in the ith column of R (tie are broken arbitrarily). Recall that we have assumed pol(i) > 0. Then, still in step 1, the read matrix R is converted to a binary read matrix, by replacing every base character with 1, 0, or 0 –0 . To this end, every base character in every column i, i = 1, . . ., n, such that pol(i) = 1 is replaced with 1. On the other hand, if there are more than one distinct base character in a column, then the first and the second most frequent of them will be replaced with 1 and 0, respectively, and the rest, if any, will be replaced with 0 –0 . In step 2 of the algorithm, for each column i, i = 1, . . ., n, where pol(i) = 1, the ith elements of both haplotypes are set to 1, i.e. h0[i] = h1[i] = 1. The challenge is then to determine the values of h0[i], h1[i], i = 1, . . ., n, where pol(i) > 1, together with the
S.R. Mousavi et al. / Biochemical and Biophysical Research Communications 404 (2011) 593–598
595
Fig. 1. The proposed HapSAT method for solving the SIH problem via the Max-2-SAT formulation.
bi-clustering function f(.) such that the resulting MEC value is minimized. Note that this problem is slightly different from the original haplotype assembly problem in that the alphabet over which the reads and the haplotypes are defined is now {1, 0, –} as opP posed to the original alphabet = {A, T, C, G, –}. The rest of step 2 converts this problem instance to an equivalent Max-2-SAT instance IM2S. Let us simply use fj to refer to f(j), j = 1, . . ., m. Then the Max-2-SAT instance IM2S created in step 2 is C ¼ C 0 [ C 1 , where:
C0 ¼
n [
m n _ o [ _ fj h0 ½i ; fj h1 ½i
i¼1 j¼1 polðiÞ>1 Rji ¼0
and
C1 ¼
n [
m n _ o [ _ fj h0 ½i ; fj h1 ½i
i¼1 j¼1 polðiÞ>1 Rji ¼1
Note that for each element Rji, i = 1, . . ., n, j = 1, . . ., m, pol(i) > 1, exactly two clauses are added to C. We do not present in this paper an extensive formal proof on the equivalence of the haplotype assembly instance and the generated Max-2-SAT instance but provide the main point for this equivalence. Recall that we want to minimize the total MEC value, and note that each element Rji, i = 1, . . ., n, j = 1, . . ., m, pol(i) > 1, contributes one unit to the total MEC value if, and only if, Rji – hf(j)[i]. We now show that in this case exactly one of two clauses corresponding to Rji will be false, hence contributing exactly one unit to the number of unsatisfied clauses of IM2S. To this end, first assume that Rji = 0. This in turn splits into two possible cases of fj = 0 and fj = 1; the former when the jth read Rj is assigned to the haplotype h0 and the latter when it is assigned to h1. Therefore, an error occurs when fj = 0 and h0 = 1 or when fj = 1 and h1 = 0. In either case, exactly one of the clauses ðfj _ h0 ½iÞ or ð fj _ h1 ½iÞ will be false. Now assume that Rji = 1, which in turn splits into two cases of fj = 0 and fj = 1. Then, an error occurs when fj = 0 and h0 = 0 or when fj = 1 and h1 = 1. In either case, again,
596
S.R. Mousavi et al. / Biochemical and Biophysical Research Communications 404 (2011) 593–598
Table 1 Comparison of HapSAT with seven algorithms of [4] for l = 100. e
c
Baseline
SpeedHap
Fast Hare
2d-mec
HapCUT
MLF
SHR-three
DGS
HapSAT
0
3 5 8 10
1.0000 1.0000 1.0000 1.0000
0.9989 1.0000 1.0000 1.0000
0.9998 0.9996 1.0000 1.0000
0.9905 0.9973 1.0000 1.0000
1.0000 1.0000 1.0000 1.0000
0.9730 0.9922 0.9970 0.9984
0.8162 0.8609 0.9119 0.9440
1.0000 1.0000 1.0000 1.0000
1.0000 1.0000 1.0000 1.0000
0.1
3 5 8 10
0.9707 0.9918 0.9972 0.9992
0.8948 0.9675 0.9892 0.9904
0.9128 0.9642 0.9930 0.9981
0.9116 0.9508 0.9835 0.9881
0.9286 0.9204 0.9006 0.8921
0.8891 0.9697 0.9854 0.9951
0.6957 0.7377 0.7584 0.7621
0.9301 0.9851 0.9894 0.9967
0.9738 0.9950 0.9995 0.9999
0.2
3 5 8 10
0.8984 0.9444 0.9672 0.9802
0.6230 0.7992 0.8517 0.8654
0.7150 0.7974 0.8807 0.9154
0.7380 0.7931 0.8730 0.8943
0.7822 0.8380 0.8640 0.8710
0.7251 0.8358 0.9176 0.9376
0.6148 0.6546 0.6812 0.6989
0.7250 0.8127 0.8785 0.9175
0.9080 0.9664 0.9912 0.9970
0.3
3 5 8 10
0.7886 0.8404 0.8780 0.9030
0.4802 0.6370 0.6666 0.6758
0.6169 0.6391 0.6610 0.6754
0.6233 0.6403 0.6749 0.6779
0.6019 0.6294 0.6726 0.7086
0.6176 0.6528 0.6968 0.7146
0.5571 0.5993 0.6322 0.6321
0.6111 0.6469 0.6634 0.6876
0.7889 0.8874 0.9534 0.9744
exactly one of the clauses ðfj _ h0 ½iÞ or ð fj _ h1 ½iÞ will be false. Similarly, it can be shown that when Rji = hf(j)[i], both of the corresponding clauses will be true. Consequently, the total MEC value is equal to the total number of false clauses. This implies that any variable assignment which minimizes the total number of false clauses also minimizes the total MEC value, and an optimal solution to IM2S is also an optimal solution to ISIH. Once the Max-2-SAT instance is constructed, a Max-SAT solver is invoked, in step 3, to obtain a variable assignment. For this purpose, we used the heuristic Max-SAT solver irots [17] provided in the UBCSAT package [16]. Based on the obtained variable assignment and using the two-dimensional array Most[ ][ ], the final haplotypes are restored, in step 4, in terms of the original alphabet characters of A, T, C, and G. 3. Results In order to evaluate the proposed method to address the haplotype assembly problem, we compared it with the (seven) algorithms examined in [4] on the same benchmark. We implemented HapSAT in C++, using Visual Studio 2005, and ran it on a Dell vostro 1510 Laptop with 4 GB of RAM, 2.5 MHz CPU, and Windows OS. The benchmark used to evaluate and compare the proposed algorithm with the other algorithms is exactly the one used in [4], which was created from real haplotypes of Phase I HapMap data.1 We avoid, for brevity, the details of the data and how the read matrixes are obtained from the real haplotypes, as these can be found in [18] and [4]. We just mention that there are three parameters of the haplotypes length, l = 100, 350, and 700, the error rate, e = 0, 0.1, 0.2, and 0.3, and the coverage rate, c = 3, 5, 8, and 10. For each triple, hl, c, ei, there are 100 problem instances, i.e. read matrixes. Given a problem instance based on a pair of real haplotypes H = (h0, h1), the performance of an algorithm, which returns a pair b 0; b b ¼ ðh of haplotypes H h 1 Þ, is measured by:
Rb ¼ 1 H ;H
n o b 0 Þ þ Dðh1 ; b b 1 Þ þ Dðh1 ; h b0Þ min Dðh0 ; h h 1 Þ; DðH0 ; h 2m
;
where D(.,.) is a function used to determine the (generalized hamming) distance between its arguments as defined in [4]:
1 Available at http://hapmap.ncbi.nlm.nih.gov/downloads/phasing/2005-03_phaseI/full.
bjÞ ¼ Dðhi ; h
n X k¼1
b j ½kÞ; where dðhi ½k; b dðhi ½k; h h j ½kÞ ¼
(
b ½k 0 if hi ½k ¼ h j 1 otherwise:
Tables 1–3 show the results for fragment lengths of 100, 350, and 700, respectively. The first columns in these tables represent the error rate. The second columns show the coverage rate c. The third columns report the results obtained by Baseline, which is not actually a true haplotype assembly algorithm as it needs to know to which haplotype each read belongs. The purpose of including Baseline in [4] was to provide a (near-optimal) solution to show how good the outputs of the other algorithms are. The results obtained by the other seven algorithms are reported in the fourth to the tenth columns; these results have been taken directly from [4]. The last columns report the results obtained by HapSAT. All the reported results are rounded up to four decimal figures, the best of which (excluding those of Baseline), for each row, shown in gray. As can be seen in Table 1, in all the cases, HapSAT outperforms all the other seven algorithms, except for the cases of e = 0, where both HapSAT and some other algorithm achieve the maximum value of 1. However, in all cases of e – 0, HapSAT outperforms not only the other seven algorithms but also Baseline. It can be seen in Table 2 that HapSAT still achieves the maximum value of 1 for all cases of e = 0. In all other cases, it outperforms all the other seven algorithms; it even outperforms Baseline in all these cases except in a single case of e = 0.3 and c = 3. The results in Table 3 are slightly different from those of Tables 1 and 2 in that HapSAT is now outperformed by some other algorithm, e.g. DGS and HapCut, in the cases of e = 0. However, in all such cases, HapSAT achieves a value of greater than 0.999, which is so close to the maximum value of 1. In all the other cases, where e – 0, HapSAT again outperforms all the other seven algorithms; it even outperforms Baseline in most of these cases. On average over all the cases, not shown in the tables, HapSAT achieves the accuracy of 0.9589 which is more than the average accuracy of all the algorithms. The best average value among the other seven algorithms is 0.8493 by DGS. The average value for Baseline is 0.9469 which is slightly less than that of HapSAT (recall that Baseline is not a true haplotype assembly algorithm). We have not provided a precise run-time comparison of the algorithms because of using different machines, and we leave that as a future work. We just mention that the average run-time of HapSAT, including the time to read in the input file, was less than a second for most of the cases. The longest run-time was 8 s for a few instances of Table 3. The source code of our HapSAT is freely available upon request.
597
S.R. Mousavi et al. / Biochemical and Biophysical Research Communications 404 (2011) 593–598 Table 2 Comparison of HapSAT with seven algorithms of [4] for l = 350. e
c
Baseline
SpeedHap
Fast Hare
2d-mec
HapCUT
MLF
SHR-three
DGS
HapSAT
0
3 5 8 10
1.0000 1.0000 1.0000 1.0000
0.9993 1.0000 1.0000 1.0000
0.9896 0.9996 1.0000 0.9999
0.9650 0.9927 0.9978 0.9991
1.0000 1.0000 1.0000 1.0000
0.8642 0.9288 0.9692 0.9810
0.8297 0.8295 0.8954 0.8784
0.9998 1.0000 1.0000 1.0000
1.0000 1.0000 1.0000 1.0000
0.1
3 5 8 10
0.9707 0.9905 0.9972 0.9990
0.8188 0.9592 0.9843 0.9836
0.8711 0.9453 0.9852 0.9948
0.8386 0.9130 0.9641 0.9781
0.9299 0.9134 0.8963 0.8883
0.7517 0.8582 0.9327 0.9616
0.6822 0.7244 0.7416 0.7285
0.9260 0.9785 0.9963 0.9982
0.9738 0.9947 0.9996 1.0000
0.2
3 5 8 10
0.8959 0.9430 0.9680 0.9811
0.4392 0.7287 0.8247 0.8555
0.6843 0.7456 0.8529 0.8774
0.6746 0.7284 0.7912 0.8169
0.7709 0.8306 0.8616 0.8672
0.6418 0.7278 0.7985 0.8314
0.5915 0.6318 0.6699 0.6682
0.6914 0.7689 0.8423 0.8784
0.8977 0.9629 0.9919 0.9969
0.3
3 5 8 10
0.7826 0.8401 0.8734 0.9026
0.2509 0.5784 0.6294 0.6381
0.5901 0.6021 0.6259 0.6437
0.5927 0.6061 0.6230 0.6340
0.5648 0.5817 0.6206 0.6641
0.5808 0.6063 0.6339 0.6408
0.5476 0.5575 0.6043 0.6189
0.5781 0.6095 0.6285 0.6408
0.7180 0.8846 0.9533 0.9849
Table 3 Comparison of HapSAT with seven algorithms of [4] for l = 700. e
c
Baseline
SpeedHap
Fast Hare
2d-mec
HapCUT
MLF
SHR-three
DGS
HapSAT
0
3 5 8 10
1.0000 1.0000 1.0000 1.0000
0.9993 1.0000 1.0000 1.0000
0.9876 0.9989 1.0000 0.9998
0.9461 0.9760 0.9917 0.9966
1.0000 1.0000 1.0000 1.0000
0.7816 0.8544 0.9194 0.9333
0.7815 0.8324 0.8682 0.8983
0.9999 1.0000 1.0000 1.0000
0.9998 0.9999 0.9999 0.9994
0.1
3 5 8 10
0.9712 0.9913 0.9972 0.9992
0.7047 0.9471 0.9848 0.9861
0.8295 0.9408 0.9859 0.9955
0.7860 0.8805 0.9483 0.9649
0.9269 0.9158 0.8957 0.8892
0.6982 0.8094 0.8632 0.8839
0.6679 0.7158 0.7429 0.7260
0.9315 0.9775 0.9873 0.9966
0.9708 0.9937 0.9993 0.9999
0.2
3 5 8 10
0.8978 0.9427 0.9661 0.9798
0.1990 0.6810 0.8006 0.8127
0.6518 0.7118 0.8078 0.8719
0.6468 0.6969 0.7512 0.7780
0.7531 0.8250 0.8562 0.8610
0.6240 0.6820 0.7475 0.7650
0.5913 0.6170 0.6529 0.6748
0.6692 0.7415 0.8177 0.8607
0.8753 0.9633 0.9902 0.9952
0.3
3 5 8 10
0.7860 0.8382 0.8748 0.9024
0.0953 0.5232 0.6158 0.6271
0.5814 0.5915 0.6147 0.6165
0.5828 0.5961 0.6126 0.6219
0.5524 0.5553 0.5966 0.6455
0.5701 0.5944 0.6139 0.6248
0.5363 0.5622 0.6113 0.6251
0.5726 0.5946 0.6138 0.6222
0.6443 0.8805 0.9519 0.9691
4. Discussion
Acknowledgments
To obtain the pair of haplotypes of an individual is a prerequisite to various genetic analyses such as the study of the association between certain diseases and their potential genetic causes. The demand for efficient solutions to the haplotype assembly problem has further increased as a consequence of recent advances in the sequencing technology and human whole-genome sequencing. In this paper, based on a new Max-2-SAT formulation, a novel method called HapSAT was proposed for the problem and compared with several distinguished algorithms, providing significantly superior results. The input to the haplotype assembly problem considered in this paper is only a read matrix. However, additional information can help increase the likelihood of determining the right haplotypes. For example, genotype information may be used to detect homozygote sites and exclude the corresponding variables and clauses from the proposed Max-2-SAT formulation. In addition, statistical probability distribution of reading error could also be exploited for this purpose. We believe that parts of such statistical information can be obtained by comparing the input read matrix with the genotype information, if available. Machine learning techniques could also be used for this purpose. We leave the extension of the proposed method to exploit such additional information as a future work.
The authors thank Dr. F. Geraci for kindly providing us with his extensive benchmark. References [1] S. Levy, G. Sutton, P. Ng, L. Feuk, A. Halpern, et al., The diploid genome sequence of an individual human, PLoS Biol. 5 (2007) e254. [2] ‘‘International HapMap Project’’. Available from:
, 12 November 2010. [3] D. He, A. Choi, K. Pipatsrisawat, A. Darwiche, E. Eskin, Optimal algorithms for haplotype assembly from whole-genome sequence data, Bioinformatics 26 (2010) i183–i190. [4] F. Geraci, A comparison of several algorithms for the single individual SNP haplotyping reconstruction problem, Bioinformatics 26 (2010) 2217–2225. [5] M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-completeness, W.H. Freeman, New York, 1979. [6] J. Wang, M. Xie, J. Chen, A practical exact algorithm for the individual haplotyping problem MEC/GI, Algorithmica 56 (2010) 283–296. [7] V. Bansal, V. Bafna, HapCUT: an efficient and accurate algorithm for the haplotype assembly problem, Bioinformatics 24 (2008) 153–159. [8] S.H. Kang, I.S. Jeong, H.G. Cho, H.S. Lim, HapAssembler: a web server for haplotype assembly from SNP fragments using genetic algorithm, Biochem. Biophys. Res. Commun. 397 (2010) 340–344. [9] S.A. Cook, The complexity of theorem proving procedures, in: Proc. ACM Symp. Theor. Comput., ACM, New York, 1971, pp. 151–158. [10] R.M. Karp, Reducibility among combinatorial problems, in: R.E. Miller, J.W. Thatcher (Eds.), Complexity of Computer Computations, Plenum, New York, 1972, pp. 85–103.
598
S.R. Mousavi et al. / Biochemical and Biophysical Research Communications 404 (2011) 593–598
[11] A.Z. Broder, A.M. Frieze, E. Upfal, On the satisfiability and maximum satisfiability of random 3-CNF formulas, in: Proc. fourth ACM-SIAM Symp. Disc. Alg., The Society for Industrial and Applied Mathematics, USA, 1993, pp. 322–330. [12] M. Moskewicz, C. Madigan, Y. Zhao, et al., Chaff: engineering an efficient SAT solver, in: Proc. Conf. Des. Auto., ACM, New York, 2001, pp. 530–535. [13] E. Goldberg, Y. Nonikov, BerkMin: a fast and robust SAT solver, in: Proc. DATE ’02, IEEE Computer Society, USA, 2002, pp. 142–149. [14] J. Marques-Silva, J. Planes, Algorithms for maximum satisfiability using unsatisfiable cores, in: Proc. Conf. DATE’08, 2008, ACM, New York, pp. 408–413. [15] I. Lynce, J. Marques-Silva, SAT in bioinformatics: making the case with haplotype inference, in: Proc. SAT06, LNCS, vol. 4121, Springer-Verlag, Berlin, Heidelberg, 2006, pp. 136–141.
[16] D.A.D. Tompkins, H.H. Hoos, UBCSAT: an implementation and experimentation environment for SLS algorithms for SAT and MAX-SAT, in: Proc. Seventh Int. Conf. Theor. Appl. Sat., LNCS, vol. 3542, Springer-Verlag, Berlin, Heidelberg, 2005, pp. 306–320. [17] K. Smyth, H.H. Hoos, T. Stutzle, Iterated robust tabu search for MAX-SAT, in: Proc. AI-03, LNCS, vol. 2671, Springer-Verlag, Berlin, Heidelberg, 2003, pp. 129–144. [18] D. Altshuler, L.D. Brooks, A. Chakravarti, et al., A haplotype map of the human genome, Nature 437 (2005) 1299–1320.