The complete parsimony haplotype inference problem and algorithms based on integer programming, branch-and-bound and Boolean satisfiability

Journal of Discrete Algorithms 37 (2016) 68–83 Contents lists available at ScienceDirect Journal of Discrete Algorithms www.elsevier.com/locate/jda ...

Download PDF

436KB Sizes 0 Downloads 24 Views

Report

PDF Reader
Full Text

Journal of Discrete Algorithms 37 (2016) 68–83

Contents lists available at ScienceDirect

Journal of Discrete Algorithms www.elsevier.com/locate/jda

The complete parsimony haplotype inference problem and algorithms based on integer programming, branch-and-bound and Boolean satisﬁability Gerold Jäger a,∗ , Sharlee Climer b , Weixiong Zhang b a b

Department of Mathematics and Mathematical Statistics, University of Umeå, SE-90187 Umeå, Sweden Department of Computer Science and Engineering, Washington University St. Louis, US-63130-4899 St. Louis, United States

a r t i c l e

i n f o

Article history: Available online 7 June 2016 Keywords: Haplotype inference Pure parsimony Branch-and-bound Integer programming Boolean satisﬁability

a b s t r a c t Haplotype inference by pure parsimony (Hipp) is a well-known paradigm for haplotype inference. In order to assess the biological signiﬁcance of this paradigm, we generalize the problem of Hipp to the problem of ﬁnding all optimal solutions, which we call Chipp. We study intrinsic haplotype features, such as backbone haplotypes and fat genotypes as well as equal columns and decomposability. We explicitly exploit these features in three computational approaches that are based on integer linear programming, depth-ﬁrst branch-and-bound, and Boolean satisﬁability. Further we introduce two hybrid algorithms that draw upon the diverse strengths of the approaches. Our experimental analysis shows that our optimized algorithms are signiﬁcantly superior to the baseline algorithms, often with orders of magnitude faster running time. Finally, our experiments provide some useful insights into the intrinsic features of this important problem. © 2016 Elsevier B.V. All rights reserved.

1. Introduction In this age of rapid advances in biological and medical ﬁelds, a number of compelling computational challenges have arisen and propelled the state-of-the-art. Haplotype inference is one such challenge. A haplotype is a set of nucleotides that are in physical proximity on a chromosome strand. Diploid species have pairs of chromosomes and, consequently, pairs of homologous haplotypes. Genotypes are conﬂations of these haplotype pairs, and their states are the primary type of data supplied using current technologies. For example, at a particular site, an individual might have inherited ‘A’ and ‘C’ nucleotides from their mother and father, respectively. In this case, one haplotype would include the ‘A’ nucleotide, while the other includes the ‘C’ nucleotide. The genotype for this site would be ‘AC’, and the entire genotype would be comprised of this pair, along with a number of other nucleotide pairs in close genomic proximity. Thanks to recent advances, genotypic data are highly accurate and complete. For example, the International HapMap project has produced genotypes that are 99.7% accurate and 99.3% complete [47]. Technological advances have also enabled the resolution of each individual haplotype contributing to a genotype for small studies of short regions and/or single individuals [26]. However, the identiﬁcation of haplotypes directly in a laboratory setting is currently infeasible for most demands of current research. For this reason, researchers commonly rely on mathematical models and computational algorithms for inferring haplotypes from genotypes. Genotype data for computational haplotype inference are typically comprised of a small subset of “tag” markers from a genomic region that aim to adequately capture the haplotype variations of the region within the given population. The phased

*

Corresponding author. E-mail addresses: [email protected] (G. Jäger), [email protected] (S. Climer), [email protected] (W. Zhang).

http://dx.doi.org/10.1016/j.jda.2016.06.001 1570-8667/© 2016 Elsevier B.V. All rights reserved.

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

69

alleles for each chromosome derived from these tag markers are commonly referred to as “haplotypes” in this domain and we employ this deﬁnition here. The ﬁrst widely used algorithm for inferring haplotypes was introduced by Clark in 1990 [5]. The objective of his approximation, referred to as Clark’s Rule, is based on the assumption that the number of unique haplotypes in a given population is relatively small. A number of additional algorithms based on Clark’s Rule have been developed (e.g. [2,8,10,11,22,34–36, 41]). In contrast to these heuristic approaches, Gusﬁeld and Orzack independently proposed a haplotype inference model using Pure Parsimony (Hipp), in which the number of unique haplotypes is minimized [19]. Given a set of genotypes, Hipp is to ﬁnd a set of distinct haplotypes with minimum cardinality such that each genotype can be resolved by two haplotypes. In other words, absolutely no solution exists for which the number of unique haplotypes is less than the number required by an optimal Hipp solution. Due to its simplicity in formulation and complexity in computation, Hipp has attracted much attention. Hipp was ﬁrst computationally studied using integer linear programming (IP) by Gusﬁeld [17,18]. Many approaches to the problem have since been developed, for example those based on IP [2–4,20], depth-ﬁrst branch-and-bound (DFBnB or BnB for short) [39], and Boolean satisﬁability [14,28–30]. Additionally, one method has been developed that provides provably optimal solutions in polynomial time for a subset of instances and a heuristic for instances outside this subset [21]. It is important to mention that there may exist multiple optimal solutions to a Hipp problem and an algorithm simply returns an arbitrary optimal solution. It is also critical to emphasize that the true solution, which is the ground truth of the haplotype structure of a given population, may not be a computationally optimal solution, as shown in our previous study [6]. Furthermore, not all computationally optimal solutions in the formulation are biologically equivalent. That is to say that some solutions might closely emulate, or possibly exactly replicate, the true haplotypes that exist within the population, while other solutions may be error-prone. In other words, the mathematical formulation of Hipp captures only some of the biological aspects of haplotype inference (HI). Similar to other computational biology problems, such as RNA folding and multiple sequence alignment, the true solutions to HI are often in the set of optimal or near optimal solutions to the Hipp formulation [6]. Therefore, it is desirable to ﬁnd all optimal solutions, which is the subject of this paper. For convenience, we call the problem of ﬁnding all optimal solutions to a Hipp problem complete Hipp, or Chipp. The Chipp formulation and our novel Chipp algorithms have helped to gain some initial biological insight into haplotype structures in human population [6]. Note that since it requires to ﬁnd all optimal solutions, Chipp is computationally more diﬃcult than Hipp, which is N P -complete [27]. In order to develop effective Chipp algorithms, we consider several intrinsic haplotype features in this paper. These include

• • • • •

backbone haplotypes, which are haplotypes common to all Hipp solutions, backbone haplotype sites, which are haplotype sites common to all Hipp solutions, decomposability of a problem into sub-problems, fat genotypes, which are genotypes that reduce the solution size by 2, if they are omitted, equal columns, which are sites that appear exactly the same for all haplotypes.

We then propose and study three classes of Chipp algorithms by exploiting these intrinsic haplotype features. The three classes of algorithms are based on and extend the existing Hipp algorithms: Gusﬁeld’s IP [18]; Wang and Xu’s BnB [39]; and Lynce, Marques-Silva and Prestwich’s Sat [30]. We extend beyond these existing algorithms by expanding their capabilities to enumerate all solutions instead of a single arbitrary solution and introducing effective optimization techniques that take advantage of the intrinsic haplotype features. Finally, we investigate classes of algorithms which strategically hybridize the greatest strengths of the three approaches. Note that exploiting these intrinsic haplotype features can also be viewed in the concept of ﬁxed parameter tractability (FPT) (for an overview of FPT see [12,33]). The motivation of FPT is that even large instances of N P -hard problems can be easy, because they might contain structure that can be exploited. For example, Holder and Langley presented a method for exposing Hipp problem instances that can be optimally solved in polynomial time using a partial-ordering technique [21]. One special idea is FPT kernelization which reduces a hard instance by preprocessing to a smaller, equivalent problem kernel [16]. As the intrinsic features backbone haplotypes, backbone haplotype sites, decomposability, fat genotypes and equal columns lead to a reduction technique, which in a preprocessing step reduces the given Chipp instance to a smaller one, our approach is also a type of problem kernelization technique. The baseline IP and BnB algorithms for Chipp including the corresponding optimization techniques have already been introduced in our previous work [24]. The current work additionally contains a baseline Sat algorithm for Chipp and a detailed description of sophisticated optimization techniques for Sat. Furthermore, the algorithms are additionally presented by a pseudo-code. Finally, an extensive experimental analysis of these three algorithms is presented, and the test bed, including its biological signiﬁcance, is described in detail. The paper is organized as follows. Section 2 begins with a formal deﬁnition of the problems Hipp and Chipp. Then three well-known algorithms for Hipp are described, and it is shown how they can be extended to Chipp algorithms. Section 3 introduces features of Chipp and optimization techniques, which are able to signiﬁcantly improve the baseline Chipp algorithms presented in Section 2. Section 4 gives an experimental analysis of the behavior of the introduced three Chipp algorithms and the optimization techniques, as well as a description of the test bed. The paper closes with a summary and suggestions for future work in Section 5.

70

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

2. Complete haplotype inference by pure parsimony Let G represent a set of genotypes, where G = { g 1 , g 2 , . . . , gn } and g i = ( g i1 , g i2 · . . . , g im, ) ∈ {0, 1, 2}m for i = 1, 2, . . . , n, for n individuals with m single nucleotide polymorphism (SNP) sites. Haplotype inference is to ﬁnd a set of haplotypes, H = {h1 , h2 , . . . , h p } and h i = (h i1 , h i2 · . . . , h im ) ∈ {0, 1}m for i = 1, 2, . . . , p, such that all genotypes are explained, or covered by the haplotypes. If a site g i j has a value of 0 or 1, it can only be explained by two haplotype sites hkj and hqj which both have value 0 or value 1, respectively. These genotype sites are referred to as homozygous. A heterozygous genotype site g i j has a value of 2, and can only be explained by two haplotype sites hkj and hqj with values of 0 and 1 (in any order). Then a genotype g i is explained/covered by a pair of haplotypes hk and hq , if for each 1 ≤ j ≤ m, the site g i j is explained by the pair of sites hkj and hqj . Note that it is also possible that an explaining haplotype pair is a pair of the same haplotypes, e.g., the genotype (0, 0) can be explained only by the pair of haplotypes (0, 0) and (0, 0). For example, G = { g 1 = (1, 2), g 2 = (0, 0), g 3 = (2, 1), g 4 = (2, 2)}, will be explained by the haplotype set H = {h1 = (0, 0), h2 = (1, 0), h3 = (0, 1), h4 = (1, 1)}. Then the haplotypes h2 and h4 explain genotype g 1 and we call h2 and h4 an explaining haplotype pair for g 1 . We also say that h2 or h4 partly explain genotype g 1 . Note that genotype g 4 can be explained by h1 and h4 or by h2 and h3 . Given a set of genotypes, Hipp is to ﬁnd a minimum set of unique haplotypes that explains the genotypes. For simplicity and without loss of generality, we assume in this research that all individuals (genotypes) are unique. Notice that a Hipp problem may have multiple optimal solutions. This is evident by a trivial example of one genotype g 1 = (2, 2), which has two optimal solutions, H 1 = {(0, 0), (1, 1)} and H 2 = {(0, 1), (1, 0)}. Chipp is to ﬁnd all optimal solutions, i.e., all sets of minimal unique haplotypes covering all genotypes. In the following we brieﬂy describe three well-established computational paradigms for Hipp as baseline algorithms, which are based on IP, BnB and Sat, respectively. Then we present how they can be extended to solve Chipp. 2.1. Chipp algorithm based on integer linear programming In 2003, Gusﬁeld [18] developed the following exponential-size Integer Linear Program formulation of Hipp. Consider n genotypes g 1 , g 2 , . . . , gn , and let ki be the number of possible explaining haplotype pairs for genotype g i for i = 1, 2, . . . , n. It is evident that ki = max{2l−1 , 1} if g i contains l 2’s. Let H i be the set containing the haplotypes of these ki n pairs, and let i =1 H i = {h1 , h2 , . . . , hq }. Let x = (x1 , . . . , xq ) be a vector deﬁned as

xs =

1,

if haplotype h s appears in the Hipp solution,

0,

otherwise

for s = 1, 2, . . . , q. Furthermore, let

ωi j =

ω = (ωi j )1≤i≤n,1≤ j≤ki be a matrix representing haplotype pairs, deﬁned as

1,

if the j-th explaining haplotype pair for genotype g i is selected in the Hipp solution,

0,

otherwise

for i = 1, 2, . . . , n and j = 1, 2, . . . , ki . In addition, let f 1 (i , j ) and f 2 (i , j ), respectively, be the indices of the two haplotypes of the j-th explaining haplotype pair for the genotype g i for i = 1, 2, . . . , n and j = 1, 2, . . . , ki . Then Hipp can be represented by the following integer linear program:

min

q

xs subject to

(1)

s =1

⎧ k i ⎪ ω =1 ⎪ j =1 i j ⎪ ⎪ ⎪ ⎨x ≥ω

for i = 1, 2, . . . , n

for i = 1, 2, . . . , n, j = 1, 2, . . . , ki i, j f 1 (i , j ) ⎪ ⎪ x f 2 (i , j ) ≥ ωi j for i = 1, 2, . . . , n, j = 1, 2, . . . , ki ⎪ ⎪ ⎪ ⎩ xs , ωi , j ∈ {0, 1} for s = 1, 2, . . . , q, i = 1, 2, . . . , n, j = 1, 2, . . . , ki In the worst case, this IP needs an exponential number of variables and constraints. In order to solve Chipp using the IP representation in (1), we implicitly enumerate all Hipp solutions to the IP. The key lies in the idea of avoiding the Hipp solutions that have been found earlier in the process. In our method, we ﬁrst solve Hipp using the IP in (1), obtaining one Hipp solution. Let p represent the value of the objective function for this solution (we call p solution value, i.e., the minimal number of haplotypes). In other words, p unique haplotypes are the minimum number of haplotypes that can resolve this set of genotypes. This means that for this solution, there are exactly p entries of the haplotype index vector x having value 1; let i 1 , i 2 , . . . , i p be the indices of these haplotypes. In order to avoid the Hipp solution that we just found, we add to the IP in (1) the following inequality p s =1

x i s ≤ p − 1.

(2)

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

71

We then solve the newly expanded IP to ﬁnd the next Hipp solution, if any. Notice that (2) can lead to two possibilities. Either the new IP has an objective value larger than p, which means that no new Hipp solution exists; or the new IP has an objective value equal to p, which gives rise to another Hipp solution. We repeat this process of adding new inequalities in the form of (2) and solving the incrementally expanded and more constrained new IP to ﬁnd all Hipp solutions. This Chipp algorithm, called Chipp-IP, is described in Pseudo-code 1. Pseudo-code 1. (Chipp-IP) 1

Solve the IP (1) with objective value p and indices i 1 , i 2 , . . . , i p .

2

List the haplotypes h i 1 , . . . , h i p as one Hipp solution.

3

Add the inequality (2) to the IP.

4

Solve the new IP with objective value p new .

5

IF p new > p OR there is no feasible solution.

6

THEN TERMINATE.

7

ELSE List the haplotypes h i 1 , . . . , h i p as one Hipp solution, where i 1 , i 2 , . . . , i p are the new indices.

8

GOTO 3.

2.2. Chipp algorithm based on branch-and-bound In 2003, Wang and Xu [39] introduced an algorithm for Hipp based on BnB. The algorithm starts with a heuristic solution, where for each genotype the explaining haplotype pair is chosen based upon a greedy approach. The haplotype pair is chosen so that the corresponding haplotypes can partly explain the most genotypes. This provides the initial incumbent (or upper bound) for the BnB search. Now, the search implicitly considers all possible explaining haplotype pairs for each genotype. The best solution found is the Hipp solution to be returned. If during the search the node cost is equal to or exceeds the incumbent, the current explaining haplotype pair is discarded and the algorithm continues to the next possible explaining haplotype pair, if it exists, thus moving on to the next branch of the current search node. In our implementation of the above BnB algorithm, we ﬁrst sorted the genotypes in an increasing order of the numbers of 2’s possessed by the genotypes. In other words, we prefer genotypes with less heterozygous sites over ones with more heterozygous sites at nodes near the top of the search tree in order to minimize the branching factor of the search. For each genotype, we further order the explaining haplotype pairs in a decreasing order of the number of genotypes that the haplotypes can cover. We use a decreasing order, as haplotype pairs which can cover a large number of genotypes are expected to often occur in a Hipp solution. As in Section 2.1, ki is the number of possible haplotype pairs for genotype g i for i = 1, 2, . . . , n. This Hipp algorithm, called Hipp-BnB, is described in Pseudo-code 2. Pseudo-code 2. (Hipp-BnB) 1

p := solution value of a greedy algorithm for haplotype inference.

2

FOR i := 1, 2, . . . , n (i means the i-th genotype).

3

si := 1 (start with the ﬁrst explaining haplotype pair for the i-th genotype).

4

i := 1.

5 6 7

Let H be the set of explaining haplotypes for the current ﬁrst i genotypes. IF |H| ≥ p ∨ i = n. THEN Search the largest t ≤ i with st < kt .

8 9 10 11 12 13 14 15 16 17

IF No such t exists. THEN TERMINATE. ELSE i := t. si := si + 1. FOR l := i + 1, i + 2, . . . , n. sl := 1. IF |H| < p ∧ i = n. THEN p := |H|. ELSE i := i + 1. GOTO 5.

We can directly extend the BnB algorithm for Hipp to an algorithm for Chipp. In order to ﬁnd all Hipp solutions, pruning is applied only when the node cost strictly exceeds the incumbent. This allows the algorithm to explore a branch that may

72

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

lead to another Hipp solution. Furthermore, if we have found a better solution, all previous “optimal solutions” are discarded. This Chipp algorithm, called Chipp-BnB, is described in Pseudo-code 3. Pseudo-code 3. (Chipp-BnB) 1

p := solution value of a greedy algorithm for haplotype inference.

2

FOR i := 1, 2, . . . , n (i means the i-th genotype).

3 4

si := 1 (start with the ﬁrst explaining haplotype pair for the i-th genotype). i := 1.

5

Let H be the set of explaining haplotypes for the current ﬁrst i genotypes.

6

IF |H| > p ∨ i = n.

7

THEN Search the largest t ≤ i with st < kt .

8

IF No such t exists.

9

THEN TERMINATE. ELSE i := t.

10 11

si := si + 1.

12

FOR l := i + 1, i + 2, . . . , n. sl := 1.

13 14

IF |H| ≤ p ∧ i = n.

15

THEN IF |H| < p. THEN p := |H|.

16 17

Omit the current list of Hipp solutions. Add H as one Hipp solution.

18 19 20

ELSE i := i + 1. GOTO 5.

2.3. Chipp algorithm based on Boolean satisﬁability In recent years much progress has been made in the optimization of practical Boolean satisﬁability (Sat) solvers (see the Sat competition [50]). This has made Sat encodings for combinatorial problems highly attractive. For instance, such encodings were given for the Hamiltonian Cycle Problem [23,37], the Social Golfer Problem [13], and Sudoku [31]. Another such encoding was given in 2006 by Lynce and Marques-Silva, who created a (Sat) model for Hipp leading to a Hipp algorithm [28]. In their subsequent work [29] and in a common work with Prestwich [30] they signiﬁcantly extended and improved the basic model. In the following let “1” represent the Boolean value “True” and “0” the Boolean value “False”. Let n genotypes g 1 , g 2 , . . . , gn ∈ {0, 1, 2}m with g i = ( g i1 , g i2 , . . . , g im ) ∈ {0, 1, 2}m for i = 1, 2, . . . , n and a number p ∈ N with 1 ≤ p ≤ 2n be given, where p equals the cardinality of a possible Hipp solution. Then solving the following Sat model answers the question whether there are exactly p different haplotypes such that each genotype can be explained by the haplotypes. The model uses Boolean variables hlj ∈ {0, 1} for l = 1, 2, . . . , p, j = 1, 2, . . . , m, where hl = (hl1 , hl2 , . . . , hlm ) for l = 1, 2, . . . , p are the haplotypes in the Hipp solution to be identiﬁed. We refer to these values as haplotype variables. The model also employs further Boolean variables that are referred to as selector variables slia and slib for l = 1, 2, . . . , p, i = 1, 2, . . . , n for selecting haplotypes which explain genotype g i . More precisely, for i = 1, 2, . . . , n, g i is explained by the haplotypes hl1 and hl2 if sla i = 1 and slb i = 1. Consider a ﬁxed site g i j ∈ {0, 1, 2} of a genotype, where i = 1, 2, . . . , n and j = 1, 2, . . . , m. Then 1 2 we have three cases:

• g i j = 0. We use the following clauses:

(¬slia ∨ ¬hlj ),

(¬slib ∨ ¬hlj ) for l = 1, 2, . . . , p .

• g i j = 1. We use the following clauses:

(¬slia ∨ hlj ), • g i j = 2.

(¬slib ∨ hlj ) for l = 1, 2, . . . , p .

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

73

To ensure that the two haplotypes explaining genotype g i have opposing values, we introduce a new Boolean variable t i j ∈ {0, 1} such that site j of the haplotype selected by the a representative selector has the same value as t i j , and site j of the haplotype selected by the b representative selector has the complementary value of t i j . We use the following clauses:

(¬slia ∨ hlj ∨ ¬t i j ), (¬slib ∨ hlj ∨ t i j ),

(¬slia ∨ ¬hlj ∨ t i j ), (¬slib ∨ ¬hlj ∨ ¬t i j ) for l = 1, 2, . . . , p .

As for i = 1, 2, . . . , n each genotype g i is assigned exactly two (possibly equal) explaining haplotypes, we have the following constraints:

p

slia

l =1

=1 ,

p

slib

=1

l =1

These constraints can be easily transformed to Sat clauses by following the procedure developed in [30]. Now, the Sat model is created and solved for p = 1, then for p = 2 and so on, until the model has a feasible Sat solution for a ﬁxed p. This feasible Sat solution directly leads to a Hipp solution with haplotypes h1 , h2 , . . . , h p . This model is correct, but as pointed out in [30], it is not effective at all. Therefore, additional features were introduced in [30], in particular symmetry breaking and computation of lower bounds, to improve this model. Symmetry breaking means adding further clauses to a Sat model, by making use of symmetries of the Boolean variables, with purpose to prune the search space. In this model we have two types of symmetries. First, the ordering of the selector variables can be changed. This leads to the constraints of

sai ≤ sbi

for i = 1, 2, . . . , n,

where sai = (sa1i , sa2i , . . . , sapi ) and sbi = (sb1i , sb2i , . . . , sbpi ) and the “≤”-relation corresponds to a lexicographic ordering. Second, the ordering of the haplotype variables can be changed. This leads to the constraints of

h1 < h2 < . . . < h p , where the “<”-relation again corresponds to a lexicographic ordering. These constraints can be easily transformed to Sat clauses. Observe that if we have a lower bound r on the cardinality of haplotypes of a Hipp solution, the algorithm can start with this ﬁxed r instead with 1. In [30] the authors present different procedures to compute such lower bounds. We have implemented their most elementary and effective procedure which is based on the term of incompatibility of genotypes. Two genotypes g 1 and g 2 are considered to be incompatible, if they do not share any explaining haplotype, otherwise they are called compatible. g 1 and g 2 at one site are incompatible if g 1 has an entry 0 while g 2 has an entry 1, or vice versa. More generally, a set of genotypes G is called incompatible if every pair of different genotypes of G is incompatible. It is evident that for a set of genotypes G and an incompatible subset G ⊆ G of cardinality k a Hipp solution of G contains at least 2k − t haplotypes, where t is the number of genotypes in G with only homozygous sites. Therefore, it is reasonable to attempt to ﬁnd an incompatible subset with maximal cardinality. For this purpose, the set of genotypes can be modeled as a graph, where the genotypes correspond to vertices and two genotypes are connected by an edge if they are incompatible. In terms of graph theory, for a graph G = ( V , E ) with vertex set V and edge set E the aim is to ﬁnd a maximal clique, i.e., a subset of the vertices V ⊆ V so that every pair of two different vertices of V is connected by an edge and this clique cannot be extended by another vertex (note that this is not necessarily a maximum clique that contains a maximum number of vertices). A simple greedy algorithm can ﬁnd such a maximal clique. This algorithm starts with the vertex with maximal degree as a basis of the clique and then adds vertices to the clique as long as there are vertices that are connected to all vertices of the clique. The role of the lower bound in this context is twofold. The ﬁrst is that the lower bound reduces the number of Sat models that have to be created and solved. The second (and more important one) is that the Sat model can be signiﬁcantly simpliﬁed. This simpliﬁcation is based on the observation that each g ∈ G with only homozygous sites has an explaining haplotype h that equals g, and every additional g ∈ G has two explaining haplotypes h1 and h2 that equal g at each homozygous site of g. Thus, some haplotypes are explicit at some sites, i.e., the values of these haplotypes in the Sat model can be ﬁxed to 0 or 1 to simplify the Sat model. We can extend the Sat algorithm for Hipp to an algorithm for Chipp as follows. The key idea again is to ﬁnd a Hipp solution and then to create – step by step – new Hipp solutions by systematically ruling out all the Hipp solutions that have been discovered so far during the search, until all Hipp solutions have been found. More precisely, we ﬁrst solve the described Sat model to ﬁnd a Hipp solution with r haplotypes. As already explained, some of the haplotype variables of the model are explicit, i.e., the values hlj are ﬁxed before solving the model. A feasible Sat solution assigns Boolean variables to all non-explicit haplotype variables. Let hl1 , j 1 , hl2 , j 2 , . . . , hls , j s be the values of the s non-explicit haplotype variables, derived from the feasible Sat solution. In order to explicitly forbid this Hipp solution, we add the following blocking clause:

74

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

¬hl1 , j 1 ∨ ¬hl2 , j 2 ∨ . . . ∨ ¬hls , j s

(3)

As the haplotype variables are lexicographically ordered, the same Hipp solution with a different ordering is not possible. By adding this clause, the updated Sat model has another solution, giving rise to another Hipp solution, or is not satisﬁable meaning that no new Hipp solution exists. Repeating this process leads to a new Chipp solution. 3. Features of C HIPP and optimization techniques We now consider some important features of Chipp, and describe how they can be exploited to develop effective methods for solving this challenging problem. 3.1. Backbones The backbone of a combinatorial optimization problem refers to the set of variables that have common values across all optimal solutions to the problem. Backbones are intrinsic features of combinatorial optimization problems, and have been used to characterize many diﬃcult optimization problems, such as the Traveling Salesman Problem and maximum Satisﬁability (Max-Sat) [38,42,44] and have been exploited in algorithms for solving these well-studied problems [7,43,45]. We denote a haplotype as backbone haplotype, if it appears in every Hipp solution. One important consequence of using backbone haplotypes is that those genotypes that can be explained by two backbone haplotypes already have an explaining haplotype pair, and thus can be ﬁxed in solving the problem to simplify the Hipp or Chipp procedure. We call such genotypes backbone genotypes. Given a set of backbone haplotypes, the corresponding backbone genotypes can be found very quickly. Therefore, solving Chipp can be accelerated, if we can identify all backbone haplotypes. A special case of backbone haplotypes is when some genotypes have zero or one heterozygous SNP site, which can be easily identiﬁed. A genotype with no 2’s gives rise to one backbone haplotype, which is the same as the genotype. A genotype with only one 2 leads to two backbone haplotypes, which are equal to the genotype except for the site with the 2, where one backbone haplotype has entry 0 and the other has entry 1. We call such backbone haplotypes trivial backbone haplotypes. These trivial backbone haplotypes were implicitly considered in Wang and Xu’s BnB program [39]. Trivial backbone haplotypes can be easily computed and used in the IP, the BnB and the Sat approach. Note that in the Sat approach the trivial backbone haplotypes correspond to the haplotypes in the Sat model which are explicit at all sites. A more diﬃcult problem is to identify all backbone haplotypes. At ﬁrst sight, the problem seems to be as diﬃcult as ﬁnding all Hipp solutions, but it turns out that all backbone haplotypes can be identiﬁed without ﬁnding all Hipp solutions. Our idea to the problem is based on the fact that when a backbone haplotype is omitted, no optimal solution to a Hipp or Chipp problem can be found. Therefore, we ﬁrst ﬁnd a Hipp solution, and then iteratively remove, one at a time, the haplotypes in the solution, and repeatedly solve each of the resulting problems to determine if a new Hipp solution can be found. If this is the case, the considered haplotype is no backbone haplotype, otherwise it is a backbone haplotype. This procedure is summarized in Pseudo-code 4. Pseudo-code 4. (Identifying all backbone haplotypes for Hipp) 1

Solve Hipp with solution value p and haplotypes h i 1 , . . . , h i p .

2

FOR s := 1, 2, . . . , p.

3

Omit haplotype h i s and all corresponding explaining haplotype pairs from the original Hipp.

4

Solve the new Hipp with solution value p new .

5

IF p new > p OR there is no feasible solution.

6

THEN h i s is a backbone haplotype.

7

ELSE h i s is not a backbone haplotype.

In the worst case the complexity of this algorithm is p times the complexity of ﬁnding a Hipp solution, where p is the cardinality of the haplotypes of a Hipp solution. Note that Step 3 of Pseudo-code 4 can be easily included in the IP and the BnB algorithms for Hipp and may simplify the algorithms by decreasing the parameter ki of explaining haplotype pairs for some genotypes. Thus, this backbone feature is expected to work effectively for both approaches. On the other hand, for the Sat algorithm omitting a haplotype is more diﬃcult. There are several approaches to identify backbones by Sat formulas [25,32,46]. We use the following idea. As in the Sat model the p haplotypes are lexicographically ordered and each haplotype in this order could equal the ﬁxed haplotype. This means that for each of the p haplotypes we have to add a blocking clause like in (3). In other words, the model is not simpliﬁed. Because of this reason, we can replace backbone haplotypes by backbone haplotype sites, to be explained next.

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

75

A backbone haplotype site is a variable hlj for l = 1, 2, . . . , p, j = 1, 2, . . . , m which has the same value 0 or 1 in all Hipp solutions. Clearly, all explicit haplotype sites in the original Sat model are backbone haplotype sites. However, there may be additional ones, which can be found as follows. Assume that a haplotype site has a ﬁxed value, say 0, in the ﬁrst solution of the Sat model. Then a new Sat model is created, where this value is set to 1 and all other haplotypes sites are unchanged. If this new model does not lead to a Hipp solution, the considered site is a backbone haplotype site, otherwise it cannot be a backbone haplotype site. Thus, for each site it can be tested by an additional Sat model whether it is a backbone haplotype site or not. This feature can be used to reduce the number of non-explicit haplotype sites to accelerate the procedure for ﬁnding all Hipp solutions (see Section 2.3). Furthermore, backbone haplotype sites can be also computed during this procedure as follows. Set a non-explicit haplotype site to a ﬁxed value, say 0, and then the following Hipp solutions are only searched with this particular value 0. If all these Hipp solutions have been found, then the site can be set to 1 for the remainder of the search, i.e., the blocking clause contains one variable less. 3.2. Equal column technique One important technique in Hipp algorithms is the so-called equal column technique [39]. Assume that we have a Hipp solution for the genotype matrix G 1 = [ g 1 , g 2 , . . . , gk ], where for l = 1, 2, . . . , k, gl is the l-th column of G 1 , and we aim at ﬁnding a Hipp solution for the genotype matrix G 2 = [ g 1 , g 2 , . . . , gk , gk +1 ], where gk +1 = gl for some l ∈ {1, 2, . . . , k}. We can simply obtain a Hipp solution to G 2 by copying the l-th column of all haplotypes in the Hipp solution to the (k + 1)-th column, because the (k + 1)-th column of all genotypes can be explained in the same way as the l-th column of all genotypes. However, the equal column technique cannot be directly used to ﬁnd all Hipp solutions, since some Hipp solutions might be lost by this technique. This can be seen from a simple example of one genotype g 1 = (2, 2), which has two Hipp solutions, H 1 = {(0, 0), (1, 1)} and H 2 = {(0, 1), (1, 0)}. However, the equal column technique will only result in the ﬁrst Hipp solution, because only in this case the ﬁrst and the second columns of the Hipp solution are equal. This example also shows that the equal column technique cannot be used to ﬁnd all backbone haplotypes, as g 0 = (2) has two backbone haplotypes (0), (1), and g 1 = (2, 2) has no backbone haplotype. Therefore, special care must be taken when extending the equal column technique to ﬁnding all Hipp solutions. We proceed as follows. Let G 1 = [ g 1 , g 2 , . . . , gk ] be a genotype matrix with solution value p and G 2 = [ g 1 , g 2 , . . . , gk , gk +1 ] another genotype matrix, where gk +1 = gl for some l ∈ {1, 2, . . . , k}. The solution value of G 2 is p as well, and each Hipp solution for G 2 can be written as a Hipp solution for G 1 plus an additional last column. Therefore, given a Hipp solution H 1 to G 1 , we have to ﬁnd all Hipp solutions H 2 to G 2 which are equal to H 1 in the ﬁrst k columns. Using the equal column technique for Hipp, we can obtain one Hipp solution of G 2 , if we use the l-th column of H 1 as (k + 1)-th column of H 2 . However, there may exist additional Hipp solutions of G 2 . Each such Hipp solution has p haplotypes and trivially, each entry of the last column can be 0 or 1. Thus, there are a total of 2 p possible Hipp solutions for G 2 from which at least one must be a Hipp solution. Let H 0 be such a possible Hipp solution. Then H 0 is a Hipp solution, if and only if all genotypes can be explained by the haplotypes of H 0 . So we can make use of this characteristic. Furthermore, we can reduce the number of all possible Hipp solutions to be tested, which is 2 p , by the following idea. Each haplotype of a Hipp solution must be used to partly explain at least one genotype. We test this characteristic for all 2p haplotypes which can possibly appear in a Hipp solution. If this characteristic is not fulﬁlled for a particular haplotype, we do not have to consider this haplotype and all possible Hipp solutions that contain this haplotype. The extended equal column technique for Chipp may still require up to 2 p steps for each equal column in the worst case. Nevertheless, as we will observe in Section 4, it is rather effective in practice. Furthermore, columns of a genotype matrix which contain only 0’s or only 1’s can also be omitted for solving Hipp and Chipp because of the following reason. If a column gl of the genotype matrix contains only 0’s or 1’s, then the l-th column of the haplotype matrix of each Hipp solution also contains only 0’s or 1’s, respectively. We denote this technique as ﬁxed column technique. It is clear that the equal column technique as well as the ﬁxed column technique can be combined with all three approaches IP, BnB and Sat. 3.3. Decomposability The concept of incompatibility of single genotypes introduced in Section 2.3 can be generalized to two sets of genotypes E = {e 1 , e 2 , . . . , e s } and F = { f 1 , f 2 , . . . , f t }. If each genotype e ∈ E is incompatible with each genotype f ∈ F , E is incompatible with F as well. A Hipp problem instance can be decomposed if it contains l > 1 sets of genotypes, where each two sets are incompatible and each sub-problem consists of exactly one of these sets. It is evident that it is suﬃcient to solve each of the l subproblems in order to solve the original problem, where a solution of the original problem is a union of the solutions to the sub-problems. This method can be simply extended to Chipp. Again a problem instance is decomposable if it contains l > 1 incompatible sets of genotypes, each of which forms a sub-problem. Assume that each sub-problem has q i solutions for i = 1, 2, . . . , l. The

76

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

l

original problem will have i =1 qi solutions, corresponding to all combinations of the solutions to the sub-problems. A decomposable Hipp or Chipp problem can be solved with a signiﬁcantly reduced complexity due to the potential exponential growth in computation time as a function of the size of the instance. 3.4. Omitting explaining haplotype pairs R TIP: In order to improve his IP approach, Gusﬁeld [18] used a reduction formulation, called Rtip. Rtip removes from the problem all explaining haplotype pairs in which the two corresponding haplotypes partly explain only one genotype. The genotypes that can only be explained by such haplotype pairs can be explained in the Hipp solution in a number of different ways without affecting the solution value. Wang and Xu [39] used a very similar idea in their BnB program. In order to extend this idea to Chipp, we ﬁrst introduce the notion of fat genotypes. Fat genotypes: Let G be a set of genotypes with Hipp solution value p, and g a genotype in G such that removing g from G results in a solution to the new Hipp instance with solution value p − 2. We call such a genotype fat genotype. A special case of a fat genotype is a genotype that is incompatible with any other genotype and has at least one entry 2. For example, consider g 1 = (0, 2, 0), g 2 = (2, 1, 0), g 3 = (1, 2, 1), where g 3 is incompatible with g 1 and g 2 . Then for G = {( g 1 , g 2 )} the (only) Hipp solution is {(0, 0, 0), (0, 1, 0), (1, 1, 0)} with solution value 3, whereas for G = ( g 1 , g 2 , g 3 ) the (only) Hipp solution is {(0, 0, 0), (0, 1, 0), (1, 1, 0), (1, 0, 1), (1, 1, 1)} with solution value 5. Thus, g 3 is a fat genotype in G . Note that there are cases of fat genotypes that are compatible with other genotypes. For example, consider g 1 = (1, 2, 2), g 2 = (2, 1, 1), g 3 = (2, 0, 1), where g 3 is compatible with g 1 . Then for G = ( g 1 , g 2 ) the (only) Hipp solution is {(1, 0, 0), (0, 1, 1), (1, 1, 1)} with solution value 3, whereas for G = ( g 1 , g 2 , g 3 ) one Hipp solution is {(1, 0, 0), (0, 1, 1), (1, 1, 1), (0, 0, 1), (1, 0, 1)} with solution value 5. Thus, g 3 is a fat genotype. It is evident that fat genotypes are the only genotypes for which omitting a corresponding explaining haplotype pair by Rtip could cause some Hipp solutions to be lost. The fatness of all genotypes can be easily tested. Our task is to ﬁnd all Hipp solutions that might be lost by Rtip. For this purpose, assume that the explaining haplotype pair (h1 , h2 ) is omitted by Rtip, but nevertheless is contained in at least one Hipp solution. Furthermore, let H 0 be a Hipp solution found by Rtip. By the deﬁnition of Rtip, h1 and h2 partly explain only one genotype g. The only possibility to recover a lost Hipp solution that includes h1 and h2 is to replace an explaining haplotype pair for g, (h3 , h4 ) ∈ H 0 , by (h1 , h2 ). Let H 0new be H 0 after the replacement. Then H 0new is a new Hipp solution if and only if all genotypes can be explained by haplotype pairs in H 0new . This idea leads to an algorithm that ﬁnds all Hipp solutions after using Rtip as described by Pseudo-code 5. Pseudo-code 5. (Repairing Chipp after Rtip) 1

Identify all fat genotypes.

2

Solve Chipp using Rtip.

3

Let H be the set of Hipp solutions found by Rtip.

4

WHILE (H = ∅)

5

Remove one Hipp solution H 0 from H.

6

FOR Each fat genotype g.

7 8 9 10 11

FOR Each explaining haplotype pair (h1 , h2 ) for g not in H 0 . FOR Each explaining haplotype pair (h3 , h4 ) for g in H 0 . Replace in H 0 (h3 , h4 ) by (h1 , h2 ) and receive H 0new .

IF All genotypes can be explained by H 0new . THEN Add H 0new to H.

For saving the (possibly expensive) computation time required for the identiﬁcation of all fat genotypes, we can try all possible replacements not only for the fat genotypes, but for all genotypes. As mentioned, for all non-fat genotypes no new Hipp solutions will be found. On the other hand, the time saved by omitting the identiﬁcation of all fat genotypes can be lost by many more tests for possible new Hipp solutions. Similar to the backbone feature (see Section 3.1), Rtip is also used in the Sat algorithm. However, it needs additional blocking clauses. Overlapping pairs: Wang and Xu [39] considered a case of two genotypes g 1 and g 2 , where g 1 only has explaining haplotype pairs (h1 , h2 ) and (h4 , h5 ), and g 2 only has explaining haplotype pairs (h2 , h3 ) and (h5 , h6 ). In the considered case, all haplotypes h1 , h2 , h3 , h4 , h5 , h6 do not partly explain any other genotype. For this case when solving Hipp, we can use (h1 , h2 ) as explaining haplotype for genotype g 1 and (h2 , h3 ) as explaining haplotype for genotype g 2 , and omit the pairs (h4 , h5 ) and (h5 , h6 ).

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

77

Table 1 Known data sets used for comparisons, including name of data set, nucleotide range in reference sequence, number of genotypes in set (n), number of sites in each genotype (m), number of genotypes that are ambiguous (number of genotypes that have at least two heterozygous sites), percentage of all sites that are heterozygous, and recombination rate. Note that recombination rate is an indicator of breakdown of a haplotype structure. Consequently, high recombination rates tend to coincide with higher haplotype diversity [15]. Data

Nucleotide range

n

m

# Amb.

% Heter.

Recombination

K-A K-B K-C K-D K-E K-F K-G

17874–21388 667–1464 32107–34389 12867–13729 32107–35800 8043–9256 33117–38365

80 39 39 39 39 39 39

9 5 17 8 26 22 47

47 13 27 15 28 26 33

21.4% 22.6% 20.5% 16.7% 19.0% 21.7% 14.0%

None None None Low Low High High

When extending this case to Chipp, we only have to replace in each Hipp solution the haplotypes h1 , h2 , h3 by the haplotypes h4 , h5 , h6 to ﬁnd a new Hipp solution. 4. Experimental results 4.1. Test bed Our test bed is composed of human biological data of two types. The ﬁrst is genotype data from the International HapMap Project [47], and the second consists of known haplotype pairs that were experimentally identiﬁed by Orzack et al. [34] and Andrés et al. [1]. The HapMap data is genome-wide and drawn from four populations:

• • • •

Utah, USA residents with ancestry from northern and western Europe (CEU) Yoruba in Ibadan, Nigeria (YRI) Japanese in Tokyo, Japan (JPT) Han Chinese in Beijing, China (CHB)

Genetic data was gathered from 30 parent-offspring trios from each of the YRI and CEU populations and from 45 unrelated individuals from each of the JPT and CHB populations. We removed the offspring samples from the trio data to produce 60 unrelated individuals for the YRI and CEU populations. We combined the JPT and CHB data, resulting in 90 unrelated individuals. From the resultant three populations, we randomly drew a series of m consecutive SNPs from n individuals, discarding duplicate genotypes, from each of 22 autosomal chromosomes. We refer to these data as random instances. The second type of data consists of known haplotype pairs that were experimentally derived by Orzack et al. [34] and Andrés et al. [1]. The actual biological haplotypes were directly identiﬁed and did not need to be computationally inferred. The Orzack et al. data is composed of nine SNPs drawn from the apolipoprotein E (ApoE) locus [34]. There were 80 individuals in the study. The subjects were unrelated and 18 were classiﬁed as Asians, 19 as Blacks, and 43 as Caucasians. This data set is summarized as K-A in Table 1. The Andrés et al. [1] data consists of haplotype pairs that were identiﬁed from hybrid somatic cell lines that were produced by electrofusion of a mouse embryonic ﬁbroblast host cell line with human lymphoblastoid cell lines. The data was generated from a 48 kb region of chromosome 19 containing KLK13 and KLK14 genes. The sample included 20 African Americans from Atlanta, Georgia, and 19 European Americans from Rochester, Minnesota. The Hipp model assumes complete data, so six sets of contiguous complete data were used for our tests. These six sets are referred to as K-B through K-G and their summaries are tabulated in Table 1. Note that K-E consists of 26 SNPs, in which the ﬁrst 17 have no recombination and are used for data set K-C. In all, the following 73 test instances were analyzed in our study:

• • • • • • • • • •

22 CEU random instances (C-1 to C-22) with 20 (20 non-equal) genotypes and 20 sites, 22 YRI random instances (Y-1 to Y-22) with 20 (20 non-equal) genotypes and 20 sites, 22 JPT+CHB random instances (J-1 to J-22) with 20 (20 non-equal) genotypes and 20 sites, known instance K-A with 80 (30 non-equal) genotypes and 9 sites, known instance K-B with 39 (10 non-equal) genotypes and 5 sites, known instance K-C with 39 (18 non-equal) genotypes and 17 sites, known instance K-D with 39 (10 non-equal) genotypes and 8 sites, known instance K-E with 39 (23 non-equal) genotypes and 26 sites, known instance K-F with 39 (26 non-equal) genotypes and 22 sites, known instance K-G with 39 (35 non-equal) genotypes and 47 sites.

78

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

4.2. Experimental analysis We have implemented all different combinations of Hipp and Chipp algorithms in C++. All our experiments were carried out on a SunOS system with an 386 CPU with 2613 MHz clock rate and 32192 MB of RAM, where our programs ran on 12 CPUs of a single processor. We used Cplex, version 12 [48] as IP solver and a Sat solver implemented by Eén and Sörensson, called MiniSat [9,49]. Each individual experiment was limited to 6 hours of CPU time. If a program ran longer than the allocated 6 hours or if it stopped, because it needed more memory than available, we denote “+∞” for the execution time in the result tables. All reported running times are in seconds. In Section 2 we introduced three baseline Chipp algorithms, one based on IP, one based on BnB and one based on Sat. Furthermore, for BnB we consider a hybrid algorithm (called HY-BnB), in which the heuristic for computing the initial upper bound is replaced by the Hipp-IP algorithm. Analogously, for Sat we consider a hybrid algorithm (called HY-Sat), in which the heuristic for computing the initial lower bound is replaced by the Hipp-IP algorithm. In other words, the initial upper and lower bound, respectively, is the cost of the Hipp solution, which can signiﬁcantly reduce the search space to be explored and computation time needed. The optimization techniques that we discussed in Section 3 can be combined in different ways with these three algorithms, and the possible combinations give rise to many different algorithms. These optimization techniques or algorithmic components include

• equal column technique (E) or not, • trivial backbones (T), all backbones (A) or no backbones (note that for Sat, (A) considers backbone haplotype sites instead of backbone haplotypes, see the last paragraph of Section 3.1),

• Rtip with computation of fat genotypes (F), Rtip with no computation of fat genotypes (R) or no Rtip. In total we have 90 = 5 × 2 × 3 × 3 Chipp algorithms to analyze, where each algorithm is written in the following way: W-XYZ, where W indicates the type of algorithm, IP, BnB, Sat, hybrid BnB (HY-BnB), or hybrid Sat (HY-Sat); and X, Y, and Z represent the use of equal columns, backbones, and Rtip, respectively, using the encodings above. For example, IP-EAF is used to indicate an algorithm based on IP using the equal column technique, all backbones and Rtip with fat genotypes. As the Chipp-HY algorithm contains a Hipp algorithm and ﬁnding a single Hipp solution is required by the identiﬁcation of all backbone haplotypes and by the identiﬁcation of fat genotypes, an effective Hipp algorithm is an important component of many variants of the Chipp algorithms. Experiments (not described here due to space limit) show that for the three Hipp versions, using all of the three features equal column technique, Rtip, and trivial backbones leads to the greatest eﬃciency, where the most important feature is the equal column technique. This is not surprising, as each of these features substantially reduces the problem sizes. The equal column technique can help to remove some sites that carry redundant information, and the techniques of trivial backbones and Rtip can be used to omit some highly constrained explaining haplotype pairs from the core computation of haplotype inference. Note that – in contrast to the Chipp algorithms – applying these features requires negligible additional computation. The mentioned experiments also show that the Hipp algorithm based on IP is more effective than those based on BnB and on Sat. As a consequence of this comparison, we used this Hipp-IP algorithm as a sub-routine in all Chipp algorithms, where a Hipp algorithm is needed, except one special case. The exception is the algorithm for ﬁnding all backbone haplotypes, where we have to omit the equal column technique. As mentioned in Section 3.2, this technique cannot be used for ﬁnding all backbone haplotypes. Finally, note that with the purpose to simplify the experiments, we decided to use the technique of decomposability (Section 3.3) and the case of overlapping pairs of Section 3.4 in all versions except the versions with no Rtip (Z = ∅). To reduce the overall computation for testing all 90 versions and all 73 instances (the 66 random and the 7 known instances), we tested 8 instances that required relatively short computation time in the ﬁrst-stage analysis. The corresponding results can be found in Tables 2–6. For each instance the best version is given in bold face. We observe that the hybrid versions of BnB and Sat are superior to the corresponding original versions. Furthermore, each single optimization technique may lead to signiﬁcant improvements, but this strongly depends on the considered instance. However, except for a few cases, most top performers for IP and HY-BnB use the all backbone technique without Rtip, or with Rtip with the computation of fat genotypes, i.e., the versions with options A, AF, EA, EAF. For HY-Sat the top performers are mostly A and EA, as here Rtip needs additional blocking clauses and so complicates the Sat model. Therefore, we have identiﬁed one top contender for each of the three paradigms, namely IP-A, HY-BnB-EAF and HY-Sat-A. In order to ﬁnd the overall champion, we further tested these top contenders on the 73 instances. Additionally, we ran the baseline algorithms, with no optimization techniques at all, for these instances. Note that there are several very easy instances where all six tested algorithms are rather fast (e.g., for the CEU instance C-13 and the known instance K-D all six tested algorithms required less than a second of computation time). For many such instances we observed that the optimization techniques did not have any noticeable effect, but required short overhead time leading to slightly larger overall computation times. As we are not primarily interested in cases where the baseline algorithms are generally eﬃcient, we focused on 31 of the 73 instances for which at least one of the baseline algorithms required more than 1 hour and present those results in Table 7. As before, the fastest version for each instance is given in bold face.

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

79

Table 2 Comparison of the execution times in seconds of all 18 Chipp-IP algorithms for 8 example instances.

IP IP-T IP-A IP-F IP-TF IP-AF IP-R IP-TR IP-AR IP-E IP-ET IP-EA IP-EF IP-ETF IP-EAF IP-ER IP-ETR IP-EAR

C-12

C-22

Y-12

Y-22

J-3

J-15

K-E

K-F

80.92 81.59 51.32 78.50 77.61 42.31 1116.85 1117.15 1144.15 193.94 196.44 182.12 196.35 195.84 186.50 1233.32 1235.03 1287.95

48.31 49.03 20.05 50.46 50.26 24.38 63.48 63.31 37.58 72.48 73.76 72.44 75.16 75.29 74.25 86.92 88.04 87.20

50.19 49.29 28.57 51.02 55.07 26.79 51.34 54.53 27.03 55.91 55.41 29.43 49.65 50.31 32.05 49.81 50.54 32.48

5.53 5.56 3.66 6.19 6.01 3.96 29.29 29.27 28.94 5.73 5.56 3.76 6.41 6.11 4.14 29.54 29.48 29.14

10.71 10.65 7.44 11.51 11.81 7.64 14.49 14.89 10.51 61.23 61.81 57.62 62.43 61.76 62.76 64.10 64.07 66.04

14.06 3.44 1.74 8.81 4.64 3.91 10.87 6.70 4.53 2.44 2.47 2.44 2.74 2.76 2.89 4.79 4.98 5.06

183.71 187.94 130.70 189.29 194.61 136.80 2489.49 2504.73 2592.00 131.87 132.81 110.01 141.25 132.04 114.37 2440.90 2466.66 2576.77

98.02 60.10 45.18 99.50 59.00 46.16 2883.67 3041.18 3209.77 199.37 189.45 182.74 198.15 192.46 189.79 2990.88 3182.11 3338.78

Table 3 Comparison of the execution times in seconds of all 18 Chipp-BnB algorithms for 8 example instances. C-12 BnB BnB-T BnB-A BnB-F BnB-TF BnB-AF BnB-R BnB-TR BnB-AR BnB-E BnB-ET BnB-EA BnB-EF BnB-ETF BnB-EAF BnB-ER BnB-ETR BnB-EAR

+∞ +∞ 2017.52

+∞ +∞ 4764.55

+∞ +∞ 1654.19 736.74 749.75 266.01 4737.21 4760.58 4542.26 1538.29 1589.27 1454.78

C-22

Y-12

957.27 645.30 17.44 918.55 628.20 33.73 921.59 630.49 30.33 56.31 51.44 53.16 70.98 68.07 68.91 66.95 64.81 65.99

176.92 182.36 5.83 465.94 470.28 300.18 180.91 180.60 6.52 34.92 23.26 10.97 327.68 319.34 301.22 35.25 24.58 11.94

Y-22 9625.94 9591.72 956.16

+∞ +∞ +∞ 10494.57 10582.53 952.45 4863.05 4869.69 943.83

+∞ +∞ +∞ 4134.24 4167.32 949.54

J-3

J-15

K-E

83.45 48.07 4.46 73.48 45.64 22.00 59.27 31.60 7.70 60.26 48.06 55.29 75.71 65.54 72.08 61.98 51.21 57.07

95.23 30.54 1.19 288.80 112.41 3.52 291.57 114.81 4.25 2.33 2.08 2.16 2.55 2.22 2.53 4.74 4.47 4.83

+∞ +∞ 6966.70

+∞ +∞ +∞ +∞ +∞ 9014.59

+∞ +∞ 4826.78

+∞ +∞ +∞ +∞ +∞ 7372.47

K-F

+∞ +∞ 1055.50

+∞ +∞ +∞ +∞ +∞ 3783.42

+∞ +∞ 562.70

+∞ +∞ +∞ 11328.61 7356.36 3756.66

Table 4 Comparison of the execution times in seconds of all 18 Chipp-HY-BnB algorithms for 8 example instances.

HY-BnB HY-BnB-T HY-BnB-A HY-BnB-F HY-BnB-TF HY-BnB-AF HY-BnB-R HY-BnB-TR HY-BnB-AR HY-BnB-E HY-BnB-ET HY-BnB-EA HY-BnB-EF HY-BnB-ETF HY-BnB-EAF HY-BnB-ER HY-BnB-ETR HY-BnB-EAR

C-12

C-22

Y-12

Y-22

J-3

J-15

K-E

+∞ +∞

939.59 622.25 16.89 907.63 609.94 16.03 890.95 601.76 29.71 56.27 51.39 52.49 71.62 67.60 51.59 67.37 64.76 65.50

43.69 44.17 0.66 331.40 333.45 1.03 44.26 44.79 1.41 12.67 10.10 5.53 302.70 300.08 5.85 13.46 10.74 6.24

6119.92 6172.23 35.99

68.88 35.12 3.59 72.69 45.67 3.63 58.88 32.64 6.79 58.72 48.16 53.40 76.32 67.17 54.05 61.23 51.27 57.28

85.12 27.94 1.19 33.93 16.68 3.56 35.96 18.78 4.30 2.33 2.09 2.12 2.64 2.35 2.61 4.73 4.49 4.83

+∞ +∞

1923.28

+∞ +∞ 373.84

+∞ +∞ 1563.92 343.47 350.05 166.04 4538.28 4541.07 167.62 1307.45 1362.17 1357.87

+∞ +∞ 3.80 2189.55 2176.55 30.59 3732.93 3737.53 18.99

+∞ +∞ 3.41 975.39 956.54 30.27

2401.28

+∞ +∞ 1962.29

+∞ +∞ 4439.08 10732.39 4610.74 248.69

K-F

+∞ 18181.66 656.24

+∞ +∞ 168.89 11061.58 8714.23 3385.00 1189.73 861.59 160.32

+∞ +∞

+∞ +∞

255.57 13020.06 6939.76 2787.58

148.76 3523.25 3719.50 3364.07

80

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

Table 5 Comparison of the execution times in seconds of all 18 Chipp-Sat algorithms for 8 example instances.

Sat Sat-T Sat-A Sat-F Sat-TF Sat-AF Sat-R Sat-TR Sat-AR Sat-E Sat-ET Sat-EA Sat-EF Sat-ETF Sat-EAF Sat-ER Sat-ETR Sat-EAR

C-12

C-22

Y-12

Y-22

J-3

J-15

K-E

K-F

1400.89 1408.33 347.19 5667.67 5665.15 4597.58 2527.75 2532.29 1478.34 576.89 574.58 495.99 4845.90 4866.13 4750.98 1702.85 1707.50 1631.85

4791.52 4702.45 104.58 4820.09 4725.36 123.36 4805.42 4716.21 117.60 1716.77 1823.72 119.00 1741.99 1844.73 136.83 1730.03 1837.48 131.57

3537.26 3856.16 74.79 3836.22 4155.97 380.35 3544.32 3862.97 78.72 1194.09 1198.28 60.57 1494.64 1494.56 361.74 1190.79 1200.68 56.76

6509.59 6552.62 4101.14

576.87 554.79 13.96 523.50 592.22 33.28 580.70 558.51 17.69 317.89 315.78 59.78 366.97 332.55 78.60 321.29 320.13 63.48

0.48 0.34 1.01 0.94 0.79 1.51 3.13 3.01 3.77 2.02 1.97 2.36 2.47 2.47 2.72 4.75 4.69 4.98

+∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞

4674.68 4705.74 4102.93

+∞ +∞ +∞ 6495.59 6522.02 4173.55 6053.70 6002.73 4123.44

+∞ +∞ +∞ 6059.09 6002.25 4178.65

+∞ +∞ +∞ 7840.55 7872.79 7240.89 4623.16 4600.90 4229.20

+∞ +∞ +∞ 7758.67 7752.96 7389.96

Table 6 Comparison of the execution times in seconds of all 18 Chipp-HY-Sat algorithms for 8 example instances.

HY-Sat HY-Sat-T HY-Sat-A HY-Sat-F HY-Sat-TF HY-Sat-AF HY-Sat-R HY-Sat-TR HY-Sat-AR HY-Sat-E HY-Sat-ET HY-Sat-EA HY-Sat-EF HY-Sat-ETF HY-Sat-EAF HY-Sat-ER HY-Sat-ETR HY-Sat-EAR

C-12

C-22

Y-12

Y-22

J-3

J-15

K-E

1076.79 1083.61 26.56 5356.85 5394.73 4313.90 2201.95 2209.51 1153.87 252.74 254.48 181.19 4524.95 4525.02 4438.85 1378.87 1379.04 1306.67

4796.79 4708.57 104.31 4827.25 4740.47 127.21 4807.79 4716.66 116.79 1719.38 1837.86 118.27 1746.85 1841.86 136.09 1748.95 1843.72 131.07

3520.00 3839.12 59.62 3811.96 4131.05 354.82 3521.75 3841.49 61.74 1174.69 1178.24 41.42 1477.82 1472.26 333.53 1175.83 1181.24 43.87

2295.87 2356.19 17.40

576.03 554.36 13.06 595.23 591.01 32.44 579.85 557.37 16.64 317.52 315.25 57.88 366.72 333.23 77.72 320.58 318.90 61.51

0.56 0.32 1.02 1.05 0.95 1.77 3.15 3.06 3.78 2.08 2.02 2.34 2.65 2.58 2.92 4.68 4.67 4.96

+∞ +∞

+∞ +∞ +∞ 2322.46 2382.44 41.87 1949.11 1897.04 15.57

+∞ +∞ +∞ 1973.11 1924.35 40.10

6455.40

+∞ +∞ +∞ +∞ +∞ 8847.98

+∞ +∞ 2686.11

+∞ +∞ +∞ +∞ +∞ 5068.63

K-F 602.24 642.40 28.52

+∞ +∞ +∞ 3759.65 3797.25 3187.27 550.14 528.49 158.70

+∞ +∞ +∞ 3701.60 3684.86 3320.62

The results show that for the 93 = 3 × 31 comparisons between baseline and corresponding optimized algorithm, the baseline algorithm was faster only 8 times – once for the IP version and 7 times for the Sat version. On most problem instances the optimized algorithms ran much faster, often with orders of magnitude, demonstrating the beneﬁts of the optimization techniques. When comparing the baseline algorithms, the IP version is generally most effective, although Sat is overall leading for seven instances. Surprisingly, although BnB is clearly the worst baseline algorithm, the corresponding optimized algorithm HY-BnB-EAF using the techniques of equal columns, all backbones, Rtip with the computation of fat genotypes is the champion for 10 of the 31 instances tested, i.e., we can consider this version as the overall champion of all Chipp algorithms. 5. Summary and future work In summary, we made three major contributions in this paper. First, we introduced the problem Chipp to expand the capability of haplotype inference by pure parsimony for ﬁnding all optimal solutions. The extensive experiments in [6] showed that a Chipp problem can have a large number of optimal solutions and the ﬁrst optimal solution returned by an algorithm may not necessarily be the true solution. Our results on seven known problem instances also showed that the true solutions to four of these problems are indeed among the optimal solutions. All these results support to ﬁnd all optimal solutions. Second, we studied many intrinsic haplotype features, some of which were studied in earlier research on Hipp, particularly by Gusﬁeld [18] and Wang and Xu [39]. However, strategies to exploit these features cannot be directly applied to Chipp. Consequently, we formulated methods to recapture solutions lost by these strategies, and we introduced several FPT kernelization techniques, namely the concepts of backbone haplotypes, decomposability, and fat genotypes.

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

81

Table 7 Comparison of the execution times in seconds of the original approaches and selected enhanced versions for the most diﬃcult random 20 × 20 instances drawn from the CEU, YRI, JPT + CHB HapMap data and from known instances. IP C-1 C-2 C-6 C-7 C-9 C-10 C-12 C-14 C-17 C-19 C-21 C-22 Y-4 Y-5 Y-9 Y-10 Y-11 Y-14 Y-15 Y-17 Y-18 Y-19 Y-22

152.01 397.39 3103.87 2.94 23.48 60.86 80.51 105.73

+∞ 18901.72 1179.21 48.02 2354.12 12.81 40.38 61.75

+∞ 704.23 732.25

+∞ 196.10 5.79 5.42

BnB

+∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ +∞ 949.33

+∞ 5404.75 3814.33

+∞ +∞ +∞ +∞ +∞ +∞ +∞ 9620.59

J-8 J-13 J-14 J-17 J-18

143.94 11.68 45.40 177.84 752.01

+∞

K-E K-F K-G

183.58 97.61

+∞ +∞ +∞

+∞

5907.30

+∞ +∞ +∞

Sat 2095.83 9146.11

+∞ 19.73 0.40 14.33 1075.13 46.89

+∞ +∞

IP-A 100.16 152.09 2902.64 1.22 9.08 13.31 50.86 53.40

HY-BnB-EAF 25.50 36.83 808.81 23.94 1.28 6.61 166.09 24.48

+∞

+∞

1.61 4805.14

13982.95 709.32 20.37

13906.56 5.23 51.69

0.33 0.30 0.96 63.51

198.10 5.92 5.20 13.29

0.68 1.23 1.87 15.23 106.40 210.52 157.12

+∞ +∞ +∞ +∞

+∞ 470.58 409.27

+∞

+∞

HY-Sat-A 51.04 117.57

+∞ 4.90 2.94 8.34 29.17 6.68

+∞ +∞ 4.23 108.69 3.39 4.51 14.14 23.79

+∞ 348.25

+∞ +∞

6267.97 10.04 2294.70

98.03 4.28 3.58

134.14 4.33 3.52

82.93 3.27 18.38

3.52 1.64 118.39 6445.54 17163.86

32.52 4.44 35.93 113.01 769.98

10.42 1.96 29.15 3.75 4184.31

7.50 2.85 17.37 82.94 1007.93

+∞

130.27 45.34

255.43 149.36

6492.29 34.31

607.05

+∞

+∞

+∞

+∞

Third, we systematically studied three approaches to Chipp, one based on integer linear programming [18], another based on depth-ﬁrst branch-and-bound [39], a third one based on Boolean satisﬁability [30], and two hybrids integrating some of these three versions. In our algorithms, we explicitly exploited the intrinsic haplotype features that we studied. In our experiments, we analyzed the possible interactions of different problem features and the three approaches. These studies provided the best version of these three general problem solving paradigms, namely IP-A, and the best hybrid versions that combine the favorable features of integer linear programming, depth-ﬁrst branch-and-bound and Boolean satisﬁability, namely HY-BnB-EAF and HY-Sat-A. In closing, although it is expected that the number of unique haplotypes in a population is relatively small [5,6,18,34], strict adherence to the Pure Parsimony objective might not always be ideal. However, when Pure Parsimony is employed, it is important that all solutions are enumerated rather than just relying upon the single solution that might happen to be the ﬁrst identiﬁed, in order to properly evaluate alternative solutions. Regarding future work, we suggest to develop methods for choosing the haplotypes that are most likely to be true from the set of all the Hipp solutions. For example, if they span a coding region, the sequences could be checked to be certain they correctly encode amino acid sequences. Also, some haplotypes might be deemed as not likely if they include unexpected sequences in speciﬁc noncoding regions, such as promoter regions, 3’-UTRs, and intronic regions. On the ﬂipside, some haplotypes might be of particular interest if they include motifs [40] or other sequences that are associated with biological functions that are expected for the given data. Furthermore, the identiﬁcation of backbone haplotypes and fat genotypes may provide additional biological clues. Due to the fact that they appear in every optimal solution, backbone haplotypes provide a sense of higher reliability. On the other hand, fat genotypes might represent recent gene ﬂow and those outliers might be omitted from subsequent analyses if the individuals are assumed to represent a single homogeneous population. Finally, in general we expect that the concepts of backbone haplotypes, decomposability, and fat genotypes may be applicable to other challenging enumeration problems of scientiﬁc interest. Acknowledgements The research was supported in part by an Olin Fellowship to SC, an Alzheimer’s Association grant, two NSF grants (IIS-0535257, DBI-0743797) to WZ and also by NIH grant R01GM100364 to WZ and SC.

82

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

References [1] A.M. Andrés, A.G. Clark, E. Boerwinkle, C.F. Sing, J.E. Hixson, Understanding the accuracy of statistical haplotype inference with sequence data of known phase, Genet. Epidemiol. 31 (7) (2007) 659–671. [2] P. Bertolazzi, A. Godi, M. Labbé, L. Tininini, Solving haplotyping inference parsimony problem using a new basic polynomial formulation, Comput. Math. Appl. 55 (5) (2008) 900–911. [3] D.G. Brown, I.M. Harrower, Integer programming approaches to haplotype inference by pure parsimony, IEEE/ACM Trans. Comput. Biol. Bioinform. 3 (2) (2006) 141–154. [4] D. Catanzaro, A. Godi, M. Labbé, A class representative model for pure parsimony haplotyping, INFORMS J. Comput. 22 (2) (2010) 195–209. [5] A.G. Clark, Inference of haplotypes from PCR-ampliﬁed samples of diploid populations, Mol. Biol. Evol. 7 (1990) 111–122. [6] S. Climer, G. Jäger, A.R. Templeton, W. Zhang, How frugal is mother nature with haplotypes?, Bioinformatics 25 (1) (2009) 68–74. [7] S. Climer, W. Zhang, Searching for backbones and fat: a limit-crossing approach with applications, in: Proc. 18th National Conference on Artiﬁcial Intelligence, AAAI, AAAI Press, Menlo Park, Unites States, 2002, pp. 707–712. [8] D.D. Do, S.V. Le, X.H. Hoang, ACOHAP: an eﬃcient ant colony optimization for the haplotype inference by pure parsimony problem, Swarm Intell. 7 (1) (2013) 63–77. [9] N. Eén, N. Sörensson, An extensible SAT-solver, in: Proc. 6th International Conference on Theory and Applications of Satisﬁability Testing, Sat, in: Lect. Notes Comput. Sci., vol. 2919, Springer, Berlin–Heidelberg, Germany, 2003, pp. 502–518. [10] A. Elmas, G.H. Jajamovich, X. Wang, Maximum parsimony xor haplotyping by sparse dictionary selection, BMC Genomics 14 (2013) 645. [11] E. Erdem, F. Türe, Eﬃcient haplotype inference with answer set programming, in: Proc. 23rd National Conference on Artiﬁcial Intelligence, AAAI, AAAI Press, Menlo Park, Unites States, 2008, pp. 436–441. [12] J. Flum, M. Grohe, Parameterized Complexity Theory, Springer, Berlin–Heidelberg, Germany, 2006. [13] I.P. Gent, I. Lynce, A SAT encoding for the social golfer problem, in: Proc. 19th International Joint Conference on Artiﬁcial Intelligence, IJCAI, Workshop on Modelling and Solving Problems with Constraints, Morgan Kaufmann Publishers Inc., San Francisco, United States, 2005. [14] A. Graça, J. Marques-Silva, I. Lynce, A.L. Oliveira, Haplotype inference with pseudo-Boolean optimization, Ann. Oper. Res. 184 (1) (2011) 137–162. [15] T.A. Greenwood, B.K. Rana, N.J. Schork, Human haplotype block sizes are negatively correlated with recombination rates, Genome Res. 14 (7) (2004) 1358–1361. [16] J. Guo, R. Niedermeier, Invitation to data reduction and problem kernelization, SIGACT News 38 (1) (2007) 31–45. [17] D. Gusﬁeld, Inference of haplotypes from samples of diploid populations: complexity and algorithms, J. Comput. Biol. 8 (3) (2001) 305–323. [18] D. Gusﬁeld, Haplotype inference by pure parsimony, in: Proc. 14th Ann. Symp. Combinatorial Pattern Matching, in: Lect. Notes Comput. Sci., vol. 2676, Springer, Berlin–Heidelberg, Germany, 2003, pp. 144–155. [19] D. Gusﬁeld, S.H. Orzack, Haplotype inference, 18-1 to 18-28, in: Handbook on Bioinformatics, CRC Press, Boca Raton, United States, 2005. [20] B.V. Halldórsson, V. Bafna, N. Edwards, R. Lippert, S. Yooseph, S. Istrail, A survey of computational methods for determining haplotypes, in: Computational Methods for SNPs and Haplotype Inference, Proc. DIMACS/RECOMB Satellite Workshop, in: Lect. Notes Comput. Sci., vol. 2983, Springer, Berlin–Heidelberg, Germany, 2004, pp. 24–47. [21] A. Holder, T. Langley, A decomposition of the pure parsimony haplotyping problem, in: Bioinformatics Research and Applications, in: Lect. Notes Comput. Sci., vol. 5542, Springer, Berlin–Heidelberg, Germany, 2009, pp. 198–208. [22] G.H. Jajamovich, X. Wang, Maximum-parsimony haplotype inference based on sparse representations of genotypes, IEEE Trans. Signal Process. 60 (4) (2012) 2013–2023. [23] G. Jäger, W. Zhang, An effective algorithm for and phase transitions of the directed hamiltonian cycle problem, J. Artif. Intell. Res. 39 (2010) 663–687. [24] G. Jäger, S. Climer, W. Zhang, Complete parsimony haplotype inference problem and algorithms, in: Proc. 17th Annual European Symposium on Algorithms (ESA), in: Lect. Notes Comput. Sci., vol. 5757, Springer, Berlin–Heidelberg, Germany, 2009, pp. 337–348. [25] M. Janota, I. Lynce, J. Marques-Silva, Algorithms for computing backbones of propositional formulae, AI Commun. 28 (2) (2015) 161–177. [26] J.O. Kitzman, A.P. Mackenzie, A. Adey, J.B. Hiatt, R.P. Patwardhan, P.H. Sudmant, S.B. Ng, C. Alkan, R. Qiu, E.E. Eichler, J. Shendure, Haplotype-resolved genome sequencing of a Gujarati Indian individual, Nat. Biotechnol. 29 (1) (2011) 59–63. [27] G. Lancia, M.C. Pinotti, R. Rizzi, Haplotype populations by pure parsimony: complexity of exact and approximation algorithms, INFORMS J. Comput. 16 (4) (2004) 348–359. [28] I. Lynce, J. Marques-Silva, Eﬃcient haplotype inference with Boolean satisﬁability, in: Proc. 21st National Conference on Artiﬁcial Intelligence, AAAI, AAAI Press, Menlo Park, Unites States, 2006, pp. 104–109. [29] I. Lynce, J. Marques-Silva, SAT in bioinformatics: making the case with haplotype inference, in: Proc. 9th International Conference on Theory and Applications of Satisﬁability Testing, SAT, in: Lect. Notes Comput. Sci., vol. 4121, Springer, Berlin–Heidelberg, Germany, 2006, pp. 136–141. [30] I. Lynce, J. Marques-Silva, S. Prestwich, Boosting haplotype inference with local search, Constraints 13 (1–2) (2008) 155–179. [31] I. Lynce, J. Ouaknine, Sudoku as a SAT problem, in: Proc. 9th International Symposium on Artiﬁcial Intelligence and Mathematics, ISAIM, Springer, Berlin–Heidelberg, Germany, 2006. [32] J. Marques-Silva, M. Janota, I. Lynce, On computing backbones of propositional theories, in: Proc. 19th European Conference on Artiﬁcial Intelligence, ECAI, in: Frontiers in Artiﬁcial Intelligence and Applications, vol. 215, IOS Press, Amsterdam, The Netherlands, 2010, pp. 15–20. [33] R. Niedermeier, Invitation to Fixed-Parameter Tractability, Oxford University Press, United Kingdom, 2006. [34] S.H. Orzack, D. Gusﬁeld, J. Olson, S. Nesbitt, L. Subrahmanyan, V.P. Stanton Jr., Analysis and exploration of the use of rule-based algorithms and consensus methods for the inferral of haplotypes, Genetics 165 (2003) 915–928. [35] S. Ramadhani, S.R. Mousavi, M. Talebi, An improved heuristic for haplotype inference, Gene 507 (2) (2012) 177–182. [36] L. Tininini, P. Bertolazzi, A. Godi, G. Lancia, CollHaps: a heuristic approach to haplotype inference by parsimony, IEEE/ACM Trans. Comput. Biol. Bioinform. 7 (3) (2010) 511–523. [37] M.N. Velev, P. Gao, Eﬃcient SAT techniques for absolute encoding of permutation problems: application to Hamiltonian cycles, in: Proc. 8th Symposium on Abstraction, Reformulation and Approximation, SARA, Springer, Berlin–Heidelberg, Germany, 2009, pp. 159–166. [38] T. Walsh, J. Slaney, Backbones in optimization and approximation, in: Proc. 17th Intern. Joint Conf. on Artiﬁcial Intelligence, IJCAI, Morgan Kaufmann Publishers Inc., San Francisco, United States, 2001, pp. 254–259. [39] L. Wang, Y. Xu, Haplotype inference by maximum parsimony, Bioinformatics 19 (14) (2003) 1773–1780. [40] G. Wang, T. Yu, W. Zhang, WordSpy: identifying transcription factor binding motifs by building a dictionary and learning a grammar, Nucleic Acids Res. 33 (2005) W412-6. [41] Y. Xu, W. Cheng, P. Nie, F. Zhou, WinHAP: an eﬃcient haplotype phasing algorithm based on scalable sliding windows, PLoS ONE 7 (8) (2012) e43163. [42] W. Zhang, Phase transitions and backbones of 3-SAT and maximum 3-SAT, in: Proc. Intern. Conf. on Principles and Practice of Constraint Programming, CP-01, in: Lect. Notes Comput. Sci., vol. 2239, Springer, Berlin–Heidelberg, Germany, 2001, pp. 153–167. [43] W. Zhang, Conﬁguration landscape analysis and backbone guided local search: Part I: satisﬁability and maximum satisﬁability, Artif. Intell. 158 (1) (2004) 1–26. [44] W. Zhang, Phase transitions and backbones of the asymmetric traveling salesman problem, J. Artif. Intell. Res. 21 (1) (2004) 471–497.

G. Jäger et al. / Journal of Discrete Algorithms 37 (2016) 68–83

83

[45] W. Zhang, M. Looks, A novel local search algorithm for the traveling salesman problem that exploits backbones, in: Proc. 19th International Joint Conference on Artiﬁcial Intelligence, IJCAI, Morgan Kaufmann Publishers Inc., San Francisco, United States, 2005, pp. 343–350. [46] C.S. Zhu, G. Weissenbacher, D. Sethi, S. Malik, SAT-based techniques for determining backbones for post-silicon fault localisation, in: Proc. High Level Design Validation and Test Workshop, HLDVT, IEEE International, New York, United States, 2011, pp. 84–91. [47] The International HapMap Consortium, A haplotype map of the human genome, Nature 437 (2005) 1299–1320. [48] IP solver Cplex, https://www-01.ibm.com/software/commerce/optimization/cplex-optimizer/. [49] Source code of [9] (MiniSat). Available: http://minisat.se/. [50] International SAT Competition: http://www.satcompetition.org/.

The complete parsimony haplotype inference problem and algorithms based on integer programming, branch-and-bound and Boolean satisfiability

The complete parsimony haplotype inference problem and algorithms based on integer programming, branch-and-bound and Boolean satisfiability

Recommend Documents