Exposing Phylogenetic Relationships by Genome Rearrangement YING CHIH LIN AND CHUAN YI TANG Department of Computer Science National Tsing Hua University Hsinchu 300 Taiwan, ROC
[email protected] [email protected]
Abstract Evolutionary studies based on large-scale rearrangement operations, as opposed to the traditional approaches on point mutations, have been considered as a promising alternative for inferring the evolutionary history of species. Genome rearrangement problems lead to combinatorial puzzles of finding parsimonious scenarios towards measuring what difference species have and explaining how a species evolves from another. Throughout this chapter, we will focus on the introduction of computing the genomic distance, arising from the effects of a set of rearrangement events, between a pair of genomes. In the end, two experiments on Campanulaceae and Proteobacteria are used to simply show how to exploit the genome rearrangement approach for exposing phylogenetic relationships. 1. Introduction . . . . . . . . . . . . . . . . . . . 1.1. Molecular Biology Primer . . . . . . . . 1.2. Algorithm and Complexity . . . . . . . . 1.3. Computational Biology . . . . . . . . . . 2. Genome Rearrangement Overview . . . . . . 2.1. Pancake Flipping Problem . . . . . . . . 2.2. The Breakpoint Graph . . . . . . . . . . 3. Sorting by Reversals . . . . . . . . . . . . . . 3.1. Unsigned Permutations . . . . . . . . . . 3.2. Signed Permutations . . . . . . . . . . . 3.3. Circular Permutations . . . . . . . . . . . 4. Sorting by Transpositions/Block-Interchanges 4.1. Sorting by Transpositions . . . . . . . . . ADVANCES IN COMPUTERS, VOL. 68 ISSN: 0065-2458/DOI: 10.1016/S0065-2458(06)68001-7
. . . . . . . . . . . . .
1
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
2 2 6 8 11 14 16 16 17 21 24 27 28
Copyright © 2006 Elsevier Inc. All rights reserved.
2
Y.C. LIN AND C.Y. TANG
4.2. Sorting by Block-Interchanges . . . . . . . . . . . . . . . 5. Sorting by Translocations . . . . . . . . . . . . . . . . . . . . 6. Sorting by Multiple Operations . . . . . . . . . . . . . . . . . 6.1. Reversal + Transposition/Block-Interchange . . . . . . . 6.2. Reversal + Translocation (Including Fusion and Fission) 6.3. Other Considerations . . . . . . . . . . . . . . . . . . . . 7. Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 7.1. Chloroplast in Campanulaceae . . . . . . . . . . . . . . . 7.2. γ -Proteobacteria . . . . . . . . . . . . . . . . . . . . . . . 8. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
30 32 35 35 38 39 40 41 43 45 47
Introduction
Today molecular biology has become an information science in many respects with close ties to computer science. Large databases and sophisticated algorithms are developed as essential tools for seeking to understand complex biological systems, determine the functions of nucleotide and protein sequences, or reconstruct the evolution of species. Before understanding biological tools, the models of biological problems and biologically related algorithms, one should primarily learn about the background in both biology and algorithm theory. In this section, we first introduce some basics in biology, then in algorithm and complexity, and finally in a relatively young field, computational biology, which also contains the topics studied in this chapter.
1.1
Molecular Biology Primer
A complete way to describe each living organism is represented by its genome. From the view of computer science, this can be regarded as a “program” in some particular language, which describes a set of instructions to be followed by an organism for growth and living. In other words, the genome is a temple or a blueprint on which constructing a building relies. Therefore in order to understand and interpret the hidden information of genome, we first define the “life language” by the representation of DNA codes. A genome is composed of the deoxyribonucleic acid (DNA) discovered in 1869 while studying the chemistry of white blood cells. The DNA appears in the cell of organisms and comprises two sequences, called strands, of tightly coiled threads of nucleotides. Each nucleotide is a molecule composed of a phosphoric group, a pentose sugar and a nitrogen-containing chemical, called a base. Four different bases
EXPOSING PHYLOGENETIC RELATIONSHIPS
3
F IG . 1. The DNA is a double-stranded molecule twisted into a helix (like a spiral staircase). Each spiral strand, comprised of a sugar-phosphate backbone and attached bases, is connected to a complementary strand by non-covalent hydrogen bonding between paired bases. Two ends of a backbone are conventionally called the 5 end and the 3 end [2].
are adenine (A), thymine (T), cytosine (C) and guanine (G), as illustrated in Fig. 1. A particular order of the bases is called the DNA sequence which varies greatly from organism to organism. It is the content of this sequence specifying the precise genetic instructions to produce unique features of an organism. Among the four bases, base A is always paired with base T, and C is always paired with G. Thus bases A and T (C and G) are said to be the complement of each other, or a pair of complementary bases. Two DNA sequences are complementary if one is the other read backwards with the complementary bases interchanged, e.g., ATCCGA and TCGGAT are complementary because ATCCGA with A ↔ T and C ↔ G becomes TAGGCT, which is TCGGAT read backwards. There are strong interac-
4
Y.C. LIN AND C.Y. TANG
F IG . 2. Schematic DNA replication [2].
tions formed by hydrogen bonds between two complementary bases, called base pairing. Hence, two complementary DNA sequences are intertwisted in a double helix structure described by Watson and Crick in 1953 [173]. Despite the complex 3-dimensional structure of this molecule, the genetic material only depends on the sequence of nucleotides and can thus be described without loss information as a string over the alphabet {A,T,G,C}. Because of this complementarity, the DNA has the capability to replicate itself. The full genome of a cell is duplicated when it divides into two new cells. At this moment, the DNA molecule unwinds and breaks the bonds between the base pairs one by one (Fig. 2). Each strand acts as a template for the synthesis of a new DNA molecule by the sequential addition of complementary base pairs, thereby generating a new DNA strand that is the complementary sequence to the parental DNA. By this way, each new molecule should be identical with the old one. However, although the replication process is very reliable, it is not completely error-free and it is possible that some bases are lost, duplicated or simply changed. The situation of variations to the original DNA sequence is known as mutations, and can make the diversity of organisms or their offspring. In most cases, mutations are harmful, but sometimes they can be innocent or advantageous for the evolution of species to adapt to new environments. In spite of the huge variety among the existing forms of life, the basic mechanism for the representation of being is the same for all organisms. As a matter of fact, all the information describing an organism is encoded in the DNA sequence of its genome by means of a universal code, known as the genetic code. This code is used to describe how to construct the proteins, which are responsible for performing most of the work of cells, e.g., the aid of constructing structures and essentially biochemical reactions. It is interesting that not all of the DNA sequences contain the coding information. In fact it appears on small regions only, e.g., in human, the total size
EXPOSING PHYLOGENETIC RELATIONSHIPS
5
F IG . 3. A schema of chromosomes in eukaryotic [2].
of such regions covers about 10% of the whole DNA sequence. The coding DNA regions are the so-called genes, where each gene is mapped to a different protein. Gene sequence contains exons and introns (Fig. 3) of which the former are translated into proteins and the latter are the intervening sequences whose functions are obscure (in general, they are regarded as irrelevancies to the functions of organisms).
6
Y.C. LIN AND C.Y. TANG
Genes are linearly located on chromosomes and are also the primary material of chromosome, as shown in Fig. 3. In the eukaryotic cell of higher organisms, there are several chromosomes in their nucleus, which are all linear sequences in general, e.g., human has 46 chromosomes (23 pairs). Their chromosome usually consists of one or two chromatids, telomeres and a centromere. In most case, the centromere roughly locates in the middle of a chromosome, but sometimes, it approaches the end. When a chromosome composes of two chromatids, we sometimes term it a doubled chromosome. Besides, in the prokaryotic cell of lower organisms, they usually contain chromosomes of circular molecules, e.g., bacteria have one circular chromosome while vibrio species have two. More background on molecular biology can be referenced in the textbooks of Clark and Russell [45], and Weaver [175].
1.2
Algorithm and Complexity
In a general sense, an algorithm is a well-defined and finite sequence of steps used to solve a well-formulated problem which is unambiguous and can be specified by inputs and outputs. An algorithm solves a problem which means that the algorithm gives the solution satisfying the requirement according to the instance of problem. A famous example in computer science is the Traveling Salesman Problem, or TSP for short. In this problem, a salesman has a map specifying a collection of cities and the cost of travel between each pair of them. He wants to visit all cities and eventually return to the city form which he started. The inputs of a TSP are the map and a starting city, while the output is the cheapest route for visiting all cities. Most problems in computer science can be abstracted and then redescribed as graphs of vertices and edges connecting two vertices. For instance in TSP, the cities correspond to vertices and the cost of a pair of cities corresponds to the edge associated with a weight. The output is finding the shortest length of starting from one vertex, visiting all vertices and finally reaching the starting vertex, where the length is computed by sum of the edge weights in this tour. See Fig. 4 for a simple example of TSP. In spite of the success in the example of Fig. 4, the TSP is indeed harder than most combinatorial optimization problems considered. The conjecture by Edmonds [65] in 1965 of why no efficient algorithm for TSP means that there is no algorithm which solves the problem on all instances in a reasonable time. However, he did not point out the precise meaning to what is the “reasonable time,” but a common definition since then is that a reasonable running time is one where the number of steps is polynomial in the size of the input, i.e., if n is the input size, then f (n) is the number of steps where f () is a polynomial function. Furthermore, the class of optimization problems having such algorithms is denoted as P introduced by Cobham [46]
EXPOSING PHYLOGENETIC RELATIONSHIPS
7
F IG . 4. An example has four vertices and the cost of each pair of vertices is labeled on the edge. A shortest tour starting from vertex A is A → B → D → C → A with cost 4.
and independently, Edmonds [65], where the problems are also referred to tractable problems. In the early 1970s, we made a better progress on the problems for which no efficient algorithms were known. These harder problems are classified as decision and optimization problems according to what they ask. A decision problem is a problem where the answer is “yes” or “no,” e.g., “Is there a TSP tour of distance at most 5?” while the optimization problem is as “What is the shortest tour of visiting all cities?,” Cook [51] proposed a milestone paper presenting the class NP (Non-deterministic Polynomial time) and first showed that the circuit-satisfiability problem is NP-complete. NP contains the decision problems which can be solved in polynomial time by a non-deterministic machine. This machine has the capability to “guess” a solution non-deterministically and then verifies it in polynomial time. In some sense, the non-deterministic guessing is equivalent to try all possible solutions in parallel. For the TSP, we can design a non-deterministic machine as follows: Given a starting vertex S, it enumerates all paths with length i from S, and verifies that a path passes through all vertices and ends in S. This verification can be done in polynomial time, thus implying TSP ∈ NP. A special class of optimization problems is NP-hard in which a polynomial-time solution to any problem in this class implies that all problems in NP have polynomial solutions. A problem is in the class NP-completeness if it is both NP and NP-hard. Thus NP-complete problems can be regarded as the hardest problems in NP. Clearly, P ⊂ NP and whether P is equal to NP or not is one of the most notable open problems in both theoretical computer science and mathematics (one of the seven Prize Problems [3]). It is widely believed that P = NP on which many results are based. Therefore, a polynomial-time algorithm for solving a problem in NP-completeness seems to be highly unlikely since that would imply polynomial solutions to all problems in NP and hence P = NP. For this reason, NP-complete problems are also said to be intractable. For more properties and classes of computational complexity, we refer the reader to the books of Garey and Johnson [79], Hopcroft, Motwani and Ullman [99], Papadimitriou [142] and Sipser [156].
8
Y.C. LIN AND C.Y. TANG
Many problems of practical significance are NP-complete implying that obtaining an optimal solution is intractable, but they are too important to be discarded directly. If the input size is small, an exhaustive algorithm for searching in exponential running time may be adequate. Nevertheless, for an input of large size, it is time-consuming, perhaps several years, to wait for the solution output by a computer. One way to attack such problems is to use heuristics, which is typically implemented with sophisticated searches in partial solution space. Solutions of heuristic methods often have no guarantees with respect to optimal solutions, i.e., there are no bounds on the degree of how near/far-optimal solutions are. Although we are most unlikely to find a polynomial-time algorithm for solving an NP-complete problem exactly, it may still be possible to find near-optimal solutions in polynomial time. An algorithm that returns a near-optimal solution with theoretical proof is called an approximation algorithm. Graham [81] made the first attempt to introduce the concept of approximation algorithm, which is used to solve the parallel-machine-scheduling problem. Subsequently, Garey, Graham and Ullman [78] and Johnson [103] formalized the concept of a polynomial-time approximation algorithm. We say that an algorithm for an optimization problem has an approximation ratio β (β-approximation algorithm) if, for any input of size n, the cost C of solution produced by the algorithm is under a factor of β with respect to the cost C ∗ of an optimal solution, i.e., max(C/C ∗ , C ∗ /C) β [52]. The definition of approximation ratio applies to both minimization (ratio C/C ∗ ) and maximization (ratio C ∗ /C) problems. Taking the TSP for an example, if there are n cities, a brute-force search of at most (n − 1)! possibilities can find out the shortest tour. However, for the TSP with triangle inequality, even if it is still NP-complete, we have a simple 2-approximation algorithm [52, p. 1028], which outputs a tour of A → B → C → D → A with cost 6 in the example of Fig. 4. Therefore, the approximation ratio of this solution is 6/4 = 1.5 (cost 4 is optimal), which is under 2. There is a handy website for recording the history and progress of TSP [4]. In addition, for more classic approximation algorithms, the books of Ausiello et al. [8], Hochbaum [97] and Vazirani [164] extensively include the topics.
1.3
Computational Biology
Computational (Molecular) Biology started in the late 1970s as an area that tends to solve the problems arising in the biological lab. In this period, computers became cheaper and simpler to use so that some biologists adapted them for storing and managing genomic data. By powerful ability in computation of computers, their projects and researches can be soon completed while they would cost much time before. An early striking example of a biological discovery by using a computer was in 1983
EXPOSING PHYLOGENETIC RELATIONSHIPS
9
when Doolittle et al. [61] used the nascent genetic sequence database to show that a cancer-causing gene was a close relative of a normal gene. From this, it became clear that the cancer might arise from a normal growth factor being acted at the wrong time. At that time, molecular biology labs throughout the world began installing computers to do database search via networks or develop their own database. Recently, due to convenience and robustness of the Internet, biologists can share their data and make them available worldwide through several genomic data banks, such as GenBank [18], PDB [25], EMBL [47], etc. Moreover, there is a well-developed website NCBI [176] established in 1988 as an interface for accessing these databases. These databases and other resources are valuable services not only to the biological community, but also to the computer scientists in searching the domain information. Up to now, it is generally obscure on what “Computational Biology” means. Some researches use the two names, Bioinformatics and Computational Biology, interchangeably, but there actually exists a little difference. We adopt the definitions provided by Lancia [112] in which the Bioinformatics problems are concerned with storage, organization and distribution of large amounts of genomic data, while the Computational Biology deals with the mathematical and computational problems of interpretation and theoretical analysis of genomic data. In general, the work of constructing algorithms that address problems with biological relevance, i.e., the work of constructing algorithms in computational biology, consists of two interacting steps. The first step is to present a biologically interesting question, and to construct a model according to the biological phenomenon that makes it unambiguous to formulate the posed question as a computational problem. We need to be careful in this step because almost every area of theoretical computer science starts as an attempt to solve applied problems, and later becomes more theoretically-oriented. These theoretical aspects may even become more important and scientifically precious than the original applications that motivate the entire area. Then, the second step is to design an algorithm for solving the computational problem of careful formulation. The first step requires the knowledge of molecular biology, while the second one needs the background of algorithm theory. To measure the quality of a constructed algorithm, we traditionally use a standard algorithmic methodology on the cost of the resources, most prominently running time and used space, it requires to solve the problem. However, since the problem solved by the algorithm originates from a question with biological relevance, the algorithmic quality could also be judged by the biological relevance of the answer it produces. For example in Fig. 5(a), the distance matrix lists all distances between each pair of the four species, VC, VP, VV and VF, indicating the evolutionary relationship of them, where a bigger distance represents a far relationship of two species.
10
Y.C. LIN AND C.Y. TANG
F IG . 5. A distance matrix (a) of four species, VC, VP, VV and VF, and a phylogenetic tree (b) represents the evolutionary relationship of them, where filled circles represent terminal nodes while open circles correspond to internal nodes, A1, A2 and A3. In particular, A1 is called root.
Now, we want to describe the distance matrix by using the phylogenetic tree made by arranging nodes and branches (Fig. 5(b)), where every node represents a distinct taxonomical unit. Nodes at the rights of branches (terminal nodes or leaves) correspond to genes or organisms for which data have actually been collected for analysis, while internal nodes usually represent an inferred common ancestor that give rise to two independent lineages at some point in the past. Furthermore, each branch of tree is labeled by a value to reflect the phylogenetic relationship at the distance matrix. The problems of inferring evolutionary trees have been extensively studied for many years, and unfortunately, many of them are NP-hard or NP-complete. Here, we want to construct a tree such that the length of each path of two leaves on the tree is equal to the corresponding value at the distance matrix. Figure 5(b) is an example of a phylogenetic tree constructed by neighbor-joining method [147] whose input is the distance matrix in Fig. 5(a). In this tree, the path from VC to VF has length 32, while the distance of VC and VF is also 32 in Fig. 5(a). This tree construction method has polynomial running time from algorithmic view and thus we can expect to obtain the output in a reasonable time. However, it is obviously not an optimal solution due to unequal distances between VC and VV in the matrix (21) and the constructed tree (21.25). Besides, from biological point of view, evaluating the quality of this tree is made by the real relationships among four species. For example, VV is closer to VP than VC to VP in the tree, and if it is also true in real situation, we will believe that the tree is good, thereby implying the neighbor-joining method is superior. For more formulated problems and their corresponding algorithms, we refer reader to the textbooks of Gusfield [83], Jones and Pevzner [104], Pevzner [145], Setubal and Meidanis [154], and Waterman [171].
EXPOSING PHYLOGENETIC RELATIONSHIPS
11
The details of a specific model and algorithm of course depend on the questions being asked. Most questions in computational biology are related to molecular or evolutionary biology, and focus on analyzing and comparing the composition of the key biomolecules, DNA, RNA and proteins, that together form the basic components of an organism. The success of ongoing efforts to develop and use techniques for getting data about the composition of these biomolecules, like the DNA sequencing technique for extracting the genetic material from species, e.g., the Human Genome Project [101,165], has resulted in a flood of available biological data to be compared and analyzed.
2.
Genome Rearrangement Overview
The genome of an organism consists of a small number of segments called chromosomes, and genes are spread to the DNA sequence with a particular order in the chromosome that are responsible for encoding proteins. Each gene has an orientation, either forward or backward, depending on which direction it is assumed to be read. Therefore, a genome can be abstracted as a set of chromosomes and a chromosome is composed of an order set of oriented genes. The chromosomes of higher organisms like mammalian are generally linear (the DNA sequence has a beginning and an end), but of lower organisms like bacteria, are circular (their DNA sequences have no beginning or end). Traditional comparison between two genomes pays attention to local operations (mutations), such as substitution, insertion, deletion and duplication (Fig. 6) which
F IG . 6. The bar represents a chromosome and each black block indicates the position of its gene. The four types of local operations, substitution (a), insertion (b), deletion (c) and duplication (d) [2], which usually affect a very small number of genes in a chromosome.
12
Y.C. LIN AND C.Y. TANG
affect only a small stretch on DNA sequence. These local operations have been widely observed by biologists due to their frequent occurrences in studying the difference between two genomes. Further, from theoretical point of view, the minimum difference caused by local operations between two genomes is regarded as the edit distance of them, and such value, in most cases, can be easily calculated by using dynamic programming method [52]. Most phylogenetic researches have been published based on these types of operations. On the other hand, the study of genome rearrangement focuses on inferring the parsimonious explanation by using a set of non-local operations for the disruption in gene orders among two or more genomes. In general, such non-local operations are called rearrangement events. A rearrangement event occurs when a chromosome is broken at two or more positions which results in two or more segments reassembling with a different order. The rearranged DNA sequence is essentially identical to the original sequence, except exchanges in the order of reassembled segments. These non-local operations causing reassembly include reversal (or inversion), transposition, block-interchange and translocation. A reversal event flips a segment in a chromosome and changes the directions of each element in the segment. Each transposition event exchanges two adjacent segments in a chromosome while the block-interchange swaps two non-intersecting segments. Due to the involving of segments in two chromosomes, the translocation event is more complicated and its effect will be introduced in Section 5. Moreover, most of non-local operations are derived from biological observations on the difference of DNA sequences among species. For example, in the late 1930s, Dobzhansky and Sturtevant [60] published a milestone paper presenting a rearrangement scenario with inversions for Drosophila fruit fly and it was taken as the pioneer of genome rearrangement in molecular biology. Moreover, in the late 1980s, Palmer and Herbon [140] compared the mitochondrial genome of Brassica oleracea (cabbage) and Brassica campestris (turnip) in which many genes are 99% identical but dramatically differ in gene order (Fig. 7). Palmer and his coworkers also found the similar phenomenon within the chloroplast genome of legume [141] and anemone [98]. These discoveries are convincing to prove that genome rearrangement plays a role in molecular evolution. In contract to the edit distance, the rearrangement distance (or genetic distance) resulting from rearrangement events is commonly set to the minimum number of operations for the transformation between two genomes. For instance in Fig. 7, if only reversal is considered, the rearrangement distance between cabbage and turnip is 3 where the minimum can be verified by “pen-and-pencil” method. In case of genomes consisting of a small number of homologous genes (or conserved blocks), we can find the most parsimonious rearrangement scenarios by exhaustive search or observation. However, it is time-consuming for genomes consisting of more genes notwithstand-
EXPOSING PHYLOGENETIC RELATIONSHIPS
13
F IG . 7. The pentagons represent the positions, orientations and order of common genes shared by cabbage and turnip, and each pair of two dotted lines represents an inverted segment. As shown in this figure, three reversals can transform cabbage into turnip [145].
ing performing exhaustive search over all potential solutions by a computer. As a result, developing efficient algorithms is an urgent requirement to deal with genome rearrangement problems arising from the large-scale mapping of species. The computational approach based on the comparison of gene orders was pioneered by Sankoff et al. [148,150,151]. According to which operations we consider, the genome rearrangement problems lead to different combinatorial puzzle. This model simply treats chromosomes as permutations and genes as the elements in the permutations associated with + or − sign indicating the direction of its transcription. Taking Fig. 7 as an example, cabbage (π ) is modeled as +1 −5 +4 −3 +2 and turnip ( σ ) as +1 +2 +3 +4 +5, and thus, the problem becomes to find the minimum number of reversals for transforming π into σ . Let Σ = {1, 2, . . . , n}, and π = π 1 π 2 . . . π n be a signed permutation on Σ, where each π i is labeled by a sign of + or − . For 1 i j < k l n, we express three types of operations as the mathematical form: • A reversal r(i, j ) affects π , denoted as r(i, j ) · π , by inverting the block π i π i+1 . . . π j to −π j −π j −1 . . . −π i , i.e., r(i, j ) · π = π 1 . . . π i−1 −π j −π j −1 . . . −π i π j +1 . . . π n . • A transposition tr(i, j, k) affects π , denoted as tr(i, j, k) · π , by swapping two consecutive segments π i π i+1 . . . π j and π j +1 π j +2 . . . π k , i.e., tr(i, j, k) · π = π 1 . . . π i−1 π j +1 . . . π k π i . . . π j π k+1 . . . π n . • A block-interchange bi(i, j, k, l) affects π , denoted as bi(i, j, k, l) · π , by swapping two non-intersecting segments π i π i+1 . . . π j and π k π k+1 . . . π l , that is, bi(i, j, k, l) · π = π 1 . . . π i−1 π k . . . π l π j +1 . . . π k−1 π i . . . π j π l+1 . . . π n .
14
Y.C. LIN AND C.Y. TANG
Given two permutations π and σ , sorting by reversals is the problem of finding a series of reversals ρ1 , ρ2 , . . . , ρt such that ρt · ρt−1 · · · , ρ1 · π = σ , where t is the minimum and considered as the reversal distance dr (π ) between π and σ . Usually, the target permutation σ is replaced by the identity permutation I = +1 +2 . . . +n and this is why we call the transformation of π into I a sorting problem. Therefore, the reversal distance is the distance dr (π ) of π and I . In 1995, Hannenhalli and Pevzner [89] surprisingly provided a polynomial-time algorithm for exactly solving the sorting by reversals problem, which lead to great interest of later researchers. Other problems such as sorting by transpositions, sorting by block-interchanges and sorting by translocations can be similarly defined, except the difference in operations. For convenience, we use the term, genomic distance, to represent the distance of two permutations no matter what operations are used to sort.
2.1 Pancake Flipping Problem Before introducing genome rearrangement problems defined later, we first present an interesting problem called pancake flipping problem originally inspired by Dweighter [64]. This problem comes out of a real-life situation that a waiter wants to rearrange a stack of pancakes with all different sizes by grabbing several from the top and flipping them over such that the smallest pancake winds up on top, and so on, down to the largest at the bottom. If there are n pancakes, what is the minimum number of flips used to rearrange them? Moreover, the pancake flipping problem corresponds to the sorting by prefix reversal problem described as follows: Given an arbitrary permutation π = π1 π2 . . . πn (a stack of n pancakes), each πi corresponds to a pancake according to its value, i.e., a bigger πi corresponds to a pancake with larger size. Sorting by prefix reversal problem is to find the minimum number of prefix reversals, denoted as dpref (π), of the form r(1, i) to sort π. Since there is no difference between two sides of a pancake, the permutation π is unsigned, i.e., each πi is always positive. A reversal thereby acts on π by inverting the order of elements without changing the signs of them in a segment (Fig. 8). Specially, Bogomolny developed a website for simulating this problem [31]. The first result attempting to solve this problem was published by Gates and Papadimitriou [80]. They proved that the prefix reversal diameter, Dpref (n) = maxπ∈Sn dpref (π) where Sn is the symmetric group containing all permutations of size n, has bounds of Dpref (n) (5n + 5)/3 and that for infinitely many n, 17n/16 Dpref (n). Subsequently, Heydari and Sudborough [94] improved the lower bound to 15n/14. To our surprise, this seemingly effortless problem had no complexity result until Fischer and Ginzinger [75] recently gave a 2-approximation algorithm to find dpref (π).
EXPOSING PHYLOGENETIC RELATIONSHIPS
15
F IG . 8. Four prefix reversals can transform π = 1 5 4 3 2 into I = 1 2 3 4 5. The left-bracket segments show where reversals take place.
F IG . 9. π i represents a pancake and moreover, a sign “−” is associated with π i if its burnt side is up; otherwise, it has good side up, where the burnt side is indicated by the rectangle in a pancake. Eight prefix reversals can transform π = 1 −5 4 −3 2 into I = 1 2 3 4 5.
Gates and Papadimitriou [80] also considered a variation of pancake flipping problem in which a pancake has two sides and one side is burnt. These pancakes must be sorted to the size-ordered arrangement and every pancake has its burnt side down. Such a variation can also be transformed to an analogous sorting by prefix problem mentioned above. Moreover, due to the dissimilarity of two sides in a pancake, the permutation π becomes signed and each prefix reversal changes all signs of elements in an inverted segment (Fig. 9). Gates and Papadimitriou found that 3n/2 − 1 dpref (π ) 2n + 3 and this was further improved to 2n − 2 by Cohen and Blum [50], where the upper bound holds 3n/2 dpref (π) for 10 n. However, there is little progress in either type, unsigned and signed, of pancake problem. Although Heydari [95] has proved the NP-completeness of a modified version of pancake problem (unsigned), it remains unknown whether or not the original problems are in P.
16
Y.C. LIN AND C.Y. TANG
F IG . 10. The breakpoint graph G(π ) of π = 1 5 4 3 2, where black edges are represented as solid lines and gray edges as dotted lines.
2.2
The Breakpoint Graph
In the field of genome rearrangement, the most famous tool for analyzing is the breakpoint graph on which many results are based. Watterson et al. [174], and Nadeau and Taylor [135] introduced the notation of a breakpoint. They also noticed that there are some correlations between the number of breakpoints and the reversal distance. Below we define the breakpoint and show how to construct the breakpoint graph of an unsigned/signed permutation. For an unsigned permutation π = π1 π2 . . . πn , we extend it by adding π0 = 0 and πn+1 = n + 1. A pair of elements (πi , πi+1 ), 0 i n, is a breakpoint if |πi − πi+1 | > 1. For instance, if π = 1 5 4 3 2, then there are two breakpoints (1, 5) and (2, 6). Since the identity permutation has no breakpoints, sorting π corresponds to eliminating breakpoints. If we use only reversals, an observation that a reversal can eliminate at most two breakpoints immediately implies b(π)/2 dr (π), where b(π) is the number of breakpoints in π. Similar inferences can be applied to transposition and block-interchange so that we obtain the lower bounds of b(π)/3 dtr (π) and b(π)/4 dbi (π) for transposition and block-interchange distance, respectively. There were several definitions for the breakpoint graph in previous researches and we choose one of the most common models introduced by Bafna and Pevzner [11] which we will use in the following sections. The breakpoint graph of an unsigned permutation π is defined to be an edge-colored graph G(π) with n + 2 vertices {π0 , π1 , . . . , πn+1 } as the following. For 0 i n, πi and πi+1 are connected by a black edge, and πi is joined to πj by a gray edge if |πi − πj | = 1, as shown in Fig. 10. Sections below will introduce in detail how to use the breakpoint graph to assist in sorting a permutation.
3.
Sorting by Reversals
The reversal event is our first discussed event: not only is it the first event observed in Drosophila species by Dobzhansky and Sturtevant [60], but also it commonly ex-
EXPOSING PHYLOGENETIC RELATIONSHIPS
17
ists in virus [102], bacteria [66,100], Chloroplast of plants [54,111,141], animals and mammalian [74] thereby accepted by most biologists. Modeling the reversal distance for higher organisms is reasonable as biological lectures report that reversals are the primary mechanism of genome rearrangement for many genomes in eukaryote [125]. Furthermore, there are also practical results, both tools and theoretical analyses, in considering reversals only. Watterson et al. [174] made the first attempt to deal with reversal events, and gave definitions of the sorting by reversals problem associated with a heuristic for computing the reversal distance. Schöniger and Waterman [152] also presented a heuristic method when only non-overlapping inversions, whose inverted segments are non-overlapping, are allowed. Biologists acquire gene orders either by sequencing entire genome or by constructing comparative physical mappings. Error-free sequencing can provide correct information about the directions of genes and thus allows one to representing a genome as a signed permutation. However, sequencing the whole genome is still expensive and may have some errors so that most available data on gene orders are based on comparative physical maps. Physical maps usually do not provide full information about the directions of genes, and hence, lead to representing a chromosome as an unsigned permutation. In general, unsigned permutations is a special case of signed ones with all positive elements implicitly implying that sorting an unsigned permutation is simpler than sorting a signed one, but on the contrary, the former is often harder than the latter in genome rearrangement. Even the sorting unsigned permutation by reversals problem is more “difficult” than the NP-complete problems. The coming part first focuses on sorting unsigned linear chromosomes, then considers the signed version and the last of this section demonstrates that the equivalence in sorting of linear chromosomes and circular ones.
3.1
Unsigned Permutations
When information about the directions of gene segments is not available, a chromosome can be modeled as an unsigned permutation π = π1 π2 . . . πn . Thus given two unsigned linear permutations π and I , the sorting by reversals problem is to find ρt , ρt−1 , . . . , ρ1 such that ρ1 · ρ2 · · · ρt · π = I , where each ρi is a reversal, and t is the minimum. Caprara [33] first showed this problem to be NP-hard, and Berman and Karpinski [28] later proved it to be MAX-SNP hard implying that it is almost impossible to be approximated under 1 + ε, for some ε > 0. From two observations that a reversal eliminates at most 2 breakpoints and n − 1 reversals can create any permutation, we instantly obtain the bounds of b(π)/2 dr (π) n−1, where b(π) is the number of breakpoints in π. Taking π = 6 4 1 5 2 3 as an instance for explaining the upper bound, the reversal r(1, 3) can move π3 = 1 to the right position when it acts on π, that is, a series of reversals, where the first
18
Y.C. LIN AND C.Y. TANG
F IG . 11. (a) Three reversals optimally sort π = 6 4 1 5 2 3 and therefore, the reversal distance between π and I is 3; (b) The approximation algorithm developed by Kececioglu and Sankoff [109] can sort π by using 4 reversals, where each reversal removes at least 1 breakpoint. Underlined segments indicate where reversals happen.
moves 1 to its right position, the second copes with 2, and so on, can sort π. One of the worst cases with this method is π = n 1 2 . . . n − 1, which can be sorted by n − 1 reversals. In addition to the straightforward bounds, Kececioglu and Sankoff [109] also derived efficient bounds of dr (π) by simulation, allowing a computer to output dr (π) in a few minutes for n 30. On the other hand, Kececioglu and Sankoff obtained a 2-approximation algorithm for dr (π) based on the structure of strip, which is a maximal subsequence in π without breakpoints. For example, if π is the last mentioned permutation, 2, 3 is a strip and moreover, the strip 2, 3 is increasing whereas the strip 6 is decreasing. By greedily choosing the reversals, deriving from the decreasing or increasing strip, to remove the most number of breakpoints, they proved that each one of such reversals removes at least 1 breakpoint, thereby obtaining a 2-approximation algorithm, as illustrated in Fig. 11. They further conjectured that for every permutation, there exists an optimal sorting series composed of reversals with cutting no strips of size more than 2, and there also exists an optimal reversal series which never increases the number of breakpoints. Both conjectures are verified by Hannenhalli and Pevzner [88] by means of their duality theorem for signed permutation [89]. Nevertheless, they found an example for which this procedure fails with strips with size 2, and described an algorithm to fix this problem. In particular, the sorting by reversals problem for permutations without strips of size one, called singletons, can be solved in polynomial time by Hannenhalli and Pevzner, which thus implies that the singletons present the major obstacle on the way towards an efficient algorithm. The approximation ratio of 2 derived from Kececloglu and Sankoff [109] was further improved to a factor of 1.75 by Bafna and Pevzner [11], then to a factor of 1.5 by Christie [43], and finally to 1.375 by Berman, Hannenhalli and Karpinski [27].
EXPOSING PHYLOGENETIC RELATIONSHIPS
19
F IG . 12. The breakpoint graph G(π ) of π = 6 4 1 5 2 3 (a) and its maximum cycle decomposition is c(π ) = 4 (b).
In Section 2, we have introduced the breakpoint graph G(π) for a permutation π, where this graph can be recognized in linear time by Caprara [36]. Such a graph has tight relations to the sorting by reversals problem and can be used to explain why this problem is difficult. Gray and black edges in G(π) constitute an alternating cycle if the colors of every two consecutive edges of this cycle are distinct. For instance in b
g
b
g
Fig. 12(a), the cycle of 1 → 5 → 6 → 0 → 1 is alternating where g and b indicate gray and black edge, respectively. From the structure of G(π), there are two gray edges and two black edges connected to every vertex, except π0 and πn having a gray and a black edge. Since the number of gray edges incident to a vertex v equals that of black edges incident to v, every vertex has even degree and thus, there exists a cycle decomposition of G(π) into alternating cycles such that every edge in the graph belongs to exactly one cycle in the decomposition [13]. We are interested in the maximum number c(π) of alternating cycles in G(π). For example in Fig. 12, c(π) = 4. A reversal can reduce the number of cycles in a maximum cycle decomposition by at most one, while the number of breakpoints by at most two. Bafna and Pevzner [11] presented a lower bound of n + 1 − c(π) on the distance dr (π), which is much tighter than the bound of b(π)/2 derived from the concept of breakpoint. Besides, Caprara [36] described a transformation from the sorting by reversals problem to the maximum cycle decomposition problem. The latter was shown to be NP-hard thereby implying the same complexity as the former. Extensively simulated studies [37,38,109] showed that dr (π) = n + 1 − c(π) in numerous cases, and Caprara [35] demonstrated that dr (π) = n + 1 − c(π) with probability 1 − (1/n5 ) for a random permutation π of size n. These results
20
Y.C. LIN AND C.Y. TANG
prompted us to derive algorithms of directly minimizing the parameter c(π) to solve the sorting by reversals problem. The first approximation algorithm of computing c(π) was obtained by Bafan and Pevzner [11] with the ratio of 1.75, and further improved to 1.5 by Christie [43]. Subsequently, Caprara and Rizzi [40] improved the ratio to a factor of 33/23 + ε ≈ 1.4348 + ε, for any positive ε, by reducing the problem to the maximum independent set problem and the set packing problem [79]. Lin and Jiang [114] recently extended the techniques of Caprara and Rizzi, and incorporated a balancing argument to further improve the approximation ratio to √ (5073 − 15 1201 )/3208 + ε ≈ 1.4193 + ε, for any positive ε. Due to the practicality of sorting by reversals problem, numerous researchers try to solve it optimally by implementing programs even if it may run in exponential time. Heath and Vergara [92] implemented an O(n3 n!) time algorithm by using the dynamic programming method for testing their conjectures. One of these interesting conjectures is that there exists a sequence of reversals that optimally sort a permutation π such that each reversal positions either the minimum or the maximum unpositioned element. For example, the permutation π = 3 4 2 5 1 can be optimally sorted as the following sequence: 3 4 2 5 1 ⇒ 3 4 2 1 5 ⇒ 1 2 4 3 5 ⇒ 1 2 3 4 5. The three reversals sort π on the positions 5, 1 and 3 (or 4) respectively, where the corresponding elements are either minimum or maximum unpositioned elements. This conjecture agrees with the intuition but however, their program found a counterexample when π = 2 5 3 1 4 sorted by the following process: 2 5 3 1 4 ⇒ 2 1 3 5 4 ⇒ 2 1 3 4 5 ⇒ 1 2 3 4 5. Note that the sorted element by first reversal is neither 1 nor 5 and attempting to do so requires more than 3 reversals to sort π. In particular, Tran [163] provided a special set of permutations, which can be optimally sorted by reversals in polynomial time by giving a graph-theoretical characterization of these permutations. Such permutations are when the number of breakpoints is twice as big as the reversal distance, and the last mentioned permutation is an example satisfying the requirement. Caprara, Lancia and Ng [37–39] also attempted to find exact algorithms for the sorting by reversals problem. They first designed a branch-and-bound algorithm for this problem [37], and the lower bound is based on the result of Bafna and Pevzner [11] (a closely related work by Kececioglu and Sankoff [109]) where the reversal distance is related to c(π). For estimating the parameter, they solved a Linear Programming problem containing a possible exponential number of variables by using column generation scheme, which has been shown to be efficient for many combinatorial optimization problems [15]. This algorithm optimally solved random instances of n = 100 within 2–3 minutes and was further improved to be more efficient by a new Linear Programming technique [38,39].
EXPOSING PHYLOGENETIC RELATIONSHIPS
21
The following is the introduction of sorting by reversals problem on signed permutations which has been extensively studied in computer science and also receives many practical results.
3.2
Signed Permutations
From the above discussion on unsigned permutations, we know that in the general problem of finding genomic distance caused by disrupted gene order between two genomes it is very difficult to find algorithms coming up with better performance. However, it is the situation when we are short of the information about gene directions. In practice, every gene in a chromosome has a direction (it is the result of a fact that DNA is double stranded and single gene resides on one of the strand). Here we consider the sorting by reversals problem on the permutation where each element has either + or − sign indicting its direction. For example in Fig. 7, the gene content of cabbage is modeled as the signed permutation π = +1 −5 +4 −3 +2. Furthermore, every reversal acting on a segment of signed case changes both the order and signs of elements in this segment. We are still interested in the minimum number of reversals dr (π ) needed for transforming a signed permutation π into the identity permutation I = +1 +2 . . . +n. Given a signed permutation π of {1, 2, . . . , n}, Hannenhalli and Pevzner [89] first transfered it into an unsigned mapping π = π0 ≡ 0 π1 . . . π2n π2n+1 ≡ 2n + 1 of {0, 1, . . . , 2n + 1}, by replacing each positive element x of π by 2x − 1 and 2x, and each negative element −x by 2x and 2x − 1. For example, if π = +1 −5 +4 −3 +2, then we have π = 0 1 2 10 9 7 8 6 5 3 4 11. Clearly, I corresponds to I and each reversal in π corresponds to a reversal in π. A reversal of the form r(2i + 1, 2j ) is said to be legal for π because it mimics the reversal r(i + 1, j ) on π. Then the problem of sorting π by legal reversals is equivalent to the sorting π by reversals problem. The analysis of Hannenhalli and Pevzner is based on the breakpoint graph. In Section 2, we present how to construct it, but it is for unsigned permutations. For the breakpoint graph of a signed permutation π , we use the breakpoint graph of its unsigned mapping π instead, which is also defined to be an edge-colored graph with 2n + 2 vertices π0 , π1 , . . . , π2n+1 as follows: For 0 i n, π2i and π2i+1 are connected by a black edge, and 2i is joined to 2i + 1 by a gray edge, as shown in Fig. 13. In above section, finding the maximum cycle decomposition is a difficult problem and closely related to the sorting unsigned permutation by reversals problem. Fortunately, in the case of signed permutations, this problem is easy because each vertex in G(π) has even degree. It is not hard to verify that the G(π) in Fig. 13 is able to be uniquely decomposed into c(π) = 3 alternating cycles. Since the number of maximum cycle decomposition in I is the maximum of all permutations with size
22
Y.C. LIN AND C.Y. TANG
F IG . 13. The breakpoint graph G(π ) of π = 0 1 2 10 9 7 8 6 5 3 4 11, which is an unsigned mapping of π = +1 −5 +4 −3 +2.
F IG . 14. The optimally sorting series for transforming π = +3 +2 +1 into I contains at least a non-proper reversal.
n, finding the reversal distance dr (π ) can be regarded as increasing the number of cycles in a most rapid manner. By demonstrating that cr = c(r · π) − c(π) 1, Hannenhalli and Pevzner immediately obtained the lower bound of n + 1 − c(π) on dr (π ). Therefore, if the used reversals are all of cr = 1, called proper, then we can optimally sort a permutation in n + 1 − c(π) steps. As shown in the Fig. 13, the lower bound is 5 + 1 − 3 = 3 suggesting that three proper reversals in Fig. 7 perform an optimal sorting. Nevertheless, for the permutation π = +3 + 2 + 1, there is no proper reversal in the first step and thus, it cannot be sorted in n + 1 − c(π) = 2 steps (an optimally sorting process is shown in Fig. 14), indicating that apart from the number of cycles, there exist hidden parameters for sorting a signed permutation. Hannenhalli and Pevzner defined the hurdle structure to describe such a hidden obstacle. An example of permutation shown in Fig. 14 needs one more reversal than the lower bound 2, as a result of a hurdle in it. For the detailed knowledge of hurdles, we refer reader to the paper by Hannenhalli and Pevzner [89]. Let the number of hurdles in a permutation π, an unsigned mapping of π, be h(π). Then they showed that n + 1 − c(π) + h(π) dr (π) n + 2 − c(π) + h(π) and
EXPOSING PHYLOGENETIC RELATIONSHIPS
23
however, there is still a little gap to obtain the optimal solution. With that, they found when h(π) is odd in some cases, there is a singular structure called fortress which leads to the hardness of sorting. After identifying the fortress, they finally presented a duality theorem for optimally sorting a signed permutation by reversals as follows: n + 1 − c(π) + h(π) + 1, if π is a fortress, dr (π ) = n + 1 − c(π) + h(π), otherwise. Furthermore, they also provided two algorithms for this problem, where the complicated one runs in O(n4 ) time and the running time of simpler one is O(n5 ). Since the time-complexity of algorithm developed by Hannenhalli and Pevzner is a little high, Berman and Hannenhalli [26] first improved it to O(nα(n)) time, where α() is the inverse Ackerman’s function [5], by exploiting more combinatorial properties of breakpoint graph. Due to avoiding special data structures, Kaplan, Shamir and Tarjan [105] further improved the running time to O(ndr (π ) + nα(n)) based on a union-find structure for efficiently finding reversals. Since α(n) is a constant no longer than four for almost all practical purposes, their algorithm is efficient for implementation. Subsequently, Tannier and Sagot [160] proposed an algorithm √ running exactly in O(n3/2 log n ) time, which has been the fastest practical algorithm to date, and also answers an open question of Ozery-Flato and Shamir [138] whether a subquadratic complexity could ever be achieved for solving the sorting by reversals problem. If only the reversal distance is needed, Bader, Moret and Yan [9] presented a simple and practical algorithm with linear running time for computing the connected components, which results in a linear-time algorithm for calculating the reversal distance. Moreover, the following works try to reduce the computational complexity by using the concept of randomization. A randomized algorithm is an algorithm that makes arbitrary choices during its execution, which allows a savings in execution time of a program as it does not require time in finding optimal choices, and instead works with arbitrary ones. Although the major disadvantage of this method may be incorrect output, i.e., output of a non-optimal solution, a well-designed randomized algorithm will have a very high probability of returning a correct answer. For more detail about it, we refer reader to a textbook written by Motwani and Raghavan [133]. Bansal [14] classified all possible reversals and considered a probability of choosing reversals from the classes. Nevertheless, the reversals she chose have no guarantee of being helpful to the transformation of π into I . Recently, Kaplan and Verbin [106,107] described a randomized algorithm to sort signed permutations by repeatedly drawing a random oriented reversal, which is a reversal making consecutive elements in the permutation adjacent with the same sign, e.g., either i, i + 1 or −(i + 1), −i is adjacent. Their method relies on the observation that typically a
24
Y.C. LIN AND C.Y. TANG
very large percentage of oriented reversals is indeed part of a most parsimonious scenario [21]. Furthermore, Kaplan and Verbin designed some efficient data structures for supporting them to maintain a permutation after applying a reversal and drawing random oriented reversals from it, where each operation costs sub-linear time. Their √ randomized algorithm has running time of O(n3/2 log n ) but fails with a very high probability on little permutations. The first polynomial-time algorithm proposed by Hannenhalli and Pevzner [89] relied on several intermediate constructions that have been simplified since [9,26, 105,160], but grasping the whole details remains a challenge. Consequently, Bergeron focused on finding a simpler explanation relying directly on the overlap graph for Hannenhalli–Pevzner theory and also gave a bit-vector implementation for the reversal problem that runs in O(n2 ) bit-vector operations, or in O(n3 /w) operations, where w is the word-size of the processor [20,24]. Besides, instead of the annoying hurdles and fortress in the duality theorem derived from Hannenhalli and Pevzner, Bergeron, Mixtacki and Stoye [22] used the PQ-tree to deal with them, and yielded an efficient and simple algorithm to compute reversal distances. On the other hand, there may exist many sorting series for the optimal transformation of π and I , but we have no idea about how to choose. Due to the absence of auxiliary information to determine a plausible scenario, Ajana et al. [7], and Siepel [155] found all minimumlength series of reversals to sort a signed permutation for the purpose of further tests. Apart from the theoretical analyses, there are several practical tools. Mantin and Shamir [124] implemented their algorithm [105] with a Java applet. Furthermore, Tesler [162] developed an integrated website GRIMM for implementing the algorithms [9,88,89] to tackle the problems of sorting by signed/unsigned permutations with linear/circular type by reversals. Figure 15 is an example provided by GRIMM to show the possible rearrangement scenarios among Herpes simplex virus (HSV), Epstein–Barr virus (EBV) and Cytomegalovirus (CMV) [86], and their phylogenetic tree.
3.3
Circular Permutations
Watterson et al. [174] made the first attempt at the problem of computing reversal distance between the circular permutations π c and circular identity permutation I c of unsigned case. The circular unsigned permutation is the circular rearrangement of elements of a linear permutation in clockwise direction. Watterson et al. gave the rudimentary bounds that b(π c )/2 dr (π c ) n − 1, where dr (π c ) is the reversal distance of π c and I c , and also presented a stochastic algorithm for this problem. Subsequently, Solomon, Sutcliffe and Lister [158] assumed that there is no difference in rotations and reflections of an unsigned circular permutation, see Fig. 16 as an
EXPOSING PHYLOGENETIC RELATIONSHIPS
25
F IG . 15. (a) Three signed permutations of HSV, EBV and CMV; (b) Their reversal distance matrix and the corresponding phylogenetic tree; (c) A possible rearrangement scenario consists of five reversals for transforming HSV into CMV outputted by GRIMM [162].
example, which are straightforward from three-dimensional view. Reversals applying to π c have similar results as that to linear permutation π, i.e., just reverse the order of elements in a segment (Fig. 16(c)). Therefore, the circular version of sorting by reversals problem is well defined. To our surprise, Solomon et al. showed that based on these assumptions, sorting circular permutations by reversals can be reduced to the same problem on linear case, thereby indicating that it is also NP-hard. On the other hand, sorting signed circular permutations also has an analogous result. A signed circular permutation π c = (π 1c π 2c . . . π nc ) can be regarded as a circular arrangement of elements in a signed permutation π , where the sign “+” indicates the clockwise direction and “−” represents the counterclockwise one. As shown in Fig. 17, rotations and reflections of π c are similar to those of π c , but a few differences exist in reflection. A reflection of π c changes both the order and signs of elements in π c . Under these assumptions, Meidanis, Walter and Dias [128] demonstrated the equivalence of sorting by reversals on linear permutations and circular ones.
26
Y.C. LIN AND C.Y. TANG
F IG . 16. (a) Rotations of a permutation, π c = (1 5 4 3 2) = (2 1 5 4 3) = · · · = (5 4 3 2 1); (b) Reflection of a permutation, π c = (1 5 4 3 2) = (1 2 3 4 5); (c) The reversal acts on the segment containing 1 and 5 indicated by dotted line by inverting the order of them.
F IG . 17. (a) Rotations of π c = (+1 −5 +4 −3 +2) = (+2 +1 −5 +4 −3) = · · · = (−5 +4 −3 +2 +1); (b) Reflections of π c = (+1 −5 +4 −3 +2) = (−1 −2 +3 −4 +5); (c) The reversal acts on the segment containing +1 and −5, which is indicated by dotted line, by changing both the order and signs of them.
In fact, there is a simple view of work in considering the relationship between linear and circular permutations. Figure 18 is an example with two equivalent reversals of a signed circular permutation π c = (+1 −5 +4 −3 +2). Therefore, the reversal acting on π c can always leave π 1c unchanged, that is, if an inverted segment contains π 1c , then we use the equivalent reversal without involving π 1c instead. With this re-
EXPOSING PHYLOGENETIC RELATIONSHIPS
27
F IG . 18. (a) The reversal acts on the segment containing +1 and −5; (b) The reversal acts on the segment containing +2, −3 and +4, and its effect is the same as (a) since reflection and rotations are not included into account.
placement, the reversal series of sorting π c is a feasible series to sort π implying that the problem of sorting signed linear permutation by reversals can reduce to that of sorting signed circular one. The other side from linear to circular can also be similarly proved. The sorting linear and circular permutations by reversals problems are consequently equivalent.
4.
Sorting by Transpositions/Block-Interchanges
A transposition is an exchange of two adjacent segments on a chromosome, while a block-interchange swaps two non-intersecting segments without necessary adjacency, suggesting that the latter is a generalization of the former. Analogously, the effect of a transposition can be taken as the result caused by two steps of cutting a segment and placing it in another location on the chromosome. In this regard, some people call it cut-and-paste operation. In biology, transpositions are rare events in contrast with reversals, and usually accompany other events like reversals or translocations (introduced in next section). Liu and Sanderson [119] identified inversions and transpositions in the bacterium Salmonella typhi, while Seoighe et al. [153] estimated that gene adjacencies of yeast have been broken frequently by rearrangements as inversions, transpositions and translocations. Moreover, Coghlan and Wolfe [49] inferred that there are 517 chromosomal rearrangements including inversions, transpositions and translocations for the transformation between the nematodes, Caenorhabditis elegans and Caenorhabditis briggsae. Zhang and Peter-
28
Y.C. LIN AND C.Y. TANG
son [180], in particular, demonstrated a new intramolecular transposition mechanism by which transpositions can greatly impact genome evolution. Since block-interchange is a generalization of transpositions, it has come up much less than reversal, transposition and translocation. A justification may be that the large-scale exchanges of segments are much less observed by biologists. Fliess, Motro and Unger [76] presented that there are swaps of short fragments in protein evolution, and Slamovits et al. [157] also observed the similar phenomenon of swapping segments by the comparison of Antonospora locustae (formerly Nosema locustae) and human parasite Encephalitozoon cuniculi. From the theoretical point of view, sorting by transpositions (resp. sorting by block-interchanges) is the problem of finding the minimum number of transpositions (resp. block-interchanges), denoted as dtr (π) (resp. dbi (π)), for sorting an unsigned permutation π. In 1996, Christie [42] solved the sorting by block-interchanges problem in polynomial time, while it has been of unknown complexity for the problem of sorting by transpositions so far. However, it is interesting that the two problems both have equivalence between sorting linear permutations and circular ones, which can be shown by a simple observation. Below we will first introduce the history of several approximation algorithms for the transposition problem and some efficient implementations. Then, there are two approaches for optimally solving the blockinterchange problem and simultaneously, two websites of ROBIN [121] and SPRING [117] can automatically find the rearrangement scenario among two or more homologous sequences as their input.
4.1
Sorting by Transpositions
In the late 1980s, Aigner and West [6] considered two rearrangement problems whose operations can be regarded as variations of the transposition. One is the restriction of operations by removing the leading element and reinserting it somewhere in the permutation. The other is an analogous restriction of above operation, except the leading element is always reinserted into the position equal to its value, e.g., 3 4 1 2 ⇒ 4 1 3 2. As regards the sorting by transpositions problem, it was first studied by Bafna and Pevzner [12], who primarily derived a 1.75-approximation algorithm and further improved to a factor of 1.5 with running time O(n2 ). Since no transposition can change the signs of elements when it acts on a permutation, all permutations discussed here are unsigned. Nevertheless, the breakpoint graph G(π) of π is established by imitation of a signed permutation π , in place of the construction introduced in Section 2. In other words, we replace each element x in π by 2x − 1 and 2x, and add π0 = 0 and π2n+1 = 2n + 1 to π. The remainder of the procedure on the connections of black and gray edges is the same as the breakpoint graph of π. Besides, let the size of a cycle in G(π) be the number of gray edges
EXPOSING PHYLOGENETIC RELATIONSHIPS
29
it contains. A cycle is odd if its size is odd and denote the number of odd cycles in a permutation π as codd (π). Then, sorting π by transpositions is equivalent to increasing the number of odd cycles to the maximum because all cycles in G(I ) are odd. Bafna and Pevzner demonstrated that dtr (π) (n + 1 − codd (π))/2, where the lower bound was enhanced to (n + 1 − codd (π))/2 + h(π)/2 [44], and developed an algorithm to sort π in at most 34 (n + 1 − codd (π)) transpositions, thereby ensuring an approximation guarantee with ratio 1.5. Subsequent works mainly focused on simplifying the approximation algorithm mentioned above. Christie [44] gave a somewhat simpler algorithm with the same approximation ratio, but a bad running time of O(n4 ). Next, Hartman and Shamir [90] first undertook the transposition problem on circular permutations, and obtained a simple approximation algorithm despite the same running time and ratio as the result of Bafna and Pevzner. In order to tackle circular unsigned permutations, they also constructed the breakpoint graph of them, which is analogous to G(π), as shown in Fig. 19. Specially, they also proposed the same result of ratio 1.5 and O(n2 ) time for the sorting by transpositions and transreversals problem, where a transreversal inverts one of two transposed segments [91].
F IG . 19. An example of permutation π c = (1 6 5 4 7 3 2) can be sorted with 4 transpositions produced by the algorithm of Hartman and Shamir [90]. Each exchanged segment of a transposition is an intermediate region delimited by two short lines placed on two black edges in G(π c ), and indicated by the underline in π c as well.
30
Y.C. LIN AND C.Y. TANG
Furthermore, Walter et al. [167,169] developed implementations and slightly improved the results obtained by three algorithms mentioned late for transposition problem. Recently, an outstanding work presented by Elias and Hartman [67] is a 1.375-approximation algorithm using the aid of a computer to systematically generate the proofs. It improves a ten-year-old ratio 1.5 of finding dtr (π) obtained in 1995. On the other hand, Guyer, Heath and Vergara [84] provided several heuristic approaches and experiments of this problem. The transposition diameter Dtr (n) of the symmetric group Sn , which is the maximum number of dtr (π) among all permutations π of size n, is still unknown. Bafna and Pevzner [12] presented that 34 n is an upper bound for Dtr (n), which was reduced to (2n − 2)/3 for n 9 by Eriksson et al. [71]. Christie [44], Eriksson et al. [71], and Meidanis, Walter and Dias [127] independently gave a lower bound of n/2 + 1 by showing that the transposition distance between a permutation and its reverse is n/2 + 1. Furthermore, Elias and Hartman [67] provided the exact diameters for 8 some kinds of permutations and an upper bound of 11n/24 + 3 n/3 mod +1 2 on the diameter of 3-permutation, which is a special collection of permutations such that all cycles in G(π) of a permutation π have length 3. Also, this upper bound is the basis of obtaining ratio 1.375 for sorting by transpositions problem. In addition, by restricting the operation to prefix transposition of the form tr(1, i, j ) for 1 < i < j n, Dias and Meidanis [59] obtained a 2-approximation algorithm for the problem of determining the minimum number of prefix transpositions to sort π. They conjectured that the diameter of prefix transposition distance is n − n/4 and also presented several tests to support it. Subsequently, Fortuna and Meidanis [77] gave a complete proof to show dpref (π) = n − n/4 when π = n n − 1 . . . 1, i.e., a reverse permutation of I .
4.2 Sorting by Block-Interchanges As to block-interchange, Monammed and Subi [132] first mentioned it in 1987 to the best of our knowledge. Their problem is how can we effectively swap two non-overlapping blocks of continuous elements by using a minimum number of constrained block-interchanges of exchanging two elements at a time. For example, given the permutation π = 1 8 9 5 6 7 2 3 4 10, how to sort it by using the minimum number of constrained block-interchanges, such as swapping the elements 8 and 2. They exactly solved the problem and Fig. 20 is an example of their algorithm. The sorting by block-interchanges problem was first studied by Christie [42], who gave an O(n2 )-time algorithm for optimally solving this problem based on the breakpoint graph. He also determined the diameter of block-interchange distance, which is n/2. Figure 21(b) is an example of his algorithm for sorting π = 4 2 1 3 6 5 8 7. Moreover, Lin et al. [116] studied the same problem on circular chromosomes based
EXPOSING PHYLOGENETIC RELATIONSHIPS
31
F IG . 20. The two swapping blocks are represented by the boldface integers. Let the number of elements in two swapping blocks be S1 and S2 , respectively and the middle block between S1 and S2 be M, which may be zero. Then Monammed and Subi [132] showed that the minimum number of constrained block-interchanges, required to swap two blocks in this example, is S1 +S2 +M −gcd(S1 +M, S2 +M) = 2 + 3 + 3 − gcd(5, 6) = 7.
F IG . 21. (a) A circular chromosome is taken as a permutation with group form in algebra, and the effect of a block-interchange is modeled as the result of composition of two 2-cycles indicated by the underline; (b) The permutation π is optimally sorted by the algorithm of Christie [42] with (n + 1 − c(π))/2 = 4 block-interchanges, where c(π ) is the number of cycles in G(π ); (c) The permutation π c can be optimally sorted by (n − f (I (π c )−1 ))/2 = (8 − 2)/2 = 3 block-interchanges deriving form the algorithm of Lin et al. [116], where f (I (π c )−1 ) denotes the number of disjoint cycles in the cycle decomposition of I (π c )−1 .
32
Y.C. LIN AND C.Y. TANG
on the permutation group in algebra. Here we somewhat abuse the notation of permutation, since it appears in both permutation group in algebra and traditional model of genome rearrangement problems. In their model, chromosomes correspond to permutations in group theory and block-interchange corresponds to two particular 2-cycles. Besides, the effect of applying a block-interchange to a chromosome is modeled as permutation composition (function composition) of two 2-cycles to π c , as illustrated in Fig. 21(a). Their strategy is to decompose I (π c )−1 , where (π c )−1 is the inverse permutation of π c , and I (π c )−1 is also a permutation in group. Even if starting from circular chromosomes, they also presented the equivalence between sorting linear permutations and circular ones. Figure 21(c) is an example of their algorithm for sorting π c = (4 2 1 3 6 5 8 7). From their experimental results, Lin et al. concluded that the block-interchange events seem to play a significant role in the evolution of three vibrio species, V. vulnificus, V. parahaemolyticus and V. cholerae. A website, called ROBIN, was developed by Lu et al. [121] for the sorting by block-interchanges problem. Instead of gene order, they use the order of landmarks to represent sequences and compute the block-interchange distance for each pair of them. ROBIN can automatically identify the Locally Collinear Blocks (LCBs) for representing the landmarks among input sequences by integrating the program of Darling et al. [58]. At the same time, Lu et al. repeated the experiment of Lin et al. and also obtained the coincident result.
5.
Sorting by Translocations
We have introduced three kinds of events, reversal, transposition and blockinterchange in previous sections, which all act on a single chromosome. In this section, we are interested in the translocation event acting on two segments of a multichromosomal genome, where the two segments belong to two different chromosomes. Before formulating this operation, some background must be introduced first to describe what the corresponding situation in biology is. For a start, depending on the position of centromere along the length of a chromosome, chromosomes are classified into two types. One is the acrocentric chromosome in which the centromere occurs at one end of the chromosome, while the other is the metacentric chromosome whose centromere approaches the middle of chromosome (Fig. 22). Within a genome, every chromosome is either acrocentric or metacentric and furthermore, in acrocentric chromosome, there is a reading direction according to the location of centromere. In the early 1930s, Creighton and McClintock [56] presented an elegantly simple experiment on Zea mays to show the interactions of two allelomorphic factors in the
EXPOSING PHYLOGENETIC RELATIONSHIPS
33
F IG . 22. (a) The structure of a chromosome [2]; (b) Representation of the 23 paired chromosomes (the chromosomes X and Y are paired) of the human male, where the chromosome 6 is a metacentric chromosome that constitutes about 6% [134] and the chromosome 13 is the largest acrocentric chromosome constituting about 4% of the human genome [62].
same linkage group accompanied by cytological and genetical crossing-over. Even if there was no clear mention about translocation events, from the description of their discovered phenomenon, it should be the first work related to translocations. Recently, Coe and Kass [48] reviewed the data surrounding the paper of Creighton and McClintock and provided a perspective on the significance of their findings. Translocation events occur as frequently as reversals and are commonly observed in virus [96], bacterium [100], yeast [63] and mammalian [172,110]. In particular, Courtay-Cahen, Morris and Edwards [55] demonstrated that the translocation event appears in breast cancer, and from clinical diagnosis on a patient, Heller et al. [93] reported that there is a complex translocation event between the two homologue chromosomes 5 in Philadelphia negative chronic myelogenous leukemia (CML). On the theoretical progress, given two multichromosomal genomes Π and Γ , which share the same set of genes, the sorting by translocations problem is finding the minimum number of translocations, denoted as dtl (Π), for transforming Π into Γ . Here, Π = {π(1), . . . , π(N)} and Γ = {γ (1), . . . , γ (M)} are genomes consisting of N and M chromosomes respectively, and π(i) = π(i)1 π(i)2 . . . π(i)ni composes of ni genes in the ith chromosome (γ (i) is similar). Particularly, directions of each chromosome are irrelevant, i.e., π(i) = −π(i).
34
Y.C. LIN AND C.Y. TANG
Kececioglu and Ravi [108] first noticed this problem and provided two approximation algorithms with respect to two types of translocations in directed and undirected model. Given two chromosomes X = X1X2 and Y = Y1Y2, a prefix-prefix translocation exchanges X1 and Y1, and a prefix-suffix translocation exchanges X1 and Y2, as illustrated in Fig. 23(a). Note that one of the two swapped segments may be empty. For immediately grasping the definition, Fig. 23(b) is an example of a parsimonious scenario obtaining from the tool developed by Feng, Wang and Zhu [73] to transform Π into Γ by translocations. The directed model concerns acrocentric chromosomes and allows only prefixprefix translocations. In other words, there are no orientations in either genes or chromosomes. The other is undirected model, which deals with metacentric chromosomes and allows both prefix-prefix and prefix-suffix translocations. Signed data are considered only in the case of chromosomes with no absolute reading directions, i.e., the undirected model. In both the directed and undirected models, Kececioglu and Ravi had 2-approximation algorithms for sorting by translocations problem which runs in O(k 2 N 2 ) time, where N is the number of chromosomes and k is the maximum number of genes among all chromosomes. Cui, Wang and Zhu [57] recently improved the approximation ratio to a factor of 1.75. Furthermore, if the two swapped
F IG . 23. (a) There are two types of translocations on signed chromosomes. Notice that directions of each chromosome are omitted for considering the unsigned chromosomes; (b) An example presents a parsimonious series of translocations for transforming Π into Γ acquired from the website CTRD [73].
EXPOSING PHYLOGENETIC RELATIONSHIPS
35
segments of a translocation are restricted to be equal-length, Kececioglu and Ravi proposed an exact algorithm with O(kN ) time for both two models. Later, Hannenhalli [85] studied the most common type of translocation, reciprocal translocation, in the direction model, where the four segments, X1, X2, Y1 and Y2, are assumed to be all non-empty. His analysis was also based on the breakpoint graph and omitted the existence of centromere for simplicity. Hannenhalli exactly solved this problem by providing an algorithm with O(n3 ) running time and a formula for dtl (Π), which can be further computed in linear time from a recent study proposed by Li et al. [113]. Afterward Wang et al. [170] gave an algorithm running in O(n2 ) time to show the optimal series composed of transformations, which improved the analogous result of algorithm with O(n2 log n) time presented by Zhu and Ma [181]. However, the translocation distance calculated by Hannenhalli’s algorithm may have an unexpected error leading to failure in finding the parsimonious scenarios of some cases. Recently, Bergeron, Mixtacki and Stoye [23] corrected the error and gave a new algorithm for sorting by translocations problem.
6.
Sorting by Multiple Operations
In nature, considering different events during the evolution of species is more general in reflecting the real situation. For some group of species, rearrangement events appear to be strongly biased toward one type of event, but most of the time, all types of events can occur. Reversals are the most common events in the single chromosome, while translocations are the most general events in the multichromosomal genome. Even so, they usually accompany fissions, fusions, transpositions, blockinterchanges, etc. Below we will introduce several combinations of operations in sorting the permutations and moreover, by assigning the weights to each operation, the evolutionary process can favor or disfavor some events, thereby exhibiting more diverse phylogenetic paths.
6.1
Reversal + Transposition/Block-Interchange
Sorting by reversals and transpositions is the problem of finding the cheapest series for transforming the permutation π (resp. π) into I (resp. I ) by using reversals and transpositions. The minimum number of reversals and transpositions is conventionally taken as the distance between two permutations and denoted by dr+tr (π ). A computational approach to analyze this problem was pioneered by Sankoff [148]. He designed a program DERANGE based on the techniques of alignment reduction and a branch-and-bound search. Figure 24 is an example of how the alignment reduction can help the sorting process.
36
Y.C. LIN AND C.Y. TANG
F IG . 24. (a) An example of alignment reduction. Dotted lines represent elements with same direction in both permutations, while solid lines indicate elements with opposite direction; (b) A sorting example of three reversals simply shows how to use the alignment reduction.
Furthermore, DERANGE can allow user-specified weights wr to reversals and wtr to transpositions, and look for the parsimonious series having the minimum sum of weights. Sankoff experimented on mitochondrial data of fungi with several possibilities of weights and concluded that assigning equal weights to reversals and transpositions is appropriate. Next, Blanchette, Kunisawa and Sankoff [30] improved the performance and provided a newer version DERANGE II [1]. They tested 37 homologous genes in human and Drosophila, and further concluded that 2wr < wtr < 2.5wr , obtaining by comparing Drosophila–human permutation with random permutation, is an appropriate weighting in their experiment. On the other hand, Walter, Dias and Meidanis [168] gave 3-approximation algorithm for computing dr+tr (π) (unsigned case) and 2-approximation algorithm for calculating dr+tr (π ) (signed case). Apart from the reversal and transposition event, Gu, Peng and Sudborough [82] added inverted transposition event as the transreversal (Fig. 25) into the consideration, and proposed a 2-approximation algorithm for this problem on signed permutations. Subsequently, Lin and Xue [115] also presented a 2-approximation algorithms for the two problems, sorting by reversals and transpositions problem and sorting by reversals, transpositions and inverted trans-
EXPOSING PHYLOGENETIC RELATIONSHIPS
37
F IG . 25. Examples present the effects of transposition, inverted transposition and both inverted transposition where the two swapped segments are −5, +4 and −3 indicated by underlines.
positions problem. Furthermore, they allowed a special event called both inverted transposition, which inverts two adjacent segments at a time (Fig. 25), and presented a better 1.75-approximation algorithm for the problem of sorting by reversals and three types of transpositions shown in Fig. 25. When π = −1 −2 . . . −n, Meidanis, Walter and Dias [129] found that dr+tr (π ) = n/2 + 2, for n 3 and conjectured that this value is the diameter on the genomic distance. On the other hand, the combination of operations, reversal and (inverted) transpositions, was favored by several researchers when considering different weights to operations. Eriksen [70] designed a simulation to show that the suitable weight to reversal is 1 and to (inverted) transposition is 2, and also proposed a (1 + ε)-approximation algorithm for the sorting by reversals and (inverted) transpositions problem under such a weight assignment [69]. In particular, the approach proposed by Miklós [130] can estimate the weighted sum of reversals and (inverted) transpositions without specific weights to them beforehand by introducing the Markov Chain Monte Carlo (MCMC) method, based on a stochastic model of the three operations. Recently, Miklós, Ittzés and Hein [131] implemented a web server ParIS for a Bayesian analysis on the same three operations. Moreover, Erdem and Tillier [68] considered genome rearrangement as a planning problem, and allowed restrictions on the number/cost of events, the length of involved segments and additional constraints to guide the search. With this planning approach, they constructed the phylogenetic tree of chloroplast genomes of Campanulaceae (flowering plants) according to their reversal and transposition distance matrix. The groupings of chloroplast genomes on their tree coincided with the ones in the consensus tree proposed by Cosner et al. [53, Fig. 4]. Since there was less progress on the problem of sorting by reversals and transpositions, and transposition is a special case of block-interchange, a feasible approach is to consider the problem of sorting by reversals and block-interchanges. When the weight to reversals is 1 and to block-interchanges is 2, Lin, Lu and Tang [118] solved
38
Y.C. LIN AND C.Y. TANG
it by proposing a simple algorithm with O(n2 ) running time. Their algorithm first distinguished between oriented and unoriented components, and independently sorted them by reversals and block-interchanges, respectively. Furthermore, the number of block-interchanges in their sorting series is shown to be minimum under all optimal sorting sequences. Such a sorting series implicitly suggests that the scenario derived from it meets the biological observation that transpositions are rare in contrast to reversals [16].
6.2 Reversal + Translocation (Including Fusion and Fission) Given two multichromosomal genomes Π and Γ as defined above, the problem considered in this section is to find a minimum number of operations composed of reversals and translocations for transforming Π into Γ . In Section 5, using the reciprocal translocation with two non-empty swapped segments in two chromosomes can lead to a polynomial-time algorithm, hence being adopted here. Moreover, two special operations are additionally considered and described as follows: One is fusion, which concatenates two chromosomes π(i) and π (j ) resulting in a new chromosome of π (i)1 π (i)2 . . . π(i) ni π(j )1 π(j )2 . . . π(j )nj and an empty chromosome, and the other is fission in which one chromosome π(i) is broken into two chromosomes π (i)1 π (i)2 . . . π(i) j −1 and π (i)j π(i) j +1 . . . π(i) n . Clearly, the fusion event reduces the number of chromosomes, whereas the fission event increases the number of (nonempty) chromosomes. The fusion and fission events bring about the difference in the number of chromosomes between two genomes, which is rather common in mammalian evolution. For example, the human genome has 46 chromosomes, while the mouse’s contains 40 chromosomes. Kececioglu and Ravi [108] first analyzed rearrangements of multichromosomal genomes, and proposed a 1.5-approximation algorithm based on the result of Bafna and Pevzner [11] for sorting by reversals alone. Nevertheless, they assumed that all chromosomes in a genome have the same number of genes, which conflicts with many organisms, e.g., human and mouse. Therefore, the subsequent model, including fissions and fusions, was first proposed by Hannenhalli and Pevzner [87], who gave the duality theorem for computing the genomic distance in terms of terrible 7 parameters associated with a polynomial-time algorithm. Their idea is to concatenate N (resp. M) chromosomes of Π (resp. Γ ) into a new permutation π (resp. γ ) first, and then to mimic genomic sorting of Π into Γ through transforming π into γ by reversals (Fig. 26). However, the difficulty of this approach introduced N !2N different concatenates for Π and Γ , and only some of them, called optimal concatenates, could mimic an optimal sorting of Π into Γ . Hannenhalli and Pevzner used the techniques called flipping and capping to find an optimal concatenate from the numerous types of concatenates.
EXPOSING PHYLOGENETIC RELATIONSHIPS
39
F IG . 26. Two types of translocations can be individually mimicked by a reversal in a concatenated permutation. Notice that X = −X for a chromosome X.
Although the sorting by reversals and translocations problem was solved by Hannenhalli and Pevzner, there are some problems in constructing the rearrangement scenarios. First, they claimed that the rearrangement scenario can be exhibited from the sorting series of reversals obtained by solving the problem of sorting by reversals, but there is a gap in the construction. Next, the genomic distance dr+tl (Π, Γ ) between Π and Γ is symmetric, i.e., dr+tl (Π, Γ ) = dr+tl (Γ, Π), but their algorithm requires that Π has fewer number of chromosomes than Γ when computing dr+tl (Π, Γ ). Finally, their strategy is based on the algorithm for sorting only by reversals, and we are interested in whether a better algorithm for sorting by reversals problem leads to a better algorithm for this problem. With regard to the three problems, Tesler [161] closed the gap in construction, modified the unusual computation of dr+tl (Π, Γ ), and improved the running time to compute genomic distance to O(n) and rearrangement scenario to O(n2 ) by combining the algorithm of Bader et al. [9]. In addition, Ozery-Flato and Shamir [138] found that there is a case in which the two polynomial algorithms mentioned above will fail, and presented a revised duality theorem associated with an algorithm to deal with the problem.
6.3
Other Considerations
In this section, we will introduce two rearrangement problems with unequal weights to their sorting operations. One of the interesting considerations is the
40
Y.C. LIN AND C.Y. TANG
F IG . 27. (a) A genome with two chromosomes is modeled as a permutation with two cycles (1 5 4 3) and (2 7 6); (b) Fission, fusion and transposition can be mimicked by 2-cycles and a 3-cycle, respectively.
weighted composing of fusion, fission and transposition on circular unsigned multichromosomes, which was proposed by Meidanis and Dias [126]. They obtained a polynomial time algorithm for the minimum weighted series of three operations with transpositions weighted twice as much as fusions and fissions to transform one genome into another, which is based on the classical results of permutation group in algebra. In their model, a permutation may have several cycles to represent a multichromosomal genome (Fig. 27(a)) in which particularly all chromosomes are circular. The fusion or fission action on π is mimicked by the composition of a special 2-cycle to π, while the effect of a transposition corresponds to the composition of a 3-cycle to π (Fig. 27(b)). Therefore, sorting by fissions, fusions and transpositions problem is reduced to a special decomposition of π to a series of 2- and 3-cycles, which has been well studied in algebra. Later, they made an attempt to assign an arbitrary weight wtr to transposition and concluded that this problem is at least as hard as the sorting by transpositions problem. Finally, they obtained an approximation algorithm with guaranteed ratio 2/wtr . Recently, Yancopoulos, Attie and Friedberg [178] proposed an algorithm for solving the problem of sorting by reversals, translocations (including fusions and fissions) and block-interchanges on multi-linear chromosomal genomes. They used an universal double-cut-and-join operation that accounts for reversal, fission, fusion and translocation, but fails in describing the block-interchanges. In order to avoid complicated analysis, they assigned weight 1 to all operations except 2 to block-interchanges, which also is consistent with the biological observation that block-interchanges are relatively rare.
7.
Experimental Results
A complete experimental procedure on genome rearrangement is starting with the sequence data as its input, next looking for genes, conserved segments or something
EXPOSING PHYLOGENETIC RELATIONSHIPS
41
for representing landmarks among the input sequences, and finally computing the distance matrix according to the considered operations. Sometimes, when the sequences are well annotated in the database, a set of homologous genes among them can be easily obtained from the biologists by identifying the gene functions, names or even similarity of gene segments. However, because of many reasons such as annotation errors, lack of annotations or insufficient knowledge in biology, it is hard to determine whether genes of two species are homologous or not. This problem has greatly perplexed not only biologists, but also anyone who wants to study related researches. Therefore, the approach of comparative mapping, which allows the observation of chromosomal segments conserved in both genomes since divergence from a common ancestor, arises by using the techniques in biology, statistics, computer sciences, etc. [32,41,179]. We consider the problem of sorting by weighted reversals and block-interchanges, where the weighted assignment is 1 to reversals and 2 to block-interchanges. In order to obtain the genomic distances automatically, the optimal algorithm of Lin et al. [118] is implemented by integrating the algorithm of Kaplan et al. [105]. Moreover, it seems that block-interchanges frequently appear in lower organisms from previous researches, and hence, we will have two experiments on 18 species of Campanulaceae and 29 γ -proteobacterial genomes for studying their evolutions in the rest of this section.
7.1
Chloroplast in Campanulaceae
In general, the Chloroplast DNA (cpDNA) of land plants is highly conserved in nucleotide sequence, gene content and order, and genome size. Chloroplast genomes of photosynthetic angiosperms average about 160 kilobase pairs (kb) in size and contain approximate 120 genes. The major disruption in gene order, such as caused by inversions, inverted repeat and gene losses, is usually rare. Its relatively slow rate of evolution makes it an excellent molecule for evolutionary studies [137]. We used gene maps released by Cosner et al. [54] to encode each of the 18 genera and the outgroup Tobacco as a circular ordering of signed gene segments. Her analysis suggested an unbelievable diversity of mutations, including inversions, insertions, deletions, duplications (inverted repeats) and putative transpositions. Transpositions in particular are only rare in the hypothesis of chloroplast evolution and therefore the inference for the Campanulaceae is surprising. The variety of rearrangements far exceeds the reports in any group of land plants, so that it is a challenge to determine the exact number and the evolutionary sequence of rearrangement events. However, in order to apply our algorithm, we have to remove an incompletely mapped genus Roella from the dataset due to the lack of gene segment in some
42
Y.C. LIN AND C.Y. TANG
F IG . 28. Phylogenetic relationships among 18 genomes of Campanulaceae inferred from a breakpoint distance (a), and a reversal and block-interchange distance matrix (b). Values at clades reflect the distance among species. Too small distances are removed for readability of the whole tree. Asterisks in (b) represent the major differences with comparing the phylogenetic tree of Cosner et al. [54, Fig. 3]. The up bar indicates 2 breakpoints (a) or weight 2 of two reversals or a block-interchange (b) in the edge length of tree.
experimental segments. Moreover, the genes suffering repeated regions, gene duplications and losses are all eliminated, thereby reducing the original 105 genes to 91 genes ultimately. The quantity of gene numbers is enough for the analysis of reversal and block-interchange events, instead of reversals and putative transpositions in primary study. It deserves to be mentioned that previous researches have found that the differences in Campanulaceae are mainly in the mutations of duplications, insertions and the inverted repeats. Here, we bypass the effects of these mutations despite the consequence of making certain pair of genera indistinguishable.
EXPOSING PHYLOGENETIC RELATIONSHIPS
43
We analyze the dataset of 18 circular genomes for their breakpoint distances, and reversal and block-interchange distances. By calculating the matrices for two distance measures, we further reconstruct the phylogenetic trees by means of the distance-base method neighbor-joining [147] contained in PHYLIP package [72]. Although this method has no guarantee on the constructed tree, it has been widely used up to now because it outputs a “better” tree topology than many tree construction methods. Moreover, a tree drawing program NJplot [144] is used to draw the phylogenetic tree according to the solution deriving from the neighbor-joining method. Our breakpoint tree (Fig. 28(a)) is very similar to the endpoint tree of Cosner et al. [54, Fig. 2], even if we use different methods in constructing trees. However, in our reversal and block-interchange tree (Fig. 28(b)), there are four species indicated by asterisks, which does not agree with that in the tree constructed by Cosner et al. [54, Fig. 3]. The inconsistency may be caused by the disregard of other mutations or methods for tree construction. Except the four divergent species, the remaining genera of Campanulaceae are consistent with the result of Cosner et al.
7.2
γ -Proteobacteria
Within the Bacteria domain, the phylum Proteobacteria constitutes at present the largest and most diverse phylogenetic lineage. The Proteobacteria contain a lot of species, scattered over 5 major phylogenetic lines of descent known as the classes “α-proteobacteria”, “β-proteobacteria”, “γ -proteobacteria”, “δ-proteobacteria” and “ε-proteobacteria” with length about 1–8 megabase pairs (mb), where γ -proteobacteria is the largest among these classes (at least 180 genera and 750 species). Genome rearrangements have been studied in several bacterial groups, and of course γ -proteobacteria is one of them, with inversions as one of the most frequent rearrangement types in interspecies comparisons. Apart from the inversions and transpositions, there are other types of changes, e.g., deletion, duplication or horizontal (or lateral) gene transfer, may disrupt the gene order of γ -proteobacteria. The deletion and duplication events result in gaps and redundant genomic segments, respectively when the genomes of two species are compared. The horizontal gene transfer, sometimes named as recombination, predominates the evolution of prokaryotic genomes and may produce insertions throughout the genome. However, it is hard to include these changes beyond the ability of our algorithm. As usual, we ignore these effects for simplifying the experiment. Recently, Belda, Moya and Silva [16] studied the breakpoint and inversion distance in 30 γ -proteobacterial complete genomes by comparing the order of 244 genes on the chromosome. They also presented the high correlation of two distance measurements by computing the correlation factor r = 0.996. Furthermore, the genes they used for analyzing the proteobacteria are recorded in the supplemen-
44
Y.C. LIN AND C.Y. TANG
F IG . 29. Phylogenetic relationships among 29 γ -proteobacteria are inferred from a breakpoint distance (a), and a reversal and block-interchange distance matrix (b).
tary material of their paper, and thus can be conveniently available via the network. In this experiment, we extract the gene orders of 29 γ -proteobacteria released by Belda et al. as the input of our algorithm, and exclude S. flexneri 301 (sfl) from our experiment as a result of its diversity in contrast with S. flexneri 2457T (sfx). Figure 29 is our experimental results of two phylogenetic trees according to two distance measures. Due to the same consideration in both breakpoint distance and tree construction method (neighbor-joining) with that of Belda et al. [16, Fig. 5a], Fig. 29(a) is almost identical to their result. As to considering reversals and blockinterchanges simultaneously, our tree in Fig. 29(b) seems to be superior than Belda et al. [16, Fig. 5b] in spite of the high similarity of two tree topologies. The Shi. flexneri (sfx) moves closer to E. coli (ecc, eco, etc.) in comparing two trees of Fig. 29, where the result of Belda et al. has the same variation, and however, the She. oneidensis (son) slightly changed its position in Belda et al. result, but not in ours. In other
EXPOSING PHYLOGENETIC RELATIONSHIPS
45
F IG . 30. Comparison of distance calculations on the dataset of 29 γ -proteobacteria with a correlation coefficient of γ = 0.997.
words, two tree topologies in Fig. 29 are more coincident in contrast to the comparison between breakpoint and reversal trees of Belda et al. This is why our correlation coefficient (γ = 0.997, see Fig. 30) is slightly higher than theirs (γ = 0.996 [16, Fig. 1]).
8.
Conclusions
In this chapter we have taken a few primary introductions toward understanding the genome rearrangement problems. Almost all rearrangement events in this area came up in discussing the history and recent progress, which are further summarized in Table I. Not only theoretical analyses, but also biological evidences for rearrangement events are mentioned to connect the theory and application. However, there are still a lot of interesting topics related to the genome rearrangement but not included in our discussion. For example, we may have constraints on the length of inverted or transposed segments [17,122,123,136,159,166]. Furthermore, recent researches focus on inferring a special scenario called perfect sorting, which conserves all common intervals during the transforming process [19,146]. As to multiple genome rearrangement [149,177], the most mentioned problem is its special case, the so-called median problem, which is to find a median for a set of permutations under a specific genomic distance. Unfortunately, it has been shown to be NP-hard for both the breakpoint [143] and the reversal [34] distance. Very recently, Bernt et al. [29] proposed a heuristic algorithm for solving the median problem by considering only reversals without breaking the common gene intervals.
46
TABLE I T HE TABLE S UMMARIZES THE P ROGRESS OF S OME G ENOME R EARRANGEMENT P ROBLEMS C OMING U P IN T HIS C HAPTER (Unsigned) 1.4193 + ε-app. [114] 2-app. [75] 1.375-app. [27] 2-app. [59] 1.375-app. [67]
Cycle decomposition
Prefix reversal
Reversal
Prefix transposition
Transposition
Block interchange
Translocation
Fusion & fission
15 n D 5n+5 pref (π) 3 [80,94] 14 MAX-SNP hard [28], Dr (π) = n − 1 [11]
n/2 Dtr (π) [44,127]
O(n2 ), O(n) [42,116]
8 Dtr (π) 11n/24 + 3 n/3 mod + 1 [67] 2
MAX-SNP hard [182] 1
1
O(n2 ), O(n) [120] O(n2 ), O(n) [126]
1
1 1
2 3n d ) 2n − 2 [50] pref (π 2
1 1
1 1 < wt < 2
1 1 < wt < 2
√ O(n3/2 log n ) [139], O(n) [113] √ O(n3/2 log n ) [160], O(n) [9], Dr (π ) = n + 1 [89]
2-app. [82,115,168] 2-app. [82,115] (1 + ε)-app. [69] O(n2 ), O(n) [118] O(n2 ), O(n) [178]
1 1 1 1 1 1
1 1 1 2 Dr+bi (π ) = n − 1 [118] O(n2 ), O(n) [138,161]
12 n Dr+tr (π ) [129]
2 2
1 1
1 1
If a column indicating an operation ρ has the sign “,” then the corresponding row is the result of sorting by ρ problem. Otherwise, if a field contains an integer, 1 or 2, then it corresponds to the sorting by multiple operations problem and moreover, the integer represents its weight. Besides, we list two time complexities in a row if the corresponding problem is polynomial solvable, where the bigger one represents the running time of finding the sorting series and the smaller expresses that of computing the genomic distance.
Y.C. LIN AND C.Y. TANG
1.75-app. [57] 3-app. [168]
(Signed) 1.5-app. [91] 1.5-app. [10]
Transreversal
EXPOSING PHYLOGENETIC RELATIONSHIPS
47
R EFERENCES [1] “Derange II”, ftp://ftp.ebi.ac.uk/pub/software/unix/derange2.tar.Z. [2] “Graphics gallery of the National Human Genome Research Institute (NHGRI)”, www. accessexcellence.org/RC/VL/GG/. [3] “The seven Prize Problems”, www.claymath.org/millennium/. [4] “Traveling Salesman Problem”, www.tsp.gatech.edu/. [5] Ackermann W., “Zum hilbertshen aufbau der reelen zahlen”, Math. Ann. 99 (1928) 118– 133. [6] Aigner M., West D.B., “Sorting by insertion of leading elements”, J. Combin. Theory Ser. A 45 (1987) 306–309. [7] Ajana Y., Lefebvre J.-F., Tillier E.R.M., El-Mabrouk N., “Exploring the set of all minimal sequences of reversals—an application to test the replication-directed reversal hypothesis”, in: Proceedings of the 25th Workshop on Algorithms in Bioinformatics, WABI2002, in: Lecture Notes in Computer Science, vol. 2452, Springer-Verlag, Berlin/New York, 2002, pp. 300–315. [8] Ausiello G., Crescenzi P., Kann V., Marchetti-Spaccamela A., Gambosi G., Spaccamela A.M., Complexity and Approximation: Combinatorial Optimization Problems and Their Approximability Properties, Springer-Verlag, Berlin/New York, 1999. [9] Bader D.A., Moret B.M.E., Yan M., “A linear-time algorithm for computing inversion distance between signed permutations with an experimental study”, J. Comput. Biol. 8 (2001) 483–491. [10] Bader M., Ohlebusch E., “Sorting by weighted reversals, transpositions, and inverted transpositions”, in: Apostolico A., Guerra C., Istrail S., Pevzner P.A., Waterman M.S. (Eds.), Proceedings of the 10th Annual International Conference on Research in Computational Molecular Biology, RECOMB2006, in: Lecture Notes in Computer Science, vol. 3909, Springer-Verlag, Berlin/New York, 2006, pp. 563–577. [11] Bafna V., Pevzner P.A., “Genome rearrangements and sorting by reversals”, SIAM J. Comput. 25 (1996) 272–289. [12] Bafna V., Pevzner P.A., “Sorting by transpositions”, SIAM J. Discrete Math. 11 (1998) 221–240. [13] Balakrishnan R., Ranganathan K., A Textbook of Graph Theory, Springer-Verlag, Berlin/New York, 2000. [14] Bansal S.A., “Genome rearrangements and randomized sorting by reversals”, unpublished, 2002. [15] Barnhart C., Johnson E.L., Nemhauser G.L., Savelsbergh M.W.P., Vance P.H., “Branchand-price: Column generation for solving huge integer programs”, Oper. Res. 46 (1998) 316–329. [16] Belda E., Moya A., Silva F.J., “Genome rearrangement distances and gene order phylogeny in γ -proteobacteria”, Mol. Biol. Evol. 22 (2005) 1456–1467. [17] Bender M.A., Ge D., He S., Hu H., Pinter R.Y., Skiena S., Swidan F., “Improved bounds on sorting with length-weighted reversals”, in: Proceedings of the 15th Annual ACM– SIAM Symposium on Discrete Algorithms, SODA2004, ACM/SIAM, New York, 2004, pp. 919–928.
48
Y.C. LIN AND C.Y. TANG
[18] Benson D.A., Karsch-Mizrachi I., Lipman D.J., Ostell J., Wheeler D.L., “GenBank”, Nucleic Acids Res. 34 (2006) D16–D20. [19] Bérard S., Bergeron A., Chauve C., Paul C., “Perfect sorting by reversals is not always difficult”, in: Casadio R., Myers G. (Eds.), Proceedings of the 10th Annual European Symposium on Algorithms, ESA2002, in: Lecture Notes in Computer Science, vol. 3692, Springer-Verlag, Berlin/New York, 2005, pp. 228–238. [20] Bergeron A., “A very elementary presentation of the Hannenhalli–Pevzner theory”, Discrete Appl. Math. 146 (2005) 134–145. [21] Bergeron A., Chauve C., Hartman T., St-Onge K., “On the properties of sequences of reversals that sort a signed permutation”, in: Proceedings of JOBIM, JOBIM2002, 2002, pp. 99–108. [22] Bergeron A., Mixtacki J., Stoye J., “Reversal distance without hurdles and fortresses”, in: Sahinalp S.C., Muthukrishnan S., Dogrusöz U. (Eds.), Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, CPM2004, in: Lecture Notes in Computer Science, vol. 3109, Springer-Verlag, Berlin/New York, 2004, pp. 388–399. [23] Bergeron A., Mixtacki J., Stoye J., “On sorting by translocations”, J. Comput. Biol. 13 (2006) 567–578. [24] Bergeron A., Strasbourg F., “Experiments in computing sequences of reversals”, in: Gascuel O., Moret B.M.E. (Eds.), Proceedings of the 1st Workshop on Algorithms in Bioinformatics, WABI2001, in: Lecture Notes in Computer Science, vol. 2149, SpringerVerlag, Berlin/New York, 2001, pp. 164–174. [25] Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E., “The protein data bank”, Nucleic Acids Res. 28 (2000) 235–242. [26] Berman P., Hannenhalli S., “Fast sorting by reversal”, in: Hirschberg D.S., Myers E.W. (Eds.), Proceedings of the 7th Annual Symposium on Combinatorial Pattern Matching, CPM1996, in: Lecture Notes in Computer Science, vol. 1075, Springer-Verlag, Berlin/New York, 1996, pp. 168–185. [27] Berman P., Hannenhalli S., Karpinski M., “1.375-approximation algorithm for sorting by reversals”, in: Mohring R.H., Raman R. (Eds.), Proceedings of the 10th Annual European Symposium on Algorithms, ESA2002, in: Lecture Notes in Computer Science, vol. 2461, Springer-Verlag, Berlin/New York, 2002, pp. 200–210. [28] Berman P., Karpinski M., “On some tighter inapproximability results”, in: Wiedermann J., Boas P.E., Nielsen M. (Eds.), Proceedings of the 26th International Colloquium on Automata, Languages and Programming, ICALP1999, in: Lecture Notes in Computer Science, vol. 1644, Springer-Verlag, Berlin/New York, 1999, pp. 200–209. [29] Bernt M., Merkle D., Middendorf M., “Genome rearrangement based on reversals that preserve conserved intervals”, IEEE/ACM Trans. Comput. Biol. Bioinform. 3 (2006) 275–288. [30] Blanchette M., Kunisawa T., Sankoff D., “Parametric genome rearrangement”, Gene 172 (1996) 11–17. [31] Bogomolny A., “Interactive mathematics miscellany and puzzles”, www.cut-theknot.org/SimpleGames/Flipper.shtml.
EXPOSING PHYLOGENETIC RELATIONSHIPS
49
[32] Bourque G., Zdobnov E.M., Bork P., Pevzner P.A., Tesler G., “Comparative architectures of mammalian and chicken genomes reveal highly variable rates of genomic rearrangements across different lineages”, Genome Res. 15 (2005) 98–110. [33] Caprara A., “Sorting by reversal is difficult”, in: Proceedings of the 1st Annual International Conference on Research in Computational Molecular Biology, RECOMB1997, ACM Press, New York, 1997, pp. 75–83. [34] Caprara A., “Formulations and hardness of multiple sorting by reversals”, in: Istrail S., Pevzner P.A., Waterman M. (Eds.), Proceedings of the 3rd Annual International Conference on Research in Computational Molecular Biology, RECOMB1999, ACM Press, New York, 1999, pp. 84–93. [35] Caprara A., “On the tightness of the alternating cycle lower bound for sorting by reversals”, J. Combin. Opt. 3 (1999) (1999) 149–182. [36] Caprara A., “Sorting permutations by reversals and Eulerian cycle decompositions”, SIAM J. Discrete Math. 12 (1999) (1999) 91–110. [37] Caprara A., Lancia G., Ng S.K., “A column-generation based branch-and-bound algorithm for sorting by reversals”, in: Farach-Colton M., Roberts F.S., Vingron M., Waterman M. (Eds.), in: DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 47, AMS Press, New York, 1999, pp. 213–226. [38] Caprara A., Lancia G., Ng S.K., “Faster practical solution of sorting by reversals”, in: Proceedings of the 11th Annual ACM–SIAM Symposium on Discrete Algorithms, SODA2000, ACM/SIAM, New York, 2000, pp. 12–21. [39] Caprara A., Lancia G., Ng S.K., “Sorting permutations by reversals through branchand-price”, INFORMS J. Comput. 13 (2001) 224–244. [40] Caprara A., Rizzi R., “Improved approximation for breakpoint graph decomposition and sorting by reversals”, J. Combin. Opt. 6 (2002) 157–182. [41] Chen C.Y., Wu K.M., Chang Y.C., Chang C.H., “Comparative genome analysis of vibrio vulnificus, a marine pathogen”, Genome Res. 13 (2003) 2577–2587. [42] Christie D.A., “Sorting by block-interchanges”, Inform. Process. Lett. 60 (1996) 165– 169. [43] Christie D.A., “A 3/2-approximation algorithm for sorting by reversals”, in: Proceedings of the 9th Annual ACM–SIAM Symposium on Discrete Algorithms, SODA1998, ACM/SIAM, New York, 1998, pp. 244–252. [44] Christie D.A., “Genome rearrangement problem”, PhD thesis, University of Glasgow, 1999. [45] Clark D.P., Russell L.D., Molecular Biology Made Simple and Fun, second ed., Cache River Press, 2000. [46] Cobham A., “The intrinsic computational difficulty of functions”, in: Proceedings of the 1964 Congress for Logic, Methodology and the Philosophy of Science, 1964, pp. 24–30. [47] Cochrane G., et al., “EMBL nucleotide sequence database: developments in 2005”, Nucleic Acids Res. 34 (2006) D10–D15. [48] Coe E., Kass L.B., “Proof of physical exchange of genes on the chromosomes”, Proc. Natl. Acad. Sci. USA 102 (2005) 6641–6646. [49] Coghlan A., Wolfe K.H., “Fourfold faster rate of genome rearrangement in nematodes than in drosophila”, Genome Res. 16 (2002) 857–867.
50
Y.C. LIN AND C.Y. TANG
[50] Cohen D.S., Blum M., “On the problem of sorting burnt pancakes”, Discrete Appl. Math. 61 (1995) 105–120. [51] Cook S.A., “The complexity of theorem-proving procedures”, in: Proceedings of the 3rd Annual ACM Symposium on Theory of Computing, STOC1971, ACM Press, New York, 1971, pp. 151–158. [52] Cormen T.H., Leiserson C.E., Rivest R.L., Introduction to Algorithms, second ed., MIT Press, Cambridge, MA, 2001. [53] Cosner M.E., Jansen R.K., Moret B.M.E., Raubeson L.A., Wang L.-S., Warnow T., Wyman S., “An empirical comparison of phylogenetic methods on chloroplast gene order data in Campanulaceae”, in: Sankoff D., Nadeau J.H. (Eds.), Comparative Genomics: Empirical and Analytical Approaches to Gene Order Dynamics Map Alignment and the Evolution of Gene Families, Kluwer Academic Press, Dordrecht/Norwell, MA, 2000, pp. 99–122. [54] Cosner M.E., Raubeson L.A., Jansen R.K., “Chloroplast DNA rearrangements in Campanulaceae: phylogenetic utility of highly rearranged genomes”, BMC Evol. Biol. 4 (2004) 1471–2148. [55] Courtay-Cahen C., Morris J.S., Edwards P.A.W., “Chromosome translocations in breast cancer with breakpoints at 8p12”, Genomics 66 (2000) 15–25. [56] Creighton H.B., McClintock B., “A correlation of cytological and genetical crossingover in Zea mays”, Proc. Natl. Acad. Sci. USA 17 (1931) 492–497. [57] Cui Y., Wang L., Zhu D., “A 1.75-approximation algorithm for unsigned translocation distance”, in: Deng X., Du D. (Eds.), Proceedings of the 16th Annual Symposium on Algorithms and Computation, ISAAC05, in: Lecture Notes in Computer Science, Springer-Verlag, Berlin/New York, 2005. [58] Darling A.C.E., Mau B., Blattner F.R., Perna N.T., “Mauve: multiple alignment of conserved genomic sequence with rearrangements”, Genome Res. 14 (2004) 1394–1403. [59] Dias Z., Meidanis J., “Sorting by prefix transpositions”, in: Laender A.H.F., Oliveira A.L. (Eds.), Proceedings of the 9th International Symposium on String Processing and Information Retrieval, SPIRE2002, in: Lecture Notes in Computer Science, vol. 2476, Springer-Verlag, Berlin/New York, 2002, pp. 65–76. [60] Dobzhansky T., Sturtevant A.H., “Inversions in the chromosomes of drosophila pseudoobscure”, Genetics 23 (1938) 28–64. [61] Doolittle R.F., Hunkapiller M.W., Hood L.E., Devare S.G., Robbins K.C., Aaronson S.A., Antoniades H.N., “Simian sarcoma Onc gene, v-sis, is derived from the gene (or genes) encoding platelet derived growth factor”, Science 221 (1983) 275–277. [62] Dunham A., et al., “The DNA sequence and analysis of human chromosome 13”, Nature 428 (2004) 522–528. [63] Dunham M.J., Badrane H., Ferea T., Adams J., Brown P.O., Rosenzweig F., Botstein D., “Characteristic genome rearrangements in experimental evolution of Saccharomyces cerevisiae”, Proc. Natl. Acad. Sci. USA 99 (2002) 16144–16149. [64] Dweighter H., “Elementary problems”, Amer. Math. Monthly (1975) 1010. [65] Edmonds J., “Paths, trees and flowers”, Canadian J. Math. 17 (1965) 449–467. [66] Eisen J.A., Heidelberg J.F., White O., Salzberg S.L., “Evidence for symmetric chromosomal inversions around the replication origin in bacteria”, Genome Biol. 1 (2000).
EXPOSING PHYLOGENETIC RELATIONSHIPS
51
[67] Elias I., Hartman T., “A 1.375-approximation algorithm for sorting by transpositions”, in: Casadio R., Myers G. (Eds.), Proceedings of the 5th Workshop on Algorithms in Bioinformatics, WABI2005, in: Lecture Notes in Computer Science, vol. 3692, SpringerVerlag, Berlin/New York, 2005, pp. 204–215. [68] Erdem E., Tillier E., “Genome rearrangement and planning”, in: Veloso M.M., Kambhampati S. (Eds.), Proceedings of the 20th National Conference on Artificial Intelligence and the Seventeenth Innovative Applications of Artificial Intelligence Conference, AAAI2005, AAAI Press/The MIT Press, 2005, pp. 1139–1144. [69] Eriksen N., “(1 + ε)-approximation of sorting by reversals”, Theoret. Comput. Sci. 289 (2002) 517–529. [70] Eriksen N., “Combinatorial methods in comparative genomics”, PhD thesis, Royal Institute of Technology, 2003. [71] Eriksson H., Eriksson K., Karlander J., Svensson L., Wästlund J., “Sorting a bridge hand”, Discrete Math. 241 (2001) 289–300. [72] Felsenstein J., “PHYLIP”, http://evolution.genetics.washington.edu/phylip.html. [73] Feng W., Wang L., Zhu D., “CTRD: a fast applet for computing signed translocation distance between genomes”, Bioinformatics 48 (2004) 3256–3257. [74] Feuk L., MacDonald J.R., Tang T., Carson A.R., Li M., Rao G., Khaja R., Scherer S.W., “Discovery of human inversion polymorphisms by comparative analysis of human and chimpanzee DNA sequence assemblies”, PLOS Genetics 1 (2005) 489–498. [75] Fischer J., Ginzinger S.W., “A 2-approximation algorithm for sorting by prefix reversals”, in: Brodal G.S., Leonardi S. (Eds.), Proceedings of the 13th Annual European Symposium on Algorithms, ESA2005, in: Lecture Notes in Computer Science, vol. 3669, Springer-Verlag, Berlin/New York, 2005, pp. 415–425. [76] Fliess A., Motro B., Unger R., “Swaps in protein sequences”, Proteins 48 (2002) 377– 387. [77] Fortuna V.J., Meidanis J., “Sorting the reverse permutation by prefix transpositions”, Technical Report IC-04-04, Institute of Computing, 2004. [78] Garey M.R., Graham R.L., Ullman J.D., “Worst-case analysis of memory allocation algorithms”, in: Proceedings of the 4th Annual ACM Symposium on Theory of Computing, STOC1972, 1972, pp. 143–150. [79] Garey M.R., Johnson D.S., Computers and Intractability: A Guide to the Theory of NP-Completeness, W.H. Freeman, New York, 1979. [80] Gates W.H., Papadimitriou C.H., “Bound for sorting by prefix reversals”, Discrete Math. 27 (1979) 47–57. [81] Graham R.L., “Bounds for certain multiprocessor anomalies”, AT&T Tech. J. 45 (1966) 1563–1581. [82] Gu Q.P., Peng S., Sudborough H., “A 2-approximation algorithms for genome rearrangements by reversals and transpositions”, Theoret. Comput. Sci. 210 (1999) 327– 339. [83] Gusfield D., Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge Univ. Press, Cambridge, MA, 1997. [84] Guyer S.A., Heath L.S., Vergara J.P.C., “Subsequence and run heuristics for sorting by transpositions”, Technical Report TR-97-20, Virginia Polytechnic Institute and State University, 1997.
52
Y.C. LIN AND C.Y. TANG
[85] Hannenhalli S., “Polynomial algorithm for computing translocation distance between genomes”, Discrete Appl. Math. 71 (1996) 137–151. [86] Hannenhalli S., Chappey C., Koonin E., Pevzner P.A., “Genome sequence comparison and scenarios for gene rearrangement: a test case”, Genomics 30 (1995) 299–311. [87] Hannenhalli S., Pevzner P.A., “Transforming men into mice (polynomial algorithm for genomic distance problem)”, in: Proceedings of the 36th IEEE Symposium on Foundations of Computer Science, FOCS1995, IEEE Comput. Soc., Los Alamitos, CA, 1995, pp. 581–592. [88] Hannenhalli S., Pevzner P.A., “To cut . . . or not to cut (applications of comparative physical maps in molecular evolution)”, in: Proceedings of the 7th Annual ACM– SIAM Symposium on Discrete Algorithms, SODA1995, ACM/SIAM, New York, 1995, pp. 304–313. [89] Hannenhalli S., Pevzner P.A., “Transforming cabbage into turnip: Polynomial algorithm for sorting signed permutations by reversals”, J. ACM 46 (1999) 1–27. Preliminary version in: Proceedings of the 27th Annual ACM Symposium on Theory of Computing, 1995, STOC1995, pp. 178–189. [90] Hartman T., Shamir R., “A simpler and faster 1.5-approximation algorithm for sorting by transpositions”, Inform. Comput. 204 (2006) 275–290. [91] Hartman T., Sharan R., “A 1.5-approximation algorithm for sorting by transpositions and transreversals”, J. Comput. Syst. Sci. 70 (2005) 300–320. [92] Heath L.S., Vergara J.P.C., “Some experiments on the sorting by reversals problem”, Technical Report TR-95-16, Virginia Polytechnic Institute and State University, 1995. [93] Heller A., et al., “A complex translocation event between the two homologues of chromosomes 5 leading to a del(5)(q21q33) as a sole aberration in a case clinically diagnosed as CML: characterization of the aberration by multicolor banding”, Internat. J. Oncol. 20 (2002) 1179–1181. [94] Heydari H., Sudborough H.I., “On the diameter of the pancake network”, J. Algorithms 25 (1997) 67–94. [95] Heydari M.H., “The Pancake Problem”, PhD thesis, University of Wisconsin at Whitewater, 1993. [96] Ho T.-C., Jeng K.-S., Hu C.-P., Chang C., “Effects of genomic length on translocation of Hepatitis B virus polymerase-linked oligomer”, J. Virol. 74 (2000) 9010–9018. [97] Hochbaum D.S., Approximation Algorithms for NP-Hard Problems, PWS Publishing Company, Warsaw, 1997. [98] Hoot S.B., Palmer J.D., “Structural rearrangements, including parallel inversions, within the chloroplast genome of anemone and related genera”, J. Mol. Biol. 38 (1994) 274– 281. [99] Hopcroft J.E., Motwani R., Ullman J.D., Introduction to Automata Theory, Languages and Computation, second ed., Addison–Wesley, Reading, MA, 2001. [100] Hughes D., “Evaluating genome dynamics: the constraints on rearrangements within bacterial genomes”, Genome Biol. 1 (2000). [101] International Human Genome Sequencing Consortium, “Initial sequencing and analysis of the human genome”, Nature 409 (2001) 860–912.
EXPOSING PHYLOGENETIC RELATIONSHIPS
53
[102] Jancovich J.K., Mao J., Chinchar V.G., Wyatt C., Case S.T., Kumar S., Valente G., Subramanian S., Davidson E.W., Collins J.P., Jacobsa B.L., “Genomic sequence of a ranavirus (family iridoviridae) associated with salamander mortalities in North America”, Virology 316 (2003) 90–103. [103] Johnson D.S., “Approximation algorithms for combinatorial problems”, J. Comput. Syst. Sci. 9 (1974) 256–278. [104] Jones N.C., Pevzner P.A., An Introduction to Bioinformatics Algorithms, The MIT Press, Cambridge, MA, 2004. [105] Kaplan H., Shamir R., Tarjan R.E., “A faster and simpler algorithm for sorting signed permutations by reversals”, SIAM J. Comput. 29 (1999) 880–892. [106] Kaplan H., Verbin E., “Efficient data structures and a new randomized approach for sorting signed permutations by reversals”, in: Baeza-Yates R.A., Chávez E., Crochemore M. (Eds.), Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, CPM2003, in: Lecture Notes in Computer Science, vol. 2676, Springer-Verlag, Berlin/New York, 2003, pp. 170–185. [107] Kaplan H., Verbin E., “Sorting signed permutations by reversals, revisited”, J. Comput. Syst. Sci. 70 (2005) 321–341. [108] Kececioglu J.D., Ravi R., “Of mice and men: algorithms for evolutionary distances between genomes with translocation”, in: Proceedings of the 6th ACM–SIAM Symposium on Discrete Algorithms, SODA1995, ACM/SIAM, New York, 1995, pp. 604–613. [109] Kececioglu J.D., Sankoff D., “Exact and approximation algorithms for the inversion distance between two permutations”, Algorithmica 13 (1995) 180–210. [110] Kent W.J., Baertsch R., Hinrichs A., Miller W., Haussler D., “Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes”, Proc. Natl. Acad. Sci. USA 100 (2003) 11484–11489. [111] Kim K.-J., Choi K.-S., Jansen R.K., “Two chloroplast DNA inversions originated simultaneously during the early evolution of the sunflower family (Asteraceae)”, Nucleic Acids Res. 22 (2005) 1783–1792. [112] Lancia G., “Applications to computational molecular biology”, in: Appa G., Williams P. (Eds.), Modeling for Discrete Optimization in: International Series in Operations Research and Management Science, Kluwer Academic Publishers, Dordrecht/Norwell, MA, 2004, in press. [113] Li G., Qi X., Wang X., Zhu B., “A linear-time algorithm for computing translocation distance between signed genomes”, in: Sahinalp S.C., Muthukrishnan S., Dogrusöz U. (Eds.), Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, CPM2004, in: Lecture Notes in Computer Science, vol. 3109, Springer-Verlag, Berlin/New York, 2004, pp. 323–332. [114] Lin G., Jiang T., “A further improved approximation algorithm for breakpoint graph decomposition”, J. Comb. Opt. 8 (2004) 183–194. [115] Lin G.H., Xue G., “Signed genome rearrangement by reversals and transpositions: models and approximations”, Theoret. Comput. Sci. 259 (2001) 513–531. [116] Lin Y.C., Lu C.L., Chang H.Y., Tang C.Y., “An efficient algorithm for sorting by blockinterchanges and its application to the evolution of vibrio species”, J. Comput. Biol. 12 (2005) 102–112.
54
Y.C. LIN AND C.Y. TANG
[117] Lin Y.C., Lu C.L., Liu Y.-C., Tang C.Y., “SPRING: a tool for the analysis of genome rearrangement using reversals and block-interchanges”, Nucleic Acids Res. 34 (2006) W696–W699. [118] Lin Y.C., Lu C.L., Tang C.Y., “Sorting permutation by reversals with fewest blockinterchanges”, manuscript, 2006. [119] Liu S.-L., Sanderson K.E., “Rearrangements in the genome of the bacterium Salmonella typhi”, Proc. Natl. Acad. Sci. USA 92 (1995) 1018–1022. [120] Lu C.L., Huang Y.L., Wang T.C., Chiu H.-T., “Analysis of circular genome rearrangement by fusions, fissions and block-interchanges”, BMC Bioinform. 7 (2006). [121] Lu C.L., Wang T.C., Lin Y.C., Tang C.Y., “ROBIN: a tool for genome rearrangement of block-interchanges”, Bioinformatics 21 (2005) 2780–2782. [122] Mahajan M., Rama R., Vijayakumar S., “Towards constructing optimal strip move sequences”, in: Chwa K.-Y., Munro J.I. (Eds.), Proceedings of the 10th International Computing and Combinatorics Conference, COCOON2004, in: Lecture Notes in Computer Science, vol. 1644, Springer-Verlag, Berlin/New York, 2004, pp. 33–42. [123] Mahajan M., Rama R., Vijayakumar S., “On sorting by 3-bounded transpositions”, Discrete Math. 306 (2006) 1569–1585. [124] Mantin I., Shamir R., “An algorithm for sorting signed permutations by reversals”, www.math.tau.ac.il/~rshamir/GR/, 1999. [125] McLysaght A., Seoighe C., Wolfe K.H., “High frequency of inversions during eukaryote gene order evolution”, in: Sankoff D., Nadeau J.H. (Eds.), Comparative Genomics: Empirical and Analytical Approaches to Gene Order Dynamics, Map Alignment and the Evolution of Gene Families, Kluwer Academic Press, Dordrecht/Norwell, MA, 2000, pp. 47–58. [126] Meidanis J., Dias Z., “Genome rearrangements distance by fusion, fission, and transposition is easy”, in: Navarro G. (Ed.), Proceedings of the 8th International Symposium on String Processing and Information Retrieval, SPIRE2001, in: Lecture Notes in Computer Science, IEEE Comput. Soc., Los Alamitos, CA, 2001, pp. 250–253. [127] Meidanis J., Walter M.E.T., Dias Z., “Transposition distance between a permutation and its reverse”, in: Proceedings of the 4th South American Workshop on String Processing, WSP1997, Carleton Univ. Press, 1997, pp. 70–79. [128] Meidanis J., Walter M.E.T., Dias Z., “Reversal distance of signed circular chromosomes”, Technical Report IC-00-23, Institute of Computing, 2000. [129] Meidanis J., Walter M.E.T., Dias Z., “A lower bound on the reversal and transposition diameter”, J. Comput. Biol. 9 (2002) 743–746. [130] Miklós I., “MCMC genome rearrangement”, Bioinformatics 19 (2003) 130–137. [131] Miklós I., Ittzés P., Hein J., “ParIS genome rearrangement server”, Bioinformatics 21 (2005) 817–820. [132] Monammed J.L., Subi C.S., “An improved block-interchange algorithm”, J. Algorithms 8 (1987) 113–121. [133] Motwani R., Raghavan P., Randomized Algorithms, Cambridge Univ. Press, Cambridge, UK, 1995. [134] Mungall A.J., et al., “The DNA sequence and analysis of human chromosome 6”, Nature 425 (2003) 805–811.
EXPOSING PHYLOGENETIC RELATIONSHIPS
55
[135] Nadeau J.H., Taylor B.A., “Lengths of chromosomal segments conserved since divergence of man and mouse”, Proc. Natl. Acad. Sci. USA 81 (1984) 814–818. [136] Nadeau J.H., Taylor B.A., “Sorting by restricted-length-weighted reversals”, Genomics Proteomics Bioinform. 3 (2005) 120–127. [137] Olmstead R.G., Palmer J.D., “Chloroplast DNA systematics: a review of methods and data analysis”, Amer. J. Bot. 81 (1994) 1205–1224. [138] Ozery-Flato M., Shamir R., “Two notes on genome rearrangement”, J. Bioinform. Comput. Biol. 1 (2003) 71–94. [139] Ozery-Flato M., Shamir R., “An O(n3/2 log(n) ) algorithm for sorting by reciprocal translocations”, in: Lewenstein M., Valiente G. (Eds.), Proceedings of the 17th Annual Symposium on Combinatorial Pattern Matching, CPM2006, in: Lecture Notes in Computer Science, vol. 4009, Springer-Verlag, Berlin/New York, 2006, pp. 258–269. [140] Palmer J.D., Herbon L.A., “Plant mitochondrial DNA evolves rapidly in structure, but slowly in sequence”, J. Mol. Evol. 28 (1988) 87–97. [141] Palmer J.D., Osorio B., Thompson W.R., “Evolutionary significance of inversions in legume chorloplast DNAs”, Curr. Genetics 14 (1988) 65–74. [142] Papadimitriou C.H., Computational Complexity, Addison–Wesley, Reading, MA, 1994. [143] Péer I., Shamir R., “The median problems for breakpoints are NP-complete”, Technical Report TR98-071, Electronic Colloquium on Computational Complexity, 1998. [144] Perrière G., Gouy M., “WWW-query: an on-line retrieval system for biological sequence banks”, Biochimie 78 (1996) 364–369. [145] Pevzner P.A., Computational Molecular Biology: An Algorithmic Approach, MIT Press, Cambridge, MA, 2000. [146] Sagot M.-F., Tannier E., “Perfect sorting by reversals”, in: Wang L. (Ed.), Proceedings of the 11th International Computing and Combinatorics Conference, COCOON2005, in: Lecture Notes in Computer Science, Springer-Verlag, Berlin/New York, 2005. [147] Saitou N., Nei M., “The neighbor-joining method: a new method for reconstructing phylogenetic trees”, Mol. Biol. Evol. 4 (1987) 406–425. [148] Sankoff D., “Edit distance for genome comparison based on non-local operations”, in: Apostolico A., Crochemore M., Galil Z., Manber U. (Eds.), Proceedings of the 3rd Annual Symposium on Combinatorial Pattern Matching, CPM1992, in: Lecture Notes in Computer Science, vol. 644, Springer-Verlag, Berlin/New York, 1992, pp. 121–135. [149] Sankoff D., Blanchette M., “Multiple genome rearrangement and breakpoint phylogeny”, J. Comput. Biol. 5 (1998) 555–570. [150] Sankoff D., Cedergren R., Abel Y., “Genomic divergence through gene rearrangement”, Methods in Enzymology 183 (1990) 428–438. [151] Sankoff D., Leduc G., Antoine N., Paquin B., Lang B.F., Cedergren R., “Gene order comparisons for phylogenetic inference: Evolution of the mitochondrial genome”, Proc. Natl. Acad. Sci. USA 89 (1992) 6575–6579. [152] Schöniger M., Waterman M.S., “A local algorithm for DNA sequence alignment with inversions”, Bull. Math. Biol. 54 (1992) 521–536. [153] Seoighe C., et al., “Prevalence of small inversions in yeast gene order evolution”, Proc. Natl. Acad. Sci. USA 97 (2002) 14433–14437.
56
Y.C. LIN AND C.Y. TANG
[154] Setubal C., Meidanis J., Introduction to Computational Molecular Biology, PWS Publishing, Warsaw, 1997. [155] Siepel A.C., “An algorithm to enumerate sorting reversals for signed permutations”, J. Comput. Biol. 10 (2003) 575–597. [156] Sipser M., Introduction to the Theory of Computation, PWS Publishing, Warsaw, 1997. [157] Slamovits C.H., Fast N.M., Law J.S., Keeling P.J., “Genome compaction and stability in microsporidian intracellular parasites”, Curr. Biol. 14 (2004) 891–896. [158] Solomon A., Sutcliffe P., Lister R., “Sorting circular permutations by reversal”, in: Dehne F.K.H.A., Sack J.-R., Smid M.H.M. (Eds.), Algorithms and Data Structures, 8th International Workshop, WADS2003, in: Lecture Notes in Computer Science, vol. 2748, Springer-Verlag, Berlin/New York, 2003, pp. 319–328. [159] Swidan F., Bender M.A., Ge D., He S., Hu H., Pinter R.Y., “Sorting by length-weighted reversals: dealing with signs and circularity”, in: Sahinalp S.C., Muthukrishnan S., Dogrusöz U. (Eds.), Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, CPM2004, in: Lecture Notes in Computer Science, vol. 3109, SpringerVerlag, Berlin/New York, 2004, pp. 32–46. [160] Tannier E., Sagot M.-F., “Sorting by reversals in subquadratic time”, in: Sahinalp S.C., Muthukrishnan S., Dogrusöz U. (Eds.), Proceedings of the 15th Annual Symposium on Combinatorial Pattern Matching, CPM2004, in: Lecture Notes in Computer Science, vol. 3109, Springer-Verlag, Berlin/New York, 2004, pp. 1–13. [161] Tesler G., “Efficient algorithms for multichromosomal genome rearrangements”, J. Comput. Syst. Sci. 65 (2002) 587–609. [162] Tesler G., “GRIMM: genome rearrangements web server”, Bioinformatics 18 (2002) 492–493. [163] Tran N., “An easy case of sorting by reversals”, J. Comput. Biol. 5 (1998) 741–746. [164] Vazirani V.V., Approximation Algorithms, Springer-Verlag, Berlin/New York, 2001. [165] Venter J.C., et al., “The sequence of the human genome”, Science 291 (2001) 1304– 1351. [166] Vergara J.P.C., “Sorting by bounded permutations”, PhD thesis, Virginia Polytechnic Institute and State University, 1997. [167] Walter M.E.T., Curado L.R.A.F., Oliveira A.G., “Working on the problem of sorting by transpositions on genome rearrangements”, in: Baeza-Yates R.A., Chavez E., Crochemore M. (Eds.), Proceedings of the 14th Annual Symposium on Combinatorial Pattern Matching, CPM2003, in: Lecture Notes in Computer Science, vol. 2676, Springer-Verlag, Berlin/New York, 2003, pp. 372–383. [168] Walter M.E.T., Dias Z., Meidanis J., “Reversal and transposition distance of linear chromosomes”, in: Proceedings of String Processing and Information Retrieval, SPIRE1998, in: Lecture Notes in Computer Science, IEEE Comput. Soc., Los Alamitos, CA, 1998, pp. 96–102. [169] Walter M.E.T., Sobrinho M.C., Oliveira E.T.G., Soares L.S., Oliveira A.G., Martins T.E.S., Fonseca T.M., “Improving the algorithm of Bafna and Pevzner for the problem of sorting by transpositions: a practical approach”, J. Discrete Algorithms 3 (2005) 342– 361.
EXPOSING PHYLOGENETIC RELATIONSHIPS
57
[170] Wang L., Zhu D., Liu X., Ma S., “An O(n2 ) algorithm for signed translocation problem”, in: Chen Y.-P.P., Wong L. (Eds.), Proceedings of 3rd Asia-Pacific Bioinformatics Conference, APBC2005, Imperial College Press, London, 2005, pp. 349–358. [171] Waterman M.S., Introduction to Computational Biology: Maps, Sequences and Genomes, Chapman & Hall, London/New York, 1995. [172] Waterston R.H., et al., “Initial sequencing and comparative analysis of the mouse genome”, Nature 420 (2002) 520–562. [173] Watson J.D., Crick F.H.C., “Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid”, Nature 171 (1953) 737–738. [174] Watterson G.A., Ewens W.J., Hall T.E., Morgan A., “The chromosome inversion problem”, J. Theor. Biol. 99 (1982) 1–7. [175] Weaver R.F., Molecular Biology, second ed., McGraw–Hill, New York, 2001. [176] Wheeler D.L., et al., “Database resources of the National Center for Biotechnology Information”, Nucleic Acids Res. 33 (2006) D173–D180. [177] Wu S., Gu X., “Algorithms for multiple genome rearrangement by signed reversals”, in: Pacific Symposium on Biocomputing, PSB2003, 2003, pp. 363–374. [178] Yancopoulos S., Attie O., Friedberg R., “Efficient sorting of genomic permutations by translocation, inversion & block interchange”, Bioinformatics 21 (2005) 3340–3346. [179] Yogeeswaran K., Frary A., York T.L., Amenta A., Lesser A.H., Nasrallah J.B., Tanksley S.D., Nasrallah M.E., “Comparative genome analyses of Arabidopsis spp.: Inferring chromosomal rearrangement events in the evolutionary history of A. thaliana”, Genome Res. 15 (2005) 505–515. [180] Zhang J., Peterson T., “Transposition of reversed Ac element ends generates chromosome rearrangements in maize”, Genetics 167 (2004) 1929–1937. [181] Zhu D.M., Ma S.H., “Improved polynomial-time algorithm for computing translocation distance between genomes”, Chinese J. Comput. 25 (2002) 189–196 (in Chinese). [182] Zhua D., Wang L., “On the complexity of unsigned translocation distance”, Theoret. Comput. Sci. 352 (2006) 322–328.