Techniques for Designing Bioinformatics Algorithms

Techniques for Designing Bioinformatics Algorithms

Techniques for Designing Bioinformatics Algorithms Massimo Cafaro, Italo Epicoco, and Marco Pulimeno, University of Salento, Lecce, Italy r 2018 Elsev...

169KB Sizes 0 Downloads 116 Views

Techniques for Designing Bioinformatics Algorithms Massimo Cafaro, Italo Epicoco, and Marco Pulimeno, University of Salento, Lecce, Italy r 2018 Elsevier Inc. All rights reserved.

Introduction This article deals with design techniques for algorithms, a fundamental topic which deserves an entire book. Indeed, several books have been published, including Cormen et al. (2009), Kleinberg (2011), Knuth (1998), Kozen (1992), Levitin (2006), Manber (1989), Mehlhorn and Sanders (2010), Sedgewick and Wayne (2011), and Skiena (2010). Owing to space limits, we can not hope to provide an in-depth discussion and thorough treatment of each of the design techniques that shall be presented. Rather, we aim at providing a modern introduction that, without sacrificing formal rigour when needed, emphasizes the pro and cons of each design technique, putting it in context. The interested reader may refer to the provided bibliography to delve into this fascinating topic. Informally, an algorithm is the essence of a computational procedure, and can be though as a set of step-by-step instructions to transform the input into the output according to the problem’ statement. The first algorithm known is the Euclidean algorithm for computing the greatest common divisor, circa 400–300 B.C. The modern study of algorithms dates back to the early 1960s, when the limited availability and resources of the first computers were compelling reasons for the users to strive to design efficient computer algorithms. The systematic study of computer algorithms to solve literally thousands of problems in many different contexts had begun, with extensive progress made by a huge number of researchers active in this field. A large number of efficient algorithms were devised to solve different problems, and the availability of many correct algorithms for the same problem stimulated the theoretical analysis of algorithms. Looking at the similarities among different algorithms designed to solve certain classes of problems, the researchers were able to abstract and infer general algorithm design techniques. We cover here the most common techniques in the design of sequential algorithms.

Exhaustive Search We begin our discussion of design techniques for algorithms starting with exhaustive search, which is also known as the brute force approach. The technique, from a conceptual perspective, represents the simplest possible approach to solve a problem. It is a straightforward algorithmic approach which, in general, involves trying all of the possible candidate solutions to the problem being solved and returning the best one. The name exhaustive search is therefore strictly related to the modus operandi of the technique, which exhaustively examines and considers all of the possible candidate solutions. The actual number of solutions returned depends on the problem’s statement. For instance, consider the problem of determining all of the divisors of a natural number n. Exhaustive search solves the problem by trying one by one each integer x from 1 to n and verifying if x divides exactly n, i.e., if n modulo x returns a remainder equal to zero. Each x satisfying the problem’s statement is outputted. Therefore, for this problem exhaustive search returns a set of solutions, according to the problem’s statement. However, it is worth noting here that the technique may also be used to solve other problems which admit one or more optimal solutions (e.g., the class of optimization problems). In this case, we are not usually concerned with determining all of the possible solutions, since we consider all of the solutions practically equivalent (from an optimality perspective with regard to the problem’s statement). For these problems, exhaustive search consists of trying one by one all of the possible solutions and returning one of the satisfying candidate solutions, typically the first encountered. Once a solution is returned, remaining candidates (if any) are simply discarded from further consideration. Of course, if the problem admits exactly one solution, discarding the remaining candidates which can not be the solution allows avoiding a waste of time. For instance, consider the sorting problem. We are given an input sequence a1, a2,…, an of n elements, and must output a permutation a1 ; a2 ; …; an such that a1 ra2 r…ran . One may try all of the possible permutations of the input sequence, stopping as soon as the one under consideration satisfies the output specification and can therefore be returned as the solution to the problem. Exhaustive search is therefore a design technique characterized by its conceptual simplicity and by the assurance that, if a solution actually exists, it will be found. Nonetheless, enumerating all of the possible candidate solutions may be difficult or costly, and the cost of exhaustive search is proportional to the number of candidates. For instance, for the problem of determining all of the divisors of a natural number n, the number of candidates is n itself. The cost of exhaustive search for this problem depends on the actual number of bits required to store n and on the division algorithm used (it is worth recalling here that a division costs O(1) only for sufficiently small n since we can not assume constant time arbitrary precision arithmetic when the size of n grows). Regarding the sorting problem, since there are n! possible permutations of the input sequence, the worst case computational complexity is exponential in the input, making this approach to the problem unsuitable for large problems as well.

Encyclopedia of Bioinformatics and Computational Biology

doi:10.1016/B978-0-12-809633-8.20316-6

1

2

Techniques for Designing Bioinformatics Algorithms

Since for many problems of practical interest a small increase in the problem size corresponds to a large increase in the number of candidates, the applicability of this technique is strictly confined to small size problems. Even though exhaustive search is often inefficient as an algorithmic design technique, it may be used as a useful complementary test to check that the results reported by other efficient algorithms - when run on small inputs - are indeed correct. Taking into account that exhaustive search is based on the enumeration of all of the possible candidate solutions, which are then checked one by one, in order to start applying the technique it is a useful discipline learning (by practice) how to identify the structure of a solution and how to rank a solution in order to select the best one. A notable example of exhaustive search is the linear search algorithm for searching an element in an unsorted array (Knuth, 1998). A good example in bioinformatics is the so-called restriction mapping problem (Danna et al., 1973). Restriction enzyme mapping was a powerful tool in molecular biology for the analysis of DNA, long before the first bacterial genome was sequenced. Such a technique relied on restriction endonucleases, each one recognizing and reproducibly cleaving a specific base pair sequence in double-stranded DNA generating fragments of varying sizes. Determining the lengths of these DNA fragments is possible, taking into account that the rate at which a DNA molecule moves through an agarose gel during the electrophoresis process is inversely proportional to its size. Then, this information can be exploited to determine the positions of cleavage sites in a DNA molecule. Given only pairwise distances between a set of points, the restriction mapping problem requires recovering the positions of the points, i.e., in other words we are required to reconstruct the set of points. Let X be a set of n points on a line segment in increasing order, and ΔX the multiset (i.e., a set that allows duplicate elements) of all pairwise distances between points in X: DX¼ {xj  xi: 1rirjrn}. How to reconstruct X from ΔX? We start noting here that the set of points giving rise to the pairwise input distances is not necessarily unique since the following properties hold: DA DA DðA"BÞ

¼ DðA"fvgÞ ¼ DðAÞ ¼ DðA⊖BÞ

ð1Þ

where A"B{a þ b : a A A, b A B} and A⊖B{a  b : a A A, b A B}. More in general, two sets A and B are said to be homometric if ΔA¼ DB, and biologists are usually interested in retrieving all of the homometric sets. Even though highly inefficient for large n, an exhaustive search algorithm for this problem is conceptually simple. Let L and n be respectively the input list of distances, and n the cardinality of X. The algorithm determines M, the maximum element in L and then for every set of n  2 integers taken from L such that 0ox2o…oxn1oM, it forms X¼ {0, x2,…, xn1, M} and checks if ΔX¼ L. Of course, the complexity of this algorithm is exponential in n. A better (slightly more practical) exhaustive search algorithm for this problem has ben designed by Skiena in 1990 (Skiena et al., 1990) (it is an exponential algorithm as well). The first polynomial-time algorithm efficiently solving this problem was designed by Daurat et al. in 2002 (Daurat et al., 2002).

Decrease and Conquer In order to solve a problem, decrease and conquer (Levitin, 2006) works by reducing the problem instance to a smaller instance of the same problem, solving the smaller instance and extending the solution of the smaller instance to obtain the solution to the original instance. Therefore, the technique is based on exploiting a relationship between a solution to a given instance of a problem and a solution to a smaller instance of the same problem. This kind of approach can be implemented either top-down (recursively) or bottom-up (iteratively), and it is also referred to as the inductive or incremental approach. Depending on the problem, decrease and conquer can be characterized by how the problem instance is reduced to a smaller instance: 1. Decrease by a constant (usually by one); 2. Decrease by a constant factor (usually by half); 3. Variable-size decrease. We point out here the similarity between decrease and conquer in which decrease is by a constant factor and divide and conquer. Algorithms that fall into the first category (decrease by a constant) include for instance: insertion sort (Cormen et al., 2009), graph traversal algorithms (DFS and BFS) (Cormen et al., 2009), topological sorting (Cormen et al., 2009), algorithms for generating permutations and subsets (Knuth, 1998). Among the algorithms in the second category (decrease by a constant factor) we recall here exponentiation by squaring (Levitin, 2006), binary search (Knuth, 1998), the strictly related bisection method and the russian peasant multiplication (Levitin, 2006). Finally, examples of algorithms in the last category (variable-size decrease) are the Euclid’s algorithm (Cormen et al., 2009), the selection algorithm (Cormen et al., 2009), and searching and insertion in a binary search tree (Cormen et al., 2009). Insertion sort exemplifies the decrease by a constant approach (in this case, decrease by one). In order to sort an array A of length n, the algorithm assumes that the smaller problem related to sorting the subarray A[1…n  1] consisting of the first n  1 elements has been solved; therefore, A[1… n  1] is a sorted subarray of size n  1. Then, the problem reduces to finding the appropriate position (i.e., the index) for the element A[n] within the sorted elements of A[1…n  1], and inserting it. Even though this leads naturally to a recursive, top-down implementation, Insertion sort is often implemented iteratively using a bottom-up

Techniques for Designing Bioinformatics Algorithms

3

approach instead: it is enough to start inserting the elements, one by one, from A[2] to A[n]. Indeed, in the first iteration, A[1] is already a sorted subarray, since an array consisting of just one element is already sorted. The worst-case complexity of Insertion sort is O(n2) to sort n elements; optimal sorting algorithms with worst-case complexity O(n lg n) are Merge sort and Heap sort. Exponentiation by squaring is an example of an algorithm based on decrease by a constant factor (decrease by half). The algorithm is based on the following equation to compute an, which takes into account the parity of n: 8 n=2 n=2 > : 1

if n is even and positive if n is odd if n ¼ 0

Therefore, an can be computed recursively by an efficient algorithm requiring O(lg n) iterations since the size of the problem is reduced in each iteration by about a half, even though at the expense of one or two multiplications. The Euclid’s algorithm for computing the greatest common divisor of two numbers m and n such that m4n (otherwise, we simply swap m and n before starting the algorithm), provides an example of variable-size decrease. Denoting by gcd(m, n) the greatest common divisor of m and n, and by m mod n the remainder of the division of m by n, the algorithm is based on repeated application of the following equation: gcdðm; nÞ ¼ gcdðn; m mod nÞ

until m mod n¼ 0. Since gcd(m, 0)¼ m, the last value of m is also the greatest common divisor of the initial m and n. Measuring an instance size of the problem of determining gcd(m, n) by the size of m, it can be easily proved that an instance size will always decrease by at least a factor of two after two successive iterations of Euclid’s algorithm. Moreover, a consecutive pair of Fibonacci numbers provides a worst-case input for the algorithm with regard to the total number of iterations required.

Transform and Conquer A group of techniques, known as transform and conquer (Levitin, 2006), can be used to solve a problem by applying a transformation; in particular, given an input instance, we can transform it to: 1. a simpler or more suitable/convenient instance of the same problem, in which case we refer to the transformation as instance simplification; 2. a different representation of the same input instance, which is a technique also known in the literature as representation change; 3. a completely different problem, for which we already know an efficient algorithm; in this case, we refer to this technique as problem reduction. As an example of instance simplification, we discuss gaussian elimination, in which we are given a system of n linear equations in n unknowns with an arbitrary coefficient matrix. We apply the technique and transform the input instance to an equivalent system of n linear equations in n unknowns with an upper triangular coefficient matrix. Finally, we solve the latter triangular system by back substitution, starting with the last equation and moving up to the first one. Another example is element uniqueness. We are given an input array consisting of n elements, and we want to determine if all of the elements are unique, i.e., there are no duplicate elements in the array. Applying the exhaustive search technique we could compare all pairs of elements in worst-case running time O(n2). However, by instance simplification we can solve the problem in O(n 1g n) as follows. First, we sort the array in time O(n lg n) using Merge Sort or Heap Sort, then we perform a linear scan of the array, checking pairs of adjacent elements, in time O(n). Overall, the running time is O(n lg n) þ O(n)¼ O(n lg n). Heap sort (Williams, 1964) provides an excellent example of representation change. This sorting algorithm is based on the use of a binary heap data structure, and it can be shown that a binary heap corresponds to an array and vice-versa, if certain conditions are satisfied. Regarding problem reduction, this variation of transform and conquer solves a problem by transforming it into a different problem for which an algorithm is already available. However, it is worth noting here that problem reduction is valuable and practical only when the sum of the time required by the transformation (i.e., the reduction) and the time required to solve the newly generated problem is smaller than solving the input problem by using another algorithm. Examples of problem reductions include:

• • • •

jxyj computing lcm(x, y) via computing gcd(x, y): lcmðx; yÞ ¼ gcdðx;yÞ counting the number of paths of length n in a graph by raising the graph’s adjacency matrix to the nth power; transforming a linear programming maximization problem to a minimization problem and vice-versa; reduction to graph problems (e.g., solving puzzles via state-space graphs).

4

Techniques for Designing Bioinformatics Algorithms

Divide and Conquer Divide and conquer (from Latin divide et impera) is an important design technique and works as follows. When the input instance is too big or complex to be solved directly, it is advantageous to divide the input instance into two or more subproblems of roughly the same size, solve the subproblems (usually recursively, unless the subproblems are small enough to be solved directly) and finally combine the solutions to the subproblems to obtain the solution for the original input instance. Merge sort, invented by John von Neumann in 1945, is a sorting algorithm based on divide and conquer. In order to sort an array A of length n, the algorithm divides the input array into two halves A[1…⌊n/2 m  1] and A[⌊n/2 m …n], sorts them recursively and then merges the resulting smaller sorted arrays into a single sorted one. The key point is how to merge two sorted arrays, which can be easily done in linear time as follows. We scan both arrays using two pointers, initialized to point to the first elements of the arrays we are going to merge. We compare the elements and copy the smaller to the new array under construction; then, the pointer to the smaller element is incremented so that it points to the immediate successor element in the array. We continue comparing pairs of elements, determining the smaller and copying it to the new array until one of the two input arrays becomes empty. When this happens, we simply add the remaining elements of the other input array to the merged array. Let p and q be respectively the sizes of the two input array to be merged, such that n¼ p þ q. Then, the merge procedure requires in the worst case O(n) time. Recursive algorithms such as Merge sort are analyzed by deriving and solving a recurrence equation. Indeed, recursive calls in algorithms can be described using recurrences, i.e., equations or inequalities that describe a function in terms of its value on smaller inputs. For instance, the recurrence for Merge sort is: ( Oð1Þ n¼1 TðnÞ ¼ ð2Þ 2Tðn=2Þ þ OðnÞ n41 Actually, the correct equation should be TðnÞ ¼

(

Oð1Þ T ð⌊n=2cÞ þ T ð⌈n=2⌉Þ þ OðnÞ

n¼1 n41

ð3Þ

but it can be shown that neglecting the floor and the ceil does not matter asymptotically. There are many methods to solve recurrences. The most general method is the substitution method, in which we guess the form of the solution, verify it by induction and finally solve for the constants associated to the asymptotic notation. In order to guess the form of the solution, the recursion-tree method can be used; it models the cost (time) of a recursive execution of an algorithm. In the recursion tree each node represents a different substitution of the recurrence equation, so that each node corresponds to a value of the argument n of the function T(n) associated with it. Moreover, each node q in the tree is also associated to the value of the nonrecursive part of the recurrence equation for q. In particular, for recurrences derived by a divide and conquer approach, the nonrecursive part is the one related to the work required to combine the solutions of the subproblems into the solution for the original problem, i.e., solutions related to the subproblems associated to the children of node q in the tree. To generate the recursion tree, we start with T(n) as the root node. Let the function f(n) be the only nonrecursive term of the recurrence; we expand T(n) and put f(n) as the root of the recursion tree. We obtain the first level of the tree by expanding the recurrence, i.e. we put each of the recurrence terms involving the T function on the first level, and then we substitute them with the corresponding f terms. Then we proceed to expand the second level, substituting each T term with the corresponding f term. And so on, until we reach the leaves of the tree. To obtain an estimate of the solution to the recurrence, we sum the nonrecursive values across the levels of the tree and then sum the contribution of each level of the tree. Equations of the form T(n)¼ aT(n/b) þ f(n), where aZ1, b41 and f(n) is asymptotically positive can be solved immediately by applying the so-called master theorem (Cormen et al., 2009), in which we compare the function f(n) with nlogb a . There are three cases to consider:   1. (e40 such thatf ðnÞ ¼ O nlogb ae . In this case, f(n) grows polynomially slower (by an ne factor) than nlogb a , and the solution is  log TðnÞ ¼ Θ n b a ;   2. (kZ0 such that f ðnÞ ¼ Θ nlogb a logk n . Then, the asymptotic grow of both f(n) and nlogb a is similar, and the solution is   TðnÞ ¼ Θ nlogb a logkþ1 n ;  log aþe  3. f ðnÞ ¼ O n b and f(n) satisfies the regularity condition af(n/b)rcf(n) for some constant co1. Then, f(n) grows polynomially faster (by an ne factor) than nlogb a , and the solution is T(n) ¼ Θ(f(n)). A more general method, devised by Akra and Bazzi (1998) allows solving recurrences of the form TðnÞ ¼

k X i¼1

ai T ðn=bi Þ þ f ðnÞ

ð4Þ

  P Let p be the unique solution to ki ¼ 1 ai bi p ¼ 1; then the solution is derived exactly as in the master theorem, but considering logb a . Akra and Bazzi also prove an even more general result. n instead of n Many constant order linear recurrences are also easily solved by applying the following theorem. p

Techniques for Designing Bioinformatics Algorithms

5

P Let a1, a2,…,ah Aℕ and hA ℕ, c and b A ℝ such that c40, bZ0 and let a ¼ hi ¼ 1 ai . Then, the solution to the recurrence 8 kA ℕ nrh > < h X TðnÞ ¼ ð5Þ ai Tðn  iÞ þ cnb n4h > : i¼1

is (  bþ1  O n   TðnÞ ¼ O an n b

a¼1 a2

ð6Þ

Specific techniques for solving general constant order linear recurrences are also available. Divide and conquer is a very powerful design technique, and for many problems it provides fast algorithms, including, for example, Merge sort, Quick sort (Hoare, 1962), binary search (Knuth, 1998), algorithms for powering a number (Levitin, 2006) and computing Fibonacci numbers (Gries and Levin, 1980), the Strassen’s algorithm (Strassen,1969) for matrix multiplication, the Karatsuba’s algorithm (Karatsuba and Ofman, 1962) for multiplying two n bit numbers etc. Since so many problems can be solved efficiently by divide and conquer, one can get the wrong impression that divide and conquer is always the best way to approach a new problem. However, this is of course not true, and the best algorithmic solution to a problem may be obtained by means of a very different approach. As an example, consider the majority problem. Given an unsorted array of n elements, using only equality comparisons we want to find the majority element, i.e., the one which appears in the array more than n/2 times. An algorithm based on exhaustive search simply compares all of the possible pairs of elements and requires worst-case O(n2) running time. A divide and conquer approach provides an O(n log n) solution. However there exist an even better algorithm, requiring just a linear scan of the input array: the Boyer-Moore algorithm (Boyer and Moore, 1981,1991) solves this problem in worst-case O(n) time.

Randomized Algorithms Randomized algorithms (Motwani and Raghavan, 2013) make random choices during the execution. In addition to its input, a randomized algorithm also uses a source of randomness:

• •

can flip coins as a basic step; (i) can toss a fair coin c which is either Heads or Tails with probability 1/2; can generate a random number r from a range {1…R}; (i) decisions and or computations are based on r0 s value. On the same input, on different executions, randomized algorithms may

• •

run for a different number of steps; produce different outputs.

Indeed, on different executions, different coins are flipped (different random numbers are used), and the value of these coins can change the course of executions. Why does it make sense to toss coins? Here are a few reasons. Some problems can not be solved deterministically at all; an example is the asynchronous agreement problem (consensus). For some other problems, only exponential deterministic algorithms are known, whereas polynomial-time randomized algorithms do exist. Finally, for some problems, a randomized algorithm provides a significant polynomial-time speedup with regard to a deterministic algorithm. The intuition behind randomized algorithms is simple. Think of an algorithm as battling against an adversary who attempts to choose an input to slow it down as much as possible. If the algorithm is deterministic, then the adversary may analyze the algorithm and find an input that will elicit the worst-case behaviour. However, for a randomized algorithm the output does not depend only on the input, since it also depends on the random coins tossed. The adversary does not control and does not know which coins will be tossed during the execution, therefore his ability to choose an input which will elicit a worst-case running time is severely restricted. Where do we get coins from? In practice, randomized algorithms use pseudo random number generators. Regarding the analysis of a randomized algorithm, this is different from average-case analysis, which requires knowledge of the distribution of the input and for which the expected running time is computed taking the expectation over the distribution of possible inputs. In particular, the running time of a randomized algorithm, being dependent on random bits, actually is a random variable, i.e., a function in a probability space O consisting of all of the possible sequences r, each of which is assigned a probability Pr[r]. The running time of a randomized algorithm A on input x and a sequence r of random bits, denoted by A(x, r) is given by the expected value E[A(x, r)], where the expectation is over r, the random choices of the algorithm: P E½Aðx; rÞ ¼ r A O Aðx; rÞPr½r. There are two classes of randomized algorithms, which were originally named by Babai (1979).



Monte Carlo algorithm: for every input, regardless of coins tossed, the algorithm always run in polynomial time, and the probability that its output is correct can be made arbitrarily high;

6



Techniques for Designing Bioinformatics Algorithms

Las Vegas algorithm: for every input, regardless of coins tossed, the algorithm is correct and it runs in expected polynomial time (for all except for a “small” number of executions, the algorithm runs in polynomial time).

The probabilities and expectations above are over the random choices of the algorithm, not over the input. As stated, a Monte Carlo algorithm fails with some probability, but we are not able to tell when it fails. A Las Vegas algorithm also fails with some probability, but we are able to tell when it fails. This allows us running it again until succeeding, which implies that the algorithm eventually succeeds with probability 1, even though at the expense of a potentially unbounded running time. In bioinformatics, a good example of a Monte Carlo randomized algorithm is the random projections algorithm (Buhler and Tompa, 2001, 2002) for motif finding. Another common example of a Monte Carlo algorithm is the Freivald’s algorithm (Freivalds, 1977) for checking matrix multiplication. A classical example of Las Vegas randomized algorithm is Quick sort (Hoare, 1962), invented in 1962 by Charles Anthony Richard Hoare, which is also a divide and conquer algorithm. Even though the worstcase running time of Quick sort is O(n2), its expected running time is O(n lg n) as Merge sort and Heap sort. However, Quick sort is, in practice, much faster.

Dynamic Programming The Dynamic Programming design technique provides a powerful approach to the solution of problems exhibiting (i) optimal substructure and (ii) overlapping subproblems. Property (i) (also known as principle of optimality) means that an optimal solution to the problem contains within it optimal solutions to related subproblems. Property (ii) tells us that the space of subproblems related to the problem we want to solve is small (typically, the number of distinct subproblems is a polynomial in the input size). In this context, a divide and conquer approach, which recursively solves all of the subproblems encountered in each recursive step, is clearly unsuitable and highly inefficient, since it will repeatedly solve all of the same subproblems whenever it encounters them again and again in the recursion tree. On the contrary, dynamic programming suggests solving each of the smaller subproblems only once and recording the results in a table from which a solution to the original problem can then be obtained. Dynamic programming is often applied to optimization problems. Solving an optimization problem through dynamic programming requires finding an optimal solution, since there can be many possible solutions with the same optimal value (minimum or maximum, depending on the problem). Computing the nth number of the Fibonacci series provides a simple example of application of dynamic programming (it is worth noting here that for this particular problem a faster divide and conquer algorithm, based on matrix exponentiation, actually exists). Denoting with F(n) the nth Fibonacci number, it holds that F(n)¼ F(n  1) þ F(n  2). This problem is explicitly expressed as composition of subproblems, namely to compute the nth number we have to solve the same problem but with smaller instances F(n  1) and F(n  2). The divide and conquer approach would recursively compute all of the subproblems with a topdown approach, including also those subproblems already solved i.e. to compute F(n) we have to compute F(n  1) and F(n  2); to compute F(n  1) we have to compute again F(n  2) and F(n  3); in this example the subproblem F(n  2) would be evaluated twice following the divide and conquer approach. Dynamic programming avoids recomputing the already solved subproblems. Typically dynamic programming follows a bottom-up approach, even though a recursive top-down approach with memoization is also possible (without memoizing the results of the smaller subproblems, the approach reverts to the classical divide and conquer). As an additional example, we introduce the problem of sequence alignment. A common approach to infer a newly sequenced gene’s function is to find the similarities with genes of known function. Revealing the similarity between different DNA sequences is non trivial and comparing corresponding nucleotides is not enough; a sequence alignment is needed before comparison. Hirschberg’s space-efficient algorithm (Hirschberg, 1975) is a divide and conquer algorithm that can perform alignment in linear space (whilst the traditional dynamic programming approach requires quadratic space), even though at the expense of doubling the computational time. The simplest form of a sequence similarity analysis is the Longest Common Subsequence (LCS) problem where only insertions and deletions between two sequences are allowed. We define a subsequence of a string v as an ordered sequence of characters, not necessarily consecutive, from v. For example, if v¼ ATTGCTA, then AGCA and ATTA are subsequences of v whereas TGTT and TCG are not. A common subsequence of two strings is a subsequence of both of them. The longer is a common substring between two strings, the more similar are the strings. We hence can formulate the Longest Common Substring problem as follows: given two input strings v and w, respectively of length n and m, find the longest subsequence common to the two strings. Denoting with si,j the longest common subsequence between the first i characters of v (denoted as i-prefix of v) and the first j characters of w (denoted as j-prefix of w), then the solution to the problem is sn,m. We can solve the problem recursively noting that the following relation holds: ( si1;j1 þ 1 if vi ¼ wj   ð7Þ si;j ¼ max si1;j ; si;j1 if vi a wj Clearly, si,0 ¼ s0,j ¼ 081rirn, 1rjrm. The first case corresponds to a match between vi and wj; in this case, the solution for the subproblem si,j is the solution for the subproblem si1, j1 plus one (since vi ¼ wj we can append vi to the common substring we are building, increasing its length by one). The second case refers to a mismatch between vi and wj, giving rise to two possibilities: the

Techniques for Designing Bioinformatics Algorithms

7

solution si1, j corresponds to the case in which vi is not present in the LCS of the i-prefix of v and j-prefix of w, whilst the solution si, j1 corresponds to the case when wj is not present in LCS. The problem has been expressed as composition of subinstances, moreover it can be easily proved that it meets the principle of optimality (i.e., if a string z is a LCS of v and w, then any prefix of z is a LCS of a prefix of v and w) and that the number of distinct LCS subproblems for two strings of lengths n and m is only nm. Hence the dynamic programming design technique can be applied to solve the problem. In general, to apply dynamic programming, we have to address a number of issues: 1. Show optimal substructure, i.e. an optimal solution to the problem contains within it optimal solutions to subproblems; the solution to a problem is derived by:

• • •

making a choice out of a number of possibilities (look what possible choices there can be); solving one or more subproblems that are the result of a choice (we need to characterize the space of subproblems); show that solutions to subproblems must themselves be optimal for the whole solution to be optimal;

2. Write a recurrence equation for the value of an optimal solution:

• •

Mopt ¼ Min (or Max, depending on the optimization problem) over all choices k of {(sum of Mopt of all of the subproblems resulting from choice k) þ (the cost associated with making the choice k)}; show that the number of different instances of subproblems is bounded by a polynomial;

3. Compute the value of an optimal solution in a bottom-up fashion (alternatively, top-down with memoization); 4. Optionally, try to reduce the space requirements, by “forgetting” and discarding solutions to subproblems that will not be used any more; 5. Optionally, reconstruct an optimal solution from the computed information (which records a sequence of choices made that lead to an optimal solution).

Backtracking and Branch-and-Bound Some problems require finding a feasible solution by exploring the solutions domain, which for these problems grows exponentially. For optimization problems we also require that the feasible solution is the best one according to an objective function. Many of these problems might not be solvable in polynomial time. In Section “Exhaustive Search” we discussed how such problems can be solved, in principle, by exhaustive search hence sweeping the whole solutions domain. In this section we introduce the Backtracking and Branch-and-Bound design techniques which can be considered as an improvement of the exhaustive search approach. The main idea is to build the candidate solution to the problem adding one component at a time and evaluating the partial solution constructed so far. For optimization problems, we would also consider a way to estimate a bound on the best value of the objective function of any solution that can be obtained by adding further components to the partially constructed solution. If the partial solution does not violate the problem constraints and its bound is better than the currently known feasible solution, then a new component can be added up to reach the final feasible solution. If during the construction of the solution no other component can be added either because it does not exist any feasible solution starting from the partially constructed solution or because its bound is worse than the currently known feasible solution, than the remaining components are not generated at all and the process backtracks, changing the previously added components. This approach makes it possible to solve some large instances of difficult combinatorial problems, though, in the worst case, we still face the same curse of exponential explosion encountered in exhaustive search. Backtracking and Branch-and-Bound differ in the nature of problems they can be applied to. Branch-and-Bound is applicable only to optimization problems because it is based on computing a bound on possible values of the problem’s objective function. Backtracking is not constrained by this requirement and the partially built solution is pruned only if it violates the problem constraints. Both methodologies require building the state-space tree whose nodes reflect the specific choices made for a solution’s components. Its root represents an initial state before the search for a solution begins. The nodes at the first level in the tree represent the choices made for the first component of a solution, and so on. A node in a state-space tree is said to be promising if it corresponds to a partially constructed solution that may still lead to a feasible solution; otherwise, it is called nonpromising. Leaves represent either nonpromising dead ends or complete solutions found by the algorithm. We can better explain how to build the state-space tree by introducing the n-Queens problem. In the n-Queens problem we have to place n queens in an n  n chessboard so that no two queens attack each other. A queen may attack any chess piece if it is on the same row, column or diagonal. For this problem the Backtracking approach would bring valuable benefits with respect the exhaustive search. We know that only a queen must be placed in each row, we hence have to find the column where to place each queen so that the problem constraints are met. A solution can be represented by n values {c1,…,cn}; where ci represents the column of the ith queen. At the first level of the state-space tree we have n nodes representing all of the possible choices for c1. We make a choice for the first value of c1 exploring

8

Techniques for Designing Bioinformatics Algorithms

the first promising node and adding n nodes at second level corresponding to the available choices for c2. The partial solution made of c1, c2 choices is evaluated and the process continues visiting the tree in a depth-first manner. If all of the nodes on the current level are nonpromising, then the algorithm backtracks to the upper level up to the first promising node. Several others problems can be solved by a backtracking approach. In the Subset-sum problem we have to find a subset of a given set A ¼ {a1,…,an} of n positive integers whose sum is equal to a given positive integer d. The Hamiltonian circuit problem consists in finding a cyclic path that visits exactly once each vertex in a graph. In the n-Coloring problem we have to color all of the vertices in a graph such that no two adjacent vertices have the same color. Each vertex can be colored by using one of the n available colors. Subset-sum, Hamiltonian circuit and graph coloring are examples of NP-complete problems for which backtracking is a viable approach if the input instance is very small. As a final example, we recall here the restriction map problem already described in Section “Exhaustive Search“. The restriction map problem is also known in computer science as Turnpike problem. The Turnpike problem is defined as follow: let X be a set of n points on a line X¼ {x1,…,xn}, given the ΔX multiset of the pairwise distances between each pair {xi, xj}, ΔX ¼ {xj  xi8i, j: 1riojrn}, we have to reconstruct X. Without loss of generality, we can assume that the first point in the line is at x1 ¼ 0. Let L be the input multiset with all of the distances between pairs of points; we have to find a solution X such that DX¼ L. The key idea is to start considering the greatest distance in L; let us denote it as δ1. We can state that the furthest point is at distance δ1, i.e. xn ¼ δ1. We remove δ1 from L and consider the next highest distance δ2. This distance derives from two cases: xn  x2 ¼ δ2 or xn1  x1 ¼ δ2; we can make an arbitrary ~ ¼ f0; δ1  δ2 ; δ1 g. In choice and start building the state-space tree. Let us choose x2 ¼ xn  δ2, we hence have a partial solution X order to verify if this partial solution does not violate the constraints, we compute the DX~ and verify that L*DX~ . If the constraint is satisfied, the node is a promising one and we can continue with the next point, otherwise we change the choice with the next promising node. The algorithm iterates until all of the feasible solutions are found. At each level of the state-space tree only two alternatives can be examined. Usually only one of the two alternatives is viable at any level. In this case the computational complexity of the algorithm can be expressed as: TðnÞ ¼ Tðn  1Þ þ Oðn log nÞ

ð8Þ

being O(n log n) the time taken for checking the partial solution. In this case the computational complexity is T(n) ¼ O(n log n). In the worst case both alternatives must be evaluated at each level; in this case the recurrence equation is: 2

TðnÞ ¼ 2Tðn  1Þ þ Oðn log nÞ

ð9Þ

whose solution is T(n) ¼ O(2 n log n). The algorithm remains an exponential time algorithm in the worst case like the one based on exhaustive search, but, usually, the backtracking approach greatly improves the computational time by pruning the nonpromising branches. We recall finally that Daurat et al. in 2002 (Daurat et al., 2002) proposed a polynomial algorithm to solve the restriction map problem. In the majority of the cases, a state-space tree for a backtracking algorithm is constructed in a depth-first search manner, whilst Branch-and-Bound usually explores the state-space tree by using a best-first rule i.e. the node with the best bound is explored first. Compared to Backtracking, Branch-and-Bound requires two additional items: n

• •

a way to provide, for every node of a state-space tree, a bound on the best value of the objective function on any solution that can be obtained by adding further components to the partially constructed solution represented by the node; the value of the best solution seen so far.

If the bound value is not better than the value of the best solution seen so far the node is nonpromising and can be pruned. Indeed, no solution obtained from it can yield a better solution than the one already available. Some of the most studied problems faced with the Branch-and-Bound approach are: the Assignment Problem in which we want to assign n people to n jobs so that the total cost of the assignment is as small as possible; in the Knapsack problem we have n items with weights wi and values vi, a knapsack of capacity W, and the problem consist in finding the most valuable subset of the items that fits in the knapsack; in the Traveling Salesman Problem (TSP) we have to find the shortest possible route that visits each city exactly once and returns to the origin city knowing the distances between each pair of cities. Assignment, Knapsack and TSP are examples of NP-complete problems for which Branch-and-Bound is a viable approach, if the input instance is very small.

Greedy Algorithms The Greedy design technique defines a simple methodology related to the exploration of the solutions domain of optimization problems. The greedy approach suggests constructing a solution through a sequence of steps, each expanding a partially constructed solution obtained so far, until a complete solution to the problem is reached. Few main aspects make the greedy approach different from Branch-and-Bound. First, in the greedy approach no bound must be associated to the partial solution; second, the choice made at each step is irrevocable hence backtracking is not allowed in the greedy approach. During the construction of the solution, on each step the choice made must be:

• •

feasible: the partial solution has to met the problem’s constraints; locally optimal: it has to be the best local choice among all of the feasible choices available on that step;

Techniques for Designing Bioinformatics Algorithms



9

irrevocable.

The Greedy approach is based on the hope that a sequence of locally optimal choices will yield an optimal solution to the entire problem. There are problems for which a sequence of locally optimal choices does yield an optimal solution for every input instance of the problem. However, there are others for which this is not the case and the greedy approach can provide a misleading solution. As an example, let us consider the change-making problem. Given a set of coins with decreasing value C¼ {ci : ci4ci þ 18i ¼ 1,…, n} and a total amount T, find the minimum number of coins to reach the total amount T. The solution is represented by a sequence of n occurrences of the corresponding coins. A greedy approach to the problem considers on step i the coin ci and chooses its occurrences as the maximum possible subject to the constraint that the total amount accumulated so far must not exceed T. Let us suppose that we can use the following coins values C ¼ {50, 20, 10, 5, 2, 1} and that we have to change T¼ 48; a greedy approach suggests choosing on the first stage no coins of value 50, on the second step 2 coins of value 20 since this is the best choice to quickly reach the total amount T and so on until building the solution S¼ {0, 2, 0, 1, 1, 1}. Greedy algorithms are both intuitively appealing and simple. Given an optimization problem, it is usually easy to figure out how to proceed in a greedy manner. What is usually more difficult is to prove that a greedy algorithm yields an optimal solution for all of the instances of the problem. The greedy approach applied to the change-making example given above provides optimal solution for any value of T, but what happens if the coins values are C¼ {25, 10, 1} and the amount is T¼ 40? In this case following a greedy approach the solution would be S¼ {1, 1, 5} but the best solution is instead S¼ {0, 4, 0}. Therefore, proving that the solution given by the greedy algorithm is optimal becomes a crucial aspect. One of the common ways to do this is through mathematical induction, where we must prove that a partially constructed solution obtained by the greedy algorithm on each iteration can be extended to an optimal solution to the problem. The second way to prove optimality is to show that on each step it does at least as well as any other algorithm could in advancing toward the problem’s goal. The third way is simply to show that the final result obtained is optimal based on the algorithm’s output rather than on the way it operates. Finally, if a problem underlying combinatorial structure is a matroid (Cormen et al., 2009), then it is well known that the greedy approach leads to an optimal solution. The matroid mathematical structure has been introduced by Whitney in 1935; his matric matroid abstracts and generalizes the notion of linear independence. In bioinformatic, one of the most challenging problem which can be solved through a greedy approach is genome rearrangement. Every genome rearrangement results in a change of gene ordering, and a series of these rearrangements can alter the genomic architecture of a species. The elementary rearrangement event is the flipping of a genomic segment, called a reversal. Biologists are interested in the smallest number of reversals between genomes of two species since it gives us a lower bound on the number of rearrangements that have occurred and indicates the similarity between two species. In their simplest form, rearrangement events can be modelled by a series of reversals that transform one genome into another. Given a permutation p ¼ p1 p2 ⋯pn1 pn , a reversal r(i, j) has the effect of reversing the order of block from ith to jth element pi piþ1 ⋯pj1 pj . For example the reversal r(3, 5) of the permutation p ¼ 654298 produces the new permutation p  r(3, 5)¼ 659248. The Reversal Distance Problem can be formulated as follows: given two permutations p and s, find the shortest series of reversals r1  r2 ⋯rt that transforms p into s. Without losing generality, we can consider the target permutation s the ascending order of the elements. In this case the problem is also known as Sorting by Reversal. When sorting a permutation it hardly makes sense to move the elements already sorted. Denoting by p(p) the number of already sorted elements of p, then a greedy strategy for sorting by reversals is to increase p(p) at every step. Unfortunately, this approach does not guarantee that the solution is optimal. As an example we can consider p¼ 51234; following the greedy strategy we need four reversals for sorting p: {r(1,2),r(2,3),r(3,4),r (4,5)} but we can easily see that two reversals are enough for sorting the permutation: {r(1,5),r(1,4)}.

Conclusions We have presented a survey of the most important algorithmic design techniques, highlighting their pro and cons, and putting them in context. Even though, owing to space limits, we sacrificed in-depth discussion and thorough treatment of each technique, we hope to have provided the interested readers with just enough information to fully understand and appreciate the differences among the techniques.

References Akra, M., Bazzi, L., 1998. On the solution of linear recurrence equations. Comput. Optim. Appl. 10 (2), 195–210. Available at: http://dx.doi.org/10.1023/A:1018353700639. Babai, L., 1979. Monte-carlo algorithms in graph isomorphism testing. Technical Report D.M.S. 79-10, Universite de Montreal. Boyer, R., Moore, J., 1981. Mjrty – A fast majority vote algorithm. Techical Report 32, Institute for Computing Science, University of Texas, Austin. Boyer, R., Moore, J.S., 1991. Mjrty – A fast majority vote algorithm. In: Automated Reasoning: Essays in Honor of Woody Bledsoe, Automated Reasoning Series. Dordrecht, The Netherlands: Kluwer Academic Publishers, pp. 105–117. Buhler, J., Tompa, M., 2001. Finding motifs using random projections. In: Proceedings of the 5th Annual International Conference on Computational Biology. RECOMB '01. ACM, New York, NY, USA, pp. 69–76. Available at: http://doi.acm.org/10.1145/369133.369172. Buhler, J., Tompa, M., 2002. Finding motifs using random projections. J. Comput. Biol. 9 (2), 225–242. Available at: http://dx.doi.org/10.1089/10665270252935430. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C., 2009. Introduction to Algorithms, third ed. The MIT Press. Danna, K., Sack, G., Nathans, D., 1973. Studies of simian virus 40 dna. vii. a cleavage map of the sv40 genome. J.Mol. Biol. 78 (2),

10

Techniques for Designing Bioinformatics Algorithms

Daurat, A., Gérard, Y., Nivat, M., 2002. The chords' problem. Theor. Comput. Sci. 282 (2), 319–336. Available at: http://dx.doi.org/10.1016/S0304-3975(01)00073-1. Freivalds, R., 1977. Probabilistic machines can use less running time. In: IFIP Congress. pp. 839–842. Gries, D., Levin, G., 1980. Computing Fibonacci numbers (and similarly defined functions) in log time. Inform. Process. Lett. 11 (2), 68–69. Hirschberg, D.S., 1975. A linear space algorithm for computing maximal common subsequences. Commun. ACM 18 (6), 341–343. Available at: http://doi.acm.org/10.1145/ 360825.360861. Hoare, C.A.R., 1962. Quicksort. Comput. J. 5 (1), 10–15. Karatsuba, A., Ofman, Y., 1962. Multiplication of many-digital numbers by automatic computers. Dokl. Akad. Nauk SSSR 145, 293–294. [Translation in Physics-Doklady 7, 595–596, 1963]. Kleinberg, J., 2011. Algorithm Design, second ed. Addison-Wesley Professional. Knuth, D.E., 1998. The Art of Computer Programming, vol. 1–3, Boxed Set, second ed. Boston, MA: Addison-Wesley Longman Publishing Co., Inc.. Kozen, D.C., 1992. The Design and Analysis of Algorithms. New York, NY: Springer-Verlag. Levitin, A.V., 2006. Introduction to the Design and Analysis of Algorithms, second ed. Boston, MA: Addison-Wesley Longman Publishing Co., Inc.. Manber, U., 1989. Introduction to Algorithms: A Creative Approach. Boston, MA: Addison-Wesley Longman Publishing Co., Inc.. Mehlhorn, K., Sanders, P., 2010. Algorithms and Data Structures: The Basic Toolbox, first ed. Berlin: Springer. Motwani, R., Raghavan, P., 2013. Randomized Algorithms. New York, NY: Cambridge University Press. Sedgewick, R., Wayne, K., 2011. Algorithms, fourth ed. Addison-Wesley Professional. Skiena, S.S., 2010. The Algorithm Design Manual, second ed. Berlin: Springer Publishing Company. Skiena, S.S., Smith, W.D., Lemke, P., 1990. Reconstructing sets from interpoint distances (extended abstract). In: Proceedings of the 6th Annual Symposium on Computational Geometry. SCG '90. ACM, New York, NY, pp. 332–339. Available at: http://doi.acm.org/10.1145/98524.98598. Strassen, V., 1969. Gaussian elimination is not optimal. Numerische Mathematik 13 (4), 354–356. Williams, J.W.J., 1964. Heapsort. Commun. ACM 7 (6), 347–348.