Available online at www.sciencedirect.com
BioSystems 90 (2007) 687–697
A P system and a constructive membrane-inspired DNA algorithm for solving the Maximum Clique Problem Marc Garc´ıa-Arnau a,∗ , Daniel Manrique a , Alfonso Rodr´ıguez-Pat´on a , Petr Sos´ık a,b b
a Departamento Inteligencia Artificial, Universidad Polit´ ecnica de Madrid (UPM), Boadilla del Monte s/n, 28660 Madrid, Spain Institute of Computer Science, Faculty of Philosophy and Science, Silesian University, Bezruˇcovo n´am. 13, 74601 Opava, Czech Republic
Received 2 April 2006; received in revised form 22 December 2006; accepted 20 February 2007
Abstract We present a P system with replicated rewriting to solve the Maximum Clique Problem for a graph. Strings representing cliques are built gradually. This involves the use of inhibitors that control the space of all generated solutions to the problem. Calculating the maximum clique for a graph is a highly relevant issue not only on purely computational grounds, but also because of its relationship to fundamental problems in genomics. We propose to implement the designed P system by means of a DNA algorithm. This algorithm is then compared with two standard papers that addressed the same problem and its DNA implementation in the past. This comparison is carried out on the basis of a series of computational and physical parameters. Our solution features a significantly lower cost in terms of time, the number and size of strands, as well as the simplicity of the biological implementation. © 2007 Elsevier Ireland Ltd. All rights reserved. Keywords: Membrane computing; Maximum Clique Problem; DNA computing; Constructive approach; NP-complete problem
1. Introduction Natural computing is the name given to the discipline covering a number of areas working with unconventional and naturally inspired computational models. One of the primary fields making up this discipline is biomolecular computing, which came into being in 1994 after Leonard Adleman’s famous experiment (Adleman, 1994). In his paper, Adleman used DNA molecules for the first time to solve a complex computational problem: the Hamiltonian Path Problem. A year later Lipton generalized the techniques that Adleman had used, proposing a computational model that solved the Satisfiability Problem
∗
Corresponding author. Tel.: +34 91 336 69 07; fax: +34 91 352 48 19. E-mail address:
[email protected] (M. Garc´ıa-Arnau).
for logical formulas, SAT (Lipton, 1995). Since then, many papers exploring the construction of new models and examining the use of molecules as support for computation have been published. The following documents are good references for researchers in this area: P˘aun (1998), Hagiya (1999), Amos (1999) and P˘aun et al. (1998). Moreover, two excellent references of current work in this discipline are Benenson et al. (2004) and Seelig et al. (2006). Another of the major milestones in the short history of natural computing was, unquestionably, the emergence of membrane computing (also known as P systems), introduced by Gheorghe P˘aun in 1998 (P˘aun, 2000). In his seminal article “Computing with Membranes”, P˘aun presented a new abstract computational model inspired by the structure and behaviour of living cells. As result of that early work, a sizeable group of researchers were seduced by membrane computing and a lot of related literature has seen the light since then (Calude and P˘aun, 2004).
0303-2647/$ – see front matter © 2007 Elsevier Ireland Ltd. All rights reserved. doi:10.1016/j.biosystems.2007.02.005
688
M. Garc´ıa-Arnau et al. / BioSystems 90 (2007) 687–697
Many of these papers addressing both membrane and DNA computing have tackled computationally difficult problems (class NP-complete) (Ledesma et al., 2004, 2005). NP-complete problems have two prominent features: (1) there are as yet no polynomial algorithms to solve them, and (2) all their “yes” instances are certified to be verifiable efficiently (Garey and Johnson, 1979). Although, thanks to the parallel and distributed capabilities of these new paradigms, it has been possible to solve NP-complete problems in polynomial time, researchers have not managed to limit the exponential growth of some of the other resources involved in problem solving, such as the number of strings or molecules or even the number of hardware components needed. Therefore, a great deal of effort (especially in molecular computing) has gone into optimizing the exponential consumption of these resources. Several strategies have been proposed for this purpose: first, the adaptation of especially space-efficient classical algorithms to a DNA scenario, as was done for 3SAT in Ogihara (1996). Second, the creation of new constructive algorithms conscientiously designed to optimize the number of DNA strings, as in the solution of the maximum independent set problem in Bach et al. (1996) (which nevertheless involves some pre-processing work on a conventional computer). Third, the use of strategies that reduce the number of biological operations needed and the resulting risk of error, as in the work of Manca and Zandron (2001) to solve the SAT. Fourth, the application of dynamic DNA programming techniques, as demonstrated in Baum and Boneh (1996) to solve the Knapsack Problem or even the creation of computational models based on a destructive strategy over RNA strands as in Cukras et al. (1999) for solving the Knight Problem. After analysing these and other papers, it is clear that, when dealing with an NP-complete problem, a compromise has to be reached between space optimization and algorithm time, without overlooking the number and complexity of the operations involved. Indeed, space-inefficient linear-time algorithms soon place limitations on the size of the instances that they can manage to solve, whereas algorithms that mind space efficiency can turn out to be relatively slow or complex because of the type of operations they use, which increases the likelihood of processing errors. Based on these points, we use membrane computing in this article to solve a standard NP-complete problem: calculating the maximum clique for a graph. This problem is very interesting for several reasons. First, it is related to the Common Algorithmic Problem (CAP) defined by Tom Head in Head et al. (1999). In this article, Head defended that the solution of many NP-complete problems fits one and the same algorithmic mould. Later,
this problem was further examined in P´erez-Jim´enez and Romero-Campero (2005), and a family of recognizer P systems with active membranes was used to solve it. The CAP definition can be easily mapped to the Maximum Clique Problem for a graph. Indeed, it has been found that there is a series of NP-complete Constraint Satisfaction Problems (CSP) that can be solved by finding the maximum clique of the complementary graph of the problem constraints graph. Apart from its purely computational appeal, the Maximum Clique Problem is also of notable biological interest. Indeed, important problems have recently been identified within the field of systems biology that call for the solution of the Maximum Clique Problem, the maximum independent set problem or calculating all the maximal cliques for a graph. Some of these problems are compiled and stated in Butenko and Wilhelm (2005), including matching three-dimensional molecular structures, protein docking or genome rearrangements and mapping genome data. The remainder of the article is organized as it follows. Section 2 presents the definition of the Maximum Clique Problem, reviews a number of papers that have addressed the problem to date and proposes a P system with replicated rewriting and inhibitors to solve the problem. The proposed P system adopts a problem-solving constructive strategy, quite different from the brute force strategies usually used in both membrane computing and many biomolecular algorithms. In Section 3, we propose the implementation of this P system using a DNA algorithm. Being thus implemented, the efficiency of the proposed solution can be compared in Section 4 with that of two standard papers that tackled the same problem earlier in Ouyang et al. (1997) and Head et al. (1999). A number of computational and physical parameters are defined to make this comparison. Finally, Section 5 sets out the final remarks. 2. A P system with replicated rewriting and inhibitors for solving the Maximum Clique Problem 2.1. The Maximum Clique Problem and related work Let G = (V, E) be a graph with n nodes. A clique in G is defined as a subset V ⊂ V such that each two vertices in V are connected by an arc in E. The Maximum Clique Problem then involves finding the biggest subset V of totally connected nodes in the graph. Over the last decade, many papers have tackled this problem from the natural computing paradigm point of view. Indeed, a DNA computing model that uses the gen-
M. Garc´ıa-Arnau et al. / BioSystems 90 (2007) 687–697
eration of integer combinations and a series of biological operations to solve NP-complete problems is presented in Amos et al. (1996). Among other algorithms proposed in that paper, there is one that solves the Maximum Clique Problem for a graph. However, a DNA bioalgorithm to solve this problem was implemented for the first time in Ouyang et al. (1997). Since this ground-breaking paper, many others have addressed the same problem from different viewpoints. For example, the work of B¨ack et al. (1999) focuses on the relation between evolutionary computing and DNA computing, proposing an evolutionary DNA approach to the Maximum Clique Problem. Later, a parallel algorithm using fluids displacement in a three-dimensional microfluidic system to solve the Maximum Clique Problem in a six-vertex graph was presented in Chiu et al. (2001). Almost simultaneously another microflow reactor was used in McCaskill (2001) to solve the same problem using a brute force strategy and codifying each possible subgraph as a DNA strand. An aqueous algorithm was proposed in Head et al. (1999) to solve the same instance of the Maximum Clique Problem as in Ouyang et al. (1997), but this time employing a new model using plasmids to store the algorithm string information. Finally, the work of Zimmermann (2002) is an example of use of the DNA computing sticker model to get all the cliques of size K of a graph. In the following, we present a new proposal for solving the Maximum Clique Problem of a graph, using a constructive P system with replicated rewriting and inhibitors. 2.2. A P system with replicated rewriting and inhibitors P systems with replicated rewriting are defined as membrane systems whose basic objects are not symbols but structured elements, like strings. These systems contain multisets of strings that are processed using replicated rewriting rules, that is, rules that, apart from modifying strings, can increase the number of their copies. Each such rule consists of n ≥ 1 subrules. When a rule is applied to a string, the string is first replicated into n copies and then each subrule is applied to one copy. Usually, both P systems with replicated rewriting and other types of P systems (for example P systems with active membranes) adopt brute force strategies to solve NP-complete problems. Hence, these systems use different tools (rewriting rules with replication or membrane creation, respectively) to generate the space of all solutions to the combinatorial problem and then select the best one (Krishna and Rama, 2001; Zandron et al., 2000).
689
For a problem like calculating the maximum clique for a graph with n nodes, this would mean having to generate first the 2n possible cliques and then select the biggest one. However, it is fairly clear that not all the 2n possible cliques of a graph will exist in most cases. Indeed, for any graph G, only those cliques that do not contain two nodes linked by an arc in its complementary graph G are valid, since we know that any pair of nodes linked in G will not be connected in G. If that complementary graph is G = (V, E ), then E is what we will term the constraints set. In order to make use of the valuable information contained in that complementary graph G , we define sets of inhibitors for the rules of our P system. The idea of using promoters or inhibitors in P systems rules has a clear biological inspiration (Bottoni et al., 2002; Ionescu and Sburlan, 2004). Thus, a rule modelling a biological reaction can or cannot take place in the presence of certain enzymatic proteins. In our case, for each rule Ri of the P system, we define a set of inhibitors Ui containing those symbols aj in presence of which the rule cannot be applied. Hence, the P system proposed here uses the information from that constraints set E to gradually build only those cliques of G that are valid, thereby optimizing the number of strings needed to solve the problem. Given a graph G = ({a1 , . . ., an }, E), we formally define this P system as a construct: Π = (V, μ, M1 , . . . , Mn , R1 , . . . , Rn ) where V = {a1 , d |1 ≤ i ≤ n} , μ = [n · · ·[2 [1 ]1 ]2 · · ·]n ,
M1 = {d}, Mi = {λ} with 2 ≤ i ≤ n, Ri = {dw → (dwai , out)¬Ui ||(dw, out)|
w ∈ {aj |1 ≤ j ≤ i − 1}∗ , 0 ≤ |w| ≤ i − 1} with Ui = {aj ∈ V |{ai , aj } ∈ E }, 1 ≤ i ≤ n
Thus, the alphabet of the P system consists of n + 1 elements: d, an auxiliary symbol, and a1 , . . ., an , the nodes of the graph. Furthermore, the defined P system Π has a membrane structure μ composed of n embedded membranes. Each membrane contains a set of rules Ri with replication. Each rule is composed of a couple of subrules ri,a : dw → (dwai , out)¬Ui and ri,b : dw → (dw, out). Initially, the innermost membrane 1 contains a string with a single symbol d (M1 = {d}), whereas the other membranes are empty (Mi = {λ}, 2 ≤ i ≤ n). The system starts to work by applying the two rewriting subrules of membrane 1 to that string d. At the end of that first step, two strings are created and sent to the next
690
M. Garc´ıa-Arnau et al. / BioSystems 90 (2007) 687–697
membrane. The process is repeated in each membrane applying their respective rules Ri so that, step by step, all the valid cliques of the graph are generated. Note that the subrules ri,b are always applicable to all strings, whereas the subrules ri,a are only applicable if the strings to be rewritten do not contain any of the symbols in the inhibitor set Ui . Each of these inhibitor sets contains the aj linked by an arc in G to the node ai , that is, Ui = {aj ∈ V|{ai , aj } ∈ E }. When this happens, the only applicable rule is ri,b , and there is no replication in this case. In step n, then, the system outputs a language consisting of all the valid cliques for the graph G, the biggest of which is the solution to the Maximum Clique Problem for that graph. In Section 3 we propose an implementation of Π by means of a DNA algorithm that uses a constructive strategy to simulate the P system behaviour.
steps, the strings representing all cliques are collected in the membrane n + 1. From now on, the system outputs in each step only strings of a certain size, in the decreasing order. Therefore, in step n + 1 the strings representing cliques of size n are sent out (if there are any), in step n + 2 the cliques of size n − 1 and so on. The first output from the system represents the maximal cliques. One can easily increase, by additional symbols and rules, the intervals between sending out solutions of different sizes. Hence there can be enough time to separate the first (and maximal) solution.
2.3. Extracting only the maximum size cliques
The computational benefits of using DNA come from its complementarity. DNA strands, formed by repeating four types of nitrogen bases {A, C, G, T}, have the natural property of pairing, meaning that two complementary strands oriented in opposing directions pair to form a double helix structure. This is known as DNA’s secondary structure and was discovered by Watson and Crick in 1953. Chemical properties of DNA guarantee that adenine (A) can only pair with thymine (T), whereas guanine (G) can only join up with cytosine (C). Lipton’s computational model or the test tube model can be used to solve generic computational problems by defining the problem strings as binary words constructed over the {A, C, G, T} alphabet. In the algorithms based on this model, the initial tube is composed of the binary words of n bits that encode the space of all problem solutions. Additionally, this model defines a series of biological operations that are used to operate on the initial contents of that test tube until the correct solution is finally reached. In the model that we propose to implement the P system described in Section 2.2, the problem solution is represented in a manner more reminiscent of what Adleman did, that is, as a word over a finite alphabet, determined by the problem domain. Quite contrary to the stipulations of Adleman’s and Lipton’s models, the initial test tube is empty in our case, that is, does not contain the space of all the possible solutions of the problem. This way the problem strings (in this case valid cliques for the graph) are built gradually. This point is conceptually much closer to how the proposed P system operates, gradually generating the strings from the initial symbol d. Apart from these differences, the biological operations used to implement the P system are similar to the
The P system proposed in the previous section generates simultaneously strings representing all the possible cliques of the graph G. When implemented in the DNA framework as described in the next sections, the strings are represented by DNA strands. Therefore, using the technique called electrophoresis, we can easily separate the longest strands representing maximal cliques. One can ask, however, how to separate the strings representing the maximal cliques also by the P system itself, to provide the solution of the Maximum Clique Problem. To achieve this, we enrich the structure of the P system Π as follows: Π = (V , μ , M1 , . . . , Mn+1 , R1 , . . . , Rn+1 ) where V = {ai , dj |1 ≤ i ≤ n, 0 ≤ j ≤ n }, μ = [n+1 · · ·[2 [1 ]1 ]2 · · ·]n+1 ,
M1 = {d0 }, Mi = {λ} with 2 ≤ i ≤ n + 1, Ri = {dk w → (dk+1 wai , out)¬Ui ||(dk w, out)|w ∈
{aj |1 ≤ j ≤ i − 1}∗ , |w| = k, 0 ≤ k ≤ i − 1} with Ui = {aj ∈ V |{ai , aj } ∈ E }, 1 ≤ i ≤ n
Rn+1 = {dk w → (dk+1 w, here)|w ∈ {aj |1 ≤ j ≤ n}∗ , 0 ≤ |w| ≤ n, 0 ≤ k ≤ n − 1} ∪{dn w → (dn w, out)|w ∈ {aj |1 ≤ j ≤ n}∗ , 0 ≤ |w| ≤ n}. The function of the P system Π is quite similar to that of Π, with the following difference. Each produced string bears an explicit information about the size of the clique it represents, encoded in the symbol dk . After n
3. A constructive membrane-inspired DNA algorithm for solving the Maximum Clique Problem 3.1. Computational model
M. Garc´ıa-Arnau et al. / BioSystems 90 (2007) 687–697
ones defined in Adleman’s and Lipton’s models. Their function is to allow problem strings to be built gradually, removing any that are invalid as they are detected. Expressly, the following operations on DNA are used: 1. Merge (t1 , t2 , t3 ): Mixes two different multisets of strings (tubes) t1 and t2 to form a single multiset in t3 . 2. Replicate (t1 , t2 , t3 ): Replicates the contents of a multiset of strings (tube) t1 into two multisets t2 and t3 such that each of t2 and t3 contains one copy of the original content of t1 , which is emptied. 3. Separate (t1 , a, t2 ): Given a tube or multiset of strings t1 and a substring a, we extract from t1 all the strings containing a, and a tube t2 is generated with all the extracted strings. 4. Append (t1 , a): Given a tube t1 , append the string a to the end of all the strings the tube contains. If t1 is empty, then the result of the operation is just {a}. 5. Delete (t1 ): Given a tube t1 , delete its contents. 6. MeasureStrings (t1 ): Measure the strings of tube t1 to identify the biggest. Following the design instructions set out in Ouyang et al. (1997), the use of 20-base pair long strands (20-mer) to encode each node ai of the problem domain is considered sufficient for the proposed examples. The last model operation, MeasureStrings, can be implemented using gel electrophoresis. The Merge and Delete operations are simple manipulations on test tubes. The Replicate operation can be perfomed via one cycle of the polymerase chain reaction (PCR). To assure that both t2 and t3 contain one copy of each original strand, we can mark one primer of each pair (using, e.g. magnetic beads) and after performing PCR we separate all the marked strands. Finally, a series of operators are needed to implement the Separate and Append operations (Fig. 1). Specifically, n separation operators (complementary strands of the n elements ai ) are needed for the Separate operation. As regards the Append operation, the strand encoding ai can be considered to be composed of a prefix and
691
a suffix of 10-mer each, called p(ai ) and s(ai ). Then, each element ai needs exactly i − 1 append operators of the form 5 -s(aj )p(ai )-3 , 1 ≤ j < i. Therefore, the algorithm needs a total of k operators of this type, with:
k=
n−1
i
i=1
3.2. The membrane-inspired DNA algorithm In this section, we present the DNA algorithm based on the structure and behaviour of the P system with replicated rewriting and inhibitors proposed in Section 2.2. As discussed earlier, this algorithm does not use binary words of constant length, but variable-sized strings on a problem domain-equivalent alphabet. In the case of the Maximum Clique Problem, this domain is the set V of nodes of the graph. Therefore, a clique for the graph is encoded by means of a word that contains only the nodes that belong to that clique, thereby ruling out the representation of those nodes that are not part of that clique. The implementation of our P system in terms of a DNAbased algorithm in the computational model set out in the last section is based on the following points: 1. The different membranes of the P system constitute the physical space that delimits the scope of the replicated rewriting operations on problem strings in each iteration. Therefore, this number of membranes is directly related to the number of iterations needed by the DNA algorithm. Specifically, the algorithm makes n iterations, one for each P system membrane. 2. As it possesses rules Ri with replication (ri,a , ri,b ), the P system is able to multiply its number of strings at every step, thereby generating the exponentially increasing space needed to solve NP-complete problems in linear time. From the viewpoint of a DNA algorithm, this replication property is equivalent to the capability of dividing the strands space into
Fig. 1. DNA strands for coding domain elements, append operators and separation operators: (a) domain element ai ; (b) element ai separation operator; (c) element ai append operators.
692
M. Garc´ıa-Arnau et al. / BioSystems 90 (2007) 687–697
several subsets, enabling different operations to be performed on each one. The Replicate operation is responsible for this task. The maximum number of strings that can be generated after a rule has been applied is called degree of replication. This degree of replication is then directly related to the number of subsets (tubes) into which the DNA strands space can be replicated. As the degree of replication of our P system is two (each rule Ri generates at most two strings in each step), the Replicate operation has to generate no more than two subsets t2 and t3 from an input set t1 (Replicate (t1 , t2 , t3 )). 3. The subrule ri,a of each P system rule Ri carries out the rewriting operation that adds a new node ai to the
strings. However, this operation is only performed on valid strings, that is, strings that contain no node aj such that {ai , aj } ∈ E , with (i > j). From the viewpoint of a DNA algorithm, it takes more than one operation to evaluate this condition. First of all, the algorithm loops through the constraints set E in search of any nodes aj , with the aim of separating and deleting all the strands containing such a node. Once this process is complete, the node ai is appended to the remaining strings. Therefore, the P system rewriting rules are directly related to the Separate, Delete and Append operations of the DNA computing model described in the last section. 4. After applying the P system rules to a set of strings in a particular membrane, the resulting strings are sent to the next system membrane. Therefore, the region defined by that receptor membrane is responsible for grouping all the strings generated in the last system step. From the viewpoint of the DNA algorithm, there needs to be an operation that can be used to cluster the
different subsets of strands. In the model used, this task is carried out by the Merge (t1 , t2 , t3 ) operation, which, of course, should be able to group as many subsets of strings as the Replicate (t1 , t2 , t3 ) operation can generate. 5. Finally, the DNA algorithm makes use of the MeasureStrings (t1 ) operation to determine the problem solution (composed of the biggest string exiting the skin membrane of the system in the last step n). In the following, we present the DNA algorithm, working in linear time with respect to the number of nodes |V|, that implements the P system Π proposed to solve the Maximum Clique Problem for a graph:
3.3. Algorithm execution In the following, we detail an execution of the algorithm presented in the above section (Table 1). To do this, we chose exactly the same instance of the problem as published in the papers Ouyang et al. (1997) and Head et al. (1999). Fig. 2 shows the graph G and its complementary G .
Fig. 2. Graphs G and G . The Maximum Clique of G is {2345} and its constraints set is E = {{0, 2}, {0, 5}, {1, 5}, {1, 3}}.
M. Garc´ıa-Arnau et al. / BioSystems 90 (2007) 687–697
693
Table 1 Algorithm execution for the graph illustrated in Fig. 2
4. Comparison In this section, we compare the algorithm proposed in this paper with those described in Ouyang et al. (1997) and Head et al. (1999). In this comparison, we look at how the three algorithms work on two different examples. The first example is exactly the same as the one described in the last section. The second is a modification of the first in which the maximum clique is still the same, but the size of the constraints set E is bigger. The next two paragraphs describe how the other two algorithms work. In their paper, Ouyang et al. (1997) propose the use of a brute force strategy to solve the Maximum Clique Problem. Hence, the cliques in this “brute force algorithm” are represented as binary strings of n pairs of elements (position, value). The position component in each pair indicates the respective node, whereas the value component contains one or zero depending on whether or not
this node is in the clique. The sequence of nucleotides (from the {A, C, G, T} alphabet) selected to represent the position and value components is designed such that a DNA strand representing an invalid clique is generated with a particular restriction enzyme recognition sequence. In this way, once the space of all possible problem solutions (2n cliques) has been generated, a series of restriction enzymes are applied to break the strands encoding all these invalid cliques. Hence, these strands are disabled and are not multiplied in successive PCR cycles applied to the working set. As a result, gel electrophoresis can be applied in the last step to measure the strands in the set and identify the optimum problem solution. The problem is tackled in quite a different way in Head et al. (1999). This paper defines a new model called “aqueous computing” which proposes the use of plasmid (circular DNA strings that are approximately 3000 base pairs (bp) long) to represent binary-encoded
694
M. Garc´ıa-Arnau et al. / BioSystems 90 (2007) 687–697
Table 2 Comparison between the “brute force algorithm”, the “aqueous algorithm” and the “constructive algorithm” for the graph illustrated in Fig. 2, with V = {0, 1, 2, 3, 4, 5} and E = {{0, 2}, {0, 5}, {1, 5}, {1, 3}}
No. of iterations No. of different molecules per iterationa Volume of the no. of copies of the problem-solving molecule after the last iterationb No. of operations performedc Mean string size per iterationd Total mean string size No. of restriction enzymes needed
Brute force algorithm
Aqueous algorithm
Constructive algorithm
4 48, 40, 32, 26 1/26
4 2, 4, 7, 12 1/16
6 1, 3, 5, 8, 17, 25 1/25
8 173.3 175 177.5 179.6 176.3 6
8 179 182 185.3 188.3 185.2 6
4 20 26.7 28 30 38.8 42.4 31 0
a The brute force algorithm starts with the space of all possible solutions and deletes strands until it gets all the strands that encode valid cliques. The constructive algorithm starts from an empty space and gradually constructs the set of all valid cliques (an empty clique is not generated, as the string that represents it does not physically exist). Finally, the aqueous algorithm solves the problem without having generated all possible molecule types (cliques) for the graph in this case. b The different generated molecule types occupy the same volume in both the brute force algorithm and the constructive algorithm. Therefore, this is calculated as (1/no. different molecule types). With respect to the aqueous algorithm, even though it generates a smaller number of molecule types, some of these are generated several times repeatedly, which upsets the balance between the proportions of the each molecule type. c Although all three algorithms perform different operations, this point has accounted for the number of operations that are considered biologically more complex for each one. Therefore, we have counted the number of cut operations using restriction enzymes for the brute force algorithm, each Reset on a station for the aqueous algorithm, and the number of separation operations for the constructive algorithm. d Even though each copy of one and the same molecule type is stored in a plasmid of about 3000 bp in the aqueous algorithm, the size that is shown in the table refers to the MSC region of that plasmid, which is where the information about the clique encoding that molecule is stored.
cliques. The information about the clique is actually to be found in a 175 bp subsegment of the plasmid, called MCS (multiple cloning site). That subsegment is further divided into n regions called stations. Each station represents a node of the graph and is associated with a particular restriction enzyme. Additionally, the model has an operation, called Reset(k), which can be used to set the value of station k to 0. This operation consists of three basic steps: (1) linearize the plasmid by cutting at station k using its associated restriction enzyme, (2) extend the 3 ends with polymerase to produce a linear molecule with blunt ends and (3) apply ligase to make that blunt-ended linear molecule circular again. This produces a plasmid in which the size of station k has been increased by 4 bp. Therefore, it no longer encodes the recognition region of its associated restriction enzyme. Hence, the “aqueous algorithm” works as follows. Initially, all the molecule stations are set to 1, indicating that all the nodes are present in the clique. Then, a step is carried out for every arc {ai , aj } that is in the complementary graph G . The strings space is split into two subsets t1 and t2 in each step. The molecules in t1 are subject to the Reset(ai ) operation and the molecules in t2 to the Reset(aj ) operation. Finally, the contents of t1 and t2 are poured into a single set. At the end of the algorithm, the number of 1s in the molecules is counted to determine the maximum clique that solves the problem.
We have selected a number of computational and physical parameters with the aim of being able to make the intended comparison. The goal is to characterize how the three algorithms work well enough to show up the strengths and weaknesses of each one. Table 2 presents this comparison based on the following parameters: 1. No. of iterations: This indicates how many steps it takes for the algorithm to finish. 2. No. of different molecules per iteration: This indicates the number of physically different molecules (not copies of the same molecule) that the algorithm generates after each iteration. 3. Volume of the no. of copies of the problem-solving molecule after the last iteration: This parameter measures the total volume taken up by the copies of the problem-solving molecule at the end of the algorithm. If all the algorithms are considered to work on a constant volume of molecules equal to 1, the volume taken up by the copies of each one molecule will fall as they operate and generate new molecule types. This parameter is directly related to error probability in the biological operations. If the volume of the problem-solving string drops substantially, there will be a higher risk of losing the problem-solving string during some biological manipulation. 4. No. of operations performed: This indicates the total number of biological operations carried out by the
M. Garc´ıa-Arnau et al. / BioSystems 90 (2007) 687–697
695
Table 3 Comparison between the brute force algorithm, the aqueous algorithm and the constructive algorithm for the graph illustrated in Fig. 3, with V = {0, 1, 2, 3, 4, 5} and E = {{0, 1}, {0, 2}, {0, 3}, {0, 5}, {1, 3}, {1, 4}, {1, 5}}
No. of iterations No. of different molecules per iterationa Volume of the no. of copies of the problem-solving molecule after the last iteration No. of operations performed Mean string size per iteration Total mean string size No. of restriction enzymes needed a
Brute force algorithm
Aqueous algorithm
Constructive algorithm
7 48, 40, 36, 34, 26, 22, 20
7 2, 4, 8, 16, 13, 22, 20
6 1, 2, 4, 6, 11, 19
1/20
1/64
1/19
14 173.3 175 175.5 175.6 178.8 180.5 181 177.1 6
14 179 182 184.5 186.8 188.5 191.2 191.4 186.2 6
7 20 20 25 26.7 32.7 40 27.4 0
In this case, the proposed aqueous algorithm does generate all the possible cliques on the graph.
algorithm. This parameter needs further explanation, as the type and complexity of the operations used in each algorithm may differ significantly. 5. Mean string size per iteration: This parameter is calculated for each iteration as the weighted mean of the size of each molecule type divided by the proportion of its number of copies. Size is measured in terms of number of bases or base pairs (bp). 6. Total mean string size: This measures the mean string size in each of the algorithms. It is calculated as the arithmetic mean of the mean sizes of all the iterations. 7. No. of restriction enzymes needed: This indicates the total amount of restriction enzymes that the algorithm requires to solve a particular problem instance. Some of the values reported in Table 2 (referring the brute force algorithm and the aqueous algorithm) were already reported explicitly or implicitly in the compared papers (e.g. the number of iterations, the number of different molecules per iteration, the size of molecules or the restriction enzymes needed). All other values have been calculated after tracing the execution of the dif-
Fig. 3. Graphs G and G . The maximum clique of G is still {2345}, but its constraints set is now E = {{0, 1}, {0, 2}, {0, 3}, {0, 5}, {1, 3}, {1, 4}, {1, 5}}.
ferent algorithms for the corresponding instances of the problem. In the following, we present a second example to complete the comparison. The selected graph in this case (Fig. 3) has the same set of nodes as the last one V = {0, 1, 2, 3, 4, 5} and its maximum clique is still {2345}. However, the size of the constraints set has been increased by three, and is now E = {{0, 1}, {0, 2}, {0, 3}, {0, 5}, {1, 3}, {1, 4}, {1, 5}}. Table 3 shows the results of applying each algorithm to this new graph. 5. Conclusions In this paper, we present a P system with replicated rewriting to solve the Maximum Clique Problem for a graph. The system uses inhibitors to prevent the generation of all the solutions space. This problem is highly relevant not only at the purely computational level, but also because of its relation to a number of problems underlying genomics. Although this problem has been tackled on numerous occasions in the DNA computing literature, the same cannot be said of membrane computing. With the aim of being able to comparatively evaluate the efficacy and efficiency of the designed system with that of other researchers, we have proposed an implementation of this P system using a DNA algorithm. Hence, we have been able to compare the algorithm with two standard works (brute force algorithm and aqueous algorithm) that solved this problem earlier. A series of computational and physical parameters have been used to carry out this comparison. Looking at the results, the constructive strategy used appears to have a number of advantages. First, the number of iterations of the brute force algorithm and aqueous algorithm match the size of the constraints set E , that is,
696
M. Garc´ıa-Arnau et al. / BioSystems 90 (2007) 687–697
the number of arcs of the complementary graph G . As at each algorithm iteration, the working space necessarily has to be split into two to apply any individual operation, the maximum theoretical size of E could be as big as (log2 1018 ≈ 60), assuming an initial set of 1018 DNA strands. In our constructive algorithm, however, each of the constraints in E matches a separation operation, whereas it is the size of the set of nodes of G that determines the number of iterations. This would allow our algorithm to tackle problems on any graph of up to 60 nodes in linear time, irrespective of the size of the constraints set E . Secondly, the fact that the only basic operation that our algorithm uses on DNA is parallel overlap assembly (POA) helps to increase the effectiveness and simplicity of its biological implementation. Additionally, the constructive strategy allows substantially smaller-sized molecules than those in the compared works to be used throughout the algorithm. This optimization of the mean strand size used is a positive improvement, as it can help to reduce the number of manipulation errors. Another benefit of the constructive algorithm stems from the non-use of restriction enzymes. Both the brute force algorithm and the aqueous algorithm call for the use of a number of restriction enzymes that increases linearly with the number of nodes of G. The fact that our algorithm has no need of these enzymes, on the one hand, lifts the numerical limits that they place on the design of the problem-solving molecules and, on the other, does away with the complications derived from the appearance of occasional mutations in their recognition region. The constructive algorithm trades the use of these enzymes for select and append operators. All the results presented in this paper are based on a theoretical model. However, the operations required to implement this algorithm have already been carried out in the laboratory many times. Acknowledgements This research has been partially funded by the Spanish Ministry of Science and Education under projects TIC2002-04220-C03-03 (co-financed by FEDER funds) and DEP2005-00232-C03-03, by the Ram´on y Cajal Program of the Spanish Ministry of Science and Technology, and by the Czech Science Foundation, grant 201/06/0567. References Adleman, L.M., 1994. Molecular computation of solutions to combinatorial problems. Science 266, 1021–1024.
Amos, M., Gibbons, A., Hodgson, D., 1996. Error-resistant implementation of DNA computations. In: Proceedings of the Second Annual Meeting on DNA Based Computers, Princeton University, pp. 87–101. Amos, M., 1999. Theoretical and experimental DNA computation. Bull. Eur. Assoc. Theor. Comput. Sci. 67, 125–138. Bach, E., Condon, A., Glaser, E., Tanguay, C., 1996. DNA models and algorithms for NP-complete problems. In: Proceedings of the 11th IEEE Conference on Computational Complexity, pp. 290–300. B¨ack, T., Kok, J.N., Rozenberg, G., 1999. Evolutionary computation as a paradigm for DNA-based computing. In: Landweber, L., Winfree, E., Lipton, R., Freeland, S. (Eds.), Proceedings of the DIMACS Workshop on Evolution as Computation. Princeton, NJ, pp. 67–88. Baum, E.B., Boneh, D., 1996. Running dynamic programming algorithms on a DNA computer. In: Proceedings of the Second Annual Meeting on DNA Based Computers, Princeton University, pp. 141–147. Benenson, Y., Gil, B., Ben-dor, U., Adar, R., Shapiro, E., 2004. An autonomous molecular computer for logical control of gene expression. Nature 429, 423–429. Bottoni, P., Martin-Vide, C., P˘aun, G., Rozenberg, G., 2002. Membrane systems with promoters/inhibitors. Acta Inform. 38, 695–720. Butenko, S., Wilhelm, W., 2005. Clique-detection Models in Computational Biochemistry and Genomics. Department of Industrial Engineering Texas AM University, College Station, TX. Calude, C.S., P˘aun, G., 2004. Computing with Cells and Atoms: After Five Years. Centre for Discrete Mathematics and Theoretical Computer Science, CDMTCS-246 Research Report Series. Chiu, D.T., Pezzoli, E., Wu, H., Stroock, A.D., Whitesides, G.M., 2001. Using three-dimensional microfluidic networks for solving computationally hard problems. Proc. Natl. Acad. Sci. U.S.A. 98, 2961–2966. Cukras, A.R., Faulhammer, D., Lipton, R.J., Landweber, L.F., 1999. Chess games: a model for RNA based computation. Biosystems 52, 35–45. Garey, M.R., Johnson, D.S., 1979. Computers and Intractability. A Guide to the Theory of NP-completeness. W.H. Freeman, San Francisco. Hagiya, M., 1999. Perspectives on molecular computing. New Generat. Comput. 17, 131–151. Head, T., Yamamura, M., Gal, S., 1999. Aqueous computing: writing on molecules. In: Proceedings of Congress on Evolutionary Computation, IEEE Service Center, Piscataway, NJ, pp. 1006–1010. Ionescu, M., Sburlan, D., 2004. On P systems with promoters/inhibitors. J. Univ. Comput. Sci. 10, 581–599. Krishna, S.N., Rama, R., 2001. P systems with replicated rewriting. J. Automata, Lang. Comb. 6, 345–350. Ledesma, L., Pazos, J., Rodr´ıguez-Pat´on, A., 2004. A DNA algorithm for the Hamiltonian Path Problem. Using microfluidic systems. In: Lecture Notes in Computer Science 2959. Springer-Verlag, pp. 289–296. Ledesma, L., Manrique, D., Rodriguez-Paton, A., 2005. A tissue P system and a DNA microfluidic device for solving the shortest common superstring problem. Soft Comput. 9, 679–685. Lipton, R.J., 1995. DNA solution of hard computational problems. Science 268, 542–545. Manca, V., Zandron, C., 2001. A clause string DNA algorithm for SAT. In: Lecture Notes in Computer Science 2340. Springer-Verlag, pp. 172–181. McCaskill, J.S., 2001. Optically programming DNA computing in microflow reactors. Biosystems 59, 125–138.
M. Garc´ıa-Arnau et al. / BioSystems 90 (2007) 687–697 Ogihara, M., 1996. Breadth first search 3-SAT algorithms for DNA computers. Technical Report 629. University of Rochester, NY. Ouyang, Q., Kaplan, Peter, D., Liu, S., Libchaber, A., 1997. DNA solution of the maximal clique problem. Science 278, 446– 449. P˘aun, G., 1998. Biomolecular Computing. Theory and Experiment. Springer-Verlag. P˘aun, G., 2000. Computing with membranes. J. Comput. Syst. Sci. 61, 108–143. P˘aun, G., Rozenberg, G., Salomaa, A., 1998. DNA Computing. New Computing Paradigms. Springer-Verlag.
697
P´erez-Jim´enez, M.J., Romero-Campero, F.J., 2005. Attacking the common algorithmic problem by recognizer P systems. In: Lecture Notes in Computer Science 3354. Springer-Verlag, pp. 304–315. Seelig, G., Soloveichik, D., Yu Zhang, D., Winfree, E., 2006. Enzymefree nucleic acid logic circuits. Science 314, 1585–1588. Zandron, C., Ferretti, C., Mauri, G., 2000. Solving NP complete problems using P systems with active membranes. In: Antoniou, I., Calude, C.S., Dinneen, M.J. (Eds.), Unconventional Models of Computation. Springer-Verlag, London, pp. 289–301. Zimmermann, K.H., 2002. Efficient DNA sticker algorithms for NPcomplete graph problems. Comput. Phys. Commun. 144, 297–309.