The bounded complexity of DNA computing

The bounded complexity of DNA computing

BioSystems 52 (1999) 63 – 72 www.elsevier.com/locate/biosystems The bounded complexity of DNA computing Max H. Garzon a,*, Natasˇa Jonoska b, Stephe...

107KB Sizes 5 Downloads 139 Views

BioSystems 52 (1999) 63 – 72 www.elsevier.com/locate/biosystems

The bounded complexity of DNA computing Max H. Garzon a,*, Natasˇa Jonoska b, Stephen A. Karl c a

Department of Computer Science, Uni6ersity of Memphis, Memphis, TN 38152, USA Department of Mathematics, Uni6ersity of South Florida, Tampa, FL 33620, USA c Department of Biology, Uni6ersity of South Florida, Tampa, FL 33620, USA

b

Abstract This paper proposes a new approach to analyzing DNA-based algorithms in molecular computation. Such protocols are characterized abstractly by: encoding, tube operations and extraction. Implementation of these approaches involves encoding in a multiset of molecules that are assembled in a tube having a number of physical attributes. The physico–chemical state of a tube can be changed by a prescribed number of elementary operations. Based on realistic definitions of these elementary operations, we define complexity of a DNA-based algorithm using the physico–chemical property of each operation. We show that new algorithms for Hamiltonian path are about twice as efficient as Adleman’s original one and that a recent algorithm for Max-Clique provides a similar increase in efficiency. Consequences of this approach to tube complexity and DNA computing are discussed. © 1999 Elsevier Science Ireland Ltd. All rights reserved. Keywords: Molecular computation; DNA-based algorithm; Complexity; Computational efficiency

1. Introduction DNA computing has been established as a field with great potential for computation, at least of the kind that is unattainable by conventional computers. Research to date has focused on several areas (Lipton et al., 1996; Amos et al., 1997; Koza et al., 1997, 1998; Landweber and Baum,  This paper has been prepared expressly for the special issue of BioSystems based on a plenary presentation at the Fourth Annual Workshop on DNA-based computing in Philadelphia, 15 – 19 June, 1998. * Corresponding author. E-mail addresses: [email protected] (M.H. Garzon), [email protected] (N. Jonoska), [email protected] (S.A. Karl)

1999), ranging from algorithm design and complexity analysis for a number of applications (e.g. NP problems, circuit evaluation, associative memories) to reliability and error analyses through further experimental results on the limits of Adleman’s ideas in specific applications. An initial attempt to characterize complexity of DNA-based algorithms in terms of the traditional concepts of ‘time’ and ‘space’ is introduced in Amos et al. (1997). Understanding the actual power of DNA to solve computational problems in practice, however, needs a notion of complexity of DNA-based algorithms that captures the physico–chemical reality in which they take place (entirely different from VLSI-based programs). This is necessary so that results can be used to gauge the scope of

0303-2647/99/$ - see front matter © 1999 Elsevier Science Ireland Ltd. All rights reserved. PII: S 0 3 0 3 - 2 6 4 7 ( 9 9 ) 0 0 0 3 3 - 7

64

M.H. Garzon et al. / BioSystems 52 (1999) 63–72

DNA-based computations. Therefore, algorithm analysis in DNA computing should be tackled with tools that bear direct relevance to the number of molecules floating in a solution of a given volume and density in a small tube, sensitive to temperature variations and subject to operations of various degrees of complexity and implementation difficulty (and therefore more or less expensive depending on the operation). A tool of this sort for algorithm analysis will allow, among others, the comparison of different procedures that solve the same problem, the determination of their relative efficiency more objectively than profiling on isolated runs of the experiment, and even the comparison of algorithms for different problems on a common yard stick in order to eventually find lower bounds on their difficulty for DNA protocols. This paper proposes a new approach to analyzing DNA-based algorithms. The basic assumptions of our approach are as follows. To date, DNA-based computing aims at solving the same ordinary algorithmic problems that are commonly posed for conventional VLSI-based computers, albeit by entirely different types of operational processes. A DNA-based computation is now characterized by three phases: encoding that maps the problem onto DNA strands, tube operations that perform the basic core processing at molecular nanoscales (such as that involved in the separation and amplification steps in Adleman’s model in Adleman (1996)) and extraction that makes the results visible. Instances of a problem are encoded in molecules, generally oligonucleotides or short strands of DNA, although other structures such as graphs certainly are possible. These molecules are complex structures possessing physico – chemical properties that cannot be entirely described by a syntactic string representing their composition. These molecules are assembled in a tube C, which can be abstractly described as a multiset of molecules having a number of physical attributes. There are, in particular, four important properties of a tube C: volume V(C), temperature T(C), number of nucleotides n(C) and amount N(C) (usually given in picomoles) of each kind of molecule contained in C. (In this notation, mention of C will be generally omitted.)

The physico–chemical state of a tube C can be changed by a prescribed number of elementary operations (sometimes called steps), which are generally of two types: physical and chemical. Examples of physical operations are temperature changes, attachment of magnetic beads, separation by application of electric fields or centrifugal forces, merging of tubes, etc. Examples of chemical operations are reactions using enzymes such as restriction endonucleases, ligases, exonucleases, DNA polymerases, etc. Certainly some of these operations are of a mixed character. An algorithmic problem is solved by using a program to manipulate DNA molecules, herein referred to as a protocol, which is defined as a time-ordered sequence of elementary steps. Since new laboratory techniques for working with DNA molecules appear frequently in molecular biology, the elementary bio-operation set (BioS) is left unspecified and a complexity analysis is always made relative to a given set of elementary operations, just as specific computer programs are always relative to the basic instruction set of digital computers. In order to measure the complexity (or efficiency) of a protocol, we will associate a cost to each elementary operation x. Each operation is implemented as a laboratory procedure, and therefore this cost depends, in general, on the full state of the tube (e.g. on many factors like temperature, volume of the test tube, number of different DNA molecules present, enzymes used, and number of nucleotides that are expected to interact within the tube). In general, however, we will make the simplifying assumption that the bulk of the cost depends only on a subset of these parameters. We assume that operations only depend explicitly on V, T, N and n. The cost of a protocol p is then given by: C(p)= % Cx (n(C), N(C), V(C), T(C)) xp

where it is understood that C denotes the successive tubes that arise from the application of the preceding operation(s). The initial tube(s) is just the encoding of the problem instance. For simplicity, we will generally emphasize one-pot protocols involving a single tube. It is clear that

M.H. Garzon et al. / BioSystems 52 (1999) 63–72

protocols involving several tubes can be handled, for example, by adding up the cost involved for each tube. The rest of the paper is laid out as follows. In Section 2 we discuss assigning a cost to the operations that would appear to constitute the current set of BIOS. The following section illustrates a typical complexity analysis with three cases: Adleman’s protocol (Adleman, 1994) for Hamiltonian path (HPP), the Morimoto – Arita – Suyama’s refinement (Morimoto et al., 1999), and Ouyang – Kaplan – Liu – Libchaber (Ouyang et al., 1997) solution of Max-Clique. Finally, Section 4 discusses advantages and disadvantages of this type of protocol analysis.

65

where y is the actual volume at which the operation takes place (in 50 ml units). These choices reflect actual lab experience, where the range of volumes between 50 and 250 ml are desirable for conservation of material, faster temperature changes, and other factors, whereas other volumes become harder to handle.

2.1. The TVn complexity unit Due to the fact that there are many different factors involved in each protocol, we will measure all costs in a common unit, h, which will be referred to as a TVn unit. One TVn unit (1h) is given as: h: unit cost per 1°C change of temperature per o ne nucleotide per 1 ml volume reaction.

2. The complexity of elementary operations Although new molecular techniques may introduce fundamentally new approaches in molecular research, many of them are mainly refinements of the physical and chemical operations discussed below. We will concentrate in this section on the cost of common operations, such as oligonucleotide synthesis, hybridization, ligation, endonuclease digestion, etc. In principle, this set might be the full spectrum of available operations in molecular biology (see, for example Ausubel et al., 1993). We group elementary operations into two classes according to their degree of complexity, as detailed in the next two subsections. The number of nucleotides is denoted with n. Usually n is calculated with n =k · l where k denotes the number of different oligos and l is the length of the oligos. The quantity, in picomols, is denoted with N and the volume of a reaction is denoted with V. The variable V will actually measure the deviation of the actual reaction volume to some standard volume, based on the fact that a deviation makes the operation correspondingly more difficult. Specifically, we assume that an optimal reaction takes place in 50 ml standard volume. The deviation will be measured with

Áe1/y − 1, V: = Í1, Ä(y − 5)2

if y 51 if 1 5y 55 if y] 5

In particular, an operation of temperature change of DT will have the complexity of hDT. Strictly speaking, the h may be a function of temperature, the specific oligos involved, etc. but we will assume it constant as a first approximation.

2.2. Complexity of ‘simple’ operations 2.2.1. Generating the input oligos Each elementary operation assumes a set of input oligos. Generating (i.e. synthesizing) the input oligos should be a source of cost incurred by the algorithm since they will eventually generate the molecules containing the answer, which is the purpose of the protocol. Deciding on the choice of encoding for an initial set of oligos might be a rather difficult problem (Adleman, 1996; Garzon et al., 1997; Baum, 1998; Deaton et al., 1998a,b). Here we assume that the problem of choosing an encoding strategy has been solved and that this cost can be amortized by adding a constant factor cI to the cost of manufacturing the specific encoding molecules. Therefore, if we denote with I the operation of generating the input oligos and with C(I) the complexity of this computational step, we put: C(I)=cI·n·N·h where n and N are as defined above.

66

M.H. Garzon et al. / BioSystems 52 (1999) 63–72

In the following, other cost constants specific to each type of procedure are included in the complexity analysis. These include the costs cH (associated with hybridization), cL, (ligation), CRE (restriction endonuclease digestion), CG (gel electrophoresis), Cp (polymerization), and CB (biotin labeled oligo separation). The specific values for these constants vary with the available technology and, presumably, will tend to decrease as better technology becomes available. The calculation of appropriate values for these constants is multifaceted, highly complex and beyond the scope of this paper. In general, however, consideration should be given to a variety of factors such as the dollar value of the equipment required, robustness of the procedure to varying reaction conditions, inherent error rates, etc. For example, cleavage of double-stranded DNA with restriction endonucleases is a commonly used step in DNA computing. Restriction digestion is also a notoriously sloppy reaction; in most cases, digestions never go to completion and often only 80 – 90% of available sites are cut. Furthermore, reaction conditions (temperature, salt concentration, necessary co-factors) can be quite specific with deviations from optimal resulting in reduced activity and increased error rates (e.g. cleaving sequences that normally would not be cut). On the other hand, ligation and polymerase chain reactions can be fairly robust to reaction conditions and efficiency, but suffer for different limitations. Nonetheless, we believe that these constants can be assigned relatively good values for the purposes of complexity analysis. We treat this issue further in the discussion and conclusions.

ent lengths of DNA segments, different DNA structures and mispaired or mismatched bases) have been obtained (see Wetmur, 1999 and Hartemink and Gifford, 1999 and references therein). These studies show that DNA hybridization is a very complex procedure and capturing its real complexity in a single operation might be extremely difficult, if not impossible. Yet, many DNA based algorithms, including Adleman’s solution to HPP, rely on successful performance of this operation. Here, we include a simplified calculation of the operation viewed from a computational angle assuming that it is performed successfully (i.e. ignoring partial hybridizations and mismatches). If we denote this operation with H, we therefore set its complexity as

2.2.2. Hybridization We consider hybridization as one of the basic operations. This operation, following the Watson – Crick (WC) complementarily of nucleotides, joins single or partially double stranded molecules together to form double stranded molecules. Because of its importance in biology, there have been extensive studies of the thermodynamics and physical chemistry of DNA hybridization (see for example Hames and Higgins, 1985). Many empirical formulas for melting temperatures (for differ-

C(L)=cL·k·N·V·h

C(H)= cH·n·N·V·h

2.2.3. Ligation Ligation is another basic operation frequently used in DNA based algorithms. Often, it is performed in conjunction with hybridization to close open ‘nicks’ remaining after molecules are first joined into double stranded DNA. There are cases, however, when ligation needs to be performed on molecules with blunt ends without hybridization. For that reason we see ligation as a separate operation. It is impossible to know what number of molecules are in fact ligated in a single reaction, hence, we will estimate the complexity of this operation by the upper bound of the number of molecules expected to be ligated. Therefore, if we denote this operation with L, we set The variable k denotes the number of different oligos that need to be ligated (i.e. the number of different open nicks) and N is as above.

2.2.4. Restriction enzymes Many algorithms in DNA computing use operations that are based on cleaving with restriction enzymes (see Morimoto et al., 1999; Ouyang et al., 1997). If we denote this operation with R, we set its complexity at

M.H. Garzon et al. / BioSystems 52 (1999) 63–72

C(R)= cRE·k·N·V·h The constant cRE is characteristic for the restriction enzyme.

2.2.5. Gel electrophoresis Gel electrophoresis has been used routinely by molecular biologists for more than 20 years. It was instrumental in the success of Adleman’s original algorithm (Adleman, 1994) to select DNA molecules by length in the extraction phase. There are many different ways this operation can be performed. Most of the time the cost of this operation depends only on the ‘kind’ of gel electrophoresis used and not on the content of the test tube that is checked. Therefore we set the cost of this operation, denoted G, at C(G)=cG·r·h where the constant factor cG characterizes the electrophoresis and the variable r indicates the sensitivity of the gel (given as the different molecule sizes that it distinguishes). This means that r= 100 if a gel can distinguish molecules with length up to 100 bp and at least one nucleotide difference between molecules, but r= 5 if the gel can only distinguish molecules with length up to 100 bp and no shorter than a 20 nucleotide difference.

2.3. Complexity of ‘complicated’ operations Many operations in suggested and performed DNA based algorithms are fairly complex and use combinations of the basic operations described above. There are algorithms that are based on forming secondary structures with DNA molecules and require much more careful complexity analysis than described here. We will concentrate here, for illustration purposes, on two complex operations that are routinely suggested by DNA based algorithms, polymerase chain reaction (PCR) and biotin – streptavidin bead separation.

2.3.1. PCR PCRs are routinely used in DNA based algorithms to amplify target molecules in the com-

67

putation and/or extraction phase(s). The procedure is characterized by 30–40 cycles of denaturing, hybridization, and polymerization in a single PCR. In each cycle the molecule(s) targeted for amplification is denatured (melted) into template single strands, two single stranded primers are hybridized (annealed) to the template molecules, and the single stranded molecules are polymerized to double-strands, usually by Taq polymerase enzyme. Each of these processes, melting, hybridization, and polymerization, generally is performed at a different temperature. If we denote the PCR operation with P, we set its complexity at C(P)= cP·k·N·n·h+ C(H) +q(DT1 + DT2 + DT3)·h The first of the three terms in the sum of C(P) is the complexity of the polymerization. The factor k is the number of pairs of primers present in the reaction. We point out that (a) there is no need to include the number of PCR cycles in the estimate of the polymerization complexity since at each step the number of primer pairs is reduced by the number of polymerizations that occur; and (b) this choice of k excludes asymmetric PCR, which, although used in molecular biology, has not been used in DNA computing. Hence, there could not be more molecules amplified than kN, since no primers exist to initiate the polymerization. Of course, in each PCR reaction, other shorter molecules in the tube can act as primers, but their number compared to the primer target molecules is very small and can be disregarded. Furthermore, we are estimating the complexity of the operation by estimating the complexity of the desired outcome. Unfortunately, it is rarely possible to know the length of the molecule polymerized or how many distinct molecules are actually amplified with the procedure. For that reason, n reflects the length of the desired target molecules. The second term C(H) is the complexity of the hybridization. Here we only count the hybridization between the primers and the template which result in new molecules. We discount the hybridization between the two complementary single stranded template molecules, which can also happen. Since the volume of the primers is included

68

M.H. Garzon et al. / BioSystems 52 (1999) 63–72

in C(H) (and as we said, there could not be more amplification than the number of the primers) the number of PCR cycles is not included. The third term reflects the changes in temperature during the PCR and must, therefore, include the number of cycles q. The change of temperature for each of the three steps, melting, hybridization, and polymerization is denoted with DTi (i=1, 2, 3) and is relative to room temperature (20°C).

2.3.2. Biotin–strepta6idin bead separation Biotin labeled oligos attached to paramagnetic beads have been used as a separation technique in Adleman’s experiment (Adleman, 1994) and also in many other theoretical descriptions of algorithms and computational models (Jonoska and Karl, 1997; Rowes et al., 1999). Studies have shown that this technique is not very reliable (Khodor and Gifford, 1999) and is probably best avoided. Nonetheless, we include the complexity of this technique because it was used in Adleman’s experiment and in the next section we estimate the complexity of his algorithm. The method can be characterized as follows. Biotinylated primers and target molecules that are WC complementary, at least in part, are mixed and allowed to hybridize. These partially double stranded molecules are then mixed with paramagnetic beads attached to streptavidin. The biotin and the streptavidin join, thus, attaching the paramagnetic bead to the WC hybridized DNA. A magnetic field is used to immobilize the complex (containing the target molecule) to the side of the reaction container while the remainder, unhybridized molecules are removed. We denote this operation with B and set its complexity C(B) at C(B) =C(H) +cB·k·N·V·h The first term in the equation, C(H) denotes the complexity of the hybridization process. As before, we will concentrate on the desired hybridization and desired outcome rather than the partial hybridizations and mismatches. The second term is the complexity of the separation itself.

3. Examples of complexity analyses With definitions in place, we can proceed to analyze and compare the complexity of various protocols that have been implemented in the lab. We exemplify this type of analysis with a detailed description of Adleman’s original protocol (Adleman, 1994). A suggested improvement of that algorithm (Morimoto et al., 1999) and an algorithm for Max-Clique (Ouyang et al., 1997) are also considered (detailed analyses are available from the authors).

3.1. Complexity of Adleman’s algorithm In this section we will not review Adleman’s algorithm, denoted A, but characterize it instead by the computational steps that were used in his paper (Adleman, 1994). He solved a small, seven vertex instance, of the HPP. HPP asks, for a given instance consisting of a (directed) graph with specified source and destination vertices, whether there is a path that visits each vertex exactly once starting at one (source) vertex and ending at another (target) vertex. Adleman’s algorithm: 1. Generate the input oligos: seven random 20 nt DNA sequences for the vertex set V, 20 nt oligos for the edges in set E% that do not start at the initial and do not end at the terminal vertex, and 30 nt oligos for the edges in the set E¦ that start at the initial vertex or end at the terminal vertex. At least 100 pmol of each edge oligo was used in the experiment. For the vertex oligos, at least 1100 pmol were used since they were used both in the construction of the graph (100 pmol) and for separation (1000 pmol). For the graph that Adleman used, V = 7, E% = 9 and E¦ = 5. Hence, C(I)=20 V 1100cIh+ 20 E% 100cIh + 30 E¦ 100cIh= 187,000cIh 2. Generate random paths through the graph: 50 pmol of each vertex oligo (except the initial and the terminal vertex) and 50 pmols of each edge oligo were mixed in a 100 ml reaction and allowed to hybridize and then ligate. Together, there are two elementary operations performed

M.H. Garzon et al. / BioSystems 52 (1999) 63–72

here (a) hybridization, H and (b) ligation, L. The complexity for H is thus given by: C(H)= ( E% ·20·50+ E¦ ·10·50)cHh =11500cHh The edge oligos are used as splints for the vertex oligos, and potentially all edge oligos can hybridize to some vertex oligos. Since the initial and the terminal vertex oligos are not present, edges from set E¦ can hybridize with only one vertex, (i.e. only ten nucleotides of the edge will make hydrogen bonds). For the complexity of L we have: C(L) = 1/2 ( V − 2+ E ) 50cLh =475cLh Any oligo can potentially be ligated to another oligo, however, since each successful ligation involves two molecules, we divide by 2. 3. Isolate paths that start at the initial vertex and end at the terminal vertex: the mix from the previous step was amplified with PCR using as primers the initial and the terminal vertex 50 pmol of each primer are added in a PCR of 50 ml total volume cycled 35 times. Hence, the volume factor in C(H) is 1. The denaturing temperature was 94°C and the hybridization and the polymerization occur during ramping from 30 to 94°C. So DT1 =64°C, DT2 =64°C and DT3 =0°C and we get: C(P1) = 50 · 20( V − 1)cPh +20 · 50 · 2cHh +35(64+ 64 +0)h =6000cPh+2000cHh + 4480h The desired molecules amplified by this method are those that visit exactly V vertices. Since one of the primers is the initial the other is the terminal vertex, the length of the polymerized molecule is 20( V − 1). In Adleman’s experiment that is 20 nt× 6=120 nt. 4. Extract paths that visit exactly V vertices: the product of the previous step is run on agarose gel. The band corresponding to the DNA molecules of the desired length is excised in the gel and washed with water. The gel electrophoresis used by Adleman is a standard 3 or 5% agarose stained by ethidium bromide which can distinguish about ten nucleotides difference between molecules up to 200 nt in length. The

69

complexity of this step thus becomes C(G)= 20cGh. 5. For each vertex i, keep the paths that visit i at least once: biotin–streptavidin bead separation was used for this step, and was performed for each vertex (except for the source and the destination) using the complement vertex oligo. This was performed by denaturing molecules at 80°C incubating the single stranded DNA with 1 nmol (103 pmol) of the i-th vertex primer in 150 ml at room temperature. Thus, we get C(B)= ( V − 2) (DTh + 20 · 103 · cHh+ 1 · 103 · 1cBh) For the entire experiment, it becomes C(B)= 5(60h + 20000cHh+ 1000cBh) = 300h + 100 000cHh+ 5000cBh 6. Check whether any paths are left: this was performed by ‘graduated PCR’ in V −1 tubes using the initial vertex oligo and the i-th vertex oligo and is followed by gel electrophoresis. If the i-th PCR produces a product of a single length, then the i-th vertex is visited exactly once. The PCR reactions are the same as before, only now the length of the molecule that is polymerized is specific for each vertex. Hence, the complexity for Adleman’s graph is −1 C(P2)= % V i = 1 50·20( V − i)cPh

+ ( V − 1)(20·50·2cHh +35(64+ 64+0)h)+ C(G) =21 000cPh+ 12 000cHh + 26 880h + 120cGh According to our definition, the total complexity of Adleman’s experiment therefore becomes C(A) =c(I)+ c(H) +C(L)+ C(P1)+ C(G) +C(B)+ C(P2) = 187 000cIh+ 125 500cHh+ 475cLh + 27 000cPh+ 5000cBh+ 140cGh +31 660h

M.H. Garzon et al. / BioSystems 52 (1999) 63–72

70

3.2. Complexity of Morimoto– Arita – Suyama (MAS) and Ouyang– Kaplan – LiuLibchaber (OKLL) algorithms In Morimoto et al. (1999), the authors present an alternative algorithm for solving the HPP problem. Their algorithm deviates from Adleman’s in that, instead of building random paths, the Hamiltonian paths are built step by step. The algorithm is based on several operations that the authors designate EXT, CAP, REMove, DEBlock. We refer the reader to the original article for a detailed description of these steps. In a fashion similar to the analysis of Adleman’s protocol, the complexity of the MAS algorithm B is estimated as: C(B) =C(I)+ C(B) −2 + % V i = 1 (C(EXT) + C(CAP) +(REM)

+C(DEB))+C(P) =97 500cIh+ 770.5cBh +24 325.5cHh +288.9cLh+ 15 885cPh + 321cREh +1094h + 120cGh We analyze the algorithm for the Maximal Clique problem used by Ouyang – Kaplan – Liu – Libchaber (OKLL) in Ouyang et al. (1997) in a similar fashion. A clique is a graph where every pair of vertices is connected with an edge. The Maximal Clique problem asks for a given graph G what is the maximal clique that is a subgraph of G and is a well known NP-complete problem. As before we will refer the reader to Ouyang et al. (1997) for details of the algorithm, here we will concentrate only on the complexity analysis. According to our definition, the total complexity of OKLL experiment C is: C(C) =C(I)+C(POA) + C(P) + C(R) + C(G) =58 000cIh+12 800cPh + 3287.04cHh + 7360h +205.44cREh +40cGh 4. Discussion and conclusions We have proposed a new approach to analyzing DNA computing algorithms and protocols that is

founded on physico–chemical realities while being abstract enough to make feasible practical computations of the complexity of DNA-based protocols. In this section we show some of the advantages of this type of analysis, as well as some problems and questions posed by the approach. For the purpose of comparison, we need to assign values to the cost constants CB,…,CP. As stated earlier, this is difficult at this time since the constants might be tweaked in a number of ways. In order to assess the sensitivity of our results to the specific values chosen, we calculated the complexity values for each algorithm using all possible combination of cost values from 1 to 4. Overall the rank order of the complexity of the algorithms did not change. Adleman’s algorithm A had the largest complexity and the OKLL algorithm C had the least complexity in all cases. (We remark that, for fairness, the complexity for C was calculated for the graph equivalent to the Adleman’s graph with seven vertices and seven edges in the complement graph.) The maximal difference between the complexity of A and B was obtained when all constants were assigned a value of 4 except CRE = 1. In this case, C(A) =1 412 120h and C(B) = 566 820h which gives the difference of 845 300h. The minimal difference in complexities involving A and B was obtained when all constants were set equal to 1 except CRE = 4. In this case C(A) = 376 775h and C(B) =151 113h, with a difference of 225 661h. This can be regarded as reasonable evidence of robustness of the complexity measure in question with respect to constant values since in both cases the complexity of A is more than twice the complexity of B. This may not be surprising since even though A and B make use of the same protocols different number of times, the restriction digestion step (hence CRE) is the only step they do not share. Therefore, according to bounded complexity, algorithm B is better by a factor of about 2 than A. Although the improvement was intuitively expected from heuristic considerations, it would be difficult to quantify the improvement without a realistic measure. Further, there follows from this complexity analysis that the most costly operation

M.H. Garzon et al. / BioSystems 52 (1999) 63–72

is the encoding phase (i.e. generating the input oligos). Hybridization, as a separate operation, is also of a very high complexity. Besides the cost of generating the input oligos, Adleman’s algorithm uses PCR operations as a necessary step (step 3) as well as biotin–strepatavidin bead separation. Both of these operations are rather complex and, in particular, their repetitive use increases the complexity of the algorithm. It should be pointed out that both algorithms, Adleman’s in particular at step 4, require PCR amplifications between steps in order to obtain enough molecules for the extraction phase. These steps are not necessary to carry out the algorithm and are performed in order to assure accuracy. We, therefore, decided not to include them in our complexity estimates. A more accurate estimation should consider these steps. The maximal difference between the complexity of B and C was again obtained when all constants were assigned a value of 4. For these values, C(B) = 567 783h and C(C) =352 268h, which gives a difference of 263 093h. The minimal difference in the complexities of B and C was obtained when all constants were equal to 1. In this case C(B) = 150 150h and C(C) = 93 612h, with a difference of 68 458h. As before, constants do not play a substantial role in the comparison of the bounded complexities since in both cases the complexity of B is almost twice that of the complexity of C, reflecting the cleverness of the clique representation in avoiding brute-force algorithms. In conclusion, we have proposed a new approach to analyze DNA-based algorithms based on physico–chemical realities and knowledge of the laboratory protocols associated with each computational step. These features certainly have both positive and negative aspects. An important aspect is that this kind of analysis emphasizes the behavior of the algorithm on instance sizes that are precisely the ones that tend to occur and be of interest in practice. In addition, it does produce a less ‘wholistic’ indication of the overall quality of an algorithm and does not focus exclusively on a single resource (such as time or space, although it involves them indirectly). Another positive aspect is the requirement for more specific descriptions of the algorithms.

71

As a result, the complexity measures are much more useful in terms of estimating the difficulty of actually running experiments in the lab. Moreover, they may also be useful generally in experimental biochemistry. On the other hand, one might argue that this approach is unrealistic (in the words of an anonymous referee, ‘a dream of social scientists’ for a wholistic measure of cost). One may say, for example, that for many computational steps, the actual, let alone optimal, laboratory conditions are unknown. Moreover, one could also say that the approach deviates too much in spirit from the well known measure of asymptotic analysis in the classical complexity theory based on the Turing model. In response, we argue that a bounded complexity analysis is useful for at least three reasons. First, after designing the input oligos, a rough estimate of the complexity of each operation is almost always possible by ignoring the temperature changes and the volume. Hence, an estimate of the cost of the protocol can always be given that is probably more valuable than subjective impressions. (For example, one might have expected that algorithm B would be much better than A based on heuristic considerations on how the paths are built.) This cost is robust enough to allow meaningful comparisons of the efficiency of molecular protocols and even to make relatively well founded predictions on the speed of progress in the field (according to the estimates above, efficiency in DNA-based protocols is steady, doubling its efficiency about every 18 months). Second, the laboratory protocols used by molecular biologists are designed to serve purposes particular to molecular biology, which are essentially different from those of DNA computing purposes. For example, success rates (1 in billions) acceptable in molecular biology protocols are not adequate for DNA computing purposes. Third, we are interested in capturing the real capability of feasible tube algorithms, the ones that can be manipulated on planet earth. Pursuing bounded complexity analysis will eventually produce quantifiably better protocols of potential use not only in DNA computing but in molecular biology. A programmer’s guide for the DNA computing practitioner may then take the form of a manual similar to (Ausubel et al., 1993) assembling both the protocols and their costs.

72

M.H. Garzon et al. / BioSystems 52 (1999) 63–72

References Adleman, L.M., 1994. Molecular computation of solutions of combinatorial problems. Science 266, 1021–1024. Adleman, L.M., 1996. On constructing a molecular computer. In: Lipton, R., Baum, E. (Eds.), DNA Based Computers. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 27. American Mathematical Society, Providence, RI, pp. 1–21. Amos, M., Gibbons, A., Dunne, P., 1997. The complexity and viability of DNA computations. In: Lundh, D., Olson, B., Narayanan, A. (Eds.), Proceedings Biocomputing and Computation (BCEC97), World Scientific, Singapore. Ausubel, F.M., Brent, R., Kingston, R.E., Moore, D.D., Seidman, J.G., Smith, J.A., Struhl, K., Wang-Iverson, P., Bonitz, S.G., 1993. Current Protocols in Molecular Biology. Greene Publishing Associates and Wiley-Interscience, New York, NY. Baum, E., 1998. DNA sequences useful for computation. In: Landweber, L., Baum, E., 1999. DNA Based Computers II, Proceedings of the Second Annual Meeting on DNA Based Computers, DIMACS workshop, Princeton NJ, June 10 – 12. American Mathematical Society, Providence, RI, pp. 122 – 127. Deaton, R. Murphy, R.C., Garzon, M., Franceschetti, D.R., Stevens, S.E., Jr., 1999. Good Encodings for DNA-based Solutions to Combinatorial Problems. In: Landweber L., Baum E., 1998. DNA Based Computers II, Proceedings of the Second Annual Meeting on DNA Based Computers, DIMACS workshop, Princeton NJ, June 10–12. American Mathematical Society, Providence RI, pp. 159–171. Deaton, R., Garzon, M., Murphy, R.C., Rose, J.A., Franceschetti, D.R., Stevens, S.E. Jr., 1998b. On the reliability and efficiency of a DNA-based computation. Phys. Rev. Lett. 80 (2), 417–420. Garzon, M., Neathery, P., Deaton, R., Murphy, R.C., Franceschetti, D.R., Stevens, S.E., Jr., 1997. A New Metric for DNA Computing. In: Koza J.R., Deb K., Dorigo M., Fogel D.B., Garzon M., Iba H., Riolo R.L., 1997. Proceedings of the 2nd Annual Genetic Programming Conference, Morgan Kaufmann, San Mateo, CA, pp. 472 – 478. Hames B.D., Higgins S.J., 1985. Nucleic acid hybridization: a practical approach. IRL Press, Washington DC, 244 pp. Hartemink, A.J., Gifford, D.K., 1999. Thermodynamic simulation of deoxyoligonucleotide hybridization of DNA computation. In: Rubin, H., Wood, D. (Eds.), DNA Based Computers III, Proceedings of the Annual Meeting. DIMACS Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society, Providence, RI, pp. 25 – 37.

.

Jonoska N., Karl S.A., 1997. Ligation experiments in DNA computations. Proceedings of 1997 IEEE International Conference on Evolutionary Computation (ICEC%97), April 13 – 16, pp. 261 – 265. Khodor, J., Gifford, D. (1999). The efficiency of the sequencespecific separation of DNA mixtures for biological computation. In: Rubin, H., Wood, D. (Eds.), DNA Based Computers III, Proceedings of the Annual Meeting. DIMACS Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society, Providence, RI, pp. 39 – 46. Koza J.R., Deb K., Dorigo M., Fogel D.B., Garzon M., Iba H., Riolo R.L., 1997. Proceedings of the 2nd Annual Genetic Programming Conference, Morgan Kaufmann, San Mateo, CA. Koza J.R., Deb K., Dorigo M., Fogel D.B., Garzon M., Iba H., Riolo R.L., 1998. Proceedings of the 3rd Annual Genetic Programming Conference, Morgan Kaufmann, San Mateo, CA. Landweber, L., Baum, E., 1999. DNA Based Computers II, Proceedings of the Second Annual Meeting on DNA Based Computers, DIMACS workshop, Princeton NJ, June 10 – 12. American Mathematical Society, Providence, RI. Lipton R., Baum E. (Eds.), 1996. DNA Based Computers. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, vol. 27. American Mathematical Society, Providence, RI. Morimoto N., Arita M., Suyama A., 1999. Solid phase DNA solution to the Hamiltonian Path problem. In: Rubin, H., Wood, D. (Eds.), DNA Based Computers III, Proceedings of the Annual Meeting. DIMACS Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society, Providence, RI, pp. 193 – 206. Ouyang, Q., Kaplan, P.D., Liu, S., Libchaber, A., 1997. DNA solution of the Maximal Clique problem. Science 278 (1997), 446 – 449. Rowes, S., Winfree, E., Burgoyne, R., Chelyapov, N., Goodman, M., Rothemund, P., Adleman, L., 1999. A sticker based architecture for DNA computation. In: Landweber L., Baum E., 1998. DNA Based Computers II, Proceedings of the Second Annual Meeting on DNA Based Computers, DIMACS workshop, Princeton NJ, June 10 – 12. American Mathematical Society, Providence, RI, pp. 1 – 30. Wetmur J.G., 1999. Physical chemistry of nucleic acid hybridization. In: Wood D., Lipton R., Seeman N., in press. Proceedings of the Third Annual Meeting on DNA Based Computers, DIMACS Workshop, University of Pennsylvania, June 23 – 25, 1997, pp. 1 – 23.