Algebraic properties of DNA operations

Algebraic properties of DNA operations

BioSystems 52 (1999) 55 – 61 www.elsevier.com/locate/biosystems Algebraic properties of DNA operations Zhuo Li Department of Computer Science, Uni6e...

100KB Sizes 0 Downloads 71 Views

BioSystems 52 (1999) 55 – 61 www.elsevier.com/locate/biosystems

Algebraic properties of DNA operations Zhuo Li Department of Computer Science, Uni6ersity of Western Ontario, London, Ont., N6A 5B7, Canada

Abstract Any DNA strand can be identified with a word in the language X* where X = {A, C, G, T}. By encoding A as 000, C as 010, G as 101, and T as 111, we treat the DNA operations concatenation, union, reverse, complement, annealing and melting, from the algebraic point of view. The concatenation and union play the roles of multiplication and addition over some algebraic structures, respectively. Then the rest of the operations turn out to be the homomorphisms or anti-homomorphisms of these algebraic structures. Using this technique, we find the relationship among these DNA operations. © 1999 Elsevier Science Ireland Ltd. All rights reserved. Keywords: Insertion; Deletion; Code; Generator

1. Introduction This paper is a first attempt to treat the DNA operations from an algebraic point of view. It offers a new formalization of DNA operations, based on the notations defined in (Boneh et al., 1997). In 1994, Leonard Aldeman (Adleman, 1994) successfully solved an instance of the directed Hamiltonian path problem solely by manipulating DNA strands. Following (Adleman, 1994), DNA algorithms have been found to solve other problems: expansion of symbolic determinants (Leete et al., 1998), matrix multiplication (Oliver, 1998), addition (Guarnieri and Bancroft, 1998), exascale computer algebra (Williams and Wood, 1998),  This paper has been prepared for the Proceedings of the Fourth International conference on DNA-Based Computers, Philadelphia, June 1998. E-mail address: [email protected] (Z. Li)

and so on. The basic idea behind the DNA computing is to use DNA strands to encode information and employ enzymes to simulate simple computations. This shows that the tools of biology could be widely used in solving mathematical problems. Conversely, it is natural and necessary to study the DNA strands and the operations of the DNA strands in a mathematically precise way. This is the motivation of this paper. Let us consider the DNA strings which are the DNA strands with polarity ignored. Then it is natural to relate these strings with formal language and power series. More precisely, let X= {A, C, G, T}, where A, C, G, T represent the building blocks of a DNA strand. called nucleotides: adenine, guanine, cytosine, and thymine. The single nucleotides are linked together end-toend to form DNA strands. The DNA sequence has a polarity: a sequence of DNA is distinct from its reverse. The two ends are known under the

0303-2647/99/$ - see front matter © 1999 Elsevier Science Ireland Ltd. All rights reserved. PII: S 0 3 0 3 - 2 6 4 7 ( 9 9 ) 0 0 0 3 2 - 5

56

Z. Li / BioSystems 52 (1999) 55–61

name of the 5% end and 3% end, respectively. Taken as pairs, the nucleotides C and G and the nucleotides, A and T are said to be complementary. Two complementary single strands with opposite orientation will join together to form a double helix in a process called annealing. The reverse process is called melting (refer to Kari, 1997 for more details). We encode A by 000, C by 010, G by 101, and T by 111. The reason for the choice of our encoding is the following. It would seem natural to encode nucleotides A, C, G, and T as 00, 01, 10 and 11, respectively. However, this encoding that satisfies the condition that A. = ‚ (00) = 11 =T and C. = ‚ (01)= 10= G, does not satisfy the natural requirement that the reverse of a nucleotide is the nucleotide itself. Indeed, C R =(01)R =10 =G, which is not what we expect. In contrast, our encoding satisfies both natural constraints mentioned above. Let ={0,1}. Let S be any commutative semiring (Definition 2.2). Let S*  and S  X*  be the corresponding sets of power series (Definition 2.3) over * and X*, respectively. Then S  *  and S  X*  are semirings (example 2 in section 2). where the multiplication on both semirings is the generalization of concatenation of two words over * and X*, respectively, and the addition on both semirings is the generalization of the union of two words over * and X*, respectively. Then under our encoding scheme. S  X* is subsemiring of S *  . Next, we generalize the operations of DNA strings, reverse, denoted by R, complement, denoted by ‚ to anti-automorphism and automorphism (Definition 2.6) of the semiring S *  . The subsemiring S  X* of S *  is invariant under these two morphisms. That is the automorphism ‚ is still an autonlolphisrn of S  X*  and the anti-automorphism is still an anti-automorphism of S X* . To consider the polarity of the DNA strands, we introduce semimodules (Definition 2.4) and direct sum of two semimodules (Definition 2.5). To get a rough idea of what a semimodule and a direct sum are, the reader could think that the semimodule is a vector space and the direct sum is Cartesian product of two vector spaces. Each

element of the direct sum has two components. In our interpretation, the first component could be thought as the DNA strand which starts with 5% and ends with 3%, while the second component could be thought as the DNA strand which starts with 3% and ends with 5%. This provides a mathematical model to simulate the double DNA strands. The operations of DNA strands denoted by notations  , ¡, and M in (Boneh et al., 1997) can then be generalized to homomorphisms from the semimodule SX* to the direct sum of S X* š S X*. This brings about the relationship among these operators.

2. Background of linear algebra We introduce some background in mathematics. The interested readers should refer to (Kuich and Salomaa, 1986) for more detail. Definition 2.1. A monoid BM,·,1 \ consists of a set M, an associative operation on M and of an identity 1 such that 1·a = a·1= a. A monoid is commutative if and only if a·b=b·a for every a and b in M. Definition 2.2. A semiring is a set S together with two binary operations+ and·and two elements 0 and 1 such that 1. B S, + , 0\ is a commutati6e monoid, 2. B S, ·, 1\ is a monoid, 3. the distribution laws a·(b + c)=a·b + a·c and (b + c)·a = b·a + c·a hold for e6ery a, b, and c in S, 4. 0·a = a·0= 0 for e6ery a. If a·b= b·a for any a, bS then S is called a commutati6e semiring. Example 1. (i)BB, + ,·\ is a semiring, where ‘+ ’ is ‘OR’, ‘·’ is ‘AND’, and B={0,1}. (ii) BX*,·\ is the free monoid generated by a nonempty countable set X. It has all the finite strings, also referred to as words, x1x2···xn,

xi X,

as its elements and the product w1·w2 is formed by

Z. Li / BioSystems 52 (1999) 55–61

writing the string w2 immediately after the string w1. The identity element is the empty word, denoted by o. Definition 2.3. Let S be a semiring. Let X be an alphabet, i.e. a finite empty set. A map r: X* “S is called a formal power series. r itself is written as a formal sum r = % r(w)w. w  X*

The values r(w) are referred to as the coefficients of the series. The collection of all power series over X* on S is denoted by S  X* . Given r S X* the subset of X* defined by {w r(w)"0} is termed the support of r and denoted by supp(r). The subset of S X*  consisting of all series with finite support is denoted by S B X* \ . Series of SB X*\ are referred to as polynomials. Example 2. For r1, r2 S  X* , define r1 +r2 by (r1 + r2)(w)= r1(w)+r2(w) and r1r2 by (r1r2)(w) = w 1w 2 = w r1(w1)r2(w2) for all w  X*. Then S X* is a semiring whose zero element is the same as the zero element of S and whose identity element is the empty word o. Definition 2.4. A left S-semimodule M is a commutative monoid B M, +,0 \ together with a left scalar multiplication satisfying for a, b, 1  S and x, yM: (i) a(x +y)= ax+ ay, (ii) (a+b)x =ax +bx, (iii) (ab)x= a(bx), (iv) 1x= x, 0x= 0, and (v) a0= 0, One can define the right S-semimodule in a similar way. Example 3. For a  S, r  S X* , if we define the left scalar product ar by ar = w  X* ar(w)w then S X* is a left S-semimodule. For a  S, r S  X* , if we define the right scalar product ra by ra=w  X* r(w)aw then S X*  is a right S-semimodule.

57

If S is commutative then S X* is both left S-module and right S-module. From now on, we assume that the semiring S is commutative. Definition 2.5. Let M, N be two left S-semimodules. Then the direct sum of M and N, denoted by Mš N, is a Cartesian product set M× N of tuples (m,n) where mM and nN, together with an addition, a 0 element, and a left scalar multiplication satisfying the following for aR, mi M, and ni N for i =1,2. (i) (m1, n1)+ (m2, n2)= (m1 + m2, n1 + n2), (ii) 0=(0, 0), and (iii) a(m1, n1)= (am1, an1). Any element in the direct sum Mš N is denoted by (m,n) for mM and nN. Example 4. The direct sum S X* š S X* is a left S-semimodule. If we define the multiplication on the direct sum by (r1, r2) (r3, r4)= (r1r3, r2r4) and addition on the direct sum by (r1, r2)+ (r3, r4)= ((r1 + r3), (r2 + r4)) then SX* š S X* is a semiring whose 0 element is (0, 0) and whose identity element (o,o), where o is the empty word of X*. Similarly, since S itself is a left S-semimodule Sš S is also a semiring whose 0 element is (0,0) and whose identity element is (1,1). Note that we can think of elements of S X* as single DNA strands. and of elements of S X* š S X* as double DNA strands. Then, the multiplication on S X* š SX* is the ligation of two double DNA strands, while the addition is just union of two double DNA strands. A map r: X*× X*“Sš S is a power series. r itself can be written as a formal sum %

w 1  X*, w 2  X*

r(w1, w2) (w1, w2).

The collection of all power series over X*× X* on the semiring SšS is denoted by (Sš S) (X*× X*). One can define the support of the power series by supp(r)= {(w1, w2) r(w1, w2)" (0, 0)}(X*×X*).

Z. Li / BioSystems 52 (1999) 55–61

58

Definition 2.6. A map f from a semiring R to a semiring S is a homomorphism if it satisfies for any a, bR. (i) f(a+b)= f(a)+ f(b) (ii) f(ab)=f(a)f(b) If a homomorphism is both injective and surjective then it is called an isomorphism. A homomorphism from S to itself is called an endomorphism. In isomorphism from S to itself is called an automorphism. A map f from a semiring R to a semiring S is an anti-homomorphism if it satisfies

Similarly we can define anti-isomorphism, and anti-automorphism. Proposition 2.1. S X*  šS X*  is a subsemiring of (S š S) X* ×X*  up to an isomorphism. Proof. Given two maps (or power series) r1, r2: X*“ S, define a map f from S X*  š S X* to (S +S)X* ×X*  by f((r1, r2))=r when r(w1, w2) =(r1(w1), r2(w2)). It is easy to check that the map f is an injection. We now need to prove that f is an homomorphism. In fact, for any (ri, si )  S X*  š S X* for i= 1, 2, f((r1, s1) (r2, s2))= f(r1r2, s1s2) %

w 1, w 2  X*

((r1r2) (w1), (s1s2) (w2)) (w1, w2),

and

 

 

%

((r1(w11), s1(w12))(w11, w12)) ·

%

((r2(w21), s2(w22))(w21, w22))

w 11, w 12  X*

w 21, w 22  X*

supp(r)= {(w1, w2)X*×X* r(w1, w2) =(r1(w1), r2(w2))"(0, 0)}.

DNA strings are words over {A, C, G, T}*, the free monoid generated by the alphabet {A, C, G, T} under the concatenation operation (e.g. x= ACCTGAC). The DNA strands are DNA strings with a polarity (e.g. 3%-ACCTGAC-5%). Now, let us ignore the polarity for the moment, and consider all the DNA strings. We encode A by 000, C by 010, G by 101 and T by 111. Then the following proposition is easy to show. Proposition 3.1. Let x, y be two DNA string encodings. Then x and y are complementary if and only if x = y and x XOR y= 1···1. Here, the XOR represent the bitwise exclusive OR. Let S be an arbitrary semiring and let = {0,1}. Now we are ready to build a regular grammar G= (N, , R, Q) such that the set of all the DNA strings is the regular language of L(G), where N= {Q} consists of only one nonterminal symbol Q, the rewriting rules of R are:

Theorem 3.1. Let G be defined as above. Then the set of all the single DNA string encodings is the regular language L(G). Given the S * left linear system Q= 000+ 010+ 101+ 111

%

=

Hence, f is a homomorphism. (Q.E.D.) By Proposition 2.1, we can define the support of a power series rS X* š S X* . Let r= (r1, r2), where r1, r2 S X* . Then

Q“ 000 010 101 111 000Q 010Q 101Q 111Q .

f((r1, s1))f((r2, s2)) =

f(r1 + r2, s1 + s2)= f(r1, s1)+ f(r2, s2).

3. DNA Strings and regular languages

(i) f(a+b)=f(a)+ f(b) (ii) f(ab)=f(b)f(a)

=

Therefore, f(r1r2, s1s2)= f(r1, s1)f(r2, s2) It is easy to show that

+(000+ 010+101+111)Q

(1)

w 1, w 2  X*, w 11w 21 = w 1, w 12w 22 = w 2

(r1(w11)r2(w21), s1(w12)s2(w22)) (w1, w2) =%w 1, w 2  X* ((r1r2) (w1), (s1s2)(w2)) (w1, w2).

corresponding to the above regular grammar, there exists a unique quasiregular solution (theorem 14.11 in Kuich and Salomaa, 1986), where a solution of Eq. (1) is a power series rS* 

Z. Li / BioSystems 52 (1999) 55–61

such that r =000 +010 +101 +111 + (000 + 010+101+111)r and is quasiregular if r(o) =0. Theorem 3.2. Let r = w  L(G)w. Then the power series r is the only quasireqular solution. Proof. Replacing Q in Eq. (1) by r, we can easily show that Eq. (1) holds. (Q.E.D.)

59

Theorem 4.1. The operators R and R. are antiautomorphisms of S *  . The operator ‚ is an automorphism of S *  . Proof. For any two power series r1, r2 S *  , ^(r1 + r2)= % (r1(w)+ r2(w))wˆ w  S*

4. DNA Operations The notations ‚ , R, ¡,  , and M for biological operations have been defined in (Boneh et al., 1997). Let x be any DNA string. Then (i) xˆ is the string that results when each character of x has been replaced by its complement. By proposition 3.1, xˆ is obtained by flipping the bits of x if the string x is thought as a string in * under our encoding scheme. (ii) x R is the reverse of a string x. (iii)  x denotes the DNA strand whose corresponding DNA string is x and whose polarity is 5%“3%. (iv) ¡x denotes the 3% – 5% DNA strand complementary to  x. (v) Mx denotes the double strand that results when  x and ¡x anneal in solution. Example 1. The table below summarizes these operators on a DNA string: x = ACCTGAC xˆ = TGGACTG x R = CAGTCCA ^ x R. = GTCAGGT ( = xˆ R =x R)  x =5% −ACCTGAC − 3% ¡x= 3% − TGGACTG −5% 5%−ACCTGAC −3% Mx = 3% − TGGACTG −5%.

!

We generalize these operators to homomorphisms of semirings. For every r  S *  , we define rˆ = % r(w)w ˆ, w  S*

r R = % r(w)w R, w  S*

r R. = % r(w)w R. . w  S*

= % r1(w)w ˆ + % r2(w)w ˆ = rˆ1 + rˆ2, w  S*

w  S*

and ^(r1r2)=

%

w  S* w 1w 2 = w

(r1(w1)r2(w2))wˆ

= % r1(w)w ˆ + % r2(w)w ˆ = rˆ1rˆ2. w  S*

w  S*

‚

Since is equivalent to the bitwise exclusive OR, it is one to one. So ‚ is an automorphism of S * . R Similarly, we can show that (r1 + r2)R = r R 1 +r2 , and, since S is commutative, (r1r2)R =

%

(r1(w1)r2(w2))w R

%

R (r1(w1)r2(w2))w R 2 w1

w  S* w 1w 2 = w

=

w  S* w 1w 2 = w

R = % r2(w)w R % r1(w)w R = r R 2 r1 . w  S*

w  S*

So R is an anti-automorphism of S *  . Since R. = R ‚ = ‚ R, R. is an anti-automorphism of S *  (Q.E.D.). Let X={A, C, G, T}. Under our encoding scheme, X* is a subset of *. More precisely, X* is a submonoid of *, that is, X* is closed under the multiplication of *. From now on we consider X* as a submonoid of *. Theorem 4.2. The semiring S X* is a subsemiring of S * . It is invariant under the automorphism ‚ and anti-automorphism R of S *  , that is ‚ (SX* )=S X* and R(SX* )=S X*. Proof. First, we show that S X* is a subsemiring of S*  . It suffices to prove that for any two power series r1, r2 S X* , r1 + r2, r1r2 S X* . The fact r1 + r2 S X* is

Z. Li / BioSystems 52 (1999) 55–61

60

clearly true. Assume that ri =w  X* ri (w)w for i= 1, 2. Then r1r2 =

%

w, w 1, w 2  X*, w 1w 2 = w

(r1(w1)r2(w2))w.

Since w1, w2 are in X* and X* is a monoid, w1w2 belongs to X*. Therefore, r1r2 is contained in S X*. Secondly, we show that S X*  is invariant under the morphisms ‚ and R. That is, for any power series r  S X* , r R and rˆ are contained in S X*. Since the morphisms have nothing to do with the coefficients of the power series it suffices to prove that w R and wˆ are contained in X* for any wX*. By our encoding scheme, (000)R = 000; (010)R =010; (101)R =101; and (111)R = 111. And A. = ‚ (000) =111 =T, C. = ‚ (010)= 101= G, G. = ‚ (101) = 010 =C, and .T = ‚ (111)=000= A. Therefore, w R, wˆ are in X* for any wX* since w is a word formed by A( =000), C(=010), G( =101) and T( = 111). Corollary 4.1. The compositions, R ‚ and ‚ R are anti-automorphisms of S  X* . Moreover, R ‚ = ‚ R. Since R ‚ = ‚ R we use the notation R. for both.

5. Double strands and annealing

p1(r1 š r2)= r1, and p2(r1 š r2)= r2. Then p1 and p2 are homomorphisms of semirings from S X* šS X* to S X* . We identify the operators  , ¡, and M with homomorphisms from S X* to SX* šS X* as follows.  =i 1, ¡= i 2, M(r)= (ršrˆ ) Theorem 5.1. The operators  , ¡, and M are homomorphisms from S X* to S X* š S X*. Proof. The proof is straightforward and similar to the proof of theorem 4.1 (Q.E.D.). We interpret p1 and p2 as melting and M as the process annealing. Note that M describes the process of a single DNA strand looking for its matching single DNA strand to form a double DNA strand. Theorem 5.2. The map M ‚ p2 is an endomorphism of SX* š S X* . The invariant of the endomorphism is the set M(S  X* )= {M(r) for all rS X* } which its a subsemiring of S X* š S X*. Proof. Let us consider the following diagram: SX* šS X*

S X* š S X*

S X* ‚

M





M‚p 2

p2





By the Theorem 4.2, the semiring S X*  is a subsemiring of S  X*  . Hence it is both left and right S-semimodule (recall that we assume that S is commutative). Therefore. we can define S X* š S X* which is also a subsemiring of S *  š S *  . Now we consider the polarity of the DNA strings. Suppose that r is a power series in S  X* šS  X*. An element w  supp(r) is of the form (w1,w2). We make the convention that w1 =  w1 =5%-w1-3%, and w2 =3%-w2-5%. For example assume that (w1,w2) = (ACTG, AG)  supp(r) then what we really mean is that w1 =5%-ACTG-3% and w2 = 3%-AG-5%. Now we define ij : S X*  “ S X*  šS  X*,for j =1, 2 by i1(r) = r š0 and i2(r) = 0 šr. Note that the supports of the images of i1 and i2 are empty. Both maps i1 and i2 are homomorphism of semirings from S X*  to S X* 

š SX* . Conversely we define pj : S X*  šS X* “ SX* for j=1, 2 by

S X*

Since p2, ‚ , and M are homomorphisms, the map M ‚ p2 is an endomorphism of the semiring S X* š S X* . Now we should prove that the invariant of the endomorphism M ‚ p2 is not empty. For any rSX* , M(r)= (rš rˆ ) is in the invariant since M ‚ p2(M(r))= M ‚ p2(ršrˆ )=M ‚ (rˆ )= Mrˆ = M(r).

Z. Li / BioSystems 52 (1999) 55–61

This also proves that the invariant of the endomorphism contains the set {M(r) for all r  S  X*  }. Conversely, assume that r1 šr2  S X*  š S X* is an element in the invariant. Then

61

p1M(r)= p1(ršrˆ )= r. The relation (iv) is true by the definitions of ¡,   and M (Q.E.D.).

M ‚ p2(r1 šr2)= M ‚ (r2) = M(rˆ2)=rˆ2 š rˆ2

Acknowledgements

= rˆ2 šr2 =r1 š r2 So r1 =rˆ2. Thus r1 šr2 =M (rˆ2). This shows that the invariant is contained in the set {M (r) for all r S  X*}. Hence the invariant of the endomorphism is exactly the same as the set {M (r) for all r  S  X*}. It is a subsemiring of S X*  š S X* since it is the image of S  X*  under the homomorphism M (Q.E.D.). The following theorem summarizes the relationships among the operators. Theorem 5.3. (i) ^R= R ^ =Rp2¡ =p1 R. = R. (ii) ^p1M= p2M= ^, (iii) p1M is the identity map, (iv) Mx š¡x =Mx for x  S X*  Proof. For relation (i), it suffices to prove that Rp2¡ =p1 R. = R. . In fact. for any r  S  X*  , Rp2¡(r)=Rp2(0š rˆ ) =Rrˆ = r R. p1 R. (r) = p1(r R. š0) =r R. . The relation (ii) is true since, for any r  S  X*  , ‚ p1M(r)= ‚ p1(r šrˆ ) =rˆ p2M(r) =p2(r šrˆ )=rˆ.

The relation (iii) is true since, for any r  SX* ,

.

The author wants to thank Professor Lila Kari for her suggestions and comments.

References Adleman, L., 1994. Molecular computation of solutions to combinatorial problems. Science 266, 1021 – 1024. Boneh, D., Dunworth, C., Lipton, R.J., 1997. A notation for DNA operations. Unpublished manuscript. Guarnieri, F., Bancroft, C., 1998. Use of a horizontal chain reaction for DNA-based addition. In: Landweber, L.F., Baum, E. (Eds.), DNA Based Computers II. Dynamic Series, Vol. 44. AMS Press, pp. 105 – 111. Kari, L., 1997. DNA Computing: Arrival of Biological Mathematics. The Mathematical Intelligencer 2, 9 – 22. Kuich, W., Salomaa, A., 1986. Semirings, Automata, Languages. Springer-Verlag. Leete, T., Schwartz, M., Williams, R., Wood, D., Salem, J., Rubin, H., 1998. Massively parallel DNA computation: expansion of symbolic determinants. In: Landweber, L.F., Baum, E. (Eds.), DNA Based Computers II. Dynamic Series, Vol. 44. AMS Press, pp. 45 – 48. Oliver, J., 1998. Computation with DNA: matrix multiplication. In: Landweber, L.F., Baum, E. (Eds.), DNA Based Computers II. Dynamic Series, Vol. 44. AMS Press, pp. 113 – 122. Williams, R., Wood, D., 1998. Exascale computer algebra problems interconnect with molecular reactions and complexity theory. In: Landweber, L.F., Baum, E. (Eds.), DNA Based Computers II. Dynamic Series, Vol. 44. AMS Press, pp. 267 – 275.