DWT based coding DNA watermarking for DNA copyright protection

Information Sciences 273 (2014) 263–286 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins...

Download PDF

1MB Sizes 0 Downloads 57 Views

Report

PDF Reader
Full Text

Information Sciences 273 (2014) 263–286

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

DWT based coding DNA watermarking for DNA copyright protection Suk-Hwan Lee ⇑,1 Department of Information Security, Tongmyong University, 535, Yongdang-dong, Namgu, Busan, Republic of Korea

a r t i c l e

i n f o

Article history: Received 13 November 2013 Received in revised form 21 February 2014 Accepted 9 March 2014 Available online 19 March 2014 Keywords: DNA watermarking DNA copyright protection Coding DNA Lifting based DWT Genetic information security Mutation resistance

a b s t r a c t DNA watermarking is a technique for copyright protection and ownership authentication of DNA sequences and ensures the security of private genetic information. This paper addresses issues regarding watermarking DNA coding sequences in the frequency domain that confer mutation resistance, amino acid conservation, and security. Multimedia watermarking is designed for robustness and invisibility mainly based on frequency domain representations. However, frequency domain watermarking for a coding DNA sequence is signiﬁcantly constrained because the transformation and inverse transformation must be performed while completely conserving the amino acid sequence. In this paper, we present a coding DNA watermarking method in a lifting-based discrete wavelet transform (DWT) domain that focuses on the feasibility of frequency domain watermarking for DNA sequences. Our method divides a coding DNA sequence into a number of subsequences and allocates all codons in subsequences to a numerical code using the histogram ranks of the amino acids. Our method then calculates a set of DWT coefﬁcients for subsequences of synonymous codons and ﬁnds a subsequence among them with DWT coefﬁcients that are optimal for embedding watermark bits. Finally, our method substitutes this sequence for a subsequence of codons. To secure the watermark, our method generates the binary watermark based on nonlinear congruential – pseudorandom number generator (NC-PRNG) and randomly selects the embeddable position in the DWT domain of the subsequence. We experimentally veriﬁed that our method ensures not only amino acid conservation and security but also resists a point mutation rate of approximately 18.5% point mutations. 2014 Elsevier Inc. All rights reserved.

1. Introduction Patent and copyright protection for DNA sequences and gene products generated by innovations in biotechnology have been available [26,43,52,62,75] for decades. Proponents of DNA protection argue that DNA sequences isolated and puriﬁed from organisms can be protected by copyright or patents based on the analogy between genetic sequences and computer programs. Stemmer et al. [75] proposed that DNA sequences can be protected by converting them to music ﬁles (e.g., mp3 ﬁle) with copyright protection, and most genomic companies would publish their sequences using DNA music ﬁles while retaining IPRs of their published sequences. Opponents argue that DNA protection cannot be justiﬁed, because DNA

⇑ Tel.: +82 51 629 1285. 1

E-mail addresses: [email protected], [email protected]. Present address.

http://dx.doi.org/10.1016/j.ins.2014.03.039 0020-0255/ 2014 Elsevier Inc. All rights reserved.

264

S.-H. Lee / Information Sciences 273 (2014) 263–286

sequences are products of nature, and gene patenting hampers subsequent innovation. Further, puriﬁed DNA sequences are no more than copied or modiﬁed representations of DNA sequences that exist in organisms before isolation and puriﬁcation. Recently, J. Craig Venter’s research team created the synthetic Mycoplasma mycoides JCVI-1.0 genome starting from digitized genome sequence information [29,30]. Synthetic biology could have led the developments of bioindustry as well as have come a turning point in the copyright protection of synthetic genome [43]. DNA watermarking provides a solution for the copyright, authentication, and integrity of DNA sequences and the protection of private genetic information. The security techniques of DNA cryptography [2,27,32,49,60,66,82] and DNA steganography [4,5,7,11,14,18,46,56,69, 72,77,80] have been studied since the late 1990s, and DNA watermarking [34,35,38–42,47,48,58] has recently become an issue. DNA cryptography involves the encryption and decryption of biological DNA. Cryptographies based on the polymerase chain reaction or DNA chips are typically used in DNA cryptography. This makes it difﬁcult to decrypt messages in the DNA without knowledge of the probes or PCR primer pairs and the experimental environment. DNA cryptography is now recognized as a biological encryption technique of the future, but it has not yet begun to replace the conventional encryption techniques because it is difﬁcult to implement [2,66]. DNA steganography hides information in DNA and is useful for creating DNA signatures for authentication and secret communication through a DNA sequence. It primarily requires capacity and security. Most methods convert the message to a DNA sequence using the genetic code and hide it in noncoding regions while considering capacity. Although it is not a trade secret, DNA storage [31,74,78,81,83–85] has been proposed for high-density and low-maintenance data storage of DNA sequences using DNA steganography. These methods can be used to store large quantities of data in noncoding regions while considering the degeneracy of the genetic code, data compression, and the data decoder using multiple alignments. DNA storage primarily requires the capacity and the reliability of the decoder. DNA watermarking embeds a watermark for copyright or private information with imperceptibility, resistance, security, and preservation of biological function. In particular, it must verify the copyright or ownership through the watermark to resist attacks, spontaneous mutation, or intended mutation by pirates. Conversely, the watermark can be embedded so that it is fragile to mutation. However, fragile or semi-fragile DNA watermarking is not effective for application to DNA. Two studies have been performed on DNA embedding methods for DNA steganography or DNA watermarking for noncoding and coding regions [36,71]. Moreover, because DNA steganography and DNA watermarking serve different purposes, they should be designed differently. The primary purpose of the former is secret communication, which requires high capacity and imperceptibility. The purpose of the latter is copyright protection, which requires high resistance. DNA watermarking should be distinguished from DNA steganography and storage. In our present study, we address a DNA watermarking method for the copyright protection of a DNA sequence. The application of DNA cryptography and DNA steganography must consider capacity and security more than their resistance. Conversely, DNA watermarking must consider resistance more than other properties. Such applications include the copyright protection of genetically modiﬁed organisms (GMOs) [39] or the discrimination between wild-type and artiﬁcial genes [29,30]. Therefore, DNA watermarking should be distinguished from DNA steganography and storage. Venter’s team inserted a watermark at intergenic sites known to tolerate transposon insertions [29,30] in order to identify the M. mycoides JCVI-1.0 genome as synthetic. Jupiter et al. [47,48] proposed watermarking strategies for tracking infectious or other agents. Moreover, they compared and analyzed ﬁve goals or features of DNA watermarking and multimedia watermarking. These goals are as follows: message ﬁdelity, error tolerance, easy interpretation, signature availability, and resistance. A watermark in a coding DNA sequence never changes the encoded protein sequence. This is one main difference between DNA watermarking and multimedia watermarking. Balado et al. [8,9,37] modeled the optimum embedding capacity using Shannon theory in the presence of mutations in noncoding and coding regions. The watermark may be embedded in noncoding or coding DNA sequences, and the embedding scheme depends on the region that is used. The noncoding sequence facilitates signal processing for watermark embedding because it does not encode the protein sequence [34,35,38]. However, the noncoding DNA may not represent so-called junk DNA, and whether this DNA in nonfunctional is disputed. Therefore, it has yet to be considered an appropriate medium for DNA steganography or watermarking. Although we assume that noncoding DNA is ‘‘junk,’’ noncoding DNA watermarking suffers from its lack of resistance, because noncoding DNA can be easily substituted by any sequence. Therefore, noncoding DNA is suited to DNA steganography and coding DNA is suited to DNA watermarking. The coding DNA watermarks must conserve the amino acid sequence of the target protein proﬁle, which we call ‘‘amino acid conservation’’. This can be solved using the codon redundancy of the genetic code. As a typical coding DNA watermarking method, Heider et al. proposed the DNA-Crypt algorithm that embeds the binary watermark in the last two-bit base of a codon and corrects a mutation using either the 8/4 Hamming or WDH codes [39]. They applied the DNA-Crypt algorithm to noncoding regions within regulatory RNA [38] and to coding regions according to codon redundancy [40]. They also developed a mitochondrial-adapted version of the DNA-Crypt algorithm [41], which combines the correction code with a simple base substitution that is similar to the least-signiﬁcant bit (LSB) substitution of steganography. LSB substitution is little used in multimedia watermarking because of fragility and security problems. Liss et al. [58] embedded text messages into the open reading frame (ORF) of synthetic genes by considering gene optimization. Thus, they designed ORFs of watermarked genes using the optimized codon usage table of a host organism. For example, given the codon usage table of a particular amino acid, odd-ranked codons in this table represent a binary 0, and even ranked codons represent a binary 1. They encoded a text message into four or six synonymous codons that retain a high degree of codon assignment ﬂexibility. However, this method does not consider mutation resistance and security, although it retains gene optimization.

S.-H. Lee / Information Sciences 273 (2014) 263–286

265

Haughton et al. [36] present two noncoding and coding embedding algorithms for DNA watermarking and DNA steganography, which they called ‘‘BioCode ncDNA (noncoding DNA)’’ for noncoding regions and ‘‘BioCode pcDNA (protein coding DNA)’’ for coding regions. They designed BioCode ncDNA under the constraint of ‘‘no start codon’’ and the BioCode pcDNA under the constraints of ‘‘primary structure preservation’’ and ‘‘codon count preservation.’’ In BioCode pcDNA, they designed a code table that maps a set of available codons to message bits according to dynamic graduated mapping for considering two constraints. First, the code table is initialized according to the codon count. The codon of the host sequence is substituted with a codon that corresponds to input message bits of the code table. The count of this codon is then decreased by one. If the count of any codon is zero, the code table is updated using dynamic graduated mapping. This process is repeated continuously until the message bits or the host codon sequences end. Haughton et al. [36] corrected the decoding errors using marker codes or watermark codes for message bitframe resynchronization. BioCode methods have the strength of codon-count preservation but do not efﬁciently resist intentional mutation and are therefore not sufﬁciently secure. Because the above methods embed the watermark in all available codons, they open the watermarked positions. This means that any pirate can easily destroy the watermark, although it is difﬁcult to extract the watermark. The bit substitution in least signiﬁcant bits (LSB) is not used in multimedia watermarking because of the problem of resistance and security. Most of multimedia watermarking methods are processed in the frequency domain of the discrete cosine transform (DCT) [19], discrete wavelet transform (DWT) [55,65], scale-invariant feature transform (SIFT) [57], as well as others. In particular, the DWT can be widely used for genomic signal processing in DNA frequency analysis [6,61,88], and for the identiﬁcation of protein-coding regions [63,68]. Here, we propose the application of DWT-based watermarking of coding DNA to copyright a DNA sequence and investigate performance from the perspective of signal processing in terms of mutation resistance, amino acid conservation, and security. Unlike DWT-based DNA signal analysis, coding sequence watermarking must process the DWT and the IDWT (inverse DWT) with amino acid conservation. This makes it difﬁcult to embed a watermark. Thus, the coding sequence that is obtained from the IDWT of the watermarked DWT coefﬁcients must conserve the original amino acid sequence. However, the amino acid sequence is difﬁcult to conserve using the general procedure of frequency domain watermarking, speciﬁcally, the series of DWT, embedding, and IDWT steps. Surmounting this difﬁculty motivated our research. Our method processes a number of codon subsequences for the embedding target of a watermark bit. When a codon sequence is processed, we obtain a set of DWT coefﬁcient sequences for all synonymous codon sequences of the processed codon sequence. We then ﬁnd an optimal DWT coefﬁcient sequence for a watermark bit and then we replace one codon subsequence with the optimal DWT coefﬁcient sequence. We experimentally veriﬁed that our method outperforms Heider e al. DNA-Crypt algorithm [39], Liss et al. method [58], Haughton et al. BioCode pcDNA [36] in terms of resistance and demonstrated that the bit error rate (BER) of our method is lower by 1.7–3.9% compared with the BERs of conventional methods and also conserves the amino acids. Moreover, we veriﬁed that our method performs better with the codon adaptation index (CAI) than DNA-Crypt algorithm and BioCode pcDNA, although it is not superior to that of Liss et al. Moreover, the watermark is highly secure from analysis of the randomization of embedding positions based on differential entropy. In the narrative that follows, Section 2 discusses the DNA watermarking framework and related research; Section 3 presents the proposed DWT-based DNA watermarking method; Section 4 evaluates the resistance, security, and amino acid conservation of the proposed method; and Section 5 expresses our views on the effectiveness of DWT for DNA watermarking. 2. Coding DNA watermarking As mentioned above, the watermark is more effectively embedded in the coding region than in the noncoding region because of the modiﬁability of unknown gene regulators and the resistance. In this section, we discuss the watermarking of coding DNA sequences. In this chapter, we review DNA data embedding methods and discuss coding DNA watermarking. 2.1. DNA information hiding Research on DNA security has been conducted from the 2000s and can be classiﬁed as DNA cryptography and DNA information hiding using DNA steganography, including DNA storage and DNA watermarking. Most of the studies on existing methods for hiding information in DNA show that the information is hidden in noncoding or coding DNA sequences and use DNA steganography and DNA watermarking without stating the distinction between the two. Multimedia watermarking for images, videos, and 3D models [19,55,57,65,76] has been studied for years in the diverse ﬁeld of multimedia steganography. Similarly, DNA steganography and DNA watermarking should be studied separately according to a speciﬁc purpose. We suggest the purpose and main features of DNA watermarking and steganography in Table 1. From these considerations, we classify existing studies on DNA watermarking and DNA steganography in Table 2. We compared our coding DNA watermarking method with those of other studies [36,39,58]. DNA watermarking is frequently mentioned as a DNA embedding method. However, direct DNA watermarking for copyright protection started with Heider’s DNA-Crypt watermarking, which was published in 2008. More research on DNA watermarking methods is required as well as its objective evaluation and comparing its features to those of multimedia watermarking.

266

S.-H. Lee / Information Sciences 273 (2014) 263–286

Table 1 Main features of DNA watermarking and DNA steganography (coding DNA watermarking and noncoding DNA steganography). Category Purpose Embedding target

Embedded information Main requirements Mutation resistance

Security

Capacity Amino acid conservation Gene optimization

DNA watermarking – Copyright/patent protection, ownership, and private information protection of DNA sequence – Coding DNA is suitable for watermarking – Noncoding DNA can be replaced by any sequence, and the watermark can be easily removed from noncoding DNA – Watermark for Copyright or ownership information of DNA sequence – Mutation resistance, amino acid conservation (with codon optimization), security for coding DNA – The watermark can be detected using spontaneous mutations or intentional mutations by the pirate. It is as robust as image watermarking – Robust watermarking is required in addition to error correction code (ECC) or repetition – It must be very difﬁcult for any pirate to detect and remove the watermark – The unpredictability of embedded position or embedded values is important. If known, the watermark can be removed by a pirate – A DNA watermark should have sufﬁcient length to represent only the copyright or ownership information – Coding DNA watermarking must preserve the amino acid. Thus, the watermark must not change biological function – Coding DNA watermarking considers gene optimization using parameters such as codon usage, codon text, DNA motifs, GC content, and repetitions/inverse repetitions – Gene optimization limits watermark capacity

DNA steganography – Secret communication, signature, or storage using a DNA sequence – Noncoding DNA is suitable for steganography, because it can store the highest density of data – In coding DNA, the embeddable data rate is limited by amino acid conservation – Message for secret communication/transmission or for signature – Capacity, security for noncoding DNA – Mutation resistance is not as important in steganography. If the hidden information is damaged by mutation, the receiver has only to demand retransmission – DNA storage must resist spontaneous mutations. The powerful ECC of turbo code will provide a solution – The hidden information must be secured using the encryption algorithm. Thus, it is very important to make it difﬁcult to extract the hidden information without the key – If the hidden information is removed by a pirate, a sender only has to retransmit a stego DNA – High-density data should be hidden or stored. The capacity is the primary requirement for stego DNA – Noncoding DNA steganography considers ‘‘no start codon’’ [36] and not amino acid conservation – Noncoding DNA steganography considers ‘‘no start codon,’’ but does not consider gene optimization

2.2. Requirements for coding DNA watermarking A coding DNA watermark has the following requirements: mutation resistance, amino acid conservation, and security. The goals of these requirements are similar to those of multimedia watermarking. (1) Mutation Resistance: Mutations are caused by radiation, insertion of viral genomes, transposons, and mutagenic chemicals, as well as errors during meiosis or DNA replication [22,53]. Randomly mutated sequences can be deﬁned as either spontaneous or induced. The ability to introduce speciﬁc mutations using molecular genetic techniques is virtually unlimited. Coding DNA watermarks must be robust with respect to intentionally introduced mutations as well as spontaneous and induced mutations. Mutations can be classiﬁed as small scale, which affect one or a few nucleotides, or large scale, which may affect chromosomal structure. The former include point mutations, insertions, and deletions, and the latter include multiple chromosomal duplications, deletions, and loss of heterozygosity. DNA watermarking enables veriﬁcation of the watermark in the pirated DNA sequence data for copyright protection or ownership certiﬁcation. When a dispute arises over copyright or ownership of a DNA sequence, watermarked and pirated DNAs are perceptually similar. If the perceptual difference is high, the pirated DNA sequence will not be valuable. Therefore, we consider here only small-scale mutations at the level of single bases, in contrast to large-scale mutations on the chromosomal level. Large-scale mutation requires that the length of coding DNA regions should be large enough to identify the watermark in the region that survives the large-scale mutation. A point mutation changes a single nucleotide base and is often caused by chemicals or errors in DNA replication. Point mutations include silent mutations that encode the original amino acid, missense mutations that encode a different amino acid, and nonsense mutations encode a stop codon that truncates the protein. Deletions and insertions of one or a few nucleotides are often caused by a transposable element (TE) and may shift the reading frame that may encode a nonfunctional protein or generate a so-called fusion protein with altered, new, or no function. Point mutations can be reversed by correcting the code or by repeated watermark embedding. However, it is not easy to extract the watermark when point mutations occur simultaneously in neighboring positions. Most deletions and insertions can be solved by alignment algorithms and are displayed as dot-plots [50,70], pairwise alignments [23,51,89], multiple alignments [16,25,44] or whole sequence alignments [24] when a watermarked DNA sequence is used as the reference for alignment. Löytynoja [59] presented strategies, challenges, benchmarking, and a comparative overview of a number of alignment methods. The watermark can be designed to resist indels using the

S.-H. Lee / Information Sciences 273 (2014) 263–286

267

Table 2 Published studies on DNA watermarking and DNA steganography. DNA security

Category

DNA steganography

Secret message hiding / signatures (Noncoding / Coding)

Data storage

DNA watermarking

Noncoding DNA watermarking

Coding DNA watermarking

Strategies and perspectives

Information hiding for DNA steganography and DNA watermarking

Noncoding and Coding DNA data embedding

Features – Most methods hide the message in noncoding DNA (some of methods use coding DNA). The message (encrypted message) is converted to a message-bearing DNA sequence and is substituted for the host sequence. It resembles DNA signature and secret communication. – Refs.: [4,5,7,11,14,18,46,56,69,72,77,80] – High-density data are stored in noncoding DNA using DNA steganography techniques with error correction code, data compression, and data extraction based on multiple alignments. – Refs.: [31,74,78,81,83–85] – The watermark base sequence is embedded in noncoding DNA. Mutation resistance is considered using the error correction code (ECC), repetition code, or alignment. – Refs.: [38,34,35] – The watermark codon sequence is embedded in coding DNA while considering the mutation resistance by ECC, repetition code, or alignment. – Refs. [36,39,58] [40,41] apply the method of reference [39] – Strategies, perspectives, or progress regarding DNA watermarking applications are presented. – Refs.: [42,47,48] – Data embedding methods are presented for DNA steganography and DNA watermarking without distinguishing between them. The information is embedded in noncoding DNA and coding DNA including ECC. – Refs.: [36,71]

alignment methods described. The watermark’s resistant to indels depends on the performance of the alignment method. However, because the alignment technique is beyond the scope of the present study, we did not evaluate watermark resistance to indels using different alignment methods. Mutations can occur simultaneously at any position regardless of type. Coding DNA watermarking must be designed for resistance by considering all types of mutations. (2) Amino acid conservation: The watermark in a sequence is similar to noise. If the watermark changes the amino acid sequence, similar to a missense mutation, the resulting proteins may be nonfunctional. The process of embedding a watermark creates an intended silent mutation. Thus, the modiﬁcation range of codons should be limited to the synonymous codons of each amino acid. Amino acid conservation corresponds to the invisibility of multimedia watermarking. Existing methods map bits to synonymous codons and then substitute a host codon with one codon that corresponds to input bits. This is similar to LSB substitution in multimedia watermarking. In the case of coding DNA watermarking, it should be considered how the watermark can be effectively embedded in synonymous codons for each amino acid. (3) Security: Obviously, the watermark should not be extracted or removed easily, even when the watermarking algorithm and the conditions of the biological experiment are known. Therefore, the watermark must be embedded in random targets or distributed similarly as a wide spread spectrum (WSS) in the frequency domain with encryption of the watermark. Creating a unique watermark depends on random targets or positions. The DNA-Crypt algorithm [39] uses the cryptographic algorithms of AES, Blowﬁsh, RSA keys, or the one-time pad to encrypt the binary watermark. In this case, it is difﬁcult to extract the watermark without knowledge of the keys but easy to remove an encrypted watermark because the embedding position is easily predicted. (4) Codon optimization: DNA watermarking should consider codon optimization for gene expression and gene analysis. Parameters for codon optimization are codon usage, codon text, DNA motifs, GC content, and direct/inverted repeats. Among these parameters, codon usage is a major factor for efﬁcient translation [17] and is a signiﬁcant factor for optimizing the primary genetic properties of a species [58]. When applying synthetic genes in silico, codon optimization concentrates on codons with the highest frequency of target species [67]. Liss et al. [33] consider codon optimization by embedding the watermark in the reading frame of a synthetic gene using the codon frequency table. Haughton et al. [36] preserved codon count using dynamic graduated mapping. (5) Blind watermark detection: DNA watermarking in bioinformatics has distinct advantages compared to multimedia watermarking. Generally, digital watermarking can be classiﬁed by detection type as blind watermarking or non-blind watermarking. Nonblind watermarking that requires the original data (nonwatermarked data) in the detection process has higher robustness. However, because of leakage of the original data, multimedia watermarking has rarely been considered non-blind watermarking, except for speciﬁc applications. DNA sequencing requires the knowledge of reference genome sequences that have become indispensable for biological research and applications for diagnostics, biotechnology, forensic biology, and biological systematics. The sequence data for a number of most genomes are freely in databases provided, for example, by the United States National Center for Biotechnology Information (NCBI),

268

S.-H. Lee / Information Sciences 273 (2014) 263–286

the European Molecular Biology Laboratory (EMBL), and the DNA Data Bank of Japan (DDBJ). Therefore, blind watermarking is practically impossible, except for applications that do not require access to the genome sequence. In DNA watermarking, the watermarked sequence can be used as the reference in the databases and for watermark extraction. Further, capacity, uniqueness, and transparency are required. Some requirements represent a tradeoff with others. It is very difﬁcult for a coding DNA watermark to satisfy all of the requirements through the embedding of the intended silent mutation. 2.3. Coding DNA watermarking In this subsection, we consider the embedding process available for watermarking coding DNA. Most amino acids, except for methionine (M) and tryptophan (T), are encoded by multiple (degenerate) codons [12]. Watermarking coding DNA obeys the embedding rules to substitute the embedded codon for one codon among synonymous codons. Fig. 1(a) shows an example of a watermarked DNA coding sequence. This generation embeds two bits in the third base of four-fold synonymous codons except for the ﬁrst and last codons. Similarly, n bits can be embedded into n-fold synonymous codons. This process has been applied in most cases of DNA steganography and DNA watermarking, most typically using the DNA-Crypt algorithm [39]. In DNA signal analysis, four letters of a DNA sequence should be mapped into integral, ﬂoating-point, or complex numbers for signal processing. Similarly, coding DNA watermarking maps codons or bases into numerical values and then embeds the watermark in the numerical sequence using signal processing. Cristea [20,21] presented a codon base code (T = 0, C = 1, A = 2, G = 3) that was optimally selected under the minimally non-monotonic correspondence between 64 codons and 20 amino acids. Fig. 1(b) shows an example of watermark embedding using the genetic code of Cristea. This process resembles that of a simple numerical substitution. Here, we used the genetic code published by Cristea for considering integer DWTs and IDWTs that conserve amino acids and provide security, and we introduce an embedding method in the DWT domain using its genetic code. 2.4. DWT-based DNA Watermarking Most multimedia watermarking methods [19,55,57,65,76] for audio, still image, and video media are based on spreadspectrum watermarking (SSW) in the domain of discrete cosine transform (DCT), discrete wavelet transform (DWT), and others. SSW, which is derived from spread-spectrum communication, distributes [19,65] very small-amplitude signals over frequency bins of wide bandwidth so that they become undetectable. SSW requires that the noise generated by signals is higher than that of the amplitudes of all watermarks in order to remove a watermark that is distributed over a wide bandwidth; however, this degrades the quality of multimedia data. Further, it is very difﬁcult to detect all watermark values embedded in a wide bandwidth. Conventional coding DNA watermarking [36,39–41,58] represents spatial domain methods. DNA frequency watermarking is required for improving mutation resistance, amino acid conservation, and security. The most difﬁcult aspect of DNAfrequency watermarking is that the watermarked frequency coefﬁcients should be the inverse of the watermarked DNA sequence while conserving the amino acid sequence encoded by the host DNA sequence. Generally, the frequency transformation of real-value or complex representation hardly conserves the amino acids by the inverse transformation of the embedded watermark. Particularly, the effects of the DWT coefﬁcient on codons within its support region depends on the length of wavelet ﬁlter the scale. All codons within the support region on the maximum scale should be mutated to their synonymous codons to conserve the identity of the amino acid residue. Accordingly, the numerical values of codons that are modiﬁed by the watermark should include the range of numerical values of synonymous codons and allow gene optimization. Considering these limitations, the lifting-based integer DWT is more suitable for DNA watermarking than other transforms. We address below the possibility of watermarking coding DNA using lifting-based integer DWT. 3. Proposed DWT-based coding DNA watermarking Here, we introduce a DWT-based coding-DNA watermarking scheme with the availability of DNA-signal processing 3.1. Deﬁnitions The symbols for alphanumeric and numeric values are written in Roman style and italics, respectively, as follows: A base is b = {T, C, A, G}, and numerical code is b. A codon consists of three consecutive bases; c = (b1b2b3), and numerical value is c. An amino acid A of a codon c is A = h(c), where h() is an n-to-1 mapping function and n is the number of synonymous codons. N D represents the sequence of nucleotide bases of coding DNA; a codon sequence C of D is C ¼ ðck ¼ ðb3k2 b3k1 b3k ÞÞk¼1 . N is the number of codons of C. c1 and cN indicate start and stop codons, respectively, which are not selected for embedding.

269

S.-H. Lee / Information Sciences 273 (2014) 263–286

Non-coding region DNA

Coding region

ncDNA

Base sequence Amino acid sequence

cDNA

ATG GTG GGC CTC M

V

G

L

----Preservation

Watermarked Base sequence Watermarked Amino acid sequence

ATG GTT GGG CTG M

Watermarked DNA

V

G

L

ncDNA

-----

Watermarked cDNA

(a) Watermarked ATG GTT GGG CTG --Base sequence

Base sequence ATG GTG GGC CTC --Number mapping Base number

203

303

331

Ex) T=0, C=1, A=2, G=3

101 ---

Watermarked Base number

203

303

331

101 ---

Watermarked Codon code

35

48

63

19

Codon number mapping Codon code

35

51

61

17

0-63

---

---

Watermark Embedding

(b) Fig. 1. (a) Example of generating a watermarked CDS and (b) another example using the numerical codes of base and the codons of Cristea [20,21].

A codon subsequence BC with NB codons is BCi ¼ ðci1 ci2 ciNB Þ. Here, C is the sum of codon subsequences and the remaining codons. A sequence of watermarked codons is C0 and a sequence of watermarked nucleotide bases is D0 . The amino acids of a host codon sequence C and a watermarked codon sequence C0 must be the same, that is, A = h(C) = h(C0 ). The DWT coefﬁcients of a numerical sequence of codons are X = dwt(C). Given a binary watermark W, the embedding function is f(X, W). 3.2. Embedding constraints in frequency domain The main constraints for DWT-based coding-DNA watermarking are amino acid conservation and codon optimization. We present a codon DNA watermarking method with mutation resistance and security while satisfying the above two constraints. The two constraints are as follows: 3.2.1. Embedding in the DWT domain while conserving amino acid identity DNA signal analysis determines the nucleotide base sequence D from DWT signal X = dwt(D) of numerical sequence D (Fig. 2(a)). It does not require recovery of D from X such as the inverse transformation D0 = dwt1(X) = dwt1(dwt(D)). Unlike DNA signal analysis, DNA watermarking shown in Fig. 2(b) embeds the watermark W in DWT coefﬁcients X of a numerical codon sequence C

X 0 ¼ f ðX; WÞ

ð1Þ 0

1

and obtains a watermarked numerical codon sequence C from the inverse transform dwt 0

1

0

1

1

C ¼ dwt ðX Þ ¼ dwt ðf ðX; WÞÞ ¼ dwt ðf ðdwtðCÞ; WÞÞ:

().

ð2Þ

DNA watermarking requires the inverse transform of the watermarked signal to recover a DNA sequence and the numerical coding of bases and codons that is applied to the forward and inverse DWTs with amino acid conservation (i.e., the amino acids of D and D0 must be the same).

270

S.-H. Lee / Information Sciences 273 (2014) 263–286

(a)

(b) Fig. 2. Process of (a) DNA signal analysis and (b) DNA watermarking based on frequency domain. 1

hðCÞ ¼ hðC 0 Þ ¼ hðdwt ðf ðdwtðCÞ; WÞÞÞ

ð3Þ

The above equation can be expressed by the embedding function f as follows: 1

f ¼ ðdwtðCÞ; WÞ ¼ dwtðh ðhðCÞÞÞ 1

ð4Þ

1

where h (h(C)) = h (A) is one of a number of synonymous codon sequences of C. Thus, X0 should be included in a set of coefﬁcients for synonymous numerical codons, dwt(h1(h(C))), which conserves amino acid identity. Therefore, DNA watermarking must design the embedding function f depending on the above condition. Lifting-based integer DWT serve as a better control for embedding while conserving amino acid identity, compared with the DWT/DCT of the real form and DFT (discrete Fourier transform)/FFT(fast Fourier transform) of the complex form. Given a sequence C of N codons, the number of synonymous codons is extremely high (66N). A long time and considerable memory are required to ﬁnd an embeddable codon among the number of synonymous codon sequences of C. Therefore, a DWT-based DNA watermark should be processed as a unit of the subsequence. As the length of a codon subsequence BC increases, the number of synonymous codon subsequences increases exponentially. This takes more time to calculate f, which should ﬁnd optimal coefﬁcients among dwt(h1(h(BC))) for a watermark. We attempted to design the embedding function f(dwt(BC), W) based on a subsequence using a small number of codons. 3.2.2. Embedding watermarks compatible with codon optimization Because the watermark likely imposes noise on the signal, the watermarked codon sequence will not be optimum. As watermark capacity increases, codon optimization will be diminished. Therefore, we designed the embedding process so that the watermarked codon sequence may be suboptimal for codon usage by using the highest frequency codons in a host DNA sequence. 3.3. Watermark embedding Our embedding scheme focused on the potential method of DWT-based watermarking with amino acid conservation (Fig. 3). Codons are mapped to numerical codes, which are determined by the permutation table of the histogram ranks of the amino acids. We divide a full codon sequence C into subsequences BCi with NB codons, gather the subsequences of all synonymous codons {h1(Ai)}, of BCi, and calculate their DWT coefﬁcients {dwt(h1(Ai))}. Then, we identify a codon subsequence BC0i with optimal DWT coefﬁcients for embedding and replace BCi with BC0i . The length of the watermark bits depends on the number of codon subsequences BCi. The details of each stage are given below. 3.3.1. Numerical codon coding Many researchers have presented numerical base-mapping methods [3,10,13,20,21,33,64,73,79,87] or have analyzed them [45,54,86]. Kwan et al. [54] classiﬁed numerical mapping methods as ﬁxed mappings or physicochemical mappings and analyzed their features. Fixed mapping converts a base sequence into any numerical sequence according to a Voss representation [78], tetrahedral representation [20,73], integers [20,21], ﬂoating-point numbers [13], or complex numbers [3]. Physicochemical mapping that uses the biophysical and biochemical structure of DNA include the electron-ion interaction potentials (EIIP) method [64], DNA-walk method [10,33], and the Z-curve method [87]. These numerical mapping methods for DNA signal analysis are inapplicable to DNA watermarking that performs forward and inverse transforms with amino acid conservation.

S.-H. Lee / Information Sciences 273 (2014) 263–286

271

cDNA A sequence of NG codons, C

Binary information I

NC-PRNG

Numerical mapping

Segment C to sub-sequences with NB codons

Random binary K

i=1 (Sub-sequence index) i
+

no

yes Ci Watermark W

Obtain a set of sub-sequences of synonymous codons Si Obtain a set of DWT subsequences Xi Select pairs of (label,position) for embedding (v=(l,k))

Bit extraction

Wi

Determine the bit number

Find an optimized subsequence Watermarked sub-sequence C i' i++

Watermarked sequence C’ Fig. 3. Proposed embedding process in the DWT domain.

For forward and inverse DWTs with amino acid conservation, our method maps codons to integer values according to the histogram of amino acids and codons. Deﬁne synonymous codons of an amino acid A as S:

S ¼ fck jA ¼ hðck Þ;

8k 2 ½1; jSjg:

ð5Þ

Deﬁne the histogram rank of amino acids of A as rank(A) and the histogram rank of synonymous codons of ck as rank(ck). |S| is the number of synonymous codons of A. rank(A) has the range of [0, 63] and rank(ck) has the range of [1, |S|]. A codon ck is mapped to a numerical value ck,

ck ! ck ¼

rankðAÞ1 X

jSrank1 ðjÞ j þ rankðck Þ

ð6Þ

j¼1

according to rank(A) and rank(ck). The above equation means that a codon ck with high rank(A) is mapped to the direction of zero, and numerical values in synonymous codons map from codons with high rank(ck) to codons with low rank(ck) in order. The numerical values of codons are stored in codon code. Fig. 4 shows numerical values of Homo sapiens ANG (NM_001145) that are mapped according to the histogram of amino acids and synonymous codons. In this ﬁgure, codons with highest rank are mapped to the lowest value of synonymous codons. These codons are used as the center point for codon optimization according to codon usage. 3.3.2. DWT of numerical codon subsequence Our method divides a numerical sequence C with NG (=N 2)codons (except for start and stop codons) into subsequences BC i ¼ ðci1 ci2 ciNB Þ with NB codons, where NB = 2s NG(s P 1).The number of subsequences is bN G =N B c. We use the binary watermark W generated by Gaussian random distribution. The length of the watermark is determined by sequence length NG and the histogram of DWT coefﬁcients in subsequences. Fig. 5(a) shows the bits of a watermark embedded into subsequences with length NB = 8. A numerical subsequence BCi can be transformed to a lifting-based integer DWT up to lmax = log 2NB levels. Here, BCi can be deconstructed into an approximation coefﬁcient Li and detail coefﬁcients Hi(lmax), . . . , Hi(1) on levels from lmax to 1.

272

S.-H. Lee / Information Sciences 273 (2014) 263–286

Occurrence frequency

15 10 5 0

L R G P V N T S K I D A Q H C F E Y M W

(a) 8

Occurrence frequency

6

4

2

0 TGG ATG GAG GAA TAT TAC GAC GAT GCA GCG GCT GCC CAA CAG CAT CAC TGC TGT TTT TTC AAA AAG ATA ATT ATC GTA GTG GTC GTT AAT AAC ACG ACT ACA ACC AGT TCG TCA TCC TCT AGC CCG CCC CCA CCT GGG GGA GGT GGC CGA CGC AGG CGG CGT AGA CTC CTT TTA CTA TTG CTG

Code

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(b) Fig. 4. Numerical values of H. sapiens ANG (NM_001145) according to histogram of (a) amino acids and (b) synonymous codons (blue codons have the highest number of synonymous codons). (For interpretation of the references to color in this ﬁgure legend, the reader is referred to the web version of this article.)

X i ¼ dwtðBC i Þ ¼ ðLi ; Hi ðlmax Þ; Hi ðlmax 1Þ; . . . ; Hi ð1ÞÞ;

ð7Þ

where

NB : Hi ðkÞ ¼ hi ðk; 1Þ; hi ðk; 2Þ; . . . ; hi k; k 2 Fig. 5(b) shows an example of a lifting-based integer DWT for NB = 8. Here, the maximum level lmax is 3. DWT coefﬁcients at level 1 are Hi(1) = (hi(1, 1), hi(1, 2), hi(1, 3), hi(1, 4)). DWT coefﬁcients at levels 2 are Hi(2) = (hi(2, 1), hi(2, 2)). DWT coefﬁcient at level 3 is Hi(3) = (hi(3, 1), ). Our method uses a lifting-based integer DWT for NB = 8, (Fig. 5(b)) and embeds bits of a watermark into detail coefﬁcients that are randomly selected among (Hi(1), Hi(2), Hi(3)). 3.3.3. Bit embedding Our method embeds bits of a watermark as the unit of a subsequence. This section explains how to embed Wi bits of watermark into DWT coefﬁcients Xi of numerical subsequence BCi. The watermarked numerical subsequence BC 0i can be ob1 1 tained by evaluating BC 0i ¼ dwt ðf ðdwtðBC i Þ; W i ÞÞ ¼ dwt ðX 0i Þ from Eqs. (1) and (2). Here, the embedding function f should be designed so that amino acids of BC 0i and BCi are the same (conserved): hðBC i Þ ¼ Ai ¼ hðBC 0i Þ ¼ A0i . If the embedding function is f ðX i ; W i Þ ¼ X 0i ¼ ð1 þ aW i ÞX i , similar to that used for multimedia watermarking, the strength a should be controlled for amino acid conservation. However, it is very difﬁcult to ﬁnd the best strength a that reconciles objectives of the embedding condition, amino acid conservation, and codon optimization in the frequency domain. To resolve this difﬁculty, our method ﬁnds the best subsequence for the three objectives stated above, in the frequency domain, among all synonymous codon subsequences of BCi in the codon numerical domain.

S.-H. Lee / Information Sciences 273 (2014) 263–286

273

(a)

(b) Fig. 5. (a) Watermark embedding based on the DWT of codon subsequences and (b) lifting-based DWT of codon subsequence with 8 codons (3-level).

Given a numerical subsequence BC i ¼ ðci1 ; ci2 ; ; ciNB Þ, our method obtains the set Si of numerical synonymous codon subsequences of BCi

S i ¼ fSi;j jj 2 ½1; jS i jg;

where hðBC i Þ ¼ hðSi;j Þ

ð8Þ

and the set Xi of DWT coefﬁcients of Si,

X i ¼ fX i;j ¼ dwtðSi;j Þjj 2 ½1; jS i jg:

ð9Þ

|Si| is the number of synonymous codon subsequences of BCi;

jS i j ¼

NB Y jSi;j j

ð10Þ

j¼1

that is, the multiple of the number of synonymous codons for each codon in a subsequence. The dynamic range of |Si| is 1 jS i j 2NB . The embedding function f ﬁnds DWT coefﬁcients of synonymous codon subsequences on Xi that satisfy the embedding condition in selected pairs of (level, position).

e i ¼ f ðX i ; W i Þ X i X

ð11Þ

fi f i ¼ dwt 1 ð X e i are inversely transformed; BC e i Þ are considered according to the waterCodon subsequences to which X BC marked codon subsequences that can be substituted with the host subsequence. f i that is closest to the target codon Here, to determine codon optimization, we select a codon subsequence BC0i among BC adaptation index, CAI, and substitute BC0i with host subsequence BCi. As C BCi indicates the sequence except for subsef ij 2 BC f i is deﬁned as CAIðC BCi þ BC f ij Þ as follows: quence BCi, a CAI for any codon subsequence BC

f ij Þ ¼ exp CAIðC BCi þ BC

! jCn j 1 1 X 1 X CUðcnm Þ log 20 n¼1 jCn j m¼1 maxðCUðCn ÞÞ

ð12Þ

CU(cnm) is the frequency of a codon cnm and max (CU(Cn)) is maximum frequency of synonymous codons. Accordingly, the watermarked codon subsequence BC0i can be obtained by

f ij Þj; BC0i ¼ arg minjCAIðC BCi þ BC ei BC

f ij 2 BC fi 8 BC

ð13Þ

274

S.-H. Lee / Information Sciences 273 (2014) 263–286

Let us consider the embedding function of Eq. (11). The DWT coefﬁcient Xi of a numerical codon subsequence BCi has NB 1 detail coefﬁcients: {hi(l, k)|"l e [1, lmax], k e [1, NB/2]}. A detail coefﬁcient on l-level and k-position in the ith subsequence is represented by hi(l, k). From here onwards, we denote (l, k) for (level, position) as v = (l, k) for simple notation. Our method M

selects randomly M ¼ dðN B 1Þ=2e pairs of (level, position); fv ¼ ðl; kÞg1 , among NB 1 detail coefﬁcients as a unit of subsequence and embeds ni,v bits on each of selected coefﬁcient. Thus, the embeddable bit number for a DWT codon subsePM quence Xi is v ¼1 ni;v . The embeddable bits for a DWT coefﬁcient are explained in the next section.The embedding e i must be non-empty. Thus, there must be more than one synonymous subsequence that comprises condition requires that X all selected coefﬁcients that are sufﬁciently embeddable. The embeddable bits and ranges on each of the selected coefﬁcients e i . Our method determines the embeddable bits for each coefﬁcient should be determined concurrently for non-empty X depending on the probability density of coefﬁcients and divides the dynamic range of the probability density function into 2ni;v sections. DWT coefﬁcients determined experimentally using selected (level, position) synonymous codon subsequences, hi(v) = {hi,j(v)|j e [1, |Si|}, are normally distributed on Nðli;j ; r2i;j Þ. Fig. 6(a) shows the probability density functions (pdfs) of approximation coefﬁcients and detail coefﬁcients at level 2 and level 3 for all synonymous subsequences of the ﬁrst codon subsequence Ci ‘‘GTGATGGGCCTGGGCGTTCTCCTC’’ in H. sapiens ANG (NM_001145) sequence. From this ﬁgure, we know that all coefﬁcients are normally distributed. Because hi(v) is distributed normally, pi;v ðhÞ ¼ Nðli;j ; r2i;j Þ, we divide the range of hi(v) ton 2ni;v sections so that cumulative i;v probabilities in sections are equal (Fig. 6(b)). Here, the reference value of section fzi;v ðtÞg2t¼1 is

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ t 1 2r2i;v erf t 2 ½1; 2ni;v ; ni;v 1 ; 2 where 2 0 13 Z zi;v ðtÞ z ðtÞ l 1 16 t B i;v i;v C7 pi;v ðhÞdh ¼ ni;v ; 41 þ erf @ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ A5 ¼ ni;v : 2 2 2 2 zi;v ðt1Þ 2r zi;v ðtÞ ¼ li;v þ

ð14Þ

i;v

From the reference values of sections, detail coefﬁcients Yi,v(t) included in a section can be deﬁned by

Y i;v ðtÞ ¼ fpi;j ðv Þjzi;v ðt 1Þ pi;j ðv Þ < zi;v ðtÞ;

8 j 2 ½1; jS i jg:

ð15Þ

e i that have When a watermark wi,v of ni,v bits is converted to decimal number di,v, we ﬁnd synonymous codon subsequences X a detail coefﬁcient Yi,v(di,v) within the range of [zi,v(di,v 1), zi,v(di,v)]. Here, for embedding as many bits as possible in a DWT codon subsequence Xi, our method intends to embed fwi;v gM v ¼1 bits M

M

into M selected coefﬁcients fhi ðv Þgv ¼1 in Xi. Given decimal numbers fdi;v gv ¼1 of watermark bits W i ¼ fwi;v gM v ¼1 , the embedM fY i;v ðdi;v Þgv ¼1

M

ding function f(Xi, Wi) (Eq. (11)) ﬁnds detail coefﬁcients on selected pairs of (level, position) v i ¼ fv ¼ ðk; nÞgv ¼1 that satisfy the embedding condition of Eq. (15). It then selects synonymous DWT codon subsequences in Xi that combine M

fY i;v ðdi;v Þgv ¼1 . M ~ i: f ðX i ; W i Þ ¼ fX i;j fhi;j ðv Þ 2 Y i;v ðdi;v Þgv ¼1 j 8 j 2 ½1; jS i jg ¼ X

ð16Þ

3.3.4. Embeddable bits M e The embeddable bit numbers fni;v gM v ¼1 of selected coefﬁcients fhi ðv Þgv ¼1 should be determined for X i to be non-empty under any binary watermark fwi;v gM . Therefore, our method determines the embeddable bit numbers for a codon subsev ¼1 quence using the joint probability according to the process as follows: (1) Initialize all bit numbers fni;v gM v ¼1 to one. M (2) Calculate the joint probabilities of coefﬁcients fY i;v ðdi;v Þgv ¼1 , ð0 t < 2ni;v Þ for fni;v gM v ¼1 . (3) If all joint probabilities are not zero, increase bit numbers fni;v gM v ¼1 by one in order and return to step 2). If not, determine bit numbers fni;v gM v ¼1 in the previous step and stop. Let Hv represent the random variable on the space of coefﬁcients hi(v) on any (level, position) and let pH1 ;...;HM ðhi ð1Þ; . . . ; hi ðMÞÞ represent the joint probability of random variables fHv gM joint probability v ¼1 . Here, the cumulative n 2 i;v of detail coefﬁcients fY i;v ðtÞgM v ¼1 can be deﬁned according to the reference value of section fzi;v ðtÞgt¼1 by the bit number fni;v gM v ¼1 for selected pairs of (level, position) as follows:

Prðzi;v 1 ðt 1 1Þ H1 < zi;v 1 ðt1 Þ; . . . ; zi;v M ðt M 1Þ HM < zi;M ðt M Þ Z zi;v ðt1 Þ pH1 ;...;HM ðhi ð1Þ; . . . ; hi ðMÞÞ dhi ð1Þ; . . . ; dhi ðMÞ; ¼

8 t 1 2 ½1; 2ni;v 1 ; . . . ; t M 2 ½1; 2ni;v M

ð17Þ

zi;v ðt1 1Þ

When H = (H1, . . . , HM) is a random vector of M order, and it can be deﬁned by a nondegenerate multivariate normal distribution as follows:

S.-H. Lee / Information Sciences 273 (2014) 263–286

275

(a)

(b) Fig. 6. Using H. sapiens ANG (NM_001145) in NB = 8, (a) Probability density functions (pdf) of approximation coefﬁcient and detail coefﬁcients on levels 1 and 2 for synonymous subsequences X1 of the ﬁrst codon subsequence ‘‘GTGATGGGCCTGGGCGTTCTCCTC’’ (except for the start codon) and (b) division of the pdf domain for classifying coefﬁcients h1(v) on v = (l, k) of (level, position).

1 1 T pH ðhi ð1Þ; . . . ; hi ðMÞÞ ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ exp ðh lÞ R1 ðh lÞ 2 ð2pÞM jRj

ð18Þ

where R is the covariance matrix and |R| is the determinant of R, where l is the average vector of coefﬁcients on pairs of M (level, position), fv ¼ ðl; kÞgv ¼1 . P Using the method above, the embeddable bit number N W i for a codon subsequence BCi is M v ¼1 ni;v . Accordingly, the total embeddable bit number NW for a codon sequence C

NW ¼

bðN2Þ=N M X B cX

ni;v

i¼1

ð19Þ

v ¼1

and the bit rate per codon is NW/(N 1) [bits/codon]. 3.3.5. Embedding process Fig. 7 illustrates the embedding process for a numerical codon subsequence BCi, in NB = 8 and M = 4. The main embedding process for BCi is as follows: (1) Map all codons in coding DNA sequence C to numerical values using histogram ranks of amino acids and codons.

276

S.-H. Lee / Information Sciences 273 (2014) 263–286

(2) Given a numerical codon subsequence BC i ¼ ðci1 ; . . . ; ciNB Þ, obtain the set Si of numerical synonymous codon subsequences of BCi and DWT coefﬁcients Xi of Si. M (3) Randomly select pairs of (level, position) v i ¼ fv ¼ ðl; kÞgv ¼1 . (4) Determine the embeddable bit number N W i ¼ fni;v gM v ¼1 from the joint-probability pH(hi(1), . . . , hi(M). M e i of a numerical synbits from the binary watermark: W i ¼ fwi;v gM . Obtain DWT coefﬁcients X (5) Extract N W ¼ fni;v g i

v ¼1

v ¼1

e onymous codon subsequence with all of coefﬁcients fY i;v ðtÞgM v ¼1 that exist in the range of decimal numbers of Wi. If X i is empty, decrease the bit number by one and move to step 3). f i . Then ﬁnd an optimum subsequence BC 0 with the highest CAI among BC f i and substitute e i to BC (6) Inversely transform X i it with the host subsequence BCi. bðN2Þ=N c

B This process is performed on all subsequences. Both randomly selected pairs of (level, position) V ¼ fv i gi¼1 and random binary data K for encrypting the watermark are used as keys for extracting the watermark. K is explained in the next section.

3.3.6. Generation of NC-PRNG-based binary watermarks To preserve the security of a watermark, our method combines the watermark W with the random binary data K using XOR operator before embedding.

W0 ¼ W K

ð20Þ

We generate random binary data K using the Rény map-based nonlinear congruential-pseudorandom number generator (NC-PRNG) described by Addabbo et al. [1]. They reported that this generator can be used very effectively for PRNG or PRNB design in terms of statistical performance and hardware implementation and is well suited to the NIST SP800-22 test. The basic structure of NC-PRNG comprises the n-bit register for storing the current state xi, the input function g(xi) = xi+1 for reaching the next state xi+1 on recursive relationship, and the output function z(xi) = yi for determining the output on the current state. Given the set of natural numbers of discretized state space of n-bits as

Kn ¼ fx 2 N; 0 x < 2n g;

ð21Þ

the input function g:Kn ? Kn for the maximum period kmax of NG-PRNG is deﬁned by

x gðxÞ ¼ q2ni x þ b i c mod 2n : 2

Fig. 7. Embedding process for a numerical codon subsequence BCi, in NB = 8 and M = 4.

ð22Þ

S.-H. Lee / Information Sciences 273 (2014) 263–286

277

The variable i is a positive integer and q is an odd positive integer. Addabbo et al. present a list of factored parameters (n, i, q, 2n 1) with maximum period kmax ¼ 2n 1. The length of the random binary data K depends on the length of watermark bits NW and the length of codon sequence. Our method generates K on three classes of NW 6 31, 32 6 NW 6 127, NW P 128. The ﬁrst class is for a very short sequence of NW 6 31. Here, we generate K using an n-bit single system with parameters (n, i, q, 2n 1) for n e [3, 31] presented by Addabbo et al. The second class of 32 6 NW 6 127 is used for the codon sequences with lengths less than 128. Here, we generate K using a 32-bit combined system of 17-bit (n = 17, i = 15, q = 5529) and 15-bit systems (n = 15, i = 14, q = 623). Here, K represents random binary 32-bit data. If NW is over 64, K is repeated bN W =32c times. The last class is applied to a longer codon sequence of NW P 128. Here, we generate K using a 128-bit combined system of ﬁve single systems with the parameters as follows: {(n, i, q)} = {(17, 15, 5599), (25, 24, 6106), (26, 25, 3321), (29, 28, 134,217), (31, 20, 524,289)}. Because maximum periods of ﬁve single systems are prime numbers with respect to each other, in this combined system, the period is approximately 2128-bits. Here, K represents random binary 128-bit-data, similar to the second class. If NW > 128, K is repeated bN W =128c times. 3.4. Watermark extraction b the watermark is extracted using two keys, which are embedded pairs of Given a potential target DNA coding sequence C, bðN2Þ=NB c b The extraction and the random binary data K of watermark from C. (level, position) in each subsequence, V ¼ fv i gi¼1 and embedding processes are similar. b to numerical values using histogram ranks of amino acids and codons. (1) Map all codons in a DNA coding sequence C f i , obtain DWT coefﬁcients X f i Þ of BC f i and the set X b i ¼ dwtð BC b i ¼ dwtð b (2) Given a numerical codon subsequence BC S i Þ of f i. DWT coefﬁcients of numerical synonymous codon subsequences b S i of BC M M (3) Calculate reference values fzi;v ð1Þ; . . . zi;v ð2ni;v Þgv ¼1 for M pairs of (level, position) v i ¼ fv ¼ ðl; kÞgv ¼1 using probability M b i. density functions fpi;v ðhÞgv ¼1 from X M (4) Extract bits by comparing detail coefﬁcients fhi ðhÞgv ¼1 on M pairs of (level, position) vi and reference values M ni;v fzi;v ð1Þ; . . . zi;v ð2 Þgv ¼1 as follows:

^ i ¼ fw ^ i;v gM W v ¼1 ;

^ i;v ¼ binaryðdi;v Þ if hi ðv Þ 2 ½zi;v ðdi;v 1Þ; zi;v ðdi;v Þ where; w

ð23Þ

(5) Perform steps (2)–(4) for all numerical codon subsequences and then extract all bits. Because the bits of a watermark bðN2Þ=N c

B c ¼ fW cig are embedded R times, calculate the binary watermark W extracting one bit per R -bits by majority i¼1 rule. b using the XOR operator of W c and the random binary data K; W c K. (6) Obtain the ﬁnal watermark of C

In the equation, above a decimal number di,v for v = (l, k) indicates any region that is divided by reference values. If hi(v) is ^ i;v are binary of decimal number di,v. within the di,vth region, bits w 4. Experimental results We conducted in silico experiments using DNA sequences acquired from the NCBI database (Table 3) and compared our method with DNA-Crypt algorithm of Heider et al. [39], a method [58] of Liss et al., and BioCode pcDNA of Haughton et al. [36]. 4.1. Simulation environment The data capacity of our method differs from those of the two methods mentioned above. Moreover, the method reported by Heider et al. [39] does not consider codon optimization, and the method reported by Liss et al. [58] does not consider mutation resistance. Thus, these differences must be taken into account. Our experimental set of watermark capacities is similar, and its mutation resistance was evaluated in a comparable manner. Consider the capacity of our method. The memory and time required to ﬁnd an optimal subsequence increase with increase in the subsequence length NB. Fig. 8 shows the maximum number of synonymous codons, log10 ðmax jXi jÞ ¼ log10 4NB , on a log scale of exponent s when NB = 2s. The maximum number max |Xi| is 104.8 for s = 3. However, it increases exponentially starting from s = 4. Therefore, we set NB to 8 and (R = 3) and embedded bits of watermark as the unit of codon subsequence. Here, the number of selected coefﬁcients on pairs of (level, position) in a subsequence is M ¼ dðN B 1Þ=2e ¼ 4 (Eq. (13)). This equals the subsequence length NB/2. If a bit number ni,v for a coefﬁcient hi(v) is high, e i in Eq. (16) will be empty. We set the range of {ni,v} to [1,2] and then the embeddable synonymous codon subsequences X 8 N2 determined the embeddable bit number NW using the process described in Section 3.3.4. Here, NW is N W 2 43 N2 ;3 8 8 (ni;v ¼ ½1; 2; N B ¼ 8; R ¼ 3). Accordingly, the watermark capacity of our method CProposed is

278

S.-H. Lee / Information Sciences 273 (2014) 263–286

4 N2 8 N2 N C Proposed N 3 8 3 8

½bits=codon

ð24Þ

The method reported by Heider et al. embeds binary watermarks into four and six synonymous codons as a 2-bit unit and uses WDH(n) for correcting the mutation code. We assume that six synonymous codons for ‘‘Leu,’’ ‘‘Ser,’’ ‘‘Arg’’ are changed to consecutive four-fold codons for embedding 2-bits. Given N4,6 as the number of four and six synonymous codons in a codon sequence, the embeddable bit number NW is 2N4,6/n. Therefore, the watermark capacity of the method reported by Heider et al. is

C Heider ¼

2N 4;6 nN

½bits=codon

ð25Þ

The method reported by Liss et al. embeds binary ASCII codes of character streams into four and six synonymous codons as a 1-bit unit. Therefore, the embeddable bit number NW = N4,6. However, because this method does not affect mutation resistance, the watermark can be lost to mutation. Our experiment embedded the watermark R times to generate mutation resistance. Thus, the watermark capacity the method reported by Liss et al. with the repetition code is

C LissþRepetition ¼

N4;6 RN

½bits=codon

ð26Þ

Haughton et al. reported that dynamic graduated mapping makes the capacity of BioCode pcDNA difﬁcult to analyze. They show that when the codon is uniformly distributed and the length of host DNA sequence is very long, the optimum embedding rate with codon bias preservation approaches the optimum embedding rate without this constraint. Under this assumption, they set the embedding rate of BioCode pcDNA by the BCE (binary codon equivalency) rate [37]: 1.75 bits/codon. If BioCode pcDNA includes the ‘‘watermark codes’’ for correcting errors, this rate depends on the length of a sparse matrix and watermark codes. Our experiment sets the watermark codes of BioCode pcDNA to the same watermark length of our method and then embeds it repeatedly within available codons. For comparable evaluation of mutations, we adjusted n of Heider et al., R of Liss et al., and repeated times of Haughton et al. so that the watermarking capacities of all methods are similar to ours, as follows: C Proposed C Heider C LissþRepetition C BioCode pcDNA . Here, n and R are

n¼

2N4;6 ; C Proposed N

R¼

N4;6

C Proposed N

:

ð27Þ

The watermark capacity CProposed of our method varies within the range of Eq. (24), depending on codon sequences. From 100 experiments, we know that n is close to 4,5 and R is close to 2. However, our experiment set n = 5, R = 3, and 5 repeated times for BioCode pcDNA to improve mutation resistance and evaluated the mutation resistance of all methods. 4.2. Amino acid conservation and CAI We repeated each method 1000 times using the coding DNA sequences presented in Table 3 and analyzed the rates of change of nucleotide bases and amino acids before and after watermarking. In all experiments, we veriﬁed that no amino acids were changed by the three methods. This result was expected, because all methods embed the watermark to conserve the amino acids. Therefore, DNA watermarking is similar to the introduction of silent mutations.

Table 3 DNA sequences analyzed (NCBI, http://www.ncbi.nlm.nih.gov/). GeneBank accession

Gene

Deﬁnition of test sequence

Length

Codon length

JQ670900 JQ439993 NM001179490 L34837 NM000520 NM001145 BC111374

bg1 mgpB CCT2 FET4 HEXA ANG TUBB6

1407bp 256bp 1584bp 1951bp 1590bp 444bp 1770bp

469 85 528 553 530 148 448

BC094877

ACTG2

1304bp

377

BC094878

ARL2BP

848bp

153

BC140809

ANKRD20A2

Bacillus subtilis strain PS beta-glucosidase gene, partial cds Mycoplasma genitalium genotype 91 adhesin gene, partial cds Saccharomyces Cerevisiae S288c Cct2p mRNA, complete cds Saccharomyces cerevisiae Fe(II) transport protein (FET4) mRNA, complete cds Homo sapiens hexosaminidase A (alpha polypeptide), mRNA Homo sapiens angiogenin, ribonuclease, RNase A family, 5, transcript variant 1, mRNA Homo sapiens tubulin, beta 6, mRNA (cDNA clone MGC:132410 IMAGE:7503364), complete cds Homo sapiens actin, gamma 2, smooth muscle, enteric, mRNA (cDNA clone MGC:104923 IMAGE:6728431), complete cds Homo sapiens ADP-ribosylation factor-like 2 binding protein, mRNA (cDNA clone MGC:104924 IMAGE:6728613), complete cds Homo sapiens ankyrin repeat domain 20 family, member A2, mRNA (cDNA clone MGC:176486 IMAGE:9021677), complete cds

2713bp

823

S.-H. Lee / Information Sciences 273 (2014) 263–286

279

Fig. 8. The maximum number of synonymous codons, log10(max |Xi|), on an exponent s shown using a log scale if the number of codons in subsequence NB = 2s.

Table 4 shows the rates of change of nucleotide bases in the watermarked sequence. Although similar in capacity, our method produced more silent mutations from 1.24 times to 1.87 times compared with the methods of Heider et al., Liss et al., and Haughton et al. This explains why all codons in the support region of the wavelet ﬁlters are inﬂuenced by the watermark. The DWT-based method does not embed any bits in a speciﬁc codon but embeds any bits in the combination of codons within the support region of the wavelet ﬁlter. This imparts effective resistance to speciﬁc mutations and insures security. Table 5 shows the CAIs and GC contents of watermarked DNA sequences using four methods. The CAI of Liss et al. is the highest, followed by that of our method. The method reported by Liss et al. embeds 1-bit of a watermark into a speciﬁc codon with the ﬁrst or second-highest codon usage (CU) among four and six synonymous codons. Haughton et al. BioCode pcDNA uses the same CAI value of the host sequence because of codon count preservation. However, because the method reported by Heider et al. allocates four synonymous codons to 2-bits without considering codon optimization, the CAI of our method is the lowest, because all codons in a subsequence are used irrespective of n-fold synonymous codons. To improve CAI, our method ﬁnds a speciﬁc subsequence with the highest CAI among the set of embeddable subsequences (see Eq. (16)) and substitutes it with a host subsequence. Therefore, the CAI of our method is greater by a factor of 0.007 than that of Heider et al. The results of GC content are similar to the results of CAI. Reviewing the results for GC content, Haughton et al. BioCode pcDNA uses the same GC content as the host sequence because of the codon count preservation. Our method is closer by 50% compared with Heider et al. and Liss et al. methods. 4.3. Mutation resistance Most published methods [36,39,40] evaluate mutation resistance to base substitution mutation that is most prevalent in bacteria, and the majority mutations in the coding region [15]. They state that indels can be solved using a powerful alignment algorithm. Our experiment compared our method with those of others for determining resistance to a point mutation. To evaluate mutation resistance, we introduced point mutations at randomly selected nucleotide positions except for the start and stop codons in watermarked sequences. Let the point mutation rate r be deﬁned as the rate of the number of mutated nucleotide bases and the number of total nucleotide bases in a sequence. Our experimental value of r varied from 1% to 20%. Fig. 9(a) shows the BERs of watermarks extracted from a mutated DNA sequence. These results veriﬁed that the BER of our method is lower by 0.0004–0.043 than that of Heider et al. and lower by 0.005–0.093 than that of Liss et al. If we assume that the threshold of mutation resistance is BER <0.15, our method is resistant to a mutation rate of approximately 18.1%. However, the methods reported by Heider et al., Liss et al., and Haughton et al. are resistant to mutation rates of approximately 16.4%, 14.2%, and 15.8%, respectively. We performed statistical analysis of watermark detection by using the receiver operating characteristic (ROC). We deﬁned the probabilities false positives (PFP) and false negatives (PFN) for the detection threshold as follows:

^ < eÞ; PFP ¼ Probabilityðeðw f ðCÞÞ

e 2 ½0:1; 0:3 ðDetect H1 when H0 is trueÞ;

ð28Þ

^ > dÞ; PFN ¼ Probabilityðeðw f ðCÞÞ

d ¼ 1:5e ðDetect H0 when H1 is trueÞ

ð29Þ

280

S.-H. Lee / Information Sciences 273 (2014) 263–286

Table 4 Watermark capacities and change rates of nucleotide bases (n = 5 in WDH(n) for Heider’s DNA-Crypt algorithm [39], R = 3 for Liss’s method. [58], 5 repeated times of Haughton’s BioCode pcDNA with watermark codes [36]). GeneBank accession

Gene

Watermark capacity (bit/codon)

JQ670900 JQ439993 NM001179490 L34837 NM000520 NM001145 BC111374 BC094877 BC094878 BC140809

bg1 mgpB CCT2 FET4 HEXA ANG TUBB6 ACTG2 ARL2BP ANKRD20A2

Base change rate (%)

Proposed Heider Liss + Repetition Haughton + Repetition Proposed Heider Liss + epetition Haughton + Repetition

Average

0.172 0.188 0.21 0.164 0.213 0.216 0.162 0.161 0.150 0.164

0.142 0.164 0.193 0.169 0.196 0.189 0.183 0.18 0.156 0.157

0.119 0.129 0.16 0.141 0.164 0.155 0.151 0.151 0.13 0.131

0.180 0.194 0.223 0.199 0.226 0.219 0.213 0.210 0.186 0.187

20.31 20.39 21.90 20.43 23.08 21.17 17.85 18.65 17.21 21.18

7.03 10.51 11.80 10.84 12.83 12.61 11.24 11.58 9.59 9.88

8.20 13.22 17.74 12.29 14.71 19.14 13.17 11.67 16.56 14.58

11.21 16.22 20.74 15.29 17.71 18.14 16.17 14.67 15.52 17.58

0.180

0.173 0.143

0.204

20.21

10.79 14.13

16.32

Table 5 CAI and GC content of host DNA and watermarked DNA sequences. GeneBank accession

CAI

GC content (%)

JQ670900 JQ439993 NM001179490 L34837 NM000520 NM001145 BC111374 BC094877 BC094878 BC140809

0.825 0.918 0.812 0.766 0.833 0.878 0.732 0.785 0.869 0.790

Average

0.821 0.825

Host Proposed Heider Liss + Repetition Haughton + Repetition Host Proposed Heider Liss + Repetition Haughton + Repetition 0.769 0.914 0.802 0.814 0.810 0.874 0.767 0.789 0.879 0.834

0.758 0.912 0.796 0.821 0.804 0.841 0.754 0.799 0.863 0.837

0.799 0.932 0.828 0.809 0.854 0.889 0.796 0.850 0.904 0.871

0.825 0.918 0.812 0.766 0.833 0.878 0.732 0.785 0.869 0.790

46.05 35.16 43.62 40.32 52.14 51.57 58.93 52.52 47.06 39.98

43.93 37.50 42.11 39.18 54.90 52.25 58.92 52.51 46.19 36.45

0.818 0.853

0.821

46.73 46.40

44.13 33.20 46.52 43.64 53.58 50.90 53.72 48.89 46.18 40.82

44.70 34.37 46.90 38.93 54.72 48.42 60.04 55.44 46.40 38.07

46.05 35.16 43.62 40.32 52.14 51.57 58.93 52.52 47.06 39.98

46.15 46.79

46.73

b extracts the pseudo-watermark W c from an unwaThe expression e(a b) represents the BER between two vectors a and b. f ð CÞ b and f ð C b 0 Þ extracts the watermark W b0. c 0 from the watermarked and mutated sequence C termarked and mutated sequence C We generated 1000 random watermarks for each DNA sequence and calculated a random watermark based on PFP and PFN using generated watermarks as references. PFP and PFN at the thresholds e and d produce an ROC curve for point mutations. We deﬁned the mutation rate r to 10% and varied e from 0.1 to 0.3 (d = 1.5e) and produced the ROC curve shown in Fig. 9(b). From this ROC curve, we found that the values of PFP and PFN obtained with our method are lower than those for the method reported by Heider et al. and Liss et al. 4.4. Security DNA watermarking should ensure watermark security, although the algorithm and the full sequence are known. Thus, the watermark can be very difﬁcult to detect or removed without the knowledge of the keys. Heider et al. and Liss et al. used four and six synonymous codons for the embedding targets at known embedding positions. Therefore, the watermarks introduced using these methods can be easily removed or detected. Our method uses the embeddable pairs of (level, position) V on each subsequence and the random binary data K using NC-PRNG as the watermark keys. If V is unknown, a pirate must ﬁnd all embedded coefﬁcients of subsequences. Although V is known, the watermark W can be very difﬁcult to obtain without the knowledge of K. We can ensure the security of K, because it is generated by the Rény map-based NC-PRNG [1], which is suited to the NIST SP800-22 test. bðN2Þ=N B c We estimate the entropy of embeddable pairs of (level, position) V ¼ fv i gi¼1 for ensuring the randomization quantity. The number of selected pairs in a subsequence is M. Therefore, the probability that M coefﬁcients are selected among NB 1 detail coefﬁcients in a subsequence is NB 1 1= . Because the number of subsequences is bðN 2Þ=N B c, the probability that the embedded coefﬁcients are deM tected in all subsequences is

PrðVÞ ¼

bðN2Þ=N Y Bc i¼1

1 : NB 1 M

ð30Þ

281

S.-H. Lee / Information Sciences 273 (2014) 263–286

0.30 Proposed method Heider's method Haughton's method Liss's method

Bit error rate

0.25 0.20 0.15 0.10 0.05 0.00 0

2

4

6

8

10

12

14

16

18

20

Mutation rate [%]

(a)

False negative probability

0.4

Proposed method Heider's method Haughton's method Liss's method

0.3

0.2

0.1

0.0 0.0

0.1

0.2

0.3

0.4

False positive probability

(b) Fig. 9. (a) BERs of the extracted watermarks in point mutated codon sequences and (b) ROC curve for a point mutation rate of 10%.

Assume that the space V is X and Pr (V) is uniform, the entropy H(V)

X NB 2 N2 log2 : HðVÞ ¼ PrðVÞlog2 ðPrðVÞÞ log2 ðPrðVÞÞ ¼ NB M V2X

ð31Þ

The entropy H(V) is close to a ﬁrst-order linear slope as a function of codon length N. Fig. 10 shows the dependence of entropy H(V) on the codon length N in NB = 8 and M = 4. When the codon length N is 202, H(V) is 128.23, indicating that approximately 2128 cases can be produced similar to using a 128-bit key. Other methods may improve security. For example, the lengths of subsequences can be speciﬁcally selected, or the number of embeddable pairs of (level, position) can be varied for each subsequence. Moreover, this can be combined with DNA watermarking that encrypts the sequence using keys of PCR primer pairs or a speciﬁc probe. 5. Conclusions We report here the introduction of a lifting-based DWT watermarking scheme for coding DNA sequences. Our method maps codons to numerical codes using a permutation table of the histogram ranks of the amino acids and embeds the watermark into randomly selected coefﬁcients in H bands in the DWT domain of numerical codon subsequences. To prevent the variation of amino acids by watermarked DWT coefﬁcients, the embedding process should deviate from the general process of forward DWT, watermark embedding, and inverse DWT. Thus, we obtained a set of DWT coefﬁcients of synonymous codon subsequences for a host codon subsequence BCi and then found a codon subsequence BC i among a set that is optimum for bit embedding; next, we replaced BCi with BC i . We compared our method with those of Heider et al. [39], Liss et al. [58],

282

S.-H. Lee / Information Sciences 273 (2014) 263–286

350 300

Entropy H(V)

250 200 150 100 50 0

0

100

200

300

400

500

600

Codon length N Fig. 10. Entropy H(V) of embedded pairs of (level, position) in subsequences varying the codon length N (NB = 8, M = 4).

and Haughton et al. [36] under comparable conditions by adjusting similar watermark capacities. From the experimental results, we determined that our method is more resistant to mutation than other methods, although our method has lower CAI and GC content than that of Liss et al. Further, we determined that our method provides good security because of the watermark generation using Rény map-based NC-PRNG and the high entropy of embeddable positions. The use of any frequency-based coding DNA sequence watermarking poses problems. The ﬁrst problem is that it takes a long time to search among a set of maximum synonymous codon subsequences to ﬁnd an optimal subsequence that preserves the amino acids. Therefore, the number of codons in subsequences cannot be high. The second problem is that the L-band method is more sensitive to point mutations if the level of DWT is increased, and this produces unstable performance. This differs from the results of DWT-based multimedia watermarking. The L-band method must become more robust and stable. If these problems are solved, the DWT as well as the DCT and the FFT will be effective for watermarking DNA coding sequences in the frequency domain. Georgiou et al. [28] recently deﬁned the fuzzy entropy and clarity of DNA sequence using fuzzy polynucleotide space (FPS) for DNA analysis and classiﬁcation. They reported that these values indicate the complexity of polynucleotide. In the future, we will analyze the entropy of watermarked DNA sequences, control the embedding domain and strength, and evaluate the differential entropy of watermarks through fuzzy entropy and clarity by using FPS. Acknowledgement This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2011-0023118). Appendix A. Effect of codon subsequence on wavelet ﬁlter Let us look into the effect of codon sub-sequence on wavelet coefﬁcients and the method for ﬁnding wavelet coefﬁcients while conserving the amino acids, dwt(h1(h(BC))). This is very necessary for designing the embedding function with amino acid conservation based on the DWT domain. We experimented using a three-level lifting-based wavelet of CDF 2/2 ﬁlters on sub-sequences with NB = 8 codons. The wavelet coefﬁcients X are considered as the sequence [L3, H1, H2, H1, H3, H1, H2, H1]. A.1. Analysis of effect of codon subsequence on wavelet coefﬁcients Wavelet coefﬁcients that are changed by the watermark make the codon subsequence change. Nevertheless, the amino acids of the codon subsequence must be conserved. We analyze the effect of codon subsequence on each wavelet coefﬁcient through the response of the inverse wavelet ﬁlter, Y = dwt1(d[k]), for impulse wavelet coefﬁcients X = d[k] (k e [0, 7]). The response that is computed as the output is not on the boundary of the support region of the wavelet ﬁlter. We denote the response for X = d[0] (L3) as YL3[n] = dwt1(d[0]), that for X = d[4] (H3) as YH3[n] = dwt1(d[4]), that for X = d[k] (k = 2, 6) (H2) as YH2[n] = dwt1(d[k]), and that for X = d[k] (k = 1, 3, 5, 7) (H1) as YH1[n] = dwt1(d[k]) Responses for impulse wavelet coefﬁcients are shown in Fig. 11 and Table 6. These results show the natural effect of codon subsequence on levels and support regions of wavelet ﬁlters. If the levels of a wavelet are high, such as L3 and H3, the support regions become wide and many codons are inﬂuenced by the wavelet coefﬁcients. This means that as the level becomes higher the number of codons for which the amino acid should be conserved increases but the number of synonymous codon sequences that satisfy the embedding condition decreases.

283

S.-H. Lee / Information Sciences 273 (2014) 263–286

Impulse respone for inverse wavelet filter

1.0

0.8

0.6

0.4

0.2

0.0

-8

-6

-4

-2

0

2

4

6

8

Index

(a) Impulse respone for inverse wavelet filter

1.0 0.8 0.6 0.4 0.2 0.0 -0.2 -0.4 -12

-10

-8

-6

-4

-2

0

2

4

6

8

10

12

Index

(b) Fig. 11. Response of inverse wavelet ﬁlters on impulse wavelet coefﬁcients; (a) approximation coefﬁcient (L3) X = d[0] and (b) detail coefﬁcients (X = d[4] for H3, X = d[k] (k = 2, 6) for (H2), X = d[k] (k = 1, 3, 5, 7) for (H1)).

A.2. Estimation of displacement coefﬁcients with amino acid conservation The displacement coefﬁcients of the watermark that conserves the amino acid can be estimated from the impulse responses of the inverse wavelet ﬁlters. These coefﬁcients are used for directly computing the DWT coefﬁcients of synonymous codon subsequences, dwt(h1(h(BC))), without wavelet transformation. The watermarked coefﬁcients X0 can be set as the original coefﬁcients X and separately as the displacement coefﬁcients D of the watermark separately.

X 0 ¼ f ðX; WÞ ¼ X þ D ¼ X þ

NX B 1

d½nd½n

ð32Þ

n¼0

Since the DWT is a linear transformation, the recovered codon numerical subsequence BC0 can be obtained by the inverse DWT, dwt1(X0 ) as follows: 1

dwt ðX 0 Þ ¼ BC 0 ¼ dwt

1

Xþ

NX B 1

! d½nd½n

1

¼ dwt ðXÞ þ dwt

n¼0

1

NX B 1

! d½nd½n

n¼0

4 X d½2k þ 1Y H1 ¼ BC þ d½0Y L3 þ d½4Y H3 þ ðd½2 þ d½6ÞY H2 þ k¼1

ð33Þ

284

S.-H. Lee / Information Sciences 273 (2014) 263–286

Table 6 Responses of inverse wavelet ﬁlter for impulse wavelet coefﬁcients. Category

Level

Response

Approximation

3

Y L3 ½n ¼ ½ 0:12; 0:25; 0:37; 0:5; 0:62; 0:75; 0:87;

Detail

3

Y L3 ½n ¼

Support region 1; "

0:87; 0:75; 0:62; 0:5; 0:37; 0:25; 0:12

½0:03; 0:06; 0:09; 0:12; 0:15; 0:18; 0:21; 0:25; 0; 0:25; 0:5; 0:5; 0:25; 0; 0:25; 0:21; 0:18; 0:15; 0:12; 0:09; 0:06; 0:03

2

Y H2 ½n ¼ ½ 0:06; 0:12; 0:18; 0:25; 0:25;

1

Y H1 ½n ¼ ½ 0:1; 0:3;

0:75; "

0:75; "

[7 7] [11 11]

0:7

" 0:25; 0:25; 0:18; 0:12; 0:06

[5 5] [2 2]

0:3; 0:1

Two amino acid sequences must be the same as h(BC0 ) = h(BC). If h(BC0 ) = h(BC) is set as A, this equation is expressed as follows: 4 X hðBCÞ ¼ h BC þ d½0Y L3 þ d½4Y H3 þ ðd½2 þ d½6ÞY H2 þ d½2k þ 1Y H1

! ¼A

ð34Þ

k¼1

This equation can be represented by

d½0Y L3 þ d½4Y H3 þ ðd½2 þ d½6ÞY H2 þ

4 X 1 d½2k þ 1Y H1 ¼ h ðAÞ BC:

ð35Þ

k¼1

Any synonymous codon sub-sequence h1(A) is set as BCi. Then, if the above equation is represented as the nth codon numerical code BC[n] (or cn), it becomes

d½0Y L3 þ d½4Y H3 þ ðd½2 þ d½6ÞY H2 þ

4 X d½2k þ 1Y H1 ¼ BC i ½n BC½n;

8 n 2 ½0; 7;

ð36Þ

k¼1

This can be represented in matrix form.

YD ¼ BC i BC;

ð37Þ

D ¼ Y 1 ðBC i BCÞ

ð38Þ

where

2

d½0

3

2

6 d½1 7 6 7 7 D¼6 6 .. 7; 4 . 5

Y L3 ½0

3

6 BC i ½1 BC½1 7 6 7 7; BC i BC ¼ 6 .. 6 7 4 5 . BC i ½7 BC½7

d½7 2

BC i ½0 BC½0

Y H1 ½0

Y H2 ½0

Y H1 ½0

Y H3 ½0

Y H1 ½0

Y H2 ½0

Y H1 ½0

3

2

3

y00

y10

y07

6 Y L3 ½1 Y H1 ½1 Y H2 ½1 Y H1 ½1 Y H3 ½1 Y H1 ½1 Y H2 ½1 Y H1 ½1 7 6 y 7 6 10 6 7 6 Y¼6 .. .. .. .. .. .. .. 7 ¼ 6 .. 6 .. 4 . . . . . . . . 5 4 . y70 Y L3 ½7 Y H1 ½7 Y H2 ½7 Y H1 ½7 Y H3 ½7 Y H1 ½7 Y H2 ½7 Y H1 ½7

y11 .. .

y71 7 7 7 .. .. 7 . . 5

y71

y77

The displacement coefﬁcient D can be computed by solving the seventh-order non-homogeneous simultaneous equations by Cramer’s rule.

d½n ¼

jY n j ; jYj

ð39Þ

where

2

y00 6y 6 10 Yn ¼ 6 6 .. 4 . y70

y0n1

BC i ½0 BC½0

y0nþ1

y1n1 .. .. . . y7n1

BC i ½1 BC½1 y1nþ1 .. .. . . BC i ½7 BC½7 y7nþ1

y07

3

y17 7 7 7 .. .. 7 . . 5 y77

S.-H. Lee / Information Sciences 273 (2014) 263–286

285

|Y| and |Yn| are the determinants of the matrices Y and Yn. When D is a set of displacement coefﬁcients such that the determinants of all components d[n](n e [0, 7]) exist, a set of DWT coefﬁcients that are available for watermarking is X0 = X + D. The embedding function g() should map an original coefﬁcient X to one coefﬁcient in X0 ; that is, g(X, W) = X0 , X0 e X0 . The DWT coefﬁcients of synonymous codon subsequences can be easily computed with dwt(h1(h(BC))) = dwt(h1(A)) = dwt(BCi). Moreover, the difference between dwt(BC) and dwt(BCi) can be computed with dwt(YD) = dwt(BCi BC). References [1] T. Addabbo, M. Alioto, A. Fort, A. Pasini, S. Rocchi, V. Vignoli, A class of maximum-period nonlinear congruential generators derived from the Rényi chaotic map, IEEE Trans. Circ. Syst.-I: Reg. Pap. 54 (2007) 816–828. [2] B. Anam, K. Sakib, M.A. Hossain, K. Dahal, Review on the advancements of DNA cryptography, in: Proceedings of the 4th International Conference on Software, Knowledge, Information Management and Applications, 2010. [3] D. Anastassiou, Genomic signal processing, IEEE Signal Process. Magaz. 18 (2001) 8–20. [4] M. Arita, Writing information into DNA, Mol. Comput., LNCS 2950 (2004) 23–35. [5] M. Arita, Y. Ohashi, Secret signatures inside genomic DNA, Biotechnol. Prog. 20 (2004) 1605–1607. [6] A. Arneodo, C. Vaillant, B. Audit, F. Argoul, Y. d’Aubenton-Carafa, C. Thermes, Multi-scale coding of genomic information: from DNA sequence to genome structure and function, Phys. Rep. 498 (2011) 45–188. [7] O.O. Babatunde, Deoxyribonucleic acid (DNA) as a hypothetical information hiding medium: DNA mimics basic information security protocol, J. Eng. Technol. Res. 3 (2011) 148–154. [8] F. Balado, On the Shannon capacity of DNA data embedding, in: Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2010, pp. 1766–1769. [9] F. Balado, On the embedding capacity of DNA strands under insertion, deletion and substitution mutations, in: SPIE Media Forensics and Security XII, vol. 7541, San José, USA, 2010. [10] J.A. Berger, S.K. Mitra, M. Carli, A. Neri, Visualization and analysis of DNA sequences using DNA walks, J. Frank. Inst. 341 (2004) 37–53. [11] M. Borda, O. Tornea, DNA secret writing techniques, in: Proceedings of the 8th International Conference on Communications (COMM), 2010, pp. 451– 456. [12] T.A. Brown, Genomes 3, Garland Science, 2006. [13] N. Chakravarthy, A. Spanias, L.D. Iasemidis, K. Tsakalis, Autoregressive modeling and feature analysis of DNA sequences, EURASIP J. Appl. Sig. Process. 1 (2004) 13–28. [14] C.C. Chang, T.C. Lu, Y.F. Chang, R.C.T. Lee, Reversible data hiding schemes for Deoxyribonucleic Acid (DNA) medium, Int. J. Innov. Comput., Inform. Control 3 (2007) 1145–1160. [15] J. Chen, Y. Wu, H. Yang, J. Bergelson, M. Kreitman, D. Tian, Variation in the ratio of nucleotide substitution and Indel rates across genomes in mammals and Bacteria, Mol. Biol. Evol. 26 (2009) 1523–1531. [16] R. Chenna, H. Sugawara, T. Koike, R. Lopez, T.J. Gibson, D.G. Higgins, J.D. Thompson, Multiple sequence alignment with the Clustal series of programs, Nucl. Acids Res. 31 (2003) 3497–3500. [17] B.K.-S. Chung, D.-Y. Lee, Computational codon optimization of synthetic gene for protein expression, BMC Syst. Biol. 6 (2012). [18] C.T. Clelland, V. Risca, C. Bancroft, Hiding messages in DNA microdots, Nature 399 (1999) 533–534. [19] I.J. Cox, J. Kilian, F.T. Leighton, T. Shamoon, Secure spread spectrum watermarking for multimedia, IEEE Trans. Image Process. 6 (1997) 1673–1687. [20] P.D. Cristea, Conversion of nucleotides sequences into genomic signals, J. Cell. Mol. Med. 6 (2002) 279–303. [21] P.D. Cristea, Genetic signals an emerging concept, in: Proceedings of IWSSIP‘01, 2011, pp. 17–22. [22] R.C. Deonier, S. Tavaré, M.S. Waterman, Computational Genome Analysis: An Introduction, Springer, 2005. [23] K.W. DeRonne, G. Karypis, Pareto optimal pairwise sequence alignment, IEEE/ACM Trans. Comput. Biol. Bioinform. 10 (2013) 481–493. [24] C.N. Dewey, Whole-genome alignment, Evol. Genom., Meth. Mol. Biol. 855 (2012) 237–257. [25] R.C. Edgar, MUSCLE: a multiple sequence alignment method with reduced time and space complexity, BMC Bioinform. 5 (2004). [26] B.M. Gaff, R.A. Loren, G. Dickson, Protecting bioinformatics as intellectual property, Computer 46 (2013) 15–17. [27] A. Gehani, T. LaBean, J. Reif, DNA-based cryptography, aspects of molecular computing, Lect. Notes Comp. Sci. 2950 (2004) (2004) 34–50. [28] D.N. Georgiou, T.E. Karakasidis, J.J. Nieto, A. Torres, A study of entropy/clarity of genetic sequences using metric spaces and fuzzy sets, J. Theoret. Biol. 267 (2010) 95–105. [29] D.G. Gibson, G.A. Benders, C. Andrews-Pfannkoch, E.A. Denisova, H. Baden-Tillson, J. Zaveri, T.B. Stockwell, A. Brownley, D.W. Thomas, M.A. Algire, C. Merryman, L. Young, V.N. Noskov, J.I. Glass, J.C. Venter, C.A. Hutchison III, H.O. Smith, Complete chemical synthesis, assembly, and cloning of a Mycoplasma genitalium genome, Science 319 (2008) 1215–1220. [30] D.G. Gibson, J.I. Glass, C. Lartigue, V.N. Noskov, R.-Y. Chuang, M.A. Algire, G.A. Benders, M.G. Montague, L. Ma, M.M. Moodie, C. Merryman, S. Vashee, R. Krishnakumar, N. Assad-Garcia, C. Andrews-Pfannkoch, E.A. Denisova, L. Young, Z.-Q. Qi, T.H. Segall-Shapiro, C.H. Calvey, P.P. Parmar, C.A. Hutchison III, H.O. Smith, J.C. Venter, Creation of a bacterial cell controlled by a chemically synthesized genome, Science 329 (2010) 52–56. [31] N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E.M. LeProust, B. Sipos, E. Birney, Towards practical, high-capacity, low-maintenance information storage in synthesized DNA, Nature (2013). doi:http://dx.doi.org/10.1038/nature11875. [32] G. Gui, L. Qin, Y. Wang, X. Zhang, An encryption scheme using DNA technology, in: Proceedings of IEEE International Conference on Bio-Inspired Computing: Theories and Applications (BICTA08), 2008, pp. 37–42. [33] A.D. Haimovich, B. Byrne, R. Ramaswamy, W.J. Welsh, Wavelet analysis of DNA walks, J. Comput. Biol. 13 (2006) 1289–1298. [34] D. Haughton, F. Balado, Repetition coding as an effective error correction code for information encoded in DNA, in: Proceedings of IEEE 11th International Conference on Bioinformatics and Bioengineering (BIBE), 2011, pp. 253–260. [35] D. Haughton, F. Balado, A modiﬁed watermark synchronisation code for robust embedding of data in DNA, Acoustics, in: Proceedings of IEEE International Conference on Speech and Signal Processing (ICASSP), 2013, pp. 1148–1152. [36] D. Haughton, F. Balado, BioCode: two biologically compatible algorithms for embedding data in non-coding and coding regions of DNA, BMC Bioinform. 14 (2013). [37] D. Haughton, F. Balado, Performance of DNA data embedding algorithms under substitution mutations, in: Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW), 2010, pp. 201–206. [38] D. Heider, A. Barnekow, DNA watermarks in non-coding regulatory sequences, BMC Res. Notes 2 (2009). doi:http://dx.doi.org/10.1186/1756-0500-2125. [39] D. Heider, A. Barnekow, DNA-based watermarks using the DNA-Crypt algorithm, BMC Bioinform. 8 (2007). [40] D. Heider, A. Barnekow, DNA watermarks – a proof of concept, BMC Mol. Biol. 9 (2008). doi:http://dx.doi.org/10.1186/1471-2199-9-40. [41] D. Heider, D. Kessler, A. Barnekow, Watermarking sexually reproducing diploid organisms, Bioinformatics 24 (2008) 1961–1962. [42] D. Heider, A. Barnekow, DNA Watermarking: challenging perspectives for biotechnological applications, Curr. Bioinform. 6 (2011) 375–382. [43] C.M. Holman J.D, Copyright for engineered DNA: An idea whose time has come?, West Virg Law Rev. 113 (2011) 699–738. [44] K.S. Hossain, D. Patnaik, S. Laxman, P. Jain, C. Bailey-Kellogg, N. Ramakrishnan, Improved multiple sequence alignments using coupled pattern mining, IEEE/ACM Trans. Comput. Biol. Bioinform. 10 (2013) 1098–1112.

286

S.-H. Lee / Information Sciences 273 (2014) 263–286

[45] M.K. Hota, V.K. Srivastava, Performance analysis of different DNA to numerical mapping techniques for identiﬁcation of protein coding regions using tapered window based short-time discrete Fourier transform, in: Proceedings of International Conference on Power, Control and Embedded Systems (ICPCES ‘10), 2010, pp. 1–4. [46] S. Jiao, R. Goutte, Code for encryption hiding data into genomic DNA of living organisms, in: Proceedings of the 9th International Conference on Signal Processing (ICSP 2008), 2008, pp. 2166–2169. [47] D. Jupiter, T. Ficht, Q.-M. Qin, A. Rice-Ficht, J. Samuel, P. Figueiredo, Genomic polymorphisms as inherent watermarks for tracking infectious agents, Front. Microbiol. 1 (2010). [48] D. Jupiter, T. Ficht, J. Samuel, Q.-M. Qin, P. Figueiredo, DNA watermarking of infectious agents: progress and prospects, PLoS Path. 6 (2010). [49] T. Kazuo, O. Akimitsu, S. Isao, Public-key systems using DNA as a one-way function for key distribution, BioSystems 81 (2005) 25–29. [50] V. Kirzhner, S. Frenkel, A. Korol, Minimal-dot plot: ‘‘Old tale in new skin’’ about sequence comparison, Inform. Sci. 181 (2011) 1454–1462. [51] B. Knudsen, M. Miyamoto, Sequence alignments and pair hidden Markov models using evolutionary history, J. Mol. Biol. 333 (2003) 453–460. [52] D.-H. Koo, Effective protection of DNA sequences and gene innovations, Institute of Intellectual Property (IIP) Bulletin, 2007. [53] T.A. Kunkel, DNA replication ﬁdelity, J. Biol. Chem. 279 (2004) 16895–16898. [54] H.K. Kwan, S.B. Arniker, Numerical representation of DNA sequences, in: Proceedings of IEEE International Conference on Electro/Information Technology (EIT ‘09), 2009, pp. 307–310. [55] C.-C. Lai, C.-C. Tsai, Digital image watermarking using discrete wavelet transform and singular value decomposition, IEEE Trans. Instrument. Measur. 59 (2010) 3060–3063. [56] A. Leier, C. Richter, W. Banzhaf, H. Rauhe, Cryptography with DNA binary strands, Biosystems 57 (2010) 13–22. [57] Y.-T. Lin, C.-Y. Huang, G.C. Lee, Rotation, scaling, and translation resilient watermarking for images, IET Image Process. 5 (2011) 328–340. [58] M. Liss, D. Daubert, K. Brunner, K. Kliche, U. Hammes, A. Leiherer, R. Wagner, Embedding permanent watermarks in synthetic genes, PLOS One 7 (2012) e42465. [59] A. Löytynoja, Alignment methods: strategies, challenges, benchmarking, and comparative overview, Meth. Mol. Biol. 855 (2012) 203–235. [60] M. Lu, X. Lai, G. Xiao, Lei Qin, Symmetric-key cryptosystem with DNA technology, Sci. China Ser. F: Inform. Sci. 50 (2007) 324–333. [61] J.A. Machado, A.C. Costa, M.D. Quelhas, Wavelet analysis of human DNA, Genomics 98 (2011) 155–163. [62] M.S. McBride, Bioinformatics and intellectual property protection, Berk. Technol. Law J. 17 (2002) 1332–1363. [63] J.P. Mena-Chalco, H. Carrer, Y. Zana, R.M. Cesar Jr., Identiﬁcation of protein coding regions using the modiﬁed Gabor-wavelet transform, IEEE/ACM Trans. Comput. Biol. Bioinform. 5 (2008) 198–207. [64] A.S. Nair, S.P. Sreenadhan, A coding measure scheme employing electron-ion interaction pseudopotential (EIIP), Bioinformation 1 (2006) 197–202. [65] C.I. Podilchuk, W. Zeng, Image-adaptive watermarking using visual models, IEEE J. Select. Areas Commun. 16 (1998) 525–539. [66] C. Popovici, Aspects of DNA cryptography, Ann. Univ. Craiova, Math. Comp. Sci. Ser. 37 (2010) 147–151. [67] D. Raab, M. Graf, F. notka, T. Schödl, R. Wagner, The GeneOptimizer Algorithm: using a sliding window approach to cope with the vast sequence space in multiparameter DNA sequence optimization, Syst. Synth. Biol. 4 (2010) 215–225. [68] K.D. Rao, M.N.S. Swamy, Analysis of genomics and proteomics using DSP techniques, IEEE Trans. Circ. Syst. I 55 (2008) 370–378. [69] V.I. Risca, DNA-based steganography, Cryptologia 25 (2001) 37–49. [70] J. Schultz, F. Leese, C. Held, Introduction to Dot-Plots , 2008. [71] B. Shimanovsky, J. Feng, M. Potkonjak, Hiding data in DNA, IH ‘02 in: Revised Papers from the 5th International Workshop on Information Hiding, 2002, pp. 373–386. [72] H.J. Shiu, K.L. Ng, J.F. Fang, R.C.T. Lee, C.H. Huan, Data hiding methods based upon DNA sequences, Inform. Sci. 180 (2010) 2196–2208. [73] B.D. Silverman, R. Linker, A measure of DNA periodicity, J. Theor. Biol. 118 (1986) 295–300. [74] G.C. Smith, C.C. Fiddes, J.P. Hawkins, J.P. Cox, Some possible codes for encrypting data in DNA, Biotechnol. Lett. 25 (2003) 1125–1130. [75] W.P. Stemmer, How to publish DNA sequences with copyright protection, Nat. Biotechnol. 20 (2002) 217. [76] Z. Su, W. Li, J. Kong, Y. Dai, W. Tang, Watermarking 3D CAPD models for topology veriﬁcation, Comp. Aided Des. 45 (2013) 1042–1052. [77] D. Tulpan, C. Regoui, G. Durand, L. Belliveau, S. Léger, HyDEn: a hybrid steganocryptographic approach for data encryption using randomized errorcorrecting DNA codes, BioMed Res. Int. (2013). Article ID 634832. [78] R. Vishwakarma, N. Amiri, High density data storage in DNA using an efﬁcient message encoding scheme, Int. J. Inform. Technol. Converg. Serv. 2 (2012) 41–46. [79] R.F. Voss, Evolution of long-range fractal correlations and 1/f noise in DNA base sequences, Phys. Rev. Lett. 68 (1992) 3805–3808. [80] Z. Wang, X. Zhao, H. Wang, G. Cui, Information hiding based on DNA steganography, in: 4th IEEE International Conference on Software Engineering and Service Science (ICSESS), 2013, pp. 946–949. [81] P.C. Wong, K.K. Wong, H.P. Foote, Organic data memory using the DNA approach, Commun. ACM 46 (2003). [82] L. XueJia, L. MingXin, Q. Lei, H. JunSong, F. XiWen, Asymmetric encryption and signature method with DNA technology, Sci. China Inform. Sci. 53 (2010) 506–517. [83] N. Yachie, K. Sekiyama, J. Sugahara, Y. Ohashi, M. Tomita, Alignment-based approach for durable data storage into living organisms, Biotechnol. Prog. 23 (2007) 501–505. [84] N. Yachie, Y. Ohashi, M. Tomita, Stabilizing synthetic data in the DNA of living organisms, Syst. Synth. Biol. 2 (2008) 19–25. [85] M. Yamamoto, S. Kashiwamura, A. Ohuchi, M. Furukawa, Large-scale DNA memory based on the nested PCR, Nat. Comput. 7 (2008) 335–346. [86] C. Yin, S. Yau, Numerical representation of DNA sequences based on genetic code context and its applications in periodicity analysis of genomes, in: IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB ‘08), 2008, pp. 223–227. [87] R. Zhang, C.T. Zhang, Z curves, an intutive tool for visualizing and analyzing the DNA sequences, J. Biomol. Struct. Dynam. 11 (1994) 767–782. [88] X.-Y. Zhang, F. Chen, Y.-T. Zhang, S.C. Agner, M. Akay, Z.-H. Lu, M.M.Y. Waye, S.K.-W. Tsui, Signal processing techniques in genomic engineering, Proc. IEEE 90 (2002) 1822–1833. [89] M. Zhao, W.P. Lee, E.P. Garrison, G.T. Marth, SSW library: an SIMD Smith-Waterman C/C++ library for use in genomic applications, PLoS One 8 (2013) e82138.