Efficient computation of shortest absent words in complete genomes

Efficient computation of shortest absent words in complete genomes

Information Sciences 435 (2018) 59–68 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins ...

1MB Sizes 0 Downloads 201 Views

Information Sciences 435 (2018) 59–68

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Efficient computation of shortest absent words in complete genomes Abdulrakeeb M. Al-Ssulami Department of Computer Science, College of Computer & Information Sciences, King Saud University, Riyadh 11543, Saudi Arabia

a r t i c l e

i n f o

Article history: Received 13 February 2017 Revised 23 October 2017 Accepted 29 December 2017 Available online 29 December 2017 Keywords: Shortest absent words Sequence analysis Alignment free techniques Algorithms Complexity analysis

a b s t r a c t Computation of shortest absent words in genomes of organisms is a recent alignment free technique to compare genomes of species and to make functional inferences. This technique offers a good indication to mutations, phylogenetic tree construction, genome reconstruction, drug target identification, and pesticide development. The traditional alignments techniques are computationally expensive and impractical on complete and large genomes. Currently, there are two solutions exist to solve this problem on genome level of large sizes, stochastic-based method is running in time complexity of O (n ), where the shortest length of absent words is approximated and entered manually and the other is standalone method costs O (nlog2 log2 (n + 1 )/2 )-time, where the shortest length is determined automatically by the algorithm itself. In this paper, we present standalone method to compute the shortest absent words in time complexity of O (n ). The proposed method produces the complete set of shortest absent words from the two strands of the whole human genome in 3.5 min. Our method scans the genomes twice. In the first round, the upper bound for the shortest length is computed and in the second round the absent words are identified. The method is efficient and easy to use; it requires only the genome as input parameter. © 2017 Elsevier Inc. All rights reserved.

1. Introduction Identification of shortest absent words is an emerging alignment free technique in which one can simply comparing very large DNA sequences by comparing few absent words instead of alignment those very large sequences that consume time and space [13,14]. That is, looking for missing short subsequences in one or more genomes is same as looking for similarity [10]. Discovery of absent words can be useful in applications such as drug target identification and pesticide development and it helps in studying the mutational characteristics in species [1,2]. In addition, absent words can be useful in forensics [8], phylogenetic tree construction [5], and sometimes we are interested in discovering short absent subsequences that are present in viruses such as Ebola or Zika and absent from Human genome [17]. The shortest absent word is the subsequence which is never exist in the complete genome or the input sequence such that removing the rightmost or leftmost symbol will lead to that the remaining word is exist in the genome. On the other hand, we have the minimal absent words that constitute a superset of longer lengths and this problem uses an indexing data structures such as suffix tree and suffix array that consume large amount of memory and processing time that make it impractical with very large genomes [4,9,15]. By looking for the recent method MAW [4] for discovering the minimal absent words, the allocated RAM for Human Chromosome 1 of size 249 MB is 6 GB (about 24 times the size of Chromosome 1)

E-mail address: [email protected] https://doi.org/10.1016/j.ins.2017.12.055 0020-0255/© 2017 Elsevier Inc. All rights reserved.

60

A.M. Al-Ssulami / Information Sciences 435 (2018) 59–68

and the allocated RAM by the previous method PFG [15] is 30 GB (about 123 times the size of Chromosome 1). That is, the complete human genome requires 72 GB of RAM using MAW and 369 GB of RAM using PFG. Asymptotically, the time and size are expressed as O (n ) while actually the size and time are influenced by the length of absent words γ and the actual complexity is O (γ n ), where n is the size of input sequence. The specified length of absent words (γ ) is constant, so it was omitted from the complexity. To the best of our knowledge, there are two solutions proposed to solve the shortest absent words problem, deterministic solution (BIW) [11] that has O (nlog2  log4 (n + 1 ) )-time complexity. BIW method uses the binary search to determine the shortest length with scanning the complete genome log2  log4 (n + 1 ) times and thus the shortest absent words are identified. The second solution is stochastic-based linear in time solution (ESW) [18] having time complexity of O (n ), where n is the size of genome. The second solution is stochastic-based which is not standalone. That is, the shortest length is computed separately using Maple package with cost O (γ 3 ), where γ is the shortest length for absent words. This means that the stochastic solution consists of two separated parts, Maple for computing the shortest length and ESW for identification the shortest absent words. Remarkably, the time of computing the shortest length using Maple was neglected. The main problem of the approximation method is that the shortest length must be entered manually and if the approximated length is a little bit longer than the actual length, more space and time are required and if the approximated length is less than the actual length by one, the program fails to obtain the shortest absent words so that the length must be increased by one and the program is run again. In this article, we present two solutions. Stochastic-based solution as a competitor for ESW in computing the shortest length in time complexity of O (1 ) instead of O (γ 3 )-time complexity of ESW, where the two solutions assume that the probability distribution is available. Our mathematical formula is based on Poisson distribution to estimate the shortest length for the absent words for the case when the memory is very important. The proposed formula estimates the shortest length of very large genomes exactly in time complexity of O (1 ). The second solution is deterministic and standalone as a competitor for BIW where the shortest length is determined by scanning the complete genome in the first round and performing the second scan to identify the shortest absent words. Thus, the proposed algorithm scans the complete genome just twice instead of log2  log4 (n + 1 ) times as BIW. The remainder of this paper is organized as follows. Section 2 presents preliminaries on DNA alphabet and absent words; derivation of the formula for estimating the shortest length; the pseudocode of our deterministic algorithm; and the complexity analysis of the algorithm, Section 3 shows the results on complete genomes, and Section 4 provides our conclusion. 2. Methods 2.1. Preliminaries A finite set of symbols is denoted by  and the cardinality of this set is denoted by | |. In this paper, we consider the DNA alphabet  = {A, C, G, T } or  = {a, c, g, t }. Let the input S = {s1 , s2 , . . . , sq } refers to the set of sequences over  . The size of input sequences is n = |s1 | + |s2 | + · · · + |sq |. Experimentally, sequencers produce genome sequences containing symbols don’t belong to  such as “N”. So, we say that c is a valid symbol if and only if c ∈  and the word ω is valid if and only if ω[i] ∈  , where 0 ≤ i < k and k = |ω|. The valid word ω is said to be present in S if there is at least one position j such that ω[0.k − 1] = st [ j. j + k − 1], where st ∈ S and 0 ≤ j ≤ |st | − k. Otherwise, the valid word ω is absent. The valid word ϖ of length γ is said to be shortest absent if the following two conditions are satisfied: (1) The word ϖ is absent, does not occur in S. (2) No word of length less than γ does not occur in S which means that all words shorter than ϖ must to occur in S. Computation of the shortest absent words is the problem of identification all valid absent words of shortest length in very large biological sequences such as Human, Chimpanzee, Mus musculus, and other large genomes without using indexing data structures such as suffix tree and suffix array that consume large amount of memory and more processing times. 2.2. Shortest length approximation The expected number of occurrences of a valid word ω can be calculated using Poisson distribution [16,18]. This method requires a priori knowledge about the probability distribution of the alphabet  . In this section we present a good approximation for the shortest length of the absent words. Let P(ωi , S ) is the likelihood of occurring the symbol ωi at a specific position in the genome S. The probability that ω to occur at a specific position in S is,

P (ω , S ) =

|ω| −1 

P ( ωi , S )

(1)

i=0

Therefore, the expected number of occurrences is estimated as:



E ( ω , S ) = |S |

|ω| −1  i=0



P ( ωi , S )

(2)

A.M. Al-Ssulami / Information Sciences 435 (2018) 59–68

61

Poisson distribution provides a good approximation for counting the word ω in the genome S. The random variable X is called a Poisson random variable with the parameter μ, if its probability distribution is given by the following equation:



P (X = x ) =

μx e−μ

0

f or x = 0, 1, 2, . . . , otherwise.

x!

(3)

Where μ = λt and λ is the average (mean) of X per unit t. Now, the expected number of occurrences of ω in genome S, E(ω, S ), represents the mean μ. Thus,

P ( Xω = c ) =

[E(ω, S )]c e−E(ω,S ) c!

(4)

Based on Eq. (4), the probability that the word ω does not occur in genome S is,

P(Xω = 0 ) = e−E(ω,S )

(5)

From Eq. (5), the word ω is absent from genome S if P(Xω = 0 ) > 1/e. By counting the occurrences of each symbol, we have to choose the least probability distribution ν as follows.

ν = min{P(i , S ) | i ∈ {0, 1, . . . , | | − 1}}

(6)

 i is the (i + 1 )-th symbol belongs to  . Then the shortest approximated length (L) for the absent words is derived using Eq. (5) as follows. L e−E(ω,S ) = e−ν |S| > 1/e

−ν L |S| > ln(1/e )

ν L < 1/|S | 1

)/l n(ν ) |S | L ≈ −l n(|S| )/l n(ν ) L < l n(

(7)

The function ln() is the natural logarithm with the Euler’s number base e. Eq. (7) gives the approximated shortest length for the absent words in terms of the length of the genome and the least probability distribution of symbols ν of the alphabet set  . Now, the formula in Eq. (7) can be generalized by counting the β -mer instead of a single symbol. This method also gives us more accurate shortest length. For β -mer, we have | |β different words. Then, the word of length L contains L/β words of length β . So, we should rewrite the formula in Eq. (7) as,



L≈

−βl n(|S| )/l n(νβ )

β

I f νβ  = 0 , otherwise.

(8)

where ν β is,

νβ = min{P(iβ , S ) | i ∈ {0, 1, . . . , | |β − 1}}

(9)

 β is the set of β -mer over  of equal lengths. Experimentally, we test the formula in Eq. (8) on set of genomes. The results show that when the β is increased, more accurate results are achieved. So, it is enough to set β =5 (see Table 2). Our approximation formula that is represented by Eq. (8) gives optimal space in most cases. On human genome, the required space is 0.5 MB to discover all absent words. Note that, ESW assumes that the content of the four nucleotides are available and this assumption not always satisfied. Similarly, our shortest length prediction, using Eq. (8), assumes that the ν β is available where it can be computed by scanning the complete genome only once using Karp and Rabin hashing function [12]. In spite that this approximation is better, it is not so enough to rely on. So, in the next section, we present a standalone linear time method to compute the shortest absent words. 2.3. The linear proposed method Our method utilizes two simple data structures, a bit array for computing the shortest absent words and another temporary short list of type integer (8 bytes integer) to count the number of occurrences as a preprocessing step to compute the best upper bound for the shortest length. We call this method as Shortest Absent Words (SAW). Let n is the size of genome S. The upper bound for the shortest length,

l = log4 (n + 1 ) = log2 (n + 1 )/2

(10)

and temporary list temp[] of length 4τ must be created and initialized with zero. The following step

Then τ = l/2 is scanning the complete genome to count the occurrences of each valid present word. Let is the smallest number of occurrences. If is greater than one, an additional necessary length is computed as: = log2 ( )/2. Thus, the computed length will be ϕ = τ + . If is less than or equal one, we set ϕ = τ + . The size of temp[] list is computed in Table 3 with different input data sizes using the first encoding scheme: (A, a) → 0, (C, c) → 1, (G, g) → 2, and (T, t) → 3.

62

A.M. Al-Ssulami / Information Sciences 435 (2018) 59–68

Fig. 1. The layout of Bit _Array[] list.

The second step is computing the shortest absent word. This step is performed by scanning the whole genome only once. After computing the possible shortest length, the temporary list, temp[], is now can be discarded and a new bit array, Bit _Array[], must be reserved to compute the absent words. The size of the bit array, Bit _Array[], exactly computed as follows.

nˇ = 1 +

ϕ 

4k = 4/3 ( 4ϕ − 1 ) + 1

(11)

k=1

Note that the first bit is used as a sentinel to avoid testing the end of a word each time a valid word is hashed and the second encoding scheme: (A, a) → 1, (C, c) → 2, (G, g) → 3, and (T, t) → 4 is used. The Karp and Rabin hashing model [12] is used to hash the valid present words. For every valid present word ω of size k,

hash(ω ) =

k−1 

Encode(ω[i] )4i

(12)

i=0

Fig. 1 depicts the Bit _Array[] as a list of holes and non-holes locations. The hole location is the list position whose entery

δ has not been hashed, location is set to 0. This position δ is the key to compute the actual length of the shortest absent words and then calculating them. The actual length γ is computed in terms of δ from Eq. (11) as follows.

δ = 4/3 ( 4γ − 1 ) + 1

(13)

γ = (log2 (3(δ − 1 )/4 + 1 )/2 ) + 1

(14)

Note that δ indicates the size of bit array starting at 0 and ending at δ − 1. Now, using γ we can compute the shortest absent words in the range determined by m = 4γ in Fig. 1. Hence, the absent words are identified by the holes across the bit array locations starting at location 4/3(4γ − 1 ) − m + 1 to location 4/3(4γ − 1 ). For efficiency, we do not need to return back to the staring position. The absent words are computed starting at location δ and ending at 4/3(4γ − 1 ). Within each encountered hole at index i, the reverse complement word rci is derived and thus the hash value h ¯ r = hash2 (rci ) is computed and tested. If the bit array location at index r is also a hole, the index i must be processed by doing the reverse hash of Eq. (12). Let i = hash(ω ), the absent word is printed letter by letter using the mask 3 and the shift right operation by 2. If the remainder (r = i & 3 ) equals 0, the program prints T or t and updates i = (i − 4 ) 2; else the program prints the corresponding DNA symbol and updates i = i 2. The & refers to the AND bitwise operation and represents the shift right bitwise operation. These two operations have to be repeated γ − 1 times within each hole. The proposed method is described by Algorithms 1–3. Considering both strands of DNA (see lines 10–15 of Algorithm 3), our algorithm first runs on the 5´ → 3´ strand and the reported absent words are checked at the end. If the reverse complement ¯ of the absent word ϖ is present in the strand 5´ → 3´ , this means that the word ϖ is also present in 3´ → 5´ strand, so it is discarded. If the word ϖ and its reverse complement ¯ are not present, the two words are reported as absent words. So, we do not need to check the whole second strand and no need to additional space as well. 2.4. Complexity analysis By analyzing the time and space complexity of the method, the total time complexity is O (n ), where n is the size of genome and the used memory is very small ( ≤ 10.67 MB) for very large genomes (3 GB). In Algorithm 1, lines 2–4

A.M. Al-Ssulami / Information Sciences 435 (2018) 59–68

63

Algorithm 1: The first scan to determine the upper bound length. Input: Genome S. Output: The computed length ϕ . 1 2 3 4 5 6 7 8 9 10 11 12 13

begin n ← |S| // costs O (1 ). l ← log2 (n + 1 )/2 τ ← l/2 temp[0 . 4τ − 1] ← 0 for i ← 0, 1, . . . , n − 1 do if ωi is valid and present then h ¯ ← hash1 (ωi ) //|ωi | = τ and hash1 is based the first scheme encoding. temp[ h ¯ ] ← temp[ h ¯]+1

← temp[0]

for i ← 1, 2, . . . , 4τ − 1 do if (temp[i] < ) then ← temp[i]

15

if > 1 then ϕ ← τ + log2 ( )/2

16

else

14

17

ϕ ←τ +

τ log16 (n+1 ) ) = cost O (1 ). The size of √ genome S in line 2 is computed in O (1 ). Line 5 of √ Algorithm 1 costs O (4 ) = O (4 O (4log16 (n+1 )+1 ) = O ( n + 1 ). Lines 6–9 cost O (n ) √ and lines 11–13 cost O ( n + 1 ) while the other lines cost O (1 ). Thus, the total time complexity of Algorithm 1 is O (n + n + 1 ) = O (n ). Considering Algorithm 2 , the total time complexity is O (n ). Line 4 of Algorithm 2 costs 2(4/3(4ϕ − 1 ) + 1 ) = O (4ϕ ) = O (4l ) = O (n ), where ϕ ≤ l. Lines 6–22 cost O (n ), let us consider the worst case when the locations of bit array are all non-holes except one is hole. The while loop condition (line 17 and line 27) of Algorithm 2 is satisfied 4/3(4ϕ − 1 ) times as a total considering the for loop (line 6) and this is equivalent to time complexity O (n ). When all locations become non-holes, the while loop condition is examined only once within each character being read. Therefore, the total time complexity becomes O (2n ) = O (n ). Considering Algorithm 3 , the worst case is that when 4γ − 1 locations are holes and only one location is non-hole. That is, printing the shortest absent words costs time complexity of O (γ 4γ ). All algorithms BIW, ESW, and SAW have this printing cost. Overall, the proposed method takes time complexity of O (n ) to count all shortest absent words and O (γ 4γ ) for output the absent words. The required space is very small because we use a bit array and its size determined by ϕ , where γ ≤ ϕ ≤ l. For all genomes that we have used in this study, we found that the computed length is very close to the actual length of absent words (ϕ − γ ≤ 2).

3. Experimental results The test is carried out on a Desktop PC with Intel® CoreTM i7-3770 x64-based CPU running at 3.40 GHz and 8 GB RAM under Windows 8.1 Pro. and the algorithm has been implemented using C# language. For the sake of fair, ESW and BIW were compiled and run under the same environment, Windows OS. The experiments are made on Homo sapiens (Human) genome (Assembly: GRCh38.p6), Pan troglodytes (Chimpanzee) (Assembly: Pan-troglodytes-2.1.4), Mus musculus (Mouse) (Assembly: GRCm38.p4), Drosophila melanogaster (Fruitfly) (Assembly: 6_plus_ISO1_MT), Caenorhabditis elegans (WBcel235), Neurospora crassa (Assembly: NC12) [7], Saccharomyces cerevisiae (Assembly: R64), and Thermococcus kodakarensis (Assembly: ASM996v1) [6] (see Table 1). Considering our approximation formula in Eq. (8) for computing the shortest length when β = 5, Table 2 lists the approximated length (L) to the actual length (γ ) and the allocated memory for the bit array by our method against ESW as the form (SAW/ESW). It has been noted that the actual length is approximated exactly on very large genomes, Homo sapiens (L = 11), Pan troglodytes (L = 11), and Mus musculus (L = 11). That is, the amount of required memory will be very small, 0.5 MB while the approximated length using the algorithm by Wu et al. [18] is 12 which means that the required memory is 2 MB. This is not the issue because this amount of memory is trivial. So, our focus is to design standalone software, albeit a little bit memory size ( ≤ 10.67 MB) is sacrificed. Table 4 lists the computed length (ϕ ) and the required memory for Bit _Array on different genomes. On human genome, the computed upper length is 13. That is, a bit array of size 10.67 MB is allocated. This upper bound length is safe and exactly greater than or equal to the actual length, (γ = 11 ) ≤ (ϕ = 13 ) ≤ (l = 16 ). The comparisons are presented in Table 5 where the results reveal that our SAW method is

64

A.M. Al-Ssulami / Information Sciences 435 (2018) 59–68

Algorithm 2: The second scan to discover the absent words. Input: Genome S. Output: Bit _Array[]. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

begin n ← |S | ϕ ← Algorithm 1 Bit _Array[0 . 4/3(4ϕ − 1 )] ← f alse // f alse refers to bit 0. Bit _Array[0] ← true // used as a sentinel. for i ← 0, 1, . . . , n − 1 do if (|ωi | ≥ ϕ ) then //ωi is valid and present. h ¯ ← hash2 (ωi ) //based on the second scheme encoding. r←h ¯ &3 if (r = 0 ) then r←4 h ¯ ←h ¯ −r h ¯ ←h ¯ 2 h ¯ ←h ¯ + Encode(Si )4ϕ −1 Sig ← h ¯ while (Bit _Array[Sig] is hole) do Bit _Array[Sig] ← true //make it non-hole. if (Sig & 3 = 0 ) then Sig ← (Sig 2 ) − 1 else Sig ← Sig 2

21 22

23 24 25 26 27 28 29 30 31 32

else

ρ ← |ωi |

h ¯ ←h ¯ + Encode(Si )4ρ // 4ρ must be computed in advance. Sig ← h ¯ while (Bit _Array[Sig] is hole) do Bit _Array[Sig] ← true //make it non-hole. if (Sig & 3 = 0 ) then Sig ← (Sig 2 ) − 1 else Sig ← Sig 2

about twice faster than the previous standalone method, BIW. For the correctness, SAW detected exactly the same absent words by BIW and ESW. For computing the minimal absent words, superset of shortest absent words with longer length, Herold et al. [11] suggested adding the shortest absent words to the genome sequences and run the program again. This solution does not consider the extensions of the shortest absent words as noted by Pinho et al. [15]. For this problem, we suggest a new solution to continue computing absent words of longer lengths by simply hashing the absent words of longer length. Let the shortest length is γ , then for every shortest absent word (hole) ϖ at index i , we must compute its 8 extensions (4 leftmost symbols + 4 rightmost symbols) and hashing these 8 generated words as non-holes locations. This step should be followed by continuous scanning the complete genome for hashing words of longer length (γ + 1). Consequently, the remaining holes locations will be the correct absent words of longer length (γ + 1). These steps must be repeated for longer lengths. By comparing the shortest absent words from the three very large genomes (Homo sapiens, Pan troglodytes, and Mus musculus), we found that Human and Chimpanzee genomes have 36 common absent words (see Table 6), 68 words are absent from Human and present in Chimpanzee, and 82 words are absent from Chimpanzee and present in Human genome. In addition, we have analyzed the absent words of the three genomes by computing the frequency of 1-mer, 2-mer, and 3-mer where we noted that the absent words tend to have similar frequency distribution with a difference in the number of occurrences (see Figs. 2 and 3). As has been noted from Fig. 2, the CG-dinucleotide is highly presented within the absent words of Human, Chimpanzee, and Mus musculus genomes with occurrences of 304, 354, and 568, respectively. In contrast, the CG-dinucleotide is highly presented in Human, Chimpanzee, and Mus musculus genomes with percentages of 28.18%,

A.M. Al-Ssulami / Information Sciences 435 (2018) 59–68

65

Algorithm 3: Printing the shortest absent words. Input: Bit _Array[] ← Algorithm 2. Output: Actual length γ and display shortest absent words. 1 2 3 4 5 6 7

begin i←0 while (Bit _Array[i] is non-hole) do i←i+1

δ←i γ ← (log2 (3(δ − 1 )/4 + 1 )/2 ) + 1 U p ← 4/3 ( 4 γ − 1 )

15

Count ← 0 for (i ← δ, δ + 1, . . . , U p) do if There is a hole at i then rci ← reverse complement of i h ¯ r ← hash2 (rci ) if There is a hole at h ¯ r then Compute the reverse hash of i // Print the absent word. C ount ← C ount + 1

16

Print the number of absent words Count.

8 9 10 11 12 13 14

Table 1 Genomes used in the test. Genome

GenBank accession

Size

l

Homo sapiens Pan troglodytes Mus musculus Drosophila melanogaster Caenorhabditis elegans Neurospora crassa Saccharomyces cerevisiae Thermococcus kodakarensis

GCA_0 0 0 0 01405.21 GCA_0 0 0 0 01515.4 GCA_0 0 0 0 01635.6 GCA_0 0 0 0 01215.4 GCA_0 0 0 0 02985.3 GCA_0 0 0182925.2 GCA_0 0 0146045.2 GCA_0 0 0 0 09965.1

2.91 GB 2.96 GB 2.56 GB 132 MB 96.8 MB 39.0 MB 11.6 MB 2.01 MB

16 16 16 14 14 13 12 11

Table 2 Approximated shortest length L using our Eq. (8) when β =5 and memory usage in MB for both Eq. (8) and the approximation method in [18]. Genome

Size (bp)

min(i ) (5-mer)

β

νβ = min(iβ )/Size

L

γ

Memory (MB) (Our/[18])

H. sapiens P. troglodytes M. musculus D. melanogaster C. elegans N. crassa S. cerevisiae T. kodakarensis

2,937,639,113 2,756,176,116 2,647,521,431 137,057,575 100,272,607 40,425,397 12,071,326 2,088,737

78,780 (cgacg) 71,800 (cgacg) 59,287 (tcgcg) 37,863 (cctag) 16,735 (gggcc) 18,262 (acgcg) 1,814 (gcgcg) 346 (ctagt)

2.68d-05 2.61d-05 2.24d-05 2.76d-04 1.67d-04 4.52d-04 1.50d-04 1.66d-04

11 11 11 12 11 12 10 9

11 11 11 11 10 11 9 8

0.5/2 0.5/2 0.5/2 2/0.5 0.5/0.5 2/0.5 0.125/0.03 0.03/0.03

Table 3 Memory consumption by temp[] with different input data sizes. Input data in GB

τ

Memory in MB

1 3 10 60 70 100

8 8 9 9 10 10

0.5 0.5 2 2 8 8

66

A.M. Al-Ssulami / Information Sciences 435 (2018) 59–68 Table 4 The computed length and the required memory in MB using SAW. Genome

ϕ

Bit array size (MB)

Homo sapiens Pan troglodytes Mus musculus Drosophila melanogaster Caenorhabditis elegans Neurospora crassa Saccharomyces cerevisiae Thermococcus kodakarensis

13 12 12 13 12 12 11 9

10.67 2.67 2.67 10.67 2.67 2.67 0.67 0.042

Table 5 Elapsed time in seconds, shortest length, and the number of shortest absents words for the two methods, SAW and BIW. Genome

Time (S) (SAW/BIW)

Length

Absent words

Homo sapiens Pan troglodytes Mus musculus Drosophila melanogaster Caenorhabditis elegans Neurospora crassa Saccharomyces cerevisiae Thermococcus kodakarensis

206.7/398.22 182.51/418.59 168.04/346.34 13.61/19.33 7.08/11.92 3.48/5.86 0.98/2.36 0.17/0.25

11/11 11/11 11/11 11/11 10/10 11/11 9/9 8/8

104/104 118/118 188/188 122/122 2/2 2010/2010 4/4 1/1

Table 6 Common shortest absent words from both Homo Sapiens and Pan troglodytes genomes. cgtcgctcgaa ccgacgatcga cgtcgttcgac cgcgtaatacg tcgcgtaatcg tacgcgattcg acgatcgtcgg cgacgaacggt

cgtccgtcgaa tatcgcgtcga ccgtcgaacgc tacgtcgagcg ccgatacgtcg acgaccgttcg tcgatcgtcgg

cgattacgcga tcgacgcgata gtcgaacgacg tcgattacgcg acggtacgtcg cgacgtaacgg cgacgtaccgt

cgattcggcga cgctcgacgta ttcgagcgacg cgtattacgcg ccgttacgtcg gcgttcgacgg ccgacgatcgt

cgcgtaatcga cgaatcgcgta ttcgacggacg tcgccgaatcg accgttcgtcg cgacgtatcgg cgaacggtcgt

Fig. 2. The 1-mers and 2-mers frequency distribution of the absent words.

28.33%, and 12.31%, respectively. It is worth to note that almost all the absent words of the three genomes consisting of 3 CG-dinucleotides. In summary, the proposed method is purely linear, efficient, and easy to use. It discovers complete lists of absent words in both strands of DNA simultaneously, so it can be used in genome comparisons and various applications.

A.M. Al-Ssulami / Information Sciences 435 (2018) 59–68

67

Fig. 3. The 3-mers frequency distribution of the absent words.

4. Conclusion A linear and standalone method is presented to identify the shortest absent words from very large and complete genomes. Our method scans the input genome only twice for computing the safe upper bound for the shortest length and for identifying the absent words using a bit array of small size. The new method, SAW, is rigorous, fast, simple, and easy to use. It produces a complete list of absent words from both strands of genomes with no need to manually entering the upper bound length. Moreover, we have presented a stochastic formula derived using Poisson distribution to estimate the shortest length of the absent words in time complexity of O (1 ) instead of the previous one that requires a time complexity of O (γ 3 ), where γ is the length of shortest absent words. Also, we have analyzed the shortest absent words of three very large genomes, Human, Chimpanzee, and Mouse and we observed that the absent words have similar frequency distribution with a difference in the number of occurrences. We also observed that the CG-dinucleotides occur almost three times in the shortest absent words. Our next work is adapting SAW wisely to compute the minimal absent words of longer lengths using Encoded Expansion technique in [3]. Availability and requirements The SAW software is publicly available for non-commercial purposes at https://github.com/Abdulrakeeb/SAW.git. SAW is Windows based and command-line program. To run SAW, both binary file and the Directory containing fasta files must be made in the same path. References [1] C. Acquisti, G. Poste, D. Curtiss, S. Kumar, Nullomers: really a matter of natural selection? PLoS One 2 (10) (2007) e1022, doi:10.1371/journal.pone. 0 0 01022. [2] A. Alileche, J. Goswami, W. Bourland, M. Davis, G. Hampikian, Nullomer derived anticancer peptides (nullops): differential lethal effects on normal and cancer cells in vitro, Peptides 38 (2) (2012) 302–311. https://doi.org/10.1016/j.peptides.2012.09.015. [3] A.M. Azmi, A. Al-Ssulami, Encoded expansion: an efficient algorithm to discover identical string motifs, PLoS One 9 (5) (2014) 1–9, doi:10.1371/journal. pone.0095148. [4] C. Barton, A. Heliou, L. Mouchard, S.P. Pissis, Linear-time computation of minimal absent words using suffix array, BMC Bioinform. 15 (1) (2014) 1–10, doi:10.1186/s12859- 014- 0388- 9. [5] S. Chairungsee, M. Crochemore, Using minimal absent words to build phylogeny, Theor. Comput. Sci. 450 (2012) 109–116. https://doi.org/10.1016/j.tcs. 2012.04.031. [6] T. Fukui, H. Atomi, T. Kanai, R. Matsumi, S. Fujiwara, T. Imanaka, Complete genome sequence of the hyperthermophilic archaeon thermococcus kodakaraensis kod1 and comparison with pyrococcus genomes, Genome Res. 15 (3) (2005) 352–363. [7] J.E. Galagan, et al., The genome sequence of the filamentous fungus neurospora crassa, Nature 422 (6934) (2003) 859–868. [8] J. Goswami, M.C. Davis, T. Andersen, A. Alileche, G. Hampikian, Safeguarding forensic {DNA} reference samples with nullomer barcodes, J. Forensic Legal Med. 20 (5) (2013) 513–519. https://doi.org/10.1016/j.jflm.2013.02.003.

68

A.M. Al-Ssulami / Information Sciences 435 (2018) 59–68

[9] G. Hampikian, T. Andersen, Absent sequences: nullomers and primes, in: Proceedings of the Pacific Symposium on Biocomputing 2007, World Scientific Publishing, Singapore, Maui, Hawaii, 2007, pp. 355–366. [10] B. Haubold, N. Pierstorff, F. Möller, T. Wiehe, Genome comparison without alignment using shortest unique substrings, BMC Bioinform. 6 (1) (2005) 1–11, doi:10.1186/1471-2105- 6- 123. [11] J. Herold, S. Kurtz, R. Giegerich, Efficient computation of absent words in genomic sequences, BMC Bioinform. 9 (1) (2008) 1–9, doi:10.1186/ 1471-2105-9-167. [12] R.M. Karp, M.O. Rabin, Efficient randomized pattern-matching algorithms, IBM J. Res. Dev. 31 (2) (1987) 249–260, doi:10.1147/rd.312.0249. [13] Y. Lianping, Z. Weilin, A multiresolution graphical representation for similarity relationship and multiresolution clustering for biological sequences, J. Comput. Biol. 24 (4) (2017) 299–310. https://doi.org/10.1089/cmb.2016.0030. [14] Y. Lianping, Z. Xiangde, W. Tianming, Z. Hegui, Large local analysis of the unaligned genome and its application, J. Comput. Biol. 20 (1) (2013) 19–29. https://doi.org/10.1089/cmb.2011.0052. [15] A.J. Pinho, P.J. Ferreira, S.P. Garcia, J.M. Rodrigues, On finding minimal absent words, BMC Bioinform. 10 (1) (2009) 1–11, doi:10.1186/1471-2105- 10- 137. [16] S. Rahmann, E. Rivals, On the distribution of the number of missing words in random texts, Comb. Probab. Comput. 12 (2003) 73–87, doi:10.1017/ S0963548302005473. [17] R.M. Silva, D. Pratas, L. Castro, A.J. Pinho, P.J.S.G. Ferreira, Three minimal sequences found in ebola virus genomes and absent from human dna, Bioinformatics 31 (15) (2015) 2421–2425, doi:10.1093/bioinformatics/btv189. [18] Z.-D. Wu, T. Jiang, W.-J. Su, Efficient computation of shortest absent words in a genomic sequence, Inf. Process. Lett. 110 (14–15) (2010) 596–601. https://doi.org/10.1016/j.ipl.2010.05.008.