proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation

proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation

Accepted Manuscript Whole genome/proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation...

1MB Sizes 6 Downloads 35 Views

Accepted Manuscript Whole genome/proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation Wei-Feng Yang, Zu-Guo Yu, Vo Anh PII: DOI: Reference:

S1055-7903(15)00390-5 http://dx.doi.org/10.1016/j.ympev.2015.12.011 YMPEV 5383

To appear in:

Molecular Phylogenetics and Evolution

Received Date: Revised Date: Accepted Date:

2 May 2015 17 December 2015 18 December 2015

Please cite this article as: Yang, W-F., Yu, Z-G., Anh, V., Whole genome/proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation, Molecular Phylogenetics and Evolution (2015), doi: http://dx.doi.org/10.1016/j.ympev.2015.12.011

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Whole genome/proteome based phylogeny reconstruction for prokaryotes using higher order Markov model and chaos game representation Wei-Feng Yang1, 2, Zu-Guo Yu 1, 3,* and Vo Anh 3 1

Hunan Key Laboratory for Computation and Simulation in Science and Engineering and Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, Hunan 411105, P.R. China.

2

Department of Mathematics and Physics, Hunan Institute of Engineering, Hunan 411104, P. R. China.

3

School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane, QLD 4001, Australia.

*Corresponding author Email addresses: Zu-Guo Yu: [email protected] Wei-Feng Yang: [email protected] Vo Anh: [email protected]

Abstract Traditional methods for sequence comparison and phylogeny reconstruction rely on pair wise and multiple sequence alignments. But alignment could not be directly applied to whole genome/proteome comparison and phylogenomic studies due to their high computational complexity. Hence alignment-free methods became popular in recent years. Here we propose a fast alignment-free method for whole genome/proteome comparison and phylogeny reconstruction using higher order Markov model and chaos game representation. In the present method, we use the transition matrices of higher order Markov models to characterize amino acid or DNA sequences for their comparison. The order of the Markov model is uniquely identified by maximizing the average Shannon entropy of conditional probability distributions. Using one-dimensional chaos game representation and linked list, this method can reduce large memory and time consumption which is due to the large-scale conditional probability distributions. To illustrate the effectiveness of our method, we employ it for fast phylogeny reconstruction based on genome/proteome sequences of two species data sets used in previous published papers. Our results demonstrate that the present method is useful and efficient. Availability and implementation: The source codes for our algorithm to get the distance matrix and genome/proteome sequences can be downloaded from ftp://121.199.20.25/. The software Phylip and EvolView we used to construct phylogenetic trees can be referred from their websites. Keywords Alignment-free whole proteome comparison; higher order Markov model; chaos game representation; Shannon entropy; phylogenetic tree.

1.

Introduction Sequence comparison and phylogeny reconstruction are fundamental problems in

bioinformatics. Recent developments of the nucleotide and protein sequencing technology have resulted in a fast growth of available genomic/proteomic data, which

makes phylogeny reconstruction at the whole genome/proteome scale feasible. Traditional methods for sequence comparison and phylogeny reconstruction rely on pair wise and multiple sequence alignments. However, because different genes have different evolutionary rates, sequence alignment based on different genes may lead to inconsistent results (Wong et al., 2008). Evolutionary events such as genomic rearrangements and, in particular, genetic shuffling also make direct sequence alignment unreliable. Moreover, at the whole genome scale, the high computational complexity renders these approaches based on multiple sequence alignments impractical (Wong et al., 2008). Consequently, alignment-free methods have been proposed for genome/proteome analysis and phylogeny reconstruction by utilizing the ever-increasing genome and proteome data in recent years (Vinga and Almeida 2003; Sims et al., 2009). Among these approaches: the compression based methods (Chen et al., 2000) generally regard the whole genomes as plain text, and define the similarity between two genomes as the relative compression ratio; the gene content based methods (Snel et al., 1999; Herniou et al., 2001; House and Fitz-Gibbon, 2002) mainly concentrate on the portion of homologous genes shared by multiple genomes, while the gene order based methods (Sankoff et al., 1992; Moret et al., 2001; Belda et al., 2005; Lin and Moret, 2011; Shifman et al., 2014) are concerned with the order of genes shared by multiple genomes; the methods based on string matching statistics, including average common substring (ACS) (Ulitsky et al., 2006), the Kr method (Haubold et al., 2009; Domazet-Loso et al., 2009) and its modification (Haubold et al., 2009; Domazet-Loso et al., 2009; Leimeister et al., 2014b) consider the longest common substrings or the shortest unique substring starting at each site of a sequence, and define a distance measure from the average length of these substrings; the word based methods (Karlin and Burge, 1995; Li et al., 2002; Stuart et al., 2002a,b, 2004; Hao and Qi, 2003; Qi et al., 2004a,b; Yu et al., 2005, 2010a; Wang et al., 2005; Wu et al., 2007; Sims et al., 2009; Wu et al., 2009; Xu and Hao, 2009; Jun et al., 2010; Hatje and Kollmar, 2012; Leimeister et al., 2014a) commonly characterize the whole genome/proteome with frequencies of k-length words. Afreixo et al. (2009, 2011) explored the inter-nucleotide distance method and the distance to the nearest

dissimilar nucleotide to construct phylogenetic trees using genomic data. Based on the inter-nucleotide distance sequence, Chang and Wang (2011) proposed the conditional multinomial distribution profile for phylogeny on the genomic level. Our group (Xie et al., 2015) proposed to use inter-amino-acid distances and the conditional geometric distribution profiles (IAGDP) for whole-proteome based phylogenetic tree construction. These alignment-free methods can be arranged into two classes: (i) resolution-free methods such as ACS (Ulitsky et al., 2006) and Kr (Haubold et al., 2009; Domazet-Loso et al., 2009), where no parameter value choice is necessary. (ii) Methods having some free parameters such as CVTree (Qi et al., 2004a, Xu and Hao, 2009), Frequency chaos game representation (FCGR) (Wang et al., 2005; Hatje and Kollmar, 2012), feature frequency profiles (FFP) (Sims et al., 2009), spaced-word (Leimeister et al., 2014a) and IAGDP (Xie et al., 2015), the utility of which may depend on the parameter value choice. For example, the words length k is a critical free parameter in the utility of the CVTree, FCGR and FFP; profile length K is a free parameter in IAGDP; the weight k and length l of spaced words are free parameters in using spaced-word. In practice, parameter values are generally determined from hindsight in using these methods. Higher order Markov models characterized by their transition matrices are widely used in biological sequence analyses, including identification of motifs (D’haeseleer, 2006; Thijs et al., 2001, 2002), CpG islands (Durbin et al., 1998), exons (Burge and Karlin, 1997; Spies et al., 2009), genes (Lomsadze et al., 2005; Salzberg et al., 1998, 1999), regulatory elements (Carlson et al., 2006; Lin et al., 2008), RNA structures (Hansen, 2009), genome segmentation (Churchill, 1989,1992; Thakur et al., 2007) and especially nucleotide and amino acid substitution (Goldman, 1993; Adachi and Hasegawa, 1996; Rambaut and Grass, 1997; Eddy, 1998; Whelan and Goldman, 1999), which is commonly used in DNA and protein sequence alignments. Some word based alignment-free methods such as CVTree are also derived from higher order Markov models. The approach described by Narlikar et al. (2013), which employs higher order Markov models with BIC-predicted optimal (BPO) order, is successful in

constructing mammalian genomic trees. However it does not produce satisfactory results on many testing data sets used for other alignment-free methods. In the present paper, we propose a new aligment-free method which uses the transition matrix of a higher order Markov model as feature vector to characterize a whole genome/proteome sequence, and then construct a distance measure based on the angle between two feature vectors. Using this distance measure, we then reconstruct distance-based phylogenetic trees with the neighbor-joining algorithm (Saitou and Nei, 1987). A critical issue in using Markov models is the determination of their order (Narlikar et al., 2013). In application in biological sequence comparison, simpler low-order Markov models tend to leave out some essential features in the sequence data, and higher order models enable more features to be captured. On the other hand, in addition to time-consuming, too high an order often results in over-fitting of the sequence data. For this reason, the optimal order of the Markov model is essential. We propose to identify this order by maximizing the Shannon entropy (see next section) rather than the Akaike information criterion (AIC) or the Bayesian information criterion (BIC) (Narlikar et al., 2013). To reduce memory and time consumption, we also propose to combine linked list with one-dimensional chaos game representation of symbolic sequences (Gutierrez et al., 2001). To illustrate the effectiveness of our method, we employ it for fast phylogeny reconstruction on two data sets used in previous published papers.

2.

Materials and Methods

2.1 Data sets We applied our method for phylogeny reconstruction on whole genome and proteome sequences of two species data sets, respectively. Species Data set 1: 172 Bacilli species (Collins and Higgs, 2012). Collins and Higgs (2012) used 172 Bacilli species to analyze the evolution properties of the pangenomes and core genomes. According to the taxonomy based on the classification of the NCBI, the population of 172 Bacilli consists of 2 orders (Bacillales and Lactobacillales), 10 families (Alicyclobacillaceae, Paenibacillaceae,

Bacillaceae, Staphylococcaceae, Listeriaceae, Leuconostocaceae, Lactobacillaceae, Streptococcaceae, Enterococcaceae, Bacillales Family XII.Incertae Sedis) and 20 genera

(Alicyclobacillus,

Lysinibacillus, Paenibacillus,

Kyrpidia,

Oceanobacillus, Macrococcus,

Anoxybacillus,

Exiguobacterium, Staphylococcus,

Bacillus,

Geobacillus,

Listeria,

Brevibacillus,

Enterococcus,

Lactobacillus,

Pediococcus, Leuconostoc, Oenococcus, Lactococcus, Streptococcus). Species Data set 2: 109 species of prokaryotes and eukaryotes (Qi et al., 2004a; Yu et al., 2005) . According to the reference taxonomy in NCBI, these 109 species are composed of 6 eukaryotes (including 3 Fungi, and those 3 from the Plantae, Animalia and Chromalveolata kingdoms repectively),

16 Archaea (including

4

Archaea

Crenarchaeota, and 12 Archaea Euryarchaeota) and 87 Bacteria (including 5 Tenericutes, 3 Spirochaetae, 37 Proteobacteria, 22 Firmicutes, 3 Cyanobacteria, 5 Chlamydiae, 7 Actinobacteria, 5 from phylum Thermotogae, Fusobacteria, Deinococcus-Thermus, Chlorobi and Aquificae respectively). All genome sequences and translated proteome sequences were obtained from the NCBI Genomes database; they were completely sequenced and were not further edited. The 172 Bacilli species, 109 species of prokaryotes and eukaryotes are listed in Tables S1 and S2 in Supplementary material, respectively.

2.2 Higher order Markov model For characterizing a sequence S over an alphabet A , a Markov model of order

k denoted by M k , is represented in the form of a mk  m transition matrix, i. e., a non-negative matrix of P(b | b1b2

bk ) , where b, b1 , b2 ,

| A | —the size of alphabet A , and P(b | b1b2

of a k -length subsequence b1b2

, bk  A , m takes value

bk ) means the conditional probability

bk followed by b . In the case of k  0 , the

Markov model M 0 corresponds to the aggregate probabilities of occurrence of the m different symbols in A . Each row of the transition matrix is a conditional

probability distribution, hence the sum of each row is 1. Obviously for nucleotide sequences and amino acid sequences, m takes the value 4 and 20 respectively. The critical task of applying higher order Markov models is to estimate the order and the large scale conditional probabilities. The maximum likelihood estimate Pˆ (b | b1b2

P(b | b1b2

bk ) of the conditional probability

bk ) can be computed as Pˆ (b | b1b2

where n(b1b2

bk ) 

n(b1b2 bk b) ,  aA n(b1b2 bk a)

bk a) is the count of the words b1b2

(1)

bk a in the considered

sequence. For the aim of whole proteome based phylogeny reconstruction, we should capture the feature information as much as possible in the sequence—this is the principle of information maximization. In information theory, the Shannon entropy is a measure of the volume of the information. We specify the order of Markov models by maximizing the Shannon entropy of conditional probability distributions, rather than AIC or BIC. For a discrete probability distribution P  ( p1 , p2 , ) , the Shannon entropy H ( P) is defined as H ( P)  i pi log pi . For the k -order Markov model

M k mentioned above, the corresponding Shannon entropy MH k is defined as (Shannon, 2001; Strait and Dewey, 1996): M Hk     P ( )b P ( b1| 2b b k )bl o g P(1 b2| b k b, bi  A b

b)

(2)

A

where P(b) is the probability of finding the symbol b in the considered sequence. The order k of Markov models is selected by maximizing MH k , i.e.

kˆ  argk max MH k  . In order to compare the sequences in the set   S1 , S2 ,

(3)

, S N  , we use Markov

models with a same order k to characterize all sequences in  . In order to synchronously capture the feature information as much as possible in ∑, the same

order k is uniquely specified by maximizing the average Markov-Shannon entropy (MME) as follows:





Kˆ  argk max MH k ,

(4)

1 N  MH k (i) . N i 1

(5)

where MH k 

Fig. 1 (a) shows the curve of average Shannon entropy MH k (red) and the curves of Shannon entropy MH k curves for 172 proteomes of Species Data set 1 (172 bacilli); Fig. 1 (b) shows the curve of average Shannon entropy MH k (red) and the curves of Shannon entropy MH k for 109 proteomes of Species Data set 2 (109 species including prokaryotes and eukaryotes). The orders of higher order Markov model for the two data sets identified by MME, AIC and BIC are illustrated in Table 1. Table 1. The orders of Markov model for four data sets identified by MME, AIC and BIC. Data set

AIC

BIC

MME

172 proteomes 109 proteomes 172 genomes 109 genomes

4 4 11 10

3 3 8 7

8 12 13 30

2.3 One-dimensional chaos game representation Chaos game representation (CGR) has been proposed to visually reveal patterns within nucleotide sequences (Jeffrey et al., 1990). It was also applied in alignment-free comparisons (Almeida et al., 2001) and alignment-based comparisons (Joseph and Sasikumar, 2006) of genomic sequences. The CGR is a bijection from a nucleotide sequence to a unit square scatter plot, whose vertices are the four assigned nucleotides: PA   0,0  , PC   0,1 , PG  1,1 , PT  1,0  . The point coordinates corresponding to each nucleotide in the sequence is produced by the iterated function system:

CGRi  0.5  CGRi 1  Pi  , CGR0   0.5,0.5 . The one-dimensional chaos game representation (Gutierrez et al., 2001) maps a symbolic sequence over any alphabet A to a numeric sequence in the unit interval [0, 1] via the iterated function system:

CGRi   CGRi 1  Pi  / m, CGR0  0.5 , where m | A | , CGRi is the coordinate value in interval [0, 1] and Pi is the order number in alphabet A (in the case of nucleotide sequence, the order number of A,C ,G and T can be set as 0, 1, 2 and 3 respectively). For example, the first exon (92 bases) of human beta-globin gene is listed as follows: ATG GTG CAT CTG ACT CCT GAG GAG AAG TCT GCC GTT ACT GCC CTG TGG GGC AAG GTG AAC GTG GAT GAA GTT GGT GGT GAG GCC CTG GGC AG. The one-dimensional chaos game representation of this sequence is 0.125, 0.78125, 0.6953125, 0.67382813, 0.91845703, ⋯. The spectrum-like graph of one-dimensional chaos game representation of human beta globin gene is illustrated in Fig. 2. The amino acid sequence translated from the first exon of human beta-globin gene is MVHLTPEEKSAVTALWGKVNVDEVGGEALG. The one-dimensional chaos game representation of this sequence is 0.525, 0.87625, 0.3438125, 0.46719063, 0.82335953, ··· Apart from visualizing a biological sequence, the one-dimensional chaos game representation inherits all the properties of the sequence. In the case of nucleotide sequence, the unit interval [0, 1] is subdivided into 4 sub-intervals, each sub-interval corresponding to a particular nucleotide. Recursively, each sub-interval is subdivided into 4 sub-sub-intervals each corresponding to a particular dinucleotide, and so on (similar idea was also used by Yu et al. (2001)). This is illustrated in Fig. 3 (a). Each point/number in the unit interval embodies the structure of the suffix of the nucleotide sequence that ends with the base corresponding to the point. This property can be reflected by the inversion formula:

CGR i 1  m C GiR ,i P

(6)

To be specific, from CGRi , the prefix Si = s1 s2

si can be identified by using

formula (6) as follows: we can specify the symbol si from the integer  mCGRi  , where,

and

henceforth,



denotes

the

integer

ceil

operation.

Then CGRi 1 = mCGRi  Pi , where Pi  mCGRi   1. Recursively, we can specify the symbol si 1 from the value mCGRi 1 , and so on. In practice, each k -length words is mapped into one of the m k sub-intervals with width m k . The integer  mk CGRi  , called structural index, indicates the structure of the k -length words ending with si . In fact, the structural index of each k-length word depends on the way to sort all symbols in alphabet A . Here, we use the lexicographic order. For example, the structural index of the first 3-length words ATG in the above mentioned human beta-globin gene is 43  0.6953125  45 . All structural indexes of 3-length words in nucleotide sequence are shown in Fig. 3 (b). The situation for amino acid sequences is similar. The chaos game representation maps a k-length word one-to-one to a number, and the comparison for numbers is faster than that for strings in computer. Here, we use the chaos game representation to convert a k-length word one-to-one to an integer—structural index. Because we employ linked list (see section 2.4) to store the count (or frequency) of actual emerging k -length words in a considered sequence, when the program reads sequence orderly, the structural index of the corresponding

k -length words can reveal the position in the linked list. This will reduce the time consumption. We employ one-dimensional chaos game representation for amino acid sequences, thus turn the comparison of strings into that of integers. Combined with linked list, it can accelerate the process of identifying the optimal order and generating the transition matrix of a higher order Markov model and then the distance matrix.

2.4 Linked list application In applications of higher order Markov models to characterize sequences, the large-scale sets of conditional probabilities P(b | b1b2 words b1b2

bk ) and the count of the

bk b need to be recorded. For the k -order Markov model M k of a

nucleotide or amino acid sequence, there are 4k 1 and 20k 1 different words need to be handled respectively. This situation also occurs in using CVTree and FFP. Due to the memory limit of personal computers, the length of words that CVTree can deal with is less than 8 for amino acid sequence and 16 for nucleotide sequence (as mentioned in CVTree v4.0), and FFP involves some measures for feature filtering. For any k , the number of different words b1b2

bk b has a maximum value less than

the length of the considered proteome sequence, and only a fraction of all theoretical word compositions b1b2

bk b actually emerges in the sequence. Moreover, the

proportion of theoretical word compositions that actually emerges in the sequence drop rapidly to zero when k increases. Fig. 4 illustrates that the proportion decrease with the order k for all sequences of 172 proteomes and 109 proteomes, respectively. In view of this, we use the well known technique of linked list to record counts of all actual emerging words in the considered sequence, where the structure information of each word is turned into integer by one-dimensional chaos game representation. By this way, we can solve the problem of memory limit of personal computer for all sequenced prokaryote genomes and proteomes. In order to estimate the conditional probability P(b | b1b2

bk ) by using formula (1), we employ m different linked

lists to record m different classes of (k  1) -length words classified by the ending symbol. All nodes are sorted by the structural index. For the case of nucleotide sequences, the distributions of all 3-length words in the 4 different linked lists are illustrated in Fig. 3 (b) as an example. It is also consistent with the one-dimensional chaos game representation. This arrangement is beneficial to fast computation of the conditional probability Pˆ (b | b1b2

bk ) .

Remark 1: Actually the size of each sequenced prokaryote genomes is less than 50 megabase (https://gold.jgi-psf.org/), and the number of different (k+1)-length words in a genome with size 50 megabase (also the number of integers--- structural index) is no more than 50 millions. Using 8 byte (64 bit) of memory to store a structural index, which is enough to handle 30-order Markov models for genome and 13-order Markov models for proteome, thus the memory consumption is no more than 400 megabyte. So a general personal computer can meet such condition.

2.5

Distance measure For two different genomes/proteome sequences S1 and S 2 , characterized by two

Markov models with the same order k , we can describe them with vectors V1 =  p1 , p2 ,

, pN 

and

V2 =  q1 , q2 ,

, qN 

in

the

N -dimensional

space,

where N  20k 1 , pi and qi are conditional probabilities. The normalized angle between vectors V1 and V2 is evaluated as Dangle V1 ,V2



1



arccos

 pi qi 1 2 2

1 2 2

 p   q  i

.

(7)

i

The normalized angle between vectors in Euclidean space is a distance measure (Stuart et al., 2002b) and is used as the evolutionary distance between two corresponding genomes/proteomes in the present approach. The chord distance (Edwards, 1971; Yu et al., 2010b), defined as Dchord V1 ,V2



1 (1  2

 pi qi 1 2 2

1 2 2

 p   q  i

),

(8)

i

is an alternative distance measure and yields the same topology trees on the data sets used in our work.

2.6

Tree construction and comparison From the distance matrices, we used the neighbor-joining algorithm (Saitou and

Nei,

1987)

(implemented

in the

software

package

PHYLIP3.695,

http://

evolution.genetics.washington.edu/phylip.html) to reconstruct phylogenetic trees. In addition, we apply the online tool EvolView (Zhang et al., 2012) to visualize and annotate phylogenetic trees. The normal way to evaluate the accuracy of a tree reconstruction method is to measure the distance between the reconstructed tree and a reference tree. There are several approaches to measure similarity between phylogenetic trees. We chose the Robinson-Foulds distance (Robinson and Foulds, 1981) as tree metric, which is implemented in the program treedist contained in the PHYLIP package (Felsenstein, 1989). We used the NCBI taxonomy common tree (del Campo, et al., 2013;Wang et al., 2015) as reference trees for the two species sets.

3.

Results and discussion Species Data set 1 was used in Collins and Higgs (2012) to analyze the evolution

properties of the pangenomes and core genomes. A phylogenetic tree (Fig. 5), which was

downloaded

from

TreeBASE

(Sanderson

et

al.,

1994)

(http://purl.org/phylo/treebase/phylows/study/TB2:S11647), was reconstructed via concatenated alignment translated amino acid sequences of 55 single-copy core genes (Collins and Higgs, 2012), while the grouping of this phylogenetic tree does not agree with the reference taxonomy at the family level. Especially, the family Lactobacillaceae was separated into two subgroups by the family Leuconostocaceae, the family Bacillaceae was split into four subgroups, the family Paenibacillaceae was split into two subgroups. For the 172 proteomes of Species Data set 1. From formulas (2) and (5), we calculated the Shannon entropies MH k and the average Shannon entropies MH k for k  0,1, 2,

,13 , these curves are illustrated in Fig. 1. Subsequently, the optimal

order of the Markov model on this proteome set was uniquely specified as k  8 by using formula (4).

Employing the conditional probability vectors of a 8-order

Markov models as feature vectors and the normalized angle between vectors as distance measure, the distance matrix was obtained, and a distance based phylogenetic

tree was reconstructed, which is shown in Fig. 6. In the resulted tree, these 2 orders and 10 families are clearly separated into a monophyletic group, which agrees with the reference taxonomy. At the genus level, except Bacillus and Lactobacillus, the remaining 18 genera collect together into a monophyletic group. The Bacillus genus is separated by four genera including Geobacillus, Lysinibacillus, Oceanobacillus and Anoxybacillus; this event is also presented in the resulted tree of Collins and Higgs (2012). The Lactobacillus is separated by genus Pediococcus. Another merit of the resulted tree by our method is that the phylogenetic relationship among the 10 families is consistent with that in Bergey’s Manual of Systematic Bacteriology (Ludwig et al., 2009). For 109 proteomes of Species Data set 2 (Qi et al., 2004a; Yu et al., 2005). With a similar process to handle the 172 proteomes, we calculated the Shannon entropies

MH k and the average Shannon entropies MH k for k  0,1, 2,

,13 . The optimal

order of the Markov model was identified as k  12 . We used the conditional probability vectors of the 12-order Markov models to characterize these 109 proteome sequences. Finally, a distance based phylogenetic tree was reconstructed (Fig. 7). Archaea, Bacteria and Eukaryota of the 109 organisms are clearly separated. Except 3 Spirochaetae, all organisms are well grouped into a monophyletic group at the Phylum level, 3 Fungi group well together. Except the 22 Gammaproteobacteria, all considered organisms collect well into a monophyletic group at the class level. The 22 Gammaproteobacteria are separated into two subgroups by 3 Betaproteobacteria. These events are also presented in the resulted trees obtained by other methods (Qi et al., 2004a; Yu et al., 2005; Jun et al., 2010). Moreover, almost all groups are monophyletic at the taxonomic level of class, order, family and genus. In the resulted tree of Qi et al. (2004a), the 3 Spirochaetes, 3 Fungi and 3 Betaproteobacteria were not grouped together, respectively, while the 22 Gammaproteobacteria are separated into three subgroups by 3 Betaproteobacteria. We also tested the applicability of the present method on whole genome based phylogeny reconstruction. To compare the phylogenetic information of whole

genomes and whole proteomes, we used whole genomes and whole proteomes from the same two species data sets. Moreover, we run the new method in two ways including auto-order (the order identified by MME, AIC and BIC) and fixed-order. In order to evaluate the effectiveness of our alignment-free approach, we calculated Robinson-Foulds distance from resulted trees constructed by our new method, FFP, CVTree (Xu and Hao, 2009) and concatenated alignment (Collins and Higgs, 2012) to the reference trees on these sequencedata sets, respectively. Since the parameter value is not specified in FFP and CVTree, we used their optimal parameter values for these sequences data sets. The results are shown in Figs. 8 and 9, and Table 2. Except “alignment tree” which corresponds to that using concatenated alignment in (Collins and Higgs, 2012), the green part of Figs. 8 and 9 shows the Robinson-Foulds distance from trees reconstructed based on whole proteomes, while the blue part shows the Robinson-Foulds distance from trees reconstructed based on whole genomes. From the point of view on tree topology, the results show that: (1) The order obtained by MME is better than that obtained by AIC and BIC (see Table 1, Figs. 8 and 9). (2) The order obtained by MME is the optimal order for whole proteome based phylogeny reconstruction, while it not always the optimal order for whole genome based phylogeny reconstruction. (3) Using the three methods, the proteome based phylogeny is better than the whole genome based phylogeny. (4) For whole proteome based phylogeny on the two species data sets, our new method is better than FFP, CVTree. Even it is better than concatenated alignment based method on the 172 proteomes of Species Data set 1. Table 2. The time consumed, parameter selected and the Robinson-Foulds distance (RF) from the obtained tree to the reference tree. Methods HOMCGR HOMCGR*

CVTree FFP

172 proteomes

109 proteomes

172 genomes

109 genomes

RF

time

k/l

RF

time

k/l

RF

time

k/l

RF

time

k/l

129 129 129 157

3108 336 276 223

8 8 5 7

53 53 63 121

2708 207 456 317

12 12 6 7

147 139 141 143

10312 1088 674 246050

13 16 11 13

79 79 95 117

6874 553 304 323

30 30 11 9

Where HOMCGR indicates our new method with the order auto-identified by MME, and HOMCGR* indicates our new method with the optimal fixed-order. The results came from the three programs run in the same computer server. The time unit is second.

Remark 2: Using our present method, from the step of identifying the unique parameter to the step of calculating the final distance matrix, the program “homcgr” can run in a personal computer (The single core processor: Intel (R) Celeron(R) CPU 1.88GHz, Memory 1.99GB). The time consumed is less than 6 hours for each kinds of data. When the unique parameter is given, the time consumed is less than 30 minutes. 4. Conclusions Higher order Markov models have been widely used in biological sequence analyses; however, it is rarely applied in alignment-free phylogenetic tree reconstruction. In this paper, we presented a novel application of higher order Markov models for alignment-free phylogenetic tree reconstruction based on whole proteome sequences. A critical issue in applications of these models is the identification of their order. We illustrated a new method of identifying the optimal order of Markov models for whole genome/proteome comparison. A challenging problem faced by all word based methods is large memory consumption due to large size sets of different long words. Any strategy to decrease the dimension of feature vectors causes the loss of specific individual feature information. By using one-dimensional chaos game representation and linked list, this approach can reduce large memory consumption significantly, while specific individual feature information is completely preserved. Our method has been tested on whole genomes/proteomes of two species data sets used by other researchers. For these test data sets, using whole proteome sequences, the reconstructed trees by our method agree with the reference taxonomy at the domain, phylum, class, order and family levels. Moreover, for these test data sets, from the point of view on tree topology, our new method is better than FFP, CVTree. Thus, our present method is useful and efficient for sequence comparison and phylogenetic tree reconstruction based whole genome/proteome data of prokaryotes.

Acknowledgements We thank the two anonymous referees for their constructive suggestions and

critical comments, which helped to improve the quality of the manuscript. We also would like to express our thanks to software engineer Li Xiao, Dr. Guoshen Han (Xiangtan University, P.R. China) and Zilong He (Beijing Institute of Genomics, Chinese Academy of Sciences) for helpful discussion. This work is supported by the National Natural Science Foundation of China (Grant No. 11371016), the Chinese Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) (Grant No. IRT1179), the Lotus Scholars Program of Hunan province of China.

References Afreixo, V., Bastos, C. A., Pinho, A. J., Garcia, S. P., & Ferreira, P. J. (2009). Genome analysis with inter-nucleotide distances. Bioinformatics, 25(23), 3064-3070. Afreixo, V., Bastos, C. A., Pinho, A. J., Garcia, S. P., & Ferreira, P. J. (2011). Genome analysis with distance to the nearest dissimilar nucleotide. J. Theor. Biol., 275(1), 52-58. Adachi, J., & Hasegawa, M. (1996). Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol.,42(4), 459-468. Almeida, J. S., Carrico, J. A., Maretzek, A., Noble, P. A., & Fletcher, M. (2001). Analysis

of

genomic

sequences

by

Chaos

Game

Representation.

Bioinformatics, 17(5), 429-437. Belda, E., Moya, A., & Silva, F. J. (2005). Genome rearrangement distances and gene order phylogeny in γ-proteobacteria. Mol. Biol. Evol., 22(6), 1456-1467. Burge, C., & Karlin, S. (1997). Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268(1), 78-94. Carlson, J. M., Chakravarty, A., & Gross, R. H. (2006). BEAM: a beam search algorithm for the identification of cis-regulatory elements in groups of genes. J. Comput. Biol., 13(3), 686-701. Chang, G., & Wang, T. (2011). Genome analysis with the conditional multinomial

distribution profile. J. Theor. Biol., 271(1), 44-50. Chen, X., Kwong, S., & Li, M. (2000). A compression algorithm for DNA sequences and its applications in genome comparison. In Proceedings of the Sixth Annual International Computing and Combinatorics Conference (RECOMB), ACM Press. Tokyo, Japan. pp. 107-117. Chu, K. H., Qi, J., Yu, Z. G., & Anh, V. O. (2004). Origin and phylogeny of chloroplasts revealed by a simple correlation analysis of complete genomes. Mol. Biol. Evol., 21(1), 200-206. Churchill, G. A. (1989). Stochastic models for heterogeneous DNA sequences. B. math. biol., 51(1), 79-94. Churchill, G. A. (1992). Hidden Markov chains and the analysis of genome structure. Comput. chem., 16(2), 107-115. Cohen, E., & Chor, B. (2012). Detecting phylogenetic signals in eukaryotic whole genome sequences. J. Comput. Biol., 19(8), 945-956. Collins, R. E., & Higgs, P. G. (2012). Testing the infinitely many genes model for the evolution of the bacterial core genome and pangenome. Mol. Biol. Evol., 29(11), 3413-3425. del Campo, E. M., Casano, L. M., Barreno, E., (2013) Evolutionary implications of intron–exon distribution and the properties and sequences of the RPL10A gene in eukaryotes, Mol. Phylogenet. Evol., 66 (2013) , 857-867. D'haeseleer, P. (2006). How does DNA sequence motif discovery work? Nat. biotechnol., 24(8), 959-961. Domazet-Lošo, M., & Haubold, B. (2009). Efficient estimation of pairwise distances between genomes. Bioinformatics, 25(24), 3221-3227. Durbin, R., Eddy, S. Krogh, A., Mitchison, G., 1998. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK. Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics, 14(9), 755-763. Edwards, A. W. F. (1971). Distances between populations on the basis of gene frequencies. Biometrics, 873-881.

Gao, L. and Qi, J. (2007) Whole genome molecular phylogeny of large dsDNA viruses using composition vector method. BMC Evol. Biol., 7(1), 41. Goldman, N. (1993) Statistical tests of models of DNA substitution. J. Mol. Evol., 36(2), 182-198. Guyon, F., Brochier-Armanet, C., & Guénoche, A. (2009). Comparison of alignment free string distances for complete genome phylogeny. Adv. Data Anal. Classif., 3(2), 95-108. Gutierrez, J. M., Rodrıguez, M. A., & Abramson, G. (2001). Multifractal analysis of DNA sequences using a novel chaos-game representation. Physica A, 300(1), 271-284. Hansen, N. R. (2009) Statistical models for local occurrences of RNA structures. J. Comput. Biol., 16, 845-858. Hao, B. and Qi, J. (2003) Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance. In Proceedings of the 2003 IEEE Bioinformatics Conference (CSB 2003), pp. 375-385. Hatje, K. and Kollmar, M. (2012) A phylogenetic analysis of the brassicales clade based on an alignment-free sequence comparison method. Front. Plant Sci., 3, 192. Haubold, B., Pfaffelhuber, P., Domazet-Los˘ o, M., & Wiehe, T. (2009). Estimating mutation distances from unaligned genomes. J. Comput. Biol., 16(10), 1487-1500. Herniou, E. A., Luque, T., Chen, X., Vlak, J. M., Winstanley, D., Cory, J. S., & O'Reilly, D. R. (2001). Use of whole genome sequence data to infer baculovirus phylogeny. J. Virol., 75(17), 8117-8126. House, C. and Fitz-Gibbon, S. (2002) Using homolog groups to create a whole-genomic tree of free-living organisms: an update. Mol. Evol., 54, 539-547. Jeffrey, H. J. (1990) Chaos game representation of gene structure. Nucleic Acids Res., 18(8), 2163-2170. Joseph, J. and Sasikumar, R. (2006) Chaos game representation for comparison of

whole genomes. BMC bioinformatics, 7(1), 243. Jun, S. R., Sims, G. E., Wu, G. A., & Kim, S. H. (2010). Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proc. Natl. Acad. Sci. USA, 107(1), 133-138. Karlin, S. and Burge, C. (1995) Dinucleotide relative abundance extremes: a genomic signature. Trends Genet., 11, 283-290. Leimeister, C. A., Boden, M., Horwege, S., Lindner, S., & Morgenstern, B. (2014a) Fast alignment-free sequence comparison using spaced-word frequencies. Bioinformatics, 30(14), 1991-1999. Leimeister, C. A., & Morgenstern, B. (2014b) kmacs: the k-mismatch average common substring

approach to

alignment-free

sequence

comparison.

Bioinformatics, 30(14), 2000-2008. Li, W., Fang, W., Ling, L., Wang, J., Xuan, Z., & Chen, R. (2002) Phylogeny based on whole genome as inferred from complete information set analysis. J. Biol. Phy., 28, 439-447. Lin, T. H., Ray, P., Sandve, G. K., Uguroglu, S., & Xing, E. P. (2008) Baycis: a bayesian hierarchical hmm for cis-regulatory module decoding in metazoan genomes. In Research in Computational Molecular Biology, Springer Berlin Heidelberg, pp. 66-81. Lin, Y. and Moret, B. M. (2011) A new genomic evolutionary model for rearrangements, duplications, and losses that applies across eukaryotes and prokaryotes. J. Comput. Biol., 18(9), 1055-1064. Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. O., & Borodovsky, M. (2005) Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res., 33, 6494-6506. Ludwig, W., Schleifer, K. H., & Whitman, W. B. (2009). Revised road map to the phylumFirmicutes. In Bergey’s Manual® of Systematic Bacteriology (pp. 1-13). Springer New York. Moret, B. M., Wang, L. S., Warnow, T., & Wyman, S. K. (2001). New approaches for reconstructing phylogenies from gene order data. Bioinformatics, 17(suppl 1),

S165-S173. Narlikar, L., Mehta, N., Galande, S., & Arjunwadkar, M. (2013). One size does not fit all: On how Markov model order dictates performance of genomic sequence analyses. Nucleic Acids Res., 41(3), 1416-1424. Qi, J., Wang, B., & Hao, B. L. (2004a) Whole proteome prokaryote phylogeny without sequence alignment: a k-string composition approach. J. Mol. Evol., 58, 1-11. Qi, J., Luo, H., & Hao, B. L. (2004b) CVTree: a phylogenetic tree reconstruction tool based on whole genomes. Nucleic Acids Res., 32(suppl 2), W45-W47. Rambaut, A. and Grass, N. C. (1997) Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci., 13(3), 235-238. Robinson, D. and Foulds, L. (1981) Comparison of phylogenetic trees. Math. Biosci., 53, 131-147. Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. biol. evol., 4, 406-425. Salzberg, S. L., Pertea, M., Delcher, A. L., Gardner, M. J., & Tettelin, H. (1999) Interpolated Markov models for eukaryotic gene finding. Genomics, 59(1), 24-31. Salzberg, S. L., Delcher, A. L., Kasif, S., & White, O. (1998) Microbial gene identification using interpolated Markov models. Nucleic Acids Res., 26(2), 544-548. Sanderson, M. J., Donoghue, M. J., Piel, W. H., & Eriksson, T. (1994) TreeBASE: a prototype data base of phylogenetic analyses and an interactive tool for browsing the phylogeny of life. Am. J. Bot., 81(6), 183. Sankoff, D., Leduc, G., Antoine, N., Paquin, B., Lang, B. F., & Cedergren, R. (1992) Gene order comparisons for phylogenetic inference: Evolution of the mitochondrial genome. Proc. Natl. Acad. Sci. USA, 89(14), 6575-6579. Shannon, C. E. (2001) A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3-55.

Shifman, A., Ninyo, N., Gophna, U., & Snir, S. (2014) Phylo SI: a new genome-wide approach for prokaryotic phylogeny. Nucleic Acids Res., 42(4), 2391-2404. Sims, G. E., Jun, S. R., Wu, G. A., & Kim, S. H. (2009) Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. USA, 106(8), 2677-2682. Snel, B., Bork, P., & Huynen, M. A. (1999) Genome phylogeny based on gene content. Nat. Genet., 21, 108-110. Snel, B., Bork, P., & Huynen, M. (2000) Genome evolution: gene fusion versus gene fission. Trends Genet.,16, 9-11. Spies, N., Nielsen, C. B., Padgett, R. A., & Burge, C. B. (2009) Biased chromatin signatures around polyadenylation sites and exons. Mol. cell, 36(2), 245-254. Strait, B. J., & T. G. Dewey. (1996) The Shannon information entropy of protein sequences. Biophys. J., 71, 148–155. Stuart, G., Moffett, K., & Bozarth, R. F. (2004) A whole genome perspective on the phylogeny of the plant virus family tombusviridae. Arch. Viro., 149, 1595-1610. Stuart, G. and Berry, M. (2003) A comprehensive whole genome bacterial phylogeny using correlated peptide motifs defined in a high dimensional vector space. J. Bioinform. Comput. Biol., 1, 475-493. Stuart, G. W., Moffett, K., & Leader, J. J. (2002a) A comprehensive vertebrate phylogeny using vector representation of protein sequences from whole genomes. Mol. Biol. Evol., 19, 554-562. Stuart, G. W., Moffett, K., & Baker, S. (2002b) Integrated gene and species phylogenies from unaligned whole genome sequence. Bioinformatics, 18, 100-108. Thakur, V., Azad, R. K., & Ramaswamy, R. (2007). Markov models of genome segmentation. Physical Review E, 75(1), 011915. Thijs, G., Lescot, M., Marchal, K., Rombauts, S., De Moor, B., Rouze, P., & Moreau, Y. (2001). A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics, 17(12),

1113-1122. Thijs, G., Marchal, K., Lescot, M., Rombauts, S., De Moor, B., Rouzé, P., & Moreau, Y. (2002) A Gibbs sampling method to detect overrepresented motifs in the upstream regions of coexpressed genes. J. Comput. Biol., 9(2), 447-464. Thompson, J.D., Gibson, T.J., Plewniak, F., Jeanmougin, F., & Higgins, D.G., (1997) The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res. 25,4876–4882. Ulitsky, I., Burstein, D., Tuller, T., & Chor, B. (2006) The average common substring approach to phylogenomic reconstruction. J. Comput. Biol., 13(2), 336-350. Vinga, S. and Almeida, J. (2003) Alignment-free sequence comparisonła review. Bioinformatics, 19(4), 513-523. Wang, Y., Hill, K., Singh, S., & Kari, L. (2005) The spectrum of genomic signatures: from dinucleotides to chaos game representation. Gene, 346, 173C185. Wang, Y., Wang, J., Chen, C., Chen, Y., Li, C., (2015) A novel BLAST-Based Relative Distance (BBRD) method can effectively group members of protein arginine methyltransferases and suggest their evolutionary relationship. Mol. Phylogenet. Evol., 84 (2015), 101-111. Whelan, S. and Goldman, N. (1999) Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics. Mol. Biol. Evol., 16(9), 1292. Wong, K. M., Suchard, M. A., & Huelsenbeck, J. P. (2008) Alignment uncertainty and genomic analysis. Science, 319(5862), 473-476. Wu, X., Cai, Z., Wan, X. F., Hoang, T., Goebel, R., & Lin, G. (2007) Nucleotide composition string selection in HIV-1 subtyping using whole genomes. Bioinformatics, 23(14), 1744-1752. Wu, G. A., Jun, S. R., Sims, G. E., & Kim, S. H. (2009) Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method. Proc. Natl. Acad. Sci. USA, 106(31), 12826-12831. Xie, X. H., Yu, Z. G., Han, G. S., Yang, W. F., & Anh, V. (2015), Whole-proteome based phylogenetic tree construction with inter-amino-acid distances and the conditional geometric distribution profiles, Mol. Phylogenet. Evol. 89, 37-45.

Xu, Z. and Hao, B. L. (2009) CVTree update: a newly designed phylogenetic study platform using composition vectors and whole genomes. Nucleic Acids Res., 37(Web Server issue), W174-W178. Yu, Z. G., Anh, V., & Lau, K. S. (2001) Measure representation and multifractal analysis of complete genome, Phys. Rev. E, 64, 031903. Yu, Z. G., Zhou, L. Q., Anh, V. V., Chu, K. H., Long, S. C., & Deng, J. Q. (2005) Phylogeny of prokaryotes and chloroplasts revealed by a simple composition approach on all protein sequences from complete genomes without sequence alignment. J. Mol. Evol., 60, 538-545. Yu, Z. G., Zhou, L. Q., Chu, K. H., Li, C. P., & Anh, V. V. (2008) Phylogenetic analysis of Polyomaviruses based on their complete genomes. In Proceeding of Fourth International Conference on Natural Computation (ICNC’08), 2008, Vol. 5, pp. 80-84 . Yu, Z. G., Chu, K. H., Li, C. P., Anh, V., Zhou, L. Q., & Wang, R. W. (2010a) Whole-proteome phylogeny of large dsDNA viruses and parvoviruses through a composition vector method related to dynamical language model. BMC Evol. Biol., 10(1), 192. Yu, Z. G., Zhan, X. W., Han, G. S., Wang, R. W., Anh, V., & Chu, K. H. (2010b) Proper distance metrics for phylogenetic analysis using complete genomes without sequence alignment. Int. J. Mol. Sci., 11(3), 1141-1154. Zhang, H., Gao, S., Lercher, M. J., Hu, S., & Chen, W. H. (2012) EvolView, an online tool for visualizing, annotating and managing phylogenetic trees. Nucleic Acids Res., 40(W1), W569-W572.

Figure captions: Fig. 1. The curve of average Shannon entropy MH k (red) and curves of Shannon entropy

MH k for two data sets. (a) For Data set 2 (109 proteomes of prokaryotes and eukaryotes), (b) For Data set 1 (172 proteomes of bacilli). Fig. 2. Spectrum-like graph of one-dimensional Chaos game representation of human beta globin gene. Fig. 3. (a) The distribution of polynucleotides in the unit interval. (b) The distribution of trinucleotides in the 4 different linked lists, where the integers are structural index of trinucleotide. Fig. 4. The proportion of theoretical word compositions with different length that actual emerging in sequences from two data sets.

(a) For Data set 2 (109 proteomes of prokaryotes and

eukaryotes), (b) For Data set 1 (172 proteomes of bacilli). Fig. 5. Phylogenetic tree of 172 Bacilli based on concatenated alignment. Fig.6. Phylogenetic tree of 172 Bacilli based on higher order Markov model from complete proteome sequences. Fig. 7. Phylogenetic tree of 109 prokaryotes and eukaryotes based on higher order Markov model from complete proteome sequences. The red-colored branches indicate Eukaryota, the green-colored branches indicate Archaea and the blue-colored branches indicate Bacteria. Fig. 8. The Robinson-Foulds distance from resulted trees to the reference tree of 172 Bacilli species. The green part corresponds to trees reconstructed based on whole proteomes except “alignment tree” which corresponds to that using concatenated alignment in (Collins and Higgs, 2012). The blue part corresponds to trees reconstructed based on whole genomes. Fig. 9. The Robinson-Foulds distance from resulted trees to the reference tree on 109 species of prokaryotes and eukaryotes. The green part corresponds to trees reconstructed based on whole proteomes. The blue part corresponds to trees reconstructed based on whole genomes.

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

*Graphical Abstract (for review)

Whole-proteome sequences Whole-genome sequences

One-dimensional chaos game representation

Higher order Markov models

Distance matrix of sequences

Phylogenetic tree

Highlights 

A new alignment-free proteome based method for phylogenetic tree construction is proposed.



We convert the whole proteome sequences into transition matrices of a higher order Markov model.



One-dimensional CGR and the linked list are used to reduce the problem of large memory storage.



A distance measure based on the angle between two feature vectors is used to refer the phylogenetic distance.



Our results on two data sets demonstrate that the new method is useful and efficient.