Robust classifying of prokaryotic genomes

Robust classifying of prokaryotic genomes

Computational Biology and Chemistry 40 (2012) 20–29 Contents lists available at SciVerse ScienceDirect Computational Biology and Chemistry journal h...

1MB Sizes 2 Downloads 77 Views

Computational Biology and Chemistry 40 (2012) 20–29

Contents lists available at SciVerse ScienceDirect

Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem

Robust classifying of prokaryotic genomes Katerina Korenblat a , Zeev Volkovich a , Alexander Bolshoy b,∗ a b

Software Engineering Department, ORT Braude Academic College, Karmiel, Israel Department of Evolutionary and Environmental Biology, University of Haifa, Haifa 31905, Israel

a r t i c l e

i n f o

Article history: Received 20 February 2012 Received in revised form 3 July 2012 Accepted 3 July 2012 Keywords: Agglomerative clustering Information bottleneck method Jackknifing Gene length Phylogenomics

a b s t r a c t In this paper, we propose a method to classify prokaryotic genomes using the agglomerative information bottleneck method for unsupervised clustering. Although the method we present here is closely related to a group of methods based on detecting the presence or absence of genes, our method is different because it uses gene lengths as well. We show that this amended method is reliable. For robustness evaluation, we apply bootstrap and jackknife techniques to input data. As a result, we are able to propose an approach to determine the stability level of a cladogram. We demonstrate that the genome tree produced for a selected small group of genomes looks a lot like a phylogenetic tree of this group. © 2012 Elsevier Ltd. All rights reserved.

1. Introduction There are different views how the terms “classification” and “clustering” should be considered. We would follow Mirkin’s approach (Mirkin, 1996) which suggests that clustering is an approximation of a data set by a cluster structure. This approach defines clustering as a data-driven phase of classification. Classification problems arise in a variety of machine learning fields, as well as pattern recognition, optimization and statistics. Our study deals with the classification of microbes, which has been defined as ‘the arrangement of prokaryotes into groups’ (Tindall et al., 2007). The objective of clustering is to determine a small number of groups having low distortion within individual groups. Various clustering algorithms provide ways to determine groups of vectors in a high-dimensional space. Group membership is detected by means of a distance-like function that measures the resemblance between data points and class centroids (vector quantization). In case that classified objects are presented by numerical vectors classification of the objects may be realized by means of clustering corresponding vectors. To perform classification, one must make three choices: (a) A choice of an object presentation. When texts are objects to be classified, these documents may be represented by numerical vectors in the spirit of the established Bag of Words method (or

Abbreviations: IB, information bottleneck; COGs, clusters of orthologous groups; UPGMA, unweighted pair group method with arithmetic mean; NJ, neighbor joining method. ∗ Corresponding author. E-mail address: [email protected] (A. Bolshoy). 1476-9271/$ – see front matter © 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compbiolchem.2012.07.001

Bag of Tokens method – see (Bolshoy et al., 2010) and associated references). In (Bolshoy et al., 2010) different ways of a numerical representation of texts are discussed. For instance, a document may be represented by its N-gram conditional distribution: P(y|x) =



n(y|x)

y∈Y

n(y|x)

,

where Y is the set of N-grams, X is the set of documents and n((y|x)) is the number of occurrences of N-gram y in document x. In this study, the objects in question are prokaryotic genomes. A regular way is to investigate a genome as a “DNA text” (see discussion in Chapter “DNA Texts” in (Bolshoy et al., 2010)). In (Bolshoy and Volkovich, 2009) and (Bolshoy et al., 2010), we introduced a way to make use of an extraction from a genome annotation instead of a complete or partial genome sequence. First of all, from the whole genome annotation we take annotations of protein coding genes exclusively. From such an annotation we take only two properties: a protein length and an association with a certain gene family. Neglecting of a gene order and other details leads to the situation where a genome is seen as a “Bag of protein coding genes” and it is represented by a vector of the lengths of these genes. Gene lengths were obtained using the database of Clusters of Orthologous Groups of proteins (Tatusov et al., 2000, 2003). One may ask a question: Is this representation relevant for further genome classification? An answer to this question relies on the fact that this representation is closely related to another well-known representation named “gene content” (Snel et al., 1999; Wolf et al., 2001; Huson and Steel, 2004) or “conservation profile” (Tekaia and Yeramian, 2005). Gene content certainly carries a very strong phylogenetic component, so that even the presence of horizontally transferred genes in the data

K. Korenblat et al. / Computational Biology and Chemistry 40 (2012) 20–29

does not affect the classifications produced. Therefore, taxonomically and evolutionarily related organisms have very similar gene presence/absence profiles, and the lengths of orthologous genes in such organisms are very similar as well. Our object presentation is more informative than a gene-content one because it uses all information present in conservation profile but it also uses the information related to the lengths of genes. Bolshoy and Volkovich (2009), Bolshoy et al. (2010) compared genome clusterings based either on gene content or on genomic data related to lengths of homologous proteins, which were produced for the same small group of genomes. It was shown that, as expected, the dendrogram based on using larger-scale evolutionary information presents a more taxonomically convincing genomic tree. (b) A choice of a measure of dissimilarity between data points. How do we quantify dissimilarity between two documents? As we mentioned above, the choice of such a metric presents a vital concern when clustering. Often, a distance function is too adapted to the current data representation such that it fails to represent the structure of high dimensional data, particularly in the text mining tasks. As a document is represented by its N-gram conditional distribution, two documents are similar if their conditional distributions are similar. Thus, two documents with close distributions are suggested to be found in the same cluster. The fundamental information bottleneck (IB) approach has been proposed by Tishby et al. (1999), whose declared purpose was to avoid the arbitrary choice of distance measure. In the framework of this general methodology, given the empirical joint distribution of two random variables P((X, Y)), a compact representation of X is constructed while preserving as much information as possible about the relevant variable Y. Further information regarding our application of the IB approach is below in Section 2.4. (c) A choice of an algorithm for genome classification (clustering). In this manuscript we utilize a variant of an information bottleneck method proposed for text classification problem (Tishby et al., 1999). The success of the IB-approach motivates its application in many other problems, particularly in the classification of proteomes and genomes, as well as the evolutionary consequences of this analysis. A hierarchical multi-bottleneck classification method has been successfully applied to classify several gene microarray expression data in (Xiong et al., 2004). An investigation of complex modular biological networks provided in (Hintze and Adami, 2008) makes possible to conclude that the IB-method will be capable to reflect the structure of a set of prokaryotic genomes and to discover their natural robust clustering. Whether lengths of genes belonging to one gene family (variable Y) are suitable variable for the application of IB to a set of genomes (variable X)? Our claim that gene lengths of a COG are well suited to serve as variable Y in the IB method is based on the observation that protein length is most commonly changed during prokaryotic evolution as a consequence of insertions and deletions (indels). In (Bolshoy and Volkovich, 2008) we introduced a novel based on IP method of genome classification. Here we introduce another modification of IP method and have several primary goals.

• First, to show that an IB agglomerative algorithm is an efficient way to construction of a genome tree from a set of available prokaryotic genomes. • Second, show the robustness of the proposed algorithm. In other words, we need to demonstrate that for a chosen genome dataset, tree structures obtained using different subsets of gene families are sufficiently similar. • Third, demonstrate that a dendrogram produced by our method, which presents classification of a selected small group of genomes, is very similar to a phylogenetic tree for the same group.

21

• Fourth, determine the parameters of the method, basing our results on a few empirical case studies. To measure the robustness of our algorithm, we must choose an appropriate metric (i.e., a measure of dissimilarity) for comparing the produced trees. Such a metric would allow for a quantitative assessment of the similarity of trees reconstructed from different data sets. The distribution of the metric then indicates whether this tree similarity could have arisen by chance. A number of tree comparison metrics have been proposed, for example, nearest-neighbor interchanging (Waterman and Steel, 1978), subtree transfer distance (Allen and Steel, 2001), tree juxtaposition (Nye et al., 2006), quartets (Estabrook et al., 1985), partition or symmetric difference metrics (Bourque, 1978; Robinson and Foulds, 1979, 1981) and path length metrics (Steel and Penny, 1993). New developments in the field are given in (Rannala and Yang, 1996). We have chosen one particular type of tree metrics, the partition metrics (Bourque, 1978; Robinson and Foulds, 1979, 1981), which fits the purposes of this study. In this paper, we investigate the robustness of trees based on numerous gene families in spite of the fact that some of the gene families are specific to certain groups of organisms. In this way, we intend to show that the stability of the tree structure increases as the percentage of considered data increases. We will demonstrate the success of the “more is better” approach. 2. Data and methods 2.1. COGs We will evaluate our algorithm on a subset of the database of Clusters of Orthologous Groups of proteins (COGs). The principles of database construction are described in (Tatusov et al., 1997, 2000, 2001, 2003). COGs were constructed by applying the criterion of the consistency of genome-specific best hits to the results of an exhaustive comparison of all protein sequences from these genomes. The data in COGs are updated continuously after new prokaryotic genomes are sequenced. For example, at the beginning of the project, proteins from 52 Archaeal and 601 Eubacterial genomes (giving a total of 653 complete genomes) were assigned to 5663 COGs. 2.2. Whole genomes To illustrate the evaluation of the tree construction algorithm, a relatively small group of genomes was chosen based on the following rules: • Genomes in the group belong to both Bacteria and Archaea. • To as large an extent as possible, well-differentiated subgroups of organisms should be included in the group to infer the trees obtained by outer criteria. • Several organisms known to be highly related or even strains of the same organism should be included to ensure that they will be monophyletic. In this study, we used a group of 60 genomes (Table 1) to satisfy these three rules. Indeed, in this dataset we can see representatives of both Kingdoms: Eubacteria and Archaea; representatives of different classes of organism (e.g., Euryarchaeota and Crenarchaeota); representatives of different strains of the same bacteria: two strains of Burkholderia pseudomallei, 5 strains of Campylobacter jejuni, 8 strains of Escherichia coli, 2 strains of Helicobacter pylori, 4 strains of Methanococcus maripaludis, 2 strains of Mycobacterium bovis and 3 strains of Mycobacterium tuberculosis (non-biologists,

22

K. Korenblat et al. / Computational Biology and Chemistry 40 (2012) 20–29

Table 1 List of genomes. #

Name

Kingdom

Group

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Aeropyrum pernix K1 Archaeoglobus fulgidus DSM 4304 Burkholderia cepacia AMMD Burkholderia mallei NCTC 10229 Burkholderia pseudomallei 1106a Burkholderia pseudomallei 668 Burkholderia vietnamiensis G4 Burkholderia xenovorans LB400 Caldivirga maquilingensis IC-167 Campylobacter fetus subsp. fetus 82–40 Campylobacter jejuni RM1221 Campylobacter jejuni 269.97 Campylobacter jejuni 81–176 Campylobacter jejuni 81116 Campylobacter jejuni NCTC 11168 Clostridium beijerinckii NCIMB 8052 Clostridium botulinum F str. Langeland Clostridium kluyveri DSM 555 Clostridium novyi NT Clostridium perfringens str. 13 Clostridium phytofermentans ISDg Escherichia coli APEC O1 Escherichia coli E24377A Escherichia coli HS Escherichia coli O157-H7 EDL933 Escherichia coli O157-H7 str. Sakai Escherichia coli UTI89 Escherichia coli W3110 DNA Escherichia coli str. K-12 MG1655 Haloarcula marismortui ATCC 43049 Halobacterium salinarum R1 Halobacterium sp. NRC-1 Haloquadratum walsbyi DSM 16790 Helicobacter pylori HPAG1 Helicobacter pylori J99 Hyperthermus butylicus DSM 5456 Ignicoccus hospitalis KIN4/I Metallosphaera sedula DSM 5348 Methanobrevibacter smithii ATCC 35061 Methanococcoides burtonii DSM 6242 Methanococcus aeolicus Nankai-3 Methanococcus maripaludis C5 Methanococcus maripaludis C6 Methanococcus maripaludis C7 Methanococcus maripaludis S2 Methanosaeta thermophila PT Methanosarcina acetivorans C2A Methanosarcina barkeri str. fusaro Methanosarcina mazei Go1 Methanosphaera stadtmanae DSM 3091 Methanospirillum hungatei JF-1 Mycobacterium avium 104 Mycobacterium bovis AF2122/97 Mycobacterium bovis BCG 1173P2 Mycobacterium gilvum PYR-GCK Mycobacterium tuberculosis CDC1551 Mycobacterium tuberculosis F11 Mycobacterium tuberculosis H37Ra Mycobacterium ulcerans Agy99 Mycobacterium vanbaalenii PYR-1

A A B B B B B B A B B B B B B B B B B B B B B B B B B B B A A A A B B A A A A A A A A A A A A A A A A B B B B B B B B B

C E ␤ ␤ ␤ ␤ ␤ ␤ C ␧ ␧ ␧ ␧ ␧ ␧ ␮ ␮ ␮ ␮ ␮ ␮ ␥ ␥ ␥ ␥ ␥ ␥ ␥ ␥ E E E E ␧ ␧ C C C E E E E E E E E E E E E E ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣ ␣

Notations of the groups: ␣ – actinobacteria, ␤ – betaproteobacteria, ␥ – gammaproteobacteria, ␧ – epsilonproteobacteria, ␮ – firmicutes, E – Euryarchaeota, C – Crenarchaeota.

may use a short description of the terms Kingdom, Class, strain in (Bolshoy et al., 2010), Chapter “Biological Classification”). 2.3. Gene-length matrix preparation The original data are transformed into pairs (genome, COG), where each pair corresponds to one standardized protein length.

Hence, if 5663 COGs were created for the set of 653 complete genomes, then shrinking to a subset of 60 genomes decreases the number of relevant COGs to 4217, remaining for further consideration. In the second step, for a given COG, each organism must be presented by one median length; we choose a median value from all paralog protein sizes, keeping only one representative size per protein per organism. 2.4. The (agglomerative) information bottleneck method and a genome tree construction IB approach was proposed by Tishby et al. (1999). In our previous work (Bolshoy and Volkovich, 2009) we applied it to the genome classification problem, and described the approach in details. Briefly, we use an application of the clustering method with a sparse matrix as an input and a dendrogram, reflecting relations among the objects (rows of a matrix) represented through their attributes (columns of a matrix), as an output. Here it is a general description of the method. In the clustering context, this admits the following natural information theoretic of the set X such that formulation: create a partition X˜ into clusters  ˜ Y) = mutual information I(X, p(˜x, y)log2 [p(y|˜x)/p(y)] is x˜

y

y

˜ X). maximized, under a constraint on the mutual information I(X, This value defines the compactness of the clustering, while its ˜ Y )/I(X; Y )). We must maximize quality is given by the ratio (I(X; ˜ Y ) − ˇI(X; Y ) with respect to P(X; ˜ X), where ˇ is the value I(X; the annealing factor, representing the information about cluster allowed to be abandoned and determining the “softness” of the classification. In fact, parameter ˇ reveals the tradeoff between compression and distortion. Thus, in the case ˇ = 0, every item belongs to every cluster with the same probability, and no information is transmitted. For ˇ → ∞, hard clustering occurs (i.e., each ˜ Y ) is object is found in a cluster with probability 1). Here I(X; ˜ Intuitively, the applicable maximized, given the capacity of X. information contained in X about Y is obtained from a compact ‘bottleneck’ of clusters, which filters out any ‘irrelevant’ information. This optimization problem can be formulated by the following three formulae:

⎧ p(˜x) ⎪ ⎪ p(˜x|x) = exp(−ˇDKL [p(y|x)||p(y||˜x)]) ⎪ Z(ˇ, x) ⎪ ⎪ ⎪ ⎨ 1  p(y||˜x) = p(˜x|x)p(x)p(y|x) p(˜ x) ⎪ x ⎪  ⎪ ⎪ ⎪ p(˜x|x)p(x) p(˜x) = ⎪ ⎩ x

where Z(ˇ,x) is a normalization Kullback–Leibler divergence: DKL [p(y|x)||p(y||˜x)] =

 y

p(y|x) log

factor

 p(y|x) p(y||˜x)

and

DKL

is

the

.

As ˇ → ∞ we approach the aforementioned case of ‘hard’ clusters, in which each item belongs to precisely one cluster. Slonim and Tishby offered a simple implementation of the information bottleneck method, restricted to the case of hard clustering (Slonim and Tishby, 2000), while a natural distance function can be straightforwardly obtained from an agglomerative hierarchical greedy clustering procedure (Slonim and Tishby, 1999). This bottom-up procedure is initialized by a trivial partition into a singleton class containing only one item. Clusters are combined at each step to reduce the loss of mutual

K. Korenblat et al. / Computational Biology and Chemistry 40 (2012) 20–29

˜ Y ). This method leads to the following distance information I(X; function (Slonim and Tishby, 1999): dij = (P(˜xi ) + P(˜xj ))DJS (P(Y |˜xi ), P(Y |˜xj )) where DJS is the Jensen–Shannon divergence ¯ + j DKL (P(Y |˜xj )||P) ¯ DJS (P(Y |˜xi ), P(Y |˜xj )) = i DKL (P(Y |˜xi )||P) defined for i =

P(˜xi ) , (P(˜xi ) + P(˜xj ))

j =

P(˜xj ) (P(˜xi ) + P(˜xj ))

,

23

DO { (0) Select only those X COGs which contain more than A% of the maximal COG size. (1) Iteratively (Ncycle times) do { 1.1. Select simple random sample (B%-subset) from the whole population of Y relevant COGs. 1.2. Select simple random sample (a C%-subset) from the selected above B%-subset and randomly reshuffle chosen Z columns. 1.3. Generate a tree corresponding to the constructed input data. } (2) For each pair of constructed trees, calculate a distance between these trees using the partition metric (3) Compose distances for all pairs of the generated trees corresponding to fixed values of the procedure parameters B and C and draw a distance histogram (4) Analyze the influence of randomization on the closeness of generated trees. }

Schema 1. General flow chart of actions applied to the full set of COG lengths.

and P = i P(Y |˜xi ) + j P(Y |˜xj ) Note, that to find a globally optimal solution is NP-complete (Garey et al., 1982), and this IB agglomerative algorithm finds a local optimum like many other agglomerative procedures do. Fortunately, information distance measure does make it possible to provide results equivalent to or even exceeding those of other existing methods (Wang et al., 2010). 2.5. Partition metric The partition metric was proposed originally in (Bourque, 1978; Robinson and Foulds, 1979, 1981). As it was shown in (Robinson and Foulds, 1981), this metric works well for comparison of weighted labeled trees. It allows comparing topology of trees that is essential for phylogenetic trees and its calculation is fast. All considerations that make this metric so useful for phylogenetic trees comparison are applicable to genome trees studied in this manuscript. The idea of the metric is to evaluate how often two trees can be divided into the same two parts. Specifically, a split is a partition of a tree into two sub-trees obtained by the deletion of one edge. A partition metric measures how many splits are found in one tree but not the other. Let us denote by n(T) the number of internal (nonleaf) edges of tree T. Given two trees T1 and T2 , a partition metric is calculated as follows: d((T1 , T2 )) = n((T1 )) + n((T2 )) − 2s((T1 , T2 )), where s((T1 ,T2 )) denotes the number of pairs of identical splits induced by deleting a nonleaf edge from both T1 and T2 . The maximum value of the metric across all pairs of trees with n leaves is 2n − 6. In (Steel and Penny, 1993) it was shown that the distribution of this metric is asymptotically Poisson; the authors also showed that convergence to the asymptotic values is relatively fast. These properties of the metric make it especially appropriate for our purposes. Additionally, we used the software to compare the resulting trees (Nye et al., 2006). The algorithm pairs up each branch in one tree with a matching branch in the second tree and finds the optimum mapping. This approach gives a simple easily understandable measure of similarity between the trees.

reduced datasets (pseudoreplicates) are formed by deleting a certain fraction of characters at random from the original data. A set of trees is obtained from each a subset and then a tree of trees is produced. The proportion of pseudoreplicates in which a particular clade appears is its jackknifing frequency; more robust clades will be more likely to survive character deletions and thus will appear in more pseudoreplicates. In the case of our data (matrix [genome, COG]), application of the bootstrapping method means that while the dimensions of the matrix and its number of non-empty elements remain unchanged, a certain randomly chosen fraction of columns (COGs) are reshuffled. An expectation is that such a data randomization strongly affects results of clustering and obtained randomized trees become very distant. Using the jackknifing method will cause a randomly chosen fraction of columns (COGs) to be deleted from original input data. It means that instead of using the full set of COGs, we will only employ a subset of the total number of gene families. 3. Results 3.1. General description of the experimental procedure Schema 1 presents a general flow chart of the procedure. Parameter C is a bootstrapping parameter and designates a fraction of random resampling (with reshuffling) columns (COGs) of the input dataset. Parameter B is a jackknifing parameter and (100% − B) designates a fraction of randomly deleted columns. Additionally, we considered it necessary to use the preprocessing parameter A. The meaning of this parameter is to consider only COGs containing more than A% of the maximal COG size. In this way, we restrict our consideration only to the most informative COGs because COGs containing “too few” elements may become useless or even destructive to our analysis. Non-deterministic steps appear three times in the procedure described in Schema 1: 0 and 1.1 are simple random sampling procedures, and 1.2 is a reshuffling procedure. The input datum is the matrix M[60,X], where X is the number of COGs containing more than A% of the maximal COG size; 1.1 is a random selection of Y (B%) columns in the matrix, and 1.2 reshuffles each of Z (B%) selected columns separately and independently.

2.6. Tree robustness estimation Assessing the robustness of phylogenetic trees remains an issue of contention. Methods commonly to assess support for cladograms include nonparametric bootstrapping and jackknifing. Nonparametric bootstrapping (Felsenstein, 1985) involves randomly resampling characters (with replacement) from a dataset to construct a new dataset of the same size as the original; the new dataset is then analyzed to give a tree or trees. This procedure is repeated many times. Another popular resampling method is jackknifing (Farris et al., 1996; Shi et al., 2010). A large number of

3.2. Bootstrapping: randomizing and robustness are anticorrelated events In this section, we demonstrate how randomization damages robustness. For the purpose of demonstration we assign constant values to parameters A and B, and we test a few selected values for C. A threshold of A = 35% means that only 888 COGs are considered for this procedure. A subset percentage B was chosen to be equal to 65%, yielding Y = 577. A bootstrapping parameter C (randomization parameter) was chosen to be equal to

24

K. Korenblat et al. / Computational Biology and Chemistry 40 (2012) 20–29

Fig. 1. Distributions of distances for trees based on an 35%-COG subset with partial randomization. The four curves, reading from left to right, correspond to 0%-, 30%-, 60%-, and 100%-randomized COG counting. The partition metric was used.

0%, 30%, 60%, 100% – making the sizes of resampling subsets equal to 0, 173, 346, and 577 COGs out of the total 577, respectively. Thus, the general procedure presented in Schema 1 for this section was transformed in the following manner:

previous procedure, we choose A = 35% (i.e., 888 COGs). A bootstrapping parameter C is set to be equal to 0 here; i.e., we will observe no randomization. Thus, the general procedure presented in Schema 1 for this section is transformed into the following amended schema:

A = 0.35

DO { (0) Y = X * B; (1) DO 100 times { 1.1. Randomly select Y COGs out of the total X, giving a sparse supermatrix M [60, Y]. 1.2. − 1.3. Generate a tree for the supermatrix M } (2) For each pair of trees constructed, calculate a distance between these trees using the partition metric (3) Draw a histogram of the distance distribution }

X is fixed. B = 0.65

Y is fixed.

FOR (C = 0, 0.3, 0.6, 1) DO { (1) Z = X * C; (2) DO 100 times { 1.1. Randomly select Y out of X COGs. It gives a sparse supermatrix M [60, Y]. 1.2. Randomly reshuffle Z chosen columns of the abovementioned matrix M. 1.3. Generate a tree for the obtained supermatrix M } (3) For each pair of constructed trees calculate a distance between these trees using partition metric (4) Draw a histogram of distance distribution. }

Fig. 1 presents distance distributions while different fractions of COGs are chosen for resampling (randomization). The four curves given depict trees with 0%-, 30%-, 60%-, and 100% of randomized COGs (Z = 0, 173, 346, 577), from left to right. Each curve in Fig. 1 demonstrates the distribution of distances between two trees constructed in different iterations of Step 1 of our algorithm. First of all, the curves are Poisson-like P(x = k) = (k e− /k!) as expected (Steel and Penny, 1993). The leftmost curve exhibits the distance distribution between trees based on the original 65% datasets and corresponds to a Poisson-like curve with the parameter  + 6 = 30; the rightmost curve – the distance distribution for fully randomized data – corresponds to the Poisson-like curve with parameter  + 30 = 59. Fig. 1 confirms our expectation that randomization should increase the distance between corresponding trees. It also demonstrates that the distance distributions are nearly Poisson as it should be for the partition metric. 3.3. Jackknifing: gene-families’ subset size and tree robustness are correlated events – more is better In this section, we investigate the similarity of trees having different COG subsets of some fixed size. A jackknifing parameter B is changed from 10% to 100% with a step 10. As in our

Analysis of histograms obtained by this procedure may be summarized in the three words appearing as a subtitle of this section: “More is Better”. For practical purposes, not all the histograms generated have been rendered in Fig. 2, which demonstrates that if we choose a sufficiently large sample of COGs (e.g., 800 from 888, as exhibited by the leftmost curve in the figure), then the specific choice of COGs is not important because the trees produced are all very similar; as a result, the algorithm applied to sufficiently large COG-subsets will produce very robust trees. Indeed, the resulting distribution has a mean value equal to 12, while this distance indicates only 1–3 permutations in a tree. In contrast, if we choose less than 10% of all COGs, then the resulting tree structure is similar to the tree structure in the case of randomization. In fact, the rightmost curve in Fig. 2 is very similar to the rightmost curve in Fig. 1 with the same mean distance value of 60. 3.4. Filtering: the relationship between a gene-family’s preprocessing parameter and tree robustness In this section, we investigate whether it is worthwhile to consider as many gene families as possible or if instead we should filter out COGs with insufficient quantity of proteins. Higher values of the preprocessing (i.e., separation) parameter A lead to stronger selection, which implies that only those COGs having a broad genomic representation will remain for further consideration. Four values

K. Korenblat et al. / Computational Biology and Chemistry 40 (2012) 20–29

25

Fig. 2. Partition metric tree distributions based on different COG sample sizes. The six graphs, reading from left to right, correspond to 90%, 70%, . . ., 10% sample sizes.

of A were used: 5%, 15%, 35%, and 50%, which correspond to 4000, 2088, 888, and 322 COGs out of 4217, respectively. A subset size B was chosen to be equal to 80%, resulting in Y = 258 COGs out of 322 total; 710 out of 888; 1670 out of 2088; 3200 out of 4000. The bootstrap parameter C is equal to zero; in other words, we choose to have no randomization. The general procedure presented in Schema 1 for this section is transformed into: DO { (0) Preprocess the complete dataset of COGs dataset with separation parameter A. Obtain X COGs, all of which contain more than A% of the maximal COG size. Y = X*B% (1) DO 100 times { 1.1. Randomly select Y COGs from X, yielding a sparse supermatrix M [60, Y]. 1.2. − 1.3. Generate a tree for the obtained supermatrix M } (2) For each pair of constructed trees, calculate a distance between these trees using the partition metric (3) Generate a histogram of distance distribution }

The optimal distribution (the leftmost curve in Fig. 3) is related to the value A = 15%, while the cases in which A is equal to 5% and 35% induce distributions with worse parameters. This implies that a preprocessing parameter value of 5%, which leads to the utilization of very small COGs, is too small to be optimal, while the value of 15% appears to be close to the optimal value. As a result, we have arrived via empirical observations to the choice of A = 15% and B = 80% as suggested optimal parameters for further investigations.

3.5. Consensus trees In this section, we discuss consensus trees obtained with the CONSENSE software from the Phylip package (Felsenstein). A subset size B was chosen to be equal to 80%. Again, our bootstrapping parameter C is equal to 0, yielding no randomization. Two consensus trees for the values of A = 15% and A = 35% are constructed. Taking A = 15% we will obtain X = 2088 and Y = 1670; choosing A = 35%, X = 888 and Y = 710.

Table 2 Comparison of consensus and 100% resulting trees. 4a, 4b, 5a, 5b labels designate the trees from the corresponding Fig. 4a and b, etc.

4a 4b 5a 5b

4a

4b

5a

5b

100% 90.3% 92.5% 91.9%

90.3% 100% 90.8% 98.1%

92.5% 90.8% 100% 92.6%

91.9% 98.1% 92.6% 100%

The general procedure presented in Schema 1 for this section is thus transformed into the following: DO { (0) Select only those X COGs which contain more than A% of the maximal COG size. (1) DO 100 times { 1.1. Randomly select Y COGs out of X, giving a sparse supermatrix M [60, Y]. 1.2. Generate a tree for the supermatrix M } (2) Construct a consensus tree for the 100 trees obtained. }

This schema outputs 100 very similar trees. Now, the problem is to summarize this collection of trees into a single synthetic, yet informative, picture. By far, the most common methods for performing such a summary are consensus methods (see, e.g. (Wilkinson and Thorley, 2001)). Their aim is simply to build a single tree displaying the frequencies of splits (or of clades) observed in the tree collection. Consensus methods capture all common properties of these trees, creating the branch-labeled consensus tree. CONSENSE uses the certain rules for consensus tree construction (Margush and Mcmorris, 1981). Each fork in the consensus tree is labeled by the percentage of trees in which the group consisting of species to the right of the fork occurred. We denote this percentage by branch label. Fig. 4 presents two consensus trees for A = 15% and A = 35%. The trees presented in Fig. 4 are quite similar – ∼90% of similarity according to the metric introduced in (Nye et al., 2006). Interestingly, the given consensus trees, which are based on randomly selecting only 80% of all relevant COGs, coincide well with the trees based on selecting 100% of our COGs (compare Fig. 5 with Fig. 4, see Table 2).

26

K. Korenblat et al. / Computational Biology and Chemistry 40 (2012) 20–29

Fig. 3. Partition metric distributions of trees based on different values of our preprocessing parameter A. The four curves, reading from left to right, correspond to 15%-, 5%-, 35%-, and 50% filtering values.

One additional important point of order is the comparison of consensus trees obtained with different jackknifing rates. Comparing consensus trees based on an 80%-subset of COGs and applying these two jackknifing rates (see Fig. 4a for 35%-jackknifing and Fig. 4b for 15%-jackknifing), we can conclude that while the resulting tree topologies are quite similar (∼90%), branch labels are larger in Fig. 4b. This fact can reasonably be interpreted as follows: There is some essential information contained only in small COGs, e.g., information specific to some small subgroup of organisms.

3.6. The consensus dendrogram is similar to the phylogenetic tree Let us verify the phylogenetic viability of the tree presented in Fig. 4b.

1. Representatives of the two prokaryotic Kingdoms (Eubacteria and Archaea) are grouped separately. In technical terms, Archaeal organisms (genomes 0, 1, 8, 29–32, 35–50) form a monophyletic group (see the arrow marked A/B in Fig. 4b). 2. Euryarchaeota and Crenarchaeota form monophyletic groups (see the arrow marked E/C). 3. Representatives from different strains of the same bacteria are grouped together: 4 strains of M. maripaludis (genomes 41–44), 5 strains of C. jejuni (genomes 10–14), 2 strains of M. bovis (genomes 52–53), 8 strains of E. coli (genomes 21–28), 2 strains of H. pylori (genomes 33–34), and 2 strains of B. pseudomallei (genomes 4 and 5). Unfortunately, 3 strains of M. tuberculosis (genomes 55–57) do not form a monophyletic group; however, all Mycobacterium taken together do form a monophyletic group. Interestingly, we can observe in Fig. 4a that M. tuberculosis strains (genomes 55–57) form a monophyletic group (see the arrow marked MB). 4. In addition, we witness certain elements of the 60-genome dataset that increase the instability of the consensus tree structure. Lower branch label values indicate unstable taxonomic units. For example, Archaeoglobus fulgidus is one such organism from the Euryarchaeota group, and Mycobacterium is an unstable clade among bacterial clades.

4. Discussion and summarizing notes In general, this paper has presented an application of the agglomerative information bottleneck algorithm to the set of completely sequenced prokaryotic genomes and the set of orthologous protein-coding gene lengths. We claim that this study offers an approach that will prove to be especially valuable for prokaryotic classification. Why do we single out prokaryotes? • Protein-coding genes constitute the majority of any prokaryotic genome, in contrast to eukaryotes; therefore, considering only genic characters is a premise that seems more suitable for the classification of prokaryotic genomes. We assume that the method we have proposed may be successfully applied to various choices of numeric genic characters; however, our study has been devoted to only one genic character, namely, gene length. • The most common ways in which protein length may change during prokaryotic evolution make gene length more suitable for the classification of prokaryotic genomes. Indeed, not only point insertions and deletions (indels) change protein length, but domain duplication and reshuffling (Li, 1997; Graur and Li, 2000) do so also, as well as excision/insertion of mobile elements. (As evidence that mobile elements affect protein size, we may note that fragments of transposable elements have been found in a large number of human proteins (Makalowski, 2000).) Our claim that the method is suitable for prokaryotic genic characters is based on the observation that changes in gene length are mainly the consequence of indels, whereas multi-domain architectures are far more frequent in eukaryotes than in prokaryotes (Gerstein and Levitt, 1998; Apic et al., 2001; Hoischen et al., 2004; Liu and Rost, 2004b,a; Ekman et al., 2005). • Evolutionary studies of eukaryotic protein length distribution (Wang et al., 2005) have resulted in a surprising discovery: protein sequence lengths have been highly conserved since the common ancestor of yeast, nematode, Drosophila, Homo sapiens, and Arabidopsis. Given the framework of the approach presented in (Bolshoy et al., 2010), there are various ways to transform a genome into a

K. Korenblat et al. / Computational Biology and Chemistry 40 (2012) 20–29

27

Fig. 4. Consensus trees obtained from 100 genome trees, produced from 80% COGs randomly chosen with (a) 35%-filtering (left); (b) 15%- filtering (right). A/B arrow points to a split between Archaea and Bacteria groups. E/C arrow points to a split between Euryarchaeota and Crenarchaeota inside Archaea group. M/B arrow points to the group of M. tuberculosis strains.

vector of numerical genomic characters; in this study, each genome was transformed into a numerical vector of gene lengths from homologous genes obtained using the COG database (Tatusov et al., 1997, 2000, 2001, 2003). Given such numerical vectors, there are many ways to calculate distances between them (Bolshoy et al., 2010); thus, there are numerous ways to transform data into a distance matrix for further classification (see a review of these data transformation methods in (San Mauro and Agorreta, 2010)). For example, several common algorithms can be used to construct a tree directly from pairwise distances, including UPGMA and NJ. In this study, we have conducted extensive experiments to validate the performance of bootstrapping and jackknifing to estimate the robustness of trees produced by our proposed method. These experiments show that randomization, when included in the bootstrap procedure, substantially decreases the stability of

the trees obtained and that jackknifing is very useful to determine the stability level of genome classification. We found that a jackknife rate of 80% should be used for further applications of our proposed method. We also concluded that a preprocessing filtering parameter is valuable (where this parameter allows us to consider only those columns of the supermatrix that contain more elements than a certain threshold and to eliminate other columns), with 15% serving as a reasonable value for this filtering parameter. To summarize, we are confident in our proposal to perform prokaryotic classification, which results in dendrograms strongly similar to prokaryotic phylogenetic trees, and uses the efficient and reliable method described in this manuscript with preprocessing parameter equal to 15% of maximal COG size and jackknifing parameter equal to 80%.

28

K. Korenblat et al. / Computational Biology and Chemistry 40 (2012) 20–29

Fig. 5. Dendrograms obtained on 100% of COGs with 0% randomization. (a) Genomic tree obtained using A = 35% preprocessing threshold. (b) Genomic tree obtained using the value of the filtering parameter A = 15%.

Acknowledgments We thank Lior Perlmuter-Shoshani and Zakharya Frenkel for technical assistance.

References Allen, L., Steel, M., 2001. Subtree transfer operations and their induced metrics on evolutionary trees. Annals of Combinatorials 5, 1–15.

Apic, G., Gough, J., Teichmann, S.A., 2001. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. Journal of Molecular Biology 310, 311–325. Bolshoy, A., Volkovich, Z., 2008. Whole-genome prokaryotic clustering based on gene lengths. Discrete Applied Mathematics 157, 2370–2377. Bolshoy, A., Volkovich, Z., 2009. Whole-genome prokaryotic clustering based on gene lengths. Discrete Applied Mathematics 157, 2370–2377. Bolshoy, A., Volkovich, Z., Kirzhner, V., Barzily, Z., 2010. Genome Clustering: From Linguistic Models to Classification of Genetic Texts. Springer-Verlag, Berlin, Heidelberg. Bourque, M., 1978. Arbres de steiner et réseaux dont varie l’emplagement de certains sommets. Ph.D. Universite de Montreal, Montreal.

K. Korenblat et al. / Computational Biology and Chemistry 40 (2012) 20–29 Ekman, D., Bjorklund, A.K., Frey-Skott, J., Elofsson, A., 2005. Multi-domain proteins in the three kingdoms of life: orphan domains and other unassigned regions. Journal of Molecular Biology 348, 231–243. Estabrook, G.F., McMorris, F.R., Meacham, C.A., 1985. Comparison of undirected phylogenetic trees based on subtree of four evolutionary units. Systematic Zoology 34, 193–200. Farris, J.S., Albert, V.A., Kallersjo, M., Lipscomb, D., Kluge, A.G., 1996. Parsimony jackknifing outperforms neighbor-joining. Cladistics – The International Journal of the Willi Hennig Society 12, 99–124. Felsenstein, J., 1980. Phylip (phylogeny inference package). Distributed by the author, Department of Genetics, University of Washington, Seattle. http://www.phylip.com/. Felsenstein, J., 1985. Confidence-limits on phylogenies – an approach using the bootstrap. Evolution 39, 783–791. Garey, M.R., Johnson, D.S., Witsenhausen, H.S., 1982. The complexity of the generalized lloyd max problem. IEEE Transactions on Information Theory 28, 255–256. Gerstein, M., Levitt, M., 1998. Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Science 7, 445–456. Graur, D., Li, W.-H., 2000. Fundamentals of Molecular Evolution. Sinauer Associates, Sunderland, MA. Hintze, A., Adami, C., 2008. Evolution of complex modular biological networks. PLoS Computational Biology 4. Hoischen, C., Bolshoy, A., Gerdes, K., Diekmann, S., 2004. Centromere parC of plasmid R1 is curved. Nucleic Acids Research 32, 5907–5915. Huson, D.H., Steel, M., 2004. Phylogenetic trees based on gene content. Bioinformatics 20, 2044–2049. Li, W.-H., 1997. Molecular Evolution. Sinauer Associates, Sunderland, MA. Liu, J., Rost, B., 2004a. CHOP proteins into structural domain-like fragments. Proteins – Structure, Function and Genetics 55, 678–688. Liu, J., Rost, B., 2004b. CHOP: parsing proteins into structural domains. Nucleic Acids Research 32, W569–W571. Makalowski, W., 2000. Genomic scrap yard: how genomes utilize all that junk. Gene 259, 61–67. Margush, T., Mcmorris, F.R., 1981. Consensus N-trees. Bulletin of Mathematical Biology 43, 239–244. Mirkin, B., 1996. Mathematical Classification and Clustering. Kluwer Academic Press, Boston-Dordrecht. Nye, T.M., Lio, P., Gilks, W.R., 2006. A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics 22, 117–119. Rannala, B., Yang, Z.H., 1996. Probability distribution of molecular evolutionary trees: a new method of phylogenetic inference. Journal of Molecular Evolution 43, 304–311. Robinson, D.F., Foulds, L.R., 1979. Comparison on weighted labelled trees. Lecture Notes in Mathematics, vol. 748. Springer-Verlag, Berlin, pp. 119–126. Robinson, D.F., Foulds, L.R., 1981. Comparison of phylogenetic trees. Mathematical Biosciences 53, 131–147.

29

San Mauro, D., Agorreta, A., 2010. Molecular systematics: a synthesis of the common methods and the state of knowledge. Cellular and Molecular Biology Letters 15, 311–341. Shi, J., Zhang, Y.W., Luo, H.W., Tang, J.J., 2010. Using jackknife to assess the quality of gene order phylogenies. BMC Bioinformatics 11. Slonim, N., Tishby, N., 1999. Agglomerative information bottleneck. Advances in Neural Information Processing Systems (NIPS-12), MIT Press. Slonim, N., Tishby, N., 2000. Document clustering using word clusters via the information bottleneck method. In: 23rd Annual ACM Conference on Research and Development in Information Retrieval, pp. 208–215. Snel, B., Bork, P., Huynen, M.A., 1999. Genome phylogeny based on gene content. Nature Genetics 21, 108–110. Steel, M.A., Penny, D., 1993. Distributions of tree comparison metrics – some new results. Systematics and Biology 42, 126–141. Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., Rao, B.S., Smirnov, S., Sverdlov, A.V., Vasudevan, S., Wolf, Y.I., Yin, J.J., Natale, D.A., 2003. The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4. Tatusov, R.L., Galperin, M.Y., Natale, D.A., Koonin, E.V., 2000. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Research 28, 33–36. Tatusov, R.L., Koonin, E.V., Lipman, D.J., 1997. A genomic perspective on protein families. Science 278, 631–637. Tatusov, R.L., Natale, D.A., Garkavtsev, I.V., Tatusova, T.A., Shankavaram, U.T., Rao, B.S., Kiryutin, B., Galperin, M.Y., Fedorova, N.D., Koonin, E.V., 2001. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Research 29, 22–28. Tekaia, F., Yeramian, E., 2005. Genome trees from conservation profiles. PLoS Computational Biology 1, 604–616. Tindall, B.J., Sikorski, J., Smibert, R.A., Krieg, N.R., 2007. Phenotypic Characterization and the Principles of Comparative Systematics. ASM Press, Washington, DC. Tishby, N., Pereira, F.C., Bialek, W., 1999. The information bottleneck method. In: 37-th Allerton Conference on Communication and Computation, Allerton. Wang, D., Hsieh, M., Li, W.H., 2005. A general tendency for conservation of protein length across eukaryotic kingdoms. Molecular Biology and Evolution 22, 142–147. Wang, M., He, Y., Jiang, M., 2010. Text categorization of enron corpus based on information bottleneck and maximal entropy. In: IEEE 10th International Conference on Signal Processing (ICSP), Beijing, pp. 2472–2475. Waterman, M.S., Steel, M., 1978. On the similarity of dendrograms. Journal of Theoretical Biology 73, 789–800. Wilkinson, M., Thorley, J.L., 2001. Efficiency of strict consensus trees. Systematic Biology 50, 610–613. Wolf, Y.I., Rogozin, I.B., Grishin, N.B., Tatusov, R.L., Koonin, E.V., 2001. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evolutionary Biology 1, 8. Xiong, X.J., Yuan, G.J., Tan, K.L., 2004. Classification of multiple cancer types by using a two-stage multi-bottleneck-based classification method. Molecular and Cellular Proteomics 3, S265.