Operon prediction based on SVM

Operon prediction based on SVM

Computational Biology and Chemistry 30 (2006) 233–240 Brief communication Operon prediction based on SVM Guo-qing Zhang a , Zhi-wei Cao b , Qing-min...

387KB Sizes 68 Downloads 123 Views

Computational Biology and Chemistry 30 (2006) 233–240

Brief communication

Operon prediction based on SVM Guo-qing Zhang a , Zhi-wei Cao b , Qing-ming Luo a , Yu-dong Cai c,d,∗ , Yi-xue Li b,e,∗ a

Hubei Bioinformatics and Molecular Imaging Key Laboratory, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China b Shanghai Center for Bioinformation Technology, 100 Qinzhou Road, Shanghai 200235, China c Department of Chemistry, College of Sciences, Shanghai University, 99 Shang-Da Road, Shanghai 200436, China d Biomolecular Sciences Department, University of Manchester, Institute of Science and Technology, P.O. Box 88, Manchester M60 1QD, UK e Bioinformation Center of Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Yue Yang Road 320, Shanghai 200031, China Received 2 December 2005; received in revised form 17 March 2006; accepted 24 March 2006

Abstract The operon is a specific functional organization of genes found in bacterial genomes. Most genes within operons share common features. The support vector machine (SVM) approach is here used to predict operons at the genomic level. Four features were chosen as SVM input vectors: the intergenic distances, the number of common pathways, the number of conserved gene pairs and the mutual information of phylogenetic profiles. The analysis reveals that these common properties are indeed characteristic of the genes within operons and are different from that of non-operonic genes. Jackknife testing indicates that these input feature vectors, employed with RBF kernel SVM, achieve high accuracy. To validate the method, Escherichia coli K12 and Bacillus subtilis were taken as benchmark genomes of known operon structure, and the prediction results in both show that the SVM can detect operon genes in target genomes efficiently and offers a satisfactory balance between sensitivity and specificity. © 2006 Elsevier Ltd. All rights reserved. Keywords: Operon; SVM; Feature

1. Introduction An operon consists of a set of adjacent genes transcribed into a polycistronic message and is one of the key regulatory features of transcriptional regulatory networks of bacteria. Since genes within specific operons, in most cases, co-operate together to perform a certain function, an operon can be considered as a functional unit. The ability to create operon maps at the genomic level would be helpful in reconstructing regulatory circuits and to gain a detailed understanding of how genes interact with each other to perform specific functions. The genes found in an operon have common biological features, such as sharing a single promoter and single terminator, executing the same or similar functions, participating in the same metabolic or signal pathway, and have on average a much shorter intergenic distance than non-operon genes (Salgado et al., 2000). These unique features have made it possible to map operons at the genomic level and several simple methods based on utiliz-



Corresponding authors. Tel.: +86 2161313680. E-mail addresses: [email protected] (Y.-d. Cai), [email protected] (Y.-x. Li). 1476-9271/$ – see front matter © 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2006.03.002

ing limited numbers of these features were developed recently (Ermolaeva et al., 2001; Salgado et al., 2000; Yada et al., 1999). These simple methods are based on restricted data sets, such as experimental promoter data, intergenic distances or conservation between gene pairs. Using such restricted data sets has the severe disadvantage that a good balance between sensitivity and specificity is not easily attained. In order to overcome these shortcomings, combinatorial methods were developed utilizing more features. One or more of the features mentioned above, as well as COG (clusters of orthologous groups) based functional classification and codon usage similarities can be utilized in a flexible way. For instance, these features can be used to form a numerically weighted set for gene pair scoring (Wang et al., 2004) and can be taken as data sources for machine-learning, such as Bayesian network and joint probabilistic distribution (Bockhorst et al., 2003; Chen et al., 2004; Price et al., 2005). They can also be converted into fuzzy values for genetic algorithms (Jacob et al., 2005). Some combined methods have been applied to novel genomes and evaluated against biological data (Edwards et al., 2005; Price et al., 2005). In summary, the combinatorial methods have led to some progress, however, this work has so far failed to describe the distribution of operons in the whole genome in detail and

234

G.-q. Zhang et al. / Computational Biology and Chemistry 30 (2006) 233–240

Table 1 Basic statistical analysis of Escherichia coli and Bacillus subtilis E. coli

B. subtilis

Gene numbers Total Protein genes Other genes

4411 4237 174

4225 4106 99

Gene clusters Total Single-gene clusters Multi-gene clusters Number of operons Number of genes in operons

1321 504 817 345 1228

1108 470 638 100 410

did not scrutinize the different prediction accuracies produced by alternative features. In this paper, we present a new method based on SVM (support vector machine) to predict operons at a genomic level in any target bacterial genome. Compared to the other methods described above, the SVM performs with high accuracy and reaches an optimal balance between sensitivity and specificity. 2. Materials and methods 2.1. Sequence data and tools The operons in Escherichia coli K12 and Bacillus subtilis are commonly used as benchmarks in operon prediction. The experimentally defined E. coli operons were provided by RegulonDB (Huerta et al., 1998). B. subtilis operons were obtained from http://www.cib.nig.ac.jp/dda/taitoh/bsub.operon.html. The basic information on both genomes is listed in Table 1. The benchmark and reference genomes, which were used to build a local BLAST (Altschul et al., 1997) database, were downloaded from the KEGG databases of sequenced bacterial genomes (http://www.genome.ad.jp/kegg/catalog/org list.html). A single species was taken per genus to avoid redundancy. In total 98

bacterial genomes were set as reference genomes. (Details are listed in Supplementary materials.) The SVM system used in this work is LIBSVM (Chang and Lin, 2001) and was obtained from http://www.csie. ntu.edu.tw/∼cjlin/libsvm/. In this program, gene features can be set as the input vectors for SVM. The Java routines used to extract these features are available on request. 2.2. Scheme of prediction All genes in a target genome can be grouped into clusters based on their orientation relative to flanking genes, i.e. genes in any one cluster are consecutive, located on one strand and are not interrupted by genes on the opposite strand (Edwards et al., 2005; Wang et al., 2004). A cluster may thus contain a single gene or any number of genes. A transcription unit (TU) lies between sites of initiation and termination by RNA polymerase and may cover more than one gene (Lewin, 2003). In other words, a transcription unit either encodes a single gene or forms an operon. It follows that all genes in a transcription unit must be located on the same strand, a cluster may contain one or more transcription units, while a transcription unit must only be located in one cluster. Any two adjacent genes can be designated as a gene pair, regardless of their orientation. According to whether or not both genes belong to an operon and are thus part of the same transcription unit, a gene pair can be either a WO (within operon) pair or a TUB (transcription unit boundary) pair (Chen et al., 2004). In order to illustrate the relationship of clusters, transcription units and gene pairs, we have taken as an example the segment of the E. coli genome shown in Fig. 1. In this nucleotide segment of eight genes, there are three clusters (cynX// lacA, lacY, lacZ, lacI, mhpR// mhpA, mhpB). The segment contains five transcription units, of which TU B (lacZ–lacy–lacA) makes up operon lacZYA and TU E (mhpA–mhpB) is part of operon mhpABCDFE. In this arrangement, the genes cynX, lacA, lacZ, lacI, mhpR, mhpA are on the boundaries of transcription units. If one now looks at the genes in

Fig. 1. Nucleotide segment of Escherichia coli genome sequence eight genes are located in this nucleotide segment. lacI is the regulator of operon lacZYA, and mhpR is the regulator of operon mhpABCDFE, which is partly shown in this example. As mhpR, lacI, lacZ, lacY, lacA are located on the complementary strand, the segment can be divided into three clusters. All genes in the segment belong to five TU (transcription units): TU A (cynX), TU B (lacA, lacY, lacZ), TU C (lacI), TU D (mhpR), TU E (mhpA, mhpB). Accordingly, cynX, lacA, lacZ, lacI, mhpR, mhpA are on the boundaries of transcription units. Hence, seven gene pairs in the segment can be classified into three WO pairs (lacA–lacY, lacy–lacZ, mhpA–mhpB) and four TUB pairs (cynX–lacA, lacZ–lacI, lacI–mhpR, mhpR–mhpA). In the four TUB pairs, cynX–lacA and mhpR–mhpA are TUB pairs of across clusters, while lacZ–lacI and lacI–mhpR are TUB pairs within clusters.

G.-q. Zhang et al. / Computational Biology and Chemistry 30 (2006) 233–240

pairs, the seven gene pairs in this segment can be classified into three WO pairs (lacA–lacY, lacy–lacZ, mhpA–mhpB) and four TUB pairs (cynX–lacA, lacZ–lacI, lacI–mhpR, mhpR–mhpA). If the genes of a pair belong to different clusters, they are classified as a TUB pair across clusters, such as the pairs cynX–lacA and mhpR–mhpA. If they belong to the same cluster, the pair is called a TUB pair within cluster, such as lacZ–lacI and lacI–mhpR. In general, if both genes are located in one cluster, the gene pair may be a WO pair or TUB pair within cluster; however, if two genes are located in different clusters, the gene pair must be a TUB pair across clusters. Predicted operons are thus the collection of consecutive WO pairs that are flanked by transcriptional boundaries. In this context, the process of operon prediction includes the following steps: (1) (2) (3) (4) (5)

Clustering the genes in a genome. Eliminating genes, which are located in single-gene clusters. Breaking down multi-gene clusters into potential WO pairs. Extracting features of candidate WO pairs. Identifying true WO pairs and eliminating TUB pairs from candidate WO pairs.

2.3. Extraction of gene features We evaluated features commonly used by other operon prediction and function analysis methods. Five features were chosen  = (x1 , x2 , x3 , x4 , x5 ), where X1–5 are, as SVM input vectors X respectively the intergenic distances, the number of common pathways, the number of conserved pairs, the mutual information (MI) of phylogenetic profiles, and the number of domain interactions. These features reflect different aspects of operons: the intergenic distance is a physical property, the number of common pathways is related to biological processes, the number of conserved pairs describes their degree of conservation, the mutual information is a measure of co-evolution, and the number of domain interactions is related to the common function of the genes within an operon. Other common features of gene pairs were excluded, as their biological meaning is already covered by inclusion of their degree of conservation. The intergenic distances can be extracted from the genome annotation files directly. The information on the common pathways and domain interactions are based on the target genome and other databases, including the KEGG pathway database, the Pfam (Bateman et al., 2004), and iPfam (Finn et al., 2005) databases. The other two features, the conserved pairs and the MI of phylogenetic profiles, were extracted from the BLASTP results of the genes of the target genome against those of the reference genomes. The selective BLAST e-value was set to 1e−5 for the extraction process. For simplification, it is assumed that a gene pair consists of two adjacent genes: gene A and gene B, where gene B is located at the right side of gene A. 2.3.1. Intergenic distances The intergenic distance x1 of a gene pair is calculated as the start of gene B minus the end of gene A. The distances in WO pairs are generally much shorter than those in TUB pairs. Most

235

intergenic distances of genes within operons are below 100 bp (see Fig. 1). This trend is more obvious in E. coli than in B. subtilis. 2.3.2. Common pathways As described above, the genes within an operon often participate in one biological process, and their co-operation may occur sequentially or contiguously, which their organization into an operon may reflect. The information regarding which pathway the genes belong to in the annotated genomes was provided by the KEGG pathway database. The pathway of new genomes was recreated from the BLAST results of the genes of the target genome against those of the annotated genomes. The numerical value x2 (one of the SVM input vectors) of a gene pair was calculated using the following rule: if both genes of a gene pair can be found in n pathways simultaneously, x2 is equal to n, if either gene has no pathway information, or both genes do not share any common pathways, x2 is zero. It is clear that this feature relies on the integrity of the KEGG pathway database. 2.3.3. Conserved pairs Genes within operons are often conserved between different genomes. If both genes (A, B) in one gene pair have a homologous gene pair (A , B ) in another genome, which means A is homologous to A , and B is homologous to B , this gene pair is a conserved pair. A gene pair is deemed not to be conserved if the similarity between A and B is higher than the similarity between A and A or B and B (Ermolaeva et al., 2001). However, local gene rearrangements happen frequently during evolution, which may disrupt the gene order in gene clusters. To account for possible gene insertions, deletions and rearrangements during evolution, gapped conserved pairs were allowed, such as AB versus A xB and A xyB (here x, y stand for genes, the maximum number of intervening genes is set as 2) (Zheng et al., 2002). Conserved pairs were obtained by applying BLASTP to the genes of the target genome against those of the reference genomes. The defining cut-off limits of homology were set as follows: the alignment of query gene against hit gene has to cover more than 50% (Price et al., 2005), the e-value has to be below 1e−5 (Li et al., 2003), and the identity has to be more than 20%. For every gene pair, the value x3 is the number of conserved pairs in all reference genomes. To reduce the effects of duplicated genes in a genome, additional conserved gene pairs in the same reference genome were ignored. The number of conserved pairs depends on the number of reference genomes and the definition of homology. 2.3.4. Mutual information of phylogenetic profiles The phylogenetic profile of a gene describes the presence or absence of homologs in the reference organisms (Pellegrini et al., 1999). The profile is an array with n entries, where n corresponds to the number of reference genomes. Each element in the profile includes information about this gene and the related genes in the corresponding genomes. Genes with similar phylogenetic profiles have often evolved in a correlated fashion. The information that lies in similar genetic profiles can be extracted efficiently (Date and Marcotte, 2003). The information gained

236

G.-q. Zhang et al. / Computational Biology and Chemistry 30 (2006) 233–240

from a comparison of profile A and B depends on their degree of co-occurrence. It tends to zero with decreasing co-occurence. The mutual information (MI) of the phylogenetic profiles at the whole genome level is calculated as follows:

that are treated as positive data and TUB pairs that are treated as negative data. The SVM should define a hyper plane to classify them

1. Construction of the raw phylogenetic profiles: The profile of each gene A is a vector consisting of elements q(A)j (j = 1, 2, . . ., n), corresponding to the information gained from a comparison between the target genome and the jth genome in the reference genomes. In the literature, there are several definitions of q(A)j (Chen et al., 2004; Date and Marcotte, 2003; Pellegrini et al., 1999). Here, q(A)j is defined as the highest percent identity between gene A and all genes in the jth genome. 2. Probabilities calculation: All values of q(A)j that represent percent identity can be classified into 11 values on a scale of 0.0–1.0. If there are k values of q(A)j they can be classified into one of these 11 values; the frequencies of these values can be calculated as k/n. For gene A, the frequencies can be treated as relative probability, which can be represented as p(A)i (i = 1, 2, . . ., 11). The joint probability p(A,B)i (i = 1, 2, . . ., 11) of gene A and B is the probability of both genes having value i and can be calculated analogously to that of p(A)i . 3. Calculation of mutual information: Given the probability profile, the mutual information x4 can be calculated with the following equation:

where yi is the class index, the value is +1 for all WO pairs, and −1 for all TUB pairs; w is the weight vector; b is the offset value.  i = (xi1 , xi2 , . . . xin )T , where xi is the ith gene pair, xij (j = 1, X 2, . . ., n, n is the feature number) is the ith input feature vector described above. The margin, denoting the distance of the hyper plane to the nearest of the WO and TUB pairs, 2/||w|| is and evaluates the classification ability of SVM. For optimal resolution of the data, the SVM analysis should arrive at an optimized hyper plane with a maximal margin. It can be expressed as the following optimization problem:  min 21 wT w

MI(A, B) = H(A) + H(B) − H(A, B)  where H(A) = − 11 i=1 P(A)i ln p(A)i represents the entropy that is a measure of probability of event A. H(A, B) = − 11 I=1 p(A, B)i ln p(A, B)i represents the entropy that is a measure of probability of co-occurrence of genes A and B. 2.3.5. Domain interaction Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common gene domains and families (Bateman et al., 2004). The iPfam data set describes domain–domain interactions that are observed in PDB entries (Finn et al., 2005). KEGG is an integration of the Pfam domain information and the GENE database. As one gene may contain several domains, and one domain may interact with several domains, one gene pair may have several interacting domain pairs. The value of x5 is the number of interacting domain pairs between gene A and gene B. The value depends on the integrity of Pfam and iPfam databases. 2.4. Support vector machine theory SVM is a classification algorithm based on statistical learning theory. It has been successfully used to tackle several biological problems, such as functional prediction, membrane gene classification and structural classification (Cai et al., 2003, 2002, 2004; Platt, 1998; Simon, 1999). With respect to the task of operon prediction, genes pairs fall into two classes: WO pairs

yi (wT xi + b) ≥ 1,

∀i

yi (wT xi + b) ≥ 1,

∀i

The theory of SVM had been widely described in the literature (Cai et al., 2002; Chang and Lin, 2001; Platt, 2005), thus only a simplified introduction into the basic ideas behind SVM for the two class classification problems is given here. By introducing Lagrange multipliers αi , the optimization problem converts into a dual form that is a quadratic programming (QP) problem where the objective function Q can be described as max Q(α) =

N 

N

αi −

i=1

N

1  αi αj yi jj xiT xi 2 i=1 j=1

This is subject to the following two conditions: ⎧ N ⎪ ⎪ ⎨ αi y i = 0 i=1 ⎪ ⎪ ⎩ α ≥ 0, ∀i i Once a SVM is trained, the decision function can be written as  N  yi αi xxi + b f (x) = sgn i=1

In non-linearly separable cases, the SVM technology introduces the slack variable {ξi }N i=1 to the definition of hyper plane  xi + b) ≥ 1 − ξi , yi (w

∀i

And the problem can be converted into a QP problem ⎧ N N N ⎪  1  ⎪ ⎪ ⎪ max Q(α) = α − αi αj yi jj xiT xi i ⎪ ⎪ 2 ⎪ ⎪ i=1 i=1 j=1 ⎨ N  ⎪ ⎪ αi y i = 0 ⎪ ⎪ ⎪ ⎪ i=1 ⎪ ⎪ ⎩ 0 ≤ α ≤ C, ∀i i

where C is a user chosen positive parameter.

G.-q. Zhang et al. / Computational Biology and Chemistry 30 (2006) 233–240

SVM maps the input variable into a high dimensional feature space with a kernel function K(xi , xj ). With this mapping, the target problem can be described as ⎧ N N N ⎪  1  ⎪ ⎪ ⎪ max Q(α) = α − αi αj yi jj K(xi , xi ) i ⎪ ⎪ 2 ⎪ ⎪ i=1 i=1 j=1 ⎨ N  ⎪ ⎪ αi y i = 0 ⎪ ⎪ ⎪ ⎪ i=1 ⎪ ⎪ ⎩ 0 ≤ α ≤ C, ∀i i

There are two typical kernel functions: linear kernel and RBF kernel (also called Gaussian kernel) ⎧ T ⎨ x xi

K(xi , xj ) = 1 2 ⎩ exp − ||x − xi || 2σ 2 The decision function can be rewritten as  N  f (x) = sgn yi αi K(xxi ) + b i=1

For a given problem, once the kernel function and the regularity parameter C are selected, the constructed SVM system can be used to solve related problems. In this paper, two kernel functions have been tested: linear kernel and RBF kernel. 3. Result and discussion Any operon prediction method can be evaluated through a comparison of the predicted operons with the experimentally defined operons. As the predicted and experimental operons may be of different sizes, we have compared gene pairs derived from the predicted and experimentally defined operons (Jacob et al., 2005). If any genes of a gene pair belong to an experimental discovered operon, the gene pair may be an experimental WO pair or TUB pair. Otherwise, the gene pair is not included by benchmark data, as no experimental evidence to prove whether the genes in the pair are organized as an operon. There are four kinds of gene pairs: predicted WO pairs, predicted TUB pairs, experimental WO pairs and experimental TUB pairs, which can be used to calculate three objective measures: sensitivity, specificity and accuracy (Moreno-Hagelsieb and Collado-Vides, 2002). The sensitivity is the fraction of true predicted WO pairs out of the total experimental WO pairs. The specificity is the ratio of true predicted TUB pairs compared to total experimental TUB pairs. It should be noted that the TUB pairs across clusters are excluded in the calculation of specificity. We have calculated accuracy using a Jackknife based test, which uses the ratio of true predicted WO and TUB pairs against the total experimental pairs. Although both protein genes and RNA genes can be expressed as operons, RNA operons have a very low proportion in genomes and can not be assessed with BLASTP, thus RNA genes were ignored in the evaluation. In our SVM system, the E. coli operons are used as training data, while E. coli and B. subtilis data are used as test data.

237

3.1. Distribution of available features The available statistical features of E. coli genes are listed in Fig. 2. Most intergenic distances of WO pairs are between 0 and 100 bp. The distribution of intergenic distances of TUB pairs is wider than that of the WO pairs. Almost half of WO pairs share common pathways, while very few TUB pairs are involved in the same pathway. Thus, when both genes in a gene pair share a common pathway, the probability of the pair to be a WO pair is very high. The number of TUB pairs is greater than that of WO pairs where the number of conserved pairs x3 is smaller than 3, while the number of WO pairs exceeds that of TUB pairs when x3 is larger than 3. The greater the mutual information of the phylogenetic profiles, the more obvious is the difference between WO and TUB pairs. Of five data sets, four data sets display obvious differences between WO and TUB pairs, while the data set of domain interaction has no obvious difference. The reason may be that the most of interaction data in the iPfam database are physical interactions derived from PDB data, when actually the interaction of operon genes may be functional rather than physical. In order to investigate the relationship of the employed features, least squares fitting was applied to standardized feature values yielding y = −2.0077x1 + 0.9137x2 + 0.3705x3 + 0.2800x4 − 0.0348x5 where y is 1 for WO pair, and −1 for TUB pair; xi (i = 1, 2, 3, 4, 5) is the ith input vector. When the linear prediction method incorporating these least squares values was applied to the E. coli benchmark data, the following values were obtained for sensitivity, specificity and accuracy: 0.9341, 0.6339, and 0.8219, respectively, with a standard deviation of 0.0208. The obtained specificity is thus too low for satisfactory prediction of operons. This result suggests that the intergenic distances may be the most important vector, whilst the domain interaction is the least distinguishing value. The domain interaction feature was thus omitted from the SVM system. 3.2. Jackknife testing on features and kernels Jackknife testing (otherwise known as leave-one-out approach) was used for cross-validation, and the accuracy was defined as the fraction of overlap between the known and predicted WO and TUB pairs over all of the benchmark data. In Jackknife tests, each gene pair is used as testing data while the other data in the benchmark dataset are used as training data. The results are shown in Table 2. When only intergenic distance was applied as input vector, the prediction performance between the linear and RBF kernel was similar. As more features were introduced, the performance of RBF kernel exceeded that of the linear to a small extent. The RBF kernel SVM with four feature input vectors achieved the highest accuracy, thus the final SVM is based on RBF kernel and four input feature vectors.

238

G.-q. Zhang et al. / Computational Biology and Chemistry 30 (2006) 233–240

Fig. 2. Distribution of features of E. coli.

3.3. E. coli and B. subtilis prediction result The SVM system accuracy was evaluated by two sets of benchmark operons: E. coli and B. subtilis. The evaluation results were listed in Table 3. Both sensitivity and specificity were higher than 80%, which means the SVM achieved a good balance between the two measurements. When the SVM system was applied to the whole genome, 837 operons in E. coli and 759 operons in B. subtilis were detected, which are presented in Supplementary materials. Furthermore, as shown in Fig. 3, the distribution of the decision values of bench pairs (WO and TUB pairs) and unknown pairs was similar in both genomes. In E. coli, 92 WO pairs and 97 TUB pairs are mis-assigned. In these mis-assigned data, we analyzed the distribution of input features, as listed in Table 4. The distribution of mis-assigned

Table 3 SVM prediction results

Table 2 Jackknife test results in E. coli Kernel type

Linear RBF

data is distinctly different to the distribution of all benchmark data. The mis-assigned WO pairs have longer intergenic distances, while mis-assigned TUB pairs have shorter intergenic distances. Few mis-assigned WO pairs have common pathways, while only a few mis-assigned TUB pairs share common pathways. A few mis-assigned WO pairs are conserved in more than three genomes, while many mis-assigned TUB pairs are conserved in more than three genomes. In general, the features of mis-assigned WO pairs are similar to that of experimentally confirmed TUB pairs, while the features of mis-assigned TUB pairs are similar to that of experimentally confirmed WO pairs. For such gene pairs, the currently employed features appear not to be sufficient to distinguish WO pairs from TUB pairs and vice versa.

Feature 1000

1100

1110

1111

0.8211 0.8265

0.8310 0.8433

0.8379 0.8486

0.8417 0.8502

Four features are suitable as input vectors: intergenic distances, the number of common pathways, the number of conserved pairs and the mutual information of phylogenetic profiles. If the ith feature is chosen, the ith character of the feature group name is set at 1, otherwise, it is set at 0. For example, 1100 means first and second features (intergenic distance and the number of common pathways) are taken as input vectors. Linear means the data are trained with a linear kernel, while RBF means the data are trained by RBF kernel.

E. coli

B. subtilis

Verified WO pairs Predicted WO pairs Verified TUB pairs Predicted TUB pairs

819 727 489 392

310 279 121 104

Sensitivity Specificity Accuracy

727/819 = 88.77% 392/489 = 80.16% 85.55%

279/310 = 90.00% 104/121 = 85.95% 88.86%

WO pairs and TUB pairs are defined in the text. In prediction results, verified means that the results match with experimental data, predicted means SVM prediction results including verified results. The prediction is based on four input feature vectors and the RBF kernel SVM.

G.-q. Zhang et al. / Computational Biology and Chemistry 30 (2006) 233–240

239

Table 4 The distribution of mis-assigned gene pairs Feature

Feature scale

Mis-assigned WO pair

Mis-assigned TUB pair

Intergenic distance

<0 0–49 50–99 100–149 150–199 ≥200

0 0 30 28 12 22

12 49 23 7 3 3

Number of common pathways

0

87

83

1 2

5 0

11 3

0

36

17

1 2 3 4 ≥5

27 5 7 6 11

21 13 12 5 29

0.0–0.1

47

32

0.1–0.2 0.2–0.3 0.3–0.4

32 12 1

43 15 7

Number of conserved pairs

Mutual information of phylogenetic profile

The SVM method failed to identify some WO and TUB pairs. These gene pairs are mis-assigned and are false negative or false positive data, respectively. In the paper, mis-assigned WO pairs are gene WO pairs in the experimentally defined benchmark genomes and are predicted to be TUB pairs. Fig. 3. Distribution of decision values in E. coli and B.subtilis. The SVM can classify a gene pair by the decision value that is calculated by a decision function. If the decision value is greater than zero, the pair is a WO pair, otherwise, the pair is a TUB pair. Bench pairs contain WO and TUB pairs. Unknown pairs contain the pairs for which no experiential assignment exists.

3.4. ROC curve analysis An ROC (receiver operating characteristic) curve is a graphical representation of the trade off between false negative and false positive rates for every possible cut off. In other words, the ROC curve is the representation of the tradeoff between sensitivity and specificity. By convention, the false positive rate is plotted as the abscissa and the true positive rate as the ordinate. The area under the ROC curve is a measure of the predictive value of the method. The ROC curves of both genomes are shown in Fig. 4. The areas under both curves are approximately 0.9 which indicates that balance of the sensitivity and specificity has been achieved. 4. Conclusion

Fig. 4. ROC curves of E. coli and Bacillus subtilis.

The prediction results from two microbial genomes reveal that the SVM system can detect operons at the whole genome level. The intergenic distances, the number of common pathways, the number of conserved pairs and MI of phylogenetic profiles are quite different between WO and TUB pairs, and thus make appropriate input vectors; the parameter related with domain interaction is indistinctive between WO and TUB pairs and is therefore excluded in the SVM. The prediction results show high accuracy in the benchmark tests, which suggests that this approach is highly suitable for operon mapping in bacterial genomes.

240

G.-q. Zhang et al. / Computational Biology and Chemistry 30 (2006) 233–240

Acknowledgements The work is supported by the 863 Hi-Tech Program of China (2004BA711A21), the National Key Technology R & D Program of China (2005BA711A04), the State Key Program of Basic Research of China (Nos. 2003CB715900, 2004CB518606), and the Key Program of Basic Research of Shanghai (03XD14018, 04QMX1450). Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.compbiolchem. 2006.03.002. References Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389– 3402. Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L.L., Studholme, D.J., Yeats, C., Eddy, S.R., 2004. The Pfam protein families database. Nucl. Acids Res. 32, D138–D141. Bockhorst, J., Craven, M., Page, D., Shavlik, J., Glasner, J., 2003. A Bayesian network approach to operon prediction. Bioinformatics 19, 1227– 1235. Cai, C.Z., Han, L.Y., Ji, Z.L., Chen, X., Chen, Y.Z., 2003. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucl. Acids Res. 31, 3692– 3697. Cai, Y.D., Liu, X.J., Xu, X.b., Chou, K.C., 2002. Prediction of protein structural classes by support vector machines. Comput. Chem. 26, 293– 296. Cai, Y.D., Ricardo, P.W., Jen, C.H., Chou, K.C., 2004. Application of SVM to predict membrane protein types. J. Theor. Biol. 226, 373–376. Chang, C.-C., Lin, C.-J., 2001. LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm. Chen, X., Su, Z., Xu, Y., Jinag, T., 2004. Computational prediction of operons in Synechococcus sp. WH8102. Genome Inform. Ser. Workshop Genome Inform. 15, 211–222.

Date, S.V., Marcotte, E.M., 2003. Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nat. Biotech. 21, 1055–1062. Edwards, M.T., Rison, S.C.G., Stoker, N.G., Wernisch, L., 2005. A universally applicable method of operon map prediction on minimally annotated genomes using conserved genomic context. Nucl. Acids Res. 33, 3253–3262. Ermolaeva, M.D., White, O., Salzberg, S.L., 2001. Prediction of operons in microbial genomes. Nucl. Acids Res. 29, 1216–1221. Finn, R.D., Marshall, M., Bateman, A., 2005. iPfam: visualization of proteinprotein interactions in PDB at domain and amino acid resolutions. Bioinformatics 21, 410–412. Huerta, A.M., Salgado, H., Thieffry, D., Collado-Vides, J., 1998. RegulonDB: a database on transcriptional regulation in Escherichia coli. Nucl. Acids Res. 26, 55–59. Jacob, E., Sasikumar, R., Nair, K.N.R., 2005. A fuzzy guided genetic algorithm for operon prediction. Bioinformatics 21, 1403–1407. Li, L., Stoeckert Jr., C.J., Roos, D.S., 2003. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189. Lewin, B., 2003. Genes VIII, Prentice Hall, Upper Saddle River, NJ. Moreno-Hagelsieb, G., Collado-Vides, J., 2002. A powerful non-homology method for the prediction of operons in prokaryotes. Bioinformatics 18, 329S–336S. Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., Yeates, T.O., 1999. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. PNAS 96, 4285–4288. Platt, J., 1998. Sequential minimal optimization: a fast algorithm for training support vector machines. https://research.microsoft.com/ users/jplatt/smoTR.pdf. Price, M.N., Huang, K.H., Alm, E.J., Arkin, A.P., 2005. A novel method for accurate operon predictions in all sequenced prokaryotes. Nucl. Acids Res. 33, 880–892. Salgado, H., Moreno-Hagelsieb, G., Smith, T.F., Collado-Vides, J., 2000. Operons in Escherichia coli: genomic analyses and predictions. PNAS 97, 6652–6657. Simon, H., 1999. Neural Networks: A Comprehensive Foundation, PrenticeHall, New Jersey. Wang, L., Trawick, J.D., Yamamoto, R., Zamudio, C., 2004. Genomewide operon prediction in Staphylococcus aureus. Nucl. Acids Res. 32, 3689–3702. Yada, T., Nakao, M., Totoki, Y., Nakai, K., 1999. Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models. Bioinformatics 15, 987–993. Zheng, Y., Roberts, R., Kasif, S., 2002. Genomic functional annotation using co-evolution profiles of gene clusters. Genome Biol. 3 (research0060.1–research0060.9).