Ckm. Vol. 20, No. I, pp. 123-133, 1996 Copyright 0 1996 Elsevier Science Ltd Printedin Great Britain. All rights reserved 0097-8485/96 $15.00 + 0.00
Com/Mers
0097~95)ooo~7
STATISTICAL ANALYSIS OF GENEMARK PERFORMANCE BY CROSS-VALIDATION JUERGEN KLEFFE,’ KLAUS HERMANN’ and MARK BORODOVSKY2* ‘Department of Molecular Biology and BioInformatics, Institute of Molecular Biology and Biochemistry, Free University of Berlin, Amimallee 22, D-14195, Berlin, Germany *School of Biology, College of Science.,Georgia Institute of Technology, Atlanta, GA 30332-0230,U.S.A. (Received 20 September 1995)
Abstract-We have explored the performance of the GeneMark gene identification method using cross-validation over learning samples of E. coli DNA sequences. The computations gave more accurate
estimations of the error rates in comparison with previous results when a sample of non-coding regions was derived from GenBank sequences with many true coding regions unannotated. The error rate components have been classified and delineated. It was shown that the method performs differently on class I, II and III genes. The most frequent errors come from misinterpreting the coding potential of the
complementary sequence in the same frame. The effectsof stop-codons present in alternative frames were also studied to understand better the main factors contributing to GeneMark performance.
INTRODUCTION Fickett and Tung (1992) studied the efficiency of many measures for coding potential of DNA sequence fragments and emphasized the importance of systematic comparisons and derivation of bench marks for decision making by statistical analysis. The GeneMark method, as more recently recognized, was only briefly mentioned. This method has been proven to be a useful tool for discovering genes in newly determined DNA sequences (Sofia et al., 1994; Fleischmann et al., 1995), as well as in those that have been already analysed in experiments and submitted to the GenBank database [see Borodovsky et al. (1994a, b)]. The GeneMark method is mathematically tractable and in this paper we collect detailed results of elaborate assessments of the GeneMark method prediction error rates under a variety of conditions. All error rates have been obtained by cross-validation over samples of E. coli coding and non-coding DNA sequences. It has been shown that the method performs differently on class I, class II and class III genes (Medigue et al., 1991), thus, the benchmarks have to be derived case dependently for each species and, most likely, as in the E. coli case, separately forgroups of sequences with homogeneous DNA composition. THE GENEMA=
MODEL FOR CODING AND
NON-CODING
REGIONS
cations of Markov models to functional regions of DNA (Borodovsky et al., 1986; Kleffe, 1991; Kleffe & Borodovsky, 1992; Borodovsky & McIninch, 1993). This method was developed under the assumption that DNA sequences can be partitioned into segments of three categories with regard to the distribution of protein-coding information among two complementary DNA strands (Fig. 1): (i) DNA sequence segments that code for a protein (in the direct strand). (ii) DNA sequence segments that are complementary to those coding for a protein (located in the inverse strand). (iii) DNA sequence segments which do not code for a protein (in both strands). According to current knowledge segments of different categories do not overlap for species higher than phages and viruses. Earlier algorithms have distinguished only sequence segments of categories (i) and (iii). The output of these algorithms is unpredictable for sequence segments of category (ii). The GeneMark method describes these three categories of sequence segments by three different Markov models. A DNA sequence Markov model of order k relates the frequency of a nucleotide b in a given sequence position to the word of length k that immediately precedes b. The model is characterized by the set of transition probabilities P(b ( W) for WE {words of length k}
The GeneMark method for identification of protein coding regions emerged from a series of appli-
and b E {A, C, G, T} and stationary initial probabilities
*To whom correspondence should be addressed. E-mail:
[email protected]. edu.
P,,(W) for WE {words of length k}. 123
Juergen Kleffe et al.
124
gene 1 inverse strand direct strand segment category Fig. 1. Sequence segments of category (i), (ii) and (iii) along the direct strand of DNA.
The Markov model is called homogeneous if the transition probabilities P(b 1W) do not depend on the position of W in the sequence. Such a model is used to describe sequence segments of category (iii). However, to incorporate the reading frame of genetic code into the model for sequence fragments of category (i) and (ii) the GeneMark method assumes the probabilities P (b 1W) and PO( W) to depend on Wand also on the frame position where ,the word W begins within the sequence. Such a model is called periodic and is described by three sets of probabilities P,(b 1W) for WE {words of length k} and b E (A, C, G, T} Poi(W) for WE (words of length k} where i = 1,2,3 describes the frame position of W. These probabilities are estimated from counts of overlapping words of length k + 1 in sequences of known category. The theory of Markov chains (Billingsley, 1961) recommends to estimate P,(b 1W) and P,,,(W) by the formulae Pi(b I W)
ni(bI W) = ni(A I W) + ni(C I W) + ni(G1W)
+ ni(TJ W)
poivv = ni(A 1W) + ni(C ( W) + ni(G I W) + ni(TI W) X where n,(b IW) denotes the count of events that the word W is observed beginning in frame position i of a sequence fragment and is followed by the nucleotide b and Ni is the total count of words of length k + 1 beginning in frame position i. These probabilities, which are different for sequence segments of the three categories, are then used to calculate probabilities for sequence fragments S = (b, , . . . , b,) of unknown type by the formula
Wltype) = Po,dWI)
n
j=l,...,n-k
f’/(j,(bj+,IWj)
wheref(j) is the frame position of thejth nucleotide and Wj is the word of length k beginning with the jth nucleotide. The frame position of nucleotides is irrelevant if we calculate these probabilities using the model for non-coding regions. However, for calculation of these probabilities for the other two models we need to
knowf(Z) what is certainly not the case for sequence fragments of unknown type. It is therefore necessary to further subdivide sequence segments of categories (i) and (ii) according to the initial frame positionf([). We therefore consider seven types of sequence segments defined as follows: Type 1: segments of category (i) which begin in frame position 1. Type 2: segments of category (i) which begin in frame position 2. Type 3: segments of category (i) which begin in frame position 3. Type 4: segments of category (ii) which begin in frame position 1. Type 5: segments of category (ii) which begin in frame position 2. Type 6: segments of category (ii) which begin in frame position 3. Type 7: segments of category (iii). It is then possible to assign a set of seven probabilities to any sequence fragment assuming that the fragment is of a pure single type out of possible seven. Let us probabilities by P(Slm) for denote these m = 1, . . . ,7. Under the assumption that a-priori probabilities a,,, for sequence segments of the seven types are given, the GeneMark algorithm determines the a-posteriori probabilities
&%l(s) =
%lP(S Im) =P(mlS)form=l,...,7
C ajp(s Ii) to classify sequence fragments. The a-priori probabilities are CI,= cz2= . . . = a6= l/12 and a, = l/2. The performance of the method depends on how the a-posteriori probabilities are distributed over randomly sampled sequence fragments. To start with Markov models of order one we show in Fig. 2a the empirical distribution functions of the a-posteriori probabilities over 8908 type 1 sequence fragments of 96 nucleotides length from E. co/i class I genes. The shape of the distribution function of g,(S) deviates largely from those of g,,,(S) for m =2,. . . ,7. The a-posteriori probability g4(S) has a distribution function closest to that of g,(S). It corresponds to the correct reading frame but on the anti-sense strand of the DNA sequence. This observation corresponds to the known fact that strong coding signals are frequently found at the anti-sense strand in phase with experimentally confirmed genes
Statistical analysis of GeneMark performance by cross-validation in the sense strand. In particular, Merino et al. (1994) have shown recently that a substantial number of genes from several species are antiparallel to antisense ORFs that entirely overlap the coding regions. Based on the RNY gene sequence model by Shepherd (1981) one can get an insight into this phenomenon since the RNY triplet is selfcomplementary, i.e. RNY triplets in true genes generate RNY triplets on their complementary strand which contribute to elevated values of g4(S). We can therefore expect misclassification of a strand to be a relatively frequent error. Next follows the distribution of g,(S) for the noncoding model. The distribution functions of g*(S), g,(S), g,(S), g6(S) are almost degenerated and make misclassification of a reading frame very unlikely. The decision problem practically reduces to choose (i) between coding, non-coding events and (ii) between strands if the region is predicted as coding.
125
Figure 3a presents the empirical distribution functions of g,,,(S) for m = 1, . . ,7 over 2976 fragments of non-coding E. coli sequences of 96 nucleotides length using Markov models of order 1. Here the shape of the distribution function of g,(S) differs considerably from all the others which are more or less similar. However, only about 81% of the sequence fragments generate values g,(S) greater than 0.5 indicating a misclassification rate of about 19%. Figures 2b and 3b show an analogous result for the second-order processes used in the GeneMark method. Now, for higher-order models, the differences between the distributions of g,,,(S) increased, indicating a better performance of the GeneMark method. Only 9% of the coding fragments and 13% of non-coding fragments are now mispredicted. Note that the order of the Markov model cannot be increased without limit. If the order grows for a
Empirical distribution functions of gl ,...,g7 over 8908 coding sequence fragments (E. coli I) of 96 nucleotides order 1
1
096
-
61
-02
-63
~
65
-66
-
Fig. 2(a)
-sr 97
qg1
126
Juergen Kleffe er al. Empirical dietrlbutlon functions of gl,...,g7 over 8908 coding eequence fragmente (E. toll I) of 96 nucleotidee order 2
0.6
O,l
0
0,2
0,3
44
45
0,0
47
0,8
0,s
-
Ql
-@
-@
-sr
-
gs
-ss
-Q
. . . . . . qg1 < 0.5)
1
Fig. 2(b) Fig. 2. (a) Empirical ~st~buti~n functions of g, , . . . , g, obtained by the ~gorit~
using first-order models for 8908 protein coding fragments of 96 bp from class 1 E. c&i genes. The random variables g, take the value g,(S) for a random sequence fragment S. (b) The same as in (a), except the results were obtained by the algorithm using second-order models.
fixed number of sequence fragments in the training sample, the Markov model tends to remember the type of each sequence fragment based on the unique words they contain. This artifact, called overidentification or overfrtting, results in rejection of all sequence fragments not included into the training sample. Hence the method can no longer be used for accurate prediction. For this reason we have to be careful about choosing the order of the Markov models. With growing order k more and more of possible words of length k + 1 do not appear in the training sample of sequence fragments causing related transition probabilities to be zero. We call such a model a singular one opposite to the non-singular
model with all elements of transition probability matrices being positive. Specific types of non-singular models can be obtained from rather small sequence samples if we initialize word counts not by zeros but by ones. The model starts with uniform distribution of words and then gradually deviates from this distribution upon training. We call such a model a forced non-singular model.
DECISION RULES AND ERROR RATES
The precise evaluation of the GeneMark performance requires rigorously defined decision rules based
Statistical analysis of GeneMark performance by cross-validation
on the computed a-posteriori probabilities definition we have
gi(S). By
Cg,(S) = 1 for all S i
and hence at most one of the values g,(S) can be greater than 0.5 for a given S. It is suggested in Borodovsky and McIninch (1993) to predict a sequence fragment S as coding in the corresponding strand and frame if one of the values g,(S), . . . , g6(S) is greater than 0.5. In all other cases the sequence fragment is predicted as non-coding regardless of the value of g,(S). The reasoning for this rule is to stress the necessity of strong evidence of a coding potential. If the statistical evidence is weak it is disregarded until another type of evidence (sequence similarity or
127
biochemical data) would again turn the attention of a biologist to the sequence fragment. In a sense such a prediction may omit uninteresting but not necessarily non-coding fragments. More generally this type of decision rule is defined for any L 3 0.5 as follows: A:
If g,(S) > L predicted to If g,(S) < L dicted to be
for some i E {I, . ,6} then S is be of type i. for all i E (1,. . . ,6} then S is preof type 7.
Note that this decision rule is biased for it decides in favour of non-coding sequence fragments if none of the g,(S) exceed L. This bias becomes important if L approximates to 1. Then the above decision rule degenerates by predicting all sequence fragments as
Empirical distribution functions of gl,...,g7 2676 non-coding sequence fragments (E. coli I) of 96 nucleotides order 1
"gl ~
,-3
95
-0s
-
99
-
Fig. 3(a)
-04
97
2
P(97< 0.5)
Juergen Kleffe et al.
128
Empirical distribution functions of gl,...,g7 2676 non-coding ssquence fragments (E. coli I) of 96 nucleotldes order 2
096
095
0 ~
I
I
I
I
I
I
I
I
I
I
0,l
0,2
0,3
0,4
0,6
0,6
0,7
0,6
0,s
1
91 95
-92
-93
------Cl6
-
97
94 . .
.
P(g7 <
0.5)
Fig. 3(b) Fig. 3. (a) Empirical distribution functions of g, , . . . , g, obtained by the algorithm using first-order models for 2876 non-coding fragments of 96 bp from E. coli intergeneic regions. (b) The same as in (a), except
the results were obtained by the algorithm using second-order models.
non-coding. Formal decision theory would rather recommend the rule B: B:
If gj(S) = max g,(S) i= I....,7
then S is predicted to be of type j. The case when the maximum value is taken by more than one g,(S) does not need to be considered in practice. The main point, regardless of which decision rule we choose, is that the performance of the prediction method is measured by the matrix of probabilities e(i, j) to predict a sequence fragment to be of type i if in fact it is of type j. The probabilities of correct decisions are e(i, i ) and the smaller the offdiagonal
elements of the matrix (e(i, j)) are the better the method is. We are not equally interested in all the 42 possible types of error. Let us focus on the following four error rates: ERll:
The rate of predicting a type 1 sequence fragment to be of type 7 (false negative). ER12: The rate of predicting a type 1 sequence fragment to be of type 4 (wrong strand and correct frame). ER13: The rate of predicting a type 1 sequence fragment to be of type j E {2,3,5,6} (wrong frame in any strand).
Statistical
analysis
of GeneMark
performance
129
by cross-validation
Table I, Cross-validation estimates for error probabilities using the GeneMark method: 8908 coding (E. cob class 1) sequence fragments, 2876 non-coding sequence fragments, 9 cross-validation groups, fragment length 96 nucleotides Order of models Method Decision rule ER II ER 2 ER 3 Decision rule Eli II ER 2 ER 3 Decision rule ER II ER 2 ER 3 Decision rule ER 11 ER 2 Eli 3
ER2:
0
1
2
3
4
5
0.2547 0.1478 0.1962
0.1405 0.1540 0.1050
0.0650 0.1248 0.0236
0.0612 0.1172 0.0170
0.0570 0.1078 0.0164
0.0696 0.1019 0.0192
0.3998 0.0967 0.1344
0.2049 0.1220 0.0775
0.0807 0.1102 a.0201
0.0756 0.1012 0.0148
0.0708 0.0946 0.0139
0.0802 0.0921 0.0168
0.5413 0.0678 0.0832
0.2694 0.0949 0.0577
0.1000 0.0974 0.0172
0.0903 0.0880 0.0136
0.0830 0.0824 0.0 I29
0.0923 0.08 10 0.0148
0. I548 0.1996 0.2453
0.1117 0.1725 0.1174
0.0617 0.1266 0.0247
0.0592 0.1203 0.0185
0.0552 0.1102 0.0172
0.0683 0.1036 0.0195
A: L = 0.5
A: L = 0.6
A: L = 0.7
B:
The rate of predicting a type 7 sequence fragment to be of type j E {1,2,3,4,5,6} (false positive).
The sum ERl = ERll + ER12 + ER13 is the misprediction rate for sequence fragments of type 1. The sum ER3 = ER12 + ER13 is the rate of predicting a wrong frame or strand for a type 1 sequence fragment. Looking at the distribution functions d, of gI shown by Figs 2 and 3, ER13 is expected to be very small. For these distribution functions di holds d#) = P(gi < I) = P( gi(S) < 1) for an arbitrary I, i.e. they allow one to derive the above mentioned error rates graphically in case training samples and prediction (control) sets are the same. Such error rates do not apply if predictions are made for sequence fragments which do not belong to the training sample. Y-FOLD
CROSS-VALIDATION
A well accepted method for deriving estimates of error rates for prediction algorithms is V-fold crossvalidation. We consider a sample S of sequence fragments of known type and split s = s, u
s, . . . u s,
into V subsamples of approximately equal size. Then V experiments are performed, each consisting of (i) training of the GeneMark algorithm on the sample S - S, deriving parameters for the used Markov model; (ii) applying the GeneMark algorithm to each sequence fragment S of sample Sj and generating values g,(S), . . . , g,(S). This procedure yields scores g,(S), . . . , g,(S) assigned to each sequence fragment S in S based on a training set that did not contain the considered sequence fragment. The scores g,(S), . . . , g,(S) are used to make a prediction and error rates are estimated by counting the cases of success and failure. The necessary programs have been written in Borland
Pascal 7.0 based on tools for statistical analysis of sequence data described in Kleffe et al. (1995). Table 1 provides the error rates ERll, ER2 and ER3 obtained by nine-fold cross-validation procedure for 8908 type 1 sequence fragments of E. coli class I genes and 2876 fragments of non-coding E. coli sequences. They are obtained by cutting into pieces of 96 nt long 8 12 E. coli class I gene sequences as well as concatenated E. coli non-coding sequences (derived from EcoSeq6 database). The higher orders of the Markov model: 2, 3 and 4 seem to perform the best and at about the same level of accuracy. Let us note that these error rates are estimates which may change if another sample of sequence fragments is considered. Table 2 provides mean and standard deviation of estimated error rates over 10 samples of sequence fragments formed by cutting the same coding and non-coding sequences into pieces using an offset of 1,3,7, 10,. . ,28 triplets. It is seen that small differences of error rates can be attributed to the estimation error and should be ignored for the comparison of methods. The increase of ERll with growing L is a consequence of the biased decision rule A which predicts non-coding whenever a true type 1 sequence fragment does not receive sufficiently high scores g,(S), . . . , gs(S). For the same reason the error rate ER2 decreases. The error rate ER3 remains almost unaffected from the choice of L. Considering the GeneMark method in the case of model order 2 and rule A with L = 0.5, about 6.5% of type 1 sequence fragments are predicted as noncoding (ERll). About 2.5% are predicted as coding but in the wrong frame or strand (ER3) and more than 12% of the non-coding sequence fragments are predicted as coding (ER2). Table 3 shows the effect of choosing a-priori probabilities IX,,. . , tl, on the misprediction rates. We considered for u, the values l/2, l/3 and l/7 and aj= (1 - c(,)/6 for j = 1,. . . ,6. It is seen that this effect is not negligible and compromises between the
Juergen Kleffe et al.
130
Table 2. Means and variances (given in parentheses in units of l/lOOQ of 10 cross-validation estimates for the GeneMark method: 8908 coding (E. coli class I) sequence fragments, 2876 non-coding sequence fragments, 9 cross-validation groups, fragment length 96 nucleotides Order of models 0
1
2
3
4
5
0.243 (6) 0.146 (3) O.I91(3)
0.133 (4) 0.147 (4) 0.098 (1)
0.059 (3) 0.119(4) 0.024 (I)
0.055 (4) 0.109 (6) 0.019 (1)
0.051 (4) 0.105 (3) 0.016(l)
0.064 (3) 0.101 (3) 0.018 (I)
0.388 (7) 0.103 (4) 0.137 (3) ~
0.192(5) 0.117(2) 0.076 (3)
0.076 (3) 0.106 (4) 0.020 (I)
0.068 (4) 0.097 (4) 0.016(l)
0.062 (4) 0.094 (4) 0.014(l)
0.074 (4) 0.093 (3) 0.016 (I)
0.526 (7) 0.068 (3) 0.090 (4)
0.256 (6) 0.091 (2) 0.057 (2)
0.095 (3) 0.093 (4) 0.017 (I)
0.083 (5) 0.087 (4) 0.014(l)
0.075 (5) 0.084 (4) 0.013(l)
0.086 (4) 0.084 (3) 0.014(l)
0.143 (5) 0.201 (31 0.247 <3j
0.105(4) 0.164(4) 0.110(4)
0.055 (3) 0.122 141 0.026ilj
0.052 (4) 0.111(5) 0.020 (1)
0.049 (4) 0.107(3) 0.017(l)
0.062 (3) 0.103 (3) 0.019 (1)
Method Decision ER 11 ER 2 ER 3 Decision ER I1 ER 2 ER 3 Decision ER 11 ER 2 ER 3 Decision ER 11 ER 2 ER 3
rule A: L = 0.5
rule A: L = 0.6
I
rule A: L = 0.7
rule B:
error rates ERll and ER2 substantially. We are therefore forced to confine the following investigations to the choice u, = l/2 which is the accepted value in the current GeneMark software. Looking at the values of Table 1 we observe a major decrease of error rates when shifting from order 1 to order 2 models. A hypothesis can be put forward that this decrease is due to the recognition of stop-codons, for Markov models of order less than 2 are unable to recognize them. To test this hypothesis we changed the GeneMark functions by truncating gi(S) for i = 1,. . . ,6 to zero if S contains a stopcodon or its anti codon in the corresponding frame. Blocks 1 and 4 of Table 4 show the results for this version of the GeneMark method for decision rule A with L = 0.5and rule B. As expected the results are
the same as in Table 1 for orders greater than 1. For orders 0 and 1, however, we observe a dramatic drop of the error rate ER3. Wrong prediction of strand or frame for coding sequence fragments drops due to noticing off-phase stop-codons. However, some of the type 1 sequence fragments which were predicted as coding at a wrong strand or frame are now predicted as non-coding, which explains an increase of error rate ERI 1 especially for rule B. The sum of error rates ERll and ER3 clearly drops as it is the total misprediction rate of coding sequence fragments. Hence the overall performance increases as expected. Nevertheless, order 0 and order 1 Markov models do not reach the smaller error rates of order 2, 3 and 4 models, i.e. recognition of stop-codons alone does not explain the better performance of higher-order models.
Table 3. Cross-validation estimates for error probabilities using the GeneMark method: 8908 coding (E. coli class I) sequence fragments, 2876 non-coding sequence fragments. 9 crossvalidation groups (decision rule A with L = 0.5). fragment length 96 nucleotides Order of models Method Rule A: ct, = l/2 ER 11 ER 2 ER 3 Rule A: OL,= l/3 ER II ER 2 ER 3 Rule A: OL,= 117 ER 11 ER 2 ER 3 Rule B: a, = l/2 ER 11 ER 2 ER 3 Rule B: q = l/3 ER II ER 2 ER 3 Rule B: OL,= l/7 ERl1’ ’ ER 2 ER 3
0
1
2
3
4
5
0.2547 0.1478 0.1962
0.1405 0. I540 0.1050
0.0650 0.1248 0.0236
0.0612 0.1172 0.0170
0.0570 0.1078 0.0164
0.0696 0.1019 0.0192
0.2011 0.1860 0.2145
0.1092 0. I954 0.1130
0.0497 0.1464 0.0246
0.0485 0.1360 0.0182
0.0471 0. I293 0.0172
0.0585 0.1193 0.0200
0.1420 0.2611 0.2359
0.0734 0.2594 0.1229
0.0327 0.1864 0.0258
0.0322 0.1638 0.0209
0.0326 0.1610 0.0187
0.0449 0.1495 0.0217
0.1548 0.1996 0.2453
0.1117 0.1725 0.1174
0.0617 0. I266 0.0247
0.0592 0.1203 0.0185
0.0552 0.1102 0.0172
0.0683 0.1036 0.0195
0. IO56 0.2670 0.2640
0.081 I 0.2184 0.1257
0.0469 0. I523 0.0256
0.0463 0.1401 0.0 194
0.0456 0.1325 0.0177
0.0569 0.1210 0.0207
0.0473 0.3821 0.2907
0.0467 0.2973 0.1354
0.0297 0.1923 0.0273
0.031 I 0.1697 0.0214
0.0304 0.1676 0.0194
0.0438 0. I526 0.0219
Statistical
of
analysis
GeneMark
performance
131
by cross-validation
Table 4. Cross-validation estimates for error probabilities using the GeneMark method: 8908 coding (E. coliclass I) sequence fragments, 2876 non-coding sequence fragments, 9 cross-validation groups (decision rule A with L = 0.5), fragment length 96 nucleotides Order of models Method Rule A: stop ER II ER 2 ER 3 Rule A: non-singular ER 11 ER 2 ER 3 Rule A: stop non-singular ER II ER 2 ER 3 Rule 8: stop ER 11 ER 2 ER 3 Rule B: non-sineular ER11 k ER 2 ER 3 Rule B: stop non-singular ER 11 ER 2 ER 3
0
I
2
3
4
5
0.2249 0.1478 0.1461
0.1376 0.1540 0.0647
0.0650 0.1248 0.0236
0.0612 0.1172 0.0170
0.0570 0.1078 0.0164
0.0696 0.1019 0.0192
0.2547 0.1478 0.1962
0.1405 0.1540 0.1050
0.0650 0. I248 0.0236
0.0608 0.1213 0.0172
0.0558 0.1147 0.0174
0.0560 0.1200 0.0201
0.2249 0.1478 0.1416
0.1376 0.1540 0.0647
0.0650 0.1248 0.0236
0.0608 0.1213 0.0171
0.0562 0.1147 0.0166
0.0567 0.1200 0.0186
0.1762 0.1996 0.1639
0.1258 0. I725 0.0698
0.06 I 7 0.1266 0.0247
0.0592 0.1203 0.0185
0.0552 0.1102 0.0172
0.0683 0.1036 0.0195
0.1548 0.1996 0.2453
0.1117 0.1725 0.1174
0.0615 0.1273 0.0247
0.0587 0.1245 0.0186
0.0537 0.1186 0.0180
0.0546 0.1234 0.0209
0.1762 0.1996 0.1639
0.1258 0.1725 0.0698
0.0615 0.1273 0.0247
0.0588 0.1245 0.0185
0.0543 0.1186 0.0172
0.0557 0.1234 0.0192
Looking at the right-hand side of Table 1 we observe that the error rate ERll starts to increase again with order 5 while the error rate ER2 decreases further. This fact has its reason in the singularity of the model discussed in Section 1 and the bias of decision rule A. Therefore we used the forced nonsingular model and show some analogue results in Table 4 in blocks 2 and 5. Here the error rate ERl 1 decreases slight!y with growing order but, in contrast to Table 1, the error rate ER2 also grows notably, i.e. forced non-singular models do not seem to allow for higher orders, too. To avoid the disadvantage that the forced non-singular model loses the ability to recognize stop-codons for orders higher than 1, we applied the above mentioned modification of the a-posteriori probabilities, i.e. set g,(S) to zero in case a stopcodon or its anti codon is found in the corresponding
reading frame in S. This modification has not essentially affected the prediction results for higher-order models, as presented in blocks 3 and 6 of Table 4. The same investigation has been performed for .3002 E. coli class II sequence fragments of type 1. The results are presented in Tables 5 and 6. The error rates are smallest for orders 2 and 3 and are much less than for the E. coli class I sequence fragments. More than 3% of the E. coli class II gene sequence fragments of type 1 are predicted as non-coding and only about 0.4% are predicted as coding in the wrong frame or strand. Less than 4% of the non-coding sequence fragments are wrongly predicted as coding. These considerably smaller error rates are explained by the much higher codon usage bias of E. coli class II genes which ease their recognition. However, the danger of overidentification by using too high-order
Table 5. Cross-validation estimates for error probabilities using the GeneMark method: 3002 coding (E. coli class II) sequence fragments, 2876 non-coding sequence fragments, 9 crossvalidation arows. fragment length 96 nucleotides Order of models Method Decision ER II ER 2 ER 3 Decision ER I1 ER 2 ER 3 Decision ER II ER 2 ER 3 Decision ER 11 ER 2 ER 3
0
1
2
3
4
5
0.1346 0.1366 0.2199
0.0693 0.0873 0.0266
0.0310 0.0369 0.0043
0.0333 0.0334 0.0030
0.0516 0.0264 0.0033
0.2032 0.0129 0.0030
0.2632 0.0977 0.1642
0.0833 0.0758 0.0237
0.0353 0.0327 0.0043
0.0360 0.0303 0.0030
0.0546 0.0254 0.0027
0.2045 0.0122 0.0030
0.3874 0.0675 0.1169
0.1026 0.0640 0.0210
0.0420 0.0278 0.0033
0.0400 0.0271 0.0027
0.0593 0.0226 0.0027
0.2055 0.01 I5 0.0030
0.0849 0.1697 0.2455
0.0643 0.0897 0.0290
0.0306 0.0372 0.0043
0.0323 0.0334 0.0030
0.0513 0.0268 0.0037
0.2032 0.0132 0.0030
rule A: L = 0.5
rule A: L = 0.6
rule A: L = 0.7
rule B:
Juergen Kleffe er al.
132 Table 6. Cross-validation II) sequence fragments,
estimates for error probabilities using the GeneMark method: 3002 coding (E. coliclass 2876 non-coding sequence fragments, 9 cross-validation groups (decision rule A with L = 0.5). fraament length 96 nucleotides Order of models
Method Rule A: stop ER 11 ER 2 ER 3 Rule A: non-singular ER 11 ER 2 ER 3 Rule A: stop non-singular ER 11 ER 2 ER 3 Rule B: stop ER II ER 2 ER 3 Rule B: non-singular ER II ER 2 ER 3 Rule B: stop non-singular ER II ER2 ER 3
0
I
2
3
4
5
0.1299 0.1366 0. I895
0.0706 0.0873 0.0193
0.0310 0.0369 0.0043
0.0333 0.0334 0.0030
0.0516 0.0264 0.0033
0.2032 0.0129 0.0030
0. I346 0.1366 0.2199
0.0693 0.0873 0.0266
0.0303 0.0386 0.0047
0.0333 0.0358 0.0030
0.0336 0.0382 0.0027
0.0400 0.0417 0.0040
0. I299 0. I366 0. I895
0.0706 0.0873 0.0193
0.0303 0.0386 0.0043
0.0333 0.0358 0.0030
0.0336 0.0382 0.0027
0.0400 0.0417 0.0033
0.0963 0.1697 0.2062
0.0680 0.0897 0.0210
0.0306 0.0372 0.0043
0.0323 0.0334 0.0030
0.0513 0.0268 0.0037
0.2032 0.0132 0.0030
0.0849 0.1697 0.2455
0.0643 0.0897 0.0290
0.0300 0.0393 0.0047
0.0323 0.0358 0.0030
0.0330 0.0382 0.0030
0.0393 0.0417 0.0043
0.0963 0.1697 0.2062
0.0680 0.0897 0.0210
0.0300 0.0393 0.0043
0.0323 0.0358 0.0030
0.0330 0.0382 0.0030
0.0393 0.0417 0.0037
models is more pronounced. The error rate ERll grows to 20% for order 5 models, while the error rate ER2 tends quickly to zero. Consideration of forced non-singular models helps to stop this effect but still, as seen in Table 6, the error rates grow again beginning with order 4. It is in fact the strong bias in triplets and tetramers that helps to recognize E. coli class II gene sequence fragments. The effect of search for stop-codons is shown in blocks 1 and 4 of Table 6 and is similar to the effect seen for E. coli class I genes. The performance of forced non-singular models combined with detecting stop-codons is very insensitive to the change of model order and decision rule. We may conclude that high codon usage bias is the major factor that makes E. coli class II genes easily recognizable.
The results for E. coli class III genes are given in Tables 7 and 8. About 20% of the gene sequence fragments are predicted as non-coding (ERl 1) and in more than 6% the method fails to predict the correct strand or frame (ER3). The misprediction rate for non-coding sequence fragments (ER2) is about 25%. These results are expected since the E. coli class III genes, which are believed to be horizontally transferred into E. coli genome, show the least pronounced codon usage bias. The error rates are very sensitive to the choice of model order and decision rule. This is also an effect of the relatively small size of the training sample. Summarizing we conclude that GeneMark performance varies considerably among gene classes that have been subjected to different evolutionary
Table 7. Cross-validation estimates for error probabilities using the GeneMark method: 1156 coding (E. cofi class III) sequence fragments, 2876 non-coding sequence fragments, 9 crossvalidation erouos. fragment length 96 nucleotides Order of models Method
0
Decision rule A: L = 0.5 ER II 0.5424 ER 2 0.2017 ER 3 0.0830 Decision rule A: L = 0.6 ER 11 0.6497 ER 2 0.1349 ER 3 0.0510 Decision rule A: L = 0.7 ER II 0.7474 ER 2 0.0800 ER 3 0.0329 Decision rule B: ER 11 0.4351 ER 2 0.2698 ER 3 0.1367
I
2
3
4
5
0.4343 0.2284 0.0701
0.1981 0.2417 0.0614
0.2206 0.2180 0.0545
0.2941 0.1951 0.0580
0.6652 0.0824 0.0441
0.5320 0.1617 0.0528
0.2647 0.2058 0.0450
0.2846 0.1864 0.0433
0.3460 0.1686 0.0467
0.6903 0.0716 0.0398
0.6176 0.1127 0.0355
0.3478 0.1638 0.0285
0.3503 0.1516 0.0277
0.4083 0.1401 0.0398
0.7197 0.0622 0.0355
0.3538 0.2848 0.1073
0.1782 0.261 I 0.0692
0.1946 0.2305 0.0649
0.2699 0.2010 0.0709
0.6618 0.0834 0.0458
Statistical analysis of GeneMark performance by cross-validation Table 8. Cross-validation III) sequence fragments,
133
estimates for error probabilities using the GeneMark method: 1156 coding (E. coliclass 2876 non-coding sequence fragments, 9 cross-validation groups (decision rule A with L = OS), fragment length 96 nucleotides Order of models
Method Rule A: stop ER II ER 2 ER 3 Rule A: non-singular ER 11 ER 2 ER 3 Rule A: stop non-singular ER 11 ER 2 ER 3 Rule B: stop ER 11 ER 2 ER 3 Rule B: non-singular ER 11 ER 2 ER 3 Rule B: stop non-singular ER 11 ER 2 ER 3
0
1
2
3
4
5
0.5320 0.2017 0.0493
0.4221 0.2284 0.0476
0.1981 0.2417 0.0614
0.2206 0.2180 0.0545
0.2941 0.1951 0.0580
0.6652 0.0824 0.0441
0.5424 0.2017 0.0830
0.4343 0.2284 0.0701
0.1981 0.2420 0.0614
0.2232 0.221 I 0.0536
0.2898 0.2131 0.0623
0.4853 0.1791 0.0735
0.5320 0.2017 0.0493
0.4221 0.2284 0.0476
0.1981 0.2420 0.0614
0.2240 0.221 I 0.0519
0.2881 0.2131 0.0597
0.4922 0.1791 0.0562
0.4948 0.2698 0.0675
0.3841 0.2848 0.0683
0.1782 0.261 I 0.0692
0. I946 0.2305 0.0649
0.2699 0.2010 0.0709
0.66 I8 0.0834 0.0458
0.4351 0.2698 0.1367
0.3538 0.2848 0.1081
0.1782 0.2618 0.0692
0.1929 0.2350 0.0683
0.2604 0.2239 0.0779
0.4637 0.1881 0.0856
0.4948 0.2698 0.0675
0.3841 0.2848 0.0683
0.1782 0.2618 0.0692
0. I946 0.2350 0.0657
0.2638 0.2239 0.0735
0.4801 0.1881 0.063 1
constraints. It also heavily depends on the size of the training sample. We have demonstrated this within the genome of E. coli. Even more variability is expected to occur between different species. We have also demonstrated a danger to use Markov models of higher orders if the training sample is not sufficiently large. On the other hand not much performance is lost when we use models of order 3 and 4. Even order 2 models work sufficiently well although order 3 and 4 models generally work better. With growing length of sequence fragments the error rates decrease (they will approximately be decreased to one-half for 144 nt length in comparison with 96 nt length). Conversely, shorter sequence fragments cannot be predicted well. The error rates become impractically large for 48 nt sequence fragments. A parallel analysis of the GeneMark accuracy with other a-priori probability values for non-coding fragments revealed even stronger effects of the different modifications we have discussed in this paper. Therefore, this study suggests that the assessment of the GeneMark method accuracy is far from being simple. Further work is in progress to assess GeneMark performance for other species as well as applicability of GeneMark trained on one species (gene class) to another species (gene classes). Acknowledgement-We
are grateful to William Hayes for valuable programming assistance. This work was sup-
ported by grant from Deutsche Forschungsgemeinschaft Foundation (Germany) and by the National Institutes of Health (U.S.A.) grant HGO0783 to M.B.
REFERENCES Billingsley P. (1961) Ann. M&h. Sraf. 82, 12. Borodovsky M. & McIninch J. (1993) Computers Chem. 17, 123. Borodovsky M., Sprizhitsky Yu., Golovanov E. & Alexandrov A. (1986) Molecular Biol. 20, 833. Borodovsky M., Koonin E. & Rudd K. (1994) Trenrls Biochem. Sri. 19, 309.
Borodovsky M., Rudd K. & Koonin E. (1994) Nucl. Acids Res. 22, 4756.
Fickett J. W. & Tung C. (1992) Nucleic Acids Res. 20, 6441. Fleischmann R. D. (1995) Science 269, 496. Kleffe J. (1991) Proceedings of the Annual Intemalional Conference of the IEEE Enxineerinp in Medicine and Bioiogy So&y 13, 1253. Kleffe J. & Borodovskv M. (1992) Modeline and Commuter Methoak in Molecular Biology atzd Genetics (Edited by V.
A. Ratner & N. A. Kolchanov), p. 103. Nova Science Publishers Inc. Kleffe J., Hermann K., Gunia W., Vahrson W. & Wittia B. (1995) CABZOS 4, 449. Medieue C.. Rouxel T.. Viaier P.. Henaut A. & Danchin A. (1931) J. ‘Mol. Biol. i22; 851. Merino E., Balbas P., Puente J. L. & Bolivar F. (1994) Nucl. Acids Rex 22, 1903.
Shepherd (1981)PNAS 78, 1596. Sofia H. J., Burland V., Daniels D. L., Plunkett III G. & Blattner F. R. (1994) Nucl. Acids Res. 22.