Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae

Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae

Author's Accepted Manuscript Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae Guoqing Liu, Yongqiang Xing, Lu Ca...

572KB Sizes 2 Downloads 91 Views

Author's Accepted Manuscript

Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae Guoqing Liu, Yongqiang Xing, Lu Cai

www.elsevier.com/locate/yjtbi

PII: DOI: Reference:

S0022-5193(15)00309-4 http://dx.doi.org/10.1016/j.jtbi.2015.06.030 YJTBI8248

To appear in:

Journal of Theoretical Biology

Received date: 17 February 2015 Revised date: 4 June 2015 Accepted date: 20 June 2015 Cite this article as: Guoqing Liu, Yongqiang Xing, Lu Cai, Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae, Journal of Theoretical Biology, http://dx.doi.org/10.1016/j.jtbi.2015.06.030 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Using weighted features to predict recombination hotspots in Saccharomyces cerevisiae Guoqing Liu1,2,*, Yongqiang Xing1,2, Lu Cai1,2,* 1

School of Mathematics, Physics and Biological Engineering, Inner Mongolia University of Science

and Technology, Baotou, 014010, China 2

The Institute of Bioengineering and Technology, Inner Mongolia University of Science and

Technology, Baotou, 014010, China

ABSTRACT Characterization and accurate prediction of recombination hotspots and coldspots have crucial implications for the mechanism of recombination. Several models have predicted recombination hot/cold spots successfully, but there is still much room for improvement. We present a novel classifier in which k-mer frequency, physical and thermodynamic properties of DNA sequences are incorporated in the form of weighted features. Applying the classifier to recombination hot/cold ORFs in Saccharomyces cerevisiae, we achieved an accuracy of 90%, which is ~5% higher than existing methods, such as iRSpot-PseDNC, IDQD and Random Forest. The model also predicted non-ORF recombination hot/cold spots sequences in Saccharomyces cerevisiae with high accuracy. A broad applicability of the model in the field of classification is expected.

Key words: recombination hotspots; weighted features; dinucleotide flexibility; thermodynamic properties

*

To whom correspondence should be addressed. 1

1

INTRODUCTION

Meiotic recombination is the genetic information exchange between homologous chromosomes by cutting and recombining the chromosomes at allele locus during the meiosis. Recombination is a fundamentally important process in evolution, and along with mutation it is responsible for producing the patterns of genetic diversity seen in extant populations (1). Recombination also helps the correct disjunction of homologous chromosomes by forming chiasma (2). Recombination also provides a molecular basis for natural selection by separating beneficial mutations from deleterious ones (3,4). Furthermore, recombination affects genomic composition via gene conversion or mutagenesis (4-7). The evolutionary effects of recombination lead to the link of recombination with various genomic features, such as local GC content (8,9), dinucleotide bias (10), codon bias (11), transposon distribution (12,13), pseudogene distribution (14), intron length (15) and protein adaptation rate (16). The mechanism of recombination still remains an open question while the biological importance of the recombination has already been undoubtedly demonstrated. In recent years, causes of genomic heterogeneity in recombination have received considerable attention (17-29). Recombination shows a non-uniform distribution across the genome, occurring preferentially in genomic regions called hotspots and rarely in the regions called coldspots (29,30). Recombination correlates with various DNA sequence features, such as local GC content, dinucleotides bias, biased codon usage, palindromes (31,32), and possible conserved sequences (20). Apart from the DNA factors, many epigenetic features, such as DNA methylation (33), nucleosome positioning (18,19) and histone modification (34,35), have been shown to correlate recombination. For instance, double-strand breaks that initiate recombination is enriched in open chromatin in mouse (18); nucleosome occupancy correlates with recombination frequency in yeast (19); DNA methylation tends to inhibit the formation of crossover (33); histone modification is associated with recombination (34,35). In addition, PRDM9 protein, a meiosis-specific histone methyltransferase that trimethylates H3 at K4, is known to be a major determinant of meiotic recombination hotspots in human and mouse (21-23). PRDM9 is a DNA-binding zinc finger protein, whose binding motif is enriched in hotspots. The rapid evolution of the zinc finger binding region of PRDM9 partially explains the hotspot-related genomic polymorphism (21) and the lack of conservation of hotspot position between human and chimpanzee (24,36). However, PRDM9, which is believed to initiate recombination in ~40% of the human hotspots (21), is unable to explain all of the human hotspots. By far, heterogeneous distribution of recombination across the genome is ascribed largely to

2

the density and intensity of recombination hotspots, whereas the factors that determine the positions and activities of the hotspots are still less clear. Identification of hotspots and coldspots is required for understanding both the regulatory mechanisms and evolutionary effects of recombination hotspots (e.g. the locations and intensities of hotspots). Although considerable knowledge relevant to recombination was acquired from the experimental genome-scale mapping of recombination, identifying computationally recombination hotspots and coldspots in the genome is still a challenging task. In recent years, several models were proposed to predict recombination hotspots and coldspots. For example, Zhou et al. (37) presented a Support Vector Machine (SVM) classifier based on codon composition and Jiang et al. (38) presented a Random Forest model based on gapped dinucleotide composition to predict hotspots and coldspots in Saccharomyces cerevisiae; Guo et al. (39) presented a SVM model based on DNA physical properties in yeast; Liu et al. (40) presented a IDQD model based on oligonucleotide frequencies; Wu et al. (41) presented a SVM model based on the combined features of binding site occurrences of the transcription factors, GC content and histone modification in human and mouse; most recently, Chen et al. (42) presented a SVM model based on pseudo dinucleotide composition in yeast; Qiu and Xiao predicted recombination hotspots based on trinucleotide composition and pseudo amino acid components (43). Overall, these models achieved accuracies above moderate level in predicting recombination hotspots, but none of them generated an exciting improvement in prediction performance. In general, three key points should be considered to improve prediction accuracy in classification problems of any kind. One is the selection (or extraction) of features that best represent the sample. The second one is the way how different kinds of features are incorporated in the model. The third one is the intrinsic power of the classifier per se that depends on the way it works. The absence of convincing improvement in previous predictions for recombination hot/cold spots might be due to either the intrinsic power of the classifiers, or the improper feature selection or incorporation. Alternatively, it might be because none of the sequence and epigenetic features associated with recombination explains all hotspots consistently and exclusively. The number of useful features integrated into predictive model is likely to be positively correlated with prediction accuracy. Constructing many features to describe the sample, however, may bare serious risk of overfitting. In addition, an efficient approach to extract features from DNA sequence is also important in various kinds of prediction including recombination hotspot prediction (42), predicting nucleosome positioning sequences (44), predicting

3

DNA methylation status (45), predicting splice sites (46), predicting promoters (47) and so on. In this regard, a series of bioinformatics tools were proposed to generate various features based only on DNA or RNA sequences (48-50) and a web server called "Pse-in-One" was established to generate various pseudo components for DNA, RNA, and protein sequences (51). It is thus highly desirable to develop novel methods to accurately identify recombination hotspots and coldspots. Although recombination hotspots are thought to be evolutionarily unstable due to the self-destructive nature of biased gene conversion that occurred during recombination (52,53), many recombination hotspots in extant human populations have existed for thousands of generations. Moreover, recombination hotspots in yeast exhibit high conservation that is attributed to low frequency of sex and out-crossing in the studied yeasts, acting to reduce the population genetic effect of biased gene conversion (54). The genomic correlates of recombination and the conservation of hotspots in yeast underlie our sequence-dependent prediction of recombination hotspots. In this study, we developed a classifier, which consists of three steps, namely, representing sequences with weighted-feature vectors, representing sequences with the distance between feature vectors, and predicting with quadratic discriminant analysis. Applying the method to recombination hot/cold spots in Saccharomyces cerevisiae in which several kinds of information are incorporated in the form of weighted features, we achieved a great improvement in prediction performance as compared to existing methods. 2

MATERIALS AND METHODS There are several steps (55) in developing a machine-learning prediction model: construction of

benchmark dataset to train and test the model, mathematical formulation of the biological samples, prediction with a classifier (or algorithm) and cross-validation tests to evaluate prediction power. Establishment of a web-server for the model is also important. The detailed information of the steps regarding our prediction model is described below. 2.1

Benchmark datasets Recombination hot/cold ORFs in Saccharomyces cerevisiae of which recombination rates were

determined experimentally by DSB rates (30) were used to construct benchmark datasets. The ratio of hybridization to a DSB-enriched probe (P2) to a total genomic probe (P1) was measured to estimate the DSBs formation adjacent to each gene ORF. The experiments were repeated seven times for each of the 6200 genes, and the median value is taken as the relative recombination rate for each sequence.

4

Excluding the sequences with missing repeated-array(s), a total of 5266 ORFs were collected. In the original study (30), an ORF was classified as “hot” if it ranked in the top 12.5% in at least five of the seven microarrays and “cold” if it ranked in the bottom 12.5% in at least five of seven experiments. After clustering adjacent hot ORFs (cold ORFs) that represent a single hotspot (coldspot), the authors found 177 hotspots and 40 coldspots. In another study, Jiang et al. (38) computationally predicted hot/cold spots by redefining the hot/cold spots cutoffs. In order to make a comparison with their prediction result, we constructed the benchmark dataset in the same way: the sequences whose relative hybridization ratios are larger than 1.5 were classified as hotspots and the sequences whose relative hybridization ratios are smaller than 0.82 were classified as coldspots. Finally, we have 490 hot ORFs and 590 cold ORFs. The length distribution of the sequences was provided in Supplementary Information (Fig S1). Benchmark datasets usually consist of a learning (or training) dataset and an independent test dataset. According to 5-fold cross-validation adopted in this study, both hotspots and coldspots datasets were randomly divided into 5 equal-sized subsets. Four subsets were used for training and the remaining one for testing in the first round. This train–test procedure was repeated five times using a different holdout set each time. To test our model, we also predicted two sets of experimentally-identified non-ORF based recombination hotspots in yeast, including 128 hotspots in yeast (29) and 452 hotspots on yeast chromosome IV (56). Note that 136 hotspots were identified in the previous study (29) based on high resolution recombination data averaged over crossover and non-crossover recombination. Of those hotspots, 128 hotspots larger than 100 bp were used in this study. All datasets used in this study were provided in Supplementary data (Excel file). 2.2

Representing sequences with weighted-features To quantify the importance of different features in prediction and incorporate different kinds of

features in a proper way, we weighted the features using a variance-based feature selection method. First, samples were represented by possibly important features, such as k-mers probability distribution in sequences and sequence-order related information. Second, weights of different features were calculated according to their ability to distinguish positive dataset (hotspots) from negative dataset (coldspots). Third, values of the features were normalized via a standard conversion. Finally, samples

5

were represented by weighted features by multiplying the normalized features with their corresponding weights. This procedure is illustrated below. We consider three kinds of features in our prediction. The first one is traditional k-mer frequency, of which capacity in recombination hot/cold spots prediction was demonstrated (38,40). The second one is physical properties of DNA sequences including dinucleotide flexibility and structure parameters calculated

from

the

protein–DNA

crystal

structures

in

the

latest

NDB

database

(http://ndbserver.rutgers.edu/, update of Aug.1, 2014). The dinucleotide structure parameters are equilibrium values of roll, tilt, twist, shift, slide and rise that describe the DNA structure. Similarly to the procedure used in (57,58), the dinucleotide flexibility parameters were computed by inverting the covariance matrix of deviations of local geometric parameters from their average values. We consider these properties owing to the observation that hotspots centers are characterized by a depletion of nucleosomes (56) and DNA flexibility plays an important role in nucleosome positioning (59,60). The performance of the dinucleotide structure parameters in hot/cold spots prediction was previously demonstrated (42). The third one is thermodynamic properties including dinucleotide free energy, entropy and enthalpy (61). These fifteen parameters are listed in Table 1. If a sequence is represented by a feature vector composed of occurrence frequencies of k-mers in the sequence, we have S = [ f1 , f 2 , , f n ]T

(1)

where n = 4 k . A k-mer is an oligonucleotide of length k. In this study, the k can take the integer ranging from 1 to 4. To avoid the drawback that may arise from the small length of sequences, pseudo-count correction was made to the k-mer frequencies. The modified k-mer frequencies is given by fi = (ni + bi ) ( N + B)

(2)

w her e ni is the real count of the i-th k-mer occurred in a sequence, bi is the corresponding pseudo-count, N = ∑ ni and B = ∑ bi . As previous studies reported (62), bi was set to be p0 N , i

i

where p0 = (1/ 4) k is the expected background probability of each k-mer. The pseudo-count has less correction to real counts for long sequences than for short ones. To include sequence-order related information in the feature vector, each feature associated with dinucleotide physical and thermodynamic properties is formulated similar to the procedure described in (63),

6

hn ,m = where

L −1− n 1 ∑ [ pm (i, i + 1) − pm (i + n, i + n + 1)]2 , 1 ≤ n ≤ L − 2 L − 1 − n i =1

(3)

pm (i, i + 1) and pm (i + λ , i + λ + 1) denote the values of the m-th parameter listed in Table 1

at dinucleotide position (i,i+1) and (i+n,i+n+1), respectively; L is sequence length. hn , m is called the nth-tier correlation factor that reflects the sequence-order correlation with respect to the m-th parameter between all the n-th most contiguous dinucleotide along a DNA sequence. In fact, sequence-order correlation is the heterogeneity of corresponding property along a sequence. In the equation, however, sequence-order information related to different parameters are formulated separately, but not reduced to one feature by directly summing over as previously studies done, since they are possible to have different distributions between positive and negative datasets. For example, it is possible that hn ,1 tends to be large in hotspot sequences while hn ,2 tends to be small in hotspot sequences. The largest value of the integer number n is set to 5 with following consideration. Recombination is affected by nucleosome positioning which is characterized by ~10-bp periodic occurrence of dinucleotides. When n is up to 5 (the half of the periodicity), the equation (3) can capture most of the sequence-order information related to the 10-bp periodicity. Note that although n<6 is certain to lose some long-distance sequence-order information in DNA sequences, a big number of n is not encouraged as it can cause overfitting or feature-dimension disaster in prediction. At this step, each sequence in the training and test datasets was denoted as a feature vector of (4k+5×15) dimensions: S = [ f1 , f 2 , , f n , h1,1 , h1,2 , , h5,15 ]T

(4)

Then weights of different features are calculated as

w(λ ) =

F (λ ) ∑ F (λ )

(5)

λ

where F (λ ) is a measure to evaluate the contribution of the λ-th feature to classification and is defined as:

F (λ ) = sB2 (λ ) sW2 (λ )

(6)

2

2

where sB (λ ) is variance of the λ-th feature between classes and sW ( λ ) is variance of the λ-th feature within classes. They can be calculated by (64) 7

 ∑ i xij ( λ ) ∑ i=1∑ ji=1 xij ( λ )  s (λ ) = ∑ i =1 ni  j=1ni − K ∑ i=1 ni   n

K

n

K

2 B

 ∑ i xij ( λ )  s (λ ) = ∑ i =1 ∑ j =1  xij (λ ) − j =1ni    n

K

2 W

2

df B

(7)

2

ni

dfW

(8)

where K (K=2 in our prediction) and N represent the number of classes and total number of samples, respectively, xij(λ) represents the value of the λ-th feature of the j-th sample in the i-th class, ni is the 2

number of sample in the i-th class, dfB=K-1 is the degrees of freedom for sB (λ ) and dfW=N-K is the 2

degrees of freedom for sW (λ ) . According to theory of statistics, the F(λ) in Eq. (6) obeys F sampling distribution with degrees of freedom dfB and dfW. The larger the value of F(λ), the larger the contribution of the feature to classification. In other words, the feature with a large F(λ) is a good discriminator of the two classes. The features can be ranked according to F value and those with no significant difference between two classes (F(λ)≈1, P>0.05) can be removed from the feature list according to F-test. Before weighting the features, they were first normalized by their respective standard deviation:

xij∗ = where

xij − µ

(9)

SD

µ and SD denote the mean and standard deviation of the feature over the whole benchmark

dataset, respectively. After normalization and weights calculation, a sequence is finally represented by a weighted-feature vector as ∗ ∗ ∗ S = [ w1 f1∗ , w2 f 2∗ , , wn f n∗ , w1,1h1,1 , w1,2 h1,2 , , w5,15 h5,15 ]T .

(10)

∗ where w1 , w2 ,, w5,15 are the weights of the normalized features f1∗ , f 2∗ , , h5,15 , respectively.

2.3

Representing samples with Euclidean distance Suppose that two sequences are represented by n-dimensional vectors as S1= { xi : i = 1, . . . , n} and

S2 = { yi : i = 1, . . . , n}, the Euclidean distance between them is n

ED =

∑ (x − y ) i

2

(11)

i

i =1

2.4

Quadratic discriminant analysis (QD) In multivariate discrimination of two normal populations, the optimal classification procedure is

based on the quadratic discriminant function. In this study, a quadratic discriminant function based on 8

Mahalanobis distance was used for prediction of recombination hot/cold spots (65). Mahalanobis distance is based on correlations between variables by which different patterns can be identified and analyzed (66). It is a useful measure of difference between an unknown sample set and a known one. It differs from Euclidean distance in that it takes into account the correlations of the dataset and is scale-invariant. In two-class prediction, there are two training datasets (positive and negative set). Each training dataset is supposed to be represented by N vectors of n dimension, in which the j-th vector is denoted as X uj = [ x j1 , x j 2 ,, x jn ]T ( j = 1, 2, , N ; u=1,2). The superscript u in the expression denotes the class of data, that is, u=1 corresponds to positive set and u=2 to negative. The mean vector averaged over the dataset is denoted as X u = [ x1, x2 ,, xn ] , where xi = ∑ j =1 x ji N , i = 1, 2, , n . The covariance matrix of the N

u-th training dataset is denoted as  c11 c12  c1n  c c  c2 n  Cu =  21 22        cn1 cn 2  cnn 

(12)

N

where c ji = ∑ ( xsj − x j )( xsi − xi ) ( N − 1) , c ji = c ij . s =1

In the same state space of training datasets, an individual in the test set is represented by a vector of n dimension Y = [ y1 , y2 ,, yn ]T , and the quadratic discriminant function (58) that determines the class of the individual is given by ξ = log 2 ( N1 N 2 ) − (δ1 − δ 2 ) 2 − 0.5log 2 (| C1 | | C2 |)

(13)

where N1 and N 2 represent the sample sizes of positive and negative training sets, respectively, δ u = (Y − X u )T (Cu )−1 (Y − X u ) is the squared Mahalanobis distance between Y and X u , and | Cu | is the

determinant of covariance matrix Cu . Let ξ0 be the optimal threshold of ξ for discriminating two test datasets. Generally, the ξ0 is around zero for equal-sized positive and negative training sets, and deviates from zero for different-sized samples. In this study, the optimal threshold is determined according to an empirical rule (67,68) described in detail in the reference (67). The discriminant rule assigns the test set individual to positive class if ξ > ξ0 , otherwise to negative class. 2.5

Predictive model

9

A predictive model called WF (Weighted-Feature) is developed to predict recombination hot/cold spots by using quadratic discriminant analysis based on weighted feature-based distance measure. The model carries out two-class prediction, using hotspots as positive dataset and coldspots as negative. There are four steps in the modeling: firstly, sequences are represented by feature vectors; secondly, the feature vectors are weighted after normalization; thirdly, the sequences are further represented by their distances with two training sets; finally, the class of the test sequences is determined by quadratic discriminant analysis by using the distances as input. To be specific, after the benchmark dataset was constructed, we represented each sequence by a vector composed of the k-mers frequencies, six dinucleotide flexibility parameters, six dinucleotide structure parameters and three dinucleotide thermodynamic parameters. Each feature in the vector is then normalized by its standard deviation in benchmark dataset and multiplied by its weight. Then two Euclidean distances of each vector with respect to two average vectors for positive and negative training sets were calculated. In this way, each sequence in the training and test datasets was represented by a vector of two dimensions, each of which consists of 2 Euclidean distances. We used Euclidean distances to represent sequences, because Euclidean distance not only is able to capture the weight difference between features, but also reduce data dimensionality before prediction. If we predict hot/cold spots by quadratic discriminant classifier using weighted features as input, the Mahalanobis distance in quadratic discriminant function is unable to capture the weight information. Finally, the quadratic discriminant analysis was used to predict the classes of the sequences in the test dataset according to the Euclidean distances of the feature vectors between the test sequences and training datasets. 2.6

Assessment of prediction performance The performance of a predictor can be evaluated by using three cross-validation methods, such as

the independent dataset test, sub-sampling test, and jackknife test, combined with several indices that quantify the prediction accuracy. Among the three cross-validation methods, jackknife test is the least arbitrary and most objective one (55) and has been increasingly adopted by investigators to test the power of various predictors (69-72). However, to reduce computational time we used 5-fold cross-validation as done by many investigators (68,73-76). Four indices, such as Sensitivity (Sn), Specificity (Sp), Total accuracy (TA) and Mathew’s Correlation coefficient (MCC), which are used to quantify the prediction accuracy are as follows:

10

Sn = TP (TP + FN )

(14)

S p = TN (TN + FP )

(15)

TA = (TP + TN ) (TP + FP + TN + FN )

(16)

MCC = (TP × TN ) − ( FP × FN )

(17)

(TP + FP ) × (TN + FN ) × (TP + FN ) × (TN + FP )

Here, the TP (true positive) denotes the number of the correctly predicted positive sequences, FN (false negative) denotes the number of the positive sequences predicted as negative, TN (true negative) denotes the number of correctly predicted negative sequences, and FP (false positive) is the number of negative sequences predicted as positive. Although the set of indices described above were often used to measure the prediction quality of a prediction method, one should note that it is valid only for the single-label systems. For the multi-label systems whose existence has become more frequent in system biology (77), there is a need to use a different set of metrics as defined in (78). 3

RESULTS

Assuming recombination-associated sequence features are simply reflected in k-mer frequencies, Liu et al. (40) predicted recombination hot/cold spots by IDQD approach using k-mer frequencies of genomic sequences as features. The total accuracy of ~80% demonstrated the efficacy of k-mer frequencies in discriminating hotspots from coldspots. Lately, Chen et al. (42) presented a prediction model based on pseudo amino composition concept, in which some other kinds of features apart from k-mer frequencies are incorporated into the prediction model. The model achieved an improved accuracy of ~85% in classifying hot/cold spots. Before incorporating different kinds of features into our prediction, we predicted recombination hot/cold spots merely using k-mer frequencies without weighting. Different k-mers exhibit similar prediction performance with an accuracy of ~82% (Table 2). To explore why the distance measure used in our model influence the prediction performance, we replaced the Euclidean Distance calculated in the model with other distance metrics (see supplementary information) including Relative Entropy (79), Increment of Diversity (68) and Cosine distance, and compared their predictions with the Euclidean Distance-based model (EDQD). These models achieved similar prediction accuracy, indicating the distance measure in the model is not key factor that influences its performance (Table 2). Besides, SVM model based solely on 2-mer frequencies also achieved similar accuracy (Table 2). It should be noted that the relative entropy distance is restricted to 11

probabilistic features and increment of diversity to occurrence-based features, which cannot be employed to weighted features in our model. By contrast, Euclidean distance and Cosine distance are suitable for any kind of numeric feature, and that is why we used Euclidean distance in our model. After DNA physical and thermodynamic properties are incorporated in our model in a weighted-feature way other than pseudo dinucleotide composition, an accuracy of 90% was achieved (Table 3). This is higher than the accuracy of 85% achieved by iRSpot-PseDNC model (42), which is the best predictor among existing methods (Table 3). Besides the overall accuracy, the other indices, such as specificity, sensitivity, correlation coefficient, which achieved by our predictor are all remarkably improved as compared with iRSpot-PseDNC model. When the benchmark sets were further filtered to 177 hotspots and 40 coldspots, which were defined by Gerton et al. (30), the prediction performance of WF is better than that of the codon usage-based SVM predictor (37) (Table 4). Our results also indicate that if positive and negative datasets are composed of more extreme sequences, k-mer frequency is adequate to obtain a high prediction accuracy and additional sequence-order information cannot improve prediction performance as much as before. To test the ability of our model, we also predicted the 452 experimentally identified recombination hotspots on yeast chromosome IV (56), no information of which was extracted in the form of weights of features. Our prediction accuracy of 98.5% ( ξ0 = 0 ) is higher than 76.8% that was generated by a pseudo dinucleotide composition-based classifier (42). Using Euclidean distances calculated on our weighted features as input, SVM classifier also achieved an accuracy of 98.5%. Similarly, we predicted 128 hotspots sequences identified from high resolution recombination data (29). A high success rate of 98.4% by WF and 97.7% by SVM were obtained. These hotspots sequences are at least not pure ORFs, but training our model on hot/cold ORFs we still achieved a high success rate in the prediction. It is understandable in that the major part of the predicted hotspots are ORF sequences although, at least, part of the hotspots sequences are composed of non-ORF fragments and thus confuse recombination-associated signals in the sequences with noise owing to different evolution pattern of various genomic elements. For example, it is known that most DSBs occur in promoter regions, but hotspot intervals primarily overlap coding sequence: only 25% of the bases in hotspot intervals overlap promoters, whereas 68% overlap coding sequences (29). 4

DISCUSSION

12

Feature selection is crucial for classification and prediction. There are two major concerns associated with feature selection, selection of those that are able to discriminate different classes without overfitting from a large number of possible features, and incorporation of different kinds of features into prediction models. Practically in most cases, the former is a knowledge-based issue since possibly important features are selected according to our knowledge. The features, however, can also be sorted and filtered by rigorous mathematical methods. Feature incorporation is not an easy thing as different kinds of features have different weights in defining the samples and have various forms in mathematical formalism, which can not be correctly evaluated or incorporated generally. Our characterization of sequences with weighted features solved the two problems effectively. As correct estimation of weights for different features is crucial in our model, it is better to base the calculation on a large, non-biased normally distributed positive and negative datasets. One may argue that the weights in the prediction were calculated using overall benchmark dataset (including training set and test set) and this made our test is, at least partially, not independent. To avoid this bias, we also predicted recombination hot/cold spots based on the weights re-calculated over each training dataset in 5-fold cross validation, and obtained prediction performance as good as previous one (Table 3). Sequence homology may create bias in machine-learning prediction. A cutoff threshold of 25% sequence identity was proposed to exclude homologous sequences from benchmark dataset (55). Accordingly, to exclude possible bias caused by sequence homology, we obtained non-redundant 183 hotspots and 45 coldspots by excluding sequences with more than 25% sequence identity from our original benchmark dataset (490 hotspots and 590 coldspots). Prediction over the non-redundant dataset achieved a reduced accuracy of 86.8% with a low specificity (Table 3) as compared with the previous result. There are two possible reasons for the reduced prediction accuracy. First, the sequence similarity among the benchmark dataset may cause some elevated prediction accuracy due to the similarity of test sequences with training sequences. Second, the small sample size of the benchmark dataset would result in the biased estimation of weights of different features, which in turn influences the prediction accuracy. Note that the reduced sample size caused by the exclusion of homologous sequences is different from the decrease of sample size by selecting sequences with extreme recombination rates. The low prediction specificity is an indication of the effect of small number of coldspots. Our results

13

suggest that it is practically important to select a proper cutoff of sequence identity, which is a compromise between sample size and non-redundancy. There are several kinds of method to define pseudo count for different biological problems. Since recombination occurred genome wide, in addition to the pseudo count described in Methods, we also used a different pseudo count defined as the dinucleotide frequency of the whole yeast genome in the prediction, and achieved prediction results (data not shown) that are highly similar to those listed in Table 3. This is probably because the majority of the sequences (98%) used in this study exceed 300 bp, leading to little effect of the pseudo-count on our results. A method referred to as pseudo amino acid composition was extensively used in incorporation of different kinds of features in machine-learning prediction and achieved a significant improvement in prediction performance (42,55,80). From the formulation, one can see that the weight factor in the pseudo amino acid composition (or pseudo oligonucleotide composition) is to modulate the relative importance of traditional k-mer composition and sequence-order information in prediction. However, it should be noted that each feature formulated by Eq.(5) in the paper (42) used to represent sequence-order information may contain various kinds of information that contribute differently to classification. It is possible that assignment of different weights for different information may improve prediction performance. Our method is capable to sort the features and delete noisy ones from the feature list before prediction. In the Eq.(6), if the difference between variance between classes and variance within classes is not significant (F-test with P>0.05), the feature cannot efficiently discriminate the classes and thereby can be deleted. Of course, sometimes, including it in the prediction may have little effect on prediction performance as the computed corresponding weight for the non-significant feature is too small to have significant effect on the computed Euclidean distance. Another concern in prediction is that selection of non-redundant features is also important. If two features strongly correlate with each other, they might serve as redundant information and only one of them is sufficient in the prediction. In other words, it is good to select the features that capture unique feature of sample and delete them that correlate strongly with other already selected features. As deletion of redundant and noisy features might be important especially when reduction of feature dimensionality is required, we shall develop a method in this regard.

14

As demonstrated in a series of publications (42,81-85), user-friendly and publicly accessible web-servers can greatly facilitate relevant investigators, we shall make efforts in our future work to provide a web-server for the prediction method presented in this study. To conclude, we presented a predictive model focusing on incorporation of different kinds of information in the form of weighted features. When applied to recombination hot/cold spots in yeast, our model outperforms other existing methods by as much as 5% in overall accuracy. Our results also indicate that the model could capture the recombination-associated information in ORFs better than other models. Furthermore, the model may have a broad application to many other classification problems, as it is suitable to any kind of numeric features.

ACKNOWLEDGEMENTS We thank Lin H. for his helpful suggestions and reviewers' comments on this study. This work was supported by grants from the National Natural Science Foundation (61102162, 61271448), Natural Science Foundation of Inner Mongolia (2014MS0312) and Program for Young Talents of Science and Technology in Universities of Inner Mongolia Autonomous Region (NJYT-14-B10).

15

REFERENCES 1.

Lynn, A., Ashley, T. and Hassold, T. (2004) Variation in human meiotic recombination. Annu. Rev. Genomics Hum. Genet., 5: 317–349.

2.

Coop, G. and Przeworski, M. (2007) An evolutionary view of human recombination. Nat. Rev. Genet., 8: 23-34.

3.

Felsenstein, J. (1974) The evolutionary advantage of recombination. Genetics, 78: 737-756.

4.

Lewin, B. (2004) Genes VIII, Upper Saddle River, NJ: Pearson Prentice Hall.

5.

Webster, M.T. and Hurst, L.D. (2012) Direct and indirect consequences of meiotic recombination: implications for genome evolution. Trends Genet., 28: 101-109.

6.

Galtier, N., Piganeau, G., Mouchiroud, D. and Duret, L. (2001) GC-Content evolution in mammalian genomes: the biased gene conversion hypothesis. Genetics, 159: 907–911.

7.

Lercher, M.J. and Hurst, L.D. (2002) Human SNP variability and mutation rate are higher in

8.

Meunier, J. and Duret, L. (2004) Recombination drives the evolution of GC-content in the

regions of high recombination. Trends Genet., 18: 337–340. human genome. Mol. Biol. Evol., 21: 984–990. 9.

Birdsell, J.A. (2002) Integrating genomics, bioinformatics, and classical genetics to study the effects of recombination on genome evolution. Mol. Biol. Evol., 19: 1181–1197.

10. Liu, G. and Li, H. (2008) The correlation between recombination rate and dinucleotide bias in Drosophila melanogaster. J. Mol. Evol., 67: 358–367. 11. Singh, N.D., Davis, J.C. and Petrov, D.A. (2005) Codon bias and non-coding GC content correlate negatively with recombination rate on the Drosophila X chromosome. J. Mol. Evol., 61: 315–324. 12. Bartolome, C., Maside, X. and Charlesworth, B. (2002) On the abundance and distribution of transposable elements in the genome of Drosophila melanogaster. Mol. Biol. Evol., 19: 926–937. 13. Jensen-Seaman, M.I., Furey, T.S., Payseur, B.A., Lu, Y.T., Roskin, K.M., Chen, C.F., Thomas, M.A., Haussler, D. and J-Jacob, H.J. (2004) Comparative recombination rates in the rat, mouse, and human genomes. Genome Res., 14: 528–538. 14. Liu, G., Li, H. and Cai, L. (2010) Processed pseudogenes are located preferentially in regions of low recombination rates in the human genome. J. Evol. Biol., 23: 1107–1115. 15. Comeron, J.M. and Kreitman, M. (2000) The correlation between intron length and recombination in Drosophila: dynamic equilibrium between mutational and selective forces. Genetics, 156: 1175–1190. 16. Presgraves, D.C. (2005) Recombination enhances protein adaptation in Drosophila melanogaster. Curr. Biol., 15: 1651–1656. 17. Myers, S., Bottolo, L. and Freeman, C. (2005) A fine-scale map of recombination rates and hotspots across the human genome. Science, 310: 321–324. 18. Getun, I.V., Wu, Z.K., Khalil, A.M. and Bois, P.R. (2010) Nucleosome occupancy landscape and dynamics at mouse recombination hotspots. EMBO reports, 11: 555–560. 19. de Castro, E., Soriano, I., Marín, L., Serrano, R., Quintales, L.and Antequera, F. (2012) Nucleosomal organization of replication origins and meiotic recombination hotspots in fission yeast. The EMBO Journal, 31: 124–137. 20. Myers, S., Freeman, C., Auton, A. et al. (2008) A common sequence motif associated with recombination hot spots and genome instability in humans. Nat. Genet., 40: 1124–1129.

16

21. Myers, S., Bowden, R., Tumian, A. et al. (2010) Drive against hotspot motifs in primates implicates the PRDM9 gene in meiotic recombination. Science, 327: 876–879. 22. Parvanov, E.D., Petkov, P.M. and Paigen, K. (2010) PRDM9 controls activation of mammalian recombination hotspots. Science, 327: 835. 23. Baudat, F., J. Buard, J., Grey, C. et al. (2010) PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice. Science, 327: 836–840. 24. Auton, A., Fledel-Alon, A., Pfeifer, S. et al. (2012) A fine-scale chimpanzee genetic map from population sequencing. Science, 336: 193–198. 25. Hansen, L., Kim, N.K., Mariño-Ramírez, L. and Landsman, D. (2011) Analysis of biological features associated with meiotic recombination hot and cold spots in Saccharomyces cerevisiae. PLoS One, 6: e29711 26. Cromie, G.A. and Smith, G.R. (2007) Branching out: meiotic recombination and its regulation. Trends in Cell Biology, 9: 448-455. 27. Brachet, E., Sommermeyer, V. and Borde, V. (2012) Interplay between modifications of chromatin and meiotic recombination hotspots. Biol Cell., 104: 51-69. 28. Youds, J.L. and Boulton, S.J. (2011) The choice in meiosis-defining the factors that influence crossover or non-crossover formation. J. Cell Sci., 124: 501-513. 29. Mancera, E., Bourgon, R., Brozzi, A., Huber, W. and Steinmetz, L.M. (2008) High-resolution mapping of meiotic crossovers and non-crossovers in yeast. Nature, 454: 479–485. 30. Gerton, J.L., DeRisi, J., Shroff, R. et al. (2000) Global mapping of meiotic recombination hotspots and coldspots in the yeast Saccharomyces cerevisiae. Proc. Natl. Acad. Sci. USA, 97: 11383–11390. 31. Lobachev, K.S., Shor, B.M., Tran, H.T., Taylor, W., Keen, J.D., Resnick, M.A. and Gordenin, D.A. (1998) Factors affecting inverted repeat stimulation of recombination and deletion in Saccharomyces cerevisiae. Genetics, 148: 1507–1524. 32. Nasar, F., Jankowski, C. and Nag, D.K. (2000) Long palindromic sequences induce double-strand breaks during meiosis in yeast. Mol. Cell Biol., 20: 3449–3458. 33. Maloisel, L. and Rossignol, J.L. (1998) Suppression of crossing-over by DNA methylation in Ascobolus. Genes Dev., 12: 1381–1389. 34. Yamada, S., Ohta, K. and Yamada, T. (2013) Acetylated Histone H3K9 is associated with meiotic recombination hotspots, and plays a role in recombination redundantly with other factors including the H3K4 methylase Set1 in fission yeast. Nucleic Acids Res., 41: 3504-3517. 35. Cesarini, E., D'Alfonso, A. and Camilloni, G. (2012) H4K16 acetylation affects recombination and ncRNA transcription at rDNA in Saccharomyces cerevisiae. Mol Biol Cell., 23: 2770-2781. 36. Winckler, W., Myers, S.R. et al. (2005) Comparison of fine-scale recombination rates in humans and chimpanzees. Science, 308: 107–111. 37. Zhou, T., Weng, J., Sun, X. and Lu, Z. (2006) Support vector machine for classification of meiotic recombination hotspots and coldspots in Saccharomyces cerevisiae based on codon composition. BMC Bioinformatics, 7: 223. 38. Jiang, P., Wu, H., Wei, J., Sang, F., Sun, X. and Lu, Z. (2007) RF-DYMHC: detecting the yeast meiotic recombination hotspots and coldspots by random forest model using gapped dinucleotide composition features. Nucleic Acids Res., 35: W47–W51.

17

39. Guo, S.H., Xua, L.Q., Chen, W., Liu, G. and Lin, H. (2012) Recombination spots prediction using DNA physical properties in the Saccharomyces cerevisiae genome. AIP Conf Proc, 9: 1479–1556. 40. Liu, G., Liu, J., Cui, X. and Cai, L. (2012) Sequence-dependent prediction of recombination hotspots in Saccharomyces cerevisiae. J. Theor. Biol., 293, 49–54. 41. Wu, M., Kwoh, C.K., Przytycka, M.T., Li, J. and Zheng, J. (2012) Integration of genomic and epigenomic features to predict meiotic recombination hotspots in human and mouse. BCB '12, Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, 297–304. 42. Chen, W., Feng, P.M., Lin, H. and Chou, K.C. (2013) iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res., 41: e68.

43. Qiu, W.R. and Xiao, X. (2014) iRSpot-TNCPseAAC: Identify recombination spots with trinucleotide composition and pseudo amino acid components. Int. J. Mol. Sci., 15: 1746–1766.

44. Chen, W., Lin, H., Feng, P.M., Ding, C., Zuo, Y.C. and Chou KC. (2012) iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties. PLoS One,7:e47843.

45. Feng, P., Chen, W. and Lin, H. (2014) Prediction of CpG island methylation status by integrating DNA physicochemical properties. Genomics, 104:229-233. 46. Chen, W., Feng, P.M., Lin, H. and Chou, K.C. (2014) iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. Biomed. Res. Int., 2014:623149.

47. Lin, H., Deng, E.Z., Ding, H., Chen, W. and Chou, K.C. (2014) iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res., 42:12961-12972.

48. Liu, B., Liu, F., Fang, L. and Wang, X. (2015) repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics, 31: 1307–1309. 49. Chen, W., Zhang, X., Brooker, J., Lin, H., Zhang, L. and Chou, K.C. (2015) PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics, 31:119-120. 50. Chen, W., Lei, T.Y., Jin, D.C., Lin, H. and Chou, K.C. (2014) PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. Anal. Biochem., 456:53-60. 51. Liu, B., Liu, F., Wang, X., Chen, J., Fang, L. (2015) Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research, doi:10.1093/nar/gkv458. 52. Boulton, A., Myers, R.S. and Redfield, R.J. (1997) The hotspot conversion paradox and the evolution of meiotic recombination. Proc Natl Acad Sci USA, 94: 8058–8063. 53. Pineda-Krch, M. and Redfield, R.J. (2005) Persistence and loss of meiotic recombination hotspots. Genetics, 169: 2319–2333. 54. Tsai, I.J., Burt, A. and Koufopanou, V. (2010) Conservation of recombination hotspots in yeast. Proc. Natl. Acad. Sci. USA, 107: 7847–7852. 55. Chou, K.C. (2011) Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review). J. Theor. Biol., 273: 236–247.

18

56. Pan, J., Sasaki, M., Kniewel, R., Murakami, H., Blitzblau, H.G., Tischfield, S.E., Zhu,X., Neale, M.J., Jasin, M., Socci, N.D. et al. (2011) A hierarchical combination of factors shapes the genomewide topography of yeast meiotic recombination initiation. Cell, 144, 719–731. 57. Morozov, A.V., Fortney, K., Gaykalova, D.A., Studitsky, V.M., Widom, J. and Siggia, E.D. (2009) Using DNA mechanics to predict in vitro nucleosome positions and formation energies. Nucleic Acids Res., 37:4707-4722. 58. Olson, W.K., Gorin, A.A., Lu, X.J., Hock, L.M. and Zhurkin, V.B. (1998) DNA sequence-dependent deformability deduced from protein-DNA crystal complexes. Proc Natl Acad Sci USA, 95: 11163-11168. 59. Richmond, T.J. and Davey, C.A. (2003) The structure of DNA in the nucleosome core. Nature, 423:145-150. 60. Tolstorukov, M.Y., Colasanti, A.V., McCandlish, D., Olson, W.K. and Zhurkin, V.B. (2007) A novel ‘Roll-and-Slide’ mechanism of DNA folding in chromatin. Implications for nucleosome positioning. J Mol Biol, 371: 725-738. 61. Ignatova, Z., Martinez-Perez, I. and Zimmermann, K.H. (2008) DNA Computing Models. New York: Springer 62. Li, Q.Z. and Lin, H. (2006) The recognition and prediction of sigma70 promoters in Escherichia coli K-12. J. Theor Biol., 242:135-141. 63. Chou, K.C. (2001) Prediction of protein cellular attributes using pseudo amino acid composition. Proteins, 43: 246–255. 64. Lin, H., and Ding, H. (2011) Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition. J. Theor. Biol., 269, 64-69. 65. Zhang, M.Q. (1997) Identification of protein coding regions in the human genome by quadratic discriminant analysis. Proc. Natl. Acad. Sci. USA, 94: 565–568. 66. Mahalanobis, P.C. (1936) On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India, 2: 49–55. 67. Lu, J., Luo, L.F., Zhang, L.R., Chen, W. and Zhang, Y. (2010) Increment of diversity with quadratic discriminant analysis–an efficient tool for sequence pattern recognition in bioinformatics. Open Access Bioinformatics, 2: 89–96. 68. Zhang, L.R. and Luo, L.F. (2003) Splice site prediction with quadratic discriminant analysis using diversity measure. Nucleic Acids Res., 31: 6214–6220. 69. Liu, B., Fang, L., Liu, F., Wang, X. and Chen, J. (2015) Identification of real microRNA precursors with a pseudo structure status composition approach. PLoS ONE, 10: e0121501. 70. Chen, W., Feng, P. and Lin, H. (2012) Prediction of replication origins by calculating DNA structural properties. FEBS letters, 586: 934–938. 71. Lin, H., Chen, W. and Ding, H. (2013) AcalPred: a sequence-based tool for discriminating between acidic and alkaline enzymes. PloS ONE, 8: e75726. 72. Liu, B., Xu, J., Lan, X., Xu, R., Zhou, J. and Wang, X. (2014) iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance-Pairs and Reduced Alphabet Profile into the General Pseudo Amino Acid Composition. PLoS ONE, 9: e106691. 73. Liu, B., Chen, J. and Wang, X. (2015) Protein remote homology detection by combining Chou's distance-pair pseudo amino acid composition and principal component analysis. Molecular Genetics and Genomics, DOI: 10.1007/s00438-00015-01044-00434. 74. Liu, B., Wang, X., Zou, Q., Dong, Q. and Chen, Q. (2013) Protein Remote Homology Detection

19

by Combining Chou's Pseudo Amino Acid Composition and Profile-Based Protein Representation. Molecular Informatics, 32:775-782. 75. Liu, B., Wang, X., Chen, Q., Dong, Q. and Lan, X. (2012) Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection. PLoS ONE, 7: e46633. 76. Lin, H. and Li, Q.Z. (2007) Predicting conotoxin superfamily and family by using pseudo amino acid composition and modified Mahalanobis discriminant. Biochem. Biophys. Res. Commun., 354: 548–551. 77. Xiao, X., Wang, P. and Lin, W.Z. (2013) iAMP-2L: A two-level multi-label classifier for identifying antimicrobial peptides and their functional types. Anal. Biochem., 436: 168–177. 78. Chou, K.C. (2013) Some remarks on predicting multi-label attributes in molecular biosystems. Molecular Biosystems, 9: 1092–1100. 79. Kullback, S. and Leibler, R.A. (1951) On information and sufficiency. Annals of Mathematical Statistics, 22: 79–86. 80. Chou, K.C. (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, 21: 10-19. 81. Liu, B., Xu, J., Fan, S., Xu, R., Zhou, J. and Wang, X. (2015) PseDNA-Pro: DNA-Binding Protein Identification by Combining Chou's PseAAC and Physicochemical Distance Transformation. Molecular Informatics, 34: 8-17. 82. Liu, B., Fang, L., Liu, F. and Wang, X. (2015) iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach. Journal of Biomolecular Structure and Dynamics, DOI: 10.1080/07391102.07392015.01014422. 83. Liu, B., Fang, L., Jie, C., Liu, F. and Wang, X. (2015) miRNA-dis: microRNA precursor identification based on distance structure status pairs. Molecular BioSystems, 11: 1194-1204. 84. Liu, B., Zhang, D., Xu, R., Xu, J., Wang, X., Chen, Q. and Dong, Q. (2014) Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics, 30: 472-479. 85. Chou, K.C. (2015) Impacts of bioinformatics to medicinal chemistry. Medicinal Chemistry, 11: 218-234.

20

Table1. The parameters that describe physical properties and thermodynamic properties of DNA sequences step AA/TT AT AG/CT AC/GT TA TG/CA TC/GA GG/CC GC CG F-roll

0.041

0.054

0.042

0.065

0.031

0.035

0.049

0.041

0.054

0.039

F-tilt

0.078

0.098

0.058

0.071

0.065

0.056

0.065

0.057

0.065

0.059

F-twist

0.069

0.071

0.053

0.064

0.048

0.052

0.057

0.055

0.059

0.051

F-slide

6.689

9.611

3.472

6.803

1.853

2.003

4.268

2.99

4.206

2.713

F-shift

6.239

4.658

2.801

2.911

4.107

2.882

3.580

2.670

2.655

3.019

F-rise

21.340

24.792

17.477

21.977

14.239

14.512

18.410

14.252

17.311

14.655

roll

1.051

0.612

3.600

2.005

3.499

5.600

2.444

4.682

1.697

6.015

tilt

-1.261

0

-1.655

0.334

0

0.137

1.437

-0.770

0

0

twist

35.02

30.72

32.29

31.53

36.94

35.43

35.67

33.54

34.07

33.67

slide

-0.176

-0.679

-0.223

-0.593

0.044

0.481

-0.046

-0.166

-0.190

0.443

shift

0.013

0

-0.023

-0.018

0

0.009

-0.011

0.026

0

0

rise

3.253

3.208

3.322

3.243

3.389

3.366

3.299

3.361

3.267

3.291

-1

-0.88

-1.28

-1.44

-0.58

-1.45

-1.3

-1.84

-2.24

-2.17

-7.6

-7.2

-7.8

-8.4

-7.2

-8.5

-8.2

-8

-9.8

-10.6

energy enthalpy

entropy -21.3 -20.4 -21 -22.4 -21.3 -22.7 -22.2 -19.9 -24.4 -27.2 Note: data in the first six lines denote dinucleotide flexibility parameters, the second six lines denote dinucleotide structure parameters, and the bottom three lines denote thermodynamic properties.

21

Table 2. The average performances of EDQD in which k-mer frequencies were used as features in discriminating recombination hot/cold ORFs defined in Jiang et al. (38) feature

Sn(%)

Sp(%)

TA(%)

MCC

1-mer

74.9

89.7

82.9

0.661

2-mer

73.9

87.3

81.2

0.624

3-mer

75.1

88.0

82.1

0.645

4-mer

72.4

90.5

82.3

0.648

22

Table 3. The average performances of different models in discriminating recombination hot/cold ORFs defined in Jiang et al. (38) method

feature

Sn(%)

Sp(%)

TA(%)

MCC

E DQ D

2-mer

75.1

84.4

80.2

0.603

RE QD

2-mer

73.9

87.3

81.2

0.624

CD QD

2-mer

69.2

89.7

80.4

0.609

IDQD

2-mer

72.9

86.6

80.4

0.605

SVM

2-mer

70.6

90.3

81.4

0.628

WF

weighted-features

c

86.6

94.1

90.7

0.815

WF

weighted-features d

86.1

94.3

90.4

0.812

WF

e

91.1

70.2

86.8

0.605

81.6

88.1

85.2

0.692

weighted-features

iRSpot-PseDNC a

PseDNC

b

RF-DYMHC Gap{0,1} 80.6 84.3 82.1 0.638 prediction from Chen et al.(42) b prediction from Jiang et al.(38) c weights in the prediction were calculated using overall benchmark dataset (including training set and test set) d weights in the prediction were calculated using training dataset e prediction over non-redundant benchmark dataset (see discussion section) E DQ D, RE QD , CD QD a n d IDQD represent Euclidean Distance-based, Relative Entropy-based, Cosine Distance-based and Increment of Diversity-based model, respectively. a

23

Table 4. The average performances of different models in discriminating recombination hot/cold ORFs defined in Gerton et al.(30)

a

method

feature

Sn(%)

Sp(%)

TA(%)

MCC

E DQ D

2-mer

94.8

67.5

89.8

0.665

WF

weighted-features

92.0

92.5

92.1

0.799

SVMa

FCU

86.6

75.0

85.0

prediction from Zhou et al. (37), in which codon use frequency (FCU) was used as feature.

24

Highlights 1.

We presented a novel model to predict hot/cold spots in yeast.

2.

Diverse information was incorporated in the form of weighted features in

3.

the model. The model achieved a much higher prediction accuracy than existing methods.

25