Gene 366 (2006) 219 – 227 www.elsevier.com/locate/gene
Characterization and prediction of alternative splice sites Magnus Wang ⁎, Antonio Marín Departamento de Genética, Facultad de Biología, Universidad de Sevilla, Avenida de Reina Mercedes 6, E-41012 Sevilla, Spain Received 30 October 2004; received in revised form 20 April 2005; accepted 8 July 2005 Available online 13 October 2005
Abstract Human alternative isoform, cryptic, skipped, and constitutive splice sites from the ALTEXTRON database were analysed regarding splice site strength, composition, GC content, position and binding site strength of polypyrimidine tract and branch site. Several features were identified which distinguish alternative isoform and cryptic splice sites, but not skipped splice sites from constitutive ones. These include splice site strength, introns GC content, U2AF35 binding site score, and oligonucleotide frequencies. For the predictive classification of splice sites, pattern recognition models for different splicing factor binding sites and oligonucleotide frequency models (OFMs) were combined using backpropagation networks. 67.45% of acceptor sites and 71.23% of donor sites are correctly classified by networks trained for classification of constitutive and alternative isoform/cryptic splice sites. A web-application for the prediction of alternative splice sites is available at http://es.embnet.org/~mwang/assp.html. © 2005 Elsevier B.V. All rights reserved. Keywords: Alternative splicing; Splice site prediction; GC content; Oligonucleotide-frequency models; Neural networks; ASSP
1. Introduction Alternative splicing is a key mechanism enriching proteomic diversity and regulating developmental and tissue specific processes by producing several transcripts from single genes (Lopez, 1998; Smith and Valcárcel, 2000; Graveley, 2001; Modrek et al., 2001). Up to 74% of all human genes are estimated to produce more than one transcript (Mironov et al., 1999; Modrek et al., 2001; Johnson et al., 2003). The mechanisms leading to alternative splicing range from the blocking of splicing factor binding sites, e.g. the polypyrimidine tract, affinity increase of splicing factors by splice enhancers, to the inhibition by premRNA secondary structures (Lopez, 1998; Smith and Valcárcel, 2000; Graveley, 2001; Modrek et al., 2001). The resulting transcripts tend to insert or delete entire domains of proteins, or to modify functional residues of the products (Kriventseva Abbreviations: A, adenine; G, guanine; T, tymine; C, cytosine; H, A/C/T; ASSP, alternative splice site predictor; EST, expressed sequence tag; MDD, maximum dependence decomposition; OFM, oligonucleotide frequency model; PAM, percent accepted mutation; PPT, polypyrimidine tract; PSSM, position specific score matrix; U2AF35, U2 snRNP auxiliary factor, 35 kDa subunit; U2AF65, U2 snRNP auxiliary factor, 65 kDa subunit. ⁎ Corresponding author. Tel.: +34 954557113. E-mail addresses:
[email protected] (M. Wang),
[email protected] (A. Marín). 0378-1119/$ - see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.gene.2005.07.015
et al., 2003). The corresponding proteins may exhibit slightly different or even antagonistic activities (Lopez, 1998; Smith and Valcárcel, 2000; Graveley, 2001). Current detection of alternative splicing is either based on EST comparisons (e.g. Modrek et al., 2001; Coward et al., 2002) or on a combination of machine learning approaches and comparisons of gene homologs in human and mouse (Cawley and Patcher, 2003; Sorek et al., 2004; Dror et al., 2004). Although several approaches for the ab initio prediction of gene structure have been developed (e.g. Guigó et al., 1992; Kulp et al., 1996; Burge and Karlin, 1997; Milanesi and Rogozin, 1998; Milanesi et al., 1999), the ab initio prediction of alternative splicing has not been considered. A general problem of identifying alternative splices with current gene finding programs is that they usually search for optimal exons, splice sites, and gene structure. Alternative splice sites are usually weaker than constitutive sites, and alternative exons or introns may possibly show an atypical composition (e.g. hexamer frequencies), therefore, they are hard to detect with most gene finding programs (see Thanaraj and Stamm, 2003). In the present study we analysed several features of constitutive, skipped (rarely not recognized), cryptic (rarely recognized), and alternative isoform splice sites (rarely recognized splice sites of extended or truncated exons),
220
M. Wang, A. Marín / Gene 366 (2006) 219–227
extracted from the ALTEXTRON database (Clark and Thanaraj, 2002), in order to find evidence for the ab initio prediction of alternative splicing events. Basing on quantitative differences between alternative and constitutive splice sites, we developed a first approach for the ab initio detection of alternative splice sites by combining pattern recognition and composition models using neural networks.
window. Pseudocounts of all models were derived from a PAM substitution matrix. Three layer backpropagation networks were trained using the standard momentum algorithm. All sequence statistics and models were calculated using the program package SEQOOL (Wang and Marín, in preparation).
2. Methods
3.1. Splice site strength
Human sequences of constitutive, skipped, cryptic and alternative splice sites were extracted from the ALTEXTRON database (Clark and Thanaraj, 2002), pre-release 2 (downloaded on 29.5.2003), including 70 nucleotides of the neighbouring exons and introns. The dataset included splice sites of 23680 constitutive, 3765 skipped and 1204 cryptic exons, 1347 alternative isoform acceptor splice sites, and 698 isoform donor splice sites. This set was divided into a learning set of 10,000 constitutive, 3500 skipped, 1100 cryptic, and 1200 isoform acceptor sites, and a test set containing 1000 constitutive, 265 skipped, 104 cryptic, and 147 isoform acceptor sites. The training and test sets of the 5′ splice site comprized the same number of constitutive, skipped, and cryptic splice sites, but only 600 (training set) and 98 (test set) alternative isoform donor sites. Since we planned to analyse oligonucleotide frequencies of length four or five nt, we split the data in a large learning set, at the expense of a rather small test set. False splice sites were extracted within ± 70 nt of constitutive acceptor and donor splice sites, by labelling all subsequences which contained AG or GT, respectively, and which where not included as splice sites in the ALTEXTRON-set, as “false”. Position specific score matrices (PSSM) were implemented using information content (Schneider, 1997). OligonucleotideFrequency-Models (OFMs) measure oligonucleotide frequencies for a reference set of sequences, e.g. constitutively spliced exons, and calculate the probability of a test-sequence, e.g. an alternatively spliced exon, to be derived from the reference set, given the oligonucleotide distribution observed in the test-sequence. Log-odds scores are first calculated for each oligonucleotide observed within a given window size of the test sequence. Finally, total scores are calculated by summing up the log-odds scores for all oligonucleotides observed in that
For estimating splice site strength, PSSMs were built for all constitutive splice sites of the learning set. For the 3′ splice site a PSSM measuring information content was build for the range of the binding site of U2AF65, which binds to the polypyrimidine tract, and the AG dinucleotide of the splice site (− 15 to + 2 nt). Splice site strength was low in cryptic and especially alternative isoform acceptor and donor sites (Fig. 1). “AG-dependent” introns show a weak or no PPT. In these introns binding affinity of U2AF65 needs to be increased by U2AF35, which interacts with the AG-dinucleotide and some nucleotides of the upstream exon (Guth et al., 2001). In order to evaluate the strength of the binding site for U2AF35, another PSSM, using information content (+ 1 to + 10nt), was built for constitutive acceptor sites. This PSSM divided constitutive splice sites into two distributions overlapping at score zero. Most sequences above this score (i.e. 78%) contained “AGG”, while almost all below this score contained “AGH” (H: any nucleotide not being G). Scores of most constitutive acceptor sites were above zero, containing AGG, while most cryptic acceptor sites showed scores below zero. The nucleotide frequencies of acceptor sites with scores larger than zero (64.1% of constitutive and 46.0% alternative sites) are very similar to a set of sequences known to bind U2AF35 (see Fig. 2). Remarkably, 89.7% of the latter also show a PSSM score larger than zero. Therefore, we assume that the bimodal score distribution of the PSSM for U2AF35 actually corresponds to AG-dependence. Acceptor site scores and scores of U2AF35 showed a significant but low correlation in constitutive (Spearman rank test: r = 0.155, p < 0.001), skipped (r = 0.160, p < 0.001), cryptic (r = 0.158, p < 0.001), and isoform (r = 0.155, p < 0.001) splice sites (no significant correlation was observed for false splice
3. Results
Fig. 1. Acceptor and donor splice site scores of constitutive, skipped, cryptic, and alternative isoform splice sites. Acceptor scores were calculated with a PSSM, donor scores using a set of 22 PSSMs derived by MDD.
M. Wang, A. Marín / Gene 366 (2006) 219–227
221
3.2. GC content profiles
Fig. 2. Nucleotide frequencies of acceptor sites of constitutive exons with a U2AF35 score below (a) and above zero (b), measured by a PSSM build for the U2AF35 binding site. (c) Sequences known to bind U2AF35 (31 sequences from Wu et al., 1999).
sites: r = 0.029, p = 0.37). Combining both scores, acceptor sites are grouped into two groups, AG-dependent and AGindependent, with most constitutive and skipped acceptor sites being AG-dependent (Fig. 3). For the 5′ splice site 22 PSSMs were build using maximum dependence composition (MDD, Burge and Karlin, 1997) for the region from − 10 to + 15 nt. Again, cryptic and alternative isoform splice sites showed low splice site scores (Fig. 1).
GC content was measured for a sliding window of five nucleotides within a range of − 70 to + 70 nt of each splice site. The GC content of the intronic sequences upstream acceptor sites and downstream donor sites is remarkably lower in cryptic than in the other splice sites (Fig. 4). The GC distribution of intronic sequences flanking constitutive and skipped exons is bimodal, with peaks at GC = 0.34 and GC = 0.60 (Fig. 5). The low GC content of introns flanking cryptic exons might be either a specific feature of cryptic splice sites, which should then not be present in the introns adjacent to constitutive exons of the same genes, or it might be a general feature related to the genomic location, i.e. isochore, where the genes with cryptic exons are situated. In order to test both possibilities we examined all genes containing cryptic exons (5400 sequences, extracted from the complete ALTEXTRON set) and computed the GC content of the introns (70 nt) flanking constitutive exons in these genes. GC content of these introns (upstream: GC = 0.459, downstream: GC = 0.485) was lower than GC of introns neighbouring constitutive exons from all genes (upstream: GC = 0.473, downstream: GC = 0.498), but still higher than in introns adjacent to cryptic exons (upstream: GC = 0.429, downstream: GC = 0.461). Additionally, we analysed the distributions of GC content between position − 70 and − 1 in relation to the strength and composition of the acceptor splice site. While no groups were
Fig. 3. Correlation of splice site scores (splice site strength) and U2AF35 binding site scores (AG-dependence) calculated from 1000 constitutive, skipped, cryptic, and alternative isoform acceptor sites. Constitutive and skipped acceptor sites show generally a higher score and a higher percentage of AG-independent acceptor sites.
222
M. Wang, A. Marín / Gene 366 (2006) 219–227
Fig. 4. GC within a range of 70 nucleotides of the acceptor and the donor site, measured for a sliding window of five nucleotides.
evident when plotting GC against the score of the PSSMs for the binding sites of U2AF65 (not shown), up to three groups were observed when drawn against the score of the U2AF35 binding site (Fig. 6, calculated for 1000 constitutive, skipped and cryptic acceptor sites, and for 430 extending and 776 truncating alternative acceptor sites). Constitutive acceptor sites showed the three groups: 1. AG-dependent with high GC content. 2. AG-dependent with low GC content. 3. AGindependent with high GC content. Most skipped and isoform splice sites, which were mostly preceded by GC-rich introns (− 70 to −1), were divided into the two groups AG-dependent and AG-independent. Cryptic acceptor sites were preceded by GC-poor introns and were mainly AG-independent. 3.3. Position and strength of polypyrimidine tracts The PSSM of the region − 15 to + 2 nt covers the core AG of the splice site and the adjacent PPT. Since PPTs may be situated much more distant from acceptor sites, we additionally searched for the strongest PPT within the range of − 70 to − 1 nt relative to the splice site, using a PSSM (21 nt) built from PPTs given in Singh et al. (1995). The position of the highest scoring sites are
similar for all types of acceptor sites (Fig. 7). Scores of these sites were highest in cryptic (7.981) and lowest in alternative isoform acceptor sites (extending: 5.235; truncating: 4.787), while constitutive (6.934) and skipped (6.745) splice sites showed intermediate scores. In order to evaluate the influence of GC content on PPT binding site strength, scores of the best hits in the range − 70 to − 1 were calculated separately for GCrich and GC-poor introns preceding constitutive acceptor sites (see Fig. 7f). Splice sites with GC-rich introns had a remarkably lower PPT score (5.062) compared to sites with GC-poor upstream introns (9.781). The highest scoring PPT hits per sequence were observed between − 20 and − 10 nt, with PPT hits of exon extending alternative sites being slightly closer, and those of truncating alternative sites being more distant from the exon start, compared to the other acceptor sites. These differences correspond to the position of alternative splice sites relative to the constitutive site (acceptor: + 8.7 nt; donor: − 5.5 nt, calculated from the ALTEXTRON dataset). Since weak PPTs are only efficiently recognized, if they are close to the acceptor site (Coolidge et al., 1997), scores and distances of PPTs were expected to correlate. In fact, scores and
Fig. 5. GC content distribution of neighbouring introns of 1000 constitutive, skipped, cryptic, and alternative isoform acceptor (top) and donor sites (bottom). GC content was calculated for 70 nucleotides of the neighbouring introns.
M. Wang, A. Marín / Gene 366 (2006) 219–227
223
Fig. 6. Correlation of GC content of the upstream intron of acceptor sites (from position − 70 to − 1) and the U2AF35 binding site scores. Data were calculated from 1000 sequences of each splice site type, except for extending (430 sequences) and truncating alternative splice sites (776 sequences).
Fig. 7. Position (a–c) and score (d–f) of the highest scoring PPTs identified by a PSSM, calculated from all sequences of the learning set, except for graph (d) (3500 sequences).
224
M. Wang, A. Marín / Gene 366 (2006) 219–227
Fig. 8. Position of branch sites in constitutive, skipped, cryptic, and isoform splice sites, and in GC-rich and GC-poor introns neighbouring constitutive acceptor sites. Branch points were identified using the weight matrix from Senapathy et al. (24).
distances correlated significantly for all splice sites, but the correlation coefficients were very low (Spearman rank-test: constitutive: r = 0.110, p < 0.0001; skipped r = 0.156; p < 0.0001; cryptic r = 0.148; p < 0.0001; extending alternative r = 0.134; p = 0.0055; truncating alternative r = 0.112; p = 0.0015; GC-rich constitutive r = 0.104; p < 0.0001; GC-poor constitutive r = 0.044; p = 0.0048). Since PSSMs have a fixed length but PPT-length is variable, we additionally applied a simple heuristic method to identify PPTs, based on experimental results (Bouck et al., 1995; Coolidge et al., 1997). PPTs were defined as sequences containing at least five consecutive pyrimidines being interrupted only by single purines. With this heuristic we measured the following lengths of PPTs: Constitutive acceptor sites: 19.5 nt; skipped: 19.5 nt; cryptic: 19.4 nt; extending alternative: 17.5 nt; truncating alternative: 17.4 nt. 3.4. Position and strength of branch sites Using the weight matrix from Senapathy et al. (1990), branch sites were searched within the 70 upstream nucleotides of acceptor sites. Constitutive and skipped acceptor sites showed a concentration of branch site hits at about − 20 nt (Fig. 8). Cryptic and alternative isoform acceptor sites displayed a less concentrated distribution. The distance between branch sites and truncating acceptor sites was slightly greater than in the other
acceptor sites. Branch site scores were almost identical for all splices sites (constitutive: 5.639; skipped: 5.404; cryptic: 5.404; extending alternative: 5.588; truncating alternative: 5.438). 3.5. Inclusion of regulatory elements We tried to include sub-sequences containing enhancer motives by analysing oligonucleotide frequencies near the splice sites. For this purpose we used oligonucleotide-frequency-models (OFMs), which score the probability of a sequence belonging to a reference set of sequences, e.g. constitutive exons, by comparing the oligonucleotide distribution of the test sequence with the oligonucleotide frequencies of the reference set (see Methods). For the acceptor site, OFMs were built for downstream sequences up to position −36, which might contain possible silencer motives, the region covered by U2AF65, and the downstream exon until +24 nt. Although regulatory elements will be present beyond this position, we decided not to include a longer stretch of the exons in order not to limit the model to longer exons. Corresponding models were also built for the donor site, i.e. for the last 24 nt of the upstream exon, the core of the splice site, and the downstream intron until +36 nt. Frequencies were calculated for oligonucleotides of length three to five. Pseudocounts were adjusted in order to avoid over-learning by comparison with the sequences of the test-set. All OFMs of each splice site were
Fig. 9. Score distributions of constitutive and alternative isoform (a, d), cryptic (b, e) and skipped (c, f) acceptor (top) and donor (bottom) sites calculated with combined OFMs (see text). Scores reflect the probability (calculated exclusively from oligonucleotide frequencies around a splice site) that a splice site is constitutive.
M. Wang, A. Marín / Gene 366 (2006) 219–227
combined by adding up the single scores of each OFM. The resulting combination of OFMs revealed partly overlapping score-distributions (see Fig. 9). 3.6. Predictive classification of splice sites For the classification of alternatively and constitutively spliced splice sites the best-separating models were combined using a backpropagation network, i.e. models for the U2AF65 and U2AF35 binding site, GC content of the intronic upstream 70 nt, OFMs, and branch site position for the acceptor site, and the MDD model for the donor site, GC content of the adjacent introns (+ 1 to +70), and OFMs for the donor site. The learning set was divided in two halves, using one as a training-set and the other for cross-verification. Networks were trained until the mean square error of the verification set indicated over-learning. The final classification performance was calculated from the test-set. Networks were first trained for classification of: alternative isoform–constitutive, cryptic–constitutive, and skipped–constitutive (Table 1). Since networks trained for classification of skipped and constitutive splice sites did not perform sufficiently, skipped splice sites were excluded in further networks. The classification performance of a network trained for constitutive, cryptic, and isoform splice sites was intermediate (about 54% compared to a random separation of 33%). Since alternative isoform and cryptic splice sites share a lower average splice site strength compared to constitutive or skipped splice sites, and since they also show similar oligonucleotide distributions, we combined both classes. This significantly increased the prediction power to 67.45% for the acceptor site and 71.23% for the donor site. More specifically, the resulting network for the acceptor sites recognized 70.77% of constitutive, 72.36% of alternative isoform, and 59.22% of cryptic splice sites. 64.39% of skipped splice sites were identified as constitutive acceptor sites. The network for the donor site classified 73.04% of all constitutive, 82.14% of all alternative isoform, and 58.52 of all cryptic splices sites correctly. 66.39% skipped sites were classified as constitutive. For the prediction of splice sites within unknown sequences the two networks were combined with pre-processing models, i.e. the PSSM for the acceptor site and the MDD model for the donor site. These pre-processing models identify putative real splice sites prior to classification. Default score thresholds (acceptor: 2.2, donor: 4.5) were adjusted for minimizing false Table 1 Classification performance in percent correctly identified sequences of backpropagation networks trained on two or three classes of splice sites Training
Isoform–constitutive Cryptic–constitutive Skipped–constitutive Isoform–cryptic–constitutive Isoform/cryptic–constitutive
Performance (%) Acceptor
Donor
Random a
73.48 63.02 51.09 53.85 67.45
79.12 64.19 51.91 54.75 71.23
50 50 50 33 50
a Random indicates the percentage of correctly identified sequences by a random classification.
225
positives and negatives. These thresholds correspond to 23.5% false acceptor and 20.6 false donor splice sites, i.e. subsequences containing AG or GT respectively, but not being confirmed splice sites, and 21.5 missed isoform acceptor and 19.9 missed isoform donor sites. Since alternative isoform splice sites exhibit the lowest splice site scores, the use of low score thresholds results mainly in the loss of isoform splice sites, while the detection of cryptic, skipped, or constitutive splice sites is less affected. A web application for the prediction of alternative splice sites was developed, the Alternative Splice Site Predictor (ASSP), which uses the described combination of pre-processing models and backpropagation networks (http://es.embnet.org/~mwang/ assp.html). 4. Discussion In this work we have identified general characteristics of alternatively spliced exons, and evaluated if a prediction of alternative splice sites is feasible using information about signal strength and composition of splice sites. The present analysis complements previous analysis of alternatively spliced genes (e.g. Thanaraj and Stamm, 2003), revealing new and more detailed insights in alternative and constitutive splice sites. In concordance with Thanaraj and Stamm (2003), splice site scores were considerably lower in cryptic and especially alternative isoform acceptor and donor sites, compared to other splice sites. In addition we observed a dependence of splice site score and AG-dependence. In AG-dependent introns with weak PPTs, binding of U2AF65 to the PPT needs to be supported by U2AF35 (Wu et al., 1999). Guth et al. (2001) reported a dual function of U2AF35, first, an enhancement of the binding of U2AF65 to the PPT, and second, a trigger for the spliceosome assembly. Although we observed significantly correlating scores for U2AF65 and U2AF35 in all exons, the correlation was very low (r = 0.155–0.160). However, the PSSM for the binding site of U2AF65 might not reflect adequately PPT strength. First, the applied PSSM was restricted to score only position − 15 to +2, whereas the PPT-position is highly variable. Second, PPTs might not be represented adequately using a PSSM. In donor splice sites, methods taking the mutual dependence of nucleotides into account, significantly increase the prediction accuracy (Burge and Karlin, 1997; Rogozin and Milanesi, 1997). Experimental studies also revealed a mutual influence of PPT composition, length, and distance to the branch site or acceptor splice site (Coolidge et al., 1997; Rossigno et al., 1993). Interestingly, the PSSM for U2AF35 divided all but cryptic splice sites into two groups, one showing a nucleotide distribution very similar to a set of experimentally confirmed U2AF35 binding sites (Wu et al., 1999), and the other showing an almost random nucleotide distribution (Fig. 2). Overall GC content of constitutive and alternatively spliced genes were compared by Clark and Thanaraj (2002), demonstrating that genes with cassette exons, i.e. skipped and cryptic exons, occur more frequently in GC-poor regions. In the present analysis we focused on introns, since we observed marked differences of GC between constitutive and alternative introns, but
226
M. Wang, A. Marín / Gene 366 (2006) 219–227
not exons. Furthermore, since intron GC might influence the assembly of the spliceosome, GC was correlated to splice site strength (U2AF35 ) for each splice site type. The GC content in introns neighbouring constitutive, isoform or skipped splice sites showed a bimodal distribution with one peak at GC = 0.34 and the other at GC = 0.60. Most cryptic exons were neighboured by low GC introns, while skipped exons are equally neighboured by GCrich or GC-poor introns. Constitutive exons are predominantly flanked by GC-rich introns. Most acceptor sites adjacent to GCpoor introns were almost exclusively AG dependent (except for cryptic exons). This might indicate that PPT strength, possibly branch site strength, or generally speaking splice site strength is influenced by the GC content of the corresponding introns. However, searching for the strongest PPT hits within the last 70 introns nucleotides, cryptic acceptor sites showed stronger PPT scores than constitutive ones, and constitutive acceptor showed higher PPT scores when neighboured by GC-poor introns. High intron GC might promote splicing by increasing the probability of purine-rich intronic enhancers. Purine-rich intronic enhancer elements were reported to support the recognition of the donor splice site (McCullough and Berget, 2000). GC content has been shown to correlate positively with gene expression patterns in several studies (e.g. Konu and Li, 2002; Vinogradov, 2003). Vinogradov (2003) found that GC3 and especially intronic GC correlates with expression level. Tissue-specifically expressed genes were reported to be lower in GC than ubiquitously expressed genes (Vinogradov, 2003; Bernardi, 1995; Pesole et al., 1999), although some studies contradict these findings (Goncalves et al., 2000; Ponger et al., 2001). In the present study we measured the lowest GC in introns neighbouring cryptic exons. Introns neighbouring constitutive exons of the same genes, in which cryptic exons were present, had higher GC than those covering cryptic exons, but lower GC than the average of all constitutive exons. This suggests, that low GC might also be a feature of introns covering tissue- or development-specifically spliced exons. Regulatory elements seem to be the most important factors controlling alternative splicing (Lopez, 1998; Smith and Valcárcel, 2000; Cartegni et al., 2002). We applied a general approach for detecting possible regulatory elements using oligonucleotide-frequency-models (OFMs). Alternative exon isoforms and cryptic exons showed markedly differing oligonucleotide distributions, compared to constitutive exons. Of course, oligonucleotide frequencies of alternative exons are not only affected by the presence of regulatory elements, but also by the presence of additional intronic or exonic stretches which are included in exons isoforms only. Concluding, alternative isoform splice sites are mainly characterized by low splice site strength, independent of U2AF35 binding site strength, and atypical exon and intron oligonucleotide frequencies, compared to constitutive splice sites. Cryptic splice sites are on average slightly weaker than constitutive ones, and exhibit atypical oligonucleotide frequencies too. Additionally, most cryptic splice sites show weak U2AF35 binding sites and are neighboured by GC-poor introns. Skipped splice sites are equally stringent as constitutive ones (scores of U2AF65 and U2AF35),
and neighbouring introns cover the same GC range. Only exon and intron oligonucleotide frequencies differ slightly from those of constitutive ones. This demonstrates that the functional differences between skipped and cryptic exons, i.e. rarely skipped vs. rarely included, are also reflected by different sequence features. For the classification of splice sites, binding site models and sequence statistics were combined using neural networks. About 70% of constitutive and alternative isoform/cryptic splice sites could be classified correctly, while skipped splice sites were hard to distinguish, since they resemble constitutive splice sites in most of their features. For the identification of putative splice sites, pre-processing models with variable thresholds were used, which facilitate an application-dependent elimination of false splice sites. Nevertheless, since score distributions of false splice sites overlap with the distribution of alternative isoform splice sites, both being characterized by low scores, the overall recognition of isoform splice sites is reduced by high score thresholds, while the number of false splice sites is increased by low thresholds. Given the abundance of false splice sites, compared to isoform splice sites, the identification of false positives and false negatives remains a problem for the prediction of splice sites. The present analysis is a first step in ab initio alternative splice site prediction. It complements previous methods for the recognition of skipped exons, which are based on the comparisons to gene homologs in mouse (Sorek et al., 2004; Dror et al., 2004). The classification performance of the applied neural networks might significantly be elevated by the future increase of data in alternative splicing databases and tissue specific databases. The inclusion of the latter might then permit to predict specific splicing mechanisms.
Acknowledgements M. W. was supported by a fellowship within the postdocprogramme of the German Academic Exchange Service (DAAD). This work was supported by the Spanish Government (BIO2002-04014-c03-01, TIC2003-09331-C02-01), and Junta de Andalucía (Grupo CVI-162). We would like to thank Stefan Haas for constructive comments on the manuscript. References Bernardi, G., 1995. The human genome: organization and evolutionary history. Annu. Rev. Genet. 29, 445–476. Bouck, J., Fu, X., Skalka, A.M., Kratz, R.A., 1995. Genetic selection for balanced retroviral splicing: novel regulation involving the second step can be mediated by transitions in the polypyrimidine tract. Mol. Cell. Biol. 15, 2663–2671. Burge, C., Karlin, S., 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94. Cartegni, L., Chew, S.L., Krainer, A.R., 2002. Listening to silence and understanding nonsense: exonic mutations that affect splicing. Nat. Rev., Genet. 3, 285–298. Cawley, S.L., Patcher, L., 2003. HMM sampling and applications to gene finding and alternative splicing. Bioinformatics 19 (Suppl. 2), ii36–ii41.
M. Wang, A. Marín / Gene 366 (2006) 219–227 Clark, F., Thanaraj, T.A., 2002. Catergorization and characterization of transcript-confirmed constitutively and alternatively spliced introns and exons from human. Hum. Mol. Genet. 11, 451–464. Coolidge, C.J., Seely, R.J., Patton, J.G., 1997. Functional analysis of the polypyrimidine tract in pre-mRNA splicing. Nucleic Acids Res. 25, 888–896. Coward, E., Haas, S.A., Vingron, M., 2002. SpliceNest: visualization of gene structure and alternative splicing based on EST clusters. Trends Genet. 18, 53–55. Dror, G., Sorek, R., Shamir, R., 2004. Accurate identification of alternatively spliced exons using support vector machine. Bioinformatics 21, 897–901. Goncalves, I., Duret, L., Mouchiroud, D., 2000. Nature and structure of human genes that generate retropseudogenes. Genome Res. 10, 672–678. Graveley, B.R., 2001. Alternative splicing: increasing diversity in the proteomic world. Trends Genet. 17, 100–107. Guigó, R., Knudsen, S., Drake, N., Smith, T.F., 1992. Prediction of gene structure. J. Mol. Biol. 226, 141–157. Guth, S., Tange, T.Ø., Kellenberger, E., Valcárcel, J., 2001. Dual function of U2AF35 in AG-dependent pre m-RNA splicing. Mol. Cell. Biol. 21, 7673–7681. Johnson, J.M., et al., 2003. Genome-wide survey of human alternative premRNA splicing with exon junction microarrays. Science 302, 2141–2144. Konu, O.O., Li, M.D., 2002. Correlations between mRNA expression levels and GC contents of coding and untranslated regions of genes in rodents. J. Mol. Evol. 54, 35–41. Kriventseva, E.V., et al., 2003. Increase of functional diversity by alternative splicing. Trends Genet. 19, 124–128. Kulp, D., Haussler, D., Reese, M.G., Eeckman, F.H., 1996. A generalized hidden Markov model for the recognition of human genes in DNA. In: States, D.J., Agarwal, P., Gaasterland, T., Hunter, L., Smith, R.F (Eds.), Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology. AAAI Press, Menlo Park, pp. 134–142. Lopez, A.J., 1998. Alternative splicing of pre-mRNA: developmental consequences and mechanisms of regulation. Annu. Rev. Genet. 32, 279–305. McCullough, A.J., Berget, S.M., 2000. An intronic splicing enhancer binds U1 snRNPs to enhance splicing and select 5′ splice sites. Mol. Cell. Biol. 20, 9225–9235. Milanesi, L., Rogozin, I.B., 1998. Prediction of human gene structure, In: Bishop, M.J. (Ed.), Guide to Human Genome Computing, 2nd ed. Academic Press, London, pp. 215–259.
227
Milanesi, L., D'Angelo, D., Rogozin, I.B., 1999. GeneBuilder: interactive in silico prediction of gene structure. Bioinformatics 15, 612–621. Mironov, A.A., Fickett, J.W., Gelfand, M.S., 1999. Frequent alternative splicing of human genes. Genome Res. 9, 1288–1293. Modrek, B., Resch, A., Grasso, C., Lee, C., 2001. Genome-wide analysis of alternative splicing using human expressed sequence tags. Nucleic Acids Res. 29, 2850–2859. Pesole, G., Bernardi, G., Saccone, C., 1999. Isochore specificity of AUG initiator context of human genes. FEBS Lett. 464, 60–62. Ponger, L., Duret, L., Mouchiroud, D., 2001. Determinants of CpG islands: expression in early embryo and isochore structure. Genome Res. 11, 1854–1860. Rogozin, I.B., Milanesi, L., 1997. Analysis of donor splice sites in different eukaryotic organisms. J. Mol. Evol. 45, 50–59. Rossigno, R.F., Weiner, M., Garcia-Blanco, M.A., 1993. A mutational analysis of the polypyrimidine tract of introns. J. Biol. Chem. 268, 11222–11229. Schneider, T.D., 1997. Information content of individual genetic sequences. J. Theor. Biol. 189, 427–441. Senapathy, P., Shapiro, M.B., Harris, N.L., 1990. Splice junctions, branch point sites, and exons: sequence statsistics, identifications, and applications to Genome Project. Methods Enzymol. 183, 252–278. Singh, R., Valcarcel, J., Green, M.R., 1995. Distinct binding specificities and functions of higher eukaryotic polypyrimidine tract-binding proteins. Science 268, 1173–1176. Smith, C.W., Valcárcel, J., 2000. Alternative pre-mRNA splicing: the logic of combinatorial control. Trends Biochem Sci. 25, 381–388. Sorek, R., Shemesh, R., Cohen, Y., Basechess, O., Ast, G., Shamir, R., 2004. A non-EST-based method for exon-skipping prediction. Genome Res. 14, 1617–1623. Thanaraj, T.A., Stamm, S., 2003. Prediction and statistical analysis of alternatively splices exons. Prog. Mol. Subcell. Biol. 31, 1–31. Vinogradov, A.E., 2003. Isochores and tissue-specifity. Nucleic Acids Res. 31, 5212–5220. Wang, M., Marín, A., in preparation. Identification and analysis of splicing regulatory elements with Seqool, a sequence analysis tool. Wu, S., Romfo, C.M., Nilsen, T.W., Green, M.R., 1999. Functional recognition of the 3′ splice site AG by the splicing factor U2AF35. Nature 402, 832–835.