232
Update
TRENDS in Genetics Vol.20 No.5 May 2004
with orthologous loci in the mouse genome. However, given the evolutionary distance between rodents and primates, even functionless sequences (such as pseudogenes or ancient repeats) might have remained conserved. Indeed, it has been shown that , 40% of the human genome forms alignments with the mouse genome sequence but only 3 – 5% is under selective pressure [14– 16]. Concluding remarks One-third of the predicted FTUs does not correspond to known protein-coding genes, and thus might encode ncRNA genes. Our analysis therefore provides independent evidence that a significant fraction of the mammalian transcriptome corresponds to functional ncRNAs [3,4]. We found that 5% of the transcribed sequences are expressed on both strands. This observation is consistent with other reports in the literature [18] and could be biologically relevant: many non-coding RNAs are developmental regulators on the antisense strand of a coding gene [19]. Because the human-gene catalogue is not complete, and the discovery of non-coding RNAs is still in its infancy [3,4], our estimate of the fraction of the genome covered by FTUs is not outstanding. Wong and colleagues [13] estimated that protein-coding FTUs might constitute up to 90% of our genome. They argue that there are some long protein-coding genes that remain to be discovered and that these transcription units might cover a large fraction of our genome. Given the uncertainty surrounding the exact number of long genes, their estimate (90%) should be taken as an upper limit. Conversely, our estimate should be considered conservative. Indeed, because of the size of the window (20 kb), our method can only detect relatively long FTUs. Moreover, as mentioned previously, the specificity of our method is probably underestimated. Taken together, this suggests that FTUs constitute .50% of our genome.
References 1 International Human Genome Sequencing Consortium, (2001) Initial sequencing and analysis of the human genome. Nature 409, 860 – 921 2 Scherer, S.W. et al. (2003) Human chromosome 7: DNA sequence and biology. Science 300, 767 – 772 3 Okasali, Y. et al. (2002) Analysis of the mouse transcriptome based on functional annotation of 60 770 full-length cDNAs. Nature 420, 563–573 4 Kapranov, P. et al. (2002) Large-scale transcriptional activity in chromosomes 21 and 22. Science 296, 916– 919 5 Smit, A.F. (1999) Interspersed repeats and other mementos of transposable elements in Mamm. Genomes. Curr. Opin. Genet. Dev. 9, 657–663 6 Medstrand, P. et al. (2002) Retroelement distributions in the human genome: variations associated with age and proximity to genes. Genome Res. 12, 1483 – 1495 7 Duret, L. et al. (1994) HOVERGEN: a database of homologous vertebrate genes. Nucleic Acids Res. 22, 2360– 2365 8 Castillo-Davis, C.I. et al. (2002) Selection for short introns in highly expressed genes. Nat. Genet. 31, 415– 418 9 Duret, L. (2002) Evolution of synonymous codon usage in metazoans. Curr. Opin. Genet. Dev. 12, 640 – 649 10 Green, P. et al. (2003) Transcription-associated mutational asymmetry in mammalian evolution. Nat. Genet. 33, 514 – 517 11 McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, Chapman & Hall 12 Szymanski, M. et al. (2003) Noncoding regulatory RNAs database. Nucleic Acids Res. 31, 429– 431 13 Wong, G.K. et al. (2001) Most of the human genome is transcribed. Genome Res. 11, 1975 – 1977 14 Thomas, J.W. et al. (2003) Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424, 788 – 793 15 Mouse Genome Sequencing Consortium, (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520 – 562 16 Dermitzakis, E.T. et al. (2003) Evolutionary discrimination of mammalian conserved non-genic sequences (CNGs). Science 302, 1033–1035 17 Rinn, J.L. et al. (2003) The transcriptional activity of human chromosome 22. Genes Dev. 17, 529 – 540 18 Shendure, J. and Church, G.M. (2002) Computational discovery of sense – antisense transcription in the human and mouse genomes. Genome Biol. 3, 1 – 14 19 Eddy, S.R. (2001) Non-coding RNA genes and the modern RNA world. Nat. Rev. Genet. 2, 919 – 929
0168-9525/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.tig.2004.03.001
The impact of very short alternative splicing on protein structures and functions in the human genomeq Fang Wen1,*, Fei Li1,*, Huiyu Xia1, Xin Lu1,2, Xuegong Zhang1 and Yanda Li1 1 2
MOE Key Laboratory of Bioinformatics and Department of Automation, Tsinghua University, Beijing 100084, China Current address: Department of Statistics, Harvard University, Cambridge, MA 02138, USA
The systematic analysis of very short alternative splicing (VSAS) in the human genome showed that VSAS might contribute more to protein-function diversity than expected. More than 65% of VSAS fragments have different secondary structures from flanking regions. q Supplementary data associated with this article can be found at doi: 10.1016/j.tig.2004.03.005 * These authors have contributed equally. Corresponding author: Xuegong Zhang (
[email protected]).
www.sciencedirect.com
They tend to have a non-loop structure and have an important influence on protein functions, as shown by the predicted 3D structure of human IL-4d2. The observed VSAS events can be classified into two groups depending on whether they insert new structure domains in the proteins, and they might be of different evolutionary status. It is a surprise that only 30 000– 40 000 genes exist in the
Update
TRENDS in Genetics Vol.20 No.5 May 2004
human genome. This focuses attention on alternative splicing, which is regarded as an important mechanism of protein diversity [1,2]. Although the exact origin and evolution of alternative splicing remains unclear, several recent efforts have shown that alternative splicing might accompany evolutionary changes [3 –7]. A comparison of alternatively spliced orthologous genes between human, mouse and rat genomes showed that most alternative exons are not conserved, indicating that alternative exons are the products of recent exon creation or loss events [3,4]. Another study on the positions of alternative splicing revealed that alternative-splicing events mostly affect entire protein domains or functional elements. Consequently, the influence of alternative splicing on proteins can be large. Those alternative-splice events that have little or no influence on protein function would have been eliminated by evolution [6]. Our research has shown that there are several alternative-splicing events that contain very short alternative splicing (VSAS, ,50 nucleotides) segments in the human genome. Recent reports [3–7] have reinforced our long-existing curiosity and efforts in investigating the impact of VSAS segments on the protein products and evolution trends behind such short segments – because these variations, presumably by having a minor influence on the protein products, would not be selected for. We scanned all of the 899 alternative-splicing genes from the complete genome of Homo sapiens deposited in the Alternative Splice Database of Mammals (AsMamDB; http://166.111.30.65/ASMAMDB.html) [8]. AsMamDB is a manually generated database and alternatively spliced transcripts in this database have been detected in biological experiments reported in the literature. Of these genes, 371 have two or more protein isoforms with their complete coding sequence (CDS) available; this gives us a total of 914 alternative segments, 258 of which (28%), from 134 genes (36%), have ,50 nucleotides (the average size of a human internal exon is ,120 bp). We call these variants VSAS segments for convenience of discussion. To investigate the effect of VSAS events on protein products, we further screened the set of alternative-splicing genes by selecting those alternative variants that have a single VSAS segment (to ensure that any differences in the protein isoforms are solely the results of these VSAS events) and such variants would directly endure the pressure of selection in evolution. This stringent criterion left us with 43 pairs of alternative transcripts (protein isoforms) from 43 genes. These transcripts included internal alternatively spliced exons, transcripts with retained introns and alternative 30 - and 50 -splice sites. We then scanned the Alternative Exon Database (AEDB; http://www.ebi.ac.uk/asd/aedb/) [9] with the same criteria and obtained another 14 VSAS events from 14 genes. In total, these 57 genes (114 alternatively spliced variants) form the major starting material of this study (see the supplementary material online for more details of the dataset and a list of the 57 genes; http://archive.bmn.com/ supp/tig/tge200506.pdf). VSAS events tend to insert new protein structures The impact of these VSAS segments on the protein products was investigated. First, we studied the protein www.sciencedirect.com
233
secondary structures of VSAS fragments and their flanking regions by predicting the secondary structures of the protein isoforms from their full sequences using the widely accepted PHD software (http://cubic.bioc.columbia.edu/ predictprotein/), which predicts secondary structure from homology alignments based on neural networks [10]. Similar to what had been observed previously [5], 65% of VSAS occurred in the loop area†. The result showed that 38 of the 57 (66.7%) inserted fragments arising from VSAS are predicted to have different secondary structures from those of their flanking regions and 19 (33.3%) of the inserted fragments have secondary structures identical to those of one or both of their flanking residues (Figure 1a). The insertion or deletion of a distinct secondary structure fragment (e.g. a short b-sheet fragment inserted into a previously consistent a-helix or an a-helix fragment inserted into a loop) would influence the folding of the tertiary or quaternary structure. However, if the inserted fragment has an identical secondary structure to its flanking regions, the influence would be relatively weak because it only extends an existing secondary structure. This result is surprising because it is generally thought that short alternative splicing gently regulates protein function; in which case, one would expect VSAS to have the same secondary structure as the surrounding sequence. Our observation in this study shows that VSAS segments that might have been presumed to be trivial tend to have an important effect on the protein products, which could be the result of selection in evolution. Then we analyzed the frequencies of three types of secondary structures (a-helix, b-sheet and loop) that occurred in VSAS fragments. The frequencies of the three structures did not show a great difference when we looked at all the VSAS events overall (Figure 1b). However, when we separated the genes into two groups, according to whether the VSAS segments generate distinct secondary structures, these frequencies become different for each group. In group A (where VSAS results in a distinct structure), only nine (23.1%) of the inserted structures are loops and 30 (76.9%) are either a-helix or b-sheet (Figure 1c)‡. However, in group B (where alternative segments do not insert new structures), the majority (14 out of 21; 66.7%)§ of the inserted fragments are loops (Figure 1d). The percentages of VSAS events in group A and B occurring in the loop area are similar. It is known that the loop structure has more flexibility than the a-helix and b-sheet [11]. We presume that different types of inserted secondary structure arising from VSAS would have a varied influence on the protein structures and that a small variation of loop fragments would have relatively little effect. The statistically significant difference in the frequencies of alternative loops between the two groups (P-value ¼ 0.0018 by Fisher’s Exact Test; † A VSAS occurs in the loop area if one or both of the regions flanking the VSAS segment are loops. If we count those VSAS in which the alternative segment is a loop but the flanking regions are not, this percentage will be 80%. ‡ There is one VSAS event that inserted two types of secondary structure; therefore, the total number of inserted secondary structure in the 38 group A genes is 39. § In group B, there are two cases where one VSAS event inserts two types of secondary structure; therefore, the total number of inserted secondary structure by the 19 VSAS events is 21.
Update
TRENDS in Genetics Vol.20 No.5 May 2004
(a)
(b)
Number of VSAS events
25
38 (66.7%)
40 35 30 25
19 (33.3%)
20 15 10 5
Number of VSAS events
234
0
20
16 (26.7%)
10 5 0
16 12 (30.8%) 9 (23.1%)
8 4
L H E Secondary structures (d)
15
Number of VSAS events
18 (46.1%)
(c) 20 Number of VSAS events
21 (35.0%)
15
Distinct Identical Groups of VSAS events
12
23 (38.3%)
14 (66.7%)
12 9 6
4 (19.0%)
3
3 (14.3%)
0
0 L
H E Secondary structures
L H E Secondary structures TRENDS in Genetics
Figure 1. Differences in the secondary structures of very short alternative splicing (VSAS) segments according to PHD prediction results. (a) 66.7% of the inserted fragments arising from VSAS have different secondary structures compared with those of their flanking sequences on both sides (group A shown in yellow), whereas the remaining 33.3% have secondary structures identical to those of the flanking fragments (group B shown in dark blue). Insertion of a distinct structure is presumed to have more influence on protein structure and function than the extension of an existing structure. Thus, most VSAS segments tend to have an important effect on the protein products. (b) Frequencies of the three types of secondary structures, a-helix (H), b-sheet (E) and the loop (L), for all the VSAS genes. The frequencies of the three types are not significantly different. (c) Frequencies of the three types (H, E and L) within group A. Only ,23.1% of the inserted structures are loops, and 76.9% are either a-helix or b-sheet. (d) Frequencies of the three types (H, E and L) within group B. The majority (66.7%) of the inserted fragments are loops.
Table 1) could imply differences in evolutionary selection pressure. In general, because ancient VSAS events that have little influence on protein-function diversity would have been eliminated in evolution, the VSAS events we see today would be either newly emerged events, which had not yet experienced evolutional selection, or ancient alternative-splicing events that had a high impact on protein-function diversity and remained in the organism. Based on this assumption, the major part of VSAS observed in group A could be functional and might have been positively selected; thus, distinct alternative structures and fewer alternative-loop structures are observed. By contrast, those VSAS observed in group B would most likely be newly emerged events, therefore, more alternative-loop fragments with minor Table 1. The number of VSAS inserted loop domains and nonloop domains in group A and group Ba Group A Group B
Loop domain inserted
Non-loop domain inserted
9 14
30 7
P-value ¼ 0.0018 by Fisher’s Exact Test. a Abbreviation: VSAS, very short alternative splicing. www.sciencedirect.com
effects are observed, which might contribute less to function diversity. Group A VSAS events are conserved in the mouse and rat genomes To further probe the possible evolutionary distinction between groups A and B, we performed a comparative analysis with the Mus musculus and Rattus norvegicus genomes in the NCBI genome database (http://www.ncbi. nlm.nih.gov/) to determine whether these VSAS events are conserved in rodents. We calculated the number of conserved and non-conserved VSAS events in mouse and rat from the VSAS genes that contain homologous genes (48 genes in rat and 49 in mouse). The results are Table 2. The number of group A and group B VSAS events that are conserved or not conserved in mouse and rata
Group A Group B a
Non-conserved
Ratc Conserved
Non-conserved
15 5
16 13
13 3
18 14
Abbreviation: VSAS, very short alternative splicing. P-value ¼ 0.132 by Fisher’s Exact Test. P-value ¼ 0.081 by Fisher’s Exact Test.
b c
Mouseb Conserved
Update
TRENDS in Genetics Vol.20 No.5 May 2004
summarized in Table 2. More than 40% of the group A VSAS events were found to be conserved in mouse and rat, whereas in group B VSAS events, this percentage is much lower (, 20%). Although a significant correlation between the grouping and the number of conserved events is not observed (P-value ¼ 0.132 for mouse; P-value ¼ 0.081 for rat), there is a strong tendency for the more functional group A VSAS events to be more conserved. These results are consistent with several recent reports. It had been shown that functional alternative-splice events are most likely to be conserved among human, mouse and rat [12], and that the introns flanking the conserved alternative exons share high sequence similarities among these species [13]. There has been great interest in investigating nonconserved alternative-splicing events that are specific to human diseases [14]. By checking the annotation of group A and B VSAS genes from GenBank and references in the literature, we found that 12 of the 19 group B genes (63.2%) have been reported to have an association with human diseases, whereas for group A, this percentage is only 42.1% (16 genes out of 38). Further investigation of these genes might provide new information about the evolution and impact of VSAS events. There is evidence that conserved (functional) and nonconserved (newly emerged) cassette exons predicted by the EST database (http://www.ncbi.nlm.nih.gov/dbEST/) differ in the characteristics of size, repeat content and influence on the protein by affecting reading frame or inserting a stop codon [15]. In this article, we have focused on experimentally verified VSAS events. They maintain reading frames and do not insert stop codons. We have observed that the conserved and non-conserved VSAS events have a different influence on protein secondary structures. 3D structural analysis of the influence of VSAS on IL-4 Among the genes examined in this study, the human interleukin-4 (IL-4) is a well-studied lymphoid cell growth factor that has important roles in stimulating the growth and survivability of certain B cells and T cells. The 3D structure of human IL-4 is available in Protein Data Bank (PDB) (http://www.rcsb.org/pdb/), and that of the short-splicing variant (IL-4d2) was predicted using SWISS-MODEL (http://www.expasy.org/swissmod/ SWISS-MODEL.html) to investigate the influence of the VSAS segment on tertiary structure. In human IL-4d2, exon two was skipped, which resulted in the deletion of 16 amino acids. Secondary structure prediction revealed that a short b-sheet was inserted into a consistent a-helix region in the long-splicing variant (IL-4). The known 3D structure of IL-4 was used as main template for structurehomology modeling of the short variant, IL-4d2. Figure 2 shows the 3D structure of the long variant and the predicted structure of the short variant. As demonstrated by the 3D structure and a previous report [14], the human IL-4 is a four-helix-bundle protein. In the predicted 3D structure of IL-4d2 variant, the stable linkage of the short b-sheet between the two a-helices is missing and is substituted by a loop fragment, which is consistent with the result of the secondary structure www.sciencedirect.com
235
(a)
β-sheet fragment
(b)
Loop fragment
TRENDS in Genetics
Figure 2. (a) The 3D structure of human interleukin 4 (IL-4). (b) Homology modeling of the alternative IL-4 short variant, IL-4d2, with SWISS-MODEL using the known structure of human IL-4 [Protein Data Bank (PDB) accession number: 1BCN and 1BBN] as the main template. A 16-amino acid fragment is absent from IL-4d2, which results in the absence of a stable and short b-sheet linkage between two helix domains; this is substituted by a loop fragment. Biological experiments have shown that this VSAS event resulted in IL-4 losing its function as a co-stimulator for T-cell proliferation [14].
prediction. Biological experiments revealed that alteration of this short b-sheet fragment dramatically influences the protein function. In vitro expression experiments proved that, unlike recombinant human (rh)IL-4, rhIL-4d2 could not act as a co-stimulator for T-cell proliferation [16]. It has been suggested that rhIL-4d2 is preferentially expressed in the thymus and inhibits the function of IL-4 [17,18]. The examples of human IL-4 and its shorter alternative variant IL-4d2 illustrated that a very short alternative fragment can have a major effect on protein function. Several other research groups have added evidence that supports this observation. A study on the crystal structures
236
Update
TRENDS in Genetics Vol.20 No.5 May 2004
of glutathione S-transferases (GST) isozymes 1-3 and 1-4 from the mosquito Anopheles dirus (an important malaria vector in South East Asia), indicated that, despite the high overall sequence identities and the only difference in 30 nucleotides arising by alternative splicing, the active site residues of GST1-3 and GST1-4 have different conformations, which might have an important role in inactivating pesticides [19– 21]. Another study of two human pyrophosphorylases revealed that the two isoforms of alternative splicing AGX1 and AGX2 (which have a 17 amino-acid-peptide difference), have a different oligomeric arrangement in solution and this resulted in distinct catalytic properties, suggesting a VSAS event could influence the activity of the enzyme [22]. To date, at least two mechanisms have been suggested to explain how short alternative splicing enhances its role in functional diversity of the genome and improved its viability during evolution: (i) by inserting and/or deleting a short alternative segment with a different secondary structure, as demonstrated in the present study; and (ii) by short alternative splicing events (, 50 amino acids) within domains that tend to affect the functional residues [6]. Apparently, short alternative splicing and VSAS events that might have been presumed to be trivial have more important roles than expected. Concluding remarks Recently, Modrek and Lee [3] discovered that exons included only in alternative-splice forms are mostly the product of recent exon creation or loss events. In our study, the observed differences between VSAS segments that cause distinct or identical secondary structure insertion and/or deletion indicated that VSAS might also be associated with evolutionary change. We suggest that there might be two types of alternative-splicing events: (i) an ancient alternative-splicing event presumed to be functional, which can often exist in several relatives; or (ii) a newly emerged splicing event most likely organism-specific, which might be functional, neutral or deleterious. The fixation of these alternative-splicing events in the genome would depend on their function and some evolutionary events, for example, drift or hitchhiking. Acknowledgements This work is supported in part by NSFC (60234020, 60275007 and 60171038) and a MOST project (No.2001CCA01400) of China. We thank the editor and referees for their valuable suggestions that helped us to complete this work.
www.sciencedirect.com
References 1 Modrek, B. and Lee, C. (2002) A genomic view of alternative splicing. Nat. Genet. 30, 13 – 19 2 Hanke, J. et al. (1999) Alternative splicing of human genes: more the rule than the exception? Trends Genet. 15, 389 – 390 3 Modrek, B. and Lee, C. (2003) Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat. Genet. 34, 177– 180 4 Packer, A. (2003) Splicing and evolutionary change. Nat. Rev. Genet. 4, 401 5 Kondrashov, F.A. and Koonin, E.V. (2003) Evolution of alternative splicing: deletions, insertions and origin of functional parts of proteins from intron sequences. Trends Genet. 19, 115 – 119 6 Kriventseva, E.V. et al. (2003) Increase of functional diversity by alternative splicing. Trends Genet. 19, 124 – 128 7 Kondrashov, F.A. and Koonin, E.V. (2001) Origin of alternative splicing by tandem exon duplication. Hum. Mol. Genet. 10, 2661– 2669 8 Ji, H.K. et al. (2001) AsMamDB: an alternative splice database of mammals. Nucleic Acids Res. 29, 260 – 263 9 Stamm, S.J. et al. (2000) An alternative exon database (AEDB) and its statistical analysis. DNA Cell Biol. 19, 739 – 756 10 Rost, B. (1996) PHD: predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol. 266, 525– 539 11 Tsou, C-L. (1988) Folding of the nascent peptide chain into a biologically active protein. Biochemistry 27, 1809 – 1812 12 Waterson, R.H. et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520 – 562 13 Sorek, R. et al. (2003) Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. Genome Res. 13, 1631– 1637 14 Xu, Q. et al. (2003) Discovery of novel splice forms and functional analysis of cancer-specific alternative splicing in human expressed sequences. Nucleic Acids Res. 31, 5635 – 5643 15 Sorek, R. et al. (2004) How prevalent is functional alternative splicing in the human genome? Trends Genet. 20, 68 – 71 16 Atamas, S.P. et al. (1996) An alternative splice variant of human IL-4, IL-4d2, inhibits IL-4 stimulated T cell proliferation. J. Immunol. 156, 435– 441 17 Alms, W.J. et al. (1996) Generation of a variant of human interleukin -4 by alternative splicing. Mol. Immunol. 33, 361 – 370 18 Kruse, N. et al. (1993) Two distinct functional sites of human interleukin 4 are identified by variants impaired in either receptor binding or receptor activation. EMBO J. 12, 5121– 5129 19 Oakley, A.J. et al. (2001) The crystal structure of glutathione S-transferases isozymes 1-3 and 1-4 from Anopheles dirus species B. Protein Sci. 10, 2176 – 2185 20 Oakley, A.J. et al. (2001) Crystallization of two glutathione S-transferases from an unusual gene family. Acta Crystallogr. D Biol. Crystallogr. 57, 870 – 872 21 Jirajaroenrat, K. et al. (2001) Heterologous expression and characterization of alternatively spliced glutathione S-transferases from a single Anopheles gene. Insect Biochem. Mol. Biol. 31, 867– 875 22 Peneff, C. et al. (2001) The Crystal structures of two human pyrophosphorylase isoforms in complexes with UDPGlc(Gal)NAc: role of the alternatively spliced insert in the enzyme oligomeric assembly and active site architecture. EMBO J. 20, 6191– 6202 0168-9525/$ - see front matter q 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.tig.2004.03.005