Alternative Splice Variants Encoding Unstable Protein Domains Exist in the Human Brain

Alternative Splice Variants Encoding Unstable Protein Domains Exist in the Human Brain

doi:10.1016/j.jmb.2004.09.028 J. Mol. Biol. (2004) 343, 1207–1220 Alternative Splice Variants Encoding Unstable Protein Domains Exist in the Human B...

1023KB Sizes 0 Downloads 39 Views

doi:10.1016/j.jmb.2004.09.028

J. Mol. Biol. (2004) 343, 1207–1220

Alternative Splice Variants Encoding Unstable Protein Domains Exist in the Human Brain Keiichi Homma1,2, Reiko F. Kikuno3, Takahiro Nagase3, Osamu Ohara3,4 and Ken Nishikawa1* 1

Laboratory of Gene-Product Informatics, Center for Information Biology-DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems, 1111 Yata Mishima, Shizuoka, 411-8540 Japan 2

Japan Science and Technology Corporation, 1-8, Honcho 4-chome, Kawaguchi City Saitama, 332-0012, Japan 3

Department of Human Gene Research, Kazusa DNA Research Institute, 2-6-7 Kazusa-Kamatari, Kisarazu 292-0818, Japan 4

RIKEN Research Center for Allergy and Immunology 1-7-22, Suehiro, Tsurumi-ku Yokohama, 230-0045, Japan *Corresponding author

Alternative splicing has been recognized as a major mechanism by which protein diversity is increased without significantly increasing genome size in animals and has crucial medical implications, as many alternative splice variants are known to cause diseases. Despite the importance of knowing what structural changes alternative splicing introduces to the encoded proteins for the consideration of its significance, the problem has not been adequately explored. Therefore, we systematically examined the structures of the proteins encoded by the alternative splice variants in the HUGE protein database derived from long (O4 kb) human brain cDNAs. Limiting our analyses to reliable alternative splice junctions, we found alternative splice junctions to have a slight tendency to avoid the interior of SCOP domains and a strong statistically significant tendency to coincide with SCOP domain boundaries. These findings reflect the occurrence of some alternative splicing events that utilize protein structural units as a cassette. However, 50 cases were identified in which SCOP domains are disrupted in the middle by alternative splicing. In six of the cases, insertions are introduced at the molecular surface, presumably affecting protein functions, while in 11 of the cases alternatively spliced variants were found to encode pairs of stable and unstable proteins. The mRNAs encoding such unstable proteins are much less abundant than those encoding stable proteins and tend not to have corresponding mRNAs in non-primate species. We propose that most unstable proteins encoded by alternative splice variants lack normal functions and are an evolutionary dead-end. q 2004 Elsevier Ltd. All rights reserved.

Keywords: alternative splicing; 3D structure; structure prediction; insertion sequence; genome comparison

Introduction Alternative splicing is a mechanism in eukaryotes by which protein variety is increased without significantly expanding the genome.1 Recent accumulation of cDNA sequence data revealed that 38–48% of human genes are alternatively spliced2,3 and that the situation is similar in other animals.4 One therefore cannot overlook alternative Abbreviations used: BLAST, basic local alignment search tool; PSI-BLAST, position-specific iterated BLAST; RT-PCR, reverse transcriptase-polymerase chain reaction; EST, expressed sequence tag; EMBL, European Molecular Biology Laboratory; DDBJ, DNA Data Bank of Japan; PDB, Protein Data Bank; NMD, nonsense-mediated mRNA decay. E-mail address of the corresponding author: [email protected]

splicing when investigating the functioning of higher eukaryotes in genetic terms. Alternative splicing affects developmental processes, often causing genetic diseases.1,5,6 For example, alterations of the alternative splicing of the WT1, BRCA1, and ATP7A genes have been reported to cause Wilms’ tumor,7–9 breast cancer,10 and occipital horn syndrome,11 respectively. As alternative splicing occurs at a high frequency in the nervous system,3,12 it is natural that aberrant alternative splicing is implicated as a cause of some human diseases associated with the nervous system: the inherited dementia FTDP-17,13 schizophrenia,14–16 and Alzheimer’s disease,17 among others. Hence, examination of this phenomenon in the nervous system is of medical importance. For the consideration of effects alternative splicing has, it is beneficial to investigate what structural changes alternative splicing introduces to

0022-2836/$ - see front matter q 2004 Elsevier Ltd. All rights reserved.

1208 the products. Although the distribution of protein domains in alternatively spliced human genes was recently investigated,18 protein domains used were not genuine structural domains like SCOP19 and CATH,20 and the correspondence of splice junctions with domain boundaries was not examined. The three-dimensional structures of alternatively spliced proteins have been compared in a small number of specific cases.7,21,22 To our knowledge, there is only one systematic research related to this subject, carried out with alternative splice variants in fully sequenced higher organisms.23 The paper, however, examined the correspondence of alternatively splice regions with Prosite profiles and not with structural domains of proteins per se. If we search for proteins in the Swiss-Prot database possessing a SCOP domain, we generally obtain good non-partial alignments like those shown in Figure 1. In the two examples shown,

Alternative Splicing Can Produce Unstable Proteins

the coverage of the SCOP domains by all the aligned Swiss-Prot entries exceeded 80%. In general, normal proteins do not have fragments of SCOP domains, as such structures would be unstable and unable to fold. Partial alignment to SCOP domains was used as a criterion by which to select unstable protein structures and thereby to identify pseudogenes.24 We examined alternative splicing variants individually whose protein products are partially aligned to SCOP domains, and judged them to be structurally unstable if structurally crucial regions are missing. Furthermore, insertions into SCOP domains that preserve the functionality of the proteins are invariably located at the molecular surface,21,25 presumably because insertions into structural cores destabilize and consequently inactivate the proteins. We use this observation when assessing the stability of proteins with insertions.

Figure 1. Alignments of proteins in the Swiss-Prot database to SCOP domains. The Swiss-Prot entries not commented as protein fragments and possessing amino acid sequences aligned to the amino acid sequence of the SCOP b.29.1.4 section of PDB 1dykA and that of the SCOP c.37.1.12 section of PDB 1jj7A were selected ((a) and (b), respectively). The entries whose aligned regions have better alignments to other PDB sequences were excluded. The SCOP domains are represented by green rectangles, while the aligned sections in the retained Swiss-Prot entries are symbolized by red rectangles placed at the corresponding vertical positions. The Swiss-Prot name of each alignment is shown to the right. The bottom two lines in each panel do not represent Swiss-Prot entries, but alternatively spliced variants that we identified (see the text).

1209

Alternative Splicing Can Produce Unstable Proteins

Figure 2. Representation of various types of alternative splice variants. Introns are represented by black line segments, while exons are shown as black rectangles, with the regions that differ between the alternative splice variants indicated by color gradations. The locations of the start and stop codons of a coding sequence are indicated by vertical bars. (a) Alternative splice variants that differ at five different internal sites in conventional representation. (b) The same variants as in (a) in the new representation. Note that in some exons multiple acceptor and/or donor sites exist, as represented by the green and orangetinted sections. (c)–(f) Other types of alternative splice variants in the new representation.

Here, we present results of a structural analysis of the proteins encoded by alternatively spliced variants identified from cDNAs in the human brain. We show that, although alternative splice junctions tend to coincide with the boundaries of protein structures, there are alternative splicing events that disrupt protein structures. Moreover, we present evidence that most alternative splice variants encoding structurally unstable protein domains are confined to a narrow branch of species and are less abundant than those encoding stable protein domains. We discuss the significance of these findings.

Results Selection of reliable alternative splice junctions We introduce a novel representation of the exon structure of alternative splice variants. Not only is the conventional representation (Figure 2(a)) inconvenient when exons are much shorter than the intervening introns, as is often the case with human genes, but are also unsuitable for presentation of corresponding protein structures. In our

notation, only the coding sequences are presented with the corresponding exons lined up vertically (Figure 2(b)), making it possible to show encoded protein structures along the exons. We decided not to include the intron retention type (Figure 2(c)) in our analysis, because this may be a result of an accidental inclusion of immature RNAs in cDNA preparation. All the other types of splice variants that differ in internal sites (Figure 2(b)) are considered as reliable alternative splice variants and are therefore subjected to further analysis. The type of splice variants presented in Figure 2(d) is excluded from our analysis, as this may be a result of a 5 0 deletion in cDNA preparation or intron retention. Since the lower splice variant shown in Figure 2(e) may reflect a retained intron at the 5 0 end of the second exon of the upper variant, this type of splice variant is not included in our analysis, either. On the other hand splice variants with different exons at the 5 0 termini (Figure 2(f)) are accepted. We also incorporate splice variants that differ at the 3 0 termini in our analysis. As alternative splice variants frequently differ at more than one site with multiple types of splice junctions, we retain those junctions that are classified reliable according to the above criteria, while discarding the rest.

Table 1. Correspondence of alternative splice junctions with SCOP boundaries Observed

Inside (%) Coincident (%) Outside (%) Total (%)

All splice junctions of constitutively spliced genes

Alternative splice junctions

Alternative splice junctions without NMD candidates

Expected (%)

17.2 (3598) 5.4 (1139) 77.4 (16215) 100 (20952)

13.2 (50) 9.5 (36) 77.2 (292) 100 (378)

13.4 (34) 9.4 (24) 77.2 (196) 100 (254)

14.3 3.9 81.8 100

Percentages of the total are shown and, wherever appropriate, actual numbers are presented in parentheses.

1210

Alternative Splicing Can Produce Unstable Proteins

3 (legend next page) FigureFigure 3 (a) and (b) (legend next page)

1211

Alternative Splicing Can Produce Unstable Proteins

Alternative splice junctions tend to coincide with SCOP domain boundaries We examined where in the protein structure each of the reliable alternative splice junctions falls. We classified the location as the inside, the boundaries, or the outside of SCOP domains, allowing a ten amino acid residue ambiguity for the “coincidence” of alternative splice junctions with SCOP domain boundaries (Table 1). Compared with the exon boundaries of constitutively spliced genes, alternative splice junctions tend not to fall inside SCOP domains (13.2% versus 17.2%). The difference, however, is not statistically significant at p!0.1. In addition, alternative splice junctions show a strong tendency to coincide with SCOP domain boundaries (9.5% versus 5.4%) and the difference is statistically significant (p!0.001). In Figure 3 are presented two cases of alternative splicing showing such coincidence ((a) and (b)) and another with the alternative splice junction corresponding to the outside of SCOP domains ((c)): these cases belong to class A. The 5 0 -most ef00431specific section encodes one immunoglobulin-like domain (b.1.1.4) and two fibronectin type III-like cell adhesion domains (b.1.2.1), while one exon in the middle that is unique to variant ef00431 almost exactly corresponds to two b.1.2.1 domains (Figure 3(a)). These additional regions of ef00431 therefore insert complete SCOP domains into the structure of the hg04401 product. The situation is similar in Figure 3(b): the additional region consisting of two exons in variant hk04562 encodes the entire fibronectin type III-like

domain (b.1.2.1). Although human neural cell adhesion molecule L1 is homologous to both variants, the YRSL motif required for targeting the growth cone is absent from the cytoplasmic (C-terminal) segment of the protein encoded by fh00384.26 On the other hand, the transmembrane domain and the entire cytoplasmic segment are missing from hk04562. In the case depicted in Figure 3(c), the PDZ domain (b.36.1.1) is aligned to the 5 0 -terminal additional exon in variant hf00331, while the P-loop containing nucleotide triphosphate hydrolase domain (c.37.1.9) is assigned to the common region. In this case, the alternative splice junction is placed 25 amino acid residues downstream of a SCOP domain, adding one domain (b.36.1.1) to one variant. Alternative splicing sometimes introduces insertion sequences to proteins In six cases (designated to belong to class B) in which alternative splice junctions correspond to the inside of SCOP domains, the alignment of one variant is normal, while that of the other includes an insertion(s). Two variants of neurexin 2-alpha precursor27 are shown in Figure 4(a), with two extra exons at two different places in hh02750 as compared with ah00573, corresponding to insertions into the same SCOP domain (b.29.1.4), a laminin G-like module. The insertions are represented by hand-drawn lines to indicate not the probable structures, but the approximate sizes and points of insertion. Another case presented in

Figure 3. Many alternative splice junctions fall near structural boundaries of the encoded proteins. Exon correspondence is shown in the novel representation with the names and the total numbers of amino acid residues added to the left and right, respectively. Regions aligned to SCOP domains in the exon correspondence figures are filled with colors, using the same color for the regions aligned to the same SCOP domain in each panel, with the SCOP numbers added above or below. In each panel contiguous segments aligned to the same SCOP domain are shown in the same color with different brightness so that the boundaries are clearly visible. Grey rectangles labeled with TM represent potential transmembrane domains. The three-dimensional structures aligned to alternatively spliced exons are also shown, with aligned and unaligned segments in red and white, respectively. (a) From the upper left, the structural figures shown are alignments to PDB ID 1nct, 1qg3A, 1cfb, 1fna, and 1fna. (b) The alignment is to PDB ID 1qgeA. (c) The structure alignment is to PDB ID 1kwaA.

1212

Alternative Splicing Can Produce Unstable Proteins

Figure4 4(a)(legend Figure (legendnext nextpage) page)

Figure 4(b) also shows the correspondence between an inserted exon in one variant with an inserted sequence to the pleckstrin-homology domain (b.55.1.1). Interestingly, all of the insertions are located not in the interior, but at the surface of protein domains. Some alternative splice variants encode unstable proteins There are 50 cases in which alternative splice junctions fall on the inside of SCOP domains, although their proportion (13.2%) is only 0.2% less than that of exon boundaries in constitutively spliced genes (Table 1; NMD candidates will be discussed below). Eleven of these were found to result in pairs of stable and unstable protein structures, which we define as constituting class C. Two alternative splice variants of protocadherin Fat 2 precursor are one such example (Figure 5(a)), as the region of the lower variant aligned to SCOP b.29.1.4 is judged to be unstable due to the absence of a number of hydrogen bonds between b-sheets. The same alignments to SCOP b.29.1.4 are entered at the bottom of Figure 1(a) in a different representation, clearly showing a nearly complete alignment

of the upper variant and a fragmental alignment of the lower variant. Alternative splicing of ATPbinding cassette transporter 9 at the C terminus also gives rise to stable and unstable structures (fg05606 and fh14074, respectively) (Figure 5(b)), since the absence of several b-sheets from the lower variant destabilize the SCOP domain. The alignments depicted at the bottom of Figure 1(b) demonstrate an unusually low coverage (63%) of the SCOP domain in the lower variant. As the SCOP domain c.37.1.12 probably functions as ABC transporter, fg05606 may encode such a transporter, while fh14074 is likely to encode a dysfunctional protein. Alternative splice variants encoding unstable proteins are less abundant than those encoding stable proteins We examined the relative abundance of alternative splice variants encoding unstable proteins compared to those encoding stable proteins. In eight representative pairs out of a total of 11, we carried out reverse transcriptase PCR (RT-PCR) analysis to determine the relative abundance of the variants, using the 5 0 and 3 0 regions shared by both variants as primers and brain mRNA library as

Alternative Splicing Can Produce Unstable Proteins

1213

Figure 4. Alternative splicing sometimes introduces insertions at the molecular surface. Exon and protein structures are depicted as for Figure 3, together with insertions of unknown structure. The lengths of the inserted sequence that are indicated in (a) (9 and 30 amino acid residues) essentially match those of the inserted exons, while the length of the inserted sequence in (b) (11 amino acid residues) corresponds exactly to that of the inserted exon. All the structural figures shown in (a) and (b) are alignments to PDB ID: 1dykA and 1btn, respectively.

template (Table 2, class C). All the variants were detected, demonstrating the non-artifactual nature of all the variants, including those encoding unstable proteins. However, the variants encoding unstable proteins were amplified to much lower levels than those of the corresponding stable variants. In contrast, the pairs of variants presented in Figures 3 and 4 are amplified to more equal levels (Table 2, classes A and B). We conclude that alternative splice variants encoding unstable proteins exist but are less abundant than those encoding stable proteins.

Discussion Based on comparison with the observed data of constitutively spliced genes, we found that alternative splice junctions have a small statistically insignificant tendency to avoid the inside of protein domains and a marked tendency of statistical significance to coincide with domain boundaries. If we instead use expected fractions for comparison as

in a previous study,23 we detect the same tendencies (Table 1). However, the difference between the observed and the expected fractions of those falling on the inside of SCOP domains (13.2% versus 14.3%) is even smaller than before and is not statistically significant at p!0.1. On the other hand, the tendency to coincide with domain boundaries becomes more pronounced, making the difference statistically significant at p!0.0001. This conclusion is consistent with the idea that a significant fraction of alternative splicing events inserts or deletes structural units of proteins in a cassette-like manner. These cases are likely to play a role in the evolution of protein structures. The product of ef00431 shown in Figure 3(a) is known as human leukocyte common antigen (LAR, a possible cell adhesion receptor) with an N-terminal addition and a deletion of nine amino acid residues, corresponding to the gap (indicated by a downward red arrow) in the fifth fibronectin type III-like domain (b.1.2.1). It is of interest to note that the nine amino acid sequence, termed LASE-c, is primarily found in variants during development

1214

Alternative Splicing Can Produce Unstable Proteins

Figure 5. Alternative splice variants sometimes encode pairs of stable and unstable proteins. Exon correspondence and protein structures are presented as for Figure 3. The structural figures presented in (a) and (b) are alignments to PDB ID 1dykA and 1jj7A, respectively.

Table 2. Pairs of alternative splice variants Class Pair#

cDNA name

5 0 Common region (nt)

Start

End

Specific region (nt)

Start

3 0 Common region (nt)

Abundance

Existence(C) or absence (K) of corresponding splice-conserved cDNAs in mouse

End

Start

End

Ratio of RTPCR products

1872 2628

Not detected 1.0

27 6

–a –a

# EST hits

A A

1–1 1–1

hg04401 ef00431

28 1

1047 1020

1021

1803

1048 1804

A A

1–2 1–2

hg04401 ef00431

1900 2629

2017 2746

2747

3325

2018 3326

2953 4261

1.0 0.4

7 3

K K

A A

2 2

fh00384 hk04562

1787 1258

2380 1851

1852

2172

2381 2173

2677 2469

1.0 0.3

7 6

C K

A A

3 3

ha04661 hf00331

1 1

109 976

110 977

4962 5829

(ND) (ND)

0 7

K C

B B

1–1 1–1

ah00573 hh02750

1218 38

2735 1555

1556

1582

2736 1583

4076 2923

0.2 1.0

0 2

K C

B B

1–2 1–2

ah00573 hh02750

2736 1583

4076 2923

2924

3013

4077 3014

5502 4439

0.5 1.0

0 1

K C

B B

2 2

hj02757 hj03347s1

2401 3229

3239 4067

4068

4100

3240 4101

4582 5443

0.1 1.0

0 2

K C

C C

1 1

fg04087 hg01289

1 1

793 35

794 36

5809 5051

(ND) (ND)

4 2

–b –b

C C

2 2

fj22173 ha02916

1075 13

1161 100

1162

1377

1378 101

2952 1675

1.0 0.2

10 0

C K

C C

3 3

hg01605 hj02942

712 1

1015 304

1016

1384

1385 305

4427 3347

1.0 !0.1

21 0

C K

C C

4 4

hg01774 hj04806

1 1910

135 2044

136

393

394 2045

3193 4844

1.0 0.4

5 0

C K

C C

5 5

fg05606 fh14074

638 1

1473 836

1474

1791

1792 837

2262 1307

1.0 0.3

2 0

C K

C C

6 6

hg00785a fh08981

1 1521

834 2354

835

1089

1090 2355

2233 3498

1.0 0.1

30 0

C K

C C

7 7

hj03579s1 ha06368

1 45

4183 4230

4184 4231

6011 4319

1.0 !0.1

0 0

C K

C C

8 8

hh02763s1 ph00869

362 1

885 524

886

1011

1.0 !0.1

37 0

K K

1012 525

6244 5757

In C2, for example, nucleotides 1075–1161 and 1378–2952 of fj22173 are identical with nucleotides 13–100 and 101–1675 of ha02916, respectively, while nucleotides 1162–1377 of fj22173 are unique to the variant. Class A: The boundaries of the specific region coincide with or fall outside of those of the structural domain(s) of the encoded protein. Four representative cases out of 328 were examined. In each pair, the lower variant contains an additional structural domain(s). Class B: The boundaries of the specific region fall on the inside of the structural domains of the encoded protein, resulting in an insertion in the lower variant as compared to the upper variant. Three out of a total of six cases were investigated. Class C: The boundaries of the specific region correspond to the inside of the structural domain(s) of the encoded protein, making the product of the lower variant structurally unstable. Eight examples out of 11 cases were analyzed. In the ratio of RT-PCR products, the amount of the more highly amplified product is taken as unity. ND, not determined. a As no KIAA number has been assigned to the variants, no search could be conducted. b A mouse cDNA aligned to the common region exists, but it has no sequence corresponding to either of the specific regions.

1216 and expression of LASE-c-containing isoforms is limited to neurons.28 As the other variant, hg04401, contains LASE-c and two cytoplasmic protein tyrosine phosphatase domains (c.45.1.2),29 the product may function as a cell adhesion receptor and play a developmental role in neurons. Multiple splice variants of neurexin 2-alpha, like those presented in Figure 4(a), have been found in the human brain, from which five canonical sites of differential splicing, SS#1 to SS#5, have been identified.30 Both variants we obtained contain the SS#2 insertion, while only one of them (hh02750) has the SS#3 and SS#4 insertions (Figure 4(a)). Intriguingly, only the insertions of SS#2 and SS#4 inhibit neurexin binding to its ligands, neuroligins, alpha-latropin, and dystroglycan.31–34 Although SS#2, SS#3, and SS#4 introduce insertions to the same SCOP domain (b.29.1.4), the insertion points significantly differ (Figure 4(a), and data not shown), possibly explaining the inhibition of ligand binding caused by SS#2 and SS#4 and the absence of which by SS#3. In view of the literature, therefore, both variants of neurexin alpha we identified are unlikely to bind to its ligands and function in cell recognition and cell adhesion. It is relevant to note that random sequences of 120–130 amino acid residues inserted into a surface loop region of Escherichia coli RNase HI did not abolish its activity.35 Moreover, such a segment inserted into the surface of a protein is generally not traceable in the electron density map of the X-ray structure,21 indicating a lack of definite structure. While alternative splice events making insertions to the structural core generally encode dysfunctional proteins, those introducing insertions into the surface of the proteins are likely to simply modify the functions.36 The exclusive use of reliable alternative splice junctions made it possible for us to propose that there are alternative splice variants encoding unstable structures. However, it is conceivable that the mRNAs of these variants are rapidly degraded and rarely get translated. Eukaryotic mRNAs containing premature termination codons (defined as those occurring more than 50 nucleotides upstream of the final splice junction) are almost always degraded rapidly,37 and therefore are called candidates for nonsense-mediated mRNA decay (NMD).38 We searched for NMD candidates in our list and found only 13.2% of variants to be NMD. None of the examples presented in Figure 5 was identified as an NMD candidate. Moreover, the exclusion of these NMD candidates leaves the fraction of alternative splice junctions falling on the inside of SCOP domains virtually unchanged, despite the reduction in the total number of junctions analyzed (Table 1). Furthermore, the removal of NMD candidates does not alter the statistical insignificance and significance of the differences between the observed and expected fractions mentioned above. These findings demonstrate that the elimination of NMD candidates does not preferentially remove variants encoding

Alternative Splicing Can Produce Unstable Proteins

unstable proteins. Additionally it is hard to conceive a mechanism by which variants encoding unstable protein structures are specifically degraded in the absence of clear differentiating DNA sequence patterns like those signifying NMD cases. We thus consider it likely that alternative splicing produces some mRNAs encoding unstable protein structures that are not subject to rapid degradation. The fact that such variants exist at levels detectable by RT-PCR (Table 2) supports this idea. It is plausible that alternative splice variants encoding unstable protein domains are defective in normal functions, as the encoded proteins are unlikely to fold to unique structures. RT-PCR experiments revealed that splice variants encoding unstable proteins are less abundant than those encoding stable proteins. We checked the consistency of this finding with the relative abundance of corresponding human expressed sequence tags (ESTs). In all the pairs for which at least one corresponding human EST was found, more ESTs were found to correspond to the variants encoding stable proteins than to those encoding unstable proteins (Table 2, class C). In total, 109 and two ESTs were identified to correspond to variants encoding stable and unstable proteins, respectively, corroborating the finding. The EST data show the rarity of variants encoding unstable proteins as compared to those encoding stable proteins, just as the RT-PCR data. This does not necessarily mean that variants encoding unstable proteins are insignificant, as some of them may exert a disproportionately large negative effect and cause disease. By contrast the difference in the total numbers of EST hits to the upper and lower variants in class A is small (41 versus 22; Table 2), in agreement with the results of RT-PCR showing less unequal abundance of mRNAs in this class. The number of EST hits to the class B variants is too small to provide any quantitative information (Table 2). Are alternative splice variants encoding unstable protein domains evolutionarily conserved? For the eight pairs of class C, we searched the ROUGE database39 for cDNAs in the mouse brain corresponding to each variant (termed corresponding splice-conserved cDNAs; see Materials and Methods). In all the pairs except for two for which no mouse variants specific to either variants were found, splice-conserved cDNAs corresponding to variants encoding stable proteins were identified, while no cDNAs corresponding to variants encoding unstable proteins was detected (Table 2, class C). Thus, alternative splice variants encoding stable protein structures in the human brain tend to have specific orthologous cDNAs in mouse, while those encoding unstable protein structures do not. Considering the brain-specific nature of the HUGE (human) and ROUGE (mouse) databases, splice variants encoding unstable protein structures are probably primate-specific rather than brainspecific. We also searched for splice-conserved cDNAs in

1217

Alternative Splicing Can Produce Unstable Proteins

mouse corresponding to variants in the other two classes (Table 2, classes A and B). For variant pairs (A1 and B1) with two specific regions, we examined each region separately. Although the paucity of cases prevents detection of significant tendencies, we note that all the three human alternative splice variants whose specific regions correspond to insertion at the surface of structural domains (class B) have corresponding mouse cDNAs. Conservation of such insertion sequences implies their functional significance. In addition to mouse cDNA data, we examined the genomic conservation of each variant in class C across metazoans. No conserved variants were found in the zebrafish, Caenorhabditis elegans, mosquito, or Drosophila melanogaster genomes, while conservation of some variants was detected in the chimpanzee, mouse, rat, and chicken genomes; In cases C2–C6 and C8, i.e. the cases in which the stable variants have an extra region inserted between the common regions, the specific regions as well as the 5 0 and 3 0 common regions were always conserved in the chimpanzee, mouse, and rat genomes. In the chicken genome, three of the cases exhibited the same conservation, while the remaining three revealed the absence of conservation even in the common regions. Thus, whenever the common regions are conserved in a metazoan genome, so is the specific region, indicating the possible presence of the stable variant. We note that in these cases we cannot exclude the presence of variants encoding unstable proteins from examinations of cross-species genomic conservation alone, because the variants encoding unstable proteins are those with the specific region skipped, not included. In cases C1 and C7, on the other hand, we can directly check the possible conservation of both variants from genomic data, because the two variants have different specific regions. In case C1, both variants are conserved in the chimpanzee genome, while only the stable variant is conserved in the mouse, rat, and chicken genomes. Case 7 gives a similar conservation pattern: the conservation of both variants in the chimpanzee genome and the exclusive conservation of the variant encoding a stable protein in the mouse and rat genomes. The latter two observations show that both variants may be present in chimpanzee, while only the splice variant encoding stable protein can possibly be present in the rodent species. Genomic conservation patterns therefore support the idea that variants encoding stable proteins tend to be conserved, while those encoding unstable proteins are confined to a small branch of species (the order of primates in the cases examined). In the brain alternative splice variants encoding unstable proteins are less abundant than those encoding stable proteins. Does the same hold in other tissues? As discussed above, preferential prevention of translation of splice variants encoding unstable proteins is unlikely. A preponderance of unstable variants is thus likely to result in a

massive amount of unstable proteins, probably causing aggregate formation and impairing the normal functionality of the tissue, an implausible event. We therefore consider it likely that in all tissues the splice variants encoding unstable proteins are rarer than those encoding stable proteins. We discovered that human splice variants encoding unstable proteins tend to be confined to species very close to Homo sapiens. Interestingly, less abundant alternative splice variants in the human, mouse, and rat genomes in general were found to be mostly species-specific.40 We consider it probable that many of the less abundant alternative splice variants they identified encode unstable protein structures. We also think that most splice variants encoding unstable protein structures represent those that are an evolutionarily dead-end. However, we cannot exclude the possibility that some of them possess functions specific to a narrow range of organisms.

Materials and Methods cDNA data We cloned and accumulated the sequence data of long cDNAs (O4 kb) mainly derived from human brain. So far, 2037 cDNA sequences have been published and results of computer-assisted sequence analyses are summarized in the HUGE database,39,41,42 with each cDNA designated by KIAA plus a four-digit number, e.g. KIAA0001. To analyze the alternative splice variants at the protein structural level systematically, we searched for candidates of alternative splice variants of KIAA cDNAs in the original cDNA sequence database using BLAST, with the cutoff E value of 1e-10. We required that each candidate share a region longer than or equal to 1000 nt, possessing similarity not less than 97% with one of the KIAA cDNAs and found 544 pairs. A total of 269 pairs had their exons aligned unambiguously and were subjected to further analysis. Assignment of protein structures and assessment of their stability To each cDNA product, we assigned three-dimensional structures using BLAST43 and PSI-BLAST,43,44 setting the cutoff E value at 0.001 just as in the GTOP database.45 SCOP release 1.63 was used for SCOP domain assignments.19 BLAST predicted some structures to 40.4% of the HUGE genes, while PSI-BLAST assigned structures to 60.3% of them. A protein is judged to be unstable if one or more structurally crucial components (i.e. components at the core) are missing in the alignment to a SCOP domain or if the protein has an insertion at the core of a SCOP domain. A protein with no structural alignment is deemed unclassifiable. All the other proteins are considered to be stable. Although the stability of protein domains was assessed individually, in practice no protein aligned to over 85% of a SCOP domain was appraised as unstable, while none aligned to less than 70% of a SCOP domain was regarded as stable. All the protein structures presented here were drawn by MOLSCRIPT.46

1218 Correspondence between exon boundaries and SCOP domain boundaries and statistics Exon boundaries placed within the allowance of ten amino acid residues of SCOP domain boundaries were considered to be coincident. On the other hand, exon boundaries that are aligned to SCOP domains except for the ten-residue ends were regarded to fall in the inside. The conclusions presented here remain unchanged even if the allowance is increased up to and including 25 amino acid residues. The expected fraction of coincidence was calculated based on the idea that the probability is basically equal to the product of the expected frequency per amino acid residue of SCOP domains and 40 amino acid residues, i.e. 20 for each end. Corrections were made for SCOP domains aligned to the N and C termini and for nearly contiguous SCOP alignments separated by less than 20 amino acid residues. The expected fraction of exon boundaries falling on the inside of SCOP domains is the fractional coverage by SCOP domains after subtracting ten amino acid residues from each end. Statistical significance was evaluated by chi-square tests. Enumeration of EST hits To count EST hits, we searched the GenBank/EMBL/DDBJ database for those of human origin with sequence identity not less than 97% to the 5 0 and 3 0 common regions shared by variants (Table2) by BLAST. The selected ESTs were examined for the presence or absence of the specific region(s) and were classified accordingly. The ESTs derived from the variants themselves were not counted.

Alternative Splicing Can Produce Unstable Proteins

hybridization at 45 8C according to a published procedure51 with 32P or 33P-lableled oligonucleotide probes which detect both variants given in Table 2. After a final wash with 0.1!SSC (SSC is 0.15 M NaCl, 0.015 M trisodium citrate (pH7.0)) including 0.1% (w/v) SDS at room temperature, hybridization signals were detected quantitatively using a Fuji BAS-2000 bioimage analyzer (Fuji Photo Film, Tokyo, Japan). We accepted only hybridized products of the lengths predicted from the cDNA sequences. The actual sequences of the RT-PCR primers and the oligonucleotide probes used are available upon request. Genomic conservation of splice variants To examine genomic conservation of splice variants, we ran BLASTN at the Ensemble site† with the default settings against the 2 June 2004 sequences of the following genomes: Pan troglodytes (chimpanzee), Mus musculus (mouse), Rattus norvegicus (rat), Gallus gallus (chicken), Danio rerio (zebrafish), Caenorhabditis elegans, Anopheles gambiae (mosquito), and Drosophila melanogaster (fruit fly). When two common regions exist (cases C2–C6 and C8), we define the common regions to be genomically conserved if the common regions immediately contiguous to the specific region are aligned to nearby regions on the same chromosome in the same orientation. In cases C1 and C7, on the other hand, alignment of the single common region suffices. We consider a specific region genomically conserved if the common region(s) is conserved and if the specific region is aligned to the region contiguous to the common region(s) in the same orientation.

Identification of mouse splice-conserved cDNAs corresponding to human cDNAs We have also accumulated cDNA sequence data of mouse counterparts of KIAA cDNAs, for which we designated mKIAA plus the same four-digit number as the corresponding human cDNAs. 47–49 Results of sequence analyses of mKIAA cDNAs and sequence comparison with the KIAA cDNAs are summarized in the ROUGE database,39 which was utilized to examine the species specificity of alternative splice variants. We selected mouse cDNAs in the ROUGE database that are mapped to the mouse genome region(s) syntenic to the human genome region(s), to which a human cDNA is mapped and considered them as orthologs. A mouse splice-conserved cDNA corresponding to a variant with a specific region is defined to exist if at least one mouse orthologous cDNA was aligned to the specific region as well as the common region(s). A mouse splice-conserved cDNA corresponding to a variant without a specific region is deemed to exist if at the minimum one mouse orthologous cDNA was aligned to both of the common regions with no sequence in between. Quantification of the ratio of mRNA abundance by RT-PCR RT-PCR was carried out using human adult brain poly(A)C RNA (Clonetech, Mansfield, UK) and a set of primers flanking an alternatively spliced region of each gene as described.50 The RT-PCR products were run on a 3% (w/v) agarose gel (Agarose 21; Nippon Gene, Tokyo, Japan), and then transferred onto a nylon membrane by vacuum blotting. After UV cross-linking, the RT-PCR products on the nylon membrane were subjected to

Acknowledgements We thank Dr T. Kawabata, Mr S. Sakamoto, and Dr S. Fukuchi for constructing and continuously updating the GTOP database. We are also grateful to the other members of the extended lab group, especially Dr K. Fukami-Kobayashi, for discussions and help in preparing Figures. This work was supported, in part, by a postdoctoral fellowship (to K.H.) in the BIRD program of Japan Science Technology, Corp. and a grant-in-aid from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

References 1. Lopez, A. J. (1998). Alternative splicing of pre-mRNA: developmental consequences and mechanisms of regulation. Annu. Rev. Genet. 32, 279–305. 2. Brett, D., Hanke, J., Lehmann, G., Haase, S., Delbruck, S., Krueger, S. et al. (2000). EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Letters, 474, 83–86. 3. Modrek, B., Resch, A., Grasso, C. & Lee, C. (2001). Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucl. Acids Res. 29, 2850–2859. † http://www.ensembl.org/

1219

Alternative Splicing Can Produce Unstable Proteins

4. Brett, D., Pospisil, H., Valcarcel, J., Reich, J. & Bork, P. (2002). Alternative splicing and genome complexity. Nature Genet. 30, 29–30. 5. Cooper, T. A. & Mattox, W. (1997). The regulation of splice-site selection, and its role in human disease. Am. J. Hum. Genet. 61, 259–266. 6. Caceres, J. F. & Kornblihtt, A. R. (2002). Alternative splicing: multiple control mechanisms and involvement in human disease. Trends Genet. 18, 186–193. 7. Laity, J. H., Chung, J., Dyson, H. J. & Wright, P. E. (2000). Alternative splicing of Wilms’ tumor suppressor protein modulates DNA binding activity through isoform-specific DNA-induced conformational changes. Biochemistry, 39, 5341–5348. 8. Sakamoto, J., Takata, A., Fukuzawa, R., Kikuchi, H., Sugiyama, M., Kanamori, Y. et al. (2001). A novel WT1 gene mutation associated with wilms’ tumor and congenital male genitourinary malformation. Pediatr. Res. 50, 337–344. 9. Klamt, B., Koziell, A., Poulat, F., Wieacker, P., Scambler, P., Berta, P. & Gessler, M. (1998). Frasier syndrome is caused by defective alternative splicing of WT1 leading to an altered ratio of WT1 GKTS splice isoforms. Hum. Mol. Genet. 7, 709–714. 10. Liu, H. X., Cartegni, L., Zhang, M. Q. & Krainer, A. R. (2001). A mechanism for exon skipping caused by nonsense or missense mutations in BRCA1 and other genes. Nature Genet. 27, 55–58. 11. Qi, M. & Byers, P. H. (1998). Constitutive skipping of alternatively spliced exon 10 in the ATP7A gene abolishes Golgi localization of the menkes protein and produces the occipital horn syndrome. Hum. Mol. Genet. 7, 465–469. 12. Grabowski, P. J. & Black, D. L. (2001). Alternative RNA splicing in the nervous system. Prog. Neurobiol. 65, 289–308. 13. Hutton, M., Lendon, C. L., Rizzu, P., Baker, M., Froelich, S., Houlden, H. et al. (1998). Association of missense and 5 0 -splice-site mutations in tau with the inherited dementia FTDP-17. Nature, 393, 702–705. 14. Vawter, M. P., Frye, M. A., Hemperly, J. J., VanderPutten, D. M., Usen, N., Doherty, P. et al. (2000). Elevated concentration of N-CAM VASE isoforms in schizophrenia. J. Psychiatr. Res. 34, 25–34. 15. Huntsman, M. M., Tran, B. V., Potkin, S. G., Bunney, W. E. & Jones, E. G. (1998). Altered ratios of alternatively spliced long and short gamma2 subunit mRNAs of the gamma-amino butyrate type A receptor in prefrontal cortex of schizophrenics. Proc. Natl Acad. Sci. USA, 95, 15066–15071. 16. Le Corre, S., Harper, C. G., Lopez, P., Ward, P. & Catts, S. (2000). Increased levels of expression of an NMDARI splice variant in the superior temporal gyrus in schizophrenia. Neuroreport, 11, 983–986. 17. Ho, L., Guo, Y., Spielman, L., Petrescu, O., Haroutunian, V., Purohit, D. et al. (2001). Altered expression of a-type but not b-type synapsin isoform in the brain of patients at high risk for Alzheimer’s disease assessed by DNA microarray technique. Neurosci. Letters, 298, 191–194. 18. Liu, S. & Altman, R. B. (2003). Large scale study of protein domain distribution in the context of alternative splicing. Nucl. Acids Res. 31, 4828–4835. 19. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540. 20. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T.,

21.

22.

23.

24.

25. 26.

27.

28.

29.

30.

31.

32. 33. 34.

35.

Swindells, M. B. & Thornton, J. M. (1997). CATH: a hierachic classification of protein domain sructures. Structure, 5, 1093–1108. Peneff, C., Ferrari, P., Charrier, V., Taburet, Y., Monnier, C., Zamboni, V., Winter, J. et al. (2001). Crystal structures of two human pyrophosphorylase isoforms in complexes with UDPGlc(Gal)NAc: role of the alternatively spliced insert in the enzyme oligomeric assembly and active site architecture. EMBO J. 20, 6191–6202. Oakley, A. J., Harnnoi, T., Udomsinprasert, R., Jirajaroenrat, K., Ketterman, A. J. & Wilce, M. C. (2001). The crystal structures of glutathione Stransferases isozymes 1–3 and 1–4 from Anopheles dirus species B. Protein Sci. 10, 2176–2185. Kriventseva, E. V., Koch, I., Apweiler, R., Vingron, M., Bork, P., Gelfand, M. S. & Sunyaev, S. (2003). Increase of functional diversity by alternative splicing. Trends Genet. 19, 124–128. Homma, K., Fukuchi, S., Kawabata, T., Ota, M. & Nishikawa, K. (2002). A systematic investigation identifies a significant number of probable pseudogenes in the Escherichia coli genome. Gene, 294, 25–33. Aroul-Selvam, R., Hubbard, T. & Sasidharan, R. (2004). Domain insertions in protein structures. J. Mol. Biol. 338, 633–641. Kenwrick, S., Watkins, A. & De Angelis, E. (2000). Neural cell recognition molecule L1: relating biological complexity to human disease mutations. Hum. Mol. Genet. 9, 879–886. Nagase, T., Ishikawa, K.-I., Suyama, M., Kikuno, R., Hirosawa, M., Miyajim, N. et al. (1999). Prediction of the coding sequences of unidentified human genes. XIII. The complete sequences of 100 new cDNA clones from brain which code for large proteins in vitro. DNA Res. 6, 63–70. Zhang, J. S., Honkaniemi, J., Yang, T., Yeo, T. T. & Longo, F. M. (1998). LAR tyrosine phosphatase receptor: a developmental isoform is present in neurites and growth cones and its expression is regional- and cell-specific. Mol. Cell Neurosci. 10, 271–286. Streuli, M., Krueger, N. X., Thai, T., Tang, M. & Saito, H. (1990). Distinct functional roles of the two intracellular phosphatase like domains of the receptor-linked protein tyrosine phosphatases LCA and LAR. EMBO J. 9, 2399–2407. Tabuchi, K. & Sudhof, T. C. (2002). Structure and evolution of neurexin genes: insight into the mechanism of alternative splicing. Genomics, 79, 849–859. Ichtchenko, K., Hata, Y., Nguyen, T., Ullrich, B., Missler, M., Moomaw, C. & Sudhof, T. C. (1995). Neuroligin 1: a splice site-specific ligand for betaneurexins. Cell, 81, 435–443. Ichtchenko, K., Nguyen, T. & Sudhof, T. C. (1998). Structures, alternative splicing, and neurexin binding of multiple neuroligins. J. Biol. Chem. 271, 2676–2682. Sugita, S., Khvochtev, M. & Sudhof, T. C. (1999). Neurexins are functional alpha-latrotoxin receptors. Neuron, 22, 489–496. Sugita, S., Saito, F., Tang, J., Satz, J., Campbell, K. & Sudhof, T. C. (2001). A stoichiometric complex of neurexins and dystroglycan in brain. J. Cell Biol. 154, 435–445. Doi, N., Itaya, M., Yomo, T., Tokura, S. & Yanagawa, H. (1997). Insertion of foreign random sequences of 120 amino acid residues into an active enzyme. FEBS Letters, 402, 177–180.

1220

Alternative Splicing Can Produce Unstable Proteins

36. Kinbara, K., Ishiura, S., Tomioka, S., Sorimachi, H., Jeong, S. Y., Amano, S. et al. (1998). Purification of native p94, a muscle-specific calpain, and characterization of its autolysis. Biochem. J. 335, 589–596. 37. Nagy, E. & Maquat, L. E. (1998). A rule for termination-codon position within intron-containing genes: when nonsense affects RNA abundance. Trends Biochem. Sci. 23, 198–199. 38. Lewis, B. P., Green, R. E. & Brenner, S. E. (2003). Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans. Proc. Natl Acad Sci. USA, 100, 189–192. 39. Kikuno, R., Nagase, T., Nakayama, M., Koga, H., Okazaki, N., Nakajima, D. & Ohara, O. (2004). HUGE: a database for human KIAA proteins, a 2004 update integrating HUGEppi and ROUGE. Nucl. Acids Res. 32, D502–D504. 40. Modrek, B. & Lee, C. J. (2003). Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nature Genet. 34, 177–180. 41. Ohara, O., Nagase, T., Ishikawa, K., Nakajima, D., Ohira, M., Seki, N. & Nomura, N. (1997). Construction and characterization of human brain cDNA libraries suitable for analysis of cDNA clones encoding relatively large proteins. DNA Res. 4, 53–59. 42. Nagase, T., Kikuno, R. & Ohara, O. (2003). The Kazusa cDNA project for identification of unknown human transcripts. C. R. Biol. 326, 959–966. 43. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389–3402. 44. Kawabata, T., Arisaka, F. & Nishikawa, K. (2000). Structural/functional assignment of unknown bacteriophage T4 proteins by iterative database searches. Gene, 259, 223–233. 45. Kawabata, T., Fukuchi, S., Homma, K., Ota, M., Araki,

46. 47.

48.

49.

50.

51.

J., Ichiyoshi, N. & Nishikawa, K. (2002). GTOP: a database of protein structures predicted from genome sequences. Nucl. Acids Res. 30, 294–298. Kraulis, P. J. (1991). MOLSCRIPT: a program to produce both detailed and schematic plots of protein structures. J. Appl. Crystallog. 24, 946–950. Okazaki, N., Kikuno, R., Ohara, R., Inamoto, S., Hara, Y., Nagase, T. et al. (2002). Prediction of the coding sequences of mouse homologues of KIAA gene: I. The complete nucleotide sequences of 100 mouse KIAA-homologous cDNAs identified by screening of terminal sequences of cDNA clones randomly sampled from size-fractionated libraries. DNA Res. 9, 179–188. Okazaki, N., Kikuno, R., Ohara, R., Inamoto, S., Hara, Y., Nagase, T. et al. (2003). Prediction of the coding sequences of mouse homologues of KIAA gene: II. The complete nucleotide sequences of 400 mouse KIAA-homologous cDNAs identified by screening of terminal sequences of cDNA clones randomly sampled from size-fractionated libraries. DNA Res. 10, 35–48. Okazaki, N., Kikuno, R., Ohara, R., Inamoto, S., Koseki, H., Hiraoka, S. et al. (2003). Prediction of the coding sequences of mouse homologues of KIAA gene: III. the complete nucleotide sequences of 500 mouse KIAA-homologous cDNAs identified by screening of terminal sequences of cDNA clones randomly sampled from size-fractionated libraries. DNA Res. 10, 167–180. Nagase, T., Ishikawa, K., Nakajima, D., Ohira, M., Seki, N., Miyajima, N. et al. (1997). Prediction of the coding sequences of unidentified human genes. VII. The complete sequences of 100 new cDNA clones from brain which can code for large proteins in vitro. DNA Res. 4, 141–150. Church, G. M. & Gilbert, W. (1984). Genomic sequencing. Proc. Natl Acad. Sci. USA, 81, 1991–1995.

Edited by F. E. Cohen (Received 22 April 2004; received in revised form 30 July 2004; accepted 7 September 2004)