0
INSTITLJT
PASTEUF~ELSEVIER
Res. Microbial.
Paris 1997
1997, 148, 731-744
Trinucleotide
repeats in yeast
G.-F. Richard Unite’ de Ghktique
mole’culaire
(*) and B. Dujon
(**)
des Levures (UMR1300 CNRS and UFR927 Univ. P. A4. Curie, Paris), Institut Pasteur, 75724 Paris Cedex 15
SUMMARY
The yeast genome exhibits a variety of trinucleotide repeat arrays within proteincoding genes and intergenic regions. In the first situation, repeats are often not random relative to the translational frame, resulting preferably in long stretches of the two acidic amino acids or of their corresponding amine forms. Interestingly, the longest trinucleotide repeats are often found in genes encoding nuclearly located proteins. Repeats tend to be more frequent in long genes, but less frequent among members of gene families compared to unique genes. In the latter case, repeat arrays often differ in length or composition between the gene homologs, indicating their instability. Key-words: Yeast, Trinucleotide; Repeats, Genome analysis, Gene duplication.
INTRODUCTION
It has been known for a long time that eucaryotic genomes contain distinct molecular constituents, including repetitive elements (B&ten and Kohne, 1968). Satellite DNA made of simple sequence repeats is a characteristic signature of centromeric or telomeric regions of some genomes (Singer and Berg, 1991). Dispersed tandem repeats forming mini- or microsatellites have also been characterized and extensively used for physical mapping of the human genome, for example (Charlesworth et al., 1994, Dib et aZ., 1996). Short dispersed repetitive elements, such as the Alu sequences in human, the B2 sequences in mouse and the GC clusters in Neurospora or yeast mitochondrial DNA are characteristic features of eucaryotic genomes (Deininger, 1989; Weiller et aZ., 1989; Yii et al., 198 1) that
Submitted
June
(*) Present address: (**) Corresponding
18, 1997, Rosenstiel author.
accepted Center
July
also contribute to repetitiveness, along with mobile elements such as LINES (long interspersed nuclear elements) in vertebrates, copia in Drosophila, or the Ty elements (and their solo-LTRs or long terminal repeats) in yeast (Hutchinson III, 1989; McDonald, 1993). The heterogeneity of eucaryotic genomes is also observed at the compositional level, creating isochores that are related to chromosome banding patterns and to variation in gene densities (Bernardi, 1993). Prior to systematic genome sequencing programs, the frequency and distribution of short oligonucleotide stretches in the DNA of various organisms were known from the analysis of the sequences present in general databases (Burge et al., 1992). In the case of yeast, the relative frequencies of mono- through hexanucleotide composition were studied and shown to be strongly
22, 1997.
and Department
of Biology,
Brandeis
University,
Wakham,
MA
02254-9110,
USA.
‘732
G.-F. RICHARD
biased (Arnold et aZ., 1988). Systematic sequencing pr,ojects offer the opportunity to reexamine this question in a more comprehensive manner. Not surprisingly, the release of the first complete sequence of a eucaryotic chromosome, chromosome III of yeast (Oliver et al., 1992), was followed by a number of studies on the compositional variation along the chromosome (Sharp and Lloyd, 1993), the frequency of short sequence iterations (Yagil, 1994) and the distribution of monothrough heptanucleotides (Kalogeropoulos, 1993). It was shown, for example, that the frequency of two-base tracts was higher than expected, the excess being considerably larger within non-coding regions than in genes. Yeast chromosome III was also shown to contain a relative abundance of tandern DNA repeats, a non-random distribution of delta elements (the solo LTRs from the Tyl and 52 retroposons), as well as two regions with multiple t.ypes of repetitive sequences and unusual DNA composition, one of them around the HMR locus (Karlin et al., 1993). Subsequent analyses performed during the yeast sequencing project have essentially confirmed and extended these findings, illustrating the remarkable homogeneity between most yeast chromosomes (Dujon, 1996; Dujon et al., 1994; Ollivier et al., 1995). Since the yeast Saccharomyces cerevisiae is the first, and so far only, eucaryote whose genome is entirely sequenced, the nature and distribution of trinucleotide repeats in the entire yeast genome may be interesting to review, both from ;a fundamental point of view and for the fact that ten neurological human diseases, mostly of the nleurodegenerative type, were found to be associated with the expansion of such repeats near or within genes (Warren, 1996). The precise mechanism responsible for trinucleotide repeat expansion is unclear, although it is believed that strand slippage induced by replication errors is probably involved. In Escherichia coli, mismatch-repair genes are responsible for the instability of such sequences (Jaworski et al., 1995). Their homologs in S. cerevisiae are also
LINE LTR
= =
long interspersed nuclear long terminal repeat.
element.
AND B. DUJON
involved in the instability of dinucleotide repeats (Strand et aZ., 1993). In a preliminary analysis of the first eight sequenced chromosomes, it was found that the genome of S. cerevisiae contains many trinucleotide repeats of variable lengths and sequences, the perfect ones (i.e. those not interrupted by non-consensus triplets) showing polymorphic size variation, as in the normal human population, when various laboratory strains are compared (Richard and Dujon, 1996).
MATERIALS
AND METHODS
We used the program “Repeats” (Benson and Waterman, 1994) which finds repeated regions in a DNA sequence and aligns them with the consensus of the repeat. This program was run with the following parameters: match bonus = 3, mismatch penalty = 6, insertion/deletion penalty = 12, threshold score to report an alignment = 36 or 45 (see text), pattern size = 3, lookcount = 3, noshortperiods = 1 (do not detect smaller repeats inside the array, i.e., do not detect homopolymeric repeats AAA/TTT and CCUGGG). The duplicated ORFs and corresponding protein sequences were aligned with ClustalW (Thompson et al., 1994). The distribution of repeat arrays along the chromosomes has been examined using the program “Repdis” (written by F. Galisson and available on request:
[email protected]) that computes the number of nucleotides contained within all repeat arrays beginning within a window (here sliding windows of 50 kb and steps of 10 kb were used). The distributions observed are similar for arrays selected using the threshold values of 36,45 or even 54 (data not shown). RESULTS Distribution yeast genome
of trinucleotide
repeats
in the
The nuclear yeast genome is made of 16 chromosomes which, altogether, contain 12,067 kb of DNA if one ignores the tandemly repeated rDNA sequences (Goffeau et al., 1996). We have
ORF
=
open reading
frame.
TRINUCLEOTIDE
REPEATS
systematically scanned each chromosome sequence for the presence of trinucleotide repeats using the program Repeats (Benson and Waterman, 1994) which detects repeated DNA sequences of any given length (see “Materials and Methods”). In a first analysis, all repeats with scores of 36 or more were considered. This figure was selected based on our preliminary analysis (Richard and Dujon, 1996). Since the occurrence of arrays of four triplets or less was found to be not significantly different from the random occurrences of a succession of identical triplets, they were ignored. Similarly, arrays of five triplets were considered only if perfect, to reduce the background of random occurrences. Arrays of six triplets or more were aU considered, whether perfeet or not. Imperfect (~LN)~ repeats of variable length due to the telomeric C 1-3A repeats were
IN YEAST
ignored to eliminate bias. Finally, repeats corresponding to homopolymeric tracts (AAA/TTT and CCC/GGG) were excluded. The first type is very frequent in intergenic regions, due to ATrich homopolymeric tracts (Yagil, 1994), while the second is almost absent from the yeast genome (Richard and Dujon, 1996). Under such definitions, a total of 1,769 trinucleotide repeat arrays were found .in the yeast genome, corresponding to an average of one array per cu. 7 kilobases. The length distribution of the 1,769 arrays is represented in figure 1. The longest array (consensus (ACT/AGT)J was 487-nucleotides long. There existed 1.5 other arrays of 50 triplets or more. The application of a higher initial score (45) in a second analysis, and the same other definitions, resulted in a total of 924 trinucleotide repeat arrays in the yeast genome, corresponding
250 T
Fig. 1. Length distribution
of trinucleotide
733
repeat arrays in the yeast genome.
Trinucleotide repeat arrays considered as significant (a total of 1,769, see text) were mapped relative to ORFs or intergenic regions (IG). Lengths of arrays are indicated in triplets (abscissa) and the number of occurrences is shown in ordinate. Arrays of 6 triplets or more contain both perfect and imperfect repeats, whereas arrays of 5 contain only perfect repeats. All arrays equal to or longer than 30 triplets are grouped.
734
G.-F. RICHARD
to an average of one array per ca. 13 kilobases. Their list is given in figure 2. Trinucleotide repeat arrays were found to be almost equally distributed between the various chromosomes, both in their number (variation range is one array per 9.6 kb for chromosome XI to one array per 18.2 kb for chromosome X) and lengths (average length varies from 32 nucleotides for chromosome III to 47 nucleotides for chromosome XI, overall average is 40 nucleotides). As can be seen from figure 2, arrays were noticeably underrepresented in subtelomeric regions (except for the right subtelomere of chromosome X which contained the longest trinucleotide repeat of the whole yeast genorne). Centromeres were also often contained in “repeat-poor” regions (11 cases out of 16). Such observations prompted us to examine the possible clustering of arrays along yeast chromosomes. Examination of density profiles along the various chromosomes (not shown) confirmed the idea that repeat arrays tended to be clustered, giving rise, in many cases, to successions of “repeat-rich” and “repe;at-poor” regions. We verified that the presence of Ty elements and their solo LTRs, which are essentially devoid of trinucleotide repeats (see below), did not alone explain the appearance of “repe.at-poor” regions. The fact that the “repeatrich” regions themselves tended to correlate with the GC-rich regions of the chromosome raised an interesting question concerning the formation of trinucleotide repeats. If their expansion (or contraction) was the result of replication-slippage errors not corrected by the mismatch-repair system (Strand et aZ., 1993), then it is possible that some chrornosomal regions would be more likely to contain large repeats because of slight differences in efficiency of the replication and/or reparation apparatus.
Fig. 2. Chromosomal
locations
AND B. DUJON
Nature and repartition of trinucleotide in the different genetic elements
repeats
In the yeast genome, ORFs occupy, on an average, 72 % of the total sequence outside of rDNA (Dujon, 1996). Out of the 1,769 trinucleotide repeat arrays of our first analysis, 1,365 were found to occur within ORFs and 404 within intergenie regions. If the repartition of repeat arrays were random between ORFs and intergenic regions, a total of 1,273 arrays should be found in ORFs and 495 in intergenic regions. The result, therefore, shows that trinucleotide repeats were slightly overrepresented in ORFs (x2 = 6.64, with P(x2 > 6.63) = 0.01). The same conclusion holds true if one considers only the 924 arrays from our second and more conservative screening (781 arrays found in ORFs, compared to 664 predicted). The difference between ORFs and intergenic regions was more obvious if one examined the nature of the triplets involved. Ignoring frame and strandedness, there existed ten different trinucleotide repeats, plus the two homopolymeric repeats (AAAKIT and CCC/GGG). Their distribution within ORFs and intergenic regions is shown in figure 3. As already observed with the first chromosomes analysed (Richard and Dujon, 1996), the triplet (ACGKGT), was highly over-represented in ORFs, while the triplets (AAUGTT), (AGCIGCT) “, (AGGICCT) n and (ATCIGAT) n were slightly overrepresented. Conversely, the triplets (MT/ATT)* and (ACT/AGT)n were overrepresented in intergenic regions. Other triplets were almost equally distributed between ORFs and intergenic regions, the triplet (CCGKGG)” being too rarely found to conclude as to its distribution (only three occurrences in ORFs). Interest-
and lengths of each trinucleotide
repeat in the yeast genome.
The chromosome maps are oriented from left (top) to right (bottom) and drawn to scale. To facilitate viewing, chromosomes were classified by decreasing size order from right to left (top), then from left to right (bottom). Trinucleotide repeat arrays considered as highly significant (a total of 924, see text) were mapped relative to ORFs, intergenic regions and other genetic elements, as defined in text. They are represented along the chromosome maps (right: repeats in ORFs ; left: repeats in intergenic regions or other genetic elements). The position of the first nucleotide of the array is indicated (next to the chromosome line) together with the size of the array in nucleotides (outmost figures). Centromeres are indicated by circles on the chromosome maps.
TRINUCLEOTIDE
REPEATS
IN YEAST
735
736
G.-F. RICHARD
AND B. DUJON
TRINUCLEOTIDE
REPEATS
IN YEAST
737
738
G.-F. RICHARD AND B. DUJON
50
Ratio
6.9
4.1
0.8
4.2
64
2.1
7.2
12.7
8.7
-
Relative ratio
2.7
1.6
0.3
1.6
21
0.8
2.8
4.9
3.4
-
Sequence distribution of trinucleotide repeat arrays in the yeast genome. Trinucleotide repeat arrays considered as significant (a total of 1,769, see text) were classified according to the consensus triplet and mapped relative to OFWs or intergenic regions (IG). The figure shows the number of occurrences in each case (columns) with the ratio of distribution between OFWs and intergenic regions, directly as observed (first numerical line) or corrected for the overall distribution of ORFs and intergenic regions in the entire yeast genome (72 and 28 %, respectively). Monotonous repeats ( AAAFITT) and (CCUGGG) are ignored. The first ones are frequent in intergenic regions, due to the presence of monotonous base tracts (Yagil, 1994). The second ones are extremely rare in the yeast genome (Richard and Dujon, 1996). Fig.
ingly., (AGCECT). human neurological
3.
is one of the triplets found in disorders.
The distribution of the various triplets in the trinucleotide repeat arrays of the different chromosomes was essentially homogeneous, if one excepts (AGUGCT). which was slightly overrepresented on chromosome IX, while (AAT/ ATT) n and (AGUGCT) n were slightly underrepresented on chromosomes X and XIII, respectively (data not shown). Genetic elements other than ORFs and intergenie regions were noticeably poor in trinucleotide repeats. No tRNA gene or other RNA gene contained such repeats, including in their introns, and the same was true of Ty elements. Only one
delta element (on chromosome III) contained a trinucleotide repeat (AAT/A7T) 7. This element was next to a Ty5 LTR remnant. Three other repeats were found in the subtelomeric core X elements (chromosomes IV, XIII and XVI) and a final one in a subtelomeric DCBA region (chromosome II). All such repeats were short (21-23 nucleotides).
Amino acids encoded arrays in ORFk
in trinucleotide
repeat
A total of 599 different ORFs (including 5 partially overlapping ORF pairs) harboured one or a
TRINUCLEOTIDE
REPEATS
few trinucleotide repeat arrays. This was close to ten percent of all ORFs of the yeast genome. Noticeably, trinucleotide repeats were often observed in the longest ORFs. The average length of the repeat-containing ORFs was 2,034 nucleotides as compared to 1,450 nucleotides for the whole yeast genome (Dujon, 1996), and 104 of the 453 ORFs with more than 1,000 codons contained trinucleotide repeats. Quite a significant number of them (56%) were genes of known function, compared to cu. 30% for the entire genome (Dujon, 1996) and 42 % for all trinucleotide repeat-containing ORFs. One essential property of trinucleotide repeats, as opposed to di- or tetranucleotide repeats for example, is that, when occurring in ORFs, they correspond to monotonous amino acid repeats in the translated protein which may contribute to, or interfere with, the function of the protein. We
Table
I. The 21 genes of known function
Gene YDLl67c (ARPl) YKxO72c (SZS2) YPLl9Oc (NAB3) YBR289w (SNFS) yKI.032~ (IXRI) YJLllSw (ASFl) YHRO77c (NMD2) YILll9c (RPII) YFRO33c (QCR6) YPW26c (SHA3) YGW86w (MADI) YPW89c (RLMl) YBRll2c (SSN6) YLRO55c fSPT8) YMRO7Ow (HMSl) YMW74c (FPR3) YGR233c (PHO81) YORll3w (AZFI) YDRl7Oc (SEC7) YGL203c (KEXl) YPWl6w (SWIl)
N
D-E D-E-Q 2 D-E D N D-E
Q
D-N N
Q-A
D-E N-A D-E-K N
Q-N
D-E D-E T-N-Q
739
have examined the amino acid repeats which correspond to the longest trinucleotide repeat arrays identified. Using a lower limit of 75 nucleotides long, 105 cases were found, all corresponding to imperfect repeats. They coded for one, two or sometimes three different amino acids. Out of the 59 corresponding ORFs (several arrays may occur in a given ORF), 21 were genes of characterized function (cu. 35% of the total, a figure close to the average proportion of known genes in the entire genome). The most frequent amino acid stretches encountered among them were Glu (10 occurrences), Asp (8 occurrences out of 37), Asn (8 occurrences) and Gln (7 occurrences) (see table I). The strong bias in favour of this limited set of amino acids showed that trinucleotide repeat arrays in ORFs (at least the longest ones) were not random with respect to the reading frame and the DNA strand. Interestingly, out of
containing the longest trinucleotide genome.
Length(s) of array Most frequent in nucleotides amino acid 227 169 155, 49, 19, 18 154, 28, 20 149, 48, 35, 32, 23 137 137 127 121 108 100 97,42 94 88, 61, 19 87, 46, 33, 29 85,36 84,45 84, 57, 54, 49 82,43 79,38 75, 28, 15
IN YEAST
repeat arrays of the yeast
Function of the gene Actin-related NaCl or LiCl resistance PolyA+ binding Transcription activator Cisplatin sensitivity Anti-silencing mRNA decay Neg. regulator RAS-CAMP pathway Ubiquinol Cyt.C reductase subunit Ser-Thr kinase Spindle assembly Transcription factor Transcription repressor Transcription start site selection Zn finger Peptidylprolyl isomerase Kinase inhibitor Zn finger Non-clathrin vesicle component Carboxypeptidase Transcription activator
Cellular location
Nut. Nut. Nut. Nut. Nut. Mit. Nut. Nut. Nut. Nut. Nut. Nut. Gol. Nut.
The gene names are indicated (parentheses) along with the systematic nomenclature. For each gene, the length of the trinucleotide repeat array(s) is given in nucleotides, followed by the most frequent amino acid(s) encoded with the atray (one letter code). The function of the gene has been taken from the ME’S data base (http ://speedy.mips.biochem.mpg.de/mips/yeast/), and the product localization from the YPD data base (http :Nwww.proteome.com/YPDhome.html) or deduced from the function : Nut. = nuclear, Mit. = mitochondrial, Gol. = Golgi; void = unknown cellular localization.
740
G.-F. RICHARD
the 21 gene products containing these amino acid repeats, 12 were clearly nuclearly localized, based on the function of the protein. When the remaining 38 ORFs of unknown functions were considered, roughly the same amino acid distribution was observed (data not shown), suggesting that some of these ORFs at least may also encode proteins of nuclear localization, a useful prediction for function search.
Trinucleotide in paralogous
repeats in duplicated gene families
genes and
It has been proposed that trinucleotide repeats, being potential regions of rapid sequence changes, may offer an efficient source of genetic variability within genomes (Green and Wang, 1994). As the yeast genome contains a significant number of duplicated genes and of paralogous gene families, examination of trinucleotide repeats within such cases may be informative of their divergence rate, compared to that of the flanking sequences. In order to examine unambiguous sequence alignments only, we limited our search to those pairs of ORFs and translation. products showing at least 70% amino acid identity when aligned using the GAP program. A total of 854 ORFs were in such a case, being members of families of two or more closely related genes (A. Perrin and B. Dujon, in prep.). Assuming a random distribution of trinucleotide repeats between unique ORFs and members of gene families, it is anticipated that 92 out of the set of 854 ORFs should contain trinucleotide repeat arrays (figure calculated from the 924 arrays with initial scores or 45 or more). Instead, only 32 such ORFs were found (a third of the expected figure), indicating significant underrepresentation of trinucleotide repeat arrays in members of gene families. The subset of 372 ORFs belonging to even more closely related famil.ies (above 85 % amino acid identity) shows a similar deficiency (12 cases observed for 40 predicted). The 32 ORFs identified above are indicative of the variability of trinucleotide repeats among paralogous genes. It was observed that only 12 such ORFs formed 6 pairs in which
AND B. DUJON
both members were found to contain equivalent trinucleotide repeats at the same position (table II). The 20 others were members of gene pairs in which the partner was not identified as containing a significant trinucleotide repeat array using an initial score of 45 or more. Closer examination of such cases revealed that in 4 ORFs, the trinucleotide repeat was totally absent from the partner, whereas in the 16 other cases the array was so degenerate that it could not be identified from direct sequence analysis. Examples of the various situations encountered are illustrated in figure 4. In the cases where both members of the pair contained similar arrays, it was often observed that the number of repeats differed (e.g. YOR389w/YPL277c, YDR099w/YER177w), consistent with the fact that trinucleotide repeats were unstable in length. Length variation between homologous genes was similar to that observed when comparing the same ORF in different yeast strains (Richard and Dujon, 1996). In the 4 cases mentioned above in which one member of the pair did not contain any repeat, the corresponding segment of the gene where the array was located was entirely missing in the homolog. Other cases exist, however, in more divergent gene families where only the repeat array is missing, with the flanking regions being present in both ORFs (Galisson and Dujon, 1996). Overall, our observations were therefore consistent with the hypothesis that trinucleotide repeats within ORFs may be a source of genetic variability that may be beneficial to a cell for “adapting” a protein sequence to a new environment, or for creating different proteins from a recent gene duplication. This idea is strengthened by the fact that natural size polymorphism of arrays is higher for ORFs than for intergenic regions (Richard and Dujon, 1996). One explanation for the relative default of trinucleotide repeats in duplicated genes or members of gene families may be that they are located in chromosomal regions which contain fewer repeat arrays. This is the case for the subtelomeric regions of yeast chromosomes that often contain genes with multiple homologs on other subtelomeres. We observed that
TRINUCLEOTIDE
Table II. Trinucleotide
ORF pairs with conserved repeats
YBR03 lw (RPL2A) YBR177w (BMHl) YOR389w YMRl83c (SS02)
YFR024ca YGR086c ORF pairs without conserved repeats
YMR 186~ (HSC82) YLR037c YGRl15c YHR21 lw (FLOS) YNLO69c (RP23) YGL008c (PMAI) YHR2 11w (FL05) YLR249w (YEFS) YJR045c (SSCI) YLR328w YKL129c (MYO3) YHR135c (YCKI) YIL130wa YJRlSlc YLR037c YDL088ca YOR383c YBR084w YBR054w YHR217c
(MISI) (YR02)
IN YEAST
741
repeats in closely related gene families.
Position of array(s)
ORF
REPEATS
Homolog
300485 544332 107468 1 626947 202278 649663 649734
YDR012w YDR099w YPL277c YPL232w YHR016c YPLOO4c
63307 1 222998 720867 527579 493996 482622 527579 639803 639839 519353 785091 196357 372882 128505 128653 714365 714575 222998 299863 1060770
412354 343983 556825
(RPL2B) (BMH2)
Position of array(s)
Conserved repeat
99.7 93.6 92.0 74.8 72.4 71.8
Yes Yes Yes Yes Yes Degener. Yes
YPL24Oc (HSP82) YPL282c YGRl13ca YAROSOw (FLOI) YIL133c (RP22) YPLO36w (PMA2) YALO65c YNLO14w
97.7 95.2 93.2 93.0 90.4 89.1 84.8 83.8
Degener. Degener. No Degener. Degener. Degener. No Degener.
YELO3Ow YGROlOw YMR109w YNL154c (YCK2) YNLO66w (SUN4)
80.8 78.0 76.8 76.6 74.4
YPL282c
74.0
YJRlSlc YDL087c YDR534c YGR204w YDR033w
74.0 73.8 73.0 72.1 70.9 70.7
(SSOI) (YSC84)
472257 654376 16368 108095 137414
% I
550777
(ADE3)
YBL109w
Degener. Degener. Degener. Degener. Degener. Degener.
Degener. No No Degener. Degener. Degener. Degener. Degener. No
Gene names are indicated (parentheses) along with the systematic nomenclature. The position of the first nucleotide of the repeat is given. In the first part of the table, both members of a gene family contain trinucleotide repeats. In the second part of the table, the homolog does not contain a significant trinucleotide repeat (no) or contains a degenerated form of the repeats not found during the primary search (Degener.). See text. The percentage of amino acid identity between the two genes is indicated (8 I).
these regions were very poor in trinucleotide even if we repeats (see fig. 1). However, exclude the subtelomeric ORFs from the calculation, the number of occurrences of trinucleotide repeat arrays in non-unique ORFs remained significantly lower than expected (32 out of 7.5 predicted). Other explanations, such as the nature of the genes that are duplicated or in families, or the possibility of ectopic gene conversion between them, must also be considered.
DISCUSSION
The complete genomic sequence of yeast enables a comprehensive description of the occurrence of trinucleotide repeats in this species which cannot be directly compared to any other eucaryotic organisms in the present status of genome programs. However, extrapolations from the present sequences of mammalian genomes reveal interesting similarities with yeast, which suggests that the mechanisms underlying trinucleotide
G.-F. RICHARD
YPLZllc YOR389r
AiC dT
H
YER17lw YDRO99w
AND B. DUJON
G GGT G:T
&
C:A
K G AAA GGG Ck
Ci!A
CCA C%
Ck
Ck
Ck
***
et*
*t*
***
***
***
l +*
***
**t
l **
l **
***
**+
GGT
GAT
CAT
CCA
AAA
GGG
CCA
CCA
CCA
CCA
CCA
CCA
CCA
P
P
P
G
D
H
P
K
G
P
P
QQQQQQJJQQQQ-----& CAA CA]L CAA CAA CAA CAA t* l ** l ** l ** t** *** l ** GATCAAC~CAACAACAACAACAGCAA
P
P
P
CAT
CAG
CAA
CAG
CAG
l
l
l
l
*t*
*
*
*
*
C& ---
-mm
-
CAGCAACAGCAA
&i
AI&
G&
*+*
***
t**
t
GAC
GAA
AU
GAT CGA D R
D
---
K
G:C
E
_--
--_
CAA ULG
K
---
cRw l
et
C:A C:T * ** GCT CCA
---
CAA CA& CU
XAT
WAC
S
E
*** * GCT GAA
DQQQQQQQQQQQQQQQQQAPAE
YUR186u
AtA
PKLEEVDEEEEEKK CCA AAA TTG GAA GAAGTC
K T GAT GAA G&A GAA GAA GAA AAG AAG C.&i AAA ACC Ah
l **
***
.a*
***
*t*
t**
***
l *,z
YPL24Oc
AAA CCA AAA TTG GAA UUL KPKLEEVDEEEE-KKPKTKK
YIL13Gwa
T&
TlT
l *
***
T
S
S
ACT
TCC
TCT
S TCC l
*
TCC
TCT
YGRllSc
GiG
Ctc
A GCT
L -55aaTTG -165 nt-
l **
***
***
*t*
Q,G
CTC
yml’Jca
GTG CCT ATT SSVPISTSGSASTSSAASS
E Fig.
L
GCT
A
TTG
L
-48
S
TCT
***
YNLO66w
**
8 TCC
t
l *t
tt*
**
**t
S
TCC
S TCC
l *
GGC
TCT
l
¶'CA
GCT
***
l **
l *
l *
&
l **
***
AAG AAG CCA AAG ACG AAA AAA
S
S
S
TCC
TCT
TCT
s TCT
l **
l
et.
l *
TCT
TCA
TCC ACA
s TCT
s
TCT
l
l *
GCA
c&
*
*
TCC
AGT
A S S S S GCC TC.A TCT TCG TCT
S TCA $?A
ATA
TCT TCG TCT TCC TCk GAG $G
---
---
---
---
--_
---
4. Examples of conservation
-
---
-
---
-
---
-
-_
_
_
and divergence of trinucleotide
-_-
-
_
_
S
TgT l
L CTT
-16aa-:
I
GCT
TTT
F -
S
TCC
l *
TCC ACA
nt
l
GTC GAT GAA GAA GAG GAA ---
S
---
_
S
---
_
S
---
_
---
_
-__
_
_
---
_
repeats within closely related genes.
ORF sequences and the corresponding amino acid sequences are shown in the regions corresponding to the trinucleotide repeat arrays identified. Sequences within the arrays are in bold type.
repeat formation or instability are conserved in eucaryotic evolution. An analysis of the distribution of trinucleotide microsatellites in mammalian genomes showed that the most common repeats encountered in exons were (CAG/CTG)n and (CGGKCG),, with the latter also being the most frequent in 5’ untranslated regions (Stallings, 1994). Similarly, we observe here that the triplet (AGUGCT)., which is equivalent to (CAG/CTG)n, is overrepresented within ORFs. The triplet (CGGKCG)” is extremely rare in yeast (only three occurrences in the whole genome), consistent with the overall low G-C content and the underrepresentation of the dinucleotide CG, the latter feature being common to all eucaryotes (Burge et al., 1992 ; Dujon et al., 1994; Kalogero-
poulos, 1993). Likewise, the amino acids encoded in repeats can be compared between yeast and mammalian genomes. Two years ago, codon reiteration was examined from general sequence databanks in which mammalian genomes contributed to a large extent (Green and Wang, 1994), and it was found that the variety of amino acids represented rapidly diminished with the length of the reiterants, with the most common ammo acid in long reiterations (1 20 aa) being glutamine, the frequency of which largely exceeds that of all other amino acids. The situation is different in yeast, since three other ammo acids (Asp, Asn and Glu) are equally represented in translation products of long trinucleotide repeats (2 25 aa). Finally, while yeast and mammals both have trinu-
TRINUCLEOTIDE
cleotide repeats in their genome, tbeir frequency of occurrence largely differs. If trinncleotide repeat arrays occur on an average once for every cu. 13 kb in yeast, in mammals similar arrays are found only once for every cu. 215 kb or even less, since the present databanks tend to contain many microsatellites (Stallings, 1994). The difference may be related to the compactness of the yeast genome compared to the mammalian genomes, since trinucleotides are often present in exons.
Acknowledgments We wish to thank Amaud Perrin for sharing results on gene duplication in yeast prior to publication, Frkdtrique Galisson for her help with computer programming and for critical reading of the manuscript, and Emmanuelle Fabre for fruitful discussions. Part of this analysis would not have been possible without the yeast database provided by MIPS, and we would like to thank H.W. Mewes and all MIPS members for their very fruitful collaboration during the Yeast Sequencing Program. G.-F. R. is the recipient of a postdoctoral fellowship from the Association FranGaise contre les Myopathies (A.F.M.). This work was supported by a grant from the MENSER (ACC-SVI).
Riip&itions
de trinucl6otides
Le gCnome
de la levure
chez la levure montre
differentes
rCpCtitions de trinucltotides dans les gbnes qui cadent des prottines aussi bien que dans les rbgions intergkniques. Dans le premier cas, ces &ppCtitions sont non altatoires vis-&-vis de la phase de lecture,
ce qui, au niveau de la protkine, conduit souvent & de longues rkp&itions des acides amints & chaines la&ales acides ou amint?es. 11 est intkressant d’observer que les plus longues rCpCtitions de
trinuclkotides
sont souvent prksentes dans des gbnes
qui cadent des prottines B localisation nuclCa.ire. LB rtpCtitions de trinuclCotides tendent & Ctre plus frdquentes dans les longs g&es. Par contre, elles sont moins fr6quentes parmi les gbnes qui sont membres de familles structurales que parmi les g&es uniques. Dans le cas des gbnes en familles, les
&pCtitions
dans les gbnes homologues
de longueur ou de composition indique leur instabilitk.
sent souvent
diffkrentes,
ce qui
Mot+cl& : Levure, Trinucleotide ; RCpCtitions, Analyse gtnomique, Duplication des gbnes.
REPEATS
IN YEAST
References Arnold, J., Cuticchia, A.J., Newsome, D.A., Jennings, W.W.D. & Ivarie, R. (1988), Mono- through hexanucleotide composition of the sense strand of yeast DNA: a Markov chain analysis. Nucleic Acids Res., 16, 71457158. Benson,G. & Waterman,M.S. (1994). A methodfor fast databasesearchfor all k-nucleotiderepeats.Nucleic Acids Res., 22, 4828-4836.
Bemardi, G. (1993), The vertebrate genome: isochores andevolution. Mol. Biol. Evol., 10, 186-204. Britten, R.J. & Kohne,D.E. (1968),Repeatedsequences in DNA. Hundreds of thousandsof copies of DNA sequences have been incorporatedinto the genomes of higherorganisms.Science,161, 529-540. Burge,C., Campbell,A.M. & Karlin, S. (1992), Over- and under-representationof short oligonucleotides in DNA sequences.Proc. Natl. Acad. Sci. USA, 89, 1358-1362. Charlesworth,B., Sniegowski,P. & Stephan,W. (1994). The evolutionary dynamics of repetitive DNA in eukaryotes.Nature (Lond.), 371, 215-220. Deininger,P.L. (1989). in “Mobile DNA” (D.E. Berg and M.M. Howe) (pp. 619-636). American Society for Microbiology. Dib, C., Faure, S., Fizames,C., Samson,I~., Drouot, N., Vignal, A., Millasseau, P., Marc, S., Hazan, J., Seboun,E., Lathrop, M., Gyapay, G., Morissette, J. & Weissenbach,J. (1996), A comprehensivegenetic mapof the humangenomebasedon 5,264sequences. Nature (Land.), 380, 152-154. Dujon, B. (1996), The yeastgenomeproject: what did we learn? TIG, 12, 263-270. Dujon, B., Alexandraki, D., Andre, B., Ansorge,W., Baladron, V., Ballesta, J.P.G. et al. (1994), Complete DNA sequenceof yeast chromosomeXI. Nature (Land.), 369, 371-378. Galisson,F. & Dujon, B. (1996),Sequenceandanalysisof a 33-kb fragmentfrom chromosomeXV of the yeast Saccharonzyces cerevisiae. Yeast, 12, 977-985. Goffeau,A., Barrell, B.G., Bussey,H., Davis, R.W., Dujon, B., Feldman,H., Galibert,F., Hoheisel,J.D., Jacq,C., Johnston,M., Louis, E.J., Mewes, H.W., Murakami, Y., Philippsen,P., Tettelin, H. & Oliver, S.G. (1996). Life with 6000 genes.Science, 274, 546-567. Green, H. & Wang, N. (1994), Codon reiteration and the evolution of proteins.Proc. Natl. Acad. Sci. USA, 91, 4298-4302. Hutchinson III, C.A., (1989), in “Mobile DNA” (D.E. Berg andM.M. Howe) (pp. 593-617)American Society for Microbiology. Jaworski, A., Rosche,W.A., Gellibolian, R., Kang, S., Shimizu, M., Bowater, R.P., Sinden, R.R. & Wells, R.D. (1995), Mismatch repair in Escherichia coli enhancesinstability of (CTG)n triplet repeatsfrom human hereditary diseases.Proc. Natl. Acad. Sci. USA, 92, 11019-11023. Kalogeropoulos,A. (1993),Linguistic analysisof cbromosomeIII DNA sequenceof Sacchromyces cerevisiae. Yeast, 9, 889-905.
Karlin, S., Blaisdell, B.E., Sapolsky, R.J., Cardon, L. Br Burge, C. (1993). Assessments of DNA inhomogencities in yeast chromosomeIII. Nucleic Acids Res., 21, 703-711.
744
G. -F. RICHARD AND B. DUJON
McDonald, J.F. (1993), Evolution and consequences of transposable elements. Curr. Opin. Genet. Dev., 3,855-864. Oliver, S.G., van der Aart, Q.J.M., Agostini-Carbone, M.L., Aigle, M., Alberghina, L., Alexandraki, D. et aX. (1992), The complete DNA sequence of yeast chromosome III. Nature (Lond.), 357, 38-46. Ollivier, E., Delorme. M.-O. & Henaut, A. (1995), DusDNA occurs along yeast chromosomes, regardless of functional significance of the sequence. C.R. Acad. Sci. Paris, 318,599-608. Richard, G.-F. & Dujon, B. (1996), Distribution and variability of trinucleotide repeats in the genome of the yeast Saccharomyces cerevisiae. Gene, 174, 165-174. Sharp, P.M. & Lloyd, A.T. (1993), Regional base composition variation along yeast chromosome III: evolution of chromosome primary structure. Nucleic Acids Res., 21, 179-183. Singer, M. and Berg, P. (1991), Genes and genomes. University Science Books, Mill Valley, CA. Stallings, R.L. (1994), Distribution of trinucleotide microsatellites in different categories of mammalian genomic sequence: implications for human genetic diseases. Genomics, 21, 116-121.
Strand, M., Prolla, T.A., Liskay, R.M. and Petes, T.D. (1993), Destabilization of tracts of simple repetitive DNA in yeast by mutations affecting DNA mismatch repair. Nature (Land.), 365, 274-276. Thompson, J.D., Higgins, D.G. & Gibson, T.J. (1994), CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 46734680. Warren, S.T. (1996), The expanding world of trinucleotide repeats. Science, 271, 1374-1375. Weiller, G., Schueller, C.M. & Schweyen, R.J. (1989), Putative target sites for mobile G+C rich clusters in yeast mitochondrial DNA: single elements and tandem arrays. Mol. Gen. Gene?., 218, 272-283. Yagil, G. (1994), The frequency of oligopurine-oligopyrimidine and other two-base tracts in yeast chromosome III. Yeast, 10, 603-611. Yin, S., Heckman, J. & RajBhandary, U.L. (1981). Highly conserved GC-rich palindromic DNA sequences flank tRNA genes in Neurospora crassa mitochondria. Cell, 26, 325-332.