Frameshift mutation events in β-glucosidases

Frameshift mutation events in β-glucosidases

Gene 314 (2003) 191 – 199 www.elsevier.com/locate/gene Frameshift mutation events in h-glucosidases Antonio Rojas, Santiago Garcia-Vallve´, Miguel A...

800KB Sizes 0 Downloads 54 Views

Gene 314 (2003) 191 – 199 www.elsevier.com/locate/gene

Frameshift mutation events in h-glucosidases Antonio Rojas, Santiago Garcia-Vallve´, Miguel A. Montero, Lluı´s Arola, Antoni Romeu * Evolutionary Genomics Group, Department of Biochemistry and Biotechnology, Rovira i Virgili University, Pl. Imperial Ta`rraco, 1. E-43005 Tarragona, Catalonia, Spain Received 11 February 2003; received in revised form 7 July 2003; accepted 22 July 2003 Received by G. Pesole

Abstract Compensated frameshift mutation is a modification of the reading frame of a gene that takes place by way of various molecular events. It appears to be a widespread event that is only observed when homologous amino acid and nucleodotide sequences are compared. To identify these mutation events, the sequence analysis rationale was based on the search for short regions that would have much lower degrees of conservation in protein, but not in DNA, in well-conserved h-glucosidase families. We have restricted our study to a seed set of sequences of O-glycoside hydrolase families 1 and 3. We found compensated frameshift mutation in the family of 1 h-glucosidases for the Erwinia herbicola, Cellulomonas fimi, and (non-cyanogenic) Trifolium repens gene sequences, and in the family of 3 h-glucosidases for the Clostridium thermocellum and Clostridium stercorarium gene sequences. By computational treatment, the observed mutation events in the gene frameshifting sub-sequence have been neutralised. Each nucleotide insertion must be eliminated and each nucleotide deletion must be substituted by the symbol N (any nucleotide). When the frameshifting fragments of the amino acid sequences were substituted by the computationally neutralised subsequences, the h-glucosidase alignments were improved. We also discuss the structural implications of the compensated frameshift mutations events. D 2003 Elsevier B.V. All rights reserved. Keywords: Compensated frameshift mutation; O-glycoside hydrolases; Reading frame; Sequence alignment; Molecular evolution

1. Introduction The evolution of macromolecules encompasses the patterns of change over evolutionary time in genetic material

Abbreviations: FSM, frameshift mutation; ERH, Family 1 Erwinia herbicola h-glucosidase; CFI, Family 1 Cellulomonas fimi h-glucosidase; TRS, Family 1 non-cyanogenic Trifolium repens h-glucosidase; CST, Family 3 Clostridium stercorarium h-glucosidase; CTH, Family 3 Clostridium thermocellum h-glucosidase; BCI, Family 1 Bacillus circulans h-glucosidase; BPA, Family 1 Bacillus polimyxa h-glucosidase; CTA, Family 1 Clostridium thermocellum h-glucosidase; STR, Family 1 Streptomyces sp. hglucosidase; MBB, Family 1 Microbispora bispora h-glucosidase; TMA, Family 1 Thermotoga maritima h-glucosidase; CBG, Family 1 cyanogenic T. repens h-glucosidase; PRA, Family 1 Prunus avium h-glucosidase; MES, Family 1 Manihot esculenta h-glucosidase; ATU, Family 3 Agrobacterium tumefaciens h-glucosidase; MBI, Family 3 Microbispora bispora hglucosidase; ECO, Family 3 Escherichia coli h-glucosidase. * Corresponding author. Department of Biochemistry and Biotechnology, Rovira i Virgili, Faculty of Chemistry, University, Pl. Imperial Ta`rraco, 1. E-43005 Tarragona, Catalonia, Spain. Tel.: +34-977-55-81-88; fax: +34-977-55-82-32. E-mail address: [email protected] (A. Romeu). 0378-1119/$ - see front matter D 2003 Elsevier B.V. All rights reserved. doi:10.1016/S0378-1119(03)00828-X

(e.g. DNA sequences) and its encoded products (e.g. proteins) (Kimura, 1983; Doolittle, 1990) The mechanisms responsible for such inherited changes include base-pair substitution and/or insertions or deletions of the nucleic acid that makes up the genome of an organism. Insertions and deletions of nucleotides occur in DNA, but at a considerably lower rate than the rate of nucleotide substitution (Doolittle, 1990; Li, 1997; Hartl and Jones, 2000). Insertions or deletions that occur in coding DNA and involve a number of nucleotides that is not a multiple of three cause a shift in the reading frame. This may also obliterate the termination codon or bring into phase a new stop codon (Gilkes et al., 1991; Henrissat, 1993). If a FSM occurs, partial restoration of the reading frame can often be accomplished by the insertion or deletion of new base pairs that compensate the mutation event. As long as a compensated FSM does not affect the function of a given protein, this event could be maintained throughout evolution (Henrissat et al., 1995). This seems to be the case with some hglucosidases (EC 3.2.1.21), which are carbohydrate-active enzymes (Coutinho and Henrissat, 1999a,b,c) that belong to

192

A. Rojas et al. / Gene 314 (2003) 191–199

the O-glycoside hydrolases (EC 3.2.1.-) families 1 and 3. hGlucosidases are an important group of enzymes that are responsible for cleaving a range of biologically significant compounds, and hydrolysing the glycosidic bond between two or more carbohydrates or between a carbohydrate and a non-carbohydrate moiety (Freer, 1993). In this paper, we describe different patterns of compensated FSM in some hglucosidase family 1 and family 3 genes.

2. Materials and methods The sequence analysis rationale was based on the search for short regions that would have much lower degrees of conservation in protein, but not in DNA, in well-conserved hglucosidase families. The systematic search of compensated FSM was based on the following criteria: (1) Characterising patterns and identity profiles. Unchanged amino acids across a multialignment may be seen as a fingerprint of that multialignment. We can then establish groups of patterns or conserved motifs that are perfectly related and that correspond to specific functional parts of the protein. (2) Setting up the group of sequences. The numerous unchanged amino acid residues in an entire block of the multialignment greatly facilitate the frameshift analysis in the whole sample of homologous sequences. However, in proteins with low sequence identity, as the size of the whole sequence sample increases, the number of unchanged residues in the multialignment usually decreases. So, the number of protein sequences in the analysis must be adjusted to produce a suitable number of unchanged residues. (3) Indicating an FSM. A good method is to analyse the possible existence in the DNA multialignment of one strictly conserved triplet or doublet that does not correspond to any unchanged amino acid residue in the equivalent position in the protein alignment. We should also determine whether the multialignment has one nucleotide or one amino acid (or a short stretch of either of them) of one specific sequence that makes an aligned gap appear in the other sequences, or whether one gap involving one single sequence appears in a given position of the multialignment. (4) Limiting the FSM. This involves counting the compensatory indels upstream and downstream from the control position. To do this, we should take as reference one sequence, or a small number of sequences, that shows high sequence identity with the frameshifting sequence. (5) Retrieving and translating the inverted gene frameshifting sub-sequence. This involves computationally treating the observed mutation events. Each nucleotide insertion must be eliminated and each nucleotide deletion must be substituted by the symbol N (any nucleotide). This new neutralised gene fragment must then be translated. (6) Aligning the inverted amino acid stretch of the frameshifting amino acid sequence. The amino acid stretch affected by the frameshift event must be substituted by the neutralised subsequence obtained by computational insertions and deletions. This hypothetical sub-sequence in a new protein alignment

would mean that conserved residues would remain unchanged and the local amino acid similarity would be recovered. (7) Analysing the structural implications of the FSM. Secondary structure predictions for the analysed sequences were made using the EMBL server Predict-Protein (http://www.embl-heidelberg.de/predictprotein/ predictprotein.html), which incorporates neural network training (Schneider and Sander, 1996). Secondary structure predictions were also made with the new hypothetical protein sequences obtained by substituting the frameshifting stretch with the neutralised subsequence. Secondary structures were aligned with the SECALING program (written in FORTRAN 77) in the following way: each amino acid in the alignment was substituted with its secondary-structure prediction, and the gaps were maintained. If the tridimensional structure of a given analysed protein was known, the secondary structure state of each residue was introduced. (8) Controlling quality. A further point is that computational analyses such as those described here may not only reveal natural mutations events, they may also have far-reaching implications for controlling the information stored in the biological databases. Sequence alignments were done using the ClustalW program (Thompson et al., 1994). The parameters for ClustalW analysis were: scoring matrix, PAM250; opening gap penalty, 10; end gap penalty, 10; extending gap penalty, 0.05; separation gap penalty, 0.05. We calculated the orthologous and paralogous clusters using the neighbour-joining method and 1000 bootstrap replicates with the ClustalW program. Our own programs, written in FORTRAN 77, were used to: (1) read sequence alignment files and generate new text files in which unchanged nucleotides or amino acids were depicted in the corresponding multialignment positions; and (2) align the elements of the secondary structure. Points 3, 4 and 5 of the above systematic search had to be done manually. At present, glycoside hydrolases based on linear amino acid sequence are classified in 88 different families (Coutinho and Henrissat, 1999a). There are approximately 300 known members of the glycoside hydrolases family 1 and 200 known members of the glycoside hydrolases family 3. The O-glycoside hydrolase families 1 and 3 have many members, so in the respective multialignments of all the sequences of the families, only the catalytic residues involved in the PROSITE signature patterns (Falquet et al., 2002) (Family 1, PDOC00495; Family 3, PDOC00621) appear as unchanged residues across the sequences (data not shown). For this reason, the overall multialignment is not suitable for analysing compensated FSM. Using the above-mentioned rationale, we have restricted our study to a seed set of sequences of O-glycoside hydrolase families 1 and 3 in order to have unchanged amino acids that will act as reference points for the multialignment. From GenBank/ EMBL and SWISS-PROT databases, we took a representative set of 43 and 18 protein sequences from glycoside hydrolase families 1 and 3, respectively. The set of 43 family 1 glycoside hydrolase proteins contained: bacterial

A. Rojas et al. / Gene 314 (2003) 191–199

h-glucosidases (EC 3.2.1.21), 6-phospho celobiase, 6-phospho-h-glucosidases (EC 3.2.1.86) and 6-phospho-h-glucosidases (EC 3.2.1.86); archaeal h-glucosidase and hglucosidases; and eukaryotic h-glucosidases, myrosinases (EC 3.2.3.1) and LPH, which is an integral membrane glycoprotein that splits lactose in the small intestine. The set of 18 family 3 glycoside hydrolase proteins contained bacterial and eukaryotic h-glucosidase sequences.

3. Results 3.1. Family 1 Erwinia herbicola b-glucosidase (SWISSPROT: Q59437) The gene alignment indicated an unchanged TGG triplet in a conserved block near the extreme 5Vregion. This cluster suggested the existence of at least one strictly conserved tryptophan in the equivalent region of the protein alignment. However, the only unchanged residue in this region of the protein alignment was a proline (alignment position 181, Fig. 1a). Remarkably, five positions upstream from this proline there appeared a conserved tryptophan, which was conserved in all sequences, except those of the ERH. Also, between the conserved tryptophan and the unchanged proline, ERH was also the only sequence that did not show a conserved arginine. Therefore, DNA alignment showed that around the unchanged TGG triplet, an adenine residue of the ERH gene generated a single gap position in all other DNA sequences (Fig. 1b). This represents a relative change in the reading phase. However, sixteen residues downstream, a gap was generated in the ERH gene sequence that compensated for the shift produced in the reading frame. This is the

193

evidence for a double frameshift mutation event in the evolutionary history of the ERH gene that was limited by these indels. The translation of the ERH gene frameshifting string produces the six amino acid stretch HRLDTY, which broke the consensus pattern [I, V, L, F]-[A, S, E, G, P]-W[S, P, A, T]-R-[I, L, V] in the protein alignment. To retrieve the hypothetical ancestral reading frame in this region of the ERH gene, the above-mentioned adenine, which generated the first indel, was computationally removed. With regard to the second indel, a nucleotide (N) was inserted in the downstream gap position. The translation of this new hypothetic gene fragment produced the amino acid fragment LAWTRX. Computationally substituting the frameshifting string in the ERH amino sequence (HRLDTY) by the new hypothetical string (LAWTRX) means that, in the protein alignment, the tryptophan and arginine remain unchanged, and the consensus pattern is recovered (Fig. 1a). Note that the undefined last residue X is probably consensuated because, of the four possible codons for ATN, three code for isoleucine (I) and one codes for methionine. In the ERH gene, the nucleotides of the strictly conserved TGG triplet found in the DNA alignment correspond to the leucine second codon position (T), the leucine third codon position (G) and the aspartic acid first codon position (G), respectively (L93, D94 ERH numbering). For the other genes, the TGG triplet corresponds to the tryptophan codon. 3.2. Family 1 Cellulomonas fimi b-glucosidase (TrEMBL: Q46043) The gene alignment showed an unchanged TA doublet in a conserved block close to the extreme 3V region. This suggests the presence of an unchanged tyrosine in the

Fig. 1. (a) Fragment of the amino acid sequence multialignment of 4 out of 43 selected family 1 O-glycoside hydrolases around the frameshift region of the ERH. ERH1 means the ERH sequence, in which the frameshifting stretch has been computationally substituted by the new hypothetical ancestral stretch (see Section 3.1). Gaps are denoted by a dash. Numbers on the right are the residue positions from the N-terminus in each sequence. Top-row numbers correspond to the protein multialignment position. The unchanged proline and the conserved triptophan and arginine (see Section 3.1) are denoted in red. The hypothetical ancestral stretch of the ERH1 is depicted in bold. The black bar represents the h2 strand of the canonical (h/a)8-barrel structure. BCI, Swiss-Prot, Q03506; BPA, Swiss-Prot P22073; CTA, Swiss-Prot 26208. (b) Fragment of nucleotide sequence multialignment of 4 out of 43 selected family 1 O-glycoside hydrolase genes around the frameshift region of the ERH. Gaps are denoted by a dash. Numbers on the right are the residue position from the 5V-extreme in each sequence. Top-row numbers correspond to the DNA mutialignment position. The unchanged nucleotides (see Section 3.1) are denoted in bold.

194

A. Rojas et al. / Gene 314 (2003) 191–199

equivalent position of the protein alignment. However, this happened in all the amino acid sequences, except that of the CFI sequence (Fig. 2a). In the gene of this enzyme, the unchanged TA doublet depicts a second and a third leucine codon position (L373 CFI numbering). This suggests an FSM event in the CFI gene. The block of DNA alignment around this unchanged TA doublet revealed no significantly different feature in the nucleotide sequence of the CFI gene (Fig. 2b). So, in order to set bounds to this possible FSM event, we explored a more extensive part of the DNA alignment, taking as reference the Streptomyces sp. h-glucosidase gene, which is very similar to CFI. Fig. 3 shows a local comparison of these genes. In the CFI gene, nucleotide deletions referred to as a single gap after position 779 (CFI gene numbering), and four nucleotide string deletions after position 898, were observed upstream of the conserved TA doublet (positions 1118 –1119). These five nucleotide deletions had a dramatic effect on the local amino acid similarity between the CFI and the others. The deletions were compensated by a further five AGGGC nucleotide string insertions after position 1142, downstream from the conserved TA. This stretch insertion was referred to as a five-gap position in the Streptomyces sp h-glucosidase gene and produced a notable recuperation of CFI sequence similarity to their counterparts. To retrieve the hypothetical ancestral CFI sub-sequence, we did a computational treatment to invert the described deletions and insertions in this gene, i.e. a single nucleotide insertion after gene position 779, four nucleotide string insertions after position 898 and five nucleotide stretch deletions after position 1142. With this treatment, translating this new hypothetical gene fragment produced an amino acid CFI stretch that was very similar to that of the reference. Therefore, the FSM event in CFI affected an internal 123 amino acid segment that is rich in arginine and proline residues. According to the canonical

structure of the family 1 (h/a)8-barrel, the segment is extended along the h5- and h6-strands. CFI secondary structure prediction showed that the frameshifting segment falls into the coil state (results not shown). Therefore, in the putative structure of the CFI there is a predicted coil fragment rather than the a-helix-loop-h-strand-loop modules corresponding to the canonical barrel h5- and h6-strands. This means that the CFI structure has a (h/a)6-barrel scaffold with a 123-amino acid basic domain that protrudes between the fourth and fifth h-strand. The catalytic glutamics, the proton donor E244 (CFI numbering) and the nucleophyle E452 remain unchanged. These catalytic residues, which in the O-glycoside hydrolase family 1 are located at the Cterminal part of the parallel h4- and h7-strands of the (h/a)8 barrel, fall into CFI at the C-terminal part of the h4- and h5strands of the putative (h/a)6-barrel domain, respectively. 3.3. Family 1 non-cyanogenic Trifolium repens b-glucosidase (SWISS-PROT: P26204) Careful observation of the DNA alignment of the whole sample of the O-glycoside hydrolase family 1 at several positions upstream from the region coding the E-N-G signature pattern (nucleophile catalytic residues) shows that an adenine of the TRS gene (A1257, gene numbering) generates a single gap in all the other aligned gene sequences (Fig. 4a). Remarkably, the gap event occurred six positions upstream from the unchanged doublets GA and GG, which code the also-unchanged glutamic and the glycine, respectively, of the above-mentioned signature pattern (PROSITE, PDOC00495). Because of the strict conservation of these doublets, which code a part of the active site of these enzymes and underline an important evolutionary functional constraint, the indel produced for that adenine in the TRS before the conserved cluster strongly suggests that there is an

Fig. 2. (a) Fragment of the amino acid sequence multialignment of 4 out of 43 selected family 1 O-glycoside hydrolases around the frameshift region of the CFI. Gaps are denoted by a dash. Numbers on the right are the residue positions from the N-terminus in each sequence. Top-row numbers correspond to the protein multialignment position. The double bar (z) means that gaps in all members of the alignment fragment have been eliminated. The conserved tyrosine, whose codon involves the unchanged TA doublet (see Section 3.2), is denoted in bold. The CFI leucine (L373), whose codon involves the unchanged TA doublet (see text), is denoted in bold. STR, TrEMBL Q59976; MBB, Swiss-Prot P38645; TMA, Swiss-Prot Q08638. (b) Fragment of the DNA nucleotide sequence multialignment of 4 out of 43 selected family 1 O-glycoside hydrolases around the frameshift region of the CFI. Gaps are denoted by a dash. Numbers on the right are the residue positions from the 5V-extreme in each sequence. Top-row numbers correspond to the DNA multialignment position. The double bar (z) means that gaps in all members of the alignment fragment have been eliminated. The unchanged nucleotides are denoted in bold.

A. Rojas et al. / Gene 314 (2003) 191–199

195

Fig. 3. Comparison of gene fragments CFI and STR gene fragments involved in both the CFI frameshifting segment and the flanking regions. The numbers in the side columns are the residue positions from the 5V-extreme in each sequence. DNA translation is depicted. Each amino acid symbol is located below (for STR sequence) and above (for CFI sequence) the nucleotide corresponding to the first codon position. The CFI nucleotide deletions (1 and 4) and the CFI nucleotide insertions (5) are denoted by black points and blue dashes or characters. These deletions and insertions are the limits of the FSM event in CFI sequence (see Section 3.2). In the figure, the effect of the FSM is also shown by the dealignment of the amino acid residues. In the flanking frameshifting regions, the identical amino acids are denoted in bold. The unchanged TA doublet (see text) is also denoted in bold.

FSM event in this gene. However, the block of DNA alignment around this indel and before the unchanged doublets did not show any feature that was different from the TRS gene. Therefore, to investigate the genetic grounds for a possible FSM in this gene, we analysed a more extensive region of the DNA alignment upstream from the indel. To facilitate the study, we compared the TRS to the three similar plant h-glucosidase genes of the sequence

sample. This sub-group included the cyanogenic T. repens h-glucosidase (Fig. 4a). This analysis revealed that in the TRS, 61 nucleotides upstream from the inserted adenine, one string of 14 nucleotides was inserted, thus generating the equivalent gap positions in the other sequences. In all, as far as the TRS gene is concerned, the sum of nucleotide insertions involved in the string plus the single adenine is 15 (14 + 1), which is a multiple of three. This is a compelling

196

A. Rojas et al. / Gene 314 (2003) 191–199

Fig. 4. (a) Fragment of the DNA nucleotide sequence multialignment of four plant h-glucosidase sequences out of 43 selected family 1 O-glycoside hydrolases around the frameshifting gene region of the TRS (see Section 3.3). Gaps are denoted by a dash. Numbers on the right are the residue positions from the 5V-extreme in each sequence. DNA translation is depicted. Each amino acid symbol is located below the nucleotide corresponding to the first codon position. Frameshifting gene fragment TRS limited by the internal gene string duplication and the single adenine insertion. The 14 nucleotide strings that results from the internal gene duplication are shown in bold. The downstream-inserted adenine is in red. CBG, Swiss-Prot P26205; PRA, TrEMBL Q43014; MES, TrEMBL Q40283. (b) Fragment of the amino acid sequence multialignment of four plant h-glucosidase sequences out of 43 selected family 1 O-glycoside hydrolases shown in (a). TRS1 means the TRS sequence, in which the five-amino-acid string that produced a gap of five positions in the other amino acid sequences has been computationally deleted and the frameshifting stretch has been computationally substituted by the new hypothetical ancestral stretch (see Section 3.3). Gaps are denoted by a dash. Numbers on the right are the residue positions from the N-terminus in each sequence. Top-row numbers correspond to the protein multialignment position. The five-amino-acid string of the TRS that produced the gap is depicted in bold. The hypothetical ancestral stretch of the TRS1 is depicted in bold.

sign of a compensatory FSM event that took place in the evolution of the TRS gene. At protein level, this event affected a 31-amino-acid stretch before the catalytic nucleophile glutamic. The protein alignment of this region showed that, in the TRS sequence, a five-amino-acid string, YMFIQ (Y395-Q399, sequence numbering), also produced a gap of five positions in the other amino acid sequences (Fig. 4b). Therefore, this insertion of a stretch of five amino acids in the protein alignment and the insertion of one nucleotide stretch in the gene in the equivalent positions was surprising. Returning to the DNA alignment, therefore, an in-depth analysis of this particular region of the string revealed the existence of a 14-nucleotide stretch duplication in the TRS gene (Fig. 4a). With these mechanisms we can identify the hypothetical ancestral amino acid stretch, which consists of the computational deletion of one of the 14 duplicated nucleotide strings and the adenine deletion at the end of the multiple FSM event. As this computational gene treatment is a confirmatory procedure of the natural occurrence of the FSM, it improves the similarity of the local amino acid sequence of the TRS and the other amino acid sequences (Fig. 4b).

3.4. Family 3 Clostridium stercorarium and Clostridium thermocellum b-glucosidases (CST, TrEMBL :O08331; CTH, SWISS-PROT: P14002) The gene alignment indicated two unchanged guanine (GG) doublets near the extreme 3V region (Fig. 5a). We therefore expected to find two unchanged glycine (G) residues in the corresponding positions of the protein alignment. This only happened with the first of the guanine doublets. However, we found a conserved, but not unchanged, glycine with the second guanine doublet. The CST and CTH were the only sequences that broke the glycine consensus. This suggests that there was a frameshift event in the CST and CTH genes. Analysis of the DNA alignment around the regions of the invariant guanine doublets showed that, a few position upstream from the first doublet, a single gap position appeared in the CST and CTH genes (for both genes, after positions 1431). Analogously, a few positions downstream from the last guanine doublet, a two-position gap appeared in these genes (after positions 1473). This was of course a double FSM based on the total deletion of three nucleotides

A. Rojas et al. / Gene 314 (2003) 191–199

197

Fig. 5. (a) Fragment of the DNA nucleotide sequence multialignment of 5 out of 18 selected family 3 O-glycoside hydrolases around the frameshift region of CST and CTH genes (see Section 3.4). Gaps are denoted by a dash. The numbers on the right are the residue positions from the 5V-extreme in each sequence. Top-row numbers correspond to the DNA multialignment position. In the frameshifting region, the two unchanged GG doublets (see text) are depicted in bold. ATU, Swiss-Prot P27034; MBI, TrEMBL Q59506; ECO, Swiss-Prot P33363. (b) Fragment of the amino acid sequence multialignment of 5 out of 18 selected family 3 O-glycoside hydrolases around the frameshift region of CST and CTH, corresponding to the protein region codified by the DNA region shown in (a). Gaps are denoted by a dash. Numbers on the right are the residue position from the N-terminus in each sequence. Top-row numbers correspond to the protein multialignment position. In the frameshifting region, the unchanged glycine and the conserved leucine, glycine and proline are denoted in bold. (c) Fragment of the amino acid sequence multialignment of 5 out of 18 selected family 3 O-glycoside around the frameshift region of CST and CTH, corresponding to the protein region codified by the DNA region shown in (a). CST1 and CTH1 mean CST and CTH sequences, respectively, in which the frameshifting stretch has been computationally substituted by the new hypothetical ancestral stretch (see text). Gaps are denoted by a dash. Numbers on the right are the residue position from the N-terminus in each sequence. Top-row numbers correspond to the protein multialignment position. In the frameshifting region, the unchanged glycine and leucine and the unspecific amino acid residues are denoted in red. The hypothetical ancestral stretch of the CST1 and CSTH1 are depicted in bold.

(1 + 2). As we can see in the protein alignment (Fig. 5b), the amino acid residues of the frameshifting stretches were unlike those of the other sequences. The computational insertion of a nucleotide (N) in the gap positions of the CST and CTH genes therefore means that, at protein level, in the frameshifting region the local similarity of the pattern G-x(6)-L-x-G-x(3)-P was recovered (Fig. 5c). From the DNA alignment we can speculate about which codons were affected by the nucleotide deletions. In fact, this first computational N-nucleotide insertion affected a third codon position of a GCN alanine codon. In the downstream NN computational nucleotide doublet insertion, the first N-nucleotide corresponds to the third codon position of CCN (proline) and CGN (arginine) codons in CST and CTH genes, respectively. The second N-nucleotide affects the first position of NCA and NTC codons in CST and CTH genes, respectively, and the coded amino acid remains undetermined in both sequences. On the basis of the known tridimensional structure of an Oglycoside hydrolase family 3 (Varghese et al., 1999;

Harvey et al., 2000), these Clostridium double FSM affect the a-helix and its following loop of the (h/a)6-sandwich domain.

4. Discussion A compensated FSM event should be seen as a consequence of insertions and deletions that affect genes (Li, 1997; Hartl and Jones, 2000). Molecular events have precise effects on the structure of the DNA text, whereas a frameshift compensation modifies the meaning of a text in a way that can only be achieved by various molecular events (mechanisms). Frameshifts (compensated or not), at the time they occur, are direct consequences of insertions/deletions. They are not independent phenomena but simply two aspects of one phenomenon. Insertions or deletions within a coding sequence cause the reading frame to be lost, which has adverse effects for the gene function. Consequently, secondary mutations that restore the function are expected to be

198

A. Rojas et al. / Gene 314 (2003) 191–199

selected. The event leading to the frameshifting strings in the core of a protein must have happened either at the late stage in the evolution of a given family, so it affects only a limited number of sequences, or at an early stage, so it affects more than a single sequence. The latter is probably the case of the double frameshift mutation event in h-glucosidase genes from two different Clostridium species, which suggests that the FSM episode happened before C. thermocellum and C. stercorarium speciation. The secondary structure prediction of the amino acid stretch delimited by these Clostridium double FSM matched the secondary structure elements predicted by the other sequences. If gene sequences were available, therefore, it would be extremely interesting to analyse more family 1 h-glucosidases from Clostridium species. Analogously, if it were possible to analyse Cellulomonas family 1 h-glucosidase sequences other than CFI, this would not only provide important information about evolution but also some valuable structural insights. The presence of (h/a)8-barrel structures in the proteins analysed here is worthy of mention (Barrett et al., 1995; Garcia-Vallve et al., 1998; Harvey et al., 2000). Topological analysis shows that the frameshift region of the E. herbicola h-glucosidase is located in the loop after the barrel h2-strand (Fig. 1a). The important catalytic role of these structural elements is well known (Arrizubieta and Polaina, 2000). This agrees with the strong conservation of the involved residues. Therefore, the secondary structure prediction of the natural ERH amino acid sequence and those in which the frameshifting stretch was computationally substituted by the new hypothetical string showed the same propensities for the secondary structure of the residues (results not shown). This may explain the viability of the ERH gene mutation, since the product maintains the h2-strand barrel, which is an element of the canonical structure of the protein family 1 (Barrett et al., 1995). However, it is surprising to see how an FSM event in C. fimi h-glucosidase CFI produces a putative (h/a)6barrel scaffold with a basic domain that protrudes between specific h-strands, instead of the (h/a)8-barrel structure. The dramatic effect of the FSM on the CFI protein structure occurs without affecting the catalytic residues. The CFI is also the first known example of a family 1 O-glycoside hydrolase with a putative tridimensional structure that differs from the canonical (h/a)8-barrel structure. These results mean that we can envisage FSM as a novel tool for studying biological activity, structural features, stability folding and the dynamics of proteins because, through directed mutagenesis, it should be possible to engineer compensated FSM events. Our results also indicate the great value of the DNA coding region multialignment in the phylogeny of proteins (Williams and Fitch, 1990), in particular when there are long frameshifting stretches such as in CFI. In general, because most amino acids have multiple codons, the nucleotide sequence of the coding region provides more phylogenetic information than the amino acid sequences (Grantham et al., 1980). As a rule in evolutionary biology, if gene information is available, use it.

According to the known tridimensional structure of family 1 O-glycoside hydrolases, the secondary structure prediction of the T. repens h-glucosidase showed that the amino acid subsequence generated by the double frameshift also strictly matched the a-helix-loop-h7-strand module of the barrel (results not shown). This suggests that the compensatory frameshift mutation event maintained the structural constraints imposed by the (h/a)8-barrel structure of the protein. TRS highlights how an internal gene fragment duplication produces a double FSM event. In the gene elongation process, nature does not take into account the fact that gene string duplication takes place via a string of a multiple of three nucleotides. Any internal insertion of a string that is not a multiple of three nucleotides will prevail in evolution unless a new downstream deletion/insertion mutation enables an open reading frame to be recovered. The evolutionary history of the TRS gene has been the object of another FSM that we have also previously described. This enzyme lacks a fragment with a high level of consensus in the C-terminal part of the multialignment as a result of a nonsense mutation (Rojas and Romeu, 1996). On the other hand, in T. repens there are at least two family 1 Oglycoside hydrolase parologous genes. These code the cyanogenic and the non-cyanogenic h-glucosidases. We can therefore speculate that, because of the distribution of h-glucosidase activity in paralogs in T. repens, the TRS gene could be more sensitive to a major evolutionary pressure, and the organism may have to endure a possible transitory gene inactivity. The computational recovery of the similarity of a frameshifting stretch highlights the paradox that even if the frameshifting protein is relatively abnormal, the pressure of evolution in the gene region limited by the frameshift mutation points is the same as the pressure of evolution in the same genetic region in the homologous genes. This begs the following question: are the known examples of FSM merely the tip of the iceberg? If so, naturally frameshifting proteins should have very diverse biological functions. The rationale behind all these schemes is that the management of sequence data is the road towards a clear description of the probably widespread and hidden FSM events, which only arise when homologous sequences are compared. At present, genome sequencing provides an enormous amount of protein sequences. This provides an excellent model for systematically studying the role of compensated FSM. This study would include searching, aligning and determining the significance of similarities. In this respect, we expect that new examples of naturally compensated FSM events will be found in the future. It is debatable whether the analysis reported takes into account the most obvious pitfall in this field, namely sequencing errors. In fact, there is a significant amount of error in DNA and protein sequence databanks (Roberts, 1991). Most errors do not prevent global detections of sequence similarities (States and Botstein, 1991) but they can lead to predictions of erroneous compensated FSM

A. Rojas et al. / Gene 314 (2003) 191–199

events. In this respect, re-sequencing the relevant DNA sequences, such as that coding the CFI h-glucosidase where insertions and deletions disrupt canonical secondary structure, will decide the issue. However, it is highly unlikely that the case for the TRS compensated FSM, in which mutations induced by a 14-nucleotide duplication are restored by a single nucleotide insertion at several nucleotides downstream just before the catalytic residue, would be a sequencing error. However, the case for the Clostridium species frameshift events is strong, because it is found in two independently determined sequences of two species of the same genus. Also, it involves a much shorter stretch of amino acids. In general, therefore, the evolutionary conservation of frameshift events may be an alternative to resequencing for supporting computational analysis.

Acknowledgements We thank Kevin Costello (of the Language Service of our university) for his help in writing the manuscript.

References Arrizubieta, M.J., Polaina, J., 2000. Increased thermal resistance and modification of the catalytic properties of a beta-glucosidase by random mutagenesis and in vitro recombination. J. Biol. Chem. 275, 28843 – 28848. Barrett, T., Suresh, C.G., Tolley, S.P., Dodson, E.J., Hughes, M.A., 1995. The crystal structure of a cyanogenic beta-glucosidase from white clover, a family 1 glycosyl hydrolase. Structure 3, 951 – 960. Coutinho, P.M., Henrissat, B., 1999a. Carbohydrate-Active Enzymes server at URL: http://afmb.cnrs-mrs.fr/~cazy/CAZY/index.html. Coutinho, P.M., Henrissat, B., 1999b. Carbohydrate-active enzymes: an integrated database approach. In: Gilbert, H.J., Davies, G., Henrissat, B., Svensson, B. (Eds.), Recent Advances in Carbohydrate Bioengineering. The Royal Society of Chemistry, Cambridge, pp. 3 – 12. Coutinho, P.M., Henrissat, B., 1999c. The modular structure of cellulases and other carbohydrate-active enzymes: an integrated database approach. In: Ohmiya, K., Hayashi, K., Sakka, K., Kobayashi, Y., Karita, S., Kimura, T. (Eds.), Genetics, Biochemistry and Ecology of Cellulose Degradation. Uni Publishers, Tokyo, pp. 15 – 23. Doolittle, R.F., 1990. Searching through sequence database. In: Doolittle, R.F. (Ed.), Molecular evolution: computer analysis of protein and nucleic acid sequences. Methods in Enzymology, vol. 183. Academic Press, New York, pp. 99 – 110.

199

Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J., Hofmann, K., Bairoch, A., 2002. The PROSITE database, its status in 2002. Nucleic Acids Res. 30, 235 – 238. Freer, S.N., 1993. Kinetic characterization of a beta-glucosidase from a yeast, Candida wickerhamii. J. Biol. Chem. 268, 9337 – 9342. Garcia-Vallve, S., Rojas, A., Palau, J., Romeu, A., 1998. Circular permutants in beta-glucosidases (family 3) within a predicted double-domain topology that includes a (beta/alpha)8-barrel. Proteins 31, 214 – 223. Gilkes, N.R., Claeyssens, M., Aebersold, R., Henrissat, B., Meinke, A., Morrison, H.D., Kilburn, D.G., Warren, R.A., Miller Jr., R.C., 1991. Structural and functional relationships in two families of beta-1,4-glycanases. Eur. J. Biochem. 202, 367 – 377. Grantham, R., Gautier, C., Gouy, M., Mercier, R., Pave´, A., 1980. Codon usage and the genome hypothesis. Nucleic Acids Res. 8, r49 – r62. Hartl, D.L., Jones, E.W., 2000. Genetics. Analysis of Genes and Genomes, 5th ed. Jones and Bartlett Publishers, Boston. Harvey, A.J., Hrmova, M., De Gori, R., Varghese, J.N., Fincher, G.B., 2000. Comparative modelling of the three-dimensional structures of family 3 glycoside hydrolases. Proteins 41, 257 – 269. Henrissat, B., 1993. Hidden domains and active site residues in beta-glycanase-encoding gene sequences? Gene 125, 199 – 204. Henrissat, B., Callebaut, I., Fabrega, S., Lehn, P., Mornon, J.P., Davies, G., 1995. Conserved catalytic machinery and the prediction of a common fold for several families of glycosyl hydrolases. Proc. Natl. Acad. Sci. U. S. A. 92, 7090 – 7094. Kimura, M., 1983. The Neutral Theory of Molecular Evolution. Cambridge Univ. Press, Cambridge. Li, W.H., 1997. Molecular Evolution. Sinuauer Associates, Publishers, Sunderland Massachusetts. Roberts, L., 1991. Finding DNA sequencing errors. Science 252, 1255 – 1256. Rojas, A., Romeu, A., 1996. 3VFlanking region of a family 1 beta-glucosidase. Biochem. J. 320, 693 – 694. Schneider, R., Sander, C., 1996. The HSSP database of protein structure sequence alignments. Nucleic Acids Res. 24, 201 – 205. States, D.J., Botstein, D., 1991. Molecular sequence accuracy and the analysis of protein coding regions. Proc. Natl. Acad. Sci. U. S. A. 88, 5518 – 5522. Thompson, J.D., Higgins, D.G., Gibson, T.J., 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673 – 4680. Varghese, J.N., Hrmova, M., Fincher, G.B., 1999. Three-dimensional structure of a barley beta-D-glucan exohydrolase, a family 3 glycosyl hydrolase. Structure 7, 179 – 190. Williams, P.L., Fitch, W.M., 1990. Phylogeny determination using dynamically weighted parsimony method. In: Doolittle, R.F. (Ed.), Molecular evolution: computer analysis of protein and nucleic acid sequences. Methods in Enzymology, vol. 183. Academic Press, New York, pp. 615 – 645.