Combining structural and bioinformatics methods for the analysis of functionally important residues in DNA glycosylases1, 2

Combining structural and bioinformatics methods for the analysis of functionally important residues in DNA glycosylases1, 2

Free Radical Biology & Medicine, Vol. 32, No. 12, pp. 1254 –1263, 2002 Copyright © 2002 Elsevier Science Inc. Printed in the USA. All rights reserved ...

395KB Sizes 0 Downloads 57 Views

Free Radical Biology & Medicine, Vol. 32, No. 12, pp. 1254 –1263, 2002 Copyright © 2002 Elsevier Science Inc. Printed in the USA. All rights reserved 0891-5849/02/$–see front matter

PII S0891-5849(02)00828-6

Serial Review: Oxidative DNA Damage and Repair Guest Editor: Miral Dizdaroglu COMBINING STRUCTURAL AND BIOINFORMATICS METHODS FOR THE ANALYSIS OF FUNCTIONALLY IMPORTANT RESIDUES IN DNA GLYCOSYLASES DMITRY O. ZHARKOV*†

and

ARTHUR P. GROLLMAN*

*Laboratory of Chemical Biology, Department of Pharmacological Sciences, State University of New York at Stony Brook, Stony Brook, NY, USA; and †Novosibirsk Institute of Bioorganic Chemistry, Siberian Division of Russian Academy of Sciences, Novosibirsk, Russia (Received 3 January 2002; Revised 19 February 2002; Accepted 1 March 2002)

Abstract—An essential function of DNA glycosylases is the recognition and excision of damaged bases in DNA, thereby preserving genomic integrity. Lesion recognition is a multistep process, which is only partially revealed by structural analysis of the catalytically competent complex. The functional role of additional residues can be predicted by combining structural data with analysis of amino acid conservation. The following postulate underlies this approach: if a family or superfamily can be broken into subgroups with different substrate specificities, residues highly conserved between these subgroups represent those important for enzyme catalysis and structure maintenance while residues highly conserved within a subgroup but not between subgroups represent residues important for substrate specificity. We review the bioinformatics approach used for this quantitative analysis and describe its application to the Nth superfamily and Fpg family of DNA glycosylases. These results serve as a starting point in planning site-directed mutagenesis experiments to elucidate the functional role of similar and dissimilar residues in DNA repair and other proteins. © 2002 Elsevier Science Inc. Keywords—DNA damage, DNA glycosylase, Sequence alignment, Substrate specificity

INTRODUCTION

glycosylase almost exclusively recognizes uracil. DNA glycosylases excise damaged bases, producing AP sites; some members of this class, known as DNA glycosylases/AP lyases, nick the nascent AP site by ␤- or ␤,␦elimination. Following base excision, components of BER are common for all lesions. These include AP endonucleases that nick DNA 5⬘ to AP sites (including those modified by AP lyases), deoxyribophosphodiesterases that remove the sugar from the resulting 5⬘-end, DNA polymerases that perform repair synthesis, and DNA ligases that seal the nick in DNA. The structures of several DNA glycosylases have been solved by X-ray crystallography [2–12], in some cases, complexed with DNA [8,10,12–17]. Although functionally important residues can be identified from these structures, certain biochemical and site-directed mutagenesis-derived data cannot be interpreted by structural analysis alone. For example, biochemical data on MutY suggest that Lys-142 may be involved both in recognition of the base positioned opposite the adenine

DNA repair capabilities are critical for prokaryotic and eukaryotic cells. At least six different repair processes act to prevent cytotoxic or mutagenic effects of DNA damage [1]. Base excision repair (BER) involves multiple enzymatic species. During the first stage of repair, damaged bases are recognized by DNA glycosylases. These enzymes vary in their specificity; for example, E. coli formamidopyrimidine-DNA glycosylase (Fpg) recognizes several oxidatively damaged purines; endonuclease III (Nth), a variety of oxidatively damaged pyrimidines; 3-methylpurine-DNA glycosylase, ring-alkylated purines, and exocyclic DNA adducts, while uracil-DNA This article is part of a series of reviews on “Oxidative DNA Damage and Repair.” The full list of papers may be found on the homepage of the journal. Address correspondence to: Dr. Arthur P. Grollman, Department of Pharmacological Sciences, Laboratory of Chemical Biology, State University of New York at Stony Brook, Stony Brook NY 11794, USA; Tel: (631) 444-3080; E-Mail: [email protected]. 1254

DNA glycosylase analysis

targeted for excision and in formation of a Schiff base [18 –21]. The structure of the MutY catalytic core [9] provides few clues regarding these interactions. In this paper, we argue that analysis of protein conservation, coupled with structural information, provides a valuable tool for predicting functional roles for critical amino acid residues and designing site-directed mutants of DNA glycosylases. We present examples of such analyses to demonstrate its validity. CLASSIFICATION OF DNA GLYCOSYLASES

A primary question in DNA repair concerns the mechanism by which proteins engaged in repair “recognize” specific classes of damage. Structures of DNA glycosylases complexed with oligodeoxynucleotides containing specific lesions reveal that the damaged nucleotide generally is everted from the DNA helix with the base inserted into a pocket [12,16,22], a notable exception being bacteriophage T4 endonuclease V (DenV), which excises cyclobutane pyrimidine dimers but does not extrude the dimer nor interact with it directly [13]. Interactions with the protein tend to be damage specific. Such interactions generally are defined in terms of specific hydrogen bonds or van der Waals contacts. It is clear, however, that the structure of an enzyme-substrate complex poised for catalysis (henceforth termed “catalytically competent complex”) may not account fully for the observed specificity of DNA glycosylase action. For example, human 8-oxoguanine(8-oxoG)-DNA glycosylase (hOgg1) bound to 8-oxoG-containing DNA forms a single hydrogen bond to N7 of the modified base [12], which is unlikely to explain the ⬎104-fold increase in the rate of excision of 8-oxoG compared with G [23]. Lesion recognition is a multistep pathway that traverses several enzyme-DNA complexes before arriving at the catalytically competent complex. Amino acid residues important in the formation of such intermediate complexes may not be immediately evident by examining the structure, even when the enzyme is bound to DNA. In addition, no crystal structure exists for the complex of certain DNA glycosylases with their substrate DNA. DNA glycosylases may be divided into several families based on their primary sequences [24]. For this analysis, which is concerned with functional rather than evolutionary relationships, we define a “family” based exclusively on sequence homology and not phylogeny. It consists, therefore, of both orthologous (connected by vertical descent) and paralogous (evolved by gene duplication within one genome) species [25]. The largest family of DNA glycosylases is the endonuclease III (Nth) family, including the prototype endonuclease III and the mismatch adenine DNA glycosylase MutY [26]. Several conserved structural motifs are common to these

1255

enzymes and to certain other DNA glycosylases, including eukaryotic Ogg1, Nth1, and MAG proteins, E. coli AlkA, M. luteus pyrimidine dimer glycosylase (Pdg), and M. thermoformicicum T:G mismatch glycosylase (Tdg), thereby defining the Nth superfamily [26,27]. Another well-defined group of DNA glycosylases is the Fpg family, comprising formamidopyrimidine-DNA glycosylase (Fpg) and endonuclease VIII (Nei). Enzymes with uracilDNA glycosylase activity belong to two unrelated families, prototypes of which are Ung and Mug proteins in E. coli. The existence of enzymes with different specificities within the same group (family or superfamily) of DNA glycosylases provides a unique way to search for protein residues important for catalysis and specificity. The following postulate guides the analysis: if a group may be broken into two subgroups with different substrate specificities, then: (i) residues highly conserved between these subgroups are those important for general catalytic mechanism and structure maintenance, and (ii) residues highly conserved within but not between subgroups (“dissimilar” residues) are those important for substrate specificity or, more generally, for a subgroup-specific function (see example of Nth-MutY below). PRINCIPLES OF CONSERVATION ANALYSIS

Quantifying similarity To implement the above postulate, several conditions must be fulfilled. First, a quantitative tool capable of accounting for similarities in the physico-chemical properties of amino acid residues is required to analyze conservation within a group of protein sequences. Fortunately, such tools already exist. AMAS (Analysis of Multiply Aligned Sequences) algorithm [28] permits comparison of physico-chemical properties of amino acids in a given position in a multiple alignment, based on a predefined set of properties (for an example of such a set, see [29]). This set is represented as a Wenn diagram, and the number of borders crossed to visit all residues in a given position is counted. A “conservation number,” or Cn, which is a function of the number of crossed borders, is assigned to each position of the alignment [28,30]. Analysis of conservation within and between subgroups of the complete alignment is achieved by calculating conservation numbers for these subgroups. AMAS can be obtained from the World Wide Web (barton.ebi.ac.uk/ servers/amas_server.html, as accessed on Nov. 14, 2001). Other algorithms also exist, some applying stricter probabilistic treatment to the sequence alignments [31], but to our knowledge the corresponding software is not publicly available.

1256

D. O. ZHARKOV and A. P. GROLLMAN

Sequence classification and qualification

Similarity and dissimilarity

Functional classification and its qualification are a major consideration when performing conservation analysis with functional assignment in mind. To sort enzymes of a family into groups, one needs to know their substrate specificity. In practice, however, such data are usually available for only a handful of members of the family. For example, based on genomes sequenced to date, at least 40 Fpg homologs are known, of which only three outside the E. coli species have been characterized biochemically in any detail [32–34]. The easiest way to circumvent this problem is to start with functionally defined paralogs (e.g., Fpg and Nei of E. coli in the case of the Fpg family) and to split the family into nonoverlapping groups homologous to each of these. Examples of this approach are discussed below. Qualification of the sequences must be performed on the basis of elements known to be critical for the enzyme’s function. For example, in E. coli Fpg, a C-terminal zinc finger is absolutely required for DNA binding; therefore, sequences lacking the zinc finger may be excluded from the alignment of the Fpg family as they differ in a critical aspect of Fpg activity.

In principle, physico-chemical properties at any given position may be identified at the same time as similar and dissimilar. Consider the following situation: in subgroup A, a certain position is always occupied by Ile, and in subgroup B, with Val. If the standard Taylor set is used, both Ile and Val have “hydrophobic” and “aliphatic” properties, producing Cn ⫽ 9; nevertheless, they may be identified as dissimilar due to absolute conservation within the subgroups (Val, but not Ile, has the property “small”). Such dissimilarity may be functionally important: in this example, Val may acommodate an extra methyl group at the binding interface. Those interested in dissimilarity rather than similarity can follow a simple hierarchical set of rules to avoid such ambiguities by using AMAS. AMAS produces a summary file that sorts different positions into the following categories: identity between all subgroups, identity between subgroup pairs (not applicable when comparing two groups), identity within one subgroup, conservation between all subgroups, conservation between subgroup pairs (not applicable when comparing two groups), difference between subgroup pairs, conserved within one subgroup, and unconserved within one subgroup. The rules to follow, in order of decreasing priority, are:

Sequence selection

1. Residues identified as “identity between all subgroups” as similar in both subgroups. 2. Residues identified as “difference between subgroup pairs” as dissimilar in both subgroups. 3. Residues identified as “conservation between all subgroups” as similar (if Cn value exceeds a certain threshold) in both subgroups. 4. Residues identified as “identity within one subgroup” as dissimilar in this subgroup. 5. Residues identified as “conserved within one subgroup” as dissimilar (if Cn value exceeds a certain threshold) in this subgroup.

The choice of sequences to be analyzed is critical. One might assume that including more sequences in the alignment would provide more predictive power. However, the current distribution of available sequences is heavily skewed in favor of certain phyla of eubacteria, which might lead to an artificial increase in conservation numbers if all sequences available in databases are included in the analysis. Weighing of sequences based on a maximum-likelihood evolutionary model may reduce this bias [31]. In a less strict but simpler approach, a subset of sequences evenly distributed through major taxonomic branches may be chosen to better reflect the functional significance of amino acid similarity. The National Center for Biotechnology Information’s Cluster of Orthologous Groups (COG) tool [35] provides an excellent initial sequence source for such an analysis. COGs were initially defined in the analysis of seven complete genomes as groups of orthologous proteins contained in at least three of five major clades (Gramnegative bacteria, Gram-positive bacteria, Cyanobacteria, Archaea, and Eukarya) [36]. COGs currently pool data from 44 complete genomes, including two eukaryotes. Importantly, these genomes represent 30 different phylogenetic lineages, thereby reducing the bias associated with the wider availability of sequences from certain clades.

While the similarity relationship is symmetrical between two subgroups (similar residues are identified as such in both groups), dissimilarity is not necessarily symmetrical. In the example with Ile/Val, dissimilarity would be symmetrical. Suppose, however, that in subgroup B this position is not conserved (and therefore presumably not functionally important). This position should be considered dissimilar for subgroup A, but not for subgroup B. Another point to consider during conservation analysis is the method of assigning loops present in one subgroup but absent in the other. We suggest that such loops should be marked as dissimilar in structures where they are present, and their position marked in the other group by designating as dissimilar the residues flanking

DNA glycosylase analysis

the gap. In the examples discussed below, such dissimilar loops generally are present outside of functionally important regions. Structural mapping After highly conserved and dissimilar residues have been identified, the next step is to map them on known structures of proteins under comparison. Even in the absence of DNA, structures of DNA glycosylases provide important clues regarding localization of the catalytic center and DNA-binding sites. It is expected that functionally important residues will be found in these regions. Residues outside of the functional regions identified as dissimilar may result from noise generated by small sample size or may reflect unexpected functions, for example, in protein-protein interactions. Statistical methods can be used to estimate the significance of detected dissimilarity [31,37] but a description of their application is beyond the scope of this paper. Based on the position of dissimilar residues and their interactions with other elements of the structure, hypotheses can be generated regarding their functions, providing a basis for biochemical experiments and site-directed mutagenesis. Having outlined the principles of conservation analysis, we will now describe their application to the analysis of several DNA glycosylases for which the crystal structure and biochemical characteristics are known. The primary data on which these analyses are based can be accessed at www.pharm.sunysb.edu/lcb/FRBM2002. Nth and MutY Comparative analysis of Nth and MutY represents a nearly ideal case where the groups being compared contain many sequences, have well-distinguished substrate specificities but similar reaction chemistry, and are highly homologous. Nth catalyzes removal of a variety of oxidized pyrimidines and employs Schiff base chemistry for nucleophilic attack by Lys-120 at C1⬘ of the damaged nucleotide. MutY excises adenine paired with G, 8-oxoguanine (8-oxoG) or C through attack by an activated water molecule [38]. MutY consists of two domains readily separated by proteolysis [39]. The Nterminal p25 domain, which retains catalytic activity but has reduced specificity towards 8-oxoG-containing substrates [39,40], is homologous to Nth. Both Nth and MutY contain an iron-sulfur cluster [26,41,42] that is absolutely required for activity although not directly involved in catalysis. The structures of Nth [3,26] and p25 of MutY [9] show close similarity in the overall folding but structural information on their interactions with DNA is not available.

1257

Nth COG (COG0177) comprises 47 bacterial proteins from 25 phylogenetic groups and MutY COG (COG1194) consists of 28 proteins from 17 groups. Following qualification analysis for the presence of four cysteines in the iron-sulfur cluster, 41 proteins of 24 phylogenetic groups of the Nth COG and 26 proteins of 16 groups of the MutY COG were included in the analysis. Structures of E. coli Nth ([26], 2ABK) and the p25 domain of E. coli MutY ([9], 1MUY) were used for mapping. Here, and unless stated otherwise, in all of the following examples, alignments were produced by ClustalW [43] with manual checking and correction. The standard Taylor set of amino acid properties was used for analysis by AMAS, cysteines were considered reduced and 10% minimum residue occupancy was allowed. Amino acids were identified as similar or dissimilar following the hierarchical set of rules described above with a Cn threshold of 9 for comparisons within and between subgroups. Loops present in one subgroup but missing in the other were considered dissimilar regardless of the calculated Cn; when mapping missing loops on a structure, flanking residues were labeled dissimilar. In Nth, 12% of the 211 amino acid residues were identified as similar and 9% as dissimilar; in MutY, the comparable values were 12 and 15%, respectively. Most similar and dissimilar residues face the DNA-binding cleft (Fig. 1). The only region of similarity fully buried inside the protein globule is composed of hydrophobic residues in ␣-helices 2, 4 and 5 in the six-helix barrel domain, which probably is crucial for interhelical packing. Other similar residues may be broken in four patches. The largest of these is an iron-sulfur cluster and the surrounding residues (S1 on Fig. 1A, B; R143, V144, R147, G183, C187, P192, C194, C197, and C203 in Nth; R143, V144, R147, G188, C192, P197, C199, C202, and C208 in MutY). Two others are centered around positions of the catalytic dyad of the Nth superfamily, 138 (S2 on Fig. 1A, B; V137, D138, and T139 in Nth; L137, D138, and G139 in MutY), and 120 (S3 on Fig. 1A, B; P105, L114, G116, V117, G118, T121, and A122 in Nth; P105, L114, G116, V117, G118, T121, and A122 in MutY). The latter similarity regions almost coincide with the conserved G/P...D loop (GPD) motif and the helixhairpin-helix (HhH) motif, respectively [26,27]. Obviously, these elements are absolutely required for correct positioning of key catalytic amino acid residues in both enzymes. The smallest patch of similarity involves two residues at the six-helix barrel domain edge of the DNAbinding cleft (S4 on Fig. 1A, B; A40 and Q41 in Nth; Q41 and Q42 in MutY), and may participate in DNA binding. Dissimilar residues in Nth and MutY may play a crucial role as determinants of the markedly different substrate specificity of these two enzymes. Most dissim-

1258

D. O. ZHARKOV and A. P. GROLLMAN

Fig. 1. (A and B) Structures of E. coli Nth and the catalytic domain of E. coli MutY, respectively. Residues identified as similar are colored red, dissimilar residues are blue; all others are shaded grey. The margins of certain clusters of similarity or dissimilarity are traced with yellow lines. Individual residues and clusters are labeled (see text for details). (C) Alignment of Nth and MutY sequences from different species. Nth subgroup (top panel): AF1692, Archaeoglobus fulgidus; aq_282, aq_496, Aquifex aeolicus; MJ0613_1, Methanococcus jannaschii; jhp0532, Helicobacter pylori J99; HP0585, Helicobacter pylori; Cj0595c, Campylobacter jejuni; Rv3674c, Mycobacterium tuberculosis; ML2301, Mycobacterium leprae; slr1822, Synechocystis sp., DR0289, DR0928, DR2438, Deinococcus radiodurans; BS_nth, Bacillus subtilis; BH1698, Bacillus halodurans; L0253, Lactococcus lactis; SPy0929, Streptococcus pyogenes; nth, Escherichia coli K12; Znth, Escherichia coli O157:H7; VC1011, Vibrio cholerae; HI1689, Haemophilus influenzae; PM0381, Pasteurella multocida; NMB0533, Neisseria meningitidis MC58; NMA0711, Neisseria meningitidis Z2491; PA3495, Pseudomonas aeruginosa; XF0647, Xylella fastidiosa; BU119, Buchnera sp. APS; RP746, Rickettsia prowazekii; CC2272, CC3731, Caulobacter crescentus; mll3176, Mesorhizobium loti; TP0775, Treponema pallidum; PH1498, Pyrococcus horikoshii; PAB0459, Pyrococcus abyssi; TM0366, Thermotoga maritima; Ta0790m, Thermoplasma acidophilum; TVN0804, Thermoplasma volcanium; VNG0592G, Halobacterium sp. NRC-1; YOL043c, Saccharomyces cerevisiae; CDa4033, Candida albicans; APE0150, Aeropyrum pernix. MutY subgroup (bottom panel): CT107, Chlamydia trachomatis, CPn0402, Chlamydia pneumoniae; Rv3589, Mycobacterium tuberculosis; ML1920, Mycobacterium leprae; mutY, Escherichia coli K12; ZmutY, Escherichia coli O157:H7; HI0759, Haemophilus influenzae; PM1319, Pasteurella multocida; VC0452, Vibrio cholerae; BU552, Buchnera sp. APS; PA5147, Pseudomonas aeruginosa, NMB1396, Neisseria meningitidis MC58; NMA1614, Neisseria meningitidis Z2491; XF1909, Xylella fastidiosa; BS_yfhQ_1, Bacillus subtilis; BH1931, Bacillus halodurans; L0296, Lactococcus lactis; mll7523, Mesorhizobium loti; CC0377, Caulobacter crescentus; jhp0130, Helicobacter pylori J99; HP0142, Helicobacter pylori; Cj1620c, Campylobacter jejuni; DR2285, Deinococcus radiodurans; MTH496, Methanobacterium thermoautotrophicum; APE0875, Aeropyrum pernix; VNG1520G, Halobacterium sp. NRC-1. Only part of the alignment surrounding the catalytic dyad is shown. See AMAS web page for a description of the coloring convention in the alignment.

DNA glycosylase analysis

ilar residues in Nth are concentrated on the DNA-binding face of the protein (Fig. 1A) while MutY has a patch of dissimilar residues on the opposite face (see below). As expected from the reaction mechanism, position 120 is dissimilar in Nth, but, unexpectedly, based on structural and biochemical information for E. coli MutY, is not dissimilar in MutY. Analysis of the alignment (Fig. 1C) shows that this position in the MutY family is occupied by Ser or Tyr, residues of quite different physico-chemical properties. The residues spatially close to position 120 form distinct dissimilarity patches (D1 in Fig. 1A, B; K120, V124, V125, and L126 in Nth; I25, V33, S36, E37, and A124 in MutY) inside deep pockets of these enzymes where the everted nucleotide presumably is bound. In MutY, Glu-37 was identified by structural and mutagenic analysis as a crucial residue for recognition of adenine in the active site pocket [9]. A number of dissimilar residues are present at the edges of the DNA-binding groove in both enzymes, forming “lips” around the “mouth” of the groove (D2 in Fig. 1A, B; S39, D44, H140, and R184 in Nth; V45, N140, D186, and A189 in MutY). For MutY, V45 and N140 have been implicated in the mechanism of base excision [9]. DNA glycosylases generally insert several amino acid residues into the DNA double helix, assisting in eversion of the damaged nucleotide and recognition of the opposing base. The residues listed above are good candidates for performing these functions. The largest region of dissimilarity is found in the six-helix barrel domain. In Nth, it surfaces on the DNA-binding face of the enzyme (D3 in Fig. 1A) and penetrates deep into the hydrophobic core of the domain while, in MutY, it runs through the entire domain, from the DNA-binding face of the enzyme to the opposite face. This region on the surface of the DNAbinding face of MutY (D3 in Fig. 1B) is rich in hydrophobic residues exposed to the solvent and projects approximately in the same direction as the C-terminal domain of MutY (absent in the core domain structure). We suggest that these residues may be involved in interactions with the C-terminal domain which is proposed, based on molecular modeling, to fold back on the barrel domain of MutY [44]. Buried residues of this region may play a role in overall shape maintenance, accounting for the large observed difference in the size of the DNA-binding cleft of Nth and MutY. The final significant dissimilarity region in both enzymes (not shown in Fig. 1; L13, C145, and H182 in Nth, W13, Y14, L187, and P203 in MutY) lies halfburied in the iron-sulfur cluster domain and partially contacts the lower “lip” of dissimilarity described above, presumably helping to position residues of the lip motifs.

1259

Table 1. Substrate Specificity of Selected DNA Glycosylases Nth superfamily Nth: oxidatively damaged pyrimidines (thymine glycol, dihydrothymine, 5-hydroxycytosine) MutY: mismatched adenine (A:8-oxoG, A:G, A:C) Pdg: UV photoproducts (cyclobutane pyrimidine dimers) Tdg: mismatched thymine (T:G) AlkA: alkylated purines (3-methyladenine) Ogg1: oxidatively damaged purines (8-oxoguanine, formamidopyrimidines) Fpg family Fpg: oxidatively damaged purines (8-oxoguanine, formamidopyrimidines) Nei: oxidatively damaged pyrimidines (thymine glycol, dihydrothymine, 5-hydroxycytosine)

Fpg and Nei Nei represents a unique member of the Fpg family, similar to Fpg in its catalytic mechanism [45] and structure [46,47] but recognizing a different set of substrates (Table 1). So far, Nei has been identified only in E. coli, its close relatives of the genus Salmonella, and, tentatively, two other prokaryotes, Carboxydothermus hydrogenoformans and Streptomyces coelicolor. The sequence of Nei differs from other members of the Fpg family at many positions, including several in the Nterminal domain that is involved in nucleophilic attack on C1⬘ of the damaged nucleotide. Fpg and Nei are bilobal proteins, consisting of an N-terminal ␤-sandwich domain and a C-terminal domain containing a single zinc finger motif. Both domains contain residues crucial for catalytic activity. The N-terminal proline, lying at the bottom of the DNA-binding groove, acts as a nucleophile in reactions catalyzed by these proteins [48,49]. Fpg COG (COG0266) contains 27 proteins (not counting Nei) belonging to 14 phylogenetic groups. Qualification of Fpg sequences was based on the presence of four Cys residues in the zinc finger motif and the consensus sequence PELPEV at the N-terminus, leaving 23 Fpg sequences of 14 phylogenetic groups for comparison with Nei. Since COG0266 includes Nei from a single species, E. coli, we constructed the Nei subgroup to include sequences of Nei from S. typhimurium, C. hydrogenoformans, and S. coelicolor, which are not represented in this COG (3 phylogenetic groups in total). As only the C-terminal section of C. hydrogenoformans Nei has been sequenced, alignments of the N-terminus and C-terminus of Nei were performed separately to avoid a disproportionately large gap percentage for Nei. Qualification of these sequences as Nei was based on the presence of four Cys residues in the zinc finger motif, the consensus sequence PEG at the N-terminus, and on grouping with the Nei rather than the Fpg subgroup in the complete phylogenetic tree, as constructed by the neighbor-joining method [50]. Structures of Thermus

1260

D. O. ZHARKOV and A. P. GROLLMAN

thermophilus Fpg ([46] 1EE8) and E. coli Nei covalently complexed to DNA ([47], 1K3X) were used for mapping. Overall, a higher percentage of the residues were identified as similar (16%) and dissimilar (17% in Fpg and 23% in Nei) for the Fpg family compared with the Nth-MutY analyses. A higher percentage of conserved residues likely reflects lower divergency and the more conserved reaction mechanism of the Fpg family. Higher percentage of dissimilar residues may be attributed to the small number of sequences of the Nei subgroup, which increases noise in attribution of dissimilarity. As with Nth-MutY, similar residues in Fpg and Nei are located primarily in the active site of both proteins and are buried within the two domains (Fig. 2). The zinc finger domain contains a tight conserved cluster composed of residues found mostly within the zinc finger motif (S1 in Fig. 2A, B; L182, L190, F224, V230, R233, C238, C241, G242, V245, F257, C258, C261, and Q262 in Fpg; L189, L197, F225, V229, R232, C237, C240, G241, I244, W256, C257, C260, and Q261 in Nei). A cluster of conserved hydrophobic residues is buried inside the ␤-sandwich domain (L13, V17, L22, I44, L54, L64, and G72 in Fpg; L13, I17, L22, V44, L54, L64, and G72 in Nei). This cluster opens up in the wide interdomain groove where the active site of the enzyme is located; nearly the entire surface of the active site is lined and supported with conserved residues (S2 in Fig. 2A, B; P1, E2, V6, R50, G51, K52, L55, H67, L121, G122, E124, L150, L151, D152, Q153, A156, G158, G160, N161, and R253 in Fpg; P1, E2, I6, R50, G51, K52, L55, H67, V125, G126, D128, L157, L158, D159, Q160, L163, G165, G167, N168, and R252 in Nei). Many of these residues are functionally important, as evident from mutagenesis studies and the structure of the DNA complex. For example, K52/K52, N161/N168, and R253/ R252 are involved in coordination of phosphates around the lesion in the bound DNA; mutation of these residues eliminates the glycosylase activity of Nei [47]. E2/E2 forms a hydrogen bond with O4⬘ of the damaged nucleotide and possibly protonates it during the catalytic reaction; mutation at this position abolishes glycosylase activity of Fpg [51] and Nei [47]. Separate from other active site residues is the conserved position 70 (S3 in Fig. 2A, B; M70 in Fpg, L70 in Nei) which lies at the position resembling that of the lip of the Nth-MutY structure and is inserted into the DNA helix upon nucleotide eversion. Interestingly, in Fpg this lip also contains two dissimilar residues (D1A in Fig. 2A), R99 and F101, which may fill the void in DNA bound to the enzyme. In contrast to Nth-MutY, the distribution of dissimilar residues in Fpg and Nei is predominantly asymmetrical. In Nei, the largest cluster of dissimilar residues lies in the

␤-sandwich domain, encasing the conserved cluster, with residues buried and on the surface. A large surface patch of this cluster (D2B in Fig. 2B; A9, A16, K20, W74, R75, V76, D97, and K98) lies on the rim of the DNAbinding groove but is not contacted by DNA due to the kink introduced by eversion of the damaged nucleotide from the helix. In Fpg, dissimilar residues of this domain are located almost exclusively on the surface but away from the DNA-binding groove (D2A in Fig. 2A). Furthermore, the two patches of dissimilarity located in the putative damage recognition site are entirely different. In Nei, patch D3B in Fig. 2B, including R171, L224, R226, F227, F230, P253, and Y255, resides in the zinc finger domain adjacent to the active site and overlaps with a putative shallow thymine glycol-binding site defined by molecular dynamics simulation [47]. In Fpg, this patch of dissimilarity is totally absent. Rather, a cluster forming a deep pocket (D3A in Fig. 2A) consisting of residues L3, P4, E5, G7, R49, P125, I162, Y163, L198, V202, G206, Y214, and G222 is present next to the conserved active site cleft in the approximate location where the extruded base is expected to project. This dissimilarity region is very small in Nei (G3, V129, L130). A well-defined cluster of dissimilarity lies buried between the active site cleft and the zinc finger similarity cluster in Fpg (L146, K147, E166, L168, F169, L173, and P175). The only relatively solvent-accessible residue in this cluster is K147, whose counterpart in E. coli Fpg, K154, has been implicated in substrate discrimination [52]. This residue is unlikely to contact the damaged base in the catalytically competent complex, but may be important, for example, in the recognition of 8-oxoG when the enzyme scans DNA by facilitated diffusion [53]. In Nei, this position is not conserved and only two dissimilar residues, R171 and I204, are found at the active site cleft-zinc finger interface; however, the base of the zinc finger cluster is flanked by dissimilar residues R239 and P258, absent from Fpg. Differences in the environment of the zinc finger domain may determine zinc finger flexibility and may be important for substrate recognition. Nth and AlkA Nth and AlkA represent a marginal case for this analysis. Three-dimensional structures of the three-domain protein AlkA [6,7] show that domains II and III have a fold similar to that of Nth [3,26]. However, sequence similarity between AlkA and Nth is quite limited and the mechanisms of catalysis are different in the following respects: first, AlkA is a monofunctional glycosylase while Nth has an AP lyase activity. The catalytic Lys is absent from AlkA but the catalytic Asp is intact. Second, unlike Nth, which likely recognizes dam-

DNA glycosylase analysis

1261

Fig. 2. (A and B) Structures of T. thermophilus Fpg and E. coli Nei complexed with DNA, respectively. The general coloring scheme is described in the legend to Fig. 1; DNA is shown as a stick model in cyan. Individual residues and clusters are labeled (see text for details).

aged bases by formation of hydrogen bonds, AlkA is believed to use ␲-system interactions to recognize its mostly electron-deficient substrates [7]. AlkA COG (COG0122) comprises 25 proteins of 15 phylogenetic groups. After qualification based on the presence of the critical catalytic Asp (D238 in E. coli AlkA), 21 sequences of 14 phylogenetic groups of the AlkA COG were used for analysis. Nth sequences and structure were the same as those used for for the Nth-

MutY analysis. Due to low sequence similarity between the subgroups, conservation requirements were slightly relaxed; positions in which Cn ⱖ 8 were considered conserved between subgroups and the Cn ⱖ 9 threshold was kept for conservation within a subgroup. The structure of E. coli AlkA complexed with an abasic site transition state analog ([17] 1DIZ), was used for mapping. Very few similarity clusters are found in the Nth-

Fig. 3. (A and B) Structures of E. coli Nth and E. coli AlkA complexed with DNA, respectively. Only domains II and III of AlkA are shown. Coloring scheme is described in the legend to Fig. 1. Individual residues and clusters are labeled (see text for details).

D. O. ZHARKOV and A. P. GROLLMAN

1262

AlkA structures (5% of residues in both sequences). Only the HhH motif (S1 in Fig. 3A, B) and parts of the GPD motif (S2 in Fig. 3A, B) can be viewed as conserved elements, underscoring their functional importance. However, the HhH motif contains a functionally relevant region of dissimilarity (K120 and V124 in Nth, W218 in AlkA), reflecting the difference in catalytic mechanism. In fact, W218 is the only residue identified as dissimilar in the active site of AlkA. This residue does not appear to be part of the damage recognition ␲-system [17]. Such a pocket, if it exists, must be formed by residues not conserved within the AlkA subgroup. Interestingly, the substrate specificity of AlkA is probably the broadest among DNA glycosylases with even unmodified nucleotides being cleaved [54]. Overall, 27% of residues in Nth and 19% in AlkA are dissimilar. Extensive regions of dissimilarity are found on the face of both proteins opposite the DNA-binding groove, reflecting the difference in overall organization and, in the case of AlkA, the presence of domain I (not considered in this analysis). In Nth, large dissimilarity regions are present in the six-helix barrel and iron-sulfur cluster domains (for example, D1A and D2A in Fig. 3A), covering a significant area of the DNA-binding groove. In contrast, in AlkA, very few dissimilar residues are found at the DNA-binding interface, such as a prominent spot (D3B in Fig. 3B) formed by L172, G173, and M174, which contacts the sugarphosphate DNA backbone by van der Waals interactions and contributes to DNA bending. Differences in conservation at the DNA-binding interface suggest that the modes of DNA binding by Nth and AlkA may differ. SUMMARY AND CONCLUSIONS

Analysis of conserved amino acid residues within groups and subgroups of proteins is widely used for predicting functions. In the past decade, the number of protein structures solved by X-ray crystallography has increased nearly exponentially, making it possible to analyze not only sequence positions of conserved or dissimilar amino acids, but also their spatial distribution. A combined analysis is made possible by the development of new methods for quantitative comparisons of amino acid conservation. This bioinformatics approach has wide applicability; similar methods have been used to analyze substrate specificity of caspases of CED-3 and ICE subfamilies [55] and to predict functional groups for members of a multimodular penicillin-binding protein family [56]. In this paper, we describe a method by which protein sequences in related families may be analyzed to identify residues important for group-specific functions and demonstrate its applicability to the Nth superfamily and Fpg family of DNA glycosylases. These families represent a particularly good system for applying conservation analysis because of the number of known sequences and

existence of a large number of representative structures. Importantly, the data serve as a starting point for designing site-directed mutagenesis experiments to elucidate the functional role of similar and dissimilar residues. Acknowledgements — The authors thank Erich Bremer for his expert assistance in constructing images. This research was supported by grants CA-17395 from the National Institutes of Health (to A.P.G.) and 0204049605 (to D.Z.) from the Russian Foundation for Basic Research.

REFERENCES [1] Friedberg, E. C.; Walker, G. C.; Siede, W. DNA repair and mutagenesis. Washington, DC: ASM Press; 1995. [2] Morikawa, K.; Matsumoto, O.; Tsujimoto, M.; Katayanagi, K.; Ariyoshi, M.; Doi, T.; Ikehara, M.; Inaoka, T.; Ohtsuka, E. X-ray structure of T4 endonuclease V: an excision repair enzyme specific for a pyrimidine dimer. Science 256:523–526; 1992. [3] Kuo, C.-F.; McRee, D. E.; Fisher, C. L.; O’Handley, S. F.; Cunningham, R. P.; Tainer, J. A. Atomic structure of the DNA repair [4Fe-4S] enzyme endonuclease III. Science 258:434 – 440; 1992. [4] Savva, R.; McAuley-Hecht, K.; Brown, T.; Pearl, L. The structural basis of specific base-excision repair by uracil-DNA glycosylase. Nature 373:487– 493; 1995. [5] Mol, C. D.; Arvai, A. S.; Slupphaug, G.; Kavli, B.; Alseth, I.; Krokan, H. E.; Tainer, J. A. Crystal structure and mutational analysis of human uracil-DNA glycosylase: structural basis for specificity and catalysis. Cell 80:869 – 878; 1995. [6] Yamagata, Y.; Kato, M.; Odawara, K.; Tokuno, Y.; Nakashima, Y.; Matsushima, N.; Yasumura, K.; Tomita, K.-I.; Ihara, K.; Fujii, Y.; Nakabeppu, Y.; Sekiguchi, M.; Fujii, S. Three-dimensional structure of a DNA repair enzyme, 3-methyladenine DNA glycosylase II, from Escherichia coli. Cell 86:311–319; 1996. [7] Labahn, J.; Scha¨ rer, O. D.; Long, A.; Ezaz-Nikpay, K.; Verdine, G. L.; Ellenberger, T. E. Structural basis for the excision repair of alkylation-damaged DNA. Cell 86:321–329; 1996. [8] Lau, A. Y.; Scha¨ rer, O. D.; Samson, L.; Verdine, G. L.; Ellenberger, T. Crystal structure of a human alkylbase-DNA repair enzyme complexed to DNA: mechanisms for nucleotide flipping and base excision. Cell 95:249 –258; 1998. [9] Guan, Y.; Manuel, R. C.; Arvai, A. S.; Parikh, S. S.; Mol, C. D.; Miller, J. H.; Lloyd, R. S.; Tainer, J. A. MutY catalytic core, mutant and bound adenine structures define specificity for DNA repair enzyme superfamily. Nat. Struct. Biol. 5:1058 –1064; 1998. [10] Barrett, T. E.; Savva, R.; Panayotou, G.; Barlow, T.; Brown, T.; Jiricny, J.; Pearl, L. H. Crystal structure of a G:T/U mismatchspecific DNA glycosylase: mismatch recognition by complementary-strand interactions. Cell 92:117–129; 1998. [11] Xiao, G.; Tordova, M.; Jagadeesh, J.; Drohat, A. C.; Stivers, J. T.; Gilliland, G. L. Crystal structure of Escherichia coli uracil DNA glycosylase and its complexes with uracil and glycerol: structure and glycosylase mechanism revisited. Proteins 35:13–24; 1999. [12] Bruner, S. D.; Norman, D. P. G.; Verdine, G. L. Structural basis for recognition and repair of the endogenous mutagen 8-oxoguanine in DNA. Nature 403:859 – 866; 2000. [13] Vassylyev, D. G.; Kashiwagi, T.; Mikami, Y.; Ariyoshi, M.; Iwai, S.; Ohtsuka, E.; Morikawa, K. Atomic model of a pyrimidine dimer excision repair enzyme complexed with a DNA substrate: structural basis for damaged DNA recognition. Cell 83:773–782; 1995. [14] Slupphaug, G.; Mol, C. D.; Kavli, B.; Arvai, A. S.; Krokan, H. E.; Tainer, J. A. A nucleotide-flipping mechanism from the structure of human uracil-DNA glycosylase bound to DNA. Nature 384: 87–92; 1996. [15] Parikh, S. S.; Mol, C. D.; Slupphaug, G.; Bharati, S.; Krokan, H. E.; Tainer, J. A. Base excision repair initiation revealed by crystal structures and binding kinetics of human uracil-DNA glycosylase with DNA. EMBO J. 17:5214 –5226; 1998. [16] Parikh, S. S.; Walcher, G.; Jones, G. D.; Slupphaug, G.; Krokan, H. E.; Blackburn, G. M.; Tainer, J. A. Uracil-DNA glycosylase—

DNA glycosylase analysis

[17] [18] [19]

[20]

[21]

[22]

[23]

[24] [25] [26] [27]

[28] [29] [30] [31] [32] [33] [34] [35]

[36] [37]

DNA substrate and product structures: conformational strain promotes catalytic efficiency by coupled stereoelectronic effects. Proc. Natl. Acad. Sci. USA 97:5083–5088; 2000. Hollis, T.; Ichikawa, Y.; Ellenberger, T. DNA bending and a flip-out mechanism for base excision by the helix-hairpin-helix DNA glycosylase, Escherichia coli AlkA. EMBO J. 19:758 –766; 2000. Zharkov, D. O.; Grollman, A. P. MutY DNA glycosylase: base release and intermediate complex formation. Biochemistry 37: 12384 –12394; 1998. Williams, S. D.; David, S. S. Evidence that MutY is a monofunctional glycosylase capable of forming a covalent Schiff baseintermediate with substrate DNA. Nucleic Acids Res. 26:5123– 5133; 1998. Hickerson, R. P.; Chepanoske, C. L.; Williams, S. D.; David, S. S.; Burrows, C. J. Mechanism-based DNA-protein cross-linking of MutY via oxidation of 8-oxoguanosine. J. Am. Chem. Soc. 121:9901–9902; 1999. Zharkov, D. O.; Gilboa, R.; Yagil, I.; Kycia, J. H.; Gerchman, S. E.; Shoham, G.; Grollman, A. P. Role for lysine 142 in the excision of adenine from A:G mispairs by MutY DNA glycosylase of Escherichia coli. Biochemistry 39:14768 –14778; 2000. Lau, A. Y.; Wyatt, M. D.; Glassner, B. J.; Samson, L. D.; Ellenberger, T. Molecular basis for discriminating between normal and damaged bases by the human alkyladenine glycosylase, AAG. Proc. Natl. Acad. Sci. USA 97:13573–13578; 2000. Zharkov, D. O.; Rosenquist, T. A.; Gerchman, S. E.; Grollman, A. P. Substrate specificity and reaction mechanism of murine 8-oxoguanine-DNA glycosylase. J. Biol. Chem. 275:28607– 28617; 2000. Eisen, J. A.; Hanawalt, P. C. A phylogenomic study of DNA repair genes, proteins, and processes. Mutat. Res. 435:171–213; 1999. Fitch, W. M. Distinguishing homologous from analogous proteins. Syst. Zool. 19:99 –113; 1970. Thayer, M. M.; Ahern, H.; Xing, D.; Cunningham, R. P.; Tainer, J. A. Novel DNA binding motifs in the DNA repair enzyme endonuclease III crystal structure. EMBO J. 14:4108 – 4120; 1995. Nash, H. M.; Bruner, S. D.; Sha¨ rer, O. D.; Kawate, T.; Addona, T. A.; Spooner, E.; Lane, W. S.; Verdine, G. L. Cloning of a yeast 8-oxoguanine DNA glycosylase reveals the existence of a baseexcision DNA-repair protein superfamily. Curr. Biol. 6:968 –980; 1996. Livingstone, C. D.; Barton, G. J. Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Comput. Appl. Biosci. 9:745–756; 1993. Taylor, W. R. Classification of amino acid conservation. J. Theor. Biol. 119:205–218; 1986. Livingstone, C. D.; Barton, G. J. Identification of functional residues and secondary structure from protein multiple sequence alignment. Methods Enzymol. 266:497–512; 1996. Gu, X. Maximum-likelihood approach for gene family evolution under functional divergence. Mol. Biol. Evol. 18:453– 464; 2001. Duwat, P.; de Oliveira, R.; Ehrlich, S. D.; Boiteux, S. Repair of oxidative DNA damage in gram-positive bacteria: the Lactococcus lactis Fpg protein. Microbiology 141:411– 417; 1995. Sentu¨ rker, S.; Bauche, C.; Laval, J.; Dizdaroglu, M. Substrate specificity of Deinococcus radiodurans Fpg protein. Biochemistry 38:9435–9439; 1999. Gao, M.-J.; Murphy, T. M. Alternative forms of formamidopyrimidine-DNA glycosylase from Arabidopsis thaliana. Photochem. Photobiol. 73:128 –134; 2001. Tatusov, R. L.; Natale, D. A.; Garkavtsev, I. V.; Tatusova, T. A.; Shankavaram, U. T.; Rao, B. S.; Kiryutin, B.; Galperin, M. Y.; Fedorova, N. D.; Koonin, E. V. The COG database: new developments in phylogenetic classification of proteins from complete genomes. Nucleic Acids Res. 29:22–28; 2001. Tatusov, R. L.; Koonin, E. V.; Lipman, D. J. A genomic perspective on protein families. Science 278:631– 637; 1997. Gu, X. Statistical methods for testing functional divergence after gene duplication. Mol. Biol. Evol. 16:1664 –1674; 1999.

1263

[38] David, S. S.; Williams, S. D. Chemistry of glycosylases and endonucleases involved in base-excision repair. Chem. Rev. 98: 1221–1261; 1998. [39] Manuel, R. C.; Czerwinski, E. W.; Lloyd, R. S. Identification of the structural and functional domains of MutY, an Escherichia coli DNA mismatch repair enzyme. J. Biol. Chem. 271:16218 – 16226; 1996. [40] Gogos, A.; Cillo, J.; Clarke, N. D.; Lu, A.-L. Specific recognition of A/G and A/7,8-dihydro-8-oxoguanine (8-oxoG) mismatches by Escherichia coli MutY: removal of the C-terminal domain preferentially affects A/8-oxoG recognition. Biochemistry 35:16665– 16671; 1996. [41] Cunningham, R. P.; Asahara, H.; Bank, J. F.; Scholes, C. P.; Salerno, J. C.; Surerus, K.; Mu¨ nck, E.; McCracken, J.; Peisach, J.; Emptage, M. H. Endonuclease III is an iron-sulfur protein. Biochemistry 28:4450 – 4455; 1989. [42] Porello, S. L.; Cannon, M. J.; David, S. S. A substrate recognition role for the [4Fe-4S]2⫹ cluster of the DNA repair glycosylase MutY. Biochemistry 37:6465– 6475; 1998. [43] Thompson, J. D.; Higgins, D. G.; Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673– 4680; 1994. [44] Volk, D. E.; House, P. G.; Thiviyanathan, V.; Luxon, B. A.; Zhang, S.; Lloyd, R. S.; Gorenstein, D. G. Structural similarities between MutT and the C-terminal domain of MutY. Biochemistry 39:7331–7336; 2000. [45] Jiang, D.; Hatahet, Z.; Melamede, R. J.; Kow, Y. W.; Wallace, S. S. Characterization of Escherichia coli endonuclease VIII. J. Biol. Chem. 272:32230 –32239; 1997. [46] Sugahara, M.; Mikawa, T.; Kumasaka, T.; Yamamoto, M.; Kato, R.; Fukuyama, K.; Inoue, Y.; Kuramitsu, S. Crystal structure of a repair enzyme of oxidatively damaged DNA, MutM (Fpg), from an extreme thermophile, Thermus thermophilus HB8. EMBO J. 19:3857–3869; 2000. [47] Zharkov, D. O.; Golan, G.; Gilboa, R.; Fernandes, A. S.; Gerchman, S. E.; Kycia, J. H.; Rieger, R. A.; Grollman, A. P.; Shoham, G. Structural analysis of an Escherichia coli endonuclease VIII covalent reaction intermediate. EMBO J. 21:789 – 800; 2002. [48] Zharkov, D. O.; Rieger, R. A.; Iden, C. R.; Grollman, A. P. NH2-terminal proline acts as a nucleophile in the glycosylase/APlyase reaction catalyzed by Escherichia coli formamidopyrimidine-DNA glycosylase (Fpg) protein. J. Biol. Chem. 272:5335– 5341; 1997. [49] Rieger, R. A.; McTigue, M. M.; Kycia, J. H.; Gerchman, S. E.; Grollman, A. P.; Iden, C. R. Characterization of a cross-linked DNA-endonuclease VIII repair complex by electrospray ionization mass spectrometry. J. Am. Soc. Mass Spectrom. 11:505–515; 2000. [50] Saitou, N.; Nei, M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4:406 – 425; 1987. [51] Lavrukhin, O. V.; Lloyd, R. S. Involvement of phylogenetically conserved acidic amino acid residues in catalysis by an oxidative DNA damage enzyme formamidopyrimidine glycosylase. Biochemistry 39:15266 –15271; 2000. [52] Rabow, L. E.; Kow, Y. W. Mechanism of action of base release by Escherichia coli Fpg protein: role of lysine 155 in catalysis. Biochemistry 36:5084 –5096; 1997. [53] Berg, O. G.; Winter, R. B.; von Hippel, P. H. Diffusion-driven mechanisms of protein translocation on nucleic acids. 1. Models and theory. Biochemistry 20:6929 – 6948; 1981. [54] Berdal, K. G.; Johansen, R. F.; Seeberg, E. Release of normal bases from intact DNA by a native DNA repair enzyme. EMBO J. 17:363–367; 1998. [55] Wang, Y.; Gu, X. Functional divergence in the caspase gene family and altered functional constraints: statistical analysis and prediction. Genetics 158:1311–1320; 2001. [56] Goffin, C.; Ghuysen, J.-M. Multimodular penicillin-binding proteins: an enigmatic family of orthologs and paralogs. Microbiol. Mol. Biol. Rev. 62:1079 –1093; 1998.