GENOMICS
43, 52–61 (1997) GE974790
ARTICLE NO.
Identification of Endogenous Retroviral Sequences Based on Modular Organization: Proviral Structure at the SSAV1 Locus Ju¨rgen H. Blusch,* Manuela Haltmeier,†,‡ Kornelie Frech,* Ingrid Sander,* Christine Leib-Mo¨sch,†,‡ Ruth Brack-Werner,† and Thomas Werner*,1 GSF-National Research Center for Environment and Health, *Institute of Mammalian Genetics, †Institute of Molecular Virology, Ingolsta¨dter Landstraße 1, D-85764 Neuherberg; and ‡Klinik III Mannheim, University of Heidelberg, Mannheim, Germany Received January 31, 1997; accepted May 2, 1997
identified in the human genome (for review see Wilkinson et al., 1994), although HERV-K transcripts with open reading frames have been described recently (e.g., Lo¨wer et al., 1993). Despite their replication defects, endogenous retroviral sequences have occasionally acquired functions in the host genome based on transcriptional control sequences of their long terminal repeats (LTRs) (e.g., Feuchter-Murphy et al., 1993). Some have been found activated in autoimmune diseases and tumorigenesis (Sauter et al., 1995; Seifarth et al., 1995; for review see Krieg et al., 1992) or have been shown to be able to amplify via retrotransposition (Tchenio and Heidmann, 1991; Goodchild et al., 1995) or reinsert into the host genome (e.g., Mager and Goodchild, 1989). Even transcriptionally defective endogenous viral sequences may augment genomic instability, functioning as targets for homologous recombination (e.g., Taruscio and Manuelidis, 1991). They can even interfere with retroviral vector technology as demonstrated by the transmission of endogenous retroviral material by a packaging cell line (Ronfort et al., 1995). Therefore, the identification of endogenous retroviral loci will be important for the interpretation of human genomic sequences in general. Identification of endogenous retroviral sequences is complicated by the accumulation of random mutations. These are not present in functionally important regions of exogenous retroviral genomes due to stringent selection for replication competence. Sequence elements involved in vital functions are usually among the best conserved sequence regions of retroviral genomes (Brack-Werner et al., 1989, Werner et al., 1990). Therefore, they often retained detectable similarity to present day sequences on the nucleotide or amino acid levels despite the loss of overall sequence similarity (Werner et al., 1990; Frech et al., 1996). We took advantage of these local sequence conservations to develop a general strategy that allows identification of the retroviral origin of endogenous sequences on the basis of conserved sequence modules. We applied this strategy to identifying the missing env and LTR sequences of the locus S71 (SSAV1). S71 is a member of a middle repetitive family of endogenous C-type retroviral sequences.
The current genome sequencing projects reveal megabases of unknown genomic sequences. About 1% of these sequences can be expected to be of retroviral origin. These are often severely deleted or mutated. Therefore, identification of the retroviral origin of these sequences can be very difficult due to the absence of convincing overall sequence similarity. There are also many copies of solo-LTRs (long terminal repeats) distributed throughout genomic sequences. LTR and envelope sequences in general are among the most divergent parts of the retroviral genome and thus especially hard to detect in mutated endogenous sequences. We took advantage of the fact that these retroviral sections contain short highly conserved sequence regions providing retroviral hallmarks even after loss of overall similarity. We defined several sequence elements and peptide motifs within LTR and Env sequences and used these elements to construct models for LTRs and Env proteins of mammalian C-type retroviruses. We then used this strategy to identify successfully the hitherto missing LTRs and an env-like region in the S71 human retroviral sequence. Our approach provides a new strategy for identifying remotely related retroviral sequences in genomic DNA (especially human DNA), of potential significance for the interpretation of genomic sequences obtained from the current large-scale sequencing projects. q 1997 Academic Press
INTRODUCTION
Endogenous retrovirus-like sequences (ERVs) constitute at least 1% of the human genome, and thus identification of these sequences will be an important contribution to the understanding of human genomic sequences. In contrast to avians and rodents, only replication defective endogenous proviruses have been 1 To whom correspondence should be addressed at GSF-National Research Center for Environment and Health, AG BIODV/Institute of Mammalian Genetics, Ingolsta¨dter Landstraße 1, D-85764 Neuherberg, Germany. Telephone: 89-3187-4050. Fax: 89-3187-4400. Email:
[email protected].
0888-7543/97 $25.00 Copyright q 1997 by Academic Press All rights of reproduction in any form reserved.
AID
GENO 4790
/
6r3a$$$281
52
06-05-97 00:39:44
gnmal
MODULAR ORGANIZATION OF RETROVIRAL SEQUENCES
Similarity to the gag and pol genes of SSAV and a solitary HERV-K LTR inserted between gag and pol were reported previously for S71 (Leib-Mo¨sch et al., 1986, 1993; Werner et al., 1990). Our results established S71 as a retroviral element with all basic features of a provirus (LTR-gag-pol-envLTR) despite extensive deletions and mutations present in the genes and no overall sequence similarity of the S71 LTRs to known LTRs. We were able to overcome these obstacles in the case of S71 by application of our concept of modular sequence organization. MATERIAL AND METHODS Clones, phages, and plasmids. Recombinant DNA techniques were carried out according to standard protocols (Sambrook et al., 1989). EcoRI and HindIII fragments of the l clones S71-4 and lS715 (Leib-Mo¨sch et al., 1986) were subcloned in the plasmid BlueScribe::lox (Boyd, 1993). Plasmid pS71JB-2 contains the 6-kb HindIIIfragment of S71-4 encompassing the S71 5*-region and part of the gag sequences. pS71JB-23 contains a 3.2-kb HindIII fragment from l phage S71-5 (Leib-Mo¨sch et al., 1986) with the 3*-region and the adjacent 3* integration locus. The 7-kb EcoRI/SalI fragment of lS715 was subcloned in pUC119. Pl124 has been described in Leib-Mo¨sch et al. (1992). Genomic DNA preparation and Southern blot hybridization. Genomic DNA was purified from human white blood cells or tissue culture cells (NIH3T3/AKV clone 623 as control). Cell lysis and digestion of proteins were performed according to Miller et al. (1988), with the following purification steps according to Sambrook et al. (1989). Ten micrograms genomic DNA per lane was digested for genomic Southern blots. The resulting fragments were separated on 0.8% agarose gels overnight (30 V) in 11 TBE buffer and transferred to nylon membranes (Nytran, NY12N; Schleicher & Schu¨ll, Dassel/Germany) according to Sambrook et al. (1989). Southern hybridizations were performed with nonfat dry milk as unspecific competitor as described in Haltmeier et al. (1995). Probe templates were generated by polymerase chain reaction (PCR) as described below, purified with the QiaQuick gel extraction kit (Diagen, Germany), and labeled with [a32 P]dCTP with the Rediprime reaction kit (Boehringer, Germany). Oligonucleotide preparation/polymerase chain reaction. Oligonucleotides for PCR and DNA sequencing were synthesized on an Applied Biosystems Model 394-8A synthesizer and used without purification. Primers were designed to a melting temperature of 507C using the program OligoEd (T. Werner, unpublished). Polymerase chain reactions were performed using 2.5 U AmpliTaq DNA polymerase and 101 buffer with magnesium (Perkin–Elmer), dNTP and 0.2 mM and 20 pmol of each primer in a reaction volume of 50 ml in a Perkin– Elmer PE9600 PCR machine. One microgram of template was used for amplification from genomic DNA; plasmid templates were in the nanogram range. All PCR amplifications were carried out together with negative controls. PCR cycle 25 (1 cycle, 1 min, 967C/2 min, 557C/3 min, 727C; 39 cycles, 30 s, 967C/1 min, 557C/2 min, 727C, 10 min, 727C) was used if not stated otherwise. The identity of the obtained PCR products was confirmed by direct sequencing with few exceptions. The forward primers for isolation of S71-related env sequences from human genomic DNA were located at the 3*-end of the S71 integrase gene (primer 1, CCTTTTGAAATCACGTATAGG) and after the S71 pol gene (primer 2, CCAACAGGTACGAGATATCA). The reverse primers were oriented at the 3* end of the published S71 sequence (primer 3, GGAGTGAATGAATCTAAGC; primer 4, GGAGGGCATAACAGGTGG) and in the putative 3*-LTR region (primer 5, AGTTTGACAGAAGCTATGC; primer 6, CAGGTTGGACACTACATTCC). Primers 1–4 were based on the S71 sequence; primers 5 and 6 were designed for the amplification of pI-124 (Leib-Mo¨sch et al., 1992). The primers were used under relaxed annealing conditions (457C). The PCR amplifications were separated on agarose gels and screened by hybridization with a S71 LTR-like structure probe to
AID
GENO 4790
/
6r3a$$$281
06-05-97 00:39:44
53
isolate longer products. The largest fragment that was obtained with primer pair 1/6 was subcloned in pUC19 and named pCRENV1. Forward primers 990 (CAAGGGCTCCTTACCCACC), 1136 (GCCCTCGCACCTATCAC), and 1550 (GGGCTGAGAGAATTTTGAGG) were used for the isolation of the S71 5* LTR and leader region in combination with the reverse primer 3367 (CGCTGGACTTGCTGTGGCAC) on the templates described in the text. The primer pair 7712 (GTACCCCTCTTAGTCACTTCACTATGTG)/7788 (GAGAAACAGAGTCAGATCTG ACTTAC) was used for specific amplification of the S71 5* LTR and part of the leader region from genomic DNA. The hybridization probes for the 3* integration locus, the leader region, and the S71 LTRs were generated with primer combinations 2777 (CCTAGAGTGGTTTCTGTTTTCCTG)/4791 (CTTGATGGAGGAGTGTTGACATC), 3791 (GAGAGCTGGACTCATTCCAGG)/3792 (CTGCCAAGAAGGAGCGTC), and 6520 (CCCTCAAAATTCTCTC)/6521 (GACCCTCCCTTAGCTGAGAAAGCAGCC) with pS71JB23 and pS71JB2 as template. DNA sequence analysis. Plasmid and direct sequencing of PCR products on both strands were carried out on an ABI 373A sequencer using the DyeDeoxyTerminator technology as described in previous reports (Werner et al., 1990; Haltmeier et al., 1995). Sequences were corrected with the help of the sequence editing program SeqEd (Applied Biosystems). The S71 locus was sequenced by a primer walking strategy. Primers were controlled in PCR experiments using the above-described plasmids and phages, as well as genomic DNA as templates to ensure the accuracy of the transition points of the phages and plasmids and to control for cloning artifacts. Computer-assisted sequence analysis. Sequence alignments were carried out with the software package GeneWorks 2.3 (IntelliGenetics, 1994). The sequences and references of the endogenous and exogenous retroviruses used for comparison are described in Werner et al. (1990). Data bank searches were carried out with the FASTA routine of GCG 8.1 on GenBank Release 90.0 and EMBL Release 43.0. Transcription factor binding sites were identified with the programs ConsInspector (Frech et al., 1993) and MatInspector (Quandt et al., 1995). S71 sequences were analyzed for potential LTR sequences by the program ModelInspector (Frech et al., 1996; Frech et al., 1997). Sequences of mammalian tRNA genes were extracted from GenBank Release 90.0, restricted to the terminal 18 bases, and aligned as inverse complement to the S71 region between 5* LTR and gag. The putative S71 and the pCRENV1 env sequences were translated in all three reading frames, and similarities were assessed by GeneWorks dot matrix analysis of these sequences and other Env proteins derived from various mammalian C-type exogenous and endogenous retroviruses. Sequences identified as partners of diagonals in the dot matrix comparisons were extracted and refined by multiple alignment (GeneWorks). These selected peptide regions were analyzed with the GCG programs PileUp and ProfileMake. The SwissProt data bank was scanned with the obtained profiles using the GCG program ProfileSearch. Accession numbers. The following sequences were used: AKV: AC J01998, J01999, X00016, X00017, K01394; BAEV: AC D10032, D00088, N00088; GALV: AC M26927; FLK03162: K03162; FCVENVC: AC M14331; FCVENVGEN: AC M89997; FCVGAENV: AC X01209; FCVGLENV: M12500; FCVSTENV: AC X01208; MCF: AC K02725; Friend MuLV: AC M93134; RD114: AC X87829, X59001; RTVL-H2: AC M18048; S71 5*LTR: AC Y10938; S71 3*LTR: AC Y10937; S71ENVLTR: AC Y10939.
RESULTS
Retrovirus-related sequences in the human genome are usually identified by overall sequence similarity of the nucleotide or the deduced amino acid sequences with known retroviral genes. The putative LTR and env regions of the S71 element showed no overall sequence similarity to known retroviral sequences (Fig. 1). Therefore, characteristic features of retroviral LTRs
gnmal
54
BLUSCH ET AL.
FIG. 1. Strategy used for the identification of retroviral LTR and Env-sequences in S71.Dot matrices show the actual results of nucleotide sequence comparisons of ERV-3 LTR and S71 LTR (top) and ERV-3 env with S71 env sequences (bottom). Both regions in S71 are identical to the sequences verified in the analysis to represent S71 LTR and env-related sequences (see text for details). The second step summarizes the development of the LTR and the Env elements/models (for details see text). The last step shows the comparison of the general models with the actual elements found in the S71 sequence. Black boxes indicate individual elements. LTR model: TIR, terminal inverted repeat; USE, upstream element; CCAAT, CAAT-box; TATA, TATA-box; RIR, R-hairpin; PAS, poly(A) signal; PDS, poly(A) downstream element; 5IR, U5-hairpin. The arrows indicate the terminal inverted repeats. Details of the model are described elsewhere (Frech et al., 1997). Envmodel: gp70-hp1; gp70-hp2; FP, fusion peptide; ISP. immunosuppressive peptide; TM, transmembrane segment. The white boxes indicate the polypurine tract (PPT) downstream of the env gene.
and env genes had to be defined to trace back the retroviral origin of S71 sequences. We relied on comparative analysis of known mammalian C-type retroviral genomes to deduce some characteristic features from both LTRs and env gene/protein sequences. A characteristic pattern of sequence elements had already been derived from comparison of known sequences of Lentivirus LTRs (Frech et al., 1996). These organizational models allowed extraordinarily high specificity and improved sensitivity compared to conventional alignment strategies (Frech et al., 1996). We applied this element-based strategy to define and detect mammalian C-type LTRs and to identify env-related sequences of the S71 proviral structure (Fig. 1). Definition of the S71 3*-LTR and targeted isolation of the corresponding 5*-LTR. The complete S71 sequence known so far was analyzed by the program ModelInspector for the presence of mammalian C-type LTRs for which a model has been defined (Frech et al., 1997). A single C-type-LTR-related sequence was detected about 1.1 kb downstream of the pol-related sequences and downstream of the previously defined LTR-like sequence of S71 (Brack-Werner et al., 1989). A sequence very similar to the polypurine rich tract
AID
GENO 4790
/
6r3a$$$281
06-05-97 00:39:44
(PPT) was present in S71 (Fig. 4b) immediately upstream of the putative LTR, which would be compatible with a 3* LTR. This 3* LTR was subsequently verified by the targeted isolation of the corresponding 5* LTR. This was achieved by PCR amplification with DNA from l clone S71lK5 (Leib-Mo¨sch et al., 1986) and three different forward primers derived from the putative 3* LTR region, which were combined with a reverse primer located in the S71 gag preceding region (Fig. 2a). No prominent PCR product could be obtained from human genomic DNA templates. An unexpectedly large amplification product of at least 1.6 kb was observed (Fig. 2b). This fragment was shown to contain the S71 LTR sequence by direct sequence analysis. Detailed comparison of S71 sequences around the LTRs revealed several additional retroviral features. A direct repeat (ACTA) in the integration locus flanking the ‘‘terminal inverted repeats’’ (TGT/ACA) of the LTR was identified, which suggested a genuine retroviral integration process (Varmus and Brown, 1989). Alignment of the S71 5* and 3*LTRs (Fig. 3) showed high sequence similarity to each other (7% sequence divergence) but no clear sequence similarity to either the AKV/SSV group (data not shown) or the human endogenous retro-
gnmal
MODULAR ORGANIZATION OF RETROVIRAL SEQUENCES
55
FIG. 2. Targeted isolation of the S71 5* LTR. (a) Location of the primers relative to the S71 sequence used for PCR amplification of the 5* LTR-leader region of S71. Forward primers a (990), b (1136), and c (1550) were derived from the predicted 3* LTR and combined with reverse primer 3367 located in the S71 gag preceding region. (b) Agarose gel electrophoresis of the PCR products obtained. The primer combinations described above and primer pair 7712/7788 (positive control) were tested under stringent annealing conditions in PCR experiments using no, human genomic, S71-4, and pS71JB2 DNA as template. Lanes 1, 18, 1-kb ladder. Lanes 2, 6, 10, 14, negative controls for primer pairs 990/3367 (2), 1136/3367 (6), 1550/3367 (10), 7712/7788 (14); lanes 3–5, 990/3367 with human genomic DNA (3), S71-4 (4) and pS71JB2 (5) as template. Lanes 7–9, 1136/3367 with human genomic DNA (7), S71-4 (8), and pS71JB2 (9) as template.Lanes 11–13, 1550/3367 with human genomic DNA (11), S71-4 (12), and pS71JB2 (13) as template. Lanes 15–17, 7712/7788 with human genomic DNA (15), S71-4 (16), and pS71JB2 (17) as template. All fragments identified by arrows were verified by direct sequencing. There is a very weak band in human genomic DNA, which may be due to primer quenching since S71 leader is a multicopy sequence in genomic DNA (see Fig. 5).
virus ERV-3 (Fig. 1). A FASTA scan of GenBank revealed no additional matches. The U3, R, and U5 substructures of the S71 LTRs were assigned on the basis of DNA elements detected by the program ConsInspector (Frech et al., 1993). None of the typical regulatory elements found in AKVlike C-type LTRs (e.g., CAAT box, NF1, GRE, LVb binding motifs etc.) were detected in the putative U3 region, except a TATA box, although putative binding sites for several other transcription factors were detected by the program MatInspector (Quandt et al., 1995). In contrast to the U3 region, the S71 LTR R and U5 regions contain most of the typical C-type retroviral structural elements. Mammalian C-type TATA boxes and Cap sites (CTTCGGGGCTGA in the 3* LTR and CTTCGGGGCCGA in the 5* LTR; Frech and Werner, 1996) were identified. A potential R hairpin (Cupelli and Lenz, 1991) with dG Å 010.10 kcal/mol (5*-LTR) resp. 011.60 kcal/mol (3*-LTR) was identified by the program Model Inspector (Frech et al., 1997). A poly(A) signal (AATAAA) was found at the expected distance (70–80 bp) downstream of the TATA box in both LTRs (70 bp for the 5* LTR and 73 bp for the 3* LTR). Finally, the poly(A) addition site was assigned to the first CA after the poly(A) signal, and a putative poly(A) downstream
AID
GENO 4790
/
6r3a$$$281
06-05-97 00:39:44
signal typically located at a distance of 10–30 nucleotides downstream of the poly(A) signal (McLauchlan et al., 1985) was identified. The best match found for such a signal in both LTRs is AGCGTTTCT (general consensus sequence YGTGTTTYY described by McLauchlan et al., 1985), although it was different from the consensus for C-type retroviral poly(A) downstream sites (Frech et al., 1993). A summary of these findings is shown in Fig. 3a. Identification of the S71 leader region. The sequence between the S71 5* LTR and the start codon of the gag gene encompassed 1656 nucleotides, which is an unusually large leader region, more than twice the length of an average retroviral C-type leader region. The longest leader sequence in the database is the RTVL-H2 leader with 945 bases (AC M18048), which is not a C-type. No duplication or insertion of a known sequence was detected in this putative leader region by dot matrix and FASTA analysis. The PCR products obtained from genomic DNA were always of the size expected for S71 but were not S71 sequence-specific, as indicated by ambiguous results in direct sequence analysis (data not shown). A tRNA binding site (PBS) with 17/18 matches to tRNAThr from Mus musculus and 16/18 matches to the human tRNAsGln Nug1 and 2 (Fig.
gnmal
56
BLUSCH ET AL.
FIG. 3. (a) Alignment of S71 5* and 3* LTR sequences Only nucleotides different from the consensus (shown below the individual sequences) are shown. The LTR elements terminal inverted repeat (TIR), TATA box (TATA), Cap site (Cap), poly(A) signal (PAS), and poly(A) downstream element (PDS) are boxed. The R-hairpin (RIR) is indicated by arrows. The S71 LTR U3 region includes bp 1–456 (5* LTR), 1–457 (3* LTR), the R region bp 457–525 (5* LTR), 458–526 (3* LTR), and the U5 region bp 526–563 (5* LTR), 527–564 (3* LTR). (b) Alignment of t-RNA sequences with the primer binding site (PBS) of S71. Only nucleotides different from the S71 PBS (on top) are shown. (c) Alignment of S71 leader region positions 1095–1235 with the Chimpanzee RTVL-Ib leader region S71 sequence is starting at position 531 and the RTVLH is starting at position 670, both with respect to the end of the LTR (TIR). Identical nucleotides are boxed.
3b) was detected downstream of the 5*LTR (genomic sequences and PCR fragments). No clear evidence for a packaging signal was found. The only sequence similar to the S71 leader region was part of the Chimpanzee RTVL-Ib putative leader region (Maeda and Kim, 1990; 66% identity over 127 nucleotides; Fig. 3c). The S71 leader sequence contains several candidate splice acceptor sites and a single splice donor site AAGGTAAGT 190 bp downstream of the PBS, exactly matching the published consensus (Ohshima and Gotoh 1987; Horowitz and Krainer, 1994).
AID
GENO 4790
/
6r3a$$$281
06-05-97 00:39:44
Identification of S71 env-related sequences. The identification of both LTRs established a region of approximately 1.1 kb between pol-related sequences and the 3*-LTR, which should be env-related. However, since this is only about half the length of a typical Ctype env gene (which should be about 2 kb; Hunter and Swanstrom, 1990), the putative S71 env appeared to be severely deleted. S71 was previously shown to be a member of a middle repetitive gene family (Leib-Mo¨sch et al., 1986). This allowed a PCR approach for isolation of less deleted env regions from other family members.
gnmal
MODULAR ORGANIZATION OF RETROVIRAL SEQUENCES
57
FIG. 3—Continued
The largest S71-related fragment obtained by PCR from human genomic DNA was pCRENV1. It encompassed 1482 bp, only 151 bp more than the corresponding S71 region. No evidence for a longer S71-related putative env gene in the human genome was obtained in these experiments. Sequence comparison of pCRENV1, S71, and pl124, an expressed S71 family member (LeibMo¨sch et al., 1992), revealed a highly similar stretch in the candidate env region (Fig. 4a). The pCRENV1 sequence showed only 48 – 50% overall nucleotide identity with the env genes of AKV, GALV, and FeLV, reaching 56% similarity with the p15E region (data not shown). However, a number of short, much better conserved peptide regions could be identified within the deduced Env proteins of Ctype retroviruses. Within the transmembrane protein (p15E) three conserved peptides, the fusion, immunosuppressive, and the membrane anchor peptides, have been previously identified (Gallaher et al., 1989). Of these only the membrane anchor peptide could be found in reading frame 1 of pCRENV1 and reading frame 3 of S71. Putative peptides from all three reading frames of pCRENV1 were compared in protein dot matrices to Env proteins of other C-type retroviruses. Two additional moderately conserved peptide modules present in pCRENV1 and other Env proteins (GaLV, murine and feline leukemia, and sarcoma viruses) were detected within the surface protein gp70 around amino acids 100 and 200, respectively. They were designated gp70-homologous peptides 1 and 2 (gp70-hp1 and gp70hp2; Figs. 4a and 4b). Corresponding peptides from retroviral Env proteins excluding the PCRENV1 sequences were used to generate search profiles for the GCG ProfileSearch program. The profiles recognized all three motifs in the pCRENV1 reading frames (Fig. 4b). Table 1 shows the specificity of the profiles for retroviral Env proteins. The gp70-hp1 and gp70-hp2 motifs could be identified in S71 reading frames 2 and 1 by comparison with the pCRENV1 motifs (Figs. 5a and 5b) despite their low
AID
GENO 4790
/
6r3a$$$281
06-05-97 00:39:44
scoring against the profiles (Table 1, Table 2). The deletion in pCRENV1 was estimated to encompass about 800 nucleotides from the region coding for the C-terminal half of the surface protein up to the membrane anchor of the transmembrane protein (Fig. 4a). Genomic characterization of the S71 provirus. The S71 retroviral related gag and pol genes were complemented by the newly defined LTRs, the leader region, and the env-related region to a proviral structure of 8276 bp. Genomic Southern blot analysis of human DNA with S71 LTR (Fig. 5a), leader (Fig. 5b), and the 3* integration locus probes (Fig. 5c) revealed the multitude of bands detected by the S71 LTR and the leader probes that are expected because S71 is a member of a multigene family (Leib-Mo¨sch et al., 1986). The 3* integration locus appeared to be single copy. Consistently, HindIII-digested human genomic DNA showed a single 3.2-kb band, whereas the signal after EcoRI digestion split to the known RFLP bands of 13.5 and 6.8 kb (Leib-Mo¨sch et al., 1989). Therefore the 3* integration locus probe circumvented the problems associated with cross-hybridization of other S71 family members and should be especially useful in combination with the refined restriction map of the S71 locus that spans a region of nearly 18 kb and is based on the sequenced plasmid subclones of the region. In addition to this, the S71 locus can now be identified by routine PCR analysis with primer pairs 7712/7788 detecting the S71 5* LTR (see the control lane of Fig. 2) and 1550/ 4791, which is specific for the 3* integration locus (data not shown). These primer combinations yield a single fragment after amplification from genomic DNA and give unambiguous sequence data after direct sequence analysis. DISCUSSION
We demonstrated the successful application of a new strategy for the identification of distantly related sequences with no distinct overall similarity. This strat-
gnmal
58
AID
BLUSCH ET AL.
GENO 4790
/
6r3a$$4790
06-05-97 00:39:44
gnmal
59
MODULAR ORGANIZATION OF RETROVIRAL SEQUENCES
TABLE 1 Database Search with Conserved Peptides Derived from Retroviral Env Proteins Motif
No. of matches
Scoresa
Sequence
gp70-hp1
32 @100 31 1 0 @100 23 2 3
23–17 16–10 24–21 18.33 17–13 12–9 42–38 37–33 32–28
Env Not Env-related Env
1 2
27–23 22–18
@100
17–13
gp70-hp2
TM
a
Specification
Not Env-related Env
Not Env-related
Mammalian C-type Mammalian C-type end. feline retrovirus
Mammalian C-type Mammalian C-type Mammalian D-type Mammalian C-type Avian C-type Mammalian C-type Avian C-type
(No. of sequences) (32) (all) (31) (1) (0) (all) (23) (2) (2) (1) (1) (1) (1) (all)
Alignment scores determined by the GCG program ProfileSearch.
egy is based on the identification of the presence and spatial organization of short sequence motifs (in both DNA and amino acid sequences). The identification of the missing parts of the S71 provirus env-related sequences and both LTRs showed this strategy to work in principle for both DNA (LTRs) and protein sequences (Env). Multiple additional elements and their correct localization in the proviral context as well as experimental results added further evidence to the computerassisted identification of the new proviral sequences. As a result the full proviral structure of the S71 locus could be established. The basis for our strategy was the observation that sequence elements directly related to functional properties of the sequences (e.g., protein binding sites in the LTRs) are better conserved than the overall sequence as reported previously (Frech et al., 1996). These elements are characteristic enough to allow detection of LTRs even if several of the sites are missing, as in the case of S71. The targeted identification of the S71 5*LTR as well as the detection of various additional retroviral features (terminal repeat, primer binding site, leader similarity, polypurine tract) confirmed the predictions. A conventional ‘‘hybridization/sequencing’’ approach to identify the 5*LTR would have been possible but tedious even with a phage clone carrying the whole S71 locus and very difficult or even impossible for genomic library screening, since the leader region is a multicopy sequence. The 5* and 3* S71 LTR sequences show 7% differences. Based on the criteria for linkage of sequence divergence and age described by Mager and Freeman
(1995), S71 entered the primate germline at least 30 million years ago. This agrees with previous findings, showing the S71 structural genes to be present already in Old World Monkeys (Leib-Mo¨sch et al., 1992). Here we demonstrate that the concept of modular sequence organization can be extended to protein sequences, at least in the case of retroviral Env proteins. The combination of this modular approach with a comparison of several phylogenetically related sequences (S71 and PCRENV1) allowed identification of S71 envrelated sequences despite severe deletions. In addition to the known conserved peptides within the transmembrane protein (Gallaher et al., 1989), we identified two new marker peptides (gp70-hp1 and gp70-hp2) within the surface protein gp70. These conserved peptides colocalize with lethal mutations in the Moloney murine leukemia virus env gene defined by random insertional mutagenesis (Gray and Roth, 1993). Both gp70-hp1 and gp70-hp2 motifs in S71 could not be clearly identified individually by ProfileSearch as significant against the background of the whole SwissProt database, though apparently similar (Fig. 4). However, no nonretroviral sequence contained both motifs in the approximate distance, which demonstrated the discriminative power of motif-based models for protein sequences. Systematic analysis of Env proteins with the program DIALIGN (Morgenstern et al., 1996) revealed several additional conserved motifs in retroviral Env proteins, which will be analyzed in more detail in another study (Morgenstern et al., in preparation). We described residual transcription control elements within this envrelated region previously (Brack-Werner et al., 1989). It
FIG. 4. Summary of Env-related peptides and env structure of S71-related sequences. (a) Location of the five diagnostic peptides. gp70hp1 and gp70-hp2 were newly defined in this study. FP, fusion peptide; ISP, immunosuppressive peptide; TM, transmembrane peptide (membrane anchor). For S71, pCRENV1, and PL124 all three reading frames for the env-related region are shown, and the locations of the peptides are indicated by boxes. The polypurine tract (PPT, hatched box) and the 3* LTRs (open box) are also indicated. (b) Alignments of the Env peptides for the three regions also present in S71 and pCRENV1. Sequences are marked by their database identifier joined to the peptide name. The S71 sequence for gp70-hp2 was not part of the alignment and was added manually. PPT is a nucleotide sequence alignment of polypurine tract sequences. The beginning of the 3* LTR is indicated by an arrow. The sequence order was determined by GeneWorks.
AID
GENO 4790
/
6r3a$$$281
06-05-97 00:39:44
gnmal
60
BLUSCH ET AL.
FIG. 5. Southern blot analysis. Human and mouse genomic DNA were digested with EcoRI and HindIII and hybridized with probes for the S71 LTRs (a), leader (b), and 3 * integration locus (c).
has been shown in the meantime that even functional promoter structures are present in the env gene of human Foamy viruses (Lo¨chelt et al., 1993) and are involved in HTLV-I Tax expression (Nosaka et al., 1993). Thus, a dual function for retroviral env genes (coding and transcription control) may be a more general phenomenon. Our results show the gag- and pol-related sequences of S71 (SSAV1 locus) to be part of a proviral structure with multiple internal deletions. Southern blot analysis showed the 3*-site of the integration locus to be a single-copy sequence in the human genome. Phylogenetic analysis in several primate species confirmed this to be a single locus in Old World primates (Blusch et al., in preparation). Furthermore, S71*s unique 3* integration locus on 18q21 may be of interest as a chromosomal marker, since this genomic locus is associated with several diseases (Brack-Werner et al., 1989; van Kessel et al., 1994; Schenk et al., 1996). Finally, the absence of gp70-hp1 and gp70-hp2 in the env genes of the endogenous retroviruses BaEV and RD114 in contrast to GaLV, Akv, or MuLV agrees with previous results, showing S71 to be most similar to infectious mammalian retroviruses (Werner et al., 1990). TABLE 2 Motif Matches in S71-Related Sequences
AID
gp70-hp1
gp70-hp2
Membrane anchor
ú17
ú16
ú18
11.95 14.18
11.67 13.63
26.67 29.40
GENO 4790
/
6r3a$$$281
ACKNOWLEDGMENTS The help of U. Linzner with synthesizing deoxyoligonucleotides is gratefully acknowledged. We thank Dr. R. Balling for critically reading the manuscript. Part of this work was supported by EU Grant BI04-CT95-0226 (TRADAT) and by EU Grant GENE-CT93-0019.
REFERENCES Boyd, A. C. (1993). Turbo cloning: A fast, efficient method for cloning PCR products and other blunt-ended DNA fragments into plasmids. Nucleic Acids Res. 21: 817–821. Brack-Werner, R., Barton, D. E., Werner, T., Foellmer, B. E., LeibMo¨sch, C., Francke, U., Erfle, V., and Hehlmann, R. (1989). Human SSAV-related endogenous retroviral element: LTR-like sequence and chromosomal localization to 18q21. Genomics 4: 68–75. Cupelli, L. A., and Lenz, J. (1991). Transcriptional initiation and postinitiation effects of Murine Leukemia Virus Long Terminal Repeat R-region sequences. J. Virol. 65: 6961–6968.
Motif scores
C-type spec: Sequence S71 pCRENV1
The concept of modular sequence organization is not new and is well known for other functional regions like promoters in general (Prestridge, 1995; Frech et al., 1996). We showed here that this concept also can be applied successfully to identification of highly diverged protein sequences. Therefore, we propose that this methodology might be applicable to extending the identification of the putative origin of sequences to larger evolutionary distances in general. Successful identification of endogenous retroviral sequences may be just one example of the potential of this strategy.
06-05-97 00:39:44
Feuchter-Murphy, A. E., Freeman, J. D., and Mager, D. L. (1993). Splicing of a human endogenous retrovirus to a novel phospholipase A2 related gene. Nucleic Acids Res. 21: 135–143. Frech, K., Herrmann, G., and Werner, T. (1993). Computer-assisted prediction, classification, and delimitation of protein binding sites in nucleic acids. Nucleic Acids Res. 21: 1655–1664.
gnmal
MODULAR ORGANIZATION OF RETROVIRAL SEQUENCES Frech, K., Brack-Werner, R., and Werner, T. (1996). Common modular structure of Lentivirus LTRs. Virology 224: 256–267. Frech, K., and Werner, T. (1996). Specific modeling of regulatory units in DNA-sequences. In ‘‘Pacific Symposium on Biocomputing 97*’ (R. Altman, A. K. Dunker, L. Hunter, and T. E. Klein, Eds.), pp. 151–162. Frech, K., Danescu-Mayer, J., and Werner, T. (1997). A novel method to develop highly specific models for regulatory units detects a new LTR in GenBank which contains a functional promoter. J. Mol. Biol., in press. Gallaher, W. R., Ball, J. M., Garry, R. F., Griffin, M. C., and Montelaro, R. C. (1989). A general model for the transmembrane proteins of HIV and other retroviruses. AIDS Res. Hum. Retroviruses 5: 431–440. Goodchild, N. L., Freeman, D., and Mager, D. L. (1995). Spliced HERV-H endogenous retroviral sequences in human genomic DNA: Evidence for amplification via retrotransposition. Virology 206: 164–173. Gray, K. D., and Roth, M. J. (1993). Mutational analysis of the envelope gene of Moloney Murine Leukemia Virus. J. Virol. 67: 3489– 3496. Haltmeier, M., Seifarth, W., Blusch, J., Erfle, V., Hehlmann, R., and Leib-Mo¨sch, C. (1995). Identification of S71-related human endogenous retroviral sequences with full-length pol genes. Virology 209: 550–560. Horowitz, D. S., and Krainer, A. R. (1994). Mechanisms for selecting 5* splice sites in mammalian pre-mRNA splicing. Trends Genet. 10: 100–106. Hunter, E., and Swanstrom, R. (1990). Retrovirus envelope glycoproteins. Curr. Top. Microbiol. Immunol. 157: 187–253. van Kessel, A. G., Straub, R. E., Silverman, G. A., Gerken. S., and Overhauser, J. (1994). Report of the second international workshop on human chromosome 18 mapping. Cytogenet. Cell Genet. 65: 141–165. Krieg, A. M., Gourley, M. F., and Perl, A. (1992). Endogenous retroviruses: Potential etiologic agents in autoimmunity. FASEB J. 6: 2537–2544. Leib-Mo¨sch, C., Brack, R., Werner, T., Erfle, V., and Hehlmann, R. (1986). Isolation of an SSAV-related endogenous sequence in human DNA. Virology 155: 666–676. Leib-Mo¨sch, C., Barton, D., Geigl, E.-M., Brack-Werner, R., Hehlmann, R., Erfle, V., and Francke, U. (1989). Two RFLPs associated with the human endogenous retroviral element S71 on chromosome 18q21. Nucleic Acids Res. 17: 2367. Leib-Mo¨sch, C., Bachmann, M., Brack-Werner, R., Werner, T., Erfle, V., and Hehlmann, R. (1992). Expression and biological significance of human endogenous retroviral sequences. Leukemia 6: 72S–75S. Leib-Mo¨sch, C., Haltmeier, M., Werner, T., Geigl, E.-M., Brack-Werner, R., Francke, U., Erfle, V., and Hehlmann, R. (1993). Genomic distribution and transcription of solitary HERV-K LTRs. Genomics 18: 261–269. Lo¨chelt, M., Muranyi, W., and Flu¨gel, R. M. (1993). Human foamy virus genome posseses an internal, Bel-1-dependent and functional promoter. Proc. Natl. Acad. Sci. USA 90: 7317–7321. Lo¨wer, R., Boller, K., Hasenmaier, B., Korbmacher, C., Mu¨llerLantzsch, N., Lo¨wer, J., and Kurth, R. (1993). Identification of human endogenous retroviruses with complex mRNA expression and particle formation. Proc. Natl. Acad. Sci. USA 90: 4480–4484. Maeda, N., and Kim, H.-S. (1990). Three independent insertions of retrovirus-like sequences in the haptoglobin gene cluster of Primates. Genomics 8: 671–683. Mager, D. L., and Goodchild, N. L. (1989). Homologous recombina-
AID
GENO 4790
/
6r3a$$$281
06-05-97 00:39:44
61
tion between the LTRs of a human retrovirus-like element causes a 5-kb deletion in two siblings. Am. J. Hum. Genet. 45: 848–854. Mager, D. L., and Freeman, J. D. (1995). HERV-H endogenous retroviruses: Presence in the New World Branch but amplification in the Old World primate lineage. Virology 213: 395–404. McLauchlan, J., Gaffney, D., Whitton, L. J., and Clements, J. B. (1985). The consensus sequence YGTGTTYY locazed downstream from the AATAAA signal is required for efficient formation of mRNA 3* termini. Nucleic Acids Res. 13: 1347–1368. Miller, S. A., Dykes, D. D., and Polesky, H. F. (1988). A simple salting out procedure for extracting DNA from human nucleated cells. Nucleic Acids Res. 16: 1215. Morgenstern, B., Dress, A., and Werner, T. (1996). Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc. Natl. Acad. Sci. USA 93: 12098–12103. Nosaka, T., Arimuri, Y., Sakurai, M., Takeuchi, R., and Hatanaka, M. (1993). Novel internal promoter/enhancer of HTLV-I for Tax expression. Nucleic Acids Res. 21: 5124–5129. Ohshima, Y., and Gotoh, Y. (1987). Signals for the selection of a splice site in pre-mRNA. Computer analysis of splice junction sequences and like sequences. J. Mol. Biol. 195: 247–259. Prestridge, D. S. (1995). Predicting Pol II promoter sequences using transcription factor binding sites. J. Mol. Biol. 249: 923–932. Quandt, K., Frech, K., Karas, H., Wingender, E., and Werner, T. (1995). MatInd and MatInspector: New fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23: 4878–4884. Ronfort, C., Girod, A., Cosset, F.-L., Legras, C., Nigon, V. M., Cheblune, Y., and Verdier, G. (1995). Defective retroviral endogenous RNA is efficiently transmitted by infectious particles produced on an avian retroviral vector packaging cell line. Virology 207: 271–275. Sambrook, J., Fritsch, E. F., and Maniatis, T. (1989). ‘‘Molecular Cloning: A Laboratory Manual,’’ 2nd ed., Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Sauter, M., Schommer, S., Kremmer, S., Remberger, K., Do¨lken, G., Lemm, I., Buck, M., Best, B., Neumann-Haefelin, D., and MuellerLantsch, N. (1995). Human endogenous retrovirus K10: Expression of gag protein and detection of antibodies in patients with seminomas. J. Virol. 69: 414–421. Schenk., M., Leib-Mo¨sch, C., Schenck, I. U., Jaenicke, M., Indraccolo, S., Saeger, H.-D., Dallenbach-Hellweg, G., and Hehlmann, R. (1996). Lower frequency of allele loss on chromosome 18q in human breast cancer than in colorectal tumors. J. Mol. Med. 74: 155–159. Seifarth, W., Skladny, H., Krieg-Schneider, F., Reichert, A., Hehlmann, R., and Leib-Mo¨sch, C. (1995). Retrovirus-like particles released from the human breast cell cancer cell line T47-D display type B-and C-related endogenous retroviral sequences. J. Virol. 69: 6408–6416. Taruscio, D., and Manuelidis, L. (1991). Integration site preferences of endogenous retroviruses. Chromosoma 101: 141–156. Tchenio, T., and Heidmann, T. (1991). Defective retroviruses can disperse in the Human genome by intracellular transposition. J. Virol. 65: 2113–2118. Varmus, H., and Brown, P. (1989). Retroviruses. In ‘‘Mobile DNA’’ (D. E. Berg and M. M. Howe, Eds.), pp.53–108, Am. Soc. Microbiol., Washington DC. Werner, T., Brack-Werner, R., Leib-Mo¨sch, C., Backhaus, H., Erfle, V., and Hehlmann, R. (1990). S71 is a phylogenetic distinct human endogenous retroviral element with structural and sequence homology to simian sarcoma virus (SSV). Virology 174: 225–238. Wilkinson, D. A., Mager, D. L., and Leong, J. C. (1994). Endogenous human retroviruses. In ‘‘The Retroviridae’’ (J. A. Levy, Ed.), Vol. 3, pp. 465–535, Plenum, New York.
gnmal