GENOMICS
37, 147–160 (1996) 0536
ARTICLE NO.
Sequence Analysis in the Olfactory Receptor Gene Cluster on Human Chromosome 17: Recombinatorial Events Affecting Receptor Diversity GUSTAVO GLUSMAN,* SANDY CLIFTON,† BRUCE ROE,†
AND
DORON LANCET*,1
*Department of Membrane Research and Biophysics, The Weizmann Institute of Science, Rehovot 76100, Israel; and †Department of Chemistry and Biochemistry, University of Oklahoma, Oklahoma 73019 Received April 19, 1996; accepted July 29, 1996
A cosmid clone covering a region of high olfactory receptor (OR) gene density inside the OR gene cluster on human chromosome 17 (17p13.3) was subjected to shotgun automated DNA sequencing. The resulting 40kb sequence revealed three known OR coding regions, as well as a new OR pseudogene (OR17-25), fused to one of the previously identified OR genes (OR17-24). The suggested mechanism for the generation of this doublet structure involves an initial duplication mediated by flanking repeats and a subsequent deletion via nonhomologous recombination. Sequence analysis further suggests that the two other OR genes present in the cosmid (OR17-40 and OR17-228) may have evolved by ancient tandem duplication of an 11-kb fragment, mediated by recombination between mammalian-wide interspersed repeats. The duplicated genes appear to be complete and potentially functional. Their conserved structure reveals a long upstream intron and a previously uncharacterized 5* noncoding exon. No additional genes could be discerned in the cosmid, suggesting that the cluster may be part of a dedicated OR subgenome. q 1996 Academic Press, Inc.
INTRODUCTION
Olfactory receptors (ORs)2 are seven-transmembrane-domain proteins (Buck and Axel, 1991; Lancet and Pace, 1987; Reed, 1990), representing at least eight families of the G-protein-coupled receptor (GPCR) suSequence data from this article have been deposited with the EMBL/GenBank Data Libraries under Accession No. U58675. 1 To whom correspondence should be addressed. Telephone: 9728-9343683. Fax: 972-8-9344112. E-mail: bmlancet@weizmann. weizmann.ac.il. 2 Abbreviations used: CDS, coding sequence; Inr, initiator site; GPCR, G-protein-coupled receptor; ESL, estimated substitution level; L1Hs, LINE-1, Homo sapiens; LINE, long interspersed nuclear element; MER, medium frequency repeat; MIR, mammalian-wide interspersed repeat; Mya, million years ago; OR, olfactory receptor; ORF, open reading frame; SDR, short direct repeat; SINE, short interspersed nuclear element; TM, transmembranal domain.
perfamily (Lancet and Ben-Arie, 1993). OR genes are thought to be expressed with clonal specificity, with one specific OR gene being expressed in every olfactory cell (Lancet, 1986). In contrast to the immunoglobulin system, where proteins specific for diverse antigenic ligands are generated by a complex system of somatic recombination and clonal selection, olfactory receptors are present in the genome in a large germ-line repertoire, and their ligand-binding can be described by a probabilistic model (Lancet et al., 1993), with many receptors binding many ligands with different affinities. In this context, the addition of novel receptors confers the advantage of broadening the ligand spectrum that can be recognized. Conversely, OR gene loss can cause specific anosmias or reduced discriminating capability (Lancet et al., 1993). OR genes have been found to be organized in the mammalian genome in several clusters (Ben-Arie et al., 1994; Griff and Reed, 1995; Sullivan et al., 1996). One of these clusters has recently been described by us on human chromosome 17 (17p13.3) (Ben-Arie et al., 1994) and includes at least 14 OR genes of the expected several hundred in the human olfactory subgenome. From the physical map of this cluster, no statement about the presence or absence of additional, non-OR genes could be made. The coding region of OR genes is intronless (Buck and Axel, 1991; Nef et al., 1992), making their recognition straightforward by computational or experimental means. On the other hand, little is known about the OR gene structure upstream and downstream from the coding region; the mechanism and sequence requirements of transcriptional control are unknown. In this paper, we describe the results of the complete sequencing of an OR-rich cosmid (cos39) spanning the center of the human chromosome 17 OR gene cluster. The group of genes included in it is of particular interest because it represents the vast majority of the known members of the 3A subfamily of OR genes (Ben-Arie et al., 1994). The results reveal mechanisms involved in the generation and reduction of OR gene diversity, sug-
147
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
0888-7543/96 $18.00 Copyright q 1996 by Academic Press, Inc. All rights of reproduction in any form reserved.
gnma
AP: Genomics
148
GLUSMAN ET AL.
gest a likely structure for the OR genes, and provide the first indications about their possible RNA splicing and transcriptional control. MATERIALS AND METHODS Reagents and equipment. The library identifier of the cosmid sequenced (cos39) is ICRFc105B01193, from the fivefold coverage chromosome 17 cosmid library, isolated from human cell line LCL127 (Nizetic et al., 1991) and maintained in the Imperial Cancer Research Fund, United Kingdom. Other reagents and equipment are described elsewhere (Bodenteich et al., 1993). Sequencing procedure. Cosmid DNA isolation, random shotgun cloning, fluorescent-based DNA sequencing, and computer analysis have been performed as described (Bodenteich et al., 1993; Chissoe et al., 1995; Pan et al., 1994). Cos39 DNA was isolated free from host genomic DNA via a cleared lysate-diatomaceous earth-based protocol (Pan et al., 1994). Subsequently, 50-mg portions of purified cosmid DNA were randomly sheared by nebulizing and made blunt-ended. After phosphorylation by T4 polynucleotide kinase and agarose gel purification, fragments in the 1- to 3-kb range were ligated into SmaI-cut, BAP-treated pUC18 vector (Stratagene). Calcium chloride competent Escherichia coli bacteria (strain XL1BlueMRF*, Stratagene) were then transformed with the ligated DNA. Approximately 450 colonies were picked, grown in TB medium (Sambrook et al., 1989), and supplemented with 100 mg of ampicillin for 14 h at 377C with shaking at 250 rpm, and the sequencing templates were isolated by a cleared lysate filter-based protocol (Bodenteich et al., 1993). Sequencing reactions were performed as previously described (Chissoe et al., 1995) using Thermus aquaticus (Taq) DNA polymerase and fluorescent-labeled universal forward or reverse DNA sequencing primers. The reactions were incubated for 30 cycles in a Perkin–Elmer Cetus DNA Thermocycler 9600, and after removal of excess primers by ethanol precipitation, the fluorescent-labeled nested fragment sets were resolved by electrophoresis on an ABI 373A DNA Sequencer, modified with a stretch upgrade. After base calling with the ABI Analysis software (Version 2.1), the analyzed data were transferred to a Sun SparcstationII, and the electrophoretograms were edited using the TED trace editor program (Dear and Staden, 1991; Gleeson and Staden, 1991). Initial assembly of the overlapping sequences and contigs (clone assemblies) was performed using the XBAP program (Dear and Staden, 1991; Gleeson and Staden, 1991). Gap closure was performed either by custom primer walking or by PCR amplification of the region corresponding to the gap in the sequence followed by subcloning into pUC18. In both instances, cycle sequencing reactions employed the Perkin–Elmer Cetus dyelabeled Taq terminators and custom synthesized primers. When necessary, additional synthetic custom primers were used to obtain sequences for at least threefold coverage for each base. In parallel, the project was assembled on a Macintosh Quadra platform. Vector sequences and contaminants (e.g., E. coli genomic sequences) were identified and discarded using the Inherit Analysis software from Applied Biosystems Division, Perkin–Elmer. Assembly was performed using the Sequencher software from GeneCodes, by iterating over (1) automated assembly at decreasing stringency parameters and (2) interactive quality control after each automated round. Various assembly strategies were attempted, including removal of known repeated sequences (SINEs, LINEs, and OR genes) prior to assembly and their addition to the project after assembling the nonrepeated sequences. All the assembly strategies arrived at essentially the same sequence, without major rearrangements, as verified by dot-plot analysis. Sequencing quality control. The quality of the sequence can be estimated by comparing it to independently obtained sequences. The GenBank entry under Accession No. X80391, locus HSOR1740, submitted by M. L. Crowe, includes 1282 bp surrounding OR17-40. Comparison of this sequence with the relevant part of cos39 revealed only 16 discrepancies (10 mismatches and 6 ambiguities in X80391), clustered at both ends of X80391. Therefore, within 1099 contiguous
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
bp there is 100% identity between the independently derived sequences: the comparison shows that, in the sampled region, the cos39 sequence is of very high quality. Comparison of the resulting cosmid sequence with the previously published OR coding sequences reveals zero, two, and five nucleotide differences with HSUO4680 (OR1724), HSUO4683 (OR17-40), and HSUO4713 (OR17-228), respectively. One of the differences with HSUO4683 is nonsilent; similarly, four of the differences with HSUO4713 imply amino acid sequence differences, making the predicted amino acid sequence of OR17-228 more similar to the consensus for OR genes (not shown). Sequence analysis. Sequence handling was done using the commercial Inherit Analysis software from Applied Biosystems Division, Perkin–Elmer. General database searching was performed using Inherit Analysis and BLAST (Altschul et al., 1990); GenBank releases 88–91 were used, as they became available. Alu repeat classification was performed using the Pythia server (Jurka et al., 1992); other repetitive sequences were identified by comparing with RepBase, a reference collection of human repetitive elements, Release 3.6 (Jurka et al., 1992). Detailed alignments were constructed using MACAW (Schuler et al., 1991). Potential coding regions and gene structure were identified using GRAIL (Uberbacher and Mural, 1991), GRAIL II (Xu et al., 1994), and the hspl, fgeneh, and fexh programs of the BCM Genefinder (Solovyev et al., 1994). Analysis of transcription control regions was performed using the tssw program of the BCM Genefinder and MatInspector for Macintosh, Version 1.2 (Quandt et al., 1995). Sequence comparison. The percentage of identity for pairwise comparisons was calculated by using the GAP program of the GCG package (Devereux et al., 1984), with gap opening and extension penalties of 5 and 0.3, respectively, after removal of short local duplications in the compared sequences. Statistical significance was estimated by randomizing 50 times and by calculating (for the tested alignment quality) the number of standard deviations away from the mean for the random alignments. Whenever relevant, this value is presented in brackets in Table 2; any alignment quality farther than 2 SD from the mean is considered significant. Divergence and timing estimation. Estimations of the actual level of nucleotide substitution (estimated substitution level, ESL), accounting for superimposed substitutions, were made by using the one-parameter formula (Jukes and Cantor, 1969):
dÅ10
%ID 100
S
D
4 3 ESL Å 0100r ln 1 0 d , 4 3 where d is the divergence, calculated as shown from %ID, the pairwise percentage of identity. Translation from ESL to million years ago (Mya) was performed by using local molecular clocks as described for the ch-globin gene locus (Bailey et al., 1991). When estimating the time of divergence of repetitive sequences, ESL from the published consensus of the relevant subfamily was used directly, assuming that mutations in retroposed elements are fixed neutrally. On the other hand, when comparing duplicated sequences, the mutual ESL is halved prior to translation to Mya, assuming that the two copies mutated independently over the same time period and at similar rates. Long-range autocorrelation analysis. We have developed a simple custom algorithm for detecting internal homologies in the sequence. The algorithm scans the sequence, searching for all the occurrences of every different word present (of a preset size k), calculates the distances between all pairs of identical words, and pools them in 100-bp bins. Specifically, bin n includes all pairwise distances L for which 100(n 0 1) õ L £ 100rn n Å 1, 2, 3, . . . , therefore suppressing signals deriving from self-identity, i.e., L Å 0.
gnma
AP: Genomics
SEQUENCING IN OLFACTORY RECEPTOR GENE CLUSTER
149
FIG. 1. Outline of the full cosmid, indicating the locations of the OR genes (solid bars), the ORcr regions (dotted bars), and the repetitive sequences found: Alu repeats (solid circles), MIR repeats (open circles), L1 repeats (triangles), a MER repeat (cross-hatched bar) and the major dinucleotide repeats. Icons on top of the line represent elements with their 5* end near the centromere, while icons under the line represent those with their 5* end near the telomere, i.e., in the opposite direction. The plot of the cumulative histogram of the distances L between all identical words has local maxima highly exceeding the expected values from random similarities: these maxima are expected to represent homologous sequences deriving from duplication, transposition, or retroposition events or from regeneration of simple sequences. This approach is equivalent to projecting the results of a dot-plot along the main diagonal and displaying the number of hits vs the distance from the main diagonal, but it lends itself to quantification and statistical analysis.
RESULTS AND DISCUSSION
General Analysis of the Cosmid Sequence The final sequence of the cos39 insert is 40176 bp. Its predicted EcoRI/HindIII restriction pattern (Fig. 1) is consistent with the available experimental data from partial digest mapping (Ben-Arie et al., 1994), enabling us to determine the orientation of the sequence in the framework of the whole cluster and therefore relative to the whole chromosome (Ben-Arie et al., 1994). The single-base composition of the full cosmid sequence is A Å 29.1%; T Å 28.1%; C Å 21.5%; G Å 21.3%. The weak-base bias (57.2% A / T vs 42.8% G / C) is typical for human genomic sequences and is intermediate between the isochore families (Bernardi, 1993) L (Ç40% G / C) and H1 (Ç45% G / C). A plot of the G / C content of the sequence along the cosmid (Fig. 2, G / C% panel) reveals extensive deviations from these average values, with long stretches of 25–30% G / C. No CpG islands were identified; CpG dinucleotides are 4.3-fold underrepresented, relative to the number expected from the single-base composition (other dinucleotides do not deviate significantly from random frequency). As expected from previous studies (Ben-Arie
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
et al., 1994), no BssHII (GCGCGC) nor NotI (GCGGCCGC) restriction sites were found in the sequence. Interestingly, there is a very clear correlation between CpG clustering and Alu repeats (Fig. 2, gray bars connecting CpG and Alu panels), which is expected for newly inserted elements, before their CpG dinucleotides decay by mutation (Britten et al., 1988). The cosmid was searched for open reading frames (ORFs, Fig. 2) and many short ORFs were found; the largest ones coincide with the OR genes or are located inside L1 repeats. GRAIL (Uberbacher and Mural, 1991), a multiple-sensor neural network trained to recognize protein-coding regions in genomic DNA sequence, assigned coding potential only to the larger ORFs, which represent four OR coding regions (Fig. 2, GRAIL panel). While no experimental transcript mapping has been performed, computational analyses indicate that no other, non-OR genes are present in the cosmid. This suggests that there could be an OR-dedicated ‘‘subgenome’’ that non-OR genes would be excluded from. Such an arrangement might prove to be related to the (yet undescribed) mechanism of activation of OR genes. Interestingly, all the OR-encoding ORFs have antisense overlapping open reading frames, probably reflecting the codon usage constraints (Merino et al., 1994). Olfactory Receptor Coding Sequences Three expected OR coding sequences (Ben-Arie et al., 1994) were found in the cosmid (OR17-24, OR17-40, and OR17-228), and their absolute orientation with respect to the cluster could be determined (Fig. 1), with
gnma
AP: Genomics
150
GLUSMAN ET AL.
FIG. 2. A summary of sequence analyses on the full-length cosmid. From top to bottom: the G / C content, averaged with a running window of 500 bp; the number of CpG dinucleotides in a running window of 250 bp; the location and direction of Alu, MIR, and L1 repeats and a summary of the repetitive sequences of all types; the location and direction of the predicted coding regions according to GRAIL; and the locations and extent of the open reading frames longer than 300 bp, in all six frames. Stippled bars correlate Alu repeats with peaks of CpG concentration. Cross-hatched bars indicate OR coding ORFs; dotted bars indicate their antisense-overlapping ORFs.
OR17-40 and OR17-228 running from telomere to centromere and OR17-24 from centromere to telomere. In addition to these, a new OR coding region was found immediately telomeric to OR17-24 and with the same orientation. It is hereafter referred to as OR17-25. Three other previously suggested OR coding regions (OR17-82, OR17-207, and OR17-219) (Ben-Arie et al., 1994) could not be found and are probably due to a PCR recombination artifact (Meyerhans et al., 1990). A Gene Fusion Event Phylogenetically, OR17-25 belongs in the 3A subfamily of OR genes (Lancet and Ben-Arie, 1993), together with OR17-24, OR17-40, OR17-201, and OR17-228. The presence of all but one (OR17-201) of the known members of this subfamily in the same region suggests that subfamily 3A arose by repeated duplications inside this subcluster. Furthermore, the similarity between OR17-24 and OR17-25 and their codirectionality suggest that they arose by tandem duplication. While OR17-40 and OR17-228 include complete ORFs, OR17-24 and OR17-25 are both partial and fused to each other (Fig. 3B). A simple alignment (Fig. 3D) of the OR doublet sequence (OR17-24/25) with TM2 of OR17-24 and TM7 of OR17-25 (the most similar OR sequences, from the same subfamily) suggests that the fusion point is located between TM7 of OR17-24 and TM2 of OR17-25 (Figs. 3B and 3C). Thus, OR17-24’s ORF extends from the initial ATG codon to the seventh transmembranal domain (TM7), while OR17-25’s ORF begins at TM2, has a 2-bp deletion causing a
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
frameshift, and a premature termination codon (Fig. 3B), but otherwise extends to the end of the coding region. A polyadenylation signal can be recognized as well, located 325 bp 3* to the expected stop codon of OR17-25 (Fig. 3B). We propose a possible mechanism for the gene fusion event: OR17-24 was duplicated, potentially through the involvement of Alu repeats (Fig. 4). The possibility that some other, older type of repeat was involved cannot be ruled out at this stage. Due to an incomplete duplication, or for other reasons, OR17-25 may have become a pseudogene. Then it is proposed that the telomericmost, pseudogenic OR copy (OR17-25) may have mutated. We show that the occurrence of two deletions of lengths 9 and 3 may have rendered the TM2 and TM7 sequences similar to each other (Fig. 3B). Later on, the gene and the pseudogene could have recombined unequally at this point, causing a deletion from TM7 of OR17-24 to TM2 of OR17-25 (Fig. 4) and yielding the current sequence. The features described above mark OR17-25 clearly as a pseudogene. On the other hand, it is less straightforward, from the sequence alone, to assess whether the extant OR17-24 could function as a normal gene. The extracellular N-terminal domain and all the TM’s are present: OR17-24 would therefore be abnormal only in its C-terminal cytosolic domain, which would be 33 amino acids longer than the OR gene prototype and would lack sites suitable for phosphorylation (Kennelly and Krebs, 1991) by protein kinase C (Ser-X-Arg) or cAMP-dependent protein kinase (Lys/Arg-Lys/Arg-X-
gnma
AP: Genomics
SEQUENCING IN OLFACTORY RECEPTOR GENE CLUSTER
151
FIG. 3. The OR gene doublet. (A) The locations of the two flanking Alu repeats relative to the fused OR coding regions. (B) The structure of the two partial OR genes (with transmembrane domains denoted schematically by numbers). Vertical arrows indicate the predicted signals for translation initiation (ATG), polyadenylation (AATAAA), and the in-frame stop codons (TGA). Horizontal arrows indicate the sequences recognized by the PCR primers employed. (C) Detail of the sequence at the recombination locus, compared to the 5B and 3B degenerate PCR primers. Solid, dotted, and open circles represent matches to non, twofold, and manyfold degenerate positions, respectively. (D) Alignment of the recombination site within the doublet with the TM2 region of OR17-24 and the TM7 region of OR17-25. Vertical bars denote identities; dots indicate a 2-bp deletion. The apparent site of transition is boxed. (E) A similar alignment, after deletion of 9 and 3 bp from TM2 of OR17-24 (indicated by vertical arrows). The region of enhanced similarity between nonhomologous sequences is boxed.
Thr), hypothesized to be required for negative feedback (Breer, 1994). The respective phosphorylation signals on OR17-25 are mutated as well. In addition, many of the putative 5*-untranslated sequence features seen in OR17-40 and OR17-228 appear to be modified or absent in OR17-24. These considerations suggest that OR1724 is a pseudogene, potentially of relatively new origin—since the time when the two genes fused. Close inspection of the site of transition between the two partial genes shows why previous PCR analyses had failed to locate OR17-25, but did detect OR17-24 (Ben-Arie et al., 1994). The method used involved amplification by nested PCR with two pairs of general OR primers (OR5B/OR3B, OR5A/OR3A) designed to recognize the most conserved regions of OR genes. The site of recombination matches OR3B almost perfectly, but the same sequence matches OR5B extremely poorly (Fig. 3C), so no PCR product is expected to include OR17-25. The remainder of the cluster could contain additional OR pseudogenes, as has been described for
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
other multigene families (Raisonnier, 1991; Schable et al., 1994). Such pseudogenes could play an important role in the evolution of the OR gene superfamily (Garcia-Meunier et al., 1993; Trabesinger-Ruef et al., 1996). Repetitive Elements The cosmid was found to include representatives of the major classes of repetitive sequences (Jurka et al., 1992): simple sequence repeats (SSR), long and short interspersed nuclear elements (LINEs and SINEs), medium frequency repeats (MERs), and retrotransposonderived sequences. Over 27% of the sequence of cos39 can be recognized to be derived from various families of retroposed elements and SSRs, while only Ç8% represents OR coding sequences. These figures are not unlike those reported for the human MHC Class II region (Beck et al., 1996), namely 23.5% repetitive sequences vs 7.5% coding regions. Details about the retroposed sequences in cos39 are summarized in Table 1.
gnma
AP: Genomics
152
GLUSMAN ET AL.
FIG. 4. Proposed mechanism for the fusion of OR17-24 with OR17-25. Solid bars denote OR coding regions, and open bars indicate their TM2 and TM7: 2, 7, 2m and 7/2m represent TM2, TM7, mutated TM2, and recombined TM7/TM2, respectively. Hatched circles represent the flanking repeats that recombined (possibly Alus); an open, stippled bar links the TM domains between which the nonhomologous recombination is proposed to have taken place.
Seven partial-sequence LINEs belonging to various L1 families were identified (Jurka, 1989; Smit et al., 1995). Seven mammalian-wide interspersed repeats (MIR) (Jurka et al., 1995; Smit and Riggs, 1995) can be identified in the cosmid, at a density that fits the
published rough estimate of their genomic average (Smit and Riggs, 1995). Twelve Alu elements (Deininger, 1989) were identified, six in each direction, or about 0.3 Alu/kb, which is in the genomic average range (Hwu et al., 1986). In addition, a copy each of
TABLE 1 Characteristics of the Repetitive Elements Found in the Cosmid Type
Family
Subfamily
LINE
L1
SINE
Alu
L1ML1PL1Hs Subtotal J S Subtotal MIR MIR2 Subtotal
MIR
MER MaLR
MER7 MLT
MLT1d
Total
Number
Lengths
% of cosmid
% ID
2 4 1 7 2 10 12 4 3 7 1 1
213–396 232–1232 1302 153–1855 279–318 141–306 141–318 165–183 106–197 106–197 234 128
1.53 8.43 3.28 13.24 1.5 7.05 8.55 1.53 1.04 2.57 0.58 0.32
77–77.3 78.6–93 93.4–94.7 77–94.7 72.8–79.5 81.5–91 72.8–91 64–69 75–76 64–76 80.5 75
28
25.26
Note. For simplicity, all Alu repeats belonging to the various Alu-S subfamilies are grouped together, as well as L1 repeats from the various mammalian (L1M-) and primate (L1P-) subfamilies. For each repeat type, the number of copies in the cosmid is presented, as well as the range of lengths in basepairs, the percentage of the cosmid they represent, and the range of percentage identities to the respective subfamily consensus sequences.
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
gnma
AP: Genomics
SEQUENCING IN OLFACTORY RECEPTOR GENE CLUSTER
153
FIG. 5. Results of the long-range autocorrelation algorithm for word size k Å 10. The strongest signals (solid peaks) represent elements of the large-scale duplication and of the OR24-25 doublet. Solid and open circles denote signals deriving from pairs of Alu repeats and An stretches, respectively. The vertically hatched area denotes the background of random similarity between subsequences.
MER7, a medium frequency repeat (Jurka, 1990; Jurka et al., 1993; Kaplan et al., 1991), and of the MLT1d (Smit, 1993) mammalian apparent LTR-retrotransposon (MaLR) have been recognized. These two repeats harbor deletions and are much diverged from their respective family consensus sequences. A Large Duplication A long-range autocorrelation algorithm was used to detect internal sequence homologies, using various word lengths. Typical results (for word length of 10 bp) are displayed in the histogram in Fig. 5, which shows clearly two large signals at distances of 13.5 and 14 kb. These represent the duplication of a large segment, with an approximate length of 13.7 kb. Therefore, about 60% of the cosmid sequence would be involved in a large duplication event. Each of the two repeats includes an OR coding region (OR17-228 and OR17-40, respectively), as well as noncoding regions (see below) and a MIR repetitive element. While the OR genes share 85.3% nucleotide identity in their ORF (80% identity in amino acid sequence), the noncoding duplicated regions average 50–55% identity. Other significant autocorrelation signals detected by the algorithm represent the OR24-25 gene doublet and various combinations of codirectional Alu repeat pairs, as well as poly(A) stretches (Fig. 5). For convenience, we subdivided the duplicated region into eight sections, called D1 through D8 (Fig. 6; Table 2). From the centromere to the telomere, the large duplication includes an OR coding region (D4), a highly diverged region (D6, which includes an Alu clus-
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
ter or an L1 repeat), a significantly conserved region (D7), and a MIR repeat (D1 and D8). In addition, an Alu repeat and an L1 repeat appear to be duplicated; they are located at approximately the same relative positions with respect to the OR genes and are in the correct orientations (Fig. 6). On the other hand, both Alus and L1s belong in different subfamilies (Alu-J and Alu-Sx, L1PA7 and L1Hs, respectively). Therefore, they must have retroposed in separate events, after the duplication took place. Moreover, Alu-6 is inserted inside MIR-2, while an additional MIR repeat (MIR-4, in the reverse direction) separates between MIR-3 and Alu-7. MIR-4 is at least as old as MIR-2 and MIR-3, and there is no indication that it could have retroposed after the duplication took place. These findings, as well as the different similarity levels on both sides of the MIR repeats, lead us to postulate that the mechanism for the generation of the largescale duplication was unequal recombination between the MIR repeats (Fig. 7). Therefore, the duplication boundary would be located inside them: MIR-2 would thus be a recombinant element, composed of the 5* half of MIR-3 and the 3* half of another MIR repeat (MIR-0), which is predicted to be located outside the sequenced cosmid, about 6 kb toward the centromere (Fig. 6). Indeed, the optimal alignment of the two MIR repeats (not shown), after removal of sequences that were inserted later (see next section) shows that the 5* halves of the two MIRs (Ç100 bp) display higher pairwise similarity than the corresponding 3* halves. The highly diverged region (D6) includes an Alu cluster in the centromeric duplication arm and a feature-
gnma
AP: Genomics
154
GLUSMAN ET AL.
FIG. 6. Scheme of the large duplication event, including the positions and sizes of the duplicated elements (solid bars), the elements that appear to be duplicated, but are younger than the duplication (dotted bars with solid outlines), and additional elements that represent later insertion events (vertically hatched bars). Cross-hatching represents available cosmid sequence.
less region in the telomeric one, with an L1 element (L1-4) for which short direct repeats (SDRs) can be recognized. The SDRs for all the Alu repeats in this cluster can be recognized readily, enabling us to reconstruct the original sequence before these insertions. The timing of retroposition of all the Alu repeats in the cosmid is summarized in Fig. 8. The D7 section in both duplication arms represents
Ç2 kb with low but significant mutual similarity, averaging 65% of nucleotide identity. A Ç300-bp subsection of it shows much better conservation (73.8% nucleotide identity) than the surrounding sequences, which are 52.1% identical. These sequences lie 6 – 7 kb 5* to the coding regions of OR17-228 and OR17-40 (4.1 and 6.2 kb, respectively, after removal of repetitive sequences). They do not have extensive open read-
TABLE 2 Characteristics of the Duplicated Sections Name
Description
%ID
ESL/2 (%)
D1 D2 D3 D4
3* half of predicted MIR-0/MIR-1 Unsequenced/various repeats Cosmid start / OR 3* UTR Full OR coding 1st codon position 2nd codon position 3rd codon position OR 5* from CDS Alu cluster/L1-2 region Outside ORcr (sample) ORcr 5* half of MIR-2/MIR-3 3* half of MIR-2/MIR-3
NA NA 52 [9.9] 85.3 [87] 84.5 92.1 79.1 64 [10.9] random 52.1 [17.2] 73.8 [21.3] 62.9 [5.8] 43.7 [2.9]
NA NA 38.4 8.2 8.7 4.2 12.2 24.5 NA 38.2 16.1 25.6 52.2
D5 D6 D7 D8 Out
Lengths Ç70 ?/4158 332/338 948/948 316/316 70/70 4026/4033 700/719 300/313 99/103 73/71
Note. Similarity was assessed as described under Materials and Methods. The results presented include the % identity (%ID), with the number of standard deviations away from the mean of the distribution for 50 randomizations indicated in brackets, and half the estimated substitution level (ESL/2). The sequences were compared after removal of repetitive elements thought to have been inserted after the duplication took place. NA, unavailable results.
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
gnma
AP: Genomics
SEQUENCING IN OLFACTORY RECEPTOR GENE CLUSTER
155
FIG. 7. The proposed unequal recombination scheme, showing the contents of the region before the duplication (solid bars), the proposed mechanism (recombination between collinear MIR repeats), the expected products (OR duplication and OR loss), and the sequence expansion due to subsequent insertion of repetitive sequences into the duplication (cross-hatched bars) and outside of it (stippled bars). MIR-a, MIRb, and MIR-c represent the postulated MIR repeats present before the duplication occurred; MIR-ab and MIR-ba are the recombinant MIR products. The bottom panel represents the extant sequence, with the actual names of the OR genes and the MIR repeats.
ing frames, and GRAIL did not assign any coding potential to them. An analysis described below leads us to hypothesize that the D7 section may contain a noncoding exon and that the 300-bp subsection with higher conservation may be a gene control region. The latter is subsequently referred to as the putative OR control region (OR228cr and OR40cr). Database searches via BLAST or FastA, using both ORcr sequences as queries, found no hits of statistical significance, suggesting that these are not members of a previously uncharacterized repetitive sequence family. Therefore, the probability that the OR228cr and OR40cr sequences did not arise by duplication, but rather were inserted at a later stage, in the exact locations and in the same orientation, is negligible. Duplication Timing The timing of the duplication can be estimated from the extent of identity and from the inferred times of divergence of the various duplicated and nonduplicated elements. As MIR-2 and MIR-3 appear to represent the duplication boundaries, they must have been present before the segment was duplicated: they set an upper limit to the duplication age. MIR-3 is Ç69% identical to the MIR consensus (the 3* half of MIR-2 is somewhat less conserved), so the duplication must have occurred less than 100 million years ago, as estimated by using local molecular clocks (see Materials and Methods).
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
On the other hand, MIR-1 seems to be the most ancient element to have retroposed into the cluster center after the duplication took place; the MLT1d repeat (Fig. 6) seems to have invaded the cluster even earlier, but as it is located in section D2, the centromeric copy of which has not been sequenced yet, we cannot ascertain whether it ‘‘arrived’’ before or after the duplication occurred. MIR-1 is Ç76% identical to the MIR2 consensus, setting 90 million years as a rough lower bound to the duplication age. Assuming that there has been no selection over section D7 (excluding the ORcr regions), we can estimate the duplication time more accurately. The 52.1% identity of this region corresponds to an ESL of 76.4%. Since both copies of D7 have mutated independently, their average ESL from the original sequence is equal to 38.2%. This is equivalent to Ç95 Mya, and is in agreement with the estimation from the retroposition times of the MIR repeats (Fig. 8). Therefore, this segment of the OR gene cluster on human chromosome 17, including an OR gene, is proposed to have duplicated not later than 90–100 Mya, before the eutherian radiation (Bailey et al., 1991; Bulmer et al., 1991; Li et al., 1990) but after the eutherian/metatherian split. Coding sequences, when under conservative selection, have been shown to mutate much more freely in the third position of the codons, or ‘‘wobble position.’’ Detailed comparison of the nucleotide sequences of
gnma
AP: Genomics
156
GLUSMAN ET AL.
FIG. 8. Summary of events in the cosmid, with an estimation of their timing. The left scale indicates the estimated substitution level, accounting for multiple substitutions (see Materials and Methods). The right scale represents a tentative translation into millions of years ago. ORcr, 1st, 2nd, and 3rd represent the apparent divergence of the OR putative control regions and of the three codon positions of the coding regions. Repetitive elements marked in bold circles are thought to have been present before the duplication; those marked with rectangular frames are thought to have been inserted into the duplicated sequence.
OR17-40 and OR17-228 shows that the first position is 84.5% conserved (identical) between the two, the second is 92.1% conserved, and the third is 79.1% conserved (Fig. 8). This implies strong conservative selection over both OR genes (OR17-40 and OR17-228), an indication that both are currently functional or were functional after the duplication and until very recent times. Prediction of the OR Gene Structure To identify features of the OR gene structure, we analyzed the noncoding genomic sequences surrounding the OR coding regions. This was done in parallel for both OR17-40 and OR17-228, assuming that any feature that appears in both genes is more likely to be significant. In the first stage, the linear discriminant function (LDF) algorithm was used (Solovyev et al., 1994), which is optimized for recognizing intron–exon boundaries, as well as promoter regions. The analysis revealed the presence of a splice acceptor site (AG, embedded in the appropriate context) 6 bp upstream of
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
the initial ATG codon of the OR coding regions (Fig. 9). This feature was found to be conserved for both duplicated OR genes, with relatively high LDF values of 0.77 for OR17-40 and 0.70 for OR17-228. At a distance of 30 bp upstream, a branchpoint consensus for lariat formation was found in both genes. Finally, highscoring donor sites (GT in the right context) were recognized, with LDF values of 0.87 and 0.82 for OR17-40 and OR17-228, respectively. This analysis suggests the presence of introns immediately 5* to the OR coding regions, which are 6134 and 5433 bp for OR17-40 and OR17-228, respectively. These introns encompass the least-conserved duplication section (D6, Fig. 6), including the Alu cluster and the L1-1 and L1-4 repeats (Fig. 9). No additional splice sites common to both genes were indicated by the LDF analysis, implying the presence of a single 5* noncoding exon. A similar intron– exon structure could be identified for OR17-24 (not shown). The lower LDF scores for the splice donor and acceptor, and the lack of many of the features described below, are consistent with the view that OR17-24 is a pseudogene. The highest-scoring splice donor and acceptor sites recognized by the linear discriminant function are conserved between the two duplicated genes, defining introns of different lengths. Their removal brings the upstream noncoding conserved regions (ORcr) to identical distances from the OR coding regions. Interestingly, while the putative intron of OR17-228 contains six retroposed elements (five Alu repeats and a short L1 repeat), the OR17-40 intron has only a single such repeat (L1). This could indicate a negative selection on random repeat insertions into the latter, which might reflect a size limitation in the distance between the coding and the noncoding/control regions of these OR genes. Presumably, OR17-40 may have reached this maximal distance earlier in evolution, and no further insertions into it were acceptable, while OR17-228 could accommodate additional insertions. A similar interpretation of differences in repetitive sequence insertion has been recently suggested for the human MHC Class II region (Beck et al., 1996). Further LDF analysis was performed to identify transcription start sites in a 1-kb segment upstream of the splice donor site. Only one transcription start site with an associated TATA box was identified. This is located 225 bp upstream of the splice donor site in both OR17-40 and OR17-228. Two potential initiator (Inr) sites (Javahery et al., 1994) were found for each gene (Fig. 9; Table 3), both downstream of the putative transcription start site, suggesting that the noncoding exon might be somewhat shorter. Next, we searched for transcription factor binding sites in the Ç800-bp segment upstream of the putative transcription initiation site, as defined by the weak TATA box. This was done by both the LDF algorithm and the MatInspector program, using the TRANSFAC database (Wingender, 1994). Only sites that appeared at the same location in both genes were further consid-
gnma
AP: Genomics
SEQUENCING IN OLFACTORY RECEPTOR GENE CLUSTER
157
FIG. 9. Proposed structure of the duplicated, complete OR genes in the cosmid. Gray boxes indicate coding sequences, cross-hatched bars represent the respective ORcr regions. The long introns are displayed in a different scale, with solid bars indicating the repetitive elements that inserted after the duplication. The LDF values for the donor and acceptor sites are indicated in parentheses. The conserved donor, branchpoint, and acceptor sites are boxed in the sequence detail.
ered. A variety of highly scoring transcription factor binding sites can be recognized (Fig. 10), and many have similar sequence in the two genes (Table 3) . All of the identified transcription control sites are found in or near the putative ORcr regions of the two genes and not in the 200-bp section immediately upstream of the transcription start site. The observation that both OR17-40 and OR17-228 constitute seemingly intact coding regions suggests that the duplicated 5* noncoding sequences of both might include some local upstream noncoding functional elements required for OR transcription and RNA splicing. This inference is strengthened by the fact that a whole array of consensus sequences for functional sites is found in both OR genes at corresponding locations. This includes a splice acceptor, a lariat branchpoint, a splice donor, a transcription start site, a weak TATA box, Inr sites, and more than a dozen putative binding sites for transcription factors. While at present no experimental corroboration is provided, these results form a consistent framework for a putative mammalian OR gene structure. Considerable work will be required to validate these suggestions experimentally and to assess the universality of such structure. For example, comparison between the control regions of OR genes belonging to different families will provide much deeper insights into the regulation of the OR gene superfamily. Many of the identified sequence features reported display relatively low statistical confidence when analyzed in a single gene. We therefore utilized a powerful
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
comparative strategy that relies on the fact that the sequenced cosmid contains two apparently complete OR genes. By considering only those features that appear in parallel in both genes, their statistical significance is greatly enhanced. Such a strategy could be used profusely in the future analysis of other members of the OR gene superfamily, as well as for studies of other multigene families. The identified putative transcription control sites did not include the Olf-1 binding sites, known to be present upstream of several other olfactory-specific genes (Kudrycki et al., 1993). Yet the upstream DNA sites described here could form part of the OR gene cluster transcription control machinery. At present, it is unclear whether the weak TATA sequence in the two OR genes simply indicates a deviation from the consensus or suggests that the OR gene is actually TATA-less, in which case transcription could be directed by Inr-like elements (Javahery et al., 1994; Kaufmann and Smale, 1994). Additional elements could be involved in the regulatory events that lead to OR clonal exclusion, i.e., the expression of only one OR gene type in each sensory cell (Chess et al., 1994; Lancet, 1994). Consequently, the ORcr regions might also include some yet unidentified functional signals, e.g. for somatic recombination joining the OR gene and a putative cluster control region. Generation of Receptor Diversity OR gene duplication with subsequent divergence provides a mechanism for generating new odorant rec-
gnma
AP: Genomics
158
GLUSMAN ET AL.
TABLE 3 Details of the 18 Conserved Putative Transcription Binding Sites, as Recognized by MatInspector, Using Matrices from the TRANSFAC Database, the Weak Putative TATA Boxes Suggested by the LDF Analysis, and the Two Putative Inr Sites Matrix score Factor
40
IK2 S8 ÌEF1 SRY CETS1P HSF1 CEBPB ÌEF1 TH1E47 IK2 CETS1P LYF1 AP4 CETS1P AP2 CETS1P ÌEF1 NF1
0.892 0.869 0.859 0.866 0.825 0.87 0.919 0.96 0.908 0.946 0.813 0.928 0.872 Weak TATA box 0.915 Weak Inr site 0.822 Strong Inr site 0.842 0.831 0.84
Position
Sequence detail
228
40
228
40
0.95 0.871 0.855 0.857 0.851 0.866 0.895 0.85 0.889 0.964 0.885 0.893 0.924
0674 0625 0619 0441 0429 0422 0383 0338 0281 0275 0275 0274 0235 029 24 37 48 93 137 187 195
0650 0600 0594 0431 0421 0412 0373 0329 0268 0262 0262 0261 0219 029 26 39 50 116 145 193 201
aacaGGGAgtag ccgttcATTAa attaACCTcat attaACAAggtc ctgagGGAGaaag AGAAagcact ggttggaGAAAgct tctcACCTcca agaggaatCTGGgaga atctGGGAgaac atctgGGAGaaca tctGGGAga ctCAGCacca tCATTAaaagcttggg ttcctGGAAgggc gctAATtct tcCCCAgaggct ctcATTtct tcCGGAgcac agaaACCTtgc tgcTGGCctctgtatcca
0.824 0.83 0.842 0.837 0.842
228 tagtGGGAagag ctgttcATTAa attaACCTgcc actaACAAgtcc ccccaGGAGagaa AGAAagcgcc ctttgaaGAAAgct tcctACCTccc gttgatatCTGGgaaa atctGGGAaaga atctgGGAAagaa tctGGGAaa tcCAGCtgca gTATGAagggcctga tcctaGGAGagcc gccAGTttc gcCCCAgaggag ctcATTtct ctccgGGAGgata agaaACCTcag cagTGGCttccgtgccct
Note. Matrix score, position (relative to the postulated transcription start site, as indicated by the LDF analysis), and sequence detail are presented for each site, for both OR17-40 and OR17-228. Capitalized bases represent matrix ‘‘cores.’’ Sites upstream and downstream from the TATA box are located in the OR control region and in the 5* noncoding exon, respectively.
ognition specificities. Deletion of an OR gene (or part of it) would eliminate some odorant recognition capacity. If appearing in a polymorphic fashion in the human population, such gene inactivation could be the basis for specific anosmia—congenital odor blindness (Whis-
sell and Amoore, 1973). A general mechanism for duplication of genes in the GPCR family has been proposed recently (Marchese et al., 1995), involving large (ú16 kb) mRNA intermediates from Alu-repeat expression, which are assumed to be reverse-transcribed and rein-
FIG. 10. The emerging structure of a complete OR gene. The transcription control sites indicated are detailed in Table 3.
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
gnma
AP: Genomics
SEQUENCING IN OLFACTORY RECEPTOR GENE CLUSTER
serted into the genome. We propose here an alternative model for generation and reduction of GPCR gene diversity, in which unequal recombination between flanking repetitive sequences duplicates (or deletes) whole genomic regions, including full or partial genes. Repeat-mediated recombination has been shown to be a major driving force in the shaping of mammalian genomes (Charlesworth et al., 1994). Specifically, we provide evidence for the complete duplication of an OR gene by recombination between MIRs, which is estimated to have occurred about 90– 100 million years ago. The high degree of conservation between the duplicated coding sequences, compared to the much lower conservation of the duplicated intergenic sequences, implies very strong purifying selection maintaining them, presumably indicating that both genes are intact and functional or at least have been so for most of the time since the duplication took place. It is striking that two olfactory receptor genes that belong in one subfamily, and are extremely similar, started diverging in such ancient times. In light of this finding, it is much less of a surprise that the various OR gene families have representatives from members of such widely separate mammalian orders as humans, mice, and dogs (Ben-Arie et al., 1994; Lancet and Ben-Arie, 1993). By extrapolation, we can only assume that the divergence between the various families of OR genes with human representatives occurred during the times of an early amphibian ancestor. It is tempting to suggest that this happened during the emergence of tetrapods from marine to terrestrial life, where the capability to detect diverse odorants in the air would be of enormous adaptive value. ACKNOWLEDGMENTS This research was supported by grants from the US National Institutes of Health (DC00305), the Minerva Foundation, a Wolfson Research Award of the Israel Science Foundation, the genome project of the Israel Academy of Science, and the BMFT-Israel Ministry of Sciences and the Arts. We thank Jaime Prilusky, Irit Rubin, Eitan Rubin, Tzachi Pilpel, Edna Ben-Asher and Andre´ Rosenthal for helpful discussions.
REFERENCES Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215: 403–410. Bailey, W. J., Fitch, D. H., Tagle, D. A., Czelusniak, J., Slightom, J. L., and Goodman, M. (1991). Molecular evolution of the psi etaglobin gene locus: Gibbon phylogeny and the hominoid slowdown. Mol. Biol. Evol. 8: 155–184. Beck, S., Abdulla, S., Alderton, R. P., Glynne, R. J., Gut, I. G., Hosking, L. K., Jackson, A., Kelly, A., Newell, W. R., Sanseau, P., Radley, E., Thorpe, K. L., and Trowsdale, J. (1996). Evolutionary dynamics of non-coding sequences within the class II region of the human MHC. J. Mol. Biol. 255: 1–13. Ben-Arie, N., Lancet, D., Taylor, C., Khen, M., Walker, N., Ledbetter, D. H., Carrozzo, R., Patel, K., Sheer, D., Lehrach, H., and North, M. A. (1994). Olfactory receptor gene cluster on human chromo-
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
159
some 17: Possible duplication of an ancestral receptor repertoire. Hum. Mol. Genet. 3: 229–235. Bernardi, G. (1993). The isochore organization of the human genome and its evolutionary history—A review. Gene (Netherlands) 135: 57–66. Bodenteich, A., Chissoe, S., Wang, Y. F., and Roe, B. A. (1993). Shotgun cloning as the strategy of choice to generate templates for high-throughput dideoxynucleotide sequencing. In ‘‘Automated DNA Sequencing and Analysis Techniques’’ (Venter, J. C., Ed.), pp. 42–50, Academic Press, London. Breer, H. (1994). Odor recognition and second messenger signaling in olfactory receptor neurons. Semin. Cell Biol. 5: 25–32. Britten, R. J., Baron, W. F., Stout, D. B., and Davidson, E. H. (1988). Sources and evolution of human Alu repeated sequences. Proc. Natl. Acad. Sci. USA 85: 4770–4774. Buck, L., and Axel, R. (1991). A novel multigene family may encode odorant receptors: A molecular basis for odor recognition. Cell 65: 175–187. Bulmer, M., Wolfe, K. H., and Sharp, P. M. (1991). Synonymous nucleotide substitution rates in mammalian genes: Implications for the molecular clock and the relationship of mammalian orders. Proc. Natl. Acad. Sci. USA 88: 5974–5978. Charlesworth, B., Sniegowski, P., and Stephan, W. (1994). The evolutionary dynamics of repetitive DNA in eukaryotes. Nature 371: 215–220. Chess, A., Simon, I., Cedar, H., and Axel, R. (1994). Allelic inactivation regulates olfactory receptor gene expression. Cell 78: 823– 834. Chissoe, S. L., Bodenteich, A., Wang, Y.-F., Wang, Y.-P., Iyer, K., Jian, L., Ma, Y., McLaury, H.-J., Pan, H.-Q., Sarhan, O., Toth, S., Wang, Z., Zhang, G., Heisterkamp, N., G, J., and Roe, B. A. (1995). Sequence and analysis of the human ABL gene BCR gene and regions involved in the philadelphia chromosomal translocation. Genomics 27: 67–82. Dear, S., and Staden, R. (1991). A sequence assembly and editing program for efficient management of large projects. Nucleic Acids Res. 19: 3907–3911. Deininger, P. L. (1989). SINEs. In ‘‘Mobile DNA’’ (Berg, D. E., and Howe, M. M., Eds.), pp. 619–636, American Society for Microbiology, Washington DC. Devereux, J., Haeberli, P., and Smithies, O. (1984). A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 12: 387–395. Garcia-Meunier, P., Etienne-Julan, M., Fort, P., Piechaczyk, M., and Bonhomme, F. (1993). Concerted evolution in the GAPDH family of retrotransposed pseudogenes. Mamm. Genome 4: 695–703. Gleeson, T. J., and Staden, R. (1991). An X windows and UNIX implementation of our sequence analysis package. Comput. Appl. Biosci. 7: 398. Griff, I. C., and Reed, R. R. (1995). The genetic basis for specific anosmia to isovaleric acid in the mouse. Cell 83: 407–414. Hwu, H. R., Roberts, J. W., Davidson, E. H., and Britten, R. J. (1986). Insertion and/or deletion of many repeated DNA sequences in human and higher ape evolution. Proc. Natl. Acad. Sci. USA 83: 3875–3879. Javahery, R., Khachi, A., Lo, K., Zenzie-Gregory, B., and Smale, S. T. (1994). DNA sequence requirements for transcriptional initiator activity in mammalian cells. Mol. Cell. Biol. 14: 116–127. Jukes, T. H., and Cantor, C. R. (1969). Evolution of protein molecules. In ‘‘Mammalian Protein Metabolism’’ (Munro, H. N., Ed.), pp. 21– 123, Academic Press, New York. Jurka, J. (1989). Subfamily structure and evolution of the human L1 family of repetitive sequences. J. Mol. Evol. 29: 496–503. Jurka, J. (1990). Novel families of interspersed repetitive elements from the human genome. Nucleic Acids Res. 18: 137–141. Jurka, J., Kaplan, D. J., Duncan, C. H., Walichiewicz, J., Milosavljevic, A., Murali, G., and Solus, J. F. (1993). Identification and
gnma
AP: Genomics
160
GLUSMAN ET AL.
characterization of new human medium reiteration frequency repeats. Nucleic Acids Res. 21: 1273–1279. Jurka, J., Walichiewicz, J., and Milosavljevic, A. (1992). Prototypic sequences for human repetitive DNA. J. Mol. Evol. 35: 286–291. Jurka, J., Zietkiewicz, E., and Labuda, D. (1995). Ubiquitous mammalian-wide interspersed repeats (MIRs) are molecular fossils from the mesozoic era. Nucleic Acids Res. 23: 170–175. Kaplan, D. J., Jurka, J., Solus, J. F., and Duncan, C. H. (1991). Medium reiteration frequency repetitive sequences in the human genome. Nucleic Acids Res. 19: 4731–4738. Kaufmann, J., and Smale, S. T. (1994). Direct recognition of initiator elements by a component of the transcription factor IID complex. Genes Dev. 8: 821–829. Kennelly, P. J., and Krebs, E. G. (1991). Consensus sequences as substrate specificity determinants for protein kinases and protein phosphatases. J. Biol. Chem. 266: 15555–15558. Kudrycki, K., Stein-Izsak, C., Behn, C., Grillo, M., Akeson, R., and Margolis, F. L. (1993). Olf-1-binding site: Characterization of an olfactory neuron-specific promoter motif. Mol. Cell. Biol. 13: 3002– 3014. Lancet, D. (1986). Vertebrate olfactory reception. Annu. Rev. Neurosci. 9: 329–355. Lancet, D. (1994). Olfaction. Exclusive receptors [news]. Nature 372: 321–322. Lancet, D., and Ben-Arie, N. (1993). Olfactory receptors. Curr. Biol. 3: 668–674. Lancet, D., Gross-Isseroff, R., Margalit, T., Seidmann, E., and BenArie, N. (1993). Olfaction: From signal transduction and termination to human genome mapping. Chem. Senses 18: 217–225. Lancet, D., and Pace, U. (1987). The molecular basis of odor recognition. Trends Biochem. Sci. 12: 63–66. Lancet, D., Sadovsky, E., and Seidemann, E. (1993). Probability model for molecular recognition in biological receptor repertoires: Significance to the olfactory system. Proc. Natl. Acad. Sci. USA 90: 3715–3719. Li, W. H., Gouy, M., Sharp, P. M., O’HUigin, C., and Yang, Y. W. (1990). Molecular phylogeny of Rodentia, Lagomorpha, Primates, Artiodactyla, and Carnivora and molecular clocks. Proc. Natl. Acad. Sci. USA 87: 6703–6707. Marchese, A., Beischlag, T. V., Nguyen, T., Niznik, H. B., Weinshank, R. L., George, S. R., and O’Dowd, B. F. (1995). Two gene duplication events in the human and primate dopamine D5 receptor gene family. Gene 154: 153–158. Merino, E., Balbas, P., Puente, J. L., and Bolivar, F. (1994). Antisense overlapping open reading frames in genes from bacteria to humans. Nucleic Acids Res. 22: 1903–1908. Meyerhans, A., Vartanian, J. P., and Wain-Hobson, S. (1990). DNA recombination during PCR. Nucleic Acids Res. 18: 1687–1691. Nef, P., Hermans, B. I., Artieres, P. H., Beasley, L., Dionne, V. E., and Heinemann, S. F. (1992). Spatial pattern of receptor expression in the olfactory epithelium. Proc. Natl. Acad. Sci. USA 89: 8948– 8952. Nizetic, D., Zehetner, G., Monaco, A. P., Gellen, L., Young, B. D., and Lehrach, H. (1991). Construction, arraying, and high-density
AID
Genom 4344
/
6r1e$$$381
09-16-96 16:30:41
screening of large insert libraries of human chromosomes X and 21: Their potential use as reference libraries. Proc. Natl. Acad. Sci. USA 88: 3233–3237. Pan, H. Q., Wang, Y. P., Chissoe, S. L., Bodenteich, A., Wang, Z., Iyer, K., Clifton, S. W., Crabtree, J. S., and Roe, B. A. (1994). The complete nucleotide sequences of the SacBII Kan domain of the P1 pAD10- SacBII cloning vector and three cosmid cloning vectors: pTCF, svPHEP, and LAWRIST16. Genet. Anal. Tech. Appl. 11: 181–186. Quandt, K., Frech, K., Karas, H., Wingender, E., and Werner, T. (1995). MatInd and MatInspector: New fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23: 4878–4884. Raisonnier, A. (1991). Duplication of the apolipoprotein C-I gene occurred about forty million years ago. J. Mol. Evol. 32: 211–219. Reed, R. R. (1990). How does the nose know? Cell 60: 1–2. Sambrook, J., Fritsch, E. F., and Maniatis, T. (1989). ‘‘Molecular Cloning: A Laboratory Manual,’’ Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Schable, K., Thiebe, R., Flugel, A., Meindl, A., and Zachau, H. G. (1994). The human immunoglobulin kappa locus: Pseudogenes, unique and repetitive sequences. Biol. Chem. Hoppe Seyler 375: 189–199. Schuler, G. D., Altschul, S. F., and Lipman, D. J. (1991). A workbench for multiple alignment construction and analysis. Proteins 9: 180– 190. Smit, A. F. (1993). Identification of a new, abundant superfamily of mammalian LTR-transposons. Nucleic Acids Res. 21: 1863–1872. Smit, A. F., and Riggs, A. D. (1995). MIRs are classic, tRNA-derived SINEs that amplified before the mammalian radiation. Nucleic Acids Res. 23: 98–102. Smit, A. F., Toth, G., Riggs, A. D., and Jurka, J. (1995). Ancestral, mammalian-wide subfamilies of LINE-1 repetitive sequences. J. Mol. Biol. 246: 401–417. Solovyev, V. V., Salamov, A. A., and Lawrence, C. B. (1994). Predicting internal exons by oligonucleotide composition and discriminant analysis of spliceable open reading frames. Nucleic Acids Res. 22: 5156–5163. Sullivan, S. L., Adamson, M. C., Ressler, K. J., Kozak, C. A., and Buck, L. B. (1996). The chromosomal distribution of mouse odorant receptor genes. Proc. Natl. Acad. Sci. USA 93: 884–888. Trabesinger-Ruef, N., Jermann, T., Zankel, T., Durrant, B., Frank, G., and Benner, S. A. (1996). Pseudogenes in ribonuclease evolution: A source of new biomacromolecular function? FEBS Lett. 382: 319–322. Uberbacher, E. C., and Mural, R. J. (1991). Locating protein-coding regions in human DNA sequences by a multiple sensor- neural network approach. Proc. Natl. Acad. Sci. USA 88: 11261–11265. Whissell, B. D., and Amoore, J. E. (1973). Odour-blindness to musk: Simple recessive inheritance. Nature 245: 157–158. Wingender, E. (1994). Recognition of regulatory regions in genomic sequences. J. Biotechnol. 35: 273–280. Xu, Y., Einstein, J. R., Mural, R. J., Shah, M., and Uberbacher, E. C. (1994). An improved system for exon recognition and gene modeling in human DNA sequences. Ismb 2: 376–384.
gnma
AP: Genomics