The complex genetic locus polyhomeotic in Drosophila melanogaster potentially encodes two homologous zinc-finger proteins

The complex genetic locus polyhomeotic in Drosophila melanogaster potentially encodes two homologous zinc-finger proteins

Gene, 105 (1991) 185-195 0 1991 Elsevier Science Publishers B.V. Ail rights reserved. 0378-i 119~91/~03.50 185 GENE 05081 The complex genetic locus...

1MB Sizes 0 Downloads 59 Views

Gene, 105 (1991) 185-195 0 1991 Elsevier Science Publishers B.V. Ail rights reserved. 0378-i 119~91/~03.50

185

GENE 05081

The complex genetic locus polyhomeotic homologous zinc-finger proteins

in Drosophila

melanogaster

potentially

encodes

two

(Transcription factors; trapls-regulators; PoIyco~b-group genes; gene duplication; nucleotide sequence analysis)

Janet Deatricka*,

Mark Daly b, Neel B, Randskolt~ and Hugh W. Brockb

” Centre de G&Ptique Mokulaire du CNRS, 91198 GifsjYvette Cedex (France), and * Department of Zoology, UniversityofBritish Columbia, Vancouver, BC, V6T 2A9 (Canada} Tel. (604)288-2619 Received by G. Bernardi: 16 March 1991 Revised/Accepted: 3 May/7 May 1991 Received at publishers: 4 June 1991

SUMMARY

Differential expression of the homeotic gene complexes, ANT-C and BX-C, of Drosophila melanogaster is partly controlled by trans-regulating factors located outside the two complexes. The complex genetic locus, poZyho~eot~cfph), is one of these tram regulators required during development for correct expression of the homeotic selector genes. The ph locus comprises two genetically independent units whose functions are largely redundant. There are two duplicated sequences arranged as a tandem repeat in the ph region, defining two molecular ph units. Sequence analysis of the 28.6 kb of DNA comprising the locus shows varying degrees of sequence conservation between these two molecular units. Long open reading frames with a high degree of conservation have been localized in each tandem repeat. Putative protein products encoded by both the proximal and the distal unit contain several identical or practically identical protein domains: a zinc-finger-forming motif, an or-helix motif, a domain rich in serine and threonine residues and stretches of glutamine residues. The presence of these protein domains supports the hypothesis that ph encodes a transcription factor that may function as part of a protein complex. Possible molecular mechanisms leading to the particular structure of the locus are discussed.

INTRODUCTION

Genetic analysis in Drosophila has revealed a complex genetic locus, ph, which corresponds molecularly to a Correspondenceto: Dr. N.B. Randsholt, Centre de GCnCtique Mol. du CNRS, 91198 Gif s/Yvette Cedex (France) Tel. (33)69823201; Fax (33)69075322. * Present address: Institut ftir Entwicklungsphysiologie, Universitat zu KSln, Gyrhofstrasse 17, 5000 Cologne 41 (F.R.G.) Tel. (49-221)4702618. Abbreviations: aa, amino acid(s); ANT-C, Anrennupediagene complex; bp, base pair(s); BX-C, bifhorux gene complex; cDNA, DNA complementary to RNA; HLH, helix-loop-helix motif; kb, kilobase or 1000 bp; nt, nucleotide(s); ORF, open reading frame; PC,Polycombgene; PC, PCgene product; PC-group, Polycombgroup gene(s); ph, polyhomeotic gene; Ph, po~homeotiegene product; wt, wild type.

tandem duplication of DNA. The ph gene is one of a class of genes in Drosophila called the PC group, named after the first mutant of this class identified, PoZycomb(Lewis, 1978). When PC-group genes are mutant, they provoke homeotic transformations, due to the inapprop~ate expression of genes in the bithorax and Antennapedia complexes. The PC-group genes act in normal development as repressors of BX-C and ANT-C genes (Struhl, 198 1; Duncan and Lewis, 1982; Duncan, 1982; ~ngh~, 1984; Dura et al., 1985; Dura and Ingham, 1988). It has been proposed that Pcgroup genes comprise a network of regulatory functions, due to the similarity of their mutant phenotypes and the synergistic interactions between the loci (Jiirgens, 1985). The molecular mechanisms underlying these interactions have yet to be elucidated, since to date, few of these loci have been characterized on the molecular level. On salivary

186 gland polytene chromosomes the PC protein binds to sites corresponding both to ANT-C and BX-C genes, and to a subset of the PC-group genes (Zink and Paro, 1989). The significance of this association at the chromosomal level is not yet clear, but these results indicate that there may be a more or less direct molecular

interaction

between members

of the PC group. The ph locus includes two genetically independent units whose functions are largely redundant (Dura et al., 1987). The mutation

of both units results in a

severe embryonic lethal phenotype, only one of the units is functionally

whereas mutation of compensated by the

tandem duplications at map unit, also represented in Fig. to the 3’ limit mentioned. 15.7 kb, and the distal one

coordinate 15 728. The distal 2A, extends from this Sal1 site The proximal unit contains 12.9 kb.

Genetic data indicate that the ph locus is composed of two redundant genetic units, one of which can largely compensate for mutation of the other (Dura et al., 1987). If each of these molecular units corresponds to one of these genetic units, one would expect that functional regions should be highly conserved between the two units. The two genomic sequences were compared to each other to more precisely

remaining unit. In these weaker mutant alleles, development proceeds to completion, and the viable adults show

define the extent and degree of similarity between halves of the locus. A schematic representation

more or less severe homeotic transformations (Dura et al., 1985; 1987; 1988). The localization of DNA lesions in these weak ph mutant strains supports the hypothesis that one genetic unit corresponds to one of the tandem repeats identified molecularly (Dura et al., 1987). The aims of the present study were to determine the nt sequence of the ph locus, to obtain precise knowledge about the molecular structure of this complex genetic locus, to identify putative protein-coding regions within the gene which might provide insight into the function of the Ph product itself, and possibly into the mechanisms by which the PC-group genes may interact.

organization of homologous nt stretches in each molecular unit is presented in Fig. 2B. Blocks of extremely similar sequences are separated by stretches with little or no sequence similarity. There are five extremely similar stretches, with 100 %, 97 “/, ,96 %, 93 y0 and 9 1 y0 similarity, respectively. Four of these blocks are in a long putative protein-coding region. The ORFs indicated have the capacity to code proteins which are similar between the two molecular units. The fifth highly similar region is 5’ of this long putative protein-coding region. Three less similar stretches, SS%, 83 y0 and 81%, respectively, are also indicated. Two of these stretches are 5’ of the long putative protein-coding region, and the third composes part of this coding region. Lastly, a stretch of 78% similarity is localized in the 5’-non-protein-coding region.

RESULTS

AND DISCUSSION

(a) Organization of the ph locus Preliminary molecular analysis of the ph locus has revealed that it is largely composed of a tandem duplication of DNA (Dura et al., 1987). There is a cross-hybridization at high stringency defining the two halves of the locus, a proximal and a distal molecular unit. In ph mutants, DNA lesions may be localized in either of these molecular units, and the transcription pattern is affected when either of these units is altered (N.B.R. and J.J.D., unpublished). Restriction fragments, within the 28.6 kb of genomic DNA sequenced, hybridize to a family of transcripts, all transcribed on the same strand. There are two major transcripts of 6.4 and 6.1 kb, as well as other, less abundant, smaller transcripts (Dura et al., 1987). The extent of the locus has been delimited by genetic and molecular criteria by Dura et al. (1985 ; 1987). We sequenced only the genomic DNA fragments that hybridized to ph transcripts. The nt sequence reported here (Fig. 1) extends from the 5’-most restriction fragment that hybridized to this family of transcripts, through both repeated regions, and to the end of the 3’most fragment that hybridizes to ph messages. The proximal unit (Fig. 2A) extends from the 5’ limit mentioned, to a Sal1 site, which falls within the unique region between the

the two of the

(b) Position of putative coding regions in the ph locus We searched the length of the genomic sequence for ORFs. The largest ORFs are indicated in Fig. 2B, and they are localized in homologous regions along the two molecular units. A partial cDNA clone has been isolated (D. Pierre and H.W.B., unpublished) which begins in one of these large ORFs. The cDNA ORF corresponds to the fusion of the three genomic ORFs at two introns, the first from nt 11332-11392andthesecondfromnt 11844-119020n the genomic sequence. The nt sequences corresponding to the splice junctions and introns of the proximal unit are perfectly conserved in the distal genomic sequence. If the RNA encoded by the distal unit is spliced at the same position as in the proximal unit, Fig. 3 represents the alignment of the predicted protein products from the proximal and distal units where the two putative protein products may be aligned along 1442 aa (Fig. 3). The great similarity between the two putative products correlates well with the genetic analysis of the locus. The ORFs we analyzed may not correspond to the only coding regions in the locus. Further analysis of cDNA clones will be necessary to identify other, smaller ORFs which may contribute to the function of the locus.

187 (c) Protein domains of the Ph products To predict

the molecular

function

of the ph locus, we

searched for protein domains which were conserved between the two proteins. This analysis revealed a zincfinger-forming motif, heavily underlined in Fig. 3, coded by both the proximal and distal units and also present in the cDNA clone. This zinc-finger belongs to the class of zincfingers including the DNA-binding domains of steroid hormone receptors. This class is defined by the presence of two pairs of Cys residues capable of forming a tetrahedral coordinate

binding

site for a Zn2 + ion, thereby stabilizing

the finger structure. The 14-aa stretch separating the two Cys pairs is rich in basic aa. This is appropriate for the nucleic acid binding interaction that has been proposed between the finger loop and DNA. The possible structure and role of zinc-fingers have been recently reviewed by Struhl (1989). Previous experiments have demonstrated DNA binding by proteins having multiple clustered zincfinger motifs (Evans and Hollenberg, 1988). The unique finger motif present in both putative Ph proteins may indicate that cooperative binding of multiple proteins may be necessary if the Ph protein acts in nucleic acid binding. Zinc-fingers have also been identified in other proteins for which no role in binding nucleic acids has been demonstrated, such as protein kinase C, reviewed by Nishizuka (1988). Further experiments will be necessary to demonstrate such a role for the Ph protein. Also present in both genomic units and in the cDNA clone, at the C-terminal region of the putative protein, there is a stretch of a-helix-forming aa, predicted according to Chou and Fasman (1978), thinly underlined in Fig. 3. This C-terminal region of the protein is particularly interesting, since the protein coding sequences are more highly conserved between the two repeats than the corresponding nt sequences. The 57 nt potentially coding for this a-helix are part of a stretch of 186 nt within which 11 out of 12 nt changes are silent. The one aa change is a conservative one of a charged aa residue for another, Asp to Lys. Long stretches of a-helix-forming aa are reported in proteins such as lamins, which assemble together in a protein complex to form intermediate filaments (McKeon et al., 1986). Short stretches of a-helix-forming aa also compose part of the HLH motif. It has been proposed that the HLH may play a role in dimerization of different HLH proteins together (Murre et al., 1989). One conceivable role for this a-helixforming motif could be in the assembly of the Ph protein into a protein complex. Stretches of Glu residues are present in the putative Ph proteins, underlined by thick dashed lines in Fig. 3. Polyglutamine motifs, encoded by repeated sequences, called opa repeats, were first described by Wharton et al. (1985) in another Drosophila gene, Notch, involved in neurogenesis. The opa repeats are widespread in Drosophila genes as well

as in other organisms. are extremely

abundant

They may be present in introns, in messenger

but

RNA. In the mam-

malian transcription factor, Spl, it has been established that Gln-rich regions are involved in transcription activation vation tional

(Courey and Tjian, 1988). Such transcription actidomains may interact with the general transcripmachinery and may as a result be conserved

throughout evolution. The functional expression of transcription-activating proteins in heterologous systems supports this view, since Drosophila proteins are functional in yeast (Samson et al., 1989), and yeast proteins are functional in Drosophila cell lines (Fischer et al., 1988). The polyglutamine stretches in Ph might play a similar role in transcription activation. Another domain present in both putative Ph proteins is a Ser + Thr-rich region. Indeed, 50% of aa 982-l 125 in Fig. 3 are either Ser or Thr. Theill et al. (1989) have reported that transcription activation by the pituitary-specific transcription factor GHF-1 is mediated by such a domain rich in hydroxylated aa residues. Ser + Thr-rich domains could either constitute acidic activators due to phosphorylation, or as in GHF-1, form an activation domain different from the negatively charged activation domains found in a number of mammalian and yeast transcription factors. The Ser + Thr-rich region of the Ph proteins might play a similar role in transcription activation. Comparison of the putative Ph aa sequence with different protein databases revealed no further similarity to functionally characterized protein domains. In particular, no similarity was found with the PC product (Paro and Hogness, 1991). Protein databases were searched using FastA programs (Pearson and Lipman, 1988). The presence of a zinc-finger, the Ser + Thr-rich domain and the Gln-rich regions support the hypothesis that the Ph protein product is a transcription factor. The functional form of the Ph protein remains to be elucidated, but the a-helical stretch could play a role in the association of the Ph protein into a protein complex. It has been hypothesized that the PC-group genes form a network of interacting functions. The mutation of one gene in the group gives a more severe mutant phenotype when associated with a mutation of another gene of the PC-group (Hannah-Alava, 1958; Jtirgens, 1985; Dura et al., 1985). There is a suggestion of similar interactions between genes of another family, the modifier of variegation genes, which play a role in heterochromatinization. Sensitivity to gene dosage has been demonstrated in the case of some of these modifier-of-variegation genes. The resulting hypothesis attributes this sensitivity to the formation of a protein complex in which each of these gene products is present in a limited quantity (Locke et al., 1988). There are other parallels between the PC-group genes and modifier-of-variegation genes. A modifier-of-variegation gene, Suvaq’3)7, encodes

188 1 91 181 271 361 451 541 631 721 811 901 991 1081 1171 1261 1351 1441 1531 1621 1711 1801 1891 1981 2071 2161 2251 2341 2431 2521 2611 2701 2791 2881 2971 3061 3151 3241 3331 3421 3511 3601 3691 3781 3871 3961 4051 4141 4231 4321 4411 4501 4591 4681 4771 4861 4951 5041 5131 5221 5311 5401 5491 5581 5671 5761 5851 5941 6031 6121 6211 6301 6391 6481 6571 6661 6751 6841 6931 7021 7111 7201 7291 7381

Fig. 1.

CTCGAGGTGT CTCGGGCTCT CCACGCACAC AAGATCTTTA ATTGGAAGAG TTACGGTGGA TTAAATTCAC GTCTATCTCT CAGTGCCCAC GTCAAAAAAT TGGGTTAACA CTGCAGTTAC TCAATGGAAA ATGTACGGAT ATCACAACAA CCCCCCCTCC AGTGAGCACA ATGGATTCGT RACCGCAAAC TATTATTATT GTCTTAACAT ATTCCTAAAA GCGTTAGCCA CACCCCATAA TTCTACAATT AGTTGACTCC TCAATATTTA TCTGTTTAGG TAGGATTTTT GGAAATTATA TTTGCAAATT TAGAAGACAA GGTGGGATTA CAGGCCACGG TTAAGGATCC ATGGCTTACT TTAACTTAAG ACCATATTTA CTCCTCGATA TGGAAAGGAC GGAAAAAGAA CRATATGTAA GCACCTGAAA AAAGGAATTT ATTACGATTG TAGGTAGCTC AGGGAGGAAG TTTACGAAAT AAATGGCAAG GGTCACAACT AGAGAGCAAA ATCGGAATGA CGCACACATA AGTGGTGGAT TTTTCATTTT TCTCAAACTG TGCGCATACG ATCTGATGTG TGAAGTATAG TGTACATACC CAAATAAATA GTCACCAGAC TTCCTGTTTT TCGCACGCGC GCATCTGAGA AAAAGTAGAG GTTGGTGTGG CTGCAAATTT CACAGTGGAG TTATGTAAGA ACTGTGAGTG GACACTGGGA GTTGCCTGTT AAACTGAGGA CTAAGCCGAC AGGTATTTTT TGCTCTTGAG ATTCTGTGCT CGTCGCCCTC TGCCCGTGTC GCAACATRRA CGAAATAAAC TATATTTCTT

GGACGCAATC AGGTATATCA ACACACTCGC TGAGTCCACA ATACAGATTC TTAGCCTTCG TATTGCAATT TACATATTTA CCTCGTAATC TCGAGCTTTT GGTAGCTGAT TAAAGGGATT ATTGAAAAAC GGATGGGGGT ACCGCAACTA ATCAAAGAAT CCACAGCGGC GAACGGGAAA AAAATCTTAA ACTTCTGTGT CTTTTATATA TTGCAAACTT CGCCCACTGA AAAACCTTTT CAATCAACTT AATAATTGCT ATTTTGTATC ATTAGTAAAC CTTCAATACT AAATAACGCC TACAGACGGT TTGAAACGGC CGTGTACACC AAAATGGTGG GGCCTTGGTA AAAGATTAGA TTCTTMGTT CATCGCCAAT TTTCATCCAC AATGGCAAAT GAGCATTTGG AATGGTAAGA PAATAAAGCA ATACTCCCAT CAATAAGCAC TGACAATTGC GCCAGCGTTA GCAAATATCG AAGAAMGCG AATAGAAMA AGCACTTAAA AAACATTTGT CTACAAATGG ATTTGAAACA ACTTATCGTA CAACTGTAAA CTATCCTCCC AAAAAGAAGG CAACAATCGT CGGTACCCAC TTTAAGAAGG CATGTTATGT TTTTACTCGT TAACTGGCCA AGAGAGAAGC TTGTCGCTGT TGTCGTATGT GTTAAATTTT CCCCTAGCTG TCGCGTGCAC GAGGTTGTAC CTTTTGTCGG GTATGTTTAT CTAAAAAGAA GCATGCATAC TGTTTAGTTT CATTGGATAT CTCTGTCATC TCCGCCATTC TGCATTTGTG GAAAACAAAA AGTCTAGGGT ACAGTCCGTT

TTCTCCTCAC CTATATATGG ACACACGCAC CAGATTATTG GGATTCGGAT TGCAAAAATG GTTATATAAT AGTTAAGTAC ATTATTGAGA GTGTATGTGT AAGCGCGAAA AATTAGGATC CGAAAGAGAA CACGTCGAGG AAATTCGAAA TCAAAATGAT GTTGTAGAAA TAACACAATA TGCTTTGGGA TTTATTTATG AAGAGATTAT GTAATCAAAA ACCACTAAAC CCGGTGGTGC TGTTATTGTA TCTTGGGTAT TCATGTTGGG ACTTTCATTT TCTGAGTATC AAGTGCTGCA GACCAAAATA AAATAGTCCT CCTTGGGACA AAAGGGGTGG CAGTTTCGAA ACGTTTCTTT CTTTCCAGCG GCATTTCTAG TGAAAACTTC GTGTGCAATG GACTCGACTT AATATGACAC TTTACTTTTA TTATTTCAGT AACAACAACA CAGCACCAAC ACGTGCCAGA ACTCTCTCGC CATGCTCGAC AAGCTCCGM AAGCCTAAAA GGGGCTCAGC TGCATGCACA TTAATAMCT ACATTGTTTA GCAATCCAAT CTTCTCAACC AATGAGAMG TGTCGAGTCG AAAGTMAAG GTTTTCATTT TTTCAGCAAC TTTGGTTTTG GGGCCACTCC GGATCCAGCT GGCGCGCGGT GTCGCATACC TTTCAAAAAG GCGAATCGTC AGGTGCTTAG TAATTACTAC ATTTTCGTAT ATTTATTGCC AGCGCCATGC CTATAATTTA GGTTTTATTT TGATTCTATA ATAAAATGCG ATCATCCAAC TGTGTGCGTC GTTCTATATC TGTCAGCTGT ATCACTGAAT

CACGGGCCGT CGACGTTATC ACCGCAGCAC CCGGCAGGTA ACGGATACGG TATTTGTATT CTGGAGCTGC TTAATATAAA TCTATGTTTA GTGTGTGGGG ACAATATCAG CTCATTCACT ACTAACGCAG TCCGGCTACT CGCAAAACAG CACGGAGCGA TTTCAACAGT TACTAATACA AATCAAATTG CCTTAGTTGC ATAAACCGCT CCAGTACGGA ATTTAGCTTG GCCCATTTTT CCCAAAGAAA CGATAAAGTC TATGTTTTGA TCCTTACCAG AAGGTGTGGA TTATATATAT TAGGCATATT ATTTGGCTCC TCAGAATGGA GTTGGTTCCA AGGGTTTTAT TATAACTTTT AGACTATTTG AGTTGGGAGC AAGCAAAACT CGCAGAGGGT ACCTGAAATA GAGCATAAGG CTAGTATGAA GGTGATCCCT ACCCACTTTT AACCACACAA GTGAGAGGCA TCATGCCCTA TTCTTGTTGT AAAACAAGCA ATAATATGGC GACGTTGTTG CACGCACGGG TAAAACTTAA TATGTATAAA AATAATGGAT GACACACCCG GAGGTAGAGG GGCCGGGCGG GGTATACTGG TAAGGATACG GGAAGGGGGC CTTTGTCAAA TAGTGGCGTC GCACACCAAC TCAGTTTGAA GCATCGCATG AGCTAACGGT GCTGGCGACG TCGTTACACT TGTCCGTGCA TCGTTTTGTG TTCACACATG CGACTATAAC TTATAAATAT TTATCACTGC TCGTATTCGG CTCGACTGGC CGACACACGC GGCTGGTGGT GAAAGGTATC AACGAATCTG GCTAATTTTA

ATCGCACTGA AGCCCCTCCG GGTCTAGATT GATAGATCCC GTACGGGTAT TTGCAACGAA ACATACGCAA TACTTTMGT TTCTCACTCG GCGGGTGCGA CATCTGCATA AACACACGAC AAGCACTCAA CCTTTCATCC GGGCAGATTA AAGCTCCCCA TCATATTGGG AAATTTTTTG ATAAATCTCT TTTTCAAMT AGAATTAGTT TATCACCTCG TCTCTGTCCT TATTCGCTTC AATGGAGAAG CCAATAACGA TTAATTTGTG TAAAATTTTC TCCATTAAAT TGCAATTAAA TTGAAAATGG CCCTGCCATT TCCCCTTTTG AAGGTCGCAA ATGTTGGGGT TATTGTTATT TTGCTTAAAT ATTCCTATTT CGAGGACTTC GAGTTATTAA GGTGTTTGAT AATTTGGAAT AAAATGAATT TGCGATCCAT GGTTATCACA GTACAACAAC ATATGACAAA GGAGTACCAT GTTGAACTTG TACACACACA GGCAGCAAAA TTGGATGAGG CTGGCACAAA TAAATCAACT AAGTTTGGAA CGCTTACCAC TGTGTTACTG TAGGAAGGTA TAGACATTCG GCACTTGGTT ACTGAAGTTA TTTCAGTCGG CGAACTGCTG GCACTCGCAC ACACACACAT ACTTAACTTG TGTATTGAAC GCTGTTGATT TAGGCAGtGC GAGAAAAAGA CGAAACACTC TACTTTTGTT TGCGTGTTTG AAAAACAACC TGTTTTTATT TTTTGTACTG ATAATAATGA GGGCCAAGAA CCCCCGCTGT GGTTGATTTC AGTTTTTGTT AGAAATCTGT CCTACAAACT

TAGCAGGGAC ACTCTGCGCG TGGTCTGGTT TACACAGAAA ATGCATGGAT CAATATTTCA TTTGTTAATT ATAATGCATA CTCTCTCTTG ACGCGTGTTG CATATATAAA AAGTRACCGG AAGCATGTTG ATAAGTCAGT GCGTTAGTTA GGAGAGAGAG GATCGATTTG TTTTTTGGGA GTTTACCTAT ATAGCTAGCT TTAAATTTAA CATAACAACT ATTCTTCTCC GATCTTCCGT ATAACATTCT TTGCTAATAA TGTCTCCATA CTAAAACCAG TTAGAACTTC ACAATGGCAA AGTTATTGCG TTCAATTCCT AGGGTAATTG GTGTCACAGG ATTGAGTGAA TTAAAAGGTG AAATTTTTGC GGATAATRAT CTGTTTACGC TAATGGAGAA CCATGCAGTT TTTAGAGTGT TGATATCCTA TTTCATTTCT TCCGCATGCA AGGTTCGTTT AGCACGCGCA TCGCCCCAAA TGAGGAAAAG TCGAATGAGA AAGAAAAGGA TGGGAAGAGT ATGAAAAATG CAATACGTAT AAATTGCTGT TGTTCAATGG ACAAGCACAA GTCGTAGTGG AGTGTACACA TTAACTGGAA GCGAGGAATG GGTGGTTGAG CCGTTTCATT CCGTGCATTT ATTCAGCGCA GCGGTGTACG GAAGAAAAAA AGCTAGTGCA AGTTAAAACA AAAGTGCCTG ATTTACACGC GTCTTGCGTT TTAATGTACA ACGAGTGTCT TTGAATAATG CCTGCACTTG CACGTTAATC AAGAAAATAA GCGGCGTTGT TCCGTCTCTC GTTGCTGCGG GTTCCGGTTT TAAATTGTTC

ACCAGGAAAC TGCATGCGAA TGAAAAGTGC CGGTCATAAA AGATATGCCT TGTTATGTAC TTCACAGGAG TATGAATATA CTCTGTATTT TCTGTAAACA CCGGTTGTGA GAATGGGAGA ATCGGAACAC GATGTCCTTC AGAGATACGA CGAACTGAGC TGTTTTACTG TGTTTTTTCT TGGAATTTGT GCATTGCAAA GCATATTTRA GTTATACAGC GCAGGCGGCA CGGCATTTTG CATACACTTG CTTAAAATTT CCGAAAGGTA TAATTTAGTT TGCCAAGTAA GCGCAACCAC GTTTAATAGA ACGCTCGTTC TATAATGTTT AATCAACCCT AGTAGGGTTT CACAATAACT CAAAAAAAA C GAACAGTCAC TGCACAGAGT TAGGCGATAT GAAATGCAAA TTAGAAATCA AAACAAAAAT GGCCCCGCCT CACCCGTTGG AAATTTTTAT GCAATTTACG CACAAGGAGC GCGAAAGAGG ACAACCAAAG AACGAGACGA AAGAAGAAGA AAATGACAAA GTCTATTATA TCTAAAAAA T GGTGGAGAGA CTAATAATAA TGGTGGTTGT CATAAAACTG TTTGTGTTAG GTATATGAAA CACGCATACA CCAACCCCAA CGGTATAGAG CATGCGCTCA GTTCTTGCGC CCCGCCGACG GATGTGCAGA AGTACTTAGT CAGCCAGCGG AAGCACACAC CCACGTTACA ATATAATACG CCGCCGCCGC GATCGTCGTG TGCTTTTTGT ATAATGCTCG ATTCATGMA TGCATTTTTA CGTGCAACCC CAACCCTGCT AAGCGAAGAT GTGGCAAGAA

CGAACTTTTC ATTAGCATAT AATCCACGGT GCAATTTGGC GGAGGATTTG ATATTTAAGA TATTTATAAC GTCTTTTAGC TTGTCGTTTT AGCGAGACAA AACGAATATC GCCTCACGCG CAACTGAACT AGAGCAGCGT TAGGCGCCAA TTTCAGAGCG TTTCTGTTTA ATAATTCAAT TATCAACAAT TTCACTGTGT TACATATTTT CACTTTAATT CAATGGTGCT TACTGCCAAT TACCTGTTTC ATAGCTTACA TTAAGTTAAA CAGTTATTAT AACGGACTCT ACATCTCTGT ATACATAAAG ACACCCCTAA GTTTGGGAAA TTATCGCAGT ATTTGCTTTG CGGATGPAAT AATCTTAAAT CACTTAAGGA TGGCAGGtCA TAAAGGAAGT ATAGTTTATA TGAAAAGAAT ACAAATAAAA CAAAACCTTT AAAGGCCAAC TACAGCAGGG CTCATTAGGC AATAAGTTGA GAAATAACGG CAAGGCAGAG GACAAGCCAA CCAACAAGAG GACAGGCACA TAATTCTATT TGACCCGCTG ATAAGATTTC GGTAAAGCCG GATGGCACGA GCTTCGCGCG TAGTCAGCAC AAGGGGTATT CATGCCCGCC GCGAACGCAC AAAAGTTATT TTTTGTTTCC TCTGCTCTCT CGACATACTC CATAAAAAGT GCTGTGACTG GGAGAAGTTG ACACACACAC TACATATGTA GCAAATAGCA CGAAAACAAA CATTGAAGTT TTACAATTTT TAAAAATGCA CCCAAAACGA ACAACGAATT TCTCTCACTC TAdATTTTGC TCAATTTTAT AATAAACCAG

CACTAGACCT TTATTATGGA CCGTGGAGTC TCGGGGCCAG CACCACCCGG CCAGTAGGCA ACCTCCTTCT GGGTTAATAG GTTTCATTAC ACAGAAACAC AGATGTGTCA CCAGAATCTC TACAATGTTT CCAACCCGGA ACATCCCCCC AAAGCGAGTG CCACTTCTTT TACTTCTAAT TACCATTTAC AGAAATGAAA TATACCGTTT CACCACACAT ATCTCTTTTG TAGCAGCGAC GCAACTCGTG ATCATGGAGT TGTCTATTGT GACAGTTTCT ACCCACGCAT TTGAAAATCG AGGACATGAT TGATCTTAGG TCGGTTTGTA TTCCTGTTTT AATTTGAGTA TTGAAAGTTG TAACGCTTGT CCCGAAGGAC ATTAGCAGAT ATGGGCTGTT AAACTTTTTA TTTTAAGGTT ATCAAAGTGG AGAAGACTTC AAAAATGGTG TGCGACCGAG ATTTGTCATT AGGCAATTAT CGPAGGAGGA AGCGAGAGAG CAAAAAGCTA CCCAAAGCAA GTGGGTCATG TTACTAAACG TACTCTTGGT GCTCTGCCTC ATCCGACCCG AAAAAGAAGA TATTTATGTA CAATAACTTA TGAAAGTCGA AGCTGTAGTC GGCATCCCTC AGTCGCAACG GATCCGAACG CTGCGTTCGT ACTGGATACC GATGCCGCGC TGGCTTAATT AAGTGGAGTG ACACACACGA TATGTTTGGT AAAAGAGAAG RACAAAACGC TATGCAAAAA TGGTTTAATC GCGAGTGGAT GTTTCCCCTC TCGCAATGCA GCGCAACAAA TTTACTCTGC TGAATACTTT AATGATTAAA

7471 7561 7651 7741 7831 7921 8011 8101 a191 8281 8371 8461 8551 8641 8731 8821 8911 9001 9091 9181 9271 9361 9451 9541 9631 9721 9811 9901 9991 10081 10171 10261 10351 10441 10531 10621 10711 10801 10891 10981 11071 11161 11251 11341 11431 11521 11611 11701 11791 11881 11971 12061 12151 12241 12331 12421 12511 12601 12691 12781 12871 12961 13051 13141 13231 13321 13411 13501 13591 13681 13771 13861 13951 14041 14131 14221

TTATTTTAGA GTTTATGTGA TATCAATCAA TTGAGCGCGA TGCTTTCCCC AAACCAGTCT CCTGTGGCGA TTTTACCGGC GCTTTTTGGG CGCTTCTCCA TTACCATGTG CTCCCAACTG ATATCGAACG CTGCTCATGG CGAAAGTCCG GGGATAGGGA TGGTACATAT CATATTTACT TTTGTTTGCT CAAAAARAAA GTTTTCAAAA TGGGAGAGCT TCGCTAGGAC GAGCAAATGT CATCCCAGGG ATAACTACAA AGCGGCCACT GCATTGCCCA CCACGGGCCG GCCACACCAG AGCTGCAGCA TAAAGCATGC GAGCCGGAGG TGGCAGCGGC AAATGATTGT ACGGCCCCCA GCATTCCCAC AATCAATGGC AACAGCAGTT GTAGCGCGGT CCACCCAAAC TGAACGCTGG TGCGAAACCA TTGTCATGTA GCACGCACTC GCGGTGGCTA GGAACTCAAG CCGGTGCGCC CAGCAGCAAC ACTTATTTTT CTGCGGCAGG TTGCTGCCAT CCCAGCAACA CCCAGGCGCA AGCAGCAGCA CGCAACAGCA CCCAGCAGCA CTACCTCTAG CAAGCAGTGT CTGCTCAGGT CCACCACAAA CAAGCACCGC CCAAAGAGAC CCCCCAAATC ATAGGTCTTC ACAGTGTCAG ATGGGATTGG GCACCAGCAC ACGTCATCGA CGCCAAGTGA AGGACATCAA TGAAACGGAA GGACAGGTGG CCCCAAAGCT CCTCGCCTCT CCAAGTCTTC

Fig. 1 (continued)

CTCGTTTAAA CTGTGTGTAC TAGGCTGTCA CCTTTCCCCC CCGATTGCAA TTTTTTTTCT GTATTCCACC TTTTATAACC CGATGGAAAG GTTAGATCAT GAGTACTCAG TCAGCATTTT GTGCCAGCTG AAATTCGAAA TTAAGTAGTT TTGGGATCGG CTTCAGGCTT TTTCCCGCGT AGTTCCCGCA WC CATCACGGCC TGTTTTGCAT AAGAGAAGGC TTACTGATCG GATTTCAGCA CCACAATAAC AAAGTGCCTG GCAGCAGGCG CAGGCAAACC CCTCACGCTG GTTCTATGCG AACCAACATT AGCAGCTAGT CACTGGAGGA GTCGGGAAAT GATGCTTACC AAACCACAAT CGCTGCAGCT GACAGCTGCT CCAGGCGGCG CACTCAGCAG CCAAGCGCAA GCCAGACGGA TATTGCATCG AACTGGATGC CTGCCCAGTT TGCGTCCGGC CCGCTTTAGC AGGCGACGAA GCATCTTTCC AGTCGACAAG TCAAATGCAG GCAAGCGGTG GCATCAACAG GCAACTTCAC ACAACAGCAG ACAGCACCAC TGCTCTGCAG GGTCACAATT CCACCAACAT TCCAATTTTG TGTTACTTCG ACCTTCGAAA ATCTACTCCT CACGCCGTCA TGGGAAAGAA AGTAGCCAGA AACCACAACC TGGCTTCATC GTATAAACTT GCTAAGTGGA GCGCTACTGT TATAGTTGGG TTCAGAATCG TGCAATGCCT GGAAGTGAAT

GCACAACAGT GGCCTGGCCA AGGCCCCCCC GCACCCCTAT GCAGGCTGTT CTCAAGAGGT CCTCTGGACC CATTTCGCCC GGGTGGGGAT TCATAGTTGA CCATTTACGT TAGCATTTCG CGTGGCTAAA ACCCATAGAT GGCCAAATGT GATCGGGCTC GACCTTGTGA TTGTTTCTGC TTTGTCTCAC GGATATCACT AGTATAAATA GAGCGGGGAG CATTAGGAAC CCAAACTTCT TCAGCGATTC RACAACAGCT GAAACGCTCG ACTTCAGGAA CACACCCCAA GAGAAGGCGC AGCAATCCCT ATGGAAGTTC CCGGCCAACT GTTGGCGGGG CTGTTGCATC ACCACGACTC CAATCGCCCC CAGCAGCAGC CTGGCCAAGG ACGGGTCCTG TGCGTGCAGG CAAATGCAGA ACCCAAGGCA GATAGGTACT ACTTGCTCCC GCAGCAACAG CAGTTCCGTA CACATTGAAA TCTCCAGCAG AGAATGGTTG CAGCAACAGC

CAGCAGCAAG GCGCAACAAC GCTCTCGCGA AACCAACTGA CGAGAGCAGC CAATCCGGGC GCAGCCCTCT ACCAACCAGA CAGCAGCTAA GCCATGACCT TCGCCGGGTC GGGCCCACCG GCCACTGTCA AAGGGCGCTA GAGCCGAAGC ACGACAGCCA ACCACGTCGA ATCCAGGAGG CTGGTACCAA ATAGCATCAG TCGCCAGGAT GTGGACGCCA TTTCCTATTT CTAGGATCGC GGAACAGATC

TTTTGTAAAT TAGCCTCATC CCCTCCTCGC CGGCGTGTCT TTACGGGGGT TTTTCGCAGC ACCCACCCCA CTTTTTCTGT CAGGCTCTCT ACTATATCAG TTGTCCACTA ATGGAAAGGT TTAGTTTTCC AAAGATCGAT TCATTGTCGC GGGATCTCTA TTAAGTCGGA CACTCTCGAA ACTTTTTTGA TTGTTATGTT ATAATAACAA GGGAGCAGGG ACTGTTTTTC CTTTTTCAAT TAGCAGGAGG CCCAGCACTC CCCAGAAGGC CAGGCCCAGC GCACTCCGAA AGAATCCCGG ACGCCATTCA AGCAGCAGTT CGCAGCAAAG ATTGGACACA CAGGAGGCCT AAAACGCCAA AGACGGTGCT AACTCACCCA TGGGAGTGGA GATCTACTGG TTTCACAGTC TTCCCTGGTT TGTTCATTCA AAAGTCAACT AAGCAGCAAC CAGCAGCAAC TCCACACAGA ACCGAAATCG GTGGTTAATG TGATGAGCAC AGCTACAACT CGGCTGTTCA AACAGGCAGT ATGCCACTCA TACAGCAGCA AGCAGAATAT AACTACAGCT CCGCCTCTGG GCAGCACTCC TCAGCGCCAC CGATGATGAA AGCTGGTTCT CAACCCTGGT GCGCATCCGT CCACTCCCAC TGGCAACCTG GCACGGCTGT GTATCAGTAA CCAACGAGCC TGCTTTTTCG CTCCAGGCTC GCTCGAGGCA TGGCATTGGT TGGGAGCCTC CATTGTCAGT GTCCGCCAAT

AAATAGTTGA TCGTTCACTC TCGCTCCCGC GTGTGTGTGT TCTCACACAC GTGCGGCCGT TATTCCCTTT CTTTTTTGCA GGACTGCCGG GGAACTGATT ATCCTATTAG CCTTGAACTT GGGCTGTTGT GTGAAATGCG TGAAAAAGGG CCTAAAAGTT ACTATAGTTT AGTCATCTTT ATGACTAAAA AGTTTTCCAT ATAATGATGA GGCAGAACGG CAGTTTGTTT CCTCAGAGCG CACTCTTCCC ACACTCGCAC GGGAATCACC AACGGGATCA CAGGCCCAGT CCAGCAGGTG GGTGAAGCAA GCAGCTGCAG CCAGCAACAG GGGAAGGACG CGGTCAGCAG GCAAATGATC CTTCTCACCG ACAGCAGCAA TGCGCAGGGC GTCAACACAG GACTTTGCCA CTTACAGAAT ACAGCAACCG ATCTTCCTAC AGCAGCAGCA TCACTGCAGC CTGCCCAGM GTCAAGTCGC CGGCGGGCAA AACGGGCACT GTTTCAGAAA GGCCCAGCAA CGCGCAGGCT GCAAATCCTT GCTACAGCAA TATCCAGCAG AAGTAGCGTG CGCCATCTTT TTTGGTCACC AATTGCCGGA TGCTACAGTG CTTAAGCACG GCCCATTGGT AGAGGCCAGT CAGCAAGCAA CGGCAGTTTA CTCAACCGCT TGGATCGRAG ATTTCCCGTC CAATCTTAAC GGATATGGTT GGCAAAGAAC GGACAGGCTG AACAGAAGTA TGCACTTCCA CAGCAGCTGG

CAACTCTGTT ACATTTGTTT GCATTTTGCA GTGCTGCCCC TCGAGCTCGA GAGCTTAACT CTTTTACGGG CTCATGCTTT GCATCACAGT CAGGAAGTAA CTTGAGCACT CGCCTGCAAT TGTCGAACGT AAGGCTGTCA CGGGTTCGGA AACCAGGGTA TGATATTATC TATGTTTACA CCAGATTTTT GATCATTAAA GAAAGCTGAC GGATGCTGAC GCGTGTGGTG GACACAGAAR TTGRAGGACA CAGCAGCAGC TTCGACGAGA GGCTCAGTCA GCTCCCAGCA GCCGCCACCA GAGTTTCCCA CAGCTGTCGG CAGCACTCCA GTGCAGCTAA CCAATCCAGG GGTGGCCAAG ATGAACGTCA CAGTTTAACC AAGCTGGCCC ACCCAGCAGG GTCGGTGTGG GCTGCAGGAC GCGACGCAGA AGAGATTATT GCAGGTTGCC AGCTCTGCAG CCAGAGCCTG AGGACAAAAT CAAGTGAGTA CCGATCACCC CAGCAAATCC CAACAGCAGC CAGCAACAGC CAGGTGGCGC CAGGCGCAGG

ATTGTGGTGC CCCTTCTCAG CAGACAGCTA AGCAGTACGG GGGACTCAAC GGTCACCTTT GCTAGTAGCG TCGCCCAAGA AGTTCCACAG AGCAATGCAG ACGTCCGCAA AGCACAACCA GATCTCCCCA ACCAGACAGC GTGTCATTCC GCTTGCGAGC GGCATCGGTG GATGAAGCCA CCTCCAATGT ACTCTTGCAC AGTGTGGACG

TACACCTACG CATCTTCTTC CGCCCGAATC GCATTTTGAT ATGTATGTAC TACACACACT GTAGCAGCGA TTATGCTCTT CGCGGTCAAT TATTAGTTAA ACTGTATAAA TCGCATCGGC TGAACGTGGA GGCAGCTCCT AAAAGGTCCA CACAATTCTT CCGAGTTAAA TTGTAAATAT GTATTTACCA TTTAATTTTT CTTATTTTTG AACGTGTTTG TGGTCCGAAA GCGATACGAC ACTCGAACAT AACMCAGCA AATACGATGT CCCCCACAAG CACCCAACAC CCACGGTGCC CGCACACGAC AAGCCAACGG CAGCCATCAG TGCAACCCTC TGATCACCGC CGGGATTCGC TTTCGCCACA AGCAGCAACA AGAAAGTGGT TGCAGCAGGT GTGGACAGTC TGCAGCCGTT CTTTGCAGAC CAATGCAACG ACTACCAACC CGACCAGGAG CTGAAGGCCA AAGGTAGTAG TTGTTTGTTT TGCAGAATGG TGCAACAACA AACAGGTCTC AGAGGGAGCA CAAATCAATT CACAAGTTCA AACAGTCTGG TTTCTTCGTC AGCCGGGTAC TGGCCAGTAT AACAGCCACA CCACTGCTCC GTGGAGGAGG CTCCTGTATC GCGAAGCCCT CAGTGCAGCC CATCAACTTC CTACCAGTTC AGGCGATGAT GATATGCAGA TTCGCGCAGA AGTGTGGAAA GAGTTGGATC TGGCTGAGGA CACTGCCAGT CACTGTCTGT ATGTCAGCAA

CTACACCGCA GATTGCAGAA CACTGGAGGT CATTCCTTTC CCTACTCCAT TGCACACGCG CATGAGGGTT GCTGTTTATG GCAATAGGAC TTATTTCTAG ACACTAACCA TTTATTGCCC AAACGAGTGC TTCTCTTCAT CAAAAAGAAC AATTAGAAAT GGTGATCGTT GATTATTTAA AGTAGCCATA TAATGTTTAA GGTCAAGTCC CATTGCCTTC TGACATTGCT CACACCCGTG CCGCGAGAAG GGTGGGTGGC GGCCAGCCCC CCATCGGCAC TAACTGCMC ACTGCAGATA CAGTGGCAGT TGGAGGAGCA CACCATGTCG CACCAGTTTC CGGCAAGCCA TGGCGGAAAT GCAGCAACAG ACAGCAGCTT TCAGAAAGTG TCAGCAACAG TGTTCAGACT TGGGCCAAAC CCAGCAAAAC TGACGCAGAC AGACGCAGCA CCCCTGTCAT AAATGCGCAA GCCACCTGAC ATTATCTGCC ACAGACCCTT ACAAATGTTG TCAGCAGCAG ACAGCAGCAA CATCACGTCC AGCCCAAGTG AGCGACTTCT AACGACGCCA TTGCAGTTCC ACAGCAGGCT GGGACCGCCA GCCTGTAACT TAGCATACCA AGGAAAGGAC GTCCAATGGA ACCGAGTAGC AACCACGACA TGGCACCTTT TAAGCCGAAC

CTCGCTCTGA CTCGTTTGTG GATGTTTGAT CCACCGCTAC GGCGACATGG CACCCACCAC GCCAAAATGC GGCCTTGCGG CTTGAAACCA AAAAACATCC TCTCGTGTTG TTTTACAATA GACTACCATG TCGTTGAATG GRATATTTTC TGAAATGACA ACCTCAGAGT TAAGAGCGTT TTTGCTTPAA TTTAAGGGCC GATTCTCAGC CTATTGTTTG CTTTTTTGCT AGCACCACAG CCCCTCCACC AAGCAGCTCG CCGCATCCCG GGAACTCCGC TCAATTGCCC TCCCCTGAGC GGAACTGAAC GCCTCGGCCG CCGATGCAAT CTGTATCCCC TTCCAAGGCA TACGCGACCT AACCTGCTGC ACTCAGCAGC ACTACCACCA CAGCAGCAGA GCCCAACTTC CAGATCATCC CGTAAGAATA GCCCACTRAG GCAGCAACTA GCCCCACAAT CAAGCAGCAG CACCGTGCAG TTGTCACAGT CATGCAGCCA CAACAGCAGA CAGGTTAACG GTTGCCCAAG CACCAGCRAC CAGGCTCAAG CAACAGACTT GCCGGAATAG TCCTCCCCCA CAGACGCAAT TCACTTACAC GTTTCTGTGA GCCACGCCCA ACCTGCACTA GATGCCTCAG ACCACTCCCA ACGATCACCA ATCACAAGTT GTCTTAACTC

CAAAGACGTC AGCGATGAGC GAAAMGGCA GATGGAGCAC AGGAGAGACG GAAGATGCAG CCAAGCGGCG AGTCACTTCT CTTCATTCGA

ACCATGCAGG AAAGCAAAGC AACGGCCTGG ACAGAGGCCA ATTTCTGCGC GGCGCGGCGC GAACTGCCTG

190 14311 14401 14491

14581 14671 14761 14851 14941 15031 15121 15211 15301 15391 15481 15571 15661 15751 15841 15931 16021 16111 16201 16291 16381 16471 16561 16651 16741 16831 16921 17011 17101 17191 17281 17371 17461 17551 17641 17731 17821 17911 18001 18091 18181 18271 18361 18451 18541 18631 18721 18811 18901 18991 19081 19171 19261 19351 19441 19531 19621 19711 19801 19891 19981 20071 20161 20251 20341 20431 20521 20611 20701 20791 20881 20971 21061 21151 21241 21331 21421 21511 21601 21691 21781

GTTGTCAGGA TGGGCATGAA GAGCGCAGTA AGGTGATTCT TGACTTTGTA TAAACTAAGG GCATATTAAT TAAGAATTTT TTTTTTTCAA TTTTTTAAGT TAAAATGCAT GAATATGRAA CATTCTTTTG ACTGTTATAC AGCTATCCCT GCGACGCTGG AATCAAAATG CGCTGGTATA AGGTATGAAA TGATCTATAC GACTTGCTCC CTGAGATCAT GAAAGTACAA AGCACTAACT CTTTTGCATT GATCCGTCTC AATACAAATG AGAAAGCAAA CAATACTAAT CGCACCCATA AACGGTAAAT AATATAATAA CCCAAGTCGA CTCCAGTGTG TCTGCGCATA AAGAGCAAGG TTTGAGTCGG AGTCTTGATT CGCTCGCACG ACGGTATCTG CCGAGCAAAA GTTCGTGTTG ACCACCGTTT TGCTGTGCGG CAAAAATGAA ATTGCTCCCT ATGATGGCAT GCAAGTTGAA TCCACCCCCG TTTCTTACTC ATAACCATCC GATGGAGAAG TGTTGGTGTG AAAAGTACTG CTTACCACGT TTCTCACATT GTAGTGCGTC AGTAACATGC GAATATGAAG GTTTCCTCTT ATCTGTAAAG CGACGAGGAC CCCACTACTA ACCGATGGCG GAAGCAAGAG CGTGCAGCAG CCAGCAACAG GGGAAGGACG CGGTCAGCAG GCAAATGATC CATCTCGCCG TACCCAACAG GGTGGGAGTG TGGATCTACT GGTTTCACAG GATCCCTTGG CATGTTCATT CTAAAGTCAA CCAAGCAGCA AGCAGCAGCA TATCCACACA AAACCGAAAT AGGTGGTTAA TGTGATGAGC

Fig. 1 t:continued)

CTACGTGGAC GCTGGGTCCA GGGCAGCTAG GATTCAATCG CATACTCCCA AAAAGCTATG AACTCCCAAC AAGCCTATGA CATTTTTTTT CACTTGACCA ACCTAGCTTT ATATAACAAA CAGCTCAGCA TACAATTTCA GTCTTTTAGC CAGAGCGGCG GAGAACGTTA TCTATGAAAT CGTAACAATC AAACATTGTC TTCACAAACC GGTTGTATTT TTAAATGTAT TTTAACTTCG TTGATTTCTT GGCAATAAAA TTCAAGGAAG AGCACTTAAT CGGAATGATG CCACAAATGA ATGTGAAACA TAATTATTTA CAGTTATGTA CCTTTGGTTG CGCTACCCCT GAGAAAGGCG GTATTTTATT CCTGTTTTTT CGCTATCTGA AGAGGAGAGA GTAGAGTTGT GTGTGGTGTC CGTAGTATTT CGTCTAAAGT AACGGTAAAA GCCTCTTCGT TATTGTATGC GATGCTGCTG ATACACACCA CTGGACAACT GCCAAGGGCC GGGAATATGC CATGCACATA AATACACGGT CTCACCTAAA TTCGTCCCGA CCTGTCGCAC AGTCGCAGTT CATGCCCCAC TCTCTTTCCA GTCAACTCCA TTTGCCAAGA CCGCTCAGCA GCCGCCACCA TTTCCCACGC CAGCTGTCGG CAGCACTCCA GTGCAGCTAA CCAATCCAGG GGTGGCCAAG GTGAATGTCA CAGCAGCAAC GATGCGCAGG GGGTCAACAC TCGACTTTGC TTCTGGCAGA CAACAGCAAC CTATCTTCCT ACAGCAGCAG ACTCACTGCA GACTGCCCAG CGGTCAAGTC TGCGGCGGGC ACAACGGGCA

GACTTTATAC AGCAGGAGAT CGACGGCCRA GCGCTTCTGT GCTCTTAAAA TTGTGGCCAA GGTGGAGTCC ATTAAGGAGG AGCACGAAAR GCCGAAAAAG ATGATCTCCT AACCGAGCAG AGCAGGCGAA AAGGACGCGA ATCCATTTGC AAAATATTAT AGATCACTTA TAAGCATATT CATTTATAAA TTAAACTAAC AATTAATTGC CAGCCAAGTA AATGGAATAA AGTACATTTT GTCATGGATA GTTTGTACAA GTATTTTBAA TCTTAGGAAT ATAATAATTG ATATCTAAAT GTGAATTTGA ACTTTTTACT TTGCAATTGC TGAAGAAATT AAAATGGCAC TGAATAGTGT ATTACAATAA TTGATGTGAT TAGTTTACTT TGATTCGACC ATCAAATGTA AGATGGAAAT ACTTCTGACT TACAATGACT CTATGGCAGT TCAAATTATA AATACATAGA AGATTGTCTG GAAGAATTGA CTTAATTTCA CTGTGCGCAA AATAAGTATA CCACCGGGCT AAGCCCCGCC CTCGAGCACG ACAATGTCGC GGGCACTCCC TTAGCCAAAA ACTAAAATCG AGTCCATTTT ACTTCACGCT TAGATCGGTG GACTCACCTA TGAATAGTCG AATCCGCTCA ACACATGTAT GTATCTTTGG TAGTATTTAC GGGATAACAA GATTATATTT TTCACCGAAT AATAGTATAA TATTGCAGTC TCAGAATGTT AAACTTTMT AAGTGACCTC TCGACAATCT TATACCAAAT AAGAAAAAAA CATATTTAGA CTATTATAGT GTTTAGCGGT CPAGCTCGGT ACGATGCTCT AATTATGCAT CTTATTAAGT AGATATATTG AAGAAATTTA GGCAAGGTTT TGTTAGTTGT CAGTTTCTAC AAATTTAAAA GCTAACAATA TATTCAAAAA TTGCATTAGA TCGTGATACG ATTAACCTAA ATTTAGATAT AACAGTTGCA TGGTAGAATA TCCCTAGCGT TTCTGCATCC CTGGAAACTT TGCAGCTACA ATTACACGTT TGGTGAAAAG GCGAAAGAGG GAAATAACCA AAGCATAAM ATAATATGGC GGCTGCAAAA AAGAAAAGAA AACACTTGTG GGCCTCAGCG ACGTTGTTGT TGGAAGGGGT TGCACGCACA CACACACGGG CTGGAACAAA ATGAAAAATG TTT GAAAAAT TAATTAATAA ATATATTAAT ATGTAAGTGA ATATAATAAT AATTATTTAA TAATAATAAT AATTATTTTA CATACAAAGA TATTTCTGTT CTACACAGCT ATGTAAATCC CCAAAGTACA ACAATAATGC ATTCCAATAA CAATAGATCG TCCCTTCGCA AACGCTTCGT TCGTCTGTTT CCGACAAGCA CTAGAGGTTG CCAAGTAGTG GTGGTTGTGA AGGAACGAAA TTGTTTTCAC AGTGGAAGGG GGCGCACAAG TCGGGGTGGT TTACTCGTTT TGGTTTTGCT TTGTCAACTG CTGCCGTTTC CCAGGGCCAC TCCTAGTGGC GTCGCACTCG CACCCGTGCA AGCGGATCCA GCTGCACACC AACACACACA CACACACATT CGCTGTAGCG CGCGGTTCAG TTTGAAACTT AACTTGGCGG GATGTGTCGC ATACCGCATG TGTATTGAAC GGGGGAAAAA ATTTATATAT TTATTTTTGG CGATCAATGC AAATCGGTGC TGCGTCGTTT TGAATGTTAG GCATGTACAG GTGCCTCCAA CATATTTTTT CTGATGAGAC GCTGTCGTCG GCGCATGCTC TTTCTCCTTC TCCTCTTCCT CTGTTTCTGC CTCTTGCAAC CGCCACACGC ACATGCACAC ACACACATGC ACATATCGTG TGCCTAGGTT TATTACCCCC CTTCCAAATC CGTTCCGCTC AATGTAACTG AAACTCCTAA CTATTATTGG TGGCGCCGGT GGCGAACAGC ATTTTGAGCG CCCAGCTTAG GTCTTTGAAC ATCCGTAAAT AAAAGTGGTC TCCCTTACAC CCACTCGAAC AGCAGAGAGC AATCCTACTA TAAATAGATG TTTCGGGCAC TGAAAAACAG TGCACTTTTA GTGGGTCACG AATCATGAAC GGGGCTAGTG TTGTACAAAG TTGTAGCTGC AATCATTTGT CAGTATCCAC TGTAAGTAAA TTACCTCGTG AGTGAACGAT CTTTGTTTTC GTTTTCGTTT TCTTTGGCGT TTTTGGACCC CCCCTTCTTT CTTTTTCTGT TCCTCGTCCC CTTTTCGCGT TGTTGTTGTG CGGTTTGTTC GTCAAGCGCT AAGAGAATAT GGATTCGGAG CCGCGCCAAG TGTGTCCGTG TTTGTTTGTG GCAGTGGGGA CACCGAAAGT GAATCAGCAA CAACAATACG CCACTCGCGT GGACCCCCAG CGGCCACTAA GGTGCCTGGA GTCCATCCCA ATCGCCCAGC TCTAAGGCAG CACGTGGGTC GCAGATCGCC AAGCGCACCC GACTCAAAGA CAACCGGCCG ATGTGCCGCT GCAGATCTCC CCCGAGCAGC TGCAGCAGTT ACACGACCAG TGGCAGTGGA ACTGAACTAA AGCATGCAAC AAGCCAACGG TGGAGGAGCA GCCTCGGCCG GAGCCGGAGG CAGCCATCAG CACCATGTCG CCGATGCAAT TGGCAGGGCC TGCAACCCTC CACCAGTTTC CTGTATCCCC AAATGATTGT TGATCACCGC CGGCAAGCCA TTCCAAGGCA ACGGCCCCCA CGGGATTCGC TGGCGGAAAT TACGCGACCT GCATTCCTTC TCTCCCACTC GCCACAGCAG CAGCAAAACC TTCTGCAATC AGCTTAACCA GCAGCAACAG CAGCTCAACC AGCAGCAGCA GCAAGCTGGC CCAGAAAGTG GTTCAGAAGG TGACCACCAC AGACCCAGCA GGTGCAGCAG GTTCAGCAAC AGCAGCAGCA CAGTCGGTGT GGGTGGACAG TCTGTTCAGA CTGCCCAACT ATGCGGCGGG CCTGCAACCC TTCGGCTCCA ATCAGATCAT CGGCGACGCA GACTTTGCAG ACCCAGCAAA ACCGTAAGAA ACAGAGATTA TTCAATGCAA CGTGACGCAG ACGCCCACTA CAGCAGGTTG GCACTACCAA CCAGACGCAG CAGCAGCAAC GCAGCTCTGC AGCGACCAGG AGCCCCTGTC ATGCCCCACA AACCAGAGCC TGCTGAAGGC CAAAATGCGC AACAAGCAGC GCAGGACAAA ATAAGGTAGT AGGCCACCTG ACCACCGTGC AACAAGTGAG TATTGTTTGT TTATTATCTG CCTTGTCACA CTCCGATCAC CCTGCAGAAT GGACAGACCC TTCATGCAGC

TGCTCAAGGA TCCCGCCACC TGGACCTGGT TAGCATCAGG AGTCAACAGT ATAAATAAGC CAAATGTAGC AAATAATATT TTAATAAATG ATTTGATCAA AAGAAACCAG TCTACCGAAT CGTACAAACG AGAGAGAGAG TCCCGTCGGC TTGTTGTCGA AAGCATTACT TCATTAGTAG AATGTAGCAT GTGCAATCCA CGTAATCCAA TTTCAGATAA CATATCTTAA CAAAACGCAA ACGCCCTTCG TATGTATTTA ACTAAGCMA AACGAGRAAA GGGAAGAGTG AAATGCCAPA AACTCTAAAT CACCATTACT TAAGTTTTAC AGTGCCACTG CAACCAATAA AAAAGAAGAT TGAGCACGCA ATTCCAACCC TTTCGGTATA CAGCACACAT CGTGCGGTTC AAAAGCGCCG TGACTATAAG ATATACAATA TTTGCCTGTT TCAATTTTTG GCGCGGTAGA TTAGCCTGCC TGCTCCTTCC CTGTCCCCTT TTCCTCTCTG AGGCAACGCT AAATTAAAAT AATTTCCCAT CATCCAAAAC AGATCGATCC TGGAGAGCAA GAATAAGAAT TTTAAAGCAA TACACCACCA AACACTCGCC AGTCGGAACG CAAACTGGAG ATATGCAAAC CMCATTATG AGCAGCTAGT CACTGGAGGA GTCGGGAAAT GATGCTTACC GAACCACAAT AATGGCCGCC ACAGCAACAG CAGCAGCACG GACCACCCAA TTTGAACGCT CCTGCGAAAC TATTGTCATG AGGCACGCAC TAGCGGTGGC ATGGAACTCA AGCCGGTGCG AGCAGCAGCA GTACTTATTT CACTGCGGCA

GAAGCATTTG TGGCGAGGCC TCAACCAAGT CCATCAGGTC CAAAAACGAA ATAAGTTTAT ACTATGATTG TGAATGCCTA TGAAAAAGAA ATATAGGTGG AGAGCAGAGC CATCTACCGA TGGGCTTCGT AGTGCGATTA ATTTTGTTAT CGTCGTCGTG TACAATTTTT AGTTACGTCA TAAGAAAATA ACTAGTTTCA TTGAATCCTA GAACTACTAC CGTTGTTTTG TCATTTCTGT TGTGGGAGAC TGTATGTATG ATAGAGAGTG AACGAGCCGA AGAAGACCGA GACAGGCACC CTTATATCTA TTTTTGTTTT GTTGCTGCTC TGGGAACATA TAAAGTAAGT GAAGTATAGC TACACATGCC CAAGCGAACG GAGAAAAGTT GCGCTCATTT TTGCGCTCTG ACGCGACGCA TGATTAGTGA ATACAGTACA GTTTGCGCCG GCTAAGGGAA AAGAAACCAG GATGGAAATG GCTCCTCTGA GGATATCACC CGTAGTAGAT GACAAGTGAC ATTTATTTAC ATAATATGTT AAAGTGTATC CGTCTTTGTC GAGAGAGAGT ATAAATATAA TTTACTGATA CCTTCCCCGG CAGAAGGCAG CCATCAATCA AAGTCACAGA AATCCCTACG GAAGTTCAGC CCGGCCAACT GTTGGCGGGG CTGTTGCATC ACCACGACTC CAGTCGCCTC GCAGCTCAAC CTGACTGCCG GTGCAGGCGG ACCACTCAGC GGCCAAGCGC CAGCCAGACG TATATTGCAT TCAACTGGAT TACTGCCCAG AGTGCGTCCG CCCCGCTTTA ACAGGCGACG TTGCATCTTT GGAGTCGACA

GTGAACGCTA AAGGATCCAG CTGTCGTGCC TTAAACCCAT TGGAATTACT AGTCTAAGTA TTCTAACMC GACCTAAGCT TTTTCTGAGT GAAAAGTGTA TTTTAATTCA AGACTTTAAT TCAACGCAAA AGCGCCGTAT TGGCTGAGCA GCTATAGGCA CTCGCTCTTT TTCGTACCAC ATTACCACAC TCTTCGTTCT AATTGGGGCG TGAGATGTAA GAACTTTTCC TCCCTTTTTT CCCGAATCCA TATGTTTGTA AGAGAGAGAG GCTAAAAAAG GTCCAACCAA TTGGGTCATG TGTTCCTAAT ACTGGCATAA TATAAGATTG TCACTCTGTC TGTATAAAGT AGCAATCGAA CGCCAGCTGT CACGGCATCC ATTAGTCGCA TGTTTCCGAT CTCTCTTTGC CACACACCCT AAACAATTAA GCAAGGAAAG CATTGTTGTC AAATCAAATA TCTTTCGGTT GAAGCCAAAA GCCGCGGCCA ATGATGGCTT GGGAGGAGGT CTGCTTTAGG CTGTTCCGTA GGTGTTTAAG TCTCGTTCCC TTTCTCAATG CTTTGGCGGG ATTTAAACAT GTCTGTGTGC AAGCCACCAC GCATCAGCTT GACGGCGCCA GTCCAGCTCA CCATTCAGGT AGCAGTTGCA CGCAGCAAAG ATTGGACACA CAGGAGGCCT AAAACGCCAA AGACGGTGCT AACAGCAACT CTCTGGCCAA CGACGGGTCC AGTGCGTGCA AACAAATGCA GAACCCAAGG CGGATAGGTA GCACTTGCTC TTGCAGCAk GCCAGTTCCG GCCACATTGA AATCTCCAGC CCAGAATGGT AGCAGCAACA

191 21871 21961 22051 22141 22231 22321 22411 22501 22591 22681 22771 22861 22951 23041 23131 23221 23311 23401 23491 23581 23671 23761 23851 23941 24031 24121 24211 24301 24391 24481 24571 24661 24751 24341 24931 25021 25111 25201 25291 25381 25471 25561 25651 25741 25831 25921 26011 26101 26191 26281 26371 26461 26551 26641 26731 26821 26911 27001 27091 27181 27271 27361 27451 27541 27631 27721 27811 27901 27991 28081 28171 28261 28351 28441 28531 28621

GCAGCTACAA CTGTTTCAGA AACAGCAAAT AGCGGCTGTT CAGGCCCAGC AACAACAGCA ACAACAGGCA GTCGCGCAGG CTCAGCAACA GAATGCCACT CAGCAAATCC TTCAGGTGGC GATACAGCAG CAGCTACAGC AACAGGCGCA GCAGCAGAAT ATTATCCAGC AGATTGTGGT ACAGTTGCAG CTTAGCAGCG TGCCGTTTTC AGCTCTCTCG GTGTCTGGCG CCATCTTTCA TACCAGCCAG AGCAGCACTC CTCTGGTCAC TCAGCAGCTA ATCAGCGCCA CTATTGCCGG GCCCTCACCT ACAACAAATC CCATTCTGGC TTCTAGCACC GCTGTCACTC CATCGTCTGG CACCAAAGAG ACACCTTCAA AAGGGCCCAC TACCCCCAAA TCATCTACTC CTGCCACTGT AGATAGGTCT TCCACGCCGT CAAAGGGCGC CAA‘CAGTGTC AGTGGGAAAG AAGAGCCGAA CAATGGGATT GGAGTAGCCA GAACGACAGC TTGCACCAGC ACAACCACAA CCACCACGTC TCACGTCATC GATGGCTTCA TCATCCAGGA GCCGCCAAGT GAGTATAAAC TTCTGGTACC GGAGGACATC AAGCTAAGTG GAATAGCATC GCTGAAACGG AAGCGCTACT GTTCGCCAGG GGGGACAGGT GGTATAGTTG GGGTGGACGC ATACCAGACA GTATCGGACG CTTTGCCAAT ATCTTCACCA CTTTCGTTGC CCCTGACATT TGCGCCGGTC CTAGCAATAC CATCC'KGAA CATCCGAGAA CTGCCTGGTT GCCAGGACTA CCATTTGGTT AACGCCATGG GCATGAAGCT TGTAAAGGAT TAAAAACACG CAACAAAGTC ACTTCGTGAG CGAATGTGAT CAGACAGAAC CATGCACCTA ATCTACAAAG GGAACTCCCC TATGCGAGAA ATCCATAATT AGGTGATGTA CAGCTAGTTT AAGCACCCCG ATCAGACCCC GGACCGCTGG CGCCTCCGCG TTGGATCAAC AGCTAGTTTA AGCACCCCGA TCAGACCCCA GACCGCTGGC GCCACCGCCG TTGGATCAAC TGACCATGCT GATCTCCCAC GACCTTTGCA TTCACTTGAA GCGGACTTTC AATCCTTTTG GTATGATAGA GCATAGACTG CCATCCCAGT TGAACGTTGT TAATCATTTC A~AACACCCT TTAGAATTAA AACTTAAGCC TGAAGTAATT AATGAGTAAC GTCTGAAGAA TTTTTGCCTT AATTTTTTTT TTAGTTTTAC GAAACCTTTT TGATCGAGTT TTATTGGAAT TGTTCATTTC ATTCTAAAAA CACATGGTM TTTTCCGAAA TAAAAATGTA GAATATATCT TCAATATTGT TATTTTGTGG CTGTGTTTRA TTTTTATGTT CGACTTTGGT TTTGTTTACA ATTTATTTGG ACGATCTAGA GCTACGAAAA CTTTGTTTAT GTTTCAATCA AATCGCAGTA TTCTACACGG CTTCTTGTAT ACCGATACAT AAACGTTTGC CGCATCTAAA TATATATTAT TCATACTCCA ACGTCGTCTC TATTTGGTTT TAATTCTTGC CACGGGTTGA GGTTGGATCG CGTGGTTAGC CGGTTGAGGT CCGAGGTTGC TGGGGTTTCG TTTGTTTGTT TGTCTACTAT GGATAGTTGG GTTTCCTTGC TCTTGTTTTT TTTGTTTTTT TTTTGATTTT GGTTCGCTCT TATTGATTAT TTGCTGTTGG CTTTTGTGCT TTGGCATTCA ACCTCTTTCT ATATGCTATA AAAATTCAAC GTGGGATTCA TATTCGGATT CACCTTCGTT TAATCGTATT ATAGGCTAAC TATATAAAGT GATGGTGGGT GGAAGCAAAG ACATTAGAAT GGGGGTGAAC ACATGATTAC TTATTGTTCT TCGTCGAGGC GTTTTATATA AGGCTATGTT GAACCAAAAA GGGAAAAGCG TTCAACAAAC GTATAAATAC A~~ATATAATT TGTGTTGTTT GAATAACAAT TTAAAGCACT TACAAAAGCA AATCTAATTG ATTTACATTT CGCTGCTATT TTTAATTTCT TTGCAGCGCC CAAGGTACTG AAGAAAACAC TCGTGCATAG ACAGTGAGGG ATGCATAATT TCTAAAGTAC AAACTTTCTA CAATATACRA TTAAAAATAT ACACACAATC AGATCTGGAT CTGAATCTGG ATCTGTGGAG TTTACCTCTA ATCTGTTCTC TTCTCTTCTG GGAATTC

Fig. 1. Sequence

ofthe ph locus. Genomic

1987), subcloned

into the vector pBluescript

or reverse biosystem’

fragments

CCTGCAACAA CAACAAATGT TGCAACAGCA GATTGCTGCC ATTCAAATGC AGCAGCAGCA GCAACAGGTC TCTCAGCAGC AGCAGGTTAA CGCCCAGCAA CAGCAAGCGG TGGCGCAACA GCAGAGGGAG CAACAGCAGC AAGTTGCCCA AGCCCAGGCG CAGCATCAAC AGGCTCTCGC GCCAAATCAA TTCATCACGT CCCACCAGCA ACAGCAGCAG CAGCAACTTC ACAACCAACT GGCACAAGTT CAAGCCCAAG TGCAGGCTCA AGCGCAACAG CAACAACAGC AGCGAGAGCA GCAACAGTCA ACTGGAGCGA CTTCTCAACA GCAGCAGCAG CAACCGCAAC AGCAGTCTGG GGTTTCACCA TCGATGACGG CGGAAGATAT TGCCGGAATA ACATCCAGTG CCCTACAAGA GACAACCAAA CCGATTACTT GCAGTTCCTC TACGCTCCCC ACAAGCAGTG TGGTCACAAT CAGCAGTACG GTGGCCAGTA TGCAGCAGGC TCAGACGCAA GGTACTCAGA TCCATCAACA AGGGTCTCAA CAGCAGCAGC AGCAGCAGCA ACTGGGACTA CCTTCACTTA CACCCACCAC CATGACCTCG ATGATGAATG CCACCGTGGG TCACCTATCC ACTGCCCCAC CCGTTAGTGT ACAGCTGGTC ACACTAAGCA GTGCTAGTAG CGGTGGAGGA GCAGGCTTTC CAGCCACGCC CGCAACCCTG GTGCCCATTG ATTCGCCCAA GACTCCTGTA TCAGGAAAGG ACACCTGCAC TAGCGCATCC GTAGAGGCCA GTAGTTCCAC AGGCGAAGCC CTGTCCAATG GAGATGCCTC TACCACTCCC ACCAGCAAGC AAAGCAATGC AGCAGTGCAG CCACCGAGTA GCACCATTCC GCTGCACAAC TGCGGCAGTT TAACGTCCGC AACATCAACA TCAACCACGA CAACGATCAC CAGCACGGCT GTCTCAACCG CTAGCACAAC CACTACCAGT TCTGGCACCT TTACCACAAG GAGTATCAGT AATGGATCGA AGGATCTCCC CAAGGCGATG ATTAAGCCGA ACGTCTTAAC GGCCAACGAG CCATTTCCCG TCACCAGACA GCGATATGCA GACAAAGACG TCAGCGATGA AATGCTTTTT CGCAATCTTA ACGTGTCATT CCTTCGCGCA GAGAAAAAGG CAACCATGCA AGCTCCAGGC TCGGATATGG TTGCTTGCGA GCAGTGTGGA AAGATGGAGC ACAAAGCAAA ATGCTCGAGG CAGGCAAAGA ACGGCATCGG TGGAGTTGGA TCAGGAGAGA CGAACGGCCT AATGGCATTG GTGGACAGAC TGGATGAAGC CATGGCTGAG GAGAAGATGC AGACAGAATC TCAAGCGGCT ACGCCGGAGG TCCCACCGAT TTCGATGCCA GTGCTGGCGG CTATGTCGAC GCCCTTGCCA ATTGCAATAG CTCCCACTGT GTCACTGCCA GTGGTTTCAG CTGGAGTGGT TATAAATGGA TCCGATCGCC CTCCCATCAG CAGTTGGAGT GTGGAAGAAG TTAGCAATTT CGTGGACGAC TTTATACAGC AGGAGATCGA CGGCCAAGCG CTGCTGCTGC TCAAAGAAAA GGGTCCAGCT CTCAAAATTG TGGCCAAGGT GGAGTCCATT AAGGAGGTCC CGCCAGGCGA AAGGTTTCAA AAGACCGCTT TCTTTAGTTT CCCGCGTTTC ACCTAAATGT AACGACATTT AAAGTGAATC ACGTTCCGAC TCACCACTTC TCACACGACG TACACCCTAA TCATCAGCTA AGAGAGCAAC CGGTECCTGG AATCACTGAC TCTGTTGCGA GGCCCATCCC ATCCAGAATC GTTGTTTTTC CCGCACATGA CGAAAGCAAG GAATATGACC CTCCTTCGGC GCCGAAGCTG AAGATTGTGG CAATAGTAGA GTCCATGACT CTGTGCGACG AAAAGGACGG GGAGGTTATA AGTCTTCAGC AGTCTACCAG AGTCTGAGGA TAGGAGCGGG CAGTATCTGA GCTCTACTGC AGATTGTGGC AATAGTAGAG TCCATGACTC TGTGCGACGA AAAGGACGGG GAGGTTATAG AGTCTTCAGC AGTCTACCAG AGTCTGAGGA TAGGAGCGGG CAGTATCCTG AGCTTCTATT TGGACCTTTG CCTACCTGTG GACCGGGTCC AACCGGGTCT GCCACCCAAG CTGAATTTGA TACGTAATCT AAACATAGCA AGAGGGGTAA TATCGTAGCT CAAATGACAG GACGCCACAT CATCAATTCA TCACTTTGTA TAGAAGTTCA CAATTACTCA TAATCACTAA CGTATTTATT ATGCAAAGAA CATGAAAAAA TCATTTGAAC TAGAGGTAAT CTGGGATTAT ATTTACGTAG CCTAAGTGAA ACTGAATTGA ATCCAACTAA ACCTTTAATT TATTGATAAA CTTAGGTCGT TATTGCAAGG TTCCGAAAAC GCTGTCCAAC TTTTATAGAT CAAATCAGTG GTCGGAGTTT TTTCCTACTG CTGMGTM TAAAACTGTA AAGTGATTAA TGGAAGTGAA TATAATGTCG GTACCTTCTT GCCAAATATG TGGATATTGA AACAAATATG TGGGTAGTM GTAGTCACTC GGTTATTTGA AATATGGTTT ATTCAATTAT ATATATTTTT CTTTC~TA TGCAGACACA TATCGGTATA ATGTTTCGAA CAGCTATGCA TCTTAAGGTT TCAATTTAAT TTCATAATAA TGTTACGTTT AGTTAGCGCT TGCAGATTGA TTTACATATA GATTTGCTTC AACATTACAA TGTTGCCTTT TGTCTCGATT TCGCGATCGT TTTGTTGCGT TTCATTAATT TGTTTATCTT TGCGTCAATC GCGTTTCGTC GTGGGCTCGT TTACCATTTT GTTCTCGATT GTTATCTATA ATTTCTAATG TTTTTGGGTC GATTAATATT TTATATGGCA GGAGAACTAC CTCTCAGCTG CTCGCCGTCT CGTTTTGGGC TCAATGAAAA TGCTTCTGAG CTGAGTACTG AGCTGTATTT TATACTATAT GGTAGTTTAA CACTTGTATT TTGCAAATTG GCAACATTTG AACAAAGAAA AACATCGAAA CCGCACCGAG GTTTCACTAA AAATCGCACA ACTTCTCATC GGATTATGAA TTATAGACGT ACTGGAGGAG GGGACGTTGG GTTGGATTCG GGTCGAGGGC ACGAGGAAAT AGGTCCTACC CGAGAATATC GTTATCTAAA GTAAGTCACT CAAGCAATTT TTTCTTGGTG TGTATGGTTT ACATTTCGAT ATTGCCTAGA TACAATTTTA GAATTATTTA AGTTTAAAGC TTTTTTTTTG TTTTGGTGTT TGGGTCTCCA TCGATCCCTC CCTCAATATT CACGTCGTTA AACATCGTTG TTTGTGAGAT TTGTGGTCAA ATGACTGCGA TGAGGCCTAT TGGTGTTTTG TCTCTCCCTC CGCCGGCTGC TCCATTGGAC TGCGTTCTTC AATATCCATA TCCGATATCC ACAAGAAGCG CGCGACTTTG TTCGCCCCGC GCTCCAAGAT CAAAACATAT CGTATCCCCA TTTTGGGGGC CACTAGGTCG CCGCCCAGTC GAAGCCAGGC TTCGCAATAT TCCGTTTGAA ACTAGATCAC GCACTTACAC ACACGTTTGT CGTCCGATCG GGATCACCCG CTAGTGCTCA AATATTATTC TAGTGTCTTT GGTGGAAGGT CAGGGGTCAG TTTAGATTCG TGTGGGTGGT AACTGGGAGA GGGGAGATGT CTCCCTCTTC TCCGTGGTGT GATTCAACTT TCCGTCTTGG CCAAAGTGTT CGGGTTTCTT TCAGTTCGTT GGTGATGTGT GGTAACATAA CTGAAGTTGA ATGCAGGGAA AATAAAAATA ATGTGCTATC GTGTATCGCA TATAAATGTT TTATTTTTCT TTTAGTTTTG TATTCGTTTT TCAGTTTTGT GGTTTTTTCA TTTTTTTGCA AGATCAAGAT TAAAGTTAAA TAAGATAAAA TAAGGGTATG TTTTAAATAT ATTACTCAAA TTTGCACAAT ATTCCTTCCT GTATATATTT TGTATTTATG ATTTAGCAGT TTGTTTTTAT TTTAAAATTT TGCTTAAATA TTGCATATAT AGTTTGATTA CGCTTCAGCG TTGAAATMT AAACTTAGTT AGAGAGAGAG AGAGAGACAG AGATAGGGAG ACTGGGTAGA GAGACACAGA CACACAATTA AATACAATTC TTTCGAACAT TTGAAATGTA ACACAATGGA CAATCGCTAT AAATAATCTT ACTAAATTAA TGGGAAAACT AAGAGGGAAG GGG-TCAG AAATTGGCTG CTAGAGATAT CTGTAATATT AGTTTTCTTA TTCTAGGGAC GACACGCTTA AGCGCTTCTT AGcTTTTTCT TTCATTTCTG TTCTGTTCGG TTGGCGTCTC CGCTGTTCTG TTCTGTTCTG TACTGATCCG

used for sequencing

(Stratagene)

were derived from cosmid or phage clones encompassing

or into vectors M13mpl8

ofmp19.

These clones wer: sequenced

usingthe

the locus (Dura et al., M 13 universal

17-mer,

primers, or using primers synthezised from the genomic nt sequence. Nested deletions in Ml3 clones were obtained using the ‘Cyclone (International Biotechnologies Inc.) or the ‘Double-stranded DNA nested deletion kit’ (Pharmacia), Sequencing reactions were performed

using the dideoxy-sequencing method (Sanger et al., 1977) and the Sequenase kit (US Biochemical). The sequence has been deposited in the GenBank database under accession No. M64750.14.

The entire sequence

was determined

on both strands.

192

distal unit s

A

x5 -,B

s

I

I

20

25

B

*

d

10500

11000

11500

12000

12500

14000

I

I

I

I

I

I

I

20500

2l000

21500

22000

22500

23000

23500

boxes represent

to a family of embryonic and the position

transcripts

of potential

correspond regions.

perfect sequence homologous

coding

frames

similarity.

sequences

signify unique sequences.

shared

by these two molecular

homologous

Diagonal-striped

where this similarity

stretches

region by an asterisk. Group)

software

the sequences

the position

of differences.

mark the positions

between

fragments

units. The proximal the corresponding

molecular

of the genomic

unit is projected

regions in the two genomic

motifboxes

similarity

correspond

is greater

to in section b. The position

and nt sequence

analyses

lines represent

a putative zinc-finger protein (Reuter et al., 1990). Polycomb and Suvur (205) loci code a stretch of similar aa (Par0 and Hogness, 1991). As the molecular characterization of these interacting loci progresses, we can hope to understand the mechanisms by which these loci act together to mediate the correct expression of the genome. (d) Evolution of the polyhomeofic locus Given the extreme similarity of structure and sequence between the two molecular units, a duplication of genomic DNA may have been at the origin of the locus. Comparison

sequences

ofnt sequences

each of the two nt sequences.

arrow marks the position

of chi sites along the corresponding

comparison

were performed

of the 1.25-kb segment

referred

boxes

The white boxes of the zinc-linger

using the UWGCG

between the proximal

Centered

the boxes

than 90%. Stippled

to 78% nt similarity.

to ORFs referred

et al., 1984). (C)Graphic

separating

units. The blackened

Protein

structure

hybridize

nt sequences,

over the distal one, and

chosen within a 2.24-kb unique stretch

where the nt sequence

respectively,

from nt l-28626

DNA correspond

Horizontal

The double-headed

restriction

units of the locus defined by the comparison

sequences

(Devereux

and the distal unit, from nt 19943-23933.

indicate

in kb. Sites XhoI, Bg1II and Sal1 are designated,

in B. Genomic

the two units was arbitrarily

boxes represent

Open boxes above and below the genomic Genetic Computer

is detailed

is between 80 and 90%. Herringbone

from nt 10089-14030,

to in section d. Asterisks

locus, with coordinates

in A. The division between

above the corresponding

of Wisconsin

map oftheph

regions whose structure

(see section a). (B) The two molecular

Filled-in boxes represent

represent

motif is designated

duplicated

to those depicted

represent

*

*

of the ph locus. (A) A restriction

Fig. 2. Organization

(University

13500

I

X, B and S. Blackened

duplicated

13000

20000

*

coordinates

C

*

I

unit,

vertical bars between

with 100% nt similarity

referred

to in section d.

of the nt sequences also reveals that the similarity between the two units is not homogeneous along the entire length. Fig. 2C represents a partial alignment of nt sequences which reveals a bias in the distribution of sequence divergence. As illustrated in this figure, there is a 1.25-kb region of perfect nt conservation between the two units which is bordered by more divergent regions. Both of the identified introns lie within this perfectly conserved stretch (D. Pierre and H.W.B., unpublished). In the well studied mammalian globin gene families, it has been proposed that gene conversion is at the origin of stretches of variable nt similarity

193 ____________,__N_“_________________________________”

I

300

““QQ____________________________________~~~~~~~

UPLOlSPEQLQQFVRSNPVRlOUKQEFPTHllSGSGlELKHRlNl~EUOOQLQL~OOLSERNGGGRRSRGRGGRRSPRHSOOSOOOOHSTRlSlNSP~O

STOTQOUOQUOOOOQOTTQTTQOCUOUSTSTLPUGUGGQSUOTROLLNRGQRO@~OIPUFLONRRGLOPFGPWQIILRNQPDGTOGHFlOOOPflTOTLQT 111-111111

1100

GTFlTSCTSTTTTTTSSlSNGSKDLPKR~lKPNULTHUlDGFllQERNEPFPUTR@RVAOKOUSDEPPSEVKLLUP~LFRNLNUSFLRREKKRlROEOl -------I

1400

ORLLLLKEKHLUNRllGIlKLGPRLYlURKUESlKEUPPPGERK

________N__-_____________---_________GDUKD

Fig. 3. A best-fit alignment of aa sequences deduced from the genomic nt sequences. The position of splice sites in the proximal unit, as predicted from a partial cDNA clone referred to in section b, was used to predict the splice sites at identical sequences in the distal unit. The bottom line corresponds to the predicted aa sequence coded by the proximal unit, and the top line, that of the distal unit. Dashes in the top sequence represent positions where aa are identical in both sequences. Dots represent gaps introduced to optimize alignment. The protein domains deduced from this putative aa sequence are: a zinc-finger, heavily underlined; an a-helix domain, thinly underlined; a Ser + Thr-rich region, underlined with thin dashes; and poly-Gin stretches, underlined with heavy dashes.

194 along a duplicated gene (~icberhaber et al., 1981; Shen et al., 1981). Such a mechanism may explain how exonic

REFERENCES

and intronic regions are conserved to the same extent within the 1.25-kb region of perfect similarity. Studies of recombi-

Chou, P.Y. and Fasman, G.D.: Prediction of the secondary structure of proteins from their amino acid sequence. Adv. Enzymoi. 47 (1978) 45-14s. Courey, A.J. and Tjian, R.: Analysis of Spl in vivo reveals multiple transcriptional domains, including a novel glutamine-rich activation motif. Cell 55 (1988) 887-898. Devereux, J., Haeberli, P. and Smithies, 0.: A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. I2 (1984) 387-395. Duncan, I.M.: Po~y~omblike~a gene that appears to be required for the normal expression of the &thorax and ~~re~~u~e~~ugene complexes of Drosophila melanogaster. Genetics I02 (1982) 49-70. Duncan, I.M. and Lewis, E.B.: Genetic control of body-segment differentiation in Drosophila. In: Subtelny, S. and Green, P.B. (Eds.), Developmental Order: Its Origin and Regulation. Liss, New York, 1982, pp. 533-554. Dura, J.-M. and Ingham, P.: Tissue- and stage-specific control of homeotic and segmentation gene expression in Drosophila embryos by the polyhomeotic gene. Development 103 (1988) 733-741. Dura, J.-M., Brock, H.W. and Santamaria, P.: poiyhomeotic: a gene of Drosophila melunogaster required for correct expression of segmental identity. Mol. Gen. Genet. I98 (1985) 213-220. Dura, J.-M., Randsholt, N.B., Deatrick, J., Erk, I., Santamaria, P., Freeman, D., Freeman, S., Weddell, D. and Brock, H.W.: A complex genetic locus, polyh~meotic, is required for segmental specification and epidermal development in D. melanogaster. Cell 51 (1987) 829-839. Dura, J.-M., Deatrick, J., Randsholt, N.B., Brock, H.W. and Santamaria, P.: Maternal and zygotic requirement for the polyhomeotic complex genetic locus in Drosophila. Roux’s Arch. Develop. Biol. 197 (1988) 239-246. Evans, M.R. and Hollenberg, S.M.: Zinc fingers: gilt by association. Cell 52 (1988) 1-3. Fisher, J.A., Giniger, E., Maniatis, T. and Ptashne, M.: GAL4 activates transcription in Drosophila. Nature 332 (1988) 853-856. Hannah-Alava, A.: Developmental genetics of the posterior legs in Drosophila me~a~og~ster. Genetics 43 (1958) 878-905. Ingham, P.W.: A gene that regulates the bithorax complex differentially in larval and adult cells in Drosophila. Cell 37 (1984) 815-823. Jtirgens, G.: A group of genes controlling the spatial expression of the bithorax complex in Drosophila. Nature 316 (1985) 153-155. Lewis, E.B.: A gene complex controlling segmentation in Drosophila. Nature 276 (1978) 565-570. Lieberhaber, S.A., Goossens, M. and Kan, Y.W.: Homology and concerted evolution at the x 1 and a2 loci of human a-globin. Nature 290 (1981) 26-29. Locke, J., Kotarski, M. and Tartof, K.D.: Dosage dependent modifiers of position effect variegation in Dr~sophjla and a mass action model that explains their effect. Genetics 120 (1988) 181-198. McKeon, F.D., Krischner, M. and Caput, D.: Homologies in both primary and secondary structure between nuclear envelope and intermediate filament proteins. Nature 319 (1986) 463-468. Murre, C., Schonleber McCaw, P., Vaessin, H., Caudy, M., Jan, L.Y., Jan, Y.N., Cabrera,C.V., Buskin, J.N., Hauschka, S.D., Lassar, A.B., Weintraub, H. and Baltimore, D.: Interactions between heterologous helix-loop-helix proteins generate complexes that bind specifically to a common DNA sequence. Cell 58 (1989) 537-544. Nishizuka, Y.: The molecular heterogeneity of protein kinase C and its implications for cellular regulation. Nature 334 (1988) 661-665. Paro, R. and Hogness, D.S.: The Polycomb protein shares a homologous domain with a heterochromatin-associated protein of Drosaphjia. Proc. Natl. Acad. Sci. USA 88 (1991) 263-267.

nation in bacteria have shown that enzyme-promoted recombination, mediated by the RecBCD protein, requires chi sites (5’-GCTGGTGG). This has been reviewed by Smith (1988). Along the 28.6 kb of the ph locus, we have identified only six such sites. The positions of live of these sites are indicated by an asterisk in Fig. 2C. These sites are concentrated in or near the regions where we propose a conversion event. The significance of these sequences is not clear, but better

understanding

of conversion

events

eukaryotes may reveal a link. On the other hand, there exist other regions

in

higher

along the

locus which are extremely divergent. One such region, immediately 5’ of the long putative protein-coding region represented in Fig. 2, is 4.1 kb in length in the proximal unit, whereas the comparable region in the distal unit is 1.8 kb. Not only does the length of this region vary between the two molecular units we sequenced, but in the course of characterizing the genomic DNA in mutant and wt strains, we reported that some wt strains contain an additional 2 kb within this proximal region (Dura et al., 1987). This result suggests a molecular mechanism creating rapid divergence between the two molecular units involving insertion or deletion events. No mutant phenotype is associated with this additional 2 kb of DNA, and the transcription pattern is not different from that of strains without the insertion (N.B.R., unpublished). This insertion is thus localized in an intronic region. The observed rapid divergence in this region may reflect the fact that this region is partly nonfunctional. To verify these hypotheses concerning the origin and evolution of the p/z locus, it would be informative to study the structure of ph in other species. In this way, we could better evaluate the age of the duplication event and the subsequent modifications.

ACKNOWLEDGEMENTS

We thank D. Pierre for allowing us to quote unpublished results and M. Ashburner for critical reading of an early version of this manuscript. This work was supported by the Contract de Recherche Externe No. 891016 from the Institut National de la Sante et de la Recherche Medicale and by a National Sciences and Engineering Research Council grant to H.W.B., whereas J.D. was supported by a grant from the Association pour le Developpement de la Recherche sur le Cancer.

195 Pearson,

W.R. and Lipman,

comparison. Reuter,

D.J.: Improved

Proc. Natl. Acad.

G., Giarre,

Dependence

M., Farah,

J., Gausz,

of position-effect

gene encoding

an unusual

tools for biological

sequence

Sci. USA 85 (1988) 2444-2448. J., Spierer,

variegation zinc-finger

A. and Spierer, Nature

P.:

344 (1990)

219-223. Samson,

M.-L., Jackson-Grusby,

L. and Brent, R.: Gene activation

by Drosophila Ubx and abd-A proteins.

and

Cell 57 (1989)

for correct

F., Nicklen,

S. and Coulson,

inhibitors.

Proc.

R.A.: DNA sequencing Natl.

Acad.

Sci.

USA

with chain74 (1977)

5463-5467. Shen, S., Slightom,

K.: Zinc finger motifs and DNA binding.

Theill, L.E., Castrillo, Nature Wharton,

J.-L., Wu, D. and Karin,

of the pituitary-specific

globin gene duplication. Smith, G.R.: Homologous 52 (1988) l-28.

0.: A history

of the human

fetal

Cell 26 (1981) 191-203. recombination

in procaryotes.

Rev.

Trends

Biochem.

Sci.

M.: Dissection

transcription

factor

of funcGHF-1.

B., Finnerty,

V.G. and Artavanis-Tsakonas, repeats

regulated

shared

by the Notch

loci in D. melanogaster.

Cell 40 (1985) 55-62. B. and Paro,

homoeotic 468-471.

Microbial.

of segmental

342 (1989) 945-948. K.A., Yedvobnick,

locus and other developmentally Zink,

J.L. and Smithies,

initiation

293 (1981) 36-41.

14 (1989) 137-140.

S.: opa: a novel family of transcribed

1045-1052. terminating

Struhl,

required

in Drosophila. Nature

tional domains

DNA binding Sanger,

G.: A gene product

determination

in Drosophila on dose of a protein.

Struhl,

genes

R.: In vivo binding

pattern

of a rrans-regulator

in Drosophila melanogaster. Nature

of

337 (1989)