Vol. 173, No. 3, 1990
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS
Pages 1019-1029
December 31, 1990
Structure and Expression of the Human Polymorphic Epithelial Mucin Gene: an expressed VNTR unit* Carole A. Lancaster, Nigel Peat, Trevor DuNg, David Wilson, Joyce Taylor-Papadimitriou and Sandra J. Gendler 1 Imperial Cancer Research Fund, P.O. Box 123, Lincoln's Inn Fields, London WC2A 3PX, u.K. Received October 15, 1990
Summary: The human polymorphic epithelial mucin (PEM) is expressed apically by glandular epithelium and by the carcinomas that develop from these tissues. Previously isolated cDNA clones revealed that the core protein contained a large domain consisting of variable numbers of 60 bp tandem repeats (TR), making it an expressed minisatellite. We now report the full genomic sequence of the PEM gene, including 803 bp of 5' flanking sequence. The gene is composed of 7 exons and varies in size from =4 to =7 kb, depending on the number of tandem repeats in exon 2. Expression of PEM was obtained from a genomic clone in an Epstein-Barr virus based vector, after transfection into a human epithelial cell line, indicating the presence of effective regulatory sequences in this clone. ~ ~99oAo~de,~io~..... ~nc. PEM is an extensively glycosylated high molecular weight mucin glycoprotein (250 to > 500 kDa) expressed by normal and malignant human mammary epithelial cells. PEM contains at least 50% carbohydrate which is O-linked through N-acetylgalactosamine. This mucin expresses the epitopes for a large number of monoclonal antibodies, including HMGH-1 ,-2 and SM3, selected for reactivity with normal and/or malignant epithelial cells (1,2). PEM is thought to be expressed in an aberrantly glycosylated form by breast and other carcinomas, the carbohydrate side chains of the cancerassociated mucin being shorter than those found in normal cells (3,4); this apparently results in the unmasking of the recently mapped core protein epitope recognised by SM3 (5). The wide clinical use of these PEM-reactive antibodies in diagnostic and therapeutic studies (6,7,8) led us to determine the full-length cDNA sequence coding for the core protein of this molecule (9). An important feature of this protein is the presence of varying numbers of a conserved 20 amino acid TR which gives rise to an extensive polymorphism seen at DNA, RNA and protein levels (10,11,12). Variable number tandem repeat (VNTR) units may be a common feature of mucins, it having recently been shown that the core proteins of the porcine submaxillary mucin (13) and the human colonic mucins (14,15) contain repeat units of 81,23 and 17 amino acids, respectively. The expression of PEM is both tissue-specific and related to the stage of differentiation of the mammary epithelium (16). Thus, it was of interest to determine the full genomic sequence for PEM, including possible 5' promoter/enhancer sequences. A cosmid clone selected from a library made from normal human DNA was used to define the intron-exon structure of the gene by comparison with the *GenBank Accession Numbers are 54350 and 54351. ITo whom correspondence should be addressed. Abbreviations: PEM, polymorphic epithelial mucin; TR, tandem repeat(s); VNTR, variable number tandem repeat; EBV, Epstein-Barr virus; EBNA, EBV nuclear antigen.
1019
0006-291X/90 $1.50 Copyright © 1990 by Academic Press, Inc. All rights of reproduction in any form reserved.
Vol. 173, No. 3, 1990
BIOCHEMICALAND BIOPHYSICAL RESEARCH COMMUNICATIONS
cDNA sequence, and the coding sequence determined from the genomic clones was found to be identical to that of the cDNA. The complete sequence for the PEM gene has been defined (with the exception of =1 kb of sequence from intron Vl) together with an additional 803 bp of the 5' flanking sequence.
Materials and Methods
Isolation and analysis of genomic clones for PEM. Cosmid clones were isolated from the human genomic libraries pCOS2EMBL and HA203 made from peripheral blood lymphocytes and HA1.7, a nontransformed helper T cell clone, respectively (17). These libraries, which were kindly donated by Dr. A.M. Frischauf (pCOS2EMBL) and Dr. D. Kiousis (HA203), were screened by hybridisation with the pMUC10 probe (18) according to the method of Church and Gilbert (19). Purified clones (GPEM-1,-2 and -3) were restriction mapped using the X terminase method of Rackwitz et al. (20). The GPEM-2 and -3 cosmid vector carries the EBV origin of replication, the EBV nuclear antigen (EBNA-1) gene, the hygromycin phosphotransferase gene and pBR322 sequences. This cosmid can replicate autonomously in the nucleus of human tissue culture cells, even when it carries a 35 kb insert. A BamHI fragment (8.8 kb) and two EcoRI fragments (8.3 and 4.5 kb) were subcloned into pBluescript SK+ (Stratagene, La Jolla, Ca) for sequencing by the dideoxy chain termination method (21) using a combination of Bluescript primers and synthetic oligonucleotides to known cDNA and genomic sequences. All reported sequences were fully determined in both directions. Computer analysis of the sequence data was performed using Intelligenetics software on a VAX system.
Results
Isolation and characterisation of ¢osmid clones for PEM. The cosmid libraries pCOS2EMBL and HA203 were screened with a 1600 bp probe, pMUC10, containing part of the tandem repeat domain of PEM (18). One clone, designated GPEM-1, was isolated from 2 x 106 recombinants of pCOS2EMBL and two clones, GPEM-2 and -3, were isolated from 2.5 x 105 recombinants of HA203 (17). These 3 clones were shown to have different inserts on the basis of preliminary restriction digests with EcoRI/Hinfl; an alignment and partial restriction map of the 3 cosmids is shown in Figure 1. All 3 clones contained the entire coding sequence of the PEM gene within 35-37 kb inserts. The orientation of the clones was confirmed by total BamHI digestion of the cosmids followed by Southern analysis using the 32p-labelled BamHI bands to cross-hybridise with BamHI bands from each of the other two cosmid clones. Two bands from GPEM-1 (8.8 and 3 kb) hybridised to similar sized bands of GPEM-3, whilst only the 8.8 kb band hybridised with GPEM-2, confirming that the clones overlapped centrally (Fig. 1). As we have shown previously, the PEM gene is polymorphic, giving a restriction fragment length polymorphism with a variety of restriction enzymes (including EcoRI and BamHI) which cut outside the tandem repeat domain. In fact, the allele sizes (EcoRI and BamHI) in the two libraries used here appear to be similar. Southern blot analysis of the EcoRI digested DNA, used in the construction of the pCOS2EMBL library, gave the same EcoRI allele size as that observed in the GPEM-1 clone (8.3 kb) indicating that the genomic cellular DNA had not undergone gross rearrangement in the library construction. In addition, the restriction maps of the insert regions common to PGEM-1,-2 and -3 indicate that there is no detectable rearrangement between the two types of DNA used in the library construction or between the cellular DNA and cosmid clones. 1020
Vol. 173, No. 3, 1990
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS I GPEM-1
GP -3
B
B B
I
I
BE
,
==
I
I
I
B
I
B
I
I
GPEM-2
E B II
I
S
E B
EB
I I
I
S
I
II
S
...J
ss
L...........
r_......_.-...--¸ 803
I I
B
E B
EE
I
II I
I I
13,
S 62 Kb
-----.............i
E1
E2 502
/
E3 ~,
t
99
E4 149
E5 144
E6
E7
79
i
VNTR unil
Fioure 1. Restrictionmap of genomic DNA containingthe PEM gene. A restrictionmap is shown of 62 kb-of genomic DNA containingthe PEM gene, obtainedfrom analysisof 3 overlapingclones GPEM-1,-2 and -3 with varyingamounts of 5' and 3' sequence. (B, BamHI; E, Ecorl; S, Sacll) The PEM gene is located centrally, within2 EcoRIbands (bold line) of 8.3 and 5 kb, respectively. The regionsequenced is expandedto show the organizationof exonsand introns(sizesgiven in bp). Boxesindicateprotein coding regions,whilstlines indicateuntranslatedregions.
Hybridisation of the GPEM-2 BamHI bands to Northern blots of RNA from T47D and MCF-7 cells (data not shown) suggested that the whole of the coding sequence for PEM was contained within the 8.8 and 3 kb BamHI bands which cross-hybridised between the clones. Since the VNTR region, consisting of varying numbers of the 60 bp repeat, was located in the 8.8 kb BamHI fragment and in the 8.3 EcoRI fragment, these bands from the GPEM-1 cosmid were subcloned into pBluescript SK+ for sequencing (Fig. 2). Both of these bands contained exons 1-6 (Fig. 1). The adjacent EcoRI band (=4.5 kb) overlapping the 3' end of the 8.8 kb BamHI fragment was also subcloned into Bluescript and used to sequence exon 7. Ex0n/intr0n structure and alternate splicing of the PEM gene. The exon/intron boundaries were mapped precisely by comparison with the previously determined cDNA sequence (9) (Fig. 3A). The gene contains 7 exons spanning =4.2-7.0 kb, according to the size of the tandem repeat domain. The sequence shown in Fig. 2 contains 803 bp of sequence 5' to the transcriptional start site which was previously identified by primer extension (9); this site is followed by a 72 bp leader sequence prior to the translational start site (ATG) in exon 1 and an almost completely homologous Kozak consensus sequence, CCACCATGA (22). This exon has a further 55 bp of translated sequence to the boundary with intron I which splices through the penultimate amino acid of a potential signal sequence (23) (Fig. 2). An alternative splice site to that found by ourselves has been observed for the intron I/exon 2 boundary in the partial 5' sequence of the DF-3 and H23 antigens and the episialin molecule, all of which appear to be the same as the PEM gene (24,25,26); the use of this splice site results in exon 2 having an additional 27 base pairs which alters the codon where the sequence is inserted but does not affect 1021
Vol. 173, No. 3, 1990 -803 tactcctctc g~qgaggagct tggaaatttc actgtgggtt
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS
cgcccggtcc gagcggcccc -703 cctggccagt ggtggagagt ttcccccact -553 caggggagaa
cctccttggc cggggtgtgg
ggggagggag cccaaaacta gcacctagtc -403 cggggaqgqc gqggaagtgg agtgggagac -303 tttgttcccc atccccacgt tagttgttgc ccccctcccc
cggagccagg gagtggttgg -153 gattagagcc cttgtaccct acccaggaat tgtcacctgt cacctgctcg ctgtgcctas -i+i gccCGCTCCA CCTCTCAAGC AGCCAGCGCC 107 ACCGGGCACC CAGTCTCCTT TCTTCCTGCT tgccctgctt tgcagacaga
aggtggtctt cgtggtcttt +257 ggctgccctg tctgtgccag
gagtgggtac caggggcaag caaatgtcct +407 gaagaaatgc aggggccatg agccaaggcc +507 gggttactga ggctgcccac tccccagtcc tatttttcct
tcataaagac ccaaccctat +657 ACAGGTTCTG GTCATGCAAG CTCTACCCCA TACTGAGAAG AATGCTGTGA GTATGACCAG +807 GACAGGATGT CACTCTGGCC CCGGCCACGG +907 CCAGTCACCA GGCCAGCCCT GGGCTCCACC GGGCTCCACC GCCCCCCCAG CCCACGGTGT +1057 CCCATGGTGT CACCTCGGCC CCGGACAACA TCAGGCTCTG CATCAGGCTC AGCTTCTACT +1207 GAGCACTCCA TTCTCAATTC CCAGCCACCA +1307 GTAGCACTCA CCATAGCACG GTACCTCCTC TTCTTTTTCC TGTCTTTTCA CATTTCAAAC +1457 GCTGCAGAGA GACATTTCTG AAATGgtgag cacacccttt gcatcaagcc cgagtccttt +1607 CTCTCCAATA TTAAGTTCAG gtacagttct
tcagcttgcg cggcccagcc
-753 ccgcaaggct
cccggtgacc
actaga~gQc -653 ggcaaggaag gaccctaggg ttcatcggag cccaggttta ctcccttaag -603 tttctccaag gagggaaccc aggctgctgg aaagtccggc tggggggggg -503 aacgggacag ggagcggtta gaagggtggg gctattccgg gaagtggtgg -453 cactcattat ccagccctct tatttctcgg ccgctctgct tcagtggacc -353 ctaggggtgg gcttcccgac cttgctgtac aggacctcga cctagctggc -253 cctgaggcta aaactagagc ccaggggccc caagttccag actgcccctc -203 tgaaaggggg aggccagctg gagaacaaac gggtagtcag ggggttgagc -103 ggttggggag gaggaggaag aggtaggagg taggggagga ggcggggttt -53 ~Qc~gqcq~g cg~qgagtgg ggggaccggt ataaagcggt aggcgcctgt +57 TGCCTGAATC TGTTCTGCCC CCTCCCCACC CATTTCACCA CCACCATGAC 157 GCTGCTCCTC ACAGTGCTTA CAGgtgaggg gcacgaggtg gggagtgggc +207 ctgtgggttt tgctccctgg cagatggcac catgaagtta aggtaagaat +307 aaggagggag aggctaagga caggctgaga agagttgccc ccaaccctga +357 gtagagaagt ctagggggaa gagagtaggg agagggaagg cttaagaggg +457 tatgggcaga gagaaggagg Ctgctgcagg gaaggaggct tccaacccag +557 tcctggtatt atttctctgg tggccagagc ttatattttc ttcttgctct +607 gactttaact tcttacagct accacagccc ctaaacccgc aacagTTGTT +707 GGTGGAGAAA AGGAGACTTC GGCTACCCAG AGAAGTTCAG TGCCCAGCTC +757 CAGCGTACTC TCCAGCCACA GCCCCGGTTC AGGCTCCTCC ACCACTCAGG +857 AACCAGCTTC AGGTTCAGCT GCCACCTGGG GACAGGATGT CACCTCGGTC +957 ACCCCGCCAG CCCACGATGT CACCTCAGCC CCGGACAACA AGCCAGCCCC +1007 CACCTCGGCC CCGGACACCA GGCCGGCCCC GGGCTCCACC GCCCCCCCAG +1107 GGCCCGCCTT GGGCTCCACC GCCCCTCCAG TCCACAATGT CACCTCGGCC +1157 CTGGTGCACA ACGGCACCTC TGCCAGGGCT ACCACAACCC CAGCCAGCAA +1257 CTCTGATACT CCTACCACCC TTGCCAGCCA TAGCACCAAG ACTGATGCCA +1357 TCACCTCCTC CAATCACAGC ACTTCTCCCC AGTTGTCTAC TGGGGTCTCT +1407 CTCCAGTTTA ATTCCTCTCT GGAAGATCCC AGCACCGACT ACTACCAAGA +1507 tatcggcctt tccttcccca tgctcccctg aagcagccat cagaactgtc +1557 ccctctcacc ccagTTTTTG CAGATTTATA AACAAGGGGG TTTTCTGGGC +1657 gggtgtggac ccagtgtggt ggttggaggg ttgggtggtg gtcatgaccg
Fiaure 2. Nucleotidesequence of the PEM gene. The sequence shown extendsfor 4.3 kb and includes 8()3 kb of 5' flanking sequence, the complete coding sequence (upper case letters), introns I-V (lower case letters), part of intron VI and 131bp of 3' untranscribedsequence. Only one complete TR is shown (marked by brackets). The number of repeats (n)varies from 21 to 125 in the Northern European population (9). Numberingwas arbitrarilychosen to include only 1 TR and to exclude the =1.3 kb of undefined sequence in intron VI. Potential Spl bindingsites and the TATAA box are underlined.
1022
Vol. 173, No. 3, 1990
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS
+1707 taggagggac tggtgcactt aaggttgggg gaagagtgct gagccagagc +1807 tgtgaccagG CCAGGATCTG TGGTGGTACA ATTGACTCTG GCCTTCCGAG +1857 CACAGTTCAA TCAGTATAAA ACGGAAGCAG CCTCTCGATA TAACCTGACG +1957 ctggctgcag ccagcaccat gccggggccc ctctccttcc agtgtctggg +2007 gaggggcgcc tcctctggga gactgccctg accactgctt ttccttttag +2107 AGTCTGGGGC TGGGGTGCCA GGCTGGGGCA TCGCGCTGCT GGTGCTGGTC +2207 CTCATTGCCT TGgtgagtgc agtccctggc cctgatcaga gccccccggt +2257 ctatctcccc agGCTGTCTG TCAGTGCCGC CGAAAGAACT ACGGGCAGCT +2357 TCCTATGAGC GAGTACCCCA CCTACCACAC CCATGGGCGC TATGTGCCCC +2407 AGgtgagatt ggccccacag gccaggggaa gcagagggtt tggctgggca +2507 aaagagcttg gaagaggtga gaagtggcgt gaagtgagca ggggagggcc +2607 gttttggggg acaggcctgg gaggagacta tggaagaaag gggcctcaag +2657 aaagatcatt ggccgtccac attcatgctg gctggcgctg gctgaactgg +2757 ttttttgcac ccagaggcaa aatgggtgga gcactatgcc caggggagcc +2807 tgatccccta atcaatctcc taggaatgga gggtagaccg agaaaaggct +2907 agcaagaag//ggtaccttttgctcctcaccc tggatctctt ttccttccac +3007 CCTCTCTTAC ACAAACCCAG CAGTGGCAGC CACTTCTGCC AACTTGTAGG +3057 CCAGTGCCAT TCCACTCCAC TCAGGTTCTT CAGGGCCAGA GCCCCTGCAC +3157 GTGGGCTGCT CACACGTCCT TCAGAGGCCC CACCAATTTC TCGGACACTT +3207 GAGGCTCATG CCTGGGAAGT GTTGTGGTGG GGGCTCCCAG GAGGACTGGC +3307 AACTGGACTG AATAAAACGT GGTCTCCCAC TGgcgccaac ttctgatctt +3407 agaatgtgtg tgagggggct gggggaggag acagggaggc caggaggcag
+1757 tgggacccgt ggctgaagtg cccatttccc AAGGTACCAT CAATGTCCAC GACGTGGAGA +1907 ATCTCAGACG TCAGCGgtga ggctacttcc tccccgctct ttccttagtg ctggcagcgg +2057 TGAGTGATGT GCCATTTCCT TTCTCTGCCC +2157 TGTGTTCTGG TTGCGCTGGC CATTGTCTAT agaaggcact ccatggcctg ccataacctc +2307 GGACATCTTT CCAGCCCGGG ATACCTACCA CTAGCAGTAC CGATCGTAGC CCCTATGAGA +2457 aggattctga aqggggtact tggaaaaccc +2557 tggcaaggat gaggggcaga ggtcagagga agggagtggc cccactgcca gaattcctaa +2707 tgccaccgtg gcagttttgt tttgttttgc cttcccgagg agtccagggg tgagcctctg +2857 ggcatagggg gagtcagttt cccaggtaga +2957 ccagGTTTCT GCAGGTAATG GTGGCAGCAG GGCACGTCGC CCGCTGAGCT GAGTGGCCAG +3107 CCTGTTTGGG CTGGTGAGCT GGGAGTTCAG CTCAGTGTGT GGAAGCTCAT GTGGGCCCCT +3257 CCAGAGAGCC CTGAGATAGC GGGGATCCTG +3357 tcatctgtga cccgtgggca gcagggcgtc taaggagcga gtttgtttga gaagccaggg
aga +3440
Figure 2. - Continued.
the reading frame of the translated product; the putative signal sequence is however shortened by one amino acid (25). The sequence of this region and its translated product are shown in Fig. 3B. An antisense oligo consisting of these additional 27 bp hybridises to both forms of the mRNA in T47D and MCF-7 breast cancer cells, indicating that the alternate splice is present in both transcripts (data not shown). The 3' end of the signal sequence is therefore found at the 5' end of exon 2; this is the largest exon and since it codes for the TR, it varies in size according to the individual allele (9,10). In GPEM-1, the VNTR unit is estimated to account for approximately 2.3 kb of genomic DNA. Introns II to V are all small (<150 bp) and located between the coding exons in the 3' end of the gene. Intron V occurs within the transmembrane sequence, designated as such due to the high content of hydrophobic amino acids in the predicted product. The 3' terminus of the gene is contained within exon 7 (378 bp) which codes for the remainder of the translated product, a TAG stop codon at 1023
Vol. 173, No. 3, 1990
A.
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS EXON
I
INTRON 3' boundary [ EXON
5' boundary
1.
130 I 633 TTACAClgtgagg ................ cccgcaacaglTTGT
2.
1462 II 1562 GAAATClgtgagt ................ ctcaccccaglTTTT
3.
1617 III 1767 CTTCAGlgtacag ................ ctgtgaccaglGCCA
4.
1903 IV TCAGCG ]gtgagg ................
5.
2169 V 2250 GCCTTG lgtgagt ................ atctccccag ]GCTG
6.
2048 ttccttttaglTGAG
2309 VI 2932 GAGAAG Igtgaga ............ ---- ttccacccag IGTTT NNNNAG [gtaagt ................ yyyyyyncag INNNN
B.
Alternate splice junction
EXON 1 I
INTRON I
~XON 2
PEM, P a n c r e a t i c Nucin CTTACAG Igtgagg .... t taac t t ct tacagc taccacagcccctaaacccgcaaca~TGTTACAGGTTCT L T V , 2 7 bp [ V T G S
DF3, Episialin
. /
CTTACAG Igtgagg .... t taac t t c t t aca~CTACCACAGCCCCTAAACCCGCAACAGTTGTTACAGGTTCT L T A ! T T A P K P A T V V T G S
Figure & Splicejunctionconsensussequencesand alternatesplicing. A. the genomicsequence of PEM is shown at the boundaries of exons and introns. The centre numbersindicate the introns; the numbersabove each line refer to the genomicsequences in Fig. 2. The intron I/exert 2 border is given for the mRNA species obtained in BT20 cells. B. An alternate splicejunction at the 3' end of intron I was observed for the DF-3 (24) and H23 (25) and episialin(26) and occurs 27 nucleotidesupstream of the PEM splice site, resulting in an additional9 codons in the mRNA which are in frame with the translated PEM product. Any nucleotide, N; pyrimidine,Y. position +3006 and a 3' untranslated region of 303 bp, which includes a single AATAAA consensus signal for polyadenylation. Expression of the PEM gene. To ascertain if the regulatory sequences required for expression of PEM were present in any of the cosmids, GPEM-2 which contained the most 5' flanking sequence was transfected into a cell line expressing the EBNA gene to allow maintenance of the EBV-based HA203 cosmid as an episome (17). The cell line fR2, developed by SV40 immortalisation of milk epithelial cells (27), contains an allele of PEM which is not expressed at detectable levels and is sufficiently different in size from the GPEM-2 allele to be distinguished from it. Examination of protein expression by Western blotting (Fig. 4) showed that PEM was expressed from the HA203 cosmid GPEM-2 in fR2 EBNA cells. When cells were grown in the absence of hygromycin for several weeks, expression from the GPEM-2 PEM allele was not detected, suggesting that PEM expression was from the episomal EMB-based cosmid, which was lost during cultivation in the absence of hygromycin. 1024
Vol. 173, No. 3, 1990
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS
a
b
c
d
•
< 200
Figure 4. Expressionof PEM protein. Immunoblotanalysisof PEM proteinproduced in fR2 cells (lane e), in fR2 EBNA cells (lane d) and in 3 clones selected in hygromycin (250p.g/ml) after transfectionof fR2 EBNA cells with the GPEM-2cosmid (lanes a,b,c). Cell lysateswere fractionatecl by electrophoresisthrough a 5% SDS/polyacrylamidegel and transferred to nitrocellulose; Pem protein was detected by bindingof HMFG-2as previouslydescribed (18).
Discussion The full-length genomic sequence has been determined for the polymorphic epithelial mucin, PEM, together with an additional 803 bp of 5' flanking sequences. The coding sequence determined from analysis of the cosmid clones is in agreement with that previously determined for the cDNA. Since the DNA used for the construction of the cosmid library, pCOS2EMBL, was from peripheral blood lymphocytes of a normal individual whilst the cDNA libraries were based on mRNA from breast cancer cells, it is clear that the sequences coding for the core protein are identical in normal and malignant tissues. It seems likely therefore that the differences in the reaction of a panel of mucin antibodies with the normally produced and cancer-associated mucin are due to aberrant glycosylation. The human PEM gene spans approximately 4 to 7 kb of genomic DNA (the size being variable according to the size of the VNTR unit) and contains 7 exons. Previous studies of the cDNA for PEM (9) have shown it to be a typical integral membrane protein with an N-terminal region containing a hydrophobic signal sequence, a tandem repeat domain and a C-terminal region, consisting of a transmembrane sequence and a 69 amino acid cytoplasmic tail. Due to alternative splicing the signal sequence may include 13 or 12 amino acids. It is conceivable that differences in the signal sequence could affect the transport to the plasma membrane and/or secretion of the molecule. The recently cloned cDNA for the pancreatic apomucin appears to be identical to PEM (28) and codes for the same 13 amino acid signal sequence. Computer searches of the EMBL and NIH databases did not reveal any statistically significant similarities at the DNA level between PEM and other genes. The sequence is unusual both in its composition and structure. The sequence of the TR region is 82% G+C, in marked contrast to the 40% G+C composition usually observed in mammalian genomic DNA. A large domain of the PEM gene found in the second exon, consists of 60 bp repeats which are extremely homogeneous in sequence and code for a 20 amino acid repeat unit, making it an expressed VNTR locus. Tandem repeats appear to be a characteristic of mucins, as the other three mucins for which partial cDNAs have been reported, both consist of precise repeats which are unrelated to, and have different lengths from, the PEM repeats (13,14,15). The PEM TR consists of 25% prolines and 25% serine and threonine residues, giving it the potential to be a highly glycosylated, extended 1025
Vol. 173, No. 3, 1990
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS
molecule. The variability observed in the size of the PEM molecule suggests that the length is not crucial to mucin function, but rather that the core protein exists in an extended form as a scaffold for O-linked carbohydrate. The carbohydrate side chains may differ depending on the tissue studied or on the change to malignancy. The sequence homogeneity and the variability in repeat unit number (20 to >100 repeats (9)) may be explained by unequal crossing over (29,30,31). Sequence similarities between repeats may cause chance misalignments in pairing, and the resultant crossover results in duplication of a series of repeats in one homologue or sister chromatid and their deletion in another. Thus, point mutations that have occurred will either spread quickly through the array or be removed. The extent of length polymorphisms will depend on the rate of new allele production, with the most polymorphic VNTR loci (of which PEM is one) being the most homogeneous in sequence. Jeffreys and co-workers have postulated that the VNTR core may be an eukaryotic recombination signal (29). A key part of the hypothesis is that the Chi sequence (which is a signal for recombination in I and E. coil), or a slight variation of Chi, is present in many VNTR loci; likewise, Chi or a 7/8 match is present in the TRs of PEM, the intestinal mucin (14), the porcine submaxillary apomucin (13), as well as in the mouse mucin homologous to PEM (A. Spicer and S. Gendler, unpublished data). If Chi or some variation of Chi can serve as a signal for recombination, then a speculative model for the generation of TRs can be invoked. The DNA duplex near the Chi sequence is nicked, repair synthesis and ligation of the nicked strand results in duplication of the Chi-like core sequence, which subsequently promotes mispairing and unequal exchange, leading to amplification to form a TR (cf. Jarman and Wells (31) for a fuller explanation). While changes in the level of expression of PEM by cultured cells have been related to growth rate (32), to treatment with HulFNe~ (33) and to calcium levels (34), all of these studies have dealt with long-term effects which may not relate to direct effects on gene transcription. One reason for obtaining genomic sequence for PEM was to investigate the sequences involved in the control of expression of a gene which is selectively expressed in certain epithelial tissues, is developmentally regulated and apparently upregulated in many carcinomas. Since it is possible to obtain expression in mammalian cells of genes cloned as inserts in EBV based vectors, expression of PEM from GPEM-2 was tested in a human mammary epithelial cell line (fR2) previously transfected with the EBNA gene. Expression of the exogenous gene was distinguished on the basis of the size of the protein product and was dependent on the presence of hygromycin to maintain the GPEM-2 episome. Since GPEM-2 contained very little 3' flanking sequence, but 29 kb of 5' flanking sequence, it seems likely that crucial regulatory sequences are located in the 5' flanking region and possibly within the gene itself. An examination of the 5' flanking sequence defined here shows the presence of many motifs identical to or showing high homology with known response elements (Table 1). Primer extension and cDNA analysis have established the transcription start site for PEM (9). A TATAA box is located at -24 to -19, but no CAAT consensus sequence is present. This region also contains multiple sites for the cellular transcription factor Spl, which activates transcription by RNA polymerase II (35); one such site occurs as a GC box with dyad symmetry and the site at -97 matches perfectly the highaffinity decanucleotide Spl binding sequence GGGGCGGGGT. Sequences were also found with homology to other potential regulatory sequences (see Table 1), and work is in progress to assess their relevance to regulation of PEM expression. 1026
Vol. 173, No. 3, 1990
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS
Table 1 Potential reKulatory elements within the 5' flanking sequence
Regulatory element
Consensus Sequence
Sequence i n PEM
Location
Ref.
SPI
GGGCGG
GGGCGG GGGCGG GGGCGG GGGCGGGCGGGCGGG
-727 -397 -94 -54
35
SV40 enhancer element a b c
ATGTGTGT GCATGCAT GTGGATAG
CTGTGGGT GCCTGCCT GTGGAGAG
-562 +25 -702
36
AP-I
CTGACTCA A
GTGACCAC CTGCTTCA GTGCCTAG CTGCCTGA
-739 -418 -61 +27
37,38
AP-2
CCCCAGGC G G
ACCCAGGC CACCGGGC
-597 +77
38,40
NFI/CTF
TTGGCTNNNAGCCAA
TTGGCTTTC-TCCAA
-618
41,42
43,44
Glucocorticoid
regulatory element:
Core sequence
TGTTCT
TGTTCT TGTTCC
+38 -321
Consensus sequence
GGTACANNNTGTTCT
GCCTGAATCTGTTCT AGCTGGCTTTGTTCC
+29 -330
CACCC factor
CACCC
CACCC
+84/+54
45
Progesterone receptor consensus sequence
ATTCCTCTGT
ACTCCTCTCC ACTCCTCCTT ATTTCTCGGC
-800 -626 -432
46
GGTCANNNTGACC
GCTCCCGGTGACC
-746
47
RRYNNARYXGG AGTGGAGTGGG GWTCRANNC
GACCTAGCTGG
48
GTTCCAGAC
-335 -388 -260
Interferon-~ seq
GGAAATTCCTCTG
GGAAATTTCTTCC
-642
49
CMV enhancer
GGAAAGTCCCGTT
GGAAAGTCCGGCT
-585
50
Estrogen consensus sequence RNA Polymerase III Box A Box B Enhancer sequences:
Acknowledaments. We thank Dr. I. Goldsmith for providing otigonucleotides, Drs. M. Fried and C. Dickson for reading the manuscript, and Liz Eaton and Kim Richardson for excellent secretarial assistance
References
1. 2.
Burchell, J.M., Durbin, H. and Taylor-Papadimitriou, J. (1983)J. Immunol. 131,508-513. Burchell, J.M., Gendler, S., Taylor-Papadimitriou, J., Girling, A., Lewis, A., and Millis, R, (1987) Cancer Res. 47, 5476-5482. 1027
Vol. 173, No. 3, 1990
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS
3.
Hull, S.R., Bright, A., Carraway, K.L., Abe, M., Hayes, D.F. and Kufe, D.W. (1989) Cancer Commun. 1,261-267. 4. Hanish, F-G., Uhlenbruen, G., Peter-Katalinic, J., Egge, H., Dabrowski, J. and Dabrowski, U. (1989) J. Biol. Chem. 264, 872-883. 5. Burchell, J., Taylor-Papadimitriou, J., Boshell, M., Gendler, S. and Duhig, T. (1989) Int. J. Ca. 44, 691-696. 6. Hilkens, J., Kroezen, V., Bonfrer, J.M.G., De Jong-Bakker, M. and Bruning, P.F. (1986) Cancer Res. 46, 2582-2587. 7. Granowska, M., Shepherd, J., Britton, K.E., Ward, B., Mather, S., Taylor-Papadimitriou, J., Epenetos, A.A., Carroll, M.J., Nimmon, C.C., Hawkins, L.A., Slevin, M., Flatman, W., Home, T., Burchell, J., Durbin, H. and Bodmer, W. (1984) Nucl. Med. Commun., 5, 485-499. 8. Hammersmith Oncology Group and Imperial Cancer Research Fund (1984) Lancet ii, 1441-1443. 9. Gendler, S.J., Lancaster, C.A., Taylor-Papadimitriou, J., Duhig, T., Peat, N., Burchell, J., Pemberton, L., Lalani, E.-N. and Wilson, D. (1990) J. Biol. Chem. 265, 15286-15293. 10. Gendler, S., Taylor-Papadimitriou, J., Duhig, T., Rothbard, J. and Burchell, J. (1988) J. Biol. Chem. 263, 12820-12823. 11. Gendler, S., Burchell, J., Girling, A., Millis, R., Duhig, T. and Taylor-Papadimitriou, J. (1988) In Breast Cancer: Scientific and Clinical Progress (M.A. Rich, J.C. Hager and D.M. Lopez, eds.). pp112-126. Kluwer Academic Publishers, Boston, Ma. 12. Swallow, D.H., Gendler, S., Griffiths, B., Corney, G., Taylor-Papadimitriou, J. and Bramwell, H. (1987) Nature 328, 82-64. 13. Timpte, C.S., Eckhardt, A.E., Abernethy, J.L. and Hill, R.L. (1988) J. Biol. Chem. 263, 1081-1088. 14. Gum, J.R., Byrd, J.L., Hicks, J.W., Toribana, N.W., Lamport, D.T.A. and Kim, Y.S. (1989) J. Biol. Chem. 264, 6480-6487. 15. Gum, J.R., Hicks, J.W., Swallow, D.M., Lagace, R.L., Byrd, J.C., Lamport, D.T.A., Siddiki, B. and Kim, Y.S. (1990) Biochem. Biophys. Res. Commun. 171,407-415. 16. Zotter, S., Hageman, P.C., Lossnitzer, A., Mooi, W.J. and Hilgers, J. (1988) Cancer Res. 11.12, 55101. 17. Kioussis, D., Wilson, F., Daniels, C., Leveton, C., Taverne, J. and Playfair, J.H.L. (1987) EMBO J. 6, 355-361. 18. Gendler, S.J., Burchell, J.M., Duhig, T., Lamport, D., White, R., Parker, M. and TaylorPapadimitriou, J. (1987) Proc. Natl. Acad. Sci. USA 84, 6060-6064. 19. Church, G.M. and Gilbert, W. (1984) Proc. Natl. Acad. Sci. USA. 81, 1991-1995. 20. Rackwitz, H.R., Zehetner, G., Muriald, H., Delius, H., Chai, J.H., Poustka, A., Frischauf, A. and Lehrach, H. (1985) Gene 40, 259-266. 21. Sanger, F., Mickless, S. and Carlson, A.R. (1977) Proc. Natl. Acad. Sci. USA 74, 5463-5467. 22. Kozak, M. (1989) Nucl. Acids Res. 14, 4683-4690. 23. Von Heinje, G. (1986) Nucl. Acids Res. 14, 4683-4690. 24. Abe, M., Siddiqui, J. and Kufe, D. (1989) Biochem. Biophys. Res. Commun. 165,644-649. 25. Wreschner, D.H., Hareuveni, M., Tsarfaty, I., Smorodinsky, N., Horev, J., Zaretsky, J., Kotkes, P., Weiss, M., Lathe, R., Dion, A and Keydar, I. (1990) Eur. J. Biochem. 189, 463-473. 26. Ligtenberg, M.J.L., Vos, H.L., Gennissen, AM.C. and Hilkens, J. (1990) J. Biol. Chem. 265, 5573-5578. 27. Chang, S.E., Keen, J., Lane, E.B. and Taylor-Papadimitriou, J. (1982) Cancer Res. 42, 2040-2053. 28. Lain, M.S., Barra, S.K., Qi, W.-W., Metzgar, R.S. and Hollingsworth, M.A. (1990) J. Biol. Chem. 265, 15824-158. 29. Jeffreys, A.J., Wilson, V. and Thein, S.L. (1985) Nature 314, 67-73. 30. Jeffreys, A.J., Royle, N.J., Wilson, V. and Wang, Z. (1988) Nature 332, 278-281. 31. Jarman, A.P. and Wells, R.A. (1989) Trends in Genet. 5, 367-371. 32. Chang, S.E. and Taylor-Papadimitriou, J. (1983) Cell Differ. 12, 143-154. 33. Tran, R., Horan Hand, P., Greiner, J.W., Pestka, S. and Schlom, J. (1988) J. Interfer. Res. 8, 75-88. 34. Bader, S.A. and Harris, H. (1987) J. Cell Sci. 87, 375-381. 35. Kadonga, J.T., Jones, K.A. and Tjian, R. (1986) TIBS 11,20-23. 36. Lee, W., Mitchell, P. and Tjian, R. (1987) Cell 49, 741-752. 37. Curran, T. and Franza, B.B. Jr. (1988) Cell 55, 395-397. 38. Franza, B.R. Jr., Rauscher, F.J., Josephs, S.F. and Curran, T. (1988) Science 239, 1150-1153. 39. Mitchell, P.J., Wang, C. and r]ian, R. (1987) Cell 50, 847-861. 40. Imagawa, M., Chiu, R. and Karin, M. (1987) Cell 51,251-260. 41. Jones, K.A., Kadonga, J.T., Rosenfeld, P.J., Kelly, T.J. and Tjian, R. (1987) Cell 48, 79-89. 1028
Vol. 173, No. 3, 1990
BIOCHEMICAL AND BIOPHYSICAL RESEARCH COMMUNICATIONS
42. Chodosh, L.A., Baldwin, A.S., Carthew, R.W. and Sharp, P.A. (1988) Cell 53, 11-24. 43. Karin, M., Haslinger, A., Holtgreve, H., Richards, R.I., Krauter, P., Westphal, H.M. and Beato, M. (1984) Nature 308, 513-519. 44. Jantzen, H.-M., Strahle, U., Gloss, B., Stewart, F., Schmid, W., Boshart, M., Miksicek, R. and Schutz, G. (1987) Cell 49, 29-38. 45. Schule, R., Muller, M., Otsuka-Murakami, H. and Renkawitz, R. (1988) Nature 332, 87-90. 46. Von der Abe, D., Janich, S., Scheidereit, C., Renkawitz, R., Schutz, G. and Beato, M. (1985) 313, 706-709. 47. Peale, F.V., Ludwig, L.B., Zain, S., Hilf, R. and Bambarra, R.A. (1988) Proc. Natl. Acad. Sci. USA 85, 1038-1042. 48. Ciliberto, G., Raugei, G., Costanzo, F., Dente, L. and Cortese, R. (1983) Cell 32, 725-733. 49. Goodbourn, S., Burnstern, H. and Maniatis, T. (1986) Cell 45, 601-610. 50. Boshart, M., Weber, F., Jahn, G., Dorsch-Hasler, K., Fleckenstein, B.G.B. and Schaffner, W. (1985) Cell 41,521-530.
1029