Sequence divergence among members of a trypanosome variant surface glycoprotein gene family

Sequence divergence among members of a trypanosome variant surface glycoprotein gene family

J. Mol. Biol. (1992) 225, 973-983 Sequence Divergence among Members of a Trypanosome Variant Surface Glycoprotein Gene Family Thomas P. Beals”f and ...

2MB Sizes 1 Downloads 10 Views

J. Mol. Biol. (1992) 225, 973-983

Sequence Divergence among Members of a Trypanosome Variant Surface Glycoprotein Gene Family Thomas

P. Beals”f and John C. Boothroydj:

Department

of Microbiology and Immunology D-305 Fairchild Building Stanford University School of Medicine Stanford, CA 94305, U.S.A.

(Received 6 August

1991; accepted 24 February

1992)

We have used analysis of DNA sequence data from four members of a Trypanosoma brucei variant surface glycoprotein gene family to investigate the molecular basis of the generation of antigenic diversity in African trypanosomes. Among these four sequences we find the greatest similarity in the untranslated sequences immediately upstream from the coding region. A complex pattern of nucleic acid and predicted amino acid sequence divergence appears starting at the coding sequence. Two related but highly divergent hydrophobic leaders are associated with different members of this gene family; both forms of these hydrophobic leaders appear to exist in other isolates of T. b. brucei. We find conservative replacements in the first 120 predicted amino acid residues of the mature protein; the following 80 predicted residues show less conservative replacements, and we suggest that this region may be hypervariable and exposed to the aqueous environment. Keywords:

Trypanosoma

brucei; VSG; evolution;

1. Introduction

protozoan

parasite

and is probably mediated by, two loosely conserved sequence elements flanking the duplicated sequence. The upstream repeated sequences, averaging 76 bp, have been found upstream from most basic copy VSG genes characterized to date, and the 5’ limit to VSG gene duplication is often within the URS (Liu et al., 1986; Campbell et al., 1984). However, when an expression site is occupied by one member of a gene family, another member of that gene family may replace it by gene conversion, not within the URS, but closer to or even within the gene itself (Pays et al., 1983). Similar processes may lead to VSG genes containing ari in-frame stop codon to contribute genetic information to a composite VSG gene in an expression site (Thon et al., 19190 and references therein). In the accompanying paper (Beals & Boothroyd, 1992), we describe the genomic organization of the VSG 117 gene family and propose that, at a gross level, the rapid rate of change in the genomio organization of VSG genes, relative to housekeeping and recombination genes, is due to segregation between chromosomes bearing VSG genes; and that these processes could, in addition, generate dispersed families of VSG genes. However, we wished to determine the mechanisms by which VSG gene families are created, and the patt’erns of sequence divergence among VSG gene fam:ilies. To that end, we have determined DNA sequences from

Multigene families and supergene families are comprised of sequences that have been duplicated and have subsequently diverged to serve related functions. The VSG§ genes of African trypanosomes; which enable these protozoan parasites to evade the immune responses of their mammalian host, are an informational gene family (Hood, 1975) in the sense that, like the immunoglobulin genes, the VSG genes are a reservoir of diversified genetic information. The VSGs are the major antigen of the intact parasite; by successively expressing antigenically distinct VSGs, the trypanosome population survives the immune response to any particular VSG. Unlike however, the VSGs are the imunoglobulins, generally encoded intact in the genome; the repertoire of antigenic diversity in trypanosomes is generated on evolutionary time-scales. Expression of most VSG genes requires the duplication of a basic copy gene encoding the VSG into a telomeric expression site to create an expression linked copy. This duplication is often bounded by, t Current address: UCLA Department of Biology, Los Angeles, CA 90024, U.S.A. $ To whom all correspondence should be addressed. $ Abbreviations used: VSG, variant surface glycoprotein; URS, upstream repeat sequences; bp, base-pairs; kb, lo3 base-pairs.

973 Oo22%2836/92/120973-11

$03.00/O

0

1992 Academic

Press Limited

T. P. Reals and J. 6’. Boothroyd

974

three members of t’he VSG 117 gene family, and compared these sequences to the homologous sequence previously determined from the VSG 117 basic copy. We show here that members of this gene family have undergone a complex pattern of sequence divergence, with sequence conservation outside the coding region but a high degree of variation within it, displaying the sequence diversity expected of an informational gene family.

2. Materials

and Methods

The origin of the sequenced DNA is described in detail in the accompanying paper. Each fragment sequenced was cloned in both orientations into either M13mp19 (Norrander et al., 1983) or into Bluescript KS (Stratagene, San Diego, CA). A deletion series of each clone was prepared as described by Henikoff (1984). Single-stranded templates for DNA sequencing were generated from Ml3 clones as described by Amersham (Ml3 Sequencing manual) and from Bluescript clones as described by Stratagene Cloning Systems. DNA sequence was obtained chain termination; [35S]dATP-labeled by dideoxy products were resolved on buffer gradient gels as described by Biggin et aZ. (1983). Extension reactions were performed with the Klenow fragment of DNA polymerase 1; or with the Sequenase enzyme supplied by United States Biochemical Corp. DKA sequence was obtained either from both &rands, or by multiple sequencing of 1 strand. Final assembly of the sequences was done with the fragment assembly package of the University of Wisconsin Computer Genetics Group (Devereux et al., 1984). Filter hybridizations were done as described in the accompanying paper.

3. Results and Discussion (a) Origin

of the sequenced DNA

The isolation of members of the VSG 117 gene family as cosmid clones, and the construction of subclones containing those gene family members, is described in the accompanying paper. Figure 1 shows schematic maps of the VSG 117 basic copy and of three subclones containing members of the MITat 1.4 gene family; D1\‘A sequences from the indicated regions of these subclones is reported here. We refer t)o these subclones by the length of the Hind111 fragment on which t,hey are found; e.g. the 8.5 gene family member is contained on an 8.5 kb Hind111 fragment. The DYA sequence of the only member of the VSG 117 gene family known to be expressed, the VSG 117 basic copy; was determined previously (Boothroyd & Cross, 1982; Boothroyd et al., 1982). The basic copy sequence in Figure 2 was compiled from basic copy genomic sequence and cDNA sequence; no sequence difference between the genomic clone and the cDiYA clones in the region shown is known, except for a transpliced mini-exon found on cDNAs that is derived from elsewhere in the genome (Boothroyd & Cross, 1982). To determine the pattern of divergence among members of the VSG 117 gene family, we determined the DSA sequence from three gene family members from a

point slightly upstream from the sequence used as the defining probe for this gene fa,mily, to an apparently common SalI site about halfway through the coding region (Fig. 1). More sequence was obtained from the 8.5 kb gene family member, extending somewhat further upstream, and including a complete potential coding region. The DNA sequences of these three gene family members are shown in Figure 2; the deduced amino acid sequences are shown in Figure 3. Comparison of the sequences obtained from this region show a complex pattern of sequence divergence with internal boundaries that roughly coincide with functional regions determined for the basic copy. Therefore, we divide this sequence, for the purpose of comparison of divergence, into the four regions shown schematically in Figure 1. (b) Divergence

in sequences not represented mature VSC: mRSA

in the

Region I (bases 1 to 1165, Fig. 2) is comprised of sequence upstream from the splice acceptor site, to which the trypanosome mini-exon is t’rans-spliced (Boothroyd & Cross, 1982). This non-coding region is the most conserved among the four regions compared here; if insertions introduced t,o maintain optimal alignment are counted as mismatches, the basic copy, the 8.5 family member, and the 6.0 family member differ pairwise by about 7 %; the 7.0 family member differs from each of the other t,hree by 11 to 12%. We find no notable pattern to the sequence changes, with the exception of bases 420 to 429 where, relative to the ba,sic copy, the 8.5 family member has an apparent insertion of nine base-pairs and the 7.0 family member has an appa,rent insertion of three base-pairs; eech insertion is a repetition of a three base motif, either AAA or GAA. We discuss further below the relatively greater divergence of the 7.0 family member. (c) Divergence in sequences encoding the 5 untranslated region Region 11 (bases 1166 to 1242) consists of sequence between the splice acceptor site and the start of the VSG coding region. Here the three nonbasic copy gene family members are more similar to each other (14 to 22% divergence) than any is to the basic copy (26 to 40% divergence). However. the sequence that in the basic copy encodes the splice junction (base 1166 in Fig. 2; Boothroyd & Cross, 1982) is conserved in the three non-ba,sic copy gene family members; the junction is centered in a 27 bp region of perfectly conserved sequence (Fig. 2, bases 1155 to 1181). jd)

Vayiabion in sequences encoding a hydrophobic leader

Trtegion III consists of 99 bp of sequence encoding, in the basic copy, a long hydrophobic leader (Boothroyd et al.; 1981). The corresponding region

T. brucei

pGB

VXG Gene Family Xequence Divergence

117.1 RI

Hd

975

';kb

Figure 1. Origin of the sequenced DNA. The restriction maps shown are of subclones from cosmids containing members of the VSG 117 gene family (see the accompanying paper). pGBll7.1 contains the VSG 117 basic copy (Bernards et al.; 1981). pSubUA, 7.0,A and 6.OA were derived from cosmids containing the VSG 117 family member genes present on Hind111 fragments of 8.5, 7.0 and 60 kb, respectively (see the accompanying paper). The duplicated block is the region involved in activation of the basic copy by duplication into a telomeric expression site. Regions I, II: III and IV are coding region segments described in the text. Abbreviations: URS, the upstream repeat sequences averaging 76 bp (Liu et al., 1983; Campbell et al., 1984). DP, the defining probe for the gene family, a 720 bp HinfI fragment. HPL, the hydrophobic leader sequence encoding the VSG 117 signal peptide; MC, the mature VSG 117 coding region; HPT, the hydrophobic tail sequence encoding the glycolipid anchoring signal; RI, EcoRI; Sa; SalI; Hd, HindID; Xh; XhoI.

in the non-basic copy gene family members appears also to encode a hydrophobic leader, but these sequences differ from the basic copy by greater than 50% in nucleotide sequence and by greater than 65% in predicted amino acid sequence. Among these non-basic copy gene family members, sequence differences in this region are minimal (2 to 4 y. at’ the nucleotide level; or not greater than 3 out of 33 amino acid residues). Accordingly, we use the 8.5 gene family member hydrophobic leader sequence to represent the non-basic copy hydrophobic leader motif in the discussion to follow. Despite the differences between the basic copy copy hydrophobic leader and the non-basic sequences, their hydrophobicity profiles (Fig. 4) are remarkably similar, suggesting that these sequences have diverged under selective pressure for functionality but not necessarily for sequence conservation. Comparison of the sequence at which the leader peptide is processed away from the mature VSG also suggests functional conservation. The amino acid sequences at the end of eukaryotic proeessed leaders tend to have small, neutral amino acid residues at -1 (the last amino acid residue of the positions

leader) and at - 3 (3 amino acid residues from the end of the leader; Von Heijne, 1983, 1986). These two positions are conserved in the protein sequence predicted for the gene family members and the basic copy hydrophobic leader; positions + 1, - 2 and - 4 fail to be conserved. This pattern of divergence is consistent with the notion that these sequences have diverged under selective pressure to retain attributes that preserve the function of the gene. The degree of divergence of the non-basic copy family member hydrophobic leader sequences from that of the basic copy, and the striking similarity of the non-basic copy hydrophobic leader sequences, suggested the possibility that sequences encoding either the non-basic copy family member hydrophobic leader or the basic copy hydrophobic leader might have originated elsewhere in the genome. To investigate this possibility, the hydrophobic leaders were used as radiolabeled probes against genomic DNA digested with HindIII. Figure 5(a) shows that the basic copy hydrophobic leader hybridizes only to the basic copy gene family member and to the 117 expression linked copy. The 8.5 family member hydrophobic leader hybridizes to, and only to,

T. P. Beak

976

and J. 6. Boothroyd

10 20 31.; L&J 50 60 70 80 w 100 8C AATGGTGCAC ACTGACGCAG GATTTGTGGA GCACla,%IAG ?UGCiACACA ATAAAAGCAT GAATCCATGA CTGTAGCTGC AGAACGAGAGACGAATTATC 85 ----A----__________ e--------m ---T--w-... _-_._----------A---C--A----,C-C--G---____-____ __________ 110 120 130 140 150 l&I 170 180 190 200 BC CAGACMCM AGAAGAATM CTCTACTAAA TGCTCAMGT CATGCGTACA AGCGAACMG CAGCGTTATG TCGTCATCTC ATAAACATCC ACGAAACAGA 85

TC-----T--

-----m---G

A----,G---

_-"-__-_-_

-----s---G

-------___

---------A

_-___-____

-_________

__________

210 220 230 240 250 260 270 280 290 300 9C AGTTATAGTG TAGTAGAAGC MTGAAAGAT TCTAAATGAC AGATGTGATA TATACAGCM ACTGGCTTGGMTC 11 GTAATTTTCA TAGATAAACT 85 __--____-_ _-______----------c -----_------___--__ -I____-___ __________ ---__----_ -____-____ -G-------m

310 320 330 340 350 364 370 380 390 400 BC GAGTGCTTTT TACMTATCA CMGGCTACA GGAGTCTCCTACMCAGCTA ACAACTCATA ACGTTCMGG TCGACTATCC TGACGACTGT ATGTCTAAAT 85 __--___-_------m-T---C-I I--C ------G--m -T---m----A-------_____._-__ __________ -A-------_________- ---------_ --G-w--..---A------70 _ --________ __-_---_-_ -------T-60 420 410 BC GCAGAATTTA GAATMGAA 1 &j __________ ---G-----A I 7o ------------G-----I 60 _______-__ -.-G&s--,

430 ;I / I;;~/GG MGMGM-I III 'GM..11 11 1 __ ,,,,,ll;

440 TTGTTAACM _---__--__ __--___-__ _ _________

450 ATMAMGAA _-____-___ ------c--______ C--m

460 AGCMTMGT -----,$-----AG-------AC ______

470 480 TACAGGACGCMGTTTTTCT _--___--__ -C---.---m ------e-C----G----C ________ C- ____ G____ C

520 TCTGATTAAA __________ __________ __--___---

530 CMGGTTTTG __________ __________ ____--____

540 CATTATMGA __________ -----,-mm ______.___

550 TACGACATTT ---s--T--------G--m _--____--_

560 CMTGTCMA t------s-__________ _--___---_

570 CTGCACTTM --____-_-__________ ---s_-_"__

510 BC CAGTATTGAT 85 __________ 70 __________ 60 __-____-__

490 500 CGAAACTTGC GCTTTTAGAA -A----------m--.,--G -A--.,---------m---G -A ________ _________ G

380 590 AGGAGGCTTA GCACATMAT --_-__--__ ___---_--_ __________ _____-____ _-________ _____I____

600 CAAATGCAGA ---t--------T---G----,------

85 70 M)

610 AGCGTACTTT __________ _;_______I __________

630 620 TGACATCCAG TTATTTCCAG C--------A _-________ C-----A--A ---C--A,-C--------A -.-----T-m

640 AAAACAGTM _-_____-__ ---,'--.--m _--____-__

650 MCGTTTTTT _--____-__ _--___-___ _--____-__

660 68Q 670 GGTGTGTCGCGACAGCTACA AGAAAMCTT _---__-_-_ ------A--_______--_ ------m-m, --A--;;-A----TA-__-___-_-_ ------c--.._____-__

690 TACTAGGMG _____-____ -----e-s-,, -_________

EC 85 70 &,

710 AATCTCTTAT ---T-A--T--------Cm ------m-C-

730 720 CGATACATCA MCGACTTCT -m-----G-__----_--__________ -----m-G------m-G-__--___---

740 TTCGATTAAA -------cm-e-----C------s-C--

750 MTGAGAAAA ______-___ __________ _---__--_-

760 MITACCTCA ____-____-CA------_---__-___

790 800 CACTAAACAG CGGAGCAGCA _________- _-_-_--__I ---C-----A --T---s------C----e ---___--_-

6c 85 70 &,

810 AAAtMTTM __--____-__-_____-_ __-_______

820 MCGTTCTTC ----e-T----,C--T-----..--,e--

840 850 CAGCAGTACGG~MGACGAT -C-,-,GG-__-____--_ _, - , __ CG-- __-_______ AC-T--GG-__--___--_

BC 85 70 60

910 920 930 940 GCAGACTAAA ACCGTCAGCA GACGAGTTTT CTTMTCCGA --G------w __---__---------G--.-----G-s --G------s __________ -------G-m ----T----v ---A-----__________ -------G-_---_-_---

BC

830 TGATCACCM m,--------,----m--o -T--------

950 CCAGAAMCG A.-------__________ _----__---

770 780 AAGCAGGCAT TTTCCTTGAC __________ --..----T-___-____-_ ------G-e_---__---_ --A-----G-

860 870 CGAAGCGGCTTGGTATTTTA _-____-_-_ C-m..------A-------_-_____-__ _--___.--_ __________

CATTTTTTT; _--__----_ ---s-----T ________"_

960 970 980 AACCTGGCGAAGCWAGGCA TAAATTGTGC _-________ -C-------m ---C----CI----__---,-G------s ---C--m--------me,GC----C-i---C--e--I

700 CAGCATGTTA ----G.---m ----,--m-m A---G-----

890 900 ~GCAGACAACAGACTGTATG .-----A----G------e ,---w--G---G----G-__________ .---"----T 990 TCMTACCTC ___t______ C-s I A-G--___ j ______

1000 CTCTGGGAAT ----__---.AG,-M-GC--____ cm--

1010 1020 1030 1040 1050 1060 1070 1080 105% 1100 BC GTAACCMTC CGMCMGTA CMCCAGGTC MTAAACTCT ACCTTTCGTA AAAGAGGCCA CCGCGCCAGAAGCTMGCAG CA/ ;ACCTCC GCAMCAAAG 85 __________ ___-______ _______--_ __-____-__ --A-------_--_-_--_---___--_---__---_ G-&--m---_____-____ 70 --.-G----m __________ ,,..-----.-s -e----e--__----_--_---___--_ C----T-..-________-_ __________ _.________ &,

-_-----_-_

___----_-_

-m-----e--

-_-_-__---

G-A-------

_-_--___-_

-----,----

______--__

-____-____

.__-_-___-

1110 1120 1130 1140 1150 1160 1170 1180 1190 1200 BC CCGGATGACGCMGGCACGG ACACCGTCCT TMCATTCAC CCCGAACATT TACCTCCMC ATMCCAGCA MAGACTAGA AGCMGCAGC G$ATATAGC

85

__________

__________

---;--A,-

-G------w-

---------C

C.-m------

__-_“__--_

____s__--_

---m-C--,,-

A-------C-

70 &,

-------s-C -------m-C

-------A--------A--

;------A-----;-A,-

CG--------G----e.--

__-_______ --;--w--C

C---e--.a-m C--l------

_----__--__-_______

_---___-__ _____I____

-M--A--M -A---C--M

A-T-..-m-cm-----M

BC 85 70 M)

1210 1220 GCAAACAATC GGGGTTTCM -me---e--M-G-G--AM-G-AC-A,-C-C-----A-C-AC; ;I ,-C-‘---e--

1230 CAMAACGGG ------m-M --G-----M --G-----M

1240 AGCGACTCAC ;A ,A--'--+ G,A-- -s-G- Gf TA--,---GG-

Figure 2. DKA sequence of the VSG 117 basic copy, and the corresponding sequence from t,hree VSG 1 I7 gene family members. Dashes indicate identity to the basic copy sequence. Vertical bar s indicate spaces inserted to maintain homologous alignment. BC; basic copy; 85, pSUB8.5A; 70, pSUB7.0rA; 60, pSUB56.OA. Numbering starts at the first base of the pSUB8.5 sequence. (a) Sequences upstream from the basic copy coding region and the corresponding sequences from the 3 gene family members. (b) The DNA sequence encoding the VSG 117 basic copy and the corresponding sequence from 3 VSG 117 gene family members. In the pSUB7.0,A sequence. at positions 1614 and 1615. lower-case letters indicate uncertainty in a G + C-rich region. We are confident of the existence of t,hese bases but, despite repeated sequencing of this region on both DNA strands, we are uncertain of their order.

VSG Gene Family Sequence Divergence

T. brucei

8C 85 70 60

ATC ___ -----

is

1250 GAC TGC .., _.. --T -.--1 ---

CAT ..C --G --G

1260 ACA MG -AC Ccc -AC CCC -AC CCC

1350 GAA ... ... .._

CCC .__ ___ ___

CT, -.C -.C -.c

GCC ACACAC-

AM 1-C 1-c 1-c

.E 70 60

CAC .-A .-A .-A

ATT .__ -cc T-6

1450 AGC TAC .CG ___ ._- ___ . . . ._.

SC 65 70 60

CTA ___ e-6 ___

GCA .__ ___ .__

ACA 1.. 1.. 1..

,35 70 M)

640 GCG ... ... .._

WA .-, ... ._.

1650 CCC clc ..G . . . .,A As. ..A A-.

8C 85 70 60

1740 GAC -CA AC, ACA

CM C-6 A-6 6.6

70 64

8c

1840 BC CM GTA 85 A.- A.70 A.- A-w 60 A.G A--

8C 85 70

60

1550 ACG ..C 6-C .-c

GAG .,-A -CA TCA

AC). 6-s CAG--

1270 CTA EGG A.-. -a A-- -CA A-- -CA

GTC C-6 C-G C-G

1280 ACA CM 6,. A.6 CT- --G GT- e-6

1360 GAA TAC .._ _.. .._ ___ . . . ___

AM .C-C-CG

1370 ACT TGG ___ .._ __. .-_ ___ _._

ACA 1-s ,A. ,-.

1380 MC WC TCC -C. .__ . . . -C, ___ . . . -c, _-- ___

1460 CGG AAA MA .-C -SC _._ A-C -.c ....C ..C ___

CTG _._ --_ -._

GM __. ___ _.-

1470 GAA --_ --_ --_

ATG __. __. -..

GAA ..C -.G s-6

CTC ___ -AC ___

TCA G-C G-C G-C

ACG --A --A --A

1300 ATG CTA GCA GCG G-A CCC GCA CCC

ACA T-m G-T--

1310 CTA TEA G-G G-G G-G C-G G-G G-G

CT0 T-T-T--

CT, --C --C --C

CGA .-, s-, --,

1390 CTG GCG .-A ___ .-A ___ 6-A __.

GCC 7-G s-6 ..c

1400 ACA CTG .-C -C. s-C -C--c -__

AGA _.. ___ --G

MG _.. .._._

1410 OTT .-A .-A --A

GCC u. C-. cm,

1480 ACG MG 6-A _-. QAA _-_ 6-A --.

CTA ___ ,-.._

1490 CGA ATC .-- _._ ___ ,-_._ _._

TAC --.__ ._.

1500 GCA CTA -6. .._ -cc ___ -CC ___

MA ____. ___

GGA A..__ ___

1580 AU -A. a-A.

CCA aMm Wv

GM __.__ ..-

1590 GM GCA A-- A-C C-. A,C A-. A-C

MT ___ ___ ___

1680 cTG ... --A .-A

1690 ACA AGC GcA v-c CAC -A.- AC. GAG CAC -A- -CG 6-6 ,-AC --A AC-

TGG ... -----

1290 AGG CGA -AA GC-M GC-AA GC-

977 1320 TAC --. -----

GCC -TG -TG -TG

ATC GCGCGC-

1330 ACT CCA GTG A-CTG A-GTG A--

1340 GCG CAC GGC *-A TCA --A --A TCA --A --A TCA --A

GOT -.A -M --A

1420 CGA GTA ..C A-__. T-6 --c A--

TTA .__ s-6 .-c

1430 ACG AM 6-s .._ 6-A ___ 6-A ___

CT0 __. _-._.

AM 6-w 6-7 G.-

1510 GAC GW -a -_-CA -c-CA -Cm

010 .GC .a -CC

1520 Go. GAG -AC ----c ___ -AC ___

CM .-G ___ M-6

AM -CT -c, -CT

GCG -,C .-c e,C

GAG ATA .-, ..G .., -.c we, w-h

1600 TTG AM ACA A.- C-m s-c A-- C-'-e, A-- C-A e-c

1610 GCG CTC .__ .-, .__ SC.__ .-,

MG -,A CC. -CA

CCC .-C Cgc m-c

1620 GCA GW -,G e-c I.- v;C -,G .,C

TTC .-, _-_ .-,

TCT CAG CM CM

1700 GCG CAC A.. 1.. AM T;A., 1-w

TGC ... --___

CTA 1.. -__ .__

1720 AGC MC cM GGC GGC GAC ,C, w-1 AC. .u m ,C, -.. GGG AC- -a m ACA 6-e G-. A-6 --A _-_ A--

1730 GGT GAC -.C A.. .s“ A.. ,CA A--

GGA AM ..G . . . ..C ._. ..C CC.

TTC .__ .__ __.

1530 TCA A-A.A.-

GCA __. __. ...

CCC ..G ..G ..G

1560 CTA ATG .-C _._ .sG -M e-c .__

CGA ___ -A___

1570 CM AM A,G CC, A,6 CC, A,G CC,

GCA AC[: .,AGC

AGC -..M .M

AGC .-___ .__

1660 TAC CTG ..-.C .., ..C e-1 T-c

ATG -.. ___ ___

1670 ACA CTC s-0 A.--- A-T ___ A--

GGG ACA e-1 T-0 --T T-C --A 1-e

CTT .-A .-A -.A

1750 GCG CCC AAA ,.A A-A CCC ,-A GGC CCC 1.1 .-A .cc

1760 GGC TGC .-_ ___ ___ -., .-A __.

CGG ___ _-.*_

1770 CAC CCC .-, ..A -7, --.-, s-,

ACA -As -A-A-

GAA ACAC. ,C-

17Edl GCA GAC AGC ___ A-. __AGC ___

TTC -A-A.-A-

1790 WC GCA A-6 A-, ,CA A-, ..A --,

CGA _.. --m-c

1800 GCC, GGC CCG A,A .-G --A m-1, .-G ..Me --A --A

GCA .A, A-C .AC

1810 &AA TCT s-6 UG -CC UG A-C 6-C

GAA .-m-6 s-c

1820 GTA GCC s-6 -A.--- -(jA.- -,A

GAC SCAC___

1830 AGC GGC ___ ___ &A- SW, .-, .CG

CCA A.-, A., G,,

1850 GGC AAA .e(j _.. .-C _.. .-A ___

WC -6s -6s A--

le.50 CGA SC-cm ACG

GCA ._. Cc. ___

MC --_ -6, ___

1870 GCA CCC A-C s-6 CAG C-6 A-- s-6

CAA 6-6 -6. 6-s

1880 GCA MC A-8 ,CG A-6 G,s-8 ,CC

ATC -A-M -M

1890 TGC GCA ___ -CC ._. ,GC ___ -CC

TTG --A --A -.A

TTC ___ .-_.-

1900 ACA CAC m-c em, G,- ..-.C .-,

CAA ___ -..._

1910 GCA ACG -CC -AC --- -AC -.C -SC

1920 CCG CAC AGC --A 6.~ ,CA A-A --A TCA .-A -.A ,CA

TCA 67. G-6-M

CAG CC“ GGC CCC

CAG AC, A.-C cc-

AU

1710

1630 GCA GGC GM .-C A.. ._. m-c A-. .._ s-6 A-. . . .

1930 GGC ATA _._ .., --- --C _.. m-1

1440 AGE .., --, --,

CGA _-. -7, --,

GCG A-A A-A .-A

TIC -,. --_--

2000 2020 2030 2010 1970 lpw 1940 1950 1960 1980 ATA ACC GGG CCA CAG ACA AAA CCT TCA TTC GGG TAC GGC ATG CTG ACA ATC GGC ACG ACG GAC CAG ACC ATC GCC TTG ALA CT, TCG GAC ATI MG GGC .__ ..A .._ CC, -m w e-6 .-G --A --- --- .-A --A m-c ..A -C- ___ AGC 1-C e-6 __. m-c .__ w-4 --. --A ._. ._. -AC .-, A.. 6-A u.., _._ ,CA ..c A,A 1-C -,G --A e-c _-_ AsA --_ .-A __. ___ -AC 6.~ MG MC --A _-- A-A m-6 .C- -.G C-m -.. -.- --A .-. -_- .-- --A AGC ,..

s-6

___

.-G

GCA -A-6~ -G-

GAC ___ .__ ___

2050 ACC GCG a.AC a. -AC ,CG -AC

CAG .._ GG___

2060 AM TTC SC. ___ CC, ___ ___ ___

--A

--_

TGG m._ w_. em_

2070 AGC AGC ___ _._ _._ .-_ -A- --_

--A

.-.

---

TGC ___ ___ -1.

CAC ___ --A ___

2080 GCA GCA e-c ___ e-c -,(j s-6 m-,-j

--A

___

--A

GCC .GG --A --A

2100 CAA GA, ___ ___ __- ___ __. _-.

-G-

SC,

mm_ C-A

-__

---

-.-

ATG .__ ___ ___

MC C-e G-A C.-

2110 CCA GAC -__ --A .ow A-6 .mm .-A

CCA ___ ue._

2120 GCC CTA C-m ___ A-, 6-C C-w -._

---

.--

--.

--A

MA ___ s-6 _..

2130 GTC GAC CAG ___ ___ --A we, 1-v -Cm --_ ___

MA ... ._.._

2040 CM AC, CC. G-.

BC 85 70

2200 2220 2210 2230 2140 2150 2160 2170 2180 2190 ACC CTC CTA GCT CT, CT, GTG CC, TCT CCC GAG ATC CC, GAA ATA CTG AM CTA GAA GCG GCA GCA TCA CA0 CAA AM GGA CCA GAG GM CTG ACG ATC .-A _.. .._ A-- .._ ___ _._ s-c --G A-A ___ ___ UG 1-c ___ _._ ___ ___ __- ___ -A- -_- __- G-- .__ ___ -__ -_. .__ .__ --A .__ .-_ .-C --A _.. __. ,.C .__ -.C s-6 e-6 A-

8C 85

2300 2310 2260 2270 2280 2290 2240 2250 GAC CTA GCC ACC GAG AAA MC MT TAT TTC GW ACC MC MC MC AAA CTA GAG CCC CTC TGG ACT AAA ATC AAA GGA CAG MT _.. _._ __- -6s -CC .__ --A -C- -.- --- q-c -6. 6-s .-_ ___ ___ ___ ___ _._ -C- ___ --- __- --- __. _-- 0-e ---

8C 85

2340 2350 GCG ACC MA GGC AGC ACG AM MS G-. ._- --, ___ ___ .._

BC 85

2510 24W 2500 2480 2470 2460 2450 2440 AAA AM AU GCG GAG u1, ATA ACT AM CTC CM ACC GAA CTA GCA GA, CM AAA GGC AM TCC CC1 GM AGC GAG TCC MT ___ ___ ___ ___ ___ c-6 6-s m ___ --A ___ _.. ___ .._ ___ -cm __- --. --- --. --- --- --- CM --A --- ---

8C 85

AM .._

GAG ,,A -.- -.-

.-_

6-q

8C 85 70 M)

8C

.__

AC.

--A

85

-CA

SAG ___

e-c

OTC ___ __. ___

2090 AM CCC w-6 ___ e., ___ w-6 ___

2320 2330 ATA GTT GAC TTG GCG 6-m ___ _-_ C-. A-C

2420 2430 2400 2410 2390 2360 2370 2380 GGA ACA GTC ACA GAC ACG CCC GAG CTA CAA AAA CT, TTA ACT TAT TAT TAC ACG GTC MC MA GM GAA CAC ___ v-6 .._ __. ___ _._ A-. _.. s-c .-, .__ ___ .ee ___ WV- .._ ___ ._e e.- -a -.. -.- A-6 --.

AN ---

2520 ATA TCT GAG GAA CCC --- -,- --_ --- -_*

2620 2610 2600 2590 2580 2570 2560 2550 2540 TGC MC GAG GAC MC ATA TGC ACT TCG CAT MG GAG GTT MA CCC CGA GM MG CAC TCC AM TTT MC TCA ACA AAA GCA AM GAA MC GGG GTC .__ __. ___ ___ .__ e-1 ___ ___ __. ___ ___ 6-m A-- -__ Cm- --- --. -_. --- .__ _-- _._ _._ -.A __. --, ._. -._ ._. ..- __. -.,

2RO 2700 2710 2690 2680 2670 2660 2650 2630 2640 TCT CTA ACA CM ACT CM ACT GCA CGA GGA ACC WA GCG ACA ACA GAT AAA TGC MA GGG MA TTG GAA GA, ACC TGC AAG MC GAG AGC MC TCC AM 111 .-C --___ ___ ___ ___ ___ ___ ___ -,- ___ .__ .-A .__ A-A .._ __. _._ ..G .-, _-. . . . _.- GM s-6 AsA &A. em. ..A ,C, CC- a,

8C 85

BC 85

2820 2810 2600 2780 2790 AAA TTC CCC CTC AGT GCT CC, CCC TTT CCC CCC TTG CT1 TTT TAG CAT MC CTT GT. --G --T G-A -T- TTC TGC TGC TTT TAT -GG --1 GA- AGC ATT ,,A --1 T--

2760 2770 2750 2740 2730 TGG GM AA, MT CC, TGC AM GA, TCC TCT ATT CTA GTA ACC AM .._ .-G 6~s G-A A.. ___ s-6 -6s .-_ AC- ,..., s-c -., -.* 2830 2840 CCC CCT CMCTTTTTT TT- TM TTWIGCA-M

2850 2860 2870 2880 CCTTAAAGAA CTTTGCCACT CTCiATATATT TTMTACCTT TT-G---ATT TAC-MAC/T---A--------C--T--

ETA11 A-GM

(b)

Fig. 2.

Hind111 fragments of the same size as those to which the defining probe for the gene family hybridizes. For the 8.5, 7.OA and 6.0 gene family members, sequence data (Fig. 2) establishes that the 8.5 hydrophobic leader and the defining probe are present on the same Hind111 fragment for each

family member. Although not proven, the coincident sizes argue that the two probes recognize the same HindIII fragment in the remaining gene family members as well. Figure 5(b) shows that sequences hybridizing to the 8.5 hydrophobic leader are present in two other

978

T. P. Beak

and 9. C. Boothroyd

trypanosome isolates, EATRO 110 and IITat 1.3; while the basic copy hydrophobic leader probe detects sequences only in DNA from MITat 1.4 and lITat 1.3. The basic copy hydrophobic leader probe hybridizes, in IITat 1.3 DNA, to at least two Hind111 fragments, both smaller than the 9.8 kb VSG 117 basic copy Hind111 fragment in MITat 1.4. At least two intriguing possibilities could explain these results. First, both these sequences may be associated with a homolog of the VSG 117 basic copy, distinguishable from the VSG 117 basic copy by Hind111 polymorphisms. In this case, IITat 1.3 may be effectively diploid for the basic copy. Second, one or both of these sequences may be associated with a non-basic copy member of the VSG 117 gene family; if so, this would suggest that this motif could, indeed, be transferred between members of a gene family. (e) Divergence

of the coding region

Region IV is the sequence encoding, in the VSG 117 BC: the mature VSG. The pattern of divergence of the predicted amino acid sequence of the 8.5 kb gene family member with respect to that of the basic copy is plotted in Figure 6. The majority of the amino acid sequence divergence appears in the first 200 amino acid residues of the mature VSG. Four out of five cysteine residues of the N-terminal sequence are conserved. In VSG 117, Cysl4 is disulfide-bonded with Cysl40, and Cysl21 with Cys182 (Allen & Gurnett, 1983). These cysteine residues are completely conserved in the non-basic copy gene family members. Cys244 is the only non-disulfide bonded cysteine residue in the VSG 177, and it is the only cysteine residue not completely conserved in the four coding sequences reported here; codon 244 encodes phenylalanine in the 6.0 kb gene family member. Open reading frames over the sequenced region were found in the 8.5 kb and 6.0 kb gene family members; in the 7.0 kb gene family member, the translation termination codon TAA is found at codon 11 (bases 1372 to 1374, Fig. 2). The X-ray structures of two VSGs have been determined (Freyman et al., 1984; Metcalf et al., 1987); although from their primary amino acid sequences these VSGs appeared unrelated, their predicted structures were remarkably similar. These VSGs form homodimers in solution, and this pairing results in the close apposit,ion of two long a-helical bundles. Cohen et al. (1984) showed the potential for coiled-coil interactions among VSGs (including

VSG 117) mediated by hydrophobic amino acid residues along the side of the helix (specifically, at positions A and D, when the primary amino acid sequence is written in heptads). VSG 117 contains extensive a-helical regions; Jahnig et al. (19873 showed by Raman spectroscopy that intact VSG 117 and its N-terminal tryptic fragment had an a-helix content, respectively, of 60 and 61%. The predicted amino acid sequences from the VSG 117 gene family members are presented in the heptad arrangement of Cohen et al. (1984) in Figure 3(b). In VSG 117, hydrophobic residues in columns A and D (Fig. 3(b)) were proposed by Cohen et al. (1984) to mediate coiled-coil interactions. In the gene fa.mily members, column D is one of the least divergent, column A one of the most, divergent (although most of the latter changes are conservative and mainta,in the hydrophobic character of the corresponding sequence of VSG 117). If this amino acid sequence is an a-helix, these columns form a face of the helix (which could interact to form a coiled coil). Resolution of the amino acid residues involved in the structure of the homodimer (if one is formed by VSG 117) will require a solution to the cryst~al structure of this VSG. In the comparison of the 8.5 predicted amino acid sequence to the basic copy, amino acid residues 111 to 205 are the most divergent in the sequence encoding the mature VSG (Figs 2 and 6). This region is predicted to cont,ain the major antigenic determinants in the VSG 117 structure proposed by Jahnig et al. (1987). The predicted amino acid sequence encoded by this region is more uniformly hydrophilic, in both the basic copy and 8.5 gene family member, than are the first 110 predicted amino acid residues, as expected if this segment of the VSG is exposed to the aqueous environment. The concentration of nucleotida and amino acid changes in this part of the sequence is consistent with its being subject to positive selection for antigenie variation. These changes are distribut,ed over this variable region, rather than being concentrated in a few hypervariable regions. The nature of the epitopes accessible to antibodies on the intact parasite is not known; however, we (R. Nsia, T.P.B. & J.C.B., unpublished results) and others have not been able to define a linear epitope using monoclonal antibodies that recognize VSGs both on intact parasites and as produced in bacteria,. Thus, these epitopes may be conformational, and liable to be altered by changes throughout the portion of the protein exposed to the aqueous environment.

Figure 3. (a) Translation of the coding region of the DP;A sequences. The hydrophobic leader sequence is numbered in parentheses; the following numbering is from the first residue of the known; mature VSG. Predicted amino acid residues are given in the single-letter code. Identity to the VSG 117 basic copy sequence in the gene family member sequences is indicated by a dash. The asterisk (*) at position 11 in the pSUB7.0,A sequence indicates an in-frame stop codon: X in that sequence indicates a codon where a space in the nucleotide sequence was added to maintain homologous alignment in the nucleotide sequence. BC. VSG 117 basic copy amino acid sequence; 85, pSUB8.5A predicted amino acid sequence: 70, pSUB7.0,A predicted amino acid sequence; 60; pSUBG.OA predicted amino acid sequence. (b) The VSG 117 basic copy amino acid sequence is shown in the heptad arrangement of Cohen et al. (1984), with the corresponding predicted amino acid sequences from the other gene family members. Abbreviations and numbering are as in (a).

T. brucei VSG Gene Family

Sequence Divergence

979 20

(10)

BC

85 70 60

(20) MDCHTKETLGVTQWRRSTMLTLSLLYAITPADG ---QNRAAIALvK-KAA-AASVA---VAVT-S---QNRAEIAtV--KAA-V~VAAAVA---VAVT-S---QNRSAIALV--KAA-AASVA---VA~-S-

30 BC 85 70 60

110

120

--P-X-es----VIK-mm----T-W--S-----IK----160

I-S-EHNTQTY--GDK--NSN-ATE-S-T-----KSSI-S-AHRTQTY--GDK--NSN-ATE-S-T-----KSS170

180 190 FDAGAGPAESEVADSGFAQVPGKQDGANAGQANMCALFTHKPS YK--I-D-Q-DA---TKIT--TGA--T-ETSK-G-----GN-E-AA--F--A-SS--YST----TAQ--GTD--TKIT--NGARXQRRmK-C--V--V---NTQ-AG-----S-ISM--

200

YE--N--DNADIV--A--KIV--T-E-SK-G--T-E-SK-G--------Q-~-----A-SS---

220

230

240 250 FGYGMLTIGTTDQTIGLKLSDIKGKQAI>SAQKFWSSCHDQTLL

260

------K-SAQ---AAQ--------TE-DD-T-----------G---Q-E-p--------R----KVKD--K-T----------AG-DM;R-----Q-V-N-----E-KETV--yp--------K-TA---ST----------EG--D----N-F----------Q-E-p----

280

290

300

310

320

AVLVASPEMAEILKLEAAASQQKGPEEVTIDLATEKNNYFGTNNNKLEPLWTKIKGQNIV T-----T--QY------E--E------------SA-KT---SD------p------E-V-

-F---340

350

360

370

380

DLAATKGSTKELGTVTDTAELQKLLSYYYTVNKEEQKKTAQKGKSPES --TKA-------------T--H---------R--K------Q~-------A-------A

400

410

420

430

440

ECNKISEEPKCNEDKICSWHKEVKAGEKHCKFNSTKAKEKGTEATTDKC -----V----T-----w--m F---------------D-----EN-Q--------------450

BC 85

130

---I-V---S------------I-S--HNTQTY--HNTQTy----T~S-N--~-STA-----KTS-

390 BC 85

100

140 TALKAAGFAGEGAAAVSSYLMTLGTLTTSGSAHCLSNEGGDGDGKDQLAPKGCRHGTEAD

330 BC 85

80

-A--E-QM---N------A-----G---~GD--T~DM--S------~S-KE-KT-MQ

270 BC 85 70

70

-A--E-QT--SN------E---F-G---TT-DM-QEK-QI-MQ

210

BC 85 70 60

60

-A--E-Q-T--N------A-----G--TTVDM--KT-MQ

150

BC 85 70 60

50

LTKLKSHISYRKKLEEMETKLRIYALKGDGVGEQKSAEILATTAALMRQ~TPEEANLK

90

BC 85 70 60

40

AKEALEYKTWTNHCGLAATLRKVAGGV SY-----T--ST-------P---H--I SY-----T--*T-------P---pE-L SY-----T--ST---V---L---p--I

460 470 480 490 KGKLEDTCKKESNCKWENNACKDSSILVTKKFALSAAAFAALLF ---G-KD--SPDXW---GET--G--F----VL-VIFCC-YGFDSILNFF (a) Fig. 3.

leu pro pro -=-

6;

leu """-" val

E ala ““.= “-” ““_

erg ““_ ---“”

LYS -“” -““S”

val """ """ "--

ala his pro w

qlbY ""-

Beu “_” ““_ -“”

thr

ala ala ala

LYS e-e ""V ""-

leu """"S ..""

lys glu glu glu

w _"" =-""-

au """

hYS ssn asn asn

85 70 60 BC 85 70 60

0

C LILY ---“” “-”

A

BC

thr _"_ ""V""

!illu """

BC 85 70 60

vek

SC 85 70 60

his gin sin SJln

thr met

ser thr """ =--

BC 85 70 60

Leu -"" """ -_-

!3lu "-"-""_

glu """ """ "_"

met --"-" "es

BC 85 70 60

Leu """ -s"""

arg ""_ -"" -""

ile "-"

wr =-v-m =--

aia LILY glY lY

leu "_" """ """

LYS "-""" ""_

BC 85 70 60

giY aw _"" ""_

asp ala ala

ala ala

val 9lY glY &Y

glY asp ""-

LllU 0"" """ "_"

gin I"" """ "_"

BC 85 ?O 60

ala

val “_” val

glu asp asp

leu ""S ""W S""

a8a """ """ L__-

thr ser ser ser

thr ""_

asp

ile met met met

ala _""

BC 85 70 60

ala “-““_ ““_

ala ““” ““” “-”

leu "-" =-""_

gin met met met

LYS ala slE2 ala

teu

thr lYS glu 1YS

(pro (glu (LYS (glu

ClLU """ """ """

5lBu BYS gin LYS

ala thr ile thr

ile keu ile

ala

ile --w

ala

phe "-" SllY -"-

ser """

ser """ """ S""

thr ala glu ala

ssP

i; 70 60

val ser

““gin “--

BC 85 70 60

leu mat met mat

LYS gin gin gin

thr ""_"" ""S

ala -=""-v-

BC 85 70 60

ala val gin val

glY “-= xxx val

phe ""-=""S

ala --I ""--"

glY ser ser ser

glu """ ".." I""

@lY ""_ "__ "_"

SC 85 70 60

ala """em """

ala ““” “_” _“”

ala meval ---

val """ ile ile

ser ""_

ser -=""" _""

wr """ "-_""

EC 85 70 60

leu =-"-_""

met m-m ““” ““”

thr -v__" ""-

ieu ile ile ile

WY ( "-" t "-I=--

thr ser ser ser

leu ""S """ ---

BC 85 70 60

thr --"

thr his his his

ser asn asn arg

CllY thr ala thr

ser gin gin 9ln

ala thr iYS thr

his tyr tyr tyr

ser

!3lu

ala

lYS iYS

(b)

Fig. 3.

dlys (thr (thr (thr

asn) ==-) --=) ==-)

cys 1 ==-3 ===I ===I

ser$ thr) thr) thr)

T. brucei VSG Gene Family Sequence Divergence

-2-

I

5

IO

15

20

25

30

33

Moving average of 7 residues

Figure 4. Hydrophobicity profiles of the VSG 117 basic copy, and predicted pSUB85A, hydrophobic leader amino acid sequence. Hydrophobicity according to Kyte & Doolittle (1982) is plotted as a moving average over a window of 7 amino acid residues. (-) VSG 117 leader; (....) 8.5 kb leader.

981

Divergence of the 8.5 predicted amino acid sequence, relative to the basic copy, decreases following codon 181, and divergence is least, in the coding region, from amino acid residue 206 to residue 456. This gradation of divergence, generally decreasing 5’ to 3’; is consistent with previously published observations that VSGs diverge most in their amino-terminal domains (Donelson & RiceFicht, 1985). We suggest a functional role for this homology below. Divergence increases in the region encoding the hydrophobic tail of the nascent VSG, although features invariant in C-terminal sequences of expressed VSG genes (Boothroyd, 1985) are conserved in the 85 kb family member. Finally, as for the 117 basic copy, 15 out of 16 bp of a sequence (TGATATATTTTAACAC) conserved found in VSG gene 3’ untranslated regions (Borst & Cross, 1982) is found also in the 8.5 sequence. The function of this conserved block is not known, but possibilities include a role in gene conversion and/or RNA processing (e.g. polyadenylation or RNA

stability). 1234

ELC

56

123456

17

BC 9.8

0.5 7-o 6.0 5.5 50

ia)

(b)

Figure 5. (a) Southern blot filters of Hind111 digests of genomic DNA were probed with: lanes 1 and 2, the defining probe for the VSG 117 gene family; lanes 3 and 4, the basic copy hydrophobic leader; lanes 5 and 6, the pSUB8.5 hydrophobic leader. The defining probe for the gene family is described in Fig. 1. The basic copy hydrophobic leader probe extends from the HinfI site at base 1234 in Fig. 2, to the Ban1 site at base 1339. The pSUB8.5 hydrophobic leader probe extends from the NruI site at base 1237 (on line 85 of Fig. 2) to the Tap1 site at base 1355. Lanes 1,3 and 5 contain UindIII-digested genomic DNA from MITar 1 trypanosomes not expressing VSG 117; lanes 2, 4 and 6 contain HindHI-digested genomic DNA from MITar 1 trypanosomes possessing a VSG 117 ELC. (b) Lanes 1 and 4 contain HindIII-digested DNA from MITat 1.5; lanes 2 and 5 from EATRO 110, and lanes 3 and 6 from ILTat 1.3. Following Southern transfer, lanes 1, 2 and 3 were hybridized to the basic copy hydrophobic leader probe; lanes 4; 5 and 6 were hybridized to the pSUB8.5 hydrophobic leader probe.

T. P. Beak

982

0

I

and J. &‘. Boothroyd

50 100 150 200 250 300 350 400 450 Amino ocld residue

Figure 6. The histogram shows the pattern of amino acid divergence between the basic copy and pSUB8.5. The percentage non-identity between the basic copy mature VSG amino acid sequence and the corresponding predicted amino acid sequence of the 8.5 gene family member was plotted as a moving average over a 20 amino acid residue window. The amino acid residues of the mature VSG are numbered as in Fig. 3(a).

(f) Proposed

mechanisms of selection for regional conservation and variation

Why is the basic copy more diverged from the gene family consensus? The source of this divergence has two components; the unselected source of va,riation (i.e. mutation) and selection of variants. We expect that there is a, certain probabilit’y of base misineorporation during DNA replication, and that this frequency, per base, is the same for all the nuclear DNA. Other sources of varia,tion, such as reverse transcription of an mRNA followed by gene conversion, may preferentially affect transcribed sequences. Whatever the source of variation, it seems likely that advantageous mutations in VSG genes will be rapidly fixed in the population. A frequently activated VSG gene is more often exposed to antigenic selection in the mammalian host. An antigenically novel trypanosome initiating a parasitemic wave may have a founder effect, establishing a population with an advantageous change in a basic copy VSG gene. The VSG 117 basic copy is the only member of the VSG 117 gene family t)o have the upstream repeat sequences believed to be important in duplication activation of VSG genes; we do not find this sequence motif in a corresponding position upstream from the other gene family members (see the accompanying paper). If the VSG 117 basic copy is activated more frequently (due to the presence of the URS) than the other gene family members, changes in the basic copy sequence may be established in the trypanosome population at a higher frequency than changes in the other gene family members. This would explain both divergence from the gene family consensus and concentration of most changes in the region of the molecule believed to be exposed on the

surface of the parasite. The basic copy is, under this interpretation, not’ the progenitor of the gene family, but the gene family member most diverged from the progenitor. It, seems unlikely, however, that external selection can explain the divergence of the basic copy hydrophobic leader sequence from the gene famiiy member hydrophobic leader sequences. Two related issues arising from the dramaCe difference between tbe basic copy and gene family member hydroof the phobic leaders are: first, the conservation non-basic copy hydrophobic leader sequence, and second, the source of the variation that resulted in the divergence. The fact that the hydrophobic leader sequences are flanked by very similar sequences suggests the possibility of gene conversion among gene family members. This could produce the near perfect conservation of the hydrophobie leader sequences we observe among the t,hree nonbasic copy gene family members described here. Obviously, the basic copy hydrophobic leader has not been converted to the family member consensus. The extent of sequence identity suggests that the two hydrophobic leaders are homologous; they could, therefore, have evolved by accumulation of point mutations over time, but there is no obvious selection to fix these mutations. Figure 5 shows an apparently single copy (exclusive of the expression linked copy) of the basic copy hydrophobic leader in MITar 1 DNA; however, two WindIII fragments hybridizing to the basic copy hydrophobic leader are seen in IITat 1.3 DNA. This suggests the possibility that the VSG 117 basic copy may have accumulated the changes relative to the gene family members in a different genomic environment, then been (re)acquired by MITat I .4 by genetic exchange. The non-basic copy gene family members have diverged with respect to each other; how ha,ve these patterns of divergence been produced? These VSG genes, although lacking the URS; may be activated by replacing t,he basic copy in an expression site. This replacement might be accomplished by gene conversion initiated upstream from t~he coding sequence (our first region of comparison), and ending within or downstream from the coding region. If the non-basic copy family members are activated by this mechanism, and such activation confers a selective advantage, this would provide a selective pressure for maintenance of the sequence homology in the regions where gene conversion is initiated and terminated. The generally well conserved sequences upstream from the basic copy coding region may thus function similarly to the upstream repeat sequences, except that these sequences would function only among members of the gene family. If the frequency of gene conversion initiated in these sequences is a function of sequence similarity, the greater divergence of the 7.0 gene family member would make the duplication of this sequence into an expression site relatively infrequent; this might be beneficial t,o the trypanosome population, since this incomplete VSG gene would require a more complex conversion event to make a

T. brucei

VSG Gene Family Sequence Divergence

functional expression linked copy. Alternatively, the conservation in this region may reflect the absence of positive selection for variation. In conclusion, it appears that this gene family has diverged in both expected and unexpected ways: variation is generally greatest in the regions encoding portions of these antigens thought to be exposed to the environment and least in the noncoding regions; further, there has been substantial variation in limited sequences without any obvious selective pressure. The mechanisms underlying this situation may become apparent when the VSG 117 family members from other strains of trypanosomes are analyzed. We are grateful to Drs Jay Bangs and C. C. Wang for trypanosome strains, and thank Mr Matthew Lawrence for assistance in sequence determination. This work was supported by grants from NIH (A121025) and the John D. and Catherine T. MacArthur Foundation. T.P.B. was supported in part by an N.I.H. Cell and Molecular Biology Training Grant (GM07276) and by a grant from the Johnson and Johnson Co. J.C.B. is a Burroughs Parasitology. The Wellcome Scholar in Molecular sequences presented herein have been assigned GenBank/ EMBL accession numbers 211674 (pSUBS.S), 211675 (pSUB7.0,A) and Z11676(pSUB6.OA). The basic copy sequence presented here for comparison was combined from sequence accession numbers K00638, V01387 and K00639. References Allen

G. & Gurnett, L. P. (1983). Locations of the six disulphide bonds in a variant surface glycoprotein (VSG 117) from Trypanosoma brucei. Biochem. J. 209; 481-487. Beals, T. P. & Boothroyd, J. C. (1992). Genomic organization and context of a trypanosome variant surface glycoprotein gene family. J. Mol. Biol. 225, 961-971. Bernards, A., Van der Ploeg, L. H. T., Frasch, A. C.: Borst, P., Boothroyd, J. C., Coleman, S. L. & Cross, G. A. M. (1981). Activation of trypanosome surface genes glycoprotein involves a duplicationtransposition leading to an altered 3’.end. Cell, 27, 497-505. Biggin, M. D., Gibson, T. J. & Hong, G. F. (1983). Buffer gradient gels and 35-S label as an aid to rapid DNA sequence determination. Proc. Nat. Acad. Sci., U.S.A.

80, 3963-3965.

983

glycoprotein from Trypanosoma brucei. J. Mol. Biol. 157, 547-556. Borst, P. & Cross, G. A. M. (1982). Molecular basis for trypanosome antigenic variation. Cell, 29; 291-303. Campbell, D. A., Van Bree, M. & Boothroyd, J. C. (1984). The 5’-limit of transposition and upstream barren region of a trypanosome VSG gene: tandem 76 basepair repeats flanking (TAA)SO. Nucl. Acids Res. 311, 2759-2774. Cohen, C., Reinhardt, B., Parry, D. A. D., Roelants, G. E., Hirsch, W. & Kanwe, B. (1984). cc-Helical brucei variable coiled-coil structures of Trypanosoma surface glycoproteins. Nature (London), 311, 169-171.

Devereux, J., Haberlie, P. & Smithies, 0. (1984). A comprehensive set of sequence analysis programs for the VAX. Nucl. Acids Res. 12, 387-395. Donelson, J. E. & Rice-Ficht, A. C. (1985). Molecular biology of trypanosome antigenic variation. Microbial. Rev. 49; 107-125. Freymann; D. M., Metcalf; P., Turner, M. & Wiley, D. C. (1984). 6 A-resolution X-ray structure of a variable surface glycoprotein from Trypanosoma brucei. Nature (London), 311, 167-169. Henikoff, S. (1984). Unidirectional digestion with exonuclease III creates targeted breakpoints for DNA sequencing. Gene, 28, 351-359. Hood, L., Campbell, J. H. & Elgin, S. C. R. (1975). The organization, expression, and evolution of antibody genes and other multigene families. Annu. Rev. Genet. 9, 305-353. Jahnig, F., Bulow, R.; Baltz, T. & Overath, P. (1987). Secondary structure of the variant surface glycoproteins of trypanosomes. FEBX Letters, 221: 37-42. Kyte, J. & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105-132. Liu, A. Y. C., Van der Ploeg, L. H. T., Rijsewijk, F. A. M. & Borst, P. (1983). The transposition unit of variant surface glycoprotein gene 118 of Trypanosoma brucei: presence of repeated elements at its border and absence of promoter-associated sequences. J. Mol. Biol. 167, 57-75. Metcalf, P.; Blum, M., Freyman, D., Turner, M. & Wiley, D. C. (1987). Two variant surface glycoproteins of T rypanosoma brucei of different sequence classes have similar 6 A resolution X-ray structures, Nature (London,),

325, 84-86.

Norrander, J., Kempe, T. & Messing, J. (1983). Construction of improved Ml3 vectors using oligodeoxynucleotide-directed mutagenesis. Gene, 26, 101-106.

Boothroyd, J. C. (1985). Antigenic variation in African trypanosomes. Annu. Rev. Microbial. 39, 475-502. Boothroyd. J. C. & Cross, G. A. M. (1982). Transcripts coding for different variant surface glycoproteins of Trypanosoma brucei have a short, identical exon at their 5’.end. Gene, 20, 281-289. Boothroyd, J. C., Paynter, C. A., Cross, G. A. M., Bernards, A. & Borst, P. (1981). Variant surface glycoproteins of Trypanosoma brucei are synthesized with cleavable hydrophobic sequences at the carboxy and amino termini. Nucl. Acids Res. 9, 4735-4743. Boothroyd, J. C.; Paynter, C. A., Coleman, S. L. & Cross, G. A. M. (1982). Complete nucleotide sequence of complementary DNA coding for a variant surface

Pays, E.? Van Assel, S., Laurent, M.; Darville, M., Vervoort, T.; Van Meirvenne, N. & Steinert, M. (1983). Gene conversion as a mechanism for antigenic variation in trypanosomes. Cell, 34, 371-381. Thon, G., Baltz, T., Giroud, C. & Eisen, H. (1990). Trypanosome variable surface glycoproteins: composite genes and order of expression. Genes Develop. 9, 1374-1383. Von Heijne, G. (1983). Patterns of amino acids near signal-sequence cleavage sites. Eur. J. Biochem. 133; 17-21. Von Heijne, G. (1986). A new method for predicting signal sequence cleavage sites. Nucl. Acids Res. 14(11), 4683-4690.

Edited by R. Schleif