The hepatitis C virus genome: a guide to its conserved sequences and candidate epitopes

The hepatitis C virus genome: a guide to its conserved sequences and candidate epitopes

vines Research, 30 (1993) 27-41 0 1993 Elsevier Science Publishers B.V. All rights reserved 016%1702/93/$06.00 V irus Research VIRUS 00921 The hepa...

1MB Sizes 4 Downloads 37 Views

vines Research, 30 (1993) 27-41 0 1993 Elsevier Science Publishers B.V. All rights reserved 016%1702/93/$06.00

V irus Research

VIRUS 00921

The hepatitis C virus genome: a guide to its conserved sequences and candidate epitopes Hsiang Ju Lin a, Johnson Y.N. Lau b, Ian J. Lauder ‘, Naiyi Shi a, Ching-lung Lai d and F. Blaine Hollinger a ’ Division of Molecular Vkoiogy, Baylor College of Medicine, Houston, TX 77030, USA, b Division of Hepatobiliary Diseases, Section of Gastroenterology, Hepatology and Nutrition, University of Florida, Gainesville, FL 32610, USA, ’ Department of Statktics and d Department of Medicine, University of Hong Kong, Hong Kong

(Received 13 October 1992; revision received and accepted 19 April 1993)

Summary

A comprehensive analysis of reported hepatitis C virus genomic sequences comprising 151 partial or complete nucleotide sequences and 159 partial or complete amino acid sequences revealed an irregular composition of conserved and variable regions. There were but eight conserved nucleotide sequences, none outside the 5’ noncoding and structural regions. A search among conserved amino acid sequences revealed 14 candidate B-cell epitopes, which were chosen mainly on the basis of their hydrophilicity profiles. Twenty five candidate T-cell epitopes were selected according to the criteria of absolute conservation of amino acid sequence, together with characteristic sequence motifs, amphipathic helical structure, or both. Conserved peptide sequences, with the characteristics of both B- and T-cell epitopes, were identified in the nonstructural 5 (NS5) region of the genome.

Primer design; Polymerase chain reaction; Antigenic sites; Kernel density analysis; Computer algorithms

Correspondence to: Dr. H.J. Lin, Division of Molecular Virology, Baylor College of Medicine, One Baylor Plaza, Houston, TX 77030, USA.

28

Introduction The employment of assays to detect antibodies to hepatitis C virus (HCV) proteins (Aach et al., 1991; Alter et al., 1989; Esteban et al., 1991; Kuo et al., 1989) and of tests based on the polymerase chain reaction (PCR) to detect genomic HCV RNA (Garson et al., 1990; Weiner et al., 1990) has led to the identification of HCV as the major causative agent of non-A, non-B hepatitis. Since 1988, seven HCV genomes have been sequenced almost in entirety (Chen et al., 1992; Choo et al., 1991; Houghton et al., 1988; Kato et al., 1990b; Okamoto et al., 1991, 1992; Takamizawa et al., 1991; Tanaka et al., 1992), and numerous partial sequences have been published. These studies show nucleotide sequence diversity which is reflected in amino acid sequence diversity (Kato et al., 1990a). Thus, when searching for primers that could be used in PCR for the detection of all strains of HCV, it is imperative that highly conserved regions of the genome be identified. The same requirement should apply to the search for amino acid sequences which might serve as B- or T-cell epitopes since any strategy for immune-based therapy must be directed at different strains of the virus. In this paper, we summarize studies carried out on HCV genomic sequences with the objective of identifying conserved sequences and the candidate epitopes contained therein. We have used as the model of HCV the generally accepted division of the linear 9.4 kb genome into noncoding and coding regions (Houghton et al., 1991b). The functions of the 5’ and 3’ noncoding regions are unknown, but they are strikingly different in their degree of conservation. The 3’ terminus of the genome is a noncoding region containing either poly(A) or poly0-J) tracts (Han et al., 1991; Kato et al., 1990b; Okamoto et al., 1991; Takamizawa et al., 1991). The highly conserved 5’ noncoding region is contiguous with the region encoding the structural proteins (capsid, membrane and envelope). There are five genes encoding putative nonstructural (NS) proteins. NSl may encode an additional envelope protein (Hijikata et al., 1991b). The functions of the gene products of NS2 and NS4 are unknown. NS3 encodes the protease/ helicase and NS5 is deduced to be the RNA-dependent RNA polymerase gene owing to its homology to the flavivirus NS5 region (Houghton et al., 1991a).

Methods Nucleotide positions were numbered according to Okamoto et al. (199Ob). In this system, position 1 is 324 nucleotides upstream of the initiation codon. The exact placements of the subdivisions within the genomic structural and nonstructural regions vary (Ogata et al., 1991; Weiner et al., 1991b); ours were made according to Takamizawa et al. (1991). Kernel density analysis was performed according to published methods (Lauder, 1983; Lauder et al., 1993). Briefly, this statistical procedure gives a picture of continuous variation along the genome. The kernel model centers on a given nucleotide or amino acid position and gives a smoothed estimate of the frequency

29

of conserved nucleotides or amino acids flanking the given position, putting more weight on the nucleotides or amino acids that lie closer to the given position. Kernel density analysis results in a value called the conservation score which lies between 1.0 and 0 for each nucleotide or amino acid position. Hydrophilicity was calculated according to Hopp and Woods (1981). Antigenicity values were calculated according to Welling et al. (1985). In all of the above calculations, nucleotides were aligned for maximum concordance. The additions and deletions present in the NSl and NS5 regions of the HCJ6 and HCJ8 sequences (Okamoto et al., 1991, 1992) did not affect the choice of candidate epitopes and conserved nucleotide sequences. T-cell epitopes were located with the use of the TSites software program (Feller and de la Cruz, 1991) which carries out searches using four computer algorithms called A, R, D and d. The ‘A’ algorithm scans 11 amino acid blocks for amphipathic helices (in which one portion of the structure is hydrophilic and the other hydrophobic) and other structural properties (Margalit et al., 1987). The ‘R’ algorithm searches for amino acid sequence motifs that are recognized by MHC class I and II molecules (Rothbard and Taylor, 1988). The third and fourth algorithms search for antigenic sites recognized by the two MHC class II molecules IAd CD motif) and IEd (‘d’ motif) (Sette et al., 1989).

Results Nucleotide sequence conservation Kernel density analysis of nucleotide conservation along the HCV genome was carried out on the seven complete genomes (Chen et al., 1992; Choo et al., 1991; Houghton et al., 1988; Kate et al., 1990b; Okamoto et al., 1991, 1992; Takamizawa et al., 1991; Tanaka et al., 1992), which we have designated the reference sequences. Conservation scores for each of the positions on the 9.4 kb genome and its 3011 amino acids were calculated, expressing continuous variation in the degree of conservation along the length of the genome. For better graphical presentation we took the averages of the conservation scores for 50 nucleotides or 50 amino acids and used them to show the nucleotide and amino acid conservation profiles of the HCV genome (Fig. 1). With respect to nucleotides, the 5’ noncoding region was the most highly conserved. The most poorly conserved region was in the NS5 region around positions 7400-7550. Further variability among the reference sequences was introduced when two additional sequences present in the HCJ6 genome were inserted in this region (Okamoto et al., 1991). In general, the degree of amino acid conservation was greater than that found with nucleotide sequences, except when conservation scores for the latter fell below 65%. A search for conserved sequences with 15 or more nucleotides, which might serve as primers for sequencing or for PCR assays, was carried out through all HCV nucleotide sequences published through April 1992. Table 1 lists the eight sequences which are 95-100% conserved within the current data base. Six se-

30

0.1

)25 ( 1 1492 670 888

Genomic Regbns Nucleotide positions Fig. 1. Conservation of nucleotide and of amino acid sequences along the HCV genome, based upon the seven reference sequences. The genomic regions are, left to right: N, the 5’ noncoding region, the genes encoding the capsid CC), membrane (MI and envelope (El proteins, the genes encoding nonstructural (NS) proteins 1-5, and the 3’ noncoding region (partially shown). Nucleotide position numbers correspond to the first nucleotide for each region. Nucleotide sequence conservation values, depicted as a continuous line, are based on the averages of conservation scores over 50 consecutive positions. The highest and lowest nucleotide conservation scores, 0.94 and 0.15, were found at positions 100-150 and 7450, respectively. Amino acid sequence conservation values are shown as serrated bars each representing the average of conservation scores over 50 consecutive amino acids. The highest and lowest amino acid conservation scores were found, respectively, in the membrane (0.84) and NS2 (0.34) regions.

quences ranging in length from 15 to 69 nucleotides were located in the 5’ noncoding region. Two others with 16 or 17 nucleotides were found in the structural region. There were no conserved sequences of primer length beyond position 1304. Thus, the choice of primers for sequencing or for PCR assays must be tailored to the HCV strains under study (Chen et al., 1991a). Amino acid conseruation Despite the paucity of conserved nucleotide sequences in the regions downstream of the 5’ noncoding region, amino acid sequences of the capsid and

31 TABLE 1 Conserved nucleotide sequences Genomic location

Position numbers a

Sequence: 5’ to 3’

Conservation among isolates b

References

5’ noncoding region 5’ noncoding region 5’ noncoding region 5’ noncoding region 5’ noncoding region 5’ noncoding region Capsid gene Envelope gene

21-45 62-79 122-156 171-185 210-225 254-322 434-449 1288-1304

CAC...GTC CTA...GTA CAT...GAA GGG...GGA l-TG...CGC GCG...GCA TGC...CCC ATG...GTC

12/12 (100%) 70/71(99%0) 73/76 (96%) 73/74 (99%) 60/63 (95%) 58/59 (98%) 18/18 (lOO%o) 21/22 (95%)

cd c-h c-h c-h c-g c,d,f-k c,e,g-1 c,e,gj,k,m,n

a

Nucleotide positions are numbered according to Okamoto et al. (199Ob). Variant sequences differed from the conserved sequences in up to three positions. ‘2 Chen et al., 1992; Choo et al., 1991; Kato et al., 1990b; Okamoto et al., 1991, 1992; Takamizawa et al., 1991; Tanaka et al., 1992. Han et al., 1991. Fuchs et al., 1991. Bukh et al., 1992. Ogata et al., 1991; Okamoto et al., 199Ob. Martell et al., 1992. Takeuchi et al., 1990a. Takeuchi et al., 1990b. Liu et al., 1992. Weiner et al., 1991~. m Weiner et al., 1991a. ” Hijikata et al., 1991a. b

membrane proteins were highly conserved, as could be seen from Fig. 1. The seven reference genomes exhibited 81% amino acid homology in the capsid region; with the inclusion of 22 other sequences homology fell to 75% (Fuchs et al., 1991; Liu et al., 1992; Ogata et al., 1991; Okamoto et al., 1990b; Takeuchi et al., 199Oa,b; Weiner et al., 1991a,b). The reference genomes showed 87% homology among the 76 amino acids making up the membrane protein. Only five more mutation sites were uncovered by 24 additional analyses, the large majority of which spanned the entire length of the region, bringing the degree of homology among all 31 isolates to 80% (Fuchs et al., 1991; Kremsdorf et al., 1991; Liu et al., 1992; Ogata et al., 1991; Okamoto et al., 1990b; Takeuchi et al., 1990a,b; Weiner et al., 1991a,b). In contrast to the capsid and membrane proteins, the putative envelope proteins were extremely variable. The amino acid sequences of the envelope protein (E), which is encoded by nucleotides 898-1491, have been reported in 33-37 isolates apart from the seven reference genomes (Delisse et al., 1991; Fuchs et al., 1991; Hijikata et al., 1991a; Kremsdorf et al., 1991; Liu et al., 1992; Ogata et al., 1991; Okamoto et al., 1990b; Takeuchi et al., 1990a,b; Weiner et al., 1991a,b). Among the latter, there were only 80 conserved amino acids out of 198 (40% homology). The additional 33-37 sequences introduced mutations at 12 other positions,

decreasing the amino acid homology to 34%. Another putative envelope peptide, encoded by nucleotides 1492-1566 (Weiner et al., 1991a), has been determined in a total of 76 isolates. The degree of homology was 12% among the reference genomes. Inclusion of the additional 69 analyses had almost no effect on the degree of homology; only three of the 25 amino acids in this peptide were conserved in 75/76 isolates (Chen et al., 1991b; Choo et al., 1991; Kato et al., 1990b, 1992; Kremsdorf et al., 1991; Liu et al., 1992; Ogata et al., 1991; Okamoto et al., 1990b, 1991; Takamizawa et al., 1991; Takeuchi et al., 1990b; Weiner et al., 1991a, 1991b). Despite the overall mutability of the putative envelope proteins, there are certain fixed features. Within the E protein there is a 1Camino acid sequence, HRMAWDMMMNWSPT, that is conserved in 42/44 isolates (Delisse et al., 1991; Fuchs et al., 1991; Hijikata et al., 1991a; Kremsdorf et al., 1991; Liu et al., 1992; Ogata et al., 1991; Okamoto et al., 1990b; Takeuchi et al., 1990a,b; Weiner 1991a,b; and the reference genomes). In the two variant sequences, a stop codon replaced a tryptophan residue (Delisse et al., 1991), and MN was replaced by LS (Okamoto et al., 1992). The MMMNW sequence is encoded by the conserved sequence shown in Table 1. The conserved peptide, located at amino acid positions 316-325 (numbered from the start codon of the polyprotein), has negative hydrophilicity and antigenicity values (- 1.13 and - 0.183 respectively). The second constant feature is the conservation of the eight cysteine residues in the E protein. Among the 44 isolates that were sequenced, cysteine residues are conserved in 351/352 instances; a single mutation (to serine) was found in one isolate (Weiner et al., 1991b). Taking the arrangement of the flavivirus envelope protein as a model (Heinz and Roehrig, 1990), which is appropriate since HCV is related to the flaviruses (Miller and Purcell, 1990), we propose that cysteine disulfide linkages play essential roles in determining the three-dimensional configuration of the envelope protein, however variable the intervening amino acid sequences. Amino acid homologies among the reference sequences in the nonstructural regions were as follows: 65% in amino acids encoded by nucleotide 1567 to the 3’ terminus of NSl; 47% in NS2; 73% in NS3; 65% in NS4; and 61% in NS5. All six cysteine residues in NSl were highly conserved; among the 59-67 isolates studied they were invariant in 360/362 instances (99%). It was interesting that few or no changes in the number of conserved amino acids within regions NS2 to NS5 were found when additional sequences were taken into account. The inclusion of two additional sequences covering amino acid positions l-49 in NS2 (Ogata et al., 1991), and of five additional sequences for amino acids 88-267 in NS4, did not reveal new sites of mutation (Kaneko et al., 1991). Inclusion of up to 17 additional sequences for amino acids 409-609 in NS3 resulted in the appearance of only seven other mutation sites (Chen et al., 1991; Kubo et al., 1989; Li et al., 1991; Martell et al., 1992; Ogata et al., 1991; Ulrich et al., 1990). Similarly, the inclusion of up to 32 additional sequences in positions 633-745 of NS5 decreased by only four the number of conserved amino acids (Enomoto et al., 1990; Kato et al., 1990a, 1991; Mori et al., 1992; Ogata et al., 1991), although it must be noted that the sequence in question was partially deleted in the one isolate (Kate et al.,

33

1990a). It would appear, then, that the amino acids specified by the reference genomes were to a large extent also conserved in many other HCV strains. Candidate B-cell epitopes Our interest lay in identifying conserved amino acid sequences that might serve as B- or T-cell epitopes. The prediction of antigenicity based on amino acid sequence, hydrophilicity, conformation and composition is admittedly inadequate (Van Regenmortel, 1989). Nonetheless, in an attempt to focus upon the likeliest B-cell epitope candidates, guidelines were formed. We defined as ‘conserved’ any amino acid in a given position that was invariant in 80% of all reported sequences. We looked for sequences that consisted of 12-40 conserved amino acids, bearing in mind their intended use as synthetic antigenic peptides. Because of its predictive value (Parker et al., 1986), the hydrophilicity value or profile was given more weight than the antigenicity value. Therefore, some hydrophilic regions with relatively low antigenicity scores also were considered. We used peak hydrophilicity values of 0.9 or more as a criterion for inclusion. Table 2 lists the HCV sequences that meet these criteria. We propose these sequences as candidate B-cell epitopes. All the genes are represented by at least one sequence, with the exception of the envelope gene and NS2. Thus the search for a possible neutralization epitope that would be located in the envelope protein would have to be narrowed to sequences that were conserved among a smaller number of strains. Five candidate B-cell epitopes were found in NS5. Of particular interest were sequences 681-695, 746-774 and 780-798 that were present in a relatively large number of isolates. Candidate T-cell epitopes Identification of T-cell epitopes is potentially useful for devising therapeutic strategies against HCV infection. The fact that the host’s immune response may be augmented by peptides displayed by the MHC complex seems all the more remarkable considering that MHC class I molecules may bind peptides with only eight or nine amino acid residues (Rotzschke et al., 1990; Van Bleek et al., 1990). Sequence patterns common to T cell epitopes can be even shorter (Rothbard and Taylor, 1988). However, economy of length has set stringent requirements as to specificity, and antigenicity may be lost with an unsuitable amino acid substitution (Rothbard and Taylor, 1988; Sette et al., 1989). The computer search revealed large numbers of potential T-cell epitopes, but these were narrowed down by imposing the criterion of absolute conservation of amino acid sequences within the existing data bases (Table 3). Sequence conservation is an important factor in identifying candidate epitopes. For example, the sequence HTVSGFVSL in the NSl region of the prototype HCV (Choo et al., 1991; Houghton et al., 1988) had the A, D and R motifs, but these features were not consistently carried through the corresponding amino acid sequences in 18 other isolates (Chen et al., 1992; Kato et al., 1990b; Kremsdorf et al., 1991; Liu et al., 1992; Ogata et al., 1991;

34 TABLE 2 Candidate B-cell epitopes within conserved regions of HCV gene products Gene Amino acid a product Positions Caosid

Membrane NSl

12-26 (12-26) 46-75 (46-75) 92-109 (92-109) 31-69 (146-184) 192-203 (581-592) 241-277 (630-666)

NS3

222-248 (1228-1254)

NS4 NSS

381-399 (1387-1405) 89-106 (1704-1721) 159-198 (2172-2211) 570-584 (2583-2597) 681-695 (2694-2708) 746-774 (2759-2787) 780-798 (2793-2811)

Sequence KRNTNRRPQDVKFPG VRATRKTSERSQPRGRRQPIPKARRPEGRt GWAGWLLSPRGSRpSWGP GAARALAHGVRVLEDGVNYATGNLPGCSFSIFLLALLSC CPTDCFRKHPeA RMYVGGVEHRLeAACNWTRGERCdLEDRDRSELSPLL LHAPTGSGKSTKVPAAYAAQGYKVLVL GGRHLIFCHSKKKCDELAA eFDEMEECASHLPYIEQG TSMLTDPSHITAEtAkRRLARGSPPSIASSSASQLSAPSL PDLGVRVCEKMALYD CGYRRCRASGVLTTS FI-EAMTRYSAPPGDPPQPEYDLELITSCS AHDASGKRvYYLTRDPTTP

Peak Peak Number Referhydroantigen- of ences philicity b icity ’ isolates 2.1

0.098 29

d,e

2.0 1.1 1.0

0.048 29 -0.012 29 0.13 30-31

de d,e d-f

1.3

0.094 10

d,f,g

2.6

0.076 10

d,f,g

1.0 2.3 1.6

0.071 9 0.12 9 0.043 12

d,g d>g d,h

1.9 0.9 1.3

0.092 7 0.028 9 0.022 29

d d,g d,g,i

1.0 1.5

0.025 25 0.086 9-25

d,g,j d,gj

a Position numbers refer to amino acid numbers within the genomic regions defined by Takamizawa et al. (1991). The positions given within the parentheses are numbered from the start codon of the polyprotein. Consensus sequences are given. The occurrence of variant amino acids in > 20% of the isolates is indicated by the lower case. b Known antigenic determinants are located in or adjacent to points of highest local hydrophilicity that typically range from about 1 to 2 (Hopp and Woods, 1981). ’ Antigenicity values > 0.05 are considered high (Welling et al., 1985). d Chen et al., 1992; Choo et al., 1991; Kato et al., 1990b; Okamoto et al., 1991, 1992; Takamizawa et al., 1991; Tanaka et al., 1992. e Fuchs et al., 1991; Liu et al., 1992; Ogata et al., 1991; Okamoto et al., 1990b; Takeuchi et al., 1990b; Weiner et al., 1991b. f Kremsdorf et al., 1991. g Ogata et al., 1991. h Kaneko et al., 1990. i Enomoto et al., 1990; Kato et al., 1991; Mori et al., 1992. j Kato et al., 1990a.

Okamoto et al., 1991, 1992; Takamizawa et al., 1991; Takeuchi et al., 1990b; Tanaka et al., 1992; Weiner et al., 1991a). Thus it was excluded. We have listed in Table 3 the T-cell epitopes most likely to be found among different HCV strains. All the gene products are represented with the exception of the capsid protein. The envelope protein is represented by a portion of its sole conserved peptide. Several candidate T-cell epitopes overlap some B-cell epitopes. In fact all five of

35 TABLE 3 Candidate T-cell epitopes within absolutely conserved regions of HCV gene products Gene product

Amino acid * Positions

Sequence

Membrane

34- 46 (149-161) 63- 70 (178-185) 126-132 (317-323) 193-201(582-590) 222-228 (611-617) 234-239 (963-968) 20- 24 (1026-1030) 155-160 (1161-1166) 229-235 (1235-1241) 250-255 (1256-1261) 323-331(1329-1337) 408-413 (1414-1419) 439-447 (1445-1453) 481-485 (1487-1491) 548-553 (1554-1559) 555-558 (1561-1564) 154-166 (1769-1781) 238-243 (1853-1858) 283-288 (1898-1903) 299-304 (1914-1919) 160-165 (2173-2178) 167-171(2180-2184) 186-195 (2199-2208) 574-580 (2587-2593) 768-775 (2781-2788)

RALAHGVRVLEDG LLALLS c RMAWDMM PTDCFRKHP ’ YPYRLWH GLRDLA LAPIT LKGSSG GKSTKVP = PSVAAT IGTVLDQAE AVAYYR TGDFDSVID RRGRT HLEFWE VFTG ISGIQYLAGLSTL AGYGAG CAAILR QWMNRL SMLTDP ’ HITAE ’ ASSSASQLSA ’ VRVCEKM ’ ELITSCSS =

Envelope NSl NS2 NS3

NS4

NS5

T-cell motifs b

Number of isolates

References

AAD A,D

30 31 44 10 10 7 7 7 9 9 9 9-12 12 15 24 24 12 7 7 7 7 7 7 9 25

d,e

R A A,R A A,D D A D A,R A A d A,R A,R A,D A,R A A A,D R D A R,D

d-f d-g d,f,h d,f,h d d d d,h d,h d,h d,h,i d,h,i d,hj d,h,i,k d,hj,k d,i d d d d d d d,h d,h,m

’ Position numbers refer to amino acid numbers within the genomic regions defined by Takamizawa et al. (1991). The positions given within the parentheses are numbered from the start codon of the polyprotein. b A: amphipathic helix (Margalit et al., 1987). R: sequence pattern recognizable by MHC class I and II molecules (Rothbard and Taylor, 1988). D, d: sequence patterns recognizable by mouse MHC molecules class II, IAd and IEd, respectively (Sette et al., 1989). ’ Sequence overlaps a B-cell sequence given in Table 2. d Chen et al., 1992; Choo et al., 1991; Kato et al., 1990b; Okamoto et al., 1991, 1992; Takamizawa et al., 1991; Tanaka et al., 1992. e Fuchs et al., 1991; Liu et al., 1992; Ogata et al., 1991; Okamoto et al., 1990b; Takeuchi et al., 1990b; Weiner et al., 1991b. f Kremsdorf et al., 1991. g Hijikata et al., 1991a; Weiner et al., 1991a; Takeuchi et al., 1990a; Delisse et al., 1991. h Ogata et al., 1991. i Ulrich et al., 1990. j Li et al., 1991; Chen et al., 1991; Kubo et al., 1989. k Martell et al., 1992. ’ Kaneko et al., 1991. m Kato et al., 1990a.

36

the T-cell sequences in NS5 fall within three candidate B-cell epitopes (see Table 2, positions 159-198, 570-584, 7467741, a condition that may further simplify the choice of possible antigenic sites. The juxtaposition of B- and T-cell epitopes is advantageous because it may increase antigenicity (Francis et al., 1987; Rothbard, 1987).

Discussion In general, RNA viruses have a higher mutation rate than DNA viruses, probably because they lack the proofreading function which corrects the mismatches that occur in the course of replication (Steinhauer and Holland, 1987). The observed heterogeneity of amino acid sequences in the envelope proteins would appear, then, to be less remarkable than the conservation of other gene products. We deduce that the tolerance for amino acid mutations among the envelope proteins is much wider than that for other gene products. The mutation rate is presumably uniform along the length of the genome, but those mutations resulting in altered membrane, capsid or helicase/protease proteins and portions of the NS5 gene product apparently do not persist because the gene products are defective. In all probability, the constraints of survival and replication in the host eliminate many variant hepatitis C virions bearing amino acid mutations in vital regions. As a result there are pronounced differences in the degree of conservation among the gene products. Our objectives have been to identify conserved nucleotide sequences to facilitate primer design for the detection of HCV and for nucleotide sequencing, and to identify conserved amino acid sequences that might be used in immunological strategies. The data base from which the analyses were made drew upon HCV isolates from many parts of the world. Although Japanese and US strains predominated, strains from Europe, South America and other parts of Asia also were represented. The i~fo~ation gained from examination of these data may assist in the development of immune-based therapy that would be effective against many different HCV strains. Attempts to identify B- and T-cell epitopes are no more than informed guesses. It is possible that our criterion for selection of candidate T-cell epitopes, that of absolute conse~ation among all reported HCV sequences, was too stringent. A NS5 amino acid sequence MSYtWTGALiTPCAAE that has been shown to elicit a cytotoxic T-lymphocyte response (Shirai et al., 1992) was omitted from our selection because variants were present in three of the reference genomes. The success rate of using hydrophilici~ and other parameters in identi~ing epitopes is about 60% (Rooman and Wodak, 1988); that for predicting T-cell epitopes based on sequence patterns and structural features is about 75% (Margalit et al., 1987; Sette et al., 1989). These figures are acceptable in view of the alternative - the probability of success with blind guesses. Our analyses appear to be borne out by several independent experimenta studies. The criteria of amino acid conservation and a relatively high degree of

37

hydrophilici~ that we have used for the identification of candidate B-cell epitopes may explain the efficiency with which the first generation anti-HCV assay is able to detect exposure to HCV (Aach et al., 1991). The only candidate B-cell epitope that we found in the NS4 region falls within the sequence encoding the 5-l-l peptide sequence which is present in the C-100 protein that is used in that assay (Kuo et al., 1989). It also lies within an immunodominant B-cell epitope of the HCV genome that was identified with the use of synthetic peptides (Cerino and Mondelli, 1991). Further, three laboratories have obtained promising results in detecting antibodies to synthetic peptides based on sequences in the highly conserved 115 amino acid capsid protein, in which we found three candidate B-cell epitopes. Amino acid sequences covering positions l-74, 69-120 (Nassoff et al., 19911, 13-41, 73-89 (Ching et al., 1992), and 39-74 (Okamoto et al., 1990a) have been identified as immunoreactive regions. As to the use of the nucleotide sequence conservation data, we have successfully used them in designing primers for a nested PCR assay (Lin et al., 1992). Our chief contribution has been to highlight possible antigenic sites in other regions of the HCV genome, particularly NS5, where some conserved peptide sequences, possessing the characteristics of both Band T-cell epitopes, may be found.

Acknowledgement We are grateful to the Eugene B. Casey Foundation

for support.

References Aach, R.D., Stevens, C.E., Hollinger, F.B., Mosley, J.W., Peterson, D.A., Taylor, P.E., Johnson, R.G., Barbosa, L.H. and Nemo, G.J. (1991) Hepatitis C virus infection in ~st-transfusion hepatitis: an analysis with first- and second-generation assays. New Engl. J. Med. 325, 13251329. Alter, H.J., Purcell, R.H., Shih, J.W., Melpoder, J.C., Houghton, M., Choo, Q.L. and Kuo, G. (1989) Detection of antibody to hepatitis C virus in prospectively followed transfusion recipients with acute and chronic non-A, non-8 hepatitis. New Engl. J. Med. 321, 1494-1500. Bukh, J., Purcell, R.H. and Miller, R.H. (1992) Sequence analysis of the 5’ noncoding region of hepatitis C virus. Proc. Natl. Acad. Sci. USA 89, 4942-4946. Cerino, A. and Mondelli, M.U. (1991) Identification of an immunodominant B cell epitope on the hepatitis C virus nonstructural region defined by human monoclonal antibodies. J. Immunol. 147, 2692-2696. Chen, P.J., Lin, M.H., Tu, S.J. and Chen, D.S. (1991al Isolation of a complementary DNA fragment of hepatitis C virus in Taiwan revealed significant sequence variations compared to other isolates. Hepatology 14, 73-78. Chen, P.J., Lin, M.H., Tai, K.F., Liu, P.C., Lin, C.J. and Chen, D.S. (1992) The Taiwanase hepatitis C virus genome: sequence determination and mapping the 5’ termini of viral genomic and antigenomic RNA. Virology 188, 102-113. Chen, W.R., Okamoto, H. and Tao, Q.M. (1991b) Similarity and diversity in sequence of HCV genome among Chinese, Japanese and American strains. Chin. Med. J. 104, 825-829. Ching, W.M., Wychowski, C., Beach, M.J., Wang, H., Davies, C.L., Carl, M., Bradley, D.W., Alter, H.J., Feinstone, SM. and Shih, J.W.K. (1992) Interaction of immune sera with synthetic peptides

38 corresponding to the structural protein region of hepatitis C virus. Proc. Natl. Acad. Sci. USA 89, 3190-3194. Choo, Q.L., Richman, K.H., Han, J.H., Berger, K., Lee, C., Dong, C., Gallegos, C., Coit, D., Medina-Selby, A., Barr, P.J., Weiner, A.J., Bradley, D.W., Kuo, G. and Houghton, M. (1991) Genetic organization and diversity of the hepatitis C virus. Proc. Natl. Acad. Sci. USA 88, 2451-2455. Delisse, A.M., Descurieux, M., Rutgers, T., D’Hondt, E., De Wilde, M., Arima, T., Barrera-Sala, J.M., Ercella, M.G., Rouelle, J.L. and Cabezon, T. (1991) Sequence analysis of the putative structural genes of hepatitis C virus from Japanese and European origin. J. Hepatol. 13 (Suppl. 4), S20-23. Enomoto, N., Takada, A., Nakao, T. and Date, T. (1990) There are two major types of hepatitis C virus in Japan. Biochem. Biophys. Res. Commun. 170, 1021-1025. Esteban, R., Esteban, J.I., Lopez-Talavera, J.C., Genesca, J., Buti, M., Vargas, C. and Guardia, J. (19911 Epidemiology of hepatitis C virus infection. In: F.B. Hollinger, S.M. Lemon and H. Margolis (Edsl, Viral hepatitis and liver disease, pp. 413-415. Williams and Wilkins, Baltimore, MD. Feller, DC. and de la Cruz, V.F. (1991) Identifying T-cell antigenic sites. Nature 349, 720-721. Francis, M.J., Hastings, G.Z., Syred, A.D., McGinn, B., Brown, F. and Rowlands, D.J. (1987) Non-responsiveness to a foot-and-mouth disease virus peptide overcome by addition of foreign helper T-cell determinants. Nature 330, 168-170. Fuchs, K., Motz, M., Schreier, E., Zachovel, R., Deinhardt, F. and Roggendorf, M. (1991) Characterization of nucleotide sequences from European hepatitis C virus isolates. Gene 103, 163-169. Garson, J.A., Tedder, R.S., Briggs, M., Tuje, P., Glazebrook, J.A., Trute, A., Parker, D., Barbara, J.A.J., Contreras, M. and Aloysius, S. (1990) Detection of hepatitis C viral sequences in blood donations by ‘nested’ polymerase chain reaction and prediction of infectivity. Lancet 335, 1419-1422. Han, J., Shyamala, V., Richman, K.H., Brauer, M.J., Irvine, B., Urdea, MS., Tekamp-Olsen, P., Kuo, G., Choo, Q.L. and Houghton, M. (1991) Characterization of the terminal regions of hepatitis C viral RNA: identification of conserved sequences in the 5’ untranslated region and poly(A) tails at the 3’ end. Proc. Natl. Acad. Sci. USA 88, 1711-1715. Heinz, F.X. and Roehrig, J.T. (1990) Flaviviruses. In: M.H.V. Van Regenmortel and A.R. Neurath (Eds), Immunochemistry of viruses II, pp. 403-458. Elsevier, New York. Hijikata, M., Kato, N., Ootsuyama, Y., Nakagawa, M., Ohkoshi, S. and Shimitohno, K. (1991a) Hypervariable regions in the putative glycoprotein of hepatitis C virus. Biochem. Biophys. Res. Commun. 175, 220-228. Hijikata, M., Kato, N., Ootsuyama, Y., Nakagawa, M. and Shimotohno, S. (1991bl Gene mapping of the putative structural region of the hepatitis C virus genome by in vitro processing analysis. Proc. Natl. Acad. Sci. USA 88, 5547-5551. Hopp, T.P. and Woods, K.R. (1981) Prediction of protein antigenic determinants from amino acid sequences. Proc. Natl. Acad. Sci. USA 78, 3824-3828. Houghton, M., Choo, Q.L. and Kuo, G. (1988) Eur. Pat. Applic. 88,310,922.5 and Publ. 0,318,216. Houghton, M., Richman, K., Han, J., Berger, K., Lee, C., Dong, C., Qverby, L., Weiner, A., Bradley, D., Kuo, G. and Choo, Q.L. (1991a) Hepatitis C virus (HCV): a relative of the pestiviruses and flaviviruses. In: F.B. Hollinger, S.M. Lemon and H. Margolis (Eds), Viral hepatitis and liver disease, pp. 328-333. Williams and Wilkins, Baltimore, MD. Houghton, M., Weiner, A., Han, J., Kuo, G. and Choo, Q.L. (1991b) Molecular biology of the hepatitis C viruses: implications for diagnosis, development and control of viral disease. Hepatology 14, 381-388. Kaneko, S., Kuno, K., Yanagi, M., Unuora, M., Hattori, N., Murakami, S. and Kobayashi, K. (1991) Sequence analysis of hepatitis C virus genomes isolated from five patients with chronic non-A, non-B hepatitis. In: F.B. Hollinger, S.M. Lemon and H. Margolis fEdsI Viral hepatitis and liver disease, pp. 364-367. Williams and Wilkins, Baltimore, MD. Kato, N., Hijikata, M., Ootsuyama, Y., Nakagawa, M., Ohkoshi, S. and Shimitohno, K. (1990a) Sequence diversity of hepatitis C viral genomes. Mol. Biol. Med. 7, 495-501. Kato, N., Hijikata, M., Ootsuyama, Y., Nakagawa, M., Ohkoshi, S., Sugimura, T. and Shimotohno, K. (199Ob) Molecular cloning of the human hepatitis C virus genome from Japanese patients with non-A, non-B hepatitis. Proc. Natl. Acad. Sci. USA 87, 9524-9528.

39 Kato, N., Ootsuyama, Y., Ohkoshi, S., Nakazawa, T., Mori, S., Hijikata, M. and Shimotohno, K. (1991) Distribution of plural HCV types in Japan. B&hem. Biophys. Res. Commun. 18f, 279-285. Kato, N., Qotsuyama, Y. Tanaka, T., Nakagawa, M., Nakazawa, T., Muraiso, K., Ohkoshi, S., Hijikata, M. and Shimohtono, K. (1992) Marked sequence diversity in putative envelope proteins of hepatitis C virus. Virus Res. 22, 107-123. Kremsdorf, D., Porchon, C. and Brechot, C. (1991) Hepatitis C virus (HCV)-RNA in non-A, non-B chronic hepatitis in France. J. Hepatol. 13 (Suppl. 41, 524-32. Kubo, Y., Takeuchi, K., Boonmar, S., Katayama, T., Choo, Q.L., Kuo, G., Weiner, A.J., Bradley, D.W., Houghton, M., Saito, I. and Miyamura, T. (1989) A cDNA fragment of hepatitis C virus isolated from an implicated donor of post-transfusion non-A, non-B hepatitis in Japan. Nucleic Acids Res. 17, 10367-10372. Kuo, G., Choo, Q.L., Alter, H.J., Gitnick, G.L., Redeker, A.G., Purcell, R.H., Miyamura, T., Dienstag, J.L., Alter, M.J., Stevens, C.E., Tegtmeier, G.E., Bonino, F., Colombo, M., Lee, W.S., Kuo, C., Berger, K., Schuster, J.R., Qverby, L.R., Bradley, D.W. and Houghton, M. (19891 An assay for circulating antibodies to a major etiologic agent of human non-A, non-B hepatitis. Science 244, 362-364. Lauder, I.J. (1983) Direct kernel assessment of diagnostic probabilities. Biometrika 70, 251-256. Lauder, I.J., Lin, H.J., Lau, J.Y.N., Siu, T.S. and Lai, CL. (19931 The variability of the hepatitis B virus genome: statistical analysis and biological implications. Mol. Biol. Evol. 10, 457-470. Li, J., Tong, S., Vitviski, L., Lepot, D. and Trepo, C. (1991) Evidence of two major genotypes of hepatitis C virus in France and close relatedness of the predominant one with the prototype virus. J. Hepatol. 13 (Suppl. 41, S33-37. Lin, H.J., Shi, N., Mizokami, M. and Hollinger, F.B. (1992) Polymerase chain reaction assay for hepatitis C virus RNA using a single tube for reverse transc~ption and serial rounds of amplification with nested primer pairs. J. Med. Virol. 38, 220-225. Liu, K., Hu, Z., Li, H., Prince, A.M. and Inchauspe, G. (1992) Genomic typing of hepatitis C viruses present in China. Gene 14, 245-250. Margalit, H., Spouge, J.L., Cornette, J.L., Cease, K.B., Delisi, C. and Berzofsky, J.A. (1987) Prediction of immun~ominant helper T cell antigenic sites from the primary sequence. J. Immunoi. 138, 2213-2229. Martell, M., Estaban, J.I., Quer, J., Genesca, .I., Weiner, A., Estaban, R., Guardia, J. and Gomes, J. (1992) Hepatitis C virus (HCV) circulates as a population of different but closely related genomes: quasispecies nature of HCV genome distribution. J. Viral. 66, 3225-3229. Miller, R.H. and Purcell, R.H. (1990) Hepatitis C virus shares amino acid sequence similarity with pestiviruses and flaviviruses as well as members of two plant virus supergroups. Prof. Natl. Acad. Sci. USA 87, 257-261. Mori, S., Kato, N., Yagyu, A., Tanaka, T., Ikeda, Y., Petchclai, B., Chiewsilp, P., Kurimura, T. and Shimotohno, K. (1992) A new type of hepatitis C virus in patients in Thailand. Biochem. Biophys. Res. Commun. 183, 334-342. Nasoff, M., Zebedee, S.L., Inchauspe, G. and Prince, A.M. (1991) Identification of an immun~ominant epitope within the capsid protein of hepatitis C virus. Proc. Natl. Acad. Sci. USA 88, 5462-5466. Ogata, N., Alter, H.J., Miller, R.H. and Purcell, R.H. (1991) Nucleotide sequence and mutation rate of the H strain of hepatitis C virus. Proc. Natl. Acad. Sci. USA 88, 3392-3396. Okamoto, H., Munekata, E., Tsuda, F., Takahashi, K., Yotsumoto, Y., Tanaka, T., Tachibana, K., Akahane, Y., Sugai, Y., Miyakawa, Y. and Mayumi, M. (199Oa) En~me-linked immuno~~ent assay for antibodies against capsid protein of hepatitis C virus with a synthetic oligopeptide. Jpn. J. Ezp. Med. 60,223-233. Okamoto, H., Okada, S., Sugiyama, Y., Yotsumoto, S., Tanaka, T., Yoshizawa, H., Tsuda, F., Miyakawa, Y. and Mayumi, M. (1990bl The 5’ terminal sequences of the hepatitis C viral genome. Jpn. J. Exp. Med. 60, 167-177. Okamoto, H., Okada, S., Sugiyama, Y., Kurai, K., Iizuka, H., Machida, A., Miyakawa, Y. and Mayumi, M. (1991) Nucleotide sequence of genomic RNA of hepatitis C virus isotated from a human carrier: comparison with reported isolates for conserved and divergent regions. J. Gen. Virol. 72, 2697-2704.

Okamoto, H., Kurai, K., Okada, S., Yamamoto, K., Lizuka, H., Tanaka, T., Fukuda, S., Tsuda, F. and Mishiro, S. (1992) Full-length sequence of a hepatitis C virus genome having poor homology to reported isolates: comparative study of four distinct genotypes. Virology 188, 331-341. Parker, J.M.R., Guo, D. and Hodges, R.S. (1986) New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites. Biochemistry 25, 5425-5432. Rooman, M.J. and Wodak, S.J. (1988) Identification of predictive sequence motifs limited by protein structure data base size. Nature 335, 45-49. Rothbard, J. (1987) Synthetic peptides as vaccines. Nature 330, 106-107. Rothbard, J.B. and Taylor, W.R. (1988) A sequence pattern common to T cell epitopes. EMBO J. 7, 93-100. Rotzschke, O., Falk, K., Deres, K., Schild, H., Norda, M., Metzger, J., Jung, G. and Rammensee, H.G. (1990) Isolation and analysis of naturally processed viral peptides as recognized by cytotoxic T cells. Nature 348, 252-255. Sette, A., Buus, S., Apella, E., Smith, J.A., Chestnut, R., Miles, C., Colon, SM. and Grey, H.M. (1989) Prediction of major histocompatibility complex binding regions of protein antigens by sequence pattern analysis. Proc. Nat]. Acad. Sci. USA 86, 3296-3300. Shirai, M., Akatsuka, T., Pendleton, CD., Houghten, R., Wychowski, C., Mihalik, K., Feinstone, S. and Berzofsky, J.A. (1992) Induction of cytotoxic T cells to a cross-reactive epitope in the hepatitis C virus non-structural RNA polymerase-like protein. J. Virol. 66, 4098-4106. Steinhauer, D.A. and Holland, J.J. (1987) Rapid evolution of RNA viruses. Annu. Rev. Microbial. 41, 409-433. Takamizawa, A., Mori, C., Fuke, I., Manabe, S., Murakami, S., Fujita, J., Onishi, E., Andoh, T., Yoshida, I. and Okayama, H. (1991) Structure and organization of the hepatitis virus genome isolated from human carriers. J. Virol. 65, 1105-1113. Takeuchi, K., Kubo, Y., Boonmar, S., Watanabe, Y., Katayama, T., Choo, Q.L., Kuo, G., Houghton, M., Saito, I. and Miyamura, T. (199Oa) Nucleotide sequence of core and envelope genes of the hepatitis C virus genome derived directly from healthy human carriers. Nucleic Acids Res. 18, 4626. Takeuchi, K., Kubo, Y., Boonmar, S., Watanabe, Y., Katayama, T., Choo, Q.L., Kuo, G., Houghton, M., Saito, I. and Miyamura, T. (1990b) Putative nucleocapsid and envelope protein genes of hepatitis C virus determined by comparison of the nucleotide sequences of two isolates derived from an experimentally infected chimpanzee and human carriers. J. Gen. Virol. 71, 3027-3033. Tanaka, T., Kato, N., Nakagawa, N., Ootsuyama, Y., Cho, M.J., Nakazawa, T., Hijikata, M., Ishimura, Y. and Shimitohno, K. (1992) Molecular cloning of hepatitis C virus genome from a single Japanese carrier: sequence variation from the same individual and among infected individuals. Virus Res. 23, 39-53. Ulrich, P.P., Romeo, J.M., Lana, P.K., Daniel, L.J. and Vyas, G.N. (1990) Detection, semiquantitation and genomic variation in hepatitis C virus sequences amplified from the plasma of blood donors with elevated alanine aminotransferase. J. Clin. Invest. 86, 1609-1614. Van Bleek, G.M. and Nathenson, S.G. (1990) Isolation of an endogenously processed immunodominant viral peptide from the class I H-2K molecule. Nature 348, 213-216. Van Regenmortel, M.H.V. (1989) Structural and functional approaches to the study of protein antigenicity. Immunol. Today 10, 266-272. Weiner, A.J., Kuo, G., Bradley, D.W., Bonino, F., Saracco, G., Lee, C., Rosenblatt, J., Choo, Q.L. and Houghton, M. (1990) Detection of hepatitis C viral sequences in non-A, non-B hepatitis. Lancet 335, l-3. Weiner, A.J., Brauer, M.J., Rosenblatt, J., Richman, K.H., Tung, J., Crawford, K., Bonino, F., Saracco, G., Choo, Q.L., Houghton, M. and Han, J.H. (1991a) Variable and hypervariable domains are found in the regions of HCV corresponding to the flavivirus envelope and NSl proteins and the pestivirus envelope glycoproteins. Virology 180, 842-848. Weiner, A.J., Christopherson, C., Hall, J.E., Bonino, F., Saracco, G., Brunetto, M.R., Crawford, K., Venkatakrisbna, S., Miyamura, T., McHutchison, J., Cuypers, T. and Houghton, M. (1991b) Sequence variation in hepatitis C viral isolates. J. Hepatol. 13 (Suppl. 4), S6-14. Weiner, A.J., Truett, M.A., Rosenblatt, J., Han, J., Quan, S., Polito, A.J., Kuo, G., Choo, Q.L. and

41 Houghton, M. (1991~). HCV immunologic and hybridization-based diagnostics. In: F.B. Hollinger, S.M. Lemon and H. Margolis (Eds), Viral hepatitis and liver disease. pp. 360-363. Williams and Wilkins, Baltimore, MD. Welling, G.W., Weijer, W.J., van der Zee, R. and Welling-Wester, S. (1985) Prediction of sequential antigenic regions in proteins. FEBS Lett. 188, 215-218.