Genetic diversity of Hepatitis C Virus in Pakistan using Next Generation Sequencing

Genetic diversity of Hepatitis C Virus in Pakistan using Next Generation Sequencing

Accepted Manuscript Title: Genetic diversity of Hepatitis C Virus in Pakistan using Next Generation Sequencing Authors: Sana Saleem, Amjad Ali, Bushra...

1MB Sizes 0 Downloads 13 Views

Accepted Manuscript Title: Genetic diversity of Hepatitis C Virus in Pakistan using Next Generation Sequencing Authors: Sana Saleem, Amjad Ali, Bushra Khubaib, Madiha Akram, Zareen Fatima, Muhammad Idrees PII: DOI: Reference:

S1386-6532(18)30221-X https://doi.org/10.1016/j.jcv.2018.09.001 JCV 4049

To appear in:

Journal of Clinical Virology

Received date: Revised date: Accepted date:

3-4-2018 14-8-2018 7-9-2018

Please cite this article as: Saleem S, Ali A, Khubaib B, Akram M, Fatima Z, Idrees M, Genetic diversity of Hepatitis C Virus in Pakistan using Next Generation Sequencing, Journal of Clinical Virology (2018), https://doi.org/10.1016/j.jcv.2018.09.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Genetic diversity of Hepatitis C Virus in Pakistan using Next Generation Sequencing Sana Saleem1, Amjad Ali2, Bushra Khubaib1, 3, Madiha Akram1,3, Zareen Fatima1,4, Muhammad Idrees1,5* 1

Division of Molecular Virology and Molecular Centre of Excellence in Molecular Biology, (CEMB), University of the Punjab, Lahore 87-West Canal Bank Road Thokar Niaz Baig, Lahore, Pakistan 2

3

SC RI PT

Molecular Virology laboratory, Centre for Applied Molecular Biology (CAMB) University of the Punjab, Lahore 87-West Canal Bank Road Thokar Niaz Baig, Lahore, Pakistan Department of Biotechnology, Lahore College for Women University, Lahore, Pakistan

4

Bioinformatics & Biotechnology, International Islamic University, Sector H-10, New Campus, Islamabad. Vice Chancellor Hazara University Mansehra, Khyber Pakhtunkhwa, Pakistan

N

U

5

*Corresponding

SS: [email protected]

AA: [email protected]

D

BK: [email protected]

M

A

author: Centre of Excellence in Molecular Biology, University of the Punjab, 87West Canal Bank Road, Thokar Niaz baig, Lahore-53700, Pakistan; Tel: +92-42-5293141; Fax: +9242-5293149; Email: MI: [email protected]

TE

MA:[email protected] ZF:[email protected]

EP

MI:[email protected]

A

CC

Word count: 2137

1

Highlights Pyrosequencing approach used to analyze complex viral genomes as it can determine minor variants.



It is crucial to understand viral evolution and quasispecies diversity in complex viral strains



NGS was used to determine intra-host viral diversity of HCV from 13 chronically infected patients



NGS of E2 (HVR1), NS3 and NS5B of HCV-3a was performed for a comprehensive analysis of the viral population



Phylogenetic analysis of studied genes revealed great variability within the Pakistani population



The average nucleotide diversity for studied genes was 0.029, 0.011 and 0.010 respectively



Results indicate that patient-2 had more heterogeneity than other patients of same genotype-3a



No significant difference was seen when nucleotide variability of genotype 3a compared with other genotypes

M

A

N

U

SC RI PT



D

Abstract

TE

Background: In Pakistan, HCV disease is considered a major public health issue with about 10-17 million people suffering with this infection and rate is increasing every day without any hindrance. The

EP

currently available Pyrosequencing approach used to analyze complex viral genomes as it can

CC

determine minor variants. It is crucial to understand viral evolution and quasispecies diversity in complex viral strains.

A

Objectives: To assess genetic diversity in patients with HCV using Next Generation Sequencing (NGS) and compare nucleotide diversity of genotype 3a with respect to other genotypes. Study design: Intra-host viral diversity of HCV was determined using NGS from 13 chronically HCV infected individuals. NGS of three different regions (E2 (HVR1), NS3 and NS5B) of HCV-3a allowed for a comprehensive analysis of the viral population. 2

Result: Phylogenetic analysis of different HCV genes revealed great variability within the Pakistani population. The average nucleotide diversity for HVR1, NS3 and NS5B was 0.029, 0.011 and 0.010 respectively. Conclusion: Our findings clearly indicate that patient-2 greater quasispecies heterogeneity than other

SC RI PT

patients of same genotype-3a using phylogenetic and one step network analyses. Initially phylogenetic analysis of these three genes showed that genotype 3a samples have greater genetic diversity. However, no significant difference was determined when nucleotide variability of genotype 3a compared with other genotypes (1a, 1b, 2a & 4a).

M

A

N

U

Keywords: HCV, 3a, HVR1, NS3, NS5B, Phylogentic, Analysis, quasispecies, NGS

1. Background

D

Nearly 3% of the World’s population is infected with Hepatitis C Virus (HCV) that is a leading cause of

TE

liver diseases [1]. HCV belongs to family Flaviviridae and classified into seven major genotypes and

EP

many subtypes based on sequence variability [2, 3]. The most significant feature of HCV is that it becomes chronic in nearly 50-80% of individuals. [4]. There is no vaccine against HCV. The current

CC

standard of care is the use of direct acting antiviral agents (DAAs) and has shown high SVR (Sustained Virological Response) rate [5].

A

Genetic heterogeneity is a characteristic of HCV. In each patient, HCV occurs as multiple variants or quasispecies. The genetic variability is not uniformly dispersed over the whole genome; the most variable region is HVR1 of the envelop E2 protein [6-9]. It is supposed that HCV diversity has major clinical consequences, as it may effect in the production of immune escape mutants, which might affect

3

disease severity and treatment response [10]. High sequence variation in HVR1 makes it a perfect model for quasispecies analysis. [11] The comprehensive analysis of minor HCV quasispecies variants is hindered by the lack of suitable approaches which would facilitate the finding of low-frequency genomes. Previously used methods

SC RI PT

were costly and time consuming [12]. By using Next Generation Sequencing (NGS), it is currently feasible to explore viral quasispecies in a better way. The extraordinary output of NGS permits production of thousands of copies in each sequencing run, assisting in detailed sequence analysis. This technology can identify variants at low frequencies, which might go unnoticed by regular sequencing procedures [13]. Though, in order to create consistent viral quasispecies from the large amounts of data

U

generated by NGS, a proper data analysis is prerequisite [14, 15].

N

In this study we performed ultradeep pyrosequencing to illustrate the complexity and heterogeneity of

A

hypervariable region 1 (HVR1), and nonstructural genes 3 (NS3) and 5b (NS5b) in individuals infected

M

with HCV genotype 3a using the p distances. The distance among each pair of variants was calculated

D

and networks were generated for all variants such that every node represents a single variant and each

TE

link in the network represents a single base change. To the best of our knowledge this is the first study of HCV 3a quasispecies from Pakistan which shows sequencing in depth of HVR1, NS3 and NS5B

EP

using NGS to evaluate diversity of HCV quasispecies within genotype 3a of Pakistani population

CC

2. Objectives

To determine genetic diversity in patients with HCV using Next Generation Sequencing (NGS) and

A

compare the nucleotide diversity of genotype 3a with respect to other genotypes.

3. Study design: We performed ultra deep pyrosequencing to illustrate the heterogeneity of hypervariable region 1 (HVR1), nonstructural genes 3 (NS3) and 5b (NS5b) in individuals infected with HCV genotype 3a

4

using the P distances. The distance among each pair of variants was calculated and networks were generated for all variants such that every node represents a single variant.

4. Materials and methods

SC RI PT

4.1 Patient Samples Sera were collected from 40 different HCV infected individuals in 2013 from Punjab and Khyber Pakhtunkhwa and then analyzed. Protocol approval was given by Ethics Review Board of the National Centre of Excellence in Molecular Biology (CEMB), University of the Punjab, Lahore, Pakistan. Enzyme immunoassay protocol (anti-HCV positive ELISA kit Abbot, Germany) was used for

N

U

serological testing. Genotyping of HCV samples was done at CEMB [16].

A

4.2 Isolation of RNA and PCR amplification

M

Extraction of RNA was done by using the total Nucleic Acid extraction kit (Roche Applied Science, USA) and the MagNAPure LC system (Roche Applied Science). cDNA was synthesized using the

D

SuperScriptVilo cDNA synthesis kit (Invitrogen, Carlsbad, CA). The reverse transcription parameters

TE

were as: 25°C for 10 min, 42°C for 90 min, and 85°C at 6 min. Amplification of the HCV HVR1, NS3

EP

and NS5b was performed using nested PCR protocol. The HCV primers were designed with primer 3 online tool for primer designing (http://bioinfo.ut.ee/primer3-0.4.0/).

CC

Primer sequences are shown in Table 1. In first round of amplification cDNA served as template using PerfeCTa SYBR FastMix (Quanta BioSciences, Gaithersburg, MD) with gene specific outer primers

A

for three regions. Amplification was done on a LightCycler (Roche, Applied Sciences, Indianapolis) under the following conditions: 95oC for 5 min, followed by 40 cycles at 95oC for 30s, 50oC for 30 s, 72oC for 50s. 2 µl of the first round PCR product was used as template in nested PCR reaction.

4.3 Amplicon sequencing

5

Amplicons obtained from the PCR product were purified using E-Gel SizeSelect Agarose 2% gel on the E-Gel Power System (Life technologies). Direct amplicon Sanger sequencing was done from all PCR products of three genes using BigDye v3.1 chemistry (Applied Biosystems) by an automated analyzer (3130xl, Applied Biosystems,Foster City, CA). Sequences were cleaned and analyzed using

SC RI PT

SeqMan and MegAlign (DNASTAR).

4.4 454 Pyrosequencing of the HVR1, NS3 and NS5B Regions

To analyze quasispecies of these three regions (HVR1, NS3 and NS5B) of HCV each sample were

U

amplified using fusion primers including the 454-primers key with a different multiple identifier (MID)

N

and HCV specific primers using Roche/454 pyrosequencing technology. After purification PCR

A

products were quantified by using the Agilent 2100 bioanalyzer platform (Agilent Technologies, Inc.,

M

Waldbronn, BW Germany). For analyzing pyrosequencing data, the SFFFILE tool (version 1.5.1) was used to process actual sequence reads (raw data) obtained from 454 sequencing [17]. Sequence reads of

D

every sample were recognized and categorized with help of MIDs. Poor quality and short sequences

TE

were removed from analysis. Data obtained from pyrosequencing were processed with the KEC error correction algorithm to recovery good quality HCV haplotypes from reads data using KEC software

CC

4.5 Analysis

EP

(http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm) [18].

A

Unbiased estimates of nucleotide diversity were calculated according to Nei (1987) using the program ARLEQUIN (version 3.5) [19, 20]. Nucleotide frequency diversity and normalized frequency diversity were also calculated for these three genes. Neighbor-joining trees were generated based on p distances. Phylogenetic analysis of NS3 was also done on all patients with known sequences of genotype 3a from Pakistan [21]. The reference sequences taken from GenBank are indicated with green color, blue and 6

red color representing the sequences from previously published sequences from Pakistan [21] and sequences involved in this study respectively. All analysis was attained using statistical and bioinformatics approaches applied in MATLAB (version 2010) [22] as previously described [17]. Nucleotide diversity of different genotypes was also determined according to Nei (1987) by using the

SC RI PT

ARLEQUIN program [19]. We tested the null hypothesis of no difference in genotypes diversity by using Analysis of Variance (ANOVA). Data for other genotypes were taken from Genbank.

4.6 Distance matrices

For each patient, a multiple sequence alignment was performed with MAFFT (version 7) [23], and then

U

the Hamming distance between each pair of variants was calculated to form a distance matrix. The

N

histogram of distances and a heatmap were built using MATLAB [22].

M

A

4.7 One Step Network

For the set of all HCV distinct variants found in each sample, we built one-step network as previously

D

described [24]. Using the distance matrix of each patient, a network was created for each patient, where

TE

each node is a quasispecies variant and two nodes are connected by a link if the Hamming distance between them is 1. As a single patient may have several disconnected sets (components), this work

CC

1.26) [25].

EP

focused on large components having > 5% of all reads. The networks were drawn with PAJEK (version

5. Results

A

5.1 HCV Genetic Heterogeneity Out of the 40 samples, 13 were successfully sequenced for three HCV genes. They all belonging to HCV genotype 3a. Phylogenetic analysis of the E2 (HVR1), NS3 and NS5B gene showed that sequences were genetically distinct among the 13 patients with genotype-3a (Fig 1). Among the 13 samples the Pak2 sample has highest genetic diversity for all genes. We found that HVR1 is more 7

variable than NS3 and NS5B. The average nucleotide diversity for HVR1, NS3 and NS5b is 0.029, 0.011 and 0.010 respectively (Table 3). HCV isolates for NS3 were intermixed with the reference sequences obtained from two different sources (Fig 2).

SC RI PT

5.2 Intra and inter-Host HCV Diversity Quasispecies analysis of HCV isolates was performed using deep pyrosequencing of three regions. The average number of reads for HVR1, NS3 and NS5b was 6910, 1172, 1526 respectively. Phylogenetic trees showed that viral isolates linked to same patient clustered such that they were more closely connected with that patient than with any other patient. There is no intermixing of HCV variants among

U

patients which shows that they were not related through transmission. The level of intra host diversity

N

varied frequently (Table 3).

A

5.3 One step Network

M

The Hamming distance between each pair of variants was calculated and networks were generated for all variants such that every node represents a single variant and two nodes are associated by a

D

connection (link) if the calculated Hamming distance between them is 1(Fig 3). The length of the

TE

longest path for one step network is calculated (Table 4).

EP

5.4 Comparison of genotype 3a with other genotypes: Early assessment of phylogenetic trees of genotype 3a (Fig 1) showed that samples of this genotype

CC

seem more genetically diverse than other HCV subtypes. We analyzed and compared the nucleotide diversity of other genotypes (1a, 1b, 2a & 4a) with genotype 3a (Table 2). No significant difference was

A

determined.

6. Discussion

8

In Pakistan, HCV infection is a major health concern as more than 17 million individuals are suffering with this disease [26].An important feature of HCV is its genetic variability [27] and rapid rate of replication during its lifespan [28]. As a result, in each individual HCV clones show populations of heterogeneous quasispecies with a high degree of variability. The group of viruses inside a population

SC RI PT

linked to each other through similar mutations is described as a quasispecies [29]. NGS is an advanced technique for analyzing intra-host quasispecies and drug resistance [24].

To the best of our knowledge, this is the first study to use NGS to sequence and analyze Pakistani HCV 3a quasispecies sequences from HVR1, NS3 and NS5b using ultra-deep sequencing (454/Roche). In this study we determined the complexity and diversity of HCV 3a within the Pakistani population by

U

using ultra-deep sequencing (454/Roche). All individuals in this study were infected with genotype 3a

A

3a is the major circulating genotype in Pakistan [30].

N

and had no history of drinking alcohol. Our study is consistent with the past result that HCV genotype

M

We performed deep sequencing of different regions at high coverage. We investigated the level of

D

genetic heterogeneity within Pakistani isolates using different statistical analyses. For HVR1 a

TE

quasispecies sample contains approximately 6910 reads with 1508 unique sequences, the population frequency of the major variant was 22%, a nucleotide diversity of 0.0290 and 96 out of 303 nucleotide

EP

positions were polymorphic. Phylogenetic analysis showed that viral isolates of each patient’s sequences clustered together, indicating that a patient’s variants were more closely linked to each other

CC

than with other patient’s variants (Fig 1). This indicates they were not related through transmission.

A

Phylogenetic trees of these three genes showed that samples of this genotype seem genetically more diverse than in other genotypes (Fig 1). To determine whether this was true sequence data we continued to study if other genotypes of HCV had differing nucleotide variability. Sequencing data for other genotypes (1a, 1b, 2a, and 4a) was taken from online sequences. It must be taken into consideration that our study was not an exhaustive analysis of all genotypes or sub genotypes, and thus 9

we did not compare the diversity of genotype 3a with other less common subtypes due to their NGS data scarcity. We had greater numbers of genotype 3a sequences than other genotypes (ANOVA, p< 0.05). Samples with a higher total number of reads also show a higher number of unique sequences, with the

SC RI PT

correlation between the two being 0.763. For this reason, there are a lot of unique genotype 3a sequences in the phylogenetic trees creating the illusion of greater diversity. Though, the number of reads obtained from NGS mainly rely on complementarity level among primers and template, which can be different according to genotypes. Nucleotide variability is not associated with sample size (correlation with number of reads is 0.176). There was no significantly difference found between

U

nucleotide diversity of different genotypes when compared with genotype 3a (Table 2).

N

The NS3 and NS5B samples had 1172 and 1526 total reads respectively, with 416 and 276 unique

A

sequences. The population frequency of each major variant was 19% and 26% with nucleotide diversity

M

0.011 and 0.010 respectively. The nucleotide diversity of NS3 and NS5b was less than for HVR1. The

D

phylogenetic tree for NS3 showed that NS3 samples were evenly distributed among each other and the

TE

reference sequences (Fig 2). No specific cluster observed in phylogenetic tree that related to specific region.

EP

HCV variability constantly changes throughout chronicity, resulting in a complex network of intra-host

CC

viral subpopulations [31]. Quasispecies distance linkage that varies by one mutation gives an expected framework to sort out the diverse population of variants [24]. Even though it was attained through

A

theoretical analysis, we constructed a one-step network by using our viral sequence data. It offers an easy biological method for finding the genetic relatedness of viral populations [32]. The one-step networks for each patient showed that viral population circulating in Pakistan was complex and heterogenic.

10

Authors' contributions

SS, AA & BK performed bench work and drafted the manuscript. MA &ZF analyzed the data. AA & MI critically reviewed the manuscript. All the authors read and approved the final manuscript

SC RI PT

Acknowledgments

Thanks to Dr Yury Khudyakov for excellent facilities at Centers for Disease Control and Prevention, USA . We are also grateful to Dr Gilberto Vaughan, Zoya Dimitrova, Mike A Purdy and David S Campo from Centers for

A

CC

EP

TE

D

M

A

N

U

Disease Control and Prevention, USA for their help in this study.

11

References 1. B.D. Lindenbach, C.M. Rice, Molecular biology of flaviviruses, Adv. Virus Res. 59 (2003) 23–61. 2. K.M. Hanafiah, J. Groeger, A.D. Flaxman, S.T. Wiersma, Global epidemiology of hepatitis C virus infection, new estimates of age specific antibody to HCV seroprevalence, Hepatology. 57 (2012) 133342.

SC RI PT

3. S.M. Lemon, C.M. Walker, M.J. Alter, M. Yi, Hepatitis C Virus, In: Knipe, D.M. Howley, P.M. Eds., Fields Virology, Lippincott Williams & Wilkings, Philadelphia, PA, (2007) pp. 1253-1304.

4. P. Simmonds, Genetic diversity and evolution of hepatitis C virus – 15 years on, J. Gen. Virol. 85 (2004) 3173-3188.

5. D. Schuppan, A. Krebs, M. Bauer, E.G. Hahn, Hepatitis C and liver fibrosis, Cell Death Differ. 10 (2003) 59–67.

U

6. D. Hunt, P. Pockros, What are the promising new therapies in the field of chronic hepatitis C after

N

the first-generation direct-acting antivirals? Current gastroenterology reports, 15 (2013) 303. 7. M. Martell, J.I. Esteban, J. Quer, J. Genesca, Hepatitis C virus (HCV) circulates as a population of

A

different but closely related genomes: quasispecies nature of HCV genome distribution, Journal of

M

Virology. 66 (1992) 3225–3229.

8. P. Moreno, M. Alvarez, L. L´apez, G. Moratorio, Evidence of recombination in Hepatitis C Virus

D

populations infecting a hemophiliac patient, Virology Journal. 6 (2009) 203. 9. J.M. Cuevas, M. Torres-Puente, N. Jim´enez-Hern´andez, M.A. Bracho, Refined analysis of genetic

TE

variability parameters in hepatitis C virus and the ability to predict antiviral treatment response, Journal of Viral Hepatitis. 15 (2008) 578–590.

EP

10. E.A. Duarte, I.S. Novella, S.C. Weaver, E. Domingo et al., RNA virus quasispecies: significance for viral disease and epidemiology, Infectious Agents and Disease. 3 (1994) 201–214.

20.

CC

11. M. Sala, S. Wain-Hobson, Are RNA viruses adapting or merelychanging? J. Mol. Evol. 51 (2000) 12-

A

12. T. Laskus, J. Wilkinson, J.F. Gallegos-Orozco, M. Radkowski et al., Analysis of hepatitis C virus quasispecies

transmission

and

evolution

in

patients

infected

through

blood

transfusion,

Gastroenterology. 127 (2004) 764–776.

13. L. Barzon, E. Lavezzo, V. Militello, S. Toppo et al., Applications of next-generation sequencing technologies to diagnostic virology, Int. J. of Mol. Sciences. 12 (2011) 7861–7884. 14. N. Beerenwinkel, H.F. Gunthard, V. Roth, K.J. Metzner, Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data, Frontiers in Microbiology. 3 (2012) 329. 12

15. N. Beerenwinkel, Ultra-deep sequencing for the analysis of viral populations, Current Opinion in Virology. 1 (2011) 413–418. 16. M. Idrees, S. Riazuddin, Frequency distribution of hepatitis C virus genotypes in different geographical regions of Pakistan and their possible routes of transmission. BMC. Infect. Dis. 8 (2008) 69. 17. J.C. Forbi, J.E. Layden, R.O. Phillips, N. Mora et al., Next-generation sequencing reveals frequent

SC RI PT

opportunities for exposure to hepatitis C virus in Ghana, PLoS one. 18 (2015) 12. 18. P. Skums, Z. Dimitrova, D.S. Campo, G. Vaughan et al., Efficient error correction for next-generation sequencing of viral amplicons, BMC Bioinformatics. 13 (2012) S6.

19. M. Nei, (1987). Molecular Evolutionary genetics. New York, Columbia University Press.

20. S. Schneider, D. Roessli, L. Excoffier, Arlequin: software for population genetics data analysis. User manual ver 2.000. Genetics and Biometry Lab, Dept. of Anthropology, University of Geneva, Geneva. (2000).

U

21. I.ur. Rehman, G. Vaughan, M.A. Purdy, G.L. Xia et al., Genetic history of hepatitis C virus in

N

Pakistan, Infect. Genet. Evol. 27 (2014) 318-24.

A

22. Mathworks (2010). Matlab. Natick, MA.

23. K. Katoh, D.M. Standley, MAFFT multiple sequence alignment software version 7: improvement in

M

performance and usability. Mol. Biol. Evol. 30 (2013) 772-780. 24. D.S. Campo, Z. Dimitrova, L.Yamasaki, P. Skums et al., Next-generation sequencing reveals large

D

connected networks of intra-host HCV variants, BMC Genomics. 15 (2014) S4.

TE

25. V. Batagelji, A. Mrvar, Pajek- Analysis and Visualization of Large Networks. Graph Drawing Software. M. Juenger and P. Mutzel. Berlin, Springer, (2003) 77-103

EP

26. M. Idrees, A. Lal, M. Naseem, M. Khalid, High prevalence of hepatitis C virus infection in the largest province of Pakistan, J. Dig. Dis. 9 (2008) 95-103.

CC

27. D.A. Steinhauer, E. Domingo, J.J. Holland, Lack of evidence for proofreading mechanisms associated with an RNA virus polymerase, Gene. 122 (1992) 281–288. 28. A. Neumann, N. Lam, H. Dahari et al., Hepatitis C viral dynamics in vivo and the antiviral efficacy of

A

interferon-alpha therapy, Science. 282 (1998) 103–107.

29. J. Pawlotsky, Hepatitis C virus population dynamics during infection, Curr. Top Microbiol. Immunol. 299 (2006) 261–284. 30. S. Butt, M. Idrees, H. Akbar, I. Rehman et al., The changing epidemiology pattern and frequency distribution of hepatitis C virus in Pakistan, Infec. Gen. Evol. 10 (2010) 595–600.

13

31. S. Ramachandran, D. Campo, Z. Dimitrova, G. Xia et al., Temporal Variations in the Hepatitis C Virus Intra-Host Population During Chronic Infection, J virol. 85 (2011) 6369-6380. 32. H. Li, M.B. Stoddard, S. Wang, L.M. Blair et al., Elucidation of hepatitis C virus transmission and

A

CC

EP

TE

D

M

A

N

U

SC RI PT

early diversification by single genome sequencing, PLoS Pathog. 8 (2012) 8.

14

Figure captions

C B

A

CC

EP

TE

D

M

A

N

U

SC RI PT

A

15

16

D

TE

EP

CC

A

SC RI PT

U

N

A

M

SC RI PT U N A M D

A

CC

EP

TE

Figure 1: Phylogenetic trees based on analyses of HVR1(A), NS3(B) and NS5B(C) showing different quasispecies of Pak isolates. Only unique sequences included. There is no intermixing of HCV variants among patients

17

SC RI PT U N A M D TE

A

CC

EP

Figure 2. Inter host diversity: Phylogenetic tree based on NS3 sequences of HCV 3a Pakistani isolates. The green color indicates the reference sequences downloaded from Genbank, blue and red represents sequences from published sequences from Pakistan and sequences included in this study respectively.

18

19

D

TE

EP

CC

A

SC RI PT

U

N

A

M

SC RI PT U N

A

CC

EP

TE

D

M

A

Figure 3. One-step components of a single patient. Panel A is a histogram of all p-distances found within the largest one-step network of patient. Panel B is simply a heatmap of the distance matrix among all sequences belonging to that patient. Panel C Largest one-step component of patient Pak 2, where each node is a variant and two nodes are connected by a link if the Hamming distance between them is 1.

20

Table

Table 1: Sequence of primers

TGGCTTGGGATATGATGATGAACT

HVR1-OAS

GCAGTCCTGTTGATGTGCCA

HVR1-IS

GGATATGATGATGAACTGGT

HVR1-IAS

ATGTGCCAGCTGCCGTTGGTG

NS3-IS

N

GGAGGAGTTGAATTGTCAGAGAAAGAT

A

NS3-OAS

GCAAACTAGGGGCCTTCTTGGGAC

GGGGCCTTCTTGGGACTATTGTGAC

M

NS3-OS

U

HVR1-OS

Sequences 5’ ------> 3’

SC RI PT

Primer name

AGTTGAATTGTCAGAGAAAGATGGAGACCT

NS5b-OS

TGAAGATGTGGACCTCAAAGAAAACCC

GGTCATAGCCTCCGTGAAGGCTCTC CAAGAAAACCCCCTTGGGGTTCTC

A

CC

NS5b-IAS

TE

NS5b-IS

AGCATCTCCGGGTGGAGCAGA

EP

NS5b-OAS

D

NS3-IAS

21

Table 2: Nucleotide diversity of genotype 3a with respect to other genotypes

n

Seqnum

Pop_size

Nucleotide_diversity

Nuc_div_Standard_error_ Mean

1a

63

728.8

4370

0.020

0.002

1b

38

853.3

4078

0.021

2a

56

581.9

2126

0.020

3a

35

1906.7

9212

0.019

4a

2

668.5

5042

0.021

SC RI PT

Genotypes

0.003 0.002 0.002

U

0.013

A

CC

EP

TE

D

M

A

N

*N, Number of samples; Seqnum , Average number of unique haplotypes; Popsize, all reads; Nuc div Standard error mean, Nucleotide diversity Standard error mean.

22

Table 3: Intra patient Distance mean and Standard deviation analysis of HVR1, NS3 and NS5B

Patient

Mean intra dist

Std intra dist

PAK1

0.013

0.004

PAK2

0.056

0.028

PAK3

0.011

0.006

PAK4

0.015

0.005

PAK6

0.038

0.017

PAK8

0.031

0.010

PAK9

0.040

0.021

PAK11

0.020

0.006

PAK12

0.036

0.025

PAK13

0.016

0.004

PAK14

0.021

0.015

PAK18

0.034

PAK19

0.050

Mean

0.029

A

0.003 0.007 0.003 0.004 0.005 0.004 0.003 0.002 0.001 0.003 0.005 0.005 0.006

0.007 0.016 0.008 0.009

0.002 0.006 0.003 0.003

NS5B PAK1 PAK2 PAK3 PAK4

U N

A M D

0.034

0.008 0.020 0.009 0.007 0.016 0.013 0.011 0.006 0.005 0.011 0.013 0.016 0.013 0.011

EP

CC

PAK1 PAK2 PAK3 PAK4 PAK6 PAK8 PAK9 PAK11 PAK12 PAK13 PAK14 PAK18 PAK19 Mean

0.017

TE

NS3

SC RI PT

HVR1

23

0.003 0.002 0.007 0.002 0.004 0.002 0.006 0.002 0.004

A

CC

EP

TE

D

M

A

N

U

SC RI PT

0.010 0.007 0.010 0.006 0.012 0.007 0.014 0.005 0.010 0.010

PAK6 PAK8 PAK9 PAK11 PAK12 PAK13 PAK14 PAK18 PAK19 Mean

24

Table 4: One step longest path analysis of big component of HVR1, NS3 and NS5B

Patient

Seqnum

Total_freq

1-step components

Max length path of big component

SC RI PT

HVR1 471

8744

15

11

PAK12

2362

16959

179

17

PAK13

1116

7078

44

15

PAK14

1645

8378

214

16

PAK18

163

1402

41

7

PAK19

47

68

35

3

PAK1

450

8681

29

8

PAK2

5524

18133

1024

30

PAK3

12

43

6

4

PAK4

495

4393

25

9

PAK6

1255

3179

214

14

PAK8

4506

8231

1235

23

PAK9

2014

4552

414

13

358

15

170

14

PAK2

594

1210

N A

M

D

TE

EP

CC

PAK2

U

PAK11

NS3 777 NS5B 4680

A

*Seqnum, Unique haplotypes; total_freq, all reads; Max length path of big component, Maximum length path of big component.

25