Accepted Manuscript Title: Genetic diversity of Hepatitis C Virus in Pakistan using Next Generation Sequencing Authors: Sana Saleem, Amjad Ali, Bushra Khubaib, Madiha Akram, Zareen Fatima, Muhammad Idrees PII: DOI: Reference:
S1386-6532(18)30221-X https://doi.org/10.1016/j.jcv.2018.09.001 JCV 4049
To appear in:
Journal of Clinical Virology
Received date: Revised date: Accepted date:
3-4-2018 14-8-2018 7-9-2018
Please cite this article as: Saleem S, Ali A, Khubaib B, Akram M, Fatima Z, Idrees M, Genetic diversity of Hepatitis C Virus in Pakistan using Next Generation Sequencing, Journal of Clinical Virology (2018), https://doi.org/10.1016/j.jcv.2018.09.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Genetic diversity of Hepatitis C Virus in Pakistan using Next Generation Sequencing Sana Saleem1, Amjad Ali2, Bushra Khubaib1, 3, Madiha Akram1,3, Zareen Fatima1,4, Muhammad Idrees1,5* 1
Division of Molecular Virology and Molecular Centre of Excellence in Molecular Biology, (CEMB), University of the Punjab, Lahore 87-West Canal Bank Road Thokar Niaz Baig, Lahore, Pakistan 2
3
SC RI PT
Molecular Virology laboratory, Centre for Applied Molecular Biology (CAMB) University of the Punjab, Lahore 87-West Canal Bank Road Thokar Niaz Baig, Lahore, Pakistan Department of Biotechnology, Lahore College for Women University, Lahore, Pakistan
4
Bioinformatics & Biotechnology, International Islamic University, Sector H-10, New Campus, Islamabad. Vice Chancellor Hazara University Mansehra, Khyber Pakhtunkhwa, Pakistan
N
U
5
*Corresponding
SS:
[email protected]
AA:
[email protected]
D
BK:
[email protected]
M
A
author: Centre of Excellence in Molecular Biology, University of the Punjab, 87West Canal Bank Road, Thokar Niaz baig, Lahore-53700, Pakistan; Tel: +92-42-5293141; Fax: +9242-5293149; Email: MI:
[email protected]
TE
MA:
[email protected] ZF:
[email protected]
EP
MI:
[email protected]
A
CC
Word count: 2137
1
Highlights Pyrosequencing approach used to analyze complex viral genomes as it can determine minor variants.
It is crucial to understand viral evolution and quasispecies diversity in complex viral strains
NGS was used to determine intra-host viral diversity of HCV from 13 chronically infected patients
NGS of E2 (HVR1), NS3 and NS5B of HCV-3a was performed for a comprehensive analysis of the viral population
Phylogenetic analysis of studied genes revealed great variability within the Pakistani population
The average nucleotide diversity for studied genes was 0.029, 0.011 and 0.010 respectively
Results indicate that patient-2 had more heterogeneity than other patients of same genotype-3a
No significant difference was seen when nucleotide variability of genotype 3a compared with other genotypes
M
A
N
U
SC RI PT
D
Abstract
TE
Background: In Pakistan, HCV disease is considered a major public health issue with about 10-17 million people suffering with this infection and rate is increasing every day without any hindrance. The
EP
currently available Pyrosequencing approach used to analyze complex viral genomes as it can
CC
determine minor variants. It is crucial to understand viral evolution and quasispecies diversity in complex viral strains.
A
Objectives: To assess genetic diversity in patients with HCV using Next Generation Sequencing (NGS) and compare nucleotide diversity of genotype 3a with respect to other genotypes. Study design: Intra-host viral diversity of HCV was determined using NGS from 13 chronically HCV infected individuals. NGS of three different regions (E2 (HVR1), NS3 and NS5B) of HCV-3a allowed for a comprehensive analysis of the viral population. 2
Result: Phylogenetic analysis of different HCV genes revealed great variability within the Pakistani population. The average nucleotide diversity for HVR1, NS3 and NS5B was 0.029, 0.011 and 0.010 respectively. Conclusion: Our findings clearly indicate that patient-2 greater quasispecies heterogeneity than other
SC RI PT
patients of same genotype-3a using phylogenetic and one step network analyses. Initially phylogenetic analysis of these three genes showed that genotype 3a samples have greater genetic diversity. However, no significant difference was determined when nucleotide variability of genotype 3a compared with other genotypes (1a, 1b, 2a & 4a).
M
A
N
U
Keywords: HCV, 3a, HVR1, NS3, NS5B, Phylogentic, Analysis, quasispecies, NGS
1. Background
D
Nearly 3% of the World’s population is infected with Hepatitis C Virus (HCV) that is a leading cause of
TE
liver diseases [1]. HCV belongs to family Flaviviridae and classified into seven major genotypes and
EP
many subtypes based on sequence variability [2, 3]. The most significant feature of HCV is that it becomes chronic in nearly 50-80% of individuals. [4]. There is no vaccine against HCV. The current
CC
standard of care is the use of direct acting antiviral agents (DAAs) and has shown high SVR (Sustained Virological Response) rate [5].
A
Genetic heterogeneity is a characteristic of HCV. In each patient, HCV occurs as multiple variants or quasispecies. The genetic variability is not uniformly dispersed over the whole genome; the most variable region is HVR1 of the envelop E2 protein [6-9]. It is supposed that HCV diversity has major clinical consequences, as it may effect in the production of immune escape mutants, which might affect
3
disease severity and treatment response [10]. High sequence variation in HVR1 makes it a perfect model for quasispecies analysis. [11] The comprehensive analysis of minor HCV quasispecies variants is hindered by the lack of suitable approaches which would facilitate the finding of low-frequency genomes. Previously used methods
SC RI PT
were costly and time consuming [12]. By using Next Generation Sequencing (NGS), it is currently feasible to explore viral quasispecies in a better way. The extraordinary output of NGS permits production of thousands of copies in each sequencing run, assisting in detailed sequence analysis. This technology can identify variants at low frequencies, which might go unnoticed by regular sequencing procedures [13]. Though, in order to create consistent viral quasispecies from the large amounts of data
U
generated by NGS, a proper data analysis is prerequisite [14, 15].
N
In this study we performed ultradeep pyrosequencing to illustrate the complexity and heterogeneity of
A
hypervariable region 1 (HVR1), and nonstructural genes 3 (NS3) and 5b (NS5b) in individuals infected
M
with HCV genotype 3a using the p distances. The distance among each pair of variants was calculated
D
and networks were generated for all variants such that every node represents a single variant and each
TE
link in the network represents a single base change. To the best of our knowledge this is the first study of HCV 3a quasispecies from Pakistan which shows sequencing in depth of HVR1, NS3 and NS5B
EP
using NGS to evaluate diversity of HCV quasispecies within genotype 3a of Pakistani population
CC
2. Objectives
To determine genetic diversity in patients with HCV using Next Generation Sequencing (NGS) and
A
compare the nucleotide diversity of genotype 3a with respect to other genotypes.
3. Study design: We performed ultra deep pyrosequencing to illustrate the heterogeneity of hypervariable region 1 (HVR1), nonstructural genes 3 (NS3) and 5b (NS5b) in individuals infected with HCV genotype 3a
4
using the P distances. The distance among each pair of variants was calculated and networks were generated for all variants such that every node represents a single variant.
4. Materials and methods
SC RI PT
4.1 Patient Samples Sera were collected from 40 different HCV infected individuals in 2013 from Punjab and Khyber Pakhtunkhwa and then analyzed. Protocol approval was given by Ethics Review Board of the National Centre of Excellence in Molecular Biology (CEMB), University of the Punjab, Lahore, Pakistan. Enzyme immunoassay protocol (anti-HCV positive ELISA kit Abbot, Germany) was used for
N
U
serological testing. Genotyping of HCV samples was done at CEMB [16].
A
4.2 Isolation of RNA and PCR amplification
M
Extraction of RNA was done by using the total Nucleic Acid extraction kit (Roche Applied Science, USA) and the MagNAPure LC system (Roche Applied Science). cDNA was synthesized using the
D
SuperScriptVilo cDNA synthesis kit (Invitrogen, Carlsbad, CA). The reverse transcription parameters
TE
were as: 25°C for 10 min, 42°C for 90 min, and 85°C at 6 min. Amplification of the HCV HVR1, NS3
EP
and NS5b was performed using nested PCR protocol. The HCV primers were designed with primer 3 online tool for primer designing (http://bioinfo.ut.ee/primer3-0.4.0/).
CC
Primer sequences are shown in Table 1. In first round of amplification cDNA served as template using PerfeCTa SYBR FastMix (Quanta BioSciences, Gaithersburg, MD) with gene specific outer primers
A
for three regions. Amplification was done on a LightCycler (Roche, Applied Sciences, Indianapolis) under the following conditions: 95oC for 5 min, followed by 40 cycles at 95oC for 30s, 50oC for 30 s, 72oC for 50s. 2 µl of the first round PCR product was used as template in nested PCR reaction.
4.3 Amplicon sequencing
5
Amplicons obtained from the PCR product were purified using E-Gel SizeSelect Agarose 2% gel on the E-Gel Power System (Life technologies). Direct amplicon Sanger sequencing was done from all PCR products of three genes using BigDye v3.1 chemistry (Applied Biosystems) by an automated analyzer (3130xl, Applied Biosystems,Foster City, CA). Sequences were cleaned and analyzed using
SC RI PT
SeqMan and MegAlign (DNASTAR).
4.4 454 Pyrosequencing of the HVR1, NS3 and NS5B Regions
To analyze quasispecies of these three regions (HVR1, NS3 and NS5B) of HCV each sample were
U
amplified using fusion primers including the 454-primers key with a different multiple identifier (MID)
N
and HCV specific primers using Roche/454 pyrosequencing technology. After purification PCR
A
products were quantified by using the Agilent 2100 bioanalyzer platform (Agilent Technologies, Inc.,
M
Waldbronn, BW Germany). For analyzing pyrosequencing data, the SFFFILE tool (version 1.5.1) was used to process actual sequence reads (raw data) obtained from 454 sequencing [17]. Sequence reads of
D
every sample were recognized and categorized with help of MIDs. Poor quality and short sequences
TE
were removed from analysis. Data obtained from pyrosequencing were processed with the KEC error correction algorithm to recovery good quality HCV haplotypes from reads data using KEC software
CC
4.5 Analysis
EP
(http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm) [18].
A
Unbiased estimates of nucleotide diversity were calculated according to Nei (1987) using the program ARLEQUIN (version 3.5) [19, 20]. Nucleotide frequency diversity and normalized frequency diversity were also calculated for these three genes. Neighbor-joining trees were generated based on p distances. Phylogenetic analysis of NS3 was also done on all patients with known sequences of genotype 3a from Pakistan [21]. The reference sequences taken from GenBank are indicated with green color, blue and 6
red color representing the sequences from previously published sequences from Pakistan [21] and sequences involved in this study respectively. All analysis was attained using statistical and bioinformatics approaches applied in MATLAB (version 2010) [22] as previously described [17]. Nucleotide diversity of different genotypes was also determined according to Nei (1987) by using the
SC RI PT
ARLEQUIN program [19]. We tested the null hypothesis of no difference in genotypes diversity by using Analysis of Variance (ANOVA). Data for other genotypes were taken from Genbank.
4.6 Distance matrices
For each patient, a multiple sequence alignment was performed with MAFFT (version 7) [23], and then
U
the Hamming distance between each pair of variants was calculated to form a distance matrix. The
N
histogram of distances and a heatmap were built using MATLAB [22].
M
A
4.7 One Step Network
For the set of all HCV distinct variants found in each sample, we built one-step network as previously
D
described [24]. Using the distance matrix of each patient, a network was created for each patient, where
TE
each node is a quasispecies variant and two nodes are connected by a link if the Hamming distance between them is 1. As a single patient may have several disconnected sets (components), this work
CC
1.26) [25].
EP
focused on large components having > 5% of all reads. The networks were drawn with PAJEK (version
5. Results
A
5.1 HCV Genetic Heterogeneity Out of the 40 samples, 13 were successfully sequenced for three HCV genes. They all belonging to HCV genotype 3a. Phylogenetic analysis of the E2 (HVR1), NS3 and NS5B gene showed that sequences were genetically distinct among the 13 patients with genotype-3a (Fig 1). Among the 13 samples the Pak2 sample has highest genetic diversity for all genes. We found that HVR1 is more 7
variable than NS3 and NS5B. The average nucleotide diversity for HVR1, NS3 and NS5b is 0.029, 0.011 and 0.010 respectively (Table 3). HCV isolates for NS3 were intermixed with the reference sequences obtained from two different sources (Fig 2).
SC RI PT
5.2 Intra and inter-Host HCV Diversity Quasispecies analysis of HCV isolates was performed using deep pyrosequencing of three regions. The average number of reads for HVR1, NS3 and NS5b was 6910, 1172, 1526 respectively. Phylogenetic trees showed that viral isolates linked to same patient clustered such that they were more closely connected with that patient than with any other patient. There is no intermixing of HCV variants among
U
patients which shows that they were not related through transmission. The level of intra host diversity
N
varied frequently (Table 3).
A
5.3 One step Network
M
The Hamming distance between each pair of variants was calculated and networks were generated for all variants such that every node represents a single variant and two nodes are associated by a
D
connection (link) if the calculated Hamming distance between them is 1(Fig 3). The length of the
TE
longest path for one step network is calculated (Table 4).
EP
5.4 Comparison of genotype 3a with other genotypes: Early assessment of phylogenetic trees of genotype 3a (Fig 1) showed that samples of this genotype
CC
seem more genetically diverse than other HCV subtypes. We analyzed and compared the nucleotide diversity of other genotypes (1a, 1b, 2a & 4a) with genotype 3a (Table 2). No significant difference was
A
determined.
6. Discussion
8
In Pakistan, HCV infection is a major health concern as more than 17 million individuals are suffering with this disease [26].An important feature of HCV is its genetic variability [27] and rapid rate of replication during its lifespan [28]. As a result, in each individual HCV clones show populations of heterogeneous quasispecies with a high degree of variability. The group of viruses inside a population
SC RI PT
linked to each other through similar mutations is described as a quasispecies [29]. NGS is an advanced technique for analyzing intra-host quasispecies and drug resistance [24].
To the best of our knowledge, this is the first study to use NGS to sequence and analyze Pakistani HCV 3a quasispecies sequences from HVR1, NS3 and NS5b using ultra-deep sequencing (454/Roche). In this study we determined the complexity and diversity of HCV 3a within the Pakistani population by
U
using ultra-deep sequencing (454/Roche). All individuals in this study were infected with genotype 3a
A
3a is the major circulating genotype in Pakistan [30].
N
and had no history of drinking alcohol. Our study is consistent with the past result that HCV genotype
M
We performed deep sequencing of different regions at high coverage. We investigated the level of
D
genetic heterogeneity within Pakistani isolates using different statistical analyses. For HVR1 a
TE
quasispecies sample contains approximately 6910 reads with 1508 unique sequences, the population frequency of the major variant was 22%, a nucleotide diversity of 0.0290 and 96 out of 303 nucleotide
EP
positions were polymorphic. Phylogenetic analysis showed that viral isolates of each patient’s sequences clustered together, indicating that a patient’s variants were more closely linked to each other
CC
than with other patient’s variants (Fig 1). This indicates they were not related through transmission.
A
Phylogenetic trees of these three genes showed that samples of this genotype seem genetically more diverse than in other genotypes (Fig 1). To determine whether this was true sequence data we continued to study if other genotypes of HCV had differing nucleotide variability. Sequencing data for other genotypes (1a, 1b, 2a, and 4a) was taken from online sequences. It must be taken into consideration that our study was not an exhaustive analysis of all genotypes or sub genotypes, and thus 9
we did not compare the diversity of genotype 3a with other less common subtypes due to their NGS data scarcity. We had greater numbers of genotype 3a sequences than other genotypes (ANOVA, p< 0.05). Samples with a higher total number of reads also show a higher number of unique sequences, with the
SC RI PT
correlation between the two being 0.763. For this reason, there are a lot of unique genotype 3a sequences in the phylogenetic trees creating the illusion of greater diversity. Though, the number of reads obtained from NGS mainly rely on complementarity level among primers and template, which can be different according to genotypes. Nucleotide variability is not associated with sample size (correlation with number of reads is 0.176). There was no significantly difference found between
U
nucleotide diversity of different genotypes when compared with genotype 3a (Table 2).
N
The NS3 and NS5B samples had 1172 and 1526 total reads respectively, with 416 and 276 unique
A
sequences. The population frequency of each major variant was 19% and 26% with nucleotide diversity
M
0.011 and 0.010 respectively. The nucleotide diversity of NS3 and NS5b was less than for HVR1. The
D
phylogenetic tree for NS3 showed that NS3 samples were evenly distributed among each other and the
TE
reference sequences (Fig 2). No specific cluster observed in phylogenetic tree that related to specific region.
EP
HCV variability constantly changes throughout chronicity, resulting in a complex network of intra-host
CC
viral subpopulations [31]. Quasispecies distance linkage that varies by one mutation gives an expected framework to sort out the diverse population of variants [24]. Even though it was attained through
A
theoretical analysis, we constructed a one-step network by using our viral sequence data. It offers an easy biological method for finding the genetic relatedness of viral populations [32]. The one-step networks for each patient showed that viral population circulating in Pakistan was complex and heterogenic.
10
Authors' contributions
SS, AA & BK performed bench work and drafted the manuscript. MA &ZF analyzed the data. AA & MI critically reviewed the manuscript. All the authors read and approved the final manuscript
SC RI PT
Acknowledgments
Thanks to Dr Yury Khudyakov for excellent facilities at Centers for Disease Control and Prevention, USA . We are also grateful to Dr Gilberto Vaughan, Zoya Dimitrova, Mike A Purdy and David S Campo from Centers for
A
CC
EP
TE
D
M
A
N
U
Disease Control and Prevention, USA for their help in this study.
11
References 1. B.D. Lindenbach, C.M. Rice, Molecular biology of flaviviruses, Adv. Virus Res. 59 (2003) 23–61. 2. K.M. Hanafiah, J. Groeger, A.D. Flaxman, S.T. Wiersma, Global epidemiology of hepatitis C virus infection, new estimates of age specific antibody to HCV seroprevalence, Hepatology. 57 (2012) 133342.
SC RI PT
3. S.M. Lemon, C.M. Walker, M.J. Alter, M. Yi, Hepatitis C Virus, In: Knipe, D.M. Howley, P.M. Eds., Fields Virology, Lippincott Williams & Wilkings, Philadelphia, PA, (2007) pp. 1253-1304.
4. P. Simmonds, Genetic diversity and evolution of hepatitis C virus – 15 years on, J. Gen. Virol. 85 (2004) 3173-3188.
5. D. Schuppan, A. Krebs, M. Bauer, E.G. Hahn, Hepatitis C and liver fibrosis, Cell Death Differ. 10 (2003) 59–67.
U
6. D. Hunt, P. Pockros, What are the promising new therapies in the field of chronic hepatitis C after
N
the first-generation direct-acting antivirals? Current gastroenterology reports, 15 (2013) 303. 7. M. Martell, J.I. Esteban, J. Quer, J. Genesca, Hepatitis C virus (HCV) circulates as a population of
A
different but closely related genomes: quasispecies nature of HCV genome distribution, Journal of
M
Virology. 66 (1992) 3225–3229.
8. P. Moreno, M. Alvarez, L. L´apez, G. Moratorio, Evidence of recombination in Hepatitis C Virus
D
populations infecting a hemophiliac patient, Virology Journal. 6 (2009) 203. 9. J.M. Cuevas, M. Torres-Puente, N. Jim´enez-Hern´andez, M.A. Bracho, Refined analysis of genetic
TE
variability parameters in hepatitis C virus and the ability to predict antiviral treatment response, Journal of Viral Hepatitis. 15 (2008) 578–590.
EP
10. E.A. Duarte, I.S. Novella, S.C. Weaver, E. Domingo et al., RNA virus quasispecies: significance for viral disease and epidemiology, Infectious Agents and Disease. 3 (1994) 201–214.
20.
CC
11. M. Sala, S. Wain-Hobson, Are RNA viruses adapting or merelychanging? J. Mol. Evol. 51 (2000) 12-
A
12. T. Laskus, J. Wilkinson, J.F. Gallegos-Orozco, M. Radkowski et al., Analysis of hepatitis C virus quasispecies
transmission
and
evolution
in
patients
infected
through
blood
transfusion,
Gastroenterology. 127 (2004) 764–776.
13. L. Barzon, E. Lavezzo, V. Militello, S. Toppo et al., Applications of next-generation sequencing technologies to diagnostic virology, Int. J. of Mol. Sciences. 12 (2011) 7861–7884. 14. N. Beerenwinkel, H.F. Gunthard, V. Roth, K.J. Metzner, Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data, Frontiers in Microbiology. 3 (2012) 329. 12
15. N. Beerenwinkel, Ultra-deep sequencing for the analysis of viral populations, Current Opinion in Virology. 1 (2011) 413–418. 16. M. Idrees, S. Riazuddin, Frequency distribution of hepatitis C virus genotypes in different geographical regions of Pakistan and their possible routes of transmission. BMC. Infect. Dis. 8 (2008) 69. 17. J.C. Forbi, J.E. Layden, R.O. Phillips, N. Mora et al., Next-generation sequencing reveals frequent
SC RI PT
opportunities for exposure to hepatitis C virus in Ghana, PLoS one. 18 (2015) 12. 18. P. Skums, Z. Dimitrova, D.S. Campo, G. Vaughan et al., Efficient error correction for next-generation sequencing of viral amplicons, BMC Bioinformatics. 13 (2012) S6.
19. M. Nei, (1987). Molecular Evolutionary genetics. New York, Columbia University Press.
20. S. Schneider, D. Roessli, L. Excoffier, Arlequin: software for population genetics data analysis. User manual ver 2.000. Genetics and Biometry Lab, Dept. of Anthropology, University of Geneva, Geneva. (2000).
U
21. I.ur. Rehman, G. Vaughan, M.A. Purdy, G.L. Xia et al., Genetic history of hepatitis C virus in
N
Pakistan, Infect. Genet. Evol. 27 (2014) 318-24.
A
22. Mathworks (2010). Matlab. Natick, MA.
23. K. Katoh, D.M. Standley, MAFFT multiple sequence alignment software version 7: improvement in
M
performance and usability. Mol. Biol. Evol. 30 (2013) 772-780. 24. D.S. Campo, Z. Dimitrova, L.Yamasaki, P. Skums et al., Next-generation sequencing reveals large
D
connected networks of intra-host HCV variants, BMC Genomics. 15 (2014) S4.
TE
25. V. Batagelji, A. Mrvar, Pajek- Analysis and Visualization of Large Networks. Graph Drawing Software. M. Juenger and P. Mutzel. Berlin, Springer, (2003) 77-103
EP
26. M. Idrees, A. Lal, M. Naseem, M. Khalid, High prevalence of hepatitis C virus infection in the largest province of Pakistan, J. Dig. Dis. 9 (2008) 95-103.
CC
27. D.A. Steinhauer, E. Domingo, J.J. Holland, Lack of evidence for proofreading mechanisms associated with an RNA virus polymerase, Gene. 122 (1992) 281–288. 28. A. Neumann, N. Lam, H. Dahari et al., Hepatitis C viral dynamics in vivo and the antiviral efficacy of
A
interferon-alpha therapy, Science. 282 (1998) 103–107.
29. J. Pawlotsky, Hepatitis C virus population dynamics during infection, Curr. Top Microbiol. Immunol. 299 (2006) 261–284. 30. S. Butt, M. Idrees, H. Akbar, I. Rehman et al., The changing epidemiology pattern and frequency distribution of hepatitis C virus in Pakistan, Infec. Gen. Evol. 10 (2010) 595–600.
13
31. S. Ramachandran, D. Campo, Z. Dimitrova, G. Xia et al., Temporal Variations in the Hepatitis C Virus Intra-Host Population During Chronic Infection, J virol. 85 (2011) 6369-6380. 32. H. Li, M.B. Stoddard, S. Wang, L.M. Blair et al., Elucidation of hepatitis C virus transmission and
A
CC
EP
TE
D
M
A
N
U
SC RI PT
early diversification by single genome sequencing, PLoS Pathog. 8 (2012) 8.
14
Figure captions
C B
A
CC
EP
TE
D
M
A
N
U
SC RI PT
A
15
16
D
TE
EP
CC
A
SC RI PT
U
N
A
M
SC RI PT U N A M D
A
CC
EP
TE
Figure 1: Phylogenetic trees based on analyses of HVR1(A), NS3(B) and NS5B(C) showing different quasispecies of Pak isolates. Only unique sequences included. There is no intermixing of HCV variants among patients
17
SC RI PT U N A M D TE
A
CC
EP
Figure 2. Inter host diversity: Phylogenetic tree based on NS3 sequences of HCV 3a Pakistani isolates. The green color indicates the reference sequences downloaded from Genbank, blue and red represents sequences from published sequences from Pakistan and sequences included in this study respectively.
18
19
D
TE
EP
CC
A
SC RI PT
U
N
A
M
SC RI PT U N
A
CC
EP
TE
D
M
A
Figure 3. One-step components of a single patient. Panel A is a histogram of all p-distances found within the largest one-step network of patient. Panel B is simply a heatmap of the distance matrix among all sequences belonging to that patient. Panel C Largest one-step component of patient Pak 2, where each node is a variant and two nodes are connected by a link if the Hamming distance between them is 1.
20
Table
Table 1: Sequence of primers
TGGCTTGGGATATGATGATGAACT
HVR1-OAS
GCAGTCCTGTTGATGTGCCA
HVR1-IS
GGATATGATGATGAACTGGT
HVR1-IAS
ATGTGCCAGCTGCCGTTGGTG
NS3-IS
N
GGAGGAGTTGAATTGTCAGAGAAAGAT
A
NS3-OAS
GCAAACTAGGGGCCTTCTTGGGAC
GGGGCCTTCTTGGGACTATTGTGAC
M
NS3-OS
U
HVR1-OS
Sequences 5’ ------> 3’
SC RI PT
Primer name
AGTTGAATTGTCAGAGAAAGATGGAGACCT
NS5b-OS
TGAAGATGTGGACCTCAAAGAAAACCC
GGTCATAGCCTCCGTGAAGGCTCTC CAAGAAAACCCCCTTGGGGTTCTC
A
CC
NS5b-IAS
TE
NS5b-IS
AGCATCTCCGGGTGGAGCAGA
EP
NS5b-OAS
D
NS3-IAS
21
Table 2: Nucleotide diversity of genotype 3a with respect to other genotypes
n
Seqnum
Pop_size
Nucleotide_diversity
Nuc_div_Standard_error_ Mean
1a
63
728.8
4370
0.020
0.002
1b
38
853.3
4078
0.021
2a
56
581.9
2126
0.020
3a
35
1906.7
9212
0.019
4a
2
668.5
5042
0.021
SC RI PT
Genotypes
0.003 0.002 0.002
U
0.013
A
CC
EP
TE
D
M
A
N
*N, Number of samples; Seqnum , Average number of unique haplotypes; Popsize, all reads; Nuc div Standard error mean, Nucleotide diversity Standard error mean.
22
Table 3: Intra patient Distance mean and Standard deviation analysis of HVR1, NS3 and NS5B
Patient
Mean intra dist
Std intra dist
PAK1
0.013
0.004
PAK2
0.056
0.028
PAK3
0.011
0.006
PAK4
0.015
0.005
PAK6
0.038
0.017
PAK8
0.031
0.010
PAK9
0.040
0.021
PAK11
0.020
0.006
PAK12
0.036
0.025
PAK13
0.016
0.004
PAK14
0.021
0.015
PAK18
0.034
PAK19
0.050
Mean
0.029
A
0.003 0.007 0.003 0.004 0.005 0.004 0.003 0.002 0.001 0.003 0.005 0.005 0.006
0.007 0.016 0.008 0.009
0.002 0.006 0.003 0.003
NS5B PAK1 PAK2 PAK3 PAK4
U N
A M D
0.034
0.008 0.020 0.009 0.007 0.016 0.013 0.011 0.006 0.005 0.011 0.013 0.016 0.013 0.011
EP
CC
PAK1 PAK2 PAK3 PAK4 PAK6 PAK8 PAK9 PAK11 PAK12 PAK13 PAK14 PAK18 PAK19 Mean
0.017
TE
NS3
SC RI PT
HVR1
23
0.003 0.002 0.007 0.002 0.004 0.002 0.006 0.002 0.004
A
CC
EP
TE
D
M
A
N
U
SC RI PT
0.010 0.007 0.010 0.006 0.012 0.007 0.014 0.005 0.010 0.010
PAK6 PAK8 PAK9 PAK11 PAK12 PAK13 PAK14 PAK18 PAK19 Mean
24
Table 4: One step longest path analysis of big component of HVR1, NS3 and NS5B
Patient
Seqnum
Total_freq
1-step components
Max length path of big component
SC RI PT
HVR1 471
8744
15
11
PAK12
2362
16959
179
17
PAK13
1116
7078
44
15
PAK14
1645
8378
214
16
PAK18
163
1402
41
7
PAK19
47
68
35
3
PAK1
450
8681
29
8
PAK2
5524
18133
1024
30
PAK3
12
43
6
4
PAK4
495
4393
25
9
PAK6
1255
3179
214
14
PAK8
4506
8231
1235
23
PAK9
2014
4552
414
13
358
15
170
14
PAK2
594
1210
N A
M
D
TE
EP
CC
PAK2
U
PAK11
NS3 777 NS5B 4680
A
*Seqnum, Unique haplotypes; total_freq, all reads; Max length path of big component, Maximum length path of big component.
25