Comprehensive analysis of the codon usage patterns of polyprotein of Zika virus

Comprehensive analysis of the codon usage patterns of polyprotein of Zika virus

Progress in Biophysics and Molecular Biology xxx (xxxx) xxx Contents lists available at ScienceDirect Progress in Biophysics and Molecular Biology j...

613KB Sizes 0 Downloads 31 Views

Progress in Biophysics and Molecular Biology xxx (xxxx) xxx

Contents lists available at ScienceDirect

Progress in Biophysics and Molecular Biology journal homepage: www.elsevier.com/locate/pbiomolbio

Comprehensive analysis of the codon usage patterns of polyprotein of Zika virus Jun Tao, Huipeng Yao* College of Life Science, Sichuan Agriculture University, Ya'an, 625014, Sichuan, PR China

a r t i c l e i n f o

a b s t r a c t

Article history: Received 29 March 2018 Received in revised form 25 February 2019 Accepted 1 May 2019 Available online xxx

Zika virus (ZIKV) is a mosquito-borne virus in the family Flaviviridae, and the massive outbreak of ZIKV has endangered public health. Codon usage patterns of viruses reflect a series of evolutionary changes that enable viruses to shape their survival rates and fitness toward the external environment and, most importantly, their hosts. In this study, 90 ZIKV isolates were used for a comprehensive analysis on the codon usage patterns. The overall codon usage among ZIKV strains is similar and slightly biased. The value of effective number of codons (ENC) showed that the overall extent of codon usage bias in ZIKV is relatively low. Nucleotide analysis showed that the overall codon usage is biased toward A- and G-ending codons. The phylogenetic analysis indicated that their independent evolutionary origins from a common ancestor. The RSCU analysis showed that the codon usage pattern of ZIKV is more similar to that of Homo sapiens. Correlation analysis, Correspondence analysis, ENC-GC3S plot, and PR2 plot indicated that the codon usage patterns of the viruses are not only influenced by mutational pressure but also by natural selection, but neutrality plot analysis showed that the latter plays a major role. These results built the base for further research on the molecular evolution of ZIKV. © 2019 Elsevier Ltd. All rights reserved.

Keywords: Zika virus Codon usage Evolutionary Mutation pressure Natural selection

1. Introduction Zika virus (ZIKV), the causative agent of zika fever, is a daunting pathogen in the genus Flavivirus, family Flaviviridae (McLean et al., 2017; van Hemert and Berkhout, 2016), which is transmitted to humans mainly through bites of Aedes aegypti and Aedes albopictus (Wang et al., 2016). ZIKV is an enveloped, single-stranded and positive-sense RNA virus with the genome of approximately 11 kb (Faye et al., 2013; Butt et al., 2016). The genome of ZIKV contains two flanking untranslated region (a longer 3’ UTR and a shorter 5’ UTR) and a large open-reading frame (ORF) encoding a polyprotein which can be cleaved into the capsid (C), precursor membrane (prM), envelope (E) and seven nonstructural (NS) proteins (50 CeprMeEeNS1eNS2Ae NS2BeNS3eNS4AeNS4BeNS5-30 ) (Cristina et al., 2016; Zhang et al., 2011). Zika virus (ZIKV) was first isolated in 1947 from a monkey blood sample in Uganda (Wang et al., 2016; van Hemert and Berkhout, 2016), and before its outbreak in Oceania in 2007, it existed only in Africa and Southeast Asia (Wang et al., 2016). Between 2007 and

* Corresponding author. Tel.: þ86 835 2886126. E-mail address: [email protected] (H. Yao).

2015, ZIKV spread further to other continents, but It did not attract international attention until the outbreak of the epidemic on Brazil in 2015 (Maharajan et al., 2016). As of February 26, 2016, ZIKV has spread to more than 30 countries and territories, which has attracted the attention of the World Health Organization (WHO) (Samarasekera and Triunfol, 2016; Reefhuis et al., 2016). The symptom of ZIKV infection ranges from asymptomatic to mild symptoms which include fever, exanthematous rash, conjunctivitis, or arthralgia (Wang et al., 2016; Reefhuis et al., 2016). Importantly, ZIKV infecting pregnant women can lead to increase the cases of newborn with microcephaly, and it can also cause neurological disorders (Musso and Gubler, 2016). As we all know, the codons encoding the same amino acid are called the synonymous codons, and all amino acids are coded by more than one codon except for methionine and tryptophan (Butt et al., 2014a). The synonymous codons don't occur equally both within and between genomes (Kumar et al., 2016a; Gu et al., 2004), and this phenomenon appearing in a wide range of organisms, from viruses, prokaryotes to eukaryotes, is termed as codon usage bias (Nasrullah et al., 2015). The sequences data were incomplete for ZIKV codon analysis in previous studies (Wang et al., 2016; van Hemert and Berkhout, 2016; Butt et al., 2016; Cristina et al., 2016; Singh and Tyagi, 2017), and there are some contradictions of the

https://doi.org/10.1016/j.pbiomolbio.2019.05.001 0079-6107/© 2019 Elsevier Ltd. All rights reserved.

Please cite this article as: Tao, J., Yao, H., Comprehensive analysis of the codon usage patterns of polyprotein of Zika virus, Progress in Biophysics and Molecular Biology, https://doi.org/10.1016/j.pbiomolbio.2019.05.001

2

J. Tao, H. Yao / Progress in Biophysics and Molecular Biology xxx (xxxx) xxx

influencing factor on its codon usage with each other, such as mutation pressure (van Hemert and Berkhout, 2016; Cristina et al., 2016; Singh and Tyagi, 2017) or natural selection (Butt et al., 2016). Basing on the reason, we have downloaded all the complete coding sequences until April the 24th, 2017 and have analyzed their codon usage patterns.

2. Materials and methods 2.1. Sequence data The complete RNA sequences of all ZIKV isolates were obtained from the NCBI GenBank database (http://www.ncbi.nlm.nih.gov) of 24 April 2017. Then we deleted some isolates without its collection date, country and host. Finally, we got 90 ZIKV isolates and the complete coding region was extracted by using the DNAstar software. The information of the selected ZIKV strains is provided in Supplementary Table S1.

2.2. Nucleotide composition analysis The following nucleotide contents of the complete coding region of ZIKV genomes were calculated by DNAstar software, the Cusp (http://www.bioinformati cs. nl/emboss-explorer) and the CodonW software (http://codonw.sourceforge.net): (i) frequency of occurrence of the nucleotides (A%, C%, U%, G%, GC%); (ii)frequency of each nucleotide at the third position of the synonymous codons (A3s%, C3s%, U3s%, and G3s%); (iii) frequencies of occurrence of nucleotides G þ C at the first (GC1), second (GC2), and third base of codon (GC3); (iv) mean frequencies of nucleotides G þ C at the first and second position (GC1,2). 2.3. Parameters of codon usage Several parameters were calculated to analyze the codon usage of the ZIKV polyprotein-coding region. The relative synonymous codon usage (RSCU), effective number of codons (ENC), hydrophobicity (GRAVY), and aromaticity (AROMO) were calculated by CodonW 1.4.4 program.

2.4. Relative synonymous codon usage (RSCU) analysis Relative synonymous codon usage (RSCU) was defined as the ratio of observed frequency of a specific codon to the expected value, if the each codon of synonymous codons group was used equally (Yang et al., 2014). The RSCU value of the ith codon for the jth amino acid was calculated as:

gij ,ni RSCU ¼ Pni g j ij Where gij is the observed number of the ith codon for jth amino acid which has ni type of synonymous codons. The codon with RSCU value more than 1.0 has positive codon usage bias, while the value less than 1.0 has relative negative codon usage bias. When the RSCU value of codon is close to 1.0, it means that this codon is chosen equally and randomly (Sharp and Li, 1986). In this study, the RSCU values of ZIKV were calculated by CodonW software, the RSCU values for Homo sapiens was retrieved from Niraj K (Singh and Tyagi, 2017), and the RSCU values of mosquito (Aedes aegypti and Aedes albopictus) were from the codon usage database (http:// www.kazusa.or.jp/codon).

2.5. Effective number of codons analysis The ENC was calculated to quantify the codon usage bias of gene and genome level, which is the best estimator of absolute codon usage bias (Nasrullah et al., 2015; Wright, 1990). ENC values ranging from 20 to 61 don't require any prior knowledge or a reference set. The value of 20 indicates extreme codon usage bias and the value to 61 indicates no bias. When the ENC value is less than or equal to 35, it is generally believed that the gene has an obvious codon bias. ENC-GC3S plot. ENC-GC3S plot was used to investigate the influence of the GC3S content on codon usage. The expected ENC value for each GC3S was calculated using the following formula:

ENCexpected ¼ 2 þ s þ

29 s2 þ ð1  sÞ2

Where, s represents the GC3S value. 2.6. Correspondence analysis (COA) Correspondence analysis (COA) is a useful multivariate statistical method which was widely used to explore the major trends in Relative Synonymous Codon Usage (RSCU). In correspondence analysis, each gene is represented with 59 dimensional variables, and each dimension matches the RSCU value of one codon, except for the stop codon, methionine and tryptophan (Liu et al., 2012). The COA on RSCU was performed by CodonW software. 2.7. Parity rule 2 analysis The Parity rule 2 (PR2) plot analysis was used to explore the effects of mutation and natural selection on the codon usage of genes. In this PR2 plot, take the value of AU-bias [A3s/(A3s þ U3s)] at the third base of codon as the ordinate and the GC-bias [G3s/ (G3s þ C3s)] as the abscissa (Sueoka, 1999). 2.8. Neutrality plot (GC12 VS GC3) analysis The neutrality plot was used to examine the extent of the effect of mutation pressure and natural selection on the codon usage patterns by plotting the GC12 values against the GC3 values. In this plot, mutation pressure is assumed to be the main force shaping codon usage when the regression line falls near the diagonal. Alternatively, the regression curve tends to tilt or parallel to the horizontal axis which indicates the dominant role of natural selection on the codon usage bias (Sueoka, 1988). 2.9. Correlation analysis A correlation analysis was performed to identify the relationships between the codon usage patterns and nucleotide composition or the characters of gene product in Zika virus (Lobry and Gautier, 1994; Rao et al., 2014) to measure the influencing extent of mutation or selection with statistical software SPSSS 21 (Liu et al., 2012). 2.10. Phylogenetic analysis For many viruses, codon usage pattern is thought to be related to their evolutionary processes (Singh and Pandey, 2017). A phylogenetic tree was constructed based on the nucleotide sequences of the coding regions of ZIKV isolates, using the neighbor-joining (NJ) method with a bootstrap value of 1000 replicates on MEGA7 software (Kumar et al., 2016b).

Please cite this article as: Tao, J., Yao, H., Comprehensive analysis of the codon usage patterns of polyprotein of Zika virus, Progress in Biophysics and Molecular Biology, https://doi.org/10.1016/j.pbiomolbio.2019.05.001

J. Tao, H. Yao / Progress in Biophysics and Molecular Biology xxx (xxxx) xxx

3. Results 3.1. Phylogenetic analysis of ZIKV based on polyprotein-coding region To determine the phylogenetic relationship of different ZIKV strains, a phylogenetic tree was drawn (Supplementary Fig. S1). The results show that 90 strains of ZIKV can be divided into two groups, group 1 and group 2, the former from Africa, including Senegal and Uganda and the latter from other countries and territories. 3.2. Nucleotide composition of the ZIKV and parameter analysis The values of nucleotide contents in complete coding region of all 90 ZIKV genomes were analyzed (Supplementary Table S2). The A%, U%, G%, C%, and GC% are 27.44% ± 0.09 (mean ± SD), 21.48% ± 0.03, 29.17% ± 0.06, 21.90% ± 0.09, and 51.08% ± 0.11 respectively. Evidently, G% and A% predominate over C% and U%, but the G þ C contents over the A þ U contents. The A3s%, U3s%, G3s%, C3s%, and GC3s% are 33.06% ± 0.37 (mean ± SD), 24.78% ± 0.13, 32.09% ± 0.25, 32.19% ± 0.24, and 51.88% ± 0.34, respectively. Obviously, the A3s% and C3s% are higher than G3s% and U3s% in ZIKV polyprotein-coding region. To know about the extent of codon usage bias, ENC values of the 90 ZIKV isolates were calculated. According to Supplementary Table S2, the ENC values range from 52.65 to 53.56, with a mean value of 53.21 and a standard deviation (SD) of 0.18. Because it is close to 61, the codon usage bias of ZIKV is slight low. 3.3. Relative synonymous codon usage (RSCU) analysis The patterns of synonymous codon usage in ZIKV coding sequences were assessed by RSCU analysis. Among 18 preferred codons of corresponding all amino acids (except Methionine and Tryptophan) in ZIKV coding sequences, 11 are C/G-ended (four Gended; seven C-ended) and the remaining are U/A ended (one Uended; six A-ended). Therefore, most of preferentially used codons in ZIKV are C-ended or A-ended codons (Table 1). By analyzing the 18 preferred codons, we can find that the RSCU values of five codons, CUG(L), GUG(V), CCA(P), AGA(R), and GGA(G) are >1.6, whereas the RSCU values of the remaining are also found to be > 0.6 and < 1.6 (Table 1). Nucleotide composition (A/G-rich) and RSCU analysis (A/C-ended) show that selection of the preferred codons has been influenced by compositional constraints, which indicated that mutation pressure mostly shaped its codon pattern. To determine the potential influences of the host on the codon usage patterns, the RSCU values of the codons in ZIKV coding sequences were calculated and then were compared with those of its three hosts (Homo sapiens, Aedes aegypti and Aedes albopictus). We find that the ratio of coincident/antagonist preferred codons is 11/7 between ZIKV and H. Sapiens; 10/8 between ZIKV and Aedes aegypti; and 9/9 between ZIKV and Aedes albopictus (Table 1). In all, the similarity in codon pattern between ZIKV and Homo. Sapiens is higher than that between ZIKV and Aedes aegypti or Ades albopictus. These results suggest that the selection pressure of the host maybe affect the codon usage pattern of ZIKV, which may help it adapt to the cellular environment of the hosts and allow it to replicate efficiently in the hosts (Wong et al., 2010; Ma et al., 2014). 3.4. Correlation analysis To determine whether the codon usage patterns of ZIKV coding sequences are mainly influenced by mutation pressure or natural selection, we performed a correlation analysis between the nucleotide compositions and the third base of synonymous codons

3

(Table 2). The results show that the A content has a significant positive correlation with the content of A3s, but has a significant negative correlation with the content of G, C, GC, G3s, C3s and GC3s, except for U and U3s. The G content has a significant positive correlation with the content of GC, G3s and GC3s, but has a significant negative correlation with A and A3s content, except for U, C, U3s and C3s. The U content has a significant positive correlation between U and U3s contents, but has a significant negative correlation with the content of C, GC, C3s and GC3s, except for A, G, A3s and G3s. The C contents has a significant positive correlation with the content of GC, G3s, C3s and GC3s, but has a significant negative correlation with the content of A, U, A3s and U3s, except for G. The GC contents has a significant positive correlation with the content of G, C, G3s, C3s and GC3s, but has a significant negative correlation with the content of A, U, A3s and U3s. The ENC value has a significant positive correlation with the content of C, GC, G3s, C3s and GC3s, but has a significant negative correlation with the content of A and A3s, except for G, U and U3s. These results indicate that compositional constraints under mutation pressure may affect the codon usage pattern for ZIKV. Correlation analysis was also performed to determine the correlations between the first two axes and nucleotide constraints of ZIKV polyprotein (Table 3). The results show that the Axis1 is positively correlated with the C, GC, G3s, C3s, GC3s, and ENC, whereas it is negatively correlated with the contents of A,A3s,U3s. Meanwhile, Axis2 is insignificant correlated with the C, U, C3s, and U3s. Overall, these results indicating that mutation pressure has played a major role in shaping the codon usage patterns of ZIKV genomes. To determine the potential influence of natural selection, correlation analysis was performed between the characters of amino acid (Gravy values and Aroma) and the codon bias (Axis1, Axis2, ENC, and GC3s) (Table 4). Our analysis indicates that the first two axes have a significant positive correlation with AROMA, and GRAVY has a significant negative correlation with Axis 1, but has non-significant correlation with Axis 2. In addition, GRAVY has a significant negative correlation with GC, GC3s, and ENC, but AROMA has a significant positive correlation with GC, GC3s, and ENC, respectively. All in, the aromaticity and hydrophobicity of amino acid have some effect on the codon usage pattern of ZIKV, which reveal that the importance of natural translational selection (Chen and Chen, 2014). 3.5. ENC-GC3S plot, PR2 plot, neutrality plot analysis To determine whether the codon usage patterns of ZIKV coding sequences have been shaped by mutation pressure, natural selection or both, we constructed ENC-GC3S plot, PR2 plot and neutrality plot analysis (Fig. 1, Supplementary Fig. S2, Fig. 2). If all strains will lie on the curve of the expected ENC values, it indicates that the codon bias is only constrained by mutation pressure (Butt et al., 2014b). As shown in Fig. 1, all ZIKV isolates clustering below the expected curve, which is similar to previous studies (Wang et al., 2016; van Hemert and Berkhout, 2016; Butt et al., 2016; Cristina et al., 2016; Singh and Tyagi, 2017) indicating that in addition to mutation pressure, the codon usage patterns have also been influenced by other factors, such as translational selection. According to Supplementary Fig. S2, the third base of synonymous codons in all ZIKV complete coding regions follows the rule of A > U in any time. However, only a small apart of ZIKV shows the rule of G > C in 1947, 2013, 2014, 2015, and 2016, compared to all strains from 1966 to 2012. If the codon bias of ZIKV is completely caused by random mutation, it shows that A ¼ U and G ¼ C, that is, the use frequency of purine base is equal to that of the pyrimidine base. The use frequency of A differs from that of U, G differ from C indicate that the formation of codon bias is weakly influenced by

Please cite this article as: Tao, J., Yao, H., Comprehensive analysis of the codon usage patterns of polyprotein of Zika virus, Progress in Biophysics and Molecular Biology, https://doi.org/10.1016/j.pbiomolbio.2019.05.001

4

J. Tao, H. Yao / Progress in Biophysics and Molecular Biology xxx (xxxx) xxx

Table 1 The relative synonymous codon usage frequency (RSCU) of ZIKV and its host. AAa

Phe Leu

Ile

Val

Ser

Pro

Thr

a b

codon

RSCUb

AAa

ZIKV

Homo sapiedes

Aedes aegypti

Aedes albopictus

UUU UUC UUA UUG CUU CUC CUA CUG

1 1 0.32 1.32 0.79 0.99 0.68 1.91

0.94 1.06 0.47 0.78 0.8 1.15 0.43 2.36

0.56 1.44 0.36 1.32 0.66 0.84 0.54 2.28

0.48 1.52 0.24 1.14 0.48 0.84 0.54 2.76

AUU AUC AUA GUU GUC GUA GUG

0.88 1.14 0.98 0.84 1.12 0.38 1.65

1.1 1.38 0.52 0.89 1.06 1.6 0.44

0.99 1.59 0.39 1.04 1.08 0.6 1.28

0.75 1.86 0.39 0.88 1.32 0.52 1.32

AGU AGC UCU UCC UCA UCG CCU CCC CCA CCG

0.97 1.26 0.87 0.98 1.52 0.4 0.66 1.13 1.79 0.42

0.91 1.44 1.12 1.28 0.91 0.33 1.14 1.29 1.1 0.47

0.96 1.08 0.66 1.2 0.66 1.44 0.68 0.84 1.2 1.32

0.78 1.08 0.54 1.38 0.48 1.68 0.36 1.12 1.08 1.44

ACU ACC ACA ACG

0.99 1.15 1.44 0.43

1.01 1.39 1.15 0.46

0.8 1.48 0.72 1

0.64 1.8 0.6 1

Tyr His

Gln Asn

Lys Asp Glu

Cys Arg

Gly

Ala

codon

RSCUb ZIKV

Homo sapiedes

Aedes aegypti

Aedes albopictus

UAU UAC CAU CAC

0.74 1.26 0.83 1.17

0.9 1.1 0.85 1.15

0.64 1.36 0.84 1.16

0.56 1.44 0.76 1.24

CAA CAG AAU AAC

1.15 0.85 0.66 1.34

0.54 1.46 0.96 1.06

0.82 1.18 0.8 1.2

0.6 1.4 0.64 1.36

AAA AAG GAU GAC GAA GAG

0.92 1.08 0.93 1.07 0.93 1.07

0.88 1.12 0.94 1.08 0.86 1.14

0.8 1.2 1.12 0.88 1.16 0.84

0.58 1.42 0.96 1.04 1.1 0.9

UGU UGC CGU CGC CGA CGG AGA AGG

0.92 1.08 0.46 0.59 0.23 0.56 2.41 1.75

0.94 1.06 0.48 1.1 0.65 1.22 1.29 1.27

0.84 1.16 1.38 1.26 1.2 1.02 0.66 0.54

0.7 1.3 1.5 1.32 0.96 1.2 0.6 0.42

GGU GGC GGA GGG GCU GCC GCA GCG

0.52 0.69 1.76 1.04 1.12 1.3 1.1 0.48

0.64 1.35 1.01 1 1.05 1.59 0.92 0.44

1.12 1.04 1.48 0.36 1.08 1.48 0.76 0.68

1.24 1.08 1.2 0.48 1 1.8 0.6 0.6

AA represents amino acid. The ‘RSCU’ value represents the pattern of relative synonymous codon usage.

Table 2 Summary of correlation analysis of nucleotide composition and ENC.

A G U C GC ENC

A

G

U

C

GC

A3s

G3s

U3s

C3s

GC3s

1 0.515** 0.088 0.809** 0.899** 0.699**

0.515** 1 0.181 0.042 0.605** 0.015

0.088 0.181 1 0.362** 0.408** 0.206

0.809** 0.042 0.362** 1 0.778** 0.699**

0.899** 0.605** 0.408** 0.778** 1 0.499**

0.987** 0.468** 0.051 0.820** 0.881** 0.755**

0.934** 0.692** 0.134 0.652** 0.888** 0.529**

0.196 0.017 0.850** 0.540** 0.439** 0.021

0.710** 0.036 0.404** 0.954** 0.713** 0.636**

0.940** 0.388** 0.310** 0.911** 0.923** 0.666**

The numbers in the each column represent correlation coefficient “r” values, which are calculated in each correlation analysis. *represents 0.01 < P < 0.05. **represents P < 0.01.

random mutation, and is strongly influenced by mutation pressure, natural selection, and other factors in ZIKV. A neutrality plot was constructed to determine the extent of influence between mutation pressure and natural selection by comparing the value of GC12 and GC3 (Wang et al., 2011a). When the value of GC12 is statistically significantly correlated to GC3 and the slope of the regression line is close to 1 in the neutrality plot, mutation pressure is regarded as the main force forming the codon usage bias. Conversely, if selection is the dominant factor, then the slope of the regression line is close to 0. The analysis show that a significant correlation is observed between the value of GC12 and GC3 (r ¼ 0.205, P < 0.01) which seemed indicative of mutation pressure playing a greater role in codon usage bias of all ZIKV polyprotein sequences (Fig. 2). However, after calculating the slope of the regression in the neutrality plot, this was not the case. The

slope of the regression line was calculated to be 0.0237, highlighting the relative neutrality (mutation pressure) is 2.4% while the relative constraint on GC3 (natural selection) is 97.6%. Compared with mutation pressure, natural selection is the dominant factor in shaping the codon usage pattern of ZIKV genes. 3.6. Correspondence analysis By using correspondence analysis to investigate the synonymous codon usage variation of our strains, it is found that the first axis (axis 1) accounts for 67.23% of the total variation, and the second, third and fourth axes (axis 2, axis 3, axis 4) account for 10.03%, 9.17%, and 2.42%, respectively. Because of accounting for 76.40%, Axis 1 and Axis 2 of each isolate are used to plot according to collection date and country (Fig. 3A, Fig. 3B, Supplementary

Please cite this article as: Tao, J., Yao, H., Comprehensive analysis of the codon usage patterns of polyprotein of Zika virus, Progress in Biophysics and Molecular Biology, https://doi.org/10.1016/j.pbiomolbio.2019.05.001

J. Tao, H. Yao / Progress in Biophysics and Molecular Biology xxx (xxxx) xxx

5

Table 3 Summary of correlation between the first two axes and nucleotide constraints in ZIKV genomes. Base composition

Axis1

Axis2

A G U C GC A3s G3s U3s C3s GC3s ENC

0.488** 0.137 0.072 0.646** 0.408** 0.550** 0.353** 0.245* 0.662** 0.582** 0.613**

0.255* 0.356** 0.061 0.038 0.264* 0.291** 0.293** 0.048 0.044 0.217* 0.296**

*represents 0.01 < P < 0.05. **represents P < 0.01.

Fig. 2. Neutrality plot analysis, the correlation analysis between the mean frequencies of GC content at the first and second codon positions (GC12) and that at the third codon position (GC3). The black solid line represents the correlation line. The correlation curve equation has also been shown on the plot.

Table 4 Correlation analysis among AROMO, GRAVY, the first two axes, GC3s, ENC and GC in the polyprotein-coding region of ZIKV isolates.

Gravy Aromo

r p r p

Axis1

Axis2

ENC

GC3s

GC

0.415** 0 0.490** 0

0.07 0.51 0.396** 0

0.504** 0 0.742** 0

0.593** 0 0.690** 0

0.607** 0 0.575** 0

*represents 0.01 < P < 0.05. **represents P < 0.01.

Fig. 3B, which revealed that the viruses in differently collection date have different codon usage patterns. From the above result, ZIKV was isolated in 1947 in Uganda (Afirca) and spread to other districts of Africa, commonly forming group B. In 1966, group B spread to Malaysia (Asia) and produced an independent branch, group C. After 2010, ZIKV gradually spread from Africa to Asia, America, Europe, and Oceania, forming group A. Therefore, group B is thought to the common ancestor of group C and group A. As we can see, the group C has not evolved since discovered in 1966. One possible explanation is that the viruses appearing region is too remote or too closed to spread for the group C. 4. Discussion

Fig. 1. ENC-GC3S plot, the relationship between the effective number of codons (ENC) values and GC content at the third synonymous codon position (GC3S). The curve indicates the expected codon usage if GC compositional constraints alone account for codon usage bias.

Fig. S3, Supplementary Fig. S4). The coordinate spots in Fig. 3A and Supplementary Fig. S4 are separated into three groups, group A, group B and group C. It is clear that group A is composed of the viruses from Asian and other regions, group B is composed of all African types, and group C is composed of three Asian types from Malaysia (Fig. 3A, Supplementary Fig. S4). This indicates that ZIKV has region-specific. In addition, ZIKV has a certain national specificity according to Supplementary Fig. S4, such as the isolates from Uganda (Africa) relatively clustering together and the same as Malaysia (Asia) or Nicaragua (Central America). All in, geographical distribution has some effect on the ZIKV codon usage pattern. According to Fig. 3B, all strains isolated from 2010 to 2016 tended to cluster together into group A, the strains in 1947, 1962, 1984, and one strains from 2014 together into group B, while the strains in 1966 together into group C. The isolates from the same period were distributed in the same region in Supplementary Fig. S3 and

According to our result, all ZIKV isolates can be clustered into two groups, group 1 from Africa and group 2 from other districts, which is similar to the published articles (Wang et al., 2016; Singh and Tyagi, 2017; Maurer-Stroh et al., 2016). It indicated that all ZIKV strains diverge from a common ancestor and the codon usage pattern is influenced by evolutionary processes and geographical distribution. ENC is a simple measure of the degree of codon usage bias. Generally, when the ENC value is greater than 45, the codon usage bias is low in a given gene (Haddow et al., 2012). From Supplementary Table S2, the ENC values range from 52.65 to 53.56, with the mean value of 53.21 and the standard deviation of 0.18, so the codon usage bias of ZIKV is slightly weak, which is similar to the virus from previous studies (Wang et al., 2016; van Hemert and Berkhout, 2016; Butt et al., 2016; Cristina et al., 2016; Singh and Tyagi, 2017) and other RNA viruses, such as NDV (Wang et al., 2011a), HAV (Chen and Chen, 2014), SARSCoV (Gu et al., 2004), MARV (Nasrullah et al., 2015), and FCV (Zang et al., 2017). A possible explanation is that the weak codon bias helps to replication, transcription and translation for the virus (Wang et al., 2011a), when it intrudes into its host cells or evolves in its transmission cycles. Base composition is an important feature of a genome and is the main factor that affects codon usage pattern. The organisms with AT-rich genome, such as Plasmodium falciparum, Mycoplasma capricolum, and Onchocerca volvulus (Waterkeyn et al., 1998), tend to use A or T at the third position in coding sequence. However, GCrich species, show a preference for G or C at the same position, such as bacteria, fungi, Triticum aestivum, Oryza sativa, and Hordium vulgare (Hershberg and Petrov, 2009; Kawabe and Miyashita, 2003). In all, mutational pressure is the main factor affecting the codon usage bias of the species. Surprisingly, although the overall

Please cite this article as: Tao, J., Yao, H., Comprehensive analysis of the codon usage patterns of polyprotein of Zika virus, Progress in Biophysics and Molecular Biology, https://doi.org/10.1016/j.pbiomolbio.2019.05.001

6

J. Tao, H. Yao / Progress in Biophysics and Molecular Biology xxx (xxxx) xxx

Fig. 3. A plot of values of the first axis (Axis 1) and the second axis (Axis 2) of polyprotein-coding region of each ZIKV in correspondence analysis (A) plot according to isolated regions; (B) plot according to isolated date. The first axis accounts for 67.23% of total variation, and the second axis accounts for 10.03% of total variation.

GC content is higher than the AU content (G > A > C > U), most optimal codons end in C or A, rather than G or U in RSCU analysis (C > A > G > U). We think the reason is that G mainly exists on the first and second position of codon, while C mainly resides on the third position. In addition, GC3S and G3S are not only strongly positively correlated with ENC but also with two principle axes, the first axe and second axe. PR2 plot shows that A and C are used more frequently than G and U in the third base of synonymous codons in all ZIKV complete coding regions. This result is similar to previous studies (Wang et al., 2016) and differs from another (Butt et al., 2016), in which showed that A3S and G3S were used more frequently than U3S and C3S. In general, the low U content and the high G content of ZIKV coding region may be the reason of the high GC content in the third base of synonymous codons, related to the high ENC value, which indicate that the codon bias is relatively low. In another respect, because the A, GC, A3s, G3s and GC3s content have a correlation with the two principle axes, which also reveal that mutation pressure from base composition is an important factor shaping the codon usage patterns. However, according to our ENC-GC3S plot, the codon usage patterns of the virus have also been influenced by natural selection, such as the hydropathicity, the aromaticity, its host, its infection district and so on, which plays a major role in shaping the codon usage bias by the result of neutrality plot in ZIKV. From Table 4, AROMO has a positive correlation with Axis1and Axis2 while GRAVY has a negative correlation with Axis1, which indicates the role of hydropathicity and aromaticity forming the codon usage pattern of ZIKV similar to previous studies (Chen et al., 2013; Wang et al., 2011b; Tao et al., 2009). The host is also an important factor shaping the codon usage bias. In comparison to Aedes aegypti and Aedes albopictus, the codon usage pattern of ZIKV is more similar to that of Homo sapiens. This shows that the codon usage pattern of

the virus in our research is mainly influenced by Homo sapiens. Another explanation is that the different of similarity in the codon usage between ZIKV and different hosts may be caused by the various defense mechanisms from different hosts against ZIKV infection. The similar pattern of codon preference to Homo sapiens has also been detected in previous studies for ZIKV (Wang et al., 2016; Butt et al., 2016; Cristina et al., 2016). The district of collecting ZIKV is also a major element shaping its codon usage pattern, which is the same as result the former phylogenetic analysis. For example, by using COA related to its codon pattern for assortment, the viruses are divided into three groups, each of which is from different countries or districts. In summary, natural selection is an important factor shaping codon usage pattern of the virus. Mutational pressure and natural selection are considered as two important factors that affect the codon usage of virus (Zhang et al., 2011). Generally, for RNA viruses, it was found that mutational pressure is a decisive factor compared with natural selection, because RNA viruses have higher mutation rates (Zhang et al., 2011; Peixoto et al., 2003; Romero et al., 2003; Jenkins and Holmes, 2003). However, for ZIKV, it is different that our research and Butt et al. (2016) revealed that natural selection dominates over mutation pressure in the codon usage analysis on ZIKV, Cristina et al. (2016) and Singh et al. (Singh and Tyagi, 2017) revealed that mutation pressure is a major factor shaping the codon usage patterns, and Wang et al. (2016) thought that both mutation pressure and natural selection have an important influence shaping the codon usage bias. 5. Conclusion In short, our analysis revealed that the codon usage bias in ZIKV is low and natural selection such as aromaticity, hydropathicity,

Please cite this article as: Tao, J., Yao, H., Comprehensive analysis of the codon usage patterns of polyprotein of Zika virus, Progress in Biophysics and Molecular Biology, https://doi.org/10.1016/j.pbiomolbio.2019.05.001

J. Tao, H. Yao / Progress in Biophysics and Molecular Biology xxx (xxxx) xxx

host, geography and so on is the main factor that affects codon usage variation in ZIKV. Mutation pressure of nucleotide composition is also an important factor influencing codon usage bias. The evolution of ZIKV probably reflects a dynamic process of mutation and natural selection to adapt its codon usage to different environments and hosts. This study of codon usage patterns in ZIKV can not only reveal information about molecular evolution, but also build the base for the prevention and the design of vaccines. Acknowledgements This work was supported by the research grants from the Discipline construction Double Support Project of Sichuan Agriculture University (00770114), in china. Appendix A. Supplementary data Supplementary data to this article can be found online at https://doi.org/10.1016/j.pbiomolbio.2019.05.001. References Butt, Azeem Mehmood, Nasrullah, Izza, Tong, Yigang, 2014. Genome-wide analysis of codon usage and influencing factors in chikungunya viruses. PLoS One 9 (3), e90905. Butt, Azeem Mehmood, Nasrullah, Izza, Tong, Yigang, 2014. Genome-wide analysis of codon usage and influencing factors in chikungunya viruses. PLoS One 9 (3), e90905. Butt, A.M., Nasrullah, I., Qamar, R., Tong, Y., 2016 Oct. Evolution of codon usage in Zika virus genomes is host and vector specific. Emerg. Microb. Infect. 5 (10), e107. Chen, Y., Chen, Y.F., 2014. Analysis of synonymous codon usage patterns in duck hepatitis A virus: a comparison on the roles of mutual pressure and natural selection. Virusdisease 25 (3), 285±293. Chen, L., Yang, D.Y., Liu, T.F., Nong, X., Huang, X., Xie, Y., Fu, Y., Zheng, W.P., Zhang, R.H., Wu, X.H., Gu, X.B., Wang, S.X., Peng, X.R., Yang, G.Y., 2013. Synonymous codon usage patterns in different parasitic platy-helminth mitochondrial genomes. Genet. Mol. Res. 12, 587e596. ~ ora, M., Moratorio, G., Musto, H., 2016 Sep 2. A detailed Cristina, J., Fajardo, A., Son comparative analysis of codon usage bias in Zika virus. Virus Res. 223, 147e152. Faye, O., Faye, O., Diallo, D., Diallo, M., Weidmann, M., Sall, A.A., 2013. Quantitative real-time PCR detection of Zika virus and evaluation with field-caught. Mosquitoes.Virol J. 10, 311. Gu, W., Zhou, T., Ma, J., Sun, X., Lu, Z., 2004. Analysis of synonymous codon usage in SARS Coronavirus and other viruses in the Nidovirales. Virus Res. 101 (2), 155±161. Haddow, A.D., Schuh, A.J., Yasuda, C.Y., Kasper, M.R., Heang, V., Huy, R., et al., 2012. Genetic characterization of zika virus strains: geographic expansion of the asian lineage. Plos Neglect Trop 6 (2), e1477. Hershberg, R., Petrov, D.A., 2009. General rules for optimal codon choice. PLoS Genet. 5, e1000556. Jenkins, G.M., Holmes, E.C., 2003. The extent of codon usage bias in human RNA viruses and its evolutionary origin. Virus Res. 92, 1e7. Kawabe, A., Miyashita, N.T., 2003. Patterns of codon usage bias in three dicot and four monocot plant species. Genes Genet. Syst. 78, 343e352. Kumar, Naveen, Chandra Bera, Bidhan, Greenbaum, Benjamin D., Bhatia, Sandeep, Sood, Richa, Selvaraj, Pavulraj, Anand, Taruna, Tripathi, Bhupendra Nath, Virmani, Nitin, 2016. Revelation of influencing factors in overall codon usage bias of equine influenza viruses. PLoS One 11 (4), e0154376. Kumar, S., Stecher, G., Tamura, K., 2016 Jul. MEGA7: molecular evolutionary genetics. Analysis version 7.0 for bigger datasets. Mol. Biol. Evol. 33 (7), 1870e1874. Liu, Xin-sheng, Zhang, Yong-guang, Fang, Yu-zhen, Wang, Yong-lu, 2012. Patterns and influencing factor of synonymous codon usage in porcine circovirus. Virol. J. 9, 68.

7

Lobry, J.R., Gautier, C., 1994 Aug 11. Hydrophobicity, expressivity and aromaticity are the major trends of amino-acid usage in 999 Escherichia coli chromosomeencoded genes. Nucleic Acids Res. 22 (15), 3174e3180. Ma, Y.P., Zhou, Z.W., Liu, Z.X., Hao, L., Ma, J.Y., Feng, G.Q., et al., 2014. Codon usage bias of the phosphoprotein gene of spring viraemia of carp virus and high codon adaptation to the host. Arch. Virol. 159 (7), 1841±1847. Maharajan, M.K., Ranjan, A., Chu, J.F., Foo, W.L., Chai, Z.X., Lau, E.Y., Ye, H.M., Theam, X.J., Lok, Y.L., 2016 Dec. Zika virus infection: current concerns and perspectives. Clin. Rev. Allergy Immunol. 51 (3), 383e394. Maurer-Stroh, S., Mak, T.M., Ng, Y.K., et al., 2016. South-east Asian Zika virus strain linked to cluster of cases in Singapore, August 2016. Euro Surveill.: bulletin europeen sur les maladies transmissibles ¼ European communicable disease bulletin 21 (38). McLean, E., Bhattarai, R., Hughes, B.W., Mahalingam, K., Bagasra, O., 2017. Computational identification of mutually homologous Zika virus miRNAs that target microcephaly genes. Libyan J. Med. 12 (1), 1304505.-. Musso, Didier, Gubler, Duane J., 2016 Jul. Zika Virus. Clin Microbiol Rev. 29 (3), 487e524. Nasrullah, Izza, Butt, Azeem M., Tahir, Shifa, Idrees, Muhammad, Tong, Yigang, 2015. Genomic analysis of codon usage shows influence of mutation pressure, natural selection, and host features on Marburg virus evolution. BMC Evol. Biol. 15, 174. Peixoto, L., Zavala, A., Romero, H., Musto, H., 2003. The strength of translational selection for codon usage varies in the three replicons of Sinorhizobium meliloti. Gene 320 (1), 109±16. Rao, Yousheng, Wang, Zhangfeng, Chai, Xuewen, Nie, Qinghua, Zhang, Xiquan, 2014. Hydrophobicity and aromaticity are primary factors shaping variation in amino acid usage of chicken proteome. PLoS One 9 (10), e110381. Reefhuis, Jennita, Gilboa, Suzanne M., Johansson, Michael A., Valencia, Diana, Simeone, Regina M., Hills, Susan L., Polen, Kara, Jamieson, Denise J., Petersen, Lyle R., Honein, Margaret A., 2016 May. Projecting month of birth for at-risk infants after zika virus disease outbreaks. Emerg. Infect. Dis. 22 (5), 828e832. Romero, H., Zavala, A., Musto, H., Bernardi, G., 2003. The influence of translational selection on codon usage in fishes from the family Cyprinidae. Gene 317 (317), 141±7. Samarasekera, U., Triunfol, M., 2016 Feb 6. Concern over Zika virus grips the world. Lancet 387 (10018), 521e524. Sharp, P.M., Li, W.H., 1986. An evolutionary perspective on synonymous codon usage in unicellular organisms. J. Mol. Evol. 24 (1e2), 28e38. Singh, R.K., Pandey, S.P., 2017. Phylogenetic and evolutionary analysis of plant ARGONAUTES. Methods Mol. Biol. 1640, 267e294. Singh, N.K., Tyagi, A., 2017. A detailed analysis of codon usage patterns and influencing factors in Zika virus. Arch. Virol. 162 (7), 1963e1973. Sueoka, N., 1988. Directional mutation pressure and neutral molecular evolution. Proc NatlAcad Sci USA 85, 2653e2657. Sueoka, N., 1999 Sep 30. Translation-coupled violation of Parity Rule 2 in human genes is not the cause of heterogeneity of the DNA GþC content of third codon position. Gene 238 (1), 53e58. Tao, P., Dai, L., Luo, M., et al., 2009. Analysis of synonymous codon usage in classical swine fever virus. Virus Gene. 38, 104e112. van Hemert, F., Berkhout, B., 2016. Nucleotide composition of the Zika virus RNA genome and its codon usage. Virol. J. 13, 95. Wang, M., Liu, Y.S., Zhou, J.H., Chen, H.T., Ma, L.N., Ding, Y.Z., et al., 2011. Analysis of codon usage in Newcastle disease virus. Virus Gene. 42 (2), 245±253. Wang, M., Zhang, J., Zhou, J., et al., 2011. Analysis of codon usage in bovine viral diarrhea virus. Arch. Virol. 156, 153e160. Wang, H., Liu, S., Zhang, B., Wei, W., 2016. Analysis of synonymous codon usage bias of zika virus and its adaption to the hosts. PLoS One 11 (11), e0166260. Waterkeyn, J.G., Gauci, C., Cowman, A.F., Lightowlers, M.W., 1998. Codon usage in Taenia species. Exp. Parasitol. 88, 76e78. Wong, E.H., Smith, D.K., Rabadan, R., Peiris, M., Poon, L.L., 2010. Codon usage bias and the evolution of influenza A viruses. Codon usage biases of influenza virus. BMC Evol. Biol. 10 (1), 1±14. Wright, F., 1990. The effective number of codon used in a gene. Gene 87 (1), 23±27. Yang, Xing, Luo, Xuenong, Cai, Xuepeng, 2014. Analysis of codon usage pattern in Taenia saginata based on a transcriptome dataset. Parasites Vectors 7, 527. Zang, M., He, W., Du, F., Wu, G., Wu, B., Zhou, Z., 2017 Jun 16. Analysis of the codon usage of the ORF2 gene of feline calicivirus. Infect. Genet. Evol. 54, 54e59. Zhang, Y., Liu, Y., Liu, W., Zhou, J., Chen, H., Wang, Y., Ma, L., Ding, Y., Zhang, J., 2011 Apr 16. Analysis of synonymous codon usage in hepatitis A virus. Virol. J. 8, 174.

Please cite this article as: Tao, J., Yao, H., Comprehensive analysis of the codon usage patterns of polyprotein of Zika virus, Progress in Biophysics and Molecular Biology, https://doi.org/10.1016/j.pbiomolbio.2019.05.001