Discovering key residues of dengue virus NS2b-NS3-protease: New binding sites for antiviral inhibitors design

Discovering key residues of dengue virus NS2b-NS3-protease: New binding sites for antiviral inhibitors design

Accepted Manuscript Discovering key residues of dengue virus NS2b-NS3-protease: New binding sites for antiviral inhibitors design D. Aguilera-Pesantes...

1MB Sizes 10 Downloads 172 Views

Accepted Manuscript Discovering key residues of dengue virus NS2b-NS3-protease: New binding sites for antiviral inhibitors design D. Aguilera-Pesantes, L.E. Robayo, P.E. Méndez, D. Mollocana, Y. Marrero-Ponce, F.J. Torres, M.A. Méndez PII:

S0006-291X(17)30566-1

DOI:

10.1016/j.bbrc.2017.03.107

Reference:

YBBRC 37494

To appear in:

Biochemical and Biophysical Research Communications

Received Date: 3 December 2016 Revised Date:

1 March 2017

Accepted Date: 19 March 2017

Please cite this article as: D. Aguilera-Pesantes, L.E. Robayo, P.E. Méndez, D. Mollocana, Y. MarreroPonce, F.J. Torres, M.A. Méndez, Discovering key residues of dengue virus NS2b-NS3-protease: New binding sites for antiviral inhibitors design, Biochemical and Biophysical Research Communications (2017), doi: 10.1016/j.bbrc.2017.03.107. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

Discovering key Residues of Dengue Virus NS2b-NS3-Protease: New Binding Sites for Antiviral Inhibitors Design D. Aguilera-Pesantes,a,b · L. E. Robayoa,b · P. E. Méndezb · D. Mollocanaa,b · Y. MarreroPoncea,b,c· F. J. Torresa,b · M. A. Méndeza,b,c,* a

RI PT

Universidad San Francisco de Quito, Grupo de Química Computacional y Teórica (QCTUSFQ), Diego de Robles sn y Vía Interoceánica, 17-1200-841, Quito, Ecuador

b

Universidad San Francisco de Quito, Instituto de Simulación Computacional (ISC-USFQ), Diego de Robles sn y Vía Interoceánica, 17-1200-841, Quito, Ecuador

c

AC C

EP

TE D

M AN U

*corresponding author: [email protected]

SC

Universidad San Francisco de Quito, Grupo de Medicina Molecular y Traslacional (MeM&T), Escuela de Medicina, Colegio de Ciencias de la Salud (COCSA), Av.Interoceánica Km 12 ½ y Av. Florencia, 17-1200-841, Cumbayá, Quito, Ecuador.

ACCEPTED MANUSCRIPT

M AN U

SC

RI PT

Abstract The NS2B-NS3 protease is essential for the Dengue Virus (DENV) replication process. This complex constitutes a target for efficient antiviral discovery because a drug could inhibit the viral polyprotein processing. Furthermore, since the protease is highly conserved between the four Dengue virus serotypes, it is probable that a drug would be equally effective against all of them. In this article, a strategy is reported that allowed us to identify influential residues on the function of the Dengue NS2b-NS3 Protease. Moreover, this is a strategy that could be applied to virtually any protein for the search of alternative influential residues, and for noncompetitive inhibitor development. First, we incorporated several features derived from computational alanine scanning mutagenesis, sequence, structure conservation, and other structure-based characteristics. Second, these features were used as variables to obtain a multilayer perceptron model to identify defined groups (clusters) of key residues as possible candidate pockets for binding sites of new leads on the DENV protease. The identified residues included: i) amino acids close to the beta sheet-loop-beta sheet known to be important in its closed conformation for NS2b ii) residues close to the active site, iii) several residues evenly spread on the NS2b-NS3 contact surface, and iv) some inner residues most likely related to the overall stability of the protease. In addition, we found concordance on our list of residues with previously identified amino acids part of a highly conserved peptide studied for vaccine development. Keywords: Dengue Virus; DENV NS2b-NS3 Protease; bindability sites; protease inhibitor; Computational Alanine Scanning Mutagenesis; Machine Learning; multilayer perceptron 1 Introduction

AC C

EP

TE D

In spite of the efforts of many research groups, no antiviral drugs are currently available against dengue viral infections [1-3]. Furthermore, DENV diseases are considered as neglected tropical diseases by the WHO [4]. DENV diseases, such as dengue fever, hemorrhagic dengue fever (DHF), or dengue shock syndrome (DSS) occurring from 2013 onward [3, 5-11] constitute a heavy social and economic burden for many countries. The etiologic agent for dengue is a virus belonging to the genus Flavivirus, with four known serotypes. That is, DENV cocirculates as a complex of four closely related but antigenically distinct serotypes (DENV1-4); all of which are etiologic agents of dengue fever, and the lifethreatening DHF, and DSS. Dengue fever is characterized by high fever, headache, joint pain, fatigue, swollen lymph nodes, and skin rashes [12]. The most prevalent of the four dengue serotypes is DENV-2 [12]. Globally, nearly 2.5 billion people are at risk of dengue virus infection, and over 100 million infections are reported annually. Dengue virus infection also causes thousands of deaths each year in areas where it is endemic [13]. Therefore, control strategies against DENV have become a global health priority [6, 14]. Many efforts have been conducted in the industrial and academic levels in the past years [6, 15-17]. However, only one Dengue vaccine has been approved up to date, and basic research is still needed for the mitigation of DENV [18]. Thus, there is an immense ongoing interest in developing antiviral agents to fight diseases caused by DENV [1]. The NS3 protease domain (NS3Pro) is essential for the Dengue Virus (DENV) replication process [6]. The NS3 is a nonstructural protein that has two domains [1,6]. The first domain (NS3Pro) has a protease activity that helps with the processing of the DENV polyprotein. NS3Pro works together with a peptidic cofactor, NS2b, which is wrapped around the NS3Pro assembling a complex (NS2b-NS3Pro or NS2b-NS3 for short) [2]. The NS2b cofactor is essential for substrate recognition and to maintain the complex stability [1,2]. The second domain has a helicase activity that works independently from the NS3Pro domain [14]. The

ACCEPTED MANUSCRIPT DENV NS2b-NS3 protease complex is considered a primary target for the design of antiviral agents [6]. In fact, at least ten clinical available inhibitors for the HIV and the Hepatitis C virus proteases have been successfully developed and put on the market [19]. At present, most of the DENV NS2b-NS3 inhibitors, including anthracene-based lead compounds, interact with the NS3Pro's active site and P1 pocket[17]. However, many of these compounds present either weak activity or low selective index [15]. These issues are attributed to the charged, shallow nature of the binding site and neighboring pockets [15].

AC C

EP

TE D

M AN U

SC

RI PT

The NS2b-NS3 protease has a catalytic triad (His51, Asp75, Ser135) in the active site (See Figure SI-1). However, this enzyme has an altered specificity relative to other trypsin-like serine proteases. The fact that NS2b-NS3 protease recognizes only sites that contain two cationic residues, whereas trypsin recognizes sites containing a single cationic residue, has created the need for the development of new classes of inhibitors for targeting the NS3Pro active site. One of the main hurdles in early drug development against this target has been that the DENV protease active site is flat. For an inhibitor to bind, the substrate binding site requires a substantial conformational change of the NS2b fragment. As a consequence, designing inhibitors by structure-based design has been challenging [20-21]. On the other hand, some authors have reported an approach that allowed to scan the NS2b-NS3Pro surface by cysteine mutagenesis, and afterwards to use cysteine reactive probes to identify regions of the protein that are susceptible to allosteric inhibition [22]. This method identified a new allosteric site utilizing a circumscribed panel of just eight cysteine variants and only five cysteine reactive probes [22]. Overall, strategies for identifying allosteric sites may be useful to overcome the challenges on drug design targeted towards NS2b-NS3Pro [16, 22-25]. Another challenge on finding a good drug lead is that certain inhibitors recognize the Arg in P1 pocket present in human proteases such as trypsin, thrombin, and elastase [15]. This difficulty also suggests the need for detecting novel sites to target inhibitors other than the active site, namely non-competitive inhibitors [26-30]. Structure-based analysis and other computational approaches might shed light on this particular problem [15]. Hence, the main aim of the present report is to obtain detailed computational description of the DENV NS2bNS3 protease with the purpose of identifying residues cluster with potential to interact with drug-like molecules. The main contribution of this research was twofold: i) to identify possible relevant residues and sites for the functionality of the enzyme using a novel computational strategy, and ii) to present a list of sites specifically on the DENV protease that could provide bindability sites for drug development [31]. Therefore, here we report an ensemble in silico approach as an integrative tool to asses to key residues that may form bindable sites for drug-like molecules. 2 Models and Methods

2.1 The DENV protease model The methodology followed here employed a crystallographic 3D-structure of DENV protease, corresponding to the serotype 3 DENV protease (PDB code 3U1I, which was described in detail by Nitsche et al., [6]). This structure corresponds to the NS3Pro protein folded in a conformation obtained by cocrystallizing the complex with a peptide-like ligand. Other structures available may give additional insights, but many do not include the protein in its functional folding bound to a ligand as is the case for 3UII structure [6]. Therefore, the DENV NS2b-NS3 protease crystal structure (3U1I) was chosen for our computational studies.

ACCEPTED MANUSCRIPT

SC

RI PT

This reported structure is a dimer that contains two highly similar protease/cofactor assemblies. For the present study only the A and B chains were used; since, these chains were originally bounded to a peptidic inhibitor that is highly similar to the protease original substrate, A and B most likely will resemble the active conformation [1]. C and D chains are very similar to chains A and B (RMSD of 0.47 Å) [1]; the chains C, D were deleted to decrease the computational time for the analysis. The ligands (E, F) were also deleted since their presence was not needed (the backbone was not relaxed or modified with any of the procedures used), and the ligands could have interfered with some of the calculations that we performed later such as the computational alanine scanning mutagenesis [32]. Figure SI-1 shows this structure where the three most important residues for the catalytic activity (His51, Asp75, and Ser135) of the enzyme are highlighted. The complete sequence of the protein, and the protein structure preparation are described in the Supplementary Information (SI1). All representations and diagrams were created using Maestro (Schrödinger LLC) or Visual Molecular Dynamics Software, version 1.9.1 [33].

2.2 Computation of features for each amino acid in DENV protease

EP

TE D

M AN U

Four main tools were used for the computation of the features for each amino acid in DENV protease. The FASTA sequence obtained from PDB ID 3U1I was used as input for the first two tools, and for the others, the input was the prepared crystal structure described above. First, the mutational susceptibility tool within the Phyre2 platform, based on SuSPect, predicts if a missense mutation is likely to have a functional/phenotypic effect [34]. Second, Raptor X was employed as a tool to predict if a residue binds to a known ligand. This predictive tool may be especially helpful for proteins where little is known about the catalytic site. In brief, it identifies residues that are more likely to be part of a binding site with a known molecule [40]. We analyzed all residues in the DENV protease sequence for both structures NS2b and NS3Pro at the same time. Third, to calculate some physico-chemical properties (the Solvent Accessible Surface Area (SASA), the Hydropathy, and the residue charge) of each residue of NS2b and NS3Pro, the Bioluminate Residue Analysis Tool was used [35, 36]. Finally, computational alanine scanning mutagenesis (CASM) was used for obtaining the protein binding affinity, and the protein stability predictions.

AC C

To prepare the system and to perform the feature's computations, the steps described below were performed. PROPKA on the protein preparation wizard was used for the correct assignment of the protonation states. Next, CASM uses the Prime/Molecular Mechanics Generalized Born Surface Area (Prime/MM-GBSA) method [35, 36], which performs a series of minimizations of the receptor and ligand, as well as point energy calculations. Prime MMGBSA uses an implicit VSGB (continuum) solvation model, OPLS2005 force field, Prime (version 3.1, Schrodinger, LLC, New York, NY, 2012), and a rotamer search algorithm, all included in the suite BioLuminate (version 1.0, Schrodinger, LLC, New York, NY, 2012). For each mutagenesis, energy minimizations are carried out according to the standard MM-GBSA protocol. By this approach, it is feasible to determine if a certain residue is important for the ligand-receptor interaction. The underlying principle behind this analysis is that the total ∆G binding will decrease when an important residue is replaced by alanine [35]. If a certain residue is important for the protein stability, the total ∆G Stability will decrease when mutated by alanine. These two parameters gave us insights on the importance of a residue for the whole protein structure including the interaction between NS2b and NS3Pro [37]. According to the MM-GBSA method [36] for the alanine scanning mutagenesis, the

ACCEPTED MANUSCRIPT

TE D

M AN U

SC

RI PT

first step is to individually compute the energies of: i) the NS3Pro system, ii) the NS2b system, and iii) NS2b-NS3Pro system for the wild type proteins; followed by the calculation of the energies of the same three systems but, depending on the case, with one of the proteins partners mutated. For example, NS3Pro system mutated, the NS2b system native, or NS2b native-NS3Pro system mutated. The score of ∆ affinity is obtained by comparing the energies of the wild type systems with the mutant systems; the process is based in a thermodynamic cycle described elsewhere [36]. All these calculations are done automatically in the program that outputs directly the desired values of ∆G binding and ∆ affinity. The computation was performed for all available residues. In all cases we used a cutoff of 4 kcal/mol to define a hotspot, referred later as a "high" ∆ affinity or "high" ∆ stability [32]. Finally, it is important to point out that all the NS2b-NS3Pro residues were mutated with the Alanine Scanning Mutagenesis Bioluminate Tool, except for the outermost residues (NS2b: Asp50, Asp88; NS3: Gly1, Gly0, Ser1, Gly2, Val3, Leu4, Trp5, Asp6, Gln167, Thr168, Asn169, Ala170, Glu171). Furthermore, the proteins sites that already had an alanine residue as the wild type genotype were not considered. The structure conservation study considering the NS2b cofactor cannot be made with all serotypes because of the lack of crystal structures for serotype 4 [6]. We also used as amino acidic features for the classifier the NS2b-NS3 residues' conservation on other related viruses found on the same sero-complex group though on a different sero-complex than DENV serotypes [29]. The residue sequences of NS2b and NS3Pro were analyzed individually using Phyre2[38]. We used Phyre2 because it has several protein analysis tools that permitted us to carry out the extraction of some residues' features. In this case, DENV NS2b-NS3 protease was compared to the protease structures with the highest score on Phyre2 such as the proteases of the other DENV serotypes, West Nile Virus, Murray Encephalitis Virus, and Japanese Encephalitis Virus. To obtain the degree of residue conservation we used the advanced Phyre2 module known as Investigator. Then, we built a score that contains three possible instances: a) high conservation, b) moderate conservation, and c) low conservation. These conservation scores are codified as 2, 1, and 0 respectively (Table 1). Table 1 comes about here

EP

2.3 Datasets and Classification based on Machine Learning Methods

AC C

To generate a holistic interpretation on the results of the previous analyses, we used all the evaluated properties as the features needed to train a machine learning algorithm. The aim was to identify residues similar to residues important for the protein function. The preparation to implement the classifier consisted on two main tasks. First, we searched for reported changes on activity when site directed mutagenesis of a residue was performed. Although firstly five classes were established based on these literature findings, these were cat the end condensed into three classes: A (total loss or high loss of activity), C (moderate loss of activity), and WT (similar activity as the wild type protein). Building these artificial classes was imperative since the experimental conditions differed among published works. Hence, using the published raw data would have been inappropriate. The classes corresponding to each residue in the training and validation set are presented on Table 1. The fourteen amino acid features were selected arbitrarily including properties related to the structural stability of the protein, the conservation of the residue in other related virus, the function of the protein, accessibility of a drug to the residue, and predictions if the amino acid can bind to a known ligand. Table 1 shows these fourteen relevant amino acid features computed for the DENV protease (it includes the features reviewed in the previous section):

ACCEPTED MANUSCRIPT

RI PT

1) residue type, 2) position on the sequence for NS3 residues, 3) activity (NS3 or NS2b), 4) ∆ affinity (as defined above), 5) ∆ stability solvated (as defined above), 6) SASA, 7) hydropathy, degree of protease's amino acid conservation with respect to 8) DENV1 M, 9) DENV 2 pH 8.5, 10) DENV2, 11) West Nile Virus, 12) Murray 2 Valley EV, 13) Japanese Encephalitis Virus, 14) RaptorX prediction if the amino acid binds to a known ligand (See Table 1). Given these features, the AMBIT Discovery v0.04 [39] was used to decide if an instance (a residue and its calculated properties) is within the applicability domain of the training/validation set available. For the decision, we considered the consensus among the follow methods: Euclidian distance, city-block distance, probability density distribution, and Hellinger distance. A brief description of each method is found in the Supplementary Information (SI1).

AC C

3 Results

EP

TE D

M AN U

SC

Several machine learning algorithms were tested depending of their availability on Waikato Environment for Knowledge Analysis (WEKA) v.3.6.13 [40], some of them were: Random Forest, Multilayer Perceptron, Lad Tree, Voting Feature Interval algorithm, and so on. The multilayer perceptron (MLP) algorithm used is a feedforward neural network (NNW) trained with the backpropagation. The MLP was automatically built by WEKA. The parameters and conditions used for the algorithm included a learning rate of 0.3, a momentum term 0.2, and a nominal to binary filter. All the attributes were normalized as well as the numerical class. A decay scheme was not applied in the present study. For the sake of validation, all machine learning algorithms were tested using internal fold cross-validation protocol which consists on a leave-one cross-validation. The size of the training and validation set consisted on 40 residues (5 of NS2b, and 35 of NS3), see Table 1 [41]. In addition, we evaluated how informative was each individual feature in each model by eliminating one feature at the time, repeating the cross-validation protocol, and observing the effect on the performance metrics. We indeed identified on a first evaluation that including the NS2b's number/position of a residue on the set decreased the quality of our model. This happens because a superposition of the NS2b's numerical values with the values on NS3. In the final models, the values on the feature “residue number” were replaced by blank spaces since the algorithm is able to handle missing values for one set of experiments. Alternatively, we created a second set of models where the NS2b's residues were removed from the training/validation set, and from the test set. The parameters for generating the second trained model (MLP 2 model) were the same as for the first MLP trained model (MLP1).

3.1 Structure based analysis A summary of the results for the feature calculations on each residue mutated for Ala are presented on Table 1 for residues where experimental site mutagenesis information was included. As discussed above, the ∆ affinity was chosen as a feature for the classification since it will help to identify if a residue is important for the interaction between NS2b and NS3Pro. Nevertheless, for this particular case study (NS2b-NS3Pro) the number of contacts between the two proteins is large in proportion to the number of NS2b residues under study. As a consequence, almost all of the NS2b residues showed a high ∆ affinity making impossible to discriminate hot spots just based on the ∆ affinity criteria. ∆ Stability was therefore selected as an additional feature to provide insights into the importance of a residue in the overall stability of the active conformation under study. For example, Asn152 (which is a residue found to bind some ligands [1]) was predicted to have a high ∆ affinity. In fact,

ACCEPTED MANUSCRIPT

RI PT

this residue interacts with NS2b-Gly82 via a hydrogen bond which agrees with the high interaction energy calculated. In the case of His51, known as part of the catalytic triad, a high ∆ stability energy is observed. Considering that His51 could be positively charged, and NS2b-Asp81 negatively charged, these residues could form a salt bridge which explains the high ∆ stability calculated. In contrast, Ser135 did not show a high ∆ stability energy in spite of being a catalytic residue. In overall, there was not a simple correlation between ∆ affinity, or ∆ stability energy with the importance of the residue for the activity of the enzyme. These observations justified our decision of using ∆ stability and ∆ affinity as two features, but among many others of equal or greater informative power. Indeed, the significance, or influence of a particular residue for the function of the DENV protease should be assessed on several properties rather than on just a single feature. 3.2 Similarity analysis based on sequence and structure homology

TE D

M AN U

SC

Residues that are more likely to be part of a binding site with a known molecule were identified using Raptor X-binding [42]. In Table 1 these values are reported as 1 or 0 to denote if a residue was predicted to bind to a known ligand, or not (See SI1.3 and Figure SI2). Twenty-two residues were predicted to be part of a binding site including Asp75, His5,1 and Ser135 (the three residues of the catalytic triad). Moreover, residues identified by Raptor X with potential of being a binding site were analyzed with Phyre2’s tools; the results showed that 12 of the 22 residues were highly conserved between the four main DENV examples. In general, the residues on the NS2b cofactor are not highly conserved between the DENV serotypes studied. If we cluster those residues highly conserved, and with high ∆ affinity energy, it was found that only six NS3 residues are conserved on all four DENV examples studied: Gly21, Tyr23, Ile25, Phe46, Leu58, and Asn152. Figure SI-3 depicts the interaction diagram of some of these residues. In a similar trend, we clustered those residues that are highly conserved, and that have a high ∆ stability energy; the residues found were all neutral amino acids: His51, Val52, Thr53, Gly133, Ser135, Tyr150, Gly151, and Gly153. All these residues are accessible to the solvent (SASA evaluation). 3.3 Machine Learning classification

AC C

EP

We have developed a classifier in order to corroborate previous findings, suggest residues that may form pockets with importance as targets for novel inhibitors, and to deepen our understanding of the biological function of the enzyme. Briefly, in previous studies the search of competitive inhibitors as well as for non-competitive inhibitors has been pursued actively [16]. For example, Lys74, Leu149, and Asn152 were previously identified to be on a nonactive site pocket [43]. For the case of Lys74, directly bonded to Asp75, it has been suggested that the formation of a hydrogen bond between Lys74 and one tested inhibitor may have caused a conformational change on Asp75 that changed the catalytic activity [43]. Recently, another inhibitor has been found for this pocket close to the active site [44]. Other inhibitors directed to pockets close to the active site inhibit all four serotypes suggesting that it is possible to develop drug-like molecules against the four serotypes [45]. In fact, we already included in our training/validation set the residues Leu149, and Asn152 where both residues were assigned as a residues with a relevant change in enzyme activity when mutated, class A (see Table 1). These two residues exemplify the correspondence between criteria used for our class definition with the potential of an specific residue to become part of a nonactive site pocket for inhibitor development in certain cases. According to the AMBIT v0.04 analyses it was determined that only two residues resulted

ACCEPTED MANUSCRIPT

M AN U

SC

RI PT

outside the applicability domain for the set of selected attributes in at least one method (See S.I. Table 1): NS3-Glu12 and NS2b-Asp88 (see SI1.4 for further information). For this reason these two residues were not included on the test set. Initially, several classification techniques were explored with WEKA software with the purpose of discarding the techniques that show a bad performance in our particular problem. Finally, the results for the four best algorithms were selected for further discussion on their ability to classify the set of data available: a) random forest (RF) [46], b) Least absolute deviation tree (LAD Tree), [47], c) voting feature interval (VFI), [48] and d) multilayer perceptron algorithm (MP). As discussed previously, the first attempts to classify in five categories resulted in classifiers that mostly assigned all the residues to the category with the largest number of data available, and the models' performances were poor because the groups were unbalanced. Accordingly, we decided that there was not enough data for five categories, and merged the categories into three. Categories A and B were merged in class A, category C was left as class C, and categories D and WT were merged on class WT. In addition, one of the features (number of the residue, N) was found to be important but could not be used for both proteins at the same time. As a consequence, we decided to perform two classifications: the first where the N for the NS3Pro was included as feature, and the second where the whole NS2b residues were excluded altogether. The results of both classifications are reported on Table 2. For these new data set, validation, and training sets, we found that the best performance was obtained by the multilayer perceptron (MP model 2). Here, we mainly report on this algorithm which best classifies the data. A summary of results for the other methods is found on S.I. Table 2.

AC C

EP

TE D

In addition, we performed an evaluation of how informative was each variable (feature) on the predictions of the classifiers. The features, as indicated on the methodology, were carefully selected to combine several strategies in order to identify influential residues important for the function of DENV-NS3Pro. The evaluation consisted on a series of training and validation runs taking out one of the fourteen features, one at the time. We used the following classifiers: MLP, VFI, and LAD-T. Figure 1 and Figure SI-4 show the results using Recall as a metric to measure the performance. All the features on the MLP 1 once taken out of the training lowered the Recall for classification on class A as seen in Figure 1. The only exception for this behavior was the conservation with Japanese Encephalitis Virus. In the cases of LAD-Tree, and VFI, several of the features did not cause an effect on the recall suggesting that for these methods those features did not contribute significantly to the predictions (See Figure SI-4). In addition, for all methods studied, the feature "name" (the identity of the residue) had a considerable effect on the capacity of an algorithm to correctly predict class A when taken out. In another experiment when we eliminated from our validation/training set the NS2b residues, the Recall improved significantly (Figure 1). This was expected since unfortunately we had just three NS2b's instances labeled for the validation/training set. Nevertheless, the fact that the Recall value, when taking into account NS2b residues, is still high provides evidence that several of our calculated features for NS3 where informative enough to help classify residues on the NS2b polypeptide. Figure 1 comes about here

Performance measures and model validation The confusion matrix for MLP, LAD-T, and VFI are reported on Table 3; the precision, and Recall values are reported on Table 4. If we compare only the recall values, we can see that the highest value is reached for the case where MLP is performed on a training set with only

ACCEPTED MANUSCRIPT

SC

RI PT

the NS3 residues included (MLP 2). Nevertheless, obtaining some classification for the NS2b residues would be also important in the search of important novel regions for the function of the complex. Our hypothesis was that we could produce meaningful predictions for NS2b from the NS3 training data due to the fact that several of the features of the NS3 should be also meaningful on the NS2b case. For example, if a NS3 residue has a high ∆ affinity, high conservation with the DENV sequences analyzed, and as a consequence belongs to class A, the same should apply for an NS2b residue. A lower Recall and Precision were obtained (0.722 vs 0.867, and 0.650 vs 0.733) when we included the NS2b residues (only three) on the training/validation set. An internal fold cross-validation protocol consisting on a leave-one cross-validation was performed (see all performance measures at Table 5). The report of the classification by the two methods are reported in Table 2. The ROC Curve for the MLP trained with a training set of 38 instances (only NS3Pro residues) validated also through a leave-one cross-validation shows a very good behavior (ROC area 0.867) as shown in Figure SI-5. Predictions

AC C

EP

TE D

M AN U

On the DENV NS3 protease the classifiers identified as class A the residues Tyr23, Gly37, Phe46, Thr48, His51, Thr53, Leu58, Asp75, Tyr79, Trp89, and Thr156 as the eleven highest ranked residues by both MLP models (see Figures 2 - 5, SI-6 - 9). Additional residues classified as class A from both chains with a score higher than 0.9955 are reported on Table 2, and Figures SI 6-8. Here, we attempt to find the rationale behind the obtained classification by analyzing some of the features, and distribution of the influential residues. First, in Figure 2 we show all residues located on the surface of the complex, and we can group them on approximately three groups. The first cluster is close to the NS2b loop, identified previously in the literature to be in a close conformation for a functional protease (see Figure 3). Furthermore, as expected for a classifier that is performing well for the system, several important residues identified are clustered on the experimental active site (see Figure 4). For example, residues Val52, and Thr53 from this cluster are residues exposed to solvent (SASA 15.55 and 8.18 respectively - Figure 4); both residues were in close proximity to the catalytic site. The calculations showed both residues to contribute to the overall stability of the complex. In addition, Thr53 forms a hydrogen bond with Tyr79, and is in close proximity to NS2b-His72. Consistent with this last observation, Thr53 presents a high ∆ affinity though this residue was not conserved on all four DENV examples. Finally, the third group or cluster is made by the NS2b residues. These are evenly distributed on the NS2b structure (see Figures 5). Figures 2 and 3 come about here

In addition to the previous three clusters, we can find additional important residues in the interior of the protein. A group of these residues in NS3Pro is formed by residues Leu58, Thr59 (exposed to the solvent), and Leu65 on a beta sheet - loop - beta sheet structure as shown in Figure SI-7. This one seems to be another important region where residues Leu58, and Thr59 may contribute to the molecular recognition of NS2b, and residues Leu65, and Tyr79 help to the stability of the secondary structure of that region (Figure SI-7). NS2bArg84 is also in close proximity to this region, and it may contribute to the molecular recognition between the two peptides. This is supported from the observation that ∆ affinity for NS2b-Arg84 was found to be one of the highest. In addition, Phe46 presents a high ∆ affinity energy and a hydrophobic interaction with NS2b-Val53 (see Figure SI-7). NS2bVal53 is not conserved between the DENV serotypes. In practice a better target would be

ACCEPTED MANUSCRIPT Phe46 on the NS3 rather than the residue on NS2b. Figure 4 comes about here

SC

Figure 5 comes about here

RI PT

A general observation seems to be that as on the last cases, several residues classified as influential on the NS3Pro interact with NS2b residues, but the conservation on the NS2b side is low; as consequence these NS2b residues were not classified as viable influential targets. As another example we have the NS2b-His72 case, not identified as important, moderately conserved between the DENV serotypes, but in close proximity to two NS3 residues classified as important NS3-Phe116 and NS3-Thr156 (see Figure 5). This suggests that this position may constitute a possible bindable site from the NS3 perspective. Finally, a possible NS2b-NS3Pro important recognition site is constituted by NS3-Phe116, and NS3Thr156 (exposed to solvent) as shown in Figure 5.

4 Discussion

AC C

EP

TE D

M AN U

The finding of novel sites on the DENV protease which could be targeted by drug-like molecules constitutes an important step towards the development of clinically effective noncompetitive inhibitors; until now, compounds directed towards the active site and/or close to the active site have not yet generated clinically viable drugs. Methods to predict ligand binding sites from conservation alone have shown low precision because many non-binding residues can have a high degree of conservation due to other functions on the protein [49]. Other methodologies include using 3D information to predict ligand binding sites (from a template-based approach or ab initio methods) [49]. Recently, new methods that combine these two general approaches have evolved with the aim of genome-wide protein predictions such as methods incorporating support vector machines [49]. In this report a different path has been proposed; instead of trying to obtain a new tool to apply to a broad range of proteins, the use of publicly available tools to study and to make useful predictions on a single protein of importance is the main focus of this research. This method allows us to combine 3D information based on experimental 3D structures, conservation comparisons focused in related virus, and a genome-wide vector machine method (SuSPect). Novel tools of data mining may help to unveil influential residues on the protein function by helping to correlate multiple characteristics of the amino acids constituting both NS2b cofactor, and NS3Pro domain. We have found that two MLP models were suitable to properly classify the NS3Pro residues unto the following three groups: those residues likely to cause a major change in activity (class A), those likely to cause a moderate change in activity (class C), and those with similar activity as wild type (class WT) residues. The classifiers had a good performance identifying class A, the main aim of this work. However, they had a low performance regarding the WT class. Therefore, we treated all these WT residues as not classified residues. More importantly, a similar overall approach has been used before to help in complex protein problems, and it resulted in comparable or better performance than previously published results with other methods [50]. Specifically, this approach assisted the design and engineering of new proteins by studying the effect of introducing certain mutations [50]. In our case, we cannot effectively make comparisons with experimental results for the complete amino acidic test set since no experimental values exist for most of these residues. Nevertheless, we have data for some key residues such as for His51 and Asp75 (part of the catalytic triad). Both residues were not included on the training/validation set, and in our approach they worked as controls. The MLP 1 and MLP 2

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

classified these residues as class A which is in agreement with the experimental fact that His51 and Asp75 constitute key residues in the normal enzymatic function. Furthermore, in overall for all clusters of residues classified as class A, it is possible to find sound chemical reasons for such classification by analyzing their different attributes. Chemical contradictions have not been found in the automatic assignment by the classifier model. Noteworthy, Khan et al. studied all the residues of DENV protease using a big screening of sequenced DENV samples to find conserved peptides on the majority of known DENV strains. This study was looking for feasible antigens for vaccine development [51]. The peptides identified were denominated PanDenv peptides [51]. In particular, in NS3Pro only one PanDenv peptide was found. In agreement with their finding, we found that all the residues in this PanDenv peptide were originally in our training/validation set, or more importantly, the residues were classified as class A by the MLP 1 and MLP 2 (see Table 2). In summary, conserved residues (part of a PanDenv peptide) stablished from a big sample of different DENV strands agrees completely with the residues found in this work by a faster and simpler methodology. This validates our results, and shows the prediction capability inherent to our models. Furthermore, it suggests that our overall methodology may be used to identify potential peptides suitable for vaccine development. This application will require introducing an a posteriori criteria on the predicted influential residues derived from a suitable protein candidate. For this a filter can be used that looks for contiguity of residues and completes the gaps of the possibly antigenic peptide sequence with residues from the training/validation set. The significant agreement found between predicted and available experimental data (see Table 2) shows that the model performs well at predicting activity hot spots of the complex (regions or single residues where changes have significant impact on the enzyme activity). The results shows that our methodology is viable for identification of influential residues for function, and bindable sites for drug-like molecules. The fact that the MLP algorithm was found to be the most appropriate from the sampled methods is consistent with the properties of this method. In particular, since the main methodology to choose attributes was expert decision, it is of great value to know that MLP has the ability to effectively deal with irrelevant attributes. This important property is not found in more advance methods such as support vector machine or RBF NNW [52]. Concerning our methodology, the decision of employing only the default minimization steps on the Prime MM-GBSA is supported by the careful comparison reported by Beard et al. [53]. These researchers found that minimization of only the side chain of interest produced the best correlation between the calculated and the experimental results of protein-protein binding affinity when using point mutations and Prime MM-GBSA [53]. In fact, mutating an amino acid by alanine, always a smaller residue, will avoid clashes with neighboring atoms; consequently, a moderate minimization step is enough. Therefore, though other methods may be used to characterize the effect of a point mutation on the binding affinity between two proteins, the Prime MM-GBSA method was shown to perform very well at a fraction of the computational cost [53]. Another requirement for the success of the methodology presented is the correct protonation state of the amino acids, a property with a great impact on the binding energy. PROPKA on the protein preparation Wizard was used for the correct assignment of these protonation states. Previously, PROPKA has shown to be reliable for assignments on entire proteins [53]. We found that although the use of Prime MM-GBSA for a computational alanine scanning mutagenesis provides great insights to characterize a protein, it is not enough. This is the reason why other attributes such as conservation-based features were also added to the model.

ACCEPTED MANUSCRIPT

M AN U

SC

RI PT

Raptor X tool was used to explore binding sites; residues that are more likely to be part of a binding site with a known molecule were identified [42]. The results were in good agreement with experimentally identified residues previously reported in the literature [27, 28, 54, 55]. In addition to the structure-based analysis, we performed a structure conservation analysis to locate residues less prone to emergence of resistant viral strains. In fact, HIV and HCV have shown rapid emergence of resistant viral strains against protease inhibitors [15]. The assumption is that the most conserved is a residue, the least likely it is to be mutated without a great cost for the virus. Nonetheless, it is known that NS2b conservation is relatively low [56]. In addition, the degree of amino acid sequence conservation for NS3Pro is around 63% and 74% between all DENV serotypes [15]. On the other hand, this degree of conservation makes more difficult to identify the non-obvious key residues for the enzyme. In order to overcome this, we also used conservation predictions as features for the classifier taking into account related viruses on a different serocomplex that DENV serotypes [57]. In conclusion, the major contribution of this article is the identification or confirmation of several pockets on the native (catalytic) NS2b-NS3 protease. These sites can be used as targets for better noncompetitive inhibitors of broad spectrum that work for all four serotypes. Due to the criteria used for their identification, the sites may work also for other flaviviruses' proteases. In addition, we suggest that the computational approach presented here can be generalized, and used for other studies in a broad spectrum of scenarios where some but not exhaustive experimental information is available for a protein of interest. It can be applied to find novel binding sites other than the native catalytic site. Moreover, this can lead to allosteric or non-competitive inhibitors design, and help to overcome drug design challenges that originate from undesirable catalytic site features such as a charged nature, or a shallow topology as in the NS2b-NS3-protease.

TE D

Funding This research was partially funded by Chancellor Grant - USFQ 2014-2015 (granted to CZ, JT) and Chancellor Grant 2015-2016 (granted to MM).

EP

Acknowledgements Authors thank Universidad San Francisco de Quito for the use of the High Performance Computing System-USFQ. Authors also thank CECIRA-III-2015-02, Bioinformatica program from Consorcio Ecuatoriano para el Desarrollo de Internet Avanzado (CEDIA) Ecuador.

AC C

References 1. Christian G Noble, Cheah Chen Seh, Alexander T Chao, and Pei Yong Shi. Ligandbound structures of the dengue virus protease reveal the active conformation. Journal of virology, 86(1):438-46, 2012. 2. Zhili Zuo, Oi Wah Liew, Gang Chen, Pek Ching Jenny Chong, Siew Hui Lee, Kaixian Chen, Hualiang Jiang, Chum Mok Puah, and Weiliang Zhu. Mechanism of NS2Bmediated activation of NS3pro in dengue virus: molecular dynamics simulations and bioassays. Journal of virology, 83(2):1060-70, 2009. 3. Mirta Roses Periago and Maria G. Guzman. Dengue y dengue hemorragico en las Americas. Revista Panamericana de Salud Publica, 21(3):187-191, 2007. 4. World Health Organization et al. Working to overcome the global impact of neglected tropical diseases: First who report on neglected tropical diseases. 2010. 5. Linfeng Li, Chandrakala Basavannacharya, Kitti Wing Ki Chan, Luqing Shang, Subhash G Vasudevan, and Zheng Yin. Structureguided Discovery of a Novel Nonpeptide Inhibitor of Dengue Virus NS2B-NS3 Protease. Chemical biology & drug

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

design, (1):1-10, 2014. 6. Christoph Nitsche, Steven Holloway, Tanja Schirmeister, and Christian D Klein. Biochemistry and Medicinal Chemistry of the Dengue Virus Protease. Chemical Reviews, 114(22):11348-11381, 2014. 7. W Van Bortel, F Dorleans, J Rosine, A Blateau, D Rousset, S Matheus, I LeparcGoFiart, O Flusin, C Prat, R Cesaire, et al. Chikungunya outbreak in the Caribbean region, december 2013 to march 2014, and the significance for Europe. Euro Surveill, 19(13):20759, 2014. 8. Scott C Weaver. Arrival of chikungunya virus in the new world: prospects for spread and impact on public health. PLoS Negl Trop Dis, 8(6):e2921, 2014. 9. Adrian Rollins et al. Who declares health emergency over zika outbreak. Australian Medicine, 28(1):11, 2016. 10. Christopher Chang, Kristina Ortiz, Aftab Ansari, and M Eric Gershwin. The zika outbreak of the 21st century. Journal of autoimmunity, 68:1-13, 2016. 11. R Aruna. Review on Dengue viral Replication, assembly and entry into the host cells. 3(11):1025-1039, 2014. 12. Duane J Gubler. Dengue and dengue hemorrhagic fever. Clinical microbiology reviews, 11(3):480-496, 1998. 13. Anthony SY Leong, K Thong Wong, Trishe YM Leong, Puay Hoon Tan, and Pongsak Wannakrairot. The pathology of dengue hemorrhagic fever. In Seminars in Diagnostic Pathology, volume 24, pages 227-236. Elsevier, 2007. 14. Dahai Luo, Subhash G Vasudevan, and Julien Lescar. The flavivirus NS2B-NS3 protease/helicase as a target for antiviral drug development. Antiviral research, 118(APRIL):148-158, 2015. 15. Siew Pheng Lim, Qing Yin Wang, Christian G. Noble, Yen Liang Chen, Hongping Dong, Bin Zou, Fumiaki Yokokawa, Shahul Nilar, Paul Smith, David Beer, Julien Lescar, and Pei Yong Shi. Ten years of dengue drug discovery: Progress and prospects. Antiviral Research, 100(2):500-519, 2013. 16. Choon Han Heh, Rozana Othman, Michael J C Buckle, Yusrizam Sharifuddin, Rohana Yusof, and Noorsaadah Abd Rahman. Rational discovery of dengue type 2 noncompetitive inhibitors. Chemical biology & drug design, 82(1):1-11, 2013. 17. Hemalatha Beesetti, Navin Khanna, and Sathyamangalam Swaminathan. Drugs for dengue: a patent review (2010 2014). Expert Opinion on Therapeutic Patents, 24(11):1171-1184, 2014. 18. Punnee Pitisuttithum and Alain Bouckenooghe. The First licensed dengue vaccine: an important tool for integrated preventive strategies against dengue virus infection. Expert Review of Vaccines, (just-accepted), 2016. 19. Erik De Clercq. The design of drugs for HIV and HCV. Nature Reviews Drug Discovery, 6(12):1001-1018, 2007. 20. Paul Erbel, Nikolaus Schiering, Allan D'Arcy, Martin Renatus, Markus Kroemer, Siew Pheng Lim, Zheng Yin, Thomas H Keller, Subhash G Vasudevan, and Ulrich Hommel. Structural basis for the activation of aviviral ns3 proteases from dengue and west nile virus. Nature structural & molecular biology, 13(4):372-373, 2006. 21. Donmienne Leung, Kate Schroder, Helen White, Ning-Xia Fang, Martin J Stoermer, Giovanni Abbenante, Jennifer L Martin, Paul R Young, and David P Fairlie. Activity of recombinant dengue 2 virus ns3 protease in the presence of a truncated NS2b cofactor, small peptide substrates, and inhibitors. Journal of Biological Chemistry, 276(49):45762-45771, 2001. 22. Muslum Yildiz, Sumana Ghosh, Jeffrey A. Bell, Woody Sherman, and Jeanne A. Hardy. Allosteric Inhibition of the NS2B-NS3 Protease from Dengue Virus. ACS

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

Chemical Biology, 8 (12): 2744-2752, 2013. 23. Shyama Sidique, Sergey A Shiryaev, Boris I Ratnikov, Ananda Herath, Ying Su, Alex Y Strongin, and Nicholas DP Cosford. Structure activity relationship and improved hydrolytic stability of pyrazole derivatives that are allosteric inhibitors of west nile virus NS2b-ns3 proteinase. Bioorganic & medicinal chemistry letters, 19(19):57735777, 2009. 24. Paul A Johnston, Jennifer Phillips, Tong Ying Shun, Sunita Shinde, John S Lazo, Donna M Huryn, Michael C Myers, Boris I Ratnikov, Jeffrey W Smith, Ying Su, et al. Hts identifies novel and specific uncompetitive inhibitors of the two component NS2b-ns3 proteinase of west nile virus. Assay and drug development technologies, 5(6):737-750, 2007. 25. Young Mee Kim, Shovanlal Gayen, CongBao Kang, Joma Joy, Qiwei Huang, Angela Shuyi Chen, John Liang Kuan Wee, Melgious Jin Yan Ang, Huichang Annie Lim, Alvin W Hung, et al. NMR analysis of a novel enzymatically active unlinked dengue NS2b-ns3 protease complex. Journal of Biological Chemistry, 288(18):12891-12900, 2013. 26. Wan Na Chen, Karin V. Loscha, Christoph Nitsche, Bim Graham, and Gottfried Otting. The dengue virus NS2B-NS3 protease retains the closed conformation in the complex with BPTI. FEBS Letters, 588(14):2206-2211, 2014. 27. B Falgout, M Pethel, Y M Zhang, and C J Lai. Both nonstructural proteins NS2B and NS3 are required for the proteolytic processing of dengue virus nonstructural proteins. Journal of virology, 65(5):2467-2475, 1991. 28. Barry Falgout, Roger H Miller, and Ching-juh Lai. Nonstructural protein NS2B: identification of a domain required for NS2B-NS3 protease Deletion Analysis of Dengue Virus Type 4 Nonstructural Protein NS2B : Identification of a Domain Required for NS2B-NS3 Protease Activity. 67(4):2034-2042, 1993. 29. C.J. Lai, M. Pethel, L.R. Jan, H. Kawano, A. Cahour, and B. Falgout. Processing of dengue type 4 and other flavivirus nonstructural proteins. Archives of Virology, Supplementum, No.9:359-368, 1994. 30. D. Luo, T. Xu, C. Hunke, G. Gruber, S. G. Vasudevan, and J. Lescar. Crystal Structure of the NS3 Protease-Helicase from Dengue Virus. Journal of Virology, 82(1):173-183, 2008. 31. Robert P Sheridan, Vladimir N Maiorov, M Katharine Holloway, Wendy D Cornell, and Ying-Duo Gao. Drug like density: a method of quantifying the bindability of a protein target based 32. D S Gesto, N M F S a Cerqueira, M J Ramos, and P a Fernandes. Discovery of new druggable sites in the anti-cholesterol target HMG-CoA reductase by computational alanine scanning mutagenesis. Journal of molecular modeling, 20(4):2178, 2014. 33. William Humphrey, Andrew Dalke, and Klaus Schulten. VMD – Visual Molecular Dynamics. Journal of Molecular Graphics, 14:33–38, 1996. 34. Christopher M Yates, Ioannis Filippis, Lawrence A Kelley, and Michael JE Sternberg. Suspect: enhanced prediction of single amino acid variant (sav) phenotype using network features. Journal of molecular biology, 426(14):2692–2701, 2014. 35. Juan Du, Huijun Sun, Lili Xi, Jiazhong Li, Ying Yang, Huanxiang Liu, and Xiaojun Yao. Molecular modeling study of checkpoint kinase 1 inhibitors by multiple docking strategies and prime/MM-GBSA calculation. Journal of Computational Chemistry, 32(13): 2800-2809, 2011. 36. Mani Srivastava, Harvinder Singh, and Pradeep Kumar. Naik. Molecular modeling evaluation of the antimalarial activity of artemisinin analogues: molecular docking

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

and rescoring using prime/MM-GBSA approach. Current Research Journal of Biological Sciences, 2(2):83–102, 2010. 37. Irina S. Moreira, Pedro A. Fernandes, and Maria J. Ramos. Unravelling Hot Spots: a comprehensive computational mutagenesis study. Theoretical Chemistry Accounts, 117(1):99–113, 2006. 38. Lawrence A Kelley, Stefans Mezulis, Christopher M Yates, Mark N Wass, and Michael J E Sternberg. The Phyre2 web portal for protein modeling, prediction and analysis. Nature Protocols, 10(6):845–858, 2015. 39. RS Boethling and J Costanza. Domain of epi suite biotransformation models. SAR and QSAR in Environmental Research, 21(5-6):415–443, 2010. 40. Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H Witten. The WEKA data mining software. ACM SIGKDD Explorations, 11(1):10–18, 2009. 41. Amir Navot12, Lavi Shpigelman12, Naftali Tishby12, and Eilon Vaadia23. Nearest neighbor based feature selection for regression and its application to neural activity. 2006. 42. Jian Peng and Jinbo Xu. RaptorX: exploiting structure information for protein alignment by statistical inference. Proteins, 79 Suppl 1(Suppl 10):161–71, 2011. 43. Rozana Othman, Tan Siew Kiat, Norzulaani Khalid, Rohana Yusof, E. Irene Newhouse, James S. Newhouse, Masqudul Alam, and Noorsaadah Abdul Rahman. Docking of noncompetitive inhibitors into dengue virus type 2 protease: Understanding the interactions with allosteric binding sites. Journal of Chemical Information and Modeling, 48(8):1582–1591, 2008. 44. Hongmei Wu, Stefanie Bock, Mariya Snitko, Thilo Berger, Thomas Weidner, Steven Holloway, Manuel Kanitz, Wibke E. Diederich, Holger Steuber, Christof Walter, Daniela Hofmann, Benedikt Weißbrich, Ralf Spannaus, Eliana G. Acosta, Ralf Bartenschlager, Bernd Engels, Tanja Schirmeister, and Jochen Bodem. Novel Dengue Virus NS2B/NS3 Protease Inhibitors. Antimicrobial Agents and Chemotherapy, 59(2):1100–1109, 2015. 45. Rajendra Raut, Hemalatha Beesetti, Poornima Tyagi, Ira Khanna, Swatantra K Jain, Variam U Jeankumar, Perumal Yogeeswari, Dharmarajan Sriram, and Sathyamangalam Swaminathan. A small molecule inhibitor of dengue virus type 2 protease inhibits the replication of all four dengue virus serotypes in cell culture. Virology journal, 12(1):16, 2015. 46. Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001. 47. Geoffrey Holmes, Bernhard Pfahringer, Richard Kirkby, Eibe Frank, and Mark Hall. Multiclass alternating decision trees. In Machine learning: ECML 2002, pages 161– 172. Springer, 2002. 48. Gülşen Demirö̈ z and H Altay Güvenir. Classification by voting feature intervals. In Machine Learning: ECML-97, pages 85–92. Springer, 1997. 49. Jianyi Yang, Ambrish Roy, and Yang Zhang. Protein ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. 2013. 50. Majid Masso and Iosif I. Vaisman. Accurate prediction of stability changes in protein mutants by combining machine learning with structure based computational mutagenesis. Bioinformatics, 24(18):2002–2009, 2008. 51. Asif M Khan, Olivo Miotto, Eduardo JM Nascimento, KN Srinivasan, AT Heiny, Guang Lan Zhang, ET Marques, Tin Wee Tan, Vladimir Brusic, Jerome Salmon, et

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

al. Conservation and variability of dengue virus proteins: implications for vaccine design. PLoS Negl Trop Dis, 2(8):e272, 2008. 52. Ian H Witten and Eibe Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005. 53. Hege Beard, Anuradha Cholleti, David Pearlman, Woody Sherman, and Kathryn a. Loving. Applying physics based scoring to calculate free energies of binding for single amino acid mutations in protein-protein complexes. PLoS ONE, 8(12):1–11, 2013. 54. J F Bazan and R J Fletterick. Detection of a trypsin like serine protease domain in flaviviruses and pestiviruses. Virology, 171(2):637–639, 1989. 55. R P Valle and B Falgout. Mutagenesis of the NS3 protease of dengue virus type 2. Journal of virology, 72(1):624–632, 1998. 56. Tawin Iempridee, Ratchanu Thongphung, Chanan Angsuthanasombat, and Gerd Katzenmeier. A comparative biochemical analysis of the ns2b (h)–ns3pro protease complex from four dengue virus serotypes. Biochimica et Biophysica Acta (BBA)General Subjects, 1780(7):989–994, 2008. 57. Niklaus H Mueller, Nagarajan Pattabiraman, Camilo Ansarah-Sobrinho, Prasanth Viswanathan, Theodore C Pierson, and R Padmanabhan. Identification and biochemical characterization of small-molecule inhibitors of west nile virus serine protease by a high throughput screen. Antimicrobial agents and chemotherapy, 52(9):3385–3393, 2008. 58. Wanisa Salaemae, Muhammad Junaid, Chanan Angsuthanasombat, and Gerd Katzenmeier. Structure guided mutagenesis of active site residues in the dengue virus two component protease NS2B-NS3. Journal of biomedical science, 17:68, 2010. 59. Pornwaratt Niyomrattanakit, Pakorn Winoyanuwattikun, Santad Chanprapaph, and Chanan Angsuthanasombat. Identification of Residues in the Dengue Virus Type 2 NS2B Cofactor That Are Critical for NS3 Protease Activation Identification of Residues in the Dengue Virus Type 2 NS2B Cofactor That Are Critical for NS3 Protease Activation. Journal of virology, 78(24):13708–13716, 2004. and prime/MMGBSA calculation. Journal of Computational Chemistry, 32(13):2800–2809, 2011.

Tables/Figure Legends

Fig. 1: Analysis of how informative are each attribute used. Here we report the recall for a set of experiments where one of the features was leave out of the training/validation set. The blue line corresponds to the recall for the class A classification. For class A instances (blue circles) a value under 0.8 indicates a loss in Recall when the indicated feature was leave out. The yellow circle shows the 0.8 threshold where any value close or above to this line shows a performance similar or superior to the model including all features. The Standard experiment is at 12 o' clock (all) and includes all the features selected for the current study. Clockwise we report the results when the following features were left out: mutational sensitivity derived from Phyre2 (No Sensibility); Ligand binding derived from Raptor X (No Raptor); conservation with respect to control sequences/structures on JEV (No JEV), Murray Encephalitis Virus (No Murray), WNV (No WNV), DENV2 (No DENV2), DENV at pH 8.4

ACCEPTED MANUSCRIPT (No DENV), DENV1 (No DENV1); residue hydropathy (no hydropathy); SASA (no SASA); delta affinity (No affinity); delta stability (No Stability); protein chain (No Activity); name of the residue (No Name); number of the residue on NS3Pro (No N); not included any of the residues for NS2b protein on the training/validation sets (No NS2b).

RI PT

Fig. 2: Residues classified as important by the classifier. NS2b residues and NS3Pro residues are indicated with red labels and black labels respectively. Only class A residues on the surface of the protein are highlighted (orange). Two different views; (bottom figure a 90 degree clockwise rotation around the vertical axis). Asp75 and Hid51 are part of the catalytic triad (these two residues were not included on the training or validation sets, they act as a positive controls). Protein representation: Blue ribbon NS3Pro, Red ribbon NS2b.

SC

Fig. 3: Residues classified as important (in orange) by the classifier. Site 1. NS2b residues and NS3Pro residues are indicated with red labels and black labels respectively. Protein representation: Blue ribbon NS3Pro, Red ribbon NS2b.

M AN U

Fig. 4: Residues classified as important (in orange) by the classifier. Site 2. NS2b residues and NS3Pro residues are indicated with red labels and black labels respectively. Protein representation: Blue ribbon NS3Pro, Red ribbon NS2b. Fig. 5: NS3-Thr156, predicted as class A by the Multilayer Perceptron Algorithm. Close up on a possible bindability site. NS2b is shown in surface visualization on a transparent material ( NS2b-His72, and NS2b-Phe116 are shown as help to the eye) while in a solid red ribbon representation is NS3Pro. Bindability pocket shown here seems to be important for the NS2b's binding to NS3. Asp75 in purple, from the catalytic triad, reported as help to the eye.

TE D

Table 1: Training/Validation set and references from where the results used to define the classes were obtained. References: a[22], b[58], c[2], d[59], e[55].

EP

Table 2: Higher probability residues predicted as class A by Multilayer Perceptron. p for probability value (p.v. > 0.95), in bold p.v. > 0.99 where a value of 1 indicates that the classifier is absolutely certain that the instance belongs to the class assigned. For residues predicted for a multilayer perceptron trained and validated without any NS2b data, the predictions reported here (as Yes) correspond to p.v. values greater than 0.95, and in bold p.v. > 0.99. Asterisk indicates residues of the catalytic triad. NA, non applicable.

AC C

Table 3: Summary of predictions for the experimental set of 171 instances Table 4: Performance measures for the Multilayer Perceptron on the validation/training set 1. Leave-one Cross validation. ROC Area: Area under de curve for Receiver Operating Characteristics. The closest the Recall and Precision are to 1.00, the best the performance. A classifier that assigns arbitrarily all instances to class A applied on the validation/training set 1 would have a Recall equal to 1.0 and a Precision equal to 0.44 for the predictions about class A. The Paired Corrected T Tester as implemented in WEKA did not find a significant difference for Recall and Precision between the three classifiers presented here (with a twotailed test at 95% confidence interval). Table 5: Performance measures for the Multilayer Perceptron on the validation/training set 1 & *validation/training set 2. Leave-one Cross validation. ROC Area: Area under de curve for Receiver Operating Characteristics. The closest the Recall and Precision are to 1.00 the best the performance. A classifier that assigns arbitrarily all instances to class A applied on the

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

validation/training set 1 would have a TP rate of 0.44, and FP rate of 0.66 for the predictions about class A.

ACCEPTED MANUSCRIPT

-4.52 8.66 0.27 0.28 -6.22 0.12 -0.95 -0.2 3.13 23.46 6.7 0 14.65 29.09 17.35 7.72 5.15 4.66 -2.46 6.85 21.54

11.82 -1.64 -0.55 0.02 0.1 4.5 -0.52 6.52 -1.68 -0.29 0.15

12 -2.97 20.13 15.39 59.52 -2.28 40.33 8.14 3 -1.92 1.17

150.94 22.38 139.7 148.48 129.32 238.52 61.93 133.81 83.85 11.91 169.75 0 0 0 0 16.58 7.89 97.16 33.53 0 6.02 125.58 0 115.22 24.59 11.16 0 23.58 6.66 4.28 6.07 0 16.32

-1.74 -0.29 -1.7 -1.74 -1.5 1.91 -0.21 -1.69 -0.96 -0.13 -2.04 0 0 0 0 -0.03 0.12 1.47 -0.07 0 0.09 -1.57 0 -1.49 0.2 -0.02 0 -0.3 -0.02 -0.01 -0.02 0 -0.05

C C NC NC NC C NC NC NC HC C HC HC HC C HC C NC NC HC HC C C HC NC HC HC HC HC C NC C HC

C C NC NC NC C NC NC NC HC C HC HC HC C HC C NC NC HC HC C C HC NC HC HC HC HC C NC C HC

NC C NC NC NC C NC NC NC HC C HC HC HC C HC C NC NC HC C NC NC HC NC HC HC HC HC C NC C HC

C C NC NC NC C NC NC NC HC C HC HC HC C HC C NC NC HC HC C C HC NC HC HC HC HC C NC C HC

C C NC NC NC C NC NC NC HC C HC HC HC C HC C NC NC HC C NC NC HC NC HC HC HC HC C NC C HC

RI PT

1.3 -0.44 9.38 -1.57 1.93 0 -0.05 0.02 0.44 0.27 1.28 0 -0.16 -0.21 8.51 -4.64 10.58 0.02 -0.39 0.36 -0.27

DENV1 DENV3(pH8.5) DENV2 WNV Murray JEV RaptorX

SC

NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS2B NS2B NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3 NS3

Hydropathy

M AN U

LYS ASN ARG GLU GLU PHE SER ASN GLN GLN ARG ALA ILE ILE ILE GLY VAL VAL GLY SER ILE ASP ILE ASP PHE GLY GLY ASN TRP THR THR THR SER

SASA

TE D

131 141 142 143 19 31 86 105 27 35 54 125 126 139 140 144 154 155 160 163 165 50 76 129 130 133 136 152 83 111 115 134 135

Activity Affinity Stability

EP

Name

AC C

N

NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC NC C NC NC NC NC NC NC NC NC NC NC NC

1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1

Sensibility 5 7 6 5 6 7 4 4 8 7 7 7 7 7 7 5 6 3 7 8 4 6 7 7 7 8 7 7 5 7 7

Exp. Mut WT WT WT WT WT WT WT WT C C C C C C C C C C C C C C C A A A A A A A A A A

Ref e e e e a a a a c c c a e e e e e e e b b c d b, e e b, e e b, e a a a, b b, e e

ACCEPTED MANUSCRIPT

TE D

M AN U

SC

RI PT

NS3 -0.06 55.79 0 0 HC HC HC HC HC NC 0 7 A e NS3 -0.14 21.96 0 0 HC HC HC HC HC NC 0 8 A e NS3 -0.32 25.45 4.42 -0.02 HC HC HC HC HC NC 1 8 A b, e NS3 -0.13 52.7 10.98 -0.02 HC HC HC HC HC NC 1 8 A b, e NS3 -0.04 42.23 15.33 -0.03 HC HC HC HC HC NC 1 7 A e NS2B 30.83 -5.99 35.34 -0.08 C C NC C NC C 0 8 A d NS2B 7.49 11.81 34.2 0.42 NC NC NC NC NC NC 0 6 A d NS2B 2.24 5.41 30.34 0.46 C C NC C NC NC 0 4 A d Table 1: Training/Validation set and references from where the results used to define the classes were obtained. References: a[22], b[58], c[2], d[59], e[55]. NC= non-conserved residue, C = conserved residue, HC = highly conserved residue.

EP

GLY LEU TYR GLY GLY TRP LEU VAL

AC C

148 149 150 151 153 61 74 78

ACCEPTED MANUSCRIPT T/V Set2 (A wo Pan-Denv sequence [18] NS2B) 4 NS3 Leu 0.952 No No 13 NS3 Thr No 0.998 Yes 18 NS3 Leu No No 0.994 Yes (21 NS3 Gly NA No 0.964) 23 NS3 Tyr No 0.999 Yes 34 NS3 Thr 0.976 Yes No 37 NS3 Gly No 1 Yes 46 NS3 Phe Yes 1 Yes 47 NS3 His Yes 0.999 Yes 48 NS3 Thr Yes 1 Yes 50 NS3 Tr 0.989 No Yes 51 NS3 His Yes 0.998* Yes 52 NS3 Val No Yes 0.997 52 NS2B Thr 0.96 NA No 53 NS3 Thr Yes 1 Yes 58 NS3 Leu No 1 Yes 59 NS3 Thr No 0.995 Yes 60 NS2B Thr 0.952 NA No 65 NS3 Leu 0.09 Yes No 67 NS3 Pro 0.976 No No 68 NS2B Thr NA No 1 69 NS3 Trp No No 0.99 72 NS3 Val No No 0.991 75 NS3 Asp No 0.998* Yes 77 NS2B Thr NA No 0.999 79 NS3 Tyr No 1 Yes 83 NS2B Thr NA No 1 84 NS2B Arg 0.986 NA No 89 NS3 Trp No 1 Yes 116 NS3 Phe 0.964 No No 118 NS3 Thr No 0.997 Yes 138 NS3 Pro 0.977 No No 156 NS3 Thr Yes 0.997 Yes 168 NS3 Thr 0.986 Yes No Table 2: Higher probability residues predicted as class A by Multilayer Perceptron. p for probability value (p.v. > 0.95), in bold p.v. > 0.99 where a value of 1 indicates that the classifier is absolutely certain that the instance belongs to the class assigned. For residues predicted for a multilayer perceptron trained and validated without any NS2B data, the predictions reported here (as Yes) correspond to p.v. values greater than 0.95, and in bold p.v. > 0.99. Asterisk indicates residues of the catalytic triad. NA, non applicable. Chain

Residue

T/V Set 1 (p.v.)

AC C

EP

TE D

M AN U

SC

RI PT

Res ID

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

A C WT Blank MLP 1 53 83 35 0 LAD Tree 50 77 44 0 MLP 2 27 74 36 34 VFI 53 51 67 0 Table 3: Summary of predictions for the experimental set of 171 instances.

ACCEPTED MANUSCRIPT

MLP 2 LAD Tree with N LAD Tree set 1

VFI set 1

Performance measure Recall Precision Recall Precision Recall Precision Recall Precision

A 0.778 0.636 0.733 0.733 0.833 0.789 0.722 0.591

C 0.267 0.364 0.533 0.533 0.733 0.688 0.4 0.545

WT 0,500 0.5 0.5 0.5 0.375 0.5 0.625 0.625

Recall Precision

0.722 0.765

0.533 0.571

0.625 0.5

RI PT

Method MLP1

AC C

EP

TE D

M AN U

SC

Table 4: Performance measures for the Multilayer Perceptron on the validation/training set 1. Leave-one Cross validation. ROC Area: Area under de curve for Receiver Operating Characteristics. The closest the Recall and Precision are to 1.00 the best the performance. A classifier that assigns all instances to class A applied on the validation/training set 1 would have a Recall equal to 1.0 and a Precision equal to 0.44 for the predictions about class A. The Paired Corrected T Tester as implemented in WEKA did not find a significant difference for Recall and Precision between the three classifiers presented here (with a two-tailed test at 95% confidence interval).

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

A/A* C/C* WT/WT* Performance measure 0.778/0.733 0.267/0.533 0.5/0.5 TP Rate 0.348/0.147 0.269/0.261 0.121/0.167 FP Rate 0.636/0.733 0.364/0.571 0.5/0.444 Precision 0.778/0,733 0.267/0.533 0.5/0.5 Recall 0.7/0,733 0.308/0.552 0.5/0.471 F-Measure 0.686/0.867 0.526/0.672 0.841/0.721 ROC Area Table 5. Performance measures for the Multilayer Perceptron on the validation/training set 1 & *validation/training set 2. Leave-one Cross validation. ROC Area: Area under de curve for Receiver Operating Characteristics. The closest the Recall and Precision are to 1.00 the best the performance. A classifier that assigns all instances to class A applied on the validation/training set 1 would have a TP rate of 0.44, and FP rate of 0.66 for the predictions about class A.

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT DENV NS2B-NS3 protease is essential for the viral replication process. DENV NS2B-NS3 protease constitutes a target for efficient antivirals discovery. We report a strategy to identify influential residues on the function of the DENV NS3-NS2B Protease.

AC C

EP

TE D

M AN U

SC

RI PT

We present a strategy for the search of alternative influential residues within a protein, which is useful for non-competitive inhibitor development.