QSAR modeling of antitubercular activity of diverse organic compounds

QSAR modeling of antitubercular activity of diverse organic compounds

Chemometrics and Intelligent Laboratory Systems 107 (2011) 69–74 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory S...

273KB Sizes 5 Downloads 105 Views

Chemometrics and Intelligent Laboratory Systems 107 (2011) 69–74

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / c h e m o l a b

QSAR modeling of antitubercular activity of diverse organic compounds Vasyl Kovalishyn a,b,⁎, João Aires-de-Sousa a,⁎, Cristina Ventura c,d, Ruben Elvas Leitão c,e, Filomena Martins c a

REQUIMTE and CQFB, Departamento de Química, Faculdade de Ciências e Tecnologia, Universidade Nova de Lisboa, 2829–516 Caparica, Portugal Institute of Bioorganic Chemistry & Petroleum Chemistry, National Ukrainian Academy of Sciences, Kyiv-94, 02660, Murmanskaya, 1, Ukraine c Departamento de Química e Bioquímica, Faculdade de Ciências, Universidade de Lisboa and Centro de Química e Bioquímica (CQB) Campo Grande, 1749–016 Lisboa, Portugal d Instituto Superior de Educação e Ciências Alameda das Linhas de Torres 1791750 Lisboa, Portugal e Área Departamental de Engenharia Química, Instituto Superior de Engenharia de Lisboa (ISEL), Instituto Politécnico de Lisboa R. Conselheiro Emídio Navarro, 11950–062 Lisboa, Portugal b

a r t i c l e

i n f o

Article history: Received 26 November 2010 Accepted 25 January 2011 Available online 3 February 2011 Keywords: QSAR Neural Networks Random Forests Antitubercular Drug design

a b s t r a c t Tuberculosis (TB) is a worldwide infectious disease that has shown over time extremely high mortality levels. The urgent need to develop new antitubercular drugs is due to the increasing rate of appearance of multi-drug resistant strains to the commonly used drugs, and the longer durations of therapy and recovery, particularly in immunocompromised patients. The major goal of the present study is the exploration of data from different families of compounds through the use of a variety of machine learning techniques so that robust QSAR-based models can be developed to further guide in the quest for new potent anti-TB compounds. Eight QSAR models were built using various types of descriptors (from ADRIANA.Code and Dragon software) with two publicly available structurally diverse data sets, including recent data deposited in PubChem. QSAR methodologies used Random Forests and Associative Neural Networks. Predictions for the external evaluation sets obtained accuracies in the range of 0.76–0.88 (for active/inactive classifications) and Q2 =0.66–0.89 for regressions. Models developed in this study can be used to estimate the anti-TB activity of drug candidates at early stages of drug development. © 2011 Elsevier B.V. All rights reserved.

1. Introduction The current need for new antivirals and antimicrobials is determined by the critical situations associated with the rise of multi-drug resistant bacteria, HIV/AIDS, the emergence or re-emergence of deadly infectious diseases (e.g. tuberculosis, Lyme disease, West Nile virus, Hantavirus pulmonary syndrome, Norwalk-like virus, SARS, and novel forms of Cryptococcal infection) and also by needs of third world, left unattended by the industry for reasons of commercial unattractiveness. Furthermore, medical experts forecast a record increase of drug resistance among major human pathogens. Tuberculosis (TB) is one of the most destructive infectious diseases, showing over time increasingly high mortality levels [1]. It is estimated that approximately one-third of the world's population is presently infected with TB, whilst nine million new cases appear every year [1,2]. To make the situation worse, many ‘Big Pharmas’ have withdrawn from the field of anti-infectives and, thus, despite the obvious need for these type of drugs, very few novel antibacterial therapeutics have emerged over the last decade. Cheminformatics and computeraided drug design (CADD) are expected to contribute to a possible solution for the perilous situation regarding this infectious disease by assisting in the rapid identification of novel effective anti-TB agents. In fact, QSAR models serve as important tools for automated previrtual screening, combinatorial library design and data mining. A ⁎ Corresponding authors. Tel.: + 351 212948300x10907; fax: + 351 212948550. E-mail addresses: [email protected] (V. Kovalishyn), [email protected] (J. Aires-de-Sousa). 0169-7439/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2011.01.011

variety of QSAR descriptors and techniques have been applied to drug/ non-drug classification problems. Multiple Linear Regression analysis (MLRA), Partial Least Squares (PLS) Regression, Support Vector Machines (SVM), Random Forests, Artificial Neural Networks (ANN) and Bayesian Neural Networks (BNN) are widely used efficient QSAR techniques [3–5]. Several groups, based on data sets of antibacterial compounds, have built regressions and binary classifiers of general antitubercular activity (antitubercular-likeness models) utilizing some of the aforementioned techniques [4,6,7]. These results clearly demonstrate that the development of judicious QSAR approaches for regression and classification of active molecules against Mycobacterium tuberculosis represents a promising tool. Some of us have previously reported QSAR models for the hydrazide family, using MLR [3]. The major goal of the present study is the exploration of other families of compounds, much larger databases, and the use of different machine learning techniques for the development of robust QSARbased models for virtual screening of databases of compounds in order to identify potential new anti-TB agents. 2. Materials and methods 2.1. Data preprocessing Data sets of anti-TB compounds with known minimum inhibitory concentration (MIC) values for M. tuberculosis (mostly for strain H37Rv) were collected from different published papers [6,8–26] and PubChem [27]. All molecules were processed by the Chemaxon standardizer [28] —

70

V. Kovalishyn et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 69–74

2D coordinates of atoms were recalculated, counter ions and salts were removed from molecular structures, molecules were neutralized, mesomerized and aromatized. Data sets were then filtered from duplicates. Finally, the dominant protonation state of molecules at pH 7.4 was calculated using the Major Microspecies Plugin [28]. 2.2. Data sets QSAR data sets for practical use may be narrow (i.e., consist of one or a few chemical series) or diverse (many chemical series). Thus, since we wish to simulate both possibilities, narrow and random (diverse) data sets were built for performance comparison purposes. Two bioassay data entries in PubChem (AID 1626 and AID 1949) corresponding to the screening of antitubercular activity for a highly diverse set of compounds were used to construct random data set 1. All active compounds were included. Inactive compounds were selected by Kennard–Stone design [29] from the first 10,000 compounds of the bioassay, as ranked by the PubChem score, in order to be the most diverse and informative subset. In this way the training set consisted of a total of 2040 active and 2060 inactive compounds from bioassay AID 1626. Similarly, test set 1 was formed from AID 1949 data and included 3200 compounds (1594 active and 1606 inactive). The evaluation of a large number of isoniazid (INH) [3,6,8–13], benzimidazole [6,14–19] and indole [6,20–26] derivatives against various strains of mycobacteria has been reported in the literature. INH has the highest activity against M. tuberculosis and is still the most widely used drug in antituberculous regimens. Even though the majority of the derivatives of indole and benzimidazole displayed a rather moderate activity, some of these compounds demonstrated appreciable antimycobacterial effect [6,14–26]. We have thus selected a more focused data set 2 with a series of 584 INH, benzimidazole and indole derivatives with known MIC values in order to investigate the possibility of building a predictive model of antitubercular activity en route to the design of new active compounds. The range of MIC values for these 584 compounds ranged from 0.00328 to 2800.34 μM. All compounds were divided into two classes: active (251 with MIC ≤ 5 μM) and inactive molecules (333 with MIC N 5 μM) as described in Ref. [6]. For regression purposes, data set 2 was divided into two subsets: data set 2.a with 365 INH derivatives; and data set 2.b with 219 benzimidazole and indole derivatives. For data sets 2, 2.a, and 2.b about 10% of compounds were selected to form external independent test sets, and the remaining molecules were used for training sets. An additional external evaluation set composed of 45 INH derivatives, data set 3, was also derived from more recent publications.[30–32]. 2.3. Descriptors A broad range of atomic and molecular properties, from purely empirical to quantum-mechanical, can nowadays be easily calculated with available software. The most commonly used QSAR arsenals can include up to hundreds and even thousands of descriptors computable for extensive molecular data sets. Such variety of available descriptors in combination with numerous powerful statistical and machine learning techniques allows the establishment of effective and robust models with superior performance. Descriptors calculated by Dragon [33], and ADRIANA.Code [34] were used in this study. 2.3.1. Dragon descriptors A set of 3224 theoretical molecular descriptors was computed using DRAGON software. 3D structures were generated via Chemaxon standardizer [28] from the SMILES notation available for each compound, and stored in SDF format. Descriptors were then calculated from the SDF format. Dragon can calculate many molecular descriptors that are divided into 22 blocks: constitutional descriptors

(48), topological descriptors (119), walk and path counts (47), connectivity indices (33), information indices (47), 2D autocorrelations (96), edge adjacency indices (107), Burden's Eigen value descriptors (64), topological charge indices (21), Eigen value-based indices (44), Randič molecular profiles (41), geometrical descriptors (74), radial distribution function (RDF) descriptors (150), 3D-MoRSE descriptors (160), WHIM descriptors (99), GETAWAY descriptors (197), functional group counts (154), atom-centered fragments (120), charge descriptors (14), molecular properties (29), 2D binary fingerprints (780) and 2D frequency fingerprints (780) [35]. Constant and near constant variables were deleted. If two descriptors were at least 99% correlated, one of them was deleted. Calculation procedures for these descriptors, with related literature references, are reported in [35]. 2.3.2. ADRIANA.Code descriptors The ADRIANA.Code contains a series of methods for the generation of 3D structures and the calculation of physicochemical descriptors and molecular properties based on fast empirical models. In addition, it contains a hierarchy of increasing levels of sophistication in representing chemical compounds from constitution to 3D structure, to the surface of a molecule. For all data sets, 1271 molecular descriptors were calculated. Descriptors were generated from the SMILES notation available for each compound. 3D structures were obtained by the CORINA program as implemented in ADRIANA.Code [34]. This program provides many types of molecular descriptors such as number of hydrogen bond donors and acceptors, topological polar surface area, RDF descriptors and many others. RDF descriptors encode the 3D arrangement of atomic properties, and enable the consideration of different physicochemical atomic properties. 2.4. Machine learning techniques (MLT) 2.4.1. Random Forests Random Forests (RF), a recursive partition ensemble method, consist of many individual trees and are devised to operate quickly over large data sets. It also presents a number of attractive features such as an in-bound procedure for descriptor selection [36]. Also, the algorithm is not affected by correlated descriptors since it uses random samples to build each tree in the forest. Furthermore, it provides a measure of importance for each descriptor in the model, which can be subsequently used in other modeling approaches that shall operate on smaller sets of descriptors (see section Evaluation of descriptor importance) [36]. RFs calculate predictions by majority vote of the individual trees. The model performance is internally assessed by the prediction error for the objects left out in the bootstrap procedure (out-of-bag estimation, OOB). RFs were grown with the R program using the random forest library [37]. The algorithm contains a number of user-definable parameters. We explored the influence of the number of trees and the number of descriptors selected at each split point. Results demonstrated no significant change on the model's predictive ability by increasing the number of trees beyond 1000 and the number of descriptors further than 30 in all cases. These results are not surprising, since as shown by Breiman et al., Random Forests avoid overfitting [36]. 2.4.2. Associative Neural Network An Associative Neural Network (ASNN) is a combination of an ensemble of Feed-Forward Neural Networks and the k-Nearest Neighbor Method (kNN) [38]. This approach uses the correlation between ensemble outputs (each molecule is represented in the neural network model space as a vector of model predictions) as a measure of distance between the analyzed cases by the nearest neighbor method. Therefore, ASNN perform kNN in the space of ensemble residuals. This approach improves prediction by the bias

V. Kovalishyn et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 69–74

correction of the neural network ensemble. The ASNN algorithm trained by SuperSAB was used [39,40]. The number of inputs was equal to the number of descriptors and five neurons in one hidden layer were used in the calculations. Weights were initialized with random numbers. The bias neuron was presented both on the input and hidden layers. For classification tasks, ASNN was used with two output neurons — the target values for active and inactive compounds were assigned to (0.1, 0.9) and (0.9, 0.1), respectively. For regression tasks, the output values were linearly scaled between 0.1 and 0.9 [41]. An ensemble of 200 Neural Networks was trained. All the networks had the same architecture. The possibility of overfitting the data was rigorously controlled by cross-validation techniques [41]. Further details of the algorithm can be found elsewhere [42]. 2.5. Evaluation of descriptor importance Sensitivity analysis methods estimate the level of change attained in the output of a model resulting from the changes of model inputs. It is mainly used to determine which input descriptor is more important to achieve accurate output values. In this work, the descriptor's importance measure was obtained from RF models by the “Mean Decrease Accuracy” value [36]. Additional selection was done by “pruning methods” implemented in ASSN software. The mentioned methods are very efficient in QSAR studies [43,44]. Pruning algorithms introduce some measure of importance of the ASNN matrix weights by the so called “sensitivities”. These algorithms operate in a manner similar to stepwise multiple regression analysis excluding on each step one input parameter that is estimated to be non-significant. At each step, the model sensitivities to all weights and input nodes are estimated and the descriptor associated with the input neuron showing the smallest sensitivities is deleted [45,46]. For these studies, the 300–400 most important descriptors obtained from RF models were used. We analyzed the influence of the number of selected descriptors on the ASNN model quality (on the basis of the leave-one-out results for the training sets). Results demonstrated that pre-selecting less than 300–400 descriptors by RF decreased the ASNN models predictive ability.

Models were first tested by the leave-one-out cross-validation coefficient (Q2) [47] defined as 2

Q = ðSSY−PRESSÞ = SSY

where TP, FP, TN and FN denote true positives, false positives, true negatives and false negatives, respectively. In general, the overall accuracy, Ac, is always used to measure the predictive power of models, as the number of active and inactive compounds is approximately the same on all data sets. 3. Results and discussion Since the main goal of this study was to compare the accuracy of the anti-TB classification and regression models built by different machine learning methods, the influence of the chosen descriptors' set upon model accuracy was studied. Descriptors were filtered by an “importance evaluation method” using Random Forests and by “pruning methods” implemented in ASSN software (see section Materials and Methods). Experiments were performed both with the full set of descriptors and with selected descriptors. Results of both approaches were compared. For each modeling set, four QSAR models were developed — with ASNN and Random Forests, using all descriptors or a selection of descriptors (see Tables 1 and 2). 3.1. Classification models 3.1.1. ASNN results Table 1 contains the detailed information for the eight classification models developed. The overall best performance for the training set was achieved by the ASNN method. In the first stage, ASNN models were developed using all sets of descriptors. The initial pool of descriptors was submitted to the following additional reduction procedure: descriptors with constant values were removed, after which a pair wise correlation analysis was performed where a given descriptor was eliminated if the correlation coefficient with another descriptor was equal or higher than 0.95. As a result, the 1135 most

Table 1 Comparison of classification models built with different MLT. M.a

2.6. Statistical coefficients

ð1Þ

where SSY is the sum of squared deviation of the target variable values from their mean, and PRESS is the prediction error sum of squares obtained from the leave-one-out-cross-validation procedure. Use of the cross-validation coefficient Q2 makes redundant the analysis of residuals by means of standard deviation, because both coefficients are interrelated and can be derived one from another. In addition, the prediction performance and the comparison of methods used in this study were done using the root mean squared error, RMSE. To assess the classification ability, and to separately monitor the classification performance of the two classes, sensitivity (Sn), specificity (Sp), and overall accuracy (Ac) were calculated [48]. Notice that sensitivity is also called true positive rate or positive class accuracy, while specificity is also called true negative rate or negative class accuracy. Sn = TP = ðTP + FNÞ

ð2Þ

Sp = TN = ðTN + FPÞ

ð3Þ

Ac = ðTP + TNÞ = ðTP + FN + TN + FPÞ

ð4Þ

71

Set

Data set 1 1 Training Test 1 2 Training Test 1 3 Training Test 1 4 Training Test 1

1 1 1 1

Consensus prediction Test 1 Data set 2 5 Training Test 2 Test 3 6 Training Test 2 Test 3 7 Training Test 2 Test 3 8 Training Test 2 Test 3

2

2

2

2

Consensus prediction Test 2 Test 3 a b

Molec.

Descr.

MLTb

Sn

Sp

Ac

4100 3200 4100 3200 4100 3200 4100 3200

1135

ASNN

0.91 0.77 0.91 0.79 0.86 0.82 0.86 0.84

0.88 0.75 0.88 0.74 0.79 0.73 0.80 0.72

0.90 0.76 0.90 0.76 0.82 0.77 0.83 0.77

0.85

0.73

0.78

0.90 0.85 0.82 0.89 0.87 0.81 0.84 0.88 0.83 0.86 0.87 0.85

0.91 0.88 0.64 0.91 0.84 0.67 0.84 0.88 0.78 0.86 0.86 0.72

0.90 0.87 0.78 0.90 0.85 0.78 0.85 0.88 0.82 0.86 0.87 0.82

0.87 0.83

0.86 0.78

0.88 0.82

300 1135

RF

300

3200

524 60 45 524 60 45 524 60 45 524 60 45

1969

ASNN

49

1969

60 45

M — model. MLT — machine learning technique.

49

RF

72

V. Kovalishyn et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 69–74

3.3. Consensus prediction

Table 2 Statistical coefficients calculated using the two MLT for data set 2.a and 2.b. M.a

Molec.

Descr.

MLTb

Q2

RMSEc

Data set 2.a 1 Training 3 Test 4 2 Training 3 Test 4 3 Training 3 Test 4 4 Training 3 Test 4

329 36 329 36 329 36 329 36

1969

ASNN

0.68 0.67 0.72 0.75 0.55 0.66 0.57 0.69

0.59 0.58 0.55 0.51 0.70 0.58 0.68 0.56

Consensus prediction Training 3 Test 4

329 36

0.68 0.73

0.59 0.51

Data set 2.b 5 Training 4 Test 5 6 Training 4 Test 5 7 Training 4 Test 5 8 Training 4 Test 5

195 24 195 24 195 24 195 24

0.90 0.83 0.91 0.85 0.89 0.87 0.89 0.87

0.32 0.45 0.31 0.42 0.34 0.39 0.34 0.39

Consensus prediction Training 4 Test 5

195 24

0.92 0.89

0.29 0.36

a b c

Set

49 1969

RF

49

1969

ASNN

49 1969 49

RF

M — model. MLT — machine learning technique. RMSE — root mean squared error.

important descriptors for data set 1 were selected, and the statistical coefficients obtained for the training set were Sn = 0.91 Sp = 0.88, Ac = 0.90 (see Table 1, model 1). In the second stage, the importance of the descriptors for the observed activity was evaluated by Random Forests and ASNN gradual pruning methods. The application of descriptors' selection methods by ASNN did not lead to any increase in performance. Therefore, the descriptors used in model 2 were only selected by RF. With the reduced number of descriptors, the accuracy for the 3200 compounds in the external test set 1 was maintained (Ac = 0.76). Regarding data set 2, the first stage selected 1969 descriptors. Statistical coefficients for the training set were Sn = 0.90 Sp = 0.91, Ac = 0.90 (see Table 1, model 5). In this case, the number of descriptors could be reduced to 49 by RF and ASNN pruning methods, and kept basically the same accuracy for training and test sets (Table 1, models 5–6). The 60 compounds in the external test set 2 were predicted with good accuracy, Ac = 0.85–0.87. The prediction accuracy of test set 3 was equal to 0.78 for both models.

3.1.2. RF results The results of this method for all data sets were statistically similar to the ASNN results (see Table 1, models 3–4 and 7–8). However, the prediction accuracy for both test sets was slightly better than that of ASNN's models.

3.2. Regression models With the more focused data set 2, for which MIC values are available, regression models were investigated with the subsets of INH derivatives (data set 2.a) and benzimidazole and indole derivatives (data set 2.b). Results are summarized in Table 2. Four QSAR models were developed for both sets, similarly to the classification studies.

An important question related to the above discussion is how to select the most predictive QSAR model from all available models for a given endpoint. If we use models that are considered acceptable using overall accuracy (Ac) to evaluate the quality of the QSAR models, then the top-ranking model is ASNN model for data set 1 or 2. However, it was the RF models that gave the best prediction for the external evaluation sets (see Table 1). Thus, relying on the results received by only one learner could obfuscate the choice of the best modeling method to achieve the most accurate external prediction of activity. Some researchers have shown that the consensus prediction that is based on the results obtained by all predictive models always provides the most stable decision [49,50]. In general, consensus prediction implies averaging the predictions for each compound made by individual models for regression QSAR models, or by majority voting for classification QSARs. This approach abolishes, of course, the requirement for the best model selection based solely on the statistics for the training and test sets. The final consensus prediction was made by integrating the anti-TB activity predicted by all four types of models — Tables 1 and 2. The performance of the consensus prediction is always very close to the best model and often even better than the best model. Fig. 1 shows the plot of the predicted versus experimental MIC values for data sets 2.a and 2.b. Regression models can also be used as a reliability measure for classification purposes. For example, amongst the 8 compounds predicted by the consensus model for test set 4 of dataset 2.a with −log (MIC) N 6, all are active (assuming the threshold for activity as 5 μM). And only one of the 13 compounds predicted with − log (MIC) b 5 was active. This possibility has obvious interest for ranking compounds in virtual screening experiments. 4. Conclusions Usually, the goal of model building is either interpretation, prediction or both. Interpretation involves dissecting the model to find more about the interaction of a set of molecules with a target. Interpretation is of course facilitated by data of very good quality, descriptors that can be easily transformed into molecular features familiar to synthetic chemists, and a mapping method in which the form of the model and the contribution of every descriptor into the model are clear. Predictive models, on the other hand, often involve much larger training sets, often of lower quality, computationally efficient descriptors and less transparent mapping methods. Here, we have presented a series of new predictive models using several machine learning techniques using a broad range of atomic and molecular properties. Results indicate that computational approaches and number and nature of molecular descriptors are not the limiting factors determining the accuracy of predictions (Tables 1 and 2). An application of pruning algorithms was able to detect subsets of relevant input descriptors determining the molecular activity. By integrating Random Forests and Artificial Neural Network methods with pruning algorithms and consensus prediction, it was possible to build models with high predictive ability. These can be applied as virtual screening tools to evaluate molecules not yet synthesized and to help in prioritizing lead discovery. In addition, these models can also be useful for analysis of new structures by systematic modification of lead structures, i.e., structures for which promising MIC values were predicted. Acknowledgement Financial support from Fundação para a Ciência e Tecnologia, Portugal, under Project and BPD grant (VK) FCT/PTDC/QUI/67933/ 2006 is greatly appreciated.

V. Kovalishyn et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 69–74

a

9

Training Set Test Set

Predicted -log(MIC)

8

7

6

5

4

3 3

4

5

6

7

8

9

8

9

Observed -log(MIC)

b

9

Training Set Test Set

Predicted -log(MIC)

8

7

6

5

4

3 3

4

5

6

7

Observed -log(MIC) Fig. 1. Consensus predicted versus observed –log (MIC) values for training and test sets of data set 2.a (top) and 2.b.(down), MIC = minimum inhibitory concentration.

Appendix A. Supplementary data Supplementary data to this article can be found online at doi:10.1016/j. chemolab.2011.01.011. References [1] Global tuberculosis control: a short update to the 2009 report, WHO, Geneva, 2009; Global tuberculosis control: epidemiology, strategy, financing: WHO report 2009, Geneva, 2009; http://www.who.int/tb/publications/2009/factsheet_tb_2009update_dec09.pdf (accessed in October 2010). [2] http://www.taacf.org/8 (Accessed in January 2010). [3] C. Ventura, F. Martins, Application of quantitative structure activity relationships to the modeling of antitubercular compounds. 1. The hydrazide family, J. Med. Chem. 51 (2008) 612–624. [4] K.F. Pasqualoto, E.I. Ferreira, O.A.I. Santos-Filho, A.J. Hopfinger, Rational design of New antituberculosis agents: receptor-independent four-dimensional quantitative structure–activity relationship analysis of a set of isoniazid derivatives, J. Med. Chem. 47 (2004) 3755–3764. [5] D.A.R.S. Latino, J.A. de Sousa, Assignment of EC numbers to enzymatic reactions with MOLMAP reaction descriptors and random forests, J. Chem. Inf. Model. 49 (2009) 1839–1846.

73

[6] P. Prathipati, N.L. Ma, T.H. Keller, Global Bayesian models for the prioritization of antitubercular agents, J. Chem. Inf. Model. 48 (2008) 2362–2370. [7] N. Karali, A. Gürsoy, F. Kandemirli, N. Shvets, F.B. Kaynak, S. Özbey, V. Kovalishyn, A. Dimoglo, Synthesis and structure-antituberculosis activity relationship of 1Hindole-2, 3-dione derivatives, Bioorg. Med. Chem. 15 (2007) 5888–5904. [8] G. Klopman, D. Fercu, J. Jacob, Computer-aided study of the relationship between structure and antituberculosis activity of a series of isoniazid derivatives, Chem. Phys. 204 (1996) 181–193. [9] M.C. Bagchi, B.C. Maitib, S. Bose, QSAR of anti tuberculosis drugs of INH type using graphical invariants, J. Mol. Struct. THEOCHEM 679 (2004) 179–186. [10] D. Sriram, P. Yogeeswari, K. Madhu, Synthesis and in vitro and in vivo antimycobacterial activity of isonicotinoyl hydrazones, Bioorg. Med. Chem. Lett. 15 (2005) 4502–4505. [11] B. Bottari, R. Maccari, F. Monforte, R. Ottanà, E. Rotondo, M.G. Vigorita, Isoniazidrelated copper(II) and nickel(II) complexes with antimycobacterial in vitro activity. part 9, Bioorg. Med. Chem. Lett. 10 (2000) 657–660. [12] N. Georgieva, V. Gadjeva, Isonicotinoylhydrazone analogs of isoniazid: relationship between superoxide scavenging and tuberculostatic activities, Biochemistry (Moscow) 67 (2002) 588–591. [13] A.D. Logu, V. Onnis, B. Saddi, C. Congiu, M.L. Schivo, M.T. Cocco, Activity of a new class of isonicotinoylhydrazones used alone and in combination with isoniazid, rifampicin, ethambutol, para-aminosalicylic acid and clofazimine against Mycobacterium tuberculosis, J. Antimicrob. Chemother. 49 (2002) 275–282. [14] O. Geban, H. Ertepinar, S. Ozden, QSAR analysis of a set of benzimidazole derivatives based on their tuberculostatic activities, Pharmazie 51 (1996) 34–36. [15] V. Klimesova, J. Koci, M. Pour, J. Stachel, K. Waisser, J. Kaustova, Synthesis and preliminary evaluation of benzimidazole derivatives as antimicrobial agents, Eur. J. Med. Chem. 37 (2002) 409–418. [16] L. Zahajská, V. Klimescaronová, J. Koccaroní, K. Waisser, J. Kaustová, Synthesis and antimycobacterial activity of pyridylmethylsulfanyl and naphthylmethylsulfanyl derivatives of benzazoles, 1, 2, 4-triazole, and pyridine-2-carbothioamide/-2carbonitrile, Arch. Pharm. Pharm. Med. Chem. 337 (2004) 549–555. [17] V. Klimesova, J. Koc, K. Waisser, J. Kaustova, New benzimidazole derivatives as antimycobacterial agents, Farmaco 57 (2002) 259–265. [18] Z. Kazimierczuk, M. Andrzejewska, J. Kaustova, V.S. Klimesova, Synthesis and antimycobacterial activity of 2-substituted halogenobenzimidazoles, Eur. J. Med. Chem. 40 (2005) 203–208. [19] H. Foks, D. Pancechowska-Ksepko, W. Kumierkiewicz, Z. Zwolska, E. AugustynowiczKopec, M. Janowiec, Synthesis and tuberculostatic activity of new benzimidazole derivatives, Chem. Heterocycl. Compd. 42 (2006) 611–614. [20] S.M. Allin, L.J. Duffy, J.M.R. Towler, P.C. Bulman Page, M.R.J. Elsegood, B. Saha, Facile asymmetric construction of a functionalized dodecahydrobenz[a]indolo[3, 2-h]quinolizine template, Tetrahedron 65 (2009) 10230–10234. [21] S. Mo, A. Krunic, G. Chlipala, J. Orjala, Antimicrobial ambiguine isonitriles from the cyanobacterium Fischerella ambigua, J. Nat. Prod. 72 (2009) 894–899. [22] S. Guo, S.K. Tipparaju, S.D. Pegan, B. Wan, S. Mo, J. Orjala, A.D. Mesecar, S.G. Franzblau, A.P. Kozikowski, Natural product leads for drug discovery: isolation, synthesis and biological evaluation of 6-cyano-5-methoxyindolo[2, 3-a]carbazole based ligands as antibacterial agents, Bioorg. Med. Chem. 17 (2009) 7126–7130. [23] S.V. Karthikeyan, S. Perumal, K.A. Shetty, P. Yogeeswari, D.A. Sriram, A microwaveassisted facile regioselective Fischer indole synthesis and antitubercular evaluation of novel 2-aryl-3, 4-dihydro-2H-thieno[3, 2-b] indoles, Bioorg. Med. Chem. Lett. 19 (2009) 3006–3009. [24] H. Lu, P.J. Tonge, Inhibitors of FabI, an enzyme drug target in the bacterial fatty acid biosynthesis pathway, Acc. Chem. Res. 41 (2008) 11–20. [25] S.K. Srivastava, D. Dube, V. Kukshal, A.K. Jha, K. Hajela, R. Ramachandran, NAD+dependent DNA ligase (Rv3014c) from Mycobacterium tuberculosis: novel structure–function relationship and identification of a specific inhibitor, Proteins 69 (2007) 97–111. [26] V.S. Velezheva, P.J. Brennan, V.Y. Marshakov, D.V. Gusev, I.N. Lisichkina, A.S. Peregudov, I.N. Tchernousova, T.G. Smirnova, S.N. Andreevskaya, A.E. Medvedev, Novel pyridazino[4, 3-b]indoles with dual inhibitory activity against Mycobacterium tuberculosis and monoamine oxidase, J. Med. Chem. 47 (2004) 3455–3461. [27] http://pubchem.ncbi.nlm.nih.gov8 (accessed in January, 2010). [28] https://www.chemaxon.com/8 (accessed in January, 2010). [29] R.W. Kennard, L.A. Stone, Computer aided design of experiments, Technometrics 1 (1969). [30] M.C.D.S. Lourenco, M.D.L. Ferreira, M.V.N. de Souza, M.A. Peralta, T.R.A. Vasconcelos, M.d.G.M.O. Henriques, Synthesis and anti-mycobacterial activity of (E)-N′-(monosubstituted-benzylidene)isonicotinohydrazide derivatives, Eur. J. Med. Chem. 43 (2008) 1344–1347. [31] M.J. Hearn, M.H. Cynamon, M.F. Chen, R. Coppins, J. Davis, H. Joo-On Kang, A. Noble, et al., Preparation and antitubercular activities in vitro and in vivo of novel Schiff bases of isoniazid, Eur. J. Med. Chem. 44 (2009) 4169–4178. [32] C.H. Andrade, L. de B. Salum, M.S. Castilho, K.F. Pasqualoto, E.I. Ferreira, A.D. Andricopulo, Fragment-based and classical quantitative structure-activity relationships for a series of hydrazides as antituberculosis agents, Mol. Divers. 12 (2008) 47–59. [33] http://www.talete.mi.it/products/dragon_description.htm8 (accessed in January 2010). [34] http://www.molecular-networks.com/products/adrianacode8 (accessed in February 2010). [35] R. Todeschini, V. Consonni, In Handbook of Molecular Descriptors, Wiley-VCH, Weinheim, 2000. [36] L. Breiman, J.H. Friedman, R.A. Olshen, C.J. Stone, Classification and Regression Trees, Chapman & Hall/CRC, Boca Raton, FL, 1984.

74

V. Kovalishyn et al. / Chemometrics and Intelligent Laboratory Systems 107 (2011) 69–74

[37] R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 20048 8 ISBN 3-900051-07-0, URL, http://www.R-project.org8 (last accessed in October 2010). Fortran original by Leo Breiman, Adele Cutler, R port by Andy Liaw and Matthew Wiener. (2004). [38] I.V. Tetko, J. Neural network studies. 4. Introduction to associative neural network, J. Chem. Inf. Comput. Sci. 42 (2002) 717–728. [39] I.V. Tetko, V.Yu. Tanchuk, N.P. Chentsova, S.V. Antonenko, G.I. Poda, V.P. Kukhar, A.I. Luik, HIV-1 reverse transcriptase inhibitor design using artificial neural networks, J. Med. Chem. 37 (1994) 2520–2526. [40] T. Tollenaere, SuperSAB: (1990) fast adaptive back propagation with good scaling properties, Neural Netw. 3 (1990) 561–573. [41] I.V. Tetko, D.J. Livingstone, A.I. Luik, Neural network studies. 1. Comparison of overfitting and overtraining, J. Chem. Inf. Comput. Sci. 35 (1995) 826–833. [42] I.V. Tetko, A.E.P. Villa, Efficient partition of learning data sets for neural network training, Neural Netw. 10 (1997) 1361–1374. [43] Y. LeCun, J.S. Dencer, S.A. Solla, Optimal Brain Damage, in: D.S. Touretzky (Ed.), NIPS*2, Morgan Kaufmann, San Mateo, CA, 1990, pp. 598–5605. [44] Y.A. Chauvin, Back-Propagation Algorithm with Optimal Use of Hidden Units, in: D.S. Touretzky (Ed.), NIPS*1, Morgan Kaufmann, San Mateo, CA, 1989, pp. 519–526.

[45] I.V. Tetko, A.E.P. Villa, D.J. Livingstone, Neural network studies. 2. Variable selection, J. Chem. Inf. Comput. Sci. 36 (1996) 794–803. [46] V.V. Kovalishyn, I.V. Tetko, A.I. Luik, V.V. Kholodovych, A.E.P. Villa, D.J. Livingstone, Neural network studies. 3. Variable selection in the cascade-correlation learning architecture, J. Chem. Inf. Comput. Sci. 38 (1998) 651–659. [47] R.D. Cramer III, D.E. Patterson, J.D. Bunce, Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins, J. Am. Chem. Soc. 110 (1988) 5959–5967. [48] Q. Li, L. Lai, Prediction of potential drug targets based on simple sequence properties, BMC Bioinform. 8 (2007) 353–363. [49] R. Guha, S.C. Schürer, Utilizing high throughput screening data for predictive toxicology models: protocols and application to MLSCN assays, J. Comput. Aided Mol. Des. 22 (2008) 367–384. [50] I.V. Tetko, I. Sushko, A.K. Pandey, H. Zhu, A. Tropsha, E. Papa, T. Oberg, R. Todeschini, D. Fourches, A. Varnek, Applicability domain for classification problems, J. Chem. Inf. Model. 48 (2008) 1733–1746.