QSPR studies: A historical perspective

QSPR studies: A historical perspective

    Chemometrics tools in QSAR/QSPR studies: A historical perspective Saeed Yousefinejad, Bahram Hemmateenejad PII: DOI: Reference: S016...

1MB Sizes 8 Downloads 207 Views

    Chemometrics tools in QSAR/QSPR studies: A historical perspective Saeed Yousefinejad, Bahram Hemmateenejad PII: DOI: Reference:

S0169-7439(15)00164-1 doi: 10.1016/j.chemolab.2015.06.016 CHEMOM 3047

To appear in:

Chemometrics and Intelligent Laboratory Systems

Received date: Revised date: Accepted date:

29 May 2015 22 June 2015 25 June 2015

Please cite this article as: Saeed Yousefinejad, Bahram Hemmateenejad, Chemometrics tools in QSAR/QSPR studies: A historical perspective, Chemometrics and Intelligent Laboratory Systems (2015), doi: 10.1016/j.chemolab.2015.06.016

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT Chemometrics tools in QSAR/QSPR studies: a historical

PT

perspective Saeed Yousefinejad, Bahram Hemmateenejad*

SC

RI

Chemistry Department, Shiraz University, Shiraz, Iran

NU

______________________________________________________________________________

MA

Abstract

One of the most extended subfields of chemometrics, at least by considering the number of publications and interested researchers, is QSAR/QSPR. During the time, various improved

D

and/or alternative methods have been developed in different principal steps of QSAR/QSPR

TE

including (i) variable selection, (ii) model construction and (iii) validation evaluation. In the current manuscript, it is tried to represent a short overview on critical and bold developments of

AC CE P

chemometrics tools utilized in QSAR/QSPR studies.

Key words: QSAR; QSPR; variable selection;model development; validation

______________________________________________________________________________ ⁎ Corresponding author. Tel.:+98 713 6460724; fax: +98 713 6460788, E-mail address: [email protected](B. Hemmateenejad).

1

ACCEPTED MANUSCRIPT 1. Introduction Quantitative structure–activity relationship (QSAR) is a quantitative method deals with

PT

finding a model to relate chemical structural features of compounds (descriptors) to their definite biological activity. Modeling the biological activity was extended to other physical/chemical

RI

properties which is called quantitative structure–property relationship (QSPR). Different properties or behaviors of chemical molecules have been investigated in the field of QSPR.

structure–chromatography

relationships

SC

Some examples are quantitative structure–reactivity relationships (QSRRs) [1], quantitative (QSCRs)

[2,3],

quantitative

structure–toxicity

NU

relationships (QSTRs) [4], quantitative structure-electrochemistry relationships (QSERs) [5–7], quantitative structure–biodegradability relationships (QSBRs) [8,9] and etc.

structure-activity"

or

"quantitative

MA

Searching on the Web of Science core collection using keywords of ―quantitative structure-property"

or

"quantitative

structure-

electrochemistry" or "quantitative structure-toxicity" or "quantitative structure-function" or

D

"quantitative structure-retention" or "quantitative structure-chromatography" or "quantitative

TE

structure-biodegradability" or "structure-activity correlation‖ resulted in a sum of 11000 records. The contribution of each kind of the recorded documents, including research articles, proceeding

AC CE P

papers, review articles, meting abstracts and book chapters, is shown in Fig. 1. It is not strange that research articles occupy 82% of the records. However, it is interesting that the noted keywords led to finding of about 670 review articles. Fig. 2 shows the evolution of the number of records using the above keywords as a function of published year from 1966 to 2015. The first article used the keyword ―structure-activity correlation‖ was appeared in 1966 and then this keyword and the term ―quantitative structureactivity" were used repeatedly by adding other keywords. From Fig. 2 one can observe three sudden increases in publications’ growth. The first and second sudden changes are observed at the beginning of the 1990s and 2000s decades. The third sharp change was happened after 2006. However, the growth of publications has been stopped after 2010. In a QSAR/QSPR study, collecting or designing a subset of chemical or biological compounds, generation of potent descriptors capable to reflect the structure of compounds, selection of reverent descriptors to include in model, construction of a regression model and checking the validity and stability of the suggested model are five essential steps. In the current article, our short overview is focused on last three steps i.e. (i) variable selection, (ii) model

2

ACCEPTED MANUSCRIPT construction and (iii) validation evaluation in which chemometrics tools have involved more significantly. However is should be emphasized that without numerous efforts in development and generation of various molecular descriptors [10], the growth and effectiveness of

RI

PT

QSAR/QSPR studies were impossible.

SC

2. Historical roots of QSAR tree

By searching in literature, the root of huge QSAR tree could be found in the thesis of Cros,

NU

entitled ―Action de l’alcoholamyliquesurl’organisme‖ (1863, Faculty of Medicine, University of Strasbourg, France) which noticed to the relationship between toxicity of primary aliphatic alcohols and their water solubility. Without doubt, focusing on the concept of ―molecular

MA

structure‖ in the years between 1860 and 1880 has led to development of QSAR researches [11]. Among the works on the molecular structure especially ―three-dimensional‖ concept, the

D

attempts of Butlerov (1861–65), Wislicenus (1869–73) and Van’t Hoff (1874–75), are

TE

notable[11].

Maybe the first primary QSAR study could be related to Brown and Fraser [12], who

AC CE P

proposed the existence of a correlation between molecular constitution and biological activity of different alkaloids. In 1884, Mills explained more on the hypothesis of structure-property correlation in his article entitled ―On melting-point and boiling-point as related to chemical composition‖[13]. After that, at the end of 19th to beginning of 20th century, other attempts were made to clarify this hypothesis (e.g. [14,15]). Somebody believes that the works of Hammet on substituent effects in organic reactions, 1935-1938[16–19],had an outstanding role in developing QSAR models. Some years later, when the first theoretical QSAR/QSPR studies were proposed in the middle of 20th century [20,21] based on the descriptors obtained from graph theory, it was not expected that QSAR can become as an inseparable part of molecular and drug design. After development of some other categories of structural descriptors, the world came ready to see the revolution of QSAR/QSPR. To reach this step, attempts of Pauling and also Coulson on the chemical bond concept [22,23], Sanderson on the atomic electronegativity [24], and the researches on electronic distribution and quantum-chemical descriptors were very important and determinant [25–28]. After that, some specific QSAR/QSPR approaches were proposed step by step which is prominent from the historical point of view. After suggestion of

3

ACCEPTED MANUSCRIPT Linear Free Energy Relationships (LFER) by Hammet [19], one of the most principal approaches was linear solvation energy relationship (LSER) which had proposed in 1952-3 by Taft [29,30]. Someone consider starting of the modern period of QSAR (or official birth date of QSAR) in

PT

1962, when Hansch et al. correlated the biological activity of plant growth regulators and chloromycetin derivatives with Hammett constants and partition coefficients[31]. Their work

RI

was the first multi-parametric QSAR model. In continue of this official birthday, other work of

SC

Hansch and coworkers [32,33]attracted a lot of interests to this field and caused the QSAR explosion [34].

NU

A historical bold point in structure-activity studies was the publication of Free and Wilson in 1964 [35] which had an effective role in development of QSAR/QSPR. The basic idea

MA

in their proposed approach was the modeling of a biological activity (or chemical property)by looking at the presence/absence of substituent functional group on a common structural skeleton. In 1980s, proposing and utilizing different categories of topological and geometrical

D

descriptors, entered the 3D geometry of molecules in QSAR/ QSPR [11] and derived

TE

enhancement in prediction of models as well as their description ability. Another shiny and really important stage for QSAR in 1980s was the development of molecular descriptors based on

AC CE P

molecular interaction fields (MIFs) which led to the well-known filed of 3D QSAR. The concern of 3D QSAR was finding of the interaction energies between a compound and specified chemical probes at certain spatial points of 3D space [36,37]. Different interaction probes, such as hydrogen atom, water molecule and methyl group [36] has been proposed to detect and calculate the interaction energies of a molecule in a grid space. Historically, the first approach of 3D QSAR category was the GRID method proposed by Goodford in 1985[38]and then was developed by Cramer et al. in 1988 by the name of comparative molecular field analysis (CoMFA) [39].These methods are done in a lattice model by aligning molecules to compare them and exploring their MIF information in 3D space [39]. Other important 3D QSAR methods based on MIF was introduced later such as comparative molecular similarity indices analysis (CoMSIA) by Klebe et al. in 1994[40], Compass method by Jain et al. in 1994[41],Comparative molecular moment analysis (CoMMA) by Silverman and Platt in 1996[42], Voronoi field analysis by Chuman in 1998 [43]and VolSurf method by Cruciani et al in 2000 [44]. The approaches of calculating 3D descriptors like grid-weighted holistic invariant molecular (GWHIM) and GRid-INdependent (GRIND)[45] can be also considered as MIF methods.

4

ACCEPTED MANUSCRIPT Nowadays QSPR/QSAR is introduced as an acceptable tool in molecular design for different purposes[46,47]. For example in the current century QSAR is embedded in the pharmaceutical industry as an essential and inseparable tool from drug discovery and activity

PT

optimization during drug development process[37,48–51]. QSAR makes it possible to eliminate chemical candidates with low drug properties or those with toxic response from further

RI

pharmaceutical development and also help us to predict the activity of new designed chemical

NU

SC

compounds[52,53].

3. What chemometrics has done for variable selection?

MA

Variable selection methods could be historically divided in two total categories: (i) Classical variable selections and (ii) variable selection by artificial intelligent algorithms. In the first category, the main focus is on the linear methods by considering a linear relationship between

D

independent variables and response variable. However, in the artificial intelligent-based

TE

methods, descriptors are mostly mapped to a relational non-linear space and help to overcome some limitations of classical methods. However, it is noteworthy that some of variable selection

AC CE P

methods can be assumed as a combination of classical and artificial intelligent methods. In all of the variable selection methods, the variables are entered into the model through an algorithmic manner and a ―fitness function‖ or ―selection criteria‖ determines which variable should be retained in the model or should be eliminated. Different selection criteria have been introduced in the literature [54–56] but utilizing a selection criterion depends on the purpose of the QSAR/QSPR model. If the goal of model is prediction, a criterion should be used which evaluates the prediction ability but for the models with other functions (like parameter estimation) some other criteria are needed to focus on parameter variances. For predictive models, one of the most popular fitness functions is prediction residual error sum of square (PRESS) that is obtained based on cross validation. In cross validation [57,58], one or more of training data points systematically removes at a time (leave-one out and leave-many-out cross validation), the QSAR/QSPR model rebuilds with the remaining data points, and then this model is used to predict the left out samples. This leaving out procedure should be repeated for the entire data set while all the compounds are predicted by cross validation. Then, PRESS could be computed as following:

5

ACCEPTED MANUSCRIPT (1) Where yi and

are the experimental and computed activity/property of the ith compound. Other

PT

important functions in variable selection are the stopping criteria and various rules can be

RI

utilized for this purpose[56].

SC

3.2. Classical Methods

NU

3.2.1. Forward Selection (FS)

In the forward selection[59], the variables (descriptors) are entered into the QSAR/QSPR model

MA

one at a time. So, the model starts with no descriptor and the first entering variable is that with the best selection criteria, e.g. the minimum PRESS or the highest correlation with the response variable. In progress of forward selection, each selected variable is presented in all further model

D

and other new variables are added to the previous model according to the enhancement of

TE

selection criterion. This procedure continues until satisfying stopping criterion. In FS, once a descriptor enters into the model, it cannot be removed in subsequent iterations and it might be

AC CE P

possible that some descriptors collectively show good prediction but obtain a poor model without each of them. Therefore, some combinations (in which a previous selected variable is not presented) are never tested during the forward process. This is the bold disadvantage of FS. However, FS have been used in many QSAR/QSPR studies[60,61]but it is not used nowadays as a favorite or common variable selection approach.

3.2.2. Backward Elimination (BE) BE starts with an initial step that all variables are included in the model and from this point is in opposite site of FS. In the next steps, BE checks the variables (descriptors) one by one for deletion based on the selection criterion. In the other words, the descriptor which its exclusion increases the quality of selection criterion(e.g. more decreasing of PRESS) is the first candidate of deletion[59]. The elimination procedure is stopped when all of the included variables are significant or all but one variable has been eliminated. BE usually results in over-fitted models

6

ACCEPTED MANUSCRIPT (models of higher number of variable than required). Also, a disadvantage similar to FS limits

RI

3.2.3 Combined Selection/ Elimination (Stepwise Selection; SS)

PT

the function of BE.

Stepwise selection is a combination of forward selection and backward elimination. Here, the

SC

forward selection is run on the descriptors but the included variables are reviewed in each step with an elimination algorithm. The goal of SS is the correction of FS disadvantage. In other

NU

words, entering a new descriptor in forward selection may diminish or may lose the significance of one or more previous included descriptors and SS takes care of this point step-by-step with its

MA

elimination tool. However, it is worthy to mention that SS seems to have no suitable function in the case of huge pool of initial descriptors. SS could be assumed as the most popular variable

AC CE P

3.2.4Heuristic Method (HM)

TE

D

selection in QSAR/QSPR studies (some examples could be found in [62–79])

Heuristic method (HM) could be also classified as a modified version of stepwise MLR based on collinearity control and rank correlation and let us to perform a rapid selection of the best model, without checking all the possible combinations of the initial descriptors[80]. In the first phase of HM a pre-selection is done by eliminating (i) descriptors that are not available for each compound, (ii) descriptors with small variation in magnitude for all compounds, (iii) descriptors with a Fisher F-value below 1.0 in the single-parameter correlation, and (iv) descriptors having tvalues lower than a user-specified value, etc. In the second phase, a rank correlation is done on the descriptors by descend-ordering according to their correlation coefficient in the single-parameter models. Then, correlation of the given activity/property with (i) the top descriptor in the rank list combined with each of the remaining descriptors and (ii) the next one with each of the remaining descriptors, etc. is calculated. The best pair among the two-parameter correlations is selected and utilized for further addition of descriptors in a similar way. Different statistical parameters could be used as the fitting criteria, e.g. Fisher F-value, correlation coefficient and the squared standard error[80].

7

ACCEPTED MANUSCRIPT Similar to stepwise MLR, HM is capable to utilize in linear QSAR/QSPR modeling, and also to act as an efficient tool for descriptor selection before construction of a linear or nonlinear model

PT

[81–99].

RI

3.2.5.Leaps-and-Bounds Method

SC

Furnival and Wilson in 1974 suggest the Leaps-and-Bounds regression which could be used for reducing the number of operations required to find the best subset of variables [100]. This

NU

method possesses a way to select a subset of variables without need of checking all possible subsets. This is briefly done by comparing the residual sum of squares (RSS) of two subsets of

MA

variables. For example if set S1 contains n variables with RSS of445 and set S2 includes m variables with RSS of 330, then, all of possible subsets of S1 will be ignored by comparing RSS of S1 and S2. So, it is the fundamental idea in Leaps-and-Bounds regression that all of the

D

subsets from setS1 are ignored, because these subsets have greater RSS in comparison with S2

AC CE P

TE

and also S1. This algorithm has been used in many QSAR/ QSPR studies[60,101–112].

3.2.6. Replacement Method (RM)

The simple basic rule in RM is checking the standard deviation (S) of a set of descriptors by systematic check of the relative errors of the model’s regression coefficients and replace descriptors iteratively with a certain algorithm (as describe later) to reach the best set of descriptors. More conceptually, regarding to search of d descriptors in a D pool to find the global minimum of S(d), d can be considered as a point in a space of D!/[d! (D − d)!]. Therefore, a total of D!/ [d!(D − d)!] linear regressions should be done to perform a full search in total space. Solving such problem is very difficult using a step-by-step method like stepwise selection (SS) which start with no descriptor in the initial step and is commonly impractical when we face to a large D. In the article of Duchowicz et al.[113], the proposed RM approach decreased the number of linear regressions efficiently but with similar efficiency to the full search. Their proposed algorithm could be abstracted in the following steps:

8

ACCEPTED MANUSCRIPT (1) A subset of descriptors d =(x1,x2,...,xd) is selected at random to perform a linear regression. (2) One of the descriptors in the d subset, called xi, is selected and is replaced with each of the descriptors of the original pool(D =( x1,x2,...,xD),D≫d) and the best resulting set is kept(i.e., that

PT

with smallest S).It is clear that there are d paths for this replacement; because it is possible to start replacing any of the d descriptors in the initial constructed model. Therefore, among the d

RI

variables, one with greatest relative error in its regression coefficient (omitting the one replaced

SC

in the previous step) is the first candidate for replacement with all of the D descriptors (except itself) and the best set is kept again.

NU

(3)Replace all of the remaining descriptors in a way as the same as noted above by omitting those replaced in previous steps.

MA

(4) Start again with the descriptor having the greatest relative error in the regression coefficient and repeat the whole process.

(5) This process should be repeated until the set of descriptors remains unchanged. The model

D

which is obtained here is the best one in path i.

TE

(6) Steps 2 to 5 are exactly performed for all possible paths (i = 1,...,d), and the best one is kept after comparing the resulting models. This calculation is done until finding the overall best

AC CE P

model.

It is worth noting that some modified versions of RM, known as enhanced replacement method (ERM) have been also introduced by Mercader etal in 2008-2011 for use in descriptor selection[114,115].

RM and ERM have been utilized with satisfactory results in many QSAR/QSPR reports [116–123].

3.2.7. Successive Projections Algorithm (SPA) The algorithm of SPA, which was proposed originally for variable selection in spectroscopic researches [124],can be summarized in three distinct phases: (i) Construction of the candidate subsets of variables according to a collinearity minimization criterion. These subsets are built on the basis of a sequence of vector projection operations that is run on the columns of the independent-variables data matrix (XO×D) with O objects and D descriptors. It is worthy to mention that, the dependent variable y (activity/property) is not

9

ACCEPTED MANUSCRIPT utilized in this phase and the multi-collinearity is computed only on the X matrix. The heart of SPA is its projection process in this phase which is explained elsewhere in details [124–126]. (ii) In the second phase, the best subset among the candidate subsets is chosen using an

PT

evaluation criterion that indicates the prediction ability of the resulting MLR model. PRESS of a separate test set, or PRESS of cross validation can be used for such purpose[125].

RI

(iii) In the final optional phase, an elimination procedure is run on the descriptors in the selected

SC

subset to indicate whether any descriptors can be deleted without significant loss of prediction ability.

NU

Because of the good potential of SPA, this algorithm has been utilized efficiently in descriptor selection of QSAR/QSPRs and in some cases with few modifications and

MA

hyphenation[126–139].

TE

3.3.1 Genetic Algorithm (GA)

D

3.3. Artificial Intelligence-Based Methods

AC CE P

After suggestion of GA by Holland in 1975 [140], and reducing the complexity of algorithm using computers and software in 1990s,GAhas been used in different branches like finding optimum (low energy) conformations, molecular design, etc.. Many researchers have been discussed GA in their review articles several times and two most important ones were written by Devillers [141,142]. GA tries to find a solution by searching the solution space using an algorithm based on natural selection in biological evolution. Initialization, selection, genetic operator and Fitness of evaluation (termination), are four basic principle steps in GA [143]. The initial population of GA includes original set of chromosomes. Number of initial chromosomes could be varied in different GA problem, but generally contains several hundreds or even thousands of solutions and often is generated by a random manner. Selection deals with the transfer of a proportion of population during each successive generation to the next generation. This filtration process is done using a fitness function which is always problem dependent. In variable selection problem, reaching to the maximum of correlation coefficient or minimum of PRESS is the goal [143].Generally, selected variable in GA is denoted with―1‖ and not selected variable is shown with ―0‖.

10

ACCEPTED MANUSCRIPT Genetic operators are the tools that lead to a new generation (―Child‖ solutions) from the previously selected ―Parents‖. "Mutation and cross over are the most important GA’s operators (See Fig. 3that shows a simple schematic of these well-known operators). The child solutions

PT

obtained from the operation of mutation and/or cross over and typically inherit many characteristics of the parents.

RI

Fitness of evaluation or termination of the genetic algorithm is the last step which could

SC

be done with different criteria like reaching maximum allowed generation or computation time, satisfying the fitness function, reaching to a plateau in the fitness function without significant

NU

enhancement or manual termination.

One of the first applications of GA in computer-aided molecular design (CAMD) was

MA

done by Venkatasubramanian and coworkers in the field of polymer design [144,145]. The next historical bold applications of GA was the reliable QSAR/QSPR models for finding organic compound with a desired biodegradability [146], for designing active inhibitors of

D

dihydrofolatereductase [147], in design of fuel-additives [148] and also for modeling of the

TE

structure–octane rating relationships of hydrocarbons [149]. The works of Hou et al. in 1999 which applied GAs in the variable selection of a structure-activity modeling were considerable

AC CE P

researches in showing the power of GA in QSAR/QSPR[150,151]. In their studies, a set of cinnamamide compounds [150] and some non-nucleoside HIV-1 inhibitors [151] were investigated.

From the beginning applications to now, GAs has been attracted the interests of researchers in QSAR/QSPR because of its high flexibility. It was a clear progressive process that enhancement in different linear/non-linear regression and also feature extraction methods cause the hyphenated approaches with the aid of GA. Coupling GA with multiple linear regression (GA-MLR) was the simplest version and it is applied up to now [150–158].On the other hands, some manipulations and inventions have been suggested to enhance the ability of GA in QSAR/QSPR for specific cases which is noted in the following. The GA-PLS method was proposed by Hasegawa et al. in 1997 for variable selection in QSAR to obtain a PLS model with high internal predictivity [159]and have been applied in different studies [154,160–170]. Another modification was GA-KPLS (Kernel PLS) proposed by Jalali-Heravi and Kyani [171] which tried to enhance the ability of GA as a powerful optimization method using the advantages of KPLS as a robust nonlinear method.

11

ACCEPTED MANUSCRIPT Combination of ANN with GA wasalso proposed to enhance the efficiency of descriptor selection and prediction ability of the model. Among these methods, the hybrid of GA and ANN (GNN or GA-ANN) which was developed by So and Karplusin 1996 is notable. GNN selects a

PT

suitable set of descriptors with GA and utilizes those as the input of a neural network [172]. In GA-ANN the descriptor selection starts with a population of randomly constructed models.

RI

Then, a fitness function that evaluates the predictive power of the models is used to rank these

SC

random models. In continue, genetic operations (mutation, crossover) are utilized on the betterranked models in the population to generate new models as the replacement of worst ranked ones

NU

which is finally used as the input of ANN.

A comparison between GA-ANN and other well-known techniques like forward selection

MA

(FS), genetic function approximation (GFA), generalized simulated annealing (GSA) was done by So et al. [173]. Yasri and Hartsough [174] suggested a combination of GA and ANN, which is different from other GA-ANN techniques fromtwo main points: firstly, the descriptor reduction

D

search performed by the GA is not limited to a certain number of variables. Secondly, the

of hidden layer [174].

TE

optimal ANN is obtained in parallel with the descriptor selection by dynamic updating of the size

AC CE P

GA-ANN has been applied in different studies[175–183]. Application of the specific combined versions of ANN with GA can be found in QSAR/QSPR studies. Some examples are, GA-ANFIS (Adaptive Neuro-Fuzzy Inference System) [184] and BRGNN (Bayesian-regularized genetic neural networks) [185].

As another modification in GA for variable selection, GA guided selection method (GAS)was suggested by Cho and Hermsmeier [186], which includes a simple encoding scheme to represent simultaneously both compounds and variables in QSAR/QSPR model. In GAS, multiple models are constructed (each model for a subset of compounds)and an approach based on molecular similarity is used to determine which model should be applied to a certain set of test compounds[186]. In continue, Wegner and Zellproposed a flexible and fast descriptor selection method using a GA based on Shannon entropy cliques (SEC) and differential Shannon entropy (DSE)to measure the relevance of the selected descriptors[187]. This method, called GA-SEC, uses the SEC to make an initial population, which causes increasing the speed of the evolutionary GA

12

ACCEPTED MANUSCRIPT optimization process and obtains very sparse memory requirements and allows the analysis of large data sets[187]. Because of the advantages of using the principle components (PCs) instead of original

PT

descriptors, GA has also utilized in PC selection in QSAR/QSPR. GA-PCR which was originally used in spectroscopic studies [188] was utilized as well in QSAR/QSPR modeling [189,190].

RI

Replacing a non-linear modeling instead of linear regression, led to a new technique for

SC

QSAR/QSPR, called PC-GA-ANN, which was proposed by Hemmateenejad and coworkers [191,192] and has been utilized in some other articles [189,193].

NU

After the introduction of support vector machine (SVM), the combination of GA and SVM(GA-SVM) have also reported for use in QSAR/QSPR modeling [194–197].

MA

Reddy et al. in 2010 reported a compatrative study of some various hybrid-GA optimization techniques for descriptor space reduction. They utilized MLR, artificial neural network (ANN), non-linear decision tree (DT) and correlation-based feature selection (CFS)

D

approaches as the fitness function of GA [198].

TE

Different applications of GA in chemistry and chemometrics including variable selection

AC CE P

has been reviewed in many works [142,199,200].

3.3.2 Evolutionary programming (EP) method EP procedure was developed by Fogel and coworkers in 1990-1993 [201–203], with an earlier background form 1966 [204], initially for travelling salesman problem(TSP).EP was utilized in 1994 for descriptor selection in QSAR by Luke [205] and then was employed by others [206– 208]. Both GA and EP are two well-known methods of Evolutionary Algorithms (EA) and their comparison have been evaluated in some articles, for example in [209]. EP algorithm is based on a population of organisms that generate offspring asexually. In EP, a vector of ni exponents the genes of each organism and N organisms are corresponded to produce the initial population. A copy of each organism’s genes generates a unique offspring and then some changes are created on that using mutation [205]. However it might seem that EP is similar to a simple GA but they have quite different algorithms. Regarding to the reproduction operators, it should be emphasized that EP works without any crossover operator and only a simple swap

13

ACCEPTED MANUSCRIPT mutation is utilized [209]. A new generated organism is added to the population and the N mostfit organisms survive to the next generation. This process is repeated for a certain number of generations.

PT

A simple perspective of the EP algorithm can be abstracted in the following five steps [201,203,205]:

RI

1. Start with a random population of N descriptors.

SC

2. Determine the goodness or weakness of each descriptor using a fitness function. 3. Create new predictor using each descriptor and reproduction operators, calculate its fitness and

NU

add it to the population.

4. Ranking the extended set of predictors (=2N) from best to worst according to their fitness and

MA

remove the N worst ones from the population. This is done to restore the population to its original size (=N) and to complete a cycle run.

5. Repeat steps 3 to 4 until terminating the algorithm or reaching to a user-defined number of

TE

D

iterations.

AC CE P

3.3.3. Artificial Neural Network (ANN)

ANN is a mathematical approach to describe nonlinear hyper surfaces and probably is the most well-known artificial intelligent method which its importance is progressively increased in chemometrics. ANNs are widely utilized in both variable selection and model construction aspects of QSAR/QSPR [210–212] and have the known advantages and limitations [213]. ANN is considered as a black-box. It transforms m-variable input into n-variable output and in the most ANN methods, n is smaller than m. It is clear that a simple feed-forward selection and feed –backward ANN consist of three principal layers named as ―Input‖, ―Hidden‖ and ―output‖ layers which are fully connected by some connection weights as it is shown in Fig. 4. The ―learning‖ or ―training‖ in ANN is done by repeatedly passing through the input variables and adjusting ANN connection weights to minimize the error of final QSAR/QSPR model or maximize the correlation between experimental activities/properties and their predicted values [214]. However the application of ANN in model construction (which will be reviewed in next sections) is more well-known than in variable selection, but this artificial method has been presented in a pruning format as a variable selection tool as well [215–217].

14

ACCEPTED MANUSCRIPT In common variable selection using ANN, the network is trained with the whole of descriptors as the input and a backward selection is done by successively removing nodes (neurons) of input layers. Along with elimination of some input nodes and their connection

PT

weights and also adjusting the remaining weights, the overall input-output correlation learned by the ANN model is successively checked to remain approximately unchanged or show

RI

enhancement. Again, a suitable fitness criterion is needed to remove non-significant input nodes

SC

[218,219]. It is worthy to mention that one of the disadvantages of ANN is high probability of chance correlation and over fitting (over-training), because of large number of adjustable

NU

parameters in ANN and also its high intrinsic mathematical flexibility [214,220]. Historically, the first report on the application of the ANN in descriptor pruning step of

MA

QSAR was published by Wikel and Dow in 1993 [216]. They found that a back-propagation neural network can be an efficient and effective method to select relevant variables for QSAR studies[216].An early example of successful application of ANN as the descriptor selection tool

D

was modeling of the activity in a set of HIV-1 reverse transcriptase inhibitors [221].

TE

Tetko et al. in 1996 investigated five pruning algorithms to eliminate non-relevant input variables and to estimate the importance of input variables in feed-forward ANN trained by back

AC CE P

propagation algorithm [218]. However it should be noted that their utilized algorithms obtained similar variable estimations for simulated data examples, but different ones for real QSAR data sets.

Kovalishyn et al. in 1998 proposed a variable selection algorithm based on feed-forward ANN trained by the cascade-correlation learning method [217]. In cascade-correlation algorithm a small network is initially generated and then new nodes are added to the network dynamically until able it to solve the problem [217]. Kovalishyn et al. showed the ability of their proposed pruning algorithm by analyzing different simulated and real QSAR data sets. In 1998 an article was published by Tetko et al. including more details about ANN as a pruning tool suitable for variable selection in different chemical data including QSAR data sets [222]. They tried to illustrate the ability of the ANN as an accurate, fast, and consistent method useful in pharmaceutical fingerprinting.

3.3.4. Automatic Relevance Determination (ARD) Method

15

ACCEPTED MANUSCRIPT As it was noted previously, different manipulation on ANN and hyphenated methods were also used for subset selection purpose. One of these well-known algorithms is automatic relevance determination (ARD),which was originally developed by MacKay [223] and Neal [224].

PT

A variable selection algorithm was proposed in 2000 by Burden et al. for use in QSAR by coupling ARD and regularized artificial neural networks (BRANNs) [225]. It should be noted

RI

that BRANN uses a single rate of weight decay (α) for all the network weights (input, hidden and

SC

output layers) whereas in ARD the weights of input, hidden and output layers are gathered in three separated classes which allows the algorithm to control the weights of each layer

NU

independently. It is possible in ARD to estimate the importance of each input and turning off the effect of irrelevant ones in a network. Some other examples of the application of ARD in

MA

QSAR/QSPR could be found in literature [226,227].

D

3.3.5. Generalized Simulated Annealing (GSA) Method

TE

According to the best of our knowledge, the first successful application of SA in variable selection, which was named as generalized simulated annealing (GSA), has been reported in

AC CE P

1993 by Sutter and Kalivas [54]. Then, Sutter et al. in 1995 utilized GSA by some corrections for descriptor selection in QSAR [107]. After that, simulated annealing was employed in several QSAR studies for descriptor selection purpose in combination with various linear/non-linear mathematical methods, e.g SA-MLR[228,229], SA-PCR [229], SA-PLS [230–234], SA-kNN (knearest-neighbor)[235,236] and SA-ANN [61,237]. SA has been also combined with GA and ANN to obtain a two-stage evolutionary algorithm for variable selection which showed good ability in non-linear model construction using radial basis function neural network [238]. In this variable selection algorithm, the prediction error is optimized by a GA procedure in the first step, and then the number of variables is minimized by the aid of SA in the second step[238]. SA (as a semi random, iterative improvement method) is a multivariate global-searchoptimization method based on the metropolis Monte Carlo algorithm that was proposed based on the annealing process [239].The principal idea in SA is based on different configurations thata system sample takes (according to Boltzmann distribution) while is heated to a high temperature

16

ACCEPTED MANUSCRIPT and then gradually is cooled to a predefined temperature (e.g., room temperature). The mathematical details of GSA could be found elsewhere [54,107,240,241] Here again, after removing a variable in each iteration step, a cost function is computed

PT

and is compared to the previous step. Removing a non-significant variable (descriptor) should not lead to obtaining a solution worse than the previous one. But in SA there is still a probability,

RI

p, for the removal of the variables (descriptors) which are not confirmed to be eliminated in

SC

previous iterations. Considering of such probability causes to jump out of a local minimum [241,242].

NU

In analogy with annealing, the lowest cost function (energy configuration) is obtained by lowering the temperature slowly. In GSA, the initial temperature is applied using a user-defined

MA

parameter, β, and could be adjusted using an estimated global minimum of the cost function (C0) that is also a user-defined variable. Thus, the algorithm could be abstracted in the following steps:

D

1- Select initial current positions (a set of descriptors) randomly and computing their cost

TE

function, Cc.

2- A new random position (new set of descriptors) is then selected in the neighborhood of the

-

AC CE P

current position and its cost function, Cn, is computed. Now: If Cn< Cc, the new position is accepted and kept as the current position and the above process is continued. -

If Cn> Cc, then the new position is accepted as the current position with a probability equal to P that originates from Boltzmann distribution theory: (1)

Where [Cn- Cc] shows the global minimum for the cost function. 3- The P acceptance probability should be compared to a random number p (which is defined from a uniform random distribution on [0,1]. Now: -

If p ≤ P, the detrimental position is accepted as the current position transpires.

-

Otherwise, a new random position should be selected again and the process should be repeated.

17

ACCEPTED MANUSCRIPT As it was noted previously, the acceptance probability of the detrimental positions can limit the chance of falling in local minima. For reaching this purpose, logical selection of control parameter is necessary. Some rules about the selection of β and C0could be found in literature

PT

[240,243].

RI

4- Termination of GSA might be done using different criteria, for example: After evaluation a certain number of new positions without acceptance.

-

when |Cc – C0| <ε, where ε denotes a desired level of precision[240].

NU

SC

-

MA

3.3.6. kNN- variable selection method (kNN-QSAR)

kNN variable selection or kNN-QSAR method was proposed in 2000 by Zheng and Topsham [244]. However, the basic principle of standard kNN (which is common in pattern recognition) is

D

a classicand not-intelligent method but because of the presence of SA in the center of kNN-

TE

QSAR algorithm, we decided to classify kNN-QSAR method in artificial intelligence-based methods.

AC CE P

It is clear for the specialist in the field of pattern recognition that kNN has a simple basic idea: (1) estimation of distances between all the objects in the training set and an unknown object (xu); (2) selection of k objects (usually an odd number) from the training setaccording to their distances to object xu,; (3) classification of objectxu in a group with morek objects. Optimization of the optimumk value is doneusing an external test set or by the leave-one-out cross-validation (LOO-CV).

However, the algorithm of kNN-QSAR has some differences in compare with kNNclassification because it is developed for subset selection problem. In the other words, kNNQSAR method combines the kNN classification rules with the variable selection procedure as it is summarized in the following: (1) Select a subset of d descriptors randomly which might be a number between 1 and the total number of descriptors. (2) Validate the constructed model based on d descriptors using a standard LOO-CV procedure (kNN cross validation).

18

ACCEPTED MANUSCRIPT (3) To find the best d subset, steps 1-2 are repeated and their corresponding Q2values are calculated. The best subset is that with the higher Q2 value. Changing the d subset to find the optimum solution is directed by a generalized simulated annealing (GSA) by continuous

PT

checking of Q2 as the fitness function.

More details about the cross validation used in step 2 and GSA in step 3 could be found in the

RI

original reports[244].

SC

The method kNN has been utilized in many QSAR/ QSPR reports by Tropsha and coworkers. Tropsha and Zheng applied kNN for identification of the pharmacophores and

NU

compare their kNN-QSAR algorithm with GA-PLS [245]. The method kNN was utilized to construct a predictive QSAR model for some pipodophyllotoxin to develop some potential

MA

anticancer agents as well [246]. Tropsha et al also performed an overview on validation of some QSPR models constructed based on kNN variable selection [247]. Shen et al. used kNN QSAR approach for the investigation of metabolic stability of drug candidates [248].

D

Also, Shen et al. used kNN-QSAR for data mining purpose and discovery of

TE

anticonvulsant agents among a database containing ca. 250,000 compounds. According to their constructed models, 22 compounds were chosen as consensus hits and some new candidates

AC CE P

were synthesized with acceptable activities [249]. As another application on large data sets, Votano et al. utilized four techniques including kNN-QSAR to investigate human serum protein binding on a set of 1008 molecules with known experimental activities [250].

3.3.7 Ant Colony System (ACS) Method Ant colony optimization (ACO) is ameta-heuristic algorithm and similar to most artificial intelligent process was proposed originally for solving complicated optimization systems [251]. ACO was introduced by Dorigo et al. in 1991–1992 [251,252] and was utilized for the first timein descriptor selection of QSAR studies in 2002 by Izrailev and Agrafiotis [253]. A bold characteristic of the ACO which differentiates it over traditionaloptimization algorithms is producing a good sub-optimal solution in a very short computational time [254]. The ACO could be explained with a travelling salesman problem (TSP). Suppose mi(t) as the number of ants in positioni at time t, M as the total number of ants (M=

), and τij(t)

as the intensity of the pheromone trail on connection (i,j) at time t. The initial τij(t) is denoted by

19

ACCEPTED MANUSCRIPT τ0 which is the amount of pheromone deposited on each edge of travel map. By moving an ant on connection (i,j), a certain amount of pheromone is remained on that connection andthe pheromone is vanished gradually by passing time. Some updating function could describe the

PT

pheromone level on the selected edge [251,253,255,256].

In a simple word, the algorithm works as following for a TSP: At first mantsare existed in

RI

a certain position and then every ant travels to another position which is determined with a

SC

certain probability [255]. The pheromone level on each path isupdated according to an updating function andt his process is run iterativelyuntil obtaining the minimum error or satisfying the

NU

terminating conditions [254]:

(2)

MA

τij(t + n) =ρτij(t) + ∆τij

In the above equation, ρ is a coefficient which represents the extension of the pheromone

D

retained on the path ij and could be varied between 0 and 1.∆τij shows the increment of

(3)

denotes the pheromone that kth ant leaves on the path ij at this circle. Then,

AC CE P

where,

TE

pheromone that is left on the path ij at the circle map and is defined asfollowing:

probabilistic transition rule can be shown as below: if ant k passes from connection (i,j) (4)

else

In the above rule, Lk is the tour length found by the kth ant and Q is a constant. The conventional ACO algorithm is usefulforsolving ordering problems such as TSP. But in a subset selection problem, which is needed in the variable selection step of QSAR, some modifications have been proposed [253,255] that is known as modified ACO. In other words, there is no concept of path in subset selection and utilizing conventional ACO is not directly possible for variable selection in QSAR. Here again, a binary (0 or 1) notation is used. When an ant moves in an N-dimensional search space of Nvariables, its motion is restricted to 1 (selected variable) or 0 (not selected) on each dimension. So, a moving probability of 0 or 1 is used for the selection of variables by ants. The intensity of the pheromoneon each dimension (variable/descriptor) can be divided into two

20

ACCEPTED MANUSCRIPT kinds, τi0 and τi1, which represents the pheromone level of the ith descriptor taking the value 1 and 0, respectively. The updating rule which is used to update the pheromone level is defined by

(5)

RI

(6)

PT

∆τi1 and ∆τi0 respectively:

SC

(7) (8)

and

show the amount of pheromone that ant k left on the descriptor/variableiat

NU

where,

MA

the circle and could be computed as the following:

if variable i was not selected by ant k in the current iteration either in its historical global best solution in the current iteration either in its historical global best solution

D

if variable i was not selected by ant k in the current iteration

TE

if variable i was not selected by ant k in its historical global best solution ifkth ant selected variable i both in the current iteration and in its global best solution

AC CE P

ifkh ant selected variable i only in the current iteration ifkth ant selected variable i only in its historical global best solution

In the above equations, F and FH are determined using a fitness function [255]. The common fitness function in ACO for subset selection, is residual sum square of the model based on the selected variables [255]. Finally, m ants select variables from all N variables by a probability defined by the following equation: (11) More details could be found in literature [255,257]. After modification of ACO for variable selection purpose [253,255] in quantitative structure-function relationships, this algorithm has been used in different QSAR/QSPR studies as a single or hyphenated method. For example, Izrailev and AgrafiotisusedACO in data partitioningforconstructing regression tree QSAR models [258].

21

ACCEPTED MANUSCRIPT Gunturi et al. in 2006 utillized ACO followed by MLR to carry out an optimal fiveparametric QSPR model for prediction of in-vitro binding data of 94 diverse drugs and drug-like molecules in to human serum albumin, from a pool of 327 molecular descriptors. [259]

PT

In 2007, Shi et al. developed QSAR models for inhibiting action of some analogue of 4(3-bromoanilino)-6,7-dimethoxyquinazoline compounds on epidermal growth factor receptor

RI

tyrosine kinase with the aid of modified ACO method and compared the results with the

SC

evolutionary algorithm (EA) [260].

Goodarzi et al. obtained aquantitative structure-activity relationship model for the anti-

NU

HIV-1 activities of some 3-(3,5-dimethylbenzyl)uracil derivatives [261]. They applied ACO as the feature selection sterategy to select reverent descriptors to make the input of some linear and

MA

non-linear regression methods (MLR, PLS, SVM) [261].

Shamsipur et al. in 2009 proposed anew approach based on the use of external memory in ACO for descriptor selection purposein QSAR/QSPR [262]. In their approach, several ACO

D

algorithms are run to obtain an external memory containing a number of elite ants and all of

TE

these elite ants are allowed to update the pheromone levels to renew the external memory using the updated pheromones. These steps are iteratively run for a defined number of iterations to

AC CE P

reach several top solutions [262]. In another work, Shamsipur et al. studied the effect ofcombination of ant colony system (ACS) algorithm, as a version of ACO, with various local search strategiesincluding forward selection, backward elimination, forward selection together with backward elimination, backward elimination together with forward selection, genetic algorithm, and Tabu search [263]. Their wok confirmed that utilizing various local search methods combined with the ACS, can improve the ability of the ACS algorithm. Hemmateenejad et al in 2011 utilized classification and regression trees (CART) along with ACS [264]. They used genetic algorithm operators (e.g., mutation and cross over) in combined with ACS to select the best model. The PLS was used to build a model at each terminal node of the tree constructed by their proposed ACS-GA algorithm. The ability of the developed tree wastestedby QSPR modeling of the melting points of a set of approximately 4173 compounds [264]. Some other application of various versions of ACO in descriptor selection of QSPR/QSAR can be found in [265–270].

22

ACCEPTED MANUSCRIPT 3.3.8 Particle Swarm Optimization (PSO) The PSO, proposed originally in 1995 by Eberhart and Kennedy [271,272], wasmodified for

PT

subset selection in QSAR by Agrafiotis and Cedeñoin in 2002 [273] and then by Shen et al in 2004 [274]. Thisalgorithm is based on simulation of the random behaviors of a group of birds

RI

looking for food in a region. In this method, the optimization is started with a population of random solutions, and continues with searching for optima by updating the generations. From

SC

this aspect, PSO is similar to other population-based intelligent algorithms like GA. But Unlike GA, there is no evolution process such as cross-over and mutation in PSO. Instead, in PSO each

NU

individual flies in the N-dimensional search space with a certain velocity determines the flying direction of the particle. Then, the particles fly through the target space by following the

MA

behavior of the current particle (optimum particle). The advantages of PSO over the GA are its few parameters to adjust and easierimplementation [274,275]. In the search space of PSO, each single solution is a particle and can model the

D

exploration of a problem space using a population of individuals or particles. Similar to other

TE

evolutionary calculations, the initial fitness of theparticles is continuously evaluated and the population of individuals is updated utilizing some updating rule to move toward better solution

AC CE P

areas in thetotal space. Suppose thatthe ith particle is denoted by mi = (mi1,mi2,…miD), their best previous position gives the best fitness value is represented by pi = (pi1,pi2,…piD),the best particle among all the particles in the population is shown by pg = (pg1,pg2,…pgD) and the rate of the position change for the ith particle or particle velocity is illustrated by vi = (vi1,vi2,…viD). During each iterative run, each particle is updated by considering two best values and then the velocity and positions of theparticle is updated by these rules: vid = vid + c1 ×r1 ×(pid–mid) + c2 ×r2(pgd–mid)

(12)

mid = mid +vid

(13)

In the above equation r1 and r2 are random numbers in the range of 0-1 and c1 and c2 are learning factors which can be constant and positive values.

23

ACCEPTED MANUSCRIPT In the modified PSO for variable selection [274], we face to a binary problem that updates a particle in discrete state of0or 1.The velocity (mid) represents the probability of a bit

If (0
mid(new) = mid(old) ,

(14)

If (a
(15)

mid (new) = pgd,

If ( 0.5 (1+a)
(16)

SC

RI

mid (new) = pid,

PT

and taking the value1or0. The updating rule in such problem has been defined as:

where, a is the ―static probability‖, a random value in the range of 0 and 1 and its initial value is

NU

0.5.In the following, a fitness function (based on PLS) is noted which could be utilized to terminate the PSO algorithm:

(17)

MA

F = RSSk/σPLS2–(n−2k)

where, RSSk is the residual sum squares of the k-latent-variable in the QSAR model based on our selected dependent variables and n is thenumber of dependent variables.

D

PSO has resulted in good potential in identification of reverent descriptors in many

TE

QSAR/QSPR models (some examples will be noted in the following).In 2004, Shen et al. combined PSO with piecewise modeling(PM), called PMPSO approach and showed its potential

AC CE P

by application in QSAR studyof the antagonism of angiotensin II [276]. A hyphenated method using PSO and ANN was presentedby Wand et al. in 2004 [277]. In their work, binary PSO was used for feature selection to feed a back propagation neural network as a non-linear regression method. Another combination of PSO and ANN was developed by Meissner et al. [278].They utilized Optimized Particle Swarm Optimization (OPSO) as a version of PSO and investigated its performance on a set of five artificial fitness functions [278]. Some other application of PSO combinedwith different chemometrics modeling methods have been reported during recent years, e.g. PSO-MLR [275,279], PSO-PLS [280–282], PSOANN [283–287], PSO-kNN-kernel regression[288], PSO-SVM [289–295], PSO-GA-SVM [296], piecewise hypersphere modeling by PSO (PHMPSO) [297,298], PSO-RT(regression trees) [299,300] and PSO-LWR (non-linear locally weighted regression) [301].

3.4. Dimension Reduction and feature extraction

24

ACCEPTED MANUSCRIPT In many instances, the number of selected descriptors is very large. For example, Dragon software calculates more than 3000 descriptors per molecule [302]. In addition, in 3D QSAR including CoMFA or CoMSIA, a very large number of variables are generated [36,39,40,303].

PT

Most of the variable selection methods fail to select the optimum subset of descriptors from the pool of very large number of variables, especially when there are multicolinear variables

RI

[304,305]. So, besides to feature selection in QSAR/QSPR analyses, feature extraction has been

SC

utilized in different studies as well [169,306–309]. In this case, the original set of descriptors is projected into) new variables of lower dimensional space. Here, we will give a very short report

NU

on the feature extraction approaches used in QSAR/QSPR studies. The purpose of feature extraction is finding the best linear/ nonlinear combination of features based on a dimensionality

MA

reduction approaches [310–313]. Principal component analysis (PCA) [310,314], Rényientropy [315], hierarchical discriminate regression (HDR) [316], independent component analysis (ICA)[317], multidimensional scaling (MDS)[318], partial least squares (PLS) [319], nonlinear

D

mapping (NLM) [320,321] and molecular maps (MOLMAP) [322–324] are the mostly applied

TE

feature extraction techniques.

A combination of feature extraction and feature selection i.e., selection of the extracted

AC CE P

features, has also been reported [191,309,324]. For example, the principal components extracted by PCA, can be selected by GA as the input of linear or nonlinear QSAR models [166,191,192]. One major drawback of the combined feature extraction and selection methods is that the extracted features are usually generated from all original descriptors and hence they are possessing information from both informative and redundant descriptors [7]. As a solution to this limitation for example in PCA, using of the loadings of the selected principal components has been suggested to obtain the variables of high loading values [325,326]. Also, instead of application of feature extraction methods on whole data set, Hemmateenejad et al suggested partitioning of variables into different subsets firstly and then running of PCA on each subset, separately. In this context, segmented principal component regression and analysis (SPCR) [327] and clustering of variables (CLoVA) [328,329] have been suggested recently. These are combined feature extraction/selection strategies that finally result in identification of the important variables.

4. Chemometrics tools in QSAR model development

25

ACCEPTED MANUSCRIPT Because of the multivariate nature of QSAR/QSPR models, different developments in multivariate calibration have been applied one after another in QSAR/QSPR[330].In the above parts, combinations of different modeling methods with various variable selection strategies were

PT

reviewed. Here, a short explanation will be represented for some important multivariate methods, which might be utilized as an independent model construction method after variable selection

RI

step or include in a hyphenated variable selection strategy. Most of the following methods

NU

SC

caused an important evolution in QSAR/QSPR model development.

4.1 Multiple linear regression (MLR)

MA

Historically, MLR is the first multivariate calibration method used for model construction. Among the advantages of MLR, simplicity of use and interpretability of the models has led to its extensive applications in almost all branches of QSAR and QSPR studies [331–353]. In MLR, a

D

linear relationship is established between the molecular features of a molecule, which is usually

is shown as the following:

TE

expressed as a descriptor vector x, and its target activity/ property, y. The general form of MLR y = b0+ b1x1 + b2x2 + b3x3 + … + bkxk+e

AC CE P

where,

(18)

are theindependent variables (molecular descriptors), b1 to bk are the MLR

coefficients of descriptor x1 to xk, and b0 is the model’s constant or intercept. The residual of the target activity/property which was not covered by the model is denoted by e. For a set of molecules, matrix notation of the above equation is: y=Xb + e

(19)

The magnitude of the model’s coefficients (the normalized values) implies the relative importance of the descriptors and their sign indicate whether these descriptors contribute positively or negatively to the target activity/property. The least squares solution for the estimation of b could be defined as: b =(XTX)XTy

(20)

26

ACCEPTED MANUSCRIPT The main draw-back of MLR is its limitation for modeling the descriptors with high degree of collinearity, which might lead to a model with inaccurate regression coefficients [354]. In addition, the number of variables in the model should not exceed the number of molecules in the

PT

training set. From mathematical point of view, the number of molecules should be at least equal to the number of descriptors plus 1. However, practical considerations imply that in

RI

QSAR/QSPR studies the number of molecules should be much larger than the number of

SC

variables [355,356]. The minimum ratio of 5:1 for ―number of molecules to the number of

MA

4.2 Principle component regression (PCR)

NU

variables‖ has been suggested which is known as Topliss ratio [355].

A logical approach to overcome the problem of collinearity in X matrix that contains huge number of descriptors is using the reduced space of original descriptors by projection methods

D

such as principle component analysis (PCA). In PCR, the regression coefficient (vector b in Eq.

(21)

AC CE P

b = P T+ y

TE

(2)) is calculated using singular score (T) and loading (PT) matrices of X [357,358]:

where ―+‖ superscript is used to denote pseudo inverse operator. The major drawback of principal component regression (PCR) is that the extraction of the eigenvectors is done just by looking at the matrix of predictor variables and without any information about their relationship with the target (dependent) variable. In the other words, the extracted PCs from the matrix of descriptors are ranked according to their decreasing Eigenvalue and are entered to the PCR model, one after the other. This common process which is known as Eigen-value ranking could be a problem especially in QSAR/QSPR studied that leads to the possibility of selection of PCs, with no essentially good relationship with the activity/property. Different strategies have been proposed to overcome this problem and to enhance the performance of PCR-based model. Sutter et al. proposed a search method based on running GSA on extracted PCs and target variable to find PCs that obtain lower PRESS after modeling [357]. Other variable selection methods like GA has also reported to select the best PCs for increase the ability of PCR in covering the variance of data in QSAR/QSPR studies [189,190]. Hemmateenejad proposed a

27

ACCEPTED MANUSCRIPT correlation ranking procedure based on entering the PCs into the PCR model by preference of the PCs with higher correlation to the activity/property [306]. This correlation ranking has been coupled also with nonlinear methods such as ANN [305].

PT

Another strategy to enhance the prediction ability of PCR was reported by Hemmateenejad and Elyasi in 2009 which called principal component analysis regression

RI

(SPCAR) and was explained briefly in the previous section. SPCAR was employed in QSPR

SC

studies [327] and QSAR of peptides as well [72,359].

Totally, using of PCR in QSAR/QSPR is limited. MLR is preferred over PCR for its very

NU

simplicity and ease of interoperation of the generated model. On the other hand, both PLS and PCR are complex from mathematical point of view but PLS is usually preferred for its higher

MA

prediction ability as it will be explained in the next section.

D

4.3. Partial least square (PLS)

TE

PLS was firstly proposed by Herman Wold around 1975 for modeling of complicated data sets [360]. The history of PLS can be found in the article of Gladi [361].Partial least square (PLS) is

AC CE P

derived in a principal component-like manner. Again, independent variables and dependent variable are shown by X and y respectively and t latent variable could be defined to cover total variance of data.

The details of PLS algorithm is well discussed widespread in literature [319,362] and different versions of PLS with some manipulations and extensions have been developed [363,364]. A brief theoretical view on this method is presented here. The scores of X-matrix (T) are a limited number of orthogonal vectors which is a linear combination of X (the original descriptors) with some coefficients that are generally noted with W*: T=XW*

(22)

The decomposition of X-block and Y-block to the scores and loading is done as a necessary step of algorithm: X=TP′ + E

(23)

Y=UC′ + G

(24)

28

ACCEPTED MANUSCRIPT Where T and U are the scores of the X-block and Y-block respectively and their corresponding loadings are represented by P and C. The matrices E and H are the residual matrices of X-and Y-

Y=TC′ + F

(25)

Y=XW*C′ + F

(26)

PT

blocks which are near to zero if someone utilize enough number of scores and loadings.

RI

If we want to obtain an equation similar to MLR, we can substitute Eq. (22) in Eq. (26). Y= BC′ + F

SC

(27)

So, B can be considered as the PLS regression coefficients. B=XW*

NU

(28)

Considering the above algorithm, it can be concluded that in contrast with MLR that

MA

deals with descriptors as separate entities and offsetting or scaling each of those independently to reach the best correlation with the target activity/property, PLS treats all the descriptors together, as a BLOCK. In summary, PLS transforms both descriptors and target variable blocks so that

D

their correlation being maximal [304]. To compare PLS and PCR, it should be considered that

TE

the transformation in PCR is only performed within-block (on descriptor block) but the objective of between-block (descriptor and target) transformation in PLS is finding maximum overlap

AC CE P

between both blocks [365].

According to the presence of large number of descriptors in many structureactivity/property relationships, PLS obtains some benefits over MLR and PCR because of its specific algorithm [366,367]:

(1) PLS can result in robust equations even when the number of descriptors greatly exceeds the number of chemical compounds. This is very important in MIF approaches like CoMFA[39], which extract large number of descriptors by comparing the structure of a small number of molecules at thousands of points in 3D-space. (2) PLS obtains more accurate predictions in comparison with PCR and MLR. (3) Because of the similarity in the structural skeleton of compounds utilized in QSAR/QSPR, the inter-correlation of descriptors is a common problem. In the presence of correlation between the independent variables (descriptors), PLS is able to derive more stable models. (4) PLS is able to develop model for more than one dependent variable simultaneously which is important in QSAR when we want to model multiple receptor assays or investigating the activity of a set of compounds against multiple microorganisms.

29

ACCEPTED MANUSCRIPT It is clear that similar to PCR, the number of the projected component in PLS which is called ―latent variables‖(LVs) should be optimized for inclusion in model. Because utilizing LVs lower than a certain value causes missing of information and oppositely increasing in number of

PT

LVs increases the risk of over-fitting [368]. Application of cross-validation (jack-knifing, leaveone-out, leave-many-out) is the common way to select appropriate number of latent variables

RI

[366,369].

SC

Another limitation of PLS is the risk of chance-correlation [370,371], which led to reporting structural variables that have not real relationship to activity/property (Type II errors).

NU

It has been shown that the frequency of chance correlation for PLS is much lower than that for stepwise MLR[372,373], but is maximal when initial number of descriptors equals to number of

MA

molecules in training set [373]. Surprisingly and unlike to stepwise MLR, the chance probability decreases indefinitely when the number of descriptors becomes much bigger than the number of molecules[373].

D

According to our knowledge, first formal application of PLS in QSAR was reported in

TE

1983 to 1984 [374,375] to confirm the advantage of PLS (such as handle both activity and multivariate structural descriptor data and its potential to treat cases with more descriptors than

AC CE P

chemical compounds).

In 1989, Wold et al. proposed ―non-linear PLS‖ [376] which was the extension of twoblock PLS for cases that the inner relation of the block scores U and T is nonlinear which could help in many QSAR/QSPR studies. After that, different approaches have been suggested to enhance or simplify the nonlinear inner-relation of PLS. Some examples are NLPLS developed by Frank [377], PLS/neural networks[378–380], Nonlinear PLS using Spline inner relation[381], nonlinear PLS improved by the numeric genetic algorithm (NPLSNGA) [382], GIGI-PLS by Eriksson et al. [383] and Nonlinear PLS with fuzzy inference system [384]. Application of N-way partial least square developed by Bro [385], as a powerful robust algorithm, was another bold stage of PLS in QSAR and specially 3D QSAR [386], where N might be three or higher. In 1996 Dunn III et al. proposed models for solving the receptor-bound geometry of a series of ligands using molecular shape analysis and N-way partial least-squares (PLS) regression and 3-way factor analysis and applied these regression methods for the prediction of receptor-bound geometries of some trimethoprim-like dihydrofolate reductase inhibitors [387]. In 1998 Nilsson et al. utilized GRID method to 3D QSAR modeling of a set of

30

ACCEPTED MANUSCRIPT (S)-N-[(1-ethyl-2-pyrrolidinyl)methyl]-6-methoxybenzamide compounds, with affinity towards the dopamineD2 receptor subtype to show the ability and validity of N-PLS [388].Another

neonicotinoid compounds [389] and QSAR of peptides [390].

PT

examples, are the application of N-way PLS in 3D-QSAR of a series of insecticidal

Because of the advantages of partial least square specially in presence of huge number of

RI

descriptors, different algorithms of PLS have been utilized in QSAR/QSPR There are a lot of

SC

reports on successful application of PLS in structure-activity/property relationship, and some examples are [235,389,391–419].

NU

Combination of PLS with different variable selection tools have reviewed in the above sections. More specifically, a review on the variable selection methods in PLS has been

MA

published in 2012 by Mehmood et al.[420].

TE

D

4.4. Artificial Neural Network (ANN)

As it was presented in previous sections, ANN can be a powerful non-linear regression method

AC CE P

for model development in addition to its ability in variable selection role [213]. After first presentation of ANN in its simplest form by McCulloch and Pitts, in 1943 [421],various kinds of neural networks have been developed by now and new ones are designed every year but all can be described using a similar skeleton [214]. ANN is surely one of the most well-known nonlinear methods with a lot of application. Its theory and applications have been reviewed many times in books or reviews [211,213,214,422–424]. The first report on the application of ANN in QSAR was published by Aoyama et al. in 1990, with a classification view [210]. In the next works, Aoyama et al. confirmed the ability of ANN as a multivariate regression method to correlate the structural parameters (descriptors) and target activity in QSAR [425,426]. Aoyama and Ichikawa also utilized ANN to obtain the correlation indices between drug activity and structural parameters [427]. In particular, feed-forward and counter propagation networks were two most applied methods in the field of non-linear regression [423]. However it is worth noting that different modifications, developments and hyphenations have been done on ANNs to enhance its ability

31

ACCEPTED MANUSCRIPT and some of them have been utilized in model construction in QSAR/QSPR for both regression and classification purposes [428]. Livingstone and coworkers can also be considered as frontier in using of ANN in QSAR.

PT

They proposed an ANN method for classification purpose in QSAR [429].Livingstone and Salt developed ANN based method (multi-layer feed-forward) for regression in QSAR and

RI

investigated the risk of chance correlation and over-fitting [430]. Salt et al. showed the ability of

SC

ANN modeling in other QSAR studies[431].

After appearance of the advantages of feed-forward ANN (in spite of its limitations),

1990s (some examples are[108,432–440]).

NU

different aspects of this method have been investigated in a lot of reports on QSAR/QSPR since

MA

However, different attempts were done to overcome the limitations of feed-forward backpropagation ANN [441], the fast locomotive of QSAR moved to the station of using other popular neural networks. As a well-known example, Bayesian Regularized Artificial Neural

D

Networks (BRANNs) was used in 1998 to discriminate between drug-like and non-drug-like

TE

molecules[442]and a little latter in 1999 for regression in QSAR[441]. BRANN methodology has been applied in many QSAR/QSPR studies[225,443–449].

AC CE P

Adaptive Neuro-Fuzzy Inference System (ANFIS) is one of the ANN methods which was applied for the first time in QSAR modeling by Loukas in 2001 [450]. He trained this hybrid algorithm using back-propagation and least-squares estimation while the shape and number of membership functions were optimized by the subtractive clustering algorithm. In Fuzzy inference system (FIS) [451], each local behavior of the system is described by a fuzzy rule. By considering FIS as a feed-forward network, where the primary inputs and computed intermediate results are being sent around to calculate the final output, then the same back-propagation rules in ANN can be applied. Such network structure that performs FIS is mentioned as ANFIS. In ANFIS a Sugeno-style FIS [452] is trained utilizing hybrid learning rules with linear rule outputs. More theoretical details about ANFIS could be find in the corresponding literature [450,453]. ANFIS has been shown good ability as a modeling tool in different QSARs/QSPRs [184,454–459].

4.5. Genetic Function Approximation (GFA)

32

ACCEPTED MANUSCRIPT One of the problems which may occur in case of starting with a large number of descriptors is the possibility of obtaining significantly different models by using various QSAR/QSPR algorithms. This problem raises an important question:― Does a single best model even exist, or

PT

instead is there a collection of models of the same performance quality?‖ [460]. In contrast with many techniques (e.g. MLR, PLS, and simple ANNs) that generate a single regression model by

RI

incremental addition or deletion of descriptors, genetic function approximation (GFA) works on

SC

a population of many candidate models to determine only the final best one. GFA was developed in 1994 by Rogers and Hopfinger[460]and was more discussed latter [461].

NU

GFA is a combination of GA [140] and multivariate adaptive regression splines (MARS) algorithm [462]. The MARS method utilized ―truncated power spline‖ terms in regression process [460]. If x is considered as the value of the original variable and t as the knot of the

MA

spline, the spline term can be appear in two forms of 〈x- t〉 or 〈t - x〉. The value of spline terms can be determined as follows:

TE

In case of 〈t - x〉

(29)

D

In case of 〈x - t〉

(30)

AC CE P

The role of spline basis functions is entering of nonlinearity concept into the constructed regression model. The algorithm of MARS makes it possible to generate spline-based models with moderate numbers of independent variables, usually lower than 20. The performance of MARS is high enough and often can compete with ANN[462]. The disadvantages of MARS is its high computational cost in case of run on large number of descriptors (e.g. bigger than 20).In addition, because the model construction is done incrementally by MARS (similar to what was said about stepwise regression), it may be possible to find regression models containing combinations of descriptors that predict poorly individually but well as a group[460,462,463]. Therefore, GFA utilizes a GA step to search the MARS problem space to determine a population of regression equations that show best fit in the training set. The algorithm of GFA is well discussed in the original references [460,463]. Hopfinger et al in 1998 utilized GFA in a 4D-QSAR study of three different data sets. They included conformational and alignment freedom into the construction of 3D- QSAR models for training set by doing ensemble averaging as the 4th―dimension‖[464].

33

ACCEPTED MANUSCRIPT Shi et al. analyzed the antitumor activity patterns of some ellipticine analogues in a QSAR study using GFA as the modeling algorithm. They suggested a procedure for improving the performance of GFA and discussed about the relative advantages and disadvantages of GFA

PT

in QSAR [463]. Another successful GFA model for screening the activity of antitumor compounds was also reported by Fan et al. [465].

RI

As other examples, GFA as an effective chemometrics tool was applied in QSAR study

SC

of adenosine A3 receptor antagonist 1,2,4-triazolo[4,3-a] quinoxalin-1-one derivatives [466], some aryl heterocycle-based thrombin inhibitors[467], a set of cyclic cationic antimicrobial

NU

peptides derived from protegrin-1[468], affinity of a set of arylpiperazines toward α1 adrenoceptors [469], Human protein tyrosine phosphatase 1B inhibitors [470], potent human

MA

protein tyrosine phosphatase inhibitors [471], some thiourea derivatives as influenza virus neuraminidase inhibitors [472] and also predicting the ecotoxicity of ionic liquids [473]. The potential of GFA makes this algorithm as an interesting option of utilization in a lot of

TE

D

QSAR/QSPR reports during these years [74,469,473–500].

AC CE P

4.6. Support-vector machine Regression (SVMR) SVMR was developed by Vapnik [501] however its origin i.e., support vector machine (SVM) in pattern recognition has an older age [502]. SVM is based on linear or nonlinear RBF kernels, and thus can be useful in improving the correlation in data sets with nonlinear nature [503]. In contrast to dimension reduction methods, the basic idea in SVMR, is mapping the data set into a higher dimensional space (See Fig. 5). If xi supposed as the input vector, di as the desired value and n as the total number of data patterns, the central idea of SVM regression (SVR) is the approximation of the set .

In SVR, the data xi is projected to the higher dimension F using a non-linear

mapping function (Φ) to let us to construct a linear regression in the F space: f(x)= wT Φ(xi)+b

(31)

where, w is a vector of coefficients, Φ (xi) maps the input x to a vector in F, and b is a coefficient. The coefficients w and b are estimated by minimizing the regularized risk function, as shown in Eq. (30) and (31):

34

ACCEPTED MANUSCRIPT (32)

PT

(33)

RI

where, ε is a prescribed parameter.

) is called ―empirical error‖ which can be

SC

The first term in Eq. (32) (

determined by the ε-insensitive loss function(Eq. (33)). According to ε tube in loss function, if

NU

predicted value is outside the tube, the loss is specified as the magnitude of the difference of the predicted value and the radius ε of the tube. But, if predicted point is within the tube, the loss is

MA

set to zero. Minimizing the second term in (

, which is called regularized term)

can be used to control the function capacity. For example minimizing the regularized term leads

D

to a function as flat as possible. A penalty parameter like C is utilized in the above risk function to trade-off between training error of model and its flatness.

TE

Introducing slack variables ―ξ‖ and ―ξ*‖(indicate upper and lower constraints on the

AC CE P

outputs of the problem respectively), let us to transform Eq. (32) for estimation of the coefficients w and b:

Subject to

(34)

(35)

Now, by using Lagrange multipliers (

and

), the decision function (Eq.(31)) , can be

represents in a new form: (36)

35

ACCEPTED MANUSCRIPT (for i=1, …, n)

The applied Lagrange multipliers satisfy the conditions and clearly can be calculated from the dual form of Eq. (34) as the following:

PT

(37)

SC

RI

Subject to:

(38)

NU

This final equation can be efficiently resolved using quadratic programming technique and some decomposition methods. Utilization of quadratic programming and selecting an appropriate

MA

kernel function make it possible to analyze very large data sets by SVR [504,505]. Different functions including linear polynomial, sigmoid and Gaussian can be utilized as the kernel function. Among these, the Gaussian function can map the data space into a higher dimensional

D

space. Such space transformation is useful in case of modeling the complex non-linear data.

TE

Support vector machine is of those methods that came to QSAR/QSPR field after wellknown and common methodologies such as PLS, ANN but the fast growth and successful

AC CE P

application of SVM introduced this method as a stable and powerful chemometrics tool. Some examples among a lot of reports on SVM application in QSAR and QSPR (including first reports) will be presented below.

In 2001, The SVM was initially applied in QSAR for discrimination goal[506]. SVM was compared to several common machine learning techniques including three ANNs, a radial basis function network, and a C5.0 decision tree for classification to predict the inhibition of dihydrofolate reductase by pyrimidines. The results showed that SVM obtained the best results [506]. The report of Czermiski et al. also confirmed the ability of SVM in classificationQSAR[507]. Song et al. in 2002 utilized SVM to develop a predictive QSRR model for both feature selection purpose and making nonlinear SVM regression models and confirmed the validity of their suggested model using bootstrap aggregation (bagging) techniques, where different set of training and external test compounds were selected from the total data set [508]. Doniger et. al in 2002 utilized SVM to differentiate permeability of 324 drug molecules into central nervious system (CNS) from their structural descriptors [509]. They compared their

36

ACCEPTED MANUSCRIPT utilized approach with multilayer perceptron neural network to discriminate active and nonactive molecules and it was shown that SVM was the more efficient in this case [509]. The preference of SVM relative to kNN was also evaluated in the classification-QSAR study of Serra

PT

et al. on structural chromosome aberrations for a diverse set of organic compounds using topological, electronic, geometrical, or polar surface area descriptors [510].

RI

Lind and Maltsevain 2003, performed a QSPR study based on support vector regression

SC

to predict solubility of organic compounds in three data sets with good accuracy and without any prior information about the physical phenomena underlying solubility [511].

NU

In 2004, Yao et al. constructed SVM QSAR models to predict the toxicities of 153 phenols and to predict activities of a set of 85 cyclooxygenase 2 (COX-2) inhibitors [512]. They

MA

compared the SVM models with MLR and RBFNN and observed that SVM can be comparable with or superior to two other techniques [512]. Liu et al. proposed a QSPR model based on support vector regression to correlate the structure of 35 amino acids to their isoelectric point

D

[513]. They used GA-PLS as a robust variable selection method to select best subset of

TE

descriptors to enter the model development step based on SVM (and also to RBFNNs as a comparing method). Their good results showed that GA-PLS could math well with SVM for

AC CE P

QSAR/QSPR studies [513].

Some example from a lot of reports which have been published during the history of SVM (different forms and combinations) in QSAR and different branches of QSPR are [473,510,511,514–549]. In addition, application of SVM in QSAR/QSPR and drug design has been reviewed in some review articles[533,550]

4.7 decision tree (DT) (recursive partitioning) The Decision Tree (DT) method determines a property (chemical’s activity in case of QSAR) through some IF-THEN expressions rules based on selection of descriptors. For example, a simple rule could be: ―IF number of x-functional group > 2, THEN the compound is active‖. Decision Tree (DT) or Recursive Partitioning (RP) is a powerful method deals with finding relationships in large/complex data sets, may involve any nonlinearities, interactions and thresholds. It is clear that any of these conditions may limit the implementation of linear regression methods. For comparison of the ability of DT with nonlinear methods like kNN, ANN, and nonlinear SVM, it should be noted that these three methods obtain high prediction

37

ACCEPTED MANUSCRIPT performance and make it possible to model multiple mechanisms of actions because of their flexibility. But kNN and ANN are not suitable for handling high- dimensional data without dimension reduction or a pre-selection step e.g., by the aid of GA[551,552]. However nonlinear

PT

SVM is able to work well with high-dimensional data but is not robust in case of existence a large number of irrelevant descriptors and thus a pre-selection step is needed as well.

RI

On the other hand, DT can handle high-dimensional data and obtain good potential in

SC

ignoring irrelevant variables and thus is a good option for construction of interpretable models in the mixtures. In other words, the important aspect of DT is the ability to identify structure

NU

activity relationships for various classes of compounds exist in the same data set that might act differently, e.g. bind in different ways. Various implementations of DT algorithms for QSAR

MA

such as Formal Inference-based Recursive modeling (FIRM) [553–555], Classification and Regression Tree (CART) [556], or C4.5 [557] have been proposed. Rusinko et al. developed a computer program, based on DT, called SCAM for the

D

analysis of a large structure/biological activity data set including 1650 monoamine oxidase

TE

inhibitors exemplifies [558].Their proposed method scales linearly with the number of descriptors, and thus let us to analyze hundreds of thousands of compounds by thousands to

AC CE P

millions of structural descriptors [558].

It is worthy to mention that DT usually obtains relatively low prediction accuracy and this is the major drawback of DT which limits its applications in subjects like virtual screening. Thus, various attempts, such as decision forest[559], using ensembles of trees [560], and random forest [552] have been reported to increase the trees’ prediction accuracy. Different Tree-based approaches have been utilized in QSAR/QSPR studies specially for modeling large data sets. Some representative examples are FIRM [561], C 4.5 [562], Decision Forest [563–566], CART [258,562,567–570], Random Forest [552,571–578,578–580] and other Decision Trees [564,580–587].

5. Model validation Obviously, the main application of mathematical relationship between structural features and a target property/activity of a set of compound is prediction of target property of a variety of compound prior to (or even instead of) expensive laboratory synthesis or property/activity measurement. However, it depends on the reliability of the developed QSPR/QSAR model. It has

38

ACCEPTED MANUSCRIPT been comprehensively shown in chemometrics’ literature that the fit of data in the established model (without any constraint during model construction) is not the evidence of its validity and reliability [356,588,589]. Four most well-known validation tools during history of these methods

PT

are (i) cross validation (ii) randomization of the target variable or y-scrambling (iii) bootstrapping and (iv) external validation using a test set of compounds. However, since some

RI

years after the beginning of QSAR/QSPR, general practitioners of this field have no serious

SC

concern about the validation of their model which led to fault and non-stable models in many cases [588]. But nowadays report of a structure-activity/property mathematical model without a

NU

statistical validation has not any value.

The most important statistical parameters, which are common in checking the model

MA

performance, are collected here in Eqs. (39-48). The terms which are utilized in these equations are defined bellow:

K: Number of factors included in the model (original descriptors, PCs or LVs)

D

N: Number of compounds in the data set (training or external test set)

TE

i: the index of notation for the ith sample

ycp, yvp and ypp: Respectively, the predicted y-value of calibration(training) set, internal

AC CE P

validation and prediction (external test) set ye: the experimental (observed) value of y , i.

ii.

,

and

: Respectively, average of ycp, yvp, ypp and ye

Pearson correlation coefficient in calibration (training set) (39)

Pearson correlation coefficient in internal validation (training set) (40)

iii.

Pearson correlation coefficient in prediction (external test set) (41)

iv.

Correlation coefficient of multiple determination in calibration (training set)

39

ACCEPTED MANUSCRIPT

(42) v.

Adjusted Correlation coefficient of multiple determination in calibration (training set)

Cross validated correlation coefficient in internal validation (training set)

RI

vi.

PT

(43)

(45)

Root mean square error of calibration (training set)

MA

viii.

Correlation coefficient of multiple determination in prediction (external test set)

NU

vii.

SC

(44)

xi.

D

TE

x.

Root mean square error of cross validation (training set) (47)

Root mean square error of prediction (external validation set)

AC CE P

ix.

(46)

(48)

Roy metric

In addition to the above common parameters, a new metrics for external validation was proposed in 2008 by Roy and Roy [369]:

(49) where R2 and

are the correlation coefficients of multiple determination in least squares

regression with and without intercept respectively. However this metric is more common for external validation with test set ( based on training set (

) but it seems that

could be calculated for calibration

) and internal validation based on cross validation (

or

)

as well [493]. When high differences are observed between predicted and experimental y-values (i.e. R2 with intercept set at zero) will be

of the test compounds (or even training),

40

ACCEPTED MANUSCRIPT significantly different from R2 and thus

is a more effective parameter to judge about model’s

PT

validity [369,493,590].

RI

5.1 Internal Validation by Cross Validation

Internal validation of a constructed model is usually done by leave-one-out (LOO) or leave-

SC

many-out (LMO) strategies [591]and the quality of predictions is expressed using the Q2 metric.The squared correlation coefficient of the training set (R2cal) might be increased

NU

artificially by adding more descriptors whereas Q2 decreases in such over-fitting conditions [589]; so the Q2 is a more meaningful metric to evaluate the average predictive ability of a

MA

model.

Leave-one-out procedure is the simplest cross validation method and a minimum of requirements in model validation. As it was shortly noted in previous parts of this article, in LOO

D

cross validation, each compound in the training set is excluded each time and a new model is

TE

constructed without this compound and is utilized to predict the target activity/property of the excluded compound. This leaving out and model building is continued until target

AC CE P

activity/property of all compounds in the training is predicted. It is clear that Q2 of a model can be lower than R2calbut according to the literature, a large difference between R2cal and Q2 (bigger than 0.2-0.3) is a warning about occurrence of over-fitting [592,593]. According to the works of Golbraikh and Tropsha [594], a rule of thumb was proposed as the minimum acceptable criteria: Q2> 0.5 and R2> 0.6. More details about these criteria could be found in the original papers [588,594].

Now, it has been accepted in literature that LOO cross validation is not a sufficient criteria and LMO cross validation is more reliable [595,596].It has been suggested that a significant fraction of compounds should be left out of in each step of LMO cross validation, e.g. leave-20 or 30% -out [597]. LMO cross validation can be done in two mode: (i) performing the procedure by the same number of factors (original descriptors, PCs or LVs) for leaving out of each block and (ii) optimizing the number of factors for each model independently [596]. An important point regarding to LMO cross validation is its higher sensitivity to the order of compounds in the data set compare with LOO strategy. Such problem can be avoided by random ordering the compounds before LMO cross validation [598].

41

ACCEPTED MANUSCRIPT

5.2 External validation

PT

Using only internal validation is not a reliable approach to evaluate the validity and predictability of a QSAR/QSPR model [594]. To further check, several authors have suggested a

RI

more rigorous way to estimate the true predictive ability of a constructed model by comparing

SC

the predicted and observed activity/property of an external test set of compounds that were not utilized in the model development process [594,597]. Therefore it is necessary to split the data (Eqs.

NU

into two separate sets of ―training‖ and ―test‖. The metrics Rp, R2p, RMSEP and 41, 45, 48 and 49) are the common statistics to check the quality of external validation.

MA

Two important issues in external validation is the manner of splitting of compounds into training and test series and the portion of compounds to be involved in the test set. For the first issue some procedures like random division, y distribution (low, moderate and high

D

activity/property) and different classification procedures have been proposed[594,599–602]. A

TE

reasonable method of data splitting tries to provide a structural/chemical analogy between the external (test) and training compounds from various aspects like activity/property distributions or

AC CE P

structural diversities by considering suitable constraints during data set selection [601]. Regarding to the second issue, i.e. percentage of compounds to include in external test set, it has been recommended to use at least 20%-30% of compounds for external validation [7,597,599].

5.3 Bootstrapping

It has been shown that the performance of the QSAR models may depend on the structural diversity and activity distribution of compounds included in the training set[603].

In

bootstrapping [603,604], the data set is randomly divide several times (from limited times to hundreds) into training and test sets and the calibration and cross validated statistics (e.g R2bstr and LOO Q2bstr)are calculated for each of these selected training sets. The main advantage of cross validation in bootstrapping compared with common cross validation is that a compound may be excluded once, or several times, as well as never [605]. Thus Q2bstr can be a more reliable parameters than Q2 [598] and is not highly dependent on the selection of training set from the total compounds.

42

ACCEPTED MANUSCRIPT

5.4 Permutation test (Y-Randomization)

PT

As it was noted previously, one of the limitations of QSAR/QSPR models is the probability of obtaining chancy models specifically if the number of descriptors (available for variable

RI

selection) highly exceeds the number of compounds in data set. On the other hand, cross-

SC

validation evaluates only the predictive power and gives no information about the statistical significance of the model [593]. A procedure which can aid to evaluate the presence of chance in

NU

model is permutation test which is also called Y-randomization or Y-scrambling [356,371,595,596].

MA

In the common form of y-randomization, several runs are done such that the descriptor matrix X is kept fixed, and only the vector y (activity/property) is randomized. Then, the maximum of the R2 and Q2 are calculated from calibration and cross valuation respectively.

D

These parameters are called R2rand andQ2rand (or Q2MP-Maximum in permutation test). The R2rand

TE

andQ2rand should be lower than R2cal and Q2(obtained from original non-scrambled data) with significant difference to show that the model was not constructed by chance. There are two

AC CE P

important issues in y-scrambling: what is the level of R2rand andQ2randin a chancy model? and how many randomization runs should be carried out? Several approaches have been proposed to address the first question: (i)

A simple approach which was obtained from the investigations of Eriksson and Wold

which proposed a decision range for R2rand andQ2rand [606] (Table 1). (ii)

In another approach proposed by Erikson et al [593,607], a plot is used in which y-axis is

R2or Q2of original and all randomized models and x-axis is the Pearson corresponded correlation coefficient between original y and randomized y. In this plot, a point is of the original model and remains are computed from the y-scrambled models (See Figure 7 in Ref.[593]). The linear equation of Q2 vs. R and R2 v.s R can be shown by: (50) (51) It has been recommended that in a model without significant chance level, the intercept of Eq. (50) (

) should be lower than 0.05 and the intercepts of Eq. (51) (

43

) should be lower than 0.3

ACCEPTED MANUSCRIPT [593]. These two values are related to the intrinsic and acceptable level of chance correlation exists in the model. Other ways have also been suggested for this purpose; for example an approach was

RI

Q2 or R2 of the original model and all randomized ones[356].

PT

suggested by Rückern et al. in 2007 which was based on the minimum of distance between the

SC

6. Conclusion

Chemometrics continuously has updated its tools in different steps of QSAR/QSPR during its

NU

long way. Linear methods are based on the assumption of existence of a linear relationship between independent variables and response variable. However, in the non-linear approaches the

MA

model is free of the limitation of generation in a linear space and the extracted descriptors are mostly mapped to a non-linear relational space. Passing from linear based methods to non-linear ones in both descriptor selection and model construction and also development of hyphenated

D

methods have opened various ways to handle diverse data sets with more complexity. However

reliable and predictive model.

TE

some risks like over-fitting or chance correlation should be always considered to construct a

AC CE P

After about 50 years of the official birth of QSAR/QSPR, this field reaches to its mature age as it is evident from the stop of the growth in the number of related publications since four years ago. A probable reason for this fact is maturity in the chemometrics methods, which has been recently noted in chemometrics literature [608–610]. Accordingly, the feeling of the authors is that the number of publications dealing with theoretical extensions will be diminished in the future and research in QSAR/QSPR will be directed toward extensive applications in drug and molecular design and also in new areas such as nano-QSAR [611,612].

References [1]

A. Nigam, M.T. Klein, A mechanism-oriented lumping strategy for heavy hydrocarbon pyrolysis: imposition of quantitative structure-reactivity relationships for pure components, Ind. & Eng. Chem. Res. 32 (1993) 1297–1303. doi:10.1021/ie00019a003.

[2]

K. Héberger, Quantitative structure-(chromatographic) retention relationships, J. Chromatogr. A. 1158 (2007) 273–305. doi:10.1016/j.chroma.2007.03.108.

44

ACCEPTED MANUSCRIPT R. Kaliszan, Quantitative structure-retention relationships applied to reversed-phase highperformance liquid chromatography, J. Chromatogr. A. 656 (1993) 417–435. doi:10.1016/0021-9673(93)80812-M.

[4]

M.T.D. Cronin, T.W. Schultz, Structure-toxicity relationships for phenols to Tetrahymena pyriformis, Chemosphere. 32 (1996) 1453–1468. doi:10.1016/0045-6535(96)00054-9.

[5]

R.J. Driebergen, E.E. Moret, L.H.M. Janssen, J.S. Blauw, J.J.M. Holthuis, S.J. Postma Kelder, et al., Electrochemistry of potentially bioreductive alkylating quinones. Part 3. Quantitative structure-electrochemistry relationships of aziridinylquinones, Anal. Chim. Acta. 257 (1992) 257–273. doi:10.1016/0003-2670(92)85179-A.

[6]

P. Tömpe, G. Clementis, I. Petneházy, Z.M. Jászay, L. Toke, Quantitative structureelectrochemistry relationships of α, β-unsaturated ketones, Anal. Chim. Acta. 305 (1995) 295–303. doi:10.1016/0003-2670(94)00354-O.

[7]

B. Hemmateenejad, M. Yazdani, QSPR models for half-wave reduction potential of steroids: a comparative study between feature selection and feature extraction from subsets of or entire set of descriptors., Anal. Chim. Acta. 634 (2009) 27–35. doi:10.1016/j.aca.2008.11.062.

[8]

D.D. Vaishnav, R.S. Boethling, L. Babeu, Quantitative structure — Biodegradability relationships for alchols, ketones and alicyclic compounds, Chemosphere. 16 (1987) 695– 703. doi:10.1016/0045-6535(87)90005-1.

[9]

G.H. Lu, Y.H. Zhao, S.G. Yang, X.J. Cheng, Quantitative structure-biodegradability relationships of substituted benzenes and their biodegradability in river water, Bull. Environ. Contam. Toxicol. 69 (2002) 111–116. doi:10.1007/s00128-002-0016-7.

AC CE P

TE

D

MA

NU

SC

RI

PT

[3]

[10] R. Todeschini, V. Consonni, Molecular Descriptors for Chemoinformatics, Second, WILEY-VCH, Weinheim, 2009. [11] R. Todeschini, V. Consonni, P. Gramatica, Chemometrics in QSAR, in: S.D. Tauler, R., Walczak, B., & Brown (Ed.), Compr. Chemom. Chem. Biochem. Data Anal., Elsevier B.V., Amsterdam, 2009: pp. 140–141. [12] A.C. Brown, T.R. Fraser, V.—On the Connection between Chemical Constitution and Physiological Action. Part. I.—On the Physiological Action of the Salts of the Ammonium Bases, derived from Strychnia, Brucia, Thebaia, Codeia, Morphia, and Nicotia, Trans. R. Soc. Edinburgh. 25 (1868) 151–203. doi:10.1017/S0080456800028155. [13] E.J. Mills, XXIII. On melting-point and boiling-point as related to chemical composition, Philos. Mag. Ser. 5. 17 (1884) 173–187. doi:10.1080/14786448408627502. [14] H. Meyer, Zur Theorie der Alkoholnarkose, Arch. Für Exp. Pathol. Und Pharmakologie. 42 (1899) 109–118. doi:10.1007/BF01834479.

45

ACCEPTED MANUSCRIPT [15] J. Traube, Theorie der Osmose und Narkose, Pfläger, Arch. Für Die Gesammte Physiol. Des Menschen Und Der Thiere. 105 (1904) 541–558. doi:10.1007/BF01682827.

PT

[16] L.P. Hammett, Reaction Rates and Indicator Acidities., Chem. Rev. 16 (1935) 67–79. doi:10.1021/cr60053a006.

RI

[17] L.P. Hammett, Some Relations between Reaction Rates and Equilibrium Constants., Chem. Rev. 17 (1935) 125–136. doi:10.1021/cr60056a010.

SC

[18] L.P. Hammett, The Effect of Structure upon the Reactions of Organic Compounds. Benzene Derivatives, J. Am. Chem. Soc. 59 (1937) 96–103. doi:10.1021/ja01280a022.

NU

[19] L.P. Hammett, Linear free energy relationships in rate and equilibrium phenomena, Trans. Faraday Soc. 34 (1938) 156. doi:10.1039/tf9383400156.

MA

[20] J.R. Platt, Influence of Neighbor Bonds on Additive Bond Properties in Paraffins, J. Chem. Phys. 15 (1947) 419. doi:10.1063/1.1746554.

D

[21] H. Wiener, Influence of Interatomic Forces on Paraffin Properties, J. Chem. Phys. 15 (1947) 766. doi:10.1063/1.1746328.

TE

[22] L. Pauling, D.M. Yost, The Additivity of the Energies of Normal Covalent Bonds, Proc. Natl. Acad. Sci. U. S. A. 414–416 (1932) 18.

AC CE P

[23] C. Coulson, The Electronic Structure of Some Polyenes and Aromatic Molecules. VII. Bonds of Fractional Order by the Molecular Orbital Method, Proc. R. Soc. Lond. A. Math. Phys. Sci. 169 (1993) 413–428. [24] J. Hinze, H.H. Jaffe, Electronegativity. I. Orbital Electronegativity of Neutral Atoms, J. Am. Chem. Soc. 84 (1962) 540–546. doi:10.1021/ja00863a008. [25] K. Fukui, T. Yonezawa, C. Nagata, Theory of Substitution in Conjugated Molecules, Bull. Chem. Soc. Jpn. 27 (1954) 423–427. doi:10.1246/bcsj.27.423. [26] R.S. Mulliken, Electronic Population Analysis on LCAO-MO Molecular Wave Functions. I, J. Chem. Phys. 23 (1955) 1833. doi:10.1063/1.1740588. [27] R.S. Mulliken, Electronic Population Analysis on LCAO-MO Molecular Wave Functions. II. Overlap Populations, Bond Orders, and Covalent Bond Energies, J. Chem. Phys. 23 (1955) 1841. doi:10.1063/1.1740589. [28] L.B. Kier, Molecular orbital theory in drug research, Academic Press, 1971. [29] R.W. Taft, Polar and Steric Substituent Constants for Aliphatic and o-Benzoate Groups from Rates of Esterification and Hydrolysis of Esters 1, J. Am. Chem. Soc. 74 (1952) 3120–3128. doi:10.1021/ja01132a049.

46

ACCEPTED MANUSCRIPT [30] R.W. Taft, Linear Steric Energy Relationships, J. Am. Chem. Soc. 75 (1953) 4538–4539. doi:10.1021/ja01114a044.

PT

[31] C. Hansch, P.P. Maloney, T. Fujita, R.M. Muir, Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients, Nature. 194 (1962) 178–180. doi:10.1038/194178b0.

SC

RI

[32] C. Hansch, R.R.M.R. Muir, T. Fujita, P.P. Maloney, F. Geiger, M. Streich, The correlation of biological activity of plant growth regulators and chloromycetin derivatives with Hammett constants and partition coefficients, J. Am. Chem. Soc. 85 (1963) 2817–2824. doi:10.1021/ja00901a033.

NU

[33] T. Fujita, J. Iwasa, C. Hansch, A New Substituent Constant, π, Derived from Partition Coefficients, J. Am. Chem. Soc. 86 (1964) 5175–5180. doi:10.1021/ja01077a028.

MA

[34] C. Hansch, A. Leo, Exploring QSAR. Fundamentals and Applications in Chemistry and Biology, American Chemical Society, Washington, DC, 1995.

D

[35] S.M. Free, J.W. Wilson, A Mathematical Contribution to Structure-Activity Studies, J. Med. Chem. 7 (1964) 395–399. doi:10.1021/jm00334a001.

TE

[36] H. Kubinyi, 3D QSAR in Drug Design: Volume 1: Theory Methods and Applications, Kluwer Academic Publishers, Dordrecht, 2000.

AC CE P

[37] H. Kubinyi, G. Folkers, Y.C. Martin, 3D QSAR in Drug Design: Volume 3: Recent Advances, Kluwer Academic Publishers, New York, 2002. [38] P.J. Goodford, A computational procedure for determining energetically favorable binding sites on biologically important macromolecules., J. Med. Chem. 28 (1985) 849–857. doi:10.1021/jm00145a002. [39] R.D. Cramer III, D.E. Patterson, J.D. Bunce, R.D. Cramer, Comparative Molecular Field Analysis (CoMFA). 1. Effect of Shape on Binding of Steroids to Carrier Proteins., J Am Chem Soc. 110 (1988) 5959–5967. doi:10.1021/ja00226a005. [40] G. Klebe, U. Abraham, T. Mietzner, Molecular similarity indices in a comparative analysis (CoMSIA) of drug molecules to correlate and predict their biological activity, J. Med. Chem. 37 (1994) 4130–4146. doi:10.1021/jm00050a010. [41] A.N. Jain, K. Koile, D. Chapman, Compass: predicting biological activities from molecular surface properties. Performance comparisons on a steroid benchmark., J. Med. Chem. 37 (1994) 2315–2327. doi:10.1021/jm00041a010. [42] B.D. Silverman, D.E. Platt, Comparative molecular moment analysis (CoMMA): 3DQSAR without molecular superposition, J. Med. Chem. 39 (1996) 2129–2140. doi:10.1021/jm950589q.

47

ACCEPTED MANUSCRIPT [43] H. Chuman, M. Karasawa, T. Fujita, F. Kansai, QSAR A Novel Three-Dimensional QSAR Procedure: Voronoi Field Analysis, Quant. Struct.-Act. Relat. 17 (1998) 313–326. doi:10.1002/(SICI)1521-3838(199808)17:04<313::AID-QSAR313>3.0.CO;2-7.

RI

PT

[44] G. Cruciani, M. Pastor, W. Guba, VolSurf: A new tool for the pharmacokinetic optimization of lead compounds, in: Eur. J. Pharm. Sci., 2000. doi:10.1016/S09280987(00)00162-7.

SC

[45] M. Pastor, G. Cruciani, I. McLay, S. Pickett, S. Clementi, GRid-INdependent descriptors (GRIND): A novel class of alignment-independent three-dimensional molecular descriptors, J. Med. Chem. 43 (2000) 3233–3243. doi:10.1021/jm000941m.

NU

[46] M.T.H. Khan, Predictions of the ADMET properties of candidate drug molecules utilizing different QSAR/QSPR modelling approaches., Curr. Drug Metab. 11 (2010) 285–295. doi:10.2174/138920010791514306.

MA

[47] F.A. Quintero, S.J. Patel, F. Muñoz, M. Sam Mannan, Review of existing QSAR/QSPR models developed for properties used in hazardous chemicals classification system, Ind. Eng. Chem. Res. 51 (2012) 16101–16115. doi:10.1021/ie301079r.

TE

D

[48] H. Kubinyi, QSAR and 3D QSAR in drug design part 2: Applications and problems, Drug Discov. Today. 2 (1997) 538–546. doi:10.1016/S1359-6446(97)01084-2.

AC CE P

[49] A.J. Hopfinger, Practical applications of computer-aided drug design, in: P.S. Charifson (Ed.), Pract. Appl. Comput. Aided Des., Marcel-Dekker, New York, 1997: pp. 105–164. [50] R. Perkins, H. Fang, W. Tong, W.J. Welsh, Quantitative structure-activity relationship methods: perspectives on drug discovery and toxicology., Environ. Toxicol. Chem. 22 (2003) 1666–1679. doi:10.1897/01-171. [51] Q.-S. Du, R.-B. Huang, K.-C. Chou, Recent advances in QSAR and their applications in predicting the activities of chemical molecules, peptides and proteins for drug design., Curr. Protein Pept. Sci. 9 (2008) 248–260. doi:10.2174/138920308784534005. [52] C.A. Lipinski, F. Lombardo, B.W. Dominy, P.J. Feeney, Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings, Adv. Drug Deliv. Rev. 64 (2012) 4–17. doi:10.1016/j.addr.2012.09.019. [53] L.B. Salum, A.D. Andricopulo, Fragment-based QSAR: Perspectives in drug design, Mol. Divers. 13 (2009) 277–285. doi:10.1007/s11030-009-9112-5. [54] J.M.J. Sutter, J.H.J. Kalivas, Comparison of forward selection, backward elimination, and generalized simulated annealing for variable selection, Microchem. J. 47 (1993) 60–66. doi:10.1006/mchj.1993.1012.

48

ACCEPTED MANUSCRIPT [55] N.R. Draper, H. Smith, E. Pownell, Applied regression analysis, WILEY, New York, 1966.

PT

[56] S. Weisberg, Applied linear regression, John Wiley & Sons, New Jersay, 2005.

RI

[57] J. Shao, Linear Model Selection by Cross-Validation, J. Am. Stat. Assoc. 88 (1993) 486– 494. doi:10.2307/2290328.

SC

[58] P. Gemperline, ed., Practical Guide to Chemometrics, 2nd ed., Taylor & Francis Group, Boca Raton, 2006.

NU

[59] S.H. Unger, Consequences of the Hansch Paradigm for the Pharmaceutical Industry, in: E.J. Ariens (Ed.), Drug Des. (Vol. 9), Academic Press, New York, 1980: pp. 47–119.

MA

[60] L. Xu, W.-J. Zhang, Comparison of different methods for variable selection, Anal. Chim. Acta. 446 (2001) 475–481. doi:10.1016/S0003-2670(01)01271-5.

D

[61] M. Seierstad, D.K. Agrafiotis, A QSAR model of hERG binding using a large, diverse, and internally consistent training set, Chem. Biol. Drug Des. 67 (2006) 284–296. doi:10.1111/j.1747-0285.2006.00379.x.

AC CE P

TE

[62] H.J.M. Verhaar, L. Eriksson, M. Sjöström, G. Schüürmann, W. Seinen, J.L.M. Hermens, Modelling the Toxicity of Organophosphates: a Comparison of the Multiple Linear Regression and PLS Regression Methods, QSAR Comb. Sci. 13 (1994) 133–143. doi:10.1002/qsar.19940130202. [63] A.K. Saxena, P. Prathipati, Comparison of MLR, PLS and GA-MLR in QSAR analysis., SAR QSAR Environ. Res. 14 (2003) 433–445. doi:10.1080/10629360310001624015. [64] B. Hemmateenejad, R. Miri, M. Tabarzad, M. Jafarpour, F. Zand, Molecular modeling and QSAR analysis of the anticonvulsant activity of some N-phenyl-N???-(4-pyridinyl)-urea derivatives, J. Mol. Struct. THEOCHEM. 684 (2004) 43–49. doi:10.1016/j.theochem.2004.06.039. [65] B. Hemmateenejad, R. Miri, N. Edraki, M. Khoshneviszadeh, A. Shafiee, Molecular modeling and QSAR analysis of some 4,5-dichloroimidazolyl-1,4-DHP-based calcium channel blockers, J. Iran. Chem. Soc. 4 (2007) 182–193. doi:10.1007/BF03245965. [66] N. Edraki, B. Hemmateenejad, R. Miri, M. Khoshneviszade, QSAR study of phenoxypyrimidine derivatives as potent inhibitors of p38 kinase using different chemometric tools, Chem. Biol. Drug Des. 70 (2007) 530–539. doi:10.1111/j.17470285.2007.00597.x. [67] S. Ray, C. Sengupta, K. Roy, QSAR modeling for lipid peroxidation inhibition potential of flavonoids using topological and structural parameters, Cent. Eur. J. Chem. 6 (2008) 267–276. doi:10.2478/s11532-008-0014-7.

49

ACCEPTED MANUSCRIPT

PT

[68] B. Hemmateenejad, M. Shamsipur, R. Miri, M. Elyasi, F. Foroghinia, H. Sharghi, Linear and nonlinear quantitative structure-property relationship models for solubility of some anthraquinone, anthrone and xanthone derivatives in supercritical carbon dioxide, Anal. Chim. Acta. 610 (2008) 25–34. doi:10.1016/j.aca.2008.01.011.

RI

[69] P.P. Roy, J.T. Leonard, K. Roy, Exploring the impact of size of training sets for the development of predictive QSAR models, Chemom. Intell. Lab. Syst. 90 (2008) 31–42. doi:10.1016/j.chemolab.2007.07.004.

NU

SC

[70] R. Miri, K. Javidnia, B. Hemmateenejad, M. Tabarzad, M. Jafarpour, Synthesis, evaluation of pharmacological activities and quantitative structure-activity relationship studies of a novel group of bis(4-nitroaryl-1,4-dihyropyridine)., Chem. Biol. Drug Des. 73 (2009) 225–235. doi:10.1111/j.1747-0285.2008.00770.x.

MA

[71] B. Hemmateenejad, M. Yazdani, QSPR models for half-wave reduction potential of steroids: A comparative study between feature selection and feature extraction from subsets of or entire set of descriptors, Anal. Chim. Acta. 634 (2009) 27–35. doi:10.1016/j.aca.2008.11.062.

TE

D

[72] B. Hemmateenejad, M. Elyasi, A segmented principal component analysis-regression approach to quantitative structure-activity relationship modeling, Anal. Chim. Acta. 646 (2009) 30–38. doi:10.1016/j.aca.2009.05.003.

AC CE P

[73] K. Roy, P. Pratim Roy, Comparative chemometric modeling of cytochrome 3A4 inhibitory activity of structurally diverse compounds using stepwise MLR, FA-MLR, PLS, GFA, G/PLS and ANN techniques, Eur. J. Med. Chem. 44 (2009) 2913–2922. doi:10.1016/j.ejmech.2008.12.004. [74] S. Kar, K. Roy, QSAR modeling of toxicity of diverse organic chemicals to Daphnia magna using 2D and 3D descriptors, J. Hazard. Mater. 177 (2010) 344–351. doi:10.1016/j.jhazmat.2009.12.038. [75] L. Jiao, H. Li, QSPR studies on the aqueous solubility of PCDD/Fs by using artificial neural network combined with stepwise regression, Chemom. Intell. Lab. Syst. 103 (2010) 90–95. doi:10.1016/j.chemolab.2010.05.019. [76] S. Yousefinejad, F. Honarasa, F. Abbasitabar, Z. Arianezhad, New LSER model based on solvent empirical parameters for the prediction and description of the solubility of buckminsterfullerene in various solvents, J. Solution Chem. 42 (2013) 1620–1632. doi:10.1007/s10953-013-0062-2. [77] S. Yousefinejad, B. Hemmateenejad, A chemometrics approach to predict the dispersibility of graphene in various liquid phases using theoretical descriptors and solvent empirical parameters, Colloids Surfaces A Physicochem. Eng. Asp. 441 (2014) 766–775. doi:10.1016/j.colsurfa.2013.03.020.

50

ACCEPTED MANUSCRIPT

PT

[78] X. Wang, Y. Sun, L. Wu, S. Gu, R. Liu, L. Liu, et al., Quantitative structure-affinity relationship study of azo dyes for cellulose fibers by multiple linear regression and artificial neural network, Chemom. Intell. Lab. Syst. 134 (2014) 1–9. doi:10.1016/j.chemolab.2014.03.001.

RI

[79] S. Yousefinejad, F. Honarasa, N. Saeed, Quantitative structure-retardation factor relationship of protein amino acids in different solvent mixtures for normal-phase thinlayer chromatography, J. Sep. Sci. 38 (2015) 1771–1776. doi:10.1002/jssc.201401427.

NU

SC

[80] A.R. Katritzky, V.S. Lobanov, M. Karelson, R. Murugan, M.P. Grendze, J.E. Toomey, Comprehensive descriptors for structural and statistical analysis. 1. Correlations between structure and physical properties of substituted pyridines, Rev. Roum. Chim. 41 (1996) 851–868.

MA

[81] H.Z. Si, T. Wang, K.J. Zhang, Z. De Hu, B.T. Fan, QSAR study of 1,4-dihydropyridine calcium channel antagonists based on gene expression programming., Bioorg. Med. Chem. 14 (2006) 4834–4841. doi:10.1016/j.bmc.2006.03.019.

TE

D

[82] F. Luan, W.P. Ma, X.Y. Zhang, H.X. Zhang, M.C. Liu, Z.D. Hu, et al., QSAR study of polychlorinated dibenzodioxins, dibenzofurans, and Biphenyls using the heuristic method and support vector machine, QSAR Comb. Sci. 25 (2006) 46–55. doi:DOI 10.1002/qsar.200530131.

AC CE P

[83] F. Luan, W. Ma, X. Zhang, H. Zhang, M. Liu, Z. Hu, et al., Quantitative structure-activity relationship models for prediction of sensory irritants (logRD50) of volatile organic chemicals., Chemosphere. 63 (2006) 1142–1153. doi:10.1016/j.chemosphere.2005.09.053. [84] X. Li, F. Luan, H. Si, Z. Hu, M. Liu, Prediction of retention times for a large set of pesticides or toxicants based on support vector machine and the heuristic method, Toxicol. Lett. 175 (2007) 136–144. doi:10.1016/j.toxlet.2007.10.005. [85] C. Zhao, H. Zhang, F. Luan, R. Zhang, M. Liu, Z. Hu, et al., QSAR method for prediction of protein-peptide binding affinity: Application to MHC class I molecule HLA-A*0201, J. Mol. Graph. Model. 26 (2007) 246–254. doi:10.1016/j.jmgm.2006.12.002. [86] J. Li, H. Liu, X. Yao, M. Liu, Z. Hu, B. Fan, Quantitative structure-activity relationship study of acyl ureas as inhibitors of human liver glycogen phosphorylase using least squares support vector machines, Chemom. Intell. Lab. Syst. 87 (2007) 139–146. doi:10.1016/j.chemolab.2006.11.004. [87] S. Qin, H.X. Liu, J. Wang, X.J. Yao, M.C. Liu, Z.D. Hu, et al., Quantitative StructureActivity Relationship study on a series of novel ligands binding to central benzodiazepine receptor by using the combination of Heuristic Method and Support Vector Machines, QSAR Comb. Sci. 26 (2007) 443–451. doi:DOI 10.1002/qsar.200630059.

51

ACCEPTED MANUSCRIPT [88] J. Rebehmed, F. Barbault, C. Teixeira, F. Maurel, 2D and 3D QSAR studies of diarylpyrimidine HIV-1 reverse transcriptase inhibitors, J. Comput. Aided. Mol. Des. 22 (2008) 831–841. doi:10.1007/s10822-008-9217-4.

RI

PT

[89] T. Wang, H. Si, P. Chen, K. Zhang, X. Yao, QSAR models for the dermal penetration of polycyclic aromatic hydrocarbons based on gene expression programming, QSAR Comb. Sci. 27 (2008) 913–921. doi:10.1002/qsar.200710153.

SC

[90] W.J. Lü, Y.L. Chen, W.P. Ma, X.Y. Zhang, F. Luan, M.C. Liu, et al., QSAR study of neuraminidase inhibitors based on heuristic method and radial basis function network, Eur. J. Med. Chem. 43 (2008) 569–576. doi:10.1016/j.ejmech.2007.04.011.

NU

[91] K. Liu, B. Xia, W. Ma, B. Zheng, X. Zhang, B. Fan, Quantitative structure-activity relationship modeling of triaminotriazine drugs based on heuristic method, QSAR Comb. Sci. 27 (2008) 425–431. doi:10.1002/qsar.200730045.

MA

[92] B. Xia, W. Ma, B. Zheng, X. Zhang, B. Fan, Quantitative structure-activity relationship studies of a series of non-benzodiazepine structural ligands binding to benzodiazepine receptor, Eur. J. Med. Chem. 43 (2008) 1489–1498. doi:10.1016/j.ejmech.2007.09.004.

AC CE P

TE

D

[93] Z.G. Gong, R.S. Zhang, B.B. Xia, R.J. Hu, B.T. Fan, Study of Nematic Transition Temperatures in Themotropic Liquid Crystal Using Heuristic Method and Radial Basis Function Neural Networks and Support Vector Machine, Qsar Comb. Sci. 27 (2008) 1282–1290. doi:DOI 10.1002/qsar.200860027. [94] Y. Yuan, R. Zhang, R. Hu, X. Ruan, Prediction of CCR5 receptor binding affinity of substituted 1-(3,3-diphenylpropyl)-piperidinyl amides and ureas based on the heuristic method, support vector machine and projection pursuit regression, Eur. J. Med. Chem. 44 (2009) 25–34. doi:10.1016/j.ejmech.2008.03.004. [95] B. Xia, K. Liu, Z. Gong, B. Zheng, X. Zhang, B. Fan, Rapid toxicity prediction of organic chemicals to Chlorella vulgaris using quantitative structure-activity relationships methods, Ecotoxicol. Environ. Saf. 72 (2009) 787–794. doi:10.1016/j.ecoenv.2008.09.002. [96] H. Liu, P. Han, Y. Wen, F. Luan, Y. Gao, X. Li, Quantitative structure-electrochemistry relationship for variously-substituted 9, 10-anthraquinones using both an heuristic method and a radial basis function neural network, Dye. Pigment. 84 (2010) 148–152. doi:10.1016/j.dyepig.2009.07.013. [97] C. Guo, P. Zhou, J. Shao, X. Yang, Z. Shang, Integrating statistical and experimental protocols to model and design novel Gemini surfactants with promising critical micelle concentration and low environmental risk, Chemosphere. 84 (2011) 1608–1616. doi:10.1016/j.chemosphere.2011.05.031.

52

ACCEPTED MANUSCRIPT [98] P. Lu, X. Wei, R. Zhang, Y. Yuan, Z. Gong, Prediction of the binding affinities of adenosine A2A receptor antagonists based on the heuristic method and support vector machine, Med. Chem. Res. 20 (2011) 1220–1228. doi:10.1007/s00044-010-9431-1.

RI

PT

[99] Y. Zhao, J. Zhao, Y. Huang, Q. Zhou, X. Zhang, S. Zhang, Toxicity of ionic liquids: Database and prediction via quantitative structure-activity relationship method, J. Hazard. Mater. 278 (2014) 320–329. doi:10.1016/j.jhazmat.2014.06.018.

SC

[100] G.M. Furnival, R.W. Wilson Jr, Regression by leaps and bounds, Technometrics. 16 (1974) 499–511. doi:10.1080/00401706.2000.10485982.

NU

[101] B.K. Chen, C. Horváth, J.R. Bertino, Multivariate analysis and quantitative structureactivity relationships. Inhibition of dihydrofolate reductase and thymidylate synthetase by quinazolines., J. Med. Chem. 22 (1979) 483–491. doi:10.1021/jm00191a005.

MA

[102] B.W. Clare, A novel quantum theoretic QSAR for hallucinogenic tryptamines: A major factor is the orientation of ?? orbital nodes, J. Mol. Struct. THEOCHEM. 712 (2004) 143– 148. doi:10.1016/j.theochem.2004.08.050.

TE

D

[103] L.M. Egolf, P.C. Jurs, Prediction of boiling points of organic heterocyclic compounds using regression and neural network techniques, J. Chem. Inf. Model. 33 (1993) 616–625. doi:10.1021/ci00014a015.

AC CE P

[104] L.M. Egolf, P.C. Jurs, Estimation of autoignition temperatures of hydrocarbons, alcohols, and esters from molecular structure, Ind. Eng. Chem. Res. 31 (1992) 1798–1807. doi:10.1021/ie00007a027. [105] J.H. Kim, P. Gramatica, M.G. Kim, D. Kim, P.G. Tratnyek, QSAR modelling of water quality indices of alkylphenol pollutants., SAR QSAR Environ. Res. 18 (2007) 729–743. doi:10.1080/10629360701698761. [106] Y.H. Qi, Q.Y. Zhang, L. Xu, Correlation analysis of the structures and stability constants of gadolinium(III) complexes, J. Chem. Inf. Comput. Sci. 42 (2002) 1471–1475. doi:10.1021/ci020027x. [107] J.M. Sutter, S.L. Dixon, P.C. Jurs, Automatic Descriptor Selection for Quantitativ Structure-Activity Relationships Using Generalized Simulated Annealing, J. Chem. Inf. Comput. Sci. (1995) 77–84. [108] L. Xu, J.W. Ball, S.L. Dixon, P.C. Jurs, Quantitative structure-activity relationships for toxicity of phenols using regression analysis and computational neural networks, Environ. Toxicol. Chem. 13 (1994) 841–851. doi:10.1002/etc.5620130520. [109] L. Xu, Y. Wu, C. Hu, H. Li, A QSAR of the toxicity of amino-benzenes and their structures, Sci. China Ser. B Chem. 43 (2000) 129–136. doi:10.1007/BF03027302.

53

ACCEPTED MANUSCRIPT [110] L. Xu, J.-A. Yang, Y.-P. Wu, Effective descriptions of molecular structures and the quantitative structure-activity relationship studies., J. Chem. Inf. Comput. Sci. 42 (2002) 602–606.

RI

PT

[111] L. Xu, Q.Y. Zhang, J. Wang, L. Dong, Extended topological indices and prediction of activities of chiral compounds, Chemom. Intell. Lab. Syst. 82 (2006) 37–43. doi:10.1016/j.chemolab.2005.05.008.

SC

[112] Y.X. Zhou, L. Xu, Y.P. Wu, B.L. Liu, A QSAR study of the antiallergic activities of substituted benzamides and their structures, Chemom. Intell. Lab. Syst. 45 (1999) 95–100. doi:10.1016/S0169-7439(98)00092-6.

NU

[113] J.H. Kim, P. Gramatica, M.G. Kim, D. Kim, P.G. Tratnyek, A new search algorithm for QSPR/QSAR theories: Normal boiling points of some organic molecules, SAR QSAR Environ. Res. 18 (2005) 729–743. doi:10.1016/j.cplett.2005.07.016.

D

MA

[114] A.G. Mercader, P.R. Duchowicz, F.M. Fernández, E.A. Castro, Modified and enhanced replacement method for the selection of molecular descriptors in QSAR and QSPR theories, Chemom. Intell. Lab. Syst. 92 (2008) 138–144. http://www.sciencedirect.com/science/article/pii/S016974390800021X.

AC CE P

TE

[115] A.G. Mercader, P.R. Duchowicz, F.M. Fernández, E.A. Castro, Advances in the replacement and enhanced replacement method in QSAR and QSPR theories, J. Chem. Inf. Model. 51 (2011) 1575–1581. doi:10.1021/ci200079b. [116] P.R. Duchowicz, E.A. Castro, F.M. Fernández, Alternative algorithm for the search of an optimal set of descriptors in QSAR-QSPR studies, MATCH Commun. Math. Comput. Chem. 55 (2006) 179–192. ://000235914500013. [117] P.R. Duchowicz, M. Fernández, J. Caballero, E. a Castro, F.M. Fernández, QSAR for nonnucleoside inhibitors of HIV-1 reverse transcriptase., Bioorg. Med. Chem. 14 (2006) 5876–5889. doi:10.1016/j.bmc.2006.05.027. [118] A. Lee, A.G. Mercader, P.R. Duchowicz, E.A. Castro, A.B. Pomilio, QSAR study of the DPPH radical scavenging activity of di(hetero)arylamines derivatives of benzo[b]thiophenes, halophenols and caffeic acid analogues, Chemom. Intell. Lab. Syst. 116 (2012) 33–40. doi:10.1016/j.chemolab.2012.03.016. [119] A.H. Morales, P.R. Duchowicz, M.Á.C. Pérez, E.A. Castro, M.N.D.S. Cordeiro, M.P. González, Application of the replacement method as a novel variable selection strategy in QSAR. 1. Carcinogenic potential, Chemom. Intell. Lab. Syst. 81 (2006) 180–187. doi:10.1016/j.chemolab.2005.12.002. [120] A.G. Mercader, P.R. Duchowicz, F.M. Fernández, E.A. Castro, Replacement method and enhanced replacement method versus the genetic algorithm approach for the selection of

54

ACCEPTED MANUSCRIPT molecular descriptors in QSPR/QSAR theories, J. Chem. Inf. Model. 50 (2010) 1542– 1548. doi:10.1021/ci100103r.

PT

[121] P.R. Duchowicz, M. Goodarzi, M.A. Ocsachoque, G.P. Romanelli, E. del V Ortiz, J.C. Autino, et al., QSAR analysis on Spodoptera litura antifeedant activities for flavone derivatives, Sci. Total Environ. 408 (2009) 277–285. doi:10.1016/j.scitotenv.2009.09.041.

SC

RI

[122] P.R. Duchowicz, A.G. Mercader, F.M. Fernández, E.A. Castro, Prediction of aqueous toxicity for heterogeneous phenol derivatives by QSAR, Chemom. Intell. Lab. Syst. 90 (2008) 97–107. doi:10.1016/j.chemolab.2007.08.006.

NU

[123] A.G. Mercader, P.R. Duchowicz, F.M. Fernández, E.A. Castro, D.O. Bennardi, J.C. Autino, et al., QSAR prediction of inhibition of aldose reductase for flavonoids, Bioorganic Med. Chem. 16 (2008) 7470–7476. doi:10.1016/j.bmc.2008.06.004.

MA

[124] M.C.U. Araújo, T.C.B. Saldanha, R.K.H. Galvão, T. Yoneyama, H.C. Chame, V. Visani, The successive projections algorithm for variable selection in spectroscopic multicomponent analysis, Chemom. Intell. Lab. Syst. 57 (2001) 65–73. doi:10.1016/S0169-7439(01)00119-8.

AC CE P

TE

D

[125] R. Kawakami Harrop Galvão, M. Fernanda Pimentel, M. Cesar Ugulino Araujo, T. Yoneyama, V. Visani, Aspects of the successive projections algorithm for variable selection in multivariate calibration applied to plasma emission spectrometry, Anal. Chim. Acta. 443 (2001) 107–115. doi:10.1016/S0003-2670(01)01182-5. [126] S.F.C. Soares, A.A. Gomes, M.C.U. Araujo, A.R.G. Filho, R.K.H. Galvão, The successive projections algorithm, TrAC - Trends Anal. Chem. 42 (2013) 84–98. doi:10.1016/j.trac.2012.09.006. [127] Y. Akhlaghi, M. Kompany-Zareh, Application of radial basis function networks and successive projections algorithm in a QSAR study of anti-HIV activity for a large group of HEPT derivatives, J. Chemom. 20 (2006) 1–12. doi:10.1002/cem.971. [128] M. Kompany-Zareh, Y. Akhlaghi, Correlation weighted successive projections algorithm as a novel method for variable selection in QSAR studies: Investigation of anti-HIV activity of HEPT derivatives, J. Chemom. 21 (2007) 239–250. doi:10.1002/cem.1073. [129] R.K.H. Galvão, M.C.U. Araújo, W.D. Fragoso, E.C. Silva, G.E. José, S.F.C. Soares, et al., A variable elimination method to improve the parsimony of MLR models using the successive projections algorithm, Chemom. Intell. Lab. Syst. 92 (2008) 83–91. doi:10.1016/j.chemolab.2007.12.004. [130] M. Goodarzi, M.P. Freitas, R. Jensen, Feature selection and linear/nonlinear regression methods for the accurate prediction of glycogen synthase kinase-3?? inhibitory activities, J. Chem. Inf. Model. 49 (2009) 824–832. doi:10.1021/ci9000103.

55

ACCEPTED MANUSCRIPT [131] N. Goudarzi, M. Goodarzi, M.C.U. Araujo, R.K.H. Galvão, QSPR Modeling of Soil Sorption Coefficients (K OC) of Pesticides Using SPA-ANN and SPA-MLR, J. Agric. Food Chem. 57 (2009) 7153–7158. doi:10.1021/jf9008839.

RI

PT

[132] N. Goudarzi, M. Goodarzi, Application of successive projections algorithm (SPA) as a variable selection in a QSPR study to predict the octanol/water partition coefficients (Kow) of some halogenated organic compounds, Anal. Methods. 2 (2010) 758–764. doi:10.1039/b9ay00170k.

SC

[133] F. Abbasitabar, V. Zare-Shahabadi, Development predictive QSAR models for artemisinin analogues by various feature selection methods: A comparative study, SAR QSAR Environ. Res. 23 (2012) 1–15. doi:10.1080/1062936X.2011.623316.

MA

NU

[134] J.B. Ghasemi, M. Salahinejad, M.K. Rofouei, M.H. Mousazadeh, Docking and 3D-QSAR study of stability constants of benzene derivatives as environmental pollutants with ??cyclodextrin, J. Incl. Phenom. Macrocycl. Chem. 73 (2012) 405–413. doi:10.1007/s10847-011-0078-4.

TE

D

[135] J.B. Ghasemi, H. Tavakoli, Improvement of the prediction power of the CoMFA and CoMSIA models on histamine H3 antagonists by different variable selection methods, Sci. Pharm. 80 (2012) 547–566. doi:10.3797/scipharm.1204-19.

AC CE P

[136] J.B. Ghasemi, M. Salahinejad, M.K. Rofouei, Alignment Independent 3D-QSAR Modeling of Fullerene (C60) Solubility in Different Organic Solvents, Fullerenes, Nanotub. Carbon Nanostructures. 21 (2013) 367–380. doi:10.1080/1536383X.2011.629751. [137] N. Goudarzi, M. Goodarzi, M. Arab Chamjangali, M.H. Fatemi, Application of a new SPA-SVM coupling method for QSPR study of electrophoretic mobilities of some organic and inorganic compounds, Chinese Chem. Lett. 24 (2013) 904–908. doi:10.1016/j.cclet.2013.06.002. [138] M. Goodarzi, W. Saeys, M.C.U. De Araujo, R.K.H. Galvão, Y. Vander Heyden, Binary classification of chalcone derivatives with LDA or KNN based on their antileishmanial activity and molecular descriptors selected using the Successive Projections Algorithm feature-selection technique, Eur. J. Pharm. Sci. 51 (2014) 189–195. doi:10.1016/j.ejps.2013.09.019. [139] M.K. Rofouei, M. Salahinejad, J.B. Ghasemi, An Alignment Independent 3D-QSAR Modeling of Dispersibility of Single-walled Carbon Nanotubes in Different Organic Solvents, Fullerenes, Nanotub. Carbon Nanostructures. 22 (2014) 605–617. doi:10.1080/1536383X.2012.702157. [140] J. Holland, Adaptation in natural and artificial systems: An introductory analysis with applications to biology, control, and artificial intelligence, U Michigan Press, Oxford, England, 1975.

56

ACCEPTED MANUSCRIPT [141] Devillers J., Genetic algorithms in computer-aided molecular design, in: D. J. (Ed.), Genet. Algorithms Mol. Model., Academic Press, New York, 1996: pp. 1–34.

PT

[142] R. Leardi, Genetic algorithms in chemometrics and chemistry: A review, J. Chemom. 15 (2001) 559–569. doi:10.1002/cem.651.

RI

[143] C.B. Lucasius, G. Kateman, Understanding and using genetic algorithms. Part 1. Concepts, properties and context, Chemom. Intell. Lab. Syst. 19 (1993) 1–33. doi:10.1016/0169-7439(93)80079-W.

NU

SC

[144] V. Venkatasubramanian, K. Chan, J.M. Caruthers, Computer-aided molecular design using genetic algorithms, Comput. Chem. Eng. 18 (1994) 833–844. doi:10.1016/00981354(93)E0023-3.

MA

[145] V. Venkatasubramanian, A. Sundaram, Genetic algorithms: introduction and applications, in: Encycl. Comput. Chem., WILEY, New York, 1998: pp. 1115–1127.

D

[146] J. Devillers, C. Putavy, Designing biodegradable molecules from the combined use of a backpropagation neural network and a genetic algorithm, in: J. Devillers (Ed.), Genet. Algorithms Mol. Model., Academic Press, New York, 1996: pp. 303–314.

AC CE P

TE

[147] F.R. Burden, B.S. Rosewarne, D.A. Winkler, Predicting maximum bioactivity by effective inversion of neural networks using genetic algorithms, Chemom. Intell. Lab. Syst. 38 (1997) 127–137. doi:10.1016/S0169-7439(97)00052-X. [148] A. Sundaram, V. Ghosh, P Venkatasubramanian, J. Caruthers, D. Daly, Design of fueladditives using hybrid neural networks and evolutionary algorithms, in: Proc. Int. Conf. Found. Comput. Process Des., n.d.: pp. 478–481. [149] R. Meusinger, R. Moros, Determination of quantitative structure-octane rating relationships of hydrocarbons by genetic algorithms, Chemom. Intell. Lab. Syst. 46 (1999) 67–78. doi:10.1016/S0169-7439(98)00148-8. [150] T.J. Hou, T.J. Hou, J.M. Wang, J.M. Wang, N. Liao, N. Liao, et al., Applications of Genetic Algorithms on the Structure-Activity Relationship Analysis of Some Cinnamamides, J. Chem. Inf. Comput. Sci. 39 (1999) 775–781. doi:10.1021/ci990010n. [151] T.J. Hou, J.M. Wang, X.J. Xu, Applications of genetic algorithms on the structure-activity correlation study of a group of non-nucleoside HIV-1 inhibitors, Chemom. Intell. Lab. Syst. 45 (1999) 303–310. doi:10.1016/S0169-7439(98)00135-X. [152] P. Gramatica, V. Consonni, R. Todeschini, QSAR study on the tropospheric degradation of organic compounds, in: Chemosphere, 1999: pp. 1371–1378. doi:10.1016/S00456535(98)00539-6.

57

ACCEPTED MANUSCRIPT [153] H. Gao, Application of BCUT metrics and genetic algorithm in binary QSAR analysis, J. Chem. Inf. Comput. Sci. 41 (2001) 402–407. doi:10.1021/ci000306p.

RI

PT

[154] B. Hemmateenejad, R. Miri, M. Akhond, M. Shamsipur, QSAR study of the calcium channel antagonist activity of some recently synthesized dihydropyridine derivatives. An application of genetic algorithm for variable selection in MLR and PLS methods, Chemom. Intell. Lab. Syst. 64 (2002) 91–99. doi:10.1016/S0169-7439(02)00068-0.

SC

[155] M.O. Taha, A.M. Qandil, D.D. Zaki, M.A. Aldamen, Ligand-based assessment of factor Xa binding site flexibility via elaborate pharmacophore exploration and genetic algorithmbased QSAR modeling, Eur. J. Med. Chem. 40 (2005) 701–727. doi:10.1016/j.ejmech.2004.10.014.

MA

NU

[156] F. Gharagheizi, QSPR Studies for Solubility Parameter by Means of Genetic AlgorithmBased Multivariate Linear Regression and Generalized Regression Neural Network, QSAR Comb. Sci. 27 (2008) 165–170. doi:10.1002/qsar.200630159.

D

[157] A. Habibi-Yangjeh, QSAR study of the 5-HT1A receptor affinities of arylpiperazines using a genetic algorithm-artificial neural network model, Monatshefte Fur Chemie. 140 (2009) 523–530. doi:10.1007/s00706-008-0084-4.

AC CE P

TE

[158] L. Saghaie, M. Shahlaei, A. Fassihi, A. Madadkar-Sobhani, M.B. Gholivand, A. Pourhossein, QSAR Analysis for Some Diaryl-substituted Pyrazoles as CCR2 Inhibitors by GA-Stepwise MLR, Chem. Biol. Drug Des. 77 (2011) 75–85. doi:10.1111/j.17470285.2010.01053.x. [159] K. Hasegawa, Y. Miyashita, K. Funatsu, GA strategy for variable selection in QSAR studies: GA-based PLS analysis of calcium channel antagonists., J. Chem. Inf. Comput. Sci. 37 (1997) 306–310. doi:10.1021/ci960047x. [160] K. Hasegawa, K. Funatsu, GA strategy for variable selection in QSAR studies: GAPLS and D-optimal designs for predictive QSAR model, J. Mol. Struct. THEOCHEM. 425 (1998) 255–262. doi:10.1016/S0166-1280(97)00205-4. [161] K. Hasegawa, K. Funatsu, Partial least squares modeling and genetic algorithm optimization in quantitative structure-activity relationships., SAR QSAR Environ. Res. 11 (2000) 189–209. doi:10.1080/10629360008033231. [162] B.T. Hoffman, T. Kopajtic, J.L. Katz, A.H. Newman, 2D QSAR modeling and preliminary database searching for dopamine transporter inhibitors using genetic algorithm variable selection of Molconn Z descriptors, J. Med. Chem. 43 (2000) 4151– 4159. doi:10.1021/jm990472s. [163] D.B. Turner, P. Willett, Evaluation of the EVA descriptor for QSAR studies: 3. The use of a genetic algorithm to search for models with enhanced predictive properties (EVA GA), J. Comput. Aided. Mol. Des. 14 (2000) 1–21. doi:10.1023/A:1008180020974.

58

ACCEPTED MANUSCRIPT

PT

[164] O. Deeb, B. Hemmateenejad, A. Jaber, R. Garduno-Juarez, R. Miri, Effect of the electronic and physicochemical parameters on the carcinogenesis activity of some sulfa drugs using QSAR analysis based on genetic-MLR and genetic-PLS, Chemosphere. 67 (2007) 2122–2130. doi:10.1016/j.chemosphere.2006.12.098.

RI

[165] S. Wanchana, F. Yamashita, M. Hashida, QSAR analysis of the inhibition of recombinant CYP 3A4 activity by structurally diverse compounds using a genetic algorithm-combined partial least squares method, Pharm. Res. 20 (2003) 1401–1408. doi:10.1023/A:1025702009611.

NU

SC

[166] A. Mohajeri, B. Hemmateenejad, A. Mehdipour, R. Miri, Modeling calcium channel antagonistic activity of dihydropyridine derivatives using QTMS indices analyzed by GAPLS and PC-GA-PLS, J. Mol. Graph. Model. 26 (2008) 1057–1065. doi:10.1016/j.jmgm.2007.09.002.

MA

[167] A. Fassihi, R. Sabet, QSAR study of p56 lck protein tyrosine kinase inhibitory activity of flavonoid derivatives using MLR and GA-PLS, Int. J. Mol. Sci. 9 (2008) 1876–1892. doi:10.3390/ijms9091876.

TE

D

[168] P. Ghosh, M.C. Bagchi, QSAR modeling for quinoxaline derivatives using genetic algorithm and simulated annealing based feature selection., Curr. Med. Chem. 16 (2009) 4032–4048. doi:10.2174/092986709789352303.

AC CE P

[169] B. Hemmateenejad, S. Yousefinejad, A.R. Mehdipour, Novel amino acids indices based on quantum topological molecular similarity and their application to QSAR study of peptides., Amino Acids. 40 (2011) 1169–83. doi:10.1007/s00726-010-0741-x. [170] S. Yousefinejad, B. Hemmateenejad, a. R. Mehdipour, New autocorrelation QTMS-based descriptors for use in QSAM of peptides, J. Iran. Chem. Soc. 9 (2012) 569–577. doi:10.1007/s13738-012-0070-y. [171] M. Jalali-Heravi, A. Kyani, Application of genetic algorithm-kernel partial least square as a novel nonlinear feature selection method: Activity of carbonic anhydrase II inhibitors, Eur. J. Med. Chem. 42 (2007) 649–659. doi:10.1016/j.ejmech.2006.12.020. [172] S.S. So, M. Karplus, Evolutionary optimization in quantitative structure-activity relationship: an application of genetic neural networks., J. Med. Chem. 39 (1996) 1521– 1530. doi:10.1021/jm9507035. [173] S.-S. So, S.-S. So, S.P. van Helden, S.P. van Helden, V.J. van Geerestein, V.J. van Geerestein, et al., Quantitative Structure-Activity Relationship Studies of Progesterone Receptor Binding Steroids, J. Chem. Inf. Comput. Sci. 40 (2000) 762–772. doi:10.1021/ci990130v.

59

ACCEPTED MANUSCRIPT [174] A. Yasri, D. Hartsough, Toward an optimal procedure for variable selection and QSAR model building., J. Chem. Inf. Comput. Sci. 41 (2001) 1218–1227. doi:10.1021/ci010291a.

PT

[175] J. Zupan, M. Novič, Optimisation of structure representation for QSAR studies, Anal. Chim. Acta. 388 (1999) 243–250. doi:10.1016/S0003-2670(99)00079-3.

SC

RI

[176] Q. Lu, G. Shen, R. Yu, Genetic training of network using chaos concept: Application to QSAR studies of vibration modes of tetrahedral halides, J. Comput. Chem. 23 (2002) 1357–1365. doi:10.1002/jcc.10149.

NU

[177] M. a. Safarpour, B. Hemmateenejad, R. Miri, M. Jamali, Quantum Chemical-QSAR Study of Some Newly Synthesized 1,4-Dihydropyridine Calcium Channel Blockers, QSAR Comb. Sci. 22 (2003) 997–1005. doi:10.1002/qsar.200330852.

MA

[178] F. Marini, A. Roncaglioni, M. Novič, Variable selection and interpretation in structure Affinity correlation modeling of estrogen receptor binders, in: J. Chem. Inf. Model., 2005: pp. 1507–1519. doi:10.1021/ci0501645.

TE

D

[179] M. Jalali-Heravi, M. Asadollahi-Baboli, QSAR analysis of platelet-derived growth inhibitors using GA-ANN and shuffling crossvalidation, QSAR Comb. Sci. 27 (2008) 750–757. doi:10.1002/qsar.200710138.

AC CE P

[180] M. Jalali-Heravi, A. Mani-Varnosfaderani, QSAR modeling of 1-(3,3-diphenylpropyl)piperidinyl amides as CCR5 modulators using multivariate adaptive regression spline and Bayesian regularized genetic neural networks, QSAR Comb. Sci. 28 (2009) 946–958. doi:10.1002/qsar.200860136. [181] J. Wu, J. Mei, S. Wen, S. Liao, J. Chen, Y. Shen, A self-adaptive genetic algorithmartificial neural network algorithm with leave-one-out cross validation for descriptor selection in QSAR study, J. Comput. Chem. 31 (2010) 1956–1968. doi:10.1002/jcc.21471. [182] M. Goodarzi, M.P. Freitas, N. Ghasemi, QSAR studies of bioactivities of 1-(azacyclyl)-3arylsulfonyl-1H-pyrrolo[2, 3-b]pyridines as 5-HT6 receptor ligands using physicochemical descriptors and MLR and ANN-modeling, Eur. J. Med. Chem. 45 (2010) 3911–3915. doi:10.1016/j.ejmech.2010.05.045. [183] P.S.F.J. Wilczyńska-Piliszek A.J., QSAR and ANN for the estimation of water solubility of 209 polychlorinated trans -azobenzenes, J. Environ. Sci. Heal. - Part A Toxic/Hazardous Subst. Environ. Eng. 47 (2012) 155–166. doi:10.1080/10934529.2012.640243. [184] M. Shahlaei, A. Madadkar-Sobhani, L. Saghaie, A. Fassihi, Application of an expert system based on Genetic Algorithm-Adaptive Neuro-Fuzzy Inference System (GAANFIS) in QSAR of cathepsin K inhibitors, Expert Syst. Appl. 39 (2012) 6182–6191. doi:10.1016/j.eswa.2011.11.106.

60

ACCEPTED MANUSCRIPT

PT

[185] M. Fernandez, J. Caballero, L. Fernandez, A. Sarai, Genetic algorithm optimization in drug design QSAR: Bayesian-regularized genetic neural networks (BRGNN) and genetic algorithm-optimized support vectors machines (GA-SVM), Mol. Divers. 15 (2011) 269– 289. doi:10.1007/s11030-010-9234-9.

RI

[186] S.J. Cho, M.A. Hermsmeier, Genetic algorithm guided selection: Variable selection and subset selection, J. Chem. Inf. Comput. Sci. 42 (2002) 927–936. doi:10.1021/ci010247v.

SC

[187] J.K. Wegner, A. Zell, Prediction of aqueous solubility and partition coefficient optimized by a genetic algorithm based descriptor selection method, J. Chem. Inf. Comput. Sci. 43 (2003) 1077–1084. doi:10.1021/ci034006u.

NU

[188] U. Depczynski, V.J. Frost, K. Molt, Genetic algorithms applied to the selection of factors in principal component regression, Anal. Chim. Acta. 420 (2000) 217–227. doi:10.1016/S0003-2670(00)00893-X.

MA

[189] B. Hemmateenejad, M. Shamsipur, Quantitative Structure–Electrochemistry Relationship Study of Some Organic Compounds Using PC–ANN and PCR, Internet Electron. J. Mol. Des. 3 (2004) 316–334.

TE

D

[190] B. Hemmateenejad, M.A. Safarpour, R. Miri, N. Nesari, Toward an optimal procedure for PC-ANN model building: Prediction of the carcinogenic activity of a large set of drugs, J. Chem. Inf. Model. 45 (2005) 190–199. doi:10.1021/ci049766z.

AC CE P

[191] B. Hemmateenejad, M. Akhond, R. Miri, M. Shamsipur, Genetic algorithm applied to the selection of factors in principal component-artificial neural networks: Application to QSAR study of calcium channel antagonist activity of 1,4-dihydropyridines (nifedipine analogous), J. Chem. Inf. Comput. Sci. 43 (2003) 1328–1334. doi:10.1021/ci025661p. [192] B. Hemmateenejad, M.A. Safarpour, R. Miri, F. Taghavi, Application of ab initio theory to QSAR study of 1,4-dihydropyridine-based calcium channel blockers using GA-MLR and PC-GA-ANN procedures, J. Comput. Chem. 25 (2004) 1495–1503. doi:10.1002/jcc.20066. [193] A. Habibi-Yangjeh, E. Pourbasheer, M. Danandeh-Jenagharad, Application of principal component-genetic algorithm-artificial neural network for prediction acidity constant of various nitrogen-containing compounds in water, Monatshefte Fur Chemie. 140 (2009) 15–27. doi:10.1007/s00706-008-0049-7. [194] S. Riahi, E. Pourbasheer, R. Dinarvand, M.R. Ganjali, P. Norouzi, Exploring QSARs for antiviral activity of 4-alkylamino-6-(2-hydroxyethyl)-2-methylthiopyrimidines by support vector machine., Chem. Biol. Drug Des. 72 (2008) 205–216. doi:10.1111/j.17470285.2008.00695.x. [195] M. Goodarzi, P.R. Duchowicz, C.H. Wu, F.M. Fernández, E.A. Castro, New hybrid genetic based support vector regression as QSAR approach for analyzing flavonoids-

61

ACCEPTED MANUSCRIPT GABA(A) complexes, J. Chem. Inf. Model. 49 (2009) 1475–1485. doi:10.1021/ci900075f.

PT

[196] E. Pourbasheer, S. Riahi, M.R. Ganjali, P. Norouzi, Application of genetic algorithmsupport vector machine (GA-SVM) for prediction of BK-channels activity, Eur. J. Med. Chem. 44 (2009) 5023–5028. doi:10.1016/j.ejmech.2009.09.006.

SC

RI

[197] E. Pourbasheer, R. Aalizadeh, M.R. Ganjali, P. Norouzi, QSAR study of ??1??4 integrin inhibitors by GA-MLR and GA-SVM methods, Struct. Chem. 25 (2014) 355–370. doi:10.1007/s11224-013-0300-7.

NU

[198] A.S. Reddy, S. Kumar, R. Garg, Hybrid-genetic algorithm based descriptor optimization and QSAR models for predicting the biological activity of Tipranavir analogs for HIV protease inhibition, J. Mol. Graph. Model. 28 (2010) 852–862. doi:10.1016/j.jmgm.2010.03.005.

MA

[199] D.B. Hibbert, Genetic algorithms in chemistry, Chemom. Intell. Lab. Syst. 19 (1993) 277– 293.

TE

D

[200] A. Niazi, R. Leardi, Genetic algorithms in chemometrics, J. Chemom. 26 (2012) 345–351. doi:10.1002/cem.2426.

AC CE P

[201] D.B. Fogel, L.J. Fogel, Optimal routing of multiple autonomous underwater vehicles through evolutionary programming, Symp. Auton. Underw. Veh. Technol. (1990). doi:10.1109/AUV.1990.110436. [202] D.B. Fogel, L.J. Fogel, V.W. Porto, Evolutionary methods for training neural networks, [1991 Proceedings] IEEE Conf. Neural Networks Ocean Eng. (1991). doi:10.1109/ICNN.1991.163368. [203] D.B. Fogel, APPLYING EVOLUTIONARY PROGRAMMING TO SELECTED TRAVELING SALESMAN PROBLEMS, Cybern. Syst. 24 (1993) 27–36. doi:10.1080/01969729308961697. [204] L.J. Fogel, A.J. Owens, M.J. Walsh, Artificial Intelligence Through Simulated Evolution, John Wiley & Sons, 1966. [205] B.T. Luke, Evolutionary programming applied to the development of quantitative structure-activity relationships and quantitative structure-property relationships, J. Chem. Inf. Comput. Sci. 34 (1994) 1279–1287. doi:10.1021/ci00022a009. [206] B.T. Luke, Comparison of three different QSAR/QSPR generation techniques, J. Mol. Struct. THEOCHEM. 468 (1999) 13–20. doi:10.1016/S0166-1280(98)00492-8. [207] A.L. Parrill, Evolutionary and genetic methods in drug design, Drug Discov. Today. 1 (1996) 514–521. doi:10.1016/S1359-6446(96)10045-3.

62

ACCEPTED MANUSCRIPT [208] D. Weekes, G.B. Fogel, Evolutionary optimization, backpropagation, and data preparation issues in QSAR modeling of HIV inhibition by HEPT derivatives, BioSystems. 72 (2003) 149–158. doi:10.1016/S0303-2647(03)00140-0.

PT

[209] R. Chiong, O. Koon, A Comparison between Genetic Algorithms and Evolutionary Programming based on Cutting Stock Problem, Eng. Lett. 14 (2007) 1–6.

RI

[210] T. Aoyama, Y. Suzuki, H. Ichikawa, Neural networks applied to structure-activity relationships., J. Med. Chem. 33 (1990) 905–908. doi:10.1021/jm00165a004.

NU

SC

[211] S. Agatonovic-Kustrin, R. Beresford, Basic concepts of artificial neural network (ANN) modeling and its application in pharmaceutical research, J. Pharm. Biomed. Anal. 22 (2000) 717–727. doi:10.1016/S0731-7085(99)00272-1.

MA

[212] D.A. Winkler, F.R. Burden, Application of neural networks to large dataset QSAR, virtual screening, and library design., in: L.B. English (Ed.), Methods Mol. Biol., Humana Press Inc, Totowa, NJ, 2002: pp. 325–367. doi:10.1385/1-59259-285-6:325.

TE

D

[213] D.J. Livingstone, D.T. Manallack, I. V Tetko, Data modelling with neural networks: advantages and limitations., J. Comput. Aided. Mol. Des. 11 (1997) 135–142. doi:10.1023/A:1008074223811.

AC CE P

[214] J. Devillers, Neural networks in QSAR and drug design, Academic Press Limited, London, 1996. [215] R. Reed, Pruning algorithms - a survey, IEEE Trans. Neural Networks. 4 (1993) 740–747. doi:10.1109/72.248452. [216] J. Wikel, E. Dow, The Use of Neural Networks for Variable Selection in Qsar, Bioorg. Med. Chem. Lett. 3 (1993) 645–651. http://www.sciencedirect.com/science/article/pii/S0960894X01812464. [217] V. V Kovalishyn, I. V Tetko, A.I. Luik, V. V Kholodovych, A.E.P. Villa, D.J. Livingstone, Neural Network Studies. 3. Variable Selection in the Cascade-Correlation Learning Architecture, J. Chem. Inf. Model. 38 (1998) 651–659. doi:10.1021/ci980325n. [218] I.V. Tetko, A.E.P. Villa, D.J. Livingstone, Neural Network Studies. 2. Variable Selection, J. Chem. Inf. Model. 36 (1996) 794–803. doi:10.1021/ci950204c. [219] M. Szaleniec, R. Tadeusiewicz, M. Witko, How to select an optimal neural model of chemical reactivity?, Neurocomputing. 72 (2008) 241–256. doi:10.1016/j.neucom.2008.01.003. [220] I. V. Tetko, D.J. Livingstone, A.I. Luik, Neural network studies. 1. Comparison of overfitting and overtraining, J. Chem. Inf. Comput. Sci. 35 (1995) 826–833. doi:10.1021/ci00027a006.

63

ACCEPTED MANUSCRIPT [221] I. V Tetko, V.Y. Tanchuk, N.P. Chentsova, S. V Antonenko, G.I. Poda, V.P. Kukhar, et al., HIV-1 Reverse Transcriptase Inhibitor Design Using Artifical Neural Networks, J. Med. Chem. 37 (1994) 2520–2526. doi:10.1021/jm00042a005.

RI

PT

[222] I.V. Tetko, A.E.P. Villa, T.I. Aksenova, W.L. Zielinski, J. Brower, E.R. Collantes, et al., Application of a Pruning Algorithm To Optimize Artificial Neural Networks for Pharmaceutical Fingerprinting, J. Chem. Inf. Model. 38 (1998) 660–668. doi:10.1021/ci970439j.

SC

[223] D.J.C. MacKay, Bayesian methods for back-propagation networks, in: E. Domany, J.L. van Hemmen, K. Schulten (Eds.), Model. Neural Networks III, Springer, New York, 1994: pp. 211–254.

NU

[224] R.M. Neal, Bayesian Learning for Neural Networks, Springer, New York, 1996.

MA

[225] F.R. Burden, M.G. Ford, D.C. Whitley, D. a Winkler, Use of automatic relevance determination in QSAR studies using Bayesian neural networks., J. Chem. Inf. Comput. Sci. 40 (2000) 1423–30. doi:10.1021/ci000450a.

TE

D

[226] D. a. Winkler, F.R. Burden, Modelling blood-brain barrier partitioning using Bayesian neural nets, J. Mol. Graph. Model. 22 (2004) 499–505. doi:10.1016/j.jmgm.2004.03.010.

AC CE P

[227] O. Obrezanova, J.M.R. Gola, E.J. Champness, M.D. Segall, Automatic QSAR modeling of ADME properties: Blood-brain barrier penetration and aqueous solubility, in: J. Comput. Aided. Mol. Des., 2008: pp. 431–440. doi:10.1007/s10822-008-9193-8. [228] M. Jung, J. Tak, Y. Lee, Y. Jung, Quantitative structure-activity relationship (QSAR) of tacrine derivatives against acetylcholinesterase (AChE) activity using variable selections, Bioorganic Med. Chem. Lett. 17 (2007) 1082–1090. doi:10.1016/j.bmcl.2006.11.022. [229] T.S. Chitre, M.K. Kathiravan, K.G. Bothara, S. V. Bhandari, R.R. Jalnapurkar, Pharmacophore optimization and design of competitive inhibitors of thymidine monophosphate kinase through molecular modeling studies, Chem. Biol. Drug Des. 78 (2011) 826–834. doi:10.1111/j.1747-0285.2011.01200.x. [230] P. Ghosh, M.C. Bagchi, QSAR modeling for quinoxaline derivatives using genetic algorithm and simulated annealing based feature selection., Curr. Med. Chem. 16 (2009) 4032–4048. doi:10.2174/092986709789352303. [231] M. Shen, A. LeTiran, Y. Xiao, A. Golbraikh, H. Kohn, A. Tropsha, Quantitative structureactivity relationship analysis of functionalized amino acid anticonvulsant agents using k nearest neighbor and simulated annealing PLS methods, J. Med. Chem. 45 (2002) 2811– 2823. doi:10.1021/jm010488u.

64

ACCEPTED MANUSCRIPT [232] N.K. Sahu, S. Shahi, M.C. Sharma, D. V. Kohli, QSAR studies on imidazopyridazine derivatives as PfPK7 inhibitors, Mol. Simul. 37 (2011) 752–765. doi:10.1080/08927022.2010.547050.

RI

PT

[233] M.C. Sharma, A structure-activity relationship study of imidazole-5-carboxylic acids derivatives as angiotensin II receptor antagonists combining 2D and 3D QSAR methods, Interdiscip. Sci. Comput. Life Sci. (2014). doi:10.1007/s12539-013-0062-3.

SC

[234] M.C. Sharma, S. Sharma, K.S. Bhadoriya, Molecular modeling studies on substituted aminopyrimidines derivatives as potential antimalarial compounds, Med. Chem. Res. 24 (2014) 1272–1288. doi:10.1007/s00044-014-1199-2.

MA

NU

[235] C. Ng, Y. Xiao, W. Putnam, B. Lum, A. Tropsha, Quantitative structure-pharmacokinetic parameters relationships (QSPKR) analysis of antimicrobial agents in humans using simulated annealing k-nearest-neighbor and partial least-square analysis methods, J. Pharm. Sci. 93 (2004) 2535–2544. doi:10.1002/jps.20117.

D

[236] P. Ghosh, M.C. Bagchi, Comparative QSAR studies of nitrofuranyl amide derivatives using theoretical structural properties, Mol. Simul. 35 (2009) 1185–1200. doi:10.1080/08927020903033141.

AC CE P

TE

[237] J.M. Sutter, S.L. Dixon, P.C. Jurs, Automatic Descriptor Selection for Quantitativ Structure-Activity Relationships Using Generalized Simulated Annealing, J. Chem. Inf. Comput. Scie. (1995) 77–84. doi:10.1021/ci00023a011. [238] A. Alexandridis, P. Patrinos, H. Sarimveis, G. Tsekouras, A two-stage evolutionary algorithm for variable selection in the development of RBF neural network models, Chemom. Intell. Lab. Syst. 75 (2005) 149–162. doi:10.1016/j.chemolab.2004.06.004. [239] S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing., Science. 220 (1983) 671–680. doi:10.1126/science.220.4598.671. [240] J.H. Kalivas, GENERALIZED SIMULATED ANNEALING FOR CALIBRATION SAMPLE SELECTION FROM AN EXISTING SET AND ORTHOGONALIZATION OF UNDESIGNED EXPERIMENTS, 5 (1991) 37–48. [241] U. Hörchner, J.H. Kalivas, Simulated-annealing-based optimization algorithms: Fundamentals and wavelength selection applications, J. Chemom. 9 (1995) 283–308. doi:10.1002/cem.1180090404. [242] J.H. Kalivas, Optimization using variations of simulated annealing, Chemom. Intell. Lab. Syst. 15 (1992) 1–12. http://www.sciencedirect.com/science/article/pii/016974399280022V.

65

ACCEPTED MANUSCRIPT [243] J.H. Kalivas, N. Roberts, J.M. Sutter, Global Optimization by Simulated Annealing with Wavelength Selection for Ultraviolet Visible Spectrophotometry, Anal. Chem. 61 (1989) 2024–2030. doi:10.1021/ac00193a006.

RI

PT

[244] W. Zheng, a Tropsha, Novel variable selection quantitative structure--property relationship approach based on the k-nearest-neighbor principle, J. Chem. Inf. Comput. Sci. 40 (2000) 185–94. doi:10.1021/ci980033m.

SC

[245] A. Tropsha, W. Zheng, Identification of the descriptor pharmacophores using variable selection QSAR: applications to database mining., Curr. Pharm. Des. 7 (2001) 599–612. doi:10.2174/1381612013397834.

NU

[246] Z. Xiao, Y. De Xiao, J. Feng, A. Golbraikh, A. Tropsha, K.H. Lee, Antitumor agents. 213. Modeling of epipodophyllotoxin derivatives using variable selection k nearest neighbor QSAR method, J. Med. Chem. 45 (2002) 2294–2309. doi:10.1021/jm0105427.

MA

[247] A. Tropsha, P. Gramatica, V.K. Gombar, The importance of being earnest: Validation is the absolute essential for successful application and interpretation of QSPR models, Qsar Comb. Sci. 22 (2003) 69–77. doi:10.1002/qsar.200390007.

TE

D

[248] M. Shen, Y. Xiao, A. Golbraikh, V.K. Gombar, A. Tropsha, Development and validation of k-nearest-neighbor QSPR models of metabolic stability of drug candidates., J. Med. Chem. 46 (2003) 3013–3020. doi:10.1021/jm020491t.

AC CE P

[249] M. Shen, C. Béguin, A. Golbraikh, J.P. Stables, H. Kohn, A. Tropsha, Application of Predictive QSAR Models to Database Mining: Identification and Experimental Validation of Novel Anticonvulsant Compounds, J. Med. Chem. 47 (2004) 2356–2364. doi:10.1021/jm030584q. [250] J.R. Votano, M. Parham, L.M. Hall, L.H. Hall, L.B. Kier, S. Oloff, et al., QSAR modeling of human serum protein binding with several modeling techniques utilizing structureinformation representation, J. Med. Chem. 49 (2006) 7169–7181. doi:10.1021/jm051245v. [251] A. Colorni, M. Dorigo, V. Maniezzo, Distributed Optimization by Ant Colonies, in: Proc., 1st Eur. Conf. Artif. Life, Elsevier, PARIS, FRANCE, 1991: pp. 134–142. [252] M. Dorigo, Optimization, Learning and Natural Algorithms, Politecnico di Milano, Italy, 1992. [253] S. Izrailev, D.K. Agrafiotis, Variable selection for QSAR by artificial ant colony systems., SAR QSAR Environ. Res. 13 (2002) 417–423. doi:10.1080/10629360290014296. [254] L.M. Gambardella, M. Dorigo, An Ant Colony System Hybridized with a New Local Search for the Sequential Ordering Problem, INFORMS J. Comput. 12 (2000) 237–255. doi:10.1287/ijoc.12.3.237.12636.

66

ACCEPTED MANUSCRIPT [255] Q. Shen, J.-H. Jiang, J.-C. Tao, G.-L. Shen, R.-Q. Yu, Modified ant colony optimization algorithm for variable selection in QSAR modeling: QSAR studies of cyclooxygenase inhibitors., J. Chem. Inf. Model. 45 (2005) 1024–9. doi:10.1021/ci049610z.

RI

PT

[256] M. Shamsipur, V. Zare-Shahabadi, B. Hemmateenejad, M. Akhond, Ant colony optimisation: A powerful tool for wavelength selection, J. Chemom. 20 (2006) 146–157. doi:10.1002/cem.1002.

SC

[257] M. Dorigo, T. Stützle, Ant Colony Optimization: Overview and Recent Advances, in: Handb. Metaheuristics, Springer US, New York, 2010: pp. 227–263. doi:10.1007/978-14419-1665-5.

NU

[258] S. Izrailev, D. Agrafiotis, A Novel Method for Building Regression Tree Models for QSAR Based on Artificial Ant Colony Systems, J. Chem. Inf. Comput. Sci. 41 (2001) 176–180. doi:10.1021/ci000336s.

MA

[259] S.B. Gunturi, R. Narayanan, A. Khandelwal, In silico ADME modelling 2: Computational models to predict human serum albumin binding affinity using ant colony systems, Bioorganic Med. Chem. 14 (2006) 4118–4129. doi:10.1016/j.bmc.2006.02.008.

TE

D

[260] W. Shi, Q. Shen, W. Kong, B. Ye, QSAR analysis of tyrosine kinase inhibitor using modified ant colony optimization and multiple linear regression., Eur. J. Med. Chem. 42 (2007) 81–86. doi:10.1016/j.ejmech.2006.08.001.

AC CE P

[261] M. Goodarzi, M.P. Freitas, R. Jensen, Ant colony optimization as a feature selection method in the QSAR modeling of anti-HIV-1 activities of 3-(3,5-dimethylbenzyl)uracil derivatives using MLR, PLS and SVM regressions, Chemom. Intell. Lab. Syst. 98 (2009) 123–129. doi:10.1016/j.chemolab.2009.05.005. [262] M. Shamsipur, V. Zare-Shahabadi, B. Hemmateenejad, M. Akhond, An efficient variable selection method based on the use of external memory in ant colony optimization. Application to QSAR/QSPR studies, Anal. Chim. Acta. 646 (2009) 39–46. doi:10.1016/j.aca.2009.05.005. [263] M. Shamsipur, V. Zare-Shahabadi, B. Hemmateenejad, M. Akhond, Combination of ant colony optimization with various local search strategies. A novel method for variable selection in multivariate calibration and qspr study, QSAR Comb. Sci. 28 (2009) 1263– 1275. doi:10.1002/qsar.200960037. [264] B. Hemmateenejad, M. Shamsipur, V. Zare-Shahabadi, M. Akhond, Building optimal regression tree by ant colony system-genetic algorithm: Application to modeling of melting points, Anal. Chim. Acta. 704 (2011) 57–62. doi:10.1016/j.aca.2011.08.010. [265] N.M. O’Boyle, D.S. Palmer, F. Nigsch, J.B. Mitchell, Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction., Chem. Cent. J. 2 (2008) 21. doi:10.1186/1752-153X-2-21.

67

ACCEPTED MANUSCRIPT

PT

[266] M. Jalali-Heravi, M. Asadollahi-Baboli, Quantitative structure-activity relationship study of serotonin (5-HT7) receptor inhibitors using modified ant colony algorithm and adaptive neuro-fuzzy interference system (ANFIS), Eur. J. Med. Chem. 44 (2009) 1463–1470. doi:10.1016/j.ejmech.2008.09.050.

RI

[267] Y. Pan, J.C. Jiang, R. Wang, J.J. Jiang, Predicting the net heat of combustion of organic compounds from molecular structures based on ant colony optimization, J. Loss Prev. Process Ind. 24 (2011) 85–89. doi:10.1016/j.jlp.2010.11.001.

SC

[268] M. Atabati, K. Zarei, M. Mohsennia, Prediction of ??max of 1,4-naphthoquinone derivatives using ant colony optimization, Anal. Chim. Acta. 663 (2010) 7–10. doi:10.1016/j.aca.2010.01.024.

MA

NU

[269] V. Zare-Shahabadi, F. Abbasitabar, Application of ant colony optimization in development of models for prediction of Anti-HIV-1 activity of HEPT derivatives, J. Comput. Chem. 31 (2010) 2354–2362. doi:10.1002/jcc.21529.

D

[270] M. Bagheri, A. Golbraikh, Rank-based ant system method for non-linear QSPR analysis: QSPR studies of the solubility parameter, SAR QSAR Environ. Res. 23 (2012) 59–86. doi:10.1080/1062936X.2011.623356.

TE

[271] J. Kennedy, R. Eberhart, Particle swarm optimization, in: Proc. IEEE Int. Conf. Neural Networks (Perth, Aust., Piscataway, NJ, 1995.

AC CE P

[272] R.C. Eberhart, J. Kennedy, A new optimizer using particle swarm theory, in: Proc. Sixth Int. Symp. Micro Mach. Hum. Sci., 1995: pp. 39–43. [273] D.K. Agrafiotis, W. Cedeño, Feature selection for structure-activity correlation using binary particle swarms, J. Med. Chem. 45 (2002) 1098–1107. doi:10.1021/jm0104668. [274] Q. Shen, J.H. Jiang, C.X. Jiao, G.L. Shen, R.Q. Yu, Modified particle swarm optimization algorithm for variable selection in MLR and PLS modeling: QSAR studies of antagonism of angiotensin II antagonists, Eur. J. Pharm. Sci. 22 (2004) 145–152. doi:10.1016/j.ejps.2004.03.002. [275] J.-X. Lü, Q. Shen, J.-H. Jiang, G.-L. Shen, R.-Q. Yu, QSAR analysis of cyclooxygenase inhibitor using particle swarm optimization and multiple linear regression., J. Pharm. Biomed. Anal. 35 (2004) 679–687. doi:10.1016/j.jpba.2004.02.026. [276] Q. Shen, J.H. Jiang, C.X. Jiao, S.Y. Huan, G.L. Shen, R.Q. Yu, Optimized partition of minimum spanning tree for piecewise modeling by particle swarm algorithm. QSAR studies of antagonism of angiotensin II antagonists, J. Chem. Inf. Comput. Sci. 44 (2004) 2027–2031. doi:10.1021/ci034292.

68

ACCEPTED MANUSCRIPT [277] Z. Wang, G.L. Durst, R.C. Eberhart, D.B. Boyd, Z.B. Miled, Particle swarm optimization and neural network application for QSAR, in: 18th Int. Parallel Distrib. Process. Symp. 2004. Proceedings., IEEE, 2004. doi:10.1109/IPDPS.2004.1303214.

RI

PT

[278] M. Meissner, M. Schmuker, G. Schneider, Optimized Particle Swarm Optimization (OPSO) and its application to artificial neural network training., BMC Bioinformatics. 7 (2006) 125. doi:10.1186/1471-2105-7-125.

SC

[279] A. Khajeh, H. Modarress, H. Zeinoddini-Meymand, Modified particle swarm optimization method for variable selection in QSAR/QSPR studies, Struct. Chem. 24 (2012) 1401– 1409. doi:10.1007/s11224-012-0165-1.

MA

NU

[280] L. Lin, W.Q. Lin, J.H. Jiang, G.L. Shen, R.Q. Yu, QSAR analysis of substituted bis[(acridine-4-carboxamide)propyl] methylamines using optimized block-wise variable combination by particle swarm optimization for partial least squares modeling, Eur. J. Pharm. Sci. 25 (2005) 245–254. doi:10.1016/j.ejps.2005.02.016.

TE

D

[281] L. Hu, H. Wu, W. Lin, J. Jiang, R. Yu, Quantitative Structure–Activity Relationship Studies for the Binding Affinities of Imidazobenzodiazepines for the α6 Benzodiazepine Receptor Isoform Utilizing Optimized Blockwise Variable Combination by Particle Swarm Optimization for Partial Least Square, QSAR Comb. Sci. 26 (2007) 92–101. doi:10.1002/qsar.200530204.

AC CE P

[282] A. Khajeh, H. Modarress, H. Zeinoddini-Meymand, Application of modified particle swarm optimization as an efficient variable selection strategy in QSAR/QSPR studies, J. Chemom. 26 (2012) 598–603. doi:10.1002/cem.2482. [283] Q.I. Shen, J.H. Jiang, C.X. Jiao, W.Q. Lin, G.L. Shen, R.Q. Yu, Hybridized particle swarm algorithm for adaptive structure training of multilayer feed-forward neural network: QSAR studies of bioactivity of organic compounds, J. Comput. Chem. 25 (2004) 1726–1735. doi:10.1002/jcc.20094. [284] Q. Shen, W. Shi, X. Yang, B. Ye, Particle swarm algorithm trained neural network for QSAR studies of inhibitors of platelet-derived growth factor receptor phosphorylation., Eur. J. Pharm. Sci. 28 (2006) 369–376. doi:10.1016/j.ejps.2006.04.001. [285] Y.-P. Zhou, J.-H. Jiang, W.-Q. Lin, H.-Y. Zou, H.-L. Wu, G.-L. Shen, et al., Adaptive configuring of radial basis function network by hybrid particle swarm algorithm for QSAR studies of organic compounds., J. Chem. Inf. Model. 46 (n.d.) 2494–2501. doi:10.1021/ci600218d. [286] J.A. Lazzús, Prediction of flash point temperature of organic compounds using a hybrid method of group contribution + neural network + particle swarm optimization, Chinese J. Chem. Eng. 18 (2010) 817–823. doi:10.1016/S1004-9541(09)60133-6.

69

ACCEPTED MANUSCRIPT

PT

[287] J. Xing, R. Luo, H. Guo, Y. Li, H. Fu, T. Yang, et al., Chemometrics and Intelligent Laboratory Systems Radial basis function network-based transformation for nonlinear partial least-squares as optimized by particle swarm optimization : Application to QSAR studies, Chemom. Intell. Lab. Syst. 130 (2014) 37–44. doi:10.1016/j.chemolab.2013.10.006.

RI

[288] W. Cedeño, D.K. Agrafiotis, Using particle swarms for the development of QSAR models based on K-nearest neighbor and kernel regression, J. Comput. Aided. Mol. Des. 17 (2003) 255–263. doi:10.1023/A:1025338411016.

NU

SC

[289] L. Lawtrakul, C. Prakasvudhisarn, Correlation Studies of HEPT Derivatives Using Swarm Intelligence and Support Vector Machines, Monatshefte Für Chemie - Chem. Mon. 136 (2005) 1681–1691. doi:10.1007/s00706-005-0357-0.

MA

[290] L.-J. Tang, Y.-P. Zhou, J.-H. Jiang, H.-Y. Zou, H.-L. Wu, G.-L. Shen, et al., Radial basis function network-based transform for a nonlinear support vector machine as optimized by a particle swarm optimization algorithm with application to QSAR studies., J. Chem. Inf. Model. 47 (2007) 1438–1445. doi:10.1021/ci700047x.

AC CE P

TE

D

[291] W.-Q. Lin, J.-H. Jiang, Y.-P. Zhou, H.-L. Wu, G.-L. Shen, R.-Q. Yu, Support vector machine based training of multilayer feedforward neural networks as optimized by particle swarm algorithm: application in QSAR studies of bioactivity of organic compounds., J. Comput. Chem. 28 (2007) 519–527. doi:http://dx.doi.org/10.1002/jcc.20561. [292] W.L.W. Liu, D.Z.D. Zhang, Feature Subset Selection Based on Improved Discrete Particle Swarm and Support Vector Machine Algorithm, 2009 Int. Conf. Inf. Eng. Comput. Sci. (2009). doi:10.1109/ICIECS.2009.5362705. [293] H. Yuan, J. Huang, C. Cao, Prediction of skin sensitization with a particle swarm optimized support vector machine, Int. J. Mol. Sci. 10 (2009) 3237–3254. doi:10.3390/ijms10073237. [294] C. Prakasvudhisarn, P. Wolschann, L. Lawtrakul, Predicting complexation thermodynamic parameters of β-cyclodextrin with chiral guests by using swarm intelligence and support vector machines, Int. J. Mol. Sci. 10 (2009) 2107–2121. doi:10.3390/ijms10052107. [295] J.-H. Wen, K.-J. Zhong, L.-J. Tang, J.-H. Jiang, H.-L. Wu, G.-L. Shen, et al., Adaptive variable-weighted support vector machine as optimized by particle swarm optimization algorithm with application of QSAR studies., Talanta. 84 (2011) 13–18. doi:10.1016/j.talanta.2010.11.039. [296] X. Zhou, Z. Li, Z. Dai, X. Zou, QSAR modeling of peptide biological activity by coupling support vector machine with particle swarm optimization algorithm and genetic algorithm, J. Mol. Graph. Model. 29 (2010) 188–196. doi:10.1016/j.jmgm.2010.06.002.

70

ACCEPTED MANUSCRIPT [297] W.Q. Lin, J.H. Jiang, Q. Shen, H.L. Wu, G.L. Shen, R.Q. Yu, Piecewise hypersphere modeling by particle swarm optimization in QSAR studies of bioactivities of chemical compounds, J. Chem. Inf. Model. 45 (2005) 535–541. doi:10.1021/ci049642m.

RI

PT

[298] L. Lin, W.-Q. Lin, J.-H. Jiang, Y.-P. Zhou, G.-L. Shen, R.-Q. Yu, QSAR analysis of a series of 2-aryl(heteroaryl)-2,5-dihydropyrazolo[4,3-c]quinolin-3-(3H)-ones using piecewise hyper-sphere modeling by particle swarm optimization, Anal. Chim. Acta. 552 (2005) 42–49. doi:10.1016/j.aca.2005.07.033.

SC

[299] Y.-P. Zhou, L.-J. Tang, J. Jiao, D.-D. Song, J.-H. Jiang, R.-Q. Yu, Modified particle swarm optimization algorithm for adaptively configuring globally optimal classification and regression trees., J. Chem. Inf. Model. 49 (2009) 1144–1153. doi:10.1021/ci800374h.

MA

NU

[300] R.-M. Luo, Y.-Q. Li, H.-L. Guo, Y.-P. Zhou, H. Xu, H. Gong, Adaptive configuration of radial basis function network by regression tree allied with hybrid particle swarm optimization algorithm, Chemom. Intell. Lab. Syst. (2013) -. doi:10.1016/j.chemolab.2013.02.002.

TE

D

[301] M. Goodarzi, W. Saeys, O. Deeb, S. Pieters, Y. Vander Heyden, Particle swarm optimization and genetic algorithm as feature selection techniques for the QSAR modeling of imidazo[1,5-a]pyrido[3,2-e]pyrazines, inhibitors of phosphodiesterase 10 A., Chem. Biol. Drug Des. 82 (2013) 685–696. doi:10.1111/cbdd.12196.

AC CE P

[302] a Mauri, V. Consonni, M. Pavan, R. Todeschini, Dragon software: An easy approach to molecular descriptor calculations, MATCH Commun. Math. Comput. Chem. 56 (2006) 237–248. http://www.mendeley.com/research/a-cognitive-approach-to-situationawareness-theory-and-application/. [303] M. Akamatsu, Current State and Perspectives of 3D-QSAR, Curr. Top. Med. Chem. 2 (2002) 1381–1394. doi:10.2174/1568026023392887. [304] S. Wold, A. Ruhe, H. Wold, W.J. Dunn, III, The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses, SIAM J. Sci. Stat. Comput. 5 (1984) 735–743. doi:10.1137/0905052. [305] B. Hemmateenejad, Correlation ranking procedure for factor selection in PC-ANN modeling and application to ADMETox evaluation, Chemom. Intell. Lab. Syst. 75 (2005) 231–245. doi:10.1016/j.chemolab.2004.09.005. [306] B. Hemmateenejad, Optimal QSAR analysis of the carcinogenic activity of drugs by correlation ranking and genetic algorithm-based PCR, J. Chemom. 18 (2004) 475–485. doi:10.1002/cem.891. [307] M. Jalali-Heravi, P. Shahbazikhah, B. Zekavat, M.S. Ardejani, Principal Component Analysis-Ranking as a Variable Selection Method for the Simulation of 13C Nuclear

71

ACCEPTED MANUSCRIPT Magnetic Resonance Spectra of Xanthones Using Artificial Neural Networks, QSAR Comb. Sci. 26 (2007) 764–772. doi:10.1002/qsar.200630111.

RI

PT

[308] M. Shamsipur, R. Ghavami, H. Sharghi, B. Hemmateenejad, Highly correlating distance/connectivity-based topological indices. 5. Accurate prediction of liquid density of organic molecules using PCR and PC-ANN, J. Mol. Graph. Model. 27 (2008) 506–511. doi:10.1016/j.jmgm.2008.09.005.

SC

[309] B. Hemmateenejad, A. Mohajeri, Application of quantum topological molecular similarity descriptors in QSPR study of the O-methylation of substituted phenols, J. Comput. Chem. 29 (2008) 266–274. doi:10.1002/jcc.20787.

NU

[310] E.R. Malinowski, Factor Analysis in Chemistry, Wiley-Interscience, New York, 2002.

MA

[311] J. Nilson, Multiway Calibration in 3D QSAR, University Groningen, Groningen, Sweden, 1999. [312] M.A. Carreira-Perpinan, Continuous latent variable models for dimensionality reduction and sequential data reconstruction, University of Sheffield, UK, 2001.

TE

D

[313] K. Varmuza, Multivariate Data Analysis in Chemistry, in: J. Gasteiger (Ed.), Handb. Chemoinformatics, Vol. 3, Wiley-VCH, Weinheim, Germany, 2003: pp. 1098–1134.

AC CE P

[314] L. Xue, J. Godden, H. Gao, J. Bajorath, Identification of a Preferred Set of Molecular Descriptors for Compound Classification Based on Principal Component Analysis., J. Chem. Inf. Comput. Sci. 39 (1999) 699–704. http://doi.wiley.com/10.1002/chin.199940211. [315] K. Torkkola, Feature Extraction by Non-Parametric Mutual Information, J. Mach. Learn. Res. 3 (2003) 1415–1438. [316] J. Weng, W.-S. Hwang, Hierarchical discriminant regression, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 1277–1293. doi:10.1109/34.888712. [317] A. Hyvärinen, E. Oja, Independent component analysis: Algorithms and applications, Neural Networks. 13 (2000) 411–430. doi:10.1016/S0893-6080(00)00026-5. [318] D.K. Agrafiotis, D.N. Rassokhin, V.S. Lobanov, Multidimensional scaling and visualization of large molecular similarity tables, J. Comput. Chem. 22 (2001) 488–500. doi:10.1002/1096-987X(20010415)22:5<488::AID-JCC1020>3.0.CO;2-4. [319] L. Eriksson, H. Antti, E. Holmes, E. Johansson, T. Lundstedt, J. Shockcor, et al., Partial Least Squares (PLS) in Cheminformatics, in: J. Gasteiger (Ed.), Handb. Chemoinformatics, Vol. 3, WILEY-VCH, Weinheim, Germany, 2003: pp. 1134–1166.

72

ACCEPTED MANUSCRIPT [320] D.K. Agrafiotis, V.S. Lobanov, Nonlinear mapping networks., J. Chem. Inf. Comput. Sci. 40 (2000) 1356–1362. doi:10.1021/ci000033y.

PT

[321] D.N. Rassokhin, V.S. Lobanov, D.K. Agrafiotis, Nonlinear Mapping of Massive Data Sets by Fuzzy Clustering and Neural Networks, J. Comput. Chem. 22 (2001) 373–386. doi:10.1002/1096-987X(200103)22:4<373::AID-JCC1009>3.0.CO;2-8.

SC

RI

[322] B. Hemmateenejad, A.R. Mehdipour, P.L. a Popelier, Quantum topological QSAR models based on the MOLMAP approach, Chem. Biol. Drug Des. 72 (2008) 551–563. doi:10.1111/j.1747-0285.2008.00731.x.

NU

[323] R. Sabet, A. Fassihi, B. Hemmateenejad, L. Saghaei, R. Miri, M. Gholami, Computeraided design of novel antibacterial 3-hydroxypyridine-4-ones: Application of QSAR methods based on the MOLMAP approach, J. Comput. Aided. Mol. Des. 26 (2012) 349– 361. doi:10.1007/s10822-012-9561-2.

MA

[324] B. Hemmateenejad, A.R. Mehdipour, R. Miri, M. Shamsipur, Application of MOLMAP approach for QSAR modeling of various biological activities using substituent electronic descriptors, J. Comput. Chem. 30 (2009) 2001–2009. doi:10.1002/jcc.21198.

AC CE P

TE

D

[325] M. Khoshneviszadeh, N. Edraki, R. Miri, B. Hemmateenejad, Exploring QSAR for substituted 2-sulfonyl-phenyl-indol derivatives as potent and selective COX-2 inhibitors using different chemometrics tools, Chem. Biol. Drug Des. 72 (2008) 564–574. doi:10.1111/j.1747-0285.2008.00735.x. [326] M. Khoshneviszadeh, N. Edraki, R. Miri, A. Foroumadi, B. Hemmateenejad, QSAR Study of 4-Aryl-4H-Chromenes as a New Series of Apoptosis Inducers Using Different Chemometric Tools, Chem. Biol. Drug Des. 79 (2012) 442–458. doi:10.1111/j.17470285.2011.01284.x. [327] B. Hemmateenejad, M. Elyasi, A segmented principal component analysis-regression approach to quantitative structure-activity relationship modeling., Anal. Chim. Acta. 646 (2009) 30–8. doi:10.1016/j.aca.2009.05.003. [328] S. Karimi, B. Hemmateenejad, Identification of discriminatory variables in proteomics data analysis by clustering of variables, Anal. Chim. Acta. 767 (2013) 35–43. doi:10.1016/j.aca.2012.12.050. [329] B. Hemmateenejad, S. Karimi, N. Mobaraki, Clustering of variables in regression analysis: A comparative study between different algorithms, J. Chemom. 27 (2013) 306– 317. doi:10.1002/cem.2513. [330] C.W. Yap, H. Li, Z.L. Ji, Y.Z. Chen, Regression methods for developing QSAR and QSPR models to predict compounds of specific pharmacodynamic, pharmacokinetic and toxicological properties., Mini Rev. Med. Chem. 7 (2007) 1097–1107. doi:10.2174/138955707782331696.

73

ACCEPTED MANUSCRIPT [331] B. Lučić, N. Trinajstić, New Developments in QSPR/QSAR Modeling Based on Topological Indices, SAR QSAR Environ. Res. 7 (1997) 45–62. doi:10.1080/10629369708039124.

RI

PT

[332] G.S. Kapur, A. Ecker, R. Meusinger, Establishing quantitative structure-property relationships (QSPR) of diesel samples by proton-NMR & multiple linear regression (MLR) analysis, Energy and Fuels. 15 (2001) 943–948. doi:10.1021/ef010021u.

SC

[333] C. Yin, X. Liu, W. Guo, T. Lin, X. Wang, L. Wang, Prediction and application in QSPR of aqueous solubility of sulfur-containing aromatic esters using GA-based MLR with quantum descriptors, Water Res. 36 (2002) 2975–2982. doi:10.1016/S00431354(01)00532-2.

MA

NU

[334] P. Gramatica, P. Pilutti, E. Papa, Ranking of volatile organic compounds for tropospheric degradability by oxidants: a QSPR approach., SAR QSAR Environ. Res. 13 (2002) 743– 753. doi:10.1080/1062936021000043472.

TE

D

[335] B. Hemmateenejad, H. Sharghi, M. Akhond, The Importance of Polarity / Polarizability Interaction on the Acidity Behavior of 9 , 10-Anthraquinone and 9-Anthrone Derivatives in Methanol – Water Mixed Solvents Using Target Factor Analysis and QSPR Approaches, J. Solution Chem. 32 (2003). doi:10.1023/A:1022982200712.

AC CE P

[336] D. Erös, G. Kéri, I. Kövesdi, C. Szántai-Kis, G. Mészáros, L. Orfi, Comparison of predictive ability of water solubility QSPR models generated by MLR, PLS and ANN methods., Mini Rev. Med. Chem. 4 (2004) 167–177. doi:10.2174/1389557043487466. [337] A.D. Pillai, S. Rani, P.D. Rathod, F.P. Xavier, K.K. Vasu, H. Padh, et al., QSAR studies on some thiophene analogs as anti-inflammatory agents: Enhancement of activity by electronic parameters and its utilization for chemical lead optimization, Bioorganic Med. Chem. 13 (2005) 1275–1283. doi:10.1016/j.bmc.2004.11.016. [338] F. Liu, Y. Liang, C. Cao, QSPR modeling of thermal conductivity detection response factors for diverse organic compound, Chemom. Intell. Lab. Syst. 81 (2006) 120–126. doi:10.1016/j.chemolab.2005.10.004. [339] A. Afantitis, G. Melagraki, H. Sarimveis, P.A. Koutentis, J. Markopoulos, O. IgglessiMarkopoulou, Prediction of intrinsic viscosity in polymer-solvent combinations using a QSPR model, Polymer (Guildf). 47 (2006) 3240–3248. doi:10.1016/j.polymer.2006.02.060. [340] B. Narasimhan, A.M. Ansari, N. Singh, V. Mourya, A.S. Dhakee, A QSAR approach for the prediction of stability of benzoglycolamide ester prodrugs., Chem. Pharm. Bull. (Tokyo). 54 (2006) 1067–1071. doi:10.1248/cpb.54.1067.

74

ACCEPTED MANUSCRIPT [341] J. Ghasemi, S. Saaidpour, S.D. Brown, QSPR study for estimation of acidity constants of some aromatic acids derivatives using multiple linear regression (MLR) analysis, J. Mol. Struct. THEOCHEM. 805 (2007) 27–32. doi:10.1016/j.theochem.2006.09.026.

RI

PT

[342] S. Riahi, M.R. Ganjali, P. Norouzi, F. Jafari, Application of GA-MLR, GA-PLS and the DFT quantum mechanical (QM) calculations for the prediction of the selectivity coefficients of a histamine-selective electrode, Sensors Actuators, B Chem. 132 (2008) 13–19. doi:10.1016/j.snb.2008.01.009.

SC

[343] J. Ghasemi, A. Abdolmaleki, S. Asadpour, F. Shiri, Prediction of solubility of nonionic solutes in anionic micelle (SDS) using a QSPR model, QSAR Comb. Sci. 27 (2008) 338– 346. doi:10.1002/qsar.200730022.

MA

NU

[344] A. Afantitis, G. Melagraki, H. Sarimveis, P.A. Koutentis, J. Markopoulos, O. IgglessiMarkopoulou, Development and evaluation of a QSPR model for the prediction of diamagnetic susceptibility, QSAR Comb. Sci. 27 (2008) 432–436. doi:10.1002/qsar.200730083.

TE

D

[345] J.B. Ghasemi, A. Abdolmaleki, N. Mandoumi, A quantitative structure property relationship for prediction of solubilization of hazardous compounds using GA-based MLR in CTAB micellar media, J. Hazard. Mater. 161 (2009) 74–80. doi:10.1016/j.jhazmat.2008.03.089.

AC CE P

[346] L.M. V Pinheiro, M.C.M.M. Ventura, M.L.C.J. Moita, Application of QSPR-MLR methodology to solvatochromic behavior of quinoline in binary solvent HBD/DMF mixtures, J. Mol. Liq. 154 (2010) 102–110. doi:10.1016/j.molliq.2010.04.013. [347] G. Fayet, D. Jacquemin, V. Wathelet, E.A. Perpète, P. Rotureau, C. Adamo, Excited-state properties from ground-state DFT descriptors: A QSPR approach for dyes, J. Mol. Graph. Model. 28 (2010) 465–471. doi:10.1016/j.jmgm.2009.11.001. [348] E. Papa, P. Gramatica, QSPR as a support for the EU REACH regulation and rational design of environmentally safer chemicals: PBT identification from molecular structure, Green Chem. 12 (2010) 836. doi:10.1039/b923843c. [349] S. Ahmadi, Application of GA-MLR method in QSPR modeling of stability constants of diverse 15-crown-5 complexes with sodium cation, J. Incl. Phenom. Macrocycl. Chem. 74 (2012) 57–66. doi:10.1007/s10847-010-9881-6. [350] M. Shariati-Rad, M. Hasani, QSPR study of charge-transfer complexes of some organic donors with p-chloranil using PLSR and MLR, J. Iran. Chem. Soc. 9 (2012) 19–25. doi:10.1007/s13738-011-0004-0. [351] P. Gramatica, N. Chirico, E. Papa, S. Cassani, S. Kovarich, QSARINS: A new software for the development, analysis, and validation of QSAR MLR models, J. Comput. Chem. 34 (2013) 2121–2132. doi:10.1002/jcc.23361.

75

ACCEPTED MANUSCRIPT

PT

[352] S. Yousefinejad, B. Hemmateenejad, A chemometrics approach to predict the dispersibility of graphene in various liquid phases using theoretical descriptors and solvent empirical parameters, Colloids Surfaces A Physicochem. Eng. Asp. 441 (2014) 766–775. doi:10.1016/j.colsurfa.2013.03.020.

RI

[353] S. Yousefinejad, F. Honarasa, H. Montaseri, Linear solvent structure-polymer solubility and solvation energy relationships to study conductive polymer/carbon nanotube composite solutions, RSC Adv. 5 (2015) 42266–42275. doi:10.1039/C5RA05930E.

SC

[354] S. Weisberg, Applied Linear Regression, John Wiley & Sons, Inc., Hoboken, 2005. doi:10.2307/1269895.

NU

[355] J.G. Topliss, R.J. Costello, Chance correlations in structure-activity studies using multiple regression analysis, J. Med. Chem. 15 (1972) 1066–1068. doi:10.1021/jm00280a017.

MA

[356] C. Rücker, G. Rücker, M. Meringer, y-Randomization and its variants in QSPR/QSAR., J. Chem. Inf. Model. 47 (2007) 2345–2357. doi:10.1021/ci700157b.

D

[357] J.M. Sutter, J.H. Kalivas, P.M. Lang, Which principal components to utilize for principal component regression, J. Chemom. 6 (1992) 217–225. doi:10.1002/cem.1180060406.

TE

[358] R.G. Brereton, Chemometrics: Data Analysis for the Laboratory and Chemical Plant, John Wiley & Sons, Ltd, Chichester, 2003.

AC CE P

[359] S. Yousefinejad, M. Bagheri, A.A. Moosavi-Movahedi, Quantitative sequence-activity modeling of antimicrobial hexapeptides using a segmented principal component strategy: an approach to describe and predict activities of peptide drugs containing L/D and unnatural residues., Amino Acids. 47 (2015) 125–134. doi:10.1007/s00726-014-1850-8. [360] H. Wold, Soft modeling by latent variables: the nonlinear iterative partial least squares approach, Perspect. Probab. Stat. Pap. Honour MS Bartlett. (1975) 520–540. [361] P. Geladi, Notes on the history and nature of partial least squares (PLS) modelling, J. Chemom. 2 (1988) 231–246. doi:http://doi.wiley.com/10.1002/cem.1180020403. [362] S. Wold, M. Sjöström, L. Eriksson, PLS-regression: A basic tool of chemometrics, Chemom. Intell. Lab. Syst. 58 (2001) 109–130. doi:10.1016/S0169-7439(01)00155-1. [363] M. Andersson, A comparison of nine PLS1 algorithms, J. Chemom. 23 (2009) 518–529. doi:10.1002/cem.1248. [364] V. Esposito Vinzi, G. Russolillo, Partial least squares algorithms and methods, Wiley Interdiscip. Rev. Comput. Stat. 5 (2013) 1–19. doi:10.1002/wics.1239. [365] P. Geladi, B.R. Kowalski, Partial least-squares regression: a tutorial, Anal. Chim. Acta. 185 (1986) 1–17. doi:10.1016/0003-2670(86)80028-9.

76

ACCEPTED MANUSCRIPT [366] R.D. Cramer III, Partial Least Squares ( PLS ): Its strengths and limitations, Perspect. Drug Discov. Des. 1 (1993) 269–278. doi:10.1007/BF02174528.

PT

[367] A. Lorber, L.E. Wangen, B.R. Kowalski, A theoretical foundation for the PLS algorithm, J. Chemom. 1 (1987) 19–31. doi:10.1002/cem.1180010105.

RI

[368] I.S. Helland, Some theoretical aspects of partial least squares regression, Chemom. Intell. Lab. Syst. 58 (2001) 97–107. doi:10.1016/S0169-7439(01)00154-X.

SC

[369] P.P. Roy, K. Roy, On Some Aspects of Variable Selection for Partial Least Squares Regression Models, QSAR Comb. Sci. 27 (2008) 302–313. doi:10.1002/qsar.200710043.

NU

[370] J.G. Topliss, R.P. Edwards, Chance factors in studies of quantitative structure-activity relationships., J. Med. Chem. 22 (1979) 1238–1244. doi:10.1021/jm00196a017.

MA

[371] K. Baumann, Chance Correlation in Variable Subset Regression: Influence of the Objective Function, the Selection Mechanism, and Ensemble Averaging, QSAR Comb. Sci. 24 (2005) 1033–1046. doi:10.1002/qsar.200530134.

TE

D

[372] J.G. Topliss, R.J. Costello, Chance correlations in structure-activity studies using multiple regression analysis, J. Med. Chem. 15 (1972) 1066–1068. doi:10.1021/jm00280a017.

AC CE P

[373] M. Clark, R.D. Cramer, The Probability of Chance Correlation Using Partial Least Squares (PLS), Quant Struct-Act Rel. 12 (1993) 137–145. doi:10.1002/qsar.19930120205. [374] B. Norden, U. Edlund, J. Dan, S. Wold, Simplified C-13 NMR Parameters Related to the Carcinogenic Potency of Polycyclic Aromatic Hydrocarbons, Quant. Struct. Relationships. 2 (1983) 73–76. doi:10.1002/qsar.19830020205. [375] W.J. Dunn, S. Wold, U. Edlund, S. Hellberg, J. Gasteiger, Multivariate structure-activity relationships between data from a battery of biological tests and an ensemble of structure descriptors: The PLS method, Quant. Struct. Relationships. 3 (1984) 131–137. doi:10.1002/qsar.19840030402. [376] S. Wold, N. Kettaneh-Wold, B. Skagerberg, Nonlinear PLS modeling, Chemom. Intell. Lab. Syst. 7 (1989) 53–65. doi:10.1016/0169-7439(89)80111-X. [377] I.E. Frank, A nonlinear PLS model, Chemom. Intell. Lab. Syst. 8 (1990) 109–119. doi:10.1016/0169-7439(90)80128-S. [378] T.R. Holcomb, M. Morari, PLS/neural networks, Comput. Chem. Eng. 16 (1992) 393– 411. doi:10.1016/0098-1354(92)80056-F. [379] S.J. Qin, T.J. McAvoy, Nonlinear PLS modeling using neural networks, Comput. Chem. Eng. 16 (1992) 379–391. doi:10.1016/0098-1354(92)80055-E.

77

ACCEPTED MANUSCRIPT

PT

[380] Y.P. Zhou, J.H. Jiang, W.Q. Lin, L. Xu, H.L. Wu, G.L. Shen, et al., Artificial neural network-based transformation for nonlinear partial least-square regression with application to QSAR studies, Talanta. 71 (2007) 848–853. doi:10.1016/j.talanta.2006.05.058.

RI

[381] S. Wold, Nonlinear partial least squares modelling. II. Spline inner relation, in: Chemom. Intell. Lab. Syst., 1992: pp. 71–84. doi:10.1016/0169-7439(92)80093-J.

SC

[382] T. Li, H. Mei, P. Cong, Combining nonlinear PLS with the numeric genetic algorithm for QSAR, in: Chemom. Intell. Lab. Syst., 1999: pp. 177–184. doi:10.1016/S01697439(98)00102-6.

NU

[383] L. Eriksson, E. Johansson, F. Lindgren, S. Wold, GIFI-PLS: Modeling of non-linearities and discontinuities in QSAR, Quant. Struct. Relationships. 19 (2000) 345–355. doi:10.1002/1521-3838(200010)19:4<345::AID-QSAR345>3.0.CO;2-Q.

MA

[384] Y.H. Bang, C.K. Yoo, I.B. Lee, Nonlinear PLS modeling with fuzzy inference system, Chemom. Intell. Lab. Syst. 64 (2002) 137–155. doi:10.1016/S0169-7439(02)00084-9.

D

[385] R. Bro, Multiway calidration. multilinear pls, J. Chemom. 10 (1996) 47–61.

AC CE P

TE

[386] J. Nilsson, S. DeJong, A.K. Smilde, Multiway calibration in 3D QSAR, J. Chemom. 11 (1997) 511–524. doi:10.1002/(SICI)1099-128X(199711/12)11:6<511::AIDCEM488>3.0.CO;2-W. [387] W.J. Dunn, A.J. Hopfinger, C. Catana, C. Duraiswami, Solution of the conformation and alignment tensors for the binding of trimethoprim and its analogs to dihydrofolate reductase: 3D-quantitative structure-activity relationship study using molecular shape analysis, 3-way partial least-squares regression, an, J. Med. Chem. 39 (1996) 4825–4832. doi:10.1021/jm960491r. [388] J. Nilsson, E.J. Homan, A.K. Smilde, C.J. Grol, H. Wikström, A multiway 3D QSAR analysis of a series of (S)-N-[(1-ethyl-2-pyrrolidinyl)methyl]-6-methoxybenzamides., J. Comput. Aided. Mol. Des. 12 (1998) 81–93. doi:10.1023/A:1007977010551. [389] K. Hasegawa, M. Arakawa, K. Funatsu, 3D-QSAR study of insecticidal neonicotinoid compounds based on 3-way partial least squares model, Chemom. Intell. Lab. Syst. 47 (1999) 33–40. doi:10.1016/S0169-7439(98)00154-3. [390] S. Yousefinjead, Application of chemometrics and chemoinformatics to ctudy the interaction of nanomaterials with chemical and biological processes and to develop new structure-function relationships for peptides and drugs (Ph.D. Dissertation), Shiraz Universtiy, Shiraz, Iran, 2012. [391] M. Cocchi, M.C. Menziani, G. Rastelli, P.G. De Benedetti, QSAR analysis in 2,4diamino-6,7-dimethoxy quinoline derivatives - ??1-adrenoceptor antagonists - using the

78

ACCEPTED MANUSCRIPT partial least squares (PLS) method and theoretical molecular descriptors, Quant. Struct. Relationships. 9 (1990) 340–345. doi:10.1002/qsar.19900090408.

PT

[392] T. Lotta, J. Taskinen, R. Bäckström, E. Nissinen, PLS modelling of structure-activity relationships of catechol O-methyltransferase inhibitors, J. Comput. Aided. Mol. Des. 6 (1992) 253–272. doi:10.1007/BF00123380.

SC

RI

[393] P. Caldirola, E. Coats, R. Mannhold, H. Van der Goot, H. Timmerman, New calmodulinantagonists of the diphenylalkylamine type. II. QSAR investigations by means of partial least square (PLS) analysis, Eur. J. Med. Chem. 28 (1993) 783–790. doi:10.1016/02235234(93)90113-S.

NU

[394] B.L. Bush, R.B. Nachbar, Sample-distance partial least squares: PLS optimized for many variables, with application to CoMFA, J. Comput. Aided. Mol. Des. 7 (1993) 587–619. doi:10.1007/BF00124364.

MA

[395] A.C. Good, S.S. So, W.G. Richards, Structure-activity relationships from molecular similarity matrices., J. Med. Chem. 36 (1993) 433–438. doi:10.1021/jm00056a002.

TE

D

[396] K.H. Kim, Nonlinear dependence in comparative molecular field analysis, J. Comput. Aided. Mol. Des. 7 (1993) 71–82. doi:10.1007/BF00141576.

AC CE P

[397] Y.C. Martin, C.T. Lin, C. Hetti, J. DeLazzer, PLS analysis of distance matrices to detect nonlinear relationships between biological potency and molecular properties, J. Med. Chem. 38 (1995) 3009–3015. doi:10.1021/jm00016a003. [398] K. Hasegawa, T. Kimura, Y. Miyashita, K. Funatsu, Nonlinear partial least squares modeling of phenyl alkylamines with the monoamine oxidase inhibitory activities., J. Chem. Inf. Comput. Sci. 36 (1996) 1025–1029. doi:10.1021/ci960362j. [399] T. Kimura, Y. Miyashita, K. Funatsu, S. Sasaki, Quantitative Structure-Activity Relationships of the Synthetic Substrates for Elastase Enzyme Using Nonlinear Partial Least Squares Regression, J. Chem. Inf. Comput. Sci. 36 (1996) 185–189. doi:10.1021/ci9501103. [400] S. Wold, N. Kettaneh, K. Tjessem, Hierarchical multiblock PLS and PC models for easier model interpretation and as an alternative to variable selection, J. Chemom. 10 (1996) 463–482. doi:10.1002/(sici)1099-128x(199609)10:5/6<463::aid-cem445>3.0.co;2-l. [401] J.M. Luco, F.H. Ferretti, QSAR based on multiple linear regression and PLS methods for the anti-HIV activity of a large group of HEPT derivatives., J. Chem. Inf. Comput. Sci. 37 (1997) 392–401. doi:10.1021/ci960487o. [402] T. Cserháti, A. Kósa, S. Balogh, Comparison of partial least-square method and canonical correlation analysis in a quantitative structure-retention relationship study, J. Biochem. Biophys. Methods. 36 (1998) 131–141. doi:10.1016/S0165-022X(98)00008-6.

79

ACCEPTED MANUSCRIPT [403] K.H. Kim, Nonlinear dependence in comparative molecular field analysis, J. Comput. Aided. Mol. Des. 7 (1993) 71–82. doi:10.1007/BF00141576.

RI

PT

[404] M. Shamsipur, B. Hemmateenejad, M. Akhond, H. Sharghi, Quantitative structure– property relationship study of acidity constants of some 9,10-anthraquinone derivatives using multiple linear regression and partial least-squares procedures, Talanta. 54 (2001) 1113–1120. doi:10.1016/S0039-9140(01)00374-5.

SC

[405] J. Devillers, A. Chezeau, E. Thybaud, PLS-QSAR of the adult and developmental toxicity of chemicals to Hydra attenuata., SAR QSAR Environ. Res. 13 (2002) 705–712. doi:10.1080/1062936021000043445.

NU

[406] T.I. Netzeva, T.W. Schultz, A.O. Aptula, M.T.D. Cronin, Partial least squares modelling of the acute toxicity of aliphatic compounds to Tetrahymena pyriformis., SAR QSAR Environ. Res. 14 (2003) 265–283. doi:10.1080/1062936032000101501.

MA

[407] K. Tang, T. Li, Comparison of different partial least-squares methods in quantitative structure-activity relationships, Anal. Chim. Acta. 476 (2003) 85–92. doi:10.1016/S00032670(02)01257-6.

TE

D

[408] P. Yang, J. Chen, S. Chen, X. Yuan, K.W. Schramm, A. Kettrup, QSPR models for physicochemical properties of polychlorinated diphenyl ethers, Sci. Total Environ. 305 (2003) 65–76. doi:10.1016/S0048-9697(02)00467-9.

AC CE P

[409] C. Catana, H. Gao, C. Orrenius, P.F.W. Stouten, Linear and nonlinear methods in modeling the aqueous solubility of organic compounds, J. Chem. Inf. Model. 45 (2005) 170–176. doi:10.1021/ci049797u. [410] J.B. Van Der Linden, E.J. Ras, S.M. Hooijschuur, G.M. Klaus, N.T. Luchters, P. Dani, et al., Asymmetric catalytic ketone hydrogenation: Relating substrate structure and product enantiomeric excess using QSPR, in: QSAR Comb. Sci., 2005: pp. 94–98. doi:10.1002/qsar.200420060. [411] V. Tantishaiyakul, N. Worakul, W. Wongpoowarak, Prediction of solubility parameters using partial least square regression, Int. J. Pharm. 325 (2006) 8–14. doi:10.1016/j.ijpharm.2006.06.009. [412] S. Ajmani, S.A. Kulkarni, A dual-response partial least squares regression QSAR model and its application in design of dual activators of PPAR?? and PPAR??, QSAR Comb. Sci. 27 (2008) 1291–1304. doi:10.1002/qsar.200860023. [413] S. Riahi, S. Eynollahi, M. Ganjali, Calculation of standard electrode potential and study of solvent effect on electronic parameters of anthraquinone-1-carboxylic Acid, Int. J. Electrochem. Sci. 4 (2009) 1128–1137. http://www.electrochemsci.org/papers/vol4/4081128.pdf.

80

ACCEPTED MANUSCRIPT [414] O. Deeb, M. Goodarzi, Predicting the solubility of pesticide compounds in water using QSPR methods, Mol. Phys. 108 (2010) 181–192. doi:10.1080/00268971003604575.

PT

[415] S. Nandi, M.C. Bagchi, 3D-QSAR and molecular docking studies of 4-anilinoquinazoline derivatives: A rational approach to anticancer drug design, Mol. Divers. 14 (2010) 27–38. doi:10.1007/s11030-009-9137-9.

SC

RI

[416] S. Pirhadi, J.B. Ghasemi, 3D-QSAR analysis of human immunodeficiency virus entry-1 inhibitors by CoMFA and CoMSIA, Eur. J. Med. Chem. 45 (2010) 4897–4903. doi:10.1016/j.ejmech.2010.07.062.

NU

[417] C. Gu, M. Goodarzi, X. Yang, Y. Bian, C. Sun, X. Jiang, Predictive insight into the relationship between AhR binding property and toxicity of polybrominated diphenyl ethers by PLS-derived QSAR, Toxicol. Lett. 208 (2012) 269–274. doi:10.1016/j.toxlet.2011.11.010.

D

MA

[418] I.B. Stoyanova-Slavova, S.H. Slavov, B. Pearce, D.A. Buzatu, R.D. Beger, J.G. Wilkes, Partial least square and k-nearest neighbor algorithms for improved 3D quantitative spectral data-activity relationship consensus modeling of acute toxicity, Environ. Toxicol. Chem. 33 (2014) 1271–1282. doi:10.1002/etc.2534.

AC CE P

TE

[419] B. Hemmateenejad, K. Javidnia, R. Miri, M. Elyasi, Quantitative structure-retention relationship study of analgesic drugs by application of combined data splitting-feature selection strategy and genetic algorithm-partial least square, J. Iran. Chem. Soc. 9 (2012) 53–60. doi:10.1007/s13738-011-0005-z. [420] T. Mehmood, K.H. Liland, L. Snipen, S. Sæbø, A review of variable selection methods in Partial Least Squares Regression, Chemom. Intell. Lab. Syst. 118 (2012) 62–69. doi:10.1016/j.chemolab.2012.07.010. [421] W.S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity, Bull. Math. Biophys. 5 (1943) 115–133. doi:10.1007/BF02478259. [422] B.J. Wythoff, Backpropagation neural networks, Chemom. Intell. Lab. Syst. 18 (1993) 115–155. doi:10.1016/0169-7439(93)80052-J. [423] F. Marini, R. Bucci, A.L. Magrì, A.D. Magrì, Artificial neural networks in chemometrics: History, examples and perspectives, Microchem. J. 88 (2008) 178–185. doi:10.1016/j.microc.2007.11.008. [424] H.M. Cartwright, Artificial neural networks in biology and chemistry: the evolution of a new analytical tool., in: D.J. Livingstone (Ed.), Artif. Neural Networks (Methods Mol. Biol. Vol. 458), Humana Press, New Jersay, 2009: pp. 1–13. doi:10.1007/978-1-60327101-1_1.

81

ACCEPTED MANUSCRIPT [425] T. Aoyama, Y. Suzuki, H. Ichikawa, Neural Networks Applied to Pharmaceutical Problems III. Neural networks applied to quantitative structure-activity relationship analysis, J. Med. Chem. 33 (1990) 2583–2590.

RI

PT

[426] T. Aoyamaa, H. Ichikawa, Neural Networks Applied to Pharmaceutical Problems IV. Basic Operating Characteristics of Neural Networks When Applied to Structure-Activity Studies, Chem. Pharm. Bull. (Tokyo). 39 (1991) 358–366.

SC

[427] T. Aoyama, H. Ichikawa, Neural networks applied to pharmaceutical problems. V. Obtaining the correlation indices between drug activity and structural parameters using a neural network., Chem. Pharm. Bull. (Tokyo). 39 (1991) 372–378. doi:10.1248/cpb.39.372.

MA

NU

[428] H.J.H. Macfie, H.J.H. Macfie, V.S. Rose, V.S. Rose, I.F. Croall, I.F. Croall, An Application of Unsupervised Neural Network Methodology Kohonen TopologyPreserving Mapping to QSAR Analysis, Quant. Struct. Relationships. 10 (1991) 6–15. doi:10.1002/qsar.19910100103.

TE

D

[429] D.J. Livingstone, G. Hesketh, D. Clayworth, Novel method for the display of multivariate data using neural networks, J. Mol. Graph. 9 (1991) 115–118. doi:10.1016/02637855(91)85008-M.

AC CE P

[430] D.J. Livingstone, D.W. Salt, Regression analysis for QSAR using neural networks, Bioorg. Med. Chem. Lett. 2 (1992) 213–218. doi:10.1016/S0960-894X(01)81067-2. [431] D.W. Salt, N. Yildiz, D.J. Livingstone, C.J. Tinsley, The use of artificial neural networks in QSAR, Pestic. Sci. 36 (1992) 161–170. doi:10.1002/ps.2780360212. [432] S. Nakai, E. Li-Chan, Recent advances in structure and function of food proteins: QSAR approach., Crit. Rev. Food Sci. Nutr. 33 (1993) 477–499. doi:10.1080/10408399309527644. [433] R.D. King, J.D. Hirst, M.J.E. Sternberg, New approaches to QSAR: Neural networks and machine learning, Perspect. Drug Discov. Des. 1 (1993) 279–290. doi:10.1007/BF02174529. [434] D.T. Manallack, D.D. Ellis, D.J. Livingstone, Analysis of linear and nonlinear QSAR data using neural networks., J. Med. Chem. 37 (1994) 3758–3767. doi:10.1021/jm00048a012. [435] M. Vracko, A Study of Structure-Carcinogenic Potency Relationship with Artificial Neural Networks. The Using of Descriptors Related to Geometrical and Electronic Structures, J. Chem. Inf. Comput. Sci. 37 (1997) 1037–1043. doi:10.1021/ci970231y. [436] M. Jalali-Heravi, Z. Garkani-Nejad, Prediction of electrophoretic mobilities of sulfonamides in capillary zone electrophoresis using artificial neural networks, J. Chromatogr. A. 927 (2001) 211–218. doi:10.1016/S0021-9673(01)01099-8.

82

ACCEPTED MANUSCRIPT

PT

[437] K.J. Schaper, Free-Wilson-type analysis of non-additive substituent effects on THPB dopamine receptor affinity using artificial neural networks, Quant. Struct. Relationships. 18 (1999) 354–360. doi:10.1002/(SICI)1521-3838(199910)18:4<354::AIDQSAR354>3.0.CO;2-2.

RI

[438] M. Jalali-Heravi, M. Asadollahi-Baboli, P. Shahbazikhah, QSAR study of heparanase inhibitors activity using artificial neural networks and Levenberg-Marquardt algorithm, Eur. J. Med. Chem. 43 (2008) 548–556. doi:10.1016/j.ejmech.2007.04.014.

SC

[439] F. Gharagheizi, A. Eslamimanesh, A.H. Mohammadi, D. Richon, Use of artificial neural network-group contribution method to determine surface tension of pure compounds, J. Chem. Eng. Data. 56 (2011) 2587–2601. doi:10.1021/je2001045.

MA

NU

[440] J. Akbar, M.S. Iqbal, M.T. Chaudhary, T. Yasin, S. Massey, A QSPR study of drug release from an arabinoxylan using ab initio optimization and neural networks, Carbohydr. Polym. 88 (2012) 1348–1357. doi:10.1016/j.carbpol.2012.02.016. [441] F.R. Burden, D. a Winkler, Robust QSAR Models Using Bayesian Regularised Artificial Neural Networks, J. Med. Chem. 42 (1999) 3183–3187. doi:10.1021/jm980697n.

TE

D

[442] Ajay, W.P. Walters, M. a. Murcko, Can we learn to distinguish between ―drug-like‖ and ―nondrug-like‖ molecules?, J. Med. Chem. 41 (1998) 3314–3324. doi:10.1021/jm970666c.

AC CE P

[443] F.R. Burden, D.A. Winkler, A quantitative structure-activity relationships model for the acute toxicity of substituted benzenes to Tetrahymena pyriformis using Bayesianregularized neural networks., Chem. Res. Toxicol. 13 (2000) 436–440. doi:10.1021/tx9900627. [444] D.A. Winkler, F.R. Burden, Robust QSAR Models from Novel Descriptors and Bayesian Regularised Neural Networks, Mol. Simul. 24 (2000) 243–258. doi:10.1080/08927020008022374. [445] D.A. Winkler, F.R. Burden, Bayesian neural nets for modeling in drug discovery, Drug Discov. Today BIOSILICO. 2 (2004) 104–111. doi:10.1016/S1741-8364(04)02393-5. [446] M. Fernández, A. Tundidor-Camba, J. Caballero, Modeling of cyclin-dependent kinase inhibition by 1H-pyrazolo[3,4-d]pyrimidine derivatives using artificial neural network ensembles., J. Chem. Inf. Model. 45 (2005) 1884–1895. doi:10.1021/ci050263i. [447] J. Caballero, M. Fernández, Linear and nonlinear modeling of antifungal activity of some heterocyclic ring derivatives using multiple linear regression and Bayesian-regularized neural networks, J. Mol. Model. 12 (2006) 168–181. doi:10.1007/s00894-005-0014-x. [448] J. Caballero, M. Fernández, Artificial neural networks from MATLAB in medicinal chemistry. Bayesian-regularized genetic neural networks (BRGNN): application to the

83

ACCEPTED MANUSCRIPT prediction of the antagonistic activity against human platelet thrombin receptor (PAR-1)., Curr. Top. Med. Chem. 8 (2008) 1580–1605. doi:10.2174/156802608786786570.

PT

[449] M. Goodarzi, T. Chen, M.P. Freitas, QSPR predictions of heat of fusion of organic compounds using Bayesian regularized artificial neural networks, Chemom. Intell. Lab. Syst. 104 (2010) 260–264. doi:10.1016/j.chemolab.2010.08.018.

SC

RI

[450] Y.L. Loukas, Adaptive neuro-fuzzy inference system: An instant and architecture-free predictor for improved QSAR studies, J. Med. Chem. 44 (2001) 2772–2783. doi:10.1021/jm000226c.

NU

[451] L.A. Zadeh, Fuzzy sets, Inf. Control. 8 (1965) 338–353. doi:10.1016/S00199958(65)90241-X.

MA

[452] M. Sugeno, G.. Kang, Structure identification of fuzzy model, Fuzzy Sets Syst. 28 (1988) 15–33. doi:10.1016/0165-0114(88)90113-3.

TE

D

[453] E. Buyukbingol, A. Sisman, M. Akyildiz, F.N. Alparslan, A. Adejare, Adaptive neurofuzzy inference system (ANFIS): A new approach to predictive modeling in QSAR applications: A study of neuro-fuzzy modeling of PCP-based NMDA receptor antagonists, Bioorganic Med. Chem. 15 (2007) 4265–4282. doi:10.1016/j.bmc.2007.03.065.

AC CE P

[454] M. Jalali-Heravi, M. Asadollahi-Baboli, Quantitative structure-activity relationship study of serotonin (5-HT7) receptor inhibitors using modified ant colony algorithm and adaptive neuro-fuzzy interference system (ANFIS), Eur. J. Med. Chem. 44 (2009) 1463–1470. doi:10.1016/j.ejmech.2008.09.050. [455] M. Jalali-Heravi, M. Asadollahi-Baboli, A. Mani-Varnosfaderani, Shuffling multivariate adaptive regression splines and adaptive neuro-fuzzy inference system as tools for QSAR study of SARS inhibitors, J. Pharm. Biomed. Anal. 50 (2009) 853–860. doi:10.1016/j.jpba.2009.07.009. [456] A. Khajeh, H. Modarress, QSPR prediction of flash point of esters by means of GFA and ANFIS, J. Hazard. Mater. 179 (2010) 715–720. doi:10.1016/j.jhazmat.2010.03.060. [457] S. Afiuni-Zadeh, G. Azimi, A QSAR study for modeling of 8-azaadenine analogues proposed as A1 adenosine receptor antagonists using genetic algorithm coupling adaptive neuro-fuzzy inference system (ANFIS)., Anal. Sci. 26 (2010) 897–902. doi:10.2116/analsci.26.897. [458] M. Goodarzi, M.P. Freitas, MIA-QSAR coupled to principal component analysis-adaptive neuro-fuzzy inference systems (PCA-ANFIS) for the modeling of the anti-HIV reverse transcriptase activities of TIBO derivatives, Eur. J. Med. Chem. 45 (2010) 1352–1358. doi:10.1016/j.ejmech.2009.12.028.

84

ACCEPTED MANUSCRIPT [459] G. Azimi, S. Afiuni-Zadeh, A. Karami, A QSAR study for modeling of thyroid receptors ??1 selective ligands by application of adaptive neuro-fuzzy inference system and radial basis function, J. Chemom. 26 (2012) 135–142. doi:10.1002/cem.2421.

RI

PT

[460] D. Rogers, A.J. Hopfinger, Application of genetic function approximation to quantitative structure-activity relationships and quantitative structure-property relationships, J. Chem. Inf. Comput. Sci. 34 (1994) 854–866. doi:10.1021/ci00020a020.

SC

[461] D. Rogers, Some Theory and Examples of Genetic Function Approximation with Comparison to Evolutionary Techniques, in: J. Devillers (Ed.), Genet. Algorithms Mol. Model., Academic Press, San Diego, CA, 1996: pp. 87–107.

NU

[462] J.H. Friedman, Multivariate Adaptive Regression Splines, Ann. Stat. 19 (1991) 1–67. http://www.jstor.org/discover/10.2307/2241837?uid=2&uid=4&sid=21106670687953.

MA

[463] L.M. Shi, Y. Fan, T.G. Myers, P.M. O’Connor, K.D. Paull, S.H. Friend, et al., Mining the NCI anticancer drug discovery databases: genetic function approximation for the QSAR study of anticancer ellipticine analogues., J. Chem. Inf. Comput. Sci. 38 (1998) 189–199. doi:10.1021/ci970085w.

TE

D

[464] a. J. Hopfinger, S. Wang, J.S. Tokarski, B. Jin, M. Albuquerque, P.J. Madhav, et al., Construction of 3D-QSAR models using the 4D-QSAR analysis formalism, J. Am. Chem. Soc. 119 (1997) 10509–10524. doi:10.1021/ja9718937.

AC CE P

[465] Y. Fan, L.M. Shi, K.W. Kohn, Y. Pommier, J.N. Weinstein, Quantitative structureantitumor activity relationships of camptothecin analogues: Cluster analysis and genetic algorithm-based studies, J. Med. Chem. 44 (2001) 3254–3263. doi:10.1021/jm0005151. [466] P. Bhattacharya, K. Roy, QSAR of adenosine A3 receptor antagonist 1,2,4-triazolo[4,3-a] quinoxalin-1-one derivatives using chemometric tools, Bioorganic Med. Chem. Lett. 15 (2005) 3737–3743. doi:10.1016/j.bmcl.2005.05.051. [467] S. Deswal, N. Roy, Quantitative structure activity relationship studies of aryl heterocyclebased thrombin inhibitors, Eur. J. Med. Chem. 41 (2006) 1339–1346. doi:10.1016/j.ejmech.2006.07.001. [468] V. Frecer, QSAR analysis of antimicrobial and haemolytic effects of cyclic cationic antimicrobial peptides derived from protegrin-1, Bioorganic Med. Chem. 14 (2006) 6065– 6074. doi:10.1016/j.bmc.2006.05.005. [469] L. Maccari, M. Magnani, G. Strappaghetti, F. Corelli, M. Botta, F. Manetti, A geneticfunction-approximation-based QSAR model for the affinity of arylpiperazines toward α1 adrenoceptors, J. Chem. Inf. Model. 46 (2006) 1466–1478. doi:10.1021/ci060031z.

85

ACCEPTED MANUSCRIPT [470] N. Sachan, S.S. Kadam, V.M. Kulkarni, Human protein tyrosine phosphatase 1B inhibitors: QSAR by genetic function approximation., J. Enzyme Inhib. Med. Chem. 22 (2007) 267–276, 371–373. doi:10.1080/14756360601051274.

RI

PT

[471] M.O. Taha, Y. Bustanji, A.G. Al-Bakri, A.M. Yousef, W. a. Zalloum, I.M. Al-Masri, et al., Discovery of new potent human protein tyrosine phosphatase inhibitors via pharmacophore and QSAR analysis followed by in silico screening, J. Mol. Graph. Model. 25 (2007) 870–884. doi:10.1016/j.jmgm.2006.08.008.

SC

[472] P.C. Nair, M.E. Sobhia, Quantitative structure activity relationship studies on thiourea analogues as influenza virus neuraminidase inhibitors, Eur. J. Med. Chem. 43 (2008) 293– 299. doi:10.1016/j.ejmech.2007.03.020.

MA

NU

[473] S. Ma, M. Lv, F. Deng, X. Zhang, H. Zhai, W. Lv, Predicting the ecotoxicity of ionic liquids towards Vibrio fischeri using genetic function approximation and least squares support vector machine, J. Hazard. Mater. 283 (2015) 591–598. doi:10.1016/j.jhazmat.2014.10.011.

TE

D

[474] C.D.P. Klein, A.J. Hopfinger, Pharmacological activity and membrane interactions of antiarrhythmics: 4D-QSAR/QSPR analysis, Pharm. Res. 15 (1998) 303–311. doi:10.1023/A:1011983005813.

AC CE P

[475] M.G.B. Drew, J.A. Lumley, N.R. Price, Predicting ecotoxicology of organophosphorous insecticides: Successful parameter selection with the genetic function algorithm, Quant. Struct. Relationships. 18 (1999) 573–583. doi:10.1002/(SICI)15213838(199912)18:6<573::AID-QSAR573>3.0.CO;2-J. [476] a S. Kulkarni, a J. Hopfinger, Membrane-interaction QSAR analysis: application to the estimation of eye irritation by organic compounds., Pharm. Res. 16 (1999) 1245–1253. [477] V.M. Gokhale, V.M. Kulkarni, Understanding the antifungal activity of terbinafine analogues using quantitative structure-activity relationship (QSAR) models, Bioorganic Med. Chem. 8 (2000) 2487–2499. doi:10.1016/S0968-0896(00)00178-4. [478] R.G. Karki, V.M. Kulkarni, Three-dimensional quantitative structure - Activity relationship (3D-QSAR) of 3-aryloxazolidin-2-one antibacterials, Bioorganic Med. Chem. 9 (2001) 3153–3160. doi:10.1016/S0968-0896(01)00186-9. [479] M. Iyer, R. Mishra, Y. Han, a J. Hopfinger, Predicting blood-brain barrier partitioning of organic molecules using membrane-interaction QSAR analysis, Pharm. Res. 19 (2002) 1611–1621. doi:10.1023/A:1020792909928. [480] M.T. Makhija, V.M. Kulkarni, QSAR of HIV-1 integrase inhibitors by genetic function approximation method., Bioorg. Med. Chem. 10 (2002) 1483–1497. doi:10.1016/S09680896(01)00415-1.

86

ACCEPTED MANUSCRIPT [481] H. Yuan, A.L. Parrill, QSAR studies of HIV-1 integrase inhibition, Bioorganic Med. Chem. 10 (2002) 4169–4183. doi:10.1016/S0968-0896(02)00332-2.

PT

[482] J. Liu, D. Pan, Y. Tseng, A.J. Hopfinger, 4D-QSAR Analysis of a Series of Antifungal P450 Inhibitors and 3D-Pharmacophore Comparisons as a Function of Alignment, J. Chem. Inf. Comput. Sci. 43 (2003) 2170–2179. doi:10.1021/ci034142z.

RI

[483] A. V. Raichurkar, V.M. Kulkarni, 3D–QSAR of Cyclooxygenase–2 Inhibitors by Genetic Function Approximation, Internet Electron. J. Mol. Des. 3 (2003) 242–261.

NU

SC

[484] P. Bhattacharya, J.T. Leonard, K. Roy, Exploring 3D-QSAR of thiazole and thiadiazole derivatives as potent and selective human adenosine A3 receptor antagonists+, J. Mol. Model. 11 (2005) 516–524. doi:10.1007/s00894-005-0273-6.

MA

[485] K. Roy, J.T. Leonard, QSAR by LFER model of cytotoxicity data of anti-HIV 5-phenyl-1phenylamino-1H-imidazole derivatives using principal component factor analysis and genetic function approximation., Bioorg. Med. Chem. 13 (2005) 2967–2973. doi:10.1016/j.bmc.2005.02.003.

TE

D

[486] J. Thomas Leonard, K. Roy, Comparative QSAR modeling of CCR5 receptor binding affinity of substituted 1-(3,3-diphenylpropyl)-piperidinyl amides and ureas, Bioorganic Med. Chem. Lett. 16 (2006) 4467–4474. doi:10.1016/j.bmcl.2006.06.031.

AC CE P

[487] S. Deswal, N. Roy, A novel range based QSAR study of human neuropeptide Y (NPY) Y5 receptor inhibitors, Eur. J. Med. Chem. 42 (2007) 463–470. doi:10.1016/j.ejmech.2006.09.011. [488] P.M. Sivakumar, S.K. Geetha Babu, D. Mukesh, QSAR studies on chalcones and flavonoids as anti-tuberculosis agents using genetic function approximation (GFA) method., Chem. Pharm. Bull. (Tokyo). 55 (2007) 44–49. doi:10.1248/cpb.55.44. [489] A.P. Zambre, A.L. Ganure, D.B. Shinde, V.M. Kulkarni, Perspective assessment of COX1 and COX-2 selectivity of nonsteroidal anti-inflammatory drugs from clinical practice: Use of genetic function approximation, J. Chem. Inf. Model. 47 (2007) 635–643. doi:10.1021/ci6004367. [490] K. Roy, A.S. Mandal, Development of linear and nonlinear predictive QSAR models and their external validation using molecular similarity principle for anti-HIV indolyl aryl sulfones., J. Enzyme Inhib. Med. Chem. 23 (2008) 980–995. doi:10.1080/14756360701811379. [491] K. Roy, G. Ghosh, QSTR with extended topochemical atom (ETA) indices. 12. QSAR for the toxicity of diverse aromatic compounds to Tetrahymena pyriformis using chemometric tools, Chemosphere. 77 (2009) 999–1009. doi:10.1016/j.chemosphere.2009.07.072.

87

ACCEPTED MANUSCRIPT [492] K. Roy, P.P. Roy, Exploring QSAR and QAAR for inhibitors of cytochrome P450 2A6 and 2A5 enzymes using GFA and G/PLS techniques, Eur. J. Med. Chem. 44 (2009) 1941– 1951. doi:10.1016/j.ejmech.2008.11.010.

PT

[493] P.P. Roy, S. Paul, I. Mitra, K. Roy, On two novel parameters for validation of predictive QSAR models, Molecules. 14 (2009) 1660–1701. doi:10.3390/molecules14051660.

SC

RI

[494] P.P. Roy, K. Roy, QSAR studies of CYP2D6 inhibitor aryloxypropanolamines using 2D and 3D descriptors, Chem. Biol. Drug Des. 73 (2009) 442–455. doi:10.1111/j.17470285.2009.00791.x.

NU

[495] K.A. Solomon, S. Sundararajan, V. Abirami, QSAR studies on N-aryl derivative activity towards alzheimer’s disease, Molecules. 14 (2009) 1448–1455. doi:10.3390/molecules14041448.

MA

[496] K.F. Khaled, Quantitative Structure and Activity Relationship Modeling Study of Corrosion Inhibitors : Genetic Function Approximation and Molecular Dynamics Simulation Methods, Comput. Stud. 6 (2011) 4077 – 4094.

TE

D

[497] K.F. Khaled, Modeling corrosion inhibition of iron in acid medium by genetic function approximation method: A QSAR model, Corros. Sci. 53 (2011) 3457–3465. doi:10.1016/j.corsci.2011.01.035.

AC CE P

[498] S.M. Mousavisafavi, S.A. Mirkhani, F. Gharagheizi, J. Akbari, A predictive quantitative structure–property relationship for glass transition temperature of 1,3-dialkyl imidazolium ionic liquids, J. Therm. Anal. Calorim. (2012). doi:10.1007/s10973-012-2207-8. [499] S. Ray, P. Pratim Roy, A QSAR Study of Biphenyl Analogues of 2-Nitroimidazo-[2, 1-b] [1, 3] - oxazines as Antitubercular Agents Using Genetic Function Approximation, Med. Chem. (Los. Angeles). 8 (2012) 717–726. doi:10.2174/157340612801216210. [500] S. Pramanik, K. Roy, Exploring QSTR modeling and toxicophore mapping for identification of important molecular features contributing to the chemical toxicity in Escherichia coli, Toxicol. Vitr. 28 (2014) 265–272. doi:10.1016/j.tiv.2013.11.002. [501] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, 1995. doi:10.1109/TNN.1997.641482. [502] B.E. Boser, I.M. Guyon, V.N. Vapnik, A Training Algorithm for Optimal Margin Classifiers, in: Proc. 5th Annu. ACM Work. Comput. Learn. Theory, 1992: pp. 144–152. doi:10.1.1.21.3818. [503] H. Li, Y. Liang, Q. Xu, Support vector machines and its applications in chemistry, Chemom. Intell. Lab. Syst. 95 (2009) 188–198. doi:10.1016/j.chemolab.2008.10.007.

88

ACCEPTED MANUSCRIPT [504] R. Collobert, S. Bengio, SVMTorch: support vector machines for large-scale regression problems, J. Mach. Learn. Res. 1 (2001) 143–160. http://www.crossref.org/deleted_DOI.html.

PT

[505] S.-P. Liao, H.-T. Lin, C.-J. Lin, A note on the decomposition methods for support vector regression., Neural Comput. 14 (2002) 1267–1281. doi:10.1162/089976602753712936.

SC

RI

[506] R. Burbidge, M. Trotter, B. Buxton, S. Holden, Drug design by machine learning: Support vector machines for pharmaceutical data analysis, Comput. Chem. 26 (2001) 5–14. doi:10.1016/S0097-8485(01)00094-8.

NU

[507] R. Czermiski, A. Yasri, D. Hartsough, Use of support vector machine in pattern classification: Application to QSAR studies, Quant. Struct. Relationships. 20 (2001) 227– 240. doi:10.1002/1521-3838(200110)20:3<227::AID-QSAR227>3.0.CO;2-Y.

MA

[508] M. Song, C.M. Breneman, J. Bi, N. Sukumar, K.P. Bennett, S. Cramer, et al., Prediction of protein retention times in anion-exchange chromatography systems using support vector regression, J. Chem. Inf. Comput. Sci. 42 (2002) 1347–1357. doi:10.1021/ci025580t.

TE

D

[509] S. Doniger, T. Hofmann, J. Yeh, Predicting CNS permeability of drug molecules: comparison of neural network and support vector machine algorithms., J. Comput. Biol. 9 (2002) 849–864. doi:10.1089/10665270260518317.

AC CE P

[510] J.R. Serra, E.D. Thompson, P.C. Jurs, Development of binary classification of structural chromosome aberrations for a diverse set of organic compounds from molecular structure, Chem. Res. Toxicol. 16 (2003) 153–163. doi:10.1021/tx020077w. [511] P. Lind, T. Maltseva, Support Vector Machines for the Estimation of Aqueous Solubility, J. Chem. Inf. Comput. Sci. 43 (2003) 1855–1859. doi:10.1021/ci034107s. [512] X.J. Yao, A. Panaye, J.P. Doucet, R.S. Zhang, H.F. Chen, M.C. Liu, et al., Comparative study of QSAR/QSPR correlations using support vector machines, radial basis function neural networks, and multiple linear regression, J. Chem. Inf. Comput. Sci. 44 (2004) 1257–1266. doi:10.1021/ci049965i. [513] H.X. Liu, R.S. Zhang, X.J. Yao, M.C. Liu, Z.D. Hu, B.T. Fan, Prediction of the Isoelectric Point of an Amino Acid Based on GA-PLS and SVMs, J. Chem. Inf. Comput. Sci. 44 (2004) 161–167. doi:10.1021/ci034173u. [514] C.X. Xue, R.S. Zhang, H.X. Liu, X.J. Yao, M.C. Liu, Z.D. Hu, et al., An accurate QSPR study of O-H bond dissociation energy in substituted phenols based on support vector machines, J. Chem. Inf. Comput. Sci. 44 (2004) 669–677. doi:10.1021/ci034248u. [515] C.X. Xue, R.S. Zhang, H.X. Liu, X.J. Yao, M.C. Hu, Z.D. Hu, et al., QSAR models for the prediction of binding affinities to human serum albumin using the heuristic method

89

ACCEPTED MANUSCRIPT and a support vector machine, J. Chem. Inf. Comput. Sci. 44 (2004) 1693–1700. doi:10.1021/ci049820b.

RI

PT

[516] H.X. Liu, C.X. Xue, R.S. Zhang, X.J. Yao, M.C. Liu, Z.D. Hu, et al., Quantitative prediction of logk of peptides in high-performance liquid chromatography based on molecular descriptors by using the heuristic method and support vector machine, J. Chem. Inf. Comput. Sci. 44 (2004) 1979–1986. doi:10.1021/ci049891a.

SC

[517] H.X. Liu, R.J. Hu, R.S. Zhang, X.J. Yao, M.C. Liu, Z.D. Hu, et al., The prediction of human oral absorption for diffusion rate-limited drugs based on heuristic method and support vector machine, J. Comput. Aided. Mol. Des. 19 (2005) 33–46. doi:10.1007/s10822-005-0095-8.

MA

NU

[518] H.X. Liu, X.J. Yao, R.S. Zhang, M.C. Liu, Z.D. Hu, B.T. Fan, Prediction of the tissue/blood partition coefficients of organic compounds based on the molecular structure using least-squares support vector machines, J. Comput. Aided. Mol. Des. 19 (2005) 499– 508. doi:10.1007/s10822-005-9003-5.

TE

D

[519] H. Liu, X. Yao, R. Zhang, M. Liu, Z. Hu, B. Fan, Accurate quantitative structure-property relationship model to predict the solubility of C60 in various solvents based on a novel approach using a least-squares support vector machine., J. Phys. Chem. B. 109 (2005) 20565–20571. doi:10.1021/jp052223n.

AC CE P

[520] Y.P. Zhou, J.H. Jiang, W.Q. Lin, H.Y. Zou, H.L. Wu, G.L. Shen, et al., Boosting support vector regression in QSAR studies of bioactivities of chemical compounds, Eur. J. Pharm. Sci. 28 (2006) 344–353. doi:10.1016/j.ejps.2006.04.002. [521] H.F. Chen, Quantitative predictions of gas chromatography retention indexes with support vector machines, radial basis neural networks and multiple linear regression, Anal. Chim. Acta. 609 (2008) 24–36. doi:10.1016/j.aca.2008.01.003. [522] A. Niazi, S. Jameh-Bozorghi, D. Nori-Shargh, Prediction of toxicity of nitrobenzenes using ab initio and least squares support vector machines, J. Hazard. Mater. 151 (2008) 603–609. doi:10.1016/j.jhazmat.2007.06.030. [523] Y. Pan, J. Jiang, R. Wang, H. Cao, Advantages of support vector machine in QSPR studies for predicting auto-ignition temperatures of organic compounds, Chemom. Intell. Lab. Syst. 92 (2008) 169–178. doi:10.1016/j.chemolab.2008.03.002. [524] Y. Pan, J. Jiang, R. Wang, H. Cao, J. Zhao, Quantitative Structure-Property Relationship Studies for Predicting Flash Points of Organic Compounds using Support Vector Machines, QSAR Comb. Sci. 27 (2008) 1013–1019. doi:10.1002/qsar.200810009. [525] A.R. Katritzky, Y. Ren, S.H. Slavov, M. Karelson, A comparative QSAR study of SVM and PPR in the correlation of lithium cation basicities, Collect. Czechoslov. Chem. Commun. 74 (2009) 217–241. doi:10.1135/cccc2008191.

90

ACCEPTED MANUSCRIPT [526] N. Goudarzi, M. Goodarzi, Prediction of the acidic dissociation constant (pKa) of some organic compounds using linear and nonlinear QSPR methods, Mol. Phys. 107 (2009) 1495–1503. doi:10.1080/00268970902950394.

RI

PT

[527] M.H. Fatemi, E. Baher, Quantitative structure-property relationship modelling of the degradability rate constant of alkenes by OH radicals in atmosphere., SAR QSAR Environ. Res. 20 (2009) 77–90. doi:10.1080/10629360902726700.

SC

[528] S. Riahi, E. Pourbasheer, M.R. Ganjali, P. Norouzi, Support vector machine-based quantitative structure-activity relationship study of cholesteryl ester transfer protein inhibitors., Chem. Biol. Drug Des. 73 (2009) 558–571. doi:10.1111/j.17470285.2009.00800.x.

MA

NU

[529] R. Hu, J.-P. Doucet, M. Delamar, R. Zhang, QSAR models for 2-amino-6arylsulfonylbenzonitriles and congeners HIV-1 reverse transcriptase inhibitors based on linear and nonlinear regression methods., Eur. J. Med. Chem. 44 (2009) 2158–2171. doi:10.1016/j.ejmech.2008.10.021.

TE

D

[530] Y. Pan, J. Jiang, R. Wang, H. Cao, Y. Cui, Predicting the auto-ignition temperatures of organic compounds from molecular structure using support vector machine, J. Hazard. Mater. 164 (2009) 1242–1249. doi:10.1016/j.jhazmat.2008.09.031.

AC CE P

[531] M. Sun, J. Chen, H. Wei, S. Yin, Y. Yang, M. Ji, Quantitative structure-activity relationship and classification analysis of diaryl ureas against vascular endothelial growth factor receptor-2 kinase using linear and non-linear models., Chem. Biol. Drug Des. 73 (2009) 644–654. doi:10.1111/j.1747-0285.2009.00814.x. [532] R. Darnag, E.L. Mostapha Mazouz, A. Schmitzer, D. Villemin, A. Jarid, D. Cherqaoui, Support vector machines: Development of QSAR models for predicting anti-HIV-1 activity of TIBO derivatives, Eur. J. Med. Chem. 45 (2010) 1590–1597. doi:10.1016/j.ejmech.2010.01.002. [533] K. Hasegawa, K. Funatsu, Non-linear modeling and chemical interpretation with aid of support vector machine and regression., Curr. Comput. Aided. Drug Des. 6 (2010) 24–36. doi:10.2174/157340910790980124. [534] M. Goodarzi, M.P. Freitas, C.H. Wu, P.R. Duchowicz, pKa modeling and prediction of a series of pH indicators through genetic algorithm-least square support vector regression, Chemom. Intell. Lab. Syst. 101 (2010) 102–109. doi:10.1016/j.chemolab.2010.02.003. [535] Z. Cheng, Y. Zhang, W. Fu, QSAR study of carboxylic acid derivatives as HIV-1 Integrase inhibitors, Eur. J. Med. Chem. 45 (2010) 3970–3980. doi:10.1016/j.ejmech.2010.05.052.

91

ACCEPTED MANUSCRIPT [536] D.-S. Cao, Q.-S. Xu, Y.-Z. Liang, X. Chen, H.-D. Li, Prediction of aqueous solubility of druglike organic compounds using partial least squares, back-propagation network and support vector machine, J. Chemom. 24 (2010) 584–595. doi:10.1002/cem.1321.

RI

PT

[537] M.H. Fatemi, A. Heidari, M. Ghorbanzade, Prediction of aqueous solubility of drug-like compounds by using an artificial neural network and least-squares support vector machine, Bull. Chem. Soc. Jpn. 83 (2010) 1338–1345. doi:10.1246/bcsj.20100074.

SC

[538] K.-C. Chen, C. Yu-Chian Chen, Stroke prevention by traditional Chinese medicine? A genetic algorithm, support vector machine and molecular dynamics approach, Soft Matter. 7 (2011) 4001. doi:10.1039/c0sm01548b.

NU

[539] J. Xu, L. Wang, L. Wang, X. Shen, W. Xu, QSPR study of Setschenow constants of organic compounds using MLR, ANN, and SVM analyses, J. Comput. Chem. 32 (2011) 3241–3252. doi:10.1002/jcc.21907.

MA

[540] X. Dong, J. Yan, D. Lu, P. Wu, J. Gao, T. Liu, et al., QSAR Models for isoindolinonebased p53-MDM2 Interaction Inhibitors Using Linear and Non-linear Statistical Methods, Chem. Biol. Drug Des. 79 (2012) 691–702. doi:10.1111/j.1747-0285.2012.01322.x.

TE

D

[541] X. Yu, B. Yi, X. Wang, J. Chen, Predicting reaction rate constants of ozone with organic compounds from radical structures, Atmos. Environ. 51 (2012) 124–130. doi:10.1016/j.atmosenv.2012.01.037.

AC CE P

[542] J.C. Gertrudes, V.G. Maltarollo, R.A. Silva, P.R. Oliveira, K.M. Honorio, A.B.F. da Silva, Machine Learning Techniques and Drug Design, Curr. Med. Chem. 19 (2012) 4289–4297. doi:10.2174/092986712802884259. [543] H. Chen, L. Carlsson, M. Eriksson, P. Varkonyi, U. Norinder, I. Nilsson, Beyond the scope of free-wilson analysis: Building interpretable QSAR models with machine learning algorithms, J. Chem. Inf. Model. 53 (2013) 1324–1336. doi:10.1021/ci4001376. [544] J. Shi, L. Chen, W. Chen, Prediction of the heat capacity for compounds based on the conjugate gradient and support vector machine methods, J. Chemom. 27 (2013) 251–259. doi:10.1002/cem.2532. [545] Y. Zhang, An improved QSPR method based on support vector machine applying rational sample data selection and genetic algorithm-controlled training parameters optimization, Chemom. Intell. Lab. Syst. 134 (2014) 34–46. doi:10.1016/j.chemolab.2014.03.004. [546] M.H. Fatemi, K. Samghani, Developing a Support Vector Machine Based QSPR Model for Prediction of Atmospheric Lifetime of Some Halocarbons, Bull. Chem. Soc. Jpn. 87 (2014) 1281–1287. doi:10.1246/bcsj.20140169. [547] S. Sepehri, S. Gharagani, L. Saghaie, M.R. Aghasadeghi, A. Fassihi, QSAR and docking studies of some 1,2,3,4-tetrahydropyrimidines: evaluation of gp41 as possible target for

92

ACCEPTED MANUSCRIPT anti-HIV-1 activity, Med. Chem. Res. 24 (2014) 1707–1724. doi:10.1007/s00044-0141246-z.

PT

[548] R. Martinčič, K. Venko, Š. Župerl, M. Novič, Chemometrics approach for the prediction of structure–activity relationship for membrane transporter bilitranslocase, SAR QSAR Environ. Res. 25 (2014) 853–872. doi:10.1080/1062936X.2014.962082.

SC

RI

[549] R. Martinčič, I. Kuzmanovski, A. Wagner, M. Novič, Development of models for prediction of the antioxidant activity of derivatives of natural compounds, Anal. Chim. Acta. 868 (2015) 23–35. doi:10.1016/j.aca.2015.01.050.

NU

[550] J.-P. Doucet, F. Barbault, H. Xia, A. Panaye, B. Fan, Nonlinear SVM Approaches to QSPR/QSAR Studies and Drug Design, Curr. Comput. - Aided Drug Des. 3 (2007) 263– 289. doi:10.2174/157340907782799372.

MA

[551] G.W. Kauffman, P.C. Jurs, QSAR and k-Nearest Neighbor Classification Analysis of Selective Cyclooxygenase-2 Inhibitors Using Topologically-Based Numerical Descriptors, J. Chem. Inf. Comput. Sci. 41 (2001) 1553–1560. doi:10.1021/ci010073h.

TE

D

[552] V. Svetnik, A. Liaw, C. Tong, J.C. Culberson, R.P. Sheridan, B.P. Feuston, Random Forest : A Classification and Regression Tool for Compound Classification and QSAR Modeling, J. Chem. Inf. Model. 43 (2003) 1947–1958.

AC CE P

[553] D.H. Hawkins, Topics in Applied MultiVariate Analysis, Cambridge University Press, Cambridge, U.K., 1982. [554] D.M. Hawkins, FIRM: Formal inference-based recursive modeling, 1997. [555] S.J. Cho, C.F. Shen, M.A. Hermsmeier, Binary Formal Inference-Based Recursive Modeling Using Multiple Atom and Physicochemical Property Class Pair and Torsion Descriptors as Decision Criteria, J. Chem. Inf. Model. 40 (2000) 668–680. doi:10.1021/ci9908190. [556] C.J. Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone, Classification and Regression Trees, Wadsworth International Group, New York, 1984. [557] J.R. Quinlan, C4.5 Programs for Machine Learning, Morgan Kaufmann Publishers, San Mate, CA, 1992. [558] a Rusinko, M.W. Farmen, C.G. Lambert, P.L. Brown, S.S. Young, Analysis of a large structure/biological activity data set using recursive partitioning., J. Chem. Inf. Comput. Sci. 39 (1999) 1017–26. http://www.ncbi.nlm.nih.gov/pubmed/10614024. [559] W. Tong, H. Hong, H. Fang, Q. Xie, R. Perkins, Decision forest: Combining the predictions of multiple independent decision tree models, J. Chem. Inf. Comput. Sci. 43 (2003) 525–531. doi:10.1021/ci020058s.

93

ACCEPTED MANUSCRIPT [560] T.G. Dietterich, Ensemble learning, in: M.A. Arbib (Ed.), Handb. Brain Theory Neural Networks, 2nd ed., The MIT Press, Cambridge, 2002.

RI

PT

[561] N. Manga, J.C. Duffy, P.H. Rowe, M.T.D. Cronin, Structure-based methods for the prediction of the dominant P450 enzyme in human drug biotransformation: consideration of CYP3A4, CYP2C9, CYP2D6., SAR QSAR Environ. Res. 16 (n.d.) 43–61. doi:10.1080/10629360412331319871.

SC

[562] F. Hammann, H. Gutmann, U. Jecklin, A. Maunz, C. Helma, J. Drewe, Development of decision tree models for substrates, inhibitors, and inducers of p-glycoprotein., Curr. Drug Metab. 10 (2009) 339–346. doi:10.2174/138920009788499021.

NU

[563] J.R. Votano, M. Parham, L.H. Hall, L.B. Kier, S. Oloff, A. Tropsha, et al., Three new consensus QSAR models for the prediction of Ames genotoxicity, Mutagenesis. 19 (2004) 365–377. doi:10.1093/mutage/geh043.

MA

[564] H. Hong, W. Tong, Q. Xie, H. Fang, R. Perkins, An in silico ensemble method for lead discovery: decision forest., SAR QSAR Environ. Res. 16 (2005) 339–347. doi:10.1080/10659360500203022.

TE

D

[565] S. Gupta, N. Basant, K.P. Singh, Estimating sensory irritation potency of volatile organic chemicals using QSARs based on decision tree methods for regulatory purpose, Ecotoxicology. 24 (2015) 873–886. doi:10.1007/s10646-015-1431-y.

AC CE P

[566] M. Chen, H. Hong, H. Fang, R. Kelly, G. Zhou, J. Borlak, et al., Quantitative structureactivity relationship models for predicting drug-induced liver injury based on FDAapproved drug labeling annotation and using a large collection of drugs, Toxicol. Sci. 136 (2013) 242–249. doi:10.1093/toxsci/kft189. [567] R.D. KING, J.D. HIRST, M.J.E. STERNBERG, COMPARISON OF ARTIFICIAL INTELLIGENCE METHODS FOR MODELING PHARMACEUTICAL QSARS, Appl. Artif. Intell. 9 (1995) 213–233. doi:10.1080/08839519508945474. [568] J.P.F. Bai, A. Utis, G. Crippen, H.-D. He, V. Fischer, R. Tullman, et al., Use of classification regression tree in predicting oral absorption in humans., J. Chem. Inf. Comput. Sci. 44 (2004) 2061–9. doi:10.1021/ci040023n. [569] B. Baert, E. Deconinck, M. Van Gele, M. Slodicka, P. Stoppie, S. Bodé, et al., Transdermal penetration behaviour of drugs: CART-clustering, QSPR and selection of model compounds, Bioorganic Med. Chem. 15 (2007) 6943–6955. doi:10.1016/j.bmc.2007.07.050. [570] S.M. Tan, J. Jiao, X.L. Zhu, Y.P. Zhou, D.D. Song, H. Gong, et al., QSAR studies of a diverse series of antimicrobial agents against Candida albicans by classification and regression trees, Chemom. Intell. Lab. Syst. 103 (2010) 184–190. doi:10.1016/j.chemolab.2010.07.005.

94

ACCEPTED MANUSCRIPT [571] V. Svetnik, A. Liaw, C. Tong, T. Wang, Application of Breiman’s random forest to modeling structure-activity relationships of pharmaceutical molecules, Mult. Classif. Syst. (2004) 334–343. doi:10.1007/978-3-540-25966-4_33.

RI

PT

[572] W. Tong, H. Hong, Q. Xie, L. Shi, H. Fang, R. Perkins, Assessing QSAR Limitations - A Regulatory Perspective, Curr. Comput. - Aided Drug Des. 1 (2005) 195–205. doi:10.2174/1573409053585663.

SC

[573] D.S. Palmer, N.M. O’Boyle, R.C. Glen, J.B.O. Mitchell, Random forest models to predict aqueous solubility., J. Chem. Inf. Model. 47 (2007) 150–158. doi:10.1021/ci060164k.

NU

[574] Q.Y. Zhang, J. Aires-de-Sousa, Random forest prediction of mutagenicity from empirical physicochemical descriptors, J. Chem. Inf. Model. 47 (2007) 1–8. doi:10.1021/ci050520j.

MA

[575] Y. Sakiyama, H. Yuki, T. Moriya, K. Hattori, M. Suzuki, K. Shimada, et al., Predicting human liver microsomal stability with machine learning techniques, J. Mol. Graph. Model. 26 (2008) 907–915. doi:10.1016/j.jmgm.2007.06.005.

TE

D

[576] R. Rajappan, P.D. Shingade, R. Natarajan, V.K. Jayaraman, Quantitative structureproperty relationship (QSPR) prediction of liquid viscosities of pure organic compounds employing random forest regression, Ind. Eng. Chem. Res. 48 (2009) 9708–9712. doi:10.1021/ie8018406.

AC CE P

[577] P.G. Polishchuk, E.N. Muratov, A.G. Artemenko, O.G. Kolumbin, N.N. Muratov, V.E. Kuz’min, Application of random forest approach to QSAR prediction of aquatic toxicity, J. Chem. Inf. Model. 49 (2009) 2481–2488. doi:10.1021/ci900203n. [578] N.A. Kovdienko, P.G. Polishchuk, E.N. Muratov, A.G. Artemenko, V.E. Kuz’min, L. Gorb, et al., Application of random forest and multiple linear regression techniques to QSPR prediction of an aqueous solubility for military compounds, Mol. Inform. 29 (2010) 394–406. doi:10.1002/minf.201000001. [579] I. Oprisiu, E. Varlamova, E. Muratov, A. Artemenko, G. Marcou, P. Polishchuk, et al., QSPR approach to predict nonadditive properties of mixtures. Application to bubble point temperatures of binary mixtures of liquids, Mol. Inform. 31 (2012) 491–502. doi:10.1002/minf.201200006. [580] D. Yukihira, D. Miura, Y. Fujimura, Y. Umemura, S. Yamaguchi, S. Funatsu, et al., MALDI efficiency of metabolites quantitatively associated with their structural properties: A quantitative structure-property relationship (QSPR) approach, J. Am. Soc. Mass Spectrom. 25 (2014) 1–5. doi:10.1007/s13361-013-0772-0. [581] F. V. Buontempo, X.Z. Wang, M. Mwense, N. Horan, A. Young, D. Osborn, Genetic programming for the induction of decision trees to model ecotoxicity data, J. Chem. Inf. Model. 45 (2005) 904–912. doi:10.1021/ci049652n.

95

ACCEPTED MANUSCRIPT

PT

[582] A. Corma, J.M. Serra, P. Serna, M. Moliner, Integrating high-throughput characterization into combinatorial heterogeneous catalysis: Unsupervised construction of quantitative structure/property relationship models, J. Catal. 232 (2005) 335–341. doi:10.1016/j.jcat.2005.03.019.

RI

[583] G. Carrera, J. Aires-de-Sousa, Estimation of melting points of pyridinium bromide ionic liquids with decision trees and neural networks, Green Chem. 7 (2005) 20. doi:10.1039/b408967g.

SC

[584] P. De Cerqueira Lima, A. Golbraikh, S. Oloff, Y. Xiao, A. Tropsha, Combinatorial QSAR modeling of P-glycoprotein substrates, J. Chem. Inf. Model. 46 (2006) 1245–1254. doi:10.1021/ci0504317.

MA

NU

[585] O. Ivanciuc, Machine Learning Quantitative Structure-Activity Relationships (QSAR) for Peptides Binding to the Human Amphiphysin-1 SH3 Domain, Curr. Proteomics. 6 (2009) 289–302. doi:10.2174/157016409789973725.

TE

D

[586] M. Gupta, S. Gupta, H. Dureja, A.K. Madan, Superaugmented eccentric distance sum connectivity indices: Novel highly discriminating topological descriptors for QSAR/QSPR, Chem. Biol. Drug Des. 79 (2012) 38–52. doi:10.1111/j.17470285.2011.01264.x.

AC CE P

[587] M. Fernandez, T.K. Woo, C.E. Wilmer, R.Q. Snurr, Large-scale quantitative structureproperty relationship (QSPR) analysis of methane storage in metal-organic frameworks, J. Phys. Chem. C. 117 (2013) 7681–7689. doi:10.1021/jp4006422. [588] A. Tropsha, P. Gramatica, V.K. Gombar, The importance of being earnest: Validation is the absolute essential for successful application and interpretation of QSPR models, QSAR Comb. Sci. 22 (2003) 69–77. doi:10.1002/qsar.200390007. [589] D.M. Hawkins, S.C. Basak, D. Mills, Assessing model fit by cross-validation., J. Chem. Inf. Comput. Sci. 43 (2003) 579–86. doi:10.1021/ci025626i. [590] P.K. Ojha, I. Mitra, R.N. Das, K. Roy, Further exploring rm2 metrics for validation of QSPR models, Chemom. Intell. Lab. Syst. 107 (2011) 194–205. doi:10.1016/j.chemolab.2011.03.011. [591] K. Baumann, Cross-validation as the objective function for variable-selection techniques, TrAC - Trends Anal. Chem. 22 (2003) 395–406. doi:10.1016/S0165-9936(03)00607-1. [592] OECD, Guidance Document on the Validation of (Quantitative) Structure- Activity Relationship Models, OECD Ser. Test. Assess. (2007). doi:10.1787/9789264085442-en. [593] L. Eriksson, J. Jaworska, A.P. Worth, M.T.D. Cronin, R.M. McDowell, P. Gramatica, Methods for Reliability and Uncertainty Assessment and for Applicability Evaluations of

96

ACCEPTED MANUSCRIPT Classification- and Regression-Based QSARs, Environ. Health Perspect. 111 (2003) 1361–1375. doi:10.1289/ehp.5758.

PT

[594] A. Golbraikh, A. Tropsha, Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection, Mol. Divers. 5 (2000) 231– 243. doi:10.1023/A:1021372108686.

RI

[595] K. Baumann, N. Stiefl, Validation tools for variable subset regression, J. Comput. Aided. Mol. Des. 18 (2004) 549–562. doi:10.1007/s10822-004-4071-5.

SC

[596] R.D. Clark, P.C. Fox, Statistical variation in progressive scrambling, J. Comput. Aided. Mol. Des. 18 (2004) 563–576. doi:10.1007/s10822-004-4077-z.

NU

[597] P. Gramatica, Principles of QSAR models validation: Internal and external, QSAR Comb. Sci. 26 (2007) 694–701. doi:10.1002/qsar.200610151.

MA

[598] R. Kiralj, M.M.C. Ferreira, Basic validation procedures for regression models in QSAR and QSPR studies: Theory and application, J. Braz. Chem. Soc. 20 (2009) 770–787. doi:10.1590/S0103-50532009000400021.

TE

D

[599] J.T. Leonard, K. Roy, On Selection of Training and Test Sets for the Development of Predictive QSAR models, QSAR Comb. Sci. 25 (2006) 235–251. doi:10.1002/qsar.200510161.

AC CE P

[600] A. Golbraikh, M. Shen, Z. Xiao, Y. De Xiao, K.H. Lee, A. Tropsha, Rational selection of training and test sets for the development of validated QSAR models, J. Comput. Aided. Mol. Des. 17 (2003) 241–253. doi:10.1023/A:1025386326946. [601] T.M. Martin, P. Harten, D.M. Young, E.N. Muratov, A. Golbraikh, H. Zhu, et al., Does rational selection of training and test sets improve the outcome of QSAR modeling?, J. Chem. Inf. Model. 52 (2012) 2570–2578. doi:10.1021/ci300338w. [602] P. Gramatica, P. Pilutti, E. Papa, Validated QSAR prediction of OH tropospheric degradation of VOCs: Splitting into training-test sets and consensus modeling, J. Chem. Inf. Comput. Sci. 44 (2004) 1794–1802. doi:10.1021/ci049923u. [603] R. Wehrens, W.E.V. a N.D.E.R. Linden, Bootstrapping Principal Component Regression Models, J. Chemom. 11 (1997) 157–171. doi:10.1002/(SICI)1099128X(199703)11:2<157::AID-CEM471>3.0.CO;2-J. [604] R. Wehrens, H. Putter, L.M.C. Buydens, The bootstrap: A tutorial, Chemom. Intell. Lab. Syst. 54 (2000) 35–52. doi:10.1016/S0169-7439(00)00102-7. [605] P.D.E. Cramer R.D., Bunce J.D., Crossvalidation, Bootstrapping, and Partial Least Squares Compared with Multiple Regression in Conventional, Quant. Struct. Relatsh. 25 (1988) 18–25. doi:10.1002/qsar.19880070105.

97

ACCEPTED MANUSCRIPT [606] S. Wold, L. Eriksson, Statistical Validation of QSAR Results, in: Chemom. Methods Mol. Des., 1995: p. 309. doi:10.1002/9783527615452.ch5.

PT

[607] L. Eriksson, E. Johansson, N. Kettaneh-Wold, S. Wold, Multi- and Megavariate Data Analysis—Principles and Applications, Umetrics AB., Umea, Sweden, 2001.

RI

[608] P. Geladi, P.K. Hopke, Editorial: Is there a future for chemometrics? Are we still needed?, J. Chemom. 22 (2008) 289–290. doi:10.1002/cem.1141.

SC

[609] R.G. Berereton, The evolution of chemometrics, Anal. Methods. 5 (2013) 3785–3789. doi:10.1039/C3AY90051G.

NU

[610] F. Vogt, Quo vadis , chemometrics?, J. Chemom. 28 (2014) 785–788. doi:10.1002/cem.2684.

MA

[611] T. Puzyn, B. Rasulev, A. Gajewicz, X. Hu, T.P. Dasari, A. Michalkova, et al., Using nanoQSAR to predict the cytotoxicity of metal oxide nanoparticles., Nat. Nanotechnol. 6 (2011) 175–178. doi:10.1038/nnano.2011.10.

AC CE P

TE

D

[612] K.P. Singh, S. Gupta, Nano-QSAR modeling for predicting biological activity of diverse nanomaterials, RSC Adv. 4 (2014) 13215–13230. doi:10.1039/C4RA01274G.

98

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

Figures’ legends:

D

Fig.1Contribution of different kinds of documents recorded inWeb of Science core collection on

AC CE P

correction and book review.

TE

QSAR/QSPR subject (prepared in May, 27, 2015). Others include note, editorial material, letter,

Fig. 2 Growth pattern of publications on QSAR/QSPR subject (prepared in May, 27, 2015) Fig. 3Genetic Operators. (a) one region cross-over ,(b) Two region cross-over and (c)one point mutation Fig 4.Feedforward network (a) and Feedback network (b) Fig. 5 “Dimensional superiority‖ in SVM

99

ACCEPTED MANUSCRIPT

Decision

any Q2rand†

Significant chance correlation

2

2

any Q

2

Tolerable chance correlation

rand<

0.4

any Q2rand† 0.2 < R

4

0.4

rand†

0.3 < R 3

SC

rand>

NU

R

2

RI

Conditions

2

rand<

MA

1

PT

Table 1. Decision on chance correlation based on Q2rand and R2rand

0.3

Q2rand< 0.2

TE

AC CE P

† Q2rand is lower than R2rand

D

R2rand< 0.2

100

Negligible chance correlation

No chance correlation

TE

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

Fig.1Contribution of different kinds of documents recorded inWeb of Science core collection on

AC CE P

QSAR/QSPR subject (prepared in May, 27, 2015)

101

AC CE P

TE

D

MA

NU

SC

RI

PT

ACCEPTED MANUSCRIPT

Fig. 2 Growth pattern of publications on QSAR/QSPR subject (prepared in May, 27, 2015)

102

ACCEPTED MANUSCRIPT

Before mutation After mutation

PT

0 1 0 1 1 0 0

0 0 0 1 0 0 1

0 1 0 1 0 0 1

RI SC

NU 0 0 0 1 0 0 1

MA

0 0 0 1 1 0 0

0 0 0 1 0 0 1

D

0 1 0 1 0 0 1

(c)

0 1 0 1 1 0 0

Offspring

0 0 0 1 0 0 1

TE

0 1 0 1 1 0 0

Parents

(b)

AC CE P

Offspring

Parents

(a)

Fig 3.Genetic Operatots. (a) one region cross-over ,(b) Two region cross-over and (c)one point mutation

103

x2

x3

xn

(Molecular Descriptors)

x1

x2

x3

xn

AC CE P

Hidden Layer

TE

D

MA

NU

SC

x1

(b) Input Layer

(a)

(Molecular Descriptors)

RI

PT

ACCEPTED MANUSCRIPT

Output Layer

y

y

(Biological activity/Chemical property)

(Biological activity/Chemical property)

Fig 4.Feedforward network (a) and Feedback network (b)

104

ACCEPTED MANUSCRIPT

NU

SC

RI

PT

Φ

x2

D

MA

x3

TE

x2

AC CE P

x1

x1

Feature Space

Input Space

Fig. 5 “Dimensional superiority‖ in SVM.

105