Chemosphere 90 (2013) 300–305
Contents lists available at SciVerse ScienceDirect
Chemosphere journal homepage: www.elsevier.com/locate/chemosphere
Chemometric model for predicting retention indices of constituents of essential oils Li-Tang Qin a,b, Shu-Shen Liu b,⇑, Fu Chen b, Qian-Fen Xiao b, Qing-Sheng Wu a,⇑ a b
Department of Chemistry, Tongji University, Shanghai 200092, PR China Key Laboratory of Yangtze River Water Environment, Ministry of Education, College of Environmental Science and Engineering, Tongji University, Shanghai 200092, PR China
h i g h l i g h t s " QSRR model was built for predicting retention indices of essential oils. " The model present high internal and external prediction power. " The study meets five principles recommended by OECD for validation of QSRR model. " The model can effectively predict retention indices of the essential oils.
a r t i c l e
i n f o
Article history: Received 28 February 2012 Received in revised form 22 June 2012 Accepted 10 July 2012 Available online 4 August 2012 Keywords: Retention index QSRR Essential oils Multiple linear regression OECD principles
a b s t r a c t Quantitative structure–retention relationships (QSRRs) model was developed for predicting the gas chromatography retention indices of 169 constituents of essential oils. The ordered predictors selection algorithm was used to select three descriptors (one constitutional index and two edge adjacency indices) from 4885 descriptors. The final QSRR model (model M3) with three descriptors was internal and external validated. The leave-one-out cross-validation, leave-many-out cross-validation, bootstrapping, and yrandomization test indicated the final model is robust and have no chance correlation. The external validations indicated that the model M3 showed a good predictive power. The mechanistic interpretation of QSRR model was carried out according to the definition of descriptors. The results show that the larger molecular weight, the greater the values of retention indices. More compact structures have stronger intermolecular interactions between the components of essential oils and the capillary column. Therefore, the result meets the five principles recommended by the Organization for Economic Co-operation and Development (OECD) for validation of QSRR model, and it is expected the model can effectively predict retention indices of the essential oils. Ó 2012 Published by Elsevier Ltd.
1. Introduction Essential oils extracted from different plant have been widely used in the food industry, medicine, and fragrance industry (Noorizadeh et al., 2011). However, essential oils may exert toxic effects to humans carcinogenicity, reproductive and developmental toxicity, neurotoxicity, and acute toxicity. The constituents present in essential oils have the potential to create serious and even fatal toxic effects if ingested in overly large quantities, or used incorrectly (Riahi et al., 2009). The constituents of essential oils are important due to their qualitative and quantitative compositions that determine the characteristics of the oils (Fisher and Phillips, 2008). Qualitative and quantitative analyses of the constituents of essential oils can be carried out by means of gas chromatography (GC) and gas chroma⇑ Corresponding authors. Tel.: +86 021 65982767; fax: +86 021 65980855 (S.-S. Liu). E-mail addresses:
[email protected] (S.-S. Liu),
[email protected] (Q.-S. Wu). 0045-6535/$ - see front matter Ó 2012 Published by Elsevier Ltd. http://dx.doi.org/10.1016/j.chemosphere.2012.07.010
tography coupled to mass spectrometry (GC–MS). In the determination of composition by GC and GC–MS, a simple identification method is the comparison of the recorded spectra with those available in a standard mass spectrometry library. Following this procedure, one can carry out the identification of the constituents of essential oils. The identification, however, may not be precise enough and sometimes maybe even wrong in the cases of isomers or similar substances in GC–MS (Jalali-Heravi and EbrahimiNajafabadi, 2011). Quantitative structure–retention relationships (QSRRs) study has been extensively used for predicting the GC retention behavior due to the time-consuming. QSRR based on the experimental value of retention index provides a promising faster way for predicting retention index of essential oils using the descriptors derived solely from the molecular structure. The advantage of QSRR study over experimental methods lies in the fact that it requires only the knowledge of chemical structure. A QSRR can provide the most informative structural descriptors related to the retention index and the possible mechanism of absorption and elution (Azar
L.-T. Qin et al. / Chemosphere 90 (2013) 300–305
et al., 2011). In addition, a QSRR model can apply to predict the retention index of a new compound that belongs to the application domain of the model. Recently, several works reported QSRR studies on the retention indices of essential oils have been published. Multiple linear regression (MLR) and partial least squares (PLSs) models were built for the retention indices of 80 essential oils using the genetic algorithm (GA) to select the variables (Riahi et al., 2009). A nonlinear model based on the support vector machine was also developed for the 80 essential oils. Both linear and nonlinear models were used to predict 20 constituents in the test set. Liao et al. (2010) reported a MLR model for predicting retention indices of 106 oxygen-containing organic compounds using hydrogen-association classified molecular electronegativity-distance vector (H-MEDV) descriptors. Noorizadeh and Farmany (2010), Noorizadeh et al. (2010, 2011) built several QSRR models (GA-MLR, GA-PLS, kernel PLS, and Levenberg-Marquardt artificial neural network (L-M ANN)) for retention indices of essential oils. A GA-MLR model for the prediction of the retention indices of 32 compounds was investigated by Azar et al. (2011). The main aim of this study is to develop QSRR model for the prediction of retention indices of 169 components of essential oils. The QSRR model based on MLR method describes the relationship between the retention indices and structure descriptors obtained from Dragon 6 software. Variable selections were performed by ordered predictors selection (OPS) algorithm (Teófilo et al., 2009). The QSRR model presents high statistical quality, good predictive power, and a reasonable mechanistic interpretation related to the retention index. The reliability of the model was reviewed against the Organization for Economic Co-operation and Development (OECD) principles (OECD, 2007): (1) a defined endpoint; (2) an unambiguous algorithm; (3) a defined domain of applicability; (4) appropriate measures of goodness-of-fit, robustness and predictivity; (5) a mechanistic interpretation, if possible.
2. Materials and methods 2.1. Data set The retention indices (RI) of 169 constituents of essential oils (see Supplementary material, Table S1) were taken from Conforti et al. (2009), who studied by GC and GC–MS. QSRR models for the 116 compounds, which measured on the Innowax column (60 m 0.25 mm i.d.; 0.33 lm film thickness) in this reference (Conforti et al., 2009), had been studied by Noorizadeh et al. (2010, 2011). Analytical GC procedure for 169 compounds used in the present study was carried out on an HP-5MS capillary column (30 m 0.25 mm i.d.; 0.25 lm film thickness). GC–MS analysis was performed on an Agilent 6850 series II apparatus, fitted with a fused silica HP-1 capillary column (30 m 0.25 mm i.d.; 0.33 lm film thickness). The use of retention index is a well-defined endpoint, and it ensures the study meets the OECD principle #1 (OECD, 2007).
2.2. Geometry optimization and obtention of descriptors All structures were generated by MarvinSketch 5.9 from ChemAxon (http://www.chemaxon.com/). Optimization of molecular structures was carried out by the semi-empirical AM1 method (MOPAC program, http://openmopac.net/) using the EF optimizer until the root mean square gradient of 0.001 was obtained. The 4885 molecular descriptors of each 169 compounds were calculated by the Dragon software (version 6.0, Talete SRL, Milano, Italy).
301
2.3. Variable selection Variable selection from the original 4885 descriptors employed a stepwise procedure. (1) Excluded the descriptors with constant and near-constant values. (2) Excluded the descriptors with standard deviation less than 0.0001. (3) Excluded the descriptors with (abs) pair correlation larger than or equal to 0.90. (4) Eliminated the descriptors that present the absolute value of the Pearson correlation coefficient (|r|) with retention indices lower than 0.3. (5) The OPS algorithm (Teófilo et al., 2009) was used to select the important descriptors. The variable selection of OPS algorithm was developed by Teófilo et al. (2009). In this study, variable influence on projection (VIP) and correlation vector (Corr) were used as the informative vectors. We sorted the variables from an informative vector of VIP/Corr in descending order of magnitude, and then multivariate PLS regression models were built and evaluated using a leaveone-out (LOO) cross-validation strategy. The successive PLS regressions are carried out with increasing number of descriptors in order to find the best model. The best quality parameters contain variables with the best prediction capability, and so these are selected variables (Teófilo et al., 2009). 2.4. Training set and test set The Kennard–Stone design (Kennard and Stone, 1969; Wu et al., 1996) technique is one of the effective methods that it ensures both the training and test sets separately uniformly spaced over the whole descriptor space. The descriptors selected by OPS algorithm for the complete data set were used in the Kennard–Stone design. The whole data set was split into a training set with 85 samples and into a test set with 84 samples (see Supplementary material, Table S1). 2.5. Model building and validation The MLR method was performed to derive the best relationships between the retention indices and the selected descriptors. MLR is widely used in model building. Thus, the study meets the OECD principle #2 (OECD, 2007). The QSRR model was internally and externally validated in order to meet the OECD principle #4 (OECD, 2007). The LOO and leave-many-out (LMO) cross-validation tests were applied to the internal validation of the training set. The cross-validation correlation coefficient Q2 (Eq. (1)) was used for quantitative assessment of the model robustness.
PnTR ^i yi Þ2 ðy Q 2 ¼ 1 Pi¼1 nTR 2 i¼1 ðyi yÞ
ð1Þ
^, and y are the experimental, predicted, and average valwhere yi, y ues of the retention indices, respectively; nTR is the number of the samples in the training set. The LOO procedure has been widely used in the QSAR/QSRR study. The LMO cross-validation employs a larger test set (more than one sample) than LOO cross-validation, and can be repeated more times. In this study M = 2–28 was used. If a QSRR model has a high Q 2LOO (>0.5) in LOO validation or high average Q 2LMO (>0.5) in LMO validation, we can reasonably conclude that the obtained model is robust.
302
L.-T. Qin et al. / Chemosphere 90 (2013) 300–305
Moreover, the model was evaluated using the y-randomization test to check for the possibility of chance correlation. In this test, the dependent-variable vector (retention index) is randomly shuffled, and a new QSRR model is developed using the original independent-variable matrix (Tropsha et al., 2003). For this, the retention index of the data set was scrambled 50 iterations. It is expect that the LOO statistic of the randomized models (R2yrand and Q 2yrand ) should be poor. This approach is based on the absolute value of the Pearson correlation coefficient (|r|) between the original vector y and randomized vectors y. Two regression lines (r R2yrand and r Q 2yrand ) are drawn for randomized models. The intercepts of the the equations obtained in two regression lines should be lower than 0.3 for R2yrand and 0.05 for Q 2yrand (Eriksson et al., 2003). The statistical parameter c R2p was also calculated for the y-randomization test (Mitra et al., 2010). In an ideal case, the average value of c R2p should be equal to the value of R2 for the real model. Bootstrapping (Wehrens et al., 2000; Kiralj and Ferreira, 2009) is another approach to validation in which the complete data set is randomly split several times (50 times in this study) into training and test sets. The basic LOO statistics (R2bstr and Q 2bstr ) are calculated for the training sets. The parameters Q 2F1 (Q 2ext ) (Tropsha et al., 2003; Chirico and Gramatica, 2011), Q 2F2 (Schüürmann et al., 2008; Chirico and Gramatica, 2011), Q 2F3 (Consonni et al., 2009, 2010; Chirico and Gramatica, 2011), and r2m (Roy et al., 2009, 2011, 2012), which are shown in Eqs. (2)–(5), were used as the measures of the predictive power of QSRR model. The recommended limits of Q2 (Q 2F1 ; Q 2F2 , and Q 2F3 ) and r 2m are larger than 0.5.
PnTest ^i yi Þ2 ðy Q 2F1 ¼ 1 Pni¼1 Test 2 i¼1 ðyi yTR Þ
ð2Þ
PnTest ^i yi Þ2 ðy Q 2F2 ¼ 1 Pn i¼1 2 Test i¼1 ðyi yTest Þ
ð3Þ
PnTest ^i yi Þ2 =nTest ðy Q 2F3 ¼ 1 Pi¼1 nTR 2 i¼1 ðyi yTR Þ =nTR
ð4Þ
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r 2m ¼ r 2 1 r 2 r 20
ð5Þ
^i is calculated/predicted value, where yi is the experimental value, y TR and y Test are average values of training set and test set, respecy tively, nTR and nTest are the number of compounds in training set and test set, respectively, and r2 (r 20 ) is R2 (R20 ) in the Tropsha and Golbraikh method (Golbraikh and Tropsha, 2002). However, Q2 (Q 2F1 , Q 2F2 , and Q 2F3 ) and r2m may not sufficient condition to ensure that the model is really predictive. Golbraikh and Tropsha (2002) recommended that the statistics parameters for the test set should satisfy all of the following criteria: Q2 > 0.5; R2 > 0.6; 2 ðR2 R20 Þ=R2 < 0:1 or ðR2 R02 0.85 6 k 6 1.15 or 0 Þ=R < 0:1; 0.85 6 k’ 6 1.15, where R is correlation coefficient, R20 is coefficient of determination between predicted and observed activities, R02 0 is coefficient of determination between observed and predicted activities, slopes k and k0 are the regression lines through the origin. After model generation and validation, the selected descriptors are interpreted that used in the QSRR model. This interpretation ensures that the study meets the OECD principle #5 (OECD, 2007).
where xi is the descriptor row vector of the query compound; X is the n k matrix of k model descriptor values for n training set compounds. The superscript ‘‘T’’ refers to the transpose of the matrix/ vector. The control leverage h is fixed at (3k)/n, where k is the number of model parameters, and n is the number of the objects used to calculate the model. 3. Results and discussion 3.1. QSRR model The OPS algorithm was used to select the optimum descriptors for the complete data set (169 samples). Three descriptors were selected: MW (molecular weight), Eig09_AEA(bo) (eigenvalue n. 9 from augmented edge adjacency matrix weighted by bond order), and SpMaxA_EA (normalized leading eigenvalue from edge adjacency matrix). Descriptor-y scatterplots for the complete data set (see Supplementary material, Fig. S1) indicate that three descriptors show well linear relations with RI. The MLR model (called model M1) for the complete data set (Table 1) was obtained based on the selected descriptors. Using the three descriptors (MW, Eig09_AEA(bo), and SpMaxA_EA), the complete data set was split into a training set (85 samples) and into a test set (84 samples) based on Kennard-Stone design (Kennard and Stone, 1969; Wu et al., 1996). The model M2 (Table 1) was built for the training set, and was used to predict the test set. The R2 and Q 2LOO values of the model M2 (R2 = 0.901 and Q 2LOO ¼ 0:891) are slightly greater than the model M1 (R2 = 0.883 and Q 2LOO ¼ 0:876). Leave-one-out (LOO) standardized residuals analysis was checked for the models M1 and M2 (Fig. 1). The sample 74 (Fig. 1B) was identified as outlier due to the high values of standardized residuals and leverage. Despite the leverages of samples 1 and 152 (Fig. 1B) are lower than the warning leverage (0.106), its standardized residuals are relative high in both models M1 (Fig. 1A) and M2 (Fig. 1B). The Pearson correlation coefficients between MW and RI are 0.924 for the complete data set and 0.930 for the training set. The well linear relationship indicates that the higher value of MW, the higher RI value for a constituent of essential oils. Compared to the sample 2 with MW value of 96.14 and observed RI values of 906, the sample 1 has high MW value of 186.33 but only has observed RI of 901. The sample 152 has MW value of 136.26 but has high observed RI value of 1608. Therefore, the samples 1 and 74 were removed from the training set, and the sample 152 was removed from the test set. The final model (called model M3) (Table 1) was obtained based on the training set of 83 samples. The model M3 presents the R2 of 0.936 and RMSEC of 89. The sign change problem was checked according to the study of Kiralj and Ferreira (2010). The Pearson correlation coefficient for the complete (rc), training (rt), and test (re) sets, the normalized regression coefficient of three descriptors for the complete (bc) and the training (bt) sets, and F-functions (F1, F2, F3, and F4) were list in Table S2 (Supplementary material). The three descriptors in the models M2 and M3 satisfy the conditions:
sgnðr c Þ ¼ sgnðr t Þ ¼ sgnðr e Þ; sgnðr c Þ ¼ sgnðbc Þ ¼ sgnðbt Þ; jr c j > 0:3; jr t j > 0:3; jre j > 0:3;
2.6. Application domain
jbc j > 0:001; jbc j > 0:001:
The application domain of the QSRR model was defined by leverage (Golbraikh and Tropsha, 2002), hi (Eq. (6)). It ensures that the study meets the OECD principle #3 (OECD, 2007):
3.2. Internal validation
hi ¼ xTi ðX T XÞ1 xi
ði ¼ 1; . . . ; nÞ
ð6Þ
The LOO cross-validation (Q 2LOO ¼ 0:931, and RMSECVLOO = 92) was carried out for the internal validation of the model M3. The
303
L.-T. Qin et al. / Chemosphere 90 (2013) 300–305 Table 1 Results of internal and external validations of the QSRR models. Model
Equation
Modelinga
Internal validationb Q 2LOO
External validationc
M1
RI = 1004.831 (±123.009) + 3.666 (±0.425) MW + 109.955 (±31.727) Eig09_AEA(bo)-1290.247 (±276.961) SpMaxA_EA
n = 169; F = 413.414;
M2
RI = 992.710 (±161.552) + 3.329 (±0.557) MW + 174.266 (±40.153) Eig09_AEA(bo)-956.100 (±371.43) SpMaxA_EA
n = 85; F = 246.756;
Q 2LOO ¼ 0:891; RMSECVLOO = 118;
ntest = 84; RMSEP = 114;
R2 = 0.901; RMSEC = 112
Q 2LMO ¼ 0:890; RMSECVLMO = 118;
Q 2F1 (Q 2ext ) = 0.842; Q 2F2 = 0.842;
R2yrand ¼ 0:158; Q 3yrand ¼ 0:070; R2bstr ¼ 0:888; Q 2bstr = 0.876
Q 2F3 ¼ 0:887; r 2m ¼ 0:703; r 02 m ¼ 0:842;
¼ 0:876; RMSECVLOO = 114
R2 = 0.883; RMSEC = 111
Dr2m ¼ 0:139; r2m ¼ 0:773; k = 0.987; k= 1.008; R2o ¼ 0:818; R2o = 0.846; R2 = 0.846
M3
RI = 919.237 (±131.458) + 3.865 (±0.463) MW+(116.251 ± 33.969) Eig09_AEA(bo)-1033.047 (±295.664) SpMaxA_EA
n = 83; F = 387.877;
Q 2LOO ¼ 0:931; RMSECVLOO = 92;
ntest = 83; RMSEP = 94;
R2 = 0.936; RMSEC = 89
Q 2LMO ¼ 0:930; RMSECVLMO = 93;
Q 2F1 (Q 2ext ) = 0.892; Q 2F2 = 0.892;
R2yrand = 0.130; Q 3yrand ¼ 0:046; R2bstr ¼ 0:921 Q 2bstr ; = 0.913
Q 2F3 = 0.923; r 2m = 0.760; r 02 m = 0.859;
Dr2m ¼ 0:099; r2m ¼ 0:810; k = 0.984; k= 1.012; R2o ¼ 0:877; R2o = 0.899; R2 = 0.901
a n is the number of samples included in the model; F is the Fisher’s statistic; R2 is correlation coefficient of multiple determination; RMSEC is the root mean square error of calibration. b Q 2LOO is the LOO cross-validated correlation coefficient; RMSECVLOO is the root mean square error of LOO cross-validation; Q 2LMO is the average Q2 value of the LMO crossvalidation; RMSECVLMO is the average RMSECV value of LMO cross-validation; R2yrand and Q 2yrand are the maximum R2 and Q2 values of the y-randomization tests (50 runs); R2bstr and Q 2bstr are the average R2 and Q2 values of the bootstrapping (50 runs). c 2 ntest is the number of samples included in the test set; RMSEP is the root mean square error of prediction; r 2m (r 02 m and Dr m ) is the parameter proposed by Roy et al. (2009); 2 02 2 r 2m is the average value of r 2m and r 02 m ; k, k, Ro , Ro , and R are the parameters recommended by Golbraikh and Tropsha (2002).
(B) 74
2 0 -2 -4
1
1000
1500
2000
2500
(C) 152
4
74 Outlier
2 Training set Test set
0 -2 -4
1
0.05
Observed RI
0.10
0.15
Standardized residuals
152
4
Standardized residuals
Standardized residuals
(A)
4
Training set Test set
2 0 -2 -4
0.20
Leverage
0.05
0.10
0.15
Leverage
Fig. 1. LOO standardized residuals analysis for the models M1 (A), M2 (B), and M3 (C).
difference between the R2 and Q 2LOO values was 0.005 units, and it ensure that the model M3 does not present data overfitting (Kiralj and Ferreira, 2009). In the LMO cross-validation (Fig. 2A), two to 28 samples (M = 2– 28) in excluded blocks were used as the internal test set. The data matrix (y and x values) was randomized ten times, and LMO crossvalidation was repeated ten times for each LMO cross-validation (see Supplementary material, Table S3). The average Q 2LMO value (0.930) of all LMO cross-validations is nearly the same as the Q 2LOO (0.931). Bootstrapping (50 iterations) was carried out by random select 83 samples from the whole data set of 166 samples (not include samples 1, 74, and 152) (Fig. 2B). The average R2bstr of 0.921 and Q 2bstr of 0.913 (see Supplementary material, Table S4) demonstrate the final model is robust. Further test for the stability of the model was performed using y-randomization analysis (see Supplementary material, Table S5). The result (R2yrand and Q 2yrand ) obtained from the y-randomization test was displayed in Fig. 3. The intercepts are within the acceptable values of 0.3 for R2 (Fig. 3B) and 0.05 for Q2 (Fig. 3C) (Eriksson
et al., 2003). The low maximum value (R2yrand ¼ 0:130 and Q 3yrand ¼ 0:046) of the y-randomization tests, the high average value of c R2p (0.917), and high R2 of the real model (model M3) indicate that the variance of the model has no chance correlation. 3.3. External validation The model M3 was tested by using the test set with 83 samples in order to assess the actual predictive power of the QSRR model. The model M3 was employed to evaluate the retention indices of the training set and predict the test set. The results of the calculated and observed retention indices for the training set and test set were summarized in Table S1 (Supplementary material) and plotted in Fig. 4. The parameters Q 2F1 (Q 2ext ), Q 2F2 , Q 2F3 , (Consonni et al., 2010; Li and Gramatica, 2010; Chirico and Gramatica, 2011) and r 2m (Roy et al., 2009, 2011, 2012) were used to evaluate the external predictive power. The differences between R2 and Q2 (Q 2F1 , Q 2F2 , and Q 2F3 ) are lower than 0.2. This indicates the model is not suffered from
304
L.-T. Qin et al. / Chemosphere 90 (2013) 300–305
(A) 0.96
internal and external validation show the model meets the OECD principle #4.
0.95
3.4. Application domain The application domain of the model M3 was defined by leverage. The control leverage h is fixed at 0.108. Fig. 1C displays the calculated values for the training set, and the predicted values of the test set put them within the application domain. If a sample has a high absolute LOO standardized residuals (>2.5) and large leverage (>0.108), it is considered to be an outlier. As can be seen from Fig. 1C, the final model (model M3) has no outlier. The well-defined applicability domain of the model meets the OECD principle #3.
0.93
2
Q LMO
0.94
0.92 0.91 0.90 5
10
15
20
25
LMO cross-validation (M=2 - 28) 3.5. Mechanistic interpretation 0.95 0.94
0.92
2
Q bstr
0.93
0.91 0.90 0.89 0.88 0.88
0.89
0.90
0.91
0.92
0.93
0.94
0.95
2
R bstr Fig. 2. Plots of LMO (M = 2–28) cross-validation (A) and of the bootstrapping (B) for the final model (model M3). In the plot (A), each point (d) is the average value from a test in ten iterations. In the plot (B), symbol (j) refers to the model M3 and symbol (s: 50 iterations) is the bootstrapping.
overfitting (Kiralj and Ferreira, 2009). The r 2m metrics equal to 0.810. The difference (Dr2m ) between the two parameters is 0.099. The values of Q 2F1 , Q 2F2 , Q 2F3 , and r 2m are higher than the recommended limits of 0.5, the Dr2m is also lower than the cutoff value 0.2 (Roy et al., 2009, 2011). Thus, the results of the external validation indicate the high predictive power of the model. In addition, the results corresponding to the criteria proposed by Golbraikh and Tropsha (2002) for the test set are within acceptable ranges: 0.85 6 k 6 1.15 or 0.85 6 k’ 6 1.15, ðR2 R20 Þ=R2 < 0:1 2 or ðR2 R02 0 Þ=R < 0:1. These good results confirm that the model has a high predictive power for an external sample. The internal validations were carried out using LOO and LMO cross-validation and y-randomization. The test set was used to test the external validation of the model. Therefore, the results of the
(B)
1.0
(C)
1.0
1.0
0.8
0.8
0.6
0.6
0.6
R2yrand
0.8
0.4
2
Q yrand
(A)
Mechanistic interpretations of QSRR begin with the number and the nature of the molecular descriptors used in the model. We tried to interpret the mechanism based on the three descriptors used in the model M3. This ensures the study meets the OECD principle #5. Three selected descriptors are constitutional index (MW) and edge adjacency indices (Eig09_AEA(bo) and SpMaxA_EA). The methods for the calculations of these descriptors and their meaning are explained in the Dragon 6 user’s manual (version 6.0, Talete SRL, Milano, Italy). The MW is a constitutional index descriptor. Constitutional descriptor is the simplest and commonly used descriptor, reflecting the chemical composition of a compound without any information about its molecular geometry or atom connectivity. The Pearson correlation coefficient between the retention indices of whole data set (166 compounds) and MW is 0.947, while the relationship between the retention indices of the training set (83 samples) and MW is 0.956. The normalized regression coefficient of MW in model M3 equal to 0.606, which is larger than the normalized regression coefficients of Eig09_AEA(bo) (0.214) and SpMaxA_EA (0.188). Thus, MW is the main factor influencing the retention behavior of essential oils. This descriptor has a positive coefficient in the final model: the greater the MW values, the greater the values of retention indices. Eig09_AEA(bo) and SpMaxA_EA belong to the edge adjacency index. The edge adjacency index is a topological index calculated from the edge adjacency matrix of a molecule. The edge adjacency matrix is derived from the H-depleted molecular graph and encodes information about connectivity between graph edges (Todeschini and Consonni, 2009). In the final model (model M3), Eig09_AEA(bo) has a positive coefficient, while SpMaxA_EA has a negative coefficient. Eig09_AEA(bo) is the eigenvalue n. 9 from augmented edge adjacency matrix weighted by bond order. SpMaxA_EA is normalized leading eigenvalue from edge adjacency
0.2 0.0 -0.2
0.4 2
0.2
R yrand = 0.022 + 0.177 r(yrand,y)
0.2
0.4 2
0.6
R yrand
0.8
1.0
-0.2 0.0
0.4 0.2
Q 2yrand= -0.079 + 0.210r(yrand,y)
0.0
0.0 0.0
Q2 yrand
(B)
0.2
0.4
0.6
r(yrand,y)
0.8
1.0
-0.2 0.0
0.2
0.4
0.6
0.8
1.0
r(yrand,y)
Fig. 3. Plot of y-randomization test: r (yrand, y) refers to the absolute values of the Pearson correlation coefficient between the original and randomized vector y; symbol (j) refers to the model M3.
L.-T. Qin et al. / Chemosphere 90 (2013) 300–305
305
References
2400 Training set Test set
Calculatrd RI
2000
1600
1200
800 800
1200
1600
2000
2400
Observed RI Fig. 4. Plot of observed versus calculated RI resulted from the model M3.
matrix. In the model M3, SpMaxA_EA is the only descriptor contributing negatively to the retention index. It is possible to observe that the relationship between retention index and SpMaxA_EA is linear: the Pearson correlation coefficients between retention indices and the SpMaxA_EA are 0.873 for the whole data set (166 samples) and 0.870 for the training set (83 samples). SpMaxA_EA values decrease with increasing the number of bonds between non-hydrogen atoms pairs. As the numbers of bonds are related to the number of molecular graphs and non-hydrogen atoms, this may mean more compact structures and larger molecular weight of essential oils have smaller SpMaxA_EA values. The van der Waals forces of intermolecular interactions between the components of essential oils and HP-5MS capillary column are becoming more significant with increasing non-hydrogen atoms pairs. Thus, the smaller the SpMaxA_EA value, the greater the tendency of the essential oils retained in the HP-5MS capillary column. 4. Conclusion The final QSRR model based on the retention indices of 83 constituents of essential oils were developed, and the model was used to predict the retention indices of 83 compounds in the test set. Three descriptors (MW, Eig09_AEA(bo), and SpMaxA_EA) included in the final model were selected from the matrix of 4885 descriptors using the OPS algorithm in the variable selection. Smaller SpMaxA_EA values indicate the stronger intermolecular interactions (van der Waals forces) with respect to retention of essential oils. The QSRR model (model M3) presented excellent internal and external prediction power. The results of LOO and LMO cross-validation and bootstrapping show the model is robust. The performance in y-randomization test demonstrates the model presents no chance correlation. The external prediction powers were evaluated by the parameters Q 2F1 (Q 2ext ), Q 2F2 , Q 2F3 , and r 2m as well as the criteria proposed by Golbraikh and Tropsha, and the good results show the acceptable of the model M3. The model M3 meets the five principles recommended by the OECD for use of QSRR in regulatory applications. Thus, it is expected the model could be employed to predict the retention indices of new compounds whose experimental values are unknown. Acknowledgment This work was financed by the National Natural Science Foundation of China (Nos. 20977065, 51072134, and 21177097). Appendix A. Supplementary material Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.chemosphere. 2012.07.010.
Azar, P.A., Nekoei, M., Riahi, S., Ganjali, M.R., Zare, K., 2011. A quantitative structure– retention relationship for the prediction of retention indices of the essential oils of Ammoides atlantica. J. Serb. Chem. Soc. 76, 891–902. Chirico, N., Gramatica, P., 2011. Real external predictivity of QSAR models: How to evaluate it? Comparison of different validation criteria and proposal of using the concordance correlation coefficient. J. Chem. Inf. Model. 51, 2320– 2335. Conforti, F., Menichini, F., Formisano, C., Rigano, D., Senatore, F., Arnold, N.A., Piozzi, F., 2009. Comparative chemical composition, free radical-scavenging and cytotoxic properties of essential oils of six Stachys species from different regions of the Mediterranean area. Food Chem. 116, 898–905. Consonni, V., Ballabio, D., Todeschini, R., 2009. Comments on the definition of the Q2 parameter for QSAR validation. J. Chem. Inf. Model. 49, 1669–1678. Consonni, V., Ballabio, D., Todeschini, R., 2010. Evaluation of model predictive ability by external validation techniques. J. Chemom. 24, 194–201. Eriksson, L., Jaworska, J., Worth, A.P., Cronin, M.T.D., McDowell, R.M., Gramatica, P., 2003. Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs. Environ. Health Perspect. 111, 1361–1375. Fisher, K., Phillips, C., 2008. Potential antimicrobial uses of essential oils in food: is citrus the answer? Trends Food Sci Technol 19, 156–164. Golbraikh, A., Tropsha, A., 2002. Beware of q2! J. Mol. Graph. Model. 20, 269– 276. Jalali-Heravi, M., Ebrahimi-Najafabadi, H., 2011. Modeling of retention behaviors of most frequent components of essential oils in polar and non-polar stationary phases. J. Sep. Sci. 34, 1538–1546. Kennard, R.W., Stone, L.A., 1969. Computer aided design of experiments. Technometrics 11, 137–148. Kiralj, R., Ferreira, M.M.C., 2009. Basic validation procedures for regression models in QSAR and QSPR studies: Theory and application. J. Braz. Chem. Soc. 20, 770– 787. Kiralj, R., Ferreira, M.M.C., 2010. Is your QSAR/QSPR descriptor real or trash? J. Chemom. 24, 681–693. Li, J.Z., Gramatica, P., 2010. The importance of molecular structures, endpoints’ values, and predictivity parameters in QSAR research: QSAR analysis of a series of estrogen receptor binders. Mol. Diversity 14, 687–696. Liao, L.M., Qing, D.H., Li, J.F., Lei, G.D., 2010. Structural characterization and Kovats retention index prediction for oxygen-containing organic compounds. J. Mol. Struct. 975, 389–396. Mitra, I., Saha, A., Roy, K., 2010. Exploring quantitative structure–activity relationship studies of antioxidant phenolic compounds obtained from traditional Chinese medicinal plants. Mol. Simulat. 36, 1067–1079. Noorizadeh, H., Farmany, A., 2010. QSRR models to predict retention indices of cyclic compounds of essential oils. Chromatographia 72, 563–569. Noorizadeh, H., Farmany, A., Khosravi, A., 2010. Investigation of retention behaviors of essential oils by using QSRR. J. Chin. Chem. Soc. 57, 982–991. Noorizadeh, H., Farmany, A., Noorizadeh, M., 2011. Quantitative structure–retention relationships analysis of retention index of essential oils. Quim. Nova 34, 242– 249. OECD, 2007. Guidance document on the validation of (quantitative) structure– activity relationship [(Q)SAR] models. Paris. Riahi, S., Pourbasheer, E., Ganjali, M.R., Norouzi, P., 2009. Investigation of different linear and nonlinear chemometric methods for modeling of retention index of essential oil components: concerns to support vector machine. J. Hazard. Mater. 166, 853–859. Roy, K., Ojha, P.K., Mitra, I., Das, R.N., 2011. Further exploring r 2m metrics for validation of QSPR models. Chemom. Intell. Lab. 107, 194–205. Roy, P.P., Paul, S., Mitra, I., Roy, K., 2009. On two novel parameters for validation of predictive QSAR models. Molecules 14, 1660–1701. Roy, K., Mitra, I., Kar, S., Ojha, P.K., Das, R.N., Kabir, H., 2012. Comparative studies on some metrics for external validation of QSPR models. J. Chem. Inf. Model. 52, 396–408. Schüürmann, G., Ebert, R.U., Chen, J.W., Wang, B., Kuhne, R., 2008. External validation and prediction employing the predictive squared correlation coefficient - test set activity mean vs training set activity mean. J. Chem. Inf. Model. 48, 2140–2145. Teófilo, R.F., Martins, J.P.A., Ferreira, M.M.C., 2009. Sorting variables by using informative vectors as a strategy for feature selection in multivariate regression. J. Chemom. 23, 32–48. Todeschini, R., Consonni, V., 2009. Molecular Descriptors for Chemoinformatics. Wiley-VCH, New York. Tropsha, A., Gramatica, P., Gombar, V.K., 2003. The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb. Sci. 22, 69–77. Wehrens, R., Putter, H., Buydens, L.M.C., 2000. The bootstrap: a tutorial. Chemom. Intell. Lab. 54, 35–52. Wu, W., Walczak, B., Massart, D.L., Heuerding, S., Erni, F., Last, I.R., Prebble, K.A., 1996. Artificial neural networks in classification of NIR spectral data: design of the training set. Chemom. Intell. Lab. 33, 35–46.