Automatic outlier sample detection based on regression analysis and repeated ensemble learning

Automatic outlier sample detection based on regression analysis and repeated ensemble learning

Chemometrics and Intelligent Laboratory Systems 177 (2018) 74–82 Contents lists available at ScienceDirect Chemometrics and Intelligent Laboratory S...

2MB Sizes 6 Downloads 74 Views

Chemometrics and Intelligent Laboratory Systems 177 (2018) 74–82

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemometrics

Automatic outlier sample detection based on regression analysis and repeated ensemble learning Hiromasa Kaneko Department of Applied Chemistry, School of Science and Technology, Meiji University, 1-1-1 Higashi-Mita, Tama-ku, Kawasaki, Kanagawa 214-8571, Japan

A R T I C L E I N F O

A B S T R A C T

Keywords: Regression Outlier samples Ensemble learning Predictive performance QSPR QSAR

The fields of chemoinformatics and chemometrics require regression models with high prediction performance. To construct predictive regression models by appropriately detecting outlier samples, a new outlier detection and regression method based on ensemble learning is proposed. Multiple regression models are constructed and yvalues are estimated based on ensemble learning. Outlier samples are then detected by comprehensively considering all regression models. Furthermore, it is possible to detect outlier samples robustly and independently by repeated calculations. By analyzing a numerical simulation dataset, two quantitative structure-activity relationship datasets and two quantitative structure-property relationship datasets, it is confirmed that automatic outlier sample detection can be achieved, informative compounds can be selected, and the estimation performance of regression models is improved.

1. Introduction In the fields of chemoinformatics and chemometrics, regression models between molecular descriptors quantifying chemical structures can be constructed with respect to activities, giving quantitative structure-activity relationships (QSARs) [1], and with respect to properties, giving quantitative structure-property relationships (QSPRs) [2]. Using regression models, it is possible to estimate the values of activities and properties in compounds for which these quantities have not been measured [3]. Because the activities and properties can only be estimated from information on chemical structures, virtual screening of chemical structures [4] and chemical structure generation with the desired activities and properties [5] are also possible. Ideally, the constructed regression models should have high prediction performance. Using predictive regression models, it is possible to estimate the values of activities and properties with low errors. The workflow of constructing QSAR models with high prediction performance has been described in a previous report [6]. Any deterioration in the prediction performance of regression models may be due to underfitting, overfitting [7], the existence of unnecessary variables [8], or the existence of outlier samples [9]. Although these factors are all important, this study focuses on the existence and proper detection of outlier samples. The prediction performance of regression models is expected to be improved if the regression models are constructed after excluding the detected outlier samples. Furthermore, as the

relationship between structure descriptor X and property or activity y is different in outliers from that in other samples, useful information may be acquired by detecting outlier samples. Outlier sample detection is not only for improving the prediction performance of regression models. Outlier samples themselves have much information. Although there is a possibility of measurement errors of y, outlier samples are a trigger to elucidate new relationships between property/activity and chemical structures. One of the most famous outlier sample detection methods is the threesigma method (TS) [10]. For an X- or y-variable, samples with values that are more than three times the standard deviation from the average are detected as outlier samples. Although outliers that are far from the data distribution can be detected, average values and standard deviations are largely affected by the outliers themselves. If the outlier values are high, both the average and the standard deviation will be increased by the outliers. Therefore, the Hampel filter (HF) [11], which uses the median instead of the average and the median absolute deviation instead of the standard deviation, was developed. By using the median and the median absolute deviation, it is possible to detect outliers robustly. Whereas TS and HF detect outlier samples using a single variable, multivariate outlier sample detection methods such as the convex hull [12,13], robust principal component analysis [14,15], k-nearest neighbor algorithm [16], support vector data description [17], and one-class support vector machine (OCSVM) [18] detect outlier samples by considering multiple variables simultaneously. OCSVM applies a

E-mail address: [email protected]. https://doi.org/10.1016/j.chemolab.2018.04.015 Received 9 September 2017; Received in revised form 28 February 2018; Accepted 17 April 2018 Available online 19 April 2018 0169-7439/© 2018 Elsevier B.V. All rights reserved.

H. Kaneko

Chemometrics and Intelligent Laboratory Systems 177 (2018) 74–82

methods. The proposed outlier sample detection method based on ensemble learning is then discussed in detail.

support vector machine to the data domain estimation problem, and can detect samples existing in low-data-density domains from relatively few training data. In regression analysis, it is important to quantitatively analyze the relationships between X and y to decrease errors between actual y-values and y-values calculated by regression models. Although the relationship between X and y must be consistent in a dataset to construct predictive regression models, outlier samples have the different relationship from that of normal samples. However, TS, HF, and multivariate methods such as OCSVM do not take the relationships between X and y into consideration in outlier sample detection. Because regression models express the relationship between X and y, they are considered reasonable for detecting outlier samples in which the relationship between X and y is different from that of the other samples. An outlier sample detection method that considers the relationship between X and y may detect outlier samples in which the errors between the actual y-values and the yvalues estimated through cross-validation [19] are high. These samples exhibit a different relationship between X and y from that of the other samples, and can thus be classified as outliers. In cross-validation, however, if samples happen to have high y-errors, they are wrongly considered as outliers. On the contrary, when samples happen to have low y-errors, it is impossible to detect them as outliers. These issues are particularly evident when the number of training samples is low and the number of X-variables is high, in which case even linear regression models can be unstable, and when the y-errors happen to be high or are conversely easy to decrease. Deng et al. proposed an outlier sample detection method based on ensemble learning [20]. In sample bagging based on partial least-squares (PLS) [21], outliers having large average y-errors are defined as outliers of y and those having a large standard deviation in their y-errors are defined as outliers of X. However, the thresholds of the average and standard deviation were not shown to detect outliers, and it is unclear whether outlier samples can be determined automatically. Automatic outlier is impossible because there are no explicit rules in outlier detection. In addition, there remains a danger that normal samples are considered as outlier samples and outlier samples are not detected since outlier samples are included in a dataset in modeling; the average and the standard deviation are affected outliers; and regression models can be unstable. This study proposes a robust and automatic outlier sample detection method based on ensemble learning and regression analysis. Regression models are repeatedly constructed by changing the training samples in ensemble learning, and samples having large errors between the center of the estimated values and the actual value are judged as outlier samples. Using the median rather than the mean as the center, and the median absolute deviation rather than the standard deviation, the influence of unstable local regression models on the detection of outlier samples is reduced. In addition, ensemble learning reduces the influence of y-errors that happen to be large or small in a local model construction and the noise of the estimated value. After detecting outlier samples, ensemble learning is performed again with normal samples and the y-values are estimated for all samples. It is possible to prevent the estimated y-values from being influenced by outlier samples. Additionally, samples detected as outliers by chance can be changed to normal samples. This enables robust and automatic outlier sample detection and improved estimation performance to be achieved simultaneously. To confirm the effectiveness of the proposed method, a set of numerical simulation data, two sets of QSPR data and two sets of QSAR data are used. Compared with other outlier sample detection methods, the proposed method classifies fewer samples as outliers, thus improving the overall estimation performance in regression analysis.

2.1. Three-sigma method (TS) The TS method detects outliers whose absolute values exceed three times the standard deviation from the average for a single variable. When the data distribution of a variable follows the normal distribution, the probability that a value is within three times the standard deviation of the average is 99.73%. Values outside this interval can be regarded as outliers. First, standardize a variable x as: zðkÞ ¼

xðkÞ  μ

σ

;

(1)

where x(k) is the value of the kth sample and z(k) is the value of the kth sample after normalization. μ is the average of x and is given as follows: n P

μ¼

xðkÞ

k¼1

n

;

(2)

where n is the number of samples. σ is the standard deviation of x, given by: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uP u n ðkÞ u ðx  μÞ2 t σ ¼ k¼1 : n1

(3)

Samples for which the absolute value of z(k) is greater than 3 are detected as outlier samples.

2. Method First, TS, HF, OCSVM, and cross-validation error-based outlier sample detection (CVE) are introduced as general outlier sample detection

Fig. 1. Flow of the proposed ELO method. 75

H. Kaneko

Chemometrics and Intelligent Laboratory Systems 177 (2018) 74–82

f ðxÞ ¼ ϕðxÞw  b n X   ¼ αi K xðiÞ ; x  b;

TS must be used with care, as outliers in x influence the average and increase the standard deviation.

(6)

i¼1

2.2. Hampel filter (HF)

where w is a weight vector, ϕ is a nonlinear function, x(i) contains the Xvariables of the ith sample, b is a constant, and K is a kernel function. In this study, the following Gaussian kernel is used:

The HF technique uses the median instead of the average and 1.4826 times the median absolute deviation instead of the standard deviation. The scale factor of 1.4826 makes the standard deviation and the median absolute deviation the same when the data distribution follows the normal distribution [22]. In HF, x is first standardized as: zðkÞ ¼

xðkÞ  μ0 : 1:4826σ 0

 2     K xðiÞ ; xðjÞ ¼ exp  γ xðiÞ  xðjÞ  : In Eq. (6), αi is computed by minimizing: n 1 1 X 2 ξ b kwk þ 2 νn i¼1 i

(4)

here, μ0 is the median of x and σ 0 is the median absolute deviation of x, given by: 



σ ¼ median xðkÞ  μ0  :

(5)

Samples for which the absolute value of z is greater than 3 are detected as outlier samples. As the median and the median absolute deviation are less affected by outliers than the average and standard deviation, respectively, it is possible to detect outliers robustly.

Table 2 MAECV and the number of detected outlier samples (#OS) for each method in PLS and SVR modeling for the solubility dataset. PLS

Before OSD TS HF OCSVM CVE5 CVE15 CVE30 ELO

2.3. One-class support vector machine (OCSVM) Both TS and HF are single-variable outlier detection methods, and cannot handle multiple variables simultaneously. The OCSVM method applies an SVM to the data domain estimation problem, making it possible to detect outlier samples while considering all X-variables. The basic formula of OCSVM is expressed as follows:

1.157 1.157 1.130 0.361 0.116

#OS 0 0 10 15 15

SVR

MAE

#OS

MAE

#OS

0.708 0.715 0.705 0.762 0.507 0.589 0.496 0.468

0 7 21 34 5 15 30 0

0.413 0.408 0.405 0.408 0.403 0.388 0.373 0.348

0 7 21 34 5 15 30 0

Table 3 MAECV and the number of detected outlier samples (#OS) for each method in PLS and SVR modeling for the toxicity dataset.

Table 1 MAECV and the number of detected outlier samples (#OS) the number of detected true outlier samples (#TOS) (the maximum is 15) for each method in PLS modeling for the numerical simulation dataset.

TS HF OCSVM CVE15 ELO

(8)

subject to

(k)

MAE

(7)

PLS

#TOS (the maximum is 15)

Before OSD TS HF OCSVM CVE5 CVE15 CVE30 ELO

0 0 2 13 15

SVR

MAE

#OS

MAE

#OS

0.347 0.347 0.350 0.376 0.331 0.384 0.309 0.308

0 0 1 32 5 15 30 0

0.286 0.286 0.279 0.287 0.281 0.269 0.262 0.242

0 0 1 32 5 15 30 0

Fig. 2. Measured and estimated y-values in cross-validation for the numerical simulation dataset. 76

H. Kaneko

Chemometrics and Intelligent Laboratory Systems 177 (2018) 74–82

  ϕ xðiÞ w  b  ξi ξi  0:

Table 4 MAECV and the number of detected outlier samples (#OS) for each method in PLS and SVR modeling for the melting dataset. PLS

Before OSD TS HF OCSVM CVE5 CVE15 CVE30 ELO

ν 2 (0, 1) is interpreted as the fraction of outliers in the training data, i.e., data for which f (x(i)) < 0. γ is determined so as to maximize the variance in the Gram matrix of the Gaussian kernel for 215, 214, …, 20, and 21. Based on TS, ν ¼ 0.003.

SVR

MAE

#OS

MAE

#OS

35.3 35.9 35.4 35.8 34.9 34.7 34.8 33.4

0 8 2 51 5 15 30 20

32.0 31.8 32.2 31.7 31.7 31.3 31.3 29.6

0 8 2 51 5 15 30 21

2.4. Cross-validation error-based outlier sample detection (CVE) Using OCSVM, outliers can be detected by considering all X-variables. However, the relationship between the X-variables and a y-variable is not considered in regression analysis such as QSAR and QSPR. Therefore, it is conceivable to construct a regression model between X and y and detect outliers. As the regression models constructed when modeling using all training samples are influenced by outliers, cross-validation should be used. Samples having high absolute errors between the measured yvalues and the y-values estimated with cross-validation are detected as outlier samples. This is the basis of CVE. This study varies the number of outlier samples and examines the performance of the resulting regression models.

Table 5 MAECV and the number of detected outlier samples (#OS) for each method in PLS and SVR modeling for the activity dataset. PLS

Before OSD TS HF OCSVM CVE5 CVE15 CVE30 ELO

SVR

MAE

#OS

MAE

#OS

1.46 1.46 1.45 1.47 1.42 1.39 1.33 1.31

0 9 8 29 5 15 30 1

1.21 1.21 1.21 1.21 1.24 1.19 1.11 1.14

0 9 8 29 5 15 30 4

(9)

2.5. Ensemble learning-based outlier sample detection (ELO) Using cross-validation, it is possible to detect outliers defined by the relationship between X and y. However, depending on how the samples are divided in cross-validation, a true normal sample may be classified as an outlier and a true outlier sample may be classified as a normal sample. To reduce this likelihood and detect outlier samples robustly, ensemble learning-based outlier sample detection (ELO) is proposed.

Fig. 3. Measured and estimated y-values in cross-validation for the solubility dataset. 77

H. Kaneko

Chemometrics and Intelligent Laboratory Systems 177 (2018) 74–82

The flow of the proposed method is shown in Fig. 1. First, multiple estimated y-values are prepared for each sample in the training data with bagging in the sample direction [23]. The y-values estimated with cross-validation are used for the samples selected in model construction. By repeating N cycles of model construction and estimation, N estimated y-values are obtained for each sample. The median of the N values gives the final estimated y-value, because the median is more robust against outliers than the average. When the absolute error between the actual y-value and the estimated y-value exceeds 1.4826 times the median absolute deviation, the sample is regarded as an outlier. Next, bagging is applied to all samples except the detected outliers. The y-values are then estimated for all samples including the detected outlier samples. Again, when the absolute y-error exceeds 1.4826 times the median absolute deviation, the sample is considered to be an outlier. Thus, even when samples are determined as outliers, if their absolute yerrors are within 1.4826 times the median absolute deviation, they are reclassified as normal samples. Robust outlier sample detection is achieved by repeating the model construction with bagging using only normal samples, estimation of all samples, and detection of outlier samples. The cycle ends when the detected outlier samples remain unchanged. The proposed method includes the explicit rules in outlier sample detection, robust statistics such as the median and the median absolute deviation, consideration of the relationships between X and y, crossvalidation, ensemble learning and repetition of ensemble learning, that is, ensemble learning is repeated to reduce a danger that normal samples are considered as outlier samples and outlier samples are not detected. Thus, it is possible to detect outlier samples robustly and automatically while decreasing the probability of accidental errors in outlier sample

detection, and to improve the estimation performance of regression models. Although generalized cross-validation (GCV) method that can approximate a measure of y-errors of leave-one-out cross-validation (LOOCV) when fitting to a linear model with least squares method is frequently used, GCV cannot be used in this study since each y-error must be handled; cross-validation is not restricted to LOOCV; and regression analysis is not restricted to multiple regression with least squares method. 3. Results and discussion 3.1. Numerical simulation data analysis Numerical simulation data, in which X had ten variables, y had one variable and the relationship between X and y was linear, were used. One variable, z, was generated as uniform random numbers. Then, y was defined as five times z and normal random numbers were added to y. The X values were generated by adding normal random numbers to z. This was performed ten times, with different random seeds, thus generating ten X-variables. 100 samples of X and y were generated and 5 was added to only y for the 15 samples as outliers. The outlier sample detection methods of TS, HF, OCSVM, CVE, and the proposed ELO were compared. TS and HF were applied to the yvariable, and samples with outlier y-values were regarded as outlier samples. In CVE, the number of outlier samples was set to 15, and fivefold cross validation was used. In the proposed ELO, the number of regression models in the bagging process was set to 100. Linear PLS was used as a regression analysis method. Five-fold cross-validation was

Fig. 4. Measured and estimated y-values in cross-validation for the toxicity dataset. 78

H. Kaneko

Chemometrics and Intelligent Laboratory Systems 177 (2018) 74–82

are closer to the diagonal since all the outlier samples successfully removed. It has been confirmed that the proposed method enables robust and automatic outlier sample detection and improves the estimation performance.

employed to optimize the hyperparameters. The determinant coefficient r2 is affected by the standard deviation of y, which changes according to the outlier samples that are removed. It is impossible to give an appropriate comparison of regression models that remove outlier samples. Therefore, the mean absolute error (MAE) was used:

3.2. Real datasets

 n  P  ðkÞ  y  yCV ðkÞ  MAE ¼ k¼1

n

;

The following two QSPR datasets and two QSAR datasets were used in this study. Solubility: 1290 compounds and their aqueous solubility, expressed as the logarithm of the solubility S (logS) at a temperature of 20–25  C in moles per liter [24]. Sixteen compounds were removed because of structural duplication. Toxicity: 1093 compounds downloaded from the Environmental Toxicity Prediction Challenge 2009 website [25]. This is an online challenge that invites researchers to estimate the toxicity of molecules against T. Pyriformis, expressed as the logarithm of 50% growth inhibitory concentration (pIGC50). Twelve compounds were removed because of structural duplication. Melting: 4450 compounds and their melting points, as collected by Karthikeyan et al. [26] from the Molecular Diversity Preservation International (MDPI) database [27]. The melting points range from 14 to 392.5  C. Activity: 756 compounds with pIC50 values for the in vitro activities of P. carinii dihydrofolate reductase (DHFR) [28]. After deleting 83 compounds with indeterminate pIC50 values, the remaining 673 compounds were used. Molecular structure descriptors for the remaining compounds were calculated using RDKit [29]. Descriptors for which the proportion of samples with the same value exceeded 0.8 were removed, along with one of a pair of descriptors for which the correlation coefficient was greater than 0.9. The ChemAxon software Marvin View [30] was used to

(10)

where y is the actual y-value of the kth sample and y(k) CV is the y-value of the kth sample estimated by five-fold cross-validation. As MAE is the average of absolute y-errors, it can be compared regardless of the variation in y. Table 1 show the modeling results and the number of detected outlier samples when using each outlier detection method. #OS is the number of detected outlier samples and #TOS is the number of detected true outlier samples. In TS and HF, no outlier samples could not be detected. OCSVM detected ten outlier samples, but only two samples were true outlier samples. In CVE15, outlier samples could be detected, but two true outlier samples remained in the dataset, which meant that outlier smaple detection failed. By using the proposed ELO, 15 true outlier samples could be detected perfectly. As a result, MAE for ELO was the lowest among the all outlier detection methods. ELO succeeded outlier sample detection and improved estimation performance by appropriately removing outlier samples. Fig. 2 presents the actual y-values and estimated y-values for the PLS models before and after the application of ELO. Fig. 2a exhibits outlier samples with large estimation errors. The estimation accuracy of these samples improved and the overall estimation performance also improved using ELO, as shown in Fig. 2b. After the application of ELO, the samples (k)

Fig. 5. Measured and estimated y-values in cross-validation for the melting dataset. 79

H. Kaneko

Chemometrics and Intelligent Laboratory Systems 177 (2018) 74–82

detected in the other datasets. However, in the solubility dataset, although seven outlier samples were detected by TS, the MAE of the PLS model increased, indicating a decrease in estimation performance. In other cases, there were only slight decreases in MAE even after deleting several outlier samples. In the case of the melting dataset, MAE decreased slightly in the SVR model constructed after deleting eight outliers with TS, but the MAE value related to HF increased. For the activity dataset, TS and HF deleted nine and eight outlier samples, respectively, but the MAE hardly improved. Outlier samples could not be adequately detected using only the y-variable. It can be concluded that meaningful outlier samples could not be deleted by TS and HF. OCSVM detected more outlier samples than TS and HF in each dataset. In SVR modeling for the solubility dataset and the melting dataset, the MAE decreased slightly, although 34 samples (solubility dataset) and 51 samples (melting dataset) were regarded as outliers. In the other datasets, many samples were detected as outliers, but the MAE either increased or did not change. OCSVM could not select appropriate outlier samples and did not contribute to improving the estimation performance in regression analysis. Using CVE, the MAE decreased as the number of outlier samples increased in many cases. This indicates that the estimation performance of regression models improved using CVE. However, in PLS modeling for the toxicity dataset, for example, there was a case where the MAE became high after detecting 15 outlier samples with CVE15. In addition, CVE5 decreased the estimation performance in SVR modeling for the activity dataset. The results of CVE were unstable. Using the proposed ELO method, the MAE decreased and the estimation performance improved in all datasets for both PLS and SVR modeling. In the solubility dataset and the toxicity dataset, the MAE

visualize the chemical structures.

3.3. Outlier sample detection and modeling In each dataset, all samples were first used for modeling between X and y. Although the accuracy could be improved by variable selection, the focus of this study is sample selection. The objective was to construct regression models with high estimation performance with as few outlier samples as possible. The outlier sample detection methods of TS, HF, OCSVM, CVE, and the proposed ELO were compared. TS and HF were applied to the yvariable, and samples with outlier y-values were regarded as outlier samples. In CVE, the number of outlier samples was successively set to 5, 15, and 30, and five-fold cross validation was used. In the proposed ELO, the number of regression models in the bagging process was set to 100. As regression analysis methods, linear PLS and nonlinear Gaussian kernel-based support vector regression (SVR) [31] were used. LIBSVM [32] was used for optimization in SVR modeling and five-fold cross-validation was employed to optimize the hyperparameters. Since the determinant coefficient r2 is affected by the standard deviation of y, which changes according to the outlier samples that are removed, MAE (see Eq. (10)) was used. MAE can be compared regardless of the variation in y because MAE is the average of absolute y-errors. Tables 2–5 present the modeling results and the number of detected outlier samples when using each outlier detection method in each dataset. In these tables, ‘Before OSD’ means before outlier sample detection and #OS is the number of detected outlier samples. Using the singlevariable outlier detection methods, only one sample was detected by HF for the toxicity dataset, although several outlier samples were

Fig. 6. Measured and estimated y-values in cross-validation for the activity dataset. 80

H. Kaneko

Chemometrics and Intelligent Laboratory Systems 177 (2018) 74–82

In the activity dataset, the MAE can be reduced by deleting only one outlier sample in PLS modeling and only four outlier samples in SVR modeling. ELO gives the best estimation performance among the PLS models. In SVR modeling, the MAE of CVE30 is slightly higher than that of ELO. However, ELO can construct an SVR model with almost the same

improved without deleting outlier samples because of the ensemble learning effect. Compared with CVE, the estimation performance of both PLS and SVR increased when using ELO. It can be concluded that robust regression models can be constructed with ensemble learning even when simple cross-validation incorrectly detects outlier samples.

Table 6 Examples of outlier compounds detected by ELO and compounds similar to the outlier compounds for the melting dataset.

Table 7 Examples of outlier compounds detected by ELO and compounds similar to the outlier compounds for the activity dataset.

81

H. Kaneko

Chemometrics and Intelligent Laboratory Systems 177 (2018) 74–82

edanzediting.com/ac) for editing a draft of this manuscript.

estimation performance as that of CVE30 by deleting only four outliers, compared to the 30 removed by CVE30. In the melting dataset, a relatively high number of outlier samples were detected (20 in PLS modeling, 21 in SVR modeling). However, the MAE given by ELO is lower than that of CVE30, in which 30 samples were deleted as outliers. Thus, the proposed ELO method constructs robust and highly predictive regression models with fewer outlier samples. Figs. 3–6 plot the actual y-values and estimated y-values for the PLS models and the SVR models before and after the application of ELO. Figs. 3a, c, 4a, and 4c exhibit outlier samples with large estimation errors. The estimation accuracy of these samples improved and the overall estimation performance also improved using ELO, as shown in Figs. 3b, d, 4b, and 4d. After the application of ELO, the samples are closer to the diagonal for both the PLS and SVR models, although no outlier samples were detected by ELO. This confirms that the estimation performance is improved by the proposed method. In Figs. 5a, c, 6a, and 6c, the outlier samples are detected by ELO, and the other samples are close to the diagonal, which demonstrates the improved estimation performance given by considering outlier samples appropriately. There do not appear to be any outlier samples in Figs. 5 and 6 after the ELO detection process. Tables 6 and 7 present examples of outlier compounds detected by ELO and compounds similar to the outlier compounds for the melting dataset and the activity dataset, respectively. With small differences in chemical structures, the melting point and pIC50 values change dramatically, similar to an activity cliff [33]. If the databases are completely correct, such small differences will hold important information for further analysis, such as the development of new descriptors and the design of chemical structures. From the above results, it has been confirmed that the proposed method enables robust and automatic outlier sample detection and improves the estimation performance.

References [1] A. Tropsha, Best practices for QSAR model development, validation, and exploitation, Mol. Inf. 29 (2010) 476–488. [2] S. Sahoo, C. Adhikari, M. Kuanar, B.K. Mishra, A short review of the generation of molecular descriptors and their applications in quantitative structure property/ activity relationships, Curr. Comput. Aided Drug Des. 12 (2016) 181–205. [3] P. Filzmoser, M. Gschwandtner, V. Todorov, Review of sparse methods in regression and classification with application to chemometrics, J. Chemometr. 26 (2012) 42–51. [4] J.J. Irwin, B.K. Shoichet, ZINC - a free database of commercially available compounds for virtual screening, J. Chem. Inf. Model. 45 (2005) 177–182. [5] C.A. Nicolaou, J. Apostolakis, C.S. Pattichis, De novo drug design using multiobjective evolutionary graphs, J. Chem. Inf. Model. 49 (2009) 295–307. € [6] I.V. Tetko, I. Sushko, A.K. Pandey, H. Zhu, A. Tropsha, E. Papa, T. Oberg,

[7]

[8]

[9] [10] [11] [12] [13]

[14] [15]

4. Conclusions [16]

In this study, to perform robust and automatic outlier sample detection while considering the relationship between X and y, an outlier sample detection method based on ensemble learning was proposed. By repeating the steps of model construction and estimation, it is possible to detect outlier samples in which the relationship between X and y is different from that of other samples. Even when samples are detected as outliers, there is a possibility that these samples may be reclassified as normal samples. Thus, it is possible to reduce the probability of accidentally classifying a sample as an outlier. Ensemble learning contributes to improving the estimation performance of the final regression model. Analysis of outlier sample detection using various QSAR and QSPR datasets showed that the proposed method produces a large improvement in the estimation performance of linear and nonlinear regression models with fewer detected outliers than other outlier sample detection methods. After outlier sample detection with the proposed ELO, there do not appear to be any outlier samples in the plots of actual and estimated y-values. By checking the difference between outlier compounds and similar compounds, important information will be obtained for further analysis, such as the development of new descriptors and the design of chemical structures. When outlier samples are found, there is a possibility that they are not only samples but also variables. Therefore, in practice, outlier sample detection must be performed simultaneously with outlier variable detection. The proposed method contributes to the improvement of the predictive ability of regression models and the construction of easy-tounderstand models by detecting appropriate outlier samples.

[17]

[18] [19] [20] [21] [22] [23] [24]

[25] [26]

[27] [28]

[29] [30] [31] [32] [33]

Acknowledgment I thank Stuart Jenkinson, PhD, from Edanz Group (www.

82

R. Todeschini, D. Fourches, A. Varnek, Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection, J. Chem. Inf. Model. 48 (2008) 1733–1746. M.A. Babyak, What you see may not be what you get: a brief, nontechnical introduction to overfitting in regression-type models, Psychosom. Med. 66 (2004) 411–421. G. Schuurmann, R.U. Ebert, J.W. Chen, B. Wang, R. Kuhne, External validation and prediction employing the predictive squared correlation coefficient - test set activity mean vs training set activity mean, J. Chem. Inf. Model. 48 (2008) 2140–2145. G.M. Maggiora, On outliers and activity cliffs - why QSAR often disappoints, J. Chem. Inf. Model. 46 (2006), 1535–1535. https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule (accessed August 9, 2017). R.K. Pearson, Outliers in process modeling and identification, IEEE T. Contr. Syst. T. 10 (2002) 55–63. J.A. Fernandez Pierna, F. Wahl, O.E. de Noord, D.L. Massart, Methods for outlier detection in prediction, Chemom. Intell. Lab. Syst. 63 (2002) 27–39. J.A. Fernandez Pierna, L. Jin, M. Daszykowski, F. Wahl, D.L. Massart, A methodology to detect outliers/inliers in prediction with PLS, Chemom. Intell. Lab. Syst. 68 (2003) 17–28. M. Hubert, S. Verboven, A robust PCR method for high-dimensional regressors, J. Chemometr. 17 (2003) 438–452. R.J. Pell, Multiple outlier detection for multivariate calibration using robust statistical techniques, Chemom. Intell. Lab. Syst. 52 (2000) 87–104. H. Ma, Y. Hu, H. Shi, Fault detection and identification based on the neighborhood standardized local outlier factor method, Ind. Eng. Chem. Res. 52 (2013) 2389–2402. H. Li, H. Wang, W. Fan, Multimode process fault detection based on local density ratio-weighted support vector data description, Ind. Eng. Chem. Res. 56 (2017) 2475–2491. Y. Xiao, H. Wang, W. Xu, J. Zhou, Robust one-class SVM for fault detection, Chemom. Intell. Lab. Syst. 151 (2016) 15–25. V. Consonni, D. Ballabio, R. Todeschini, Comments on the definition of the Q(2) parameter for QSAR validation, J. Chem. Inf. Model. 49 (2009) 1669–1678. B.C. Deng, Y.H. Yun, Y.Z. Liang, Model population analysis in chemometrics, Chemom. Intell. Lab. Syst. 149 (2015) 166–176. S. Wold, M. Sj€ ostr€ om, L. Eriksson, PLS-regression: a basic tool of chemometrics, Chemom. Intell. Lab. Syst. 58 (2001) 109–130. P.J. Rousseeuw, C. Croux, Alternatives to the median absolute deviation, J. Am. Stat. Assoc. 88 (1993) 1273–1283. H. Kaneko, K. Funatsu, Applicability domain based on ensemble learning in classification and regression analyses, J. Chem. Inf. Model. 54 (2014) 2469–2482. T.J. Hou, K. Xia, W. Zhang, X.J. Xu, ADME evaluation in drug discovery. 4. prediction of aqueous solubility based on atom contribution approach, J. Chem. Inf. Comput. Sci. 44 (2004) 266–275. http://www.cadaster.eu/node/65.html (accessed August 9, 2017). M. Karthikeyan, R.C. Glen, A. Bender, General melting point prediction based on a diverse compound dataset and artificial neural networks, J. Chem. Inf. Model. 45 (2005) 581–590. http://www.mdpi.org (accessed June 28, 2017). J.J. Sutherland, L.A. O'Brien, D.F. Weaver, Spline-fitting with a genetic algorithm: a method for developing classification structure-activity relationships, J. Chem. Inf. Comput. Sci. 43 (2003) 1906–1915. http://www.rdkit.org/(accessed June 28, 2017). https://www.chemaxon.com/(accessed June 28, 2017). C.M. Bishop, Pattern Recognition and Machine Learning, Springer, New York, 2006. C.C. Chang, C.J. Lin, LIBSVM: a Library for Support Vector Machines. http://www. csie.ntu.edu.tw/~cjlin/libsvm. Y. Hu, J. Bajorath, Extending the activity cliff concept: structural categorization of activity cliffs and systematic identification of different types of cliffs in the ChEMBL database, J. Chem. Inf. Model. 52 (2012) 1806–1811.