Neurocomputing 158 (2015) 127–133
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Indefinite kernel ridge regression and its application on QSAR modelling Benjamin Yee Shing Li n, Lam Fat Yeung, King Tim Ko Department of Electronic Engineering, City University of Hong Kong, Hong Kong
art ic l e i nf o
a b s t r a c t
Article history: Received 24 September 2014 Received in revised form 10 December 2014 Accepted 28 January 2015 Communicated by Feiping Nie Available online 12 February 2015
Recently, the use of indefinite kernels in machine learning has attracted numerous attentions. However most works are focused on the classification techniques and less are devoted to regression models. In this paper to adapt indefinite kernels to ridge regression model, an indefinite kernel ridge regression model is proposed. Instead of performing spectral transformation on the kernel matrix, a less restrictive semi-definite proxy kernel can be constructed to approximate the kernel which normally is positive semi-definite. The sensitivity of the distance between this indefinite kernel and the proxy kernel is controlled by a parameter ρ. This approach allows one to construct regression models of response values based on the similarities of corresponding objects, where the requirement on similarity measures to satisfy Mercers condition can be relaxed. To illustrate the use of this algorithm, it was applied to the quantitative structure-activity relationship (QSAR) modelling over 16 drug targets. & 2015 Elsevier B.V. All rights reserved.
Keywords: Regression analysis Indefinite kernel Computer aided drug design Quantitative structure-activity relationship (QSAR) modelling
1. Introduction The relationship between data and response values is always of great interest as it can be used to explain the structure or mechanism of a complex/non-trivial system (for instance, protein–ligand interactome) or, furthermore, to construct a predictive model. This analysis can be done via statistical learning techniques such as regression analysis; for instance, kernel ridge regression [1]. It can be considered as a kernelized learning procedure based on least square approach with regularization. With kernelization, a major restriction of ridge regression, where data points and estimated values have to be in linear relation, are relaxed by selecting different feature transforms or kernel functions. Due to the flexibility and simplicity of the regression model, it is widely applied in various domains such as image processing, bioinformatics, and cheminformatics [2–6]. In traditional practice, the construction of kernel is based on applying a valid kernel function over feature vectors. Here, valid kernels are referred to those satisfy Mercer's condition [7,8]. Under this context, the resulting kernels are always positive semidefinite (p.s.d.) and works fine in most kernelized learning methods. However, for some kinds of data such as structural objects like 3D conformations of molecular structures and DNA sequences, features are hard to be extracted. To apply kernelized method over these
n
Corresponding author. E-mail addresses:
[email protected] (B.Y.S. Li),
[email protected] (L.F. Yeung),
[email protected] (K.T. Ko). http://dx.doi.org/10.1016/j.neucom.2015.01.060 0925-2312/& 2015 Elsevier B.V. All rights reserved.
structural objects, a significant level of efforts is devoted to the design of customized kernels such as graph kernels and string kernels [9–11]. As the design of kernel can directly affect the interpretation of similarities amongst data, these pre-defined kernels may not be able to fit into all applications. A pragmatic approach would be to design kernels based on the knowledge of domain experts [12], in which Mercer's condition may become a restriction or burden on the design of kernels. To relieve this issue, we are motivated to extend kernel ridge regression to adapt indefinite kernels. Many studies on indefinite kernel methods focused on classification techniques such as support vector machine (SVM) and less are devoted to regression models. One type of approaches is based on spectral transform on the kernel matrix, including clipping [13], flipping [14], shifting [15] and squaring [12] the spectrum. The main advantages of these approaches are that they are independent of learning models and they are simple in implementation. However, during the transformation, noise may be added and information may be lost. Another type of approaches modifies the construction of the original learning model such that the p.s.d. constraint is relaxed [16–21]. The advantage of this type of approaches is that the entire similarity information (the indefinite kernel) is retained, then the remaining task can be reduced to the modification of the learning model. Alternatively, some recent studies also shown promising results on using indefinite kernels on regression analysis. In [22], the support vector regression model is being modified to adapt indefinite kernels. The performance and approximation error of using indefinite kernels on regression via a coefficient regularized least square approach is also being studied [23]. This approach is further
128
B.Y.S. Li et al. / Neurocomputing 158 (2015) 127–133
improved via a kernel decomposition technique proposed in [24]. Note that when the kernel is symmetric, this kernel decomposition is similar to the one proposed in [13]. In this paper an indefinite kernel extension of ridge regression is proposed. In our formulation, the indefinite kernel is replaced by a proxy kernel, in addition a penalty of distance between the proxy kernel and the original indefinite kernel is added. This penalty is weighted by a parameter ρ, it allows users to control the sensitivity of the algorithm. A lower value of ρ would allow the algorithm to search the proxy kernel over a larger range and vice versa. This problem can then be solved by decomposing the search space into two subspace and iteratively solved via two subproblems. From our analysis we found that the algorithm converges and in the worst case it converges sub-linearly. In addition, the parameter ρ and γ2 can also be considered as learning rate of the algorithm. To illustrate the performance of the proposed method, it is applied to quantitative structure activity relationship (QSAR) modelling over 16 sets of drug targets and their corresponding drug candidates. As compared to other spectral transform methods [12–15] and kernel principle component analysis transformation [25], results show that the proposed method can provide a significant improvement on the R2 value of the QSAR model.
By Karush–Kuhn–Tucker (KKT) conditions of (4), we have e¼
θ¼
β
ð5Þ
2 Xβ 2γ
ð6Þ
By plugging (5) and (6) back to (4), we yield L¼
βT β 4
þ β y T
βT X T X β ; 4γ
ð7Þ
then by letting α ¼ β=2γ and kernel matrix K ¼ X T X, L ¼ γ 2 αT α þ2γαT y γαT K α
ð8Þ
Hence we have the following dual problem: P 1 : minfγ 2 αT α 2γαT y þ γαT K αg
ð9Þ
α
Note that in general the kernel matrix K is assumed to be p.s.d., yet in this paper we are interested in the indefinite case.
3. Indefinite kernel ridge regression 2. Background
3.1. Algorithm
The following is some related basic definitions and properties of kernel trick and kernel ridge regression.
In most of the cases, the kernel matrix is constructed through a valid kernel function of kernel transform; hence it is guaranteed to be p.s.d. However, sometimes the kernel value is directly computed via a kernel function and it may not guarantee the p.s.d.ness of the kernel matrix. To handle such case, here we proposed an indefinite kernel extension of ridge regression. Consider the problem n o P 2 : min f ðα; KÞ ¼ min γ 2 αT α 2γαT y þ γαT K α þ ρ‖K K~ ‖2F ;
2.1. Kernel trick Given a space S, let ϕ : S-T be a feature transform which maps data points in space S to a Hilbert space T. A kernel function K : S S-R is defined as Kðx; yÞ ¼ 〈ϕðxÞ; ϕðyÞ〉
α;K≽0
α;K≽0
ð1Þ
ð10Þ
where 〈; 〉 is the inner product over space T. The kernel trick is a direct application of the kernel function Kðx; yÞ to the model instead of computing the transformed feature ϕðxÞ; ϕðyÞ. This trick can provide the user a freedom on manipulating the structure of the space or the distribution of the data points. In addition, once a kernel function Kð; Þ is defined, the feature extraction process can be avoided and this is an advantage for mining over structured objects. Usually, the design of Kðx; yÞ is based on the similarity between x and y, since 〈ϕðxÞ; ϕðyÞ〉 describes the similarity between the two transformed feature vectors ϕðxÞ and ϕðyÞ.
where K0 is the indefinite kernel matrix, ρ Z 0 is a constant. P2 relaxes the p.s.d. requirements with the use of proxy kernel, the indefinite kernel matrix K~ is considered as a noisy version of the proxy kernel. To prevent the proxy kernel K from being too far away from K~ , a penalty of Frobenius distance between K and K~ is added to the cost function. In (10), f ðx; KÞ is not jointly convex over α and K, this implies that there is no guarantee on the global optimal of α and K. However it is convex in α and K separately; hence in this work we proposed to solve (10) via alternatively solving the following two subproblems. P3 and P4 are subproblems of P2 solved over the subspaces of α and K≽0 respectively. n o P 3 : g ðK^ Þ ¼ arg minL1 ðα; K^ Þ ¼ arg min γ 2 αT α 2γαT y þ γαT K^ α
2.2. Kernel ridge regression Given a collection of data pairs ðxi ; yi Þ, where xi A RN is the Ndimensional feature vector and yi A R is the corresponding result. Ridge regression can be represented as a regularized linear regression model y ¼ xT θ
ð2Þ
The optimal θ can be obtained by minimizing the following loss function: P 0 : min θ;e
s:t:
f‖e‖22 þ γ ‖θ‖22 g
α
e ¼ yX θ
ð3Þ
where θ is a non-negative regularization terms, y ¼ ½y1 ⋯yN T , X ¼ ½x1 ⋯xN . The Lagrangian of P0 is Lðθ; e; βÞ ¼ eT eþ γθ θ þ β ðy X T θ eÞ T
ð4Þ
α
ð11Þ n o T P 4 : g 2 ðα^ Þ ¼ arg minL2 ðα^ ; KÞ ¼ arg min ρ j j K K~ j j F þ γ α^ K α^ K≽0
K≽0
ð12Þ P3 and P4 both exist closed form solutions. P3 can be solved by applying the optimality condition ∂L1 =∂α ¼ 0 which leads to g 1 ðK^ Þ ¼ ðK^ þ γ IÞ 1 y
T
T
1
ð13Þ
P4 is actually a projection of a specific matrix on the p.s.d. space, γ T P 5 : arg min J K K~ α^ α^ JF ð14Þ K≽0 2ρ Consider a symmetric matrix X with eigen-decomposition X ¼ U ΛU T , where U ¼ ½u1 ⋯uN and Λ ¼ diagðλ1 ; …; λN Þ. Let ðXÞ þ ¼ P T λk 4 0 λk uk uk , i.e. the positive part of X. The optimal solution of P5
B.Y.S. Li et al. / Neurocomputing 158 (2015) 127–133
trðð∇K f ðK 1 Þ ∇K f ðK 2 ÞÞT ðK 1 K 2 ÞÞ ¼ 2ρ‖K 1 K 2 ‖2F
can be written as γ T g 2 ðα^ Þ ¼ K~ α^ α^ 2ρ þ Start with an initial iterative procedures:
ð15Þ
α0, P2 can be solved via the following
ðα0 Þ - ðα0 ; K 0 Þ - ðα1 ; K 0 Þ - ðα1 ; K 1 Þ - ⋯ αstep
Kstep
129
Kstep
ð16Þ
αstep
where in the α-step, αi ¼ g 1 ðK i 1 Þ and in the K-step, K i ¼ g 2 ðαi Þ. This algorithm can be summarized as Algorithm 1.
By (25), since modulus 2ρ. □
ð25Þ
ρ is non-negative, f is strongly convex in K with
Lemma 1 (Property of strongly convex function, Nesterov and Nesterov [26]). For any strongly convex function f with modulus m and any two points x, y m f ðyÞ Zf ðxÞ þ∇f ðxÞT ðy xÞ þ ‖y x‖22 2
ð26Þ
Lemma 2. For any p.s.d. matrices X and Y, trace of their product is nonnegative.
Algorithm 1. Indefinite kernel ridge regression.
ρ, K~ , y Result: α, K
Data: γ,
trðXYÞ Z 0
initialization;
Proof. Let X ¼ LLT where X is a p.s.d. matrix and L is the positive definite square root of X
while f not converge do 6 6 K’ðK~ γ ααT Þ ; þ 6 2ρ 6 6 α’ðK þ γ IÞ 1 y; 4 f ’γ 2 αT α 2γαT y þ γαT K α þ ρ‖K K~ ‖2F ;
trðXYÞ ¼ trðLT YLÞ Z 0
α’randomðÞ;
ð27Þ
□
Combine strongly convexity of f on α and K and Lemmas 1 and 2, we can obtain the following two properties: Property 3. For the i-th α-step, the cost function f is being reduced by a non-negative value bounded below by
3.2. Properties and analysis In iterative process (16), the objective function f ðα; KÞ is being minimized at each step. That is
γ 2 ‖αi αi 1 ‖22
f ðα; K i Þ Z f ðαi þ 1 ; K i Þ
Proof. As f is strongly convex in
for all
ð17Þ
α and
f ðαi þ 1 ; KÞ Zf ðαi þ 1 ; K i þ 1 Þ
ð18Þ
for all K≽0. Hence the value of the objective function f ðα; KÞ is monotonic decreasing over the iterative process. In addition, the objective function f ðα; KÞ can be written as ‖γα y‖22 ‖y‖22 þ γαT K α þ ρ‖K K~ ‖2F . As K≽0, hence γαT K α Z 0 and f ðα; KÞ is bounded below by ‖y‖22 . By combining these two facts, the algorithm is converging. If f is strongly convex in α and K respectively, at each step, its reduction can be shown bounded by a value in terms of the step size as follows. Theorem 1. f ðα; K^ Þ is strongly convex in 2γ 2 þ 2γλmin ðK^ Þ. Proof. We first consider the gradient of f on
α with modulus
ð29Þ
α, with Lemma 1 we have
f ðαi 1 ; K i 1 Þ f ðαi ; K i 1 Þ Z ðαi 1 αi ÞT ð2γ 2 αi 2γ y þ 2γ K i 1 αi Þ þ ðγ 2 þ γλmin ðK i 1 ÞÞ‖αi αi 1 ‖22
ð30Þ
By plugging in αi ¼ ðK i 1 þ γ IÞ 1 y to (30), we have f ðαi 1 ; K i 1 Þ Zf ðαi ; K i 1 Þ þ ðγ 2 þ γλmin ðK i 1 ÞÞ‖αi αi 1 ‖22 Z f ðαi ; K i 1 Þ þ γ 2 ‖αi αi 1 ‖22
where λmin ðKÞ Z 0 when K is p.s.d.
ð31Þ □
Similarly, we can also obtain a similar relation for K-step, Property 4. For the i-th K-step, the cost function f is being reduced by a non-negative value bounded below by
ρ‖K i K i 1 ‖2F
ð32Þ
Proof. From (1) we have
α
f ðαi ; K i 1 Þ f ðαi ; K i Þ Z tr½ðK i 1 K i ÞT ðγαi αTi þ 2ρK i
∂f ∇ α f ðα Þ ¼ ¼ 2γ 2 α 2γ yþ 2γ K^ α ∂α
ð19Þ
∇α f ðα1 Þ ∇α f ðα2 Þ ¼ ð2γ 2 þ 2γ K^ Þðα1 α2 Þ
ð20Þ
ð∇α f ðα1 Þ ∇α f ðα2 ÞÞT ðα1 α2 Þ ¼ ð2γ 2 Þ‖α1 α2 ‖22 þ 2γ ðα1 α2 ÞT K^ ðα1 α2 Þ
ð21Þ
ð∇α f ðα1 Þ ∇α f ðα2 ÞÞT ðα1 α2 Þ Z ð2γ 2 þ 2γλmin ðK^ ÞÞ‖α1 α2 ‖22
ð22Þ
By (22), since γ and λmin ðK^ Þ are both non-negative, f is strongly convex in α with modulus 2γ 2 þ 2γλmin ðK^ Þ. □ Theorem 2. f ðα^ ; KÞ is strongly convex in K with modulus 2ρ. Proof. We first consider the gradient of f on K ∂f ¼ γααT þ 2ρK 2ρK~ ∂K
ð23Þ
∇K f ðK 1 Þ ∇K f ðK 2 Þ ¼ 2ρðK 1 K 2 Þ
ð24Þ
∇K f ðKÞ ¼
ð28Þ
2ρK 0 Þ þ2ρ‖K i K i 1 ‖2F γ þ 2ρ‖K i K i 1 ‖2F ¼ ð2ρÞtr ðK i 1 K i ÞT K i K~ αi αTi 2ρ ð33Þ
Since 2ρ‖K i K i 1 ‖2F is non-negative, the rest is to show that tr½ðK i 1 K i ÞT ðK i ðK~ ðγ =2ρÞαi αTi Þ is non-negative. Remind that Ki is a p.s.d. projection of ðK~ ðγ =2ρÞαi αTi Þ, hence K Ti ðK~ P 2 ðγ =2ρÞαi αTi Þ ¼ λj 4 0 λj uj uTj ¼ K Ti K i , where λi and ui are corresponding eigenvalue and eigenvectors of Ki. And we have γ tr ðK i 1 K i ÞT K i K~ αi αTi 2ρ γ T ð34Þ ¼ tr K i 1 K i K~ αi αTi 2ρ Since K i 1 and K i ðK~ ðγ =2ρÞαi αTi Þ are both p.s.d., according to Lemma 2, trace of their product is nonnegative.
130
B.Y.S. Li et al. / Neurocomputing 158 (2015) 127–133
4. Case studies and discussions
Thus, f ðαi ; K i 1 Þ f ðαi ; K i Þ Z ρ‖K i K i 1 ‖2F
□
ð35Þ
Properties 3 and 4 provide an insight on how the parameters γ and ρ affect the search of the optimal fn. As the decrease of f is bounded from below by the step size with the rate γ2 and ρ, a larger γ and ρ can lead to faster minimization. Yet, in this case the regularization of α will be stronger and the proxy kernel will be highly sensitive to the original kernel K~ . Hence the choice of γ and ρ is dependent on the application. The following property summarized the reduction in cost function f from p-th step to the q-th step by combining Properties 3 and 4. Property 5. From the p-th iteration to the q-th iteration, the cost function is reduced by a non-negative value bounded below by q X
γ2
q X
‖αi αi 1 ‖22 þ ρ
i ¼ pþ1
‖K i K i 1 ‖2F
ð36Þ
i ¼ pþ1
Proof. From (3) we have f ðαi 1 ; K i 1 Þ f ðαi ; K i 1 Þ Z γ 2 ‖αi αi 1 ‖22 ;
ð37Þ
and from Property 4 we have f ðαi ; K i 1 Þ f ðαi ; K i Þ Z ρ‖K i K i 1 ‖2F
ð38Þ
Thus, f ðα p ; K p Þ f ðα q ; K q Þ Z γ 2
q X
‖αi αi 1 ‖22 þ ρ
i ¼ pþ1
q X
‖K i K i 1 ‖2F
i ¼ pþ1
ð39Þ for all p o q.
□
With Property 5, the rate of convergence of the algorithm can be further studied. Property 6. Suppose the function f ðα; KÞ converges to fn j f ðα i þ 1 ; K i þ 1 Þ f j r1 n n-1 j f ðαi ; K i Þ f j n
Here, the QSAR modelling problem will be studied via indefinite kernel ridge regression. Its performance is compared with other spectral transform methods. 4.1. Quantitative structure-activity relationship (QSAR) modelling QSAR modelling is a technique wildly used in computer aided drug design to analyze the performance or toxicity level of candidate drugs [27,28]. It is a model of dependence between structural information of compounds and their chemical responses toward specific target [29]. This is based on the Structure-Activity Relationship, a fundamental principle in medicinal chemistry, which states that for structurally similar compounds, their behaviour should also be similar [30]. If the model is constructed based on structure similarities between compounds, the task can be considered as a kernel regression problem. However, the measure of similarities may not be guaranteed to satisfy Mercer's condition. In this case the kernel matrix is indefinite and hence the problem is actually an indefinite kernel regression problem. In this paper we conducted QSAR modelling of half maximal inhibitory concentration value (IC50) or half maximal effective concentration (EC50) based on the 3D structural similarities of compounds. Here, 3D-tanimoto is employed as the similarity information between compounds [31,32]. It is a score measure computed via flexible alignment between 3D conformations of ligands. Fig. 1 shows an example of alignment between C20H12 F2N2O5S and C19H12ClFN2O7S using Screen3D [31,32], which yields a 3D-tanimoto score of 0.8851. 4.2. Dataset In this experiment we employed 16 confirmatory BioAssaies from the Maximum Unbiased Validation (MUV) dataset [33]. The response values and molecular structures are collected from the PubChem database [34]. Table 1 summarized the datasets. Note that as these response values y diverge in nature, hence instead of studying y, log(y) is being considered.
ð40Þ
lim
4.3. Results
Proof. From (39) we have f ðαi þ 1 ; K i þ 1 Þ r f ðαi ; K i Þ γ 2 ‖αi þ 1 αi ‖22 ρ‖K i þ 1 K i ‖2F
ð41Þ
Since f is monotonic decreasing, j f ðαi ; K i Þ f j ¼ f ðαi ; K i Þ f , n
j f ðαi þ 1 ; K i þ 1 Þ f j f ðαi þ 1 ; K i þ 1 Þ f ¼ n n j f ðαi ; K i Þ f j f ðα i ; K i Þ f n
n
n
ð42Þ
Table 2 summarized the spectrum of the 3D-tanimoto kernel. According to the table, all the kernels are indefinite as all the λmin are negative. Furthermore, the ratio between sum of negative eigenvalues and sum of positive eigenvalues Reig is not negligible. In some of the datasets, such as AID:652 and AID:713, the ratio Reig is even higher than 5%, reflecting that directly using 3D-tanimoto
j f ðαi þ 1 ; K i þ 1 Þ f j f ðαi ; K i Þ γ 2 ‖αi þ 1 αi ‖22 ρ‖K i þ 1 K i ‖2F f r n n j f ðαi ; K i Þ f j f ðα i ; K i Þ f ð43Þ n
n
γ 2 ‖αi þ 1 αi ‖22 þ ρ‖K i þ 1 K i ‖2F j f ðαi þ 1 ; K i þ 1 Þ f j r1 n n j f ðαi ; K i Þ f j f ðαi ; K i Þ f
ð44Þ
j f ðαi þ 1 ; K i þ 1 Þ f j r1 n j f ðαi ; K i Þ f j
ð45Þ
n
n
□
From Property 6, one can see that the worst case is that the algorithm may converge sub-linearly. Yet, according to the experimental results, the algorithm converged within a practical number of iterations within certain tolerances.
Fig. 1. 3D models of (a) C20H12F2N2O5S (PubChem Compound ID: 16072278), (b) C19H12ClFN2O7S (PubChem Compound ID: 16072279) and (c) their 3D alignment via Screen3D.
B.Y.S. Li et al. / Neurocomputing 158 (2015) 127–133
131
Table 1 Details of confirmatory BioAssaies used in this paper. BioAssay ID
Target
Type of interaction
466 548 600 644 652 689a 692 712 713 733 737 810 832 846 852 858 859
S1PR1 PKACA STF1 ROCK2 1VRU EPHA4 STF1 HSP90AA1 ESR1 ESR2 ESR1 FAK CG FXI FXII DAD1 CHRM1
Agonist Inhibitor Inhibitor Inhibitor Inhibitor Inhibitor Agonist Inhibitor Inhibitor Inhibitor Potentiator Kinase Inhibitor Inhibitor Inhibitor Allosteric modulator Allosteric inhibitor
a b
of ligands 489 92 346 203 388 61 462 105 256 436 384 94 208 93 288 146 222 146
Response value (y)
meanðlog ðyÞÞ
stdðlog ðyÞÞ
maxðlog ðyÞÞ
minðlog ðyÞÞ
EC50 IC50 IC50 IC50 IC50 N/A EC50 IC50 IC50 IC50 EC50 IC50 IC50 IC50 IC50 EC50 IC80b
2.152 3.2979 1.5792 2.6407 1.413 N/A 0.8756 3.264 3.1889 2.9501 3.4788 2.4141 2.336 2.9398 1.1387 11.3185 6.4919
1.2995 0.992 1.4091 1.4901 1.2424 N/A 1.4278 1.1329 0.7847 1.117 2.3777 1.1391 1.3461 1.6107 1.6643 1.4216 0.0784
2.2349 0.3827 5.2923 5.7992 2.3645 N/A 1.9805 1.9929 0.4463 0.4463 0.4587 1.0906 0.4603 3.3744 4.5637 20.092 6.254
4.5539 5.88 4.5951 6.9167 3.912 N/A 4.5951 3.912 3.912 3.912 9.9218 4.2485 3.912 3.912 3.8129 6.7254 6.6776
No response value is provided, this dataset is ignored. No IC50 value is provided, IC80 is used instead.
Table 2 Summary of the spectrum of 3D-tanimoto matrix. λi denotes the i-th eigenvalue of the matrix. BioAssay ID
λmin ¼ mini λi
λmax ¼ maxi λi
P λ o 0 λi Reig ¼ P i λ 4 0 λi i
466 548 600 644 652 692 712 713 733 737 810 832 846 852 858 859
1.3834 0.2807 0.7475 0.4405 0.9266 0.1316 0.7043 0.8743 0.7955 0.1355 0.4405 0.1108 0.7513 0.2696 0.4007 0.3094
144.3584 33.1673 108.251 65.5507 105.3444 35.6528 82.0128 120.2105 104.3789 28.4339 65.3534 32.6043 87.4465 56.9353 64.4823 39.4765
0.084 0.0058 0.0442 0.0226 0.0542 0.0037 0.0314 0.0567 0.0477 0.0015 0.0181 0.0038 0.0383 0.0104 0.0229 0.0091
scores as kernel may suffer from significant indefiniteness issue. Here four spectral transformations clipping [13], flipping [14], shifting [15] and squaring [12] were applied on the kernel and their performances are compared with our proposed indefinite kernel ridge regression method. As kernel matrix can always be transformed into reproduced features, we also compared our model and the model trained by KPCA transformed features (the leading 92 features) [25]. As an example, measured and estimated log ðIC 50 Þ values of AID:737 are shown in Fig. 2. Although the same information (3Dtanimoto kernel) is used to train the model, using indefinite kernel ridge regression can obtain a closer estimation as compared to the other spectral transformation methods. In this work, the R-squared (R2) value is employed to evaluate the performance of models [35]. It is defined as follows: P ðy y^ i Þ2 R2 ¼ 1 Pi i 2 i ðyi yÞ
ð46Þ
where yi are the real values, y^ i are the estimated values and y is the mean of y.
Note R2 A ½0; 1, a higher R2 indicates that the model explained the variation in the original dataset better and vice versa. R2 ¼ 1 when the model is perfect. Experimental results are summarized in Table 3. Results show that without any modification, directly using the indefinite kernel, yields a low R2 value. This issue is significant for those dataset with Reig 4 5%; for instance AID:652 and AID:713. On the other hand, these results also show that in the handling of indefinite kernel, our method can obtain relatively higher R2 values in most of the cases except AID:466 and AID:858. In AID:466, as compared to our proxy kernel, the flipped kernel is closer to the “true” kernel matrix. In AID:858, the original kernel (3D-tanimoto value) may not be able to effectively interpreted the chemical space so our proposed method failed to improve the estimation performance. This can be indicated by the relatively low R2 values in all 6 models. It is worth mentioning that in AID:859, although our method obtained the highest score, the overall R2 values are low and this may be due to the narrow range of log ðIC 50 Þ values in the dataset. Besides performances, Table 3 also shows the number of iterations needed for the algorithm to converge. The average iterations needed is around 12, which is fast in practice.
5. Conclusions In this paper the application of kernel ridge regression over indefinite kernel is discussed. An extension of kernel ridge regression is proposed, which allows domain experts to design their own kernel specific to particular application without constrained by Mercer's condition. This is done by reformulating the conventional kernel ridge regression model with a proxy kernel. This augmented problem is solved via iteratively solving two subproblems. There exist closed form solutions for the two subproblems and the optimization process reduces to computations of pseudo-inverse and eigen-decomposition. This algorithm is shown to be converged and in the worst case it converges sublinearly. To illustrate the performance and application of the algorithm, it is applied to construct QSAR models based on 16 confirmatory BioAssaies from the MUV dataset. The results shown that in most of the cases our proposed method outperformed the other spectral transformation methods in terms of R2 value. That is, using the same similarity information, our method can construct a model better fits the data.
132
B.Y.S. Li et al. / Neurocomputing 158 (2015) 127–133
Fig. 2. Measured and estimated values of log ðIC 50 Þ in dataset AID:737. Estimation is conducted based on the 3D tanimoto kernel, indefiniteness is handled via clipping, flipping, shifting and indefinite kernel ridge regression proposed in this paper. Squaring and KPCA are not included in this figure as the underlying estimation is too far away from the real value and would skew the plot. Table 3 Experimental results of different methods on handling indefinite kernels in kernel ridge regression with γ ¼ 1, ρ ¼ 100. For the KPCA transformation, the leading 92 features is employed as this is the minimal size of all kernel matrix. These experiments are performed on a 3.6 GHz quad-core (i7-2600) PC. BioAssay ID
466 548 600 644 652 692 712 713 733 737 810 832 846 852 858 859 Average Average without AID:859 Average Running Time (sec)
R-squared value Original
Clipping
Flipping
Shifting
Squaring
KPCA
Our method
19.1271 0.4480 0.2952 0.4613 0.2215 0.5836 0.4792 0.0922 0.2398 0.5644 0.4667 0.5862 0.5066 0.4971 0.2457 26.4074
0.3104 0.4483 0.4342 0.4808 0.4504 0.5837 0.5528 0.2915 0.3726 0.5650 0.5001 0.5862 0.5754 0.5008 0.2638 25.9556
0.3951 0.4486 0.4611 0.4895 0.4908 0.5838 0.5649 0.3355 0.4096 0.5654 0.5125 0.5863 0.5970 0.5032 0.2736 25.6873
53.3671 0.4476 0.2977 0.4614 0.1634 0.5839 0.4808 0.0948 0.2389 0.5656 0.4674 0.5849 0.5054 0.4964 0.2451 26.4380
2.4344 9.8101 0.8104 2.4942 0.8766 0.2461 7.5360 15.8384 6.4279 1.4307 3.8203 2.1651 2.8411 0.0260 60.6345 6514.3475
2.4324 10.6576 0.8204 2.6586 0.8477 0.2118 7.7524 16.2088 6.6051 1.6029 4.0069 2.4184 2.7561 0.0360 63.1866 6899.8458
0.3820 0.5444 0.5517 0.6361 0.5698 0.7324 0.7108 0.4018 0.4976 0.7399 0.6484 0.7313 0.6480 0.5929 0.1904 23.3733
2.5181 0.9255
1.1900 0.4611
1.1544 0.4811
4.6562 3.2040
414.4529 7.7933
438.8470 8.1137
0.9247 0.5718
0.0074
0.0552
0.0522
0.0533
0.0523
0.9873
1.0782
Acknowledgment This work was supported by City University of Hong Kong (Grant no. 7003016).
References [1] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, 2000. [2] S. An, W. Liu, S. Venkatesh, Face recognition using kernel ridge regression, in: IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR'07, IEEE, Minneapolis, 2007, pp. 1–7.
Iterations 13 8 13 12 13 10 12 10 12 12 11 10 14 10 17 5 11.375 11.8
[3] B.V. Kumar, R. Aravind, Face hallucination using olpp and kernel ridge regression, in: 15th IEEE International Conference on Image Processing, 2008. ICIP 2008, IEEE, San Diego, 2008, pp. 353–356. [4] L. Song, J. Bedo, K.M. Borgwardt, A. Gretton, A. Smola, Gene selection via the basic family of algorithms, Bioinformatics 23 (13) (2007) i490–i498. [5] S. Giguère, M. Marchand, F. Laviolette, A. Drouin, J. Corbeil, Learning a peptideprotein binding affinity predictor with kernel ridge regression, BMC Bioinform. 14 (1) (2013) 82. [6] T. Hinkley, J. Martins, C. Chappey, M. Haddad, E. Stawiski, J.M. Whitcomb, C. J. Petropoulos, S. Bonhoeffer, A systems analysis of mutational effects in hiv-1 protease and reverse transcriptase, Nat. Genet. 43 (5) (2011) 487–489. [7] R. Courant, D. Hilbert, Methods of Mathematical Physics, Willey, New York, 1966. [8] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) 273–297. [9] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, C. Watkins, Text classification using string kernels, J. Mach. Learn. Res. 2 (2002) 419–444.
B.Y.S. Li et al. / Neurocomputing 158 (2015) 127–133
[10] K. Tsuda, T. Kin, K. Asai, Marginalized kernels for biological sequences, Bioinformatics 18 (Suppl 1) (2002) S268–S275. [11] S.V.N. Vishwanathan, N.N. Schraudolph, R. Kondor, K.M. Borgwardt, Graph kernels, J. Mach. Learn. Res. 11 (2010) 1201–1242. [12] Y. Chen, E.K. Garcia, M.R. Gupta, A. Rahimi, L. Cazzanti, Similarity-based classification: concepts and algorithms, J. Mach. Learn. Res. 10 (2009) 747–776. [13] E. Pekalska, P. Paclik, R.P. Duin, A generalized kernel approach to dissimilaritybased classification, J. Mach. Learn. Res. 2 (2002) 175–211. [14] T. Graepel, R. Herbrich, P. Bollmann-Sdorra, K. Obermayer, Classification on pairwise proximity data, in: Advances in Neural Information Processing Systems, 1999, pp. 438–444. [15] V. Roth, J. Laub, M. Kawanabe, J.M. Buhmann, Optimal cluster preserving embedding of nonmetric proximity data, IEEE Trans. Pattern Anal. Mach. Intell. 25 (12) (2003) 1540–1551. [16] C.S. Ong, X. Mary, S. Canu, A.J. Smola, Learning with non-positive kernels, in: Proceedings of the Twenty-first International Conference on Machine Learning, ACM, Banff, 2004, p. 81. [17] J. Chen, J. Ye, Training svm with indefinite kernels, in: Proceedings of the 25th International Conference on Machine Learning, ACM, Helsinki, 2008, pp. 136–143. [18] R. Luss, A. d'Aspremont, Support vector machine classification with indefinite kernels, in: Advances in Neural Information Processing Systems, 2008, pp. 953–960. [19] S. Gu, Y. Guo, Learning svm classifiers with indefinite kernels, in: AAAI, 2012. [20] B. Haasdonk, E. Pekalska, Indefinite kernel fisher discriminant, in: ICPR, 2008, pp. 1–4. [21] E. Pekalska, B. Haasdonk, Kernel discriminant analysis for positive definite and indefinite kernels, IEEE Trans. Pattern Anal. Mach. Intell. 31 (6) (2009) 1017–1032. [22] J.-c. Zhou, D. Wang, An improved indefinite kernel machine regression algorithm with norm-r loss function, in:2011 Fourth International Conference on Information and Computing (ICIC), IEEE, Zhengzhou, 2011, pp. 142–145. [23] H. Sun, Q. Wu, Least square regression with indefinite kernels and coefficient regularization, Appl. Comput. Harmon. Anal. 30 (1) (2011) 96–109. [24] Q. Wu, Regularization networks with indefinite kernels, J. Approx. Theory 166 (2013) 1–18. [25] C. Zhang, F. Nie, S. Xiang, A general kernelization framework for learning algorithms based on kernel pca, Neurocomputing 73 (4) (2010) 959–967. [26] Y. Nesterov, I.E. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Springer, New York, 2004. [27] H. Zhu, T.M. Martin, L. Ye, A. Sedykh, D.M. Young, A. Tropsha, Quantitative structure–activity relationship modeling of rat acute toxicity by oral exposure, Chem. Res. Toxicol. 22 (12) (2009) 1913–1921. [28] C. R Munteanu, E. Fernandez-Blanco, J.A. Seoane, P. Izquierdo-Novo, J. Angel Rodriguez-Fernandez, J. Maria Prieto-Gonzalez, J.R. Rabunal, A. Pazos, Drug discovery and design for complex diseases through qsar computational methods, Curr. Pharm. Des. 16 (24) (2010) 2640–2655. [29] J. Verma, V.M. Khedkar, E.C. Coutinho, 3d-qsar in drug design—a review, Curr. Top. Med. Chem. 10 (1) (2010) 95–115. [30] M.A. Johnson, G.M. Maggiora, Concepts and Applications of Molecular Similarity, 1990. [31] A. Kalászi, D. Szisz, G. Imre, T. Polgár, Screen3d: a novel fully flexible highthroughput shape-similarity search method, J. Chem. Inf. Model. 54 (4) (2014) 1036–1049. [32] W. Deng, A. Kalászi, Screen3d: a ligand-based 3d similarity search without conformational sampling, in: International Conference and Exhibition on Computer Aided Drug Design & QSAR, 2012.
133
[33] S.G. Rohrer, K. Baumann, Maximum unbiased validation (muv) data sets for virtual screening based on pubchem bioactivity data, J. Chem. Inf. Model. 49 (2) (2009) 169–184. [34] Y. Wang, J. Xiao, T.O. Suzek, J. Zhang, J. Wang, S.H. Bryant, Pubchem: a public information system for analyzing bioactivities of small molecules, Nucleic Acids Res. 37 (Suppl 2) (2009) W623–W633. [35] N.R. Draper, H. Smith, Applied Regression Analysis, 2nd ed., 1981.
Benjamin Yee Shing Li received the B.Eng. (Hons.) in Information Engineering from City University of Hong Kong in 2009. He is now working towards the Doctor of Philosophy degree at the same university. His research interests included data mining, machine learning, bioinformatics and cheminformatics.
Lam Fat Yeung received the B.Sc. (Hons.) in Electronic and Electrical Engineering from Portsmouth University, UK, in 1978; and Ph.D. in Control Engineering from Imperial College of Science and Technology and Medicine in 1990. Currently he is an associate professor at City University of Hong Kong. He was the Chair of the IET Hong Kong Branch and the Chair of the Joint Chapter of the Robotics & Automation and Control Society, IEEE Hong Kong Section. His current research interests are system biology, control systems and optimisations.
King Tim Ko received the B.Eng. (Hons.) and Ph.D. degrees from The University of Adelaide, Adelaide, Australia. He worked several years with the Telecom Australia Research Laboratories, Melbourne, Australia, before joining the Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong, in 1986, where he is currently an associate professor. His research interests are in the performance evaluation of communication networks and mobile networks.