Indefinite kernel ridge regression and its application on QSAR modelling

Indefinite kernel ridge regression and its application on QSAR modelling

Neurocomputing 158 (2015) 127–133 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Indefini...

796KB Sizes 16 Downloads 173 Views

Neurocomputing 158 (2015) 127–133

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Indefinite kernel ridge regression and its application on QSAR modelling Benjamin Yee Shing Li n, Lam Fat Yeung, King Tim Ko Department of Electronic Engineering, City University of Hong Kong, Hong Kong

art ic l e i nf o

a b s t r a c t

Article history: Received 24 September 2014 Received in revised form 10 December 2014 Accepted 28 January 2015 Communicated by Feiping Nie Available online 12 February 2015

Recently, the use of indefinite kernels in machine learning has attracted numerous attentions. However most works are focused on the classification techniques and less are devoted to regression models. In this paper to adapt indefinite kernels to ridge regression model, an indefinite kernel ridge regression model is proposed. Instead of performing spectral transformation on the kernel matrix, a less restrictive semi-definite proxy kernel can be constructed to approximate the kernel which normally is positive semi-definite. The sensitivity of the distance between this indefinite kernel and the proxy kernel is controlled by a parameter ρ. This approach allows one to construct regression models of response values based on the similarities of corresponding objects, where the requirement on similarity measures to satisfy Mercers condition can be relaxed. To illustrate the use of this algorithm, it was applied to the quantitative structure-activity relationship (QSAR) modelling over 16 drug targets. & 2015 Elsevier B.V. All rights reserved.

Keywords: Regression analysis Indefinite kernel Computer aided drug design Quantitative structure-activity relationship (QSAR) modelling

1. Introduction The relationship between data and response values is always of great interest as it can be used to explain the structure or mechanism of a complex/non-trivial system (for instance, protein–ligand interactome) or, furthermore, to construct a predictive model. This analysis can be done via statistical learning techniques such as regression analysis; for instance, kernel ridge regression [1]. It can be considered as a kernelized learning procedure based on least square approach with regularization. With kernelization, a major restriction of ridge regression, where data points and estimated values have to be in linear relation, are relaxed by selecting different feature transforms or kernel functions. Due to the flexibility and simplicity of the regression model, it is widely applied in various domains such as image processing, bioinformatics, and cheminformatics [2–6]. In traditional practice, the construction of kernel is based on applying a valid kernel function over feature vectors. Here, valid kernels are referred to those satisfy Mercer's condition [7,8]. Under this context, the resulting kernels are always positive semidefinite (p.s.d.) and works fine in most kernelized learning methods. However, for some kinds of data such as structural objects like 3D conformations of molecular structures and DNA sequences, features are hard to be extracted. To apply kernelized method over these

n

Corresponding author. E-mail addresses: [email protected] (B.Y.S. Li), [email protected] (L.F. Yeung), [email protected] (K.T. Ko). http://dx.doi.org/10.1016/j.neucom.2015.01.060 0925-2312/& 2015 Elsevier B.V. All rights reserved.

structural objects, a significant level of efforts is devoted to the design of customized kernels such as graph kernels and string kernels [9–11]. As the design of kernel can directly affect the interpretation of similarities amongst data, these pre-defined kernels may not be able to fit into all applications. A pragmatic approach would be to design kernels based on the knowledge of domain experts [12], in which Mercer's condition may become a restriction or burden on the design of kernels. To relieve this issue, we are motivated to extend kernel ridge regression to adapt indefinite kernels. Many studies on indefinite kernel methods focused on classification techniques such as support vector machine (SVM) and less are devoted to regression models. One type of approaches is based on spectral transform on the kernel matrix, including clipping [13], flipping [14], shifting [15] and squaring [12] the spectrum. The main advantages of these approaches are that they are independent of learning models and they are simple in implementation. However, during the transformation, noise may be added and information may be lost. Another type of approaches modifies the construction of the original learning model such that the p.s.d. constraint is relaxed [16–21]. The advantage of this type of approaches is that the entire similarity information (the indefinite kernel) is retained, then the remaining task can be reduced to the modification of the learning model. Alternatively, some recent studies also shown promising results on using indefinite kernels on regression analysis. In [22], the support vector regression model is being modified to adapt indefinite kernels. The performance and approximation error of using indefinite kernels on regression via a coefficient regularized least square approach is also being studied [23]. This approach is further

128

B.Y.S. Li et al. / Neurocomputing 158 (2015) 127–133

improved via a kernel decomposition technique proposed in [24]. Note that when the kernel is symmetric, this kernel decomposition is similar to the one proposed in [13]. In this paper an indefinite kernel extension of ridge regression is proposed. In our formulation, the indefinite kernel is replaced by a proxy kernel, in addition a penalty of distance between the proxy kernel and the original indefinite kernel is added. This penalty is weighted by a parameter ρ, it allows users to control the sensitivity of the algorithm. A lower value of ρ would allow the algorithm to search the proxy kernel over a larger range and vice versa. This problem can then be solved by decomposing the search space into two subspace and iteratively solved via two subproblems. From our analysis we found that the algorithm converges and in the worst case it converges sub-linearly. In addition, the parameter ρ and γ2 can also be considered as learning rate of the algorithm. To illustrate the performance of the proposed method, it is applied to quantitative structure activity relationship (QSAR) modelling over 16 sets of drug targets and their corresponding drug candidates. As compared to other spectral transform methods [12–15] and kernel principle component analysis transformation [25], results show that the proposed method can provide a significant improvement on the R2 value of the QSAR model.

By Karush–Kuhn–Tucker (KKT) conditions of (4), we have e¼

θ¼

β

ð5Þ

2 Xβ 2γ

ð6Þ

By plugging (5) and (6) back to (4), we yield L¼ 

βT β 4

þ β y T

βT X T X β ; 4γ

ð7Þ

then by letting α ¼ β=2γ and kernel matrix K ¼ X T X, L ¼  γ 2 αT α þ2γαT y γαT K α

ð8Þ

Hence we have the following dual problem: P 1 : minfγ 2 αT α  2γαT y þ γαT K αg

ð9Þ

α

Note that in general the kernel matrix K is assumed to be p.s.d., yet in this paper we are interested in the indefinite case.

3. Indefinite kernel ridge regression 2. Background

3.1. Algorithm

The following is some related basic definitions and properties of kernel trick and kernel ridge regression.

In most of the cases, the kernel matrix is constructed through a valid kernel function of kernel transform; hence it is guaranteed to be p.s.d. However, sometimes the kernel value is directly computed via a kernel function and it may not guarantee the p.s.d.ness of the kernel matrix. To handle such case, here we proposed an indefinite kernel extension of ridge regression. Consider the problem n o P 2 : min f ðα; KÞ ¼ min γ 2 αT α  2γαT y þ γαT K α þ ρ‖K  K~ ‖2F ;

2.1. Kernel trick Given a space S, let ϕ : S-T be a feature transform which maps data points in space S to a Hilbert space T. A kernel function K : S  S-R is defined as Kðx; yÞ ¼ 〈ϕðxÞ; ϕðyÞ〉

α;K≽0

α;K≽0

ð1Þ

ð10Þ

where 〈; 〉 is the inner product over space T. The kernel trick is a direct application of the kernel function Kðx; yÞ to the model instead of computing the transformed feature ϕðxÞ; ϕðyÞ. This trick can provide the user a freedom on manipulating the structure of the space or the distribution of the data points. In addition, once a kernel function Kð; Þ is defined, the feature extraction process can be avoided and this is an advantage for mining over structured objects. Usually, the design of Kðx; yÞ is based on the similarity between x and y, since 〈ϕðxÞ; ϕðyÞ〉 describes the similarity between the two transformed feature vectors ϕðxÞ and ϕðyÞ.

where K0 is the indefinite kernel matrix, ρ Z 0 is a constant. P2 relaxes the p.s.d. requirements with the use of proxy kernel, the indefinite kernel matrix K~ is considered as a noisy version of the proxy kernel. To prevent the proxy kernel K from being too far away from K~ , a penalty of Frobenius distance between K and K~ is added to the cost function. In (10), f ðx; KÞ is not jointly convex over α and K, this implies that there is no guarantee on the global optimal of α and K. However it is convex in α and K separately; hence in this work we proposed to solve (10) via alternatively solving the following two subproblems. P3 and P4 are subproblems of P2 solved over the subspaces of α and K≽0 respectively. n o P 3 : g ðK^ Þ ¼ arg minL1 ðα; K^ Þ ¼ arg min γ 2 αT α  2γαT y þ γαT K^ α

2.2. Kernel ridge regression Given a collection of data pairs ðxi ; yi Þ, where xi A RN is the Ndimensional feature vector and yi A R is the corresponding result. Ridge regression can be represented as a regularized linear regression model y ¼ xT θ

ð2Þ

The optimal θ can be obtained by minimizing the following loss function: P 0 : min θ;e

s:t:

f‖e‖22 þ γ ‖θ‖22 g

α

e ¼ yX θ

ð3Þ

where θ is a non-negative regularization terms, y ¼ ½y1 ⋯yN T , X ¼ ½x1 ⋯xN . The Lagrangian of P0 is Lðθ; e; βÞ ¼ eT eþ γθ θ þ β ðy  X T θ  eÞ T

ð4Þ

α

ð11Þ n o T P 4 : g 2 ðα^ Þ ¼ arg minL2 ðα^ ; KÞ ¼ arg min ρ j j K  K~ j j F þ γ α^ K α^ K≽0

K≽0

ð12Þ P3 and P4 both exist closed form solutions. P3 can be solved by applying the optimality condition ∂L1 =∂α ¼ 0 which leads to g 1 ðK^ Þ ¼ ðK^ þ γ IÞ  1 y

T

T

1

ð13Þ

P4 is actually a projection of a specific matrix on the p.s.d. space,     γ T P 5 : arg min J K  K~  α^ α^ JF ð14Þ K≽0 2ρ Consider a symmetric matrix X with eigen-decomposition X ¼ U ΛU T , where U ¼ ½u1 ⋯uN  and Λ ¼ diagðλ1 ; …; λN Þ. Let ðXÞ þ ¼ P T λk 4 0 λk uk uk , i.e. the positive part of X. The optimal solution of P5

B.Y.S. Li et al. / Neurocomputing 158 (2015) 127–133

trðð∇K f ðK 1 Þ  ∇K f ðK 2 ÞÞT ðK 1  K 2 ÞÞ ¼ 2ρ‖K 1  K 2 ‖2F

can be written as   γ T g 2 ðα^ Þ ¼ K~  α^ α^ 2ρ þ Start with an initial iterative procedures:

ð15Þ

α0, P2 can be solved via the following

ðα0 Þ - ðα0 ; K 0 Þ - ðα1 ; K 0 Þ - ðα1 ; K 1 Þ - ⋯ αstep

Kstep

129

Kstep

ð16Þ

αstep

where in the α-step, αi ¼ g 1 ðK i  1 Þ and in the K-step, K i ¼ g 2 ðαi Þ. This algorithm can be summarized as Algorithm 1.

By (25), since modulus 2ρ. □

ð25Þ

ρ is non-negative, f is strongly convex in K with

Lemma 1 (Property of strongly convex function, Nesterov and Nesterov [26]). For any strongly convex function f with modulus m and any two points x, y m f ðyÞ Zf ðxÞ þ∇f ðxÞT ðy  xÞ þ ‖y  x‖22 2

ð26Þ

Lemma 2. For any p.s.d. matrices X and Y, trace of their product is nonnegative.

Algorithm 1. Indefinite kernel ridge regression.

ρ, K~ , y Result: α, K

Data: γ,

trðXYÞ Z 0

initialization;

Proof. Let X ¼ LLT where X is a p.s.d. matrix and L is the positive definite square root of X

while f not converge do 6 6 K’ðK~  γ ααT Þ ; þ 6 2ρ 6 6 α’ðK þ γ IÞ  1 y; 4 f ’γ 2 αT α  2γαT y þ γαT K α þ ρ‖K  K~ ‖2F ;

trðXYÞ ¼ trðLT YLÞ Z 0

α’randomðÞ;

ð27Þ



Combine strongly convexity of f on α and K and Lemmas 1 and 2, we can obtain the following two properties: Property 3. For the i-th α-step, the cost function f is being reduced by a non-negative value bounded below by

3.2. Properties and analysis In iterative process (16), the objective function f ðα; KÞ is being minimized at each step. That is

γ 2 ‖αi  αi  1 ‖22

f ðα; K i Þ Z f ðαi þ 1 ; K i Þ

Proof. As f is strongly convex in

for all

ð17Þ

α and

f ðαi þ 1 ; KÞ Zf ðαi þ 1 ; K i þ 1 Þ

ð18Þ

for all K≽0. Hence the value of the objective function f ðα; KÞ is monotonic decreasing over the iterative process. In addition, the objective function f ðα; KÞ can be written as ‖γα  y‖22  ‖y‖22 þ γαT K α þ ρ‖K  K~ ‖2F . As K≽0, hence γαT K α Z 0 and f ðα; KÞ is bounded below by  ‖y‖22 . By combining these two facts, the algorithm is converging. If f is strongly convex in α and K respectively, at each step, its reduction can be shown bounded by a value in terms of the step size as follows. Theorem 1. f ðα; K^ Þ is strongly convex in 2γ 2 þ 2γλmin ðK^ Þ. Proof. We first consider the gradient of f on

α with modulus

ð29Þ

α, with Lemma 1 we have

f ðαi  1 ; K i  1 Þ  f ðαi ; K i  1 Þ Z ðαi  1  αi ÞT ð2γ 2 αi  2γ y þ 2γ K i  1 αi Þ þ ðγ 2 þ γλmin ðK i  1 ÞÞ‖αi  αi  1 ‖22

ð30Þ

By plugging in αi ¼ ðK i  1 þ γ IÞ  1 y to (30), we have f ðαi  1 ; K i  1 Þ Zf ðαi ; K i  1 Þ þ ðγ 2 þ γλmin ðK i  1 ÞÞ‖αi  αi  1 ‖22 Z f ðαi ; K i  1 Þ þ γ 2 ‖αi  αi  1 ‖22

where λmin ðKÞ Z 0 when K is p.s.d.

ð31Þ □

Similarly, we can also obtain a similar relation for K-step, Property 4. For the i-th K-step, the cost function f is being reduced by a non-negative value bounded below by

ρ‖K i  K i  1 ‖2F

ð32Þ

Proof. From (1) we have

α

f ðαi ; K i  1 Þ  f ðαi ; K i Þ Z tr½ðK i  1 K i ÞT ðγαi αTi þ 2ρK i

∂f ∇ α f ðα Þ ¼ ¼ 2γ 2 α  2γ yþ 2γ K^ α ∂α

ð19Þ

∇α f ðα1 Þ  ∇α f ðα2 Þ ¼ ð2γ 2 þ 2γ K^ Þðα1  α2 Þ

ð20Þ

ð∇α f ðα1 Þ  ∇α f ðα2 ÞÞT ðα1  α2 Þ ¼ ð2γ 2 Þ‖α1  α2 ‖22 þ 2γ ðα1  α2 ÞT K^ ðα1  α2 Þ

ð21Þ

ð∇α f ðα1 Þ  ∇α f ðα2 ÞÞT ðα1  α2 Þ Z ð2γ 2 þ 2γλmin ðK^ ÞÞ‖α1  α2 ‖22

ð22Þ

By (22), since γ and λmin ðK^ Þ are both non-negative, f is strongly convex in α with modulus 2γ 2 þ 2γλmin ðK^ Þ. □ Theorem 2. f ðα^ ; KÞ is strongly convex in K with modulus 2ρ. Proof. We first consider the gradient of f on K ∂f ¼ γααT þ 2ρK  2ρK~ ∂K

ð23Þ

∇K f ðK 1 Þ ∇K f ðK 2 Þ ¼ 2ρðK 1  K 2 Þ

ð24Þ

∇K f ðKÞ ¼

ð28Þ

 2ρK 0 Þ þ2ρ‖K i  K i  1 ‖2F     γ þ 2ρ‖K i K i  1 ‖2F ¼ ð2ρÞtr ðK i  1  K i ÞT K i  K~  αi αTi 2ρ ð33Þ

Since 2ρ‖K i  K i  1 ‖2F is non-negative, the rest is to show that tr½ðK i  1  K i ÞT ðK i  ðK~  ðγ =2ρÞαi αTi Þ is non-negative. Remind that Ki is a p.s.d. projection of ðK~  ðγ =2ρÞαi αTi Þ, hence K Ti ðK~  P 2 ðγ =2ρÞαi αTi Þ ¼ λj 4 0 λj uj uTj ¼ K Ti K i , where λi and ui are corresponding eigenvalue and eigenvectors of Ki. And we have     γ tr ðK i  1  K i ÞT K i  K~  αi αTi 2ρ     γ T ð34Þ ¼ tr K i  1 K i  K~  αi αTi 2ρ Since K i  1 and K i ðK~  ðγ =2ρÞαi αTi Þ are both p.s.d., according to Lemma 2, trace of their product is nonnegative.

130

B.Y.S. Li et al. / Neurocomputing 158 (2015) 127–133

4. Case studies and discussions

Thus, f ðαi ; K i  1 Þ  f ðαi ; K i Þ Z ρ‖K i  K i  1 ‖2F



ð35Þ

Properties 3 and 4 provide an insight on how the parameters γ and ρ affect the search of the optimal fn. As the decrease of f is bounded from below by the step size with the rate γ2 and ρ, a larger γ and ρ can lead to faster minimization. Yet, in this case the regularization of α will be stronger and the proxy kernel will be highly sensitive to the original kernel K~ . Hence the choice of γ and ρ is dependent on the application. The following property summarized the reduction in cost function f from p-th step to the q-th step by combining Properties 3 and 4. Property 5. From the p-th iteration to the q-th iteration, the cost function is reduced by a non-negative value bounded below by q X

γ2

q X

‖αi  αi  1 ‖22 þ ρ

i ¼ pþ1

‖K i  K i  1 ‖2F

ð36Þ

i ¼ pþ1

Proof. From (3) we have f ðαi  1 ; K i  1 Þ  f ðαi ; K i  1 Þ Z γ 2 ‖αi  αi  1 ‖22 ;

ð37Þ

and from Property 4 we have f ðαi ; K i  1 Þ  f ðαi ; K i Þ Z ρ‖K i  K i  1 ‖2F

ð38Þ

Thus, f ðα p ; K p Þ  f ðα q ; K q Þ Z γ 2

q X

‖αi  αi  1 ‖22 þ ρ

i ¼ pþ1

q X

‖K i  K i  1 ‖2F

i ¼ pþ1

ð39Þ for all p o q.



With Property 5, the rate of convergence of the algorithm can be further studied. Property 6. Suppose the function f ðα; KÞ converges to fn j f ðα i þ 1 ; K i þ 1 Þ  f j r1 n n-1 j f ðαi ; K i Þ f j n

Here, the QSAR modelling problem will be studied via indefinite kernel ridge regression. Its performance is compared with other spectral transform methods. 4.1. Quantitative structure-activity relationship (QSAR) modelling QSAR modelling is a technique wildly used in computer aided drug design to analyze the performance or toxicity level of candidate drugs [27,28]. It is a model of dependence between structural information of compounds and their chemical responses toward specific target [29]. This is based on the Structure-Activity Relationship, a fundamental principle in medicinal chemistry, which states that for structurally similar compounds, their behaviour should also be similar [30]. If the model is constructed based on structure similarities between compounds, the task can be considered as a kernel regression problem. However, the measure of similarities may not be guaranteed to satisfy Mercer's condition. In this case the kernel matrix is indefinite and hence the problem is actually an indefinite kernel regression problem. In this paper we conducted QSAR modelling of half maximal inhibitory concentration value (IC50) or half maximal effective concentration (EC50) based on the 3D structural similarities of compounds. Here, 3D-tanimoto is employed as the similarity information between compounds [31,32]. It is a score measure computed via flexible alignment between 3D conformations of ligands. Fig. 1 shows an example of alignment between C20H12 F2N2O5S and C19H12ClFN2O7S using Screen3D [31,32], which yields a 3D-tanimoto score of 0.8851. 4.2. Dataset In this experiment we employed 16 confirmatory BioAssaies from the Maximum Unbiased Validation (MUV) dataset [33]. The response values and molecular structures are collected from the PubChem database [34]. Table 1 summarized the datasets. Note that as these response values y diverge in nature, hence instead of studying y, log(y) is being considered.

ð40Þ

lim

4.3. Results

Proof. From (39) we have f ðαi þ 1 ; K i þ 1 Þ r f ðαi ; K i Þ  γ 2 ‖αi þ 1  αi ‖22  ρ‖K i þ 1  K i ‖2F

ð41Þ

Since f is monotonic decreasing, j f ðαi ; K i Þ  f j ¼ f ðαi ; K i Þ  f , n

j f ðαi þ 1 ; K i þ 1 Þ  f j f ðαi þ 1 ; K i þ 1 Þ  f ¼ n n j f ðαi ; K i Þ  f j f ðα i ; K i Þ  f n

n

n

ð42Þ

Table 2 summarized the spectrum of the 3D-tanimoto kernel. According to the table, all the kernels are indefinite as all the λmin are negative. Furthermore, the ratio between sum of negative eigenvalues and sum of positive eigenvalues Reig is not negligible. In some of the datasets, such as AID:652 and AID:713, the ratio Reig is even higher than 5%, reflecting that directly using 3D-tanimoto

j f ðαi þ 1 ; K i þ 1 Þ  f j f ðαi ; K i Þ  γ 2 ‖αi þ 1  αi ‖22  ρ‖K i þ 1  K i ‖2F  f r n n j f ðαi ; K i Þ  f j f ðα i ; K i Þ  f ð43Þ n

n

γ 2 ‖αi þ 1  αi ‖22 þ ρ‖K i þ 1  K i ‖2F j f ðαi þ 1 ; K i þ 1 Þ  f j r1 n n j f ðαi ; K i Þ  f j f ðαi ; K i Þ f

ð44Þ

j f ðαi þ 1 ; K i þ 1 Þ  f j r1 n j f ðαi ; K i Þ  f j

ð45Þ

n

n



From Property 6, one can see that the worst case is that the algorithm may converge sub-linearly. Yet, according to the experimental results, the algorithm converged within a practical number of iterations within certain tolerances.

Fig. 1. 3D models of (a) C20H12F2N2O5S (PubChem Compound ID: 16072278), (b) C19H12ClFN2O7S (PubChem Compound ID: 16072279) and (c) their 3D alignment via Screen3D.

B.Y.S. Li et al. / Neurocomputing 158 (2015) 127–133

131

Table 1 Details of confirmatory BioAssaies used in this paper. BioAssay ID

Target

Type of interaction

466 548 600 644 652 689a 692 712 713 733 737 810 832 846 852 858 859

S1PR1 PKACA STF1 ROCK2 1VRU EPHA4 STF1 HSP90AA1 ESR1 ESR2 ESR1 FAK CG FXI FXII DAD1 CHRM1

Agonist Inhibitor Inhibitor Inhibitor Inhibitor Inhibitor Agonist Inhibitor Inhibitor Inhibitor Potentiator Kinase Inhibitor Inhibitor Inhibitor Allosteric modulator Allosteric inhibitor

a b

of ligands 489 92 346 203 388 61 462 105 256 436 384 94 208 93 288 146 222 146

Response value (y)

meanðlog ðyÞÞ

stdðlog ðyÞÞ

maxðlog ðyÞÞ

minðlog ðyÞÞ

EC50 IC50 IC50 IC50 IC50 N/A EC50 IC50 IC50 IC50 EC50 IC50 IC50 IC50 IC50 EC50 IC80b

2.152 3.2979 1.5792 2.6407 1.413 N/A 0.8756 3.264 3.1889 2.9501 3.4788 2.4141 2.336 2.9398 1.1387  11.3185 6.4919

1.2995 0.992 1.4091 1.4901 1.2424 N/A 1.4278 1.1329 0.7847 1.117 2.3777 1.1391 1.3461 1.6107 1.6643 1.4216 0.0784

 2.2349  0.3827  5.2923  5.7992  2.3645 N/A  1.9805  1.9929 0.4463 0.4463 0.4587  1.0906  0.4603  3.3744  4.5637  20.092 6.254

4.5539 5.88 4.5951 6.9167 3.912 N/A 4.5951 3.912 3.912 3.912 9.9218 4.2485 3.912 3.912 3.8129  6.7254 6.6776

No response value is provided, this dataset is ignored. No IC50 value is provided, IC80 is used instead.

Table 2 Summary of the spectrum of 3D-tanimoto matrix. λi denotes the i-th eigenvalue of the matrix. BioAssay ID

λmin ¼ mini λi

λmax ¼ maxi λi

 P    λ o 0 λi   Reig ¼ P i   λ 4 0 λi  i

466 548 600 644 652 692 712 713 733 737 810 832 846 852 858 859

 1.3834  0.2807  0.7475  0.4405  0.9266  0.1316  0.7043  0.8743  0.7955  0.1355  0.4405  0.1108  0.7513  0.2696  0.4007  0.3094

144.3584 33.1673 108.251 65.5507 105.3444 35.6528 82.0128 120.2105 104.3789 28.4339 65.3534 32.6043 87.4465 56.9353 64.4823 39.4765

0.084 0.0058 0.0442 0.0226 0.0542 0.0037 0.0314 0.0567 0.0477 0.0015 0.0181 0.0038 0.0383 0.0104 0.0229 0.0091

scores as kernel may suffer from significant indefiniteness issue. Here four spectral transformations clipping [13], flipping [14], shifting [15] and squaring [12] were applied on the kernel and their performances are compared with our proposed indefinite kernel ridge regression method. As kernel matrix can always be transformed into reproduced features, we also compared our model and the model trained by KPCA transformed features (the leading 92 features) [25]. As an example, measured and estimated log ðIC 50 Þ values of AID:737 are shown in Fig. 2. Although the same information (3Dtanimoto kernel) is used to train the model, using indefinite kernel ridge regression can obtain a closer estimation as compared to the other spectral transformation methods. In this work, the R-squared (R2) value is employed to evaluate the performance of models [35]. It is defined as follows: P ðy  y^ i Þ2 R2 ¼ 1  Pi i 2 i ðyi  yÞ

ð46Þ

where yi are the real values, y^ i are the estimated values and y is the mean of y.

Note R2 A ½0; 1, a higher R2 indicates that the model explained the variation in the original dataset better and vice versa. R2 ¼ 1 when the model is perfect. Experimental results are summarized in Table 3. Results show that without any modification, directly using the indefinite kernel, yields a low R2 value. This issue is significant for those dataset with Reig 4 5%; for instance AID:652 and AID:713. On the other hand, these results also show that in the handling of indefinite kernel, our method can obtain relatively higher R2 values in most of the cases except AID:466 and AID:858. In AID:466, as compared to our proxy kernel, the flipped kernel is closer to the “true” kernel matrix. In AID:858, the original kernel (3D-tanimoto value) may not be able to effectively interpreted the chemical space so our proposed method failed to improve the estimation performance. This can be indicated by the relatively low R2 values in all 6 models. It is worth mentioning that in AID:859, although our method obtained the highest score, the overall R2 values are low and this may be due to the narrow range of log ðIC 50 Þ values in the dataset. Besides performances, Table 3 also shows the number of iterations needed for the algorithm to converge. The average iterations needed is around 12, which is fast in practice.

5. Conclusions In this paper the application of kernel ridge regression over indefinite kernel is discussed. An extension of kernel ridge regression is proposed, which allows domain experts to design their own kernel specific to particular application without constrained by Mercer's condition. This is done by reformulating the conventional kernel ridge regression model with a proxy kernel. This augmented problem is solved via iteratively solving two subproblems. There exist closed form solutions for the two subproblems and the optimization process reduces to computations of pseudo-inverse and eigen-decomposition. This algorithm is shown to be converged and in the worst case it converges sublinearly. To illustrate the performance and application of the algorithm, it is applied to construct QSAR models based on 16 confirmatory BioAssaies from the MUV dataset. The results shown that in most of the cases our proposed method outperformed the other spectral transformation methods in terms of R2 value. That is, using the same similarity information, our method can construct a model better fits the data.

132

B.Y.S. Li et al. / Neurocomputing 158 (2015) 127–133

Fig. 2. Measured and estimated values of log ðIC 50 Þ in dataset AID:737. Estimation is conducted based on the 3D tanimoto kernel, indefiniteness is handled via clipping, flipping, shifting and indefinite kernel ridge regression proposed in this paper. Squaring and KPCA are not included in this figure as the underlying estimation is too far away from the real value and would skew the plot. Table 3 Experimental results of different methods on handling indefinite kernels in kernel ridge regression with γ ¼ 1, ρ ¼ 100. For the KPCA transformation, the leading 92 features is employed as this is the minimal size of all kernel matrix. These experiments are performed on a 3.6 GHz quad-core (i7-2600) PC. BioAssay ID

466 548 600 644 652 692 712 713 733 737 810 832 846 852 858 859 Average Average without AID:859 Average Running Time (sec)

R-squared value Original

Clipping

Flipping

Shifting

Squaring

KPCA

Our method

 19.1271 0.4480 0.2952 0.4613  0.2215 0.5836 0.4792 0.0922 0.2398 0.5644 0.4667 0.5862 0.5066 0.4971 0.2457  26.4074

0.3104 0.4483 0.4342 0.4808 0.4504 0.5837 0.5528 0.2915 0.3726 0.5650 0.5001 0.5862 0.5754 0.5008 0.2638  25.9556

0.3951 0.4486 0.4611 0.4895 0.4908 0.5838 0.5649 0.3355 0.4096 0.5654 0.5125 0.5863 0.5970 0.5032 0.2736  25.6873

 53.3671 0.4476 0.2977 0.4614  0.1634 0.5839 0.4808 0.0948 0.2389 0.5656 0.4674 0.5849 0.5054 0.4964 0.2451  26.4380

 2.4344  9.8101  0.8104  2.4942  0.8766 0.2461  7.5360  15.8384  6.4279  1.4307  3.8203  2.1651  2.8411  0.0260  60.6345  6514.3475

 2.4324  10.6576  0.8204  2.6586  0.8477 0.2118  7.7524  16.2088  6.6051  1.6029  4.0069  2.4184  2.7561 0.0360  63.1866  6899.8458

0.3820 0.5444 0.5517 0.6361 0.5698 0.7324 0.7108 0.4018 0.4976 0.7399 0.6484 0.7313 0.6480 0.5929 0.1904  23.3733

 2.5181  0.9255

 1.1900 0.4611

 1.1544 0.4811

 4.6562  3.2040

 414.4529  7.7933

 438.8470  8.1137

 0.9247 0.5718

0.0074

0.0552

0.0522

0.0533

0.0523

0.9873

1.0782

Acknowledgment This work was supported by City University of Hong Kong (Grant no. 7003016).

References [1] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, Cambridge, 2000. [2] S. An, W. Liu, S. Venkatesh, Face recognition using kernel ridge regression, in: IEEE Conference on Computer Vision and Pattern Recognition, 2007. CVPR'07, IEEE, Minneapolis, 2007, pp. 1–7.

Iterations 13 8 13 12 13 10 12 10 12 12 11 10 14 10 17 5 11.375 11.8

[3] B.V. Kumar, R. Aravind, Face hallucination using olpp and kernel ridge regression, in: 15th IEEE International Conference on Image Processing, 2008. ICIP 2008, IEEE, San Diego, 2008, pp. 353–356. [4] L. Song, J. Bedo, K.M. Borgwardt, A. Gretton, A. Smola, Gene selection via the basic family of algorithms, Bioinformatics 23 (13) (2007) i490–i498. [5] S. Giguère, M. Marchand, F. Laviolette, A. Drouin, J. Corbeil, Learning a peptideprotein binding affinity predictor with kernel ridge regression, BMC Bioinform. 14 (1) (2013) 82. [6] T. Hinkley, J. Martins, C. Chappey, M. Haddad, E. Stawiski, J.M. Whitcomb, C. J. Petropoulos, S. Bonhoeffer, A systems analysis of mutational effects in hiv-1 protease and reverse transcriptase, Nat. Genet. 43 (5) (2011) 487–489. [7] R. Courant, D. Hilbert, Methods of Mathematical Physics, Willey, New York, 1966. [8] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (3) (1995) 273–297. [9] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, C. Watkins, Text classification using string kernels, J. Mach. Learn. Res. 2 (2002) 419–444.

B.Y.S. Li et al. / Neurocomputing 158 (2015) 127–133

[10] K. Tsuda, T. Kin, K. Asai, Marginalized kernels for biological sequences, Bioinformatics 18 (Suppl 1) (2002) S268–S275. [11] S.V.N. Vishwanathan, N.N. Schraudolph, R. Kondor, K.M. Borgwardt, Graph kernels, J. Mach. Learn. Res. 11 (2010) 1201–1242. [12] Y. Chen, E.K. Garcia, M.R. Gupta, A. Rahimi, L. Cazzanti, Similarity-based classification: concepts and algorithms, J. Mach. Learn. Res. 10 (2009) 747–776. [13] E. Pekalska, P. Paclik, R.P. Duin, A generalized kernel approach to dissimilaritybased classification, J. Mach. Learn. Res. 2 (2002) 175–211. [14] T. Graepel, R. Herbrich, P. Bollmann-Sdorra, K. Obermayer, Classification on pairwise proximity data, in: Advances in Neural Information Processing Systems, 1999, pp. 438–444. [15] V. Roth, J. Laub, M. Kawanabe, J.M. Buhmann, Optimal cluster preserving embedding of nonmetric proximity data, IEEE Trans. Pattern Anal. Mach. Intell. 25 (12) (2003) 1540–1551. [16] C.S. Ong, X. Mary, S. Canu, A.J. Smola, Learning with non-positive kernels, in: Proceedings of the Twenty-first International Conference on Machine Learning, ACM, Banff, 2004, p. 81. [17] J. Chen, J. Ye, Training svm with indefinite kernels, in: Proceedings of the 25th International Conference on Machine Learning, ACM, Helsinki, 2008, pp. 136–143. [18] R. Luss, A. d'Aspremont, Support vector machine classification with indefinite kernels, in: Advances in Neural Information Processing Systems, 2008, pp. 953–960. [19] S. Gu, Y. Guo, Learning svm classifiers with indefinite kernels, in: AAAI, 2012. [20] B. Haasdonk, E. Pekalska, Indefinite kernel fisher discriminant, in: ICPR, 2008, pp. 1–4. [21] E. Pekalska, B. Haasdonk, Kernel discriminant analysis for positive definite and indefinite kernels, IEEE Trans. Pattern Anal. Mach. Intell. 31 (6) (2009) 1017–1032. [22] J.-c. Zhou, D. Wang, An improved indefinite kernel machine regression algorithm with norm-r loss function, in:2011 Fourth International Conference on Information and Computing (ICIC), IEEE, Zhengzhou, 2011, pp. 142–145. [23] H. Sun, Q. Wu, Least square regression with indefinite kernels and coefficient regularization, Appl. Comput. Harmon. Anal. 30 (1) (2011) 96–109. [24] Q. Wu, Regularization networks with indefinite kernels, J. Approx. Theory 166 (2013) 1–18. [25] C. Zhang, F. Nie, S. Xiang, A general kernelization framework for learning algorithms based on kernel pca, Neurocomputing 73 (4) (2010) 959–967. [26] Y. Nesterov, I.E. Nesterov, Introductory Lectures on Convex Optimization: A Basic Course, Springer, New York, 2004. [27] H. Zhu, T.M. Martin, L. Ye, A. Sedykh, D.M. Young, A. Tropsha, Quantitative structure–activity relationship modeling of rat acute toxicity by oral exposure, Chem. Res. Toxicol. 22 (12) (2009) 1913–1921. [28] C. R Munteanu, E. Fernandez-Blanco, J.A. Seoane, P. Izquierdo-Novo, J. Angel Rodriguez-Fernandez, J. Maria Prieto-Gonzalez, J.R. Rabunal, A. Pazos, Drug discovery and design for complex diseases through qsar computational methods, Curr. Pharm. Des. 16 (24) (2010) 2640–2655. [29] J. Verma, V.M. Khedkar, E.C. Coutinho, 3d-qsar in drug design—a review, Curr. Top. Med. Chem. 10 (1) (2010) 95–115. [30] M.A. Johnson, G.M. Maggiora, Concepts and Applications of Molecular Similarity, 1990. [31] A. Kalászi, D. Szisz, G. Imre, T. Polgár, Screen3d: a novel fully flexible highthroughput shape-similarity search method, J. Chem. Inf. Model. 54 (4) (2014) 1036–1049. [32] W. Deng, A. Kalászi, Screen3d: a ligand-based 3d similarity search without conformational sampling, in: International Conference and Exhibition on Computer Aided Drug Design & QSAR, 2012.

133

[33] S.G. Rohrer, K. Baumann, Maximum unbiased validation (muv) data sets for virtual screening based on pubchem bioactivity data, J. Chem. Inf. Model. 49 (2) (2009) 169–184. [34] Y. Wang, J. Xiao, T.O. Suzek, J. Zhang, J. Wang, S.H. Bryant, Pubchem: a public information system for analyzing bioactivities of small molecules, Nucleic Acids Res. 37 (Suppl 2) (2009) W623–W633. [35] N.R. Draper, H. Smith, Applied Regression Analysis, 2nd ed., 1981.

Benjamin Yee Shing Li received the B.Eng. (Hons.) in Information Engineering from City University of Hong Kong in 2009. He is now working towards the Doctor of Philosophy degree at the same university. His research interests included data mining, machine learning, bioinformatics and cheminformatics.

Lam Fat Yeung received the B.Sc. (Hons.) in Electronic and Electrical Engineering from Portsmouth University, UK, in 1978; and Ph.D. in Control Engineering from Imperial College of Science and Technology and Medicine in 1990. Currently he is an associate professor at City University of Hong Kong. He was the Chair of the IET Hong Kong Branch and the Chair of the Joint Chapter of the Robotics & Automation and Control Society, IEEE Hong Kong Section. His current research interests are system biology, control systems and optimisations.

King Tim Ko received the B.Eng. (Hons.) and Ph.D. degrees from The University of Adelaide, Adelaide, Australia. He worked several years with the Telecom Australia Research Laboratories, Melbourne, Australia, before joining the Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong, in 1986, where he is currently an associate professor. His research interests are in the performance evaluation of communication networks and mobile networks.