Kernel-driven similarity learning

Kernel-driven similarity learning

Neurocomputing 267 (2017) 210–219 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Kernel-...

1MB Sizes 3 Downloads 185 Views

Neurocomputing 267 (2017) 210–219

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Kernel-driven similarity learning Zhao Kang a,b,∗, Chong Peng b, Qiang Cheng b a b

School of Computer Science & Engineering University of Electronic Science and Technology of China Chengdu, Sichuan, 611731, China Department of Computer Science, Southern Illinois University Carbondale, IL, 62901, USA

a r t i c l e

i n f o

Article history: Received 11 August 2016 Revised 11 March 2017 Accepted 2 June 2017 Available online 9 June 2017 Communicated by Dr. Haiqin Yang Keywords: Similarity measure Nonlinear relation Sparse representation Kernel method Multiple kernel learning Clustering Recommender systems

a b s t r a c t Similarity measure is fundamental to many machine learning and data mining algorithms. Predefined similarity metrics are often data-dependent and sensitive to noise. Recently, data-driven approach which learns similarity information from data has drawn significant attention. The idea is to represent a data point by a linear combination of all (other) data points. However, it is often the case that more complex relationships beyond linear dependencies exist in the data. Based on the well known fact that kernel trick can capture the nonlinear structure information, we extend this idea to kernel spaces. Nevertheless, such an extension brings up another issue: its algorithm performance is largely determined by the choice of kernel, which is often unknown in advance. Therefore, we further propose a multiple kernel-based learning method. By doing so, our model can learn both linear and nonlinear similarity information, and automatically choose the most suitable kernel. As a result, our model is capable of learning complete similarity information hidden in data set. Comprehensive experimental evaluations of our algorithms on clustering and recommender systems demonstrate its superior performance compared to other state-ofthe-art methods. This performance also shows the great potential of our proposed algorithm for other possible applications. © 2017 Elsevier B.V. All rights reserved.

1. Background and motivation Similarity measurement is an indispensable preprocessing step for a number of data analysis tasks, such as clustering, nearest neighbor classification, and graph-based semi-supervised learning. In many algorithms, the initial data are not needed any more once we obtain similarity information between data points. Therefore, similarity measure is crucial to the performance of many techniques. One well known fact is that the choice of a particular similarity metric may improve an algorithm’s performance on a specific dataset [1]. For example, there are four open issues in the widely used Laplacian matrix construction of a graph [2]: (1) selecting the appropriate number of neighbors, (2) choosing the appropriate similarity metric to measure the affinities among sample points, (3) making algorithms robust to noise and outliers, (4) determining the appropriate scale of data. In practical applications, one often adopted strategy is to try different kinds of similarity measure, such as Cosine, Gaussian function, Jaccard metric, and different neighborhood size and parameters [3]. However, this approach



Corresponding author. E-mail address: [email protected] (Z. Kang).

http://dx.doi.org/10.1016/j.neucom.2017.06.005 0925-2312/© 2017 Elsevier B.V. All rights reserved.

is time consuming and impractical for large scale data. Kar and Jain [4] proposes a framework to measure the goodness of similarity metric in classification tasks. Nevertheless, this approach is hard to adapt to different settings. Even if one tries different similarity metrics, those predefined similarity metrics may still learn incomplete and inaccurate relationships. In recent years, the dimension of data has become increasingly high. Significant work has focused on discovering potential low-dimensional structures of the high-dimensional data. Some of the state-of-the-art methods are locally linear embedding (LLE) [5,6], isomeric feature mapping (ISOMAP) [7], and locality preserving projection (LPP) [8]. Most of these algorithms need to construct an adjacency graph of neighborhood. Traditional similarity measures often fail to consider the local environment of data points. In Fig. 1, for instance, the points near the intersection are pretty close if they are measured by the Euclidean distance; however, they can be from different clusters, which are represented in different colors. In this case, it is unlikely that any standard similarity function will be adequate to capture the local manifold structure precisely. Notations. In this paper, matrices and vectors are represented by upper case letters and boldfaced lower-case letters, respectively. The ith column of X is denoted by Xi . The 1 -norm of matrix A is defined as the absolute summation of its entries, i.e., A1 =   2 T i j |ai j |. The 2 -norm of a vector x is defined as x = x · x. I denotes the identity matrix. Tr(· ) is the trace operator.

Z. Kang et al. / Neurocomputing 267 (2017) 210–219

211

sparsity does not encourage locality. Yu et al. [12] has pointed out that locality is more essential than sparsity in some situations. 1.2. Contributions

Fig. 1. Three clusters in R3 denoted in three different colors. Although some points look close, they are from different clusters.

1.1. Introduction to sparse representation Recently, the pairwise similarity has been learned from the data using a sparse representation, which is regarded as a data-driven approach. According to LLE [5], locally linear reconstruction of a data point by its neighboring points can capture the local manifold structure. This reconstruction can be written as:

Xi ≈



X j a ji ,

(1)

j∈N (i )

where N(i) represents the neighborhood and aji is the weight for data point Xj . More similar sample points should receive bigger weights and the weights should be smaller for less similar points. Thus the resulting coefficient matrix A is also called a similarity matrix. In the literature, (1) is also called self-expressive property of the data [54]. Hence the objective function can be formulated as:

min A

n  i=1

Xi −



X j a ji 2 .

(2)

j∈N (i )

There are usually a number of issues affecting the learning performance with neighborhood-based approaches, such as how to choose a proper size of the neighborhood and what distance to use to measure the closeness. To avoid these problematic issues related to determining neighbors, we relax the requirement that aji be zero outside neighborhood; in the meantime, as a compensation, we seek a sparse solution of A. It is natural to introduce a regularizer A0 , which counts the number of non-zero elements in A. Recent development in [9,10] has found that the sparse solution could be approximately obtained by solving the 1 minimization problem, which enjoys the advantage of being continuous and possessing smoothness. By using the l1 heuristic, then our objective function becomes

min X − X A2F + λA1 , A

s.t.

A ≥ 0,

diag(A ) = 0.

(3)

Here, we restrict the reconstruction weights A to be nonnegative for ease of interpretation, and the second constraint is used to avoid the numerically trivial solution (A = I ). Coefficient λ is to balance the contribution of the sparsity. It is worthy of noting that sparse representation can often lead to resilience noise and outliers [11]. In addition, in Eq. (3), there is no scale consistence restriction for the data points. Therefore, sparse representation helps address both scale inconsistence and outlier issue [2]. On the other hand,

Although sparsity has shown good performance in various applications, such as subspace recovery [13], denoising [14] and classification [11], the similarity information in the sparse model is learned in the original feature space and Eq. (3) assumes linear relations among data points. Thus, it cannot effectively capture nonlinear relations hidden in the data. For many high-dimensional data in real world, it is often necessary and favorable to model the nonlinearity of data [15]. For instance, face images are assumed from a nonlinear submanifold [16]. Recall that nonlinear data may exhibit linearity when mapped into a higher dimensional feature space via the kernel trick [17]. By doing so, we can use (3) model to capture the linear relations in the transformed space, and thus the nonlinear relations in the original data space. In this paper, we first extend model (3) to kernel spaces so as to learn underlying nonlinear relations of a given data set. By doing so, we then need to address a relevant problem. It is well known that the type of kernels plays an important role in the performance of kernel methods. How can we find the most appropriate kernel for a given learning task on a specific data? Exhaustive search of a predefined pool of kernels is time-consuming if the pool size is large [18]. To handle this issue, we further propose a multiple kernel learning algorithm. Specifically, it learns simultaneously data similarity and an appropriate linear combination of multiple input kernels. An iterative optimization procedure is developed to mutually reinforce similarity learning and consensus kernel construction. Furthermore, there is an important benefit by leveraging a consensus kernel from multiple kernels: it is of great potential to integrate complementary information from heterogeneous sources or at different scales, which in turn improves the performance of a single kernel based method [19]. After constructing our models, we develop their optimization algorithms. In particular, the optimal weights for kernels have closed-form solutions. We then apply the proposed methods to clustering problem and recommender systems to demonstrate their high potential for real applications. The rest of the paper is organized as follows. We first introduce the related works in Section 2. The single kernel-based learning method (SKLM) is described in Section 3. Section 4 provides the multiple kernel-based learning method (MKLM). Experiments on clustering and Top-N recommendation are presented in Sections 5 and 6, respectively. Finally, we provide some concluding remarks in Section 7. 2. Related work In the literature, there are a number of works that combine sparse models with kernel methods [20–25]. Qi and Hughes [20] exploits kernel-based sparse representation in the context of compressive sensing. This idea has also been proposed for supervised learning tasks such as image classification [24–26], face recognition [25], object recognition [23], and kernel matrix approximation [21]. Unfortunately, an a priori dictionary must be known. The authors in [23–25] involve a dictionary learning step. Thiagarajan et al. [22] develops a framework for multiclass object classification, where their kernel weights are tuned according to graphembedding principles and the associated optimization problem is non-convex and computationally expensive. Unlike previous work, we need no predefined dictionary, and the kernel weights have closed-form solutions. Moreover, we are guaranteed to obtain the optimal solution due to the convexity of our objective function. With our multiple kernel learning algorithm, the choice of the most appropriate kernel would be

212

Z. Kang et al. / Neurocomputing 267 (2017) 210–219

less crucial. Hence, the ease of the proposed learning process makes our method attractive for many real-world tasks. In this paper, we examine the applications of our models to unsupervised learning: specifically, we discuss our methods in clustering and recommender systems. 2.1. Clustering Traditional clustering methods can not deal with complex relationships among objects and thus their performance is deteriorated with complex data sets. Kernel clustering algorithms are designed to capture the nonlinear relationship inherent in many real world data sets, which often achieve better clustering performance than other algorithms [55]. Some representative kernel-based clustering methods are: kernel k-means (KKM) [27] and spectral clustering (SC) [28]. Huang et al. [29] extends k-means to a multiple kernel setting so as to integrate multiple sources of information. Besides the loss function, the method in [29] also adopts a different constraint about the kernel weight distribution to us. Huang et al. [30] extends spectral clustering to the situation where multiple affinities exist and proposes an affinity aggregation spectral clustering (AASC). In this paper, we learn a unique affinity matrix. Recently, robust kernel k-means (RKKM) and Robust multiple kernel k-means (RMKKM) method [31] have been proposed to deal with noise and outliers. Rather than using squared 2 -norm loss, both methods adopt 2, 1 -norm to measure the distances between data points and cluster centers. Both approaches show improved performance compared to above existing methods. 2.2. Recommender systems Recommender systems are crucially important for various online business. Their task is to recommend appropriate items to the users according to their historical behaviors or preferences [32]. Similarity measure is the most critical component in recommender systems. Collaborative filtering, the widely-used technique in recommender system, predicts an item’s missing rating by aggregating its ratings from similar users or items. Some predefined similarity metrics, such as cosine, Pearson correlation coefficient [33], are often adopted. As we mentioned above, such predefined metrics suffer from a number of inherent drawbacks. Specifically, in recommendation problems, these drawbacks can be problematic in four aspects [34]: (1) flat value problem; (2) opposite-value problem; (3) single-value problem; (4) cross-value problem. Recently, a sparse linear model, called SLIM [35] has been utilized for recommender systems. SLIM shows impressive performance. However, it also fails to consider the complex relationships among items and users [36]. In contrast, the proposed kernel-based methods are more suited to address the problematic issues in existing recommendation systems. 3. Single kernel model In this section, we combine sparse representation with kernel method to learn nonlinear relations hidden in the data. 3.1. Formulation Define a (potentially nonlinear) mapping φ : RD → H, which maps the data samples from the input space to a reproducing kernel Hilbert space H [37]. K ∈ Rn×n represents a positive semidefinite kernel (Gram) matrix. The transformed data sample x from the original space to the kernel space is φ (x). For n samples in matrix X = [X1 , . . . , Xn ], the transformed data matrix is φ (X ) = [φ (X1 ), . . . , φ (Xn )]. The kernel similarity between data samples Xi and Xj is defined as < φ (Xi ), φ (X j ) >= φ (Xi )T φ (X j ) = KXi ,X j . It is

easy to observe that all similarities can be computed exclusively using the kernel function and we do not need to know the transformation φ . This is known as the kernel trick and it greatly simplifies the computations in the kernel space when the kernels are precomputed. To deal with potentially nonlinear relations, we extend the sparse model to kernel spaces and propose a single kernel-based learning model (SKLM),

min φ (X ) − φ (X )A2F + λA1 A

s.t.

(4)

A ≥ 0, diag(A ) = 0.

In terms of the kernel matrix K, this model can be equivalently written as

min T r (K − 2KA + AT KA ) + λA1 A

s.t.

(5)

A ≥ 0, diag(A ) = 0.

By solving above problem, we learn the linear sparse relations of φ (X), and thus the nonlinear relations among X. Note that (5) goes back to (3) if we adopt a linear kernel. Therefore, SKLM is flexible for us to explore either linear or nonlinear relations. 3.2. Optimization Problem (5) can be efficiently solved by the Alternating Direction Method of Multipliers (ADMM) [38]. By introducing an auxiliary variable Z ∈ Rn×n , Eq. (5) can be reformulated as

min T r (K − 2KA + AT KA ) + λZ 1 A

s.t.

(6)

Z = A, A ≥ 0, diag(A ) = 0.

The corresponding augmented Lagrangian function is

L(A, Z, Y ) =T r (K − 2KA + AT KA ) + λZ 1 + s.t.

μ 2

Z − A +

Y

μ

2F

(7)

A ≥ 0, diag(A ) = 0.

where Y is a Lagrange multiplier and μ > 0 is a penalty parameter. Then problem (7) can be solved by iteratively updating A, Z and Y. Given the current points At , Zt , Yt , we update At+1 by

At+1 = arg min T r (K − 2KA + AT KA ) + A

μ 2

Z t − A +

Yt

μ

2F .

(8)

It is obvious that the objective function in (8) is a strongly convex quadratic function and it can be solved by setting the first-order derivative to zero. Thus, we obtain

At+1 = (2K + μI )−1 [2K + μZt + Y t ]. For

Zt+1 ,

(9)

we need to solve the following problem

Zt+1 = arg min λZ 1 +

μ 2

Z − At+1 +

Yt

μ

2F .

(10)

t

Denoting Q t+1 = At+1 − Yμ , Z can be updated component-wisely by Beck and Teboulle [39]

Zit+1 = max{|Qit+1 |− j j

λ , 0} · sign(Qit+1 ). j μ

(11)

The complete procedure is outlined in Algorithm 1 . As a matter of fact, thanks to the column-wise independence property of A, problem (5) can be easily parallelized by solving it column-wisely. 4. Multiple kernel model We can learn the similarity information among data samples using model (5). However, there is still an unsolved problem: the choice of kernel. Several different forms of kernel functions are

Z. Kang et al. / Neurocomputing 267 (2017) 210–219

Algorithm 1: The algorithm of SKLM.

213

Algorithm 2: The algorithm of MKLM.

Input: Kernel matrix K, parameters λ > 0, μ > 0. Initialize: Random matrix Z, Y = 0. REPEAT

Input: A set of kernel matrices {Ki }ri=1 , parameters λ > 0, μ > 0. Initialize: Random matrix Z, Y = 0, wi = 1/r. REPEAT

Obtain A through (9). A ← A − diag(diag(A )), ai j ← max{ai j , 0}. 3: Update Z as (11). 4: Z ← Z − d iag(d iag(Z )), zi j ← max{zi j , 0}. 5: Update the Lagrangian multiplier Y by Y = Y + μ ( Z − A ). 1:

1:

2:

2: 3: 4: 5: 6:

UNTIL stopping criterion is met.

Calculate the estimated kernel Kw according to Eq. (12). Obtain A through (9) by replacing K with Kw . A ← A − diag(diag(A )), wi j ← max{ai j , 0}. Update Z by (11). Z ← Z − diag(diag(Z )), zi j ← max{zi j , 0}. Update the Lagrangian multiplier Y by

Y = Y + μ ( Z − A ). often used in the literature. The performance also depends critically on the choice of kernel. Nevertheless, the most suitable kernel is unknown in advance and it is time-consuming to search exhaustively when the size of a user-defined pool of kernels becomes large [18]. Multiple kernel learning is suggested to alleviate above issue. It constructs a few candidate kernels and merges them to form a consensus kernel [40]. Another benefit of using multiple kernels is that they can exploit heterogeneous features of real world data sets [19]. In this section, we present our multiple kernel-based learning method (MKLM). In addition to the similarity matrix, MKLM simultaneously learns an appropriate consensus kernel from a convex combination of several predefined kernel matrices. With a sufficient number and variety of kernels, the learned similarity matrix would be accurate and less biased, and robust to noise or outliers.

7: 8:

Calculate h by (15). Update the kernel weight w using (17).

UNTIL stopping criterion is met.

Thus, we just need to use Algorithm 1 with Kw as the input kernel matrix. (2) Optimizing with respect to w when A is fixed: solving (13) with respect to w can be rewritten as

min

r 

w

w i hi

r  √ wi = 1, wi ≥ 0,

s.t.

i=1

where

hi = T r (Ki − 2Ki A + AT Ki A ).

4.1. Formulation

(14)

i=1

(15)

The Lagrange function of (14) is Suppose there are a total of r different kernel functions {Ki }ri=1 . Correspondingly, there would be r different kernel spaces denoted  as {Hi }ri=1 . An augmented Hilbert space, H˜ = ri=1 Hi , can be constructed by concatenating all kernel spaces and by using the map√ √ √ ping of φ˜ (x ) = [ w1 φ1 (x ), w2 φ2 (x ), . . . , wr φr (x )]T with differ√ ent weights wi (wi ≥ 0 ). Then the combined kernel Kw can be represented as [18]

Kw (x, y ) =< φw (x ), φw (y ) >=

r 

wi Ki (x, y ).

min T r (Kw − 2Kw A + AT Kw A ) + λA1 A,w

Kw =

r 

r  √ wi Ki , wi = 1, wi ≥ 0.

i=1

(13)

i=1

In this framework, we can learn both linear and nonlinear relation information if we combine linear kernel with other kernels. In next subsection, we introduce an iterative algorithm to solve (13).



r  √ 1− wi .

By

utilizing the Karush–Kuhn–Tucker (KKT) condition with

r √  ∂ J (w ) wi = 1, we obtain the solution ∂ wi = 0 and the constraint i=1

of w as follows [41]



r  1 hi hj

−2

.

(17)

j=1

In Algorithm 2 we provide a complete algorithm for solving the problem (13). 5. Application to clustering In this section, we perform experiments to show the effectiveness of our proposed algorithms on clustering application. We evaluate their performance on a number of real world data sets. 5.1. Data sets There are altogether nine benchmark data sets used in our experiments. Table 1 summarizes the statistics of these data sets. Among them, six are image ones, and the other three are text corpora.1 The six image data sets consist of four famous face databased (ORL,2 Yale,3 AR4 [42] and JAFFE),5 a toy image database

4.2. Optimization 1

We can solve (13) by alternatively updating w and A, while holding the other variable as constant. (1) Optimizing with respect to A when w is fixed: We can directly calculate Kw , and the optimization problem is exactly (5).

(16)

i=1

wi =

Note that the convex combination of the positive semi-definite kernel matrices {Ki }ri=1 is still a positive semi-definite kernel matrix [31]. Thus the combined kernel still satisfies Mercer condition. Here the values of weights also reflect the importance of the candidate kernels. Replacing the single kernel in (5) with the consensus kernel, we obtain a new multiple kernel-based learning method (MKLM)

A ≥ 0, diag(A ) = 0,

J ( w ) = wT h + γ

(12)

i=1

s.t.



2 3 4 5

http://www-users.cs.umn.edu/∼han/data/tmdata.tar.gz. http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html. http://vision.ucsd.edu/content/yale- face- database. http://www2.ece.ohio-state.edu/∼aleix/ARdatabase.html. http://www.kasrl.org/jaffe.html.

214

Z. Kang et al. / Neurocomputing 267 (2017) 210–219

Fig. 2. Samples images of COIL20.

(a) BA

(b) Yale Fig. 3. Sample images of BA and Yale.

Table 1 Description of the data sets.

YALE JAFFE ORL AR COIL20 BA TR11 TR41 TR45

COIL206

Then the Acc is defined by

n #Classes

Acc =

165 213 400 840 1440 1404 414 878 690

1024 676 1024 768 1024 320 6429 7454 8261

15 10 40 120 20 36 9 10 10

where n is the total number of samples, delta function δ (x, y) equals one if and only if x = y and zero otherwise, and map(· ) is the best permutation mapping function by the Kuhn–Munkres algorithm [45]. The NMI evaluates the quality of clusters. Given two index sets L and Lˆ,

n



BA.7

5.2. Evaluation metrics To quantitatively evaluate the clustering performance, we adopt the three widely used metrics, accuracy (Acc), normalized mutual information (NMI) [44], and Purity. Acc measures the extent to which each cluster contain data points from the corresponding class. Let li and lˆi be the produced cluster label and the ground truth class label of Xi , respectively.

7

δ (lˆi , map(li ))

#Features

[43] and a binary alpha digits data set Specifically, COIL20 contains images of 20 objects. For each object, the images were taken five degrees apart as the object is rotating on a turntable. There are 72 images for each object. Each object is represented by a 1024-dimensional vector. Some sample images are shown in Fig. 2. As shown in Fig. 3(a), BA consists of digits of “0” through “9” and capital “A” through “Z”. There are 39 examples for each class. Yale, ORL, AR, and JAFEE contain images of individuals. Each image is with different facial expression or configuration: times, illumination conditions, glasses/no glasses. Fig. 3(b) shows some example images from Yale database.

6

i=1

#Instances

http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php. http://www.cs.nyu.edu/∼roweis/data.html.

NMI(L, Lˆ ) =

,



p(l, lˆ)log

l∈L,lˆ∈Lˆ

p(l,lˆ) p(l ) p(lˆ)

max(H (L ), H (Lˆ ))



,

where p(l) and p(lˆ) represent the marginal probability distribution functions of L and Lˆ, respectively, induced from the joint distribution p(l, lˆ) of L and Lˆ. H(·) denotes the entropy function. NMI takes values between 0 and 1, which occur when the two clusters are independent and identical, respectively. Purity evaluates the extent to which the most common category into each cluster [46]. It is computed as follows:

Purity =

K  ni 1 P (Si ), P (Si ) = max j (nij ), n ni i=1

j

where ni is the cluster size of cluster Si , ni is the total number that the ith input class is assigned to the jth cluster, and K denotes the cluster number. 5.3. Experiment setup To assess the effectiveness of multiple kernel learning, we design 12 kernels. They include, seven Gaussian kernels of the form 2 )), where d K (x, y ) = exp(−x − y22 /(tdmax max is the maximal distance between samples and t varies from the list {0.01, 0.0, 0.1, 1, 10, 50, 100}, a linear kernel K (x, y ) = xT y, four polynomial kernels K (x, y ) = (a + xT y )b with a = {0, 1} and b = {2, 4}. Furthermore, all

Z. Kang et al. / Neurocomputing 267 (2017) 210–219

215

Table 2 Clustering results measured on benchmark data sets.

Data YALE JAFFE ORL AR COIL20 BA TR11 TR41 TR45

KKM-b 0.4712 0.7439 0.5353 0.3302 0.5949 0.4120 0.5191 0.5564 0.5879

KKM-a 0.3897 0.6709 0.4593 0.3089 0.5074 0.3366 0.4465 0.4634 0.4558

SC-b 0.4942 0.7488 0.5796 0.2883 0.6770 0.3107 0.5098 0.6352 0.5739

SC-a 0.4052 0.5403 0.4665 0.2222 0.4365 0.2625 0.4332 0.4480 0.4596

RKKM-b 0.4809 0.7561 0.5496 0.3343 0.6164 0.4217 0.5303 0.5676 0.5813

(a) Accuracy RKKM-a SSR 0.3971 0.5455 0.6798 0.8732 0.4688 0.6900 0.3120 0.6500 0.5189 0.7632 0.3435 0.2397 0.4504 0.4106 0.4680 0.6378 0.4569 0.7145

SKLM-b 0.6303 0.9953 0.7367 0.7333 0.8358 0.4872 0.6775 0.7312 0.8058

SKLM-a 0.4889 0.6876 0.4129 0.6316 0.8115 0.4590 0.5298 0.4934 0.5338

MKKM 0.4570 0.7455 0.4751 0.2861 0.5482 0.4052 0.5013 0.5610 0.5846

AASC 0.4064 0.3035 0.2720 0.3323 0.3487 0.2707 0.4715 0.4590 0.5264

RMKKM 0.5218 0.8707 0.5560 0.3437 0.6665 0.4342 0.5771 0.6265 0.6400

MKLM 0.6303 1.0 0 0 0 0.7525 0.7357 0.8424 0.4924 0.6836 0.7335 0.8159

Data YALE JAFFE ORL AR COIL20 BA TR11 TR41 TR45

KKM-b 0.5134 0.8013 0.7343 0.6521 0.7405 0.5725 0.4888 0.5988 0.5787

KKM-a 0.4207 0.7148 0.6336 0.6064 0.6357 0.4649 0.3322 0.4037 0.3869

SC-b 0.5292 0.8208 0.7516 0.5837 0.8098 0.5076 0.4311 0.6133 0.4803

SC-a 0.4479 0.5935 0.6674 0.5605 0.5434 0.4009 0.3139 0.3660 0.3322

RKKM-b 0.5229 0.8347 0.7423 0.6544 0.7463 0.5782 0.4969 0.6077 0.5786

(b) NMI RKKM-a SSR 0.4287 0.5726 0.7401 0.9293 0.6391 0.8423 0.6081 0.8416 0.6370 0.8689 0.4691 0.3029 0.3348 0.2760 0.4086 0.5956 0.3896 0.6782

SKLM-b 0.6160 0.9915 0.8075 0.8676 0.8834 0.6504 0.5429 0.6462 0.7684

SKLM-a 0.5017 0.6346 0.5932 0.8236 0.8526 0.5991 0.3894 0.3478 0.4049

MKKM 0.5006 0.7979 0.6886 0.5917 0.7064 0.5688 0.4456 0.5775 0.5617

AASC 0.4683 0.2722 0.4377 0.6506 0.4187 0.4234 0.3939 0.4305 0.4194

RMKKM 0.5558 0.8937 0.7483 0.6549 0.7734 0.5847 0.5608 0.6347 0.6273

MKLM 0.6354 1.0 0 0 0 0.8658 0.8781 0.8942 0.6639 0.5781 0.6987 0.7685

Data YALE JAFFE ORL AR COIL20 BA TR11 TR41 TR45

KKM-b 0.4915 0.7732 0.5803 0.3552 0.6461 0.4420 0.6757 0.7446 0.6849

KKM-a 0.4112 0.7013 0.5042 0.3364 0.5530 0.3606 0.5632 0.60 0 0 0.5364

SC-b 0.5161 0.7683 0.6145 0.3324 0.6992 0.3450 0.5879 0.7368 0.6125

SC-a 0.4306 0.5656 0.5120 0.2599 0.4683 0.2907 0.5023 0.5645 0.5002

RKKM-b 0.4979 0.7958 0.5960 0.3587 0.6635 0.4528 0.6793 0.7499 0.6818

(c) Purity RKKM-a SSR 0.4174 0.5818 0.7182 0.9624 0.5146 0.7650 0.3388 0.6952 0.5634 0.8903 0.3686 0.4085 0.5640 0.8502 0.6021 0.7540 0.5375 0.8362

SKLM-b 0.6606 0.9953 0.7852 0.7988 0.8949 0.5239 0.7036 0.7812 0.8391

SKLM-a 0.6037 0.6995 0.4392 0.7017 0.8648 0.5687 0.6548 0.7421 0.7156

MKKM 0.4752 0.7683 0.5285 0.3046 0.5895 0.4347 0.6548 0.7283 0.6914

AASC 0.4233 0.3308 0.3156 0.3498 0.3914 0.3029 0.5467 0.6205 0.5749

RMKKM 0.5364 0.8890 0.6023 0.3678 0.6995 0.4627 0.7293 0.7757 0.7520

MKLM 0.6848 1.0 0 0 0 0.7925 0.8440 0.9021 0.5412 0.7101 0.7836 0.8493

kernels are rescaled to [0, 1] by dividing each element by the largest pair-wise squared distance. For single kernel methods, we run KKM, SC, RKKM, and SKLM on each kernel separately. And we report both the best and the average results over all these kernels. Moreover, we also compare with simplex sparse representation (SSR) [2] method, which is parameter free. SSR learns similarity information using sparse representation. However, it is expressed in the feature space. For multiple kernel methods, we implement MKKM,8 AASC,9 RMKKM,10 MKLM on a combination of above kernels. For our proposed method, after obtaining the similarity matrix, we follow the standard spectral clustering procedure. The difference between our method and spectral clustering method lies in the construction of similarity graph. For spectral clustering, like SC and AASC, we need to run k-means on spectral embedding to obtain the clustering results. To reduce the influence of initialization, we follow the strategy suggested in [31,47]: we repeat clustering 20 times and present the results with the best objective values. We set the number of clusters to the true number of classes for all clustering algorithms. 5.4. Experimental results Table 2 summarizes the clustering results for all methods on all the data sets. We bold the best results. “−b” denotes the best result and “−a” means the average of those 12 kernels. It can be observed that our proposed methods provide the best results in most cases. Among the other state-of-the-art methods, SSR performs the best in most datasets. The difference between SSR and our methods demonstrates the advantage of kernel approach. Comparing the 8 9 10

http://imp.iis.sinica.edu.tw/IVCLab/research/Sean/mkfc/code. http://imp.iis.sinica.edu.tw/IVCLab/research/Sean/aasc/code. https://github.com/csliangdu/RMKKM.

best and average results, we can observe some big difference on some data sets. This is consistent with the general conclusion, i.e., the performance of single kernel methods is often determined by the kernel function. This also motivates the development of multiple kernel learning. Moreover, we can observe that the multiple kernel learning methods often show close or better performance to the results of the best single kernel. Without performing exhaustive search on a bunch of candidate kernels, multiple kernel learning methods would be very useful in practice. 5.5. Parameter analysis Our model MKLM just has one parameter λ to control the sparsity of similarity matrix A. We also introduce auxiliary parameter μ during the optimization. Here we evaluate their impact empirically. We let λ varies in the range of {1e-7, 1e-6, 1e-5, 1e-4, 1e-3}, and the list of μ to be {0.01, 0.1, 1}. We repeat MKLM 10 times and report the average value. Figs. 4 and5 show how the clustering results in terms of Acc, NMI, Purity vary with λ and μ on JAFFE and Yale data sets. It shows that the performance of MKLM is stable with respect to a large range of λ values. 6. Application to recommendation In this section, we apply our proposed method to Top-N recommendation problem. 6.1. Data sets Generally, there are two types of data: explicit and implicit feedback data. Ratings and review texts belong to the former group, while user logs, bookmarks, transactions, check-ins are referred as implicit data. We assess the effectiveness of our methods

216

Z. Kang et al. / Neurocomputing 267 (2017) 210–219

Fig. 4. The effect of parameters λ and μ on the JAFEE data set.

Fig. 5. The effect of parameters λ and μ on the Yale data set. Table 3 Description of data sets. Dataset

#Users

#Items

#Transactions

rsize

csize

density (%)

Ratings

Delicious Lastfm BX FilmTrust Netflix Yahoo

1300 8813 4186 1508 39,884 85,325

4516 6038 7733 2071 8478 55,371

17,550 332,486 182,057 35,497 1,256,115 3,973,104

13.50 37.7 43.49 23.54 31.49 46.56

3.89 55.07 23.54 17.14 148.16 71.75

0.29 0.62 0.56 1.14 0.37 0.08

{0, 1} {0, 1} {0, 1} {0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4} {1, 2, 3, 4, 5} {1, 2, 3, 4, 5}

In above table, the “rsize” is the average number of ratings given by each user. “csize” represent the average numbers of ratings that each item possesses. “density” is calculated through #transactions/(#users× #items).

on both types of data. The statistics of our data sets are summarized in Table 3. More precisely, Delicious is bookmarking and tagging information.11 Lastfm denotes music artist listening history.12 BX is extracted from the Book-Crossing data set.13 FilmTrust is crawled from the FilmTrust website14 [48]. The Netflix is part of Netflix Prize data set.15 The Yahoo data set is from Yahoo!Movies user ratings.16

6.2. Evaluation methodology and metrics Following the similar strategy of other top-N recommendation, we apply 5-fold cross validation strategy. For each fold, we split the data set into training and test sets. For each user, one non-zero element is randomly selected and its value is set to zero, then we will check its prediction value in the test set [35]. The test set is nothing but the ID of those items without any values. After obtaining similarity matrix A, we reconstruct the rating matrix Xˆ by Xˆ = XA. For each user, we rank the predicted ratings for those unrated items. The first N items is recommended.

11 12 13 14 15 16

http://www.delicious.com. http://www.last.fm. http://www.informatik.uni-freiburg.de/∼cziegler/BX/. http://www.librec.net/datasets.html. http://www.netflixprize.com/. http://webscope.sandbox.yahoo.com/catalog.php?datatype=r.

For top-N recommendation, the ideal situation is that a short recommendation list contains our interested items [49]. Thus the most direct and meaningful metrics are hit-rate (HR) and the average reciprocal hit-rank (ARHR) [50]. They are defined as follows:

HR =

# hits #hits 1 1 , ARH R = , #users #users pi i=1

where #hits is the number of users whose took-out item is contained in the size-N recommendation list, #users is the total number of users, pi is the position of the ith hit in the ranked size-N list. The maximum HR value is 1.0, corresponding to the case that the hidden item is always recommended. By contrast, a value of 0.0 means that the hidden items are never recommended. HR does not care where the hidden item appear in the list. ARHR measures how strongly an item is recommended by rewarding each hit by the inverse of its position.

6.3. Compared algorithms To demonstrate how the recommendation quality can be improved by the proposed method, we compare them with six stateof-the-art recommendation algorithms. These methods are from different categories, which include: the item neighborhood-based collaborative filtering method ItemKNN [50]; Matrix factorization based pure singular-value-decomposition-based (PureSVD) [51] and weighted regularized matrix factorization (WRMF) [52]; sparse

Z. Kang et al. / Neurocomputing 267 (2017) 210–219

217

Fig. 6. Performance versus different values of N.

Table 4 Comparison of top-N recommendation algorithms. Method

Delicious

Lastfm

BX

HR

ARHR

HR

ARHR

HR

ARHR

ItemKNN PureSVD WRMF BPRKNN BPRMF SLIM SKLM MKLM

0.300 0.285 0.330 0.326 0.335 0.343 0.388 0.401

0.179 0.172 0.198 0.187 0.183 0.213 0.220 0.226

0.125 0.134 0.138 0.145 0.129 0.141 0.187 0.198

0.075 0.078 0.078 0.083 0.073 0.082 0.091 0.095

0.045 0.043 0.047 0.047 0.048 0.050 0.058 0.061

0.026 0.023 0.027 0.028 0.027 0.029 0.029 0.031

Method

FilmTrust

mendation list. Again, MKLM is a little better than SKLM. Fig. 6 displays the HR values with different N values (i.e., 5, 10, 15, 20 and 25). Our methods are consistently better than other approaches. The performance of various methods in terms of ARHR over these datasets can be similarly plotted, and the conclusions are the same. The experimental results fully demonstrate that SKLM and MKLM discover some nonlinear structures that cannot be revealed by existing methods. 7. Conclusion

representation based SLIM [35]; ranking-oriented models BPRMF and BPRKNN [53].

In this paper, we learn the similarity among sample points by integrating the kernel function into the sparse representation model. In our proposed method, similarity is learned from the data, so it will be more unbiased than that measured by traditional similarity metrics. With the sparse representation, how to select an appropriate neighborhood size is no longer an issue. Therefore, the learned similarity matrix is well suited to many other algorithms, such as graph Laplacian construction, and spectral clustering. Our multiple kernel learning method automatically learns an appropriate kernel from a pool of kernels. This significantly reduces the effort of kernel selection and thus makes more sense in practical application. Applications to clustering and recommender systems well demonstrate the superiority of our methods. Besides, it is easy to extend our proposed methods to multi-view data sets.

6.4. Experimental results

Acknowledgment

For a simple demonstration, we apply five kernels for this experiment. They include four Gaussian kernels with t varying over the set of {0.1, 1, 10, 100}, and a linear kernel. We test SKLM on those Gaussian kernels and report the best results (Note SLIM corresponds to linear kernel.). For MKLM, we consider a combination of those five kernels. Table 4 summarizes the results for N = 10. As it shows, our methods yield very impressive results on Delicious, lastfm, Yahoo datasets. On other data sets, the improvement is not so substantial, this could be improved by careful tuning the kernel parameter t. For ARHR, our results are not strikingly better than SLIM. Actually, ARHR is not so important since we often consider a short recom-

This work is supported by the U.S. National Science Foundation under Grant IIS 1218712.

ItemKNN PureSVD WRMF BPRKNN BPRMF SLIM SKLM MKLM

Netflix

Yahoo

HR

ARHR

HR

ARHR

HR

ARHR

0.583 0.601 0.604 0.625 0.610 0.628 0.642 0.657

0.352 0.369 0.371 0.391 0.375 0.397 0.403 0.410

0.155 0.159 0.174 0.163 0.145 0.174 0.186 0.197

0.085 0.090 0.096 0.088 0.075 0.099 0.101 0.108

0.315 0.211 0.253 0.307 0.312 0.322 0.400 0.405

0.184 0.118 0.130 0.180 0.182 0.188 0.201 0.206

References [1] C.A.R. de Sousa, S.O. Rezende, G.E. Batista, Influence of graph construction on semi-supervised learning, in: Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2013, pp. 160–175. [2] J. Huang, F. Nie, H. Huang, A new simplex sparse learning model to measure data similarity for clustering, in: Proceedings of the Twenty-fourth International Conference on Artificial Intelligence, AAAI Press, 2015, pp. 3569–3575. [3] Z. Kang, C. Peng, Y. Ming, Q. Cheng, Top-n recommendation on graphs, in: Proceedings of the Twenty-fifth ACM International on Conference on Information and Knowledge Management, ACM, 2016.

218

Z. Kang et al. / Neurocomputing 267 (2017) 210–219

[4] P. Kar, P. Jain, Similarity-based learning via data driven embeddings, in: Proceedings of the Advances in Neural Information Processing Systems, 2011, pp. 1998–2006. [5] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [6] F. Nie, D. Xu, I.W.-H. Tsang, C. Zhang, Flexible manifold embedding: a framework for semi-supervised and unsupervised dimension reduction, IEEE Trans. Image Process. 19 (7) (2010) 1921–1932. [7] J.B. Tenenbaum, V. De Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. [8] X. Niyogi, Locality preserving projections, in: Neural Information Processing Systems, 16, MIT, 2004, p. 153. [9] D.L. Donoho, For most large under determined systems of linear equations the minimal 1-norm solution is also the sparsest solution, Commun. Pure Appl. Math. 59 (6) (2006) 797–829. [10] E.J. Candes, T. Tao, Near-optimal signal recovery from random projections: universal encoding strategies? IEEE Trans. Inf. Theory 52 (12) (2006) 5406–5425. [11] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. [12] K. Yu, T. Zhang, Y. Gong, Nonlinear learning using local coordinate coding, in: Proceedings of the Advances in Neural Information Processing Systems, 2009, pp. 2223–2231. [13] E. Elhamifar, R. Vidal, Sparse subspace clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2009, IEEE, 2009, pp. 2790–2797. [14] M. Elad, M. Aharon, Image denoising via sparse and redundant representations over learned dictionaries, IEEE Trans. Image Process. 15 (12) (2006) 3736–3745. [15] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, M. Harandi, Kernel methods on the Riemannian manifold of symmetric positive definite matrices, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 73–80. [16] P. Li, Q. Wang, W. Zuo, L. Zhang, Log-Euclidean kernels for sparse representation and dictionary learning, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1601–1608. [17] B. Schölkopf, A. Smola, K.-R. Müller, Kernel principal component analysis, in: Proceedings of the International Conference on Artificial Neural Networks, Springer, 1997, pp. 583–588. [18] H. Zeng, Y.-m. Cheung, Feature selection and kernel learning for local learning-based clustering, IEEE Trans. Pattern Anal. Mach. Intell. 33 (8) (2011) 1532–1547. [19] S. Yu, L. Tranchevent, X. Liu, W. Glanzel, J.A. Suykens, B. De Moor, Y. Moreau, Optimized data fusion for kernel k-means clustering, IEEE Trans. Pattern Anal. Mach. Intell. 34 (5) (2012) 1031–1039. [20] H. Qi, S. Hughes, Using the kernel trick in compressive sensing: accurate signal recovery from fewer measurements, in: Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2011, pp. 3940–3943. [21] S. Gao, I.W.-H. Tsang, L.-T. Chia, Sparse representation with kernels, IEEE Trans. Image Process. 22 (2) (2013) 423–434. [22] J.J. Thiagarajan, K.N. Ramamurthy, A. Spanias, Multiple kernel sparse representations for supervised and unsupervised learning, IEEE Trans. Image Process. 23 (7) (2014) 2905–2915. [23] H. Van Nguyen, V.M. Patel, N.M. Nasrabadi, R. Chellappa, Design of non-linear kernel dictionaries for object recognition, IEEE Trans. Image Process. 22 (12) (2013) 5123–5135. [24] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2010), IEEE, 2010, pp. 3360–3367. [25] S. Gao, I.W.-H. Tsang, L.-T. Chia, Kernel sparse representation for image classification and face recognition, in: Proceedings of the European Conference on Computer Vision, Springer, 2010, pp. 1–14. [26] L. Zhang, W.-D. Zhou, P.-C. Chang, J. Liu, Z. Yan, T. Wang, F.-Z. Li, Kernel sparse representation-based classifier, IEEE Trans. Signal Process. 60 (4) (2012) 1684–1695. [27] B. Schölkopf, A. Smola, K.-R. Müller, Nonlinear component analysis as a kernel eigenvalue problem, Neural Comput. 10 (5) (1998) 1299–1319. [28] A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, Adv. Neural Inf. Process. Syst. 2 (2002) 849–856. [29] H.-C. Huang, Y.-Y. Chuang, C.-S. Chen, Multiple kernel fuzzy clustering, IEEE Trans. Fuzzy Syst. 20 (1) (2012a) 120–134. [30] H.-C. Huang, Y.-Y. Chuang, C.-S. Chen, Affinity aggregation for spectral clustering, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), IEEE, 2012b, pp. 773–780. [31] L. Du, P. Zhou, L. Shi, H. Wang, M. Fan, W. Wang, Y.-D. Shen, Robust multiple kernel k-means using 2; 1-norm, in: Proceedings of the Twenty-fourth International Conference on Artificial Intelligence, AAAI Press, 2015, pp. 3476–3482. [32] F. Ricci, L. Rokach, B. Shapira, Introduction to recommender systems handbook, Springer, US, 2011.

[33] J.S. Breese, D. Heckerman, C. Kadie, Empirical analysis of predictive algorithms for collaborative filtering, in: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers Inc., 1998, pp. 43–52. [34] H.J. Ahn, A new similarity measure for collaborative filtering to alleviate the new user cold-starting problem, Inf. Sci. 178 (1) (2008) 37–51. [35] X. Ning, G. Karypis, Slim: Sparse linear methods for top-n recommender systems, in: Proceedings of the Eleventh IEEE International Conference on Data Mining, IEEE, 2011, pp. 497–506. [36] E. Christakopoulou, Moving beyond linearity and independence in top-N recommender systems, in: Proceedings of the Eighth ACM Conference on Recommender systems, ACM, 2014, pp. 409–412. [37] N. Aronszajn, Theory of reproducing kernels, Trans. Am. Math. Soc. 68 (3) (1950) 337–404. [38] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Found. Trends® Mach. Learn. 3 (1) (2011) 1–122. [39] A. Beck, M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problems, SIAM J. Imaging Sci. 2 (1) (2009) 183–202. [40] M. Gönen, E. Alpaydın, Multiple kernel learning algorithms, J. Mach. Learn. Res. 12 (Jul) (2011) 2211–2268. [41] X. Cai, F. Nie, W. Cai, H. Huang, Heterogeneous image features integration via multi-modal semi-supervised learning model, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 1737–1744. [42] A. Martinez, R. Benavente, The AR Face Database, 1998, Technical report, 2007. vol. 3 [43] S.A. Nene, S.K. Nayar, H. Murase, Columbia Object Image Library (COIL-20), Technical report CUCS-005-96, 1996. [44] D. Cai, X. He, X. Wang, H. Bao, J. Han, Locality preserving nonnegative matrix factorization., in: Proceedings of the 2009 International Joint Conference on Artificial Intelligence (IJCAI), 9, 2009, pp. 1010–1015. [45] S.S. Chen, D.L. Donoho, M.A. Saunders, Atomic decomposition by basis pursuit, SIAM Rev. 43 (1) (2001) 129–159. [46] Y. Zhao, G. Karypis, Criterion functions for document clustering: experiments and analysis. Vol. 1. Technical report, 2001. [47] Y. Yang, D. Xu, F. Nie, S. Yan, Y. Zhuang, Image clustering using local discriminant models and global integration, IEEE Trans. Image Process. 19 (10) (2010) 2761–2773. [48] G. Guo, J. Zhang, N. Yorke-Smith, A novel Bayesian similarity measure for recommender systems., in: Proceedings of the 2013 International Joint Conference on Artificial Intelligence (IJCAI), 2013. [49] A. Bellogin, P. Castells, I. Cantador, Precision-oriented evaluation of recommender systems: an algorithmic comparison, in: Proceedings of the Fifth ACM Conference on Recommender Systems, ACM, 2011, pp. 333–336. [50] M. Deshpande, G. Karypis, Item-based top-N recommendation algorithms, ACM Trans. Inf. Syst. (TOIS) 22 (1) (2004) 143–177. [51] P. Cremonesi, Y. Koren, R. Turrin, Performance of recommender algorithms on top-Nrecommendation tasks, in: Proceedings of the Fourth ACM Conference on Recommender Systems, ACM, 2010, pp. 39–46. [52] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering for implicit feedback datasets, in: Proceedings of the IEEE International Conference on Data Mining (ICDM), IEEE, 2008, pp. 263–272. [53] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-Thieme, BPR: Bayesian personalized ranking from implicit feedback, in: Proceedings of the Twenty– fifth Conference on Uncertainty in Artificial Intelligence, AUAI Press, 2009, pp. 452–461. [54] Z. Kang, C. Peng, J. Cheng, Q. Cheng, Logdet rank minimization with application to subspace clustering, Computational Intelligence and Neuroscience 2015 (2015) 68. [55] Kang Z. Peng C. Cheng Q. Twin learning for similarity and clustering: A unified kernel approach. Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), AAAI Press, 2080–2086. Zhao Kang received his M.S. degree in physics from Sichuan University, China, in 2011. He received his Ph.D. degree in Computer Science from Southern Illinois University Carbondale, USA, in 2017. His research interests are machine learning, data mining, social network, information retrieval, and deep learning. He has published over 20 research papers in top-tier conferences and journals, including AAAI, ICDE, CVPR, SIGKDD, ICDM, CIKM, SDM, ACM Transactions on Intelligent Systems and Technology, ACM Transactions on Knowledge Discovery from Data, Neurocomputing, and IEEE Signal Processing Letters.

Z. Kang et al. / Neurocomputing 267 (2017) 210–219 Chong Peng received his B.S. degree statistics from Qingdao University, China, in 2012. Currently, he is a Ph.D. student of computer science at Southern Illinois University Carbondale. His research interests are machine learning, data mining, computer vision.

219

Qiang Cheng received the B.S. and M.S. degrees in mathematics, applied mathematics, and computer science from Peking University, China, and the Ph.D. degree from the Department of Electrical and Computer Engineering at the University of Illinois, Urbana-Champaign. Currently, he is an associate professor at the Department of Computer Science at Southern Illinois University Carbondale. He previously was an AFOSR faculty fellow at the Air Force Research Laboratory, Wright-Patterson, Ohio, and a senior researcher and senior research scientist at Siemens Medical Solutions, Siemens Corporate Research, Siemens Corp., Princeton, New Jersey. His research interests include pattern recognition, machine learning, signal and image processing, and biomedical informatics. He received various awards and privileges. He was on the organizing committee of a number of international conferences and workshops. He has a number of international patents issued or filed with the IBM T.J. Watson Research Laboratory, Yorktown Heights, Siemens Medical, Princeton, and Southern Illinois University, Carbondale.