Neurocomputing 267 (2017) 447–454
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Multiple kernel clustering with corrupted kernels Teng Li∗, Yong Dou, Xinwang Liu, Yang Zhao, Qi Lv National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha 410073, PR China
a r t i c l e
i n f o
Article history: Received 18 April 2016 Revised 14 March 2017 Accepted 11 June 2017 Available online 29 June 2017 Communicated by K. Li Keywords: Kernel method Clustering Multiple kernel clustering Corrupted kernels
a b s t r a c t Multiple kernel clustering (MKC) algorithms usually learn an optimal kernel from a group of pre-specified base kernels to improve the clustering performance. However, we observe that existing MKC algorithms do not well handle the situation that kernels are corrupted with noise and outliers. In this paper, we first propose a novel method to learn an optimal consensus kernel from a group of pre-specified kernel matrices, each of which can be decomposed into the optimal consensus kernel matrix and a sparse error matrix. Further, we propose a scheme to address the problem of considerable corrupted kernels, where each given kernel is adaptively adjusted according to its corresponding error matrix. The inexact augmented Lagrange multiplier scheme is developed for solving the corresponding optimization problem, where the optimal consensus kernel and the localized weight variables are jointly optimized. Extensive experiments well demonstrate the effectiveness and robustness of the proposed algorithm.
1. Introduction Clustering algorithms aim to find a meaningful grouping of samples x in an unsupervised manner for machine learning [1], computer vision [2,3] and data mining [4,5] fields. Kernel-based clustering methods, such as kernel k-means [6], has the advantage of handling non-linear separable clusters and usually achieve improved clustering performance. The performance of these kernelbased clustering methods is highly dependent on kernel selection. In a practical scenario, we can construct various kernels because many different kernel functions exist with different parameters. Since the information of these pre-specified kernels are unknown in advance, we have difficulty in selecting suitable kernels for the clustering task. Multiple kernel clustering methods, which extend the traditional single kernel method into a multiple kernel method, have been studied actively and have shown state-of-theart results in recent years. Traditionally, multiple kernel clustering methods learn the optimal kernel through the linear or nonlinear combination of multiple base kernels. Huang et al. [7] seek an optimal combination of affinity matrices so that it is more immunity to ineffective affinities and irrelevant features. Huang et al. [8] propose a multiple kernel clustering algorithm by incorporating multiple kernels and automatically adjusting the kernel weights. In [9], the kernel weights are assigned to the information of the ∗
Corresponding author. E-mail addresses:
[email protected],
[email protected] (T. Li),
[email protected] (Y. Dou),
[email protected] (X. Liu),
[email protected] (Y. Zhao),
[email protected] (Q. Lv). http://dx.doi.org/10.1016/j.neucom.2017.06.044 0925-2312/© 2017 Published by Elsevier B.V.
© 2017 Published by Elsevier B.V.
corresponding view and a parameter is used to control the sparsity of these weights. In [10], they propose a localized multiple kernel clustering method, which is dedicated to the dataset with varying local distributions. Gönen and Margolin [11] combine kernels calculated on the views in a localized way to better capture sample-specific characteristics of the data. Du et al. [12] present a robust multiple kernel k-means algorithm by replacing the sum-of-squared loss with 2, 1 -norm. Nevertheless, in a practical scenario, data may be corrupted with noise and outliers, which results in the corresponding kernel being corrupted as well. Besides, once these pre-specified kernels are corrupted with noise and outliers, the optimal kernel is likely to be corrupted, which may degrade the clustering performance consequently. Moreover, in multiple kernel clustering method, we usually construct a number of kernels to exploit the advantage of the kernel method as much as possible. If the original data have been corrupted by noise and outliers, the number of corrupted kernels will increases, which may worsen the performance of multiple kernel clustering. Since the optimal kernel is learned from these pre-specified kernels, once the number of corrupted kernels increases, a fatal effect may occur on the final optimal kernel learning. In addition, we have observed that existing multiple kernel learning methods do not sufficiently consider the corrupted situation among these kernels. This could result in learning the optimal kernel inaccurately and degrading the clustering performance. In this paper, we propose a robust multiple kernel clustering with corrupted kernels method. Compared with previous studies, our method is better at capturing the underlying structure of the data and can effectively address the situation where considerable
448
T. Li et al. / Neurocomputing 267 (2017) 447–454
kernels are corrupted with noise and outliers. To the best of our knowledge, this study is the first to learn the optimal consensus kernel based on that each pre-specified kernel from a different kernel function can be decomposed into the optimal consensus kernel matrix and the sparse error matrix. In summary, we highlight the main contributions of this paper as follows: • We propose a novel method for learning an optimal consensus kernel based on that each pre-specified kernel from a different kernel function can be decomposed into the optimal consensus kernel matrix and the sparse error matrix. • To address the problem of considerable corrupted kernels, we enable the utilization of each given kernel to be adaptively adjusted by assigning a weight variable to each error matrix. • The optimal consensus kernel and the localized weight variables are obtained by solving a constrained nuclear norm and an 1 -norm minimization problem, in which each subproblem is convex and can be solved efficiently when optimizing one variable while fixing other variables through inexact augmented Lagrange multiplier scheme. • We conduct comprehensive experiments to compare the proposed approach with existing state-of-the-art multiple kernel clustering methods on three benchmark datasets. The experimental results demonstrate the superiority of the proposed method. The remainder of this paper is organized as follows. We introduce the proposed multiple kernel clustering with Corrupted Kernels algorithm (MKCCK) in Section 2. An efficient alternate optimization algorithm is proposed in Section 3, where the details of the algorithm are also provided. The discussion are presented in Section 4. We compare the clustering performance of MKCCK and state-of-the-arts multiple kernel clustering algorithms in Section 5, where the robustness study and parameter sensitivity are also presented. Finally, conclusions are drawn in the Section 6. 2. Multiple kernel clustering with corrupted kernels In this section, we present multiple kernel clustering by learning an optimal consensus kernel from the pre-specified kernels. To address the problem of corrupted kernels, we further enable the utilization of each pre-specified kernel to be adaptively adjusted.
norm ·1 . Equipped with 1 -norm of E, we can reformulate the unified optimization framework as follows:
min rank(K ) + λ
m
K,E p
E p 1 , s.t. ∀ p, Kp = K + E p .
(2)
p=1
The optimization problem (2) is difficult to solve because of the discrete nature of the rank function. By replacing the rank function with the nuclear norm, we convert the problem into the following convex optimization, which provides a good surrogate for the problem (2):
min K ∗ + λ K,E p
m
E p 1 , s.t. ∀ p, Kp = K + E p .
(3)
p=1
As the problem (3) shows, the consensus kernel K is learned from these pre-specified kernels Kp . However, we cannot guarantee that all these pre-specified kernels take the equal corrupted information in practice. Based on the assumption that some of the kernels are severely corrupted by noise and outliers, the norm value of the error matrix becomes extremely large and directly influences the objective function in problem (3). To alleviate the effect of the corrupted kernel on the consensus kernel learning and to recover the underlying structure of the consensus kernel correctly, we assign a localized weight variable α p to each error matrix. Therefore, we formulate the unified optimization framework as
min
K,E p ,α p
K ∗ + λ
m
α p E p 1 s.t. ∀ p, Kp = K + E p ,
p=1
m γ
αp
p=1
= 1, α p ≥ 0, 0 < γ < 1, (4) where α p ≥ 0 is used to control the weight of each error matrix γ and α p = 1 is used to avoid a trivial solution. As observed in problem (4), once a pre-specified kernel is severely corrupted by noise or outliers, the corresponding α p is assigned with a small weight so as to minimize the objective function. Also, once a prespecified kernel has a good underlying data structure, the corresponding α p is assigned with a large weight to reduce the influence induced by other corrupted kernels. 3. Alternate optimization
2.1. Formulation Given N samples, we construct m kernels K1 , K2 , . . . , K m , which are N × N matrices, from m different kernel formulations, to learn an optimal consensus kernel K from these pre-specified base kernels and cluster them into their respective groups. To capture the underlying structure of the correct information while removing the noise information that may degrade the clustering performance from the consensus kernel, we consider the problem under the following conditions: (1) The consensus kernel is low-rank and tends to be block diagonal, with minimal noise. (2) Noise and outliers might be introduced in data acquisition, thus, a fraction of data entries might be corrupted in these base kernels. Considering these conditions, we formulate the unified optimization framework as follows:
min rank(K ) + λ K,E p
m
E p , s.t. ∀ p, Kp = K + E p ,
We propose to address the introduced equality constraints through an inexact augmented Lagrangian method (inexact ALM) [13] that can be formed as
L(K, E p , α p , Yp , μ ) = K ∗ + λ
m
α p E p 1 +
p=1
+
m μ
2
m
Yp , Kp − K − E p
p=1
Kp − K − E p2F ,
(5)
p=1
where μ is the penalty factor that controls the rate of convergence of the inexact ALM, Yp is the Lagrange multiplier for the constrain D p = A + E p , · denotes the inner-product operator and ·F is the Frobenius norm. The optimization problem can be solved by Inexact ALM. The optimization procedure is shown in Algorithm 1.
(1)
p=1
where the norm · on the error matrix E depends the prior knowledge on the pattern or corruptions, and λ is a trade-off parameter between the two terms. In a practical scenario, the norm ·1 represents the randomly element-wise corruptions. To address this problem generally, we formulate our framework based on the
3.1. Optimizing K w.r.t E, α are fixed When other variables are fixed, the subproblem is
K
(k+1 )
= arg min L(K, E p(k ) , α p(k ) , Yp(k ) , μ(k ) ) K
which can be solved by Singular Value Threshold [14] method
(6)
T. Li et al. / Neurocomputing 267 (2017) 447–454 Table 1 Description of the datasets.
Algorithm 1 Solving problem (5) by inexact ALM. Input: Kernel K p , p = 1, . . ., m, Parameters λ, γ . Initialize:K = 0, E p = 0, Yp = 0, α p = (1/m )1/γ , p = 1, . . ., m, μ0 = 10−5 , μmax = 1015 , ρ = 1.5, ε = 10−6 . repeat Obtain the consensus kernel K by Eq. (6); Obtain the error matrix for each E p by Eq. (10); Compute the localized weight variables α p by Eq. (12); Update the Lagrange multiplier Yp by Eq. (16); Update the penalty parameter μ by Eq. (17); k = k + 1; until max(K p − K − E p F ) < ε . Output: The consensus kernel K.
449
YALE JAFFE ORL AR
# of instances
# of features
# of classes
165 213 400 840
1024 676 1024 768
15 10 40 120
3.5. Penalty parameter update A simple and common scheme for selecting of μ is as following [15]:
μ(k+1) = min(μmax , ρμ(k) ), (U, , V ) = SV D( K (k+1) = U S
1
μ (k )
m m m 1 (k ) 1 1 (k ) Kp − Ep + Y p ). ( k ) m m mμ p=1 p=1 p=1
[]V T
(17)
(7)
we found experimentally μ0 = 10−5 , ρ = 1.5 and μmax = 1015 to perform well.
(8)
4. Discussion
where S is a shrinkage operator for singular value threshold defined as:
4.1. Complexity analysis
Sε (x ) = max(|x − ε |, 0 ).
The time-complexity for the each kernel construction is O(n2 ) while O(n3 ) for the eigenvalue decomposition. In our proposed algorithm, when updating K, the time complexity is O(n3 ) due to the eigenvalue decomposition. Updating E only has the element-wise operation, thus the time-complexity is O(n2 ∗ m) (m is the number of kernel). The cost of updating α by Eq. (12) is O(n2 ∗ m + n2 ). Thus the overall complexity is O( (n3 + n2 ∗ m ) ∗ l ), where l is number of iterations.
(9)
3.2. Optimizing E w.r.t K, α are fixed We update E by solving the following problem: (k+1 )
= arg min L(K (k+1) , E p(k ) , α p(k ) , Yp(k ) , μ(k ) ),
Ep
(10)
Ep
whose solution is given by
1
E p(k+1) = S λα p (k) [K p − K (k+1) +
μ (k )
μ (k )
Yp(k ) ].
(11)
3.3. Optimizing α w.r.t K, E are fixed We update α p by solving the following problem: (k+1 )
αp
= arg min L(K (k+1) , E p(k+1) , α p(k ) , Yp(k ) , μ(k ) ), α
(12)
that is
L (α ) =
m
α p E p 1 , s.t.
p=1
m γ
α p = 1, α p ≥ 0.
(13)
p=1
By introducing Lagrange multiplier β , the Lagrange function of Eq. (13) is
J (α ) =
m
α p E p 1 − β (
p=1
m γ
α p − 1 ).
(14)
p=1
In order to get the optimal solution of the above subproblem, we set the derivative of Eq. (14) w.r.t α p to zero and joint the con γ straint α p = 1, therefore we get the optimal solution of α p as follows:
(k+1) E p
αp = (
m p=1
γ −1 (E p(k+1) 1 ) ) γ
1
γ
.
(15)
The Lagrange multiplier can be calculated as
Yp
The parameter γ is used to control the distribution of weights for different kernels. From Eq. (15) we can see that when γ → 0, we will get the equal weight for each weight variables. By using the parameter γ , it is feasible for us to control the weight distribution on the one hand, and avoid the trivial solution to the weight distribution on the other hand (e.g. the solution when γ → 0). 5. Experimental results In this section, we evaluate the clustering performance of the proposed algorithm on three benchmark datasets. 5.1. Datasets description We use three face datasets in our experiments, most of which are frequently used to evaluate the performance of different clustering methods. Table 1 summarizes the detailed information of these three datasets in terms of number of samples, feature dimension and classes.
1
γ −1
1
3.4. Lagrange multiplier update
(k+1 )
4.2. Discussion of the parameter γ
= Yp (k ) + μ(k ) (K − K (k+1) − E p(k+1) ),
(16)
where K (k+1 ) and E p(k+1 ) are the current solutions to the above subproblems at iteration (k + 1 ).
• YALE1 : This dataset contains 32 × 32 gray scale images of 15 subjects. There are 11 images per subject facial expression or configuration. • ORL2 : This dataset contains 10 different images of each of 40 distinct subjects. The images were taken at different times, varying the lighting, facial expressions and facial details. Each image is manually cropped and normalized to size of 64 × 64 pixels.
1 2
http://cvc.yale.edu/projects/yalefaces/yalefaces.html. http://www.uk.research.att.com/facedatabase.html.
450
T. Li et al. / Neurocomputing 267 (2017) 447–454
• ORL3 : This dataset contains 10 classes, and each of them contains 40 distinct subjects. The images were taken at different times, varying the lighting, facial expressions and facial details. Each image is manually cropped and normalized to size of 32 × 32 pixels. • AR4 : This dataset contains cropped images from 120 individuals, and each individuals has 7 images. Images are collected from two sessions feature frontal view faces with different facial expressions, illumination conditions, and occlusions(sun glasses and scarf). Participants may wear clothes and glasses and have different make-ups and hair styles. 5.2. Compared algorithms We compare our proposed approach with a number of state-ofthe-art algorithms. In particular, we compare with: • Single kernel method: Since we have different kernels for clustering, we run all the kernel k-means (KKM) [12] and robust kernel k-means (RKKM) [12] separately. Both the best and average results are reported, which are referred as KKM-b, KKM-a, RKKM-b, RKKM-a, respectively. Due to the space limitation, the worst kernel result is not reported in this paper, it should be noted that the worst kernel result is often far below the average result. • Equal weighted method: Combining different kernels from different kernels and averaging them for clustering. As reported in [16], this seemingly simple approach often achieve the near optimal results when compared to other sophisticated approaches. We report the results of equal weighted methods in our paper, which are referred as KKM-e and RKKM-e, respectively. • MKKM: The MKKM (Multiple Kernel K-Means) is proposed in [8] which extends kernel k-means in multiple kernel setting. • LMKKM: The LMKKM (Localized Multiple Kernel K-Means) is proposed by Gönen and Margolin [11] which combines kernel calculated on the views in a localized way to better capture sample-specific characteristics of the data. • RMKKM: The RMKKM (Robust Multiple Kernel K-Means) is proposed by Du et al. [12], which improves the robustness of MKKM by replacing the sum-of-squared loss with 2, 1 -norm. • MKCKM: The proposed multiple kernel clustering algorithm, where the localized weight variables (as presented in problem (3)) is absent. • MKCCK: The proposed multiple kernel clustering with corrupted kernels algorithm. 5.3. Experiment setup Following the similar strategy of other multiple kernel learning approaches, we apply 12 different kernel functions as basis for multiple kernel clustering. These kernels include, seven RBF kernels κ (xi , x j ) = exp(−||xi − x j ||2 /2σ 2 ) with σ = t × D, where D is the maximum distance between samples and t varies in the range of {0.01, 0.05, 0.1, 1, 10, 50, 100}, four polynomial kernels κ (xi , x j ) = (a + xTi x j )b with a = {0, 1} and b = {2, 4} and a linear kernel κ (xi , x j ) = (xTi x j )/(||xi || · ||x|| ). Finally, all the kernels have
been normalized through κ (xi , x j ) = κ (xi , x j )/ κ (xi , xi )κ (x j , x j ) and then re-scaled to [0, 1]. The number of clusters is set to the true number of the classed for all datasets and clustering methods. For all methods, we perform 10 replications of k-means with different initializations for further performance evaluation. In our proposed algorithm, we tune the λ and γ within [21 , 22 , . . . , 210 ] and [0.1, 0.2, . . . , 0.9] by 3 4
http://www.uk.research.att.com/facedatabase.html. http://www2.ece.ohio-state.edu/∼aleix/ARdatabase.html.
grid-search. For other compared methods, we tune the parameters as suggested in their papers. 5.4. Evaluation metrics To evaluate their performance, we compare the generated cluster with the ground truth by computing the following three performance measures. Accuracy (ACC): The first performance measure is the clustering accuracy, which discovers the one-to-one relationship between clusters and classes. Given a point xi , let pi and qi be the clustering result and the ground truth label, respectively. The ACC is defined as follows:
ACC =
n 1 δ (qi , map( pi )), n
(18)
i=1
where n is the total number of samples and δ (x, y) is the delta function that equals 1 if x = y and equals 0 otherwise, and map(·) is the permutation mapping function that maps each cluster index to a true class label. The best mapping can be found by using the Kuhn–Munkres algorithm. The greater clustering accuracy means the better clustering performance. Normalized mutual information (NMI): Another evaluation metric that we adopt here is the normalized mutual information, which is widely used for determining the equality off clustering. Let C be the set of clusters from the ground truth and C∗ obtained from a clustering algorithm. Their mutual information MI(C, C∗ ) is defined as follows:
MI (C, C ∗ ) =
ci ∈C,c j
p(ci , c j ) log
∈C ∗
p( c i , c j ) p( c i ) p( c j )
(19)
where p(ci ) and p(cj ) are the probabilities that a data point arbitrarily selected from dataset belongs to the cluster ci and cj , respectively, and p(ci , cj ) is the joint probability that the arbitrarily selected data point belongs to the cluster ci as well as cj at the same time. In our experiments, we use the normalized mutual information as follows:
NMI (C, C ∗ ) =
MI (C, C ∗ ) max(H (C ), H (C ∗ ))
(20)
where H(C) and H(C∗ ) are the entropies of C and C∗ , respectively. A larger NMI indicates better clustering performance. Purity (PUR): The last evaluation metric that we adopt here is the purity, which is a popular measure for clustering evaluation. Being C the set of clusters to be evaluated, L the set of categories and N the number of clustered items, purity is computed by taking the weighted average of maximal precision values:
P urity =
|Ci | max j P recision(Ci , L j ) N
(21)
i
where the precision of a cluster Ci for a given category Lj is defined as:
P recision(Ci , L j ) =
Ci ∩ L j |Ci |
(22)
Again, a larger Purity indicates better clustering performance. 5.5. Experiment results The experimental results are presented in Tables 2–4 (the best results are highlighted in bold). To effectively exhibit the superiority of the localized weight variables, we also report the results without the localized weight variables (as presented in problem (3)), which is referred to MKCKM. The results of the method with the localized weight variables is referred to MKCCK. From these tables, we have the following observations.
T. Li et al. / Neurocomputing 267 (2017) 447–454
451
Table 2 Clustering results measured by ACC (%). Data
KKM-b
KKM-a
RKKM-b
RKKM-a
KKM-e
RKKM-e
MKKM
LMKKM
RMKKM
MKCKM
MKCCK
YALE JAFFE ORL AR
46.67 71.83 53.00 34.76
36.87 61.58 43.29 32.05
38.67 70.89 52.75 35.83
37.17 60.76 43.50 32.63
43.03 71.83 49.25 34.76
44.24 70.89 50.50 34.88
49.09 82.63 48.75 29.88
55.76 97.65 72.25 81.90
47.70 72.77 52.25 34.17
57.58 98.59 74.25 82.86
62.42 99.06 77.75 85.60
Table 3 Clustering results measured by NMI (%). Data
KKM-b
KKM-a
RKKM-b
RKKM-a
KKM-e
RKKM-e
MKKM
LMKKM
RMKKM
MKCKM
MKCCK
YALE JAFFE ORL AR
50.82 78.57 73.21 67.20
38.93 65.60 60.43 61.54
42.37 76.33 73.17 67.49
39.45 63.34 60.71 61.69
48.14 78.57 70.30 67.11
48.57 76.33 71.42 67.09
52.63 81.44 69.71 61.19
57.12 96.43 85.25 91.66
50.58 76.72 72.75 66.71
59.44 97.81 85.17 91.90
62.91 98.35 87.43 93.04
Table 4 Clustering results measured by PUR (%). Data
KKM-b
KKM-a
RKKM-b
RKKM-a
KKM-e
RKKM-e
MKKM
LMKKM
RMKKM
MKCKM
MKCCK
YALE JAFFE ORL AR
48.48 75.59 58.50 36.90
38.23 65.34 48.94 34.38
40.97 75.12 59.75 37.62
38.48 64.87 49.81 34.96
44.24 75.59 55.25 36.79
44.85 75.12 56.25 37.02
50.30 82.63 52.50 32.50
56.36 97.65 75.00 83.57
49.70 75.59 57.00 37.26
59.39 98.59 76.50 83.69
62.42 99.06 80.25 86.90
Fig. 1. The clustering results with different number of corruption kernels on JAFFE.
(1) The performance of single kernel methods are highly dependent on the selection of the kernel functions. (2) As a strong baseline, equal weighted methods usually show comparable clustering performance. However, with an appropriate kernel weight learning method, multiple kernel clustering methods usually exhibit better performance than the equal weight method, demonstrating that multiple kernel clustering is necessary and effective. (3) By utilizing the consensus kernel learning methods, our proposed methods usually yield superior performance in these benchmark datasets. (4) The clustering performance has been improved by using the localized weight variables (as shown in the last two columns in Tables 2–4). In summary, our proposed algorithm achieves best performance on all datasets by learning an optimal consensus kernel matrix from a group of pre-specified kernel matrices and enabling the utilization of each given kernel matrix to be adaptively adjusted by assigning a weight variable to each error matrix.
ness in corrupted kernel handling compared with MKCKM (our method without the localized weight variables). When the number of corrupted kernel is 12, the discriminative information are severely destroyed, thus diminishing the performance of MKCCK.
5.7. Parameter selection There are two hyper-parameters λ and γ , in MKCCK. These two parameters are tuned from [21 , 22 , . . . , 210 ] and [0.1, 0.2, . . . , 0.9], respectively. Figs. 5–8 show how the average accuracy, normalized mutual information and purity of MKCCK varies with the parameters combination (λ, γ ), respectively. λ is the trade-off term in our proposed formulation and γ controls the effect of the localized weight variables. When γ it is too small, the constraint will lose the effect (as presented in Section 4). As observed, the satisfactory clustering performance will be obtained when γ is larger than 0.5 and λ is not too small.
5.6. Robustness study To evaluate the robustness of our proposed method, we conduct experiments on these datasets , where kernel is corrupted by replacing random value from a uniform distribution on the interval from 0 to 1, and the number of corrupted kernel varies from 1 to 12 (the percentage of corrupted information is 90% in each kernel). As shown in Figs. 1–4, MKCCK (our method with the localized weight variables) stays at the top and remains at a stable level in all subfigures when the number of corrupted kernels varies from 1 to 11, which indicates the better adaptability and greater robust-
6. Conclusion In this paper, we have proposed a robust multiple kernel clustering with corrupted kernels algorithm. Compared with the existing methods, our algorithm achieves significant improvement on three benchmark datasets. Moreover, the proposed method shows clear advantages in corrupted kernel handling. In the future, we intent to further integrate more prior knowledge into our formulation to find more relevant clusters.
452
T. Li et al. / Neurocomputing 267 (2017) 447–454
Fig. 2. The clustering results with different number of corruption kernels on JAFFE.
Fig. 3. The clustering results with different number of corruption kernels on ORL.
Fig. 4. The clustering results with different number of corruption kernels on AR.
Fig. 5. The clustering results on YALE dataset w.r.t λ and γ .
Fig. 6. The clustering results on JAFFE dataset w.r.t λ and γ .
T. Li et al. / Neurocomputing 267 (2017) 447–454
453
Fig. 7. The clustering results on ORL dataset w.r.t λ and γ .
Fig. 8. The clustering results on AR dataset w.r.t λ and γ .
Acknowledgment This work was supported by the National Natural Science Foundation of China (U1435219, 61125201, 61402507, 61303070). References [1] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Comput. Surv. 31 (3) (1999) 264–323. [2] J. Jolion, P. Meer, S. Bataouche, Robust clustering with applications in computer vision, IEEE Trans. Pattern Anal. Mach. Intell. 13 (8) (1991) 791–802. [3] H. Frigui, R. Krishnapuram, A robust competitive clustering algorithm with applications in computer vision, IEEE Trans. Pattern Anal. Mach. Intell. 21 (5) (1999) 450–465. [4] P. Berkhin, A survey of clustering data mining techniques, in: Grouping Multidimensional Data – Recent Advances in Clustering, Springer, Berlin, Heidelberg, 2006, pp. 25–71. [5] X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, A.F.M. Ng, B. Liu, P.S. Yu, Z. Zhou, M. Steinbach, D.J. Hand, D. Steinberg, Top 10 algorithms in data mining, Knowl. Inf. Syst. 14 (1) (2008) 1–37. [6] I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and normalized cuts, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22–25, 2004, 2004, pp. 551–556. [7] H. Huang, Y. Chuang, C. Chen, Affinity aggregation for spectral clustering, in: Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16–21, 2012, 2012a, pp. 773–780.
[8] H. Huang, Y. Chuang, C. Chen, Multiple kernel fuzzy clustering, IEEE Trans. Fuzzy Syst. 20 (1) (2012b) 120–134. [9] G. Tzortzis, A. Likas, Kernel-based weighted multi-view clustering, in: Proceedings of the 12th IEEE International Conference on Data Mining, ICDM 2012, Brussels, Belgium, December 10–13, 2012, 2012, pp. 675–684. [10] L. Zhang, X. Hu, Locally adaptive multiple kernel clustering, Neurocomputing 137 (2014) 192–197. [11] M. Gönen, A.A. Margolin, Localized data fusion for kernel k-means clustering with application to cancer biology, in: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8–13 2014, Montreal, Quebec, Canada, 2014, pp. 1305–1313. [12] L. Du, P. Zhou, L. Shi, H. Wang, M. Fan, W. Wang, Y. Shen, Robust multiple kernel k-means using L21-norm, in: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25–31, 2015, 2015, pp. 3476–3482. [13] Z. Lin, M. Chen, Y. Ma, The augmented Lagrange multiplier method for exact recovery of corrupted low-rank matrices, UIUC Technical Report UILU-ENG-092215 (2009) , arXiv:1009.5055. [14] J. Cai, E.J. Candès, Z. Shen, A singular value thresholding algorithm for matrix completion, SIAM J. Optim. 20 (4) (2010) 1956–1982. [15] S.P. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Found. Trends Mach. Learn. 3 (1) (2011) 1–122. [16] C. Cortes, M. Mohri, A. Rostamizadeh, Learning non-linear combinations of kernels, in: Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009. Proceedings of a meeting held 7–10 December 2009, Vancouver, British Columbia, Canada., 2009, pp. 396–404.
454
T. Li et al. / Neurocomputing 267 (2017) 447–454 Teng Li received the B.S. degree and the M.S. degree in computer science from the National University of Defense Technology, Changsha, China, in 2013 and 2015, respectively. He is currently working toward the Ph.D. degree at the National University of Defense Technology. His research interests include machine learning and computer vision.
Yang Zhao received the B.S. degree and the M.S. degree in computer science from the National University of Defense Technology, Changsha, China, in 2014 and 2016, respectively. He is currently working toward the Ph.D. degree at the National University of Defense Technology. His research interests include machine learning and computer vision.
Yong Dou received the B.S., M.S., and Ph.D. degrees from National University of Defense Technology in 1989, 1992, and 1995, respectively. His research interests include high performance computer architecture, high performance embedded microprocessor, reconfigurable computing, machine learning, and bioinformatics. He is a Member of the IEEE and the ACM.
Qi Lv received the B.S. degree in computer science and technology in Tsinghua University, Beijing, in 2009, and received the M.S. degree in computer science and technology at National University of Defense Technology in 2011, and now he is a Ph.D. candidate at National University of Defense Technology. His research interests include machine learning and remote sensing image processing.
Xinwang Liu received the M.S. and Ph.D. degree from National University of Defense Technology, China in 2008 and 2013, respectively. From Jan. 2014, He works as a research assistant at National Laboratory for Parallel and Distributed Processing, National University of Defense Technology, Changsha, China. His research interests focus on designing algorithms on kernel learning, feature selection and multi-view clustering.