Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Manifold regularized multi-view feature selection for social image annotation Yangxi Li a, Xin Shi b, Cuilan Du a, Yang Liu a,n, Yonggang Wen c a
National Computer network Emergency Response technical Team/Coordination Center of China (CNCERT/CC), China Key Laboratory of Machine Perception, Peking University, Beijing, China c School of Computer Engineering, Nanyang Technological University, Singapore b
art ic l e i nf o
a b s t r a c t
Article history: Received 15 March 2015 Received in revised form 4 June 2015 Accepted 13 July 2015 Communicated by Dacheng Tao
The features used in many social media analysis-based applications are usually of very high dimension. Feature selection offers several advantages in highly dimensional cases. Recently, multi-task feature selection has attracted much attention, and has been shown to often outperform the traditional singletask feature selection. Current multi-task feature selection methods are either supervised or unsupervised. In this paper, we address the semi-supervised multi-task feature selection problem. We firstly introduce manifold regularization in multi-task feature selection to utilize the limited number of labeled samples and the relatively large amount of unlabeled samples. However, the graph constructed in manifold regularization from a single feature representation (view) may be unreliable. We thus propose to construct the graph using the heterogeneous feature representations from multiple views. The proposed method is called manifold regularized multi-view feature selection (MRMVFS), which can exploit the label information, label relationship, data distribution, as well as correlation among different kinds of features simultaneously to boost the feature selection performance. All these information are integrated into a unified learning framework to estimate feature selection matrix, as well as the adaptive view weights. Experimental results on three real-world image datasets, NUS-WIDE, Flickr and Animal, demonstrate the effectiveness and superiority of the proposed MRMVFS over other state-of-the-art feature selection methods. & 2016 Elsevier B.V. All rights reserved.
Keywords: Feature selection Multi-view Manifold regularization Image annotation
1. Introduction Feature selection lies at the heart of many multimedia analysisbased applications, such as image search [24,25,27], automatic image annotation [10,12,13], image classification [23,26], etc. In these applications, where high dimensional features are usually utilized, a compact features selected from the original features is helpful to reduce the computational cost, save the storage space, and reduce the chance of over-fitting. Although traditional feature selection methods usually select features from a single task [3,7,16,28], more recently there has been a focus on joint feature selection across multiple related tasks [11,17,18,20,22]. This is because joint feature selection can exploit task relationship in order to establish the importance of features, and this approach has been empirically demonstrated to be superior to feature selection on each task separately [18]. Current multi-task feature selection methods are conducted either in a supervised [17,18] or unsupervised [11,20,22] manner, n
Corresponding author.
in terms of whether the label information is utilized to guide the selection of useful features. Supervised feature selection methods always require a large amount of labeled training data, and it may fail to identify the relevant features that are discriminative when the number of labeled samples is small. On the other hand, unsupervised feature selection methods are often unable to identify the discriminative features since the label information is ignored [21]. Since the cost of manually labeling of multi-view data is high, while large amount unlabeled multi-view data can easily be obtained, it is desirable to develop multi-view feature selection methods that are capable of exploiting both labeled and unlabeled data. This motivates us to introduce semi-supervised learning into the multi-view feature selection method. Therefore, we focus on the semi-supervised multi-task feature selection in this paper. To make use of both the limited labeled data and the abundant unlabeled data, we employ the idea of manifold regularization [1] into multi-task feature selection. The performance of manifold regularization relies much on the constructed graph Laplacian. However, the graph constructed using the features from a single
http://dx.doi.org/10.1016/j.neucom.2015.07.151 0925-2312/& 2016 Elsevier B.V. All rights reserved.
Please cite this article as: Y. Li, et al., Manifold regularized multi-view feature selection for social image annotation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.07.151i
Y. Li et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
2
view may be unreliable. For example, the different labels cannot be properly characterized by a single feature representation in image classification [15]. We thus further extract different kinds of features from the image, and weightedly combine the graphs constructed using the different feature representations. The combined graph is able to approximate the underlying data distribution more accurately than the single graph. The proposed method is called manifold regularized multiview feature selection (MRMVFS). To the best of our knowledge, little progress has been made in semi-supervised multi-view feature selection. There are some recent works focused on unsupervised multi-view feature selection [5,20], and we mainly differ from them in that the label information is incorporated. On the other hand, compared to the works focused on semi-supervised feature selection [3,28], we mine the relationship information among multiple representations (views) and integrate it into feature selection jobs. The proposed MRMVFS integrates four kinds of information, i.e., label information, label relationship, data distribution, as well as correlation among different views, to select the most representative feature components from the original multi-view features. In particular, a regression model is adopted to exploit the label information contained in the labeled samples. The l2,1-norm penalty on the feature selection matrix is enforced to joint feature selection across multiple labels, and thus explore the label relationship. Meanwhile, the visual similarity graphs of different views are constructed and combined to model the geometric structure of the underlying data distribution. Besides, a set of non-negative view weights is learned to leverage the correlation among different views, and establish a reliable regularization term along the data manifold to smooth the prediction function. Finally, we integrate all these information into a unified learning framework. Based on this framework, we can simultaneously estimate feature selection matrix, as well as the adaptive view weights. In the experiments, we apply MRMVFS to automatic image annotation on three challenge image datasets, NUS-WIDE-OBJECT [4], Flickr [8] and Animal [9], and compare it with several state-of-the-art feature selection methods. Experimental results demonstrate the effectiveness and superiority of the proposed MRMVFS.
2. Manifold regularized multi-view feature selection In this section, we elaborate the proposed manifold regularized multi-view feature selection (MRMVFS) method in detail. 2.1. Notations We firstly introduce some important notations used in the rest of this paper. A matrix is represented by a capital letter, e.g., X. Xij is the ði; jÞth element of X, and X i: indicates the elements in the ith row of X. The bold lower case letter x indicates a vector and x indicates a scalar. Superscript indicates the view of data, e.g., X ðvÞ is the vth view of data X. Subscript is used to denote if the data is labeled, for example, XL is the labeled data, whereas XU is the unlabeled data. J X J F denotes the matrix X's Frobenius norm. Specifically, for a matrix X A Rpq , its l2,1-norm is defined as: vffiffiffiffiffiffiffiffiffiffiffiffiffiffi p uX X u q 2 t J X J 2;1 ¼ X ij ð1Þ i¼1
j¼1
2.2. Problem formulation Given a set of l labeled samples DL ¼ fðxi ; yi Þli ¼ 1 g and a relan o tively large set of u unlabeled samples DU ¼ ðxi Þli þ¼ul þ 1 , we
suppose each sample h i is represented by m different views, i.e., xi ¼ xði 1Þ ; xði 2Þ ; …; xði mÞ . The view we refer to here is a certain kind of feature or modality. Then the feature matrix of the vth view can ðvÞ ðvÞ dv n be represented as X ðvÞ ¼ ½xðvÞ , and feature 1 h ; x2 ; …; xn A Ri
matrix of all the views is X ¼ X ð1Þ ; X ð2Þ ; …; X ðmÞ A Rdn , where d ¼ Pm v ¼ 1 dv and dv is the feature dimension of the vth view. To select the compact and representative feature components from raw features, we propose to integrate four kinds of information, i.e., the label information contained in labeled data, label relationship, data distribution, as well as correlation among different views of both labeled and unlabeled data, into the learning framework. We firstly introduce how to integrate label information of labeled data into MRMVFS. Given the labeled feature matrix X L ¼ ½x1 ; x2 ; …; xl A Rdl and the corresponding label matrix Y L ¼ ½y1 ; y2 ; …; yl T A Rlc with each yi ¼ ½y1i ; y2i ; …; yci T , we can learn the p prediction functions f ðxÞ; p ¼ 1; …; c by minimizing the prediction error over the labeled data in training set min p
c X l X
ff g p ¼ 1 i¼1
p
Lðf ðxi Þ; ypi Þ;
ð2Þ
where L is some pre-defined convex loss, ypi ¼ 1 if the pth label is manually assigned to the ith sample, and 1 otherwise. We p p assume each f ðxÞ is a linear transformation with f ðxi Þ ¼ ðwp ÞT xi . 1 2 c T T Let fðxÞ ¼ ½f ðxÞ; f ðxÞ; …; f ðxÞ we have fðxÞ ¼ W x, where W ¼ ½ w1 ; w2 ; …; wc A Rdc is the transformation matrix. To make it suitable for feature selection, a l2,1-norm regularization of W is added to (2) to ensure that W is sparse in rows. Thus the optimization problem becomes min W
l X
Lðfðxi Þ; yi Þ þ α J W J 2;1 :
ð3Þ
i¼1
This is a general formulation of multi-task feature selection [18]. In this formulation, the importance of an individual feature is evaluated by simultaneously considering multiple tasks. In this way, different tasks help each other to select features assumed to be shared across tasks. In many practical applications, the number of labeled samples l is quite small, and thus the learned W is often unreliable. Considering the data samples may lie on a low-dimensional manifold embedding in a high dimensional space, we propose to utilize the large amount of unlabeled samples to help learning W under the theme of manifold regularization (MR) [1]. MR has been widely used for capturing the local geometry and conducting lowdimensional embedding. In MR, the data manifold is characterized by a adjacency graph, which explores the geometric structure of the compact support of the marginal distribution. The geometry is then incorporated as an additional regularizer to ensure that the solution is smooth with respect to the data distribution. In our method, the regularization term is given by c n n X 1X 1 X p p T T T ðf i f j Þ2 Aij ¼ A ðf f þ f j f j 2f i f j Þ 2 p ¼ 1 i;j ¼ 1 2 i;j ¼ 1 ij i i
¼ trðF T ðD AÞFÞ ¼ trðF T LFÞ; T
ð4Þ nc
is the predictions over all where F ¼ ½fðx1 Þ; fðx2 Þ; …; fðxn Þ A R the data (labeled and unlabeled). Here, A is the adjacency graph constructed using all the data, and each element Aij indicates the similarity between sample xi and xj ; D is a diagonal matrix with P Dii ¼ nj¼ 1 Aij and L ¼ D A is the graph Laplacian matrix. The regularization term guarantees that if two samples are similar in the feature space, then their predictions will be close. In this way, the data geometric structure existed in high dimensional space is preserved and the data distribution information is well explored.
Please cite this article as: Y. Li, et al., Manifold regularized multi-view feature selection for social image annotation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.07.151i
Y. Li et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
In general, there are two ways to construct the graph: k-nearest neighbor method and ϵ-ball method. The former one is adopted in this paper. In semi-supervised problem, label information of labeled data can be utilized in graph construction. For those labeled data, feature similarity is measured as 1 if xi A N k ðxj Þ or xj A N k ðxi Þ; and yi ¼ yj ð5Þ Aij ¼ 0 otherwise where xi A N k ðxj Þ indicates xi is xj 's k nearest neighbor, and yi is xi 's label. For those unlabeled data, the feature similarity is measure as 8 < exp J xi xj J 2 if xi A N k xj or xj A N k ðxi Þ 2 2 σ Aij ¼ ð6Þ :0 otherwise The adjacency graph matrix A is usually symmetrized by setting A’ðAþ AT Þ=2. By introducing the regularization term in (3), we obtain the following optimization problem: min W
l X L f ðxi Þ; yi þ α J W J 2;1 þ β tr F T LF :
ð7Þ
i¼1
In this formulation, the features from multiple views are concatenated to calculate the graph Laplacian. However, such a strategy ignores the complementary property of the different views. We thus further propose to weightedly combine the graph Laplacians calculated from the different views and learn the combination coefficients. Thus the optimization problem becomes min W;λ
s:t:
l m X X L f ðxi Þ; yi þ α J W J 2;1 þ β tr F T λv LðvÞ F þ γ J λ J 22 ; m X
λv ¼ 1; λv Z0;
ð8Þ
where the regularization term J λ J 22 is to avoid the variable λ overfitting to a trivial solution. By choosing the least square loss for L and considering that fðxÞ ¼ W T x, we have the following formulation for the proposed manifold regularized multi-view feature selection (MRMVFS) framework, W;λ
s:t:
J X TL W Y L J 2F þ α J W J 2;1 þ β tr W T X LU
m X
!
λv LðvÞ X TLU W þ γ J λ J 22 ;
v¼1
m X
g ðW Þ ¼
J X TL W Y L J 2F þ
βtr W X LU T
!
m X
λv L
ðvÞ
X TLU W
:
v¼1
By setting
∂LðWÞ ∂W ¼
W ¼ ðP þ αQ Þ
1
0, we have
ðX L Y L Þ;
ð11Þ Pm
and Q is a diagonal matrix v ¼ 1 λv L This means that Q is dependent on W. Fortunately, the solution can be obtained efficiently by repeating the following two steps until convergence,
where P ¼ X L X TL þ with Q ii ¼ 2 J W1 i: J 2 .
βX LU
ðvÞ
X TLU ,
(1) W τ þ 1 ¼ ðP þ αQ τ Þ 1 ðX L Y L Þ; (2) update the diagonal matrix Q using W τ þ 1 . In practical, J W i: J 2 may close to zero. Therefore, we can regularize 1 , where ϵ is a tiny constant. Following [17], it with Q ii ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi T W i: W i: þ ϵ
2
we can prove the convergence of this iteration algorithm. Theorem 1. Performing the two steps (a) and (b) will decrease the objective of problem (10). By repeating the two steps, W τ þ 1 will converge to the global optimum of the problem. Proof. It can be easily verified that (11) is the solution of the following problem: argminW gðWÞ þ α trðW T QWÞ:
ð12Þ
Thus in the τth iteration ð13Þ
which indicates that
v¼1
min
where
W τ þ 1 ¼ argminW gðWÞ þ α trðW T QWÞ;
v¼1
i¼1
3
λv ¼ 1; λv Z0;
ð9Þ
v¼1
where X LU A Rdn is the feature matrix of both the labeled and unlabeled data. According to the definition of l2,1-norm, many rows of W will shrink to (or close to) zeros. Therefore, x~ ¼ W T x can be seen as the selected features consist of most representative dimensions of raw data x. That is to say, we can rank raw features according to J W i: J in descending order and only preserve those top ranked feature components.
gðW τ þ 1 Þ þ αtrððW τ þ 1 ÞT Q τ W τ þ 1 Þ rgðW τ Þ þ αtrððW τ ÞT Q τ W τ Þ; That is to say, X J wτ þ 1 J 22 X J wτ J 22 τ i i g Wτ þ1 þα ; τ J rg W þ α 2 J w 2 J wτi J 2 i 2 i i
Problem (9) is a nonlinearly constrained nonconvex optimization problem. We adopt the alternating optimization to solve this 1 problem. The variable λ is initialized as λv ¼ m , and W is initialized as a random matrix. Then we fix one variable and update the other variable, alternatively and iteratively.
According to lemma 1 presented in [18], we have J W τ þ 1 J 2;1
X J wτ þ 1 J 22 i
i
2 J wτ J i
2
r J W τ J 2;1
X J wτ J 22 i : 2 J wτi J 2 i
min W
LðW Þ ¼ g ðW Þ þ α J W J 2;1 ;
ð18Þ
Finally, the problem (10) is convex since both g(W) and J W J 2;1 are convex w.r.t. W. Thus W τ þ 1 will converge to the global optimum of the problem.□ 2.3.2. Fix W, optimize λ With a fixed W, the problem (9) degenerates to T
λ
ð10Þ
ð17Þ
Thus we obtain
minh λ þ γ J λ J 22 ;
2.3.1. Fix λ, optimize W With a fixed λ, the problem (9) becomes
ð15Þ
where wi is the ith row of W. Using a simple strategy (simultaneously adding and subtracting a term), we have ! X J wτ þ 1 J 22 i g W τ þ 1 þ α J W τ þ 1 J 2;1 α J W τ þ 1 J 2;1 2 J wτi J 2 i ! X J wτ J 22 i rg W τ þ α J W τ J 2;1 α J W τ J 2;1 : ð16Þ 2 J wτi J 2 i
gðW τ þ 1 Þ þ α J W τ þ 1 J 2;1 r gðW τ Þ þ α J W τ J 2;1 : 2.3. Optimization algorithm
ð14Þ
s:t:
m X
λv ¼ 1; λv Z 0;
ð19Þ
v¼1
where h ¼ ½h1 ; …; hm T with hv ¼ β trðW T X LU LðvÞ X TLU Þ. To solve this problem, we use a coordinate descent-based algorithm. The
Please cite this article as: Y. Li, et al., Manifold regularized multi-view feature selection for social image annotation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.07.151i
Y. Li et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
4
Lagrange of problem (19) is Lðλ; ξÞ ¼ h λ þ γ J λ J 22 ξ T
!
X
λv 1 :
ð20Þ
v
Randomly select two variables to update, i.e., ∂Lðλ; ξÞ ¼ 0; ) hi þ 2γλi ξ ¼ 0; ∂λi
ð21Þ
∂Lðλ; ξÞ ¼ 0; ) hj þ 2γλj ξ ¼ 0; ∂λj
ð22Þ
We can eliminate the Lagrange multiplier with (21), which leads to
ξ by subtracting (22)
2γ ðλi λj Þ þ hi hj ¼ 0:
ð23Þ i þ j
By further considering that λ λ ¼ λi þ λj , we have the following solution for updating λi and λj,
λi ¼
2γ ðλi þ λj Þ þhj hi ; λj ¼ λi þ λj λi ; 4γ
ð24Þ
To satisfy the constraint λv Z 0, we set 8 < λi ¼ 0; λj ¼ λi þ λj ; if 2γ ðλi þ λj Þ þ hj hi r 0;
: λj ¼ 0; λi ¼ λi þ λj ;
if 2γ ðλi þ λj Þ þ hi hj r 0:
The optimization procedure is summarized in Algorithm 1. When the local optimum solution of W is obtained, we sort all the feature components according to the value of J W i: J , the top ranked components are selected. Algorithm 1. The optimization procedure of the proposed MRMVFS algorithm. ðvÞ l Input: Labeled training data DðvÞ L ¼ fðx i ; yi Þi ¼ 1 g and unlabeled
ðvÞ l þ u training data DðvÞ U ¼ fðxi Þi ¼ l þ 1 g form different views and v ¼ 1; …; V is the view index.
Algorithm parameters:
α, β, γ, and k
Output: The feature selection matrices W and the view combination coefficients fλv g. Initialization: W is a random matrix, and λv ¼ V1 ; v ¼ 1; …; V. Set t¼0. 1: Iterate 2: Update W by iterating the two steps (a) and (b) presented in the optimization section until convergence; 3: Update fθv g via (24); 4: t ¼ t þ 1. 5: Until convergence
3. Experiments In this section, we evaluate the effectiveness of proposed MRMVFS by applying it to automatic image annotation on three challenge image datasets, NUS-WIDE-OBJECT [4], Flickr [8] and Animal with attribute(Animal) [9]. 3.1. Datasets and settings Example images of NUS-WIDE-OBJECT, Flickr and Animal dataset are given in Figs. 1–3, respectively. NUS-WIDE-OBJECT is a multi-view image dataset consisting of 31 categories and 30,000 images in total (17,927 for training and 12,703 for test). Six different kinds of features, namely 500-D bag
of visual words, 64-D color histogram, 144-D color auto-correlogram, 73-D edge direction histogram, 128-D wavelet texture, and 225-D block-wise color moments are provided in [4] to represent each image. The Flickr dataset contains 25,000 images which are collected from the social photography site Flickr.com and equally split for training and test. There are 38 labels in total and each image in this dataset is associated with 8.94 labels on average. Three kinds of features, namely 1000-D SIFT [14] features computed on a dense multiscale grid, 512-D Gist features [19], as well as 457-D tag features in a bag-of-words vector provided in [6], are used to represent each image. The Animal dataset contains more than 30,000 images of 50 animals classes. We randomly select 14,112 images of 20 classes in our experiments. 60% of these images are used as training set, while the other 40% images are used as test set. Three kinds of preextracted features provided in [9] are used to represent each image, i.e., 500-D color histogram features, 500-D SIFT, and 252-D pyramid HOG. In our experiments on all these datasets, we randomly select s ¼ f5; 10; 20g labeled samples for each category from the training set to construct the labeled set; 5000 images are randomly selected as the unlabeled set, and 2000 images are used for validation. Experiments on each dataset are conducted 5 times, and the average performance, i.e., the annotation precision, are reported. In particular, the following methods are compared:
AllFea: the baseline in which all features are normalized and concatenated.
CSFS [3]: a convex semi-supervised multi-label feature selection algorithm. The trade-off parameter μ is set in the range f2 5 ; 2 3 ; 2 2 ; …; 22 ; 103 ; 25 g.
LSDF [28]: a semi-supervised feature selection algorithm. The
nearest neighbor parameter k is chosen from the set f3; 5; 8; 10; 12; 15; 20g. MTFS [18]: a popular multi-task feature selection algorithm. The trade-off parameter γ is set in the range f10 3 ; 10 2 ; …; 103 g. L21FS [17]: an efficient and robust multi-task feature selection algorithm that utilizes the l2;1 -norm for both the least squares loss and the regularization term. The trade-off parameter γ is chosen from the set f10 3 ; 10 2 ; …; 103 g. MRMVFS: the proposed feature selection method. The parameters α, β and γ are tuned on the set f10 3 ; 10 2 ; …; 103 g. The nearest neighbor parameter k is set as 5 empirically.
All the features of different views ( 1134-D for NUS-WIDEOBJECT, 1969-D for Flickr or 1252-D for Animal) are sorted in descending order according to the value J W i: J 2 ; i ¼ 1; …; d and then the r top-ranked features are selected as the input of an SVM classifier. We apply both linear SVM (linSVM) and nonlinear SVM (kerSVM) to the selected features. For SVM training, we use the libSVM [2] toolbox1 and tune the penalty factor C on the set f10 3 ; 10 3 ; …; 103 g in linear SVM. We use the RBF kernels with the bandwidth parameter σ optimized over the set f2 8 ; 2 7 ; …; 23 ; 24 g for nonlinear SVM. 3.2. Experimental result and analysis The experimental results using linear and nonlinear SVM on NUS-WIDE-OBJECT dataset are shown in Figs. 4 and 5, the results on Flickr dataset are shown in Figs. 6 and 7, while the results on Animal dataset are shown in Figs. 8 and 9. Annotation accuracies 1
http://www.csie.ntu.edu.tw/ cjlin/libsvm/
Please cite this article as: Y. Li, et al., Manifold regularized multi-view feature selection for social image annotation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.07.151i
Y. Li et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Fig. 1. NUS-WIDE dataset sample images.
Fig. 2. Flickr dataset sample images.
Fig. 3. Animal dataset sample images.
Fig. 4. Annotation performance using linear SVM vs. the number of selected features on NUS-WIDE dataset.
Fig. 5. Annotation performance using nonlinear SVM vs. the number of selected features on NUS-WIDE dataset.
Please cite this article as: Y. Li, et al., Manifold regularized multi-view feature selection for social image annotation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.07.151i
5
6
Y. Li et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
Fig. 6. Annotation performance using linear SVM vs. the number of selected features on Flickr dataset.
Fig. 7. Annotation performance using nonlinear SVM vs. the number of selected features on Flickr dataset.
Fig. 8. Annotation performance using linear SVM vs. the number of selected features on Animal dataset.
Fig. 9. Annotation performance using nonlinear SVM vs. the number of selected features on Animal dataset.
Please cite this article as: Y. Li, et al., Manifold regularized multi-view feature selection for social image annotation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.07.151i
Y. Li et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎
at different selected feature dimensionality r ¼ f20; 50; 100; 200; 500; 800; 1000g are reported. It can be seen from the results that 1. the accuracies of all the compared methods improve with an increasing of labeled samples; 2. annotation using the feature subset selected by the feature selection algorithms can outperform the baseline that using all the features at some r. The reason is that feature selection methods discard redundant and noisy feature components; 3. the proposed MRMVFS is superior to the other four feature selection algorithms significantly in most cases, especially when the number of labeled samples is small. Besides, in most cases, the semi-supervised methods, CSFS, LSDF, as well as the proposed MRMVFS, outperform the supervised methods MTFS and L21FS in most cases. This indicates the data distribution information of large amount of unlabeled data is useful for the feature selection task; 4. When the number of labeled samples increases, the improvements of MRMVFS compared to the other methods (including the baseline) becomes small. This is because that when more labeled samples are used, the underlying data distribution is better explored, and thus the significance of the using the unlabeled data decreases. On NUS-WIDE-OBJECT dataset, the improvement becomes small with an increasing r. The reason is that the overlap of features selected by different algorithms becomes large when r is large. In particular, only 100 features selected by MRMVFS is needed to achieve the performance of using all the features when the number of labeled samples is 5 for each category. This demonstrates the advantage of the proposed method in handling the small-labeled sample size problem. The performance of the compared methods on the Flickr dataset has much oscillation. This is because the used tag features is very sparse, and the tag features of many samples are all zeros when limited dimensions are selected. The performance the proposed MRMVFS peaks around the dimension r ¼ 200 and is superior to other approaches on most dimensions, especially when r is small (e.g., less than 200). The results of Animal dataset is similar to the other datasets, the proposed MRMVFS outperforms all the compared methods under most experimental settings. All these results demonstrate the advantage of the proposed method for feature selection.
4. Conclusion In this paper, we propose a manifold regularized multi-view feature selection (MRMVFS) for semi-supervised problem. In our method, four kinds of vital information from data, i.e., label information contained in labeled samples, label relationship, data distribution, and data correlation among different views of both labeled and unlabeled samples, are integrated into an unified learning framework. Experiment on the social image annotation task demonstrates the superiority of the proposed method.
7
[2] C.C. Chang, C.J. Lin, Libsvm: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (3) (2011) 1–27. [3] X. Chang, F. Nie, Y. Yang, H. Huang, A convex formulation for semi-supervised multi-label feature selection, in: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014. [4] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, Y. Zheng, NUS-WIDE: a real-world web image database from National University of Singapore, in: Proceedings of the ACM international Conference on Image and Video Retrieval, 2009. [5] Y. Feng, J. Xiao, Y. Zhuang, X. Liu, Adaptive unsupervised multi-view feature selection for visual concept recognition, in: Proceedings of the Asian Conference of Computer Vision, 2012, pp. 343–357. [6] M. Guillaumin, J. Verbeek, C. Schmid, Multimodal semi-supervised learning for image classification, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010, pp. 902–909. [7] X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, in: Proceedings of the Advances in Neural Information Processing Systems, 2005, pp. 507–514. [8] M.J. Huiskes, M.S. Lew, The mir flickr retrieval evaluation, in: Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, 2008, pp. 39–43. [9] C.H. Lampert, H. Nickisch, S. Harmeling, Learning to detect unseen object classes by between-class attribute transfer, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 951–958. [10] J. Li, J.Z. Wang, Real-time computerized annotation of pictures, in: Proceedings of the ACM Multimedia, 2006, pp. 911–920. [11] Z. Li, Y. Yang, J. Liu, X. Zhou, H. Lu, Unsupervised feature selection using nonnegative spectral analysis, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2012. [12] W. Liu, D. Tao, Multiview hessian regularization for image annotation, IEEE Trans. Image Process 22 (7) (2013) 2676–2687. [13] W. Liu, D. Tao, J. Cheng, Y. Tang, Multiview hessian discriminative sparse coding for image annotation, Comput. Vis. Image Underst. 118 (2014) 50–60. [14] D.G. Lowe, Distinctive image features from scale-invariant keypoints, Int. J. Comput. Vis. 60 (2) (2004) 91–110. [15] Y. Luo, D. Tao, C. Xu, C. Xu, H. Liu, Y. Wen, Multiview vector-valued manifold regularization for multilabel image classification, IEEE Trans. Neural Netw. Learn. Syst. 24 (5) (2013) 709–722. [16] L.C. Molina, L. Belanche, À. Nebot, Feature selection algorithms: a survey and experimental evaluation, in: Proceedings of the International Conference on Data Mining, 2002, pp. 306–313. [17] F. Nie, H. Huang, X. Cai, C. Ding, Efficient and robust feature selection via joint l2,1-norms minimization, Adv. Neural Inf. Process. Syst. 23 (2010) 1813–1821. [18] G. Obozinski, B. Taskar, M. Jordan, Multi-task feature selection, in: Proceedings of the ICML workshop on Structural Knowledge Transfer for Machine Learning, 2006. [19] A. Oliva, A. Torralba, Modeling the shape of the scene a holistic representation of the spatial envelope, Int. J. Comput. Vis. 42 (3) (2001) 145–175. [20] J. Tang, X. Hu, H. Gao, H. Liu, Unsupervised feature selection for multi-view data in social media, in: Proceedings of the SIAM International Conference on Data Mining, 2013, pp. 270–278. [21] Z. Xu, I. King, M.T. Lyu, R. Jin, Discriminative semi-supervised feature selection via manifold regularization, IEEE Trans. Neural Netw. 21 (7) (2010) 1033–1047. [22] Y. Yang, H.T. Shen, Z. Ma, Z. Huang, X. Zhou, l2,1-norm regularized discriminative feature selection for unsupervised learning, in: Proceedings of the International Joint Conference on Artificial Intelligence, 2011, pp. 1589–1594. [23] J. Yu, Y. Rui, Y.Y. Tang, D. Tao, High-order distance-based multiview stochastic learning in image classification, 2014. [24] J. Yu, Y. Rui, D. Tao, Click prediction for web image reranking using multimodal sparse coding, 2014. [25] J. Yu, D. Tao, Modern Machine Learning Techniques and their Applications in Cartoon Animation Research, vol. 4, John Wiley & Sons, 2013. [26] J. Yu, D. Tao, M. Wang, Adaptive hypergraph learning and its application in image classification, IEEE Trans. Image Process. 21 (7) (2012) 3262–3272. [27] J. Yu, D. Tao, M. Wang, Y. Rui, Learning to rank using user clicks and visual features for image retrieval, 2014. [28] J. Zhao, K. Lu, X. He, Locality sensitive semi-supervised feature selection, Neurocomputing 71 (10) (2008) 1842–1849.
Yangxi Li is a senior engineer of National Computer network Emergency Response technical Team/Coordination Center of China (CNCERT/CC). He received the Ph.D. degree from Peking University. His research interests lie primarily in multimedia search, information retrieval and computer vision.
References [1] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: a geometric framework for learning from labeled and unlabeled examples, J. Mach. Learn. Res. 7 (2006) 2399–2434.
Please cite this article as: Y. Li, et al., Manifold regularized multi-view feature selection for social image annotation, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.07.151i