J. Vis. Commun. Image R. 25 (2014) 1894–1904
Contents lists available at ScienceDirect
J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci
A novel semi-supervised canonical correlation analysis and extensions for multi-view dimensionality reduction XiaoBo Shen, QuanSen Sun ⇑ School of Computer Science and Engineering, Nanjing University of Science & Technology, Nanjing 210094, China
a r t i c l e
i n f o
Article history: Received 18 January 2014 Accepted 8 September 2014 Available online 16 September 2014 Keywords: Canonical correlation analysis Semi-supervised learning Label propagation Sparse representation Multi-view learning Feature extraction Dimensionality reduction Image recognition
a b s t r a c t Canonical correlation analysis (CCA) is an efficient method for dimensionality reduction on two-view data. However, as an unsupervised learning method, CCA cannot utilize partly given label information in multi-view semi-supervised scenarios. In this paper, we propose a novel two-view semi-supervised learning method, called semi-supervised canonical correlation analysis based on label propagation (LPbSCCA). LPbSCCA incorporates a new sparse representation based label propagation algorithm to infer label information for unlabeled data. Specifically, it firstly constructs dictionaries consisting of all labeled samples; and then obtains reconstruction coefficients of unlabeled samples using sparse representation technique; at last, by combining given labels of labeled samples, estimates label information for unlabeled ones. After that, it constructs soft label matrices of all samples and probabilistic within-class scatter matrices in each view. Finally, in order to enhance discriminative power of features, it is formulated to maximize the correlations between samples of the same class from cross views, while minimizing within-class variations in the low-dimensional feature space of each view simultaneously. Furthermore, we also extend a general model called LPbSMCCA to handle data from multiple (more than two) views. Extensive experimental results from several well-known datasets demonstrate that the proposed methods can achieve better recognition performances and robustness than existing related methods. Crown Copyright Ó 2014 Published by Elsevier Inc. All rights reserved.
1. Introduction In many applications of computer vision, and pattern recognition, the same objects can be observed at different viewpoints or by different sensors, thus generating multiple distinct and heterogeneous features. For example, a video can be represented by sound feature and image feature; a webpage can be represented by text feature and image feature. Such different representations are referred as multi-view data [1–4]. Some recent researches [1,5–12] show that learning from such multi-view data often leads to better performance than learning from single-view data. However, multi-view data are usually represented as high-dimensional vectors, which cannot be analyzed directly. Thus, dimensionality reduction (DR) for multi-view data is a necessary and important preprocessing step for subsequent tasks. Until now, many multiview DR methods (MvDR) have been proposed, including unsupervised ones [3,13,14], supervised ones [15,16], and semi-supervised ones [2,8].
⇑ Corresponding author. Fax: +86 25 84318156. E-mail addresses:
[email protected] (X. Shen), sunquansen@njust. edu.cn (Q. Sun). http://dx.doi.org/10.1016/j.jvcir.2014.09.004 1047-3203/Crown Copyright Ó 2014 Published by Elsevier Inc. All rights reserved.
The most typical approach for multi-view dimension reduction is canonical correlation analysis (CCA) [17,18], which was proposed by Hotelling in 1936. CCA investigates the linear correlations between two sets of features from the same patterns. It linearly projects two sets of features into a low-dimensional feature space, where they are maximally correlated. Until now, CCA and its variants have received more and more attentions [5,6,19–26]. CCA is essentially a linear subspace method, which fails to discover the nonlinear correlations between two datasets. To remedy this problem, some related nonlinear extensions, e.g., kernel CCA (KCCA) [17], locality preserving CCA (LPCCA) [21], were proposed and successfully applied into data visualization and pose estimation. Moreover, some sparse extensions of CCA [25,26] were also presented for gene classification and cross-language document retrieval. The aforementioned extensions are unsupervised methods, which may not be suitable for classification tasks. To enhance the discriminant power of CCA features, several supervised variants, e.g., discriminative CCA (DCCA) [20], generalized CCA (GCCA) [19], local discrimination CCA (LDCCA) [22], sparse representation based discriminative CCA (SPDCCA) [27] were proposed. They incorporate class label information into CCA from different angles, and all can improve recognition rates to some extent. Moreover, there are also some semi-supervised extensions [4,28] of CCA.
1895
X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904
For example, Peng and Zhang [28] proposed a semi-supervised CCA method (SemiCCA). SemiCCA incorporates must-link constraints and cannot-link constraints of labeled samples into objective of CCA, to improve the classification performance under semi-supervised scenarios. Semi-supervised learning (SSL) [29] is a hot research field in pattern recognition, and machine learning. The main task of SSL algorithms is to utilize a small amount of labeled data as well as a large amount of unlabeled data for training. Among them, graph-based approach [2,8,30–34] is one of the most active fields. Meanwhile, SSL techniques have been incorporated into DR, and many SSL-DR algorithms including single-view ones [35–38], multi-view ones [2,8], and so on have been proposed. For examples, in multi-view scenarios, Hou et al. [2] developed a multiple view semi-supervised dimension reduction (MVSSDR). MVSSDR learns a consensus pattern with domain knowledge in form of pairwise constraints. Pairwise constraints based multiview subspace learning (PC-MSL) [8] aims to learn a unified low-dimensional subspace to effectively fuse multiple features. It takes both data distribution of both labeled and unlabeled samples, and user labeling information into consideration, and thus achieves good performances in scene classification. On the other hand, label propagation [39] is a popular kind of graph-based SSL methods. It aims to model the whole data (both labeled and unlabeled) as a graph, and then propagate label information from the labeled samples to unlabeled ones through the constructed graph. Recently some label propagation algorithms have been proposed [40–42]. Many graph-based methods construct k-nearest-neighborhood (KNN) graph, and employ Gaussian function to compute the graph edge weights. Thus some parameters (i.e., nearest neighborhood k, bandwidth r) have to be manually set beforehand. However, some recent researches [41,43,44] show that the parameter setting and distance measurement (Euclidean distance) in this approach are very sensitive to noisy data. Recently inspired with the success of sparse representation [43,45,46], a novel graph, i.e., l1-graph [43,44], has been proposed, which has more robustness to noises and adaptive neighborhoods for each sample. By introducing this new graph [43,44], we, in this paper, propose a novel semi-supervised canonical correlation analysis based on label propagation (LPbSCCA). Several advantages of LPbSCCA can be highlighted. (1) LPbSCCA does not need one to one correspondence between samples in different views, and thus can be applied into more general scenarios than SemiCCA [28]. (2) LPbSCCA is free of parameters due to superiority of l1-graph, while previous related methods (i.e., SemiCCA [28], S2GCA [4]) have to tune regularization parameters. Thus, LPbSCCA are more applicable. (3) LPbSCCA can be viewed as a general extension of DCCA
Table 1 List of notations. Notations
Descriptions
X, Y
Two group of feature representations in two-view scenarios Sparse reconstructive coefficient vector of xi in X view
ðXÞ
wi ~f ðXÞ i e F ðXÞ ; e F ðYÞ n om X ðiÞ 2 Rpi N
Label vector of xi in X view Soft label matrix in X(Y) view m group of feature representations from N patterns
under view of probability. LPbSCCA computes correlations by soft labels, while correlations in DCCA are revealed by hard labels (i.e., 0, 1). (4) Based on LPbSCCA, we extend a general model, namely LPbSMCCA. LPbSMCCA can simultaneously deal with the data from multiple (more than two) views. The remainder of the paper is organized as follows. In Section 2, we briefly review the basic theory of CCA, and DCCA. In Section 3, LPbSCCA and an extended model, namely LPbSMCCA are developed. In Section 4, we experimentally evaluate the proposed methods on several well-known datasets to demonstrate effectiveness. Finally, concluding remarks and some discussions of future work will be given in Section 5. For convenience, Table 1 lists some important notations used in the remainder of this paper. 2. Related work 2.1. Canonical correlation analysis Given a set of N pair-wise samples fxi ; yi g 2 Rp Rq (i = 1, . . ., N), we define two feature representation matrices X = [x1, x2, . . ., xN] 2 RpN , Y = [y1, y2, . . ., yN] 2 RqN . Assume that all P P ¼ Ni¼1 yi =N ¼ 0. samples are centered, i.e., x ¼ Ni¼1 xi =N ¼ 0; y The aim of canonical correlation analysis (CCA) [17,18] is to find pairs of projection directions, a 2 Rp and b 2 Rq , such that the correlations between canonical variables z1 = aTX and z2 = bTY are maximized. The projection pair, a 2 Rp and b 2 Rq , is obtained by maximizing the following correlation function q
E aT xyT b aT Sxy b ffi max qða; bÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi T ffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffi E½aT xxT a E b yyT b aT Sxx a bT Syy b s:t: aT Sxx a ¼ 1;
bT Syy b ¼ 1 ð1Þ
T
T
pp
T
T
qq
where Sxx = E[xx ] = XX /N 2 R , Syy = E[yy ] = YY /N 2 R are within-set covariance matrices in view X and Y, respectively. Sxy = E[xyT] = XYT/N 2 Rpq represents between-set covariance matrix between view X and Y. With Lagrange multiplier method, this problem can be equally reformulated as two generalized eigenvalue problems [17]. Besides, Melzer et al. [47] also presented an equivalent SVD-based approach to solve CCA model. 2.2. Discriminative CCA As an unsupervised multi-view learning method, CCA focuses on correlations between the pair-wise samples from different views, but ignores class information of the samples, resulting in the limitation of recognition performance. To remedy this shortcoming, Sun et al. [20] proposed discriminative CCA (DCCA) by incorporating class information into CCA model. DCCA seeks pairs of projection directions, a 2 Rp and b 2 Rq , to maximize withinclass correlations (between-class correlations are automatically minimized, and details can be referred to [20]) between different views. The formulation of DCCA is as follows
aT XAY T b argmax pffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffi a;b aT Sxx a bT Syy b
ð2Þ
i¼1
XL e ðiÞ F
l labeled samples in ith view
e ðijÞ A
Probabilistic correlation matrix between ith view and jth view Probabilistic within-class scatter matrices in ith view
ðiÞ
e Sw W(i) ðiÞ
a(i)
soft label matrix in ith view
Projection matrix in ith view Projection direction in ith view
where Sxx, Syy are within-set covariance matrices in view X and Y, A ¼ diagð1N1 N1 ; 1N2 N2 ; . . . ; 1Nc Nc Þ, which reflects within-class correlations between samples from view X and Y, c is the number of classes, Ni represents the number of samples in ith class, and Pc i¼1 N i ¼ N; 1Ni Ni is a Ni Ni square matrix where all entries equal to 1. Similar to CCA, DCCA can also be transformed into a generalized eigenvalue problem [20]. From (2), we can see that DCCA
1896
X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904
breaks the one to one correspondence restriction of samples in different views. Thus DCCA can be applied into more general scenarios than CCA.
h i ðXÞ ðXÞ ðXÞ ðXÞ sented as ~f i ¼ ~f i1 ; ~f i2 ; . . . ; ~f ic 2 Rc (i = 1, . . ., N). For unlabeled ðXÞ sample xi (i = l + 1, . . ., N), ~f ik is defined as
X
~f ðXÞ ¼ ik 3. Semi-supervised canonical correlation analysis based on label propagation (LPbSCCA) and its extensions
~f ðXÞ ¼ ik
The main goal of label propagation is to propagate label information of labeled samples to unlabeled ones by constructing proper graphs. Currently k-nearest-neighborhood graph is a widely used approach based on pair-wise Euclidean distance, which is very sensitive to data noise. Another drawback of this approach is to manually tune the neighborhood size k, where it is a challenge to select the optimal value. Recently, inspired with the success of l1-graph [43], Lu et al. [44] proposed a sparse representation based label propagation algorithm, which has more robustness to noises and adaptive neighborhoods for each sample. Thus we incorporate this approach to infer label information of the unlabeled samples. Assume that two groups of features from N samples are given as X = [x1, x2, . . ., xl, xl+1, . . ., xN] 2 RpN and Y = [y1, y2, . . ., yl, yl+1, . . ., yN] 2 RqN , and samples from view X and Y have the one to one correspondence, fxi ; yi gli¼1 ; fxi ; yi gNi¼lþ1 represent labeled and unlabeled sample pairs, respectively. The first l pairs can be sepah i rately represented as X L ¼ x11 ; . . . ; x1l1 ; . . . ; xc1 ; . . . ; xclc and Y L ¼ h i y11 ; . . . ; y1l1 ; . . . ; yc1 ; . . . ; yclc , where xki ; yki are the ith sample of the kth class in view X and Y, c is the number of classes, lk denotes Pc the number of labeled samples in the kth class, and k¼1 lk ¼ l. Here we firstly present construction process of soft label matrix in view X. Inspired with sparse representation methods [43,45], we apply all labeled samples XL in view X to construct a dictionary and each unlabeled sample can be linearly reconstructed by XL. Specifically, given an unlabeled sample xi (i = l + 1, . . ., N), we solve the following optimization problem ðXÞ
ðXÞ ¼ arg min wi 1 ðXÞ s:t: X L wi xi 6 e
ð3Þ ðXÞ
structive coefficient of xi on xj. The normalized weights are defined as follows
, l ðXÞ X ðXÞ ¼ wij wij
ð4Þ
j¼1
Obviously, 0 6
ðXÞ uij
6 1;
uXij
represents constructive contribution of
ðXÞ
xj on xi; especially if uij ¼ 1, we infer that xi is totally constructed ðXÞ
only by xj. Thus, we can similarly use uij to reconstruct label information of xi. F ðXÞ ¼ We define soft label matrix in view X as e h iT ~f ðXÞT ; ~f ðXÞT ; . . . ; ~f ðXÞT 2 RNc , where label vector of x is reprei N 1 2 1 2 3
www.acm.caltech.ed/l1magic. www.cs.ubc.ca/labs/scl/spgl1/index.html. http://www.public.asu.edu/~jye02/Software/SLEP/index.htm.
1 xi 2 c k 0
ð6Þ
otherwise
where k = 1, . . ., c. Similarly, we can construct the soft label matrix e F ðYÞ for view Y using the above approach. Here we ignore it for the limit of this paper. 3.2. Probabilistic within-class scatter matrix construction Based on two soft label matrices, e F ðXÞ and e F ðYÞ , now we construct ðXÞ ðYÞ the probabilistic within-class scatter matrices e Sw ; e S w in view X, Y, respectively. Their definitions are as follows ðXÞ e ¼ Sw
c X N T X ~f ðXÞ x m ~ kðXÞ xj m ~ kðXÞ j jk
ð7Þ
k¼1 j¼1
ðYÞ e ¼ Sw
c X N T X ~f ðYÞ y m ~ ðYÞ ~ kðYÞ yj m j jk k
ð8Þ
k¼1 j¼1
P PN ~ðXÞ PN ~ðYÞ ðXÞ 1 ~ kðXÞ ¼ 1ðXÞ Ni¼1 ~f ðXÞ ~ ðYÞ m i¼1 f ik ; mk ¼ NðYÞ i¼1 f ik yi ; ik xi ; N k ¼ Nk P k ðYÞ ðYÞ N Nk ¼ i¼1 f ik ðk ¼ 1; . . . ; cÞ. In the above formulations, it is clear that probabilistic withinclass scatter matrix is a general extension of classical within-class scatter matrix under the view of probability. Especially if each sample fully belongs to one unique class, i.e., label vector of each sample is encoded only with (0, 1), probabilistic within-class scatter matrix naturally reduces to the classical one. where
3.3. Formulation and solution of LPbSCCA In this section, we formulate the optimization problem of LPbSCCA as follows
max
where e is the reconstruction error bound, wi 2 Rl is the sparse reconstructive coefficient vector of sample xi. This problem can be solved efficiently by standard linear programming using many available toolboxes, such as l1-magic, 1SPGL1, 2and SLEP3. 3 h i ðXÞ ðXÞ ðXÞ ðXÞ ðXÞ We define wi ¼ wi1 ; wi2 ; . . . ; wil , where wij is the recon-
ðXÞ uij
ð5Þ
ðXÞ For labeled sample xi (i = 1, . . ., l), ~f ik is defined as
3.1. Soft label matrix estimation
wi
ðXÞ
uij
xj 2ck
e Tb aT X AY J LPbSCCA ða; bÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðYÞ aT eS ðXÞ bT e Sw b w a
ð9Þ
e¼e F ðXÞ e F ðYÞT is a probabilistic correlation matrix, where A ðXÞ ðYÞT ðXÞ ðYÞ e ij ¼ ~f ~f 06 A 6 1; e S w and e S w are probabilistic within-class i j scatter matrices defined in Section 3.2. It is worth to highlight discriminative information into the following two aspects. (1) Cross-view discrimination. We maximize correlations of samples in different classes from cross views for discrimination, and this kind of discriminative information is encoded e A e ij can be viewed as the probability of xi and yj into A. belonging to the same class, and thus certainly regarded as the correlation measurement of these two samples. If e ij ¼ 1, which means ~f ðXÞ ¼ ~f ðYÞ , then we regard xi and yj A i
j
e ij ¼ 0, which means that xi and yj totally correlated; if A belong to two different classes, we consider these two same in ples are totally uncorrelated. Especially, it is clear that A, the view of probability, can be regarded as a general extension of A in DCCA. (2) Within-view discrimination. In each view, we, meanwhile, try to minimize within-class variations by introducing probabilistic within-class scatter matrices. In this case, samples
1897
X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904
from the same class will be kept as close as possible in the low-dimensional feature space. Thus it helps to further enhance discrimination of LPbSCCA. The objective of LPbSCCA, i.e., (9), can be equivalent to the following optimization problem with equality constraints
e Tb max J LPbSCCA ða; bÞ ¼ aT X AY S ðXÞ a ¼ 1; bT e S ðYÞ b ¼ 1 s:t: aT e w
ð10Þ
the steps in Sections 3.1 and 3.2, respectively. From the above formulation, we see that LPbSCCA can be regarded as a special case of LPbSMCCA. When only two views (m = 2) are involved, LPbSMCCA naturally reduces to LPbSCCA. By using Lagrange multiplier method, we transform (14) into the following optimization problem
Lðað1Þ ; að2Þ ; . . . ; aðmÞ ; kÞ ¼
w
ðXÞ e T b ¼ ke X AY Sw a ðYÞ T T e e Y A X a ¼ kSw b
ð11Þ
Alternatively, we can combine (11) into a single generalized eigenvalue problem below
e T X AY
!
a
e T XT YA
b
¼k
!
ðXÞ e Sw
ð12Þ
b
Theorem 1. Projection matrices Wx and Wy satisfy the following property
aTi eS wðXÞ aj ¼ bTi eS ðYÞ w bj ¼ dij T e ðXYÞ ai C bj ¼ dij
ð13Þ
1 i¼j ; i; j ¼ 1; . . . ; d. 0 i–j The proof of Theorem 1 is given in Appendix A. In Theorem 1, it is clear that the projection directions a(b) are conjugately orthogðXÞ ðYÞ onal with each other about the within-class scatter matrix e S w ðe Sw Þ e T ; dij ¼ e ðXYÞ ¼ X AY where C
! m k X aðiÞT eS wðiÞ aðiÞ 1 2 i¼1
ð15Þ
Setting oL/oa(i) = 0, we obtain
X ðiÞ ðiÞ ðiÞ e ðijÞ X ðjÞT aðjÞ ke X A a ¼0 Sw
ð16Þ
j–i
or
a
ðYÞ e Sw
Furthermore, we give two theorems about some properties of the solutions in LPbSCCA:
(
aðiÞT X ðiÞ Ae ðijÞ X ðjÞT aðjÞ
i¼1 j–i
With Lagrange multiplier method, (10) can be solved as follows
(
m X X
X ðiÞ ðiÞ ðiÞ e ðijÞ X ðjÞT aðjÞ ¼ ke X A a Sw
ð17Þ
j–i
Do some matrix transformations to (17), and it can be written as
0
0 B ð2Þ e ð21Þ ð1ÞT BX A X B B .. @. e ðm1Þ X ð1ÞT X ðmÞ A 0 ð1Þ e Sw 0 B ð2Þ e B0 Sw B ¼ kB . .. B. . @. 0
0
e ð12Þ X ð2ÞT X ð1Þ A 0 .. ... . e ðm2Þ X ð2ÞT X ðmÞ A 10 1 að1Þ 0 CB ð2Þ C a C 0 C CB C CB C . . .. .. CB A . . A@ . ðmÞ ðmÞ e a Sw
10 ð1Þ e ð1mÞ X ðmÞT a X ð1Þ A B e ð2mÞ X ðmÞT C CB að2Þ X ð2Þ A CB . CB . ... A@ .
1 C C C C A
aðmÞ
0
ð18Þ
in each view. Theorem 2. In LPbSCCA, we can obtain at most c (the number of total classes) pairs of projection directions
k¼1
The proof of Theorem 2 is given in Appendix B.
sponding to the first largest d eigenvalues in (18), and form the proh i ðiÞ ðiÞ ðiÞ jection matrix in the ith view as W ðiÞ ¼ a1 ; a2 ; . . . ; ad 2 Rpi d . The
3.4. Multi-view extension LPbSCCA only concerns two views, and thus cannot handle data from multiple (more than two) views. However, several feature representations revealing different characteristics of the same objects are prevalent in reality. How to simultaneously deal with any number of groups of features is a practical and fundamental problem [3,7]. To address this problem, we extend LPbSCCA to present a general model called LPbSMCCA, which involves multiple (more than two) views simultaneously. Assume that m (m P 2) groups of high-dimensional features m
from the same N patterns are given as fX ðiÞ 2 Rpi N gi¼1 , where h i ðiÞ ðiÞ ðiÞ ðiÞ X ðiÞ ¼ x1 ; x2 ; . . . ; xN ; xj 2 Rpi (j = 1, 2, . . ., N), pi denotes the n ol n oN ðiÞ ðiÞ dimensionality of features in the ith view. xj ; xj are j¼1
j¼lþ1
labeled and unlabeled samples in the ith view, respectively. Then LPbSMCCA can be formulated as below
max J LPbSMCCA að1Þ ; að2Þ ; . . . ; aðmÞ ¼
m X X
aðiÞT X ðiÞ Ae ðijÞ X ðjÞT aðjÞ
i¼1 j–i
s:t:
m X
ðiÞT eðiÞ Sw
a
ðiÞ
a ¼1
Finally, the solution of LPbSMCCA model (i.e., (14)) is the eigenvector of the above generalized eigenvalue problem (i.e., (18)). We n od ð1ÞT ð2ÞT ðmÞT select the eigenvectors aTk ¼ ak ; ak ; . . . ; ak corre-
ð14Þ
i¼1 ðiÞ e ðijÞ ¼ e F ðiÞ e F ðjÞT ; e where A S w denotes the probabilistic within-class scatter matrix in the ith view, and both can be computed using
detail flow chart of LPbSMCCA is presented in Table 2. Now we give a brief complexity analysis of LPbSMCCA. The computational complexity mainly comes from two parts: soft label estimation and solving a generalized eigenvalue problem as (18). In the first part, solving (3) using sparse representation technique for each unlabeled sample requires time of O(t3 + l) (t is the
Table 2 The flow chart of LPbSMCCA. ðiÞ
ðiÞ
ðiÞ
ðiÞ
ðiÞ
Input: m sets of variables X ðiÞ ¼ ½x1 ; x2 ; . . . ; xl ; xlþ1 ; . . . ; xN 2 Rpi N , i = 1, 2, . . ., m. For i = 1, 2, . . ., m (1) Compute the normalized weight, according to (3) and (4); e ðiÞ , according to (5) and (6); (2) Construct the soft label matrix F ðiÞ (3) According to (7), compute the fuzzy within-class matrix e Sw ; End e ðijÞ ¼ e F ðiÞ e F ðjÞT , i, j = 1, 2, . . ., m and i – j; (4) Compute matrix A h iT d (5) Obtain the d projection vector sets ajð1ÞT ; ajð2ÞT ; . . . ; ajðmÞT , by j¼1
computing the eigenvectors with respect to the largest d eigenvalues for the generalized eigenvalue problem, according to (18); n h iom ðiÞ ðiÞ ðiÞ Output: m projection matrices W ðiÞ ¼ a1 ; a2 ; . . . ; ad . i¼1
1898
X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904
number of nonzero entries in reconstruction coefficients) [45], thus complexity of this part is O(t3 + l) (N l) m; in the second part, solving a generalized eigenvalue problem requires time of O(p3), P where p ¼ m i¼1 pi . Thus, the total complexity time of LPbSMCCA is about O((t3 + l) (N l) m + p3). After the training process of LPbSMCCA, given a new test sample xT = (x(1)T, x(2)T, . . ., x(m)T), xðiÞ 2 Rpi , we obtain the final lowP ðiÞT ðiÞ dimensional feature z ¼ m x using an effective strategy i¼1 W [6,20,22]. Finally, z can be combined with any classifier (e.g., 1NN classifier) for the following classification tasks. 4. Experimental results and analysis In order to validate the effectiveness of two proposed methods, a series of experiments have been performed on the popular handwritten numeral and face datasets. Specifically, we compare LPbSCCA with some CCA-related methods (i.e., CCA [6], DCCA [20], SemiCCA [28], SPDCCA [27]) on MFD dataset [48], CENPARMI dataset [49], and Yale face dataset [50]. Then we further apply LPbMSCCA on AR face dataset, and select several related multiview DR methods (i.e., LDA [50], multiset CCA (MCCA) [51], GMLDA [16], MVSSDR [2], MVFDA [12], LPbSMCCA_KNN) as a comparison. The details of some compared methods in our experiment are described as follows. (1) SemiCCA [28]: Two kinds of supervised information (i.e., must-link and cannot-link) in SemiCCA are provided from the given labels of labeled samples in the training set. (2) SPDCCA [27]: SPDCCA incorporates both sparse representation and discriminative information into CCA simultaneously. Here we obtain the optimal parameter a by searching from the range of [0, 1] with step 0.1. (3) LDA [50]: Here LDA projection directions are separately learned in each view, and then we obtain the final lowdimensional features using the same fusion strategy with LPbSMCCA.
Table 3 Six feature sets of multiple feature dataset in UCI. Fac: 216-dimension profile correlations feature; Fou: 76-dimension Fourier coefficients feature; Kar: 64-dimension Karhunen–Loève coefficients feature; Mor: 6-dimension morphological feature; Pix: 240-dimension pixel averages feature; Zer: 47-dimension Zernike moments feature.
(4) MCCA [51]: MCCA is a generalized extension of CCA, and it can simultaneously handle several (more than two) views. Here we first obtain projection directions by MCCA, and then fuse these views by the same way with LPbSMCCA. (5) GMLDA [16]: GMLDA is a multi-view extension under Generalized Multiview Analysis (GMA) framework. According to the parameter settings in [16], here we fix a = 10, l = 1, c = tr(B1)/tr(B2) in GMLDA. (6) MVSSDR [2]: In MVSSDR, we obtain two kinds of supervised information by the same way with SemiCCA. Besides, two parameters of a and b are empirically set as 1 and 20, according to the parameter settings in [37]. (7) LPbSMCCA_KNN: To compare with previous label propagation methods, we instead apply existing k-nearest-neighborhood graph to infer soft labels, and name this modification as LPbSMCCA_KNN. Here we empirically set k as 5. In all experiments, the nearest neighborhood classifier with euclidean metrics is adopted, random runs are repeated 10 times independently for final classification tasks.
4.1. Experiments on handwritten numerals dataset 4.1.1. Experiment on multiple feature dataset in UCI Multiple feature dataset (MFD) in UCI [48] about handwritten numerals is adopted for this experiment. As a part of UCI machine learning repository, MFD includes 10 classes handwritten numerals, i.e., 10 numbers from 0 to 9. Each class has 200 examples. Thus, the sample size is 2000 in total. These digit characters are represented in terms of the next six feature sets, as shown in Table 3. We randomly choose two sets of features as X set, and Y set, and there will be 15 (C 26 ¼ 15) data combination modes. For each combination mode, we choose 100 samples in each class for training randomly, the rest for testing, i.e., 1000 training samples, 1000 testing samples, respectively. In order to construct semi-supervised learning dataset, we randomly label 20% training samples for each class, i.e., 20 labeled and 80 unlabeled training samples, respectively. The average maximal recognition rates of different methods are listed in Table 4. From Table 4, we can see that LPbSCCA is distinctly superior to other existing methods. Specifically, in all 15 combinations, LPbSCCA achieves 14 times of best recognition rates, and CCA achieves 1 time. Especially in combination 1, 3, 8, 9, 13, 14, maximal recognition rates of LPbSCCA go beyond those of other methods about 10%. Moreover, recognition rates of CCA, in several
Table 4 The average maximal recognition rates (%) of different methods on MFD dataset. c
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Feature combination
LPbSCCA
X–Y
Accuracy
Dim.
CCA Accuracy
Dim.
DCCA Accuracy
Dim.
SemiCCA Accuracy
Dim.
SPDCCA Accuracy
Dim.
Fou–Fac Fou–Kar Fou–Pix Fou–Zer Fou–Mor Fac–Kar Fac–Pix Fac–Zer Fac–Mor Kar–Pix Kar–Zer Kar–Mor Pix–Zer Pix–Mor Zer–Mor
93.47 91.22 89.99 82.23 78.65 95.05 94.25 93.94 91.99 90.82 91.73 89.04 90.21 86.99 78.72
9 9 9 8 6 9 10 9 6 9 9 6 9 6 6
82.69 83.68 78.40 79.92 72.62 91.33 88.68 79.72 72.36 91.57 81.95 74.88 76.02 70.27 70.93
14 14 10 12 5 16 16 20 5 14 28 6 20 5 5
79.08 83.03 73.30 76.05 73.46 83.09 73.24 80.99 73.23 75.61 84.03 81.35 74.61 65.21 75.84
9 9 9 8 6 9 9 9 6 9 9 6 9 6 6
82.41 85.43 75.91 77.50 74.23 88.67 80.81 83.63 73.53 84.25 85.53 79.90 78.20 65.57 75.37
18 16 15 14 6 26 39 23 6 28 23 6 21 6 5
83.58 84.18 79.18 80.14 73.15 90.82 87.19 80.72 73.58 86.84 82.68 75.13 77.91 71.12 72.49
12 14 10 12 6 18 25 22 5 29 28 6 18 5 6
Note: Bold number denotes the best recognition rate in that case. The same cases also exist in the following tables, i.e., Table 5 and 7–10.
1899
X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904 0.84
1
(a)
0.95
0.82
0.8 0.75 0.7
CCA DCCA SemiCCA SPDCCA LPbSCCA
0.65 10
20
30
40
0.8
Recognition Rate
Recognition Rate
0.78 0.76 0.74 0.72 CCA DCCA SemiCCA SPDCCA LPbSCCA
0.7 0.68 0.66 10
50
20
30
40
(d)
0.95
Recognition Rate
0.95 0.9 0.85 0.8 0.75
10
20
30
40
0.9
0.8 0.75 CCA DCCA SemiCCA SPDCCA LPbSCCA
0.65
10
Percentage of Labeled Samples Per Class (%)
10
20
30
40
50
(f)
0.8
0.85
50
CCA DCCA SemiCCA SPDCCA LPbSCCA
0.82
(e)
0.7
CCA DCCA SemiCCA SPDCCA LPbSCCA
0.7
0.8
Percentage of Labeled Samples Per Class (%)
1
1
0.85
0.7
50
Percentage of Labeled Samples Per Class (%)
Percentage of Labeled Samples Per Class (%)
0.9
0.75
Recognition Rate
Recognition Rate
0.9
Recognition Rate
(c) 0.95
0.85
0.65
1
(b)
20
30
40
0.78 0.76 0.74 0.72 CCA DCCA SemiCCA SPDCCA LPbSCCA
0.7 0.68 50
10
Percentage of Labeled Samples Per Class (%)
20
30
40
50
Percentage of Labeled Samples Per Class (%)
Fig. 1. The average maximal recognition rates of different methods under different numbers of labeled samples on MFD dataset (a) Fou–Fac, (b) Fou–Mor, (c) Fac–Kar, (d) Fac– Zer, (e) Pix–Zer, (f) Zer–Mor.
Table 5 The average maximal recognition rates (%) and standard deviations (%) of LPbSMCCA under different numbers of views on MFD dataset. 2
3
4
5
Accuracy Std
90.39 8.93
93.66 1.93
95.33 1.05
95.82 0.44
feature combinations, are higher than those of SemiCCA and DCCA. The reason may be that since only a few samples are labeled, label information incorporated into SemiCCA, DCCA does not help to improve recognition performances a lot. Then, we continue to evaluate recognition performances of different methods with different numbers of labeled samples. We select 6 feature combinations, and randomly label 10%, 20%, 30%, 40%, and 50% of training samples in each class, respectively. The average maximal recognition rates of different methods with different labeled numbers are given in Fig. 1. From Fig. 1, we see that LPbSCCA continuously achieves better performances than other methods in all 6 combinations. With the increase of labeled number, SemiCCA and DCCA achieve increasing recognition performances, and consistently outperform CCA and SPDCCA after a point. CCA is insensitive to the change of labeled number due to its unsupervised essence. The results in this experiment indicate that LPbSCCA has better recognition rates and robustness than other methods. Furthermore, we investigate the recognition performances of LPbSMCCA under different numbers of views. We choose m (m = 2, 3, 4, 5) groups of features from 5 groups (except 6-dimensional Mor feature), and randomly label 20% training samples for each class. Table 5 lists the average maximal recognition rates and their corresponding standard deviations of LPbSMCCA under different numbers of views. The average recognition rates of LPbSMCCA under different dimensions are also given in Fig. 2. From Table 5 and Fig. 2, we see that, with the increase of view
0.9
0.8
Recognition Rate
m
1
0.7 m=2 m=3 m=4 m=5
0.6
0.5
0.4
1
2
3
4
5
6
7
8
9
10
Dimension Fig. 2. The average recognition rates of LPbSMCCA under different numbers of views on MFD dataset.
number, the average recognition rates of LPbSMCCA gradually increase, while the corresponding standard deviations, meanwhile, decrease. In a word, since different kinds of features usually reveal complementary characteristics of the same objects, LPSMCCA can obviously achieve better and more stable recognition rates by fusing more kinds of features. 4.1.2. Experiment on CENPARMI handwritten numerals dataset The CENPARMI handwritten Arabic numerals dataset [49], prevalent in the world, is further adopted. The CENPARMI dataset contains 10 categories, i.e., 10 numbers from 0 to 9; and each class has 600 samples. In [52], some preprocessing work has been done and extracted four groups of features, as shown in Table 6. We randomly choose two sets of features as X set, and Y set. There will be 6 (C 24 ¼ 6) data combination modes. For each
1900
X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904
Table 6 Four features on CENPARMI handwritten numerals dataset. X(G): 256-dimension Gabor transformation feature; X(L): 121-dimension Legendre moment feature; X(P): 36-dimension Seudo–Zernike moment feature; X(Z): 30-dimension Zernike moment feature.
combination mode, we choose 200 samples in each class for training randomly, the rest for testing, i.e., 2000 training samples, 4000 testing samples. The experimental setting is the same with that in Section 4.1.1. We report the average maximal recognition rates of different methods, as listed in Table 7. From Table 7, we see that, from all 6 combinations, LPbSCCA obviously outperform other methods, and its superiority is distinct in combination 2, 3. SemiCCA and DCCA nearly have the same performances, and both outperform CCA and SPDCCA in most combinations. Similarly to Section 4.1.1, we select all 6 feature combinations, and also randomly label 10%, 20%, 30%, 40%, and 50% training samples of each class, respectively. Fig. 3 shows the average maximal recognition rates of different methods with different labeled numbers. From Fig. 3, we see that, LPbSCCA continues to be superior to other methods under most cases (except X(P)–X(Z) feature combination). With the increase of labeled number, recognition rates of all
(a)
4.2.1. Experiment on Yale face dataset Yale face dataset [50] contains 165 face images of 15 individuals. There are 11 images per subject, and these 11 images are under the following different facial expression or configuration: centerlight, wearing glasses, happy, left-light, wearing no glasses, normal, right-light, sad, sleepy, surprised, and wink, respectively. The original resolution of these images is 100 80 pixels. 11 sample images of one individual are shown in Fig. 5. We use coiflets, daubechies wavelet transforms to extract two sets of low-frequency feature vectors from each image. An impor-
0.8 0.75 0.7 0.65
CCA DCCA SemiCCA SPDCCA LPbSCCA
0.6
20
30
40
(c)
0.85
0.85
0.8
0.8
0.75 0.7 0.65 CCA DCCA SemiCCA SPDCCA LPbSCCA
0.6 0.55 10
50
Recognition Rate
Recognition Rate
0.85
0.55 10
0.9
(b)
0.9
Percentage of Labeled Samples Per Class (%)
20
30
40
0.75 0.7 0.65
0.55 10
50
Percentage of Labeled Samples Per Class (%)
0.85
30
40
50
0.7
(e)
0.68
(f)
0.8
Recognition Rate
0.8
20
Percentage of Labeled Samples Per Class (%)
0.85
(d)
CCA DCCA SemiCCA SPDCCA LPbSCCA
0.6
0.75
0.7
CCA DCCA SemiCCA SPDCCA LPbSCCA
0.65
10
20
30
40
0.75
0.7
CCA DCCA SemiCCA SPDCCA LPbSCCA
0.65
50
Percentage of Labeled Samples Per Class (%)
Recognition Rate
Recognition Rate
4.2. Experiments on face recognition
0.9
0.95
Recognition Rate
methods improve, and LPbSCCA has better stable performances than other methods. Moreover, similarly to Section 4.1.1, we choose m (m = 2, 3, 4) groups of features, and randomly label 20% training samples for each class. The recognition results of LPbSMCCA under different numbers of views are summarized in Table 8 and Fig. 4. From Table 8 and Fig. 4, we see that, when all 4 views are involved, LPbSMCCA obtains the best recognition rates. This experiment proves again that as a multi-view learning method, LPbSMCCA can achieve better recognition performances by fusing different kinds of features from more views.
10
20
30
40
0.66 0.64 0.62 0.6 0.58
CCA DCCA SemiCCA SPDCCA LPbSCCA
0.56
50
Percentage of Labeled Samples Per Class (%)
0.54 10
20
30
40
50
Percentage of Labeled Samples Per Class (%)
Fig. 3. The average recognition rates of different methods under different numbers of labeled samples on CENPARMI dataset (a) X(G)–X(L), (b) X(G)–X(P), (c) X(G)–X(Z), (d) X(L)–X(P), (e) X(L)–X(Z), (f) X(P)–X(Z).
Table 7 The average maximal recognition rates (%) of different methods on CENPARMI dataset. c
1 2 3 4 5 6
Feature combination
LPbSCCA
X–Y
Accuracy
Dim.
Accuracy
CCA Dim.
Accuracy
DCCA Dim.
SemiCCA Accuracy
Dim.
SPDCCA Accuracy
Dim.
X(G)–X(L) X(G)–X(P) X(G)–X(Z) X(L)–X(P) X(L)–X(Z) X(P)–X(Z)
84.65 78.16 77.46 79.76 79.23 62.43
9 9 9 9 9 9
78.63 66.17 65.72 69.92 68.67 61.77
17 15 16 21 26 27
73.41 69.91 69.40 74.75 74.57 61.12
9 9 9 9 9 9
75.86 69.98 69.76 74.84 74.45 62.07
19 10 14 15 10 17
79.61 66.97 66.12 71.12 69.82 61.83
19 14 15 20 22 27
1901
X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904
superior to those of other methods under nearly all dimensions. This experiment demonstrates again that LPbSCCA has better recognition performances and robustness than other methods, and thus can be more suitable for semi-supervised scenarios.
0.9
Recognition Rate
0.8
0.7
0.6
0.5 m=2 m=3 m=4
0.4
0.3
0.2
1
2
3
4
5
6
7
8
9
10
Dimension Fig. 4. The average recognition rates of LPbSMCCA under different numbers of views on CENPARMI dataset.
Table 8 The average maximal recognition rates (%) and standard deviations (%) of LPbSMCCA under different numbers of views on MFD dataset. m
2
3
4
5
Accuracy Std
90.39 8.93
93.66 1.93
95.33 1.05
95.82 0.44
tant reason why we use low-frequency sub-images from wavelet transforms is that they contain more shape information than high-frequency sub-images. After obtaining two sets of feature vectors, in order to avoid the SSS problem, PCA is performed to reduce their dimensions to 40, and 40 dimensions, respectively. We randomly select 6 images per class for training, and rest 5 images for testing, i.e., totally 90 training samples and 75 testing samples. For semi-supervised setting, we randomly label 2, 3, 4, 5 images per class, respectively. The average maximal recognition rates of different methods with different labeled numbers are listed in Table 9. We also record the average recognition rates of different methods under different dimensions in Fig. 6. From Table 9, we can clearly see that, LPbSCCA prominently outperform other methods, no matter how many training samples are labeled. Fig. 6 reveals that recognition rates of LPbSCCA are consistently
4.2.2. Experiment on AR face dataset The AR dataset [53] contains over 4000 face images of 126 individual (70 men and 56 women), including frontal facial images with different facial expressions, lighting conditions and occlusions. There are 26 images for each person taken in two sessions, each having 13 images. Here we only use a subset, which contains 1680 face images corresponding to 120 persons, where each person has 14 different images taken in two sessions. For computational convenience, each image is manually cropped to 50 45 pixels. Some sample images of one individual are shown in Fig. 7. We firstly extract three kinds of features, i.e., raw pixel, local binary patterns (LBP), wavelet feature of each image as three sets of feature representations, and then apply PCA to reduce their dimensions to 150, 150, and 150, respectively. We randomly select 7 images in each class for training, rest 7 images are for testing, and then randomly label 2, 4, 6 images per class, respectively. The maximal recognition rates of LPbSMCCA and other related multi-view DR methods are summarized in Table 10. From Table 10, we see that LPbSMCCA consistently outperform other methods under different labeled numbers. The recognition rates of LPbSMCCA_KNN are lower than those of LPbSMCCA, which may indicate that kNN-graph is more sensitive to l1-graph. For other methods, the recognition rates of MVSSDR and GMLDA are slightly less than those of LPbSMCCA. The recognition rates of LPbSMCCA are obviously superior than those of LDA. The reason maybe that LDA separately learns projection directions in each view, thus ignores the correlations across views. MCCA achieves the worst recognition performances, due to its unsupervised essence. The results in this experiment indicate that LPbSMCCA is a powerful method for multi-view learning, especially in semi-supervised scenarios. 5. Conclusion and future work In this paper, we proposed a novel two-view semi-supervised learning method called semi-supervised canonical correlation analysis method based on label propagation (LPbSCCA). LPbSCCA firstly estimates label information of the unlabeled samples using sparse representation based label propagation algorithm; then in order
Fig. 5. Sample images of one individual on Yale face dataset.
Table 9 The average maximal recognition rates (%) of different methods under different number of labeled samples on Yale face dataset. Method
2 3 4 5
label label label label
LPbSCCA
CCA
DCCA
SemiCCA
SPDCCA
Accuracy
Dim.
Accuracy
Dim.
Accuracy
Dim.
Accuracy
Dim.
Accuracy
Dim.
64.4 74.67 77.6 82.53
14 14 14 14
58.53 62.27 65.6 68.8
28 30 34 34
58.13 68.27 76.13 81.07
14 14 14 14
60.8 68.93 76.93 81.07
16 15 15 15
56.3 62.9 69.1 75.3
35 39 40 28
1902
X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904 0.8
(a)
(b) 0.7
Recognition Rate
Recognition Rate
0.6
0.5 CCA DCCA SemiCCA SPDCCA LPbSCCA
0.4
0.3
0.2
0.6 CCA DCCA SemiCCA SPDCCA LPbSCCA
0.5 0.4 0.3 0.2 0.1
0.1 0
5
10
15
20
25
30
35
0
40
5
10
15
0.8
30
35
40
(d) 0.8
0.7
0.7
0.6
Recognition Rate
Recognition Rate
25
0.9
(c)
CCA DCCA SemiCCA SPDCCA LPbSCCA
0.5 0.4 0.3
0.6 CCA DCCA SemiCCA SPDCCA LPbSCCA
0.5 0.4 0.3
0.2 0.1 0
20
Dimension
Dimension
0.2 0.1 5
10
15
20
25
30
35
40
0
5
10
15
Dimension
20
25
30
35
40
Dimension
Fig. 6. The average recognition rates (%) of different methods under all dimensions on Yale face dataset using different number of labeled samples i.e., (a) 2, (b) 3, (c) 4, (d) 5.
Fig. 7. 14 sample images without occlusion of one individual on AR dataset.
to enhance discrimination of features, it seeks projection directions by considering both correlations between samples of the same class from cross different views and compactness of samples of the same class within each view. Furthermore, in order to handle data from multiple (more than two) views, we extend a general model namely LPbSMCCA. Experiments in this paper demonstrate effectiveness of the proposed methods. As stated in Theorem 2, the number of projection directions in proposed methods is limited by c, i.e., the number of total classes. If there are only few classes, the recognition abilities of proposed methods may be affected to some extent. How to break this restriction to find more projection directions will be major work ahead. In addition, the kernel extensions of our proposed methods, their theoretical properties and performances on nonlinear data will also be our future work.
Table 10 The average maximal recognition rates (%) of different methods under different number of labeled samples on AR face dataset. Method
2 label
4 label
6 label
Accuracy
Dim.
Accuracy
Dim.
Accuracy
Dim.
LDA MCCA MVSSDR GMLDA LPbSMCCA_KNN LPbSMCCA
53.64 42.01 58.14 55.69 59.52 60.44
96 150 128 100 116 94
74.75 56.54 75.59 75.20 78.50 79.60
63 150 129 59 88 78
80.91 64.74 81.91 85.60 86.85 87.16
61 150 143 57 57 60
X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904
Acknowledgments This work is supported by the National Science Foundation of China under Grant No. 61273251. The authors would like to thank the anonymous reviewers for their valuable comments and suggestions that helped greatly improve the quality of this paper. Appendix A. The proof of Theorem 1 Proof. In order to prove Theorem 1, we firstly define some 1=2 1=2 ðXÞ 1=2 e ðXYÞ l ¼ eS ðXÞ a; m ¼ eS ðYÞ b; H ¼ e symbols: Sw C w w 1=2 ðYÞ e . Sw ðXÞ 1=2 ðYÞ 1=2 Multiplying the left hand of (11) by e and e , Sw Sw we obtain
8 1=2 1=2 1=2 1=2 ðXÞ ðYÞ ðYÞ ðXÞ > e ðXYÞ e e < e Sw C Sw Sw Sw b¼k e a 1=2 1=2 1=2 1=2 > ðYÞ ðXÞ ðXÞ : e e ðXYÞT e e a ¼ k eS ðYÞ b Sw C Sw Sw w
ðA:1Þ
or
H m ¼ kl
ðA:2Þ
H T l ¼ km
From (A.2), we see that l and m are left and right singular vector of H, respectively. Then by the theory of singular value decomposition [54], we have
8
1=2 T 1=2 > > ðXÞ ðXÞ T e e > l l ¼ a aj ¼ aTi eS ðXÞ S S > i w w w aj ¼ dij i j > > > > >
> 1=2 T 1=2 > > ðYÞ ðYÞ ðYÞ < vTv ¼ e e b b ¼ bT e S S S b ¼d i
j
w
i
w
j
i
w
j
ij
>
> 1=2 T 1=2 1=2 1=2 > > ðXÞ ðXÞ ðYÞ ðYÞ > T e e e ðXYÞ e e > S S C S S l H m ¼ a bj j i w w w w > i > > > > > : e ðXYÞ b ¼ dij ¼ aTi C j ðA:3Þ Therefore, Theorem 1 is proven. h
Appendix B. The proof of Theorem 2 e ðXYÞ Þ holds, for Proof. From the definition of H, rankðHÞ ¼ rankð C ðXÞ 1=2 ðYÞ 1=2 ðe and ðe are nonsingular matrices. Besides, we Sw Þ Sw Þ obtain the following inequality
e T Þ 6 minfrankðXÞ; rankðAÞ; rankðYÞg e ðXYÞ Þ ¼ rankðX AY rankð C
ðB:1Þ
where rank(X) 6 min {p, N}, rankðAÞ ¼ rankð e F ðXÞ e F ðYÞT Þ minfc; Ng, rank(Y) 6 min {q, N}. In pattern recognition problem, class number c is generally less than sample number N, as well as feature dimene ðXYÞ Þ 6 c. sionality p, q. From the above, we have rankðHÞ ¼ rankð C From Theorem 1, we see that a and b have one to one correspondence to l and m, respectively, and the available numbers of l and m are limit to rank(H). Therefore, we infer that in LPbSCCA, there are at most c pairs of projection directions. h References [1] C. Xu, D. Tao, C. Xu, A Survey on Multi-View Learning, arXiv Preprint arXiv:1304.5634, 2013. [2] C. Hou, C. Zhang, Y. Wu, F. Nie, Multiple view semi-supervised dimensionality reduction, Pattern Recogn. 43 (2010) 720–730.
1903
[3] X. Shen, Q. Sun, Orthogonal multiset canonical correlation analysis based on fractional-order and its application in multiple feature extraction and recognition, Neural Process. Lett. (2014) 1–16. [4] X. Chen, S. Chen, H. Xue, X. Zhou, A unified dimensionality reduction framework for semi-paired and semi-supervised multi-view data, Pattern Recogn. 45 (2012) 2005–2018. [5] Y. Fu, L. Cao, G. Guo, T.S. Huang, Multiple feature fusion by subspace learning, in: Proceedings of the 2008 International Conference on Content-Based Image and Video Retrieval, (ACM, 2008), pp. 127–134. [6] Q.-S. Sun, S.-G. Zeng, Y. Liu, P.-A. Heng, D.-S. Xia, A new method of feature fusion and its application in image recognition, Pattern Recogn. 38 (2005) 2437–2448. [7] Y.-H. Yuan, Q.-S. Sun, Q. Zhou, D.-S. Xia, A novel multiset integrated canonical correlation analysis framework and its application in feature fusion, Pattern Recogn. 44 (2011) 1031–1040. [8] J. Yu, D. Tao, Y. Rui, J. Cheng, Pairwise constraints based multiview features fusion for scene classification, Pattern Recogn. 46 (2013) 483–496. [9] H.-Y. Ha, F.C. Fleites, S.-C. Chen, Content-based multimedia retrieval using feature correlation clustering and fusion, Int. J. Multimed. Data Eng. Manage. (IJMDEM) 4 (2013) 46–64. [10] W. Liu, D. Tao, Multiview hessian regularization for image annotation, IEEE Trans. Image Process 22 (2013) 2676–2687. [11] W. Liu, D. Tao, J. Cheng, Y. Tang, Multiview hessian discriminative sparse coding for image annotation, Comput. Vis. Image Und. 118 (2014) 50–60. [12] T. Diethe, D.R. Hardoon, J. Shawe-Taylor, Multiview fisher discriminant analysis, in: NIPS Workshop on Learning from Multiple Sources, 2008. [13] T. Xia, D. Tao, T. Mei, Y. Zhang, Multiview spectral embedding, IEEE Trans. Syst. Man Cybern. B 40 (2010) 1438–1446. [14] B. Long, P. Yu, Z. Zhang, A general model for multiple view unsupervised learning, in: Proceedings of the 2008 SIAM International Conference on Data Mining (Society for Industrial and Applied Mathematics, 2008), pp. 822–833. [15] X. Shen, Q. Sun, Y. Yuan, A unified multiset canonical correlation analysis framework based on graph embedding for multiple feature extraction, Neurocomputing (2014). [16] A. Sharma, A. Kumar, H. Daume, D.W. Jacobs, Generalized multiview analysis: a discriminative latent space, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2160–2167. [17] D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Comput. 16 (2004) 2639–2664. [18] H. Hotelling, Relations between two sets of variates, Biometrika 28 (1936) 321–377. [19] Q.-S. Sun, Z.-D. Liu, P.-A. Heng, D.-S. Xia, A theorem on the generalized canonical projective vectors, Pattern Recogn. 38 (2005) 449–452. [20] T. Sun, S. Chen, J. Yang, P. Shi, A novel method of combined feature extraction for recognition, in: Eighth IEEE International Conference on Data Mining. ICDM’08 (IEEE, 2008), 2008, pp. 1043–1048. [21] T. Sun, S. Chen, Locality preserving CCA with applications to data visualization and pose estimation, Image Vision Comput. 25 (2007) 531–543. [22] Y. Peng, D. Zhang, J. Zhang, A new canonical correlation analysis algorithm with local discrimination, Neural Process. Lett. 31 (2010) 1–15. [23] T.K. Kim, J. Kittler, R. Cipolla, Discriminative learning and recognition of image set classes using canonical correlations, IEEE Trans. Pattern Anal. 29 (2007) 1005–1018. [24] H. Huang, H. He, Super-resolution method for face recognition using nonlinear mappings on coherent features, IEEE Trans. Neural Netw. 22 (2011) 121–130. [25] L. Sun, S. Ji, J. Ye, Canonical correlation analysis for multilabel classification: a least-squares formulation, extensions, and analysis, IEEE Trans. Pattern Anal. 33 (2011) 194–U204. [26] D. Chu, L.-Z. Liao, M.K. Ng, X. Zhang, Sparse canonical correlation analysis: new formulation and algorithm, IEEE Trans. Pattern Anal. (2013). [27] N. Guan, X. Zhang, Z. Luo, L. Lan, Sparse representation based discriminative canonical correlation analysis for face recognition, in: 2012 IEEE 11th International Conference on Machine Learning and Applications (ICMLA), 2012, pp. 51–56. [28] Y. Peng, D. Zhang, Semi-supervised canonical correlation analysis algorithm, J. Softw. 19 (2008) 2822–2832. [29] O. Chapelle, B. Schölkopf, A. Zien, Semi-Supervised Learning, MIT Press, Cambridge, 2006. [30] J. Yu, M. Wang, D.C. Tao, Semisupervised multiview distance metric learning for cartoon synthesis, IEEE Trans. Image Process 21 (2012) 4636–4648. [31] J. Yu, Y. Rui, B. Chen, Exploiting click constraints and multi-view features for image re-ranking, IEEE Trans. Multimedia 16 (2014) 159–168. [32] J. Yu, D. Tao, M. Wang, Adaptive hypergraph learning and its application in image classification, IEEE Trans. Image Process 21 (2012) 3262–3272. [33] J. Yu, D. Liu, D. Tao, H.S. Seah, Complex object correspondence construction in two-dimensional animation, IEEE Trans. Image Process 20 (2011) 3257–3269. [34] J. Yu, D. Tao, J. Li, J. Cheng, Semantic preserving distance metric learning and applications, Inform. Sci. (2014). [35] D. Cai, X. He, J. Han, Semi-supervised discriminant analysis, in: IEEE 11th International Conference on Computer Vision 2007. ICCV 2007, 2007, pp. 1–7. [36] M. Sugiyama, T. Ide, S. Nakajima, J. Sese, Semi-supervised local Fisher discriminant analysis for dimensionality reduction, Mach. Learn. 78 (2010) 35–61. [37] D. Zhang, Z.-H. Zhou, S. Chen, Semi-Supervised Dimensionality Reduction, in: SDM, 2007.
1904
X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904
[38] Y. Song, F. Nie, C. Zhang, S. Xiang, A unified framework for semi-supervised dimensionality reduction, Pattern Recogn. 41 (2008) 2789–2799. [39] X. Zhu, Z. Ghahramani, Learning from labeled and unlabeled data with label propagation, Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002. [40] F. Wang, C. Zhang, Label propagation through linear neighborhoods, IEEE Trans. Knowl. Data Eng. 20 (2008) 55–67. [41] J. Wang, F. Wang, C. Zhang, H.C. Shen, L. Quan, Linear neighborhood propagation and its applications, IEEE Trans. Pattern Anal. 31 (2009) 1600– 1615. [42] F. Nie, S. Xiang, Y. Jia, C. Zhang, Semi-supervised orthogonal discriminant analysis via label propagation, Pattern Recogn. 42 (2009) 2615–2627. [43] B. Cheng, J. Yang, S. Yan, Y. Fu, T.S. Huang, Learning with $ n ell1$ graph for image analysis, IEEE Trans. Image Process 19 (2010) 858–866. [44] J. Lu, X. Zhou, Y.-P. Tan, Y. Shang, J. Zhou, Cost-sensitive semi-supervised discriminant analysis for face recognition, IEEE Trans. Inform. Forensics Secur. 7 (2012) 944–953. [45] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. 31 (2009) 210–227. [46] W. Liu, H. Zhang, D. Tao, Y. Wang, K. Lu, Large-Scale Paralleled Sparse Principal Component Analysis, ArXiv e-prints, 2013.
[47] T. Melzer, M. Reiter, H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern Recogn. 36 (2003) 1961–1971. [48] C. Blake, C.J. Merz, UCI Repository of Machine Learning Databases [
]. Department of Information and Computer Science, University of California, Irvine, CA, vol. 55, 1998. [49] J. Franke, L. Lam, R. Legault, C. Nadal, C. Suen, Experiments with the CENPARMI database combining different classification approaches, in: 3rd International Workshop on Frontiers in Handwriting Recognition, 1993, pp. 305–311. [50] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Trans. Pattern Anal. 19 (1997) 711–720. [51] A.A. Nielsen, Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data, IEEE Trans. Image Process 11 (2002) 293– 305. [52] Z. Hu, Z. Lou, J. Yang, K. Liu, C. Suen, Handwritten digit recognition based on multi-classifier combination, Chinese J. Comput. 22 (1999) 369–374. [53] A.M. Martinez, The AR face database, CVC Technical Report, 24, 1998. [54] G.H. Golub, C.F. Van Loan, Matrix Computations, JHU Press, 2012.