A novel semi-supervised canonical correlation analysis and extensions for multi-view dimensionality reduction

J. Vis. Commun. Image R. 25 (2014) 1894–1904 Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/l...

Download PDF

945KB Sizes 2 Downloads 60 Views

Report

PDF Reader
Full Text

J. Vis. Commun. Image R. 25 (2014) 1894–1904

Contents lists available at ScienceDirect

J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

A novel semi-supervised canonical correlation analysis and extensions for multi-view dimensionality reduction XiaoBo Shen, QuanSen Sun ⇑ School of Computer Science and Engineering, Nanjing University of Science & Technology, Nanjing 210094, China

a r t i c l e

i n f o

Article history: Received 18 January 2014 Accepted 8 September 2014 Available online 16 September 2014 Keywords: Canonical correlation analysis Semi-supervised learning Label propagation Sparse representation Multi-view learning Feature extraction Dimensionality reduction Image recognition

a b s t r a c t Canonical correlation analysis (CCA) is an efﬁcient method for dimensionality reduction on two-view data. However, as an unsupervised learning method, CCA cannot utilize partly given label information in multi-view semi-supervised scenarios. In this paper, we propose a novel two-view semi-supervised learning method, called semi-supervised canonical correlation analysis based on label propagation (LPbSCCA). LPbSCCA incorporates a new sparse representation based label propagation algorithm to infer label information for unlabeled data. Speciﬁcally, it ﬁrstly constructs dictionaries consisting of all labeled samples; and then obtains reconstruction coefﬁcients of unlabeled samples using sparse representation technique; at last, by combining given labels of labeled samples, estimates label information for unlabeled ones. After that, it constructs soft label matrices of all samples and probabilistic within-class scatter matrices in each view. Finally, in order to enhance discriminative power of features, it is formulated to maximize the correlations between samples of the same class from cross views, while minimizing within-class variations in the low-dimensional feature space of each view simultaneously. Furthermore, we also extend a general model called LPbSMCCA to handle data from multiple (more than two) views. Extensive experimental results from several well-known datasets demonstrate that the proposed methods can achieve better recognition performances and robustness than existing related methods. Crown Copyright Ó 2014 Published by Elsevier Inc. All rights reserved.

1. Introduction In many applications of computer vision, and pattern recognition, the same objects can be observed at different viewpoints or by different sensors, thus generating multiple distinct and heterogeneous features. For example, a video can be represented by sound feature and image feature; a webpage can be represented by text feature and image feature. Such different representations are referred as multi-view data [1–4]. Some recent researches [1,5–12] show that learning from such multi-view data often leads to better performance than learning from single-view data. However, multi-view data are usually represented as high-dimensional vectors, which cannot be analyzed directly. Thus, dimensionality reduction (DR) for multi-view data is a necessary and important preprocessing step for subsequent tasks. Until now, many multiview DR methods (MvDR) have been proposed, including unsupervised ones [3,13,14], supervised ones [15,16], and semi-supervised ones [2,8].

⇑ Corresponding author. Fax: +86 25 84318156. E-mail addresses: [email protected] (X. Shen), sunquansen@njust. edu.cn (Q. Sun). http://dx.doi.org/10.1016/j.jvcir.2014.09.004 1047-3203/Crown Copyright Ó 2014 Published by Elsevier Inc. All rights reserved.

The most typical approach for multi-view dimension reduction is canonical correlation analysis (CCA) [17,18], which was proposed by Hotelling in 1936. CCA investigates the linear correlations between two sets of features from the same patterns. It linearly projects two sets of features into a low-dimensional feature space, where they are maximally correlated. Until now, CCA and its variants have received more and more attentions [5,6,19–26]. CCA is essentially a linear subspace method, which fails to discover the nonlinear correlations between two datasets. To remedy this problem, some related nonlinear extensions, e.g., kernel CCA (KCCA) [17], locality preserving CCA (LPCCA) [21], were proposed and successfully applied into data visualization and pose estimation. Moreover, some sparse extensions of CCA [25,26] were also presented for gene classiﬁcation and cross-language document retrieval. The aforementioned extensions are unsupervised methods, which may not be suitable for classiﬁcation tasks. To enhance the discriminant power of CCA features, several supervised variants, e.g., discriminative CCA (DCCA) [20], generalized CCA (GCCA) [19], local discrimination CCA (LDCCA) [22], sparse representation based discriminative CCA (SPDCCA) [27] were proposed. They incorporate class label information into CCA from different angles, and all can improve recognition rates to some extent. Moreover, there are also some semi-supervised extensions [4,28] of CCA.

1895

X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904

For example, Peng and Zhang [28] proposed a semi-supervised CCA method (SemiCCA). SemiCCA incorporates must-link constraints and cannot-link constraints of labeled samples into objective of CCA, to improve the classiﬁcation performance under semi-supervised scenarios. Semi-supervised learning (SSL) [29] is a hot research ﬁeld in pattern recognition, and machine learning. The main task of SSL algorithms is to utilize a small amount of labeled data as well as a large amount of unlabeled data for training. Among them, graph-based approach [2,8,30–34] is one of the most active ﬁelds. Meanwhile, SSL techniques have been incorporated into DR, and many SSL-DR algorithms including single-view ones [35–38], multi-view ones [2,8], and so on have been proposed. For examples, in multi-view scenarios, Hou et al. [2] developed a multiple view semi-supervised dimension reduction (MVSSDR). MVSSDR learns a consensus pattern with domain knowledge in form of pairwise constraints. Pairwise constraints based multiview subspace learning (PC-MSL) [8] aims to learn a uniﬁed low-dimensional subspace to effectively fuse multiple features. It takes both data distribution of both labeled and unlabeled samples, and user labeling information into consideration, and thus achieves good performances in scene classiﬁcation. On the other hand, label propagation [39] is a popular kind of graph-based SSL methods. It aims to model the whole data (both labeled and unlabeled) as a graph, and then propagate label information from the labeled samples to unlabeled ones through the constructed graph. Recently some label propagation algorithms have been proposed [40–42]. Many graph-based methods construct k-nearest-neighborhood (KNN) graph, and employ Gaussian function to compute the graph edge weights. Thus some parameters (i.e., nearest neighborhood k, bandwidth r) have to be manually set beforehand. However, some recent researches [41,43,44] show that the parameter setting and distance measurement (Euclidean distance) in this approach are very sensitive to noisy data. Recently inspired with the success of sparse representation [43,45,46], a novel graph, i.e., l1-graph [43,44], has been proposed, which has more robustness to noises and adaptive neighborhoods for each sample. By introducing this new graph [43,44], we, in this paper, propose a novel semi-supervised canonical correlation analysis based on label propagation (LPbSCCA). Several advantages of LPbSCCA can be highlighted. (1) LPbSCCA does not need one to one correspondence between samples in different views, and thus can be applied into more general scenarios than SemiCCA [28]. (2) LPbSCCA is free of parameters due to superiority of l1-graph, while previous related methods (i.e., SemiCCA [28], S2GCA [4]) have to tune regularization parameters. Thus, LPbSCCA are more applicable. (3) LPbSCCA can be viewed as a general extension of DCCA

Table 1 List of notations. Notations

Descriptions

X, Y

Two group of feature representations in two-view scenarios Sparse reconstructive coefﬁcient vector of xi in X view

ðXÞ

wi ~f ðXÞ i e F ðXÞ ; e F ðYÞ n om X ðiÞ 2 Rpi N

Label vector of xi in X view Soft label matrix in X(Y) view m group of feature representations from N patterns

under view of probability. LPbSCCA computes correlations by soft labels, while correlations in DCCA are revealed by hard labels (i.e., 0, 1). (4) Based on LPbSCCA, we extend a general model, namely LPbSMCCA. LPbSMCCA can simultaneously deal with the data from multiple (more than two) views. The remainder of the paper is organized as follows. In Section 2, we brieﬂy review the basic theory of CCA, and DCCA. In Section 3, LPbSCCA and an extended model, namely LPbSMCCA are developed. In Section 4, we experimentally evaluate the proposed methods on several well-known datasets to demonstrate effectiveness. Finally, concluding remarks and some discussions of future work will be given in Section 5. For convenience, Table 1 lists some important notations used in the remainder of this paper. 2. Related work 2.1. Canonical correlation analysis Given a set of N pair-wise samples fxi ; yi g 2 Rp Rq (i = 1, . . ., N), we deﬁne two feature representation matrices X = [x1, x2, . . ., xN] 2 RpN , Y = [y1, y2, . . ., yN] 2 RqN . Assume that all P P ¼ Ni¼1 yi =N ¼ 0. samples are centered, i.e., x ¼ Ni¼1 xi =N ¼ 0; y The aim of canonical correlation analysis (CCA) [17,18] is to ﬁnd pairs of projection directions, a 2 Rp and b 2 Rq , such that the correlations between canonical variables z1 = aTX and z2 = bTY are maximized. The projection pair, a 2 Rp and b 2 Rq , is obtained by maximizing the following correlation function q

E aT xyT b aT Sxy b ﬃ max qða; bÞ ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ T ﬃ ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ E½aT xxT a E b yyT b aT Sxx a bT Syy b s:t: aT Sxx a ¼ 1;

bT Syy b ¼ 1 ð1Þ

T

T

pp

T

T

qq

where Sxx = E[xx ] = XX /N 2 R , Syy = E[yy ] = YY /N 2 R are within-set covariance matrices in view X and Y, respectively. Sxy = E[xyT] = XYT/N 2 Rpq represents between-set covariance matrix between view X and Y. With Lagrange multiplier method, this problem can be equally reformulated as two generalized eigenvalue problems [17]. Besides, Melzer et al. [47] also presented an equivalent SVD-based approach to solve CCA model. 2.2. Discriminative CCA As an unsupervised multi-view learning method, CCA focuses on correlations between the pair-wise samples from different views, but ignores class information of the samples, resulting in the limitation of recognition performance. To remedy this shortcoming, Sun et al. [20] proposed discriminative CCA (DCCA) by incorporating class information into CCA model. DCCA seeks pairs of projection directions, a 2 Rp and b 2 Rq , to maximize withinclass correlations (between-class correlations are automatically minimized, and details can be referred to [20]) between different views. The formulation of DCCA is as follows

aT XAY T b argmax pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ a;b aT Sxx a bT Syy b

ð2Þ

i¼1

XL e ðiÞ F

l labeled samples in ith view

e ðijÞ A

Probabilistic correlation matrix between ith view and jth view Probabilistic within-class scatter matrices in ith view

ðiÞ

e Sw W(i) ðiÞ

a(i)

soft label matrix in ith view

Projection matrix in ith view Projection direction in ith view

where Sxx, Syy are within-set covariance matrices in view X and Y, A ¼ diagð1N1 N1 ; 1N2 N2 ; . . . ; 1Nc Nc Þ, which reﬂects within-class correlations between samples from view X and Y, c is the number of classes, Ni represents the number of samples in ith class, and Pc i¼1 N i ¼ N; 1Ni Ni is a Ni Ni square matrix where all entries equal to 1. Similar to CCA, DCCA can also be transformed into a generalized eigenvalue problem [20]. From (2), we can see that DCCA

1896

X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904

breaks the one to one correspondence restriction of samples in different views. Thus DCCA can be applied into more general scenarios than CCA.

h i ðXÞ ðXÞ ðXÞ ðXÞ sented as ~f i ¼ ~f i1 ; ~f i2 ; . . . ; ~f ic 2 Rc (i = 1, . . ., N). For unlabeled ðXÞ sample xi (i = l + 1, . . ., N), ~f ik is deﬁned as

X

~f ðXÞ ¼ ik 3. Semi-supervised canonical correlation analysis based on label propagation (LPbSCCA) and its extensions

~f ðXÞ ¼ ik

The main goal of label propagation is to propagate label information of labeled samples to unlabeled ones by constructing proper graphs. Currently k-nearest-neighborhood graph is a widely used approach based on pair-wise Euclidean distance, which is very sensitive to data noise. Another drawback of this approach is to manually tune the neighborhood size k, where it is a challenge to select the optimal value. Recently, inspired with the success of l1-graph [43], Lu et al. [44] proposed a sparse representation based label propagation algorithm, which has more robustness to noises and adaptive neighborhoods for each sample. Thus we incorporate this approach to infer label information of the unlabeled samples. Assume that two groups of features from N samples are given as X = [x1, x2, . . ., xl, xl+1, . . ., xN] 2 RpN and Y = [y1, y2, . . ., yl, yl+1, . . ., yN] 2 RqN , and samples from view X and Y have the one to one correspondence, fxi ; yi gli¼1 ; fxi ; yi gNi¼lþ1 represent labeled and unlabeled sample pairs, respectively. The ﬁrst l pairs can be sepah i rately represented as X L ¼ x11 ; . . . ; x1l1 ; . . . ; xc1 ; . . . ; xclc and Y L ¼ h i y11 ; . . . ; y1l1 ; . . . ; yc1 ; . . . ; yclc , where xki ; yki are the ith sample of the kth class in view X and Y, c is the number of classes, lk denotes Pc the number of labeled samples in the kth class, and k¼1 lk ¼ l. Here we ﬁrstly present construction process of soft label matrix in view X. Inspired with sparse representation methods [43,45], we apply all labeled samples XL in view X to construct a dictionary and each unlabeled sample can be linearly reconstructed by XL. Speciﬁcally, given an unlabeled sample xi (i = l + 1, . . ., N), we solve the following optimization problem ðXÞ

ðXÞ ¼ arg min wi 1 ðXÞ s:t: X L wi xi 6 e

ð3Þ ðXÞ

structive coefﬁcient of xi on xj. The normalized weights are deﬁned as follows

, l ðXÞ X ðXÞ ¼ wij wij

ð4Þ

j¼1

Obviously, 0 6

ðXÞ uij

6 1;

uXij

represents constructive contribution of

ðXÞ

xj on xi; especially if uij ¼ 1, we infer that xi is totally constructed ðXÞ

only by xj. Thus, we can similarly use uij to reconstruct label information of xi. F ðXÞ ¼ We deﬁne soft label matrix in view X as e h iT ~f ðXÞT ; ~f ðXÞT ; . . . ; ~f ðXÞT 2 RNc , where label vector of x is reprei N 1 2 1 2 3

www.acm.caltech.ed/l1magic. www.cs.ubc.ca/labs/scl/spgl1/index.html. http://www.public.asu.edu/~jye02/Software/SLEP/index.htm.

1 xi 2 c k 0

ð6Þ

otherwise

where k = 1, . . ., c. Similarly, we can construct the soft label matrix e F ðYÞ for view Y using the above approach. Here we ignore it for the limit of this paper. 3.2. Probabilistic within-class scatter matrix construction Based on two soft label matrices, e F ðXÞ and e F ðYÞ , now we construct ðXÞ ðYÞ the probabilistic within-class scatter matrices e Sw ; e S w in view X, Y, respectively. Their deﬁnitions are as follows ðXÞ e ¼ Sw

c X N T X ~f ðXÞ x m ~ kðXÞ xj m ~ kðXÞ j jk

ð7Þ

k¼1 j¼1

ðYÞ e ¼ Sw

c X N T X ~f ðYÞ y m ~ ðYÞ ~ kðYÞ yj m j jk k

ð8Þ

k¼1 j¼1

P PN ~ðXÞ PN ~ðYÞ ðXÞ 1 ~ kðXÞ ¼ 1ðXÞ Ni¼1 ~f ðXÞ ~ ðYÞ m i¼1 f ik ; mk ¼ NðYÞ i¼1 f ik yi ; ik xi ; N k ¼ Nk P k ðYÞ ðYÞ N Nk ¼ i¼1 f ik ðk ¼ 1; . . . ; cÞ. In the above formulations, it is clear that probabilistic withinclass scatter matrix is a general extension of classical within-class scatter matrix under the view of probability. Especially if each sample fully belongs to one unique class, i.e., label vector of each sample is encoded only with (0, 1), probabilistic within-class scatter matrix naturally reduces to the classical one. where

3.3. Formulation and solution of LPbSCCA In this section, we formulate the optimization problem of LPbSCCA as follows

max

where e is the reconstruction error bound, wi 2 Rl is the sparse reconstructive coefﬁcient vector of sample xi. This problem can be solved efﬁciently by standard linear programming using many available toolboxes, such as l1-magic, 1SPGL1, 2and SLEP3. 3 h i ðXÞ ðXÞ ðXÞ ðXÞ ðXÞ We deﬁne wi ¼ wi1 ; wi2 ; . . . ; wil , where wij is the recon-

ðXÞ uij

ð5Þ

ðXÞ For labeled sample xi (i = 1, . . ., l), ~f ik is deﬁned as

3.1. Soft label matrix estimation

wi

ðXÞ

uij

xj 2ck

e Tb aT X AY J LPbSCCA ða; bÞ ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðYÞ aT eS ðXÞ bT e Sw b w a

ð9Þ

e¼e F ðXÞ e F ðYÞT is a probabilistic correlation matrix, where A ðXÞ ðYÞT ðXÞ ðYÞ e ij ¼ ~f ~f 06 A 6 1; e S w and e S w are probabilistic within-class i j scatter matrices deﬁned in Section 3.2. It is worth to highlight discriminative information into the following two aspects. (1) Cross-view discrimination. We maximize correlations of samples in different classes from cross views for discrimination, and this kind of discriminative information is encoded e A e ij can be viewed as the probability of xi and yj into A. belonging to the same class, and thus certainly regarded as the correlation measurement of these two samples. If e ij ¼ 1, which means ~f ðXÞ ¼ ~f ðYÞ , then we regard xi and yj A i

j

e ij ¼ 0, which means that xi and yj totally correlated; if A belong to two different classes, we consider these two same in ples are totally uncorrelated. Especially, it is clear that A, the view of probability, can be regarded as a general extension of A in DCCA. (2) Within-view discrimination. In each view, we, meanwhile, try to minimize within-class variations by introducing probabilistic within-class scatter matrices. In this case, samples

1897

X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904

from the same class will be kept as close as possible in the low-dimensional feature space. Thus it helps to further enhance discrimination of LPbSCCA. The objective of LPbSCCA, i.e., (9), can be equivalent to the following optimization problem with equality constraints

e Tb max J LPbSCCA ða; bÞ ¼ aT X AY S ðXÞ a ¼ 1; bT e S ðYÞ b ¼ 1 s:t: aT e w

ð10Þ

the steps in Sections 3.1 and 3.2, respectively. From the above formulation, we see that LPbSCCA can be regarded as a special case of LPbSMCCA. When only two views (m = 2) are involved, LPbSMCCA naturally reduces to LPbSCCA. By using Lagrange multiplier method, we transform (14) into the following optimization problem

Lðað1Þ ; að2Þ ; . . . ; aðmÞ ; kÞ ¼

w

ðXÞ e T b ¼ ke X AY Sw a ðYÞ T T e e Y A X a ¼ kSw b

ð11Þ

Alternatively, we can combine (11) into a single generalized eigenvalue problem below

e T X AY

!

a

e T XT YA

b

¼k

!

ðXÞ e Sw

ð12Þ

b

Theorem 1. Projection matrices Wx and Wy satisfy the following property

aTi eS wðXÞ aj ¼ bTi eS ðYÞ w bj ¼ dij T e ðXYÞ ai C bj ¼ dij

ð13Þ

1 i¼j ; i; j ¼ 1; . . . ; d. 0 i–j The proof of Theorem 1 is given in Appendix A. In Theorem 1, it is clear that the projection directions a(b) are conjugately orthogðXÞ ðYÞ onal with each other about the within-class scatter matrix e S w ðe Sw Þ e T ; dij ¼ e ðXYÞ ¼ X AY where C

! m k X aðiÞT eS wðiÞ aðiÞ 1 2 i¼1

ð15Þ

Setting oL/oa(i) = 0, we obtain

X ðiÞ ðiÞ ðiÞ e ðijÞ X ðjÞT aðjÞ ke X A a ¼0 Sw

ð16Þ

j–i

or

a

ðYÞ e Sw

Furthermore, we give two theorems about some properties of the solutions in LPbSCCA:

(

aðiÞT X ðiÞ Ae ðijÞ X ðjÞT aðjÞ

i¼1 j–i

With Lagrange multiplier method, (10) can be solved as follows

(

m X X

X ðiÞ ðiÞ ðiÞ e ðijÞ X ðjÞT aðjÞ ¼ ke X A a Sw

ð17Þ

j–i

Do some matrix transformations to (17), and it can be written as

0

0 B ð2Þ e ð21Þ ð1ÞT BX A X B B .. @. e ðm1Þ X ð1ÞT X ðmÞ A 0 ð1Þ e Sw 0 B ð2Þ e B0 Sw B ¼ kB . .. B. . @. 0

0

e ð12Þ X ð2ÞT X ð1Þ A 0 .. ... . e ðm2Þ X ð2ÞT X ðmÞ A 10 1 að1Þ 0 CB ð2Þ C a C 0 C CB C CB C . . .. .. CB A . . A@ . ðmÞ ðmÞ e a Sw

10 ð1Þ e ð1mÞ X ðmÞT a X ð1Þ A B e ð2mÞ X ðmÞT C CB að2Þ X ð2Þ A CB . CB . ... A@ .

1 C C C C A

aðmÞ

0

ð18Þ

in each view. Theorem 2. In LPbSCCA, we can obtain at most c (the number of total classes) pairs of projection directions

k¼1

The proof of Theorem 2 is given in Appendix B.

sponding to the ﬁrst largest d eigenvalues in (18), and form the proh i ðiÞ ðiÞ ðiÞ jection matrix in the ith view as W ðiÞ ¼ a1 ; a2 ; . . . ; ad 2 Rpi d . The

3.4. Multi-view extension LPbSCCA only concerns two views, and thus cannot handle data from multiple (more than two) views. However, several feature representations revealing different characteristics of the same objects are prevalent in reality. How to simultaneously deal with any number of groups of features is a practical and fundamental problem [3,7]. To address this problem, we extend LPbSCCA to present a general model called LPbSMCCA, which involves multiple (more than two) views simultaneously. Assume that m (m P 2) groups of high-dimensional features m

from the same N patterns are given as fX ðiÞ 2 Rpi N gi¼1 , where h i ðiÞ ðiÞ ðiÞ ðiÞ X ðiÞ ¼ x1 ; x2 ; . . . ; xN ; xj 2 Rpi (j = 1, 2, . . ., N), pi denotes the n ol n oN ðiÞ ðiÞ dimensionality of features in the ith view. xj ; xj are j¼1

j¼lþ1

labeled and unlabeled samples in the ith view, respectively. Then LPbSMCCA can be formulated as below

max J LPbSMCCA að1Þ ; að2Þ ; . . . ; aðmÞ ¼

m X X

aðiÞT X ðiÞ Ae ðijÞ X ðjÞT aðjÞ

i¼1 j–i

s:t:

m X

ðiÞT eðiÞ Sw

a

ðiÞ

a ¼1

Finally, the solution of LPbSMCCA model (i.e., (14)) is the eigenvector of the above generalized eigenvalue problem (i.e., (18)). We n od ð1ÞT ð2ÞT ðmÞT select the eigenvectors aTk ¼ ak ; ak ; . . . ; ak corre-

ð14Þ

i¼1 ðiÞ e ðijÞ ¼ e F ðiÞ e F ðjÞT ; e where A S w denotes the probabilistic within-class scatter matrix in the ith view, and both can be computed using

detail ﬂow chart of LPbSMCCA is presented in Table 2. Now we give a brief complexity analysis of LPbSMCCA. The computational complexity mainly comes from two parts: soft label estimation and solving a generalized eigenvalue problem as (18). In the ﬁrst part, solving (3) using sparse representation technique for each unlabeled sample requires time of O(t3 + l) (t is the

Table 2 The ﬂow chart of LPbSMCCA. ðiÞ

ðiÞ

ðiÞ

ðiÞ

ðiÞ

Input: m sets of variables X ðiÞ ¼ ½x1 ; x2 ; . . . ; xl ; xlþ1 ; . . . ; xN 2 Rpi N , i = 1, 2, . . ., m. For i = 1, 2, . . ., m (1) Compute the normalized weight, according to (3) and (4); e ðiÞ , according to (5) and (6); (2) Construct the soft label matrix F ðiÞ (3) According to (7), compute the fuzzy within-class matrix e Sw ; End e ðijÞ ¼ e F ðiÞ e F ðjÞT , i, j = 1, 2, . . ., m and i – j; (4) Compute matrix A h iT d (5) Obtain the d projection vector sets ajð1ÞT ; ajð2ÞT ; . . . ; ajðmÞT , by j¼1

computing the eigenvectors with respect to the largest d eigenvalues for the generalized eigenvalue problem, according to (18); n h iom ðiÞ ðiÞ ðiÞ Output: m projection matrices W ðiÞ ¼ a1 ; a2 ; . . . ; ad . i¼1

1898

X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904

number of nonzero entries in reconstruction coefﬁcients) [45], thus complexity of this part is O(t3 + l) (N l) m; in the second part, solving a generalized eigenvalue problem requires time of O(p3), P where p ¼ m i¼1 pi . Thus, the total complexity time of LPbSMCCA is about O((t3 + l) (N l) m + p3). After the training process of LPbSMCCA, given a new test sample xT = (x(1)T, x(2)T, . . ., x(m)T), xðiÞ 2 Rpi , we obtain the ﬁnal lowP ðiÞT ðiÞ dimensional feature z ¼ m x using an effective strategy i¼1 W [6,20,22]. Finally, z can be combined with any classiﬁer (e.g., 1NN classiﬁer) for the following classiﬁcation tasks. 4. Experimental results and analysis In order to validate the effectiveness of two proposed methods, a series of experiments have been performed on the popular handwritten numeral and face datasets. Speciﬁcally, we compare LPbSCCA with some CCA-related methods (i.e., CCA [6], DCCA [20], SemiCCA [28], SPDCCA [27]) on MFD dataset [48], CENPARMI dataset [49], and Yale face dataset [50]. Then we further apply LPbMSCCA on AR face dataset, and select several related multiview DR methods (i.e., LDA [50], multiset CCA (MCCA) [51], GMLDA [16], MVSSDR [2], MVFDA [12], LPbSMCCA_KNN) as a comparison. The details of some compared methods in our experiment are described as follows. (1) SemiCCA [28]: Two kinds of supervised information (i.e., must-link and cannot-link) in SemiCCA are provided from the given labels of labeled samples in the training set. (2) SPDCCA [27]: SPDCCA incorporates both sparse representation and discriminative information into CCA simultaneously. Here we obtain the optimal parameter a by searching from the range of [0, 1] with step 0.1. (3) LDA [50]: Here LDA projection directions are separately learned in each view, and then we obtain the ﬁnal lowdimensional features using the same fusion strategy with LPbSMCCA.

Table 3 Six feature sets of multiple feature dataset in UCI. Fac: 216-dimension proﬁle correlations feature; Fou: 76-dimension Fourier coefﬁcients feature; Kar: 64-dimension Karhunen–Loève coefﬁcients feature; Mor: 6-dimension morphological feature; Pix: 240-dimension pixel averages feature; Zer: 47-dimension Zernike moments feature.

(4) MCCA [51]: MCCA is a generalized extension of CCA, and it can simultaneously handle several (more than two) views. Here we ﬁrst obtain projection directions by MCCA, and then fuse these views by the same way with LPbSMCCA. (5) GMLDA [16]: GMLDA is a multi-view extension under Generalized Multiview Analysis (GMA) framework. According to the parameter settings in [16], here we ﬁx a = 10, l = 1, c = tr(B1)/tr(B2) in GMLDA. (6) MVSSDR [2]: In MVSSDR, we obtain two kinds of supervised information by the same way with SemiCCA. Besides, two parameters of a and b are empirically set as 1 and 20, according to the parameter settings in [37]. (7) LPbSMCCA_KNN: To compare with previous label propagation methods, we instead apply existing k-nearest-neighborhood graph to infer soft labels, and name this modiﬁcation as LPbSMCCA_KNN. Here we empirically set k as 5. In all experiments, the nearest neighborhood classiﬁer with euclidean metrics is adopted, random runs are repeated 10 times independently for ﬁnal classiﬁcation tasks.

4.1. Experiments on handwritten numerals dataset 4.1.1. Experiment on multiple feature dataset in UCI Multiple feature dataset (MFD) in UCI [48] about handwritten numerals is adopted for this experiment. As a part of UCI machine learning repository, MFD includes 10 classes handwritten numerals, i.e., 10 numbers from 0 to 9. Each class has 200 examples. Thus, the sample size is 2000 in total. These digit characters are represented in terms of the next six feature sets, as shown in Table 3. We randomly choose two sets of features as X set, and Y set, and there will be 15 (C 26 ¼ 15) data combination modes. For each combination mode, we choose 100 samples in each class for training randomly, the rest for testing, i.e., 1000 training samples, 1000 testing samples, respectively. In order to construct semi-supervised learning dataset, we randomly label 20% training samples for each class, i.e., 20 labeled and 80 unlabeled training samples, respectively. The average maximal recognition rates of different methods are listed in Table 4. From Table 4, we can see that LPbSCCA is distinctly superior to other existing methods. Speciﬁcally, in all 15 combinations, LPbSCCA achieves 14 times of best recognition rates, and CCA achieves 1 time. Especially in combination 1, 3, 8, 9, 13, 14, maximal recognition rates of LPbSCCA go beyond those of other methods about 10%. Moreover, recognition rates of CCA, in several

Table 4 The average maximal recognition rates (%) of different methods on MFD dataset. c

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Feature combination

LPbSCCA

X–Y

Accuracy

Dim.

CCA Accuracy

Dim.

DCCA Accuracy

Dim.

SemiCCA Accuracy

Dim.

SPDCCA Accuracy

Dim.

Fou–Fac Fou–Kar Fou–Pix Fou–Zer Fou–Mor Fac–Kar Fac–Pix Fac–Zer Fac–Mor Kar–Pix Kar–Zer Kar–Mor Pix–Zer Pix–Mor Zer–Mor

93.47 91.22 89.99 82.23 78.65 95.05 94.25 93.94 91.99 90.82 91.73 89.04 90.21 86.99 78.72

9 9 9 8 6 9 10 9 6 9 9 6 9 6 6

82.69 83.68 78.40 79.92 72.62 91.33 88.68 79.72 72.36 91.57 81.95 74.88 76.02 70.27 70.93

14 14 10 12 5 16 16 20 5 14 28 6 20 5 5

79.08 83.03 73.30 76.05 73.46 83.09 73.24 80.99 73.23 75.61 84.03 81.35 74.61 65.21 75.84

9 9 9 8 6 9 9 9 6 9 9 6 9 6 6

82.41 85.43 75.91 77.50 74.23 88.67 80.81 83.63 73.53 84.25 85.53 79.90 78.20 65.57 75.37

18 16 15 14 6 26 39 23 6 28 23 6 21 6 5

83.58 84.18 79.18 80.14 73.15 90.82 87.19 80.72 73.58 86.84 82.68 75.13 77.91 71.12 72.49

12 14 10 12 6 18 25 22 5 29 28 6 18 5 6

Note: Bold number denotes the best recognition rate in that case. The same cases also exist in the following tables, i.e., Table 5 and 7–10.

1899

X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904 0.84

1

(a)

0.95

0.82

0.8 0.75 0.7

CCA DCCA SemiCCA SPDCCA LPbSCCA

0.65 10

20

30

40

0.8

Recognition Rate

Recognition Rate

0.78 0.76 0.74 0.72 CCA DCCA SemiCCA SPDCCA LPbSCCA

0.7 0.68 0.66 10

50

20

30

40

(d)

0.95

Recognition Rate

0.95 0.9 0.85 0.8 0.75

10

20

30

40

0.9

0.8 0.75 CCA DCCA SemiCCA SPDCCA LPbSCCA

0.65

10

Percentage of Labeled Samples Per Class (%)

10

20

30

40

50

(f)

0.8

0.85

50

CCA DCCA SemiCCA SPDCCA LPbSCCA

0.82

(e)

0.7

CCA DCCA SemiCCA SPDCCA LPbSCCA

0.7

0.8

Percentage of Labeled Samples Per Class (%)

1

1

0.85

0.7

50

Percentage of Labeled Samples Per Class (%)

Percentage of Labeled Samples Per Class (%)

0.9

0.75

Recognition Rate

Recognition Rate

0.9

Recognition Rate

(c) 0.95

0.85

0.65

1

(b)

20

30

40

0.78 0.76 0.74 0.72 CCA DCCA SemiCCA SPDCCA LPbSCCA

0.7 0.68 50

10

Percentage of Labeled Samples Per Class (%)

20

30

40

50

Percentage of Labeled Samples Per Class (%)

Fig. 1. The average maximal recognition rates of different methods under different numbers of labeled samples on MFD dataset (a) Fou–Fac, (b) Fou–Mor, (c) Fac–Kar, (d) Fac– Zer, (e) Pix–Zer, (f) Zer–Mor.

Table 5 The average maximal recognition rates (%) and standard deviations (%) of LPbSMCCA under different numbers of views on MFD dataset. 2

3

4

5

Accuracy Std

90.39 8.93

93.66 1.93

95.33 1.05

95.82 0.44

feature combinations, are higher than those of SemiCCA and DCCA. The reason may be that since only a few samples are labeled, label information incorporated into SemiCCA, DCCA does not help to improve recognition performances a lot. Then, we continue to evaluate recognition performances of different methods with different numbers of labeled samples. We select 6 feature combinations, and randomly label 10%, 20%, 30%, 40%, and 50% of training samples in each class, respectively. The average maximal recognition rates of different methods with different labeled numbers are given in Fig. 1. From Fig. 1, we see that LPbSCCA continuously achieves better performances than other methods in all 6 combinations. With the increase of labeled number, SemiCCA and DCCA achieve increasing recognition performances, and consistently outperform CCA and SPDCCA after a point. CCA is insensitive to the change of labeled number due to its unsupervised essence. The results in this experiment indicate that LPbSCCA has better recognition rates and robustness than other methods. Furthermore, we investigate the recognition performances of LPbSMCCA under different numbers of views. We choose m (m = 2, 3, 4, 5) groups of features from 5 groups (except 6-dimensional Mor feature), and randomly label 20% training samples for each class. Table 5 lists the average maximal recognition rates and their corresponding standard deviations of LPbSMCCA under different numbers of views. The average recognition rates of LPbSMCCA under different dimensions are also given in Fig. 2. From Table 5 and Fig. 2, we see that, with the increase of view

0.9

0.8

Recognition Rate

m

1

0.7 m=2 m=3 m=4 m=5

0.6

0.5

0.4

1

2

3

4

5

6

7

8

9

10

Dimension Fig. 2. The average recognition rates of LPbSMCCA under different numbers of views on MFD dataset.

number, the average recognition rates of LPbSMCCA gradually increase, while the corresponding standard deviations, meanwhile, decrease. In a word, since different kinds of features usually reveal complementary characteristics of the same objects, LPSMCCA can obviously achieve better and more stable recognition rates by fusing more kinds of features. 4.1.2. Experiment on CENPARMI handwritten numerals dataset The CENPARMI handwritten Arabic numerals dataset [49], prevalent in the world, is further adopted. The CENPARMI dataset contains 10 categories, i.e., 10 numbers from 0 to 9; and each class has 600 samples. In [52], some preprocessing work has been done and extracted four groups of features, as shown in Table 6. We randomly choose two sets of features as X set, and Y set. There will be 6 (C 24 ¼ 6) data combination modes. For each

1900

X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904

Table 6 Four features on CENPARMI handwritten numerals dataset. X(G): 256-dimension Gabor transformation feature; X(L): 121-dimension Legendre moment feature; X(P): 36-dimension Seudo–Zernike moment feature; X(Z): 30-dimension Zernike moment feature.

combination mode, we choose 200 samples in each class for training randomly, the rest for testing, i.e., 2000 training samples, 4000 testing samples. The experimental setting is the same with that in Section 4.1.1. We report the average maximal recognition rates of different methods, as listed in Table 7. From Table 7, we see that, from all 6 combinations, LPbSCCA obviously outperform other methods, and its superiority is distinct in combination 2, 3. SemiCCA and DCCA nearly have the same performances, and both outperform CCA and SPDCCA in most combinations. Similarly to Section 4.1.1, we select all 6 feature combinations, and also randomly label 10%, 20%, 30%, 40%, and 50% training samples of each class, respectively. Fig. 3 shows the average maximal recognition rates of different methods with different labeled numbers. From Fig. 3, we see that, LPbSCCA continues to be superior to other methods under most cases (except X(P)–X(Z) feature combination). With the increase of labeled number, recognition rates of all

(a)

4.2.1. Experiment on Yale face dataset Yale face dataset [50] contains 165 face images of 15 individuals. There are 11 images per subject, and these 11 images are under the following different facial expression or conﬁguration: centerlight, wearing glasses, happy, left-light, wearing no glasses, normal, right-light, sad, sleepy, surprised, and wink, respectively. The original resolution of these images is 100 80 pixels. 11 sample images of one individual are shown in Fig. 5. We use coiﬂets, daubechies wavelet transforms to extract two sets of low-frequency feature vectors from each image. An impor-

0.8 0.75 0.7 0.65

CCA DCCA SemiCCA SPDCCA LPbSCCA

0.6

20

30

40

(c)

0.85

0.85

0.8

0.8

0.75 0.7 0.65 CCA DCCA SemiCCA SPDCCA LPbSCCA

0.6 0.55 10

50

Recognition Rate

Recognition Rate

0.85

0.55 10

0.9

(b)

0.9

Percentage of Labeled Samples Per Class (%)

20

30

40

0.75 0.7 0.65

0.55 10

50

Percentage of Labeled Samples Per Class (%)

0.85

30

40

50

0.7

(e)

0.68

(f)

0.8

Recognition Rate

0.8

20

Percentage of Labeled Samples Per Class (%)

0.85

(d)

CCA DCCA SemiCCA SPDCCA LPbSCCA

0.6

0.75

0.7

CCA DCCA SemiCCA SPDCCA LPbSCCA

0.65

10

20

30

40

0.75

0.7

CCA DCCA SemiCCA SPDCCA LPbSCCA

0.65

50

Percentage of Labeled Samples Per Class (%)

Recognition Rate

Recognition Rate

4.2. Experiments on face recognition

0.9

0.95

Recognition Rate

methods improve, and LPbSCCA has better stable performances than other methods. Moreover, similarly to Section 4.1.1, we choose m (m = 2, 3, 4) groups of features, and randomly label 20% training samples for each class. The recognition results of LPbSMCCA under different numbers of views are summarized in Table 8 and Fig. 4. From Table 8 and Fig. 4, we see that, when all 4 views are involved, LPbSMCCA obtains the best recognition rates. This experiment proves again that as a multi-view learning method, LPbSMCCA can achieve better recognition performances by fusing different kinds of features from more views.

10

20

30

40

0.66 0.64 0.62 0.6 0.58

CCA DCCA SemiCCA SPDCCA LPbSCCA

0.56

50

Percentage of Labeled Samples Per Class (%)

0.54 10

20

30

40

50

Percentage of Labeled Samples Per Class (%)

Fig. 3. The average recognition rates of different methods under different numbers of labeled samples on CENPARMI dataset (a) X(G)–X(L), (b) X(G)–X(P), (c) X(G)–X(Z), (d) X(L)–X(P), (e) X(L)–X(Z), (f) X(P)–X(Z).

Table 7 The average maximal recognition rates (%) of different methods on CENPARMI dataset. c

1 2 3 4 5 6

Feature combination

LPbSCCA

X–Y

Accuracy

Dim.

Accuracy

CCA Dim.

Accuracy

DCCA Dim.

SemiCCA Accuracy

Dim.

SPDCCA Accuracy

Dim.

X(G)–X(L) X(G)–X(P) X(G)–X(Z) X(L)–X(P) X(L)–X(Z) X(P)–X(Z)

84.65 78.16 77.46 79.76 79.23 62.43

9 9 9 9 9 9

78.63 66.17 65.72 69.92 68.67 61.77

17 15 16 21 26 27

73.41 69.91 69.40 74.75 74.57 61.12

9 9 9 9 9 9

75.86 69.98 69.76 74.84 74.45 62.07

19 10 14 15 10 17

79.61 66.97 66.12 71.12 69.82 61.83

19 14 15 20 22 27

1901

X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904

superior to those of other methods under nearly all dimensions. This experiment demonstrates again that LPbSCCA has better recognition performances and robustness than other methods, and thus can be more suitable for semi-supervised scenarios.

0.9

Recognition Rate

0.8

0.7

0.6

0.5 m=2 m=3 m=4

0.4

0.3

0.2

1

2

3

4

5

6

7

8

9

10

Dimension Fig. 4. The average recognition rates of LPbSMCCA under different numbers of views on CENPARMI dataset.

Table 8 The average maximal recognition rates (%) and standard deviations (%) of LPbSMCCA under different numbers of views on MFD dataset. m

2

3

4

5

Accuracy Std

90.39 8.93

93.66 1.93

95.33 1.05

95.82 0.44

tant reason why we use low-frequency sub-images from wavelet transforms is that they contain more shape information than high-frequency sub-images. After obtaining two sets of feature vectors, in order to avoid the SSS problem, PCA is performed to reduce their dimensions to 40, and 40 dimensions, respectively. We randomly select 6 images per class for training, and rest 5 images for testing, i.e., totally 90 training samples and 75 testing samples. For semi-supervised setting, we randomly label 2, 3, 4, 5 images per class, respectively. The average maximal recognition rates of different methods with different labeled numbers are listed in Table 9. We also record the average recognition rates of different methods under different dimensions in Fig. 6. From Table 9, we can clearly see that, LPbSCCA prominently outperform other methods, no matter how many training samples are labeled. Fig. 6 reveals that recognition rates of LPbSCCA are consistently

4.2.2. Experiment on AR face dataset The AR dataset [53] contains over 4000 face images of 126 individual (70 men and 56 women), including frontal facial images with different facial expressions, lighting conditions and occlusions. There are 26 images for each person taken in two sessions, each having 13 images. Here we only use a subset, which contains 1680 face images corresponding to 120 persons, where each person has 14 different images taken in two sessions. For computational convenience, each image is manually cropped to 50 45 pixels. Some sample images of one individual are shown in Fig. 7. We ﬁrstly extract three kinds of features, i.e., raw pixel, local binary patterns (LBP), wavelet feature of each image as three sets of feature representations, and then apply PCA to reduce their dimensions to 150, 150, and 150, respectively. We randomly select 7 images in each class for training, rest 7 images are for testing, and then randomly label 2, 4, 6 images per class, respectively. The maximal recognition rates of LPbSMCCA and other related multi-view DR methods are summarized in Table 10. From Table 10, we see that LPbSMCCA consistently outperform other methods under different labeled numbers. The recognition rates of LPbSMCCA_KNN are lower than those of LPbSMCCA, which may indicate that kNN-graph is more sensitive to l1-graph. For other methods, the recognition rates of MVSSDR and GMLDA are slightly less than those of LPbSMCCA. The recognition rates of LPbSMCCA are obviously superior than those of LDA. The reason maybe that LDA separately learns projection directions in each view, thus ignores the correlations across views. MCCA achieves the worst recognition performances, due to its unsupervised essence. The results in this experiment indicate that LPbSMCCA is a powerful method for multi-view learning, especially in semi-supervised scenarios. 5. Conclusion and future work In this paper, we proposed a novel two-view semi-supervised learning method called semi-supervised canonical correlation analysis method based on label propagation (LPbSCCA). LPbSCCA ﬁrstly estimates label information of the unlabeled samples using sparse representation based label propagation algorithm; then in order

Fig. 5. Sample images of one individual on Yale face dataset.

Table 9 The average maximal recognition rates (%) of different methods under different number of labeled samples on Yale face dataset. Method

2 3 4 5

label label label label

LPbSCCA

CCA

DCCA

SemiCCA

SPDCCA

Accuracy

Dim.

Accuracy

Dim.

Accuracy

Dim.

Accuracy

Dim.

Accuracy

Dim.

64.4 74.67 77.6 82.53

14 14 14 14

58.53 62.27 65.6 68.8

28 30 34 34

58.13 68.27 76.13 81.07

14 14 14 14

60.8 68.93 76.93 81.07

16 15 15 15

56.3 62.9 69.1 75.3

35 39 40 28

1902

X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904 0.8

(a)

(b) 0.7

Recognition Rate

Recognition Rate

0.6

0.5 CCA DCCA SemiCCA SPDCCA LPbSCCA

0.4

0.3

0.2

0.6 CCA DCCA SemiCCA SPDCCA LPbSCCA

0.5 0.4 0.3 0.2 0.1

0.1 0

5

10

15

20

25

30

35

0

40

5

10

15

0.8

30

35

40

(d) 0.8

0.7

0.7

0.6

Recognition Rate

Recognition Rate

25

0.9

(c)

CCA DCCA SemiCCA SPDCCA LPbSCCA

0.5 0.4 0.3

0.6 CCA DCCA SemiCCA SPDCCA LPbSCCA

0.5 0.4 0.3

0.2 0.1 0

20

Dimension

Dimension

0.2 0.1 5

10

15

20

25

30

35

40

0

5

10

15

Dimension

20

25

30

35

40

Dimension

Fig. 6. The average recognition rates (%) of different methods under all dimensions on Yale face dataset using different number of labeled samples i.e., (a) 2, (b) 3, (c) 4, (d) 5.

Fig. 7. 14 sample images without occlusion of one individual on AR dataset.

to enhance discrimination of features, it seeks projection directions by considering both correlations between samples of the same class from cross different views and compactness of samples of the same class within each view. Furthermore, in order to handle data from multiple (more than two) views, we extend a general model namely LPbSMCCA. Experiments in this paper demonstrate effectiveness of the proposed methods. As stated in Theorem 2, the number of projection directions in proposed methods is limited by c, i.e., the number of total classes. If there are only few classes, the recognition abilities of proposed methods may be affected to some extent. How to break this restriction to ﬁnd more projection directions will be major work ahead. In addition, the kernel extensions of our proposed methods, their theoretical properties and performances on nonlinear data will also be our future work.

Table 10 The average maximal recognition rates (%) of different methods under different number of labeled samples on AR face dataset. Method

2 label

4 label

6 label

Accuracy

Dim.

Accuracy

Dim.

Accuracy

Dim.

LDA MCCA MVSSDR GMLDA LPbSMCCA_KNN LPbSMCCA

53.64 42.01 58.14 55.69 59.52 60.44

96 150 128 100 116 94

74.75 56.54 75.59 75.20 78.50 79.60

63 150 129 59 88 78

80.91 64.74 81.91 85.60 86.85 87.16

61 150 143 57 57 60

X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904

Acknowledgments This work is supported by the National Science Foundation of China under Grant No. 61273251. The authors would like to thank the anonymous reviewers for their valuable comments and suggestions that helped greatly improve the quality of this paper. Appendix A. The proof of Theorem 1 Proof. In order to prove Theorem 1, we ﬁrstly deﬁne some 1=2 1=2 ðXÞ 1=2 e ðXYÞ l ¼ eS ðXÞ a; m ¼ eS ðYÞ b; H ¼ e symbols: Sw C w w 1=2 ðYÞ e . Sw ðXÞ 1=2 ðYÞ 1=2 Multiplying the left hand of (11) by e and e , Sw Sw we obtain

8 1=2 1=2 1=2 1=2 ðXÞ ðYÞ ðYÞ ðXÞ > e ðXYÞ e e < e Sw C Sw Sw Sw b¼k e a 1=2 1=2 1=2 1=2 > ðYÞ ðXÞ ðXÞ : e e ðXYÞT e e a ¼ k eS ðYÞ b Sw C Sw Sw w

ðA:1Þ

or

H m ¼ kl

ðA:2Þ

H T l ¼ km

From (A.2), we see that l and m are left and right singular vector of H, respectively. Then by the theory of singular value decomposition [54], we have

8

1=2 T 1=2 > > ðXÞ ðXÞ T e e > l l ¼ a aj ¼ aTi eS ðXÞ S S > i w w w aj ¼ dij i j > > > > >

> 1=2 T 1=2 > > ðYÞ ðYÞ ðYÞ < vTv ¼ e e b b ¼ bT e S S S b ¼d i

j

w

i

w

j

i

w

j

ij

>

> 1=2 T 1=2 1=2 1=2 > > ðXÞ ðXÞ ðYÞ ðYÞ > T e e e ðXYÞ e e > S S C S S l H m ¼ a bj j i w w w w > i > > > > > : e ðXYÞ b ¼ dij ¼ aTi C j ðA:3Þ Therefore, Theorem 1 is proven. h

Appendix B. The proof of Theorem 2 e ðXYÞ Þ holds, for Proof. From the deﬁnition of H, rankðHÞ ¼ rankð C ðXÞ 1=2 ðYÞ 1=2 ðe and ðe are nonsingular matrices. Besides, we Sw Þ Sw Þ obtain the following inequality

e T Þ 6 minfrankðXÞ; rankðAÞ; rankðYÞg e ðXYÞ Þ ¼ rankðX AY rankð C

ðB:1Þ

where rank(X) 6 min {p, N}, rankðAÞ ¼ rankð e F ðXÞ e F ðYÞT Þ minfc; Ng, rank(Y) 6 min {q, N}. In pattern recognition problem, class number c is generally less than sample number N, as well as feature dimene ðXYÞ Þ 6 c. sionality p, q. From the above, we have rankðHÞ ¼ rankð C From Theorem 1, we see that a and b have one to one correspondence to l and m, respectively, and the available numbers of l and m are limit to rank(H). Therefore, we infer that in LPbSCCA, there are at most c pairs of projection directions. h References [1] C. Xu, D. Tao, C. Xu, A Survey on Multi-View Learning, arXiv Preprint arXiv:1304.5634, 2013. [2] C. Hou, C. Zhang, Y. Wu, F. Nie, Multiple view semi-supervised dimensionality reduction, Pattern Recogn. 43 (2010) 720–730.

1903

[3] X. Shen, Q. Sun, Orthogonal multiset canonical correlation analysis based on fractional-order and its application in multiple feature extraction and recognition, Neural Process. Lett. (2014) 1–16. [4] X. Chen, S. Chen, H. Xue, X. Zhou, A uniﬁed dimensionality reduction framework for semi-paired and semi-supervised multi-view data, Pattern Recogn. 45 (2012) 2005–2018. [5] Y. Fu, L. Cao, G. Guo, T.S. Huang, Multiple feature fusion by subspace learning, in: Proceedings of the 2008 International Conference on Content-Based Image and Video Retrieval, (ACM, 2008), pp. 127–134. [6] Q.-S. Sun, S.-G. Zeng, Y. Liu, P.-A. Heng, D.-S. Xia, A new method of feature fusion and its application in image recognition, Pattern Recogn. 38 (2005) 2437–2448. [7] Y.-H. Yuan, Q.-S. Sun, Q. Zhou, D.-S. Xia, A novel multiset integrated canonical correlation analysis framework and its application in feature fusion, Pattern Recogn. 44 (2011) 1031–1040. [8] J. Yu, D. Tao, Y. Rui, J. Cheng, Pairwise constraints based multiview features fusion for scene classiﬁcation, Pattern Recogn. 46 (2013) 483–496. [9] H.-Y. Ha, F.C. Fleites, S.-C. Chen, Content-based multimedia retrieval using feature correlation clustering and fusion, Int. J. Multimed. Data Eng. Manage. (IJMDEM) 4 (2013) 46–64. [10] W. Liu, D. Tao, Multiview hessian regularization for image annotation, IEEE Trans. Image Process 22 (2013) 2676–2687. [11] W. Liu, D. Tao, J. Cheng, Y. Tang, Multiview hessian discriminative sparse coding for image annotation, Comput. Vis. Image Und. 118 (2014) 50–60. [12] T. Diethe, D.R. Hardoon, J. Shawe-Taylor, Multiview ﬁsher discriminant analysis, in: NIPS Workshop on Learning from Multiple Sources, 2008. [13] T. Xia, D. Tao, T. Mei, Y. Zhang, Multiview spectral embedding, IEEE Trans. Syst. Man Cybern. B 40 (2010) 1438–1446. [14] B. Long, P. Yu, Z. Zhang, A general model for multiple view unsupervised learning, in: Proceedings of the 2008 SIAM International Conference on Data Mining (Society for Industrial and Applied Mathematics, 2008), pp. 822–833. [15] X. Shen, Q. Sun, Y. Yuan, A uniﬁed multiset canonical correlation analysis framework based on graph embedding for multiple feature extraction, Neurocomputing (2014). [16] A. Sharma, A. Kumar, H. Daume, D.W. Jacobs, Generalized multiview analysis: a discriminative latent space, in: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2160–2167. [17] D.R. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Comput. 16 (2004) 2639–2664. [18] H. Hotelling, Relations between two sets of variates, Biometrika 28 (1936) 321–377. [19] Q.-S. Sun, Z.-D. Liu, P.-A. Heng, D.-S. Xia, A theorem on the generalized canonical projective vectors, Pattern Recogn. 38 (2005) 449–452. [20] T. Sun, S. Chen, J. Yang, P. Shi, A novel method of combined feature extraction for recognition, in: Eighth IEEE International Conference on Data Mining. ICDM’08 (IEEE, 2008), 2008, pp. 1043–1048. [21] T. Sun, S. Chen, Locality preserving CCA with applications to data visualization and pose estimation, Image Vision Comput. 25 (2007) 531–543. [22] Y. Peng, D. Zhang, J. Zhang, A new canonical correlation analysis algorithm with local discrimination, Neural Process. Lett. 31 (2010) 1–15. [23] T.K. Kim, J. Kittler, R. Cipolla, Discriminative learning and recognition of image set classes using canonical correlations, IEEE Trans. Pattern Anal. 29 (2007) 1005–1018. [24] H. Huang, H. He, Super-resolution method for face recognition using nonlinear mappings on coherent features, IEEE Trans. Neural Netw. 22 (2011) 121–130. [25] L. Sun, S. Ji, J. Ye, Canonical correlation analysis for multilabel classiﬁcation: a least-squares formulation, extensions, and analysis, IEEE Trans. Pattern Anal. 33 (2011) 194–U204. [26] D. Chu, L.-Z. Liao, M.K. Ng, X. Zhang, Sparse canonical correlation analysis: new formulation and algorithm, IEEE Trans. Pattern Anal. (2013). [27] N. Guan, X. Zhang, Z. Luo, L. Lan, Sparse representation based discriminative canonical correlation analysis for face recognition, in: 2012 IEEE 11th International Conference on Machine Learning and Applications (ICMLA), 2012, pp. 51–56. [28] Y. Peng, D. Zhang, Semi-supervised canonical correlation analysis algorithm, J. Softw. 19 (2008) 2822–2832. [29] O. Chapelle, B. Schölkopf, A. Zien, Semi-Supervised Learning, MIT Press, Cambridge, 2006. [30] J. Yu, M. Wang, D.C. Tao, Semisupervised multiview distance metric learning for cartoon synthesis, IEEE Trans. Image Process 21 (2012) 4636–4648. [31] J. Yu, Y. Rui, B. Chen, Exploiting click constraints and multi-view features for image re-ranking, IEEE Trans. Multimedia 16 (2014) 159–168. [32] J. Yu, D. Tao, M. Wang, Adaptive hypergraph learning and its application in image classiﬁcation, IEEE Trans. Image Process 21 (2012) 3262–3272. [33] J. Yu, D. Liu, D. Tao, H.S. Seah, Complex object correspondence construction in two-dimensional animation, IEEE Trans. Image Process 20 (2011) 3257–3269. [34] J. Yu, D. Tao, J. Li, J. Cheng, Semantic preserving distance metric learning and applications, Inform. Sci. (2014). [35] D. Cai, X. He, J. Han, Semi-supervised discriminant analysis, in: IEEE 11th International Conference on Computer Vision 2007. ICCV 2007, 2007, pp. 1–7. [36] M. Sugiyama, T. Ide, S. Nakajima, J. Sese, Semi-supervised local Fisher discriminant analysis for dimensionality reduction, Mach. Learn. 78 (2010) 35–61. [37] D. Zhang, Z.-H. Zhou, S. Chen, Semi-Supervised Dimensionality Reduction, in: SDM, 2007.

1904

X. Shen, Q. Sun / J. Vis. Commun. Image R. 25 (2014) 1894–1904

[38] Y. Song, F. Nie, C. Zhang, S. Xiang, A uniﬁed framework for semi-supervised dimensionality reduction, Pattern Recogn. 41 (2008) 2789–2799. [39] X. Zhu, Z. Ghahramani, Learning from labeled and unlabeled data with label propagation, Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002. [40] F. Wang, C. Zhang, Label propagation through linear neighborhoods, IEEE Trans. Knowl. Data Eng. 20 (2008) 55–67. [41] J. Wang, F. Wang, C. Zhang, H.C. Shen, L. Quan, Linear neighborhood propagation and its applications, IEEE Trans. Pattern Anal. 31 (2009) 1600– 1615. [42] F. Nie, S. Xiang, Y. Jia, C. Zhang, Semi-supervised orthogonal discriminant analysis via label propagation, Pattern Recogn. 42 (2009) 2615–2627. [43] B. Cheng, J. Yang, S. Yan, Y. Fu, T.S. Huang, Learning with $ n ell1$ graph for image analysis, IEEE Trans. Image Process 19 (2010) 858–866. [44] J. Lu, X. Zhou, Y.-P. Tan, Y. Shang, J. Zhou, Cost-sensitive semi-supervised discriminant analysis for face recognition, IEEE Trans. Inform. Forensics Secur. 7 (2012) 944–953. [45] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face recognition via sparse representation, IEEE Trans. Pattern Anal. 31 (2009) 210–227. [46] W. Liu, H. Zhang, D. Tao, Y. Wang, K. Lu, Large-Scale Paralleled Sparse Principal Component Analysis, ArXiv e-prints, 2013.

[47] T. Melzer, M. Reiter, H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern Recogn. 36 (2003) 1961–1971. [48] C. Blake, C.J. Merz, UCI Repository of Machine Learning Databases []. Department of Information and Computer Science, University of California, Irvine, CA, vol. 55, 1998. [49] J. Franke, L. Lam, R. Legault, C. Nadal, C. Suen, Experiments with the CENPARMI database combining different classiﬁcation approaches, in: 3rd International Workshop on Frontiers in Handwriting Recognition, 1993, pp. 305–311. [50] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. ﬁsherfaces: recognition using class speciﬁc linear projection, IEEE Trans. Pattern Anal. 19 (1997) 711–720. [51] A.A. Nielsen, Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data, IEEE Trans. Image Process 11 (2002) 293– 305. [52] Z. Hu, Z. Lou, J. Yang, K. Liu, C. Suen, Handwritten digit recognition based on multi-classiﬁer combination, Chinese J. Comput. 22 (1999) 369–374. [53] A.M. Martinez, The AR face database, CVC Technical Report, 24, 1998. [54] G.H. Golub, C.F. Van Loan, Matrix Computations, JHU Press, 2012.

A novel semi-supervised canonical correlation analysis and extensions for multi-view dimensionality reduction

A novel semi-supervised canonical correlation analysis and extensions for multi-view dimensionality reduction

Recommend Documents