Fractional-order embedding canonical correlation analysis and its applications to multi-view dimensionality reduction and recognition

Fractional-order embedding canonical correlation analysis and its applications to multi-view dimensionality reduction and recognition

Pattern Recognition 47 (2014) 1411–1424 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr ...

2MB Sizes 2 Downloads 65 Views

Pattern Recognition 47 (2014) 1411–1424

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Fractional-order embedding canonical correlation analysis and its applications to multi-view dimensionality reduction and recognition Yun-Hao Yuan a,b,n, Quan-Sen Sun b, Hong-Wei Ge a a b

School of Internet of Things, Jiangnan University, No. 1800, Li Hu Street, Wuxi 214122, China Department of Computer Science, Nanjing University of Science & Technology, No. 200, Xiao Ling Wei Street, Nanjing 210094, China

art ic l e i nf o

a b s t r a c t

Article history: Received 9 December 2012 Received in revised form 21 July 2013 Accepted 16 September 2013 Available online 2 October 2013

Due to the noise disturbance and limited number of training samples, within-set and between-set sample covariance matrices in canonical correlation analysis (CCA) usually deviate from the true ones. In this paper, we re-estimate within-set and between-set covariance matrices to reduce the negative effect of this deviation. Specifically, we use the idea of fractional order to respectively correct the eigenvalues and singular values in the corresponding sample covariance matrices, and then construct fractional-order within-set and between-set scatter matrices which can obviously alleviate the problem of the deviation. On this basis, a new approach is proposed to reduce the dimensionality of multi-view data for classification tasks, called fractional-order embedding canonical correlation analysis (FECCA). The proposed method is evaluated on various handwritten numeral, face and object recognition problems. Extensive experimental results on the CENPARMI, UCI, AT&T, AR, and COIL-20 databases show that FECCA is very effective and obviously outperforms the existing joint dimensionality reduction or feature extraction methods in terms of classification accuracy. Moreover, its improvements for recognition rates are statistically significant on most cases below the significance level 0.05. & 2013 Elsevier Ltd. All rights reserved.

Keywords: Pattern recognition Canonical correlation analysis Feature extraction Dimensionality reduction Multi-view learning

1. Introduction In many scientific applications in pattern recognition, computer vision, and data visualization, the same objects are usually represented in multiple high-dimensional feature spaces. For example, a speaker can be represented by audio feature and visual feature in speaker identification [1–3]; genes can be represented by the genetic activity feature and the text information feature [4] in bioinformatics; the same geographical region has multi-spectral or multi-temporal remote sensing data, which can be naturally viewed as multiple high-dimensional representations from different feature spaces [5]; handwritten numerals can be represented by Zernike moments feature and Fourier coefficients of character shapes feature [6], and so on. Obviously, multiple feature representations derived from the same objects always reflect different characteristics or views of the input data. This kind of data, called multi-view data [7–9], usually contains more useful information and knowledge than a single feature representation. Therefore, multi-view analysis has recently attracted more and more attentions.

n Corresponding author at: School of Internet of Things, Jiangnan University, No. 1800, Li Hu Street, Wuxi 214122, China. Tel.: þ86 25 84315142; fax: þ86 25 84318156. E-mail addresses: [email protected], [email protected] (Y.-H. Yuan), [email protected] (Q.-S. Sun), [email protected] (H.-W. Ge).

0031-3203/$ - see front matter & 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2013.09.009

Generally, multi-view data are highly dimensional in many real-world applications, while this case leads to several problems. First, high dimensionality greatly increases the space requirement for data storage in various learning tasks (e.g., classification and clustering). Second, the speed of learning algorithms depends on the dimensionality of data [10]. That is, when the dimension of data is very high, the computational costs will become more expensive. Finally, learning tasks of multi-view data that are analytically or computationally manageable in low-dimensional spaces may become completely intractable in multiple highdimensional feature spaces [11,12]. Thus, multi-view dimensionality reduction (MDR) is a very necessary and fundamental problem in numerous scientific applications. Different from traditional dimensionality reduction, the main task of MDR is to simultaneously find multiple meaningful low-dimensional representations for multi-view high-dimensional data such that the inherent data structures of all views are revealed. Multi-view dimensionality reduction can be applied for data preprocessing and facilitate other learning tasks such as multi-view classification, clustering and regression [13–15]. For multi-view data, Hou et al. [8] have pointed out that traditional dimensionality reduction methods, such as principal component analysis (PCA) [16,17], linear discriminant analysis (LDA) [18,19], locality preserving projections (LPP) [20,21], locally linear embedding (LLE) [22], and so on, are not suitable to reduce the dimensionality of multi-view data. This is because these

1412

Y.-H. Yuan et al. / Pattern Recognition 47 (2014) 1411–1424

methods are presented, mainly based on a single view. If they are used to reduce the dimensionality of the data that are formed by simply concatenating feature representations of all views together, then these traditional methods will treat all representations in the same way and unavoidably ignore their diversities. Obviously, this is not conducive to MDR and subsequent pattern classification tasks. At present, there have been a number of MDR methods [7,8,23– 34]. In these methods, canonical correlation analysis (CCA) [35] is the most extensively used approach. CCA investigates the linear correlations between two sets of random variables. It can linearly project two sets of random variables into a lower-dimensional space in which they are maximally correlated. In image recognition tasks, Sun et al. [23] employed CCA to simultaneously reduce the dimensionality of two sets of feature vectors (i.e., two views) for obtaining two lowdimensional representations, and then the two low-dimensional representations are fused to form effective discriminant feature vectors for classification. From the viewpoint of regression, Foster et al. [24] performed CCA to derive the low dimensional embeddings of two-view data and compute the regression function based on these embeddings. Since CCA cannot be performed in the well-known small sample size (SSS) problem [19,36] where the dimension of feature vectors is larger than the number of training samples, the regularized CCA (RCCA) [28,37] has been presented for two-view dimensionality reduction. Due to the lacking of discriminant, Sun et al. [32] and Kim et al. [33] proposed, respectively, the generalized CCA (GCCA) and discriminant canonical correlations (DCC) based on different applications. The extracted features by the two methods are more powerful for handwritten numeral recognition and image set matching. Moreover, since CCA is essentially a linear subspace learning method, it fails to discover the nonlinear relationships between two-view data. In order to solve this problem, Melzer et al. [34] presented kernel CCA (KCCA), where two non-linear transformations of the input data are performed implicitly using kernel methods. From the other point of view of nonlinear extensions, the locality idea is incorporated into CCA and locality preserving CCA (LPCCA) [30] have been proposed to discover the local manifold structure of each view for data visualization and pose estimation. In addition, sparse CCA (SCCA) [38,39], and partial least squares (PLS) [31] have also been presented. However, in CCA and other related methods (e.g., GCCA and KCCA), since the within-set (i.e., within-view) and between-set (i.e., between-view) population covariance matrices are not known beforehand in practical applications, their estimates have to be obtained from a training sample space. This will lead to a problem that the sample covariance matrices usually deviate from the true ones when training samples have noise disturbances, or the number of training samples is of the same order of magnitude as the dimension of feature vectors or even less [40,41]. Meanwhile, the research in [40] has shown that even though sample covariance matrices are unbiased estimates of the true ones, their eigenvalues are also biased estimates of the eigenvalues of the true covariance matrices. Using these biased estimates, it is very difficult for CCA to yield the good projections for pattern classification tasks. Clearly, this bias has a negative effect on the learning of CCA for multi-view dimensionality reduction. In this paper, in order to reduce the negative effect of the bias, the within-set and between-set covariance matrices are re-estimated by sample spectrum modeling. Specifically, we employ the idea of fractional order to respectively correct the eigenvalues and singular values in the corresponding within-set and betweenset sample covariance matrices, and then construct fractional-order within-set and between-set scatter matrices which can alleviate the problem of the biased estimates caused by noise and the limited number of training samples. On this basis, we present a new approach for multi-view dimensionality reduction, called fractional-order embedding canonical correlation analysis (FECCA), which is shown to be able to obtain the powerful pair-wise

projections for classification purpose in many real-world applications. Extensive visual recognition experiments compared with existing joint feature extraction methods demonstrate the effectiveness of the proposed method. The remainder of this paper is organized as follows: Section 2 briefly reviews the related background work, including CCA, RCCA, LPCCA, and PLS. Section 3 constructs fractional-order within-set and between-set scatter matrices and subsequently presents the FECCA method. Section 4 deals with the singularity problem of fractional-order within-set scatter matrices. Section 5 provides the experimental results and analysis of FECCA and the performance comparisons with CCA, RCCA, LPCCA, and PLS. At last, Section 6 gives the conclusions.

2. Background and related work 2.1. Canonical correlation analysis Given two random zero-mean vectors x A Rp and y A Rq , a pair of projection directions, u A Rp and v A Rq , is computed such that the correlation coefficient of canonical variates x^ ¼ uT x and y^ ¼ vT y, i.e., ^ x^ y^ Þ uT Sxy v Eð ρðu; vÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi; T u Sxx u UvT Syy v ^ y^ 2 Þ ^ x^ 2 Þ U Eð Eð

ð1Þ

^ Þ denotes the empirical expectation of is maximized, where Eðf function f , Sxx and Syy are, respectively, within-set covariance matrices of random vectors x and y, and Sxy is the between-set covariance matrix between random vectors x and y. Here, we assume that Sxx and Syy are nonsingular. In (1), it is evident that the canonical correlation coefficient ρ is affine-invariant to the arbitrary scaling of u and v. Therefore, CCA needs to normalize the projection directions u and v by uT Sxx u ¼ 1 and vT Syy v ¼ 1. Under the normalization constraints, CCA can be formulated equivalently as the following optimization problem: uT Sxy v max ρðu; vÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi T u;v u Sxx u U vT Syy v s:t:

uT Sxx u ¼ 1; vT Syy v ¼ 1:

ð2Þ

Solving the optimization problem in (2) by the Lagrange multiplier method, the solution to CCA reduces to the solution of the following two generalized eigenvalue equations [42]: 8 1 < Sxy Syy Syx u ¼ λSxx u ð3Þ 1 : Syx Sxx Sxy v ¼ λSyy v where Syx ¼ STxy , and the eigenvalue λ is just equal to ρ2 , u and v are the corresponding eigenvectors. Under conjugate orthogonal constraints, CCA can further produce multiple pairs of projection directions fðui ; vi Þgdi¼ 1 which consist of the first d pairs of eigenvectors corresponding to the first d largest eigenvalues of the generalized eigenvalue problems in (3) [42], where d r minðp; qÞ. 2.2. Regularized canonical correlation analysis In the SSS problem, within-set covariance matrices Sxx and Syy are always singular and their inverses are no longer available. To solve this problem and prevent overfitting, RCCA can be used in practice. Specifically, RCCA computes two projection directions u A Rp and v A Rq by the following optimization problem: uT Sxy v max ρR ðu; vÞ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi T u;v u ðSxx þ λx I x Þu U vT ðSyy þ λy I y Þv s:t:

uT ðSxx þ λx I x Þu ¼ 1; vT ðSyy þ λy I y Þv ¼ 1;

ð4Þ

Y.-H. Yuan et al. / Pattern Recognition 47 (2014) 1411–1424

where λx and λy are two regularization parameters with λx 4 0 and λy 4 0, and I x A Rpp and I y A Rqq are two identity matrices. Solving this optimization problem in (4) via the Lagrange multiplier technique, we can obtain the following two generalized eigenvalue problems: ( Sxy ðSyy þ λy I y Þ  1 Syx u ¼ λðSxx þ λx I x Þu ð5Þ Syx ðSxx þ λx I x Þ  1 Sxy v ¼ λðSyy þλy I y Þv where λ is the eigenvalue corresponding to the eigenvector u or v. Similarly to CCA, RCCA can also yield d pairs of projection directions fðui ; vi Þgdi¼ 1 according to the two generalized eigenequations in (5). Moreover, regularization parameters λx and λy are generally chosen by the cross-validation method [28,29] in realworld applications. 2.3. Locality preserving canonical correlation analysis LPCCA is a locally linear dimensionality reduction method, which can capture the nonlinear information of data by forcing neighbor points in the original feature space to be close as well in the transformed lower-dimensional subspace. Concretely, given n pairs of samples fðxi ; yi Þgni¼ 1 with xi A Rp and yi A Rq , the projection directions u A Rp and v A Rq are calculated by the following optimization problem: n

s:t:

8 n n > > uT ∑ ∑ s2 ðx  xj Þðxi  xj ÞT u ¼ 1 > > < i ¼ 1j ¼ 1 ij;x i n n > > > vT ∑ ∑ s2 ðy  yj Þðyi  yj ÞT v ¼ 1; > : i ¼ 1j ¼ 1 ij;y i

ð6Þ

max uT XðDxy  Sx 3 Sy ÞY T v s:t:

where r ¼ rankðSxy Þ. Further, PLS solves the following two eigenvalue problems [44]: ( Sxy Syx u ¼ λu ð11Þ Syx Sxy v ¼ λv where λ is the eigenvalue corresponding to the eigenvector u or v and representing the covariance. In contrast to CCA, it can be seen from (11) that within-set covariance matrix inversion is not needed any longer in PLS. Clearly, this case is beneficial to the dimensionality reduction of high-dimensional data in the SSS problem.

i ¼ 1j ¼ 1

where sij;x denotes the similar degree of xi and xj belonging to a local neighbor set, and sij;y indicates the similar meaning to sij;x , i; j ¼ 1; 2; …; n. Let us denote X ¼ ðx1 ; x2 ; …; xn Þ, Y ¼ ðy1 ; y2 ; …; yn Þ, Sx ¼ ðsij;x Þ A Rnn and Sy ¼ ðsij;y Þ A Rnn , then the optimization problem in (6) can be reformulated as u;v

is maximized under orthonormal constraints, where covðf 1 ; f 2 Þ represents the covariance of functions f 1 and f 2 . In PLS, the first pair of projection directions, u1 and v1 , is obtained by maximizing (9) with constraints uT u ¼ 1 and vT v ¼ 1. On this basis, the kth ð2 rk rrÞ pair of projection directions, uk and vk , is calculated by continually maximizing (9) with the following constraints: 8 T T > < u u ¼ 1; v v ¼ 1 T u ui ¼ 0; vT vi ¼ 0 ð10Þ > : i ¼ 1; 2; …; k  1

3. Fractional-order embedding canonical correlation analysis (FECCA)

n

max uT ∑ ∑ sij;x ðxi  xj Þsij;y ðyi  yj ÞT v u;v

1413

(

uT XðDxx  Sx 3 Sx ÞX T u ¼ 1 vT YðDyy  Sy 3 Sy ÞY T v ¼ 1;

ð7Þ

where the operator 3 denotes ðA 3 BÞij ¼ aij bij for matrices A ¼ ðaij Þ A Rnn and B ¼ ðbij Þ A Rnn , Dxx , Dyy and Dxy are the diagonal matrices with the same size n  n and their ith diagonal elements are, respectively, the sum of all elements in the ith row of the matrices Sx 3 Sx , Sy 3 Sy and Sx 3 Sy . By the Lagrange multiplier technique, the optimization problem in (7) can be transformed into the following generalized eigenvalue problem: 0 1 !  XGxy Y T  u  XGxx X T u @ A ¼ λ ð8Þ YGyy Y T YGTxy X T v v where Gxy ¼ Dxy  Sx 3 Sy , Gxx ¼ Dxx  Sx 3 Sx and Gyy ¼ Dyy  Sy 3 Sy . 2.4. Partial least squares The PLS method [31,43] is closely related to CCA. Given two random high-dimensional vectors x A Rp and yA Rq , PLS aims at computing two linear projection directions, u A Rp and v A Rq , such that the covariance between the projections uT x and vT y J PLS ðu; vÞ ¼ covðuT x; vT yÞ ¼ uT Sxy v

ð9Þ

In recent years, fractional-order techniques have been used in texture enhancement [45], image registration [46], face representation [47], and so on. Pu et al. [45] proposed a fractional differentialbased method for multiscale texture enhancement. This method has been demonstrated to have the better capability of nonlinearly enhancing complex texture details than traditional intergral-based algorithms. In image registration, Pan et al. [46] presented an adaptable-multilayer fractional Fourier transform algorithm, which is more efficient than the pseudopolar-based fast Fourier transform methods. Moreover, Liu et al. [47] proposed a new intermediate representation method for face recognition, called fractional order singular value decomposition representation (FSVDR). The motivation of the FSVDR method is that each face image can be decomposed into a composition of a set of base images generated by singular value decomposition and the top singular values are sensitive to great facial variations, e.g., expression, lighting, and occlusion. To alleviate facial variations in each image, Liu et al. employed a fractional function to deflate the singular values corresponding to facial variations. Experimental results show that FSVDR is an excellent intermediate representation for face recognition. In this section, we introduce the concept of fractional order into CCA and give the characterizations of fractional-order within-set and between-set scatter matrices. On this basis, we build the FECCA method for multi-view dimensionality reduction. 3.1. Motivation As discussed in Section 1, the estimates of within-set and between-set covariance matrices usually deviate from the true ones in real-world applications due to the noise disturbance and finite number of training samples. In this case, the within-set and between-set sample covariance matrices cannot anymore be considered a good approximation of the true covariance matrices. For visual illustration, we consider 500 (p þq)-dimensional simulated samples following multivariate Gaussian distribution split into two groups of data X and Y (i.e., two views) with dimensions p and q, respectively, where p ¼1000 and q¼ 1500. Fig. 1 shows the comparisons of within-set and between-set sample covariance

1414

Y.-H. Yuan et al. / Pattern Recognition 47 (2014) 1411–1424

Fig. 1. Comparisons of within-set and between-set sample covariance matrices and the true covariance matrices with respect to the corresponding ordered eigenvalues or singular values, where TCM denotes the true covariance matrix and SCM denotes the sample covariance matrix. Note that (a), (b), and (c) show the leading spectrums in the corresponding covariance matrices.

matrices and the true covariance matrices with respect to the ordered eigenvalues or singular values. As observed in Fig. 1, it is evident that the spectrums (i.e., eigenvalues or singular values) of within-set and between-set sample covariance matrices differ greatly from those of the true covariance matrices. This deviation has recently been demonstrated to have a negative effect on classification systems [49,50]. Likewise, based on these biased estimates, CCA cannot obtain the good pair-wise projections for MDR and subsequent classification tasks. Motivated by the above-mentioned situation, we re-estimate the within-set and between-set covariance matrices by correcting the eigenvalues and singular values in the corresponding sample covariance matrices.

3.2. Construction of fractional-order within-set and between-set scatter matrices In this subsection, we establish fractional-order within-set and between-set scatter matrices that can alleviate the problem of the deviation.

3.2.1. Fractional-order within-set scatter matrices Assume two different high-dimensional representations from the same n patterns are given as fX i g2i ¼ 1 , where X i ¼ ðxi1 ; xi2 ; …; xin Þ with xij A Rmi for j ¼ 1; 2; …; n denoting the jth sample in representation X i , and mi denotes the dimensionality of samples in representation X i . Let us suppose fxij gnj¼ 1 have been centered, i.e., ∑nj¼ 1 xij ¼ 0. The fractional-order within-set scatter matrix can be characterized by fractional-order eigenvalues of within-set sample covariance matrix. Specifically, let Sii A Rmi mi represent the within-set sample covariance matrix of X i , i.e., Sii ¼ ð1=nÞX i X Ti . Then, it is evident that Sii is a symmetric and nonnegative definite matrix. For matrix Sii , we can always obtain its eigenvalue decomposition (EVD) [48] as Sii ¼ Q i Λi Q Ti ;

Λi ¼ diagðλi1 ; λi2 ; …; λiri Þ

ð12Þ

where Q i ¼ ðqi1 ; qi2 ; …; qiri Þ is the eigenvector matrix of Sii , λi1 Z λi2 Z … Z λiri are the descending order eigenvalues corresponding to the eigenvectors, and r i ¼ rankðSii Þ. According to (12), we are now able to define the fractional-order within-set scatter matrix, as follows.

Definition 1. Suppose αi is a fraction and satisfies 0 r αi r1. Then, the matrix Sαiii is referred to as fractional-order within-set scatter matrix if and only if Sαiii ¼ Q i Λαi i Q Ti Λαi i

ð13Þ

diagðλαi1i ; λαi2i ; …; λαirii Þ,

where ¼ Q i and r i are defined in (12), and i ¼ 1; 2. From Definition 1, we can see that when fractional-order parameter αi satisfies αi ¼ 1, the decomposition in (13) can be reduced to the EVD corresponding to (12). In other words, the EVD can be viewed as a special case of (13). According to Definition 1, it is easy to draw the following properties: Property 1. Sαiii is a symmetric and nonnegative definite matrix, i.e., Sαiii ¼ ðSαiii ÞT , and ξTi Sαiii ξi Z 0 for any non-zero vector ξi A Rmi . Property 2. The rank of Sαiii rankðSαiii Þ ¼ rankðSii Þ.

is equal to that of Sii , i.e.,

Property 3. All nonzero eigenvalues of Sαiii are fλαiji grj i¼ 1 . Furthermore, Sαiii and Sii have the same orthonormal transformation matrix, which consists of the eigenvectors of Sii . Property 4. When αi 4 0, there exists an invertible mapping between Sαiii and Sii . When αi ¼ 0, there only exists a single mapping from Sii to Sαiii . Property 5. Sαiii and Sii satisfy an equation: Sαiii ¼ Sii Siiαi  1 . Property 6. jjSαiii  Sii jj2F ¼ ∑rj i¼ 1 ðλαiji  λij Þ2 , where jjAjjF denotes the Frobenius norm of matrix A. Properties 1–3 show that some basic characteristics of withinset sample covariance matrix Sii are preserved in fractional-order within-set scatter matrix Sαiii . Properties 4 and 5 further indicate the close relationships between Sαiii and Sii . Property 6 reveals that the space distance between Sαiii and Sii can be controlled by the adjustment of fractional-order parameter αi . That is, the closer to one αi is, the smaller their space distance is. 3.2.2. Fractional-order between-set scatter matrix The fractional-order between-set scatter matrix can be characterized by fractional-order singular values of between-set sample covariance matrix. Let S12 A Rm1 m2 be the between-set sample covariance matrix of X 1 and X 2 , i.e., S12 ¼ ð1=nÞX 1 X T2 . Then, we have the thin singular value decomposition (SVD) [48] as S12 ¼ PΛQ T ; Λ ¼ diagðs1 ; s2 ; …; sr Þ;

ð14Þ

Y.-H. Yuan et al. / Pattern Recognition 47 (2014) 1411–1424

1415

Fig. 2. Comparisons of within-set and between-set sample covariance matrices, fractional-order within-set and between-set scatter matrices, and the true covariance matrices with respect to the corresponding ordered eigenvalues or singular values, where TCM denotes the true covariance matrix, SCM denotes the sample covariance matrix, and FSM represents the fractional-order scatter matrices. Note that (a), (b), and (c) show the leading spectrums in the corresponding covariance matrices and the fractional-order parameters are respectively 0.9, 0.9, and 0.9 in (a), (b), and (c).

where P ¼ ðp1 ; p2 ; …; pr Þ and Q ¼ ðq1 ; q2 ; …; qr Þ are, respectively, the left and right singular vector matrices of S12 , s1 Z s2 Z… Z sr are the descending order singular values of S12 , and r ¼ rankðS12 Þ. Now, the fractional-order between-set scatter matrix can be defined in terms of (14). Definition 2. Suppose β is a fraction and satisfies 0 r β r1. Then, the matrix Sβ12 is referred to as fractional-order between-set scatter matrix if and only if Sβ12 ¼ PΛβ Q T β

ð15Þ diagðsβ1 ; sβ2 ; …; sβr Þ,

where Λ ¼ P, Q , and r are defined in (14). In Definition 2, it is obvious that the decomposition in (15) can subsume the thin SVD in (14) as its special case when β ¼ 1. Since the fractional-order between-set scatter matrix Sβ12 has similar properties to fractional-order within-set scatter matrices, here we ignore them.

v A Rm2 by maximizing the following criterion:1 Jðu; vÞ ¼

3.3. Model and solution of FECCA After establishing fractional-order within-set and between-set scatter matrices, we make use of them to build a new optimization model for MDR. For convenience, the proposed method is called fractional-order embedding canonical correlation analysis (FECCA). Specifically, we can obtain the projection directions u A Rm1 and

ð16Þ

where α1 , α2 , and β are fractional-order parameters. Obviously, the criterion in (16) can subsume the CCA criterion in (1) as its special case when the three fractional-order parameters satisfy α1 ¼ α2 ¼ β ¼ 1. Since the criterion in (16) is invariant to the arbitrary scaling of u and v, i.e., Jðu; vÞ ¼ Jðau; bvÞ for any nonzero a; b A R, let us respectively set uT Sα111 u ¼ 1 and vT Sα222 v ¼ 1 without loss of generality. Now, the problem of maximizing the criterion in (16) can be equivalently transformed into the following optimization problem: uT Sβ12 v α1 T ðu S11 u U vT Sα222 vÞ1=2 T α1 u S11 u ¼ 1; vT Sα222 v ¼ 1:

max Jðu; vÞ ¼ u;v

s:t: 3.2.3. Effectiveness test of fractional-order parameters To validate the effectiveness of the idea of fractional order, we employ the same simulated data as those used in Section 3.2.1. The within-set and between-set sample covariance matrices and fractional-order within-set and between-set scatter matrices are compared to the true covariance matrices with respect to the sorted eigenvalues or singular values, as shown in Fig. 2. From the simulated results in Fig. 2, we can clearly find that the idea of fractional order is very effective for correcting the eigenvalues and singular values in the corresponding within-set and between-set sample covariance matrices. By adjusting fractionalorder parameters, the spectrums in fractional-order within-set and between-set scatter matrices can be ensured to be as close as possible to the true spectrums, while the sample spectrums significantly deviate from the true ones due to the finite number of training samples. Thus, we will employ the fractional-order within-set and between-set scatter matrices to improve the CCA method for MDR.

uT Sβ12 v α 1 ðuT S11 u U vT Sα222 vÞ1=2

ð17Þ

Using the method of Lagrange multipliers, we now form the Lagrangian function: Lðu; v; γ 1 ; γ 2 Þ ¼ uT Sβ12 v 

γ 1 T α1 γ ðu S11 u  1Þ  2 ðvT Sα222 v  1Þ 2 2

ð18Þ

with γ 1 and γ 2 as the Lagrange multipliers. Let ∂L=∂u ¼ 0 and ∂L=∂v ¼ 0, then we have: 8 < Sβ v  γ 1 Sα1 u ¼ 0 12 11 ð19Þ : Sβ21 u  γ 2 Sα222 v ¼ 0 where Sβ21 ¼ ðSβ12 ÞT . According to the two constraints in (17), it is easy to get γ 1 ¼ γ 2 ¼ uT Sβ12 v. Here, let us assume fractional-order within-set scatter matrices Sα111 and Sα222 are nonsingular. Then, the two equations in (19) can be rewritten after some algebraic transformations as 8 < ðSα1 Þ  1 Sβ ðSα2 Þ  1 Sβ u ¼ ηu 11 12 22 21 ð20Þ : ðSα222 Þ  1 Sβ21 ðSα111 Þ  1 Sβ12 v ¼ ηv where η ¼ γ 21 ¼ γ 22 ¼ ðuT Sβ12 vÞ2 . It shows that the solutions of the optimization problem in (17) are the eigenvectors of the two eigen-equations in (20) corresponding to the eigenvalue η. 1 Note that the idea of fractional order can be applied easily to CCA-related methods, such as GCCA and KCCA.

1416

Y.-H. Yuan et al. / Pattern Recognition 47 (2014) 1411–1424

Similarly to CCA, we can obtain d pairs of projection directions fðui ; vi Þgdi¼ 1 of FECCA by means of (20), which consist of the first d pairs of eigenvectors corresponding to the first d largest eigenvalues η1 Z η2 Z… Z ηd of the two eigen-equations in (20), where d r rankðSβ12 Þ. After getting d pairs of projection directions, we can derive an important Theorem 1, as follows: Theorem 1. Let U ¼ ðu1 ; u2 ; …; ud Þ A Rm1 d and V ¼ ðv1 ; v2 ; …; vd Þ A Rm2 d be two projection matrices. Then, U and V must satisfy 8 T α1 u S uj ¼ δij > > < iT 11 vi Sα222 vj ¼ δij ð21Þ > > : uTi Sβ12 vj ¼ pffiffiffiffi ηi δij where ηi is the ith largest eigenvalue defined in (20), and ( 1; i ¼ j δij ¼ , i; j ¼ 1; 2; …; d. 0; ia j Proof. Since Sα111 and Sα222 are symmetric and positive definite, the

Cholesky decomposition Sα111 ¼ RT1 R1 and Sα222 ¼ RT2 R2 must exist. Let

for projection directions in real-world applications. Evidently, the computational cost can be reduced largely. In addition, since FECCA involves three fractional-order parameters α1 , α2 and β, how to choose the three fractional-order parameters is a first problem. Here, we use the cross-validation method to determine the proper parameters α1 , α2 and β which yield the best recognition performance, just as the way used in the RCCA algorithm [28,29]. Furthermore, the computational complexity of FECCA is the same as that of CCA for given fractional-order parameters α1 , α2 and β, i.e., Oðm3 Þ, where m ¼ maxðm1 ; m2 Þ. In the FECCA algorithm, for any given sample xT ¼ ðxT1 ; xT2 Þ with x1 A Rm1 and x2 A Rm2 , since we are always able to obtain two d-dimensional correlation feature vectors by U T x1 and V T x2 , it is necessary to fuse them for subsequent classification tasks. Here, we suggest a simple and effective fusion strategy based on the weighted sum of correlation feature vectors, as follows: ! !T θ1 U x1 z¼ ð23Þ ¼ WT x x2 θ2 V

us set u~ ¼ R1 u, v~ ¼ R2 v, and H ¼ R1 T Sβ12 R2 1 . Then, the two eigenequations in (20) can be converted equivalently as ( HH T u~ ¼ ηu~ H T H v~ ¼ ηv~

where W T ¼ ðθ1 U T ; θ2 V T Þ with 0 rθ1 ; θ2 r 1 as the fusion coefficients. These two coefficients determine the weights of the abovementioned two d-dimensional correlation feature vectors in classification. After feature fusion, the formed feature vector z is used to represent the sample x for recognition purpose.

It shows that u~ and v~ are just, respectively, the left and right singular vectors of the matrix H corresponding to the singular pffiffiffi value η. Therefore, it is easy to obtain the following uTi Sα111 uj ¼

4. Singularity processing of fractional-order within-set scatter matrices

u~ Ti R1 T Sα111 R1 1 u~ j ¼ u~ Ti u~ j ¼ δij and vTi Sα222 vj ¼ v~ Ti R2 T Sα222 R2 1 v~ j ¼ v~ Ti v~ j ¼ pffiffiffi δij . According to (19), we have Sβ12 v ¼ ηSα111 u. Hence, uTi Sβ12 vj ¼ p ffiffiffiffi pffiffiffiffi T α1 ηj ui S11 uj ¼ ηi δij , that is, Theorem 1 is proven. □ 3.4. Algorithmic description of FECCA In summary of the foregoing description, the FECCA algorithm is given, as follows: Step 1: For two given representations X 1 and X 2 , compute within-set and between-set sample covariance matrices, i.e., S11 , S22 and S12 . Step 2: Perform the EVD in (12) for S11 and S22 , and the SVD in (14) for S12 . Then, calculate fractional-order within-set scatter matrices Sα111 and Sα222 using (13), and fractional-order betweenset scatter matrix Sβ12 using (15) for given nonnegative fractional-order parameters α1 , α2 and β. Step 3: According to (20), calculate eigenvectors fui gdi¼ 1 and fvi gdi¼ 1 corresponding to the first d largest eigenvalues η1 Z η2 Z… Z ηd , where d rrankðSβ12 Þ. Step 4: Form two projection matrices U ¼ ðu1 ; u2 ; …; ud Þ and V ¼ ðv1 ; v2 ; …; vd Þ using d pairs of projection directions fðui ; vi Þgdi¼ 1 . Then, perform dimensionality reduction of highdimensional data in two different representations by U T X 1 and V T X2. It should be noted that we can obtain the relationships between u and v according to (19), as follows: 8 < u ¼ η  1=2 ðSα1 Þ  1 Sβ v 11 12 : ð22Þ : v ¼ η  1=2 ðSα222 Þ  1 Sβ21 u From (22), we can observe that once the projection direction u (or v) is obtained, the other direction v (or u) can also be acquired by their mapping connections. Therefore, we only need to perform the eigenvalue decomposition of the matrix with small size in (20)

In Section 3.3, in order to perform FECCA, we assume fractional-order within-set scatter matrices are nonsingular. However, in fact, when they are singular in real-world applications, their inverses will be no longer available. To solve this problem, we introduce two regularization terms according to the idea in [42] into the criterion function in (16), as follows: uT Sβ12 v J R ðu; vÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðuT Sα111 u þ τ1 jjujj2 Þ U ðvT Sα222 v þ τ2 jjvjj2 Þ uT Sβ12 v ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi α ðuT S111 u þ τ1 uT uÞ U ðvT Sα222 v þτ2 vT vÞ uT Sβ12 v ffi; ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi uT ðSα111 þ τ1 I m1 Þu U vT ðSα222 þ τ2 I m2 Þv

ð24Þ

where τi is a small positive real number, and I mi is the identity matrix with size mi  mi , i ¼ 1; 2. In (24), we observe that Sαiii þ τi I mi must be nonsingular since the following Theorem 2 holds: αi Theorem 2. Let τi 4 0 be any real number, and set S~ ii ¼ Sαiii þ τi I mi . αi ~ Then, rankðS ii Þ ¼ mi , where i ¼ 1; 2.

From (24), it can be seen that the new regularized criterion function is not affected by rescaling of u or v. Hence, the optimization problem maximizing J R ðu; vÞ is subject to ( T α1 u ðS11 þ τ1 I m1 Þu ¼ 1 ð25Þ vT ðSα222 þ τ2 I m2 Þv ¼ 1 Following the same approach as in Section 3.3, where we assume that Sα111 and Sα222 are invertible, the regularized FECCA solves the following two standard eigenvalue problems: 8 < ðSα1 þ τ1 I m1 Þ  1 Sβ ðSα2 þ τ2 I m2 Þ  1 Sβ u ¼ ηu 12 22 21 11 ð26Þ : ðSα222 þ τ2 I m2 Þ  1 Sβ21 ðSα111 þ τ1 I m1 Þ  1 Sβ12 v ¼ ηv In real-world applications, if fractional-order within-set scatter matrices are singular, or we need to introduce a control on the

Y.-H. Yuan et al. / Pattern Recognition 47 (2014) 1411–1424

1417

flexibility of FECCA in some particular applications, then the regularized FECCA can be employed. In this paper, we choose the regularization parameter τi as τi ¼ 0:001trðSαiii Þ when Sαiii is not invertible, where trðAÞ denotes the trace of the matrix A, and i ¼ 1; 2.

5. Experimental results and analysis In this section, the performance of the proposed FECCA method is evaluated by a series of experiments on handwritten numeral, face and object image databases. In order to further reveal the effectiveness of FECCA, we compare it with CCA, RCCA, LPCCA, and PLS. In all experiments, the fractional-order parameters α1 , α2 and β in FECCA are, respectively, chosen from the set f0:1; 0:2; …; 1g. The regularization parameters λx and λy in RCCA are selected from the set f10  5 ; 10  4 ; …; 10g. The K-nearest neighborhood parameter K in LPCCA is chosen from the set f1; 2; …; l  1g, where l represents the number of training samples per class. We take advantage of the cross-validation method to determine the proper parameters with optimal recognition performance for RCCA, LPCCA and FECCA. Note that we perform two-fold cross-validation on each training set to choose the optimal parameters. In the phase of feature fusion, the coefficients θ1 and θ2 in (23) are set as θ1 ¼ θ2 ¼ 1 for each method in all recognition experiments. 5.1. Handwritten numeral recognition In this section, the main objective of the experiments is to examine the performance of the proposed FECCA method in handwritten numeral recognition. 5.1.1. Experiment on the CENPARMI handwritten numeral database The CENPARMI handwritten numeral database [51,52], which is prevalent in the world, contains 6000 samples of 10 numeral classes (i.e., 10 numbers from 0 to 9). There are 600 samples in each class. The first 400 samples per class are chosen for training, while the remaining 200 samples are used for testing. Thus, the total number of training samples and testing samples is, respectively, 4000 and 2000. In [53], Hu et al. have done some preprocessing work and extracted four sets of features, as shown in Table 1. Here, our experiment is performed by using these features. In Table 1, there are a total of six different pairwise feature combinations. In our experiment, CCA, RCCA, LPCCA, PLS and FECCA are, respectively, used to perform the joint feature extraction from each feature combination. After feature extraction, the Table 1 Four sets of features on the CENPARMI database. X(G): 256-dimension Gabor transformation feature; X(L): 121-dimension Legendre moment feature; X(P): 36-dimension Seudo-Zernike moment feature; X(Z): 30-dimension Zernike moment feature.

Table 2 The maximal recognition rates (percent) of CCA, RCCA, LPCCA, PLS and FECCA and the corresponding dimensions (in parentheses) on the CENPARMI database. Pairwise features

CCA

RCCA

LPCCA

PLS

FECCA

X(G)–X(L) X(G)–X(P) X(L)–X(P) X(Z)–X(G) X(Z)–X(L) X(Z)–X(P)

82.30(46) 72.90(32) 75.30(31) 75.45(30) 73.20(30) 57.85(30)

92.30(33) 87.40(34) 89.60(25) 88.30(27) 88.60(29) 70.45(30)

79.15(76) 76.70(36) 74.55(34) 76.70(30) 72.65(30) 66.80(30)

90.40(33) 83.95(36) 86.85(33) 83.55(29) 85.45(30) 73.90(30)

93.40(37) 88.25(30) 90.60(31) 88.90(29) 90.15(26) 74.55(30)

Fig. 3. The recognition rates of CCA, RCCA, LPCCA, PLS and FECCA versus the dimensions on feature combination X(G) and X(L) from the CENPARMI database.

Table 3 Six sets of features of handwritten numerals in MFD. Pix: 240-dimension pixel averages feature in 2  3 windows; Fac: 216-dimension profile correlations feature; Fou: 76-dimension Fourier coefficients of the character shapes feature; Kar: 64-dimension Karhunen–Loève coefficients feature; Zer: 47-dimension Zernike moments feature; Mor: 6-dimension morphological feature.

fusion strategy in (23) is employed for recognition tasks. Finally, the classification is performed using the nearest neighbor (NN) classifier with cosine distance metric. Table 2 shows the maximal recognition rates of each method and their corresponding dimensions. Taking the recognition results on the two-set features X(G) and X(L) as a representation, Fig. 3 reveals the variation of the recognition rates of each method when the number of the dimension increases. From Table 2, we can see that FECCA obviously outperforms CCA, RCCA, LPCCA and PLS on all feature combinations. Especially on the X(G) and X(L) combination, the maximum recognition rate of FECCA is 93.4%, and beyond that of CCA 11.1%, that of RCCA 1.1%, that of LPCCA 14.25%, and that of PLS 3%. This indicates that the extracted features by FECCA are more discriminative than those by CCA, RCCA, LPCCA and PLS. From Fig. 3, we can find two main points. First, the difference of recognition rates is not obvious between RCCA and FECCA when the dimension varies from 13 to 30. Second, FECCA becomes robust and consistently outperforms CCA, RCCA, LPCCA, and PLS when the dimension is over 30.

5.1.2. Experiment on multiple feature dataset in UCI To further examine the performance of FECCA, we use the wellknown multiple feature dataset (MFD) [6] about handwritten numerals in UCI. MFD includes 10 classes of handwritten numerals, i.e. 10 numbers from 0 to 9. These digit characters are represented in terms of the following six feature sets, as shown in Table 3. There are 2000 samples in all and 200 samples per class are available in the form of 30  48 binary images. According to the six sets of features shown in Table 3, we are able to get fifteen different pairwise feature combinations in total. In the experiment on each feature combination, the first 100 samples per class are chosen for training, while the remaining 100 samples are

1418

Y.-H. Yuan et al. / Pattern Recognition 47 (2014) 1411–1424

used for testing. Therefore, the total number of training samples is 1000, and the total number of testing samples is 1000. CCA, RCCA, LPCCA, PLS and FECCA are used for joint feature extraction. After feature fusion, a nearest neighbor classifier with cosine distance is employed for classification. The maximal recognition rates of each method and the corresponding dimensions are shown in Table 4. Taking the recognition results on feature combination Fac and Fou as a representation, the recognition rates of each method versus the dimensions are given in Fig. 4. From Table 4, we can clearly observe that the recognition results of FECCA are significantly better than those of CCA, LPCCA and PLS on fifteen different feature combinations. Particularly on feature combination Fou and Kar, the recognition rate of FECCA is up to 98.2% by using only a 46-dimensional feature vector. That is, out of 1000 testing samples, only eighteen samples are misclassified by FECCA. Compared with RCCA, although FECCA achieves the small improvement of recognition rates on feature combinations Pix–Fac, Fac–Kar, Fou–Zer, and Kar–Zer, the recognition results are improved more obviously on other cases, particularly on these combinations with respect to the feature Mor. This reveals that when the dimensions of two-set features are greatly unbalanced, FECCA can provide the more reliable and accurate recognition results in contrast with other methods. These conclusions demonstrate again that FECCA is more effective and robust than CCA, RCCA, LPCCA and Table 4 The maximal recognition rates (percent) of CCA, RCCA, LPCCA, PLS and FECCA and the corresponding dimensions (in parentheses) on MFD. Pairwise features

CCA

RCCA

LPCCA

PLS

FECCA

Pix–Fac Pix–Fou Pix–Kar Pix–Zer Pix–Mor Fac–Fou Fac–Kar Fac–Zer Fac–Mor Fou–Kar Fou–Zer Fou–Mor Kar–Zer Kar–Mor Zer–Mor

92.7(62) 83.7(66) 94.2(38) 80.3(46) 48.9(6) 91.5(43) 96.6(36) 94.8(26) 78.1(6) 90.2(60) 81.3(41) 57.0(6) 93.3(45) 65.2(6) 57.4(6)

97.9(33) 97.7(27) 97.1(36) 97.4(23) 86.6(6) 94.9(39) 97.4(47) 95.0(40) 77.7(6) 97.8(44) 85.5(44) 74.4(6) 97.2(43) 84.6(6) 68.0(6)

93.8(203) 95.7(76) 93.3(61) 90.3(47) 78.5(6) 82.5(40) 94.8(54) 93.5(41) 84.9(6) 96.2(63) 80.1(42) 65.0(6) 94.5(47) 77.9(6) 68.8(6)

97.9(43) 97.5(30) 95.5(64) 96.3(40) 88.8(6) 93.6(30) 96.8(27) 91.1(42) 81.3(6) 97.8(49) 85.5(43) 72.9(6) 96.4(35) 81.3(6) 64.8(6)

98.1(32) 98.1(29) 97.4(43) 97.7(46) 88.9(6) 95.6(37) 97.6(47) 96.8(47) 85.3(6) 98.2(46) 85.6(32) 77.5(6) 97.3(32) 87.4(6) 71.5(6)

PLS in handwritten numeral recognition. Fig. 4 shows that FECCA consistently performs better than other methods when the number of the dimension is larger than 10. Moreover, the recognition results of FECCA are very stable as the dimension increases. 5.2. Face recognition In order to verify the effectiveness of FECCA in face recognition, two experiments on face image databases have been carried out. In our experiments, the original face images are considered as the first sets of features, denoted as Ori. Since orthonormal wavelet transform can retain important information of original face images, and the low-frequency sub-images include more shape information in contrast with high-frequency sub-images [23,54], we perform twolevel Daubechies wavelet decomposition to achieve the lowfrequency sub-images which are regarded as the second sets of features, denoted as Dau. Since face recognition is typically a small sample size (SSS) problem, we borrow the idea in [19,55] and use PCA to reduce the dimensions of two input spaces Ori and Dau such that the singularity problem is avoided in the transformed spaces. 5.2.1. Experiment on the AT&T database The AT&T database2 consists of 400 face images from 40 individuals. For each individual, there are 10 different grayscale images with a resolution of 92  112. In some individuals, the images are taken at different times. The lighting, facial expressions (open or closed eyes, smiling or not smiling) and facial details (glasses or no glasses) are also varied. The images are taken with a tolerance for some tilting and rotation of the face up to 201, and have some variation in the scale up to about 10%. Ten images of one person are shown in Fig. 5. In this experiment, five images per individual are randomly chosen for training, while the remaining five images are used for testing. Therefore, the total number of training samples is 200, and the total number of testing samples is 200. To avoid the singularity problem, we use PCA to reduce the dimension of data points in feature sets Ori and Dau to 150 and 150, respectively, corresponding to more than 97% and 99% of data energy. Ten independent tests are performed and the average recognition rates are computed to examine the performance of CCA, RCCA, LPCCA, PLS and FECCA. The cosine NN classifier is employed for recognition. Table 5 shows the maximal average recognition rates across ten runs of each method and their corresponding standard deviations and dimensions. Fig. 6 reveals the recognition rates of each method versus the variation of the dimensions. As we can see from Table 5, FECCA is obviously superior to CCA, RCCA, LPCCA, and PLS. The recognition rate of FECCA exceeds that of CCA 13.3%, that of RCCA 3.9%, that of LPCCA 7.65% and that of PLS 6.4%. RCCA achieves the second best recognition result. LPCCA and PLS obtain comparable results and CCA achieves the worst recognition result. Fig. 6 indicates that FECCA consistently outperforms CCA, RCCA, LPCCA and PLS, regardless of the variation of the dimension. 5.2.2. Experiment on the AR database The AR database [56] contains over 4000 color face images of 126 individuals (70 men and 56 women). These images are frontal views of faces with different facial expressions, lighting conditions and occlusions. The pictures of most people are taken in two time periods (i.e. two phases). Each phase contains 13 color images for each individual and 120 persons participate in both time periods. In our experiments, the images of 120 individuals are selected from both phases and one person has 14 images, which are full facial images (i.e. not occluded). We crop these face images and

Fig. 4. The recognition rates of CCA, RCCA, LPCCA, PLS and FECCA versus the dimensions on feature combination Fac and Fou from MFD.

2

http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.

Y.-H. Yuan et al. / Pattern Recognition 47 (2014) 1411–1424

1419

Fig. 5. Ten images of one person in the AT&T face database.

Table 5 The maximal average recognition rates (percent) and their corresponding standard deviations, optimal dimensions of CCA, RCCA, LPCCA, PLS and FECCA across ten runs on the AT&T database. Method

CCA

RCCA

LPCCA

PLS

FECCA

Acc. Dim.

83.05 7 3.01 150

92.45 7 2.05 98

88.707 1.67 148

89.95 73.18 93

96.357 0.97 97

Fig. 6. The recognition rates of CCA, RCCA, LPCCA, PLS and FECCA versus the variation of the dimensions on the AT&T database.

Fig. 7. Fourteen gray images of one person in the AR face database.

resize them to 50  40 pixels. Then, the cropped color images are converted to grayscale images. Fig. 7 shows the cropped gray images of one person.

In our experiments, l images (l is respectively 3, 5, 7, and 9) per individual are randomly selected to form the training set, and the remaining 14-l images are taken for the testing set. For each given l,

1420

Y.-H. Yuan et al. / Pattern Recognition 47 (2014) 1411–1424

PCA is used to reduce the dimensions of the two-set features Ori and Dau to 150 and 150, respectively, corresponding to more than 94% and 99% of training data energy. Ten independent classification tests are performed to examine the performance of CCA, RCCA, LPCCA, PLS and FECCA. Then, the average recognition rates are computed for each method. The nearest neighbor classifier with cosine distance is employed for classification tasks. The average recognition results across ten runs of each method are listed in Table 6. Taking the recognition results with three training samples per individual as a

Table 6 The maximal average recognition rates (percent) and their corresponding standard deviations, optimal dimensions of CCA, RCCA, LPCCA, PLS and FECCA across ten runs on the AR database. Method

3 Train.

5 Train.

7 Train.

9 Train.

CCA Acc. Dim.

86.86 7 1.81 147

92.20 7 1.35 149

94.95 7 1.11 150

96.65 7 0.78 139

RCCA Acc. Dim.

80.82 7 3.07 150

88.767 1.99 150

93.81 7 1.41 146

95.05 7 1.27 150

LPCCA Acc. Dim.

81.09 71.44 150

89.46 7 0.99 150

94.50 7 1.34 150

96.977 0.91 148

PLS Acc. Dim.

62.98 7 8.94 149

71.46 710.71 145

76.93 7 7.18 150

82.57 79.69 148

FECCA Acc. Dim.

88.93 7 1.69 147

93.82 7 1.50 129

97.217 0.89 134

98.40 7 0.24 106

representation, the recognition rates of each method versus the dimensions are given in Fig. 8. As shown in Table 6, our proposed FECCA method prominently outperforms CCA, RCCA, LPCCA, PLS and FECCA again, no matter how many training samples per individual are used. Furthermore, the dimensions, corresponding to the best recognition results in FECCA, decrease when the number of training samples increases. On this dataset, CCA also performs better on the whole than RCCA, LPCCA, and PLS. The PLS method performs the worst on all cases. Fig. 8 reveals that the recognition rates of FECCA are stable and consistently superior to those of other methods as the number of the dimension is over 100. 5.3. Object recognition The COIL-20 database [57] contains 1440 grayscale images of 20 objects (72 images per object) under various poses. These objects have a wide variety of complex geometric, appearance and reflectance characteristics. They are rotated through 360 degrees against a black background and taken at the intervals of 51. The size of each object image is 128  128 pixels. Fig. 9 shows twenty different object images from the COIL-20 database. In this experiment, 30 images per class (object) are randomly chosen to form the training set, while the remaining 42 images are regarded as the testing set. Thus, the total number of training samples and testing samples is, respectively, 600 and 840. Similarly to the previous experiments described in Section 5.2, we follow the same method to form two sets of features Ori and Dau, and perform PCA to reduce their dimensions to 150 and 150, respectively. After this, CCA, RCCA, LPCCA, PLS and FECCA are used for joint feature extraction. For each method, ten independent tests are carried out under the nearest neighbor classifier with cosine distance and the average recognition rates are computed. The recognition results of each method are summarized in Table 7. The recognition rates of each method versus the dimensions are given in Fig. 10. From Table 7, we can find that FECCA performs better than CCA, RCCA, LPCCA, and PLS. This demonstrates again that the extracted features by FECCA have better discriminant ability in contrast with those by other methods. RCCA and PLS perform comparably to our method and better than CCA and LPCCA. LPCCA performs the worst in all methods and its dimension corresponding to the best recognition rate is largest. Fig. 10 shows that the difference of recognition rates is not very obvious between RCCA and FECCA along with the increase of the dimension. Despite this, FECCA overall outperforms CCA, RCCA, LPCCA, and PLS as the dimension increases. 5.4. Influence of fractional-order parameters on FECCA performance

Fig. 8. The recognition rates of CCA, RCCA, LPCCA, PLS and FECCA with three training samples per individual versus the dimensions on the AR database.

Compared with CCA, FECCA has three fractional-order parameters α1 , α2 and β. In this subsection, we evaluate how FECCA performs with different values of the parameters. The datasets used for this test are, respectively, the AT&T face database and the COIL-20 object database. On the two databases, we adopt the same experimental settings as those used in Sections 5.2.1 and 5.3.

Fig. 9. Twenty object images from the COIL-20 database. For each object, only one image is shown.

Y.-H. Yuan et al. / Pattern Recognition 47 (2014) 1411–1424

Figs. 11 and 12 show the average recognition rates as a function of two of these three parameters on the AT&T and COIL-20 databases. From Figs. 11 and 12, we can clearly see that our proposed FECCA is not sensitive on the whole to these fractional-order

Table 7 The maximal average recognition rates (percent) and their corresponding standard deviations, optimal dimensions of CCA, RCCA, LPCCA, PLS and FECCA across ten runs on the COIL-20 database. Method

CCA

RCCA

LPCCA

PLS

FECCA

Acc. Dim.

94.81 7 0.73 67

97.86 7 0.57 50

94.7 7 0.93 149

97.127 0.73 43

98.177 0.42 66

Fig. 10. The recognition rates of CCA, RCCA, LPCCA, PLS and FECCA versus the variation of the dimensions on the COIL-20 database.

1421

parameters on the AT&T and COIL-20 databases. More specifically, on the AT&T face database, Fig. 11a shows that the performance of FECCA decreases monotonously in a small range with the increase of parameters α1 and α2 , while Fig. 11b and c indicate that FECCA is stable with respect to the other two pairs of fractional-order parameters, i.e., α1 and β, α2 and β. On the COIL-20 object database, although Fig. 12a–c reveal at the same time that FECCA has some small fluctuations, it is overall stable with respect to these fractional-order parameters.

5.5. Computational efficiency comparison In this section, we discuss the computational cost of our proposed FECCA algorithm in comparison to CCA, RCCA, LPCCA, and PLS. In brief, if we use big O notation [58] to express the time complexity of one algorithm, the FECCA algorithm has the same complexity as CCA for given fractional-order parameters. However,

Fig. 13. The average training time of CCA, RCCA, LPCCA, PLS, and FECCA across ten runs on different datasets (CPU: Pentium (R) Dual-Core 2.93 GHz, RAM: 2 GB, in MATLAB).

Fig. 11. The average recognition rates versus fractional-order parameters α1, α2, and β on the AT&T face database. Note that in (a–c), when two fractional-order parameters vary, another is fixed as 0.1.

Fig. 12. The average recognition rates versus fractional-order parameters α1, α2, and β on the COIL-20 object database. Note that in (a), (b), and (c), when two fractional-order parameters vary, another is fixed as 0.1.

1422

Y.-H. Yuan et al. / Pattern Recognition 47 (2014) 1411–1424

differences statistically significant? In this section, we evaluate the experimental results using a paired t-test with significance level 0.05. Tables 8–10 summarize the results of the paired t-test. In each table, when the digit is “1”, A4B means that the recognition rates of the method A are significantly higher than those of the method B, and when the digit is “0”, this means that the method A does not significantly outperform the method B for the given confidence level. From Tables 8 and 9, we can find that, on most cases, our proposed FECCA method significantly outperforms other methods at the 5% significance level in handwritten numeral recognition. In face and object recognition, Table 10 shows that the FECCA's improvements for recognition rates are statistically significant on all cases. These results demonstrate again that FECCA is a powerful technique for joint feature extraction and classification tasks.

this is not precise enough to differentiate between the computational costs of FECCA and CCA. This is because FECCA need to perform extra computation for building fractional-order withinset and between-set scatter matrices. Precisely, FECCA has larger arithmetic operations than CCA. Thus, we use the MFD, AT&T, AR, and COIL-20 datasets to empirically compare the computational efficiency of the five methods. In each database, half of total samples are randomly selected to calculate the training costs of each method. Note that, in the MFD dataset, we adopt feature combination Pix-Fac. The experiments are repeatedly performed 10 times. Finally, the average training time is computed for each method. Fig. 13 shows the average training costs of the five methods on different datasets. From Fig. 13, we can see three main points. First, LPCCA needs the most training time in all methods, no matter what dataset is used. Particularly on the MFD, AR, and COIL-20 datasets, it needs much more time for training than other methods. Second, after giving fractional-order parameters, although our proposed FECCA theoretically has the same time complexity as CCA when using the big O method, actually it requires more training time than CCA in real-world recognition tasks. This conclusion is also completely consistent with the aforementioned analysis. Third, in all methods, PLS consumes the least training time. Regarding the testing time, since the five methods make use of the same fusion strategy in (23) and classifier, they have the same testing time costs.

5.7. Further experimental analysis When training samples are of shortage or disturbed by noise, it is easy to know that sample covariance matrices usually deviate from true ones. In this case, CCA cannot yield good lowdimensional features for recognition purpose, while our proposed FECCA method can tackle the above challenges well. To elaborate on how FECCA tackles the challenges around noise and shortage of training samples, let us first define the deviated degree between sample and true covariance matrices by ^ SÞ ¼ jjS^  Sjj =jjSjj sðS; F F

5.6. Evaluation of experimental results The above experiments show that the recognition results of FECCA are always better than CCA, RCCA, LPCCA, and PLS. However, are these Table 8 The paired t-test results of the recognition rates reported in Table 2. Feature pair

X(G)–X(L)

X(G)–X(P)

X(L)–X(P)

X(Z)–X(G)

X(Z)–X(L)

X(Z)–X(P)

FECCA 4CCA FECCA 4RCCA FECCA 4LPCCA FECCA 4PLS

1 1 1 1

1 1 1 1

1 1 1 1

1 0 1 1

1 1 1 1

1 1 1 0

ð27Þ

where S denotes the true covariance matrix, S^ denotes the corresponding sample covariance matrix, and jj U jjF denotes the Frobenius norm. It is obvious that, if S^ is a good approximation of S, then the deviated degree in (27) is very small; otherwise, it is large. Now, let us design a simulated experiment to show how FECCA tackles the challenge of the bias. In our experiment, we employ the multivariate normal distribution to generate two sets of simulated data, one set of which is of lack of training samples and the other has noise disturbance. Note that the noise data also follow a multivariate normal distribution. Fig. 14 shows the deviated degree of sample covariance matrices and fractional-order scatter matrices versus the dimension.

Table 9 The paired t-test results of the recognition rates reported in Table 4. Feature pair

Pix–Fac

Pix–Fou

Pix–Kar

Pix–Zer

Pix–Mor

Fac–Fou

Fac–Kar

Fac–Zer

FECCA 4CCA FECCA 4RCCA FECCA 4LPCCA FECCA 4PLS

1 1 1 1

1 1 1 1

1 1 1 1

1 0 1 1

1 0 1 0

1 1 1 1

1 1 1 1

1 1 1 1

Fac–Mor

Fou–Kar

Fou–Zer

Fou–Mor

Kar–Zer

Kar–Mor

Zer–Mor

1 1 0 0

1 1 1 1

1 1 1 1

1 0 0 0

1 0 1 1

1 0 0 1

1 1 0 1

FECCA 4CCA FECCA 4RCCA FECCA 4LPCCA FECCA 4PLS

Table 10 The paired t-test results of the recognition rates reported in Tables 5–7. Dataset

AT&T

AR_3

AR_5

AR_7

AR_9

COIL-20

FECCA 4CCA FECCA 4RCCA FECCA 4LPCCA FECCA 4PLS

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

Note: AR_# denotes the case where # images per class are randomly chosen for training.

Y.-H. Yuan et al. / Pattern Recognition 47 (2014) 1411–1424

1423

Fig. 14. The deviated degree of sample covariance matrices and fractional-order scatter matrices versus the dimensions, where SCM and FCM respectively denote the sample covariance matrix and fractional-order scatter matrix, and the number of training samples is fixed as one hundred when the dimension varies. In (a), training samples are lacking and not disturbed by noise. In (b), training samples have noise disturbance.

From Fig. 14a and b, we can see that fractional-order scatter matrices can obviously alleviate the bias of sample covariance matrices, particularly when the dimensions are large. In other words, fractional-order within-set and between-set scatter matrices in FECCA can be ensured to be a good approximation of true covariance matrices. This indicates that our proposed FECCA can effectively tackle the deviation problem by adjusting its fractional-order parameters when training samples are lacking or disturbed by noise.

6. Conclusion In this paper, we have proposed a new approach for multi-view dimensionality reduction, called fractional-order embedding canonical correlation analysis (FECCA), which is based on fractional-order within-set and between-set scatter matrices. These scatter matrices can be viewed as the re-estimates of within-set and between-set covariance matrices and alleviate the problem of the significant bias of sample covariance matrices caused by noise and the limited number of training samples. The proposed FECCA method has been evaluated on various handwritten numeral, face and object recognition problems. A series of experimental results show that, on most cases, FECCA significantly outperforms CCA, RCCA, LPCCA, and PLS at the 5% significance level. In FECCA, there are three fractional-order parameters that play important roles. Although our empirical tests show that the performance of FECCA is not very sensitive to the variation of these fractional-order parameters, it is very worthwhile to investigate a good criterion for the choice of optimality fractional-order parameters instead of the cross-validation method. Essentially, FECCA is an unsupervised multi-view dimensionality reduction method. From the classification's point of view, the label information should be introduced into the proposed method when it is available. In kernel-based learning, due to the implicit highdimensional nonlinear mapping determined by kernel, many typical “large sample size” problems in input space are turned into SSS problems in feature space. In this case, the covariance matrix estimates are generally biased high. Consequently, it is necessary to introduce the fractional-order idea into KCCA. Moreover, in this paper we consider two-view dimensionality reduction using FECCA. It remains unclear how to extend FECCA to deal with multi-view (at least three views) dimensionality reduction. We are currently exploring these problems in theory and practice.

Conflict of interest statement None declared.

Acknowledgments This work is partially supported by the National Science Foundation of China under Grant no. 61273251. Meanwhile, it is also partially supported by the Scientific Research and Innovation Project Fund for Graduate Students of Jiangsu Provincial Higher Education Institutions under Grant no. CXZZ110260, and the Excellent Doctoral Graduate Education Fund of Nanjing University of Science & Technology (NUST). Moreover, we would like to thank the editor and anonymous reviewers for their constructive comments, which significantly improved the presentation of this paper. References [1] M. Sargin, Y. Yemez, E. Erzin, A. Tekalp, Audio-visual synchronization and fusion using canonical correlation analysis, IEEE Transactions on Multimedia 9 (7) (2007) 1396–1403. [2] M. Slaney, M. Covell, Facesync: a linear operator for measuring synchronization of video facial images and audio tracks, in: Proceedings of Neural Information Processing Systems, 2000, pp. 814–820. [3] C.C. Chibelushi, F. Deravi, J.S.D. Mason, A review of speech-based bimodal recognition, IEEE Transactions on Multimedia 4 (1) (2002) 23–37. [4] P. Glenisson, J. Mathys, B.D. Moor, Meta-clustering of gene expression data and literature-based information, ACM SIGKDD Explorations Newsletter 5 (2) (2003) 101–112. [5] A.A. Nielsen, Multiset canonical correlations analysis and multispectral, truly multitemporal remote sensing data, IEEE Transactions on Image Processing 11 (3) (2002) 293–305. [6] A.K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: a review, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (1) (2000) 4–37. [7] Bo Long, Philip S. Yu, Zhongfei (Mark) Zhang, A general model for multiple view unsupervised learning, in: Proceedings of the 2008 SIAM International Conference on Data Mining, 2008, pp. 822–833. [8] C. Hou, C. Zhang, Y. Wu, F. Nie, Multiple view semi-supervised dimensionality reduction, Pattern Recognition 43 (3) (2010) 720–730. [9] Zhang Jintao, Jun Huan, Inductive multi-task learning with multiple view data, in: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 543–551. [10] F. Camastra, A. Vinciarelli, Estimating the intrinsic dimension of data with a fractal-based method, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (10) (2002) 1404–1407. [11] Jidong Zhao, Ke Lu, Xiaofei He, Locality sensitive semi-supervised feature selection, Neurocomputing 71 (2008) 1842–1849. [12] Xiaofei He, Ming Ji, Chiyuan Zhang, Hujun Bao, A variance minimization criterion to feature selection using laplacian regularization, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (10) (2011) 2013–2025.

1424

Y.-H. Yuan et al. / Pattern Recognition 47 (2014) 1411–1424

[13] S. Bickel, T. Scheffer, Multi-view clustering, in: Proceedings of the 4th IEEE International Conference on Data Mining, 2004, pp. 19–26. [14] K. Chaudhuri, S.M. Kakade, K. Livescu, K. Sridharan, Multi-view clustering via canonical correlation analysis, in: Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 129–136. [15] S.M. Kakade, D.P. Foster, Multi-view regression via canonical correlation analysis, in: Proceedings of the 20th Annual Conference on Learning Theory, 2007, pp. 82–96. [16] I.T. Jolliffe, Principal Component Analysis, NewYork, second ed., Springerverlag, 2002. [17] M. Truk, A. Pentland, Eigenfaces for recognition, Journal of Cognitive Neuroscience 3 (1) (1991) 71–86. [18] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed., Academic Press, Boston, 1990. [19] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. fisherfaces: recognition using class specific linear projection, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (7) (1997) 711–720. [20] X. He, P. Niyogi, Locality preserving projections, in: Proceedings of the 16th Conference on Neural Information Processing Systems, 2003, pp. 585–591. [21] X. He, S. Yan, Y. Hu, P. Niyogi, H.J. Zhang, Face recognition using laplacianfaces, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (3) (2005) 328–340. [22] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [23] Q.S. Sun, S.G. Zeng, Y. Liu, P.A. Heng, D.S. Xia, A new method of feature fusion and its application in image recognition, Pattern Recognition 38 (12) (2005) 2437–2448. [24] D.P. Foster, S.M. Kakade, T. Zhang, Multi-view Dimensionality Reduction via Canonical Correlation Analysis, Technical Report TTI-TR-2008-4, 2008. [25] Y.H. Yuan, Q.S. Sun, Q. Zhou, D.S. Xia, A novel multiset integrated canonical correlation analysis framework and its application in feature fusion, Pattern Recognition 44 (5) (2011) 1031–1040. [26] X.H. Chen, S.C. Chen, H. Xue, X.D. Zhou, A unified dimensionality reduction framework for semi-paired and semi-supervised multi-view data, Pattern Recognition 45 (5) (2012) 2005–2018. [27] A. Kimura, H. Kameoka, M. Sugiyama, SemiCCA: efficient semi-supervised learning of canonical correlations, in: Proceedings of the 20th International Conference on Pattern Recognition, Istanbul, Turkey, 2010. [28] L. Sun, S.W. Ji, J.P. Ye, Canonical correlation analysis for multilabel classification: a least-squares formulation, extensions, and analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (1) (2011) 194–200. [29] I. Gonzalez, S. Dejean, P. Martin, A. Baccini, CCA: an r package to extend canonical correlation analysis, Journal of Statistical Software 23 (12) (2008) 1–14. [30] T.K. Sun, S.C. Chen, Locality preserving CCA with applications to data visualization and pose estimation, Image and Vision Computing 25 (5) (2007) 531–543. [31] Q.S. Sun, Z. Jin, P.A. Heng, D.S. Xia, A novel feature fusion method based on partial least squares regression, in: Proceedings of the Third International Conference on Advances in Pattern Recognition, 2005, pp. 268–277. [32] Q.S. Sun, Z.D. Liu, P.A. Heng, D.S. Xia, A theorem on the generalized canonical projective vectors, Pattern Recognition 38 (3) (2005) 449–452. [33] T.K. Kim, J. Kittler, R. Cippola, Discriminative learning and recognition of image set classes using canonical correlations, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (6) (2007) 1005–1018. [34] T. Melzer, M. Reiter, H. Bischof, Appearance models based on kernel canonical correlation analysis, Pattern Recognition 36 (9) (2003) 1961–1971. [35] H. Hotelling, Relations between two sets of variates, Biometrika 28 (34) (1936) 321–377.

[36] Lu Juwei, K.N. Plataniotis, A.N. Venetsanopoulos, Regularized discriminant analysis for the small sample size problem in face recognition, Pattern Recognition Letters 24 (16) (2003) 3079–3087. [37] S.E. Leurgans, R.A. Moyeed, B.W. Silverman, Canonical correlation analysis when the data are curves, Journal of the Royal Statistical Society B 55 (3) (1993) 725–740. [38] A. Lykou, J. Whittaker, Sparse CCA using lasso with positivity constraints, Computational Statistics and Data Analysis 54 (2) (2010) 3144–3157. [39] D.R. Hardoon, J. Shawe-Taylor, Sparse canonical correlation analysis, Machine Learning 83 (3) (2011) 331–353. [40] A. Hendrikse, L. Spreeuwers, R. Veldhuis, A bootstrap approach to eigenvalue correction, in: Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, 2009, pp. 818–823. [41] O. Ledoit, M. Wolf, A well-conditioned estimator for large-dimensional covariance matrices, Journal of Multivariate Analysis 88 (2) (2004) 365–411. [42] D. Hardoon, S. Szedmak, J. Shawe-Taylor, Canonical correlation analysis: an overview with application to learning methods, Neural Computation 16 (12) (2004) 2639–2664. [43] M. Barker, W. Rayens, Partial least squares for discrimination, Journal of Chemometrics 17 (3) (2003) 166–173. [44] Quan-sen Sun, Research on Feature Extraction and Image Recognition Based on Correlation Projection Analysis (Ph.D. dissertation), Nanjing University of Science and Technology, 2006. [45] Yi-Fei Pu, Ji-Liu Zhou, Xiao Yuan, Fractional differential mask: a fractional differential-based approach for multiscale texture enhancement, IEEE Transactions on Image Processing 19 (2) (2010) 491–511. [46] W. Pan, K. Qin, Y. Chen, An adaptable-multilayer fractional Fourier transform approach for image registration, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (3) (2009) 400–413. [47] J. Liu, S. Chen, X. Tan, Fractional order singular value decomposition representation for face recognition, Pattern Recognition 41 (1) (2008) 378–395. [48] G.H. Golub, C.F. Van Loan, Matrix Computations, Third ed., The Johns Hopkins University Press, Baltimore, London, 1996. [49] A. Hendrikse, R. Veldhuis, L. Spreeuwers, Eigenvalue correction results in face recognition, in: Proceedings of the 29th Symposium on Information Theory in the Benelux, 2008, pp. 27–35. [50] P. Xu, G.N. Brock, R.S. Parrish, Modified linear discriminant analysis approaches for classification of high-dimensional microarray data, Computational Statistics and Data Analysis 53 (5) (2009) 1674–1687. [51] Z. Lou, K. Liu, J.Y. Yang, C.Y. Suen, Rejection criteria and pairwise discrimination of handwritten numerals based on structural features, Pattern Analysis and Applications 2 (3) (1992) 228–238. [52] J. Yang, A.F. Frangi, J.Y. Yang, D. Zhang, Z. Jin, KPCA plus LDA: a complete kernel fisher discriminant framework for feature extraction and recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2) (2005) 230–244. [53] Z.S. Hu, Z. Lou, J.Y. Yang, K. Liu, J.Y. Sun, Handwritten digit recognition based on multi-classifier combination, Chinese Journal of Computers 22 (4) (1999) 369–374. [54] Yun-Hao Yuan, Quan-Sen Sun, Discriminative learning of multiset integrated canonical correlation analysis for feature fusion, in: Proceedings of the 15th International Conference on Information Fusion, 2012, pp. 882–887. [55] J. Yang, L. Zhang, J.Y. Yang, D. Zhang, From classifiers to discriminators: a nearest neighbor rule induced discriminant analysis, Pattern Recognition 44 (7) (2011) 1387–1402. [56] A.M. Martinez, R. Benavente, The AR Face Database, CVC Technical Report #24, June 1998. [57] S.A. Nene, S.K. Nayar, H. Murase, Columbia Object Image Library (COIL-20), Technical Report CUCS-005-96, February 1996. [58] T.H. Cormen, C.E. Leiserson, R.L. Rivest, C. Stein, Introduction to algorithms, second ed., MIT Press and McGraw-Hill, 2001.

Yun-Hao Yuan received the M. Eng. degree in Computer Science and Technology from Yangzhou University. He is working for his Ph.D. degree in Pattern Recognition and Intelligence System from Nanjing University of Science and Technology (NUST). He has published more than 10 scientific papers. He is currently a member of International Society of Information Fusion (ISIF). His research interests include pattern recognition, image processing, computer vision, and information fusion.

Quan-Sen Sun received the Ph.D. degree in Pattern Recognition and Intelligence System from Nanjing University of Science and Technology (NUST), China, in 2006. He is a professor in the Department of Computer Science at NUST. He visited the Department of Computer Science and Engineering, The Chinese University of Hong Kong in 2004 and 2005, respectively. His current interests include pattern recognition, image processing, remote sensing information system, medicine image analysis.

Hong-Wei Ge received the M.S. degree in Computer Science and Technology from Nanjing University of Aeronautics and Astronautics, China, in 1992, and Ph.D. degree in Pattern Recognition and Intelligence System from Jiangnan University, China, in 2008. Currently, he is a professor in the School of Internet of Things, Jiangnan University. He has published more than 60 scientific papers. His research interests include neuro-fuzzy systems, image processing, machine learning, and pattern recognition.