Probabilistic learning of similarity measures for tensor PCA

Probabilistic learning of similarity measures for tensor PCA

Pattern Recognition Letters 33 (2012) 1364–1372 Contents lists available at SciVerse ScienceDirect Pattern Recognition Letters journal homepage: www...

1MB Sizes 0 Downloads 43 Views

Pattern Recognition Letters 33 (2012) 1364–1372

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Probabilistic learning of similarity measures for tensor PCA Kwanyong Lee a, Hyeyoung Park b,⇑ a b

Korea National Open University, Seoul, Republic of Korea Kyungpook National University, Daegu, Republic of Korea

a r t i c l e

i n f o

Article history: Received 21 December 2010 Available online 4 April 2012 Communicated by G. Moser Keywords: Tensor Principal component analysis Similarity measure Probabilistic learning

a b s t r a c t In order to extract low-dimensional features from image data, matrix-based subspace methods such as 2DPCA and tensor PCA have been recently proposed. Since these methods extract features based on 2D image matrices rather than 1D vectors, they can preserve useful information in image matrices and we can expect better classification performance by using the matrix features. In order to maximize the advantages of the matrix features, it is also important to use an appropriate similarity measure between two feature matrices. This paper proposes a method for learning similarity measures for feature matrices, which utilizes distribution properties of given data set and class membership. Through computational experiments with facial image data, we confirm that the obtained similarity measure by the proposed method can give better classification performance than conventional similarity measures for matrix data. Ó 2012 Elsevier B.V. All rights reserved.

1. Introduction Principal component analysis (PCA), which is a well-known linear subspace method, has been widely used for extracting statistically meaningful features in the areas of pattern recognition and computer vision. Especially for image classification, PCA gives efficient low-dimensional features that can guarantee minimum reconstruction error in the sense of squared error, and shows successful application results in various classification problems. Since the remarkable success in face recognition reported by Turk and Pentland (1991), there have been active works on applying PCA to numerous applications through combining some appropriate variations of classical PCA and diverse classification methods. When applying classical PCA to image data, each image matrix needs to be transformed to a vector. However, this may cause some loss of information in the original matrix representation. To solve the problem, there have been a number of works on directly extracting features from image matrices rather than 1D vectors. Yang and Yang (2002) and Yang et al. (2004) proposed 2DPCA (IMPCA), which uses image covariance matrices constructed directly using the original image matrices in order to derive eigenvectors. A block-based PCA, which breaks an image into several blocks, has also been developed (Vidal-Naquet and Ullman, 2003; Ullman et al., 2001). Wang et al. (2005) showed that the 2DPCA is a specific case of block-based PCA, which uses column vectors as blocks. Though the matrix-based feature extraction methods have shown interesting results on various image database, they have some limits that it conducts only one-directional compression. To ⇑ Corresponding author. E-mail address: [email protected] (H. Park). 0167-8655/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2012.03.019

overcome the drawbacks, several works taking bidirectional compression of image matrix have been proposed. The bidirectional PCA (BDPCA) proposed by Zuo et al. (2006), the (2D)2 PCA proposed by Zhang and Zhou (2005), and G2D-PCA proposed by Hong et al. (2005) suggested to use two linear projection matrices taking into account row-based blocks and column-based blocks at the same time. More recently, many variations of these matrix-based PCA methods have been developed. Hu et al. (2010) proposed multi-oriented 2DPCA by adding rotated variations of original image for 2DPCA learning. Zeng and Feng (2011) developed the two-directional variation of 2DPCA through transforming block structure of image matrix. Chen et al. (2009) combined 2DPCA and sub-pattern PCA to extend it for dealing with color information. Yang et al. (2010) proposed the Laplacian bidirectional PCA (LBDPCA), which is an extension of BDPCA to non-Euclidean space defined by Laplacian scatter matrix with similarity weight. As further extension of the matrix-based feature extraction, He et al. (2005) and Cai et al. (2005) have proposed the tensor subspace method, which can deal with higher order tensor data. When nth order tensor data are given, tensor PCA, which is one of the tensor subspace methods, finds n-linear projection that captures most of the original tensorial input variation. The multilinear PCA (MPCA) proposed by Lu et al. (2008, 2011) and the Generalized N-dimensional PCA (GND-PCA) proposed by Xu and Chen (2009) also take the same approach. In this paper, we focus on tensor PCA for 2D images (i.e. second order tensor), which can be regarded as BDPCA and (2D)2 PCA. A more precise description on tensor PCA will be given in Section 2. When we apply the feature extraction methods for pattern classification, it is also important to find a good measure for evaluating similarities between the given features. In case of matrix-based

1365

K. Lee, H. Park / Pattern Recognition Letters 33 (2012) 1364–1372

PCA, the extracted features are also given in matrix form. However, the most of conventional studies on pattern classification using matrix features have used vector-based Euclidean distance, which may still cause loss of spatial locality information. To solve the problem, there have been several studies on using matrix norms as distance measures. Zuo et al. (2006) proposed an assembled matrix distance (AMD) metric for 2DPCA and compared it with other popular matrix norm distance measures. Meng and Zhang (2007) also proposed to use a volume measure of a matrix for 2DPCAbased image recognition. More recently, Zhao and Xu (2011) proposed a p–q distance, which is a generalization of AMD metric. Though these works have shown that the classification performance can be improved by using an appropriate metric for matrix features, it is still not clear what kind of metric is appropriate for given tasks. In this paper, we propose a probabilistic learning method for finding an appropriate similarity measure for given matrix features as well as classification tasks. Through Bayesian analysis on the differences of features within the same class, we estimate a probabilistic distance between features. Since the features are given in matrices, the probability density function also needs to be defined for matrix random variables. Based on the previous works on learning a similarity metric for vector features (Moghaddam et al., 1999; Lee and Park, 2003), this paper proposes an extended version for matrix features. 2. Tensor PCA To make the present paper self-consistent, we briefly describe tensor PCA based on the general framework for tensor subspace analysis described in (Cai et al., 2005). Let X 2 Rn1 n2 denote an image data of size n1  n2. X can be considered as the second-order tensor in the tensor subspace Rn1  Rn2 . Let ðu1 ; . . . ; un1 Þ and ðv 1 ; . . . ; v n2 Þ be the sets of orthonormal basis vectors of Rn1 and Rn2 , respectively. Then, an image X can be represented by using the basis vectors as



X

yij ui v Tj ;

ð1Þ

ij

n o where ui v Tj forms a basis of the tensor space Rn1  Rn2 and yij ¼ uTi X v j is a bilinear projection coefficient onto the basis vectors ui and vj. By selecting a smaller number of vectors ðu1 ; . . . ; ul1 Þ and ðv 1 ; . . . ; v l2 Þ (l1 < n1, l2 < n2), we can define a tensor subspace U  V of Rl1  Rl2 , where U and V are subspaces of Rn1 and Rn2 spanned by 1 2 fui gli¼1 and fv j glj¼1 , respectively. The tensor subspace analysis is to 1 2 find appropriate basis vectors fui gli¼1 and fv j glj¼1 for a given purpose. Using the basis vectors, it is possible to define two linear transformation matrices U ¼ ½u1 ; . . . ; ul1  and V ¼ ½v 1 ; . . . ; v l2  by which a low-dimensional feature matrix Y = [yij] in Rl1  Rl2 is obtained from X. Given a set of data points {X1, . . . XN} in Rn1  Rn2 , we try to find two transformation matrices U of size n1  l1 and V of size n2  l2, which map a data point Xi to a feature Yi in Rl1  Rl2 as

Y i ¼ U T X i V:

ð2Þ

A specific purpose of the subspace analysis can be defined as an objective function J ðU; VÞ. Like PCA, tensor PCA tries to find a tensor subspace of maximal variances so that the reconstruction error can be minimized. The objective function J ðU; VÞ of tensor PCA can be described by

X maxJ ðU; VÞ ¼ max kY i  MY k2 ; U;V

U;V

i;j

ð3Þ

where

MY ¼

N N 1X 1X Yi ¼ U T X i V: N i¼1 N i¼1

ð4Þ

Using the definition of the Frobenius norm for a matrix, kAk2 = tr (AAT) = tr (ATA), we obtain

J ðU; VÞ ¼ trðU T SX UÞ ¼ trðV T SX VÞ;

ð5Þ

where

SX ¼

N 1X ðX i  MX ÞT ðX i  MX Þ; N i¼1

ð6Þ

SX ¼

N 1X ðX i  MX ÞðX i  MX ÞT ; N i¼1

ð7Þ

P and MX ¼ N1 i X i . Here, the assumption of the orthonormal basis T T (U U = UU = I and VTV = VVT = I) is taken. The optimal projections U and V should be composed of the largest eigenvectors of the covariance matrix SX and SX, respectively. The well-known 2DPCA and line-based PCA can be considered as a specific version of tensor PCA. In case of 2DPCA, the transformation matrix V is fixed as the identity matrix and only U is optimized with respect to J ðU; IÞ. In the case of line-based PCA, U is fixed as the identity matrix and V is optimized with respect to J ðI; VÞ. Though the previous works on the similarity measure of tensor PCA mainly dealt with 2DPCA, we apply the general tensor PCA to extract matrix features of size l1  l2 (1 6 l1 6 n1, 1 6 l2 6 n2). 3. Matrix norms for similarity measure To achieve better performances in classification tasks, a number of similarity measures for matrix features have been studied. For given two matrix features A = (aij) and B = (bij) of size l1  l2, Yang and Yang (2002) used the Frobenius distance measure (Golub and Van Loan, 1996) defined as

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u l1 l2 uXX dF ðA; BÞ ¼ t ðaij  bij Þ2 ; i

ð8Þ

j

which is just the same as the Euclidean distance of the vectorized version of matrix features. Yang et al. (2004) proposed another distance measure, which is called the Yang distance defined as

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u l l2 uX 1 X t ða  b Þ2 : dY ðA; BÞ ¼ ij ij j

ð9Þ

i

Zuo et al. (2006) proposed an assembled version of these two measures which is defined as

0

p dAMD ðA; BÞ

l1 l2 X X ¼@ ðaij  bij Þ2 i

!ð1=2Þp 11=p A ;

ð10Þ

i

where p is a user-defined parameter. This metric was generalized by Zhao and Xu (2011) with additional parameter q. The definition of the generalized AMD measure can be defined as

0 p;q dgAMD ðA; BÞ

¼@

l1 l2 X X ðaij  bij Þq i

!ðp=qÞ 11=p A :

ð11Þ

i

Meng and Zhang (2007) also proposed to use a volume measure of matrix, which is defined as

dV ðA; BÞ ¼ VolðA  BÞ ¼

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi detðA  BÞT ðA  BÞ:

ð12Þ

1366

K. Lee, H. Park / Pattern Recognition Letters 33 (2012) 1364–1372

Though the literatures that have suggested the use of the measures have shown that their measures can achieve some improvement in a number of experiments on image data sets, there have not been given any theoretical explanation and justification on the differences and advantages of a specific measure compared to others. In addition, these matrix norms do not consider the properties of data distribution. In vector cases, the simple Euclidean distance may cause mismeasure when the variances of elements are significantly different. To solve the problem, there have been a few distance measures trying to use some information from a given data set. The Mahalanobis distance and the simply normalized Euclidean distance use the variance information. The probabilistic similarity measure (Moghaddam et al., 1999) and the statistical similarity measure (Lee and Park, 2003) use intraclass variance to define a new similarity measure. This paper proposes an extension of the statistical similarity measure proposed by Lee and Park (2003) to a matrix version.

The similarity of the two features S(Yi, Yj) can then be measured by GððY i  Y j ÞjMD ; SD ; SD Þ. However, the calculation of the value of the Gaussian probability needs inversion of the two matrices, SD and SD, which gives rise to a high computational cost. To resolve this, we apply the whitening process to the set XI. Through the singular value decomposition of SD and SD, we define the whitening transformation of D as

Z ¼ U TD DV D ;

ð20Þ D

where UD and VD are the eigenvector matrices of S and SD, respectively. Then, the probability density of Z in XI, p(ZjXI), is given by a Gaussian distribution that can be written by

GðZjMZ ; SZ ; SZ Þ ¼

n h io T Z 1 exp  12 tr S1 ðZ  MZ Þ Z ðZ  MZ Þ S ð2pÞl1 l2 =2 jSZ jl2 =2 jSZ jl1 =2

SZ ¼ E½ðZ  MZ Þ ðZ  MZ Þ

Dij ¼ Y i  Y j :

ð13Þ

Using the difference matrix, we can define two groups of difference features: the intraclass group XI = {DijjDij = Yi  Yj; C(Yi) = C(Yj)} and the extraclass group XE = {DijjDij = Yi  Yj; C(Yi) – C(Yj)}, where C(Yi) denotes the class membership of Yi. We then define a similarity between two features Yi and Yj as the probability that their difference Dij belongs to the class XI, which can be written as

SðY i ; Y j Þ ¼ PðDij 2 XI Þ ¼ PðXI jDij Þ:

ð14Þ

The probability P(XIjDij) can be calculated by using the Bayesian theorem as follows.

PðXI jDÞ ¼

PðDjXI ÞPðXI Þ / PðDjXI Þ: PðDÞ

ð15Þ

In this paper, we ignore the terms P(XI) and P(D) for computational efficiency and estimate the conditional probability P(DjXI). To estimate the probability density P(Dj XI), we adopt the maximum likelihood estimation using a Gaussian model. The previous works by Moghaddam et al. (1999) and Lee and Park (2003) took these probabilistic approaches for vector data, and showed its efficiency. In the present paper, we need to estimate a probability density for matrix data. Taking the Gaussian assumption, the conditional probability distribution is modeled as a matrix Gaussian distribution as follows.

GðDjMD ; SD ; SD Þ ¼

exp

n

 12 tr

h

S1 D ðD

ð2pÞ

T D 1

 MD Þ S

l1 l2 =2

io ð D  MD Þ

jSD jl2 =2 jSD jl1 =2

;

ð16Þ

where the sample mean MD and the sample covariance matrices SD and SD can be estimated by using the intraclass data set XI as follows:

MD ¼ E½D 

1 X Dij ; jXI j D 2X ij

ð17Þ

I

SD ¼ E½ðD  MD ÞT ðD  MD Þ 

1 X ðDij  MD ÞT ðDij  MD Þ; jXI j D 2X

ð18Þ

1 X ðDij  MD ÞðDij  MD ÞT : jXI j D 2X

ð19Þ

ij

SD ¼ E½ðD  MD ÞðD  MD ÞT  

ij

I

I

ð22Þ

T

The similarity between two matrix features Yi and Yj can be measured by the value of probability that the two features belong to the same class. To estimate the probability, we start by defining a difference matrix between two features as

ð21Þ

where the mean and covariance matrices are estimated by

MZ ¼ E½U TD DV D  ¼ U TD MD V D ; 4. Probabilistic learning of similarity measure

;

ð23Þ

¼ E½V TD ðD  MD ÞT U D U TD ðD  MD ÞV D   V TD SD V D ¼ KV ;

ð24Þ

S ¼ E½ðZ  MZ ÞðZ  MZ ÞT  Z

¼ 

E½U TD ðD  MD ÞT V D V TD ðD U TD SD U D ¼ KU :

ð25Þ  MD ÞU D  ð26Þ

We should note here that KV and KU are diagonal matrices composed of eigenvalues of SD and SD, respectively. By taking the exponential term from Eq. (21), we define a distance measure of two matrix features Yi and Yj in a simple form as

h i T T 1 T dp ðY i ; Y j Þ ¼ tr K1 V V D ðY i  Y j Þ U D KU U D ðY i  Y j ÞV D :

ð27Þ

The main difference between the proposed distance measure and the conventional matrix norms discussed in the previous section is that the parameters of the proposed distance measure (KU, KV, UD, and VD) are obtained through a training process using given data set. We also should note that the learning parameters depend not only on input data but also on the class membership because we utilize the set XI that is dependent on the class membership of each data. Through the learning process, we can obtain a distance measure that is optimized for the distribution properties of the given data set as well as classification purposes. 5. Experimental results 5.1. Classification using facial database In order to confirm the efficiency of the proposed distance measure, we compared it with the conventional distance measures based on matrix norms. In this chapter, we used two popular facial image databases: the FERET (Face Recognition Technology) database (available at http://www.itl.nist.gov/iad/humanid/feret) and the PICS (Psychological Image Collection at Stirling) database (available at http://pics.psych.stir.ac.uk/). Figs. 1 and 2 show examples of the images. In case of the FERET database, we obtained 450 images from 50 subjects; nine images of each subject were taken with horizontal pose variations. Using the FERET database, we conducted two types of classification tasks: face recognition and pose recognition. For face recognition, the left (+60°), right (60°) and frontal (0°) images of each subject were used for training, and the remaining 300 images were used for testing. For pose recognition, images from 25 different subjects were used for training, and the remain-

1367

K. Lee, H. Park / Pattern Recognition Letters 33 (2012) 1364–1372

Table 1 Comparison results between the vector-based approaches and the tensor-based approaches. The bold values represent the best result in each task. Methods

FERET database

PICS database

Face recog.

Pose recog.

Face recog.

Exp. recog.

97.0 (117) 92.3 (92)

36.4 (65) 38.7 (51)

70.0 (48) 69.4 (45)

35.6 (65) 42.4 (76)

Tensor PCA + Frobenius Tensor PCA + Proposed

99.0 (14  3) 99.7 (15  1)

36.4 (21  4) 48.0 (7  11)

78.3 (9  12) 88.4 (10  5)

37.8 (15  1) 56.6 (25  1)

LBDPCA + Frobenius

99.0 (14  3) 100.0 (20  1)

36.0 (20  4) 48.4 (3  15)

78.3 (9  12) 88.4 (12  5)

37.8 (15  1) 56.1 (25  1)

PCA + Euclidean PCA + Probabilistic

Fig. 1. Examples from FERET database with pose variation.

LBDPCA + Proposed

Fig. 2. Examples from PICS database with various facial expression.

ing 225 images from the other 25 subjects were used for testing. The image size is 70  50 pixels; the raw input dimension, 3500. In case of the PICS database, we obtained 276 images from 69 persons; four images with different expressions were taken from each person. For the PICS database, we also conducted two types of classification tasks: face recognition and expression recognition. For face recognition, we used three images from each person for training, and the remaining one image with a neutral expression was used for testing. For expression recognition, 20 images per each facial expression were used for training and the remaining 49 images were used for testing. The size of each image is 90  80 pixels; the dimension of the raw input data, 7200. For classification using the proposed measure, we first applied tensor PCA to image data set X = {X1,X2, . . . , XN} so as to obtain low-dimensional matrix features Y = {Y1,Y2, . . . , YN}. To obtain the proposed distance measure, we compose the set XI according to the class membership and estimate the matrices KU, KV, UD, and VD. Here, we should note that the set XI depends on the class membership, and thus, the obtained measures depend on the classification task as well as the data set. For example, the measure trained for pose recognition using the FERET data is different from that for face recognition using the same FERET data. This is one of the main differences between the proposed method and the conventional measures, which do not require a learning process and are thus always the same regardless of the data set. Once the distance measure is obtained, the classification is performed by the K-nearest neighbor method with K = 1. Since the proposed measure is an extension of the previous probabilistic measure for vector data (Lee and Park, 2003; Moghaddam et al., 1999), we first compared the tensor-based method with the conventional vector-based method. Table 1 shows the results of six different combinations of feature extraction methods and distance measures. As the vector-based methods, we applied the original PCA to obtain low dimensional feature vectors. We then classified the obtained feature vectors using two different

measures: the Euclidean distance and the probabilistic measure for vector data. As the tensor-based approaches, we applied tensor PCA and LBDPCA to obtain feature matrices. We then classified the obtained feature matrices using two different measures: the Frobenius norm (DF), which is a matrix version of Euclidean distance, and the proposed probabilistic measure for matrix data. We tried all of the possible dimension of features and choose the one giving the best classification rate for each method. The parenthesis of each cell in Table 1 shows the dimension of the obtained features. From the results in Table 1, we can see that the tensor-based approaches give generally better results than the vector-based approaches. The two tensor-based feature extraction methods, tensor PCA and LBDPCA showed little differences in the recognition performance. Though the probabilistic measure for vector data can give better results than the Euclidian distance, we can see that the performance can still be improved by using tensor-based features and measures. We also compared the proposed probabilistic measure with other conventional measures for matrix data, which are mentioned p in Section 3. For the measure dAMD , we set the parameter p = 0.125 according to the suggestion given in (Zuo et al., 2006). For the meap;q sure dgAMD , we tried two different set, (0.25, 2) and (0.5, 0.5), which were chosen in (Zhao and Xu, 2011). Since we use tensor PCA for feature extraction, we need to select the number of basis vectors (the dimensionality of features) in both row and column directions. We tried all of the possible number of basis vectors for obtaining the best performance; Fig. 3 shows the best classification rates for test data and the dimensionality of features at that time. For the face recognition task on the FERET data, all measures generally give good results. However, the proposed method gives the best results. For the pose recognition task on the same FERET data, we can still see that the proposed method gives the best, although the performances of all methods are generally low. One remarkable thing is that the measure dF shows a good result for face recognition on the FERET data, but it shows the worst result for pose recognition. This shows the limit of conventional matrix measures without learning ability. For the face recognition task on PICS data, the measure dV shows the best performance and the proposed method shows slightly lower rates. This may be due to the limited number of learning data. Since the proposed method needs a learning process for obtaining a reliable measure, it can be a weak point when the number of training data is not enough for representing their distributional properties. Nevertheless, the strong point of the proposed method is apparently shown in the last task of expression recognition. Though the result of dV is much low unlike the task of face recognition on the same data set, the proposed method gives

1368

K. Lee, H. Park / Pattern Recognition Letters 33 (2012) 1364–1372

dF

dF

dY

dY

dVV

dV

dgAMD gAMD

dgAMD gAMD

proposed

dF

dAMD

dgAMD

dgAMD

proposed

dF

p=0.125

p=0.25 q=2

p=0.5 q=0.5

dAMD AMD

p=0.125

p=0.25 q=2

dVV

dY

p=0.5 q=0.5

dVV

dY

dAMD AMD

p=0.125

dAMD AMD

p=0.125

ddgAMD gAMD

ddgAMD gAMD

proposed

ddgAMD gAMD

ddgAMD gAMD

proposed

p=0.25 p=0.25 q=2 q=2

p=0.25 p=0.25 q=2 q=2

p=0.5 p=0.5 q=0.5 q=0.5

p=0.5 p=0.5 q=0.5 q=0.5

Fig. 3. Classification rates with different distance measures: results on FERET data base (upper row) and results on PICS database (bottom row).

remarkably superior results compared to other methods, which can be achieved owing to its learning ability. In summary, while the conventional methods show unstable results for various tasks and data sets, the proposed method shows stably good results compared to other methods. When we use the low-dimensional features obtained by projection, we need consider how to select the dimensionality of the features. In the case of matrix features, it is more delicate because we have two parameters for the bidirectional projections: row dimensionality and column dimensionality. Though we showed the best results with optimal dimensionality of feature matrix in Table 1 and Fig. 3, we should discuss how to select the parameters more closely. Though there have been several works on choosing dimensionality (Renald and Bourennane, 2008, 2009; Lu, 2008), it is still an open problem. In our experiments, we first investigate the dependency of the classification performance on the dimensionality. Fig. 4 shows the changes of classification rates according to the change of row dimensionality (shown in left column of the figure) and column dimensionality (shown in right column of the figure), respectively. At each specific value of row dimensionality, we plotted the maximum value of classification rate among the all possible values with the variety of column dimensionality for the graphs in left column, and vice versa for those in right column. The vertical line in each graph shows the best classification rate and the corresponding dimensionality. As shown in the figure, it is difficult to find a consistent tendency according to the change of dimensionality. Especially, it is interesting that the increase of column dimensionality causes the degradation of classification performances in the case of face recognition on FERET data. On the contrary, in the case of pose recognition, the larger column dimensionality gives better performances. For the face recognition, the variations in column vector due to pose factor may give bad influence on identifying subjects, which causes the degradation of performance with the increase of column dimension. Contrarily, the variation in column vector in the same data gives essential information for pose recognition, which causes the better pose recognition rate with the larger column dimensionality. This tendency appears more clearly in the proposed distance and the generalized AMD distance compared to other measures such as dF.

Based on the investigation results, we can anticipate the selection criteria using reconstruction error, which is widely used in the practical applications of PCA, would be not appropriate for matrixbased PCA. Nevertheless, we still checked the change of performance according the the change of reconstruction errors. When we have two projection matrices for row and column directions, we can choose their dimensionality which can achieve the pre-defined reconstruction error. The reconstruction error e(l) with the dimensionality l is calculated as

Pl

k i¼1 ki

i eðlÞ ¼ 1  Pi¼1 ; n

ð28Þ

where ki is the eigenvalue corresponding to the projection matrix U and V. We checked the performances at three value of e (15%, 10%, 5%) for the four recognition tasks. In addition, we also tried the statistical model selection criteria, the Akaike information criteria, which was used for PCA (Valle et al., 1999) and tensor PCA (Renald and Bourennane, 2008). The change of performances according to the selection of dimensionality was shown in Fig. 5. The obtained value of dimensionality using each selection criterion is also shown under the horizontal axis in each graph. Note that the values selected by the criteria do not depend on the distance measures, whereas the best values selected in the experiments (Fig. 3) are different among the measures. As we can expect from Figs. 4 and 5 show that the smaller reconstruction error gives the lower classification performances. This tendency is much apparent in the proposed measure. In addition, AIC also cannot give good performance compared to the best case. Especially, AIC causes severe degradation of performance in the FERET database for a few distance measures including the proposed one. From the investigation, we can say that it is difficult to find a proper dimensionality only using the information from the projection matrix, because the performance is strongly dependent on data sets as well as classification method. One practical solution is to utilize a validation set to find the optimal value through experiments. We will show an example of using validation set in the next section.

1369

K. Lee, H. Park / Pattern Recognition Letters 33 (2012) 1364–1372

dF

dY

d AMD

d gAMD

d gAMD

p=0.125

p=0.25,q=2

p=0.5,q=0.5

dV

proposed

row dimension

column dimension

row dimension

column dimension

row dimension

column dimension

row dimension

column dimension

Fig. 4. Changes of classification rates according to the change of row dimensionality (left column) and column dimensionality (right column).

5.2. Face recognition using large database In order to check the validity of the proposed methods on real applications, we conducted face recognition experiments on CMU Multi-MPIE database (Gross et al., 2010). The multi-PIE database has been built as an substantial extension of CMU-PIE database, which is a well-known benchmark database for face recognition. Multi-PIE database was collected from 337 subjects in four different time sessions with diverse variations in pose, illumination, and

expression condition. Overall database has more than 750,000 images with 20 illumination variations, 15 pose variations, and six expression variations. In the experiments, we used the images obtained from 121 subjects appearing in all four sessions. For each original image shown in Fig. 6, we automatically localized the facial area using public face detection algorithm, and resized it into 32  32 matrix. Though the original images have color information, we changed them to gray scale images. Since the automatic face detection algorithm often fails in detecting face with large

1370

K. Lee, H. Park / Pattern Recognition Letters 33 (2012) 1364–1372

ε 7×6

ε 10×8

ε 16×13

ε ε ε 14×12 21×20 37×37 dF

dY

ε 20×7

7×6

17×15

17×16

18×21

ε 25×25 44×44

15×14

ε

ε

dV

ε 16×12

ε 10×8

dAMD

dgAMD

dgAMD

p=0.125

p=0.25,q=2

p=0.5,q=0.5

proposed

Fig. 5. Selection methods for dimensionality of feature matrix.

Fig. 6. An original image and samples of pre-processed images of Multi-PIE database.

pose and illumination variations, we could use only the part of the whole data set, which passed through the detection process successfully. An original image and some sample images obtained through the pre-processing are shown in Fig. 6. For the experiments, we divided the pre-processed data set from 121 subjects into two groups; the training set obtained from 10 subjects, and the test set from the remaining 111 subjects. The training set was used for finding projection matrices U and V by tensor PCA, and also for estimating the parameters KU, KV, UD of

the proposed distance measure. In addition, we also used the training set for estimating the optimal dimensionality of matrix features through validation test. Using 8691 images given after preprocessing, we trained tensor PCA, to find the projection matrices, U and V, and then calculated the matrix feature set by projecting the training images bidirectionally with U and V. The training feature set are then used to train the proposed distance measure. We first composed the set XI by pairing two samples from the same subjects. Concerning computational efficiency, we systematically chose the pairs of samples instead of using all possible pairs. Noting that the calculation of distance are mainly done between a newly given probe image and a gallery image that is usually recorded in normal condition, we composed a gallery set including 72 images recorded in one time session with neutral expression and slight pose variations (0°, L15°, R15°). Among the remaining images, we chose 702 probe images recorded in three different time sessions with expression, pose, and illumination variation as shown in Fig. 6. By pairing each gallery image and a probe image with same identity, we collected 5058 pairs to estimate the matrices KU, KV, UD, and VD for the proposed measure. The gallery and probe set were also used for estimating optimal dimensionality of matrix features through the validation experiments. The estimated optimal dimensionality for each measure, which is noted in Fig. 7, was used for test stage. In the test stage, we conduct face recognition task for the remaining 111 subjects. We first composed the gallery set and four probe sets. The gallery set was composed of 847 images recorded in first session with neutral expression including up to five pose variations and two illumination variations. The four probe sets were composed by using images recorded in the three different time sessions. Set 1 has 1538 images with only small pose variations (0°, L15°, R15°) and two illumination variation. Set 2 has 4648 images with small pose variations as well as expression and illumination variations. Set 3 has 2513 images with five pose

1371

K. Lee, H. Park / Pattern Recognition Letters 33 (2012) 1364–1372

dF

dY

dV

32×16

27×7

25×5

dAMD

dgAMD

dgAMD

p=0.125

p=0.25,q=2

p=0.5,q=0.5

20×7

25×6

22×12

proposed 14×5

Fig. 7. Classification results on four different probe sets of Multi-PIE database.

variations and illumination variations. Finally, Set 4 has 7237 samples with pose, expression, and illumination variations as shown in Fig. 6. In evaluating the performance, we calculated the cumulative match rate (CMR) of the nearest neighbor method as increasing the rank. The result on the four different probe sets are shown in Fig. 7. We compared seven distance measures as done in Section 5.1. From the figure, we can see that the basic matrix norm (dF) shows the worst performance and the proposed measure and dgAMD with (p = 0.5, q = 0.5) shows similarly good performances. In the case of Set 2 with expression variation, the proposed measure gives slightly degraded performance. As shown in Fig. 6, every expression except for neutral appears in a specific time session, and thus the lack of training samples for expression variations may cause the low performance of the proposed measure. However, in the case of Set 3 with pose variations and Set 4 with diverse variations, the proposed measure shows the best performance. From the results, we can say that the proposed method can be successfully applied to practical image pattern classification tasks provided that the enough number of training set is available. 6. Conclusions In this paper, we proposed a probabilistic learning method for distance measure between two matrix data. Owing to the learning ability, the proposed method can provide more appropriate distance measure for the given data set as well as for the given purpose of task. Though we assumed the matrix Gaussian distribution in the estimation of p(D jXI), a more sophisticated probabilistic model can be adopted. In addition, we should note that the current work only deal with second order tensor for 2D images. Since the tensor PCA and other related works such as GBDPCA and MPCA deals with higher order tensors, the extension of the proposed measure for higher order tensors would be a challenging topic for further research.

Acknowledgments This research was partially supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2011-0003671). This research was partially supported by the Converging Research Center Program funded by the Ministry of Education, Science and Technology (2011K000659).

References Cai, D., He, X., Han, J., 2005. Subspace learning based on tensor analysis. Tech. Rep. (UIUCDCS-R-2005-2572), Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL, USA. Chen, S., Sun, Y., Yin, B., 2009. A novel hybrid approach based on sub-pattern technique and extended 2DPCA for color face recognition. In: 11th IEEE Internat. Symposium on Multimedia, pp. 630–634. Gross, R., Matthews, I., Chon, J., Kanade, T., Baker, S., 2010. Multi-PIE. In: Proc. Eighth IEEE Internat. Conf. on Automatic Face and Gesture Recognition, vol. 28(5), pp. 807–813. He, X., Cai, D., Niyogi, P., 2005. Tensor subspace analysis. Adv. Neural Inform. Process. Systems. Hong, H., Li, X.C., Wang, L., Teoh, E.K., 2005. Generalized 2D principal component analysis. In: Proc. Internat. Joint Conf. on Neural Network, pp. 113–118. Hu, X., Yu, W., Yao, J., 2010. Multi-oriented 2DPCA for face recognition with one training face image per person. J. Comput. Inform. Systems 6 (5), 1563–1570. Lee, K., Park, H., 2003. A new similarity measure based on intraclass statistics for biometric systems. ETRI J. 25 (5), 401–406. Lu, H., Plataniotis, K.N., Venetsanopoulos, A.N., 2008. MPCA: Multilinear principal component analysis of tensor objects. IEEE Trans. Neural Networks 19 (1), 18– 39. Lu, H., Plataniotis, K.N., Venetsanopoulos, A.N., 2011. A survey of multilinear subspace learning for tensor data. Pattern Recognition 44, 1540–1551. Meng, J., Zhang, W., 2007. Volume measure in 2DPCA-based face recognition. Pattern Recognition Lett. 28, 1203–1208. Moghaddam, B., Jebara, T., Pentland, A., 1999. Bayesian modeling of facial similarity. Adv. Neural Inform. Process. Systems 11. Renald, N., Bourennane, S., 2008. Improvement of target detection methods by multiway filtering. IEEE Trans. Geosci. Remote Sens. 46 (8), 2407–2417. Renald, N., Bourennane, S., 2009. Dimensionality reduction based on tensor modelling for classification methods. IEEE Trans. Geosci. Remote Sens. 47 (4), 1123–1131.

1372

K. Lee, H. Park / Pattern Recognition Letters 33 (2012) 1364–1372

Turk, M., Pentland, A.P., 1991. Face recognition using eigenfaces. In: Proc. CVPR’91, pp. 586–5131. Ullman, S., Sali, E., Vidal-Naquet, M., 2001. A fragment-based approach to object representation and classification. In: Proc. IWVF 2001, pp. 85–100. Valle, S., Li, W., Qin, S.J., 1999. Selection of the number of principal components: the variance of the reconstruction error criterion with a comparison to other methods. Ind. Eng. Chem. Res. 38, 4389–4401. Vidal-Naquet, M., Ullman, S., 2003. Object recognition with informative feature and linear classification. In: Proc. ICCV 2003, vol. 1, pp. 281–288. Wang, L., Wang, X., Zhang, X., Feng, J., 2005. The equivalence of two-dimensional PCA to line-based PCA. Pattern Recognition Lett. 26, 57–60. Xu, R., Chen, Y.-W., 2009. Generalized N-dimensional principal component anlysis (GND-PCA) and its application on construction of statistical appearance models for medical volumes with fewer samples. Neurocomputing 72, 2276–2287. Yang, J., Yang, J.Y., 2002. From image vector to matrix: A straight-forward image projection technique – IMPCA vs. PCA. Pattern Recognition 35, 1997–1999.

Yang, J., Zhang, D., Frangi, A.F., Yang, J., 2004. Two-dimensional PCA: A new approach to appearance-based face representation and recognition. IEEE Trans. PAMI 26 (1), 131–137. Yang, W., Sun, C., Zhang, L., Ricanek, K., 2010. Laplacian bidirectional PCA for face recognition. Neurocomputing 74, 487–493. Zeng, Y., Feng, D., 2011. The face recognition method of the two-directional variation of 2DPCA. Internat. J. Digital Content Tech. Appl. 5 (2), 216–223. Zhang, D., Zhou, Z., 2005. (2D)2PCA: 2-directional 2-dimensional PCA for efficient face representation and recognition. Neurocomputing 69, 224–231. Zhao, L., Xu, X., 2011. Distance Function for face recognition based on 2D PCA. In: The 2nd Internat. Conf. on Multimedia Technology (ICMT’11), pp. 5814–5817. Zuo, W., Zhang, D., Yang, J., Wang, K., 2006. BDPCA plus LDA: A novel fast feature extraction technique for face recognition. IEEE Trans. Systems Man Cybernet.— Part B 36 (4), 946–952.