Pairwise constraints based multiview features fusion for scene classification

Pairwise constraints based multiview features fusion for scene classification

Pattern Recognition 46 (2013) 483–496 Contents lists available at SciVerse ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/loca...

1MB Sizes 0 Downloads 122 Views

Pattern Recognition 46 (2013) 483–496

Contents lists available at SciVerse ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Pairwise constraints based multiview features fusion for scene classification Jun Yu a, Dacheng Tao b,n, Yong Rui c, Jun Cheng d,e a

Computer Science Department, Xiamen University, Xiamen 361005, China Centre for Quantum Computation & Intelligent Systems and Faculty of Engineering & Information Technology, University of Technology, Sydney, 235 Jones Street, Ultimo, NSW 2007 Australia c Microsoft Research Asia, Beijing, China d Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China e The Chinese University of HongKong, Shatin, HongKong, China b

a r t i c l e i n f o

abstract

Article history: Received 9 December 2011 Received in revised form 2 August 2012 Accepted 6 August 2012 Available online 16 August 2012

Recently, we have witnessed a surge of interests of learning a low-dimensional subspace for scene classification. The existing methods do not perform well since they do not consider scenes’ multiple features from different views in low-dimensional subspace construction. In this paper, we describe scene images by finding a group of features and explore their complementary characteristics. We consider the problem of multiview dimensionality reduction by learning a unified low-dimensional subspace to effectively fuse these features. The new proposed method takes both intraclass and interclass geometries into consideration, as a result the discriminability is effectively preserved because it takes into account neighboring samples which have different labels. Due to the semantic gap, the fusion of multiview features still cannot achieve excellent performance of scene classification in real applications. Therefore, a user labeling procedure is introduced in our approach. Initially, a query image is provided by the user, and a group of images are retrieved by a search engine. After that, users label some images in the retrieved set as relevant or irrelevant with the query. The must-links are constructed between the relevant images, and the cannot-links are built between the irrelevant images. Finally, an alternating optimization procedure is adopted to integrate the complementary nature of different views with the user labeling information, and develop a novel multiview dimensionality reduction method for scene classification. Experiments are conducted on the real-world datasets of natural scenes and indoor scenes, and the results demonstrate that the proposed method has the best performance in scene classification. In addition, the proposed method can be applied to other classification problems. The experimental results of shape classification on Caltech 256 suggest the effectiveness of our method. & 2012 Elsevier Ltd. All rights reserved.

Keywords: Scene classification Fusion Multiview dimensionality reduction User labeling information

1. Introduction Scene classification is a critical task in many practical applications, e.g., video analysis [50], video surveillance [34], contentbased image retrieval [38] and robotics path planning [42]. In general, scene classification is difficult because of the wide variety of scenes potentially to be classified. Specifically, lighting variations, viewing angles difference and background changing cause the difficulties in efficient learning and robust classification. Recently, some approaches have been proposed and we can categorize them into: low-level visual feature-based methods, local feature-based methods and local-global feature based methods. In low-level visual feature-based method, scene images are described by visual information [7], e.g. color, texture and shape [45,46]. Representative descriptors are HSV color histogram (HSV)

n

Corresponding author. Tel.: þ61 2 9514 1829 E-mail address: [email protected] (D. Tao).

0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2012.08.006

[37], color coherence vectors [32], multiresolution simultaneous autoregressive models [28], wavelet-based decompositions [27] and edge directional histogram [22]. These visual features perform well with respect to scaling, rotation, perspective and occlusion [37]. However, they are sensitive to small geometric and photometric distortions. In this case, local feature-based methods are developed to describe scene images with obtained interest points by using some descriptors. Bag-of-features model is frequently used to describe a scene image. Two components include detectors and descriptors are utilized to obtain local features. Detectors include Harris’s corner [15], scale invariant feature transform (SIFT) [26], Harris–Laplace detector [29], etc. To describe detected interest points or regions, e.g., SIFT [26], steerable filters [11], Gabor functions [8], Varma–Zisserman model [44], gradient location and orientation histogram [30], are utilized. Because local feature-based methods neglect the spatial information, local-global feature-based methods are introduced. Local-global feature-based methods adopt both the global spatial information and the local descriptors of interest points to

484

J. Yu et al. / Pattern Recognition 46 (2013) 483–496

describe scene images to accomplish a robust classification. For instance, Lazebnik et al. [23] proposed the spatial pyramid matching by partitioning an image into increasingly fine grids and computing histograms of local features inside each grid cell. Wang et al. [48] proposed an effective coding method called Locality-constrained Linear Coding (LLC), which utilizes the locality constraints to project each descriptor into its local-coordinate system, and the projected coordinates are integrated by max pooling to generate the final representation. Their experimental results show improvements over the bag of words model. In real applications of scene classification, the descriptors, introduced above, are often very high dimensional, so direct manipulations on these descriptors are computationally expensive and obtain less-than-optimal results. This is the so-called ‘‘curse of dimensionality’’ [3]. To solve this problem, a straightforward way is to project the high-dimensional data into a lowdimensional subspace through dimensionality reduction algorithms. The most popular conventional algorithms for dimensionality reduction are Principal Component Analysis (PCA) [20] and Linear Discriminant Analysis (LDA) [10]. PCA is based on the computation of low-dimensional representations of highdimensional samples that maximize the total scatter. PCA is not fit for the problems of classification since it is unsupervised and does not utilize prior information of class identities. Unlike PCA, LDA is supervised. It constructs the projection directions that maximize the trace of the between-class scatter matrix while at the same time minimizing the trace of the within-class scatter matrix. However, both are conventional algorithms that see only the global Euclidean structure and cannot discover the nonlinear structure hidden in the high-dimensional data. Recently, a number of manifold learning [13,14,39] algorithms have been developed, which promise to be useful for analyzing the highdimensional data that lie on or near a submanifold of the observation space. Representative methods include Locally Linear Embedding (LLE) [35], ISOMAP [40], Laplacian Eigenmaps (LE) [4], and Local Tangent Space Alignment (LTSA) [58]. However, these methods are only applicable to scene classification in single-view, while in real-world applications, multiple descriptors of scene images (global visual features and local features) are complementary to each other. Here, a view refers to a kind of feature which describes a certain property of the scene image. A great deal of efforts has been carried out on multiview learning [49,51,60,62]. One possible way is to connect feature vectors from different modalities together as a new vector. However, this connection is not meaningful because of the specific property in each view. For instances, the bag-of-words feature [24] is a histogram and the value of a specific bin refers to the counts of a specific word, while a wavelet feature [21] is composed of responses of wavelet filters. It makes no sense to directly connect these two types of features into a long vector because they have different physical meanings and statistical properties, e.g., different means and variances. This connection method ignores the diversity of multiple features, and thus cannot effectively investigate the complementary property of different views. Long [25] proposed a distributed spectral subspace learning (DSSL) method, which constructs subspaces for each view separately. According to the obtained subspace, a common low-dimensional subspace is constructed to get close to each subspace. The DSSL can choose different subspace learning method for different views. However, the original multiview data are invisible to the final learning process. Thus, it cannot well explore the complementary nature of different views. Besides, DSSL has high computation cost because it adopts subspace learning algorithms for each view independently. In this paper, we consider the problem of multiview dimensionality reduction by learning a low-dimensional subspace to

effectively explore the complementary characteristics of scene images’ multiple features. The obtained low-dimensional subspace should be better than a low-dimensional subspace learned by each single view of scene images. To enhance the performance of classification, the new proposed algorithm takes both intraclass and interclass geometries into consideration. The discriminability is effectively preserved in the algorithm because it takes into account neighbor samples having different labels. On another hand, semi-supervised learning [1,36,52,53,54,57,61] has succeeded in many applications due to its utilization of user input as labels or pairwise constraints [2,41,55,56]. Due to the semantic gap, the learning of multiview features still cannot achieve excellent performance in real applications of scene classification. The semantic gap can be efficiently reduced by adopting user’s labeling information. Therefore, the labeling procedure is introduced in our approach. Specifically, the user is asked to provide a query. A group of images are retrieved by a search engine, and the user should label some images in the retrieved set as relevant or irrelevant with the query. The must-links [52] are constructed between the relevant images, and the cannot-links [52] are built between the irrelevant images. Therefore, to take advantages of both the multiview features and the user labeling information, we adopt alternating optimization [6] to integrate the complementary nature of multiview features with pairwise constraints, and develop a novel multiview dimensionality reduction method for scene classification. Experiments are conducted on two realworld scene datasets-natural scenes [23] and indoor scenes [33]. The results suggest the effectiveness of the proposed method in scene classification. Besides, the proposed method can also be applied to other image classification problems. The experimental results of shape classification on Caltech 256 dataset [12] suggest the effectiveness of our method. The rest of this paper is organized as follows. In Section 2, we provide a brief review on related works. The complementary nature of multiview features in scene image is analyzed in Section 3. In Section 4, we describe the novel multiview dimensionality reduction method. Section 5 provides the experiments of scene classification on real-world datasets. Finally, Section 6 concludes the paper.

2. Related works In this section, we first provide a brief review on traditional dimensionality reduction algorithms, which are all for single-view data. Besides, we provide a short review of the utilization of the pairwise constraints.

2.1. Dimensionality reduction The assignment of dimensionality reduction is to build a   low-dimensional subspace Y ¼ y1 ,    ,yN A RdN for highdimensional observations X ¼ ½x1 ,    ,xN  A RmN . In general, the dimensionality reduction algorithms are categorized into two classes: linear methods, e.g., (PCA [20] and LDA [10]; and nonlinear methods, e.g., LLE [35], ISOMAP [40], LE [4], Hessian Eigenmaps (HLLE) [9] and Local Tangent Space Alignment [58]). They perform well for single-view data but cannot deal with multiview data directly. Unfortunately, all of these methods suffer from the out-of-sample problem [5]. He et al. [16] address the out-of-sample problem by applying a linearization procedure to construct explicit maps over new samples. Examples of this approach include locality-preserving projections (LPP) [16], a linearization of LE and neighborhood preserving embedding (NPE) [17], a linearization of LLE.

J. Yu et al. / Pattern Recognition 46 (2013) 483–496

RGB

I1

I2

I3

I4

I5

I6

I1

EDH

4096

Gist

256

384

512

0

1152 2304 3456 4608

0

1000 2000 3000 4000

72

0

128

256

384

512

0

1152 2304 3456 4608

0

1000 2000 3000 4000

54

72

0

128

256

384

512

0

1152 2304 3456 4608

0

1000 2000 3000 4000

54

72

0

128

256

384

512

0

1152 2304 3456 4608

0

100 200 300 400 500

72

0

128

256

384

512

0

1152 2304 3456 4608

0

1000 2000 3000 4000

72

0

128

256

384

512

0

1000 2000 3000 4000

0

1024 2048 3072 4096

0

18

36

54

0

1024 2048 3072 4096

0

18

36

0

1024 2048 3072 4096

0

18

36

0

1024 2048 3072 4096

0

18

36

54

0

1024 2048 3072 4096

0

18

36

54

36

LLC

128

1024 2048 3072

18

Sift

0

0

0

485

54

72

0

1152 2304 3456 4608

I1

I2

I3

I4

I5

I6

I1

I2

I3

I4

I5

I6

I1

I2

I3

I4

I5

I6

0

0.05

0.41

0.18

0.22

0.25

I1

0

0.25

0.04

0.35

0.38

0.42

I1

0

0.16

0.14

0.08

0.25

0.20

I2

0.05

0

0.25

0.34

0.19

0.21

I2

0.25

0

0.23

0.31

0.27

0.29

I2

0.16

0

0.09

0.14

0.21

0.17

I3

0.41

0.25

0

0.17

0.23

0.10

I3

0.04

0.23

0

0.18

0.22

0.58

I3

0.14

0.09

0

0.13

0.19

0.16

I4

0.18

0.34

0.17

0

0.25

0.15

I4

0.35

0.31

0.18

0

0.21

0.34

I4

0.08

0.14

0.13

0

0.10

0.18

0.28

I5

0.38

0.27

0.22

0.21

0

0.13

I5

0.25

0.21

0.19

0.10

0

0.12

0

I6

0.42

0.29

0.58

0.34

0.13

0

I6

0.20

0.17

0.16

0.18

0.12

0

I5

0.22

0.19

0.23

0.25

0

I6

0.25

0.21

0.10

0.15

0.28

Distance on RGB

Distance on EDH

Distance on Gist

I1

I2

I3

I4

I5

I6

I1

I2

I3

I4

I5

I6

I1

0

0.36

0.43

0.24

0.06

0.21

I1

0

0.32

0.42

0.21

0.27

0.10

I2

0.36

0

0.38

0.12

0.27

0.15

I2

0.32

0

0.36

0.23

0.38

0.27

I3

0.43

0.38

0

0.31

0.11

0.13

I3

0.42

0.36

0

0.28

0.52

0.38

I4

0.24

0.12

0.31

0

0.10

0.12

I4

0.21

0.23

0.28

0

0.23

0.25

I5

0.06

0.27

0.11

0.10

0

0.09

I5

0.27

0.38

0.52

0.23

0

0.27

I6

0.21

0.15

0.13

0.12

0.09

0

I6

0.10

0.27

0.38

0.25

0.27

0

Distance on Sift

Distance on LLC

Fig. 1. Complementary characteristics of multiview features in representing the images in indoor scene [36]. (a) Sample images in the indoor scenes dataset; (b)–(f) the features of RGB, EDH, Gist, Sift and LLC; (g)–(k) distance calculation on RGB, EDH, Gist, Sift and LLC.

486

J. Yu et al. / Pattern Recognition 46 (2013) 483–496

2.2. Distributed spectral subspace learning for multiview learning The multiview dimensionality reduction is a new topic. DSSL is first proposed in [25] to solve this problem. Next, we briefly elaborate the summary of DSSL. Given a multiview datum with n n ot objects having t views, a set of matrices X ¼ XðiÞ A Rmi N is i¼1

constructed. Each representation XðiÞ is a feature matrix from view i. In DSSL, it is assumed that the low-dimensional subspace of n ot each view XðiÞ is already known, i.e., Y ¼ YðiÞ A Rdi N ,

 x ,x , where xi and xj are in the same class, and  i j  D ¼ xi ,xj , where xi and xj are in different classes. DCA adopts the ratio of determinants as the objective function to learn W as:



T

W Sb W

, ð2Þ arg max



W

WT Sw W





di omi ð1r i rt Þ. DSSL concentrates on how to obtain a low-

where Sb and Sw are the matrix calculated from the point pairs in the cannot-links and those in the must-links: X   T Sb ¼ xi xj xi xj ð3Þ ðxi ,xj Þ A D

dimensional subspace Y0 A RdN based on Y. The objective function of DSSL can be formulated as

Sw ¼

i¼1

t  T  0 T  0 T X 2 arg min : YðiÞ  Y BðiÞ :F s:t: Y Y0 ¼ I, 0 Y ,B

ð1Þ

i¼1

n

where :U:F is the Frobenius norm for matrix. B ¼ BðiÞ A Rddi

ot

X   T xi xj xi xj : ðxi ,xj Þ A P

ð4Þ

In [19], the transformation matrix W is calculated via the eigen decomposition of S1 w Sb .

i¼1

is a set of mapping matrix. The global optimal solution to DSSL is calculated by conducting the eigendecomposition of the matrix h i CCT , C ¼ Yð1Þ ,. . .,YðtÞ . 2.3. Pairwise constraints Unlike typical supervised learning, where each training sample is annotated with its class label, the label information in the form of pairwise constraints can be formulated as: (1) must-links, which state that the given pair are semantically-similar and should be close together in the low-dimensional subspace; and (2) cannot-links, which indicate that the given points are semantically-dissimilar and should not be near in the low-dimensional subspace. The data points X ¼ ½x1 ,    ,xN  A RmN can be projected into the low-dimensional   subspace Y ¼ y1 ,    ,yN A RdN through the linear projection [16]: WT X ¼ Y, where the projection matrix W A Rdm [16] can be obtained from the pairwise constraints. Xing et al. [52] utilized the pairwise constraints by minimizing the distance between the data points in the same classes under the constraints that the data points from different classes are well separated. Relevant Components Analysis (RCA) [18] learns global linear transformation from the must-links. Furthermore, Hoi et al. [19] proposed Discriminative Component Analysis (DCA) to improve RCA by exploring cannotlink constraints. Specifically, the projection matrix W in DCA is learned from two groups of pairwise constraints: must-links

3. Complementary characteristics of multiview features In this section, we analyze the complementary characteristics of multiview features in representing scene images. To achieve this task, we adopt the sample images from the dataset of indoor scenes [33]. The details of this dataset will be illustrated in Section 5. In Fig. 1, each image is represented by multiview features include: RGB color histogram encoded in a 4096 dimensional vector, 72 dimensional Edge Direction Histogram (EDH), 512 dimensional Gist descriptor [31], 4608 dimensional SIFT descriptor [26] and 4000 dimensional LLC descriptor [48]. The distance matrix calculated on multiview features are presented, and all the distances are normalized in the range [0,1]. In Fig. 1(a), it is obvious scene images I1 and I2 are from different classes and have different semantic content. However, as reported in Fig. 1(g), their Euclidean distance on RGB is 0.05, which cannot distinguish I1 from I2. This means that the dissimilarity between these two images cannot be measured accurately by using their Euclidean distance on RGB. By using the features of EDH, Gist, Sift and LLC, I1 and I2 are complete different, and distance on these four features, as shown in Fig. 1(h), (i), (j) and (k), are 0.25, 0.16, 0.36 and 0.32, respectively. It means the features of EDH, Gist, Sift and LLC play more important roles than RGB in measuring the dissimilarity between I1 and I2. More cases about the complementary characteristics of the multiview features can be found in Fig. 1. Section 4 presents a novel method to effectively

Fig. 2. Workflow of Pairwise Constraint based multiview subspace learning (PC-MSL) for scene classification.

J. Yu et al. / Pattern Recognition 46 (2013) 483–496

and efficiently explore the complementary characteristics of multiview features, and integrate the pairwise constraints to bridge the semantic gaps.

4. Multiview low-dimensional subspace learning with pairwise constraints In this section, we describe the framework of pairwise constraints based multiview subspace learning (PC-MSL) for scene classification. The workflow of PC-MSL is presented in Fig. 2. First, multiview features are extracted to represent the scene images. Some queries are provided and the coarse retrieval results are obtained based on the multiview features. Afterward, in accordance with the retrieval result, the user can label some images which are relevant or irrelevant to the query images. The relevant images constitute the must-links in a positive set and the irrelevant images from the cannot-links in a negative set. The Patch Alignment Framework (PAF) [59] is adopted to build the alignment matrix of the labeled data. Subsequently, PC-MSL constructs local patches for every image in each view. Based on these patches, the trick of global coordinate alignment [59] can be conducted to obtain the low-dimensional subspace for each view. We derive an iterative algorithm by using the alternating optimization [6] to obtain a group of appropriate weights for each

487

view, which can integrate them with the alignment matrix of the labeled data into an optimal subspace through the linearization procedure [16]. The projection matrix W is obtained as the output. Finally, new image samples are projected into the lowdimensional space through Y ¼ WT X, and scene classification is conducted in the low-dimensional subspace. To clearly explain the technique details of the proposed PC-MSL, we present important notations used in the rest of the paper. Capital letters, e.g., X, represent the image database. Lower case letters, e.g., x, represent a specified image, and xi is the ith image of X. Superscript ðiÞ, e.g., XðiÞ and xðiÞ represents the feature of images from the ith view. More important notations are listed in Table 1.

4.1. User labeling information To integrate the high-level semantics for user labeling information, we first need to know which images are semantic relevant and which are irrelevant. Therefore, the labeling process is conducted to obtain necessary information to construct the user-driven semantic space. Based on the labeled images, the discriminative information is utilized to grasp the labeling information. By studying the distribution of the images, we observe that the relevant images are similar to each other while the irrelevant images are irrelevant in its own way. Therefore, the

Table 1 Important notations used in this paper. Notations n h i ot X ¼ XðiÞ ¼ xð1iÞ ,    ,xðNiÞ A Rmi N i¼1   Y ¼ y1 ,    ,yN A RdN W   S þ : xi ,xj A S þ   S : xi ,xj A S h i XS þ ¼ xS1 þ ,. . .,xSNþþ h i XS ¼ xS1 ,. . .,xSN XðjiÞ YðjiÞ LðjiÞ P

Descriptions A multiview dataset with N images and m representations The low-dimensional subspace for original dataset X Linear projection matrices used to achieve Y ¼ WT X The relationship between relevant images The relationship between irrelevant images The positive image set with N þ images The negative image set with N  images The local patch for the data xðjiÞ The corresponding output for each patch Xðj iÞ Encode the local geometry and the discriminative information for each patch

L k1

Represent the alignment matrix constructed from the pairwise constraints

k2

Number of neighbors selected from data in different class with xðjiÞ

ks   b ¼ b1 ,. . ., bt

Number of neighbors selected from data in unlabeled data Nonnegative weights imposed on subspace learning of different modalities

Number of neighbors selected from the data in the same class with xðjiÞ

Fig. 3. Utilization of the user labeling information through pairwise constraints. In our approach, it is required that the relevant images in the set of must-links are as close as possible and the irrelevant images in the set of cannot-links are as far away as possible in the obtained low-dimensional subspace. In this figure, sign ‘‘þ ’’ indicates the images from the set of must-link and sign ‘‘  ’’ indicates the images from the set of cannot-link. The blue solid lines represent the must-links, while the red dashed lines describe the cannot-links. (a) Presentation of image distribution in the original feature space (b) presentation of image distribution in the obtained low-dimensional subspace. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article).

488

J. Yu et al. / Pattern Recognition 46 (2013) 483–496

pairwise constraints including must-links [52] and cannot-links [52] are utilized to learn the discriminative information. Fig. 3   shows a set of must-links (S þ : xi ,xj A S þ ) representing the relationship between relevant scene images and a set of cannot  links (S : xi ,xj A S ) describing the relationship between irrelevant scene images. To optimally preserve the relationship between the relevant images and separate the relevant images from irrelevant ones, the image pairs in must-links are required to be as close as possible and the image pairs in cannot-links need to be as far away as possible in the obtained low-dimensional subspace. The objective function can be defined as: 0 1 X X X X 1 1 2 2 min@ :yi yj : l :yi yk : A, N þ N x A S x A S ðN þ Þ2 x A S x A S i

þ

j

i

þ

þ

k



ð5Þ where N þ represents the number of images in the set S þ , and N is the number of images in the set S . The l is a trade-off parameter to control the effects of these two parts. The objective function in Eq. (5) can be rewritten as: 0 1 1 @ X X Nþ X X 2 2A :yi yj : l :yi yk : N x A S x A S ðN þ Þ2 x i A S þ x j A S þ þ k  i 0 1 X X 0 X 1 2 2 @ ¼ :yi yj : l :yi yk : A ðN þ Þ2 xi A S þ xj A S þ xk A S ¼

X

1

ðN þ Þ2 xi A S þ

bi ,

ð6Þ

0 P P 2 2 0 :yi yj : l :yi yk : . Therewhere l ¼ l NNþ and bi ¼ xj A S þ xk A S fore, the minimization of bi means that each relevant images xi A S þ is expected to be close to all the other relevant ones but to be far from all irrelevant ones. The objective function in Eq. (6) can be successfully solved by Patch Alignment Framework (PAF) [59], which integrates popular dimension reduction algorithms, e.g., LLE, LE, ISOMAP and LPP into a general framework. According to PAF, local patches are built for each image in the set of must-links S þ , and the

related matrix 2

LSi þ

is derived from bi . Hence, a coefficient vector 3

6 0 07 ui ¼ 41,    ,1, l ,    l 5 can be built. The corresponding low|fflfflfflfflffl{zfflfflfflfflffl}|fflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflffl} Nþ

N

h dimensional representation is defined as Yi ¼ yi ,yS1 þ ,    ,ySNþþ , yS1 ,

   ,ySN ,

where each

ySi þ

is the ith image in the set of must-

links S þ and ySi  is the ith image in the set of cannot links S . Therefore, bi can be reformulated as X X 2 2 0 :yi yj : l :yi yk : bi ¼ xj A S þ

¼

xi A S þ

xi A S þ

"

ðui Þk :ðYi Þ1 ðYi Þk þ 1 :

# h eTNL ¼ tr Yi diagðui Þ eNL INL   ¼ tr Yi Li ðYi ÞT ,

!

Si Li STi

! !  YT

  ¼ tr YLP YT ,

ð9Þ

xi A S þ

 P  Si Li STi . To substitute Eq. (9) into Eq. (5), we can

where LP ¼

xi A S þ

have 0

1 X X 1 2A :yi yj : l :yi yk : N þ N x A S x A S

X X

1

min@

2

ðN þ Þ2 xi A S þ xj A S þ   1 ¼ min tr YLP YT : 2 ðN þ Þ

i

þ

k



ð10Þ

4.2. Multiview analysis for data distribution The local patch in our method is built by the given labeled data and its neighbors which including the data from not only a same class but also different classes. On each patch, an objective function is designed to preserve the local discriminative information. For a given data xðjiÞ A X, according to the class label information, the other data can be divided into two groups: data in the same class as xðjiÞ and data from different classes with xðjiÞ . k1 nearest neighbors are selected from data in the same class as xðjiÞ and termed as neighbor data from an identical class: xðjiÞ ,. . .,xðjiÞ ; k2 nearest neighbors are selected from data in 1

k1

different classes with xðjiÞ and termed as neighbor data from different classes: xðjiÞ ,. . .,xðjiÞ . The local patch for the data xðjiÞ is 1

k2

constructed by putting xðjiÞ ,. . .,xðjiÞ and xðjiÞ ,. . .,xðjiÞ together as: 1 k 1 k2

1 ðiÞ ðiÞ ðiÞ ðiÞ ðiÞ ðiÞ Xj ¼ xj ,xj ,. . .,xj ,xj ,. . .,xj . 1

k1

1

k2

For each patch XðjiÞ , the corresponding output in the subspace is

YðiÞ ¼ yðjiÞ ,yðjiÞ ,. . .,yðjiÞ ,yðjiÞ ,. . .,yðjiÞ , and we expect that distances j 1

k1

1

k2

between the given data and neighbor data from an identical class are as small as possible, so we have:

yðjiÞ

i

X 

¼ tr Y

k1 X

arg min

2

ð8Þ

where Fi ¼ ½i,i1 ,. . .,ik  is the index vector for samples in Yi . More details about patch alignment framework can refer to [59]. To sum up all local patches together, we can obtain   X  X  ¼ tr Yi Li YTi tr YSi Li STi YT

xk A S

N þX þ N k¼1

Si A RNðk þ 1Þ is defined as [59] ( 1, if p ¼ Fi ðqÞ ðSi Þpq ¼ , 0, else

2

:yðjiÞ yðjiÞ :

ð11Þ

h

h¼1

At the same time, we expect that distances between the given data and neighbor data from different classes are as large as possible, so we have:

T

INL ðYi Þ

ð7Þ arg max

where ðYi Þj is the jth column of Yi , N L ¼ N þ þ N , Li ¼ " # h i eTNL diagðui Þ eNL INL with eNL ¼ ½1,. . .,1T A RNL . Here, each IN L patch Yi contains its own coordinate system and all Yi s can be aligned together into a consistent coordinate via selection matrices. Based on this assumption, each patch Yi is selected from the global coordinate Y as Yi ¼ YSi . The selection matrix

yðj iÞ

k2 X

2

:yðjiÞ yðjiÞ :

p¼1

ð12Þ

p

Since the patch constructed by the local neighborhood can be approximately estimated by using Euclidean distance, the patch optimization can be formulated through the linear manipulation: ! k1 k2 X X 2 2 arg min :yðjiÞ yðjiÞ : l :yðjiÞ yðjiÞ : , ð13Þ yðjiÞ

h¼1

h

p¼1

p

J. Yu et al. / Pattern Recognition 46 (2013) 483–496

where l is a scaling factor in ½0,1 to unify the different measures of the within class distance and the between-class distance. The coefficients vector can be defined as 2 3 6 7 uðjiÞ ¼ 41,. . .,1, l,. . .,l 5 |fflfflfflffl{zfflfflfflffl}|fflfflfflfflfflfflffl{zfflfflfflfflfflfflffl} k1

ð14Þ

k2

Therefore, Eq. (13) can be rewritten as arg min yðjiÞ

k1 X h¼1

2 :yðjiÞ yðj iÞ : h

¼ arg min

k1X þ k2

yðjiÞ

h¼1

¼ arg mintr

k2     X 2 uðjiÞ þ :yðjiÞ yðjiÞ : uðjiÞ h

2

:yðFijÞf1g yðFiÞfh þ 1g : j

"

YðjiÞ

eTk1 þ k2

Ik1 þ k2   ¼ arg mintr YðjiÞ LðjiÞ YðjiÞT , YðjiÞ

p

p¼1

# diag





uðjiÞ

uðjiÞ



489

where eks ¼ ½1,. . .,1T A Rks ;Iks is the ks  ks identity matrix; and " # ks eTks ðiÞ . Since the complementary characteristics exist LU ¼ j eks Iks among multiple views, they contribute differently to the final lowdimensional subspace. In order to well explore the complementary property of different views, a set of nonnegative weights vector   b ¼ b1 ,. . ., bt is imposed on local patch optimization of different views independently. The bi with larger value indicates the XðjiÞ plays

!

the more important role in learning to obtain the low-dimensional p þ k1

representation YðjiÞ . By summing over all views, the optimization for

!

h

  ek1 þ k2 Ik1 þ k2 YðjiÞT

!

h i where Fj ¼ j, jð1Þ ,. . ., jk1 , j1 ,. . ., jk2 is the index set for the jth patch; ek1 þ k2 ¼ ½1,. . .,1T A Rk1 þ k2 ; Ik1 þ k2 is the ðk1 þ k2 Þ  ðk1 þk2 Þ identity matrix; diagðUÞ is the diagonalization operator; LðjiÞ encodes both the local geometry and the discriminative information, and it is represented by: 2 k þk 3 1 X2  ðiÞ  ðiÞT uj uj 6 7 6 7 h LðjiÞ ¼ 6 h ¼ 1 ð16Þ  7 4 5 ðiÞ ðiÞ uj diag uj By using the whole alignment tricks of Eq. (9), the alignment matrix LðiÞ for the ith view can be calculated. Recent researches in semi-supervised learning [61] show that unlabeled data may be beneficial to explore the local geometry of the distribution of data in the original high-dimensional space. Hence, the unlabeled data is incorporated into the local patch construction to improve the performance of subspace learning. The unlabeled data are attached to the original labeled data of ith n o view as: XðiÞ ¼ xð1iÞ ,    ,xðNiÞl ,xðNiÞl þ 1 ,. . .,xðNiÞl þ Nu , where the first N l data are labeled and the left N u ones are unlabeled. The patch optimization for each labeled data is given by Eq. (15). Unlabeled data are costly to enhance the local geometry. For each unlabeled data xðjiÞ j ¼ Nl þ 1,. . .,Nl þ Nu , its ks nearest neighbors xðjiÞ ,. . .,xðjiÞ in 1

YðjiÞ



ð15Þ

YðjiÞ

the labeled jth patch is  t  T  X ðiÞ ðiÞ arg min b tr Y L YðjiÞ , i j n ot

ks

all data including both labeled and unlabeled ones are selected. h i XðjiÞ ¼ xðjiÞ ,xðjiÞ ,. . .,xðjiÞ represents the ith patch and the associated 1 ks   index set is recorded in FU j ¼ j,j1 ,. . .,jks . To preserve the local

,b

ð18Þ

i¼1

i ¼ 1

and the optimization for the unlabeled patch is  t  T  X ðiÞ U ðiÞ b tr Y L YðjiÞ , arg min i j n ot YðjiÞ



,b

ð19Þ

i¼1

i ¼ 1

Through the whole alignment stage, we can obtain: 0  Nl t  T  X X arg bi @ min tr YðjiÞ LðjiÞ Yðj iÞ ðiÞ j ¼ 1 Yj

i¼1

þ aarg

ðiÞ j ¼ N l þ 1 Yj

t X

¼ arg min Y

NX l þ Nu

þ

1   T  ðiÞ U ðiÞ ðiÞ A min tr Yj Lj Yj

NX l þ Nu

0 0

bi tr@Y@

a

ðiÞ LU j

 T SLj ðiÞ LðjiÞ SLj ðiÞ

j¼1

i¼1

ðiÞ SU j

Nl X



ðiÞ SU j

T

1 AYT Þ ¼ arg min Y

j ¼ Nþ1

t X





bi tr YLðiÞ YT ,

i¼1

ð20Þ SLj ðiÞ A RðN þ Nu Þðk1 þ k2 þ 1Þ

and where a is a control parameter. ðiÞ SU A RðN þ Nu Þðks þ 1Þ are the selection matrices defined in Eq. (8); j and LðiÞ A RðN þ Nu ÞðN þ Nu Þ is the alignment matrix. The constraint YYT ¼ I is imposed on Eq. (20) to uniquely determine the low-dimensional subspace Y as: arg min Y, b

t X



bi tr YLðiÞ YT



i¼1 t X

T

s:t:YY ¼ I;

ð21Þ

bi ¼ 1, bi Z0:

i¼1

geometry of the jth patch, the nearby data should stay nearby in the low-dimensional subspace, or yðjiÞ A Rd is close to yðjiÞ ,. . .,yðjiÞ , i.e., 1

arg min

ks X

yðj iÞ h ¼ 1

The labeling information contained in LP (shown in Eq. (10)) can be assumed to be a novel view. By integrating the labeling information of Eq. (10) and the data distribution of Eq. (21), the objective function for the multiview low-dimensional subspace learning can be reformulated as:

2

:yðj iÞ yðjiÞ : h

02 

ðiÞ

ðiÞ

1

T 3

B6 yj yj1 7h iC B6 C 7 ðiÞ ðiÞ 6^ 7 y y ,. . .,yðiÞ yðiÞ C ¼ arg mintrB B6  j1 j jks C 7 j ðiÞ  yj @4 ðiÞ ðiÞ T 5 A yj yj

arg min Y, b

ks

" ¼ arg mintr YðjiÞ Yðj iÞ

4.3. Labeling information integration

ks

# !   eTks   ðiÞ ðiÞT , Yj eks Iks YðjiÞT ¼ arg mintr YðjiÞ LU j ðiÞ Iks Y j

ð17Þ

t X







bi tr YLðiÞ YT þ Ztr YLP YT



i¼1

s:t:YYT ¼ I;

t X

bi ¼ 1, bi Z 0

ð22Þ

i¼1

where Z is the controlling parameter. Since LP is constructed from the view of pairwise constraint, it can be rewritten as Lðt þ 1Þ . Thus,

490

J. Yu et al. / Pattern Recognition 46 (2013) 483–496

To set the derivative of Lðb, WÞ with respect to bi and W to zero, we can obtain:   8 r1 @Lðb, WÞ ðiÞ T > > > @bi ¼ r bi tr YL Y W ¼ 0,i ¼ 1,. . .,t þ1 < tþ1 ð26Þ X @Lðb, WÞ > > ¼ bi 1 ¼ 0 > W @ :

Eq. (22) can be reformulated as: arg min Y, b

tþ1 X



bi tr YLðiÞ YT



i¼1

s:t:YYT ¼ I;

tþ1 X

i¼1

bi ¼ 1, bi Z 0

ð23Þ

i¼1

The solution of Eq. (23) is bk ¼ 1 corresponding to the mini    mum tr Y LðiÞ YT over different views, and bk ¼ 0 otherwise. Thus, only one view is finally selected by this method. Hence, the performance of this method equals to the one from the best view. However, it does not meet our objective on exploring the complementary property of multiple views to get a better embedding than based on a single view. We avoid this phenomr enon by adopting the technique [47]: set bi ’bi with r 4 1. Pt þ 1 r Through this trick, achieves its minimum when i ¼ 1 bi P bi ¼ 1=ðt þ 1Þ with respect to it þ¼ 11 bi ¼ 1, bi 40 . The new objective function can be reformulated as arg min Y, b

t X







bri tr YLðiÞ YT þ Ztr YLP YT



i¼1

Hence, bi can be calculated as , tþ1    1=ðr1Þ X  1=ðr1Þ ðiÞ T bi ¼ 1=tr YL Y 1=tr YLðiÞ YT

ð27Þ

i¼1

The alignment matrix LðiÞ is positive semidefinite, so we have

bi 4 0 naturally. When Y is fixed, Eq. (27) gives the global optimal b. In accordance with Eq. (27), we have the following knowledge for r in controlling bi . If r-1, different bi will be close to each  other. If r-1, only bi ¼ 1 corresponding to minimum tr YLðiÞ YT over different views, and bi ¼ 0 otherwise. Hence, the choice of r should be based on the complementary characteristics of all views. Rich complementary prefers large r; otherwise, r should be small. We will discuss the effect of parameter r in our experiments. Second, we fix b to update Y. The optimization problem is equivalent to   arg min tr YLYT , Y

s:t:YYT ¼ I;

t X

bi ¼ 1, bi Z 0

ð24Þ

i¼1

We derive an iterative algorithm by using alternating optimization [6] to obtain the optimal solution. The alternating optimization iteratively updates Y and b in an alternating way. First, we fix Y to update b. By adopting the Lagrange multiplier tP þ1 W to take the constraint bi ¼ 1 into consideration, the Lagrange i¼1

function can be obtained as:

Lðb, WÞ ¼

tþ1 X i¼1



bri tr

ðiÞ

YL Y

T



W

tþ1 X i¼1

!

bi 1 :

ð25Þ

s:t:YYT ¼ I ð28Þ Pt þ 1 r ðiÞ ðiÞ where L ¼ i ¼ 1 bi L . Because L is symmetric, L is symmetric. The optimal Y is given as the eigenvectors associated with the smallest d eigenvalues of the matrix L. In accordance with above descriptions, we can build an alternating optimization procedure, described in Fig. 4, to obtain a local optimal solution of PC-MSL. The algorithm of PC-MSL converges because the objective function in Eq. (24) reduces with the increasing of the iteration numbers. Specifically, with fixed b, the optimal Y can reduce the value of the objective function, and with fixed Y, the optimal b will also reduce the value of the objective function. The computational complexity of PC-MSL can be calculated from two parts. The first part is for the constructions of alignment

Fig. 4. Descriptions of the pairwise constraints based multiview subspace learning (PC-MSL).

J. Yu et al. / Pattern Recognition 46 (2013) 483–496

491

Fig. 5. Sample images from three image datasets used in the experiments: (a) sample images from natural scenes dataset (b) sample images from indoor scenes dataset (c) sample images from Caltech 256 dataset.

matrix for different views, i.e., the computation of LðiÞ . From P  tþ1 2 Eq. (21), the complexity of this part is O i ¼ 1 mi  N , wherein N represents the total number of the dataset. The other part is for the alternating optimization. The update of Y has the complexity   of the eigenvalue decomposition of an N  N matrix. It is O N3 .   The update of b has the time complexity of O ðt þ d þ 1Þ  N2 . Therefore, the entire time complexity of PC-MSL is   P   tþ1 2 O N 3 þ ðt þ d þ1Þ  N 2  T þ , where T is the i ¼ 1 mi  N number of training iterations and T is always less than five. The PC-MSL aims to obtain an optimal low-dimensional subspace for the database. In real applications, some new data might be registered into the dataset, and learning a new low-dimensional subspace is time consuming and scales with the number of data points. In such situation, it is popular to use a linearization procedure to deal with it.

5. Experimental results and discussions To demonstrate the effectiveness of the proposed approach in scene classification, we conduct experiments on two datasets: a dataset of 15 natural scenes categories [23] and a dataset of indoor scenes [33]. Besides, an object image dataset Caltech256 [12] is utilized to demonstrate the generality of PC-MSL in

Table 2 Details of the three datasets adopted in the experiments. Datasets

Sample size

Number of classes

Natural scenes Indoor scenes Caltech 256

3000 1500 2000

20 15 20

classification. Some sample images from the three image datasets are shown in Fig. 5. We compare the performance of the proposed Pairwise Constraint based Multiview Supspace Learning (PC-MSL) with different methods. The details are presented in Section 5.2.

5.1. Dataset descriptions In order to evaluate the classification performance on scene images, we first conduct experiments on the dataset of natural scenes. This dataset used in our work contains 3000 images that belong to 15 natural scene categories [23]: bedroom, CALsuburb, industrial, kitchen, livingroom, MITcoast, MITforest, MIThighway, MITinsidecity, MITmountain, MITopencountry, MITstreet, MITtallbuilding, PARoffice, and store. We adopt five different features to describe the scenes, including Color Histogram (CH), Edge Direction Histogram (EDH), SIFT [26], Gist [31] and Localityconstrained Linear Coding (LLC) [48].

J. Yu et al. / Pattern Recognition 46 (2013) 483–496

Classification Accuracy (%)

The subset of indoor scene images is selected from the indoor scenes dataset [33]. This subset used in our work contains 1500 images which are categorized into 15 indoor scene groups: airport_inside, artstudio, auditorium, bakery, bar, bathroom, bedroom, bookstore, bowling, buffet, casino, children_room,

80

PCMSL MSL

SSL-LE SSL-PCA

FCSL DSSL

SSL-LDA SVM

60 40 20 0

church_inside, classroom and cloister. The five different features describing the images are CH, EDH, SIFT, Gist and LLC. The Caltech 256-2000 is a subset of the Caltech256 dataset [12]. It contains 2000 images from 20 categories: chimp, bear, bowling-ball, light-house, cactus, iguana, toad, cannon,

Classification Accuracy (%)

492

60

Classification Accuracy (%)

SSL-LE SSL-PCA

FCSL DSSL

SSL-LDA SVM

45 30 15 0

10 20 30 40 50 Sampling Percentage for Training (%) PCMSL MSL

PCMSL MSL

10 20 30 40 50 Sampling Percentage for Training (%) SSL-LE SSL-PCA

FCSL DSSL

SSL-LDA SVM

60 45 30 15 0 10 20 30 40 50 Sampling Percentage for Training (%)

80

PCMSL MSL

SSL-LE SSL-PCA

FCSL DSSL

SSL-LDA SVM

60 40 20 0

Classification Accuracy (%)

Classification Accuracy (%)

Fig. 6. Performance evaluations on classification accuracy (%) of PCMSL, MSL, FCSL, DSSL, SSL-LE, SSL-PCA, SSL-LDA and SVM with different sampling percentage for training. (a) Results of natural scenes dataset; (b) results of indoor scenes dataset; (c) results of Caltech 256 dataset.

60

Classification Accuracy (%)

SSL-LE SSL-PCA

FCSL DSSL

SSL-LDA SVM

45 30 15 0

10 20 30 40 50 Sampling Percentage for Training (%) PCMSL MSL

PCMSL MSL

10 20 30 40 50 Sampling Percentage for Training (%) SSL-LE SSL-PCA

FCSL DSSL

SSL-LDA SVM

60 45 30 15 0 10 20 30 40 50 Sampling Percentage for Training (%)

Fig. 7. Performance evaluations on classification accuracy (%) of PCMSL, MSL, FCSL, DSSL, SSL-LE, SSL-PCA, SSL-LDA and SVM with different dimensions. (a) Results of natural scenes dataset; (b) results of indoor scenes dataset; (c) results of Caltech 256 dataset.

J. Yu et al. / Pattern Recognition 46 (2013) 483–496

chess-board, lightning, goat, pci-card, dog, raccoon, cormorant, hammock, elk, necktie and self-propelled-lawn-mower. We also adopt the five descriptors of CH, EDH, SIFT, GIST and LLC to represent the images. The details of these three datasets are presented in Table 2.

5.

5.2. Experimental configurations 6.

Classification Accuracy (%)

1. Pairwise Constraint based Multiview Supspace Learning (PC-MSL). The parameters l and a in Eqs. (14) and (20) are tuned by 5-fold cross-validation. The k1 nearest neighbors in the same class and the k2 nearest neighbors in different classes are determined to be 5 respectively. Besides, the number of must-links in the positive set and the number of cannot-links in the negative set are determined to be 10, respectively. 2. Multiview Supspace Learning (MSL). The algorithm presented in Fig. 4 can be conducted without the pairwise constraints. We directly name it Multiview Supspace Learning (MSL). The parameters l and a are tuned by 5-fold cross-validation, and the number of k1 , k2 are determined to be 5 respectively. 3. Feature Concatenation based Subspace Learning (FCSL). The multiview features of CH, EDH, SIFT, Gist and LLC are integrated into a long vector. Then, Laplacian Eigenmaps (LE) is used to obtain the low-dimensional subspace for the training data. The value of k-nearest neighbor construction in LE is fixed at 5. By using the linearization procedure, we can produce the projection matrix, which is conducted on the test data. 4. Distributed Spectral Subspace Learning (DSSL). Given a multiview datum, DSSL builds low-dimensional subspaces for each view separately. Based on the obtained subspace, a common low-dimensional subspace is constructed to get close to each

80 60 40 Natural Scenes Indoor Scenes Caltech 256

20 0 1

3

5 k1

7

9

7.

8.

subspace. The solution can be found in Eq. (1). In our work, DSSL chooses the subspace learning method of LE for each view. The value of k-nearest neighbor is fixed at 5. Single-view Subspace Learning with LE (SSL-LE). In our work, we conduct SSL for each view of CH, EDH, SIFT, Gist and LLC. LE is adopted in each view to construct the low-dimensional subspace. The average performance of the single-view subspace learning with LE (Aver-SSL-LE) is reported. Single-view Subspace Learning with PCA (SSL-PCA). SSL is adopted, and PCA is used in each view to build the lowdimensional subspace. The average performance of the singleview subspace learning with PCA (Aver-SSL-PCA) is reported. Single-view Subspace Learning with LDA (SSL-LDA). SSL is adopted, and LDA is used in each view to build the lowdimensional subspace. The average performance of the single-view subspace learning with LDA (Aver-SSL-LDA) is reported. Support Vector Machine (SVM). To provide the baseline performance of the experiments, we adopt SVM in the original feature spaces. In each view, SVM is conducted to solve the multiclass problem, and the radius parameter and the weighting parameter in regularization are tuned to their optimal values. The average performance of the SVM (Aver-SVM) is reported.

5.3. Experimental results Fig. 6 presents the classification results on the three datasets. In this experiment, the sampling percentage for training is varied in the range [10%, 20%, 30%, 40%, 50%]. The proposed PC-MSL performs better than MSL in all cases. Although the superiority is marginal in some cases, it is consistent. This indicates that it is better to integrate the labeling information of pairwise constraints with multiview features. Besides, we compare PC-MSL with other six methods. Fig. 6 demonstrates that PC-MSL achieves the best results in all cases. It validates the effectiveness of the proposed method. From the experimental results, it can be observed that the performances of PC-MSL and MSL are obviously better than DMSL and FCSL. It indicates that the adopted alternating optimization method is effective in exploring the complementary characteristics of multiview features. In addition, Fig. 7 compares the performance of different methods with varied dimensions of [10,20,30,40,50]. It presents that PC-MSL performs best in all cases. The results also indicate that the classification accuracy will degrade with the increase of the dimensions. It shows that the best result always appears when the dimension is reduced to 10. Therefore, the dimension reduction is effective in improving the performance.

Classification Accuracy (%)

In experiments, we randomly choose the labeled samples. Since the sizes of these three datasets vary widely, it is inappropriate to find a fixed number-setting for different datasets. Therefore, we choose different percentages of samples as labeled data for each dataset. For instance, a dataset has 1000 data samples and the sampling percentage is fixed to 10%, then 100 labeled samples are selected as the training data. Through the linearization procedure [16], we can obtain the projection matrix which is used to build the low-dimensional subspace for testing data. Finally, Support Vector Machine (SVM) [43] is conducted in the low-dimensional subspace. Besides, for all the classification approaches, we independently repeat the experiments for 20 times with randomly selected training samples, and the averaged results with standard deviation are presented. For the experiments of classification, we compare the performance of eight methods:

493

80 Natural Scenes Indoor Scenes Caltech 256

60 40 20 0 5

10

20 k2

40

80

Fig. 8. Sensitivity evaluations of k1 and k2 in the datasets of natural scenes, indoor scenes and Caltech 256. (a) Results of k1, which increases in the range [1,3,5,7,9]; (b) results of k2, which increases in the range [5,10,20,40,80].

494

J. Yu et al. / Pattern Recognition 46 (2013) 483–496

80 Classification Accuracy (%)

Classification Accuracy (%)

80

60

40

Natural Scenes Indoor Scenes Caltech 256

20

60

40

Natural Scenes Indoor Scenes Caltech 256

20

0

0 0.001

0.01

0.1

1 

10

100

1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1



Classfication Accuracy (%)

80

60

40

Natural Scenes Indoor Scenes Caltech 256

20

0 5

10

20 Group Number

40

80

Fig. 9. Performance evaluations of parameters a, l and group number in the datasets of natural scenes, indoor scenes and Caltech 256. (a) Results of a, which increases in the range [0.001, 0.01, 0.1, 1, 10, 100, 1000]; (b) results of l, which increases in the range [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]; (c) results of group number, which differs in the range [5,10,20,40,80].

Fig. 8 presents the performance of PC-MSL with varied k1 and k2. In Fig. 8(a), the value of k1 changes in the range [1,3,5,7,9], while k2 is fixed. On these three datasets, the classification accuracy raises by increasing k1 from 1 to 5. When k1 increases from 5 to 9, the performance of PC-MSL has little changes. This indicates that the optimal k1 is 5. On the contrary, in Fig. 8 (b), the value of k2 changes in the range [5,10,20,40,80], while k1 is fixed. In this subfigure, the classification accuracy of PC-MSL keeps decreasing when the value of k2 enlarges from 5 to 80. It means that the optimal value of k2 is also 5. We test the sensitivity of the parameters a, l and group number in PC-MSL. We first vary a in the range [0.001, 0.01, 0.1, 1, 10, 100, 1000] and the results are shown in Fig. 9(a). It presents that PCMSL performs stably when the value of a increases from 0.001 to 10. However, the performance of PC-MSL severely degrades when a is enlarged from 10 to 1000. We vary l in the range [0.1, 0.2, 0.3,y, 1] and the experimental results are recorded in Fig. 9(b). It shows that the performance of PC-MSL increases when l is enlarged from 0.1 to 0.8. Its performance keeps stable when l increases from 0.8 to 1. The results indicate that the optimal values of a and l are 1 and 0.8, respectively. Furthermore, we test the effects of group number and the results are presented in Fig. 9(c). It shows that the performance of PC-MSL slightly increases when the group number expands from 5 to 40, and it keeps stable when the group number increases from 40 to 80.

6. Conclusions This paper presents a novel pairwise constraints based multiview subspace learning (PC-MSL) method for scene classification. We first represent scene images well by finding multiview features and explore the complementary characteristics of these features. A novel method is proposed to successfully solve the problem of multiview dimensionality reduction by learning a unified low-dimensional subspace to effectively fuse the multiview features. The new proposed method takes both intraclass and interclass geometries into consideration. The discriminability is effectively preserved because it takes into account neighboring samples which have different labels. Due to the semantic gap, the fusion of multiview features still cannot achieve excellent performance of scene classification in real applications. Hence, a user labeling procedure is introduced in our approach. Specifically, a query image is provided by the user, and a group of images are retrieved by a search engine. After that, user should label some images in the retrieved set as relevant or irrelevant with the query. The must-links are constructed between the relevant images, and the cannot-links are built between the irrelevant images. Finally, we adopt alternating optimization to integrate the complementary nature of different views with the user labeling information, and develop a novel multiview dimensionality reduction method for scene classification. Experiments are

J. Yu et al. / Pattern Recognition 46 (2013) 483–496

conducted on the real-world datasets of natural scenes and indoor scenes, and the results demonstrate the effectiveness of the proposed method. Besides, the proposed method can also be applied to other classification problems. The experimental results of shape classification on Caltech 256 dataset suggest the effectiveness of our method.

Acknowledgments This work is supported by National Natural Science Foundation of China (No. 61100104), the Natural Science Foundation of Fujian Province of China (No. 2012J01287), CAS and Locality Cooperation Projects (ZNGZ-2011–012), Guangdong-CAS Strategic Cooperation Program (No. 2010A090100016), National Defense Basic Scientific Research Program of China, the Specialized Research Fund for the Doctoral Program of Higher Education of China (No. 20110121110020) and the National Defense Science and Technology Key Laboratory Foundation.

References [1] M. Baghshah, S. Shouraki, Metric learning for semi-supervised clustering using pairwise constraints and the geometrical structure of data, Intelligent Data Analysis 13 (6) (2009) 887–899. [2] M. Baghshah, S. Shouraki, Non-linear metric learning using pairwise similarity and dissimilarity constraints and the geometrical structure of data, Pattern Recognition 43 (8) (2010) 2982–2992. [3] R Bellman, Adaptive Control Processes: A Guided Tour, Princeton Univ. Press, Princeton, NJ, 1961. [4] M. Belkin, P. Niyogiy, Laplacian Eigenmaps for dimensionality reduction and data representation, Neural Computation 15 (6) (2003) 1373–1396. [5] Y. Bengio, J. F. Paiement, P. Vincent, Out-of-sample extensions for LLE, isomap, MDS, Eigenmaps, and spectral clustering. In: Advances in Neural Information Processing Systems, 2003, pp. 177–184. [6] J. C. Bezdek, R. J. Hathaway, Some notes on alternating optimization. In: Proceedings of the AFSS International Conference on Fuzzy Systems, 2002, pp. 288–300. [7] M. Boutell, C. Brown, J. Luo, Review of the State of the Art in Semantic Scene Classification, Univ. Rochester, Rochester, NY, 2002. [8] J. Daugman, Two-dimensional spectral analysis of cortical receptive field profile, Vision Research 20 (1980) 847–856. [9] D.L. Donoho, C. Grimes, Hessian Eigenmaps: new locally linear embedding techniques for high-dimensional data, Proceedings of the National Academy of Sciences 100 (10) (2003) 5591–5596. [10] R.A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7 (1936) 179–188. [11] W. Freeman, E. Adelson, The design and use of steerable filters, IEEE Transactions on Pattern Analysis and Machine Intelligence 13 (9) (1991) 891–906. [12] G. Griffin, A. Holub, P. Perona, The Caltech-256, Caltech Technical Report. [13] N. Guan, D. Tao, Z. Luo, B. Yuan, NeNMF: an optimal gradient method for nonnegative matrix factorization, IEEE Transactions on Signal Processing 60 (6) (2012) 2882–2898. [14] N. Guan, D. Tao, Z. Luo, B. Yuan, Online nonnegative matrix factorization with robust stochastic approximation, IEEE Transactions on Neural Networks and Learning Systems 23 (7) (2012) 1087–1099. [15] C. Harris, M. Stephens, A combined corner and edge detector. In: Proceedings of the Alvey Vision Conference, 1988, pp. 147–151. [16] X. He, P. Niyogi, Locality preserving projections. In: Proceedings of the of Neural Information Processing Systems, Whistler, BC, Canada, 2003. [17] X. He, D. Cai, S. Yan, H. Zhang, Neighborhood preserving embedding. In: Proceedings of the IEEE International Conference on Computer Visualization, 2005, pp. 1208–1213. [18] A.B. Hillel, T. Hertz, N. Shental, D. Weinshall, Learning distance functions using equivalence relations. In: International Conference on Computer Learning, 2003, pp. 11–18. [19] C. Hoi, W. Liu, M. Lyu, W. Ma, Learning distance metrics with contextual constraints for image retrieval. In: Proceedings of the International Conference on Computer Vision and Pattern Recognition, 2006, pp. 2072–2078. [20] H. Hotelling, Analysis of a complex of statistical variables into principal components, Journal of Educational Psychology 24 (1933) 417–441. [21] K. Huang, S. Aviyente, Wavelet feature selection for image classification, IEEE Transactions on Image Processing 17 (9) (2008) 1709–1720. [22] A. Jain, A. Vailaya, Image retrieval using color and shape, Pattern Recognition 29 (8) (1996) 1233–1244.

495

[23] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE International Confernce on Computer Vision and Pattern Recognition, 2006, pp. 2169–2178. [24] F. Li, R. Fergus, A. Torralba, Recognizing and learning object categories. In: Proceedings of the of Computer Vision and Pattern Recognition, Short Course, 2007. [25] B. Long, P. S. Yu, Z. Zhang, A general model for multiple view unsupervised learning. In: Proceedings of the 8th SIAM International Conference on Data Mining, 2008, pp. 822–833. [26] D. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Computer Vision 60 (2) (2004) 91–110. [27] B. Manjunath, W.Y. Ma, Texture features for browsing and retrieval of image data, IEEE Transactions on Pattern Analysis and Machine Intelligence 18 (8) (1996) 837–842. [28] J. Mao, A.K. Jain, Texture classification and segmentation using multiresolution simultaneous autoregressive models, Pattern Recognition 25 (2) (1992) 173–188. [29] K. Mikolajczyk, C. Schmid, Scale & affine invariant interest point detectors, International Journal of Computer Vision 60 (1) (2004) 63–86. [30] K. Mikolajczyk, C. Schmid, A performance evaluation of local descriptors, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (10) (2005) 1615–1630. [31] A. Oliva, A. Torralba, Modeling the shape of the scene: a holistic representation of the spatial envelope, International Journal of Computer Vision 42 (3) (2001) 145–175. [32] G. Pass, R. Zabih, J. Miller, Comparing images using color coherence vectors. In: Proceedings of the ACM International Conference on Multimedia, 1996, pp. 65–73. [33] A. Quattoni, A Torralba, Recognizing indoor scenes. In: Proceedings of the Computer Vision and Pattern Recognition, 2009, pp. 413–420. [34] P. Remagnino, A. Shihab, G. Jones, Distributed intelligence for multi-camera visual surveillance, Pattern Recognition 37 (4) (2004) 675–689. [35] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (22) (2000) 2323–2326. [36] M. Shiga, H. Mamitsuka, Efficient semi-supervised learning on locally informative multiple graphs, Pattern Recognition 45 (3) (2012) 1035–1049. [37] M. Swain, D. Ballard, Color indexing, International Journal of Computer Vision 7 (1) (1991) 11–32. [38] D. Tao, X. Tang, X. Li, X. Wu, Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (7) (2006) 1088–1099. [39] D. Tao, X. Li, X. Wu, S. Maybank, Geometric mean for subspace selection, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2) (2009) 260–274. [40] J. Tenenbaum, V. Silva, J. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (22) (2000) 2319–2323. [41] X. Tian, D. Tao, Y. Rui, Sparse transfer learning for interactive video search reranking, ACM Transactions on Multimedia Computing, Communications and Applications 8 (3) (2012), Article no. 26. [42] A. Ude, R. Dillmann, Vision-based robot path planning, Advances in Robot Kinematics and Computational Geometry (1994) 505–512. [43] V.N. Vapnik, Statistical Learning Theory, Wiley-Interscience, 1998. [44] M. Varma, A. Zisserman, A statistical approach to texture classification from single images, International Journal of Computer Vision 62 (1–2) (2005) 61–81. [45] D. Vizireanu, Generalizations of binary morphological shape decomposition, Journal of Electronic Imaging 16 (1) (2007) 1–6. [46] D. Vizireanu, S. Halunga, G. Marghescu, Morphological skeleton decomposition interframe interpolation method, Journal of Electronic Imaging 19 (2) (2010) 1–3. [47] M. Wang, X. S. Hua, X. Yuan, Y. Song, L. R. Dai, Optimizing multigraph learning: towards a unified video annotation scheme. In: Proceedings of the ACM Multimedia, 2007, pp. 862–870. [48] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image classification. In: Proceedings of the of Computer Vision and Pattern Recognition, 2010, pp. 3360–3367. [49] T. Xia, D. Tao, T. Mei, Y. Zhang, Multiview spectral embedding, IEEE Transactions on Systems, Man and Cybernetics, Part B 40 (6) (2010) 1438–1446. [50] T. Xiang, S. Gong, Activity based surveillance video content modeling, Pattern Recognition 41 (7) (2008) 2309–2326. [51] B. Xie, Y. Mu, D. Tao, K. Huang, m-SNE: Multiview Stochastic Neighbor Embedding, IEEE Transactions on Systems, Man and Cybernetics, Part B 41 (4) (2011) 1088–1096. [52] E.P. Xing, A.Y. Ng, M.I. Jordan, S. Russell, Distance metric learning with application to clustering with side information. In: Proceedings of the Neural Information Processing Systems, 2003, pp. 505–512. [53] J. Yu, D. Liu, D. Tao, H. Seah, Complex object correspondence construction in 2D animation, IEEE Transactions on Image Processing 20 (11) (2011) 3257–3269. [54] J. Yu, D. Tao, M. Wang, Adaptive hypergraph learning and its application in image classification, IEEE Transactions on Image Processing 21 (7) (2012) 3262–3272.

496

J. Yu et al. / Pattern Recognition 46 (2013) 483–496

[55] J. Yu, M. Wang, D. Tao, Semi-supervised multiview distance metric learning for cartoon synthesis, IEEE Transactions on Image Processing (2012), http://dx.doi. org/10.1109/TIP.2012.2207395. [56] J. Yu, D. Liu, D. Tao, On combining multiple features for cartoon character retrieval and clip synthesis, IEEE Transactions on Systems, Man and Cybernetics, Part B (2012), http://dx.doi.org/10.1109/TSMCB. 2012.2192108. [57] G. Yu, G. Zhang, C. Domeniconi, Z. Yu, J. You, Semi-supervised classification based on random subspace dimensionality reduction, Pattern Recognition 45 (3) (2012) 1119–1135. [58] Z. Zhang, H. Zha, Principal manifolds and nonlinear dimension reduction via local tangent space alignment, SIAM Journal of Scientific Computing 26 (1) (2005) 313–338.

[59] T. Zhang, D. Tao, X. Li, J. Yang, Patch alignment for dimensionality reduction, IEEE Transactions on Knowledge and Data Engineering 21 (9) (2009) 1299–1313. [60] Z. Zhao, H. Liu, Multi-source feature selection via geometry dependent covariance analysis. In: Proceedings of the JMLR Workshop Conference, 2008, pp. 36–47. [61] X. Zhu, A. Goldberg, R. Brachman, T. Dietterich, Introduction to SemiSupervised Learning, Morgan and Claypool, 2009. [62] A. Zien, C. S. Ong, Multiclass multiple kernel learning. In: Proceedings of the 24th ICML, Corvallis, OR, 2007, pp. 1191–1198.