Regularized discriminant embedding for visual descriptor learning

Regularized discriminant embedding for visual descriptor learning

Neurocomputing 149 (2015) 1048–1057 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Regul...

2MB Sizes 0 Downloads 123 Views

Neurocomputing 149 (2015) 1048–1057

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Regularized discriminant embedding for visual descriptor learning Kye-Hyeon Kim a, Rui Cai b, Lei Zhang b, Seungjin Choi a,n a Department of Computer Science and Engineering, Pohang University of Science and Technology (POSTECH), 77 Cheongam-ro, Nam-gu, Pohang 790-784, Republic of Korea b Microsoft Research Asia, No. 5 Dan Ling Street, Haidian District, Beijing 100080, China

art ic l e i nf o

a b s t r a c t

Article history: Received 1 August 2013 Received in revised form 26 May 2014 Accepted 21 July 2014 Communicated by Deng Cai Available online 4 August 2014

Visual descriptor learning seeks a projection to embed local descriptors (e.g., SIFT descriptors) into a new Euclidean space where pairs of matching descriptors (positive pairs) are better separated from pairs of non-matching descriptors (negative pairs). The original descriptors often confuse the positive pairs with the negative pairs, since local points labeled “non-matching” yield descriptors close together (irrelevant-near) or local points labeled “matching” yield descriptors far apart (relevant-far). This is because images differ in terms of viewpoint, resolution, noise, and illumination. In this paper, we formulate an embedding as a regularized discriminant analysis, which emphasizes relevant-far pairs and irrelevant-near pairs to better separate negative pairs from positive pairs. We then extend our method to nonlinear mapping by employing recent work on explicit kernel mapping. Experiments on object retrieval for landmark buildings in Oxford and Paris demonstrate the high performance of our method, compared to existing methods. & 2014 Elsevier B.V. All rights reserved.

Keywords: Discriminant analysis Kernel method Local descriptor Object retrieval

1. Introduction Comparing images by matching their local interest points1 [1–3] is a fundamental preliminary task in many multimedia applications and computer vision problems, including object recognition [4,5], near-duplicate media detection [6,7], image retrieval [8–10], and scene alignment [11]. Local descriptors are extracted from a small image patch around each local interest point, and then any two local points belonging to different images are matched if their local descriptors are close enough in the feature space. Scale-invariant feature transform (SIFT) [3] is a wellknown method for extracting interest points and their local descriptors from a given image. The SIFT features are robust to minor appearance changes engendered in a local image patch by varying environmental conditions (e.g., viewpoint, illumination, noise, and resolution). However, the robustness of SIFT features is limited to small changes only. The appearance near a local point can vary widely due to significant changes in environmental conditions, leading to a large variance of the SIFT features extracted from the point. Such

n

Corresponding author: Tel.: þ 82 54 279 2259; fax: þ82 54 279 2299. E-mail addresses: [email protected] (K.-H. Kim), [email protected] (R. Cai), [email protected] (L. Zhang), [email protected] (S. Choi). 1 There are many synonyms for these, including local points, interest points, and physical points. In this paper, we use each synonym according to context, while all have the same meaning. http://dx.doi.org/10.1016/j.neucom.2014.07.029 0925-2312/& 2014 Elsevier B.V. All rights reserved.

instability of local features is a great difficulty in image matching and its applications. Another issue is distinctiveness. Since a local descriptor represents very limited information about a local point, two local descriptors having different contexts can be rather close in the feature space when their local image patches look similar to each other. Such ambiguity is also a major limitation of local features. Extensive research has been conducted to overcome these limitations. One line of research is to develop more robust and distinctive local features, which include PCA-SIFT [12], multi-step feature extraction [13], the Walsh–Hadamard transform [14], and kernel descriptors [15]. However, compared to SIFT, those customized features are rather complicated to compute and not widely proven in their general performance or applicability. Another line of research is descriptor learning [16–19], which consists in learning a projection that maps given local features (e.g., SIFT features) to a new feature space where matching descriptors are closer to each other and non-matching descriptors are farther from each other. To this end, two categories of training data are required: 1. Relevant descriptors (or matching descriptors) that belong to the same class and thus should be closer to each other for better robustness to intra-class changes. 2. Irrelevant descriptors (or non-matching descriptors) that belong to different classes and thus should be farther from each other for more inter-class discriminative power in local description.

K.-H. Kim et al. / Neurocomputing 149 (2015) 1048–1057

Descriptors that belong to the same class are extracted from the same local point in various images taken under different environmental conditions. Compared to customized features, descriptor learning can easily be incorporated into any existing local features to improve their intra-class robustness and inter-class distinctiveness. We address descriptor learning also because of its wide applicability. In this paper, we present a novel learning strategy to further improve the performance gain of descriptor learning. First, we show that the pairwise distance between local descriptors in the original feature space can be a strong clue for determining which kind of training data are essential for descriptor learning. We define four categories of local descriptor pairs according to the pairwise distance and relevance of each pair: 1. Relevant-Near (Rel-Near): A relevant pair lying quite close to each other in the original feature space. We define a pair as Rel-Near if both descriptors belong to the same class and one descriptor is among the k nearest neighbors (k NNs, k ¼5) of the other descriptor, considering all descriptors in that class. Because it is already well matched, such a pair is not very worthwhile as training data for improving the matching performance of local features. 2. Relevant-Far (Rel-Far): A relevant pair, but not close enough. We define Rel-Far pairs as all pairs of relevant descriptors except for Rel-Near pairs. For a Rel-Far pair, two descriptors are extracted from the same local point, but have significant differences in their feature values due to varying environmental conditions. Thus, Rel-Far pairs are important for training in order to improve the robustness of local features against intraclass variations. 3. Irrelevant-Near (Irr-Near): An irrelevant pair, but close enough to be easily mistaken as matching descriptors. We define an IrrNear pair as a local descriptor and its k NN descriptors, considering all irrelevant descriptors. The small pairwise distance implies that Irr-Near pairs mostly lie near a boundary between different classes. Thus, Irr-Near pairs are important for training in order to improve the inter-class distinctiveness of local features. 4. Irrelevant-Far (Irr-Far): An irrelevant pair far apart. We define Irr-Far pairs as all pairs of irrelevant descriptors except for Irr-Near pairs. Irr-Far pairs contain the overall scattering information between classes, but most of the pairs are already

well separated in the original feature space, so these are not very important as training data. Fig. 1(a) shows the distribution of the pairwise distances in the SIFT feature space, where 2  104 pairs are randomly chosen in each category from among 5  105 SIFT descriptors. According to their distances, Rel-Near and Irr-Far pairs are already well separated in the SIFT space. By contrast, a significant overlap exists between Rel-Far and Irr-Near distributions (  30% of their area), i.e., many Rel-Far pairs lie farther than Irr-Near pairs in the SIFT space. Thus, the success of descriptor learning highly depends on the success in differentiating between Rel-Far and Irr-Near pairs. In order to further emphasize Rel-Far and Irr-Near pairs for learning, we propose a regularized learning framework in which each category of training pairs is weighted differently in costs. We seek a linear projection T that maximizes the ratio of variances between matching and non-matching differences: JðTÞ ¼

ð1Þ

1. βRN 5 βRF to enhance the contribution of Rel-Far pairs; 2. βIN b βIF to enhance the contribution of Irr-Near pairs. In this way, our method can separate Rel-Far and Irr-Near pairs more clearly in the projected feature space (Fig. 1(b)). We also provide an extension of our method to nonlinear learning. By adapting recent work on kernelization with explicit feature maps [20], our nonlinear learning method has more discriminative power but imposes the same computational cost as its linear counterpart. In image retrieval experiments on landmark buildings in Oxford and Paris, our selective learning scheme outperformed existing descriptor learning methods and achieved a considerable

5000

Err (RFar vs INear) = 29.92%

4000 3500

RelNear RelFar IrrNear IrrFar

3000 2500 2000 1500

The number of pairs

Err (Rel vs Irr) = 15.58%

4500

The number of pairs

βIN ∑ði;jÞ A P IN dij ðTÞ þ βIF ∑ði;jÞ A P IF dij ðTÞ ; βRN ∑ði;jÞ A PRN dij ðTÞ þ βRF ∑ði;jÞ A PRF dij ðTÞ

where dij ðTÞ denotes the squared distance jjTðxi  xj Þjj2 , and P RN , P RF , P IN , P IF denote the training sets belonging to Rel-Near, Rel-Far, Irr-Near, and Irr-Far, respectively. We introduce four regularization constants β RN , β RF , β IN , βIF to control the importance of each category appropriately. In [16], β RN ¼ βRF ¼ βIN ¼ β IF ¼ 1, i.e., all pairs are equally important regardless of their pairwise distances. We propose setting the regularization constants as follows:

5000

Err (Rel vs Irr) = 8.34%

4000

Err (RFar vs INear) = 16.30%

300

400

Distance

500

600

700

IrrFar

2500 2000 1500

500 200

IrrNear

3000

1000

100

RelFar

3500

500 0

RelNear

4500

1000

0

1049

0

0

100

200

300

400

500

600

700

Distance

Fig. 1. (a) Distribution of Euclidean distance in SIFT feature space for four categories of local descriptor pairs: Rel-Near, Rel-Far, Irr-Near, and Irr-Far. The Bayes optimal error rates are also shown. Err (Rel vs. Irr) measures the proportion of overlapping region between {Rel-Near, Rel-Far} and {Irr-Near, Irr-Far}, while Err (RFar vs. INear) measures the overlap between Rel-Far and Irr-Near. (b) Distance distribution and Bayes optimal error are obtained by our learning method. Compared to the original SIFT space, the regions between relevant and irrelevant pairs or between Rel-Far and Irr-Near pairs are more separable.

1050

K.-H. Kim et al. / Neurocomputing 149 (2015) 1048–1057

performance improvement even for unseen images of landmark buildings.

2. Related work In this section, we summarize some previous work that significantly motivates our work. Our regularized learning framework (Eq. (1)) generalizes the previous work as well as our method. The relative weights of the four categories of training pairs are set differently in each method, as shown in Table 1. 2.1. Linear discriminant embedding Hua et al. [16] proposed linear discriminant embedding (LDE), which learns a linear transformation T such that the variance of relevant pairs is minimized while that of irrelevant pairs is maximized. The ground-truth pairs are extracted from 3D correspondences [21] between multiple training images of the same scene. Given the relevant and irrelevant training pairs, we further divide these into Near and Far matches, according to closeness, and then we focus on two challenging cases, the Rel-Far and IrrNear pairs. Fig. 2 shows some examples where each triplet consists of a local descriptor (center), its Rel-Far descriptor (left) and its Table 1 Differences between learning methods in terms of the regularization constants. Method

Rel-Near vs. Rel-Far

Irr-Near vs. Irr-Far

Hua et al. [16] Philbin et al. [18] Our method

βRN ¼ βRF βRN b βRF βRN 5βRF

βIN ¼ βIF βIN b βIF βIN b βIF

Irr-Near descriptor (right). For all examples, Rel-Far pairs lie farther away in the SIFT space than Irr-Near pairs due to different resolutions, illuminations and viewpoints. In most cases, LDE fails to map Irr-Near pairs farther away than Rel-Far pairs, whereas our method successfully maps all those pairs. 2.2. Learning from near and far matching pairs Philbin et al. [18] proposed the collection of training pairs by a homography [22] between two images of the same scene. Since any two images related by a homography are usually almost identical, all the collected training pairs are either Rel-Near or Irr-Near. Some randomly chosen Irr-Far pairs are also used, where the balance between Irr-Near and Irr-Far costs is controlled by regularization constants. This work motivates us to consider the closeness between training pairs, not only their relevance. However, the work did not consider Rel-Far pairs, and focused on distinguishing between Rel-Near and Irr-Near pairs. Since there is not much ambiguity between Rel-Near and Irr-Near pairs in terms of their pairwise distances (Fig. 1(a)), the performance gain by learning would be limited. Additionally, many Rel-Far pairs could become farther apart in order to make Irr-Near pairs farther apart, which may reduce the robustness against large intraclass variations. 2.3. Nonlinear descriptor learning For nonlinear learning, scalability becomes a major issue, since a large number of training data are required (e.g., 3  106 pairs [18]) for a meaningful improvement in real applications of descriptor learning. For this reason, recent work has popularly used multi-layer networks, including convolutional neural networks (CNNs) [17], deep belief networks (DBNs) [18], and hierarchical kernel machines [19], since the semi-parametric and

Fig. 2. Six examples where each local descriptor (center in each triplet) has a closer Irr-Near descriptor (right) than Rel-Far descriptor (left) in the SIFT space. To show the structural context of a local descriptor, we show a wider patch where a square is the actual local region. We show the distance values computed in (1) the original SIFT space, (2) a space obtained by kernelized LDE [16], and (3) a space obtained by our method, where we want the distance to the Rel-Far descriptor (left-hand value) less than (“ o ”) the distance to the Irr-Near descriptor (right-hand value). Our method successfully maps all Irr-Near descriptors farther away than the corresponding Rel-Far descriptors.

K.-H. Kim et al. / Neurocomputing 149 (2015) 1048–1057

hierarchical structures provide better scalability than other nonlinear methods (e.g., kernel methods). However, it is well known that multi-layer networks suffer from local minima, slow convergence, over-fitting, and tedious parameter tuning. Our method adopts explicit feature maps [20] that provide an efficient kernelization by obtaining an explicit kernel space for the original feature space. In experiments, our kernel method not only exhibited much better scalability and generalization ability compared to DBNs but also was free from the known problems of multi-layer networks.

3. Proposed method In this section, we show the details of our regularized learning framework, and we suggest an appropriate regularization setting for emphasizing Rel-Far and Irr-Near training pairs. We also provide some examples in which Rel-Far and Irr-Near pairs are helpful. Finally, we extend our method to nonlinear learning. 3.1. Weighted discriminant embedding The optimal projection for Eq. (1), denoted by T n ¼ ½t 1 ; …; t r  > , can be found by solving a generalized eigenvalue problem, ðβIN S IN þ βIF S IF Þt t ¼ λt ðβ RN S RN þ βRF S RF Þt t ;

ð2Þ

where S RN , S RF , S IN , S IF denote the scatter matrices of the Rel-Near, Rel-Far, Irr-Near, and Irr-Far pairs, respectively, S RN ¼ S RF ¼ S IN ¼ S IF ¼



ðxi  xj Þðxi  xj Þ > ;



ðxi  xj Þðxi  xj Þ > ;



ðxi  xj Þðxi  xj Þ > ;

ði;jÞ A P RN ði;jÞ A P RF

ði;jÞ A P IN

ð3Þ

ði;jÞ A P IF

Suppose that N ground-truth training data are given as d fðxi ; yi ÞgN i ¼ 1 , where xi A R is a feature vector and yi A f1; …; Cg is a ground-truth class label. The precise definition of each category of training pairs is as follows: if xj : k NN of xi amongfxj0 jyj0 ¼ yi g

ði; jÞ A P RF

if yi ¼ yj

ði; jÞ A P IN

if xj : k NN of xi among fxj0 jyj0 a yi g

ði; jÞ A P IF

if yi ayj

and and

ði; jÞ2 = P RN

if yi ¼ yj

and

ði; jÞ A P RF

if yi ¼ yj

and

Such an imbalance of sizes between Near and Far sets entails a strong bias toward Rel-Far pairs and an extremely strong bias toward Irr-Far pairs when βRN ¼ βRF ¼ β IN ¼ βIF ¼ 1. That is, LDE [16] focuses on distinguishing between Rel-Far and Irr-Far pairs. Since there is not much ambiguity between the Rel-Far and Irr-Far distributions in the original SIFT space (Fig. 1(a)), the performance gain by learning would be limited. Moreover, reducing the variance of Rel-Far pairs could make Irr-Near pairs closer, which may reduce the discriminative power near class boundaries. Consequently, we modify the relative importance to make RelFar and Irr-Near pairs more focused in learning. We empirically found that the highest performance is achieved through βRN ¼ 0 (i.e., completely ignoring Rel-Near pairs) and β IF ¼ 1=N (i.e., balancing the importance of OðkNÞ Irr-Near pairs and OðN 2 Þ IrrFar pairs). In [18], the balancing of Irr-Near and Irr-Far pairs was also addressed by using a small subset of Irr-Far pairs as small as the set of Irr-Near pairs. Algorithm 1 is the detailed procedure of our regularized discriminant embedding (RDE) method. After learning, the optimal projection T transforms given local features (e.g., SIFT features) to our RDE space, as prescribed in Algorithm 2.

ði; jÞ2 = P IN :

ð4Þ

J xi  xj J r θ R

if yi ayj

and

J xi  xj J r θI

ði; jÞ A P IF

if yi ayj

and

J xi  xj J 4 θI ;

Ensure: RDE mapping T ¼ ½t 1 ; …; t r  > A Rrd

Require: A given descriptor x ¼ ½x1 ; …; xd  > ; Mapping T Ensure: A mapped descriptor z ¼ ½z1 ; …; zr  > ¼ Tx

3.2. Toy examples

J xi  xj J 4 θR

ði; jÞ A P IN

Require: Training data fðxi ; yi ÞgN i ¼ 1 ; Projected dimension r 1: Find P RN , P RF , P IN , P IF (Eq. (5)) 2: Compute S RN , S RN , S RN , S RN (Eq. (3)) 3: Set βRN ¼ 0, βRF ¼ 1, β IN ¼ 1, β IF ¼ 1=N 4: Solve eigenvalue problems (Eq. (2)) to compute t 1 ; …; t r of r largest eigenvalues λ1 ; …; λr

Algorithm 2. Obtaining learned descriptors by RDE.

Since finding the k NN is rather time-consuming, we efficiently group the pairs by the pairwise distances in the original space: ði; jÞ A P RN

Rel-Near pairs and OðM 2 CÞ Rel-Far pairs in all C classes. Consequently, there are M=k (several tens of) times more Rel-Far pairs than Rel-Near pairs. 2. Irr-Near vs. Irr-Far: Each xi belonging to class c has k Irr-Near descriptors and N  Nc  k Irr-Far descriptors, thus there are OðkNÞ Irr-Near and NðN  N c  kÞ ¼ OðN2 Þ Irr-Far pairs in all N data. Therefore, there are N=k (  105 ) times more Irr-Far pairs than Irr-Near pairs.

Algorithm 1. Regularized discriminant embedding.

∑ ðxi  xj Þðxi  xj Þ > :

ði; jÞ A P RN

1051

ð5Þ

where we set the thresholds in our experiments to θR ¼ 150 and θI ¼ 400, as determined from the distance distribution of each set (Fig. 1(a)). From Eq. (4), it can be easily seen that the number of Far pairs is significantly larger than the number of Near pairs: 1. Rel-Near vs. Rel-Far: In class c, each descriptor xi has k Rel-Near descriptors and N c  1  k Rel-Far descriptors, where Nc denotes the number of descriptors in class c. Thus, there are kN c RelNear and N c ðNc  1  kÞ Rel-Far pairs in class c. Even for largescale training data (e.g., N ¼ 5  105 in our experiments), Nc does not exceed several hundred. Thus, denoting the upper bound of the class size by M, there are approximately OðkMCÞ

Fig. 3 presents 2D examples to illustrate the different behaviors of focusing on Rel-Near, Rel-Far, Irr-Near, and Irr-Far pairs. We show that (1) focusing on Rel-Far pairs is good for capturing the global intra-class distribution, and (2) focusing on Irr-Near pairs is good for improving the discriminative power near the inter-class boundary. Thus, our method can be expected to achieve performance improvements over the existing learning methods. Fig. 3(a) shows the difference between focusing on Rel-Near pairs and focusing on Rel-Far pairs, when the global class distribution cannot be captured by the local scatter. In this example, the global scatter is diagonal, but each class consists of three distinct local regions in which there are no meaningful directions of scattering, so Rel-Near pairs cannot capture the global class distribution. Thus, the projections obtained by “RN,IN” and “RN,IF” are biased toward the difference between irrelevant pairs (i.e., the y-axis), which is not desirable in the example.

1052

K.-H. Kim et al. / Neurocomputing 149 (2015) 1048–1057

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

RN,IF RF,IF RN,IN RF,IN

−0.2 −0.4 −0.6

−0.2

−0.6

−0.8 −1 −1

RN,IF RF,IF RN,IN RF,IN

−0.4

−0.8

−0.5

0

0.5

1

−1 −1

−0.5

0

0.5

1

Fig. 3. Two behavioral examples of four different scatters. Given 2D data points in two classes (red circles and blue triangles), we plot four 1D projections learned from different categories of training pairs. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

Fig. 3(b) shows the difference between focusing on Irr-Near pairs and focusing on Irr-Far pairs, when near the class boundary. Since Irr-Near pairs lie near the boundary, focusing on Irr-Near pairs can capture the true class boundary (x ¼0). By contrast, IrrFar pairs try to maximize the overall variance between two classes, so “RF,IF” and “RN,IF” cannot yield separable projections. Both examples suggest that our method (“RF,IN”) obtains appropriate projections, whereas Philbin's work [18] (“RN,IN”) is limited by a lack of Rel-Far information in capturing the global intra-class distribution, and Hua's LDE [16] (“RF,IF”) is limited by a lack of Irr-Near information in capturing the inter-class boundary. 3.3. Nonlinear descriptor learning Due to the nonlinearity of the changes in local appearance under varying environmental conditions (e.g., viewpoint and illumination), a desirable subspace for local descriptors of realworld images is usually nonlinear [18]. We extend our method by kernelization [23]. Given a nonEuclidean similarity metric Kðx; x0 Þ between two input data x and x0 in the original feature space, kernel methods implicitly provide a nonlinear mapping from the input space into a high-dimensional feature space ΦðxÞ. The non-Euclidean metric becomes the linear inner product in the mapped space, i.e., Kðx; x0 Þ ¼ ΦðxÞ > Φðx0 Þ:

ð6Þ

In short, a kernelized method is equivalent to its linear counterpart operating on the kernel space. The major problem of kernel methods is the poor scalability with respect to the size of the training dataset: given N training samples, we require a space OðN 2 Þ to maintain kernelized scatter matrices and time OðN 3 Þ to solve the generalized eigenvalue problem. Since descriptor learning usually involves a large number of local descriptors (e.g., from 5  105 [16] to 3  106 [18] training pairs), it is infeasible to use standard kernel methods. This is why multi-layer networks were preferred in the previous work on descriptor learning [17–19]. Because of the semiparametric and hierarchical structures, multi-layer networks are capable of learning with millions of training data. However, multilayer networks suffer from many difficulties, such as local minima, slow convergence, over-fitting, and tedious parameter tuning. The overall learning time is rather long due to the poor convergence,

and the generalization ability is limited by over-fitting and the sensitivity to parameter tuning. For example, in [18], a DBN trained with Oxford building images was applied to retrieval of Paris building images and could not achieve a significant performance improvement. The mean average precision scores of the raw SIFT and the DBN were 0.655 and 0.678, respectively, on the Paris images vs. 0.613 and 0.662, respectively, on the Oxford images. These results suggest that the DBN was somewhat over-fitted to the training data. In this work, we kernelize our method by adopting recent work on explicit feature maps [20], which compute an approximation of the explicit kernel space ΦðxÞ. Explicit feature maps are applicable only if a given non-Euclidean metric is (1) additive and (2) homogeneous. The first condition is expressed as d

Kðx; x0 Þ ¼ ∑ kðxt ; xt 0 Þ; t¼1

where x ¼ ½x1 ; …; xd  > ;

ð7Þ

for a 1D metric kðx; x0 Þ, and the second condition is expressed as ckðx; x0 Þ ¼ kðcx; cx0 Þ for all c 4 0. Some well-known metrics capable of using explicit feature maps include the Bhattacharyya coefficient (or Hellinger kernel), chi-square kernel, and intersection kernel [20]. Since the 1D metric is also decomposable as kðx; x0 Þ ¼ ϕðxÞ > ϕðx0 Þ, explicit feature maps approximate ϕðxÞ by a discrete Fourier transform (DFT) with a sampling rate n. That is, ϕðxÞ is approximated by the 2n þ 1 samples of a DFT,2 denoted by b ðxÞ A R2n þ 1 , and thus each kðx ; x0 Þ in Eq. (7) also becomes ϕ t t b ðx0 Þ. Therefore, Eq. (7) can be rewritten as b ðx Þ > ϕ kðxt ; x0t Þ  ϕ t t d

b ðx Þ > ϕ b ðx0 Þ Kðx; x0 Þ  ∑ ϕ t t t¼1

b ðx Þ; …; ϕ b ðx Þ > ½ϕ b ðx0 Þ; …; ϕ b ðx0 Þ ¼ ½ϕ 1 d 1 d b ðx0 Þ: b ðxÞ > Φ ¼Φ

ð8Þ

That is, explicit feature maps generate a ð2n þ 1Þd-dimensional b ðxÞ A Rð2n þ 1Þd . Mapping each x A Rd approximate kernel space, Φ ð2n þ 1Þd b into Φ ðxÞ A R requires only ð2n þ 1Þd DFT computations. Thus, millions of data can be handled as the overall time and space for mapping all N data are OðndNÞ, i.e., linear with respect to 2 b ðxÞ are shown in [20] for several 1D metrics. The equations for computing ϕ The software is also available at http://www.vlfeat.org/download.html and http:// www.vlfeat.org/mdoc/VL_HOMKERMAP.html.

K.-H. Kim et al. / Neurocomputing 149 (2015) 1048–1057

the training dataset size. Algorithms 3 and 4 summarize our kernel RDE (KRDE) learning method. Algorithm 3. Kernel regularized discriminant embedding. Require: Training data fðxi ; yi Þgi ; Projected dimension r; 1D kernel ϕðx; x0 Þ; Sampling rate n 1: Find P RN , P RF , P IN , P IF (Eq. (5)) b ðx Þ A Rð2n þ 1Þd (Eq. (8)) for all i 2: Compute Φ i

3: Compute S RN , S RN , S RN , S RN by summing the terms b ðx Þ  Φ b ðx ÞÞðΦ b ðx Þ  Φ b ðx ÞÞ > over all pairs in each set P , ðΦ RN i j i j P RF , P IN , P IF 4: Set β RN ¼ 0, βRF ¼ 1, βIN ¼ 1, βIF ¼ 1=N 5: Solve eigenvalue problems (Eq. (2)) to compute t 1 ; …; t r of r largest eigenvalues λ1 ; …; λr

Ensure: KRDE mapping T ¼ ½t 1 ; …; t r  > A Rrð2n þ 1Þd Algorithm 4. Obtaining learned descriptors by KRDE. Require: A descriptor x ¼ ½x1 ; …; xd  > ; KRDE mapping T; 1D kernel ϕðx; x0 Þ; Sampling rate n b ðxÞ A Rr Ensure: A mapped descriptor z ¼ T Φ

Additive homogeneous kernels are known to be proper similarity metrics between two normalized histograms [20,24]. Since a SIFT descriptor is represented by a 128D normalized histogram, such metrics reflect well the nonlinearity of SIFT features. Vedaldi and Zisserman [20] reported that (1) chi-square kernel obtains the best performance, and (2) the sampling rate n ¼3 yields a nearly perfect approximation for the chi-square kernel.

4. Experiments on images of landmark buildings In this section, we show the superior performance of our learning method for images of landmark buildings. Training database: In our experiments, we collected N ¼ 5  105 SIFT descriptors of 5000 local interest points from images of 11 landmark buildings (top row in Fig. 4 for example), where each local point corresponds to 100 SIFT descriptors, i.e., there are 5000 classes and 100 data points per class. We chose k ¼5 nearest neighbors for collecting Rel-Near and Irr-Near pairs. Table 2 summarizes the size of our training dataset. Note that Hua's LDE [16] 5  105 SIFT pairs were collected from only two scenes (Trevi Fountain and Yosemite's Half Dome). We collected a much larger training dataset from more scenes in order to cover a wide variety of local interest points for landmark buildings. Since there are too many Irr-Far pairs, we randomly sampled as many Irr-Far pairs as Irr-Near pairs and then multiplied the scatter matrix S IF by N=k  105 . Test database: For evaluation of unseen images, we randomly sampled 8  104 instances of Rel-Near, Rel-Far, Irr-Near, and Irr-Far

1053

pairs (2  104 of each) from images of another 11 landmark buildings (bottom row in Fig. 4 for example), which are totally unseen in our training database. Performance measure: We measured the Bayes optimal error for those 8  104 test pairs, measuring the overlapping region between relevant and irrelevant data, as shown in Fig. 1. 1. Err (Rel vs. Irr) is the overlap between distance distributions of {Rel-Near, Rel-Far} and {Irr-Near, Irr-Far}; 2. Err (RFar vs. INear) is the overlap between Rel-Far and Irr-Near pairs, which are the more important pairs for robustness and distinctiveness of the local descriptors. We also evaluated the retrieval performance for images of buildings in Oxford and Paris [18] and compared our method to recent descriptor learning methods. Summary of performance: Table 3 summarizes the best performance obtained by each method, where the learned space dimension is equivalent to r¼ 64. Including the previous work of Hua et al. [16] and Philbin et al. [18], we measured the performance of (1) some classic dimensionality reduction methods (PCA, LDA, and LPP [25]) and (2) linear/nonlinear SVMs with raw SIFT. PCA and LPP performed poorer than raw SIFT, since they lose some information in raw SIFT without any guidance from labels. LPP was even poorer than PCA, since it prevents Irr-Near pairs from being far away from each other. For the SVMs, we used the difference of two descriptors as input features, i.e., jxi  x0i j between two local descriptors x ¼ ½x1 ; …; xd  > and x0 ¼ ½x01 ; …; x0d  > . Compared to the descriptor learning methods, the performance improvement by SVMs was limited. For nonlinear versions of our method and SVMs, we tested 4 kernel functions to which the explicit feature mapping is applicable [20]: Hellinger's kernel, chi-square kernel, Jenson–Shannon's kernel (JS), and intersection kernel. Table 3 shows that these kernels obtained similar performance, except for the Hellinger kernel whose intrinsic dimensionality is the same with the input dimension. 4.1. Performance improvement by regularization We compared various regularization settings, including the settings in previous work [16,18] and our proposed settings. Relative to Table 2 Training dataset information. Item

Size

Total number of descriptors N Number of classes C Class size M Neighborhood size k Number of Rel-Near pairs jP RN j Number of Rel-Far pairs jP RF j Number of Irr-Near pairs jP IN j Number of Irr-Far pairs jP IF j

500 K 5K 100 5  1.2 M  23 M  1.2 M  125 B

Fig. 4. Sample images of landmark buildings: images of 11 buildings for learning (top) and 11 more for testing (bottom).

1054

K.-H. Kim et al. / Neurocomputing 149 (2015) 1048–1057

the baseline (i.e., β RN ¼ βRF ¼ β IN ¼ βIF ¼ 1), we propose reductions in βRN and β IF so that Rel-Far and Irr-Near pairs are more emphasized in learning. Setting βRF ¼ β IN ¼ 1, we measured the performance under variations within the ranges 0r βRN r 1 and 0 r β IF r1. Fig. 5 shows the following results: 1. Varying β RN : According to Table 2, there are  20 times more Rel-Far pairs than Rel-Near pairs, i.e., setting β RN ¼ βRF is already biased toward Rel-Far pairs. Thus, the effect on the performance of reducing βRN is unnoticeable, while reducing βRN yields slightly less error in most cases. 2. Varying βIF : According to Table 2, setting β IF ¼ 1=N makes the relative importance of Irr-Near pairs k times greater than that of Irr-Far pairs. In the other cases, there are extremely 2 strong biases pffiffiffiffi toward either Irr-Near (βIF ¼ 0; …; 1=N ) or Irr-Far (β IF ¼ 1= N; …; 1). Fig. 5 shows that balancing Irr-Near pairs (near-boundary scatter) and Irr-Far pairs (overall inter-class scatter) yields the highest performance.

Table 3 Bayes optimal error (%) of various methods. Method

Model

Dimension

Error Rel vs. Irr

RFar vs. INear

SIFT

Raw data

128

15.58

29.92

Supervised classifier SVM SVM (Hellinger) SVM (chi-square) SVM (JS) SVM (intersection)

Linear Nonlinear Nonlinear Nonlinear Nonlinear

128 128 128 128 128

14.12 13.74 12.71 12.83 12.19

26.35 26.54 24.18 24.69 24.04

Dimension reduction PCA LDA LPP [25]

Linear Linear Linear

64 64 64

16.65 11.02 17.92

32.31 20.91 36.49

Descriptor learning Hua et al. [16] Philbin et al. [18]

Linear Nonlinear

64 64

10.97 11.53

20.85 22.47

Our method RDE KRDE (Hellinger) KRDE (chi-square) KRDE (JS) KRDE (intersection)

Linear Nonlinear Nonlinear Nonlinear Nonlinear

64 64 64 64 64

11.53 9.95 8.18 8.34 8.26

23.51 19.78 16.01 16.50 16.09

To measure the performance under each setting, we used our KRDE (Algorithms 3 and 4) with the chi-squared kernel and n¼ 3. We set the learned space dimension to r ¼64. Fig. 1(b) shows the distance distribution obtained by our KRDE with the proposed setting. Table 4 compares our best setting with the settings in previous work [16,18]. The results are consistent with our insights and clearly show that our regularized framework achieves a significant improvement in the performance. 4.2. Performance improvement by kernelization We compared the performance improvement by kernelization. We tested KRDE with various sampling rates in the range 0 r n r 3 to measure the trade-off between the approximation quality and the learning time. In all cases, we obtained 64-dimensional (r ¼ 64) learned space. Table 5 lists the results. Benefit of nonlinear learning: As we had expected, KRDE clearly outperformed its linear counterpart on both training and test data. Generalization ability: Table 5 shows that slightly better performance was achieved on test data for n ¼1 than for larger n. Since larger n yields higher-dimensional kernel space (128D for n¼ 0, 384D for n ¼1, 640D for n ¼2, and 896D for n ¼3), the nonlinearity of KRDE becomes lower as n decreases. Moreover, the difference between the errors on test and training data (Err difference in Table 5) shows that less nonlinearity leads to better generalization ability. For n ¼ 0 and the linear case (RDE), however, the overall performance was poor because the nonlinearity was limited too much. Consequently, the results imply that n ¼1 yields the best tradeoff between nonlinearity and generalization ability while providing a sufficient approximation quality for descriptor learning. For all n Z 1, however, KRDE still achieved reasonable generalization ability. Learning time: Table 5 shows that a lower sampling rate n reduces the learning time of our KRDE, since explicit feature maps scale linearly with the kernel space dimension ð2n þ 1Þd. Our method is extremely fast, taking only a few minutes (on a Table 4 Various regularizations and their Bayes optimal errors (%). Method

Hua et al. [16] Philbin et al. [18] Our method

Err(RFar vs INear)

21

Error

βRN

βRF

βIN

βIF

Rel vs. Irr

RFar vs. INear

1 1 0

1 0 1

1 1 1

1 1=N 1/N

10.97 11.53 8.34

20.85 22.47 16.30

Table 5 Bayes optimal error (%) of raw SIFT, our linear RDE, and kernel RDE for various sampling rates. The difference between test and training errors is also shown, with the learning time for each method.

20 19 18 17 16 1

Regularization

0.5 β

RN

0

1 1/sqrt(N) 1/N 1/N^2 β 0 IF

Fig. 5. Bayes optimal error Err (RFar vs. RNear) under βRF ¼ βIN ¼ 1 with varying βRN and βIF .

Performance Measure

SIFT

RDE

n ¼0

1

2

3

Training err Rel vs. Irr RFar v INear

15.73 29.10

11.34 22.56

9.27 16.91

7.41 14.04

7.34 13.92

7.29 13.84

Test err Rel vs. Irr RFar v INear

15.58 29.92

11.53 23.51

9.70 18.54

8.18 16.01

8.23 16.16

8.34 16.30

Err difference Rel vs. Irr RFar v INear

 0.15 0.72

0.19 0.95

0.43 1.63

0.77 1.97

0.89 2.24

1.05 2.46

Time (s)



33

119

251

438

58

K.-H. Kim et al. / Neurocomputing 149 (2015) 1048–1057

Table 6 Image retrieval performance (mAP) for landmark buildings in Oxford and Paris. Dataset

Method

Model

Dimension

mAP score

Oxford

SIFT LDE KDE DBN

Raw data Linear Nonlinear Nonlinear

KRDE

Nonlinear

128 32 64 32 64 32 64

0.611 0.585 0.656 0.644 0.662 0.654 0.678

SIFT KDE DBN

Raw data Nonlinear Nonlinear

KRDE

Nonlinear

128 64 32 64 32 64

0.649 0.673 0.669 0.678 0.674 0.700

Paris

1055

2.93 GHz Intel(R) Core(TM) i7-870 CPU) for learning with our 5  105 training data. In [18], Philbin's nonlinear DBN was trained with 3  106 training pairs, for 50 iterations. On a modern CPU, it took 35 min per iteration, i.e., the total learning time was approximately 1 day. 4.3. Image retrieval for unseen buildings We applied our method to image retrieval and compared the performance to that of two recent methods: (1) Hua's LDE [16], along with its kernelized version (KDE) and (2) Philbin's DBN [18]. For the LDE and DBN, we referred directly to the performance reported in [18]. For the KDE and our KRDE, we subjected our 5  105 training data (introduced at the beginning of this section with Fig. 4) to Algorithms 3 and 4 with proper regularization settings (βRN ¼ β RF ¼ β IN ¼ βIF ¼ 1 for the KDE or β RN ¼ 0,

Fig. 6. Some retrieval results for images of buildings in Oxford and Paris. A given query image is shown in the first column with its mAP scores. Six relevant images are shown with their rankings as determined by SIFT, kernelized LDE (KDE), and KRDE. Our KRDE features are more robust to large changes in viewpoint and illumination.

1056

K.-H. Kim et al. / Neurocomputing 149 (2015) 1048–1057

βRF ¼ βIN ¼ 1, and βIF ¼ 1=N for the KRDE). The kernelization was

performed through explicit feature maps with n ¼ 1. Databases: We used datasets of buildings in Oxford (5069 images)3 and Paris (6349 images).4 Each dataset consists of 11 landmark buildings in a city and 55 queries (five images for each building) with ground-truth information. Method: We used the bag-of-words model for retrieval. After mapping all local descriptors by each method, 5  105 visual words were generated using k-means clustering of those descriptors. Then, each image was represented as a vector in 5  105 dimensions, where the i-th entry in a vector is the number of local descriptors belonging to the i-th visual word. An approximate k NN search5 [26] was used for efficient clustering and mapping (see [18] for details). Performance measure: We measured the mean average precision (mAP) scores by averaging the average precision (the area under the precision-recall curve) over all queries. Overall performance: Table 6 shows a comparison of results. Note that the KRDE and KDE were trained with our 5  105 data, where none of the buildings is in Oxford and Paris. By contrast, the LDE and DBN were trained with images in Oxford, so the mAP scores of the LDE and DBN for Oxford may exaggerate their true performance. Even if learning with totally different images of landmark buildings, our KRDE significantly improved the image retrieval performance for both databases, with the same number of the learned features (32 and 64) and the same size of visual vocabulary (5  105). Some matching examples: Fig. 6 shows some examples of retrieval results in terms of mAP, including easy cases (Hertford College, Notre Dame, and Les Invalides), ordinary cases (Ashmolean Museum) and difficult cases (Magdalen College and Moulin Rouge). Also shown are the matching performances of the raw SIFT features, the KDE features, and our KRDE features. The results of the DBN can be inferred from those of the KDE, since the DBN and the KDE achieved comparable mAP scores. For the images having environmental conditions similar to those of the query images (2nd column in Fig. 6), the SIFT features also performed well and were comparable to our KRDE features. However, as the changes in viewpoint or illumination increased, the gap in matching performance between the SIFT features and the KRDE features became ever larger. Under large changes, the gap between the KDE features and the KRDE features also became fairly large. Our KRDE features found more actual correspondences between the same building's images, so that the images were ranked higher compared to the SIFT and KDE features. The improvement in matching performance would be helpful for many multimedia applications based on local descriptor matching.

5. Conclusions We have investigated local descriptor learning and provided some new insights. Considering the pairwise distance between given pairs of training data, we found two essential categories of local descriptor pairs for descriptor learning, Relevant-Far and Irrelevant-Near. We proposed a regularized framework, whose criterion is adequately fitted for emphasizing those two categories of training pairs. Thanks to explicit feature maps, we derived a kernel method for nonlinear learning, which requires only a few minutes for learning with millions of SIFT descriptors, and has 3 4 5

http://www.robots.ox.ac.uk/  vgg/data/oxbuildings/ http://www.robots.ox.ac.uk/  vgg/data/parisbuildings/ http://www.cs.ubc.ca/  mariusm/index.php/FLANN/FLANN/

good generalization ability. On image retrieval tasks, our method outperformed existing descriptor learning methods as well as the original SIFT features, since the correspondences between the same building's images were better captured by our method. The results suggest that our method is widely applicable to multimedia applications that are based on local feature representations and matching. In image classification or retrieval, local descriptors are usually represented by the bag-of-words model, where a visual word corresponds to a representative of each cluster. In this paper, we used a conventional k-means clustering in the image retrieval experiments. As a future work, we will further improve the performance by using a recent work (e.g., bilevel visual words coding [27]) that improves discriminative power of visual words. Improving computation time and memory usage of SIFT is another important research topic. Several recent work [28–30] proposed binary local descriptors, which are much more efficient than SIFT descriptors for storing, comparison, and extraction, while still preserving useful characteristics of SIFT. Strecha et al. [31] proposed a descriptor learning method for binary features, while it suffers from the difficulty of separating Irr-Near pairs from relevant pairs in the feature space, as LDE does. As a future work, we plan to extend our descriptor learning method to a binary version.

Acknowledgments This work was supported by NIPA-MSRA Creative IT/SW Research Project, National Research Foundation (NRF) of Korea (NRF-2013R1A2A2A01067464), and the IT R & D Program of MSIP/ IITP (14-824-09-014, Machine Learning Center).

References [1] C. Harris, M. Stephens, A combined corner and edge detector, in: Proceedings of the Alvey Vision Conference, 1988, pp. 147–151. [2] L.J. Van Gool, T. Moons, D. Ungureanu, Affine/photometric invariants for planar intensity patterns, in: Proceedings of the European Conference on Computer Vision (ECCV), 1996, pp. 642–651. [3] D.G. Lowe, Object recognition from local scale-invariant features, in: Proceedings of the International Conference on Computer Vision (ICCV), 1999, pp. 1150–1157. [4] R. Fergus, P. Perona, A. Zisserman, Object class recognition by unsupervised scale-invariant learning, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2003, pp. 264–271. [5] A.Y.-S. Chia, S. Rahardja, D. Rajan, M.K.H. Leung, Structural descriptors for category level object detection, IEEE Trans. Multimed. 11 (8) (2009) 1407–1421. [6] Y. Ke, R. Sukthankar, L. Huston, Efficient near-duplicate detection and subimage retrieval, in: Proceedings of the ACM International Conference on Multimedia (MM), 2004, pp. 869–876. [7] W.-L. Zhao, C.-W. Ngo, H.-K. Tan, X. Wu, Near-duplicate keyframe identification with interest point matching and pattern learning, IEEE Trans. Multimed. 9 (5) (2007) 1037–1048. [8] K. Mikolajczyk, C. Schmid, Indexing based on scale invariant interest points, in: Proceedings of the International Conference on Computer Vision (ICCV), 2001, pp. 525–531. [9] J. Philbin, O. Chum, M. Isard, J. Sivic, A. Zisserman, Object retrieval with large vocabularies and fast spatial matching, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8. [10] Y.-H. Kuo, W.-H. Cheng, H.-T. Lin, W.H. Hsu, Unsupervised semantic feature discovery for image object retrieval and tag refinement, IEEE Trans. Multimed. 14 (4) (2012) 1079–1090. [11] C. Liu, J. Yuen, A. Torralba, J. Sivic, W.T. Freeman, SIFT flow: dense correspondence across difference scenes, in: Proceedings of the European Conference on Computer Vision (ECCV), 2008, pp. 28–42. [12] Y. Ke, R. Sukthankar, PCA-SIFT: a more distinctive representation for local image descriptors, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2004, pp. 506–513. [13] S. Winder, M. Brown, Learning local image descriptors, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8.

K.-H. Kim et al. / Neurocomputing 149 (2015) 1048–1057

[14] G. Zhao, L. Chen, G. Chen, J. Yuan, KPB-SIFT: a compact local feature descriptor, in: Proceedings of the ACM International Conference on Multimedia (MM), 2010, pp. 1175–1178. [15] L. Bo, X. Ren, D. Fox, Kernel descriptors for visual recognition, in: Advances in Neural Information Processing Systems (NIPS), 2010, pp. 244–252. [16] G. Hua, M. Brown, S. Winder, Discriminant embedding for local image descriptors, in: Proceedings of the International Conference on Computer Vision (ICCV), 2007, pp. 1–8. [17] K. Kavukcuoglu, M. Ranzato, R. Fergus, Y. LeCun, Learning invariant features through topographic filter maps, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 1605–1612. [18] J. Philbin, M. Isard, J. Sivic, A. Zisserman, Descriptor learning for efficient retrieval, in: Proceedings of the European Conference on Computer Vision (ECCV), 2010, pp. 677–691. [19] L. Bo, K. Lai, X. Ren, Object recognition with hierarchical kernel descriptors, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 1729–1736. [20] A. Vedaldi, A. Zisserman, Efficient additive kernels via explicit feature maps, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 3539–3546. [21] N. Snavely, S.M. Seitz, R. Szeliski, Photo tourism: exploring photo collections in 3D, in: Proceedings of the ACM SIGGRAPH, 2006, pp. 835–846. [22] M.A. Fischler, R.C. Bolles, Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography, Commun. ACM 24 (6) (1981) 381–395. [23] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. [24] O. Chapelle, P. Haffner, V.N. Vapnik, Support vector machines for histogrambased image classification, IEEE Trans. Neural Netw. 10 (5) (1999) 1055–1064. [25] X. He, P. Niyogi, Locality preserving projections, in: S. Thrun, L.K. Saul, B. Schölkopf (Eds.), Advances in Neural Information Processing Systems (NIPS), vol. 16, MIT Press, 2004. /http://papers.nips.cc/paper/2359-locality-preser ving-projections.pdfS. [26] M. Muja, D.G. Lowe, Fast approximate nearest neighbors with automatic algorithm configuration, in: Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP), 2009, pp. 331–340. [27] J. Zhang, C. Wu, D. Cai, J. Zhu, Bilevel visual words coding for image classification, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 2013, pp. 1883–1888. [28] M.C.anV. Lepetit, M. Özuysal, T. Trzcinski, C. Strecha, P. Fua, BRIEF: computing a local binary descriptor very fast, IEEE Trans. Pattern Anal. Mach. Intell. 34 (7) (2012) 1281–1298. [29] M. Ambai, Y. Yoshida, CARD: compact and real-time descriptors, in: Proceedings of the International Conference on Computer Vision (ICCV), 2011, pp. 97–104. [30] C. Wu, J. Zhu, J. Zhang, C. Chen, D. Cai, A convolutional treelets binary feature approach to fast keypoint recognition, in: Proceedings of the European Conference on Computer Vision (ECCV), 2013, pp. 368–382. [31] C. Strecha, A.M. Bronstein, M.M. Bronstein, P. Fua, LDAHash: improved matching with smaller descriptors, IEEE Trans. Pattern Anal. Mach. Intell. 34 (1) (2012) 66–78.

Kye-Hyeon Kim is a Ph.D. student at Pohang University of Science and Technology, Pohang, Korea. He received the B.S. degree in computer science and engineering from Pohang University of Science and Technology in 2005. His research interests include machine learning with their applications to computer vision and information retrieval.

1057 Rui Cai is a Lead Researcher at Microsoft Research Asia. He received the B.E. and Ph.D. degrees in computer science from Tsinghua University, Beijing, China, in 2001 and 2006, respectively. His research interests include web search and data mining, machine learning, pattern recognition, computer vision, multimedia content analysis, and signal processing. He is a member of Association for Computing Machinery (ACM) and the Institute of Electrical and Electronics Engineers (IEEE).

Lei Zhang received his Ph.D. degree in computer science from Tsinghua University in 2001. He is a Principle Development Lead at Microsoft. He is interested in research problems in image search, Internet vision, and information retrieval. He holds 20 U.S. patents for his innovation in these fields. Dr. Zhang is a senior member of the Association for Computing Machinery (ACM). He also served as program area chairs or committee members for many international conferences in multimedia, computer vision, and information retrieval.

Seungjin Choi received the B.S. and M.S. degrees in electrical engineering from Seoul National University, Seoul, Korea, in 1987 and 1989, respectively, and the Ph.D. degree in electrical engineering from the University of Notre Dame, Notre Dame, IN, USA, in 1996. He was with the Laboratory for Artificial Brain Systems, RIKEN, Japan, in 1997, and was an Assistant Professor in the School of Electrical and Electronics Engineering, Chungbuk National University from 1997 to 2000. Since 2001, he has been a Professor of Computer Science at Pohang University of Science and Technology, Pohang, Korea. His primary research interests include machine learning and probabilistic models with their applications to data mining, computer vision, and brain-computer interface.