Unseen object categorization using multiple visual cues

Neurocomputing (xxxx) xxxx–xxxx Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Unseen ob...

Download PDF

1MB Sizes 6 Downloads 173 Views

Report

PDF Reader
Full Text

Neurocomputing (xxxx) xxxx–xxxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Unseen object categorization using multiple visual cues ⁎

B. Ramesh , C. Xiang Department of Electrical and Computer Engineering, National University of Singapore, Singapore 117576

A R T I C L E I N F O

A BS T RAC T

Communicated by Liang Lin

In this paper, we propose an object categorization framework to extract diﬀerent visual cues and tackle the problem of categorizing previously unseen objects under various viewpoints. Speciﬁcally, we decompose the input image into three visual cues: structure, texture and shape cues. Then, local features are extracted using the log-polar transform to achieve scale and rotation invariance. The local descriptors obtained from diﬀerent visual cues are fused using the bag-of-words representation with some key contributions: (1) a keypoint detection scheme based on variational calculus is proposed for selecting sampling locations; (2) a codebook optimization scheme based on discrete entropy is proposed to choose the optimal codewords and at the same time increase the overall performance. We tested the proposed object classiﬁcation framework on the ETH-80 dataset using the leave-one-object-out protocol to speciﬁcally tackle the problem of categorizing previously unseen objects under various viewpoints. On this popular dataset, the proposed object categorization system obtained a very high improvement in classiﬁcation performance compared to state-of-the-art methods.

Keywords: Log-polar transform Object classiﬁcation Structure-texture decomposition Shape extraction Bag-of-words model ETH-80 dataset

1. Introduction Object recognition has been a central task to the computer vision community since the early days of using computers to identify handwritten characters [1]. Through these fruitful decades of increasing machine intelligence, we have taken huge strides in solving speciﬁc tasks, such as classiﬁcation systems for automated assembly line inspection [2], hand-written character recognition in mail sorting machines [3], bill counting and inspection in automated teller machines [4], to name a few. Despite these successful applications, computers have made little progress in generalizing object appearance, even under moderately controlled sensing environments. On the other hand, humans can eﬀortlessly categorize hundreds of objects present in highly complex scenarios. This success in pattern recognition is naturally due to the eﬀective utilization of appearance and shape cues, which help in forming distinctive groupings of visual stimuli with diﬀerent perceivable characteristics [5], by the visual cortex. Therefore, we believe a cue-based approach to object categorization is central to real progress toward intelligent systems and this paper aims to take a step in that direction. A cue-based approach to object classiﬁcation is important for generalization to unseen objects. However, this aspect has been rarely studied due to the nature of training and testing protocol used for several object datasets. While the practice of using a random training and testing split avoids the bias of having a ﬁxed training set, it leads to diﬃculties in objectively assessing whether the training images yield a ⁎

visual world model that can generalize to unseen objects of a known object category. Another signiﬁcant obstacle for rigorously evaluating both appearance and shape based methods is the widespread use of databases without segmentation ground truth for the object categorization task. In this paper, the above problems are addressed by adopting the rigorous leave-one-object-out cross validation protocol on the ETH-80 dataset [6], which provides segmentation ground truth for each object. Even so, unlike previous works on this dataset, we do not make use of the ground truth for classiﬁcation. In contrast, we propose to extract the shape cue by thresholding [7] the output of a salient object detection model [8]. We only make use of the ground truth for postmortem analysis of the extracted shape cue, which helps in quantifying the performance of diﬀerent thresholding methods [7,9–15]. While shape cue provides important clues about the identity and functional properties of the object, several other visual cues assist, both humans and computers alike, in identifying objects from two-dimensional images. Some examples are depth, motion, texture, color, and 3D pose. Nevertheless, object recognition research in its budding years was primarily concerned with 3D shape representation [16,17]. In the late 1980s, the theory of recognition-by-components [18] proposed a powerful set of regularizing constraints using shape primitives for object recognition. It proposed that humans made use of easily detectable perceptual properties (curvature, collinearity, symmetry, parallelism and cotermination) that are invariant to orientation changes, distortion, and occlusion. Nevertheless, this theory has not

Corresponding author. E-mail address: [email protected] (B. Ramesh).

http://dx.doi.org/10.1016/j.neucom.2016.12.003 Received 18 July 2016; Received in revised form 18 November 2016; Accepted 2 December 2016 0925-2312/ © 2016 Elsevier B.V. All rights reserved.

Please cite this article as: Ramesh, B., Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2016.12.003

Neurocomputing (xxxx) xxxx–xxxx

B. Ramesh, C. Xiang

2. Related work

been used successfully in natural images due to the representational gap between low-level features and the abstract nature of model components. Subsequent two decades of research in object recognition moved away from 3D geometry to appearance-based identiﬁcation systems, which opened up new horizons in recognizing natural images [19]. Although appearance based approaches have taken the forefront of object categorization research [20], contour based object categorization in natural images has been of increasing interest lately [21], with the help of advances in contour detection [22]. This paper aims to take a further step by encoding the object shape and appearance cues in a uniﬁed bag-of-features framework using log-polar transform [23]. In particular, we propose a novel fusion of grayscale appearance cues, such as texture and structure, and binary shape information. The input image is decomposed into the structure and texture parts using the Rudin–Osher–Fatemi method [24], and local features are extracted using the log-polar transform on select keypoint locations. As for shape extraction, we employ a state-of-the-art salient object detection model [8] and binarize the saliency map using the classical Otsu's thresholding method [7]. The local shape features are also extracted using logpolar transform at the shape boundaries, following the method proposed in [25]. Finally, each set of the extracted local features (structure, texture, and shape) are fused using the bag-of-words model. For combining features from diﬀerent cues, a natural choice for the classiﬁcation framework is the bag-of-words model, which has become the standard image classiﬁcation pipeline due to its simplicity and high performance on various datasets [26–28]. The principal idea behind the bag-of-words model is to extract several local features from an image and then identify the likely object from which those features were extracted. The design considerations for each step of the bag-ofwords model are discussed below. Firstly, most works extract local features using a uniform, dense grid of keypoints and report better performance compared to keypoint detection methods [29]. In this work, a novel keypoint detection scheme is used to select the sampling locations and better performance is obtained compared to the dense keypoint strategy. Secondly, heuristically designed local descriptors [30,31] fail to achieve scale and rotation invariant properties theoretically. Therefore, one of the focuses of this paper is to employ a local descriptor with a sound mathematical basis for scale and rotation invariance. In this regard, we employ the log-polar transform [23] for obtaining the local descriptor at each keypoint. Thirdly, the majority of the works quantize the extracted local descriptors using a visual vocabulary or codebook, whose size is chosen in a trial-and-error manner, as noted in [25]. Therefore, we address the issue of choosing the optimal codewords, which aims to reduce the codebook size and simultaneously improve classiﬁcation performance. In summary, the key contributions of the paper are as follows.

In this section, we review the related works in each domain of the proposed methods in this paper. Multi-cue object representation: Several works extract diﬀerent local descriptors, and treat them as diﬀerent cues in their object recognition framework. For instance, Khan et al. [32] combined shape cues obtained from SIFT descriptors and color cues obtained from the histogram of sRGB values for object classiﬁcation. Similarly, Leibe [33] ambitiously combined multiple interest point detectors and multiple descriptors for detecting objects in an image. Likewise, Vedaldi [34] combined dense SIFT, self-similarity descriptors, and geometric blur features with multiple kernel learning to obtain the ﬁnal image representation. A similar eﬀort was made by [35–37] to combine multiple feature channels for image classiﬁcation/saliency detection. Diﬀerent from the above works, a handful of attempts have been made in the past with the aim of encoding multiple cues by designing a novel image processing method. Ref. [38] combined texture cues obtained from texture-layout ﬁlters [39] and contour fragments [40] obtained from sets of edges matched to the image using the oriented chamfer distance. In the same vein, Kumar et al. [41] combined outline contour and the enclosed texture in pictorial structures for object detection. Likewise, some works [21,40,42] obtain local contour fragments to encode shape information from grayscale images. This paper aims to take a further step by encoding the object shape, which is diﬀerent from encoding the contours of the image, extracted using a salient object detection algorithm [8]. The proposed framework also uses grayscale texture and structure in a uniﬁed bag-of-features framework using log-polar transform. Feature descriptor: Most works adopt the popular SIFT descriptor [31] or shape context [43] for extracting local features. However, other ways of extracting local descriptors, such as ﬁltered responses, normalized pixel values, etc., are also in practice. Normally, dense SIFT descriptors extracted without scale selection are widely reported to give good performance [44], but without the guarantee of the invariant properties. Our work aims to achieve scale and rotation invariance, and in this regard, is most similar to [45], which has used the classic logpolar transform to achieve scale invariance without scale selection for grayscale images. Ref. [45] presents scale invariant descriptors (SIDs) that use a logarithmic sampling on band-pass ﬁltered images. As a result of the non-uniform scale of spatial sampling, centered at each pixel of the image, the authors showed that it is possible to obtain feature vectors that are scale and rotation invariant, by transforming the corresponding log-polar sampled amplitude, orientation and phase maps into the Fourier domain. In contrast to [45], this paper decomposes the grayscale image into the structure and texture ﬁltered images and the resulting cue values are encoded using the log-polar transform at select locations. In addition, we sample the shape boundaries of the extracted binary shape image using the log-polar transform followed by obtaining its Fourier transform modulus. Keypoint detection: Existing works have adopted two main strategies for selecting keypoints: (1) the simple but counter-intuitive strategy of densely sampling the entire image regardless of object boundaries [44] and (2) the more principled approach of designing sophisticated scale-and-aﬃne invariant keypoint detectors [46]. Our work takes a diﬀerent approach for selecting keypoints, based on the assumption that a keypoint only needs to be visually salient with respect to its neighbors, and it need not possess invariant properties. Therefore, dealing with noise is a crucial aspect of such a strategy. In this regard, the most related work is in the image denoising literature, which has a multitude of algorithms reviewed extensively in [47]. Gaussian smoothing, anisotropic smoothing (mean curvature motion), total variation minimization, and the neighborhood ﬁlters are examples of image denoising methods. Inspired by the success of variational methods on state-of-the-art optical ﬂow benchmark datasets [48,49],

1. An object categorization framework is proposed to eﬃciently encode appearance and shape cues using the log-polar transform, with very high performance for classifying unseen objects under various viewpoints. 2. A novel keypoint detection method is proposed to select sampling locations using an image denoising method based on variational calculus. 3. An entropy-based codebook optimization scheme is proposed to choose the optimal codewords and simultaneously improve the classiﬁcation performance. The rest of the paper is organized as follows. We review the related works in Section 2. Then, the details of our proposed methods are introduced in Section 3. Next, the proposed framework is evaluated on the ETH-80 dataset and the experimental results are presented in Section 4. Finally, we conclude this paper in Section 5.

2

Neurocomputing (xxxx) xxxx–xxxx

B. Ramesh, C. Xiang

boundary points of the shape. Feature extraction involves sampling the cue images at the keypoints, using log-polar transform. The set of descriptors from each cue of the training images are collectively used to obtain a codebook. In this case, three codebooks will be generated using the training set. Then, the quantization step is the histogram representation of each cue image, using the respective codebooks generated in the previous step. Subsequently, the histograms of the training images are formed by a late fusion step, i.e., the histograms obtained for all the cues are concatenated to form the ﬁnal representation of each image. Finally, the histograms of the training images are used to train an SVM classiﬁer. During testing, the codebook construction step is bypassed, and a test image is simply represented using the codebooks and classiﬁed using SVM. The procedure to obtain the object cues from an input image is given below. Fig. 2 illustrates the details of the proposed cue-based feature extraction step. On the one hand, the input image is decomposed into the structure and texture cues using the Rudin–Osher–Fatemi method. On the other hand, saliency detection is performed on the input image to obtain a saliency map, which is further binarized using the Otsu method to obtain the shape representation. Using log-polar transform, the keypoints obtained from the structure image (also known as the denoised image) are used to locally sample the grayscale appearance cues (structure and texture) while the binary shape is sampled at its boundaries locally. Our main contributions are the object categorization framework that eﬃciently encodes appearance and shape cues using the log-polar transform (Section 3.2), a novel keypoint detection method described in Section 3.1, and the entropy-based codebook optimization scheme described in Section 3.3.

we choose the Rudin–Osher–Fatemi (ROF) model [24] to perform image denoising. In fact, the optical ﬂow literature has a diﬀerent interpretation of the ROF model, that is, the denoised image is termed as structure and the residue is treated as texture. Thus, in our work, the output of the ROF algorithm is eﬃciently used for keypoint detection, and also for obtaining grayscale structure and texture cues. Codebook optimization: Lastly, we review the related works for the problem of codebook optimization in the bag-of-words model. The dictionary used for vector quantization usually consists of several codewords that are both unnecessary and detrimental to the classiﬁcation performance. Hence, many works have aimed to optimize the visual dictionary by merging codewords [50,51] or choosing the best codebook based on global codebook measures [25,52]. Some works also consider pruning a very large codebook using criteria like likelihood ratio [53], entropy-based minimum description length [54], etc. Similar to [54], we select the codewords from a clustering evaluation perspective. Nevertheless, we consider individual entropy values of the clusters, whereas [54,25] considered the overall entropy for the training set. In other words, our primary motivation is to discard clusters with a very high entropy and retain the useful ones from a classiﬁcation point-of-view. In particular, high entropy clusters have members from almost all object categories, and therefore, they are potentially confusing when creating the histogram representation. However, moderately high entropy clusters may still be useful for classiﬁcation, in case of shared features between diﬀerent categories. To balance these two ideals, we use cross-validation to determine the usefulness of a cluster, and thus achieve reduction in codebook size and also a performance boost. In the section that follows, the details of the proposed cue-based object categorization framework are given.

3.1. Keypoint detection 3. Cue-based object categorization

We deﬁne the keypoints as locations on the image with a distinctive appearance with respect to its neighboring pixels. Therefore, to deal with noise, we ﬁrst use the Rudin–Osher–Fatemi model to obtain the denoised image, and then use the Canny edge detector to obtain the keypoints. The ROF denoising model is based on the principle that images with excessive and possibly spurious details have high total variation. In other words, the integral of the gradient modulus of the signal will be high. Accordingly, by reducing the total variation of the image, subject to being a close match to the original image, unwanted details can be removed whilst preserving important ones, such as edges. For the input grayscale image v(x): Ω ⊂ 2 →  , the denoised image u(x) is given as the solution of

We adopt the bag-of-words framework consisting of four main stages: keypoint detection, feature extraction, vector quantization, and classiﬁcation (Fig. 1). From the input image, we extract three cue images representing the structure, texture, and shape of the object. For the grayscale appearance cues (structure and texture), keypoint detection is done using the Rudin–Osher–Fatemi image denoising method [24]; for the extracted binary image, the keypoints are simply the

min u

⎧

⎫

∫Ω ⎨⎩ 21θ (u − v)2 + |∇u|⎬⎭dx,

(1)

where θ is a small constant, such that u is a close approximation of v. To solve Eq. (1), an eﬃcient iterative scheme that is globally convergent was proposed in [55]. The solution is based on gradient descent and subsequent re-projection using the dual-ROF model. Since this algorithm is a core component of our framework, we reproduce the relevant results below. Proposition 1. The solution of Eq. (1) is given by (2)

u = v + θ div p, where the dual variable p = [p1 , p2 ] is iteratively deﬁned as,

τ ∼ p n +1 = p n + (∇(v + θ div p n)) and θ p n +1 =

(3)

∼ p n +1 , max{1, |∼ p n +1|}

(4) 1

where p0 = 0 and the time step τ ≤ 4 . The denoising method described above has advantages over simple techniques such as Gaussian smoothing or median ﬁltering, due to the

Fig. 1. Block diagram of the bag-of-words classiﬁcation framework.

3

Neurocomputing (xxxx) xxxx–xxxx

B. Ramesh, C. Xiang

Fig. 2. Outline of the feature extraction steps for the proposed cue-based object representation.

fact that simple ﬁltering techniques reduce noise, but at the same time smooth away edges to a greater or less extent. In contrast, total variation denoising is remarkably eﬀective, even at low signal-to-noise ratios, at preserving edges while removing noise in ﬂat regions (see the structure image in Fig. 2). To extract the textural part, the diﬀerence between the original image and the denoised image is computed as, v(x) − αu(x), where the blending factor α is set to be 0.95 as in [49]. Similarly, the time step, τ was set to 1 . The Canny edges extracted from 4 the structural part are used as keypoints for both the structure and texture appearance cues. After obtaining the keypoints, the next step is to extract the local features at these locations for the appearance cues and at the boundaries of the object shape cue.

Subsequently, the local descriptor is obtained by converting the twodimensional Fourier transform output into a vector and performing normalization using the Euclidean norm. For sampling the appearance cues using log-polar transform (LPT), directly obtaining its Fourier transform modulus is not suitable, because noise, small appearance changes and non-uniform lighting conditions can severely aﬀect the invariant properties. An alternative is to use bandpass ﬁlters at multiple scales and extract only the high energy Fourier transform components, as done by Kokkinos and Yuille [45]. In contrast, we use the ROF denoising method and encode the resulting grayscale cue values using log-polar transform. Next, we present the log-polar transform that is used for encoding the appearance cues and the binary shape cue.

3.2. Feature extraction 3.2.1. Log-polar transform By observing the non-uniform distribution of cones in the primate fovea, as shown in Fig. 3(a), a logarithmic relationship for information

For each boundary point in the extracted binary shape, log-polar sampling is accompanied by computing its Fourier transform modulus.

4

Neurocomputing (xxxx) xxxx–xxxx

B. Ramesh, C. Xiang

Fig. 3. Biologically inspired log-polar mapping.

the log-polar parameters in [57] has been adopted here. The radii of the smallest and the largest ring are represented as rmin and rmax, respectively. The logarithmic scaling is deﬁned as ρ = log r . The samples of LPT lie at the intersection between rings and wedges, and thus the size of the log-polar image is nr byn w , where nr and nw are the number of rings and wedges, respectively. In general, the intersection happens at arbitrary locations in the image, and therefore bilinear interpolation is used to ﬁnd the image intensity at these locations. Due to this non-equidistant polar sampling, scale and rotation changes in the Cartesian image correspond to horizontal and vertical shifts in the log-polar domain, respectively. Note that the Fourier transform modulus of two images related by pure translation is the same. Consequently, two log-polar images of similar shapes, which have scale and rotation variations, are expected to have “similar” Fourier transform magnitude. This concept is illustrated in Fig. 4 for the simple case of a circle to facilitate easy visual comparison in the frequency domain. After eliminating the translation diﬀerences in the log-polar space, it is easy to see that circles of diﬀerent radii have nearly identical features in the frequency domain. Naturally, this invariant property is also applicable to the case of local information. However, for the grayscale case, noise, appearance changes and lighting conditions can have adverse eﬀects on its invariant property and thus the Fourier domain features are only used for the shape cues. In other words, the grayscale intensity values of the input image cannot be encoded directly. Therefore, the input image is decomposed into structure and texture ﬁltered images. The resulting cue values are encoded using log-polar transform for the structure and texture cues. For the appearance cues (structure and texture), we set rmin = 1 and rmax = 7, which would roughly encode a 14 by 14 patch around each

around the fovea structure can be established [56]. The Log-polar transform [23] simulates the foveal mechanism of the human vision system by considering an exponential sampling of the Cartesian image. In other words, there is dense sampling near the center of the log-polar grid and coarse sampling in the periphery (see Fig. 3(b)). Let us deﬁne the mapping from Cartesian coordinates of the image—(x,y) to LPT coordinates—(ρ , θ ) as follows:

x′ = r cosθ ,

(5)

y′ = r sinθ ,

where (r , θ ) are polar coordinates deﬁned with (xc , yc ) as the center of the transform and (x′, y′) = (x − xc , y − yc ), that is,

r=

(x′)2 + (y′)2 .

(6)

The angle θ is required to be in the range [0, 2π ), but arctan is deﬁned −π π only for ( 2 , 2 ). Therefore, the angles are computed depending on the quadrant as shown below:

⎧ ⎛ y′ ⎞ ⎪ arctan⎜ ⎟ if x′ > ⎝ x′ ⎠ ⎪ ⎪ ⎛ ⎞ ⎪ arctan⎜ y′ ⎟ + π if x′ < ⎪ ⎝ x′ ⎠ θ=⎨ π ⎪+ if y′ > ⎪ 2 ⎪ 3π if y′ < ⎪+ ⎪ 2 if y′ = ⎩ Undefined

0 0 0, x′ = 0 0, x′ = 0 0, x′ = 0

(7) −π 3π ( 2 , 2 ],

The above operation produces output in the range which can be mapped to [0, 2π ) by adding 2π to negative values. The convention of

5

Neurocomputing (xxxx) xxxx–xxxx

B. Ramesh, C. Xiang

pij = nij / ni .

(12)

Those clusters with a low entropy will have members from a few object categories, and would play a crucial role in obtaining a discriminative image representation after vector quantization. On the other hand, clusters with high entropy have members from many object categories, and this makes it a suspicious candidate for providing a useful image representation. However, some clusters, with moderately high entropy, would provide to be useful if the categories share similar features. To account for this case, we use cross-validation to select the clusters within a range of the rescaled entropy values while removing potentially disadvantageous and redundant clusters. The rescaling of entropy values is given as,

Einew = (b − a ) ×

(Ei − m ) (M − m )

(13)

where the values of a=0 and b=1 are used to rescale the entropy values between [0, 1], and m and M represent the minimum and maximum entropy values out of all the clusters in the codebook, respectively. Thus, a threshold is varied between 0 and 1 to obtain the set of clusters that give the best performance during cross-validation. For all the three cue codebooks, the same threshold is used to obtain the best crossvalidation result, as the individual codebook representations are concatenated to form the ﬁnal image representation. 3.4. Vector quantization and classiﬁcation For each object cue, a training/testing image is quantized into Ki histogram bins (i = 1, 2 , and 3 for structure, texture and shape respectively), i.e., the local features extracted from a cue image are individually matched to the nearest visual word of the respective codebook using Euclidean distance and the frequency of each word creates the Ki-dimensional histogram representation. Besides the global bag-of-words features, a 2×2 image grid is used to capture mid-level spatial information. Each of the four histograms from the 2×2 grid is normalized separately and concatenated together. In turn, the 4Ki-dimensional vector using the 2×2 grid is concatenated with the Ki-dimensional vector to form the 5Ki-dimensional histogram representation for each cue. Finally, the individual cue representations are concatenated to form the image representation of dimension (5K1 + 5K2 + 5K3). The classiﬁer used is the SVM implementation of VLFeat in their bag-of-words application [58].

Fig. 4. (a) and (b). Log-polar transform applied to the image by centering on the shape, followed by computing the Fourier transform modulus. Scale change in the Cartesian space corresponds to a horizontal shift in the log-polar space, which can be eliminated by computing the Fourier transform modulus to obtain an invariant descriptor.

keypoint. For the binary shape, we set rmin = 1 and rmax = 40 , following the recommendations in [25]. Furthermore, the number of rings (nr) is set to 7 and number of wedges (nw) is set to be 12 for both the binary shape and the appearance cues. This will result in an 84-dimensional local descriptor for both the binary shape and the appearance cue images. 3.3. Codebook optimization The descriptors obtained from the training images are collectively used to obtain a codebook for each cue, using VLFeat's [58] Approximate Nearest Neighbor (ANN) K-means algorithm for faster optimization. The codebook is simply the collection of the cluster centroids obtained using K-means. Let P = {p1 , p2 , …, pC } represent the probability distribution of the training descriptors belonging to C shape categories. Then the information conveyed by this distribution, entropy of P, is given by,

4. Experiments and discussion We tested our object classiﬁcation system on the ETH-80 dataset, which was introduced by Leibe and Schiele [6], to speciﬁcally tackle the problem of unseen object classiﬁcation. The ETH-80 image dataset consists of 80 objects categorized into 8 classes, namely apple, pear, tomato, cow, dog, horse, cup and car (see Fig. 5). Each object is captured under 41 diﬀerent viewpoints, and the testing protocol is to classify each object under all viewing angles while the rest of the objects are considered for training. Thus, classiﬁcation is done for a total of 80 times for all the images in the dataset. Our experiments were carried out on HP Xeon Two Sockets 2.66 GHz Quad-Core 64-bit Linux clusters with 72 GB memory limit.

C

Info(P) = −

∑ pj log2pj , j =1

pj = Nj / N

(9) (10)

where Nj is the number of data points belonging to class j and N = N1 + N2 + ⋯ + NC , is the total number of data points. After partitioning the data into K clusters, the entropy of each cluster Ei is given by,

4.1. Classiﬁcation results on the ETH-80 dataset We compare the performance of the proposed method to many earlier works on ETH-80 in Table 1. Our cue-based bag-of-words approach outperforms the previous state-of-the-art method [21] by a large margin. In our experiments, we did not make use of the segmentation ground truth available in the dataset, whereas [21] is a shape classiﬁcation framework that makes use of all ground truth

C

Ei = −

∑ pij log2pij , j =1

i = 1, 2, …, K (11)

where pij is the ratio of number of samples of class j in cluster i (nij) to the total number of samples in cluster i (ni),

6

Neurocomputing (xxxx) xxxx–xxxx

B. Ramesh, C. Xiang

Fig. 5. Sample images from all the 80 objects in the ETH-80 dataset. Table 1 Comparison of classification accuracy of the proposed method with previous works (%). Method

Feature

Accuracy

Color histogram [6] PCA gray [6] PCA masks [6] SC & DP [6] IDSC & DP [59] IDSC & Morphology [60] Height function [61] Robust symbolic [62] Kernel-edit [63] BCF [21] Proposed method

Color Appearance Appearance Ground truth shape Ground truth shape Ground truth shape Ground truth shape Ground truth shape Ground truth shape Ground truth shape Appearance + extracted shape

64.86 82.99 83.41 86.40 88.11 88.04 88.72 90.28 91.33 91.49 97.13

Table 3 Confusion matrix (%) for the shape cue on the ETH-80 dataset.

Apple Car Cow Cup Dog Horse Pear Tomato

Grayscale

Structure

Texture

Str-Tex

Shape

Proposed

Acc.

84.97

86.80

88.87

92.53

92.25

97.13

Car

Cow

Cup

Dog

Horse

Pear

Tomato

84.39 0 0 0.24 0 0 0 15.85

0 97.80 4.39 0 0.49 0.49 0 0

0 0.49 86.34 0 3.41 3.17 0 0

0.24 0 0.49 99.52 0 0 0 0

0 0.49 3.17 0 93.17 3.66 0 0

0 1.22 5.61 0 2.93 92.68 0 0

0 0 0 0 0 0 100 0

15.37 0 0 0.24 0 0 0 84.15

Table 4 Confusion matrix (%) for the structure-texture cue on the ETH-80 database.

Table 2 Performance of individual object cues in comparison with the proposed method (%). Alg.

Apple

Apple Car Cow Cup Dog Horse Pear Tomato

shapes to report the result. On the other hand, some earlier works like [6] only use the color or gray level information to report the results on the ETH-80 dataset. Note that irrespective of the information used by the previous methods, all of them follow the leave-one-object-out protocol, and hence, comparison of the ﬁnal accuracy is valid. Next, we show the individual performance of the structure, texture and shape cues, and also demonstrate the necessity of multiple cues for high classiﬁcation performance. Table 2 shows the performance of the individual object cues in comparison to the performance of the proposed multiple cue representation. Clearly, using the original grayscale pixel values results in inferior performance when compared to the usage of any individual object cue. Moreover, only when the structure and texture cues are combined together, the performance is as good as using the shape cue,

Apple

Car

Cow

Cup

Dog

Horse

Pear

Tomato

100 0 0 0 0 0 0.74 0.49

0 100 0.49 3.17 0.24 0 0 0

0 0 85.61 0.49 3.90 19.27 0.24 0

0 0 0 94.63 0 0 0 0

0 0 1.71 1.22 89.02 7.80 0 0

0 0 11.95 0.49 6.84 72.44 0 0

0 0 0 0 0 0.49 99.02 0

0 0 0.24 0 0 0 0 99.51

which reaﬃrms the observations in the literature regarding the importance of shape cues for object recognition. Overall, the proposed method of combining all three cues leads to a very signiﬁcant improvement in classifying unseen objects from a known category. Tables 3 and 4 show the confusion matrix of the classiﬁcation system that utilizes the shape and the structure-texture cues combined, respectively. It is clearly evident that without the grayscale appearance cues, distinguishing the round shapes of tomato and apple is diﬃcult (see the ﬁrst two shapes in the last row of Fig. 6), whereas classifying the animal shapes is diﬃcult without the shape information. Naturally, a classiﬁcation system that can leverage the beneﬁts of both appearance

7

Neurocomputing (xxxx) xxxx–xxxx

B. Ramesh, C. Xiang

Table 6 Evaluation of the quality of the extracted shapes using different thresholding methods. Thresholding algorithm

Precision

Recall

F-score

Otsu [7] Kittler and Illingworth [9] Rosenfeld and La Torre [10] Kapur, Sahoo and Wong [11] Prewitt and Mendelsohn [12] Prewitt and Mendelsohn2 [12] Glasbey [13] Doyle [14] Tsai [15]

99.67 95.40 52.78 86.36 99.63 99.61 96.16 46.43 99.33

93.22 95.39 97.68 95.23 92.88 92.77 95.18 97.99 92.77

96.34 95.83 68.51 90.58 96.14 96.07 95.67 63.92 95.15

P=

R=

|Rs ∩ Rg| |Rs| |Rs ∩ Rg| |Rg|

,

,

(14)

(15)

where Rs is the extracted shape and Rg is the ground truth shape. Subsequently, average precision and average recall are calculated by obtaining the arithmetic average of the precisions and recalls obtained for all the shapes. Table 6 lists the average precision, average recall and F-score (harmonic mean of precision and recall, given by Eq. (16)) of the shapes extracted using diﬀerent thresholding algorithms. Fig. 6. Sample shapes extracted using the salient object detection algorithm proposed in [8].

Fscore =

Apple Car Cow Cup Dog Horse Pear Tomato

Car

Cow

Cup

Dog

Horse

Pear

Tomato

100 0 0 0 0 0 0 0.49

0 100 1.46 0 0 0 0 0

0 0 94.15 0 1.46 4.88 0 0

0 0 0 97.80 0 0 0 0

0 0 1.22 0 94.63 4.14 0 0

0 0 3.17 0 3.91 90.98 0 0

0 0 0 0 0 0 100 0

0 0 0 2.20 0 0 0 99.51

(16)

Notable thresholding algorithms that have a high F-score are the entropy based method developed by Kittler and Illingworth [9], Prewitt and Mendelsohn [12]'s analysis of cell images, and the popular Tsai's moment preserving thresholding method [15]. Out of all the thresholding algorithms developed from the early 1960s to the 1990s, the popular Otsu's method performs the best in terms of F-score, which has been used in this work for obtaining the shape representation from the output of salient object detection. In the next subsection, we present the results of the codebook optimization procedure.

Table 5 Confusion matrix (%) for the best result of our system on the ETH-80 database. Apple

2PR P+R

4.2. Results of codebook optimization The algorithm described in Section 3.3 is used to optimize the codebooks by removing those clusters with a very high entropy, and at the same time, retaining clusters with a moderately high entropy that are useful for classiﬁcation. Note that clustering algorithms like Kmeans produce diﬀerent sets of clusters during each run, because of the random initialization of the cluster centers. Moreover, even for slightly diﬀerent set of data points, the clustering by K-means can produce very diﬀerent results. This phenomenon is illustrated in Fig. 7 for diﬀerent cases in the leave-one-object-out cross validation on the ETH-80 dataset. Fig. 7(a) and (b) shows instances where a threshold below 0.50 is ideal, however, Fig. 7(d) and (e) show cases where a threshold above 0.50 is optimal. Additionally, in some cases like Fig. 7(c) and (f), best classiﬁcation can be obtained for both low and high thresholds. Therefore, in order to choose a robust threshold, cross-validation should be adopted to obtain the general trend. Fig. 8 shows the cross-validation accuracy as the threshold is varied from 0 to 1. As expected, the accuracy is lowest when only the “pure” clusters (low entropy) are retained. The accuracy increases as more clusters with members from diﬀerent object categories (higher entropy) are introduced. Eventually, the accuracy saturates for threshold values close to 1, which is the instance of using all the clusters for classiﬁcation. However, when using all the clusters, the cross-validation accuracy is lower when compared to retaining clusters with 90% to 95% of the entropy energy. We note that this observation is similar to a

and shape cues will perform much better than those using them separately, as shown in Table 5. The wrong instances of classiﬁcation mainly arose from the confusion among the four-legged animals (cow, dog and horse). In the case of cow, the ‘sitting cow’ proved to be the hardest to classify (see Fig. 5: ﬁfth from the left in the cow row) since the legs were missing and all the other cows in the training set are in standing position. Similarly, much confusion between horse and dog was observed due to the structural similarities. Fig. 6 shows sample shapes extracted using the salient object detection algorithm [8]. Although the extracted shapes are imperfect, we have achieved high classiﬁcation performance since the proposed framework uses local features. The quality of the extracted shapes depends upon the saliency algorithm and the thresholding method. Since the saliency algorithm employed [8] is one of the state-of-the-art solutions (as evaluated by Cheng et al. [64]), we only focused on improving the shapes using diﬀerent thresholding methods. The quality of the extracted shape can be quantiﬁed using precision and recall. The two quantities are deﬁned as: (a) Precision: The percentage area of the extracted shape that overlaps with the ground truth. (b) Recall: The percentage area of the ground truth that is contained within the shape. Mathematically,

8

Neurocomputing (xxxx) xxxx–xxxx

B. Ramesh, C. Xiang

Fig. 7. Individual cases of codebook optimization. The title of each graph shows the object that was left out for testing.

codebooks, respectively. The original codebook size of 3000 for each cue yielded a sub-optimal accuracy of 96.2543% at a higher computational cost. After the codebook optimization procedure, a total of 1737 codewords were discarded. A similar trend was observed for other codebook sizes of 1000 (96.32% with optimization vs 96.01% no optimization) and 6000 for each cue (96.06% with optimization vs.

popular way of choosing the number of principal components for dimensionality reduction: taking the ﬁrst k eigenvectors that capture 95% of the total variance. The best accuracy reported (97.1314%) in this paper was obtained with a threshold of 0.90, which corresponds to selecting [2419, 2158, 2686] words on an average for the structure, texture and shape 9

Neurocomputing (xxxx) xxxx–xxxx

B. Ramesh, C. Xiang

codebooks obtained a 1% increase in accuracy on average over ten trials. In the next subsection, we demonstrate the eﬀectiveness of the proposed keypoint detection scheme. 4.3. Dense sampling vs. keypoint detection As far as selecting keypoints for the bag-of-words representation is concerned, the most successful method has been the dense sampling strategy. In this method, keypoints are placed all over the image without considering any explicit way to detect keypoints. It has been found that dense sampling gives equal or better classiﬁcation rates than sophisticated multi-scale interest point operators [29]. This behavior is explained by observing that the number of keypoints is the most important factor governing the performance of the bag-of-words model. It has been widely reported in the literature that more number of keypoints lead to a better performance. Naturally, keypoint detectors provide less sampling locations compared to a dense/random sampling strategy. However, there are a few scenarios in which dense sampling may not be suitable, namely when there is no background context information available (plain background), or when the object size is small in comparison to the size of the picture. In both these cases, dense sampling is better avoided. This proposition is conﬁrmed in our experiments on the ETH-80 dataset, wherein the background is uniform for all the objects and some objects like car occupy less than 20% of the total number of pixels in the image. We implement two commonly used step sizes (two and four) for the dense sampling method. The step size indicates the space between two keypoints in the x- and y-directions (Fig. 9). The ROF keypoint detection scheme proposed in this paper is replaced by the dense sampling strategy for the grayscale appearance cues while the binary shapes are still sampled at the boundaries. In comparison to the result of the proposed approach (97.13%), the classiﬁcation results of the dense sampling approach are 92.9573% and 87.4695% for a step sizes of two and four, respectively. Clearly, we can see that dense sampling produces sub-optimal results compared to a keypoint detection scheme when there is no background context information and when the object size is small. Therefore, we conclude that it is better to use keypoint detection methods for the proposed object classiﬁcation system as demonstrated above.

Fig. 8. Cross-validation accuracy for various entropy thresholds.

5. Conclusion In this paper, we proposed a cue-based object classiﬁcation framework to tackle the problem of categorizing previously unseen objects under various viewpoints. Firstly, we proposed the use of log-polar transform for extracting invariant local descriptors belonging to multiple object cues, namely structure, texture and shape. As part of this system, a novel scheme was devised to obtain these multiple object cues from the input image. Secondly, we proposed an eﬃcient keypoint detection scheme using image denoising and demonstrated that it outperforms dense grid sampling by a large margin. Additionally, we proposed a codebook optimization scheme that can improve the classiﬁcation accuracy while signiﬁcantly reducing the codebook size. We tested our system on the popular ETH-80 object dataset and demonstrated very high classiﬁcation performance compared to the state-of-the-art methods that make use of the ground truth shape information. Our future work aims to extend the proposed framework to classify color images by incorporating a separate color cue channel, which remains to be fully investigated.

Fig. 9. Two common settings for dense sampling.

95.97% no optimization). Note that for higher codebook sizes like 6000, the running times with optimization are much lower (4× faster). Therefore, there are two important advantages of optimizing the codebook. 1. Since there is a signiﬁcant reduction in the codebook size, the computational cost for vector quantization is signiﬁcantly lowered. For the best accuracy reported above, the running time using the optimized codebooks was 2.2 times less than the running time using the original codebooks. 2. A slight increase in classiﬁcation accuracy because of removing clusters with very high entropy that potentially consist of noisy local features. Compared to using the original codebooks, the optimized

Acknowledgments We would like to thank M. S. Muthukaruppan for engaging in extensive discussion and providing valuable comments. The authors 10

Neurocomputing (xxxx) xxxx–xxxx

B. Ramesh, C. Xiang

[33] B. Leibe, K. Mikolajczyk, B. Schiele, Segmentation based multi-cue integration for object detection., in: British Machine Vision Conference, Springer Berlin Heidelberg, 2006, pp. 1169–1178. [34] A. Vedaldi, V. Gulshan, M. Varma, A. Zisserman, Multiple kernels for object detection, in: IEEE International Conference on Computer Vision, IEEE Computer Society, 2009, pp. 606–613. [35] P. Gehler, S. Nowozin, On feature combination for multiclass object classiﬁcation, in: 12th International Conference on Computer Vision, IEEE, 2009, pp. 221–228. [36] K. Wang, L. Lin, J. Lu, C. Li, K. Shi, PISA: Pixelwise image saliency by aggregating complementary appearance contrast measures with edge-preserving coherence, IEEE Trans. Image Process. 24 (2015) 3019–3033. [37] F. Lu, X. Yang, W. Lin, R. Zhang, S. Yu, Image classiﬁcation with multiple feature channels, Opt. Eng. 50 (2011) 57210–57219. [38] J. Shotton, A. Blake, R. Cipolla, Eﬃciently combining contour and texture cues for object recognition., in: British Machine Vision Conference, Springer Berlin Heidelberg, 2008, pp. 1–10. [39] J. Shotton, J. Winn, C. Rother, A. Criminisi, Textonboost for image understanding: multi-class object recognition and segmentation by jointly modeling texture, layout, and context, Int. J. Comput. Vis. (IJCV) 81 (2009) 2–23. [40] J. Shotton, A. Blake, R. Cipolla, Multiscale categorical object recognition using contour fragments, IEEE Trans. Pattern Anal. Mach. Intell. 30 (2008) 1270–1281. [41] P.M. Kumar, P. Torr, A. Zisserman, Extending pictorial structures for object recognition, in: British Machine Vision Conference, Springer Berlin Heidelberg, 2004, pp. 789–798. [42] A. Bosch, A. Zisserman, X. Munoz, Representing shape with a spatial pyramid kernel, in: Proceedings of the 6th ACM International Conference on Image and Video Retrieval, ACM, 2007, pp. 401–408. [43] S. Belongie, J. Malik, J. Puzicha, Shape matching and object recognition using shape contexts, IEEE Trans. Pattern Anal. Mach. Intell. 24 (2002) 509–522. [44] K. Chatﬁeld, V. Lempitsky, A. Vedaldi, A. Zisserman, The devil is in the details: an evaluation of recent feature encoding methods, in: Proceedings of the British Machine Vision Conference, Springer Berlin Heidelberg, 2011, pp. 76.1–76.12. [45] I. Kokkinos, A. Yuille, Scale invariance without scale selection, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 2008, pp. 1 –8. [46] K. Mikolajczyk, C. Schmid, Scale & aﬃne invariant interest point detectors, Int. J. Comput. Vis. 60 (2004) 63–86. [47] A. Buades, B. Coll, J. Morel, A review of image denoising algorithms, with a new one, Multiscale Model. Simul. 4 (2005) 490–530. [48] D. Sun, S. Roth, M. Black, Secrets of optical ﬂow estimation and their principles, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, 2010, pp. 2432–2439. [49] A. Wedel, T. Pock, C. Zach, H. Bischof, D. Cremers, An improved algorithm for tv-l1 optical ﬂow, in: Statistical and Geometrical Approaches to Visual Motion Analysis, Lecture Notes in Computer Science, vol. 5604, Springer Berlin Heidelberg, 2009, pp. 23–45. [50] J. Winn, A. Criminisi, T. Minka, Object categorization by learned universal visual dictionary, in: Tenth IEEE International Conference on Computer Vision, vol. 2, IEEE, 2005, pp. 1800 –1807. [51] A. Ribes, S. Ji, A. Ramisa, R.L. de Mántaras, Self-supervised clustering for codebook construction: An application to object localization, in: Proceedings of the 14th International Conference of the Catalan Association for Artiﬁcial Intelligence, IOS Press, 2011, pp. 208–217. [52] S. Kim, I.S. Kweon, Object categorization robust to surface markings using entropyguided codebook, in: Workshop on Applications of Computer Vision, IEEE, 2007, pp. 22–22. [53] L. Wang, L. Zhou, C. Shen, A fast algorithm for creating a compact and discriminative visual codebook, in: European Conference on Computer Vision, Lecture Notes in Computer Science, vol. 5305, Springer Berlin Heidelberg, 2008, pp. 719–732. [54] S. Kim, I.S. Kweon, Simultaneous classiﬁcation and visual word selection using entropy-based minimum description length, in: International Conference on Pattern Recognition, vol. 1, IEEE, 2006, pp. 650–653. [55] A. Chambolle, Total variation minimization and a class of binary MRF models, in: Energy Minimization Methods in Computer Vision and Pattern Recognition, Lecture Notes in Computer Science, vol. 3757, Springer Berlin Heidelberg, 2005, pp. 136–152. [56] E.L. Schwartz, Spatial mapping in the primate sensory projection: analytic structure and relevance to perception, Biol. Cybern. 25 (1977) 181–194. [57] D. Young, Straight lines and circles in the log-polar image, in: British Machine Vision Conference, Springer Berlin Heidelberg, 2000, pp. 426–435. [58] A. Vedaldi, B. Fulkerson, Vlfeat: An open and portable library of computer vision algorithms, 2008. [59] H. Ling, D. Jacobs, Shape classiﬁcation using the inner-distance, IEEE Trans. Pattern Anal. Mach. Intell. 29 (2007) 286–299. [60] R.-X. Hu, W. Jia, Y. Zhao, J. Gui, Perceptually motivated morphological strategies for shape retrieval, Pattern Recognit. 45 (2012) 3222–3230. [61] J. Wang, X. Bai, X. You, W. Liu, L.J. Latecki, Shape matching and classiﬁcation using height functions, Pattern Recognit. Lett. 33 (2012) 134–143. [62] M. Daliri, V. Torre, Robust symbolic representation for shape recognition and retrieval, Pattern Recognit. 41 (2008) 1782–1798. [63] M.R. Daliri, V. Torre, Shape recognition based on kernel-edit distance, Comput. Vis. Image Understand. 114 (2010) 1097–1103. [64] M.-M. Cheng, N.J. Mitra, X. Huang, P.H.S. Torr, S.-M. Hu, Global contrast based salient region detection, IEEE Pattern Anal. Mach. Intell. 37 (2015) 569–582.

also gratefully acknowledge Dr. Cheong Loong Fah and Dr. Yan Shuicheng for their insightful comments that have led to substantial improvements of the paper. Last but not least, the authors appreciate the inputs of Dr. Lu Huchuan at various stages of this work. References [1] L. Roberts, Pattern recognition with an adaptive network, in: Proceedings of the IRE International Convention Record, IEEE, 1960, pp. 66–70. [2] E.N. Malamas, E.G. Petrakis, M. Zervakis, L. Petit, J.-D. Legat, A survey on industrial vision systems, applications and tools, Image Vis. Comput. 21 (2003) 171–188. [3] S. Mori, H. Nishida, H. Yamada, Optical Character Recognition, John Wiley & Sons, Inc., 1999. [4] M. Ejiri, Machine vision in early days: Japan's pioneering contributions, in: Asian Conference on Computer Vision, Lecture Notes in Computer Science, vol. 4843, Springer Berlin Heidelberg, 2007, pp. 35–53. [5] N.K. Logothetis, D.L. Sheinberg, Visual object recognition, Annu. Rev. Neurosci. 19 (1996) 577–621. [6] B. Leibe, B. Schiele, Analyzing appearance and contour based methods for object categorization, in: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, IEEE Computer Society, 2003, pp. 409–415. [7] N. Otsu, A threshold selection method from gray-level histograms, Automatica 11 (1975) 23–27. [8] C. Yang, L. Zhang, H. Lu, Graph-regularized saliency detection with convex-hullbased center prior, IEEE Signal Process. Lett. 20 (2013) 637–640. [9] J. Kittler, J. Illingworth, Minimum error thresholding, Pattern Recognit. 19 (1986) 41–47. [10] A. Rosenfeld, P. de la Torre, Histogram concavity analysis as an aid in threshold selection, IEEE Trans. Syst. Man Cybern. 13 (1983) 231–235. [11] J. Kapur, P. Sahoo, A. Wong, A new method for gray-level picture thresholding using the entropy of the histogram, Comput. Vis. Graph. Image Process. 29 (1985) 273–285. [12] J.M.S. Prewitt, M.L. Mendelsohn, The analysis of cell images, Ann. NY Acad. Sci. 128 (1966) 1035–1053. [13] C. Glasbey, An analysis of histogram-based thresholding algorithms, CVGIP: Graphical Models and Image Processing 55 (1993) 532–537. [14] W. Doyle, Operations useful for similarity-invariant pattern recognition, J. Assoc. Comput. Mach. 9 (1962) 259–267. [15] W.-H. Tsai, Moment-preserving thresolding: a new approach, Comput. Vis. Graph. Image Process. 29 (1985) 377–393. [16] J. Crowley, A.C. Parker, A representation for shape based on peaks and ridges in the diﬀerence of low-pass transform, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 156–170. [17] J. Crowley, A.C. Sanderson, Multiple resolution representation and probabilistic matching of 2-D gray-scale shape, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1987) 113–121. [18] I. Biederman, Recognition-by-components: a theory of human image understanding, Psychol. Rev. 94 (1987) 115. [19] T. Tuytelaars, K. Mikolajczyk, Local invariant feature detectors: a survey, Found. Trends Comput. Graph. Vis. 3 (2008) 177–280. [20] A. Andreopoulos, J.K. Tsotsos, 50 years of object recognition: Directions forward, Comput. Vis. Image Understand. 117 (2013) 827–891. [21] X. Wang, B. Feng, X. Bai, W. Liu, L.J. Latecki, Bag of contour fragments for robust shape classiﬁcation, Pattern Recognit. 47 (2014) 2116–2125. [22] P. Arbelaez, M. Maire, C. Fowlkes, J. Malik, Contour detection and hierarchical image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2011) 898–916. [23] R.A. Messner, H.H. Szu, An image processing architecture for real time generation of scale and rotation invariant patterns, Comput. Vis. Graph. Image Process. 31 (1985) 50–66. [24] L.I. Rudin, S. Osher, E. Fatemi, Nonlinear total variation based noise removal algorithms, Physica D: Nonlinear Phenomena 60 (1992) 259–268. [25] B. Ramesh, C. Xiang, T.H. Lee, Shape classiﬁcation using invariant features and contextual information in the bag-of-words model, Pattern Recognit. 48 (2015) 894–906. [26] G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray, Visual categorization with bags of keypoints, in: In Workshop on Statistical Learning in Computer Vision, ECCV, Springer Berlin Heidelberg, 2004, pp. 1–22. [27] S. Agarwal, A. Awan, D. Roth, Learning to detect objects in images via a sparse, part-based representation, IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 1475–1490. [28] T. Leung, J. Malik, Representing and recognizing the visual appearance of materials using three-dimensional textons, Int. J. Comput. Vis. 43 (2001) 29–44. [29] E. Nowak, F. Jurie, B. Triggs, Sampling strategies for bag-of-features image classiﬁcation, in: European Conference on Computer Vision, Lecture Notes in Computer Science, vol. 3954, Springer Berlin Heidelberg, 2006, pp. 490–503. [30] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision and Pattern Recognition, IEEE Computer Society, 2005, pp. 886– 893. [31] D. Lowe, Object recognition from local scale-invariant features, in: International Conference on Computer Vision, vol. 2, IEEE, 1999, pp. 1150–1157. [32] F.S. Khan, J. Weijer, A.D. Bagdanov, M. Vanrell, Portmanteau vocabularies for multi-cue image representation, in: Advances in Neural Information Processing Systems, Curran Associates Inc., 2011, pp. 1323–1331.

11

Neurocomputing (xxxx) xxxx–xxxx

B. Ramesh, C. Xiang Bharath Ramesh received the B.E. degree in electrical and electronics engineering from Anna University of India in 2009; M.Sc. and Ph.D. degrees in electrical engineering from National University of Singapore in 2011 and 2015 respectively, working at the Control and Simulation Laboratory on Image Classiﬁcation using Invariant Features. Bharath's main research interests include pattern recognition and computer vision. At present, his research is centred on event-based cameras for autonomous robot navigation.

Cheng Xiang received the B.S. degree in mechanical engineering from Fudan University, China in 1991; M.S. degree in mechanical engineering from the Institute of Mechanics, Chinese Academy of Sciences in 1994; and M.S. and Ph.D. degrees in electrical engineering from Yale University in 1995 and 2000, respectively. He is an Associate Professor in the Department of Electrical and Computer Engineering at the National University of Singapore. His research interests include computational intelligence, adaptive systems and pattern recognition.

12

Unseen object categorization using multiple visual cues

Unseen object categorization using multiple visual cues

Recommend Documents