An improved distance-based relevance feedback strategy for image retrieval

An improved distance-based relevance feedback strategy for image retrieval

Image and Vision Computing 31 (2013) 704–713 Contents lists available at ScienceDirect Image and Vision Computing journal homepage: www.elsevier.com...

1018KB Sizes 1 Downloads 110 Views

Image and Vision Computing 31 (2013) 704–713

Contents lists available at ScienceDirect

Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis

An improved distance-based relevance feedback strategy for image retrieval☆ Miguel Arevalillo-Herráez ⁎, Francesc J. Ferri Department of Computer Science, University of Valencia, Avda. de la Universidad s/n, 46100 Burjasot, Spain

a r t i c l e

i n f o

Article history: Received 18 December 2012 Received in revised form 13 May 2013 Accepted 18 July 2013 Keywords: CBIR Image retrieval Relevance feedback Nearest neighbor

a b s t r a c t Most CBIR (content based image retrieval) systems use relevance feedback as a mechanism to improve retrieval results. NN (nearest neighbor) approaches provide an efficient method to compute relevance scores, by using estimated densities of relevant and non-relevant samples in a particular feature space. In this paper, particularities of the CBIR problem are exploited to propose an improved relevance feedback algorithm based on the NN approach. The resulting method has been tested in a number of different situations and compared to the standard NN approach and other existing relevance feedback mechanisms. Experimental results evidence significant improvements in most cases. © 2013 Elsevier B.V. All rights reserved.

1. Introduction Content based image retrieval (CBIR) refers to the application of techniques to retrieve digital images from large databases, by analyzing the actual content of the image rather than the metadata associated with it. In general, a CBIR system represents each image in the repository as a set of features (usually related to color, texture and shape), and uses a set of distance functions defined over this feature space to estimate similarity between pictures. A query can be understood as the intention of a user to retrieve a certain kind of images, and it is usually materialized as one or more sample pictures. The goal of a CBIR system is to retrieve a set of images that is best suited to the user's intention. Obviously, the potential results of such a system will strongly depend not only on the particular features of the representation space but also on the implicit or explicit distance functions used to measure similarity between pictures [1–3]. This way of assessing similarity comes along with the implicit assumption that image resemblance is related to a distance defined over a particular feature space. This leads to the so-called semantic gap, between the semantics induced from the low level features and the real high level meaningful user interpretation of the image. To reduce this gap, relevance feedback has been adopted by most recent CBIR systems [4]. When relevance feedback is used, the search is considered an iterative process in which the original query is refined ☆ This paper has been recommended for acceptance by Nicu Sebe. ⁎ Corresponding author. Tel.: +34 96 3543962. E-mail addresses: [email protected] (M. Arevalillo-Herráez), [email protected] (F.J. Ferri). 0262-8856/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.imavis.2013.07.004

interactively, to progressively obtain a more accurate result. At each iteration, the system retrieves a series of images according to a predefined similarity measure, and requires user interaction to mark the relevant and non-relevant retrievals. This data is used to modify some system parameters and produce a new set of results, repeating the process until a satisfying enough result is obtained. In this context, the relationship between any image in the database and the user's desire is usually expressed in terms of a relevance value. This value is aimed at directly reflecting the interest that the user may have in the image and is to be refined at each iteration. Most relevance feedback algorithms use the user's selection to search for global properties which are shared by the relevant samples available at each iteration [4]. From a Pattern Recognition viewpoint, this can be seen as obtaining an appropriate estimate of the probability of (subjective) relevance. Many different approaches exist to model and progressively refine these estimates. But relevance feedback faces a small sample problem whose models cannot be reliably established because of the semantic gap. In this context, nonparametric distancebased methods using neighbors are particularly appealing [5–8]. The aim of these methods is to assess relevance of a given image by using distances to relevant and non-relevant neighbors. In particular, an image is considered as much relevant as its distance from the nearest relevant image is small compared to the distance of its nearest nonrelevant image. In the present work, all these considerations about distance-based CBIR approaches are taken into account to derive a novel way of reliably estimating relevance from distances. The algorithm is then evaluated exhaustively in three databases and in a variety of contexts, including both query by example and refinement of a textual search. The paper is organized as follows. The next section explains the model used; outlines the assumptions made; presents the name

M. Arevalillo-Herráez, F.J. Ferri / Image and Vision Computing 31 (2013) 704–713

705

conventions used throughout the article; and describes the plain nearest neighbor approach. Section 6 exposes several key facts about the way relevance is estimated and introduces a novel alternative. In the experimental section the proposed algorithm is compared against the original one [5,6], some extensions [7,8], and other representative relevance feedback methods. Finally, the main conclusions of the proposed approach are outlined along with some work in progress.

framework to determine relevance by using other distances learned from the user feedback. These methods have previously been applied in the CBIR field [5–8], showing a good comparative performance superior to other Bayesian frameworks [5] and classical SVM techniques [7]. In this paper, we build on some of these previous works, by proposing series of strategies to face some fundamental shortcomings of NNbased approaches.

2. Related work

3. Interactive content-based image retrieval

Relevance feedback in CBIR has been an active topic of research for the last two decades. In general, the performance of CBIR algorithms depends critically on the (dis)similarity measure used to rank the images in the repository. This measure is commonly built/adapted at each iteration, by using the information made available by the user. In this section, we summarize some previous work in relevance feedback, with the intention of contextualizing the method presented in the paper. For a more comprehensive survey, the reader is referred to recent reviews on the topic [1,9]. First approaches were aimed at progressively adapting the similarity measure and/or move the query point so that it becomes closer to positive samples and farther from negative ones. Query point movement and axis re-weighting methods fall within this category of techniques [10–12]. In general, these approaches model the query as a point in a (possibly deformed) feature space, and retrieve results according to their distance to the query. A major advantage of these techniques is that they are relatively fast, and scale reasonably well with the size of the repository. On the negative side, they usually ignore dependencies between features [13], treat features globally [14] and are only effective if the query concept consists of a convex region in the feature space. A different way to approach the definition of an adequate similarity function is from a pattern recognition perspective. Relevance feedback can be considered as a classical machine learning problem, in which the user feedback is used as an input to a learning algorithm to address the classification of images as relevant images [9]. This opens the scope for the application of a large diversity of popular algorithms in this context. For example, labeled samples can be used to build a projection into a subspace of a lower dimensionality where relevant samples appear close to each other [15,16]; or to learn a Mahalanobis metric based on pairwise (dis)similarity constraints [17] or small subsets of points that are known to belong to the same class [18]. Closely related, support vector machines (SVM) methods attempt to find the hyperplane which achieves a maximum separation between two classes [19–22]. One class, two classes and other extensions have been adapted to overcome some of the inherent limitations of standard SVMs, e.g., imbalanced training set, high computational burden [9]. A major problem with some of these methods refers to the high sensibility of the parameters required for fine tuning the algorithms [7]. Other related approaches include the use of neural networks, e.g., radial basic functions (RBF) networks [23], self-organizing maps (SOM) [14], fuzzy sets [24] or regression methods [13]. Despite the many efforts performed in this direction, many of the algorithms suffer from the small sample size problem, caused by the relatively scarce information provided at each relevance feedback iteration. Other strategies also include Bayesian approaches. In this case, posterior probability distributions are estimated according to the data gathered through the relevance feedback process. In particular, the probability densities of the relevant and non-relevant classes are usually approximated by using different types of estimators [25–29]. Then, the probability of being relevant is used as a similarity measure, as in the case of using a soft classifier. In this way, relevance values are implicitly modeled as a probability distribution, rather than as a single point in the feature space. Nearest neighbor methods can be classified in this category, and used in this context to estimate the posterior probabilities of the relevant and non-relevant classes. In addition, they are compatible with other distance metric learning approaches and can also be used as a

3.1. Query representation Assume that there is a repository of images X ¼ fx1 ; ⋯; xm g conveniently represented in a metric feature space, F , whose associate distance measure is d : F ñF →R ≥ 0 . This particular representation space is assumed to be the D-dimensional space RD in this work as in much other closely related works. In some cases and specially in the image retrieval context, the representation space may embrace multiple low level descriptors (e.g., color, texture or shape) and the distance d is constructed by a combination from simpler distance measures over each descriptor [1]. When a particular user is interested in retrieving images from X, his/ her intention can be thought of as a semantic concept which can be more or less objectively specified (as e.g., images of bicycles, domestic animals, etc.). Regardless of the scope and specificity of this semantic concept, it can be modeled in the feature space as a probability function over the corresponding repository, P(relevant|x), which can be extended to the whole representation space. This probability of relevance over the space can be conveniently simplified and tackled. For example, single point query approaches assume that this probability function can be appropriately represented by a single (ideal) point c∈F possibly along with a convenient axis or feature re-weighting [30]. This can be seen as equivalent to considering uncorrelated Gaussian distribution functions. This approach can be extended to use a set of representative points or mixtures of Gaussians instead [31]. Single point methods use a distance measure to rank images. Multiple point methods usually combine distances (or rankings) to each representative in a set of (ideal) points C. In any case, all methods end up using a particular ranking as a final tool to retrieve images regardless of the way they model the query in the feature space. Fig. 1 graphically illustrates this situation. 3.2. Relevance feedback In this paper we assume the most usual case in relevance feedback in which the user marks or labels some images as relevant or non-relevant. In this setup, the available information or user feedback is given by the

− Other Concepts Sought Concept + Semantic space





++ −+ − − + − − − − − ++ − + −

Image space

Fig. 1. An illustrative example of query representations. User intentions (semantic space) get translated into particular regions in the representation space where relevant images (marked as +) are more likely.

706

M. Arevalillo-Herráez, F.J. Ferri / Image and Vision Computing 31 (2013) 704–713

set of images from Q⊂X already seen by the user and marked either as relevant (positive), Qþ or as non-relevant (negative), Q− . Both disjoint subsets, Qþ and f Q− , can be seen as samples corresponding to the distribution functions that determine P(relevant|x) as in a typical two-class classification setting. This information is then used to produce a particular ranking over the database. If the set of images shown to the user for labeling was large and representative enough and the user was infallible, we would be facing a standard learning problem that could be tackled with many different methodologies. But the sets involved are very small and poorly representative. Moreover, the user labeling in practice may contain several types of errors. Hence, the learning problem involved is very hard to solve effectively, not only because of small sample size effects and unrepresentativeness, but also because of the strong dependences introduced by the way in which new evidence is progressively taken into account. 4. Relevance estimation through nearest neighbors Nearest neighbor (NN) methods have been extensively used in the context of learning, vision and pattern recognition due to their wellknown, convenient and well studied practical and asymptotic behavior [32,33]. In general, and regardless of whether they are used for classification or estimation, they show a very consistent behavior ranging from the large scale to the small sample size case. NN approaches base probability estimation on the ratio of neighbors of a certain kind to the volume of the hypersphere that contains them. This is known to be a reasonable estimate for the corresponding probability distributions [34], and has previously been used in the context of image retrieval as a way to compute scores by using the relative distance of an image to its nearest relevant and non-relevant neighbors [9]. In [7], a single nearest neighbor approximation was proposed to estimate the density of the relevant and non-relevant classes as inversely proportional to the volume of the corresponding 1Neighborhoods, VR(dR(x)) and VN(dN(x)), where the subscripts refer to the nearest relevant (R) and non-relevant (N) neighbors, respectively; and dR and dN are the corresponding distances to each neighbor from x. From the separate estimates and obviating some constant terms and the exponent in the volume formulae, the following expression can be arrived at [7]: P ðrelevantjxÞ ¼

dN ðxÞ : dR ðxÞ þ dN ðxÞ

ð1Þ

For convenience, the ratio of distances in the above formula can be substituted by other equivalent ones from the point of view of ranking relevant images. The most used one is dd ððxxÞÞ . In a recent work, this ratio N R

has been smoothed by introducing a moderating term that decreases with the distance to the closest relevant image. The modified ratio, d ðxÞ , has been shown to improve the basic one in some cases [8]. d ðxÞ In all NN methods, and regardless of the particular expression used to compute the scores (or corresponding probability estimate), this ratio is calculated for each image in the database. Then, results are used to rank the images and the top w (window size) is presented to the user for evaluation. When multiple descriptors are available (e.g., color, texture or shape), both distance and score combination approaches are compatible with the NN methods. When distance combination methods are used, the distance function d is constructed by a combination from simple distance measures over each descriptor space [1], and scores are computed according to the corresponding ratio expression (e.g., dd ððxxÞÞ). Score combination methods would use this ratio expression in each representation space and then combine the results to produce a new score which gathers the contribution of the multiple representations. A common approach for both cases is the use of a linear combination of normalized distances/scores, in which weights reflect the relevance of the each feature representation to the user's need [7]. The following expression is used in [5] to normalize the scores obtained for each descriptor: N

R

2

N R

relevanceNN ðxÞ ¼ 1−e

−ddN ððxxÞÞ R

ð2Þ

where dR and dN represent the distances to the nearest relevant and non-relevant sample respectively. 5. Particularities in nearest neighbor estimates in image retrieval NN-based relevance feedback gives surprisingly good results in practice, in comparison to other state-of-the-art techniques [7]. Nevertheless and apart from other improvements related to more meaningful or robust representations (g using dissimilarity spaces) or using hybridization techniques (e.g., with Bayesian relevance feedback), there is still room for improvement in the NN approach related to relevance feedback itself. In this section, several issues on the application of the non parametric NN density estimation to the CBIR problem are carefully analyzed. 5.1. Unbalanced number of samples A first problem, initially reported in [7], is caused by the relative sizes of the Qþ and Q− sets. In general, the number of relevant items is by far smaller than the number of non-relevant ones, even in the surroundings of the elements in C. This causes typically the number of elements in Qþ to be also lower than in Q−. When a relevant selection is surrounded by non-relevant ones, the above rankings produce high values in a very

Fig. 2. Probability of relevance as in Eq. (2) (left) and the same using a moderating term [8] (right). There is one relevant image (+) surrounded by three non-relevant ones (x). Using the basic ratio, images far away in the top left direction get ranked too high.

M. Arevalillo-Herráez, F.J. Ferri / Image and Vision Computing 31 (2013) 704–713

small closed region around it. But from the images outside this region, the top ranked ones are those which are far from both relevant and non-relevant samples. This undesirable effect is illustrated in Fig. 2 (left). The chances of this type of situation increase with the relevance feedback iterations, as areas around positive selections tend to be explored more in depth. This problem has been treated in [7,8]. In [7], the score of the conventional NN technique is combined with another obtained by using a Bayesian query shifting (BQS) approach [27]. A linear combination is used, and weights are computed dynamically at each iteration by considering the number of relevant and non-relevant samples. This leads to a stabilized score that combines an exploitation (NN) with an exploration term (BQS). In [8], a conveniently smoothed NN estimate (SNN) was defined by proposing an alternative formulation of the approach that increased the importance of Qþ. In particular, Eq. (2) was modified by introducing a moderating term that rewards pictures which are close to positive selections and penalizes others which are far away. To this end, the following expression was used to compute the relevance scores: dN ðxÞ dR ðxÞ2



relevanceSNN ðxÞ ¼ 1−e

:

ð3Þ

The effect of introducing this modification in the plain NN approach is illustrated in Fig. 2 (right). 5.2. Variable density Differently populated regions in the feature space cause relevance estimates to work with significantly different sample densities [8]. As the ratio of distances is defined in a global way, densely populated regions with high relevance values will tend to dominate the ranking. This problem is caused both by the possibly uneven distribution of images in the repository, X , and also because of the complex relationship between perceptual similarity and the distance used to find neighbors, which may in turn be different in different regions of the feature space. That is, the probability of relevance may scale differently with distance in different regions. A way to overcome this problem consists of building independent rankings for each relevant selection and combining them to produce a more consistent final ordering [8]. As a result, a set of r local searches þ (one per relevant selection q+ i in the set Q ) is carried out. Each of these searches is performed by using Eq. (3), but considering the picture q+ i as the only relevant sample. This results in a set of r independent rankings R = {R1 ⋯ Rr}, one for each local search. Finally, each picture is assigned a score which is inversely proportional to its best ranking

707

position in the set of rankings R. This technique makes the approach robust against different density areas.

5.3. Small sample size Another problem that was identified in [7] is related to the reliability of the nearest neighbor density estimation under a small sample size. This problem is implicitly associated with the problem at hand, and it is shared with most methods based on machine learning. The use of the k-th distance instead of the first distance used in Eq. (2) could potentially contribute to alleviate this problem [6]. However, the number of relevant samples in the usual case is not sufficient to allow for a reliable probability estimate by using a k-nearest neighbor approach.

5.4. Locality of labeled samples Another major limitation is imposed by the fact that the NN density estimation, as most other non-parametric density estimation methods, relies on a labeled set of random samples. This principle is violated in the CBIR problem, in which user judgments (the labeled samples) concentrate on regions of a (probably deformed) feature space. This bias is introduced by the relevance feedback mechanism itself, which does not sample the feature space randomly but rather according to the probability of finding relevant pictures. This yields considerably different sampling densities across the feature space. Hence, the reliability of the scores obtained for each sample also varies across the feature space. On the one hand, for images which lay far apart from populated regions, the samples provide little information about the density function. On the other hand, density estimations for images lying within populated regions are more accurate. Despite the importance and implications of this problem, to the best of our knowledge, it has not even been mentioned in the CBIR literature. In the next section we extend the standard NN method to compensate for this effect.

6. Improving reliability in nearest neighbor estimates on unevenly populated spaces Common relevance feedback mechanisms are based on processing user feedback on a set of images Q⊂X . In most CBIR systems, the elements in Q are the first w elements in the ranking produced at the previous iteration, and these were computed as those with the highest probability P(relevant|x) to be relevant to the user query. This process leads to a severe biased estimate, and the reliability of the scores obtained varies dramatically across the feature space.

Fig. 3. Estimates of reliability (left) and relevance (right) for the same illustrative example as in Fig. 2.

708

M. Arevalillo-Herráez, F.J. Ferri / Image and Vision Computing 31 (2013) 704–713

Let us assume that there is a new random variable reliable, which can take a value in the set {true, false}, depends on x and represents whether the posterior probability of relevance, P(relevant|x), is actually certain. The corresponding estimate of relevance at a given point x will be trusted if reliable is true. But if reliable is false, then the only information about relevance is given by the prior P(relevant) regardless of x. If information about reliability is available and assuming independence, it is possible to obtain a corrected probability of relevance at x as ′

P ðrelevantjxÞ ¼ P ðreliablejxÞ  P ðrelevantjxÞþ ð1−P ðreliablejxÞÞ  P ðrelevant Þ:

ð4Þ

The use of P(reliable|x) allows facing both the small sample size and the locality of labeled sample problems described above. The definition of such a probability function is domain dependent, but it is certainly related to the density of samples around the estimation point x. One way to approximate it in a CBIR context is to define it as a function f : R ≥ 0 → ½0; 1 that maps the distance to the nearest sample from x to an estimate of the probability P(reliable|x). 6.1. Implementing reliable estimates using distances The application of Eq. (4) assumes that P(relevant|x), P(reliable|x) and the prior, P(relevant) are known. The first of these terms was already given in Eq. (1). Although other estimates of the probability P(reliable|x) are possible, one simple estimate when distances are already normalized in the range [0,1] is given by P ðreliablejxÞ≈f ð minðdR ðxÞ; dN ðxÞÞÞ ¼ 1− minðdR ðxÞ; dN ðxÞÞ:

ð5Þ

This distance-based approximation is of a similar nature as that of Eq. (1). The first elements in the ranking determined by Eq. (4) are the ones in which P(reliable|x) and P(relevant|x) are simultaneously close to one. In addition, the second term at the right hand side in Eq. (4) tends to zero as P(reliable|x) tends to one. This makes it possible to neglect this term without significantly affecting the first elements in the final ranking produced. By doing this and considering Eqs. (1) and (5), Eq. (4) can be written as: ′

P ðrelevantjxÞ≈ð1− minðdR ðxÞ; dN ðxÞÞÞ 

dN ðxÞ : dR ðxÞ þ dN ðxÞ

ð6Þ

The effect of the proposed estimate is graphically illustrated in Fig. 3. By comparing this with the estimates in Fig. 2 it is possible to see that the new proposal leads to a smoothing effect similar to the one proposed in [8]. Nevertheless, the fundamental difference lies in the fact that the amount of smoothness varies across the representation space by the use of the reliability estimate. 7. Empirical evaluation A number of comparative experiments using all nearest neighborbased approaches considered in this work have been carried out. The original NN approach [5] without any other independent extensions has been taken as a starting point. This has allowed us to appropriately evaluate the impact of the improvements and extensions proposed in this work on several performance measures. In particular, the results obtained with the method proposed and with the following NN-based methods are reported: a) the basic nearest neighbor approach (NN) as in [5]; b) the composite NN approach using Bayesian query shifting (NN + BQS) [27]; and c) the NN approach enhanced with a smoothed estimator as in [8] (SSN). In addition, other relevance feedback algorithms representative of the approaches described in Section 2 have been considered, namely: a) a simple query point movement (QPM) implementation to be used as a baseline; b) the SVM method in [19]; c) the probabilistic framework presented in [28]; d) the self-organizing map (SOM) method introduced in [14]; and

Table 1 Details of the three databases used in the experiments. Name

Size

Dimension

Categories

Small Art Corel

1508 5476 30,000

50 104 89

29 63 71

e) an approach based on the fuzzy sets [24]. An exhaustive experimentation has been carried out using three different databases that have been previously used in similar studies. A summary of the details of each database is given in Table 2. Small: is a small repository which was intentionally assembled for testing, using some images obtained from the Web and others taken by the authors.1 The 1508 pictures it contains are classified as belonging to 29 different themes such as flowers, horses, paintings, skies, textures, ceramic tiles, buildings, clouds, trees, etc. In this case, the 50 dimensional feature vectors include a 10 × 3 HS color histogram and texture information in the form of two granulometric cumulative distribution functions [35]. Art: has been manually compiled from the commercial collection “Art Explosion”, distributed by the company Nova Development. This collection is composed of 5476 images organized in 63 categories [1]. The 10 × 3 HS color histogram and six texture features have been computed for each picture in this database, namely Gabor convolution energies [36], Gray level cooccurrence matrix, [37], Gaussian random Markov fields, [38], the coefficients of fitting the granulometry distribution with a B-spline basis, [39] and two versions of the spatial size distribution, [40], one using a horizontal segment and another with a vertical segment [35]. The total number of features used is 104. Corel: is a subset of the Corel database used in [5]. This is composed of 30 000 images which were manually classified into 71 categories (Table 1). The 89 dimensional representation uses the descriptors in the KDD-UCI2 repository, namely: A nine component vector with the mean, standard deviation and skewness for each hue, saturation and value in the HSV color space; a 16 component vector with the co-occurrence in horizontal, vertical and the two diagonal directions; a 32 component vector with the 4 × 2 color HS histograms for each of the resulting sub-images after one horizontal and one vertical split; and a 32 component vector with the HS histogram for the entire image. All algorithms have been implemented according to the principles described in the original publications. In the case of the SVM approach, Gaussian kernels have been used and parameters for the soft margin strength and class imbalance have been appropriately tuned. The SOM-based approach uses SOM sizes of 16 × 16, 32 × 32 and 64 × 64 for the Small, Art and Corel repositories, respectively. In all nearest neighbor approaches, distances between features have been estimated using the histogram intersection [41] on the color histograms and the Euclidean distance for the other descriptors, and they have been combined as suggested in the original publication [5]. In particular, the distances for different descriptors have been normalized in the range [0,1] and then summed up prior to applying the formulas in Section 5. Other options as using weighted linear combination of the scores or using Gaussian normalization [42] have also been tried but the results obtained were similar and, more importantly, they did not affect the relative merits of the proposals in this work.

1 2

Available in http://www.uv.es/arevalil/dbImages/. Available in http://kdd.ics.uci.edu/databases/CorelFeatures.

M. Arevalillo-Herráez, F.J. Ferri / Image and Vision Computing 31 (2013) 704–713

and 5, also at random). Accordingly, other w − k non-relevant images are selected in the same way. Both kinds of queries use the general relevance feedback algorithm to proceed from their initial configuration. In the first case, the algorithm has a very local positive information about the concept sought. In the second, the information about relevant and non-relevant images is potentially spread throughout the whole representation space. The following measures have been used to compare the performance of the algorithms considered: 1. Precision. The proportion of relevant pictures amongst the ones retrieved. 2. Recall. The proportion of relevant pictures retrieved with regard to the ones in the whole database. 3. Average precision. If precision is plotted as a function of recall, it measures the average value of precision for each different recall value.

0.9

0.9

0.8

0.8

Precision

Precision

In the experiments carried out and reported, a similar setup to the one in [5,7] has been employed. The available categories have been used as concepts, and user judgments about similarity have been simulated considering that only pictures under the same category represent the same concept. In order to systematically characterize the pros and cons of each method, two different kinds of queries have been simulated. On one hand, a query by example has been considered by randomly choosing a picture from a random category. Having this image as the unique positive example, that is |Q+| = 1 and Q− = ∅, a given number of pictures are retrieved from the database according to the plain distance to the only picture in Q+. On the other hand, semantic queries aiming at simulating content-based searches initiated by a textual query have also been considered. These have been implemented by randomly choosing several non-clustered positive and negative examples in the following way. First, a category is initially chosen at random. Then a small number of relevant images, k, are randomly taken from the database (between 3

0.7 0.6

0.7

0.5

0.4

0.4 2

3

4

5

6

7

Proposed SNN NN+BQS Probabilistic SVM SOM QPM

0.6

0.5

1

8

1

2

3

0.6

0.6

0.5

0.5

0.4 0.3

0.1

0.1 4

5

6

7

8

1

2

3

7

8

4

5

6

7

8

6

7

8

Iteration

0.8

0.8

0.7

0.7

Avg. Precision

Avg. Precision

Iteration

0.6 0.5 0.4 0.3 0.2

6

0.3 0.2

3

5

0.4

0.2

2

4

Iteration

Recall

Recall

Iteration

1

709

0.6 0.5 0.4 0.3

1

2

3

4

5

Iteration

6

7

8

0.2

1

2

3

4

5

Iteration

Fig. 4. Retrieval precision (up), recall (middle) and averaged precision (down) measured on the first 20 retrieved images for 500 random queries by example (left) and semantic queries (right) on the small database.

710

M. Arevalillo-Herráez, F.J. Ferri / Image and Vision Computing 31 (2013) 704–713

been considered: query by example (left) and semantic queries (right). In this figure and the following ones, the NN method is not shown for clarity reasons and also because it was always equal or worse than all of its variants. The fuzzy approach has been also removed from the plots as it was always equal or worse than the SOM approach. As it can be seen in Fig. 4, the proposed method is among the best behaved ones according to all performance measures. Nevertheless, the composite method NN + BQS is significantly the best one for this database when queries start from the very local information. On the contrary, when queries start with more spread information, this method gives significantly worse results in the first iterations (specially when measuring precision). Also remarkable is the poor performance of the SVM method in the first iterations which can be attributed at the difficulty of training an appropriate relevance model when data is scarce. Accordingly, the SVM method is the one whose relative improvement as iterations increase is the best, specially when measuring precision and average precision on semantic queries.

0.9

0.9

0.8

0.8

Precision

Precision

The first two measures have been used to evaluate the performance of the techniques on the top positions of the ranking, as it is generally the interest in CBIR. Both have been measured on a typical window size w = 20 (the number of images provided for judgment at each iteration). To measure recall, relevant images ranked in between the top w at each iteration are marked as already recovered and not considered as candidate pictures for the next iterations. For precision, images recovered at one iteration, are still considered as candidate images for the next. Average precision is used to evaluate the entire ranking produced by each method. Although this may be of less interest from a CBIR perspective, it is an adequate measure to explore the potential of using the proposed approach in other different contexts. To obtain more reliable data, each technique has been evaluated with 500 random searches on each repository, using the same categories and initial picture order for all algorithms. The presented results are averaged values in all cases. The performance results using the smallest of the databases considered are shown in Fig. 4 for the two kinds of queries that have

0.7 0.6

0.7

0.5

0.5

0.4

0.4 1

2

3

4

5

6

7

Proposed SNN NN+BQS Probabilistic SVM SOM QPM

0.6

8

1

2

3

0.4

0.4

0.3

0.3

0.2

0 1

2

3

4

5

6

7

8

7

8

1

2

3

4

5

6

7

8

6

7

8

Iteration

0.6

Avg. Precision

0.6

Avg. Precision

6

0.2

Iteration

0.5 0.4 0.3

0.5 0.4 0.3 0.2

0.2 0.1

5

0.1

0.1

0

4

Iteration

Recall

Recall

Iteration

0.1 1

2

3

4

5

Iteration

6

7

8

1

2

3

4

5

Iteration

Fig. 5. Retrieval precision (up), recall (middle) and averaged precision (down) measured on the first 20 retrieved images for 500 random queries by example (left) and semantic queries (right) on the Art database.

M. Arevalillo-Herráez, F.J. Ferri / Image and Vision Computing 31 (2013) 704–713

and average precision results in the case of queries by example on this database. The results on the largest of the databases considered in this work are shown in Fig. 6. As it can be seen, very similar results as in the previous databases are obtained. The proposed method and SNN give very similar results in precision and average precision. In this database, differences are more significant in recall. The proposed method is significantly the best and the difference increases with iterations. On the contrary, SNN performance gets worse with iterations. In the case of measuring average precision along the whole ranking, the probabilistic approach gives the best results when performing semantic queries. The SVM approach follows a very similar behavior than in the previous databases and ranks among the best at the last iterations. Apart from its behavior with regard to average precision, the probabilistic approach also improves the results of the proposed approach but only in the last iterations.

0.8

0.8

0.7

0.7

0.6

0.6

Precision

Precision

Fig. 5 shows the performance results on both types of queries using the Art database. This is a medium size and relatively well behaved repository according to the absolute values of the performance measures shown in the figure. From these, very similar conclusions can be drawn with regard to the behavior of the proposed method with respect to most of its competitors. Contrary to the previous database, the NN + BQS method ranks among the worst for both types of queries and according to all performance measures. Also, the SVM method exhibits a very similar behavior but in this occasion its relative improvement is so dramatic that the method gives clearly and significantly the best precision and, specially, average precision results in the last iterations. The huge difference in average precision is not totally unexpected as it measures the whole ranking and not only the top positions and clearly the SVM is by far the most powerful classifier when enough training data is available. A detail worth remarking is that the SNN method that shares part of the rationale of the proposed algorithm gives slightly better precision

0.5

0.4

0.3

0.3

0.2

0.2 2

3

4

5

6

7

Proposed SNN NN+BQS Probabilistic SVM SOM QPM

0.5

0.4

1

8

1

2

3

0.08

0.08

0.07

0.07

0.06

0.06

0.05

0.05

0.04

0.03

0.02

0.02

0.01

0.01 2

3

4

5

6

7

0

8

1

2

3

0.16

0.16

0.14

0.14

0.12 0.1 0.08 0.06

5

Iteration

5

6

7

8

6

7

8

0.06

0.02 4

4

0.08

0.02 3

8

0.1

0.04

2

7

0.12

0.04

1

6

Iteration

Avg. Precision

Avg. Precision

Iteration

0

5

0.04

0.03

1

4

Iteration

Recall

Recall

Iteration

0

711

6

7

8

0

1

2

3

4

5

Iteration

Fig. 6. Retrieval precision (up), recall (middle) and averaged precision (down) measured on the first 20 retrieved images for 500 random queries by example (left) and semantic queries (right) on the Corel database.

712

M. Arevalillo-Herráez, F.J. Ferri / Image and Vision Computing 31 (2013) 704–713

The performance measures shown in the previous figures are averaged on a particular set of 500 random queries of each type. These measures have a very high variability due to the fact that queries are very different depending on the particular class and also depending on the starting information given to the algorithms. To effectively rank the different algorithms according to their general performance and measure the statistical significance of this, the different performance measures have been measured separately for each single query at each particular iteration and for all three databases considered. Measures for iterations 1 and 2 (where variability is even higher) have not been considered in this study. With all these measures, a multiple comparison Friedman test [43] has been performed for each measure and kind of query. The resulting average ranks are shown in Table 2 for all methods including the ones not shown in the previous figures. The average rank that corresponds to the best method for each performance measure on each kind of query is shown in bold. Shaded cells indicate that there is no statistically significant difference among these and the proposed method according to a post-hoc Holm test at significance level α = 0.05. The adjusted p-values according to this test that correspond to comparing each method to the proposed one are shown in the same table in brackets. According to this statistical test, only the proposed method performs the best in all measures and kinds of queries. The previously proposed smoothed estimate, SNN, gives also very similar results with the exception of recall. With regard to recall, the proposed method is very significantly the best according to the p-values shown. It is also worth mentioning the fact that the SVM method is as good as the best with regard to average precision. Also, the probabilistic approach is as good as the best with regard to precision and average precision on semantic queries.

with regard to the other nearest neighbor based alternatives, and also outperforms a series of representative relevance feedback approaches. Even though the proposed method continues a series of improvements on the plain NN method and arrives to very similar results to other improved methods, the way in which the estimate has been derived makes it more robust and easy to integrate with other approaches. For example, the previous SNN method needed the combination of two strategies (a moderating term in the score and a local search for images) to beat the NN method. With the proposal in this work we are able to improve this by using only one better and more reliable way of estimating relevance, avoiding the computational burden associated with building a ranking per relevant selection. Related work on hybridizing these distance-based estimates with other ways of estimating relevance is now under development. Even though it is difficult to improve further the results presented here, we aim at being able to adapt the way relevance is estimated to different kinds of situations. Acknowledgments We would like to thank Dr. G. Giacinto for his help on the evaluation of this paper, facilitating the manually performed classification of the 30 000 images repository. We would also like to thank Dr. M. Ortega for providing the thumbnails of these images. This work has been partially funded by FEDER and the Spanish Government through projects TIN2011-29221-C03-02 and TIN2009-14205-C04-03, and Consolider Ingenio 2010 CSD07-00018. References

8. Concluding remarks A comparative study considering several different proposals to improve the basic nearest neighbor approach for CBIR has been carried out in this paper. In particular, the ideas behind the approach along with their benefits and disadvantages have been analyzed and an improved score using a reliability estimate has been proposed. According to the experimentation carried out and the corresponding performance measures, the proposed approach has several significant advantages

Table 2 Average ranks corresponding to the multiple comparison Friedman tests for different performance measures and query type. Best results are shown in bold. Adjusted p-values corresponding to a Holm post-hoc test when comparing each method to the proposed one are shown in brackets. The ones corresponding to rejected null hypotheses at significance level α = 0.05 are shaded. Query by example

Proposed SSN

SVM

Probabilistic

NN + BQS

NN

SOM

QPM

Fuzzy

Semantic query

Precision

Recall

Avg precision

Precision

Recall

Avg precision

4.498

3.804

4.125

4.386

3.385

3.846

4.505

4.302

3.997

4.391

4.082

(1.260)

(<10

(0.007)

(1.594)

(<10

4.936

5.725

4.017

5.054

6.157

3.918

(0)

(0.025)

(<10

(0)

(0.319)

4.670

4.587

4.449

3.857

(0.482)

(<10

5.347

4.723

(<10

−26

)

4.766 (<10

−11

)

4.956 (<10

)

−97

)

4.156

−27

)

4.825 (<10

(<10

−32

(<10

)

5.182

(<10

−27

)

4.420

−16

)

4.535

−13

(<10

(<10

−11

)

(<10

5.451

−69

)

(<10

−58

)

)

)

(0.756)

3.875

−29

−104

)

4.623

−229

3.799

−63

)

4.741

−233

(<10

)

4.064

−7

(0.756) −104

(<10

−60

)

−176

(<10 )

(<10

(<10

5.550

6.203

(0)

(0)

6.385

6.655

(0)

(0)

5.397

5.922

5.064

(0)

(0)

(<10

5.895

5.921

5.501

(0)

(<10

6.035

6.516

6.558

6.185

6.796

6.951

(0)

(0)

(0)

(0)

(0)

(0)

(<10

−60

)

5.295 (<10

−82

)

−252

)

(<10

−59

)

−162

)

)

5.010 )

[1] R. Datta, D. Joshi, J. Li, J.Z. Wang, Image retrieval: ideas, influences, and trends of the new age, ACM Computing Surveys 40 (2) (2008) 1–60. [2] M.S. Lew, N. Sebe, C. Djeraba, R. Jain, Content-based multimedia information retrieval: state of the art and challenges, ACM Transactions on Multimedia Computing, Communications and Applications 2 (1) (2006) 1–19. [3] A. Smeulders, S. Santini, A. Gupta, R. Jain, Content-based image retrieval at the end of the early years, IEEE transactions on Pattern Analysis and Machine Intelligence 22 (12) (2000) 1349–1379. [4] X. Zhou, T. Huang, Relevance feedback for image retrieval: a comprehensive review, Multimedia Systems 8 (6) (2003) 536–544. [5] G. Giacinto, F. Roli, Nearest-prototype relevance feedback for content based image retrieval, ICPR 04: Proceedings of the 17th International Conference on Pattern Recognition (ICPR), vol. 2, IEEE Computer Society, Washington, DC, USA, 2004, pp. 989–992. [6] G. Giacinto, F. Roli, Instance-based relevance feedback for image retrieval, in: Y.W.L.K. Saul, L. Bottou (Eds.), Advances in Neural Information Processing Systems, vol. 17, MIT, Cambridge, MA, 2005, pp. 489–496. [7] G. Giacinto, A nearest-neighbor approach to relevance feedback in content based image retrieval, Proceedings of the 6th ACM International Conference on Image and Video Retrieval (CIVR'07), ACM Press, Amsterdam, The Netherlands, 2007, pp. 456–463. [8] M. Arevalillo-Herráez, F.J. Ferri, Interactive image retrieval using smoothed nearest neighbor estimates, Joint IAPR International Workshops on Structural and Syntactic Pattern Recognition and Statistical Techniques in Pattern Recognition (SSPR/SPR), 2010, pp. 708–717. [9] B. Thomee, M.S. Lew, Interactive search in image retrieval: a survey, International Journal of Multimedia Information Retrieval 1 (2) (2012) 71–86. [10] Y. Ishikawa, R. Subramanya, C. Faloutsos, Mindreader: querying databases through multiple examples, Proceedings of the 24th International Conference on Very Large Data Bases, VLDB, New York, USA, 1998, pp. 433–438. [11] Y. Rui, S. Huang, M. Ortega, S. Mehrotra, Relevance feedback: a power tool for interactive content-based image retrieval, IEEE Transactions on Circuits and Video Technology 8 (5) (1998) 644–655. [12] G. Ciocca, R. Schettini, Content-based similarity retrieval of trademarks using relevance feedback, Pattern Recognition 34 (8) (2001) 1639–1655. [13] T. León, P. Zuccarello, G. Ayala, E. de Ves, J. Domingo, Applying logistic regression to relevance feedback in image retrieval systems, Pattern Recognition 40 (10) (2007) 2621–2632. [14] J. Laaksonen, M. Koskela, E. Oja, PicSOM: self-organizing image retrieval with MPEG-7 content descriptors, IEEE Transactions on Neural Networks 13 (4) (2002) 841–853. [15] X.S. Zhou, T. Huang, Small sample learning during multimedia retrieval using BiasMap, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2001, pp. 11–17. [16] X. He, D. Cai, J. Han, Learning a maximum margin subspace for image retrieval, IEEE Transactions on Knowledge and Data Engineering 20 (2) (2008) 189–201.

M. Arevalillo-Herráez, F.J. Ferri / Image and Vision Computing 31 (2013) 704–713 [17] H. Chang, D.-Y. Yeung, Kernel-based distance metric learning for content-based image retrieval, Image and Vision Computing 25 (5) (2007) 695–703. [18] S.C.H. Hoi, W. Liu, M.R. Lyu, W.-Y. Ma, Learning distance metrics with contextual constraints for image retrieval, Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR ’06, 2006, pp. 2072–2078. [19] S. Tong, E. Chang, Support vector machine active learning for image retrieval, ACM Multimedia Conference, ACM Press, Ottawa (Canada), 2001, pp. 107–118. [20] Y. Chen, X.S. Zhou, T.S. Huang, One-class SVM for learning in image retrieval, Proceedings of the IEEE International Conference on Image Processing, 2001, pp. 34–37. [21] I. Gondra, R. Heisterkamp, J. Peng, Improving image retrieval performance by inter-query learning with one-class support vector machines, Neural Computing and Applications 13 (2) (2004) 130–139. [22] D. Tao, X. Tang, X. Li, X. Wu, Asymmetric bagging and random subspace for support vector machines-based relevance feedback in image retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (7) (2006) 1088–1099. [23] P. Muneesawang, L. Guan, An interactive approach for CBIR using a network of radial basis functions, IEEE Transactions on Multimedia 6 (5) (2004) 703–716. [24] M. Arevalillo-Herráez, M. Zacarés, X. Benavent, E. de Ves, A relevance feedback CBIR algorithm based on fuzzy sets, Signal Processing: Image Communication 23 (7) (2008) 490–504. [25] Z. Su, H. Zhang, S. Li, S. Ma, Relevance feedback in content-based image retrieval: Bayesian framework, feature subspaces, and progressive learning, IEEE Transactions on Image Processing 12 (8) (2003) 924–937. [26] E. de Ves, J. Domingo, G. Ayala, P. Zuccarello, A novel Bayesian framework for relevance feedback in image content-based retrieval systems, Pattern Recognition 39 (2006) 1622–1632. [27] G. Giacinto, F. Roli, Bayesian relevance feedback for content-based image retrieval, Pattern Recognition 37 (7) (2004) 1499–1508. [28] M. Arevalillo-Herráez, F.J. Ferri, J. Domingo, A naive relevance feedback model for content-based image retrieval using multiple similarity measures, Pattern Recognition 43 (3) (2010) 619–629. [29] T. Amin, M. Zeytinoglu, L. Guan, Application of Laplacian mixture model to image and video retrieval, IEEE Transactions on Multimedia 9 (7) (2007) 1416–1429.

713

[30] G. Ciocca, R. Schettini, A relevance feedback mechanism for content-based image retrieval, Information Processing and Management 35 (1) (1999) 605–632. [31] J. Urban, J.M. Jose, Evidence combination for multi-point query learning in contentbased image retrieval, ISMSE 04: Proceedings of the IEEE Sixth International Symposium on Multimedia Software Engineering, IEEE Computer Society, Washington, DC, USA, 2004, pp. 583–586. [32] In: B. Dasarathy (Ed.), Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques, IEEE Computer Society Press, Los Alamitos, California, 1991. [33] G. Shakhnarovich, T. Darrell, P. Indyk, Nearest-neighbor Methods in Learning and Vision: Theory and Practice (Neural Information Processing), The MIT Press, 2006. [34] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd edition Wiley-Interscience, 2000. [35] P. Soille, Morphological Image Analysis: Principles and Applications, SpringerVerlag, Berlin, 2003. [36] G. Smith, I. Burns, Measuring texture classification algorithms, Pattern Recognition Letters 18 (14) (1997) 1495–1501. [37] R.W. Conners, M.M. Trivedi, C.A. Harlow, Segmentation of a high-resolution urban scene using texture operators, Computer Vision, Graphics, and Image Processing 25 (3) (1984) 273–310. [38] R. Chellappa, S. Chatterjee, Classification of textures using Gaussian Markov random fields, IEEE Transactions on Acoustics Speech and Signal Processing 33 (1985) 959–963. [39] Y. Chen, E. Dougherty, Gray-scale morphological granulometric texture classification, Optical Engineering 33 (8) (1994) 2713–2722. [40] G. Ayala, J. Domingo, Spatial size distributions. Applications to shape and texture analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (12) (2001) 1430–1442. [41] M.J. Swain, D.H. Ballard, Color indexing, International Journal of Computer Vision 7 (1) (1991) 11–32. [42] Q. Iqbal, J. Aggarwal, Combining structure, color and texture for image retrieval: a performance evaluation, 16th International Conference on Pattern Recognition (ICPR), Quebec City, QC, Canada, 2002, pp. 438–443. [43] S. García, F. Herrera, An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons, Journal of Machine Learning Research 9 (2008) 2677–2694.