A unified image retrieval framework on local visual and semantic concept-based feature spaces

J. Vis. Commun. Image R. 20 (2009) 450–462 Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/loc...

Download PDF

2MB Sizes 0 Downloads 88 Views

Report

PDF Reader
Full Text

J. Vis. Commun. Image R. 20 (2009) 450–462

Contents lists available at ScienceDirect

J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

A uniﬁed image retrieval framework on local visual and semantic concept-based feature spaces Md. Mahmudur Rahman a,*, Prabir Bhattacharya b, Bipin C. Desai a a b

Dept. of Computer Science & Software Engineering, Concordia University, Canada Concordia Institute for Information Systems Engineering, Concordia University, Canada

a r t i c l e

i n f o

Article history: Received 5 August 2008 Accepted 4 June 2009 Available online 10 June 2009 Keywords: Content-based image retrieval Learning methods Classiﬁcation Self-organizing map Support vector machine Relevance feedback Similarity fusion

a b s t r a c t This paper presents a learning-based uniﬁed image retrieval framework to represent images in local visual and semantic concept-based feature spaces. In this framework, a visual concept vocabulary (codebook) is automatically constructed by utilizing self-organizing map (SOM) and statistical models are built for local semantic concepts using probabilistic multi-class support vector machine (SVM). Based on these constructions, the images are represented in correlation and spatial relationship-enhanced concept feature spaces by exploiting the topology preserving local neighborhood structure of the codebook, local concept correlation statistics, and spatial relationships in individual encoded images. Finally, the features are uniﬁed by a dynamically weighted linear combination of similarity matching scheme based on the relevance feedback information. The feature weights are calculated by considering both the precision and the rank order information of the top retrieved relevant images of each representation, which adapts itself to individual searches to produce effective results. The experimental results on a photographic database of natural scenes and a bio-medical database of different imaging modalities and body parts demonstrate the effectiveness of the proposed framework. Ó 2009 Elsevier Inc. All rights reserved.

1. Introduction In recent years, there has been an exponential growth of image data due to the acceptance and wider use of digital imaging. The low storage cost, availability of the digital devices, and high bandwidth communication facilities have accelerated the growth. Generating huge amounts of images will be pointless without an organization and search capabilities. Thus, it creates a compelling need for developing innovative tools for managing, retrieving, and visualizing images from large collections. Many applications, such as digital libraries, image search engines, medical decision support systems and teaching applications require an effective and efﬁcient image retrieval techniques [1]. The ﬁrst generation of image retrieval systems developed in late 1970s, was mainly linked to text retrieval. In those systems, manually assigned keywords were used as indexing terms, which describe the content of the images and querying items. However, due to the time consuming aspect and the subjective nature of assigned keywords, content-based image retrieval (CBIR) systems were evolved at the beginning of early 1990s [1,2]. In CBIR, access to information is performed at a perceptual level based on auto-

* Corresponding author. Address: Dept. of Computer Science & Software Engineering, Concordia University, Sir George Williams Campus, 1515 St. Catherine W., EV 9.105, Montreal, Que., Canada H3H 2G7. E-mail address: [email protected] (M.M. Rahman). 1047-3203/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.jvcir.2009.06.001

matically extracted low-level features (e.g., color, texture, shape, etc.) without interpreting any direct semantic meaning. In general, the similarity comparison between the query and target images is performed either globally on the entire image or locally from the automatically derived segmented regions [3–5]. However, the global features fail to capture enough semantic information due to their limited descriptive power. Although there is a strong correlation between segmented regions and real world objects, the accurate automatic segmentation for object detection in general domain images is still an unsolved problem [5]. In summary, these CBIR systems can be distinguished as either global or region-speciﬁc in nature based on their feature representation and employed similarity matching functions. Even after almost two decades of intensive research, the CBIR systems still lag behind the best text-based search engines of today, such as Google, Yahoo, AltaVista, etc. [6]. The main problem here is the extent of mismatch between user’s requirements as high-level concepts and the lowlevel representation of images; this is the well known ‘‘semantic gap” problem [1]. To narrow the semantic gap, we have witnessed new trends over the last few years in the form of semantic image classiﬁcation, annotation, and interactive retrieval by utilizing various off-line and on-line learning-based methods [5–8,14,16–18]. The automatic classiﬁcation of images greatly enhances the performance of CBIR systems by ﬁltering out irrelevant classes. Despite the fact that it is currently impossible to recognize objects in images of

M.M. Rahman et al. / J. Vis. Commun. Image R. 20 (2009) 450–462

general-purpose domain reliably, a few research prototypes have emerged, which successfully classify textured and non-textured, natural photographs and artiﬁcial graphs [5], indoor and outdoor images [7] and further classify as city (man-made) or landscape (natural objects) at a global level [8]. On the other hand, some classiﬁcation approaches try to automatically annotate local regions in images without providing a global level description [9,10]. For example, Town and Sinclair in [9], use an annotation approach where non-overlapping image regions are classiﬁed into visual categories of outdoor scenes by neural networks. Motivated from a machine translation perspective, the approach in [10] tries to translate the local image regions to corresponding words, where the joint distribution of text descriptions and image contents are learned from images as well as from their categories. The ongoing success of the keyword-based retrieval systems also motivated the researchers to explore analogous techniques in the CBIR domain [11,13,14]. However, instead of simply using manually annotated keyword-based searches, these recent approaches automatically extract visual terms (such as, different predominant color or texture patches) or different semantic patches (such as, water, sand, sky, cloud, etc. in natural photographic images) from the images in a manner close to the keyword-based extraction from text documents by relying on unsupervised clustering or supervised learning-based classiﬁcation techniques [11–15]. For example, a framework to automatically generate visual terms (which is called as ‘‘keyblock”) is proposed in [11] by applying vector quantization or clustering techniques. It represents images close to the keyword-based representation with a correlation-enhanced feature model based on ideas of n-gram and bi-gram borrowed from the text retrieval domain [23]. In [12], a compact and sparse representation of images is proposed based on the utilization of a region codebook generated by a clustering technique on the feature space of segmented image regions. For the reliable identiﬁcation of image elements, the work in [13] manually identiﬁes visual patches as keywords from sample images. Then, every image in the database is compared against those identiﬁed one to detect the speciﬁc visual keywords of the image. A supervised classiﬁcation-based approach is presented in [14], in which an ensemble of binary classiﬁers are trained to estimate global category membership of images and later applied it to individual images to generate multiple soft labels for the keyword-based search. A semantic modeling approach is investigated in [15] for a small collection of images based on the binary SVM and k-NN based classiﬁcation of semantic patches of local image regions and utilize that information to represent images in the form of a concept occurrence vector for later retrieval. However, the main limitation of majority of the concept (keyword)-based approaches is that the quality of matching or correspondence (e.g., entire image to global concepts and image region to local concepts) depends on the details of the domain knowledge and it is always not exact. There are usually several concepts with almost as good match as the one detected for a particular image region or for an entire image. Considering only the best matching concepts in image annotation or encoding process does not provide sufﬁcient information about their correlations. Hence, the correlations between the concepts need to be exploited for an effective image annotation and representation. In addition to improving the semantic representation power of images with the various off-line learning schemes, another effective approach is to incorporate user’s semantic perceptions interactively in the retrieval loop. This on-line interactive and supervised learning approach, commonly known as the relevance feedback [16], prompts the user for a feedback on the retrieval results and then use that information for subsequent retrievals with the goal of increasing the retrieval performance. Though the idea of the relevance feedback borrowed from the text retrieval domain, the ease

451

with which the relevance of an image can be evaluated has accelerated its development for image retrieval since the early work in [17]. In this vein, a number of techniques have been proposed, such as, query point movement, feature re-weighting, active learning, etc. [16,18,19]. For example, in MARS (Multimedia Analysis and Retrieval System) [18], feature weights are determined in inverse proportion to their variances across the set of retrieved images marked as relevant by the user. In a similar view point, the ImageRover system [27] utilizes positive feedbacks to select the appropriate distance measures and to adjust their weights by calculating the relevance weights as the inverse of the mean scores between the pair-wise comparisons of all images in the relevant set. The majority of the relevance feedback approaches in CBIR estimate the ideal query parameters from the low-level image features [16,18,19,27]. However, due to the limited description power of the low-level features to represent the high-level user’s perceptions, the relevance feedback has proved to be a tedious and time consuming process for the users in many cases. Often, there are a lot of images need to be inspected and different levels of relevancy (such as ‘‘fully relevant”, ‘‘not relevant”, and ‘‘somewhat relevant”) need to be provided to the system for an effective estimation of the ideal query parameters [18,27]. Due to the limitations of both the low-level and concept-level feature representations and motivated by the learning paradigm, this paper presents a uniﬁed image retrieval approach by exploiting both the supervised and unsupervised learning methodologies in a common framework. The major contributions of the proposed framework are fourfold: First, we investigate how a self-organizing map (SOM)-based [20] clustering and a probabilistic support vector machine (SVM)-based [22] classiﬁcation technique can be effectively utilized to represent images in local visual and semantic concept-based feature spaces. Second, we present a correlation-enhanced visual concept-based feature representation scheme by exploiting the local neighborhood structure of the SOM map and a spatially enhanced feature representation by considering the local concept structure. Third, we present a local semantic conceptbased feature and similarity matching scheme by exploiting the concept correlations based on their conﬁdence scores in individual images. Fourth, we propose a relevance feedback-based similarity fusion technique under the assumption that visual and semantic concept-based features might be complementary in nature. In this approach, feature level weights are updated at each iteration by considering both the precision and the rank order information of relevant images in the individual result lists based on the relevant feedback information. As a result, the ﬁnal rank-based retrieval is obtained through an adaptive and linear weighted combination of overall similarity fusing both visual and semantic concept-level similarities. The architecture of the proposed framework is shown in Fig. 1. As can be seen from this ﬁgure, the SOM and SVM learning are conducted from the training images to generate a codebook and SVM model ﬁle for the construction of the visual and semantic concepts. Based on these constructions, several correlation-enhanced features are extracted from the database images and stored in a logical database of concept index. The training of the SOM and SVM and the later feature extraction process of the database images are conducted off-line, whereas the feature extraction of the query image and the relevance feedback process for dynamic weight updating are conducted on-line as shown in the ﬁgure. The rest of the paper is organized as follows. Section 2 presents the visual concept-based feature extraction and representation schemes by utilizing the self-organizing map (SOM). Section 3 presents feature representations and distance matching based on the semantic concepts at the local image level by utilizing the SVM classiﬁcation. The relevance feedback-based similarity fusion scheme is described in Section 4. The experiments and the analysis

452

M.M. Rahman et al. / J. Vis. Commun. Image R. 20 (2009) 450–462

Fig. 1. Overview of the framework architecture.

of the results are presented in Sections 5 and 6, respectively, and ﬁnally Section 7 provides the conclusion. 2. Image representation based on visual concepts A major component of the framework is to represent images with automatically generated local visual concept by utilizing the self-organizing map (SOM). Here, the visual concepts depict the perceptually distinguishable color or texture patches in the local image regions, which might not have any clear semantic interpretation. For example, a predominant yellow color patch can be presented either in an image of the sun or in a sunﬂower image. However, when the images are represented with the frequency of occurrences of such local visual concepts, it has proved to be effective in retrieval compared to the low-level color and texture feature vectors in many cases [11,13]. There are three main steps to be considered before representing images in such a feature space: the generation of a set of visual concepts from local image regions; construction of a codebook of prototype concepts analogous to a dictionary of keywords; and encode the images with the concept indices of the codebook [11]. Based on the encoding scheme, the images can be represented as a feature vector where each dimension of the vector corresponds to the frequency of a concept in the codebook, which might be calculated by considering various local and global statistics of the collection [23]. To generate a set of visual concepts, we consider a ﬁxed decomposition approach based on the work reported in [24], where sample images from a training set are equally partitioned into a number of nonoverlapping smaller blocks. To represent each block as a feature vector, the mean and standard deviation of each channel in the HSV color space as color feature and second order moments (such as, energy, maximum probability, entropy, contrast, and inverse difference moment) are extracted from the gray level co-occurrence matrix (GLCM) [25].

2.1. Codebook generation by self-organizing map (SOM) To generate a codebook of prototype vectors (e.g., concept vectors) from the above features, we utilize the SOM-based clustering [20]. The SOM is basically an unsupervised and competitive learning algorithm, which ﬁnds the optimal set of prototypes based on a grid of artiﬁcial neurons whose weights are adapted to match input vectors in a training set [20]. It has been successfully utilized for indexing and browsing by projecting the low-level input features to the two-dimensional grid of the SOM map [21,28,29]. The basic structure of a SOM consists of two layers: an input layer and a competitive output layer as shown in Fig. 2. The input layer consists of a set of input node vectors. The output layer consists of a set of N neurons C ¼ fc1 ; . . . ; cj ; . . . ; cN g organized into either a

Fig. 2. Structure of the SOM.

M.M. Rahman et al. / J. Vis. Commun. Image R. 20 (2009) 450–462

one or two-dimensional lattice structure, where each neuron cj is associated with a d-dimensional weight vector cj ¼ ½cj1 cj2 cjd T . During the training phase, the set of the input vectors is presented to the map a multiple number of times and the weight vectors stored in the map units are modiﬁed to match the distribution and topological ordering of the feature vector space. The learning process continues for a maximum number of iterations or until the weight vectors are stabilized. After the learning process, the SOM generated output map can be effectively utilized as a codebook of visual concepts for image encoding and representation where each output neuron cj resembles as a visual concept prototype with the associated weight vector cj as the concept vector of the codebook. 2.2. Image encoding and feature representation To encode an image with the visual concept indices, it is also decomposed into an even grid-based partition, where the color and texture moment-based features are extracted from each block or regions. Let, an image is represented by a region set R ¼ fr 1 ; . . . ; ri ; . . . ; rn g, where xri 2 Rd is the d-dimensional feature vector of a region r i . Now, for the feature vector xri , the nearest concept prototype in the codebook (e.g., output node in the map) ck ; 1 6 k 6 N is identiﬁed based on a weighted Euclidean distance measure. After this encoding process, an image Ij is represented ¼ ½f1j ; . . . ; fij ; . . . fNj T , where each dimension coras a vector FV-concept j responds to a concept index in the codebook. The element fij represents the frequency of occurrences of ci appearing in Ij . This feature vector is closely related to the histogram model (HM)-based feature in [11], which mainly captures the coarse distribution of the visual concepts analogous to a global color histogram. However, this representation is very sensitive to the quantization errors. Two concepts will be considered totally different if they fall into two different bins even though they might be very similar or correlated to each other. Another drawback is that the images are represented without considering the relative positions or ordering of the concepts, hence it lacks spatial relationships. Due to these limitations, we propose two feature representation approaches in the following sections that exploit the correlations and spatial relationships between the visual concepts. 2.3. Feature representation by membership values in local neighborhood Because of its topology preserving structure, the distance between two neurons (e.g., concept prototypes) on the SOM map

453

(codebook) indicates the degree of similarity of the input vectors represented by the neurons. We exploit this topology preserving property to generate a correlation-enhanced feature representation. The basic idea is based on the observation that there are usually several neurons on the map with almost as good match as the best matching neuron for an input vector. When we consider only the best matching output neuron to encode images, it does not provide enough information about the correlations between the neurons, and thereby between the concepts. This is also investigated for data visualization in [29], where a ranked centroid projection (RCP) algorithm is proposed to project the input vectors on the output map based on their membership degrees to the prototype vectors instead of using a crisp value. It has adopted a ranking scheme based on the distance computation of the input vectors to all the output neurons, which might require substantial computational efforts [29]. Due to this limitation, we propose to compute the ranking and membership values based on a local neighborhood structure on the map and utilize that information in the feature representation. In this structure, the similar visual concepts are organized closely in a local neighborhood. For example, the right side of Fig. 3 shows the local neighborhood structure in a two-dimensional SOM map, where each output node is visualized as a square block on the grid. The square block with red line in the middle denotes a particular output node cm on the grid and the square blocks within the dotted white rectangle are positioned in the ﬁrst level neighborhood (we call it as 1-LN) and the blocks in between the blue and white rectangles are positioned at the second level (e.g., 2-LN) local neighborhood of cm based on their coordinate positions. As the level increases (it can go up to a maximum neighborhood level as MLN), the number of neighboring nodes increases as well for a best matching node. For example, there are 8 neighbors in 1-LN and 16 neighbors in 2-LN for the output node cm as shown in Fig. 3. The membership degree of an input vector to a speciﬁc neighbor node is then deﬁned based on the level of neighborhood and the rank of closeness between the input vector and the weight vector associated with that output node. To determine the rank of closeness, K c ; c 2 f1; . . . ; Mg best neurons are selected, where K c is the total number of output neurons which includes the best matching neuron and all other neighbor neurons in level c-LN and any levels before that (e.g., 9 (1 + 8) neurons for 1-LN and 25 (1 + 8 + 16) neurons for 2-LN). To determine the membership values, the K c neurons are sorted in descending order of similarity values to the input node (e.g, the best matching neuron will be at the ﬁrst position always). Now the membership degrees of an input vector to the ordered neurons are determined as the set of values

Fig. 3. Procedure for the correlation-enhanced feature representation.

454

M.M. Rahman et al. / J. Vis. Commun. Image R. 20 (2009) 450–462

PK c 1

fK c =N; ðK c 1Þ=N; . . . ; 1=Ng, where N ¼ i¼0 K c i, is applied to normalize the degree of membership [29]. Therefore, the ﬁrst neuron (e.g., the best matching neuron) has a normalized membership value as l1 ¼ K c =N and the last one in the ordered set has a value as lK c ¼ 1=N. Finally, these membership values of the output neurons will be distributed based on their corresponding indices (bins) in the feature histogram to take into account the correlation factor. To summarize, the steps involved in the feature representation process are as follows: Step 1: Initialize the feature histogram an image Ij as hw1j ; . . . ; wij ; . . . wNj i where each bin (index) is empty, e.g., wij ¼ 0. with a region set Step 2: Decompose an image Ij R ¼ fr 1 ; . . . ; ri ; . . . ; r n g, where xri 2 Rd is the feature vector of a region ri . Step 3: For each xri ; 1 6 i 6 n, ﬁnd the corresponding best matching output node cm ; 1 6 m 6 N, in the map (codebook). Step 4: Consider K c output neurons for cm in the local neighborhood up to level c-LN and sort them based on the similarity values of the corresponding weight vectors to input vector xri . Step 5: Calculate the membership values of the sorted output neurons as fl1 ¼ K c =N; l2 ¼ ðK c 1Þ=N; . . . ; lK c ¼ 1=Ng. Step 6: For each cl ; 1 6 l 6 K c with a membership value ll , update the corresponding element wlj at an index position or bin number l as wlj þ ¼ ll . Step 7: Continue, steps 3–6 for each region input vector of the region set R. Step 8: Finally obtain the correlation-enhanced cumulative histogram or feature vector as Fj V-correlation ¼ ½w1j ; . . . ; wij ; . . . wNj T . Fig. 3 shows an example of the above feature generation process. For a particular region ri on the segmented image in the left, it will ﬁrst ﬁnd the best matching neuron cm in the codebook. Next, it considers the K 2 output neurons by taking the local neighborhood level up to 2-LN and calculate the membership values as discussed in the above algorithm. Based on the membership values of the K 2 neurons, the corresponding index values in the feature histogram will be incremented. Due to the space limitation, Fig. 3 shows only few connections, where for the best matching neuron cm and two other neurons ck and cj in 1-LN and 2-LN, respectively, the corresponding indices are incremented with the membership values of lm ; lk , and lj , respectively. Instead of incrementing only the index (bin) for the best matching unit with a values of one, this approach distributes the membership values to all other correlated neurons (concepts) in the appropriate indices of the histogram. Hence, it can reduce the effect of the quantization errors by considering additional related visual concepts based on their membership values to the feature representation. 2.4. Feature representation by exploiting spatial relationships The visual concept vector FV-concept , cannot also distinguish between two images in which a given concept is present in identical amounts but where the structure of the groups of regions having that concept is different. Due to this limitation, we present a feature representation scheme that captures both the visual concept frequency similar to a color histogram and information about the local spatial relationships of the concepts. This representation is closely related to the MPEG-7 color structure descriptor (CSD) [26] and thereby we call it visual concept structure descriptor (VCSD). Speciﬁcally, it is a histogram where each bin counts the number of times a visual concept is present in a windowed neighborhood determined by a small square structuring element as this

window progresses over the two-dimensional encoded image rows and columns. It enables us to distinguish, for example, between an image in which the concepts are distributed uniformly and an image in which the same concepts occur in the same proportions, but are located in distinct blocks. The feature extraction method embeds concept structure information into the feature vector by taking into account all the concepts in a structuring element (4 4 blocks in this case) that slides over the image, instead of considering each block separately. For each unique concept index based on the codebook in an encoded image that falls inside the structuring element, the corresponding index in the feature histogram is incremented only once. Unlike the original CSD [26], we need not measure the spatial extent of the structuring element since the images are equally partitioned into blocks and after encoding, all of them contain the same number of concepts in total. The accumulation process is illustrated in Fig. 4. For a particular position of the structuring element on the encoded image, there are three different indices m; j, and k that are occurred with different frequencies. As a result, their corresponding indices (bins) of the histogram are incremented only once by one as shown in the middle of Fig. 4. A VCSD of an encoded image Ij is represented as a vector as ¼ ½^f 1j ; . . . ; ^f ij ; . . . ^f Nj T , where each dimension corresponds to FVCSD j a visual concept index. The element ^f ij represents the number of structuring elements in the image containing one or more block with the concept ci and is normalized by the number of locations of the structuring element, which lie in the range [0, 1]. The origin of the structure element is deﬁned by its top-left sample and the locations of the structure element over which the elements of the vector are accumulated are deﬁned by the position of the block (e.g., the smallest unit in the encoded image) that contains the visual concept index. Hence, this feature representation expresses local concept structure in an image by visiting all the location in the encoded image with the structuring element. 3. Image retrieval based on semantic concepts In many image domains, various semantic concepts at the local region level are instantly available. For example, in a collection of natural scenery images, we can easily identify speciﬁc local patches (such as, water, sand, grass, sky, snow, etc.) that are semantically distinguishable from each other [15]. We can incorporate supervised learning techniques in the form of image classiﬁcation to exploit these local concepts or domain knowledge for image retrieval. In this context, the classiﬁer creates statistical models from the training data, where an instance (e.g., local concept) in the training set is represented by a feature vector and contains category speciﬁc labels [30]. 3.1. Multi-class SVM We represent images in a local semantic concept-based space by utilizing a probabilistic multi-class SVM classiﬁer [31]. In its basic formulation, the SVM performs classiﬁcation between two classes by constructing a decision surface between samples of two classes, maximizing the margin between them. The SVM classiﬁcation function [22] is given by

f ðxÞ ¼ sign

N X

!

ai yi Kðxi ; xÞ þ b

ð1Þ

i¼1

where x 2 Rd is an input vector, xi is a training sample vector along with its label yi 2 ðþ1; 1ÞN ; b is a bias and K is a kernel function which maps the vectors into a higher dimensional space by the non-linear mapping function / : Rd ! Rl , where l > d or l could even be inﬁnite.

M.M. Rahman et al. / J. Vis. Commun. Image R. 20 (2009) 450–462

455

Fig. 4. Procedure for the generation of VCSD.

The SVM was originally designed for binary classiﬁcation problems. A number of methods have been proposed for the extension of the SVM to multi-class problems to separate the mutually exclusive classes essentially by solving many two-class problems and combining their predictions in various ways [33]. We utilize a multi-class classiﬁcation method by combining all pair-wise comparisons of binary SVM classiﬁers, known as oneagainst-one or pair-wise coupling (PWC) [31]. During the testing of a feature vector x, each classiﬁer votes for one class. The winning class is the one with the largest number of accumulated votes.

pk ðr i Þ ¼ Pðy ¼ kjxri Þ;

k ¼ 1; . . . ; L

ð2Þ

for a L number of local semantic concepts. A region r i belongs to a concept class m; 1 6 m 6 L that is determined by

m ¼ arg max½pk ðr i Þ k

ð3Þ

that is the label of the concept class with the maximum probability score. Hence, the region r i is annotated with the label m and the entire image is thus represented as a one-dimensional index linked to the concept or localized semantic labels assigned for each region. Based on this information, an image Ij is represented as a vector in the concept space as

3.2. Local semantic concept-based feature representation

FS-concept ¼ ½w1j ; . . . ; wij ; . . . wLj T j

The main step of this approach is the training of the multi-class SVMs for model generation of local semantic concepts. For this, we need to construct a training set of local semantic patches from individual image regions. To generate a set of local semantic concepts, the ideal approach is to ﬁnd the individual objects in an image which convey image semantics. However, associating the labels with segmented image regions remains a challenging problem as an accurate segmentation is the major bottleneck, which could produce some regions that do not correspond to any particular concepts. Hence, to compute the local regions from a set of training images, a similar grid-based approach is used to divide the entire image space into an even grid of 8 8 local regions; this basically generates 64 non-overlapping sub-images. In order to perform the classiﬁer training based on the local concept categories, a subset of these local regions are annotated manually with a particular concept label in a mutually exclusive way. The selected local image regions are represented by a combination of color and texture moment-based features as described in Section 2. For the SVM training, the initial input to the classiﬁer system is the feature vector set along with the manually assigned concept labels of the vectors. To represent the database images with semantic concept-based vectors, each image goes through the same grid-based partition. Let, an image be represented by a region set as R ¼ fr 1 ; . . . ; ri ; . . . ; rn g, where n ¼ 64 ð8 8Þ is the total number of regions (blocks). Now, for the combined color and texture feature vector xri of each region r i , the class or concept probabilities are determined by the prediction of the multi-class SVM when applying Eq. (2) as

where each element wij corresponds to the weight of a concept label i; 1 6 i 6 L, which is s expressed as the product of the local and global weights as wij ¼ Li;j Gi following the tf-idf weighting of the vector space model in the text retrieval domain [23]. The local weight is denoted as Lij ¼ logðfij Þ þ 1, where fij is the frequency of occurrence of label i in Ij . The global weight is denoted as Gi ¼ logðM=M i Þ þ 1 where Mi is the number of images in which concept label i is found and M is the total number of images in the entire collection. A global weight indicates the overall importance of the visual keyword across the entire image collection. Whereas, a local weight is applied to each element indicating the relative importance of the keyword within its vector [23]. This representation is closely related to the ‘‘concept occurrence vector (COV)” in [15], where only the frequency of occurrences of the concept labels is considered. However, as mentioned earlier, one of the drawbacks of the above representation is that it does not consider the correlations and spatial relationships between the concepts in the feature space. The probability or conﬁdence score of each semantic concept also forms an L-dimensional vector for each r i as ¼ ½p1 ðr i Þ; p2 ðri Þ; . . . ; pL ðri ÞT . Based on this information, we xl-concept ri propose to represent the images in a correlation-enhanced way by estimating the mean vector and the covariance matrix of the region vectors in R. It is assumed that the feature distribution in the region set R follows the multivariate Gaussian distribution. Under this assumption, images are characterized with the ﬁrst and second order statistical parameters in the form of mean l and covariance matrix R. Hence, the mean vector is calculated by taking the averages in each dimension of the region vectors in the region

ð4Þ

456

M.M. Rahman et al. / J. Vis. Commun. Image R. 20 (2009) 450–462

Pn 1

set R as l ¼ n i¼1 xl-concept ¼ hl1 ðri Þ; l2 ðri Þ; . . . ; lL ðri Þi and the ri Pn l-concept 1 lÞ covariance matrix is estimated as R ¼ n1 i¼1 ðxr i T l-concept lÞ . Together with the l and R, we call the feature repðxri resentation as FS-correlation . By averaging out the probability scores by the mean vector and considering the cross-correlation due to the off-diagonal of the matrix RðLLÞ , this representation provides a more accurate measurement of the local semantic concept occurrences in the entire image. 3.3. Statistical distance matching Now to compare a query image Iq and a database image Ij based on this representation, a statistical distance measure, namely the Bhattacharya distance [30] is applied as

1 ðRq þ Rj Þ DS-correlation ðIq ; Ij Þ ¼ ðlq lj ÞT 8 2

1

ðRq þRj Þ 2 1 ðlq lj Þ þ ln pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 jRq kRj j ð5Þ

where, lq and lj are the mean vectors of the local concept space and Rq and Rj are the covariance matrices of Iq and Ij , respectively. Eq. (5) is composed of two terms, the ﬁrst one being the distance between the feature vectors, while the second term gives the class separability due to the difference between the covariance matrices. The above metric will perform better (as shown in the experimental section) than the cosine distance in tf-idf weighting based feature space FS-concept , as it captures the variations or correlations of the local concepts as covariance matrix in the distance function. On the other hand, to consider spatial relationships between the local semantic concepts in feature representation, a similar feature extraction process is performed as described in Section 2.4 to generate the VCSD. This representation scheme is called here as semantic concept structure descriptor (SCSD). 4. Similarity fusion based on relevance feedback It is difﬁcult to ﬁnd a unique feature representation or distance function to compare images accurately for all types of queries. In other words, each feature representation along with its distance measure might be complementary in nature and will have its own limitations. In information retrieval, data fusion is a technique for combining the outputs of more than one representation or retrieval strategy that has proved to be effective or better than the individual representation strategies in many cases [34]. One of the most commonly used approaches in data fusion is the linear combination of similarity scores. In this model, the similarity between a query image Iq and database image Ij is deﬁned as

SðIq ; Ij Þ ¼

X

af Sf ðIq ; Ij Þ

ð6Þ

versa. In this approach, an equal emphasis is given based on their weights to all the features along with their similarity matching functions initially. However, the weights and the query vector will be changed or updated dynamically during the subsequent iterations by incorporating the feedback information from the previous round. After the initial retrieval result with a linear combination of equal weights af in individual similarity matching functions, a user needs to provide a feedback about the relevant images from the top K returned images. The effectiveness of each feature space Ff ; f 2 fV-concept; V-correlation; VCSD; S-concept; S-correlation; and SCSDg with its associated similarity measure for the same query image is determined by considering the top K images returned. The performance is measured by using the formula:

EðFf Þ ¼

PK

i¼1 RankðiÞ

K=2

PðKÞ

ð7Þ

where RankðiÞ ¼ 0 if image in the rank position i is not relevant based on user’s feedback and RankðiÞ ¼ ðK iÞ=ðK 1Þ for the relevant images. Hence, the function RankðiÞ monotonically decreasing from one (if the image at rank position 1 is relevant) down to zero (e.g., for a relevant image at rank position K). On the other hand, PðKÞ ¼ RK =K is the precision at top K, where Rk be the number of relevant images in the top K retrieved result. Eq. (7) is basically the product of two factors, rank order and precision. The rank order factor takes into account the position in the retrieval set of the relevant images, whereas the precision is a measure of the retrieval accuracy, regardless of the position. Generally, the rank order factor is heavily biased for the position in the ranked list over the total number of relevant images and the precision value totally ignores the rank order of the images. To balance both the criteria, we use a performance measure that is the product of the rank order factor and precision. If there is more overlap between the relevant images of a particular retrieval set and the one from which a user provides the feedback, then the performance score will be higher. Both terms on the right side of Eq. (7) will be 1, if all the top K returned images are considered as relevant. The raw performance scores obtained by the above procedure are f ^ f ¼ PEðF Þ f to yield then normalized by the total score as b EðFf Þ ¼ a EðF Þ

f P numbers in [0, 1] where f b EðFf Þ ¼ 1. For the next iteration of retrieval, these normalized scores are utilized as the weights for the respective features in the linear combination of similarity measures as

SðIq ; Ij Þ ¼

X f

a^ f Sf ðIq ; Ij Þ ¼

X

a^ f Sf F fq ; F fj

ð8Þ

f

P ^ f ¼ 1 and F fq is the mean query vector based on the relwhere f a evant images for a feature f. Hence, the steps involved in the weight updating process are as follows:

f

where the af are certain weights for different similarity matching P functions and subject to 0 6 af 6 1; af ¼ 1. The effectiveness of the linear combination depends mainly on the choice of the weights af . Motivated by the data fusion and relevance feedback paradigms, we propose a dynamic weight updating method in a linear combination scheme by considering both the precision and rank order information of the top retrieved K images. Before any fusion, the distance scores of each representation are normalized and converted to the similarity scores with a range of [0, 1] as DðI ;I ÞminðDðI ;I ÞÞ

q j q j , where minðÞ and maxðÞ are the SðIq ; Ij Þ ¼ 1 maxðDðI q ;Ij ÞÞminðDðI q ;Ij ÞÞ

minimum and maximum distance scores. Generally, a similarity score is the converse of a distance score. So, when the similarity score is one (i.e., exactly similar), the distance score is zero and vice

Step 1: Initially, consider the top K images by applying similarity P fusion SðIq ; Ij Þ ¼ f af Sf ðIq ; Ij Þ based on equal weighting. Step 2: Obtain the user’s feedback about relevant images from the top K images. Step 3: Calculate the new query vector F fq as the mean vector of the relevant images. Step 4: For each ranked list based on individual similarity matching, also consider top K images and measure the effectiveness as EðFf Þ by utilizing Eq. (7). Step 5: Normalize the effectiveness or weight score to be in the range [0, 1]. Step 6: Utilize the normalized scores as updated weights in the similarity function of Eq. (8) for the ﬁnal retrieval. Step 7: Continue, steps 2–6 until no changes are noticed or the system converges.

M.M. Rahman et al. / J. Vis. Commun. Image R. 20 (2009) 450–462

5. Experiments To evaluate the effectiveness of the proposed approaches, exhaustive experiments were performed in two different image collections. 5.1. Datasets We have chosen two collections for the testing because they provide complementary properties in terms of inter and intra class variances of the image features. The testing of our proposed approaches on both sets would provide a fair evaluation of their performances. The ﬁrst collection contains 6000 images of 18 disjoint semantical categories from the IAPR TC-12 Benchmark [35] and from the COREL. This collection contains photos taken from locations around the world and comprises a varying cross-section of still natural images. These categories include people such as groups and individual closeup photos, natural landscapes such as mountains and beaches, man-made objects such as cities and old architectures, animals such as birds and kangaroos, sporting events such as football and tennis, indoor images such as the inside of a hotel room and church or cathedral; these are organized as a taxonomy of classes as shown in Fig. 5. The second collection contains 5000 bio-medical images of 26 manually assigned disjoint categories, which is a subset of a larger data set of over 50,000 images from four distinct collections of ImageCLEFmed [36]. In this collection, the images are classiﬁed into three levels, where in the ﬁrst level, images are categorized according to the imaging modalities (e.g., X-ray, CT, MRI, etc.). At the next level, each of the modalities is further classiﬁed according to the examined body parts (e.g., head, chest, knee, etc.) and ﬁnally it is further classiﬁed by orientation (e.g., frontal, coronal, sagittal, etc.) or distinct visual observation (e.g., ultra sound with gallstones, CT images with nodules) as shown

457

in Fig. 6. The disjoint categories are selected only from the leaf nodes (gray in color) to create the ground-truth data sets for both the collections. 5.2. Experimental setup The selection of the training image set that well represent all the categories in the collection is critical for both the visual and semantic concept generation. The training sets used to generate the codebook for the visual concepts and the SVM model ﬁle for the semantic concepts consist of 10% images from each collection with all the categories in equal portions, resulting in a total of 600 images for the photographic collection and 500 images for the medical collection. The remaining images (e.g., 90% of the collections) were used as ground-truth for all other testing and evaluation purposes. To construct the codebook of visual concepts, the training images are ﬁrst equally partitioned into three different grids of 8 8; 12 12, and 16 16 to generate 64, 144, and 256 sub-regions for each image, respectively. After the feature extraction process, for each partition scheme, the SOM is trained to generate the two-dimensional output map or codebook of four different sizes as 256 (e.g., 16 16 units), 400 (e.g., 20 20 units), 900 (e,g., 30 30 units), and 1600 (e.g., 40 40 units). For visual concept-based retrieval, the testing is therefore conducted with 12 (3 different partitions times 4 codebook sizes) different conﬁgurations. After the codebook construction process, all the images in the test sets are encoded with the concept indices as described previously and the features are generated correspondingly following the procedures in Section 2. The local semantic concepts are generated from an even grid of 8 8 local regions of the training images and a subset of these, which conform to at least 80% of a particular concept with some other overlapping concepts, were used as training data. We manu-

Fig. 5. Classiﬁcation structure of the natural images.

Fig. 6. Classiﬁcation structure of the bio-medical images.

458

M.M. Rahman et al. / J. Vis. Commun. Image R. 20 (2009) 450–462

ally assigned 20 local semantic concept classes (such as, sea water, lake water, blue sky, cloudy sky, sand, rock, snow, grass, yellow sun, dark background or night, red and pink ﬂowers, and so on) for the photographic collection. The concepts are selected in such a way that images in the same category share many common concepts (such as, rock, sky, grass, snow, etc.) as well as images from different categories also contain common concepts (such as blue sky in all sub-categories under the landscape category) in different proportions. For the medical image collection, local semantic concepts are selected as the ones that exhibit some meanings to physicians with distinct visual appearances. We deﬁned 25 concept classes from the local regions (such as X-ray of ﬁnger, lung, and bone, red tissue, white teeth, normal and abnormal skins, MRI of brain and pelvis, microscopic images of blue, pink, purple, and brown colors, gray and color ultrasound, black and white backgrounds, and so on) for the medical collection. Finally, around 2400 and 2200 local regions are generated and each region is annotated with a single concept class out of the 20 and 25 classes for the photographic and medical collection, respectively, for the SVM training at the local level. For the SVM training, we utilized both the radial basis function (RBF) and the polynomial kernels. There are two tunable parameters while using RBF kernels: C and c. It is not known beforehand which values of C and c are the best for the classiﬁcation problem at hand. Hence, a 10-fold cross-validation (CV) is conducted. Basically pairs of ðC; cÞ are used and the one with the best CV accuracy is picked. We also experimented with the polynomial kernel of degree 1 and 2 with C ¼ 100. However, the best accuracies are achieved by the radial basis kernel in both the collections as shown in Table 1. Hence, after ﬁnding the best values of parameters C and c of the RBF kernels, they are utilized for the ﬁnal training to generate the SVM model ﬁle for the later prediction on the test sets. We utilized the LIBSVM software package [37] for the implementation of the multi-class SVM classiﬁers. For a quantitative evaluation of the retrieval results, we selected all the images in the test collections (e.g., the remaining 90% images from both the collections) as query images and used query-by-example as the search method, where the query is speciﬁed by providing an example image to the system. A retrieved image is considered to be a correct match if it is in the same category (based on the ground truth) as the query image. Precision (percentage of retrieved images that are also relevant) and recall (percentage of relevant images that are retrieved) are used as the basic evaluation measure of retrieval performances [23]. The average precision and recall are calculated over all the queries to generate the precision–recall curves in different settings. We at ﬁrst evaluated, whether the proposed concept-based feature representation and similarity matching schemes perform better (based on the precision–recall curves) than the commonly used low-level feature representations and whether it brings any improvement after enhancing the feature spaces by exploiting the correlations and spatial relationships between the concepts. We also evaluated, whether the proposed relevance feedback-based similarity fusion outperforms the fusion scheme based on equal weighting or whether it performs better than the best individual feature repre-

Table 1 CV accuracy of local semantic concept classiﬁcation. Training set

Kernel

C

c

Natural Natural Natural Medical Medical Medical

RBF Polynomial Polynomial RBF Polynomial Polynomial

200 100 100 200 100 100

0.05

Degree 1 2

0.02 1 2

Accuracy (%) 77.10 72.22 72.09 81.01 78.76 78.76

sentation in terms of the precision–recall curve. Moreover, we compared the proposed relevance feedback-based method to the method in the ImageRover system [27]. Since, it also utilizes a weight updating approach in a linear combination of similarity matching based on the positive feedbacks from the users.

6. Results Fig. 7 presents the precisions at rank 20 (P20) of the visual concept-based retrieval (e.g., F V-concept ) with three image partitions on the four codebook sizes for both the collections. P20 might be an effective measure, since most online image retrieval engines like Google, Yahoo, and AltaVista display 20 images by default. From Fig. 7, it is clear that a larger codebook size leads to higher precision for both the collections and a partition of 16 16 grid achieved better precisions for all the codebook sizes. However, more storage and computation are required for a larger codebook size. Hence, we choose the codebook size as 400 (e.g., the ﬁrst turning point of the precisions) for both the collections and consider the 16 16 partition scheme for the rest of the feature enhancement and evaluation purposes. Fig. 8 presents the precision–recall curves of the proposed feature representation schemes in the visual and semantic conceptbased feature spaces as well as the representation based on MPEG-7 color structure descriptor (CSD) with 128 bins quantization of HMMD color space and edge histogram descriptor (EHD) [26]. The CSD represents the local color structure in images and the EHD represents spatial distribution of edges as a global shape feature [26]. For the feature FV-correlation , a neighborhood level up to 2-LN was considered. Except for the feature F S-correlation , a L1-norm based distance measure is applied for all other feature representations. By analyzing Fig. 8(a) and (b), we make several important observations. First of all, the proposed concept-based feature representations (e.g., visual and semantic) are most of the times better than the low-level (e.g., CSD and EHD) features in terms of the precisions at each recall level for both the collections. Only the CSD performed slightly better in the photographic collection due to the prominence of the color images and the EHD performed better in the medical collection due to the prominence of the edge structures in many gray level images as compared with the features in the visual concept space. However, the performances of the semantic concept-based features are signiﬁcantly better in both the collections when compared to any other feature representations. The good performances are expected as the feature of these kinds are more semantically oriented which exploits the domain knowledge of the collections at the local level. Another major observation is that, the performances are always improved either in the visual or in the semantic concept space when the features are enhanced (e.g., V-correlation, VCSD, S-correlation, and SCSD). This certainly shows that there is always enough correlations between the local concepts, which is well exploited in the proposed feature representation schemes. Fig. 9 shows the precision–recall curves for the similarity fusion-based approaches. Here, the proposed relevance feedbackbased similarity fusion approach is compared with an equal weighting and the ImageRover’s approaches. The approach of the ImageRover system is compared since it also relies on only positive feedbacks. The main idea of the ImageRover system is also to provide more weight to the features that are consistent across the relevant set chosen by the user. However, the feature weights are calculated as the inverse of the mean scores between the pair-wise comparisons of all images in the relevant set. From Fig. 9(a) and (b), ﬁrst of all it is noticeable that due to the complementary nature of the feature spaces, the combination of different features in similarity matching is more effective as compared to the best individ-

459

M.M. Rahman et al. / J. Vis. Commun. Image R. 20 (2009) 450–462 0.53

0.35

0.52 0.3

0.25

Precision

Precision

0.51

0.2

0.5 0.49 0.48 0.47

0.15

0.1

0.46

8 X 8 Partition 12 X 12 Partition 16 X 16 Partition

200

400

600

800

1000

1200

1400

8 x 8 Partition 12 x 12 Partition 16 x 16 Partition

0.45 200

1600

400

600

Codebook sizes

800

1000

1200

1400

1600

Codebook sizes

Fig. 7. Precision for different sizes of the codebook and image partition.

0.5

0.8 V−concept V−correlation VCSD S−concept S−correlation SCSD CSD EHD

0.45 0.4

V−concept V−correlation VCSD S−concept S−correlation SCSD CSD EHD

0.7 0.6

0.35

Precision

Precision

0.5 0.3 0.25 0.2

0.4 0.3

0.15

0.2

0.1

0.1

0.05

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0

1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.7

0.8

0.9

1

Recall

Recall

Fig. 8. Precision–recall curves.

0.9

0.6 Fusion−Equal weight Fusion−RF S−correlation ImageRover

0.55

0.8

0.5

0.7 0.6

0.4

Precision

Precision

0.45

0.35 0.3

0.5 0.4

0.25 0.3 0.2 0.2

0.15 0.1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1 0.1

Fusion−Equal weight Fusion−RF S−correlation ImageRover

0.2

0.3

0.4

Recall

Fig. 9. Precision–recall curves for the fusion-based retrieval.

0.5

0.6

Recall

460

M.M. Rahman et al. / J. Vis. Commun. Image R. 20 (2009) 450–462

ual feature (e.g., S-correlation) in both the collections. For the relevance feedback-based similarity fusion approaches, we simulated the user’s feedback by considering the top K ¼ 30 most similar images as relevant to the query image and considered only one iteration of feedback. From Fig. 9(a) and (b), we can observe that the proposed adaptive similarity matching scheme has better precisions at each recall level compared to the equal weighting and the ImageRover-based fusion schemes. Specially, it performed signiﬁcantly better in the medical collection and there is also distinct improvement in the photographic collection. One of the main reasons of improved performance of the proposed method is that the combined precision and the rank order information might provide more accurate effectiveness measure compared to measure effectiveness by considering only the variances of the mean scores. As for some feature representations, semantically relevant images may still have large differences in distance scores among themselves and considering the inverse of the mean score might not be an effective criteria in this case. To test the efﬁciency in terms of the number of K ¼ f10; 20; 30; 50; 100g images needs to be judged, the performances of the relevance feedback-based approaches are compared based on P20. From Fig. 10(a) and (b), it is observed that it only requires top 30 images to consider to achieve the optimal precision

0.9

0.57

Fusion−RF ImageRover

0.565

0.89

0.56

0.88

0.555

Precision

Precision

in terms of P20 for the proposed method. Whereas, there is a linear combination of precision and K for the ImageRover’s approach where the precision always increases with an increase of K. This is actually a drawback for an interactive system as the users need to judge more images to provide feedback information to achieve an optimal precision. To test the convergence rate in terms of the number of iterations, we also considered ﬁve iterations of feedback and compared the performances with P20 as shown in Fig. 11. As expected, the proposed approach has better precisions than that of the ImageRover’s approach and there are constant precision values at each iteration for the equal weighting approach in an obvious manner. From Fig. 11(a) and (b), we can observe that both the relevance feedback methods share a common trend, which implies that the more iterations of feedback, the higher accuracy they can achieve, hence the performances are consistent from one iteration to another. However, a faster convergence is achieved after only one iteration of the proposed method compared to the ImageRover’s approach as shown in Fig. 11 for both the collections. For a qualitative evaluation of the proposed relevance feedbackbased similarity fusion technique, Figs. 12–14 show the snapshots of the retrieval results, in which the retrieved images are ranked and displayed in the descending order of their similarity scores from the top left to the bottom right (e.g., top most 15 similar

0.55

Fusion−RF ImageRover

0.87 0.86

0.545 0.85 0.54 0.84

0.535 0.53

10

20

30

50

100

0.83

10

20

30

40

50

60

70

80

90

100

Recall

Recall

Fig. 10. The effect of the number of K.

0.5 Fusion−Equal weight Fusion−RF ImageRover

0.49

0.84

0.48 0.82 0.46

Precision

Precision

0.47

0.45 0.44

0.8 0.78 0.76

0.43 0.74 0.42 0.72

0.41 0.4

0

1

2

3

4

5

0.7

Fusion−Equal weight Fusion−RF ImageRover

0

1

Number of Iterations

Fig. 11. The effect of the number of iterations.

2

3

Number of Iterations

4

5

M.M. Rahman et al. / J. Vis. Commun. Image R. 20 (2009) 450–462

461

Fig. 12. Retrieval result based on the S-correlation feature.

Fig. 13. Retrieval result based on the equal weighting in similarity fusion.

Fig. 14. Retrieval result based on the dynamic weighting in similarity fusion.

images), with the image in the top left corner being the query image. In this example, the query image belongs to the ‘‘Sunset” category of the photographic collection. From Fig. 12, we observe that there are only nine images returned which belong to the same category as the query image based on the ‘‘S-correlation”” feature space. Whereas, for the similarity fusion based on the equal weighting and the proposed dynamic weight updating (with one iteration only), the system returned 12 and 14 images, respectively, from the same category of the query image as shown in Figs. 13 and 14. Based on the above results, we can see that the fusion approaches provide better precision compared to the best individual representation for that particular query image and even within

one iteration of feedback, the precision increases signiﬁcantly compared to the equal weighting scheme at a lower recall level which is also veriﬁed in Fig. 11. 7. Conclusion In this paper, a uniﬁed image retrieval framework is presented, based on the utilization of several learning methodologies. In this framework, image are represented in local visual and semantic concept spaces towards semantic-based image retrieval by utilizing the SOM clustering and multi-class SVM classiﬁcation. Moreover, the feature spaces are enhanced by exploiting the topology

462

M.M. Rahman et al. / J. Vis. Commun. Image R. 20 (2009) 450–462

preserving SOM map and the concept-concept correlations and spatial relationships in individual images. We tested and evaluated the performances in a photographic and a medical image collection, which showed promising results compared to the low-level MPEG-7 based feature descriptors and compared to the conceptbased feature representation without considering the correlations and spatial relationships. We also propose an adaptive similarity fusion technique by considering the precision and rank order information of the top retrieved images. The proposed technique showed improved performance in terms of effectiveness and efﬁciency compared to the technique used in the ImageRover system. In general, the performances of the fusion-based approaches are better than the retrieval of any individual feature representation due to the complementary nature of the feature spaces. Moreover, by incorporating the feedback information in the retrieval loop, the performances are further enhanced as shown in the result section. In future, we will investigate to incorporate other learning methodologies (such as boosting), and integrate the textual modality in the proposed framework. Acknowledgments This work was supported by Natural Sciences and Engineering Research Council of Canada (NSERC), IDEAS, Canada Research Chair and SRTC Concordia University grants. We thank the CLEF [35,36] organizers for making the database available for the experiments and C.C. Chang and C.J. Lin for the LIBSVM software tool [37] that is utilized for the SVM-related experiments. References [1] A. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-based image retrieval at the end of the early years, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 1349–1380. [2] Y. Rui, T.S. Huang, S.F. Chang, Image retrieval: current techniques, promising directions and open issues, J. Vis. Comm. Image Rep. 10 (1999) 39–62. [3] W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, The QBIC project: querying images by content using color, texture, and shape, in: Proceedings of the SPIE’93, Storage and Retrieval for Image and Video Databases, pp. 173–187. [4] C. Carson, S. Belongie, H. Greenspan, J. Malik, Blobworld: image segmentation using expectation–maximization and its application to image querying, IEEE Trans. Pattern Anal. Mach. Intell. 24 (8) (2002) 1026–1038. [5] J.Z. Wang, J. Li, G. Wiederhold, SIMPLIcity: Semantics-Sensitive Integrated Matching for Picture LIbraries, IEEE Trans. Pattern Anal. Mach. Intell. 23 (9) (2001) 947–963. [6] P.E. John, Towards intelligent image retrieval, Pattern Recognit. 35 (2002) 3– 14. [7] M. Szummer, R.W. Picard, Indoor–outdoor image classiﬁcation, in: Proceedings of the IEEE International Workshop on Content-based Access of Image and Video Databases, 1998, pp. 42–51. [8] A. Vailaya, M. Figueiredo, A. Jain, H.J. Zhang, Image classiﬁcation for contentbased indexing, IEEE Trans. Image Process. 10 (1) (2001) 117–130. [9] C. Town, D. Sinclair, Content-based image retrieval using semantic visual categories, Technical report, 2000.14, AT&T Research, Cambridge, 2000.

[10] P. Duygulu, K. Barnard, N. Freitas, D. Forsyth, Object recognition as machine translation: learning a lexicon for a ﬁxed image vocabulary, in: Proceedings of the Seventh European Conference on Computer Vision, 2002, pp. 97–112. [11] L. Zhu, A. Zhang, A. Rao, R. Srihari, Keyblock: an approach for content-based image retrieval, in: Proceedings of ACM Multimedia, 2000, pp. 157–166. [12] F. Jing, M. Li, H.J. Zhang, B. Zhang, An efﬁcient and effective region-based image retrieval framework, IEEE Trans. Image Process. 13 (2004) 699–709. [13] J.H. Lim, Explicit query formulation with visual keywords, in: Proceedings of the Eighth ACM International Conference on Multimedia, 2000, pp. 407–412. [14] E. Chang, G. Kingshy, G. Sychay, W. Gang, CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines, IEEE Trans. Circuits Syst. Video Technol. 13 (2003) 26–38. [15] J. Vogel, B. Schiele, Semantic modeling of natural scenes for content-based image retrieval, Int. J. Comput. Vis. 72 (2) (2007) 133–157. [16] X.S. Zhou, T.S. Huang, Relevance feedback for image retrieval: a comprehensive review, Multimedia Syst. 8 (6) (2003) 536–544. [17] R.W. Picard, T.P. Minka, M. Szummer, Modeling user subjectivity in image libraries, in: Proceedings of the IEEE International Conference on Image Processing, 1996, pp. 777–780. [18] Y. Rui, T.S. Huang, Relevance feedback: a power tool for interactive contentbased image retrieval, IEEE Circuits Syst. Video Technol. 8 (5) (1999) 644–655. [19] S. Tong, E. Chang, Support vector machine active learning for image retrieval, in: Proceedings of the Ninth ACM International Conference on Multimedia, 2001, pp. 107–118. [20] T. Kohonen, Self-Organizing Maps, second ed., Springer-Verlag, Heidelberg, 1997. [21] J. Laaksonen, M. Koskela, E. Oja, PicSOM: self-organizing image retrieval with MPEG-7 content descriptors, IEEE Trans. Neural Netw. 13 (4) (2002) 841–853. [22] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [23] R.B. Yates, B.R. Neto, Modern Information Retrieval, Addison Wesley, 1999. [24] M.M. Rahman, B.C. Desai, P. Bhattacharya, Visual keyword-based image retrieval using correlation-enhanced latent semantic indexing, similarity matching & query expansion in inverted index, in: Proceedings of the International Database Engineering & Applications Symposium (IDEAS’06), 2006, pp. 201–208. [25] R.M. Haralick, K. Shanmugam, I. Dinstein, Textural features for image classiﬁcation, IEEE Trans. Syst. Man Cybern. 3 (1973) 610–621. [26] S.F. Chang, T. Sikora, A. Puri, Overview of the MPEG-7 standard, IEEE Trans. Circuits Syst. Video Technol. 11 (2001) 688–695. [27] S. Sclaroff, M.L. Cascia, S. Sethi, L. Taycher, Unifying textual and visual cues for content-based image retrieval on the world wide web, Comput. Vision Image Understand. 75 (1999) 86–98. [28] J. Vesanto, SOM-based data visualization methods, Intell. Data Anal. 3 (2) (1999) 111–126. [29] G.G. Yen, W. Zheng, Ranked centroid projection: a data visualization approach for self-organizing maps, in: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN’05), 2005, pp. 1587–1592. [30] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed., Academic Press, 1990. [31] T.F. Wu, C.J. Lin, R.C. Weng, Probability estimates for multi-class classiﬁcation by pairwise coupling, J. Mach. Learn. Res. 5 (2004) 975–1005. [33] C.W. Hsu, C.J. Lin, A comparison of methods for multi-class support vector machines, IEEE Trans. Neural Netw. 13 (2) (2002) 415–425. [34] C.C. Vogt, G.W. Cottrell, Fusion via a linear combination of scores, Inf. Retrieval 1 (1993) 151–173. [35] M. Grubinger, P. Clough, H. Müller, T. Deselears, The IAPR-TC12 benchmark: a new evaluation resource for visual information systems, in: International Workshop OntoImage2006 Language Resources for Content-Based Image Retrieval, 2006, pp. 1323. [36] H. Müeller, T. Deselaers, T.M. Lehmann, P. Clough, E. Kim, W. Hersh, Overview of the ImageCLEFmed 2006 medical retrieval and annotation tasks, in: Seventh Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Proceedings of LNCS, vol. 4730, 2007, pp. 595–608. [37] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines. Software available at , 2001.

A unified image retrieval framework on local visual and semantic concept-based feature spaces

A unified image retrieval framework on local visual and semantic concept-based feature spaces

Recommend Documents