JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION
Vol. 9, No. 4, December, pp. 287–299, 1998 ARTICLE NO. VC980398
Query Expansion by Text and Image Features in Image Retrieval Zhou Hong and Chan Syin School of Applied Science, Nanyang Technological University, Singapore 639798
and Kok F. Lai Information Technology Institute, Singapore 117685 Received January 13, 1998; accepted August 28, 1998
are typically generated manually by human beings, they provide compact, important [8], though sometimes biased and incomplete, descriptions of the visual content. Such text-based systems can therefore leverage on matured text retrieval techniques to produce satisfactory results. Nevertheless, it has been reported [9] that users querying an image collection tend to be much more specific in their requests and information needs, than when querying a text database. Moreover, most text annotations tend to be short. These imply that a simple term-by-term match between query terms and annotations terms may not be effective in some applications. Several solutions such as semantic distances [4] and concept spaces [7] have been proposed to tackle this problem. In contrast, image retrieval systems that are based purely on image features are still in the preliminary stages of development. As image processing and computer vision techniques that provide domain independent recognition of image content are still in their infancy, these systems rely mainly on low-level image features. These include color, composition, texture, structure [10–12], shape [13], motion parameters [14], Gabor wavelets [15], and DCT coefficients [16]. Being poor descriptors of perceptual information and semantic similarity, these low-level features have exhibited limited effectiveness in general retrieval tasks (such as ‘‘find images containing computers and printers’’). In fact, researchers from the computer vision community have shown [17] that in the absence of a priori knowledge or an imposed model, most low-level vision tasks are ill-posed problems. A given application domain, however, offers stronger constraints in feature selection to optimize performance. This approach has been adopted in retrieving maps [18], fingerprints [19], human faces [20], and satellite images [21]. Another problem posed by these low-level image fea-
We present a two-pass image retrieval system in which retrieval techniques for text and image documents are combined in a novel approach. In the first pass, the text-based initial query is matched against the text captions of the images in the database to obtain the initial retrieved set. In the second pass, text and image features obtained from this initial retrieved set are used to expand the initial query. Additional images from the database are then retrieved based on the expanded query. The image features that we have used are color histograms, DC coefficients from the discrete cosine transform, and two texture features: multiresolution simultaneous autoregressive model and local binary pattern. These are low-level statistical image features that can be easily computed. Extensive experiments have been performed on 1019 color pictures of mixed variety with captions, relevance judgments and queries supplied by a national archives agency. Objective precision-recall results have been obtained with various combinations of text and image features. The results show that the image features do not perform well when used on their own. However, when image features are used in query expansion, they increase the average precision more significantly than text annotations. Moreover, these findings are valid at all precision levels and are not sensitive to the image feature acquisition parameters. 1998 Academic Press
1. INTRODUCTION
With the collections of digital images growing at a rapid rate, providing easy access to image database has become a significant service. Unlike text retrieval [1] which has already been used widely and successfully, image retrieval is a relatively new research area. A considerable number of image retrieval systems [2–7] rely solely on text annotations found in titles, authors, captions and descriptive labels. In these systems, retrieval is based on computing the overlap of the query terms with the text annotations of the images. As these annotations 287
1047-3203/98 $25.00 Copyright 1998 by Academic Press All rights of reproduction in any form reserved.
288
ZHOU, CHAN, AND LAI
tures involves query formulation. As these features cannot infer semantic similarity, they do not constitute an effective ‘‘visual language’’ for querying. Consequently, some systems [10, 11] adopt a less intuitive approach, requiring the users to manually assign weights to color, texture and other features in querying. Such methods could possibly exploit a good ‘‘visual thesaurus’’ [22] to enrich their low-level features’ power of representation, but these are still not widely tested and available. Alternatively, some systems [13, 19] require the user to input a sample image for querying. In this case, the retrieved images may have little semantic similarity with the input image. It is generally believed that the combination of text annotations and image features holds the greatest promise in image retrieval. Several attempts in this direction include using image texture to generate text annotations [23], and combining text captions with image understanding modules to identify human faces [24]. This paper presents our experiments on combining text annotations and several low-level image features in image retrieval. We adopt a novel two-pass retrieval approach. In the first pass, users specify their information needs using simple text-based queries and the system retrieves an initial list of candidate images. In the second pass, without requiring further user input, the system uses the text annotations and color and texture features of these candidate images to expand the initial query and further refine the retrieval. The advantage of this approach is that it allows the user to interact with the retrieval system in a natural and intuitive manner yet at the same time retains the ability to exploit image features. We have built an image collection of 1019 pictures with the assistance of a national archives agency. During the acquisition process, the archivists assigned category headings (e.g., building, transportation, events) and short descriptive captions to the pictures. We further engaged the archivists’ assistance to provide keywords which can be used to query each of these categories. Using the category headings as relevance judgments and text captions as indexed terms, we have designed and implemented a series of experiments to study the effects of query expansion using text and image features. All observations made are based on objective precision-recall measure rather than on subjective human judgment of the retrieval performance. This catalogued image collection has been made available to other researchers (the interested reader may make enquiries at
[email protected].). This collection comprises a mixed variety of images and is therefore not domain-specific. The short text captions of the images are likely to facilitate experiments in query expansion, particularly with image features. The associated relevance judgments are able to provide an objective assessment of the retrieval performance of similar systems. This paper is organized as follows: Section 2 contains
brief discussions on query expansion and relevance feedback methods employed in text retrieval. Section 3 describes the image features used in this study. Section 4 describes the experimental set-up and the image collection. Sections 5 and 6 present the results, discussions, and conclusion. 2. QUERY EXPANSION AND RELEVANCE FEEDBACK
The work in query expansion has largely been motivated by the difficulty in restricting a text query to contain only those term sets that are found in the text databases being queried [25]. To ease the user’s burden in query term selection, the system can be designed to automatically expand the query by adding related terms to those supplied by the user. Query expansion can be performed in a feed-forward manner: terms are added a priori without first looking at the initial retrieved results. This can either be achieved using lexical aids (thesauruses and controlled vocabulary) such as the WordNet [26] or by exploiting statistical relations of terms existing in a text collection. The results of applying feed-forward query expansion using lexical aids have been mixed. In [27], a large scale experiment involving the TREC(Text Retrieval Conference) data [28] shows that query expansion using WordNet makes little difference in the retrieval effectiveness if the original queries are relatively complete descriptions of the information needs. Less well-developed queries can, however, be significantly improved by hand-chosen concepts. Similar experiments involving captioned image database have also shown some promises [4]. It has also been reported that query expansion using statistical relations determined a priori from text collections have had little success in improving retrieval effectiveness when used apart from relevance data [29, 30]. Serious limitations of using term co-occurrence data in query expansion were also highlighted [31]. Recent experiments using domain-specific concepts and their weighted co-occurrence relationships in a restricted engineering domain have, however, reported significantly higher concept recall [7]. The results of employing query expansion using relevance judgments have been more uniform. In this case, the relevance of the documents used in query expansion are known, and term weights can be assigned appropriately. All top performers in the routing tasks in TREC-4 [32–37] employed some form of query expansion from relevant documents and reported improvements over their initial results. Similarly, in the ad hoc retrieval task whereby relevance judgments are not available, most top performing systems [32–36, 38] expanded their initial queries using terms found in top-ranked retrieved documents
289
QUERY EXPANSION BY TEXT AND IMAGE FEATURES
and reported overall improvements after expansion. This strategy is based on the assumption that there are more relevant documents than nonrelevant ones in the topranked list, which can be invalid in some cases. 3. IMAGE FEATURES FOR QUERY EXPANSION
The primary objective of the experiments is to investigate the effectiveness of using image features in query expansion as compared to text annotations. We have chosen four different image features in color, grayscale and texture for our experiments. These low-level image features have been chosen mainly because they are relatively easy to compute and can be easily included in query expansion. They have also been used widely in other image retrieval systems such as QBIC [10] and Photobook [12]. Such features do play an important role in the high-level perception of visual scenes. For example, regions such as ‘‘water,’’ ‘‘grass,’’ and ‘‘sky’’ can be distinguished with the use of proper combinations of textures and colors [23]. We believe that if these simple image features are shown to be able to enhance retrieval performance through query expansion, the additional effort of including more sophisticated image features will be justifiable. The four image features selected are color histogram, DC coefficients, the multiresolution simultaneous autoregressive model (MRSAR), and the local binary pattern. Color histograms have been successfully used in other image retrieval systems [39]. DC coefficients are useful when we consider the retrieval of images that have been compressed using the JPEG standard [16]. MRSAR has been used in Photobook because it has reportedly worked better than eigenvectors for retrieval of images from a texture database. The local binary pattern is primarily chosen in our study as a lowcost alternative to MRSAR. These four image features are described in the subsequent paragraphs. 3.1. Color Histograms Color histograms are obtained by summing the number of pixels having similar values in the RGB (Red, Green, Blue) components. Let hˆ r(i) be the number of pixels with R value equal to i. Similarly, let hˆ g(i) and hˆ b(i) be the corresponding numbers for the G and B components. Also, i 5 0, 1, 2, . . . , Nh 2 1, where Nh is the total number of quantization steps and is equal to 256 in an 8-bit/component display system. We can reduce the length of the histogram features by using only the most significant bits of the RGB component values. For example, to obtain Nh 5 2 p, we use only the p most significant bits. This allows us to study the effect of using 4, 8, 16, 32, and 256 quantization levels, which correspond to p 5 2, 3, 4, 5, and 8 respectively. To reduce the histogram’s sensitivity to differences in the pixels’ brightness levels, we perform simple Gaussian
smoothing on the raw histogram. The Gaussian kernel used is [0.2236, 0.5477, 0.2236]. Thus a smoothed histogram is given by h(i) 5 0.2236 · hˆ (i 2 1) 1 0.5477 · hˆ (i) 1 0.2236 · hˆ (i 1 1) (1) To handle the boundary conditions, we simply let hˆ (21) 5 hˆ (0) and hˆ (Nh) 5 hˆ (Nh 2 1). The overall histogram features are thus given by h 5 [hr(0), hr(1), . . . , hr(Nh 2 1), hg(0), . . . , hg(Nh 2 1), hb(0), . . . , hb(Nh 2 1)]T
(2)
3.2. DC Coefficients from Discrete Cosine Transform Image retrieval based on JPEG (Joint Photographic Expert Group) or MPEG (Moving Pictures Expert Group) compressed data has been studied by many researchers [16, 40–42], where 8 3 8 blocks of discrete cosine transform (DCT) coefficients are commonly used as the image features. In our experiments, we use the zero-frequency (or DC) coefficient of the luminance component as the query expansion feature. The DC coefficient, F(0, 0), represents the average pixel value of each block and can be calculated as F(0, 0) 5
O O f (i, j),
1 7 8 i50
7
(3)
j 50
where f (i, j) is the pixel luminance in an 8 3 8 block of pixels. Let Nc be the total number of DCT blocks in an image. Because of the large resolution (768 3 512) of our images, there are too many 8 3 8 blocks, i.e., Nc 5 96 3 64 5 6144 blocks. Grouping 16 such blocks (4 in the horizontal direction and 4 in the vertical direction) and averaging their DC coefficients, we obtain a feature vector c of length Nc 5 24 3 16 5 384, c 5 [c(1, 1), c(1, 2), . . . , c(2, 1), . . . , c(24, 16)],
(4)
where c represents the average DC coefficient value in a group of 16 DCT blocks. We also use Nc 5 12 3 8 5 96 and 6 3 4 5 24 in our study. 3.3. Multiresolution Simultaneous Autoregressive Models Multiresolution Simultaneous Autoregressive (MRSAR) [43] is a texture feature used by Photobook [12] to characterize spatial interactions among neighboring pixels. It computes texture features for a second-order simultaneous autoregressive model (on grayscale component) over three
290
ZHOU, CHAN, AND LAI
scales (resolution levels 2, 3, 4) on overlapping 25 3 25 subwindows. Let g(i) be the luminance value of a pixel at site i 5 (ix , iy) in an image. The SAR model can be expressed as g(i) 5 e 1
O u(r)g(i 1 r) 1 «(i),
(5)
r[f
where f is the set of neighbors of pixel at site i. In the above equation, «(i) is an independent Gaussian random variable with zero mean and variance s 2. The standard deviation, s, has a direct relationship to the visually perceived granularity of the neighborhood. u(r) are the model parameters characterizing the dependence of a pixel on its eight neighbors. Because the model is symmetric, i.e., u(r) 5 u(2r), u(r) represents four parameters. e is the bias, which is dependent on the mean gray value of the image and is not used in our experiments. All the five model parameters, u(r) and s, can be estimated from a given window using the least-squares error (LSE) technique. The multiresolution SAR model can be written as g(i) 5 el 1
O u (r)g(i 1 r) 1 « (i),
r[fl
l
l
(6)
where l 5 0, 1, . . . , L 2 1 is the resolution level, and fl denotes the corresponding set of neighboring pixels. In our experiments, the 25 3 25 overlapping windows are shifted two pixels in both the horizontal and vertical directions to estimate the model parameters. Because each resolution has five parameters as features, the features from all the models, hul (r), sl ur [ fl , l 5 2, 3, 4j are appended to form the feature vector of length 15 at pixel i. The query expansion vector s is taken as the mean of the feature vectors over all pixels. MRSAR is a very expensive image feature to compute. Using the software package supplied by Photobook, it takes 9–10 min to compute the MRSAR feature for each image on a 167MHz Sun Ultra Sparc workstation. This is approximately 1000 times more than the time needed by the other three features. 3.4. Local Binary Pattern Local Binary Pattern (LBP) is a simple texture feature originally proposed by Wang and He [44]. We have selected this simple feature as a low-cost alternative to the computationally expensive MRSAR feature. In this method, each of the eight connected neighbors in a 3 3 3 window is compared with the center pixel. The comparison result for each center pixel is an 8-bit binary number. Based on these 8-bit binary numbers, a histogram of Nb 5 256 bins is constructed and used as the LBP feature
vector, b. For simplification, the boundary pixels are not considered. We also use a simplified version of the LBP feature in which only four neighboring pixels (north, south, east, and west) are considered. In this case the histogram consists of only Nb 5 16 bins. The representation and calculation of the LBP feature are similar to those of the color histogram, but this feature is only computed for the luminance component. 4. EXPERIMENTAL SET-UP
4.1. Image Collection The image collection consists of 1019 pictures supplied by the National Archives of Singapore. The image resolution is either 512 3 768 or 768 3 512. Each image has been assigned to one or more categories by archivists during the acquisition process. A total of 51 categories exist in the image collection, and these category headings formed the basis of our relevance judgment. Each image is also accompanied by a short text caption of approximately 70 to 300 bytes supplied by the archivists. After discarding stopwords, there are 1161 unique index terms in the collection; 51 sets of keywords which constitute the initial queries to the system have also been supplied by the archivists. More details about this image collection are given in the Appendix. 4.2. Object Representation and Query Expansion Each image is represented as a composite vector that consists of two normalized feature vectors as follows: d5
F G t/utu
x/uxu
.
(7)
t is the text feature vector, t 5 [ f (1), f (2), . . . , f (Nt)] T,
(8)
where f ( j) is a function of term frequency for term j. Nt is the total number of unique text terms in the collection and is equal to 1161 in the image collection used. In the same token, x is the image feature vector. x will be substituted by h, c, s, or b depending on the feature used. This arrangement is very suitable for our experiments as we measure the similarity between a document vector and the query vector by the cosine of the angle between the two vectors. Different weights can also be easily assigned to the various normalized text and image feature vectors if it is necessary. The initial query q0 is similarly represented as a vector where
291
QUERY EXPANSION BY TEXT AND IMAGE FEATURES
q0 5
F G t0 /ut0u
TABLE 1 Precision-Recall Values with No Query Expansion
(9)
0
since all initial queries are text-based. The expanded query is given by q 5 q0 1 a
fm uf mu
(10)
where f m is the feedback vector using the top m images retrieved from the initial query and a is a weighting factor. The feedback vector for query with text feature only is defined as fm 5
Ot
Recall
Mean precision
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Avg. precision
0.567543 0.450892 0.433713 0.365712 0.340227 0.324323 0.280546 0.261319 0.203282 0.120171 0.066677 0.310400
m
i
(11)
i51
where ti is the text feature vector for image ranked i. m Similarly, we use f m 5 oi51 xi to expand using image feam tures only, and f m 5 oi51 di to expand using the composite features.
the ability of the system to present only relevant items. The formulae are given by recall 5
number of relevant items retrieved (13) number of relevant items in collection
and
4.3. Evaluation By varying the importance of expansion terms through a, the number of images used for expansion, m, and the features used in expansion, f m, we performed a series of experiments to study the effectiveness of text-based features and image features in query expansion. The similarity between an image d and a query q is given by
precision 5
number of relevant items retrieved total number of items retrieved
(14)
Finally, the term average precision refers to the 11-point average value of precision for recall levels 0.0, 0.1, 0.2, 0.3, . . . , 1.0. 5. RESULTS AND DISCUSSIONS
s 5 d · q.
(12)
For each query, images are ranked in descending order of similarity. To mimic a true retrieval system, only images with similarity greater than zero are ranked and evaluated. This implies that only images with at least one overlap in the feature space are retrieved. When used on their own, the low level image features that we have chosen are of limited use to infer semantic similarity. For example, the color histogram feature would suggest that a field which is full of yellow flowers is semantically similar to a beach which is covered with yellow umbrellas. However, when combined with the text feature, we believe that these features help to retrieve images that not only appear similar in color and textures, but also share similar textual descriptions of the images. We adopt the standard evaluation mechanism in the series of TREC evaluations for document retrieval [28]. This includes recall, a measure of the ability of a system to present all relevant items, and precision, a measure of
5.1. Baseline Experiments We first perform a baseline experiment using only the initial text query and no expansion. At each of the 11 recall levels, the mean precision over the 51 queries is computed. These mean precision values, together with the 11-point precision-recall figure, which is the average over 51 initial queries, are given in Table 1. It might be of interest to compare the figures in Table 1 with the average precision for random guessing. For example, if a user is allowed to pick images at random, what is the likelihood of picking one that matches his query? This will depend on the total number of images in the collection and the number of images that are relevant to his query. There are altogether 1019 images in our collection, and the largest number of images relevant to a single query is 128 (this corresponds to category 7 in the Appendix). Thus the average precision for this query is around 0.13. This is well below the average precision of 0.31 in Table 1.
292
ZHOU, CHAN, AND LAI
TABLE 2 Average Precisions Using Mean Feature Vectors of Relevant Images Feature type
Text
Histogram
DC
MRSAR
LBP
Avg. precision
0.851940
0.203158
0.224745
0.171518
0.192284
For each of the 51 queries, we then compute the mean of the various feature vectors using only the images that are relevant to the query. Thus we obtain five different query vectors, each being the mean obtained from different features (t, h, c, s, b). Using each of these query vectors as an initial query with no query expansion, the average precision values are obtained and are shown in Table 2. Table 2 clearly shows that text captions are excellent in identifying the relevant images. It also suggests that all the image features, including the very expensive MRSAR, are poor discriminators in general image retrieval tasks when they are used exclusively. 5.2. Experiments on Query Expansion In these experiments, we perform a two-pass retrieval by first retrieving an initial list of candidate images using text queries. These candidate images are ranked in descending order of similarity. We expand the query using the top m images and the weight a as shown in Eq. (10). In other words, we try to retrieve more images that are similar to the top m images. This is based on the assumption that there are more relevant images among the top m candidate images than nonrelevant ones. In such cases, query expansion is likely to retrieve more relevant images. 5.2.1. Effects of Varying the Number of Feedback Images. Figure 1 shows the results where a is fixed at 1.0, and m is varied. Figure 1a shows the results obtained when the expansion are performed without using the initial query vector, i.e., q0 is omitted in Eq. (10). This is similar in concept to the experiment that produced Table 2, except that here we assume the top m retrieved images are relevant. For query expansion using the text feature, the results are generally in agreement with other similar experiments in text retrieval; i.e., results are improved with more images used for feedback, although there is a diminishing rate of return. Also, there appears to be little consequence of using more than 20 images for expansion. When 50 images are used to expand the queries, the average precision is improved by about 48% as compared to using one image. The graph also shows that the four image features perform poorly when used by themselves in query expansion, which is in agreement with the results in Table 2. Figure 1b shows an interesting result when text or image
features are combined with the initial query during expansion. All image features performed substantially better when compared with text annotation. This shows that while the low-level image features performed poorly when used by themselves, they improve average precision more than text annotation when used in query expansion. Figure 1c shows that the findings in b are valid even if the average precisions are computed at the high precision region only (recall 5 0.0, 0.1, 0.2). We have suspected that improvements yielded by the image features were due to the fact that they have caused more images to be retrieved, but did not contribute significantly to the rankings of top images. If this was true, then the improvements in average precision were mainly contributed in the high recall region, where a typical user has little interest. There would then be minimal improvements at high precisions. However, Fig. 1c shows that at high precision levels the image features also perform better than text when used for query expansion. Therefore, query expansion using image features does help to improve the ranking of top images. Figure 1d shows the results when both text and image features are used in query expansion. Comparing with Fig. 1b, it shows that using additional text captions in query expansion did not improve the results. 5.2.2. Effects of Varying the Weight of the Expansion Vector. We also investigate the effects when m is fixed at 10 and a is varied. Figure 2a shows that when the values of a are very small (a , 0.5), there is very little performance difference for all the features. When the expansion vector is as important as the initial query (a 5 1.0), image features perform better than text when used in feedback. With larger values of a, however, the performance of image features deteriorate rapidly. This is almost equivalent to using an image feature as the primary retrieval vector and earlier results have already shown that the performance is very poor. Therefore, the initial text queries capture the user information needs effectively and should not be discarded in the process of query expansion. With larger a values, the two texture features, MRSAR and LBP, appear to yield better performance than the color and grayscale features. Figure 2(b) confirms these observations at high precision levels. Figures 2c and 2d show the performance of the composite feedback features (text and image) under different weighting conditions. There appears to be a weak maximum at low values of a (a , 1.0) and all composite features produce similar performance. The apparent superiority of MRSAR and LBP over other image features as seen in Figs. 2a and 2b is not observed, because with larger values of a, the contribution of image features to the performance is little, and the text feature is the main contributor. 5.2.3. Effects of Varying the Acquisition Parameters of Image Features. We have performed similar experiments
293
QUERY EXPANSION BY TEXT AND IMAGE FEATURES
FIG. 1. Query expansion using text and image features by varying the number of images used in feedback (a 5 1.0).
by varying various parameters in the acquisition of image features. These include varying the quantization levels of the color histograms Nh 5 4, 16, 32, 256, the number of DCT blocks Nc 5 96, 384, and the number of total bins Nb 5 256. We found that the results are similar to those reported in Fig. 1 and Fig. 2 which were obtained with Nh 5 8, Nc 5 24, and Nb 5 16. Tables 3 to 5 show the results of query expansion with image features at different acquisition parameters of those image features, where a 5 1.0 and m 5 10. It can be seen from Tables 3 to 5 that except for the case Nh 5 256, where the quantization steps are too fine so that the average precision is slightly worse than for smaller Nh , changes in quantization levels have little effect on average precisions. Thus, we can conclude that all our previous observations are not sensitive to changes in the quantization levels of the low-level image features. This suggests that in selecting the quantization levels, one can place more importance on issues such as the sparseness of the feature space and retrieval efficiency, with the confidence that the effect on recall precision will be minimal. Our empirical results enable us to recommend the follow-
ing minimum values for the various acquisition parameters: Nh 5 4, Nc 5 24 and Nb 5 16. 5.2.4. Expansion with Multiple Image Features. From previous experiments, we know there is no significant difference among the average precisions improved by the various image features. We conclude that this collection is not sensitive to the image features used. We proceed further to investigate the effect of using multiple image features. This experiment uses three different combinations of two image features for expansion: color histogram and DC coefficient, color histogram and LBP feature, DC coefficient and MRSAR. Now the x in Eq. (7) consists of two image features. Taking the color histogram and the DC coefficient, for example,
x5
F G h/uhu c/ucu
.
(15)
For comparison, we draw graphs similar to those previously drawn. Figure 3a shows the results where a is fixed
294
ZHOU, CHAN, AND LAI
FIG. 2. Query expansion using text and image features by varying a.
at 1.0 and m is varied, while Fig. 3b shows the results when m is fixed at 10 and a is varied. It can be seen from Fig. 3a that there is no obvious difference among the average precision of single features and multiple features by varying the number of feedback images. From Fig. 3b, we observe that when the values of a are small (a , 1.0), there is very little performance difference for all the features. Similarly to the results in Fig. 2b, with larger a values the two texture features, MRSAR and LBP, appear to yield better performance than the color histogram and DC coefficient features; the performance of combined image features is between those of the individual image features. The comparisons at high precision region, or in conjunction with text annotation have also yielded similar results. We
TABLE 3 Query Expansion Using Color Histogram by Varying Acquisition Parameters of Color Histograms (a 5 1.0, m 5 10)
conclude that the behavior of multiple image features is similar to that of single image feature. In summary, experiments with the four image features, whether single or multiple, indicate very little variation in performance. Therefore in query expansion with low-level image features, we should use simple image features and focus on other issues such as the number of relevant images and the weight of expansion vector. 6. CONCLUSION
We have presented an image retrieval system that combines text-based queries and image features in a novel
TABLE 4 Query Expansion Using DC Coefficient by Varying Acquisition Parameters of Number of DCT Blocks (a 5 1.0, m 5 10)
Nh
4
8
16
32
256
Nc
24
96
384
Avg. precision
0.389982
0.386260
0.378170
0.375146
0.364957
Avg. precision
0.396136
0.388510
0.385814
QUERY EXPANSION BY TEXT AND IMAGE FEATURES
TABLE 5 Query Expansion Using LBP Texture by Varying Acquisition Parameters of Number of Total Bins (a 5 1.0, m 5 10) Nb
16
256
Avg. precision
0.388981
0.389409
approach. Using this system, a series of experiments on query expansion using text, color, grayscale and texture features have been carried out. The following are some important findings: • By themselves, low-level image features cannot infer semantic similarity and therefore are not suitable to be used as initial queries. This is evident from the poor results when initial text-based queries are discarded. • Text-based initial query is a simple yet powerful tool in identifying the relevant image documents. • Using image features for query expansion improves the results more significantly than using text annotations. The improvements in average precision over the baseline results are about 27% and 8% respectively. Although we have expected the query expansion with image features to improve the average precision, it is a pleasant surprise to observe its superiority over the text annotations. We can only speculate that perhaps because the text feature has already been utilized in the first pass, its contribution in the second pass is considerably less. The text captions in our image collection are usually very short, which might have also limited the effectiveness of the text feature in query expansion. We hope to verify the validity of these speculations with further research. • Combining image features and text in query expansion (together with the initial query vector) yield better results than using text only. However, there is no significant differ-
295
ence when compared to using image features only in query expansion. This implies that image features play a more important role than text annotations in query expansion. • The above observations are valid at high precision levels, which affect the ranking of top image documents, as well as at all recall levels. • The more expensive image feature (MRSAR) has not produced substantial improvements in performance over other simpler image features. It seems that this collection, which contains a mixed variety of images, is not sensitive to the low-level image features used. • The performance is not sensitive to the acquisition parameters of the image features. These findings have been obtained using an image collections of 1019 pictures of natural scenes, buildings, people, and events, accompanied by short descriptive captions which yield a total of 1161 unique indexed terms and 51 queries. The results are based on an objective precisionrecall measure which can be verified. Certainly we do not claim that they will generalize to all types of image collection. Nevertheless, we believe the above findings are significant enough to warrant further investigation. APPENDIX: DESCRIPTIONS OF IMAGE COLLECTION
Image Categories A total of 51 categories are listed here. The number of images in each category is presented in [ ]. People 1. 2. 3. 4.
Portraits/individuals [4] Group photos [9] Chinese operas [30] Occupations/trades [90]
FIG. 3. Query expansion using multiple image features: (a) a 5 1.0, (b) m 5 10.
296
ZHOU, CHAN, AND LAI
FIG. 4. Sample images from database. Courtesy of National Archives of Singapore.
Buildings 5. Government buildings [27] 6. Private buildings [75] 7. Religious buildings (churches, mosques, temples) [128] 8. Cemetaries [11] 9. Schools/universities/institutions of learning [29] 10. Housing estates (public and private housing estates) [36] 11. Markets [32] 12. Monuments [18] 13. Clan association buildings [2] 14. Theaters/cinemas [19]
15. 16. 17. 18. 19.
Villages/kampongs [112] Clubs/societies [12] Hotels [12] Banks [1] Bungalows [2]
Scenes 20. 21. 22. 23. 24. 25.
Street scenes [49] Waterfronts/Singapore river/beaches [28] Aerial views [13] Islands/offshore islands [18] Harbors/wharfs/docks/slipways [10] Bridges [10]
QUERY EXPANSION BY TEXT AND IMAGE FEATURES
Events 26. Cultural (concerts, dramas, operas, films, arts festivals) [70] 27. Educational (schools, talks, lectures, students) [4] 28. Parades (national day parade, uniformed groups, and others) [2] 29. Social (clubs, societies, openings, farewells, celebrations) [2] 30. Sports [5] 31. Religious/rituals [13] 32. Medical [5] 33. Disasters (e.g., fire, flood) [1] 34. Campaigns [1] 35. Exhibitions/carnivals (expositions, trade fairs, etc.) [1] 36. Conferences/seminars/workshops/meetings/ training courses/press [1] Communications 37. 38. 39. 40.
Air [6] Rail [7] Road (including road works) [47] Sea (ocean going vessels, local craft) [9]
Organizations 41. Social, voluntary, and uniformed groups [4] Forestry/agriculture/fisheries 42. Plantations/farms [39] 43. Nature reserves/parks [13] 44. Fisheries [10] Industry 45. 46. 47. 48.
Shipping [2] Construction [4] Cottage industry [34] Metal mining [4]
Documents 49. Pre-1945 [7] Others 50. Public utilities (gas, water, electricity) [1] 51. Symbols/logos/seals/badges [8] Sample Images and Captions Figure 4 gives some sample images from various categories, where (a) is under category 1; (b) is under category 5; (c) is under category 25; (d) is under category 26; (e) is under category 40; (f ) is under category 42. The corresponding captions are as follows:
297
(a) Kg Sarhad—Malay lady cooking for Hari Raya Haji Festival. (b) MRTC headquarters. (c) Benjamin Sheares bridge. (d) Glove puppet show—puppets on stage. (e) Princess Mahsuri at sea. (f ) Pulau Ubin orchid farm. Initial Queries Some samples of the 51 initial queries are as follows: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.
Portrait, individual. Photo. Chinese, opera. Occupation, trade. Government, building. Private, building. Religious, building, church, mosque, temple. Cemetery. School, university, institution, learn. House, estate. REFERENCES
1. G. Salton, Automatic Text Processing, Addison–Wesley, Reading, MA, 1989. 2. N. C. Rowe and E. J. Guglielmo, Exploiting captions in retrieval of multimedia, Inform. Process. Management 29(4), 1993, 453–461. 3. S. Flank, P. Martin, A. Balogh, and J. Rothey, PhotoFile: A digital library for image retrieval, in Proceedings, International Conference on Multimedia Computing and Systems, Washington, D.C., 1995, pp. 292–295. 4. A. F. Smeaton and I. Quigley, Experiments on using semantic distances between words in image caption retrieval, in Proceedings, 19th ACM SIGIR Conference on R&D in Information Retrieval, Zurich, 1996, pp. 174–180. 5. R. Mohan, Text-based search of TV news stories, in Proceedings, Multimedia Storage and Archiving Systems, Boston, 1996, SPIE Proceedings, Vol. 2916, pp. 2–13. 6. H. Zhou, S. Chan, and Kok F. Lai, Multilingual information retrieval system, in Proceedings, Multimedia Storage and Archiving Systems, Boston, 1996, SPIE Proceedings, Vol. 2916, pp. 33–45. 7. H. Chen, B. Schatz, T. Ng, J. Martinez, A. Kirchhoff, and C. Lim, A parallel computing approach to creating engineering concept spaces for semantic retrieval: The Illinois digital library initiative project, IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 1996, 771–782. 8. W. B. Croft, Text and multimedia, presented at MIRO Workshop, Glasgow, Scotland, 1995. 9. P. G. B. Enser, Query analysis in a visual information retrieval context, J. Document Text Management 1(1), 1993. 10. W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, and G. Taubin, The QBIC project: Querying images by content using color, texture and shape, in SPIE Proceedings on Storage and Retrieval of Image and Video Databases, San Jose, 1993, pp. 173–181. 11. Virage Engine, 1995, available http://www.virage.com/. 12. A. Pentland, R. W. Picard, and S. Sclaroff, Photobook: Tools for
298
ZHOU, CHAN, AND LAI
content-based manipulation of image databases, Int. J. Comput. Vision 18(3), 1996, 233–254.
filtering experiments using PIRCS, in The Fourth Text REtrieval Conference (TREC-4), Washington, D.C., 1995, pp. 145–152.
13. K. Han and S. Myaeng, Image organization and retrieval with automatically constructed feature vectors, in Proceedings, 19th ACM SIGIR Conference on R&D in Information Retrieval, Zurich, 1996, pp. 157–165.
35. M. Hearst, J. Pedersen, P. Pirolli, H. Schutze, G. Grefenstette, and D. Hull, Xerox site report: Four TREC-4 tracks, in The Fourth Text REtrieval Conference (TREC-4), Washington, D.C., 1995, pp. 97–120.
14. H. S. Sawhney and S. Ayer, Compact representations of videos through dominant and multiple motion estimation, IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 1996, 814–830. 15. B. S. Manjunath and W. Y. Ma, Texture features for browsing and retrieval of image data, IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 1996, 837–842. 16. M. Shneier and M. Abdel-Mottaleb, Exploiting the JPEG compression scheme for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 1996, 849–853.
36. C. Buckley, A. Singhal, M. Mitra, and G. Salton, New retrieval approaches using SMART: TREC-4, in The Fourth Text REtrieval Conference (TREC-4), Washington, D.C., 1995, pp. 25–48. 37. T. Strzalkowski and J. P. Carbello, Natural language information retrieval: TREC-4 report, in The Fourth Text REtrieval Conference (TREC-4), Washington, D.C., 1995, pp. 245–258. 38. E. M. Voorhees, Siemens TREC-4 report: further experiments with database merging, in The Fourth Text REtrieval Conference (TREC-4), Washington, D.C., 1995, pp. 121–130.
17. T. Poggio and V. Torre, Ill-posed problems and regularization analysis in early vision, in Proceedings, AARPA Image Understanding Workshop, 1984, pp. 257–263.
39. Yihong Gong, H. C. Chua, and Xiaoyi Guo, ‘‘Image Indexing and Retrieval Based on Color Histograms,’’ in MMM’95: The International Conference on Multimedia Modeling, Singapore, November 14–17, 1995, pp. 115–126.
18. H. Samet and A. Soffer, MARCO: Map retrieval by content, IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 1996, 783–798.
40. F. Arman, A. Hsu, and M. Chiu, Image processing on compressed data for large video databases, ACM Multimedia 93, 1993, pp. 267–272.
19. N. K. Ratha, K. Karu, S. Chen, and A. K. Jain, A real-time matching system for large fingerprint databases, IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 1996, 799–813.
41. V. Kobla, D. Doermann, and K. Lin, Archiving, indexing, and retrieval of video in the compressed domain, in Multimedia Storage and Archiving Systems, Boston, 1996, SPIE Proceedings, Vol. 2916, pp. 78–89.
20. D. L. Swets and J. Weng, Using discriminant eigenfeatures for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 18(8), 1996, 831–836. 21. G. Healey and A. Jain, Retrieving multispectral satellite images using physics-based invariant representations, IEEE Trans. Pattern Anal. Machine Intell. 18(8), 1996, 842–848. 22. R. W. Picard, Toward a visual thesaurus, presented at MIRO Workshop, Glasgow, Scotland, 1995. 23. R. W. Picard and T. P. Minka, Vision texture for annotation, Multimedia Systems 3, 1995, 3–14. 24. Rohini K. Srihari, Automatic indexing and content-based retrieval of captioned images, IEEE Computer, Sept. 1995, 49–56. 25. David C. Blair and M. E. Maron, Full-text information retrieval: Further analysis and clarification, Information Processing and Management, 26(3), 1990, 437–447.
42. W. Wan and C.-C. Jay Kuo, Image retrieval based on JPEG compressed data, in Multimedia Storage and Archiving Systems, Boston, 1996, SPIE Proceedings, Vol. 2916, pp. 104–115. 43. J. Mao and A. K. Jain, Texture classification and segmentation using multiresolution simultaneous autoregressive models, Pattern Recognition, 25(2), 1992, 173–188. 44. L. Wang and D. He, Texture classification using texture spectrum, Pattern Recognition, 23(8), 1990, 905–910. 45. T. Ojala, M. Pietikainen, and D. Harwood, Performance evaluation of texture measures with classification based on kullback discrimination of distributions, in Proceedings, International Conference on Pattern Recognition, 1994, Vol. 1, pp. 582–585.
26. George Miller, WordNet: An on-line lexical database, Int. J. Lexicography 3(4), 1990. 27. E. M. Voorhees, Query expansion using lexical-semantic relations, in Proceedings, 17th ACM SIGIR Conference on R&D in Information Retrieval, Dublin, 1994, pp. 61–69. 28. D. K. Harman, The first text retrieval conference (TREC-1), Inform. Process. Management 29(4), 1993, 411–414. 29. A. F. Smeaton and C. J. van Rijsbergen, The retrieval effects of query expansion on a feedback document retrieval system, Comput. J. 26, 1983, 239–246. 30. C. T. Yu, C. Buckley, and G. Salton, A generalized term dependency model in information retrieval, Inform. Technol. Res. Dev. 2, 1983, 129–154. 31. H. J. Peat and Peter Willet, The limitations of term co-occurrence data for query expansion in document retrieval systems, J. Amer. Soc. Inform. Sci. 42(5), 1991, 378–383. 32. J. Allan, L. Ballesteros, J. P. Callan, W. B. Croft, and Z. Lu, Recent experiments with INQUERY, in The Fourth Text REtrieval Conference (TREC-4), Washington, D.C., 1995, pp. 49–64. 33. S. E. Robertson, W. Walker, M. M. Beaulieu, M. Gatford, and A. Payne, Okapi at TREC-4, in The Fourth Text REtrieval Conference (TREC-4), Washington, D.C., 1995, pp. 73–96. 34. K. L. Kwok and L. Grunfeld, TREC-4 ad-hoc, routing retrieval and
KOK-FUNG LAI graduated with the B. Eng. (Elect), first class honours, from the National University of Singapore in 1988 and the Ph.D. in electrical engineering from the University of Wisconsin at Madison in 1994. He worked at AT&T Consumer Products (Signapore), the Hong Kong Sino Software Research Center, and the University of Wisconsin before returning to the Information Technology Institute (ITI) of Singapore in 1994. While at ITI and later at Kent Ridge Digital Labs (KRDL), he led an engineering team to develop and deploy technology and products globally. These include projects with Alta Vista Internet Service, Schering–Plough Research Institute, Alam Teknokrat (Malaysia), Singapore Press Holdings, and various government agencies in Singapore. Dr. Lai is the winner of the 1998 Tan Kah Kee Young Inventors’ Awards, the KRDL Excellence Award, and the ITI Technical Innovation Award. He is currently Chief Executive Officer of NetBeacon Technology, a start-up company specializing in providing tools and solutions for vertical business portals.
QUERY EXPANSION BY TEXT AND IMAGE FEATURES
ZHOU HONG received her B.Eng. and M.Eng. in 1990 and 1994, both from the Department of Computer Science at Tsinghua University, China, and received her M.A.Sc. from Nanyang Technological University, Singapore in 1998. Her research areas include information retrieval, image processing, image compression, multilingual, multimedia, and internet applications. She is currently working at a consulting company in the United States.
299
CHAN SYIN received her B.Eng. with first class honours in electrical engineering from the National University of Singapore in 1987, and her Ph.D. in computer science from the University of Kent, United Kingdom, in 1993. She is currently a lecturer with the School of Applied Science at Nanyang Technological University, Singapore. Her research interests include image coding and multimedia information retrieval.