Signal Processing: Image Communication 28 (2013) 624–641
Contents lists available at SciVerse ScienceDirect
Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image
Context-aware mobile image annotation for media search and sharing Zhen Li a, Kim-Hui Yap a,n, Kiat-Wee Tan b a b
School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798, Singapore School of Information Systems, Singapore Management University, Singapore
a r t i c l e in f o
abstract
Article history: Received 5 July 2012 Accepted 1 January 2013 Available online 7 February 2013
In recent years, rapid advances in media technology including acquisition, processing and distribution have led to proliferation of many mobile applications. Amongst them, one of the emerging applications is mobile-based image annotation that uses camera phones to capture images with system-suggested tags before uploading them to the media sharing portals. This procedure can offer information to mobile users and also facilitate the retrieval and sharing of the image for Web users. However, context information that can be acquired from mobile devices is underutilized in many existing mobile image annotation systems. In this paper, we propose a new mobile image annotation system that utilizes content analysis, context analysis and their integration to annotate images acquired from mobile devices. Specifically, three types of context, location, user interaction and Web, are considered in the tagging processes. An image dataset of Nanyang Technological University (NTU) campus has been constructed, and a prototype mobile image tag suggestion system has been developed. The experimental results show that the proposed system performs well in both effectiveness and efficiency on NTU dataset, and shows good potential in domain-specific mobile image annotation for image sharing. & 2013 Elsevier B.V. All rights reserved.
Keywords: Media search and sharing Mobile devices Tag suggestion Content and context integration
1. Introduction In recent years we have witnessed the phenomenal boom in the usage of mobile devices. Nowadays, most mobile phones in use have the camera feature, which are known as camera phones. Camera phone users often upload the captured media (photos, videos, etc.) to various media sharing websites such as Facebook.com, Flickr.com and Renren.com, e.g. about one-third of Facebook posting is done through mobile devices [1]. Nowadays, the growing usage of mobile camera phones is becoming an important source of Web media contents, while the current technology also allows us to browse most of the Web media using mobile devices. As such we can foresee the upcoming convergence of Web media and Mobile
n
Corresponding author. E-mail address:
[email protected] (K.-H. Yap).
0923-5965/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.image.2013.01.003
media in the near future. Consequently, an emerging technology for Web media search and sharing is centred in mobile image annotation that uses camera phones to capture and annotate images before uploading them to the media sharing portals. At present, the Web media normally do not have semantic tags, as a result, media can only be efficiently shared amongst small circles of friends. For more efficient Web media search and sharing, users will benefit greatly from annotating the image with meaningful tags, as it will make text-based search of media in Internet possible. However, it is rather labour intensive to manually annotate all the existing Web media. One way to address media search and sharing is to annotate the Web media that are newly captured and uploaded by users. To annotate the newly captured mobile images with semantic tags, the annotation may be cumbersome as the keyboard of the mobile device is usually tiny and the mobile users are usually on the move. At times, the mobile users may not even be sure about
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
what tags are suitable to annotate the photos. So, it is advantageous to use computerized annotation methods instead of manual annotation for mobile images. In the last decade, various image annotation schemes have been proposed, which can be broadly divided into two categories: non-mobile based systems and mobile based systems. In non-mobile based systems, a number of computerized image annotation methods have been developed [2–12], which focus on the development of new algorithms for content based image classification. The early works [2–4] involve relatively simple models and mainly focus on datasets with few categories. Later on, more sophisticated models including variants of [3] have been developed [5–12] to map image features to annotations by supervised or unsupervised learning. These methods essentially use probabilistic models for semantic image annotation which is posed as a classification problem where each class is defined as the group of database images labelled with a common semantic label. Although some state-of-the-art non-mobile image annotation systems have achieved good performance at cost of high computational complexity, they underutilize the unique features of mobile devices, i.e. the context information that can be acquired in camera mobile phone, such as the location information (e.g. captured by Global Positioning System (GPS), WiFi localization), direction information (recorded by digital compass), time information, domain information (e.g. interaction between user and information server) and Web information that are available to enhance their analysis and processing. In view of this, researchers have started to redirect their focus into analyzing mobile image which utilizes context information acquired from mobile devices [13–15]. In [13], a new photo is assigned a label by propagating the labels of the photos taken within the same location. ZoneTag [14] is a mobile phone application that facilitates mobile users to upload the newly captured images to Flickr and then obtain the suggestion tags which include the location tags and the previously used tags in the nearby area. In [15], context information including GPS, time and event is considered for mobile image annotation and retrieval, and the final relevance feedback is obtained by weighting the similarities of the context. Their annotation procedure is semi-automatic and needs manual input. In these works, only the context information is considered, but the content of mobile images is not analyzed. Without content analysis, the annotation results will still be suboptimal. In order to further enhance the annotation performance, it is therefore imperative to combine the context information with content analysis in mobile systems to enhance the performance. In [16], the content of mobile images is analyzed based on local descriptors, and the user interaction as a source of context information is considered. In [17–20], the GPS location information is utilized to assist in contentbased mobile image annotation. With the aid of the location information, the challenge in differentiating similar images that are captured in different areas can be reduced substantially. In these systems, the GPS location is utilized to filter the results of content analysis. In [17], GPS information is used to index the geo-referenced object database and restrict object recognition to the vicinity of a local area. In
625
[18], GPS information is first used to narrow down the search for a small candidate list, and then visual features are used to analyze content. In [19], all photos outside a certain radius of the query image are removed from the collection, leaving the nearby images to be ranked. In the authors’ previous work on landmark image annotation [20], the context information (location information alone or a combination of location and direction information) is first adopted to shortlist the landmark candidates, and then content analysis is performed to determine which landmark the captured image belongs to. In these methods, the content analysis is essentially filtered by a pre-defined area centred at the logged GPS location of the query image. In these mobile context-based annotation systems, the location information is mainly used to restrict the tag propagation process within a local region centred at the GPS location of the mobile device. However, this ‘‘hard propagation’’ strategy is insufficient since the GPS error of the captured image tends to be very high in dense build-up areas. If the GPS error of the query image is too large, the database images taken at that GPS location will be chosen for tag propagation, and the improper tags will be assigned to the query image. In addition, the tags are propagated and assigned to the query images without considering different tag distributions at different places. For example, the tag ‘‘tree’’ is suitable to annotate images taken at various places, while the tag ‘‘Hall 7’’ is only applicable to the images taken at the specific area. Therefore, it is elusive to determine the search range centred at the logged GPS location. In this work, a mobile image tag suggestion system that integrates the content analysis and context analysis is developed. With the aid of the proposed image tag suggestion system, the camera mobile user can use his/ her mobile phone to capture an image, and then provide interactive feedback on the tags suggested by the system. The mobile user can upload the annotated image to media sharing portals. Then the general Web users can make text-based retrieval for the images shared by camera mobile users through search engines such as Google and Yahoo. There are several key contributions of this work: (1) We have developed a new system prototype for mobile image annotation that can integrate content information with the context information to enhance the annotation performance. The developed system includes the hardware implementation with server–client architecture, the mobile image dataset, the software platform and mobile-user interaction interface in the mobile phone; (2) We develop an integration method that fuses the content-based tags and the context-based tags. It is shown that the integration has fast computation speed and exhibits good performance when compared with using content or context analysis alone. 2. Proposed mobile prototype for image tagging and sharing The overview of a mobile scene image annotation diagram is given in Fig. 1. It consists of the following components: content analysis, context analysis, integration of content and context analysis, and interactive annotation. In content analysis, an image classification
626
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
Content Analysis
Content-based Tags
Context Analysis
Context-based Tags
Tag Integration
Yunnan Garden
Outdoor Garden Landscape Pavilion Grass Yunnan garden Sky …
Media Sharing Portals
Suggested Tags
Fig. 1. Overview of the proposed mobile image annotation system.
scheme of low computational cost is adopted and then an image annotation scheme based on the classification is developed. In context analysis, a new annotation approach is proposed to annotate the newly captured geotagged mobile images based on geotagged mobile images with semantic tags provided by mobile user community. Both the location information and the user interaction are considered. Then the content-based and context-based tags are fused by the proposed integration method, in which the Web information is incorporated. Finally, based on the Android mobile platform, we provide the user with graphical user interface where the interactive annotation is performed. The users can easily choose the system-suggested tags that they prefer, and then upload the image and the chosen tags to the media portals, which will in turn take effect in the future tag suggestion. If the mobile user prefers the system to annotate the images automatically, he/she can directly upload the image with the suggested tags to the media portals. The system prototype is implemented in both hardware and software. For the evaluations of the system performance in a specific domain, we have also constructed a mobile image dataset in NTU. 2.1. System prototype development The implementation of the system prototype is achieved by identifying the following key issues: (i) limited resources of the mobile devices, (ii) deployment strategy, and (iii) retrieving, handling and processing data. A practical client–server architecture has been implemented for interactive mobile image tagging system, based on the following hardware/software specifications: Intel Core (TM) 2, Quad Processor (2.25 GHz), 4G RAM, MS Windows XP operating system, and Java with Matlab R2008a for algorithm implementation. 2.1.1. Limited resources of mobile devices Both mobile and non-mobile/PC-based image annotation require extraction of content-based image features and training of classifiers using these features. However, the inherent challenges/limitations of mobile devices should be considered when designing a mobile image annotation system. Although mobile devices have been equipped with more powerful processor and larger memory, they are still limited in handling and processing of large datasets.
Mobile devices differ from the desktop environments in several aspects: (i) Tradeoff between tagging accuracy and fast-response-time requirement: fast-response-time requirements of mobile image tagging place a significant constraint on what types of content and context analysis techniques can be employed in real systems, while maintaining a high tagging accuracy; and (ii) The battery energy will remain the key bottleneck for mobile devices in the near future [21]. As such if the mobile image tagging process is performed in the mobile devices, the tagging efficiency cannot be guaranteed and the battery will be out of power in a short time. With this consideration, it is advisable to adopt the client–server architecture, in which the mobile images are transmitted through the telecommunication networks to the PC based server for the content analysis and the context analysis, and then the suggested tags for the mobile image are transmitted back to the mobile clients. In doing so, the processing time can be reduced by shifting the computational load onto the server that is equipped with more computational power without considering the processing capacity or the battery power of mobile devices. The system is implemented using the Representational State Transfer (REST) Web Services architecture as illustrated in Fig. 2. This REST-based approach is adopted in view of the established standards both in industry as well as within the development framework such as software development kits and established application programming interfaces. Using REST web services with Hypertext Transport Protocol (HTTP) over the existing Internet stack layer, the web services host the processing methods on the backend server. During the operation, the client (service requester) will invoke a remote processing method (request) to the server (service provider). 2.1.2. Deployment strategy The high level deployment strategy is shown in Fig. 3, the system is deployed across the wireless channel in cellular networks (3G/2G) as well as Wi-Fi (802.11a/b/g). This wireless coverage in turn provides the mobile client a path using the wired backend through the Internet backbone to the backend server. As the service uses the Internet stack, this allows the service to be deployed in various wired and wireless channels that supports it (e.g. WiMAX). However, this will also create the network bottleneck as the wireless channel is still the slowest section in the path, limited by the available base stations, distance away from them and the number of users attached to it.
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
627
Fig. 2. Web services architecture of the proposed system.
Fig. 3. Illustration of high-level deployment of the system prototype.
2.1.3. Retrieving, handling and processing data Using the client–server architecture, the process of retrieving, handling and processing the data can also be viewed within the same context. In the client architecture, the process concerns with retrieving and handling the data to and from the server. For the server, the operation is concerned with the processing of the data as well as handling the data with the client. The illustration of data processing on the client side is shown in Fig. 4. On the client side, the Google Android platform (using the HTC Dream handset) is selected which provides a comprehensive platform that is capable of capturing digital images and obtaining context information with 3G and WiFi connectivity for communication with the server. The information collected from the Google Android platform includes (i) digital image from the digital camera on the phone and
(ii) location information provided by the built-in GPS module in longitude/latitude. Once the above data is collected, the data is processed such that they can be handled easily by the server across the wireless channel. On the returned tag list, it is processed and presented via the graphic user interface that allows further user interaction such as tagging the user captured image with the result set in terms of tagging keywords, description and building the user image album and uploading it to social networking sites such as Flickr. Illustration of data processing on the server side is given in Fig. 5. On the server side, the system runs an Apache Tomcat Application server that houses the main processing logics that handle the data coming in or going out to the client. The processing logics integrated the MATLAB algorithms together with the relational database using MySQL.
628
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
Fig. 4. Illustration of data processing on the client side.
Fig. 5. Illustration of data processing on the server side.
2.2. Graphical user interface Amongst the whole annotation procedure, the steps for the mobile user’s operations in the mobile platform are illustrated in Fig. 6. The mobile device works as the client. It captures images and sends the image together with any context information (e.g. location) through network connection to the server. The server performs content and context analysis of the captured scene image, and then fuses the two lists of tags to get the final suggested tags. It then sends the suggested tags back to the mobile user. When the mobile user tries to annotate an image, he/she needs to select the ‘‘tag suggestion’’ mode and capture the image, and simply press the button of ‘‘suggest tags’’. After several seconds, the mobile user can get the suggested tags. The mobile user can simply select tags in the checkboxes to indicate whether the tags are suitable or not. When the user confirms the proper tags he/she can upload the photo along with the proper tags to the web for sharing. 2.3. Dataset construction For the purpose of validating the effectiveness of the proposed schemes for mobile image annotation, we require a dataset of labelled and geotagged photos. Therefore, we created a mobile image dataset consisting of 3916 images in 25 main concepts of activities/scenes/landmarks from the campus in Nanyang Technological University (NTU). We consider choosing NTU campus for collecting mobile images mainly because it facilitates: (i) collecting a large number of images in a confined geographic area for validation of the system in domain-specific application and (ii) evaluating the system performance in terms of recognition accuracy
and processing time. To construct the dataset, we provided GPS mobile cameras (Dopod D810, O2 Atom Life and GloFiish X800) for taking images in the NTU campus including the scene images, landmarks and activities. The volunteers were instructed to take photos around the NTU campus, at their leisure, as if they were travelling in NTU campus with their own mobile cameras. The images are captured using camera phones with different camera settings (contrast, saturation, color balance, and resolution), from different viewpoints (near, far, side, front, left, right, full, part, etc.), at different times (morning and afternoon), and under different weather (sunny and cloudy) and illumination conditions. The descriptions of the 25 concepts are listed in Table 1, and some samples images from the 25 concepts are shown in Fig. 7. After the photo taking, we have collected 3166 images as the training set and 750 images as the testing set. After establishing the geotagged mobile image datasets, we conducted a survey on 20 volunteers to provide all the possible tags for every photo in training and testing sets. By merging the very similar tags (such as ‘‘student’’ and ‘‘students’’), we obtained an ensemble of 1095 tags for mobile image annotation in NTU campus. The tags cover most of the concepts in NTU campus, e.g. ‘‘student’’, ‘‘canteen’’, ‘‘hall’’, ‘‘building’’, and ‘‘lecture theater’’. After collecting the tag ensemble, we asked 16 volunteers to annotate the images in both training set and testing set with the ground-truth tags. The tagging processes include two levels: image-wise tagging and category-wise tagging. The image-wise tags will be used for training and evaluation of the context-based image annotation, and the category-wise tags will be used in the content-based annotation which is based on image categorization. In the image-wise tagging, for each image, the volunteers were required to provide about 6–8 related tags out of the tag ensemble. Furthermore, we also asked the volunteers to assign a confidence score w to each tag to indicate how well the tag describes the image. Some examples of images with the ground-truth tags along with the confidence scores are shown in Fig. 8. For the closely related tags to the image (e.g. ‘‘blue water’’ and ‘‘people’’), the confidence score is set to be 2, and for the related tags that are not the main concepts of the image (e.g. ‘‘greenery’’) or in case that the volunteer is not sure (e.g. ‘‘relaxing’’), the confidence score is set to be 1 and, the remaining tags of the tag ensemble are irrelevant and confidence score is set to be 0. The mobile images along with their related (ground-truth) tags and corresponding confidence scores are stored in the NTU mobile image dataset. In the category-wise tagging, the volunteers are required to provide about six common tags to describe each category and finally we obtain 140 category tags for content-based annotation. 3. Context analysis for mobile image annotation Amongst all the context information available in mobile devices, the most important and popular one is the location information captured by GPS (outdoor) and WiFi (indoor). More and more mobile phones are now equipped with GPS and WiFi, including Apple iPhone 4SG,
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
629
Fig. 6. Mobile platform for interactive tag suggestion.
HTC Droid Incredible and Nokia N97. It is reported that advances in GPS chipset development will allow embedding of GPS in every mobile device by 2013 [22]. Hence, location information should be considered for mobile image annotation and sharing. In this section, we propose a tagging approach that suggests tags for the newly captured mobile images based on the mobile images that have been both geotagged and annotated by previous mobile users. In the proposed method, the tags are associated with the locations using Gaussian Mixture Model (GMM). In the GMM modelling procedure, the relevance between tags and mobile images that is provided by mobile user community is specially considered to be incorporated, so the obtained GMMs can
capture the mobile users’ intentions when capturing the mobile images. Due to the continuous nature of the GMM model, every tag will be assigned a non-zero score for the query image even if the GPS location drifts with a large error. As such, the context-based tags with relatively low scores can be potentially chosen as top suggestion tags when integrating with content-based tags. In contrast, in the existing mobile image systems which pre-filter content analysis by a certain area of locations, some correct tags by content analysis may not pass the pre-filtering. In addition, another type of context information, i.e. interaction between mobile users and the tagging system, is also incorporated in the tagging process, so the mobile
630
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
user community’s intentions will facilitate the tag suggestion in future tagging.
3.1. Proposed context-based annotation scheme Let the set of tags for all the mobile images in the dataset be T ¼ fT 1 ,T 2 , . . . ,T N g, and denote the set of the images that contain tag T k in their annotations by IðT k Þ. Then we can collect all the location coordinates in IðT k Þ, and denote the set of the coordinates as GðT k Þ, with each element represented as a two-dimensional vector. Table 1 Descriptions of concepts in NTU mobile image dataset. Number Concepts
Description
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Indoor view of academic buildings Small animals in campus Indoor view of auditorium Activities in badminton hall Activities in basketball court Bicycle and cycling View in bus stop Indoor view of canteen Vehicles and driving Foods and eating Activities in soccer field Garden view and sceneries Crowd activities Activities in gymnasium Outdoor view of hall/hostel Indoor view of hall/hostel Lakeside view Library and studying Activities in medical centre Students and visitors Road view Sunrise and sunset Activities in swimming pool Activities in table tennis hall Activities in tennis court
Academic indoor room Animal Auditorium indoor Badminton Basketball Bicycle Bus stop Canteen Car Food Football Garden view General activity Gym Hall outside Hall room Lakeside view Library Medical centre Portrait Road view Sunrise and sunset Swimming pool Table tennis Tennis
A generative modelling method is adopted here for associating tags with the image taken at a given location, in which the observed location x 2 GðT k Þ is characterized by the tag-conditional density f k ðxÞ for each tag k ¼ 1,2, . . . ,N. Let Pk denote the prior probability of the occurrence of the tag k, according to the Bayes rule, the posterior probability Pðk9xÞ that an arbitrary observation x corresponds to tag k is P f ðxÞ Pðk9xÞ ¼ PK k k i ¼ 1 P i f i ðxÞ
ð1Þ
The overall tagging process in a Bayesian decision framework is illustrated in Fig. 9. For each tag in the tag ensemble, we construct a probabilistic model based on GMM separately, and then use the Bayesian rule to get the posterior probability. In such a way, the computational cost for learning tag distributions will be linearly proportional to the number of tags.
3.2. Combining tag distribution modelling with user interaction After the location coordinates collection, for each tag T k , k ¼ 1,2, . . . ,N, we have locations x 2 GðT k Þ, then the probability density of a GMM with K components can be swimming pool
ground-truth tags confidence score w
blue water boy open-air people pool swimmers greenery relaxing sport game
2 2 2 2 2 2 1 1 1 1
Fig. 8. An example in NTU mobile image dataset with ground-truth tags and scores.
Fig. 7. Sample images from the NTU Scene-25 dataset.
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
Training Images for Tag 1
GPS Locations Extraction
GMM Modeling
Model for Tag 1
Training Images for Tag 2
GPS Locations Extraction
GMM Modeling
Model for Tag 2
GPS Locations Extraction
GMM Modeling
Model for Tag K
Prior of Tag 1
Prior of Tag 2 Bayes Classification Rule
. . .
Training Images for Tag K
631
Prior of Tag K
Fig. 9. Proposed context-based mobile image tagging scheme.
represented as f ðxÞ ¼
K X
pj fRj ðxuj Þ
ð2Þ
j¼1
fRj ðxuj Þ ¼
1 1=2
ð2pÞd=2 9Rj 9
1 exp ðxuj ÞT R1 ðxu Þ j j 2 ð3Þ
where pj are mixing coefficients, which are non-negative and sum to one, and fS j ðxuj Þ denotes multivariate normal density with mean vector uj and covariance matrix Rj . The fitting of the parameters pj , uj and Rj is carried out by maximizing the likelihood of the parameters to the training data x 2 GðT k Þ. Here we use expectation maximization (EM, [23]) to maximize the log-likelihood as follows: 0 1 K X X n @ h ¼ arg max pðX9hÞ ¼ arg max log pj fRj ðxuj ÞA h2H
h2H
x2GðT k Þ
ðtÞ j ¼ 1 pj fRj ðxuj Þ
ð5Þ
However, in the context of this application, we regard that the posterior probability not only depends on the Gaussian function, but also depends on the relevance between the tag being modelled and the location x where the corresponding image was taken. This consideration aims to incorporate the mobile users’ intentions to the contextbased annotation procedure. The relevance value is estimated by the three-level confidence score w provided by mobile user community as described in Section 2.3. A higher relevance value indicates that the tag matches better with the corresponding location. For example, at a specific location surrounded by several different scenes/landmarks, if the majority of mobile users regard it as a good position to take photos of one scene/landmark, the tags provided by the mobile user interactions will concentrate on some specific tags for this scene/landmark, and the relevance value
ð6Þ
where wðx,T k Þ is the relevance value between tag T k and the image taken at the location x. In this way, the centre of each component will be closer to the locations that have higher confidence scores of the T k , and thus the conditional probability of the tag at such locations will tend to be higher. By considering the mobile user intentions, the parameters of GMM are updated as follows: 1 X nðtÞ pðtj þ 1Þ ¼ s ðxÞ ð7Þ N x2GðT Þ j k
P
where h ¼ fpj ,uj , Rj gKj¼ 1 . After the initialization, the EM first performs E-step, which estimates the posterior probability of a data point x in the j-th component in the t-th iteration by the Bayes rule:
pðtÞ fRj ðxuj Þ j
snj ðxÞ ¼ sj ðxÞwðx,T k Þ
j¼1
ð4Þ
sðtÞ ðxÞ ¼ PK j
between these tags and the location will tend to be high. So, when a new mobile image is taken around the location, it is better to annotate it using the tags with high relevance values. The relevance value can be periodically updated from the mobile user interaction. In order to ensure that the centre of each GMM component of locations take the relevance value into consideration, we define the posterior probabilities by incorporating the relevance values as follows:
nðtÞ
x2GðT k Þ sj
uðtj þ 1Þ ¼ P
Rðtj þ 1Þ ¼
P
ðxÞx
nðtÞ
x2GðT k Þ sj
nðtÞ
x2GðT k Þ sj
P
ð8Þ ðxÞ
ðxÞðxuj Þðxuj ÞT nðtÞ
x2GðT k Þ sj
ð9Þ
ðxÞ
E-step and M-step are performed iteratively until the log-likelihood pðX9hÞ is maximized, resulting in optimum parameters hn ¼ fpj ,uj , Rj gKj¼ 1 . Note that GMM is a universal approximator that can approximate any continuous density to arbitrary accuracy with a sufficiently large K. However, if K is too large, there will be only a few samples assigned to each Gaussian component, so the centre and the covariance calculated may be inaccurate or even singular. In view of this, we need to impose some constraint by regularization of the covariance matrix as follows:
Rnj ¼ aRj þ ð1aÞlI
ð10Þ
where l is a small regularization factor, I is the identity matrix and a is a variable within the range ½0,1 to make a tradeoff between data fidelity and regularization functional. Note that the regularization has its practical meaning because the GPS positioning has an intrinsic random error so the variances cannot be arbitrary. We use a 10-fold crossvalidation to determine the optimal parameters.
632
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
Fig. 10. Tag distributions and Fitted GMMs: (a) building, (b) road, (c) Nanyang lake and (d) swimming pool.
Several examples of the original locations versus the contour map of the fitted GMM model are shown in Fig. 10. From Fig. 10, we can see that, by using the user interaction, the GMMs well fit the discrete distribution for each tag. Some GMMs cover a large range in NTU campus, e.g. ‘building’ and ‘road’, while some other GMMs concentrate at several specific locations, e.g. ‘Nanyang lake’ (there is only 1 such place), and ‘swimming’ (there are 2 such places). 4. Content analysis for mobile image annotation 4.1. Content-based mobile image categorization In recent years, many automatic techniques for assigning semantic labels to images based on content analysis have been proposed, which is essentially a scene categorization problem. According to the survey made in [24], dense SIFT extracted from multi-scale patches in the image is one of the most promising features for scene image recognition. However, the large number of dense SIFT increases the computational cost greatly, which will not be acceptable in fast-response-time requirement of mobile image annotation.
In view of this, a bag-of-words (BoW)-based approach with spatial pyramid matching [25] based on dense SIFT is adopted here for its low computation cost and relatively high recognition accuracy. It partitions the image into increasingly fine sub-cells and computes the histogram of local features inside each sub-cell at each image pyramid level. These histogram vectors from all sub-cells are then concatenated. Finally, the discriminative models are trained using a spatial pyramid kernel (SPK)-based support vector machine (SVM). Since the SPK-BoW scheme can produce state-of-the-art scene categorization results at low computational cost as well as low dimensionality of image signatures, we adopt this scheme in the content-based mobile image analysis. We use the identical settings of feature extraction and classifier parameters as in [25], and we use 3-layer image decomposition. 4.2. Proposed content-based annotation scheme Based on the SPK-BoW categorization, only the category labels for testing images can be produced. In order to suggest tags for the testing images, instead of directly
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
classify the SPK-BoW histograms, we develop an annotation scheme based on the histograms. Let the set of distinct tags for M concepts be T ¼ fT 1 ,T 2 , . . . ,T K g. To annotate an image, its signature H (SPK-BoW histogram) is calculated first. Then for each category, we regard the image signatures in the category as the positive samples and others as the negative samples. Based on the SVM classifiers, we have a score for the image to be associated with each category, denoted as sm ðHÞ, m ¼ 1,2, . . . ,M. Here we normalize sm ðHÞ with a Sigmoid function to ensure that it belongs to ð0,1Þ: pm ðHÞ ¼
1 1 þ easm ðHÞ
ð11Þ
where a is the scaling parameter that is set to be 10 empirically. Here we denote the prior probabilities of tags in the m-th category by xm ¼ fom1 , om2 , . . . , omK g. Then the prior probability of tag Ti in the m-th concept is computed by counting the frequency it appears with 9T i 9m
omi ¼ PM
m¼1
ð12Þ
9T i 9m
where 9T i 9m is the occurrence that the tag T i is used as a tag for the m-th category. Then the probability for tag T i to be used for annotating the image is formulated as follows: qðT i 9HÞ ¼
M X
pm ðHÞ omi
ð13Þ
m¼1
which is the sum of the priors of T i in each category weighted by the image score pm ðHÞ. Then we sort q ¼ fqðT 1 9HÞ, . . . ,qðT K 9HÞg in descending order and select top tags as the content-based suggestion tags. 5. Proposed tagging by content and context integration Mobile image tagging via context or content analysis alone is less than ideal. If the content analysis is used alone, the annotation performance is not satisfactory due to the diverse conditions that can affect the quality of the image; if we use context information alone, the GPS location obtained using mobile devices may drift up a lot in dense build-up areas and will deteriorate the performance of mobile scene image tagging. Therefore, it would be desirable to integrate content and context information to achieve better performances. In this section, we develop a method for integrating content-based tags and context-based tags using information fusion theory. Dempster’s rule of combination [26] is to be utilized in this work to combine different sources. It is considered to be a more flexible and general approach than the traditional
633
probability theory and it is able to deal with some ignorance of the system. At present, there is very few works using Dempster’s rule in the information retrieval community. One typical work which uses Dempster’s rule for image annotation is developed in [27]. In Jin’s work, each blob in the image is assigned a tag and then the image is assigned by several tags. Different metrics of tag similarity are used to remove the irrelevant tags. Finally, the different sets of tags for the image are fused using Dempster’s rule. Different from Jin’s work, before using the information fusion, we need to do some preprocessing to make sure that the lengths of the different sources are the same. To address this issue, we incorporate another type of context information, i.e. the tag relationship information provided by Web user community, during the process of mobile content and mobile context integration. Moreover, the process of the integration is specially designed to greatly accelerate the online annotation procedure in the server. We consider image tagging by combining two sources: a list of N t scores for all the tags in context-based annotation and a list of Nc scores for all category tags in content-based annotation, where the N c category tags are subset of the N t tags used in context-based annotation. The tag lists are illustrated in Fig. 11, where m1 , m2 and S are the content-based score list, context-based score list, and the final score list to be fused, respectively, and Ai , i ¼ 1,2, . . . ,N t , are the tags from the whole set. 5.1. Proposed web-community-aware tag fusion To integrate the two lists of scores, we have to ensure that the sizes of the two score lists are the same before using information fusion theory. However, in this application, the context-based tags provided by the mobile users are much more than the content-based tags that describe the image categories, i.e. Nt ¼ 1095 and Nc ¼ 140. So, we have to expand the size of the content-based score list to be the same as that of context-based score list. The difficulty lies in that we are not sure about the scores m1 ðAi Þ for i ¼ Nc ,N c þ 1, . . . ,N t . To address this issue, we derive the scores of Ai , i ¼ Nc þ 1, . . . ,Nt , by using the scores that are available, i.e. m1 ðAi Þ, i ¼ 1,2, . . . ,Nc , along with the conditional probabilities between words as follows: m1 ðAj Þ ¼
Nc 1 X m1 ðAi ÞpðAj 9Ai Þ Nc
ð14Þ
i ¼ 1
where j ¼ N c þ 1, . . . ,N t , and pðAj 9Ai Þ is the conditional probability of tag Aj provided by tag Ai .
Content: Estimated by tag co-occurrence
Context: Dempster’s Fusion: Sorted fused list (example):
Top tags
Fig. 11. Illustration of the tagging process based on content and context integration.
Fusion
634
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
To acquire the unknown conditional probabilities, we utilize another type of context information, i.e. Webcommunity. We assume that the relation between two tags for the query image is roughly the same as that provided by Web user community in a specific domain. Here we define the conditional probability between two tags for the query image as the normalized co-occurrence as follows: pðAj 9Ai Þ ¼
9Ai \ Aj \ O9 9Ai \ O9
ð15Þ
where Ai \ Aj is the set of results by searching tag Ai and tag Aj together in Web, 9 9 is the number of elements of a set, and O is the word ‘‘NTU Singapore’’ for confining the search of tags in a specific domain. The Web search engine we use is Yahoo API, and we have developed a software for automatically calculating the co-occurrences without interruption by the website. For efficiency, the conditional probabilities between any two tags amongst the tag ensemble are stored in the mobile annotation system server once they are calculated from Yahoo API, and can be updated regularly. After the expansion of the content-based score list, the basic probability assignment (BPA) function is used here to take into account all the available evidences, and is defined as a mapping S from the power set 2O of a finite set O ¼ fA1 ,A2 , . . . ,ANt g to ½0,1 that for any T 2 2O , and we have X SðTÞ ¼ 1, SðTÞ Z 0 ð16Þ
much smaller subset P ¼ ffA1 g,fA2 g, . . . ,fANt g,fA1 ,A2 g,fA2 ,A3 g, . . . ,fANt 1 ,ANt g, Fg ð20Þ O
where F is a subset of 2 which contains the elements with more than two tags. The reduced set P has ðN t Þ2 þ1 elements so the computational complexity is OðN2t Þ. The computational cost will be reduced substantially since N t ¼ 1095 here. So, the joint BPA can be simplified as follows: X S1 ðT 1 ÞS2 ðT 2 Þ ð21Þ SðAÞ ¼ 1M T ,T P,T \T ¼ A 1
2
1
2
P where M ¼ T 1 \T 2 ¼ f ðS1 ðT 1 ÞS2 ðT 2 ÞÞ and the BPAs on each element of P for each source can be formulated as follows: Sk ðAi Þ ¼ Sk ðfAi ,Ai gÞ ¼ mk ðfAi gÞmk ðfAi gÞ : Ai O,
k ¼ 1,2 ð22Þ
Sk ðfAi ,Aj gÞ ¼ mk ðfAi gÞmk ðfAj gÞ : Ai O,
k ¼ 1,2
ð23Þ
and SðFÞ denote the ignorance on the power set, is set to be 0.1 empirically. For any T k P, k ¼ 1,2, we can estimate S1 ðT k Þ and S2 ðT k Þ according to (22) and (23), and then we calculate SðAi Þ for every Ai O according to (21). Finally, we sort SðAi Þ, Ai O descendingly and get the ranked top tags as the suggestion tags. 6. Experimental results and discussions
T22O
O
where the power set 2 comprises exhaustive set of mutually exclusive elements
6.1. Experimental settings
2O ¼ ffA1 g,fA2 g, . . . ,fAðNt Þ g,fA1 ,A2 g,fA2 ,A3 g, . . . ,
For evaluating the tagging performance, we implemented various methods for mobile image annotation on the NTU mobile image dataset. The methods include proposed tagging based on SPK-BoW [25], proposed tagging based on BoW baseline [24], the proposed context-based tagging without user interaction, the proposed context-based tagging with user interaction, location-prefilter tagging based on SPK-BoW [20], and the proposed tagging by content and context integration. For a fair comparison, the feature extraction step, the SVM kernel, SVM regularization parameter and image annotation scheme are kept the same for all content-based tagging methods. In the content-based tagging methods, we use the ‘‘one-versus-all’’ approach for multi-class classification. The SVM regularization parameter is fixed to be 1000 since we do not aim to tune the parameter. For matching of histograms between query images and database images, we adopt the intersection kernel [28] as follows for its high efficiency and parameter-free characteristic:
fAðNt 1Þ ,AðNt Þ g, . . . ,fA1 ,A2 , . . . ,AðNt Þ g, fg
ð17Þ
where f is the empty set, SðfÞ ¼ 0 in a close-world assumption, and there are in total 2Nt elements in 2O . Dempster’s rule for combining K sources is P ðS ðT Þ SK ðT K ÞÞ T 1 ,T 2 ,...,T K 2O ,\Ki ¼ 1 T i ¼ T 1 1 SðTÞ ¼ P ð18Þ ðS ðT Þ SK ðT K ÞÞ T 1 ,T 2 ,...,T K 2O ,\K Taf 1 1 i ¼ 1
and the joint BPA for two sources in this application can be reformulated as follows: SðTÞ ¼
X O
T 1 ,T 2 2 ,T 1 \T 2 ¼ T
S1 ðT 1 ÞS2 ðT 2 Þ 1M
ð19Þ
P where M ¼ T 1 \T 2 ¼ f ðS1 ðT 1 ÞS2 ðT 2 ÞÞ is a measure of the amount of conflict between the two BPA sets.
5.2. Proposed computational cost reduction In order to satisfy the fast-response-time requirement of integration-based mobile image annotation, instead of directly using Dempster’s rule of combination, we change the online procedure. Originally, the BPAs need to be estimated on as many as 2Nt elements in the power set 2O , and the computational complexity is as high as Oð2Nt Þ which is not affordable in real mobile image annotation systems. Here we reduce the original power set 2O to a
GðHq ,Hd Þ ¼
D X
minðHq ðiÞ,Hd ðiÞÞ
ð24Þ
i¼1
where Hq and Hd denote the histograms of the query image and a database image, and D is the dimension of image signatures. The tagging performance is evaluated in terms of precision and recall. The precision and recall are evaluated in an objective way that, for each testing image, the suggested tags generated by the system are compared
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
increase much while the computational cost will increase. So, we set K ¼200 as the optimal setting. There are two parameters for regularization of covariance matrix in (10): a and l which need to be optimized. Here we use 10-fold cross-validation for determining the optimal parameters. We split the training set into 10 roughly equal sized parts. For each setting of parameters a and l, using nine parts we fit the GMMs by EM, and calculate the classification accuracy on the remaining 1 part as the validation set. We repeat this procedure by using every part as the validation set in each of the 10 runs. Finally we get an average of the 10 precisions which correspond to the setting of a and l. We iterate the computation of the precision at a grid of points on the ða, lÞ plane: ða, lÞ 2 ½0:5,1:0 ½0,0:2. The performance of the proposed context-based annotation for different values of a and l with K ¼200 is shown in Fig. 13. We can notice that when a ¼ 0:8 and l ¼ 0:1, the precision reaches its peak, and we use them as the optimal setting. Note that when a ¼ 1:0 and l ¼ 0, it means that the original covariance matrix is used, the fitted GMMs tend to be overfitting. In the practical mobile application, GPS location of the mobile phone may have large errors, and the tag cannot be propagated to the query image correctly based on the overfitted GMM. In contrast, with the optimal setting a ¼ 0:8 and l ¼ 0:1, the GMM is more robust to the GPS error, and thus produces higher annotation results.
with the ground-truth tags in the dataset. Denote the set of tags suggested by the annotation system to be S and the set of tags in the ground-truth annotation to be T, then precision is defined as Precision ¼
9S \ T9 9S9
ð25Þ
and recall is defined as Recall ¼
9S \ T9 9T9
ð26Þ
where 9 9 denotes the number of elements in the set. To determine the suitable number of GMM components K in the proposed context-based tagging with user interaction, we evaluate the precision of the top 1 suggested tag with different K from 1 to 300, and without regularization in (10), as shown in Fig. 12. We can notice that, when K increases from 200 on, the precision will not 0.9 0.85 0.8
precision
635
0.75 0.7 0.65
6.2. User-intention-aware location-based annotation results 0.6 0.55
0
50
100
150
200
250
Based on the Bayes classification rule, we can get a descending order of the tags for each image. Some examples of tag suggestion are shown in Fig. 14, where the tags that match well with the ground-truth tags in the dataset are underlined.
300
number of GMM components
Fig. 12. Precision values with different number of GMM components.
0.8
0.78
precision
0.76
0.74
0.72
0.7
0.68 1 0.9
0.8
0.7
alpha
0.6
0.5
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
lambda
Fig. 13. Precision values of the top suggested tag with different a and l.
0.16
0.18
0.2
636
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
sport, indoor , net, court , yellow floor, sportsman , badminton , people , boy, player
sport, outdoor , court , people , boys, sportsman , basketball court , green & red court , game
outdoor , grass, trees , indoor, greenery , road , building , sky , hall, hall 6
canteen , indoor , food , cuisine, people , chairs , tables , drinks, delicious, crowed
busstop shelter , bus stop , outside , bench , road , pillars , trees, pavement, car, road , outdoor
people , indoor , girls, chairs , boys , smile, library, food , group photo, activity
indoor , ceramic floor, canteen, SBS canteen, people, food, cuisine, counter, crowd, sign
indoor , sport, yellow floor, net , court, badminton, boy, player, sportsman, people
indoor , room , chairs , people , medical centre, wall , classroom , seat, lecture theater , students
Fig. 14. Suggested tags in context-based tagging for some testing images.
We can notice that the tags related to the location as well as the location-related activity are generally well suggested by the context-based annotation in Fig. 14(a)–(f). Note that in Fig. 14(f), even though the photo is heavily blurred, which will make the content analysis nearly impossible, the context based annotation is still satisfactory. In Fig. 14(g)–(i), we listed some cases where mobile images are not well annotated based on the context-based tagging. In Fig. 14(g), since the place that the ‘‘lecture theater’’ is in a higher floor over the ‘‘medical centre’’ and share similar locations, the suggested tags are
related to both ‘‘medical centre’’ and ‘‘lecture theater’’. In Fig. 14(h), the ground-true place is ‘‘auditorium’’, however, its occurrence is lower than ‘‘SBS canteen’’ since mobile users are more interested in ‘‘SBS canteen’’ and the prior probability of tag ‘‘SBS canteen’’ is much higher than ‘‘auditorium’’. These problems may be alleviated by considering content analysis. In Fig. 14(i), it is an empty hall, but the context-based annotation still produces tags such as ‘‘boy’’, ‘‘player’’ and ‘‘sportsman’’ because these activities frequently appears at this place.
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
6.3. Content-based annotation results Examples of content-based mobile image annotation for 3 testing images are shown in Fig. 15, where the left images are the mobile images and the right boxes indicate the top 10 tags along with their normalized scores, i.e. the scores for all tags in the tag ensemble sum to one. The suggested tags that match the ground-truth tags in dataset are underlined. From Fig. 15, we can note that the tags suggested by the system are generally correct,
637
especially for the top few tags. However, in the annotated tags of the example ‘‘Hall room’’, we can notice that some wrong tags are produced, e.g. ‘‘library’’, and ‘‘laundry room’’, which can be easily removed by using location information. 6.4. Integration-based annotation results Based on the proposed integration method for fusing content-based tags and context-based tags, we expect to
basketball , 0.1304 sport , 0.1050 game , 0.1049 basketball court , 0.1043 outdoor , 0.0814 competition , 0.0787 trees , 0.0534 sky , 0.0528 player , 0.0525 match, 0.0523
indoor , 0.1764 hall , 0.1369 residence , 0.1095 dormitory , 0.1095 books , 0.0735 desk , 0.0559 bed, 0.0548 library, 0.0313 laundry room, 0.0274 sofa, 0.0274
canteen , 0.1083 indoor , 0.1025 food , 0.0867 cuisine, 0.0867 people , 0.0718 chairs , 0.0658 tables , 0.0655 drinks, 0.0650 delicious, 0.0650 crowed , 0.0650
Fig. 15. Content-based mobile image annotation tags.
638
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
people, portrait, face, gathering, boy, girl, smile, students, handsome, outdoor, bus stop, indoor
indoor, room, classroom, chairs, desk, studying, seat, students, having class, lecture theater, tutorial room, outdoor
indoor, outdoor, tennis, sport, game, hall, court, medical centre, residence, dormitory, competition
busstop shelter, bus stop, outside, trees, bench, pillars, pavement, car, road, outdoor
indoor, room, chairs, people, medical centre, wall, classroom, seat, lecture theater, students, windows, poster
open-air, pool, blue water, water, greenery, tree, NIE, grass, swimming, deep pool, field, sport
people, portrait, face, bus stop shelter, bus stop, gathering, outside, trees, smile, bench, road
indoor, room, chairs, classroom, studying, desk, seat, lecture theater, students, having class, tutorial room, people
open-air, pool, blue water, water, greenery, indoor, outdoor, sport, tree, NIE, game, tennis
Fig. 16. Integration of content based tagging and context based tagging: (a) testing images, (b) content-based tags, (c) context-based tags and (d) integration-based tags.
get better results than content analysis or context analysis alone. Some examples are shown in Fig. 16. For the first image, the main concept of the photo is a people, and it was taken at the place of a bus stop. By content analysis, the tags are closely related to the concept of ‘‘portrait’’, and by context analysis the tags are more related to the concept of ‘‘bus stop’’. So, the final fused tags are generally correct, including the concepts related to both ‘‘portrait’’ and ‘‘bus stop’’. For the second image, its content is typical for ‘‘classroom’’, so it can be correctly annotated by content analysis. However, by context analysis, the tagging accuracy is not very high since its location is very similar to that of ‘‘medical centre’’. For the third image, content analysis fails due to various changes of photo conditions. At the moment of the photo taking, it was slightly raining, resulting in a color distortion and image blur. In addition, the image was taken from an unusual camera perspective. Both factors make it difficult to annotate the image using content analysis alone. However, this photo can be easily annotated by context analysis using the logged GPS location. From Fig. 16, we can note that the integration can generally produce better results than content-based and contextbased suggestion tags. 6.5. Objective performance comparisons Fig. 17 gives the performance comparison for testing images in terms of precision and recall, respectively. From
Fig. 17 (a), we can know that for the context analysis, the tagging precision is higher than BoW baseline, however, the curve intersects with that of SPK-BoW. Furthermore, the context analysis with user interaction shows consistent improvement over context analysis without user interaction by about 1%. By comparing content analysis and context analysis in Fig. 17(a), content analysis is better for the first tag, and context analysis is better for the tags from the third on overall for the testing images. However, this threshold varies from image to image and is hard to determine. As seen from Fig. 17(a), integration of the content-based tags and context-based tags consistently produces better precision than the maximum performance of either content alone or context alone and, the improvement can reach up to as much as 8% when the number of suggested tags is 3. The SPK-BoW tagging based on location prefiltering [20] consistently outperforms SPK-BoW tagging alone, but still cannot outperform the proposed integration tagging. When the number of suggested tags is small, the difference between the two content and context fusion methods is not large, but the difference becomes obvious when the number of suggested tags increases. In terms of recall, from Fig. 17(b), we can notice that the trend of performance of all the methods is the same as in Fig. 17(a). The content and context integration also consistently outperforms all the methods for comparison, and the SPK-BoW tagging based on location prefiltering
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
cannot outperform the proposed location tagging for all numbers of suggested tags. We can notice that, by integrating the context information to the content information, the tagging performance can achieve improvements over content analysis alone by up to about 16% of precision and 15% of recall
639
when the number of suggested tags is 15. The recall– precision curve is reproduced in Fig. 18. We can notice that the proposed content and context integration method consistently and significantly outperform other state-of-the-art methods. 6.6. Subjective performance comparisons
1 BoW tagging SPK−BoW tagging Proposed context tagging without user interaction Proposed context tagging with user interaction Location−prefilter tagging Proposed tag fusion
0.9
precision
0.8 0.7 0.6 0.5 0.4 0.3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In order to evaluate the system performance in a more realistic scenario, we asked 20 volunteers to subjectively evaluate the precision of the tags generated by the system. By using the interactive mobile platform as shown in Fig. 6, the volunteers choose the relevant tags out of all the suggested tags, and ignore the irrelevant tags. The ratio of the tags chosen by each volunteer for each image is recorded, and for each volunteer the precision is the average value of the ratios over all testing images. The average precision is the average value of the ratios per volunteer. Fig. 19 gives the average precision for different number of suggested tags. The standard deviation is also depicted by assuming a normal distribution of the precisions
number of suggested tags 1 0.95
0.8
0.85 precision
0.6 0.5 recall
SPK−BoW tagging Proposed content tagging Proposed tag fusion
0.9
BoW tagging SPK−BoW tagging Proposed context tagging without user interaction Proposed context tagging with user interaction Location−prefilter tagging Proposed tag fusion
0.7
0.4
0.8 0.75 0.7 0.65
0.3
0.6 0.55
0.2
0.5
0.1 0
1
2
3
4
5
6
7
8
9
10
11
12
number of suggested tags 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
number of suggested tags
Fig. 19. Subjective performance comparison of various methods in terms of precision.
Fig. 17. Objective performance comparison of various methods in terms of precision (a) and recall (b). 1.4 1
0.8
annotation time (seconds)
0.9
precision
1.2
BoW tagging SPK−BoW tagging Proposed context tagging without user interaction Proposed context tagging with user interaction Location−prefilter tagging Proposed tag fusion
0.7 0.6 0.5 0.4 0.3
1 0.8 0.6 0.4 0.2
0
0.1
0.2
0.3
0.4
0.5
0.6
recall
0
0
50
100
150
200
250
300
number of GMM components
Fig. 18. Objective performance comparison of various methods in terms of precision and recall.
Fig. 20. Annotation time with different number of GMM components.
640
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641
evaluated by different volunteers. We can notice that the content and context integration consistently outperforms using content or context alone by about 10% of precision, and the standard deviation is less than 2%. 6.7. Computational time evaluations In the developed system, the content-based, contextbased and fusion-based mobile image annotation schemes are all specially considered to meet the fast-response-time requirement. To evaluate the computational time in a realistic scenario, we use the implemented mobile platform to perform the real annotation procedure on all the testing images. Since the communication time between the mobile and the annotation system server depends on the network and is irrelevant to the developed system, we only concern the processing time in server. The time for content analysis is approximately 0.90 s. As to the computational cost of the proposed context tagging, the main factor is the number K of the GMM components. The computational cost of the proposed context tagging with K from 1 to 300 is shown in Fig. 20. We can notice that the computational cost is approximately linear proportional to K, and the time cost is about 0.75 s when K¼200, which is comparable to that of the content analysis, i.e. 0.90 s. We set this relatively high computational cost in order to achieve a higher tagging performance compared to the traditional ‘‘location-prefilter tagging’’. Actually, since the proposed context tagging method is independent to the content tagging, content analysis and context analysis can be carried out by two different processors in a parallel way. Then the different sets of suggested tags are integrated by the proposed tag fusion method, which needs about 0.13 s. Thus the overall tagging procedure needs only about 1.0 s, which is suitable in the mobile phone applications. 7. Conclusions and future work This paper presents a new and effective framework for mobile image annotation. Based on the developed mobile image annotation system, mobile images can be upload to the media sharing portals along with their semantic tags, which facilitates the retrieval and sharing of media in Web. Particularly, we developed effective content analysis and context analysis methods that are both effective and computationally efficient in mobile image tag suggestion, as well as a method for fusing content and context to make full use of all sources. According to experiment results, the integration-based annotation has good performance as long as either content analysis or context analysis performs well. In addition, the proposed integrated content and context method is designed to have very low computational cost for this fast-response-timerequired application. The experimental results have shown that the proposed mobile image tagging framework performs well in NTU campus, and we can expect it to be applied to other domain-specific image annotation applications. Future work may include the integrating other types of context information to the content analysis,
such as camera direction (acquired by digital compass) and time. References [1] /http://danzarrella.com/new-data-on-mobile-facebook-posting. htmlS. [2] S. Fountain, T. Tan, G. Sullivan, Content-based rotation-invariant image annotation, in: IET Colloquium on Intelligent Image Databases, 1996, pp. 6/1. [3] P. Duygulu, K. Barnard, J. De Freitas, D. Forsyth, Object recognition as machine translation: learning a lexicon for a fixed image vocabulary, in: European Conference on Computer Vision, 2006, pp. 349–354. [4] E. Chang, K. Goh, G. Sychay, G. Wu, CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines, IEEE Transactions on Circuits and Systems for Video Technology 13 (1) (2003) 26–38. [5] J. Lu, S. Ma, M. Zhang, Automatic image annotation based-on model space, in: Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering, 2005, pp. 455– 460. [6] H. Frigui, J. Caudill, Unsupervised image segmentation and annotation for content-based image retrieval, in: IEEE International Conference on Fuzzy Systems, 2006, pp. 72–77. [7] R. Datta, W. Ge, J. Li, J. Wang, Toward bridging the annotationretrieval gap in image search by a generative modeling approach, in: Proceedings of the 14th Annual ACM International Conference on Multimedia, 2006, pp. 977–986. [8] L. Yu, Y. Liu, T. Zhang, Using example-based machine translation method for automatic image annotation, in: The Sixth World Congress on Intelligent Control and Automation, vol. 2, pp. 9809– 9812. [9] C. Wang, F. Jing, L. Zhang, H. Zhang, Content-based image annotation refinement, in: IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [10] G. Carneiro, A. Chan, P. Moreno, N. Vasconcelos, Supervised learning of semantic classes for image annotation and retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (3) (2007) 394–410. [11] B. Sun, W. Ng, D. Yeung, J. Wang, Localized generalization error based active learning for image annotation, in: IEEE International Conference on Systems, Man and Cybernetics, 2008, pp. 60–65. [12] J. Li, J. Wang, Real-time computerized annotation of pictures, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (6) (2008) 985–1002. [13] M. Naaman, A. Paepcke, H. Garcia-Molina, From where to what: metadata sharing for digital photographs with geographic coordinates, in: On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE, 2003, pp. 196–217. [14] S. Ahern, M. Davis, D. Eckles, S. King, M. Naaman, R. Nair, M. Spasojevic, J. Yang, Zonetag: designing context-aware mobile media capture to increase participation, in: Workshop on Pervasive Image Capture and Sharing, Ubicomp, 2006. [15] S. Xia, X. Gong, W. Wang, Y. Tian, X. Yang, J. Ma, Context-aware image annotation and retrieval on mobile device, in: Second International Conference on Multimedia and Information Technology, vol. 1, 2010, pp. 111–114. [16] Y. Li, J. Lim, Outdoor place recognition using compact local descriptors and multiple queries with user verification, in: Proceedings of the 15th International Conference on ACM Multimedia, 2007, pp. 549–552. [17] G. Fritz, C. Seifert, L. Paletta, A mobile vision system for urban detection with informative local descriptors, in: IEEE International Conference on Computer Vision Systems, 2006, pp. 30. [18] C. Tsai, A. Qamra, E. Chang, Y. Wang, Extent: inferring image metadata from context and content, in: IEEE International Conference on Multimedia and Expo, 2005, pp. 1270–1273. [19] N. O’Hare, C. Gurrin, G. Jones, A. Smeaton, Combination of content analysis and context features for digital photograph retrieval, in: The Second European Workshop on the Integration of Knowledge, Semantics and Digital Media Technology, 2005, pp. 323–328. [20] K. Yap, T. Chen, Z. Li, K. Wu, A comparative study of mobile-based landmark recognition techniques, IEEE Intelligent Systems 25 (1) (2010) 48–57. [21] N. Ravi, J. Scott, L. Han, L. Iftode, Context-aware battery management for mobile phones, in: Sixth Annual IEEE International Conference on Pervasive Computing and Communications, PerCom., 2008, pp. 224–233.
Z. Li et al. / Signal Processing: Image Communication 28 (2013) 624–641 [22] /http://www.windowsfordevices.com/news/ns7065484804.htmlS. [23] Z. Yu, et al., FEMA: a fast expectation maximization algorithm based on grid and PCA, in: IEEE International Conference on Multimedia and Expo, 2006, pp. 1913–1916. [24] J. Zhang, M. Marszalek, S. Lazebnik, C. Schmid, Local features and kernels for classification of texture and object categories: a comprehensive study, in: IEEE Conference on Computer Vision and Pattern Recognition Workshop, 2006, pp. 13. [25] S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, in: IEEE
641
Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2, 2006, pp. 2169–2178. [26] K. Sentz, S. Ferson, S.N. Laboratories, Combination of evidence in Dempster–Shafer theory, Citeseer, 2002. [27] Y. Jin, L. Khan, L. Wang, M. Awad, Image annotations by combining multiple evidence & wordnet, in: Proceedings of the 13th Annual ACM International Conference on Multimedia, ACM, 2005, pp. 706–715. [28] M. Swain, D. Ballard, Color indexing, International Journal of Computer Vision 7 (1) (1991) 11–32.