Available online at www.sciencedirect.com
ScienceDirect Procedia Computer Science 92 (2016) 148 – 154
2nd International Conference on Intelligent Computing, Communication & Convergence (ICCC-2016) Srikanta Patnaik, Editor in Chief Conference Organized by Interscience Institute of Management and Technology Bhubaneswar, Odisha, India
Ontology- based annotation for semantic multimedia retrieval Lakshmi Tulasi. R1*, Srinivasa Rao.M2, Usha. K3, R.H Goudar4 1
Dept. of CSE, QIS Institute of Technology, Ongole, India 2 School of IT, JNTU, Hyderabad, India 3, 4 Centre for PG Studies, VTU, Belagavi, India
Abstract
Numerous educational video lectures, CCTV surveillance, transport and other types have upgraded the impact of multimedia video content. In order to make large video databases realistic, video data has to be automatically indexed in order to search and retrieve relevant material. An annotation is a markup reference made to data in video in order to improve the video accessibility. Video annotation is used to examine the massive quantity of multimedia data in the repositories. Video annotation refers to the taking out of significant data present in video and placing this data to the video can benefit in “retrieval, browsing, analysis, searching comparison and categorization”. Video annotation implies taking out of data and to attach such metadata to the video which will “accelerate the retrieval speed, ease of access, analysis and categorization”. It permits fast and better understanding of video content and improves the performance of retrieval and decreases human time & efforts for better study of videos. Video annotation is imperative technique that assists in video access. Proposed system provides effortless access to the data of the video and decrease the time necessary to access and evaluate the video. Ontology-based video annotation helps the user to get the semantic information from video, which is essential to search the needful data from a video. © 2016 2016Published The Authors. Published Elsevier B.V. © by Elsevier B.V. by This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Selection and peer-review under responsibility of scientific committee of Missouri University of Science and Technology. Peer-review under responsibility of the Organizing Committee of ICCC 2016
Keywords: Video Annotation; Ontology; Multimedia; Feature Extraction;
* Lakshmi Tulasi. Tel.:94408 23573; E-mail address:
[email protected]
1877-0509 © 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the Organizing Committee of ICCC 2016 doi:10.1016/j.procs.2016.07.339
R. Lakshmi Tulasi et al. / Procedia Computer Science 92 (2016) 148 – 154
1. Introduction In the recent years, most of people depend on internet as it offers us multimedia contents such as textual data, audio data, animation data, video content, and image content. Such multimedia contents are change with respect to time. These contents need huge database for storing and retrieving the information, which can be used by people for accessing. There is a need of more effective and robust approach for retrieval of required information from internet. The information retrieval is such an approach that allow user to store and retrieval of needed information. The main goal of information retrieval system is to determine which documents are most related with respect to the user queries from list of data source. User requirement may not only include text but also requires multimedia contents. The information retrieval approach provides only retrieval of text information but not multimedia content. Therefore, there exists a need of multimedia information retrieval method. 2. Problem Statement, Motivation and Objectives MIR is a research region of computer science that supports to extract semantic data from multimedia data sources. Data sources contains directly observable source such as “audio, image, video” and indirectly observable source such as “text, bio signals”. Proposed system provides effortless access to the data of the video and decrease the time necessary to access and evaluate the video. Ontology-based video annotation helps the user to get the semantic information from video, which is essential to search the needful data from a video. Searching of required video information from large multimedia database is complex task. Video consists of collection frames, analyzing each frame is tedious and erroneous. So, analyzing only essential information from video make user to understand entire content of video without analyzing entire video. Ontology based video annotation is such technique which permit user to easily get information present in video. The Objectives are: x To reduce the time required to analyze video. x To improves the performance of retrieval of video systems. x Ontology-based annotation eases the semantic retrieval of video data from huge database. 3. System Architecture System Architecture is described in Fig.1 ¾ Ontology-based video annotation system consists of two phases: x Training Phase x Testing Phase 3.1 Training Phase x In Training phase, image with any format, which is collected from internet is given as input to system. x SIFT features are extracted from given input image using VL Fleat toolbox. x After extracting SIFT features, next step is feature extraction by using HoG method. x HoG features along with RGB histogram also extracted as training classifier and to identify the objects and store in a database. 3.2 Testing Phase x In Testing Phase, video with .AVI format is given as input to system. x Video consists of large collection of frames. User does not require all frames so the frame which consists of useful information and which are non-redundant in video are considered as key frames. Such key frames are extracted using entropy method. x SIFT features are extracted from each key frame. x After extracting SIFT features, next step is feature extraction by using HoG method.
149
150
R. Lakshmi Tulasi et al. / Procedia Computer Science 92 (2016) 148 – 154
The comparison is made between the video which created from images stored in database during training phase and key frame which are collected after feature extracted during testing phase. The result includes the annotated video key frame based on ontology. It also displays the semantic statistics of video key frames.
Training Phase
Training Images
Testing Phase
Video
SIFT Features
Key Frame Extraction
Feature Extraction
SIFT Features
Feature Extraction Knowledge Base Match
Result [Annotated Video Key Frame]
Fig 1: Architecture for Ontology-based Video Annotation System 4. Methodology 4.1 Prewitt Operator Method In image processing, prewitt operator has been extensively used because of its advantages in finding edge in image for every pixel. Edge is nothing but the size and direction of image. Prewitt operator is advantages in evaluating the changes occurred in direction of image. Prewitt operator can be used to calculate the gradient of image. Therefore, it is more reliable method. The prewitt operator produces results which contain various key points on the image. These key points are called as image intensity. Similarly key point on edge is called as vector. 4.2 Scale –Invariant Feature Transform (SIFT) The SIFT features are the small chosen part on image. SIFT features include the descriptor of image. It contains two parts such as SIFT detector and SIFT descriptor. The SIFT detector helps to find the key points which are mapped on the image. The SIFT descriptor can be used to calculate the descriptions of key points. The key points on image calculated by using midpoint coordinates i.e., X, Y, scale of image and direction of image. SIFT feature values are not changes for rotating image and scaling of image. The SIFT descriptor can be used to represent three dimension histograms based on how key points are appeared in image.
R. Lakshmi Tulasi et al. / Procedia Computer Science 92 (2016) 148 – 154
4.3 Histogram of Oriented Gradients (HoG) The HoG features can be used to check which type of object appearing in image. This identification is done based on measuring the number of gradient direction in particular portion of image. The goal of HoG descriptor is to find the object present by its location in image and shape, direction of edge in image. Image is divided into small connected region referred as cell. HoG features are taken out each of these cells. 4.4 Nearest Neighbour Search Method Nearest neighbour search is a type of searching method present in image processing. The nearest neighbour search is unsupervised method. The searching is done by finding which type of object present in image and display result either as positive class or negative class. 5. Main GUI The main GUI of the project consists of four modules: x Train Classifier: This module consists of a set of images fetched from database for training. At first SIFT features are extracted and then HoG features along with RGB histograms are taken out from the input image. x Make Video Annotation: This module includes the set of video for testing. The drawing out of key frames from the given input video. Comparison between the video frames from database and key frames of video and display the annotated video key frames along with its semantic information. x Performance Result: This module is used for calculating the accuracy of the system by fetching various video from database. x Exit: This module allows the user to exit from the system. GUI_ Image Training Module x In this GUI, the user must browse the image from the database for training. Then the user must select the type of image for making annotation. Then, extract the feature from the given input image along with its values. x Next button is used to select the next image from database. x Back button is used to move to main GUI. Fig. 2 shows snapshot of image training. x
Fig.3 shows an example when user selects an airplane image from database for training purpose and mentioning the annotated type as airplane and finally feature is extracted for airplane image with its values.
Fig 2 Snapshot Showing Image Training
151
152
R. Lakshmi Tulasi et al. / Procedia Computer Science 92 (2016) 148 – 154
Fig. 3 Snapshot Showing Airplane Image with feature extraction
Play Video x In this GUI, user has to select the video from database. Then have to click on play video button. It will display the number frames present in given input video along with time required to run that video. The number of frames is nothing but the number of images present in that video. Time is measured as difference between the end time and start time taken to play the video. x The clear button will clear the previously running video. x The back button will get back to main GUI.
Fig. 4. Snapshot showing multimedia retrieval before, after key frame extraction
153
R. Lakshmi Tulasi et al. / Procedia Computer Science 92 (2016) 148 – 154
x
When user press extract key frame button then the key frames were displayed along with number of frames and time required to play input video. When user click on make annotation button then the matching between video frames from input video and query video key frames, then it will display the annotated video key frames as result. This GUI includes an example where user selects test4.avi video from database. It displays the number of frames as 116 in test4.avi and time required to run the video is 4.7344 seconds. When user clicks on extract key frames, then 6 key frames are extracted and time required to run video is 0.5 seconds.
6. Analysis This section covers the analysis of experimental result. It includes the graph showing precision and recall for different annotated video key frames. True positives (TP) – “Positive results that are declared as positive False Positives (FP) – “Negative results declared as positive Total Positives (TP) - “Total no of positives” Precision and Recall can be computed using the following equations Precision= True positives/ (True positives + False Positives) Recall= True positives/ Total Positives
Fig 5 Snapshot Showing graphs for different images Table 1 Comparison between Before and After Key Frame Extraction Along With Corrected Annotated Frames
Video Name Test1.avi Test2.avi Test3.avi Test4.avi Test5.avi
Before Key frame Extraction Number of Play Time Frames (seconds) 51 63 84 116 213
2.1563 2.6719 3.4844 4.6719 8.4375
After Key frame Extraction Number of Play Time Frames (seconds) 9 12 16 6 10
Accuracy = (Correctly Annotated Frames) / (Actual Frames) * 100 = 47/53 *100 = 88.67925 %
0.48 0.609 0.79 0.375 0.593 Total
Actual Frames 9 12 16 6 10 53
Correctly Annotated frames 7 11 15 5 9 47
154
R. Lakshmi Tulasi et al. / Procedia Computer Science 92 (2016) 148 – 154
7. Conclusion The video annotation technique supports the user to easily access required video data from large database. By using video annotation system based ontology, video can be searched quickly and more effectively. Analyzing key frame consumes less time when compared to analyzing entire video. Accuracy can be increased by key frame extraction process using entropy along with prewitt operator and nearest neighbor methods. The proposed method used to generate the annotation of video key frames based on ontology which supports in easy retrieval of the videos from huge database. An experimental result shows the proposed system achieved an accuracy of 88%. References 1) 2) 3) 4) 5) 6) 7) 8)
9)
B. Vrusias, D. Makris, J. P. Renno, N. Newbold, K. Ahmad, G. Jones, “A Framework for Ontology Enriched Semantic Annotation Of CCTV Video”, In Image Analysis for Multimedia Interactive Services, WIAMIS'07, Eighth International Workshop. IEEE, June 2007. J. W. Jeong, H. K. Hong, D. H. Lee, “Ontology-Based Automatic Video Annotation Technique In Smart TV Environment”, IEEE Transactions On Consumer Electronics, Vol. 57, No. 4, Pages 1830-1836, 2011. K. Khurana, M.B.Chandak, “Key Frame Extraction Methodology For Video Annotation”, International Journal Of Computer Engineering And Technology, Volume 4, Issue 2, Pages 221-228, March- April 2013. L. Ballan, M. Bertini, A. Bimbo, G. Serra, “Video Annotation And Retrieval Using Ontologies And Rule Learning”, IEEE Multimedia, Volume 17, Issue 4, 2010. Lei Bo, Jing Yuanping, and Ding Jinlei, “Target Localization Based on SIFT Algorithm”, Springer-Verlag Berlin Heidelberg, Pages 1–7, 2011. Liping Ren, Zhiyi Qu, Weiqin Niu, Chaoxin Niu, Yanqiu Cao, “Key Frame Extraction Based On Information Entropy And Edge Matching Rate”, 2nd International Conference On Future Computer And Communication ,Volume 3, Pages 91-94, 2010. N. Magesh, Dr. P. Thangaraj, “Semantic Image Retrieval Based on Ontology and SPARQL Query”, In proceedings of International Journal of Computer Applications (IJCA) – ICACT, Number 1, Pages 12-16, August 2011. P. M. Kamde, Sankirti Shiravale, S. P. Algur, “Entropy Supported Video Indexing for Content based Video Retrieval” ,International Journal of Computer Applications, pages 0975 – 8887,Volume 62,No.17, January 2013. Shih-Wei Sun, Yu-Chiang Frank Wang, Yao-Ling Hung, Chia-Ling Chang, Kuan-Chieh Chen, Shih-Sian Cheng, Hsin-Min Wang, and Hong-Yuan Mark Liao, “Automatic Annotation of Web Videos”, IEEE, 2011.