Journal Pre-proofs Weakly-supervised Large-scale Image Modeling for Sport Scenes and its applications Congsheng Lu, Feng Zhai PII: DOI: Reference:
S1047-3203(19)30339-6 https://doi.org/10.1016/j.jvcir.2019.102718 YJVCI 102718
To appear in:
J. Vis. Commun. Image R.
Received Date: Revised Date: Accepted Date:
13 September 2019 14 November 2019 15 November 2019
Please cite this article as: C. Lu, F. Zhai, Weakly-supervised Large-scale Image Modeling for Sport Scenes and its applications, J. Vis. Commun. Image R. (2019), doi: https://doi.org/10.1016/j.jvcir.2019.102718
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Β© 2019 Published by Elsevier Inc.
Weakly-supervised Large-scale Image Modeling for Sport Scenes and its applications Congsheng Lu1*οΌFeng Zhai2 1. Haian Senior Shool of Jiangsu Province, Nantong 226600, China 2. College of Physical Education, China University of Mining and Technology, Xuzhou 221000, China The corresponding author email:
[email protected] Abstract: Image modeling towards sport scenes plays an important role in sport image classification and analysis. Traditional algorithms for sport image modeling required carefully hand-crafted features, which cannot be popularized in practical application, especially with the emergence of massive-scale data. Weakly-supervised learning algorithms have shown effectiveness in modeling data with image-level labels. Thus, in this paper, we propose a weakly-supervised learning based method for sport image modeling without utilizing bounding box annotations, which can be used for various sport image applications. More specifically, we first collect large-scale sport images from existing datasets and Internet, and we annotate them at image-level labels. Subsequently, we leverage region proposal generation algorithm to select discriminative regions that can effectively represent the category of images. Each region is fed into a pre-trained CNN architecture to extract deep representation. Afterwards, we design an improved multiple discriminant analysis (MDA) algorithm to project these datapoints to a subspace that can more easily to distinguish different sport categories. Comprehensive experiments have shown the effectiveness and robustness of our proposed method. Keywords: Sport scene modeling; weakly-supervised learning; MDA. 1. Introduction With the development of modern Internet technique, large-scale multimedia data is emerged. Especially in sport domain, with the increase of large-scale data, it becomes more and more important to model multimedia data effectively [1-4]. Existing algorithms in modeling multimedia data mainly rely on hand-crafted low-level features such as objectsβ shape, contour. These algorithms can also be regarded as global information based image modeling, which requires that the target objects in the classified image are aligned, and they have good requirements for the input samples However, these methods cannot reflect human visual systems (HVS) because of the semantic gap. To solve this problem, some researchers proposed various mid-level features based algorithms for image modeling [5-7]. These mid-level features have some common attributes: richer semantic information, more stable expression, more stronger discrimination. The mid-level feature based algorithms have shown satisfactory performance in image processing such as image recognition and classification. However, most algorithms require pre-trained object detector to recognize candidate regions in images. Thus, the performances of these algorithms rely on the effectiveness of these external detectors.
In recent years, convolutional neural networks (CNNs) based algorithms have achieved impressive performance [8-10]. CNN-based algorithms can automatically extract deep representation of training data. Typical CNN architecture consists of convolutional layers, pooling layers and fully-connected layers, where the convolutional layer aims to extract features of images and pooling layer aims to reduce the dimension of features. The fully-connected layer can integrate the extracted features into a feature vector. In addition, supervised learning algorithms require large-scale manual labels, which cannot be widely applied in modeling massive multimedia data. In recent years, weakly supervised learning algorithms have shown effective performance [11-15], which only require image-level labels. Thus, in this paper, we propose a weakly supervised learning based method for sport image modeling, which can be widely applied in various applications such as sport image classification, recognition and retrieval. The main contributions of our proposed method can be summarized as follows: 1) We propose a novel method for large-scale sport data modeling, which only use image-level labels instead of accurate bounding box annotations. 2) We design an improved LDA to project the extracted deep representation to a subspace, which can effectively distinguish different image categories. 2. Related work Our work is related to two research topics: weakly-supervised learning and multiple discriminant analysis. Compared with supervised learning algorithms, weakly-supervised learning only requires image-level labels and it has achieved competitive performance. Weaklysupervised learning can be widely applied in computer vision and image processing. Ren et al. [16] presented a weakly supervised object localization algorithm based on multiple instance learning (MIL). Besides, the authors designed a bag splitting algorithm to continually remove negative samples from positive samples. A pre-trained CNN architecture was used to extract deep features from candidate object proposals. Zhu et al. [17] proposed to learn cross-domain dictionary based on weakly-supervised scheme for visual recognition. Zhang et al. [18] proposed a human fixations prediction based on weakly-supervised learning, where image-level labels were used to improve the accuracy of predicting human fixation. The proposed method aimed to discover discriminative objects in images, which may attract human attention than other regions. Wang et al. [19] proposed latent category learning based large scale weakly supervised object localization method. Latent category learning (LCL) only require image-level labels. Galleguillos et al. [20] leveraged stable segmentations to conduct weakly supervised objection localization. Multiple instance learning (MIL) algorithm was used to train an effective classifier by using ambiguous labels. Multiple discriminant analysis (MDA) aims to find the projection direction that can distinguish data points [21-23]. Different from principle component analysis (PCA), LDA aims to make the projection points of each category of data as close as possible, and the distance between the category centers of different categories of data as large as possible. While PCA is an unsupervised dimension reduction technique without considering the output of sample categories. LDA can effective distinguish different
categories of data points. However, traditional LDA method only considers the global geometry of datapoints while the local geometry structure of data points is not exploited. Thus, we leverage an improved LDA method proposed by Cai et al. [24] to better model datapoints. 3. Proposed method In this paper, we propose a weakly supervised learning based method for modeling multimedia towards sport scenes. The pipeline of our proposed method is shown in Figure 1. We first conduct data collection and image-level annotations. Our data is collected from existing datasets and Internet.
Figure 1. The pipeline of our proposed method. 3.1 Region proposals and representations Without any prior information of a given sport image, choosing some representative regions of images is a challenging task. In general, there are π4 candidate regions of an image with π Γ π size. Only few of them can represent the entire image while other regions can be ignored. For example, for golf sport, the golf and golf club are significant for recognizing this sport, while grass or sky are less important since these objects may also appear in soccer sport. Thus, how to choose these representative regions is a significant component. There are various candidate regions generation algorithms [2527]. These algorithms can automatically generate several candidate regions that may contain informative objects. The method proposed in [27] achieved the highest mean average best overlap (MABO) in VOC2007 dataset. Considering good performance of object generation method proposed in [27], we leverage this algorithm in our implementation. To further refine the selected regions, regions less than the threshold are removed since they cannot capture any objects. Regions with abnormal aspect ratio should also be removed. Based on the extracted regions, we aim to extract deep representation of these regions. In recent years, deep learning framework based algorithms have achieved impressive performance. With deep neural network (DNN), these algorithms can extract the most representative deep representation of images. In our implementation, we adopt the architecture that is the same as [28]. Our used CNN architecture is pretrained on the ILSVRC2011 dataset, thus it can effectively recognize different objects. The last fullyconnected layer is replaced with a pooling layer, which can integrate the extracted
features into a feature vector. In the training step, each extracted region is first resized to 224 Γ 224. 3.2 Improved multiple discriminant analysis Traditional multiple discriminant analysis (MDA) algorithm aim to search a projection which can effectively distinguish datapoints with different labels. Traditional MDA only consider the global geometry structure of datapoints while local geometry structure of datapoints with the same label is not exploited. Thus, we leverage an improved MDA algorithm to project our extracted feature vectors into a subspace, which can effectively distinguish different sport images. Considering there are π sport image categories. Given a datapoint π±, we first construct a nearest neighbor graph πΊ, where the between-class scatter matric ππ€ is formulated as: πβ²
πΊπ =
βπ
π
(1)
π=1
where denotes the number of categories of the graph πΊ. ππ denotes the with-class scatter matrix. Here ππ is defined as follows: πβ²
πΊπ =
β (π± β π¦ )(π± β π¦ ) πΊ
πΊ
π‘
(2)
π± β π·π
where π¦πΊ denotes the average value of datapoints in graph πΊ. Our goal is to maximize the distance between classes and minimize the distance within classes. The total scatter matrix ππ of the graph πΊ is formulated as: πΊπ =
β(π± β π¦)(π± β π¦)
π‘
(3)
π±
Considering the intra class scatter matrix ππ΅ of the graph πΊ, the ππ is reorganized as πΊπ = πΊπ + πΊπ΅. After projecting the original sample {π±1,π±2,β¦,π±π}, a new sample π²1,π²2 ,β¦,π²π is obtained. We define the within-class scatter matrix ππ and between-class scatter matrix ππ΅ of the new samples. The goal is to minimize the criterion function. π½(πΎ) =
|ππ΅| |ππ|
=
|ππ‘ππ΅π| |ππ‘πππ|
(4)
The column vector of the optimal matrix πΎ is the eigenvector corresponding to the maximum eigenvalue in the following equation: πΊπ΅ππ = πππΊπππ. Based on the improved MDA, we can achieve various applications, such as image classification, image retrieval. 4. Experiment and analysis In our experiment, we first collect a sport image dataset from existing datasets and Internet. And we manually label image-level labels. More specifically, our dataset consists of 90000 images collected from 9 sport types including basketball, soccer, volleyball, golf, athletics, baseball, table tennis, rowing, and equestrian. The snapshot
of our collected dataset is shown in Figure 2. The details of our dataset are shown in Table 1. In order to verify the effectiveness of our method, our experiment consists of two components: sport image classification and retrieval. Table 1. The details of our collected dataset Sport types
#Num. of training samples
#Num. of testing samples
Basketball Soccer Volleyball Golf Athletics Baseball Table Tennis Rowing Equestrian
8000 7500 7000 7500 8000 6500 7550 7500 8000
2000 2500 3000 2500 2000 3500 2450 2500 2000
Figure 2. The snapshot of our collected dataset. 4.1 Sport image classification In this subsection, we conduct image classification experiment for testing the performance of our proposed method. More specifically, we first leverage our training dataset to model the improved MDA. Then, we leverage our MDA model to train SVM classifier. Subsequently, we test the method on our testing dataset. For comparative study, we compare our method with several well-known algorithms including low-level features based SVM, k-nearest neighbor algorithm, method proposed by Schuldt et al. [29], Dollar et al. [30], Niebles et al. [31], and CNN algorithm [32]. The comparative result is shown in Table 2. As we can see, our proposed method achieves the best
performance in classifying sport images. Notably, our algorithm can effectively distinguish images with different labels in the projected subspace. Table 2. The comparison result tested on our dataset using different algorithms (Each experiment is repeated for 10 times and we leverage the average value as our final result.) Methods basketball soccer volleyball golf SVM 0.5872 0.4852 0.6038 0.5291 KNN 0.6274 0.5963 0.7184 0.5483 Schuldt 0.7295 0.7649 0.7290 0.6092 Dollar 0.7048 0.7802 0.8105 0.6805 Niebles 0.8240 0.8047 0.8264 0.7748 CNN 0.8703 0.8360 0.8693 0.8690 Ours 0.9028 0.9274 0.9195 0.9245 Methods athletics baseball table tennis rowing SVM 0.5573 0.5194 0.6037 0.5585 KNN 0.6014 0.5969 0.6763 0.6528 Schuldt 0.7407 0.6087 0.7719 0.7410 Dollar 0.7695 0.6833 0.7930 0.7804 Niebles 0.8380 0.7146 0.8593 0.8766 CNN 0.8907 0.8873 0.8701 0.9375 Ours 0.9365 0.9027 0.9478 0.9528 Methods equestrian SVM 0.5825 KNN 0.6926 Schuldt 0.7194 Dollar 0.7753 Niebles 0.8635 CNN 0.9529 Ours 0.9611 4.2 Sport image retrieval In this subsection, we conduct image retrieval experiment on our collected dataset. We compare our method with Bag-of-word algorithm, DF.FC. The comparison result is shown in Table 2. As we can see, our algorithm achieves the best performance in image retrieval. Figure 3 shows the accuracy and standard deviation of image retrieval. Table 2. The comparison result tested on our algorithm (Each experiment is repeated for 10 times and we leverage the average value as our final result.) Methods Accuracy BoW.1200 0.7842 BoW.4800 0.7903 BoW.1K 0.8265 BoW.10K 0.8517 DF.FC1 0.8603 DF.FC2 0.8879
DF.FC3 Ours
0.8730 0.9173
Figure 3. The accuracy and standard deviation of our experiment. 5. Conclusion In this paper, a motion image modeling method based on weak supervision learning is proposed, which does not require the use of boundary box labeling and can be applied to the modeling of various motion images. More specifically, we first collect large mobile images from existing data sets and the Internet and annotate them on imagelevel labels. Then, using the area recommendation generation algorithm to select a distinguishing area that effectively represents the image category. Each area is entered into a pre-trained CNN schema to extract a depth representation. Then, we designed an improved multi-judgmental analysis (MDA) algorithm that projects these data points into a subspace to make it easier to distinguish between different motion categories. Reference [1] Hanjalic, A. (2005). Adaptive extraction of highlights from a sport video based on excitement modeling. IEEE Trans Multimedia, 7(6), 1114-1122. [2] Wu, J., Bo, Z., Hua, X. S., & Zhang, J. (2006). A semi-supervised incremental learning framework for sports video view classification. International Multi-media Modelling Conference. [3] Bertini, M., Bimbo, A. D., & Nunziati, W. (2004). Highlights modeling and detection in sports videos. Pattern Analysis & Applications, 7(4), 411-421. [4] Wang, F., Ma, Y. F., Zhang, H. J., & Li, J. T. (2005). A Generic Framework for Semantic Sports Video Analysis Using Dynamic Bayesian Networks. International Multimedia Modelling Conference. [5] Hu, J. F., Hu, J. F., Lu, X., & Zheng, W. S. (2017). Multi-task mid-level feature learning for micro-expression recognition. Pattern Recognition, 66(C), 44-52. [6] Lin, S., Li, H., Li, C. T., & Kot, A. C. (2018). Multi-task mid-level feature alignment network for unsupervised cross-dataset person re-identification.
[7] Silos, E. D. L. C., Diaz, I. G., & Maria, E. D. D. (2014). Mid-level feature set for specific event and anomaly detection in crowded scenes. IEEE International Conference on Image Processing. [8] Sss, K., Ayush, K., & Babu, R. V. (2017). Deepfix: a fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing, 26(9), 4446-4456. [9] Li, S., Liu, Z. Q., & Chan, A. B. (2015). Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. International Journal of Computer Vision, 113(1), 19-36. [10] He, S., Lau, R. W. H., Liu, W., Zhe, H., & Yang, Q. (2015). Supercnn: a superpixelwise convolutional neural network for salient object detection. International Journal of Computer Vision, 115(3), 330-344. [11] Yazdian-Dehkordi, M., & Azimifar, Z. (2016). Adaptive visual target detection and tracking using weakly supervised incremental appearance learning and RGM-PHD tracker. Journal of Visual Communication & Image Representation, 37(C), 14-24. [12] Redondo-Cabrera, C., Baptista-RΓos, M., & LΓ³pez-Sastre, R. J. (2018). Learning to exploit the prior network knowledge for weakly-supervised semantic segmentation. IEEE Transactions on Image Processing, PP (99), 1-1. [13] Wang, C., Huang, K., Ren, W., Zhang, J., & Maybank, S. (2015). Large-scale weakly supervised object localization via latent category learning. IEEE Transactions on Image Processing, 24(4), 1371-1385. [14] Yan, C., Yang, X., Zhong, B., Zhang, H., & Lin, C. (2016). Network in network based weakly supervised learning for visual tracking. Journal of Visual Communication & Image Representation, 37(C), 3-13. [15] Siva, P., & Xiang, T. (2011). Weakly supervised object detector learning with model drift detection. IEEE International Conference on Computer Vision. [16] Weiqiang Ren, Kaiqi Huang, Dacheng Tao, and Tieniu Tan. Weakly Supervised Large Scale Object Localization with Multiple Instance Learning and Bag Splitting. IEEE T-PAMI, Vol 38, No. 2, 2016 [17] Zhu, F. , & Shao, L. . (2014). Weakly-supervised cross-domain dictionary learning for visual recognition. International Journal of Computer Vision, 109(1-2), 42-59. [18] Zhang, L. , Li, X. , Nie, L. , Yang, Y. , Xia, Y. , & Xia, Y. . (2016). Weakly supervised human fixations prediction. IEEE TRANSACTIONS ON CYBERNETICS, 46(1), 258-269. [19] Wang, C., Huang, K., Ren, W., Zhang, J., & Maybank, S. (2015). Large-scale weakly supervised object localization via latent category learning. IEEE Transactions on Image Processing, 24(4), 1371-1385. [20] Galleguillos, C., Babenko, B., Rabinovich, A., & Belongie, S. (2008). Weakly Supervised Object Localization with Stable Segmentations. European Conference on Computer Vision. [21] Lin, W., Shen, C., & Hengel, A. V. D. (2016). Deep linear discriminant analysis on fisher networks: a hybrid architecture for person re-identification. Pattern Recognition, 65(Complete), 238-250. [22] Lu, G. F., & Wang, Y. (2012). Feature extraction using a fast null space based
linear discriminant analysis algorithm. Information Sciences, 193(11), 72-80. [23] Siddiqi, M. H., Ali, R., Khan, A. M., Park, Y. T., & Lee, S. (2015). Human facial expression recognition using stepwise linear discriminant analysis and hidden conditional random fields. IEEE Transactions on Image Processing, 24(4), 1386-1398. [24] Deng, C., He, X., Zhou, K., Han, J., & Bao, H. (2007). Locality sensitive discriminant analysis. [25] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman, βMultiple kernels for object detection,β in Proc. Int. Conf. Comput. Vis., 2009, pp. 606β613. [26] B. Alexe, T. Deselaers, and V. Ferrari, βMeasuring the objectness of image windows,β IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2189β2202, Nov. 2012. [27] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, βSelective search for object recognition,β Int. J. Comput. Vis., vol. 104, pp. 154β171, 2013. [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, βImagenet classification with deep convolutional neural networks,β in Proc. Adv. Neural Inf. Process. Syst. Conf., 2012, pp. 1097β1105. [29] SchΒ¨uldt, C., Laptev, I., and Caputo, B. Recognizing human actions: A local SVM approach. In ICPR, pp. 32β36, 2004. [30] DollΒ΄ar, P., Rabaud, V., Cottrell, G., and Belongie, S. Behavior recognition via sparse spatio-temporal features. In ICCV VS-PETS, pp. 65β72, 2005. [31] Niebles, J. C., Wang, H., and Fei-Fei, L. Unsupervised learning of human action categories using spatialtemporal words. International Journal of Computer Vision, 79(3):299β318, 2008. [32] Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221-231.
1) We propose a novel method for large-scale sport data modeling, which only use image-level labels instead of accurate bounding box annotations. 2) We design an improved LDA to project the extracted deep representation to a subspace, which can effectively distinguish different image categories.
There is no conflict of interest