Perceptually learning multi-view sparse representation for scene categorization

Perceptually learning multi-view sparse representation for scene categorization

Accepted Manuscript Perceptually Learning Multi-view Sparse Representation for Scene Categorization Weibin Yin, Dongsheng Xu, Zheng Wang, Zhijun Zhao,...

NAN Sizes 0 Downloads 71 Views

Accepted Manuscript Perceptually Learning Multi-view Sparse Representation for Scene Categorization Weibin Yin, Dongsheng Xu, Zheng Wang, Zhijun Zhao, Chao Chen, Yiyang Yao PII: DOI: Reference:

S1047-3203(19)30002-1 https://doi.org/10.1016/j.jvcir.2019.01.002 YJVCI 2420

To appear in:

J. Vis. Commun. Image R.

Received Date: Revised Date: Accepted Date:

21 September 2018 10 November 2018 1 January 2019

Please cite this article as: W. Yin, D. Xu, Z. Wang, Z. Zhao, C. Chen, Y. Yao, Perceptually Learning Multi-view Sparse Representation for Scene Categorization, J. Vis. Commun. Image R. (2019), doi: https://doi.org/10.1016/ j.jvcir.2019.01.002

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Perceptually Learning Multi-view Sparse Representation for Scene Categorization Weibin Yin State Grid Jiaxing Electric Power Supply Company [email protected]

Dongsheng Xu State Grid Jiaxing Electric Power Supply Company

Zheng Wang State Grid Jiaxing Electric Power Supply Company

Zhijun Zhao State Grid Jiaxing Electric Power Supply Company

[email protected]

[email protected]

Chao Chen State Grid Jiaxing Electric Power Supply Company

Yiyang Yao(corresponding author) State Grid Zhejiang Electric Power co., Ltd. Information &Telecommunication Company [email protected]

[email protected]

Abstract Utilizing multi-channel visual features to characterize scenery images is standard for state-of-the-art scene recognition systems. However, how to encode human visual perception for scenery image modeling and how to optimally combine visual features from multiple views remains a tough challenge. In this paper, we propose a perceptual multi-view sparse learning (PMSL) framework to distinguish sceneries from different categories. Specifically, we first project regions from each scenery into the so-called perceptual space, which is established by combining human gaze behavior, color and texture. Afterward, a novel PMSL is developed which fuzes the above three visual cues into a sparse representation. PMSL can support absent channel visual features, which is frequently occurred in practical circumstances. Finally, the sparse representation from each scenery image is incorporated into an image kernel, which is further fed into a kernel SVM for scene categorization. Comprehensive experimental results on popular data sets have demonstrated the superiority of our method over well-known shallow/deep recognition models.

1. Introduction Scene classification is the key technology in all kinds of real artificial intelligence systems. For example, in robot path planning [1], it is important to ensure that the

[email protected]

robot can calculate the shortest path between pairs of positions intelligently. This requires the robot to dynamically analize of different landscape types, such as topological landscape, terrain level, etc., in video monitoring, advanced methods of learning scenarios related to contextual features [2], [3], such as the direction of the road and topology, in order to improve the human body / vehicle tracking based on video. Traffic accidents are likely to occur at the intersection of the road, but it is not likely to happen in a straight line. An intelligent video monitoring system is developed by effectively identifying different types of scenic spots. In the literature, researchers have introduced several scece categorization systems based on multi-view feature integration. Generally, multi-view feature fusion technique can be divided into: multi-cue integration and modality identification based integration. Multi-cue Integration treats each type of feature as a modality. In the literature, a series of multi-cue integration methods have been proposed. In [4], each modality corresponds to a sub-classifier which produces the recognition accuracy based on the feature of this modality. And the sub-classifier with the highest recognition rate is treated as the final decision of the system. Kittler et al. [5] presented a theoretical framework which constructs the result by integrating the decisions from all the sub-classifiers. Score level feature integration [6] converts the multimodal feature vectors into a set of class labels predicted by the sub-classifiers. And the final decision is derived by integrating the labels assigned to each sub-classifier through a combining strategy, such as the product rule or the sum rule. In [7],

each modality is represented as a graph, and the graphs corresponding to different modalities are integrated into a single one by an adaptation scheme. Decision tree [8] is another multimodal feature integration method wherein each modality is represented by a node. And the decision at each node is made by an accumulation scheme first; then these decisions are multiplied by coefficients which reflect the reliability of its corresponding modality in the recognition process. Further in [9], features from each modality are represented by an independent graph. Different learning tasks will be formulated from the constraints in every graph as well as the supervision information. To exploit the statistical property of different types of multimodal features, multi-view spectral embedding [10] is presented, wherein different types of features are grouped in a physically meaningful way. In summary, all the above multi-cue based integration methods solve the dimensionality curse significantly. However, grouping each type of feature into a modality is heuristic based, which is not feasible all the time and prevent these methods from being widely used. In order to overcome the limitation of the multi-cue based integration, Modality Identification Based Integration is proposed very recently. Feature level integration concatenates different types of the multimodal features into one single vector. And linear discriminant analysis (LDA) is employed to further increase the discriminative ability of this single vector. However, there are two limitations in feature level integration: 1) The feature level integration does not explicitly consider the relationships between features and 2) The number of features obtained from LDA should not be more than , where is the number of categories. Aiming to obtain a better combination of multimodal features, Wu et al. [11] rearranges features in different modalities by introducing a modality identification step. And different modalities of features are integrated together in the kernel space by support vector machine (SVM). Although better recognition accuracy is observed in this technique, the binary correlation between different multimodal features is not consistent with real condition, e.g., an object described by multi-camera. In order to alleviate the aforementioned problems, we propose a novel image kernel based on multi-view feature fusion. Our image kernel can accurately capture human gaze shifting paths, based on which multiple visual features are integrated into a sparse representation to describe each scenery image. In particular, given a set of scene images associated with their tags, we first form color, texture and gaze behavior channel. Noticeably, the gaze behavior is obtained by sequentially connecting actively selected visually salient regions. Afterward, a sparse model is designed which optimally fuzes multimodal features from partially-labeled photos. The sparse model can be efficiently solved using an iterative

parameters optimization algorithm. Finally, these sparse representation from all the scenery images are integrated into an image kernel, the kernel is further fed into the kernel, which is subsequently fed into a multi class support vector machine (SVM) used for scene classification.

2. Related Work Our work can be closely related to sparse representation and multi-view learning, and can be analyzed from these two perspectives. In recent years, many algorithms for sparse representation and multi-view learning have been proposed in some literature. Further, Zhang et al. [12] based on the Lagrange algorithm, a heterogeneous tensor decomposition model is proposed. By using low rank regularization constraints to improve its robustness, it is used for subspace clustering to show good convergence. Jing et al. [13] in order to improve the memory of predictive images, an external source framework for multi-view transfer learning is proposed, and an iterative algorithm is used to solve the optimization problem of joint framework. Su et al. [14] proposed a multi-view learning framework specifically for predicting and improving the popularity of micro-video on social media sites. The framework is a low-order embedded regression model with the ability to prevent over-fitting. Jing et al. [15] a low-rank and sparse regression model is proposed to solve the problem of how to improve the predictive power of image memory. At the same time, it can be extended to multi-view framework to eliminate heterogeneity and its framework is robust. Cheng et al. [16] in order to enhance the perception of the user's location, a music recommendation system is proposed for context awareness. Cheng et al. [17] a new A3NCF perceptual recommendation system is proposed, which aims to solve different preference vectors and project features of users. The system model has good generalization performance compared with the same type of system. Xu et al. [18] in order to solve the problem of bad recommendation to users, a review text model with rating prediction is proposed, which mainly focuses on perceiving potential factors to rank the text and weight the importance, which is much better than other aspects. And can be studied in depth. Dhillon et al. [19] proposed a new multi-view learning algorithm, which uses the low rank fast spectrum algorithm to model the context semantics of words, and uses them for the feature representation of natural language processing classifiers, and the algorithm is very fast, with good convergence. Xu et al. [20] proposed a multi-view learning method based on integrated analysis to represent a large amount of valuable information in the potential complete space of the encoded data. The complementarity in the algorithm can increase the stability and generalization ability between multiple views, and use

some new. The technology has been optimized to give the algorithm a very bright future. Wang et al. [21] to solve the problem that a single information source uses a multi-view learning algorithm for processing, a new algorithm that can re-present the original data in multiple matrices is developed, which combines the Universum example with its newly developed The Multi-MHKS technology is fused to form a more flexible multi-view learning algorithm UMultiV-MHKS. This UMultiV-MHKS not only improves the classification ability, but also has better generalization performance than other algorithms. 3. Our Proposed Approach In total, our proposed scene categorization model involves three components: multi-view feature extraction, novel sparse learning algorithm, and multi-class SVM training, which will be introduced in the following. 3.1. Multi-view Feature Extraction Gaze Shifting Path (GSP) Construction: Researches in both cognitive science and psychology [22], [23] have shown that, before perceiving each scenery, humans prone to perceive objects, i.e., selecting those attended object locations. Thereafter, human vision system will processes only parts of each scenery in detail, while leaving the others nearly unprocessed. By leveraging such experience, we encode human gaze behavior into our perceptual scene categorization framework. As human eye typically attends to those foreground objects, we first deploy an objectness measure to obtain hundreds object patches. It is widely acknowledged that a sucessfully-designed objectness measure should exhibits the following four advantages: 1) achieving a highly accurate object detection as any false detection cannot be recovered later; 2) obtaining a succinct set of object patches which will improve the subsequent GSP construction; 3) being highly efficient in order to be flexibly embedded into a variety of applications; and 4) exhibiting an outstanding generalization ability to unkonwn object categories. In this way, the model will be adaptive to different image sets. In order to satisfy these requirements, in this work, we employ the so-called BING feature proposed by Cheng et al. [24] as the objectness measure. More specifically, the BING feature first decomposes each image window into and subsequently utilizes the binarized gradient norm as its descriptor. It can receive a high object detection accuracy, and at the same time, it can maintain an extraordinarily fast speed. In practice, the number of object patches output from [24] is intolerably large. By leveraging the active viewing mechanism of biological vision, a novel active learning algorithm is utilized to select a few visually salient object patches for modeling human gaze behavior. Theoretically,

a successfully active learning algorithm should reflect the underlying data structure. Based on the local geometrical preservation property of object patches inside each scenery image, each object patch can be linearly reconstructed by its spatial neighboring ones. And the optimal reconstruction coefficients are calculated by: , (1) where represents the object patches, denotes the contribution of the j-th object patch to construct the i-th one, denotes the total number of object patches within a scenery, and contains the spatial neighbors of object patches . Eq. (1) is solved using the greedy search algorithm, i.e., the represented object patches are discovered sequentially, which is similar human perception. In the real world, humans first fixate on the most visually/semantcially salient region, and then shift their gazes to the second visually/semantcially salient one, and so on. In this way, we sequentially link the actively selected object patches into a GSP to describe human visual perception. Notably, the GSPs are 2-dimensional features which cannot be easily fed into a conventional classifier. In our method, we use the deep model [25] proposed by Zhang et al. to obtain the deep representation for each GSP. Color& Textural Feature Extraction: we use both color and texture information to characterize each atomic region. This is because color and texture are complementary to each other in measuring the region properties and also it is quite efficient to extract these two types of image features. We detail the process of extracting these two types of image features as follows. 1). Color feature: we propose to use color moment to describe the color distribution of each scenery image. Color moment is widely used for image representation in classification and content based image retrieval (CBIR). 2). Texture feature: we propose to use the well-known histogram of gradient (HOG) feature for modeling the texture information of each scenery image. HOG feature has a great advantage as invariance to local geometric and photometric transformations, i.e., translations or rotations. Therefore HOG is widely used in object detection and tracking. 3.2. Multi-view Sparse Learning Without loss of generality, we formulate our proposed multi-view sparse learning into a two-view setting. Given N scene images represented by dual-modality features , where denotes the -dimensional feature of the i-th scene image from the first modality. Similarly, represents the -dimensional feature of the i-th scene image from the second modality. Notably, in typical settings, . In practice, usually a few

channel features are absent during multi-view learning. In such partial modality environment, the scene image set is represented by where denotes the common scene images in both the two views. Matrix A and B denotes the scene images only exist in view A and B respectively. The number of scene images presented in both views is denoted by , and that presented in view A and B are denoted by and respectively. The features of common scene images in modality A and B are represented by and respectively. We then represent the scene image features in each view by and respectively. Based on the above notations, given N scene images with possibly missing features, the objective of our multi-view sparse learning is to calculate the unified sparse codes associated with the corresponding pairwise dictionaries and , that is,

s.t. , , . (2) where denotes the joint sparse codes for the common scene images; and denote the sparse codes for view A and B respectively. Note that the sparse codes ensures that the codes generated from different views are consistent. It is worth emphasizing that the above objective function (2) is solved using the iterative algorithm as presented in. That is, we first fix and update , and then fix and update . These two operations conducted iteratively until converge.

where represents the feature vector corresponding to the i-th training scene image; denotes the class label (+1 for the p-th category while -1 for the q-th category) to the i-th training scene image; denotes the hyperplane separating scene images in the p-th category from those in the q-th category; trades the machine complexity off the number of nonseparable scene images; and is the number of training scene images from either the p-th or the q-th category. Suppose there are R different scene categories totally, (R-1)R/2 binary SVM classifiers will be trained. 4. Experimental Results and Analysis To validate the performance of the proposed FRH based multimodal feature integration, we carry out recognition tasks on several popular datasets. All the three experiments run on a system equipped with Intel i7-6700K CPU and 24GB RAM. The algorithm of our approach is implemented on Matlab 2013b platform. 4.1. Comparison with Feature Fusion Algorithms Experiment 1: Corel [27] contains 7700 images from 77 categories, which makes it a large and heterogeneous image dataset. Some samples from Corel are given in Figure 1. In this experiment, two types of features, 64-dimensional color histogram and 64-dimensional color texture moment, are extracted from each image. The dataset is equally split into two sets, one for training and the other for testing.

3.3. Image Kernel Learning for SVM As aforementioned, each scene image can be represented by its sparse codes. In our approach, we integrate these sparse codes into an image kernel machine for scene categorization. The kernel is calculated based on the distances between pairwise scene images. Given each scene image, we convert each of its sparse codes into a vector , wherein each element is defined as: , (3) where denotes the Euclidean distance between pairwise vectors, represents the sparse codes learned from each scene image, N denotes the number of training scene images; is a normalization factor. Based on the feature vector calculated above, we train a multi-class SVM [26]. Given different scenery categories, we actually train binary SVM classifiers. For training scene images from the p-th and q-th categories, we train a binary SVM classifier as:

s.t.

.

(4)

Figure 1. Image samples in Corel First of all, we compare our algorithm with 3 well-known dimensionality reduction algorithms, i.e., LDA, kernel LDA (with Gaussian kernel) and Locality Sensitive Discriminative Analysis (LSDA) [26]. Besides, to demonstrate the advantage of our algorithm over other feature integration algorithms, we further compare our algorithm with 4 well-known feature fusion methods, i.e., feature level integration, match score level integration (with product rule, sum rule, max rule, and min rule) , super kernel integration (with linear kernel and Gaussian kernel) and adaptation cue integration (with Frobenius norm and KL-divergence). Towards a fair comparison, multimodal features with no integration are evaluated also. In this experiment, linear SVM [28] is chosen as the classifier; the value of Gaussian kernel is determined by 5-fold cross validation; the number of nearest neighbours in LSDA is tuned from 0 to 10. We randomly split the dataset into two equal size parts, one for training and the other for testing. This process repeated 10 times

and the average recognition accuracy of the 9 algorithms is presented in Figure 2. As shown in Figure 2, our algorithm achieves the best recognition accuracy on average. It is noticeable that the low recognition accuracy on LDA and LSDA, which demonstrate their weakness to exploit features’ relationships. Besides, the worst recognition result is the score level feature integration under min rule, which is consistent with the experimental results in [26]. We observe that among the 77 categories, the best recognition accuracy from xx categories are achieved by our algorithm, which further demonstrates the robustness our algorithm.

Figure 3. Sample human body masks in ViHASi

Figure 4. Average recognition accuracy of the 9 compared feature integration algorithms 4.2. Comparison with other Scene Classification Models

Figure 2 .Average recognition accuracy of the 9 compared feature integration algorithms (SI(P),SI(S),SI(Max) and SI(Max) mean score level integration with product rule, sum rule, maximum rule and minimum rule, respectively; AI(F) and AI(KL) mean adaption feature integration with Frobenius norm and KL-divergence, respectively; SI(L) and SI(G) means super kernel integration with linear and Gaussian kernel, respectively .) Experiment 2: ViHASi [29] is a synthetic action dataset for human action recognition based on action silhouettes. This dataset contains 20-class actions made by 11 actors from 12 to 20 synchronized perspective camera views, each corresponding to a human body masks, i.e., a 640*480 pixels binary image as presented in Figure 3. To represent actions in each camera view, we handle the corresponding human body mask as a 640*480dimensional feature vector and employ PCA to convert the 640*480-dimensional feature vector into a 100-dimensional feature vector. In this experiment, each action is described by three different views and thus a 300-dimensional feature vector is obtained to represent each action. And the whole dataset is represented by 50960 feature vectors. We use the same experimental settings as experiment 1 and report the recognition accuracy of the 9 algorithms in Figure 4. As seen, our algorithm achieved the best average recognition accuracy again.

We compare our proposed method with a series of popular shallow recognition models, including 1) conventional spatial pyramid matching kernel (SPM) and its three variants: LLC-SPM [30], SC-SPM [31], and OB-SPM [32], 4) super vector image encoding [33]. Due to the promising performance of the deep architectures, it is necessary to compare our method with a set of deep recognition models, i.e., the ImageNet-based CNN [34], the RCNN [35], the meta-object-based CNN (M-CNN) [36], the deep-mining-based CNN (DM-CNN) [37], and spatial pyramid pooling-based CNN (SPP-CNN) [38]. We experiment on the two well-known scene image sets: SUN397 [39] and Places 205 [40]. As the experimental results reported in Table I, our method outperforms its competitors remarkably.

Table I Average Precision of the above shallow/deep recognition models on the two scene image sets LLC-SPM

SC-SPM

OB-SPM

SV

IM-CNN

SUN397

48.5%

43.4%

42.1%

39.6%

44.3%

Places 205

43.3%

38.7%

39.6%

31.2%

40.6%

RCNN

M-CNN

DM-CNN

SPP-CNN

Ours

SUN397

41.1%

42.4%

43.7%

45.7%

52.4%

Places 205

38.9%

39.6%

37.8%

40.6%

49.4%

5. Conclusions In this paper, we propose an effective image kernel for scene categorization, which optimally encodes human gaze shifting process. Given a collection of scene images,

we first extract multiple visual features to describe each scene image, wherein the human gaze behavior is represented by an active learning paradigm. Then, a multi-view sparse learning algorithm is designed to integrate these multi-view features. Finally, the learned sparse representations are integrated into an image kernel for scene categorization. Extensive empirical results demonstrated the effectiveness of our method.

Conflict of interest There is no conflict of interest. 6. References [1] Ude, A., & Dillmann, R. (1994). Vision-based robot path planning. In Advances in Robot Kinematics and Computational Geometry (pp. 505-512). Springer, Dordrecht. [2] Yuan, C., Wu, B., Li, X., Hu, W., Maybank, S., & Wang, F. (2016). Fusing $${\mathcal {R}} $$ R Features and Local Features with Context-Aware Kernels for Action Recognition. International Journal of Computer Vision, 118(2), 151-171. [3] Hu, W., Ding, X., Li, B., Wang, J., Gao, Y., Wang, F., & Maybank, S. (2016). Multi-perspective cost-sensitive context-aware multi-instance sparse coding and its application to sensitive video recognition. IEEE Transactions on Multimedia, 18(1), 76-89. [4] Woods, K., Kegelmeyer, W. P., & Bowyer, K. (1997). Combination of multiple classifiers using local accuracy estimates. IEEE transactions on pattern analysis and machine intelligence, 19(4), 405-410.. [5] Kittler, J., Hatef, M., Duin, R. P., & Matas, J. (1998). On combining classifiers. IEEE transactions on pattern analysis and machine intelligence, 20(3), 226-239. [6] Zhou, X., & Bhanu, B. (2006, June). Integrating face and gait for human recognition. In Computer Vision and Pattern Recognition Workshop, 2006. CVPRW'06. Conference on (pp. 55-55). IEEE. [7] Sun, Z. (2003, June). Adaptation for multiple cue integration. In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on (Vol. 1, pp. I-I). IEEE. [8] Nilsback, M. E., & Caputo, B. (2004, June). Cue integration through discriminative accumulation. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on (Vol. 2, pp. II-II). IEEE. [9] Tong, H., He, J., Li, M., Zhang, C., & Ma, W. Y. (2005, November). Graph based multi-modality learning. In Proceedings of the 13th annual ACM international conference on Multimedia (pp. 862-871). ACM. [10] Zhou, X., & Bhanu, B. (2008). Feature fusion of side face and gait for video-based human identification. Pattern Recognition, 41(3), 778-795. [11] Wu, Y., Chang, E. Y., Chang, K. C. C., & Smith, J. R. (2004, October). Optimal multimodal fusion for multimedia data analysis. In Proceedings of the 12th annual ACM international conference on Multimedia (pp. 572-579). ACM. [12] Zhang, J., Li, X., Jing, P., Liu, J., & Su, Y. (2018). Low-Rank Regularized Heterogeneous Tensor Decomposition for Subspace Clustering. IEEE Signal Processing Letters, 25(3), 333-337. [13] Jing, P., Su, Y., Nie, L., & Gu, H. (2017). Predicting image memorability through adaptive transfer learning from external sources. IEEE Transactions on Multimedia, 19(5), 1050-1062. [14] Jing, P., Su, Y., Nie, L., Bai, X., Liu, J., & Wang, M. (2017). Low-rank multi-view embedding learning for micro-video popularity prediction. IEEE Transactions on Knowledge and Data Engineering. [15] Jing, P., Su, Y., Nie, L., Gu, H., Liu, J., & Wang, M. (2018). A Framework of Joint Low-rank and Sparse Regression for Image Memorability Prediction. IEEE Transactions on Circuits and Systems for Video Technology. [16] Cheng, Z., & Shen, J. (2016). On effective location-aware music recommendation. ACM Transactions on Information Systems (TOIS),

34(2), 13. [17] Cheng, Z., Ding, Y., He, X., Zhu, L., Song, X., & Kankanhalli, M. S. (2018). A^ 3NCF: An Adaptive Aspect Attention Model for Rating Prediction. In IJCAI (pp. 3748-3754). [18] Cheng, Z., Ding, Y., Zhu, L., & Kankanhalli, M. (2018). Aspect-Aware Latent Factor Model: Rating Prediction with Ratings and Reviews. arXiv preprint arXiv:1802.07938.. [19] Dhillon, P., Foster, D. P., & Ungar, L. H. (2011). Multi-view learning of word embeddings via cca. In Advances in neural information processing systems (pp. 199-207). [20] Xu, C., Tao, D., & Xu, C. (2015). Multi-view intact space learning. IEEE transactions on pattern analysis and machine intelligence, 37(12), 2531-2544. [21] Wang, Z., Zhu, Y., Liu, W., Chen, Z., & Gao, D. (2014). Multi-view learning with universum. Knowledge-Based Systems, 70, 376-391. [22] Wolfe, J. M., & Horowitz, T. S. (2004). What attributes guide the deployment of visual attention and how do they do it?. Nature reviews neuroscience, 5(6), 495. [23] Bruce, N. D., & Tsotsos, J. K. (2009). Saliency, attention, and visual search: An information theoretic approach. Journal of vision, 9(3), 5-5. [24] Cheng, M. M., Zhang, Z., Lin, W. Y., & Torr, P. (2014). BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3286-3293). [25] Zhang, L., Liu, M., Chen, L., Qiu, L., Zhang, C., Hu, Y., & Zimmermann, R. (2018). Online Modeling of Esthetic Communities Using Deep Perception Graph Analytics. IEEE Transactions on Multimedia, 20(6), 1462-1474. [26] Cai, D., He, X., Zhou, K., Han, J., & Bao, H. (2007, January). Locality Sensitive Discriminant Analysis. In IJCAI (Vol. 2007, pp. 1713-1726). [27] http://corel.digitalriver.com. [28] Chang, Y. W., & Lin, C. J. (2008, December). Feature ranking using linear SVM. In Causation and Prediction Challenge (pp. 53-64). [29] Ragheb, H., Velastin, S., Remagnino, P., & Ellis, T. (2008, September). ViHASi: virtual human action silhouette data for the performance evaluation of silhouette-based action recognition methods. In Distributed Smart Cameras, 2008. ICDSC 2008. Second ACM/IEEE International Conference on (pp. 1-10). IEEE. [30] Harchaoui, Z., & Bach, F. (2007, June). Image classification with segmentation graph kernels. In Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on (pp. 1-8). IEEE. [31] Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010, June). Locality-constrained linear coding for image classification. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on (pp. 3360-3367). IEEE. [32] Yang, J., Yu, K., Gong, Y., & Huang, T. (2009, June). Linear spatial pyramid matching using sparse coding for image classification. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp. 1794-1801). IEEE. [33] Zhou, X., Yu, K., Zhang, T., & Huang, T. S. (2010, September). Image classification using super-vector coding of local image descriptors. In European conference on computer vision (pp. 141-154). Springer, Berlin, Heidelberg. [34] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). [35] Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580-587). [36] Wu, R., Wang, B., Wang, W., & Yu, Y. (2015). Harvesting discriminative meta objects with deep CNN features for scene classification. In Proceedings of the IEEE International Conference on Computer Vision (pp. 1287-1295). [37] Li, Y., Liu, L., Shen, C., & van den Hengel, A. (2015). Mid-level deep pattern mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 971-980). [38] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid

pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 37(9), 1904-1916. [39] Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010, June). Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on (pp. 3485-3492). IEEE. [40] Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems (pp. 487-495). [41]Cheng, Z., & Shen, J. (2016). On effective location-aware music recommendation. ACM Transactions on Information Systems (TOIS), 34(2), 13.