Detection based object labeling of 3D point cloud for indoor scenes

Detection based object labeling of 3D point cloud for indoor scenes

Neurocomputing 174 (2016) 1101–1106 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Detec...

1MB Sizes 0 Downloads 115 Views

Neurocomputing 174 (2016) 1101–1106

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Detection based object labeling of 3D point cloud for indoor scenes Wei Liu a,b,1, Shaozi Li b,n, Donglin Cao b, Songzhi Su b, Rongrong Ji b a b

Department of Information Engineering, East China Jiaotong University, China Department of Cognitive Science, Xiamen University, China

art ic l e i nf o

a b s t r a c t

Article history: Received 21 November 2014 Received in revised form 2 October 2015 Accepted 2 October 2015 Communicated by Ke Lu Available online 19 October 2015

While much exciting progress is being made in 3D reconstruction of scenes, object labeling of 3D point cloud for indoor scenes has been left as a challenge issue. How should we explore the reference images of 3D scene, in aid of scene parsing? In this paper, we propose a framework for 3D indoor scenes labeling, based upon object detection on the RGB-D frames of 3D scene. First, the point cloud is segmented into homogeneous segments. Then, we utilize object detectors to assign class probabilities to pixels in every RGB-D frame. After that, the class probabilities are projected into the segments. Finally, we perform accurate inference on a MRF model over the homogeneous segments, in combination with geometry cues to output the labels. Experiment on the challenging RGB-D Object Dataset demonstrates that our detection based approach produces accurate labeling and improves the robustness of small object detection for indoor scenes. & 2015 Elsevier B.V. All rights reserved.

Keywords: Point cloud Labeling Object detection RGB-D

1. Introduction Coming with the popularity of Kinect sensors and advanced 3D reconstruction techniques, we are facing an increasing amount of 3D point clouds. Consequently, the demand of scene understanding is emerging [6,7]. Understanding 3D scenes is a fundamental problem in perception and robotics, as knowledge of the environment is a preliminary step for subsequent tasks such as route planning, augmented reality and mobile visual search [1,2]. In the literature, a significant amount of work has been done in semantic labeling for pixels or regions in 2D images. In spite of this, semantic labeling of 3D point clouds remains an open problem. Its solution, however, can bring a breakthrough in a wide variety of computer vision and robotics research, with great potential in human-computer interface, 3D object indexing and retrieval, object manipulation in robotics [4] as well as exciting applications such as self-driving vehicles and semantic-aware augmented reality. There is a great deal of literature on 3D scene labeling in indoor environments [5,16]. Many of these operate directly on a 3D point cloud, as 3D point clouds contain very important shape and spatial information, which allows to model context and spatial relationships between objects However,one missing information of these

methods to explore is the reference images. Existing works [4,3] have shown that multi-view images can be used to recognize objects of 3D scene to a higher degree of accuracy. In this paper, we focus on detection based object labeling in 3D scene. Specifically, we tackle the problem of object labeling in 3D scene with the help of object detection results from part-of-scene 2D reference image. Our goal is to transfer such reliable 2D labeling results into 3D to enhance the inference accuracy. It is noted that the proposed approach works in the scenario where a single point cloud is merged from multiple images. First, the point cloud is segmented into homogeneous segments. Then, we utilize object detectors to assign class probabilities to pixels in every RGB-D frame. After that, the class probabilities are projected into the segments. Finally, we perform accurate inference on a MRF model over the homogeneous segments, in combination with geometry cues to output the labels. The outline of this paper is organized as follows. Section 2 surveys related work. The detailed of the proposed method is presented in Section 3. Experimental results and comparison are provided in Section 4. Finally, we draw conclusion of this paper in Section 5.

2. Related work n

Corresponding author. E-mail address: [email protected] (S. Li). 1 This work was done when the author was a Ph.D. candidate at Xiamen University. http://dx.doi.org/10.1016/j.neucom.2015.10.005 0925-2312/& 2015 Elsevier B.V. All rights reserved.

With the popularity of Kinect camera and 3D laser scanner, recently there is ever increasing research focus on semantic labeling of 3D point cloud [8,9,5]. The application scenarios are

1102

W. Liu et al. / Neurocomputing 174 (2016) 1101–1106

either indoor or outdoor, with corresponding tasks such as labeling tables, beds, desks, computers, or trees, cars, roads and pedestrians. So far, most robotic scene understanding work has focused on 3D outdoor scenes with related applications such as mapping and autonomous driving [10–13]. The task is to label point clouds obtained from 3D laser scanner data into a few coarse geometric classes (e.g., ground, buildings, trees, vegetation, etc.) [8,9,14,15]. Indoor scenes, in comparison, cover a wider range of objects, scene layout, and scales, which is therefore more challenging. A major challenge of this task arises from the fact that most indoor scenes are cluttered and occluded with each other [16,17]. There is also some recent work in labeling indoor scenes [18,19,4,5,20]. While these methods achieve impressive results on large architectural elements (e.g. walls, chairs), the performance is still far from satisfactory for small objects such as torches and cups. With the emerging 3D reconstruction techniques such as PTAM [21,22], DTAM [23], and Kinectfusion [24,25], it is feasible to robustly merge RGB-D videos of a scene into constant and detailed 3D point cloud [26]. To label the 3D scene in this scenario, one potential solution is to directly work with the merged point cloud, which has achieve high performance for outdoor scenes, where laser data is classified and modeled into a few coarse geometric classes [8,9,27]. The recent researches of [5,20] demonstrate the feasibility to label indoor point clouds. The other solutions conduct segmentation and/or detection on individual frames respectively, whose outputs are then merged into point cloud and undergo a joint optimization procedure [12,4]. Most of these approaches label individual 3D laser points or voxels using features describing local shape and appearance, and the joint optimization is typical accomplished using spatial and temporal smoothing via graphical model inference The closest work to ours is presented in [3]. The authors utilized sliding window detectors trained from object views to assign class probabilities to pixels in every RGB-D frame. Then, these probabilities are projected into the reconstructed 3D scene and integrated using a voxel representation. Finally, inference is performed by using a Markov Random Field (MRF) model over the voxels, in combination with cues from view-based detection and 3D shape. The downside of their detection based approach is the spoil of object boundaries, leading to misclassification of objects. To maintain accurate object boundaries, we present a detection based scheme in combination with clustering of 3D points for labeling indoor scenes. We first introduce a normal and color based segmentation algorithm to oversegment the reconstructed 3D scene into homogeneous segments. Then, we utilize sliding window detectors trained from object views to assign class probabilities to pixels in every RGB-D frame. After that, the class probabilities are projected into the segments. Finally, we perform accurate inference on a MRF model over the homogeneous segments, in combination with geometry cues to output the labels (semantic segments) of the point cloud. We term the proposed

approach as DetSMRF, which refers to 3D indoor scene labeling based on detection, segmentation and MRF. DetSMRF can be considered as a generalization of [3]. The main difference between the proposed framework and [3] is that the nodes in our MRF model are homogeneous segments rather than voxels, which is key to preserve object boundaries. Moreover, it is worthwhile to note that the geometry cues integrated into MRF model are simply the difference of normals between two nearby segments which does not require any kinect pose parameters, as compared to [1]. Experiment on the challenging RGB-D Object Dataset demonstrates that our DetSMRF approach produces accurate labeling and improve the robustness of object detection for indoor scenes.

3. DetSMRF We now describe the proposed DetSMRF model, as shown in Fig. 1. The 3D point clouds we consider are captured from a set of RGB-D video frames. To reconstruct a 3D scene, RGB-D mapping [26] is employed to globally align and merge each frame with the scene under a rigid transform. The goal of this paper is to label small everyday objects of interest, which may comprise only a small part of the whole 3D point cloud. DetSMRF is defined over a graph G ¼ ðV; EÞ , where V are vertices representing homogenous segments obtained by oversegmentation of 3D point cloud, E are edges connecting adjacent vertices. Each segment is associated with a label yv A f1; …; C; cB g, where f1; …; Cg are object classes of interest and cB is the background class. The joint distribution of segments are modeled via a MRF with pairwise interactions. The optimal labeling of the scene minimizes the following energy function defined over the field of homogenous segments f1; …; Cg: X X Eðy1 ; …; y j V j Þ ¼ Ψ v ðyv Þ þ λ Φi;j ðyi ; yj Þ; ð1Þ vAV

fi;jg A P

where P is the set of all pairs of neighboring segments. The data term Ψv models the observation cost for individual segments, and the pairwise term Φi;j measures interactions between adjacent segments. λ is the smoothing constant, penalizing adjacent elements do not share the same label. 3.1. Oversegmentation of 3D point cloud Given the 3D point cloud merged by RGB-D mapping from RGB-D video we first reduce the complexity from millions of points on the 3D surface to a few thousand segments. That is to say, we want to group points from 3D point cloud into perceptually meaningful segments which conform to object boundaries. We assume that the boundaries between objects are represented by discontinuities in color and surface orientation. The grouping is done by oversegmentation. To this end, a normal and color based

Fig. 1. Flow chart of our method.

W. Liu et al. / Neurocomputing 174 (2016) 1101–1106

region growing segmentation algorithm is proposed, which is shown in Algorithm 1. Algorithm 1. Segmentation of 3D point cloud. Input: point cloud ¼{P}, point normals ¼{N}, points _ curvature curvatures¼{c}, neighbour finding function N ðÞ,

threshold cth, angle threshold θth, minimum cluster size threshold qth, color threshold dth Output: region list R Initialization: R’∅, available points list fAg’f1; …; j Pj g While fAg is not empty do Current region fRc g’∅ Current seeds fSc g’∅ Point with minimum curvature in fAg-P min fSc g’fSc g⋃P min fRc g’fRc g⋃P min fAg’fAg⧹P min For each point s in ðfSc gÞ do Find nearest neighbours of current seed point fBc g’N ðsÞ for j¼0 to size ðfBc gÞ Current neighbour point P j ’Bc fjg If fAg contains Pj, color difference dðj ðNfSc figg; NfSc fjggÞj Þ o dth and cos  1 ðj ðNfSc figg; NfSc fjggÞj Þ o θth then fRc g’fRc g⋃P j fAg’fAg⧹P j If cfP j g o cth then fSc g’fSc g⋃P j end If end If end for end for Add current region to global segment list fRg’fRg [ fRc g end while for Rc A R do if j Rc j o qth then Rc is merged with the closest neighbouring cluster end if end for

To reduce salt-and-pepper noise and the amount of segments, we filter out points that are spatially sparse and grow a region from the flattest area. Concretely, we filter out points with neighbouring points less than 40 in a ball with 1 cm radius. Then the rest points are sorted by their curvature value. It needs to be done because the region begins its growth from the point that has the minimum curvature value. The point with the minimum curvature is located in the flat area. So we have the sorted cloud. The algorithm picks up the point with minimum curvature value and starts the growth of the region, until there are no unlabeled points in the cloud. This process occurs as follows:

 The picked point with minimum curvature value is added to the 

set called seeds. For every seed point, the algorithm finds its neighbouring points. ○ The normal difference and color difference are tested for every neighbour and the current seed point. If both of the differences are less than pre-defined threshold value then the current neighbour is added to the current region. ○ After that every neighbour is tested for the curvature value. If the curvature value is less than threshold value, then this point is added to the seeds. ○ The current seed is removed from the seeds.

1103

To further remove salt and pepper noise, an attempt for merging neighbouring clusters with close colors is made. Two neighbouring clusters with a small difference between average color are merged together. Then the second merging step takes place. The pseudocode of the oversegmentation is presented in Algorithm 1. In our experiments, for all the 3D scenes, we set θth ¼ 7o , dth ¼ 30, and qth ¼ 30. 3.2. Detection based probability maps The data term Ψ v ðyv Þ in Eq. (1) is computed using responses from object detectors that have been independently trained using a database of objects and background images. The detectors are run on individual RGB-D frames and their responses are used to estimate an object class probability for every pixel in every frame. For scenes reconstructed from Kinect videos, each point is a pixel in some RGB-D frame and the system can remember this one-to-one mapping. Score map pyramid is utilized to compute a score map over all pixels (and hence 3D points). For a 3D point x, h Let f c ðxÞ be a variant of HOG features [28] extracted HOG over both the RGB and depth image to capture appearance and shape information of each view of an object for object detector c at scale h h. When computing f c ðxÞ, the system looks up the corresponding pixel and computes them using the source RGB-D frame [3]. The score map of a linear SVM object detector c is : sc ðxÞ ¼ maxfωTc f c ðxÞ þ bc g; h

ð2Þ

h

where wc and bc are the weight vector and bias of the object detector c, respectively. One detector for each object class is trained from training data. To convert these linear score maps into probability maps that define the probability of point x belonging to class c, Platt scaling [29] is employed: pðcj xÞ ¼

1 ; 1 þ expfusc ðxÞ þ vg

ð3Þ

where u and v are parameters of the sigmoid, which can be found by minimizing negative log likelihood of the training or validation set. This is probability obtained from binary classifier between class c and background. A point is classified as the background class if it does not lie on any of the objects. For all of the object, detectors are trained using a database of objects and background images independently, when there are C foreground classes, it is reasonable to define the probability of point x belonging to background as follows [3]: pðcB j xÞ ¼ α min f1  pðcj xÞg;

ð4Þ

1rcrC

where α controls the looseness of the upper bound probability for x classified as background. We set α ¼ 0:1 in all of our experiments by cross validation. We generate the likelihood of segment v by multiplying the likelihoods of its constituent points v and normalizing the resulting likelihood using the geometric mean: ( )1=jΩv j pðyv j Ωv Þ ¼

∏ pðyv j xÞ

;

x A Ωv

ð5Þ

where yv A f1; …; C; cB g and pðyv j xÞ is looked up from the corresponding probability map. The data term in Eq. (1) is the negative log likelihood:

Ψ v ðyv Þ ¼  ln pðyv j Ωv Þ ¼ 

1

X

j Ωv j x A Ω

ln pðyv j xÞ:

ð6Þ

v

Eqs. (5) and (6) are common tricks in the fields of detection and labeling. It is worthwhile to note that Ωv denotes a homogeneous segment in our work.

1104

W. Liu et al. / Neurocomputing 174 (2016) 1101–1106

3.3. Label consistency The pairwise terms in Eq. (1) penalize labelings of pairs of neighbouring segments that do not share the same label . The simplest pairwise interaction is the Potts model [30]: Ifyi a yj g;

ð7Þ

where Ifg is an indicator function, which takes on a value of 1 if its argument is true, and 0 otherwise. It is intuitive that label changes tend to occur at sharp edges. Therefore, it is reasonable to penalize two neighbouring segments with similar normals. We incorporate this 3D geometry information into our model by defining

Φi;j ðyi ; yj Þ ¼ Ifyi ayj g  If〈ni ; nj 〉 4 ηg;

ð8Þ

where ni and nj are normals of segment i and j respectively, 〈; 〉 is inner dot, and η is a constant. The data term Ψ v ðyv Þ and the pairwise term Φi;j ðyi ; yj Þ together define a multi-class associative MRF, whose energy is able to efficiently minimized using graph cuts [30]. In this paper, we set λ ¼ 0:5 and η ¼ 0:8.

4. Experiments We experimentally evaluate our method primarily using the challenging RGB-D Object Dataset [31], which includes everyday objects captured individually on a turntable. We show that our scheme has a competitive performance compared to state-of-theart method in terms of labelling accuracy. All of the parameter setting are able to be referred in previous sections in this paper. Table 1 Detail of the categories for 8 3D scenes. Category

# of points

Category

# of points

Bowl Cereal Box Soda Can

332,523 1,142,927 167,350

Cap Coffee Mug Background

395,782 177,219 21,528,740

Total

23,744,343

Table 2 Experimental evaluation of Macro- and Micro F score. Approach

Micro F-score

Macro F-score

DetOnly Det3DMRF DetSOnly DetS3DMRF

87.27 96.41 93.75 98.40

90.33 88.66 93.40 94.54

Data set and Ground Truth Labeling: The RGB-D Object Dataset includes 250,000 segmented RGB-D images of 300 objects in 51 categories. It also contains eight video sequences of office and kitchen environments. The object detectors are trained with the turntable data as positive examples. During training the only videos we make use of from the RGB-D Scenes Dataset are the background videos for hard negative example mining [3]. Evaluation is done on the 8 video sequences in the RGB-D Scenes Dataset. The task is to detect and label objects in five categories bowl, cap, cereal box, coffee mug, and soda can, and distinguish them from everything else labeled as background. The overall detail of the 8 3D scenes is listed in Table 1, which indicates that background is the absolute dominant category and the obvious unbalanced distribution of categories. Comparison baselines: We evaluate our method (DetS3DMRF), compared to state-of-the-art, involving: (a) only the data term in (DetOnly) [3]; (b) the approach presented in [3](Det3DMRF); (c) only the data term of our method (DetSOnly). Evaluation criterion: The primary goal of this paper is to perform labeling, which is to assign a category label to every point in a 3D scene. In this paper, we evaluate our approach by two means: (1) Micro and Macro F-score on the accompanying RGB-D Object Dataset : The traditional F-score is the harmonic mean of precision and recall: F ¼2

pr ; pþr

ð9Þ

where p and r are precision and recall respectively. The Micro Fscore is computed from the overall precision and recall of the scene labeling, while Macro F-score is the average of the F-scores of the five object categories, each computed separately [3]. (2) Percategory and overall precisions/recalls: The overall precisions/ recalls is a Macro-average across categories. Micro and Macro F-score analysis: Table 2 compares the Micro and Macro F-score of the above algorithms. We observed that our method achieves 1.99% and 5.98% improvement, compared to [3], respectively. Especially, the data term of the proposed model gives 6.48% and 3.07% boost over data term of [3], respect to Micro and Macro F-score respectively. This allows our model achieve competitive performance even with the simple unary term, compared to [3]. Per-category and overall precisions/recalls analysis: The percategory and overall precisions/recalls can be found in Figs. 3 and 4 respectively. Note that the precision and recall are identical if a label has to be assigned to each of the segment, but the Det3DMRF is allowed to reject points hence leading to different values for precision and recall. Compared to [3], our full model achieves 5.54% in overall precision and 6.22% in overall recall. Figs. 3 and 4 also reveal the super performance of our approach in labeling small objects such as coffee mug and soda can, compared to Det3DMRF.

Fig. 2. Top: the 8 3D scenes to label; middle: the ground truths of the 3D scenes. Bottom: the labeling results of our model. Best viewed in color. By zooming in, images can be clearly investigated in the PDF file. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

W. Liu et al. / Neurocomputing 174 (2016) 1101–1106

1105

robustness of small object detection for indoor scenes of our method.

5. Conclusion In this paper, we propose a detection-based scheme for 3D indoor scenes labeling. Experiment results show that our approach achieves better performance in terms of accuracy, compared to the compared to state-of-the-art method, on the challenging RGB-D Object Dataset. To some extend, our work also demonstrates the importance of the unary term in MRF model.

Acknowledgements This work is supported by the Nature Science Foundation of China (No. 61373076), National Outstanding Youth Science Foundation of China (No. 61422210), and the Special Fund for Earthquake Research in the Public Interest (No. 201508025).

References

Fig. 3. Per-category and overall precisions for the proposed scheme and the compared methods.

Fig. 4. Per-category and overall recalls for the proposed scheme and the compared methods.

Visualization of the labeling results: In Fig. 2, we show all of the 8 scenes that were labeled by our method. Objects are colored by their category label, where bowl is red, cap is green, cereal box is blue, coffee mug is yellow, soda can is cyan, and background is gray. Note that by zooming in, images can be clearly investigated in the PDF file. The labeling results of the scenes demonstrate the

[1] T. Guan, Y. He, J. Gao, J. Yang, J. Yu, On-device mobile visual location recognition by integrating vision and inertial sensors, IEEE Trans. Multimed. (2013) 1688–1699. [2] T. Guan, Y.F. He, L.Y. Duan, J.Q. Yu, Efficient BOF generation and compression for on-device mobile visual location recognition, IEEE Multimed. (2014) 32–41. [3] K. Lai, L. Bo, X. Ren, D. Fox, Detection-based object labeling in 3d scenes, In: ICRA, 2012, pp. 1330–1337. [4] Y. Wang, R. Ji, S.-F. Chang, Label propagation from imagenet to 3d point clouds, In: CVPR, 2013, pp. 3135–3142. [5] H.S. Koppula, A. Anand, T. Joachims, A. Saxena, Semantic labeling of 3d point clouds for indoor scenes, In: NIPS, 2011, pp. 4–9. [6] Yue Gao, Meng Wang, Zhengjun Zha, Qi Tian, Qionghai Dai, Naiyao Zhang, Less is more: efficient 3d object retrieval with query view selection, IEEE Trans. Multimed. 11 (5) (2011) 1007–1018. [7] Yue Gao, Jinhui Tang, Richang Hong, Shuicheng Yan, Qionghai Dai, Naiyao Zhang, Tat-Seng Chua, Camera constraint-free. view-based 3d object retrieval, IEEE Trans. Image Process. 21 (4) (2012) 2269C–2281. [8] D. Munoz, J.A. Bagnell, N. Vandapel, M. Hebert, Contextual classification with functional max-margin Markov networks, In: CVPR, 2009, pp. 975–982. [9] D. Munoz, N. Vandapel, M. Hebert, Onboard contextual classification of 3-d point clouds with learned high-order Markov random fields, In: ICRA, 2009. [10] D. Anguelov, B. Taskarf, V. Chatalbashev, D. Koller, D. Gupta, G. Heitz, A. Ng, Discriminative learning of Markov random fields for segmentation of 3d scan data, In: CVPR, 2005, pp. 169–176. [11] B. Douillard, D. Fox, F. Ramos, H. Durrant-Whyte, Classification and semantic mapping of urban environments, Int. J. Robot. Res. 30 (1) (2011) 5–32. [12] K. Lai, D. Fox, Object recognition in 3d point clouds using web data and domain adaptation, Int. J. Robot. Res. 29 (8) (2010) 1019–1037. [13] I. Posner, M. Cummins, P. Newman, A generative framework for fast urban labeling using spatial and temporal context, Auton. Robots 26 (2–3) (2009) 153–170. [14] A. Golovinskiy, V.G. Kim, T. Funkhouser, Shape-based recognition of 3d point clouds in urban environments, In: CVPR, 2009, pp. 2154–2161. [15] R. Shapovalov, A. Velizhev, O. Barinova, Non-associative Markov networks for 3d point cloud classification, Photogram. Comput. Vis. Image Anal. 38 (3A) (2010) 103–108. [16] H. Wang, S. Gould, D. Roller, Discriminative learning with latent variables for cluttered indoor scene understanding, Commun. ACM 56 (4) (2013) 92–99. [17] W. Choi, Y.-W. Chao, C. Pantofaru, S. Savarese, Understanding indoor scenes using 3d geometric phrases, In: CVPR, 2013, pp. 33–40. [18] R. Triebel, R. Schmidt, Ó.M. Mozos, W. Burgard, Instance-based amn classification for improved object recognition in 2d and 3d laser range data, In: IJCAI, 2007, pp. 2225–2230. [19] G.D. Hager, B. Wegbreit, Scene parsing using a prior world model, Int. J. Robot. Res. 30 (12) (2011) 1477–1507. [20] O. Kähler, I. Reid, Efficient 3d scene labeling using fields of trees, In: ICCV, 2013, pp. 3064–3071. [21] G. Klein, D.Murray, Parallel tracking and mapping for small ar workspaces, In: ISMAR, 2007, pp. 225–234. [22] G. Klein, D. Murray, Parallel Tracking and Mapping on a Camera Phone (2009) 83–86. [23] R.A. Newcombe, S.J. Lovegrove, A.J. Davison, Dtam: dense tracking and mapping in real-time, In: ICCV, 2011, pp. 2320–2327.

1106

W. Liu et al. / Neurocomputing 174 (2016) 1101–1106

[24] R.A. Newcombe, A.J. Davison, S. Izadi, P. Kohli, O. Hilliges, J. Shotton, D. Molyneaux, S. Hodges, D. Kim, A. Fitzgibbon, Kinectfusion: real-time dense surface mapping and tracking, In: ISMAR, 2011, pp. 127–136. [25] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et al., Kinectfusion: realtime 3d reconstruction and interaction using a moving depth camera, In: ACMUIST, 2011, pp. 559–568. [26] P. Henry, M. Krainin, E. Herbst, X. Ren, D. Fox, Rgb-d mapping: using depth cameras for dense 3d modeling of indoor environments, Exp. Robot. (2014) 477–491. [27] X. Xiong, D. Munoz, J.A. Bagnell, M. Hebert, 3-d scene analysis via sequenced predictions over points and regions, In: ICRA, 2011, pp. 2609–2616. [28] P. Felzenszwalb, D. McAllester, D. Ramanan, A discriminatively trained, multiscale, deformable part model, In: CVPR, 2008, pp. 1–8. [29] J. Platt, et al., Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, Adv. Large Margin Class. 10 (3) (1999) 61–74. [30] Y. Boykov, O. Veksler, R. Zabih, Fast approximate energy minimization via graph cuts, IEEE Trans. Pattern Anal. Mach. Intell. 23 (11) (2001) 1222–1239. [31] K. Lai, L. Bo, X. Ren, D. Fox, A large-scale hierarchical multi-view rgb-d object dataset, In: ICRA, 2011, pp. 1817–1824.

Wei Liu received B.S. Degree in Information and Computing Science in 2009 from Nanchang University, Jiangxi, China, and the M.S. Degree in Applied Mathematics from Jimei University in 2012, Fujian, China. He is currently working towards his Ph.D at Xiamen University. His research interests include machine learning, hyperspectral remote sensing image analysis, and computer scene understanding.

ShaoZi Li received the B.S. degree from the Computer Science Department, Hunan University in 1983, and the M.S. degree from the Institute of System Engineering, Xi'an Jiaotong University in 1988, and the Ph.D. degree from the College of Computer Science, National University of Defense Technology in 2009. He currently serves as the Professor and Chair of School of Information Science and Technology of Xiamen University, the Vice Director of Fujian Key Lab of the Brain-like Intelligence System, and the Vice Director and General Secretary concurrently of the Council of Fujian Artificial Intelligence Society. His research interests cover Artificial Intelligence and Its Applications, Moving Objects Detection and Recognition, Machine Learning, Computer Vision, Natural Language Processing and Multimedia Information Retrieval, Network Multimedia and CSCW Technology and others.

Donglin Cao received the B.S. degree from Xiamen University, China, in 2000, and the M.S. degree from Xiamen University, China, in 2003, and the Ph.D. degree from Institute of Computing Technology Chinese Academy of Sciences, China, in 2009. His research interests cover Artificial Intelligence and Its Applications, Computer Vision, Nature Language Processing and Machine Learning.

Song-Zhi Su received the B.S. degree in Computer Science and Technology from Shandong University, China, in 2005. He received M.S. and Ph.D degree in Computer Science in 2008 and 2011, both from Xiamen University, Fujian, China. He joined the faculty of Xiamen University as an assistant professor in 2011. His research interests include pedestrian detection, object detection and recognition, RGBD based human action recognition and image/video retrieval.

Rongrong Ji serves as the Professor of Xiamen University, where he directs the Intelligent Multimedia Technology Laboratory (http://www.imt.xmu.edu.cn) and serves as a Dean Assistant in the School of Information Science and Engineering. He has been a Postdoc research fellow in the Department of Electrical Engineering, Columbia University from 2010 to 2013, worked with Profes sor Shih-Fu Chang. He obtained his Ph.D. degree in Computer Science from Harbin I nstitute of Technology, graduated with a Best Thesis Award at HIT. He had been a visiting student at University of Texas of San Antonio worked with Professor Qi Tian, and a research assistant at Peking University worked with Professor Wen Gao in 2010, a research intern at Microsoft Research Asia, worked with Dr. Xing Xie from 2007 to 2008. He is the author of over 40 tired-1 journals and conferences including IJCV, TIP, TM M, ICCV, CVPR, IJCAI, AAAI, and ACM Multimedia. His research interests include image and video search, content understanding, mobile visual search, and social multimedia analytics. Dr. Ji is the recipient of the Best Paper Award at ACM Multimedia 2011 and Microsoft Fellowship 2007. He is a guest editor for IEEE Multimedia Maga zine, Neurocomputing, and ACM Multimedia Systems Journal. He has been a special session chair of MMM 2014, VCIP 2013, MMM 2013 and PCM 2012, would be a pr ogram chair of ICIMCS 2016, Local Arrangement Chair of MMSP 2015. He serves as reviewers for IEEE TPAMI, IJCV, TIP, TMM, CSVT, TSMC A⧹B⧹C and IEEE Sign al Processing Magazine, etc. He is in the program committees of over 10 top conferen ces including CVPR 2013, ICCV 2013, ECCV 2012, ACM Multimedia 20132010, etc.