Adaptive structured sub-blocks tracking

Adaptive structured sub-blocks tracking

Author’s Accepted Manuscript Adaptive Structured Sub-blocks Tracking Jing-Wen Liu, Wei-Ping Sun, Tao Xia www.elsevier.com/locate/neucom PII: DOI: Re...

2MB Sizes 1 Downloads 69 Views

Author’s Accepted Manuscript Adaptive Structured Sub-blocks Tracking Jing-Wen Liu, Wei-Ping Sun, Tao Xia

www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(16)30112-6 http://dx.doi.org/10.1016/j.neucom.2015.10.133 NEUCOM16911

To appear in: Neurocomputing Received date: 1 March 2015 Revised date: 30 September 2015 Accepted date: 1 October 2015 Cite this article as: Jing-Wen Liu, Wei-Ping Sun and Tao Xia, Adaptive Structured Sub-blocks Tracking, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2015.10.133 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Adaptive Structured Sub-blocks Tracking LIU Jing-Wen, SUN Wei-Ping and XIA Tao School of Computer Science and Technology Huazhong University of Science and Technology Wuhan, China {M201272672, wpsun, xiatao}@hust.edu.cn

Abstract—Visual object tracking algorithms based on middle level appearance have been widely studied for their effective representation to non-rigid appearance variation and partial occlusion. Sub-blocks are often adopted as local feature in mid-level based tracking algorithms. How to select representative sub-blocks to reveal the spatial structure of objects and retain the flexibility to model non-rigid deformation has not been adequately addressed. Exploiting discrimination, uniqueness and historical prediction accuracy of sub-blocks of a target, we propose a local feature selection method which includes rough initial subblock selection and refined subblock-sample particle bi-directional selection under particle filter tracking framework. A quantitative evaluation is conducted on 10 sequences. Experimental results show the robustness of our proposed algorithm in tackling with non-rigid deformation and partial occlusion. Keywords—Visual object tracking; structured sub-blocks; Particle filter

I.

INTRODUCTION

Visual object tracking is a fundamental task in computer vision, applied in a wide range of domains such as intelligent surveillance, human computer interaction, video compression, and so on. Generally a visual object tracker includes four modules: object initialization, appearance modeling, motion estimation and object localization, among which robust adaptive appearance modeling could play an important role[1]. Recently tracking algorithms based on middle level appearance cues have been widely studied. Compared with global visual representations, local visual representations are less sensitive to non-rigid appearance variation and partial occlusion. Among these outstanding trackers, Adam et al. [2] design an appearance model composed of several image fragments, where fragments are obtained in a predefined way with no selection and every fragment votes on the possible positions and scales of the target. Kwon et al. [3] propose an appearance model which is comprised of multiple local patches and the topology between those patches. The appearance model also need no specific object model and is more flexible. Researchers are interested in finding novel ways to choose the most informative local features with respect to tracking task and local features are selected for specific purposes. Nejhum et al. [4] use multiple rectangular blocks to model the constantly changing foreground shape. Yang et al. [5] propose an attentional tracking method AVT based on attentional patches. In AVT, the early selection and the late selection processes are adopted to obtain more discriminating local patches gradually. F. Yang et al. [6] conduct a discriminative appearance model based on segmentation from the perspective of mid-level cues and superpixels are adopted for structural information. Based on the multiple instance learning (MIL), K. Zhang [7] proposes an online feature selection algorithm with prior information of instance labels. K. Moo Yi et al. [8] describe object with feature points, select local features with motion saliency and descriptor saliency and suppress degradation caused by inaccurate initializations. These feature selection algorithms always focus on information got from historical frames and in some application appearance in current frame could provide more accurate and effective information for local features. Motivated by the above observation, under particle filter tracking framework we propose a novel local feature selection method taking current sampling state into account and an adaptive structured sub-blocks tracking (ASST) algorithm is implemented. A large amount of particles in current frame naturally generated in particle filter tracker can not only be used as target candidates, but also can be viewed as important reference to select local features because the “quality” of samples in current frame exhibits the good or bad of the tracking result in previous frame in some degree. What is more, when calculate the similarity between each sample and the target model we need only take part of local features into account since some local features in samples maybe disturbed by noise or occlusion. So in our proposed tracker, rough initial sub-block selection and refined sub-block-sample particle bi-directional selection are utilized. The method can select more representative sub-blocks to reveal the current spatial structure of tracking target and model non-rigid deformation flexibly. Several measures are adopted in our local feature selection processes. II.

STRUCTURED SUB-BLOCKS TRACKING FRAMEWORK

Given the position of the target in one frame, a tracker should be able to track the target in subsequence frames. Tracking based sub-blocks can be viewed as matching based on similarity measuring. In order to narrow the gap between low level features and high level semantic of the images, those local feature (i.e. sub-blocks) and sample particles which will take part in This article is supported by the Natural Science Foundation of China (No.61300140), the Natural Science Foundation of Hubei Province (No. 2015CFB565) and the Fundamental Research Funds for the Central Universities,HUST:No.2014QN008.

final decision should be considered seriously. We model the decision-making in tracking as local feature selection in current frame. Fig. 1 illustrates the basic pipeline of our tracker. The target for tracking is firstly divided into overlapped sub-blocks. Some of them are chosen to add to a candidate subblock set through initial selection process. Two measures, discrimination and coverage, are taken into consider during initial selection. Sub-blocks in candidate set then are ordered by historical prediction accuracy. When a new frame (frame t in Fig. 1) arrives, a two-stage selection process, one is on sub-blocks and another is on sample particles, is performed. In first stage, sub-blocks in candidate set are re-chosen according their “quality” distribution among sample particles and local feature which are more representative to reveal the current spatial structure of target under current appearance are chosen as templates of frame t. In second stage, “good” particles are filtered to join final decision-making. In this refined bi-directional selection discrimination and uniqueness of sub-blocks and confidence of particles are adopted as measures to help to adapt non-rigid deformation flexibly. After “good” local feature and “good” particles are obtained, localization of frame t is conducted by a simple similarity measuring. The particle with high similarity with templates of frame t is output as the tracking result of frame t.

Fig. 1. Pipeline of our tracking algorithm ASST

As mentioned above, some measure will be adopted in our tracking and we list them below. (1) Coverage ratio

The coverage ratio of sub-block i and j (which denoted by p p respectively) is defined as i, j Area pi  Area p j CR( p p )  i, j Area pi  Area p j

Here Area

pn

(1)

is the area of the rectangularity of sub-block p . n

(2) Discrimination and uniqueness Discrimination which measured by the maximum margin between positive and negative training samples shows the ability one sub-block could distinguish target from background. Uniqueness indicates whether a sub-block is unique to target object, that is to say, if the sub-block can distinguish the tracking target from other similar objects. Let Dit and U it be the discrimination and uniqueness respectively for sub-block i (i {1, 2,..., N}) at time t. Denote X it and X it the positive and negative training samples of sub-block i at time t respectively, X it, j the jth sample particle at time t. Dit and U it are defined as follows. Dit  f ( X it 1 , X it 1 , X it, j )  max( P( X it 1 , X it, j )  P( X it 1 , X it, j )) j

(2)

U it  1 

1 M

M

 g( X j 1

t -1 i

, X it -1 , X it, j , ) .

(3)

where j  1, 2,..., M and M is the number of sampling particles in particle filter tracking. P   is histogram intersection coefficient and g () is defined as

1, P( X it -1 , X it, j )  P( X it -1 , X it, j )   g ( X it -1 , X it -1 , X it, j , )   , t -1 t t -1 t 0, P( X i , X i , j )  P( X i , X i , j )  

(4)

where   Dit   .  is a parameter to measure the similarity of confusing particles (which refer to sampling particles having similar appearance with target object) and optimum particle. Larger confusing particle number is, more sampling particles have similar appearance within sub-block area, which indicates this sub-block will be unhelpful in subsequent localization. (3) Historical prediction accuracy In two-stage selection sub-blocks are operated one by one to get template for frame t. For convenience we will order sub-blocks by their historical prediction accuracy before the selection and only those sub-blocks in candidate set with higher historical prediction accuracy will be take into account. Historical prediction accuracy HPi indicates historical discrimination of sub-block i in the previous frames. Define HPi as

HPi 

1  P( X i , X i )

, i  1, 2,...N .

N

 (1  P( X , X i

i 1

Let HPi 0 

Di0 n

D i 1

i

(5)

))

, i  1, 2,..., N , where Di0  1  P( X i0 , Xi0 ) be the initial historical prediction accuracy.

0 i

(4) Confidence of sample particles In particle filter tracking, many sample particles will be generated in each frame. The confidence of the jth sample particle at time t indicated the probability that particle j is the target and it is defined as n n ctj   cit, j   ( P( X it 1 , X it, j )  P( X it 1 , X it, j )) i 1 i 1

(6) III.

TRACKING PROCEDURE

A. Initialization The target labeled in initial frame is roughly segmented into overlapped sub-blocks using sliding window firstly. The size of sub-block is positively correlated with that of the target. In general the correlation ratio is set between 0.25~0.50. Then we resize each sub-block to a 32  32 HoG vector. The initial selection is conducted then to choose discriminative sub-blocks and get a candidate set with greedy strategy, which is shown in Algorithm 1. Sub-blocks in candidate set are ordered according to their historical prediction accuracy and denoted by p1 , p2 , p3 ,..., pN . Algorithm 1 Initial selection Input: initial frame 1: slid window segmentation to get sub-blocks 2: generate positive and negative training samples for each sub-block using the method in [7] 3: calculate discrimination and sort sub-blocks in descending order: p1 , p2 ,... 4:

initialize the candidate set C  { p1} , i=2

5:

while C  N do

if CR( pi , C )   then 7: C  C  pi 8: end if 9: i  i 1 10: end while Output: candidate set C 6:

B. The adaptively bi-directional selection Refined selection process is divided into two stages, and one is for sub-blocks and another for sampling particles, illustrated in Fig. 2. When a new frame comes, we sample particles around the last tracking result within a search radius r. In the first stage, we segment each particle into sub-blocks according to the coordinates of sub-blocks in the candidate set and we get many samples for each sub-block (for example, the sub-block i in Fig. 2). Then ct are calculated for each particle and i, j

sub-block i’s samples with high confidence are picked out. Then we calculate the discrimination and uniqueness of these sub-block i’s samples. Denote   { pi : Dit  1 , Uit   2 } , where  1 and  2

are thresholds initialized automatically in training frames:

1 K 1 f ( X ik 1 , X ik 1 , X ik ), 2  1   2 , 1 ,  2  [0,1] . The item   f ( X ik 1 , X ik 1 , X ik ) formulates the average K k 1 K k 1 discrimination of sub-block i in K training frames. Parameter 1 decides maximum decreased of discrimination caused by deformation or scale change, that is the minimal credibility of sub-block one should occupy for template construction. Parameter  2 determines the maximum number tolerated. In following test experiments, we set 1  0.3, 2  0.2 respectively. The set B is the collection of sub-blocks after the first stage.

1  1 

K

In the second stage, sample particles are ordered by their confidence ct

i, j

and particles with lower confidence are discarded

directly. Then update particle set and repeat the two-stage selection. After several iterations, representative sub-blocks and reliable particles are obtained.

Fig. 2. The selection of sub-blocks and sampling particles

C. Template construction and localization After initial selection process we get the candidate set of N sub-blocks p1 , p2 , p3 ,..., pN  Sub-block i is represented by its center coordinate, width and height, i.e. pi  { pxi , pyi , pw, ph} . Sub-blocks in one sequence are supposed to be the same size. When s a new frame t comes, we activate the two-stage selection and n representative sub-blocks from candidate set are picked out to build tracking template { pt , pt ,... p } . Localization of frame t is conducted by a European similarity measuring. The particle of maximum 1

2

tn

similarity with templates of frame t is output as the tracking result of frame t. D. Updating Sub-blocks in candidate set belong to Buodate  { pi : P( X it , X it )   3 , P( X it , X it )   4 } are updated. The former inequation makes template keep continuity in variations while the latter one prevents occluded area to be updated. We define  the learning rate and an accumulative update method is adopted: Pi t   Pi t 1  (1   ) Pi ,

(7)

Pi t   Pi t 1  (1   ) Pi .

(8)

Then we update historical prediction accuracy of each sub-block using the following formula: HPi t   HPi t 1  (1   ) HPi .

(9) IV.

EXPERIMENT

To verify the performance of proposed ASST, experiment is conducted with ten challenging video sequences and five classic tracking algorithms. These algorithms are Comprehensive Tracking (CT) [9], Multiple Instance Boosting-based Tracking (MIL) [10], Multi-task Sparse Learning Tracker (MTT) [11], Frag-track [2] and Online Discriminative Feature Selection (ODFS) [7]. A. Tacking results Ten video sequences for testing are Faceocc [2], Dollar [10], David [12], David3, Deer, Car4, girl, Dudek, Singer1 and Singer2 [1]. We adopt center location error for quantitative comparison, defined as the Euclidean distance between tracking results and ground truth (GT). Table  lists the average center location error. Error plot for all sequences are shown in Fig. 3-12. The proposed ASST get suboptimal or optimal error in many sequences. Heavy occlusion. In Faceocc sequence, CT appears to deviate from the first occlusion. Other four tracking algorithms have suffered different degrees of offset when heavy occlusion occurs except Frag and ASST, which is obvious at frame 760. In David3 sequence, MTT and CT lose the target after frame 90 when target crosses the tree. MIL, ODFS and Frag shift for appearance deformation when the target turns around. ASST catches the target but the scale is larger than the target. Fast motion and blur. As shown in Fig. 14 of Deer sequence, Frag loses target at frame 5. ASST always keeps the most discriminative region (the nose and mouth region) for localization while other algorithms have missed target in several frames. Variation of pose. Targets undergoes large appearance changes of pose and scale in david (frame 130, 180 and 218), Dudek (frame 80, 680, 730 and 943) and Singe2 (frame 14, 35, 185, 280 and 330). In David sequence, MIL and MTT drift while object receives overall blur at frame 30. Frag loses the target when the appearance of the target change seriously at frame 130. MIL, MTT and ODFS perform well in Dudek sequence with long time variations. But in Singer2 sequence, ODFS keeps part of the target but tracking unsteadily. Frag shifts and loses at frame 103. ASST performs well in david and Singer2. In Dudek sequences, the bounding box enlarges with variation, including more background at frame 388. However it doesn’t miss the target. Reason may be that the voting sub-blocks always lie in target region, which is more stable compared with background area. Partial deformation. Object in Dollar sequence sustains local deformation at frame 60. CT and ODFS shift from the target. So is Frag but it corrects quickly. MTT and Frag track wrong object when similar object appears at frame 130. CT and ODFS also have been offset. Only MIL and ASST well finish overall tracking task. Target in girl sequence also experiences severe deformation. All algorithm have shifted. Our algorithm captures stable local features and gets the suboptimal center location error. Cluttered background, scale and large illumination. Singer1 and Car4 sequences both undergo large scale change and illumination variation. In Singer1 sequence, MIL shifts for illumination at frame 80 and loses the target quickly. Other algorithms don’t miss the target but offset. The illumination change in Car4 results in tracking drift at frame 247 except MTT and ASST.

TABLE I.

THE AVERAGE CENTER LOCATION ERROR

CT

MIL

MTT

ODFS

Frag

ASST

Faceocc

18.65

27.14

31.39

27.72

5.75

11.89

David

13.19

22.98

26.73

10.68

74.20

12.48

dollar

19.55

5.91

69.74

17.09

56.91

6.57

David3

89.84

32.51

61.91

12.59

59.40

12.52

Deer

126.06

17.77

20.33

35.01

74.26

8.74

Car4

81.16

54.21

9.83

73.72

135.24

12.01

girl

37.11

32.34

30.89

37.30

26.57

26.84

Dudek

26.90

12.63

10.55

19.56

63.02

22.84

Singer1

8.80

39.82

9.32

6.24

6.58

7.78

Singer2

49.29

112.57

133.41

18.12

21.07

14.78

B. Validation of sub-blocks’ adaptive selection Fig. 18 shows the effectiveness of the proposed sub-block selection strategy. In Fig. 18 blue boxes are sub-blocks picked out after the two-stage selection and red box is the output of our tracker. Large occlusion arises in Faceocc sequence from frame 93. ASST avoids blocked area. Even at frame 539 with heavy occlusion, our selection mechanism also adapts well. It is worth mentioning that template sub-blocks are not always lying in most discriminative area (such as the human eyes region at frame 588). It can be explained that historical prediction accuracy prevents region changing frequently from being selected. Similar situations also occur in Dudek and Singer1 sequences. Thanks to our selection criterions, confusing regions are kept away for localization. As showed at frame 1126 of Dudek sequence and frame 236 of Singer1 sequence, the voting sub-blocks still focus on the target area and make effective decision for localization. In Dollar sequence, local deformation occurs at frame 51 then voting sub-blocks jump from initial area to non-deformation zone quickly and template update gradually. When similar target appears at frame 136, ASST captures these unique sub-blocks to avoid localization ambiguity. The structured sub-blocks tracking will drift when heavy deformation occurs. And the tracking algorithm is not suitable for small object tracking. Discrimination and uniqueness are calculated for each sub-block and the computational burden is large.

Fig. 3. Center location error of Faceocc sequence

Fig. 4. Center location error of David sequence

Fig. 5. Center location error of Dollar sequence

Fig. 7. Center location error of Deer sequence

Fig. 6. Center location error of David3 sequence

Fig. 8. Center location error of Car4 sequence

Fig. 9. Center location error of girl sequence

Fig. 10. Center location error of Dudek sequence

Fig. 11. Center location error of Singer1 sequence

Fig. 12. Center location error of Singer2 sequence

Faceocc

David3

Fig. 13. Some result for heavy occlusion

Deer

Fig. 14. Some result for fast motion and blur

david

Dudek

Singer2

Fig. 15. Some result for variation of pose

dollar

girl

Fig. 16. Some result for partial deformation

Singer1

Car4

Fig. 17. Some result for clutter, scale and large illumination

Faceocc

Dudek

Singer1

Dollar

Fig. 18. Adaptive selection of structured sub-blocks during tracking

V. CONCLUSION Based on particle filter tracking framework, we developed a sub-block based tracker with a local feature selection method considering current sampling state. After rough initial selection and refined bi-directional selection, more representative sub-blocks can be picked out and they can reveal the current spatial structure of tracking target and retain the flexibility to model non-rigid deformation. In the future, we will improve our tracker by considering the scale change of the target and extend “hard” sub-block selection to “soft” sub-block selection by adopting voting mechanism based on adaptive-weighted appearance model.

REFERENCES [1]

X. Li, W. Hu, C. Shen, Z. Zhang, A. Dick, A. van den Hengel, A Survey of Appearance Models in Visual Object Tracking, ACM Trans. Intelligent Systems and Technology. 4 (4) (2013). [2] A. Adam, E. Rivlin, I. Shimshoni, Robust Fragments-based Tracking using the Integral Histogram, in: CVPR, 2006, pp. 798-805. [3] J. Kwon, K. Mu Lee, Highly Nonrigid Object Tracking via Patch-Based Dynamic Appearance Modeling, IEEE Trans. Pattern Analysis and Machine Intelligence. 35 (10) (2013) pp. 2427-2441. [4] S.M.S. Nejhum, J. Ho, and M.-H. Yang, Visual Tracking with Histograms and Articulating Blocks, Proc. IEEE Conf. Computer Vision and Pattern Recognition,2008. [5] M. Yang, J. Yuan, and Y. Wu, Spatial Selection for Attentional Visual Tracking, Proc. IEEE Conf. Computer Vision and Pattern Recognition,2007. [6] F. Yang, H.-C. Lu, M.-H. Yang, Robust Superpixel Tracking, IEEE tans. Image Processing. 23 (4) (2014) pp.1639-1651 [7] K. Zhang, L. Zhang, M.-H. Yang, Real-time Object Tracking via Online Discriminative Feature Selection, IEEE Trans. Image Processing. 22 (12) (2013) pp. 4664-4677. [8] K. Moo Yi, H. Jeong, B. Heo, H.J. Chang, J.Y. Choi, Initialization-Insensitive Visual Tracking Through Voting with Salient Local Features, in: ICCV, 2013, pp. 2912-2919. [9] K. Zhang, L. Zhang, M.-H. Yang, Real-Time Compressive Tracking, in: ECCV, 7574 (2012) pp. 864-877. [10] B. Babenko, M.-H. Yang, S. Belongie, Robust Object Tracking with Online Multiple Instance Learning, IEEE Trans. PAMI. 33 (8) (2011) pp. 1619-1632. [11] T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Robust Visual Tracking via Multi-Task Sparse Learning, in: CVPR, 2012, pp.2042-2049. [12] R. D., L. J., L. R. , Y. M, Incremental learning for robust visual tracking, International Journal of Computer Vision, 77 (1-3) (2008) pp. 125-141.

LIU Jing-Wen is a M.S. candidate of the School of Computer Science & Technology at Huazhong University of Science and Technology. Her research interest include visual object tracking, image and video signal processing and visual surveillance. SUN Wei-Ping is an associate professor in the School of Computer Science & Technology at Huazhong University of Science and Technology.

Her research interests include computer vision and

machine learning. Xia Tao is a lecturer in the School of Computer Science & Technology at Huazhong University of Science and Technology. His research interests include mobile computing, image and video coding.

LIU Jing-Wen

SUN Wei-Ping

Xia Tao