Optik - International Journal for Light and Electron Optics 194 (2019) 163124
Contents lists available at ScienceDirect
Optik journal homepage: www.elsevier.com/locate/ijleo
Original research article
Multi-Person tracking algorithm based on data association ⁎
Yi Zhang, Yongliang Shen , Qiuyu Zhao
T
College of Electronic Engineering, Heilongjiang University, Harbin 150080, China
A R T IC LE I N F O
ABS TRA CT
Keywords: Multi-person tracking Data association Kalman filtering Hungarian algorithm
Aiming at the diversity of human attitudes, appearance similarity and occlusion of real-time road traffic scenarios, this paper proposes a multi-person tracking algorithm based on tracking-bydetection framework, which uses pedestrian depth appearance features and motion features to correlate tracking objects. We use the YOLOv3 algorithm to detect the pedestrian target in the input video sequence, and the detected target is tracked and predicted using the Kalman filter algorithm. The depth appearance and motion information of the target are combined with the Hungarian algorithm to match the prediction and detection results, so as to achieve tracking of multi-Pedestrian targets.
1. Introduction Accurate identification and real-time tracking of pedestrian targets in road scenes is a difficult point in driverless vehicle technology. A reliable tracking algorithm will promote the development of driverless vehicles. In recent years, with the continuous development of target detection technology [1–3], the tracking algorithm based on tracking-by-detection has become the mainstream of current tracking algorithms. The traditional data association algorithms are mainly Joint Probability Data Association Filter (JPDAF) [4,5] and Multi-Hypothesis Tracking (MHT) [6,7], both of which are data association on frame-by-frame images. The JPDAF algorithm correlates the detection result of the current frame with the existing tracking target by probability similarity, and weights all the detection results to match the tracking target, thereby generating a motion trajectory. As the number of tracking targets and detected targets increases, the amount of computation of the JPDAF algorithm will explode, resulting in computational complexity. Unlike JPDAF, the Multiple Hypothesis Tracking (MHT) preserves the predicted trajectories of all tracking targets and continues to pass forward, calculates the probability of all predicted trajectories and selects the trajectory with the highest probability as the trajectory of the tracking target. When the tracking target has a certain degree of occlusion, the calculation amount of the MHT algorithm will also increase exponentially. The unmanned car has high requirements on the real-time performance of the pedestrian tracking algorithm, and the actual road environment background is complex, the number of pedestrian targets is large and there are various disturbances and occlusions. The complex calculation of these two algorithms is not suitable for the driverless car. The Sort [8] algorithm is a simpler tracking algorithm based on the tracking-by-detection framework. It uses the Kalman filter algorithm [9] to predict the motion of the tracking target, and calculates the intersection-over-union (IOU) distance between the bounding box and the detection target of each tracking target, then use the Hungarian method [10] to find the best correlation results. This method uses the IOU distance as the assignment cost matrix, so the tracking speed is very fast, which exceeds most of the current tracking methods, but also causes the IOU distance to become large once the target is blocked for a long time and the target will be lost. In this paper, we made some improvements on the basis of the sort algorithm. We adopts the most advanced YOLOv3
⁎
Corresponding author. E-mail address:
[email protected] (Y. Shen).
https://doi.org/10.1016/j.ijleo.2019.163124 Received 30 April 2019; Accepted 18 July 2019 0030-4026/ © 2019 Elsevier GmbH. All rights reserved.
Optik - International Journal for Light and Electron Optics 194 (2019) 163124
Z. Yi, et al.
Fig. 1. (a)Tracking algorithm flow chart and (b) Data association part flow chart.
[11] algorithm as the detector, combined with the depth appearance characteristics of the target for data correlation, to overcome the target loss caused by long-term occlusion.
2. Algorithm implementation The algorithm proposed in this paper uses the YOLOv3 detector to detect the input video sequence, assigns the tracker from the detection result of the first frame, and then uses Kalman filter to predict the motion of all tracking targets frame-by-frame, and then calculates the IOU distance of the target between the two frames, using the Hungarian method to obtain the best correlation results, record the target of the successful match and the target that is not successfully matched. For the unsuccessful target to extract the depth appearance feature, the Hungarian method is used again to obtain the correlation result, so that the target loss due to the longterm occlusion of the target can be avoided to a certain extent while maintaining the high frame rate. The flow chart of the whole tracking algorithm is shown in Fig. 1 (a). The specific algorithm flow for the data association part is shown in Fig. 1 (b).
2
Optik - International Journal for Light and Electron Optics 194 (2019) 163124
Z. Yi, et al.
2.1. Detector selection The method in this paper is based on the tracking-by-detection framework, which presents the detection result of the current frame and the previous frame to the tracker, and then performs data correlation to achieve tracking of the target. Therefore, a good detector will greatly affect the subsequent tracking effect. In the SORT original, tracking performance can be improved by up to 18.9% by changing the detector. The application scenario of the tracking algorithm in this paper is real-time road scene, which means that the detector is not the higher the better, the trade-off accuracy and speed of a detector is very critical for real-time road tracking. We compared the more advanced detection algorithms in recent years, such as deep residual network (ResNet) [12], SSD algorithm [13] and YOLO algorithm [11,14,15]. We chooses YOLOv3 algorithm as the detector in this paper. The most famous YOLO algorithm is that the detection speed is very fast, which is several times higher than the speed of the existing detection algorithm. The YOLOv3 algorithm borrows the ResNet algorithm and the SSD algorithm on the basis of YOLO, which further improves the detection precision of the YOLO algorithm. On the ImageNet dataset, the author found that the effect of this network is very good. Compared with ResNet-152 and ResNet-101, YOLOv3 is not only similar in detection accuracy, but also faster than ResNet-152 and ResNet-101, and the number of network layers is less than them. 2.2. State estimation and trajectory processing •
•
•
•
Each pedestrian object uses an 8-dimensional space to characterize the state at a certain moment, that is (u, v, γ , h, x , y , γ , h) , where (u, v ) represents the center position of the bounding box of the detector output, γ represents the aspect ratio, h represents the •
•
•
•
height, and (x , y , γ , h) represents the corresponding speed information in image coordinates. The prediction results are updated using a standard Kalman filter that uses the uniform velocity model and the current observation model with (u, v, γ , h) as the observed variable [16]. For each tracking target, a threshold is used to record the number of frames from which the target was successfully matched until the target reappeared. When the Kalman filter predicts the target, the counter starts to increase. If the prediction result is successfully associated with the detection result, the counter is reset to 0. If the association fails, the counter is incremented until it exceeds the preset maximum threshold. It is considered that the tracking target has left the tracking range and the trajectory is terminated. When performing target association, the IOU distance is first used for association, and the location information is updated for the successfully associated target. For the unsuccessful association target, the depth appearance feature is extracted and associated with the existing trajectory, then the location information and the depth appearance feature information are updated for the successfully associated target. A target that is not associated with success is considered to generate a new tracking target and generate a new tracker. 2.3. Data association For the correlation between the prediction result of the Kalman filter and the current trajectory, we uses the Hungarian algorithm to solve the assignment problem. In this paper, we combine the depth appearance information and the IOU distance to complete the data association and update the motion trajectory of the target. When the detection is assigned to the trajectory of the existing tracking target, the position of the previous frame target in the current frame is predicted by Kalman filtering. The IOU distance between the predicted position and the detection result in the current frame is taken as the assignment cost matrix, and the Hungarian algorithm is used to calculate the optimal assignment. An assignment that is less than the specified IOU threshold is calibrated as a possible new target to be further determined if it is a new target. Using the IOU distance can solve the short-term occlusion problem of the target, because when the target is occluded, the occlusion is detected. When the occlusion is close to the target size, the IOU distance tends to be large while the occlusion is associated with the target, so it can be very quickly restore the association. When the target is occluded for a long time, using the IOU distance alone to allocate the trajectory will result in a large number of targets being lost, which are considered as new tracking targets. To this end, we have introduced deep appearance feature matching to solve the problem. Considering the data association problem of the t-th frame, we obtains a series of target detection bounding boxes St through the target detector, and the target trajectory of the t-1th frame Tt − 1 is known. First, by calculating the IOU distance as the distribution cost matrix, the Hungarian method is used to assign the target with the IOU distance greater than the set threshold to different target trajectories, and the trajectory Ttiou of a part of the target in the t-th frame is obtained. Mark a target that is less than the set threshold as a possible new target Stp − new .At this time, the depth appearance feature is extracted for this part of the target, and fi is used to represent the depth appearance feature of the i-th target in the set Stp − new . At the same time, the depth appearance feature 3 set Rk = {f k(j) }kk = = 1 of the first three frame targets of the track j is extracted and retained. Using the Hungarian method, the minimum cosine distance d (i, j ) between the i-th target and the trajectory j in the new target set Stp − new is calibrated as the appearance matching degree:
d (i, j ) = min {1 − fiT f k(j) |f k(j) ∈ Ri}
(1)
The minimum cosine distance considers the appearance information, and it is particularly effective for the same target to recover the trajectory after long-term occlusion. In this way, the target assigned from the possible new target set Stp − new is added to the existing target trajectory, then the position information and the appearance information are updated. The remaining targets in the 3
Optik - International Journal for Light and Electron Optics 194 (2019) 163124
Z. Yi, et al.
Table 1 Experimental platform configuration. Names
Related configuration
operating system CPU /GHz RAM /GB GPU GPU acceleration library
Windows Inter CoreI7-8700K, 3.7 16 NVIDIA GeForce GTX 1070, 8 CUDA10.0, CUDNN7.4
possible new target set Stp − new are marked as the new target Stnew , and a new tracker is generated. The tracking of all inspection targets is completed throughout the cycle. 3. Experiment The improved multi-target tracking system of this paper is experimentally tested on the data set disclosed in MOT16 [17]. The data set contains monitoring scenes under static and moving camera lenses, which is suitable for multi-target tracking analysis. The entire experimental platform configuration in this paper is shown in Table 1. 3.1. Experimental results and analysis In order to better measure the performance of the algorithm, we uses the evaluation indicators proposed by Bernardin et al to evaluate the tracking algorithm [18]. We compared with some current mainstream tracking methods on the published MOT16. The specific experimental results are shown in Table 2. It can be seen from the experimental results that the proposed in this paper has a great improvement in tracking accuracy and loss of trajectory compared with the traditional method. This is because we uses deep learning to extract the depth appearance features of the tracking target combined with the motion features for data association, which greatly improves the tracking accuracy. It can be seen from the table that the IDSw index of the traditional method is much lower than SORT and our method. This is because JPDAF and MHT are the overall calculation from the video stream, which not only use the information of the current frame. It is also necessary to consider the information of the video frame in the future, which leads to more accurate matching of the target in the tracking process, but also causes the calculation of the traditional method to be complex, affecting the real-time performance, and is not suitable for real-time scene tracking. Compared with the SORT method, the improved method proposed in this paper has a certain improvement in accuracy. More importantly, the method of this paper has greatly reduced the number of label conversions for the tracking target because of considering the depth appearance characteristics of the target. The tracking speed of the SORT method is very fast. The method of this paper has a certain reduction in the tracking speed due to the matching of the deep appearance features, but it can reach 30 frames per second in the real-time test scenario, meeting the requirements of real-time. In order to test the effectiveness of the tracking algorithm more intuitively, this paper captures a set of real-time road video from the vehicle camera at a speed of 40 km/h, and selects a video stream to test the tracking effect. The tracking effect is shown in Fig. 2. From the real-time tracking effect, the algorithm of this paper has good tracking performance, can accurately match the target and the trajectory, and can quickly restore the original trajectory of the target after the target loss reappears. 4. Conclusion Based on the SORT tracking algorithm, we selects the top-of-the-line YOLOv3 algorithm as the target detector. In the data association stage, the idea of two-stage data association is proposed. Firstly, the IOU distance is allocated, and the depth-appearance features of the unsuccessfully assigned target are extracted and then data association is performed. The experimental results show that the proposed has a good effect and can basically meet the real-time road scene tracking. However, the algorithm in this paper still has certain defects. The tracking speed of the SORT method can reach about 200 frames. In this paper, the neural network is used to extract the depth appearance features for data correlation, which takes a long time. The number of frames in the experiment only reached about 30 frames, which is a huge limitation for the speed of unmanned vehicles, and needs further research to improve in the future. Table 2 Comparison of various algorithm test results.
JPDA_m MHT_DAM SORT Our method
MOTA
MOTP
MT
ML
IDSw
26.2 45.8 59.8 60.5
37.4 54.2 79.6 79.3
4.1 16.2 25.4 30.2
67.5 43.2 22.7 19.6
365 590 1423 1129
4
Optik - International Journal for Light and Electron Optics 194 (2019) 163124
Z. Yi, et al.
Fig. 2. Experimental result : (a) frame 186, (b) frame 194, (c) frame 217, (d) frame 605, (e) frame 615, (f) frame 626.
Acknowledgements In the process of completing my paper, the gratitude would like to express to the professor Shen Yongliang for their great assistance. I would also like to thank my friends and classmates who gave me a lot of useful materials in the process of writing my paper, and also provided enthusiastic help in the process of typesetting and writing the paper. References [1] R. Girshick, Fast r-cnn, Proceedings of the IEEE International Conference on Computer Vision, (2015), pp. 1440–1448. [2] D. Erhan, C. Szegedy, A. Toshev, et al., Scalable object detection using deep neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2014), pp. 2147–2154. [3] C.Y. Fu, W. Liu, A. Ranga, et al., DSSD: deconvolutional single shot detector, arXiv preprint arXiv 1701 (06659) (2017). [4] T.E. Fortmann, Y. Bar-Shalom, M. Scheffe, Sonar tracking of multiple targets using joint probabilistic data association, IEEE J. Ocean. Eng. 8 (3) (1983) 173–184. [5] S.H. Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, I. Reid, Joint probabilistic data association revisited, ICCV, (2015), pp. 3047–3055. [6] D.B. Reid, An algorithm for tracking multiple targets, IEEE Trans. Autom. Control 24 (6) (1979) 843–854. [7] C. Kim, F. Li, A. Ciptadi, J.M. Rehg, Multiple hypothesis tracking revisited, ICCV, (2015), pp. 4696–4704. [8] A. Bewley, G. Zongyuan, F. Ramos, B. Upcroft, Simple online and realtime tracking, ICIP, (2016), pp. 3464–3468. [9] R. Kalman, A new approach to linear filtering and prediction problems, J. Basic Eng. 82 (Series D) (1960) 35–45. [10] H.W. Kuhn, The Hungarian method for the assignment problem, Nav. Res. Logist. Q. 2 (1955) 83–97. [11] J. Redmon, A. Farhadi, Yolov3: an incremental improvement, arXiv preprint arXiv (02767) (2018) 1804. [12] K. He, X. Zhang, S. Ren, et al., Deep residual learning for image recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), pp. 770–778. [13] W. Liu, D. Anguelov, D. Erhan, et al., Ssd: single shot multibox detector, European Conference on Computer Vision, (2016), pp. 21–37. [14] J. Redmon, S. Divvala, R. Girshick, et al., You only look once: unified, real-time object detection, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2016), pp. 779–788. [15] J. Redmon, A. Farhadi, YOLO9000: better, faster, stronger, arXiv preprint (2017). [16] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, K. Schindler, Mot16: a benchmark for multi-object tracking, arXiv preprint arXiv (00831) (2016) 1603. [17] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, K. Schindler, MOT16: a benchmark for multi-object tracking, CoRR (2016). [18] K. Bernardin, R. Stiefelhagen, Evaluating multiple object tracking performance: the clear mot metric[J], EURASIP J. Image Video Process. (1) (2008) 1–10.
5