Pattern Recognition 96 (2019) 106967
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/patcog
Novel event analysis for human-machine collaborative underwater explorationR Yang Cong a,b,∗, Baojie Fan d, Dongdong Hou a,b,c, Huijie Fan a,b, Kaizhou Liu a,b, Jiebo Luo e a
State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, China Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, China c University of Chinese Academy of Sciences, China d College of Automation, Nanjing University of Posts and Telecommunications, China e Department of Computer Science, University of Rochester, USA b
a r t i c l e
i n f o
Article history: Received 27 November 2018 Revised 27 March 2019 Accepted 11 July 2019 Available online 19 July 2019 Keywords: Underwater Underwater robot Visual summarization Visual saliency Visual tracking Robot vision Video analysis Novel event Deep sea
a b s t r a c t One of the main task for deep sea submersible is for human-machine collaborative scientific exploration, e.g., human ourselves drive the submersible and monitor cameras around the submersible to observe new species fish or strange topography in a tedious way. In this paper, by defining novel marine animals or any extreme events as novel events, we design a new deep sea novel visual event analysis framework to improve the efficiency of human-machine collaboration and improve the accuracy simultaneously. Specifically, our visual framework concerns diverse functions than most state-of-the-arts, including novel event detection, tracking and summarization. Due to the power and computation resource limitation of the submersible, we design an efficient deep learning based visual saliency method for novel event detection and propose an online object tracking strategy as well. All the experiments are depending on Chinese Jiaolong, the manned deep sea submersible, which mounts several PanCtiltCzoom (PTZ) camera and static cameras. We build a new novel deep sea event dataset and the results justify that our human-machine collaborative visual observation framework can automatically detect, track and summarize the novel deep sea event.
1. Introduction Underwater robots [1] are widely applied for marine scientific exploration. For example, the manned Chinese deep sea submersible [2,3], Jiaolong, are adopted for human-machine collaborative observing deep sea events and human-machine collaborative collecting deep sea samples, e.g., marine salvage, deep sea rescue, mineral/oil resources exploitation, deep sea biological research, deep sea gene acquisition, etc. Scientists can get to deep sea via the submersible, where they could observe the underwater world, map the topographic, and use various tools such as robot arm to grab biological samples from the seabed. Generally, there
R This work is supported by National Nature Science Foundation under Grant (61722311, U1613214, 61821005, 61533015), CAS-Youth Innovation Promotion Association Scholarship (2012163) and Liaoning Revitalization Talents Program (XLYC1807053). ∗ Corresponding author. E-mail addresses:
[email protected] (Y. Cong),
[email protected] (B. Fan),
[email protected] (D. Hou),
[email protected] (H. Fan),
[email protected] (K. Liu),
[email protected] (J. Luo).
https://doi.org/10.1016/j.patcog.2019.106967 0031-3203/© 2019 Elsevier Ltd. All rights reserved.
© 2019 Elsevier Ltd. All rights reserved.
are several cameras mounted around the submersible, all these works are done by human operators manually, e.g., online monitor and analyze the videos on the screen. Therefore, how to improve the efficiency of human-machine collaboration for deep sea exploration, accomplish more tasks under the energy and computation limitation is a key problem, which could improve the human labor and reduce various errors or missing important events due to subjective or objective factors as well. In this paper, by concerning novel events as any unknown marine animals, interesting objects or moving particles, we focus on the semi-automatic novel event analysis for Chinese Jiaolong [2,3]. Due to several cameras are mounted around Jiaolong and online monitored by human crews manually, our mission is to help onboard crews to reduce the intensity of work and make the work more efficiency and accuracy accordingly. Some previous related works have been focusing on the problem of underwater visual analysis, however most of these works intend to handle single task, e.g., marine creatures detection and tracking [4]. For example, Zhou and Clark [5] propose to track underwater fish via monocular camera. Clark et al. [6], Forney et al. [7] follows the tagged leopard sharks using the autonomous underwater vehicle (AUV).
2
Y. Cong, B. Fan and D. Hou et al. / Pattern Recognition 96 (2019) 106967
Lin et al. [8] intends to track marine lives via multi-AUV platform. Yim et al. [9] uses the remotely operated underwater vehicle (ROV) to track a shallow water nocturnal squid. Hsiao and Chen [10] proposes to track the spars fish. Chuang et al. [11] use the low frame rate stereo videos with low-contrast quality to track fish. Chuang et al. [12] designs a multiple kernel tracker to track deformable fish on moving platform. Gebali et al. [13] detects visual saliency from underwater video. Chuang et al. [14] recognizes underwater fish species depending on supervised and unsupervised features [15]. Ravanbakhsh et al. [16] detects underwater fish via shape-based level sets. Instead of most state-of-the-arts focusing on detecting and tracking of marine creatures, we design a general framework for novel deep sea event visual analysis, including novel event detection, novel event tracking and novel event summarization. Due to several cameras mounted around “Jiaolong”: for the PTZ cameras, our framework can first use visual saliency to detect the novel events automatically, initialize the template of the corresponding events, and track them by the our designed online tracker. Moreover, in order to achieve human-in-loop control, the tracking process could be re-initialized by human crews manually anytime to improve the accuracy. For the static cameras, we adopt our previous group sparsity based video summarization to extract the key frames for the efficient overview. For a fair comparison, we collect and build a new deep sea novel events dataset to verify the effectiveness of our visual framework. Generally, there are three main contributions as follows: i We propose a new problem, i.e., novel deep sea event analysis for deep sea scientific exploration to reduce the intensity of work and also improve the efficiency. To our best knowledge, this is the first work to analyze deep sea novel events using multimedia technologies, including novel deep sea event detection, tracking and summarization simultaneously. ii Due to the energy/power resource limitation, we propose a general low cost visual analysis framework, including visual saliency detection via simple structured deep learning, especially our new online novel event tracking to overcome nonrigid deformation, and novel event summarization by our efficient key frame extraction. iii We collect original videos from several times of sea tests of Chinese Jiaolong and build a new deep sea video dataset divided into three subdatasets with manually annotation individually. This new deep sea event analysis dataset will be released soon. 2. Related works By concerning any unknown marine creatures, interesting objects or moving particles as novel deep sea event, in this paper, we focusing on novel deep sea event analysis, including novel deep sea event detection, tracking and summarization. For novel event detection, most previous works focus on fish or underwater creatures detection [17], which is crucial for sea fish behaviors evaluation and the study of underwater creatures behavior. For example, Leow et al. [18] intends to identify of copepods using neural network, Chuang et al. [19] proposes to recognize underwater fish via unsupervised feature learning, Huang et al. [20] designs a hierarchical classifier method for live fish recognition, Spampinato et al. [21] proposes a sea fish classification framework to help marine biologists understand fish behavior. Most of these above methods are actually to identify of known creatures. However, the intention of deep sea submersible is used for scientific exploration, i.e., we may find new kinds of creatures or new topographies. Therefore, we cannot have enough prior about the deep sea environment and collect enough samples to
train some classifier. In this paper, we concern novel event detection as a saliency detection issue. Saliency detection [22–25] is to distinguish arbitrarily image changes, e.g., visual saliency for robot perception [26], visual saliency for object segmentation [22], spectral residual based visual saliency [27], context-aware based visual saliency [23], deep learning based visual saliency [24], non-local deep feature based visual saliency [25], visual attention based on saliency detection for rapid scene analysis [28], graph-based visual saliency [29], salient object detection with non-local deep learning features [25]. In this paper, we intends to detect novel event in an efficient way. For novel event tracking, there are many prominent works for object tracking [30], e.g., kernelized correlation filters (KCF) [31], DSST [32], CN [33], SAMF [34], sparsity based collaborative model (SCM) [35], Struck [36], Tracking-learning-detection (TLD) [37], online metric learning tracker [38,39] and deep learning based tracking [40]. Most previous works mainly concern underwater creatures tracking, e.g., fish tracking [41,42]. For instance, Rui et al. [43] perform fish trajectory tracking via a dynamic model hypothesis by Extreme Learning Machine (ELM). Zhou et al. [44] adopt Gabor filter to accomplish fish tracking. In order to overcome the poor motion continuity, which cannot handle dynamic background object tracking due to fixed stereo camera. To address this problem, Chuang et al. [12] designs a deformable part model (DPM) based multi-fish tracker for moving platforms, however, it is difficult to overcome all these problems posed by the uneven illumination and ubiquitous noise in underwater scenario. Due to the deep sea fish tracking is an online learning process, where the fish frequently overlap with each other and suffer serious self-occlusion, the fish shape change frequently with deformable appearance, therefore, we design a new efficient online tracking method. For novel event summarization, there are few works focusing on visual summarization for deep sea environment. For example, Gebali et al. [13] intend to detect interesting deep sea events via a video abstraction method by overcome small and slow moving fish imaging issue. Sooknanan et al. [45] enhance and summarize the Nephrops Habitats depending on underwater videos. Actually, visual summarization is a hot topic in multimedia domain, which intends to extract key frames or video skims from long video sequence to achieve knowledge condensing and knowledge search, e.g., egocentric video summarization [46], video summarization based on story driven[47], video summarization depending on large-scale web image priors [48], video summarization from consumer video [49], video summarization via group sparsity [50,51,52], multiview video summarization [53] and deep learning based video summarization [54]. In comparison with most state-of-the-arts only concerning on single task, e.g., marine fish tracking, our semi-automatic novel deep sea event analysis framework contains much more diverse functions including novel deep sea event detection, tracking and summarization, which intends to reduce the intensity of onboard crews work and improve the work accuracy and efficiency accordingly. 3. Overview of the platform “Jiaolong” In this paper, we use the Chinese “Jiaolong” [2] as the testing platform as shown in Fig. 1, where the maximum dive depth is more than 7,0 0 0 meters with no more than 3 human crews (1 pilot and 2 scientists), the total weight of “Jiaolong” in the air is 22 tons, the power is the silver-zinc battery. Most performance parameters of Jiaolong are shown in Table 1, including propulsion system, automatic control system, observation system, acoustic system, under water communication system, self-navigation system and work operation tool. To overcome various disturbances and improve the robustness for automatic control, e.g., asynchronous problem
Y. Cong, B. Fan and D. Hou et al. / Pattern Recognition 96 (2019) 106967
3
Table 1 The major performance parameter of Chinese manned deep submersible “Jiaolong”. Index
Parameter
Size Speed Num of Crews Weight Propulsion System Control System Observation System Acoustic System Communication System Navigation System Work Tools
8.2m × 3.2m × 3.2m 1 kn, max 2.5 kn 3 in total 22 tons in air 1 front propeller, 2 middle propellers, and 4 tail propellers automatic orientation, depth, position, emergency manned control 8 Led lights, 8 cameras, imaging sonar, side sweep sonar 7 collision sonar, depth sonar, ultrashort/long baseline sonars VHF, underwater acoustic telephone / communication machine GPS, motion sensor, depth meter, Doppler meter 7-DOF robot arm, 5-DOF robot arm, hydrothermal sampler, sampling basket, drilling, etc
Fig. 2. The sensing framework of the Chinese “Jiaolong”.
Fig. 1. The Chinese manned deep sea submersible “Jiaolong”.
between the control period and measurements period, timevarying of system parameter, various uncertainty of the closedloop, both the adaptive unscented Kalman filter and the fuzzy control theory are adopted to achieve robust control and selfnavigation. Therefore, “Jiaolong” can achieve precise location, automatic navigation control, complex information monitor of the manned cabin, surface real-time monitor, virtual reality with semiphysical digital simulation and data analysis of black-box; and can also move flexibly with automatic control including depth, heading, 3DOF altitude, velocity and dynamic 3DOF positioning. Jiaolong is a hybrid deep sea platform for marine research, which is mainly used for deep sea observation and deep sea sample collection, e.g., new species of marine creatures or new minerals. One 7-DOF and one 5-DOF hydraulic manipulator robot arms are mounted in front of “Jiaolong” to collect various marine samples, and it is also equipped with hydrothermal sampler, sampling basket, drilling, etc. As shown in Fig. 2, the observation system contains several fixed cameras, HD cameras, PTZ cameras, LED lights, ICCD, and image sonar mounted in the front and around “Jiaolong”, which can be applied for deep sea creatures observation, self-security surveillance, visual navigation operation and sample collection. In this paper, we focus on the novel deep sea event analysis by using various videos collected from the observation system. Due to the current technology to monitor and analyze the videos of Jiaolong are achieved by human crews manually, in this paper, our intention is to assist human operator to reduce the intensity of onboard work and improve work efficiency accordingly.
4. Our framework for novel deep sea event visual analysis We intend to propose our human-machine collaborative framework for novel deep sea event visual analysis. We consider the novel events as any unknown or interesting objects or events, for example, new deep sea plants, new marine creatures (lobster, crab, fish), or mineral (e.g., Manganese nodules). Several PTZ and static cameras are mounted around the submersible, and our purpose is to help the onboard scientist to reduce the work load other than continuously watching the monitors all the time, i.e., ours could be concerned as a warning system to improve the probability of false alarm and missing generated by various human subject factors. The general framework is demonstrated in Fig. 3, which mainly includes three components, 1) the novel deep sea event detection, which is activated during the sailing stage by detecting various novel events as visual saliency; 2) the novel deep sea event tracking, which is initialized based on the results of visual saliency, and moreover, the tracking process can be re-initialized by human crews manually to involve the human for controlling; 3) the novel deep sea event summarization, which is activated during on stage of diving, sailing and floating, is adopted for quick overview of the big stored video data and novel event warning with the generated key frames. For more details, please check the following contents. 4.1. Novel deep sea event detection Generally for scientific exploration of novel event, we do not have sufficient prior knowledge about the deep sea environment. Therefore, it is hard to collect enough samples for object event detection, e.g., detect various underwater creatures via deep learning. Another strategy for object detection is by motion detection,
4
Y. Cong, B. Fan and D. Hou et al. / Pattern Recognition 96 (2019) 106967
Fig. 3. The general visual framework for analyzing novel deep see events including detection, tracking and summarization.
Fig. 4. The framework of our deep sea novel event detection model.
where the basic assumption is that the objects with rapid motion should be novel events. However, all cameras are fixed on the testing platform “Jiaolong” instead of tying on the seabed, so the motion based object detection cannot work well neither. Therefore, we consider the novel deep sea events detection as visual saliency detection issue [22–25,55,56]. Visual saliency models have been broadly used for object tracking, outlier detection, etc. Most of earlier saliency detection methods are motivated by human visual cognition system and intend to calculate various as saliency map using various hand-crafted features. However, without prior knowledge, these methods cannot generate satisfied results in all cases. Recently, the fully convolutional neural networks (FCNs) have been used for visual saliency detection. Due to the computation resource limitation, motivated by [24], we design a visual saliency method based on simple structured deep learning, where the framework is shown in Fig. 4. Our model is a top-down view with 5 layers, where we use the short connections within the HED structure over the skip-layer structure for deep supervision, and also fuse the corresponding weight accordingly. For testing, the non-maximum suppression is adopted to generate the bounding box depending on the saliency map, where the events or objects in the bounding box are assumed as novel event, and the score to become a potential novel event is computed by averaging saliency map in the corresponding bounding box. The tuning threshold is preset manually to detect the novel event. Therefore, the greater the score of the event is, the more probability it becomes a novel deep sea event.
4.2. Novel deep sea event tracking The tracker template is initialized by the novel event detection. The deep sea event tracking suffers more challenges, e.g., non-rigid deformation of marine creatures, uneven light illumination, back scatter of water; to achieve robust tracking, we concern the novel deep sea event tracking as an online learning issue and design a new online learning based tracker to handle these challenges. Moreover, the power / consumption resource is limited onboard, so the low computational complexity is also a significant fact needed to be concerned carefully. The developed online tracking algorithm is under the particle filter tracking framework. Given one particle, denoted X as its feature, we combine the hog and color features to represent the particle. The matrix X is the linear combination of the dictionary templates Z and the error matrix E, which indicates that the content in X are occluded or disturbed X = DZ + E = AW, A = [D, I], W = [Z; E], X = [X0,0 , . . . , Xm,n , . . . , XM−1,N−1 ], (M, N are the size of the particles), D is defined as D = [A1 , . . . , Ak , . . . AK ]. Here, Ak contains all the circular shifts of the k-th base samples Am,n , k = 1, . . . K, where k
,N−1 m ∈ [0, . . . M − 1], n ∈ [0, . . . N − 1], Ak = [a0k ,0 , . . . , aM−1 ] with K k template base. Therefore, each Ak is circulant and A is the blockwise circulant. The particle correlation filtering model intends to handle the following tracking objective function.
min X − AW 2F + γ1 W 2F , W
(1)
where γ 1 denotes a weight parameter. In order to increase the accuracy of the discriminative filter during the learning stage, we adds contextual information into the above function. The context
Y. Cong, B. Fan and D. Hou et al. / Pattern Recognition 96 (2019) 106967
5
patches are sampled around the object of interest, and they can be considered as the hard negative samples, where the learned filter W generates greater response for the target patch and nearly zero for context patches. We intend to enhance it by formulating the context patches as the new regularizer in Eq. (1).
min X − AW 2F + γ1 W 2F + γ2
p
W
BiW 2F ,
(2)
i
where Bi is the circulant matrix of context patch bi , p denotes the size of the context patch, γ 2 is the weight parameter. The objective function in Eq. (2) can be formulated by stacking the context image patches below the target image patch to generate a new matrix C.
min XR − CW 2F + γ1 W 2F ,
(3)
W
√ √ where XR = [X, 0, . . . .0]T , C = [A, γ2 B1 , . . . , γ2 B p ]. In order to achieve a more robust correlation filter tracker, we exploit the anisotropy of the response, and develop a robust correlation filter tracking algorithm with elastic net loss function.
min L(XR − CW ) + γ1 W 2F ,
(4)
W
where L() denotes the elastic net loss function. Note that the reformulated model is a convex model with respect to both W and E. An iterative algorithm is required to approximate the solution. To solve the model in Eq. (4), we rewrite this equation as follows:
min CW − XR + E 2F + γ1 W 2F + δ L(E ), W,E
(5)
where δ is the tuning parameter. Eq. (5) can be separated into two subproblems, i.e.,
min CW − XR + E 2F + γ1 W 2F ,
(6)
min CW − XR + E 2F + δ L(E ),
(7)
W
E
The problem in Eq. (5) can be optimized these two subproblems alternately until the objective function values of the model converged. Eq. (6) can be efficiently solved as in the least square manner. The optimal solution of E in Eq. (7) is:
E=σ
δ
4 + 2δ
,
2 F −1 (XR − ) , 2+δ
(8)
where σ is the shrinkage operator with the definition as
σ (u, v ) = sign(v )max(0, |u| − v ). F −1 represents the inverse Fourier transform, refers the elementwise multiplication. is the dual conjugate of W, W = i θi i , where i is the feature-space projector. denotes the kernel matrix with each element as i, j = iT i j .
We intend to summarize the novel deep sea visual event by considering such a problem as a key frame extraction issue [49,52]. Even we record and save all the video data for backup, it is a time consuming and tedious work to monitor them. Our intention is to recommend human crews with meaningful key frames in order to reduce the work burden, and improve the efficiency and accuracy generated by various subjective factors. Our previous model, the video summarization based on group sparse dictionary selection [50], is adopted here, where the key technology is to use the group sparsity via 2,1 norm to generate a more sparse key frames with global optimization. The novel deep sea event summarization model can be formulated as: S
λ 2
P − PS2F +
(1 − λ ) 2
S 2,1 ,
Table 2 The summary of our novel deep sea video dataset. ID
Name
#Clips
#Frames
1 2 3
Novel Event Detection Novel Event Tracking Novel Event Summarization
11 19 2
5174 9830 11,376
where P ∈ Rd×n is the feature pool extracted from the original video shot; S ∈ Rn×n is the pursuit coefficient group sparse matrix; and λ ∈ [0 1] is the pre-set tuning parameter. The first term intend to evaluate the quality of the reconstruction error by adopting the selected dictionary to recover the whole feature pool. The second term of Eq. (9) enhances the sparse property of the dictionary selection with 2,1 norm. The model in Eq. (9) induces a sparse solution of S, where the rows of S is sparse with Si. 2 = 0 as the selected dictionary features. Our model could generate a global optimization, where the convergence rates of our is O(1/T2 ) in comparing √ with the traditional method with sub-gradient descent O(1/ T ). T denotes the iteration number). We process our novel event summarization model on each video channel independently, where we first extract the features from each frame, segment the long video into small video shots online and then generate the feature pool P depending on the feature in the video shot. Finally, the key frames will be summarized from each video shot by our group sparse model Eq. (9), and we can collect key frames from all video shots and recommend them to human crews accordingly. 5. Comparisons and experiments We first collect and annotate a new dataset for novel deep sea event visual analysis, and then present various experiments and comparisons to verify the effectiveness of our method accordingly.
4.3. Novel deep sea event summarization
min : f (S ) =
Fig. 5. Demo Images of our Novel Deep Sea Event Video Dataset.
(9)
5.1. Novel deep sea video dataset All original videos are collected from Chinese Jiaolong, during several real sea tests, where each sea test occupies more than 10 h including diving, sailing and floating as shown in Fig. 3. The time of sea tests are distributed between August 2009 to October 2009, May 2010 to July 2010, July 2011 to August 2011, and June 2012 to July 2012 [57]. Several biologist are required to collect and annotate the novel deep sea events, where the novel events are including new underwater creatures, novel motion particles, etc. Some demo images are shown in Fig. 5 and the summary of the dataset is shown in Table 2. We then categorize it into three individual subdatasets depending on each specific tasks:
6
Y. Cong, B. Fan and D. Hou et al. / Pattern Recognition 96 (2019) 106967
Fig. 6. Demo figures of the novel deep sea event detection by comparing ours (bottom row) with both FASA (second row) and NLDF (third row).
Table 3 The statistic results IOU of ours compared with the state-of-the-arts. The best one is marked by bold. ID
FASA
NLDF
Ours
1 2 3 4 5 6 7 8 9 10 11 Avg
0.2779 0.2696 0.2602 0.2038 0.2030 0.2091 0.2134 0.2178 0.2173 0.2118 0.2089 0.2106
0.2147 0.2136 0.2080 0.1969 0.2007 0.2072 0.2122 0.2155 0.2149 0.2271 0.2266 0.2126
0.2302 0.2285 0.2246 0.2076 0.2102 0.2129 0.2109 0.2133 0.2131 0.2235 0.2227 0.2143
i The Novel Deep Sea Event Detection Subdataset: There are totally 11 video clips collected with the length varying from 50 to 2868 frames. Various challenges are contained in these videos, e.g., the pose, abrupt light changing, object rotation, large-scale changes, abrupt particle motion and mutual occlusion. The resolution of the original frame is 768 × 576. The bounding box are used to manually annotate the objects from these video clips by human annotator every five frames. We randomly select 581 RGB-images from these videos and manually annotate them as the dataset, and then separate them as training and testing dataset with the ratio as 7: 3. ii The Novel Deep Sea Event Tracking Subdataset: There are totally 19 video clips collected each containing several hundreds frames. The novel events of each video clip are annotated manually and the groundtruth is recorded as the bounding box described by the (x, y) of the top left corner and bottom right corner for evaluation. iii The Novel Deep Sea Event Summarization Subdataset: There are totally 2 videos collected, where each video contains various novel deep sea events. Several human are invited to label the key frame separately, and we fuse their results to generate the groundtruth. Our algorithm is required to select key frames from the corresponding videos for comparison.
5.2. Novel deep sea event detection results We use our simple structured deep learning based visual saliency model to detect the novel deep sea event, where we first estimate the saliency map by defining the pixel value as the probability of novel. Then the non-maximum suppression method is used to denote the novel event by the bounding box depending on the saliency map. In this subsection, we compare two saliency
object detection methods due to they also performs efficient as follows: 1. FASA [58]: Fast Accurate and Size-Aware Salient Object Detection. 2. NLDF [25]: Non-Local Deep Features for Salient Object Detection. The Intersection-Over-Union (IOU) is adopted as the criterion for evaluation:
IOU =
Detection Result GroundTruth Detection Result GroundTruth
The statistic results are shown in Table 3, where we can see that the average IOU of ours is 0.2143 greater than both FASA and NLDF. The demo results are shown in Fig. 6, and it is obviously to observe that the novel fish and lobster are founded, and some other tiny animals are also recognized. The tracking template could be initialized using these results for further novel deep sea object tracking. Specifically, the hyper-parameters are set as learning rate (1e-8), weight decay (0.0 0 05), momentum (0.9), loss weight for each side output (1). Our fusion layer weights are all initialized with 0.1667 in the training phase. 5.3. Novel deep sea event tracking results We evaluate our online tracking algorithm on the tracking benchmark [59] and our deep sea scenarios in this subsection, respectively. We use the same parameters and initialization for all the sequences, specifically we set γ1 = 0.001, γ2 = 0.01, δ = 0.001. For a fair comparison, there are two typical evaluation criteria adopted here. 1)The precision of tracking, which is defined as the percentage of frames with the corresponding location errors less than the preset threshold. The location error is measured by the Euclidean distance between the tracked target and the human labeled ground truth. 2) The success rate of tracking, which is evaluated as the percentage of frames with overlap rates greater than the preset tuning parameter. The overlap rate is defined on the area(ROI ROI ) PASCAL challenge object detection score as area(ROIT ROIG ) , where T
G
ROI indicates the bounding box, T, G represent the current tracking results and labelled ground truth, respectively. For the comparison on the tracking benchmark dataset [59], we perform the One-Pass Evaluation (OPE) using our online tracker on the public benchmark, and adopt the online toolbox [59] to estimate the evaluation plots. Due to there are 29 different object trackers in the benchmark dataset [59], we only compare ours with eight state-of-the-arts, including KCF [31], DSST [32], CN [33], SAMF [34], SCM [35], Struck [36] and TLD [37]. Specifically, we adopt the source codes and data of these benchmark tracker from the original authors themselves without tuning any parameters. The precision and success rate of OPE curves are plotted in Fig. 7. The statistic tracking results are summarized in
Y. Cong, B. Fan and D. Hou et al. / Pattern Recognition 96 (2019) 106967
7
Fig. 7. Comparison the precision and success rate of our online tracker with the states-of-arts in [59] using the benchmark datasets. Table 4 Score of precision plot in comparison of ours with the state-of-the-arts depending on 11 attributes. The top three results are annotated as bold, italic and bolditalic, respectively. Attribute
Our
SCM [35]
Struck [36]
TLD [37]
KCF [31]
DSST [32]
SAMF [34]
CN [33]
MB FM OV DEF BC IV SV OCC LR IPR OPR
0.748 0.763 0.734 0.816 0.800 0.789 0.798 0.820 0.501 0.798 0.834
0.339 0.333 0.429 0.586 0.578 0.594 0.672 0.640 0.305 0.597 0.618
0.551 0.604 0.539 0.521 0.585 0.558 0.639 0.564 0.545 0.617 0.597
0.518 0.551 0.576 0.512 0.428 0.537 0.606 0.563 0.381 0.584 0.596
0.650 0.602 0.650 0.740 0.753 0.728 0.679 0.749 0.396 0.725 0.729
0.547 0.517 0.515 0.660 0.694 0.735 0.730 0.716 0.497 0.766 0.733
0.650 0.663 0.709 0.796 0.708 0.727 0.723 0.840 0.458 0.690 0.763
0.550 0.480 0.434 0.620 0.642 0.587 0.598 0.629 0.405 0.675 0.652
Table 5 Score of success rate plot in comparison of ours with the state-of-the-arts depending on 11 attributes. The top three results are annotated as bold, italic and bolditalic, respectively. Attribute
Our
SCM [35]
Struck [36]
TLD [37]
KCF [31]
DSST [32]
SAMF [34]
CN [33]
MB FM OV DEF BC IV SV OCC LR IPR OPR
0.572 0.575 0.599 0.586 0.578 0.556 0.521 0.563 0.367 0.551 0.580
0.298 0.296 0.361 0.448 0.450 0.473 0.518 0.487 0.279 0.458 0.470
0.433 0.462 0.459 0.393 0.458 0.428 0.425 0.413 0.372 0.444 0.432
0.404 0.417 0.457 0.378 0.345 0.399 0.421 0.402 0.309 0.416 0.420
0.497 0.460 0.551 0.534 0.535 0.494 0.427 0.514 0.312 0.497 0.496
0.464 0.435 0.459 0.510 0.517 0.563 0.541 0.534 0.409 0.560 0.535
0.519 0.515 0.611 0.622 0.526 0.534 0.516 0.621 0.361 0.509 0.555
0.410 0.373 0.410 0.438 0.453 0.417 0.384 0.428 0.311 0.469 0.443
Tables 4 and 5 using 11 different attributes. We can conclude that our proposed online tracker achieves favorable performance than other state-of-the-art trackers on the tracking benchmark. From these results, we can see that our online tracker outperforms the state-of-the-arts. For the comparison on our novel deep sea event tracking dataset, we select 3 trackers with low computation cost, i.e., KCF [31], TLD [37], Staple[60]. The corresponding result curves measured by both center error and overlap rate are demonstrated in Fig. 9. By adopting the center position error (CP) as the evaluation criterion, our online learning based tracker achieving the CP
as 12.37, performs better than KCF, Staple and TLD (the CP of them are 48.47, 62.43 and 23.16, respectively). By adopting the criterion of Overlap Rate (OR), our method with the OR as 0.59, also outperforms KCF, Staple and TLD (the OR of them are 0.42, 0.32, 0.57). The demo results are demonstrated as in Fig. 8, where the deep sea fish are disturbed by various disturbances, such as largescale changing and deformation, heavy occlusion, bad illumination. It is obvious that our proposed online tracker can robustly keep on tracking the deep sea object, where some of other tracking methods are loss and drift rapidly.
8
Y. Cong, B. Fan and D. Hou et al. / Pattern Recognition 96 (2019) 106967
Fig. 8. Sample results of the novel deep sea event tracking.
Fig. 9. The result of both the Center Position error (CP) and Overlap Rate (OR) for Usual Deep Sea Event Tracking.
5.4. Novel deep sea event summarization result We intend to summarize the novel deep sea event in this subsection. Some results are demonstrated in Fig. 10, where the selected key frames of novel deep sea events are extracted from 5 static cameras. We can observe that the novel / unusual / interesting deep sea events are summarized, for example, various deep sea fish closing to Jiaolong submersible from far and near, gathering various deep sea mineral samples or inserting the logo flag using the onboard robot arms. Depending on these extracted key frames, the work intensity of onboard scientist can be relieved without continuously monitoring the screens all the time; moreover, these results can be acting as a warning system to reduce the proba-
bility of false alarms or missing by various objective or subjective factors. 5.5. Compare the time consumption In this subsection, we intend to compare the time consumption of our method with other state-of-the-art methods. For novel deep sea event detection, we first compare the runtime of our model with FASA [58], NLDF [25], where the results are shown in Table 6. Ours is more efficient than the state-of-the-arts. For object tracking, the average running time is about 56.32fps on the tracking benchmark with 50 particle filers. The platform is equipped with a single NVIDIA TITAN Xp GPU and a 4.0GHz Intel processor.
Y. Cong, B. Fan and D. Hou et al. / Pattern Recognition 96 (2019) 106967
9
Fig. 10. The result of usual deep sea event summarization from multi-camera systems.
Fig. 11. Compare the tuning parameters γ 1 , γ 2 and δ of our online tracker.
Table 6 The runtime of ours compared with the state-of-the-arts. The best one is marked by bold. Runtime(s)
FASA
NLDF
Ours
Method
0.341
0.042
0.029
tively. Due to the power consumption limitation, both the computational complexity and online learning ability are considered carefully. A novel deep sea video dataset is also gathered and labeled depending on Jiaolong, the Chinese deep sea submersible. Various experiments and evaluations are also adopted to verify the effectiveness of the proposed framework. Some future work plans and ideas are as follows,
5.6. Compare the tuning parameters γ 1 , γ 2 and δ
•
In this subsection, we compare the tuning parameters γ 1 , γ 2 and δ of our online tracker. As shown in Fig. 11, we can find when these parameters are tuned in a relative wide range, the performance of our model is not changed abruptly. Therefore, we can conclude that our online tracking model is robust in practice.
•
6. Conclusions and future plans •
We introduce a new problem about analyzing the novel deep sea visual events by considering various deep sea creatures or any interesting / unknown events as novel events. To our best knowledge, ours is an earlier work to adopt a general human-machine collaborative framework for automatically deep sea extreme event analysis that mainly includes novel event detection, tracking and summarization simultaneously. We then design a semi-automatic visual framework to reduce the work intensity of the onboard crews and also improve the work efficiency for deep sea observation, where three components are included, i.e., novel event detection, novel event tracking and novel event summarization, respec-
•
The deep sea environment is an unknown world for human ourselves, it needs urgently new technologies, new intelligent platforms to explore the unknown world. Therefore, many new problems could be defined for research scientists in future. The deep sea exploration is a high-cost task and the collection of deep sea videos is hard. Therefore, the size of the video dataset is still limited and more video data will be collected, annotated and added into our video dataset. Moreover, we plan to public this dataset for research purposes soon. Even we try to design visual algorithm with lower computation complexity to consume less power, all the experiments and evaluations are tested using online collected video dataset. In future, we plan to use our novel event visual analysis framework on the Jiaolong platform online, which will actually assist the human scientists to reduce the work intensity and improve the work efficiency. Our proposed problem is a general issue for deep sea / underwater observation, so our proposed framework could be extended to other platforms with similar tasks as well, such as unmanned remote ocean vehicle (ROV), autonomous underwater vehicle(AUV).
10
Y. Cong, B. Fan and D. Hou et al. / Pattern Recognition 96 (2019) 106967
References [1] L. Kang, L. Wu, Y. Wei, S. Lao, Y.-H. Yang, Two-view underwater 3d reconstruction for cameras with unknown poses under flat refractive interfaces, Pattern Recognit. 69 (2017) 251–269. [2] K. Liu, P. Zhu, Y. Zhao, S. CUI, X. Wang, Research on the control system of human occupied vehicle ??jiaolong?? Chin. Sci. Bull. 58 (S2) (2014) 40–48. [3] L. Feng, Z. Huaiyang, W. Chunsheng, L. Xiangyang, H. Zhen, C. Cunben, Chinese jiaolong’s first scientific cruise in 2013, in: IEEE OCEANS, IEEE, 2014, pp. 1–8. [4] A. Plotnik, S. Rock, Hybrid estimation using perceptional information: Robotic tracking of deep ocean animals, in: IEEE Journal of Oceanic Engineering, 36, 2011, pp. 298–315. [5] J. Zhou, C.M. Clark, Autonomous fish tracking by rov using monocular camera, in: The 3rd Canadian Conference on Computer and Robot Vision (CRV’06), IEEE, 2006. 68–68 [6] C.M. Clark, C. Forney, E. Manii, D. Shinzaki, C. Gage, M. Farris, C.G. Lowe, M. Moline, Tracking and following a tagged leopard shark with an autonomous underwater vehicle, J. Field Robot. 30 (3) (2013) 309–322. [7] C. Forney, E. Manii, M. Farris, M.A. Moline, C.G. Lowe, C.M. Clark, Tracking of a tagged leopard shark with an auv: Sensor calibration and state estimation, in: ICRA, IEEE, 2012, pp. 5315–5321. [8] Y. Lin, J. Hsiung, R. Piersall, C. White, C.G. Lowe, C.M. Clark, A multi-autonomous underwater vehicle system for autonomous tracking of marine life, J. Field Robot. 34 (4) (2017) 757–774. [9] S. Yim, C.M. Clark, T. Peters, V. Prodanov, P. Fidopiastis, Rov-based tracking of a shallow water nocturnal squid, in: Oceans-San Diego, 2013, IEEE, 2013, pp. 1–8. [10] Y.-H. Hsiao, C.-C. Chen, A sparse sample collection and representation method using re-weighting and dynamically updating omp for fish tracking, in: IEEE ICIP, 2016, pp. 3494–3497. [11] M.-C. Chuang, J.-N. Hwang, K. Williams, R. Towler, Tracking live fish from low– contrast and low-frame-rate stereo videos, IEEE Trans. Circ. Syst. Video Technol. 25 (1) (2015) 167–179. [12] M.C. Chuang, J.N. Hwang, J.H. Ye, S.C. Huang, Underwater fish tracking for moving cameras based on deformable multiple kernels, IEEE Trans. Syst. Man Cybern. Syst. 47 (9) (2017) 2467–2477. [13] A. Gebali, A.B. Albu, M. Hoeberechts, Detection of Salient Events in Large Datasets of Underwater Video, IEEE, 2012. [14] M.-C. Chuang, J.-N. Hwang, K. Williams, Supervised and unsupervised feature extraction methods for underwater fish species recognition, in: Computer Vision for Analysis of Underwater Imagery (CVAUI), 2014 ICPR Workshop on, IEEE, 2014, pp. 33–40. [15] Y. Zheng, B. Jeon, L. Sun, J. Zhang, H. Zhang, Student t-hidden markov model for unsupervised learning using localized feature selection, IEEE Trans. Circ. Syst. Video Technol. 28 (10) (2018) 2586–2598. [16] M. Ravanbakhsh, M. Shortis, F. Shaifat, A.S. Mian, E. Harvey, J. Seager, An application of shape-based level sets to fish detection in underwater images., GSR, 2014. [17] M. Mehrnejad, A.B. Albu, D. Capson, M. Hoeberechts, Detection of stationary animals in deep-sea video, in: Oceans - San Diego, 2013, pp. 1–5. [18] L.K. Leow, L.-L. Chew, V.C. Chong, S.K. Dhillon, Automated identification of copepods using digital image processing and artificial neural network, BMC Bioinform. 16 (18) (2015) S4. [19] M.-C. Chuang, J.-N. Hwang, K. Williams, A feature learning and object recognition framework for underwater fish images, IEEE Trans. Image process. 25 (4) (2016) 1862–1872. [20] P.X. Huang, B.J. Boom, R.B. Fisher, Hierarchical classification with reject option for live fish recognition, Mach. Vis. Appl. 26 (1) (2015) 89–102. [21] C. Spampinato, D. Giordano, R. Di Salvo, Y.-H.J. Chen-Burger, R.B. Fisher, G. Nadarajan, Automatic fish classification for underwater species behavior understanding, in: ACM international workshop on Analysis and retrieval of tracked events and motion in imagery streams, ACM, 2010, pp. 45–50. [22] Y. Li, X. Hou, C. Koch, J.M. Rehg, A.L. Yuille, The secrets of salient object segmentation, in: IEEE CVPR, 2014, pp. 280–287. [23] S. Goferman, L. Zelnik-Manor, A. Tal, Context-aware saliency detection, IEEE Trans. Pattern Anal. Mach. Intell. 34 (10) (2012) 1915–1926. [24] Q. Hou, M.-M. Cheng, X. Hu, A. Borji, Z. Tu, P. Torr, Deeply supervised salient object detection with short connections, in: CVPR, IEEE, 2017, pp. 5300–5309. [25] Z. Luo, A. Mishra, A. Achkar, J. Eichel, S. Li, P.-M. Jodoin, Non-local deep features for salient object detection, CVPR, 2017. [26] Y. Yu, J. Gu, G.K. Mann, R.G. Gosine, Development and evaluation of object-based visual attention for automatic perception of robots, IEEE Trans. Autom. Sci. Eng. 10 (2) (2013) 365–379. [27] X. Hou, L. Zhang, Saliency detection: a spectral residual approach, in: CVPR, IEEE, 2007, pp. 1–8. [28] L. Itti, C. Koch, E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Trans. Pattern Anal. Mach. Intell. 20 (11) (1998) 1254–1259.
[29] J. Harel, C. Koch, P. Perona, Graph-based visual saliency, in: NIPS, 2007, pp. 545–552. [30] P. Liu, C. Liu, W. Zhao, X. Tang, Multi-level context-adaptive correlation tracking, Pattern Recognit. 87 (2019) 216–225. [31] J.F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters, Pattern Anal. Mach. Intell. IEEE Trans. 37 (3) (2015) 583–596. [32] M. Danelljan, G. Häger, F. Khan, M. Felsberg, Accurate scale estimation for robust visual tracking, in: BMVC, BMVA Press, 2014, pp. 1–11. [33] M. Danelljan, F.S. Khan, M. Felsberg, J. van de Weijer, Adaptive color attributes for real-time visual tracking, in: CVPR, IEEE, 2014, pp. 1090–1097. [34] Y. Li, J. Zhu, A scale adaptive kernel correlation filter tracker with feature integration, in: ECCV, 2014, pp. 1–12. [35] W. Zhong, H. Lu, M.-H. Yang, Robust object tracking via sparsity-based collaborative model, in: CVPR, 2012, pp. 1–8. [36] S. Hare, A. Saffari, P. H. S. Torr, Struck: structured output tracking with kernels, in: ICCV, 2011, pp. 1–8. [37] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, IEEE Trans. Pattern Anal. Mach. Intell. 34 (7) (2012) 1409–1422. [38] Y. Cong, B. Fan, J. Liu, J. Luo, H. Yu, Speeded up low-rank online metric learning for object tracking, IEEE Trans. Circ. Syst. Video Technol. 25 (6) (2015) 922–934. [39] W. Liu, D. Xu, I.W. Tsang, W. Zhang, Metric learning for multi-output tasks, IEEE Trans. Pattern Anal. Mach. Intell. 41 (2) (2019) 408–422. [40] L. Bertinetto, J. Valmadre, J.F. Henriques, A. Vedaldi, P.H. Torr, Fully-convolutional siamese networks for object tracking, in: European conference on computer vision, Springer, 2016, pp. 850–865. [41] A. Attanasi, A. Cavagna, L. Del Castello, I. Giardina, Greta-a novel global and recursive tracking algorithm in three dimensions, IEEE Trans. Pattern Anal. Mach. Intell. 37 (12) (2015). 1–1 [42] Z. Wu, T.H. Kunz, M. Betke, Efficient track linking methods for track graphs using network-flow and set-cover techniques, in: IEEE CVPR, 2011, pp. 1185–1192. [43] N. Rui, B. He, B. Zheng, M.V. Heeswijk, Q. Yu, Y. Miche, A. Lendasse, Extreme learning machine towards dynamic model hypothesis in fish ethology research, Neurocomputing 128 (5) (2014) 273–284. [44] J. Zhou, C.M. Clark, Autonomous fish tracking by rov using monocular camera, CRV, 2006. 68–68 [45] K. Sooknanan, Enhancement, Summarization and Analysis of Underwater Videos of Nephrops Habitats, Citeseer, 2014 Ph.D. thesis. [46] Y.J. Lee, J. Ghosh, K. Grauman, Discovering important people and objects for egocentric video summarization., CVPR, 1, 2012. 3–2 [47] Z. Lu, K. Grauman, Story-driven summarization for egocentric video, in: CVPR, IEEE, 2013, pp. 2714–2721. [48] A. Khosla, R. Hamid, C.-J. Lin, N. Sundaresan, Large-scale video summarization using web-image priors, in: CVPR, IEEE, 2013, pp. 2698–2705. [49] J. Luo, C. Papin, K. Costello, Towards extracting semantically meaningful key frames from personal video clips: from humans to computers, IEEE Trans. Circ. Syst. Video Technol. 19 (2) (2009) 289–301. [50] Y. Cong, J. Yuan, J. Luo, Towards scalable summarization of consumer videos via sparse dictionary selection, Multimed. IEEE Trans. 14 (1) (2012) 66–75. [51] S. Wang, Y. Cong, J. Cao, Y. Yang, Y. Tang, H. Zhao, H. Yu, Scalable gastroscopic video summarization via similar-inhibition dictionary selection, Artif. Intell. Med. 66 (2016) 1–13. [52] Y. Cong, J. Liu, G. Sun, Q. You, Y. Li, J. Luo, Adaptive greedy dictionary selection for web media summarization, IEEE Trans. Image process. 26 (1) (2017) 185–195. [53] J. Meng, S. Wang, H. Wang, J. Yuan, Y.-P. Tan, Video summarization via multiview representative selection, IEEE Trans. Image process. 27 (5) (2018) 2134–2145. [54] K. Kumar, D.D. Shrimankar, Deep event learning boost-up approach: delta, Multimed. Tool. Appl. (2018) 1–21. [55] D. Zhang, D. Meng, J. Han, Co-saliency detection via a self-paced multiple-instance learning framework, IEEE Trans. Pattern Anal. Mach. Intell. 39 (5) (2017) 865–878. [56] G. Cheng, P. Zhou, J. Han, Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images, IEEE Trans. Geosci. Remote Sens. 54 (12) (2016) 7405–7415. [57] Y. Cong, B. Fan, K. Liu, H. Fan, Unusual event analysis for deep sea submersible, in: International Conference on Advanced Robotics and Mechatronics (ICARM), IEEE, 2017, pp. 529–534. [58] G. Yildirim, S. Süsstrunk, Fasa: fast, accurate, and size-aware salient object detection, in: Asian Conference on Computer Vision, 2014, pp. 514–528. [59] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: a benchmark, in: CVPR, IEEE, 2013, pp. 2411–2418. [60] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, P.H. Torr, Staple: complementary learners for real-time tracking, in: CVPR, 2016, pp. 1401–1409.
Y. Cong, B. Fan and D. Hou et al. / Pattern Recognition 96 (2019) 106967 Yang Cong is a full professor of Chinese Academy of Sciences. He received the B.Sc. degree from Northeast University in 2004, and the Ph.D. degree from State Key Laboratory of Robotics, Chinese Academy of Sciences in 2009. He was a Research Fellow of National University of Singapore (NUS) and Nanyang Technological University (NTU) from 2009 to 2011, respectively; and a visiting scholar of University of Rochester. He has served on the editorial board of the Journal of Multimedia. His current research interests include image processing, compute vision, machine learning, multimedia, medical imaging, data mining and robot navigation. He has authored over 60 technical papers. He is a senior member of IEEE. Baojie Fan received the B.S. degree in automation from Qufu Normal University, Qufu, China, in 2006; the M.S. degree in automation from Northwest University, Xi’an, China, in 2008; and the Ph.D. degree in pattern recognition and intelligent system from State Key Laboratory of Robotics, Shenyang Institute Automation, Chinese Academy of Sciences, Beijing, China. His research interests include UAV vision system, space robot, object tracking, and pattern recognition.
Dongdong Hou is currently a Ph. D candidate in State Key Laboratory of Robotics, Shenyang Institute of Automation Chinese Academy of Sciences, University of Chinese Academy of Sciences. She received the B.S. degree from Hebei University of Technology, China, in 2014. Her current research interests include abnormal events detection, dictionary selection, and sparse representation.
11
Huijie Fan received the B.S. degree in automation from University of Science and Technology of China, P. R. China, in 2007 and the doctor’s degree in pattern recognition and intelligent systems from University of Chinese Academy of Sciences, P. R. China, in 2014. She is a research associate in Shenyang Institute of Automation, Chinese Academy of Sciences. Her research interests include Medical image processing and machine learning.
Kaizhou Liu received his Ph.D. degree in Mechatronic Engineering from the University of Chinese Academy of Sciences in 2007. From 2004, he is with the Shenyang Institute of Automation, Chinese Academy of Sciences, where he currently is a professor. He has published more than 80 journals and conference papers. His research interests include modeling and simulation, path planning and obstacle avoidance, autonomous navigation, and virtual reality for unmanned/manned underwater vehicles.
Jiebo Luo joined the Department of Computer Science at the University of Rochester in 2011 after a prolific career of 15+ years with Kodak Research. His research spans computer vision, machine learning, data mining, social media, and biomedical informatics. He has authored 300+ technical papers and 90+ US patents. He has served as the program chair of ACM Multimedia 2010, IEEE CVPR 2012, ACM ICMR 2016, and IEEE ICIP 2017, as well as on the editorial boards of the IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI), IEEE Trans. on Multimedia (TMM), IEEE Trans. on Circuits and Systems for Video Technology (TCSVT), Patter Recognition, Machine Vision and Applications (MVA), and ACM Trans. on Intelligent Systems and Technology (TIST). He is a Fellow of the SPIE, IEEE, and IAPR.