Neurocomputing 332 (2019) 406–416
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Multi-video summarization with query-dependent weighted archetypal analysis Zhong Ji a, Yuanyuan Zhang a, Yanwei Pang a,∗, Xuelong Li b, Jing Pan b,c a
School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China School of Computer Science and Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, China c School of Electronic Engineering, Tianjin University of Technology and Education, Tianjin 300072, China b
a r t i c l e
i n f o
Article history: Received 7 February 2018 Revised 10 December 2018 Accepted 18 December 2018 Available online 29 December 2018 Communicated by Dr. Yu Jiang Keywords: Multi-video summarization Weighted archetypal analysis Multi-modal graph Keyframe extraction
a b s t r a c t Given the tremendous growth of web videos, video summarization is becoming increasingly important to improve user’s browsing experience. Since most existing methods focus on generating an informative summarization from a single video and often fail to generate satisfying results for multiple videos, we propose an unsupervised framework for summarizing a set of topic-related videos. We develop a Multi-Video Summarization via Multi-modal Weighted Archetypal Analysis (MVS-MWAA) method to extract a concise summarization that is both representative and informative. To ensure the summarization querydependent, we design a multi-modal graph to guide the generation of the weight in WAA, which we call query-dependent weight. Specifically, the multi-modal graph fuses information of video frames, tags, and query-dependent web images. Furthermore, we present a Ranking from Bottom to Top (RBT) approach to make it understandable. Extensive experimental results demonstrate that our approach clearly outperforms the state-of-the-art methods. © 2019 Elsevier B.V. All rights reserved.
1. Introduction With the development of multimedia technology and the popularity of handheld devices, there is an urgent need for an efficient technique to index and manage the increasing volume of unstructured videos. For example, people always desire to capture the main story in a video or several videos, especially news videos, as quick as possible. As one of the promising techniques, video summarization [1–4] aims at condensing a long video or lots of short videos into a compact form [5,6], which has drawn much attention in recent years. Video summarization can be static or dynamic. Typically, a static summarization is formed with a number of keyframes, while a dynamic one is composed of a successive of video clips. In this paper, we focus on static summarization. In addition, according to the number of videos to be summarized, video summarization can be categorized into Single-Video Summarization (SVS) and MultiVideo Summarization (MVS). Most existing methods focus on SVS, whose purpose is to summarize a long video into a compact form [7–10]. Recently, with the popularity of online news videos and personal videos, MVS receives increasing attention [11–13]. MVS
∗
Corresponding author. E-mail addresses:
[email protected] (Z. Ji),
[email protected] (Y. Zhang),
[email protected] (Y. Pang),
[email protected] (X. Li),
[email protected] (J. Pan). https://doi.org/10.1016/j.neucom.2018.12.038 0925-2312/© 2019 Elsevier B.V. All rights reserved.
in this paper refers to the query-dependent summarization, whose aim is to condense a large number of searched videos into a concise summarization. It enables users to quickly browse and comprehend the main idea of massive videos from the same query, thus is able to appeal more potential users. Generally, MVS is more challenging than SVS mainly because of the following three sides. (1) Since these videos are from the same query, they usually have high content redundancy. (2) These videos have plenty of irrelevant content, which demands MVS should be query-aware to narrow down the search intention gap. (3) SVS is usually displayed by the chronological sequence in the original video. However, MVS needs to deal with and analyze a large number of short videos with a few minutes. Accordingly, it is difficult to set an easy to understand representation order since the keyframes are from different videos. In recent years, some important progress has been made in the research of MVS. However, generating a summarization from a series of topic-related videos is still a challenging problem. Some studies summarize videos of specific genres by utilizing some genre specific information [14,15]. For example, [14] proposes to apply meta-data sensor information related to geographical area to summarize multiple sensor-rich topic-related videos. However, the utilization of this genre specific information in turn restrict this type of method in a wider range of application. Therefore, some recent attempts try to exploit the user search intents
Z. Ji, Y. Zhang and Y. Pang et al. / Neurocomputing 332 (2019) 406–416
407
Fig. 1. Flowchart of the proposed MVS-MWAA framework.
to narrow down the search intention gap for query-dependent summarization. For instance, Wang et al. [16] propose a method for event driven web videos summarization by tag localization and key-shot mining to generate satisfactory results, where the searched images are used to estimate the similarities between them and the keyframe of each shot in key-shot identification. In [17], the authors propose a multi-task deep visual-semantic embedding modal, where the query-dependent video thumbnails are generated based on both visual and side information (e.g., title, description, and query). Besides, Yao et al. [18] apply supervised learning method on video summarization. They propose a novel pairwise deep ranking model to learn the relationship between highlight and non-highlight video segments. And a two-stream network structure by representing video segments from complementary information on appearance of video frames and temporal dynamics across frames is developed for video highlight detection. Then the video summarization is extracted by highlight scores which are obtained by training the highlight detection model. However, how to efficiently reflect the search intend in MVS is still an open and challenging problem.
Recently, we have seen a proliferation of Archetypal Analysis (AA) algorithm in different fields, such as in economics [19], pattern recognition [20], document summarization [21], and computer vision [22]. It represents each individual in a dataset as a mixture of individuals of pure type or archetypes [23]. Recently, it is also used in video summarization [24]. Specifically, the authors propose a novel Co-Archetypal Analysis (CAA) algorithm, which learns canonical visual concepts shared between video and web images by finding a joint-factorial representation of two datasets. The frame-level importance is measured based on the learned factorial representation of the video and then combined into shotlevel scores, by which the summarization is generated with a fix length. In this paper, we present an alternate way to utilize AA to MVS. We propose a query-dependent MVS method by using weighted AA algorithm. Different from the idea of CAA in [24], we explore the web images searched by the same query and the tags around the videos as the query-dependent information to guide the AA algorithm in video summarization to be query-dependent. Fig. 1 depicts the framework of this paper.
408
Z. Ji, Y. Zhang and Y. Pang et al. / Neurocomputing 332 (2019) 406–416
The main contributions of this paper lie in the following three aspects: (1) A novel MVS method with Weighted Archetypal Analysis (WAA) is proposed, which is called Multi-Video Summarization via Multi-modal Weighted AA (MVS-MWAA). (2) To ensure the summarization query-dependent, we design a multi-modal graph to guide the generation of the weight in WAA, which we call query-dependent weight. Specifically, the multi-modal graph algorithm exploits not only video data, but its tags and query-independent web images. (3) To make the summarization logical and readable, a novel Ranking from Bottom to Top (RBT) method is developed. The rest of this paper is organized as follows. Section 2 reviews the related work. The brief concept of WAA is introduced in Section 3. Section 4 describes the details of the proposed MVSMWAA method. Section 5 introduces the proposed RBT method for MVS presentation. Experiments are then presented and analyzed in Section 6. Section 7 concludes the paper. 2. Related work Recently, MVS has attracted more attention and much great progress has been made. Existing work can be roughly divided into three categories: graph based approaches, multi-modal fusion based approaches, and decomposition based approaches. Graph based approaches. The graph modal is beneficial to explore the relationship among a large number of video frames. For example, Yeo et al. [25] explore complete multipartite graph to model semantic relationship between the extracted subsequences in multiple videos. Then, the absorbing Markov chain is used to generated a subset of frames containing co-activity from each video. In [26], both visual and textual information are structured into a complex graph, on which the co-clustering of frames and keywords is performed to take advantage of the benefits of framekeyword relations. Then, the summarization is construted by representative keyframes and keywords mined from clusters with higher importance scores. By using the additional web images, Kim et al. [27] build a similarity graph between images and video frames, on which the video summarization is generated by diversity ranking. In [28], the authors consider the MVS task as a problem of finding dominant sets through a hypergraph and also apply the web images to ensure the summarization satisfy the user intent. Multi-modal fusion based approaches. A surge of studies fuse multi-modal information for MVS. For example, in [29], a common textual-visual semantic embedding is utilized to measure the distance between frames and textual queries, leading to a significant performance compared to only using the similarity to textual queries. Then the summarization is created by using a submodular mixture of objectives. Li et al. [15] propose a method called Balance AV-MMR to summarize multiple videos by using the audio and visual information get to a good balance between them. However, it may not meet the usersâ satisfaction because of the neglect of the query intent. Decomposition based approaches. Non-negative Matrix Factorization (NMF) [30], sparse coding [31], and AA algorithms can all be viewed as a special case of matrix factorization in mathematics. These algorithms have been employed in MVS. For example, motivated by the idea of NMF, Chu et al. [32] propose a video co-summarization method exploiting visual co-occurrence across multiple videos and introduce a Maximal Biclique Finding (MBF) method that is optimized to find sparsely co-occurring patterns across videos obtained by a topic keyword. Recently, Ji et al. [33] introduce a query-aware method in a sparse coding framework, where the web images and the video frames are explored
to reconstruct a meaningful summarization. Panda et al. [34] also develop a sparse optimization framework to jointly summarize a set of videos by exploring the complementarity within the videos, which considers âǣinterestingnessâǥ prior in the sparse representative selection and introduces a diversity regularizer in the optimization framework. Zhang et al. [35] propose a context-aware video summarization (CAVS) framework based on sparse coding to find the most informative video portions, where the information about individual local motion regions and the interactions between these motion regions are applied to capture the important video portions. Furthermore, Song et al. [24] observe that images related to the title can serve as a proxy for important visual concepts of the main topic, thus they introduce Title-based Video Summarization (TVSum) by applying the proposed CAA method to learn the canonical visual concepts shared between video and web images. The proposed MVS-MWAA method can be considered as a combination of the above three categories. On the one hand, it fuses information of video frames, web images and tags together via a multi-modal graph, thus it can be viewed as graph based and multi-modal fusion based approach. On the other hand, since it exploits AA algorithm to generate the query-dependent summarization, it can also be viewed as a decomposition based approach. 3. A brief review of weighted archetypal analysis Archetypal analysis (AA) [36] represents each individual in a dataset as a mixture of individuals of pure, not necessarily observed, types or archetypes. The archetypes themselves are limited to being mixtures of the individuals in the dataset and lie on the dataset boundary. Generally, AA modal can be regarded as a technique fusing the ideas of clustering approaches and low-rank approximation, which completely assembles the advantages of clustering and the flexibility of matrix factorization. AA is naturally considered as an unsupervised learning technique, whose definition is as follows. Consider a multivariate dataset X = {x1 , x2 , . . . , xn } ∈ Rn×m with n observations and m dimensionalities. Given an archetype number d(d n), the archetypal problem is to factorize the matrix X into two coefficient matrices P ∈ Rn × d and C ∈ Rn × d as defined by Eq. (1).
X ≈ PCT X.
(1)
More precisely, AA algorithm first initializes matrices P and C, and then updates them to minimize the Residual Sum of Squares (RSS) in Eq. (2) until RSS converges to a small value or the maximum iterations is reached.
2
RSS(k ) = X − PYT with Y = XT C, s.t.
d
Pi j = 1, Pi j ≥ 0, i = 1, 2, . . . , n
j=1
s.t.
n
Ci j = 1, Ci j ≥ 0, j = 1, 2, . . . , d
(2)
j=1
where · 2 denotes the Euclidean matrix norm and k is the num ber of iterations. As for the constraints in Eq. (2), dj=1 Pi j = 1 in conjunction with Pij ≥ 0 enforces the feature matrix Y = XT C to be a convex combination of the archetypes, and each archetype is enforced to be a meaningful combination of data instances via the constraints nj=1 Ci j = 1 and Cij ≥ 0. To solve the approximation problem in Eq. (2), the constrained optimization of two sets of coefficients Pij and Cij are required. They can be estimated by alternately finding the best P for given
Z. Ji, Y. Zhang and Y. Pang et al. / Neurocomputing 332 (2019) 406–416
archetypes Y and the best archetype Y for given C; at each step the convex least square problems are solved until the RSS reduces successfully. However, the AA problem defined in Eq. (2) considers that all data instances have the same weight. Thus, each data instance and hence each residual contribute to minimize Eq. (2) with equal weight. Unfortunately, matrices representing candidate keyframes’ similarity graph can be often very complex, and AA has difficulty in representing additional information such as importance and correlation among data instances. To ensure the uniqueness of data instances and then optimize the problem in Eq. (2), the Weighted Archetypal Analysis (WAA) is first proposed in [37]. Remember that X is an n × m matrix and suppose that W is a corresponding n × n square matrix of weights, the weighted archetypal problem can be written as the minimization of:
2
RSS(k ) = W(X − PYT ) with Y = XT C. s.t.
d
n
Ci j = 1, Ci j ≥ 0, j = 1, 2, . . . , d
G = (X, E, W ), where X denotes the candidate keyframes, E denotes the edges between any two candidate keyframes, and W represents the weights of edge connections. In order to obtain W, we first calculate the visual similarity between candidate keyframes with Eq. (5).
Wv ( xi , x j ) =
sim(xi , x j ) , ( xi , xk )
(5)
xk ∈X∩k =i sim
where sim( · ,· ) denotes the similarity between candidate keyframes xi and xj . In our paper, we use the cosine similarity. Further, we distinguish intra-video and inter-video frame relations. This is because that the candidate keyframes in inter-videos may contain additional useful and globally informative information than those in the intra-videos. To reflect the different impacts in intra-video and inter-video, a weight matrix using tag information is designed as follows:
Pi j = 1, Pi j ≥ 0, i = 1, 2, . . . , n
Wt (xi , x j ) =
j=1
s.t.
409
1, 1 + sim(v(xi ), v(x j )),
if if
v ( xi ) = v ( x j ) v ( xi ) = v ( x j ), (6)
(3)
j=1
Then, the problem can be rewritten as:
2 min RSS(k ) = ( X − P YT ) with X = WX and Y= XT C . (4) P,C
4. The Proposed MVS-MWAA framework The proposed MVS-MWAA method employs the candidate keyframes, web images and the tags around each video to extract meaningful segments from multiple videos. The framework of MVS-MWAA is illustrated in Fig. 1. It consists of four components, i.e., the multi-modal graph construction modal, query-dependent WAA modal, summarization generation modal and final summarization presentation modal with RBT algorithm. Algorithm 1 outlines the procedure of the proposed MVS-MWAA approach. The technical details are as follows. 4.1. Multi-modal graph construction Assume there are l videos to the same query, we first preprocess them to get a set of candidate keyframes. Let X = {x1 , x2 , . . . , xn } ∈ Rn×m represent the visual features of these frames, where n is the number of candidate keyframes and m is the feature dimensionality. Then we construct a similarity graph
sim(v(xi ), v(x j )) =
tv ( xi ) • tv ( x j )
|tv ( xi ) | × |tv ( x j ) |
,
(7)
where “•” denotes the dot product and “ × ” denotes the cross product. v(xi ) denotes the video that includes the candidate keyframe xi , tv(xi ) indicates the tag information around v(xi ), such as titles and descriptions. And sim( · ,· ) denotes the similarity between two videos. The equation v(xi ) = v(x j ) shows that the frames xi and xj are from the same video. Eq. (6) ensures the weights of the inter-video candidate keyframe edges are larger than those in the intra-video ones. In addition, there are tremendous amount web images uploaded by users on the Internet. The images searched by the same query relect the usersâ intent to some extent, and therefore can be used as a query-dependent prior information. To this end, we take advantage of the web images to ensure the summarization query-dependent. Specifically, let Z = {z1 , z2 , . . . , zk } ∈ Rk×m denote the visual features of web images searched by the query, where k denotes the number of web images and m is the feature dimensionality. We calculate the average similarity between candidate keyframes and web images, and take the similarity as the importance criterion for each keyframe, as shown in Eq. (8).
WI ( xi , Z ) =
k
sim(xi , z j )/|Z|,
(8)
j=1
Algorithm 1 MVS-MWAA approach. Input: Candidate keyframes X = {x1 , x2 , . . . , xn }, web image set Z = {z1 , z2 , . . . , zk }, tag information set T = {t1 , t2 , . . . , tl } and the parameter for selecting archetypes Tp . Output: Keyframe set S = {s1 , s2 , . . . , s p }. 1: Take the candidate keyframes as vertices to construct a multimodal graph G = (X, E, W ) and generate W with Eq.(9). 2: Calculate the input matrix X = WX. X and estimate the factorization 3: Perform WAA on the matrix matrices P and C in Eq.(4). 4: Calculate the significance of each archetype Si according to Eq.(10). 5: Sort the archetypes in a decreasing order and select the top Tp archetypes as the final archetype set. 6: Choose the top two candidate keyframes from each archetypes in the final archetype set as keyframes. 7: Remove the redundant frames with their cosine similarities.
where zj is the jth web image, and sim( · ,· ) denotes the cosine similarity between xi and zj . |Z| is the length of web image set. Therefore, the query-dependent weight matrix W in the multimodal graph is:
W = Wv Wt WI ,
(9)
where denotes the dot product and denotes the cross product. 4.2. Query-dependent WAA This section introduces how to utilize the above multi-modal graph to guide the query-dependent weight in WAA. In the querydependent summarization, graph modal shows the relationship and the importance of the candidate keyframes. In this sense, WAA is an enhanced version of the graph-based model. Given the number of archetypes d, we perform the WAA algorithm to estimate the factorization matrices P and C in Eq. (4) with
410
Z. Ji, Y. Zhang and Y. Pang et al. / Neurocomputing 332 (2019) 406–416
the query-dependent weight matrix in Eq. (9). In this way, all candidate keyframes are clustered into several sparse archetypes, from which the keyframes will be selected in the next step. Then, the selected keyframes are represented as a convex combination of these archetypal keyframes. 4.3. Video summarization generation Different archetypes reflect different importance for the corresponding content. Generally, higher archetypal score have more important content, which should be selected into the summarization. Specifically, the rows in matrix C obtained in Section 4.2 represent archetypes and its columns denote the archetypal membership scores. To improve the representativeness of summarization content, we propose a keyframe ranking algorithm. Given the decomposed matrix C, we calculate the significance of each archetype score Si according to:
Si =
T
( C X )i j .
(10)
j=1
Then, the archetypes are sorted in a decreasing order. To remove the noisy archetypes, we select the top Tp archetypes as the final archetype set, and further choose the top two candidate keyframes in each of them as the keyframes. In addition, to remove redundancy and enhance diversity, these keyframes are compared with themselves with cosine similarity. If it is higher than 0.8, the two keyframes are considered to be sufficiently similar and then one of them is removed from the summarization set. Finally, the remaining keyframes are arranged to generate the final summarization. MVS-MWAA satisfies all the criteria of conciseness, representativeness, and informativeness in MVS. Specifically, conciseness requires to minimize the information redundancy. Since AA can be considered as a type of clustering method, the selected keyframes can be viewed as coming from different clusters, which ensures their lower redundancy. Besides, the last step of removing redundancy also strengthens the conciseness property. The representativeness indicates that the summarization should contain comprehensive video contents for better understanding. The fact that the keyframes are extracted from significant archetypes ensures this criterion. The informativeness refers that the most important and relevant information is preferred in the summarization. Actually, the utilization of tag information around each video and the web images make the final summarization query dependent, which ensures its satisfying the informativeness criterion.
association will be introduced from two aspects of chronologycorrelation and topic-closeness as follows. Specifically, to deal with the challenge of inferring temporal relations across a diverse set of multiple videos, we define the strength of association to order segments A and B, which are measured by the chronology-correlation. The chronology-correlation of keyframes is determined by their upload time and their orders in the same video. The determination process is defined as follows:
fchro (A B ) =
⎧ 1, ⎪ ⎪ ⎪ ⎨1,
T ( am ) < T ( b1 )
⎪ 0.5, ⎪ ⎪ ⎩
[T (am ) = T (b1 )] ∧ [V (am ) = V (b1 )]
[V (am ) = V (b1 )] ∧ [N (am ) < N (b1 )]
0,
otherwise, (11)
where am is the last keyframe in segment A and b1 is the first keyframe in segment B. T(am ) is the time of keyframe am . V (am ) = V (b1 ) denotes that am and b1 come from the same video and N(am ) < N(b1 ) denotes that am appears earlier than b1 in the same video. fchro (A B) defines that the chronological order of segment B is behind of A. In addition, a set of videos showing a particular event usually contains some sub-topics. For instance, a series of videos related to “Britain’s Prince William Wedding” typically contains several subtopics such as wedding itinerary, Williamâs speech, security measures and keepsakes. Grouping keyframes by different sub-topics can improve the readability and logicality of the MVS summarization. Generally speaking, videos are mainly presented visually in the user’s field of vision and the visual feature can better characterize video content and topic information. Thus topic-closeness criterion deals with this association of two segments based on their similarities, which is defined as follows:
ftopic (A B ) =
1 max sim(a, b), |B| a∈A
(12)
b∈B
where ftopic (A B) shows that A precedes B in topic-closeness criterion. sim(a, b) denotes the cosine similarity between keyframes a and b. For keyframe b ∈ B, max a ∈ A sim(a, b) yields the similarity between it and keyframe a ∈ A, which is the most similar one to b. The purpose of summing and averaging is to normalize the result. Then we combine the chronology-correlation and the topiccloseness to measure the correlation of the segment A in front of segment B, named as fcor (A B).
fcor (A B ) = fchro (A B ) + ftopic (A B ).
(13)
5. MVS presentation A good summarization is expected to have higher logicality and readability so that it is easy to be understood for users. In SVS, the generated keyframes are logically presented based on the video play order. However, the summarization obtained from multiple videos does not have this chronological order, so it is difficult to give users a satisfactory presentation. Therefore, we develop a Ranking from Bottom to Top (RBT) method to provide a user-friendly summarization representation based on the chronology-correlation and topic-closeness of keyframes. Firstly, we define a notation a b to represent the keyframe a ranks ahead of b and denote a segment A = (a1 a2 · · · am ) by a series of ordered keyframes. We use A B to show the segment A precedes segment B. Then, with each segment initialized with a keyframe, we select two segments with the strongest association and concatenate them into one segment. And the ranking result obtains by repeating the concatenating process until all keyframes are arranged in one segment. The aforementioned
Algorithm 2 gives the detailed process of the proposed RBT algorithm.
Algorithm 2 The proposed RBT algorithm. Input: Keyframe set S = {s1 , s2 , · · · , s p }. Output: Sorted keyframe set S∗ = {s∗1 , s∗2 , · · · , s∗p }. 1: Create a segment pi that contains si only for each kryframe si in S. 2: Find the two segments pA and pB in the set of S, that have the maximum strength of association based on ( pA , pB ) ← argmaxpi , p j ∈ S, pi = p j with fcor ( pi p j ). Here fcor ( pi p j ) is calculated by Eq.(13). 3: Append the keyframe in segment pB to the end of pA and remove the segment pB from S. 4: Repeat step 3 and 4 until there is a single segment in left S. 5: Return the sorted keyframe set S∗ = {s∗1 , s∗2 , . . . , s∗p }.
Z. Ji, Y. Zhang and Y. Pang et al. / Neurocomputing 332 (2019) 406–416
411
Table 1 Description of the MVS1K dataset. Query ID
Query
#Video
Duration(seconds)
#Web Image
1 2 3 4 5 6 7 8 9 10 Total
Britain’s Prince William wedding 2011 Prince’s dead 2016 NASA discovers Earth-like planet American government shut down 2013 Malaysia Airline MH370 FIFA corruption scandal 2015 Obama re-election 2012 Alphago vs Lee Sedol Kobe Bryant retirement Pairs terror attacks -
90 104 100 82 109 90 85 84 109 83 936
10018 13759 14816 10898 10468 9973 10939 8025 14933 9687 113516
324 142 226 177 435 177 207 118 221 651 2678
Fig. 2. Example illustration of the MVS1K dataset.
6. Experimental results and analysis
and F-score (F) are defined as follows:
6.1. Experimental settings
P=
Nmatched , NAS
(14)
R=
Nmatched , NUS
(15)
F =
2×P×R , P+R
(16)
Most of the existing MVS datasets are either publicly unavailable or in small scale. To the best of our knowledge, the recently introduced MVS1K dataset [33] is the largest publicly available annotated dataset. It has 936 videos from 10 queries, with 113,516 seconds duration. Table 1 shows its details and its illustration is represented in Fig. 2. We use the same settings in [33]. Particularly, the visual feature is a 4352D vector, composed by a 4096D VGGNet-19 CNN feature [38] and a 256D HSV color histogram feature. The textual features are 100D word2vec and TF-IDF. All videos are preprocessed with the shot detection method in [39], and then the middle frames from each shot are selected to constitute the candidate keyframes set. Moreover, we also evaluate our approach on TVSum dataset, whose web images are downloaded by ourselves since it provides only videos without web images. The number of archetypes d is an important parameter to determine. A simple way is to assign it to a fixed number, however, this cannot reflect the diversity of each query. To choose a unique archetype number for each query, we set d = 0.1 × L, where L is the length of candidate keyframes for every query. In addition, we set Tp = 60% in the fifth step of Algorithm 1.
6.2. Experimental results on MVS1K 6.2.1. Objective experimental results We evaluate the objective performance by comparing automatic summarizations generated by different methods with the manually labeled ground truth. Specifically, we first calculate the Euclid distance between each generated keyframe and each ground truth keyframe one by one. The two types of keyframe are considered to be matched if the normalization distance is smaller than the predefined threshold of 0.6. And then they are excluded in the next comparative round. The popular metrics of precision (P), recall (R),
where Nmatched , NAS , and NUS denote the numbers of matched keyframes, automatically generated keyframes, and the ground truth keyframes, respectively. The final performance is evaluated by the average results for all the annotatorsâ ground truth. Since F is a balance between P and R, it is usually used as a main criterion for evaluation in the field of video summarization. We compare our approach with six related approaches. (1) kmeans [40]. It clusters the candidate keyframes and then extracts the nearest frames to the clustering centers as keyframes. (2) Dominant Set Clustering (DSC) [41]. It is a graph based method, where the candidate keyframes are used to build a graph, and then the dominant set algorithm is applied to cluster the keyframes. The video summarization is composed of the representative frames from domain sets. (3) Minimum Sparse Reconstruction (MSR) [4]. It is a decomposition based approach, which views the final keyframes as the basis vectors and adopts a sparse constraint to formulate the video summarization in a minimum sparse reconstruction framework. (4) Video-MMR [15]. It extends the concept of maximal marginal relevance [42] to reward relevant keyframes and penalize redundant keyframes, relying on visual features of candidate keyframes. (5) Query-Aware Sparse Coding (QUASC) [33]. It is also a decomposition based approach. It represents a novel query-dependent method based on a sparse coding, in which the web images obtained by the query is used as the important preference information to reveal the query intent. (6) Title-based Video Summarization (TVSum) [24]. It is closely related to our approach, which uses AA as its main method. Specifically, it proposes a novel
412
Z. Ji, Y. Zhang and Y. Pang et al. / Neurocomputing 332 (2019) 406–416
Table 2 Objective performance comparison of different approaches on MVS1K dataset. P denotes precision, R denotes recall, F denotes F-score, and #KF denotes the number of keyframes. The best performance for each column for each criterion is in Bold. Query ID
1
2
3
4
5
6
7
8
9
10
Average
P
k-means [40] DSC [41] MSR [4] Video-MMR [15] QUASC [33] TVSum [24] MVS-MWAA(ours)
0.589 0.637 0.484 0.724 0.659 0.380 0.714
0.667 0.543 0.461 0.607 0.579 0.700 0.600
0.602 0.566 0.386 0.625 0.798 0.360 0.668
0.338 0.622 0.417 0.590 0.555 0.530 0.596
0.492 0.471 0.417 0.505 0.667 0.638 0.679
0.559 0.467 0.378 0.581 0.616 0.435 0.568
0.771 0.577 0.490 0.675 0.665 0.456 0.704
0.410 0.652 0.333 0.611 0.505 0.520 0.708
0.410 0.555 0.410 0.654 0.672 0.520 0.660
0.652 0.707 0.571 0.667 0.728 0.725 0.596
0.549 0.580 0.435 0.624 0.644 0.526 0.649
R
k-means [40] DSC [41] MSR [4] Video-MMR [15] QUASC [33] TVSum [24] MVS-MWAA(ours)
0.563 0.530 0.460 0.705 0.430 0.376 0.695
0.471 0.418 0.340 0.615 0.460 0.428 0.621
0.538 0.308 0.355 0.450 0.267 0.274 0.471
0.334 0.463 0.411 0.443 0.586 0.507 0.447
0.426 0.358 0.377 0.351 0.458 0.362 0.481
0.496 0.524 0.334 0.405 0.477 0.410 0.396
0.563 0.496 0.365 0.614 0.586 0.286 0.629
0.211 0.386 0.180 0.326 0.388 0.368 0.366
0.361 0.505 0.361 0.568 0.751 0.580 0.568
0.227 0.353 0.193 0.315 0.493 0.413 0.278
0.419 0.434 0.338 0.479 0.490 0.400 0.495
F
k-means [40] DSC [41] MSR [4] Video-MMR [15] QUASC [33] TVSum [24] MVS-MWAA(ours)
0.576 0.578 0.472 0.715 0.520 0.378 0.705
0.552 0.472 0.391 0.611 0.513 0.530 0.610
0.568 0.399 0.370 0.523 0.400 0.311 0.553
0.336 0.530 0.414 0.506 0.570 0.519 0.511
0.457 0.407 0.396 0.414 0.513 0.461 0.563
0.525 0.494 0.355 0.478 0.538 0.423 0.466
0.651 0.533 0.418 0.643 0.623 0.351 0.664
0.278 0.485 0.234 0.425 0.439 0.431 0.483
0.384 0.529 0.384 0.608 0.709 0.548 0.611
0.337 0.471 0.288 0.427 0.588 0.527 0.379
0.466 0.490 0.372 0.535 0.544 0.450 0.555
#KF
k-means [40] DSC [41] MSR[4] Video-MMR[15] QUASC[33] TVSum [24] MVS-MWAA(ours)
48 42 48 49 33 50 49
51 47 51 75 57 50 75
59 34 59 46 21 45 46
51 39 51 39 55 50 39
63 52 63 49 48 40 49
47 46 47 37 41 50 37
48 55 48 60 59 40 60
36 41 36 36 52 50 36
39 41 39 39 51 50 39
28 41 28 39 56 40 39
47.0 43.8 47.0 46.9 47.3 46.5 46.9
method based on AA to learn canonical visual concepts shared between video and web images searched by the video title by finding a joint-factorial representation of two datasets. Moreover, the QUASC [33] and TVSum [24] can also be viewed as the multimodal fusion based approaches, where different modalities are applied. The comparative results are provided in Table 2. We can observe that MVS-MWAA outperforms the others on all the three metrics. Since F-score is a balance between P and R, it is usually used to evaluate the overall performance of summarization videos. From its perspective, MVS-MWAA outperforms k-means [40], DSC [41], MSR [4], Video-MMR [15], QUASC [33] and TVSum [24] in 8.9%, 6.5%, 18.3%, 2%, 1.1%, and 10.5%, respectively. This clearly demonstrates the effectiveness of the proposed approach. Generally, the length of an automatic summarization also influence the performance of P and R. It can be observed that MVS-MWAA, k-means, MSR, Video-MMR, and QUASC have similar keyframes. However, the precision of our MVS-MWAA outperforms those of k-means, MSR, Video-MMR, and QUASC in 10%, 21.4%, 2.5%, and 0.5%, respectively. Although our proposed MVS-MWAA selects slightly more keyframes than DSC and TVSum, its precision is still higher than those of them. In the view of recall, we can observe that our proposed MVS-MWAA achieves the best recall under the situation of the similar average length of keyframes, which reveals the good ability to select more matched keyframes with ground truth keyframes. Fig. 3 gives a representation of summarizations generated by different methods for the query “Malaysia Airline MH370”. It is confirmed that k-means and DSC contain high redundancy and more unimportant keyframes. This is because that they cluster the visually similar frames but neglect the semantic content. In addition, neither they can achieve a good balance between representativeness and redundancy. As for TVSum, although it applies the prior knowledge of the web searched images, the large number of irrelevant or less relevant content in large scale web videos can bring more noise that is ignored in this approach. Thus, there
are several redundant keyframes. The method of MSR and VideoMMR have less redundancy but there still have some unimportant keyframes. The reason mainly due to that the visually dissimilar frames can bring diversity information. Moreover, the QUASC and our MVS-MWAA has less unimportant information thanks to the web images which enhance the performance of query summarization. In addition, the better performance of MVS-MWAA against QUASC also proves the effectiveness of our framework, especially the utilization of query-dependent WAA. 6.2.2. Subjective evaluation We further evaluate the performance of the proposed MVSMWAA approach by subjective user studies, which is performed in the lab. Five participants with 2 males and 3 females are invited to give subjective scores between 1 (poor) and 10 (good) based on their satisfaction to evaluate the summarizations generated by the six approaches. All users are familiar with the content of video set. The evaluation results are represented in Fig. 4(a). We can observe that MVS-MWAA outperforms the others on the subjective evaluation results, which matches the fact that our proposed method is able to present video content well and meet usersâ preference on most queries. Fig. 4(b) illustrates the results of user preference to analyze the statistical reliability. We can see that various users have similar preference among these methods and show no serious bias on the queries. Furthermore, to evaluate the effectiveness of our proposed RBT method, we compare the structure of presentation in different ways, which are denoted as RBT, RBT-Topic, RBT-Chro, and NULL. Specifically, the RBT-Topic performs only topic-closeness, the RBTChro preforms only chronology-correlation, and NULL means without using any presentation approach. We invite the same five participants to score for the four structures and summarize the results in Table 3. We can observe that the score of the structure with RBT is higher than the others, which indicates the user-friendliness of RBT presentation. This is because that the logical presentation structure of the summarization helps users understand the query
Z. Ji, Y. Zhang and Y. Pang et al. / Neurocomputing 332 (2019) 406–416
413
Table 3 Score for the presentation with four different presentation structures. Query ID
RBT
RBT-Topic
RBT-Chro
NULL
1 2 3 4 5 6 7 8 9 10 Average
8.0 7.7 7.0 7.9 7.5 8.6 7.4 7.8 7.6 7.3 7.68
6.1 5.4 8.0 7.3 6.4 5.8 6.3 7.2 6.4 6.6 6.55
7.0 5.9 7.3 5.1 6.0 6.6 5.6 5.8 6.8 6.3 6.24
5.4 4.7 5.8 5.0 5.6 6.9 6.0 6.3 4.5 6.0 5.62
Table 4 Ablation study of multi-modal graph on summarization.
P R F
AA
VAA
IAA
TAA
MWAA
0.34 0.59 0.42
0.638 0.481 0.536
0.657 0.475 0.540
0.619 0.451 0.512
0.649 0.495 0.555
event more easily, especially for MVS. In addition, the performance of RBT-Chro is poorer than that of RBT-Topic, which demonstrates that topic-closeness criterion is better than chronology-correlation one. The reason may lie in the fact that RBT-Chro arrange the summarization by the upload time, which leads to bad visual coherence. 6.2.3. Ablation study of multi-modal graph In this section, we compare the summarization results with different modalities. We introduce MVS-MWAA in five different ways during the summarization process, which are denoted as AA, VAA, IAA, TAA, and MWAA. The AA is the most simple one, which uses the method of Archetypal Analysis without any additional weight. The VAA performs only on the candidate keyframes with their visual similarity matrix. The IAA denotes the joint matrix of both candidate keyframes and web images that reflect the query intent. The TAA performs on the joint matrix of both candidate keyframes and their corresponding tag information, in which the visual and textual information are taken into account simultaneously. The MWAA as shown in Section 4, is our proposed approach. Table 4 illustrates the comparison for better understanding of the results. We can observe that IAA is better than VAA, which proves that the utilization of web images is helpful. However, TAA is inferior to VAA. This may due to the reason that tags around videos are noisy and incomplete, which cannot be necessarily reflect the true visual content. Finally, MWAA method achieves the best performance, which demonstrates the complementary properties of the multi-modal information, i.e., visual, web image and tags. 6.3. Experimental results on TVSum
Fig. 3. Summarizations for the query of “Malaysia Airline MH370” by k-means [40], DSC [41], MSR [4], Video-MMR[15], QUASC [33], TVSum [24] and our proposed MVS-MWAA, respectively. The red bound indicates the unimportant keyframe, and the yellow bound denotes the redundant keyframe.
We also evaluate the MWAA approach on the publicly available Title-based Video Summarization (TVSum) [24] dataset. It contains 50 videos collected from YouTube in 10 categories, such as âǣGrooming an Animalâǥ, âǣMaking Sandwichâǥ, which are taken as queries. And every category has 5 videos that are 2–10 min in length. Thus this dataset can be used as a MVS dataset. Since the number of videos per category is small, we obtain fewer candidate keyframes. To better describe the diversity of every video, we set the archetype number d = 0.15 ∗ L and T p = 70% in the fifth step of Algorithm 1 on TVSum dataset. Table 5 provides the results against the six comparative methods. Since the TVSum approach do not provide the Precision and Recall results, we only provide its F score. We can observe that the proposed MWAA method demonstrate better performance than the comparative approaches. More-
414
Z. Ji, Y. Zhang and Y. Pang et al. / Neurocomputing 332 (2019) 406–416
Fig. 4. User study results: (a) Subjective evaluation by seven methods over 10 queries, and (b) User preference for six summarization methods.
Table 5 Objective performance comparison on TVSUM dataset. P denotes precision, R denotes recall, F denotes F-score. The best performance for each column for each criterion is in Bold. Query ID
VT
VU
GA
MS
PK
PR
FM
BK
BT
DS
Average
P
MSR [4] Clustering [40] DSC [41] Video-MMR [15] TVSum [24] QUASC [33] MVS-MWAA(ours)
0.59 0.53 0.54 0.65 0.70 0.68
0.61 0.68 0.69 0.63 0.60 0.58
0.67 0.75 0.77 0.83 0.83 0.83
0.63 0.76 0.68 0.79 0.90 0.92
0.72 0.60 0.74 0.75 0.60 0.48
0.51 0.75 0.81 0.51 0.88 0.85
0.55 0.69 0.86 0.51 0.74 0.75
0.75 0.80 0.75 0.65 0.75 0.75
0.65 0.51 0.77 0.67 0.64 0.75
0.78 0.54 0.94 0.82 0.82 0.90
0.64 0.66 0.75 0.68 0.75 0.75
R
MSR [4] Clustering [40] DSC [41] Video-MMR [15] TVSum [24] QUASC [33] MVS-MWAA(ours)
0.25 0.48 0.25 0.58 0.63 0.50
0.34 0.57 0.36 0.53 0.51 0.47
0.28 0.19 0.34 0.21 0.21 0.56
0.22 0.70 0.31 0.73 0.84 0.59
0.43 0.08 0.35 0.10 0.08 0.62
0.23 0.43 0.37 0.29 0.51 0.60
0.23 0.55 0.35 0.41 0.60 0.62
0.47 0.12 0.44 0.10 0.11 0.37
0.28 0.53 0.34 0.70 0.67 0.33
0.38 0.38 0.39 0.58 0.58 0.51
0.31 0.40 0.35 0.42 0.47 0.52
F
MSR [4] Clustering [40] DSC [41] Video-MMR [15] TVSum [24] QUASC [33] MVS-MWAA(ours)
0.36 0.50 0.34 0.61 0.52 0.66 0.58
0.43 0.62 0.48 0.58 0.55 0.55 0.52
0.39 0.30 0.47 0.33 0.41 0.33 0.67
0.33 0.73 0.43 0.76 0.58 0.87 0.72
0.54 0.14 0.47 0.17 0.44 0.14 0.54
0.32 0.55 0.50 0.37 0.53 0.65 0.70
0.32 0.61 0.50 0.45 0.51 0.66 0.68
0.57 0.22 0.56 0.18 0.47 0.20 0.50
0.40 0.52 0.47 0.69 0.49 0.66 0.46
0.51 0.45 0.56 0.68 0.48 0.68 0.65
0.42 0.46 0.48 0.48 0.50 0.54 0.60
Z. Ji, Y. Zhang and Y. Pang et al. / Neurocomputing 332 (2019) 406–416
over, it performs best on F score on queries of ”GA”, “PK”, “PR”, and “FM”. 7. Conclusions This paper proposes a query-dependent MVS-MWAA approach that meets the MVS criteria: representativeness, conciseness and informativeness. In this unsupervised framework, we jointly use the information of video frames, searched web images and tags to explore the relationship among the candidate keyframes with a multi-modal graph. Then, to generate a representative and concise summarization, we exploit query-dependent WAA to cluster all candidate keyframes into archetypes with distinct significance. The keyframes with higher archetypal membership are selected to generate the summarization. Finally, we also present a RBT method to ensure the summarization easy to be understood. Results on the publicly largest MVS1K dataset and the popular TVSum dataset show that our MVS-MWAA approach is effective. Acknowledgements This work was supported by the National Natural Science Foundation of China under Grants (61472273, 61632018, and 61771329), the National Basic Research Program of China (Grant No. 2014CB340400), and the Nokia. References [1] X. Li, B. Zhao, X. Lu, A general framework for edited video and raw video summarization, IEEE Trans.Image Process. 26 (8) (2017) 3652–3664. [2] K. Zhang, W.-L. Chao, F. Sha, K. Grauman, Video summarization with long short-term memory, in: Proceedings of the European Conference on Computer Vision, 2016, pp. 766–782. [3] J. Li, T. Yao, Q. Ling, T. Mei, Detecting shot boundary with sparse coding for video summarization, Neurocomputing 266 (2017) 66–78. [4] S. Mei, G. Guan, Z. Wang, S. Wan, M. He, D.D. Feng, Video summarization via minimum sparse reconstruction, Pattern Recognit. 48 (2) (2015) 522–533. [5] H. Sun, Y. Pang, Glancenets - efficient convolutional neural networks with adaptive hard example mining, Sci. China Inf. Sci. 61 (10) (2018) 109101. [6] Y. Pang, J. Cao, X. Li, Cascade learning by optimally partitioning, IEEE Trans. Cybern. 47 (12) (2016) 4148–4161. [7] X. Song, L. Sun, J. Lei, D. Tao, G. Yuan, M. Song, Event-based large scale surveillance video summarization, Neurocomputing 187 (2016) 66–74. [8] M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool, Creating summaries from user videos, in: Proceedings of the European Conference on Computer Vision, 2014, pp. 505–520. [9] Y. He, C. Gao, N. Sang, Z. Qu, J. Han, Graph coloring based surveillance video synopsis, Neurocomputing 225 (2016) 64–79. [10] R. Kannan, G. Ghinea, S. Swaminathan, What do you wish to see? A summarization system for movies based on user preferences, Inf. Process. Manag. 51 (3) (2015) 286–305. [11] W. Zhang, C. Liu, Z. Wang, G. Li, Q. Huang, W. Gao, Web video thumbnail recommendation with content-aware analysis and query-sensitive matching, Multimed. Tools Appl. 73 (1) (2014) 547–571. [12] L. Nie, R. Hong, L. Zhang, Y. Xia, D. Tao, N. Sebe, Perceptual attributes optimization for multivideo summarization, IEEE Trans. Cybern. 46 (12) (2016) 2991–3003. [13] H. Li, L. Yi, B. Liu, Y. Wang, Localizing relevant frames in web videos using topic model and relevance filtering, Mach. Vis. Appl. 25 (7) (2014) 1661–1670. [14] Y. Zhang, G. Wang, B. Seo, R. Zimmermann, Multi-video summary and skim generation of sensor-rich videos in geo-space, in: Proceedings of the ACM Sigmm Conference on Multimedia Systems, 2012, pp. 53–64. [15] Y. Li, B. Merialdo, Multimedia maximal marginal relevance for multi-video summarization, Multimed. Tools Appl. 75 (1) (2016) 199–220. [16] M. Wang, R. Hong, G. Li, Z.-J. Zha, S. Yan, T.-S. Chua, Event driven web video summarization by tag localization and key-shot identification, IEEE Trans. Multimed. 14 (4) (2012) 975–985. [17] W. Liu, T. Mei, Y. Zhang, C. Che, J. Luo, Multi-task deep visual-semantic embedding for video thumbnail selection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3707–3715. [18] T. Yao, T. Mei, Y. Rui, Highlight detection with pairwise deep ranking for first-person video summarization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 982–990. [19] G.C. Porzio, G. Ragozini, D. Vistocco, On the use of archetypes as benchmarks, Appl. Stoch. Models Bus. Ind. 24 (5) (2008) 419–437. [20] C. Bauckhage, C. Thurau, Making archetypal analysis practical, in: Dagm Symposium on Pattern Recognition, Springer, Berlin, Heidelberg, 2009, pp. 272–281.
415
[21] E. Canhasi, I. Kononenko, Weighted archetypal analysis of the multi-element graph for query-focused multi-document summarization, Expert Syst. Appl. 41 (2) (2014) 535–543. [22] M. Mørup, L.K. Hansen, Archetypal analysis for machine learning and data mining, Neurocomputing 80 (2012) 54–63. [23] S. Seth, M.J. Eugster, Archetypal analysis for nominal observations, IEEE Trans. Pattern Anal. Mach. Intell. 38 (5) (2016) 849–861. [24] Y. Song, J. Vallmitjana, A. Stent, A. Jaimes, TVSum: summarizing web videos using titles, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5179–5187. [25] D. Yeo, B. Han, J.H. Han, Unsupervised co-activity detection from multiple videos using absorbing Markov chain, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2016, pp. 3662–3668. [26] J. Shao, D. Jiang, M. Wang, H. Chen, L. Yao, Multi-video summarization using complex graph clustering and mining, Comput. Sci. Inf. Syst. 7 (1) (2010) 85–98. [27] G. Kim, L. Sigal, E.P. Xing, Joint summarization of large-scale collections of web images and videos for storyline reconstruction, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4225–4232. [28] Z. Ji, Y. Zhang, Y. Pang, X. Li, Hypergraph dominant set based multi-video summarization, Singal Processing 148 (2018) 114–123. [29] A.B. Vasudevan, M. Gygli, A. Volokitin, L. Van Gool, Query-adaptive video summarization via quality-aware relevance estimation, in: Proceedings of the ACM on Multimedia Conference, 2017, pp. 582–590. [30] P. Paatero, U. Tapper, Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values, Environmetrics 5 (2) (1994) 111–126. [31] B.A. Olshausen, D.J. Field, Emergence of simple-cell receptive field properties by learning a sparse code for natural images, Nature 381 (6583) (1996) 607–609. [32] W.-S. Chu, Y. Song, A. Jaimes, Video co-summarization: video summarization by visual co-occurrence, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3584–3592. [33] Z. Ji, Y. Ma, Y. Pang, X. Li, Query-aware sparse coding for multi-video summarization, arXiv:1707.04021 (2017). [34] R. Panda, N.C. Mithun, A. Roy-Chowdhury, Diversity-aware multi-video summarization, IEEE Trans. Image Process. 26 (10) (2017) 4712. [35] S. Zhang, Y. Zhu, A.K. Roychowdhury, Context-aware surveillance video summarization, Image Process. 25 (11) (2016) 5469–5478. [36] A. Cutler, L. Breiman, Archetypal analysis, Technometrics 36 (4) (1994) 338–347. [37] M.J. Eugster, F. Leisch, Weighted and robust archetypal analysis, Comput. Stat. Data Anal. 55 (3) (2011) 1215–1225. [38] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Representations, 2014, pp. 1–14. [39] J. Yuan, H. Wang, L. Xiao, W. Zheng, J. Li, F. Lin, B. Zhang, A formal study of shot boundary detection, IEEE Trans. Circ. Syst. Video Technol. 17 (2) (2007) 168–186. [40] S.E.F. De Avila, A.P.B. Lopes, A. da Luz, A. de Albuquerque Araújo, Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method, Pattern Recognit. Lett. 32 (1) (2011) 56–68. [41] D. Besiris, A. Makedonas, G. Economou, S. Fotopoulos, Combining graph connectivity & dominant set clustering for video summarization, Multimed. Tools Appl. 44 (2) (2009) 161–186. [42] J. Carbonell, J. Goldstein, The use of mmr, diversity-based reranking for reordering documents and producing summaries, in: Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp. 335–336. Zhong Ji received the Ph.D. degree in signal and information processing from the Tianjin University, Tianjin, China, in 2008. He is currently an Associate Professor with the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. His current research interests include multimedia understanding, computer vision, and deep learning. He has published more than 50 scientific papers.
Yuanyuan Zhang is a Master student in the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. Her research interests include video summarization and computer vision.
416
Z. Ji, Y. Zhang and Y. Pang et al. / Neurocomputing 332 (2019) 406–416 Yanwei Pang received the Ph.D. degree in electronic engineering from the University of Science and Technology of China, Hefei, China, in 2004. He is currently a Professor with the School of Electrical and Information Engineering, Tianjin University, Tianjin, China. His current research interests include object detection and recognition, and image processing. His current research interests include object detection and recognition, vision in bad weather, and image processing. He has published more than 100 scientific papers.
Xuelong Li is a full professor with the School of Computer Science and the Center for OPTical IMagery Analysis and Learning (OPTIMAL), Northwestern Polytechnical University, Xi’an 710072, China.
Jing Pan received her B.S degree in Mechanical Engineering from the North China Institute of Technology (now North University of China), Taiyuan, China, in 2002, and her M.S degree in Precision Instrument and Mechanism from the University of Science and Technology of China, Hefei, China, in 2007. She is currently an associate professor with the School of Electronic Engineering, Tianjin University of Technology and Education, Tianjin, China. Meanwhile, she is pursuing her Ph.D. degree in the Tianjin University, China. Her research interests include computer vision and pattern recognition.