Summarizing video sequence using a graph-based hierarchical approach

Summarizing video sequence using a graph-based hierarchical approach

Author’s Accepted Manuscript Summarizing video sequence using a graph-based hierarchical approach Luciana do Santos Belo, Carlos Antônio Caetano Jr., ...

1MB Sizes 0 Downloads 103 Views

Author’s Accepted Manuscript Summarizing video sequence using a graph-based hierarchical approach Luciana do Santos Belo, Carlos Antônio Caetano Jr., Zenilton Kleber Gonçalves do Patrocínio Jr., Silvio Jamil Ferzoli Guimarães www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(15)01229-1 http://dx.doi.org/10.1016/j.neucom.2015.08.057 NEUCOM15989

To appear in: Neurocomputing Received date: 24 April 2015 Revised date: 5 August 2015 Accepted date: 16 August 2015 Cite this article as: Luciana do Santos Belo, Carlos Antônio Caetano Jr., Zenilton Kleber Gonçalves do Patrocínio Jr. and Silvio Jamil Ferzoli Guimarães, Summarizing video sequence using a graph-based hierarchical approach, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2015.08.057 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Summarizing video sequence using a graph-based hierarchical approach Luciana do Santos Belo, Carlos Antônio Caetano Jr., Zenilton Kleber Gonçalves do Patrocínio Jr., Silvio Jamil Ferzoli Guimarães∗ Audio-visual Information Processing Laboratory Computer Science Department Pontifícia Universidade Católica de Minas Gerais

Abstract Video summarization is a simplification of video content for compacting the video information. The video summarization problem can be transformed into a clustering problem, in which some frames are selected to saliently represent the video content. In this work, we use a graph-based hierarchical clustering method for computing a video summary. In fact, the proposed approach, called HSUMM, adopts a hierarchical clustering method to generate a weight map from the frame similarity graph in which the clusters (or connected components of the graph) can easily be inferred. Moreover, the use of this strategy allows the application of a similarity measure between clusters during graph partition, instead of considering only the similarity between isolated frames. We also provide an unified framework for video summarization based on minimum spanning tree and weight maps in which HSUMM could be seen as an instance that uses a minimum spanning tree of frames and a weight map based on hierarchical observation scales computed over that tree. Furthermore, a new evaluation measure that assesses the diversity of opinions among users when they produce a summary for the same video, called Covering, is also proposed. During tests, different strategies for the identification of summary size and for the selection of keyframes were analyzed. Experimental results provide quantitative and qualitative comparison between the new approach and other popular algorithms from the literature, showing that the new algorithm is robust. Concerning quality measures, HSUMM outperforms the compared methods regardless of the visual feature used in terms of F-measure. Keywords: Graph-based hierarchical video summarization, Covering, Global descriptors, Observation ∗ Corresponding

author Email addresses: [email protected] (Luciana do Santos Belo), [email protected] (Carlos Antônio Caetano Jr.), [email protected] (Zenilton Kleber Gonçalves do Patrocínio Jr.), [email protected] ( Silvio Jamil Ferzoli Guimarães)

Preprint submitted to Neurocomputing

August 26, 2015

scales.

1. Introduction The increasing number of video files has made the task of searching for a specific content very expensive, because it is necessary to index the video information. Usually, there are two approaches to cope with the index problem: (i) manual annotation; or (ii) automatic annotation. The former is expensive and subjective, 5

since it depends on the experts to perform this annotation. The second one is objective and is directly related to the visual contents, however it depends on the features which are used. The cost to find a specific content related to a video depends on the index size, thus instead of considering all video content, one can summarize it in order to reduce the search space. In literature, there are many approaches to simplify the video content [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]. Thus, video summarization is a simplification of video

10

content in order to reduce the amount of data without losing information. The video summarization problem can be transformed into a clustering problem, in which some frames are selected to saliently represent the video content, as illustrated in Fig. 1. In [5], the authors proposed an approach to cope with the video summarization problem in which the clustering is achieved by a K-means algorithm, but it is necessary to know a priori the number of clusters.

15

In [8], it was proposed the use of graph-theoretic FCM algorithm for video summarization, however the graph creation is directly related to number of centers. In [6], the authors used a Delaunay triangulation to automatically identify the frame clusters, however this approach is expensive and produces very compressed summaries. VISTO [7] is based on the extraction of low-level color features of video frames and on a modification of Furtherest Point-First algorithm to cluster the frames. This approach is fast but the summaries

20

are big. In [12], a normalized cut algorithm is employed to globally and optimally partitionate a video into clusters. The authors in [14] have presented a keyframe-based interface for browsing video (which is represented by a keyframe hierarchy) on different devices. Their algorithm uses a non-temporal clustering method for locating a single keyframe and two new temporal clustering methods (backtrack-balanced and split-balanced) when a video clip is searched. Moreover, it is able to adapt both to the user’s task and to

25

the repetitiveness of the video. Recently, some new approaches for video summarization using constraint satisfaction programming [10] and prior information [11] were proposed. In [13], it was proposed a quasi real-time video summarization based on learning of a dictionary in which a video summary is then generated by combining segments that cannot be sparsely reconstructed using the learned dictionary. In [15], the authors have proposed a Bag-of-Importance model in order to effectively quantify the importance of indi2

Figure 1: Steps for video summarization.

30

vidual local features in a transformed sparse space in which both the inter-frame and intra-frame properties of features are exploited and quantified for video summarization. In [16], the authors have proposed a multitask deep visual-semantic embedding model to directly compute the similarity between the query and video thumbnails by mapping them into a common latent semantic space. In [9], the authors use a graph-theoretic divisive clustering algorithm based on construction of a Mini-

35

mum Spanning Tree (MST) to select video frames without segmenting the video into shots or scenes, eliminating preprocessing steps. It is important to note that according to [17] the MST approach for clustering is hierarchical, and thanks to this property, it is easy to compute a video summary regarding a specified number of keyframes. Following the same strategy, in this work, video summarization problem is also transformed into a graph-based clustering problem in which a cluster (or connected component of the graph), computed

40

from a graph partition, represents a set of similar frames. Similar to [9], the proposed method – HSUMM – eliminates the preprocessing steps using a MST. But instead of computing the graph partition directly over the MST, the proposed approach adopts a hierarchical clustering method to generate a weight map from the MST in which the clusters can easily be inferred. Moreover, the use of a graph-based hierarchical clustering method allows HSUMM to apply a similarity measure between clusters (i.e., set of frames) during graph

45

partition, instead of considering only the similarity between isolated frames. Distinct strategies could be used in HSUMM to select keyframes for saliently represent the video content. In this work, two different procedures (considering or not the graph weights) are analyzed. Finally, a new evaluation measure, called Covering, is also proposed. This measure assesses the diversity of opinions among users when they produce a summary for the same video, what is not taken into account when average values of quality measures are

50

used, such as in [5].

3

The use of MST and the use of observation scales computed on MST were first introduce in our previous works [9] and [18], respectively.

Moreover, in [18] – hereafter called HSUMM1 – only the unweighted

center selection procedure was evaluated.

The main differences of the this work in comparison with the

previous ones are highlighted in the following: 55

• An automatic approach for determing the number of frame clusters. Unlike of our previous work [9] – hereafter called AGM1/AGM2 – in which the number of frame clusters is a parameter, an automatic approach is proposed for determing the number of frame clusters based on the homogeneity of each cluster. • A new keyframe selection based on a weighted graph. This strategy takes into account the dissimilarity

60

values which are completely ignored in HSUMM1 [18]. • New formulations. In order to clarify some explanations and to create a general framework, we revisit some definitions and formulations presented in HSUMM1 [18]. • Generalization of graph-based hierarchical video summarization. In this work, we present a unified framework for video summarization based on MST and weight maps, which will facilitate the adoption

65

and analysis of others strategies for building weight maps in the future. The paper is organized as follows. Section 2 describes the clustering problem using minimum spanning tree and also defines many concepts used in our work. Section 3 describes our methodology to solve the video summarization problem, while Section 4 presents a new evaluation measure that can be used to compare video summarization methods. Section 5 describes the performed experiments together with a

70

comparative analysis between our approach and the others methods. Finally, we give some conclusions in Section 6.

2. Some fundamental concepts This section introduces main concepts used in this work. In order to facilitate the comprehension of our method, the main symbols used throughout the text are summarized in Table 1. 75

2.1. Similarity graph Let A ⊂ N2 , A = {0, . . . , W−1}×{0, . . . , H−1}, where W and H are the width and height of each frame, respectively, and, T ⊂ N, T = {0, . . . , N − 1}, in which N is the number of frames of a video. A frame f is a 4

Table 1: Summary of Symbols and Definitions Symbol

Definition video frame

f f (x, y)

color value at pixel location (x, y)

VN = (f )t∈T

video (or sequence) of N frames

D(ft1 , ft2 )

distance measure between frames at locations t1 and t2

FS(ft1 , ft2 , δ )

frame similarity between frames at locations t1 and t2 considering threshold δ

G f = (V, E)

frame graph, in which V is the set of nodes representing video frames and E is set of edges between every pair of nodes



= (V, Eδ )

Gs

= (Gδ , w)

subgraph of G f , in which Eδ ⊆ E is the set of edges between nodes representing similar frames according to the threshold δ frame similarity graph, in which w is the weight value (similarity measure) associated with edges of Gs

T f = (T, wT )

tree of frames, in which T = (V, Eδ∗ ) is a connected acyclic subgraph of Gs and wT associates a weight value to each tree edge

T∗f = (T∗ , w∗T ) TM

= (T∗ , w

M)

minimum spanning tree of frames tree map, in which wM is the weight map used to calculated graph partitions

Si

center of the frame cluster

Ci

frame cluster

C

set of frame clusters

hscale(e)

hierarchical observation scale calculated for edge e

F(e)

equilibrium measure function for edge e

CUSa

accuracy rate

CUSe

error rate

COV

Covering

mA

number of matching keyframes from automatic summary

mA

number of non–matching keyframes from automatic summary

nU

number of keyframes from user summary

function from A to R3 , where for each spatial position (x, y) in A, f (x, y) represents the color value at pixel location (x, y). A video VN , in domain A × T, can be seen as a sequence of frames f. It can be described by 80

VN = (f)t∈T , where N is the number of frames contained in the video. A frame is usually described in terms of global descriptors, such as color histogram or GIST [19], or by using mid-level representations, such as bag of features (BoF) or BossaNova [20]. Definition 1 (Frame graph – G f ) Let VN be a video with N frames. A frame graph G f = (V, E) is a (complete) undirected graph. Each node vt ∈ V represents a frame ft ∈ VN , t ∈ T. There is an edge

5

85

e = (vt1 , vt2 ) ∈ E between every two nodes vt1 and vt2 associated with video frames at locations t1 and t2 , respectively. Since the size of a frame graph could be too large, in terms of number of edges, the adoption of a similarity measure between frames may help to reduce the total number of edges. Thus, similarity between frames could be assessed as follows. Furthermore, instead of linking all pairs of vertices, a constrained frame

90

time could be envisaged. Definition 2 (Frame similarity) Let ft1 and ft2 be two video frames at locations t1 and t2 , respectively. Two frames are similar if a distance measure D(ft1 , ft2 ) between them is smaller than or equal to a specified threshold (δ ). The frame similarity is defined as

FS(ft1 , ft2 , δ ) =

95

⎧ ⎪ ⎨true,

if D(ft1 , ft2 ) ≤ δ ;

⎪ ⎩false,

otherwise.

(1)

There are several choices for the distance measure D(·, ·) between two frames, e.g. histogram/frame difference, histogram intersection, difference of histograms means, and others. For the global descriptor GIST, the representations are usually compared using L2 norm. In this work, we used L2 norm for comparing two visual descriptors. After selecting one, it is possible to construct a frame similarity graph based on a frame graph G f , a distance measure D(·, ·) and a threshold δ as follows.

100

Definition 3 (Frame similarity graph – Gs ) Let VN be a video with N frames and G f = (V, E) be a frame graph generated from VN . A frame similarity graph Gs = (Gδ , w) is an undirected weighted graph in which Gδ = (V, Eδ ) is a subgraph of G f , i.e., Eδ ⊆ E and the function w associates a weight value to each edge e ∈ Eδ , i.e., w : Eδ → R. There is an edge e ∈ Eδ between two nodes vt1 and vt2 if the associated frames ft1 and ft2 are similar: Eδ = { (vt1 , vt2 ) | vt1 ∈ V, vt2 ∈ V, FS( ft1 , ft2 , δ ) = true}.

(2)

105

The weight function w could be set to the same distance measure used to establish similarity between two frames, i.e., w(e) = w((vt1 , vt2 )) = D(ft1 , ft2 ), or to any other function that can help in assessing of the visual content of frames in order to produce a video summary with good quality. Fig. 2(a) illustrates a frame similarity graph of a real video in which only the frames 1, 501, 1001, 1501, 2001, 2501 and 3001 are

6

(a)

(b)

(c)

(d)

Figure 2: Examples of video summarization [9]: (a) frame similarity graph – Gs ; (b) minimum spanning tree of frames – T∗f ; (c) removal of edges with weights ≥ 72; (d) removal of edges with weights ≥ 49.

110

sampled. The similarity measure used is the histogram intersection in HSV color space normalized in the range [0, 100]. In order to perform the video summarization, without considering a video partitioning (or segmentation) step, it was proposed in [9] the use of a divisive clustering algorithm based on the construction of a MST [21]. Thus, we define the minimum spanning tree of frames as a graph structure that preserves the video content

115

and the relationship between video scenes. Definition 4 (Tree of frames – T f ) Let Gs = (Gδ , w) be a frame similarity graph, in which Gδ = (V, Eδ ). An edge-weighted tree of frames T f = (T, wT ), in which T = (V, Eδ∗ ) is a connected acyclic subgraph of Gs , i.e., Eδ∗ ⊆ Eδ and the function wT associates a weight value to each edge e ∈ Eδ∗ , i.e., wT : Eδ∗ → R (usually

7

wT is set to the same value of w, i.e., wT (e) = w(e), ∀e ∈ Eδ∗ ). The weight of T f – w(T f ) – is equal to the 120

sum of weights of all edges belonging to Eδ∗ , i.e., w(T f ) = ∑e∈Eδ wT (e). ∗

Definition 5 (Minimum spanning tree of frames –

T∗f )

Let Gs = (Gδ , w) be a frame similarity graph. The

minimum spanning tree of frames T∗f = (T∗ , w∗T ) is a tree of frames whose weight is minimal. Fig. 2(b) illustrates a minimum spanning tree of frames for the frame similarity graph presented at Fig. 2(a). According to [22], a k-clustering divides the elements into k non-empty groups, in which the 125

insertion of an element into a group depends on a distance measure between this element and the elements already in the group. In order to compute the video summarization, a k-clustering divides the video sequence into k video scenes, by eliminating k − 1 edges from the T f , e.g., removing the k − 1 largest edges from the T∗f . The edge removal operation, when applied to a tree, produces two connected components. Here, each connected component is called frame cluster. In Fig. 2(c), it is illustrated the removal of all edges from the

130

T∗f of Fig. 2(b) with weight greater than or equal to a 72, i.e., the removal of just one edge, generating two connected components; while in Fig. 2(d), all edges with weight greater than or equal to a 49 are removed from T∗f (i.e., two edges), producing three connected components. Definition 6 (Set of frame clusters – C) Let T f = (T, wT ) be an edge-weighted tree of frames, in which T = (V, Eδ∗ ). Let C = {C1 , C2 , . . . , Ck } be the set of k connected components (or the set of k frame clusters)

135

formed by deleting the k − 1 largest edges of T f in which Ci = (Vi , Eδi ), such that Eδi ⊆ Eδ∗ ,



i=1,...,k Vi

= V,

and Vi ∩ V j = 0, / ∀i, j = 1, . . . , k, and i = j. There are different ways for computing the tree of frames T f which is used in generating a set of frame clusters C, e.g., the minimum spanning tree of frames T∗f as used in [9]. But, in this work, we generalize this issue by computing a tree map based on a weight map, which can be adopted during the calculation of the 140

set of frame clusters. The construction of this weight map may follow the hierarchical approach described in [23] for partitioning a graph using observation scales. More details about the generation of the thee map and its weight map could be found in Subsection 3.1, where two distinct mappings are fully described. The number of clusters and, consequently, the number of video scenes is directly related to the number of edge removal operations when a tree is considered. Also, the process to compute the frame cluster is

145

hierarchical in the sense that the edge removal divides a cluster into two different groups preserving the inclusion property. It is important to note that two clusters will never be merged after the application of edge removal operation. However, the saliency of a frame cluster component may depend on its size, since components with a small number of frames may represent noise. For example, a black frame or a flashlight

8

frame are probably very dissimilar from all other frames and, consequently, their adjacent edges will be the 150

largest weight edges of T f . Thus, the frame cluster produced by removal of those edges may present a very small number of frames, and it can be ignored in subsequent analysis. Finally, in order to represent the content of a frame cluster, the video frame which is more similar to the other frames belonging to the same cluster must be chosen. For that, the center of a frame cluster is defined as follows.

155

Definition 7 (Center of the frame cluster – Si ) Let Ci ∈ C be a frame cluster. The center Si of the frame cluster Ci = (Vi , Eδi ) is the frame fti related to a node vti ∈ Vi which is more similar to the other frames belonging to the same cluster Ci and it is used as a representative of the cluster content. Usually, the center of the frame cluster may be computed by two different procedures: (i) unweighted center selection; or (ii) weighted center selection. In the former, the center is associated to the frame which

160

is the most central, in terms of number of frames to the other ones. In the latter, the most central frame in terms of (dis)similarity measure is selected. This issue will be discussed in more details in Subsection 3.2. 2.2. Visual descriptors In this work, we study the behaviour of two different descriptors in order to describe the video frames when they are used to video summarization, the GIST descriptor [19] and the mid-level representation

165

BossaNova [20]. Both descriptors can be considered as global ones since they represent the information of whole image. 2.2.1. GIST The main idea of GIST descriptor [19] is to develop a low dimensional representation of the image, which does not require any type of segmentation. The authors propose a set of perceptual dimensions (naturalness,

170

openness, roughness, expansion, ruggedness) that represent the dominant spatial structure of an image. They show that these dimensions may be reliably estimated using spectral and coarsely localized information. The image is divided into a 4-by-4 grid for which orientation histograms are extracted. According to [24], this descriptor is similar in spirit to the local SIFT descriptor [25], which is a local visual descriptor. Most of the works using the GIST descriptor resize the image in a preliminary stage, producing a small

175

square image typically ranging from 32 to 128 pixels. Following [24], we adopt a size of 32 × 32 pixels and visual frames are rescaled to that size irrespective of their aspect ratio. After resizing, color GIST descriptor

9

is calculated for each visual frame using an implementation1 that takes as input a square image and produces a 960-dimensional vector. 2.2.2. BossaNova 180

The BossaNova is a mid-level image representation, which offers more information-preserving pooling operation based on a distance-to-codeword distribution. The BossaNova approach follows the Bag-of-Words (BoW) formalism (coding/pooling), but it proposes an image representation which keeps more information than BoW during the pooling step, since it estimates the distribution of the descriptors around each codeword, by computing a histogram of distances between the descriptors found in the image and those in the codebook.

185

In BossaNova coding [20], the authors proposed a soft-assignment strategy considering only the k-nearest codewords for coding a local descriptor. This representation was sucessfully applied to the context of visual recognition. In comparison to BoW, BossaNova significantly outperforms it. Furthermore, by using a simple histogram of distances to capture the relevant information, the method remains very flexible and keeps the representation compact. For those reasons, we choose the BossaNova approach as the mid-level feature to

190

be used in the experiments. More details can be found in [20]. 3. Video summarization In this work, video summarization problem is transformed into a graph-based clustering problem in which a cluster (or connected component of the graph), computed from a graph partition, represents a set of similar frames. Fig. 3 illustrates the proposed method for video summarization, so-called HSUMM. The

195

method depends basically on: (i) the visual feature selected for representing a frame; (ii) the dissimilarity measure used for comparing two frames; (iii) the computation of frame clusters respecting some properties; and (iv) the keyframe selection for saliently representing the video content. While the two first are common to several clustering methods in literature, the others represent the novelty of this kind of approach, and they will be discussed in the following. For instance, we consider both GIST and BossaNova as visual

200

descriptors, and L2 norm as the distance (dissimilarity) between two frames. 3.1. Computing frame clusters There are different ways for computing frame clusters which may be used in generating a set of frame clusters C. Thanks to adoption of the minimum spanning tree of frames T∗f , the proposed framework elimi1 Available

at http://lear.inrialpes.fr/src/lear_gist-1.1.tgz

10

10

8

0

1

7

2 6

3

5

4

3

2

2

1

4

3

10

10

7

6

10

7

6

6

5

0

0

0

0

0

0

4

3

0

7

6

0

0

0

0

0

0

0

3

0

0

0

0

0

0

0

0

3

0

0

Figure 3: Outline of our method: the video is transformed into a frame sequence (step 0); the frame sequence, by using its visual representation, is transformed into a frame similarity graph (step 1); the weight map is computed from the video graph (step 2); the size filtering is applied to hierarchy of the weight map (step 3); the graph partitions are computed (step 4); the keyframes are selected (step 5); and finally, the presentation of the keyframes (step 6).

nates the preprocessing to compute the number of clusters, and also eliminates the video segmentation step. 205

Thus, the minimum spanning tree of frames T∗f is computed from the frame similarity graph Gs that was generated from the video sequence.

Afterwards, a weight map wM is computed from T∗f = (T∗ , w∗T ). In

this sense, the method proposed in [9] could easily be inserted in our framework, since the hierarchy of partitions is computed directly over T∗f ; and, in this case, the weight map could be calculated by applying an identity function to the edge weights of T∗f , i.e., wM (e) = w∗T (e), ∀e ∈ Eδ∗ . Recently, for the method 210

proposed in [18], the weight map represents the hierarchical observation scales computed over T∗f , i.e., wM (e) = hscale(e), ∀e ∈ Eδ∗ . In that case, the construction of the weight map follows the hierarchical approach describe in [23] for partitioning a graph using observation scales. Finally, an (edge-weighted) tree map TM is generated through the combination of T∗ and wM for representing a hierarchy, i.e., TM = (T∗ , wM ). In a general way, the identification of frame clusters is divided in three main steps: (i) computing a

215

weight map; (ii) filtering the tree map TM according to the size of the components in terms of number of frames; and (iii) computing graph partitions (i.e., frame clusters). 3.1.1. Computing a weight map The computation of a weight map wM is done over T∗f in which the edge weights are mapped to values through a specified mapping function (e.g. identity map). Thanks to these new values of edge weights

220

wM , the graph partitions shall have some desired properties (discussed later). In this work, we explore two

11

different mappings for edge weights: (i) the identity function, i.e., wM (e) = w∗T (e), ∀e ∈ Eδ∗ ; and (ii) the function to compute hierarchical observation scales, i.e., wM (e) = hscale(e), ∀e ∈ Eδ∗ . In both cases, we compute a weight map wM from which the desired hierarchy can easily be inferred. Revisiting the minimum spanning tree approach. In our previous work [9], we proposed a method for video 225

summarization in which the graph partition is made directly over T∗f by removing its edges, and a filtering step is made at the end in order to eliminate some outliers. In order to consider that method in the same framework presented here, we propose a variation of the method in which the order of these two steps is inverted. Fig. 4 shows an example of the difference between the result obtained by using the method of [9] and the proposed variation. Moreover, in order to better formalize the method described in [9], we also

230

propose the adoption of a weight map which, in this case, is calculated by applying an identity function to the edge weights of T∗f . The observation scale approach. Thanks to the method proposed in [23], one can compute the hierarchical observation scales using the method called cp-HOScale (or simply HOScale), in which the adjacent regions that are evaluated depend on the order of the merging in the fusion tree (or simply, the order of the merging

235

between connected components on the minimum spanning tree of the original graph). Generally speaking, a new weight map wM is created from T∗f in which each edge weight corresponds to the scale from which two adjacent regions connected by this edge are correctly merged, i.e., there are no two sub-regions of these regions that might be merged before these regions. For computing the new weight map, we consider the criterion for region-merging proposed in [26] which measures the evidence for a boundary between two

240

regions by comparing two quantities based on: (i) intensity differences across the boundary; and (ii) intensity differences between neighboring pixels within each region. More precisely, in order to know whether two regions must be merged, two measures are considered. The internal difference Int(X) of a region X is the highest edge weight among all the edges linking two vertices of X in T∗f . The difference Diff (X,Y ) between two neighboring regions X and Y is the smallest edge weight among all the edges that link X to Y . Then,

245

two regions X and Y are merged when:   λ λ Diff (X,Y ) ≤ min Int(X) + , Int(Y ) + |X| |Y |

(3)

in which λ is a parameter used to prevent the merging of large regions (i.e., larger λ forces smaller regions to be merged).

12

a

16

b

6

13

1 d

2

c

a

4 e

3

b

6

1 f

c

4 2

d

(a)

e

3

a

b

6

1 f

c

a

4 d

(b)

2

e

3

1 f

d

c

2

(c)

e

0

a

1

3

f

d

e

0

f

(e)

(d)

6 4

4 3 2 1

2 1 0

a

d

e

f

c

a

b

d

(f)

e

f

c

b

(g)

Figure 4: Example of graph partitions and size filtering: (a) frame similarity graph – Gs ; (b) minimum spanning tree – T∗f ; (c) tree map TM generated using an identify function; (d) 3-clustering computed from TM by using the method in [9] before the removal of all clusters with less than 2 elements; (e) 3-clustering computed from TM by using our revisited approach with a minimum cluster size equals to 2; (f) dendrogram based on TM ; and (g) dendrogram computed after size filtering with minimum cluster size equals to 2.

The merging criterion defined by Eq. (3) depends on the scale λ at which the regions X and Y are observed. More precisely, let us consider the (observation) scale SY (X) of X relative to Y as a measure 250

based on the difference between X and Y , on the internal difference of X and on the size of X: SY (X) = [Diff (X,Y ) − Int(X)] × |X|.

(4)

Then, the scale S(X,Y ) is simply defined as: S(X,Y ) = max {SY (X), SX (Y )} .

(5)

Finally, thanks to this notion of a scale, Eq. (3) can be written as: λ ≥ S(X,Y ).

(6)

The core of HOScale [23] is the identification of the smaller scale value hscale(e), ∀e ∈ Eδ∗ that can be used to merge the largest region to another one while guaranteeing that the internal differences of these 255

merged regions are larger than the value calculated for the hscale(e). In fact, instead of computing the hierarchy of partitions, a weight map wM is constructed using the notion of scale presented in Eq. (3) from

13

which the desired hierarchy can easily be inferred, e.g., by removing those edges whose weight is greater than the desired scale. 3.1.2. Size filtering 260

An important step of our methodology is the post-processing of the tree map TM in order to guarantee that the cluster size, in terms of number of frames, is greater than or equal to a threshold. Considering that the hierarchy of partitions represented by a tree map can be visualized as a dendrogram, it is simple to identify the level of merging since it is related to the weight of the edge linking the regions. Without loss of generality, we consider that the fusion tree is binary. The area filtering can be easily implemented by using a

265

post-order traversal. Thus, for each internal node, if the number of leaves (of the subtree rooted in that node) is smaller than the size threshold, the weight associated with the edge (that is related to that internal node) in the weight map is set to zero. This strategy produces a new hierarchy of partitions and guarantees that the obtained partitions after the removal of edges have at least the minimum permitted size. For instance, Fig. 4 shows an example of size filtering applied to TM , in which Fig. 4(f) illustrates the dendrogram computed

270

from the TM shown in Fig. 4(c) and Fig. 4(g) illustrates the dendrogram obtained for the new hierarchy generated by using the size filtering with a minimum cluster size equals to 2. 3.1.3. Computing graph partitions Considering that the tree map TM = (T∗ , wM ) has been calculated and filtered, the hierarchical clustering method is easily computed, since the tree map is just a specialized tree of frames. The edge removal oper-

275

ation, when applied to a tree, produces two connected components. It is important to note that the clusters (represented by connected components) can be obtained by deleting edges from the tree map until stability. The concept of stability is related to two possibilities: (i) the number of desired clusters; and (ii) the frame similarity inside a cluster. While the former is related to the compress factor of the video summary, the second establishes the homogeneity of the clusters. In both cases, only a keyframe for each cluster (i.e., a

280

single center of each frame cluster) might be selected. Desired number of clusters. The simplest way for partitioning a graph is by removing edges. Here, in order to compute the video summarization, a k-clustering divides the video sequence into k video scenes; and, consequently, it is necessary to eliminate k − 1 edges from the tree map TM . The edges chosen for removal are the ones that have the highest values on the weight map. For instance, considering that Fig. 4(c)

285

represents a tree map TM obtained by applying an identity function, then Fig. 4(d) and Fig. 4(e) show 3-

14

clusterings obtained by the graph partition of the hierarchies illustrated as the dendrograms in Fig. 4(f) and Fig. 4(g), respectively. Automatic size identification. Instead of considering a specified number of clusters, we also propose a strategy for identifying the moment when the stability is reached during the process of edge removal. The edges 290

of TM are analyzed in a non-increasing order of their values according to the weight map wM . Consider an edge e that belongs to a connected component Ci = (Vi , Eδi ), i.e., e ∈ Eδi . Thus, the edge e is removed only when its weight in wM is greater than or equal to an equilibrium measure function F(e), i.e., wM (e) ≥ F(e). In this work, the equilibrium function used is F(e) = avg(Ci ) + σ × sd(Ci )

(7)

in which Ci is the connected component that contains the edge e, avg(Ci ) and sd(Ci ) are the average and 295

standard deviation of all edge weights of the component Ci , i.e., avg(Ci ) =

sd(Ci ) =

1 wM (e) ni ∑δ e∈Ei

1

2 ∑ (wM (e) − avg(Ci )) ni δ

(8)

(9)

e∈Ei

in which ni = |Eδi | and, finally, σ is a parameter related to the allowed variability, which can be seen as a relaxation parameter. During tests, we have empirically set σ to 0.1. It is important to note that it is meaningless to apply the second strategy for partitioning the graph when hierarchical observation scales were used for computing the weight map, since the weights represent scales. 300

Fig. 5 illustrates some results obtained by HSUMM. It is important to observe the hierarchical property of the generated video summary. When the number of clusters is increased by one, the centers of the frame clusters remain the same except by one or two changes, at most. One change is always expected since the recently created cluster should always have a new center. But the existence of another change in the set of centers will depend on: (i) the criterion used for center selection; and (ii) the homogeneity of cluster content.

305

Fig. 5(a) presents 2 frame clusters obtained by HSUMM. When the number of frame clusters is set to 3 – see Fig. 5(b), HSUMM maintains the first cluster without any changes and divides the second one into two new clusters with two new centers. In Fig. 5(c), HSUMM obtains a set of 4 frame clusters which has the same first two clusters of Fig. 5(b). The third cluster of Fig. 5(b) is divided into two new clusters but only the recently created cluster has a new center, i.e., the third cluster maintains the same keyframe as center since it 15

(a)

(b)

(c)

(d)

Figure 5: Examples of video summarizations obtained by HSUMM for video "News.mpg” composed by 5 (five) shots.

310

is still a good representative of the cluster content. Finally, in Fig. 5(d), the 5 clusters obtained by HSUMM are shown – the first, the second and the fourth clusters of Fig. 5(c) remain unchanged, but the third cluster is divided into two new clusters (with two new centers). 3.2. Selecting keyframes As described in Section 2.1, the frame cluster should represent a set of similar frames, and for saliently

315

represent the cluster, it is necessary to identify a frame whose distance with respect to other frames in the same cluster has the smallest value as possible. Usually, the center of the frame cluster has this property. However, this center, hereafter keyframe, can be computed by two different procedures: (i) unweighted center selection; or (ii) weighted center selection. In the former, the center is associated with the frame which is the most central, in terms of number of frames to the other ones. In the later, the most central

320

frame in terms of (dis)similarity measure is selected. In both cases, the idea is to apply a thinning process (considering or not the edge weights) in order to identify the center of each component. It is worth to say that each frame cluster is also represented by a minimum spanning tree which facilitates the application of the thinning operator. As mention before, in HSUMM1 [18], only the unweighted center selection procedure was evaluated. Here, we also explorer the adoption of the weighted center selection procedure – hereafter

16

325

called HSUMM2.

4. A new evaluation measure The process of evaluating video summaries is not a simple task, since there is no well-known groundtruth and the quality of a summary is highly subjective. In this way, some initiatives have been proposed, like Comparison of User Summaries (CUS) [5], in which the video summary is built manually by a number of 330

users from the sampled frames, in order to assess and compare automatic generated video summaries. In [7], the summaries are compared to Open Video2 summaries (called OV) and they are classified into 4 classes: (i) same keyframes; (ii) fewer keyframes; (iii) more keyframes; and (iv) mismatched keyframes. However this classification depends on a single groundtruth for video summary, and it may be not a good measure due to the subjectivity of the evaluation task related to video summaries.

335

To solve these problems, it is necessary to consider that the groundtruth is composed by various users. The most common measure to evaluate visual system is the average of precision and recall computed for each user, however they present some distortions due to the adoption of the average between results from different users. In [5], the automatic summaries are compared to users’ summaries to compute two measures: (i) accuracy rate – CUSa; and (ii) error rate – CUSe. These measures can be defined as follows: CUSa =

340

and CUSe =

mA nU ,

mA nU

in which mA is the number of matching keyframes from automatic summary (AS), mA is

the number of non–matching keyframes from AS and nU is the number of keyframes from user summary (U). Moreover, the users’ summaries are the reference summaries and can be considered as the groundtruth. According to [5], the goals of that approach are: (1) to reduce the subjectivity of the evaluation task; (2) to quantify the summary quality and; (3) to allow comparisons among different techniques to be done quickly. 345

Despite the fact that these measures are interesting, they do not evaluate well the diversity presented by the users’ summaries. Moreover, due to the average of the measures computed for each user, some distortions may be produced. Actually, CUSa does not well evaluate the diversity of user opinions. For example, let A and B be two users with the following summaries for the same video, UA = {X,Y } and UB = {M, N, O, P, Q, R, S, T,U,V }, in which each character represents a distinct video frame. Suppose now

350

that three methods compute the following summaries AS1 = {X,Y }, AS2 = {M, N, O, P, Q, R, S, T,U,V } and AS3 = {X, M, N, O, P, Q}. Even if these three summaries are completely different, they present the same accuraty rate – CUSa = 0.5. 2 http://www.open-video.org

17

(a) Summary of user 1

(b) Summary of user 2

(c) Example of a summary

Figure 6: Evaluation of video summarization for video 41: (a) summary of user 1; (b) summary of user 2; and (c) an artificial video summary with CUSa = 0.565 and COV = 0.47.

In order to cope with this kind of problem, we propose a new measure, called Covering – COV, which is based on the frames of the users’ summaries to evaluate the covering of user summary by an automatic sum355

mary. This measure assesses the diversity of opinions without ignoring the agreement of opinions among users. In other words, CUSa assesses the average value of ratio calculated for each user between the automatic summary and the summary made by that user. While COV assesses the ratio of automatic summary which are similar to all user summaries. Thus, Covering can be defined as follows. Definition 8 (Covering) Let AS be the automatic summary computed by a summarization method. Let US

360

be a set of n user summaries. The Covering – COV – of users’ summaries by an automatic summary is is defined as COV =

∑U∈US |M(AS,U)| ∑U∈US |U|

(10)

in which, M(A, B) and |.| are the maximum matching between two set of elements A and B, and the cardinality of a set. For example, the COV values for the three automatic summaries presented before – AS1 , AS2 and AS3 365

– are 2/12, 10/12 and 6/12, respectively. Fig. 6 shows an artificial example of summaries for presenting both CUSa and COV. Fig. 7 shows a real example for summary of the video 27 which presents the five users’ summaries, the result of HSUMM2BN and the result of VSUMM1. The fifth frame of the result of 18

(a) Summary of user 1

(b) Summary of user 2

(c) Summary of user 3

(d) Summary of user 4

(e) Summary of user 5

(f) HSUMM2BN

(g) VSUMM1

Figure 7: Evaluation of video summarization for video 27: (a-e) summaries of users 1-5; (f) an automatic video summary obtained by HSUMM2BN with CUSe = 0.07, CUSa = 0.83 and COV = 0.77.; and (g) an automatic video summary obtained by VSUMM1 with CUSe = 0.76, CUSa = 0.96 and COV = 0.93.

19

HSUMM2BN is not present in the results of VSUMM1 (while the frames number 1, 4, 5, 7, 8, 9 and 12 from VSUMM1 result are missing in the result of HSUMM2BN ). As one can see, in this example, the values for 370

CUSa are greater than the values of COV, for both automatic summaries. Furthermore, the value of CUSe of VSUMM1 result are much higher than HSUMM2BN . The high values for CUSa should be expected since it represents the average between the matching rates for each user summary, thus a high CUSa value could be obtained by an automatic summary that agrees with a large portion of each user summary without having to agree with all of them at the same time. This could also lead to a high CUSe value (see the result of

375

VSUMM1 in Fig 7(g)). On the other hand, the only way for an automatic summary to obtain a large COV value is to generate a video summary that is a super-set of all users’ summaries, i.e., it should cover the diversity of opinions among all users. Finally, consider the following example to illustrate the difference between CUSa and COV. Suppose that we have X users’ summaries: X − 1 summaries with length equals to 1 and with the number of matching

380

frames also equals to 1, and just 1 summary with length equals to X − 1 and no matching frames. In that

X−1 0 1+1+···+0 = X , while COV value is equals to 1+1+···+(X−1) case, CUSa value is equals to X1 11 + 11 + · · · + X−1 = X−1 2X−2

= 0.5, if there is no frames in common among summaries. For large values of X, CUSa value is close

to 1.0 while COV is always the same. Now, suppose that we have 1 summary with length equals to X − 1 and with the number of matching frames also equals to X − 1, and X − 1 summaries with length equals to 385

1 and no matching frames. Then, for large values of X, if there is no common frames among summaries, CUSa is close to zero while COV value is 0.5.

5. Experiments In order to compare HSUMM with other methods, the same dataset adopted in [7] composed by 50 videos in different genres (documentary, lectures, ephemeral, historical, educational) is used. With respect to the 390

assessment, we consider the same measures proposed by [5] in terms of Comparison of User Summaries (CUS), in which the video summary is built manually by a number of users from sampled frames. In fact, the summaries are evaluated by using CUSa and CUSe calculated from 250 summaries of users. Theses summaries were manually created by 50 users, each one dealing with only 5 videos, i.e., each video has 5 video summaries created by 5 different users.

395

Unfortunately, using only these two measures, it is not possible to decide which approach is the best method for computing a video summary. In order to solve this issue, a trade-off between the well-known measures in information retrieval, precision and recall, is calculated in terms of their harmonic mean, called 20

F-measure. The precision is the fraction of detections that are true positives rather than false positives, while recall is the fraction of true positives that are detected rather than missed. Furthermore, we also assess the 400

methods using the new evaluation metric, called Covering, which measures the diversity of opinions among all users. In this work, HSUMM is compared to VSUMM1 [5], VSUMM2 [5], AGM1 [9], AGM2 [9], DT [6], VISTO [7] and Open Video summaries (called OV). Furthermore, we also analyze different strategies for creating the video summary, by using the method AGM1 [9], in which the frame clusters are homogeneous according to some statistical properties. Note that the parameters for each method are tuned

405

according to the original works. First, we study the behavior of the simple minimum spanning tree approach AGM1 [9] by using a manual strategy to stop the removal of edges. Fig. 8 presents the quantitative measures computed for several adopted strategies when compared to the original method AGM1. For the matching between an automatic summary and an user-made one, the frame similarity FS(ft1 , ft2 , δ ) is adopted, using L2 as distance (dissimilarity)

410

between two frames and varying δ value according to the set {0.2, 0.3, 0.4, 0.5, 0.6, 0.8, 0.9}. In AGM1, the size filtering is made after the edge removal. Thus, we study both the real influence of applying the size filtering before the edge removal and the automatic size identification. As can be seen in Fig. 8, the simple test made with the HSUMM1 method using a weight map based on an identity function and an automatic size identification — called HSUMM1 − I − A − S10 — outperforms AGM1 in terms of

415

F-measure. Several variations were tested, and in order to facilitate the understanding, some symbols are used to define them: the symbols I, A, M, and Sn are used for representing the adoption of weight map based on an identity function, automatic size identification, manual size identification, and subsampling of one frame to each n frames, respectively. Afterwards, we analyze the adoption of a weight map based on hierarchical observation scales. As de-

420

scribed in Section 3, HSUMM depends on: (i) a visual feature; (ii) a dissimilarity measure; (iii) a minimum size of each cluster; and (iv) a number of clusters. Concerning visual feature, the behavior of HSUMM1 with two global features (a weight map based on observation scales and an unweighted center selection procedure) is analyzed: (i) using GIST [19] – hereafter called HSUMM1G and; (ii) using BossaNova [20] with SIFT – hereafter called HSUMM1BN . We also present the results for the behavior of HSUMM2 with two global

425

features (a weight map based on observation scales and a weighted center selection procedure) – hereafter called HSUMM2G and HSUMM2BN , accordingly. Note that a mid-level representation, like BossaNova, can be considered as a global descriptor. The L2 norm is used for computing the distance between two frames; and the minimum size of each cluster is set to 10 frames. As before, for the matching between an

21

Quantitative measures 1

AGM1(F=0.808) AGM1-W (F=0.812) HSUMM2 − I − A − S10 (F=0.834) HSUMM2 − I − A − S15 (F=0.832) HSUMM1 − I − A − S10 (F=0.833) HSUMM1 − I − A − S15 (F=0.835) HSUMM2 − I − M − S10 (F=0.838) HSUMM1 − I − M − S10 (F=0.838)

Covering

0.8

0.6

0.4

0

0.2

0.4 CUSe

0.8

0.6

Quantitative measures 1

AGM1(F=0.808) AGM1-W (F=0.812) HSUMM2 − I − A − S10 (F=0.834) HSUMM2 − I − A − S15 (F=0.832) HSUMM1 − I − A − S10 (F=0.833) HSUMM1 − I − A − S15 (F=0.835) HSUMM2 − I − M − S10 (F=0.838) HSUMM1 − I − M − S10 (F=0.838)

Covering

0.8

0.6

0.4

0

0.2

0.4 CUSe

0.8

0.6

Figure 8: A comparison between several strategies used for computing frame cluster homogeneity when compared to the original method AGM1. The comparison is made in terms of CUSa, CUSe and COV. For each strategy, the best F-measure is shown. The ideal method must have CUSa (or COV) equals to 1, and CUSe equals to zero. Several variations regarding AGM1 [9] were showed. The first one is AGM1 − W in which the keyframe is based on weighted center selection that is different from the original one. The other variations are based on the our framework in which the weight map is the identity function and two different possibilities for center selection: the weighted center selection procedure – HSUMM1; and unweighted center selection procedure – HSUMM2. The meaning of the symbols I, A, M, and Sn – used for defining each variation – are: identify function as weight map, automatic size identification, manual size identification, and subsampling of one frame to each n frames, respectively.

automatic summary and an user-made one, the frame similarity FS(ft1 , ft2 , δ ) is adopted, using L2 as distance 430

(dissimilarity) between two frames and varying δ value according to the set {0.2, 0.3, 0.4, 0.5, 0.6, 0.8, 0.9}. Finally, for computing video summaries using any of those variations of HSUMM, its hierarchical properties are taken into account in order to easily compute summaries with different sizes for the same video.

22

Thus, for each video, video summaries with different sizes are assessed in order to maximize the F-measure. This kind of strategy is well-known and adopted in hierarchical image segmentation for identifying the best 435

set of parameters for each image to be segmented, and to the best of our knowledge, this is the first time in which a video summarization method is evaluated using this strategy. Fig. 9 presents the quantitative measures computed for several methods. The ideal method must have CUSa (or COV) equals to 1, and CUSe equals to zero. As one can be see, all variations of HSUMM outperforms the compared methods regardless of the visual feature used (see F-measure values). Moreover,

440

CUSa and COV for HSUMM1G is a little bit better than HSUMM1BN ; and CUSa for VSUMM1 is better than all variations of HSUMM, however its CUSe is much bigger. Furthermore, it is worth to mention that COV value for all methods is smaller than the CUSa value for the same method. This behavior could be related to the elimination of average step for computing COV. In Fig. 10, Fig. 11 and Fig. 12, static video summaries computed by several methods are illustrated.

445

For comparing those results to the user summaries, the frame similarity FS(ft1 , ft2 , δ ) is adopted with L2 as distance (dissimilarity) between two frames and δ is set to 0.5. For each method, F-measure value is shown. Considering these results, HSUMM1G outperforms the compared methods. One should notice that, in some cases, there are redundant frames, which contribute for higher values of CUSe, and, consequently, lower F-measure values (see, for example, Fig. 10(a) and 10(d)).

450

6. Conclusion and Further Works Video summarization is a simplification of video content for compacting the video information in which a small number of frames must be selected for saliently representing its content. To address video summarization, we proposed the use of a graph-based hierarchical clustering method, called HSUMM, which allows the application of a similarity measure between clusters during graph partition, instead of consid-

455

ering only the similarity between isolated frames. Moreover, we provide an unified framework for video summarization based on MST and weight maps in which HSUMM could be seen as an instance that uses a minimum spanning tree of frames and a weight map based on hierarchical observation scales. The minimum spanning tree of frames is generated from a frame similarity graph that represents the video content and the relationship between video scenes; and the hierarchical method for partitioning the frame similarity graph

460

using observation scales produces a hierarchy of partitions in which the frame clusters can easily be inferred. Another important contribution of this work is the definition of a new evaluation measure that assesses the diversity of opinions among users when they produce a summary for the same video, called Covering. 23

Quantitative measures 1 VSUMM1 (F=0.889) AGM1 (F=0.808) VSUMM2 (F=0.853) AGM2 (F=0.827) HSUMM1BN (F=0.906) HSUMM2BN (F=0.906) HSUMM1G (F=0.906) HSUMM2G (F=0.906) OV (F=0.797) DT (F=0.800) VISTO (F=0.831)

Covering

0.8

0.6

0.4

0

0.2

0.4 CUSe

0.6

0.8

Quantitative measures 1 VSUMM1 (F=0.889) AGM1 (F=0.808) VSUMM2 (F=0.853) AGM2 (F=0.827) HSUMM1BN (F=0.906) HSUMM2BN (F=0.906) HSUMM1G (F=0.906) HSUMM2G (F=0.906) OV (F=0.797) DT (F=0.800) VISTO (F=0.831)

Covering

0.8

0.6

0.4

0

0.2

0.4 CUSe

0.6

0.8

Figure 9: A comparison between several video summarization methods in terms of CUSa, CUSe and COV. For each method, the best F-measure is shown. The ideal method must have CUSa (or COV) equals to 1, and CUSe equals to zero.

During tests, manual and automatic strategies for the identification of summary size were explored along with two different mappings for edge weights and two distinct procedures for selection of keyframes – 465

considering or not the edge weights, i.e., the (dis)similarity measure between frames that belongs to the same frame cluster. Furthermore, we use of a new strategy borrowed from hierarchical image segmentation to evaluate video summaries generated by HSUMM. Experimental results provides quantitative and qualitative comparison between the new approach and other popular algorithms from the literature, showing that the new algorithm

24

470

is robust. Concerning quality measures, HSUMM outperforms the compared methods regardless of the visual feature used in terms of F-measure. In order to improve and better understand our results, further works involve multimodal analysis, inclusion of new features and automatic scale identification. Additionally, we will also investigate the use of n-grams that may help us in avoiding error in the identification of keyframes.

475

Acknowledgments The authors are grateful to PUC Minas (Pontifícia Universidade Católica de Minas Gerais), MIC BH (Microsoft Innovation Center Belo Horizonte), CNPq, Capes and FAPEMIG for the partial financial support of this work. The authors also thank to the anonymous reviewers for their valuable comments and suggestions

480

References [1] Y.-P. Tan, H. Lu, Video scene clustering by graph partitioning, Electronics Letters 39 (11) (2003) 841– 842. doi:10.1049/el:20030536 . [2] M. Yeung, B.-L. Yeo, Video visualization for compact presentation and fast browsing of pictorial content, IEEE Transactions on Circuits and Systems for Video Technology 7 (5) (1997) 771–785.

485

doi:10.1109/76.633496 . [3] S.-H. Zhu, Y.-C. Liu, Automatic video partition for high-level search, International Journal of Computer Sciences and Engineering Systems 2 (3) (2008) 163–172. [4] Y. Rui, T. Huang, S. Mehrotra, Browsing and retrieving video content in a unified framework, in: IEEE Second Workshop on Multimedia Signal Processing, 1998, pp. 9–14. doi:10.1109/MMSP.1998.

490

738905. [5] S. E. F. de Avila, A. P. B. Lopes, A. da Luz Jr., A. de Albuquerque Araújo, Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method, Pattern Recognition Letters 32 (1) (2011) 56–68. doi:10.1016/j.patrec.2010.08.004 . [6] P. Mundur, Y. Rao, Y. Yesha, Keyframe-based video summarization using delaunay clustering, Inter-

495

national Journal on Digital Libraries 6 (2) (2006) 219–232. doi:10.1007/s00799-005-0129-9 . 25

[7] M. Furini, F. Geraci, M. Montangero, M. Pellegrini, Visto: visual storyboard for web video browsing, in: Proceedings of the 6th ACM International Conference on Image and Video Retrieval – CIVR 2007, ACM, 2007, pp. 635–642. doi:10.1145/1282280.1282370 . [8] D. Besiris, F. Fotopoulou, G. Economou, S. Fotopoulos, Video summarization by a graph-theoretic 500

fcm based algorithm, in: 15th International Conference on Systems, Signals and Image Processing – IWSSIP 2008, 2008, pp. 511–514. doi:10.1109/IWSSIP.2008.4604478 . [9] S. J. F. Guimarães, W. Gomes, A static video summarization method based on hierarchical clustering, in: 15th Iberoamerican Congress on Pattern Recognition – CIARP 2010, 2010, pp. 46–54. doi:

10.1007/978-3-642-16687-7_11 . 505

[10] S.-A. Berrani, H. Boukadida, P. Gros, Constraint satisfaction programming for video summarization, in: 2013 IEEE International Symposium on Multimedia – ISM 2013, 2013, pp. 195–202. doi:10.

1109/ISM.2013.38. [11] A. Khosla, R. Hamid, C.-J. Lin, N. Sundaresan, Large-scale video summarization using web-image priors, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition – CVPR 2013, Portland, 510

USA, 2013. doi:10.1109/CVPR.2013.348 . [12] C. Ngo, Y. Ma, H. Zhang, Automatic video summarization by graph modeling, in: 9th IEEE International Conference on Computer Vision – ICCV 2003, 2003, pp. 104–109. doi:10.1109/ICCV.2003.

1238320. [13] B. Zhao, E. P. Xing, Quasi real-time summarization for consumer videos, in: 2014 IEEE Conference 515

on Computer Vision and Pattern Recognition – CVPR 2014, 2014, pp. 2513–2520. doi:10.1109/

CVPR.2014.322. [14] A. Girgensohn, F. Shipman, L. Wilcox, Adaptive clustering and interactive visualizations to support the selection of video clips, in: Proceedings of the 1st ACM International Conference on Multimedia Retrieval, ICMR ’11, ACM, New York, NY, USA, 2011, pp. 34:1–34:8. doi:10.1145/1991996. 520

1992030. [15] S. Lu, Z. Wang, T. Mei, G. Guan, D. Feng, A bag-of-importance model with locality-constrained coding based feature learning for video summarization, Multimedia, IEEE Transactions on 16 (6) (2014) 1497–1509. doi:10.1109/TMM.2014.2319778 . 26

[16] W. Liu, T. Mei, Y. Zhang, C. Che, J. Luo, Multi-task deep visual-semantic embedding for video thumb525

nail selection, in: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), CVPR ’15, 2015, pp. 3707–3715. [17] A. K. Jain, M. N. Murty, P. J. Flynn, Data clustering: A review, ACM Computing Surveys 31 (3) (1999) 264–323. doi:10.1145/331499.331504 . [18] L. Belo, C. Caetano, Z. Patrocínio, S. Guimarães, Graph-based hierarchical video summarization using

530

global descriptors, in: 26th IEEE International Conference on Tools with Artificial Intelligence – ICTAI 2014, 2014, pp. 822–829. doi:10.1109/ICTAI.2014.127 . [19] A. Oliva, A. Torralba, Modeling the shape of the scene: A holistic representation of the spatial envelope, International Journal of Computer Vision 42 (3) (2001) 145–175. doi:10.1023/A:

1011139631724. 535

[20] S. Avila, N. Thome, M. Cord, E. Valle, A. de A. Araújo, Pooling in image representation: the visual codeword point of view, Computer Vision and Image Understanding 117 (5) (2013) 453–465. doi:

10.1016/j.cviu.2012.09.007 . [21] C. Zahn, Graph-theoretical methods for detecting and describing gestalt clusters, IEEE Transactions on Computers C-20 (1) (1971) 68–86. doi:10.1109/T-C.1971.223083 . 540

[22] J. Kleinberg, É. Tardos, Algorithm design, Addison-Wesley, 2006. [23] K. Jacques, A. de Albuquerque Araújo, Z. K. Patrocínio Jr, J. Cousty, Y. Kenmochy, L. Najman, S. J. F. Guimarães, Hierarchical video segmentation using an observation scale, in: 26th Conference on Graphics, Patterns and Images – SIBGRAPI 2013, Arequipa, Peru, 2013, pp. 320–327. doi:

10.1109/SIBGRAPI.2013.51 . 545

[24] M. Douze, H. Jégou, H. Sandhawalia, L. Amsaleg, C. Schmid, Evaluation of gist descriptors for web-scale image search, in: Proceedings of the ACM International Conference on Image and Video Retrieval – CIVR’09, ACM, New York, NY, USA, 2009, pp. 19:1–19:8. doi:10.1145/1646396.

1646421. [25] D. G. Lowe, Distinctive image features from scale-invariant keypoints, International Journal of Com550

puter Vision 60 (2) (2004) 91–110. doi:10.1023/B:VISI.0000029664.99615.94 .

27

[26] P. F. Felzenszwalb, D. P. Huttenlocher, Efficient graph-based image segmentation, International Journal of Computer Vision 59 (2) (2004) 167–181. doi:10.1023/B:VISI.0000022288.19776.77 .

28

(a) AGM1 [9] (F=0.91)

(b) AGM2 [9] (F=1.00)

(c) OV (F=0.89)

(d) Visto [7] (F=0.63)

(e) DT [6] (F=0.89)

(f) VSUMM1 [5] (F=1.00)

(g) VSUMM2 [5] (F=0.75)

(h) HSUMM1G (F=1.00)

(i) HSUMM2G (F=0.80)

(j) HSUMM1BN (F=1.00)

(k) HSUMM2BN (F=0.80)

Figure 10: Examples of static video summary for the video 37 using several methods. For each result, the F-measure is shown between parentheses.

29

(a) AGM1 [9] (F=0.78)

(b) AGM2 [9] (F=0.86)

(c) OV (F=0.74)

(d) Visto [7] (F=0.82)

(e) DT [6] (F=0.76)

(f) VSUMM1 [5] (F=0.82)

(g) VSUMM2 [5] (F=0.52)

(h) HSUMM1G (F=0.93)

(i) HSUMM2G (F=0.86)

(j) HSUMM1BN (F=0.86)

(k) HSUMM2BN (F=0.86)

Figure 11: Examples of static video summary for the video 42 using several methods. For each result, the F-measure is shown between parentheses.

30

(a) AGM1 [9] (F=0.87)

(b) AGM2 [9] (F=0.88)

(c) OV (F=0.79)

(d) Visto [7] (F=0.83)

(e) DT [6] (F=0.83)

(f) VSUMM1 [5] (F=0.74)

(g) VSUMM2 [5] (F=0.83)

(h) HSUMM1G (F=0.88)

(i) HSUMM2G (F=0.88)

(j) HSUMM1BN (F=0.86)

(k) HSUMM2BN (F=0.88)

Figure 12: Examples of static video summary for the video 53 using several methods. For each result, the F-measure is shown between parentheses.

31