Feature Extraction Algorithm of Audio and Video based on Clustering in Sports Video Analysis

Feature Extraction Algorithm of Audio and Video based on Clustering in Sports Video Analysis

Journal Pre-proofs Feature Extraction Algorithm of Audio and Video based on Clustering in Sports Video Analysis Jinming Xing, Xiaofeng Li PII: DOI: Re...

831KB Sizes 0 Downloads 44 Views

Journal Pre-proofs Feature Extraction Algorithm of Audio and Video based on Clustering in Sports Video Analysis Jinming Xing, Xiaofeng Li PII: DOI: Reference:

S1047-3203(19)30315-3 https://doi.org/10.1016/j.jvcir.2019.102694 YJVCI 102694

To appear in:

J. Vis. Commun. Image R.

Received Date: Revised Date: Accepted Date:

10 August 2019 30 October 2019 30 October 2019

Please cite this article as: J. Xing, X. Li, Feature Extraction Algorithm of Audio and Video based on Clustering in Sports Video Analysis, J. Vis. Commun. Image R. (2019), doi: https://doi.org/10.1016/j.jvcir.2019.102694

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Elsevier Inc. All rights reserved.

Feature Extraction Algorithm of Audio and Video based on Clustering in Sports Video Analysis Jinming Xing 1,a and Xiaofeng Li2,b*

School of Physical Education, Northeast Normal University, ChangChun, 130024, China of Computer Science and Technology, Beijing Institute of Technology, BeiJing 100081, China

1

2School

aE-mail:

[email protected], bE-mail: [email protected] [*Corresponding author:Xiaofeng Li]

Abstract: In order to extract audio/visual mid-level features of sport videos, this paper uses supervised

audio classification and unsupervised scene clustering to satisfy versatility requirements. The sound in the sports video is more robust, and the supervised audio classification can be used for the versatility of a competition. When extending this method to other competitions, such as diving, baseball, supervised methods also require only a small number of annotations. For video, due to heavy scene differences, unsupervised scene clustering is used to achieve versatility. This paper proposes a new and effective scene clustering algorithm, which automatically determines the stopping point of the algorithm without prior knowledge. The experimental results show that although the number of clusters of different videos is different, but the final scene classification results are consistent with people's expectations.

Key words: Sports Video Analysis; Audio and Video Classification; Scene Clustering Algorithm

1. Introduction Sound and video are two important media in sport videos. The low-level features of audio-video are usually used for high-level semantic analysis [1-2]; but low-level features are neither visualized enough nor in accordance with human perception. The on feasible approach is to extract middle-level features. Different from low-level features, mid-level features can promote the analysis of high-level events from the perspective of semantic concept, being the role of a bridge between low-level features and high-level semantics. The extraction of mid-level features in [3-5] does not belong to all purposes because what’s adopted is the supervised method. In spite of the same one competition event, the information like field color and player is varying. It will cause lots of troubles to prepare training data. [6] in the baseball batter detection algorithm, 240 pictures with big variations were employed as training data which include field, stand and audience. Especially for different competition items, mid-level features of different videos were fetched, such as goalmouth scene of football, baseball batter scene and the scene of bending down to hit golf [7-10]. [11] in shot classification algorithm, it pre-defines the type of shots and other field knowledge according to specific competition events, which all determine that the extraction of videos’ mid-level features shall require much extra manual work and is universal [12-13]. Our strategy is to apply non-supervision scene clustering algorithm, implementing automatic clustering of video visual contents. Relatively speaking, the extraction of audio mid-level features is relatively general, because the same one or even different competition event, voice does not change much, e.g. the sound of baseball batting is always the same in any competition; cheering is all similar in all competitions; hence the supervised method can be taken [14]. This paper will describe in turn the process of audio and video middle feature extraction with supervised audio classification and unsupervised scene clustering [15-18]. Based on the understanding of the characteristics of the audio and video in the racket motion, the supervised audio classification and unsupervised scene clustering is used to extract the middle and middle features of the audio and video. Mid-level features are characterized by low complexity and intuitiveness. Especially unsupervised scene clustering is also versatile. It is completely clustered according to the visual similarity of the video itself. No additional models are needed to update changes in different matches. Supervised audio classification also has good versatility, this is due to the non-variability of the sound itself, when it is to be extended to other events, for example, diving, baseball, etc. Only a small number of annotations are required.

2. Related work At present, in the field of sports video analysis, a large number of research results have appeared at home and abroad. From a macro perspective, the main goal of sports video semantic analysis research is to narrow the "semantic gap" and "application gap" [19-22]. The so-called "semantic gap" is the serious discrepancies between the underlying media features and the high-level semantics. The so-called “application gap” is the current means of analysis and understanding that is far from meeting the growing needs of users [31-34]. In the "semantic gap", the basic idea of sports video analysis is to establish the relationship between features and semantics through domain knowledge models [23-25]. Since the video sequence is composed of multimodal information streams, such as audio streams, video streams. As many features as possible on all of these streams have been covered in the extraction of the underlying and mid-level features. In the method of multimodal fusion that has been studied so far. Ioannis Mademlis and others have classified according to their attributes, and will combine their classification to better explain the advantages and disadvantages of various methods [26]. According to the three salient attributes of the multi-modal fusion method, the classification is classified as feature category, classification method and processing relationship. Feature categories refer to bottom or middle features. Hamid Izadinia et al. fuse audio, text and motion features to extract highlights. Nitta et al. use image and text features to automatically generate semantic annotations for sports videos [27]. These features are all bottom features. The problem is that bottom features can only provide limited clues to events. So the researchers extracted some middle-level features, such as lines, players, audio keywords and so on. [28] Extract some common audio and video mid-level features, and then apply specific rules for tennis, basketball, football, baseball, and golf to extract related events. In sports video analysis, only sporadic work adds perceptual attributes to events. K. Anoop et al. selected the motion characterization to describe the rhythm and intensity of the movement [29]. The audio characterization described the exciting cheers and applause, and then used the linear model to generate the fascinating curve. Nikolaos Sarafianos et al. introduced a psychology model that enumerated some of the commonly used perceptual features of audio, video, and editing techniques, and then linearly combined to understand the perceived content of the video [30]. This paper tries to start with the non-linear model and to cooperate with the effective evaluation criteria to further improve this work.

3. Supervised audio classification method Audio includes voice, music and diversified background noise. In sport videos, audio is generally composed of narrations, audience voice and competition-related noises. Some voice hint the occurrence of specific events and have close connection with players, commentator, referees and spectators, e.g. hurrah is hint of wonderful events; batting appears during hitting the ball. Voice has abundant semantics so it plays an important role in sport video analysis. During bat movement, we classify five types of sounds: batting, hurrah, silence, wonderful narration and common narration, of which the batting is relating to game, and others are common sounds. In fact, common sounds are applicable to broadcast videos of all other sport items because they come from audience and narrators. Generally speaking, in broadcast sport video, narrators introduce players, report scoring of player/team or analyze the game states. Similarly, audiences cheer, applaud and shout always at wonderful moments of the event. When we observed some movements of the bat, we found audience expressed their excitements with higher cheers, when it’s scoring, catch-up, game point or match point. In the meantime, narrators can’t help making encouraging sounds. So in order to discriminate movements with different emotions, we define the above four common sounds. The reason for defining batting is more explicit, because during bat moving, narrators and audience rarely make sounds and batting appears only in the course of the game. Those sounds play importance in the subsequent audio-video integration and highlights ranking. The paper describes the audio-video classifying process of the five different sounds. Firstly, extract nine general types total 55-dimensional audio low-level features; with forward searching algorithm of multiple cross-over support vector machine, choose seventeen dimensions of effective features; then use multi-type support vector machine to train and classify audio key words; finally, give experimental results. 3.1. Feature extraction and selection For the audio classification, the most basic concept is short-time audio frame, equal to the concept of image frame in videos, which are the most fundamental feature extraction unit. The time-domain feature, frequency domain feature and time-frequency domain feature of audio signal are all obtained from short-time audio frames. Actually there is another form of feature which is audio clip feature. Compared with short-time audio frame feature, for audio clip feature, it’s just the difference of extracting unit length. What the audio clip feature considers is any audio semantics needs last certain long moment, such as cheering and narration sounds both last

for seconds. In order to consider short-time stability of audio and non-stationarity of audio signal nature, generally at first it gets audio frame feature, then calculate statistical characteristics of audio frame, such as mean value and variance, used as features of audio clip. Both audio frame feature and audio clip feature are regarded as bottom characters in audio classification. Here we’ll use audio clip. But audio clip is merely statistical characteristics of audio frame, we introduce following audio frame features. 1.Zero-crossing rate: Refers to in one short-time frame, the times of discrete sampling signal value changing from positive to negative and negative to positive; it can probability reflect the average frequency of signal in short-time frame. For the m Frame in the audio signal stream x , the Equation (1) is as follows:

Zm 

1  | sign[ x(n)]  sign[ x(n  1)]w(m  n) 2 n

(1)

2. Short-time average energy: The average energy collected by discrete sampling point signals in one shorttime audio feature. It is shown in Equation (2).

Em 

1 N

 [ x(n)w(m  n)]

2

(2)

n

3. Tone: Base wave of audio signal; it’s an important parameter for the analysis and synthesis of speech and music. In general speaking, only vocal speech and harmonious music have a definite tone. However, we still use tone as bottom feature to depict the fundamental wave of any audio signal. Human’s tone frequency range is 50450Hz, but music’ is much wider. It’s not easy to estimate audio signal’s tone, without robust and reliable method. According to different precision and complication limit, different approaches are applied, i.e. timedomain or frequency domain method can be utilized to estimate tone. Roughly speaking, time-domain method gets tone by computing short-time self-correlation function. The frequency-domain method is determined by the periodic structure of amplitude after Fourier transformation of short-time audio frames. 4. Luminance: The luminance refers to the centroid of the frequency. The Equation (3) is as follows: 0

0

0

0

c    | F ( ) |2 d /  | F ( ) |2 d

(3)

5. Bandwidth: The bandwidth refers to the square root of the difference square of the frequency and frequency center of mass with the average of the weight of the energy. The Equation (4) is as follows:

B



0

0

0

(  c ) | F ( ) |2 d  /  | F ( ) |2 d  0

(4)

6. Frequency change: The value is the average of the frequency differences between the two adjacent audio frames in one second. The feature itself is an example of an audio feature. The Equation (5) is as follows:

SF 

N 1 K 1 1 [log( A(n, k )   )  log( A(n  1, k )   )]2  ( N  1)( K  1) n 1 k 1

(5)

7.The energy of the sub-band: The energy of the sub-band is the energy within a certain frequency range. We divide the frequency range into four intervals [0, 0 / 8],[0 / 8, 0 / 4],[0 / 4, 0 / 2],[0 / 2, 0 ] . The Equation (6) is as follows: Hj Pj  log   | F ( ) |2 d    Lj 

(6)

After as many audio frame features are fetched as possibly, we extract relative mean value, variance, low ratio, high ratio and second order variance as features of audio clip. Although all those features can be used to describe audio signal, some may have better descriptive abilities. So we have to make feature selection. After that, the time for feature extraction and classification is reduced; on the other hand, when training samples are

limited, using small feature space can boost the generalization ability of classifier. The common ways for feature selection include Principle Components Analysis (PCA) and Linear Discriminant Analysis (LDA ) . Here we use forward searching algorithm to do that. For mid-level meanings are included in the fetched feature, no need of feature transform. Certainly, features selected by forward search method is sub-optimal, and some correlations are kept. But the method can reduce complexity of classifying. In the process of selecting forward search features, the evaluation of features are based on the classification accuracy of multi cross and multi class SVM. The specific algorithms can be described as follows: Algorithm 1 Feature extraction and selection algorithm Input: The feature set F is divided into the selected feature set Output: Feature selection 1. Set up Fs   , Fu  F 2. 3. 4. 5. 6. 7.

Fs and the non-selected feature set Fu

The feature in the set Fu is not tested Select an untested feature f from Fu , and mark it to be teste Put f into the set Fs to form a temporary feature set Fs Evaluate Fs classification performance If there is still Fu , go to (3) Looking for feature f from all untested features f , the highest classification accuracy can be obtained

f =Maximum precision, and move f from Fu to Fs 8. If there are still untested features in Fu , go to (2), If all the features of the Fu have been processed, when it is added to the feature set Fs , the feature selection is over. Figure 1 gives the classification precision curve during feature selection. From it, we see with increasing number of features, the classifying precision becomes higher; but when the number is above 17-dimension, the precision drops; therefore we use 17-dimension feature for classification.

Figure 1. The process of feature selection of forward search algorithm 3.2. Training and classification We choose SVM (Support vector machine) as audio classifier. Unlike other classifiers such as artificial neural network, decision tree, SVM is easily trained, with fewer training samples. What’s more important, SVM can get classifying model of better generalization performance even from rough training set, minimizing the risk of structure. The last column of the selected feature Table 1 is described.

Audio frame feature Zero-crossing rate Short-time average energy Tone Luminance

Table 1. Extract and select audio sample features mean value variance Low rate High ratio Two order variance 1 2 3 4 5 6 7 8,9 10 11

12

Selection Feature(17) 1,2,4 6,8,9

Bandwidth

13

14

Frequency change

15

The energy of the sub-band

16-20

11 15

20-24

SVM is a machine learning method proposed by Vapnik. It achieved better effects in many classical model classifying problems. Traditional statistical classifying methods employ empirical risk minimization to train classifying model, no mathematical evidence for the generalization. SVM makes use of structural risk minimization principle to search for optimized hyper-planes which separate two classes. Most traditional statistical pattern recognition defines empirical risk according to the law of large numbers. It is shown in Equation (7).

Remp ( ) 

1 n  L( yi , f ( xi ,  )) n i 1

(7)

yi is a category label, xi is the probability distribution parameter of the sample  . Since Remp ( ) is determined by the training sample (empirical data), it is called empirical risk. In early statistical pattern recognition, people always try to make Remp ( ) smaller. But it was later found that a small training error was not a good prediction. So we began to solve the problem of training, and introduced the concept of structural risk minimization. SET a set of indicator functions f ( xi ,  ) and a set of sets that contain n training samples Z. Assuming that the set of functions can implement different classifications in this group of samples in N z , the random entropy of the set of functions on this sample is defined. It is shown in Equation (8).

H ( Z )  ln N z

(8)

The random entropy directly reflects the classification ability of the function. If you take the expectation of random entropy, for all possible samples of n, you get the entropy of the set of instructions, which are the VC entropy. The VC entropy is a measure of function set classification ability. On this basis, Vapnik also gives a visual definition of the VC dimension (h). The function set VC dimension is defined as the number of samples of the largest sample set that can be scattered using the function set in this function. For two categories of classifications Vapnik proves that empirical risk and actual risk are at least satisfied with probability   1 . It is shown in Equation (9).

R( )  Remp ( ) 

h(ln(2n / h  1)  ln( / 4)) n

(9)

Upper explanation, the actual risk under empirical risk minimization is composed of two parts, which can be written. It is shown in Equation (10).

n R( )  Remp ( )     h

(10)

According to these, in the actual classification, the minimum empirical risk and the minimum sum of the confidence range can be selected to minimize the structural risk of classification.

According to the above theoretical analysis, the researcher realized a classification method that minimizes the structural risk, namely the support vector machine (SVM) method. The SVM method is proposed from the optimal classification surface in the case of linear separability. It is shown in Figure2.

Figure2. Optimal classification surface in linear separability 3.3. The experimental results For the calculation of SVM, there is plenty of software like LIBSVM, my SVM, SVMLight. LIBSVM is one simple, rapid and efficient software package designed by Lin Zhiren, Professor at Taiwan University, for SVM pattern recognition and regression; not only compiled execution files are offered as to use on Windows and also source code, which facilitates improvement, modification and application on other operating system. The software can solve problems of classifying (C-SVC, Nu-SVC), regression (Epsilon-SVR, Nu-SVR) and distribution estimation (one-class SVM), and support the issue of multi-class pattern recognition of multi-class classification (one-to-one algorithm). So we choose LIBSVM. The experimental steps are as follows: 1. Convert data into the format required by SVM software. 2. Scaling the data by ratio (normalized) 3. Select the RBF kernel function 4. Use cross validation to find the best parameters C and  . 5. Train the entire training set with C and



6. Test 3.4. Discussion and experiments In our work, we leverage LIBSVM. The experiment is conducted with randomly chosen eight video segments from 2012 Olympic Games Table Tennis. Since experimental results are close, with limited paper, we won’t list out every video segment; we’ll choose a few segments, from which to get overall information. It is shown in Table 2 and Table 3. Each video is identified with the letters of Table 2 and Table 3, and a few segments are selected for the results. No special instructions are given.

Video Total time

a 8:50

b 7:54

Table 2. Tennis data for the experiment c d e f 13:56 10:03 24:27 23:23

g 8:30

h 25:45

Video Total time

A 3:22

B 9:03

Table 3. Table tennis data for the experiment C D E F 7:12 8:10 25:30 34:20

G 10:09

H 33:23

The tennis (table tennis) data of A(a) and B(b) are used to train the selection and training of SVM classification model. It is shown in Table 4. The test results of C(c) and D(d) are listed. It is shown in Table 5 and Table 6. If the test concentration is M, the counter example is N, then, it is shown in Equation (11) and Equation (12).

Video

590 885

Sound category

Total test Data(s)

Batting sound Cheers sound Mute Wonderful commentary Common commentary

304

Sound category

Total test Data(s)

Batting sound Cheers sound Mute Wonderful commentary Common commentary

62

(11)

Precision= m / ( M  N  n)

(12)

Table 4. Training data for audio classification Sound categories(s) Batting Cheers sound Mute Wonderful sound commentary 128 40 270 25 125 156 170 77

Total training data(s)

a+b A+B

Recall= m / M

Table 5. Audio classification results on tennis test video Test results Cheers Mute Wonderful Common Batting sound commentary commentary sound 224 1 38 0 41

Common commentary 110 370

Recall rate (%)

Precision (%)

73.6

73.7

120

15

87

1

11

6

73.7

82.1

260 122

14 14

0 15

239 1

0 60

7 32

93.4 50.0

70.3 80.0

556

54

3

61

4

432

78.0

84.0

Recall rate (%)

Precision (%)

83.3

65.8

Table 6. Audio classification results on table tennis test video Test results Cheers Mute Wonderful Common Batting sound commentary commentary sound 50 5 7 0 0

190

18

147

10

11

4

77.0

87.0

270 150

7 0

6 14

216 3

2 107

39 24

80.6 71.8

80.6 76.4

260

0

0

30

20

210

80.2

75.8

We find that the sound of a shot in a tennis match is easily mixed with the usual commentary, because the players in the tennis match always yell out when they hit the ball. It is shown in Table 5. The overall audio classification effect is not ideal, as the various sounds in video are often mixed together, which greatly increases the difficulty of audio classification. It is shown in Table 6. But in our application, the classification results have been good enough, can be used for subsequent time-domain voting strategy, and through video information fusion, will improve the classification performance of the final.

4. Unsupervised scene clustering Physical sport occurs at a certain place, with better pre-defined time sequence structure. Besides, sport videos are captured by a fixed number of cameras beside the play field, causing periodic occurrence of some specific scenarios (or vision) in the whole video. Take for example, in table tennis video, there are three main scenes: long scene of field, close scene of player A, close scene of player B. Fetch those scenes as video’s midlevel features, which are beneficial to the high-level analysis of sport video contents. It is shown in Figure3.

(a)long scene of field,

(b) close scene of player A (c) close scene of player B Figure 3. Main scenes of table tennis video

Those scenes are constituted by shots with similar visual contents; their visual distribution does not vary much in the same video segment. On the other hand, competition of different sessions may lead to big differences among different video segments out of varied field and player information. So the application of unsupervised scene clustering method can ensure the universality of video’s mid-level features. We propose a new scene classifying method running completely in an unsupervised manner without any priori knowledge. The method gets scene classifying results by merging repetitively shots of similar contents and stopping merging after appropriate merging-stop rule is satisfied. In the method, the appropriate merging-stop rule is defined based on one J value, which is obtained by Fisher discriminant function. The merging and automatic stop rule decides the best merging and stopping point and makes the final scene classifying result consistent with people’s expectation. We call the method as scene classifying method based on J value. 4.1. Calculation of scene similarity In our proposed scene classifying method, shot is the most basic processing unit; the normalized HSV (hue-saturation-value) color histogram of shot relative key frame as feature to measure the scene similarity among shots. First of all, sport videos are split into N shots, represented by S  {s1 , s2 ,..., sN } . In order to simplify the calculation, choose five key frames from each shot to constitute the content expression of the shot K i  {k1i , k2i , k3i , k4i , k5i } , next HSV color spatial coordinate is uniformly quantified to 16(H)×4(S)×4(V), total 256 color quantitative levels. The distance between two shots or scenes represents the scene similarity between them; the smaller the distance is, the more alike they are, and that they share higher scene similarity. The distance between lens si and s j is calculated as follows. It is shown in Equation (13) and Equation (14).





ˆ {d (k i , k j )} SD( si , s j )  1/ 2  Min{d (kmi , knj )}  Min m n

(13)

B

d (kmi , knj )   | H mi (b)  H mj (b) | / B b 1

(14)

ˆ represent the minimum and sub-minimum distances between all key frames of Where, Min and Min lenses si and s j , respectively. d ( kmi , knj ) is HSV color histogram distances between key frames k mi and k nj H mi is the normalized 256-level HSV color histogram of the key frame m of lens i. B is equal to 256, which is the total number of color quantization levels 4.2. Scene classification

When scene classifying starts, each shot is regarded as one scene with a single shot; use Equation (13) and(14) to calculate two-two distance of initial scenes, then those initial scenes { S c  {s1c , s2c ,..., s Nc } } are repetitively combined with different scene type. During each cycle, two shots with the nearest distance i.e. highest similarity are merged to one scene class. The process repeats till merging-stopping rule is met; then the merging process stops and the final scene classifying result is gained. Before giving merging-stopping rule, we define one J value. The definition of J value is got based on Fisher discriminant function. J value suggests the total scene class divergence during merging, which is the ratio between intra-class divergence (intra-class distance) and inter-class divergence (inter-class distance). When the number of remaining scenes in the merge process is K l , the J value is defined as follows. It is shown in Equation (15). Kl

Kl

Jl 

 J wc c 1

Jt



Nc

 || s

c i

c 1 i 1 N

c  smean ||

 || s  s i 1

i

mean

(15)

||

When scene classification starts, all initialized scene classes contain only single shot, so the intra-class divergence is 0, J l  0.0 . With two scenes merged in one scene class, the shot is added and intra-class divergence increases, thus J l grows up; when all initialized scenes are merged to one scene class, intra-class divergence is the highest and J l is maximum 1.0. J l is smaller indicates the higher similarity of shots in each scene class. As a matter of fact, people hope J l is small and the number of scene is too big. But in reality, J l grows along with decreasing number of scene. So to compromise between J l and number of scene, we choose the smallest point of ( J l  K l ) as the optimal merging-stopping point. That is the defined scene mergingstopping rule. It is shown in Figure 4.

Figure 4. J l and K l in a table tennis video relationship curve 4.3. Algorithm description The scenario classification method based on J value is described as follows: 1. Divide the input sports video into lens sequence. S  {s1 , s2 ,..., sN } , N is the total number of shots.

2. Choose five key frames from each shot to constitute the content expression of the shot

K i  {k1i , k2i , k3i , k4i , k5i } 3. HSV color spatial coordinate is uniformly quantified to 16(H)×4(S)×4(V), total 256 color quantitative levels. 4. Each shot is regarded as one scene with single shot. S c  {s1c , s2c ,..., s Nc } , their distance (scene similarity) is calculated using formula13 and formula14. 5. Combine the most similar scenarios into a new scene,

Slc  ( sic , s cj ) .

(1) The weighted HSV color histogram of the two combined scenarios is used to represent the content of the new scene. (2) Re-compute the distance between the new scenario S lc and other scenarios. (3) Use formula 15 to calculate J l , and calculate the value of J l  K l . 6. Continuous execution step 5 to find the best merge stop point when J l  K l reaches the minimum. Then restart the step 5 until the stop merge point, merge stop, and get the scenario classification result. 7. According to the scene clustering results, the number of shots in each scene is arranged in descending order, and the sequence of the sequence is output. At this point, the whole scene is classified. 4.4. Discussion and experiments The most important task of non-supervision scene clustering is to do correct clustering of shots as per the similarity of visual content; but the number of cluster determines the performance of classification. The experiment attempted to validate the practicability of our proposed scene classifying stopping point strategy based on J value. As a result of the experiment, Table 7 and 8 enumerate the number of stops of video.

Video Clustering number

c 16

Table 7. The clustering number of scene clustering in tennis video d e f g 13 45 38 28

h 52

Video Clustering number

c 11

Table 8. The clustering statistics of scene clustering in table tennis data d e f g 14 26 31 26

h 35

In Table 7 and Table 8, the number of clusters changes a lot, from the smallest 11 to the biggest 52. That is because some advertisements are included in broadcast videos and shooting methods of different kinds of videos are not identical; even for the same kind of videos, which differ due to diversified shooting ways by the camera man. Due to those factors, the number of clusters shows big instability. We can also find that the number of clusters in tennis video is generally larger than that in table tennis video. This is because the camera in the video of tennis is a little bit more than the table tennis video, so there are more scenes in the tennis video. Although the number of clustering varies greatly, the clustering effect is very good, which proves that the algorithm can find the suitable stop point according to the content of video itself. Figure5 shows the results of some video clustering.

Figure 5. The result of clustering of tennis and table tennis scenes

4. Conclusions The paper introduced fetching process of mid-level features of audio-video. Unlike the method for direct content analysis of audio-video’s mid-level feature, the solution based on mi-level feature is more visualized and convenient to subsequently apply rule-related method. In the supervised audio classification, there are two methods: one based on audio frame feature and the other based on audio clip feature. Considering in sport video, some kind of sound would last some time; we chose statistic feature of audio frame as audio clip feature. Features were chosen by forward search algorithm, other than PCA, LDA, because extracted features already had meaning of mid-level feature, no need of feature transform; lastly for the selection of classifier, we used multi-class SVM, for its good generalization performance and requiring fewer training samples. We doing those choices relied on reasonable theoretical basis and in noisy broadcast videos, the classifying performance of the five kinds of sound achieving about 80% implied the wise choice. In non-supervision scene classification, former methods need priori knowledge to determine the number of cluster and the initialized clustering center. In future work, both supervised audio classification and non-supervision scene clustering need improvements; the more accurately mid-level features of audio are fetched, the higher reliability and robustness the entire frame will achieve.

Abbreviations Principle Components Analysis (PCA) Linear Discriminant Analysis (LDA) Support Vector Machine (SVM)

Declarations Ethical Approval and Consent to participate: Approved. Consent for publication: Approved. Availability of supporting data: We can provide the data.

Competing interests These no potential competing interests in our paper. And all authors have seen the manuscript and approved to submit to your journal. We confirm that the content of the manuscript has not been published or submitted for publication elsewhere.

Author’s contributions All authors take part in the discussion of the work described in this paper. These authors contributed equally to this work and should be considered co-first authors.

Acknowledgements This work was supported by the Postdoctoral Foundation of China (2017M610852) and the Key Program of the Social Science Foundation of Jilin Province(2016A5) and Ministry of Education Science and Technology Development Center Industry-University Research Innovation Fund ( 2018A01002)

REFERENCES 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12) 13) 14) 15)

Ling. Lu, Hu. Zhang. “Content Based Audio Classification and Segmentation by Using Support Vector Machines”, Multimedia Systems, 2003,8(3):482–492 Arnau. Raventós, Raúl. Quijada, Luis. Torres. The importance of audio descriptors in automatic soccer highlights generation.2014 IEEE 11th International Multi-Conference on Systems, Signals & Devices (SSD14), 2014:1-6, Jianwu. Liu, Yingying. Zhu, Na Song. Integration of Audio Features in the Game Field Landlord Color Clustering Algorithm, Journal of Putian University, 2011,17 (5):55-59 Rachida Hannane, Abdessamad Elboushaki, Karim Afdel. MSKVS: Adaptive mean shift-based keyframe extraction for video summarization and a new objective verification approach, Journal of Visual Communication and Image Representation, 2018,55(2):179-200 Sanjay K. Kuanar, Rameswar Panda, Ananda S. Chowdhury. Video key frame extraction through dynamic Delaunay clustering with a structural constraint, Journal of Visual Communication and Image Representation, 2013,24(7):1212-1227 Hrishikesh Bhaumik, Siddhartha Bhattacharyya, Susanta Chakraborty. A vague set approach for identifying shot transition in videos using multiple feature amalgamation, Applied Soft Computing, 2019,75(20):633-651 Fereshteh. Falah, Chamasemani. Lilly, Suriani. AffendeyNorwati. Video abstraction using density-based clustering algorithm, The Visual Computer, 2018,34 (10): 1299–1314 Ji-Xiang Du, Chuan-Min Zhai, Yi-Lan Guo. Philip Chen Chun Lung. Recognizing complex events in real movies by combining audio and video features, Neurocomputing, 2014,137(5):89-95 Zu. Xiong, Ricael. Radhakrishnan. Generation of Sports Highlights Using Motion Activity in Combination With A Common Audio Feature Extraction Framework, in Proc. IEEE Int. Conf. Image Processing, 2013:230-239 Payam. Oskouie, Sara.AlipourAmir-Masoud, Eftekhari-Moghadam. Multimodal feature extraction and fusion for semantic mining of soccer video: a survey, Artificial Intelligence Review,2014, 42(4):173–210 Zu. Liu, Jing. Huang. Classification of TV Programs Based on Audio Information Using Hidden Markov Model , IEEE Signal Processing Society Workshop on Multimedia Signal Processing, 2008:27-32 Markos. Mentzelopoulos, Alexandra. PsarrouAnastassia. Active Foreground Region Extraction and Tracking for Sports Video Annotation, Neural Processing Letters, 2013,37(1):33–46 Micael Kolekar, Palaniappan K. Semantic Event Detection and Classification in Cricket Video Sequence, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008:382 – 389 Xiaoxi, Zhou Lu. An adaptive clustering feature extraction algorithm for speech recognition based on K-means and normalized intra-class variance. Journal of Tsinghua University (Natural Science Edition), 2017, 57 (08): 857-861. Dawn, Wu Chen. Image feature extraction based on improved clustering algorithm. Information communication, 2017 (03): 20-21.

16) Tianyuan, Wang Hongtao. Research on image edge feature extraction based on quantum kernel clustering algorithm. Journal of Metrology, 2016, 37 (06): 582-586. 17) Wang Ruirong, Yu Xiaoqing, Zhu Guangming, Wang Min. ECG feature extraction based on wavelet transform and K-means clustering algorithm. Aerospace Medicine and Medical Engineering, 2016, 29 (05): 368-371. 18) Jiang Lijun, Xiong Zhiyong, Li Zhelin. Rough clustering algorithm for thread detection based on sub-pixel feature points extraction. Computer Engineering and Science, 2009, 31 (07): 43-45 19) Manish Shrivastava ; Princy Matlani.A smoke detection algorithm based on K-means segmentation. 2016 International Conference on Audio, Language and Image Processing (ICALIP),2016: 301 – 305 20) Aren Jansen ; Jort F. Gemmeke ; Daniel P. W. Ellis ; Xiaofeng Liu ; Wade Lawrence ; Dylan Freedman.Large-scale audio event discovery in one million YouTube videos.2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),2017: 786 – 790 21) Shiqi Tang ; Min Zhi.Summary generation method based on audio feature.2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS).2015: 619 – 623 22) Gabriel Sargent ; Pierre Hanna ; Henri Nicolas ; Frédéric Bimbot.Exploring the Complementarity of Audio-Visual Structural Regularities for the Classification of Videos into TV-Program Collections.2015 IEEE International Symposium on Multimedia (ISM),2015: 620 – 623 23) Payam Oskouie,Sara Alipour,Amir-Masoud Eftekhari-Moghadam.Multimodal feature extraction and fusion for semantic mining of soccer video: a survey. Artificial Intelligence Review,2014, 42(2) , 173–210 24) Fereshteh Falah Chamasemani,Lilly Suriani Affendey,Norwati Mustapha,Fatimah Khalid.Video abstraction using density-based clustering algorithm. The Visual Computer, 2018, 34,(10), 1299– 1314 25) S. Shanthi Therese,Chelpa Lingam,A linear visual assessment tendency based clustering with power normalized cepstral coefficients for audio signal recognition system. Journal of Ambient Intelligence and Humanized Computing,2017, 1–14 26) Ioannis Mademlis ; Anastasios Tefas ; Nikos Nikolaidis ; Ioannis Pitas.Multimodal Stereoscopic Movie Summarization Conforming to Narrative Characteristics.IEEE Transactions on Image Processing,2016,25 (12): 5828 – 5840 27) Salloum, R. , Ren, Y. , & Kuo, C. C. J. . (2017). Image splicing localization using a multi-task fully convolutional network (mfcn). Journal of Visual Communication and Image Representation, 51. 28) Hamid Izadinia ; Imran Saleemi ; Mubarak Shah. Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects.IEEE Transactions on Multimedia,2013 15 , (2): 378 – 390 29) Elie El Khoury,Christine Sénac,Philippe Joly,Audiovisual diarization of people in video content ,Multimedia Tools and Applications, 2014, Volume 68, Issue 3, pp 747–775 30) Rachmadi, R. F. , Uchimura, K. , Koutaki, G. , & Ogata, K. . (2018). Single image vehicle classification using pseudo long short-term memory classifier. Journal of Visual Communication and Image Representation. 31) K. Anoop,Manjary P. Gangan,V. L. Lajish.Mathematical Morphology and Region Clustering Based Text Information Extraction from Malayalam News Videos.Advances in Signal Processing and Intelligent Recognition Systems, 2015 , 431-442 32) Acharya, A. , & Meher, S. . (2017). Efficient fuzzy composite predictive scheme for effectual 2-d upsampling of images for multimedia applications. Journal of Visual Communication and Image Representation, 44, 156-186. 33) Nikolaos Sarafianos,Theodoros Giannakopoulos,Sergios Petridis.Audio-visual speaker diarization using fisher linear semi-discriminant analysis. Multimedia Tools and Applications,2016, 75,(1) , 115–130 34) Kossyk, I. , & Márton, Zoltán-Csaba . (2019). Discriminative regularization of the latent manifold of variational auto-encoders. Journal of Visual Communication and Image Representation, 61, 121-129.

Figures. Figure 1. The process of feature selection of forward search algorithm Figure2. Optimal classification surface in linear separability Figure 3. Main scenes of table tennis video Figure 4. The result of clustering of tennis and table tennis scenes Figure 5. The result of clustering of tennis and table tennis scenes

Author details

Xing Jinming is currently a professor and MA student in the School of Physical Education at Northeast Normal University. His research interest is mainly Physical Education and training, Sports Management and Sports Engineering. He has published several research papers in scholarly journals in the above areas and has participated in several books.

Xiaofeng Li received his ph.D degree from Beijing Institute of Technology. He is a Professor. His research interests include data mining, intelligent medical, intelligent transportation, big data, social computing, sports engineering. He is a member of ACM, a member of IEEE, and an advanced member of CCF. He has published more than 50 academic papers at home and abroad, and has been indexed and collected more than 20 papers by SCI, EI.

There is no conflict of interest.