Video semantic analysis based kernel locality-sensitive discriminative sparse representation

Video semantic analysis based kernel locality-sensitive discriminative sparse representation

Expert Systems With Applications 119 (2019) 429–440 Contents lists available at ScienceDirect Expert Systems With Applications journal homepage: www...

2MB Sizes 0 Downloads 99 Views

Expert Systems With Applications 119 (2019) 429–440

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

Video semantic analysis based kernel Locality- Sensitive Discriminative sparse representation Ben-Bright Benuwa a,b, Yongzhao Zhan a,∗, Augustine Monneyine a, Benjamin Ghansah b, Ernest K. Ansah b a

School of Computer Science and Communication Engineering, Jiangsu University, Xuefu Road 301 Jingkou District Zhenjiang province Jiangsu City, Zhenjiang 212013, China School of Computer Science, Data Link Institute P. O Box 2481 Tema Ghana

b

a r t i c l e

i n f o

Article history: Received 27 July 2018 Revised 17 October 2018 Accepted 8 November 2018 Available online 9 November 2018 Keywords: Kernel sparsity Sparse representation Locality information Dictionary learning Video Semantic Analysis Group Sparsity

a b s t r a c t Kernel based locality-sensitive sparse representation is currently an active hot research topic in artificial intelligence, pattern recognition, signal processing and multimedia applications. In this paper, a new kernel based approach; named Video Semantic Analysis based Kernel Locality-Sensitive Discriminative Sparse Representation (KLSDSR) is proposed. This is to improve video semantic analysis for military intelligence systems and video surveillance. The proposed algorithm is able to learn more discriminative sparse representation (SR) coefficients based on group sparsity for video semantic analysis by incorporating both sparsity and locality-sensitive in kernel feature space with mapping of the SR features into a high dimensional space. Furthermore, an optimal dictionary is however engendered to compute the SR of video features aimed at good preservation of the locality structure among video semantic features and an improvement on computational cost. The proposed method gave promising classification results, when compared with state-of-the-art comparative approaches based on experimental results on video semantic concepts, which significantly improves the discrimination of SR features and it outperformed the other baseline methods. © 2018 Elsevier Ltd. All rights reserved.

1. Introduction Video Semantic Analysis (VSA) is gaining a lot of interest in research field and is an imperative task in video surveillance and military intelligence systems. More so, the detonation of videos as a result of the proliferation of the social media coupled with an astronomical advancement in technology (such as the internet), has necessitated the creation of various applications including but not limited to semantic video retrieval and indexing (Haseyama, Ogawa, & Yagi, 2013), sematic sports event analysis (Shih, 2017), and video event detection (Abbasnejad, Sridharan, Denman, Fookes, & Lucey, 2017). VSA is a crucial aspect of video data retrieval, structuring and indexing in multimedia applications. Furthermore, an effective classification performance of video semantic concepts detection system, is heavily dependent on an efficient and appropriate feature representation (Abidine, Fergani, Fergani, & Oussalah, 2016; Nweke, Ying, Al-Garadi, & Alo, 2018).



Corresponding author’s.. E-mail addresses: [email protected] (B.-B. Benuwa), [email protected] (Y. Zhan), [email protected] (B. Ghansah). https://doi.org/10.1016/j.eswa.2018.11.016 0957-4174/© 2018 Elsevier Ltd. All rights reserved.

The key concept of VSA is the exploitation of an effective mapping between the low level visual features and the high level semantic concepts, to efficiently extract the high level semantic concepts from video data by semantic information handling (Xu, Ma, Zhang, & Yang, 2005). However, the current feature analysis processes are challenged with the issues of video data complexity, dimensionality reduction, discriminative ability and the presence of noisy information in the structure of the video data, particularly when the size of the dictionary gets large with relatively less sample size (Nweke et al., 2018). The increase in datasets is due to the advances in capturing devices, display methods and transmission technologies contributing to the astronomical growth of videos and its related analytical tools for video retrieval services (Song et al., 2018). The growth of online videos also makes it an imperative task to develop strategies for video semantic contents. This is a valuable conception that the content of videos implicates comparatively, considering the complexity and nonlinear nature of video semantic data. As a result, a lot of sparse representation based approaches have been proposed by scholars to address these challenging issues with video data for video semantic analysis (Bai, Li, & Zhou, 2015; Wang, Wang, Xiao, Wang, & Zhang, 2012). The foremost ob-

430

B.-B. Benuwa, Y. Zhan and A. Monneyine et al. / Expert Systems With Applications 119 (2019) 429–440

jective of SR is to represent a test image over the formed dictionary as a sparse linear combination with all the training samples. After which the label of the test sample is realized by assessing the class with the minimum residual between the test sample and the reconstruction error with the training samples of the same class. More so, the discriminative capability of SR is heavily dependent on the quality of the dictionary for good classification results. Consequently, a lot of studies have been carried out to improve the theories and expand the implementations of SR in its application areas. For instance, Gabor features were adopted for learning an occluded dictionary for SR with low rank subspace projection ( Yang, Xi-Sheng, & Biao-Zhun, 2017). An extension of the conversional SRC techniques was implemented in (Deng, Hu, & Guo, 2012), to exploit the inter-class relations of variant dictionaries to deal with the issue of under-sampled dictionary for face recognition. A SR technique was developed in (Wang, Li, & Liao, 2013) to resolve the challenges with image alignment, illumination and occlusions concurrently for face recognition. Recently, a virtual extended dictionary based SRC was introduced in (X. Song, Shao, Yang, & Wu, 2017), to deal with data ambiguity problems faced in building an optimized and a comprehensive dictionary. However, the problem with SR based technics is that, they cannot produce the same coding results when the input features are from the same categorization despite having good classification results. More so, the elementary approach to building the dictionary is by exhausting all the training samples which may result in the size of the dictionary being huge, and is uncomplimentary to the sparse solver subsequently (Wright, Yang, Ganesh, Sastry, & Ma, 2009). To deal with this challenge, and also to effectively exploit the hidden information in the structure of the training data essential for classification, group based SR techniques were proposed by ( Sun, Liu, Tang, & Tao, 2014; Xu, Sun, Quan, & Zheng, 2015; Zha et al., 2018) for classification. In these prior studies, a dictionary is learned by integrating into the objective function the discriminative assets of the reconstruction error, the representative error and the classification error terms. This makes the dictionary more discerning for an effective classification results. Nonetheless, these techniques are insufficient for large datasets and do not consider the complex nonlinear structure of video semantic data, which is essential for good classification results. In addressing the aforementioned nonlinear and video data complexity issues, Kernel based SR (KSRC) methods were proposed by (Dumitrescu & Irofti, 2018; Gao, Tsang, & Chia, 2013; Wang, Wang, Liu, & Zhang, 2017; Wu, Li, Xu, Chen, & Yao, 2016; Zhang, Zhou, & Li, 2015), to handle data samples with nonlinear structure. These kernel based techniques initially map into high dimensional feature space, the original data and then learn SR from the results obtained in the kernel feature space. However, these techniques do not consider the locality structure of data although they have been successfully applied for image classification. In order to take advantage of data locality coupled with KSRC and group sparsity, a video semantic detection approach, based on Locality-sensitive Discriminant SR and Weighted KNN (LSDSRWKNN) was recently proposed in (Zhan, Liu, Gou, & Wang, 2016a). Despite having better category discrimination on the SR of video semantic concepts, the LSDSR-WKNN technique failed to fully exploit the potential information of the samples. An extension of the KSRC, locality-sensitive KSRC (LSKSRC) was proposed by (Zhang & Zhao, 2014). The approach incorporates locality sensitivity into the structure of KSRC in kernel feature space instead of the original feature space for face recognition. More recently, the LSKSRC was extended by exploiting further, group sparsity, data locality and kernel trick in proposing a joint SR method, called Kernelized Locality Sensitive Group Sparsity Representation (KLSGSRC) for face recognition (Tan, Sun, Chan, Lei, & Shao, 2017). However, the KLSGSRC technique does not guarantee a discriminative representation

and also fails to explore fully the potential information of the samples despite having a good recognition results. Based on the aforementioned challenges, this paper seeks to enhance the power of discrimination by exploiting the structure and the nonlinear information embedded in both training and test samples for video semantic concept. Having adopted ordered samples clustering centered on artificial immune (Yongzhao, Manrong, & Jia, 2012) to extract the static image frames from each original video data, we propose a Video Semantic Analysis Based Kernel Locality–Sensitive Discriminative Sparse Representation (KLSDSR) technique, an extension of our earlier approach, Group Sparse based Locality-Sensitive Dictionary Learning (GSLSDL) (Benuwa et al., 2018) approach in the kernel scheme, which is more efficient and has superior numeric stability. The proposed KLSDSR algorithm enables the adjustment of the dictionary adaptively, using the differences between the reconstructed dictionary atoms and the training samples with the locality adaptor in the kernel feature space. More so, the implementation of the locality adaptor in the kernel space during the sparse coding stage increases the efficiency of the algorithm. This is due to the locality and similarity of the dictionary atoms being mapped into high dimensional feature space. The dictionary information could also be fully exploited by the proposed algorithm with the introduction of a discriminant loss function based on group sparsity. This is to obtain more discriminative information and enhance the classification ability of spares representation features in the kernel feature space. The dictionaries learned could realistically represent video semantic more, and thus gives a better representation of the samples belonging to the sparse coefficients. The main contributions of this paper are as enumerated below: 1 A discriminative loss function is introduced into the structure of kernel locality-sensitivity sparse representation. This results in optimized sparse coefficients, by the encoding sparse codes of video samples from the same category in the kernel space. This enables higher recognition performance by exploring the nonlinear discriminative information embedded in video semantic data in the kernel space to enhance the discriminative classification of SR features. 2 A kernelized locality-sensitive and SR based on group sparsity model is developed for video semantic detection that encodes video features originally, into high dimensional feature space. Thus features are sparsely encoded with video semantic detection for better preservation of the dictionary. This model therefore yields an improvement in the accuracy of video concept detection. The rest of the paper is organized into the following sections; Section 2 presents a review of related works, Section 3 discusses the proposed algorithm, experimental results are presented in Section 4. Finally, Section 5 outlines the main conclusions and recommendations. 2. Related areas In this section, some related works on sparse representation, kernel and data locality such as the KSRC, LSDL and KLSSR are briefly discussed. 2.1. Kernel sparse representation The Kernel Sparse Representation (KSR) (Zhang & Zhao, 2014; Zhang et al., 2015) projects the SR features and basis into highdimensional kernel space. Generally, sparse coding aims at finding the sparsest solution under a given basis function while simultaneously reducing the reconstruction error. More precisely, let A ∈ RmxN represent the training set and D ∈ RmxK denote the sparse

B.-B. Benuwa, Y. Zhan and A. Monneyine et al. / Expert Systems With Applications 119 (2019) 429–440

learning dictionary. Represent the sparse representation matrix of A as X ∈ RKxN over D, and demote a test sample as y ∈ Rmx1 . The sparse representation of the original sample is denoted as the sparse linear integration of the over-complete dictionary. The objective function of representation is as stated in Eq. (1) below.

min ||A − Dx||2F + λ||x||1 D, x

sub ject to : ||dm ||

2

≤1

(1)

where D = [d1, d2, …, dk ]. The reconstruction error is the initial term in Eq. (1), the regularization parameter is λ and the second term is used to control the sparse coefficient vector and x is the sparsity. A larger λ results in a corresponding sparser solution empirically. Kernel techniques are utilized in finding the SR and the similarity of nonlinear feature, since SR does not fully consider the nonlinear structure of samples. The kernel functions alter the distribution of samples after it had been mapped into a highdimensional feature space. Under the appropriate kernel function projection, the data will have better linear representation in the high-dimensional feature space. The samples are more accurately represented by a similar training sample, which are the non-zero value of the SR for the samples, corresponding to the same training sample to make the data more separable in high-dimensional feature space. This is because the SR of the sample contains more discriminative information. Now considering the feature mapping functions ∅: Rd → RF due to the existence of nonlinear kernel mappings k(., .), with (d <
min ||∅(a ) − Dx||22 + λ||x||1 D, x





sub ject to : k di , d j ≤ 1

(2)

where x is the sparse representation coefficient in the kernel space, D = [∅(d1 ), ∅( d2 ) . . . , ∅(dk )] is the mapped dictionary (codebook) and the KSR seeks the sparsest solution for a mapped feature under the basis in the high-dimensional space. There are four kinds of kernel functions: linear kernel, polynomial kernel, Gaussian kernel and sigmoid kernel. As a kernel extension of SR, the KSR enhances the generalization ability and robustness of the algorithm by solving sparse representation coefficients in kernel space.

2.2. Locality-sensitive dictionary learning Test samples in SR are mostly characterized by the dictionary atoms that may actually not be neighboring it for data reconstructions. This may render it inappropriate for holding data locality, and hence could lead to poor recognition results. In the LocalitySensitive Dictionary Learning (LSDL) method, the locality restraints of data locality were introduced to both the DL and sparse coding phases, so that the test samples are well epitomized by the neighboring dictionary atoms. To be precise, Assume that, there exists a test sample y, A = [a1 , a2 ,…, an ] representing n training samples and a dictionary D = [D1 , D2 ,…, DM ] ∈ RdxM consisting of M atoms each in the dictionary, then the coding coefficients of A over D is denoted by X = [x1 , x2 ,…, xL ] ∈ RMxL with L denoting the number of video shots. The optimized function for LSDL is stated as follows:

min || A − DX ||2F + λ D,X

N 

431

|| pi xi ||22 ,

i=1

s.t .1T xi = 1

∀i = 1 ,

(3)

where pi ∈ RKx1 is locality adapter, A denotes the training samples, D is the dictionary, xi is the sparse coefficient vector belonging to the class i and λ is the constriction parameters that controls the locality constraint. A discriminating loss function using class information to enhance the SR based classification is designed in this study, since class information is not cogitated by the LSDL technique. This was done in order to improve the accuracy of video analysis. It has been reported in (Wang et al., 2010) that LLC give promising classification results if resultant sparse coefficients are utilized as features for testing and training. A lot of locality preserving techniques such as discussed in (Liu, Yu, Yang, Lu, & Zou, 2015; Wei, Chao, Yeh, & Wang, 2013a, 2013b), for preserving the local structure of data in dictionary learning, that results in a close form solution during the learning process and an integration of the of the penalty term into the structure of the DL has successfully been implemented for locality DL. In (Lee, Wang, Mathulaprangsan, Zhao, & Wang, 2016), a dual layer locality preserving technique based on KSVD was proposed for object recognition. Despite the successes chalked by these approaches in improving the discriminability of the learned dictionary, it is challenged with the issues of dimensionality reduction and fails to exploit fully, the discriminative information essential for classification. Besides, the discrimination ability of the learned dictionary is a key to an effective classification results and aims at classifying an input query samples correctly. This has motivated us to propose a kernel locality – sensitive dictionary centered on group sparse coding of the sparse coefficients for the purposes of enhancing the power of discrimination for video semantic analysis in kernel feature space. 2.3. Kernel locality-sensitive sparse representation Kernel Sparse Representation (KSR) techniques as highlighted in (Zhang & Zhao, 2014; Zhang et al., 2015), failed to capture the locality structure of data which is essential for effective classification results. Besides, it has been pointed out in (C. P. Wei et al., 2013, 2013b; Sun, Chan, & Qu, 2015; Zhan, Liu, Gou, & Wang, 2016b) that, data locality is more essential than sparsity. The non-linear association between video features are exploited by measuring data similarity in the kernel space ( Sun et al., 2015; Zhang & Zhao, 2014). This is achieved by enforcing data locality in the kernel feature space. Assuming that, there exists a test sample y and a dictionary D mapped by a nonlinear kernel mapping function ∅ respectively into ∅(y) and D = [d1, d2, . . . , dk ] → D = [∅(d1 ), ∅( d2 ) . . . , ∅(dk )]. Then the similarity between the test sample and neighboring samples are well-kept by the technique whiles seeking SR coefficients, with the integration of both sparsity and data locality. The Kernel Locality Sensitive Sparse Representation (KLSSR) ( Zhang & Zhao, 2014) technique is formulated by enforcing data locality as stated in Eq. (4) below;

min ||∅(A ) − Dx||2 + λ||P x||22 , x

(4)

where the symbol  denotes the elementwise multiplicative symbol, the P is the locality adaptor used to realize kernel distance between a test sample ∅(y) and each column of D. All the training features in the kernel feature space is represented by D = [∅(d1 ), ∅( d2 ) . . . , ∅(dk )]= [∅(A1 ), ∅( A2 ) . . . , ∅(Ak )], and λ is the regularization parameter. 3. Proposed method The proposed algorithm for KLSDSR is detailed in this section by kernelizing the locality-sensitive adaptor. The kernelized adap-

432

B.-B. Benuwa, Y. Zhan and A. Monneyine et al. / Expert Systems With Applications 119 (2019) 429–440

tor is integrated with a discriminative lost function centered on kernel SR into the objective function of locality -sensitive discriminative SR method. The proposed method could achieve optimal dictionary and further enhance the power of discriminability of sparse codes. This enables the realization of SR of the non-linear features. Non-linear features from representation coefficient might not be capable of achieving the same coding results. An assumption is therefore made that; samples should be encoded as similar sparse coefficients in the SR based data detection, after mapping the samples to high-dimensional feature spaces. This is done with the objective of enhancing the power of discrimination in SR based classifiers. The sample is more accurately represented by a similar training sample, that is, the non-zero value of the sparse. SR of the sample will correspond more to the same training sample by enforcing data locality in the kernel feature space. This yields a more discriminant information in the video data. Therefore, the video data features are mapped into high dimensional feature space to improve the SR coefficient of the video. Assume that A = [a1 , a2 , . . . , an ] = [A1 , A2 , . . . , Ak ] ∈ Rdxn → A = [∅(a1 ), ∅(a2 ), . . . , ∅(ak )] ∈ Rdxn denotes n training samples (when kernel is applied) from k classes, where the column vector a1 is the sample i (i = 1, …, n ), and the submatrix Aj consisting of column vectors from class j (j = 1, …, k ). If there are M atoms each in the dictionary D = [D1 , D2 , . . . , DM ] ∈ RdxM → D = [∅(d1 ), ∅(d2 ), . . . , ∅(dk )] ∈ Rkxk , and if the dictionary D is mapped into kernel space, an over complete dictionary for SR with (M ≤ n ), then coding coefficients of A over D could be denoted by X = [x1 , x2 ,…, xL ] ∈ RMxL . It is envisaged that, there exists nonlinear kernel mappings for training samples that enhances the accuracy of video semantic concepts for classification. The dictionary D is obtained by mapping the dictionary features into high dimensional feature space as a result of the nonlinear kernel mappings for the training samples

( R → R ), d

F

(5)

where F represents training features in the high dimensional space and d denotes number of dictionary features with d <
min || ∅(Aij ) − DX ||2F + λ X

N 

    |Pi j xij |22 + G xij .

i=1

s.t. 1

T

xij

∀i = 1 , . . . , N

=1

(6)

where xi is the sparse coefficient vector associated with the ith class, A ∈ Rdxn is the training sample being mapped into a high dimensional feature space with the function ∅(.), D ∈ RdxM is the j dictionary mapped into the kernel space, G(xi ) is the proposed discriminant loss function as defined in Eq. (7), the shift variant j constant 1T xi enforces the coding result of x to remain the same although the origin of the data coordinate may be shifted as indicated in (Yu, Zhang, & Gong, 2009), the regularization parameter controlling the reconstruction error and the sparsity is λ and K = P ∈ RKx1 for measuring the kernel distance could be substituted into j j Eq. (6). The locality adaptor Pi and its kth element in Pi is given





by Pi j = |∅ai − ∅di | in the kernel space and the symbol  represents element-wise multiplication symbol. Considering cases where sample features from the same category could be having similar sparse codes, we propose a discriminant loss function based on sparse coefficients for the purposes of enhancing the power of discrimination of input video data or samples in sparse representation. Based on the Fisher criterion, j

j

j

the discriminant loss function is designed by minimizing the withclass scatter of sparse codes and simultaneously maximizing the between-class scatter of sparse codes. Consequently, video samples from same category are compacted and the ones from different categories are separated. The discriminant loss therefore, utilizes group structure information for training and testing video data and thus measures their similarity in the kernel feature space to explore better, the nonlinear relationship amongst video features as explained in Eq. (7) below:

⎛    1 1 G xi = λ2 ⎝ 1 − xij − N N j j   j



(

)

l ∈ 1,N j l =i

⎞2     xij ⎠ + η|xij |22 ,  2

(7) where

|| ( ( 1 −

1 Nj

)xij −

1 Nj



l ∈ (1,N j )l =i

xi )||22 + ||xi ||22 j

j

is

the

within class-similar term enforcing group sparsity, λ2 is the discriminative weighing constraint, Nj is the number of samples of j

the representation coefficient xi belonging to the class j and N is the number of training samples. Group sparsity enforces the representation of test samples by training samples from some few features as possible by dividing the dictionary into groups with each group formed training samples from the same category. By this, classification is enhanced with a representation that uses j minimum number of nonzero groups. The term ||xi ||22 combined with ||X ||22 could make Eq. (7) more stable based on theorem of (Zou & Hastie, 2005). With η set to 1 for simplicity, Eq. (6) is reformulated as



min ||∅ Aij



D, X

− DX ||2F + λ1

⎛    ⎝ 1 − 1 x j − 1 i  Nj Nj  s.t. 1

T

xij

=1

N 

  |Pi j xij |22 + λ2

i=1



(

)

l ∈ 1,N j l =i

∀i = 1 , . . . , N

⎞2     j ⎠ xi  + |xij |22 .  2

(8)

The proposed method as stated in Eq. (8) is implemented by enforcing data locality in high dimensional feature space where the locality adaptor is used to measure the kernel distance between the test sample A and each column of D. Note that D = [D1 , D2 ,…, DM ] represents all the training samples in the feature space. The vector P, the dissimilarity vector in Eq. (8) is implemented to suppress the corresponding weight and also penalizes the distance between the test sample and each training sample in the feature space. Furthermore, it should however be made known that the resulting coefficients in our KLSDSR formulation may not be fully sparse with regards to l2 – norm, but is seen as sparse because the representation solutions only have few significant values with most being zero. The test samples and their neighboring training samples in feature space are encoded when the problem of Eq. (7) is being minimized and the resulting coefficients X still sparsed. This j is because, as Pij gets large, xi shrink to be zero. Therefore, most coefficients get zero with just some few having significant values. The proposed KLSDSR approach integrates both data locality structure and sparsity in obtaining sparse coefficients in high dimensional feature space, hence the capability of learning discriminative SR coefficients for classification. A kernelled exponential locality adaptor was implemented in our proposed method as explained below:

⎛    ⎞   dk ai , a j ⎠, Pi j = ∅⎝exp σ

(9)

B.-B. Benuwa, Y. Zhan and A. Monneyine et al. / Expert Systems With Applications 119 (2019) 429–440

where σ is a constant and dk (ai , aj ) is the kernel Euclidean distance induced by a kernel space, k defined as







dk ai , a j = =







∅ ( ai ) − ∅ a j , ∅ ( ai ) − ∅ a j

 







j



dg(xij )



(10)

The Pij increases exponentially with an increase dk (ai , aj ) hence yield a huge Pij when the samples are ai and aj far apart hence a smalleraij .

 2 

= 2 K + λ1 xijT diag Pi j

dxij

+λ2

  T j xi

 T

Y xij + xij

2 −1

In this paper, the training sample for KSR optimization and the close form solution for the objective function in Eq. (8) as regard our KLSDSR algorithm is analytically presented below. An alternative optimization is implemented iteratively adopting the theories of (Cai, Zuo, Zhang, Feng, & Wang, 2014; Feng, Yang, Zhang, Liu, & Zhang, 2013; Jiang, Lin, & Davis, 2011; M. Yang, Zhang, Feng, & Zhang, 2011). The objective function of Eq. (8) is reduced to the sparse coding problem of Eq. (11) to obtain the sparse coding coefficient vector xi by updating the coefficient matrix X:

X

||Pi j xij ||22 + λ2

i=1

N 

 

G xij .

i=1

(11) j

The X in Eq. (11) could be addressed on class bases. Thus xi corresponding to class i could be derived as: N 

xij = min || ∅(Aij ) − Dxij ||22 + λ1 j xi

⎛    ⎝ 1 − 1 x j − 1 i  Nj Nj 

||Pi j xi ||22 + λ2

i=1



(

)

l ∈ 1,N j l =i

⎞2   j ⎠ xi  + ||xij ||22 .  (12)

dxij

d

       ∅(A j ) − Dx j 2 + λ1 P j xi 2 + G x j , i i i i j 2 2

dxi



d

=

dxij

where K =

xi

is the kernel Gram matrix, and κ (·, j 2

j

D T ∅(ai ), diag (Pi )

j ai )

=

is a diagonal matrix whose nonzero elj



 1

ements are the square of the entries of Pi , Y = I − N I − j  j T T    j j j T 1 1 1 I− N I− N l ∈ (1,N )l =i xi xi 1 l ∈ (1,N )l =i xi (xi 1 ) . N j

j

j

j

j



T

−1



1 .

(15)

By solving the problem of Eq. (15) iteratively, the sparse coefficient vector for the training samples in the kernel space is obtained with xi (1 ≤ i ≤ N). 3.2. Kernel locality-sensitive SR based classification for VSA By solving Eq. (15), the analytical solution of Eq. (11) can directly be gotten in the kernel space and the reconstruction errors of the test samples from the dictionary are calculated. The SR based on locality-sensitive constraint has a more discriminative analytical solution compared with the representative based on the L1-norm hence the solution speed much faster since the algorithm is much less complex. For a given test sample y ∈ Rm , ∅: Rm → H represents a kernel that maps the test and the training samples from the original feature space Rm into a high dimensional kernel feature space H. The number of training samples close to ∅(y) is then determined based on the Euclidean distance between ∅(y) and each training sample ∅(ai ). The test sample y and its kernel sparse representation coefficients vector x on dictionary D could be obtained as given below:

(16)

where D = [D1 , D2 , . . . , DJ ] is the optimal dictionary for j classification categories, λ1 is a scalar constant that weighs the locality-sensitive adaptor,  represents element wise multiplicative symbol, P = [P1, P2 , …, PK ]T and every entry is represented as Pk = ||∅(y) − ∅(dk )||2 , which determines the Euclidean distance between the test sample y and the reconstructed dictionary atom. We let T(y) represent Eq. (16) as stated in Eq. (17) to obtain the solution for Eq. (16) as explained below; The test sample can finally be classified after obtaining D,

(17)

Eq. (17) is easily simplified derivatively with first order derivatives and the analytical solution x˜ is obtained on class bases as

⎛ ⎞2       2   1 1 j j ⎠   ⎝ +λ2  1− xi − xi  + xij  , 2 Nj Nj   l ∈ (1,N j )l =i 2    2 j T T = 2 K + λ1 xij diag Pi j xi − 2κ ·, aij   T  T  +λ2 xij Y xij + xij I xij , (13) DT D

(14)

∅(aij ) D − λ2 (Y + 1 )

x

 2 j 

T

+λ1 xij diag Pi j

T

aij )



T (y ) = min ||∅(y ) − Dx||22 + λ1 ||P x||22 .

 T T κ (aij , aij ) + xij Kxij − 2κ ·, aij xij

= 0, this re-

I xij = 0,

x

Supposed the objective function of Eq. (12) is represented by g(x). Taking the first derivative of Eq. (12) against the sparse coj efficient vector xi will result in Eq. (13) based on the deductions below;

=

xij − 2κ (·,

x = ar g min ||∅(y ) − D x||22 + λ1 ||P x||22 ,

2

s.t. 1T xij = 1

dg(xij )

j

dxi

given us

xi = (K + λ1 diag(Pi j ) )

3.1. Optimization

N 

dg(xi )

To obtain the analytical solution of Eq. (12), sults in Eq. (14)

k ai , a j − 2k ai , a j + k ( a j , a j ).

X = min ||∅(Aij ) − DX ||2F + λ1

433

 dT (x ) d  = ||∅(y ) − Dx )||22 + λ1 ||P x||22 , dx dx  d  = κ (y, y ) + xT Kx − 2κ (·, y )T x + λ1 xT diag (P )2 x dx 

= 2 K+

 λ1 diag(P )2 x − 2κ (·, y )T ,

(18)

where K = κ (·, y ) = D∅(y ) and kernel sparse coding coefficient x is the then obtained by normalizing x˜ and is determined by assigning Eq. (18) to zero as 2 T dT (x ) = 2(K + λ1 diag(P ) )x − 2κ (·, y ) = 0, therefore, dx D T D,

−1

x˜ = (K + λ1 diag(P )2 )





x = x˜/ 1T x˜

2κ (·, y ) . T

(19) (20)

Using the analysis discussed in (Harandi & Salzmann, 2015), the kernel sparse coding coefficient associated with test sample y is determined, after which the residual is calculated according to

434

B.-B. Benuwa, Y. Zhan and A. Monneyine et al. / Expert Systems With Applications 119 (2019) 429–440

Eq. (21), supposing there are J classes in the video semantic concept for reconstructing each class in the dictionary.

4. Experimental results and analysis

r j (y ) = | ∅(y ) − D x j |22

This section details the experimental results and analysis to demonstrate the effectiveness of the proposed method.







j

j

for j = 1, 2, . . . , k, j

where x j = x1 , x2 , . . . , xM

T

(21)

for j = 1, 2, …, k, is the new sparse

coding coefficient on the jth class and



xij

=

xi , 0,

ai ∈ class j . otherwise

(22)

The sparsity coefficients for the training samples of class J are represented by comparing the dictionaries of each class. Each column of the dictionary with training samples, that are closely associated with the test sample y and the reconstruction error of test sample y, with the minimum error category, is taken as the semantic concept category of the test video sample y. The test sample y is finally classified into the class that has the minimum residual as formulated below;

class(y ) = arg min r j (y ), j

(23)

The proposed KLSDSR approach is summarized as follows: The approach integrates kernelized locality-sensitive and group sparsity constraints into the structure of KSR. The group sparsity constraint utilizes group structure information for training and testing video data and thus measures their similarity in the kernel feature space to explore better, the nonlinear relationship amongst video features. The locality-sensitive constraint is used to enforce, the encoding of input features through their corresponding kernel space neighboring dictionary atoms, aside satisfying SR constraint. The corresponding coefficients are ensured to diminish to zero using exponential operator when dk gets large. The dictionary atom dk is obtained by the linear reconstruction of adjacent atoms used to maintain similarities between dictionaries of the same kind. This serves as the bases threshold, based on which similar input features, produce similar representation solutions. Furthermore, the group sparsity ensures that, test samples are represented with training samples from fewer groups in the kernel space and thus enforce discrimination by minimizing the number of reconstructed nonzero vectors for better classification results. The reconstruction error is therefore computed and the dictionary updated until the error is less than the threshold error. The sample features are obtained from the video samples as described in Section 4.1 and the dictionary D determined from the training samples. The dictionary D is obtained by mapping the dictionary features into high dimensional feature space as a result of the nonlinear kernel mappings for the training samples (i.e. Rd → RF ). The F represents training features in the high dimensional space and d denotes number of dictionary features with d << F. The sparse learning dictionary is therefore mapped into high dimensional feature space as D = [D1 , D2 , . . . , DM ] → D = [∅(d1 ), ∅(d2 ), . . . , ∅(dM )]. As an extension of the GSLSDL (Benuwa et al., 2018) method, the proposed KLSDSR method seeks to obtain discriminative nonlinear information of all training samples in high dimensional feature space. Therefore, the GSLSDL method is espoused to initialize the sparse coefficients of all the training samples in the kernel feature space for higher recognition performance. The classification procedure is therefore carried out firstly by generating the dictionary D. Secondly, sparse coding is carried out to generate the sparse representation coefficients for classification, with an optimal X obtained by kernel locality-sensitive sparse coding until an optimal iteration, or error of reconstruction less than a threshold value is gotten. And lastly, the minimum reconstruction error is determined and the test sample classified into the class that has the minimum reconstruction residual. The classification procedure of the proposed KLSDSR is summarized in algorithm 1.

4.1. Video shot preprocessing In the experiments, ordered samples clustering centered on artificial immune (Yongzhao et al., 2012) was adopted to extract the static image frames from each original video data utilizing the Video key-frame extraction approach. Afterwards, features consisting of the 5-dimensional radial Tchebichef moment(Mukundan, 2005), 6-dimensional gray level co-occurrence matrix (GLCM) (Haralick & Shanmugam, 1973), 81-dimensional HSV color histogram, and 30-dimensional multi-scale local binary pattern (LBP) (Ojala, Pietikainen, & Maenpaa, 2002) features from the key-frame were extracted. The details of these features could be referred from the paper (Zhan et al., 2016a). Fig. 1 below shows some key-frames from the three video datasets that were implemented. 4.2. Database selection and experiment evaluation Experiments were performed on three video datasets, to evaluate the effectiveness and efficient performance of our proposed method as detailed in this section. Video samples from these datasets are shown in Fig. 1. By using accuracy rate for classification of video semantic concepts as evaluation metric, we analyze and compare the performance of our proposed algorithm with most related algorithms such as the KSRC, LSKSRC (S. Zhang & Zhao, 2014), KLSGSRC (Tan et al., 2017), LSDSR (Zhan et al., 2016a), LSDL(Wei et al., 2013a, 2013b) and GSLSDL (Benuwa et al., 2018) algorithms. Out of the many collections of datasets utilized for video categorization and classification, three of them have been selected for use in our experimental set up and evaluations. The public datasets used are the TRECVID 2012 videos dataset (TRECVID Video Dataset, 2012), Open videos (OV) dataset (Open Video Dataset, 2006), and YouTube videos dataset (YouTube Video Dataset, 2011). The TRECVID 2012 videos dataset has airplane, baby, building, car, dog, flower, instrumental, mountain, scene, and speech as its video semantic models with each class containing 60 data samples, 50 of the samples were randomly selected as training samples, and the remaining as test samples. The dataset has evolved over the years until 2017. However, there is no substantial difference between the contents and structure of these datasets. The YouTube video dataset comprise of basketball, biking, diving, golf, horse, soccer, swing, tennis, trampoline, volleyball and walking as semantic videos with each class containing 70 data samples, out of which 60 were randomly selected as training samples, and the rest as test samples, and the semantic concepts of the OV dataset contain parachute, aircraft, road, sea, rocket, satellite, star and face. Each class consists of 70 samples. For each class, 60 samples are randomly selected as training samples, and the rest as testing samples. The proposed approach considered the exponential locality adaptor based on Gaussian kernel. All the experimental results were realized by twentyfold cross validation in which training samples are selected randomly and the remaining as testing samples for the various video datasets. 4.3. Parameter selection In the formulation of our objective function of Eq. (8), λ1 , a positive weighing parameter for the locality- sensitive constraint and λ2 for the discriminative weighing constraint was utilized by the proposed KLSDSR. They are both important adjustment parameters contributing to the effective performance of the proposed KLSDSR

B.-B. Benuwa, Y. Zhan and A. Monneyine et al. / Expert Systems With Applications 119 (2019) 429–440

435

Fig. 1. Some key-frames from video shots, (a) Some key-frames from OV video set, (b). Part key-frames from the TRECVID 2012 video set, (c). Part key-frames from the YouTube video set.

technique. The values of λ1 and λ2 are tuned between the range of 0.01 to 1 by a step size of 0.01. This was done by fixing one parameter constant and varying the other parameter value. The parameter value which gave the best recognition result was selected for the experiment. Thus the results of the proposed KLSDSR changes with variations of one parameter with the other one fixed. The relevant parameters were chosen based on the testing results with regards to the various video datasets. With each experiment set by twentyfold cross validation are as follows: for TRECVID 2012 video dataset, λ1 = 0.01, λ2 = 0.03; for OV video dataset, λ1 = 0.01, λ2 = 0.05; for YouTube dataset, λ1 = 0.1, λ2 = 0.01. Thus the chosen parameters were used to carry out the experiments on the respective datasets. It is worth noting that, different parameter values were realized for the experiment on the respective datasets because of the varying fundamental structure of the datasets. Based on the values stated above, an experiment was conducted to validate the performance of our proposed algorithm and the findings are as explained below. 4.4. Analysis of the proposed KLSDSR approach In the proposed KLSDSR technique, the kernelized localitysensitive adaptor is utilized to progressively regulate the procedure of dictionary learning by imposing the capture of local discriminative nonlinear information of semantic video. Furthermore, a discriminative loss function based on group sparsity is incorporated into the structure of the proposed KLSDSR to enhance the discrimination ability of SR. In addition, the proposed algorithm results in an optimized dictionary that enhances further, the discriminative classification of SR. We chose the samples of baby and scene video

semantic concepts from TRECVID 2012 video dataset in an experiment to determine the efficient discrimination ability of the proposed KLSDSR method, KSRC and KLSGSRC. The experiment was carried out with the same training sets for the three algorithms. This was done to optimize the dictionaries and obtain sparse coefficients of the testing samples. The test samples are then reconstructed by the optimized dictionaries and projected onto two dimensional subspaces by PCA. The two dimensional subspaces of KSRC, KLSGSRC and the proposed KLSDSR method with their reconstructed test samples are respectively shown in the sample distribution graphs of Fig. 2(a), (b) and (c). There is no clear distinction amongst the inter-classes for KSRC after reconstruction of the test samples with the optimized dictionary with some slight segregation in the test samples from the same class as shown in Fig. 2. Even though the graph of KLSGSRC has a strong correlation aggregation within the intra-class and has a massive dispersion between the inter-classes, the proposed KLSDSR has its boundaries more clarified when reconstructing the test samples with the optimized dictionary. More so, the test samples from the same class are compact whiles the ones from different classes are well separated for the proposed KLSDSR technique. Subsequently, we can conclude that the proposed KLSDSR obtains more effective discrimination information due to the introduction of a discriminant loss function based on group sparsity into the structure of the locality sensitive sparse representation. 4.5. Classification results of different dictionary atoms The SR of the dictionary varies as the number of the dictionary atoms changes. For each class of the OV dataset we chose

436

B.-B. Benuwa, Y. Zhan and A. Monneyine et al. / Expert Systems With Applications 119 (2019) 429–440

Fig. 2. Reconstructed Samples from two classes by KSRC, KLSGSRC and KLSDSR.

Table 1 Classification results on different dictionary atoms of TRECVID 2012.

fore appropriately chosen for semantic analysis on YouTube video dataset.

Number of the dictionary atoms

10

20

30

40

50

Accuracy Rate (%)

74.40

80.20

81.70

89.20

85.30

10, 20, 30, 40, 50 and 60 video features to represent the initial dictionary. The respective sizes of these video features are denoted by 122 × 100, 122 × 220, 122 × 300, 122 × 400, 122 × 500 and 122 × 600, constituting the ten classes of OV video dataset. It could be seen from Table 1 that, the proposed KLSDSR technique gives an optimum performance for semantic analysis when the size of the dictionary is 122 × 500 and thus resulted in the highest accuracy of semantic analysis. Therefore, a dictionary size of 122 × 500 is more appropriate for VSA on OV as indicated in table 1. For the classification performance of the proposed KLSDSR algorithm on video semantic detection with varying video features, we chose 10, 20, 30, 40 and 50 video features to constitute the initial dictionaries as indicated in Table 2. The corresponding sizes of these video features are respectively represented as 122 × 100, 122 × 220, 122 × 300, 122 × 400 and 122 × 500, constituting the ten classes of TRECVID 2012 video dataset. The proposed KLSDSR algorithm records the highest accuracy performance of semantic analysis when the size of the dictionary is 122 × 400. A dictionary size of 122 × 400 is therefore chosen for semantic analysis on TRECVID 2012 video dataset. And finally for the YouTube video dataset, we chose 10, 20, 30, 40, 50 and 60 video features as indicated in Table 3, to constitute the initial dictionaries since the dataset has 70 data samples. The corresponding sizes of these video features are respectively represented as 122 × 110, 122 × 220, 122 × 330, 122 × 440, 122 × 550 and 122 × 660, constituting the eleven classes of YouTube video dataset. The proposed KLSDSR algorithm records the highest accuracy performance of the semantic analysis when the size of the dictionary is 122 × 550. A dictionary size of 122 × 550 is there-

4.6. Recognition results of different approaches The recognition rates of the various VSA techniques based on the three video datasets are shown in Table 4. It could be seen from the experimental results in Table 4 that, the proposed KLSDSR approach comparatively, obtained the highest recognition rates than the other recognition approaches for the three datasets. As indicated by Fig. 3, the average recognition rates of the various video semantic detection methods associated with each class on the OV video dataset is also presented. This is evident that the proposed KLSDSR method outperforms all the baseline approaches on each feature category as shown in Table 4 below. The recognition rate of KLSDSR approach on the OV dataset outperformed KSRC by 11.1%, LSKSRC by 9.8%, LSDL by 11.3%, KLSGSRC by 4.6%, LSDSR by 4.7% and 3.5% against our previous GSLSDL approach. This goes to confirm the effectiveness of the proposed KLSDSR method. The accuracies of KLSDSR technique for the ten video semantic concepts of OV are 100%, 85%, 94%, 92%, 99%, 91%, 84%, 76%, 91% and 100% respectively. Furthermore, it could be seen from Fig. 3 that, the KLSDSR technique has the highest recognition precision in seven (7) of the categories compared with the existing methods. The KLSDSR approach therefore, could effectively improve best the accuracy rates of OV video semantic feature detections. As indicated in Fig. 4, the average recognition rates of various video semantic detection methods associated with each class on the TRECVID 2012 video dataset is also presented. This clearly indicates that; the proposed method is best among all the comparative approaches on each category as highlighted in Table 4. The recognition rate of the proposed method on the TRECVID 2012 dataset outperformed KSRC by 9.4%, LSKSRC by 7.5%, LSDL by 10.5%, KLSGSRC by 4.2%, LSDSR by 3.3% and 2.4% against our previous GSLSDL approach. This goes to confirm the effectiveness of the proposed KLSDSR method on TRECVID 2012. The accuracies of the proposed

B.-B. Benuwa, Y. Zhan and A. Monneyine et al. / Expert Systems With Applications 119 (2019) 429–440

437

Table 2 Classification results on different dictionary atoms of OV dataset. Number of the dictionary atoms

10

20

30

40

50

60

Accuracy (%)

72.10

78.60

83.70

83.20

91.20

87.50

Table 3 Classification results on different dictionary atoms of YouTube. The number of the dictionary atoms

10

20

30

40

50

60

Accuracy (%)

75.53

80.9

83.47

86.45

90.17

87.27

Fig. 3. Recognition rate of OV videos with different algorithms.

Table 4 The recognition rates of different video semantic analysis algorithms of the three video sets. Comparative Methods

OV

TRECVID 2012

YouTube

KSRC LSDL LSDSR LSKSRC KLSGSRC GSLSDL KLSDSR

80.10 79.90 86.50 81.40 86.60 87.70 91.20

79.80 78.70 85.90 81.70 85.00 86.80 89.20

81.00 80.45 85.81 83.72 86.09 86.27 90.17

KLSDSR technique for the ten video semantic concepts are 90%, 95%, 93%, 82%, 78%, 92%, 80%, 87%, 100% and 95% respectively. As indicated in Fig. 4, the KLSDSR technique has the highest recognition precision in six (6) of the categories compared with the existing methods. The KLSDSR approach therefore could effectively improve the accuracy rates of TRECVID 2012 video semantic feature detections. Fig. 5 shows the average recognition rates of various video semantic detection methods associated with each class on the YouTube video dataset. This clearly indicates that; the proposed method is best among all the comparative approaches on each category as highlighted in Table 4. The recognition rate of the proposed method on the YouTube dataset outperformed KSRC by

9.17%, LSKSRC by 6.45%, LSDL by 9.72%, KLSGSRC by 4.08%, LSDSR by 4.36% and 3.9% against our previous GSLSDL approach. This goes to confirm the effectiveness of the proposed KLSDSR method on TRECVID 2012. The accuracies of the proposed KLSDSR technique for the eleven video semantic concepts are 90.90%, 94.55%, 93.63%, 86.36%, 78.18%, 91.81%, 81.81%, 88.18%, 99.09%, 91.91% and 95.45% respectively. As indicated in Fig. 5, the KLSDSR technique has the highest recognition precision in five (5) of the categories compared with the existing methods. The KLSDSR approach therefore could effectively improve the accuracy rates of YouTube video semantic feature detections. The superior performance of the proposed KLSDSR approach over prior studies is as a result of the integration of kernel, locality-sensitive adaptor and a loss function based on group sparse coding of sparse coefficients, into the structure the representation solution. This enhances the power of discrimination for classification and thus improves the accuracy of video semantic analysis. 4.7. Computational time Now the computational time of KLSGSRC, GSLSDL and the proposed KLSDSR algorithms for VSA were considered and their results compared to evaluate the computational efficiency of the proposed KLSDSR approach. The selected approaches are all based on an integration of data locality and sparsity, and have been consid-

438

B.-B. Benuwa, Y. Zhan and A. Monneyine et al. / Expert Systems With Applications 119 (2019) 429–440

Fig. 4. Recognition rate of TRECVID 2012 videos with different algorithms.

Fig. 5. Recognition rate of YouTube videos with different algorithms.

ered in determining the computational time for classification performance for video semantic detection on the three datasets for each key frame. The kernel sparse coding and the classification processes were considered for the determination of the computational time for sparse coefficients operation time. Although the proposed KLSDSR achieve the best recognition results, its computational time is a bit higher than that of GSLSDL algorithm comparatively due to the computational intensity of kernel methods in the sparsity coefficient optimization process. However, the proposed KLSDSR has a lower computation time than KLSGSRC. Fur-

thermore, the GSLSDL uses the same class to reconstruct dictionary atoms by other dictionaries in the sparse coefficient optimization process. The local adaptors are then constructed with the reconstructed dictionary and the training atoms to optimize the representative coefficient. A discriminative loss function based on group sparsity is also incorporated to optimize the dictionary for efficient classification results. In the KLSDSR algorithm, the training samples and the dictionary are projected into a high dimensional feature space with an optimization of the sparse coefficient and the local adaptor in

B.-B. Benuwa, Y. Zhan and A. Monneyine et al. / Expert Systems With Applications 119 (2019) 429–440 Table 5 Computation time in seconds for average recognition on OV, TRECVID 2012 and YouTube video datasets. Comparative Methods

OV

TRECVID

YouTube

KLSGSRC GSLSDL KLSDSR

2.22 1.47 1.79

2.10 1.25 1.84

2.50 1.61 1.93

Algorithm 1 Video Semantic Classification based Kernel Locality Sensitive SR. Input: Training set A = [A1 , A2 , …, AJ ] where AJ ∈ Rdxn is a sample of the class J, the test sample set y and the normalization parameters λ1 , λ2 . Method: (a) Initialize the coding coefficient matrix X and i(i ← 1), the initial dictionary D, (b) Map the original data into the kernel feature space using kernel function (c) Calculate the vector p between each training sample and a test sample in the kernel space by implementing the exponential locality adaptor with Eq. (9) to reconstruct the dictionaries. (d) Get the kernel sparse representation coefficients of the training samples using Eq. (15). (e) Get the kernel sparse representation coefficient of the test samples, using Eqs. (19) and (20). (f) Calculate the residuals in the kernel feature space with samples closely related to the jth class with Eq. (21) (g) Determine the class label of the test samples and return the category label of the test sample using Eq. (23) Output: The class label y of the given test sample according to (g) as class(y ) = arg min r j (y ) j

high dimensional space. The kernel functions are more suitable for handling nonlinear functions because the atoms between the dictionary classes are more similar when projected into a high dimensional feature space. For simplicity, 20 video features were selected for each video dataset to justify our claims. All the experiments were run on the same platform with Intel (R) Core TM i5 3.7 GHz CPU and 8.0 G RAM by MATLAB R2016a software. Table 5 reports the computational time on the three video datasets.

5. Conclusion and recommendation for future works In this paper, a Kernel Locality-Sensitive Discriminative Sparse Representation (KLSDSR) algorithm based on video semantic analysis is proposed. Besides considering data locality in the kernel space, the proposed KLSDSR method also takes into account the grouped structure information of the trained dictionary for video semantic detection. Furthermore, a lost function based on sparse coefficients is also introduced into the structure of localitysensitive discriminative SR for an optimized dictionary. This enables an efficient utilization and capture of nonlinear information that may be embedded in the structure of both the training and test video data for more discriminability of SR features and also for an improved video semantic classification. Based on the experimental results obtained from the three video datasets, it could be seen that, the proposed KLSDSR algorithm for video semantic analysis achieves better performance and is more effective than the other state-of-the-art approaches. Despite the superior results demonstrated by the proposed KLSDSR on the TRECVID, YouTube and OV video datasets, there is still the need to improve on the execution time and further improve the power of discrimination hence we plan to extend our kernel scheme with a multiple kernel technique with joint sparsity and virtual dictionary in our future works.

439

Conflict of interest The authors declare that, there are no conflicts of interest whatsoever. CRediT authorship contribution statement Ben-Bright Benuwa: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing - original draft. Yongzhao Zhan: Conceptualization, Data curation, Formal analysis, Funding acquisition, Project administration, Resources, Supervision, Visualization, Writing - review & editing. Augustine Monneyine: Investigation, Software, Validation, Visualization. Benjamin Ghansah: Formal analysis, Investigation, Software, Writing - review & editing. Ernest K. Ansah: Formal analysis, Project administration, Resources, Visualization. Acknowledgments This work was buoyed in part by National Natural Science Foundation of China (Grant Nos. 61672268, Grant Nos. 61502208, and Grant Nos. 61170126) and Primary Research & Development Plan of Jiangsu Province of China (Grant No. BE2015137). References Abbasnejad, I., Sridharan, S., Denman, S., Fookes, C., & Lucey, S. (2017). Joint max margin and semantic features for continuous event detection in complex scenes. arXiv:1706.04122. Abidine, B. M. H., Fergani, L., Fergani, B., & Oussalah, M. (2016). The joint use of sequence features combination and modified weighted SVM for improving daily activity recognition. Pattern Analysis & Applications, , 21(1), 1–20. Bai, T., Li, Y. F., & Zhou, X. (2015). Learning local appearances with sparse representation for robust and fast visual tracking. IEEE Trans Cybern, 45(4), 663–675. Benuwa, B. B., Zhan, Y., Liu, J. Q., Gou, J., Ghansah, B., & Ansah, E. K. (2018). Group sparse based locality – sensitive dictionary learning for video semantic analysis. Multimedia Tools & Applications, 1–24. Cai, S., Zuo, W., Zhang, L., Feng, X., & Wang, P. (2014). Support vector guided dictionary learning. Paper presented at the European Conference on Computer Vision. Deng, W., Hu, J., & Guo, J. (2012). Extended SRC: Undersampled face recognition via intraclass variant dictionary. IEEE Transactions on Pattern Analysis & Machine Intelligence, 34(9), 1864. Dumitrescu, B., & Irofti, P. (2018). Kernel Dictionary Learning. In: Dictionary Learning Algorithms and Applications. Cham: Springer. Feng, Z., Yang, M., Zhang, L., Liu, Y., & Zhang, D. (2013). Joint discriminative dimensionality reduction and dictionary learning for face recognition. Pattern Recognition, , 46(8), 2134–2143. Gao, S., Tsang, I. W., & Chia, L. T. (2013). Sparse representation with kernels. IEEE Trans Image Process, 22(2), 423–434. Haralick, R. M., & Shanmugam, K. (1973). Textural features for image classification. IEEE Transactions on systems, man, and cybernetics, 3(6), 610–621. Harandi, M., & Salzmann, M. (2015). Riemannian coding and dictionary learning: Kernels to the rescue. In Paper presented at the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Haseyama, M., Ogawa, T., & Yagi, N. (2013). [Survey paper] A Review of Video Retrieval Based on Image and Video Semantic Understanding. ITE Transactions on Media Technology and Applications, 1(1), 2–9. Jiang, Z., Lin, Z., & Davis, L. S. (2011). Learning a discriminative dictionary for sparse coding via label consistent K-SVD. Paper presented at the Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. Lee, Y.-S., Wang, C.-Y., Mathulaprangsan, S., Zhao, J.-H., & Wang, J.-C. (2016). Locality-preserving K-SVD Based Joint Dictionary and Classifier Learning for Object Recognition. In Paper presented at the Proceedings of the 2016 ACM on Multimedia Conference. Liu, W., Yu, Z., Yang, M., Lu, L., & Zou, Y. (2015). Joint kernel dictionary and classifier learning for sparse coding via locality preserving K-SVD. Paper presented at the IEEE International Conference on Multimedia and Expo. Mukundan, R. (2005). Radial Tchebichef invariants for pattern recognition. Paper presented at the TENCON 2005 2005 IEEE Region 10. Nweke, H. F., Ying, W. T., Al-Garadi, M. A., & Alo, U. R. (2018). Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: state of the art and research challenges. Expert Systems with Applications, 105, 233–261. Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 971–987. Open Video Dataset. (2006). http://open-video.org/index.php. Shih, H.-C. (2017). A survey on content-aware video analysis for sports. IEEE Transactions on Circuits and Systems for Video Technology.

440

B.-B. Benuwa, Y. Zhan and A. Monneyine et al. / Expert Systems With Applications 119 (2019) 429–440

Song, J., Zhang, H., Li, X., Gao, L., Wang, M., & Hong, R. (2018). Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Transactions on Image Processing, PP, 27(99), 3210–3221. Song, X., Shao, C., Yang, X., & Wu, X. (2017). Sparse representation-based classification using generalized weighted extended dictionary. Soft Computing, 21(15), 4335–4348. Sun, X., Chan, W., & Qu, L. (2015). Robust face recognition with locality-sensitive sparsity and group sparsity constraints. Tianjin, China: Springer International Publishing. Sun, Y., Liu, Q., Tang, J., & Tao, D. (2014). Learning discriminative dictionary for group sparse representation. IEEE Transactions on Image Processing, 23(9), 3816–3828. Tan, S., Sun, X., Chan, W., Lei, Q., & Shao, L. (2017). Robust face recognition with kernelized locality-sensitive group sparsity representation. IEEE Transactions on Image Processing, 26(10), 4661–4668. TRECVID Video Dataset. (2012). http://www-nlpir.nist.gov/projects/tv2012/tv2012. html. Wang, B., Li, W., & Liao, Q. (2013). Illumination variation dictionary designing for single-sample face recognition via sparse representation. Paper presented at the International Conference on Multimedia Modeling. Wang, B., Wang, Y., Xiao, W., Wang, W., & Zhang, M. (2012). Human action recognition based on discriminative sparse coding video representation. Robot, 34(6), 745. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. Paper presented at the Computer Vision and Pattern Recognition. Wang, Z., Wang, Y., Liu, H., & Zhang, H. (2017). Structured kernel dictionary learning with correlation constraint for object recognition. IEEE Transactions on Image Processing, PP, (99), 1. Wei, C.-P., Chao, Y.-W., Yeh, Y.-R., & Wang, Y.-C. F. (2013a). Locality-sensitive dictionary learning for sparse representation based classification. Pattern Recognition, 46(5), 1277–1287. Wei, C. P., Chao, Y. W., Yeh, Y. R., & Wang, Y. C. F. (2013b). Locality-sensitive dictionary learning for sparse representation based classification. Pattern Recognition, 46(5), 1277–1287. Wright, J., Yang, A. Y., Ganesh, A., Sastry, S. S., & Ma, Y. (2009). Robust face recognition via sparse representation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2), 210–227.

Wu, X., Li, Q., Xu, L., Chen, K., & Yao, L. (2016). Multi-feature kernel discriminant dictionary learning for face recognition. Pattern Recognition, 66(C), 404–411. Xu, G., Ma, Y. F., Zhang, H. J., & Yang, S. Q. (2005). An HMM-based framework for video semantic analysis. IEEE Transactions on Circuits & Systems for Video Technology, 15(11), 1422–1433. Xu, Y., Sun, Y., Quan, Y., & Zheng, B. (2015). Discriminative structured dictionary learning with hierarchical group sparsity. Computer Vision and Image Understanding, 136, 59–68. Yang, F. F., Xi-Sheng, W. U., & Biao-Zhun, G. U. (2017). A face recognition algorithm based on low-rank subspace projection and Gabor feature via sparse representation. Computer Engineering & Science, 24, 460–480. Yang, M., Zhang, L., Feng, X., & Zhang, D. (2011). Fisher discrimination dictionary learning for sparse representation. Paper presented at the Computer Vision (ICCV), 2011 IEEE International Conference on. Yongzhao, Z., Manrong, W., & Jia, K. (2012). Video keyframe extraction using ordered samples clustering based on artificial immune. Journal of Jiangsu University (Natural Science Edition), 2, 017. YouTube Video Dataset. (2011). http://crcv.ucf.edu/data/UCF_YouTube_Action.php. Yu, K., Zhang, T., & Gong, Y. (2009). Nonlinear learning using local coordinate coding. Paper presented at the Advances in neural information processing systems. Zha, Z., Zhang, X., Wang, Q., Bai, Y., Tang, L., & Yuan, X. (2018). Group sparsity residual with non-local samples for image denoising. Zhan, Y., Liu, J., Gou, J., & Wang, M. (2016a). A video semantic detection method based on locality-sensitive discriminant sparse representation and weighted KNN. Journal of Visual Communication and Image Representation, 41, 65–73. Zhan, Y., Liu, J., Gou, J., & Wang, M. (2016b). A video semantic detection method based on locality-sensitive discriminant sparse representation and weighted KNN R. Journal of Visual Communication & Image Representation, 41. Zhang, L., Zhou, W. D., & Li, F. Z. (2015). Kernel sparse representation-based classifier ensemble for face recognition. Multimedia Tools & Applications, 74(1), 123–137. Zhang, S., & Zhao, X. (2014). Locality-sensitive kernel sparse representation classification for face recognition. Journal of Visual Communication & Image Representation, 25(8), 1878–1885. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.