Gesture recognition for human–machine interaction in table tennis video based on deep semantic understanding

Gesture recognition for human–machine interaction in table tennis video based on deep semantic understanding

Journal Pre-proof Gesture recognition for human-machine interaction in table tennis video based on deep semantic understanding Shuping Xu, Lixin Liang...

3MB Sizes 0 Downloads 19 Views

Journal Pre-proof Gesture recognition for human-machine interaction in table tennis video based on deep semantic understanding Shuping Xu, Lixin Liang, Chengbin Ji

PII: DOI: Reference:

S0923-5965(19)30707-6 https://doi.org/10.1016/j.image.2019.115688 IMAGE 115688

To appear in:

Signal Processing: Image Communication

Received date : 15 July 2019 Revised date : 21 October 2019 Accepted date : 3 November 2019 Please cite this article as: S. Xu, L. Liang and C. Ji, Gesture recognition for human-machine interaction in table tennis video based on deep semantic understanding, Signal Processing: Image Communication (2019), doi: https://doi.org/10.1016/j.image.2019.115688. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier B.V.

Journal Pre-proof

Gesture Recognition for Human-machine Interaction in Table Tennis Video Based on Deep Semantic Understanding

in dynamic scenes, the non-rigid motion of objects, objects' self-occlusion, the occlusion between objects, etc. Since human motion analysis has broad application prospects and potential economic value in advanced human-computer interaction (HCI), security monitoring, video conferencing, medical diagnosis, and content-based image storage and retrieval, the current research in this area is inspiring, and the interest of scientific researchers and related businesses in the world are vast. Deep learning of video semantic features has become a research hotspot in video analysis related domains, especially in motion recognition. In recent years, with the development of the Internet and multimedia devices, the network multimedia video data has grown geometrically, and the high-level semantic concept extraction and recognition technology for scene-free video have a broader application prospect. At the same time, traditional video classification methods such as keyword matching based on tag text [1] need to be improved in response to the massive Internet video data and video content complexity distinction. Low-level feature extraction is the basis of video semantic analysis technology. The extracted low-level features can objectively express certain properties of the video surveillance object and, to a certain extent, reflect the behavioral semantics of the video object. In this paper, we design a video semantic feature learning method for gesture recognition task that integrates both image topological sparse encoding and dynamic time warping. We introduce topology constraints that are similar to TICA, while considering the characteristics of the feature space arrangement of hidden layer neurons combined with the analysis of structured sparse associations [2]. We consider the topology relationship of planar neuron nodes as the topological structure. The constraint term forms a topological sparse auto-encoder (TSAE) for neural network model pre-training [3] which is used to increase the regularization in the parameter learning process and to learn the video image feature representation in accordance with the video image topology information. We divide video feature learning into two phases: 1) Semi-supervised video image feature learning, where a new topological sparse encoder is constructed to pre-train the neural network parameters of each layer so that the feature representation of the video image can reflect the topological information of the image. 2) Optimized learning of supervised video sequence features, where a fully connected layer of video feature learning is constructed, and the

urn al P

re-

Abstract—The analysis of moving objects in videos, especially the recognition of human motions and gestures, is attracting increasing emphasis in computer vision area. However, most existing video analysis methods do not take into account the effect of video semantic information. The topological information of the video image plays an important role in describing the association relationship of the image content, which will help to improve the discriminability of the video feature expression. Based on the above considerations, we propose a video semantic feature learning method that integrates image topological sparse coding with dynamic time warping algorithm to improve the gesture recognition in videos. This method divides video feature learning into two phases: semi-supervised video image feature learning and supervised optimization of video sequence features. Next, a distance weighting based dynamic time warping algorithm and K-nearest neighbor algorithm is leveraged to recognize gestures. We conduct comparative experiments on table tennis video dataset. The experimental results show that the proposed method is more discriminative to the expression of video features and can effectively improve the recognition rate of gestures in sports video. Index Terms—video semantic learning, gesture recognition, human-machine interaction, table tennis, topological information, dynamic time warping

pro

*Corresponding author

I. INTRODUCTION

W

Jo

ith the advent of the era of big data, a large amount of videos is produced every day, and videos are playing an increasingly important role in people's daily lives. What follows is the continuous breakthrough and innovation of computer vision technology. Video data is not limited to Internet applications. With the continuous deepening of the construction of safe cities and smart cities, video surveillance systems have been widely used in people's work and life, and at the same time, massive video data are created. The analysis of moving objects is a frontier direction in the field of computer vision in recent years. It consists of the object detecting, identifying, tracking and understanding, and the describing of target behavior from the sequence of images containing objects. It belongs to the category of image analysis and understanding. From a technical point of view, the research content of motion analysis of objects is quite broad, including pattern recognition, image processing, computer vision, artificial intelligence, etc. However, there are still many challenges in the motion analysis of objects in videos, including the rapid segmentation of motion

of

Shuping Xu1 *, Lixin Liang1, Chengbin Ji2 North China Electric Power University, Beijing, China 2 Beijing College of Politics and Law, Beijing, China 1a [email protected], 1 [email protected][email protected] 1

Journal Pre-proof

DTW algorithm, the recognition rate is still low; also, it only uses the motion trajectory of the palm as an input feature, and can only recognize simple gestures, while being less robust to complex gestures.

pro

of

B. Video Semantic Learning Most existing methods are based on the original video image frame global features (color, edge detection, Gabor, etc.) or first obtain local features (Sift, MoSift, etc.) [7-8], then apply BoW or other methods [9] to convert local features into global feature descriptions, and finally load the classifier methods. These traditional methods are inevitable to use hand-crafted features. The deep learning method transforms the feature representation of the sample into a new feature space by layer-by-layer feature transformation [10], which is more convenient for classification and prediction. Deep learning techniques have encouraged successful applications in computer vision, speech recognition and natural language processing [11]. In the field of video semantic analysis based on deep learning, Wu et al. [12] proposed a multi-linear principal component analysis network (MPCANet) to study the high-level semantic features of video and target classification. Liu et al. [13] proposed an over-complete independent component analysis (OICA) to learn video spatiotemporal feature for video motion recognition. Gammulle et al. [14] proposed a human motion recognition method based on a convolutional neural network (CNN) and long-term short-term memory (LSTM). Research shows that deep learning methods play an important role in improving the accuracy of video semantic analysis. Compared with image target recognition tasks such as handwriting, the video image content is more complex since the target may have variations include rotation, scaling, and translation. Therefore, the feature extractor used in video semantic detection needs to be robust when dealing with complex phenomena to extract more invariant characterization. Ng et al. [15] pointed out that the optic neurons on the retina have adjacent similarities, that is, the activation of current neurons affects the degree of activation of peripheral neurons, and such neighborhood correlation can help to form orderly features in feature learning. Hyvarinen et al. [16] obtained a topological independent component analysis (TICA) that can ensure the strong correlation of neighbor components by adding topological constraints in the independent component analysis (ICA) model. They had verified that the topology characteristics are beneficial in the study of image recognition problems. Similar experiments [17-18] show that this topological correlation has better object rotation, scaling, translation invariance. In the past, the convolutional neural network used for video image feature learning [19-22] mainly focused on the design of the network structure model, and did not consider the use of adjacent neuron node topology correlation information, and the convolution kernel in the same layer lack correlations.

urn al P

re-

video sequence key frame features are integrated. Then, logistic regression constraints are established, and network parameters are fine-tuned to better identify the recognition results. Next, a distance weighting based dynamic time warping algorithm and K-nearest neighbor algorithm are applied to recognize gestures based on a skeleton sequence model. In table tennis competitions, sports experts need to accurately locate and recognize the action postures of table tennis players in sports situations in order to improve athletic performance. However, in the past table tennis competitions, the definition of the action posture recognition of table tennis players is vague, and it is easy to form a problem of large discriminant error. The table tennis player's posture recognition method is an effective way to solve this problem, which has attracted the attention of many experts and scholars. Therefore, in this paper, we choose table tennis videos as the research target for the proposed gesture recognition algorithm. The major contributions of this paper are three-fold: 1) Considering the correlation relationship of image edges and the neighborhood structure of neurons, we add new topological information constraint to form a topological sparse encoder in the semi-supervised learning of video image features, which is used to pre-train the weight factor of the neural network to integrate the video image features learned by the depth network with topological order information. 2) In the fully connected layers of the video feature learning framework, a set of logistic regression constraints is built where the key frame features of the video sequence are integrated. By fine-tuning the network parameters, we are able to obtain more identifiable video sequence features. 3) Based on the extracted video feature sequences, a distance weighting based dynamic time warping (DW-DTW) algorithm and K-nearest neighbor (KNN) are combined to improve the accuracy of gesture recognition. II. RELATED WORK

Jo

A. Gesture Recognition With the rapid development of computer technology, Human-Computer Interaction (HCI) has gradually shifted from computer-centered to human-centered. As a common mode of interpersonal communication, gestures are naturally intuitive and easy to understand. Therefore, gestures have become the main method of a new generation of human-computer interaction, and are widely used in remote control, virtual reality, medical diagnosis and other fields [4]. Shotton et al. [5] proposed a method for recognizing human poses and predicting the position of skeleton nodes from a single-frame Kinect depth image. By initializing each frame of the image, the problem of long re-initialization and weak robustness of other methods is solved. The method has high prediction accuracy and strong real-time performance and can be applied to a complex human-computer interaction environment. In the existing gesture recognition method, [6] uses a vector containing 15 skeleton nodes as a feature, and then uses the Feature Weighting-Dynamic Time Warping (FW-DTW) to perform gesture boundary segmentation and recognition. By testing 5 gestures, the highest recognition rate is 76%. There are still some shortcomings in these two methods: although the method proposed in [6] is better than the classic

C. Sparse Auto-Encoding As an unsupervised training model, sparse auto-encoder (SAE) automatically learns a nonlinear mapping to extract the characteristics of input data. The cost function of the SAE model is

Journal Pre-proof



∑ 𝑝

∑ 𝜌||𝜌

𝛽∑

𝑊 ∑

𝑿

𝜌𝑙𝑛

𝑝 1

𝜌||𝜌

𝜌 𝑙𝑛

(1) (2)

III. VIDEO SEMANTIC FEATURE LEARNING

urn al P

re-

∑ 𝑌 𝜌 (3) In (1) and (2), 𝐖, 𝒃 , 𝒃 are the weight factors of the hidden neurons in the sparse encoding network. The first right term in (1) denotes the model reconstruction error on the training 𝑔 𝑾 𝒀 𝒃𝒅 , 𝒀 𝑓 𝑾𝑿 𝒊 𝒃𝒆 , f(x) samples. 𝑿 is the coding function, g(x) the decoding function, N is the sample number. The third right term in (1) is the KL sparse penalty term to enforce the model to learn a sparse representation of the target data. Here KL term is selected for robust [23]. 𝜌 is the sparsity parameter, and 𝜌 is the average

activation value of the i-th neuron on N samples. By learning the sparse expression, the accuracy of the classification task can be effectively improved and the feature expression can be more easily explaining et al. [15] induced topological sparse coding to learn topological sparse coding, introducing some kind of "order" to the learned model. Kavukcuogiu et al. [24] realized similar topological feature filter mapping by weighted grouping, and confirmed that adding topological constraints can obtain features of image invariance of image rotation, scaling and translation. Learning feature representation can reflect image topology information. The above method generates the neighboring packets of the same size for each feature node when constructing the topology association. However, when used for video image feature learning, the discontinuity of the video image boundary is not considered.

of

λ∑

𝑿

pro

1 2𝑁

J 𝐖, 𝒃 , 𝒃

Figure 1. Video image feature learning based on topological sparse pre-trained CNN.

full connection layer (FC1), the parameters are fine-tuned with a Softmax activation to learn the characteristics of the video image. The image features of the video key frame are learned by the Softmax-optimized fully connected layer (FC2) to obtain the video segment features. Finally, the video segment features are sent to the SVM for video semantics conceptual modeling, in which the level of the network and the setting of its parameters are determined by the results of experiments.

Jo

The overall feature learning framework based on topological sparse coding pre-training CNN proposed in this paper is shown in Figure 1. The model learning is divided into two stages: a video image feature semi-supervised learning phase, and a video segment feature with supervised optimization learning phase. For each input video frame image, the unsupervised topology sparse pre-trained neural network learns the video image topology order information feature, and at the

Figure 2. An illustration of the topological sparse auto-encoder.

The parameters of the convolution kernel can be obtained by unsupervised sparse auto-encoder pre-training. Traditional SAE learned features do not consider the relevance of adjacent neuron node extraction features. For image data, the value of a

pixel at a certain point is always closely related to the surrounding pixel value. In the visual neural network, the optic neuron has adjacent similarity, so the pixel value of the current position is input to the neural network: if the current neuron is

Journal Pre-proof

pro

of

association of the video image boundary. The new-TSAE is shown in Figure 3 (b). Consider a row of hidden neuron nodes arranged as matrix of √row √row, and consider the position of each neuron node in the matrix. We do not form a neighbor relationship group of the matrix boundary node, thereby defining a topological adjacency relationship region with differences, which considers only the neighbor relationship of neurons in the plane regardless of the connection of the upper and lower boundary nodes, thus forming a topological association with unconnected boundaries. In order to balance the influence of adjacent nodes weight, we use the inverse of the number of nodes in the group as the factor of the group topology constraint. The target cost function of the new-TSAE model of the group topology constraint is J 𝐖, 𝒃 , 𝒃 J 𝐖, 𝒃 , 𝒃 ∑ ∈ 𝑤 𝑌 𝛾 ∑ (4) In (4), the second right term is the extended topological constraint term. 𝛾 is a hyper-parameter that defines the importance of the topological term in the overall object function. 𝑌 is the vector of the neuron activate state in the g-th topological group, k-th sample. IV. DW-DTW BASED GESTURE RECOGNITION

re-

activated, then nearby neurons should also have similar activation conditions. Therefore, the similarity of the state of the surrounding structure can be fully considered in the learning of the neural network, and the topological grouping is formed to be constrained. The advantage of this constraint is that it facilitates the learning of video image features to achieve some sort of topological order. In order to make the extracted image features exhibit a unique topological order, that is, the adjacent neurons are activated in a similar state, the adjacent neuron activation value features are added to the sparse coding to form a vector modulus as a constraint term to establish a topology sparse coding. As shown in Figure 2, for green neurons, the topologically associated neurons are red neurons. The TSAE model paddle layer coded neuron nodes are arranged in rows. In order for convenience, we group hidden layer neuron nodes based on adjacent relationships. Each hidden layer node forms a grouping at its neighboring nodes. When considering a certain neuron as the center, we hope that in the two-dimensional matrix arrangement grid, the neurons in the adjacent area are similar, where a neighboring group is formed by a square matrix of adjacent area windows. The first row and the first column are one group, the first row and the second column are another group, and the grouping has a partial overlap sliding on the grid matrix. Combined with the grouping relationship of the grouping matrix, the L2 constraint penalty of the hidden layer state matrix of the grouping can be added to the sparse coding model of (1) to form the topological constraint of the feature similarity of adjacent nodes.

urn al P

When the gestures in the video are recognized based on the video features extracted above, since the durations and speeds of the different gestures are different, the lengths of the extracted motion trajectories Di = di1, di2, …, diT may also be different, so gesture recognition can be treated as a time series classification problem. Dynamic Time Warping (DTW) [25] uses the dynamic programming algorithm to find the optimal matching of two sequences, thus defining the distance metrics of the two sequences, which can well solve the problem of the unequal length of time series. DTW is widely used in sign language recognition, signature recognition, information retrieval and other research fields. Another widely used algorithm is the Hidden Markov Model (HMM). [26] specifically evaluates the effects of these two commonly used algorithms in gesture recognition, pointing out the advantages of DTW versus HMM: (1) When the parameters are optimal, the recognition rate of DTW is higher; (2) DTW requires fewer samples when the recognition rate is equal; (3) The recognition time of DTW is proportional to the length of the sequence, and is more acceptable in most applications. The DTW algorithm can calculate the DTW distance of each node corresponding to the training sample and the test sample, respectively. The motion state of each node in different gestures is different, and the DTW distance of each node contributes differently to the final classification result. If simply average all DTW distances, although most gestures can be distinguished, the recognition rate needs to be improved. [27] increases the recognition rate to some extent by assigning different weights to each feature in the feature vector. This paper improves the method of weighting in [27] and proposes a new distance-weighted dynamic time warping algorithm (DW-DTW). For the training sample Rg=R1g, R2g, …, RNg of gesture g, assume that it consists of a time series of N nodes, where the total distance moved by node i is defined by ∑ 𝐷 𝑑 𝑟 ,𝑟 (5)

Figure 3. Different neighborhoods for the two kinds of TSAE.

Jo

Andrew et al. [15] curved the neuron arrangement plane to form a ring topology to obtain the topological sparse coding of the boundary. This method makes each neuron node have the same size neighboring area, as shown in Figure 3 (a). Since the upper and lower edges and the left and right edges of the video image are not continuous spaces, we propose to consider the topology association relationship of the planar neuron nodes as the topology constraint, and form the topology sparse coding with unconnected boundaries. We call this new topology sparse auto-encoding (new-TSAE). This new topology sparse auto-encoding eliminates the association constraint of the original old topology across the boundary, and can avoid the interference of video feature learning across the feature

Journal Pre-proof

(6)



Therefore, the weight of a node will be zero if it stays still in gesture g. The final DTW between the training and testing sample is (7) δ ∑ 𝑤 𝛿 where 𝑤 is the weight of node i in training sample g. Assume that D 𝛽 is the inter-class variance of all training samples, define discriminant [32]: R β (8) The optimal parameter 𝛽∗ is the value that maximizes R β : 𝛽∗

arg 𝑚𝑎𝑥

𝑅 𝛽

(9)

V. EXPERIMENTS

Figure

4.

Performance

urn al P

re-

In this paper, we choose table tennis videos as the research target to test the proposed gesture recognition algorithm. The

of

𝑤

training and testing videos are collected by the MotionXtra HG-LE in the REDLAKE series. The basic performance parameters of this model are: resolution 752×1128 pixels, 30-bit (color); shooting frequency 1000 fps, up to 100,000 fps, adjustable speed; transmission 100/1000-Mbps Ethernet remote control. The videos can be quickly downloaded to local and remote computers. We selected six categories from the video clips: serve, short push, throw, prepare, slide, and hit. For data balance, 30 video clips were selected for each category to form the entire data set. The experimental environment is an i7 processor and a GTX 780 ti graphics card, and is based on CUDA 6.5 and Theano 0.7 implementation. For each training epoch, 120 samples are randomly selected as the test set, and the rest are training set. The number of key frames of the video is set to 3. The experiment combines the super-parameter adjustment guiding idea in [28]. The hyper-parameters are: β=4, the unsupervised learning rate is 1E-2, the unsupervised learning batch size is 256, and the number of iterations is 3000; in the supervised optimization learning, the learning rate is set to 1E-4, and the batch size is 30, the number of iterations is 5000.

pro

In (5), F denotes the total frame number of gesture g, which is also the sequence length of each node. 𝑑 denotes the distance. The weight of node i in gesture g is then defined as:

comparison

on

sparse

and

weight

parameters

selection

of

different

SAE

methods.

coefficient experimental accuracy rate, and topological item weight on experimental data, the final topology weight parameter is selected as 0. 003.

Jo

Considering the influence of sparsity parameter p, weight penalty term coefficient λ, and topological weight penalty term sparse γ on model classification accuracy, this paper optimizes topology-free sparse coding, boundary-connected topological sparse coding and boundary-unconnected topological sparse coding using its pre-trained CNN to learn video features. We use the highest accuracy of video gesture recognition to determine the choice of relevant hyper-parameters. The weight penalty parameter is initially fixed to 0.001, and the selection of the sparsity parameter p affects the learning of the feature and the result of the classification using the learning feature. As shown in Figure 4 (a), the recognition accuracy is very low when the sparseness parameter is set small, and the optimal parameter is obtained when the sparse parameter is 0.25. Therefore, the sparse parameter selection in the experimental data set is 0.25. The weighted penalty sparse selection is shown in Figure 4(b). Increasing the weight penalty term coefficient can speed up the filter weight parameter, but it may lead to excessive weight penalty. The experiment is performed as shown in Figure 5. Combined with different weight penalty

lambed

Figure 5. Performance comparison of TSAE methods on γ selection

Journal Pre-proof

pro

of

edge in a certain direction, peripheral neurons respond to directions that deviate slightly from the former, enabling them to learn more ordered features. However, the neurons in the upper and lower and left and right borders have similar response weights. Compared with the topological sparse coding of the boundary and the topological sparse coding of the boundary proposed in this paper, it also has the characteristics of the neighboring similarity, and eliminates the similar weights of the upper and lower and left and right neurons. That is, the filter weight does not need this similarity constraint on the upper and lower edges and the left and right edges of the feature space. In fact, the upper and lower edges and the left and right edges of the video image are not continuous spaces. Therefore, the proposed topology-sparse coding pre-training CNN is used to learn the characteristics of video images, which is more in line with the expression of video images.

urn al P

re-

Combined with the above experimental parameters, this paper applies topological sparse coding without border sparse coding, boundary-connected topological sparse coding, and borderless sparse coding model to unlabeled image blocks randomly clustered by video library for unsupervised learning. The visualization effect of the filter weight values of the three pre-training methods is shown in Figure 6. Figure 6 is the filter weight value corresponding to the neurons of the 7 × 3 × 3 RGB image block of the first layer CNN learning video frame feature. From the visualization of the filter weights corresponding to 400 neurons in Figure 6, it can be seen that the neurons in the non-topological sparse coding model can only respond to the sparse information in the data and appear in an unordered form; In the case of topological sparse coding, by adding topological constraints, the features learned by the sparse encoder have peripheral similarity, and the weight visualization shows a spiral gradual trend, that is, if the current neuron responds to the

Figure 6. Visualization of filters learned by different pre-trained models.

and video features are learned with CNN. HMM-FNN is a gesture recognition algorithm combining Hidden Markov Model and fuzzy neural network. SSC denotes a supervised-learning based image representation algorithm. And SPM is a multi-layer pyramid kernel for image matching. All video features are analyzed by SVM modeling and semantic concept classification. The experiment compares the recognition results of SIFT-BOW, LBP-Hist, SAE-CNN, old-TSAE-CNN and new-TSAE-CNN methods on six kinds of gesture recognition tasks, as shown in Figure 7. The accuracy of the new-TSAE proposed in this paper achieves better results than other methods for most gestures.

Figure 7. Performance of different methods on 6 gesture categories.

Jo

In this paper, a number of different feature extraction and deep learning methods are selected for a 10-fold cross-validation experiment. SIFT-BOW means that the SIFT operator is extracted separately for the key frame sequence, and then converted into a global feature by the BOW method [29]; LBP-Hist means that the LBP feature extraction is performed on the key frame first, and then the histogram is adopted and converted into global features [30]; SAE-CNN uses SAE for CNN pre-training, where CNN learns video features; old-TSAE-CNN uses boundary-sparse topology-sparse coding pre-trained CNN, and new-TSAE-CNN is a pre-trained CNN with topological sparse coding with unconnected boundaries,

Table 1. Test accuracy of different gesture recognition methods on the table tennis video dataset.

Methods SIFT-BOW LBP-Hist MPCANet OICA SAE-CNN old-TSAE-CNN HMM-FNN Shape context SSC SPM our method

Accuracy / % 51.46 52.34 62.27 63.51 71.33 80.18 61.75 57.26 77.80 73.36 82.07

Journal Pre-proof

LSTM is more reasonable at the video sequence level. In the future, we plan to combine LSTM and other deep learning methods with the proposed method to learn the complex video sequence feature expression, and further improve the video semantic understanding and gesture recognition performance.

pro

of

Acknowledgements This work was supported by “ supported by ‘ the Fundamental Research Funds for the Central Universities( No.JB2017MS067)’”

REFERENCE

[1] Ye, Guangnan, et al. "Large-scale video hashing via structure learning." Proceedings of the IEEE international conference on computer vision. 2013 [2] Du, Lei, et al. "Structured sparse canonical correlation analysis for brain imaging genetics: an improved GraphNet method." Bioinformatics 32.10 (2016): 1544-1551. [3] Erhan, Dumitru, et al. "Why does unsupervised pre-training help deep learning?." Journal of Machine Learning Research11.Feb (2010): 625-660. [4] Mitra, Sushmita, and Tinku Acharya. "Gesture recognition: A survey." IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 37.3 (2007): 311-324. [5] Shotton, Jamie, et al. "Real-time human pose recognition in parts from single depth images." Cvpr. Vol. 2. 2011. [6] Reyes, Miguel, Gabriel Dominguez, and Sergio Escalera. "Featureweighting in dynamic timewarping for gesture recognition in depth data." 2011 IEEE international conference on computer vision workshops (ICCV Workshops). IEEE, 2011. [7] Haseyama, Miki, Takahiro Ogawa, and Nobuyuki Yagi. "[Survey paper] A Review of Video Retrieval Based on Image and Video Semantic Understanding." ITE Transactions on Media Technology and Applications 1.1 (2013): 2-9. [8] Han, Yahong, et al. "Semisupervised feature selection via spline regression for video semantic recognition." IEEE Transactions on Neural Networks and Learning Systems26.2 (2014): 252-264. [9] Wang, Miao, et al. “Data-driven image analysis and editing: A survey.” Journal of Computer-Aided Design and Computer Graphics, 2015, 27 (11): 2015-2024 [10] Yu, Kai, et al. "Deep learning: yesterday, today, and tomorrow." Journal of computer Research and Development50.9 (2013): 1799-1804. [11] Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." science 313.5786 (2006): 504-507. [12] Wu, Jiasong, et al. "Multilinear principal component analysis network for tensor object classification." IEEE Access 5 (2017): 3322-3331. [13] Liu, Zhikang, Ye Tian, and Zilei Wang. "Stacked overcomplete independent component analysis for action recognition." Asian Conference on Computer Vision. Springer, Cham, 2016. [14] Gammulle, Harshala, et al. "Two stream lstm: A deep fusion framework for human action recognition." 2017 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2017. [15] Ng, Andrew, et al. "UFLDL tutorial." 2012)[2014-08-12]. http://deeplearning. stanford. edu/wiki/index. php/UFLDL_Tutorial (2010). [16] Hyvarinen, Aapo, Patrik Hoyer, and Mika Inki. "Topographic ICA as a model of V1 receptive fields." Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium. Vol. 4. IEEE, 2000. [17] Ngiam, Jiquan, et al. "Tiled convolutional neural networks." Advances in neural information processing systems. 2010. [18] Goh, Hanlin, et al. "Learning invariant color features with sparse topographic restricted Boltzmann machines." 2011 18th IEEE International Conference on Image Processing. IEEE, 2011.

urn al P

re-

At the same time, the average recognition accuracy results of each method, including the results of MPCANet and OICA methods, are shown in Table 1. The proposed method is superior to the traditional feature extraction method in overall results. It is verified that compared with the traditional SIFT and LBP feature extraction models, the CNN model itself has better generalization ability. Pre-trained CNN can extract features with specific generalization characteristics [31]. old-TSAE and the two methods of SAE are different in the pre-training loss function, and the average value of the old-TSAE pre-training is better than that of the non-topological SAE pre-training. The reason is that the topology association constraint is considered, and the convolutional neural network can extract the information with peripheral topology in the video image, obtain the invariance of the rotation and scaling of the video image, and enrich the information expression of the video image feature, which helps to improve the accuracy of the video semantic concept detection. In this paper, the topology-sparse coding (new-TSAE) pre-training method with an unconnected boundary is proposed, which further increases the recognition rate by 1.9% than the old-TSAE. The inherent reason is to consider the discontinuous topology constraint of the image edge and eliminate the top and bottom edges of the video image which is topologically related to each other. The new topological constraints can not only maintain the invariance of changes in the rotation and scaling of the video object but also eliminate the interference associated with the boundary features of the video image region in the original topology. Consistent with the characteristics of the video image itself without cross-image boundary association, the convolutional neural network can extract the video image features that more reasonably express its topology information, which is more in line with the expression of video image features. At the same time, the results also show that the recognition method proposed in this paper is better than the feature deep learning method of MPCANet and OICA. VI. CONCLUSION

Jo

The topological information of the video image can enrich the expression of the video image features. In this paper, a boundary-disjoint topology sparse coding pre-training CNN is proposed. By semi-supervised learning video image features, the topology information can be more reasonably expressed. Next, the key segment features of the video segment are reconstructed and supervised logistic regression is used to optimize the learned video features, thereby obtaining the feature representation of the video segment reflecting the spatiotemporal characteristics. On this basis, a dynamic time warping algorithm is used to calculate the DTW distance of a single node, and the DTW distance of each node is weighted and averaged to obtain the final recognition result. The experimental results show that the proposed method can make the convolutional neural network express the topology information in the video images, which is helpful to improve the accuracy of video semantic understanding. At present, some studies combine CNN and LSTM to learn video features and obtain superior video semantic analysis performance. The reason is that the semantic representation of

Journal Pre-proof

of

Lixin Liang, born in Tianjin of China in 1962. Born in, China,  He holds a master's degree and now works in North China  Electric Power University. His Research Interest include 

pro

sports teaching and training, social sports and big data  analysis.E-mail: [email protected]

re-

[19] Karpathy, Andrej, et al. "Large-scale video classification with convolutional neural networks." Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014. [20] Ji, Shuiwang, et al. "3D convolutional neural networks for human action recognition." IEEE transactions on pattern analysis and machine intelligence 35.1 (2012): 221-231. [21] Donahue, Jeffrey, et al. "Long-term recurrent convolutional networks for visual recognition and description." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. [22] Jiang, Yu-Gang, et al. "Exploiting feature and class relationships in video categorization with regularized deep neural networks." IEEE transactions on pattern analysis and machine intelligence 40.2 (2017): 352-364. [23] Jiang, Nan, et al. "An empirical analysis of different sparse penalties for autoencoder in unsupervised feature learning." 2015 International Joint Conference on Neural Networks (IJCNN). IEEE, 2015. [24] Kavukcuoglu, Koray, Rob Fergus, and Yann LeCun. "Learning invariant features through topographic filter maps." 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009. [25] Berndt, Donald J., and James Clifford. "Using dynamic time warping to find patterns in time series." KDD workshop. Vol. 10. No. 16. 1994. [26] Carmona, Josep Maria, and Joan Climent. "A performance evaluation of HMM and DTW for gesture recognition." Iberoamerican Congress on Pattern Recognition. Springer, Berlin, Heidelberg, 2012. [27] Reyes, Miguel, Gabriel Dominguez, and Sergio Escalera. "Featureweighting in dynamic timewarping for gesture recognition in depth data." 2011 IEEE international conference on computer vision workshops (ICCV Workshops). IEEE, 2011. [28] Bengio, Yoshua. "Practical recommendations for gradient-based training of deep architectures." Neural networks: Tricks of the trade. Springer, Berlin, Heidelberg, 2012. 437-478. [29] Venkatesan, Ragav, et al. "Classification of diabetic retinopathy images using multi-class multiple-instance learning based on color correlogram features." 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, 2012. [30] Chan, Chi Ho, and Josef Kittler. "Sparse representation of (multiscale) histograms for face recognition robust to registration and illumination problems." 2010 IEEE International Conference on Image Processing. IEEE, 2010. [31] Li, Jun, et al. "Sparseness analysis in the pretraining of deep neural networks." IEEE transactions on neural networks and learning systems 28.6 (2016): 1425-1438. [32] McLachlan, Geoffrey. Discriminant analysis and statistical pattern recognition. Vol. 544. John Wiley & Sons, 2004.

Chengbin Ji, born in Hebei province, China in 1978, He 

holds a master's degree and now works in Beijing College

urn al P

Jo

 Shuping Xu, born in Shijiazhuang,  Hebei Province of China in 1978. She received the Master degree from Hebei Normal University and now an associate  professor in Sports Teaching Department of North China  Electric Power University. Her research interests include  sports teaching and training, social sports and big data  analysis. E-mail: [email protected]

P

of Politics and Law. His Research Interest includes 

intercultural communication and foreign language  education. E-mail: [email protected]

Journal Pre-proof

Jo

urn al P

re-

pro

of

1) We add new topological information constraint to form a topological sparse encoder in the semi-supervised learning of video image features. 2) By fine-tuning the network parameters, we are able to obtain more identifiable video sequence features. 3) Based on the extracted video feature sequences, a distance weighting based dynamic time warping (DW-DTW) algorithm and K-nearest neighbor (KNN) are combined to improve the accuracy of gesture recognition.

Journal Pre-proof

Jo

urn al P

re-

pro

of

There is no conflict of interest.