Physica A 532 (2019) 121812
Contents lists available at ScienceDirect
Physica A journal homepage: www.elsevier.com/locate/physa
Gait analysis and recognition prediction of the human skeleton based on migration learning ∗
Chao Sun, Chao Wang , Weike Lai Department of Information Engineering, Guangzhou Huashang Vocational College, guangzhou 511300, China
highlights • Transfer learning and initial-v3 neural networks were used as methods. • The HMDB - a large human motion database and UCF_sports_actions video action data is as the main dataset. • Predicting whether a person in a picture or video is making a run or a walk.
article
info
Article history: Received 29 January 2019 Received in revised form 23 April 2019 Available online 21 June 2019 Keywords: Transfer learning Neural network Gait Human pose estimation
a b s t r a c t Gait recognition is a hot topic in the computing. Different gaits have different characteristics. This paper predicts whether a person in an image or video is running or walking by capturing the behavior of a character in an image or video. In this paper, we identify the gait by adopting the method of transfer learning and inception -V3 neural network. At the same time, the paper uses the HMDB - a large human motion database and UCF sports actions video action data as the main data set. At the end, this will help predict if the characters in either the picture or video make a running or walking motion. Results show a significant increase in object detection performance in comparison to existing algorithms with the use of transfer learning neural networks adapted for mobile use. © 2019 Elsevier B.V. All rights reserved.
1. Introduction Human action recognition is a hotspot in the field of computer vision — which is widely used in the fields of security, human–machine intelligence interaction, virtual reality etc. The key points of the human skeleton plays an important role in capturing and judging a person’s attitude and action, hence it is of great importance to get the position of the key points of the human skeleton in the task of behavior recognition and human posture detection. The gait, as the name suggests, is the posture that the body presents while in motion: its footsteps. Gait recognition can be used to detect people’s identities by classifying the attitude of people through their walking pattern. It has many other applications in the fields of identity verification, abnormal behavior detection and medical treatment. At present, most of the research centered on this is based on simple environmental analysis. For effective application in complex environments, further study is required. Different gaits have different characteristics – the relative position of the key points of the human skeleton and the angle of the joint is not the same – therefore, we can use the key points of the human skeleton to describe gait characteristics and apply this in the identification and prediction of human action. In daily life, if someone suddenly runs into the road, there is a high probability that an accident will most likely occur. In this situation, if the surveillance camera ∗ Corresponding author. E-mail address:
[email protected] (C. Wang). https://doi.org/10.1016/j.physa.2019.121812 0378-4371/© 2019 Elsevier B.V. All rights reserved.
2
C. Sun, C. Wang and W. Lai / Physica A 532 (2019) 121812
Fig. 1. Application function flow chart.
can make an early detection and warning of such abnormal behavior, the maintenance of social security will have a very positive significance. The main purpose of this thesis is to capture the objects’ movements in the image, to accurately and quickly predict the coordinates of the key points of the human skeleton, to classify the human gait, and to predict whether the characters in the image are running or walking [1–8]. The main function of this application is when opening a picture or video, the loading model prediction identifies the position of the human skeleton key points in the image or video, and then displays the predicted skeleton’s key points on the image [9,10]. The opening picture is a local picture file, and the video can either select a local video file or open a computer camera to get real-time video [11]. When a person in a picture or video shows a running gait, the person is framed. The functional design flow of the application is shown in Fig. 1. In this paper, through the training model, we can predict and distinguish if someone in the image has made a running or walking motion [12,13]. The HMDB - a large human motion database and UCF_sports_actions video action data sets are mainly used. These two data sets consist of a variety of motion classification video data, in which running and walking data are selected as datasets. In training, these image data are first converted into black and white images with only the position information of the key points of the human skeleton, and then undergo this training [14–16]. In the selection of the model, the INCEPTION-V3 model is chosen and transfer-learning method is used to obtain the results. The advantage of using transfer learning is that it makes the new model more efficient in learning. It uses less training data to get more accurate results, and improves the efficiency of research and development [17]. 2. Related work 2.1. Human motion recognition and gait analysis The rise of human motion recognition in the 90s is an important topic in the field of computer vision [18]. The existing methods are mainly divided into template matching, probability statistics and local semantics for these three major categories of methods. In today’s society, human motion capture and recognition technology is often used in various fields [19,20]. In terms of human–computer interaction, Microsoft’s Xbox console’s Kinect action collection system was released in 2010 and was a great success; in security monitoring, IBM, Intel, GE, Huawei and other large enterprises have developed an intelligent video surveillance system capable of detecting suspicious characters and abnormal events [21]. Many commercial software have been published in video content retrieval, such as VideoQ system developed by Columbia University, IBM QBIC system and so on [22].
C. Sun, C. Wang and W. Lai / Physica A 532 (2019) 121812
3
Gait refers to the body posture that humans perform as they move their limbs (such as walking, running, etc.). Gait is affected by a variety of factors [23]. By analyzing one’s gait, one can evaluate the walking ability of people. In addition, it is often used in medical diagnosis, physical training, biometric identification, comparative biomechanics and other fields [24]. 2.2. Transfer learning In deep learning, transfer learning is a very efficient method. In recent years, more researchers began to pay attention to machine learning; transfer learning is currently a high degree of research in one direction [25,26]. Transfer learning is expected to be realized as a means of applying the knowledge of the pre-training model to the new learning task, use the relevance between the learning goal and the existing knowledge, migrate knowledge in the original model to the new learning target to solve the new problem [27]. There are two advantages of transfer learning: First, the initial performance of the model is higher than that from zero start. In addition, its speed is faster, and the effect of the model in a training process is faster than that from the zero start off. In the case of limited training data, using transfer learning can lead to better accurate results than not using it [28,29]. According to the content of transfer, transfer learning is divided into four categories [30,31]; example, characteristics, parameters and relationship knowledge transfer learning [32]. At present, transfer learning has many applications in emotion classification, Chinese–English translation, image classification and so on [33]. Most of the methods of instance transfer learning use the importance transfer to trigger the conversion learning setup. In general, we want to learn the best parameters θ ∗ of the model by minimizing the expected risk [34],
θ ∗ = arg min E(x,y)∈P [l (x, y, θ)]
(1)
In the transfer learning setup, we hope to learn the best model of the target domain by minimizing the expected risk,
∑
θ ∗ = arg min
P (DT ) l (x, y, θ)
(2)
(x,y)∈DT
However, since the marker data in the target domain is not observed in the training data, we must learn the model from the source domain data. If P (Ds ) = P (DT ), then we can simply learn the model by solving the following optimization problems used in the target domain,
∑
θ ∗ = arg min
P (Ds ) l (x, y, θ)
(3)
(x,y)∈Ds
If P (Ds ) ̸ = P (DT ), We need to modify the above optimization problem to learn the model with high generalization ability of the target domain, as follows:
θ ∗ = arg min
∑ P (DT ) P (Ds ) l (x, y, θ) P (Ds ) x,y D
(
)∈
(4)
s
P (xT ,yT )
Therefore, we add different penalty valued to each instance (xsi , ysi ) with the corresponding weight PT (xs i,ys i) , learning s i i the exact model for the target domain. Thus, the difference between P (Ds ) and P (DT ) is caused by P (Xs ) and P (XT ) PT (xTi , yTi ) Ps (xsi , ysi )
=
P (xsi ) P (xTi )
If we can estimate
P (xsi ) P (xTi )
(5)
for each instance, we can solve the transductive transfer learning problems.
3. Our method 3.1. Algorithm and dataset We deduce the equation of the minimal graph on Riemannian manifolds and the curvature of the level sets. In Section 3, we out to prove our Theorems 3.1 and 3.3 [35–39]. Lemma 3.1. Let u be a smooth function defined on Riemannian manifold Mn with constant sectional curvature and denote by Rijkl the curvature tensor of Mn , then uijk = uijk + ul Rlijk
(6)
uijkl = uklij − uεj Rεkli − uiε Rεklj + uεl Rεijk + ukε Rεijl
(7)
and
4
C. Sun, C. Wang and W. Lai / Physica A 532 (2019) 121812 Table 1 The key points available in the dataset that were used when training (Coco data set of key points of the human skeleton). Keypoint category
Z. Cao et al.
nose left_eye right_eye left_ear right_ear left_shoulder right_shoulder left_elbow right_ankle
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Our implementation
Keypoint category
Z. Cao et al.
Our implementation
✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
✓ ✓ ✓ ✓ ✓
✓ ✓ ✓
right_elbow left_wrist right_wrist left_hip right_hip left_knee right_knee left_ankle
Proposition 3.2. Let u: Mn → R be a smooth function defined on a Riemannian manifold Mn . Consider the graph of u, ∑ n n n n denoted by u = F(M ), where F: M → M ∗R is defined ∑ to be F(p)=(p,u(p)) and M ∗R is equipped with the canonical product Riemannian metric. Then the mean curvature of u is
( H = −div
)
∇u
(8)
√
1 + |∇ u|2
here, div is the divergence operator on Mn Theorem 3.3. Let Ω ⊆ M2 e a smooth bounded connected domain and u ∈ C4 (Ω ) ∩ C2 (Ω ) be the solution to the maximal space-like graph equation
( div
)
∇u
√
1 − |∇ u|2
=0
(9)
Assume |∇ u| ̸ = 0 in Ω and the level lines of u are all convex with respect to the normal ∇ u, then the curvature of the level lines of u must be identically zero on Ω or positive on Ω . 2 ∑
aij ∅ij ∼ −u22 (a11 u1111 + a22 u2211 ) = −u22
i,j=1
aij uij11
i,j=1
⎡ = −u22 ⎣
2 ∑
i,j=1
= u22
2 ∑
2 ∑ [(
⎤ + u22
aij uij ⎦
2 ∑
(aij,11 uij + 2aij,1 uij1 )
i,j=1
11
1 − |∇ u|2 δij + ui uj
i,j=1
≜ VII + VII
)
u + 2u22 11 ij
]
2 ∑ [(
1 − |∇ u|2 δij + ui uj
)
] 1
uij1
i,j=1
(10)
In the prediction of the human gait movements using two data sets respectively is hmdb-a large human motion database and UCF_sports_actions. Both datasets are video datasets on human actions, videos from news and movies, including a variety of complex scenes and different angles of human motion video clip data [26]. In this application, selected data set – running and walking – are part of the data set as gait classification, after screening and processing, and the end-use gait classification dataset contains approximately 11,000 images [27,40]. We used the COCO key points 2017 dataset for training. This dataset was used because it is the largest pose estimation dataset available (120,000-labeled images) and Microsoft backs it. 17 unique key point categories are available in this dataset, and those which we decided to exclude due to the unique application of monitoring passengers of cars are shown in Table 1, We further experimented on the performance on human pose estimation task using the COCO2017 key point dataset [41]. It contains more than 200k images and 250k person instances labeled with keypoints. 150k instances of them are publicly available for training and validation. We only used COCO2017 train (57K images and 150K person instances) and evaluated the performance on the COCO2017 val set. We used the same implementation in and used a single input size of 256 × 192 during training and testing. Results are shown Table 2. All reported baseline results are similar or even better than the ones reported in [42].
C. Sun, C. Wang and W. Lai / Physica A 532 (2019) 121812
5
Table 2 COCO2017 val keypoint results. APkp denotes keypoint AP, ARkp denotes keypoint AR. (Red: best, Blue: second best.) Method
Backbone
Pre-train Computation
AP kp
AP50
kp
AP75
kp
APM
APL
kp
AP kp
AP50
kp
AP75
kp
APM
APL
SimpleBaseline [30] SimpleBaseline [30] SimpleBaseline [30]
ResNet-50 ResNet-50 ResNet-50
100% 75% 86%
70.4 70.7 70.5
91.4 91.5 91.5
78.2 78.2 78.1
67.7 67.9 67.5
74.4 75.1 74.9
73.5 74.0 73.7
92.1 92.6 92.3
80.5 80.7 80.3
70.4 70.7 70.3
78.3 78.9 78.7
SimpleBaseline [30] SimpleBaseline [30] SimpleBaseline [30]
ResNet-101 ResNet-101 ResNet-101
100% 75% 86%
72.0 72.4 72.6
91.5 91.5 92.5
79.4 80.3 80.3
69.2 69.4 69.6
76.4 76.8 77.0
75.3 75.6 75.7
93.0 92.8 93.1
82.0 82.4 82.4
72.0 72.4 72.4
80.2 80.5 80.7
kp
kp
kp
3.2. Sensor flow lite provides These are a group of core operators – including quantitative operators and floating operators – many of which have been adapted for mobile platforms. They can be used to create and run custom models. Developers can also write their own custom operators and use them in their models [43]. A new model file format on-device interpreter based on flat buffers whose kernel is optimized for faster execution on the mobile TensorFlow converter to convert the TensorFlow trained model to the TensorFlow-lite format. Smaller size: TensorFlow Lite is less than 300 KB when all supported operators are linked; TensorFlow Lite is less than 200 KB when using only the operators required to support inceptionv3 and MobileNet [35]. Pre-tested models: All of the following models are guaranteed to be out of the box; Inception V3 - a popular model for detecting the main objects present in an image. MobileNets is a family of mobile first computer vision models designed to effectively maximize accuracy while paying attention to the limited resources of devices or embedded applications [36]. They are parametric, small, low-latency, low-power models that meet the resource constraints of various user cases. They can be used for classification, detection, embedding, and segmentation. The MobileNet model is smaller than the inception V3 with lower precision. The quantized version of the MobileNet model runs faster than the non-quantized (floating) version on the CPU [37]. 3.3. Framework In the part of gait classification, transfer-learning method is used to train the data corresponding model of the task quickly through the feature extraction ability of the original model. The INCEPTION-V3 model itself is used to classify images, and the gait classification in this application can also be regarded as image classification, which can be changed from the original 1000 classifier to the classifier required by the research task. For an artificial neural network, if the network is increased purely, it will result in complex computation, too many parameters and easy fitting. In order to deal with this problem, it is necessary to increase the depth and width of the artificial neural network while making the parameters less; the inception model has been developed for this purpose [38]. An important improvement of Inception-V3 – compared to the first two versions of Inception – is the introduction of decomposition; to decompose a large two-dimensional convolution into two smaller one-dimensional convolution. This can save a lot of parameters and, while speeding up the computation, makes the non-linear extension model more expressive [39]. The model regularization of inception-v3 is through label smoothing. Input x and the model calculates the probability of class k, exp (zk ) p (k|x) = ∑k i=1 exp (zi )
(11)
Assuming the true distribution is q (k), the cross entropy loss function is, l=−
K ∑
log (p (k)) q (k)
(12)
k=1
Minimize the cross entropy equivalence maximization likelihood function. Cross entropy function deriving logic output,
∂l = p (k) − q (k) ∂ zk
(13)
In this way, we can smooth the label slightly, q′ (k|x) = (1 − ε) δk,y + eu (k) If u (k) is a uniform distribution, then, e q′ = (1 − ε) δk,y + K
(14)
(15)
6
C. Sun, C. Wang and W. Lai / Physica A 532 (2019) 121812
Fig. 2. Inception-v3 Neural network diagram.
Fig. 3. Video turns into a picture.
In the study of transfer, the original input weights of the INCEPTION-V3 model will be used to extract the characteristics of the new dataset, and a new classification layer is added at the end of the neural network to be used for the classification. The final model only needs to be able to achieve higher precision with fewer iterations. Fig. 2 is an illustration of the structure of the Inception-v3 artificial neural network [44]. 3.4. Model training The INCEPTION-V3 model is used to classify images for training. First, convert each frame of the video in the dataset into a picture [45]. Using the Os.walk() method of the OS module, you can traverse all the video files in the specified folder, and then use OPENCV to read the video frames by frame, saving each frame as a picture [46,47]. This section defines the Getpicture() method for implementation. The video turns into a picture in Fig. 3. After the original picture was obtained, the color picture was transformed into a black-and-white picture of the key points of the human skeleton and the connection line, and the characteristics of the keypoints of the human skeleton was classified as the data set of gait classification by using the model to predict the keypoints of human skeleton [48,49]. This section defines the Save_keypoint() method, which is used to draw Black-and-white images with human skeleton keypoints in Fig. 4.
C. Sun, C. Wang and W. Lai / Physica A 532 (2019) 121812
7
Fig. 4. The original image converts into a characteristic map of the keypoints of the human skeleton.
After the data set is processed, the Inception-V3 neural network is modified, and a new classification layer is added at the end for the new gait classification [50]. Then, reading the model from the local folder, the images in the processed dataset are randomly divided into training sets, validation sets, and test sets in proportion to 8:1:1. Define new neural network input and parameter settings, such as learning rate, iteration times, batches. These parameters will affect the training process [51]; different initial parameters will have different training results – in order select the appropriate parameter values. Once done, formal training can begin. During the training process, the correct rate of the current model is computed 100 times per iteration, the test results are printed at the end of the test set, and the final model is saved. 4. Result of the experiment 4.1. Experiment The original expected basic function has been realized, which can predict the position of the human skeleton key points in the image or video and realize simple gait classification recognition. There are 10,962 data sets of gait classification, of which the test set is 10% of the dataset, about 1100 pictures. The following is the final training result, the model has been iterated 20,000 times, and it can be seen that the completed final training model has an accuracy rate of 88.1% on the test set in Fig. 5. The results are still relatively low for practical application and further studies are required for improved methods and results. In the application of the interface, mainly practical, not aesthetically modified, it only contains the basic buttons required for the function and simple and clear button tips. Choosing to open a picture, video or camera that you want to forecast will load the keypoints of the human skeleton in the model prediction image and judge the gait type of the characters in the image, because of the large loading model, the prediction process takes longer. If the figure in the forecast is walking, only the keypoints of human skeleton are shown in Fig. 6. And if the characters in the image are running, they will show the location of the keypoints of the human skeleton, the connecting line, and the position of the figure in Fig. 7. 4.2. Deficiencies The classification of the gait is less. The current category of gait classification is of two types – only running and walking, but in real life, people gait are far more than these two – when the image of the characters try to make other actions, the prediction is very easy to misjudge. The only way to solve this problem is to collect more data sets of different gait categories and increase the amount of training data. The operation of the program takes a long time. After analysis, it will result in a long running time mainly due to the following reasons: it needs to import the cubby to be plenty; if the model is too big, the load time is long, and altogether needs to load two models, thereby requiring more time and a greater number forecast. In order to save the time required for program use, the program is currently programmed to load the first model when the initial interface is opened, and after sometime, the program will open the initial interface for about 1 min. The application will not load the model again. After that, the human skeleton keypoints for each projection of a picture takes about 5 s. After testing – in the same Python program – if there is no way to use TensorFlow to load two models at the
8
C. Sun, C. Wang and W. Lai / Physica A 532 (2019) 121812
Fig. 5. Model predictive accuracy.
Fig. 6. Walking effect drawing.
same time, there will be conflicts. So, select the use of the Os.popen() method, create a new console command, and run another Python program to predict the gait type. The disadvantage of this is that the model is loaded once for each character’s gait type, and the more people in the image, the longer it takes. After testing, it takes about 15 s for each to predict a person’s time and results are shown in Table 3. This problem cannot be solved at present; reducing the running time will be what the application needs to improve this part of the research focus.
C. Sun, C. Wang and W. Lai / Physica A 532 (2019) 121812
9
Fig. 7. Running effect. Table 3 Running cost (reducing the running time). Function
Operating Procedure
Time
Open the initial interface
Import library + Open the initial interface + Loading models that predict the critical points of the human skeleton
1 min
Predict the keypoints of a human skeleton in a picture
Predict the keypoints of human skeleton
5 s
Predict the gait type of a person in the picture
Loading model of gait classification + Predicted gait category
15 s
5. Conclusion The focus of this study is to use deep learning to find the location of the keypoints of the human skeleton from the image, and to use the keypoints to determine the current gait category of the person, thereby realizing the objectives of abnormal behavior warning. In the design of this application, the function of judging the gait category can be achieved by accomplishing its expected function. Moreover, this application also has numerous shortcomings of which there are four main problems: firstly, the prediction of the speed is relatively slow; secondly, the accuracy of the prediction needs to be improved; thirdly, for the gait classification, category prediction is too little; fourthly, the interface design is too rough, it is difficult to attract users to use. These shortcomings make the current application not yet able to meet the requirement for actual use, and it needs to be improved continuously. At present, in the field of machine learning, predicting the keypoints of human bones and identifying the gait using keypoints is still a relatively new topic; further research is needed to obtain better results. Lastly, we have successfully implemented the proposed tracking framework onto a Nvidia Jetson TX2 embedded platform, with a run time of 2.6 fps. Unfortunately, our dynamic hand gesture classifier has too many parameters to fit onto the embedded system. However, we can include the static hand-gesture recognition module, and the action recognition module on top of the framework. It reduces the run time to 2.4 fps, but is still considered a successful real-time implementation. Acknowledgments Guangzhou Science and Technology Plan Project (No. 201903010103), The Project was supported by the ‘‘13th Five-Year plan’’ for the development of philosophy and Social Sciences in Guangzhou, China (No. 2018GZYB36), Science Foundation of Guangdong Provincial Communications Department, China (grant number 2015-02-064).
10
C. Sun, C. Wang and W. Lai / Physica A 532 (2019) 121812
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51]
Y. LeCun, L. Bottou, Y. Bengio, P. Ha_ner, Gradient-based learning applied to document recognition. In: Proc. IEEE. (1998). A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classi_cation with deep convolutional neural networks, in: NIPS, 2012. M.A. Fischler, R. Elschlager, The representation and matching of pictorial structures, in: IEEE TOC, 1973. P. Felzenszwalb, D. McAllester, D. Ramanan, A discriminatively trained, multi-scale, deformable part model, in: CVPR, 2008. M. Andriluka, S. Roth, B. Schiele, Pictorial structures revisited: People detection and articulated pose estimation, in: CVPR, 2009. M. Eichner, V. Ferrari, Better appearance models for pictorial structures, in: BMVC, 2009. B. Sapp, C. Jordan, B. Taskar, Adaptive pose priors for pictorial structures, in: CVPR, 2010. Y. Yang, D. Ramanan, Articulated pose estimation with exible mixtures of parts, in: CVPR, 2011. M. Dantone, J. Gall, C. Leistner, L.V. Gool, Human pose estimation using body parts dependent joint regressors, in: CVPR, 2013. S. Johnson, M. Everingham, Learning e_ective human pose estimation from inaccurate annotation, in: CVPR, 2011. L. Pishchulin, M. Andriluka, P. Gehler, B. Schiele, Poselet conditioned pictorial structures, in: CVPR, 2013. B. Sapp, B. Taskar, Modec: Multimodal decomposable models for human pose estimation, in: CVPR, 2013. G. Gkioxari, P. Arbelaez, L. Bourdev, J. Malik, Articulated pose estimation using discriminative armlet classi_ers, in: CVPR, 2013. S.J. Krotosky, M.M. Trivedi, Occupant Posture Analysis Using Reflectance and Stereo Image for Smart Airbag Deployment, in: Proceedings of the IEEE Intelligent Vehicles Symposium, Parma, Italy, vol. 14–17, 2004, pp. 698–703. T.B. Moeslund, A. Hilton, V. Krüger, A survey of advances in vision-based human motion Capture and analysis, Comput. Vis. Image Underst. 104 (2006) 90–126. R. Poppe, Vision-based human motion analysis: An overview, Comput. Vis. Image Underst. 108 (2007) 4–18. Y. Li, Z. Sun, Vision-Based Human Pose Estimation for Pervasive Computing, in:Proceedings of the ACM Workshop on Ambient Media Computing, Beijing, China, vol. 23, 2009, pp. 49–56. Z. Liu, J. Zhu, J. Bu, C. Chen, A survey of human pose estimation: The body parts parsing based methods, J. Vis. Commun. Image Represent. 32 (2015) 10–19. V. Lepetit, P. Fua, Monocular model-based 3D tracking of rigid objects: A survey, Found. Trends Comput. Graph. Vis. 1 (2005) 1–89. X. Perez-Sala, S. Escalera, C. Angulo, J. Gonzalez, A survey on model based approaches for 2D and 3D visual human pose recovery, Sensors 14 (2014) 4189–4210. W. Hu, T. Tan, L. Wang, S. Maybank, A survey on visual surveillance of object motion and behaviors, IEEE Trans. Syst. Man Cybern. C 34 (2004) 334–352. V. Lepetit, P. Fua, Monocular model-based 3D tracking of rigid objects, Found. Trends. Comput. Graph. Vis. 1 (2005) 1–89. A. Yao, J. Gall, G. Fanelli, L.J. Van Gool, Does Human Action Recognition Benefit from Pose Estimation? in: Proceedings of the British Machine Vision Conference, Dundee, UK, 29 August–2 September, 2011, pp. 671–6711. A. Yao, J. Gall, L. Van Gool, Coupled action recognition and pose estimation from multiple views, Int. J. Comput. Vis. 100 (2012) 16–37. X. Nie, C. Xiong, S.C. Zhu, Joint Action Recognition and Pose Estimation from Video, in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, Boston, WA, USA, vol. 7–12, 2015, pp. 1293–1301. W. Gong, J. Gonzalez, F.X. Roca, Human action recognition based on estimated weak poses, EURASIP J. Adv. Signal Process. 2012 (2012) 1–14. R. Poppe, A survey on vision-based human action recognition, Image Vis. Comput. 28 (2010) 976–990. L. Chen, H. Wei, J. Ferryman, A survey of human motion analysis using depth imagery, Pattern Recognit. Lett. 34 (2013) 1995–2006. A. Newell, K. Yang, J. Deng, Stacked Hourglass Networks for Human Pose Estimation, CoRR, vol. abs/16030, 2016, [Online] Available: http://arxiv.org/abs/1603.06937. C.-J. Chou, J.-T. Chien, H.-T. Chen, Self Adversarial Training for Human Pose Estimation, CoRR, vol. abs/17070, 2017, [Online] Available: http://arxiv.org/abs/170702439. W. Yang, S. Li, W. Ouyang, H. Li, X. Wang, Learning Feature Pyramids for Human Pose Estimation, CoRR, vol. abs/17080, 2017, [Online] Available: http://arxiv.org/abs/170801101. L. Ke, M.-C. Chang, H. Qi, S. Lyu, Multi-Scale Structure-Aware Network for Human Pose Estimation, CoRR, vol. abs/18030, 2018, [Online] Available: http://arxiv.org/abs/180309894. Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam, Mobilenets: Efficient convolutional neural networks for mobile vision applications, in: CoRR abs/170404861, 2017, arXiv:170404861. Qingzhen Xu, Zhoutao Wang, Fengyun Wang, JIiajia Li, Thermal comfort research on human CT data modeling, Multimedia Tools Appl. 77 (2018) 6311. Qingzhen Xu, Miao Li, Min Li, Shuai Liu, Energy spectrum CT image detection based dimensionality reduction with phase congruency, J. Med. Syst. 42 (3) (2018) 42–49. W. Peihe, Z. Lingling, Some geometrical properties of convex level sets of minimal graph on 2-dimensional Riemannian mani-folds, Nonlinear Anal. Theory Methods Appl. 130 (1) (2016) 1–13. W. Peihe, Z. Dekai, Convexity of level sets of minimal graph on space form with nonnegative curvature, J. Differ. Equ. 262 (2017) 5534–5564. P.H. Wang, X. Liu, Z.H. Liu, The convexity of the level sets of maximal strictly space-like hypersurfaces defined on 2-dimensional space forms, Nonlinear Anal. 174 (2018) 79–103. P.H. Wang, H.M. Qiu, Z.H. Liu, Some geometrical properties of minimal graph on space forms with nonpositive curvature, Houston J. Math. 44 (2) (2018) 545–570. D. Weinland, R. Ronfard, E. Boyer, A survey of vision-based methods for action representation, segmentation and recognition, Comput. Vis. Image Underst. 115 (2011) 224–241. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, C.L. Zitnick, Microsoft coco: common objects in context, in: ECCV, vol. 7, (8) 2014. B. Xiao, H. Wu, Y. Wei, Simple baselines for human pose estimation and tracking, in: ECCV, 2018, 1, 7, 8. Qingzhen Xu, Min Li, A New Cluster Computing Technique for Social Media Data Analysis. Cluster Computing. A. Kobren, N. Monath, A. Krishnamurthy, A. McCallum, An Online Hierarchical Algorithm for Extreme Clustering, 2017, arXiv preprint arXiv:170401858. J. Zhu, S. He, J. Liu, P. He, Q. Xie, Z. Zheng, M.R. Lyu, Tools and Benchmarks for Automated Log Parsing, 2018, arXiv preprint arXiv:181103509. P. He, J. Zhu, S. He, J. Li, M.R. Lyu, An evaluation study on log parsing and its use in log mining, 2016, in: 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), IEEE, 2016, pp. 654–661. P. He, J. Zhu, S. He, J. Li, M.R. Lyu, Towards automated log parsing for large-scale log data analysis, IEEE Trans. Dependable Secure Comput. 15 (6) (2018) 931–944. K.a. Heller, Z. Ghahramani, Bayesian Hierarchical clustering, in: Proceedings of the 22nd International Conference on Machine Learning, ACM, 2005, pp. 297–304. Shouqiang Liu, Mengjing Yu, Miao Li, Qingzhen Xu, The research of virtual face based on deep convolutional generative adversarial networks using tensorflow, Physica A 521 (9) (2019) 667–680. Qingzhen Xu, Miao Li, Learning to rank with relational graph and pointwise constraint for cross-modal retrieval, Soft Comput. (2018). Shouqiang Liu, Miao Li, Min Li, Qingzhen Xu, Research of animals’ image semantic segmentation based on deep learning, Concurrency, Pract. Exp. (2018) e4892.