Learning 3D spatiotemporal gait feature by convolutional network for person identification

Learning 3D spatiotemporal gait feature by convolutional network for person identification

Learning 3D Spatiotemporal Gait Feature by Convolutional Network for Person Identification Communicated by Dr Jianjun Lei Journal Pre-proof Learnin...

2MB Sizes 2 Downloads 72 Views

Learning 3D Spatiotemporal Gait Feature by Convolutional Network for Person Identification

Communicated by Dr Jianjun Lei

Journal Pre-proof

Learning 3D Spatiotemporal Gait Feature by Convolutional Network for Person Identification Thien Huynh-The, Cam-Hao Hua, Nguyen Anh Tu, Dong-Seong Kim PII: DOI: Reference:

S0925-2312(20)30238-1 https://doi.org/10.1016/j.neucom.2020.02.048 NEUCOM 21916

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

25 August 2019 12 December 2019 10 February 2020

Please cite this article as: Thien Huynh-The, Cam-Hao Hua, Nguyen Anh Tu, Dong-Seong Kim, Learning 3D Spatiotemporal Gait Feature by Convolutional Network for Person Identification, Neurocomputing (2020), doi: https://doi.org/10.1016/j.neucom.2020.02.048

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Elsevier B.V. All rights reserved.

Highlights • An efficient deep learning-based person identification method for visual biometric • A hierarchical descriptive-geometric 3D gait feature extraction scheme • A compact-size DCNN with multiple stacks of asymmetric convolutional filters. • Outperformance of identification rate with state-of-the-art approaches. • Comparable performance with several modern CNNs

1

Learning 3D Spatiotemporal Gait Feature by Convolutional Network for Person Identification Thien Huynh-Thea,b,1 , Cam-Hao Huac , Nguyen Anh Tud , Dong-Seong Kima,b,1 a

ICT Convergence Research Center, Kumoh National Institute of Technology, 61, Daehak-ro, Gumi-si, Gyeongsangbuk-do 39177, Republic of Korea b Department of IT Convergence Engineering, Kumoh National Institute of Technology, 61, Daehak-ro, Gumi-si, Gyeongsangbuk-do 39177, Republic of Korea c Department of Computer Science & Engineering , Kyung Hee University (Global Campus), 1732 Deokyoungdae-ro, Giheung-gu, Yongin-si, Gyeonggi-do, 446-701, Republic of Korea d School of Science and Technology, Nazarbayev University, 53 Kabanbay Batyr Ave, Nur-Sultan city, Astana 010000, Republic of Kazakhstan

Abstract For person identification in non-interaction biometric systems, gait recognition has been recently encouraged in literature and industrial applications instead of face recognition. Although numerous advanced methods that learn object appearance by conventional machine learning models have been discussed in the last decade, most of them are strongly sensitive to scene background motion. In this research, we address the drawbacks of existing works by comprehensively studying gait information from 3D human skeleton data with a deep learning-based identifier. To capture the statistic gait information in the spatial dimension, we first extract the geometric gait features of joint distance and orientation. The dynamic gait information is then obtained by calculating the temporal description features with the mean and standard deviation of geometric features. Accordingly, the fully gait information of an individual is finally learned via a compact Deep Convolutional Neural Network which is explicitly designed with multiple stacks of asymmetric con∗

Corresponding author Email addresses: [email protected] (Thien Huynh-The), [email protected] (Cam-Hao Hua), [email protected] (Nguyen Anh Tu), [email protected] (Dong-Seong Kim)

Preprint submitted to Neurocomputing

February 19, 2020

volutional filters to fully gain the spatial correlation of in-frame body joints and the temporal relation of frame-wise posture at multiple scales. Based on the experimental results evaluated on four benchmark 3D gait datasets commonly used for person identification, including UPCV Gait, UPCV Gait K2, KS20 VisLab Multi-View Kinect Skeleton, and SDUGait, the proposed method presents the superior performance over that of several state-of-theart approaches while maintaining a low computational capacity. Keywords: 3D gait recognition, person identification, deep convolutional neural network, spatiotemporal gait information. 1. Introduction In the last decade, person identification has been an active topic and attracted massive attention from research and industrial communities with various practical applications embedded in intelligent surveillance systems [1], including terrorist monitoring, crime prevention, access control, and other on-demand securities. Fundamentally, biometric-based human identification systems can be categorized into two principal groups, i.e., physiological characteristic (e.g., fingerprint, iris, face) and behavioral information (e.g., voice, movement, gait) [2]. It can be realized that some biometric identifiers ask for a direct interaction between the subject and data acquisition devices (e.g., a fingerprint is analyzed for individual identification through an action of touching the optical sensor) while the others can automatically execute from a distance. In popular surveillance systems, the identification process needs to be done without subject notice [3], therefore, visual approaches like face and gait recognition are regularly considered for the deployment in public regions (e.g., shopping mall and subway station) [4]. Especially, gaitbased person identification also possesses several advantages of non-subjectcooperation and robustness against cyberattacks in comparison with other biometric identification technologies [5]. As illustrated in Fig. 1, gait is basically defined as the walking posture of a person, which is portrayed by regular movement and varying motions of upper and lower limbs. Consequently, gait recognition refers to as a biometric identification technique that is able to identify a person based on the information of walking posture. Compared with face identification, gait recognition mitigates the adverse affects of illumination, object occlusion, and poor visibility caused by the long distance from camera to subject [6]. Ac3

Figure 1: Illustration of a complete cycle of the human gait with eight events defined as follows: (a) initial contact, (b) opposite toe off, (c) heel rise, (d) opposite initial contact, (e) toe off, (f) feet adjacent, (g) tibia vertical, and (h) next initial contact.

cording to the gait analysis, many meaningful biomedical information (such as gender, age, and race) can be acquired thoroughly. Additionally, several skeleton-related diseases can be earlier diagnosed through gait monitoring and examination. These preeminent outcomes of gait recognition have enabled gait-based person identification to be exhaustively applicable in the domains of forensic, security, and robotic[7, 8]. Current gait-based person identification methods can be classified into two categories, i.e., appearance-based and human pose-based approaches. The former studies visual features of gait extracted directly from an image sequences of walking. Particularly, silhouette features of human body are mainly exploited to learn the gait information. As a baseline algorithm widely used in visual-based surveillance systems thanks to its simplicity and low latency, Gait Energy Image (GEI) [9] calculates the mean of silhouette images as gait features for the learning of a model attached with Bayesian classification. However, this technique is sensitive to the quality of silhouette which is extracted from a scene [10] by conventional object detection schemes (wherein the performance is strongly influenced by background noise such as dynamic background motion, shadow, and illumination changes [11, 12, 13]). Several advanced GEI-based approaches have been introduced for improving silhouette quality by modeling a standard human gait and also taking advantage of depth images. Besides that, other practical issues like variant viewpoints and walking speeds have been handled in the literature. Meanwhile, the latter strategy focuses on learning human pose and skeleton data for gait recognition, in which 2D/3D posture information is either obtained by human pose estimation algorithms or directly captured from depth cam-

4

Figure 2: Overview of the proposed gait-based person identification method using 3D geometric feature and its statistics followed by a convolutional neural network.

era devices [14, 15, 16]. Most of the traditional 3D gait recognition methods offer two key contributions: feature description (wherein several handcrafted features such as walking speed, gait cycle, joint rotation, and angle are extracted from skeleton data to describe personal gaits); and model learning (wherein various classification algorithms are developed to learn an identification model from different gaits). Unfortunately, those handcrafted features are unable to fully characterize the body joint correlations in the space-time domain, especially if coupled with conventional machine learning (ML) methods, which produces trivial performance accordingly. In order to address the aforementioned drawbacks of the existing work in terms of effectively learning gait features, this study proposes a deep learning-based person identification method which proficiently analyzes 3D human pose in the spatiotemporal domain. Specifically, from the 3D raw human skeleton data, we calculate the geometric characteristics of each frame as the spatial feature (including joint Euclidean distance and joint orientation) and the descriptive statistics of geometric properties from multiple frames as the temporal feature (including mean and standard deviation) for exhaustively describing the movement of human limbs. A specific Deep Convolutional Neural Network (DCNN) with asymmetric convolutional filters is subsequently designed as a person identifier to apprehend the intra-class relations, inter-class correlations, and cross-class associations of those geo5

metric features. Following the multiple stacks of convolutional and pooling layers, high-level informative features represented at various scales can be comprehensively acquired. In brief, the main contributions of this research are summarized as follows: • A hierarchical geometric feature extraction scheme is studied to capture dynamics of human pose in the spatial and temporal domains. • An efficient DCNN containing multiple stacks of asymmetric convolutional filters is introduced to thoroughly learn the deep correlations at multi-level feature maps. • The proposed method attains a superior recognition performance over that of the state-of-the-art approaches on three public 3D gait-based person identification datasets. The rest of this paper is arranged as follows. State-of-the-art gait-based person identification approaches are reviewed in Section II. The proposed 3D person identification method is comprehensively presented in Section III, wherein the spatiotemporal statistics of geometric features are exhaustively handled by the proposed DCNN. The experiment and conclusion are finally given in Section IV and V, respectively. 2. Related Work 2.1. Appearance-based Approaches In the last decade, traditional appearance-based gait recognition approach has been intensively studied for person identification in many researches contributed in the fields of computer vision and pattern recognition. Several methods have exploited GEI as the baseline feature for explaining the subject appearance and movement in the time domain. To effectively handle the challenge of view-point variation, some advanced feature description and transformation techniques, for instance, Frame Different Frieze Pattern [17], Geometric View Transformation Model (GVTM) [18], Area Average Distance (AAD) [19], are developed to enhance the robustness of gait recognition under cross-view scenarios. In [18], a subject-independent warping field is trained with a freestyle distortion framework which allows transforming geometric gait features from two different viewpoints to an intermediate viewpoint for subject registration in the training stage. Comparing with several existing 6

approaches, the GVTM substantially handles the corruption of gait features thanks to spatial proximity prevention. Meanwhile, Wang and Yan [19] introduce a novel static-gait feature based on the average distance for dealing with various view angles conditions. The extracted feature vectors are then fed as the individual input of each Hidden Markov Model (HMM) for gait model learning. In another research, Yu et al. [20] take into account two issues of the multi-view conversion and the multi-model learning by developing an Artificial Neural Network (ANN)-based auto-encoder model for invariant gait extraction. In order to deploy for most surveillance systems, this approach needs to be extended for many other dimensions (e.g., looking down view) instead of the current frontal-to-side view. Additionally, disciminative representations of visual gait features are subsequently learned by various machine learning models, from traditional classification methods [6, 21] to modern deep learning (DL) techniques [22, 23]. Chen et al. [6] train the selective gait features obtained from a Latent Conditional Random Field model by an attribute-constrained Support Vector Machine (SVM) to improve the multi-gait recognition accuracy. Recently, Convolutional Neural Network (a.k.a., CNN or ConvNet), a narrow class of Deep Neural Networks (DNNs), which has presented an impressive performance in several computer vision tasks (e.g., image classification and semantic segmentation), is taken into account for the gait-based person identification. For example, the meaningful gait information [22] is analyzed as multi-scale feature maps when feeding the GEI representation of human locomotion into a deep network, of which the architecture has two multi-stage CNNs specifically designed for gait verification and person identification. In [23], Babaee et al. address the problem of gait cycle corruption by a deep learning-based approach which can effectively work with some available frames. Particularly, a CNN is developed with several auto encoders working independently for the purpose of repairing a broken GEI to achieve a full gait cycle. It is realized that most of appearance-based approaches are so sensitive with the environment illumination, variant viewpoints, and object occlusion due to learning an identification model from local visual features of GEI [24]. 2.2. Human Pose-based Approaches With the advancement of depth sensor technology and human pose estimation algorithm [25], pose-based person identification [26] and action recognition [27] have attracted more attentions from the literature thanks to benefits of high accuracy and easy implementation. Different from the 7

appearance-based approaches where the gait information is learned from an image sequence, most of the human pose-based person identification techniques have processed the 3D coordinates of body joints collected from depth sensors. Hence, human movement can be depicted by various kinds of geometric features, such as joint-joint angle [28, 29, 30] and distance [31, 32, 33, 34, 35] besides the primitive anthropometric feature [36]. For example, Kastaniotis et al. [28] calculated the Euler angle features of some typically selective skeletal joints. These trajectory features are then projected in a highdiscriminative dimension for conducting dissimilarity vectors before modeled via sparse representation. In another study, Guan et al. [30] measured the angle constructed by the global center point of all body joints and two local center points of some body joints. The feature set of angle features is refined with some common feature selection techniques (e.g., variance, Chi-square test, mutual information) to eliminate outliers before passing to an ensemble learner. The limitation of this approach is the high complexity and sensitive performance of ensemble learning, wherein the final class is synthesized from many individual classifier following the majority voting mechanism for decision fusion. Besides the angle feature, Khamsemanan et al. [33] computed the Euclidean distance of fully 20-joint skeleton at two consecutive frames to measure the pose dynamics. To solve the multi-viewpoint problem of gait recognition for person identification, the authors further introduced the Center-of-Body (CoB) relative coordinate system for skeleton representation. The method is robust under some common scenarios (e.g., fixed direction and freestyle walking), however, the correlation between statistic and dynamic features in the spatial and temporal dimensions is missing. That leads some confusions of person identification in short-term observation caused by the walking posture similarity of two different individuals. Additionally, some post processing schemes can be applied to strengthen the feature variability and discrimination, for example, transforming features to Hilbert space using Riemannian and Euclidean kernels [29] and feature-level fusion (a.k.a., early fusion) [32]). The 3D geometric features, which are extracted to gain temporal body movement in multiple frames, are then learned by various classification methods ranging from the conventional techniques [37, 32], such as Radial Basis Function network, Dynamic Bayesian Networks, Decision Tree (DT), Random Forest (RF), k -Nearest Neighbor (k -NN), to the modern deep learning [38]. In the study [31], Rahman et al. adopted a k -NN classifier to the mean and variance of geometric distance features in a gait cycle, however, this approach suffers from the within-class gait cycle varia8

tion. Since the individual gait cycles of different people are dissimilar (i.e., a gait cycle typically depends on many impacts such as gender, age, emotion, and etc.), incorrect cycle determination can result some uncertainties of identification. In the recent work [34], Arseev et al. studied two recognition tasks: (i) gait recognition using k -NN and (ii) person identification by transfer learning the pre-trained VGG-16 model. For the gait recognition, an ensemble learning with a large number of k -NN classifiers is adopted for dealing with the high-dimensional feature vector that is extracted from the skeleton data of Kinect. Meanwhile, the identification task is taken into account via the appearance learning accomplished by a CNN model. The approach can potentially boost the performance of person identification task thanks to an accurate gait recognition model, but this sequential processing flow takes so much computational resource that it is nearly impossible for real-time system deployment. Since most of the existing DL works for gait-based person identification are studied on subject appearance [39], DL for 3D gait information remains an active research topic due to the lack of a well-organized data manipulation approach. Compared with the appearance information obtained from color image/video, the 3D walking skeleton has more gait characteristic information that is useful for individual identification, especially if they are learned by several modern DL techniques, such as Recurrent Neural Network [40] and CNN. 3. The Proposed Approach In this section, from the 3D human skeleton information either collected by RGB-D camera or estimated by human pose algorithm, we firstly calculate the hierarchical spatiotemporal pose features, including the geometric features of joint-joint distance and orientation (which are considered as lowlevel features in the proposed method) followed by the descriptive statistical features in terms of mean and variance values (represented as high-level features). Next, the extracted features are arranged into a high dimensional matrix, which is then fed into a DCNN developed for the task of person identification. Following the feedforward procedure in the proposed deep network, descriptive statistics of geometric features are exhaustively enriched for the substantial discrimination of target identities. Correspondingly, the overall processing flow of the proposed 3D gait-based person identification method is demonstrated in Fig. 2.

9

3.1. Spatiotemporal Pose Feature Extraction In general, 3D human skeleton information can be either obtained from RGB videos with the involvement of several impressive human pose estimation algorithms or directly acquired by depth camera like Microsoft Kinect sensor. Given a full set of skeletal joints S = {pi=1:n }, where n is the number of body joints, each joint is defined as a point with coordinate (x, y, z) in the 3D space <3 . Different from some existing approaches where the distance of two certain points is computed in the 3D space for representing human appearance, we calculate the Euclidean distance in three planes Oyz, Ozx, and Oxy corresponding to the planes x = 0, y = 0, and z = 0, respectively. For more details, the distance values of two joints i and j are calculated as follows q δx (i, j) = (yj − yi )2 + (zj − zi )2 , q (1) δy (i, j) = (xj − xi )2 + (zj − zi )2 , q δz (i, j) = (xj − xi )2 + (yj − yi )2 .   The distance feature of two arbitrary joints is formed as d = δx δy δz . Another important geometric feature widely used to explain human gait is joint orientation, which is defined by the angle between the joint-joint vector and the original axis. Concretely, the angles between the joint-joint vector −→ → − ji and three axes (i.e., the horizontal one Ox = h1, 0, 0i on the plane z = 0, −→ − → the vertical one Oy = h0, 1, 0i on the plane x = 0, and the depth one Oz = h0, 0, 1i on the plane y = 0) are defined as follows   → − → −→ − −→ ji ·Oy −1 ϕx ji , Oy = cos , −→ − → k ji k×kOyk   → → − →− → − − ji ·Oz ϕy ji , Oz = cos−1 , (2) − → − → k ji k×kOzk   → − → −→ − −→ ji ·Ox ϕz ji , Ox = cos−1 , −→ − → k ji k×kOxk



− where (·) refers to as the dot product of two considered vectors and ji → − means the length (a.k.a., magnitude) of the vector ji = hxi − xj , yi − yj , zi − zj i. The orientation feature of two arbitrary joints is then exposed in form of  g = ϕx ϕy ϕz . With the n-joint skeleton S in the tth frame, we retrieve n (n − 1) distance and orientation features in total and arrange them into a 10

lower-level feature matrix, denoted Flower (S), as the following expression   d1 d2 d3 . . . dk t Flower (S) = , (3) g1 g2 g3 . . . gk where k = n (n − 1)/2. It is worth noting that Flower is determined with the size of 2×k×3. At this point, a person can be theoretically identified through human posture by passing Flower to a classifier for frame-wise recognition, however, this is a practical challenge if facing with the posture variants of diverse subjects. Obviously, the posture information in one frame is not enough to provide accurate identification. Hence, monitoring a person in multiple frames of a predefined duration is an effective solution for realistic identification systems wherein the personal gait can be thoroughly analyzed in the temporal dimension. Let S = {S t } be the sequence of N skeletons of a person in the duration of t. Accordingly, a set of geometric features accumulated through multiple frames is expressed as follows  t F = Flower (S) . (4)

Note that the feature set F can be organized as a 4-D matrix with the size of (2 × k × 3) × N. In order to coordinate time series data in the geometric feature matrices, two descriptive statistical features, i.e., mean and standard deviation, are utilized for the construction of the so-called high-level features. It should be noted that the statistical features for both distance and orientation characteristics are calculated as follows µδ{x,y,z} =

1 N

µϕ{x,y,z} =

1 N

σδ{x,y,z} = σϕ{x,y,z} =

N P

t=1 N P

t , δ{x,y,z}

ϕt{x,y,z} ,

s t=1 N  P 1 N

s

1 N

t=1

t δ{x,y,z}

− µδ{x,y,z}

2

(5) ,

2 N  P ϕt{x,y,z} − µϕ{x,y,z} ,

t=1

where µδ , µϕ , σδ , and σϕ refer to as the mean and standard deviation values of distance and orientation features, respectively. All statistical values are 11

Figure 3: Architecture of the DCNN designed for the proposed method.

arranged into a upper-level feature matrix, denoted Fupper (S), as follows   md1 md2 md3 . . . mdk  mg1 mg2 mg3 . . . mg  k  Fupper (S) =  (6)  sd1 sd2 sd3 . . . sdk  , sg1 sg2 sg3 . . . sgk i i h i h h where the elements md = µδ{x,y,z} , mg = µϕ{x,y,z} , sd = σδ{x,y,z} , and h i sg = σϕ{x,y,z} are depth vectors. To this end, a final feature matrix Fupper with the size of 4×k×3 is established from a corresponding skeleton sequence. Regarding to the 20-joint and 25-joint skeletal patterns recorded by Kinect v1 and v2, the feature volume is determined by 4 × 190 × 3 and 4 × 300 × 3, respectively. 3.2. Person Identification Model Learning For each skeleton sequence, the gait information of a subject is represented by descriptive statistics of geometric features which are capable of characterizing the spatial body joint correlations in temporal domain. Based on these informative features, a person with corresponding identification can then be recognized by a classification model. In this section, a compact CNN is developed to comprehensively learn underlying representation of spatiotemporal pose features in the task of person identification. Adopting the transfer learning technique for several modern CNNs (such as VGG-16, GoogleNet, ResNet, and DenseNet) can lead overfitting issue while occupying expensive computational resource. As shown in Fig. 3, the CNN is designed with nine stacks of convolutional, batch normalization, and Rectified Linear Unit (ReLU) layers. By normalizing the neural responses and gradients propagating over the network, batch 12

normalization layers obviously ease the optimization procedure during the training stage. In addition, using batch normalization layers between convolutional and ReLU layers can reduce the sensitivity to network’s parameters initialization and accelerate the training convergence. Spatial pooling is also adopted with four max-pooling and one average-pooling layers placed on top of some convolutional layers as specified in Fig. 3. Subsequently, one fully connected layer followed by a softmax layer is utilized to finalize the acquired features for the target classification. Compared with fine-tuning a pre-trained modern network (e.g., VGG-16, GoogleNet, and Inception-v3) wherein the input data has to be scaled to a pre-specified size, the proposed network is designed to fit the resolution of Fupper appropriately. To capture the intraclass relations (i.e., between features of different pairs of joints within one class) and further reduce the number of parameters, we utilize asymmetric convolutional filters having sizes of 1 × 7, 1 × 5, and 1 × 3. Particularly, in the first stack, there are 64 feature maps obtained by filtering the input with 64 1 × 7 kernels, correspondingly. We configure 128 kernels having size of 1 × 5 × 64 and those of 1 × 5 × 128 for the second and third stack, respectively. The next two stacks subsequently employ 256 1 × 3 × 128 and 256 1 × 3 × 256 kernels. Meanwhile, to acquire the inter-class correlations (i.e., between-class features of a pair of joints) and cross-class associations (i.e., between-class features of different pairs of joints), we adopt 3 × 1 and 3 × 3 filters in the last four stacks. By this way, the high-level features represented at multiple scales are extracted effectively. Next, the number of hidden units in the fully connected layer is set to be same as the number of subjects in the benchmark dataset. In summary, the total number of parameters in our DCNN is approximately 8.6M. Besides that, details of the aforementioned network configuration are summarized in Table. 1. Note that output volume of each layer depends on the size of network’s input, which is the feature matrix Fupper . For instance, with a sequence of 20-joint skeletons, if Fupper is specified with the size of 4 × 190 × 3, the output feature map of the last convolutional layer has size of 2 × 12 × 1024 accordingly. With the determined network structure, training settings are enumerated as follows. The designed DCNN is trained in 60 epochs using the stochastic gradient descent with momentum optimizer with the initial learning rate of 0.01 (which is dropped 90% after every 20 epochs). The size of a mini-batch used in each training iteration is set at 32. In brief, the processing flow of the proposed 3D gait-based person identification is portrayed as follows: 13

Table 1: Details of Network Configuration Number

Layer

1

Input

Number of Filters

Filter Size

Stride

Padding

64

1×7

2

Conv

3

MaxPool

4

Conv

128

5

Conv

128

6

MaxPool

7

Conv

256

8

Conv

256

9

MaxPool

10

Conv

512

11

Conv

512

12

MaxPool

13

Conv

512

14

Conv

1024

15

AvgPool

16

Fully Connected

17

Softmax

(1, 1)

(0, 0)

2×2

(1, 2)

(0, 0)

1×5

(1, 1)

(0, 2)

1×5

(1, 1)

(0, 2)

2×2

(1, 2)

(0, 0)

1×3

(1, 1)

(0, 1)

1×3

(1, 1)

(0, 1)

2×2

(2, 2)

(1, 1)

3×1

(1, 1)

(1, 0)

3×1

(1, 1)

(1, 0)

2×2

(2, 2)

(1, 0)

3×3

(1, 1)

(1, 1)

3×3

(1, 1)

(1, 1)

2×`

(1, 1)

(0, 0)

Note: ` = 12 and ` = 18 for the 20-joint and 25-joint skeletal patterns, respectively.

1. Given a segmented sequence S consisting of N skeletons of a subject, wherein a full skeleton S has n body joints. 2. Spatiotemporal pose feature extraction • For each skeleton,   calculate  the distance and orientation features d = δ{x,y,z} , g = ϕ{x,y,z} ← pi=1:n .

• Arrange geometric features into a lower-level feature matrix Flower = [di=1:k ; gi=1:k ] following (3). • For multiple frames in a sequence, compute the mean h and standard i deviation metrics of geometric features md = µδ{x,y,z} , mg = h i h i h i µϕ{x,y,z} , sd = σδ{x,y,z} , sd = σϕ{x,y,z} ← δ{x,y,z} , ϕ{x,y,z} . 14

• Structure descriptive statistics into an upper-level feature matrix Fupper = [mdi=1:k ; mgi=1:k ; sdi=1:k ; sgi=1:k ] following (6). 3. Person identification model learning • Design a DCNN specified by several asymmetric convolutional kernels (the details of network configuration are described in Table 1). • Input the upper-level feature matrix (6) to the deep network for learning the person identification model. • After the training stage, the recognition model is obtained as an identifier. 4. For a new skeleton sequence of gait collected by depth camera or obtained by pose estimation algorithm, calculate Flower as geometric features and Fupper as descriptive statistics for passing to the trained CNN for inferring the corresponding label of person identification. 4. Experimental Results and Discussion In this section, the proposed 3D gait-based person identification method is evaluated on three benchmark datasets: UPCV Gait [28], UPCV Gait K2 [29], KS20 VisLab Multi-View Kinect skeleton [36], and SDUGait [37]. Besides several state-of-the-arts in the field of 3D person identification, we also conduct some baselines for performance comparison, wherein conventional classification methods like DT, k -NN, SVM, RF, and Naive Bayes (NB) are utilized to learn from the spatiotemporal human pose features proposed in this work. 4.1. Datasets UPCV Gait: This dataset, recorded by Kinect sensor v1, is served for the task of identity recognition, wherein there are totally 150 3D skeleton sequences representing the walking activity of 30 participants (15 males and 15 females). Each subject is asked to perform five times of walking in a straight line with a normal speed. Each skeleton is formed by 20 body joints estimated by Microsoft Kinect SDK. As an evaluation protocol, three samples of each subject are randomly selected for training while the remains are for testing.

15

UPCV Gait K2: The dataset is collected using Kinect sensor v2 with some feature upgrades (e.g., up to 25 joints of a skeleton is estimated more accurately if compared with the first version) for the task of pose-based person identification, where 30 individuals (17 males and 13 females) are guided to nearly walk in a straight direction). The entire dataset has 300 sequences (each subject performs walking in 10 times) captured at a frame rate of 30 fps. For the details of data collection setup, the Kinect is placed at 1.2 meters above the ground and at a zero-degree angle relative to the walking path. Following the recommendation of the original work [29], we conduct the 10-fold cross validation procedure on this dataset. KS20 VisLab Multi-View Kinect Skeleton: This skeleton dataset is acquired using Kinect sensor v2 at different viewpoints (including left lateral at 0◦ , left diagonal at 30◦ , frontal at 90◦ , right diagonal at 130◦ , and right lateral at 180◦ ) for long-term person identification. In total, there are 300 sequences gathered from 20 subjects, each of which perform 3 times of walking for every viewpoint. In this dataset, we randomly select 2 sequences per person in a particular viewpoint for training and the remaining ones for testing. SDUGait: This dataset presents the walking activity of 52 subjects (including 28 males and 24 females). Each subject is covered by 20 image sequences with 6 predefined walking directions (i.e., 0, 90, 135, 180, 225, and 270 degrees, wherein the front direction of the camera is specified as 0 degree) and 2 arbitrary directions at least. In total, there are 1040 videos conducted in the SDgait dataset. Compared with others, this dataset is quite challenging due to the rich variety of age and height of subjects, and the variation of walking directions. The dataset is evaluated by the following protocols: SDU-front (four 1800 sequences of each subject are randomly divided into two sub-sets with two samples for training and the remainder for testing), SDU-side (three among six side-direction sequences of 900 and 2700 are used for training and the remainder for benchmark), and SDU-arb (wherein 16 predefined directional walking sequences are utilized for learning model and four remaining sequences of arbitrary direction are used for validating the proposed model). 4.2. Experimental Setup For the pose feature extraction, we set N = 30 (i.e., corresponding to the 1-second sliding window of data segmentation with the frame rate of 30 fps for all three datasets) as the number of frames in each skeleton sequence S.. 16

To enrich the learning capability of the identification model through diverse gait patterns, we partition a skeleton sample into multiple sequences with the overlapping rate of 80% . Remarkably, the gait cycle is not considered in this work because the cycle lengths of different people and even the same person are essentially dissimilar. Therefore, learning a large number of gait patterns (specified by numerous fixed-length skeleton sequences with a high overlapping ratio) can thoroughly exploit the parameter-based memorizing capability of deep learning. Briefly, three principal experiments are given as follows: • In the first experiment, we compare the identification accuracy of the proposed method with several baselines and state-of-the-art approaches on three aforementioned datasets. • The second experiment focuses on analyzing performance of the proposed method under various settings of hyper-parameters used for gait representation. • The last experiment evaluates the proposed network complexity in the competition with several modern existing CNNs. Notably, we report the average result of 20 running times for randomlysplitting experiments (UPCV Gait, KS20 VisLab Multi-View Kinect Skeleton, and SDUgait) and that of 10 times for 10-fold cross validation experiments (UPCV Gait K2). 4.3. Results and Discussion Regarding to three experiments, the numerical results are given besides some discussions to deeply analyze the strength and also the drawback of our proposed approach. 4.3.1. Method Comparison For the two UPCV datasets of 3D gait-based individual identification, we compare the proposed method, denoted Fupper + ConvNet (ST-CNN), with several state-of-the-art approaches, such as Covariance Dissimilarity [26], Sparse Representation-based Classification (SRC) in Dissimilarity Space [27], Kernel Matrix Euclidean [29], Kernel Matrix Riemannian [29], EuclideanRiemannian Fusing [29], and ST-Large Margin Nearest Neighbor [35], of which the quantitative results are reported in Table 2. Furthermore, some 17

(a)

(b)

(c) Figure 4: Confusion matrices of person identification results on three datasets: (a) UPCV Gait, (b) UPCV Gait K2, and (c) KS20 VisLab Multi-View Kinect Skeleton.

baselines of traditional approach (as the combinations between descriptive statistics of geometric features with various traditional classification techniques) and DL model (as convolutional networks − ConvNets [22, 23]) are additionally given. It is observed that Fupper + ConvNet shows the best performance in terms of recognition rate, i.e., 98.86% and 99.65% for the UPCV Gait and UPCV Gait K2, respectively. Note that corresponding confusion matrices are plotted in Fig. 4a and 4b. Despite learning from the same feature set Fupper , our deep network (ST-CNN) can outperform DT, k -NN, SVM, RF, and NB in the task of person recognition thanks to the power of sequential stacks of asymmetric convolutional kernels in capturing high-level features of gait motion via the multi-scale representation. Mean18

Table 2: Method Comparison on UPCV Gait and UPCV Gait K2 Method

Fupper (Descriptive statistics of geometric features)

Accuracy (%)

+

DT k -NN SVM RF NB

Covariance Dissimilarity [26] SRC in Dissimilarity Space [27] Kernel Matrix Euclidean [29] Kernel Matrix Riemannian [29] Euclidean-Riemannian Fusing [29] ST-Large Margin Nearest Neighbor [35] Fupper + ConvNet (1FC) [23] Fupper + ConvNet (2FC) [22] F upper + ConvNet (ST-CNN)

UPCV Gait

UPCV Gait K2

77.35 97.03 97.48 96.11 76.89

89.31 97.59 98.00 96.97 92.31

89.60 94.50 94.67 92.00 95.67 98.10 98.17 98.50 98.86

90.17 94.63 96.17 93.33 97.05 NA 99.19 99.33 99.65

while, DT and NB introduce the worst recognition rate (lower than ST-CNN about 21.5% and 10.3% in the UPCV Gait and UPCV Gait K2 experiments, respectively) due to the overfitting issue of the single tree and the insubstantial feature correlation learning of the probabilistic model. As an upgrade version of DT, RF is more robust than a single tree since more random trees are aggregated to efficiently generalize the classification model. Among conventional techniques in this experiment, both the k -NN and SVM identify person with very high accuracy rates (up to 98.00%). With respect to Covariance Dissimilarity [26] where a subject is identified by the comparison of covariance metrics of skeleton trajectories between the training and testing sets, Fupper + ConvNet significantly improves by approximately 9.3%. This covariance-based approach is limited by a low bias which makes the model difficult to deal with new skeleton data. SRC in Dissimilarity Space [27], which learns Euler angles of primitive skeletons in a projected coordinate system using the Sparse Representation-based Classification in Dissimilarity Space scheme, outperforms Covariance Dissimilarity by around 5.0%. However, Euler angle is not informative enough to intensively describe human gait, which causes some mis-identifications in practical medium and large datasets. In another work, Kastaniotis et al. [29] fuse the gait information from divergent geometric features in the Hilbert space using the Euclidean 19

and Riemannian kernels for the computation of feature histogram and covariance matrix, respectively. The sparse representation of fused mutual information is then learned by a SRC model to heighten the inter-subject variability and enhance the discrimination. Accordingly, this research is investigated following three strategies (i.e., Euclidean, Riemannian, and the combination of these two kernels), in which the Euclidean-based strategy identifies a person better than the Riemannian one by nearly 2.7% and 2.9% for UPCV Gait and UPCV Gait K2, respectively. In the case of combining two kernels for higher-level feature representation, the recognition rate can be improved by 1 − 3.72%. Despite exploiting two kernels simultaneously, this approach is still jammed by Euler angles due to the aforementioned reason. ST-Large Margin Nearest Neighbor [35] reaches 98.10% by learning Riemannian geometric features in the spatial and temporal domain via an advanced k -nearest neighbor algorithm. Compared with two DL-based approaches, wherein the ConvNets are developed with several stacks of convolutional layers followed by one fully connected layer (1FC) [23] and two fully connected layers (2FC) [22] for the identity recognition task, ST-CNN recognizes more accurately with slightly higher rate. Concretely, the proposed network reaches higher accuracy by 0.69% and 0.46% for UPCV Gait and UPCV Gait K2, respectively. Yao et al. [22] deployed different convolutional kernels of 11 × 11, 5 × 5, and 3 × 3, meanwhile, Babaee et al. [23] adopted one kernel of 4 × 4 throughout the network. With three convolutional layers for deep feature extraction, both ConvNets can easily conduct notable performance in terms of the identification accuracy if compared with the conventional ML approaches. Besides classification techniques for learning different personal gait models, the geometric features extracted from 3D skeleton sequences are significantly influential in person identification. The proposed approach, which makes use of the higher-level descriptive statistical features from the lower-level geometric counterparts comprehensively, in the combination with a modern deep network, has demonstrated the superiority in terms of identification accuracy. The identification results of the proposed method and other existing approaches on KS20 VisLab Multi-View Kinect Skeleton are shown in Table 3. Compared with the UPCVs, this dataset is more challenging due to the multi-view configuration for skeleton data acquisition. In details, our Fupper + ConvNet achieves the identification rate of 87.63% (with corresponding confusion matrix graphed in Fig. 4c), which is the second best result (less than Context-Aaware Score-level Fusion [36] and Fupper + SVM by approxi20

Table 3: Method Comparison on KS20 VisLab Multi-View Kinect Skeleton Method

Accuracy (%)

Fupper (Descriptive statistics of geometric features)

DT k -NN SVM RF NB

+

56.19 77.49 88.67 86.25 59.79

Context-Unaware Score-level Fusion [36] Context-Aware Score-level Fusion [36] Fupper + ConvNet (1FC) [23] Fupper + ConvNet (2FC) [22] F upper + ConvNet (ST-CNN)

79.33 88.67 86.59 87.42 87.63

Table 4: Method Comparison on SDUGait

Method

Fupper (Descriptive statistics of geometric features)

Accuracy (%)

+

DT k -NN SVM RF NB

Matching-level Fusion [32] Two-stage LM [24] CNN + LSTM [38] Fupper + ConvNet (1FC) [23] Fupper + ConvNet (2FC) [22] F upper + ConvNet (ST-CNN)

SDU-front

SDU-side

SDU-arb

Average

84.42 92.40 95.19 95.10 86.73

70.90 81.44 86.92 87.56 78.69

48.37 56.27 71.56 67.02 61.44

67.90 76.71 84.56 83.23 75.62

93.84 98.08 90.14 94.66 96.25 98.17

83.91 87.82 91.67 88.21 88.33 92.98

52.06 72.60 82.51 75.50 76.90 81.97

76.60 86.17 88.11 85.32 87.16 91.04

mately 1%). Adopting the DT and NB classification algorithms to the final feature vector Fupper is not recommended due to their poor performance (less than 60%). Following the one-versus-one coding design (i.e., constructed by many L2-norm binary SVM models), multiclass SVM yields the best accuracy on this dataset thanks to the generalization capability to prevent the model from overfitting. Obviously, the performance of deep learning on small-sized datasets is fundamentally not an advantage if compared with the conventional techniques like SVM. In [36], Nambiar et al. present a score-level fusion with the supplementary information of viewpoint context (as remarked Context-Aware), wherein multiple context-specific classifiers 21

are trained individually with refined anthropometric-gait features and then fused based on binary weighting classification scores, which reaches the highest rate of 88.67%. In another strategy without the view-point information (i.e., Context-Unaware), the identification rate is degraded significantly by 9.34%. Because the viewpoint information of gait sequence is not used in the experiments, the accuracy of Fupper with traditional ML and innovative ConvNet models is worse than that of Context-Aware Score-level Fusion by at least 1.03% (if compared with ST-CNN). In [36], Nambiar et al. improved the overall accuracy of the identification system by utilizing the viewpoint information as an extra useful context besides the skeletal contexts, in which the viewpoint-wise models are fused at the classifier level. Meanwhile, regarding the proposed approach, all samples are passed into a single model for learning without the involvement of view-point context. It is worth noting that ST-CNN overcomes two ConvNets [22, 23] on this dataset, but the gap of identity recognition rate is trivial. The recognition results on the SDUGait dataset (regarding three evaluation protocols comprising SDU-front, SDU-side, and SDU-arb) are reported in Table 4. Being more challenging than KS20 VisLab Multi-View Kinect Skeleton, this dataset has a larger number of subjects for identification besides arbitrary walking directions. In Table 4, ST-CNN is completely better than several traditional ML-based models and two recent ConvNets [22, 23] if using the same feature set Fupper . For example, ST-CNN outperforms SVM (as the most efficient ML algorithm in the test) and the ConvNet [22] (that consists of two fully connected layers) by the average of 3.88% and 6.48%, respectively. Among ML algorithms considered in the experiments, DT reports the worst accuracy, especially significantly worse than the proposed ConvNet model by 33.61% for the SDU-arb protocol because of the poor intra-class correlation. In [32], the identity is recognized based on a matching-level fusion algorithm of the static and dynamic features, wherein the fusion is performed at the score-level with k -NN classifier weights which are automatically determined. Inspired by a frame-level matching mechanism, the two-stage linear matching (LM) method proposed by Choi et al. [24] calculates the discriminative score (including similarity and margin) between the input skeleton and registered skeletons to prevent the matching performance from scratchy patterns. Following the SDU-front protocol, two-stage LM reaches over 98% of accuracy rate, but it shows the performance degradation with the SDU-arb evaluation configuration. As a shortcoming, the framewise matching patterns are so sensitive to the 3D viewpoint variations in the 22

Table 5: Method Performance Analysis on Different Feature Categories Feature in Use

Accuracy (%)

Description UPCV

Mean of geometric Std of geometric Mean and std of distance Mean and std of orientation Descriptive statistics of geometric

{md , mg } {sd , sg } {md , sd } {mg , sg } {md , mg , sd , sg }

UPCV K2 KS20 SDU

97.03 93.82 97.25 95.42 98.86

99.32 98.26 98.10 98.97 99.65

86.60 85.40 86.08 84.02 87.63

89.55 88.91 90.62 89.90 91.04

Table 6: Recognition Rates on Various Existing CNN Models Accuracy (%)

Network VGG-16 GoogleNet Inception-v3 ResNet101 ConvNet (ST-CNN)

UPCV

UPCV K2

KS20

SDU

97.49 98.05 98.67 98.86 98.86

98.83 99.19 99.75 99.47 99.65

85.12 87.33 87.90 88.05 87.63

89.90 90.76 92.18 91.87 91.04

scenario of free walking with arbitrary directions. By fusing the deep features of CNN and Long-Short Term Memory (LSTM) to concurrently learn the visual information from skeleton GEI image and the temporal correlation from dynamic feature, CNN + LSTM [38] achieves the notable accuracy of 88.11%, the runner-up average score (less than Fupper + ConvNet (STCNN) by 2.93%). Despite being more superior than the proposed model by 0.54% in the SDU-arb test, respectively, CNN + LSTM is strongly defeated in the SDU-front configuration by 8.00% approximately due to redundant information encoded by the CNN from skeleton GEI image. 4.3.2. Performance Sensitivity In the second experiment, we analyze the influence of spatiotemporal feature and deep network in use to the overall person identification rate. The recognition results of Fupper + ConvNet using four kinds of feature, including the mean {md , mg } and standard deviation {sd , sg } of geometric feature, those of distance feature {md , sd }, and orientation feature {mg , sg }, on three evaluation datasets are delivered in Table 5. From the performance achievement, the mean feature is generally more superior than the standard deviation (e.g., higher than 3.21% for the UPCV Gait test). In addition, 23

the orientation feature is mostly worse than the distance counterpart (for instance, lower than 0.72% and 2.06% in the experiments with SDUGait and KS20 VisLab Multi-View Kinect Skeleton, respectively). Meanwhile, the utilization of full feature set (i.e., all the descriptive statistics of geometric features) for gait motion explanation can boost the overall identification rate by 0.33 − 5.04%, wherein the correlations of same and different features (i.e., distance vs. orientation and mean vs. standard deviation) of same and different pairs of joint are fully revealed by ST-CNN. The proposed approach is further evaluated with various modern CNNs (including VGG-16, GoogleNet, Inception-v3, and ResNet101) typically designed for image classification task to verify the performance of the designed ST-CNN. It is noticed that these pre-trained CNNs are studied for the person identification task via the transfer learning technique. The average results of multiple running times on three datasets are summarized in Table 6. Generally, there is a small gap of identification rate between our ST-CNN and other modern CNNs. ST-CNN achieves the highest accuracy in the test with UPCV and be defeated by Inception-v3 (less than 0.3%) and ResNet101 (less than 0.42%) in the UPCV K2 and KS20 tests, respectively. Despite its compact size, ST-CNN with asymmetric convolutional filters allows learning the spatial correlation of in-frame body joints and the temporal relation of frame-wise posture. As a principal limitation, some modern CNNs suffer the negative influence of overfitting issue during the training stage due to learning a large-size model (i.e., having a huge number of parameters) on a small-size dataset. 4.3.3. Complexity Analysis In the last experiment, we benchmark the network complexity via the training time metric based on a system equipped the NVIDIA GeFore GTX 1080Ti. The results measured on the UPCV Gait K2 dataset is plotted in Fig. 5, in which our ST-CNN obtains the second best processing time (slightly longer than GoogleNet around 5 mins and much shorter than VGG16, Inception-v3, and ResNet101). It is worth noticed that all networks are benchmarked with a same configuration (e.g., trained in 60 epochs with the mini-batch size of 32). The proposed approach takes nearly 76ms to recognize every 30 frames (including the pose feature calculation and identity prediction by ST-CNN), which is much appropriate with many general vision systems (wherein cameras operate at 30fps). Obviously, the design of a compact-size ST-CNN significantly reduces not only the computational 24

Figure 5: Complexity comparison of deep networks via the training time metric.

cost but also the memory consumption that accelerates the training time of ST-CNN if compared with several modern CNNs like Inception-v3 and ResNet101. 5. Conclusion In this paper, we have studied an efficient 3D gait-based person identification approach by analyzing human pose dynamics in the spatial and temporal domains using a designated deep learning model. Particularly, the descriptive statistical features of a gait sequence are computed by the accumulation of geometric distance and orientation features through multiple frames. Then the learning of identification is accomplished by a DCNN constructed from multiple stacks of asymmetric convolutional filter, which is capable of extracting intra-class relations, inter-class correlations, and cross-class associations simultaneously from the representation of multi-scale feature maps. According to the experimental results, the proposed method achieves a remarkable identification performance on the UPCV Gait (98.86%), UPCV Gait K2 (99.65%), and KS20 VisLab Multi-View Kinect Skeleton (87.63%) datasets and mostly outperforms the state-of-the-art approaches. Our future work will focus on accuracy improvement of the identification system by studying more advanced 3D pose features as well as upgrading the DCNN model without the increment of computational cost to deploy in visual-based surveillance and biometric systems.

25

Author contributions Author 1: Thien Huynh-The Elaborated the methodology Conceptualized the technical content Performed the experiments and analysis Wrote the paper Author 2: Cam-Hao Hua Elaborated the methodology Conceptualized the technical content Wrote the paper Author 3: Nguyen Anh Tu Conceptualized the technical content Contributed data or analysis tools Wrote the paper Author 4: Dong-Seong Kim Contributed data or analysis tools Performed the experiments and analysis Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgements This research was financially supported by the National Research Foundation of Korea (NRF) through Creativity Challenge Research-based Project (2019R1I1A1A01063781), and in part by Priority Research Centers Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2018R1A6A1A03024003). References [1] H. Liu, S. Chen, N. Kubota, Intelligent video systems and analytics: A survey, IEEE Transactions on Industrial Informatics 9 (3) (2013) 1222– 1233. 26

[2] J. P. Singh, S. Jain, S. Arora, U. P. Singh, Vision-based gait recognition: A survey, IEEE Access 6 (2018) 70497–70527. [3] H. Yao, S. Zhang, R. Hong, Y. Zhang, C. Xu, Q. Tian, Deep representation learning with part loss for person re-identification, IEEE Transactions on Image Processing 28 (6) (2019) 2860–2871. [4] Z. Liu, Z. Zhang, Q. Wu, Y. Wang, Enhancing person re-identification by integrating gait biometric, Neurocomputing 168 (2015) 1144 – 1156. [5] Z. Wu, Y. Huang, L. Wang, X. Wang, T. Tan, A comprehensive study on cross-view gait based human identification with deep cnns, IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (2) (2017) 209–226. [6] X. Chen, J. Weng, W. Lu, J. Xu, Multi-gait recognition based on attribute discovery, IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (7) (2018) 1697–1710. [7] Z. Zeng, Z. Li, D. Cheng, H. Zhang, K. Zhan, Y. Yang, Two-stream multirate recurrent neural network for video-based pedestrian reidentification, IEEE Transactions on Industrial Informatics 14 (7) (2018) 3179–3186. [8] W. Chi, J. Wang, M. Q. . Meng, A gait recognition method for human following in service robots, IEEE Transactions on Systems, Man, and Cybernetics: Systems 48 (9) (2018) 1429–1440. [9] D. S. Matovski, M. S. Nixon, J. N. Carter, Gait recognition, in: Computer Vision, A Reference Guide, 2014, pp. 309–318. [10] C. Hua, T. Huynh-The, S. Lee, Convolutional networks with bracketstyle decoder for semantic scene segmentation, in: 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2018, pp. 2980–2985. [11] T. Huynh-The, O. Banos, S. Lee, B. H. Kang, E. Kim, T. Le-Tien, Nic: A robust background extraction algorithm for foreground detection in dynamic scenes, IEEE Transactions on Circuits and Systems for Video Technology 27 (7) (2017) 1478–1490. 27

[12] N. A. Tu, T. Huynh-The, K. U. Khan, Y. Lee, Ml-hdp: A hierarchical bayesian nonparametric model for recognizing human actions in video, IEEE Transactions on Circuits and Systems for Video Technology 29 (3) (2019) 800–814. [13] G. Wang, J. Lai, P. Huang, X. Xie, Spatial-temporal person reidentification (2019) 8933–8940. [14] T. Huynh-The, B.-V. Le, S. Lee, Y. Yoon, Interactive activity recognition using pose-based spatiotemporal relation features and four-level pachinko allocation model, Information Sciences 369 (2016) 317 – 333. [15] T. Huynh-The, H. Hua-Cam, D. Kim, Encoding pose features to images with data augmentation for 3d action recognition, IEEE Transactions on Industrial Informatics (2019) 1–1. [16] T. Huynh-The, C.-H. Hua, T.-T. Ngo, D.-S. Kim, Image representation of pose-transition feature for 3d skeleton-based action recognition, Information Sciences 513 (2020) 112 – 126. doi:https://doi.org/10.1016/j.ins.2019.10.047. [17] M. Shinzaki, Y. Iwashita, R. Kurazume, K. Ogawara, Gait-based person identification method using shadow biometrics for robustness to changes in the walking direction, in: 2015 IEEE Winter Conference on Applications of Computer Vision, 2015, pp. 670–677. [18] H. El-Alfy, C. Xu, Y. Makihara, D. Muramatsu, Y. Yagi, A geometric view transformation model using free-form deformation for cross-view gait recognition, in: 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR), 2017, pp. 929–934. [19] X. Wang, W. Q. Yan, Cross-view gait recognition through ensemble learning, Neural Computing and Applications (2019) 1–13. [20] S. Yu, H. Chen, Q. Wang, L. Shen, Y. Huang, Invariant feature extraction for gait recognition using only one uniform model, Neurocomputing 239 (2017) 81 – 93. [21] W. Zeng, C. Wang, View-invariant gait recognition via deterministic learning, Neurocomputing 175 (2016) 324 – 335. 28

[22] L. Yao, W. Kusakunniran, Q. Wu, J. Zhang, Z. Tang, Robust cnn-based gait verification and identification using skeleton gait energy image, in: Digital Image Computing: Techniques and Applications (DICTA), 2018, pp. 1–7. [23] M. Babaee, L. Li, G. Rigoll, Person identification from partial gait cycle using fully convolutional neural networks, Neurocomputing 338 (2019) 116 – 125. [24] S. Choi, J. Kim, W. Kim, C. Kim, Skeleton-based gait recognition via robust frame-level matching, IEEE Transactions on Information Forensics and Security 14 (10) (2019) 2577–2592. [25] A. Borgia, Y. Hua, E. Kodirov, N. M. Robertson, Gan-based pose-aware regulation for video-based person re-identification (2019) 1175–1184. [26] M. S. N. Kumar, R. V. Babu, Human gait recognition using depth camera: a covariance based approach, in: Eighth Indian Conference on Computer Vision, Graphics and Image Processing, 2012. [27] I. Theodorakopoulos, D. Kastaniotis, G. Economou, S. Fotopoulos, Pose-based human action recognition via sparse representation in dissimilarity space, Journal of Visual Communication and Image Representation 25 (1) (2014) 12 – 23. [28] D. Kastaniotis, I. Theodorakopoulos, C. Theoharatos, G. Economou, S. Fotopoulos, A framework for gait-based recognition using kinect, Pattern Recognition Letters 68 (2015) 327 – 335. [29] D. Kastaniotis, I. Theodorakopoulos, G. Economou, S. Fotopoulos, Gait based recognition via fusing information from euclidean and riemannian manifolds, Pattern Recognition Letters 84 (2016) 245 – 251. [30] G. Guan, T. Yang, W. Liu, Gait recognition with skeleton information by using ensemble learning, in: 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), 2017, pp. 1–7. [31] M. W. Rahman, M. L. Gavrilova, Kinect gait skeletal joint feature-based person identification, in: 2017 IEEE 16th International Conference on 29

Cognitive Informatics Cognitive Computing (ICCI*CC), 2017, pp. 423– 430. [32] J. Sun, Y. Wang, J. Li, W. Wan, D. Cheng, H. Zhang, View-invariant gait recognition based on kinect skeleton feature, Multimedia Tools Appl. 77 (19) (2018) 24909–24935. [33] N. Khamsemanan, C. Nattee, N. Jianwattanapaisarn, Human identification from freestyle walks using posture-based gait feature, IEEE Transactions on Information Forensics and Security 13 (1) (2018) 119–128. [34] S. Arseev, A. Konushin, V. Liutov, Human recognition by appearance and gait, Programming and Computer Software 44 (2018) 258–265. [35] Y. Su, Z. Feng, M. Xing, Spatio-temporal large margin nearest neighbor (st-lmnn) based on riemannian features for individual identification, in: 2018 IEEE International Conference on Multimedia and Expo (ICME), 2018, pp. 1–6. [36] A. Nambiar, A. Bernardino, J. C. Nascimento, A. Fred, Context-aware person re-identification in the wild via fusion of gait and anthropometric features, in: 12th IEEE International Conference on Automatic Face Gesture Recognition (FG 2017), 2017, pp. 973–980. [37] Y. Wang, J. Sun, J. Li, D. Zhao, Gait recognition based on 3d skeleton joints captured by kinect, in: 2016 IEEE International Conference on Image Processing (ICIP), 2016, pp. 3151–3155. [38] Y. Liu, X. Jiang, T. Sun, K. Xu, 3d gait recognition based on a cnnlstm network with the fusion of skegei and da features, in: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2019, pp. 1–8. [39] C. Carley, E. Ristani, C. Tomasi, Person re-identification from gait using an autocorrelation network, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019, pp. 1–9. [40] H. Wang, L. Wang, Learning content and style: Joint action recognition and person identification from human skeletons, Pattern Recognition 81 (2018) 23 – 35. 30

THIEN HUYNH-THE received the B.S. degree in electronics and telecommunication engineering, and the M.Sc. degree in electronics engineering from Ho Chi Minh City University of Technology and Education, Vietnam in 2013. He received the Ph.D. degree in computer science and engineering from Kyung Hee University (KHU), Republic of Korea, in 2018. He is awarded with Superior Thesis Prize by KHU. He is currently a Post-Doctoral Research Fellow with ICT Convergence Research Center at Kumoh National Institute of Technology, Republic of Korea. His current research interest includes digital image, computer vision, and machine learning.

CAM-HAO HUA received the B.S. degree in Electrical and Electronic Engineering from Bach Khoa University, Ho Chi Minh City, Vietnam, in 2016. He is currently pursuing the Ph.D. degree with the Department of Computer Science and Engineering, Kyung Hee University, Gyeonggi, South Korea. His research interests are image processing, computer vision and deep Learning.

31

NGUYEN ANH TU received the B.S. degree of Electrical and Electronics Engineering from Ho Chi Minh City University of Technology, Vietnam, in 2010, and the Ph.D. degree of Computer Science and Engineering from Kyung Hee University, Republic of Korea, in 2018. He is currently Assistant Professor in the Department of Computer Science, School of Engineering and Digital Sciences, Nazarbayev University. Prior to joining Nazarbayev University, he has been a postdoctoral fellow at Data Knowledge Engineering Lab, Kyung Hee University. His current research interests include computer vision, machine learning, image retrieval, and big data processing.

DONG-SEONG KIM received his Ph.D. degree in Electrical and Computer Engineering from the Seoul National University, Seoul, Korea, in 2003. From 1994 to 2003, he worked as a full-time researcher in ERC-ACI at Seoul National University, Seoul, Korea. From March 2003 to February 2005, he worked as a postdoctoral researcher at the Wireless Network Laboratory in the School of Electrical and Computer Engineering at Cornell University, NY. From 2007 to 2009, he was a visiting professor with Department of Computer Science, University of California, Davis, CA. He is currently a director of KIT Convergence Research Institute and ICT Convergence Research Center (ITRC and NRF advanced research center program) supported by Korean government at Kumoh National Institute of Technology. He is a senior member of IEEE and ACM. His current main research interests are real-time IoT and smart platform, industrial wireless control network, net32

worked embedded system.

33