J. Vis. Commun. Image R. 25 (2014) 53–63
Contents lists available at SciVerse ScienceDirect
J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci
Pose Depth Volume extraction from RGB-D streams for frontal gait recognition Pratik Chattopadhyay ⇑, Aditi Roy, Shamik Sural, Jayanta Mukhopadhyay School of Information Technology, Indian Institute of Technology, Kharagpur, West Bengal 721302, India
a r t i c l e
i n f o
Article history: Available online 4 March 2013 Keywords: Frontal gait recognition Microsoft Kinect Depth registered silhouette Pose Depth Volume RGB-D stream Silhouette Voxel volume Key pose
a b s t r a c t We explore the applicability of Kinect RGB-D streams in recognizing gait patterns of individuals. Gait energy volume (GEV) is a recently proposed feature that performs gait recognition in frontal view using only depth image frames from Kinect. Since depth frames from Kinect are inherently noisy, corresponding silhouette shapes are inaccurate, often merging with the background. We register the depth and RGB frames from Kinect to obtain smooth silhouette shape along with depth information. A partial volume reconstruction of the frontal surface of each silhouette is done and a novel feature termed as Pose Depth Volume (PDV) is derived from this volumetric model. Recognition performance of the proposed approach has been tested on a data set captured using Microsoft Kinect in an indoor environment. Experimental results clearly demonstrate the effectiveness of the approach in comparison with other existing methods. Ó 2013 Elsevier Inc. All rights reserved.
1. Introduction Like several other biometric based identification methods, gait has been studied extensively as a biometric feature in recent years. An advantage of gait recognition is that, unlike other existing biometric methods like finger print detection, iris scan and face detection, gait of a subject can be recognized from a distance without active participation of the subject. This is because detailed textured information is not required in gait recognition. Capturing position variation of human limbs during walking is the main aim of gait recognition and this can be done using binary silhouettes extracted from images which may not be of very high quality. Over the years, while model based and appearance based gait recognition has captured significant attention from researchers, appearance based gait recognition entirely from the frontal view was not given much focus. Also, use of depth cameras in gait recognition is quite rare. In this paper, we concentrate on gait recognition from frontal view (frontal gait recognition) only. An advantage of frontal gait recognition is that walking videos captured from this viewpoint do not suffer from self-occlusion due to hand swings which prevails in fronto-parallel view. Also, since the camera is positioned right in front of a walking person, videos can be captured in a narrow corridor like situation as well. However, a disadvantage of frontal
⇑ Corresponding author. E-mail addresses:
[email protected] (P. Chattopadhyay), aditi.roy@ sit.iitkgp.ernet.in (A. Roy),
[email protected] (S. Sural),
[email protected]. ernet.in (J. Mukhopadhyay). 1047-3203/$ - see front matter Ó 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jvcir.2013.02.010
gait recognition is that binary silhouettes extracted from RGB video frames cannot represent which limb (left/right) of a walking person is nearer to the camera and which one is behind. Thus, pose ambiguity cannot be adequately resolved, leading to incorrect gait recognition. This information deficiency is not present in depth images, where depth values indicate whether the right limb is forward and the left limb is backward or the other way round. Variation of depth in limb positions together with variation of shape is an important element of frontal gait recognition. Recently developed depth cameras like Kinect [1,2] can efficiently capture the depth variation in different human body parts while walking. But the depth video frames so obtained are quite noisy, as a result of which extracted object silhouettes are not often clean. In contrast, silhouettes extracted from the RGB video frames are much cleaner but shape variation over a gait cycle is not enough for the extraction of useful gait features. In order to capture both color and shape information in a single frame, we combine information from both the RGB and the depth video streams from Kinect to derive a new gait feature. Each silhouette from the depth frame of Kinect is projected into the RGB frame coordinates using a standard registration procedure, thereby forming a silhouette in transformed space which we term as depth registered silhouette. Previously, registration of Kinect depth and RGB frames has been used for 3D reconstruction using depth videos captured from multiple views of an object [4]. However, to the best of our knowledge, no gait recognition method exists which fuses both RGB and depth information for deriving gait features. It may be noted that, there is no publicly available frontal gait database with both color and depth video frames of walking persons recorded simultaneously.
54
P. Chattopadhyay et al. / J. Vis. Commun. Image R. 25 (2014) 53–63
So, we have built a new database using Microsoft Kinect by capturing walking sequences of 30 individuals. The proposed gait feature is termed as Pose Depth Volume (PDV). It is derived from a partial volumetric reconstruction of each depth registered silhouette. First, a certain number of depth key poses are estimated from a set of training samples and each frame of an entire walking sequence of a subject is classified into an appropriate depth key pose. A PDV is constructed corresponding to each such pose by averaging voxel volumes corresponding to all the frames which belong to that pose. Thus, the number of PDVs of each subject is same as the number of depth key poses. Each voxel in a PDV indicates the number of times an object voxel occurs in that position for that particular depth key pose within a complete gait cycle. A classifier is trained with gait cycles of subjects in the training data set and a different gait cycle is used for testing the accuracy of recognition. The rest of the paper is organized as follows. Section 2 introduces the Kinect RGB-D camera and basic functionality of its different parts. A brief background study on gait is also included in this section. Section 3 illustrates the sequence of steps followed in deriving our proposed gait feature. Positioning of the Kinect camera and construction of the data set together with experimental results are presented in Section 4. Finally, Section 5 concludes the paper and points out future scope of work. 2. Background 2.1. Basics of RGB-D Kinect camera RGB-D cameras [2,3] are useful for providing depth and color information of an object simultaneously. Kinect, developed by Microsoft, is one such type of camera [1]. It captures depth information through its infrared projector and sensor. The infrared laser emitted by Kinect draws a structured pattern on the object surface. The infrared camera senses the depth from this pattern using a technology which is based on the structured light principle. Apart from the infrared projector and sensor, Kinect also has a color video camera and an array of microphones. The color camera returns RGB frames while the microphone array helps in audio capture. A detailed description of Kinect functionality is given in [1]. For our application, we use RGB and depth video streams from Kinect for deriving gait features.
method, was proposed by Han and Bhanu [10] which compresses one entire gait cycle into a single image by averaging all the silhouettes over the gait cycle. Each pixel in GEI has a gray level value indicating the number of times an object pixel occurs in that position over an entire gait cycle. But, as a result of averaging, two pixels in a GEI may have the same gray levels even if the object pixels in these positions do not occur simultaneously. Thus, GEI captures intrinsic dynamic characteristics of walking with less resolution. Enhanced gait energy image (EGEI) [14], frame difference energy Image (FDEI) [15] and active energy image (AEI) [16] were proposed to address this problem of loss of kinematic information and also to make the feature more robust against different clothing conditions, object carrying conditions, etc. Dynamic gait motion features can be detected with greater accuracy with pose energy image (PEI) [20]. PEI has been shown to achieve better performance at the cost of greater execution time than GEI and its variants. In PEI, rather than averaging silhouettes over an entire gait cycle, a fixed number of key poses are extracted from the gait cycle and silhouette averaging is done over all silhouettes that belong to each key pose. Basically, these key poses are representatives of the entire gait cycle and are derived in a manner such that significant variations are present between any two key poses. Each pixel gray level in PEI indicates the number of times an object pixel occurs in that position for that particular pose. Most of the previous approaches used either side view of walking or stereo view from multiple angles. But more emphasis was given on fronto-parallel view rather than frontal view since fronto-parallel walking sequence carries most information about the gait of an individual [6]. Few attempts have been made towards gait recognition solely from the frontal view [18,19]. An approach for model based frontal gait recognition using spherical space model is described in [19]. But the only appearance based gait recognition from frontal view developed so far is gait energy volume (GEV) [18]. GEV first projects pixels on each depth image frame of an entire gait cycle into world coordinates to create a 3D voxel volume and then computes average of all the voxel volumes obtained over a gait cycle. Since averaging is done over all the frames in a gait cycle, recognizing dynamic variation in walking is affected to a great extent in GEV. Again, as silhouette shapes obtained from depth frames are noisy, directly projecting the pixels of depth frames into 3D loses substantial information about the actual shape of the object. 3. Gait recognition using Pose Depth Volume
2.2. Gait recognition literature review Gait recognition approaches are broadly categorized as model based [8,17], appearance based [10–12,14–16] and spatiotemporal based [13]. Model based methods try to capture the kinematics of human motion. Although these approaches are invariant to viewpoint and scaling, the requirements of high computational overhead and very good quality silhouettes have limited their use for practical purposes. In contrast, appearance based methods directly extract useful features from binary silhouettes without building any model and the complexity of implementation of appearance based approaches is also much less compared to model based techniques. Very high quality binary silhouettes is not required in appearance based methods and computational burden is also less, making them suitable for practical uses. Spatiotemporal based gait recognition considers spatial as well as temporal domain information. Temporal template based gait feature is a type of appearance based gait recognition that is in use more in recent times due to its robust nature against random noise and simplicity in implementation. It started with the development of two features namely, motion energy image (MEI) and motion history image (MHI) [7]. Later gait energy image (GEI), another appearance based
In this section, we describe a new feature called Pose Depth Volume (PDV). The applicability of the feature is tested on videos captured by Microsoft Kinect. Instead of using depth videos directly as done in case of GEV, we combine RGB and depth information from Kinect to obtain better silhouettes along with depth information. To capture intrinsic dynamics of gait better than GEV, we divide an entire gait cycle into a number of depth key poses. Averaging of voxel volumes is done over all the frames belonging to a particular depth key pose. The steps followed in deriving the proposed feature are described in detail in the following subsections. 3.1. Combining information from Kinect color and depth video frames During the recording phase, subjects walk in front of a static background. RGB video frames and corresponding depth video frames are recorded at a fixed rate. The tilt angle of the Kinect and the height at which it is placed are suitably adjusted so that at least one full gait cycle of a walking individual is always captured by both the depth and the RGB cameras of Kinect. An example RGB frame and its corresponding depth frame are shown in Fig. 1(a) and (b), respectively.
P. Chattopadhyay et al. / J. Vis. Commun. Image R. 25 (2014) 53–63
55
Fig. 1. (a) RGB video frame (b) Depth video frame (c) Binary silhouette extracted from RGB video frame (d) Binary silhouette extracted from depth video frame (e) Depth registered silhouette.
Object extraction from Kinect RGB and depth frames is done using background subtraction. Since Kinect cameras work in indoor environment, unlike outdoor video capture, there is not much change in the illumination levels of different frames. For each Kinect RGB frame, a self-organizing feature map based approach for background subtraction [21] is adopted for object silhouette extraction from the background. It makes use of the HSV (Hue, Saturation, Value) color space and also efficiently handles the shadow elimination problem. Since, the only moving object in front of the static background is the walking person, the background subtraction procedure generates quite accurate and clean silhouettes. The object extracted from the RGB frame of Fig. 1(a) is shown in Fig. 1(c) after applying the above mentioned background subtraction procedure. The silhouettes extracted from the RGB frames do not have any depth information. On the other hand, silhouettes extracted from the depth frames are usually noisy with broken contours as shown in Fig. 1(d). This is due to the fact that, while there is a significant depth difference between the background and the frontal surface of a walking subject, the lower portions of the limbs have almost similar depth values as the background (i.e., the floor). Further, differential absorption of the infrared rays by different body parts results in a certain degree of noise in the depth frames. We, therefore, combine both RGB and depth video frames in order to
preserve and effectively fuse information available from the two. The goal is to obtain a frame with a clean silhouette having the same shape as that of the silhouette extracted from the RGB video frame with depth values of each pixel within the silhouette derived from the corresponding depth video frame. This can be done after registering each depth frame with its corresponding RGB frame. Registration is needed since there is a misalignment between the corresponding RGB and depth frames captured by Kinect [24]. We use a registration procedure similar to that described in [23] to align each frame from the depth camera with its corresponding frame from the RGB camera. The sequence of operations applied on a pixel in a depth frame to register it with the RGB frame coordinates is given in Algorithm 1. The registration procedure is briefly described next. The raw depth data returned by Kinect is of 16 bits of which the first 11 bits represent the depth of an object point from the camera [23,24]. The actual distance Dðxd ; yd Þ (in meters) of a pixel ðxd ; yd Þ is obtained from the depth value dðxd ; yd Þ of the pixel as shown in Step 1 of Algorithm 1. Values of the constants a1 and a2 as originally mentioned in [23] are quite appropriate for any Kinect and have been used in other Kinect calibration work as well [24]. After the distance computation is completed, the pixel ðxd ; yd Þ is projected to the point P d ðPxd ; Pyd ; Pzd Þ depth camera coordinate system in Steps 2–4. Constants fx, fy, cx and cy mentioned in Algorithm 1
P. Chattopadhyay et al. / J. Vis. Commun. Image R. 25 (2014) 53–63
are intrinsic camera parameters. Suffixes d and c denote parameters for the depth and RGB cameras respectively. Step 5 transforms the 3D point P d Pxd ; Pyd ; Pzd to a point Pc Pxc ; Pyc ; Pzc in the RGB camera coordinate system using rotation matrix R and translation matrix T. Steps 6 and 7 project the 3D point Pc P xc ; P yc ; Pzc so obtained to the RGB frame coordinate system ðxc ; yc Þ in order to get a registered depth frame in RGB frame coordinates. The silhouette obtained after registering the depth frame of Fig. 1(b) with the silhouette of Fig. 1(c) is shown as a gray level image in Fig. 1(e). Algorithm 1 Depth and RGB frame registration Input: Pixel (xd ; yd ) of segmented silhouette from depth frame (Id ), depth value d(xd ; yd ) at pixel (xd ; yd ), depth camera intrinsics (fxd ; fyd ; cxd ; cyd ), color camera intrinsics (fxc ; fyc ; cxc ; cyc ), rotation matrix (R), translation matrix (T) estimated during stereo calibration and floating point constants a1 and a2 . Output: Pixel (xc ; yc ) mapped to color frame coordinates. Begin: 1. Dðxd ; yd Þ 1:0=ða1 dðxd ; yd Þ þ a2 Þ 2. Pxd ðxd cxd ÞDðxd ; yd Þ=fxd 3. Pyd ðyd cyd ÞDðxd ; yd Þ=fyd 4. Pzd Dðxd ; yd Þ 5. Pc R x Pd þz T 6. xc P fx =P þ cxc cy c cz 7. yc Pc fyc =Pc þ cyc End
It can be observed from the figure that the registered depth frame obtained after applying the registration procedure of Algorithm 1 still contains some noise in a few regions. These regions are filled using morphological reconstruction [22]. Since only depth values of the neighboring pixels are utilized in filling the noisy pixels, the depth value of a noisy pixel obtained after morphological reconstruction is sufficiently accurate. Finally, a median filter with window size of (3 3) is applied to remove noise on the boundary of the silhouette. Since frontal view of walking is recorded for each subject, as a person moves towards the Kinect, the height and width of the silhouette in the RGB and depth frame sequences increase. Thus, height and width of the depth registered silhouette also increase gradually as distance of each point on the frontal surface of the silhouette from the Kinect decreases. Constructing temporal template based gait features from silhouette sequences with varying height and width leads to inaccuracy, since these methods deal with averaging of silhouette sequences over time [10,20]. Unless the silhouettes are normalized, the averaging process would not capture the desired information. We next describe how shape normalization of depth registered silhouettes is done as the subject moves towards the RGB-D camera. Let height and width of the subject in the depth registered silhouette be denoted by h and w, respectively. The coordinates of the centroid of the silhouette is (xc ; yc ), where, xc = h/2 and yc = w/2. Each pixel within the silhouette is projected to the three dimensional space using the raw depth value of that pixel. The effect of varying depth due to distance from the Kinect is neutralized by applying a translation to each projected pixel along the depth dimension so as to effectively bring the pixel corresponding to the centroid of each such silhouette at a fixed distance from the Kinect. Thus, if Dðxd ; yd Þ is the distance of the object point corresponding to the pixel ðxd ; yd Þ on the depth registered silhouette, then the translation to be applied along the depth axis on the three dimensional point corresponding to ðxd ; yd Þ is given by
D0 ðxd ; yd Þ ¼ Dðxd ; yd Þ Dðxc ; yc Þ þ L1 ;
ð1Þ
where Dðxc ; yc Þ is the distance of the object point corresponding to the centroid of the bounding box and L1 is a fixed distance from the Kinect. This generates a distance normalized depth registered 3D silhouette, which is resized to attain a fixed height H and a fixed width W.
3.2. Eigen space projection and depth registered silhouette classification In the proposed approach, we aim to represent the entire gait cycle using a fixed number of key poses [20]. The basic motivation behind dividing a gait cycle into key poses is to capture dynamic characteristics of gait at a finer level of granularity. Further, any noise or distortion in a few frames does not affect the entire gait feature. Instead, the impact is restricted to only those poses to which the affected frames belong. The rest of the features remain unaffected. It may be noted that in [20], key pose estimation is done using binary silhouette sequences extracted from training sequences in the fronto-parallel view. In contrast, we estimate key poses using depth registered silhouettes and denote them as depth key poses. Since depth information is not present in binary silhouettes, it can be asserted that depth key poses extracted from depth registered silhouettes are more accurate than those from binary silhouettes as the former contains information about the relative distance between different body parts. It may be argued that the depth registered silhouette sequences obtained after the registration procedure described in Section 3.1 could be directly used to find the depth key poses. But then, the input vectors used to perform clustering would have very large dimensions, thereby substantially slowing down the whole procedure. So, the size of the original data set is reduced using principal component analysis (PCA) [25] to generate depth eigen silhouettes. Constrained K-Means clustering is next applied on the depth eigen silhouettes to determine the depth key poses. The number of clusters (K) used in K-Means clustering is determined by plotting a rate distortion curve [9]. The rate distortion curve obtained using the set of gait cycles present in the training data set is shown in Fig. 2. It is observed that the distortion coefficient with respect to number of clusters significantly reduces after K = 6 and becomes almost constant after K = 12. The K centroids obtained after completion of the clustering procedure represent the K depth key poses. 950 900
Distortion Coefficient
56
850 800 750 700 650 600
0
2
4
6
8
10
12
Number of clusters Fig. 2. Rate distortion curve.
14
16
18
P. Chattopadhyay et al. / J. Vis. Commun. Image R. 25 (2014) 53–63
In the cluster assignment step of K-Means clustering, the transition of a particular eigen silhouette vector between clusters is constrained by imposing a condition that reassignment of the ith cluster will be valid only for the ði þ 1Þth or the ði 1Þth cluster. In addition to this constraint, the order of transitions between the frames for a complete walking sequence is also checked to see if the index of cluster assigned to the previous frame is lower than the index of cluster assigned to the current frame. If any frame X j is assigned to the ith cluster and the previous frame X j1 is assigned to the ði þ 1Þth cluster, then reassignment of cluster indices is done so that a proper ordering of frames is maintained. Once the depth key poses are estimated, each depth registered silhouette is classified following a graph-based path search algorithm based on dynamic programming [5]. The problem of assigning depth key pose is formulated as finding the most likely path in a directed acyclic graph. The formulation is done such that if the number of depth key poses is K and the ith frame is assigned to the depth key pose number j, then if j is less than K, the ði þ 1Þth frame can be assigned only to the depth key pose j or depth key pose j + 1. But if j is equal to K, the ði þ 1Þth frame must be assigned to either depth key pose j or the first depth key pose. Assigning the ði þ 1Þth frame to the first depth key pose indicates completion of a gait cycle in the ith frame. If the frames of an input sequence are not assigned to at least one depth key pose, then it is concluded that the sequence does not contain a full gait cycle and gait recognition cannot be done. Constrained K-Means clustering and classification of silhouettes into depth key poses along with their algorithmic representation are given in more details in [20]. The depth key poses obtained after completion of the clustering procedure for K = 8 are shown in Fig. 3. A block diagram of the procedure used for estimating
57
depth key poses and classifying a test depth registered silhouette sequence is shown in Fig. 4. 3.3. Pose Depth Volume This section describes the detailed methodology for extracting the proposed gait feature from the depth key poses and assigning a class label to each depth registered silhouette. Since only a single Kinect camera has been used in this work, a full 3D volume reconstruction is not possible as only one view (the frontal view) of each walking subject is captured. Instead, partial volume reconstruction of the frontal surface of an object is possible using the depth data stream provided by Kinect. We utilize this depth information to create a 3D voxel volume. Features extracted directly from the depth registered silhouette do not correctly capture the swing of human limbs at different phases of a gait cycle. For example, the frames of a gait cycle belonging to one key pose are not all alike. Often there is a depth difference at the same point of observation on a walking subject in two different frames even if both of them belong to the same pose. Unless the gait features are constructed in 3D, such type of variation cannot be accurately modeled. The method for voxel volume construction is described in details in Section 3.3.1 and the construction of PDV feature is presented in Section 3.3.2. A block diagram for the construction of PDV and human recognition using PDV is presented in Fig. 5. 3.3.1. Voxel volume construction Let, the size of each depth registered silhouette be W H. A binary voxel volume is created for each such depth registered silhouette. A voxel volume V is defined to be a cuboid having a
Fig. 3. Eight depth key poses obtained after K-Means clustering.
58
P. Chattopadhyay et al. / J. Vis. Commun. Image R. 25 (2014) 53–63
Fig. 4. Flow chart for finding depth key poses and classification of depth registered silhouette.
Training Depth Registered Silhouettes in Eigen Space
Create Voxel Volume for Each Frame
Find Depth Key Pose Label of Each Frame
Compute Pose Depth Volume
Apply PCA
Transformation Matrix
Find Depth Key Pose Label of Each Frame
Test Depth Registered Silhouettes in Eigen Space
Compute Pose Depth Volume
Find Most Probable Class using Similarity Metric
Transform Feature Space
Create Voxel Volume for Each Frame
Fig. 5. Flow chart for human recognition using Pose Depth Volume as gait feature.
Class of Test Silhouette
59
P. Chattopadhyay et al. / J. Vis. Commun. Image R. 25 (2014) 53–63
dimension W H Lmax , where Lmax is the maximum depth value along the Z axis. We first initialize the binary voxel volume to have a value of zero for each voxel. Given an input depth registered silhouette image, each pixel on the silhouette (excluding the background) can be projected to a point in world coordinates. Let us consider a particular pixel (X,Y) of the depth registered silhouette image I. Suppose the depth value at (X,Y) is d. Let D0 be the distance of the object point corresponding to (X,Y). The value of D0 can be obtained from depth d by applying Step 1 of Algorithm 1. Rather than projecting each pixel to world coordinates, the projection from depth value to three dimensional voxel coordinates is done in such a way that the object point corresponding to depth value of 0 is mapped to z = 0 and the object point corresponding to the maximum depth value is mapped to z ¼ Lmax . Once the 3D point ðX; Y; D0 Þ in the voxel volume is obtained, we assign a fixed binary value to this voxel and fill all the voxels from ðX; Y; D0 Þ to ðX; Y; Lmax Þ along the Z axis with the same value. Hence, according to our construction, if ðxd ; yd Þ is an object pixel in the depth registered silhouette, then,
0 6 D0 ðxd ; yd Þ 6 Lmax ;
ð2Þ
and
Vðxd ; yd ; zÞ ¼ 1;
8z 2 ½D0 ðxd ; yd Þ; Lmax :
ð3Þ
If (xd ; yd ) is a background pixel,
Vðxd ; yd ; zÞ ¼ 0;
8z 2 ½1; Lmax :
ð4Þ
For 11 bit depth representation, which is the standard in Kinect, the value of Lmax is 2047. 3.3.2. Feature construction In Section 3.2, we have described the procedure used for classifying an input depth registered silhouette. Given a test sequence, the depth key pose to which each frame of the sequence belongs, is, therefore, known. Let frames F 1 ; F 2 ; . . . ; F p of an input silhouette sequence map to the kth depth key pose Sk . Let V 1 ; V 2 ; . . . ; V p be the aligned 3D voxel volumes corresponding to frames F 1 ; F 2 ; . . . ; F p . The mean of all the voxel volumes belonging to the kth depth key pose is termed as the PDV of the kth depth key pose and is denoted by PDV k . Mathematically, p X PDV k ¼ ð1=pÞ V j :
C X K X Cov ¼ ð1=NÞ ðC ik mÞðC ik mÞT :
Eigen vectors [eig 1 ; eig 2 ; . . . ; eig e ] of Cov are computed corresponding to the e largest eigen values. Each column of the reduced e dimensional feature vector Y is calculated using Eq. (8):
Y ik ¼ ½eig 1 ; eig 2 ; . . . ; eig e T ðC ik mÞ; ½i ¼ 1; 2; . . . ; C; ½k ¼ 1; 2; . . . ; K:
ð8Þ
This reduced set of PDV features is used for final classification. Fig. 6 shows an input silhouette sequence and the depth key pose to which each frame belongs. A frame with caption Sk in the figure indicates that the frame belongs to the depth key pose k. Fig. 7 shows the Pose Depth Volumes obtained corresponding to each depth key pose of the input silhouette sequence shown in Fig. 6. 3.4. Human recognition using PDV The PCA transformed feature is of dimension e N. In the resulting feature set, the first K columns give the PCA transformed feature for K depth key poses of the first subject, the next K columns give the PCA transformed feature for K depth key poses of the second subject, and so on. Thus, there are K column vectors, corresponding to every subject, each having a size e 1, resulting in a feature vector of size e K . Given an input test depth registered silhouette sequence, the same operations as described in Sections 3.2, 3.3.1 and 3.3.2 are applied to classify the frames into K depth key poses and derive the reduced PDV from it. To classify the reduced silhouette vector corresponding to the input test sample, we compute the Euclidean distance of this test vector from each of the training vectors. The test vector is assigned to the class with the minimum Euclidean distance. Let Y i1 ; Y i2 ; . . . ; Y iK denote the normalized e K dimensional training vector for the ith subject and Z 1 ; Z 2 ; . . . ; Z K be the normalized input test vector of same dimension of an unknown subject. The likelihood of a test vector to belong to the class of the ith subject is determined using Eq. (9). A lower value of distance with respect to a particular class implies greater likelihood of belonging to that class.
distanceðY i ; ZÞ ¼ ð5Þ
ð7Þ
i¼1 k¼1
k X e X
½Y ij ðrÞ Z j ðrÞ2 ;
8i ¼ 1; 2; . . . ; C:
ð9Þ
j¼1 r¼1
j¼1
To extract the proposed gait feature, the voxels in a PDV are raster scanned and arranged as a column vector. Similar arrangements of voxels into column vectors are done for all the PDVs of all the C subjects and the column vectors thus obtained are concatenated. Let each column vector C ik be of size WHL 1, where C ik denotes the column vector corresponding to the kth PDV (PDV k ) of the ith subject, [i ¼ 1; 2; . . . ; C and k = 1, 2, . . ., K]. The total size of the matrix thus becomes WHL KC ¼ R N (say) where, R ¼ WHL is the number of rows and N ¼ KC is the total number of training images which is again equal to the number of columns of the resulting matrix. As the matrix is quite large, PCA is applied for dimension reduction. Since all the eigen vectors provided by PCA are not of enough significance, we choose only the e eigen vectors (e < R) corresponding to the largest e eigen values so as to maintain more than 90% variance among the selected eigen vectors. Now, if m is the mean of the column vectors, then C X K X m ¼ ð1=NÞ C ik : i¼1 k¼1
Covariance matrix is determined as
ð6Þ
The input test vector is assigned to class c if
distanceðY c ; ZÞ < distanceðY i ; ZÞ;
8i ¼ 1; 2; . . . ; C;
i – c:
ð10Þ
4. Experimental results In this section, we present results from an extensive set of experiments carried out using the proposed Pose Depth Volume (PDV) feature. Experiments have been conducted in the context of biometric based identification in which the features of a test subject are matched against a gallery of previously captured and annotated feature sets. The proposed gait recognition algorithm has been implemented in Matlab 7.12.0 (R2010a) on a 2.50 GHz Intel Core i5 processor having 4 GB RAM. 4.1. Data set and testing protocol description Gait recognition algorithms in existing literature have used data sets that were collected with the help of color video camera. Two notable examples are USF HumanID database [27], which deals with only fronto-parallel view of walking (left or right) in different scenarios, like walking with/without shoes, walking on grass/concrete, etc., and CMU Mobo Database [26], which contains videos of sub-
60
P. Chattopadhyay et al. / J. Vis. Commun. Image R. 25 (2014) 53–63
Fig. 6. Mapping of depth registered frames to key poses for a complete gait cycle. Sk represents the kth depth key pose.
jects walking on a treadmill from multiple views. Others include SOTON [28], CASIA [29], TUM-IITKGP [30], etc. Since no gait data set containing both RGB and depth video frames of human walking from frontal view is publicly available, to test the effectiveness of the proposed approach, we have prepared a new data set. The camera setup and testing protocol are described in detail next. The Kinect camera is positioned at a certain height from the ground with the Kinect pointing downwards at an angle of 27° as is typically done in surveillance systems. This positioning of the camera always allows us to capture at least one full gait cycle of a walking individual with both depth and RGB cameras. Fig. 8 shows the setup for collecting the data sets. With reference to this figure, the Kinect is placed at a height AB = 2 m from the ground in front of a pathway IJFE. 30 persons are allowed to walk from EF to KL roughly following the line GHM, so that full depth variation in limbs can be captured by the Kinect. (G, H and M are midpoints of EF, CD and KL, respectively). CD is the limit up to which the full body of a walking person is visible to the Kinect. For our data collection, FJ = EI = 5.0 m, DJ = CI = 1.0 m. SimpleOpenni wrapper and Processing tool [31] is used for collecting the data. In order to study the effectiveness of the proposed approach and variation of its performance with different parameters, we follow the following protocol. There are a total of 30 different subjects. For each subject, there are four data sets – three (denoted as data sets C A ; C A0 and C A00 ) where any given subject wears the same dress in each of the recordings (although different subjects wear different clothing), and one data set in which the subject wears a different dress (denoted as data set C B ). The recordings are also made for different frame rates, namely, 10, 15, 20, 25 and 30 frames per second (fps). One of the gait cycles for each subject is used for training and the others are used for testing. To determine the impact of number of depth key poses on the recognition accuracy, the number of clusters K is varied from 5 to 15 in steps of one. Unless otherwise mentioned, we show recognition accuracy averaged over all the 30 subjects in the test data sets.
4.2. Results We first study the impact of the number of key poses on the recognition accuracy. A lower number of key poses would degrade the performance while a larger number of key poses adversely affects the processing speed. We vary the number of key poses from 5 to 15 and plot the accuracy of recognition in Fig. 9. It is observed that the recognition rate increases between 5 and 8 key poses and then saturates between 8 and 12 key poses. Thus, a value of 8 as the number of key poses is considered to be a balance between recognition accuracy and speed of processing. While various tools provided by software vendors for Kinect data capture can record 30 fps videos, we study the impact of variation in frame rate on the recognition accuracy. It is observed from Fig. 10 that there is a significant dependence of the gait recognition system on frame rate. At 10–15 fps, the accuracy is almost 10% less than that obtained at 30 fps. This is due to the fact that at lower frame rate, a lot of dynamic information is lost as they are not fully captured in the individual poses. It should be noted here that as the frame rate is decreased, the number of key poses is also accordingly reduced. It is not meaningful to keep the number of key poses more than the number of frames that make up one gait cycle at a given frame rate of video recording. Based on the above two observations, for the rest of the experiments, we keep frame rate = 30 fps and number of key poses = 8. We next study the impact of change in the clothing of subjects on the accuracy of recognition. We first train the system with PDVs extracted from the data set C A and test with PDVs from data sets C A00 and C B . We next train the system with PDVs extracted from the data set C B and test with PDVs from the data set C A00 . The results are reported in Table 1. It is observed that, there is no significant impact on the accuracy of recognition due to change in clothing. Thus, the PDV feature is robust against such type of variation. We further study the variation in recognition accuracy with the size of the training set. We follow a different protocol in this set of experiments. Data set C A is first used for training and C A00 is used for testing. Data set C A0 is next used for training and C A00 is used for
P. Chattopadhyay et al. / J. Vis. Commun. Image R. 25 (2014) 53–63
61
Fig. 7. Eight Pose Depth Volumes of the input sequence shown in Fig. 6 corresponding to eight depth key poses shown in Fig. 3.
Fig. 8. Experimental setup with Kinect camera.
testing. Finally, both the data sets C A and C A0 are used for training and C A00 is used for testing. We show the results in Table 2. It is ob-
served that there is a small improvement in recognition accuracy as the size of the training data is increased. Finally, we make a comparative study of the proposed PDV based gait recognition approach with other existing methods, namely, GEI and GEV, where both the training and testing videos are captured at 30 fps under similar clothing conditions. It is observed from the table that PDV using depth registered silhouettes outperforms both GEI and GEV. The improved recognition performance of PDV is certainly due to its capability of extracting gait features by efficiently combining both RGB and depth information from Kinect. GEI using binary silhouettes extracted only from the depth frames of Kinect shows very poor performance, as shape information is not preserved while extracting object silhouettes from depth image frames. GEV is the first feature that uses Kinect depth frames for gait recognition, but since it makes no attempt to combine the shape along with the depth information, its performance suffers due to the noisy depth frames, as seen from Table 3. We have also tested the discrimination capability of GEV using depth registered silhouettes and obtained improved performance than that of the originally proposed GEV with depth images only, thereby, verifying the need for RGB-D data fusion for gait recognition
62
P. Chattopadhyay et al. / J. Vis. Commun. Image R. 25 (2014) 53–63
Average classification accuracy (%)
100 90 80 70 60 50 40 30
5
6
7
8
9
10
11
12
13
14
15
Number of depth key poses Fig. 9. Effect of change in number of key poses on the recognition accuracy of PDV.
79
Average classification accuracy (%)
78 77 76 75 74 73 72 71 70 10
12
14
16
18
20
22
24
26
28
30
Frame rate (in fps) Fig. 10. Effect of change in frame rate on the recognition accuracy of PDV.
Table 3 Performance comparison of different gait recognition algorithms. Table 1 Variation of classification accuracy of PDV with clothing. Training data set
Testing data set
Recognition accuracy (%)
CA CA CB
C A00 CB CA
73.33 73.33 78.33
Table 2 Variation of classification accuracy of PDV with change in training data set size. Training data set
Testing data set
Recognition accuracy (%)
CA C A0 C A + C A0
C A00 C A00 C A00
73.33 76.67 80.00
Gait Feature
Recognition Accuracy (in %)
GEI using binary silhouettes extracted from depth frames GEI using binary silhouettes extracted from RGB frames GEV using depth silhouette GEV using depth registered silhouette PDV using depth silhouette PDV using depth registered silhouette
33.33 71.67 46.67 62.67 53.33 80.00
from frontal view. PDV with depth silhouettes extracted directly from depth video frames of Kinect shows better performance than that of traditional GEV, due to its inherent ability to capture dynamic variation better than GEV. The performance is still not sufficiently good to be considered as a usable feature for gait recognition.
P. Chattopadhyay et al. / J. Vis. Commun. Image R. 25 (2014) 53–63
A substantial improvement in performance is observed as compared to traditional GEV, when both RGB and depth information is combined in deriving the proposed PDV feature using depth registered silhouettes. From the table, it is seen that GEI computed with binary silhouettes extracted from only RGB video frames has a performance higher than GEV. Yet it is less than that of PDV. The performance of PDV is expected to surpass that of GEI by a higher margin if a larger data set is used, as both shape and depth information is embedded in PDV. From the results, it can be concluded that combined RGB-D information from Kinect helps in improving gait recognition accuracy from frontal view. Depth data alone is not sufficient to derive good gait features from this view. An effective combination of shape and depth information is needed, which has been incorporated in PDV.
[4]
[5] [6]
[7]
[8]
[9]
[10]
5. Conclusions [11]
In this paper, we have combined both depth as well as color streams from Kinect by registering Kinect depth frames to map with corresponding color frames. Next, we have introduced a novel feature called Pose Depth Volume by averaging voxel volumes corresponding to frames belonging to the same pose. The proposed feature considers both shape and depth variations of walking sequence of individuals over each depth key pose of a gait cycle. Experiments carried out on a data set comprising of 30 subjects clearly demonstrate the effectiveness of the approach. It can be asserted from the results that depth and color information fused together helps in better gait recognition from frontal view rather than using only the depth information from Kinect. It is felt that even better recognition can be achieved if full volume reconstruction of each silhouette were possible. However, this requires installation of multiple cameras. Future work involves testing our feature with more number of subjects and use of parallel processing to speed up the whole procedure. Here, we have only focused on gait as the biometric feature. The accuracy of recognition is expected to improve if a combination of different biometric methods in a multi-modal set up is introduced. Other than shape representation as done in this paper, combining skeleton tracking functionality of Kinect SDK along with shape will be a direction for future research.
[12] [13]
[14] [15]
[16] [17] [18]
[19]
[20] [21]
[22]
[23]
Acknowledgments This work is partially funded by project Grant No. 22(0554)/11/ EMR-II sponsored by the Council of Scientific and Industrial Research, Govt. of India. The authors thank the anonymous reviewers for their constructive suggestions.
[24] [25] [26] [27]
References [1] Z. Zhang, Microsoft Kinect sensor and its effect, IEEE Multimedia 19 (2) (2012) 4–10. [2] Kinect for Windows.
. [3] Sony Depth Sensing Camera: Sony Patents Kinect-Like 3D Depth-Sensing Camera for Play Station Consoles.
[28] [29] [30] [31]
63
02/20/sony-patents-kinect-like-3d-depth-sensing-camera-kinect-forplaystation-consoles>. R.A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A.J. Davison, P. Kohli, J. Shotton, S. Hodges, A.W. Fitzgibbon, KinectFusion: Real-time Dense Surface Mapping and Tracking, in: Tenth IEEE International Symposium on Mixed and Augmented Reality, 2011, pp. 127–136. L.R. Rabiner, A tutorial on hidden markov models and selected applications in speech recognition, Proceedings of the IEEE 77 (2) (1989) 257–286. N.V. Boulgouris, D. Hatzinakos, K.N. Plataniotis, Gait recognition: a challenging signal processing technology for biometrics identification, IEEE Signal Processing Magazine 22 (6) (2005) 78–90. A.F. Bobick, J.W. Davis, The recognition of human movement using temporal templates, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (3) (2001) 257–267. M.S. Nixon, J.N. Carter, C. Yam, Automated person recognition by walking and running via model-based approaches, Pattern Recognition 37 (5) (2004) 1057– 1072. A. Kale, A. Sundaresan, A.N. Rajagopalan, N.P. Cuntoor, A.K. Roy Chowdhury, V. Kruger, R. Chellappa, Identification of humans using gait, IEEE Transactions on Image Processing 13 (2004) 1163–1173. J. Han, B. Bhanu, Individual recognition using gait energy image, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (2) (2006) 316–322. J. Liu, N. Zheng, Gait history image: a novel temporal template for gait recognition, in: Proceedings of the IEEE Conference on Multimedia and Expo, 2007, pp. 663–666. Q. Ma, S. Wang, D. Nie, J. Qiu, Recognizing humans based on gait moment image, Eighth ACIS International Conference on SNPD, 2007, pp. 606–610. S. Lee, Y. Liu, R. Collins, Shape variation-based frieze pattern for robust gait recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. X. Yang, Y. Zhou, T. Zhang, G. Shu, J. Yang, Gait recognition based on dynamic region analysis, Signal Processing 88 (9) (2008) 2350–2356. C. Chen, J. Liang, H. Zhao, H. Hu, J. Tian, Frame difference energy image for gait recognition with incomplete silhouettes, Pattern Recognition Letters 30 (11) (2009) 977–984. E. Zhang, Y. Zhao, W. Xiong, Active energy image plus 2DLPP for gait recognition, Signal Processing 90 (7) (2010) 2295–2302. G. Ariyanto, M.S. Nixon, Model-based 3D gait biometrics, in: International Joint Conference on Biometrics, 2011, pp. 11–13. S. Sivapalan, D. Chen, S. Denman, S. Sridharan, C. Fookes, Gait energy volumes and frontal gait recognition using depth images, in: International Joint Conference on Biometrics, 2011, pp. 1–6. J. Ryu, S. Kamata, Front view gait recognition using spherical space model with human point clouds, in: Eighteenth IEEE International Conference on Image Processing, 2011, pp. 3209–3212. A. Roy, S. Sural, J. Mukherjee, Gait recognition using pose kinematics and pose energy image, Signal Processing 92 (3) (2012) 780–792. L. Maddalena, A. Petrosino, A self-organizing approach to background subtraction for visual surveillance applications, IEEE Transactions on Image Processing 17 (7) (2008) 1168–1177. L. Vincent, Morphological grayscale reconstruction in image analysis: applications and efficient algorithms, IEEE Transactions on Image Processing 2 (2) (1993) 176–201. Kinect Camera Calibration: . Getting Started with Kinect and Processing: . A Tutorial on Principal Components Analysis. . R. Gross, J. Shi, The CMU Motion of Body (MoBo) Database. Tech. Report CMURI-TR-01-18, Robotics Institute, Carnegie Mellon University, 2001. S. Sarkar, P.J. Phillips, Z. Liu, I. Robledo-Vega, P. Grother, K.W. Bowyer, The human ID gait challenge problem: data sets, performance, and analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2) (2005) 162– 177. SOTON data set. . CASIA data set. . TUM-IITKGP data set. . Simple-Openni: .