Pattern Recognition 44 (2011) 988–995
Contents lists available at ScienceDirect
Pattern Recognition journal homepage: www.elsevier.com/locate/pr
Gait recognition based on improved dynamic Bayesian networks Changhong Chen a,n, Jimin Liang b, Xiuchang Zhu a a
Image Processing and Image Communication Lab, College of Communication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing 210003, China Life Sciences Research Center, School of Life Sciences and Technology, Xidian University, Xi’an 710071, China
b
a r t i c l e i n f o
a b s t r a c t
Article history: Received 10 May 2010 Received in revised form 19 October 2010 Accepted 26 October 2010
In this paper, we proposed an improved two-level dynamic Bayesian network layered time series model (LTSM), which aims to solve the limitations hindering the application of available dynamic Bayesian networks, the hidden Markov model (HMM) and the dynamic texture (DT) model to gait recognition. In the first level, a gait silhouette or feature cycle is divided into several temporally adjacent clusters. Each cluster is modeled by a DT or logistic DT (LDT). In the second level, HMM is built to describe the relationship among the DTs/LDTs. Besides LTSM, LDT is also an improved dynamic Bayesian network presented in this paper to describe the binary image sequence, which introduces the logistic principle component analysis (PCA) to learning its parameters. We demonstrated the validity of LTSM with experiments on both the CMU Mobo gait database and CASIA gait database (dataset B), and that of LDT on the CMU Mobo gait database. Experimental results showed the superiority of the improved dynamic Bayesian networks. & 2010 Elsevier Ltd. All rights reserved.
Keywords: Gait recognition Improved dynamic Bayesian networks Layered time series model Logistic dynamic texture model Hidden Markov model
1. Introduction Gait recognition aims to identify people by the way they walk. In comparison with other biometrics, gait pattern has the advantages of being unobtrusive, difficult to conceal and effective at a distance. However, the gait recognition algorithm has to deal with the image sequences instead of a single image. Both the spatial information of the gait image and the temporal transformation of the gait sequence are important. Methods characterizing the spatial and temporal information fall into two categories: information fusion and dynamic Bayesian networks. Information fusion technology offers a promising solution to combine the spatial and temporal information to develop a superior classification system. Wang et al. [1] derived the dynamic information of gait by using a condensation framework to track the walker and to recover joint-angle trajectories of lower limbs. The static body information is derived from temporal pose changes of the segmented moving silhouettes which are represented as an associated sequence of complex vector configurations and are then analyzed using the Procrustes shape analysis method to obtain a compact appearance representation. Both the static and dynamic cues may be used independently for recognition and are also fused on the decision level using different combinations of rules to
n
Corresponding author. Tel.: + 86 25 83492402; fax: + 86 25 83492420. E-mail address:
[email protected] (C. Chen).
0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2010.10.021
improve identification and verification performances. Lam et al. [2] presented two gait feature representation methods, the motion silhouette contour templates (MSCTs) and static silhouette templates (SSTs), and performed decision-level fusion by summarizing the similarity scores. Bazin et al. [3] examined the fusion of a dynamic feature and two static features in a probabilistic framework. The dynamic signatures derived from bulk motion and shape characteristics of the subject. The first static signature is a vector yielded from an average silhouette. The second static vector is obtained from a block-based silhouette averaging method. They proposed a process for determining the probabilistic match scores using intra- and inter-class variance models together with the Bayesian rule. Veres et al. [4] proposed a two-stage data fusion rule to fuse the dynamic feature and the first static signature mentioned in Ref. [3]. In summary, the feature fusion method is a good choice to combine spatial and temporal information, but it is challenging to extract the representative dynamic feature, especially when the gait silhouettes are of low quality. Dynamic Bayesian networks do not need to extract dynamic features and embody the temporal transformation via the model parameters. The most commonly used dynamic Bayesian networks are descendants of either the hidden Markov models (HMM) or stochastic linear dynamical systems, which are also known as state-space models (SSM) [5]. HMM represents information regarding the past of a sequence through a single discrete random variable—the hidden state. Kale et al. [6] and Sundaresan et al. [7] introduced HMM to gait recognition, which consider two different image features: the
C. Chen et al. / Pattern Recognition 44 (2011) 988–995
width of the outer contour of a binary silhouette and the entire binary silhouette itself. They proposed an indirect approach and a direct approach to train HMM. The indirect approach to forming a feature vector uses a frame to exemplar distance (FED) which captures a subject’s shape and his/her motion, under the assumption that the camera is sufficiently distant so that the moving subject can be considered to be planar. The information in the FED vector sequences is captured using an HMM. The direct approach used the feature vector directly (as opposed to computing the FED) for training an HMM. The observation probability is estimated using the distance between the exemplars and the image features. In Ref. [8], the direct approach of an HMM was further developed. The width vector of the outer contour is used as the feature, and an adaptive method is developed to calculate the observation probability of HMM. While the performance of HMM is excellent, there are still some remaining problems that need to be addressed, such as how to determine the hidden states. Liu and Sarkar [9] proposed a population hidden Markov model (pHMM) for silhouette reconstruction and cleaning. The pHMM helps to map a frame in any given sequence to a stance, and an appearance-based Eigen–Stance model is used to reconstruct the computed silhouette in the frame. This method can reconstruct silhouettes that are visually appealing and robust to viewpoint variation, but the lost characteristic possessed by a single subject drops the recognition performance. SSM represents information of the past through a real-valued hidden state vector. The dynamic texture (DT) model [10], a linear SSM, is an efficient method for dynamic image sequence modeling. DT learns the parameters through a closed-form solution and commonly uses the principal component analysis (PCA) to get the observation parameters. PCA assumes a Gaussian distribution over a set of observations, while some study has shown that a natural way to model binary data is using Bernoulli distribution [11,12]. Therefore, it is not suitable for directly using DT to model gait sequences. Some researches [13–15] applied DT to model the extracted gait features. Mazzaro et al. [13] measured the angles of the shoulder, elbow, hip and knee joints and used the angle vector to get a nominal model. Bissacco et al. [14] introduced a representation with the projection features, which encode the distance of the points on the silhouette from lines passing through its center of mass. A linear non-Gaussian model was constructed with the features. The autoregressive and moving average (ARMA) model in Ref. [15] is also a DT, which is constructed on the tangent space projections of a shape sequence extracted from the binary images by uniform sampling or uniform arc-length sampling. Bissacco and Soatto [16] proposed a hybrid autoregressive model of human motion and novel algorithms to estimate switches and model parameters. A set of joint-angle trajectories on the skeletal model were extracted to train the hybrid model for dynamic discrimination. Two problems exist in the aforementioned applications. First, these models are good at describing a motion process, but they are not accurate in detailed information. Therefore, these methods are only validated based on activity recognition of a small database or classifying gait styles. Second, the gait sequence is a non-linear process, while the DT and its extensions are linear models. Chan and Vasconcelos [17–19] further developed the DT method. They improved the modeling capability of DT by introducing kernel PCA to learn a non-linear observation function [17]. In Ref. [18], the mixture of DTs was studied. It is a statistical model for an ensemble of video sequences that is sampled from a finite collection of visual processes, each of which is a DT. The mixture of DTs is shown to be a suitable representation for both the appearances and dynamics. A novel video representation, the layered DT is further discussed in Ref. [19]. It represents a video as a collection of stochastic layers of different appearances and dynamics. Each layer is modeled as a temporal texture sampled from a different linear dynamic system. These extensions are good at describing dynamic
989
textures, but are not suitable for gait sequencing. Furthermore, it is not needed to build complicated models such as [18,19] for binary gait sequences. In this paper, an improved dynamic Bayesian network, the logistic DT (LDT), is proposed to directly model the binary image sequence. It introduces logistic PCA to learn the observation function, which assumes a Bernoulli distribution over a set of observations and processes the pixels of 1 and 0 separately. This model can avoid the loss of useful information caused by feature extraction. It is evaluated using the CMU Mobo gait database. It is testified to be more suitable for describing a binary image sequence than DT, although it is still a non-linear model. In order to tackle the limitations hindering the application of HMM and DT/LDT to gait recognition, the major contribution of this paper is to propose another improved dynamic Bayesian network, the layered time series model (LTSM). An LTSM is a two-level model which combines HMM and DT/LDT. The first level has multiple DTs/ LDTs. A gait silhouette or feature cycle is segmented into several temporally adjacent clusters, which can be considered as linear processes. Each cluster can be modeled by a DT/LDT. The second level is an HMM, which is used to describe the statistical distribution of different DTs/LDTs. The DTs/LDTs are treated as the hidden states of the HMM and the observation probability is a function of the distance between the observation and the synthesized observation of DT/LDT. An LTSM conquers the non-linear process representation problem using piecewise linear DTs/LDTs, and then applies HMM to describe the transition of the DTs/LDTs. Its validity is evaluated by both the CMU Mobo gait database [20] and CASIA gait database (dataset B) [21]. The experimental results show its superiority over HMM and DT. The remainder of this paper is structured as follows. Section 2 proposes the LDT in detail. The construction of LTSM is described in Section 3. In Section 4, we evaluate LDT and LTSM using the CMU Mobo gait database and CASIA gait database (dataset B). We discuss the results and conclude this paper in Section 5.
2. Logistic dynamic texture (LDT) model In this section, we simply introduce DT and then proposed an improved dynamic Bayesian network LDT to model the binary image sequence. 2.1. DT DT is a generative stochastic model that treats the video sequence as a sample from the output of a linear SSM. It was first proposed by Doretto [10], and it was a successful application of the linear dynamical system to video processing. It has an underlying assumption that individual images are realizations of the output of a dynamical system driven by an independent and identically distributed (IID) process. The visual components and underlying dynamics are represented as two stochastic processes, as shown in Fig. 1. The visual component yt A RD is a function of the current state
xt−1
xt
xt+1
yt−1
yt
yt+1
Fig. 1. Topology structure of dynamic texture model (DT).
990
C. Chen et al. / Pattern Recognition 44 (2011) 988–995
vector xt A RL with some observation noise, and the dynamics are represented as a time-evolving state process. The model is denoted by ( xt ¼ Axt1 þ vt , ð1Þ yt ¼ Cxt þ wt , LL
DL
is the state-transition matrix and C A R is the where A A R observation matrix. The input and observation noises obey Gaussian distribution, which are given by vt Nð0,Q Þ and wt Nð0, rID Þ, respectively. The choice of matrices A, C, Q , r is not unique and a closed-form solution was also proposed in Ref. [10]. The closed-form solution commonly uses PCA to obtain the observation function. The columns of C are the principal components of the image sequence, and the state vector xt is a set of PCA coefficients. Strictly speaking, this solution is incorrect, but it is proven to be asymptotically efficient. Most natural texture sequences can be successfully represented by this model. 2.2. LDT DT is simple and efficient in capturing moving scenes that exhibit certain stationary properties in time, such as sea-waves, smoke, foliage and whirlwind. It is successful for use in modeling texture sequences, but its performance in modeling binary image sequences is disappointing. The major reason is that PCA assumes a Gaussian distribution over a set of observations, while a natural way to model binary image is using Bernoulli distribution. When applying DT to describe gait sequence, feature extraction is necessary. However, the extracted features are vectors with few texture characteristics and the lost useful information may deteriorate the accuracy of the binary image sequence. In order to model the dynamic image sequence directly and precisely, we proposed the LDT. LDT introduces logistic PCA [24] to learn the observation function. Logistic PCA assumes a Bernoulli distribution over a set of observations and processes the pixels of 1 and 0 separately. The property of logistic PCA makes LDT more reliable for binary sequence representation than DT. This model assumes that the individual binary images are realizations of the output of the model driven by the IID process. In this way, it can also be represented by Fig. 1 and Eq. (1). Training an LDT involves two steps: Step 1: Conduct a logistic PCA on binary image vector sequence Y ¼ ½Y1 ,Y2 YN . The objective of this step is to find the coefficient matrix and basis vectors. Let Ynd denotes the element of the N D binary matrix Y. Let U A RNL and V A RLD denote the coefficient matrix and basis vector matrix respectively. L is the dimension of latent space. Let Ynd be the log-odds of the binary random variable Ynd . It is defined as X Ynd ¼ Unl Vld þ Dd , ð2Þ l
where Dd is a D-dimensional bias vector. Matrices U and V can be obtained by following the maximizing log-likelihood function: X X LðU,VÞ ¼ logðPðYnd 9Ynd ÞÞ ¼ ½Ynd logsðYnd Þ þ ð1Ynd ÞlogsðYnd Þ, n,d
n,d
ð3Þ where sðYÞ ¼ 1=1expðYÞ. The initial values of U and V are stochastically given. The bias vector in Eq. (2) is neglected in the training procedure to reduce the complexity. Matrices U and V are updated by the alteration least squares (ALS) updates as follows, which guarantees an increase in the log-likelihood.
For simplicity, an intermediate quantity variable T is introduced and represented as Tnd ¼
tan hðYnd =2Þ
Ynd
:
ð4Þ
When updating U, we fix V and neglect D. A simple update rule for the nth row of matrix U is obtained by solving the L L set of linear equations ! X X X Tnd Vld Vlud Unlu ¼ ð2Ynd 1ÞVld ð5Þ lu
d
d
The matrix V is updated in a similar manner while keeping U fixed. After a few iterations, we can obtain the suitable U and V. For more details on the derivation refer to Ref. [24]. Step 2: Estimate the LDT parameters A, C, Q , r according to the decomposition values. The coefficient matrix U is regarded as the estimation of the observation matrix, such that C ¼ U. The basis vector matrix V is taken as the state vector estimation, such that we have ½x1 , ,xN ¼ V. The other parameters can be obtained according to Eq. (1). When the input noise is ignored, the state-transition matrix A is calculated as A ¼ ½x2 , ,xN ½x1 , ,xN1 1 :
ð6Þ
The initial value of state space x0 is assigned as x1 . Ignoring the observation noise, the estimated observation E^ t is E^ t ¼ Cxt :
ð7Þ
Although the binary gait images do not obey Gaussian distribution, vt and wt can still be considered as Gaussian noises because they are stochastic. The parameter r of the observation noise is obtained as N 1 X 2 99Yt E^ t 99 : DN t ¼ 1
r¼
ð8Þ
The input noise variance Q is estimated by Q¼
1 1 NX ðxt Axt1 Þðxt Axt1 ÞT : N1 t ¼ 1
ð9Þ
Among these estimated parameters fA,C,Q ,rg, A and C play more important roles than the other two. In some circumstances, the noises can be neglected and parameters Q , r are set to be zeros.
3. Layered time series model (LTSM) Both HMM and DT/LDT have limitations hindering their application to gait recognition. One of the key problems in HMM-based gait recognition is how to define the hidden states. In Ref. [6], the gait feature cycle is divided into clusters and a state is selected by minimizing the summation of the distances between the state and features of the corresponding cluster. The average of the features in each cluster is considered to be a hidden state [8]. The states obtained from these methods are gray gait images. These methods cannot guarantee that a feature has a closer relationship with its corresponding state than with other states. In other words, they cannot obtain optimal states for HMM. The major limitation of DT and its extension LDT is their linearity. Gait sequence is a non-linear process that cannot be described perfectly by a linear model. In order to conquer the aforementioned problems, we proposed the improved dynamic Bayesian network LTSM to model the dynamic image sequence, which is a natural combination of DT/ LDT and HMM. The major idea of LTSM is inspired by Li et al. [25], which proposed a two-level statistical model for characterizing motion synthesis. Li et al. [25] used a linear dynamic system to capture the local linear dynamics and a transition matrix to model
C. Chen et al. / Pattern Recognition 44 (2011) 988–995
the global non-linear dynamics. Their model was demonstrated by many synthesized sequences of visually compelling dance motion. However, it is not suitable for recognition. In this paper, we also used multiple linear systems to model complex non-linear dynamics. The second level of the proposed model not only describes the distribution of the linear systems, but it also shows the relationship of the linear systems and the observations. Similar layered model was introduced in Ref. [26]. The differences of our work and Ref. [26] lie in three points. First, we regard the people silhouette as a whole instead of decomposing the moving people into blobs as in Ref. [26]. Second, DT/LDT is built on the frames of the same cluster in our work, while the dynamical systems are built on the blobs for all frames in Ref. [26]. Third, we calculate the emission probabilities of HMM according to the dynamical system of corresponding cluster and Bregler [26] represented the emission probabilities by the dynamical system. The proposed LTSM is illustrated in Fig. 2. It conquers the nonlinear process representation problem using piecewise linear DTs/ LDTs and then applies HMM to describe the transition among the DTs/LDTs. LTSM is easily executed, because it makes use of the existing inferring algorithms, such as the Baum–Welch algorithm. The most complexity of the proposed model lies in the training process. Once the LTSMs of the training database are established, online recognition is realized. There are three steps to build this model: Step 1: Build the DT/LDT for each gait feature or silhouette cluster. Given a gait feature or silhouette cycle Y ¼ fy1 ,y2 yM g, M is the frame number of the gait cycle. The cycle is segmented into temporally adjacent clusters of approximately equal frame number. The cluster number N is chosen by minimizing the average distortion, which is a function of N and is computed by the Bayesian information criterion [6]. A DT or an LDT is trained to describe each cluster. LDT is directly built on the binary silhouette cluster, while DT is built on the extracted frieze or wavelet feature cluster. Step 2: Synthesize the gait observations from the trained DTs/ LDTs. With parameters Si ¼ fAi ,Ci ,Qi ,ri g of the ith DT/LDT, we can synthesize a set of gait observations. Given the initial state x0 and i random noise vector M A RL , the tth state estimation x^ t of cluster Li can be synthesized as i i x^ t ¼ Ai x^ t1 þQi M:
ð10Þ
D
Let P A R be a random vector and ID A RDD be an identity i matrix, where the tth synthesized observation R^ t of model Si is i i R^ t ¼ Ci x^ t þ ri ID P:
ð11Þ
991
i Because the observations of LDT are gait silhouettes, R^ t should be denoised further to be binary by setting a threshold of 0.8. Step 3: Train an HMM l ¼ fA,B, pg based on the DTs/LDTs and the synthesized observations. The trained models DTs/LDTs are regarded as the hidden states. The parameters A, B, p represent the transition probability, observation probability and initial probability, respectively. The parameters should first be initialized. The transition probability A is initialized as a matrix allowing one state to equally transform itself and its next state. The last state can turn back to the first one. The initial state is regarded as the first one, whose initial probability p1 is assigned to be 1. Other initial probabilities are set to be zeros. Among the three parameters, the observation probability B plays the most important role and should not be initialized casually. In this paper, B ¼ fbi ðym Þg is calculated as a function of the distance between the observation and the synthesized observation of DT/LDT. Suppose ym is the tth frame of cluster Li, so that bi ðym Þ can be represented as [8] ^i
bi ðym Þ ¼ adi edi Dðym , Rt Þ ,
ð12Þ
i Dðym , R^ t Þ
where represents the Euclidean distance between i observation ym and the synthesized vector R^ t , a is a constant less than 1 and di is an important parameter defined as follows:
di ¼ P
Ti ^i ym A Li Dðym , Rt Þ
,
ð13Þ
where Ti is the number of frames in cluster Li. Given that model Si is the centroid, di shows the intensity of cluster Li. During the training process, the Viterbi algorithm is employed to find the most likely sequence of hidden states, from which the observation probability B can be updated. The Baum–Welch algorithm is used to update parameters A and p. For further details, refer to Ref. [6]. Finally, the proposed two-level hybrid model LTSM can be represented as fS1 ,S2 , SN , lg, which is the combination of the parameters of DTs/ LDTs and HMM. Given a probe sequence, it is first divided into temporally adjacent N clusters. The observation probability of the probe sequence is a function of the distance between the observation and the vector synthesized by DT/LDT of the training sequence. Even though the size of the probe sequence is different from that of the training sequence, the synthesized sequence can still remain in the same size as the former. Other parameters are kept identical to those of the training sequence. The similarity between the probe sequence and the training sequence is obtained from the Viterbi algorithm.
λ
4. Numerical experiments
observations
HMM states
S1
S2
DT/LDT
DT/ LDT
L2
L1
……
SN
……
DT/ LDT
……
……
LN
……
Fig. 2. The proposed layered time series model (LTSM).
Y
We demonstrated the validity of the proposed LTSM in two datasets, the CMU Mobo gait database [20] and CASIA gait database (dataset B) [21], and that of LDT in the CMU Mobo gait database. The percentage of times that one of those similarity scores is correct match for all individuals is referred to as the cumulative match score (CMS). The performance results are shown in cumulative match score (CMS) curves, which are the rank n versus the percentage of correct identification. Rank n is the number of top similarity scores reported. In order to clearly and comprehensively illustrate its superiority over other algorithms, we further evaluated them using McNemar’s test [27]. McNemar’s test is a first-order check on the statistical significance of an observed difference in recognition performance. From the experimental results of algorithms A and B, we obtained the number of experiments for both A and B succeed Nss , where A succeeds but B fails Nsf , A fails but B succeeds Nfs , and both A but B
992
C. Chen et al. / Pattern Recognition 44 (2011) 988–995
fail Nff . When Nsf þ Nfs 4 ¼ 40, the Z value is calculated as Z¼
ð9Nsf Nfs 91Þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : Nsf þ Nfs
ð14Þ
If algorithm A and B give very similar results, Z tends to be zero. As their difference increases, Z will increase. Confidence limits can be obtained from the Z value through a standard table. When Z equals to 3.9, the confidence reaches 100%. Before showing the recognition performance, we will discuss the preprocessing. 4.1. Preprocessing The gait silhouettes always contain spurious pixels, holes inside the moving subject and other anomalies after background subtraction. First, we employed mathematical morphological operations, such as erosion and dilation, for hole filling and removing small noisy area. To eliminate the size difference caused by the varying distance between the subject and camera, the silhouettes are usually centered and adjusted to the same height. The aforementioned preprocessing is adequate for silhouettes of high quality, such as those in the CMU Mobo gait database, as shown in Fig. 3. However, many preprocessed silhouettes in the CASIA gait database (dataset B), such as in Fig. 4, still need further denoising to suppress the influence of silhouette incompleteness. In our previous work, we proposed a gait representation called the frame difference energy image (FDEI) to solve this problem [22]. FDEI is the addition of the frame difference and gait energy image [28]. The FDEI F ðx,y,t Þ of the frame t is calculated as Fðx,y,tÞ ¼ Bðx,y,tÞBðx,y,ðt1ÞÞ þ Gðx,yÞ,
where Bðx,y,tÞ is the tth frame silhouette, and Gðx,yÞ is the gait energy image of the gait cycle. Fig. 5 shows some examples of incomplete silhouettes, their corresponding previous frames and their FDEI representations. It can be seen that FDEI can suppress the silhouette distortion caused by incompleteness. We employed the FDEI representation to replace the binary gait silhouettes of the CASIA gait database (dataset B) in the following recognition experiments. In order to describe the gait sequence efficiently, two features should be extracted. Two features, the frieze feature and wavelet feature, are used in the following experiments. The frieze feature is obtained by stacking row projections of the gait image. In other words, for a binary gait image bðx,y,tÞ indexed spatially by pixel location ðx,yÞ and temporally by time t, its frieze feature is P FR ðy,tÞ ¼ x bðx,y,tÞ. The two-dimensional wavelet transform is applied to the silhouettes using the Harr wavelet base. The wavelet coefficients of the approximation sub-image holding the most useful information are chosen as the wavelet feature. 4.2. Recognition of the CMU Mobo gait database The CMU Mobo gait database consists of sequences from 25 subjects walking on a treadmill, positioned in the middle of a room. Each subject is recorded performing four different types of walking: slow walk, fast walk, slow walk holding a ball and walk on an inclined plane. Each sequence recorded is 11 s long, or about 30 frames per second. Slow and fast walk sequences caught by the frontal-view camera were adopted in the following experiments. Two experiments were performed. The experiments were setup as follows:
ð15Þ (a) S vs. F: Training on slow walk sequences and testing on fast walk sequences. (b) F vs. S: Training on fast walk sequences and testing on slow walk sequences.
Fig. 3. Examples of human silhouettes and their preprocessed silhouettes of CMU Mobo gait database. (a) Raw silhouetts and (b) preprocessed silhouettes.
Fig. 4. Examples of human silhouettes and their preprocessed silhouettes of CASIA gait database (dataset B). (a) Raw silhouetts and (b) preprocessed silhouettes.
Fig. 5. Some incomplete silhouettes and their FDEI. (a) Incomplete silhouettes; (b) former frames corresponding to the silhouettes in (a); (c) FDEI of the silhouette in (a).
C. Chen et al. / Pattern Recognition 44 (2011) 988–995
4.2.1. LDT We directly applied LDT to gait recognition and employed the Martin distance [23] to calculate the distance between two models. We compared the results of LDT with DT [10] and kernel DT (KDT) [17]. KDT is another extension of DT using kernel PCA to get the observation function. Some abbreviations are used. DT(B), DT(f) and DT(w) denote DT constructed in binary images, frieze features and wavelet features, respectively. The CMS curves of the algorithms are shown in Fig. 6. Four cycles were used for training and two cycles for testing. The proposed LDT had the best performance and the rank 1 recognition rates reached 80% for the two experiments, which reveals its robustness to walking speed. Compared with KDT, LDT averaged 14% improvement in the recognition rate at rank 1 and climbed faster as the rank increased. Because KDT deals with the image sequence in a high dimensional space, its experimental results were quite different from the other algorithms and its correct identification rate of experiment F vs. S was 12% higher than that of experiment S vs. F. While the curves of DT(w) and DT(f) were crossed, DT(w) performed better as a whole. DT(w) outperformed KDT in the first experiment, but the circumstance was contrary in the second one. The performances of DT(w) and KDT were comparative. DT(B)
100
Identification Rate (%)
90 80 70 60 50 DT (B) DT (f) DT (w) KDT LDT
40 30
0
5
10
15
Rank 100
Identification Rate (%)
90
obtained the worst performance and only had a 30% recognition rate at rank 1, which demonstrated that DT was not appropriate for describing binary image sequences directly. McNemar’s test can give statistical significance of LDT compared with other algorithms. Two cycles were chosen for training and two cycles for testing. Both the slow walk and fast walk had 7 cycles. Therefore, C72 25 2 ¼ 1050 experiments were executed and their McNemar’s test results are shown in Table 1. DT(w) had a 86.2% confidence superior to KDT. DT(f) greatly improved the performance of DT(B), but had a 64.1% confidence, which was inferior to KDT. Wavelet feature performed better than the frieze feature for DT. The proposed LDT performed absolutely better than the other two algorithms. 4.2.2. LTSM The recognition performance of the LTSM were compared with HMM, pHMM [9] and LDT. We used LTSM(f), LTSM(w) and LTSM(L) to represent LTSM and combined HMM with DT(f), DT(w) and LDT, respectively. HMM(f) and HMM(w) were HMM built on the frieze feature and wavelet feature. Fig. 7 shows the CMS curves of LTSM, HMM, pHMM and LDT. LTSM(L) had the best performance and there was no person incorrectly recognized. LTSM(w) also had excellent performance and it averaged 4% better than LTSM(f). LTSM(f), HMM(w) and HMM(f) had the same recognition rate in rank 1, but the different climb velocities determined that LTSM(f) was the best and HMM(f) was the worst. The pHMM performed better than LDT, but about 6– 8% lower than HMM. The rank 1 recognition rate of LDT was at least 10% lower than other algorithms. In other words, the proposed LTSM algorithm outperformed HMM and LDT. When the first level was modeled by LDT, it was better than that modeled by DT. The wavelet feature performed better than the frieze feature as a whole. These algorithms are further compared in Table 2 with McNemar’s test. Two cycles were chosen for training and two cycles for testing. Again 1050 experiments were executed. LTSM(L) had a 87.3% confidence superior to LTSM(w) and was absolutely better than the other algorithms. The wavelet feature performed better than the frieze feature when using LTSM, but had a 78.5% confidence inferior to the frieze feature when HMM was employed. Whatever DTs or LDTs were constructed in the first level, the proposed LTSM improved the recognition performance. The improvement of LTSM(L) was more obvious, which further verified the validity of LDT. HMM outperformed pHMM with a large Z value. LDT had the worst performance, which was partly due to its initial state values being neglected by the Martin distance. 4.3. Recognition of the CASIA gait database (dataset B)
80
The CASIA gait database (dataset B) contained 124 subjects (93 males and 31 females) captured from 11 view angles. There were six normal walking sequences for each subject per view. There were 11 experiments carried out for all view angles of the database. Because many silhouettes in this database were incomplete after background subtraction, we employed the FDEI representation [22] before the models (such as HMM, DT and LTSM) were built in the experiments; for further details refer to Section 4.1.
70 60 50 40
DT (B) DT (f) DT (w) KDT LDT
30 20
993
0
5
10
Table 1 Performance comparison of LDT, DT and KDT.
15
Rank Fig. 6. The CMS curves of DT, kenel DT (KDT) and logistic DT (LDT) model. DT(B), DT(f) and DT(w) represent DTs built on binary silhouettes, frieze features and wavelet features, respectively. (a) S vs. F and (b) F vs. S.
LDT vs. DT(w) DT (w) vs. KDT KDT vs. DT(f) DT(f) vs. DT(B)
Nss
Nsf
Nfs
Nff
Z-value
Confidence
650 665 625 341
151 31 62 341
46 22 57 5
203 332 306 363
5.575 1.099 0.367 18.01
100% 86.2% 64.1% 100%
994
C. Chen et al. / Pattern Recognition 44 (2011) 988–995
100
Identification Rate (%)
95
90
85
HMM (f) HMM (w) pHMM LDT LTSM (f) LTSM (w) LTSM (L)
80
75 0
5
10
15
Rank 100
Identification Rate (%)
95
that of DT, whose average recognition rates were below 60%. The recognition rate of the proposed LTSM had 1.8% improvement compared with HMM when using the wavelet feature and 3.6% when using the frieze feature. The wavelet feature had better performance than the frieze feature for HMM, DT and LTSM. Whatever feature was used, the proposed LTSM had the higher average rates than the other algorithms. The average CMS curves of the 11 experiments are illustrated in Fig. 8. The performance of LTSM(w) was the best at all ranks. HMM(w) performed better than LTSM(f) at the first three ranks, but LTSM(f) climbed faster. HMM(f) had a similar performance with pHMM, but it was worse than that of HMM(w). Although the CMS curves of DT climbed the fastest, their rank 1 recognition rates were too low to make the rank 10 recognition rates better than 95%. While the quality of the database was not satisfactory, the proposed LTSM still showed excellent performance with the help of FDEI. The algorithms were also evaluated using McNemar’s test. Two sequences were chosen for training and two sequences for testing. There were a total of C62 124 11 ¼ 20,460 experiments. The results of McNemar’s test are shown in Table 4. McNemar’s test further testified the validity of the proposed LTSM. LTSM(w) performed the best. The performance of LTSM(f) had a 56.4% confidence better than HMM(w). The superiorities of HMM(w) over PHMM, pHMM over HMM(f), HMM(f) over DT(w) and DTW(w) over DT(f) are absolute with high Z values.
90 Table 3 Performance comparison of LTSM, HMM and DT.
85
HMM (f) HMM (w) pHMM LDT LTSM (f) LTSM (w) LTSM (L)
80
75 0
5
10
Angle
15
Rank Fig. 7. The CMS curves of HMM, population HMM (pHMM), LDT and LTSM. HMM(f) and HMM(w) represent HMMs built on frieze features and wavelet features, respectively. LTSM(f), LTSM(w), LTSM(L) represent LTSMs combined HMM with DT(f), DT(w) and LDT, respectively. (a) S vs. F and (b) F vs. S.
PHMM
0 18 36 54 72 90 108 126 144 162 180 Avg.
Frieze feature
94.4 86.3 85.9 88.7 89.5 90.3 85.9 85.5 89.9 90.7 93.4 89.2
Wavelet feature
DT
HMM
LTSM
DT
HMM
LTSM
59.7 54.8 49.2 54.8 54.8 58.1 49.2 54.8 53.2 57.8 58.1 55.0
95.2 84.7 84.7 93.5 90.3 91.1 86.3 84.7 85.5 88.7 95.2 89.1
97.6 91.9 86.3 94.4 99.2 95.2 91.9 84.7 93.5 85.5 99.2 92.7
62.9 56.5 54.8 57.8 58.9 61.3 57.8 56.5 57.2 59.7 61.3 58.6
100 100 100 93.4 91.1 90.3 90.3 86.3 91.9 91.9 97.6 93.9
100 96.8 92.7 96.8 97.6 95.2 96.8 91.9 95.2 89.5 100 95.7
Table 2 Performance comparison of LTSM, HMM, pHMM and LDT. Nss
Nsf
Nfs
Nff
Z-value
Confidence
980 968 948 949 850 800
12 19 33 23 116 95
7 13 24 17 45 29
51 50 45 61 39 123
1.147 1.237 1.060 0.791 5.52 5.84
87.3% 89.2% 85.5% 78.5% 100% 100%
100
FDEI represented the binary silhouettes with gray images. Because LDT could not be applied to gray images, LDT and LTSM(L) were not used to deal with this database. We gave the results of LTSM(w), LTSM(f), HMM(w), HMM(f), DT(w) and DT(f). Four sequences were used for training and two sequences for testing. The correct classification rates of these algorithms are shown in Table 3. HMM(w) had the best performance at angles 0, 18, 36 and 162. LTSM(f) attached the highest recognition rates 99.2% and 95.2% at angles 72 and 90, respectively. LTSM(w) performed best at other angles besides the angles 18, 36, 72 and 162. The performance of pHMM was similar with that of HMM(f) and much better than
Identification Rate (%)
95 LTSM(L) vs. LTSM(w) LTSM(w) vs. LTSM(f) LTSM(f) vs. HMM(f) HMM(f) vs.HMM(w) HMM(w) vs. pHMM pHMM vs. LDT
90 85 80 75 70
DT (f) DT (w) pHMM HMM (f) HMM (w) LTSM (f) LTSM (w)
65 60 55 50
0
1
2
3
4
5
6
7
8
9
10
Rank Fig. 8. The CMS curves of DT, HMM, pHMM and LTSM using frieze feature and wavelet feature.
C. Chen et al. / Pattern Recognition 44 (2011) 988–995
Table 4 Performance comparison of LTSM, HMM, pHMM and DT.
LTSM(w) vs. LTSM(f) LTSM(f) vs. HMM(w) HMM(w) vs. pHMM pHMM vs. HMM(f) HMM(f) vs. DT(w) DT(w) vs. DT(f)
Nss
Nsf
Nfs
Nff
Z-value
Confidence
18,960 18,783 18,723 18,620 13,297 10,825
775 690 743 472 5531 2989
613 683 369 208 517 1315
112 204 525 1060 1015 5231
5.979 0.162 11.18 10.08 64.47 25.52
100% 56.4% 100% 100% 100% 100%
5. Discussion and conclusions This paper proposed two improved dynamic Bayesian networks for gait recognition, LDT and LTSM. LDT is proposed to model binary image sequences and avoids information loss caused by feature extraction. Experimental results showed that LDT had better recognition performance than DT and was testified to be a good extension. Although LDT and DT were not good at recognition compared with other algorithms, they had good performance in describing the motion process. LTSM aims to tackle the obstacles hindering the application of HMM and DT/LDT to gait recognition. It conquers the non-linear process representation problem using piecewise linear DTs/LDTs and applies HMM to describe the transition among the DTs/LDTs. LTSM does not need new inferring algorithms and is easily constructed. Once the LTSMs of the training database are established, recognition can be established online. Experimental results verified the validity of the proposed model. Although we only applied this model to gait recognition, it can also be applied to other fields, such as biological sequence analysis, activity recognition and handwriting recognition.
Acknowledgements This paper is partially supported by NSF of Jiangsu Province (BK2010523), NSFC (60902083, 60872154, and 81090272), NBRPC (2006CB705700 and 2011CB707702), and Significant software and integrated circuit special of Jiangsu Province (2007-161). The author would like to the Journal Manager and the anonymous reviewers for their constructive comments. They would also like to thank Karen von Deneen for her help in improving the manuscript. References [1] L. Wang, H. Ning, T. Tan, W. Hu, Fusion of static and dynamic body biometrics for gait recognition, IEEE Transactions on Circuit and Systems for Video Technology 15 (2004) 149–158. [2] T. Lam, R. Lee, D. Zhang, Human gait recognition by the fusion of motion and static spatio-temporal templates, Pattern Recognition 40 (2007) 2563–2573. [3] A.I. Bazin, L. Middleton, M.S. Nixon, Probabilistic fusion of gait features for biometric verification, in: Proceedings of the International Conference on Information Fusion, vol. 2, 2005, pp.1211–1217.
995
[4] G.V. Veres, M.S. Nixon, L. Middleton, J.N. Carter, Fusion of dynamic and static features for gait recognition over time, in: Proceedings of the International Conference on Information Fusion, vol. 2, 2005, pp. 1204–1210. [5] J.D. Hamilton, State-space models, Handbook of Econometrics 4 (1994) 3039–3080. [6] A. Kale, N. Cuntoor, R. Chellappa, A framework for activity-specific human identification, in: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 4, 2002, pp. 3660–3663. [7] A. Sundaresan, A. Roy-Chowdhury, R. Chellappa, A hidden Markov model based framework for recognition of humans from gait sequences, in: Proceedings of IEEE International Conference on Image Processing, 2003, pp. 143–150. [8] C. Chen, J. Liang, H. Zhao, H. Hu, Gait recognition using hidden Markov model, in: Proceeding of the International Conference of The 2nd International Conference on Natural Computation, part I, 2006, pp. 399–407. [9] Z. Liu, S. Sarkar, Improved gait recognition by gait dynamics normalization, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (2006) 863–876. [10] G. Doretto, A. Chiuso, S. Soatto, Y.N. Wu, Dynamic textures, International Journal of Computer Vision 51 (2003) 91–109. [11] A. Juan, E. Vidal, Bernoulli mixture models for binary images, in: Proceedings of 17th International Conference on Pattern Recognition, part III, 2004, pp. 367–370. [12] Z. Zivkovic, J. Verbeek, Transformation invariant component analysis for binary images, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2006, pp.254–259. [13] C. Mazzaro, M. Sznaier, O. Camps, S. Soatto, A. Bissacco, A model (in)validation approach to gait recognition, in: Proceedings of International Symposium on 3D Data Processing Visualization and Transmission, 2002, pp. 700–703. [14] A. Bissacco, A. Chiuso, S. Soatto, Classification and recognition of dynamical models: the role of phase, independent components, kernels and optimal transport, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (2007) 1958–1972. [15] A. Veeraraphavan, A.K. Roy-Chowdhury, R. Chellappa, Matching shape sequence in video with applications in human movement analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005) 1896–1908. [16] A. Bissacco, S. Soatto, Hybrid dynamical models of human motion for the recognition of human gaits, International Journal of Computer Vision 85 (2009) 101–114. [17] A.B. Chan, N. Vasconcelos, Classifying video with kernel dynamic textures, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp.1–6. [18] A.B. Chan, N. Vasconcelos, Modeling, clustering, and segmenting video with mixtures of dynamic textures, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2008) 909–926. [19] A.B. Chan, N. Vasconcelos, Layered dynamic textures, IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (2009) 1862–1879. [20] R. Gross, J. Shi, The Cmu Motion of Body (mobo) Database, Technical report CMU-RI-TR-01-18, Robotics Institute, Carnegie Mellon University, 2001. [21] CASIA gait database, Online available: /http://www.cbsr.ia.ac.cn/english/ Gait%20Databases.aspS. [22] C. Chen, J. Liang, H. Zhao, H. Hu, J. Tian, Frame difference energy image for gait recognition with incomplete silhouettes, Pattern Recognition Letters 30 (2009) 977–984. [23] K.D. Cock, B.D. Moor, Subspace angles between linear stochastic models, in: Proceedings of IEEE Conference on Decision and Control, 2000, pp.1561–1566. [24] A.I. Schein, A. Popescul, L. Ungar, D.M. Pennock, A generalized linear model for principal component analysis of binary data, in: Proceedings of Ninth International Workshop on Artificial Intelligence and Statistics, 2003, pp. 110–117. [25] Y. Li, T. Wang, H.Y. Shum, Motion texture: a two-level statistical model for character motion synthesis, in: Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques, 2002, pp. 465–472. [26] C. Bregler, Learning and recognizing human dynamics in video sequences, in: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1997, pp. 568–574. [27] A.F. Clark, C. Clark, Performance characterization in computer vision: a tutorial, Online available: /http://peipa.essex.ac.uk/benchmark/tutorials/essex/tutor ial.pdfS. [28] J. Han, B. Bhanu, Individual recognition using gait energy image, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (2006) 316–322.
Changhong Chen received her Ph.D. degree in electronic engineering from Xidian University, Xi’an, China, in 2009. After graduation, she joined the college of Communication and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China. Her current research interests include pattern recognition, image processing, and video analysis. Jimin Liang received his B.S., M.S. and Ph.D. degrees from Xidian University, Xi’an, China, in 1992, 1995 and 2000, all majored in electronic engineering. In 1995, he joined Xidian University, where he is currently a Professor in the Life Sciences Research Center, School of Life Sciences and Technology, Xidian University. During 2002, he was a Research Associate Professor in the Deparment of Electrical and Computer Engineering, University of Tennessee, Knoxville. His current research interests span several areas of image processing and analysis. Xiuchang Zhu received his B.S. and M.S. degrees from Nanjing University of Posts and Communications in 1982 and 1987, respectively. He has been working in Nanjing University of Posts and Communications since 1987. At present, he is a Professor and the direct of Jiangsu Key Library of Image Processing and Image Communications. His current research interests focus on multimedia information, especially on the collection, processing, transmission and display of image and video.