Pattern Recognition Letters 33 (2012) 476–484
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Feature fusion for 3D hand gesture recognition by learning a shared hidden space Jun Cheng a,⇑, Can Xie a,b, Wei Bian a,b,c, Dacheng Tao c a
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China The Chinese University of Hong Kong c Centre for Quantum Computation and Intelligent Systems, Faculty of Engineering and Information Technology, University of Technology, Sydney Broadway NSW 2007, Australia b
a r t i c l e
i n f o
Article history: Available online 29 December 2010 Keywords: Hand gesture recognition Feature fusion Shared hidden space Discrete cosine transform
a b s t r a c t Hand gesture recognition has been intensively applied in various human–computer interaction (HCI) systems. Different hand gesture recognition methods were developed based on particular features, e.g., gesture trajectories and acceleration signals. However, it has been noticed that the limitation of either features can lead to flaws of a HCI system. In this paper, to overcome the limitations but combine the merits of both features, we propose a novel feature fusion approach for 3D hand gesture recognition. In our approach, gesture trajectories are represented by the intersection numbers with randomly generated line segments on their 2D principal planes, acceleration signals are represented by the coefficients of discrete cosine transformation (DCT). Then, a hidden space shared by the two features is learned by using penalized maximum likelihood estimation (MLE). An iterative algorithm, composed of two steps per iteration, is derived to for this penalized MLE, in which the first step is to solve a standard least square problem and the second step is to solve a Sylvester equation. We tested our hand gesture recognition approach on different hand gesture sets. Results confirm the effectiveness of the feature fusion method. Crown Copyright Ó 2010 Published by Elsevier B.V. All rights reserved.
1. Introduction In recent years, with the rapid development of information technology, increasing attentions have been paid to the field of human–computer interactions (HCI) (Ho et al., 2005; Hong et al., 2000; Wu and Huang, 1999). In addition, the growing demands from various applications, such as virtual games and robot control, also boost the development of HCI (Kang et al., 2004; Stiefelhagen et al., 2007; Yang et al., 2007). Nowadays, people attempt to search novel techniques for efficient and comfortable interactions between human and computers (Ho et al., 2005; Hong et al., 2000). Indeed, the traditional interactive ways, e.g., by using a mouse or keyboard, are inherently restricted by the speed and space, and lack of an immersed sense within the simulated environment (Pavlovic et al., 1997). In contrast, gesture recognition (Mitra and Acharya, 2007; Wu and Huang, 1999), a newly developed HCI technique, has advantages to enrich the personal experience during interaction, and thus plays an important role in modern HCI systems, from virtual sports game, remote control of virtual scene, to home intelligence applications (Kang et al., 2004; Mitra and Acharya, 2007; Zhang et al., 2009). In the early stage, gesture recognition systems, such as glovebased gestural interfaces, require the user to wear cumbersome devices and carry a load of cables to connect to a computer (Mitra ⇑ Corresponding author. E-mail addresses:
[email protected] (J. Cheng),
[email protected] (C. Xie),
[email protected] (W. Bian),
[email protected] (D. Tao).
and Acharya, 2007). Thus, these HCI techniques inherently prevent a natural interaction between the user and the computer. Recently, the advent of vision-based and accelerometer-based gesture recognition techniques eliminated above limitations to a great extent, as no contacting devices are required. The vision-based technique (Lee et al., 2007; Liu and Li, 2004) generally adopts one or two stereo cameras or specialized cameras, such as time-of-flight cameras, to approximate the 3d representation of what is seeing. Then pattern matching methods are employed to recognize the gestures. A number of vision-based gesture recognition systems have been developed in the past years, e.g., the hand gesture recognition method by Hidden Markov model (HMM) (Bernardin et al., 2005; Lee and Kim, 1999; Lv and Nevatia, 2006; Yoon et al., 2001) and DTW (Darrell et al., 1996), the spatiotemporal matching algorithm (Alon et al., 2009) for gesture recognition, and the gesture control system and video game system developed in (Kang et al., 2004 and Ramamoorthy et al., 2003). The accelerometer-based gesture recognition (Fitz-Walter et al., 2008; Wu et al., 2009) utilizes the acceleration signals, sampled from a three-axis accelerometer, to represent gestures. Acceleration-based gesture recognition has made great progress thanks to the tremendous development of Micro Electro Mechanical System (MEMS) technology (Mäntyjärvi et al., 2004). For instances, the Wii Remote (Schlömer et al., 2008) employs an accelerometer to track the user’s movements, which has been widely applied, especially in virtual games. However, it has been also been noticed that both the visionbased and the acceleration-based gesture recognition have their own limitations (Wu et al., 2009). Since the vision-based gesture
0167-8655/$ - see front matter Crown Copyright Ó 2010 Published by Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2010.12.009
J. Cheng et al. / Pattern Recognition Letters 33 (2012) 476–484
recognition acquires a 3D trajectory of a gesture through stereo cameras, it is prone to be interfered by varying lighting conditions and cluttered backgrounds. Moreover, the sample rate of normal stereo cameras is not enough for capturing rapid movement, while high quality stereo cameras are rather expensive. Differently, acceleration-based gesture recognition does not utilize 3D trajectory but acceleration signals of gestures, and thus are free from lighting conditions and backgrounds. Another advantage of acceleration-based method is that the sample rate of accelerometers is generally adequate for high speed gestures. However, a major drawback of accelerometers is that they cannot collect position and orientation information of a gesture. Thus, certain extents of realities are lost by acceleration-based gesture recognition. An effective way to overcome above limitations of the visionbased and the acceleration-based gesture recognition techniques, while preserving their merits, is to find a proper fusion of the features used in them. To this end, we propose a fusion approach for the two kinds of features by leaning a shared hidden space. First, we introduce new method for the representation of the visual and acceleration features, i.e., by courting the intersection points between the projected gesture trajectories and randomly generated line segments and by using the discrete cosine transformation (DCT) of the acceleration signals. Then, we formulize the learning of the shared hidden space as penalized maximum likelihood estimation (MLE), in which the likelihood term contains the information of the two features while the penalization takes account of the supervised information. The solving of the penalized MLE is implemented by an iterative algorithm. During each iteration, the first step involves standard least square problems, and the second step solves a Sylvester equation. To test the effectiveness of the proposed hand gesture recognition approach, and especially the feature fusion method, we perform experiments on two hand gesture sets, i.e., a 10-digit set and an eight-letter set. Both experiments show that feature fusion significantly improves the performance of recognition. The rest of this paper is organized as follows. In Section 2, we briefly introduce the feature fusion based hand gesture recognition system. In Section 3, we show the representation of both visual and acceleration features. Section 4 presents the feature fusion method, i.e., the penalized MEL formulation and the iterative algorithm. Experiments on two hand gesture sets are reported in Section 5. Section 6 concludes this paper.
2. System overview In this paper, we propose a continuous hand gesture recognition method, which combines the vision-based technique and acceleration-based technique to recognize the gestures perfectly. Some works have been done on gesture recognition with the fusion of multi-sensors, but they recognize the gestures individually and combine the individual recognition outputs of both modalities, which are not full use of each information (Huang and Li, 2008). Our system includes three modules: visual image processing module, acceleration signal module and fusion recognition module. The visual image processing module capture the 3D trajectories, 3D acceleration signals are collected by acceleration signal module and fusion recognition module combine two information resources to recognize the gestures in real time. And all of these processing is on an ordinary personal computer. Fig. 1 shows the diagram of entire system using both 3D trajectories signals and 3D acceleration signals. In our hand gesture recognition system, the motion capture platform includes two infrared cameras and a handhold. The handhold has an accelerometer embedded that detects the 3D acceleration of the movement and transport them to PC via Bluetooth
477
protocol. The sampling rate of accelerometer is at 80 fps. Retro-reflective markers are stuck to the surface of handhold. The markers are made of a material called ScotchLite TM with about thousands of times higher reflectance to the light than everyday materials. The two cameras capture infrared images synchronously. We regulated the illumination of these LED’s to promise the marker highlighted and the background black as possible as we can. To extract the marker area in the infrared images we simply used a threshold-based segmentation method. Then we rectified the distortion of 2D position after calculating the gravity center of the marker. Once we got the position of the marker, the shape based matching algorithm is adopted. Moreover the epipolar constraint was used when matching the object in both images. After we got the correspondence of the marker, its 3D position can be easily calculated using triangulation. If there is more than one candidate object, we filtered out the noise by velocity constraint. Finally the output of our motion capture platform is the 3D position of the handle in camera coordinate system. The range of visual angle is 30°, and the cameras capture the images up to 640 480 pixels of 30 fps. The users hold the handhold and perform the predefined gestures, and our system will recognize the gestures automatically in real time. The visual image processing module reconstructs the 3D position of the handhold from the synchronous IR images captured by two infrared cameras. And at the same time, acceleration signal module acquires the 3D acceleration signals. Fusion recognition module receives 3D trajectories and 3D acceleration signals synchronously, and gesture spotting algorithm segments potential gestures synchronously. Gesture spotting is the task of finding the begin and end boundaries of the potential gestures from the continuous signals. In this paper, we have to segment potential gestures from two kinds of signals, respectively. A threshold-based spotting algorithm is adopted in two sequences of signals to implement gesture spotting. The signals of a gesture generally experience three phases during the whole movement: preparation, stroke and retraction. Thus, a gesture is considered to start at a high speed, continuously change directions during a period, and end in almost a steady position. Our threshold-based spotting algorithm segments the gestures on 3-D trajectories signals as follows. We get 3-D velocity signals from 3-D trajectories by differential operations firstly. When the velocity value exceeds the predefined starting threshold TH_BEGIN, we begin to record the signal data. And when we detect that several acceleration values are lower than the ending threshold TH_END, a legitimate gesture is achieved. The 3-D trajectories between start point and end point are deemed to be associated with a tentative gesture. And same method is employed in acceleration signals to realize segmentation synchronously. Generally, a hand gesture lasts at least 0.4 s, and thus the discrete time length of its acceleration signals are more than 30 and the length of its 3-D trajectories signals are more than 10. Visual features and acceleration features are extracted from 3D trajectory potential gestures and 3D acceleration potential gestures, respectively. The next section will described them in detail. 3. Feature representation In this section, we discuss the representation methods for the 3D gesture trajectories and the three-axis acceleration signals. 3.1. Random visual feature A number of methods have been developed to represent the 3D trajectories of hand gestures, such as the location, orientation and velocity feature of gestures (Yoon et al., 2001). A major advantage
478
J. Cheng et al. / Pattern Recognition Letters 33 (2012) 476–484
Fig. 1. The diagram of the feature-fusion based 3D hand gesture recognition system.
25
Generation of Random Line Segments
Three−Axis Acceleration Signals
2
20
DCT of Acceleration Signals
10
X Y Z
3
X Y Z
8 6 4
1
2
10
0
Value
Value
15
−1
0 −2 −4
−2
−6 5
−3
−8
−4 0 0
5
10
15
20
25
−10 10
20
30 40 Discrete Time
(a)
50
60
10
20 30 40 Discrete Frequency
(b)
50
60
(c)
Fig. 2. Feature extraction: (a) Example of random visual feature representation. (b) Example of three-axis acceleration signals. (c) DCTs of the acceleration signals.
of our random visual feature over these methods is its robustness. First, we assume that, during a short period, the 3D trajectory of each hand gesture is mainly contained in a 2D principal plane. Such assumption will meet the requirements of most HCI systems and visual games. Then, we represent each trajectory by counting the intersection points between it and a set of randomly generated line segments in its 2D principal plane. Denote by T 2 Rn3 a 3D gesture trajectory, where n is the number of points sampled from the trajectory, and the three columns of T store the (x, y, z) coordinates of the points in the coordinate system defined by cameras. The principal component analysis (PCA) (Jolliffe, 2002) is applied to find the 2D principal plane of T. Specifically, we first centralize T by subtracting its mean coordinates, and then perform singular value decomposition
T¼
3 X
ki ui v i ;
ð1Þ
i¼1
where ⁄ is the transportation operator, ki is the ith singular value of T, and ui and vi are the left and right singular vectors, respectively. e 2 Rn2 , onto the 2D principal plane, is giThe projected trajectory T ven by
Te ¼ ½k1 u1 ; k2 u2 :
ð2Þ
It is worth emphasizing that, by projection onto the 2D principal e does not depend on the original coordinate system. Thus, plan, T the feature representation to be shown later is invariant to the rotae tion of hand gesture. To make scale-invariant, we further rescale T to be within a 25 25 square. To represent the obtained 2D trajectory, we use randomized method. First, we randomly generate a set of line segments in the 2D principal plane, denote by fl1 ; l2 ; . . . ; ld1 g, for which the two endpoints are sampled according to the uniform distribution on the left–right and top–bottom pair of sides of the
rescaled square. Then, we use each line segment to define one dimension of our random visual future, i.e.,
fi ðTÞ ¼ #ð Te \ li Þ;
i ¼ 1; 2; . . . ; d1 ;
ð3Þ
where \ obtains the intersection point and # counts the number, and then the d1-dimensional random visual feature is given by *
f v ¼ ½f1 ; f2 ; . . . ; fd1 :
ð4Þ *
*
*
Finally, we normalize f v by f v =kf v k2 , so that it has a unit length. Fig. 2(a) gives an illustration of the proposed feature extraction method, by using the digit-8 hand gesture as an example. Intuitively, given a adequate set of random line segments, the visual information contained in the gesture trajectory can be sufficiently extracted. 3.2. Acceleration feature by DCT DCT (Ahmed et al., 1974) is a fledged and widely applied technique in science and engineering areas, e.g., numerical solution of partial differential equation and audio/video compression (Rao and Yip, 1990). Here, we introduce DCT to get a compact representation of the three-axis acceleration signals of hand gesture. Since the acceleration signals of natural hand gestures are sufficiently smooth, it is reasonable to assume that most energy of the acceleration signals is located on low frequency components. Thus, by selecting a adequate number of low frequency coefficients of the DCT, we can expect nearly ignorable information loss. Given a discrete time signal s(n), i = 0, 1, . . ., N 1, its DCT is defined by:
XðmÞ ¼
N1 X n¼0
sðnÞ cos
p
N
nþ
1 m ; 2
m ¼ 0; 1; . . . ; N 1;
ð5Þ
479
J. Cheng et al. / Pattern Recognition Letters 33 (2012) 476–484
where X(m) is the coefficient on the mth frequency component. We perform DCT on each of the three-axis acceleration signals, obtaining {X1(m), X2(m), X3(m)}, and then, the first d coefficients of the three DCTs are used to compose our acceleration feature *
f a ¼ ½X 1 ð1Þ; . . . ; X 1 ðkÞ; X 2 ð1Þ; . . . ; X 2 ðkÞ; X 3 ð1Þ; . . . ; X 3 ðkÞ :
ð6Þ
*
*
f v ¼ A f þev
ð7Þ
and *
*
f a ¼ B f þea ;
ð8Þ
To determine the parameter k, we use an energy principle. Empirically, it is found that by setting k = 10, 90% energy of the acceleration signals of most hand gestures can be preserved. Let * d = 3k, we have f 2 Rd2 . And finally, normalization is made by a 2 * * f a =k f a k2 , as to obtain a feature of unit length. Fig. 2(b) and (c) show an example of the three-axis acceleration signals and their DCTs. From the plots, one can see that the acceleration signals are indeed sufficiently smooth, and low-frequency components contain the major part of the total energy.
where ev and ea are independent noises from Gaussian distribution * * * * Nð0 ;*rIÞ. Thus, we have that f is distributed by NðA f ; rIÞ and f a by v NðB f ; rIÞ. Note that we used the same variance for the noise in the two features. This does not lead to any problem, because we mainly * focus on the estimation of A, B, and f , which is not affected by the variance of noise. Fig. 3 gives a probabilistic graphical representation of above formulations. Given a training data set of n gestures, the two set of features can be obtained from the 3D trajectories ð1Þ ð2Þ ðnÞ and the three-axis acceleration signals, i.e., F v ¼ ½fv ; fv ; . . . ; fv ð1Þ ð2Þ ðnÞ and F a ¼ ½fa ; fa ; . . . ; fa . Then, the log likelihood for hidden features
4. Feature fusion
F ¼ ½f ð1Þ ; f ð2Þ ; . . . ; f ðnÞ ;
Following the methods developed above, we obtain two kinds * of features to represent a 3D hand gesture, i.e., f 2 Rd1 and v * d2 f a 2 R . The fact that both features describe the same physical phenomenon motivates us that there should be a hidden feature that can summarize the whole information contained in these two. We regard the finding of the hidden feature as a feature fusion process and formulize such idea as a problem of learning a shared hidden space H. Below, we present the detailed formulations and an iterative algorithm.
ð9Þ
as well as linear transformations A and B is given by
LðF; A; BÞ ¼
n n * 2 * 2 X *ðiÞ *ðiÞ 1 X f A f ðiÞ 1 f Bf ðiÞ þ Const: a v v 2r2 i¼1 2r2 i¼1 2 2 ð10Þ Gesture Spotting of Acceleration Signals
14 intermediate movements segment gestures
12
4.1. MLE of hidden features 10 *
8 value
Denote the hidden feature in H by h 2 Rd , and we assume that * * the obtained two features f v and f a are generated through two distinct liner transformations A 2 Rd1 d and B 2 Rd2 d , i.e.,
6
4
2
0 0
50
100
150
200
250
300
350
400
450
500
Discrete Time
Fig. 3. Probabilistic graphical representation of hidden features.
Fig. 5. The dash lines represent the segment gestures, and the real lines represent intermediate movements. There are three gestures in the figure.
(a)
(b) Fig. 4. Examples of digit-gestures.
480
J. Cheng et al. / Pattern Recognition Letters 33 (2012) 476–484
The maximization of (9) for obtaining (F, A, B) can be equivalently formed as the minimization below
min kF v AFk2F þ kF a BFk2F ;
ð11Þ
ðF;A;BÞ
where kkF is the Frobenius norm of a matrix. We remark that (10) is closely related to the problem of dictionary learning (KreutzDelgado et al., 2003; Li and Liu, 2009). However, (10) is an extension of the problem, i.e., it is a jointly dictionary learning problem, in which A and B are the respective dictionaries for the visual and the acceleration features while F is the shared common coefficients.
with the same labels are more likely to keep their relative distance with each other in the hidden space H. Suppose the labels of the n gestures in the training data set are {y1, y2, . . ., yn}, with yi 2 {1, 2, . . . , c}, and c is the class number of the hand gesture set. For the jth class, we define the penalty as below,
Rj ðFÞ ¼
0
ði0 Þ
( Þ¼
ðiÞ
ði0 Þ
ðiÞ
ði0 Þ
ðiÞ
0
ðiÞ
ði0 Þ
1; kfv fv k þ kfa fa k < d ^ yi ¼ yi0 ; ði Þ
0; kfv fv k þ kfa fa k > d _ yi – yi0 ;
ð13Þ
where parameter d is used to specify the closeness between gestures from the same class, and ‘‘_’’ and ‘‘^’’ are the ‘‘and ’’ and ‘‘or’’ operator respectively. For all the c classes, the entire penalty is given by
Random Visual Feature of Digit ’ 0 ’
Random Visual Feature of Digit ’ 5 ’ 0.3
0.25
0.25
0.2
0.2 Value
Value
0.3
0.15
0.15
0.1
0.1
0.05
0
5
10
15 Dimension
20
25
0.05
30
0
5
10
(a)
0.4
0.3
0.3
0.2
0.2
0.1
0.1 Value
0.5
0.4
0
−0.1
−0.2
−0.2
−0.3
−0.3
−0.4
−0.4
−0.5
−0.5 10
15
25
30
25
30
0
−0.1
5
20
DCT Acceleration Feature of Digit ’ 5 ’
0.5
0
15 Dimension
(b)
DCT Acceleration Feature of Digit ’ 0 ’
Value
ð12Þ
where nj the number of gestures of the jth class in the training data 0 set, and the weighting function wðf ðiÞ ; f ði Þ Þ is given by
wðf ; f
Above, we have formulized the estimation of (F, A, B) as the minimization (10). However, one limitation of (10) is that it is unable to utilize supervised information. To this end, we introduce penalization on the hidden features F so that the close gestures
0
wðf ðiÞ ; f ði Þ Þkf ðiÞ f ði Þ k2 ;
yi ¼yi0 ¼j
ðiÞ
4.2. Penalization by supervised information
X
20
25
30
0
5
10
15
Dimension
Dimension
(c)
(d)
20
Fig. 6. Averaged features over 120 samples in the digit-gesture set: (a) for digit ‘‘0’’ of random visual feature, (b) for digit ‘‘5’’ of random visual feature, (c) for digit ‘‘0’’ of acceleration feature and (d) for digit ‘‘5’’ of acceleration feature. The solid line plot shows the mean feature, while the error-bar shows the variances of different dimension of the feature.
481
J. Cheng et al. / Pattern Recognition Letters 33 (2012) 476–484
RðFÞ ¼
c X
Rj ðFÞ:
JðAÞ ¼ kF v AFk2F :
ð14Þ
ð17Þ
Thus, it is clear that the optimal A that minimizes (17) is given by
j¼1 0
Denote by W 2 Rnn a matrix with element W i;i0 ¼ wðf ðiÞ ; f ði Þ Þ, and it is easy to verify that
RðFÞ ¼ traceðFLF Þ;
A ¼ F v F T ðFF T Þ1 : Similarly, we have the optimal B is given by
ð15Þ
B ¼ F a F T ðFF T Þ1 :
P where L = D W and Di;i0 ¼ ni0 ¼1 W i:i0 . Finally, combining (13) with (10), we obtain the penalized MLE of (F, A, B) as below,
min kF v AFk2F þ kF a BFk2F þ gtraceðFLF Þ;
ð19Þ
dn
Note that F 2 R , where d is much smaller than n in our problem, and thus the inverse operations in (18) and (19) can be assumed to always exist. When obtaining (A, B), the objective function of F is as below,
ð16Þ
ðF;A;BÞ
ð18Þ
where g is a parameter controls the weight of the penalty.
JðFÞ ¼ kF v AFk2F þ kF a BFk2F þ gtraceðFLF Þ:
4.3. Iterative algorithm
ð20Þ
By taking derivative with respect to F and setting it to be zero, we obtain the following equation:
To solve (16), we utilize an iterative algorithm by alternatively optimizing (A, B) and F. Fixing F, the objective function of A is a sum of squared error,
@J ¼ ðA A þ B BÞF þ gFL ðF v A þ F a BÞ ¼ 0; @F
3−NN Classification
ð21Þ
5−NN Classification
0.95
0.95
0.9
Recognition Rate
Recognition Rate
0.9
0.85
0.8
0.85
0.8
0.75
0.75 0.7
0.7
0.65 Vis
Acc
Vis−PCA
Acc−PCA
Com−PCA
Fus−20
Fus−30
Fus−40
Vis
Acc
Vis−PCA
Acc−PCA
Com−PCA
Column Number
Column Number
(a)
(b)
Fus−20
Fus−30
Fus−40
SVM Classification 1
0.95
Recognition Rate
0.9
0.85
0.8
0.75
0.7
0.65 Vis
Acc
Vis−PCA
Acc−PCA
Com−PCA
Fus−20
Fus−30
Fus−40
Column Number
(c) Fig. 7. Performance evaluation by 5-fold cross validation. Each method is represented by a box with whiskers, where ‘‘Vis’’ stands for visual feature, ‘‘Acc’’ stands for acceleration feature, ‘‘Vis-PCA’’ stands for 20-dimensional visual feature, ‘‘Acc-PCA’’ stands for 20-dimensional acceleration feature, ‘‘Com-PCA’’ stands for 40-dimensional combination feature, ‘‘Fus-20’’ stands for 20-dimensional fusion feature, and similarly for ‘‘Fus-30’’ and ‘‘Fus-40’’. The boxplot has lines at the lower quartile, median, and upper quartile values of the classification error rates on 20 independent experiments, and the error rate outside 1.5 times of the interquartile range from the ends of the box is regarded as whisker.
482
J. Cheng et al. / Pattern Recognition Letters 33 (2012) 476–484 3−NN Classification
5−NN Classification
1
1
0.95
0.95
0.9
Recognition Rate
Recognition Rate
0.9
0.85
0.8
0.85
0.8
0.75
0.75 0.7 0.7 0.65 0.65 0.6 Vis
Acc
Vis−PCA
Acc−PCA
Com−PCA
Fus−20
Fus−30
Fus−40
Vis
Acc
Vis−PCA
Column Number
Acc−PCA
Com−PCA
Fus−20
Fus−30
Fus−40
Column Number
(a)
(b) SVM Classification 1
0.95
Recognition Rate
0.9
0.85
0.8
0.75
0.7
0.65
0.6 Vis
Acc
Vis−PCA
Acc−PCA
Com−PCA
Fus−20
Fus−30
Fus−40
Column Number
(c) Fig. 8. Performance evaluation by 10-fold cross validation. Each method is represented by a box with whiskers, where ‘‘Vis’’ stands for visual feature, ‘‘Acc’’ stands for acceleration feature ‘‘Vis-PCA’’ stands for 20-dimensional visual feature, ‘‘Acc-PCA’’ stands for 20-dimensional acceleration feature, ‘‘Com-PCA’’ stands for 40-dimensional combination feature, ‘‘Fus-20’’ stands for 20-dimensional fusion feature, and similarly for ‘‘Fus-30’’ and ‘‘Fus-40’’. The boxplot has lines at the lower quartile, median, and upper quartile values of the classification error rates on twenty independent experiments, and the error rate outside 1.5 times of the interquartile range from the ends of the box is regarded as whisker.
Table 1 By 5-fold cross validation, the confusion matrix of the recognition performance on the digit-gesture dataset. Rows correspond to ground truth labels and columns correspond to predicted class labels. SVM classification, hidden fusion feature is used.
‘0’ ‘1’ ‘2’ ‘3’ ‘4’ ‘5’ ‘6’ ‘7’ ‘8’ ‘9’
‘0’
‘1’
‘2’
‘3’
‘4’
‘5’
‘6’
‘7’
‘8’
‘9’
0.95 0.033 0 0 0 0 0 0 0 0
0 0.938 0 0 0 0 0 0.051 0 0
0 0 0.95 0.05 0 0 0 0 0 0
0.025 0 0.025 0.95 0 0.022 0 0 0 0
0 0 0 0 1 0 0.117 0 0 0.12
0 0 0 0 0 0.978 0 0 0 0
0.025 0 0 0 0 0 0.858 0.051 0 0
0 0 0 0 0 0 0.025 0.85 0 0
0 0 0.025 0 0 0 0 0 1 0.04
0 0.029 0 0 0 0 0 0.048 0 0.84
which is a Sylvester equation (Bartels and Stewart, 1972) that is well known in control theory. In particular, there are off-the-shelf solvers for Sylvester equation, e.g., the dlyap function of Matlab.
We iteratively update (A, B) by using (18) and (19), and F by solving (21), until the objective in (16) converges. Such alternative method only provides local optimum; however, empirical results
J. Cheng et al. / Pattern Recognition Letters 33 (2012) 476–484 Table 2 By 10-fold cross validation, the confusion matrix of the recognition performance on the digit-gesture dataset. Rows correspond to ground truth labels and columns correspond to predicted class labels. SVM classification, hidden fusion feature is used.
‘0’ ‘1’ ‘2’ ‘3’ ‘4’ ‘5’ ‘6’ ‘7’ ‘8’ ‘9’
‘0’
‘1’
‘2’
‘3’
‘4’
‘5’
‘6’
‘7’
‘8’
‘9’
1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0.067
0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0.967 0 0 0.066
0 0 0 0 0 0 0 1 0 0
0 0 0 0 0 0 0.033 0 1 0
0 0 0 0 0 0 0 0 0 0.867
show that it is sufficient for our hand gesture recognition problem. The procedure of our feature fusion is summarized as Algorithm 1.
Output: Step 1: Step 2: Step 3: Step 4:
!
generated line segments in its 2D principal plane. We perform DCT to extract acceleration features to represent gestures. Because of the ‘‘energy compaction’’ property, most of energy is concentrated on the low-frequency (first ten) DCT coefficients. In our experiment, we combine the first ten low-frequency DCT coefficients of the three-axis signals as a 30 dimensional acceleration feature to represent gestures. Fig. 6 show the averaged random visual and acceleration features and their variances of digit-gesture ‘‘0’’ and ‘‘9’’, respectively. According to these plots, the averaged features exhibit relative discriminative information, though the variances are considerably large on some dimensions and thus can obscure the discrimination of different gestures. The proposed feature fusion method is able to address this problem by finding a hidden space that is more discriminative for the different gestures. The recognition results obtained by using feature fusion method are reported in the next subsection. 5.3. Recognition
Algorithm 1. Feature fusion algorithm Input:
483
!
visual feature fv 2 Rd1 , acceleration feature fv 2 Rd2 and penalty weight parameter g shared hidden space H calculate A and B by (18) and (19). according to Step 1, calculate F by solving Sylvester Eq. (21). iterative Step 1 and Step 2 to get A, B and F until it converges in (16). hidden features F is our result in the hidden space H
5. Experiments To evaluate the performance of the proposed 3D hand gesture recognition approach, we design the following series of experiments. We used the hand gesture set which contain ten digit-gestures, from 0 to 9. For each digit-gesture, 120 samples are recorded, with 10 samples performed by each of the 12 individuals. Fig. 4 shows the 3D trajectories of digits ‘‘3’’ and ‘‘6’’. 5.1. Segmentation In this part, we open two threads to collect two kinds of data synchronously. The visual image processing module tackles the images captured by IR image to achieve the 3D trajectories and the acceleration module achieves the 3D acceleration. After acquiring the data, we should implement the task of gesture spotting, finding the start and end boundary points of a legitimate gesture. An automatic spotting algorithm is employed to realize gesture spotting on two data sources. For 3D trajectories, we make difference operation to get velocity values. Because the velocity values are very small during the transition period, approximately zero, so we can use this feature to segment the potential gesture from the continuous 3D trajectories. And at the same time, another thread repeats the 3D acceleration signals. Fig. 5 shows the example of gesture spotting about acceleration signals. After both threads get the potential gesture, we extract features respectively and detail described in the next section. 5.2. Feature extraction We employ random visual feature method to represent trajectories. PCA method is adopted to find the maximum 2D projection plane of the 3D trajectories. Then, we represent each trajectory by counting the intersection points between it and a set of randomly
To evaluate the recognition performance, 5-fold and 10-fold cross validations are employed, where one of the subsets is used as the test set and the other subsets are put together to form a training set. We use the proposed fusion method to learn a shared hidden feature from the visual random feature and acceleration DCT feature, both of which have 30 dimensions. We then perform recognition on the learned feature. To illustrate the improvement by using fusion method, we also performed gesture recognition on the visual random feature space, the acceleration feature space and for comparison. And to sufficiently test the efficient of our algorithm, we also employ PCA method on visual random feature space and acceleration feature space to reduce the dimensionality respectively. The dimensionality of them are reduced to 20. And also we combine two kinds of features, using PCA method to get a 40D feature to represent the gestures. We vary the dimensionality of the fusion feature from 20, 30 to 40, and SVM and the k-NN classifier with k equal to 3 and 5 are utilized to perform gesture recognition in the learned feature space. Figs. 7 and 8 indicate the averaged performances on 5-fold and 10-fold cross validations, respectively. Fig. 7(a)–(c) shows the recognition performances with SVM, 3-NN classifier and 5-NN classifier. We can see that the recognition rate with single feature by all classifiers, i.e. SVM and k-NN, are very low, the visual random feature giving a recognition rate of only about 0.8 and the acceleration feature about 0.85. Recognition rate has been improved by PCA dimensionality reduction, but very limited. Significant improvements of gesture recognition rate are obtained from fusion feature, compared with single feature, especially the SVM classifier giving a recognition rate of about 0.95. Confusion matrices in 5-fold cross validation and 10-fold cross validation are shown in Tables 1 and 2, respectively, where hidden fusion features are employed and SVM classifier used for classification. Most of the gestures are recognized well but there exists some misclassification, e.g., gesture ‘‘9’’ is misclassified as ‘‘4’’, and ‘‘7’’ is misclassified as ‘‘2’’ and ‘‘9’’, because of their similarity in some degree. And more complicated gestures have high recognition rate. 6. Conclusion Gesture recognition is a very hot interactive technique in the human–computer interactive field nowadays. Such technique can be used in various applications, such as virtual games, robot control and other kinds of electrical applications.In this paper, we propose a method that combines the vision and acceleration-based sensors to recognize the gestures. Random visual feature and
484
J. Cheng et al. / Pattern Recognition Letters 33 (2012) 476–484
DCT acceleration feature are extracted from 3D trajectories and 3D acceleration signals, respectively, to represent the gestures. Penalized maximum likelihood estimation is employed to find a hidden space to represent the gestures. The experimental results demonstrate the effectiveness of our algorithm. The recognition rate has been improved significantly. Furthermore, 3D trajectory information makes you feel reality when you use our system to play game or other virtual sports. However, our gesture set is very simple and maybe these gestures are not useful in virtual games, so designing comfortable and natural gesture set, satisfied with virtual games’ needs, may become another hot topic in gesture recognition field. Acknowledgments Special thanks to Key Laboratory of Robotics and Intelligent System of Guangdong Province (2009A060800016), the Knowledge Innovation Program of the Chinese Academy of Sciences (KGCX2YW-156, KGCX2-YW-154), National Natural Science Foundation of China (60806050), Shenzhen Technology Project (JC200903160416A), Shenzhen 100 Talent Project and Shenzhen Nanshan Technology Project (2009016). References Ahmed, N., Natarajan, T., Rao, K.R., 1974. Discrete cosine transfom. IEEE Trans. Comput. 23 (1). Alon, J., Athitsos, V., Yuan, Q., Sclaroff, S., 2009. A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Trans. Pattern Anal. Machine Intell. 31, 1685–1699. Bartels, R.H., Stewart, G.W., 1972. Solution of the matrix equation ax + xb = c [f4]. Commun. ACM 15 (9), 820–826. Bernardin, K., Koichi Ogawara, K.I., Dillmann, R., 2005. A sensor fusion approach for recognizing continuous human grasping sequences using hidden markov models. IEEE Trans. Robotics 21. Darrell, T., Essa, I., Pentland, A., 1996. Task-specific gesture analysis in real-time using interpolated views. IEEE Trans. Pattern Anal. Machine Intell. 18. Fitz-Walter, Z., Jones, S., Tjondronegoro, D., 2008. Detecting gesture force peaks for intuitive interaction. In: Proc. 5th Australasian Conf. on Interactive Entertainment, p. 391. Ho, M., Yanada, Y., Umetani, Y., 2005. An adaptive visual attentive tracker for human communicational behaviors using HMM-based td learning with new state distinction capability. IEEE Trans. Robot. 21. Hong, P., Turk, M., Huang, T., 2000. Gesture modeling and recognition using finite state machines. In: Fourth IEEE Internat. Conf. on Automatic Face and Gesture Recognition (FG’00), pp. 410–415. Huang, W., Li, C., 2008. Gesture stroke recognition using computer vision and linear accelerometer. In: IEEE Internat. Conf. on Automatic Face and Gesture Recognition, pp. 1–6.
Jolliffe, I., 2002. Principal Component Analysis, Series: Springer Series in Statistics, second ed. Springer, NY. Kang, H., Lee, C.W., Jung, K., 2004. Recognition-based gesture spotting in video games. Pattern Recognition Lett. 25, 1701–1714. Kreutz-Delgado, K., Murray, J.F., Rao, B.D., Engan, K., Lee, T.-W., Sejnowski, T.J., 2003. Dictionary learning algorithms for sparse representation. Neural Comput. 15 (2), 349–396. Lee, H.K., Kim, J.H., 1999. An HMM-based threshold model approach for gesture recognition. IEEE Trans. Pattern Anal. Machine Intell. 21 (10), 961–973. Lee, K., Kim, J., Hong, K., 2007. An implementation of multi-modal game interface based on pdas software engineering research. In: Fifth Internat. Conf. on Software Engineering Research, Management and Applications, pp. 759–768. Li, H., Liu, F., 2009. Image denoising via sparse and redundant representations over learned dictionaries in wavelet domain. In: ICIG ’09: Proc. 2009 5th Internat. Conf. on Image and Graphics. IEEE Computer Society, pp. 754–758. Liu, Y., Li, Y., 2004. A robust hand tracking and gesture recognition method for wearable visual interfaces and its applications. In: Proc. 3rd Internat. Conf. on Image and Graphics, pp. 472–475. Lv, F., Nevatia, R., 2006. Recognition and segmentation of 3-d human action using HMM and multi-class adaboost. In: Proc. 9th Eur. Conf. on Computer Vision, vol. 3954, pp. 359–372. Mäntyjärvi, J., Kela, J., Korpipää, P., Kallio, S., 2004. Enabling fast and effortless customisation in accelerometer based gesture interaction. In: MUM ’04: Proc. 3rd Internat. Conf. on Mobile and Ubiquitous Multimedia, vol. 83, pp. 25–31. Mitra, S., Acharya, T., 2007. Gesture recognition: A survey. Systems Man Cybernet. Part C: Appl. Rev. 37 (3), 311–324. Pavlovic, V., Sharma, R., Huang, T.S., 1997. Visual interpretation of hand gestures for human–computer interaction: A review. IEEE Trans. Pattern Anal. Machine Intell. 19 (7), 677–695. Ramamoorthy, A., Vaswani, N., chaudhury, S., Banerjee, S., 2003. Recognition of dynamic hand gestures. Pattern Recognition 36, 2069–2081. Rao, K.R., Yip, P., 1990. Discrete Cosine Transform: Algorithms, Advantages, Applications. Academic Press Professional, Inc., San Diego, CA, USA. Schlömer, T., Poppinga, B., Henze, N., Boll, S., 2008. Gesture recognition with a Wii controller. In: Internat. Conf. on Tangible and Embedded Interaction, pp. 11–14. Stiefelhagen, R., Ekenel, H.K., Fugen, C., Gieselmann, P., Holzapfel, H., Kraft, F., Nickel, K., Voit, M., Waibel, A., 2007. Enabling multimodal human–robot interaction for the karlsruhe humanoid robot. IEEE Trans. Robot. 23 (5), 840–851. Wu, J., G. Pan, D.Z., Qi, G., Li, S., 2009. Gesture recognition with a 3-d accelerometer. In: Proc. 6th Internat. Conf. on Ubiquitous Intelligence and Computing, vol. 5585, pp. 25–38. Wu, Y., Huang, T., 1999. Vision-based gesture recognition: A review. In: Proc. Internat. Gesture Workshop on Gesture-Based Communication in Human– Computer Interaction, vol. 1379, pp. 103–115. Yang, H., Park, A.-Y., Lee, S., 2007. Gesture spotting and recognition for human– robot interaction. IEEE Trans. Robot. 23, 256–270. Yoon, H.-S., Sho, J., Bae, Y.J., Yang, H.S., 2001. Hand gesture recognition using combined features of location, angle and velocity. Pattern Recognition 34, 1491–1501. Zhang, X., Chen, X., Wang, W., Yang, J., Lantz, V., Wang, K., 2009. Hand gesture recognition and virtual game control based on 3d accelerometer and EMG sensors. In: Proc. 13th Internat. Conf. on Intelligent User Interfaces, pp. 401– 406.