Available online at www.sciencedirect.com
ScienceDirect Procedia Computer Science 58 (2015) 194 – 201
6HFRQG,QWHUQDWLRQDO6\PSRVLXPRQ&RPSXWHU9LVLRQDQGWKH,QWHUQHW9LVLRQ1HW¶
Stereo Vision-based Hand Gesture Recognition under 3D Environment Mohidul Alam Laskara, Amlan Jyoti Dasb, Anjan Kumar Talukdarc, Kandarpa Kumar Sarma* abc
Dept. of Electronics and Communication Engineering, Gauhati University, Assam-781014 Dept. of Electronics and Communication Technology, Gauhati University, Assam-781014
*
Abstract Human hand gestures as a natural way of communication with computers is becoming an emerging field of research due to its various applications like sign language recognition, human computer interaction, gaming, virtual reality etc. Hand gesture recognition under 2D environment has some limitations as the information about the other dimension (z-axis) is missed out. So, hand gesture recognition under 3D environment is becoming a growing field of research. In this paper, we have proposed a novel technique to detect hand gesture with forward and backward movement towards and away from the camera respectively. Our technique is based on stereo-vision and we have used a disparity map-based centroid movement and changing of its intensity as feature to recognize the gesture with conditional random field (CRF) as classifier. Stereo calibration and rectification is done to get the rectified images and correspondence gives the depth map. We tested our proposed method for Arabic numerals (0-9) and it worked efficiently with an average recognition rate of 88%. 2015The TheAuthors. Authors.Published Published Elsevier © 2015 byby Elsevier B.V.B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the Second International Symposium on Computer Vision and the Peer-review under responsibility of organizing committee of the Second International Symposium on Computer Vision and the Internet ,QWHUQHW9LVLRQ1HW¶ . (VisionNet’15)
Keywords:Stereo Vision; Conditional random fields(CRF); Stereo calibration; Stereo matching; Disparity map.
* Corresponding author. Tel.: +0-000-000-0000 ; fax: +0-000-000-0000 . E-mail address:
[email protected]
1877-0509 © 2015 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the Second International Symposium on Computer Vision and the Internet (VisionNet’15) doi:10.1016/j.procs.2015.08.053
Mohidul Alam Laskar et al. / Procedia Computer Science 58 (2015) 194 – 201
1. Introduction Vision-based hand gesture recognition is challenging task but has wide prospects, primarily because of the broad range of configurations and aspects [1]. Object identification, tracking and classification, especially of hands, has been investigated extensively in the research fraternity. It has become popular and is a technology for the future. The research of this area has a wide range of emerging applications with marketable importance. Further, hand gestures offer a very fitting and natural way of interacting with a machine or a computer. Hand gesture is a spatio-temporal pattern [2] and can be static and dynamic or both [3]. The challenges encountered with 2D recognition have encouraged researchers to study 3D gesture recognition where we can extract the depth information also. Mainly two techniques are there by which a 3D gesture can be recognized. The first one is by Kinect which uses range vision [1] and the other is by stereoscopic vision. Kinect is a combination of a RGB and an IR sensor [4] whereas two RGB sensors are used in a stereoscopic camera. There are applications which could benefit from the stereo vision based hand gesture recognition such as pervasive computing, augmented reality, compute gaming etc. Stereo vision technique can be also used in industrial automation and surveillance systems. In our approach, we use a stereo camera for 3D hand gesture recognition as it is inexpensive and easily available. The hand gestures of front and back as well as Arabic numerals (0-9) are taken for implementation. Hand gesture recognition has been studied extensively and implemented by many researchers in the past decade. Dense stereo algorithms, where disparity at every image point needs to be calculated has been discussed in [5][6]. As given in [5]; a typical stereo algorithm for computing a disparity map, there are four steps: Matching cost computation, cost aggregation in a support region, optimization and computation of disparity, and followed by disparity refinement. Approaches adopted in these algorithms are either local or global. Local algorithms which are window based employ correlation among left and right images where disparity is computed by similarity measurement of pixels i.e. lowest aggregate of cost matching within a window [7]. Global algorithms [8] obtain the disparity map by minimizing a global cost function which is defined in terms of smoothness constraint and image data. Local algorithms are computationally more efficient and robust than global algorithms. The prerequisite is to detect the hand from the scene. In [9] color based segmented is utilized in HSV color space. Detailed color based skin segmentation is discussed in [10]. Detection of face is done in [11][12] where the face is detected using haar cascade classifiers. Stereo vision techniques such as stereo calibration, stereo rectification and stereo correspondence are discussed thoroughly in [13][12]. The disparity computation based on stereo matching algorithm is reported in [14] and is provided in OpenCV. In [15], implementation and evaluation of gesture based on discrete Hidden Markov Models is given. Conditional Random Field (CRF) implementation for Indian Sign Language recognition is done in [16]. Sign Language Spotting using CRF is discussed in [17]. 2. Proposed Method The typical stereo vision technique uses two cameras whose relative position are known, and finds the differences in the images observed by the cameras in a similar manner how human vision mechanism functions. Computers do the task of stereo imaging by finding the correspondence between points that are viewed by one camera and the same points as seen by the other camera. With a known baseline separation between cameras and correspondence, 3D location of the points can be computed. The brief methodology of the proposed method is shown in Fig. 1.
Fig. 1. Block diagram of proposed method.
195
196
Mohidul Alam Laskar et al. / Procedia Computer Science 58 (2015) 194 – 201
2.1. Stereo Calibration Stereo Calibration computes the geometrical relationship between the two cameras. It is a onetime process and generates a set of parameters which are applied to correct distortion in images due to lens and camera. Mathematical derivation is available in [12] [18]. The rotation matrix (R) and translation vector (T) can be derived from equations: (1) R Rr ( Rl )T (2) T Tr RTl Where Rl and Rr are the rotation matrices while Tl and Tr are the translation vectors of left and right cameras respectively. 2.2. Stereo Rectification Stereo rectification deals with correction of the individual images so that they appear as if they are taken by two cameras with row-aligned image planes. Two popular DOJRULWKPV WR FDUU\ RXW VWHUHR UHFWLILFDWLRQ DUH +DUWOH\¶V DOJRULWKPDQG%RXJXHW¶VDOJRULWKP%RXJXHW¶VDOJRULWKPLVSUHIHUUHGRYHU+DUWOH\¶VDOJRULWKPDVLWPLQLPL]HVWKH reprojection error because of its greater accuracy [12]. 2.3. Stereo Correspondence Stereo Correspondence yields a disparity map. The disparities are actually differences in x-coordinates of the same feature on the image planes viewed in the left and right cameras. Based on the horizontal distance between the locations of a point in the two images and a set of predefined constants, a disparity image is generated. Once the physical coordinates of the cameras or the sizes of objects in the scene are known, depth measurements can be derived from the triangulated disparity measures. Block Matching algorithm was developed by Kurt Konolige [14]. It was preferred over Graph Cut algorithm in view of the fact that it is faster. Strongly matching (high-texture) points between the two images are found out. Block matching algorithm is a local algorithm which utilizes small ³VXPRIDEVROXWHGLIIHUHQFH´ 6$' ZLQGRZVWRORFDWHPDWFKLQJSRLQWVEHWZHHQWKHOHIWDQGULJKWUHFWLILHGVWHUHR images. SAD window C ( x, y, G ) is mathematically represented as [18], wh 1 ww1
C ( x, y, G )
¦¦I
R
( x, y) I L ( x G , y)
(3)
y 0 x 0
Where wh is the window height, ww is the window width, Ɂ is the differential value, I L ( x G , y) and I R ( x, y ) are the image intensities corresponding to left and right images. 2.4. Feature Extraction The feature extraction is a vital stage in our system and there is no doubt that, selecting good features to distinguish the hand gesture path plays major role in the system performance. Location, orientation and velocity are the three basic features. Previous researches showed that the orientation feature is the best in terms of results. Therefore, we are implementing it as a main feature in our system. The concept of moment is very pertinent to continuous gesture trajectory systems. A moment is a gross characteristic of a contour computed by integrating or th summing over all of the pixels of the contour. Generally, ( p, q) moment of a contour is given by [12]:
mp ,q
¦ I ( x, y) x
p
yq
(4)
Where I ( x, y ) is the image pixel value at the pixel location (x,y), p is the order of x and q is the order of y. From the above equation, we can get the centroid (x,y) of the hand/palm and is given as:
§m m · ( x, y ) ¨ 1,0 , 0,1 ¸ (5) ¨m m ¸ 0,0 0,0 © ¹ Where m1,0 and m0,1 are the first order moments of x and y, respectively and m0,0 is the zero order moment.
197
Mohidul Alam Laskar et al. / Procedia Computer Science 58 (2015) 194 – 201
A gesture path is spatiotemporal pattern which consists of centroid point. So, the orientation is determined between two consecutive points from hand gesture path [2].
Tt
§y y arctan ¨ t 1 t © xt 1 xt
· ¸ ; t 1, 2,3,....., T 1 ¹
(6)
Where, T represents the length of gesture path. The orientation T t is divided by 300 in order to quantize the value from 1 to 12 as shown in Fig. 2. Thereby, the discrete vector is determined and then is used as input to the classifier.
Fig. 2. Discrete vector quantization range.
2.5. Conditional Random Field(CRF) CRF framework implements conditional probability methods for labeling and segmentation of sequential data. For a given observations CRF utilizes a single exponential distribution in modeling all levels [16]. In the CRF framework, observation sequence o o1 , o2 , o3 ,......, on is associated with the label sequence w w1 , w2 , w3 ,......, wn . The probability of the label sequence w given o is calculated by using a log-linear exponential model. Conditional probability is given by [17]: p q 1 exp(¦D j t j (wi 1 , wi , o, i ) ¦ E k sk (wi , o, i )) (7) z ( o) j 1 k 1 Where t j ( wi 1 , wi , o, i) transition feature function of observation sequence o at is positions i and i-1. sk (wi , o, i) is a state feature function of observation sequence o at position i. p and q are the total number of features for transition feature function and state feature function. D j and E k are the weights to exponent which are estimated from training data by using maximum entropy criteria and z (o) is a normalization factor to ensure that ¦ p(w o) 1 . The normalization factor is given by:
p ( w o)
p
z (o)
q
¦ exp(¦D t (w j j
j
j 1
i 1
, wi , o, i) ¦ Ek sk (wi , o, i))
(8)
k 1
3. Implementation 3.1. Pre-processing The input video is fed from an inexpensive stereo web camera with a frame rate of 10 frames per second with a width height resolution of 320 ൈ 240. The face part of the operator is detected by viola-jones algorithm [11] which is inbuilt in OpenCV, as we only need the hand part which is our region of interest. Color based segmentation is used where the frames are converted to HSV space and thresholding is used to get the binary hand mask. This is followed by morphological closing and opening operations to filter out the noise present and to fill up the holes and thereby we get a precise segmented hand mask. Simultaneously, on the other side, the RGB frames from the stereo camera are converted to Gray frames. The hand mask is then multiplied (logical AND operation) with gray frames to get an image where only gray values of hand are present.
198
Mohidul Alam Laskar et al. / Procedia Computer Science 58 (2015) 194 – 201
3.2. Stereo Matching The frames obtained from the pre-processing steps are not rectified thus rectification is necessary as it removes the vertical disparity present thereby making both image frames row aligned. An image of 9 ൈ 6 chessboard is taken to compute the camera calibration. A total of 30 different pairs of images are taken with chessboard held in different inclinations. For each pair, Rotation and Translation matrices are calculated using cvStereoCalibrate() function. The generated matrices are then approximated into one (R,T) pair with minimum error of chessboard corners in left and right camera view. cvStereoRectify() function corresponding to Bouguets algorithm is used and output of the function is fed to cvInitUndistortRectifyMap() which completes the stereo rectification of left and right images. These calibration matrices are saved as .xml files for further uses as it a onetime operation. Stereo Matching is done by Block Matching technique to get disparity map of hand. The function cvFindStereoCorrespondenceBM() computes and gives the disparity map. The block matching parameters used in experimentation are SADWindowSize=19, preFilterSize=19, minDisparity=0, numberOfDisparities=128, textureThreshold=12 and uniquenessRatio=3. 3.3. Post- processing The post-processing deals with refinement of disparity map. It is first normalized and then smoothed by a 5 ൈ 5 median filter and also by a Gaussian filter. Further a morphological grayscale closing operation is done by a rectangular structuring element of size 11 ൈ 11 to make the disparity map more continuous. 3.4. Gesture Recognition Front and Back Gesture Recognition: The output of post-processing steps gives out a disparity map of hand. The Front and Back gesture is computed by taking the account of change of average pixel value of the disparity map (as shown in Fig. 5(e)) with front and back movements. Arabic Numerals Recognition: The position and centroid of hand disparity is found out by using moment calculation. Gesture trajectory path is obtained by calculating centroids of all successive frames and then joining then with the help of a line. Orientation feature is extracted from the trajectory path as given in Equation 6. Vector quantization is done where the orientation features are quantized to discrete vectors of 1 to 12 as shown in Fig. 2. The implementation block diagram is given in Fig. 3. CRF classifier is trained using one dimensional feature vector for all gestures followed by their recognition. In this paper we have taken the gestures of Arabic numerals (0-9) for classification and recognition using CRF++ toolkit.
Fig. 3. Block diagram of feature extraction and classification.
4. Results The input frames from left and right camera are taken as RGB images and shown in Fig. 5(a) and Fig. 5(b) respectively. These images are not row aligned, thus calibration and rectification is done to make them row aligned. The rectified left and right chessboard image is shown in Fig. 4. Fig. 5(c) shows color based thresholded image which corresponds to a hand mask. The resulting image of AND operation which corresponds to a gray skin tone of hand is shown is Fig. 5(d). Gesture trajectory path for the gesturH³HLJKW´LVJLYHQLQ)LJ 5(f).
199
Mohidul Alam Laskar et al. / Procedia Computer Science 58 (2015) 194 – 201
Fig. 4. Rectified left and right chessboard images.
Fig. 5. (a) RGB image of left camera; (b) RGB image of right camera; (c) Segmented hand; (d) Skin tone gray image; (e) Disparity map of hand; I *HVWXUHWUDMHFWRU\SDWKRI³HLJKW´.
The change in intensity value with the distanFHEHWZHHQWKHKDQGDQGFDPHUDIRUERWK³IURQW´JHVWXUHDQG³EDFN´ gesture for 10 samples is shown in Fig. 6(a) and Fig. 7(a) with an average signature is shown in Fig. 6(b) and Fig. 7(b) respectively. The recognition rate (RR) is the number of correctly recognized gestures to the number of tested gesture and is given by:
RR
RG u100 TG
(9)
Where RG is the total number of recognized gestures and TG is the total no of test gestures. For recognition of Arabic numerals we have taken 15 video pairs of each isolated gestures for training the system. Testing is done one by one for all 10 gestures with 15 different pairs of each gesture. Our database consists of a total of 300 video pairs. The entire software is implemented using OpenCV 2.0 library on C/C++ programming in Visual Studio 2010 on a PC with Intel Pentium B970 2.30 GHz CPU and 4GB RAM with pre-installed Windows 7 Ultimate operating system. The stereo video frames are captured using Minoru 3D webcam.
200
Mohidul Alam Laskar et al. / Procedia Computer Science 58 (2015) 194 – 201
Fig. 6. (a) Sample signaturHIRU³IURQW´JHVWXUHE $YHUDJHVLJQDWXUHIRU³IURQW´JHVWXUH
Fig. 7D 6DPSOHVLJQDWXUHIRU³EDFN´JHVWXUHE $YHUDJHVLJQDWXUHIRU³EDFN´JHVWXUH
The result of the testing for recognition is given in Table I. Table 1. Recognition rate of different gestures Input
Test
Correct
Gesture
Samples
Classification
One
15
14
1
93.33
Two
15
13
2
86.67
Three
15
13
2
86.67
Four
15
13
2
86.67
Five
15
14
1
93.33
Six
15
13
2
86.67
Seven
15
12
3
80.00
Eight
15
14
1
93.33
Nine
15
13
2
86.67
Zero
15
13
2
86.67
Misclassification
Recognition Rate (%)
Mohidul Alam Laskar et al. / Procedia Computer Science 58 (2015) 194 – 201
5. Conclusion In this paper we have proposed a stereo vision-based hand gesture recognition in a 3D environment for Front and Back gestures as well as Arabic numerals (0-9). Color based segmentation is implemented to segment the hand from the background. Stereo vision techniques are applied to get the disparity map. The centroid of the hand is found out from the disparity map. Orientation feature is extracted from the trajectory of the centroid and trained and recognized using CRF classifier. The front and back moving gestures are detected from the change of the average intensity of the disparity map. With the results obtained as of now, it can be concluded that the proposed method is robust and can be used for a gesture recognition system. Acknowledgements The authors are thankful to the Ministry of Communication and Information Technology, Govt. of India for facilitating the research. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
16.
17. 18.
Zhang C, Yang X, Tian Y. Histogram of 3D Facets: A Characteristic Descriptor for Hand Gesture Recognition. 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG). p. 1 ± 8. Shanghai. China. April. 2013. Elmezain M, Hamadi A A, Michaelis B. Improving Hand Gesture Recognition Using 3D Combined Features. Second International Conference on Machine Vision. p. 128 ± 132. Dubai. UAE. Dec. 2009. Just A, Bernier O, Mercel S. Recognition of Isolated Complex Mono- and Bi-Manual 3D Hand Gestures. Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition. p. 571 ± 576. May. 2004. Han J, Shao L, Xu D, Shotton J. Enhanced Computer Vision with Microsoft Kinect Sensor: A Review. IEEE Transactions on Cybernatics. vol. 43. issue. 5. p. 1318-1334. Oct. 2013. Scharstein D, Szeliski R. A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms. International Journal of Computer Vision. vol. 47. issue.1-3. p. 7-42. April. 2002. Tombari F, Mattrocia S, Stefano L D. Classification and evaluation of cost aggregation methods for stereo correspondence. IEEE Conference on Computer Vision and Pattern Recognition. p. 1-8. Anchorage. AK. June. 2008. Adhyapak S A, Kehtarnavaz N, Nadin M. Stereo matching via selective multiple windows. Journal of Electronic Imaging. vol.16. issue.1. p. 013012 1-14. Jan. 2007. Kolmogorov V, Zabih R. Computing Visual Correspondence with Occlusions via Graph Cuts. Proceedings of the Eighth IEEE International Conference on Computer Vision. vol. 2. p. 508-515. 2001. Rautaray S S, Agarwal A. Vision based hand gesture recognition for human computer interaction: a survey. Artificial Intelligence Review. Springer Netherlands. Nov. 2012. Phung S L, Bouzerdoum A, Chai D. Skin Segmentation Using Color Pixel Classification: Analysis and Comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence. vol. 27. issue. 1. p. 148-154. Jan. 2005. Viola P, Jones M. Rapid objects detection using a boosted cascade of simple features. Proceedings of the IEEE conference on computer vision and pattern recognition. vol. 1. p. 511 ± 518. 2001. Bradski G, Kaehler A. Learning OpenCV- Computer Vision with OpenCV library. OReilly Media Inc. 1005 Gravenstein Highway North. Sebastopol. CA 95472. Cyganek B, Seibert JP. An Introduction to 3D Computer Vision Techniques and Algorithms. John Wiley & Sons Ltd. New Jersey; 2009. Konolige K. Projected Texture Stereo. IEEE International Conference on Robotics and Automation (ICRA). p. 148-155. Anchorage. AK. May. 2010. Dennemont Y, Bouyer G, Otmane S, Mallem M. A Discrete Hidden Markov Models Recognition Module for Temporal Series: Application to Real-Time 3D Hand Gestures. 3rd International Conference on Image Processing Theory, Tools and Applications (IPTA). p. 299-304. Istanbul. Oct. 2012. Choudhury A, Talukdar A K, Sarma K K. A Conditional Random Field based Indian Sign Language Recognition System under Complex Background. Fourth International Conference on Communication Systems and Network Technologies (CSNT). p. 900 ± 904. Bhopal. April. 2014. Yang H D, Sclaroff S, Lee S W. Sign Language Spotting with a Threshold Model Based on &RQGLWLRQDO 5DQGRP )LHOGV´ IEEE Transactions on Pattern Analysis and Machine Intelligence. vol. 31. issue. 7. p. 1264 ± 1277. July. 2009. Shenoy V, Bongale H P, Roy V, David S. Stereovision based 3D Hand Gesture Recognition for Pervasive Computing Applications. 9th International Conference on Information, Communications and Signal Processing (ICICS). p. 1 ± 5. Tainan. Dec. 2013.
201