ARTICLE IN PRESS Journal of Network and Computer Applications 33 (2010) 447–466
Contents lists available at ScienceDirect
Journal of Network and Computer Applications journal homepage: www.elsevier.com/locate/jnca
Real-time hands, face and facial features detection and tracking: Application to cognitive rehabilitation tests monitoring D. Gonza´lez-Ortega n, F.J. Dı´az-Pernas, M. Martı´nez-Zarzuela, M. Anto´n-Rodrı´guez, J.F. Dı´ez-Higuera, D. Boto-Giralda Department of Signal Theory, Communications and Telematics Engineering, Telecommunications Engineering School, University of Valladolid, Campus Miguel Delibes, Valladolid 47011, Spain
a r t i c l e in fo
abstract
Article history: Received 25 April 2009 Received in revised form 6 November 2009 Accepted 4 February 2010
In this paper, a marker-free computer vision system for cognitive rehabilitation tests monitoring is presented. The system monitors and analyzes the correct and incorrect realization of a set of psicomotricity exercises in which a hand has to touch a facial feature. The monitoring requires different human body parts detection and tracking. Detection of face, eyes, nose, and hands is achieved with a set of classifiers built independently based on the AdaBoost algorithm. Comparisons with other detection approaches, regarding performance and applicability to the monitoring system, are presented. Face and hands tracking is accomplished through the CAMShift algorithm with independent and adaptive twodimensional histograms of the chromaticity components of the TSL color space for the pixels inside these three regions. The TSL color space was selected after a study of five color spaces regarding skin color characterization. The system is easily implemented with a consumer-grade computer and a camera, unconstrained background and illumination and runs at more than 23 frames per second. The system was tested and achieved a successful monitoring percentage of 97.62%. The automation of the human body parts motion monitoring, its analysis in relation to the psicomotricity exercise indicated to the patient and the storage of the result of the realization of a set of exercises free the rehabilitation experts of doing such demanding tasks. The vision-based system is potentially applicable to other human–computer interface tasks with minor changes. & 2010 Elsevier Ltd. All rights reserved.
Keywords: Human–computer interaction Cognitive rehabilitation Human body parts detection and tracking AdaBoost CAMShift TSL color space
1. Introduction Recent intensive research in the human–computer interaction field has made it possible the development of universally accessible systems that can be used by the physically or ¨ cognitively handicapped (Obrenovic et al., 2007; Turk and Kolsch, 2004). Particularly, computer vision-based human–computer interfaces (HCI) have been proposed (Magee et al., 2008; Morris and Chauhan, 2006; Varona et al., 2008) so that the handicapped can use computers. Rehabilitation is a long and arduous process and needs clinician specialists and appealing tools to be facilitated. Although several clinical decision-support systems include computer vision techniques to help health professionals (Musen et al., 2001) and cognitive rehabilitation is one of the most important means dealing with alterations in cognitive processes, there are few medical centers with cognitive rehabilitation professionals and programs. Few experienced clinicians and the demanding temporal availability are the main reasons of this
n
Corresponding author. E-mail address:
[email protected] (D. Gonza´lez-Ortega).
1084-8045/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.jnca.2010.02.001
scarcity, being emphasized by the necessary intensity and regularity of the cognitive rehabilitation therapies to be successful (Zhou and Hu, 2008). For these reasons, the motivation for developing systems that let patients with cognitive handicaps improve their lives is very big. We focused our efforts in developing a system to monitor cognitive rehabilitation tests and help patients in their rehabilitation. Computer vision approaches to help patients with the rehabilitation of physical or cognitive handicaps have hardly been presented. Daly and Wolpaw (2008) stated the potential of brain–computer interface (BCI) technology to help improve the quality of life and to restore functions for people with severe motor disabilities. Lin et al. (2004) demonstrated the effectiveness of an eye-tracking device to play a computer game in the rehabilitation of eye movement dysfunction. da Costa and de Carvalho (2004) and Edmans et al. (2006) showed positive results regarding the use of a virtual reality device for stroke and cognitive rehabilitation, respectively. Rand et al. (2004) studied the rehabilitation potential with low-cost video game console PS2 and a camera. Although its limitations regarding motion monitoring and recording, it was stated as a valuable intervention tool during the rehabilitation of stroke patients and those with other
ARTICLE IN PRESS 448
´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
neurological disorders. Zhou and Hu (2008) reviewed the human motion tracking systems for rehabilitation and they considered six issues to assess systems: cost, size, weight, function, operation, and automation. Among these systems, marker-free visual systems were underlined as they can achieve reduced restriction, robust performance, and low cost. The main contribution of this work is the development of an unprecedented marker-free computer vision system for cognitive rehabilitation. The system was conceived to be included in GRADIOR (Franco-Martı´n et al., 2000), which is a cognitive rehabilitation platform developed by the INTRAS Foundation (INTRAS Foundation, 2009) supported by the Spanish Ministries of Education and Science and Innovation. GRADIOR includes approximately 15,000 rehabilitation exercises of different cognitive functions such as attention, perception, memory, calculation, and language and was assessed satisfactory by rehabilitation experts and users (INTRAS Foundation, 2009). Our vision-based system monitors and analyzes cognitive tests consisting of a set of exercises in which a hand, either the right or left, has to touch the head or one facial feature among the right eye, left eye, nose, right ear, and left ear. The system achieves real-time working in an uncontrolled environment through the combination of detection and adaptive tracking of multiple human body parts such as face, facial features, and hands. Two other contributions of this work necessary to develop the system are the implementation of independent real-time hands and facial features detectors and their validation to be integrated in the system and the real-time adaptive and independent tracking of the face and hands based on the chromaticity components of the TSL color space, selected after a comparative study of five color spaces in the task of characterizing skin colors. The rest of the paper is organized as follows. In Section 2, we present the state of the art regarding skin color characterization and human body parts detection and tracking to frame the system. In Section 3, the skin color modeling, the developed facial feature detection modules, and the face and hands tracking modules, necessary to implement the vision-based system, are described. The proposed system is presented in Section 4, including its environment, limitations, monitoring of the psicomotricity exercises and performance evaluation. Finally, Section 5 draws the conclusions about the system.
2. Related work In this section, we present the state of the art regarding the tasks that the proposed system has to accomplish. First, the skin color characterization is described as many computer vision systems based on human body parts have a skin color filtering as a preprocessing stage. Then, different human body parts detection and tracking approaches are presented. Finally, as the proposed system has to monitor psicomotricity exercises in which a hand has to touch a facial feature, two approaches in the literature, which address the occlusion between a hand and the face, are explained. 2.1. Skin color characterization Color is a low-level feature, highly discriminative, computationally fast, and robust to geometric changes that can be applied in the definition of human body parts. Many studies evaluating color spaces for skin detection have been carried (Kakumanu et al., 2007; Phung et al., 2005). Color can be decomposed into three different components, one luminosity and two chromaticity components. Although skin color can notably change from some human being to other or even in the same human being due to
factors such as a suntan, a blush, etc., several researches have proved that skin colors have a certain invariance regarding chromaticity components, although skin colors belong to people of different ethnic groups (Fu Jie Huang and Tsuhan Chen, 2000; Hunke and Waibel, 1994; Yang and Ahuja, 1998). Other factors such as lighting or skin tone affect mainly the luminosity component. If a large number of images are taken and their skin regions are analyzed, skin colors are concentrated in a well-defined area in the histogram of the chromaticity components of these colors. The distribution of skin colors can be modeled by a Gaussian distribution (Hunke and Waibel, 1994; Yang and Ahuja, 1998). However, illumination affects skin color drastically (Marszalec et al., 2000). Other factors causing skin color changes in images, although smaller than illumination, are differences in spectral sensitivities of cameras, interreflections, and the parameters and settings of the camera. Such factors have motivated the use of color in human body parts tracking in adaptive models. For example, Bradski (1998) used the hue channel in the HSV color space as the feature to track the face.
2.2. Face, facial features, and hand detection Face detection approaches can be classified in two different groups: feature-based and appearance-based. Feature-based approaches search for invariant features present in the faces regardless the pose, viewpoint, or illumination conditions. Appearance-based approaches capture the facial appearance from a set of training images (Yang et al., 2002). Most of the face detection algorithms are based on appearance (Li and Jain, 2005). Schneiderman and Kanade (2000) proposed a face and non-face classifier through statistics of products of histograms computed from face and non-face examples with AdaBoost learning. Rowley et al. (1998) presented a neural-based classifier trained using preprocessed and normalized face and non-face subwindows. Both approaches achieve a high detection rate but without processing more than one frame per second (fps), thus being far from real-time requirements. Viola and Jones (2004) built a fast, robust face detection system using AdaBoost learning (Freund and Schapire, 1996, 1997) to construct a two-class nonlinear classifier. Their system was the first real-time frontal view face detector processing 15 fps. The important properties of AdaBoost learning (Li and Jain, 2005) made us select it as the basis of our approach. Our system uses an AdaBoost-based classifier to detect faces. Besides face detection, the system needs to know the facial features location to monitor the psicomotricity exercises. Facial features detection, such as eyes and nose, has been rather accomplished and they can be applied to subsequent face detection or to face alignment prior to face recognition. Eye detection with passive illumination has been achieved through image gradient (Kothari and Mitchell, 1996), projection functions (Zhou and Geng, 2004), and templates (Kawaguchi et al., 2000), although heuristics and postprocessing are usually necessary to remove false detections. Moreover, these features are sensitive to image noise. Our application needs to detect the eyes while a hand occludes the face. Song et al. (2006) and Wang and Ji (2007) proposed eye detection methods that achieve successful detection rates of 96.8% and 99% in face images, respectively. Although the two methods have satisfactory results, both of them are not applicable to our system because they assume that the eyes are in the face and try to locate them accurately. Our system needs an eye detector that detects the eyes in the face but that does not detect one eye when it is occluded by a hand as well. Asteriadis et al. (2009) proposed a method for eye detection and eye center localization through the distance vector field of the face. The
ARTICLE IN PRESS ´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
distance vector field is extracted from the edge map by assigning to every facial image pixel a vector pointing to the closest edge pixel. Although the method achieved an overall detection rate of 96%, it requires a non-occluding face. However, our system needs to detect the eyes while a hand occludes the face. Nose detection methods are mainly based on characteristic points such as the nostrils and the tip of the nose. Xu et al. (2006) proposed a robust detector based on 3D facial data that achieves a correct detection rate of 99.3%. This nose detector is not applicable to our system for the same reasons than the eye detectors proposed in Song et al. (2006) and Wang and Ji (2007). Moreover, it takes 0.4 s, far from the real-time requirements of our system. Although eyes and nose have less structural information than face, some detection methods use a two-class classifier from a previous face detection (Fasel et al., 2005; Wilson and Fernandez, 2006; Yong et al., 2004), both detections based on AdaBoost algorithm. The system presented in this paper uses AdaBoostbased classifiers to detect eyes and nose inside the region of interest (ROI) containing a face. Due to its high variability, hand detection is much more difficult than face and facial features. Actually, to our knowledge, pose invariant hand detection system without using temporal information has not been presented yet, although it can play a major role in an HCI. Many hand detection methods use skin color (Jintae Lee and Kunii, 1995) or geometric features such as lines and contours (Kuch and Huang, 1995). Detection based on AdaBoost has been proposed, although several detectors for different restricted positions are needed and detection rates are usually below 75% (Barczak and Dadgostar, 2005). In our system, just as facial features, hand detection is achieved with AdaBoostbased classifiers so that an explanation of the AdaBoost algorithm is presented below. 2.3. AdaBoost algorithm for object detection AdaBoost belongs to the family of Boosting algorithms, which are machine learning algorithms combining weak classifiers, which are a bit better than randomness in a two-class classification problem, to form a single strong classifier with better accuracy. A two class classifier can be formed through a set of positive and negative samples, a set of features, and the AdaBoost algorithm. Each weak classifier is built using a thresholded simple feature. The final strong classifier is a weighted linear combination of a set of weak classifiers where the weights are inversely proportional to the training error (Viola and Jones, 2004). Viola and Jones (2001) proposed 4 basic types of scalar features for building face/non-face classifiers that were extended by Lienhart et al. (2003) for dealing with more varied objects. These features are known as Haar wavelet-like features due to their resemblance of the Haar transform and early features of human visual system such as center–surround and directional responses. Such features are placed in an image subwindow and a scalar is calculated by summing up the pixels in the white region and subtracting those in the dark region of each feature. These scalar numbers form an overcomplete feature set for the intrinsically low-dimensional face pattern. These features are efficient because face detectors can be constructed based on them and they can be computed rapidly using the integral image technique (Viola and Jones, 2001). The integral image II(x,y) at location (x,y) contains the sum of the pixels of the grayscale image I(x,y) above and to the left of (x,y). It is defined in Viola and Jones (2001) and expressed as X IIðx; yÞ ¼ Iðx0 ; y0 Þ ð1Þ x0 r x;y0 r y
449
Using II(x,y), any rectangular sum can be computed in four array references, thus leading to enormous savings in calculating features at varying locations and scales. To make a translation and scale invariant object search with a detector, an image has to be sampled in subwindows varying in position and size from the minimum size used in the training stage to the image size with a constant scaling factor, i.e. 1.1, between adjacent subwindow dimensions. Each subwindow enters the detector. Taking into account the fundamental rarity of the face class in relation to the background class in all the sampled subwindows in an image, Viola and Jones (2004) proposed a cascade of strong classifiers to make the face/non-face classification in grayscale images. A boosted strong classifier effectively eliminates a large portion of non-face subwindows while maintaining a high detection rate. Nonetheless, a single strong classifier may not meet the requirement of an extremely low false alarm rate. The cascade of strong classifiers arbitrate between several classifiers using the AND operation. This way, a subwindow enters the next strong classifier for further classification only if it has passed all the previous strong classifiers as the face pattern. This strategy can significantly speed up the detection and reduce false alarms, with a little sacrifice on the detection rate. Given a cascade of k strong classifiers with detection rate di and false positive rate fi each, the overall detection rate D and false detection rate F of the cascade are expressed in Eq. 2. Given concrete goals for D and F, target rates can be determined for each stage in the cascade process. Each classifier in the cascade is trained using bootstrapped non-face examples that passed through the previously trained cascade. D¼
k Y i¼1
di
and
F¼
k Y
fi
ð2Þ
i¼1
We used the Viola and Jones approach, i.e. a cascade of strong classifiers built with the AdaBoost algorithm and the Haar-like features to achieve face, facial features, and hands detection. The Haar-like features selected for building the classifiers were not only the 4 basic types of scalar features proposed by Viola and Jones (2001) but also the rotated features proposed by Lienhart et al. (2003). These rotated features adapt to the detection of more varied object better than the basic features alone. 2.4. Face and hand tracking Tracking of non-rigid human body parts such as face and hands can be mainly classified into 2D and 3D methods (Shan et al., 2007). Regarding hands, although 2D tracking can only track global motion, many existing applications are based on 2D approaches because they are more computationally efficient for real-time tasks (Shan et al., 2007). Among them, particle filter (Isard and Blake, 1998), optical flow, and Mean shift (Comaniciu and Meer, 2002; Yizong Cheng, 1995) have been widely used. Particle filter (Isard and Blake, 1998) is a technique for implementing a recursive Bayesian filter by Monte Carlo simulations. In particle filter, the required posterior n oN ðnÞ density is approximated by a weighted particle set sðnÞ t ; pt n¼1
at each time t. Each particle sðnÞ t represents one hypothetical state of the object. To capture the variations in state-space, a large number of particles is necessary, which leads to a high computational cost, far from real-time requirements. Optical flow tracking is based on the assumption that brightness at every point of an object does not change in time (Horn and Schunk, 1981). The pyramidal implementation of the Lucas–Kanade optical flow tracking is convenient for real-time tracking as it has little computation (Lucas and Kanade, 1981). The most suitable points to fulfill object tracking need to have a position in a frame with strong bidirectional changes. For example,
ARTICLE IN PRESS ´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
450
these points could be the nostrils or pupils to achieve face tracking. However, in our application the hands can occlude the nose or the eyes. This occlusion would cause the loss of these optical flow tracking points. Mean shift is a non-parametric density gradient estimation approach to local mode seeking that has been adopted as an efficient technique for object tracking (Comaniciu et al., 2003). To fulfill object tracking, Mean shift algorithm needs a probability distribution image that can be determined using any method that associates each pixel with its probability of belonging to the object. A common method is known as histogram backprojection, introduced by Swain and Ballard (1991) that computes one or several representative features of the object. An initial histogram is obtained from the image subwindow where the object is and with the selected features. With the next frame in the image sequence, histogram backprojection image is created. The backprojection of an image in a histogram is a primitive operation that associates each pixel in the image with the corresponding histogram bin that includes the value of the selected pixel features. The backprojection of an image generates a probability distribution image where the value of each pixel depends on the probability that this pixel belongs to the created histogram. To track an object, the Mean shift algorithm sets up an initial location and size of the search window W, that is usually the location of the object in the previous frame. The center of mass (xc,yc) within the search window W in the backprojection image is calculated using Eq. (3), where Mab is the (a+ b)th moment as defined in Eq. (4). In Eq. (4), P(Ixy/Of) represents the probability of a pixel (x,y) within W of being part of the object characterized by the selected features, i.e. the value of the pixel in the backprojection image. The search window is centered in the center of mass, then recalculating center of mass again in the new search window, until there is a variation less that a fixed number of pixels in both x and y coordinates of the center of mass. xc ¼
M10 M00
Mab ðWÞ ¼
and
M01 M00
ð3Þ
xa yb PðIxy =Of Þ
ð4Þ
yc ¼
X X x A Wy A W
CAMShift algorithm (Bradski, 1998) is an adaptation of Mean shift that uses continuously adaptive probability distributions, i.e. distributions that can be recomputed for each frame, and with a search window size adaptation. Like Mean shift, CAMShift needs structural features of the tracking object and it is robust to temporal variations of the features (Bradski, 1998). Using the CAMShift algorithm and a proper model to feature skin color can lead to robust real-time human body parts tracking overcoming the temporal skin color variations. We have studied five color spaces to select the best color features for CAMShiftbased human body parts tracking. The proposed system accomplishes independent CAMShift-based tracking of face, right, and left hand with the two chromaticity components of the TSL color space (Terrillon et al., 1998), selected from experimental results. 2.5. Occlusion between hand and face To our knowledge, no system to monitor the psicomotricity exercises that will be presented in Section 4.1, has been proposed in literature. On the other hand, some methods addressing the occlusion between a hand and the face have been presented. This occlusion has to be addressed by our system. Holden et al. (2005) proposed a vision-based Australian sign language recognition system that deals with this occlusion by detecting the hand contour. It assumes that the face is static. The detection of the hand contour is addressed by a combination of motion cues and
the snake algorithm (Williams and Shah, 1991), which is an energy minimization technique. The system achieved a high recognition rate but with constrained occlusions between the hand and the face as a result of touching an ear or the chin with a hand. The system would decrease their behavior with bigger occlusions as a result of touching an eye or the nose with a hand. Smith et al. (2007) proposed a method to deal with bigger occlusions between a hand and the face through the image force field, which is able to measure the regional structure changes in the occluded region during occlusion. The regional structure remains relatively constant elsewhere in the image. This method takes 15 s per frame, very far from real-time requirements. Our system achieves high performance of psicomotricity tests monitoring with unconstrained occlusions between a hand and the face through the combination of human body parts AdaBoostbased detection and CAMShift-based tracking.
3. Our approach The proposed tests monitoring system needed the study of skin color in different color spaces. This study led to the selection of the best features to achieve human body parts tracking. Moreover, a LUT (Look-Up Table) was created to make a skin color filtering. The system also needed the development of face, eyes, nose, and hand detection modules and face and hand tracking modules. 3.1. Skin color modeling: skin color LUT With the aim of characterizing the skin color, we studied five color spaces widely used in image processing (Kakumanu et al., 2007; Phung et al., 2005): normalized rgb, YIQ, IHS, CIELUV, and TSL. The first four spaces have been widely known and used. In the TSL color space (Terrillon et al., 1998), a color is specified in terms of tint (T), saturation (S), and luminance (L) values. TSL has been selected as the best color space to extract skin color from complex backgrounds (Chen and Liu, 2003) because it has the advantage of extracting a given color robustly while minimizing illumination influence. The Eqs. 5, 6, and 7 are used to obtain the T, S, and L components, respectively in the TSL space, where r0 = (r 1/3) and g0 =(g 1/3), being r = (R/R +G+ B) and g= (G/R+ G+ B), the components of the normalized rgb color model. The values of T, S, and L are normalized in the range [0,1]. For R= G=B (achromatic colors), T= 5/8 and S= 0 are taken. 0 1 r 1 arctan 0 þ ð5Þ T¼ g 2p 2 S¼
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 9 0 ðr 2þ g 0 2Þ 5
L ¼ 0:299R þ0:587Gþ 0:114B
ð6Þ ð7Þ
To compare the color spaces, 575 images were selected from three different sources: the Faces96 database (Collection of facial images: Faces96, 2007), the Internet, and our private database. Concerning the choice of skin regions in every image, areas which contained cheeks, forehead, nose, chin, neck, and hands were selected. These areas were formed mainly by skin pixels although some pixels may belong to hair, eyes, teeth, and other non-skin components such as glasses. Chromaticity components of each pixel in the selected skin regions in the five color spaces were extracted. Fig. 1 shows the histogram of the two chromaticity components in the color spaces for all the pixels in the skin regions selected in the images. The histograms show that skin colors are grouped in a rather compact region of the chromaticity components space, although IHS and TSL distributions are not
ARTICLE IN PRESS ´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
451
Fig. 1. Histogram of the chromaticity components of the skin pixels selected from 575 images: (a) normalized rgb; (b)YIQ; (c) IHS; (d) TSL; and (e) CIELUV.
Fig. 2. Images after applying the skin color filters corresponding to the five color spaces: (a) original image; (b) normalized rgb; (c) YIQ; (d) IHS; (e) TSL; and (f) CIELUV.
mainly around a point but around the line S =0, because both are cylindrical color spaces. The next step was the estimation of the skin color distributions as bivariate unimodal Gaussian distributions. We used the minimum covariance determinants estimator (MCD), proposed in (Rousseeuw, 1985). This method makes use of a series of data from observations to search the subset of observations which form a covariance matrix whose determinant is minimum, extracting its mean and covariance matrix. The MCD method has the ability to discard the outliers (non-skin pixels in the skin regions) and it is very efficient regarding robust estimation. Moreover, the MCD method can deal with many observations in reasonable time using the algorithm proposed in Rousseeuw and Van Driessen (1999). Once the normal distributions are estimated, a threshold is fixed for each distribution. A pixel will be classified to belong to skin if the Mahalanobis distance (McLachlan, 2004) of its color to the mean color given by the mean vector of the modeled distribution is lower than a threshold value which is defined through the critical value of a% of a w2 distribution with two degrees of freedom, where a is the threshold value of the filter, because the Mahalanobis distance of a given observation to the mean vector of a normal distribution follows a w2 distribution. After the estimation of a normal distribution for each studied color space, we evaluated their behavior with threshold values of 0.95 and 0.99, with 243 different images covering a wide range of
people, backgrounds, and illuminations from the Faces96 image database, our private database, and different web pages. Fig. 2 shows in each row an image and the resulting binary image after applying the filters created with the normalized rgb, YIQ, IHS, TSL, and CIELUV color spaces and a threshold of 0.95. An image is considered to be filtered successfully if the true positive rate (TPR) is bigger than 75% and the false positive rate (TPR) is lower than 10%, computing the ouputs of the manually segmented ground truth skin color regions and the skin color filterings for all the image pixels. These two thresholds were selected as the tradeoff between the biggest possible TPR and the lowest possible FPR to extract the skin color regions in images. The results are summarized in Table 1. The highest percentage of successful filtered images is 86.32%, achieved using the TSL color space and a threshold of 0.99. A LUT was created to give one out of two different outputs, skin or non-skin color, to all the colors in the RGB color space. As 8 bits are used to represent each color coordinate, there are 23 8 = 16,777,216 colors. Skin color output was assigned to the colors whose pair of TS values have a Mahalanobis distance to the mean color of the calculated normal distribution lower than the threshold. In the LUT, the number of colors with skin color tag is 559640. Fig. 3 shows the 2D histogram of the TSL chromaticity components of the colors that the LUT assigns skin color output. Fig. 4 shows four images with the ground truth skin regions in the
ARTICLE IN PRESS 452
´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
second row and the LUT filtering in the third row. In the figure caption, TPR and FPR for each image are showed. Only the image in Fig. 4(b) has a TPR bigger than 75% and a FPR lower than 10%, thus following the rule.
3.2. Face, facial features, and hands detection modules Independent classifiers to detect facial features and hands in grayscale images were created, each with a cascade of strong Table 1 Percentage of successful skin color filtering of test images with different color spaces and thresholds. Color space
0.95 threshold
0.99 threshold
Normalized rgb YIQ HIS TSL CIELUV
78.19 56.06 69.73 80.34 73.84
77.48 53.27 81.54 86.32 65.11
Fig. 3. Histogram of skin colors in LUT.
classifiers obtained by thresholding on scalar features selected from an overcomplete set of Haar wavelet-like features using the AdaBoost algorithm. In the framework of the Intel Open Source Computer Vision Library (OpenCV) (Open Computer Vision Library, 2009), Lienhart et al. (2003) developed a frontal face detector with the AdaBoost algorithm and based on Haar-like features. We evaluated the face detector with 210 images collected from the Faces96 and BioID (The BioID face database, 2009) databases and from our private database, with frontal and near frontal faces. It achieved a true detection rate of 0.985 and a number of false positives per image of 0.181. A frontal face in a grayscale image carries a lot of structural information: it has symmetry, the eyes, nose, and mouth have constrained geometric relationships, the eye regions are darker than the neighboring regions and nose wings and nostrils are featured by their position and brightness. The eyes and nose also have relevant information individually but poorer than the whole face. However, as our system needs to detect the facial features (their presence or absence because the hand occludes the feature) only in an image subwindow where the face was previously tracked, we used this information to create the facial features detectors, thus overcoming the mentioned limitation. To create the facial features detectors, the negative images were mainly selected from facial regions that did not contain either totally or partially the facial feature to detect and hand regions. This way, the detectors are optimally integrated in the system. Apart from the positive and negative images, other factors influence the training stage of the detectors, such as the particular AdaBoost algorithm (Discrete, Real or Gentle), the Haar-like features (only basic or basic and rotated), the minimum true detection rate and the maximum false positive rate selected in each strong classifier (stage) of the cascade, the minimum width and height of the regions to detect, the target true detection rate of the cascade, and the false positive rate of the cascade. In the training stage, the process of selecting features for the first stage finishes when the true detection rate is equal or bigger than the selected minimum true detection rate and the false positive rate is equal or smaller than the selected maximum false positive rate. Then, another stage is added with the same method. The addition of stages finishes when the overall detection rate D of the cascade
Fig. 4. Four test images with ground truth skin regions and LUT filter.
ARTICLE IN PRESS ´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
and the false detection rate F of the cascade, expressed in Eq. 2 as a function of the detection rate and the false positive rate of each stage, are achieved. To train the pair of eyes, eye, and nose detectors, 2115, 2756, and 1477 positive images and 2425, 2327, and 1785 negative images were used respectively, selected from frontal face images from our private database and from the Faces96 and BioID databases. The Gentle AdaBoost algorithm and basic and rotated Haar-like features were selected to create the detectors after the performance comparison of different detectors. The parameters of the chosen detectors are:
Pair of eyes: minimum object width and height of 24 9 and 20 stages.
Eye: minimum object width and height of 14 9 and 18 stages. Nose: minimum object width and height of 14 9 and 20 stages.
453
Fig. 5 shows in each row a training image used to create the pair of eyes, eye, and nose detectors, respectively, together with the images with two features selected for the detectors. For the evaluation of the detectors, the same 210 images used to evaluate the face detector, which are different from the training set, were selected. Fig. 6 depicts the FROC curves (true positive rate vs. average number of false positives per image) of the three detectors for the 210 images, built by varying the similarity criteria between a detection and the ground truth facial features to consider a true detection. Fig. 6(a) and Fig. 6(b) show the FROC curves for the pair of eyes detector and the eye detector, respectively. Fig. 6(c) shows the FROC curves for two nose detectors, the best built with the basic Haar-like features and the best built with the basic and rotated Haar-like features. The latter detector gives better results. Fig. 6(d) shows the FROC curves for three nose detectors, each one built with a particular
Fig. 5. Training images for the pair of eyes, eye, and nose detectors and two selected features for each detector.
Fig. 6. FROC curves for: (a) pair of eyes detector; (b) eye detector; (c) nose detectors with basic and basic and rotated Haar-like features; and (d) nose detectors with different variants of the AdaBoost algorithm.
ARTICLE IN PRESS 454
´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
AdaBoost algorithm: Discrete (DAB), Real (RAB), and Gentle (GAB). The difference among them lies in the way they reassign the weights in each iteration of the algorithm (Lienhart et al., 2003). The DAB and RAB algorithms have bad performance, both having many false positives per image together with a lower true detection rate than the GAB algorithm. The selected nose detector was built with the basic and rotated Haar-like features and the GAB algorithm, similarly that the rest of selected pair of eyes, eye, and hands detectors. Table 2 shows the values of the chosen true positive rate and average number of false positives per image of the detectors. The results of false positives per image presented in Fig. 6 and Table 2 were obtained searching the facial features in the entire test images and not only in the face regions. Fasel et al. (2005) dealt with the face and eye detection in arbitrary images and they stated the tradeoff in the building of an eye detector between a robust working without constraints and an accurate eye location. For the implementation of robust and precise facial feature detection in the system, our approach is to minimize false alarms using the information available for the system before the facial feature detection. As face location in frontal or near frontal position is known, facial features search is constrained in a subwindow inside the face region based on anthropometry measures. The pair of eyes is searched in the upper face subregion with width and height equal to the face width. The individual eyes are searched in the upper face subregion with height equal to the face width and with width equal to 3/4 of the face width, each eye in the corresponding side of the face. The nose is searched in the face subregion with height equal to the face width and with width equal to half the face width. The dimensions of the search subwindow have been adjusted with the test images and video sequences so that facial features detection are correct from the face regions given by the face tracking and robust to different people and environments where video sequences are captured. False alarms and detection time are greatly decreased with these constraints. With all the constraints mentioned above, the true positive rates of the detectors remain constant but the average number of false positives per image decreases to 0.032, 0.147, and 0.076 for the pair of eyes, eye, and nose detector, respectively.
Table 2 True positive rate and false positives per image for the pair of eyes, eye, and nose detectors.
True positive rate (%) False positives per image
Pair of eyes
Eye
Nose
98.36 0.262
93.44 4.049
91.47 1.124
Fig. 7 shows the results of applying the facial features detectors to images from the Faces96 databases with the already mentioned size constraints. No false alarms appeared. There are two images where a facial feature is not detected, the nose in Fig. 7(c) (first row) and the left eye in Fig. 7(b) (second row). The building of hand detectors is more complex due to its great variability. Besides, the positive training images have to be rectangular. This fact is not significant for the eyes and nose positive training images. However, the rectangular image trimming to select a hand region includes an important number of background pixels, quantitatively variable depending on the hand posture, which influences the positive training images and can make training more difficult. On the other hand, we tried to overcome hand variability with the building of detectors using hand images in different poses and also with the building of specific detectors for each hand posture, both with bad results. Finally, right- and left-hand detection was accomplished only in frontal position with 3192 and 3217 positive training images and 2256 and 2278 negative training images from our private database and the Static Hand Posture Database (Marcel, 1999), respectively. Right-hand detector reached 22 stages and left-hand detector reached 19 stages. Fig. 8 shows a right-hand and a lefthand training image and three features selected for each detector. Hand detectors were tested with 140 images different from the training images, some for our private database and others from the Static Hand Posture Database. Table 3 and Fig. 9 show the true positive rates and the average number of false positives per image depending on the number of stages of the cascade selected for the detection. From the selected right- and left-hand detectors, we compared the results of using all the stages and different subsets of the first stages. Our system does not have to detect the hands in the video sequence constantly. After an initial detection, hands will be located by tracking. In this scenario, a low false alarm rate is more important than a high true detection rate and the 22-stage righthand detector and the 19-stage left-hand detector are the detectors selected for the system. Schmugge et al. (2007) presented an extensive comparative study of five color spaces with or without the luminosity component and two color models. They found that the hand images easiest to be detected were indoor and with the hand in an open frontal position as our approach. The average detection rate for indoor hand images were 37% and 41% for complex and simple backgrounds, respectively. Detection rates of our classifiers greatly outperform both results. To reduce the number of false positives, the skin filter explained in Section 3.1 is used in such a way that to validate a detection given by the hand classifiers, the fraction of region pixels filtered by the Look-Up Table (LUT), explained in Section 3.1, has
Fig. 7. Facial feature detection in images from the Faces96 database.
ARTICLE IN PRESS ´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
to be bigger than a threshold. The evaluation of different 51-step percentage thresholds was made with the 140 images of the test set. The highest decrease in the number of false positives was with the percentage value of 60%, with a decrease in the true detection rate less than 5%. False positives per image were reduced to 0.082 and 0.115 for the right- and left-hand detector, respectively. In contrast, the true detection rates were also reduced to 70.04% and 84.38% for the right- and left-hand detector, respectively. The reductions in the true detection rates are acceptable for our system because it needs to detect the hands just once in a video sequence. The non-detection of the hands in a set of consecutive frames, assuming the correct detection in a later frame, only implies the delay in the shift from the initialization mode to the monitoring analysis mode. In this mode, the user will be indicated the psicomotricity exercise to make. In the experimental results, the time needed in the worst case from the beginning of the
455
initialization mode to the shift to the monitoring analysis mode was smaller than 3 s. Differently, a false positive in a hand detector would cause the incorrect working as the system would apply the tracking algorithm to the non-hand object. Fig. 10 shows the result of the right-hand detector with images from the Static Hand Posture Database. The last three false negatives are caused by a non-frontal hand position. 3.3. Face and hand tracking modules The system needs to track three human body parts independently: the face, the right hand, and the left hand. As explained in Section 2.4, the tracking is made through the computation of 2D histograms of the three regions. The chromaticity components of the TSL color space are selected to compute the histograms based on the study in Section 3.1. A different histogram is used to track each region with the aim of adapting to the independent variation of the face, right-hand, and left-hand colors due to the changing illumination affected by the different motions although the three are skin regions. To obtain the histogram of each region, first the AdaBoost detector is applied to the image. The detected rectangle contains the region although not all its pixels belong to the region. Then, a subwindow centered in the detected rectangle center and with 2/3 the width and height of the rectangle, to discard pixels not belonging to the region, is selected. The 2D histogram of 64 64 bins is made with the pixels inside the subwindow. A function is defined f: R2 - {1, y, m} that associates to each pixel xi the index of the corresponding histogram bin c(xi). The obtained non-weighted histogram is computed as seen in Eq. 8, where u= 1, y, m is each histogram bin, xi, i= 1, y, n, is each pixel inside the subwindow computed to make the histogram and d[x] is the function that has the value zero everywhere except at x= 0, where its value is 1. q^ u
n X
kðxi Þd½cðxi Þu
ð8Þ
i¼1
Fig. 8. Right- and left-hand training images and features selected for the detectors.
Table 3 True positive rate and false positives per image depending on the number of stages selected by the right- and left-hand detectors. Right hand
11 stages
15 stages
18 stages
22 stages
True positive rate (%) False positives per image
98.30 8.728
94.91 1.983
88.13 0.949
72.88 0.389
Left hand
11 stages
15 stages
17 stages
19 stages
True positive rate (%) False positives per image
95.91 8.081
93.87 1.979
89.79 1.061
87.75 0.693
k(xi) is the profile of decreasing kernel used to assign bigger weights to pixels closer to the region center because as a pixel gets close to the region boundary, the likelihood that the pixel is a outlier, i.e. the pixel do not have TS values representative of the region to track, is bigger. A more robust tracking of faster objects in changing and unconstrained environments is achieved by assigning less weight to the pixels with higher likelihood of being outliers. The kernel is not applied to make the histogram after the initial detection but in the following updates of the histogram for the tracking. k(xi) is calculated using Eq. (9), where d1 is the distance from the pixel xi to the ellipse center obtained from the ellipse fitting of the tracking region and d2 is the distance from the ellipse center to the intersection point of the ellipse and the line that joints the ellipse center with the pixel xi. Fig. 11 shows an image with the ellipse fitting of the face tracking, the
Fig. 9. FROC curves depending on the number of stages selected by: (a) right-hand detector and (b) left-hand detector.
ARTICLE IN PRESS 456
´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
Fig. 10. Output of the right-hand detector.
Fig. 11. Face tracking region and a pixel p to calculate the histogram: (a) input image; (b) backprojection image; and (c) skin color LUT filtering image.
backprojection, and the skin color LUT filtering images. Distances d1 and d2 to calculate the kernel for the pixel p are seen in Fig. 11(a). The pixel p in Fig. 11(a) has a bright color, similar to a big portion of the background due to the illumination, and its LUT output is skin color so that if its corresponding kernel were big, its contribution to the histogram could make the tracking region gradually include the background. However, the kernel of pixel p is low because it is close to the region boundary and therefore d1 is similar to d2 so its contribution to the histogram is small. After many experiments, fLUT in Eq. (9) is fixed to 1 when chrominance values of the pixel xi are included in the skin color LUT and to 0.5 otherwise. This parameter gives robustness to the tracking with dramatic changes in illumination. d1 ð9Þ kðxi Þ ¼ fLUT 1 d2 The non-weighted histogram is not adequate because it may contain background values, so average histogram is calculated dividing the histogram between the background histogram. Background histogram is different for each region to track. It is computed from the entire image after removing the other tracked regions. For instance, right- and left-hand regions, but not the face region, would be removed to obtain the background histogram for the face. Besides, the kernel k(xi) to compute the pixels of the face region in their contribution to the background histogram for the face is the same as the one to calculate the face histogram. The division between the face histogram and the background histogram is the weighting between the height of each histogram bin and the height of the same bin in the background histogram. Therefore, colors that appear both in the tracking region and in the background are penalized in the division, thus increasing the
contrast between the region and its background. With the weighted histogram, histogram backprojection is applied to obtain the 2D likelihood distribution image from the frames of the video sequence. For each tracked region, with the backprojection image and the region location in the previous frame, CAMShift algorithm is applied to fulfill tracking. Fig. 12 depicts the face and hand tracking in a video sequence. The image on top of Fig. 12 shows the face and hands after the ellipse fitting of the tracking regions. From these regions, the 2D weighted histograms of the chromaticity components are used to make face, right-, and left-hand tracking by dividing the region histograms between the regions background histograms as explained above. The T and S components are normalized in the range [0,1] in the graphs. The three background histograms are very similar but not the same as the region of interest for each histogram is included to compute the histogram unlike the two other regions. After the division for the respective background histogram, the TS histogram for region tracking is more focused as some bins in the region histogram were penalized for being also present in the background. Backprojection images are calculated with the next frame of the video sequence, the skin color LUT filtering and the respective TS histogram. With these backprojection images and the region location in the previous frame, CAMShift algorithm is applied to obtain the face, right-, and left-hand rectangular tracking regions. Then, these rectangular tracking regions in the backprojection images are applied ellipse fitting to obtain the elliptical tracking regions drawn in Fig. 12. These regions are used to update the histograms for tracking to repeat the process. Fig. 13 shows the face and hands tracking in frames nos. 1, 2, 18, 42, 44, 45, 61, and 103 of a video sequence in which the skin
ARTICLE IN PRESS ´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
457
Fig. 12. Adaptive tracking of the face and hands in two frames of a video sequence with histograms and backprojection images.
color LUT filters not only the human body parts but also regions with skin-like color such as some clothes and the door. In each row of Fig. 13, processings of the same frame are presented. These are, from left to right, the detected (rectangular) or tracked (elliptical) regions, skin color LUT filtering and backprojection images obtained with the face, right-hand, and left-hand histograms, respectively. The face, right-, and left-hand detections are accomplished in different frames of the video: rows nos. 1, 4, and 5, respectively. Although a big part of the background has skin-like colors, these are correctly penalized in the histogram for tracking. It can be observed that the background skin-like colors are greatly reduced in the backprojection images
so that tracking is properly achieved. The backprojection image with the right-hand histogram showed in row no. 5 only has a small number of non-black pixels because the left hand has not been detected yet and therefore the histogram for right-hand tracking penalized a big part of its bins because they were also present in the left hand that belonged to the background at that time. Once the regions are detected and tracked, their background images show that the particular regions have increasingly bigger grayscale values in all its pixels and in contrast to the rest of the image, thus leading to the stable region tracking in the video sequence regardless of changes in size, appearance, and illuminations of the face and hands.
ARTICLE IN PRESS 458
´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
Fig. 13. Face, right-hand, and left-hand CAMShift-based tracking in a video sequence: (a) original frame with the detected (rectangular) and tracked (elliptical) regions; (b) skin color LUT filtering; (c) backprojection image with the face histogram; (d) backprojection image with the right-hand histogram; and (e) backprojection image with the left-hand histogram.
Fig. 14 shows tracking in frames nos. 1, 2, 3, 5, 6, and 8 of a video sequence in which there are very bright areas in the skin regions. Hands and face positions with respect to the illumination source cause that some skin pixels are so bright that the LUT assigns non-skin color to them although LUT was created with chromaticity components, thus not depending on the luminosity component. In each row, processings of the same frame are presented with the same order as Fig. 13. Face, right hand, and left hand are accomplished in the same frame (first row). In the following frames, the hands motions make skin pixels in the middle of the hands region brighter and not being assigned skin color by the LUT. However, CAMShift-based tracking with the adaptive and weighted histogram fulfill correct and stable tracking of the face and hands.
4. Proposed system The proposed system monitors psicomotricity exercises through face detection and tracking, eyes and nose detection inside the face region, ears location, and left-, and right-hand detection and tracking. Two stages are included in the system: face and hands detection and tracking stage and tests monitoring stage, as seen in Fig. 15. From these stages, the system has two working modes: the initialization mode and the monitoring analysis mode. The first mode is an initialization mode where the system makes AdaBoost-based translation and scale invariant face, right-, and left-hand detection in frontal position. Having the three regions detected, the system tracks them based on CAMShift algorithm and with translation, scale, and rotation invariance. The
ARTICLE IN PRESS ´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
459
Fig. 14. Face, right-hand, and left-hand CAMShift-based tracking in a video sequence: (a) original frame with the detected (rectangular) and tracked (elliptical) regions; (b) skin color LUT filtering; (c) backprojection image with the face histogram; (d) backprojection image with the right-hand histogram; and (e) backprojection image with the left-hand histogram.
system needs to have detected and track the three regions to shift from the initialization mode to the monitoring analysis mode. In the monitoring analysis mode, having the face and hands tracked, the system indicates to the user the psicomotricity exercise to make and begins to detect the eyes or nose or locate the ears involved in the exercise if needed. Eyes and nose are detected because their absence or presence and position and size are achieved by AdaBoost-based detector. Ears location is determined using the face and head anthropometry measures (Farkas, 1994). It is not possible to accomplish ear detection in a frontal face position as they can be partial or totally occluded by the hair. Finally, the monitoring analysis module processes the human body parts position to assess the realization of the exercise.
4.1. System environment The system program has to be running in a standard PC with a webcam connected to it. The user sits in front of the camera at approximately 1 m away from it with non-restricted environment and illumination conditions. Firstly, the system is in the initialization mode. In this mode, the user has to place the face and both hands in frontal position so that the AdaBoost-based detectors can detect them. Once each region, either the face, right hand, or left hand, has been detected, CAMShift-based tracking is applied to the region. As soon as the three regions are tracked, the
system shifts from the initialization mode to the monitoring analysis mode. In the beginning of the monitoring analysis mode, the user is showed with an acoustic message and a pop-up window, the exercise, out of a set of 12, to make. The psicomotricity exercises are: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
Touch the right eye with the right hand. Touch the left eye with the right hand. Touch the right eye with the left hand. Touch the left eye with the left hand. Touch the nose with the right hand. Touch the nose with the left hand. Touch the right ear with the right hand. Touch the left ear with the left hand. Touch the head with the right hand. Touch the head with the left hand. Raise the right hand. Raise the left hand.
The particular exercise can need not only the face and hands tracking but also the eyes and nose detection or the ears location that the pair of eyes, eyes, and nose detection and ears location module makes from the face position. Besides, the pair of eyes and both individual eyes are detected in exercises 1–4 and both ears are located in exercises 7 and 8 so that the monitoring analysis
ARTICLE IN PRESS 460
´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
Fig. 15. Overall diagram of the monitoring system.
can distinguish the incorrect realization of an exercise due to the use of the opposite hand, eye, or ear. This information has medical interest for the assessment of the left–right recognition. From the beginning of the exercise, the monitoring analysis module processes the detection and location of the body parts involved in it in each frame of the sequence given rise to the successful end of the exercise (with the corresponding reaction time), the incorrect end of the exercise (with the corresponding failure time) or the end of the time given to make the exercise. After the end of the exercise, the system is ready to monitor another exercise. The time limit to make an exercise is decided by the rehabilitation expert to adapt to the particular user and the rehabilitation stage. Fig. 16 shows a user when she has just begun exercise 6 after knowing it by a pop-up window and an acoustic message. The system has some limitations in its two working modes. In the initialization mode, both the face and the two hands have to be placed in frontal position so that the AdaBoost-based detectors can detect them. After their detection, the face and hands tracking in the monitoring analysis mode is correct regardless of their posture. However, it is necessary to place the face in frontal position in the exercises in which a hand has to touch a facial feature because facial features detection needs it. The face can be tilted up to 151 in plane and up to 201 out of plane so that the face and facial features detectors keep on working properly within these ranges. The system does not process 3D information in the monitoring analysis mode because it works with one camera. In this situation, if a hand were between the camera and a facial feature, it could occlude the facial feature in a video sequence even though the hand had not touched the face. The rehabilitation experts stated that this system limitation is acceptable because the most significant information of the psicomotricity exercises for the cognitive rehabilitation lies in the fact that the user moves the corresponding hand to the facial feature regardless of whether the touch between the hand and the facial feature takes place or not.
4.2. Psicomotricity tests monitoring The monitoring system has two modes: the initialization mode and the monitoring analysis mode. In the initialization mode, phases with the necessary operations regarding the initial detection and posterior tracking of the face and hands are included, as seen in Fig. 17. The detection and tracking modules for the face, right hand, and left hand in Fig. 17 have similar processings. Fig. 18 shows the region detection and tracking module, which is valid for the three regions. For each region, first the detector is applied to the frame if it was not previously detected. If the region is not detected, the module processing for the region finishes. If the region was detected in a previous frame, the backprojection image is obtained with the weighted histogram for the region, then CAMShift-based region tracking is made and, with the tracking region, the TS region histogram is updated. After the detection and tracking module of the three regions in the initialization mode, each region-background histogram is calculated and the weighted histogram for region tracking is obtained by dividing the region histogram and the region-background histogram. It is necessary to know the location of the three regions before computing the backgrounds histograms because each region-background histogram includes the pixels of the regions, but not the other two regions, to compute it. Finally, the face, right hand, and left hand need to be tracked to shift the system to the monitoring analysis mode. In the monitoring analysis mode, as seen in Fig. 19, the system indicates the user the exercise he has to make. From this time, the system processes the video frames to continue face and hands tracking. If any hand touches the face, the CAMShift-based tracking of these two regions intersect and the system enters the occlusion state. In this state, the histogram of the face and hand will be the union of the histograms for the face and the hand tracking before the occlusion. The histogram is fixed in the occlusion state to make the tracking stable as the ellipse fitting of the face and occluding hand region may include a lot of non-skin
ARTICLE IN PRESS ´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
461
Fig. 16. System environment.
pixels. In the occlusion state, a contour extraction will be applied to a binary image obtained after thresholding pixels with gray levels bigger than zero in the common backprojection image for the following frame. If there are two regions inside the contours with areas between 70% and 130% of the areas of the face and the occluding hand before the occlusion, face detection is applied to the original image and face and hand regions are obtained from these two regions. Then, ellipse fitting is applied to the face and hand rectangular regions in the common backprojection image that circumscribe the face and hand contours. If the two elliptical regions are not in contact, the system leaves the occlusion state and independent histogram updates for the face and hand are calculated from these elliptical regions. Depending on the psicomotricity exercise, eye and nose detection will be applied only if the exercise involves these facial features and a hand occludes the face because a hand can occlude the facial feature involved in an exercise only in this situation. After the ending of the exercise monitoring, the system has to leave the occlusion state so that the user can then be indicated another exercise to make. If the system is in the monitoring analysis mode and loses the tracking of one hand, caused for leaving the scope of the camera, or an occlusion between the two hands occurs, there will be a shift to the initialization mode. Fig. 20 shows in each row, frames with the monitoring of a different psicomotricity exercise from the time when the user is indicated the exercise to its correct realization. Each frame is accompanied with the number of the frame in the video sequence from the beginning of the monitoring of the psicomotricity exercise. Regions of the human body are showed in different colors: green for the face and the pair of eyes, red for the left hand, nose, and left ear, and blue for the right hand, right eye, and right ear. The psicomotricity exercises monitored in the rows are nos. 2, 5, 7, and 10, as referred in Section 4.1. The last frame in each row corresponds with the correct realization of the exercise. In the monitoring of exercise 7 showed in the third row of Fig. 20, the
center of the right-hand region in the consecutives frames are joined, so that the right-hand trajectory from the beginning of the exercise to the frame where the right hand contacts the right ear is observed. This information together with the size and orientation of the ellipse fitting of the face and hands tracking regions and the time associated to each frame of every monitored exercise are saved in a file for the further offline analysis of the rehabilitation experts. The realization of the monitored exercises needs the following output in five consecutive frames in the monitoring analysis stage:
The detection of an eye and the non-detection of the other eye
and the pair of eyes while one hand is in contact with the face (exercises 1–4). If the hand in contact with the face and the non-detected eye correspond with the ones of the exercise, it is realized correctly, otherwise incorrectly. The non-detection of the nose while one hand is in contact with the face (exercises 5 and 6). The contact of one hand with one ear (exercises 7 and 8). The contact of one hand with the head (exercises 9 and 10). The location of one hand higher than the face (exercises 11 and 12).
Fig. 21 shows the consecutive monitoring of exercises 3 and 5. In Fig. 21(a) the user has just known the exercise. In Fig. 21(c), the system enters the occlusion state because face is in contact with the left hand. In the occlusion state, the face and the left hand have a common histogram for tracking. As a consequence, face and left-hand tracking regions converge and in Fig. 21(f) the regions coincide. In Fig. 21(j), the correct realization of exercise 3 is achieved after five consecutive frames with the detection of the left eye and the non-detection of the right eye and the pair of eyes while the left hand is in contact with the face. Next, the system stops eyes detection and has to leave the occlusion state, which
ARTICLE IN PRESS ´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
462
Fig. 18. Diagram of region detection and tracking module in the initialization mode.
Fig. 17. Diagram of the initialization mode.
occurs in Fig. 21(o), so that a new exercise (no. 5) is indicated. In Fig. 21(q) the user has just known exercise 5 and in Fig. 21(x) it is correctly finished. 4.3. Performance evaluation The system program was implemented in Visual C++ and was tested with consumer-grade PCs and different cameras, such as Logitech QuickCam Zoom, Philips SPC900NC and Prosilica EC750C, at a resolution of 320 240 pixels and at an average frame rate of 30 fps. The system was tested with 14 users in the research laboratory and at INTRAS Foundation with unconstrained background and illumination. After the necessary training regarding system working requirements, such as the initial frontal position of the hands and the presence of the hands in the entire video sequence, all the users found the system easy to use and motivating. Table 4 shows the performance results of the system. The 14 users made three tests, each with the 12 psicomotricity exercises. Failures in exercises regarding eyes and nose were caused by the tendency to turn the head while the hand is moving towards the facial features, thus in the occlusion state the face was far from being in frontal position and the facial features detection failed. Failures in exercises 8, 9, and
12 were caused by the initial incorrect hand detection. The overall successful monitoring percentage was 97.62%. The system performance was considered adequate for the rehabilitation experts at INTRAS Foundation to integrate the system into the GRADIOR platform. AdaBoost-based detection times for each classifier without constraints regarding the search region and the minimum and maximum sizes of the human body parts are showed in Table 5. These processing times were taken with a video sequence of 320 240 pixels and a Pentium 4 processor at 2.8 GHz and 1 GB RAM. In the third column of Table 5, detection times applying search regions and size constraints explained in Section 3.2 appear. Detection times with constraints do not apply to face and hands because after an initial detection, these regions will be tracked with the CAMShift algorithm. CAMShift-based region tracking will not be applied to the entire frame but to a rectangular region centered in the region tracked in the previous frame and 40 pixels bigger up, down, left, and right. These motion restrictions between consecutive frames were suitable for the 14 tested users, furthermore CAMShift-based tracking is speeded because it is applied to a frame region instead of the entire frame. With these restrictions, the average processing time needed to make the adaptive CAMShift-based tracking is 9 ms for each region. The overall average processing time for each frame in the monitoring of the exercises with the highest computational load (exercises 1–4) is 43 ms, corresponding to the addition of the face and hands tracking times and the pair of eyes, right eye, and left eye detection times, that gives rise to a frame rate of more than 23 fps. This rate is sufficient for the correct monitoring of the 12 psicomotricity exercises.
5. Conclusions In this paper, a real-time computer vision-aided rehabilitation system applicable to the monitoring of psicomotricity exercises is presented. Monitoring is achieved through the initial detection and later tracking of the face, the right hand, and the left hand and
ARTICLE IN PRESS ´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
463
Fig. 19. Diagram of the monitoring analysis mode.
the facial features detection such as the pair of eyes, individual eyes, and nose. The system determines the correct or incorrect realization of the exercise and the time required for the user to
carry it out. It can be integrated in a consumer-grade computer and with an inexpensive Universal Serial Bus camera. The system makes human body parts motion monitoring, its analysis in
ARTICLE IN PRESS ´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
464
Fig. 20. Monitoring of four psicomotricity exercises.
Fig. 21. Monitoring of two psicomotricity exercises.
Table 4 Performance of the monitoring system. Exercise
No. 1
No. 2
No. 3
No. 4
No. 5
No. 6
No. 7
No. 8
No. 9
No. 10
No. 11
No. 12
Total
No. of monitorings No. of successful monitorings No. of erroneous monitorings
42 41 1
42 39 3
42 40 2
42 42 0
42 41 1
42 40 2
42 42 0
42 41 1
42 41 1
42 42 0
42 42 0
42 41 1
504 492 (97.62%) 12 (2.38%)
Table 5 Detection times of AdaBoost-based classifiers without and with constraints. Human body part
Detection time without constraints (ms)
Detection time with constraints (ms)
Face Pair of eyes Eye Nose Right hand Left hand
65 30 52 48 24 28
– 4 6 5 – –
relation to the psicomotricity exercise indicated to the user and the storage of the result of the realization of a set of exercises. The automation of these tasks frees the rehabilitation experts of doing them. It works with one camera, which does not need calibration. The implementation of the system with one camera is easier and faster than a 3D approach with two cameras. In contrast, our system does not process depth information and the 2D occlusion of the face and a hand could not always imply a 3D occlusion. Moreover, hands occlusion is not dealt with. The previous two limitations give rise to system working requirements that the rehabilitation experts and users who tested the system did not
ARTICLE IN PRESS ´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
find problematic. The system is robust to different working environments and illuminations, featuring the human body parts in motion regardless of their significant color change due to their relative position in relation to the illumination sources. The system was evaluated with 14 users, achieving a successful monitoring percentage of 97.62%. This performance is adequate for the integration of the system in a multimodal cognitive rehabilitation platform. Detectors of different human body parts such as the pair of eyes, individual eyes, nose, right hand, and left hand, based on the AdaBoost algorithm and with Haar-like features, have been developed with a view to achieving detection and false alarm rates that make it possible their integration in the monitoring system, including constraints regarding search region in the frame, and the minimum and maximum size of the search region to minimize the processing time, thus achieving real-time requirements. A detailed study of five color spaces has been carried out to select the best one characterizing skin color. The TSL color space, which achieved the best results, was used to create a LUT from characteristic values of its chromaticity components (T and S) so that every color has a skin or non-skin output. With the chromaticity components of the TSL color space, CAMShiftbased tracking has been implemented to make adaptive (based on histograms that update in every frame), weighted (taking the background into account) and individual face, right-hand, and left-hand tracking. The computer vision system is able to work with frame rates of 23 fps and represents a framework that is potentially applicable to other HCI tasks with minor changes.
Acknowledgments This work was partially supported by the Spanish Ministry of Science and Innovation under Project TIN2007-67236 and by the Spanish Ministry of Industry, Tourism and Commerce under Project FIT-350305-2007-21. We would like to thank people at INTRAS Foundation for their contribution and advice to the clinical focus and requirements of the computer vision system to be applicable in the cognitive rehabilitation and integrable in the GRADIOR platform. References Asteriadis S, Nikolaidis N, Pitas I. Facial feature detection using distance vector fields. Pattern Recognition 2009;42(7):1388–98. Barczak ALC, Dadgostar F. Real-time hand tracking using a set of cooperative classifiers based on Haar-like features. Research Letters in the Information and Mathematical Sciences 2005;7:29–42. Bradski GR. Real time face and object tracking as a component of a perceptual user interface. In: Proceedings of the fourth IEEE workshop on applications of computer vision, 1998. p. 214–9. Chen D, Liu Z. A novel approach to detect and correct highlighted face region in color image. In: Proceedings of the IEEE conference on advanced video and signal based surveillance, 2003. p. 7–12. Collection of facial images: Faces96. /http://cswww.essex.ac.uk/mv/allfaces/ faces96.htmlS. 2007. Comaniciu D, Meer P. Mean shift: a robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002;24(5): 603–19. Comaniciu D, Ramesh V, Meer P. Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 2003;25(5):564–75. da Costa RMEM, de Carvalho LAV. The acceptance of virtual reality devices for cognitive rehabilitation: a report of positive results with schizophrenia. Computer Methods and Programs in Biomedicine 2004;73(3):173–82. Daly JJ, Wolpaw JR. Brain–computer interfaces in neurological rehabilitation. The Lancet Neurology 2008;7(11):1032–3. Edmans JA, Gladman JRF, Cobb S, Sunderland A, Pridmore T, Hilton D, et al. Validity of a virtual environment for stroke rehabilitation. Stroke 2006;37(11): 2770–5. Farkas LG. Anthropometry of the head and face. Raven Press; 1994. Fasel I, Fortenberry B, Movellan J. A generative framework for real time object detection and classification. Computer Vision and Image Understanding 2005;98(1):182–210.
465
Franco-Martı´n MA, Orihuela-Villameriel T, Buanco-Aguado Y. Programa GRADIOR. Programa de evaluacio´n y rehabilitacio´n cognitiva por ordenador. Valladolid: Edintrans; 2000. Freund Y, Schapire RE. Experiments with a new boosting algorithm. In: Proceedings of the 13th international conference on machine learning, 1996. p. 148–56. Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences 1997;55(1):119–39. Fu Jie Huang, Tsuhan Chen. Tracking of multiple faces for human-computer interfaces and virtual environments. In: Proceedings of the IEEE international conference on multimedia and expo, vol. 3, 2000. p. 1563–6. Holden E-J, Lee G, Owens R. Australian sign language recognition. Machine Vision and Applications 2005;16(5):312–20. Horn BKP, Schunk BG. Determining optical flow. Artificial Intelligence 1981;17: 185–203. Hunke M, Waibel A. Face locating and tracking for human–computer interaction. In: Proceedings of the 28th Asilomar conference on signals, systems and computers, vol. 2, 1994. p. 1277–81. INTRAS Foundation. /www.intras.esS. 2009. Isard M, Blake A. CONDENSATION—conditional density propagation for visual tracking. International Journal of Computer Vision 1998;29(1):5–28. Jintae Lee, Kunii TL. Model-based analysis of hand posture. IEEE Computer Graphics and Applications 1995;15(5):77–86. Kakumanu P, Makrogiannis S, Bourbakis N. A survey of skin-color modeling and detection methods. Pattern Recognition 2007;40(3):1106–22. Kawaguchi T, Hidaka D, Rizon M. Detection of eyes from human faces by Hough transform and separability filter. In: Proceedings of the international conference on image processing, vol. 1, 2000. p. 49–52. Kothari R, Mitchell JL. Detection of eye locations in unconstrained visual images. In: Proceedings of the International Conference on Image Processing, vol. 3, 1996. p. 519–22. Kuch JJ, Huang TS. Vision based hand modeling and tracking for virtual teleconferencing and telecollaboration. In: Proceedings of the fifth international conference on computer vision, 1995. p. 666–71. Li SZ, Jain AK. Handbook of face recognition. New York: Springer; 2005. Lienhart R, Kuranov A, Pisarevsky V. Empirical analysis of detection cascades of boosted classifiers for rapid object detection. In: Proceedings of the DAGMsymposium, 2003. p. 297–304. Lin C, Huan C, Chan C, Yeh M, Chiu C-C. Design of a computer game using an eyetracking device for eye’s activity rehabilitation. Optics and Lasers in Engineering 2004;42(1):91–108. Magee JJ, Betke M, Gips J, Scott MR, Waber BN. A human–computer interface using symmetry between eyes to detect gaze direction. IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 2008;38(6): 1248–361. Lucas BD, Kanade T. An iterative image registration technique with an application to stereo vision. In: Proceedings of the imaging understanding workshop, 1981. p. 121–30. Marcel S. Hand posture recognition in a body–face centered space. In: Proceedings of the CHI ’99 extended abstracts on human factors in computing systems, 1999. p. 302–3. ¨ Marszalec E, Martinkauppi B, Soriano M, Pietikainen M. A physics-based face database for color research. Journal of Electronic Imaging 2000;9(1):32–8. McLachlan GJ. Discriminant analysis and statistical pattern recognition. New York: Wiley; 2004. Morris T, Chauhan V. Facial feature tracking for cursor control. Journal of Network and Computer Applications 2006;29(1):62–80. Musen MA, Shahar Y, Shortliffe EH. Clinical decision-support systems. In: Shortliffe EH, Perreault LE, Wiederhold G, Fagan LM, editors. Medical informatics: computer applications in health care and biomedicine. New York: Springer; 2001. p. 573–609. Obrenovic Z, Abascal J, Starcevic D. Universal accessibility as a multimodal design issue. Communications of the ACM 2007;50(5):83–8. Open Computer Vision Library. http://sourceforge.net/projects/opencvlibrary. 2009. Phung SL, Bouzerdoum Sr. A, Chai Sr. D. Skin segmentation using color pixel classification: analysis and comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence 2005;27(1):148–54. Rand D, Kizony R, Weiss PL. Virtual reality rehabilitation for all: Vivid GX versus Sony PlayStation II EyeToy. Proceedings of the fifth international conference on disability. Virtual Reality and Associated Technologies; 2004. p. 87–94. Rousseeuw PJ. Multivariate estimation with high breakdown point. In: Grossmann W, Pflug G, Wertz W, editors. Mathematical statistics and applications. Dordrecht: Reidel Publishing; 1985. p. 283–97. Rousseeuw PJ, Van Driessen K. A fast algorithm for minimum covariance determinant estimator. Technometrics 1999;41:212–23. Rowley HA, Baluja S, Kanade T. Neural network-based face detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 1998;20(1): 23–38. Schmugge SJ, Zaffar MA, Tsap LV, Shin MC. Task-based evaluation of skin detection for communication and perceptual interfaces. Journal of Visual Communication and Image Representation 2007;18(6):487–95. Schneiderman H, Kanade T. A statistical method for 3D object detection applied to faces and cars. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol. 1, 2000. p. 746–51.
ARTICLE IN PRESS 466
´lez-Ortega et al. / Journal of Network and Computer Applications 33 (2010) 447–466 D. Gonza
Shan C, Tan T, Wei Y. Real-time hand tracking using a mean shift embedded particle filter. Pattern Recognition 2007;40(7):1958–70. Smith P, da Vitoria Lobo N, Mubarak S. Resolving hand over face occlusion. Image and Vision Computing 2007;25(9):1432–48. Song J, Chi Z, Liu J. A robust eye detection method using combined binary edge and intensity information. Pattern Recognition 2006;39(6):1110–25. Swain MJ, Ballard DH. Color indexing. International Journal of Computer Vision 1991;7(1):11–32. Terrillon J-C, David M, Akamatsu S. Automatic detection of human faces in natural scene images by use of a skin color moments and of invariant moments. In: Proceedings of the third IEEE international conference on automatic face and gesture recognition, 1998. p. 112–7. The BioID face database. /http://www.bioid.com/downloads/facedbS. 2009. ¨ Turk M, Kolsch M. Perceptual interfaces. In: Medioni G, Kang SB, editors. Emerging topics in computer vision. Prentice-Hall; 2004. p. 456–520. Varona J, Manresa-Yee C, Perales FJ. Hands-free vision-based interface for computer accessibility. Journal of Network and Computer Applications 2008;31(4):357–74. Viola P, Jones M. Rapid object detection using a boosted cascade of simple features. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, vol. 1, 2001. p. 511–8. Viola P, Jones MJ. Robust real-time face detection. International Journal of Computer Vision 2004;57(2):137–54.
Wang P, Ji Q. Multi-view face and eye detection using discriminant features. Computer Vision and Image Understanding 2007;105(2):99–111. Williams DJ, Shah M. A fast algorithm for active contours and curvature estimation. CVGIP: Image Understanding 1991;55(1):14–26. Wilson PI, Fernandez J. Facial feature detection using Haar classifiers. Journal of Computing Sciences in Colleges 2006;21(4):127–33. Xu C, Tan T, Wang Y, Quan L. Combining local features for robust nose location in 3D facial data. Pattern Recognition Letters 2006;27(13):1487–94. Yang M, Ahuja N. Detecting human faces in color images. In: Proceedings of the international conference on image processing, vol. 1, 1998. p. 127–30. Yang M, Kriegman DJ, Ahuja N. Detecting faces in images: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002;24(1): 34–58. Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 1995;17(8):790–9. Yong M, Xiaoqing D, Zhenger W, Ning W. Robust precise eye location under probabilistic framework. In: Proceedings of the sixth IEEE international conference on automatic face and gesture recognition, 2004. p. 339–44. Zhou H, Hu H. Human motion tracking for rehabilitation—a survey. Biomedical Signal Processing and Control 2008;3(1):1–18. Zhou Z, Geng X. Projection functions for eye detection. Pattern Recognition 2004;37(5):1049–56.