J. Vis. Commun. Image R. 19 (2008) 382–391
Contents lists available at ScienceDirect
J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci
Efficient multiple faces tracking based on Relevance Vector Machine and Boosting learning q Shuhan Shen *, Yuncai Liu Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200240, China
a r t i c l e
i n f o
Article history: Received 8 June 2007 Accepted 18 June 2008 Available online 28 June 2008 Keywords: Face tracking Face detection Multiple faces tracking Real-time tracking Probabilistic algorithms Relevance Vector Machine Boosting AdaBoost
a b s t r a c t A multiple faces tracking system was presented based on Relevance Vector Machine (RVM) and Boosting learning. In this system, a face detector based on Boosting learning is used to detect faces at the first frame, and the face motion model and color model are created. The face motion model consists of a set of RVMs that learn the relationship between the motion of the face and its appearance, and the face color model is the 2D histogram of the face region in CrCb color space. In the tracking process different tracking methods (RVM tracking, local search, giving up tracking) are used according to different states of faces, and the states are changed according to the tracking results. When the full image search condition is satisfied, a full image search is started in order to find new coming faces and former occluded faces. In the full image search and local search, the similarity matrix is introduced to help matching faces efficiently. Experimental results demonstrate that this system can (a) automatically find new coming faces; (b) recover from occlusion, for example, if the faces are occluded by others and reappear or leave the scene and return; (c) run with a high computation efficiency, run at about 20 frames/s. Ó 2008 Elsevier Inc. All rights reserved.
1. Introduction Visual tracking in image sequence is a major research topic in computer vision and it has many potential application areas such as intelligence visual surveillance, human–computer interaction, video coding, etc. Tracking can be considered to be equivalent to establishing coherent relations of image features between frames with respect to position, velocity, shape, texture, color, etc. [1]. In the research of face tracking, the common methods are skin-color based tracking [2–4], shape based tracking [5–8], feature based tracking [9–11] and machine learning based tracking [12–15]. Skin-color based methods are popular in face tracking, and they always have a low computational cost. Schwerdt and Crowley [2] present a system based on skin-color model for tracking single face and controlling camera by panning, tilting, and zooming to maintain the face at the center position. Solar et al. [3] use background subtraction and skin detection to track multi-faces. Lerdsudwichai et al. [4] use non-parametric distribution to represent the colors of the face region and use mean shift algorithm to track multi-faces. However, when the illumination condition changes drastically and the background image has a skin-like color, skin-color based methods easily fail to track the face.
q This work was supported by National Natural Science Foundation of China (NSFC) under Grant 60675017, and National 973 Key Basic Research Program of China under Grant 2006CB303103. * Corresponding author. Fax: +86 21 34204340. E-mail addresses:
[email protected] (S. Shen),
[email protected] (Y. Liu).
1047-3203/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.jvcir.2008.06.005
Shape based methods are not influenced by background color and illumination changes. Birchfield [5] models the face as an ellipse whose position and size are continually updated by a local search using the intensity gradient around the ellipse. Pardas and Sayrol [6] use active contours to track a single face with larger displacements of the contour. Bing et al. [7] use gradient angular difference to constrain the Snake contour to the specific face region, and proposed a motion compensation method to accelerate the face contour tracking process. However, shape based methods are fragile in a highly cluttered background. To increase the robustness to background and lighting, Lee and Kim [8] use the face tracking method based on two condensation trackers that use skin color and facial shape as observation measures, but this method can only track one face and requires much higher computation time than using skin-color based or shape based tracker alone. Feature based methods use extracted face features as the cues for face tracking. Maurer and von der Malsburg [9] use Gabor filters as the visual features to track face. Liang et al. [10] build a 3D face model based on the 2D facial features, and estimate the locations of 2D facial features for the next frame using Kalman filters. Von Duhn et al. [11] proposed a three-view based video tracking and model creation algorithm, which is based on the active appearance model and a generic facial feature model. However, these methods are often computationally expensive and hard to implement for a real-time application. Machine learning based methods interest more and more researchers in recent years. Active appearance models proposed by Cootes et al. [12] pioneered this approach in the 2D case,
383
S. Shen, Y. Liu / J. Vis. Commun. Image R. 19 (2008) 382–391
and they have since been extended to 3D by Matthews and Baker [13]. However, the fitting process of AAM fails occasionally and still does not run for a real-time application. Avidan [14] presents the Support Vector Tracking (SVT) that integrates the Support Vector Machine (SVM) classifier into an optic-flow-based tracker, and tracking is achieved by maximizing the SVM classification score. However, this method can not handle occlusions and only useful for single object tracking. Williams et al. [15] use Relevance Vector Machine (RVM) to build a displacement expert which directly estimates displacement from the target region. This algorithm is computation efficiency and very robust, but it can only track one target. This paper extends the work of Williams et al. [15] and presents a fully automatic multiple faces tracking system based on RVM and Boosting learning. In our system, each face is modeled by a motion model and a color model. The motion model is obtained by training a set of RVMs to learn the relationship between the motion of the face and its appearance in the image, and the color model is obtained by creating the 2D histogram in the CrCb color space. In the tracking process, different tracking methods are used according to the face states, and the states are changed according to the tracking results. When the full image search condition (often at a certain time interval) is satisfied, a full image search is started in order to find new coming faces and former occluded faces. Experimental results demonstrate that the properties of this system are: (1) Automatically find new coming faces; (2) Can recover from occlusion, for example, if the faces are occluded by others and reappear or leave the scene and return; (3) Computation efficiency, run at about 20 frames/s. The rest of the paper is organized as follows. Section 2 gives the background knowledge of the system. Section 3 gives the overview of our multiple faces tracking system. Sections 4–6 describe the details of the system. Section 7 contains the complete algorithm. Section 8 shows the experimental results. Finally, the conclusion follows in Section 9. 2. Background In this section, we describe the background knowledge of our system. First we give a short review of Relevance Vector Machine. Then we discuss Williams et al.’s work [15] of single face tracking based on RVM, which is the foundation of our work. 2.1. Relevance Vector Machine The Relevance Vector Machine (RVM) proposed by Tripping is a model of identical functional form to the popular Support Vector Machine (SVM) [16]. The learning of RVM is under the Bayesian framework, and it yields a probabilistic output. The output of RVM is:
yðx; WÞ ¼
N X
wi kðx; zi Þ þ w0 ¼ W T K
ð1Þ
i¼1
where x is the input vector, {z1, . . . , zn} are the training examples, W = [w0, . . . , wN]T is the weights vector, K = [1,k(x, z1), . . . , k(x, zN)]T is the kernel function vector. In RVM, there is no necessity for Mercer kernels and no error/margin trade-off parameter [17]. When RVM is used for regression under the Bayesian framework, a independent zero-mean Gaussian prior distribution is specified over the weights wi, wi Nð0; a1 i Þ, so the distribution over the weights vector W is also a Gaussian distribution:
pðWjaÞ ¼
N Y
pðwi jai Þ ¼
i¼0
N Y
Nð0; a1 i Þ
ð2Þ
i¼0
where, a = [a0, . . . , aN]T is a vector of N + 1 hyper-parameters that should be estimated in the learning process. The training targets are assumed to be sampled with additive noise:
ti ¼ yðzi ; WÞ þ ei
ð3Þ
where, ei are independent zero-mean Gaussian process noise, ei N(0,r2), and r2 is another hyper-parameter that should be estimated in the learning process. Obviously, the distribution over the targets vector T = [t0, . . . , tN]T is also a Gaussian distribution:
pðTjW; r2 Þ ¼
N Y
pðti jW; r2 Þ
ð4Þ
i¼1
The posterior distribution over the weights is thus given by:
pðWjT; a; r2 Þ ¼ R
pðTjW; r2 ÞpðWjaÞ ¼ Nðl; RÞ pðTjW; r2 ÞpðWjaÞdW
ð5Þ
where l = r2RUTt, R = (r2UTU + A)1, A = diag(a0, . . . , aN), U = [/(z1), . . ., /(zN)]T, /(zi) = [1,k(zi, z1), . . . , k (zi, zN)]T. Then, W can be set to fixed values l for the purpose of prediction. The values of l is dependent on the hyper-parameters a and r2, and the values of a and r2 can be obtained by maximize the marginal likelihood p(T—a, r2). In this paper, we use the fast marginal likelihood maximization method presented in [18,19]. For the full details of RVM and the fast maximization method, one can peruse Ref. [16,18,19]. 2.2. Single face tracking based on RVM [15] Williams et al. [15] proposed a single face tracking framework using sparse probabilistic regression by RVMs, with temporal fusion for high efficiency and robustness. The core idea of this work is to train a set of RVMs such that they learn the relationship between images and motion. For training the RVMs, one seed image Iseed containing the labeled region of face k was used. The training set is generated by moving the face region k in the seed image Iseed using random Euclidean displacements sampled from a uniform distribution. Construction of the training set and training of the RVMs take only a few seconds on a desktop PC. In the tracking process, owing to the principal advantage of the RVM over the SVM: The RVM does not just estimate the change in state but generates an output probability distribution, the probabilistic regressing output of the RVMs can be treated as a Gaussian innovation and incorporated into a Kalman–Bucy filter which is very efficient. The tracking result is validated by a classifier, and an absence of verification will trigger a search in which a face detector is used to find face in the whole image. The overall algorithm of [15] can be summarized as follows: 1: 2: 3: 4: 5: 6: 7: 8: 9:
Generate training sets using seed image Iseed, and train RVMs; mode EXPERT; I New image from capture device; If mode = SEARCH, go to step 8; Track face using RVMs regression incorporating with Kalman filter; Test tracking result with validator. If the test is passed, let mode EXPERT, otherwise let mode SEARCH; Go to step 3; Search the whole image using face detector. If face is found, let mode EXPERT, otherwise let mode SEARCH; Go to step 3;
384
S. Shen, Y. Liu / J. Vis. Commun. Image R. 19 (2008) 382–391
Full Image Search
Face Models
Video In
Face States
Face Tracking K1
Full Image Search K2
Fig. 1. Block diagram of the multiple faces tracking system.
This algorithm is very effective and robust to track single face, but it is not straightforward to generalize it to detect multiple faces because of the following reasons: (1) The training of RVMs is implemented only one time for one face beforehand. This is possible for tracking only one face. But for tracking multiple faces, the ability to find new faces online and train new RVMs online is necessary; (2) In the tracking process, if the face detector finds multiple faces, it does not have a mechanism to distinguish the former tracked face from the other faces; (3) Multiple faces tracking may suffer from both self-occlusion and inter-occlusion, which also makes this algorithm hard to generalize straightly. In order to solve these problems, we extend Williams et al.’s work [15] and proposed a fully automatic multiple faces tracking system.
3. System overview The overview block diagram of our system is shown in Fig. 1. At the start of the system, K1 in Fig. 1 triggers the Full Image Search block in which a face detector is used to detect faces in the input image. Once faces are detected, the face models are learned and recorded in the Face Models database, and the initial states of the faces are recorded in the Face States database. Then K1 triggers the Face Tracking block to start tracking. The Face Models database records the models of the tracked faces, and the model consists of motion model and color model. The Face States database records the states of the tracked faces. The state of a face can be one of the following 3 states: RVM, BOOST and OCCLUDED. The RVM state means that the face was
tracked correctly using the RVM tracking method in current frame, and the RVM tracking method will be used in next frame for this face. The BOOST state means that RVM tracking was failed in current frame and the local search method will be used in next frame. The OCCLUDED state means the face was occluded in current frame and the system will not track the face in next frame. The state of the face will be changed according to the tracking result as shown in Fig. 2. In the tracking process, if the full image search condition is satisfied, K2 in Fig. 1 triggers the Full Image Search block in which a face detector is used to detect faces in current frame. If some faces in state OCCLUDED or BOOST are found in the full image search, the states of the faces are change to RVM according to Fig. 2. If new faces are found in full image search, the system learns their models, records the models in the Face Models database, and records the initial faces states in the Face States database. In our system, when local search or full image search is implemented, a face detector should be used. Here we employ a face detector based on Boosting learning to search faces in these situations. Viola et al. [20,21] proposed the face detector using AdaBoost which is a very efficient method to detect faces in still images. This method selects a small number of critical visual features from a larger set and yields extremely efficient classifiers. Furthermore, it combines increasingly more complex classifiers in a ‘‘cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. Lienhart et al. [22] introduce a novel set of rotated haarlike features under AdaBoost paradigm in order to detect rotated faces. Li et al. [23] presents the multi-view face detector based on FloatBoost learning which requires fewer weak classifiers than AdaBoost and has lower error rates in both training and testing. In this paper, we employ the face detector based on AdaBoost in both local search and full image search.
Face was not found in local search or full image search
RVM tracking successful
RVM tracking failed RVM
BOOST Face was found in local search or full image search
Face was found in full image search
The state in BOOST lasts a period OCCLUDED
Face was not found in full image search Fig. 2. Face state transition diagram.
S. Shen, Y. Liu / J. Vis. Commun. Image R. 19 (2008) 382–391
4. Face model In our system, the face models should be learned when new faces are detected. The face model consists of motion model and color model. The motion model is obtained by training a set of RVMs, and the color model is obtained by creating the 2D histogram in CrCb color space.
4.1. Face motion model The face motion model consists of a set of RVMs, and each RVM is trained to learn the relationship between one motion of the face and its appearance in the image [15]. In this paper, the motion model has three RVMs: yH(x; WH), yV(x; WV) and yS(x; WS), corresponding to horizontal translation tH, vertical translation tV and scale change tS, respectively. The ranges of the three motions are: tH 2 [DH, DH], tV 2 [DV, DV], tS 2 [DS, DS]. The training set for each RVM consists of N training examples, and it is generated by moving the face region k in the image I using random motions tH, tV and tS, respectively. After we get the training sets, the parameters of three RVMs could be learned which are {WH, RH}, {WV, RV} and {WS, RS}. Where WH, WV, WS are weights vectors of three RVMs, and RH, RV, RS are their covariances. Algorithm 1 shows the learning algorithm of the face motion model. Algorithm 1. Learn face motion model 0. (INPUT) (1) Current image, I; (2) Face region, k; (3) Number of training examples, N; (4) Motion ranges, DH, DV and DS; 1. (GENERATE TRAINING SETS) for i = 1 to N do (1) Generate random motions tiH , t iV and t iS uniformly in their motion ranges; (2) Move k horizontally, vertically and make scale change in I using tiH , t iV and tiS , respectively, then sample k from I into a patch u; (3) Change u into gray-scale image and do histogram equalization; (4) Raster scan u into a vector zi; end for 2. (TRAINING) Train three RVMs to learn the relationship between the motion and the appearance, i.e., get WH, RH; WV, RV; WS, RS by training {tiH , zi}, {tiV , zi} and {tiS , zi}, respectively. The training algorithm for RVM is described in [16]; 3. (OUTPUT) The face motion model: WH, RH; WV, RV; WS, RS. 4.2. Face color model The color model records the skin distribution of the face region. To reduce the color sensitivity to lighting variations, we use the lighting compensation method in Ref. [24,4]. This method uses ‘‘reference-white” to normalize the color appearance. Regard pixels with top 5 percent of the luma (non-linear gamma-corrected luminance) values in the image as the reference-white only if the number of these pixels is sufficiently large. The R, G and B components of the face are adjusted so that the average gray value of these reference-white pixels is linearly scaled to 255. The image is not changed if a sufficient number of reference-white pixels is not detected [24].
385
After lighting compensation, the image is translated into the YCrCb color space and the 2D color histogram in CrCb space is used as the face color model. The learning algorithm of face color model is show in Algorithm 2. Algorithm 2. Learn face color model 0. (INPUT) (1) Current image, I; (2) Face region, k; 1. (LIGHTING COMPENSATION) Sample k from I into a patch u, and make lighting compensation; 2. (2D HISTOGRAM) Translate u into YCrCb space and calculate the 2D histogram HM in CrCb color space; 3. (OUTPUT) The face color model, HM.
5. Face tracking In the tracking process, our system uses motion model to track the face if the state of the face is RVM, and the tracking result will be validated. If the validator gives a positive response, the state of the face remain RVM, otherwise the negative response will change the state to BOOST and trigger a local search in next frame. 5.1. Face tracking based on motion model Denote the region of the face at time instance t by kt .If the state of the face is RVM, the position of the face will be predicted based on the face motion model. The face region kt1 at time instance t 1 is used as the initial region at time instance t, and then the displacements of kt1 from the true region kt can be estimated as:
8 > t ¼ W TH K > < H t V ¼ W TV K > > : t S ¼ W TS K
ð6Þ
where K = [1, k(x, z1), . . . , k(x, zN)]T is the kernel function vector as show in Eq. (1). The variances of the displacements are:
8 > sH ¼ K T RH K þ r2H > > < > > > :
sV ¼ K T RV K þ r2V
ð7Þ
sS ¼ K T RS K þ r2S
Let T = [tH,tV,tS]T, apparently T is Gaussian distributed as N(XK, S), where X = [WH, WV, WS]T, S = diag(sH, sV, sS). Since T is the displacements of kt1 from kt, kt can be calculated using Kalman filter [15], and the Kalman state formulation is:
kt ¼ Fkt1 þ m
m Nð0; Q Þ
ð8Þ
where F and Q can be learned from a hand-labeling motion sequence. The face tracking based on motion model is shown in Algorithm 3. Algorithm 3. Face tracking based on motion model (RVM tracking) 0. (INPUT) (1) Current image, I; (2) Face region at time instance t 1, kt1; (3) Face region covariance at time instance t 1, Kt1; 1. (PREDICTION) Fkt1; kt
386
S. Shen, Y. Liu / J. Vis. Commun. Image R. 19 (2008) 382–391
FKt1FT + Q; REGRESSION) Sample kt from I into a patch u; Change u into gray-scale image and do histogram equalization; (3) Raster scan u into a vector x; (4) Calculate the displacement T and covariance S using Eqs. (6) and (7); 3. (KALMAN GAIN) Kt[Kt + S]1; G 4. (CORRECTION) kt + GT; kt Kt Kt GKt; 5. (OUTPUT) Face region at time instance t and its covariance: kt, Kt.
Kt 2. (RVM (1) (2)
5.2. Tracking result validation The tracking result kt generated by Algorithm 3 should be validated, and the face state will be reset according to the validation result. For a successful tracking, on one hand the tracking result should contain face; on the other hand the face should be the same as the one be tracked in last frame. Here we use two validators to test kt. The first validator uses the face detector based on AdaBoost to find face in kt, and gives a positive response if a face is found. The second validator uses the face model to measure the similarity between kt and the face color model HM, and it gives a positive response if the similarity if bigger than a threshold h1. The tracking result is considered to be successful if both two validators have positive responses. The tracking result validation is shown in Algorithm 4. Algorithm 4. Tracking result validation 0. (INPUT) (1) Current image, I; (2) Tracking result, kt; (3) Similarity threshold, h1;
PP
ðHu ðm; nÞ Hu ÞðHM ðm; nÞ HM Þ s ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PP PP ðHu ðm; nÞ Hu Þ2 ðHM ðm; nÞ HM Þ2 m
(4)
n
n
If s > h1, test2
6. Face matching In our system, a full image search is started to find new faces and former occluded faces when the full image search condition is satisfied. A local search is started when a face state is BOOST at each time instance. The face detector based on AdaBoost is employed in these two kinds of searches. It may find multiple faces in the search process, so we need to match the detected faces to the stored faces in the face models database. In this section, first we discuss how to determine the local search area in the local search process. Then we introduce the similarity matrix which is used to help matching faces efficiently, and at last show the face matching algorithm based on similarity matrix. 6.1. Local search area Denote the face region at time instance t by kt, and assume the validation result is tracking failed, so the state of the face will be changed from RVM to BOOST according to Fig. 2. At time instance t + 1, the system will search the face in an extended area of kt. Here we define a new variable, lost_num, which is the duration of face state in BOOST. Denote the width and height of kt by w and h, respectively, and denote the local search area of the face by ksearch. For a face with state BOOST, the center of ksearch is the same as kt, and the width and height of ksearch can be calculated as:
ksearch ¼ W H
1. (VALIDATE USING FACE DETECTOR) (1) Detect face in kt using face detector based on AdaBoost; PASS, otherwise test1 not (2) If kt contains face, test1 PASS; 2. (VALIDATE USING COLOR MODEL) (1) Sample kt from I into a patch u, and make lighting compensation for u; (2) Translate u into YCrCb color space and calculate the 2D histogram Hu in CrCb color space; (3) Calculate the similarity, s, between Hu and the face color model HM: m
3. (OUTPUT) if test1 = test2 = PASS then Tracking successful; else Tracking failed. end if
m
n
PASS, otherwise test2
not PASS;
W ¼ ðlost num þ 1Þ w
ð9Þ
H ¼ ðlost num þ 1Þ h where W and H are the width and height of ksearch, respectively. Fig. 3 shows the local search area of a face in three consecutive frames, where the big rectangle is the image, the small rectangle is kt, and the dashed rectangle is ksearch. The text in the top of the image shows the transition of the face state in current frame (for example, RVM ? BOOST means that the face state is changed from RVM to BOOST in current frame). 6.2. Similarity matrix The face detector may find multiple faces in full image search or local search, and we should consider three situations when matching the detected faces to the stored faces in the face models database.
Fig. 3. Local search area.
387
S. Shen, Y. Liu / J. Vis. Commun. Image R. 19 (2008) 382–391
(1) For a face in state RVM, a detected face is matched it if they have the same region; (2) For a face in state BOOST, the detected faces in its local search area may match it and other detected faces do not match it; (3) For a face in state OCCLUDED, all detected faces may match it. For the purpose of matching faces efficiently with the three constraints above, we introduce the similarity matrix. Similarity matrix, C, is a m n matrix, where m is the number of detected faces and n is the number of stored faces in the face models database. The element cij in C is the similarity between the ith detected face and the jth stored face. Algorithm 5 shows the calculation of the similarity matrix C. Algorithm 5. Calculation of the similarity matrix C 0. (INPUT) (1) Number of detected faces, m; 1 m ; . . . ; fdetect ; (2) Detected faces, fdetect (3) Number of stored faces, n; 1 n ; . . . ; fstore ; (4) Stored faces, fstore 1. (CALCULATE SIMILARITY MATRIX) for i = 1 to m for j = 1 to n j j i is RVM & fdetect and fstore have the if the state of fstore same region (over 75% overlap) then The elements of C in row i and column j are all set to 0, and let cij = 1; j i is BOOST & fdetect is not in the else if the state of fstore j then local search area of fstore cij = 0; else j i and fdetect Calculate the similarity, s, between fstore using face color model (see Algorithm 4), and let cij = s; end if end for end for 2. (OUTPUT) Similarity matrix, C. Fig. 4 shows two similarity matrixes calculated using Algorithm 5. In each matrix, the detected faces are shown on the left, and the stored faces in the face models database are shown at the top with their states. 6.3. Face matching using similarity matrix Similarity matrix C reflects the similarity between detected faces and stored faces with constrains of the face states. In full image search, we can use C to find whether the faces with state OCCLUDED reappear and whether new faces appear in the scene. In local search, we can use C to find whether the face in state BOOST can be found in its local search area. Apparently, the maximum element in each row of C is a matching candidate. is the biggest one in the ith row of C, and give a Assume cmax ij > h2 , we consider the ith detected face threshold h2. If cmax ij and the jth stored face are matched, and the state of the jth stored face will be changed to RVM according to Fig. 2. In full 6 h2 , that means none of the stored faces image search, if cmax ij match the ith detected face, so it is a new face in the scene, and the system will learn its model and record it into the face models database. Algorithms 6 and 7 are the face matching algorithms in full image search and local search, respectively.
A. (RVM)
B. (BOOST)
1.
1
0
1.
0.8244
2.
0
0.9556
2.
0.3256 0.5100 0.3746
3.
0
0.5135
3.
A. B. C. (BOOST) (BOOST) (OCCLUDED)
0
0
0.7892
0.7023
0
Fig. 4. Similarity matrix.
Algorithm 6. Face matching in full image search 0. (INPUT) (1) Number of detected faces, m; 1 m ; . . . ; fdetect ; (2) Detected faces, fdetect (3) Number of stored faces, n; 1 n ; . . . ; fstore ; (4) Stored faces, fstore (5) Similarity matrix, C; (6) Similarity threshold, h2; 1. (FACE MATCHING) for i = 1 to m do the max value in the ith row of C; cmax ij > h2 then if cmax ij j j i and fstore are matched, and the state of fstore is fdetect changed to RVM; else i is a new face, and learn the face model using fdetect Algorithms 1 and 2; The state of the new face is initialized as RVM. end if end for Algorithm 7. Face matching in local search 0. (INPUT) (1) Number of detected faces, m; 1 m ; . . . ; fdetect ; (2) Detected faces, fdetect (3) The stored face, fboost, whose state is BOOST and triggers this local search; (4) Similarity matrix, C; (5) Similarity threshold, h2; 1. (Match fboost) the max value in the column corresponding to cmax ij fboost in C; > h2 then if cmax ij i are matched, and the state of fboost is chanfboost and fdetect ged to RVM; end if
7. Complete algorithm Now we give the complete algorithm of our system, as shown in Algorithm 8. The system requires the following parameters: the kernel function k(,) in RVM (Algorithm 1); the number of training examples N (Algorithm 1); the motion range DH, DV and DS (Algorithm 1); the similarity threshold h1 in tracking result validation (Algorithm 4); the similarity threshold h2 in face matching based on similarity matrix (Algorithms 6 and 7); the duration threshold h3 and the full image search condition (Algorithm 8).
388
S. Shen, Y. Liu / J. Vis. Commun. Image R. 19 (2008) 382–391
Algorithm 8. Complete algorithm 0. (INPUT) (1) Duration threshold, h3; (2) Full image search condition; 1. (INITIALIZATION) (1) I the first frame; (2) Detect faces in I using face detector based on AdaBoost; (3) If no face is detected, I next frame, and go to (2); (4) n number of detected faces; (5) Learn the motion model and color model of the detected faces using Algorithms 1 and 2; (6) The states of all the detected faces are initialized as RVM; 2. (MULTIPLE FACES TRACKING) loop I next frame; for i = 1 to n do if the state of the ith face is RVM then (1) Track face i using Algorithm 3; (2) Validate tracking result using Algorithm 4; if tracking successful then Keep the state of face i as RVM; else Change the state of face i to BOOST; lost_num 1; end if else if the state of the ith face is BOOST then (1) Determine the local search area as described in Section 6.1; (2) Detect faces in the local search area using face detector based on AdaBoost and calculate the similarity matrix using Algorithm 5; (3) Match face i and make state transition using Algorithm 7; if face i is not matched then lost_num lost_num + 1; end if if lost_num > h3 then Change the state of face i to OCCLUDED; end if end if end for if the full image search condition is satisfied then (1) Detect faces in I using face detector based on AdaBoost and calculate the similarity matrix using Algorithm 5; (2) Match faces, make states transition and lean new faces models using Algorithm 6. end if end loop 8. Experimental results To demonstrate the effectiveness of our system, four image sequences captured by a web camera are used. These sequences all have a resolution of 320 240, 24 bit color, with 510, 123, 98 and 115 frames, respectively. All the algorithms are run on a 2.4 GHz PC without code optimization. 8.1. Parameter settings In the subsequent experiments, Gaussian kernel function is used in RVM:
kðzi ; zj Þ ¼ exp
1 kzi zj k2 2M s2
ð10Þ
where M is the input dimensionality and s controls the width of k(,). Here we set s = 0.8. We set the other parameters as: the motion ranges DH = 10, DV = 10, DS = 0.1; number of training examples N = 50; the similarity thresholds h1 = h2 = 0.7; the duration threshold h3 = 6; and the full image search condition is every 15 frames. 8.2. Results Fig. 5 shows some tracking results of image sequence 1. In the results, the faces are shown in rectangles, and the number above the rectangle is the face ID. The text in the top of the image shows the state transition (for example, RVM ? BOOST means the face state is changed from RVM to BOOST in current frame). In image sequence 1, there were 2 persons (Face 1 and Face 2) at the beginning, and then the third person (Face 3) came into the scene and was occluded by Face 1 and Face 2 subsequently. Then Face 3 left the scene for a while and returned, followed he was occluded by Face 1 and Face 2 again. The results show that Face 3 can be detected as a new face when he came into the scene, and can be tracked correctly after he was occluded by others. Fig. 6 shows the states of Face 3 at each frame in image sequence 1. We can see clearly from Fig. 6 that Face 3 did not come into the scene at the beginning, and at frame 80 Face 3 came into the scene and its state became RVM. The state of face 3 had become BOOST since frame 92 because he began to be occluded by Face 2. At frame 98 the state of Face 3 became OCCLUDED because the duration of its state in BOOST is larger than h3 = 6, and Face 3 had not been tracked by the system since then. At frame 125, Face 3 was found in a full image search so its state was changed from OCCLUDED to RVM. Form then on, Face 3 was occluded by others several times and these also can be seen from Fig. 6. Fig. 7 shows some tracking results of image sequence 2. At first, there were two persons (Face 1 and Face 2) in the scene. Face3 came into the scene at frame 46, and then Face 1 and Face 3 were occluded by Face 2 and reappeared later. The results show that the system can find new face and track faces correctly during occlusion. Figs. 8 and 9 show some tracking results of image sequence 3 and 4, respectively. These two sequences include scale changing of faces, new face coming into the scene and faces occlusion. The results show that the system can track faces correctly in these situations. Table 1 is the tracking accuracy of the four image sequences. The ground truth is obtained by labeling each frame manually. At each frame, the tracking is considered to be successful if the matching error between the tracking result and the ground truth is less than 4 pixels. The tracking errors in our system include detection error and matching error. The detection error occurs when the face detector based on Boosting learning detect wrong faces in full image search. The matching error occurs when the face orientation or the lighting condition changes drastically. Table 2 compares the efficiency of our system with tracking based on Boosting alone. In the tracking based on Boosting alone, face detector based on AdaBoost was used to detect faces in every frame, and face matching based on similarity matrix was implemented subsequently. Table 2 shows that the speed of our system is 50% faster than tracking with AdaBoost alone.
389
S. Shen, Y. Liu / J. Vis. Commun. Image R. 19 (2008) 382–391
Fig. 5. Some tracking results of image sequence 1 (from left to right, top to bottom: frame 1, 80, 98, 125, 185, 215, 293, 350, 432).
No face3
RVM
BOOST
OCCLUDED 1
50
100
150
200
250
300
350
400
450
500
Frame No Fig. 6. Plot the sequence of states for face 3 in image sequence 1.
9. Conclusions This paper presents a multiple faces tracking system based on Relevance Vector Machine and Boosting learning. At the start of the system, a face detector based on AdaBoost is used to detect faces and the face models are created. The face model in our system includes the motion model and the color model. The motion model consists of a set RVMs, and the color model is the 2D histogram in CrCb color space. In the tracking process, different tracking methods are used according to different states of the faces. A full
image search is started at regular time intervals in order to find new coming faces and occluded faces. In full image search and local search, the similarity matrix is introduced to help matching faces efficiently. Experimental results demonstrate that our system can detect new faces automatically and handle the problems of face occlusion and face scale change. Our system can process the image sequence at 20 frames/s. In the future we plan to integrate the face detector with skin color model in order to reduce detection errors, and use more robust lighting compensation method in order to reduce matching
390
S. Shen, Y. Liu / J. Vis. Commun. Image R. 19 (2008) 382–391
Fig. 7. Some tracking results of image sequence 2 (frame 1, 46, 64, 68, 73, 80).
Fig. 8. Some tracking results of image sequence 3 (frame 1, 24, 53, 61, 69, 78).
Fig. 9. Some tracking results of image sequence 4 (frame 1, 16, 21, 37, 68, 79).
S. Shen, Y. Liu / J. Vis. Commun. Image R. 19 (2008) 382–391 Table 1 Tracking accuracy Image sequence
Total frames
Successful frames
Successful rate (%)
Sequence Sequence Sequence Sequence
510 123 98 115
507 118 98 111
99.4 95.9 100 96.5
1 2 3 4
Table 2 Algorithm speed Image sequence
This paper (frames/s)
AdaBoost alone (frames/s)
Sequence Sequence Sequence Sequence
20.6 21.2 19.8 22.5
10.3 11.4 10.4 10.9
21.0
10.6
1 2 3 4
Average
errors. Our further works also include extending our system to track other kinds of objects (pedestrians, cars, etc.). References [1] W. Hu, T. Tan, L. Wang, S. Maybank, A survey on visual surveillance of object motion and behaviors, IEEE Transactions on System Man and Cybernetics—Part C 34 (3) (2004) 334–351. [2] K. Schwerdt, J.L. Crowley, Robust face tracking using color, in: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, Washington, DC, USA, 2000, pp. 90–95. [3] J.R. del Solar, A. Shats, R. Verschae, Real-time tracking of multiple persons, in: Proceedings of the 12th International Conference on Image Analysis and Processing, 2003. [4] C. Lerdsudwichai, M. Abdel-Mottaleb, A.-N. Ansari, Tracking multiple people with recovery from partial and total occlusion, Pattern Recognition 38 (7) (2005) 1059–1070. [5] S. Birchfield, Elliptical head tracking using intensity gradients and color histograms, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Santra Barbara, CA, USA, 1998, pp. 232–237. [6] M. Pardas, E. Sayrol, A new approach to tracking with active contours, in: Proceedings of the IEEE International Conference on Image Processing, Vancouver, BC, USA, 2000, pp. 259–262.
391
[7] X. Bing, Y. Wei, C. Charocnsak, Face contour tracking in video using active contour model, in: Proceedings of the IEEE International Conference on Image Processing, Singapore, 2004, pp. 1021–1024. [8] H.-S. Lee, D. Kim, Robust face tracking by integration of two separate trackers: skin color and facial shape, Pattern Recognition 40 (11) (2007) 3225–3235. [9] T. Maurer, C. von der Malsburg, Tracking and learning graphs and pose on image sequences of faces, in: Proceedings of the International Conference on Automatic Face and Gesture Recognition, 1996. [10] R. Liang, C. Chen, J. Bu, Real-time facial features tracker with motion estimation and feedback, in: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 2003. [11] S.V. Duhn, L. Yin, M.J. Ko, T. Hung, Multiple-view face tracking for modeling and analysis based on non-cooperative video imagery, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 2007. [12] T.F. Cootes, G.J. Edwards, C.J. Taylor, Active appearance models, IEEE Transactions on Pattern Analysis and Machine Intelligence 23 (6) (2001) 681–685. [13] I. Matthews, S. Baker, Active appearance models revisited, International Journal of Computer Vision 60 (2) (2004) 135–164. [14] S. Avidan, Support vector tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (8) (2004) 1064–1072. [15] O. Williams, A. Blake, R. Cipolla, Sparse bayesian learning for efficient visual tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (8) (2005) 1292–1304. [16] M.E. Tipping, Sparse bayesian learning and the relevance vector machine, Journal of Machine Learning Research 1 (2001) 211–244. [17] C.J.C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2 (2) (1998) 121–167. [18] A. Faul, M. Tipping, Analysis of sparse bayesian learning, Advances in Neural Information Processing Systems 14 (2002). [19] M. Tipping, A. Faul, Fast marginal likelihood maximisation for sparse bayesian models, in: Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, 2003. [20] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001. [21] P. Viola, M. Jones, D. Snow, Detecting pedestrians using patterns of motion and appearance, in: Proceedings of the IEEE International Conference on Computer Vision, 2003. [22] R. Lienhart, J. Maydt, An extended set of haar-like features for rapid object detection, in: Proceedings of the IEEE International Conference on Image Processing, 2002. [23] S.Z. Li, Z. Zhang, Floatboost learning and statistical face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (9) (2004) 1112–1123. [24] R.-L. Hsu, M. Abdel-Mottaleb, A.K. Jain, Face detection in color images, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (5) (2002) 696– 706.