Signal Processing: Image Communication 28 (2013) 1100–1113
Contents lists available at ScienceDirect
Signal Processing: Image Communication journal homepage: www.elsevier.com/locate/image
Combining texture and stereo disparity cues for real-time face detection Feijun Jiang a, Mika Fischer b, Hazım Kemal Ekenel b,c, Bertram E. Shi d,n a
Department of Data Application, ICBU, Alibaba Group, China Institute of Anthropomatics, Karlsruhe Institute of Technology, Germany Faculty of Computer and Informatics, Istanbul Technical University, Turkey d Department of ECE and Division of Biomedical Engineering, Hong Kong University of Science and Technology, Hong Kong b c
a r t i c l e in f o
abstract
Article history: Received 29 August 2012 Received in revised form 20 July 2013 Accepted 21 July 2013 Available online 6 August 2013
Intuitively, integrating information from multiple visual cues, such as texture, stereo disparity, and image motion, should improve performance on perceptual tasks, such as object detection. On the other hand, the additional effort required to extract and represent information from additional cues may increase computational complexity. In this work, we show that using biologically inspired integrated representation of texture and stereo disparity information for a multi-view facial detection task leads to not only improved detection performance, but also reduced computational complexity. Disparity information enables us to filter out 90% of image locations as being less likely to contain faces. Performance is improved because the filtering rejects 32% of the false detections made by a similar monocular detector at the same recall rate. Despite the additional computation required to compute disparity information, our binocular detector takes only 42 ms to process a pair of 640 480 images, 35% of the time required by the monocular detector. We also show that this integrated detector is computationally more efficient than a detector with similar performance where texture and stereo information is processed separately. & 2013 Elsevier B.V. All rights reserved.
Keywords: Multi-view face detection Stereo vision Disparity energy model Gabor filter
1. Introduction Fast and accurate object detection is still a challenging task in computer vision, but appears to be quite easy and robust for biological systems. One possible way to improve algorithms for object detection is to incorporate models of the signal processing performed by biological systems. The process of object detection in biological systems efficiently integrates multiple cues, such as stereo disparity, texture and motion. For example, the primary visual cortex contains neurons that are jointly tuned to respond maximally to inputs with a particular combination of cues, e.g. bars at
n
Corresponding author. Tel.: +852 23587079. E-mail address:
[email protected] (B.E. Shi).
0923-5965/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.image.2013.07.006
a particular orientation and disparity. This early integration allows the extraction and representation of information from multiple cues to share processing stages, potentially reducing computational complexity. In contrast, the typical engineering approach merges the outputs of separate modules for each cue, each developed independently. In this work, we examine the performance of a biologically inspired image representation that integrates texture and stereo disparity cues in an algorithm for multiview face detection. There have been extensive research efforts on face detection, most of which have focused on frontal face detection. One of the most popular face detectors is the Viola–Jones face detector, which uses features computed using Haar wavelets and boosting for classification [1]. Boosting has also been used in conjunction with other features, such as the modified
F. Jiang et al. / Signal Processing: Image Communication 28 (2013) 1100–1113
census transform (MCT), which provide improved illumination invariance [2]. Frontal face detectors can be extended to multi-view face detectors in several ways: a parallel structure consisting of a bank of many detectors trained for different poses [3], a hybrid structure where a pose estimator predicts the pose of the face first and then the detector covering the corresponding pose makes the decision [4,5], a hierarchical structure where coarse detectors covering a wide pose range are cascaded with fine detectors covering narrower pose ranges [6,7], or a unified manifold method [8] that learns to project faces onto a face manifold and non-faces away from the manifold. Our multi-view face detector uses a parallel structure that consists of five detectors designed for different pose ranges. The detectors are trained using boosting applied to biologically inspired features that are constructed from the outputs of Gabor filters. The spatial receptive fields of monocular simple cells in the primary visual cortex V1 can be modeled by Gabor functions tuned to different orientations and scales [9]. Gabor filters are also important components of computational models of V1 neurons tuned to disparity [10] and motion [11], as well as biologically inspired model for object recognition [12]. From the engineering point of view, features based on the outputs of Gabor filters have been found to outperform other features for texture classification [13]. A recent survey [14] on face detection proposed that complex features might improve performance significantly. It has been previously demonstrated that it is possible to use Gabor features for face detection. Huang et al. [15] proposed a face detector that uses a polynomial neural network to classify Gabor features computed at four orientations. Kwolek [16] proposed a convolutional neural network face detector using Gabor filter outputs from two scales and two orientations. However, the performance of these two detectors did not match that of state-of-the-art face detectors, such as those based on MCT or Haar-like features [1,2]. Nonetheless, there is evidence that Gabor based features contain useful information not captured by Haar features. Combining features based on Gabor filters with a set of Haar-like features and training using boosting has been shown to improve performance over using Haarlike features alone [17,18]. Recently, Serre et al. [12] demonstrated state-of-the-art object recognition performance using features based on multi-scale Gabor features. Here, we describe a Gabor feature based face detector, whose performance exceeds that of state-of-the-art face detectors, such as [1,2,19], on standard databases, such as the CMU/MIT database [20] and the Face Detection Dataset and Benchmark (FDDB) [21]. In order to detect faces at different sizes and locations, the face detector adopts a sliding window approach applied at all levels of an image pyramid. Each 18 18 pixel sub-window is classified as containing a face or not. Detections undergo a further clustering stage to remove multiple detections for each face in different sub-windows. Each pixel within the sub-image is labeled with one of 625 discrete feature labels, which is obtained by discretizing the normalized outputs of Gabor filters at four orientations. Each window is classified using a cascade of classifiers of increasing complexity. The use of discrete features enables us to implement the classifier at each stage using a computationally efficient table lookup.
1101
Scanning an image pyramid at all locations and scales is quite time consuming. We use stereo disparity information to ignore locations that are unlikely to contain faces. This use of stereo information has been incorporated into other object detection frameworks, such as pedestrian detection [22]. A preliminary study of the use of depth information in face detection by Burgin et al. has shown that it can significantly reduce computation time [23]. However, this study does not give details about the detector performance or how the performance changes with the incorporation of stereo data. While filtering can reduce the number of image locations scanned, there is also the danger that faces that would have been found by the face detector may be discarded, resulting in poorer detection performance. In addition, unlike the work described here, the stereo disparity and face detection modules used were independent, with no effort to leverage computation performed by one for use in the other. The use of Gabor-based features for face detection has the advantage that the outputs of the Gabor filters can be re-used for stereo disparity computations. Here, we use the disparity energy model, which is commonly used to model the responses of disparity selective complex cells in the primary visual cortex [10], to encode stereo disparity information. The disparity energy model uses Gabor filters to model the spatial receptive fields that weight input from the left and right eyes. These outputs of the monocular Gabor filters are combined linearly and followed by squaring to give detectors tuned to respond maximally when the input lies at a certain disparity. The tuned disparity can be adjusted by changing phase or position shifts between the Gabor filter outputs of the left and right eyes. Different scales are used to detect faces of different sizes. Since the size of the face changes primarily with the depth, each scale picks up faces at different depths. We use the outputs of the stereo disparity detectors to screen out image locations at each scale, where the input does not have the expected disparity. Our tests on a database of over 900 stereo images reveal a double benefit to the use of stereo information. First, it reduces computation time. Although the amount of input image data is doubled in comparison with a face detector based on monocular input, the extra computation required for disparity detection is (1) reduced by the reuse of the Gabor filter outputs and (2) more than offset by the savings obtained by reducing the number of windows that must be scanned by the face detector. The disparity detectors screen out nearly 90% of the windows in the pyramid. The computation time required by the stereo face detector is 65% smaller than that required by the monocular detector. The computations required by the Gabor filters can be significantly accelerated using Graphics Processing Units (GPUs) [24]. Our detector runs at 24 fps on 640 480 pixel image on a PC with an i5 2.66 GHz CPU and an Nvidia GTX 465 graphics card. Second, it improves detection performance. The stereo face detector possesses a better Precision–Recall curve than the monocular detector, suggesting that the disparity detectors rarely screen out true faces and effectively reject non-face regions that are erroneously detected by the monocular detector.
1102
F. Jiang et al. / Signal Processing: Image Communication 28 (2013) 1100–1113
Input Image (Left eye image)
4-orientation Gabor filters
Normalization
Discretization
Right eye image
4-orientation Gabor filters
Normalization
Disparity Detector
Scan with face classifier
Clustering
Detections
1
Stereo Matching
Fig. 1. The system diagram of the detectors. The switch selects between the monocular detector, the binocular detector and the detector using stereo matching.
Fig. 1 shows a block diagram of operations applied by the face detectors described in this paper. The operations required by the face detector are shown in the top row. The operation of these blocks is detailed in Section 2. For the monocular face detector, the switch attached to scan block is set to “1”, indicating that every sub-window is applied to the classifier. For the disparity energy detector based stereo face detector, the switch is set to the middle position, indicating some blocks may be screened out by the disparity detectors. The operation of the three blocks used to compute the disparity detector is described in Section 3. Experimental results presented in Section 4 characterize the detectors' performance and computational efficiency on the publicly available CMU/MIT and FDDB databases of monocular images and on the stereo dataset [25] we collected. We also compare the performance of stereo face detector using disparity energy detectors with a similar stereo face detector, where the switch is set to the bottom position, indicating that windows are screened using a conventional stereo estimator based on sum of squared differences.
! where x indexes pixel location, Σ is the covariance matrix ! of the 2D Gaussian, and Ω ¼ ½Ωx Ωy T determine the spatial frequency and orientation tuning of the filter. Here, we use four Gabor filters. All share the same center spatial qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi frequency Ω ¼ Ω2x þ Ω2y ¼ 2π=3 and diagonal Gaussian covariance matrix Σ with diagonal entries ! s2 ¼ ð1:6= J Ω J Þ2 . They differ in their tuned orientation, ! which is determined by the spatial frequency vector Ω in Eq. (1), γ ¼ arctan Ωx =Ωy ¼ 0, π=4, π=2 and 3π=4 radians. Horizontal orientations correspond to γ ¼ 0. The additional term in the square brackets removes the DC component in the real part of the Gabor filter. We denote the complex ! ! valued output of each Gabor filter at pixel x by Gγ ð x Þ, ! where γ indicates orientation. The real part of Gγ ð x Þ is the response of a filter whose impulse response is a cosine modulated by a Gaussian. Intuitively, the real part responds maximally at the center of the oriented bars with width inversely proportional to the spatial frequency ! tuning of the filter. The imaginary part of Gγ ð x Þ is the response of a filter whose impulse response is a sine modulated by a Gaussian. Intuitively, the imaginary part responds maximally to step edge transitions.
2. Multi-view face detector based on discretized Gabor features 2.2. Normalization The multi-view face detector adopts a sliding window approach on an image pyramid. The image pyramid is constructed by resizing the image with a geometric series of scaling factors with the common ratio 1.2. Faces of different sizes are detected at different levels on the pyramid. At each level of the pyramid, each 18 18 pixel window is classified as either containing a face or not. The operations required for face detection are shown as the blocks in the top row of Fig. 1. This section details the operations within each block. 2.1. Gabor filtering The Gabor filter is tuned to respond maximally when the local input texture has a certain orientation. Its impulse response is a 2D Gaussian modulated by a 2D complex exponential ! ! ! gð x jΣ; Ω Þ ¼ eð1=2Þ x
T
! ! !! x ½ejð Ω x Þ e J Ω J s=2 T
Σ 1
ð1Þ
The challenge of feature extraction is to find features that preserve information required for detection while rejecting variations due to environmental changes, such as changes in illumination. The outputs of Gabor filters are less sensitive to illumination changes than image intensity [26], since they remove the DC component in the image and respond only to oriented edges and bars. However, because the filters are linear, their response magnitudes increase with contrast: the intensity difference between the bar and background or across the edge. To improve contrast and illumination invariance, we normalize the Gabor filter outputs by dividing the complex Gabor responses by the square root of the sum of the Gabor energies from four orientations. Denoting the normalized ! Gabor response by G γ ð x Þ, we have ! Gγ ð x Þ ! G γ ð x Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ! ∑η ¼ 0;π=4;π=2;3π=4 J Gη ð x Þ J 2
ð2Þ
F. Jiang et al. / Signal Processing: Image Communication 28 (2013) 1100–1113
1103
sub-windows for each frame. Classifying continuous valued features is computationally complex, since each classifier typically requires a number of multiplication operations equal to the feature dimensionality. On the other hand, if feature values are discrete, classification can be done using an efficient table lookup [2]. However, the danger of quantization is a potential loss of information. One of the most common ways to perform quantization is vector quantization (VQ) using k-means clustering to find the codebook vectors. As described in Section 4, classifiers using normalized Gabor filter outputs discretized using VQ results in state-of-the-art performance. Unfortunately, VQ encoding is computationally expensive, since we must compare the Gabor feature with each of the k codebook vectors. Based on knowledge about the structure of the feature space gleaned from examining the results of VQ based classifiers trained by k means clustering, we developed a discrete encoding of the normalized Gabor responses that can be computed more efficiently. At each pixel, each of the four complex valued normalized Gabor filter outputs G γ is mapped to one of five indices depending upon which of the five regions of the complex plane shown in Fig. 3 the filter output falls into. Thus, the four normalized Gabor filter values at each pixel are mapped to one of 54 ¼625 possible discrete feature values. We refer to this encoding as the Radial Phase encoding. Filter outputs falling into the central circular region with radius 0.5 correspond to orientations whose Gabor filter energy is smaller than the mean energy among the four orientations. These are less significant. The more significant filter outputs are quantized into one of the four regions, corresponding to different phases. The phase indicates the edge type, which is necessary for face detection. Phases of 0 and π correspond to cosine modulations in the Gabor function, and thus indicate light or dark bars at the given orientation. Phases of π=2 and π=2 correspond to sine modulations in the Gabor function, and
The normalized responses indicate the relative distribution of energy among the four orientations. To quantify the effect of normalization, we studied the relative changes in the raw and normalized Gabor filter outputs at face pixels as the illumination condition changes using the Multi-PIE dataset [27]. This dataset contains frontal face images of 337 subjects taken under 20 different lighting conditions, for example, the conditions with flash, without flash, or with the flash on a certain direction. For each subject, this 20 images are taken within 0.7 s, to eliminate variations on the facial expression. Fig. 2(a) shows two images from this dataset of the same subject under different lighting conditions (without flash and with flash). The second row displays the relative distances of the 8 dimensional vector of Gabor filter outputs at each pixel between these two face images. The relative distance is defined as the Euclidean distance between the two vectors divided by the sum of the lengths of these two vectors. It is apparent that the relative distances between normalized Gabor features are much smaller than the relative distances between the raw Gabor features. To quantify this effect statistically, Fig. 2(b) shows histograms of the relative distances between the 8 dimensional vector of Gabor filter outputs at each pixel under different lighting conditions and the vector of Gabor filter outputs at the corresponding pixel under the frontal flash lighting condition. The histograms accumulate differences computed over all pixels in the 18 18 pixel region of the face, all lighting conditions, and all subjects. The distribution of relative distances of the normalized Gabor features is concentrated towards much smaller values, with values (mean¼0.3281), than the distribution of relative distances of the raw Gabor features (mean¼0.4085). 2.3. Discretization Face detection requires simple and efficient classifiers, since the detector must classify hundreds of thousands of
0.025 Normalized Gabor Raw Gabor
Percentage
0.02
0.015
0.01
0.005
0
0
0.2
0.4
0.6
0.8
1
Relative Distance Fig. 2. (a) Top row: the face images under natural light and frontal flash. Bottom row: the relative differences for normalized Gabor features (left) and the raw Gabor features space (right). White indicates 1. Black indicates 0. (b)The histogram of the relative distances between the feature vectors of the same subject under different light conditions.
1104
F. Jiang et al. / Signal Processing: Image Communication 28 (2013) 1100–1113
1 0.75 0.5
Imaginary
0.25 0 −0.25 −0.5 −0.75 −1 −1
−0.75 −0.5 −0.25
0
0.25
0.5
0.75
1
Real Fig. 3. The boundaries of the radial phase encoding in the complex plane.
thus indicate step edges with different polarity. By quantizing the output into four regions centered around these values, we ignore small position shifts of the edges. 2.4. Sub-window classifier Each 18 18 pixel is classified as containing a face using a bank of five detectors operating in parallel. Each detector is designed for one of the following poses: frontal, 451 and 451 in-plane rotations, and left and right profiles. Detections of the same face detected by multiple detectors or in multiple windows are merged by clustering. Since each individual detector can detect faces in a range of poses around the nominal pose, we find these five pose angles are enough to cover most of the cases encountered when the camera is at about head level. Each of the five detectors has a similar cascade structure [28]. The detectors for frontal and in-plane rotated faces use a four stage cascaded classifier, where the first three stages are trained using AdaBoost and the final stage is a winnowing stage [29]. Since the profile faces contain fewer distinct features than the frontal faces, the profile face detectors require twice as many cascaded stages. All eight stages are trained by boosting. On average, these additional stages increase the required computation to three times that of the frontal and in-plane rotation detectors. Stages are trained sequentially. Each stage is trained only on samples which pass through all the previous stages. The threshold for comparison at each stage is set so that the true positive detection rate is 99.5% on the validation set. For the boosting stages, the classification at each stage is determined by the weighted sum of the outputs of a number of weak classifiers. Each weak classifier makes its decision based on the feature at one pixel within the subwindow. The choice of the weak classifiers and their weighting coefficients are determined by boosting. Since the feature values are discrete, each weak classifier can be implemented by a lookup table, where the address in the
table is given by the feature value and the value in the table is either 0 or the weighting coefficient, depending on the classification result (non-face or face). As described in [2], multiple classifiers at the same sub-window location can be combined into one lookup table simply by adding the values in lookup tables of the individual classifiers. The first two stages, which should screen locations quickly, are limited to combining classifiers from fewer sub-window locations: 20 and 80 for the frontal face detector, and 80 and 160 for the profile face detector. Once we have trained the frontal face detector, we can immediately obtain detectors for faces at +451 and 451 in-plane rotations without additional training by applying the weak classifiers of the frontal face detector at rotated sub-window locations. For example, if the first stage of the frontal face classifier combines outputs of weak classifiers based on the feature vectors at five subwindow locations, the first stage of the 451 in-plane rotated face detector combines outputs of weak classifiers based on the same five sub-window locations only rotated by 451. Non-integer pixel locations after rotation are rounded to the nearest integer pixel location. The lookup tables of the weak classifiers in the rotated detector are the same as the lookup tables of the corresponding weak classifiers in the frontal face detector. However, the discrete feature value used to address the lookup table must be computed differently to take into account the rotation by replacing the normalized Gabor filter output at each orientation by the normalized Gabor filter output at a rotated orientation. For example, to compute the discrete feature value for the 451 in-plane rotated face detector, we n replace ðG 0 ; G π=4 ; G π=2 ; G 3π=4 Þ by ðG 3π=4 ; G 0 ; G π=4 ; G π=2 Þ where the superscript n indicates complex conjugation. Face detectors tuned to left and right profile faces were trained using data from a profile face database [20]. 3. Binocular face detector Applying the face detector at every possible window in the image pyramid is time consuming, due to the large number of image windows that must be evaluated. Stereo disparity information can potentially speed up face detection by screening out image windows that are unlikely to contain faces. Different levels of the image pyramid are used to detect faces of different sizes. Human faces are roughly the same size, even during development. Mean forehead width of males increases by 41% from year 1 (83.3 mm) to year 18 (117.5 mm). Most of the change occurs during the early years. Mean forehead width of males changes by only 14% from year 6 (104.4 mm) to year 18. Standard deviations within each year of age are in the range of 5 mm. The difference between males and females is typically on the order of 3–4 mm [30]. Thus, the primary determinant of face size in the image is depth or distance from the camera. Different scales detect faces at different depths. In a stereo camera system, depth determines the disparity between corresponding points in the left and right camera images. Given that human faces have similar physical sizes and assuming the optical axes of the two cameras are parallel, it turns out that the faces that appear at the size selected
F. Jiang et al. / Signal Processing: Image Communication 28 (2013) 1100–1113
by one scale of the pyramid appear at the same pixel disparity. Using the perspective camera projection model, a face with width Wa mm located at a distance Z mm from the camera will have size in pixels given by f Wi ¼ Wa Z
ð3Þ
where f denotes the focal length in pixels. Different scales can be modeled by changing the focal length f of the optical axes of the two cameras. An object at depth Z will appear at disparity D in pixels given by f D¼ B Z
ð4Þ
where B is the stereo baseline in millimeters. Combining Eqs. (3) and (4), we obtain the following relationship: D¼
B W Wa i
ð5Þ
Because this equation is independent of focal length, the disparity is independent of scale. Thus, the stereo disparity at which we expect faces to appear in each scale is constant. Our system uses the Bumblebee 2 stereo vision camera system from Point Grey Research, which has a stereo baseline of B ¼120 mm. Substituting Wa ¼ 117.5 mm and Wi ¼18 pixels into Eq. (5), we find that faces will appear around an expected disparity of 18 pixels at each scale. Since scales differ by a factor of 1.2, the range ofpexpected disparities covered by ffiffiffiffiffiffiffi p ffiffiffiffiffiffiffi each scale runs from 18= 1:2 to 18 1:2 pixels (16.4–19.7 pixels). Typically, stereo image analysis starts with disparity estimation, generating an estimate of the disparity between each pixel in one image with its corresponding pixel in the other image. Algorithms for disparity estimation can be categorized as being either local or global. Local methods seek best matching pixels by considering only information from windows around candidate pixels in the left and right images. For example, the sum of squared difference method evaluates the quality of a match between two pixels by computing the sum of squared differences between pixels in identically sized windows from the left and right images. Global methods add contextual information from segmentation [31] or smoothness constraints [32] to facilitate disparity estimation. Local methods are much faster than the global methods. For example, using GPU acceleration, the Sum of Square Difference method can run up to at 30 fps and cover a large disparity range [33], while the fastest published implementation of a global method is 10 fps [34]. On the other hand, the disparity estimation accuracy of global methods is greater than that of local methods on the standard Middlebury dataset [35]. However, in order to screen windows for face detection, we do not require disparity estimates, but rather only a determination as to whether the disparities inside each window fall inside the expected range of 16.4–19.7 pixels. In other words, we need to only consider the simpler problem of disparity detection. In particular, we seek to construct a detector at each pixel whose response is large if the disparity at that pixel is within the expected range. In constructing this detector, we would also like to re-use
1105
the results of computations done for the face detector. It turns out that the visual cortex contains neurons, called disparity selective complex cells, that are “ideally suited as disparity detectors” [10]. These neurons are commonly modeled using the disparity energy model, which uses monocular Gabor filters as an initial processing step. Our implementation of disparity detectors is based upon the disparity energy model as described in [10] with two additional normalization stages. The disparity energy combines the outputs of normalized Gabor filters applied to the left and the right images. The normalized output is given by ! Gγ ð x Þ ¼
! Gγ ð x Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ! maxð ∑η ¼ 0;π=4;π=2;3π=4 J Gη ð x Þ J 2 ; βÞ
ð6Þ
! where Gγ ð x Þ is the raw Gabor output. The original disparity energy model did not include this monocular normalization step, but rather combined raw Gabor filter outputs. The inclusion of normalization was proposed by Fleet et al., to account for cross-orientation suppression and the weak dependency on interocular contrast differences observed in visual cortical neurons [36]. In our experiments, we have found that monocular normalization improves rejection of horizontal false matches and leads to more localized responses. The floor on the energy given by β avoids the amplification of small responses likely due to noise. The disparity energy combines left and right camera normalized Gabor outputs with position and phase shifts: ! L ! R ! ! Eγ ð x ; ΔψÞ ¼ J G γ ð x Þ þ G γ ð x þ d ÞejΔψ J 2
ð7Þ
where the superscripts indicate left and right eye outputs, ! d ¼ ½D 0T introduces a horizontal position shift of D¼ 18 pixels and Δψ is a phase shift. The disparity energy is tuned to respond maximally when the input is near a preferred disparity [37] Dpref ¼ D þ
Δψ Ω sin ðγÞ
ð8Þ
We fix the position shift D to center the range of preferred disparities around 18 pixels and vary the phase shift between 7 π to move the preferred disparity in a small range around this point. The size of this range depends upon the orientation γ. Vertical orientations ðγ ¼ π=2Þ have the smallest range. Since we choose Ω ¼ 2π=3, this range covered by the vertical orientations is 187 1:5, which is approximately equal to the range covered by one scale (16.4–19.7 pixels). Diagonal orientations cover a large range. Disparity energy responses at the same phase shift and orientation are pooled spatially by a Gaussian filter whose variance is twice that of the Gaussian in the Gabor kernel. Past work has shown that this improves disparity estimation [38]. We then combine responses across the three orientations to obtain a single disparity detector at each pixel tuned to a disparity D þ Δd: ! Eð x ; ΔdÞ ¼
∑
γ ¼ π=4;π=2;3π=4
! Eγ ð x ; Δd Ω sin ðγÞÞ
ð9Þ
1106
F. Jiang et al. / Signal Processing: Image Communication 28 (2013) 1100–1113
Fig. 4. (a) The image pyramid of the left eye image. Two faces are detected at the second level. The red bounding boxes represent the detector windows. ! (b) The disparity responses Uð x Þ on each level of the pyramid. The intensity represents the magnitude, with darker colors indicating larger magnitude. (c) The dark regions represent the pixels of sub-windows that pass the binocular detector and are sent to the face detector for classification.
where Δd ranges between 71.5. We then apply a binocular normalization ! Eð x ; ΔdÞ ! 1 ð10Þ Eð x ; ΔdÞ ¼ Sþα where ! Sð x Þ ¼
∑
L ! R ! ð J Gγ ð x Þ J 2 þ J Gγ ð x Þ J 2Þ
γ ¼ π=4;π=2;3π=4
ð11Þ
is the average disparity energy across all phase shifts and orientations. Past work has shown that binocular normalization yields a better indicator that the disparity is near the preferred disparity [38]. The parameter α prevents amplification of small responses. Subtracting one places the average response at zero. The final disparity detector at each pixel is obtained by summing the positive half wave rectified responses over the disparity range Δd between 7 1.5 pixels ! ! Uð x Þ ¼ ∑jEð x ; ΔdÞjþ ð12Þ Δd
where j jþ indicates half wave rectification. To determine which windows to apply the face detector to, we count the number of pixels in the window for which ! Uð x Þ exceeds a threshold θ. Only those windows for which the count exceeds a threshold λ are fed to the face detector. Otherwise, the window is classified as non-face. Fig. 4 illustrates the binocular detection process. Left and right images are rectified so that corresponding pixels lie on the same scan line. 4. Experimental results 4.1. Databases In order to train the frontal face detector, we use cropped frontal faces from [39] as positive training samples. This
database contains 25,000 face images cropped and resized to 20 20 pixels. It includes faces with slight variations in pose angle and under different illumination conditions. To avoid boundary effects, our detector considers only the center 18 18 pixels. To train the profile face detector, we use cropped profile faces from the CMU/VASC dataset [20] as positive training samples. The negative samples are taken from a dataset of more than 25,000 background images obtained from multiple sources: the CalTech background database [40], online background images and background images from holiday photos. When training the first stage, the negative samples are chosen randomly. For the later stages, only negative samples that pass through all of the previous stages are used. In all cases, 70% of the data is used to train the classifiers. The remaining 30% is used as validation set to determine the threshold of each stage. For k-means clustering, we trained the codebook vectors using 11,338,056 normalized Gabor vectors at all orientations. Half were taken from the center 18 18 pixel sub-windows in the frontal face detector training set. The other half were taken from pixels in the background image training set. To test the frontal face detector, we used images from the CMU/MIT dataset [20] and the Face Detection Dataset and Benchmark (FDDB) [21]. The FDDB dataset was also used to test the multi-view face detector. The CMU/MIT dataset contains 511 faces in 130 images. The FDDB dataset contains 2845 images. To measure detection accuracy, we determined true positive detections as those where the area of overlap between the bounding box generated by the face detector and the ground truth face region exceeds 50%. We compare the performance of our detectors with the results of state-of-the-art detectors by Mikolajazyk et al. [19] and Viola–Jones [1], which are published on the FDDB website. These detectors were trained using different training databases than ours. We also compare with the results of our own experiments using a MCT based
F. Jiang et al. / Signal Processing: Image Communication 28 (2013) 1100–1113
detector that uses the same multi-stage classifier architecture and is trained with the same training data [2]. To train and test the binocular face detector, we collected a database of 992 stereo image pairs containing 1658 faces using a Bumblebee2 Stereo Vision System from Point Grey Research. The Bumblebee2 system contains two cameras whose optical axes are approximately parallel and offset by a 120 mm baseline. The images contain faces in both indoor and outdoor environments. One-third of the images were collected by mounting the stereo camera on a moving mobile robot in seven different locations. The remainder were collected from a stationary camera mounted on a tripod in 10 different locations. We used 360 image pairs for training and the remaining 632 image pairs for testing. Training and test images were taken from different locations to avoid background similarity. For the binocular detector, the face detector is applied to the left camera image. The right camera image is used only for disparity computation. Thus, face locations were hand annotated in the left camera images only. This database is available online [25]. 4.2. Monocular detector performance In order to specify the number of orientations that are necessary for face detection, we build VQ based detectors with four orientations and eight orientations. The eight orientations are γ ¼ 0; π=8; 2π=8; 3π=8; 4π=8; 5π=8; 6π=8; 7π=8. Fig. 5 shows the ROC curves of the different detectors on FDDB set. The performance of the eight orientation detector is almost identical with the performance of the four orientation detector using the same size of VQ codebook. This demonstrates that four orientation Gabor features already provide sufficient texture features for face detection. This figure also shows that the VQ based quantization outperforms the benchmark face detectors by Mikolajazyk et al. [19] and Viola–Jones [1]. Our experiments with the VQ based detector gave us confidence in using its results to develop the radial phase discrete encoding. Fig. 6 shows an example of this analysis with a VQ codebook size of 128. This figure shows the
0.7
True positive rate
0.6 0.5 0.4 0.3 VQ 4 Orientations 512 Centers VQ 4 Orientations 128 Centers VQ 8 Orientations 128 Centers Mikolajaczyk et al. Viola−Jones
0.2 0.1 0 0
25
50
75
100 125 150 175 200 225 250
False positives Fig. 5. The ROC curves of frontal face detectors with different number of VQ centers and Gabor orientations, on the FDDB set [21].
1107
lookup table for classifiers from the first boosting stage graphically as a set of points on the complex plane. Each point corresponds to one codebook vector. Its location is determined by the vector components corresponding to horizontal orientations. The shape (triangle/square) encodes the sign (positive/negative) of the lookup table weight. The intensity encodes the weight magnitude. As we can see, the squares and triangles can be roughly separated by subsets of the green lines. Fig. 7 compares the ROC curves of frontal face detectors based on the radial phase coding and vector quantization with 512 centers with the results of the Mikolajazyk et al. [19], Viola–Jones [1], and MCT [2] detectors on the FDDB dataset [21]. Both the radial phase encoder and the VQ based detector achieve the best performance. At a true positive rate of 70%, the number of false detections for VQ, radial phase, MCT [2], Mikolajazyk et al. [19] and Viola– Jones [1] is 123, 136, 151, 31,903 and 4155, respectively. This demonstrates that appropriately chosen discretization based on Gabor filter outputs can provide more effective representations of the image structure than those used by other state-of-the-art face detectors. Table 1 compares the results of our detectors on the CMU/MIT dataset with the results of the MCT [2], Viola– Jones [1], and Huang et al.'s Gabor-based [15] face detectors. Our results on this dataset are comparable to the results of the MCT detector and better than the other two. Fig. 8 shows the ROC curve of our monocular multiview face detector on the FDDB dataset [21]. Our detector is able to detect 79.4% of the faces while making only on average one false detection in every five images. The performance of the multi-view face detector is significantly better than that of the frontal view detector. We also built a MCT multi-view face detector with the same poses and training samples for comparison, denoted as “MCT 5-view”. As the MCT features represent binary features within 3 3 neighborhood, we cannot directly derive the detectors with +451 and 451 in-plane rotations from the frontal face detector. In the MCT multi-view face detector, we rotate the input image by 451 and +451, respectively, and then run the frontal face detector on the rotated images to detect the faces with +45 and 451 in-plane rotations. The performance of radial phase detector is slightly better than that of MCT detector. Considering that the MCT detector has to rotate the input image twice, the radial phase detector is also more efficient. We also plot the results of frontal face detectors from the Mikolajazyk et al. [19], Viola–Jones [1] as baselines. Fig. 13(a) shows some examples of detected faces including frontal faces, profile faces and the in-plane rotated faces. Fig. 13(b) shows some examples of the faces that the detector fails to detect. These include occluded faces, blurred faces and the faces whose poses are not covered by the detector bank. In order to compare with the previously proposed multi-view face detectors, we conduct the experiment on CMU rotate face dataset [4]. In our detector, we derived the detectors that detect in-plane rotated faces from the frontal face detector. Here we generate eight detectors, all derived from the frontal face detector, to detect faces of 01 to 3151 in-plane rotated faces, and each detector covers
1108
F. Jiang et al. / Signal Processing: Image Communication 28 (2013) 1100–1113
1 0.75 0.5 0.25 0 −0.25 −0.5 −0.75
(b)
(c)
(d)
−1
−1
−0.75 −0.5 −0.25
0
0.25
0.75
1
−1 −1 −0.75 −0.5 −0.25
0
0.25 0.5 0.75
1
1
1
0.75
0.75
0.5
0.5
0.25
0.25
0
0
−0.25
−0.25
−0.5
−0.5
−0.75
−0.75
−1 −1 −0.75 −0.5 −0.25
0
0.25
0.5
0.75
1
0.5
Fig. 6. Graphical representation of the lookup table for one classifier for the horizontal orientation in the first boosting stage. Color intensities of triangle/ square represent positive/negative weights. The green lines show the boundaries used for the radial phase encoding.
Table 1 Comparison of the frontal face detector performance on CMU/MIT set.
True positive rate
0.7 0.6
Frontal face detector
True positives
False positives
0.5
VQ, 512 centers Radial phase MCT baseline Gabor detector in [15] Viola–Jones in [1]
444 444 430 421 448
6 17 7 57 31
0.4 0.3
VQ 512 centers Radial Phase MCT Mikolajaczyk et al. Viola−Jones
0.2 0.1 0 0
25
50
75
100
125
150
175
200
False positives Fig. 7. The ROC curves of different frontal face detectors on FDDB set [21].
451 range. Fig. 9 shows the ROC curve of our detector, denoted as the radial phase rotated and the curves of previous detectors. The proposed method is better than the previously proposed detectors [4,5,8], which verifies
the effectiveness of in-plane rotated face detector. However, the performance does not match that of the detector proposed by Huang et al. [6]. We conjecture that their improved performance comes in part from the increased number of rotated views (12 vs 8) they use as well as the tuning of the detector features specifically for face detection through the use of sparse granular features. 4.3. Binocular detector performance The performance of the stereo disparity based filter described in Section 3 depends upon four parameters: the offset used for binocular normalization α, the floor used for monocular normalization β, the threshold θ applied to the
F. Jiang et al. / Signal Processing: Image Communication 28 (2013) 1100–1113
Fixing α ¼ 0:66, we chose the remaining parameters using the stereo face image database described in Section 4.1. For each set of parameters, we computed a single scalar measure of binocular detector performance by first generating a precision–recall curve, and finding the largest F-measure on that curve. The F-measure is given by
0.8
0.6 0.5
F ¼2
0.4 Radial Phase 5−view Radial Phase Frontal view MCT 5−view Mikolajaczyk et al. Viola−Jones
0.3 0.2 0.1 0 0
50
100 150 200 250 300 350 400 450 500
False positives Fig. 8. The ROC curves of 5-view face detector on FDDB set [21].
True Positive Ratio (%)
100 Radial Phase rotated Rowley [4] Viola Jones [5] Huang et al. [6] Osadchy et al. [8]
95
90
0
100
200
300
400
500
600
Number of False Detections Fig. 9. The ROC curves of 8-view in-plane rotated face detector on CMU rotate set [4].
pixel-based disparity detector, and the threshold λ on the number of pixels in the window at which the disparity detector output exceeds θ. Increasing α reduces false matches in regions containing strong horizontal texture by reducing the value of E in these regions. However, if α is too large, E decreases for other regions as well. We chose a desirable value for α using a separately collected dataset of 81 stereo image pairs collected using the Bumblebee2 stereo vision system. Each image pair contained one face in a cluttered indoor laboratory environment. The face pose and depth varied between images. We obtained a set of 81 positive samples by hand annotating an 18 18 window containing the face at the appropriate level of the pyramid. We obtained a set of 324 negative samples by choosing four 18 18 windows containing background from each pair. For each ! value of α, we set β ¼ 0 and computed value of Uð x Þ at every pixel in these windows. We summed the values of ! Uð x Þ inside these windows and compared it with a threshold to generate a precision and recall rate. We adjusted the threshold so that the recall rate was 95%, and measured the resulting precision. Precision was maximized for α ¼ 0:66.
PR PþR
ð13Þ
where P indicates precision and R indicates recall [41]. Increasing β increases rejection of false matches in areas with little texture by decreasing the monocular normalized Gabor output, which decreases the detector ! output Uð x Þ. However, if β is too big, true matches in low contrast regions may also be rejected. To choose the value of β, we fixed the value of β and swept the value of λ from 0 to 290 and setting θ ¼ 0:00039λ. Fig. 10 shows the curves generated for different values of β between 6 and 14. For λ ¼ 0, 100% of the windows are scanned by the face detector (right-hand side of the figure). Under this condition, the detector performance is the same as the monocular detector (horizontal line). As λ increases, fewer and fewer windows are scanned. Eventually, almost no windows are scanned and performance drops (left-hand side of the figure). As β increases from 6 to 10, detector performance increases in the region where about 10% of the windows are scanned. In fact, performance in this region can exceed that of the monocular detector, indicating that those regions screened out are mainly false positives of the monocular detector. We choose β ¼ 10, which leads to the best performance when the percentage of windows applied to the face detector is less than 10%. Having chosen α and β, we use a grid search to choose the parameters θ and λ. Fig. 11 shows a contour diagram of the best F-measure in this 2D parameter space. Given a set of parameter choices with nearly equal performance, we would prefer larger values of θ and λ, which results in fewer windows being scanned. Based on these considerations, we chose θ ¼ 0:032 and λ ¼ 114. Fig. 12 compares the precision–recall curve of the binocular detector and the monocular detector. The
0.85
Best F−measure
True positive rate
0.7
85
1109
0.84
β =6 β = 10
0.83
β = 14
0.82 0.81 0.8
0.1
0.2
0.4
0.6
0.8
percentage of windows scanned Fig. 10. The performance and efficiency trade-off for different β as λ changes from 0 to 290 while θ ¼ 0:00039λ. The horizontal line represents the performance of monocular detector.
1110
F. Jiang et al. / Signal Processing: Image Communication 28 (2013) 1100–1113
precision–recall curve for the binocular detector is higher than that of the monocular detector, indicating using stereo information improves detector performance. For example, at a recall rate of 74.5%, the monocular detector makes 97 false detections, whereas the binocular detector makes only 66. Fig. 14 shows examples of the false detections by the monocular detector that are screened out by binocular detector. The binocular detector screens out false detections in regions that contain different 3D surfaces and detections in regions where the depth is not equal to the expected depth implied by the scale at which the face is detected. 4.4. Computational efficiency We implemented the detectors using a personal computer equipped with an i5 2.67 GHz CPU and an NVidia GTX-465 graphics card. Input frame sizes were 640 480 pixels. The monocular detectors ran at 18 scales. The binocular detectors ran at only 14 scales. The four coarsest scales of the monocular detector were not used, since the 0.86 0.858
0.06
0.856 0.854 0.05
0.852
θ
0.85 0.04
0.848 0.846 0.844
0.03
0.842 0.84 0.02 50
100
150
λ Fig. 11. The best F-measure contour on the training set as the θ changes from 0.02 to 0.067 and λ changes from 50 to 150.
1 0.95
Precision
0.9 Binocular 0.85
Stereo Matching Monocular
0.8 0.75
0.7
0
0.2
0.4
0.6
0.8
1
Recall Fig. 12. The precision–recall curve on the testing set. The best F-measures for the binocular, stereo matching based and monocular detectors are 0.8273, 0.8276 and 0.8166, respectively.
Fig. 13. (a) Examples of face detections by the multi-view face detector in FDDB. Green bounding boxes are true positives. (b) Examples of faces that are missed by the detector are annotated as red ellipses.
images were not big enough to exhibit the implied disparity. The pyramids were implemented using texture mapping on the graphics card. The graphics card was also used to accelerate computation of the Gabor filter outputs, vector quantization and radial phase encoding. Gabor filtering was implemented as a linear convolution using 5 5 pixel masks. The primary advantage of the radial phase encoding is its computational simplicity. Its performance is comparable to that of VQ. Table 2 compares the time required to process one frame by the monocular face detectors based on VQ and radial phase encoding. The most time consuming steps are building the pyramid and scanning the pyramid. Clustering to merge multiple detections is fast because there are relatively few detections compared to the number of windows. The VQ based detectors are significantly slower than the radial phase encoding based detector due to the time required to build the pyramid, which includes the discretization. Computing the closest codebook vector to the input requires k comparisons, where k is the number of codebook vectors, so the time to build the pyramid scales linearly with k.
F. Jiang et al. / Signal Processing: Image Communication 28 (2013) 1100–1113
1111
Fig. 14. (a) The examples of face detections by the binocular detector. Green bounding boxes are true positives. (b) The examples of face detections by the monocular detector. Red bounding boxes are false positives.
Table 2 Comparison of the frontal face detectors on the computational efficiency in (ms). Frontal face detector
Build pyramid
Scan pyramid
Clustering
Whole detector
VQ, 512 centers VQ, 128 centers Radial phase
433.042 110.866 3.731
13.076 12.501 14.347
0.052 0.011 0.095
446.17 123.378 18.173
Table 3 Comparison of the multi-view detectors on the computational efficiency in (ms). Multi-view detector
Build pyramid
Stereo analysis
Scan pyramid
Clustering
Whole detector
Binocular Monocular Stereo matching
3.401 3.401 3.401
11.893 0 38.87
26.718 116.271 22.345
0.029 0.098 0.025
42.041 119.77 64.643
Table 3 compares the time required to process one frame by the binocular and monocular detectors. The time listed for building the pyramid includes the time for
computing the radial phase encoding at all levels of the pyramid for the left camera only. The time listed for stereo analysis includes building the pyramid for the right eye
1112
F. Jiang et al. / Signal Processing: Image Communication 28 (2013) 1100–1113
and combining information from the left and right eyes. Despite the extra time required for stereo analysis, the total time the binocular detector requires to process one frame is shorter than that required by the monocular detector. Most (89.2%) of the windows that would have been passed to the face detector are screened out by the disparity detector. The computation time required for scanning by the binocular detector is 23.0% of the time required by the monocular detector. The time does not scale linearly with the number of windows passed to the face detector, because the windows that must be scanned are the more challenging detections. For the monocular detector, only 3.37% of the windows must be scanned by more than one classifier stage. For the binocular detector, this percentage increases to 5.73%.
4.5. Comparison with local stereo matching One of the key advantages of our approach is that the time required for stereo processing is reduced because the face detector and the stereo disparity detectors are based on the same Gabor filter outputs. This source of efficiency is inspired by the shared feature representation found in biological systems. For a concrete measure of the advantages of this approach, we compared the performance and processing time of our binocular detector with a binocular detector where the stereo matching stage was developed independently of the face detector. This system is represented in Fig. 1, where the switch is turned to the lowest position. We use an implementation of the Sum of Square Difference (SSD) algorithm for disparity estimation [33]. The disparity of each pixel in the left image is estimated by finding the pixel in the same scan line of the right image where the SSD between the 5 5 pixel windows around the two pixels achieves its minimum value. We used 5 5 pixel windows to compute the SSD to match the Gabor kernel size. The candidate disparities ranged from 0 to 189 pixels in 1 pixel steps to cover the same disparity range as the disparity pyramid model. The disparity map was calculated at the original image resolution, and then resized to match the pyramid. For each window at each scale, we counted the number of pixels where the estimated disparity fell within the disparity range expected by the scale. If the number of pixels exceeded λ, the window was fed to the monocular face detector. Fig. 12 shows that the performance of this stereo matching based detector is very similar to that of our binocular detector. However, Table 3 indicates that the time to process one frame is about 54% longer. While the scanning of the pyramid is faster, because only 7.02% of windows must be processed by the face detector, the time for stereo analysis is more than three times longer. In addition to the shared feature representation, the system also benefits from the highly parallel structure of the detector, since the exact same operations must be applied at each pixel and on every scale. This facilitates parallelization by the GPU. Nonetheless, the time required by this stereo detector is still shorter than that required by the monocular detector.
5. Conclusion In this paper, we proposed a multi-view face detector based on Gabor feature and disparity energy model. It makes use of the biological plausible Gabor filters to extract discrete radial phase features. As tested in the standard FDDB dataset, these discrete features outperform other popular features that have been used for face detection. We also demonstrated that integrating stereo improves both the computational efficiency and the performance of face detection. While improved performance might not be surprising given that we are exploiting more information than available to a monocular face detector, the improved computational efficiency is surprising given that we must process twice as much input image data. The key reason for improved computation efficiency is the rapid filtering of regions unlikely to contain faces based on the input disparity. The computation time of the binocular detector is only 35% of the original monocular detector. We also showed that significant gains in computational efficiency can be achieved by designing the face detection and stereo analysis modules around the same set of image features. Inspired by a similar strategy found in biological systems, we have used an image representation based on the output of Gabor filters. Given the widespread use of Gabor features, this initial feature extraction stage may also be useful for other tasks, such as face recognition, pose estimation and motion analysis. Thus, this work is a promising step towards the development of a general multi-task vision system exhibiting the same efficiency and robustness of biological vision systems. However, as noted in Section 4.2, the performance of our multi-view detector does not match that of the detector of Huang et al. [6] where features are customized for the face detection task. Based on our results, we expect that the performance of the detector of Huang et al. [6] would also be improved by the use of disparity information. However, it is not clear that the sparse granular features they learned for face detection would be appropriate for extracting disparity information. Thus, an interesting area for future work would be the development of algorithms that can effectively improve task specific performance while maintaining the efficiency of a more general purpose representation capable of representing multiple cues.
Acknowledgments This work was supported by the Germany/Hong Kong Joint Research Scheme sponsored by the Research Grants Council of Hong Kong and the German Academic Exchange Service (Reference No. G HK014/09), by the Concept for the Future of Karlsruhe Institute of Technology within the framework of the German Excellence Initiative, by the General Research Fund sponsored by the Research Grants Council of Hong Kong (Reference No. 619111), and by the Istanbul Technical University Research Fund (Reference No. 36123). References [1] P. Viola, M. Jones, Robust real-time face detection, International Journal of Computer Vision 57 (2) (2004) 137–154.
F. Jiang et al. / Signal Processing: Image Communication 28 (2013) 1100–1113
[2] C. Küblbeck, A. Ernst, Face detection and tracking in video sequences using the modified census transformation, Image and Vision Computing 24 (6) (2006) 564–572. [3] B. Wu, H. Ai, C. Huang, S. Lao, Fast rotation invariant multi-view face detection based on real adaboost, in: Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004, pp. 79–84. [4] H. Rowley, S. Baluja, T. Kanade, Rotation invariant neural networkbased face detection, in: IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 1998, pp. 38–44. [5] M. Jones, P. Viola, Fast Multi-view Face Detection, Technical Report, Mitsubishi Electric Research Lab TR2000396, 2003. [6] C. Huang, H. Ai, Y. Li, S. Lao, High-performance rotation invariant multiview face detection, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (4) (2007) 671–686. [7] S. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, H. Shum, Statistical learning of multi-view face detection, in: European Conference on Computer Vision, Citeseer, 2002. [8] M. Osadchy, Y. Cun, M. Miller, Synergistic face detection and pose estimation with energy-based models, The Journal of Machine Learning Research 8 (2007) 1197–1215. [9] J. Daugman, Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters, Journal of Optical Society of America A: Optics and Image Science 2 (1985) 1160–1169. [10] I. Ohzawa, G. DeAngelis, R. Freeman, Stereoscopic depth discrimination in the visual cortex: neurons ideally suited as disparity detectors, Science 249 (4972) (1990) 1037. [11] E. Adelson, J. Bergen, Spatiotemporal energy models for the perception of motion, Journal of the Optical Society of America A: Optics and Image Science 2 (2) (1985) 284–299. [12] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, T. Poggio, Robust object recognition with cortex-like mechanisms, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (3) (2007) 411–426. [13] B. Manjunath, W. Ma, Texture features for browsing and retrieval of image data, IEEE Transactions on Pattern Analysis and Machine Intelligence 18 (8) (1996) 837–842. [14] C. Zhang, Z. Zhang, A Survey of Recent Advances in Face Detection, Technical Report, Microsoft Research, June 2010. URL: 〈http:// research.microsoft.com/pubs/132077/facedetsurvey.pdf〉. [15] L. Huang, A. Shimizu, H. Kobatake, Robust face detection using Gabor filter features, Pattern Recognition Letters 26 (11) (2005) 1641–1649. [16] B. Kwolek, Face detection using convolutional neural networks and Gabor filters, in: W. Duch, J. Kacprzyk, E. Oja, S. Zadrożny (Eds.), Artificial Neural Networks: Biological Inspirations, Springer, 2005, pp. 551–556. [17] L. Xiaohua, K. Lam, S. Lansun, Z. Jiliu, Face detection using simplified Gabor features and hierarchical regions in a cascade of classifiers, Pattern Recognition Letters 30 (8) (2009) 717–728. [18] J. Chen, S. Shan, P. Yang, S. Yan, X. Chen, W. Gao, Novel face detection method based on Gabor features, in: S. Li, Z. Sun, T. Tan, S. Pankanti, G. Chollet, D. Zhang (Eds.), Advances in Biometric Person Authentication, Springer, 2004, pp. 90–99. [19] K. Mikolajczyk, C. Schmid, A. Zisserman, Human detection based on a probabilistic assembly of robust part detectors, in: European Conference on Computer Vision, vol. 3021, Springer, 2004, pp. 69–82. [20] H. Schneiderman, T. Kanade, A statistical method for 3D object detection applied to faces and cars, in: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, IEEE, 2000, pp. 746–751.
1113
[21] V. Jain, E. Learned-Miller, FDDB: A Benchmark for Face Detection in Unconstrained Settings, Technical Report UM-CS-2010-009, University of Massachusetts, Amherst, 2010. [22] M. Bajracharya, B. Moghaddam, A. Howard, S. Brennan, L. Matthies, Results from a real-time stereo-based pedestrian detection system on a moving vehicle, in: IEEE ICRA Workshop on People Detection and Tracking, 2009. [23] W. Burgin, C. Pantofaru, W. Smart, Using depth information to improve face detection, in: Proceedings of the 6th International Conference on Human–Robot Interaction, ACM, 2011, pp. 119–120. [24] X. Wang, B. Shi, GPU implementation of fast Gabor filters, in: Proceedings of 2010 IEEE International Symposium on Circuits and Systems, IEEE, 2010, pp. 373–376. [25] F. Jiang, Stereo images with labeled faces in the wild, June 2011. URL: 〈http://dl.dropbox.com/u/27439811/WholeSet.zip〉. [26] X. Wang, X. Tang, Bayesian face recognition using Gabor features, in: Proceedings of the 2003 ACM SIGMM Workshop on Biometrics Methods and Applications, ACM, 2003, pp. 70–73. [27] R. Gross, I. Matthews, J. Cohn, T. Kanade, S. Baker, Multi-PIE, Image and Vision Computing 28 (5) (2010) 807–813. [28] F. Jiang, B. Shi, M. Fischer, H. Ekenel, Effective discretization of Gabor features for real-time face detection, in: 18th IEEE International Conference on Image Processing, 2011, pp. 2057–2060. [29] M. Yang, D. Roth, N. Ahuja, A SNoW-based face detector, in: Advances in Neural Information Processing Systems, vol. 12, MIT Press, 2000, pp. 855–861. [30] L. Farkas, J. Posnick, T. Hreczko, Anthropometric growth study of the head, The Cleft Palate-Craniofacial Journal 29 (4) (1992) 303–308. [31] A. Klaus, M. Sormann, K. Karner, Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure, in: 18th International Conference on Pattern Recognition, vol. 3, IEEE, 2006, pp. 15–18. [32] Z. Wang, Z. Zheng, A region based stereo matching algorithm using cooperative optimization, in: IEEE Conference on Computer Vision and Pattern Recognition., IEEE, 2008, pp. 1–8. [33] J. Stam, Stereo Imaging with CUDA, January 2008. URL: 〈http:// openvidia.sourceforge.net/index.php/OpenVIDIA〉. [34] X. Mei, C. Cui, X. Sun, M. Zhou, Q. Wang, H. Wang, On building an accurate stereo matching system on graphics hardware, in: ICCV Workshop on GPU in Computer Vision Applications, 2011. [35] D. Scharstein, R. Szeliski, A taxonomy and evaluation of dense twoframe stereo correspondence algorithms, International Journal of Computer Vision 47 (1) (2002) 7–42. [36] D. Fleet, D. Heeger, H. Wagner, Modelling binocular neurons in the primary visual cortex, in: M. Jenkin, L. Harris (Eds.), Computational and Psychophysical Mechanisms of Visual Coding, Cambridge University Press, 1997, pp. 103–130. [37] N. Qian, Computing stereo disparity and motion with known binocular cell properties, Neural Computation 6 (3) (1994) 390–404. [38] E. Tsang, B. Shi, Normalization enables robust validation of disparity estimates from neural populations, Neural Computation 20 (10) (2008) 2464–2490. [39] J. Chen, R. Wang, S. Yan, S. Shan, X. Chen, W. Gao, Enhancing human face detection by resampling examples through manifolds, IEEE Transactions on Systems, Man and Cybernetics, Part A: Systems and Humans 37 (6) (2007) 1017–1028. [40] M. Weber, Caltech background image dataset, October 2003. URL: 〈http://www.vision.caltech.edu/html-files/archive.html〉. [41] C. Van Rijsbergen, Information Retrieval, Butterworths, 1979.