Region-based face detection

Region-based face detection

Pattern Recognition 35 (2002) 2095 – 2107 www.elsevier.com/locate/patcog Region-based face detection  Olugbenga Ayindea , Yee-Hong Yangb;∗ b Depart...

504KB Sizes 0 Downloads 60 Views

Pattern Recognition 35 (2002) 2095 – 2107

www.elsevier.com/locate/patcog

Region-based face detection  Olugbenga Ayindea , Yee-Hong Yangb;∗ b Department

a SMART Technologies Inc., Calgary, Alberta, Canada T2R 1K9 of Computing Science, University of Alberta, Edmonton, Alberta, Canada T6G 2E8

Received 12 September 2000; received in revised form 20 September 2001

Abstract Face detection is a challenging task. Several approaches have been proposed for face detection. Some approaches are only good for one face per image, while others can detect multiple faces from an image with greater price to pay in terms of training. In this paper, we present an approach that can be used for single or multiple face detection from simple or cluttered scenes. Faces with di/erent sizes located in any part of an image can be detected using this approach. Three test sets are used to evaluate the system. The system has a detection rate of 100% on test set A containing 200 good quality images (200 faces) with simple backgrounds. Test set B contains 23 images (149 faces) with cluttered backgrounds and a mixture of high- and low-quality images. A detection rate of 66.4% is obtained on this set. Test set C is a selection of 22 high-quality images (54 faces) images from di/erent sources including the World Wide Web, and a detection rate of 90.7% is obtained. ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Correlation; Face detection; Region-based recognition; Symmetry; Training

1. Introduction

1.1. Importance of face detection

Face detection is the process of recognizing a pattern in an image as a face pattern and subsequently segmenting it from the rest of the image. Human face detection is a challenging exercise because the face can have varying pose, size, skin color, and facial expression. In addition, other factors, such as wearing of glasses, presence or absence of hair, and occlusion can make the appearance of the face unpredictable. Other e/ects, such as varying lighting conditions and complexity of the scene containing the face can make the generalization of face detection algorithms di
Automatic face detection is the Drst major step in an automatic face recognition system. Coarsely speaking, the success of a recognition technique depends, up to an extent, on the detection scheme used to extract the face area from an image. For instance, consider a situation whereby the person contained in the image in Fig. 1 is to be identiDed. If there is no way to accurately locate and extract the face area from the rest of the image, it becomes di
 The work was done while the authors were at the Department of Computer Science, University of Saskatchewan, Saskatoon, Saskatchewan, Canada S7N 5A9. ∗ Corresponding author. Tel.: +1-780-492-3059; fax: +1-780-492-1071. E-mail address: [email protected] (Y.-H. Yang).

1.2. Problem de,nition Many face detection techniques are only good for detecting and extracting one face from an image. Others can

0031-3203/02/$22.00 ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 0 1 ) 0 0 1 9 6 - 0

2096

O. Ayinde, Y.-H. Yang / Pattern Recognition 35 (2002) 2095 – 2107

Fig. 1. Image containing a face.

handle multiple face detection, but require too many training images before a good percentage of the faces in the test images can be detected. A face detection technique that can handle single or multiple face(s) per images with just few training images is required. In this paper, we present a technique called region-based face detection technique, which uses few positive and negative training images and a pre-deDned model for the face pattern. The technique can handle single or multiple face detection. The rest of this paper is organized as follows: Section 2 is on some previous research on automatic face detection, Section 3 presents the region-based face detection technique and Section 4 presents the conclusions and future work. 2. Previous research on face detection Although every normal face has similar features—two eyes, a nose and a mouth, faces have di/erent shapes, skin colors, expressions, and facial hair (mustache or beard). All these variations make face detection a challenging task. 2.1. Face detection techniques Numerous approaches have been used for automatic face detection in simple and complex scenes. Each approach uses a set of parameters to represent the face. Some of the common facial models used in the literature are based on facial texture, statistical patterns, facial feature extraction, skin color, template-matching, and facial and=or non-facial examples. Depending on the closeness of a pattern under consideration to the deDned model, the pattern is taken to be a face or it is regarded as a non-face pattern. Template-matching approaches use models of face patterns stored in the form of templates representing the whole face or facial features. Fixed [1] or deformable [2,3] templates are used for detection of local facial sub-features that have approximately Dxed appearance. Correlation of a test pattern with the face template involves computing a measure of disparity between the face pattern and the test pattern. A threshold is set for the degree of disparity that can be associated with a face. Feature extraction approaches involve the extraction of landmark features of the face. Then the global information is used to compare the geometry connecting the extracted

features with the expected geometry of a face [2,4 – 6]. Feature extraction is usually done using edge detection and linking. These approaches are applicable to images having simple backgrounds and one face per image. Extraction of the face area from a head-and-shoulder image is a common way of applying such approaches. For images with multiple faces and cluttered backgrounds, these methods are unreliable since the facial features may not easily be detected independently. Example-based approaches use both positive and negative examples of face patterns to train the face detection system. Negative examples are usually obtained from some false detection results produced by the system at the initial stages of training [7–9]. The region-based face detection approach presented in this paper is also an example-based approach. The di/erence between the region-based method and earlier example-based approaches is that the face model used can capture common face patterns using few training images (29 positive and 124 negative examples). This is in contrast to existing techniques that require many training examples (usually thousands of positive and negative examples) before a reliable face model can be obtained. Other approaches are based on the use of facial texture [10], color information [11–14], and statistical patterns [15,16]. In addition, there are techniques based on the use of information extracted from log-polar images [17,18], and shape information [19]. 2.2. Problems with existing face detection approaches Template-matching, feature extraction and most of the other approaches are particularly good for detecting single or few faces from an image. Example-based approaches are generally good, and have been used for single or multiple face detection from an image. Despite the advantages, example-based approaches require huge number of positive and negative training examples, thereby making the training process time consuming and challenging. This is one of the major aims of our region-based technique—single or multiple face detection without using many training images (examples). 3. Region-based face detection Face detection involves locating and extracting face(s) from an image. To detect a face from an image, the whole image is scanned at di/erent scales for face patterns. The initial size of the window used to scan the image is N × N pixels (N is chosen to be 20 in this implementation, but other values can be used). Hence, the minimum size of a face that can be detected is N × N pixels. The minimum window size that is to be chosen based on the minimum size of a face is expected to have in the particular image set on which the technique is to be used. The system scans for faces at one-pixel increments horizontally and vertically.

O. Ayinde, Y.-H. Yang / Pattern Recognition 35 (2002) 2095 – 2107

Whenever a face pattern is detected, the size of the current search window and the location are noted. The size of the window is gradually enlarged to scan the image at di/erent scales until either the size of the image has been exceeded, or the size of the scan window has reached the maximum size. The number of iterations, which is speciDed by the user, determines the maximum size of the scan window. Let the minimum size of face that can be detected be N × N . For every enlargement step, the size is increased by a factor F horizontally and vertically (F is chosen to be 1.2 in this implementation, but the user can use other enlargement factors). The value of F must be greater than 1, but setting it too high will cause some faces to be missed. The image is searched at di/erent scales as the enlargement process progresses. Note that square windows are used for searching. This implies that using an enlargement factor of F multiplies both the height and width of the previous window by F. Using a factor of F searches the image using windows of sizes N ×N , NF ×NF, NF 2 ×NF 2 , NF 3 ×NF 3 , etc., as the enlargement process continues. Another way of handling this is to shrink the size of the whole image by F horizontally and vertically at each step and searching for N × N windows. Hence, the time taken to search the image decreases as the search-window enlargement (or the image shrinking) process progresses. The reason for this is that the number of windows to search for in the image decreases as the window size gets larger (or as the image size gets smaller if the method based on shrinking is used). In addition, the user may want to modify the way enlargement is done. For instance, the user can add a constant value to the height and width of window for enlargement (i.e. N × N , (N + k) × (N + k), (N + 2k) × (N + 2k), (N + 3k) × (N + 3k), etc.). The user can also use window whose height and width are integral multiples of the original height and width (i.e. N × N , 2N × 2N , 3N × 3N , 4N × 4N , etc.). However, the change in size from one window to the next may be too signiDcant, and some faces can be missed, thereby justifying the choice made for this implementation. Each window that is to be analyzed for the presence or absence of a face pattern has to be normalized. Three normalization steps are carried out on all the test windows — resizing to N × N pixels, histogram equalization, and masking with a binary circular mask. 3.1. Resizing The portion of the image in each search window is automatically resized to N × N pixels, irrespective of the original size. The resizing algorithm used is based on linear interpolation from the original (M × M )-pixel image to an (N × N )-pixel image. 3.2. Histogram equalization Histogram equalization compensates for changes in illumination brightness and di/erences in camera input gains.

2097

Fig. 2. (a) Original image. (b) Image after histogram equalization.

Fig. 3. (a) Binary mask. (b) Original face image, and (c) Masked face image.

Histogram equalization enhances the image by increasing the dynamic range of pixels [20]. Fig. 2 shows the results of carrying out histogram equalization on a facial image. 3.3. Masking To eliminate the pixels that are most likely a/ected by background information, an (N × N )-pixel binary circular mask shown in Fig. 3 is applied on each resized face image. This ensures that no unwanted background information is included in the face pattern when a measure of symmetry, which is one of the parameters used for the detection of a face, is computed. It should be noted that the mask is not used for computing the other parameters apart from the symmetry measure. 3.4. Training The region-based technique is an example-based technique requiring both positive and negative training examples. The purpose of the positive and negative examples is to show the system examples of common face patterns and non-face patterns that can easily be detected as a face. The face patterns cover the eyebrows to the mouth since that portion can be detected reliably in most commonly found images. Other regions of the face are susceptible to occlusion by long hair, mustache or beard. Few selected images containing face patterns and other patterns are also required to determine thresholds for the parameters used. The positive training samples are carefully chosen to span as much of the face space as possible. In order to achieve this, the faces are from di/erent sources, and they represent the common shapes and facial feature conDgurations that can

2098

O. Ayinde, Y.-H. Yang / Pattern Recognition 35 (2002) 2095 – 2107

Fig. 5. Face pattern showing all six regions.

Five conditions have to be satisDed before the content of a search window can be classiDed as a face:

Fig. 4. (a) Face patterns and (b) Non-face patterns.

represent many faces. Our technique can easily be adapted to detecting other patterns of faces by re-training the system with few positive and negative examples of the variations expected across the face class. However, in doing this, one set of parameters used for describing the darkness of the eye and cheek regions has to be left out, and some selected images are also needed for the determination of thresholds. Faces collected from di/erent sources are used for training. The faces are selected in order to capture common variations in face pattern. Training faces are resized and histogram equalization is carried out on them. The resulting images are used as positive training examples for the system. The positive training set includes upright faces and slightly tilted faces that will capture the shapes and patterns of commonly found faces. In addition, some negative training examples are used to reduce the non-face patterns detected as faces. See Fig. 4 for examples. The negative training examples are obtained from some of the false detection results produced by the system during some preliminary runs. The same normalization and enhancement steps described above are carried out on patterns wrongly detected as faces. Sung and Poggio present the advantage of using both positive and negative training examples over the use of just positive training examples [9]. For the experiments presented in this paper, 29 positive training and 124 negative training examples are used. The results obtained are very encouraging despite the small number of images required for training. This is intriguing considering the vast number of training images required in many example-based techniques to achieve similar results. For instance, over 16,000 positive and 9000 negative training examples are used in Ref. [8], and over 4000 positive and over 6000 negative examples are used in Ref. [9].

• The parameter (rpos ) indicating the correlation between the search window pattern and the positive training patterns must be greater than or equal to an experimentally determined threshold (tpos ). • The parameter (rneg ) indicating the correlation between the search window pattern and the negative training patterns must be less than an experimentally determined threshold (tneg ). • The parameter for symmetry measure (S) must be less than or equal to an experimentally determined threshold (T ). • rpos ¿ rneg . • Some regions of the pattern (corresponding to the eyes and eyebrows) are expected to be darker than other regions (corresponding to cheeks) as observed on most commonly found faces. 3.5. Computation of parameters 3.5.1. Correlation values rpos is the maximum linear correlation between the search window pattern and the positive training examples, and rneg is the maximum linear correlation between the search window pattern and the negative training examples. Linear correlation between two vectors X = {xi ; i = 1; : : : ; n} and Y = {yi ; i = 1; : : : ; n} with means X and Y , respectively, is given by  (Xi − X )(Yi − Y )   √ r(X; Y) = : [ (Xi − X )2 (Yi − Y )2 ] The value of r lies between −1 and 1. r = 1 implies a total positive correlation, r = −1 implies a total negative correlation, and r = 0 means X and Y are uncorrelated. 3.5.2. Symmetry measure The symmetry measure S is determined using a simple technique based on the computation of sum of gray levels of the six parts each face is divided into. Fig. 5 shows a block diagram representing the face. Six values L1 ; R1 ; L2 ; R2 ; L3 ; R3 representing the sum of gray levels in the six regions are computed. L1 , L2 and L3 constitute the

O. Ayinde, Y.-H. Yang / Pattern Recognition 35 (2002) 2095 – 2107

2099

Fig. 7. Image with lighter mouth region than cheeks region. Fig. 6. Dark and light regions of a face pattern.

left half of the pattern, while R1 , R2 and R3 constitute the right half of the pattern. Before computing the sums, a circular mask shown in Fig. 3 is applied to the pattern in the window. This is done to eliminate unnecessary background and hair information that can greatly a/ect the symmetry measure. Three ratios L1 =R1 (or R1 =L1 ); L2 =R2 (or R2 =L2 ) and L3 =R3 (or R3 =L3 ) are computed depending on which of the two values (Ln or Rn , n = 1; 2; or 3) is greater. The reason for using the greater of the two values as the numerator is to avoid indicating that either the left half or the right half of the face is darker. Since it may be unpredictable to indicate which of the halves of the face (left or right) is darker, it is good to avoid such a limiting assumption. Each of the three ratios gives a value greater than or equal to 1.0. The system does not consider patterns that have any of the three ratios having a value outside the range 1.0 to T (a threshold determined during initial test runs) as possible candidates for further processing. The threshold used for this implementation is 1.25, but can be adjusted by the user. Since symmetry is one of the criteria used for classiDcation, this technique is good for faces that are frontal or nearly frontal with minimal rotation e/ects. However, the technique can be extended to handle signiDcant two-dimensional rotation, but this is beyond the scope of this paper. Rotation e/ect can be achieved by rotating the portion of the image under consideration by di/erent angles (for example at 10◦ = − 10◦ interval starting from 0◦ ) and trying to match the result with the training images. Doing this can result in the detection of faces with signiDcant rotation e/ects or even inverted faces. However, there is a price to pay computationally. 3.5.3. Expected face pattern The pattern deDned for the face makes use of an assumption that the upper part of the window corresponding to the eyes and eyebrows is darker than the region corresponding to the cheeks (see Fig. 6). In other words, region 1 is expected to be darker than regions 2 and 3. Initially, another assumption that the mouth region (region 4) is darker than the cheeks region is also used, but a number of faces such as the one shown in Fig. 7 violate this assumption. Hence, this additional assumption is dropped. 3.6. Determination of thresholds Initial thresholds are set for the adjustable parameters arbitrarily for the preliminary runs of the system. The system

is then used to detect faces from some selected images. Based on the good detection and false detection results, threshold values are gradually changed until the observed results are considerably good. For instance, setting tpos to 0.0 and tneg to 1.0 will cause many windows that are not faces to be labeled as faces. Gradually increasing tpos and reducing tneg will eliminate some of the false detections while retaining the detected faces until optimal values of tpos and tneg are obtained. Threshold selection is a trade-o/ between reducing the number of false detections and missing some faces. Setting tpos too low and tneg too high will cause a lot of false detections. If tpos is too high and tneg is too low, many faces will be missed. The threshold used for the three ratios discussed in Section 3:6:2 — L1 =R1 (or R1 =L1 ), L2 =R2 (or R2 =L2 ) and L3 =R3 (or R3 =L3 ) is also determined from the selected images used for threshold determination. The minimum value that the threshold can take is 1.0, and at this value, only patterns that are perfectly symmetrical are detected as faces. Setting the threshold at a high value (2.0 in this implementation, but any other value chosen by the user can be used) initially results in the detection of some patterns that are not faces. Lowering the value gradually reduces such incorrect detection, until an acceptable value is reached, which can work for a range of patterns. If the value is set to a value lower than necessary, many face patterns are missed as a result. The thresholds are set to values such that face patterns in selected images used for threshold determination are correctly detected and non-face patterns are not detected. Once the thresholds have been determined, the values are kept for use in subsequent experimental runs. 3.7. Merging and elimination of some windows It is common to have multiple detections of the same face at di/erent scales with small displacements horizontally and=or vertically. This makes it necessary to merge windows that can be counted as enclosing the same face. The decision criterion determines whether or not there is a substantial overlap between two or more windows. The position of every window that has been detected to contain a face is compared with the positions of all the other windows detected to be faces. If two or more windows have top left corners that are close to one another and the scale di/erence is not too signiDcant, then they are merged. Merging two or more windows means averaging the size and the starting coordinates of the windows. The resultant e/ect of the merging process is the removal of the original windows and replacing them with the average. The process is repeated

2100

O. Ayinde, Y.-H. Yang / Pattern Recognition 35 (2002) 2095 – 2107

Fig. 9. Window elimination.

Fig. 8. Merging of detected windows.

until all the windows marked for this stage of processing have been used. Fig. 8 presents examples of situations requiring the merging of detected windows. Merging of windows detected for the same face at di/erent scales requires the use of two constraints. Firstly, the starting (x; y) locations of two windows should have a maximum distance D from each other (D is chosen to be 20 in this implementation, but other values can be used). Values of D that are too low will lead to multiple detection of the same face at di/erent scales, which may not be desirable for some applications (such as face recognition). Unnecessarily large distance (D) values will allow windows detected for di/erent faces to be merged together as belonging to the same face. This will cause a shift in both windows, thereby reducing the detection accuracy. Depending on how close the faces in the image are expected to be, the value of D can be adjusted. The other constraint for merging is that the ratio of the sizes of two windows that are to be merged should be between R1 and R2 . The word size as used in this context refers to the width or height of the window, since square windows are used in this system. For this implementation, R1 and R2 are chosen to be 0.4823 and 2.0736, respectively, though the user can change these values. Two windows are not expected to be more than four enlargement steps apart in order to be merged. The maximum ratio of sizes (R2 ) is therefore F 4 [Remember, enlargement factor F = 1:2 for this implementation, making R2 = 2:0736]. The minimum ratio (R1 ) is 1=R2 (calculated for this implementation as 0.4823). The values are set based on the observation that most of the detections enclosing a face occur at small displacements from one another, and the scale di/erences are not always too signiDcant. On the other hand, some windows overlap other windows without satisfying the conditions mentioned. One way to handle this kind of situation is to eliminate the windows that are considered to be insigniDcant. The assumption used in this paper is that two faces in an image do not usually overlap. This assumption may not always hold since there are some images with overlapping faces in some practical situations. However, the cases of window elimination are not common. See an example of face detection involving window elimination in Fig. 9.

Determination of the signiDcance of a window is based on the placement and size of the window when compared with other overlapping windows. If the top left corner of the window is far from the corresponding corners of the other overlapping windows and=or the window is much smaller than the other overlapping windows, it is discarded, and therefore excluded from further computations. If the top left corner of a small window is at a maximum distance of D pixels both horizontally and vertically from the top left corner of a much larger overlapping window and the ratio of their sizes is less than R1 , the smaller window is eliminated. If the distance between the top left corners of two overlapping windows is greater than D pixels either horizontally or vertically and the ratio of their sizes is less than R3 = 1=F 2 (i.e. the smaller window is more than two enlargement steps away), then the smaller window is eliminated. The window merging and elimination process still allows detection results to have overlapping windows. However, all the overlapping windows in the Dnal result must be signiDcant with respect to the criteria mentioned above. The merging process is designed to average windows that do not have signiDcant displacements with respect to one another. Di/erent sets of windows can be merged to give overlapping windows in the Dnal output. The elimination process is only aimed at discarding windows that can distort the detection result if averaged with other overlapping windows. 3.8. Summary of algorithms The algorithms used for face detection, threshold determination and window merging=elimination are summarized below. Algorithm used for face detection Step 1: Determine thresholds for the adjustable parameters—the maximum linear correlations with positive and negative training examples and the symmetry measure. Set the initial window size to N × N pixels. Step 2: Using the current window size, scan the whole image starting from the top left corner and sliding the window at one pixel increment horizontally and vertically until the image has been covered. Step 3: At each location, normalize the window pattern and compute the maximum linear correlations with positive and negative examples, the symmetry measure, and the parameter that indicates whether or not the face pattern is satisDed.

O. Ayinde, Y.-H. Yang / Pattern Recognition 35 (2002) 2095 – 2107

2101

Step 4: If all the conditions in Step 3 fall within acceptable thresholds, the window location is marked for use in Step 6. Step 5: Enlarge the window size by a factor of F and repeat Steps 2– 4 until either the number of iterations speciDed has been completed or the size of the image (either width or height) has been exceeded. Step 6: Merge and=or eliminate windows if necessary. Algorithm used for threshold determination Step 1: Set tpos to 0.0, tneg to 1.0, and symmetry parameter to T . Step 2: Using the current value of tpos as the only threshold, detect patterns whose correlation values with positive examples are greater than or equal to tpos . Step 3: Gradually increase tpos and repeat Step 2 until all the faces in the selected images for threshold determination are correctly detected and the fewest possible false detection results are obtained. The optimal value of tpos is noted. Step 4: Using the current value of tneg as the additional threshold to the optimal value of tpos , detect patterns whose correlation values with negative examples are less than tpos . Step 5: Gradually decrease tneg and repeat Step 4 until most of the non-face patterns originally detected as faces in the selected images for threshold determination are eliminated. The optimal values of tpos and tneg are noted. Step 6: Using the current value of symmetry parameter as the threshold and the optimal parameters of tpos and tneg , detect patterns whose symmetry parameters have values between 1.00 and the current threshold (T ). Step 7: Gradually decrease the symmetry threshold and repeat Step 6 until most of the non-face patterns originally detected as faces in the selected images for threshold determination are eliminated. The optimal values of tpos , tneg and the symmetry parameter are noted. Algorithm used for window merging and elimination Step 1: Recall the starting coordinates and sizes of all the windows that have been detected as faces. Step 2: If the top left corner of a window is at a maximum distance of D pixels both horizontally and vertically from the top left corner of another overlapping window and the ratio of their sizes is less than R1 , the smaller window is eliminated. Step 3: If the distance between the top left corners of two overlapping windows is greater than D pixels either horizontally or vertically and the ratio of their sizes is less than R3 , then the smaller window is eliminated. Step 4: Repeat Steps 2 and 3 until all the windows have been used. Step 5: If the top left corners of two or more windows are at a maximum distance of D pixels both horizontally and vertically from each other and the ratio of their sizes lies between R1 and R2 , average the starting coordinates and sizes of the windows.

Fig. 10. Detection results of test set A.

Step 6: Remove the overlapping windows that have been averaged and replace them with the averaged window. 3.9. Experiments Three sets of experiments are used to evaluate the system. Test set A is a subset of images collected by the Computer Vision Lab, Tsinghua University, China face database. There are 200 good quality images in this test set. The images in this face database have simple backgrounds, one face per image, and the faces have di/erent sizes and locations within the images. Our system gives an impressive result with this test set. The detection rate achieved is 100% with two false detections. Some of the detection results obtained are shown in Fig. 10. Test set B is a set of 23 images containing 149 faces from the CMU database. The images are grouped under test-low directory and were originally provided by Kah-Kay Sung and Tomaso Poggio at AI Lab in MIT. Some of the images have good quality and good lighting e/ects and some do not. Test set C is a collection of other images randomly selected from other CMU test sets and images from the World Wide Web. Test set C contains 54 faces in 22 images. Images in test set C are good quality images with good lighting conditions. Many of the faces in test sets B and C are real human faces and a few are hand-drawn sketches of faces. The detection rate obtained for test set B is 66.4% with 97 false detections. Our system has a detection rate of

2102

O. Ayinde, Y.-H. Yang / Pattern Recognition 35 (2002) 2095 – 2107

3.9.1. Analysis of experimental results Selected test results that deviate from the expected are presented below. Some of the gray scale images falsely detected may not necessarily look like faces, but after carrying out histogram equalization on the window, many of the patterns are similar to faces. Some examples are presented in Fig. 13. In addition, it is important to note that most of the faces missed have strong illumination shadows, or some hair covering part of the face, and some have low image quality. Examples are shown in Fig. 14. By easing o/ some of the conditions for face detection, such as the symmetry measure and the deDned face pattern, some of the faces should be detected. However, easing o/ the conditions may imply having more false detections. 3.10. Timing considerations The face detection system is implemented in C++. It takes the system a long time to process big images, though for smaller images, processing time is not so long. The window sliding technique, which scans the pattern at di/erent locations in an image at one-pixel increments horizontally and vertically is the major reason for the long processing time. All the timing measurements presented are done on a Gateway (E-5400) Dual Pentium III 733 MHz system with 256 MB RAM running Windows 2000. Tables 1– 4 present the time taken to process images with di/erent sizes. In each table, the indicated number of enlargement iterations is used. The number of enlargement iterations determines the maximum size of the face that can be detected by the system. The user sets this value at runtime. The initial size of search window is N × N pixels, and for subsequent iterations, the window size is enlarged by a factor of F, both horizontally and vertically. It should also be noted that the number of iterations determines the processing time. Bigger values for the number of iterations increase the processing time, but not in a linear fashion as shown in Fig. 15. As the iterations progress, the window size gets bigger, and the number of windows to be examined becomes smaller, thereby reducing the time taken for the iteration when compared with earlier iterations. If the faces to be detected are not expected to be too big (such as in Fig. 12(a)), then there is no need to set a value too large for the number of enlargements since that will add to the computation time. All the test images used in the paper are processed with 3 enlargement steps, with the exception of few images such as those in Figs. 12(a) and (i), which are processed with 7 enlargement steps. 3.11. Comparison with other systems Fig. 11. Detection results of test set B.

90.7% for test set C and 5 false detections. Figs. 11 and 12 present some of the results obtained for test sets B and C, respectively.

Our system slides a window across the image horizontally and vertically to determine whether or not a face is present at a location in an image. Rowley et al. use a neural network-based approach [8], which involves training the system with both positive and negative patterns. Each image is also examined by sliding a window across the image.

O. Ayinde, Y.-H. Yang / Pattern Recognition 35 (2002) 2095 – 2107

Fig. 12. Detection results of test set C.

2103

2104

O. Ayinde, Y.-H. Yang / Pattern Recognition 35 (2002) 2095 – 2107 Table 4 Processing times for images using 5 enlargement iterations Size of image (pixels)

Number of processed windows

Time taken (s)

60 × 60 120 × 90 340 × 350

5632 32,272 593,152

29 165 3000

Fig. 13. False detection examples showing (a) the original gray scale images, and (b) the corresponding histogram equalized images.

Fig. 14. Examples of faces missed by the system.

Table 1 Processing times for images using 2 enlargement iterations Size of image (pixels)

Number of processed windows

Time taken (s)

60 × 60 120 × 90 340 × 350

4139 19,529 311,009

20 100 1500

Table 2 Processing times for images using 3 enlargement iterations Size of image (pixels)

Number of processed windows

Time taken (s)

60 × 60 120 × 90 340 × 350

4923 24,633 408,953

24 124 2100

Table 3 Processing times for images using 4 enlargement iterations Size of image (pixels)

Number of processed windows

Time taken (s)

60 × 60 120 × 90 340 × 350

5407 28,897 503,177

27 145 2580

Arbitration among multiple networks reduces the number of false detections (false positives) in the Dnal decision. Their system, like ours, can detect single or multiple faces from simple or complex backgrounds. An example-based learning approach proposed by Sung and Poggio [9] also uses the

Fig. 15. Processing time versus number of iterations for (a) 60×60, (b) 120 × 90, and (c) 340 × 350-pixel images.

sliding window approach. The Dnal decision about the presence of a face is determined by computing the distance of the window pattern from the six face clusters and six non-face clusters generated using k-means clustering method. The system can also detect single or multiple faces from simple or cluttered scenes. Both of the other example-based systems mentioned above [8,9] also use windows of di/erent sizes to determine whether or not a face is present at a location in an image.

O. Ayinde, Y.-H. Yang / Pattern Recognition 35 (2002) 2095 – 2107

However, as mentioned earlier, the two systems mentioned above use huge number of training examples when compared with our system. Rowley et al. [8] use over 16,000 positive and 9000 negative training examples to obtain 76.9 –92.5% detection rate on a test set containing 507 frontal faces in 130 images. These images include the 23 images used as test set B for our experiments. Sung and Poggio [9] use over 4000 positive and over 6000 negative examples to obtain 79.9% detection rate on a test set containing 149 frontal faces in 23 images (same as what we used as test set B). Our system uses only 29 positive and 124 negative examples to obtain a detection rate of 66.4% on test set B containing 149 faces in 23 images. The major reason why a large number of training examples is not required is that the correlation technique used only requires a degree of similarity between the training and test patterns. The other parameters used in addition to the correlation parameters compensate for the possible misclassiDcation that may result. The test pattern does not have to look exactly like the training pattern, but requires similarity of portions of the pattern corresponding to the eyes, cheeks, nose, and mouth. To capture the di/erent shapes and orientations of faces, examples of such patterns need to be included as part of the few training images used. Since a few face patterns are required to represent the di/erent general shapes, orientations, and expressions that common face patterns have, it is not necessary to use too many training images. This is in contrast to the six 283-dimensional full covariance Gaussian clusters used to approximate the face distribution in Ref. [9]. Even with over 4000 training images, the clusters could not be accurately approximated. The neural network used as the face Dlter in Ref. [8] also requires thousands of training images before the parameters required to accurately describe a face can be determined. The main problem with this kind of window sliding approach is the processing time, but a way to minimize this problem is discussed in Ref. [8]. A faster version of our system is able to decrease detection time by a factor of 4.0 or higher, depending on the value chosen for the horizontal and vertical increments. Originally, the system is designed to detect faces by sliding the scan window at 1-pixel increments horizontally and vertically. The faster version implemented allows windows to be slid at 2-pixel increments. However, the speed improvement has a drawback, which is the likelihood of missing some faces that may not be correctly detected if the window does not start at a particular location or nearby. Other methods are good for the detection of single faces from scenes as observed from the experimental results presented. Kervrann et al. [16], Yoo and Oh [14] and Yin and Basu [2] propose methods, which are examples of such methods. The major advantage of such methods is the short processing time, which makes them useful for real-time face tracking.

2105

Fig. 16. (a) Original image, (b) Face detected with the Drst training set, and (c) Face detected with the second training set.

3.12. Summary The region-based face detection technique presented is able to detect faces located at any position in an image. Multiple faces can be detected from the same image at different scales. The system requires only few training images to achieve a considerably high detection rate. Faces can be detected irrespective of how cluttered the scene is. Most of the faces missed are from low quality images or have some occlusion of a part of the face due to shadow or facial hair. Using a preprocessing step that can compensate for the low quality of some of the images may improve the detection rate. Another major improvement that is required is to optimize it with respect to computation time. 4. Conclusions and future work An example-based face detection technique is presented in this paper. The technique can be used to detect single or multiple faces from images having simple or complex backgrounds. A safe conclusion is that any face detected by the system is similar in pattern to at least one of the training images. To change the pattern detected as a face, the system has to be retrained with some new examples. However, some of the parameters may need some adjustment based on the new training examples. Fig. 16 shows two sets of detection results obtained using two di/erent sets of positive training images. Fig. 16(b) is the result produced when training images only cover the region from the eyebrows to the mouth, while Fig. 16(c) is produced when training images include the region from the forehead to the mouth. Since the pattern detected using this technique can be varied, it can be applied to the detection of other patterns that may not necessarily be faces. In other words, other objects can be detected using this technique. The region-based technique is only applicable to frontal images; other ways of capturing the proDle face models have to be incorporated to make the approach useful for proDle face detection. The technique is fairly sensitive to noise, sharp illumination shadows, and unpredictable facial hair effects. Experimental results show that the face detection system can detect most of the faces in good quality images, but some of the images in low quality images are missed as a result of the random noise e/ect. Directions for future work

2106

O. Ayinde, Y.-H. Yang / Pattern Recognition 35 (2002) 2095 – 2107

include using an appropriate technique to de-emphasize the random noise e/ect in low-quality images before proceeding to the other normalization stages of the face detection system. Another future work is to train the detection system with proDle views of faces and to obtain a model that can describe the proDle views. In addition, a lot of optimizations can be done to make the system much faster than it is by including Intel MMX technology (on systems with Intel processors) in the processing. MMX technology is based on a single-instruction, multiple-data (SIMD) technique, and it allows a single instruction to operate on multiple pieces of data. This implies that some aspects of the processing can be done in parallel, and this can speed up computer vision and image processing systems signiDcantly. Some optimization options in the compiler and optimization to the correlation technique can also lead to a signiDcant improvement.

Acknowledgements The authors would like to acknowledge Dnancial support from NSERC and the University of Saskatchewan.

References [1] R. Brunelli, T. Poggio, Face recognition: features versus templates, IEEE Trans. Pattern Anal. Mach. Intell. 15 (10) (1993) 1042–1052. [2] L. Yin, A. Basu, Integrating active face tracking with model based coding, Pattern Recognition Lett. 20 (1999) 651–657. [3] A. Yullie, P. Hallinan, D. Cohen, Feature extraction from faces using deformable templates, Int. J. Comput. Vision 8 (2) (1992) 99–111. [4] A. Jacquin, A. Eleftheriadis, Automatic location tracking of faces and facial features in video sequences, in: M. Bichsel (Ed.), International Workshop on Automatic Faceand Gesture-Recognition, IEEE Computer Society, Zurich, Switzerland, June 1995, pp. 142–147. [5] V. Govindaraju, Locating human faces in photographs, Int. J. Comput. Vision 19 (2) (1996) 129–146. [6] H. Graf, T. Chen, E. Petajan, E. Cosatto, Locating faces and facial parts, in: M. Bichsel (Ed.), International Workshop on Automatic Face- and Gesture-Recognition, IEEE Computer Society, Zurich, Switzerland, June 1995, pp. 41–46.

[7] T. Poggio, D. Beymer, Learning networks for face analysis and synthesis, in: M. Bichsel (Ed.), International Workshop on Automatic Face- and Gesture-Recognition, IEEE Computer Society, Zurich, Switzerland, June 1995, pp. 160–165. [8] H. Rowley, S. Baluja, T. Kanade, Neural network-based face detection, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1) (1998) 23–38. [9] K. Sung, T. Poggio, Example-based learning for view-based human face detection, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1) (1998) 39–51. [10] Y. Dai, Y. Nakano, Extraction of facial images from complex background using color information and SGLD matrices, in: M. Bichsel (Ed.), International Workshop on Automatic Faceand Gesture-Recognition, IEEE Computer Society, Zurich, Switzerland, June 1995, pp. 238–242. [11] S. Kim, N. Kim, S. Ahn, H. Kim, Object oriented dace setection using range and color information, IEEE Conference on Automatic Face and Gesture Recognition, IEEE Computer Society, Nara, Japan, April 1998, pp. 76 –81. [12] S. McKenna, S. Gong, Y. Raja, Modelling facial color and identity with Gaussian mixtures, Pattern Recognition 31 (12) (1998) 1883–1892. [13] J.-G. Wang, E. Sung, Frontal-view face detection and facial feature extraction using color and morphological operations, Pattern Recognition Lett. 20 (1999) 1053–1068. [14] T.-W. Yoo, I.-S. Oh, A fast algorithm for tracking human faces based on chromatic histograms, Pattern Recognition Lett. 20 (1999) 967–978. [15] A. Colmenarez, T. Huang, Face detection with informationbased maximum discrimination, IEEE Proceedings Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, Puerto Rico, June 1997, pp. 782–787. [16] C. Kervrann, F. Davoine, P. Perez, R. Forchheimer, C. Labit, Generalized likelihood ratio-based face detection and extraction of mouth features, Pattern Recognition Lett. 18 (1997) 899–912. [17] K. Hotta, T. Kurita, T. Mishima, Scale invariant face detection method using higher-order local autocorrelation features extracted from log-polar image, IEEE Conference on Automatic Face and Gesture Recognition, IEEE Computer Society, Nara, Japan, April 1998, pp. 70 –75. [18] F. Jurie, A new log-polar mapping for space variant imaging. Application to face detection and tracking, Pattern Recognition 32 (5) (1999) 865–875. [19] A. Tankus, Y. Yeshurun, N. Intrator, Face detection by direct convexity estimation, Pattern Recognition Lett. 18 (1997) 913–922. [20] R. Gonzalez, R. Woods, Digital Image Processing, Addison-Wesley, Reading, MA, 1993.

About the Author—OLUGBENGA AYINDE received his B.Sc. in Computer Engineering from Obafemi Awolowo University, Nigeria in 1995, and M.Sc. in Computer Science from the University of Saskatchewan, Canada in 2000. He is currently working as a software developer with Smart Technologies Inc., Calgary, Canada. His research interests include computer vision, computer graphics, and image processing. About the Author—YEE-HONG YANG received the B.Sc. degree (with Drst class honours) in Physics from the University of Hong Kong, the M.S. degree in Physics from Simon Fraser University in Canada, the M.S.E.E. and Ph.D. degrees in electrical engineering from the University of Pittsburgh. From 1980 to 1983, Dr. Yang was a research scientist in the Computer Engineering Center of Mellon Institute, a division of Carnegie-Mellon University, where he was involved in the Very High Speed Integrated Circuit (VHSIC) program. In 1983, he joined the Department of Computational Science (now called the Department of Computer Science) at the University of Saskatchewan and held the rank of full professorship since 1991. He was the founding Director of the Computer Vision and Graphics Lab. Since July 1, 2001, he is a full professor with the Department of Computing Science, University of Alberta.

O. Ayinde, Y.-H. Yang / Pattern Recognition 35 (2002) 2095 – 2107

2107

From July 1, 1989 to June 30, 1990, he spent his sabbatical at the McGill Research Center for Intelligent Machines of McGill University in Montreal, Quebec Canada. Since July 1, 1996, he collaborates with Alias Wavefront on projects related to both computer vision and computer graphics. Dr. Yang’s research interests include computer graphics, image based rendering, modeling of physical phenomena, animation, image processing, computer vision, motion analysis, and biomedical image processing. He is a senior member of the IEEE and a member of the ACM. In addition to his refereeing activities, he serves on the Editorial Board of Pattern Recognition since 1991. In 1998, he served as program co-chair of Vision Interface 98. He has published over 70 papers in journals and conference proceedings.