Detecting human faces in color images

Detecting human faces in color images

Image and Vision Computing 18 (1999) 63–75 www.elsevier.com/locate/imavis Detecting human faces in color images J. Cai, A. Goshtasby* Department of C...

1006KB Sizes 15 Downloads 95 Views

Image and Vision Computing 18 (1999) 63–75 www.elsevier.com/locate/imavis

Detecting human faces in color images J. Cai, A. Goshtasby* Department of Computer Science and Engineering, Wright State University, 303 Russ Engineering Center, Dayton, OH 45435, USA Received 3 November 1998; received in revised form 2 February 1999; accepted 8 February 1999

Abstract A method for detecting human faces in color images is described that first separates skin regions from nonskin regions and then locates faces within skin regions. A chroma chart is prepared via a training process that contains the likelihoods of different colors representing the skin. Using the chroma chart, a color image is transformed into a gray scale image in such a way that the gray value at a pixel shows the likelihood of the pixel representing the skin. An obtained gray scale image is then segmented to skin and nonskin regions, and model faces representing front- and side-view faces are used in a template-matching process to detect faces within skin regions. The false-positive and false-negative errors of the proposed face-detection method on color images of size 300 × 220 and containing four or fewer faces are 0.04 or less. q 1999 Elsevier Science B.V. All rights reserved. Keywords: Chroma chart; Skin-color region; Color image; Gray scale image

1. Introduction and background Indexing and retrieval of video images containing human activities require automatic detection of humans and, in particular, the human face in the images. Techniques that recognize faces or analyze facial expressions also require knowledge about the locations of faces in images. In this article a new method for detecting faces in color images is presented. The majority of existing face-detection methods use image gray values to detect faces [1–11] in spite of the fact that most images obtained today are colored. Only a small number of methods use color information to detect faces [12–15]. Face-detection methods that rely on image gray values use predefined image features to either train a system or carry out model matching. A method developed by Rowley et al. [6] tests for the existence of faces of different sizes at each image position using a neural network. Although the system is designed to detect front-view faces, the network filters can be trained to detect side-view faces also. Instead of using neural networks, Colmenzrez and Huang [1] used a probabilistic visual learning algorithm to detect faces in images. In a system developed by Sung and Poggio [7], first a model face is generated using an example-based * Corresponding author. Tel.: 11-937-775-5170; fax: 11-937-7755133. E-mail address: [email protected] (A. Goshtasby)

learning method. Then at each image location, a feature vector is computed between the local image and the model. Finally a trained classifier is used to verify the existence of a face at that location. Many methods detect faces by feature matching. A method developed by Yow and Cipolla [11] uses secondderivative Gaussian filters, elongated at an aspect ratio of 3:1, to locate horizontal bar-like features in images. From the bar-like features, elongated facial features such as the eyes and the mouth are located, and by grouping the features and matching their relations with those of a facial model, instances of faces are found. By changing the size of the Gaussian filter, this method can detect faces of different sizes in an image. A similar method is described by Cootes and Taylor [3], which uses statistical features instead of barlike features and detects features by matching relations between detected features and those of a model face. A method developed by Leung et al. [5] uses a graph-matching method to find the likely faces from among detected facial features. First, graphs representing candidate faces are generated from detected facial features, and then by a random graph-matching algorithm, true faces are located from among candidate faces. A method that treats a facial pattern like a texture pattern and uses texture features to detect faces is described by Dai et al. [16]. In this method, texture features based on a space gray level dependence (SGLD) matrix [17] are computed. By comparing texture features of a model face to those of windows in an image, potential faces are located. Then,

0262-8856/99/$ - see front matter q 1999 Elsevier Science B.V. All rights reserved. PII: S0262-885 6(99)00006-2

64

J. Cai, A. Goshtasby / Image and Vision Computing 18 (1999) 63–75

Fig. 1. (a) All observable colors of luminance 100. The rectangle contains all observed skin colors of luminance 100. (b) The rectangular area in (a) is enlarged. (c) 2,300 skin samples from 80 images. (d) Skin likelihoods obtained by centering a Gaussian of standard deviation 13 at the samples in (c) and normalizing the sum of the Gaussians to[0,1]. (e) Likelihoods of different chromas representing the skin of different races. (f)–(h) Likelihoods of chromas representing the skins of persons of African, Asian, and European origins, respectively. Isovalued contours showing the likelihoods of 0.8, 0.6, 0.4, and 0.2 are shown in (e)–(h).

using model faces and windows of different sizes, faces of different sizes are detected. A method that uses only image edges to detect faces is described by Govindaraju [4]. In this method, second-derivative edges are determined, and using models of human face curves, faces are located within the edges. A small number of techniques use color information to detect faces. These techniques first select likely image regions for faces and then detect faces in selected regions using facial patterns. Dai and Nakano [13] isolated and kept the orange-like region in the YIQ color space as the skin region and eliminated the remaining regions. They then used texture features in gray scale images [16] to identify faces in skin regions. Chen et al. [12] prepared a color chart in the HSV color space that represented possible skin colors. Then they used the color chart to identify image regions that represented the skin. Using three face templates, they located faces in skin regions through a fuzzy pattern-matching algorithm. Face templates of different sizes were used to detect different-sized faces in images. Sobottka and Pitas [15] detected skin regions in images using the hue and saturation and then selected regions that were elliptic as faces. Miyake et al. [14] used a chart of skin colors in the Luv color space to detect skin regions in an image. They then used the round regions from among obtained regions as faces. Schiele and Waibel [18] used the red and green

components of the skin color in a large number of images to construct a histogram. The histogram was then used to compute the probability that a particular combination of red and green belonged to the skin and the probabilities were used to classify image pixels to skin and nonskin regions. Crowley and Berard [2] used this method to track faces in video images. In the method proposed here, information about skin colors is used first to find likely image regions where a face may exist. Then information about facial patterns is used to determine instances of faces within such regions. The characteristics of the proposed face-detection method are: • Computations are performed in the 1976 CIE Lab color space. Colors in the CIE Lab space are perceptually more uniformly spaced than are colors in RGB or HSV spaces [19], enabling use of a fixed color distance in decision making over a wide range of colors. Another perceptually uniform color space is the CIE Luv space, which can be used for this purpose. We chose CIE Lab over CIE Luv because we had more experience with CIE Lab [8]. • Only chroma (a and b components in the CIE Lab color space) is used to separate skin from nonskin regions. Because of the smooth and curved shape of faces, the reflected light intensity varies considerably across a

J. Cai, A. Goshtasby / Image and Vision Computing 18 (1999) 63–75

face; therefore, the luminance component of the color cannot be a reliable measure for detecting facial regions. Chroma across a face, however, remains relatively unchanged and can be used to detect regions that represent faces. After detecting potential facial regions, image luminance is used to determine whether they represent faces or not. • Rather than classifying pixels to skin and nonskin regions, we assign a weight to each pixel, showing the likelihood of the pixel belonging to the skin. The weights are obtained from a chroma chart that is prepared through a training process. • Using the chroma chart, a color image is transformed into a gray scale image, with the gray value at a pixel showing the likelihood of the pixel belonging to the skin. This gray scale image is then segmented into skin and nonskin regions. • The proposed method is based on a bottom-up approach; therefore, it can find faces of different sizes and orientations in an image. This is in contrast to top-down methods that search for faces of prespecified sizes and orientations in an image. The algorithm for detecting human faces in a color image can be summarized as follows: 1. Prepare a chroma chart using ab components of colors in a large number of images using the 1976 CIE Lab color space in such a way that the value at entry (a,b) in the chart represents the likelihood that chroma (a,b) represents the skin. 2. Transform a given color image into a gray scale image in such a way that the gray value at a pixel shows the likelihood of the pixel representing the skin. 3. Separate skin regions from nonskin regions in the obtained gray scale image by image segmentation. 4. Detect facial features in the obtained skin regions and hypothesize faces. 5. Verify the existence of the hypothesized faces by template matching. These steps are described in more detail next. 2. Preparing the chroma chart Color charts have been used in face detection with considerable success [12,14,15]. Past color charts, however, have been binary: colors have been grouped to either skin or nonskin. In our model, any color can represent the skin, but with a different likelihood. Our color chart will have two components, representing the a and b values in the 1976 CIE Lab color space [19]. Therefore, we refer to our color chart as the chroma chart. As mentioned earlier, we do not use the luminance component of the color because luminance may vary considerably across a person’s face. Chroma, however, remains relatively unchanged and can be reliably used to separate skin from

65

surrounding nonskin regions. Chroma has previously been effective in segmenting color images [20]. The colors representing all possible chromas with luminance 100 are shown in Fig. 1(a). Because we cannot show colors in the CIE Lab space using only their a and b values, in this figure L ˆ 100 was chosen to show all colors obtained by setting the chroma to all possible values. In the images we had, luminance of 100 was observed in skin regions more than other luminances. Only a small portion of the chromas represented the skin. The rectangular area shown in Fig. 1(a) contains all chromas that we observed from 2300 skin samples. This rectangular area is enlarged and shown again in Fig. 1(b). The colors in Fig. 1(b) contain all chromas that we have observed in all our skin samples. Skin colors having the same chroma will be treated similarly even though they may have different luminances. The image array shown in Fig. 1(b) is of size 500 × 700 and represents values of a and b in ranges [210,40] and[210,60], respectively. Our chroma chart is obtained by mapping the ab values in the CIE Lab color space to discrete array entries. Entries of the chart are initially set to 0. When color samples are taken from skin areas in images, the chromas of the colors are mapped to the array entries, and the entries are incremented by 1. After processing 2300 skin samples from 80 images, we obtained the array shown in Fig. 1(c). This image is the same size as the image shown in Fig. 1(b). The skin samples were obtained from images searched on the Internet. The images, therefore, have not been obtained with the same camera and under the same lighting conditions. We included only those images in our study that we thought by visual examination to have been obtained under white lighting. Images obtained under nonwhite lighting, change true skin colors and were excluded from our study. To reduce the effect of noise in the color samples, the average color of a 3 × 3 window centered at a pixel rather than the color at that pixel was used. To obtain the average color, first the average red, average green, and average blue in the 3 × 3 window were determined. The average red, green, and blue values were transformed into Lab color coordinates, and the ab components of the color were used to identify the entry in the chroma chart. Then the entry of the chart was changed to 1. As the number of samples increases, the chart becomes denser. Skin colors fill only a part of the chart. Since adjacent entries in the chart show very similar colors, it is inconceivable that while some entries in a small neighborhood belong to the skin, the other entries in the same neighborhood don’t. As more skin samples are used, more entries in the chart become filled. Therefore, there is a need to fill in the small gaps in the chroma chart when a sufficiently large number of skin samples are not available. The small gaps in the chroma chart can be filled in a variety of ways. We notice, however, that simply filling in the gaps is not sufficient. There are certain areas of the chroma chart that represent skin with a higher likelihood

66

J. Cai, A. Goshtasby / Image and Vision Computing 18 (1999) 63–75

Fig. 2. (a) A color image containing a face. (b) The skin-likelihood image. (c) Smoothing of (b) with a Gaussian of standard deviation 9 pixels. (d) Local intensity peaks of (c) marked in (b). (a) A skin region obtained by segmenting the skin-likelihood image and one of the intensity peaks. Small regions within a skin region are hypothesized as facial features and are used to locate a face.

than the other areas, as evidenced by the density of the samples. There are entries that are not likely to belong to the skin because only sparse samples are obtained. Some of these samples could be due to noise or reflections from the skin’s specularities. We need a chroma chart that contains likelihoods proportional to the local densities of skin samples. To obtain such a chroma chart, we center a 2-D Gaussian of height 1 at each sample point and record at each chart entry the sum of the obtained Gaussians. Therefore, if N skin samples …ai ; bi † : i ˆ 1; …; N} are available, we let the likelihood that the chroma at entry (a,b) represents the skin be proportional to T…a; b† ˆ

N X

exp{ 2 ‰…a 2 ai †2 1 …b 2 bi †2 Š=2s2 }:

…1†

iˆ1

Initially, the value of T…a; b† is saved at chart entry …a; b†. When values at all entries are computed, each entry is divided by the largest value in the chart to obtain values between 0 and 1. Function T produces higher values for areas containing a large number of samples. As a Gaussian changes rather slowly near its center and falls rather sharply about one standard deviation away from its center, the sum of the Gaussians will fill in the gaps in neighborhoods where a high density of samples exists, and will quickly approach zero in areas where no samples are available. Areas containing sparse samples will receive small values, showing small likelihoods of representing the skin. Too small a standard deviation will produce a chroma

chart whose likelihoods vary sharply between adjacent samples, while too large a standard deviation may assign high likelihoods to areas where only sparse samples or no samples are available. Parameter s depends on the size of the chroma chart. For the chroma chart of size 500 × 700 and the density of the points we had, we visually found that s between 11.0 and 19.0 was appropriate. The chroma chart obtained with s ˆ 13.0 is shown in Fig. 1(d). Intensities in this chart show the skin likelihoods: the brighter an entry, the higher its likelihood to represent the skin. The likelihoods from 0.0 to 1.0 are linearly mapped to intensities between 0 and 255. Skin samples from persons of African as well as Asian and European origins were used to obtain Fig. 1(d). The intensities in this image show the likelihoods of colors in Fig. 1(b) representing the skin. Contours representing likelihoods of 0.8, 0.6, 0.4, and 0.2 obtained from Fig. 1(d) are overlaid with Fig. 1(b) and shown in Fig. 1(e) to depict the likelihoods of different chromas representing the skin. Preparing chroma charts using samples from persons of African, Asian, and European origins separately, we obtained Fig. 1(f), (g) and (h), respectively. Although skin chromas of different races have some overlaps, there are certain chromas that belong to only certain races. Therefore, if the objective is to locate faces of a particular race in an image, one should create a chroma chart that uses samples from only that race. This will not only increase the face detection speed but it will increase the face detection accuracy. If a chroma chart contains information about the skin colors of various races, the probability of missing faces will reduce, but this will

J. Cai, A. Goshtasby / Image and Vision Computing 18 (1999) 63–75

classify more image pixels to the skin, increasing the probability of detecting false faces.

3. Detecting skin-color regions To identify skin-color regions in an image, we first transform a color image into a skin-likelihood image. This involves transforming the RGB values at a pixel (x,y) to …L; a; b† values, reading the likelihood from entry (a,b) in the chroma chart, and saving it at location (x,y) in the image. When this process is repeated for all pixels in the image, an image will be obtained whose gray values represent the likelihoods of pixels belonging to the skin. Using the color image of Fig. 2(a) and the chroma chart of Fig. 1(d), we obtain the skin-likelihood image of Fig. 2(b). As can be observed, regions having the color of the skin in Fig. 2(a) are transformed into bright regions in Fig. 2(b), while regions not having the color of the skin are transformed into dark regions. The brighter a pixel, the higher the likelihood that the color at the pixel represents the skin. However, we should note that although the process detects regions having the color of the skin, the obtained regions may not necessarily represent the skin. Detected regions simply have the color of the skin. This process, though, can reliably identify regions that do not have the color of the skin. Such regions can be safely discarded before searching for faces. The image of Fig. 2(b) shows the likelihoods of pixels in Fig. 2(a) that have the color of the skin. Since skin regions appear as bright spots in a skin-likelihood image, it seems that we should be able to locate the skin-color regions in an image by a thresholding process. Since persons with different skin colors may be present in an image and different skin colors have different skin likelihoods, it may be necessary to use more than one threshold value to detect skin regions that belong to different persons in the image. We determine the threshold values for locating skin regions in an image from the following observation. When the threshold value is high, an obtained skin region is small, and as the threshold value is reduced, the skin region becomes larger. The size of a skin region may remain relatively stable under a wide range of threshold values, and at a certain threshold value, it may merge with a neighboring region, increasing in size considerably. Therefore, if we plot the increase in size of a skin region as a function of the decrease in threshold value, we will obtain a curve that gradually decreases up to a point and then increases sharply. The threshold value at which minimum increase in region size is observed, while uniformly decreasing the threshold value, will be taken as the optimal threshold value. The optimal threshold value, therefore, will find a region whose size is most stable under variations of the threshold value. To implement this idea, local intensity peaks in the skinlikelihood image are first determined. To avoid detection of

67

noisy peaks, the likelihood image is smoothed with a wide Gaussian, typically of standard deviation 8.0 or 9.0 (see Fig. 2(c)). Locally peak values are then located by shifting a 3 × 3 window over the smoothed image and determining window positions where the value at the center of the window is larger than values at other entries in the window. Our likelihood images are represented by floating point numbers; therefore, the possibility of obtaining the same value more than once in a window is extremely rare. In this manner, local peaks can be determined uniquely. The locations of the peaks will be used as seed points to determine the skin regions. Local intensity peaks of Fig. 2(c) are shown by ‘ 1 ’ marks in Fig. 2(d). For each obtained peak, the likelihood image is thresholded from 1.0 down to 0.0 in steps of 0.1, and at each threshold value, the size of the region containing the peak is determined. As the threshold value is decremented, the change in region size is recorded, and the two consecutive threshold values producing the least change in region size are used to determine the optimal threshold value. If threshold values t and t 1 0.1 result in the least change in size of a region, the optimal threshold value for segmenting the region will be set to t 1 0.05. Applying the thresholding process to one of the peaks in Fig. 2(d), we obtain the region shown in Fig. 2(e). Once a region is obtained from a peak, all other peaks in the region are removed as they all produce the same region. The thresholding decrement of 0.1 is arbitrary and can be reduced to increase the segmentation accuracy. As the thresholding decrement is reduced, region boundaries become more clearly defined, but the process becomes slower. We found that decrements of 0.1 are sufficient to detect faces in the images we had. At the optimal threshold value, since the region size is most stable under variations of the threshold value, a smaller decrement in the threshold value will not noticeably change the segmentation result. The following steps summarize the iterative thresholding process. 1. Smooth the skin-likelihood image with a rather wide Gaussian and find local intensity peaks in the smoothed image. 2. For each peak: 1. Start from threshold value 0.9 and decrement the threshold value in steps of 0.1 until 0.0 is reached. At each threshold value, determine the region containing the peak in the original likelihood image (the likelihood image before it is smoothed) and find the difference in its size at this and the previous threshold values. If the current threshold value is 0.9, the previous threshold value is 1.0 and the region size at threshold value 1.0 is zero. 2. Find two consecutive threshold values t and t 1 0.1 such that change in size of the region containing the peak is minimum. Then, set the optimal threshold value to t 1 0.05.

68

J. Cai, A. Goshtasby / Image and Vision Computing 18 (1999) 63–75

Fig. 3. (a) Average of sixteen front-view male and female faces wearing no glasses and having no facial hair. (b) Average of sixteen side-view faces of males and females wearing no glasses and having no facial hair. (c) The model face used in template matching to detect the front-view faces. This template was cut out of image (a), with its left and right borders at the center of the left and right ears of the averaged face. (d) The interior portion of (b) was cut out and used as the model to detect left-view faces. (e) Model (d) was reflected about the vertical axis to obtain this model, which is used to detect the right-view faces in images.

3. Remove all peaks that are contained in the region just obtained. If no more peaks remain, stop. Otherwise, take another one of the remaining peaks and go to step 2.1. Some detected regions may be very small and unlikely to represent faces. Such regions are removed before looking for faces. These regions often correspond to noisy regions in the scene having the color of the skin. Not all detected regions contain faces. Some regions represent the hands and other exposed parts of the body, while some regions may correspond to objects with colors similar to those of the skin. Next, we will show how facial features can be used to locate faces within skin regions. 4. Detecting faces in skin regions The main objective in detecting skin regions in an image is to reduce the search space for the faces. Naturally, faces are in skin regions, and there is no point looking for them in

nonskin regions. Facial features that do not have the color of the skin, such as the eyes or the mouth, appear as small regions within skin regions. These features can be detected and used to generate hypotheses about the faces, and their presence can be verified by template matching. Since features such as the eyes or the mouth are elongated, we will use their axes of minimum inertia (their major axes) [21] to hypothesize the faces. We will transform a model face to overlay a skin region in such a way that the centers of the eyes and the mouth in the image and the model coincide. Then we will determine the match rating between the model and the image. If the match rating is above a threshold value, we will confirm the hypothesis that a face exists in the region. Otherwise, we will reject the hypothesis. The front-view model used to verify the presence of a face in a region is shown in Fig. 3(c). This model was obtained by averaging intensities of 16 front-view faces of men and women without facial hair or glasses (see Fig. 4). These images were selected from the Yale University and University of Bern face databases. To obtain the front-view model, first the distance between eye centers in an image was determined. Let us denote this distance by A. Then the distance between the center of the mouth and midway between eye centers was determined. Let us denote this distance by B. Then the ratio of the two distances …r ˆ A=B† was determined. This ratio r was then computed for all faces and averaged. Let us denote the obtained average byr¯. Among the faces used to determiner¯, the one whose r ratio was the closest to r¯ was identified. In Fig. 4, face ‘a’ was the face identified in this manner. All other images were then resampled by an affine transformation so that the centers of the eyes and the mouths of the faces coincided with those of face ‘a’. The overlaid faces obtained in this manner are shown in Fig. 3(a). Next, the interior portion of the overlaid faces, as shown in Fig. 3(c), was manually extracted and used as the front-view model of the face. The left and right borders of the image of Fig. 3(c) are positioned at the centers of left and right ears of the image of Fig. 3(a). A side-view model of the face was also prepared by geometrically transforming 16 images from the Yale University and the University of Bern face databases in such a way that the centers of the eyes, the mouths, and

Fig. 4. The sixteen faces used to obtain the front-view model of the face.

J. Cai, A. Goshtasby / Image and Vision Computing 18 (1999) 63–75

69

Fig. 5. Different cases to be considered when matching a front-view model face to a skin region containing zero, one, or two features. (a) Skin region contains no features. (b)–(e) Skin region contains one feature. (b) Shows the case in which the detected feature does not correspond to an eye or the mouth. (f)–(l) are the cases in which two features are detected.

the ears in the images coincided with the face whose ratio of distances between eye–ear and eye–mouth was closest to the average ratio of the same measures in 16 side-view faces. The averaged side-view face is shown in Fig. 3(b). The portion of the face that was common to all images was manually extracted and used as a left-view model (see Fig. 3(d)). The left and right borders in this model correspond to the tip of the nose and the center of the ear in Fig. 3(b). This left-view model was reflected about the vertical axis to obtain the right-view model shown in Fig. 3(e). The model of Fig. 3(c) will be used to verify the presence of front-view faces, and models of Figs. 3(d) and (e) will be used to verify the presence of side-view faces in images. The centers of the eyes, ears, mouth, and the tip of nose that were used in the model faces were selected manually. The reason for cutting out the central portion of images in Fig. 3(a) or 3(b) is to exclude portions of a scene that do not belong to the face or portions of the face that vary from one individual to another in template matching. Since facial patterns can be characterized by image gray values, we will use the gray scale model obtained in Fig.3(c)–(e) to search for faces in the luminance component of color images. After scaling and reorienting a model to overlay with a region, we compute the cross-correlation coefficient between the model and the image and use that as the match rating between the model and the image. When computing the cross correlation coefficient, only nonzero pixels in the face models of Fig.3(c)–(e) will be used. This is to reduce the effect of scene contents other than the face on match rating. Since the cross-correlation coefficients vary between 21.0 and 1.0, our match ratings will vary between 21.0 and 1.0 also. When the match rating is typically above 0.5, we accept the hypothesis that a face exists in a region. Otherwise, we reject the hypothesis. A detected skin region may contain zero, one, two, three, or more than three nonskin regions. Fig. 5 shows cases where two or fewer nonskin regions are detected in a skin region. We will refer to a nonskin region within a skin region as a facial feature.

If no facial feature is detected in a skin region, the width of the region is determined and a face model of the same width is matched to it. To determine the width of a region, the major axis of the region is determined and a line normal to the major axis is shifted along the major axis from one end to the other. The intersections of the line with the region borders are determined, and the distance between the intersections for each line position is found. Then the maximum of the distances is considered the width of the region. The intersection of the major axis of the region with the line that has the maximum width will be considered the center of the region. Estimating region width and region center in this manner is motivated by the observation that in a frontview image, the distance between the ears is usually the widest skin cover in a face region, and in a side-view image, the distance between the ear the tip of the nose is usually the widest skin cover in a face region. The center of our front-face model is obtained by connecting the ear centers and finding the point midway between the two. The center of a side-view model is obtained by connecting the center of the ear to the tip of the nose and finding the midpoint between the two. When no facial features are obtained in a region, the matching process involves scaling a model face to the same width as the width of the region, placing the center of the scaled model on the center of the region, reorienting the model so that its major axis lies on the major axis of the region, and determining the match rating between the image and the model. If the match rating is above a prespecified threshold value, we conclude that a face exists in the region. If only one feature is detected in a region, it could be one of the eyes, the mouth, or a noisy region. Therefore, as shown in Fig. 5(b)–(e), there are four cases to be considered. Matching is performed by scaling a model so that it has the same width as that of the region, and its orientation is the same as that of the region. In addition, in case (c), the front model is resampled so that its left eye center falls on the feature center. In case (d), the front model is resampled so that its right eye center lies on the feature center, and in case (e), the front model is resampled so that its mouth

70

J. Cai, A. Goshtasby / Image and Vision Computing 18 (1999) 63–75

average, the number of missed faces and the number of incorrectly detected faces are the same.

5. Results and discussion

Fig. 6. (a) Best-match pose of the front-view model in the luminance component of image of Fig. 2(a). (b) Detected face is enclosed in a square. The center of the square is at the tip of the nose of the model when matching the image and the orientation of the square shows the orientation of the detected face.

center lies on the feature center. Then the correlation between the model and the image is determined. The choice of (c), (d), or (e) is based on the position of the feature center with respect to the major axis and the center of the skin region. In cases (c)-(e), the major axis of the feature should be approximately normal to the major axis of the region (^158). Otherwise, the feature will be considered noise, as shown in case (b) in Fig. 5. In such a situation, it is assumed that no valid feature is found in the region, and template matching is performed as with case (a) in Fig. 5. We use similar rules to verify the presence of faces in skin regions containing two, three, or more than three regions. If a region contains two regions, cases (f)-(l) shown in Fig. 5 will be considered. Among the detected facial features, one (cases f-h) or both (case i) features could be noisy features. When noisy features are removed, one of cases (a), (c)-(e) will be obtained. Such cases will be pursued the same as before. If three or more features are obtained in a region, all combinations of three features at a time will be tested, and the case producing the highest match rating will be selected to find the pose of the face in the image. Fig. 5 shows cases where two or fewer features in a frontview image of a face are obtained. Similar cases are considered when side-view models of faces are used in the matching. Therefore, in addition to matching the front-view model of a face to a region, the side-view models are matched to a region, and the model producing the highest match rating is taken to determine the pose of the face. If the highest match rating is above a threshold value, it is concluded that a face exists in the region, otherwise, it is concluded that a face does not exist in the region. As the match-rating threshold value is increased, more correct faces will be missed, while as the threshold value is decreased, more false faces will be detected. The threshold value should be chosen taking into consideration the cost of missing a face and the cost of wrongly detecting a face. If the two costs are the same, the threshold value should be chosen such that, on the

Fig. 6(a) shows the best-match pose of the front-view model of Fig. 3(c) in the luminance component of image of Fig. 2(a). the cross-correlation coefficient in this matching is 0.57. Our program identifies a detected face by placing a square around it. The center of the square is at the tip of the nose of the model, and the two sides of the square are parallel to the major axis of the face model when matching the image (see Fig. 6(b)). The brightness of the square is proportional to the match rating between the model and the image. Examples of facial images of different sizes, orientations, and races are shown in Figs. 7–9. Note that although most persons have light skins, persons of dark skins also appear in these images. Fig. 7 shows images containing a single face, Fig. 8 shows images containing two to four faces, and Fig. 9 shows images containing more than four faces. In the upper-right corner of image 7, a face was mistakenly detected among the fingers of the person because the pattern of the fingers resembled that of a face. In the upper-left corner of image 8, a person’s face was again detected by mistake among the person’s fingers. The face of the man in the middle-left image in Fig. 8 was missed because the segmentation process combined it with the coat of the person, and the features corresponding to the eyes and the mouth of the person were not detected. Therefore, template matching failed to correctly match a face model to the region and thus missed the face. The face of the rightmost person in the right image in Fig. 9 was missed because facial features were missed, and the blonde hair caused the entire head to be segmented into a skin region. The model that was matched to the region produced a match rating that was below 0.5. When faces are very small, facial features are often missed, and if a person is wearing a shirt or a coat that has the color of the skin, the face and the clothes will be detected as one region. When a model face is fitted to it, a low match rating is obtained. The person standing in the extreme left in the left image of Fig. 9 has a shirt that has the skin color, which was merged with the face region. The person standing in the extreme right in the left image in Fig. 9 in front of a building that has the color of the skin. Again, the face merged with the building, producing one big region. As the facial features were too small to be detected, template matching failed to detect these faces. Sizes of some of the faces detected were underestimated. Such faces have been neither the front-view nor the sideview faces, but rather, something between the front- and side-views. When a face is slightly rotated, both eyes are visible, but the distance between them becomes shorter and the skin cover representing the face becomes narrower. Therefore, when carrying out template matching, a model

J. Cai, A. Goshtasby / Image and Vision Computing 18 (1999) 63–75

Fig. 7. Sample single-face images. (a) and (b) show matching results and detected faces, respectively.

71

72

J. Cai, A. Goshtasby / Image and Vision Computing 18 (1999) 63–75

Fig. 8. Sample images containing two to four faces. (a) Template-matching results. (b) Detected faces.

J. Cai, A. Goshtasby / Image and Vision Computing 18 (1999) 63–75

73

Fig. 9. Sample images containing more than four faces. (a) Template-matching results. (b) Detected faces.

face is scaled down more than necessary to match the image. Although a match rating above 0.5 may be obtained, showing the existence of a face in the region, the estimated size of the face will be smaller than what it should be. If there is a need to estimate sizes of the faces more accurately, in addition to models representing front- and side-views, the views in between them should also be used in template matching. It is interesting to note that although some mistakes were made in determining the scales of faces, the orientations of faces were determined accurately. The observed error in the orientation of detected faces never reached fifteen degrees in all our experiments. The face-detection accuracy of the described method is Table 1 False positive and false negative probabilities of the proposed face detection algorithm in images containing a single face, two to four faces, and more than four faces Number of faces in image

False positive

False negative

1 2–4 .4

0.03 0.04 0.05

0.02 0.04 0.15

summarized in Table 1 for the match-rating threshold of 0.5. As the number of faces in an image increases, the error in face detection increases. This is because the sizes of the faces become smaller. Facial features inside small regions are often missed or, when found, their orientations cannot be accurately determined. Our program was able to correctly detect faces in almost all images containing a single face. Only a small number of faces in images containing two to four faces were missed. The table entries were computed using the following formulas: False Positive ˆ

False Negative ˆ

Number of Incorrectly Detected Faces Total Number of Actual Faces

Number of Missed Faces : Total Number of Actual Faces

…2† …3†

The faces that were missed by the proposed algorithm were due to the following reasons: (1) They were too small, or were broken into small pieces due to inhomogeneous scene lighting. (2) Two faces were so close to each other that they merged into one, or the faces were merged with hands or the background. (3) The faces showed

74

J. Cai, A. Goshtasby / Image and Vision Computing 18 (1999) 63–75

neither the front view nor the side view of persons but rather something between front- and side-views. Therefore, template matching produced low match ratings, discarding the correct faces. The results shown in Table 1 represent errors in processing 453 color images: 143 containing a single face, 235 containing two to four faces, and 75 containing more than 4 faces. Skin regions in 80 of the images were used to obtain the chroma chart of Fig. 3(c). Image sizes varied from 64 × 80 to 720 × 480. Average image size was 292 × 220. The accuracy of the current method can be improved further. The false positive accuracy can be reduced by carrying out further tests on a detected face. For example, if a detected face is part of a larger skin region, we may want to verify the presence of ears or hair in the image, and if we do not find them reject the region as being a face. False negative accuracy can be improved by using a larger number of model faces that represent different views of a face. The speed of the proposed face-detection method depends on the size as well as the number of faces in an image. Computation time needed for image smoothing and image segmentation linearly increases with image size. Face hypothesis and verification, however, linearly increases with the number of skin regions in an image. The average time needed to detect faces in images containing a single face, two to four faces, and more than four faces were 1, 2.5, and 6 min, respectively, on a Sun Ultra 30 workstation. Most of the processing time are spent on matching the face models to different configurations of the subregions obtained within the skin regions. As the number of subregions n increases, the search time to find the best-match face within a region increases by n 3. In the images we had, we obtained skin regions containing from 0 to 33 nonskin regions (subregions). Average number of subregions in a skin region was 7. Some of the subregion configurations could be immediately rejected as faces using constraints between the eyes and the mouth in a human face. Other configurations, however, required template matching to determine their likelihoods of representing the face.

6. Summary and conclusions Indexing and retrieval of images containing human activities require the ability to detect humans and, in particular, human faces in images. This article presented a new method for detecting human faces in color images, which first determines the skin-color regions and then locates faces within those regions. A chroma chart was prepared through a learning process that contains the likelihoods of different colors representing the skin. The chroma chart was then used to distinguish skin regions from nonskin regions. The contributions of this article are: (1) transforming a color image into a gray scale image, with the gray value at a pixel representing the likelihood of the pixel representing the skin; (2) designing an adaptive thresholding method to

segment a gray scale image into regions of different intensities; and (3) developing a rule-based system to locate faces in images. The skin-likelihood image proposed in this article can be used to guide other face-detection algorithms and increase their speed by avoiding exhaustive searches. The adaptive threshold-selection method described for isolation of skin regions in an image can, in general, be used to segment gray scale images into regions of different gray values.

Acknowledgements The authors would like to thank David Kriegman and Peter Belhumeur of Yale University and Bernard Achermann of the University of Bern for making their face databases available to them. The support of the National Science Foundation under grant IRI-9529045 is also greatly appreciated.

References [1] A.J. Colmenarez, T.S. Huang, Face detection with information-based maximum discrimination, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997, pp. 782–787. [2] J.L. Crowley, F. Berard, Multi-modal tracking of faces for video communications, Proceedings of the Computer Vision and Pattern Recognition, Puerto Rico, June 1997, pp. 640–645. [3] T.F. Cootes, C.J. Taylor, Locating faces using statistical feature detectors, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, 1996, pp. 204–209. [4] V. Govindaraju, Locating human faces in photographs, International Journal of Computer Vision 19 (2) (1996) 129–146. [5] T.K. Leung, M.C. Burl, P. Perona, Finding faces in cluttrered scenes using random labeled graph matching, Proceedings of the Fifth International Conference on Computer Vision, 1995, pp. 637–644. [6] H.A. Rowley, S. Bluja, T. Kanade, Neural network-based face detection, IEEE Transactions, Pattern Analysis and Machine Intelligence 20 (1) (1998) 23–38. [7] K.-K. Sung, T. Poggio, Example-based learning for view-based human face detection, IEEE Transactions Pattern Analysis and Machine Intelligence 20 (1) (1998) 39–51. [8] L. Xu, M. Jackowski,, A. Goshtasby, D. Roseman, S. Bines, C. Yu, A. Dhawan, A. Huntley, Segmentation of Skin Cancer Images, Image and Vision Computing 17 (1) (1999). [9] K.C. Yow, R. Cipolla, Finding initial estimation of human face location, Proceedings of the Second Asian Conference on Computer Vision, Singapore 3 (1995) 514–518. [10] K.C. Yow, R. Cipolla, A probabilistic framework for perceptual grouping of features for human face detection, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, Vermont, USA, 1996. [11] K.C. Yow, R. Cipolla, Feature-based human face detection, Image and Vision Computing 15 (1997) 713–735. [12] Q. Chen, H. Wu, M. Yachida, Face detection by fuzzy pattern matching, Proceedings of the Fifth International Conference on Computer Vision, 1995, pp. 591–0596. [13] Y. Dai, Y. Nakano, Face-texture model based on SGLD and its applications in face detection in a color scene, Pattern Recognition 29 (6) (1996) 1007–1017. [14] Y. Miyake, H. Saitoh, H. Yaguchi, N. Tsukada, Facial pattern detec-

J. Cai, A. Goshtasby / Image and Vision Computing 18 (1999) 63–75 tion and color correction from television picture and newspaper printing, Journal of Imaging Technology 16 (5) (1990) 165–169. [15] K. Sobottka, I. Pitas, Segmentation and tracking of faces in color images, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, 1996, pp. 236–241. [16] Y. Dai, Y. Nakano, H. Miyao, Extraction of facial images from a complex background using SGLD matrices, Proceedings of the Twelfth International Conference on Pattern Recognition 1 (1994) 137–141. [17] R.M. Haralick, Texture features for image classification, IEEE Transactions, Systems, Man and Cybernetics 3 (6) (1973) 610–621.

75

[18] B. Schiele, A. Waibel, Gaze tracking based on face-color, presented in International Workshop on Face and Gesture Recognition, Zurich, July 1995 (unpublished proceedings). [19] J.K. Kasson, W. Plouffe, An analysis of selected computer interchange color spaces, ACM Trans. Graphics 11 (4) (1992) 373– 405. [20] Y. Gong, M. Sakauchi, Detection of regions matching specified chromatic features, Computer Vision and Image Understanding 61 (2) (1995) 263–269. [21] R. Jain, R. Kasturi, B.G. Schunck, Machine Vision, McGraw Hill, New York, 1995 pp. 327–635.