Expert Systems with Applications Expert Systems with Applications 33 (2007) 86–95 www.elsevier.com/locate/eswa
Semantic-based facial expression recognition using analytical hierarchy process Shyi-Chyi Cheng b
a,*
, Ming-Yao Chen b, Hong-Yi Chang b, Tzu-Chuan Chou
c
a Department of Computer Science, National Taiwan Ocean University, Taiwan Department of Computer and Communication Engineering, National Kaohsiung First University of Science and Technology, Taiwan c Department of Information Management, National Taiwan University of Science and Technology, Taiwan
Abstract In this paper we present an automatic facial expression recognition system that utilizes a semantic-based learning algorithm using the analytical hierarchy process (AHP). All the automatic facial expression recognition methods are similar in that they first extract some low-level features from the images or video, then these features are used as inputs into a classification system, and the outcome is one of the preselected emotion categories. Although the effectiveness of low-level features in automatic facial expression recognition systems has been widely studied, the success is shadowed by the innate discrepancy between the machine and human perception to the image. The gap between low-level visual features and high-level semantics should be bridged in a proper way in order to construct a seamless automatic facial expression system satisfying the user perception. For this purpose, we use the AHP to provide a systematical way to evaluate the fitness of a semantic description for interpreting the emotion of a face image. A semantic-based learning algorithm is also proposed to adapt the weights of low-level visual features for automatic facial expression recognition. The weights are chosen such that the discrepancy between the facial expression recognition results obtained in terms of low-level features and high-level semantic description is small. In the recognition phase, only the low-level features are used to classify the emotion of an input face image. The proposed semantic learning scheme provides a way to bridge the gap between the high-level semantic concept and the low-level features for automatic facial expression recognition. Experimental results show that the performance of the proposed method is excellent when it is compared with that of traditional facial expression recognition methods. 2006 Elsevier Ltd. All rights reserved. Keywords: Facial expression recognition; Low-level visual feature; High-level semantic concept; Analytical hierarchy process; Semantic learning
1. Introduction The common methods for most of current human–computer interaction (HCI) techniques are through the modalities such as, key press, mouse movement, or speech input. These HCI techniques do not provide natural human-tohuman-like communication. The information about emotions and the mental state of a person contained in human faces is usually ignored. Due to the advances of artificial intelligent techniques in the past decades, it is possible to enable communication with computers in a natural way, *
Corresponding author. Tel.: +886 2 24622192; fax: +886 2 24623249. E-mail address:
[email protected] (S.-C. Cheng).
0957-4174/$ - see front matter 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2006.04.019
similar to every day interaction between people, using an automatic facial expression recognition system (Fasel & Luettin, 2003). Since the early 1970s, Ekman and Friesen (Ekman & Friesen, 1978) had performed extensive studies of human facial expressions and defined six basic emotions (happiness, sadness, fear, disgust, surprise, and anger). Each of these six basic emotions corresponds to a unique facial expression. They also defined the Facial Action Coding System (FACS), a system provides a systematical way to analyze facial expressions through standardized coding of changes in facial motion. FACS consists of 46 Action Units (AUs) which describe basic facial movements. Ekman’s work inspired many researchers to analyze facial features
S.-C. Cheng et al. / Expert Systems with Applications 33 (2007) 86–95
using image and video processing. By tracking facial features and measuring the amount of facial movements, they attempt to categorize different facial expressions. Based on these basic expressions or a subset of them, Suwa, Sugie, and Fujimora (1978), and Mase and Pentland (1991) performed early work on automatic facial expression analysis. Detailed review of many of the recent work on facial expression analysis can refer to Fasel and Luettin (2003) and Pantic and Rothkrantz (2000). All these methods are similar in that they first extract some features from image or video, then these features are used as inputs into a classification system, and the outcome is one of the preselected emotion categories. They differ mainly in the features extracted and in the classifiers used to answer an input face image. Facial features used for automatic facial expression analysis can be obtained using image processing techniques. In general, the dimensionality of the low-level visual features used to describe a face expression is high. Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Discrete Cosine Transform (DCT), Wavelet Transform, etc., are the commonly used techniques for data reduction and feature extraction (Calder, Burton, Miller, & Young, 2001; Draper, Baek, Bartlett, & Beveridge, 2003; Jeng, Yao, Han, Chern, & Liu, 1993; Lyons, Budynek, & Akamatsu, 1999; Martinez & Kak, 2001; Saxena, Anand, & Mukerjee, 2004). Such visual features contain the most discriminative information and provide more reliable training of classification systems. It is important to normalize the values that correspond to facial features changes using the facial features extracted from the person’s neutral face in order to construct a person-independent automatic facial expression recognition system. FACS has been used to describe visual features in facial expression recognition system (Tian, Kanad, & Cohn, 2001). Furthermore, the lowlevel facial features – Facial Animation Parameters (FAPs) supported by MPEG-4 standard are also widely used in automatic facial expression recognition (Aleksic & Katsaggelos, 2004; Donato, Hager, Bartlett, Ekman, & Sejnowski, 1999; Essa & Pentland, 1997; Pardas & Bonafonte, 2002). Fig. 1 shows the FAPs that contain significant information about facial expressions controlling eyebrow (group 4) and mouth movement (group 8) (Text for ISO/IEC FDIS 14496-2 Visual, 1998). In recent work, the approaches for automatic facial feature recognition can be classified into three categories
Fig. 1. Outer-lip and eyebrow FAPs (Tian et al., 2001).
87
(Fasel & Luettin, 2003). In the image-based approach, the whole face image, or images of parts of the face, are processed in order to obtain visual features. The weightings of different parts of the face should be different to improve the performance. For example, the nose movement obviously contains less information than eyebrow and mouth movement about facial expressions. Hence, the weighting of nose movement should be decreased in order to improve the recognition accuracy. On the basis of deformation extraction, the facial expression recognition process is conducted through the deformation information of each part of the face. The models to extract deformation information include Active Shape Model and Point Distribution Model. The common process for these models is to estimate the motion vectors of the feature points. The motion vectors are then used to recognize facial expressions. The disadvantages of the approach include (1) the feature points are usually sensitive to noise (i.e., light condition change) and hence unstable; (2) the computational complexity of motion estimation is high. In the geometric-analysis approach, the shape and position of each part of the face are used to represent the face for expression classification and recognition. Facial expression recognition is performed by a classifier, which often consists of models of pattern distribution, coupled to a decision procedure. A wide range of classifiers, covering parametric as well as non-parametric techniques, have been applied to the automatic facial expression recognition problem (Fasel & Luettin, 2003; Pantic & Rothkrantz, 2000). Neural networks (Tian et al., 2001), hidden Markov models (Aleksic & Katsaggelos, 2004; Pardas & Bonafonte, 2002), k-nearest neighbor classifiers (Bourel, Chibelushi, & Low, 2002), etc. are commonly used to perform classification. Although the rapid advance of face image processing techniques, such as face detection and face recognition, provides a good starting point for facial expression analysis, the semantic gap between low-level visual features and high-level user perception remains as a challenge to construct an effective automatic facial expression recognition system. Facial features suffer a high degree of variability due to a number of factors, such as differences across people (arising from age, illness, gender, or race, for example), growth or shaving of beards or facial hair, make-up, blending of several expressions, and superposition of speech-related facial deformation onto affective deformation (Bourel et al., 2002). Low-level visual features are usually unstable due to the variation of imaging conditions. It is very important to introduce the semantic knowledge into the automatic facial expression recognition systems in order to improve the recognition rate. However, research into automatic facial expression recognition systems capable of adapting their knowledge periodically or continuously has not received much attention. To incorporate adaptation in the recognition framework is a feasible approach to improve the robustness of the system under adverse conditions.
88
S.-C. Cheng et al. / Expert Systems with Applications 33 (2007) 86–95
In this paper we present an automatic facial expression recognition system that utilizes a semantic-based learning algorithm using the analytical hierarchy process (AHP) (Min, 1994; Saaty, 1980). In general, human emotions are hard to be represented using only the low-level visual features due to the lack of facial image understanding models. Although the effectiveness of low-level features in automatic facial expression recognition systems has been widely studied, the success is shadowed by the innate discrepancy between the machine and human perception to the image. The gap between low-level visual features and high-level semantics should be bridged in a proper way in order to construct a seamless automatic facial expression system satisfying the user perception. For this purpose, we use the AHP to provide a systematical way to evaluate the fitness of a semantic description for interpreting the emotion of a face image. A semantic-based learning algorithm is also proposed to adapt the weights of low-level visual features for automatic facial expression recognition. The weights are chosen such that the discrepancy between the facial expression recognition results obtained in terms of low-level features and high-level semantic description is small. In the recognition phase, only the low-level features are used to classify the emotion of an input face image. The proposed semantic learning scheme provides a way to bridge the gap between the high-level semantic concept and the low-level features for automatic facial expression recognition. Experimental results show that the performance of the proposed method is excellent when it is compared with that of traditional facial expression recognition methods. The remainder of this paper is organized as follows: Section 2 of the paper describes the proposed semantic-based facial representation using AHP in detail. Then the adaptation scheme for choosing the weights of low-level visual features by utilizing semantic clustering results is presented in Section 3. In Section 4, some experimental results are shown. Finally, conclusions are given in Section 5. 2. Semantic-based face representation using analytic hierarchy process AHP proposed by Saaty (1980) used a systematical way to solve multi-criteria preference problems involving qualitative data and was widely applied to a great diversity of areas (Cheng, Chou, Yang, & Chang, 2005; Lai, Trueblood, & Wong, 1999; Min, 1994). Pairwise comparisons are used in this decision-making process to form a reciprocal matrix by transforming qualitative data to crisp ratios, and this makes the process simple and easy to handle. The reciprocal matrix is then solved by a weighting finding method for determining the criteria importance and alternative performance. The rationale of choosing AHP, despite its controversy of rigidity, is that the problem to assign the semantic descriptions to the objects of an image can be formulated as a multi-criteria preference problem. As shown in Fig. 2, the two face images should be classified
Fig. 2. Two face images of ‘‘happiness’’ emotion with different low-level visual features.
as ‘‘happiness’’ emotion using human assessment, however, the outer-lip movement for the two face images is much different. Semantic knowledge plays an important role in an automatic facial expression recognition system such that the system fairly meets user perception. It is shown in our previous work that the AHP provides a good way to evaluate the fitness of a semantic description used to represent an image object (Cheng et al., 2005). 2.1. A brief review of AHP The process of AHP includes three stages of problemsolving: decomposition, comparative judgments, and synthesis of priority. The decomposition stage aims at the construction of a hierarchical network to represent a decision problem, with the top level representing overall objectives and the lower levels representing criteria, sub-criteria, and alternatives. With comparative judgments, users are requested to set up a comparison matrix at each hierarchy by comparing pairs of criteria or sub-criteria. A scale of values ranging from 1 (indifference) to 9 (extreme preference) is used to express the users preference. Finally, in the synthesis of priority stage, each comparison matrix is then solved by an eigenvector method for determining the criteria importance and alternative performance. The following list provides a brief summary of all steps involved in AHP applications: 1. Specify a concept hierarchy of interrelated decision criteria to form the decision hierarchy. 2. For each hierarchy, collect input data by performing a pairwise comparison of the decision criteria. 3. Estimate the relative weightings of decision criteria by using an eigenvector method. 4. Aggregate the relative weights up the hierarchy to obtain a composite weight which represents the relative importance of each alternative according to the decision-maker’s assessment. One major advantage of AHP is that it is applicable to the problem of group decision-making. In group decision setting, each participant is required to set up the preference
S.-C. Cheng et al. / Expert Systems with Applications 33 (2007) 86–95
of each alternative by following the AHP method and all the views of the participants are used to obtain an average weighting of each alternative.
89
‘‘sadness’’ for the image object in Fig. 2(b) according to the authors’ opinion. 2.3. Semantic-based facial expression representation
2.2. Semantic facial expression representation using AHP We view a face image as a compound object containing multiple component objects which are then described by several semantic descriptions according to a three-level concept hierarchy. The concept hierarchy, shown in Fig. 3, is used to assign the semantics to a facial expression for an input face image. According to the hierarchy, the method for involving the semantic facial expression recognition to a face database by AHP is proposed in this study. For the sake of illustration convenience, the classification hierarchy is abbreviated as FEC hierarchy. There are seven subjects in the top level of FEC hierarchy. Each top-level subject corresponding to a facial expression category is then divided into several sub-subjects corresponding to the parts of the face image controlling the human emotion, and each sub-subject is again decomposed into several Level 3 subjects corresponding to the facial animation parameters in MPEG-7 are used to describe a facial expression. A path from the root to each leaf node forms a semantic description, and multiple semantic descriptions are possible to interpret a facial object according to different aspects of user notion. A question arises naturally: is the weight of each path code of an image object equivalent? The answer to the problem is of course no. Some semantic descriptions are obviously more important than others for a specific image object. For example, the semantic description ‘‘happiness’’ is more important than the code with the semantic description
Goal
Facial Expression
Level 1
Neutral
Happiness
Sadness
Anger
Fear
Surprise
Disgust
Level 2 Level 3
Eyebrows
Higher eyebrows
Lower eyebrows
The left is higher than the right
The right is higher than the left
Inner sides is far
Inner sides is close
Higher inner sides
Lower inner sides
Others
Eyes Close
Mouth Higher right corner
Higher lower lip
Lower right corner
Lower lower lip
Higher left corner
Circular shape
Lower left corner
Narrow mouth
Higher upper lip
Lopsided mouth
Lower upper lip
Other shape
Assume the path codes of the semantic classification hierarchy are numbered from 1 to n. Given a face image I, the content of the image is represented by a semantic vector which is defined as n X I ¼ ðs1 ; s2 ; . . . ; sn Þ; si ¼ 1; ð1Þ i¼1
where si denotes the weighting of the ith path code. Although the value of n is large, in any vector representing an image, the vast majority of the components will be zero. The reason is that the number of objects perceived in an image is generally small. Assigning weights to the path codes in a semantic vector is a complex process. Weights could be automatically assigned using the object recognition techniques. However, this problem is far from being totally solved. Instead of that, in this paper, weights are assigned using the analytical hierarchy process. Note that the numerical characteristic of a weight limits the possibility of assigning it directly through human assessment. One major advantage of using AHP in assigning weights to the path codes is that users are only required to set the relative importance of several pairs of semantic descriptions, and then the values of weights are automatically calculated. The judgment of the importance of one semantic description over another can be made subjectively and converted into a numerical value using a scale of 1–9 where 1 denotes equal importance and 9 denotes the highest degree of favoritism. Table 1 lists the possible judgments and their representative numerical values. The numerical values representing the judgments of the pairwise comparisons are arranged to form a reciprocal matrix for further calculations. The main diagonal of the matrix is always 1. Users are required to adopt a top-down approach in their pairwise comparisons. Given an image, the first step of the classification process using AHP is to choose the large classification codes and evaluate their relative importance by performing pairwise comparisons. For example, Fig. 4(a) containing a face image is the target of
Wide open Narrow Close the left & open the right Close the right & open the left Others
Fig. 3. The concept hierarchy of the facial expression for interpreting an input face image.
Table 1 Pairwise comparison judgments between semantic descriptions A and B Judgment
Values
A A A A A A A A A
1 2 3 4 5 6 7 8 9
is is is is is is is is is
equally preferred to B equally to moderately preferred over B moderately preferred over B moderately to strongly preferred over B strongly preferred to B strongly to very strongly preferred over B very strongly preferred over B very strongly to extremely preferred over B extremely preferred to B
90
S.-C. Cheng et al. / Expert Systems with Applications 33 (2007) 86–95
Fig. 4. An example image to be classified: (a) the face image; (b) the corresponding reciprocal matrix with respect to (a) for calculating the local weightings of Level 1 semantic descriptions in interpreting the expressions of the image.
classification. In this case, the image can be classified into three Level 1 expression categories – ‘‘Neutral’’ N, ‘‘Happiness’’ H, and ‘‘Surprise’’ S. Fig. 4(b) is the corresponding Level 1 reciprocal matrix M1 for judging the relative importance of the three semantic descriptions. The entries of M1 can be denoted as
M1 ¼
N
2
N wN =wN
6 H 4 wH =wN S
wS =wN
H wN =wH
S wN =wS
3
7; wH =wS 5 wS =wS
wH =wH wS =wH
ð2Þ
codes used to describe the face image. It would be too cumbersome to classify an image using AHP if the value of l · m · n is very large. Fortunately, this problem would not occur because most of the face images do not have a large amount of path codes to describe them. Most of them have at most 4–10 path codes according to our experience. Obviously, most of the weightings corresponding to semantic descriptions in the semantic vector are zero. 3. Proposed semantic-based automatic facial expression recognition
where wN, wH, and wS are the relative importance values (defined in Table 1) for the three semantic descriptions N, H, and, S, respectively. Level 1 weightings of the three semantic descriptions are then obtained from M1. Without lose of generality, let l, m, and n be the number of Level 1 semantic descriptions, the number of Level 2 semantic descriptions for each Level 1 description, the number of Level 3 semantic description for each Level 2 description, respectively. For each row of Level 1 reciprocal matrix M1, we can define a weighting measurement as sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 a1i a1i l ai 1 ri ¼ ; i ¼ 1; . . . ; l; ð3Þ a11 a12 a1l where a1i is the relative important value of the ith Level 1 semantic description. Then Level 1 weightings are determined by , l X 1 1 wi ¼ ri r1j ; i ¼ 1; . . . ; l: ð4Þ
The proposed automatic facial expression recognition system can be classified into two phases. In the learning phase, a training database is used to obtain the correct structure using the semantic vectors in learning the classifier. The semantic vectors of the training samples obtained from AHP is first clustered in order to choose the proper weightings for the extracted low-level visual features, which are used to compute the similarity value between two face images in the recognition phase. Fig. 5 shows the block diagram of the proposed method and will be discussed in detail later.
Face images
Low-level feature extraction
j¼1
Similarly, we can compute Level 2 weightings w2i;j ; j ¼ 1; . . . ; m for the ith Level 1 semantic description and Level 3 weightings w3i;j;k ; k ¼ 1; . . . ; n for the ith Level 1 semantic description and the jth Level 2 semantic description. Finally, the entry p of the semantic vector defined in Eq. (1) is computed as wp¼ði1Þlþðj1Þmþk ¼
w1i
w2i;j
w3i;j;k :
ð5Þ
Note that the number of reciprocal matrixes for the image is l · m · n and it is actually equal to the number of path
Semantic vector extraction using AHP
Semantic vector clustering
k-NN searching using low-level features
k-NN searching using semantic vector Weighting adaptation for low-level features
Learning phase Semantic knowledge Recognition phase Input face images
Low-level feature extraction
k-NN searching using low-level features
Facial expression recognition
Fig. 5. Block diagram of the proposed semantic-based automatic facial expression recognition system.
S.-C. Cheng et al. / Expert Systems with Applications 33 (2007) 86–95
91
3.1. Low-level visual feature extraction
3.2. Semantic clustering for facial expressions
As mentioned above, the concept hierarchy shown in Fig. 3 plays the role in bridging the gap between the high-level user perception and low-level visual features for the proposed facial expression recognition system. Actually, given an input face image, the possibility of the image to be classified into a Level 3 subject of the concept hierarchy can be measured using a set of low-level visual features. For example, one can judge whether the positions of the eyebrows of an input face image is higher than those of the corresponding neutral face image by measuring the position changes of eyebrows from the input image to the neutral image. In the stage of feature extraction, 14 characteristic points in a face image are first detected, then the relative feature distances (a l, n) among these points, shown in Fig. 6, are calculated. Note that the sizes of two output images of a camera for the same face are different if we use two different focus lengths, hence, the feature distances should be normalized in order to eliminate the effects of camera operations. The distance between the inner corners of the eyes is used to normalize the feature distance due to the fact that the inner corners of the eyes are relatively stable to detect using image processing techniques. The normalized feature distances a 0 l 0 are computed by
As mentioned above, the semantic information of each training face images is represented as a semantic vector. However, the entries of semantic vectors are mostly zero. The dimensionality of semantic vector should be reduced in a proper way in order to compact the semantic information. In this work, we use the widely used K-means clustering to cluster the semantic vectors of the training database into K semantic clusters, where each of them carries different semantic information. The value of K would be 7 (corresponding to ‘‘neutral’’, ‘‘happiness’’, ‘‘sadness’’, ‘‘fear’’, ‘‘anger’’, ‘‘surprise’’, and ‘‘disgust’’ expression categories) for automatic facial expression recognition if the number of sample faces, which cover all types of facial expressions, is large enough. On the other hand, the value of K could be less than 7 to reduce the effect of the small size problem that in case of the small sample training data, the singular of within-class leads to its inverse not existing. The semantic distance d(IA,IB) between two face images ðAÞ ðAÞ ðAÞ with the semantic vector I A ¼ ðs1 ; s2 ; . . . ; sN Þ and ðBÞ ðBÞ ðBÞ I B ¼ ðs1 ; s2 ; . . . ; sN Þ is defined as N h i X ðAÞ ðBÞ ðBÞ ðAÞ dðA; BÞ ¼ si 1 si þ si 1 si ; ð7Þ
a b c d a0 ¼ ; b0 ¼ ; c 0 ¼ ; d 0 ¼ ; n n n n f g h i f 0 ¼ ; g 0 ¼ ; h0 ¼ ; i 0 ¼ ; n n n n k l k 0 ¼ ; l0 ¼ : n n
e e0 ¼ ; n j j0 ¼ ; n
i¼1
where Nis the total of semantic descriptions. The number ðAÞ ðBÞ ðBÞ ðAÞ þ si 1 si is actually the probabilitem si 1 si ity of objects IA and IB disagreeing with each other on the ith semantic description. For the sake of easy reference, the semantic clustering using the K-means algorithm is described below:
ð6Þ
Finally, the normalized 12 feature distances, as low-level visual features, are further subtracted from the corresponding normalized feature distances of the common base image of neutral expression.
Step 1: For each semantic cluster Sk, k = 1 , . . . , K, random initial semantic vectors are chosen for the cluster representatives I k . Step 2: For every semantic vector I, the difference is evaluated between I and I k , k = 1 , . . . , K. If dðI; I i Þ < dðI; I k Þ for all k 5 i, I is assigned to cluster Si. Step 3: According to the new classification, I k is recalculated. If Mk elements are assigned to Sk then Mk 1 X Ik ¼ I m; ð8Þ M k m¼1 where Im, m = 1 , . . . , Mk are the semantic vectors belonging to cluster Sk. Step 4: If the new I k is equal to the old one then stop, else Step 2.
3.3. Weighting adaptation for low-level visual features
Fig. 6. The extracted low-level visual features: (a) the characteristic points in the face image; (b) the distances among the characteristic points for describing the muscle activities.
Once the semantic clusters are obtained, 12 low-level visual features (cf. Fig. 6) are extracted from the content of each database image. Given a query face image, users should not be required to set the weighting of each feature type in order to recognize semantically relevant expression.
92
S.-C. Cheng et al. / Expert Systems with Applications 33 (2007) 86–95
Unfortunately, the low-level features and the high-level semantic concepts do not have an intuitive relationship, and hence, the problem of weighting setting is not a trivial job to solve. This limits the recognition accuracy of the automatic facial expression recognition system. In this paper, we propose a method for automatically determining the weightings of the low-level visual features for each semantic cluster. A winnow-like mistake-driven learning algorithm is used to learn the discriminant function g(x)s for defining the decision boundaries of semantic clusters in terms of low-level visual features. The paradigm followed in the literature for learning from labeled and unlabeled data is based on inducing classifiers from the semantic vectors of the training samples. The induced classifiers are then used to predict the classifiers for patterns in terms of low-level visual features such that the classification results of the classifiers in terms of low-level visual features agrees with those of the classifiers in terms of high-level semantic vectors. ðqÞ ðqÞ Let F q ¼ ðf1 ; f2 ; . . . ; fmðqÞ Þ be the low-level visual features of an input image q. Given a semantic cluster Si containing n images, the n images can be ranked by q in terms of semantic information using (7) as following order set Si = (I1, I2 , . . . , In). The proposed weighting adaptation algorithm aims at choosing weightings for low-level visual features such that Si can be obtained for answering q in terms of low-level visual features. More concretely, we can define a cost function J ð~ aÞ to be minimized as # J ð~ aÞ ¼
n X
ðSÞ
ðLÞ
jtj tj j;
ð9Þ
(3.1) Answer q and rank the n images in Si using (7). (3.2) Answer q and rank the n images in Si using (10) and compute the value of J ð~ aÞ using (9). (3.3) Do for k = 1 , . . . , m: (3.3.1) Answer q and rank the n images in Si using the kth low-level feature only and compute ðtÞ the value of J k ðak Þ. ðtÞ (3.3.2) Update the weight factor ak as follows:
ðtþ1Þ ak
¼
8 ðtÞ a > < ck > :
ðtÞ cak
ðtÞ
if ðJ k ðak Þ > J ð~ aðtÞ ÞÞ; if
ðtÞ ðJ k ðak Þ
ð11Þ
ðtÞ
< J ð~ a ÞÞ;
where c is the regulation factor. In this implementation, c = 1.05. (3.4) Normalize the weights so that they are a distribution, , m X ðtþ1Þ ðtÞ ðtÞ ak ¼ ak aj : ð12Þ j¼1
For each semantic cluster, we perform the weighting adaptation algorithm to set the values of weights for lowlevel visual features. Hence, the weighting vectors are different among different semantic clusters. For each semantic cluster, the system learns the decision boundary on the lowlevel feature space supervised by the decision boundary on the high-level semantic space. The goal of our scheme is to adaptively learn boundaries to filter the images for late stage facial expression recognition.
j¼1
where ~ a is the weighting vector for the low-level visual feaðSÞ ðLÞ tures and tj and tj are the ranks of the jth image with respect to a query image in terms of high-level semantic vectors and low-level visual features, respectively. In addition, the distance between q and an image I in Si is defined as m 2 X ðqÞ ðIÞ aj fj fj : ð10Þ Dðq; IÞ ¼ j¼1
The proposed learning algorithm uses a set of weak learners to work in a single feature each time. The value of ak corresponding to the kth feature should decrease if Jk(ak) > J(a) where Jk(ak) is the cost function using only the kth feature of q. The proposed learning algorithm is briefly described as follows: Algorithm. Weighting adaptation Input: a semantic cluster Si containing n images and the number of iterations T. Output: a weighting vector ~ a for Si . Method: ð0Þ (1) Initialize weights ak ¼ 1=m; k ¼ 1; . . . ; m. (2) Do for t = 1 , . . . , T: (3) For each image q in Si do
3.4. Facial expression recognition constrained by the boundary For an input face image, after the weighting vector of the low-level visual features for each semantic cluster is learned, the database images of large weighted Euclidean distances (see (10)) with respect to the input image are filtered. More concretely, a k nearest neighbor (k-NN) searching algorithm in terms of low-level visual features is performed to find the top k nearest neighbors from the training database for the input face image. The top k similar database images are used to decide the final expression category for the input image according to their semantic information. Note that each training face image belongs to a specific semantic cluster, and hence, the weighting vector of the training image must be retrieved in advance in order to compute weighted Euclidean distance between the training image and the input image. According to the concept hierarchy shown in Fig. 3, a semantic vector includes seven sub-vectors, each of them consists of 27 dimensions. The probability of facial expressions on the basis of a semantic vector ~ s ¼ ðs1 ; s2 ; . . . ; s189 Þ can be obtained from
S.-C. Cheng et al. / Expert Systems with Applications 33 (2007) 86–95
8 27 P > > > p ¼ si ; > neutral > > i¼1 > > > 54 > P > > > phappiness ¼ si ; > > > i¼28 > > > 81 > P > > > si ; > psadness ¼ > > i¼25 > > < 108 P panger ¼ si ; > i¼82 > > > > 135 > P > > pfear ¼ si ; > > > i¼109 > > > > 162 > P > > si ; > psurprise ¼ > > i¼136 > > > > 189 P > > > si : : pdisgust ¼
the database. In this system, five nearest neighbors are enough to promote the robustness of the proposed method according to our experimental results. 4. Experimental results
ð13Þ
i¼163
Given an input face image, we can classify the input image into the facial expression category of the largest probability value according to its semantic vector. The proposed automatic facial expression recognition algorithm is briefly described as follows: Algorithm. The proposed recognition strategy Input: A training database TD and an input face image q. Output: The expression category of the input image. Method: (1) Perform feature extraction process to obtain the low-level visual features Fq for q. (2) Perform a k-NN searching process to find the top k similar database images and form a candidate set C using Fq. (3) Compute the interpolated semantic vector sq using C as follows: sq ¼
jCj X
aj~ sj ;
93
ð14Þ
j¼1
where ~ sj is the semantic vector of the jth image Ij in C and aj is the weight of ~ sj . The value of aj is obtained by X Dðq; I j Þ Dðq; I i Þ aj ¼ 1 1 ; ð15Þ Dmax Dmax i¼1;...;jCj where Dmax = max[D(q, Ij), j = 1 , . . . , jCj]. (4) According to sq , compute the probability values of all the expression categories using (13). (5) Output the facial expression category for q as the expression category of the largest probability value. Obviously, the recognition rate of the proposed method depends on the number of nearest neighbors used to interpolate the semantic vector of the input image q. The underlying idea of the approach is to use voting scheme for the purpose of reducing the effect of outliners caused by using low-level visual features in retrieving similar images from
In order to evaluate the proposed approach, a series of experiments was conducted on an Intel PENTIUM-IV 1.5 GHz PC, and the JAFFE (Japanese Female Facial Expression) face database including ten persons of five types of facial expression (Lyons et al., 1999). For each person there are on average 10 face images. The images in the database are divided into two parts: one is the training database and the other is the test database. Each image in the training database is first analyzed by the AHP for testing the semantic learning approach. Test images are randomly extracted from the test database. Furthermore, some face images are directly extracted from a CCD camera as the input images to test the proposed recognition system. In general, given an input face image, the expression category that the input image belongs to might be different through different human assessment. This leads to a difficult situation to build the ‘‘ground truth’’ for evaluating the recognition performance of a system. The JAFFE database provides semantic information, which is assessed by a group of experts, to each image in it. In order to test the effectiveness of analyzing the facial expression using AHP, for each image in the training database, Table 2 compares the semantic vectors with the annotations provided by the JAFFE database. The labeling results for both methods are much similar with each other. Accordingly, the semantic knowledge of facial expression built by AHP is trustable. In addition, AHP provides a systematic way to generate the semantic information for a face image rather than labeling the image by intuition, which is not a trivial job to do even for an expert. The weighting adaptation approach plays an important role in improving the recognition performance of the proposed automatic facial expression recognition system. Tables 3 and 4 show the confusion matrices for the system without and with the proposed weighting adaptation algorithm, respectively. The recognition rate of the proposed method is improved from 67.6% to 85.2%. An interesting result can be seen from the experimental results: the induced semantic knowledge using AHP cannot improve
Table 2 Confusion matrix for labeling face image using AHP and in a direct assignment provided by the JAFFE database (Lyons et al., 1999)
Neutral Happiness Anger Sadness Surprise
Neutral
Happiness
Anger
Sadness
Surprise
29/30 1 0 2 0
0 32/34 0 2 1
0 0 26/30 1 1
1 0 3 27/32 0
0 1 1 0 28/30
94
S.-C. Cheng et al. / Expert Systems with Applications 33 (2007) 86–95
Table 3 Confusion matrix for the system without using the weighting adaptation algorithm
Neutral Happiness Anger Sadness Surprise
Neutral
Happiness
Anger
Sadness
Surprise
Recognition rate (%)
22/24 3 1 7 1
0 11/17 0 0 2
0 0 5/9 7 0
2 0 3 10/25 0
0 3 0 1 19/22
92 65 55 40 86
Average
67.6
Table 4 Confusion matrix for the proposed system using the weighting adaptation algorithm
Neutral Happiness Anger Sadness Surprise
Neutral
Happiness
Anger
Sadness
Surprise
Recognition rate (%)
20/24 0 0 3 0
1 16/17 0 0 1
0 0 7/9 1 0
3 0 2 19/25 0
0 1 0 2 21/22
83 94 77 76 96
Average
85.2
Table 5 Confusion matrix for the system using the multi-layer perception (Tian et al., 2001)
Neutral Happiness Anger Sadness Surprise
Neutral
Happiness
Anger
Sadness
Surprise
Recognition rate (%)
17/24 5 0 7 4
2 11/17 0 1 1
0 0 5/9 4 0
5 0 4 13/25 2
0 1 0 0 15/22
71 65 56 52 68
Average
62.4
Fig. 7. The user interface of the proposed method.
the recognition performance for the input images that belong to the ‘‘neutral’’ expression category. Actually, many test face images might contain multiple expressions
especially when the test image is labeled as ‘‘neutral’’ expression. However, the way to interpret an input image in multiple expressions is adopted in our approach.
S.-C. Cheng et al. / Expert Systems with Applications 33 (2007) 86–95
In order to further verify the effectiveness of the proposed method, the automatic facial expression recognition system using the neural network technique, i.e., multi-layer perception is also simulated for comparison (Tian et al., 2001). Table 5 shows the recognition rate of the neural network approach. Accordingly, the performance of the proposed method outperforms the neural network approach. Fig. 7 is a recognition example using the user interface of the proposed system. 5. Conclusion In this paper, we have presented an automatic facial expression recognition system utilizing the semantic knowledge using AHP. The introduction of the semantic knowledge using human assessment bridges the gap between the low-level visual features and the high-level semantic concept. In conclusions, the contributions of the proposed approach are as follows: (1) a framework to quantize the qualitative data of user perception using AHP is developed to semantically describe facial expressions; (2) a semantic learning utilizing the proposed weighting adaptation algorithm is implemented; (3) a semantic-based automatic facial expression recognition system is developed. Experimental results show the effectiveness of bridging the lowlevel visual features and the high-level user perception by adaptively tuning the weights of low-level visual features. The deficiencies of the proposed approach include (1) the size of training database is not large – the small size problem in machine learning on the proposed method should be explored in detail in order to further improve the performance of the system; (2) it is expected that the use of additional visual information about facial expressions would further improve recognition performance. Furthermore, semantic-based and soft computing methods based on image-based facial features can be combined to construct a system with machine intelligence. Acknowledgement This work has been supported in part by the National Science Council, Taiwan Grants NSC 93-2213-E-327-002 and NSC 94-2213-E-327-010. References Aleksic, P. S., & Katsaggelos, A. K. (2004). Automatic facial expression recognition using facial animation parameters and multi-stream HMMs. In Proceedings of the 8th IEEE international conference on automatic face and gestures recognition. Bourel, F., Chibelushi, C. C., & Low, A. A. (2002). Robust facial expression recognition using a state-based model of spatially-
95
localised facial dynamics. In Proceedings of the fifth IEEE international conference on automatic face and gesture recognition (pp. 106– 111). Calder, A. J., Burton, A. M., Miller, P., & Young, A. W. (2001). A principal component analysis of facial expressions. Vision Research, 41, 1179–1208. Cheng, S.-C., Chou, T.-C., Yang, C.-L., & Chang, H.-Y. (2005). A semantic learning for content-based image retrieval using analytical hierarchy process. Expert Systems with Applications, 28, 495–505. Donato, G., Hager, S., Bartlett, C., Ekman, P., & Sejnowski, J. (1999). Classifying facial actions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(10), 974–989. Draper, B. A., Baek, K., Bartlett, M. S., & Beveridge, J. R. (2003). Recognizing faces with PCA and ICA. Computer Vision and Image Understanding, 91, 115–137. Ekman, P., & Friesen, W. (1978). Facial action coding system. Palo Alto, CA: Consulting Psychologists Press. Essa, I., & Pentland, A. (1997). Coding, analysis, interpretation and recognition of facial expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7), 757–763. Fasel, B., & Luettin, J. (2003). Automatic facial expression analysis: A survey. Pattern Recognition, 36, 259–275. Jeng, S. H., Yao, H. Y. M., Han, C. C., Chern, M. Y., & Liu, Y. T. (1993). Facial feature detection using geometrical face model: An efficient approach. Pattern Recognition, 31(3), 273–282. Lai, V. S., Trueblood, R. P., & Wong, B. K. (1999). Software selection: A case study of the application of the analytical hierarchy process to the selection of multimedia authoring system. Information and Management, 36, 221–232. Lyons, M. J., Budynek, J., & Akamatsu, S. (1999). Automatic classification of single facial images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 21(12), 1357–1362. Martinez, A. M., & Kak, A. C. (2001). PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 228–233. Mase, K., & Pentland, A. (1991). Recognition of facial expression from optical flow. Transactions of IEICE, E, 74(10), 3474–3483. Min, H. (1994). Location analysis of international consolidation terminals using the analytical hierarchy process. Journal of Business Logistics, 15(2), 25–44. Pantic, M., & Rothkrantz, L. J. M. (2000). Automatic analysis of facial expressions: The state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1424–1445. Pardas, M., & Bonafonte, A. (2002). Facial animation parameters extraction and expression recognition using Hidden Markov Models. Signal Processing: Image Communication, 17, 675–688. Saaty, T. L. (1980). The analytic hierarchy process. New York: McGrawHill. Saxena, A., Anand, A., & Mukerjee, A. (2004). Robust facial expression recognition using spatially localized geometric model. International Conference on Systematics, 12(15), 124–129. Suwa, M., Sugie, N., & Fujimora, K. (1978). A preliminary note on pattern recognition of human emotional expression. In Proceedings of the 4th international joint conference on pattern recognition (pp. 408– 410). Text for ISO/IEC FDIS 14496-2 Visual (1998). ISO/IEC JTC1/SC29/ WG11 N2502, November. Tian, Y.-L., Kanad, T., & Cohn, J. F. (2001). Recognizing action units for facial expression analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2), 97–115.