Face alive icon

ARTICLE IN PRESS Journal of Visual Languages and Computing 18 (2007) 440–453 Journal of Visual Languages & Computing www.elsevier.com/locate/jvlc F...

Download PDF

736KB Sizes 8 Downloads 198 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

Journal of Visual Languages and Computing 18 (2007) 440–453

Journal of Visual Languages & Computing www.elsevier.com/locate/jvlc

Face alive icon Xin Lia,, Chieh-Chih Changb, Shi-Kuo Changa a

Department of Computer Science, University of Pittsburgh, 210 South Bouquet St. RM6508, Pittsburgh, PA 15213, USA b Industrial Technology Research Institute, Taiwan

Received 3 November 2006; received in revised form 23 January 2007; accepted 8 February 2007

Abstract In this paper, we propose a methodology to synthesize facial expressions from photographs for devices with limited processing power, network bandwidth and display area, which is referred as ‘‘LLL’’ environment. The facial images are reduced to small-sized face alive icons (FAI). Expressions are decomposed into the expression-unrelated facial features and the expression-related expressional features. As a result, the common features can be identiﬁed and reused across expressions using a discrete model constructed from the statistical analysis on training dataset. Semantic synthesis rules are introduced to reveal the inner relations of expressions. Veriﬁed by the experimental prototype system and usability study, the approach can produce acceptable facial expression images utilizing much less computing, network and storage resource than the traditional approaches. r 2007 Elsevier Ltd. All rights reserved. Keywords: Facial expression synthesis; Face alive icons; Emoticons; Iconic visual language

1. Introduction Facial expression is one of primary communication means of the human, and sometimes it is even more expressive than words. Today with the increasing popularity of advanced communication tools such as emails and short messages, more and more people have been communicating by various means without seeing each other. However, facial expressions are still greatly needed for the most of these conversations. Corresponding author. Tel.: +1 412 624 8836.

E-mail addresses: ﬂ[email protected] (X. Li), [email protected] (C.-C. Chang). 1045-926X/$ - see front matter r 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.jvlc.2007.02.008

ARTICLE IN PRESS X. Li et al. / Journal of Visual Languages and Computing 18 (2007) 440–453

441

Observing some popular instant message service (IMS) tools such as the messenger by Yahoo [1] and MSN [2], they use cartoon faces to present facial expressions such as happiness, sadness, fear, anger, surprise, and so on. These cartoon icons become so popular that they are encoded as combinations of ASCII characters such as ‘‘:-)’’ for happiness and ‘‘:-(‘‘for sadness, which are widely used within emails and short messages. The popularity of these cartoon face icons and character combinations has proven the great needs for facial expressions in numerous applications. However, neither of them can provide as natural and vivid expressions as the facial images of the real human. Our motivation to synthesize realistic facial expressions is originally conceived from the Chronobot Virtual Classroom (CVC) system [3]. The CVC system simulates a classroom via the Internet, which provides a convenient communication environment for distance learners and teachers like the traditional face-to-face style classroom. The system can be used on personal computer (PC), Personal Digital Assistant (PDA) and cell phone. Shown in Fig. 1, the classroom is visualized using emotive icons. User’s emotions such as happiness and sadness are presented using the cartoon faces. Fig. 1(a) and (b) show the visual classroom on PC and portable devices (i.e. cell phone and PDA) respectively. Using the push buttons on the left top (Fig. 1(a)) or the menu items on the popup menu (Fig. 1(b)), users can express their emotions by changing the icons. To display identiﬁcations as well as expressions, it will be much better if we can use some realistic images as the emotive icons here. However, for expression synthesis approaches as far as we surveyed, none of them could ﬁt our needs. In general, real human’s facial images are not used for expressions in many IMS systems mainly because of the following three reasons: 1. Acquisition: they are hard to acquire. Besides the privacy issue, it costs the users too much effort to take photos for each expression. 2. Transference: they are too big to transfer. The users may use portable devices with limited processing power and network bandwidth. 3. Display: they are too large to display. There are only small display areas in portable devices such as cell phones and PDAs.

Fig. 1. The Student Client of the CVC System: (a) interface on PC and (b) interface on portable device.

ARTICLE IN PRESS 442

X. Li et al. / Journal of Visual Languages and Computing 18 (2007) 440–453

For the acquisition problem, several approaches are described to synthesize expressions from photographs [4–8], in which various facial expressions can be generated from one single image. Although these approaches have achieved success in many applications, the proposed algorithms, as far as we studied, are all designed for the high resolution images and mainly depending on some computing dense operations, which cannot be applied directly on the portable devices with limited processing capability and network bandwidth. In this paper, an approach to synthesize facial expression images from photographs is proposed for portable devices with limited bandwidth, limited processing power and limited display areas, which is called the ‘‘LLL’’ environment. The facial images are reduced to the Face Alive Icons, which are small size realistic images, usually, 64 pixels 64 pixels or smaller, so that they are suitable for most portable devices, and meanwhile still big enough to represent expressions. Based upon a single facial image, facial features are decomposed into the expression-unrelated features—facial features, and the expressionrelated features—expressional features. Through the principal component analysis (PCA) on the training data set, a discrete model for the expressional features can be constructed. The face alive icons are synthesized by combining standard states of expression features in the model. The icons can be generated in terminal devices using simple combination rules and operations. More important, all information needed to generate these icons is packed into the Face Icon Profile (FIP) which is very small in size. All these features make the approach suitable for the portable devices such as cell phones and PDAs. The short report of this study was presented in [9]. The rest of this paper is organized as follows: the related research is brieﬂy reviewed in Section 2. Compared to the related works, our approach is overviewed in Section 3. Section 4 describes the decomposition rules for expressions. The discrete model of expressional features is explained in Section 5, which is built through the statistical analysis on training data. The synthesis rule and expression encoding are described in Section 6. In Section 7, several icons synthesized using the proposed method are presented. An experiment is described to quantitatively evaluate the results. Finally, the future research is discussed in Section 8. 2. Related research A number of approaches have been developed to synthesize facial expressions from photographs [4–8]. Depending on the basic underlying techniques, generally they can be categorized into two groups: warping-based approaches and morphing-based approaches. Noh and Neumann [5] have proposed a typical warping-based approach for the expression cloning from one person to another using the vertex motion vectors. The main drawback of this kind of approach is that it only considers the shape changes, and ignores the other factors such as textures and illuminations, which are also important. The morphing-based approaches such as [6,7] usually use a large collection of the sample expressions, and the new expressions are produced by the morphing between these samples. Although this method could generate the photo-realistic expressions, it does not work for a new person who has no similar samples in the collection. A hybrid method was proposed by Gralewski et al. in [10], in which new sequences of facial expressions can be synthesized from existing video footage using the principle component analysis (PCA) and an auto-regressive process (ARP). As the result of PCA, the video frames from image space are projected to a much lower dimensional space, which

ARTICLE IN PRESS X. Li et al. / Journal of Visual Languages and Computing 18 (2007) 440–453

443

can signiﬁcantly facilitate the synthesis process. Similarly we use PCA as the major analysis tool. However, we do not apply it on entire face as Gralewski et al. did, but only on the expression-related features. In other words, we ﬁrst decompose the facial images into features, and then only apply PCA on the features that relate to expressions. The work most closely related to ours is a hybrid approach proposed by Wang and Ahuja [8], who decompose face images into three dimensions—the person, expression and feature using high-order singular values decomposition (HOSVD). Expression synthesis is done by tensor-based transformation in the separate dimensions based upon the training data. It is relatively simple and can produce acceptable result even for new persons. However, as same as all approaches mentioned above, both decomposition and synthesis algorithms are computing intensive processes and neither are suitable for the portable devices in the ‘‘LLL’’ environment. Since the facial model is very important in the audio-visual world, the Moving Pictures Experts Group (MPEG) [11] has been working on the facial animation for years. Beginning with the video data model MPEG-4, a standard object-based 3-D facial model has been introduced to synthesize facial image sequences [12,13]. Similar as our approach, the facial model is encoded into several independent features. For example, MPEG-4 speciﬁed three types of facial data [12]: (1) facial animation parameters (FAP), which supports to animate a facial model; (2) facial deﬁnition parameters (FDP), which supports to conﬁgure a facial model; (3) FAP interpolation table (FIT): which deﬁnes the interpolation rules for the decoder. Based on these data, high quality synthetic facial animations can be produced at the receiver. However, the MPEG facial technology mainly focuses on the stream data, in which many optimizations have been done. Compared to our approach, both face encoding and decoding in MPEG are associated with high degree of complexity. Neither of them can be done in the portable computing devices without the support of any special hardware. A compromise solution for the ‘‘LLL’’ environment is to perform the complex expression synthesis using high-performance servers, and then distribute the output expression images to low-performance terminal devices. As a result, the terminal devices are only subject to display these images. Although this method can remove the processing workloads of the terminal devices, the heavy burdens for network transference and device storage are still unresolved, which could be a severe problem if a large number of expression images are needed. From this point of the view, our approach is unique because we take expressional features as basic processing units instead of the whole facial images. By combining the different states of the expressional features in the terminal devices, the face alive icons can be generated using much less network and storage resource.

3. Overview of our approach Basically our approach contains two distinct steps (Table 1): expression decomposition and icon synthesis. The expression decomposition involves computing intensive operations, so it is performed on high-performance servers. On the other side, the icon synthesis is a light load process which can be done on the portable computing devices. The connection between the two steps is the user’s facial icon proﬁle (FIP), which includes the facial images and expressional features. The FIP is distributed from high-performance servers to portable terminal devices. Since the FIP has much smaller size than a bundle of

ARTICLE IN PRESS 444

X. Li et al. / Journal of Visual Languages and Computing 18 (2007) 440–453

Table 1 Two steps of our approach Step

Host

Input

Output

Step 1: Expression decomposition Step 2: Icon synthesis

Servers Terminal devices

Photograph Facial icon proﬁle

Facial icon proﬁle Facial alive icons

Fig. 2. The process diagram of the expression decomposition.

the expression images, our approach actually can save a great amount of network bandwidth and device storage resources. In the expression decomposition, the facial icon proﬁles are generated from the photograph provided by users. The process is illustrated in Fig. 2. A facial photograph is input with feature points such as the eye sockets and lip corners. Facial images are extracted by the face extractor and input to the expression decomposer, in which face images are decomposed into the two dimensions: facial features and expressional features. The facial features are directly copied to a facial icon profile because they keep unchanged across expressions. A discrete model of the expressional feature is created in the expressional features generator which will be described in the following sections. As a result, the facial icon proﬁle contains the facial features as well as the expressional features described by the discrete model. The compact size makes it easy to be distributed to terminal devices through network. The following step is icon synthesis on terminal devices. The process is illustrated in Fig. 3. The expressions are encoded into uniform formatted codes which deﬁnes states of the expressional features. A synthesis algorithm is proposed using a generation grammar combining speciﬁc states of expressional features to produce expressions. The operation involved here is only overlaying appropriate expressional features onto the facial features. Because this kind of overlaying is a very simple and light-loaded operation, the process can be done on most portable devices with limited computing capability.

ARTICLE IN PRESS X. Li et al. / Journal of Visual Languages and Computing 18 (2007) 440–453

445

Fig. 3. The process diagram of the icon synthesis.

Fig. 4. The seven basic expressions in JAFFE database.

4. Expression decomposition Generally, six basic facial expressions are widely recognized by psychologists, which are happiness, sadness, surprise, anger, disgust and fear [14]. For convenience, we consider neutral expression as the seventh. To investigate the composition of these typical expressions, a facial image collection— Japanese Female Facial Expressions (JAFFE) database [15] has been used in our approach as training data set, which contains 210 images of 10 Japanese female subjects. Every image in the database is marked with one major expression according the quantitative evaluations. Fig. 4 shows the seven basic expressions of two actresses in the database. The following two facts are conceived from the observation on different expression images in the JAFFE database: 1. If face can be divided into several independent areas such as eyes area, nose area and so on, some of them deﬁnitely contribute to the expressions much more than the others. For example, in different expressions, nose and ears areas usually keep unchanged, while eyes and mouth area vary a lot. In other words, some facial parts are much more expressive than others. 2. For the expressive areas, they may appear similar in different expressions. As shown in Fig. 4, the eyes in the expression sadness and disgust are almost same; the mouths in the expression angry and sadness are also similar. In other words, different expressions may

ARTICLE IN PRESS 446

X. Li et al. / Journal of Visual Languages and Computing 18 (2007) 440–453

share same appearance of some facial parts; consequently, they could be reused across the expressions. Based upon the two facts mentioned above, we decompose our face alive icons (FAI) into the following two dimensions: FAI :¼ FF þ EF ,

(1)

in which FF, the facial features are parts of FAI, which do not change in different expressions, such as the hair, ears and nose; and EF, the expressional features are parts of FAI, which change in the different expressions, such as the eyes and mouth. It is worth noting that a real human facial image is an extremely complex geometric form, and almost every part of face is active. For example, the human face models used in Pixars Toy Story have several thousands control points each [16]. However, in our approach, we are trying to capture the most expressive parts of face and distinguish them from the relatively inactive parts. In fact, this classiﬁcation depends on the quality of the expression images required by users. For instance, to produce high-quality photo realistic expressions, we can use as many expressional features as possible. On the contrary, for the facial alive icons in the ‘‘LLL’’ environment, the basic requirement is to deliver expressions as well as identiﬁcations. For this reason, in our approach only the most active facial parts are taken as the expressional features, i.e. namely, the mouth and eyes: EF :¼ heyesi þ hmouthi.

(2)

Considering some expressions such as winking, two eyes may be in different appearance. So the eyes are decomposed into the left eye and right eye: heyesi :¼ hleft eyei þ hright eyei.

(3)

To summarize (1), (2), (3), a grammar for the composition rules of FAI is constructed: FAI :¼ FF þ hleft eyei þ hright eyei þ hmouthi.

(4)

5. The discrete model of expressional features To eliminate the complexity of the FAI synthesis, we decompose the icon into expressional features and facial features. However, generally they are still very complex because these features could have thousands of subtle appearances for every single person. Here, we propose an approach to build a discrete model through PCA on the training data, in which the distribution of expressional features is investigated, and a reasonable number of standard states are deﬁned. Every expression feature, i.e. eye or mouth areas, is described by landmark points. As shown in Fig. 5, 18 landmark points have been deﬁned to model the eye area. The landmark points are accessed based on reference points input by users. The reference points are key landmark points such as eye sockets, lip corners. The rest of the landmark points can be automatically detected using feature templates [17]. Our discrete model is built similarly to the Flexible Model by Lanitis et al. in Ref. [18], which has achieved great success in many face-modeling applications [19]. Both models use PCA as the main statistical method to capture the main variance of the data; however, besides providing model parameters, the features are ‘‘normalized’’ into the appropriate standard states in our model. What follows is the detailed description of our model:

ARTICLE IN PRESS X. Li et al. / Journal of Visual Languages and Computing 18 (2007) 440–453

447

Fig. 5. The eye area with 18 landmark points.

Assuming n sample data items in training set and m variables for landmark points of each item. The ith data item Xi (i ¼ 1, 2, y, n) is X i ¼ ðxi;1 ; xi;2 ; . . . ; xi;m Þ,

(5)

where xi,k is the kth variable of the ith data item, and it can be either the coordinates or grayscale of landmark points. In our method, we actually used the coordinates. Apply PCA analysis on training data, the data item Xi can be assessed as follows: X i ¼ X¯ þ P b,

(6)

in which, X is the average of the training examples, P ¼ (eig1, eig2, y, eigm), eigi (i ¼ 1, 2, y, s; spm) is unit eigenvector of the covariance of deviation and b is a vector of eigenvector weights referred as Model Parameters. By modifying b, new instance of model can be generated. Solving b in (6), we get the following equation: b ¼ PT ðX i X¯ Þ

(7)

T

where P P ¼ 1. Usually, the number of eigenvectors needed to describe most of the variability within the training set is much smaller than the original number of variables for landmark points i.e. s5m. So through this method, features can be described by vector b using much less parameters. The distance function D500 is deﬁned on features represented as b1 and b2: Dðb1 ; b2 Þ ¼ jb1 b2 j.

(8)

Assuming there are p expressions e1, e2, y, ep. The set of expressions E: E ¼ fe1 ; e2 ; . . . ; ep g.

(9)

Categorized by expressions, the vector b of features for each expression ei can be represented by the average b¯ ei So we have a basic form of discrete model S for feature on the expression set E: S ¼ b¯ e1 ; b¯ e2 ; . . . ; b¯ ep . (10) In which, b¯ ei is called the standard state of a feature. As the fact we discussed in Section 3, the features may have similar states in different expressions. Illustrated in Fig. 6, an algorithm is proposed to merge similar states in the model S. Assuming expression features are distributed uniformly in terms of the distance deﬁned in (8), a threshold d is simply derived as the average distance among all the features. Every

ARTICLE IN PRESS 448

X. Li et al. / Journal of Visual Languages and Computing 18 (2007) 440–453

Fig. 6. The algorithm to merge items in the discrete model S.

Fig. 7. States of the right eye: (a) b1: normal (b) b2: up (c) b3: wide-open (d) b4: down.

pair of states that are closer than the threshold in distance is merged. Because of the merges, we get much less standard states for expressional features. Furthermore, a unique semantic name is given for each standard state according to its appearances. Using these semantic names, synthesis rules can be generated by the grammar (4). This process will be discussed in detail in the next section. For example, using the algorithm described in Fig. 6, the standard states of right eye area are merged from seven to four. The new coordinates of landmark points of these standard states can be assessed. The images are warped using thin plate splines [20] from the closest existing states (not the neutral state). Shown in Fig. 7, some semantic names such as normal, up, wide-open and down are assigned to them according to their appearances. Similarly, there are four standard states for left eye. For mouth area, there also are four standard states, which are normal, down-close, up-close and down-open shown in Fig. 8.

6. FAI synthesis and encoding Based on the proposed discrete model, facial icons can be synthesized by the mapping from expressions to the combinations of standard states of expressional features. What follows gives a formal description of this synthesis process:

ARTICLE IN PRESS X. Li et al. / Journal of Visual Languages and Computing 18 (2007) 440–453

449

Fig. 8. The states of the mouth: (a) b1: normal (b) b2: down-close (c) b3: up-open (d) b4: down-open.

Table 2 Average distance D between training data and the standard states NEU

HAP

SAD

SUR

ANG

DIS

FEA

Left eye Normal Up Wide open Down

0.27 2.47 3.00 2.29

1.94 0.45 2.89 3.45

1.65 3.24 2.73 0.96

1.42 2.06 0.75 2.75

3.09 2.21 1.05 4.27

3.32 2.90 2.47 1.23

2.34 2.97 1.30 4.08

Right eye Normal Up Wide open Down

0.33 2.28 2.77 2.06

1.87 0.53 3.08 3.25

1.34 3.57 2.88 0.83

2.36 2.11 0.89 3.07

3.28 2.24 1.11 3.77

3.39 2.99 2.43 1.21

2.35 3.00 1.23 3.92

Mouth Normal Down-close Up-open Down-open

0.18 2.87 3.94 3.25

2.35 3.45 0.94 1.95

2.27 0.43 3.77 2.06

3.07 2.23 0.54 2.09

4.01 2.25 3.17 1.23

1.50 3.56 2.76 2.47

2.25 1.06 3.86 3.02

NEU: neutral; HAP: happiness; SAD: sadness; SUR: surprise; ANG: anger; DIS: disgust; FEA: fear.

Assuming the expressions are decomposed into N expressional features f1, f2, y, fN. For each of facial feature fi(i ¼ 1, y, N), there are Mi standard states, represented as a set of vectors Oi: Oi ¼ fbi;1 ; bi;2 ; . . . ; bi;M i g.

(11)

The synthesis rule derived from the grammar (4) can also be described by a function R on the set of the expressions E in (9) to O1 O2 ? ON, e.g. 8e 2 E; RðeÞ ¼ b1;k1 ; b2;k2 ; . . . ; bN;kN (12) in which bi;ki AOi. Given a set of the training images for an expression e, the synthesis rule R(e) can be determined by statistical analysis on the distance functions D500 in (8) between data item and the standard states. As shown in Table 2, the average distances between the training data with the standard states are listed. The standard states with minimum average distance (numbers underlined) are taken for the synthesis rules followed by the grammar

ARTICLE IN PRESS 450

X. Li et al. / Journal of Visual Languages and Computing 18 (2007) 440–453

Table 3 The synthesis rules

NHU HAP SAD SUR ANG DIS FEA

Left eye

Right eye

Mouth

Normal Up Down Wide-open Wide-open Down Wide-open

Normal Up Down Wide-open Wide-open Down Wide-open

Normal Up-open Down-close Up-open Down-open Normal Down-close

(4). For example: HAP :¼ FF þ ½Left Eye : up þ ½Right Eye : up þ ½Mouth : upopen.

(13)

The rules for seven basic expressions are listed in Table 3: It is worth noting that the synthesis rules can be conceived either from the statistic analysis as we did in Table 2 or by the intuitive observation on sample expression images. For example, the following non-standard expressions are created by intuition: WINKING :¼ FF þ ½Left Eye : down þ ½Right Eye : wideopen þ ½Mouth : upopen, SMILE :¼ FF þ ½Left Eye : normal þ ½Right Eye : normal þ ½Mouth : upopen. As discussed in (12), for N expressional features f1, f2,y, fN, and Mi (i ¼ 1,2, y, N standard features for fi, it needs log2 M i bits to encode fi. Therefore, to P the feature encode the expressions with the synthesis rules, totally i log2 M i bits are needed. For example, in our approach, a 2-bits code (e.g. 00, 01, 10, 11) is needed for the states of the each eye and mouth, so totally there are 6 bits for each expression. Technically, system can support 26 ¼ 64 potential expressions. 7. Experimental system and result analysis An experimental prototype system has been implemented to verify the usability of proposed methodology. Fig. 9 shows some preliminary experimental results on three persons who have no similar images in the train set. Although the results are not fully optimized since we are just using simple image warping techniques [20] to generate the expressional features, the method proves to be able to produce acceptable facial icons for different expressions. Compared to the traditional expression synthesis methodologies [4–8], in which facial expressions are considered to be independent, and each of them is stored and transmitted as a image ﬁle, our approach actually reveals the inherit relations among expressions and reuses expressional features across them. As a result, more expressions can be generated using less storage and network resources. Fig. 10 compares the sizes of ﬁles needed to transmit and store for displaying K (K ¼ 5, 7, 9, y) expressions in our approach and the traditional approaches. Considering a communication session involving 20 users, totally around 2.5 MBytes FIPs is needed to be distributed in our approach, which potentially supports up to 64 expressions. For other approaches, however, totally at least 20 MBbytes of image ﬁles are needed to support 32 expressions.

ARTICLE IN PRESS X. Li et al. / Journal of Visual Languages and Computing 18 (2007) 440–453

451

Fig. 9. Results of the facial expression synthesis.

Fig. 10. Size of user expression proﬁle vs. number of supported expressions.

An experiment has been done to quantitatively evaluate the results of our method. A number of FAIs (68 per expression, 408 in total) have been generated using proposed techniques. Eight evaluators have participated to manually label a large subset of these icons with the perceived expressions. The accuracy of the labeling on each expression is shown in Fig. 11. The majority of these evaluations got the right answers. 8. Discussion and future research In this paper, we propose an approach to synthesize facial expression images in the ‘‘LLL’’ environment. Different from other methodologies, facial images are decomposed into expressional features, and common features are reused across expressions using the discrete model which is built through the statistical analysis on training data set. The approach has been veriﬁed by the experimental prototype system and proves to be able to generate acceptable result utilizing much less network and storage resources. It is worth noting that the proposed method is not just for the small-size facial icons. It could be used for the large and high quality expression images: as mentioned in Sections 3

ARTICLE IN PRESS 452

X. Li et al. / Journal of Visual Languages and Computing 18 (2007) 440–453

Fig. 11. The Accuracy of the manually labeling.

and 4, the quality of the output images can be adjusted by the numbers of expressional features N, the landmark points L and the standard states M. By increasing these numbers, the quality of output images can be elevated. However, at the same time the size of facial icon proﬁle (FIP) will also increase. When N, L, M are large enough, the size of FIP will be bigger than the total of separate expression images, and the approach will degenerate to the traditional ones. It is an important and interesting problem to ﬁnd the optimal trade off between the quality of outputs and the size of FIP. It must be the subject of our future research. It is also important to point out that the proposed methodology can be extended to new expressions. Recalling how human being can create a new expression by imitating other faces, our system must be feed by training data to produce new expressions. Based upon training samples, a new expression can be decomposed using the same way. The discrete model can be updated, and then new synthesis rule can be assessed. As a result, the system described here will be an open system that can accept any new expressions. The further experiment along this line is another subject of our future research. Acknowledgements Our project [3] is supported in part by the Industry Technology Research Institute (ITRI) and the Institute for Information Industry (III) of Taiwan. We would like to thank Dr. Lyons [15] who kindly provides the JAFFE database, Jui-Hsin Huang for his work on implementation of the experimental prototype system, and Dr. Carl Kuo for the valuable comments. Particularly, we would thank the graduate students from CS2310 class (Spring Term, 2006) in Dept. of Computer Science, University of Pittsburgh, for serving as the evaluators of our experiments. References [1] Yahoo Messenger, /http://messenger.yahoo.com/S. [2] MSN Messenger, /http://messenger.msn.comS.

ARTICLE IN PRESS X. Li et al. / Journal of Visual Languages and Computing 18 (2007) 440–453

453

[3] S.K. Chang, A chronobot for time and knowledge exchange, in: Tenth International Conference on Distributed Multimedia Systems, San Francisco Bay, CA, September 2004, pp. 3–6. [4] Z. Liu, Y. Shan, Z. Zhang, Expressive expression mapping with ratio images, in: Proceedings of the ACM 28th Annual Conference on Computer Graphics and Interactive Techniques, 2001, pp. 271–276. [5] J.Y. Noh, U. Neumann, View morphing, in: SIGGRAPH ‘01: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, 2001, pp. 277–288. [6] F. Pighin, J. Hecker, D. Lischinskiy, R. Szeliskiz, D.H. Salesin, Synthesizing realistic facial expressions from photographs, in: SIGGRAPH’98: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, 1998, pp. 75–84. [7] S.M. Seize, C.R. Dyer, View morphing, in: SIGGRAPH’96: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, 1996, pp. 21–30. [8] H. Wang, N. Ahuja, Facial expression decomposition, in: Proceedings of the Ninth IEEE International Conference on Computer Vision, October 2003, pp. 958–965. [9] X. Li, M. Chieh-Chih Chang, S.K. Chang, Face alive icons. in: Proceedings of the Seventeenth International Conference on Software Engineering and Knowledge Engineering, Taipei, Taiwan, July 2005. [10] L. Gralewski, N. Campbell, B. Thomas, C. Dalton, D. Gibson, Statistical synthesis of facial expressions for the portrayal of emotion, in: Proceedings of the Second International Conference on Computer Graphics and Interactive Techniques in Australasia and South East Asia, June 2004. [11] MPEG, /http://www.chiariglione.org/mpeg/S. [12] G.A. Abrantes, F. Pereira, Mpeg-4 facial animation technology: survey, implementation, and results, IEEE Transactions on Circuits and Systems for Video Technology 9 (2) (1999) 290–305. [13] J. Ostermann, Animation of synthetic faces in mpeg-4, in: Computer Animation, Philadelphia, PA, June 1998, pp. 49–55. [14] P. Ekman, Facial expressions of emotion: an old controversy and new ﬁndings, Philosophical Transactions of the Royal Society, B 335 (1992) 63–69. [15] M.J. Lyons, S. Akamatsu, M. Kamachi, J. Gyoba, Coding facial expressions with gabor wavelets, In: Proceedings of Third IEEE International Conference on Automatic Face and Gesture Recognition, 1998, pp. 200–205. [16] E. Ostby, Pixar animation studios. Personal communication, January 1997. [17] R. Brunelli, T. Poggio, Face recognition: features versus templates, IEEE Transactions on Pattern Analysis and Machine Intelligence 15 (10) (1993) 1042–1052. [18] A. Lantitis, C.J. Taylor, T.F. Cootes, Automatic interpretation and coding of face images using ﬂexible models, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (7) (1997) 743–756. [19] Y. Du, X. Lin, Mapping emotional status to facial expressions, in: Proceedings of 16th International Conference on Pattern Recognition, August 2002, pp. 524–527. [20] F.L. Bookstein. Principle warps: thin plate splines and the decomposition of deformations, in: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. II (6), 1989.

Face alive icon

Face alive icon

Recommend Documents