Data-driven facial animation via semi-supervised local patch alignment

Data-driven facial animation via semi-supervised local patch alignment

Pattern Recognition 57 (2016) 1–20 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr Data-...

4MB Sizes 0 Downloads 79 Views

Pattern Recognition 57 (2016) 1–20

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Data-driven facial animation via semi-supervised local patch alignment$ Jian Zhang a, Jun Yu b,n, Jane You c, Dapeng Tao d, Na Li a, Jun Cheng e a

School of Science and Technology, Zhejiang International Studies University, Hangzhou 310012, China Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China c Department of Computing, The Hong Kong Polytechnic University, Hongkong, China d School of Information Science and Engineering, Yunnan University, Kunming 650091, China e Key Laboratory of Human-Machine Intelligence-Synergy Systems, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China b

art ic l e i nf o

a b s t r a c t

Article history: Received 17 August 2015 Received in revised form 23 February 2016 Accepted 25 February 2016 Available online 11 March 2016

This paper reports a novel data-driven facial animation technique which drives a neutral source face to get the expressive target face using a semi-supervised local patch alignment framework. We define the local patch and assume that there exists a linear transformation between a patch of the target face and the intrinsic embedding of the corresponding patch of the source face. Based on this assumption, we compute the intrinsic embeddings of source patches and align these embeddings to form the result. During the course of alignment, we use a set of motion data as shape regularizer to impel the result to approach the unknown target face. The intrinsic embedding can be computed through both locally linear embedding and local tangent space alignment. Experimental results indicate that the proposed framework can obtain decent face driving results. Quantitative and qualitative evaluations of the proposed framework demonstrate its superiority to existing methods. & 2016 Elsevier Ltd. All rights reserved.

Keywords: Facial animation Manifold Local patch Linear transformation Global alignment

1. Introduction Recent years have witnessed a surge of digital films and digital games where CG techniques are used to bring incomparable visual experience to ordinary people. One of the most representative CG techniques is computer facial animation which has created impressive virtual characters in films, and attracted the attention from both the computer engineers and the academic researchers. Generally, facial animation covers a wide variety of techniques, but we will focus this paper on data-driven facial animation which means driving a 3D face model to demonstrate certain facial expression or motion through synthesized or captured motion data. At any time instance, the motion data are usually a number of ☆ This research is supported by the National Natural Science Foundation of China (No. 61303143, No. 61472110, No. 61572486, No. 6140051238), Zhejiang Provincial Natural Science Foundation of China (No. LQ14F020003, No. LR15F020002), the Scientific Research Fund of Zhejiang Provincial Education Department (No. Y201326609), Guangdong Natural Science Funds (No. 2014A030310252), Shenzhen Technology Project (No. JCYJ20140901003939001, No. JCYJ20140417113430736, No. JCYJ20140901003939001, No. JSGG20140703092631382, No. JSGG20141015153 303491) and the State Scholarship Fund of China (No. 201508330643). n Corresponding author. E-mail address: [email protected] (J. Yu).

http://dx.doi.org/10.1016/j.patcog.2016.02.021 0031-3203/& 2016 Elsevier Ltd. All rights reserved.

facial feature points that depict the appearance of the facial expression or motion at that time. For convenience, the face to be driven is called the source face, and the face with certain expression is called the target face in this paper. Given a set of motion data, there are various ways to generate facial expressions. Frequently used methods include but are not limited to shape interpolation [1–4], face muscle model [5,6], facial action coding system (FACS) [7], MPEG-4 face model [8] and scattered data interpolation [9,10]. Each method has its inherent weakness. The blend shape method needs to build a group of key shapes of the target face, this is quite labor intensive and the facial appearance of the target face is restricted by the key shapes. Face muscle model associates the facial motion to a set of muscle vectors which controls the deformation of the face model. The position of the muscle vector as well as the muscle spring constant are difficult to determine. FACS and MPEG-4 face models use motion data to adjust some predefined face model parameters to generate expressions. But how to correspond the motion data to the model parameters is very complicated for high-resolution face models. These methods lack flexibility and adaptability. Perhaps the most flexible and adaptive way is to drive the source face directly using the motion data. For this reason, scattered data interpolation has been widely used in facial animation. But

2

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

traditionally, this kind of method rests upon the Euclidean distance metric between a set of central points and all the model vertices. This sometimes degenerates the driving result. Some topology preserving methods, say, the Laplacian method [11], avoid the problem by preserving the local face structure at each vertex during face driving. Though the result of Laplacian method is influenced by independent transformation imposed on each model vertex, it inspires us to regard face driving as some topology preserving learning problem. As the most typical kind of topology preserving methods, manifold learning has been studied for decades. Manifold learning algorithms were originally designed for dimensionality reduction of high dimensional data. Due to the topology preserving characteristics, manifold learning methods have been widely used as preprocessing of various problems such as pattern recognition, data visualization, and image understanding. In other scenarios, manifold learning seldom finds its usage, and this might be caused by deep-seated regular thinking tendency that manifold learning is equivalent to dimensionality reduction. Some researchers are working to change the situation. They use manifold method in new applications like face hallucination [12] and expression synthesis [13]. However, to the best of our knowledge, manifold method has not been applied to driving human face model. Generally speaking, a 3D face model is defined by a set of 3D vertices and the interconnections between the vertices. The interconnections manifest the topology characteristics of the face model. To this end, it is reasonable to assume that the face model is in fact a set of 3-dimensional data points embedded into a low dimensional manifold. Then we can try to tackle the face driving problem in a manifold perspective. Meanwhile, the data-driven facial animation is a semi-supervised learning problem, where we need to derive the unknown target face based on the source face and a number of known motion data. These motion data can be seen as labels of some designated vertices. Motivated by these concerns, we clarify the internal relations between manifold learning and face driving firstly in this paper, and then propose a semi-supervised manifold framework for datadriven facial animation. This framework is built according to the manifold assumption that each local neighborhood or local patch of the whole data set presents linear characteristic and the learning result can be obtained by aligning the embedding of each local patch, therefore it is called semi-supervised local patch alignment. Specifically, we divide the source face into overlapping local patches according to the topology structure of the face, and compute the intrinsic embedding of each patch based on certain manifold constraint. Then we impose a local linear transformation to the embedding of each patch to form the global result. During this course, we endue a small number of facial vertices selected from the source face new 3D coordinates, and construct shape constraint using these new coordinates for face driving. The new coordinates reflect the visual appearance of the target face, therefore the learning result can be very close to the 3D vertices of the target face in Euclidean space. In other words, the learning result of the proposed framework is an approximation of the unknown target face. In this paper, we implement this framework based on two representative manifold learning methods, i.e., locally linear embedding (LLE) [14] and local tangent space alignment (LTSA) [15]. It is worthwhile to highlight several aspects of the proposed framework as follows: (1) To the best of our knowledge, it is the first attempt to achieve facial animation through manifold-based method. It provides solid evidence to verify the feasibility of applying manifold method to face driving. (2) The framework is open, at least two manifold algorithms, LLE and LTSA, can be put into the framework for face driving.

Besides, the local patch is defined based on the geometry of the face model to capture the local topology, therefore the framework is adaptive to highly nonlinear face surface. (3) This is a semi-supervised learning framework where a number of facial vertices are selected and designated new coordinates as labeled data. Thus the problem can be converted into a simple least square problem instead of eigenvalue problem. This impels the learning result to approach the unknown target face. The rest part of this paper is organized as follows: in Section 2, we review the related works; in Section 3, we describe the proposed framework in detail; Section 4 shows the experimental results and conclusion is presented in Section 5.

2. Related works 2.1. Data-driven facial animation Data-driven facial animation can be realized by a variety of methods. Blend shape methods generate target face by linearly combining multiple key shapes of the target face according to given motion data. These key shapes usually represent several typical facial motions or expressions. The combination coefficients can be obtained by directly projecting the motion data to the linear space spanned by the corresponding key shapes of the source face [16,4], or learned from the motion data by radial basis function regression [1] or solving Poisson problem [3]. Blend shape methods depend too much on the key shapes which are built by animators manually. This is very labor intensive work, and blend shape methods are not good at presenting expression details. Expression coding techniques allow the animators to create facial expressions according to some predefined parameters, where each parameter controls the motion of a specific local facial region. The most popular expression coding systems are facial action coding systems (FACS) [7] and MPEG-4 facial animation models [8]. The weakness of expression coding techniques is that they deal only with obvious facial changes but ignore the subtle changes that might be difficult to distinguish. Moreover, for complex models with high resolution, it is quite difficult to correspond the motion data to appropriate parameters. Given motion data, another way to generate facial expression is using muscle vector model [5,6]. The muscle vectors deform a face model through simulated muscle motions due to the vector's influence with different directions and magnitude. However, the displacements of motion data should be transformed to the muscle vectors at first. And the positioning of muscle vectors into anatomically correct positions can be a daunting task. The process involves manual trial and error with no guarantee of efficient or optimal placement. Incorrect placement results in undesirable animation of the face. Perhaps the most appropriate way to generate facial motion is driving the face model through the motion data directly, since both global motion and local deformation of the face can be recorded in this way if a good driving algorithm is used, and no tedious manual work is needed. Scattered data interpolation has been used by some representative methods for face driving [9,10]. Pighin et al. developed a system to generate impressive facial animations. They deformed the face using a set of known facial feature points through scattered data interpolation [10]. But the interpolation rested upon the Euclidean distance metric between the feature points and the other vertices on the face model. This may lead to unnatural facial deformation of some local facial regions with complex surface and geometry. In [17], the authors

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

improved the interpolation by substituting the geodesic distance between two vertices for the Euclidean distance between them, but the computation of geodesic distance was too time consuming. The Laplacian method [11] proposed several years ago casts light on the face driving problem. It changes a source model to a target model in a semi-supervised way using a set of known motion data or feature points as labels, with the constraint that the target model shares similar local topology structure at each vertex to the source model. The constraint does not consider rotations that exist in the deformation of local structures, and this may cause unwanted deformation result. To this end, Sorkine [11] managed to solve a rigid transformation for each vertex to encode the rotation for correct deformation. But the transformation was imposed independently to each vertex, rather than each local structure. This may bring about extra potential deformation problems. Despite the weakness, Laplacian method has found its wide usage in shape editing [11], facial animation [17], and even image warping [18]. This is a clue that motivates us to consider face driving problem as some topology preserving learning problem. 2.2. Topology preserving learning The most representative group of topology preserving learning methods is manifold learning, which has been studied for decades with the aim of dimensionality reduction. Various manifold learning algorithms have been proposed such as ISOMAP [19], locally linear embedding (LLE) [14], Laplacian eigenmap (LE) [20], locality preserving projections (LPP) [21] and local tangent space alignment (LTSA) [15]. Despite their distinct objective functions, they share the similarity of preserving some topological characteristic of the high dimensional data set when calculating the low dimensional embedding. Due to the topology preserving ability, manifold learning has been widely used as preprocessing of pattern classification. In recent years, manifold learning has been combined with other lost functions to develop new algorithms [22,23], enhanced to supervised version [24,25], and summarized as several unified manifold frameworks [26,27]. However, these improvements still confine manifold learning to dimensionality reduction. In recent years, trend has been found to use manifold learning to solve new problems other than dimensionality reduction. Zhang et al. achieved facial expression synthesis [13] and face hallucination [12], respectively, by modeling the local geometry of the sample data through manifold learning. In [28], the authors proposed an example based approach to inferring color from image texture, where the nonlinear structure of the sample data distribution was learnt by manifold method. Ref. [29] presented a manifold-based stroke correspondence construction approach, which generated in-between frames from a set of key frames for animation. In addition, semi-supervised manifold learning receives considerable attention due to its adaptability to problems that require nondimensionality reduction but preserving the topology of data set and utilizing the prior information simultaneously. In [30], Zhang et al. proposed a semi-supervised method for video object tracking. They assumed that the known video frames and the unknown object positions in the frames have similar manifold structure, such that they could estimate the object positions from video frames based on a small number of labeled positions. Some topology preserving learning methods, like Laplacian methods [11,18], are also semi-supervised methods, but the topology is not preserved in a manifold way. According to the work of [11], the local topology at each data point was represented by the local linear reconstruction error, so the shape of the local structure could not be well preserved during the learning process. Therefore the topology preserving capability of Laplacian method is weaker than that of manifold learning method. Despite this weakness, Laplacian method can be successfully used in 3D model

3

deformation. And this enlightens us to study face driving problem in a similar semi-supervised way, but with manifold constraint.

3. Face driving through semi-supervised local patch alignment 3.1. The basic concern The basic concern of our method is to capture the topology of the source face through manifold learning and convey the topology to the target face. An apparent fact is that the target face and the source face have very similar geometric characteristics in most regions of the face. To ensure the geometric similarities, the target face usually shares the same topology as the source face, except that some vertices of the target face have different coordinates from that of the source face. This indicates that the topology of the source face should be preserved during the course of face driving. Manifold methods are well known for the ability of topology preserving, which conveys the topology of the original data set to their intrinsic embeddings. This means that the intrinsic embeddings of both the source face and the target face share very similar topology structure. To verify this point, we project two source faces and corresponding target faces with two distinct expressions for each source face into manifold space. These faces and their intrinsic embeddings are shown in Fig. 1 which indicates that the source face and the target face have very similar intrinsic embeddings. Thus, we can assume that there exists some quantitative correlation between the intrinsic embeddings of the source face and that of the target face. Owing to the complexity of the data geometry, this quantitative correlation must be highly nonlinear. For the sake of simplicity, it is very reasonable to simulate the nonlinear correlations using a group of local linear transformations. To this end, for each vertex, we define a local patch structure [26] that is similar to the neighborhood structure of manifold learning algorithm. And the local linear transformations are imposed on these local patches in which the correlations between vertices can be assumed to be linear. Without loss of generality, we can define a local patch at a vertex xi as X i ¼ ½xi ; xi1 ; xi2 ; …; xik  which is composed of xi and its ki i neighbors X inb ¼ ½xi1 ; xi2 ; …; xik . The neighbors are some vertices i that are connected to xi by edges defined during the course of face model construction. Fig. 2 shows a local patch where the central green vertex is xi and the surrounding bluish vertices are its neighbors. Suppose Xis is the local patch at vertex xis of the source face, Xit s t is the corresponding local patch of the target face. Let X^ and X^ be i

i

the intrinsic embeddings of Xis and Xit. We assume that there exists s t a local linear transformation Di between X^ and X^ such that i

i

t s X^ i ¼ Di X^ i . Then we wish to convey the topology of the source face to the target face based on the local linear transformations imposed on the patches. The objective of the local linear transformation is closely related to some locality preserving manifold learning algorithms. If the intrinsic embeddings are computed by LLE, we have

xsi ¼ X sinb wi ;

ð1Þ

s s x^ i ¼ X^ inb wi

ð2Þ

where wi are reconstruction coefficients [14]. By (1), we obtain T

T

wi ¼ ðX sinb X sinb Þ  1 X sinb xsi : Substituting (3) into (2), we have

ð3Þ

4

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

source

target1

target2

face1

intrinsic embeddings of face1

face2

intrinsic embeddings of face2

Fig. 1. Source faces of two different persons (face1, face2), their respective target faces with two different expressions (target1, target2) and the corresponding intrinsic embeddings.

Local patch

Manifold projection

Local linear transformation

Fig. 2. The local patch at each vertex. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.) s

s

T T s x^ i ¼ X^ inb wi ¼ X^ inb ðX sinb X sinb Þ  1 X sinb xsi ¼ Ei xsi

ð4Þ Fig. 3. Local patch alignment for face driving.

s T where Ei ¼ X^ inb ðX sinb TX sinb Þ  1 X sinb . Similarly, we can obtain t x^ i ¼ Gi xti

where

ð5Þ

t Gi ¼ X^ inb ðX tinb TX tinb Þ  1 X tinb T.

The assumption

t X^ i

s ¼ Di X^ i

implies

t s x^ i ¼ Di x^ i :

ð6Þ s into (6), we obtain Gi xti ¼ Di x^ i s † † T i x^ i where T i ¼ Gi Di and Gi is the

Substituting (5) which can be Moore–Penrose rewritten as xti ¼ generalized inverse of Gi. Because a local patch of face model roughly forms a plane, we believe its intrinsic embedding is a linear structure and the local transformation Ti is also suitable for s s the other vertices fx^ ij j j ¼ 1; …; ki g in the same patch as x^ i . Therefore we have s

X ti ¼ T i X^ i :

ð7Þ

If the intrinsic embeddings are computed by LTSA, we have s x^ i

 Q Ti xsi ;

t x^ i  P Ti xti

ð8Þ ð9Þ

where Qi and Pi are orthogonal basis matrices of the local patch t s t s [15]. Similarly, we derive x^ i ¼ Di x^ i from the assumption X^ i ¼ Di X^ i , s T and then we have xti  T i x^ i where T i ¼ P i †Di . Different from LLE, Ti here is computed based on the orthogonal basis matrix of the local tangent space, thus is strictly applicable to the other vertices s s fx^ ij j j ¼ 1; …; ki g in the same patch as x^ i . Therefore we can also obtain (7). Eq. (7) indicates that a local patch of the target face (local target patch) can be derived by linearly transforming the intrinsic embedding of the corresponding local patch of the source face (local source patch). Like manifold learning, the function of the local linear transformations is aligning the intrinsic embedding of a local source patch to a target patch. This technique is consistent with the patch alignment method [26], but not for dimensionality reduction. The idea of our local patch alignment technique is described in Fig. 3. In order to drive the alignment result to approach the unknown target face, we use motion data as the labels of a set of selected

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

Source face

Target face

Motion data

Input

5

Input

Output

Local source patches Local linear transformation Local linear transformation Local linear transformation Intrinsic embedding of the source face Aligned local target patches Fig. 4. The framework of the proposed semi-supervised local patch alignment for face driving.

SSLPA with LLE local patch

original face

β =0.1

β =1

β =10

β =100

true expression

Fig. 5. Semi-supervised local patch alignment with different β using LLE patch.

facial vertices to form shape constraint. The shape constraint makes the learning result to be a close approximation of the unknown target face in Euclidean space. To this end, we obtain a semi-supervised local patch alignment method for face driving. The flowchart of its overall framework is depicted by Fig. 4. 3.2. Semi-supervised local patch alignment In this section, we describe the two steps of semi-supervised local patch alignment, i.e., computing the intrinsic embedding of the local

source patch and aligning the intrinsic embedding to target patch. Then we introduce motion data as labels to form a semi-supervised problem and present the solution of this problem.

3.2.1. Intrinsic embedding computation of the local source patch For an arbitrary vertex xis chosen from the source face, we s s s construct its local patch as X s ¼ ½xs ; xs ; xs ; …; xs . Let X^ ¼ ½x^ ; x^ ; i

i

i1

i2

ik

i

i

i s

i1

s s x^ i2 ; …; x^ ik  be the intrinsic embedding of Xis. It is known that X^ i can i

6

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

SSLPA with LTS local patch

original face

β =0.1

β =1

β =10

β =100

true expression

Fig. 6. Semi-supervised local patch alignment with different β using LTS patch.

be computed through the following framework s

s

min traceðX^ i Li X^ i TÞ s

ð10Þ

X^ i

where Li A Rðki þ 1Þðki þ 1Þ captures the local topology of this patch and varies with different manifold learning methods. 3.2.2. Semi-supervised alignment to the target patch s We align X^ i to target patch Xit through local linear transfors mation Ti, the alignment is represented as T i X^ i ¼ X ti which is similar to LTSA [15]. Therefore the optimal Ti can be obtained by s

min J xti  ðci þ Ri x^ i Þ J 2 þ ci ;Ri

ki X j¼1

s

J xtij  ðci þ Ri x^ ij Þ J 2

s ¼ J X ti  ðci eT þ Ri X^ i Þ J 2F ¼ J X ti Φi J 2F

ð11Þ

where Ti is decomposed into Ri and ci, and Φi is the orthogonal sT projection whose null space is spanned by the columns of ½e; X^ i . sT Accordingly, the projections of Φi onto ½e; X^  are close to zero. i

Comparing (10) with (11), we immediately find that Φi is equivalent to Li. Note that the coordinates of the vertices in Xit are global, which means that Xit are some vertices selected from the target face Xt with N vertices. Thus we can seek to compute Xt by minimizing the following objective function

is the alignment matrix with Si A RNki , the 0–1 selection matrix such that X ti ¼ X t Si . L can be obtained through an iterative procedure Lðbi Þ ¼ Lðbi Þ þ Li ;

ði ¼ 1; …; NÞ

with the initial L¼ 0. Lðbi Þ is a submatrix of L comprising certain rows and columns of L selected via each vertex's patch index set bi. With certain normalization constraint on Xt, the problem T

traceðX t LX t Þ min t

J X ti Φi J 2F ¼

i¼1

N X

tT

J X ti Li J 2F ¼ traceðX t LX Þ

ð12Þ

i¼1

where L¼

N X i¼1

Si Li STi

ð13Þ

ð15Þ

X

is indeed a typical eigenvalue problem that can be easily solved through existing methods. In face driving, however, the solution of problem (15) should be close to the unknown target face in 3D Euclidean space. For this purpose, we pick some vertices from the source face, use a set of motion data as the labels of corresponding target vertices to construct a shape constraint. The shape constraint functions as a penalty term to drive the solution to approach the target face. Let X l ¼ ½xl1 ; xl2 ; …; xlm  be labels of m vertices X tl ¼ ½xl t1 ; xl t2 ; …; xl tm  chosen from the unknown target face Xt, the optimization problem can be represented as T

N X

ð14Þ

traceðX t LX t Þ þ β J X tl  X l J 2F min t

ð16Þ

X

where β is a regularization parameter. For simplicity, we rearrange the positions of the face vertices from the very beginning (before intrinsic embedding computation) to reform Xt as X t ¼ ½X tl X tu  where Xlt are labeled vertices and Xut are unlabeled vertices, such that the first term of (16) can be

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

SSLPA with LLE patch 3

β = 0.1 β=1 β = 10 β = 100

2.5

2

2

RMS error

RMS error

SSLPA with LTS patch 3

β = 0.1 β=1 β = 10 β = 100

2.5

7

1.5

1.5

1

1

0.5

1

2

3

4 5 6 7 Sample ID

8

0.5

9 10

1 2 3 4 5 6 7 8 9 10 Sample ID

Fig. 7. Semi-supervised local patch alignment with different β using LLE patch and LTS patch.

β =0.1

β =1

β =10

β =100

true expression

Fig. 8. Driving results of sample 2 in Fig. 7 with different β's using LLE patch.

transformed to 0

#2 t T 31 Llu X 4 l T 5A Luu X tu

"   Lll trace@ X tl X tu LTlu

ð17Þ

where Lll is a symmetric matrix of size m  m, corresponding to the labeled vertices, and symmetric matrix Luu is of size ðN  mÞ  ðN  mÞ, corresponding to unlabeled ones. Thus problem (16) can be represented as 0

"  t t  Lll @ trace X l X u min LTlu Xt

#2 t T 31 X 4 l T 5A þ β J X t  X l J 2 : F l Luu X tu Llu

ð18Þ

The objective function of (18) is quadratic. Under weak assumptions, it can be shown that this function has a symmetric positive definite Hessian matrix, therefore, its minimization can be computed by solving the following linear system of equations: "

Lll þ β I

Llu

LTlu

Luu

#"

X tl

T

X tu

T

"

# ¼

βX Tl 0

# :

ð19Þ

Consequently, we can represent the closed form solution of the face driving problem as: " #1   Lll þ βI Llu t X ¼ βX l 0 : ð20Þ LTlu Luu 3.3. Implementation of the proposed framework 3.3.1. Implementation through LLE LLE captures the topology of the local patch at a given vertex by reusing the reconstruction coefficients of the vertex from its neighbors. Let xis be a vertex of the source face, with its local patch X si ¼ ½xsi ; xsi1 ; xsi2 ; …; xsik , xis can be linearly reconstructed from xsi1 ; xsi2 i

; …; xsik as i

xsi ¼ xsi1 wi1 þ xsi2 wi2 þ ⋯ þ xsik wik þ εi i

i

ð21Þ

where wij ðj ¼ 1; …; ki Þ are reconstruction coefficients and εi is the error. LLE seeks to compute wij by minimizing the reconstruction error: argmin J xsi  wi

ki X j¼1

xsij wij J 2 :

ð22Þ

8

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

Sparse

Mediate

Dense

LLE patch

LTS patch

30 labels

80 labels

120 labels

80 labels

120 labels

150 labels

120 labels

150 labels

200 labels

Fig. 9. Semi-supervised local patch alignment with randomly selected feature vertices.

Fig. 10. Comparison between fine manual feature vertex selection and random feature vertex selection.

SSLPA with LLE patch

SSLPA with LTS patch

4 3.5

3 random selection manual selection

120 2.5

random selection manual selection

120

2

2.5 120 2 1.5

0

1.5 120 1

1 0.5

RMS error

RMS error

3

80

80 0.5 30

30

sparse

60 mediate dense Type of face

30 30 0

sparse

60 mediate dense Type of face

Fig. 11. Comparison between fine manual feature vertex selection and random feature vertex selection.

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

9

Fig. 12. The true target face and the results of face driving using LLE patch and LTS patch.

Fig. 13. The side view of the true target face and the results using LLE patch and LTS patch.

Pki

With the sum-to-one constraint puted as

j¼1

wij ¼ 1, wi can be com-

With Li, (14), (20) and a set of motion data, we can derive the target face with the proposed framework.

ð23Þ

3.3.2. Implementation through LTSA To preserve the local topology of a patch Xis at a vertex xis, LTSA computes the local linear approximation for the vertices in Xis using local tangent space [15]. The objective function of each patch is

Pki

1 t ¼ 1 M jt Pki 1 p¼1 q ¼ 1 M pq

wij ¼ Pk

i

where M jt ¼ ðxsi  xsij ÞT ðxsi  xsit Þ.

s s s s LLE reuses the coefficients to reconstruct x^ i from x^ i1 ; x^ i2 ; …; x^ ik

i

s in the intrinsic subspace. To this end, the intrinsic embedding X^ i of s the local patch Xi can be obtained by optimizing the following problems: " # ! ki X  sT s 1  s s x^ ij wij J 2 ¼ argmintrace X^ i argmin J x^ i   1 wTi X^ i wi ^s ^s j¼1

Xi

Xi

T

s s ¼ argmintraceðX^ i Li X^ i Þ s X^ i

where Li ¼

h

1 wi

i

 h  1 wTi ¼ 1wi

 wTi wi wTi

i .

ð24Þ

argmin J X si H ki þ 1  Q i Θi J 2 Q ;Θ

ð25Þ

where H ki þ 1 ¼ I ki þ 1  eki þ 1 eTki þ 1 =ðki þ 1Þ with eki þ 1 ¼ ½1; …; 1T A

Rki þ 1 and I ki þ 1 an identity matrix, Qi is an orthogonal basis matrix of the tangent space and Θi ¼ ½θi ; θi1 ; θi2 ; …; θik  are the local i coordinates corresponding to Qi. The optimal Qi is given by the s matrix of the d left singular vectors of X i H ki þ 1 corresponding to its d largest singular values, and the optimal tangent coordinates Θi are computed as

Θi ¼ Q Ti X si Hki þ 1 :

ð26Þ

10

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

Sparse

RMS error

0.1

lle patch lts patch

0.05

0

5

10

15

20

25

30

35

40

45

50

Number of extra labels

Mediate

RMS error

0.3

lle patch lts patch

0.2 0.1 0

50

100

150

200

250

300

350

400

450

500

Number of extra labels

Dense

RMS error

3

lle patch lts patch

2 1 0

100

200

300

400

500

600

700

800

900

1000

Number of extra labels Fig. 14. The RMS error of face driving with different number of extra labels (feature vertices).

LTSA assumes that there exists an affine transformation matrix which transforms the local tangent coordinates Θi to the intrinsic s embedding X^ i . Therefore we have

Average RMS error

s X^ i H ki þ 1 ¼ Ai Θi þ Ei

2.5

ð27Þ

where Ai is the affine transformation matrix and Ei is the error. To preserve as much of the local topology in the intrinsic space, LTSA s finds X^ i and Ai by minimizing the error term Ei: s argmin J X^ i H ki þ 1  Ai Θi J 2 :

ð28Þ

s

argmin J X^ i H ki þ 1 ðI ki þ 1  Θi Θi Þ J 2 : †

ð29Þ

1.5 1 0.5 0

s X^ i ;Ai

It can be derived from (28) that the optimal affine transformation s † matrix has the form of Ai ¼ X^ i H ki þ 1 Θi , so (28) can be rewritten as

2

100 200 300 400 500 600 700 800 900 1000

Number of extra labels Fig. 15. The RMS error of the dense face driving with growing number of extra labels (feature vertices).

s X^ i

s Suppose Ui is the d right singular vectors of X^ i H ki þ 1 corresponding to its d largest singular values, (29) can be converted to:

Algorithm 1. The algorithm of implementing the semi-supervised local patch alignment for face driving.

s

argmin J X^ i H ki þ 1 ðI ki þ 1  U i U Ti Þ J 2 s X^ i

  s sT ¼ argmintrace X^ i H ki þ 1 ðI ki þ 1  U i U Ti ÞðI ki þ 1  U i U Ti ÞT H ki þ 1 X^ i s X^ i

T

s s ¼ argmintraceðX^ i Li X^ i Þ

ð30Þ

s X^ i

where Li ¼ H ki þ 1 ðI ki þ 1 U i U Ti ÞðI ki þ 1  U i U Ti ÞT H ki þ 1 . With Li, (14), (20) and a set of motion data, we can get the target face using the proposed framework. 3.3.3. Algorithm of the proposed framework The algorithm of implementing the semi-supervised local patch alignment for face driving (16) can be summarized as Algorithm 1.

Input: source face Xs, and a set of motion data Xl; Output: The target face Xt; 1: Subtract the local source patches Xis according to the model's topology structure defined during the course of model construction; 2: Compute Li that encodes the local topology of each patch s Xis into its intrinsic embedding X^ by optimizing (10). Note i

3: 4: 5: 6:

that Li can be computed based on either LLE patch or LTS patch; Set L ¼0; for All the local patches Xis do Update L according to (14); end for

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

11

LLE patch

LTS patch

ground truth

No extra label

50 extra labels

Fig. 16. The driving result of sparse face with 80 feature vertices.

7:

Obtain the closed form solution of target face Xt based on L and Xl through (20).

4. Experiments In order to efficiently implement the proposed semi-supervised local patch alignment (SSLPA) method, two key issues should be clarified. First, the shape constraint constructed via the motion data plays very important role in generating the target face, therefore how to choose its regularization parameter β is crucial to our method. Second, how to determine the quantity and geometric positions of the feature points which represent the motion data is also very important. In this section, we will present some experimental results about the aforementioned key issues of the proposed method. Then we show the comparison of our method with several existing methods both qualitatively and quantitatively to demonstrate its superiority. We concentrate on driving a face model through given motion data, instead of facial motion cloning or retargeting [9], so we suppose the given motion data perfectly suit the 3D face model. Building specific personalized 3D face for the human actor from whom the motion data are captured is very labor intensive, therefore synthesized 3D face and motion data are used in the experiments. To this end, we generate three kinds of human head

models, with 620, 3220 and 6174 vertices, respectively. The face models with 620 and 3220 vertices are manually built by one of our team members, and the face models with 6174 vertices are created through FaceGen Modeller software. Each kind of model includes both source faces without expression and several target faces with certain expressions for each source face. Then we choose some feature vertices from the ground truth target face as the motion data at certain time instance, and drive the source face through these selected vertices. Consequently, we can evaluate the face driving result by comparing both the visual appearance and the driving error with true targetqface. The error is computed as ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi root mean square (RMS) error J X t X J 2F =N where Xt is the estimated target face and X is the ground truth target face. In the following part of this section, these three kinds of face models are called sparse face, mediate face and dense face respectively. Since we can generate a large number of dense faces with no trivial manual work, most of the quantitative tests are done based on the dense faces. 4.1. The influence of the regularization parameter In order to measure the influence of the regularization parameter β to the driving result, we fix a number of vertices from the true target face as motion data or feature points, and change β to observe the result. For three kinds of models, the vertices are

12

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

LLE patch

LTS patch ground truth

No extra label

150 extra labels

Fig. 17. The driving result of mediate face with 210 feature vertices.

selected from similar region of face so as to reflect the appearance of the facial expression, but varies in number. We select 60, 120 and 150 vertices from the sparse, mediate and dense face respectively. The source face is driven using these motion data based on two SSLPA algorithms which differ in intrinsic embedding computation. One algorithm uses the LLE (locally linear embedding) local patch, and the other uses the LTS (local tangent space) local patch. The driving results with respect to different βs are shown in Figs. 5 and 6. It is observed that the synthesized target faces look more similar to the true target faces with β increasing from 0.1 to 100, at an 10 time's interval. This is because β controls the similarity of the driving result to the true target face, a larger β leads to higher similarity and vice versa. Also, we notice that when β is larger than 10, the driving results show little difference from the true target faces. Therefore β ¼ 10 might be an appropriate value for face driving. For a quantitative test, we randomly create 10 pairs of dense faces, with each pair includes a source and a target face, drive the source faces using a fixed number of motion data selected from the target faces, and then compute the RMS error between the driving results and the true target faces. The RMS error with respect to different βs is plotted in Fig. 7 which indicates that when β Z 1, the RMS error tends to be stable. Considering both the visual effect and the RMS error, we use β ¼ 10 in the following experiments.

Still there is one point needs to be clarified. It can be observed from Fig. 7 that the RMS error of sample 2 with β ¼ 0:1 is smaller than the RMS error with other values of β when using LLE patch to implement our method. This seems to contradict with our conclusion that β ¼ 10 yields better results. However, when we pick out the driving results of sample 2 with different βs, we immediately find that the driving result with β ¼ 0:1 has more difference from the true target face compared with the results with other β's, especially on the regions of eyes and cheeks. This is shown in Fig. 8. The reason is that β actually adjusts the deviation of the feature points on the driving result (synthesized target face) from that on the true target face, whereas the RMS error plotted in Fig. 7 is computed based on the whole face which includes not only important facial features but also some regions with little personal characteristics like ears and neck. These regions are not very important to facial animation but they do influence the computation of RMS error. Our objective function ensures that the important features of the driving result fit the features of the true target face well. In most cases, when the facial features of the driving result are close to the facial features of the true target face, the RMS error between the driving result and the true target face is also small. This indicates the rationality to compute RMS error using whole face. Only in few cases, good matching of features does not lead to small RMS error between the driving result and the true target face. This is because the computation of RMS error is disturbed by some unimportant regions that do not reflect much

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

13

LLE patch

LTS patch

ground truth

No extra label

200 extra labels

Fig. 18. The driving result of dense face with 320 feature vertices.

personal characteristics. To verify the analysis, we use previously selected feature points from sample 2 to recompute the RMS error, and the RMS error is 0.64937, 0.13408, 0.017087, 0.0017679 with β ¼ 0:1, β ¼ 1, β ¼ 10, β ¼ 100 respectively. Consequently, the visual appearance of the driving result with β ¼ 0:1 is not as good as that with β ¼ 10 and β ¼ 100. Note that we do not suggest to use a very big β though it can bring about small error between facial features of the driving result and the true target face because we believe this might cause some over-fitting problem. This is why the RMS error of some samples with β ¼ 10 is smaller than that with β ¼ 100 in Fig. 7. Based on the above discussions, we choose β ¼ 10 for our method. 4.2. Dependency of the framework on the selection of facial feature vertices The learning result of the framework is dependent on the selection of the feature vertices, but how to determine a set of feature vertices that will be given labels for optimal semisupervised learning remains a very difficult problem of machine learning. We have proved that manually selection is feasible, but from what region of the face should the feature vertices be chosen, how many feature vertices should be chosen, and is there any other method for feature vertex selection? There is no conclusion for these questions so far. In this section, we compare three

strategies for feature vertex selection, they are random selection, manual selection, and the combination of both strategies. Random strategy is perhaps the simplest approach to feature vertex selection. Because the vertices are treated without distinction, those vertices that best reflect the appearance of facial motion might be ignored. So we believe random selection would not lead to good result, if the number of selected vertices is small. To verify this speculation, we randomly pick a number of vertices from the true target face of the sparse, mediate and dense face respectively, and drive the corresponding source faces using these selected vertices. The result is depicted in Fig. 9 where the first row represents the face driving result using LLE local patch, and the second row represents the result using LTS local patch. For a sparse face, we repeat the driving for three times, with 30, 80, 120 feature vertices each time. Similar repetition is conducted for the mediate face and the dense face, but with different number of feature vertices each time. For the mediate face, the number is 80, 120 and 150 respectively. For the dense face, the number is 120, 150 and 200 respectively. The common point is that the result of face driving becomes better as the feature vertex number grows larger. Fig. 9 shows that 120, 150 and 200 feature vertices are enough to generate fine results for the sparse, mediate and dense face. Also, it can be observed that even with 120, 150 and 200 feature vertices, there exists some difference in appearance between the results of using LLE patch and LTS patch. We believe

14

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

Fig. 19. The driving results of more dense faces. (a) The source face, (b) LLE patch, (c) LTS patch, (d) the ground truth. 2

RMS error

the difference tends to diminish if more feature vertices are used, but the number of feature vertices can not increase without restraint. We expect to use as few feature vertices as possible to achieve good driving result. For this purpose, we ask an experienced animator in our team to carefully select some feature vertices from the facial regions like eyes, mouth, cheeks and chin as motion data that can best represent the appearance of facial motion. Specifically, we choose 30, 60 and 120 feature vertices from the sparse, mediate and dense target face respectively, and drive the corresponding source face through the selected vertices. As is expected, the driving result is much better than the result of random feature vertex selection. The driving results of the sparse face and the mediate face are shown in Fig. 10 where the left part demonstrates the results of random selection with 30 and 80 feature vertices for each type of face, and the right part demonstrates the results of fine manual selection. Manual selection of feature vertices has obvious advantage over random selection in that some representative vertices that are closely related to facial motion can be adopted. Lack of representative vertices can lead to apparent and unwanted distortions on some important regions. For example, the mouth regions of the driving results distort incorrectly as is shown

1.5

1

0.5

5

10

15

20

25

30

35

40

45

50

Sample ID Fig. 20. Comparison of the RMS error generated by different methods where “rbf” means radial basis function interpolation, “lap” means Laplacian method, “lle patch” represents our driving method using LLE patch, and “lts patch” represents our driving method using LTS patch.

in the left part of Fig. 10. In the meanwhile, manual selection is supposed to achieve less RMS error compared with random selection when using the same or even smaller number of feature vertices. We select a sparse face, a mediate face and a dense face as

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

source face

rbf

lap

lle

15

lts

true face

Fig. 21. Comparison of the visual appearance of the results generated by different methods where “rbf” means radial basis function interpolation, “lap” means Laplacian method, “lle” represents our driving method using LLE patch, and “lts” represents our driving method using LTS patch.

16

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

lap

lle

lts

true face

Fig. 22. The comparison of the local details generated by different methods.

source faces, and pick out a number of feature vertices from corresponding true target faces to execute our method. The RMS error of both random feature selection and manual feature selection is plotted in Fig. 11 where the left part corresponds to LLE patch and the right part corresponds to LTS patch, and the number of features used for each execution is shown on top of each bar. It is easy to see that manual feature vertex selection yields less error than random feature vertex selection when using the same number of features. In particular, when driving the mediate face, the RMS error of manual feature selection with 60 features is even less than the RMS error of random feature selection with 80 features. That is the reason why we use 60 manually selected features rather than 80. Note that for random feature selection, the driving error of each execution is a little bit different, so we run our algorithms for 10 times and choose the best result to plot in Fig. 11. We do not think it very necessary to reduce the number of features further in manual selection, because choosing a few dozen or more features will not be a heavy burden to the animator and the corresponding real motion data in temporal domain can be easily obtained through motion capture devices. We have noticed from Fig. 9 that with random feature vertex selection, the result of using LLE patch is a little bit different from the result of using LTS patch. This is also true for manual feature selection. We use 120 carefully selected feature vertices as motion data to drive a dense face, the results of using LLE patch and LTS patch are demonstrated in Fig. 12 where the central image represents the true target face, the left image represents the face driving result using LLE patch and the right image represents the result using LTS patch. Fig. 12 indicates that based on a set of carefully selected feature vertices, LLE patch captures more local facial details caused by facial motion around the regions such as lip and cheek, but preserves less global geometric characteristics such as the shape of the head and the contour of the chin. On the contrary, LTS patch preserves more global geometric characteristics, but lacks some local facial details. This can be also observed from the side view shown in Fig. 13 where the left figure represents the driving result using LLE patch, the central figure represents the true target face, and the right figure represents the result using LTS patch. The difference between the driving results using LLE patch and LTS patch is closely related to the objective functions of the two methods. Given one current vertex, these two methods construct its local patch in the same way. However, LLE patch uses local reconstruction coefficients (23) to capture the local facial topology and LTS patch uses tangent space coordinates (26) to represent the local facial topology. The reconstruction coefficients are computed in the 3D Euclidean space and conveyed to the local intrinsic space, while the local tangent coordinates are in fact the coordinates of local principal component analysis (PCA) aiming to maximize the variance between vertices, therefore LLE patch is more suitable for local topology preserving than LTS patch. On the other hand, our assumption (7) is strictly true for LTS patch, because the local linear transformation Ti is computed based on

the orthogonal basis matrix of the tangent space. But for LLE patch, we use the local linear transformation computed through the current vertex to approximate the linear transformation that should be imposed on the whole patch. Thus the LTS patch is more suitable for capturing the global geometric characteristics than LLE patch. In order to capture both the local topology and the global geometric characteristics well simultaneously, more feature vertices are preferable. However, choosing a large number of vertices by hand is neither practical nor necessary. We propose to combine manual and random method for feature vertex selection. To this end, we maintain the 30, 60 and 120 feature vertices chosen from the sparse, mediate and dense target face, respectively, and append an extra group of randomly selected vertices from each kind of target face to construct the motion data. The RMS error between the driving results and the true target faces with different number of extra labels are plotted in Fig. 14. As is expected, the general trend of the RMS error decreases, with the growing number of extra feature vertices. Basically, Fig. 14 shows a slow descent of the RMS error without large fluctuation, this indicates the efficacy of the extra feature vertices. We notice that the RMS error of the dense face driving using both LLE patch and LTSA patch in Fig. 14 drops suddenly when the number of extra labels is roughly equal to 540, and rises again when the number increases to about 580. This seems to be unusual, but we think when the dips in the RMS error curves occur, the randomly selected features happen to reflect the representative facial characteristics in face motion. In this case, these features have the same function as those carefully selected by hand. Consequently, we obtain smaller RMS error using these features than using other features that do not reflect the representative facial characteristics. On the contrary, if the randomly selected features happen to locate at the regions that do not reflect representative facial characteristics (such as ears and neck), the RMS error may increase, and then there would be a hump in the RMS error curve. We think this phenomenon is normal, since we can find some humps and similar dips in the RMS error curves of sparse and mediate face driving from Fig. 14, the difference is that the dips or humps of sparse and mediate face driving are not as apparent as the dip of dense face driving. For example, a hump in the RMS error curve of sparse face driving using LLE patch occurs when the number of extra features is between 42 and 47. Also, a dip in the RMS error curve of mediate face driving using LLE patch or LTS patch occurs when the number of extra features is roughly between 230 and 245. Note that the experiments in Fig. 14 are based on a sparse face, a mediate face, and a dense face, so Fig. 14 does not show statistical results. We want to ascertain more general trend of RMS error curve with growing number of extra labels. To this end, we randomly generate 50 different dense source faces and corresponding dense target faces with certain expressions, and repeat the experiment whose result has been shown in Fig. 14 for 50 times. The average RMS error of the dense face driving with different number of extra labels is plotted in

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

lle

lts

true

lle

17

lts

true

Fig. 23. More results of local patch alignment algorithm, where “lle” represents face driving using LLE patch, and “lts” represents face driving using LTS patch.

18

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

Fig. 15 where we can see that the RMS error drops more smoothly with growing number of extra labels. It can be learnt from Figs. 14 and 15 that when the number of extra vertices exceeds some value, the RMS error tends to be stable. And, for the sparse, mediate and dense face, the value can be determined as 50, 150 and 200 respectively. Note that the extra vertices are chosen randomly from the whole target head, and this would not add additional burden to the animator. The total feature vertices we need should be more than what can be available from the motion capture devices in some real world scenarios, we can use the real motion data to drive these feature vertices through some existing method, then deform the whole face using these feature vertices. This strategy is somewhat like the repetitive interpolation method adopted in [10]. Consequently, we use 80 feature vertices, including 30 manually selected vertices and 50 randomly selected ones, for sparse face driving in the remaining experiments. For mediate face driving, 210 feature vertices, including 60 manually selected vertices and 150 randomly selected ones, are used. For dense face driving, 320 feature vertices, including 120 manually selected vertices and 200 randomly selected ones, are used. The face driving results using LLE patch and LTS patch as well as the true target faces are shown in Figs. 16–18 where each figure corresponds to one type of face. It can be observed that with these feature vertices, we can obtain decent results which are very close to the ground truth target faces. Additional results of dense faces with 120 manually selected vertices and 200 randomly selected ones are shown in Fig. 19, where column (a) shows the source face, column (b) shows results using LLE patch, column (c) shows results using LTS patch and column (d) shows the ground truth target faces. 4.3. Comparison with other methods Radial basis function (RBF) interpolation [10] and Laplacian method [11] have been successfully used in face driving. To demonstrate the advantage of the proposed method, we compare it with the aforementioned two methods. We build 50 pairs of dense sample faces, with each pair including a source face without expression and a target face with certain facial expression. Then we pick 320 feature vertices from the true target faces following the strategy introduced in Section 4.2, to drive the corresponding source faces through different methods. We compute the RMS error between the driving results and the true target faces so as to quantitatively evaluate these methods. The RMS error is shown in Fig. 20 where the horizontal axis represents the sample ID, and the vertical axis shows the RMS error. Fig. 20 indicates that for these 50 arbitrarily generated faces, RBF interpolation has a larger error than the other three methods, among which Laplacian method shares similar error with face driving based on LLE patch, while face driving based on LTS patch has the smallest error. The mean error of the four methods is 1.6630, 1.1567, 1.1540 and 1.0783 respectively. The reason is that RBF interpolation depends on the Euclidean distance between the feature vertices and all the vertices of the face, and Euclidean distance cannot correctly capture the local topology of face model in many cases. For example, when a feature vertex on the upper lip is closer to current vertex whose new position needs to be determined than any feature vertex on the lower lip, the new position will be influenced much by the feature vertex on the upper lip. If the current vertex happens to locate on the lower lip, the driving result will be incorrect. This is especially obvious for some mouth open expressions demonstrated in Fig. 21. The error of Laplacian method, face driving based on LLE patch and face driving based on LTS patch is smaller because these three methods

are based on local topology preserving, instead of global Euclidean distance computation. Theoretically, local patch alignment method, with either LLE patch or LTS patch, is superior to Laplacian method in face driving. The local topology in the original Laplacian method is represented by only the reconstruction error of each vertex from its neighboring vertices during face driving. The reconstruction error can be seen as a vector whose direction is determined by the relative position of the current vertex to its neighboring vertices. Therefore the local topology can be translated, but not rotated or scaled. A commonly taken measure to reduce this problem is to compute a local linear transformation for each vertex to encode the rotation and scale variation [11]. However, the linear transformation at each vertex is computed independently, thus the adjacent vertices may differ considerably from each other in the linear transformations. This is not true in many cases. In addition, the objective function lacks the ability to ensure that the driving result with these linear transformations is, in a sense, optimal. In contrast, local patch alignment method aims to compute a same local linear transformation for all the vertices in a particular patch that overlaps with adjacent patches to form a face model, and use the linear transformation to align the embedding of the local patch to part of the global unique face driving result. In this sense, the driving result of local patch alignment method is optimal, and the objective function encodes translation, rotation and scaling of the local topology. The optimality of the global alignment has been elaborated in [15]. A linear transformation is imposed on a whole patch instead of a single vertex, so the global alignment avoids some flaws of Laplacian method caused by independent linear transformations imposed on the vertices. Specific to LLE patch alignment, we use the local linear transformation computed through the current vertex to approximate the linear transformation that should be imposed on the whole patch, hence the global alignment is not entirely accurate. For this reason, the RMS error of the driving result using LLE patch is sometimes larger than that of the Laplacian method. However, LLE patch is still very suitable for capturing the local geometric characteristics, and it needs to be emphasized that owing to the local topology capture and global alignment, LLE patch alignment is very good at preserving the geometric characteristics around the feature vertices during face driving. Consequently, for some faces, even though LLE patch alignment method yields similar error to Laplacian method, the driving result of LLE patch alignment indeed has closer visual appearance to the true target face than Laplacian method (Fig. 22). We think this is particularly important for facial animation. Fig. 21 demonstrates 6 groups of faces, and each group contains the source face, the true target face and several faces generated by different methods. We observe obvious incorrect deformation from the mouth region when RBF interpolation is adopted, while the driving results of the other three methods are better. Although the visual appearances of Laplacian method and local patch alignment method look alike in Fig. 21, they actually differ in many local details. In Fig. 22, we can find the hollow spaces between the upper and lower lips have different contours, and the result of local patch alignment is closer to the ground truth. In addition, there exists unnatural wrinkle at the mouth corner of the face generated by Laplacian method. More results of the local patch alignment method are shown in Fig. 23 where we build 12 pairs of dense faces and corresponding target faces covering different races, genders, expressions and ages, and demonstrate the driving results of LLE patch and LTS patch alignment algorithms as well as the true target faces. The fine results indicate that the semi-supervised local patch alignment algorithm has good adaptability.

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

5. Conclusion We report a semi-supervised local patch alignment framework for data-driven facial animation. Although the objective function of this framework shares similarity with some semi-supervised locality preserving manifold methods, the framework is not developed for dimensionality reduction, but for face driving. We divide the given face into local patches, compute the intrinsic embeddings of the local patches, and align these embeddings to obtain the face driving result. In this process, we use a set of motion data to regularize the alignment, such that the final result is a close approximation of the unknown expressive face. To constructing the optimal motion data, we adopt a combinative approach which combines manual designation and random selection together. Experimental results demonstrate that the proposed framework has some advantages over existing methods and is very suitable for face driving. However, this paper assumes that the motion data fit the given face model perfectly, therefore we did not adopt real motion data captured from a human actor to achieve face driving. This needs a process of motion retargeting or motion clone, and we are going to solve the problem and use real motion data to achieve face animation in our following work.

Conflict of interest No conflict of interest.

References [1] Z. Deng, P. Chiang, P. Fox, U. Neuman, Animating blendshape faces by crossmapping motion capture data, In: Proceedings of the 2006 ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games, 2006, pp. 43–48. [2] E. Chuang, C. Bregler, Performance driven facial animation using blendshape interpolation, Computer Science Technical Report CS-TR-2002-02, Stanford University, 2002. [3] T. Weise, S. Bouaziz, H. Li, M. Pauly, Realtime performance-based facial animation, ACM Trans. Graph. 30 (4) (2011) 77–85. [4] Q. Zhang, Z. Liu, B. Guo, D. Terzopoulos, H. Shum, Geometry-driven photorealistic facial expression synthesis, IEEE Trans. Vis. Comput. Graph. 12 (1) (2006) 48–60. [5] K. Waters, A muscle model for animating three-dimensional facial expression, In: Proceedings of the ACM SIGGRAPH, 1987, pp. 17–24. [6] Y. Zhang, E. Prakash, E. Sung, A new physical model with multilayer architecture for facial expression animation using dynamic adaptive mesh, IEEE Trans. Vis. Comput. Graph. 10 (3) (2004) 339–352. [7] J. Hamm, C.G. Kohler, R.C. Gur, R. Verma, Automated facial action coding system for dynamic analysis of facial expressions in neuropsychiatric disorders, J. Neurosci. Methods 200 (2) (2011) 237–256.

19

[8] W. Gao, Y. Chen, R. Wang, S. Shan, D. Jiang, Learning and synthesizing mpeg-4 compatible 3-d face animation from video sequence, IEEE Trans. Circuits Syst. Video Technol. 13 (11) (2003) 1119–1128. [9] M. Fratarcangeli, M. Schaerf, R. Forchheimer, Facial motion cloning with radial basis functions in MPEG-4 FBA, Graph. Models 69 (2007) 106–118. [10] F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, D.H. Salesin, Synthesizing realistic facial expressions from photographs, In: Proceedings of the ACM SIGGRAPH, 1998, pp. 75–84. [11] O. Sorkine, D. Cohen-Or, Y. Lipman, M. Alexa, C. Rössl, H.-P. Seidel, Laplacian surface editing, In: Proceedings of the 2004 EUROGRAPHICS/ACM SIGGRAPH Symposium on Geometry, 2004, pp. 175–184. [12] Y. Zhuang, J. Zhang, F. Wu, Hallucinating faces: Lph super-resolution and neighbor reconstruction for residue compensation, Pattern Recognit. 40 (11) (2007) 3178–3194. [13] J. Zhang, Y. Zhuang, F. Wu, Video-based facial expression hallucination: a twolevel hierarchical framework, In: Proceedings of Advanced Concepts for Intelligent Vision Systems 2006, 2006, pp. 513–521. [14] S. Roweis, L. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (5500) (2000) 2323–2326. [15] Z. Zhang, H. Zha, Principal manifolds and nonlinear dimensionality reduction via tangent space alignment, SIAM J. Sci. Comput. 26 (1) (2004) 313–338. [16] J. Chai, J. Xiao, J.K. Hodgins, Vision-based control of 3d facial animation, In: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, 2003, pp. 193–206. [17] X. Wan, X. Jin, Data-driven facial expression aynthesis via Laplacian deformation, Multimed. Tools Appl. 58 (1) (2012) 109–123. [18] M. Eitz, O. Sorkine, M. Alexa, Sketch based image deformation, In: Proceedings of the 2007 Vision, Modeling, and Visualization Conference, 2007, pp. 135–142. [19] J.B. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (5500) (2000) 2319–2323. [20] C. Chen, L. Zhang, J. Bu, C. Wang, W. Chen, Constrained Laplacian eigenmap for dimensionality reduction, Neurocomputing 73 (4–6) (2010) 951–958. [21] X. He, P. Niyogi, Locality preserving projections, In: Advances in Neural Information Processing Systems 16 (NIPS), 2004, pp. 153–160. [22] N. Guan, D. Tao, Z. Luo, B. Yuan, Manifold regularized discriminative nonnegative matrix factorization with fast gradient descent, IEEE Trans. Image Process. 20 (7) (2011) 2030–2048. [23] L. Zhang, Q. Zhang, L. Zhang, D. Tao, X. Huang, B. Du, Ensemble manifold regularized sparse low-rank approximation for multiview feature embedding, Pattern Recognit. 48 (10) (2015) 3102–3112. [24] Y. Mu, D. Tao, Biologically inspired feature manifold for gait recognition, Neurocomputing 73 (4–6) (2010) 895–902. [25] X. Yang, H. Fu, H. Zha, J. Barlow, Semi-supervised nonlinear dimensionality reduction, In: Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 1065–1072. [26] T. Zhang, D. Tao, X. Li, J. Yang, Patch alignment for dimensionality reduction, IEEE Trans. Knowl. Data Eng. 21 (9) (2009) 1299–1313. [27] F. Nie, D. Xu, I. Tsang, C. Zhang, Flexible manifold embedding: a framework for semi-supervised and unsupervised dimension reduction, IEEE Trans. Image Process. 19 (7) (2010) 1921–1932. [28] J. Li, W. Bian, D. Tao, C. Zhang, Learning colours from textures by sparse manifold embedding, Signal Process. 93 (6) (2013) 1485–1495. [29] D. Liu, Q. Chen, J. Yu, H. Gu, D. Tao, H.S. Seah, Stroke correspondence construction using manifold learning, Comput. Graph. Forum 30 (8) (2011) 2194–2207. [30] Z. Zhang, H. Zha, M. Zhang, Spectral methods for semi-supervised manifold learning, In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–6.

Jian Zhang received his B.E. degree and M.E. degree from Shandong University of Science and Technology in 2000 and 2003 respectively, and received the Ph.D. degree from Zhejiang University in 2008. He is an associate professor working in School of Science & Technology of Zhejiang International Studies University. Before this, he worked in the department of mathematics of Zhejiang University from 2009 to 2011. Currently, he is visiting School of Computing Science of Simon Fraser University for six months’ research. His research interests include computer animation, multimedia processing and machine learning. He serves as a reviewer of several prestigious journals.

Jun Yu received his B.Eng. and Ph.D. from Zhejiang University, Zhejiang, China. He is currently a Professor with the School of Computer Science and Technology, Hangzhou Dianzi University. He was an Associate Professor with School of Information Science and Technology, Xiamen University. From 2009 to 2011, he worked in Singapore Nanyang Technological University. From 2012 to 2013, he was a visiting researcher in Microsoft Research Asia (MSRA). Over the past years, his research interests include multimedia analysis, machine learning and image processing. He has authored and co-authored more than 50 scientific articles. He has (co-)chaired for several special sessions, invited sessions, and workshops. He served as a program committee member or reviewer top conferences and prestigious journals. He is a Professional Member of the IEEE, ACM and CCF.

Jane You obtained her B.Eng. in Electronic Engineering from Xi'an Jiaotong University in 1986 and Ph.D. in Computer Science from La Trobe University, Australia, in 1992. She was a lecturer at the University of South Australia and senior lecturer at the Griffith University from 1993 till 2002. Currently she is a professor at the Hong Kong Polytechnic University. Her research interests include image processing, pattern recognition, medical imaging, biometrics computing, multimedia systems, and data mining.

Dapeng Tao received B.E degree from Northwestern Polytechnical University and Ph.D. degree from South China University of Technology, respectively. He is currently a Full Professor with School of Information Science and Engineering, Yunnan University, Kunming, China. He has authored and co-authored more than 30 scientific articles. He has

20

J. Zhang et al. / Pattern Recognition 57 (2016) 1–20

served more than 10 international journals including IEEE TNNLS, IEEE TMM, IEEE CSVT, IEEE SPL, and Information Sciences. Over the past years, his research interests include machine learning, computer vision and robotics.

Na Li received her B.E. degree and Ph.D. degree from Zhejiang University in 2001 and 2008 respectively. She worked in College of Computer Science & Technology of Zhejiang University as a post-doctoral fellow from 2008 to 2011. Now she is an assistant professor working in School of Science & Technology of Zhejiang International Studies University. Her research interests include multimedia analysis and image processing.

Jun Cheng received the B.Eng.and M.Eng.degrees from the University of Science and Technology of China,in 1999 and 2002, and the Ph.D.degree from the Chinese University of Hong Kong in 2006. He is currently with the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, as a Professor and the Director of the Laboratory for Human Machine Control. His current research interests include computer vision, robotics, machine intelligence, and control.