3D face modeling based on structure optimization and surface reconstruction with B-Spline

3D face modeling based on structure optimization and surface reconstruction with B-Spline

Neurocomputing 179 (2016) 228–237 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom 3D face...

8MB Sizes 0 Downloads 52 Views

Neurocomputing 179 (2016) 228–237

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

3D face modeling based on structure optimization and surface reconstruction with B-Spline Weilong Peng a, Chao Xu b,n, Zhiyong Feng a a b

School of Computer Science and Technology, Tianjin University, China School of Computer Software, Tianjin University, China

art ic l e i nf o

a b s t r a c t

Article history: Received 30 March 2015 Received in revised form 31 October 2015 Accepted 28 November 2015 Communicated by Qingshan Liu Available online 29 December 2015

How to reconstruct 3D face model from wild photos is such a difficult issue that camera calibration is necessary and the images must be from video sequences. In this paper, a face reconstruction model with structure optimization is proposed to build 3D face surface with individual geometry and physical features reservation through wild face images directly and without camera calibration. Low rank and BSpline are employed to estimate the aligned 2D structure, to calculate the depth information with SSIM, and to reconstruct the 3D face surface from control points and their space transformation. Furthermore, LFW and Bosphorus datasets, as well as Young-to-Aged samples, are introduced to verify the proposed approach and the experimental results demonstrate the feasibility and effectiveness even with different poses, expressions and age-variety. & 2015 Elsevier B.V. All rights reserved.

Keywords: 3D face reconstruction Low rank B-spline face Structure optimization SSIM

1. Introduction Face reconstruction is usually based on a scene that the subject is fixed and scanned with special equipment under a lab condition. And other techniques can be based on video sequences or multiview photographs. The reconstructed face model will be of greate help for many other tasks relative to face, such as recognition [19], animation [4], and tracking [33-35]. However, a more challenging work is to reconstruct face structure based on photos in the wild, with unknown camera calibration. Wild means that the images are captured under different environments, resulting in their variety in expression, illumination, poses, and even various ages. Affected by these variations, the shape of a person's face is difficult to be defined. Currently, there exists two main kinds of approaches used to create a 3D face model. The first kind uses special 3D scanner device to capture the shapes of face, e.g. [1]. The second one reconstructs face based on 2D images, such as video sequences [2], or multi-view photographs, e.g. [3]. Several relative techniques will be introduced. Most state of the art techniques [4,5,1] implement high quality face reconstruction using 3D scanning under highly-controlled lab condition, with special equipment such as laser, stereo and structural lighting. And Multi-view stereo approaches [5–7] rely on synchronized data from multiple high resolution cameras with n

Corresponding author. E-mail address: [email protected] (C. Xu).

http://dx.doi.org/10.1016/j.neucom.2015.11.090 0925-2312/& 2015 Elsevier B.V. All rights reserved.

known cameras calibration. Structured light [4] and light stages [8] based reconstructions need multiple synchronized and calibrated lights. However, it is inconvenient to distribute the devices for wide application. Therefore, it is significant to research reconstruction without calibration. Single-view method: Kemelmacher et al. [9] uses a single image for reconstruction by shape-from-shading (SFS) approach [10], but it also relies on a template face as a prior. And Suwajanakorn et al. [11] reconstruct highly detailed 3D shape using dense 3D flow and SFS for each video frame, but it needs an individual model. Due to the fact that the single-view problem is ill-posed [12], there is a uncertain uniqueness of shape recovering via the reflection model. Thus, solution with SFS is still an intractable problem [13]. A template in [9] and an individual model in [11] are applied to yield good looking results. Sometimes, multiple photos are more reliable. Structure-from-motion (SFM) uses multiple frames of image sequences to recover 3D shape of an object [14,15], estimating the rigid 3D structure of the feature points with 2D observation [16,17]. The incremental SFM approach is proposed in [18] to build a 3D generic face model for non-rigid face. Although it gives a good generalization performance with respect to expression and identify by incorporating prior 3D shape information, the incremental SFM is not applicable for the multi-view wild images. Most related work are the spatial-transformation approach [19], and reconstruction based on multiple wild photos [20]. Similarity transformation is used to optimize the 3D face structure with a set of face images under different poses in [19]. During the procedure, a frontal constraint is necessary to yield good result,

W. Peng et al. / Neurocomputing 179 (2016) 228–237

but at last it cannot generate a dense model. Comparing with Bundle adjustment in [14], our method fits the facial structure problem well although both of them is essentially based on minimizing reprojection error. It is because that face is nonrigid and varies in different wild images in our problem but bundle adjustment always fits the large scale rigid object reconstruction. Our method also differs from the 3DMM [21–23] that learns basisface representation based on a 3D database, because it directly optimizes without learning. What is important, we solve the frontal optimization by low rank, reducing the difficulty of structure optimization, and also generate the spline surface of face. Kemelmacher [20] assumes that there exists a subset of photos with local consistent shade for each local region, and build face model based on local optimization. Its good reconstruction always relies on a constraint that the image set must be “identified” as one person by some algorithm, to ensure the local area with global consistency. Then it is invalid without identification when the photos are various at ages, poses, and so on. Roth et al. [24] perform reconstruction by landmark driven mesh deformation based on photometric normal. It assumes that the face surface is C0continuous at every vertex of mesh model, and it cannot be approximated and guaranteed for surface smoothness and true shape in arbitrary precision. In contrast, we focus on optimizing 3D structure according to the face 2D structure cues in the photos, and try to obtain the consistent C2 shape of the whole face via optimization. Significantly, the C2 face is built by optimizing based on low rank and B spline, which is an innovative work of combining statistic method and continuous geometric model. In our work, at first we obtain the 3D face structure by leveraging the structure cues from multiple images with low rank and SSIM based depth optimization. Then face can be reconstructed based on 3D structure transformation and B-spline. The main contributions of this paper are: (1) 3D face structure optimization: frontal face structure optimization is considered as a sparse and low rank decomposition, and depth estimation as nonlinear programming based on constraints of multi-substructure; (2) Face surface construction with B-spline control grid deforming guided by 3D structure transformation; (3) 3D reconstruction solution based on wild photos, instead of calibration.

229

The remainder of the paper is organized as follows. Section 2 presents our method in detail. Section 3 shows the experimental results and related discussions.

2. Methodology In detail, we optimize firstly the objective 3D structure from a reference one; Secondly, B-Spline control grid of reference surface is guided to be deformed according to structure transformation from reference to objective; At last, objective surface is constructed by the deformed control points. This procedure is illustrated in Fig. 1. Particularly, the 3D face structure refers to the CANDIDE 3D model [25] for coding face structure. Since it is difficult to optimize 3D structure with 2D images, we re-arrange the optimization as: (1) frontal 2D structure optimization by matrix rank minimum in image coordinates for generating x and y components of 3D structure, and (2) depth optimization for generating z component for frontal 2D structure scaled into 3D coordinate. All the 3D structures are unified to a standard one with 50 unit long in distance between two eye centers (seen in the left of Fig. 1). Therefore, the optimized 3D face structure can guide construction of B-Spline face surface. 2.1. Frontal face structure optimization by matrix rank minimization In this section, we discuss why rank is a natural measurement of face structure similarity, and how to abstract the frontal face structure. And then 2D structure optimization is formulated to achieve a set of transformations by minimizing the rank of the images. Modeling Assumption: Given N face images from an individual, note np facial points representing 2D structures for each image as " # ui;1 ui;2 … ui;np Si ¼ ; vi;1 vi;2 … vi;np where i A 1; 2; …; N. These facial points in 2D structure can be marked manually or by detection algorithm [26] that has a similar high performance to human. In fact, these 2D structures are projections from the same 3D facial structure T from 3D space, and 2 3 x1 x2 … xnp 6y 7 T ¼ 4 1 y2 … ynp 5: z1 z2 … znp The differences among these structures mainly result from face

Fig. 1. Framework: 3D face reconstruction based on 3D structure optimization and B-Spline. The left is the definition of 3D structure including 40 points: the bottom shows that the structure topology has point O1(  25, 0, 0) and point O2 (25, 0, 0) locating eye centers in 3D space, and it looks like a frontal 2D face structure in direction of normal (0, 0, 1); the up shows the point positions on the face.

230

W. Peng et al. / Neurocomputing 179 (2016) 228–237

Fig. 2. Low rank and sparse decomposition: optimizing frontal 2D face structure for batch of misaligned face images.

misalignment, which is an inherent problem in image acquisition process, since the relative position of the camera with respect to face is seldom fixed across multiple images. The process can be formulated approximately as Si ðu; vÞ ¼ s  ðR  TÞðx;yÞ . The ideal aligned 2D structure is Sideal ¼ s  Tðx; yÞ, which is difficult to obtain. Since the 3D structure of the face is unknown, we assume that the frontal 2D structure is alignment restricted to the image plane. Then, the alignments are modeled as domain deformation. More precisely, if 〈I i ; Si 〉 represents misaligned image and its structure, and 〈I 0 ; S0 〉 represents well aligned value as well, there exists the 2D affine transformation such that I 0 ðu; vÞ ¼ ðI i ○τi Þðu; vÞ ¼ I i ðτi ðu; vÞÞ;

ð1Þ

and S0 ðu; vÞ ¼ τi ðSi ðu; vÞÞ:

ð2Þ

Actually, the affine transformation in Eq. (1) is an approximation for 3D projection. It is unconvincing to determine the frontal structure by transformation of one image. With enough images, the frontal 2D structure can be obtained more convincingly from the mean of the transformed structure by low rank. Modeling Optimization of Frontal 2D Structure: We formulate optimization problem as following. For N input misaligned images of the same object with respect to each other, seen in Fig. 2, there exists 2D affine transformation τ1 ; τ2 ; …; τN , by which the transformed images I 1 ○τ1 ; I 2 ○τ2 ; …; I N ○τN are well aligned at pixel level, and the transformed structures τ1 ðS1 Þ; τ2 ðS2 Þ; …; τN ðSN Þ have the similarity 2D structures. Equivalently, the matrix D○τ ¼ ½vecðI 1 ○τ1 Þ; vecðI 2 ○τ2 Þ; …; vecðI N ○τN Þ A RdN is low rank, where D ¼ ½vecðI 1 Þ; vecðI 2 Þ; …; vecð I N Þ and τ ¼ τ1 ; τ2 ; …; τN . Therefore, the batch image alignment can be solved by the following optimization problem: min rankðAÞ; s:t:D○τ ¼ A: A;τ

ð3Þ

By the optimization, each column in A represents the image data of well aligned face, and the shared 2D structure in faces is the mean of N well aligned 2D structure: N S^ 0 ¼ 1=NΣ i ¼ 1 τ i ðSi ðu; vÞÞ:

ð4Þ

It is the frontal 2D structure of images I 1 ; I 2 ; …; I n . Practical Solution: As there exists corruption such as occlusion and pose diversity, let ei represent the error corresponding to Ii, and the images fI i ○τi  ei gi ¼ 1:N are well aligned. The Formula (3)

can be modified as: min rankðAÞ þ γ J E J 0 ; s:t:D○τ ¼ A þE: A;τ

ð5Þ

This optimization is also referred as Robust Alignment by Sparse and Low-rank decomposition (RASL) problem [27]. The optimization is given as: min J A J n þ λ J E J 1 ; s:t: J D○τ  A  E J F o þ ϵ; A;τ

ð6Þ

_ where rankðÞ is replaced with the nuclear norm J A J n ¼ 0 1 Σ minm;n σ ðAÞ, and the ł -norm J E J with the ł -norm 0 i i¼1 J E J 1 ¼ Σ i;j j Eij j , and λ is the weighting parameter, ϵ 4 0 is the noise level. An iterative convex programming is also given in RASL [27]. In addition, while RASL can be used to align face images w.r.t. each other, it does NOT guarantee to align them to a frontal view. Practically, left-right flip version of images is considered to input the model to generate the frontal structure S^ 0 . After alignment, x and y components of 3D face structure T can be computed by scaling and projecting frontal 2D structure S^ 0 to the camera projection plane. Therefore, computing of z component is considered as a depth estimation problem. 2.2. Depth estimation for 3D structure in frontal view With the known frontal 2D structure, depth estimation is needed to obtain the 3D structure. The depth values are initialized with depth values of a reference 3D structure that can be from existing 3D face database e.g. KinectFace DB [28]. Particularly, the references are unified by using similarity transformation, to be 50 unit long between two centers of eyes; and the midpoint of two eyes locates at the ordinate origin. For S A fS1 ; S2 ; …; SN ; S^ 0 g, each point ½uj ; vj T in S is represented by " 0# " # " # " #! uj uj u1 u2 1 þ ; j ¼ 1; 2; …; np: v0j ¼ vj  2 v1 v2   u1 where u1 v1 and v1 are the initial location estimation of two eye location by low rank. It means that the 2D midpoint of two eyes also locates at the ordinate origin. And then, component x and y of

W. Peng et al. / Neurocomputing 179 (2016) 228–237

231

Multi-substructure based Structural Similarity Constraint: The depth estimation is based on the location of facial points in images, and a global optimal solution relies on the enough images that guarantee completeness of effective information. However, there are too many variables in Formula (8) to reach a global optimal when N is relatively small. Therefore, we propose a Multisubstructures based Structural Similarity Constraint to bring a priori of reference face to guarantee the approximately global optimization solution. Given a reference 3D structure, geometric vectorization V ref represents length values of edges that connect 40 feature points shown in Fig. 1. Correspondingly, vectorization V T represents that of T being optimized. Thus similarity index V ref and V T can be used as constraint terms for the optimization function. In fact, V ref is closely similar to the optimal 3D face structure V T . The optimization is reformed as: min

^ x; N ^ y; N ^ z ; S; ZÞ gðΘ; N

Θ;N^ x ;N^ y ;N^ z ;S;Z

Fig. 3. Sub structures in the 3D structure for SSIM (G1: the whole face. G2: eye structure. G3: nose structure. G4: mouth structure. G5: cheek structure. G6: contour structure. G7: cross-organ structure.).

T can be obtained by 50 Tðx; yÞ ¼ S^ 0  qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi; ^ ^ ðu 1  u 2 Þ2 þ ðv^ 1  v^ 2 Þ2 where ðu^ 1 ; v^ 1 Þ and ðu^ 2 ; v^ 2 Þ are two eye centers in S^ 0 . As we know, the distance between non-frontal-view 2D face structure S and its 3D structure T can be written as the reprojection error: 2 3 " # xj 2 np  u0  X j   6 y 7 2 ð7Þ d ¼ ‖S  s  R1:2  T‖2 ¼  v0  s  R1:2;:  4 j 5 ;  j  j¼1 2 zj where s and R are the scale parameter and rotation matrix respectively. Then the parameters θ; s; n^ x ; n^ y ; n^ z and depth values zj ; j ¼ 1; 2; …; np can be obtained by minimizing the distance d: minθ;s;n^ x ;n^ y ;n^ z d. If more than one non-frontal-view face image are available, e.g., N non-frontal-view images, the depth values zi ; i ¼ 1; 2; …; np of 3D N structure can be achieved by minimizing Σ i ¼ 1 di , the sum of the similarity distances di ; i ¼ 1; 2; …; N between the non-frontal-view 2D face structure Si ; i ¼ 1; 2; …; N and the corresponding 2D projection structures of 3D structure T: " # np  u0 N X X  i;j ^ x; N ^ y; N ^ z ; S; ZÞ ¼ f ðΘ; N min min  v0  si  i;j Θ;N^ x ;N^ y ;N^ z ;S;Z Θ;N^ x ;N^ y ;N^ z ;S;Z i¼1j¼1

2

3 xj 2  6 7  Ri1:2 4 yj 5  2 zj

ð8Þ ^ x ¼ fn^ x1 ; n^ x2 ; …; n^ xN g, N ^ y ¼ fn^ y1 ; n^ y2 ; …; where Θ ¼ fθ1 ; θ2 ; …; θN g, N ^ z ¼ fn^ z1 ; n^ z2 ; …; n^ zN g, S ¼ fs1 ; s2 ; …; sN g; Z ¼ fz1 ; z2 ; …; znp g. R n^ yN g, N

is a rotation matrix defined by Ri ¼ I þ sin θi  M i þ ð1  cos θi ÞðM i Þ2 , where I is an identity matrix, and parameter θi is the rotation angle around rotation axe 〈n^ xi ; n^ yi ; n^ zi 〉, and 2

0 6 n^ i M ¼ 4 zi  n^ yi

 n^ zi 0 n^ xi

3 n^ yi  n^ xi 7 5: 0

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 1 ^ x; N ^ x; N ^ x ; S; ZÞ f ðΘ; N N  np ; ¼ min SIMðV ref ; V T Þ Θ;N^ x ;N^ y ;N^ z ;S;Z

ð9Þ

where the mean square root at numerator represents the mean error offset, being brought by incomplete matched projection from 3D to 2D, and the similarity term at the denominator is with the value range ð0; 1. The division is used to balance the similarity term. We define the constraint term by the Structural Similarity index (SSIM), an index for measuring similarity of two signal in structure [29]. Generally, SSIM of two signal x and y is SSIMðx; yÞ ¼

ð2ux uy þ C 1 Þð2σ xy þ C 2 Þ ; ðu2x þ u2y þ C 1 Þðσ 2x þ σ 2y þC 2 Þ

with mean ux, uy, standard deviations σx, σy, and covariance σxy, and C1 and C2 are constants. The SSIM index is suitable for 3D structure similarity. It is not sure to acquire the global optimization by applying SSIM on the whole face structure, since it may ignore the local similarity. In another word, a weak global similarity cannot guarantee the local similarity. Practically, the similarity term is recommended to combine multiple sub-structures of the whole model: G1, G2, G3, G4, G5, G6, and G7 covers the whole face, eye structure, nose structure, mouth structure, cheek structure, contour structure, and cross-organ structure, respectively, as shown in Fig. 3. A multisubstructure based similarity is integrated by SIMðV ref ; V T ÞÞ ¼

Π Ki¼ 1 ðSSIMðV ref Gi ; V T Gi Þ þ 1Þ 2K

;

ð10Þ

where fGi gi ¼ 1:K is defined sub-structures, and V ref G and V T G are i i the vectorizations of reference structure and objective structure with Gi respectively. Optimization Strategy of Depth Estimation: The optimization depicted in Formula (9) has many variables, and it is a largescale nonlinear programming problem. When N is small relatively, the optimization can be solved by an interior point algorithm [30]. To further simplify the optimization, we suggest ^ x; N ^ x; N ^ x ; Sg and depth fZg separately to optimize the poses fΘ; N and iteratively. Such an iterative schema is proved to be convergence and effective in paper [19] that had established a similar case with alternating between two terms for optimization. As described in Algorithm 1: the depth values are initialized by the given reference model to estimate the pose; at the kth iterator, pose parak ^k ^k ^k k meter fΘ ; N ; N ; N ; S g is obtained by solving the nonlinear x

x

x

optimization in Step 1, and then the pose can be further applied to

232

W. Peng et al. / Neurocomputing 179 (2016) 228–237

optimize the depth for fZk g in Step 2. Carry on the steps repeatedly until the absolute difference j vk  vk  1 j of objective function value between two iterators is converged to a small ϵ value. We set ϵ ¼ 0:01 in this paper. Algorithm 1. Depth estimation and optimization with multisubstructure based Structural Similarity Constraint. Input 0

0

0

^ ;N ^ ;N ^ ; S0 Þ A R N1  RN1  RN1  RN1 , Z0 A np1 ; ðΘ ; N x y z 0

0

0

0

^ ;N ^ ;N ^ ; S0 ; Z0 Þ ¼ ð0; 0; 0; 1; 1Þ; Initially set ðΘ ; N x y z 0

Z0 ¼ Zref ; v0 ¼ 0; k ¼ 1. Particularly, Zref is the depth of reference model, and vk is the objective function value at kth iterator. 1: while not converged do 2: Step 1: Solve nonlinear optimization for the pose estik ^k ^k ^k k mation fΘ ; N x ; N y ; N z ; S g:

Fig. 4. A simplest B-spline surface of degree 4  4 on face, controlled by 4  4 grid B.

with Ri;j ðu; vÞ ¼ Ni;k ðuÞ  N j;l ðvÞ, and spline function ( 8 1 ui r u oui þ 1 ; > > > ðuÞ ¼ N > i;1 < 0 otherwise; > ðu ui Þ  Ni;k  1 ðuÞ ðui þ k uÞ  N i þ 1;k  1 ðuÞ > > > þ : : N i;k ðuÞ ¼ u u u u iþk1

^ x; N ^ y; N ^ z ; S; Zk  1 Þ gðΘ; N

min

^ x; N ^ y; N ^ z ; SÞ starts at the seed point where variables ðΘ; N k1 ^ k1 ^ k1 ^ k1 k1 ; Nx ; Ny ; Nz ; S Þ. ðΘ 3: Step 2: Solve nonlinear optimization for the depth esti-

mation fZk g: k

k

k

^ ;N ^ ;N ^ ; Sk ; ZÞ mingðΘ ; N x y z Z

where variables Z starts at the seed point Zk  1 . 4: Step 3: Compute the minimum value of objective function: k ^k ^k ^k k k vk ¼ gðΘ ; N x ; Ny; Nz ; S ; Z Þ

IF j vk  vk  1 j o ϵ. It reach convergence. ELSE not converged, k ¼ k þ 1. 5: end while Output n ^n ^n ^n n n Solution pose fΘ ; N x ; N y ; N z ; S g, and depth fZ g.

þk nþl fui gm i ¼ 1 ; V ¼ fvj gj ¼ 1 as knots splitting uv parameter plane in u and v direction. Then

i¼1j¼1

Ri;j ðu; vÞbi;j

where the r u;v and, Bx, By and Bz are vectorization of coefficient matrix R and control points B respectively. Particularly, r u;v is a 1  mn vector, and Bx, By and Bz are mn  1 vectors. For a simplest 4  4 order B-spline surface defined by 4  4 control grid seen in Fig. 4: the coefficient vector r u;v is very sparse, with only 16 nonzero entry; and there are four corner points and eight edge points to determine positions and shapes, and another four middle points are defined to determine the convexity. According to the affine invariance of B-spline shape and the good expression on shape, the facial points transformation Trf can be used to guide the control points Trb transformation to reconstruct face surface approximately. During the process, the spline basis function recovers the missing surface information from 3D structure. With the estimated facial points, it is difficult to carry out control grid that can generate these facial points, but we can use Trf as the grid transformation, to guarantee that the estimated facial points are approximately close to the real facial points on the spline surface. Noted, boundary effect of the constructed spline surface is also ignored in the derivation at this point. Therefore, we obtain the grid deformation represented by 3D structure transformation as Tr f

Human faces are remarkably similar in global properties, including size, aspect ratio, and location of main features, except that the details across individuals, gender and expression. To exploit the similarity of faces, B-spline is used to reconstruct individual face surface, for which basis function and control grid are introduced to describe surface curvature and norm at each point, filling in gaps from the individual sparse 3D structure by deforming the control points. B-Spline Face Modeling: we model face as an uniform B-Spline surface of degree k  l as F, control points as B ¼ fbij gmn , and U ¼

m X n X

iþ1

Fðu; vÞ ¼ r u;v  ½Bx j By j Bz ;

^ ¼ argmin‖T 1  T 0  Tr ‖2 : Tr b f 2

2.3. Face reconstruction based on B-spline

Fðu; vÞ ¼

iþk

The control grid of B-spline surface supplies a flexibility of controlling the surface shape, and each point gives a local effect in the controlling. A facial point Fðu; vÞ on 3D model is calculated as

Θ;N^ x ;N^ y ;N^ z ;S

k

i

ð11Þ

ð12Þ

The reference control points B0 can be inversely calculated from reference surface points F0 by assuming that the boundary condition of surface is supplied and completed. Then the objective surface is obtained as ^ b Þ; ½F 1x j F 1y j F 1z  ¼ R  ð½B0x j B0y j B0z   Tr

ð13Þ

where the symbol R is a coefficient matrix of control points, every of whose row stores a vector r u;v at position of ðu; vÞ, and its row number is the number of surface points to be generated. In fact, it is unnecessary to consider the boundary of surface because the final face model is clipped to remove the boundary. Sum up, the method includes two steps: 3D structure optimization with images and B-spline face generating based on control points optimization. The 3D structure can be optimized precisely via combination of depth estimation and low rank method. And applying B-spline of degree 4  4 to representing face makes a continuous model of face shape.

W. Peng et al. / Neurocomputing 179 (2016) 228–237

233

Fig. 5. Examples in KinectFaceDB reference set: the unified 3D data points with labeled facial points (the second row) corresponding to the face images (the first row).

Fig. 6. Sample data and evaluation: the upper is sample “bs000” with 12 types; the left bottom is a comparison between reconstruction and neutral ground truth; and the right bottom is the mean value and the standard deviation of reconstruction error.

3. Experiment In the experiment, the preprocessing of images and reference 3D models are described firstly. Secondly, we carry evaluation and comparison on Bosphorus database and synthetic model. And thirdly, we show experiments with real unconstrained data: (1) taken from the Labeled Face in Wild (LFW) database, (2) image collections of HM Queen Elizabeth ll covering young to old. 3.1. Preprocessing of images and reference 3D data To obtain an effective collection of face photos, we detect all the images using the Vioda-Jones face detector [31], and extract the

photos recognized by the face detector for each person. Then we input each photo into a facial point detector [26] to find the 40 facial points shown in Fig. 1. The images are aligned to 120  100 canonical window for optimizing the frontal structure by RASL algorithm [27]. A reference 3D data, or an alternative 3D database. e.g. KinectFaceDB [28], is necessary to build an unknown surface of a person. Here, we take 104 neutral data from KinectFaceDB as the reference set, and label facial points according to the RGB images. Then, we collect their 3D vertex informations of the facial part, unify point position according to the location of eyes shown in Fig. 1, and then sample the surface points with PCA. The final 3D points in reference data are shown in Fig. 5.

234

W. Peng et al. / Neurocomputing 179 (2016) 228–237

3.2. Evaluation on bosphorus database To evaluate the proposed method, we reconstruct face on Bosphorus 3D database [32]. 11 persons are collected with 12 types of photos. Their B-Spline faces are constructed as example data, and the 3D results can be seen in Fig. 6. 12 types of photos cover various factors including pose, expression, and occlusion. Each photo corresponds to an original ground truth 3D data, it is convenient to test the method effectiveness. In the experiment, the face structure and surface are evaluated with the neutral data as the baseline. The evaluation results are shown via similarity, correlation, and error in Table 1. All 3D face structure reconstruction approaches to the ground truth with the SSIM 0.9312, and the correlation coefficient of structure depth reaches about 0.9034. It means that the results fit well with the real face in 3D structure. The surface error is measured by j z_rec  z_truej =50 per point on the surface, where 50 unit length is as distance between two eye centers. Particularly, the SSIM and Corr_coef values of sample “bs013” is relative weak, but the mean error is lower than that of most others. So it is also a good reconstruction. Table 1 shows an error about 0.1203, for which distribution maps of the mean and standard deviation are shown in the right bottom of Fig. 6, illustrating that the error mainly occurs at the chin, two sides of nose Bridge, and eyebrow. It can be inferred from two aspects that: (1) the number of pose type is not enough for a better reconstruction, which can be seen via LFW experiment in next section; (2) these areas have great curvature changing rate, and the error can be decreased by increasing local facial points for further optimization. 3.3. Comparison with other method Our method is compared with [28] on synthetic data in [4]. We conduct reconstructions on two session. The first session contains

non-frontal and neutral face images: 45 images collected under various light direction and poses; and the second session contains non-frontal and non-neutral face images: 70 images of 7 expressions collected under various light direction and poses. Example images and expressions can be seen in Fig. 7. We compute the mean and standard deviation of the error between the reconstructions and the neutral ground-truth model according to the experiment in [28]. The error is measured by 100 j zrec  zgt j =zgt per point on the surface. We get a reconstruction error of 0:90 7 1:16% from neutral and non-frontal images, and 0:90 7 1:16% for non-neutral and non-frontal, seen in Table 2. By contrast, the results in [28] show a reconstruction error of 0:76 7 0:48% from neutral and frontal images, and 0:93 7 0:68% for neutral and non-frontal images, seen. And with the percentage of non-neutral images increasing, the mean error becomes large. It infers that our method is less insensitive to pose, and the method in [28] is more sensitive to poses and expression. The only weakness is that our result has a larger standard deviation due to the error at the chin, two sides of nose Bridge, and eyebrow. 3.4. Photos in the wild We test our method with eight persons from the LFW dataset with different pose and expressions (Ariel Sharon, Arnold Schwarzenegger, Colin Powell, Donald Rumsfeld, George W.Bush, Gerhard Schroeder, Gloria Macapagal Arroyo and Hugo Chavez). 35 images are used for each reconstruction. And Fig. 8(1) and (2) shows the results. The models are rendered with a mouth-closed photo of the individuals, as they are more likely with a neutral expression. We can see that the models generate good looking from any viewpoints. It is convinced that the reconstructions express the individual face structure well and fit well with pose-variety photos.

Table 1 Mean and standard deviation evaluation on structure and surface: SSIM of structure, correlation coefficient of facial point depth, and surface reconstruction error. Evaluation index personID

SSIM(structure)

Corr_coeff(Depth)

Error(surface)

bs000 bs005 bs012 bs013 bs014 bs015 bs016 bs017 bs024 bs026 bs028 μ7 σ

0.9696 0.9251 0.9585 0.8672 0.9239 0.9306 0.9379 0.9399 0.8823 0.9494 0.9692 0:9312 7 0:0327

0.9483 0.9197 0.9694 0.7071 0.9201 0.9454 0.9086 0.9256 0.9359 0.7883 0.9697 0:9034 7 0:0815

0.1418 0.1454 0.1381 0.0888 0.0838 0.1344 0.1226 0.1531 0.0984 0.0763 0.1404 0:1203 7 0:0380

Fig. 7. Example data: Row.1 shows non-frontal images of the neutral model; and Row.2 shows the 7 expressions under various pose and light.

W. Peng et al. / Neurocomputing 179 (2016) 228–237

235

Table 2 Comparison of results between ours method and [28]. Image set

Image num

Per.NonNeutral(%)

Per.NonFrontal

Mean Rec Error 7 std(%)

Neutral þ Frontal [28] Neutral þ Frontal þ NonFrontal [28]

– – – – – – 45 70

0 0 30 35 42 48 0 80

0% – 0% 0% 0% 0% 100% 100%

0.76 7 0.48 0.93 7 0.68 ~37 1:8 ~3:6 7 2 ~6:9 7 2:9 ~8:9 7 5:9 0.90 7 1.16 0.85 7 1.16

Neutral þ NonNeutral þ Frontal [28]

Neutral þ Nonfrontal (ours) NonNeutral þ Nonfrontal (ours)

nPer.NonNeutral is short for percentage of NonNeutral images. n Per.NonFrontal is short for percentage of NonFrontal images.

Fig. 8. Reconstruction results: (1) for “Hugo_Chavez”, “George_W_Bush” and “Arnold_Schwarzenegger” in different views; (2) for “Ariel_Sharon”: surfaces with different textures in corresponding views; (3) comparison with different numbers of images in three viewpoints.

Fig. 9. Experiment on LFW: (a) Data in alignment schema: origin, aligned and low rank data, as well as the mean images and structures. (b) Reconstruction examples without & with MsSSIM Constraint: texture images and their result; observe that MsSSIM enhance lifelikeness of the local shape, e.g. nose.

We also compare the reconstructions with different number of photos in Fig. 8(3). In general, it approaches a stable good looking with the increasing number of images. In detail, the reconstruction is still like a face even with one image, but the upper part around eye is with a bad structure; additionally, the model is further optimized with the increasing of images and the improvement is stable when the number is more than eight. Particularly, the shape

of face is unreal when four photos is used, since the poses of face have no diversity. In fact, a good reconstruction relies on photos that contains various poses to a great extent. Thus, the method has a great advantage in reconstruction with wild photos. Frontal Structure in Alignment: The frontal structure optimization by low rank for Colin Powell is shown in Fig. 9(a). The original images contain various faces that are different in pose, expression,

236

W. Peng et al. / Neurocomputing 179 (2016) 228–237

Fig. 10. Face reconstructions for aging Queen Elizabeth lls: infancy, juvenile, youth, mid-life, quinquagenarian, and aged in four viewpoints: frontal, profile, bottom, local.

and occlusion. Even then, they share a similar faces in low rank data, as shown in the third row. And the average of aligned faces is more trenchant than that of unaligned ones, highly approximating to the frontal face. Then the frontal structure can be obtained from average structure in aligned images by Eq. (4). Effectiveness of SSIM: To test the performance of the Multisubstructure based Structural Similarity (MsSSIM) Constraint, we give a visual comparison between reconstructions with and without SsSSIM. Because the shape informations in wild images are limited, it maybe cannot ensure a thorough B-spline face optimization, e.g. the second example in Fig. 9(b) does not give a protuberant nose without MsSSIM. By contrast, the result with MsSSIM constraint is more realistic. Therefore, MsSSIM is thought to be able to enhance the performance of optimization on both whole and local structures.

4. Conclusion In this paper, we present a method for 3D face reconstruction, by leveraging the structure and spline cues in a collection of wild photos. The method works well, particularly for the challenging case of capturing the key structure changing with poses, ages, and etc. A key contribution is the combination of low rank and B-spline, which leverage the 3D structure from photos “in the wild” and enable a B-spline face surface being covered with individual structure and physical features. In the future, there will be a number of potential improvements. For example, our reconstruction is still not metrically correct, and its optimization relies on a reference Bspline model that cannot deal with a case of mouth-open face. There are also much to do based on current work, such as high detail modeling, and individual model based face recognition and verification. Particularly, even mouth-open case also can be described by combination of multiple B-spline shape.

3.5. Young-to-aged queen elizabeth lls To verify that our method can well capture face structure in images, we test our method on a collection of 45 images of Elizabeth II covering from young to the aged in time. We reconstruct 3D faces with six different age period from young to aged according to Elizabeth's age using multi-photos: infancy (4), juvenile (6), youth (5), mid-life (8), quinquagenarian (7), and the aged (7). The results in Fig. 10 show the convincing reconstruction results clearly. It is seen that young face tend to be smooth while the old tend to be rough from bottom viewpoints, and the face grows wide and flat with the growth of age. This can be proved to be true in medicine. The work is interesting and significant. It is known that age variety is a great challenge in face recognition based on 2D, and the key of recognition is to catch the law of shape change brought by age factor with depth information in 3D model. The structure optimization based model is convenient to describe the law of structure changing, and it can be used to improve the recognition with the condition of age variety, which will be conducted in our future work.

Acknowledgments This work was partially supported by the National Natural Science Foundation of China (No. 61304262).

Appendix A. Supplementary material Supplementary data associated with this paper can be found in the online version at http://dx.doi.org/10.1016/j.neucom.2015.11.090.

References [1] T. Weise, B. Leibe, L. Van Gool, Fast 3d scanning with automatic motion compensation, in: IEEE Conference on Computer Vision and Pattern Recognition, 2007 (CVPR'07), IEEE Minneapolis, 2007, pp. 1–8. [2] A.K. Roy-Chowdhury, R. Chellappa, Statistical bias in 3-d reconstruction from a monocular video, IEEE Trans. Image Process. 14 (8) (2005) 1057–1062.

W. Peng et al. / Neurocomputing 179 (2016) 228–237

[3] H.-S. Koo, K.-M. Lam, Recovering the 3d shape and poses of face images based on the similarity transform, Pattern Recognit. Lett. 29 (6) (2008) 712–723. [4] L. Zhang, N. Snavely, B. Curless, S.M. Seitz, Spacetime faces: high-resolution capture for ~ modeling and animation, Data-Driven 3D Facial Animation, Springer, London (2007), p. 248–276. [5] T. Beeler, B. Bickel, P. Beardsley, B. Sumner, M. Gross, High-quality single-shot capture of facial geometry, ACM Trans. Graph. (TOG) 29 (4) (2010) 40. [6] T. Beeler, F. Hahn, D. Bradley, B. Bickel, P. Beardsley, C. Gotsman, R.W. Sumner, M. Gross, High-quality passive facial performance capture using anchor frames, in: ACM Transactions on Graphics (TOG), vol. 30, ACM, 2011, p. 75. [7] C. Wu, C. Stoll, L. Valgaerts, C. Theobalt, On-set performance capture of multiple actors with a stereo camera, ACM Trans. Graph. (TOG) 32 (6) (2013) 161. [8] A. Ghosh, G. Fyffe, B. Tunwattanapong, J. Busch, X. Yu, P. Debevec, Multiview face capture using polarized spherical gradient illumination, ACM Trans. Graph. (TOG) 30 (6) (2011) 129. [9] I. Kemelmacher-Shlizerman, R. Basri, 3d face reconstruction from a single image using a single reference face shape, IEEE Trans. Pattern Anal. Mach. Intell. 33 (2) (2011) 394–405. [10] A. Thelen, S. Frey, S. Hirsch, P. Hering, Improvements in shape-from-focus for holographic reconstructions with regard to focus operators, neighborhoodsize, and height value interpolation, IEEE Trans. Image Process. 18 (2009) 151–157, http://dx.doi.org/10.1109/TIP.2008.2007049. [11] S. Suwajanakorn, I. Kemelmacher-Shlizerman, S.M. Seitz, Total moving face reconstruction, in: Computer Vision–ECCV 2014, Springer Zurich, 2014, pp. 796–812. [12] E. Prados, O. Faugeras, Shape from shading: a well-posed problem?, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005 (CVPR 2005), vol. 2, IEEE San Diego, 2005, pp. 870–877. [13] R. Zhang, P.-S. Tsai, J.E. Cryer, M. Shah, Shape-from-shading: a survey, IEEE Trans. Pattern Anal. Mach. Intell. 21 (8) (1999) 690–706. [14] S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S.M. Seitz, R. Szeliski, Building rome in a day, Commun. ACM 54 (10) (2011) 105–112. [15] N. Snavely, S.M. Seitz, R. Szeliski, Photo tourism: exploring photo collections in 3d, ACM Trans. Graph. (TOG) 25 (3) (2006) 835–846. [16] C. Tomasi, T. Kanade, Shape and motion from image streams under orthography: a factorization method, Int. J. Comput. Vis. 9 (2) (1992) 137–154. [17] J. Fortuna, A.M. Martinez, Rigid structure from motion from a blind source separation perspective, Int. J. Comput. Vis. 88 (3) (2010) 404–424. [18] J. Gonzalez-Mora, F. De la Torre, N. Guil, E.L. Zapata, Learning a generic 3d face model from 2d image databases using incremental structure-from-motion, Image Vis. Comput. 28 (7) (2010) 1117–1129. [19] Z.-L. Sun, K.-M. Lam, Q.-W. Gao, Depth estimation of face images using the nonlinear least-squares model, IEEE Trans. Image Process. 22 (1) (2013) 17–30. [20] I. Kemelmacher-Shlizerman, S.M. Seitz, Face reconstruction in the wild, in: 2011 IEEE International Conference on Computer Vision (ICCV), IEEE Barcelona, 2011, pp. 1746–1753. [21] V. Blanz, A. Mehl, T. Vetter, H.P. Seidel, A statistical method for robust 3d surface reconstruction from sparse data, in: Proceedings of 2nd International Symposium on 3D Data Processing, Visualization and Transmission, 2004 (3DPVT 2004), 2004, pp. 293–300. [22] V. Blanz, T. Vetter, A morphable model for the synthesis of 3d faces, in: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques, 1999, pp. 187–194. [23] J. Heo, M. Savvides, In between 3D Active Appearance Models and 3D Morphable Models, in: Computer Vision and Pattern Recognition http://dx. doi.org/10.1109/CVPRW.2009.5204300, 2009. [24] J. Roth, Y. Tong, X. Liu, Unconstrained 3d face reconstruction, Trans. Graph 33 (4) (2014) 43. [25] J. Ahlberg, CANDIDE-3 - An Updated Parameterised Face, Springer, Berlin Heidelberg, 2001. [26] X.P. Burgos-Artizzu, P. Perona, P. Dollár, Robust face landmark estimation under occlusion, in: IEEE International Conference on Computer Vision (ICCV), IEEE Sydney, 2013, pp. 1513–1520. [27] Y. Peng, A. Ganesh, J. Wright, W. Xu, Y. Ma, Rasl: Robust alignment by sparse and low-rank decomposition for linearly correlated images, IEEE Trans. Pattern Anal. Mach. Intell. 34 (11) (2012) 2233–2246. [28] R. Min, N. Kose, J.-L. Dugelay, Kinectfacedb: A kinect database for face recognition, IEEE Trans. Systems Man & Cybernetics Systems 44 (11) (2014) 1534–1548.

237

[29] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. (2004) 600–612. [30] R.H. Byrd, M.E. Hribar, J. Nocedal, An interior point algorithm for large-scale nonlinear programming, SIAM J. Optim. 9 (4) (1999) 877–900. [31] P. Viola, M.J. Jones, Robust real-time face detection, Int. J. Comput. Vis. 57 (2) (2004) 137–154. [32] A. Savran, N. Alyüz, H. Dibeklioğlu, O. Çeliktutan, B. Gökberk, B. Sankur, L. Akarun, Bosphorus database for 3d face analysis, in: Biometrics and Identity Management, Springer Roskilde, 2008, pp. 47–56. [33] K. Zhang, L. Zhang, M.H. Yang, Fast compressive tracking, IEEE Trans. Pattern Anal. Mach. Intell. 36 (10) (2014) 2002–2015. [34] H. Song, Robust visual tracking via online informative feature selection, Electronics Letters 50 (25) (2014) 1931–1933. [35] C. Xu, W. Tao, Z. Meng, Z. Feng, Robust visual tracking via online multiple instance learning with fisher information, Pattern Recognition 48 (12) (2015) 3917–3926.

Weilong Peng received his Master degrees from the Tianjin University, Tianjin, China, in 2013. Now he is a Ph.D. candidate in Tianjin University, Tianjin, China. His research interests include face recognition, machine learning, and 3D face reconstruction.

Chao Xu received his Ph.D. at School of Computer Science and Technology, Tianjin University. He is currently an associate professor in Tianjin University. His research interests lie in pattern recognition, affective computing, and knowledge management.

Zhiyong Feng received his Ph.D. at the Tianjin University. He is currently a professor in Tianjin University. His research interests lie in artificial intelligence, knowledge engineering, and services computing.