State-of-the-art of 3D facial reconstruction methods for face recognition based on a single 2D training image per person

State-of-the-art of 3D facial reconstruction methods for face recognition based on a single 2D training image per person

Pattern Recognition Letters 30 (2009) 908–913 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.c...

277KB Sizes 1 Downloads 85 Views

Pattern Recognition Letters 30 (2009) 908–913

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

State-of-the-art of 3D facial reconstruction methods for face recognition based on a single 2D training image per person Martin D. Levine *, Yingfeng (Chris) Yu Center for Intelligent Machines and Dept. of Elec. and Computer Eng., McGill University, Montreal, Quebec, Canada H3A 2A7

a r t i c l e

i n f o

Article history: Received 21 December 2007 Received in revised form 13 March 2009 Available online 29 March 2009 Communicated by G. Sanniti di Baja Keywords: State-of-the-art 3D reconstruction Face recognition Single 2D training image 3D morphable model

a b s t r a c t 3D facial reconstruction systems attempt to reconstruct 3D facial models of individuals from their 2D photographic images or video sequences. Currently published face recognition systems, which exhibit well-known deficiencies, are largely based on 2D facial images, although 3D image capture systems can better encapsulate the 3D geometry of the human face. Accordingly, face recognition research is gradually shifting from the legacy 2D domain to the more sophisticated 2D to 3D or 2D/3D hybrid domain. Currently there exist four methods for 3D facial reconstruction. These are: Stochastic Newton Optimization method (SNO) [Blanz, V., Vetter, T., 1999. A morphable model for the synthesis of 3D faces. In: Proc. 26th Annu. Conf. on Computer Graphics and Interactive Techniques, SIGGRAPH. pp. 187–194; Blanz, V., Vetter, T., 2003. Face recognition based on fitting a 3D morphable model. IEEE Trans. Pattern Anal. Machine Intell. 25(9), 1063–1074; Blanz, V., 2001. Automatische Rekonstruction der Dreidimensionalen Form von Gesichtern aus einem Einzelbild. Ph.D. Thesis, Universitat Tubingen, Germany] inverse compositional image alignment algorithm (ICIA) [Romdhani, S., Vetter, T., 2003. Efficient, robust and accurate fitting of a 3D morphable model. In: IEEE Int. Conf. on Computer Vision, vol. 2, no. 1. pp. 59–66], linear shape and texture fitting algorithm (LiST) [Romdhani, S., Blanz, V., Vetter, T., 2002. Face identification by fitting a 3D morphable model using linear shape and texture error functions. In: Proc. ECCV, vol. 4. pp. 3–19], and shape alignment and interpolation method correction (SAIMC) [Jiang, D., Hu, Y., Yan, S., Zhang, L., Zhang, H., Gao, W., 2005. Efficient 3D reconstruction for face recognition. Pattern Recogn. 38(6), 787–798]. The first three, SNO, ICIA + 3DMM, and LiST can be classified as ‘‘analysis-by-synthesis” techniques and SAIMC can be separately classified as a ‘‘3D supported 2D model”. In this paper, we introduce, discuss and analyze the difference between these two frameworks. We begin by presenting the 3D morphable model (3DMM; Blanz and Vetter, 1999), which forms the foundation of all four of the reconstruction techniques described here. This is followed by a review of the basic ‘‘analysis-by-synthesis” framework and a comparison of the three methods that employ this approach. We next review the ‘‘3D supported 2D model” framework and introduce the SAIMC method, comparing it to the other three. The characteristics of all four methods are summarized in a table that should facilitate further research on this topic. Ó 2009 Elsevier B.V. All rights reserved.

1. Introduction 3D facial reconstruction systems attempt to reconstruct 3D facial models of individuals from their 2D photographic images or video sequences. Research in the areas of computer graphics and machine vision has given rise to a number of concepts that aid in facial reconstruction. Certain of these, for instance ‘‘Structure from Motion (SFM)” (Bregler et al., 2000; Pollefeys, 1999; Zhao and Chellappa, 2000), ‘‘Shape from Contours” (Brady and Yuille, 1983; Himaanshu et al., 2004) and ‘‘Shape from Silhouette” (Matusik et al., 2000; Moghaddam et al., 2003) have found application in commercial facial reconstruction applications. There exist several * Corresponding author. Tel.: +1 514 398 7348; fax: +1 514 398 7115. E-mail address: [email protected] (M.D. Levine). 0167-8655/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2009.03.011

excellent survey papers addressing these and other theories, implementations, and applications of currently available facial reconstruction methods (Bowyer et al., 2004; Zhou and Chellappa, 2005). All of these methods require several 2D images or video frames per person to reconstruct the 3D version of the person’s face. However, it is not always practically feasible to obtain several 2D images per person. In this paper, we focus exclusively on the existing state-of-the-art of 3D facial reconstruction using only a single frontal 2D image as the training input for each person. In particular, we address only those 3D approaches that have been applied to face recognition and which have been tested on face databases. Currently available face recognition systems, which are largely based on 2D facial images, suffer from low reliability due to their sensitivity to lighting conditions, facial expressions and changes

909

M.D. Levine, Y. (Chris) Yu / Pattern Recognition Letters 30 (2009) 908–913

in head pose. The inadequate performance of these systems should come as no surprise as these 2D based systems ignore the fact that the human face is three-dimensional and therefore needs to be described by a 3D model. In recent years, several advancements in the fields of computer vision and computer graphics have helped researchers acquire highly accurate 3D scanning data (generally a 2.5D range image) and thus better capture the 3D geometry of the human face. Accordingly, face recognition research has shifted moderately from the legacy 2D domain to the more sophisticated 2D to 3D or 2D/3D hybrid domain. Although 3D face recognition is theoretically superior in performance to 2D face recognition, not only with regard to varying head pose but also to illumination changes, the high cost and limited applicability of 3D sensing devices (range sensors) has restricted progress. Furthermore, 3D face recognition technologies tend to be constrained by the usage of presently available legacy databases, which are primarily of a 2D nature. Thus, 3D facial reconstruction could play an important role to bridge the gap between existing 2D face databases and state-ofthe-art 3D face recognitions system. The benefits are shared; the construction of 3D facial data from 2D images may help 3D face recognition systems; the synthesized 2D images generated from existing 2D images using a 3D model, may enhance the capabilities of existing 2D face recognition systems. To the best of our knowledge, there exist at present four methods for facial reconstruction. These are 1. Stochastic Newton Optimization method (SNO) proposed by Blanz and Vetter (1999, 2003), Blanz (2001). 2. Inverse compositional image alignment (ICIA) applied to 3D morphable models (3DMM) (Romdhani and Vetter, 2003) initially introduced by Baker and Matthews (2001). This algorithm was later adopted by Blanz and Vetter (1999) with regard to 3DMM. 3. Linear shape and texture fitting (LiST), introduced by Romdhani et al. (2002). 4. Shape alignment and interpolation method correction (SAIMC) introduced by Jiang et al. (2005) to retrieve a 3D facial model from a single frontal 2D image.1 The first three methods, SNO, ICIA + 3DMM, and LiST were classified as ‘‘analysis-by-synthesis” techniques by their authors. However, in the recent survey paper on 3D face recognition methods by Scheenstra et al. (2005), the first three algorithms are renamed ‘‘template matching approaches” (Scheenstra et al., 2005) and SAIMC was separately classified as being a ‘‘3D supported 2D model” approach. In this paper, we use the nomenclature ‘‘analysis-bysynthesis” to refer to SNO, ICIA + 3DMM, and LiST and ‘‘3D supported 2D model” to refer to the approach of Jiang et al. We discuss the differences between these two frameworks later in this paper. Although ‘‘analysis-by-synthesis” and ‘‘3D supported 2D model” are significantly different from each other, they are both based on the 3D graphics model called the 3D morphable model (3DMM). In this paper, we begin in Sections 2–4 by introducing the 3DMM (Blanz and Vetter, 1999), which forms the foundation of all four of the reconstruction techniques described here. Sections 3 and 4 deal specifically with shape and texture correspondence, respectively, which form the backbone of this approach. Following this, in Section 5, we review the basic framework of ‘‘analysis-bysynthesis” (Blanz and Vetter, 1999, 2003; Blanz, 2001; Romdhani and Vetter, 2003; Romdhani et al., 2002) and the three methods that employ this technique, and compare them in Section 6. Then in Section 7, we review the ‘‘3D supported 2D model” (Blanz and

1 Note that this material also appears in Hu et al. (2004), which was published by the same research group but using different notation.

Vetter, 1999) framework and introduce the SAIMC method. Generally, due to the lack of literature available concerning the two frameworks, there tends to be some confusion regarding the implementation of each. Finally, in Section 8, all four methods are summarized and conclusions are drawn. 2. Face modeling: 3D morphable model (3DMM) A 3D human facial model is represented by its shape and texture: the shape is captured as vertices in three dimensions and the texture of the face describes the color information. The 3D morphable model (3DMM) infrastructure, developed by Blanz and Vetter (1999) and used in state-of-the-art facial reconstruction systems, decomposes any 3D human facial model into a linearly convex combination of shape and texture vectors spanning a set of exemplars which describe a realistic human face. The linearly convex combination is fully controlled by shape and texture  We begin  and b. parameters, which perform as weights, called a by reviewing several important concepts of the 3D morphable model (3DMM). In order to obtain their morphable model, Blanz and Vetter used 100 young adults and scanned (Cyberware 3D laser scanner) their heads, thereby providing a set of exemplar faces. Each laserscanned head was saved in a 75,972 vertex, 150,958-facet format.2 A basic assumption of the work of Blanz and Vetter are that any 3D facial surface can be realistically modeled using a convex combination of shape and texture vectors of the set of exemplar faces, where the shape and texture vectors are given as follows:

 T Si ¼ xi;1 ; yi;1 ; zi;1 ; . . . ; xi:75972 ; yi;75972 ; zi;75972 2 R227;9161 ;  T T i ¼ Ri;1 ; Gi;1 ; Bi;1 ; . . . ; Ri;75972 ; Gi;75972 ; Bi;75972 2 R227;9161 ; where Si and Ti have dimension 227,916 (=3*75,972). Blanz and Vetter performed principle component analysis on the 100 laserscanned heads and processed shape and texture information separately. Information from each of these exemplar heads was saved as shape and texture vectors (Si,Ti) where i = 1, . . ., 100. Before continuing, we define several important quantities based on the observation that exactly the same vertices and facets define all heads. Let

S0 ¼

100 1 X Si ; 100 i¼1

T0 ¼

100 1 X T i; 100 i¼1

ð2:1Þ

S ¼ ½S1 ; S2 ; . . . ; S100  2 Rð375972Þ100 ; T ¼ ½T 1 ; T 2 ; . . . ; T 100  2 Rð375972Þ100 ;

where (S0,T0) represents the mean face of the 100 exemplar faces and is referred to as the Generic Mean Face (GMF). The covariance matrices of shape and texture are defined as

CS ¼

100 X i¼1

ðSi  S0 ÞðSi  S0 ÞT ;

CT ¼

100 X

ðT i  T 0 ÞðT i  T 0 ÞT ;

ð2:2Þ

i¼1

The eigenvectors and eigenvalues of the covariance matrices are written as

    ES ¼ es1 ; es2 ; . . . ; es100 ; eS ¼ ks1 ; . . . ; ks100 ;  t t    ET ¼ e1 ; e2 ; . . . ; et100 ; eT ¼ kt1 ; . . . ; kt100 ;

ð2:3Þ ð2:4Þ

2 Generally, 3D laser-scanned faces are obtained using a varied number of vertices and type of triangular connectivity depending on the characteristic features of the individual face. However, in order to facilitate their particular scheme, Blanz and Vetter chose to use the Optical Flow Method to force every face to have exactly 75,972 vertices sharing exactly 150,958 identical facets.

910

M.D. Levine, Y. (Chris) Yu / Pattern Recognition Letters 30 (2009) 908–913

Fig. 1. The shape correspondence concept of the 3D morphable model (3DMM). This figure illustrates that the index of the vertices in the 3DMM are assigned on the basis of anthropomorphic markers on the 3D face model and not on the basis of the 3D coordinates x, y and z that represent the position of the morphological marker, which will vary from one person’s scanned head to another. In the above, for example, S0 is shape of the Generic Mean Face (GMF) and Si is an arbitrary 3D head in the 3DMM database. The anthropomorphic marker representing the tip of the nose is assigned a specific index, say k = 37,814. Therefore, the tip of the nose for every scanned head will always be indexed as k = 37,814, even though the x, y and z coordinates that represent the position of the tip on the nose on the face will change.

The eigenvectors (eigenfaces) esi and eti and the eigenvalues ksi and kti of the covariance matrices CS and CT are such that

C S esi ¼ ksi esi

C T eti ¼ kti eti ;

ð2:5Þ

The most important observation and conclusion that can be drawn from the 3DMM after PCA modeling is that every one of the 100 laser-scanned exemplar heads can be decomposed into the form:

Sj ¼ S 0 þ

100 X

ai

i¼1

ks;i

esi ; ð2:6Þ

100 X bi t ei Tj ¼ T0 þ k t;i i¼1

where esi and eti represent the i-th eigenvectors of the covariance matrices of CS and CT, respectively. Moreover, following Blanz and Vetter’s work, any arbitrary human facial 3D model can be con structed by varying parameters (ak,bk),k = 1, . . ., 100, such that a  are modeled by the normal probability density functions: and b

Þ  e pS ða

12

100 P k¼1

 2 ak ks k

100 P 12    k¼1 and pT b  e

 2 bk kt k

;

ð2:7Þ

 can perform as model descrip and b The authors conclude that a tors, which can sufficiently represent a specific 3D human facial model. Accordingly, the objective of the 3D facial reconstruction problem is now redefined from one retrieving 3D shape and texture  such  and b, to searching for the best possible fitted parameters, a that the rendered morphable model optimally matches the single 2D input image presented to the reconstruction system.3 3. Shape correspondence At this point we emphasize the significance of the indicies assigned to the shape and texture vectors. These run from 1 to 75,972 and are saved as an ordered set, with each index specifically corresponding to one morphological point on the 3D face. The ordered indices are assigned (linked) to a specific anthropomorphic marker on the face and not to the 3D coordinates of the feature on the 3D model. This concept, which will be referred to as ‘‘shape correspondence” throughout this paper, forms an integral part of 3DMM. Volker and Blanz (Blanz and Vetter, 1999) obtained this 3

See Section 5.1 and Table 1 for details.

3D-to-3D correspondence for the 75,972 vertices by using optical flow and the cylindrical projection method. Fig. 1 illustrates this concept. 4. Texture correspondence To fully appreciate the texture correspondence concept, it is first necessary to present a texture information retrieval procedure used in the area of computer graphics. Clearly, a 3D model must always contain both 3D shape and texture information. However, accurate 3D texture information can also be retained in a 2D image format by using the so-called UV map4 as shown in Fig. 2. UV mapping is a process which can roughly be described by the coordinate mapping function fg ¼ ðu; vÞ : ðx; y; zÞ  R2 ! R3 g that warps a 2D image containing texture information to a 3D mesh. Thus, the opposite 2D flattening process is the reverse of the mapping to R3 and the flattening process f satisfies the relationship: ff ¼ g 1 ¼ ðx; y; zÞ : ðu; vÞ  R3 ! R2 g with respect to g. The pixel I(u,v) information in UV space is saved in the RGB color format using chromaticity such that I(u, v) = {r, g, b}. Hence, we can summarize relationship between the UV mapping, the flattening process and the rgb color space in the following manner:

Iðu; vÞ

¼ Iðfu ðx; y; zÞ; fv ðx; y; zÞÞ   ¼ I g 1 ðx; y; zÞ; g 1 v ðx; y; zÞ ¼ fr; g; bg;  u  Iðx; y; zÞ ¼ I g x ðu; vÞ; g y ðu; vÞ; g z ðu; vÞ  ¼ I fx1 ðu; vÞ; fy1 ðu; vÞ; fz1 ðu; vÞ ¼ fr; g; bg;

ð2:8Þ ð2:9Þ

The subscripts in the above equations represent the partial derivative in a specific direction. To achieve flattening from the 3D-shape to the corresponding 2D-embedded-space, Blanz and Vetter used cylindrical projection (Blanz and Vetter, 1999). The outcome of this flattening process is stored as a file called textmap.obj. In general, we would expect that flattening would result in UV maps that vary with different geometries of the 3D face. However, when applied to 3DMM, which we saw exhibits the property of shape correspondence, the UV map is independent of the 3D geometry. Given the k-th vertex V k ¼ ½ xk yk zk  of an arbitrary 3D shape Si, the flattening

4

The axes of the 2D UV map are called U and V.

M.D. Levine, Y. (Chris) Yu / Pattern Recognition Letters 30 (2009) 908–913

911

Fig. 2. Three examples of a UV map belonging to three different individuals. The three individuals possess different 3D shapes according to each individual’s different facial appearance. However, we can see that these shape differences do not affect the geometry observed in their UV maps. The only differences for the three different individuals in the UV maps are the pixel intensities (rgb color values). When these UV maps are transformed into 3D space by combining texture information with the shape information at each indexed vertex, the faces of the three persons will appear significantly different.

process f, which is a function of index k (and not of x, y and z), has a unique output of (uk,vk) in UV space (ignoring the specific values of xk, yk and zk). Therefore, in the case of 3DMM, the flattening process f is simplified to being a look-up table saved in textmap.obj. The key point here is that once we know the index of a vertex, the specific 3D geometric coordinates of that vertex would not contribute to the vertex’s coordinates in UV space. The index is sufficient to retrieve corresponding UV coordinates from textmap.obj. The texture (color) information of the vertex is saved at the location of its corresponding UV coordinates. An illustration of the concept of 3D-shape-to-2D-embedded-space correspondence (texture correspondence) is given in Fig. 2.

5. Analysis-by-synthesis framework To date we are aware of three ‘‘analysis-by-synthesis” methods: Stochastic Newton Optimization (SNO)(Blanz and Vetter, 1999, 2003; Blanz, 2001), inverse compositional image alignment with 3DMM (ICIA + 3DMM)(Romdhani and Vetter, 2003) and Linear shape and texture fitting (LiST) (Romdhani et al., 2002). These require as input a single arbitrary image of a face at a specific pose to reconstruct the corresponding synthesized 3D face using  parameters that control the 3D  and b 3DMM. The estimated a shape and texture of the requisite 3D facial model are returned as an output of the reconstruction process. In a nutshell, ‘‘synthesis” refers to creating a virtual 3D face by estimating the needed parameters that completely describe the 3D model of the face and ‘‘analysis” refers to solving the overall face recognition problem using the information from the ‘‘synthesis” step. In this section, we briefly discuss the three methods that follow the ‘‘analysis-by-synthesis” framework.

5.1. Stochastic Newton Optimization method (SNO) The Stochastic Newton Optimization method (see Appendix B in Jiang et al., 2005 for details.) is the first published fitting algorithm for facial reconstruction. It not only optimizes the shape and texture parameters, alpha and beta, but also 22 other rendering parameters including pose angles (3 parameters), 3D translation (4 parameters), focal length (1 parameter), ambient light intensities (3 parameters), directed light intensities (3 parameters), the angles of the directed light (2 parameters), color contrast (1 parameter), and gains and offsets of color channels (6 parameters). Its main drawback is low efficiency. It is reported that it requires around 4.5 min to perform parameter estimation on a 2 GHz Pentium IV (Romdhani et al., 2004).

5.2. Inverse compositional image alignment with 3DMM (ICIA + 3DMM) The inverse compositional image alignment (ICIA) algorithm, introduced Baker and Matthews (2001), is an efficient 2D image alignment method based on the Lucas–Kanade matching algorithm (Baker et al., 2002), one of the most widely used techniques in computer vision for such applications as optical flow, tracking, mosaic construction, medical image registration, face coding, etc. As originally published, ICIA was merely capable of fitting 2D images. However, Romdhani and Vetter (2003) have extended ICIA to fit the 3D morphable model (3DMM) Blanz and Vetter (1999) using the correspondence between the input and UV-map image of the 3DMM (called the reference frame in Romdhani and Vetter, 2003). It is reported that ICIA + 3DMM completes the fitting process within an average of 30 s (Romdhani et al., 2004) on the same machine as the SNO. Romdhani et al. (2005) identified two drawbacks of ICIA + 3DMM. Firstly, the adaptation of ICIA to 3DMM causes ICIA to lose its original efficiency, although the accuracy is improved. The efficiency and accuracy of this algorithm are discussed in Romdhani et al. (2004), Romdhani et al. (2005). Secondly, the algorithm is not able to handle direct light sources due to the fact that it does not permit shading as SNO does. 5.3. Linear shape and texture fitting algorithm (LiST) This algorithm is similar to ICIA + 3DMM and has been reported to be five times faster than the SNO algorithm.5 We are not aware of any studies that compare ICIA + 3DMM and LiST. The efficient and accurate estimation obtained by LiST is mainly based on the unique correspondence between the UV map (model reference frame) and the input image. The assumption made here is that the correspondence achieved by using a 2D optical flow algorithm Bergen and Hingorani (1990) can be modeled by ‘‘a bilinear relationship between the 2D vertices projection (sic) and the shape (alpha) and rigid parameters” (Romdhani et al., 2005). The texture and illumination parameters are also recovered using this correspondence. LiST performs slightly different rigid parameter modeling from SNO: it uses a weak-perspective projection while SNO uses a normal perspective projection. Romdhani et al. (2005) claim that the correspondence concept, which dominates LiST, sacrifices the accuracy of illumination and texture recovery due to the fact that no shading information is used in LiST in contrast to SNO. This is one of the drawbacks of the 5 As mentioned above, it was reported that SNO requires 4.5 min on average to complete the fitting process. Based on the assumption that LiST works 5 times faster than SNO, LiST will require about 54 s (=4.5 min * (60 s/min)/5times). Accordingly, ICIA + 3DMM is the most efficient of the three ‘‘analysis-by-synthesis” approaches.

10 s No Based on synthesized 2D images 54 s No   ,b Based on a Efficiency Quantitative analysis of reconstruction accuracy Face recognition

4.5 min No   ,b Based on a

30 s No  ; b Based on a

Not clear since no method is provided Computed separately Actual  Find optimal texture b Computed simultaneously Synthesized  Find optimal texture b Computed simultaneously Synthesized Methodology Relationship to shape reconstruction Facial texture Texture recovery

 Find optimal texture b Computed simultaneously Synthesized

No Only 87 feature points are involved Yes, completed by Kriging interpolation No Levenberg–Marquardt approximation Every pixel is involved No Yes Levenberg–Marquardt approximation Every pixel is involved No Yes Gradient descent techniques Shape alignment Shape correction step  parameters control the properties Do a of the whole reconstructed shape? Shape reconstruction

Stochastic version of Newton’s method Every pixel is involved No Yes

SAIMC Frontal view 87 feature points Input image Initialization

ICIA + 3DMM Arbitrary view 7  8 feature points SNO Arbitrary view 7  8 feature points

Analysis-by-synthesis

Table 1 Comparison of the four existing facial reconstruction methods based on a single facial image.

3D supported 2D models

M.D. Levine, Y. (Chris) Yu / Pattern Recognition Letters 30 (2009) 908–913

LiST Arbitrary view 7  8 feature points

912

LiST algorithm. Moreover, Gill and Levine (2005) attempted to implement the LiST algorithm but the synthesized images they obtained were unsatisfactory. They claim that one of the factors causing this may be the accuracy of the correspondence obtained by using optimal flow. Essentially, LiST has the same drawbacks as ICIA + 3DMM. 6. Comparison of the three ‘‘analysis-by-synthesis methods With reference to SNO, ICIA + 3DMM, and LiST, we note that these three algorithms share the following similarities: 1. There are no specific constraints on the input image and the input face can be at an arbitrary pose angle. Five or more manual feature points are needed for algorithm initialization. 2. The idea is to fit existing 3D morphable models to 2D images by finding the optimal alpha (shape parameters) and beta (texture parameters) of the 3D morphable model (3DMM) plus the relevant shape transformation parameters (referred to as the ‘rigid parameters by Blanz and Vetter, 1999): rotation matrix, scale, focal length of the camera and translation vector and so on. 3. A gradient descent algorithm is employed for minimizing the non-linear optimization objective function. Furthermore, during the iterative fitting process, the updated alpha and beta parameters are highly correlated and synchronous. 4. The reconstructed 3D model is entirely determined by the estimated parameters: alpha (shape parameters), beta (texture parameters) and relevant shape transformation parameters. Alpha and beta are assumed to be sufficient for face recognition i 2 R100 represent the recon i 2 R100 and b purposes. Assume a structed 3DMM shape and texture parameters of image I and j . We further define  j and b similarly for image j, a * i  2 R200 and * j  2 R200 . The recognition decii; b j; b c j ¼ ½a c i ¼ ½a * sion is simply based on the similarity score between c i and * c j . In other words, since the alpha and beta parameters completely and uniquely describe the 3D facial model, there is no need to reconstruct the face and then use the reconstructed 3D model for face recognition. The parameters can be used directly. Blanz and Vetter referred to the above methodology as ‘‘analysis-by-synthesis”. Modeling using 3DMM accomplishes the ‘‘synthe and recognition  and bÞ sis” by finding the required parameters (a based on these parameters takes care of the ‘‘analysis” task. 7. 3D supported 2D models Jiang et al. (2005) discuss SAIMC, which is also based on the 3D morphable model but the reconstruction approach is somewhat different from SNO, ICIA + 3DMM and LiST. The difference arises due to the following aspects: 1. SAIMC restricts the input image to a single 2D frontal facial image but does not permit arbitrary facial poses. It requires 84 feature points to initialize the 3D reconstruction algorithm. 2. Shape and texture parameter updates are completely separate. SAIMC still contains a step to evaluate alpha parameters a 2 R100 but assumes that the reconstructed shape is not accurate enough in the x–y plane and so performs an additional interpolation correction step (e.g., Kriging interpolation). The outcome of this correction step is that the estimated alpha parameters are modified to represent the final reconstructed shape and the significance of the earlier estimated alpha parameters is lost. (This is due to the fact that correction step modifies the (x, y) values of the shape obtained from alpha.) As a result

M.D. Levine, Y. (Chris) Yu / Pattern Recognition Letters 30 (2009) 908–913

we cannot use alpha for the recognition decision as is done in SNO, ICIA + 3DMM and LiST. For texture retrieval, there is no specific information provided in the paper. Jiang et al. simply mention that ‘‘the 2D image is projected orthogonally to the 3D geometry to generate the texture”. 3. In SAIMC, the estimated alpha parameters do not represent the final reconstructed shape and no beta parameters are returned. This is why the ‘‘analysis” framework of SNO, ICIA + 3DMM and LiST does not hold for SAIMC. 4. For 2D face recognition, SAIMC uses reconstructed 3D face models to synthesize 2D images by projecting the 3D model onto the 2D plane. Synthesized images are used as well for training data. 8. Conclusions Although both major frameworks involve shape parameter estimation, the procedures are significantly different. To begin with, the ‘‘3D supported 2D models” framework requires several feature points on the input image (Jiang et al. use 87). The step that achieves shape parameter estimation, referred to as ‘‘shape alignment” in the ‘‘3D supported 2D model” approach, fits the 3D morphable model (3DMM) to these feature points only, while non-feature points are not considered by the ‘‘shape alignment” procedure. This is appreciably different from the ‘‘analysis-by-synthesis” framework that updates shape parameters globally (all pixels in 2D images will be taken into consideration by the fitting procedure). This is why the ‘‘3D supported 2D models” framework does not require a gradient descent technique, the most time-consuming step, to minimize a non-linear objective function as do SNO, ICIA + 3DMM, and LiST. But the ensuing superior efficiency of ‘‘3D supported 2D models” more or less achieved by sacrificing the accuracy of shape reconstruction. In SAIMC, a shape alignment is used to reconstruct the shape based on the manually selected 84 feature points. Non-feature points do not contribute to the actual reconstruction but the correction does improve the accuracy of the (x, y) coordinates of the overall estimated shape. Unfortunately, the z (depth) information, which is dominated by the alpha parameters obtained in the earlier ‘‘shape alignment” process, cannot be fixed. This is the main predicament associated with the ‘‘correction”. Moreover, the relevant shape transformation parameters (e.g., rotation matrix, scale, focal length of the camera, and translation vector) are tainted in the ‘‘3D supported 2D models” case because only one frontal 2D image is used as the input. As was explained above, the ‘‘analysis-by-synthesis” and ‘‘3D supported 2D models” frameworks are almost completely different from each other except for the fact that both utilize a 3D morphable model (3DMM). 3DMM is the only intersecting aspect of these two reconstruction frameworks. The efficiency of the algorithms is estimated as    

SNO: 4.5 min ICIA + 3DMM: on average 30 s. LiST: 54 s (rough estimate). SAMIC: 10 s and ‘‘fifteen times faster than LiST”

Table 1 summarizes the qualitative differences between the four methods. Finally, yet importantly, in Table 1, we observe that no quantitative or comparative analysis of the reconstruction accuracy for

913

any of the four methods has appeared in the literature. This remains to be done in the future. Acknowledgements The authors would like to acknowledge the financial support of the Natural Sciences and Engineering Research Council of Canada (NSERC). References Baker, S., Matthews, I., 2001. Equivalence and efficiency of image alignment algorithms. In: IEEE Comput. Soc. Conf. on Computer Vision and Pattern Recognition (CVPR’01), vol. 1. pp. 1090–1097. Baker, S., Gross, R., Matthews, I., 2002. Lucas–Kanade 20 years on: A unifying framework: Part 1. Technical Report CMU-RI-TR-02-16. Robotics Institute, Carnegie Mellon University. Bergen, J.R., Hingorani, R., 1990, Hierarchical motion-based frame rate conversion. Technical Report. David Sarnoff Research Center, Princeton, NJ. Blanz, V., 2001. Automatische Rekonstruction der Dreidimensionalen Form von Gesichtern aus einem Einzelbild. Ph.D. Thesis, Universitat Tubingen, Germany. Blanz, V., Vetter, T., 1999. A morphable model for the synthesis of 3D faces. In: Proc. 26th Annu. Conf. on Computer Graphics and Interactive Techniques. SIGGRAPH. pp. 187–194. Blanz, V., Vetter, T., 2003. Face recognition based on fitting a 3D morphable model. IEEE Trans. Pattern Anal. Machine Intell. 25 (9), 1063–1074. Bowyer, K.W., Chang, K., Flynn, P., 2004. A survey of approaches and challenges in 3D and multi-modal 3D + 2D face recognition. Comput. Vision Image Und. 101 (1), 1–15. Brady, M., Yuille, A.L., 1983. An extremum principle for shape from contour. IEEE Trans. Pattern Anal. Machine Intell. 6 (3), 288–301. Bregler, C., Hertzmann, A., Biermann, H., 2000. Recovering non-rigid 3D shape from image streams. In: Proc. IEEE Comput. Soc. Conf. on Computer Vision and Pattern Recognition, vol. 2. pp. 690–696. Gill, G.S., Levine, M.D., 2005. Searching for the holy grail: A completely automated 3D morphable model. Technical Report, March 15, 2005. Department of Electrical & Computer Engineering & Center for Intelligent Machines, McGill University Montreal, Canada. . Himaanshu, G., RoyChowdhury, A.K., Chellappa, R., 2004. Contour-based 3D face modeling from a monocular video. In: British Machine Vision Conference, BMVC04, September 7–9. Kingston University, London. Hu, Y., Jiang, D., Yan, S., Zhang, L., Zhang, H., 2004. Automatic 3D reconstruction for face recognition. In: Proc. 6th IEEE Int. Conf. on Automatic Face and Gesture Recognition. pp. 843–848. Jiang, D., Hu, Y., Yan, S., Zhang, L., Zhang, H., Gao, W., 2005. Efficient 3D reconstruction for face recognition. Pattern Recogn. 38 (6), 787–798. Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L., 2000. Image-based visual hulls. In: Proc. Int. Conf. on Computer Graphics and Interactive Techniques, SIGGRAPH, 2000. pp. 369–374. Moghaddam, B., Lee, J.H., Pfister, H., Machiraju, R., 2003. Model-based 3D face capture with shape-from-silhouettes. In: IEEE Int. Workshop on Analysis and Modeling of Faces and Gestures (AMFG), Bice, France. pp. 20–27. Pollefeys, M., 1999. Metric 3D Surface Reconstruction from Uncalibrated Image Sequences. Ph.D. Thesis, Katholieke Universiteit Leuven. Romdhani, S., Vetter, T., 2003. Efficient, robust and accurate fitting of a 3D morphable model. In: IEEE Int. Conf. on Computer Vision, vol. 2, no. 1. pp. 59– 66. Romdhani, S., Blanz, V., Vetter, T., 2002. Face identification by fitting a 3D morphable model using linear shape and texture error functions. In: Proc. ECCV, vol. 4. pp. 3–19. Romdhani, S., Blanz, B., Basso, C., Vetter, T., 2004. Morphable Models of Faces. In: Li, S.Z., Jain, A.K. (Eds.), Handbook of Face Recognition. Springer, New York. p. 395. Romdhani, S., Pierrard, J.S., Vetter, T., 2005. 3D morphable face model, a unified approach for analysis and synthesis of image. In: Zhao, W., Rama Chellappa, R. (Eds.), Face Processing: Advanced Modeling and Methods. Elsevier, p. 768. Scheenstra, A., Ruifrok, A., Veltkamp, R., 2005. A survey of 3D face recognition methods. In: Fifth Int. Conf. on Audio- and Video-Based Biometric Person Authentication. Rye Brook, New York. Zhao, W.Y., Chellappa, R., 2000. SFS based view synthesis for robust face recognition. In: Proc. IEEE Int. Automatic Face and Gesture Recognition. pp. 285–292. Zhou, S., Chellappa, R., 2005. Beyond a single still image: Face recognition from multiple still images and videos. Face Processing: Advanced Modeling and Methods. Academic Press Inc, New York. p. 547.