The THU multi-view face database for videoconferences and baseline evaluations

The THU multi-view face database for videoconferences and baseline evaluations

Author’s Accepted Manuscript The THU Multi-View Face Database for Videoconferences and Baseline Evaluations Xiaoming Tao, Linhao Dong, Yang Li, Jianhu...

1MB Sizes 2 Downloads 13 Views

Author’s Accepted Manuscript The THU Multi-View Face Database for Videoconferences and Baseline Evaluations Xiaoming Tao, Linhao Dong, Yang Li, Jianhua Lu

www.elsevier.com/locate/neucom

PII: DOI: Reference:

S0925-2312(16)30271-5 http://dx.doi.org/10.1016/j.neucom.2016.03.055 NEUCOM16962

To appear in: Neurocomputing Received date: 25 September 2015 Revised date: 9 March 2016 Accepted date: 28 March 2016 Cite this article as: Xiaoming Tao, Linhao Dong, Yang Li and Jianhua Lu, The THU Multi-View Face Database for Videoconferences and Baseline Evaluations, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.03.055 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

The THU Multi-View Face Database for Videoconferences and Baseline Evaluations Xiaoming Taoa , Linhao Dongb,∗, Yang Lia , Jianhua Lua,b a Department b School

of Electronic Engineering, Tsinghua University, Beijing, 100084 China of Aerospace Engineering, Tsinghua University, Beijing, 100084 China

Abstract In this paper, we present a face video database and its acquisition. This database contains 31,500 video clips of 100 individual figures from 20 countries. The primary purpose of building this database is to serve as a standardized test video sequences for any research related to videoconferences, such as gaze-correction, model based face reconstruction, etc.. To be specific, each of the subjects was filmed by 9 groups of synchronized webcams under 7 illumination conditions, and was requested to complete a series of designated actions. Thus, face variations including lip shape, occlusion, illumination, pose, and expression are well presented in each video clip. Compared to the existing databases, the proposed THU face database provides multi-view video sequences with strict temporal synchronization, which enables evaluations on current and future possible gazecorrection methods for conversational video communications. Besides, we discuss the evaluation protocol based on our database for gaze-correction, where three well-known methods were tested. Experiment results show that, under this evaluation protocol, comparisons of performance can be obtained numerically in terms of peak signal-to-noise ratio (PSNR), demonstrating the strengths I This work was supported by the National Basic Research Project of China (973) (2013CB329006), National Natural Science Foundation of China (NSFC, 61101071, 61471220, 61021001), and Tsinghua University Initiative Scientific Research Program (20131089365). II Part of this work was published in proceedings of IEEE ICIP’15, Qu´ ebec City, Canada. ∗ Corresponding author Email addresses: [email protected] (Xiaoming Tao), [email protected] (Linhao Dong), [email protected] (Yang Li), [email protected] (Jianhua Lu)

Preprint submitted to Journal of LATEX Templates

May 7, 2016

and weaknesses of these methods under different circumstances. Keywords: Face video database, videoconferences, gaze-correction, multi-view, evaluation protocol.

1. Introduction 1.1. Background and Motivation Recently, video communication over Internet has attracted much attention from the public. Its unique features make it irreplaceable in many scenarios, 5

such as teleconference [1], telemedicine [2], video calling, etc.. Also, its related design of multicast protocol, rate allocation, and household application have been well investigated [3, 4, 5], providing accessible solutions to home-based video communications. However, when it comes to user experiences, the current video communication applications often fall short of satisfaction. For instance,

10

one of the most glaring incongruities often found in personal video calling is that natural eye contact cannot be maintained since it is impossible to stare at the camera and the conversation window simultaneously. To address this problem, several schemes [6, 7, 8, 9] were proposed to correct gaze direction for users at both ends. They are generally categorized into

15

image-based ([6, 7]) and model-based ([8, 9, 10]) schemes. For image-based schemes, calibrated stereo cameras are usually applied to capture images from two separate views. Utilizing binocular vision processing, a virtual central view is synthesized; therefore the gaze direction is corrected accordingly. Differing from image-based schemes, model-based ones typically require a depth map from

20

either a stereo camera set [8] or a depth camera [9] to build a 3D face model. In [10], authors applied a generic 3D head mesh model to fit the features extracted from individual face images. By using morphing techniques, the face model can be rotated in the correct direction, whereas texture-mapping is performed to transfer the texture patches in the original image onto the direction-corrected

25

face model. These schemes, however, were only able to be evaluated qualitatively and subjectively, each in isolation, by observing the output synthetic

2

sequences. There lacked a unified criteria with which the schemes can be compared quantitatively. Therefore, a specialized database is desirable in order to evaluate the performance of these gaze-correction methods. 30

1.2. Face Database and Its Design Table 1: List of some face databases Database

Number

M2VTSDB

(Release

1.00),

of

Total im-

subjects

ages

37

185

Universit´ e

Features

References

large pose changes, speaking

[11]

subjects, eye glasses, time

Catholique de Louvain,

change

Belgium Yale

Face

Database,

15

165

126

4,000

expressions,

Yale University, USA AR

Face

Database,

eye

glasses,

[12]

frontal pose, expression, il-

[13]

lighting

Ohio State University

lumination, occlusions, eye

(previous

glasses, scarves

at

Purdue

University), USA The

Sheffield

Face

Database ous

20

564

pose, gender, ethnic back-

(previ-

UMIST

Database),

[14]

grounds, gray scale

Face

University

of Sheffield, UK XM2VTSDB,

Univer-

295

sity of Surrey, UK

1,180

rotating head, speaking sub-

videos

jects, 3D models, high qual-

[15]

ity images Cohn-Kanade

AU-

Coded

Ex-

Facial

pression

Database,

University

of

100

500

se-

dynamic sequences of facial

quences

expressions

14,126

color images, changes in ap-

[16]

Pitts-

burgh, USA FERET

Database,

1,199

Color, National Insti-

pearance through time, con-

tute of Standards and

trolled pose variation, facial

Technology, USA

expression

3

[17]

University Physics

of

Oulu

Based

Database,

125

>2,000

Face

highly varied illumination

[18]

(various color), eye glasses

University

of Oulu, Finland Yale Face Database B,

10

4,050

pose, illumination

[19]

68

41,368

very large database, pose, il-

[20]

Yale University, USA PIE

Database,

Carnegie

Mellon

lumination, expression

University, USA MIT-CBCL

Face

10

>2,000

synthetic images from 3D

Recognition Database,

models, illumination, pose,

Massachusetts

background

tute

of

Insti-

[21]

Technology,

USA VALID Database, Uni-

106

530

versity College Dublin,

highly variable office condi-

[22]

tions

Ireland CAS-PEAL Database,

Face

1,040

99,594

very large, expressions, ac-

Chinese

cessories, lighting, simulta-

Academy of Sciences,

neous capture of multiple

China

poses, Chinese

COX Face Database, Chinese

Academy

1,000

of

Sciences, China

1,000

low

+3,000

variant

resolution

video

surveillance-like scenario

light

videos,

[23]

[24]

conditions,

clips Multi-PIE, Mellon

Carnegie

337

750,000

University,

very large database, pose, il-

[25]

lumination, expression, high

USA

resolution frontal images

MUCT

Landmark

Database,

University

624

3,755

expressions,

ethnic back-

[26]

grounds, manual landmarks

of Cape Town, South Africa NVIE Database, Uni-

>100

> 3,000

visible and infrared imaging,

versity of Science and

spontaneous and posed ex-

Technology of China

pressions, landmarks

4

[27]

In the field of automatic face analysis and perception, significant progress has been made in several sub-domains such as facial expression analysis, face tracking, 3D face modeling, etc. [28, 29, 30]. From a recognition viewpoint, these 35

schemes usually involve learning-based solutions that require training datasets; their performances are often directly related to the size and the quality of the available training sets. Therefore, diversity and comprehensiveness are important factors for a database. Since mid-1990s, several institutions have begun building face databases in various forms, including images [11]-[21],[23]-[27],

40

videos [11],[15],[16],[22],[24] and audio-visual sequences [11],[15],[22]. Among these databases, early works including M2VTSDB [11], Yale Face Database [12], and AR Face Database [13], pioneered diverse methodologies in design. In M2VTSDB (Release 1.00), synchronized video and speech data as well as image sequences were recorded for the research of identification strate-

45

gies. A variety of facial variations such as hairstyles, accessories, etc., were presented. Also, 3D information can be obtained from the head movements in video sequences. As an extended version of M2VTSDB, XM2VTSDB [15] collected data by using digital devices with a larger sample space for more reliable and robust training. However, the illumination conditions in both of these

50

databases are constant and ideal, therefore the impact of illumination on face identification cannot be effectively evaluated. In the Yale Face Database and the AR Face Database, various controlled illuminations were introduced, such that the impact of illumination conditions on the same face can be well studied. Both the Yale Face Database and the AR Face Database, nevertheless, present

55

only frontal face images, precluding the ability to test the success rate of face recognition algorithms on profile faces. It has been reported that the variations between the images of the same face caused by illumination and viewing direction are likely larger than that caused by change in face identity [31]. Hence, face images in databases like the Yale Face

60

Database B [19], the PIE Database [20], Multi-PIE [25], and the CAS-PEAL Face Database [23] were taken under several designated poses and controlled illuminations. In the Yale Face Database B, a geodesic dome consisting of 64 5

strobes was constructed to offer 45 illumination conditions for training purpose; meanwhile, images of the same face were taken in 9 viewing directions to provide 65

a possibility of building 3D models. However, the facial expression was not considered as one of the variations. In both PIE and Multi-PIE Databases, pose, illumination and expression were variables for each face; the geometric and color calibration information for each camera was recorded as well. Specifically, more deliberate facial expressions were contained in Multi-PIE. With similar

70

acquisition methods to PIE, the CAS-PEAL Face Database provides a large sample space of Chinese people as a supplementary for a specific ethnicity. Considering the challenges of face recognition in surveillance scenario, COX Face Database [24] offers a large video-based database of 1,000 subjects acquired by 3 camcorders under uncontrolled light conditions with low spatial resolution to

75

simulate the practical surveillance environment. It can also serve as a diverse training set with different views of face in front of complex backgrounds. Table 1 lists some popular face databases with their respective features. While the aforementioned databases have been successfully applied in several fields, they were not constructed specifically for videoconference or conver-

80

sational video communication scenarios; specifically, none offers synchronized videos with multiple views. Aiming at providing a video database for videoconferences to supplement the currently available corpus, we have collected 31,500 raw videos from 100 volunteers, the total size of which is around 5TB. Our database is specially designed for conversational videos; thereby, the scenes in

85

the video clips simulate real-life conversation scenarios. To be specific, each subject was requested to complete a series of designated actions that may appear during one video communication, including speaking, head movements, drinking with a cup, reading, and performing one facial expression; through these actions variations of lip shape, occlusion, pose, and expression were exhibited.

90

Moreover, each subject was filmed under 7 different controlled illumination conditions. Multiple cameras were arranged in groups for video acquisition. For each clip, 5 out of 13 available cameras were used for simultaneous recording. Such simultaneous multi-view recording enables us to propose an evaluation 6

protocol for gaze-correction methods, with which the three typical methods in95

cluding two image-based methods [6, 7], and one model-based method [8] were evaluated as examples to give baseline performances. The evaluation results show quantitative comparisons among these methods different illuminations and action scenarios. The remainder of this paper is organized as follows. The hardware setup

100

is described in Section 2, where the arrayed camera system and illumination arrangements are explained. Post-processing, including geometric and color calibration for the cameras are described in Section 3. In Section 4, the design and structure of this database is introduced with examples. The proposed evaluation protocol and the baseline experiment results are presented in Section 5.

105

Finally, we draw the conclusion in Section 7.

2. Hardware Setup This section describes the hardware setup for video acquisition in our studio. In Subsection 2.1, the layout and function of the webcam array is introduced. In Subsection 2.2, the detailed arrangements of various illumination conditions 110

are described. 2.1. Multi-View Camera Array R To record multi-view videos for one face, 13 Logitech c310 HD webcams

are used, each of which is able to capture 1280x720 HD image sequences. A rectangular camera rack carrying all the cameras was custom designed to mimic 115

the dimension of a regular 16:9 19-inch monitor. As shown in Fig. 1, the 4 cameras labeled as U, R, D, and L are called acquisition cameras, since one or more of them can be physically presented in a conversational video communication system, and actually used by gaze-correction methods; the cameras labeled as C1 to C9 (sequenced in the zig-zag order) are called reference cameras, since

120

they do not actually appear on screen in a real system, and are used to capture the target view which the gaze-correction methods must attempt to synthesize.

7

They are placed across the interior of the rack to signify the fact that the actual messaging window may not necessarily appear at the center of the monitor. During each round of the video collection, only 1 out of 9 reference cameras 125

is active, along with cameras U, R, D, and L always recording. In other words, 5 cameras are required to record videos simultaneously. Since one computer can support at most 3 webcams to record videos of 1280x720 @25 fps at the same time due to bandwidth limit of the USB port, these 5 cameras are controlled by two inter-connected computers (3+2, separately) in strict temporal synchroniza-

130

tion. In case of unexpected difference of frame rate happening among cameras, two time stamps are made in the form of .yml file after each recording, where the absolute moment of each frame is written. By using such stamps, frames from 5 cameras can be trimmed accordingly. Time stamps are stored along with the video clips. While recording, the subject is asked to perform the same sequence

135

of actions under each illumination condition and for each reference camera, for a total of 63 repetitions. An automatic control program with a simple graphic user interface was developed to systematically cycle through all illumination conditions and reference cameras.

Figure 1: The multi-view camera array for the video acquisition; C1-C9 label the reference cameras, while U, R, L, and D represent the actual working cameras in gaze-correction methods.

8

Figure 2: The layout of the studio, The cameras are placed as in Fig. 1, where W (west), S (south) and E (east) represent the relative positions of the three LED lights.

Figure 3: The GUI to control camera groups and illumination conditions

2.2. Illumination Combinations 140

In Yale Database B [19], significant variations on one face caused by different illumination conditions can be observed. Likewise, To better mimic real R videoconference scenarios, we use 3 Pixel Sonnon DL-913 LED flood lights to

provide variable illumination. The LED lights are located in the studio as shown in Fig. 2, providing 145

illumination varying in both direction and intensity, controlled by an automatic 9

system. During recording, 7 working combinations of these lights are offered. If the on-off statuses of the 3 lights (W, S, and E as shown in Fig. 2) are coded with three bits (“WSE”), the 7 illuminations can be represented as binary numbers ranging from 001 to 111, where digit “0” and “1” represent “off” and “on”, 150

respectively. Fig. 3 gives a brief view of the GUI of our automatic control program. The “West”, “South”, and “East” options in the middle of the dialog box are switches for the LED lights, and the buttons labeled as “C1” to “C9” are switches for the reference cameras.

155

3. Post-Processing 3.1. Color Calibration for Multiple Cameras

y = 5.5e−05*x2 + 0.077*x + 42

Mean pixel value of each column

Mean pixel value of each column

140 120 100 80

Sample data Fitting curve

60 40 0

200

400 600 Pixel index

(a) R fitting.

800

160

160

2 140 y = 5.5e−05*x + 0.078*x + 47

140

Mean pixel value of each column

160

120 100 80 Sample data Fitting curve

60 40 0

200

400 600 Pixel index

(b) G fitting.

800

y = 6e−05*x2 + 0.073*x + 45

120 100 80 Sample data Fitting curve

60 40 0

200

400 600 Pixel index

800

(c) B fitting.

Figure 4: Quadratic curve (the light blue one) fitting and the equation fitted, corresponding to the R, G, and B channels of the gray bar.

Since the CMOS sensors embedded in cameras are not identical, the colors of the captured images exhibit inconsistent characteristics across different cameras, requiring calibration. We apply the method proposed in [32] by us160

ing all 13 cameras to capture the same printed gradient bar (shown in Fig. 6) simultaneously under the same illumination condition for color reference. An arbitrary camera is selected as the reference towards which the color characteristics of the other cameras were adjusted. Here we illustrate the calibration procedure of two cameras as an example. First, after the captured images from both cameras were rectified, the intensity values of each column were averaged 10

12

10

8 6 4 2 0 0

200

400 Pixel index

600

(a) R error.

800

8

12

After correction Before correction

Relative error ε (%)

After correction Before correction

Relative error ε (%)

Relative error (%)

10

6 4 2 0 0

200

400 Pixel index

600

800

10

After correction Before correction

8 6 4 2 0 0

(b) G error.

200

400 Pixel index

600

800

(c) B error.

Figure 5: The relative error of color calibration between two images at R, G, and B channels. The red and green lines present error before and after color correction, respectively. The relative error after correction is much smaller compared to the origin one.

Figure 6: We use the gray bar as the calibration reference under the same illumination. The color values of gray bar linearly increase from 0 to 255 along the horizontal direction.

to produce two row vectors of decreasing intensities. Then, polynomial fitting is used to obtain six curves, one for each color channel in each image. A highquality fit can usually be obtained using the least-squares algorithm. The fitting results for an example camera are shown in Fig. 4. Finally, a function mapping the intensity values within a given channel captured by the camera under calibration onto those in the corresponding channel of the reference camera can be found by first solving for the root x of the quadratic equation: y1 = b1 x2 + b2 x + b3 , where y1 is the pixel value on x; and x can be solved by p −b2 + b22 − 4b1 (b3 − y1 ) x= , 2b1

(1)

(2)

accordingly, where b1 , b2 , and b3 are obtained from fitting. x can be substituted back into the equation of the reference image color curve to get the corresponding

11

intensity value for the pixel, as shown in (3): y2 = a1 x2 + a2 x + a3 ,

(3)

where y2 is the pixel value corresponding to x. By this method, the image color of all cameras can be calibrated accordingly. To verify the accuracy of the color correction, we calculate the relative intensity difference of each pixel as (4), before and after color calibration, ε=

|pa − pr | × 100%, pr

(4)

where pa denotes the pixel intensity value captured by the camera under cal165

ibration within a given color channel either before or after correction, and pr denotes the corresponding pixel intensity value as captured by the reference camera. Some example results are shown in Fig. 5. 3.2. Geometric Calibration Extracted corners L 1800

O dY dX

100 150

1600

U C3 C2 C6 C1 C5 C9 C4 DC8 C7 R

1400 1200

200 Zworld

Yc (in camera frame)

50

250

1000 800 600

300

400

350

200 400

0

450

0 100

200

300

400

500

0 500

500

600

Xc (in camera frame)

1000

1000

(a)

X

(b)

Figure 7: (a) shows the corner extraction of the chessboard. We have chosen 10 different angles to capture 13 images from 13 cameras; (b) shows the calibrated cameras located in the world coordinates.

In order to accurately identify the spatial relationships among all the record170

ing cameras, which is a pre-requisite for many stereo vision algorithms, camera geometric calibration is required, which involves the estimation of the camera’s 12

perspective projection matrix and distortion parameters as signified by its extrinsic and intrinsic parameters. The method that we use is from [33], where all cameras are used to capture pictures of the same chessboard, as shown in 175

Fig. 7, with different angles and locations; from these measurements, the intrinsic parameters of the cameras can be computed. To enable the application of different camera calibration methods, all the original chessboard images are stored along with the corresponding video sequences.

4. Database Structure 180

On the first level, the video sequences in the database were divided according to gender. Videos of male and female figures are named with letters “M” and “F” at the beginning of the name string, respectively. Each figure has an identifier number that goes from “001” to “0XX”. As mentioned in previous section, the reference webcam used for the specific sequence is named from “C1” to “C9”.

185

Likewise, the combinations of LED lights are named in binary numbers from “001” to “111”. These identifiers are concatenated to form the name string of a video. For instance, one video clip named as “M001C1001 U” means: “this video is taken by the camera U for the male volunteer 001 along with the first camera C1; the west and south lights are off and the east light is on”. The

190

folder structure is illustrated in Fig. 8 as a tree map. When taking videos, each volunteer is requested to complete a set of designated actions with a random facial expression at the end, under each combination of illumination and camera group. During each recording, the actions were taken by 1 of the 9 reference cameras along with the 4 acquisition cameras

195

simultaneously. Note that videos taken from the reference cameras are functioning as the benchmark for gaze-correction algorithms. Each video clip lasts around 40 seconds long; all videos are stored in the raw YUV format (5TB in total), with around 40 GB for each subject. Some features of the database is summarized in Table 2.

200

We explain the contents of videos by showing some thumbnails in different

13

Figure 8: Folder structure of THU multi-view face database.

Table 2: A short summary of THU multi-view face database Number of subjects

100 (65 males, 35 females)

Origin of subjects

China (70), Pakistan (3), Azerbaijan (2), Germany (2), Kazakhstan (2), Myanmar (2), Russia (2), Singapore (2), South Africa (2), UK (2), USA (2), Albania (1), Columbia (1), Costa Rica (1), Finland (1), Kyrgyzstan (1), The Netherlands (1), Panama (1), Turkey (1), Ukraine (1)

Ethnic backgrounds

Chinese (74), Central Asian (3), Caucasian (3), South Asian (4), South East Asian (2), White European (9), Black African (2), White American (1), Latin American (2)

Variable

Actions, illuminations, facial expressions, multiple views

Total video clips

31,500 (6,300 groups)

cases. In Fig. 9, the thumbnails were extracted at the same instant in each of five videos (C5 + the four acquisition cameras) under the illumination 100. The views of the subject from five different viewpoints are shown. With the same subject in Fig. 9, the thumbnails in Fig. 10 demonstrate 205

the videos from camera C5 with the same action under 7 different illuminations. It can be observed that the shadows cast under the different illumination

14

(a) C5

(b) D

(c) L

(d) R

(e) U

Figure 9: Thumbnails of multiple views from one camera group under the same illumi-

nation (001), where the reference camera is C5

conditions have visible impact on the appearance of the face as well as the background.

(a) 001

(d) 100

(b) 010

(e) 101

(c) 011

(f) 110

(g) 111

Figure 10: Thumbnails from C5 with the same action under 7 illumination conditions.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 11: Thumbnails of different actions from one video under the same illumination

taken from the same camera.

In Fig. 11, the thumbnails are extracted from several different instants 210

within one video clip in order to exhibit the various actions taken, where the illumination is 100 and the camera is C5. These actions are arranged in six parts consisting of head-turning, speaking, holding a mug, reading a book, taking off

15

glasses (where applicable), and performing a facial expression. Head-turning follows the sequence of left, right, up, and down, during which all views of a 215

face can be seen. The volunteers were requested to select one of the sentences from the TIMIT Acoustic-Phonetic Database [34] to achieve a good coverage of different lip shapes; these sentences were stored with the corresponding videos as .txt files in the database. Holding-a-mug and reading-a-book provide some occlusion on the face, as such actions may appear during a conversational video.

220

At the end of the actions, each volunteer was requested to improvise a facial expression randomly, which may include happiness, anger, surprise, and so on.

5. Evaluation Protocol and Baseline Results In order to facilitate effective comparisons among different gaze correction algorithms, we provide an evaluation protocol as a guideline for conducting 225

evaluations on the proposed database; results for three baseline gaze correction algorithms were given following the protocol’s specifications, showing distinctive performance characteristics from which interesting conclusions can be drawn. 5.1. Evaluation Protocol

Figure 12: Workflow of image-based gaze-correction algorithms BM and DP.

Broadly speaking, the evaluation of a gaze-correction algorithm on the pro230

posed database includes the following steps: 1. Running the algorithm to be evaluated on the appropriate preprocessed input video sequences to synthesize a virtual view according to the camera

16

Figure 13: Workflow of the model-based gaze-correction algorithm MBGC.

calibration data, such that the virtual view conforms to that of the reference camera used. Two distinct sets of experiments can be performed, 235

focusing on assessing the performance of the gaze correction algorithm for the following two cases: • varying illumination conditions, where for a fixed action sequence, any combination of illumination conditions (e.g. 001, 101, 111, etc.) can be used; and

240

• varying occlusion, where for a fixed illumination condition, different action sequences can be selected featuring varying degrees of occlusion. 2. Remove lens distortion from the sequence captured by the reference camera, as performed in [35].

245

3. Extract the regions-of-interest (ROIs) according to the capabilities of the algorithm (e.g. face region, head-shoulder region, whole image, etc.), followed by the computation of an image similarity metric. 4. Characterize the performance of the algorithm by examining its cumulative quality curve obtained from performing the above steps on a set of

250

sequences. It is worth noting that not all gaze-correction algorithms have the flexibility to generate arbitrary virtual camera views according to the actual reference camera placements. Many assume the synthesis of a “center view” from two “side views”. In this case, off-center reference cameras cannot be used (effec-

255

tively, the algorithm cannot be applied to the case where the conversational 17

video is windowed and placed away from the center of the screen); furthermore, due to inevitable manufacture inaccuracies and disturbances, we cannot guarantee that the center reference camera is at the exact center of the up/down or left/right camera pairs. This mis-alignment will add noise to the similarity 260

metric and pollute the evaluation results. In order to reduce such influences and to ensure fair comparisons, for such algorithms, the homography between the input images and the reference image can be pre-computed by imaging a reference grid; according to the obtained homography, each input image can be warped into the same perspective as the reference image, such that the resultant

265

synthetic image and the reference image are in the same position (see Fig. 14).

(a)

(b)

(c)

(d)

(e)

Figure 14: The rectified images from one camera group. (a) View from the camera U; (b) View from the camera D; (c) View from the camera L; (d) View from the camera C5; (e) View from the camera R

In the interest of clear, well-rounded characterization and easy comparison, we propose the use of the cumulative quality curve (CQC) to study the performance of algorithms on our database. If the quality of gaze-correction, as measured by a similarity metric averaged over a given test video sequence, is 270

taken as a random variable q, then CQC is simply 1 − CDF(q), where CDF(·) stands for the cumulative distribution function of a random variable, plotted on a reversed x-axis. Intuitively, a point (x, y) on this curve signifies that (100×y)percent of the synthetic images given by the algorithm under evaluation achieved a quality greater or equal to x.

275

5.2. Baseline Experimental Results Three gaze-correction algorithms were chosen for baseline evaluations and comparison, covering both image-based and model-based methods. Note that

18

videos with reference camera C5 were chosen in the experiments as examples. Image-based algorithms. Most image-based algorithms share a common frame280

work consisting of image rectification, disparity computation, occluding decision, and view synthesis (as shown in Fig. 12) [7]. Following the conventions of most existing works, left and right cameras were used to compute a horizontal disparity; therefore, horizontal rectification of the left and right images is necessary to ensure epipolar alignment. We

285

evaluate two representative methods of disparity computation: a) the modified H. Hirschmuller algorithm (semi-global block matching) [36], which we refer to as BM; b) the improved Dynamic Programming algorithm proposed by A. Criminisi, et al. [6], which we refer to as DP. A model-based algorithm. An additional model-based gaze-correction algorithm

290

[8] (as shown in Fig. 13), which we refer to as MBGC, was evaluated. It utilizes face tracking and image segmentation techniques to extract the headand-shoulder region from the input images, and performs landmark-based correspondence between the up and down cameras. While this results in improved face-region synthesis quality due to the increased correspondence accuracy of

295

the tracked features, the quality of the synthesized background often suffers. Single webcam-based algorithm in [10] is not evaluated here as it is similar to one in [8] except for the requirement of the pre-captured frontal face image for each individual. In our experiments, for simplicity, we chose to use the peak signal-to-noise

300

ratio (PSNR) as our similarity metric. We perform the evaluations with respect to two independently varying factors: illumination and occlusion. Illumination. We tested two ROIs: a) the face region, which is the convex hull of all facial landmarks (see Fig. 15), and b) the segmented head-and-shoulder region (see Fig. 16). Since the MBGC algorithm only applies to the situation

305

where frontal face is visible, we limited our evaluation set to samples that meet this requirement so that all three algorithms can be compared. For compari-

19

(a) MBGC

(b) DP

(c) BM

(d) Reference

(e) MBGC

(f) DP

(g) BM

(h) Reference

Figure 15: Comparison of synthetic views among three methods. (a) - (d) show the original synthetic images; (e) - (h) show the segmented face region.

(a) MBGC

(b) DP

(c) BM

(d) Reference

(e) MBGC

(f) DP

(g) BM

(h) Reference

Figure 16: Comparison of synthetic views among three methods. (a) - (d) show the original synthetic images; (e) - (h) show the head-and-shoulder segmentation.

son, we chose four illumination conditions consisting of 001, 011, 101, and 111. For each face, we calculated the PSNR between the synthetic and the reference images under each illumination. The face-only CQC resulting from our exper-

20

1

0.8

0.8

0.8

0.6 0.4 MBGC 111 MBGC 101 MBGC 011 MBGC 001

0 55

50

45 40 PSNR (dB)

35

0.4 DP 111 DP 101 DP 011 DP 001

0.2 0 55

30

50

(a) MBGC. 1

percentage

percentage

0.6 0.4

0 55

BM 001 MBGC 001 DP 001 50

45 40 PSNR (dB)

35

30

0.4 BM 111 BM 101 BM 011 BM 001

0.2 0 55

(b) DP.

0.8

0.2

45 40 PSNR (dB)

0.6

35

30

(d) Lighting 001.

1

1

0.8

0.8

0.6 0.4 BM 011 MBGC 011 DP 011

0.2 0 55

50

45 40 PSNR (dB)

50

45 40 PSNR (dB)

35

30

(c) BM.

percentage

0.2

0.6

percentage

1

percentage

percentage

1

35

(e) Lighting 011.

0.6 0.4 BM 111 MBGC 111 DP 111

0.2

30

0 55

50

45 40 PSNR (dB)

35

30

(f) Lighting 111.

Figure 17: (a) - (c): CQCs in terms of PSNR for BM, DP, and MBGC under 4 illumination conditions. (d) - (f): Comparisons of CQCs of BM, DP, and MBGC under illumination 001, 011, and 111. ROI is face-only.

310

iments is show in Fig. 17. We can see that for the face-only ROI, the PSNRs from 111 lighting are always better than those from 001, especially for the DP algorithm where the difference is most pronounced. This nicely illustrates that all three algorithms are negatively affected by uneven illumination. Among the three algorithms, MBGC is least affected by illumination changes due to its

315

model-based nature: the model-fitting step uses illumination-invariant features so the correspondences thus extracted are more robust to lighting variations. The image-based algorithms are more sensitive in this regard. Also, the MBGC algorithm is advantaged in the face-only comparison since it uses an explicit face model to compute the set of point and line matches for view synthesis.

320

Fig. 18 shows the head-and-shoulder region CQC obtained from our experiments. We can see that the influences of illumination is the same as the face-only experiment, and the MBGC algorithm does not strictly outperform the other algorithms; we can see from its CQC that, compared to DP, it produced fewer samples with very high quality and much fewer samples with low 21

1

0.8

0.8

0.8

0.6 0.4 MBGC 111 MBGC 101 MBGC 011 MBGC 001

0.2 0 55

50

45 40 PSNR (dB)

35

0.6 0.4 DP 111 DP 101 DP 011 DP 001

0.2 0 55

30

50

35

30

0.6 0.4 BM 111 BM 101 BM 011 BM 001

0.2 0 55

(b) DP. 1

0.8

0.8

0.8

0.4 BM 001 MBGC 001 DP 001

0.2 0 55

50

45 40 PSNR (dB)

35

30

percentage

1

0.6

0.6 0.4 BM 011 MBGC 011 DP 011

0.2 0 48

(d) Lighting 001.

46

44 PSNR (dB)

50

45 40 PSNR (dB)

35

30

(c) BM.

1

percentage

percentage

(a) MBGC.

45 40 PSNR (dB)

percentage

1

percentage

percentage

1

42

(e) Lighting 011.

0.6 0.4 BM 111 MBGC 111 DP 111

0.2

40

0 55

50

45 40 PSNR (dB)

35

30

(f) Lighting 111.

Figure 18: (a) - (c): CQCs in terms of PSNR for BM, DP, and MBGC under 4 illumination conditions. (d),(e): Comparisons of CQCs of BM, DP, and MBGC under illumination 001 and 111. ROI is head-and-shoulder.

(a) BM

(b) DP

(c) Reference

Figure 19: Comparison of synthesized view between BM and DP, where the action “drinking” is under test.

325

quality. Therefore, we can conclude that while MBGC may not be as accurate as DP in ideal conditions, it is more robust. Occlusion. In order to examine the influences of occluding motions on synthesis, we compared the performance of BM and DP using the frontal sequences from the previous experiment, as well as sequences with the motion of “figures holding

330

a cup as if they are drinking”. Since this motion will cause the cup to occlude

22

the face, the MBGC algorithm is no longer applicable (see Fig. 19). Since we cannot locate the face region, automatic head-shoulder segmentation also becomes difficult; therefore we used the whole image as ROI when comparing the algorithms. 335

From the CQCs (see Fig. 20) we can see that the influences of illumination on results is similar with the first experiment, i.e. uniform illumination gives the best results. The effect of occlusion is better examined by fixing the illumination at 111 and compare the performance of the algorithms between the frontal sequences and the “drinking” sequences (see Fig. 21). We see that while DP

340

always gives better performance than BM, it is not particularly more robust against occlusions, especially in this case where the cup is at a significantly higher parallax than the face. 1 0.9 0.8

percentage

0.7 0.6 0.5 0.4 0.3 BM+001 BM+111 DP+001 DP+111

0.2 0.1 0 40

38

36

34 PSNR (dB)

32

30

28

Figure 20: CQCs in terms of PSNR for BM and DP, with 2 different illuminations. ROI is the whole image.

6. Availability of Database Free samples of 5 volunteers (F029, F033, F035, F038 and M052) can be 345

downloaded at www.facedbv.com; since the total size of the database is too large to be stored on the web host, please contact the corresponding author if you would like to order it to be delivered on a physical medium.

23

1 0.9 0.8

percentage

0.7 0.6 0.5 0.4 0.3 DP Drink DP Frontal BM Drink BM Frontal

0.2 0.1 0 40

38

36 34 PSNR (dB)

32

30

Figure 21: CQCs in terms of PSNR for BM and DP, for “Frontal” and “Drink” actions, under 111 illumination. ROI is the whole image.

7. Conclusion In this paper we have presented a multi-view face video database and an 350

evaluation protocol for gaze-correction methods including image-based [6][7] and model-based [8] ones. These methods have been evaluated in terms of PSNR compared with the reference sequences acquired from the cameras. Our evaluation is based on a large collection of calibrated image sequences in terms of position and color from videos together with an evaluation procedure. Different

355

performances can be observed among these methods; and the impact of illumination and occlusion on the synthesized results can be found as well. It can be seen that MBGC is more robust against the variation of illumination; while DP outperforms other two with noticeable occlusion in front of face. In addition to gaze-correction algorithms, other face analysis tasks such as gaze-estimation,

360

face-tracking, cross-race effect, stereo matching, structural similarity, etc., may also be evaluated with our proposed database.

Acknowledgment During the preparation of this database, several members in our group provided valuable assistance. Specifically, Tianyou Zhou and Weihua Bao helped 365

design the GUI and the camera control system; Jichuan Lu helped implement

24

camera color calibration; undergraduate students Zizhuo Zhang and Kai Gu helped design the flood light control program. We highly appreciate their excellent work. We also would like to thank Wenjie Ye, Shengming Yu and Jingwen Cheng 370

(Xidian University, China) for their help in conducting the experiments. Last but not least, we express our deepest gratitude to all volunteers participating in the video collection.

References [1] N. Atzpadin, P. Kauff, O. Schreer, Stereo Analysis by Hybrid Recursive 375

Matching for Real-Time Immersive Video Conferencing, IEEE Trans. Circuits Syst. Video Technol. 14 (2004) 321–334. doi:10.1109/TCSVT.2004. 823391. [2] K. Hung, Y.-T. Zhang, Implementation of a WAP-Based Telemedicine System for Patient Monitoring, IEEE Trans. Inf. Technol. Biomed. 7 (2003)

380

101–107. doi:10.1109/TITB.2003.811870. [3] C. Luo, W. Wang, J. Tang, J. Sun, J. Li, A Multiparty Videoconferencing System Over an Application-Level Multicast Protocol, IEEE Trans. Multimedia 9 (2007) 1621–1632. doi:10.1109/TMM.2007.907467. [4] M. Ponec, S. Sengupta, M. Chen, J. Li, P. A. Chou, Optimizing Multi-Rate

385

Peer-to-Peer Video Conferencing Applications, IEEE Trans. Multimedia 13 (2011) 856–868. doi:10.1109/TMM.2011.2161759. [5] J. Jansen, P. Cesar, D. C. A. Bulterman, T. Stevens, I. Kegel, J. Issing, Enabling Composition-Based Video-Conferencing for the Home, IEEE Trans. Multimedia 13 (2011) 869–881. doi:10.1109/TMM.2011.2159369.

390

[6] A. Criminisi, J. Shotton, A. Blake, P. H. S. Torr, Gaze Manipulation for One-to-one Teleconferencing, Proc. IEEE ICCV’03 1 (2003) 191–198. doi: 10.1109/ICCV.2003.1238340.

25

[7] S.-B. Lee, I.-Y. Shin, Y.-S. Ho, Gaze-corrected View Generation Using Stereo Camera System for Immersive Videoconferencing, IEEE Trans. Con395

sum. Electron. 57 (2011) 1033–1040. doi:10.1109/TCE.2011.6018852. [8] R. Yang, Z. Zhang, Eye Gaze Correction with Stereovision for VideoTeleconferencing, IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 956– 960. doi:10.1109/TPAMI.2004.27. [9] C. Kuster, T. Popa, J. C. Bazin, C. Gotsman, M. Gross, Gaze Correction

400

for Home Video Conferencing, ACM Trans. Graph. (Proc. of ACM SIGGRAPH ASIA) 31 (2012) 174:1–174:6. doi:10.1145/2366145.2366193. [10] D. Giger, J.-C. Bazin, C. Kuster, T. Popa, M. Gross, Gaze Correction with a Single Webcam, Proc. IEEE ICME’14 (2014) 1–6doi:10.1109/ICME. 2014.6890306.

405

[11] S. Pigeon, L. Vandendorpe, The M2VTS Multimodal Face Database (Release 1.00), Proc. AVBPA’97 1206 (1997) 403–409.

doi:10.1007/

BFb0016021. [12] P. N. Belhumeur, J. P. Hespanha, D. J. Kriegman, Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projections, IEEE Trans. 410

Pattern Anal. Mach. Intell. 19 (1997) 711–720. doi:10.1109/34.598228. [13] A. M. Mart´ınez, R. Benavente, The AR Face Database, Technical Report #24, Computer Vision Center (CVC). [14] D. B. Graham, N. M. Allinson, Characterizing Virtual Eigensignatures for General Purpose Face Recognition, Face Recognition: From Theory

415

to Applications, NATO ASI Series F, Computer and Systems Sciences 163 (1998) 446–456. [15] K. Messer, J. Matas, J. Kittler, J. L XM2VTSDB: the Extended M2VTS Database, Proc. AVBPA’99 (1999) 72–77.

26

[16] T. Kanade, J. F. Cohn, Y. Tian, Comprehensive Database for Facial Ex420

pression Analysis, Proc. IEEE FG’00 (2000) 484–490doi:10.1109/AFGR. 2000.840611. [17] P. J. Phillips, H. Moon, S. A. Rizvi, P. J. Rauss, The FERET Evaluation Methodology for Face-Recognition Algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 1090–1104. doi:10.1109/34.879790.

425

[18] E. Marszalec, B. Martinkauppi, M. Soriano, M. Pietik A Physics-Based Face Database for Color Research, J. Electron. Imaging. 9 (2000) 32–38. doi:10.1117/1.482722. [19] A. S. Georghiades, P. N. Belhumeur, D. Kriegman, From Few to Many: Generative Models for Recognition under Variable Pose and Illumination,

430

IEEE Trans. Pattern Anal. Mach. Intell. 23 (2001) 643–660. doi:10.1109/ 34.927464. [20] T. Sim, S. Baker, M. Bsat, The CMU Pose, Illumination, and Expression Database, IEEE Trans. Pattern Anal. Mach. Intell. 25 (2003) 1615–1618. doi:10.1109/TPAMI.2003.1251154.

435

[21] B. Weyrauch, J. Huang, B. Heisele, V. Blanz, Component-based Face Recognition with 3D Morphable Models, Proc. IEEE CVPRW’04 (2004) 85–85doi:10.1109/CVPR.2004.41. [22] N. A. Fox, B. A. O’Mullane, R. B. Reilly, VALID: A New Practical AudioVisual Database, and Comparative Results, Proc. AVBPA’05 3546 (2005)

440

777–786. doi:10.1007/11527923_81. [23] W. Gao, B. Cao, S. Shan, D. Zhou, X. Zhang, D. Zhao, The CAS-PEAL Large-Scale Chinese Face Database and Baseline Evaluations, IEEE Trans. Syst., Man, Cybern. A 38 (2008) 149–161. doi:10.1109/TSMCA.2007. 909557.

445

[24] Z. Huang, S. Shan, R. Wang, H. Zhang, S. Lao, A. Kuerban, X. Chen, A Benchmark and Comparative Study of Video-Based Face Recognition on 27

COX Face Database, IEEE Trans. Signal Process. 24 (2015) 5967–5981. doi:10.1109/TIP.2015.2493448. [25] R. Gross, I. Matthews, J. Cohn, T. Kanade, S. Baker, Multi-PIE, Elsevier 450

Image and Vision Computing 28 (2010) 807–813. doi:10.1016/j.imavis. 2009.08.002. [26] S. Milborrow, J. Morkel, F. Nicolls, The MUCT Landmarked Face Database, Proc. Twenty-First Annual Symposium of the Pattern Recognition Association of South Africa (PRASA’10).

455

[27] S. Wang, Z. Liu, S. Lv, Y. Lv, G. Wu, P. Peng, F. Chen, X. Wang, A Natural Visible and Infrared Facial Expression Database for Expression Recognition and Emotion Inference, IEEE Trans. Multimedia 12 (2010) 682–691. doi:10.1109/TMM.2010.2060716. [28] M. Yeasin, B. Bullot, R. Sharma, Recognition of Facial Expressions and

460

Measurement of Levels of Interest from Video, IEEE Trans. Multimedia 8 (2006) 500–508. doi:10.1109/TMM.2006.870737. [29] P. Nair, A. Cavallaro, 3-D Face Detection, Landmark Localization, and Registration Using a Point Distribution Model, IEEE Trans. Multimedia 11 (2009) 611–623. doi:10.1109/TMM.2009.2017629.

465

[30] M. Barnard, P. Koniusz, W. Wang, J. Kittler, S. M. Naqvi, J. Chambers, Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling, IEEE Trans. Multimedia 16 (2014) 864–880. doi:10.1109/TMM. 2014.2301977. [31] Y. Moses, Y. Adini, S. Ullman, Face Recognition: The Problem of Compen-

470

sating for Changes in Illumination Direction, Proc. ECCV’94 800 (1994) 286–296. doi:10.1007/3-540-57956-7_33. [32] J. Jung, Y. Ho, Color Correction Method Using Gray Gradient Bar for Multi-View Camera System, Proc. IWAIT’09 (2009) MP.C4(1–6).

28

[33] J.-Y. Bouguet, Camera Calibration Toolbox for Matlab. 475

URL http://www.vision.caltech.edu/bouguetj/calib_doc/ [34] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V. Zue, TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1, Philadelphia: Linguistic Data Consortium. [35] Y.-S. Kang, Y.-S. Ho, Geometrical Compensation for Multi-view Video in

480

Multiple Camera Array, Proc. IEEE ELMAR’08 (2008) 83–86. [36] H. Hirschmuller, Stereo Processing by Semiglobal Matching and Mutual Information, IEEE Trans. Pattern Anal. Mach. Intell. 30 (2008) 328–341. doi:10.1109/TPAMI.2007.1166.

29