Author’s Accepted Manuscript The THU Multi-View Face Database for Videoconferences and Baseline Evaluations Xiaoming Tao, Linhao Dong, Yang Li, Jianhua Lu
www.elsevier.com/locate/neucom
PII: DOI: Reference:
S0925-2312(16)30271-5 http://dx.doi.org/10.1016/j.neucom.2016.03.055 NEUCOM16962
To appear in: Neurocomputing Received date: 25 September 2015 Revised date: 9 March 2016 Accepted date: 28 March 2016 Cite this article as: Xiaoming Tao, Linhao Dong, Yang Li and Jianhua Lu, The THU Multi-View Face Database for Videoconferences and Baseline Evaluations, Neurocomputing, http://dx.doi.org/10.1016/j.neucom.2016.03.055 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
The THU Multi-View Face Database for Videoconferences and Baseline Evaluations Xiaoming Taoa , Linhao Dongb,∗, Yang Lia , Jianhua Lua,b a Department b School
of Electronic Engineering, Tsinghua University, Beijing, 100084 China of Aerospace Engineering, Tsinghua University, Beijing, 100084 China
Abstract In this paper, we present a face video database and its acquisition. This database contains 31,500 video clips of 100 individual figures from 20 countries. The primary purpose of building this database is to serve as a standardized test video sequences for any research related to videoconferences, such as gaze-correction, model based face reconstruction, etc.. To be specific, each of the subjects was filmed by 9 groups of synchronized webcams under 7 illumination conditions, and was requested to complete a series of designated actions. Thus, face variations including lip shape, occlusion, illumination, pose, and expression are well presented in each video clip. Compared to the existing databases, the proposed THU face database provides multi-view video sequences with strict temporal synchronization, which enables evaluations on current and future possible gazecorrection methods for conversational video communications. Besides, we discuss the evaluation protocol based on our database for gaze-correction, where three well-known methods were tested. Experiment results show that, under this evaluation protocol, comparisons of performance can be obtained numerically in terms of peak signal-to-noise ratio (PSNR), demonstrating the strengths I This work was supported by the National Basic Research Project of China (973) (2013CB329006), National Natural Science Foundation of China (NSFC, 61101071, 61471220, 61021001), and Tsinghua University Initiative Scientific Research Program (20131089365). II Part of this work was published in proceedings of IEEE ICIP’15, Qu´ ebec City, Canada. ∗ Corresponding author Email addresses:
[email protected] (Xiaoming Tao),
[email protected] (Linhao Dong),
[email protected] (Yang Li),
[email protected] (Jianhua Lu)
Preprint submitted to Journal of LATEX Templates
May 7, 2016
and weaknesses of these methods under different circumstances. Keywords: Face video database, videoconferences, gaze-correction, multi-view, evaluation protocol.
1. Introduction 1.1. Background and Motivation Recently, video communication over Internet has attracted much attention from the public. Its unique features make it irreplaceable in many scenarios, 5
such as teleconference [1], telemedicine [2], video calling, etc.. Also, its related design of multicast protocol, rate allocation, and household application have been well investigated [3, 4, 5], providing accessible solutions to home-based video communications. However, when it comes to user experiences, the current video communication applications often fall short of satisfaction. For instance,
10
one of the most glaring incongruities often found in personal video calling is that natural eye contact cannot be maintained since it is impossible to stare at the camera and the conversation window simultaneously. To address this problem, several schemes [6, 7, 8, 9] were proposed to correct gaze direction for users at both ends. They are generally categorized into
15
image-based ([6, 7]) and model-based ([8, 9, 10]) schemes. For image-based schemes, calibrated stereo cameras are usually applied to capture images from two separate views. Utilizing binocular vision processing, a virtual central view is synthesized; therefore the gaze direction is corrected accordingly. Differing from image-based schemes, model-based ones typically require a depth map from
20
either a stereo camera set [8] or a depth camera [9] to build a 3D face model. In [10], authors applied a generic 3D head mesh model to fit the features extracted from individual face images. By using morphing techniques, the face model can be rotated in the correct direction, whereas texture-mapping is performed to transfer the texture patches in the original image onto the direction-corrected
25
face model. These schemes, however, were only able to be evaluated qualitatively and subjectively, each in isolation, by observing the output synthetic
2
sequences. There lacked a unified criteria with which the schemes can be compared quantitatively. Therefore, a specialized database is desirable in order to evaluate the performance of these gaze-correction methods. 30
1.2. Face Database and Its Design Table 1: List of some face databases Database
Number
M2VTSDB
(Release
1.00),
of
Total im-
subjects
ages
37
185
Universit´ e
Features
References
large pose changes, speaking
[11]
subjects, eye glasses, time
Catholique de Louvain,
change
Belgium Yale
Face
Database,
15
165
126
4,000
expressions,
Yale University, USA AR
Face
Database,
eye
glasses,
[12]
frontal pose, expression, il-
[13]
lighting
Ohio State University
lumination, occlusions, eye
(previous
glasses, scarves
at
Purdue
University), USA The
Sheffield
Face
Database ous
20
564
pose, gender, ethnic back-
(previ-
UMIST
Database),
[14]
grounds, gray scale
Face
University
of Sheffield, UK XM2VTSDB,
Univer-
295
sity of Surrey, UK
1,180
rotating head, speaking sub-
videos
jects, 3D models, high qual-
[15]
ity images Cohn-Kanade
AU-
Coded
Ex-
Facial
pression
Database,
University
of
100
500
se-
dynamic sequences of facial
quences
expressions
14,126
color images, changes in ap-
[16]
Pitts-
burgh, USA FERET
Database,
1,199
Color, National Insti-
pearance through time, con-
tute of Standards and
trolled pose variation, facial
Technology, USA
expression
3
[17]
University Physics
of
Oulu
Based
Database,
125
>2,000
Face
highly varied illumination
[18]
(various color), eye glasses
University
of Oulu, Finland Yale Face Database B,
10
4,050
pose, illumination
[19]
68
41,368
very large database, pose, il-
[20]
Yale University, USA PIE
Database,
Carnegie
Mellon
lumination, expression
University, USA MIT-CBCL
Face
10
>2,000
synthetic images from 3D
Recognition Database,
models, illumination, pose,
Massachusetts
background
tute
of
Insti-
[21]
Technology,
USA VALID Database, Uni-
106
530
versity College Dublin,
highly variable office condi-
[22]
tions
Ireland CAS-PEAL Database,
Face
1,040
99,594
very large, expressions, ac-
Chinese
cessories, lighting, simulta-
Academy of Sciences,
neous capture of multiple
China
poses, Chinese
COX Face Database, Chinese
Academy
1,000
of
Sciences, China
1,000
low
+3,000
variant
resolution
video
surveillance-like scenario
light
videos,
[23]
[24]
conditions,
clips Multi-PIE, Mellon
Carnegie
337
750,000
University,
very large database, pose, il-
[25]
lumination, expression, high
USA
resolution frontal images
MUCT
Landmark
Database,
University
624
3,755
expressions,
ethnic back-
[26]
grounds, manual landmarks
of Cape Town, South Africa NVIE Database, Uni-
>100
> 3,000
visible and infrared imaging,
versity of Science and
spontaneous and posed ex-
Technology of China
pressions, landmarks
4
[27]
In the field of automatic face analysis and perception, significant progress has been made in several sub-domains such as facial expression analysis, face tracking, 3D face modeling, etc. [28, 29, 30]. From a recognition viewpoint, these 35
schemes usually involve learning-based solutions that require training datasets; their performances are often directly related to the size and the quality of the available training sets. Therefore, diversity and comprehensiveness are important factors for a database. Since mid-1990s, several institutions have begun building face databases in various forms, including images [11]-[21],[23]-[27],
40
videos [11],[15],[16],[22],[24] and audio-visual sequences [11],[15],[22]. Among these databases, early works including M2VTSDB [11], Yale Face Database [12], and AR Face Database [13], pioneered diverse methodologies in design. In M2VTSDB (Release 1.00), synchronized video and speech data as well as image sequences were recorded for the research of identification strate-
45
gies. A variety of facial variations such as hairstyles, accessories, etc., were presented. Also, 3D information can be obtained from the head movements in video sequences. As an extended version of M2VTSDB, XM2VTSDB [15] collected data by using digital devices with a larger sample space for more reliable and robust training. However, the illumination conditions in both of these
50
databases are constant and ideal, therefore the impact of illumination on face identification cannot be effectively evaluated. In the Yale Face Database and the AR Face Database, various controlled illuminations were introduced, such that the impact of illumination conditions on the same face can be well studied. Both the Yale Face Database and the AR Face Database, nevertheless, present
55
only frontal face images, precluding the ability to test the success rate of face recognition algorithms on profile faces. It has been reported that the variations between the images of the same face caused by illumination and viewing direction are likely larger than that caused by change in face identity [31]. Hence, face images in databases like the Yale Face
60
Database B [19], the PIE Database [20], Multi-PIE [25], and the CAS-PEAL Face Database [23] were taken under several designated poses and controlled illuminations. In the Yale Face Database B, a geodesic dome consisting of 64 5
strobes was constructed to offer 45 illumination conditions for training purpose; meanwhile, images of the same face were taken in 9 viewing directions to provide 65
a possibility of building 3D models. However, the facial expression was not considered as one of the variations. In both PIE and Multi-PIE Databases, pose, illumination and expression were variables for each face; the geometric and color calibration information for each camera was recorded as well. Specifically, more deliberate facial expressions were contained in Multi-PIE. With similar
70
acquisition methods to PIE, the CAS-PEAL Face Database provides a large sample space of Chinese people as a supplementary for a specific ethnicity. Considering the challenges of face recognition in surveillance scenario, COX Face Database [24] offers a large video-based database of 1,000 subjects acquired by 3 camcorders under uncontrolled light conditions with low spatial resolution to
75
simulate the practical surveillance environment. It can also serve as a diverse training set with different views of face in front of complex backgrounds. Table 1 lists some popular face databases with their respective features. While the aforementioned databases have been successfully applied in several fields, they were not constructed specifically for videoconference or conver-
80
sational video communication scenarios; specifically, none offers synchronized videos with multiple views. Aiming at providing a video database for videoconferences to supplement the currently available corpus, we have collected 31,500 raw videos from 100 volunteers, the total size of which is around 5TB. Our database is specially designed for conversational videos; thereby, the scenes in
85
the video clips simulate real-life conversation scenarios. To be specific, each subject was requested to complete a series of designated actions that may appear during one video communication, including speaking, head movements, drinking with a cup, reading, and performing one facial expression; through these actions variations of lip shape, occlusion, pose, and expression were exhibited.
90
Moreover, each subject was filmed under 7 different controlled illumination conditions. Multiple cameras were arranged in groups for video acquisition. For each clip, 5 out of 13 available cameras were used for simultaneous recording. Such simultaneous multi-view recording enables us to propose an evaluation 6
protocol for gaze-correction methods, with which the three typical methods in95
cluding two image-based methods [6, 7], and one model-based method [8] were evaluated as examples to give baseline performances. The evaluation results show quantitative comparisons among these methods different illuminations and action scenarios. The remainder of this paper is organized as follows. The hardware setup
100
is described in Section 2, where the arrayed camera system and illumination arrangements are explained. Post-processing, including geometric and color calibration for the cameras are described in Section 3. In Section 4, the design and structure of this database is introduced with examples. The proposed evaluation protocol and the baseline experiment results are presented in Section 5.
105
Finally, we draw the conclusion in Section 7.
2. Hardware Setup This section describes the hardware setup for video acquisition in our studio. In Subsection 2.1, the layout and function of the webcam array is introduced. In Subsection 2.2, the detailed arrangements of various illumination conditions 110
are described. 2.1. Multi-View Camera Array R To record multi-view videos for one face, 13 Logitech c310 HD webcams
are used, each of which is able to capture 1280x720 HD image sequences. A rectangular camera rack carrying all the cameras was custom designed to mimic 115
the dimension of a regular 16:9 19-inch monitor. As shown in Fig. 1, the 4 cameras labeled as U, R, D, and L are called acquisition cameras, since one or more of them can be physically presented in a conversational video communication system, and actually used by gaze-correction methods; the cameras labeled as C1 to C9 (sequenced in the zig-zag order) are called reference cameras, since
120
they do not actually appear on screen in a real system, and are used to capture the target view which the gaze-correction methods must attempt to synthesize.
7
They are placed across the interior of the rack to signify the fact that the actual messaging window may not necessarily appear at the center of the monitor. During each round of the video collection, only 1 out of 9 reference cameras 125
is active, along with cameras U, R, D, and L always recording. In other words, 5 cameras are required to record videos simultaneously. Since one computer can support at most 3 webcams to record videos of 1280x720 @25 fps at the same time due to bandwidth limit of the USB port, these 5 cameras are controlled by two inter-connected computers (3+2, separately) in strict temporal synchroniza-
130
tion. In case of unexpected difference of frame rate happening among cameras, two time stamps are made in the form of .yml file after each recording, where the absolute moment of each frame is written. By using such stamps, frames from 5 cameras can be trimmed accordingly. Time stamps are stored along with the video clips. While recording, the subject is asked to perform the same sequence
135
of actions under each illumination condition and for each reference camera, for a total of 63 repetitions. An automatic control program with a simple graphic user interface was developed to systematically cycle through all illumination conditions and reference cameras.
Figure 1: The multi-view camera array for the video acquisition; C1-C9 label the reference cameras, while U, R, L, and D represent the actual working cameras in gaze-correction methods.
8
Figure 2: The layout of the studio, The cameras are placed as in Fig. 1, where W (west), S (south) and E (east) represent the relative positions of the three LED lights.
Figure 3: The GUI to control camera groups and illumination conditions
2.2. Illumination Combinations 140
In Yale Database B [19], significant variations on one face caused by different illumination conditions can be observed. Likewise, To better mimic real R videoconference scenarios, we use 3 Pixel Sonnon DL-913 LED flood lights to
provide variable illumination. The LED lights are located in the studio as shown in Fig. 2, providing 145
illumination varying in both direction and intensity, controlled by an automatic 9
system. During recording, 7 working combinations of these lights are offered. If the on-off statuses of the 3 lights (W, S, and E as shown in Fig. 2) are coded with three bits (“WSE”), the 7 illuminations can be represented as binary numbers ranging from 001 to 111, where digit “0” and “1” represent “off” and “on”, 150
respectively. Fig. 3 gives a brief view of the GUI of our automatic control program. The “West”, “South”, and “East” options in the middle of the dialog box are switches for the LED lights, and the buttons labeled as “C1” to “C9” are switches for the reference cameras.
155
3. Post-Processing 3.1. Color Calibration for Multiple Cameras
y = 5.5e−05*x2 + 0.077*x + 42
Mean pixel value of each column
Mean pixel value of each column
140 120 100 80
Sample data Fitting curve
60 40 0
200
400 600 Pixel index
(a) R fitting.
800
160
160
2 140 y = 5.5e−05*x + 0.078*x + 47
140
Mean pixel value of each column
160
120 100 80 Sample data Fitting curve
60 40 0
200
400 600 Pixel index
(b) G fitting.
800
y = 6e−05*x2 + 0.073*x + 45
120 100 80 Sample data Fitting curve
60 40 0
200
400 600 Pixel index
800
(c) B fitting.
Figure 4: Quadratic curve (the light blue one) fitting and the equation fitted, corresponding to the R, G, and B channels of the gray bar.
Since the CMOS sensors embedded in cameras are not identical, the colors of the captured images exhibit inconsistent characteristics across different cameras, requiring calibration. We apply the method proposed in [32] by us160
ing all 13 cameras to capture the same printed gradient bar (shown in Fig. 6) simultaneously under the same illumination condition for color reference. An arbitrary camera is selected as the reference towards which the color characteristics of the other cameras were adjusted. Here we illustrate the calibration procedure of two cameras as an example. First, after the captured images from both cameras were rectified, the intensity values of each column were averaged 10
12
10
8 6 4 2 0 0
200
400 Pixel index
600
(a) R error.
800
8
12
After correction Before correction
Relative error ε (%)
After correction Before correction
Relative error ε (%)
Relative error (%)
10
6 4 2 0 0
200
400 Pixel index
600
800
10
After correction Before correction
8 6 4 2 0 0
(b) G error.
200
400 Pixel index
600
800
(c) B error.
Figure 5: The relative error of color calibration between two images at R, G, and B channels. The red and green lines present error before and after color correction, respectively. The relative error after correction is much smaller compared to the origin one.
Figure 6: We use the gray bar as the calibration reference under the same illumination. The color values of gray bar linearly increase from 0 to 255 along the horizontal direction.
to produce two row vectors of decreasing intensities. Then, polynomial fitting is used to obtain six curves, one for each color channel in each image. A highquality fit can usually be obtained using the least-squares algorithm. The fitting results for an example camera are shown in Fig. 4. Finally, a function mapping the intensity values within a given channel captured by the camera under calibration onto those in the corresponding channel of the reference camera can be found by first solving for the root x of the quadratic equation: y1 = b1 x2 + b2 x + b3 , where y1 is the pixel value on x; and x can be solved by p −b2 + b22 − 4b1 (b3 − y1 ) x= , 2b1
(1)
(2)
accordingly, where b1 , b2 , and b3 are obtained from fitting. x can be substituted back into the equation of the reference image color curve to get the corresponding
11
intensity value for the pixel, as shown in (3): y2 = a1 x2 + a2 x + a3 ,
(3)
where y2 is the pixel value corresponding to x. By this method, the image color of all cameras can be calibrated accordingly. To verify the accuracy of the color correction, we calculate the relative intensity difference of each pixel as (4), before and after color calibration, ε=
|pa − pr | × 100%, pr
(4)
where pa denotes the pixel intensity value captured by the camera under cal165
ibration within a given color channel either before or after correction, and pr denotes the corresponding pixel intensity value as captured by the reference camera. Some example results are shown in Fig. 5. 3.2. Geometric Calibration Extracted corners L 1800
O dY dX
100 150
1600
U C3 C2 C6 C1 C5 C9 C4 DC8 C7 R
1400 1200
200 Zworld
Yc (in camera frame)
50
250
1000 800 600
300
400
350
200 400
0
450
0 100
200
300
400
500
0 500
500
600
Xc (in camera frame)
1000
1000
(a)
X
(b)
Figure 7: (a) shows the corner extraction of the chessboard. We have chosen 10 different angles to capture 13 images from 13 cameras; (b) shows the calibrated cameras located in the world coordinates.
In order to accurately identify the spatial relationships among all the record170
ing cameras, which is a pre-requisite for many stereo vision algorithms, camera geometric calibration is required, which involves the estimation of the camera’s 12
perspective projection matrix and distortion parameters as signified by its extrinsic and intrinsic parameters. The method that we use is from [33], where all cameras are used to capture pictures of the same chessboard, as shown in 175
Fig. 7, with different angles and locations; from these measurements, the intrinsic parameters of the cameras can be computed. To enable the application of different camera calibration methods, all the original chessboard images are stored along with the corresponding video sequences.
4. Database Structure 180
On the first level, the video sequences in the database were divided according to gender. Videos of male and female figures are named with letters “M” and “F” at the beginning of the name string, respectively. Each figure has an identifier number that goes from “001” to “0XX”. As mentioned in previous section, the reference webcam used for the specific sequence is named from “C1” to “C9”.
185
Likewise, the combinations of LED lights are named in binary numbers from “001” to “111”. These identifiers are concatenated to form the name string of a video. For instance, one video clip named as “M001C1001 U” means: “this video is taken by the camera U for the male volunteer 001 along with the first camera C1; the west and south lights are off and the east light is on”. The
190
folder structure is illustrated in Fig. 8 as a tree map. When taking videos, each volunteer is requested to complete a set of designated actions with a random facial expression at the end, under each combination of illumination and camera group. During each recording, the actions were taken by 1 of the 9 reference cameras along with the 4 acquisition cameras
195
simultaneously. Note that videos taken from the reference cameras are functioning as the benchmark for gaze-correction algorithms. Each video clip lasts around 40 seconds long; all videos are stored in the raw YUV format (5TB in total), with around 40 GB for each subject. Some features of the database is summarized in Table 2.
200
We explain the contents of videos by showing some thumbnails in different
13
Figure 8: Folder structure of THU multi-view face database.
Table 2: A short summary of THU multi-view face database Number of subjects
100 (65 males, 35 females)
Origin of subjects
China (70), Pakistan (3), Azerbaijan (2), Germany (2), Kazakhstan (2), Myanmar (2), Russia (2), Singapore (2), South Africa (2), UK (2), USA (2), Albania (1), Columbia (1), Costa Rica (1), Finland (1), Kyrgyzstan (1), The Netherlands (1), Panama (1), Turkey (1), Ukraine (1)
Ethnic backgrounds
Chinese (74), Central Asian (3), Caucasian (3), South Asian (4), South East Asian (2), White European (9), Black African (2), White American (1), Latin American (2)
Variable
Actions, illuminations, facial expressions, multiple views
Total video clips
31,500 (6,300 groups)
cases. In Fig. 9, the thumbnails were extracted at the same instant in each of five videos (C5 + the four acquisition cameras) under the illumination 100. The views of the subject from five different viewpoints are shown. With the same subject in Fig. 9, the thumbnails in Fig. 10 demonstrate 205
the videos from camera C5 with the same action under 7 different illuminations. It can be observed that the shadows cast under the different illumination
14
(a) C5
(b) D
(c) L
(d) R
(e) U
Figure 9: Thumbnails of multiple views from one camera group under the same illumi-
nation (001), where the reference camera is C5
conditions have visible impact on the appearance of the face as well as the background.
(a) 001
(d) 100
(b) 010
(e) 101
(c) 011
(f) 110
(g) 111
Figure 10: Thumbnails from C5 with the same action under 7 illumination conditions.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 11: Thumbnails of different actions from one video under the same illumination
taken from the same camera.
In Fig. 11, the thumbnails are extracted from several different instants 210
within one video clip in order to exhibit the various actions taken, where the illumination is 100 and the camera is C5. These actions are arranged in six parts consisting of head-turning, speaking, holding a mug, reading a book, taking off
15
glasses (where applicable), and performing a facial expression. Head-turning follows the sequence of left, right, up, and down, during which all views of a 215
face can be seen. The volunteers were requested to select one of the sentences from the TIMIT Acoustic-Phonetic Database [34] to achieve a good coverage of different lip shapes; these sentences were stored with the corresponding videos as .txt files in the database. Holding-a-mug and reading-a-book provide some occlusion on the face, as such actions may appear during a conversational video.
220
At the end of the actions, each volunteer was requested to improvise a facial expression randomly, which may include happiness, anger, surprise, and so on.
5. Evaluation Protocol and Baseline Results In order to facilitate effective comparisons among different gaze correction algorithms, we provide an evaluation protocol as a guideline for conducting 225
evaluations on the proposed database; results for three baseline gaze correction algorithms were given following the protocol’s specifications, showing distinctive performance characteristics from which interesting conclusions can be drawn. 5.1. Evaluation Protocol
Figure 12: Workflow of image-based gaze-correction algorithms BM and DP.
Broadly speaking, the evaluation of a gaze-correction algorithm on the pro230
posed database includes the following steps: 1. Running the algorithm to be evaluated on the appropriate preprocessed input video sequences to synthesize a virtual view according to the camera
16
Figure 13: Workflow of the model-based gaze-correction algorithm MBGC.
calibration data, such that the virtual view conforms to that of the reference camera used. Two distinct sets of experiments can be performed, 235
focusing on assessing the performance of the gaze correction algorithm for the following two cases: • varying illumination conditions, where for a fixed action sequence, any combination of illumination conditions (e.g. 001, 101, 111, etc.) can be used; and
240
• varying occlusion, where for a fixed illumination condition, different action sequences can be selected featuring varying degrees of occlusion. 2. Remove lens distortion from the sequence captured by the reference camera, as performed in [35].
245
3. Extract the regions-of-interest (ROIs) according to the capabilities of the algorithm (e.g. face region, head-shoulder region, whole image, etc.), followed by the computation of an image similarity metric. 4. Characterize the performance of the algorithm by examining its cumulative quality curve obtained from performing the above steps on a set of
250
sequences. It is worth noting that not all gaze-correction algorithms have the flexibility to generate arbitrary virtual camera views according to the actual reference camera placements. Many assume the synthesis of a “center view” from two “side views”. In this case, off-center reference cameras cannot be used (effec-
255
tively, the algorithm cannot be applied to the case where the conversational 17
video is windowed and placed away from the center of the screen); furthermore, due to inevitable manufacture inaccuracies and disturbances, we cannot guarantee that the center reference camera is at the exact center of the up/down or left/right camera pairs. This mis-alignment will add noise to the similarity 260
metric and pollute the evaluation results. In order to reduce such influences and to ensure fair comparisons, for such algorithms, the homography between the input images and the reference image can be pre-computed by imaging a reference grid; according to the obtained homography, each input image can be warped into the same perspective as the reference image, such that the resultant
265
synthetic image and the reference image are in the same position (see Fig. 14).
(a)
(b)
(c)
(d)
(e)
Figure 14: The rectified images from one camera group. (a) View from the camera U; (b) View from the camera D; (c) View from the camera L; (d) View from the camera C5; (e) View from the camera R
In the interest of clear, well-rounded characterization and easy comparison, we propose the use of the cumulative quality curve (CQC) to study the performance of algorithms on our database. If the quality of gaze-correction, as measured by a similarity metric averaged over a given test video sequence, is 270
taken as a random variable q, then CQC is simply 1 − CDF(q), where CDF(·) stands for the cumulative distribution function of a random variable, plotted on a reversed x-axis. Intuitively, a point (x, y) on this curve signifies that (100×y)percent of the synthetic images given by the algorithm under evaluation achieved a quality greater or equal to x.
275
5.2. Baseline Experimental Results Three gaze-correction algorithms were chosen for baseline evaluations and comparison, covering both image-based and model-based methods. Note that
18
videos with reference camera C5 were chosen in the experiments as examples. Image-based algorithms. Most image-based algorithms share a common frame280
work consisting of image rectification, disparity computation, occluding decision, and view synthesis (as shown in Fig. 12) [7]. Following the conventions of most existing works, left and right cameras were used to compute a horizontal disparity; therefore, horizontal rectification of the left and right images is necessary to ensure epipolar alignment. We
285
evaluate two representative methods of disparity computation: a) the modified H. Hirschmuller algorithm (semi-global block matching) [36], which we refer to as BM; b) the improved Dynamic Programming algorithm proposed by A. Criminisi, et al. [6], which we refer to as DP. A model-based algorithm. An additional model-based gaze-correction algorithm
290
[8] (as shown in Fig. 13), which we refer to as MBGC, was evaluated. It utilizes face tracking and image segmentation techniques to extract the headand-shoulder region from the input images, and performs landmark-based correspondence between the up and down cameras. While this results in improved face-region synthesis quality due to the increased correspondence accuracy of
295
the tracked features, the quality of the synthesized background often suffers. Single webcam-based algorithm in [10] is not evaluated here as it is similar to one in [8] except for the requirement of the pre-captured frontal face image for each individual. In our experiments, for simplicity, we chose to use the peak signal-to-noise
300
ratio (PSNR) as our similarity metric. We perform the evaluations with respect to two independently varying factors: illumination and occlusion. Illumination. We tested two ROIs: a) the face region, which is the convex hull of all facial landmarks (see Fig. 15), and b) the segmented head-and-shoulder region (see Fig. 16). Since the MBGC algorithm only applies to the situation
305
where frontal face is visible, we limited our evaluation set to samples that meet this requirement so that all three algorithms can be compared. For compari-
19
(a) MBGC
(b) DP
(c) BM
(d) Reference
(e) MBGC
(f) DP
(g) BM
(h) Reference
Figure 15: Comparison of synthetic views among three methods. (a) - (d) show the original synthetic images; (e) - (h) show the segmented face region.
(a) MBGC
(b) DP
(c) BM
(d) Reference
(e) MBGC
(f) DP
(g) BM
(h) Reference
Figure 16: Comparison of synthetic views among three methods. (a) - (d) show the original synthetic images; (e) - (h) show the head-and-shoulder segmentation.
son, we chose four illumination conditions consisting of 001, 011, 101, and 111. For each face, we calculated the PSNR between the synthetic and the reference images under each illumination. The face-only CQC resulting from our exper-
20
1
0.8
0.8
0.8
0.6 0.4 MBGC 111 MBGC 101 MBGC 011 MBGC 001
0 55
50
45 40 PSNR (dB)
35
0.4 DP 111 DP 101 DP 011 DP 001
0.2 0 55
30
50
(a) MBGC. 1
percentage
percentage
0.6 0.4
0 55
BM 001 MBGC 001 DP 001 50
45 40 PSNR (dB)
35
30
0.4 BM 111 BM 101 BM 011 BM 001
0.2 0 55
(b) DP.
0.8
0.2
45 40 PSNR (dB)
0.6
35
30
(d) Lighting 001.
1
1
0.8
0.8
0.6 0.4 BM 011 MBGC 011 DP 011
0.2 0 55
50
45 40 PSNR (dB)
50
45 40 PSNR (dB)
35
30
(c) BM.
percentage
0.2
0.6
percentage
1
percentage
percentage
1
35
(e) Lighting 011.
0.6 0.4 BM 111 MBGC 111 DP 111
0.2
30
0 55
50
45 40 PSNR (dB)
35
30
(f) Lighting 111.
Figure 17: (a) - (c): CQCs in terms of PSNR for BM, DP, and MBGC under 4 illumination conditions. (d) - (f): Comparisons of CQCs of BM, DP, and MBGC under illumination 001, 011, and 111. ROI is face-only.
310
iments is show in Fig. 17. We can see that for the face-only ROI, the PSNRs from 111 lighting are always better than those from 001, especially for the DP algorithm where the difference is most pronounced. This nicely illustrates that all three algorithms are negatively affected by uneven illumination. Among the three algorithms, MBGC is least affected by illumination changes due to its
315
model-based nature: the model-fitting step uses illumination-invariant features so the correspondences thus extracted are more robust to lighting variations. The image-based algorithms are more sensitive in this regard. Also, the MBGC algorithm is advantaged in the face-only comparison since it uses an explicit face model to compute the set of point and line matches for view synthesis.
320
Fig. 18 shows the head-and-shoulder region CQC obtained from our experiments. We can see that the influences of illumination is the same as the face-only experiment, and the MBGC algorithm does not strictly outperform the other algorithms; we can see from its CQC that, compared to DP, it produced fewer samples with very high quality and much fewer samples with low 21
1
0.8
0.8
0.8
0.6 0.4 MBGC 111 MBGC 101 MBGC 011 MBGC 001
0.2 0 55
50
45 40 PSNR (dB)
35
0.6 0.4 DP 111 DP 101 DP 011 DP 001
0.2 0 55
30
50
35
30
0.6 0.4 BM 111 BM 101 BM 011 BM 001
0.2 0 55
(b) DP. 1
0.8
0.8
0.8
0.4 BM 001 MBGC 001 DP 001
0.2 0 55
50
45 40 PSNR (dB)
35
30
percentage
1
0.6
0.6 0.4 BM 011 MBGC 011 DP 011
0.2 0 48
(d) Lighting 001.
46
44 PSNR (dB)
50
45 40 PSNR (dB)
35
30
(c) BM.
1
percentage
percentage
(a) MBGC.
45 40 PSNR (dB)
percentage
1
percentage
percentage
1
42
(e) Lighting 011.
0.6 0.4 BM 111 MBGC 111 DP 111
0.2
40
0 55
50
45 40 PSNR (dB)
35
30
(f) Lighting 111.
Figure 18: (a) - (c): CQCs in terms of PSNR for BM, DP, and MBGC under 4 illumination conditions. (d),(e): Comparisons of CQCs of BM, DP, and MBGC under illumination 001 and 111. ROI is head-and-shoulder.
(a) BM
(b) DP
(c) Reference
Figure 19: Comparison of synthesized view between BM and DP, where the action “drinking” is under test.
325
quality. Therefore, we can conclude that while MBGC may not be as accurate as DP in ideal conditions, it is more robust. Occlusion. In order to examine the influences of occluding motions on synthesis, we compared the performance of BM and DP using the frontal sequences from the previous experiment, as well as sequences with the motion of “figures holding
330
a cup as if they are drinking”. Since this motion will cause the cup to occlude
22
the face, the MBGC algorithm is no longer applicable (see Fig. 19). Since we cannot locate the face region, automatic head-shoulder segmentation also becomes difficult; therefore we used the whole image as ROI when comparing the algorithms. 335
From the CQCs (see Fig. 20) we can see that the influences of illumination on results is similar with the first experiment, i.e. uniform illumination gives the best results. The effect of occlusion is better examined by fixing the illumination at 111 and compare the performance of the algorithms between the frontal sequences and the “drinking” sequences (see Fig. 21). We see that while DP
340
always gives better performance than BM, it is not particularly more robust against occlusions, especially in this case where the cup is at a significantly higher parallax than the face. 1 0.9 0.8
percentage
0.7 0.6 0.5 0.4 0.3 BM+001 BM+111 DP+001 DP+111
0.2 0.1 0 40
38
36
34 PSNR (dB)
32
30
28
Figure 20: CQCs in terms of PSNR for BM and DP, with 2 different illuminations. ROI is the whole image.
6. Availability of Database Free samples of 5 volunteers (F029, F033, F035, F038 and M052) can be 345
downloaded at www.facedbv.com; since the total size of the database is too large to be stored on the web host, please contact the corresponding author if you would like to order it to be delivered on a physical medium.
23
1 0.9 0.8
percentage
0.7 0.6 0.5 0.4 0.3 DP Drink DP Frontal BM Drink BM Frontal
0.2 0.1 0 40
38
36 34 PSNR (dB)
32
30
Figure 21: CQCs in terms of PSNR for BM and DP, for “Frontal” and “Drink” actions, under 111 illumination. ROI is the whole image.
7. Conclusion In this paper we have presented a multi-view face video database and an 350
evaluation protocol for gaze-correction methods including image-based [6][7] and model-based [8] ones. These methods have been evaluated in terms of PSNR compared with the reference sequences acquired from the cameras. Our evaluation is based on a large collection of calibrated image sequences in terms of position and color from videos together with an evaluation procedure. Different
355
performances can be observed among these methods; and the impact of illumination and occlusion on the synthesized results can be found as well. It can be seen that MBGC is more robust against the variation of illumination; while DP outperforms other two with noticeable occlusion in front of face. In addition to gaze-correction algorithms, other face analysis tasks such as gaze-estimation,
360
face-tracking, cross-race effect, stereo matching, structural similarity, etc., may also be evaluated with our proposed database.
Acknowledgment During the preparation of this database, several members in our group provided valuable assistance. Specifically, Tianyou Zhou and Weihua Bao helped 365
design the GUI and the camera control system; Jichuan Lu helped implement
24
camera color calibration; undergraduate students Zizhuo Zhang and Kai Gu helped design the flood light control program. We highly appreciate their excellent work. We also would like to thank Wenjie Ye, Shengming Yu and Jingwen Cheng 370
(Xidian University, China) for their help in conducting the experiments. Last but not least, we express our deepest gratitude to all volunteers participating in the video collection.
References [1] N. Atzpadin, P. Kauff, O. Schreer, Stereo Analysis by Hybrid Recursive 375
Matching for Real-Time Immersive Video Conferencing, IEEE Trans. Circuits Syst. Video Technol. 14 (2004) 321–334. doi:10.1109/TCSVT.2004. 823391. [2] K. Hung, Y.-T. Zhang, Implementation of a WAP-Based Telemedicine System for Patient Monitoring, IEEE Trans. Inf. Technol. Biomed. 7 (2003)
380
101–107. doi:10.1109/TITB.2003.811870. [3] C. Luo, W. Wang, J. Tang, J. Sun, J. Li, A Multiparty Videoconferencing System Over an Application-Level Multicast Protocol, IEEE Trans. Multimedia 9 (2007) 1621–1632. doi:10.1109/TMM.2007.907467. [4] M. Ponec, S. Sengupta, M. Chen, J. Li, P. A. Chou, Optimizing Multi-Rate
385
Peer-to-Peer Video Conferencing Applications, IEEE Trans. Multimedia 13 (2011) 856–868. doi:10.1109/TMM.2011.2161759. [5] J. Jansen, P. Cesar, D. C. A. Bulterman, T. Stevens, I. Kegel, J. Issing, Enabling Composition-Based Video-Conferencing for the Home, IEEE Trans. Multimedia 13 (2011) 869–881. doi:10.1109/TMM.2011.2159369.
390
[6] A. Criminisi, J. Shotton, A. Blake, P. H. S. Torr, Gaze Manipulation for One-to-one Teleconferencing, Proc. IEEE ICCV’03 1 (2003) 191–198. doi: 10.1109/ICCV.2003.1238340.
25
[7] S.-B. Lee, I.-Y. Shin, Y.-S. Ho, Gaze-corrected View Generation Using Stereo Camera System for Immersive Videoconferencing, IEEE Trans. Con395
sum. Electron. 57 (2011) 1033–1040. doi:10.1109/TCE.2011.6018852. [8] R. Yang, Z. Zhang, Eye Gaze Correction with Stereovision for VideoTeleconferencing, IEEE Trans. Pattern Anal. Mach. Intell. 26 (2004) 956– 960. doi:10.1109/TPAMI.2004.27. [9] C. Kuster, T. Popa, J. C. Bazin, C. Gotsman, M. Gross, Gaze Correction
400
for Home Video Conferencing, ACM Trans. Graph. (Proc. of ACM SIGGRAPH ASIA) 31 (2012) 174:1–174:6. doi:10.1145/2366145.2366193. [10] D. Giger, J.-C. Bazin, C. Kuster, T. Popa, M. Gross, Gaze Correction with a Single Webcam, Proc. IEEE ICME’14 (2014) 1–6doi:10.1109/ICME. 2014.6890306.
405
[11] S. Pigeon, L. Vandendorpe, The M2VTS Multimodal Face Database (Release 1.00), Proc. AVBPA’97 1206 (1997) 403–409.
doi:10.1007/
BFb0016021. [12] P. N. Belhumeur, J. P. Hespanha, D. J. Kriegman, Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projections, IEEE Trans. 410
Pattern Anal. Mach. Intell. 19 (1997) 711–720. doi:10.1109/34.598228. [13] A. M. Mart´ınez, R. Benavente, The AR Face Database, Technical Report #24, Computer Vision Center (CVC). [14] D. B. Graham, N. M. Allinson, Characterizing Virtual Eigensignatures for General Purpose Face Recognition, Face Recognition: From Theory
415
to Applications, NATO ASI Series F, Computer and Systems Sciences 163 (1998) 446–456. [15] K. Messer, J. Matas, J. Kittler, J. L XM2VTSDB: the Extended M2VTS Database, Proc. AVBPA’99 (1999) 72–77.
26
[16] T. Kanade, J. F. Cohn, Y. Tian, Comprehensive Database for Facial Ex420
pression Analysis, Proc. IEEE FG’00 (2000) 484–490doi:10.1109/AFGR. 2000.840611. [17] P. J. Phillips, H. Moon, S. A. Rizvi, P. J. Rauss, The FERET Evaluation Methodology for Face-Recognition Algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 1090–1104. doi:10.1109/34.879790.
425
[18] E. Marszalec, B. Martinkauppi, M. Soriano, M. Pietik A Physics-Based Face Database for Color Research, J. Electron. Imaging. 9 (2000) 32–38. doi:10.1117/1.482722. [19] A. S. Georghiades, P. N. Belhumeur, D. Kriegman, From Few to Many: Generative Models for Recognition under Variable Pose and Illumination,
430
IEEE Trans. Pattern Anal. Mach. Intell. 23 (2001) 643–660. doi:10.1109/ 34.927464. [20] T. Sim, S. Baker, M. Bsat, The CMU Pose, Illumination, and Expression Database, IEEE Trans. Pattern Anal. Mach. Intell. 25 (2003) 1615–1618. doi:10.1109/TPAMI.2003.1251154.
435
[21] B. Weyrauch, J. Huang, B. Heisele, V. Blanz, Component-based Face Recognition with 3D Morphable Models, Proc. IEEE CVPRW’04 (2004) 85–85doi:10.1109/CVPR.2004.41. [22] N. A. Fox, B. A. O’Mullane, R. B. Reilly, VALID: A New Practical AudioVisual Database, and Comparative Results, Proc. AVBPA’05 3546 (2005)
440
777–786. doi:10.1007/11527923_81. [23] W. Gao, B. Cao, S. Shan, D. Zhou, X. Zhang, D. Zhao, The CAS-PEAL Large-Scale Chinese Face Database and Baseline Evaluations, IEEE Trans. Syst., Man, Cybern. A 38 (2008) 149–161. doi:10.1109/TSMCA.2007. 909557.
445
[24] Z. Huang, S. Shan, R. Wang, H. Zhang, S. Lao, A. Kuerban, X. Chen, A Benchmark and Comparative Study of Video-Based Face Recognition on 27
COX Face Database, IEEE Trans. Signal Process. 24 (2015) 5967–5981. doi:10.1109/TIP.2015.2493448. [25] R. Gross, I. Matthews, J. Cohn, T. Kanade, S. Baker, Multi-PIE, Elsevier 450
Image and Vision Computing 28 (2010) 807–813. doi:10.1016/j.imavis. 2009.08.002. [26] S. Milborrow, J. Morkel, F. Nicolls, The MUCT Landmarked Face Database, Proc. Twenty-First Annual Symposium of the Pattern Recognition Association of South Africa (PRASA’10).
455
[27] S. Wang, Z. Liu, S. Lv, Y. Lv, G. Wu, P. Peng, F. Chen, X. Wang, A Natural Visible and Infrared Facial Expression Database for Expression Recognition and Emotion Inference, IEEE Trans. Multimedia 12 (2010) 682–691. doi:10.1109/TMM.2010.2060716. [28] M. Yeasin, B. Bullot, R. Sharma, Recognition of Facial Expressions and
460
Measurement of Levels of Interest from Video, IEEE Trans. Multimedia 8 (2006) 500–508. doi:10.1109/TMM.2006.870737. [29] P. Nair, A. Cavallaro, 3-D Face Detection, Landmark Localization, and Registration Using a Point Distribution Model, IEEE Trans. Multimedia 11 (2009) 611–623. doi:10.1109/TMM.2009.2017629.
465
[30] M. Barnard, P. Koniusz, W. Wang, J. Kittler, S. M. Naqvi, J. Chambers, Robust Multi-Speaker Tracking via Dictionary Learning and Identity Modeling, IEEE Trans. Multimedia 16 (2014) 864–880. doi:10.1109/TMM. 2014.2301977. [31] Y. Moses, Y. Adini, S. Ullman, Face Recognition: The Problem of Compen-
470
sating for Changes in Illumination Direction, Proc. ECCV’94 800 (1994) 286–296. doi:10.1007/3-540-57956-7_33. [32] J. Jung, Y. Ho, Color Correction Method Using Gray Gradient Bar for Multi-View Camera System, Proc. IWAIT’09 (2009) MP.C4(1–6).
28
[33] J.-Y. Bouguet, Camera Calibration Toolbox for Matlab. 475
URL http://www.vision.caltech.edu/bouguetj/calib_doc/ [34] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, N. Dahlgren, V. Zue, TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1, Philadelphia: Linguistic Data Consortium. [35] Y.-S. Kang, Y.-S. Ho, Geometrical Compensation for Multi-view Video in
480
Multiple Camera Array, Proc. IEEE ELMAR’08 (2008) 83–86. [36] H. Hirschmuller, Stereo Processing by Semiglobal Matching and Mutual Information, IEEE Trans. Pattern Anal. Mach. Intell. 30 (2008) 328–341. doi:10.1109/TPAMI.2007.1166.
29