Eye-gaze estimation under various head positions and iris states

Eye-gaze estimation under various head positions and iris states

Expert Systems with Applications 42 (2015) 510–518 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: www...

2MB Sizes 0 Downloads 61 Views

Expert Systems with Applications 42 (2015) 510–518

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Review

Eye-gaze estimation under various head positions and iris states Reza Jafari ⇑, Djemel Ziou Departement d’informatique, Universite de Sherbrooke, Sherbrooke, QC J1K 2R1, Canada

a r t i c l e

i n f o

Article history: Available online 22 August 2014 Keywords: Gaze estimation Bayesian logistic regression Kinect

a b s t r a c t This paper describes a method for eye-gaze estimation under normal head movement. In this method, head position and orientation are acquired by Kinect depth data and eye direction is obtained from high resolution images. We propose the Bayesian multinomial logistic regression based on a variational approximation to construct a gaze mapping function and to verify iris state. Our method eliminates limitation of head movements, eye closure and light source as common drawbacks in most conventional techniques. The efficiency of the proposed method is validated by performance evaluation for multiple people with different distances and poses to the camera under various eye states. Ó 2014 Elsevier Ltd. All rights reserved.

1. Introduction Eye-gaze deals with the estimation of the line of sight of a person. The eyes are the windows to the soul (William Shakespeare), therefore one of the logical steps to understanding human behavior and motivation should involve the study of eye gaze tracking. Eye-gaze mapping is important for many applications such as the pedophilia, training, and marketing. Indeed, in the presence of children, it has been shown that the map of pedophile is different from that of a person who is not (Renaud et al., 2009). During the training of surgeons, the eye-gaze is recorded and analysed for more reliable assessment of surgical skill (Law, Atkins, Kirkpatrick, & Lomax, 2004). In marketing, that is suitable to determine what features of the product attracts buyer attention (Khushabaa et al., 2013). In this paper, we present a method for eye-gaze estimation that robustly detects the location and orientation of a person’s head, from the depth data obtained by Kinect. There are two cameras embedded in the Kinect, one operating in the visible spectrum (RGB) and the other in the IR (infrared). Unfortunately, Kinect RGB camera is in low resolution to obtain iris images. Hence, another camera can be used simultaneously with the IR Kinect camera to acquire high resolution eyes images. In order to estimate eye-gaze we use the iris center and the reference point provided by the IR camera. However, the iris center and the reference point vary significantly with head position and orientation. This prompts us to consider head orientation and location in a gaze mapping model. To do so, the head location and orientation in 3D space is

⇑ Corresponding author. http://dx.doi.org/10.1016/j.eswa.2014.08.003 0957-4174/Ó 2014 Elsevier Ltd. All rights reserved.

calculated from Kinect. Then, the resulting measures are used for an eye gaze mapping function. Since a gaze mapping function cannot be assumed beforehand, the variational Bayesian multinomial logistic regression (VBMLR) is used as a model to estimate it. Most of gaze estimators work when eye is open. However, winking is a necessity for humans. Therefore, in order to estimate iris center we need to detect the state of the eyes (i.e. whether they are open or closed). For this work, we introduce a method that uses Histogram of Oriented Gradients (HOG) and VBMLR to detect whether the eyes are open or closed. Preliminary results of gaze-estimation model are reported in Jafari and Ziou (2012a, 2012b). In early version of this work, we used a PTZ camera, we did not provide a detailed description of the proposed model such as the iris state verification and the comparison between different matching functions. Our paper is organized as follows. We start by summarizing related work in Section 2. In Section 3, the suggested eye-gaze scheme is explained. Section 4 presents experiment results and Section 5 the conclusion.

2. Related work Many traditional techniques for eye-gaze estimation require some existing equipments to be put in physical contact with the user such as contact lenses, electrodes, and head mounted devices. The resulting situation can be massively inconvenient and uncomfortable for the user. Due to recent advances in computers and video cameras technology, eye-gaze estimation have been widely investigated based on digital video analysis (Hansen & Ji, 2010). Since it does not need anything attached to the user, the video

R. Jafari, D. Ziou / Expert Systems with Applications 42 (2015) 510–518

technology opens the most encouraging direction to build a remote eye-gaze tracker. The remote systems can be classified into two different techniques (Guestrin & Eizenman, 2006; Miyake, Haruta, & Horihata, 2005, 2009; Morimoto & Mimica, 2005; Villanueva et al., 2009): interpolation based gaze estimation and 3D model-based gaze estimation. Interpolation methods use general purpose equations, such as linear or quadratic polynomials, to map the image data to gaze coordinates. 3D model-based techniques, on the other hand, directly compute the gaze direction based on a geometric model of the eye. All methods require calculating some parameters such as calibration procedure in which the user is asked to look at certain points on the screen. Moreover most of these techniques use a reference point to estimate gaze direction. The reference point can be generated by using something such as marker paper sticks to the face (Miyake et al., 2005, Miyake, Asakawa, Yoshida, Imamura, & Zhang, 2009). Unfortunately, these methods are cumbersome for users. Corneal reflection or glint is another wellknown method to create a reference point. Glint is generated by an active light source on the cornea surface. Thus the vector from the glint to the center of the iris describes the gaze direction (Guestrin & Eizenman, 2006; Morimoto & Mimica, 2005; Villanueva et al., 2009). These approaches have problems with changes in lighting conditions, the reflection of light sources on glasses, awkward calibration process and a limitation on distance between the user and camera. Moreover, the small head motions tolerate by such devices have considerable influence on their accuracy, therefore experiments are usually done using a chin rest to restrict head motion, which greatly reduces the user’s comfort. Usually, a person moves the head to a comfortable position before turning the eye. Therefore, a 3D head pose needs to be modeled and integrated within a gaze estimation algorithm. There has been research for 3D human pose detection and tracking in the past years by depth image via Time of Flight (ToF) camera (Diraco, Leone, & Siciliano, 2013; Almansa-Valverdea, Castilloa, & Fernndez-Caballero, 2012; Zhu, Dariush, & Fujimura, 2008). Unlike 2D intensity images, depth images are robust to color and illumination change. Unfortunately existing commercial ToF cameras such as the Swiss Ranger SR4000 (C.C.S. d’Electronique, 2009) and PMD Tech (Video Sensor, 2009) are quite expensive and low resolution. Fortunately, Microsoft has launched the Kinect, which is cheap and easy to use. Recently, EyeCharm uses Microsoft Kinect for eye tracking (Eye tracking gets kickstarted, 2013). Since Kinect RGB camera is in low resolution to obtain eye images and EyeCharm uses Kinect RGB for eye detection, therefore, this system has a limitation on distance between Kinect and user (about 80 cm). Unlike mentioned systems, Tobii (Be first to get, 2013) and GazePoint (Products, 2014) have some low cost eye trackers on the market. Unfortunately, these systems do not consider head movements in mapping function, therefore, they have a restriction on changing head position and orientation. In this study we propose a novel method to overcome common drawbacks that most of the existing gaze tracking systems share. First, by using Kinect we avoid the use of physical marking and corneal reflection that are very sensitive to distance and lighting condition. Second, with web camera and 3D reference point, we overcome the head movements and distance limitation, therefore the user can move and rotate his/her head freely in front of the camera. Third, since blinking is a physiological necessity for humans, our gaze estimator verifies eye state in order to detect: iris is visible or not. Fourth, we propose a discriminative Bayesian formalism for the estimation of eye-gaze mapping which eliminate individual calibration procedure in classical methods. Finally, proposed method use low cost devices (web camera and Kinect) for eye-gaze estimation.

511

3. Proposed method The architecture of the proposed eye-gaze system includes Kinect sensor and Logitech HD web camera for acquisition of the same scene. The Logitech web camera is low cost camera that pans and tilts to automatically track the user face in the visible spectrum with high resolution images. The Microsoft Kinect is also a low cost peripheral, used as a game controller with the Xbox 360 game system. The basic principle of Kinect’s depth calculation is based on stereo matching. The stereo matching requires an image to be captured by an infrared camera, and another to operate in the visible spectrum of the scene. In addition to these cameras, we use the head tracker system of Kinect that continuously computes the head location and orientation from depth data. Note that, location and orientation are in term of Kinect reconstructed geometry in Cartesian space as shown in Fig. 1. This system uses random regression forests to estimate the 3D head pose in real time (Fanelli, Weise, Gall1, & Gool, 2011). It basically learns a mapping between simple depth features and real-valued parameters such as 3D head position and rotation angles. Fig. 2 shows the location and orientation of a head estimated using Kinect. Since the RGB image of Kinect is in low resolution, another camera is concurrently used to detect observer’s irises. Our eye-gaze estimation method is based on the relative displacements of the irises in terms of the reference point. The reference point can be extracted via corneal reflection, eye’s corner or physical mark on the face. Unfortunately, these methods need specialized hardware device, individual calibration or are not user friendly (Hansen & Ji, 2010). In contrast, the reference point can be simply measured via depth information of the head which we explain in the next section. As shown in Fig. 3, the displacement of the irises is quantified by the RM vector, where R and M are the reference point and the midpoint between the right and left irises center respectively. The RM vector is expressed by two components, dx and dy , which are measured along the horizontal and vertical axes. Thus eye-gaze can be estimated by comparing dx and dy components with predetermined threshold values where a person is looking at different target points. Here, we assumed that a person is stationary in a specific location. In Section 3.1 we extend our method to deal with face movement and rotation. The reference point is calculated by the depth Kinect images that is described in Section 3.4. We verify iris state in order to detect whether the iris is visible or not, then the iris center is calculated which are described in Sections 3.2 and 3.3 respectively. 3.1. Gaze estimation The user’s gaze point can be accurately estimated based on the extracted RM vector when the user does not move his/her head significantly. However, if the head changes its position and orientation, the eye-gaze method will fail to estimate the target point because the vector RM (Fig. 3) changes. To overcome this problem,

Fig. 1. Kinect coordinate system.

512

R. Jafari, D. Ziou / Expert Systems with Applications 42 (2015) 510–518

Fig. 2. 3D head tracker for estimating location and orientation of a person’s head.

Fig. 3. Displacement of irises.

in addition to iris displacement dx and dy , six others parameters T x ; T y ; T z ; Rx ; Ry and Rz which account for head orientation and position should be considered for eye-gaze estimation. The parameters ½T x ; T y ; T z  and ½Rx ; Ry ; Rz  are the 3D head position and head orientation in Kinect coordinate system respectively. Given the eight affecting gaze parameters, we now need to determine the mapping function that maps the parameters to the actual gaze. Since traditional techniques such as interpolation and polynomial do not estimate eye-gaze under head movements

(Hansen & Ji, 2010), therefore we propose a gaze mapping scheme based on a supervised classifier. See Fig. 4 for an overview of our gaze estimation system. In this study, one’s gaze space is subdivided into l different target regions. We introduce different classes with varying head position/orientation and eye direction for each target region. Let us consider the set of complete data D ¼ fðX0 ; y0 Þ; . . . ; ðXl ; yl Þg, where Xi is the set of gaze features of the ith class. We assume that Xi is generated from known probability density functions (pdfs) qi ðxÞ where x is the feature vector of Xi . The vector y is equal to ½y0 ; . . . ; yl T , where yi ¼ 1 if x belong to class i. We need to estimate the statistical model enabling the discrimination between classes. Experiments in previous work show that the performance of the variational Bayesian multinomial logistic regression (VBLR) outperforms many well established discrimination based algorithms such as the Support Vector Machine (SVM), Relevance Vector Machine (RVM), Bayesian Logistic Regression Model (BLRM), Informative Vector Machine (IVM), and Logistic Regression Model (LRM) (Ksantini, Ziou, Colin, & Dubeau, 2008). We propose to extend the VBLR to the multinomial (i.e, multi classes) case and call it variational Bayesian multinomial logistic regression (VBMLR) (Ziou & Jafari, 2014). The multinomial logistic regression consists of estimating the statistical model h ¼ ðh1 ; . . . ; hl Þ that discriminates each class from the baseline class (Krishnapuram, Carin, Figueiredo, & Hartemink, 2005). The probability of the binary variables is given by:

Pðyi ¼ 1jhi ; xÞ ¼

Fig. 4. Gaze estimation system overview.

expðhTi xÞ Pl 1 þ i¼1 expðhTi xÞ

ð1Þ

R. Jafari, D. Ziou / Expert Systems with Applications 42 (2015) 510–518

The maximum likelihood is a popular estimator of the parameter vector hi . However, collinearity, separability, existence of many zero explanatory variables, and overfitting are the drawbacks. A Bayesian formulation was proposed in Ksantini et al. (2008) for logistic regression to prevent such weaknesses in which a variational approximation of the posterior is provided. The parameter vector h is composed of the parameters of the approximated posterior. However, this model has to be extended to deal with the regression of several classes. Fortunately, the multinomial logistic regression can be implemented by individual logistic regressions comparing each class with a baseline class (Begg & Gray, 1984). Let pðhi Þ is a Gaussian prior of mean l and covariance R, we need to find the parameter vector hi maximizing the posterior probability pðhi jy0 ¼ 0; yi ¼ 1Þ given by:

maxhi pðhi jy0 ¼ 0; yi ¼ 1Þ / pðhi Þ

X Y

pðyk ¼ kjx; hi Þqk ðxÞ

ð2Þ

x2Xi k2f0;ig

where pðhi Þ is the prior. However, in the case of multidimensional data, the estimation of hi fails due to the insufficiency of computer accuracy when computing the exponential function in Eq. (2). Fortunately, variational approximation and Jensen equality can be used to approximate the posterior:

Pðhi jy0 ¼ 0; yi ¼ 1Þ / pðhi Þ

Y

2

2

Fðk ÞeðEqk ðHk Þk Þ=2uðk ÞðEqk ðHk Þk Þ

ð3Þ

k2f0;ig

where Hk ¼ ð2k  Eqk the expectation with respect to qk ; uðk Þ ¼ tanhðk =2Þ=4k , and k a variational parameter. The approximation of the posterior above is a Gaussian with a posterior P mean lpost and a posterior covariance post given by:



Rpost i

1

¼ ðRÞ1 þ 2

X

uðk ÞEqk ðxk xtk Þ

k2f0;ig

lpost ¼ Rpost R1 l þ i

X

ð4Þ !

ðk  0:5ÞEqk ðxk Þ

ð5Þ

k2f0;ig

2k ¼ Eq

k



 t xtk Rpost xk þ ðlpost Þ Eqk ðxtk xk Þlpost ; i i

k 2 f0; ig

ð6Þ

The vector parameter hi is the mean vector l . Mathematical derivation of the above formulae and other details can be found in Ksantini et al. (2008). post

3.2. Iris verification As previously stated, the iris should be used to estimate the eyegaze. However, if the eye is closed the iris will not be visible (Fig. 5) in order to estimate gaze space. Therefore, iris state should be detected and tracked for each frame. In fact, eye-gaze is estimated for frames that iris has been detected. In this section we propose a method of detecting the iris states. We use Histogram of Oriented Gradients (HOG) to get the iris feature and VBMLR as a classifier. Hence, a dual state iris model is introduced to detect the different

Fig. 5. Closed eyes.

iris states ‘‘iris’’ or ‘‘non-iris’’. Note that the VBMLR is used for mapping gaze features to gaze space and for the detection of iris. Histograms of orientations and the statistics derived from them have proven to be effective image representations in various tasks such as object detection and recognition (Dalal & Triggs, 2005; Kobayashi & Akinori, 2008). The main idea is that distribution of intensity gradients describes well local shape feature even if the location of the edge is not used. HOG is implemented by dividing the image into small cells and for each cell a local histogram of gradient directions is constructed. The algorithm proposed in Dalal and Triggs (2005), cell sizes are 8  8 pixels. Each cell consists of a 9-bin (i.e. 360  9) Histogram of Oriented Gradients. In this method, each group of 4 (2  2) cells are integrated into one block. Finally each block is represented by a 4  9 feature vector. The combined histogram entries form the HOG representation of the given image. During iris center extraction which is described in Section 3.3, iris region is cropped automatically via the center of iris candidate and its radius. In next step, feature vector are extracted from cropped iris candidate based on HOG (Dalal & Triggs, 2005). Feature vector then are provided to the trained VBMLR for classification (see Section 4.1). The trained VBMLR will classify the input vector into iris class or non-iris class. Given an iris image candidate and the associated features xT , it is considered iris if:

Pðyiris ¼ 1; ynoniris ¼ 0jh; xT Þ > s

1Þhtk xk ;

513

ð7Þ

It is non-iris otherwise. A k-fold cross-validation is run during the learning phase and the value which maximizes the tradeoff between false positives and false negatives is the threshold s. 3.3. Iris center extraction Recall that in our eye-gaze estimation we must estimate the iris center. In this way, we need to detect the face of the user. To do so, we apply Viola–Jones object detection (Viola & Jones, 2004). In this framework, Haar-like features which originate from Haar-wavelets is used. More specifically, Haar-like has three kinds of features. The value of a two-rectangle feature is the difference between the sum of the pixels within two rectangular regions. The regions have the same size and shape and are horizontally or vertically adjacent (see Fig. 6). A three-rectangle feature computes the sum within two outside rectangles subtracted from the sum in a center rectangle. Finally a four-rectangle feature computes the difference between diagonal pairs of rectangles. Rectangle features are computed using the integral image (Viola & Jones, 2004). The integral image at any point ðx; yÞ contains the sum of the pixels above and to the left of ðx; yÞ such that:

Pðx; yÞ ¼

X

pðx0 ; y0 Þ

ð8Þ

x0 6x; y0 6y

where Pðx; yÞ is the integral image and pðx; yÞ is the original image. The presence of a Haar-like feature is determined by comparing feature with a threshold which is found in the training phase. In order to avoid low resolution iris images provided from Kinect, a camera is used to obtain high resolution eye images. After face detection, the eye position can be roughly estimated and therefore restricted (Castrillon, Deniz, Guerra, & Hernandez, 2007). A Viola–Jones based eye is again applied to detect eyes. Note that, eyebrows are also included in the target for training. After eye detection, we calculate iris center using iris detection technique which is based on references (Daugman, 1993; Camus & Wildes, 2002). The iris is located at the maximum of the output of an integro-differential operator in first order derivative. This operator is applied with respect to increasing radius r, of the normal-

514

R. Jafari, D. Ziou / Expert Systems with Applications 42 (2015) 510–518

Rwebcamð3DÞ ¼ M:RKinect þ T

ð10Þ

where M and T are rotation and translation matrices respectively between two cameras. The geometric relations of the cameras in our experiments are M ¼ I and T ¼ ð5; 10; 25ÞT cm. The 2D position reference point R; Rwebcamð2DÞ ¼ ðxwebcam ; ywebcam ÞT in the image acquired by the web camera is obtained by the perspective projection of the Rwebcamð2DÞ :

Rwebcamð2DÞ ¼ K webcam :Rwebcamð3DÞ

ð11Þ

where K webcam is the 3  3 intrinsic matrix of the web camera such that:

0

fx

B @0 0

Fig. 6. Example rectangle features in Viola–Jones object detection (Viola & Jones, 2004). (a) Two-rectangle feature(vertically), (b) Two-rectangle feature (horizontally), (c) Four-rectangle feature, and (d) Three-rectangle feature.

0

u0

1

fy

v0 C A

0

1

Parameters f x ¼ f :mx and f y ¼ f :my represent focal length of camera f in terms of pixels where mx and my are numbers of pixels per unit distance in each direction. According to technical specification reported by manufacturer, f x and f y are equal to 1327 and 1334 respectively for web camera. Parameters u0 and v 0 represent the image center. Let us recall that the user is not assumed to be stationary. Movements of head and eyes may lead to acquiring not useful images by both web camera and Kinect. For example, by moving

ized contour integral of Iðx; yÞ along a circular arc ds of radius r and center coordinates ðx0 ; y0 Þ:

  I  @ Iðx; yÞ  maxðr;x0 ;y0 Þ Gr ðrÞ  ds @r r;x0 ;y0 2pr

ð9Þ

The symbol ⁄ denotes convolution and Gr ðrÞ is a Gaussian of scale sigma r. The complete operator behaves as a circular edge detector, blurred at a scale set by r, with increasing radius at successively finer scales of analysis through the spatial parameter that are origin coordinates and radius ðx0 ; y0 ; rÞ, defining a path of contour integration. Fig. 7 shows iris detection in right and left eyes. 3.4. Reference point extraction and tracking As mentioned before, we need a reference point in order to estimate the iris displacement. In this study, we use head pose as reference point in the 3D space of Kinect view. Head pose is continuously estimated in real time by the head tracker system of Kinect from depth data (Fanelli et al., 2011). The estimations are robust to occlusions and their accuracy is high. Unlike physical marking (or corneal reflection) used in traditional techniques, the head pose is not troublesome for the user. Given the head pose RKinect ¼ ðX Kinect ; Y Kinect ; Z Kinect ÞT , we need to estimate its position Rwebcamð3DÞ ¼ ðX webcam ; Y webcam ; Z webcam ÞT in the web camera space. The relationship between these two points is given by:

Fig. 7. Iris detection using integro-differential operator.

Fig. 8. The effect of standard deviation on iris detection.

R. Jafari, D. Ziou / Expert Systems with Applications 42 (2015) 510–518

away from the web camera, the image of eyes acquired by the web camera becomes blur and their size smaller (Deschenes, Ziou, & Fuchs, 2004). It follows that in order to track a moving the user, the direction of web camera is adjusted automatically. 4. Experimental results In this section, we will present our results on iris verification and eye-gaze estimation technique. We validate by performance evaluation for different users and for different distances between the user and the cameras. Moreover, we compare the VBMLR against Support Vector Machine (SVM) with different kernels. 4.1. Performance of iris verification technique Iris state is continuously verified frame by frame in proposed method. In fact, eye-gaze is estimated for the frames which are correctly detected. Our method detects iris base on Eq. (9). In our experimental results, the kernel of convolution is Gaussian with standard deviation r ¼ 0:5. Fig. 8 presents the effect of standard deviation (i.e. convolution) on the iris detection. In order to validate our method, the iris state verification approach is tested on a dataset of faces that we acquired during the experimentation. This dataset consists of 3600 frontal face images acquired from 60 different males and females of European, African, and Asian ethnicity that some of them wearing glasses as shown in Fig. 9. They have been recorded in an indoor environment. During recording, the camera is positioned in front of the subjects and provided for a full face view. Generally, the size of a cropped iris region is about 30  30 pixels. The cropped image data are processed using 9-bin Histogram of Oriented Gradients and normalized to a ½0; 1 range before training. The iris training images are divided into two sets: iris set and non-iris set. In the iris image set, there are different gazes, different degrees of opening, different face poses, and with/without glasses. In the second set, cropped images are non-iris images such as eyebrows, eyelashes, and eye corners. Figs. 10 and 11 contain examples of iris and non-iris images in the training sets, respectively.

515

After finishing the above step, we apply VBMLR classifier to verify the iris state. There is one class of iris images and one class of non-iris images. The training set includes 1000 iris and 1000 non-iris images selected randomly from the collection (i.e. 1300 images). For the test, the remaining 300 iris images and 300 noniris images are used. Note that, the 5-fold cross-validation allows us to set the threshold s in Eq. (7) to 0.5. For the sake of comparison, we used the same experimentation protocol with SVM classifier (Chang & Lin, 2011) using different kernels. From the Table 1, we can see that the best accuracy is 95.50, using VBMLR classifier. Accuracy is measured in terms of true positives TP, true negatives TN, and a success rate SR ¼ ðTP þ TNÞ=2. True positives (respectively, negatives) are the images that have been identified as iris (respectively, non-iris) knowing that they are iris (respectively, non-iris). 4.2. Performance of gaze estimation technique Our method is validated by the following experiment. The system consists of the screen (board), the Kinect, the HD web camera and a PC. In this paper, we have used a Logitech web camera which can pan 189 and tilt 102. It also has a digital zoom and motorized face tracking with high resolution 2-megapixel sensor at approximately 30fps. The Microsoft Kinect can be modified to obtain, simultaneously at 30 Hz, a 640  480 pixel monochrome intensity coded depth map and a 640  480 RGB video stream. The screen is subdivided into nine predefined position of dimension 19  17:5 cm2 that subject can see it. The Kinect is placed behind the screen. Fig. 12 illustrates the system setup. The features used for learning, vary with different head positions and orientations and eye directions. Therefore, the input vector to the VBMLR is:

v ¼ ½dx ; dy ; T x ; T y ; T z ; Rx ; Ry ; Rz :

ð12Þ

where ½dx ; dy  is iris displacement which is estimated using the iris center and the reference point. Features ½T x ; T y ; T z  and ½Rx ; Ry ; Rz  are 3D head position and orientation respectively which are estimated based on random regression forests (Fanelli et al., 2011).

Fig. 9. Face dataset.

516

R. Jafari, D. Ziou / Expert Systems with Applications 42 (2015) 510–518

Fig. 10. Iris training samples.

Fig. 11. Non-iris training samples.

Table 1 Iris verification accuracy results. Classifier

VBMLR

SVM (RBF)

SVM (Liner)

SVM (Polynomial)

Accuracy

95.50

92.25

93.00

90.25

Nine predefined positions are gazed by twenty different subjects. Included subjects are males and females with different races (i.e. some of them wearing glasses). In the training procedure, we asked users to gaze at positions 1; 2; 3; . . . ; 9 respectively on the screen (see Fig. 12). For each position of screen a large amount of training data under different eye directions as well as head positions, and orientations are collected for training the VBMLR. For each predefined position, 90 input parameters are collected with twenty different subjects. Therefore, 1800 samples composed of the input gaze feature vector v and its corresponding position of gaze are collected for training. From 1800 samples, 1400 are used to build up the training set, while the other 400 are put aside for testing. During the training procedure, users can move their heads 30 cm along the horizontal direction, and 25 cm along the vertical direction, with different head orientations. Moreover, the distance to the Kinect ranges from 70 cm to 130 cm. Note that, thanks to the sampling of the gaze space, users can move their eyes/heads slow, fast, or even saccade. Therefore in testing phase user can move his/her eye (or head) freely in front of the camera. After training, the VBMLR can classify a given input vector into one of the nine positions of gaze. To verify the accuracy of our proposed method, some comparative experiments were conducted and scores are shown in the following tables. Table 2 presents the percentage of accuracy of our system under different head positions and orientations. According to the expression of gaze angular error:

Mean angular error ¼ arctan

  ME  d

ð13Þ Fig. 12. The system setup.

517

R. Jafari, D. Ziou / Expert Systems with Applications 42 (2015) 510–518 Table 2 Percentage of accuracy for 400 frames. Positions

1

2

3

4

5

6

7

8

9

Accuracy

1 2 3 4 5 6 7 8 9

381 8 1 0 0 2 0 0 0

10 379 9 0 1 0 0 0 0

1 9 382 3 0 0 0 0 0

0 2 6 388 4 2 0 0 8

1 1 2 7 392 6 2 4 1

7 1 0 0 3 387 6 1 0

0 0 0 0 0 3 383 7 1

0 0 0 0 0 0 8 380 12

0 0 0 2 0 0 1 8 378

95.25 94.75 95.50 97.00 98.00 96.75 95.75 95.00 94.50

Table 3 Percentage of accuracy.

Accuracy

VBMLR

SVM (RBF)

SVM (liner)

SVM (Polynomial)

95.83

94.44

90.05

92.86

Table 4 Percentage of accuracy in different distances. Distances (cm)

VBMLR

SVM (RBF)

SVM (liner)

SVM (Polynomial)

70 80 90 100

98.75 97.25 96.25 94.75

97.25 97.25 94.00 93.00

92.50 91.75 90.00 89.25

95.50 94.00 92.25 91.75

 is distance from screen and ME is computed as: where d

Pn ME ¼

i¼1 jC e

n

 Ct j

ð14Þ

Such that for n gaze samples, C e and C t are estimated gaze position and true gaze position respectively. Mean angular accuracy for mentioned head movement space is 7.9 degree. In Table 3, we compare VBMLR against the Support Vector Machines (SVM) with different kernels. Note that the head movement is allowed but the distance to the Kinect must be kept. We used the same training and test sets for SVM classifier. According to this table, we can see that the best accuracy is 95.83, using VBMLR classifier. Table 4 lists the effect of the distance to the Kinect on the gaze accuracy. The user was positioned in four different locations. In each location, the user was asked to gaze nine predefined positions across the screen. In this experiment the head movement is allowed but the distance to the Kinect must be kept. Note that the decrease in accuracy is coherent with a previous finding described in Macknojia, Chvez-Aragn, Payeur, and Laganire (2012). There are some limitations of the proposed model due to the use of Kinect. First, the distance to the Kinect should be from 70 cm to 130 cm due to capturing ability of Kinect and head tracker system. Second, because the sun is a source of IR light, the accuracy of the Kinect can be reduced if it is used outside. 5. Conclusion In this paper, we proposed a method of estimating the eye-gaze irrespective of a head direction and movement using Kinect sensor and HD web camera. In order to estimate eye-gaze, we used the iris center and the reference point provided by the IR camera. Since the iris center and the reference point vary significantly with head pose, we considered head orientation and location in a gaze mapping model. We have introduced variational Bayesian multinomial logistic regression as a general gaze mapping function. Our method performs well regardless of whether the iris is visible or not. This has been achieved by combining a Bayesian formalism classifier

and Histogram of Oriented Gradients (HOG). Moreover, it has the advantage of not requiring a special light source and allowing for natural head direction and movement under various iris states while still producing adequately gaze estimation. Experimental results show the accuracy of eye-gaze and iris verification is 96% and 95.50% under different head movements and iris states respectively. However, when the user moves away from the camera, the gaze score will decrease. Comparing VBMLR against SVM as a traditional classifier for eye-gaze estimation and iris verification demonstrated that the proposed mapping function has better accuracy results. Further to this work, we propose to focus more on the screen (board) by studying more thoroughly the effects of its size and depth. To this end, the use of some cues from the color image in order to enhance the depth estimation by the Kinect can be envisaged.

References Almansa-Valverdea, S., Castilloa, J. C., & Fernndez-Caballero, A. (2012). Mobile robot map building from time-of-flight camera. Expert Systems with Applications, 39(10), 8835–8843. Be first to get a Tobii EyeX Dev Kit. (2013). Retrieved from . Begg, C. B., & Gray, R. (1984). Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika, 71(1), 11–18. Camus, T. A., & Wildes, R. (2002). Reliable and fast eye finding in close-up images. In Proceedings of the IEEE international conference on pattern recognition. (pp. 389– 394). Castrillon, M., Deniz, O., Guerra, C., & Hernandez, M. (2007). ENCARA2: Real-time detection of multiple faces at different resolutions in video streams. Journal of Visual Communication and Image Representation, 18(2007), 130–140. C.C.S. d’Electronique SA. (2009). Retrieved from . Chang, C. C., & Lin, C. J. (2011). LIBSVM: A Library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 1–27. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In International conference on computer vision and pattern recognition. (pp. 886– 893). Daugman, J. G. (1993). High confidence visual recognition of person by a test of statistical independence. IEEE Transaction on PAMI, 15(11), 1148–1160. Deschenes, F., Ziou, D., & Fuchs, F. (2004). An unified approach for a simultaneous and cooperative estimation of defocus blur and spatial shifts. Image and Vision Computing, 22(1), 35–57. Diraco, G., Leone, A., & Siciliano, P. (2013). Human posture recognition with a timeof-flight 3D sensor for in-home applications. Expert Systems with Applications, 40(2), 744–751. Eye tracking gets kickstarted: 4titoo launches the NUIA EyeCharm for Kinect. (2013). Retrieved from . Fanelli, G., Weise, T., Gall1, J., & Gool, L. V. (2011). Real time head pose estimation from consumer depth cameras. In Proceedings of the 33rd international conference on pattern recognition. (pp. 101–110). Guestrin, E. D., & Eizenman, M. (2006). General theory of remote gaze estimation using the pupil center and corneal reflections. IEEE Transaction on Biomedical Engineer, 53(6), 1124–1133. Hansen, D., & Ji, Q. (2010). In the eye of the beholder: A survey of models for eyes and gaze. IEEE Transactions on PAMI, 323, 478–500. Jafari, R., & Ziou, D. (2012a). Gaze estimation using Kinect/PTZ camera. In IEEE international symposium on robotic and sensors environments. (pp. 13–18). Jafari, R., & Ziou, D. (2012b). Eye-gaze estimation using kinect. In 11th Asian conference on computer vision, In DEMO section. Khushabaa, R. N., Wiseb, C., Kodagodaa, S., Louviereb, J., Kahnc, B. E., & Townsendd, C. (2013). Consumer neuroscience: Assessing the brain response to marketing

518

R. Jafari, D. Ziou / Expert Systems with Applications 42 (2015) 510–518

stimuli using electroencephalogram (EEG) and eye tracking. Expert Systems with Applications, 40(9), 3803–3812. Kobayashi, T., & Akinori, H. (2008). Selection of histograms of oriented gradients features for pedestrian detection. In International conference on neural information processing. (pp. 598–607). Krishnapuram, B., Carin, L., Figueiredo, M. A. T., & Hartemink, A. J. (2005). Sparse multinomial logistic regression: Fast algorithms and generalization bounds. Transactions on PAMI, 27(6), 957–968. Ksantini, R., Ziou, D., Colin, B., & Dubeau, F. (2008). A Weighted pseudo-metric discriminatory power improvement using a variational method based bayesian logistic regression model. IEEE Transactions on PAMI, 30(2), 256–266. Law, B., Atkins, M. S., Kirkpatrick, A. E., & Lomax, A. J. (2004). Eye gaze patterns differentiate novice and experts in a virtual laparoscopic surgery training environment. In Proceedings of the symposium on eye tracking research and applications. (pp. 41–48). Macknojia, R., Chvez-Aragn, A., Payeur, P., & Laganire, R. (2012). Experimental characterization of two generations of kinects depth sensors. In IEEE international symposium on robotic and sensors environments (ROSE), (pp. 150–155). Miyake, T., Asakawa, T., Yoshida, T., Imamura, T., & Zhang, Z. (2009). Detection of view direction with a single camera and its application using eye gaze. In 35th annual conference of IEEE industrial electronics. (pp. 2037–2043).

Miyake, T., Haruta, S., & Horihata, S. (2005). Eye-gaze estimation by using features irrespective of face direction. Journal of Systems and Computers in Japan, 36(3), 18–23. Morimoto, C. H., & Mimica, M. R. M. (2005). Eye gaze tracking techniques for interactive applications. Computer Vision and Image Understanding, 98(1), 4–24. Products, (2014). Retrieved from . Renaud, P., Chartier, S., Rouleau, J., Proulx, J., Decarie, J., Trottier, D., et al. (2009). Gaze behavior nonlinear dynamics assessed in virtual immersion as a diagnostic index of sexual deviancy: Preliminary results. Journal of Virtual Reality and Broadcasting, 6(3). Video Sensor Array with Active SBI. (2009). Retrieved from . Villanueva, A., Daunys, G., Hansen, D. W., Bohme, M., Cabeza, R., Meyer, A., et al. (2009). A geometric approach to remote eye tracking. Communication by Gaze Interaction, 8(4), 241–257. Viola, P., & Jones, M. J. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2), 137–154. Zhu, Y., Dariush, B., & Fujimura, K. (2008). Controlled human pose estimation from depth image streams. In Proc. CVPR workshop on TOF computer vision. (pp. 1–8). Ziou, D., & Jafari, R. (2014). Efficient steganalysis of images: Learning is good for anticipation. Pattern Analysis Applications, 17(2), 279–289.