Author’s Accepted Manuscript DeepSafeDrive: A Grammar-aware Driver Parsing Approach to Driver Behavioral Situational Awareness (DB-SAW) T. Hoang Ngan Le, ChenChen Zhu, Yutong Zheng, Khoa Luu, Marios Savvides www.elsevier.com/locate/pr
PII: DOI: Reference:
S0031-3203(16)30386-7 http://dx.doi.org/10.1016/j.patcog.2016.11.028 PR5971
To appear in: Pattern Recognition Received date: 16 February 2016 Revised date: 28 November 2016 Accepted date: 29 November 2016 Cite this article as: T. Hoang Ngan Le, ChenChen Zhu, Yutong Zheng, Khoa Luu and Marios Savvides, DeepSafeDrive: A Grammar-aware Driver Parsing Approach to Driver Behavioral Situational Awareness (DB-SAW), Pattern Recognition, http://dx.doi.org/10.1016/j.patcog.2016.11.028 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
DeepSafeDrive: A Grammar-aware Driver Parsing Approach to Driver Behavioral Situational Awareness (DB-SAW) T. Hoang Ngan Le? , ChenChen Zhu, Yutong Zheng, Khoa Luu, Marios Savvides Department of Electrical and Computer Engineering, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213, USA. Phone: 412-980-8939 {thihoanl, zcckernel, yutongzh, kluu }@andrew.cmu.edu,
[email protected]
Abstract This paper presents a Grammar-aware Driver Parsing (GDP) algorithm, with deep features, to provide a novel driver behavior situational awareness system (DB-SAW). A deep model is first trained to extract highly discriminative features of the driver. Then, a grammatical structure on the deep features is defined to be used as prior knowledge for a semi-supervised proposal candidate generation. The Region with Convolutional Neural Networks (R-CNN) method is ultimately utilized to precisely segment parts of the driver. The proposed method not only aims to automatically find parts of the driver in challenging “drivers in the wild” databases, i.e. the standardized Strategic Highway Research Program (SHRP-2) and the challenging Vision for Intelligent Vehicles and Application (VIVA), but is also able to investigate seat belt usage and the position of the driver’s hands (on a phone vs on a steering wheel). We conduct experiments on various applications and compare our GDP method against other state-of-the-art detection and segmentation approaches, i.e. SDS [1], CRF-RNN [2], DJTL [3], and R-CNN [4] on SHRP-2 and VIVA databases. Keywords: Driver Parsing, Pictorial Structure, Deep Features, Region with Convolutional Neural Networks (R-CNN), Structure based N-cuts
∗ Corresponding
author
Email address:
[email protected] (T. Hoang Ngan Le? , ChenChen Zhu, Yutong Zheng, Khoa Luu, Marios Savvides)
Preprint submitted to Journal of LATEX Templates
December 1, 2016
(A)
(B)
(C)
(D)
Figure 1: Driver parsing on the SHRP-2 database [5] (A) input images (B) seat belt segmentation results Figure 1: Driver parsing on the SHRP-2 database [5] (A) input images (B) seat belt segmentation results using SDS [1], (C) probability maps for seat belt using our proposed GDP (D) seat belt segmenting results using SDS [1], (C) probability maps for seat belt using our proposed GDP (D) seat belt segmenting results in the GDP in the GDP
1. Introduction 1. Introduction Driver safety is one of the major concerns in today’s world. With the number of Driver safety is one of the major concerns in today’s world. With the number of vehicles on the road increasing on a daily basis, the ability to use computer vision and vehicles on the road increasing on a daily basis, the ability to use computer vision and machine learning algorithms to automatically assess a driver’s vigilance is extremely machine learning algorithms to automatically assess a driver’s vigilance is extremely important. One of the biggest challenges in this topic is that drivers in videos are usu-
important. Oneunder of theweak biggest challenges in this topic is that drivers in videos usually recorded lighting, low resolution and poor illumination control are of day ally underAweak lighting, low resolution poore.g. illumination of day andrecorded night modes. person drives through various and terrain, open roads,control under trees, and night modes. A person through variousin terrain, e.g. open roads, under trees, etc.., causing constant anddrives often stark variations illumination conditions. Furtheretc.., causing andfield, oftenlower-resolution stark variationscameras in illumination conditions. Furthermore, in the constant automotive and lower-power embedded more, in the which automotive field, lower-resolution cameras andadditional lower-power embedded processors, are preferred due to cost constraints, create challenges. processors, which arethe preferred to cost driver constraints, additional challenges. In order to meet goals ofdue assessing safety,create we propose a fully automatic In order the(1)goals of assessing driver safety, we the propose a fully system thatto is meet able to simultaneously detect and segment seat belt of a automatic driver to see if that the driver it (as shown detect in Fig.and 1(D)); (2) analyze upper system is ableistowearing (1) simultaneously segment the seatthe belt of a parts driverofto theifdriver usually from head,(2)body (body) seat parts belt inof see the driver is recorded wearing it (as cameras, shown inincluding Fig. 1(D)); analyze theand upper order to detect the driverfrom is looking forward and keeping their(body) eyes onand theseat road; (3)in the driver usuallyif recorded cameras, including head, body belt determine if the is tired starting to fall asleep, distracted to aonconversation, order to detect if drive the driver is or looking forward and keeping theirdue eyes the road; (3) using a hand-held device, i.e.orastarting cell phone; and (4) detect whether hands determine if the drive is tired to fall asleep, distracted duethe to driver’s a conversation, are on the steeringdevice, wheel. i.e. a cell phone; and (4) detect whether the driver’s hands using a hand-held recently popular Simultaneous Detection and Segmentation (SDS) are onMotivation: the steeringThe wheel. 2 2
Motivation: The recently popular Simultaneous Detection and Segmentation (SDS) method [1] uses Convolutional Neural Networks (CNN) to classify category-independent region proposals and aims to detect all instances of a category in an image. Their experiments showed that the method achieves state-of-the-art results in object detection and semantic segmentation. However, SDS lacks grammatical structures of objects thus causes the problem of erroneous detection and segmentation as shown in Fig. 1(B) where the seat belt is the object of interest and in Fig. 2 where the hands on the phone are the objects of interest. The segmentation results from SDS share similar shapes with the objects of interest but their positions are incorrect. In addition, since the ”driver in the the wild” videos are usually low resolution and poorly illuminated, SDS is therefore unable to achieve accurate proposal candidates. In our proposed system, the global relative structure of the parts of the driver (i.e. head, body, seatbelt) and local relative structure (i.e. eyes, nose, mouth) are first modeled using the Pictorial Structures (PS) approach [6]. The detected deep probability map of the driver will be used as prior knowledge for proposal candidate generation to achieve accurate detection and segmentation. The flowchart of our proposed system is shown in Fig. 3. Unlike previous PS methods [6], [7] using Histogram of Oriented Gradients (HOG) or Scale-Invariant Feature Transform (SIFT) features to extract features, the proposed PS employs deep features learned from our trained deep model shown in section 3.1. These deep features allow our proposed PS approach to achieve higher detection results than previous PS methods [6], [7]. The contribution of our work can be summarized as follows: • Incorporate grammatical structure which tells the relationship between parts into Convolutional Neural Networks (CNNs)
• Propose a fast and effective partition method which utilizes prior knowledge of deep probability map to defines within-subgraph and between-subgraph
• The deep features capable of representing both information of feature and shape • To the best of our knowledge, this is the first time an automatic system to support Driver Behavioral Situational Awareness (DB-SAW) has been presented. Partic3
ularly, it is the first system that finds the parts of the driver, i.e. head, body, seat belt, hands, eyes, mouth and nose. The detection and segmentation results of our system can be used for numerous tasks, e.g. head detection and pose estimation, hands on wheels verification, hands on phone evaluation, etc. 2. Related Work This section reviews previous work on driver activity analysis. Since driver hand detection is a part of this work, we also review recent studies in hand detection. The multimodal vision method [8] was presented to characterize driver activity based on head, eye and hand cues. The fused cues from these three inputs using hierarchical Support vector Machines (SVM) enrich the descriptions of the drivers state allowing for evaluation of driver performance captured in on-road settings. However, this method with a linear kernel SVM for detection focuses more on analyzing the activities of the driver correlated among these three cues. It does not emphasize the accuracy of hand detection of drivers in challenging conditions, e.g. shadow, low resolution, phone usage, etc. Ohn-Bar et al. [9] introduced a vision-based system that employs a combined RGB and depth descriptor in order to classify hand gestures. The method employed various modifications of HOG features with the combination of both RGB and depth images to achieve a high classification accuracy. However, in the context of this work, it’s impossible to get both RGB and depth images in cars since these videos are usually recorded in low resolution under poor illumination. Mittal et al. [10] presented a two-stage approach to detect hands in unconstrained images. Three complementary detectors are employed to propose hand bounding boxes. These proposal regions are then used as inputs to train a classifier to compute a final confidence score. In their method, the context-based and skin-based proposals with sliding window shape based detector are used to increase recall. However, these skin-based features cannot contribute in our presented problem since all videos are recorded under low illumination and gray-scale pixels. Meanwhile, these proposed methods [11], [12], [13] for hand tracking and analysis are only applicable in depth images with high resolution. They are therefore unusable in the types of videos used in this work.
4
Figure 2: Examples of incorrect hand segmentation using SDS [1]. These results are scored confidently and Figure 2: Examples of incorrect hand segmentation using SDS [1]. These results are scored confidently and shared similar shapes with actual hands in given images shared similar shapes with actual hands in given images
form of SLIC superpixels, with Deformable Part Models (DPMs). In this method, a
Trulls et al. [14] presented a method to combine bottom-up segmentation, in the
large pool of SLIC superpixels, instead of HOG features, are used to build soft segmen-
form of SLIC superpixels, with Deformable Part Models (DPMs). In this method, a tation masks. These masks are then used to construct enhanced, background- invariant large pool of SLIC superpixels, instead of HOG features, are used to build soft segmenfeatures to train DPMs. Rothrock et al. [15] proposed a compositional and-or graph tation masks.method Theseto masks then used construct enhanced, backgroundgrammar modelarehuman pose to estimation. In their method, a globalinvariant image features to trainisDPMs. al. [15] proposed a compositional and-or graph segmentation used as Rothrock a referenceetdistribution to compute part-based appearance feagrammar to model human pose results estimation. In to their method, a global image tures. Inmethod other words, the segmentation are used support object detection in segmentation is used as a reference distribution to compute part-based appearance featheir method. tures. In other words, the segmentation results are used to support object detection in 3. method. Grammar-aware Driver Parsing (GDP) their This section presents our Grammar-aware Driver Parsing approach with the aware3. Grammar-aware Driver Parsing (GDP) ness of the grammar structure of the driver to detect and segment parts of the driver. The algorithmour uses the grammatical structures as guidance Thisproposed section presents Grammar-aware Driver Parsing approachand withrefinement the awareforof faster searching, structure more precise locating segmentation. ness the grammar of the driverand to more detectaccurate and segment parts of the driver. Our proposed algorithm first develops a GPU-based Caffe framework [16] to train The proposed algorithm uses the grammatical structures as guidance and refinement a Deep Convolutional Neural Network (DCNN) model for objects of interest, namely, for faster searching, more precise locating and more accurate segmentation. face, torso and seat belt. The DCNN model is first used to extract deep features, instead Our proposed algorithm first develops a GPU-based Caffe framework [16] to train of HOG features, to build a probability map to define the grammatical structure for a a Deep Convolutional Neural Network (DCNN) model for objects of interest, namely, driver. The probability map that measures the confidence score of a pixel belong to face, torso and seat belt. The DCNN model is first used to extract deep features, instead a object of interest is used for two purposes. First, the map is used as initial seeds of HOG features, to build a probability map to define the grammatical structure for a for a Semi-Supervised Normalized Cuts (SSNC) [17] for object proposal generation. driver. The probability map that measures the confidence score of a pixel belong to Second, the map is employed to refine the results in the case of many high confidence a object of interest is used for two purposes. First, the map is used as initial seeds outcomings. The Region with Convolutional Neural Networks (R-CNN) method [4] for a Semi-Supervised Normalized Cuts (SSNC) [17] for object proposal generation. 5
5
Head
Body
Seatbelt
Driver with seatbelt and no cellphone
Deep Network Feature Selection No seatbelt responses Probability Map Detection
Driver with cellphone and no seatbelt
Input Frames from videos
Proposal Candidates for Driver Parsing
Driver Behavioral Situational Awareness
Figure 3: The flowchart of our proposed Grammar-aware Driver Parsing (GDP) approach with deep features
Figure 3:forThe flowchart our videos/images proposed Grammar-aware Parsing (GDP) approach with DB-SAW: (A)ofinput from SHRP-2Driver database [5], (B) deep probability mapsdeep usingfeatures grammatical(A) structures of driver, (C) parts of SHRP-2 a driver proposal (D) deep feature selection for DB-SAW: input videos/images from databasecandidate [5], (B) generation, deep probability maps using gramour deep learning(C) model, Behavioral analysis matical using structures of driver, parts(E) of Driver a driver proposalSituational candidateAwareness generation, (D) deep feature selection
using our deep learning model, (E) Driver Behavioral Situational Awareness analysis
is then employed to extract the features from both the bounding box of the region and from region foreground different scales. in A SVM classifier is trained on the top Second, thethe map is employed to atrefine the results the case of many high confidence of theseThe extracted deep features to assign a score forNetworks each class (R-CNN) to each candidate. outcomings. Region with Convolutional Neural methodThe [4] candidates are decided by incorporating PS into is thenfinal employed to extract the features from boththe thedefined bounding boxthe of prediction the regionmap and produced by the scored candidates. from the region foreground at different scales. A SVM classifier is trained on the top
of these extracted deep features to for assign score for each class to each candidate. The 3.1. Deep Feature Extraction Partsa of the Driver final candidates are decided by incorporating the defined PS into the prediction map The DCNN in our system contains four convolution layers with max-pooling to
produced by the scored candidates. extract hierarchical features from human faces, followed by two fully-connected layers and a softmax output layer assigning the estimated class. One dropout layer is right af-
3.1. Deep Feature Extraction for Parts of the Driver
ter the first fully-connected layer with a dropout ratio of 0.7. Training data are cropped
Theto DCNN in our system contains four convolution layers with max-pooling to the sizes of 144 × 144 based on face landmarking points. Each training data sample
extractishierarchical features human followed by input two fully-connected layers further cropped to 128from × 128 piecesfaces, randomly to be the of the network. During and a softmax layer assigning estimated class. One dropout layer is right aftraining, output the maximum iteration isthe 2,000,000. ter the first fully-connected layer with a dropout ratio of 0.7. Training data are cropped to the sizes of 144 × 144 based on face landmarking points. Each training data sample 6 6
is further cropped to 128 × 128 pieces randomly to be the input of the network. During
training, the maximum iteration is 2,000,000.
There are not enough images with ground-truth labels in the highway database to train the deep network that usually requires hundreds of thousands of training images. Therefore, the DCNN is trained on a large-scale face database. Then, we employ the discriminative capability of this trained deep model to extract deep features for parts of the driver. This learning and transferring mid-level feature representations technique was successfully used in [18]. We use the CASIA WebFace dataset [19] as the training set and the Labeled Faces in the Wild (LFW) [20] as the testing one. Our deep face model achieves the accuracy of 99.15% on randomly picked 3000 pairs of LFW images. 3.2. Grammatical Structures for Parts of the Driver The structure of a driver is decomposed into a set of parts modeled as Γ = {γk }K k=1 ,
where γk = (xk , yk , ϑk , sk ) denotes the position (xk , yk ), orientation(ϑk ) and scale (sk ) of part k of the driver, respectively. There are three main components (K = 3) in this model, i.e. head, body and seat belt. In addition, the facial regions in the head are modeled into sub-parts, i.e. two eyes, nose and mouth as shown in Fig. 4. It should be noted that the seat belt can be missing in this model. The parts of arms are not defined in this structure since they are often not visible in the recorded videos and images. Instead, the hands of the drivers will be modeled and segmented in two other tasks, i.e. hands on wheel and hands on phone analysis. Given an image, D, the posterior of the parts of the driver Γ is modeled as p(Γ|D) ∝
p(Γ)p(D|Γ). In the Pictorial Structure approach, p(Γ) = p(γ1 )Π(i,j)∈E p(γi |γj ) cor-
responds to a kinematic tree prior modeling the parts of directed acyclic graph (DAG) with the set of edges E, the root node γ1 , and pairs of jointing parts p(γi |γj ). Mean-
while, the model likelihood given a particular part of driver configuration is defined as p(D|Γ) = ΠK k=1 p(dk |γk ). Finally, the posterior can be computed as follows, p(Γ|D) ∝ p(γ1 )Π(i,j)∈E p(γi |γj )ΠK k=1 p(dk |γk )
(1)
Both these distributions, i.e. p(Γ) and p(D|Γ), are learned using the annotated driver 7
where γk = (xk , yk , ϑk , sk ) denotes the position (xk , yk ), orientation(ϑk ) and scale (sk ) of part k of the driver, respectively. There are three main components (K = 3) in this model, i.e. head, body and seat belt. In addition, the facial regions in the head are modeled into sub-parts, i.e. two eyes, nose and mouth as shown in Fig. 4. It should be noted that the seat belt can be missing in this model. The parts of arms are not defined in this structure since they are often not visible in the recorded videos and images. Instead, the hands of the drivers will be modeled and segmented in two other tasks, i.e. hands on wheel and hands on phone analysis.
Given an image, D, the posterior of the parts of the driver Γ is modeled as p(Γ|D) ∝
p(Γ)p(D|Γ). In the Pictorial Structure approach, p(Γ) = p(γ1 )Π(i,j)∈E p(γi |γj ) corresponds to a kinematic tree prior modeling the parts of directed acyclic graph (DAG)
with the set of edges E, the root node γ1 , and pairs of jointing parts p(γi |γj ). Meanwhile, the model likelihood given a particular part of driver configuration is defined as p(D|Γ) = ΠK k=1 p(dk |γk ). Finally, the posterior can be computed as follows, Figure 4: Grammatical structure of parts of the driver is defined in our approach
Kis defined in our approach Figure 4: Grammatical of partsp(γ of the p(Γ|D) ∝ structure p(γ1 )Π(i,j)∈E p(dk |γk ) i |γdriver j )Πk=1
(1)
Both these distributions, i.e. p(Γ) andSamples p(D|Γ),from are learned usingexhibit the annotated training set described in Section 4.1. this model images driver with low 7 training set described in Section 4.1. Samples from this model exhibit images with low
resolution and a large variety of illumination. Moreoever, instead of commonly used resolution and a large variety of illumination. Moreoever, instead of commonly used
HOG features, deep features extracted from our trained model (presented in Section HOG features, deep features extracted from our trained model (presented in Section
3.1) are utilized to achieve highly discriminative information to represent parts of the 3.1) are utilized to achieve highly discriminative information to represent parts of the
driver. An example of grammatical structure consisting of head, body, seat belt using driver. An example of grammatical structure consisting of head, body, seat belt using
HOG feature and the proposed deep feature is shown in Fig. 5. HOG feature and the proposed deep feature is shown in Fig. 5.
Figure 5: Comparisons of grammatical structure (head (first column) - body (second column)- seat belt (third
Figure 5: Comparisons of grammatical structure (head (first column) - body (second column)- seat belt (third column)) using HOG feature (first row) against our designed deep features (second row).
column)) using HOG feature (first row) against our designed deep features (second row).
88
3.3. Proposal Generation with Grammar Prior To accurately generate a set of proposal candidates for parts of the driver, we improve Multiscale Combinatorial Grouping (MCG) [21] by incorporating the grammatical structure defined in Section 3.2. In MCG, a segmentation hierarchy is defined as a family of partitions such that regions from coarse levels are unions of regions from fine levels. Each level in hierarchy is assigned a real valued index. Furthermore, an ultrametrics contour map (UCM) is used to unify the problems of contour detection and hierarchical image segmentation. Based on thresholding at each level, the partitions are generated from the UCM. One of the most important contributions of [21] is to propose the fast Normalized Cuts algorithm that is able to preserve full performance for contour detection with low memory requirements and provides a 20× speed-up. In [21] and other related works, i.e. [22], [23], [24], the Normalized Cuts (Ncuts) [25] has been used as a key globalization mechanism of recent high-performance contour detectors for spectral graph partitioning and scene labeling. However, it ignores articulated geometry knowledge when working on structured objects, e.g. driver in the DB-SAW system. Furthermore, it is hard to obtain accurate results with images captured under low resolution and poor illumination when using Ncuts alone. In this section, we introduce an effective way of using Ncuts by employing the deep probability map defined in Section 3.2 to generate proposal candidates for parts of the driver. The grammatical structure is used to determine vertices within-subgrah (in a same subgraph) (Cout )or between-subgraph (in different subgraphs)(Cin ). Consider an indirected weighted graph G = {V, E} to be split into two disjoint
groups A and B, where V = A ∩ B and ∅ = A ∪ B, the cut is then defined as in Eqn.
(2).
cut(A, B) =
X
wij
(2)
i∈A,j∈B
where V = {v1 , v2 , ..., vN } is a set of vertices, and wij denotes a weight from the
weighted adjacency matrix of G. One of the most popular way to balance partitioning is Normalized Cuts [25] as follows, N cut(A, B) =
cut(A, B) cut(A, B) + assoc(A, V ) assoc(B, V ) 9
(3)
where assoc(A, X) is the total connection from all vertices in A to all vertices in V . Using Rayleigh equation and relaxing to real values, Shi et al. [26] shows that minimizing Eqn. (3) is equivalent to solving the following generalized eigenvector problem, (D − W )y = λDy
(4)
where D is a diagonal weighted degree matrix. Recently Chew et al. [17] proposed a soft version of both must-link and cannot-link to vary the desired influence of specific constraints on the grouping process. However, this method requires prior knowledge from a user to specify must-link and cannot-link regions. Expanding from this work, our proposed method employs a simpler way to automatically explore this required prior knowledge using grammatical structures. Let Γ be the grammatical structure or deep probability map defined in Section 3.2, each pixel pi is assigned a score γi which tells the likelihood that the pixel belongs to foreground (parts of the driver) or background. The higher value of γi is the more likely the pixel pi belongs to the foreground. On the other hand, the lower value the of γi is, the more likely the pixel pi belongs to the background. cannot-link regions. Expanding from this work, our proposed method employs a simpler way to automatically explore this required prior knowledge using grammatical structures. Let inF be the group of all pixels pi with top high score and let inB be the group of all pixel pi with bottom low score. The constraints Cin (i, j) focus on vertices betweensubgraphs, namely pi ∈ inF and pj ∈ inB whereas Cout (i, j) focus on vertices within-subgraph, namely {pi , pj } ∈ inF or {pi , pj } ∈ inB.
Following the work of [17] by using indicator vector x, the constraints are rewritten
as Cin (i, j) =
(xi − xj )2 4
(5)
Cout (i, j) =
(xi + xj )2 4
(6)
¯ T T¯U ¯ x, respectively, Generally, Cin and Cout are expressed as xT U T T U x and xT U ¯ are defined as row-wise so that each row contains a half in column i where U and U and half in column j of vertices in Cin and Cout , respectively. T and T¯ are the diagonal 10
matrices containing the weights. The cut with constraints Cin and Cout is then defined as follows: cutC (A, B) = cut(A, B) + Cin + Cout ¯ T T¯U ¯x = cut(A, B) + xT U T T U x + xT U
(7)
and the problem in Eqn. (3) becomes N CutC =
cutC (A, B) cutC (A, B) + assoc(A, V ) assoc(B, V )
(8)
According to [24], [17], minimizing N CutC in Eqn. (8) is equivalent to solving the following eigen problem, ¯ T T¯U ¯ Pi M h M T Pi D − W + U T T U + U = λM T P iD1/2 I − qq T ) D1/2 Pi M h
(9)
where q is the unit vector in the direction of D1/2 1, h is n − 1 vector, x = Bh. Herein
, B is chosen as n × (n − 1) matrix whose columns form a basis for the subspace
¯ 1.M is the n × (n − 1) matrix given by ¯ T T¯U orthogonal to the vector U P2 p P3 p ... Pn p M = −P1 pI
(10)
where Pi is the n × n permutation matrix that swaps row 1 and row ith 3.4. Parts of the Driver Segmentation via R-CNN
Our proposed approach first employs the proposal generation technique presented above on the input image to extract N candidate regions xi , i = 1, ..., N . In the experiments, N is chosen as 1,000. Then, deep features using R-CNN with the deep model defined in Section 3.1 are extracted. There are two types of representations of the regions, i.e. the bounding box of the region with only the foreground xF i and the F ones with background xB i . The first representation xi , i.e. the segmented region, is
used to learn the parts of the driver information. Meanwhile, the second type xB i , i.e. the bounding boxes, aims at learning relative information between the parts and their neighbours. In this way, the learning features include both parts of the driver and common backgrounds. This approach helps our proposed system be robust against various challenging conditions presented in Section 1. 11
face
seatbelt
Upper body
Facial components
Figure 6: The first and the third rows are different subjects captured in various challenging environmental Figure 6: The first and the third rows are different subjects captured in various challenging environmental conditions. The subjects in the first row wear seat belt and the ones in the third row do not have seat belt. conditions. The subjects in the first row wear seat belt and the ones in the third row do not have seat belt. The fifth row shows a driver at different poses and emotions. Their corresponding segmentation results are The fifth row shows a driver at different poses and emotions. Their corresponding segmentation results are given in the second, the fourth and the sixth rows, respectively. given in the second, the fourth and the sixth rows, respectively.
The two representations as inputs are fused into the CNN to extract deep features. The two representations as inputs are fused into the CNN to extract deep features. In the later step, SVM are developed to verify regions of parts of the driver. Finally, In the later step, SVM are developed to verify regions of parts of the driver. Finally, the region refinement technique is used in the post-processing steps to refine the segthe region refinement technique is used in the post-processing steps to refine the segmented regions. Some illustrations of our proposed parts of the driver detection and mented regions. Some illustrations of our proposed parts of the driver detection and segmentation using the grammar-aware method are shown in Fig. 6. The first and third segmentation using the grammar-aware method are shown in Fig. 6. The first and third rows are different subjects captured various environmental conditions where the sub-
rows are different subjects captured various environmental conditions where the sub12
12
jects in the first row are wearing seat belt and the subjects in the second row are not wearing seat belt. Notably, the seat belts are shown in different lengths and presented in various background (of the outfit). The fifth row is a driver shown in different poses and emotions. Their corresponding segmentation results are given in second, fourth and sixth rows, respectively. 4. Experimental Results Subsection 4.1 briefly reviews the main features of the databases used in our evaluations. Subsection 4.2 presents our training and evaluation steps for the DCNN used to extract deep features for the later experiments. Then, in the next three sections, we evaluate our proposed method in various tasks. Subsection 4.3 presents our experiments in parts of the drivers parsing on the “drivers in the wild” database. Subsection 4.4 presents the problem of hands on wheels detection. Finally, subsection 4.5 presents our experiment in the problem of hands on phone verification. 4.1. Databases used in this Work The databases used in the experiments in this paper consist of a “drivers in the wild” database, i.e. Strategic Highway Research Program (SHRP-2) [5], collected by the Virginia Tech Transportation Institute (VTTI) [27], and hand database from Vision for Intelligent Vehicles and Applications (VIVA) Challenge [28]. By using these databases with numerous challenging factors, we aim to show the robustness and efficiency of our proposed method. In addition, the large-scale CASIA-WebFace [19] and LFW [20] datasets are also used to train our deep CNN model. SHRP-2 Database: This database is collected by VTTI in order to evaluate the capability of safety driving system. In this collection, the platform was a 2001 Saab 9 - 3 equipped with two proprietary Data Acquisition Systems (DAS). These videos comprised of four channels of video, forward view, face view (resolution of 356 ×
240), lap and hand view, and rearward view, recorded at 15 frames per second and compressed into a single quad video. These SHRP2 face view videos are used in our experiments.
13
VIVA Hand Database: The dataset consists of 2D bounding boxes around hands of drivers and passengers from 54 videos collected in naturalistic driving settings of illumination variation, large hand movements, and common occlusion. There are 7 possible viewpoints, including first person view. Some of the data was captured in test beds, while some was kindly provided by YouTube. In the challenging evaluation protocol, the standard evaluation set consists of 5,500 training and 5,500 testing images. Those images also include the ground-truth with bounding boxes around hands as well as the labeling of hands on wheels. In our work, we further manually annotate regions of hands of 5,500 training images. CASIA-WebFace Dataset: This is a large-scale face database containing 494,414 images of 10,575 subjects. The image qualities in this database is similar to the Labeled Faces in the Wild (LFW) database [20]. 4.2. Deep CNN Training The deep CNN features (presented in Section 3.1) are used to extract features for parts of the drivers. In order to handle this task, the Caffe framework [16], a rapid deep learning implementation using CUDA C++ for Graphics Processing Unit (GPU) computation, is used to train the system. It is employed to extract features from candidate regions. Given a candidate region, two 4096-dimensional feature vectors are extracted by feeding two types of region images into the AlexNet and the two vectors are concatenated to form the final feature vector. AlexNet has 5 convolution layers and 2 fully connected layers. All the activation functions are Rectified Linear Units(ReLU). Max pooling with a stride of 2 is applied at the first, second and fifth convolution layers. Two types of region images are cropped box and region foreground. They are both warped to a fixed 227 × 227 pixel size.
In our experiment, during training, the maximum iteration is 2,000,000. We use
CASIA WebFace dataset as the training set and LFW for testing. In the face verification experiments, our model, trained on 10,000, subjects achieves the same state-of-the-art result [29] with accuracy of 99.15% on randomly picked 3000 pairs of LFW images.
14
4.3. Driver Parsing 4.3. Driver Parsing
Since this is the first time DB-SAW problem is presented, the metrics and eval-
Since this on is the time DB-SAW is presented, metrics evaluation protocols thefirst SHRP-2 database problem will be defined. We the evaluate theand proposed uation protocols on on the driver SHRP-2 database will be defined. We evaluateand theparts proposed driver parsing system (person) detection and segmentation of the driver parsing system on driver (person) detection and segmentation and parts of the
driver segmentation. Our system is trained on one video and benchmarked on 10 other driver segmentation. Our system is trained on one video and benchmarked on 10 other
videos randomly picked in the SHRP-2 database.
videos randomly picked in the SHRP-2 database.
4.3.1. Driver detection and segmentation
4.3.1. Driver detection and segmentation
Figure 7: Performance of driver detection and segmentation of CRF-RNN [2] (first column), DJTL [3]
Figure 7: Performance of driver detection and segmentation of CRF-RNN [2] (first column), DJTL [3] (second column), SDS[1] (third column), R-CNN [4] (fourth column) and our GDP for person detection
(second SDS[1] column), R-CNN(sixth [4] (fourth (fifthcolumn), column) our GDP (third for person segmentation column)column) and our GDP for person detection (fifth column) our GDP for person segmentation (sixth column)
We benchmark our system against the previous state-of-the-art, SDS [1], CRF-
We benchmark our system against for theGeneric previous state-of-the-art, SDS [1], RNN [2], Deep Joint Task Learning Object Extraction (DJTL) [3] CRFand RNN [2], Deep Joint Learning for Generic Extraction (DJTL) [3]we and R-CNN [4] with twoTask metrics on Precision (Pre) andObject Recall (Rec). For SDS system, R-CNN [4]the with two metrics Precision (Pre) andthat Recall (Rec). Forsystem SDS system, retrain models using theonsame training dataset is used in our whereaswe retrain the models the same models trainingalready datasetbuilt thatinis CRF-CNN used in ourand system whereas we can only useusing the pre-trained Wang’s syssinceuse training code of these systems is not available. Among all and objects already we tems can only the pre-trained models already built in CRF-CNN Wang’s systrained in CRF-RNN, DJTL is andnot R-CNN, person (driver) the only obtemsdefined since and training code of these systems available. Among allisobjects already ject that available measure in the situation of safety driving. We benchmark defined andistrained in for CRF-RNN, DJTL and R-CNN, person (driver) is the onlyourob-
ject that is available for measure in the situation of safety driving. We benchmark our 15
15
Table 1: Precision(Pre) and Recall (Rec) of driver(person) segmentation performance on SHRP-2 database
DJTL
CRF-RNN
SDS
[3]
[2]
[1]
Pre
42.7%
60.3%
53.9%
84.1%
Rec
51.3%
73.7%
80.6%
94.4%
Methods
GDP
system against [2], [3],[1] on both person detection and person segmentation whereas we compare against R-CNN on person detection only. The performance on driver detection segmentation of previous works against our GDP systems illustrated in Fig. 7. Table 1 reports the measure on Pre and Rec for driver segmentation performance of CRF-RNN, DJTL, SDS and our GDP systems. The performance on driver detection of CRF-RNN, DJTL, SDS, R-CNN and our GDP systems is given in Fig. 8 under a ROC curve of Precision and Recall. From the Table 1 and some examples shown in Fig. 7, we can see that using grammar prior helps to improve the performance of both detection and segmentation. For driver detection, our proposed GDP obtains quite similar accuracy as CRF-RNN and outperforms other (SDS, DJTL, R-CNN) with Recall < 0.7. With larger than 0.7 Recall, the Precision from CRF-RNN drops dramatically while ours decreases slightly. With Recall > 0.9, the performance of SDS and our approach are compatible. 4.3.2. Parts of the driver segmentation In the context of driving safety, details about parts of the driver, such as facial components (eye, nose, mouth), face, hand, seat belt, body, are more critical. Among many state-of-the-art deep learning-based segmentation methods SDS, CRF-RNN and DJTL, SDS is the only segmentation system that allows for retraining. In the experiment of parts of the driver segmentation, we retrain SDS on the same data that is used to train our system. We benchmark with six common metrics on pixel accuracy and region intersection over union (IU) for segmentation P n Pk kk k tk P MAcc N1 k ntkk k
Pixel accuracy:PAcc = Mean accuracy:
Mean IU: MIU =
1 N
P
k
nkk /(tk +
P
j6=i
nji )
16
1 0.9 0.8
Precision
0.7 0.6 0.5 0.4 0.3
Our system CRFCNN DJTL SDS R−CNN
0.2 0.1 0
0
0.1
0.2
0.3
0.4
0.5 Recall
0.6
0.7
0.8
0.9
1
Figure 8: ROC curves of [2], approach GDP driverdetection detection Figure 8: ROC curves of [2], [3],[3], [1],[1], [4],[4], andand ourour approach GDP (-)(-) forfor driver
Frequency Weighted Frequency Weighted IU:IU: P P P FIUP = k tk nkk /(tkP + j6=i nji P )/ l tl , FIU = k tk nkk /(tk + j6=i nji )/ l tl , where nkl is the number of pixels of class k predicted as class l, N is the numwhere nkl is the number of pixels of class k predicted as class l, N is the number of classes, and let tk be the total number of pixels of class k. Table 2 reports ber of classes, and let tk be the total number of pixels of class k. Table 2 reports the performance of SDS and our GDP on four metrics whereas Fig. 9 shows some the performance of SDS and our GDP on four metrics whereas Fig. 9 shows some examples of driver parsing (head-body-setbelt) from SDS and ours. It is clearly that examples of driver parsing (head-body-setbelt) from SDS and ours. It is clearly that SDS is a productive segmentation system and achieves good performance on PASCAL SDS is a productive segmentation system and achieves good performance on PASCAL VOC2012; however, its segmentation accuracy is quite low on the poor quality, images VOC2012; however, its segmentation accuracy is quite low on the poor quality, images with inadequate texture such as the ones in SHRP-2 database. In many cases, SDS with inadequate texture such as the ones in SHRP-2 database. In many cases, SDS confidently segments a region which has similar shape with the object of interest; howconfidently segments a region which has similar shape with the object of interest; however, their segmenting results are erroneous because no grammatical structure is used. ever, their segmenting results are erroneous because no grammatical structure is used. Furthermore, using grammar prior to modify the traditional Ncuts aims to increase the Furthermore, using grammar prior to modify the traditional Ncuts aims to increase the accuracy of proposal candidates generation. This produces better segmentation results accuracy of proposal candidates generation. This produces better segmentation results 17
17
Table 2: Parts of the driver segmentation results of SDS[1] and our GDP approach on SHRP-2 database Table driver segmentation results SDS[1] approach on SHRP-2 database Table2:2:Parts Partsof ofthe the Method driver segmentation results ofof SDS[1] our GDP approach on SHRP-2 database PAcc M Mour FIU Acc andand IUGDP
Method PPAcc MM Method M Acc Acc Acc MIU IU FIUF57.5% IU SDS[1] 69.9% 62.8% 50.8% SDS[1] 69.9% 62.8% 50.8% 57.5% SDS[1] 88.7% 69.9% 62.8% GDP 89.4% 50.8% 84.2% 57.5% 84.8% GDP 88.7% 89.4% 84.2% GDP 88.7% 89.4% 84.2%84.8% 84.8%
specially the contour of the objects of interest (head, body, seat belt). specially the contour of the objects of interest (head, body, seat belt). specially the contour of the objects of interest (head, body, seat belt).
Figure 9: Some examples of driver parsing (face, body, seat belt). The first row is images from different
Figure 9: Some examples of driver parsing (face, body, seat belt). The first row is images from different videos. The second row is ground truth. The third row is segmentation results from SDS [1]. The fourth row Figure 9: Some examples of driver parsing (face, body, seat belt). The first row is images from different videos. second row is ground The third row is segmentation results from SDS [1]. The fourth row is theThe performance of our proposedtruth. GDP approach. videos. The second row is ground truth. The third row is segmentation results from SDS [1]. The fourth row is the performance of our proposed GDP approach. is the performance of our proposed GDP approach.
4.4. Hands on Wheel Detection
4.4. Hands on Wheel Detection 4.4. Hands on Wheel Detection
(A)
(B)
(C)
Figure 10: Our proposed hands on wheel detection on VIVA dataset, (A): input image with hands detected
(A) window in shown in dash yellow boxes,(B) (C) and segmented, the cropped (B): deep feature extraction using our trained model, (C) hand on steering wheel detection
Figure 10: Our proposed hands on wheel detection on VIVA dataset, (A): input image with hands detected Figure 10: Our proposed hands on wheel detection on VIVA dataset, (A): input image with hands detected and segmented, the cropped window in shown in dash yellow boxes, (B): deep feature extraction using our and segmented, the cropped window in shown in dash yellow boxes, (B): deep feature extraction using our trained model, (C) hand on steering wheel detection 18 trained model, (C) hand on steering wheel detection
18 18
Table 3: Performance of hand on wheel detection
Features
EER
AUC
Accuracy
Raw Pixel
30.8%
69.5%
77.0%
HOG
46.7%
59.0%
64.4%
Deep feature
28.8%
78.4%
82.1%
In this experiment, given an input image with hand(s) already detected, the proposed algorithm will verify whether the hands of the driver are on steering wheel or not. Fig. 10 illustrates the process of investigating the hands-on-wheel problem. In this framework, the hands of a driver within a given image are first segmented and detected and then cropped. The detected hand region is a window around the hand and bigger than the actual size as depicted by the yellow boxes of dash lines in Fig. 10. A bigger window around the hand helps to embed more information about its neighbors, including the steering wheel. Then, deep features based on our Deep CNN model are extracted from the extended cropped boxes and used as inputs to train a binary SVM classifier. The experiment is conducted on VIVA Hand Database as presented in Section 4.1. To evaluate the performance of our approach, we divide VIVA’s training dataset into two subsets of 3000 images for training and 2500 images for testing. Three common metrics, i.e. Equal Error Rates (EER), Area Under the ROC Curve (AUC), and the classification accuracy rates are used. The performances on different features (raw pixel, HOG, Deep features) with SVM classifier are listed in Table 3. 4.5. Hands on Phone Detection The experiment of hands on phone follows the similar framework presented in Section 4.4. To evaluate the performance of our approach, we use the same datatest represented in [30]. The training dataset consists of 489 positive samples and 1479 negative samples and testing dataset consists of 3757 positive samples and 9288 negative samples from SHRP-2 database. Three common metrics, i.e. Equal Error Rates (EER), Area Under the ROC Curve (AUC), and the classification accuracy rates, are used. The performances on different features (raw pixel, HOG, Deep features) with SVM 19
Table 4: Performance of hand on phone detection
Features
EER
AUC
Accuracy
Raw Pixel
15.2%
91.7%
78.8%
HOG
10.5%
94.9%
84.2%
Deep feature
10.8%
97.6%
90.0%
classifier are listed in Table 4. 5. Conclusion This paper has presented a Grammar-aware Driver Parsing approach with our trained deep model to solve the problem of driver behavioral situational awareness. The proposed approach first designs a deep model to extract highly discriminative features of parts of the driver. A Pictorial Structure is employed to build the grammatical structure of these parts. The deep probability maps are then used as prior knowledge to a semi-supervised segmentation method to generate high accuracy proposal candidates. Finally, the R-CNN is utilized to produce the final parts of the driver. The proposed method outperforms other state-of-the-art detection and segmentation methods, i.e. SDS, CF-RNN, DJTL and R-CNN on images with low resolution and poor illumination in SHRP-2 database. In additional to driver parsing, two extensions of the system, i.e. the hands on phone and the steering wheel mannerisms, have been also presented in this work on VIVA and SHRP-2 databases. References [1] B. Hariharan, P. Arbel´aez, R. Girshick, J. Malik, Simultaneous detection and segmentation, in: ECCV, 2014, pp. 297–312. [2] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P. H. S. Torr, Conditional random fields as recurrent neural networks, in: ICCV, 2015.
20
[3] X. Wang, L. Zhang, L. Lin, Z. Liang, W. Zuo, Deep joint task learning for generic object extraction, in: NIPS, 2014, pp. 523–531. [4] T. D. Ross Girshick, Jeff Donahue, J. Malik, Region-based convolutional networks for accurate object detection and segmentation, IEEE TPAMI 99 (2015) 1–16. [5] E. The National Academies of Sciences, Medicine, The Second Strategic Highway Research Program (2006-2015) (SHRP-2), http://www.trb.org/ StrategicHighwayResearchProgram2SHRP2/Blank2.aspx. [6] M. Andriluka, S. Roth, B. Schiele, Pictorial structures revisited: People detection and articulated pose estimation, in: CVPR, 2009, pp. 1014–1021. [7] L. Pishchulin, M. Andriluka, P. Gehler, B. Schiele, Poselet conditioned pictorial structures, in: CVPR, IEEE, 2013, pp. 588–595. [8] E. Ohn-Bar, S. Martin, A. Tawari, M. M. Trivedi, Head, eye, and hand patterns for driver activity recognition, in: ICPR, 2014, pp. 660–665. [9] E. Ohn-Bar, M. M. Trivedi, Hand gesture recognition in real time for automotive interfaces: A multimodal vision-based approach and evaluations, IEEE Transactions on ITS 15 (6) (2014) 2368–2377. [10] A. Mittal, A. Zisserman, P. H. S. Torr, Hand detection using multiple proposals, in: British Machine Vision Conf., 2011, pp. 1–11. [11] X. Sun, Y. Wei, S. Liang, X. Tang, J. Sun, Cascaded hand pose regression, in: CVPR, 2015, pp. 824–832. [12] C. Qian, X. Sun, Y. Wei, X. Tang, J. Sun, Realtime and robust hand tracking from depth, in: CVPR, 2015, pp. 1106–1113. [13] S. Sridhar, F. Mueller, A. Oulasvirta, C. Theobalt, Fast and robust hand tracking using detection-guided optimization, in: CVPR, 2015, pp. 3213–3221.
21
[14] E. Trulls, S. Tsogkas, I. Kokkinos, A. Sanfeliu, F. Moreno-Noguer, Segmentationaware deformable part models, in: CVPR, 2014, pp. 168–175. [15] B. Rothrock, S. Park, S. C. Zhu, Integrating grammar and segmentation for human pose estimation, in: CVPR, 2013, pp. 3214–3221. [16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast feature embedding, in: ACM Intl. Conf. on Multimedia, 2014, pp. 675–678. [17] S. E. Chew, N. D. Cahill, Semi-supervised normalized cuts for image segmentation, in: ICCV, 2015. [18] M. Oquab, L. Bottou, I. Laptev, J. Sivic, Learning and transferring mid-level image representations using convolutional neural networks, in: CVPR, 2014, pp. 1717–1724. [19] Casia
webface
database,
http://www.cbsr.ia.ac.cn/english/
CASIA-WebFace-Database.html. [20] G. B. Huang, M. Ramesh, T. Berg, E. Learned-Miller, Labeled faces in the wild: A database for studying face recognition in unconstrained environments, Tech. Rep. 07-49, University of Massachusetts, Amherst (October 2007). [21] P. Arbel´aez, J. Pont-Tuset, J. Barron, F. Marques, J. Malik, Multiscale combinatorial grouping, in: CVPR, 2014, pp. 328–335. [22] S. X. Yu, J. Shi, Segmentation given partial grouping constraints, IEEE TPAMI 26 (2) (2004) 173–183. [23] P. Arbelaez, M. Maire, C. Fowlkes, J. Malik, Contour detection and hierarchical image segmentation, IEEE TPAMI 33 (5) (2011) 898–916. [24] A. Eriksson, C. Olsson, F. Kahl, Normalized cuts revisited: A reformulation for segmentation with linear grouping constraints, Journal of Mathematical Imaging and Vision 39 (1) (2011) 45–61.
22
[25] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE TPAMI. 22 (8) (2000) 888–905. [26] J. Shi, J. Malik, Normalized cuts and image segmentation, CVPR 22 (1997) 888– 905. [27] Virginia Tech Transportation Institute (VTTI), http://www.vtti.vt. edu/. [28] N. Das, E. Ohn-Bar, M. M. Trivedi, On performance evaluation of driver hand detection algorithms: Challenges, dataset, and metrics, in: COnf. on ITS, 2015. [29] Y. Sun, Y. Chen, X. Wang, X. Tang, Deep learning face representation by joint identification-verification, in: NIPS, 2014, pp. 1988–1996. [30] K. Seshadri, F. J. Xu, D. K. Pal, M. Savvides, C. P. Thor, Driver cell phone usage detection on strategic highway research program (shrp2) face view videos, in: CVVT Workshop, CVPR, 2015.
23