Chapter
5
Audio-visual learning for body-worn cameras Andrea Cavallaro∗ , Alessio Brutti† ∗ Centre † Center
for Intelligent Sensing, Queen Mary University London, London, UK for Information Technology, Fondazione Bruno Kessler, Trento, Italy
CONTENTS
5.1 Introduction 103 5.2 Multi-modal classification 105 5.3 Cross-modal adaptation 107 5.4 Audio-visual reidentification 110 5.5 Reidentification dataset 111 5.6 Reidentification results 112 5.7 Closing remarks 116 References 116
5.1 INTRODUCTION A wearable camera is an audio-visual recording device that captures data from an ego-centric perspective. The camera can be worn on different parts of the body, such as the chest, a wrist, or the head (e.g. on a cap, on a helmet, or embedded in eyewear). Body-worn cameras support applications such as life-logging and assisted living [30,50], security and safety [3,48], object and action recognition [28], physical activity classification of the wearer [21,27,38], and interactive augmented reality [33]. In particular, recognizing objects or people from an ego-centric perspective is an attractive capability for tasks such as diarization [46], summarization [17,30], interaction recognition [22], and person re-identification [9,23]. However, despite two pioneering multi-modal works presented already in 1999 and 2000 [14,13] and the availability of at least one microphone in wearable cameras [7], video-only solutions are still predominant even in applications that could benefit from multi-modal firstperson observations [17,21–23,27,28,30,38]. While a few exceptions exist, Multimodal Behavior Analysis in the Wild. https://doi.org/10.1016/B978-0-12-814601-9.00014-6 Copyright © 2019 Elsevier Ltd. All rights reserved.
103
104
CHAPTER 5 Audio-visual learning for body-worn cameras
■ FIGURE 5.1 Sample challenging frames from a wearable camera.
multi-modal approaches combine audio or video with other motion-related sensors [25,45,47]. Wearable cameras can be used in a wide variety of scenarios and require new processing methods that adapt to different operational conditions, e.g. indoors vs. outdoors, well lit vs. poorly lit environments, sparse vs. crowded scenes. In this chapter, we focus on video and audio challenges. Video processing challenges are related to the (unconventional) motion of a body-worn camera that leads to continuous and considerable pose changes of the objects being captured. Moreover, objects of interest are often partially occluded or outside the field of view. In addition, the rapid movements of body-worn cameras degrade the quality of the captured video due to motion blur and varying lighting conditions. Changes in lighting produce under or over exposed frames, especially when the camera is directed towards or away from a light source (see Fig. 5.1). Audio processing challenges arise from the use of compact cases with (typically) low-quality microphones co-located with the imager. Moreover, plastic shields protecting the camera (e.g. GoPro Hero) may cover the microphones, attenuating and low-passing the incoming sound waves. Then, the variety of use conditions include changing and unfavorable acoustics and background noise, as well as interfering sound sources. Finally, wind noise and noise induced by movements of the microphone itself considerably reduce the signal quality. These challenges prevent the use of off-the-shelf audio-visual processing algorithms and require specifically designed methods to effectively exploit the complementarity of the information provided by the auditory and visual sensors. In addition to these specific challenges, audio-visual processing for wearable cameras has to deal with more traditional issues such as asynchronism between video and audio streams, and different temporal sampling rates, which may increase uncertainty. Finally, an additional challenge for machine learning algorithms is the lack of large training datasets covering most operational conditions. This chapter is organized as follows. Section 5.2 overviews state-of-the-art methods for audio-visual classification. Section 5.3 presents multi-modal
5.2 Multi-modal classification
model adaptation as a strategy to exploit the complementarity of audio and visual information. Section 5.4 introduces person reidentification as use-case for model adaptation, Section 5.5 describes a body-worn camera dataset captured under four different conditions and Section 5.6 reports experimental results on this dataset. Finally, Section 5.7 summarizes the contributions of the chapter.
5.2 MULTI-MODAL CLASSIFICATION The use of multiple modalities from wearable cameras may considerably help addressing the challenges presented in the previous section. We focus on how audio and visual signals that can be fused effectively in a machine learning context. Then we will discuss how this knowledge can help adapt to varying conditions under which body-worn cameras operates. Audio-visual recognition can adopt late, early, mid or hybrid fusion strategies [9]. These four strategies for multi-modal fusions are illustrated in Fig. 5.2, and their main principles are discussed below. Audio and video are processed independently in late fusion methods (Fig. 5.2A), which combine mono-modal scores (or decisions). These meth-
105
■ FIGURE 5.2 Four strategies for multi-modal
classification.
106
CHAPTER 5 Audio-visual learning for body-worn cameras
ods are generally efficient and modular, and other modalities or sub-systems can be easily added. The reliability of each modality may be used to weight the late fusion combination [18,24,32]. Reliability measures may relate to signals (e.g. SNR), to models [24] or to the recognition rate of each classifier by learning appropriate weights [32]. Decision selection is also a late fusion strategy that typically uses the reliability or the discriminative power of each expert [11,19,34,40]. Examples include articulate hard-coded decision cascades driven by reliability [19] and adaptive weights based on estimating the model mismatch from score distributions [40]. Assuming that a further training stage is possible, the combination of the scores may be based on a classification stage that stacks scores and treats them as feature vectors. Classification methods used in these cases include Gaussian Mixture Models (GMMs), Support Vector Machines (SVMs) and Multi-Layer Perceptrons (MLPs) [2,39]. Reliability measures can also be included to improve classification [1,6]. The main drawback of late fusion is that errors in one modality are not recovered (or are difficult to recover) in the classification stage. Features are processed independently also by mid-fusion methods (Fig. 5.2B), which however merge modalities in the classification stage. These methods may employ a Multi-Stream HMM [31,52] or Dynamic Bayesian Networks [36]. In general, a weighted sum of the log-likelihoods is adopted (e.g. [31]), which is equivalent to late fusion when a timeindependent classifier is used instead of Hidden Markov Models (HMMs). The stream with the highest posterior can be used in combination with an HMM for audio-visual speech recognition [41]. When the sampling rate of the modalities is different and misalignments between the modality occur, asynchronous HMM combinations are needed [52]. Audio and video features are combined before processing in a joint feature set in early fusion methods [13,44] (Fig. 5.2C). For example, stacking audio and video features for a HMM classifier outperforms processing the two modalities independently [35]. Early fusion exploits the correlation between feature sets but might lead to dimensionality problems. The dimensionality of the stacked feature vector can be reduced through Principal Component Analysis (PCA) and Principal Component Analysis (ICA) to remove redundancy across modalities [44]. Finally, hybrid methods perform fusion in at least two stages of their pipeline (Fig. 5.2D). In Co-EM [8], as an example, models are iteratively updated in the maximization step by exchanging labels between multiple views of the same dataset and minimizing disagreement between modalities. Examples of Co-EM in speech recognition or multi-modal interaction are referred to as co-training [29] and co-adaptation [12]. Co-training can be
5.3 Cross-modal adaptation
used for traffic analysis using multiple cameras [29]. The goal is to generate a larger training set for batch adaptation through unsupervised labeling of unseen training data. This labeling is performed based on the agreement between weak mono-modal classifiers trained on small labeled datasets. Co-adaptation for audio-visual speech recognition and gesture recognition jointly adapts audio and visual models using unseen unlabeled data of the new application domain by maximizing their agreement [12]. A limitation of most fusion methods is that a modality that deteriorates over time due to an increasing model mismatch is generally dropped without using mechanisms to limit its degradation. A solution to this problem is cross-modal model adaptation, which can be combined with the fusion strategies discussed above.
5.3 CROSS-MODAL ADAPTATION Two major challenges for classification tasks with data from wearable cameras are rapidly varying capturing conditions and limited training data. To address these challenges, the complementarity of multiple modalities may be advantageous to prevent building poor models and to improve classification accuracy. In this section we discuss cross-modal adaptation of models learned with deep learning approaches and, in particular, we use on-line person reidentification as application example. An additional challenge to address here is the discontinuous availability of visual or audio information, for example when a person goes outside the field of view or becomes silent. Audio-visual person reidentification methods typically use late fusion and combine classification scores (or decisions) produced by two independent mono-modal systems (see Section 5.2). Let xit be an observation (feature vector) from modality i and t = 1, . . . , T i the temporal index. Let xi = {xi1 , xi2 , ..., xiT i } be the T i observations extracted from a segment to be classified. Let the identity of an enrolled target be denoted with one of S labels s in S . Let p(yt = s|xit ) be the probability that xit is generated by target with identity yt = s. When a new person appears in front of the camera, a model for modality i is acquired during an unconstrained enrollment stage (which might involve some supervision from the user) and then employed to reidentify that person in future interactions, when the target is again visible and/or audible by the body-worn camera. The target identity estimated with modality i, sˆti , can be
107
108
CHAPTER 5 Audio-visual learning for body-worn cameras
computed as sˆti = arg max p(yt = s|xit ).
(5.1)
s∈S
When modeled via deep learning architectures, p(yt |xit ) can be expressed as p(yt = s|xit ; i ), where i are the model parameters (weights wi and bias bi ) of the neural network. The model parameters are estimated by maximizing the cross-entropy loss, L(xi |i ), on the training set: i
L(x | ) = − i
i
T S
2
p(yt = s|xit ) log p(yt = s|xit , i ) + λwi , (5.2)
t=1 s=1 2
where λwi is a L2 regularization term on network weights. In multi-class classification problems, a hard alignment with training labels st∗ is often used that models the posterior as p(yt |xit ) = δ(yt = st∗ ). Therefore, Eq. (5.2) can be rewritten as a Negative Log-Likelihood (NLL): i
L(xi |i ) = −
T
2
log p(yt = st∗ |xit , i ) + λwi .
(5.3)
t=1
The number of available samples is usually insufficient to train models that account for temporal changes in the environment and in target appearance. To handle this variability, models could be continuously adapted over time in an unsupervised manner. However, adapting models is challenging and requires integration of multi-modal information in a multi-modal model adaptation strategy. j
Let it and t be time-dependent model parameters at time t for modality i and j , respectively. As soon as new observations are available, we adapt in an unsupervised fashion these model parameters by exploiting the complementarity of the modalities: j j it ← f it−1 , xit , p(yt |xt , t ) ,
(5.4) j
where f (·) is a model transformation (adaptation) function and xit (xt ) is the observation at time t from modality i (j ). Model adaptation can be achieved by retraining the network using small learning rates when (supervised) labels are available. If (supervised) labels are unavailable, estimated j unsupervised labels sˆt provided by the other modality can be used. In this
5.3 Cross-modal adaptation
case the loss function is defined as: i
L( |x ) = − i
i
T
j
2
log p(yt = sˆt |xit ) + λwi ,
(5.5)
t=1
To avoid over-fitting the unlabeled adaptation data, regularization can be used, for example with a term based on the Kullback–Leibler Divergence (KLD), as in unsupervised DNN adaptation for speech recognition [20,51]. In this case, the target distribution, p(y ˜ t |xit ), is reformulated as linear combination of the actual distribution, p(yt |xit ), and the original distribution, p(yt |xit ), i.e. the output of the network prior to adaptation: p(y ˜ t |xit ) = (1 − ρ i )p(yt |xit ) + ρ i p(yt |xit ),
(5.6)
where the hyper-parameter ρ i controls the amount of regularization and depends, for example, on the reliability of the unsupervised labels (see Section 5.4). ˜ t |xit ) from Eq. (5.6), the loss By replacing p(yt |xit ) in Eq. (5.2) with p(y function becomes T S i
L˜ (i |xi ) = −
(1 − ρ i )p(yt |xit ) + ρ i p(yt |xit ) log p(yt = s|xit )
t=1 s=1 i 2
+ λw ,
(5.7)
which can be re-arranged as i
L˜ (i |xi ) = −(1 − ρ i )
S T
p(yt |xit ) log p(yt = s|xit )
t=1 s=1 i
− ρi
S T
p(yt |xit ) log p(yt = s|xit )
t=1 s=1 i 2
+ λw .
(5.8)
Note that the first and the third terms on the right hand side of Eq. (5.8) are the right hand side of Eq. (5.2). Therefore the KLD-regularized loss function is [51] L˜ (i |xi ) i
= (1 − ρ )L( |x ) − ρ i
i
i
i
S T t=1 s=1
p(yt = s|xit ) log p(yt = s|xit ), (5.9)
109
110
CHAPTER 5 Audio-visual learning for body-worn cameras
■ FIGURE 5.3 Example of audio-visual cross-modal adaptation strategy.
where hard alignment is not used for p(yt |xit ) since it has to replicate the output of the original network. The amount of adaptation of the model can be controlled by resorting to the original model when the quality of the observations or the operational conditions do not allow an effective model adaptation. In particular, when ρ i = 0, the model is adapted towards the new distribution, whereas when ρ i = 1 the network is constrained to resemble the original distribution, prior to adaptation.
5.4 AUDIO-VISUAL REIDENTIFICATION While both mono-modal and multi-modal approaches for person reidentification are moderately successful in controlled scenarios [5,26,48], their use on data from body-worn cameras requires the relaxation of constraints on the scene, target position or appearance, and speech signals. Fig. 5.3 shows a specific instantiation for person reidentification of the cross-modal model adaptation approach presented in the previous section: the classification results of two mono-modal Multi-Layer Perceptrons (MLPs) undergo late score fusion. This method extends our previous work [10] with more advanced deep-features for the video modality. We detect faces with an MXNet implementation in Python of light CNN1 [49] and we extract VGG-face features2 using the Oxford models and tools available for Keras [37]. A 4096-dimensional vector is extracted for each face by considering the 7th layer (i.e. “fc7”). Audio features are based on the Total Variability (TV) paradigm [15,16]: we generate 30-dimensional feature vectors by concatenating 15 Mel-Frequency Cepstral Coefficients 1 https://github.com/tornadomeet/mxnet-face. 2 VGG-face implements the same architecture as VGG16 [43] but it is trained to recognize faces.
5.5 Reidentification dataset
(MFCC) with their first derivatives (energy is not used). We extract the coefficients from 20 ms windows, with 10 ms step. Starting from the MFCCs and using a TV feature extractor trained on the out-of-domain clean Italian APASCI dataset [4], we extract 200-dimensional i-vectors [15,16] for each utterance. Audio and video deep-features are classified frame by frame (i.e. for each face and each i-vector) using a modality-dependent MLP. The goal is to identify the person present in each segment k (we assume that a single person is present in each segment) using T i feature vectors for modality i and T j feature vectors for modality j . We evaluate the maximum among the mean posteriors over each segment k: i
sˆki
T 1 = arg max i p(yt = s|xit ), s∈S T
(5.10)
t=1
and similarly for modality j . The KLD parameter, ρki , is segment-dependent and varies with the reliability of the observations. This parameter is a function of the average score of the label in that segment: ⎤ Tj 1 j j ρki = g i ⎣ j p(yt = sˆk |xt )⎦ . T ⎡
(5.11)
t=1
Finally, score fusion is implemented as weighted average of the scores of the modalities [9].
5.5 REIDENTIFICATION DATASET The QM-GoPro dataset contains 52 clips of 13 subjects talking for 1 minute to a person wearing a chest-mounted GoPro camera.3 The dataset includes four conditions: ■ ■ ■ ■
C1 (indoors): in a lecture room; C2 (indoors): different clothes and possibly different room; C3 (outdoors): same clothes as in C1, quiet location; C4 (outdoors): same clothes as in C2, noisy location (near a busy road).
The video resolution is 1920 × 1080 pixels, at 25 frames per second. Audio streams are at 48 kHz, 16 bits. Speakers are up to a few meters away from the camera so the reidentification task is in distant talking conditions. 3 The dataset is available at http://www.eecs.qmul.ac.uk/~andrea/adaptation.html and described in [9].
111
112
CHAPTER 5 Audio-visual learning for body-worn cameras
A noticeable illumination mismatch is present between the indoor and outdoor recordings. Conditions C1 and C2 are challenging for video processing as illumination conditions change considerably when the speaker moves, e.g. towards or away from the windows. Moreover, the person is often occluded by furniture thus challenging the person detector. The speakerto-camera distance varies considerably: from so close that only a body part is visible to so far that the face is barely visible. The illumination conditions and the speaker-to-camera distance in C3 and C4 are instead rather constant. The camera microphone is partially covered by a plastic shield thus resulting in a strong low-pass effect that limits the spectral content (even if audio is sampled at 48 kHz). Acoustics changes considerably between favorable conditions indoors (C1 and C2) and more challenging outdoors (C3 and C4). Although relatively quiet, a variety of interfering noise sources are present in C3 (e.g. wind blowing into the microphone, people passing by), which affect speaker reidentification performance. In addition, the lack of reverberation introduces a further mismatch with respect to C1 and C2. Finally, C4 has strong background noise, mostly due to road traffic, which at times makes the voice of the speaker inaudible.
5.6 REIDENTIFICATION RESULTS In this section we discuss results on the QM-GoPro dataset (Section 5.5) using the reidentification pipeline of Section 5.4. We compare four strategies: ■ ■ ■ ■
intra (intra-modal): adaptation based on labels from the same modality; cross (cross-modal): adaptation based on labels from the other modality; KLD: KLD regularization added to cross; base (baseline): no adaptation.
We test both matched conditions, when target models are obtained under the same condition Cn as the test, and mismatch conditions, when target models are trained under condition Cn and tested under condition Cm , with m = n. Each clip is split in 5-second segments, excluding the first 8 and the last 5 s. Each segment (k = 1, . . . , K) consists of T V = 125 frames (or VGG feature vectors) and T A = 1 i-vectors. For training, the first 3 segments are used for each target under each condition. Moreover, g V [·] is the identity function and g A [·] is a sigmoid function (see [10] for details). Fig. 5.4 shows the results under matched conditions. When matched models are available and video conditions are favorable (C3 and C4), the use of deep features, in combination with a deep neural classifier, results in a 100% accuracy (Fig. 5.4B). Similarly, i-vectors offers high performance indoors,
5.6 Reidentification results
while their accuracy deteriorates as the noise increases, due to mismatch in the i-vector extractor. Cross-modal adaptation improves the performance of badly performing models but deteriorates in case of accurate video models (i.e. in C3 and C4). The KLD-based regularization improves poor models while preserving the performance of the good ones. Note that KLD limits the potential impact of cross-modal adaptation for audio, probably because it underestimates the reliability of the video models. Fig. 5.4C reports the results after late score fusion. Note that this fusion strategy is not optimal for the system discussed here as it was developed in [9] to optimize the Equal Error Rate (ERR) on a different classifier. Note also that if an effective fusion is used, the overall accuracy is high and the benefits of model adaptation are less evident with respect to the baseline system, especially if the two modalities are always available (as in the majority
113
■ FIGURE 5.4 Accuracy under matched conditions
(QM-GoPro dataset).
114
CHAPTER 5 Audio-visual learning for body-worn cameras
■ FIGURE 5.5 Accuracy under “mismatched-C3”
conditions (QM-GoPro dataset).
of the cases in the QM-GoPro dataset). Nevertheless, the results with the KLD adaptation are equivalent or superior to those of the baseline. A minor deterioration is observed in C4. Figs. 5.5 and 5.6 show the classification accuracy under mismatched conditions: “mismatch-C3” (models are trained in C3) and “mismatch-C1” (models are trained in C1). As the behaviors of the methods are similar in the two cases, we will focus the analysis on “mismatch-C1” only. In Fig. 5.6A the performance decreases from C1 to C4 as the amount of mismatch increases due to noise (outdoors). Note that the mismatch is not only in the speaker models but also in the i-vector extractor. Similar trends are observed in Fig. 5.5A: note however that despite some noise in the training set, C4 is still worse than C1 and C2. In terms of adaptation, the same behavior
5.6 Reidentification results
is observed here as in the matched case. Cross reduces the model mismatch and KLD, in some cases, limits the potential gain. The effect of model mismatch is less evident for the video modality as shown in Fig. 5.6B. In C3 and C4 the accuracy is higher than in C1 despite a high mismatch under light conditions. This is related to the robustness of VGG features in combination with the fact that in C3 and in C4 faces are visible without occlusions or over-exposure. This is also observed in Fig. 5.5B where models trained in C3 are suitable also in C4 (similar light conditions) but performance deteriorates in C1 and C2 to the same level as “mismatch-C1”. Occlusions and over-exposure are the main reasons for the low performance indoors. Also for the video modality the KLD-regularized adaptation improves the performance over the baselines and avoids model deterioration introduced by cross adaptation in C3 and C4. Finally, after score fusion, the same trends
115
■ FIGURE 5.6 Accuracy under “mismatched-C1”
conditions (QM-GoPro dataset).
116
CHAPTER 5 Audio-visual learning for body-worn cameras
are observed with KLD, improving baselines and not affecting good models.
5.7 CLOSING REMARKS In this chapter we discussed the main challenges in learning models from data captured by body-worn cameras. In particular, we exploited the complementarity of observations generated by different sensing modalities [42], namely audio and video, for cross-modal adaptation in person reidentification. We also showed how model adaptation can be effective for reidentification as well as for improving the performance of each mono-modal model. This is a desirable feature for body-worn cameras as the amount of training (and enrollment) data is generally limited. Importantly, the proposed method is also applicable when each modality is used for a different task.
REFERENCES [1] M.R. Alam, M. Bennamoun, R. Togneri, F. Sohel, A confidence-based late fusion framework for audio-visual biometric identification, Pattern Recognit. Lett. 52 (2015) 65–71. [2] A. Albiol, L. Torres, E.J. Delp, Fully automatic face recognition system using a combined audio-visual approach, IEE Proc., Vis. Image Signal Process. 152 (3) (June 2005) 318–326. [3] P.S. Aleksic, A.K. Katsaggelos, Audio-visual biometrics, Proc. IEEE 94 (11) (Nov. 2006) 2025–2044. [4] B. Angelini, F. Brugnara, D. Falavigna, D. Giuliani, R. Gretter, M. Omologo, Speaker independent continuous speech recognition using an acoustic-phonetic Italian corpus, in: Proc. of the Int. Conf. on Spoken Language Processing, 1994, pp. 1391–1394. [5] A. Bedagkar-Gala, S.K. Shah, A survey of approaches and trends in person reidentification, Image Vis. Comput. 32 (4) (2014) 270–286. [6] S. Bengio, C. Marcel, S. Marcel, J. Mariéthoz, Confidence measures for multimodal identity verification, Inf. Fusion 3 (4) (2002) 267–276. [7] A. Betancourt, P. Morerio, C.S. Regazzoni, M. Rauterberg, The evolution of first person vision methods: a survey, IEEE Trans. Circuits Syst. Video Technol. 25 (5) (May 2015) 744–760. [8] S. Bickel, T. Scheffer, Estimation of mixture models using Co-EM, in: J. Gama, R. Camacho, P.B. Brazdil, A.M. Jorge, L. Torgo (Eds.), Machine Learning: ECML 2005, Springer, Berlin, Heidelberg, 2005, pp. 35–46. [9] A. Brutti, A. Cavallaro, Online cross-modal adaptation for audio visual person identification with wearable cameras, IEEE Trans. Human-Mach. Syst. 47 (1) (Feb. 2017) 40–51. [10] A. Brutti, A. Cavallaro, Unsupervised cross-modal deep-model adaptation for audiovisual re-identification with wearable cameras, in: Proc. of Int. Conf. on Computer Vision Workshop, Oct. 2017. [11] U.V. Chaudhari, G.N. Ramaswamy, G. Potamianos, C. Neti, Audio-visual speaker recognition using time-varying stream reliability prediction, in: Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing, vol. 5, Apr. 2003, pp. 712–715. [12] C.M. Christoudias, K. Saenko, L. Morency, T. Darrell, Co-adaptation of audio-visual speech and gesture classifiers, in: Proc. of the Int. Conf. on Multimodal Interfaces, Nov. 2006, pp. 84–91.
5.7 Closing remarks
[13] B. Clarkson, K. Mase, A. Pentland, Recognizing user context via wearable sensors, in: Proc. of the Int. Symposium on Wearable Computers, Oct. 2000, pp. 69–75. [14] B. Clarkson, A. Pentland, Unsupervised clustering of ambulatory audio and video, in: Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing, Mar. 1999, pp. 3037–3040. [15] N. Dehak, R. Dehak, P. Kenny, N. Brümmer, P. Ouellet, P. Dumouchel, Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification, in: Proc. of Interspeech, Sep. 2009, pp. 1559–1562. [16] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification, IEEE Trans. Audio Speech Lang. Process. 19 (4) (May 2011) 788–798. [17] A.G. del Molino, C. Tan, J.H. Lim, A.H. Tan, Summarization of egocentric videos: a comprehensive survey, IEEE Trans. Human-Mach. Syst. 47 (1) (Feb. 2017) 65–76. [18] H.K. Ekenel, M. Fischer, Q. Jin, R. Stiefelhagen, Multi-modal person identification in a smart environment, in: Proc. of IEEE Int. Conf. on Computer Vision and Pattern Recognition, Jun. 2007, pp. 1–8. [19] E. Erzin, Y. Yemez, A.M. Tekalp, Multimodal speaker identification using an adaptive classifier cascade based on modality reliability, IEEE Trans. Multimed. 7 (5) (Oct. 2005) 840–852. [20] D. Falavigna, M. Matassoni, S. Jalalvand, M. Negri, M. Turchi, DNN adaptation by automatic quality estimation of ASR hypotheses, Comput. Speech Lang. 46 (Nov. 2017) 585–604. [21] A. Fathi, A. Farhadi, J.M. Rehg, Understanding egocentric activities, in: Proc. of Int. Conf. on Computer Vision, Nov. 2011, pp. 407–414. [22] A. Fathi, J.K. Hodgins, J.M. Rehg, Social interactions: a first-person perspective, in: Proc. of IEEE Int. Conf. on Computer Vision and Pattern Recognition, June 2012, pp. 1226–1233. [23] F. Fergnani, S. Alletto, G. Serra, J.D. Mira, R. Cucchiara, Body part based reidentification from an egocentric perspective, in: Proc. of IEEE Int. Conf. on Computer Vision and Pattern Recognition Workshops, June 2016, pp. 355–360. [24] N.A. Fox, R. Gross, J.F. Cohn, R.B. Reilly, Robust biometric person identification using automatic classifier fusion of speech, mouth, and face experts, IEEE Trans. Multimed. 9 (4) (June 2007) 701–714. [25] N. Kern, B. Schiele, H. Junker, P. Lukowicz, G. Troster, Wearable sensing to annotate meeting recordings, in: Proc. of the Int. Symposium on Wearable Computers, June 2002, pp. 186–193. [26] T. Kinnunen, H. Li, An overview of text-independent speaker recognition: from features to supervectors, Speech Commun. 52 (1) (Jan. 2010) 12–40. [27] K.M. Kitani, T. Okabe, Y. Sato, A. Sugimoto, Fast unsupervised ego-action learning for first-person sports videos, in: Proc. of IEEE Int. Conf. on Computer Vision and Pattern Recognition, Nov. 2011, pp. 3241–3248. [28] Y.J. Lee, J. Ghosh, K. Grauman, Discovering important people and objects for egocentric video summarization, in: Proc. of IEEE Int. Conf. on Computer Vision and Pattern Recognition, June 2012, pp. 1346–1353. [29] A. Levin, P. Viola, Y. Freund, Unsupervised improvement of visual detectors using co-training, in: Proc. of Int. Conf. on Computer Vision, vol. 1, Oct. 2003, pp. 626–633. [30] Z. Lu, K. Grauman, Story-driven summarization for egocentric video, in: Proc. of IEEE Int. Conf. on Computer Vision and Pattern Recognition, June 2013, pp. 2714–2721. [31] S. Lucey, T. Chen, S. Sridharan, V. Chandran, Integration strategies for audio-visual speech processing: applied to text-dependent speaker recognition, IEEE Trans. Multimed. 7 (3) (June 2005).
117
118
CHAPTER 5 Audio-visual learning for body-worn cameras
[32] J. Luque, R. Morros, A. Garde, J. Anguita, M. Farrus, D. Macho, F. Marqués, C. Martínez, V. Vilaplana, J. Hernando, Audio, video and multimodal person identification in a smart room, in: Rainer Stiefelhagen, John Garofolo (Eds.), Multimodal Technologies for Perception of Humans: First International Evaluation Workshop on Classification of Events, Activities and Relationships, CLEAR 2006, 2007. [33] Z. Lv, A. Halawani, S. Feng, S. Réhman, H. Li, Touch-less interactive augmented reality game on vision-based wearable device, Pers. Ubiquitous Comput. 19 (3) (July 2015) 551–567. [34] C. McCool, S. Marcel, A. Hadid, M. Pietikainen, P. Matejka, J. Cernocky, N. Poh, J. Kittler, A. Larcher, C. Levy, D. Matrouf, J. Bonastre, P. Tresadern, T. Cootes, Bi-modal person recognition on a mobile phone: using mobile phone data, in: Proc. of ICME Workshop on Hot Topics in Mobile Multimedia, July 2012, pp. 635–640. [35] I. McCowan, S. Bengio, D. Gatica-Perez, G. Lathoud, F. Monay, D. Moore, P. Wellner, H. Bourlard, Modeling human interaction in meetings, in: Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing, vol. 4, Apr. 2003, pp. 748–751. [36] A. Noulas, G. Englebienne, B.J.A. Krose, Multimodal speaker diarization, IEEE Trans. Pattern Anal. Mach. Intell. 34 (1) (Jan. 2012). [37] O.M. Parkhi, A. Vedaldi, A. Zisserman, Deep face recognition, in: British Machine Vision Conference, 2015. [38] H. Pirsiavash, D. Ramanan, Detecting activities of daily living in first-person camera views, in: Proc. of IEEE Int. Conf. on Computer Vision and Pattern Recognition, June 2012, pp. 2847–2854. [39] C. Sanderson, S. Bengio, H. Bourlard, J. Mariethoz, R. Collobert, M.F. BenZeghiba, F. Cardinaux, S. Marcel, Speech face based biometric authentication at IDIAP, in: Proc. of the Int. Conf. on Multimedia and Expo, vol. 3, July 2003, pp. 1–4. [40] C. Sanderson, K.K. Paliwal, Noise compensation in a person verification system using face and multiple speech features, Pattern Recognit. 36 (2) (2003) 293–302. [41] R. Seymour, J. Ming, S.D. Stewart, A new posterior based audio-visual integration method for robust speech recognition, in: Proc. of Interspeech, Sep. 2005, pp. 1229–1232. [42] S.T. Shivappa, M.M. Trivedi, B.D. Rao, Audiovisual information fusion in human computer interfaces and intelligent environments: a survey, Proc. IEEE 98 (10) (Oct. 2010) 1692–1715. [43] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, CoRR, abs/1409.1556, 2014. [44] P. Smaragdis, M. Casey, Audio/visual independent components, in: Proc. of the Int. Symposium on Independent Component Analysis and Blind Source Separation, Apr. 2003, pp. 709–714. [45] S. Song, V. Chandrasekhar, B. Mandal, L. Li, J.H. Lim, G.S. Babu, P.P. San, N.M. Cheung, Multimodal multi-stream deep learning for egocentric activity recognition, in: Proc. of IEEE Int. Conf. on Computer Vision and Pattern Recognition Workshops, June 2016, pp. 378–385. [46] R. Stiefelhagen, K. Bernardin, R. Bowers, R. Rose, Martial Michel, J. Garofolo, The CLEAR 2007 evaluation multimodal technologies for perception of humans, in: Rainer Stiefelhagen, Rachel Bowers, Jonathan Fiscus (Eds.), Multimodal Technologies for Perception of Humans, in: Lect. Notes Comput. Sci., Springer, Berlin, Heidelberg, 2008, chapter 1. [47] A. Subramanya, A. Raj, Recognizing activities and spatial context using wearable sensors, in: Proc. of the Conf. on Uncertainty in Artificial Intelligence, July 2006. [48] R. Vezzani, D. Baltieri, R. Cucchiara, People reidentification in surveillance and forensics: a survey, ACM Comput. Surv. 46 (2) (Dec. 2013) 29:1–29:37. [49] X. Wu, R. He, Z. Sun, A lightened CNN for deep face representation, arXiv preprint arXiv:1511.02683, 2015.
5.7 Closing remarks
[50] Y. Yan, E. Ricci, G. Liu, N. Sebe, Egocentric daily activity recognition via multitask clustering, IEEE Trans. Image Process. 24 (10) (2015) 2984–2995. [51] D. Yu, K. Yao, H. Su, G. Li, F. Seide, KL-Divergence regularized deep neural network adaptation for improved large vocabulary speech recognition, in: Proc. of IEEE Int. Conf. on Audio, Speech and Signal Processing, May 2013, pp. 7893–7897. [52] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, Modeling individual and group actions in meetings with layered HMMs, IEEE Trans. Multimed. 8 (3) (June 2006).
119