Chapter 7
Basic Concepts of Multimodal Analysis Mihai Gurban and Jean-Philippe Thiran Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
7.1. Defining Multimodality 7.2. Advantages of Multimodal Analysis 7.3. Conclusion References
145 148 151 152
7.1 DEFINING MULTIMODALITY The word multimodal is used by researchers in different fields and often with different meanings. One of its most common uses is in the field of human–computer (or man–machine) interaction (HCI). Here, a modality is a natural way of interaction: speech, vision, face expressions, handwriting, gestures or even head and body movements. Using several such modalities can lead to multimodal speaker tracking systems, multimodal person identification systems, multimodal speech recognisers or, more generally, multimodal interfaces. Such interfaces aim to facilitate HCI, augmenting or even replacing the traditional keyboard and mouse. Let us go into more detail about these particular examples. Multimodal speaker detection, tracking or localisation consists of identifying the active speaker in an audio-video sequence which
Multimodal Signal Processing, ISBN: 9780123748256 Copyright © 2010 Elsevier Ltd. All rights reserved.
145
146
PART | II Multimodal Signal Processing and Modelling
contains several speakers, based on the correlation between the audio and the movement in the video [1, 2]. In the case of a system with several cameras, the speakers might not necessarily be in the same image, but the underlying idea remains the same. The gain from multimodality over using for example only video is that correlations with the audio can help discriminate between the person that is actually speaking and someone who might be only murmuring something inaudible. It is also possible to find the active speaker using audio only, with the help of microphone arrays, but this makes the hardware set-up more complex and expensive. More details on this application can be found in Chapter 9. For multimodal speech recognition, or audio-visual speech recognition, some video information from the speakers’ lips is used to augment the audio stream to improve the speech recognition accuracy [3]. This is motivated by human speech perception, as it has been proven that humans subconsciously use both audio and visual information when understanding speech [4]. There are speech sounds which are very similar in the audio modality but easy to discriminate visually. Using both modalities significantly increases automatic speech recognition results, especially in the case where the audio is corrupted by noise. Chapter 9 gives a detailed treatment on how the multimodal integration is done in this case. A multimodal biometrics system [5] establishes the identity of a person from a list of candidates previously enrolled in the system, based on not one but several modalities, which can be taken from a long list: face images, audio speech, visual speech or lip-reading, fingerprints, iris images, retinal images, handwriting, gait and so on. The use of more than one modality makes results more reliable, decreasing the number of errors. It should be noted here how heterogenous these modalities can be. There are still other examples in the domain of HCI. Multimodal annotation systems are annotating systems typically used on audio-visual sequences, to add metadata describing the speech, gestures, actions and even emotions of the individuals present in these sequences (see Chapter 12). Indeed, multimodal emotion recognition [6] is a very active area of research, using typically the speech, face expressions, poses and gestures of a person to assess his or her
Chapter | 7
Basic Concepts of Multimodal Analysis
147
emotional state. The degree of interest of a person can also be assessed from multimodal data, as shown in Chapter 15. However, all these applications are taken from just one limited domain, HCI. There are many other fields where different modalities are used. For psychologists, sensory modalities represent the human senses (sight, hearing, touch and so on). This is not very different from the previous interpretation. However, other researchers can use the word modality in a completely different way. For example, linked to the concept of medium as a physical device, modalities are the ways to use such media [7]. The pen device is a medium, whereas the actions associated to it, such as drawing, pointing, writing or gestures, are all modalities. However, in signal processing applications, the modalities are signals originating from the same physical source [8] or phenomenon, but captured through different means. In this case, the number of modalities may not be the same as the number of information streams used by the application. For example, the video modality can offer information about the identity of the speaker (through face analysis), his or her emotions (through facial gestures analysis) or even about what he or she is saying (through lip-reading). This separation of information sources in the context of multimodal analysis can be called information fission. Generally, the word multimodal is associated to the input of information. However, there are cases where the output of the computer is considered multimodal, as is the case of multimodal speech synthesis, the augmentation of synthesised speech with animated talking heads (see Chapter 13 for more details). In this context, distributing the information from the internal states of the system to the different output modalities is done through multimodal fission. In the context of medical image registration, a modality can be any of a large array of imaging techniques, ranging from anatomical, such as X-ray, magnetic resonance imaging or ultrasound, to functional, such as functional magnetic resonance imaging or positron emission tomography [9]. Multimodal registration is the process of bringing images from several such modalities into spatial alignment. The same term is used in remote sensing, where the modalities are images
148
PART | II Multimodal Signal Processing and Modelling
from different spectral bands [10] (visible, infrared or microwave, for example). For both medical image analysis and remote sensing, the gain from using images from multiple modalities comes from their complementarity. More information will be present in a combined multimodal image than in each of the monomodal ones, provided they have been properly aligned. This alignment is based on information that is common to all the modalities used, redundant information.
7.2 ADVANTAGES OF MULTIMODAL ANALYSIS Although there is such high variability in multimodal signals, general processing methods exist which can be applied on many problems involving such signals. And the research into such methods is ongoing, as there are important advantages from using multimodal information. The first advantage is the complementarity of the information contained in those signals, that is, each signal brings extra information which can be extracted and used, leading to a better understanding of the phenomenon under study. The second is that multimodal systems are more reliable than monomodal ones because if one modality becomes corrupted by noise, the system can still rely on the other modalities to, at least partially, achieve its purpose. Multimodal analysis consists then in exploiting this complementarity and redundancy of multiple modalities to analyse a physical phenomenon. To offer an example for the complementarity of information, in multimodal medical image analysis, information that is missing in one modality may be clearly visible in another. The same is true for satellite imaging. Here, by missing information, we mean information that exists in the physical reality but was not captured through the particular modality used. By using several different modalities, we can get closer to the underlying phenomenon, which might not be possible to capture with just one type of sensor. In the domain of human–computer interfaces, combining speech recognition with gesture analysis can lead to new and powerful means of interaction. For example, in a ‘put-that-there’ interface [11], visual objects can be manipulated on the screen or in a virtual environment with voice commands coupled by pointing. Here, the modalities are naturally complementary because the action is only specified by voice, where the object and place are inferred from the gestures.
Chapter | 7
Basic Concepts of Multimodal Analysis
149
Another example could be audio-visual speech recognition. Indeed, some speech sounds are easily confusable in audio but easy to distinguish in video and vice versa. In conditions of noise, humans themselves lip-read subconsciously, and this also shows that there is useful information in the movement of the lips, information that is missing from the corrupted audio. For the second advantage, the improved reliability of multimodal systems consider the case of a three-modal biometric identification system, based on face, voice and fingerprint identification. When the voice acquisition is not optimal, because of a loud ambient noise or because the subject has a bad cold, the system might still be able to identify the person correctly by relying on the face and fingerprint modalities. Audio-visual speech recognition can also be given as an example because if for some reason the video becomes unavailable, the system should seamlessly revert to audio-only speech recognition. As can be seen, the extra processing implied by using several different signals is well worth the effort, as both accuracy and reliability can be increased in this way. However, multimodal signal processing also has challenges, of which we will present two of the most important ones. The first is the extraction of relevant features from each modality, features that contain all the information needed at the same time being compact. The second is the fusion of this information, ideally in such a way that variations in the reliability of each modality stream affect the final result as little as possible. We will now analyse in detail why each of these challenges is important. The first challenge is the extraction of features which are at the same time relevant and compact. By relevant features, we mean features that contain the information required to solve the underlying problem, thus the term relevance is always tied to the context of the problem. To offer a simple example, for both speech recognition and speaker identification, some features need to be extracted from the audio signal. But their purpose is different. Features extracted for speech recognition should contain as much information as possible about what was being said and as little as possible about who was speaking. Thus, some hypothetic ideal features that would respect this requirement would be very relevant for the speech recognition task but irrelevant for speaker identification. For speaker identification, the situation is reversed, that is, the features should contain as much
150
PART | II Multimodal Signal Processing and Modelling
information as possible about who is speaking and nothing about what was said. Obviously, we would like features to contain all the relevant information in the signal, and at the same time retain as little superfluous information as possible. The second requirement that we impose on the features is that they should be compact, that is, the feature vector should have a low dimensionality. This is needed because of the curse of dimensionality, a term defining the fact that the number of equally-distributed samples needed to cover a volume of space grows exponentially with its dimensionality. For classification, accurate models can only be built when an adequate number of samples is available, and that number grows exponentially with the dimensionality. Obviously, for training, only a limited amount of data is available, so having a low input dimensionality is desirable to obtain optimal classification performance. Dimensionality reduction techniques work by eliminating the redundancy in the data, that is, the information that is common to several modalities or, inside a modality, common to several components of the feature vector. The quest for compact features needs, however, to be balanced by the need for reliability, that is, the capability of the system to withstand adverse conditions like noise. Indeed, multimodal systems can be tolerant to some degree of noise, in the case when there is redundancy between the modalities. If one of the modalities is corrupted by noise, the same information may be found in another modality. In conclusion, reducing redundancy between the features of a modality is a good idea; however, reducing redundancy between modalities needs to be done without compromising the reliability of the system. Therefore, for the system to attain optimal performance, the features need to be relevant and compact. The second challenge that we mentioned is multimodal integration, or fusion, which represents the method of combining the information coming from the different modalities. As the signals involved may not have the same data rate, range or dimensionality, combining them is not straightforward. The goal here is to gather the useful information from all the modalities, in such a way that the endresult is superior to those from each of the modalities individually. Multimodal fusion can be done at different levels. The simplest kind of
Chapter | 7
Basic Concepts of Multimodal Analysis
151
fusion operates at the lowest level, the signal level, by concatenating the signals coming from each modality. Obviously, for this, the signals have to be synchronised. High-level decision fusion operates directly on decisions taken on the basis of each modality, and this does not require synchrony. There can also be an intermediate level of fusion, where some initial information is extracted from each modality, synchronously fused into a single stream and then processed further. A more detailed analysis is given in Chapter 8. The multimodal integration method has an important impact on the reliability of the system. Ideally, the fusion method should allow variations in the importance given to the data streams from each modality because the quality of these streams may vary in time. If one modality becomes corrupted by noise, the system should adapt to rely more on the other modalities, at least until the corrupted modality returns to its previous level of quality. This type of adaptation is, however, not possible with simple low-level fusion but only with decision fusion. The levels of fusion can be easily illustrated on the example of audio-visual speech recognition. Concatenating the audio and visual features into a single audio-visual data stream would represent the lowest level of fusion. Recognising phonemes and visemes (visual speech units) in each modality and then fusing these speech units into multimodal word models represent intermediate-level fusion. Finally, when recognising the words individually in each modality based on monomodal word models and then fusing only the n-best lists means, fusion was done at the highest level. Synchrony between the signals is also an important aspect. In some applications, such as audio-visual speech recognition, the signals are naturally synchronous, as phonemes and visemes are emitted simultaneously. In other applications, the signals are asynchronous, such as in the case of simultaneous text and speech recognition. Here, the user is not constrained in pronouncing the words at the same time as they are written, so the alignment needs to be estimated through the analysis. Obviously low-level fusion is not feasible in this case.
7.3 CONCLUSION We have presented several examples of multimodal applications, showing how, by using the complementarity of different modalities,
152
PART | II Multimodal Signal Processing and Modelling
we can improve classification results, enhance user experience and increase overall system reliability. We have shown the advantages of multimodal analysis but also the main challenges in the field. The next chapter, Chapter 8, will give a much more detailed overview on the multimodal fusion methods and the levels at which fusion can be done.
REFERENCES 1. H. Nock, G. Iyengar, C. Neti, Speaker localisation using audio-visual synchrony: an empirical study, in: Proceedings of the International Conference on Image and Video Retrieval, pp. 565–570, 2003. 2. M. Gurban, J. Thiran, Multimodal speaker localization in a probabilistic framework, in: Proceedings of the 14th European Signal Processing Conference (EUSIPCO), 2006. 3. G. Potamianos, C. Neti, J. Luettin, I. Matthews, Audio-visual automatic speech recognition: an overview, in: G. Bailly, E. Vatikiotis-Bateson, P. Perrier (Eds.), Issues in Audio-Visual Speech Processing, MIT Press, 2004. 4. H. McGurk, J. MacDonald, Hearing lips and seeing voices, Nature 264 (1976) 746–748. 5. A. Ross, A.K. Jain, Multimodal biometrics: an overview, in: Proceedings of the 12th European Signal Processing Conference (EUSIPCO), pp. 1221–1224, 2004. 6. N. Sebe, I. Cohen, T. Huang, Multimodal emotion recognition, in: C. Chen, P. Wang (Eds.), Handbook of Pattern Recognition and Computer Vision, World Scientific, 2005. 7. C. Benoit, J.C. Martin, C. Pelachaud, L. Schomaker, B. Suhm, Audio visual and multimodal speech systems, in: D. Gibbon (Ed.), Handbook of Standards and Resources for Spoken Language Systems – Supplement Volume, 2000, Mouton de Gruyter, Berlin. 8. T. Butz, J. Thiran, From error probability to information theoretic (multi-modal) signal processing, Signal Process. 85 (2005) 875–902. 9. J.B.A. Maintz, M.A. Viergever, A survey of medical image registration, Med. Image Anal. 2 (1) (1998) 1–36. 10. B. Zitova, J. Flusser, Image registration methods: a survey, Image Vision Comput. 21 (2003) 977–1000. 11. R.A. Bolt, “put-that-there”: voice and gesture at the graphics interface, in: SIGGRAPH ’80: Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques, vol. 14, ACM Press, 1980, pp. 262–270.