We are less free than how we think: Regular patterns in nonverbal communication☆

We are less free than how we think: Regular patterns in nonverbal communication☆

Chapter 13 We are less free than how we think: Regular patterns in nonverbal communication✩ Alessandro Vinciarelli∗ , Anna Esposito† , Mohammad Tay...

337KB Sizes 0 Downloads 26 Views

Chapter

13

We are less free than how we think: Regular patterns in nonverbal communication✩

Alessandro Vinciarelli∗ , Anna Esposito† , Mohammad Tayarani∗ , Giorgio Roffo∗ , Filomena Scibelli† , Francesco Perrone∗ , Dong-Bach Vo∗ ∗ University

of Glasgow, School of Computing Science, Glasgow, UK † Università degli Studi della Campania L. Vanvitelli, Dipartimento di Psicologia, Caserta, Italy

CONTENTS

13.1 Introduction 269 13.2 On spotting cues: how many and when 271 13.2.1 The cues 272 13.2.2 Methodology 273 13.2.3 Results 275

13.3 On following turns: who talks with whom 276 13.3.1 Conflict 277 13.3.2 Methodology 278 13.3.3 Results 278

13.4 On speech dancing: who imitates whom 279 13.4.1 Methodology 279 13.4.2 Results 282

13.5 Conclusions 286 References 287

13.1 INTRODUCTION Everyday life behavior in naturalistic settings is typically defined as “spontaneous”, an adjective that, according to the most common English dictionaries, accounts for the lack of planning, rules or any other inhibitions or ✩ This work was supported by the Engineering and Physical Sciences Research Council

(EPSRC) through the projects “School Attachment Monitor” (EP/M025055/1) and “Socially Competent Robots” (EP/N035305/1). Multimodal Behavior Analysis in the Wild. https://doi.org/10.1016/B978-0-12-814601-9.00030-4 Copyright © 2019 Elsevier Ltd. All rights reserved.

269

270

CHAPTER 13 We are less free than how we think

constraints: “done in a natural, often sudden way, without any planning or without being forced” (Cambridge Dictionary), “Spontaneous acts are not planned or arranged, but are done because someone suddenly wants to do them” (Collins English Dictionary), “proceeding from natural feeling or native tendency without external constraint” (Merriam Webster), “Performed or occurring as a result of a sudden impulse or inclination and without premeditation or external stimulus” (Oxford Dictionary), etc. On the other hand, social psychology has shown that everyday behavior tends to follow principles and laws that result into stable behavioral patterns that can be not only observed, but also analyzed and measured in statistical terms. Thus, at least in the case of everyday behavior, the actual meaning of the adjective “spontaneous” seems to be the tendency of people to effortlessly adopt and display, typically outside conscious awareness, recognizable behavioral patterns. There are at least two reasons behind the presence of stable behavioral patterns. The first is that behavior—in particular its nonverbal component (facial expressions, gestures, vocalizations, etc.)—contributes to communication by adding layers of meaning to the words being exchanged. In other words, people that accompany the same sentence (e.g., “I am happy”) with different behaviors (e.g., frowning and smiling) are likely to mean something different. This requires the adoption of stable behavioral patterns because communication can take place effectively only when there is coherence between the meaning that someone tries to convey and the signals adopted to convey it [37]. The second important reason behind the existence of stable behavioral patterns is that no smooth interaction is possible if the behavior of its participants is not, at least to a certain extent, predictable [30]. For example, during a conversation people that talk expect others to listen and, when this does not happen, people tend to adopt repairing mechanisms that bring back the interactants to a situation where one person speaks and the others listen (unless there is an ongoing conflict and then people compete for the floor or there is no intention to be involved in the conversation for one or more interactants). The considerations above suggest that human behavior, while being a complex and rich phenomenon, is not random and it is possible to detect order in it. Such an observation plays a crucial role in any discipline that tries to make sense of human behavior, including social psychology, anthropology, sociology and, more recently, technological domains aimed at seamless interaction between people and machines like, e.g., Human–Robot Interaction [13], Affective Computing [21], Computational Paralinguistics [28] or Social Signal Processing [32]. In particular, the possibility to use human behavior as a physical, machine detectable evidence of social and psycho-

13.2 On spotting cues: how many and when

logical phenomenon—if behavior can be perceived through the senses and interpreted unconsciously, it can be detected via sensors—can help to make machines socially intelligent, i.e., capable to understand the social landscape in the same way as people do [33]. The rest of this chapter aims at showing what the expression “stable behavioral pattern” means exactly through several experimental examples. In particular, the chapter aims at providing the reader with a few basic techniques that can help the detection of behavioral patterns in data, with the focus being on conversations as these are the primary site of human sociality. Overall, the chapter shows that human behavior is far less free than how it might look like, but this, far from being a constraint or an inhibition, is the very basis for social interaction. Section 13.2 shows how the occurrence of certain nonverbal behavioral cues accounts for important social dimensions such as gender and role; Section 13.3 shows that the sequence of speakers in a conflictual conversation provides information on who agrees with whom; Section 13.4 shows how people tend to imitate the behavioral patterns of others and the final Section 13.5 draws some conclusions.

13.2 ON SPOTTING CUES: HOW MANY AND WHEN One of the key-principles of all disciplines based on behavior observation (ethology, social psychology, etc.) can be formulated as follows: “[...] the circumstances in which an activity is performed and those in which it never occurs [provide] clues as to what the behavior pattern might be for (its function)” [19]. On such a basis, this section shows that one of the simplest possible observations about behavior—how many times a given cue takes place and when—can provide information about stable behavioral patterns and, most importantly, about the social and psychological phenomena underlying the patterns. The observations are made over the SSPNet Mobile Corpus [22], a collection of 60 phone calls between 120 unacquainted individuals (708 min and 24 s in total). The conversations revolve around the Winter Survival Task [15], a scenario that requires two people to identify, in a list of 12 objects, the items most likely to increase the chances of survival for people that crash with a plane in a polar area. The main reason to use such a scenario is that it allows unacquainted individuals to start and sustain a conversation without the need to find a topic. Furthermore, since it happens rarely that the people involved in the experiments are expert in survival techniques—only 1 out of 120 subjects in the case of the SSPNet Mobile Corpus—the interaction tends to be driven by actual social and psychological factors rather than by competence and knowledge differences about the topic of conversation.

271

272

CHAPTER 13 We are less free than how we think

■ FIGURE 13.1 The upper chart shows the number of occurrences for each cue, while the lower chart shows

the fraction of the total corpus time every cue accounts for.

In the case of the SSPNet Mobile Corpus, the conversations take place via phone and, hence, the participants can use only paralinguistics, i.e., vocal nonverbal behavioral cues. Given that many cues (e.g., laughter or pauses) require one to stop speaking, the amount of time spent in nonverbal communication can give a measure of how much such a phenomenon is important in communication (see below).

13.2.1 The cues The Corpus has been annotated in terms of five major nonverbal cues, namely laughter, fillers, back-channel, silence and overlapping speech (or interruptions). The motivation behind such a choice is that these cues are among the most common (see below for the statistics) and widely investigated in the literature [17,25]. The left chart of Fig. 13.1 provides the number of times each of the cues above has been observed in the corpus, while the right chart shows how much time each of the cues accounts for in total. Overall, the total number of occurrences is 16,235, corresponding to 23.5% of the total time in the corpus. While being under time pressure—the Winter Survival Task must be completed as quickly as possible—the participants invest roughly one quarter of their time in nonverbal communication rather than in speaking. This is a clear indication of how important the cues mentioned above are. Laughter is one of the first nonverbal behavioral cues that have been addressed in scientific terms. The first studies date back to the seminal work by Darwin about the expression of emotions [9] and the behavior of children [10]. In more recent times, laughter has been investigated in social psychology and cognitive sciences and one of its most common definitions

13.2 On spotting cues: how many and when

is as follows: “a common, species-typical human vocal act and auditory signal that is important in social discourse” [24]. According to the literature (see [23] for a survey), the most common patterns associated to laughter are that women laugh more than men, that people tend to laugh more when they listen than when they talk and that, in general, people laugh at the end of sentences. Furthermore, recent work has shown that laughter tends to take place in correspondence of topic changes [5]. Fig. 13.1 shows that the SSPNet Mobile corpus includes 1805 laughter occurrences for a total duration of 1114.8 s (2.6% of the total Corpus length), corresponding to one occurrence every 23.5 s, on average. When people involved in a conversation want to hold the floor, but do not know what to say next, they tend to use expressions like “ehm” or “uhm,” called fillers, to signal the intention to keep speaking. The reason behind the name is that they “are characteristically associated with planning problems [...] planned for, formulated, and produced as parts of utterances just as any word is” [8]. Fig. 13.1 shows that the 120 subjects of the SSPNet Mobile Corpus utter 3912 fillers that correspond to a total length of 1815.9 s (4.2% of the Corpus time). The average interval between one filler and the other is 10.9 s. The symmetric cue is back-channel, i.e., a short utterance like “ah-ah” or “yeah” that is “produced by one participant in a conversation while the other is talking” [36]. In general the goal of back-channel is to signal to others attention and agreement while implicitly saying that they can continue to speak. The total number of back-channel episodes in the Corpus is 1015 (one episode every 41.9 s, on average), for a total of 407.1 s (0.9% of the Corpus time). The time intervals during which both participants of a conversation were not speaking have been labeled as silence. Such a definition does not take into account the multiple functions of such a cue—6091 occurrences for a total of 4670.6 s (10.9% of the corpus length)—but it still allows one to test whether there is any relationship between the occurrences of silence and social or psychological phenomena of interest. The same applies to overlapping speech, i.e., the time intervals during which both people involved in the same conversation speak at the same time. Such a phenomenon can take place for different reasons and with different functions, but it tends to be an exception rather than the norm [27]. For this reason, it is important to count the occurrences of such a cue that are 3412 for a total of 2000.5 s (4.7% of the corpus time).

13.2.2 Methodology Each of the cues annotated in a corpus can be thought of as a triple (ci , ti , di ), where i = 1, . . . , N (N is the total number of cues that have

273

274

CHAPTER 13 We are less free than how we think

been counted), ci ∈ C = {C1 , . . . , CL } is one of the L cues that have been counted (in the case of the SSPNet Mobile Corpus, the cues are laughter, filler, back-channel, silence and overlapping speech), ti is the time at which the cue i begins, and di is its duration. Such a notation allows one to define the number Nc of occurrences of a given cue as follows:   Nc =  (ci , ti , di ) : ci = c ,

(13.1)

where |.| denotes the cardinality of a set. The corpus can be segmented into time intervals according to the value of a variable V that corresponds to a factor of interest for the analysis of the data. For example, in the case of the gender, the variable can take two values—v1 and v2 —that correspond to female and male, respectively. In this way, it is possible to count the number of times a cue c takes place when v = vk as follows:   (13.2) Nc(k) =  (ci , ti , di ) : ci = c, [ti , ti + di ] ∈ Vk , where Vk is the set of all corpus segments in which v = vk . The notations above allow one to define the following χ 2 variable: χ2 =

K (k)  Nc − Nc Ek k=1

Ek

,

(13.3)

where Ek is the fraction of total corpus length in which the value of V is vk . The particularity of the χ 2 variable is that its probability density function is known in the case there is no relationship between the cue c and the variable v. In other words, the probability density function is known when (k) the differences in Nc do not depend on the interplay between the cue and the factor underlying the variable V , but on simple statistical fluctuations. This makes it possible to estimate the probability to obtain a value of χ 2 at least as high as the one estimated with Eq. (13.3) when the occurrence of a cue does not depend on the factor underlying V . When such a probability is lower than a value α, then in it possible to say that there is an effect with confidence level α, i.e., that there is a relationship between the cue and the factor underlying V and the probability that such an observation is a false positive—meaning that it is the result of chance—is lower than α. In general, the literature accepts an effect when the probability α, called the p-value, is lower than 0.05 or 0.01. When the approach above is used several times with confidence level α, then the probability that none of the observed effects is the result of chance is p = (1 − α)M , where M is the number of statistical inferences that someone

13.2 On spotting cues: how many and when

■ FIGURE 13.2 The upper chart shows the difference between male and female subjects, while the lower one

shows the difference between callers and receivers. The single and double stars show effects significant at 5% and 1%, respectively (after Bonferroni correction).

makes out of the data (or the total number of p-values that are estimated). For this reason, it is common to apply a correction, i.e., to modify the value of α to ensure that the risk of false positives is eliminated or, at least, limited to a point that possible false positives do not change the conclusions that can be made out of the observed effects. The correction most commonly applied is the Bonferroni one. Following such an approach, an effect is considered to be significant at level α only when the value is lower than α/M. Such a correction aims at eliminating all false positives, but this happens at the cost of eliminating many true positives as well. For this reason, most recent approaches apply the False Discovery Rate (FDR) correction [3]. This correction aims at ensuring that false positives, if any, are sufficiently few not to change the conclusions that can be made out of the observed effects. In the case of the FDR, the values are ordered from the smallest to the largest and the one that ranks kth is accepted if it is lower than αM/k.

13.2.3 Results The methodology described earlier has been applied to the SSPNet Mobile Corpus using gender and role. In the case of gender, the variable V can take two values, female and male, and the results of the analysis are illustrated in Fig. 13.2. The chart shows that female subjects tend to display more frequently than male ones laughter, back-channel and overlapping speech (meaning that they start speaking when someone else does more frequently). In other words, the distribution of such cues is not random, but it changes according to the gender of speaker, thus showing that there are different behavioral patterns for female and male subjects (at least in the corpus under analysis).

275

276

CHAPTER 13 We are less free than how we think

The literature provides confirmation and explanations for these patterns. In particular, the tendency of woman to laugh more has been observed in a wide spectrum of context [23] as well as the tendency of men to adopt higher-status behaviors like limited laughter [18] and back-channel [14]. For what concerns the tendency of female subjects to start overlapping speech more frequently than male ones, it can be explained as a stereotype-threat behavior, i.e., an attempt to contradict a possible stereotype abut women being less assertive during a negotiation (the scenario adopted in the experiments actually requires the two subjects to negotiate a common solution to the Winter Survival Task) [29]. In the case of the role, the variable V takes two possible values (caller and receiver) and Fig. 13.2 shows the patterns that have been observed. The callers tend to display fillers more frequently than expected and, viceversa, they ten to initiate overlapping less frequently than how it would happen in absence of effects. Such a pattern is compatible with a status difference, i.e., with the tendency of the receivers to behave like if they were above the callers in terms of social verticality and the tendency of these latter to behave like if they are hierarchically inferior [20,26]. As a confirmation of these behavioral patterns, the receivers tend to win the negotiations involved in the Winter Survival Task significantly more frequently than the callers [34]. Overall, the observations above show that people actually display detectable behavioral patterns that can be identified with a methodology as simple as counting the cues and showing, through a statistical significance test, whether they tend to occur more or less frequently according to the value of an independent variable V expected to account for an underlying social or psychological phenomenon (gender and role in the case of this section). In other words, the observations of this section show that nonverbal behavior, especially when it comes to cues that are displayed spontaneously and result from the interaction with others, is not random but it tends to follow behavioral patterns.

13.3 ON FOLLOWING TURNS: WHO TALKS WITH WHOM One of the main tenets of conversation analysis is that people tend to speak one at a time and, in general, overlapping speech and silence tend to result from errors in turn-taking, the body of practices underlying the switch from one speaker to the other, more than from the actual intentions of interacting speakers: “Talk by more than one person at a time in the same conversation is one of the two major departures that occur from what appears to be

13.3 On following turns: who talks with whom 277

a basic design feature of conversation, [...] namely ‘one at a time’ (the other departure is silence, i.e. fewer than one at a time)” [27]. This seems to contradict the observation of Section 13.2 that overlapping speech and silence account, jointly, for roughly 15% of the total time in a corpus of conversations. However, the data of Section 13.2 shows that the average length of a silence or an overlapping speech segment are lower than a second. This shows that, however frequent, such episodes tend to be short and people actually tend to stay in a situation where one person speaks and the others listen. Following up on the above, it possible to think of a conversation as a sequence of triples (si , ti , ti ), where i = 1, . . . , N (N is the total number of turns in a conversation), si ∈ A = {a1 , . . . , aG } is one of the G speakers involved in a conversation, ti is the time at which turn i starts and ti is the duration of the turn. The rest of this section shows how such basic information can be used to detect the structure of a conflict.

13.3.1 Conflict The literature proposes many definitions of conflict and each of them captures different aspects of the phenomenon (see the contributions in [12] for a wide array of theoretic perspectives and approaches). However, there is a point that all researchers investigating conflict appear to agree upon, namely that the phenomenon takes place whenever multiple parties try to achieve incompatible goals, meaning that one of the parties can achieve its goals only if the others do not (or at least such is the perception of the situation): “conflict is a process in which one party perceives that its interests are being opposed or negatively affected by another party” [35], “[conflict takes place] to the extent that the attainment of the goal by one party precludes its attainment by the other” [16], “Conflict is perceived [...] as the perceived incompatibilities by parties of the views, wishes, and desires that each holds” [2], etc. Goals cannot be accessed directly, but can only be inferred from the observable behavior, both verbal and nonverbal, of an individual. For this reason, the automatic analysis of conflict consists mainly of detecting the physical traces that conflict leaves in behavior (what people do different or peculiar when they are involved in conflict) and using them to predict whether there is conflict or what are the characteristics of an ongoing conflict. In the particular case of turns, the main physical trace of conflict is the tendency of people to react immediately to interlocutors they disagree with [4]. This means that speaker adjacency statistics should provide indications on

278

CHAPTER 13 We are less free than how we think

who agrees or disagrees with whom and, hence, on the presence of possible groups that oppose one another.

13.3.2 Methodology The beginning of this section shows that a conversation can be thought of, with good approximation, as a sequence of triples (si , ti , ti ), with each of these corresponding to a turn. This section shows that the sequence of the speakers—meaning who speaks before and after whom—far from being random provides information about the composition of groups, if any, that oppose one another during a discussion. The main reason why this is possible is that people involved in conversations—and more in general in social interactions—tend to adopt preference structures, i.e., they tend to behave in a certain way because any alternative way of behaving would be interpreted in a wrong way. In the case of conflict, the preference structure is that people that do not react immediately to someone they disagree with will be considered to agree or to have no arguments at disposition. In formal terms, this means that we can attribute each speaker involved in a conversation three possible labels, namely g1 (the speaker belongs to group 1), g2 (the speaker belongs to group 2) or g0 (the speaker is neutral). Since every speaker is assigned always the same label, the sequence of the speakers can be mapped into a sequence of symbols X = (x1 , . . . , xN ), where xi ∈ {g0 , g1 , g2 } and xi = f (si )—f (si ) is a function that maps a speaker into one of the three possible labels. The simplest probabilistic model of the sequence X = (x1 , . . . , xN ) is the Markov chain: p(X) = px1

T 

p(xk−1 |xk ),

(13.4)

k=2

where p(x1 ) is the probability of starting with label x1 and p(xk−1 |xk ) is the probability of a transition from xk−1 to xk .

13.3.3 Results Such a model has been adopted to analyze the conflict in the political debates of the Canal 9 Corpus, a collection of 45 political debates involving a total of 190 persons speaking for 27 h and 56 min [31]. The main reason for using political debates is that these are built around conflict. In other words, following the definition at the beginning of this section, if one of the participants achieves its goals, the others do not. Furthermore, the political debates have a structure that allows one to make simplifying assumptions, namely that the number of participants is always the same (5 in the debates

13.4 On speech dancing: who imitates whom

of the Corpus), that the number of members per group is always the same (2 in the debates of the Corpus) and that there is always a neutral moderator (1 person in the debates of the Corpus). In this way, the number of functions f (s) mapping the speakers to the labels is limited to L = 15, i.e., there are only 15 ways to assign the labels to the speakers when taking into account the assumptions above. Given that each of the 15 functions gives rise to a different sequence Xj of labels, it is possible to identify the function that produces the most probable sequence: (13.5) X ∗ = arg max p(Xj ). j =1,...,L

X∗

is the one most likely to group the debate The function that generates participants in the correct way, i.e., to correctly identify the composition of the groups and the moderator. The experiments performed over the Canal 9 Corpus with the approach above show that the grouping is correct in 64.5% of the cases, thus confirming that the sequence of the speakers is not random, but it follows detectable behavioral patterns that are stable enough to allow the development of automatic analysis approaches.

13.4 ON SPEECH DANCING: WHO IMITATES WHOM Human sciences show that people change their way of communicating according to the way others communicate to them. The phenomenon takes place in different ways and, correspondingly, it takes different names in the literature. The unconscious tendency to imitate others is typically referred to as mimicry [7], the adoption of similar timing patterns in behavior is typically called synchrony [11] and, finally, the tendency to enhance similarity or dissimilarity in observable behavior typically goes under the name of adaptation [6]. This section focuses on one of the many facets of adaptation, i.e., the tendency to change acoustic properties of speech in relationship to the way others speak in a given conversation. In particular, this work investigates the interplay between the acoustic adaptation patterns that people display in a conversation and their performance in addressing a collaborative task. The problem is important because it can provide insights about the role of adaptation in collaborative problem solving. Furthermore, it can give indications on how to improve the efficiency of people that address a task together.

13.4.1 Methodology The words of a conversation can be mapped into sequences X = ( x1 , . . . , xT ) of observation vectors, i.e., vectors where every component is a physical

279

280

CHAPTER 13 We are less free than how we think

measurement extracted from the speech signal. The reason why X includes multiple vectors is that these are extracted at regular time steps from short signal segments. The motivation behind such a representation is that it allows the application of statistical modeling approaches. The goal of the methodology presented in this section is to measure the tendency of A and B, the two speakers involved in a conversation, to become more or less similar over time. If p(X|θA ) and p(X|θB ) are probability distributions trained over spoken material of A and B, respectively, then it (A) is possible to estimate the following likelihood ratio for every word wk in a conversation, where the A index means that the word has been uttered by A: (A)

dk (A, B) = log

p(Xk |θB ) (A)

p(Xk |θA )

.

(13.6)

The expression of dk (B, A) can be obtained by switching A and B in the equation above and, in general, dk (A, B) = dk (B, A). If dk (A, B) > 0, (A) it means that p(X|θB ) explains wk better than p(X|θB ). Vice versa, if dk (A, B) < 0, it means that it is the model of speaker A that explains the data better than the model of speaker B. Similar considerations apply to dk (B, A). In the experiments of this chapter, p(X|θ ) is estimated with hidden Markov models (the speaker index is omitted for making the notation lighter), a probability density function defined over joint sequences of states and feature vectors (X, S), where X = ( x1 , . . . , xT ) and S = (s1 , . . . , sT ) with every si belonging to a finite set V = {v1 , . . . , vL }: p(X, S|θ ) = πs1 bs1 ( x1 ) ·

T 

ask sk−1 bsk ( xk ),

(13.7)

k=2

where πs1 is the probability of s1 being the first state of the sequence, ask sk−1 is the probability of a transition between sk−1 and sk , bsk ( x ) is the probability of observing a vector x when the state is sk , and T is the total number of vectors in sequence X. In the experiments of this work, the value of p(X|θ ) is approximated as follows: p(X|θ ) = arg max p(X, S|θ ), S∈ST

(13.8)

where ST is the set of sequences of all possible states of length T . Given that the features adopted in the experiments are continuous, the emission probability functions bvk ( x ) are mixtures of Gaussians, i.e., weighted

13.4 On speech dancing: who imitates whom

sums of multivariate normal distributions: x) = bvk (

G 

(k)

(k)

(k)

αl N ( x |μ  l , l ),

(13.9)

l=1

 (k) (k) where G  l is the mean of Gaussian l in the mixture of state l=1 αl = 1, μ (k) vk and l is the covariance matrix in the same Gaussian. One of the main physical evidences of adaptation in human–human communication is that the behavior of people changes over time: “A necessary although not sufficient condition for establishing adaptation is showing that at least one person’s behaviors change over time” [6]. In the experiments of this work, the behaviors of interest are the acoustic properties of speech, represented in terms of feature vectors extracted from speech samples of A and B. If tk(A) is the time at which word wk(A) starts, then it is possible to estimate the Spearman correlation coefficient ρ(A, B) between dk (A, B) and tk(A) : ρ(A, B) = 1 −



NA

2 j =1 (rj ) , NA · (NA2 − 1)

(13.10)

where rj is the difference between, on the one hand, the rank of dj (A, B) across all observed values of dk (A, B) and, on the other hand, the rank of tj(A) across all observed values of tk(A) . The value of ρ(B, A) can be obtained by simply switching A and B in the equations above. The motivation behind the choice of the Spearman coefficient is that it is more robust to the presence of outliers with respect to other measurements of the correlation like, e.g., the Pearson correlation coefficient. In fact, the Spearman coefficient does not take into account the value of the variables, but their ranking among the observed values. In this way, it is not possible for one outlier to change the value of the coefficient to a significant extent. When ρ(A, B) is positive and statistically significant, it means that A tends to become more similar to B in terms of distribution of the features that are extracted from speech. This seems to correspond to the “exhibition of similarity between two interactants, regardless of etiology, intentionality or partner influence” [6], a phenomenon called matching. Conversely, when ρ(A, B) is negative and statistically significant, it means that A tends to become increasingly more different—in terms of features’ distribution—from B over time. This appears to correspond to the “exhibition of dissimilar behavior [...] regardless of etiology, intent or partner behavior” [6], a phenomenon called complementarity. If ρ(A, B) is not statistically significant, it means that no matching and complementarity take place or, if they do,

281

282

CHAPTER 13 We are less free than how we think

Table 13.1 This table shows the adaptation patterns that a pair of speakers A and B can display. Symbols +, − and = stand for matching, complementarity and lack of observable effects, respectively. Acronyms CO, DI, CM and IN stand for convergence, divergence, complementarity and independence, respectively

A + B+ (CO)

A + B = (CO)

A + B− (CM)

A = B+ (CO) A − B+ (CM)

A = B = (IN) A − B = (DI)

A = B− (DI) A − B− (DI)

they are too weak to be observed. All the considerations made above about ρ(A, B) apply to ρ(B, A) as well. Following the definitions above, the combination of ρ(A, B) and ρ(B, A) provides a representation of the adaptation pattern, if any, that a dyad of interacting speakers displays. Table 13.1 shows all possible combinations, where the symbols “+”, “−” and “=” account for matching (the correlation is positive and statistically significant), complementarity (the correlation is negative and statistically significant), or lack of observable effects (the correlation is not statistically significant), respectively. In every cell, the symbol on the left is ρ(A, B) while the other one is ρ(B, A). The combinations result into the following adaptation patterns: ■







Convergence (CO): the correlation is positive and statistically significant for at least one of the two speakers and no negative and statistically significant correlations are observed (cells “++”, “+=” and “=+”). Divergence (DI): the correlation is negative and statistically significant for at least one of the two speakers and no positive and statistically significant correlations are observed (cells “−−”, “−=” and “=−”). Compensation (CM): the correlation is statistically significant for both speakers, but with opposite sign (cells “+−”, “−+”). Independence (IN): both correlations are not statistically significant and adaptation patterns, if any, are too weak to be detected (cell “==”).

The experiments of this chapter aim not only at detecting the adaptation patterns above, but also at showing whether they have any relationship with the performance of the subjects, i.e., whether certain patterns tend to appear more or less frequently when the subjects need more time to address a task.

13.4.2 Results The experiments of this work have been performed over a collection of 12 dyadic conversations between fully unacquainted individuals (24 in total). All subjects are female that were born and raised in Glasgow (United King-

13.4 On speech dancing: who imitates whom

dom). The conversations revolve around the Diapix UK Task [1], a scenario commonly adopted in human–human communication studies. The task is based on 12 pairs of almost identical pictures—referred to as the sub-tasks hereafter—that differ only for a few, minor details (e.g., the t-shirt of a given person has two different colors in the two versions of the same picture). The 12 sub-tasks can be split into three groups, each including four pairs of pictures, corresponding to the different scenes being portrayed, namely the Beach (B), the Farm (F) and the Street (S). The number of differences to be spotted is the same for all sub-tasks and the subjects involved in the experiments are asked to complete the task as quickly as possible. As a result of the scenario, each of the 12 conversations of the corpus can be split into 12 time intervals corresponding to the sub-tasks. Thus, the data includes 12 × 12 = 144 conversation segments during which a dyad addresses a particular sub-task, i.e., it spots the differences for a particular pair of pictures. The segments serve as analysis units for the experiments and, on average, they contain 1502.5 words for a total of 216,368 words in the whole corpus. During the conversations, the two subjects addressing the task together sit back-to-back and are separated by a non-transparent curtain. This ensures that the two members of the same dyad can communicate via speech, but not through other channels like, e.g., facial expressions or gestures. Furthermore, the setup ensures that each member of a dyad can see one of the pictures belonging to a Diapix sub-task, but not the other. Thus, the subjects can complete the sub-tasks only by actively engaging in conversation with their partners. As a result, all subjects have uttered at least 40.1% of the words in the conversations they were involved in. Moreover, the deviation with respect to a uniform distribution (meaning that the speakers utter the same number of words) is statistically significant only in one case. Since the subjects are asked to spot the differences between two pictures as quickly as possible, the amount of time needed to address a sub-task can be used as a measure of performance: The shorter it takes to complete a sub-task, the higher the performance of a dyad. The lower plot of Fig. 13.3 shows the average amount of time required to address each of the sub-tasks (the error bars correspond to the standard deviations). Some of these require, on average, more time to be completed, but the differences are not statistically significant according to a t-test. Thus, the performance of the dyads is, on average, the same over all sub-tasks and none of these appears to be more challenging than the others to a statistically significant extent.

283

284

CHAPTER 13 We are less free than how we think

■ FIGURE 13.3 The scatter plots show the

correlations that are statistically significant (the size of the bubble is proportional to the time needed to complete the corresponding sub-task), blue bubbles correspond to complementarity patterns, red bubbles to divergence patterns, and yellow bubbles to convergence patterns. The charts show the average duration of the task where a certain pattern is observed. The error bar is the value of σ/n, where σ is the standard deviation and n is the number of sub-tasks in which the pattern is observed.

The experiments have been performed with HMMs that have only one state and adopt Mixtures of Gaussians (MoGs) as emission probability functions:

p(X|θ ) = πs1 bs1 ( x1 ) ·

T  k=2

ask sk−1 bsk ( xk ) =

T 

b( xk ),

(13.11)

k=1

where b( xk ) is the MoG (used as emission probability function). The equation above holds because πs1 = 1 (there is only one state and its probability of being the first state is consequently 1) and ask sk−1 = a = 1 (the only possible transition is between the only available state and itself). MoGs are word independent because the same model is used for all words uttered by a given speaker. Furthermore, MoGs are time independent because the value of p(X|θ ) depends on the vectors included in X, but not on their order. The main goal behind the use of MoGs is to verify whether adaptation takes place irrespectively of the words being uttered and of possible temporal patterns in acoustic properties. This is important because it can show whether the patterns detected by the approach are just a side effect of lexical alignment—the tendency to use the same words across participants of the same conversation—or the result of actual accommodation taking place at the acoustic level.

13.4 On speech dancing: who imitates whom

In the experiments of this work, the number G of Gaussians in the mixtures is 10. This parameter has been set a priori and no other values have been tested. The choice of G results from two conflicting requirements. On the one hand, G needs to be large to approximate as closely as possible the actual distribution of the data. On the other hand, the larger is G, the larger is the number of parameters and, hence, the larger is the amount of training material needed. A different MoG has been trained for each speaker of the corpus. The training has been performed using the Expectation– Maximization algorithm and all the words uttered by a given speaker have been used as training material. The motivation behind this choice is that the goal of the experiments is not to detect adaptation in unseen data, but to analyze the use of adaptation in a corpus of interactions. In other words, the assumption behind the approach proposed in this work is that the whole material to be analyzed is at disposition at the moment of the training. Fig. 13.3 shows the results for different ways of mapping speech into vectors, called MFCC, PLP and LPC, respectively (the dimensionality of the feature vectors is 12 in the three cases). The bubble plots show the accommodation patterns detected in individual Diapix UK sub-tasks. The size of the bubbles is proportional to the amount of time needed to complete the sub-task. Whenever a bubble is missing, it means that no statistically significant correlations have been observed (the detected pattern is independence). Statistically significant values of ρ(A, B) and ρ(B, A) can be observed over the whole range of conversation lengths observed. This shows that the detected patterns are not the mere effect of conversation length, but the result of the way the subjects speak. Overall, the number of ρ(A, B) and ρ(B, A) values that are statistically significant with p < 0.05 (according to a t-test) is 116 in the case of MFCC, 114 in the case of PLP and 127 in the case of LPC. According to a binomial test, the probability to get such a result by chance is lower than 10−6 in all cases. This seems to further confirm that the observations result from patterns actually detected in the data. The use of different feature extraction processes leads to the detection of adaptation patterns in different sub-tasks. The agreement between two feature extraction processes can be measured with the percentage of cases in which the detected pattern is the same for a given sub-task. In the case of MFCC and PLP, the percentage is 83.3%, while it is 51.4% and 52.8% for the comparison MFCC vs LPC and PLP vs LPC, respectively. In other words, while MFCC and PLP tend to agree with each other, LPC tends to disagree with both the other feature extraction approaches. One possible explanation is that MFCC and PLP account for perceptual information while LPC does not. Hence, LPC allows one to detect adaptation patterns in cor-

285

286

CHAPTER 13 We are less free than how we think

respondence of acoustic properties that the other feature sets do not capture and vice versa. Following the indications of communication studies [6], the investigation of the interplay between detected adaptation patterns and other observable aspects of the interaction can provide indirect confirmation that the approach captures actual communicative phenomena. For this reason, the charts in the lower part of Fig. 13.3 show the average amount of time required to complete the sub-tasks in correspondence of different detected adaptation patterns. In the case of MFCC, there is a statistically significant difference between the length of the sub-tasks in which the detected pattern is convergence and those in which it is divergence or independence (p < 0.001 in both cases according to a t-test). The same can be observed for PLP (p < 0.001 according to a t-test), but, in this case, there are other statistically significant differences as well. In particular, between divergence and complementarity (p < 0.05 according to a t-test) and between independence and complementarity (p < 0.05 according to a t-test). Overall, the findings above show that the outcomes of the approach do not distribute randomly but tend to be associated to the difficulty that the subjects experience when they address a given sub-task, at least in the case of MFCC and PLP. In fact, the subjects are asked to complete the tasks as quickly as they can and, hence, they take more time when they encounter more difficulties. The relationship between the occurrence of certain adaptation patterns and the amount of time needed to complete a sub-task provides indirect confirmation that the approach captures actual adaptation patterns. The lack of effects for LPC further confirms such a hypothesis. In fact, MFCC and PLP account for speech properties that the subjects can perceive and, hence, they can react to by displaying adaptation. This is not the case for LPC that tends to disagree with the other feature extraction processes (see above) and probably accounts for information that the subjects do not manage to use when they display adaptation patterns.

13.5 CONCLUSIONS This chapter has shown that human behavior is not random, even when it is spontaneous and not scripted, but it follows principles and laws that result into regular and detectable behavioral patterns. The exact meaning of such an expression changes according to the cases, but it typically corresponds to behavioral cues that change their observable and statistical properties according to underlying social and psychological phenomena. The example of Section 13.2 shows that the very number of times people display a

13.5 Conclusions

given cue—e.g., laughter or fillers—depends on two major social characteristics, namely gender and role. In the case of Section 13.3, the focus is on turns and their sequence. The example shows that the very sequence of speakers—who talks to whom—provides information about conflict in political debates. Finally, the experiments of Section 13.4 show that people collaborating on a task make their speaking style more or less similar to the speaking style of their interlocutors and, furthermore, such a phenomenon interacts with the amount of time required to complete a task. In all cases, the existence of the patterns is the basis for the development of automatic analysis approaches.

REFERENCES [1] R. Baker, V. Hazan, DiapixUK: task materials for the elicitation of multiple spontaneous speech dialogs, Behav. Res. Methods 43 (3) (2011) 761–770. [2] C. Bell, F. Song, Emotions in the conflict process: an application of the cognitive appraisal model of emotions to conflict management, Int. J. Confl. Manage. 16 (1) (2005) 30–54. [3] Y. Benjamini, Y. Hochberg, Controlling the False Discovery Rate: a practical and powerful approach to multiple testing, J. R. Stat. Soc. B (1995) 289–300. [4] J. Bilmes, The concept of preference in conversation analysis, Lang. Soc. 17 (02) (1988) 161–181. [5] F. Bonin, N. Campbell, C. Vogel, Time for laughter, Knowl.-Based Syst. 71 (2014) 15–24. [6] J.K. Burgoon, L.A. Stern, L. Dillman, Interpersonal Adaptation, Cambridge University Press, 1995. [7] Tanya L. Chartrand, Amy N. Dalton, Mimicry: its ubiquity, importance and functionality, in: Oxford Handbook of Human Action, vol. 2, 2009, pp. 458–483. [8] H.H. Clark, J.E. Fox Tree, Using “uh” and “um” in spontaneous speaking, Cognition 84 (1) (2002) 73–111. [9] C. Darwin, The Expression of Emotion in Man and Animals, John Murray, 1872. [10] C. Darwin, A biographical sketch of an infant, Mind 7 (1) (1877) 285–294. [11] E. Delaherche, M. Chetouani, M. Mahdhaoui, C. Saint-Georges, S. Viaux, D. Cohen, Interpersonal synchrony: a survey of evaluation methods across disciplines, IEEE Trans. Affect. Comput. 3 (3) (2012) 349–365. [12] F. D’Errico, I. Poggi, A. Vinciarelli, L. Vincze (Eds.), Conflict and Multimodal Communication, Springer Verlag, 2015. [13] T. Fong, I. Nourbakhsh, K. Dautenhahn, A survey of socially interactive robots, Robot. Auton. Syst. 42 (3) (2003) 143–166. [14] J.A.A. Hall, E.J. Coats, L. Smith LeBeau, Nonverbal behavior and the vertical dimension of social relations: a meta-analysis, Psychol. Bull. 131 (6) (2005) 898–924. [15] M. Joshi, E.B. Davis, R. Kathuria, C.K. Weidner, Experiential learning process: exploring teaching and learning of strategic management framework through the winter survival exercise, J. Manag. Educ. 29 (5) (2005) 672–695. [16] C.M. Judd, Cognitive effects of attitude conflict resolution, J. Confl. Resolut. 22 (3) (1978) 483–498. [17] M.L. Knapp, J.A. Hall, Nonverbal Communication in Human Interaction, Harcourt Brace College Publishers, 1972. [18] A. Leffler, D.L. Gillespie, J.C. Conaty, The effects of status differentiation on nonverbal behavior, Soc. Psychol. Q. 45 (3) (1982) 153–161.

287

288

CHAPTER 13 We are less free than how we think

[19] P. Martin, P. Bateson, Measuring Behaviour, Cambridge University Press, 2007. [20] Julian A. Oldmeadow, Michael J. Platow, Margaret Foddy, Donna Anderson, Selfcategorization, status, and social influence, Soc. Psychol. Q. 66 (2) (2003) 138–152. [21] R.W. Picard, Affective Computing, MIT Press, 2000. [22] A. Polychroniou, H. Salamin, A. Vinciarelli, The SSPNet Mobile Corpus: social signal processing over mobile phones, in: Proceedings Language Resources and Evaluation Conference, 2014, pp. 1492–1498. [23] R.P. Provine, Laughter punctuates speech: linguistic, social and gender context of laughter, Ethology 95 (4) (1993) 291–298. [24] R.R. Provine, Y.L. Yong, Laughter: a stereotyped human vocalization, Ethology 89 (2) (1991) 115–124. [25] V.P. Richmond, J.C. McCroskey, Nonverbal Behaviors in Interpersonal Relations, Allyn and Bacon, 1995. [26] V.P. Richmond, J.C. McCroskey, S.K. Payne, Nonverbal Behavior in Interpersonal Relations, Prentice Hall, 1991. [27] E.A. Schegloff, Overlapping talk and the organization of turn-taking for conversation, Lang. Soc. 29 (01) (2000) 1–63. [28] B. Schuller, A. Batliner, Computational Paralinguistics: Emotion, Affect and Personality in Speech and Language Processing, John Wiley & Sons, 2013. [29] L.L. Thompson, J. Wang, B.C. Gunia, Negotiation, Annu. Rev. Psychol. 61 (2010) 491–515. [30] H.L. Tischler, Introduction to Sociology, Harcourt Brace College Publishers, 1990. [31] A. Vinciarelli, A. Dielmann, S. Favre, H. Salamin, Canal9: a database of political debates for analysis of social interactions, in: Proceedings of International Conference on Affective Computing and Intelligent Interaction and Workshops, 2009, pp. 1–4. [32] A. Vinciarelli, M. Pantic, H. Bourlard, Social signal processing: survey of an emerging domain, Image Vis. Comput. 27 (12) (2009) 1743–1759. [33] A. Vinciarelli, M. Pantic, D. Heylen, C. Pelachaud, I. Poggi, F. D’Errico, M. Schroeder, Bridging the gap between social animal and unsocial machine: a survey of social signal processing, IEEE Trans. Affect. Comput. 3 (1) (2012) 69–87. [34] A. Vinciarelli, H. Salamin, A. Polychroniou, Negotiating over mobile phones: calling or being called can make the difference, Cogn. Comput. 6 (4) (2014) 677–688. [35] J.A. Wall, R. Roberts Callister, Conflict and its management, J. Manag. 21 (3) (1995) 515–558. [36] N. Ward, W. Tsukahara, Prosodic features which cue back-channel responses in English and Japanese, J. Pragmat. 32 (8) (2000) 1177–1207. [37] T. Wharton, Pragmatics and Nonverbal Communication, Cambridge University Press, 2009.