Computers & Education 140 (2019) 103605
Contents lists available at ScienceDirect
Computers & Education journal homepage: www.elsevier.com/locate/compedu
Reconsidering the Voice Principle with Non-native Language Speakers
T
Robert O. Davisa,∗, Joseph Vincentb, Taejung Parkc a b c
Hankuk University of Foreign Studies, English Linguistics and Language Technology, South Korea Hankuk University of Foreign Studies, English for International Conferences and Communication, South Korea Kyonggi University, College of Liberal Arts and Interdisciplinary Studies, South Korea
ARTICLE INFO
ABSTRACT
Keywords: Pedagogical agent Computer voice Human voice Agent persona Prosody
Researchers have suggested that use of pedagogical agents speaking with a human voice increases social perception and enables deeper learning when compared against computer-generated voice. However, recent research (Craig & Schroeder, 2017) found modern computer voice was as effective as human voice in certain social measures, and can outperformed human voice in particular learning outcomes. This research aimed to study whether two human voice conditions (strong-prosodic and weak-prosodic) produced consistent measures when compared against modern computer voice and each other in social perception and retention measures with nonnative speakers. The human weak-prosodic voice was rated significantly higher on four of seven scale items compared to modern computer voice. However, no significant differences were found in the retention of information. These results show that non-native speakers prefer human voice with less prosodic elements, and factors behind voice are more complicated than just categorizing it as either human or computer.
1. Introduction It is widely believed that pedagogical agents (PAs) speaking with a human voice provide users with a better learning experience than PAs speaking with computer-generated voices. Compared to computer-generated voice, natural human voice has intrinsic features of prosody that convey a significant amount of lexical, semantic, syntactic, and discourse information (Akker & Cutler, 2003; Cutler, Oahan, & Van Donselaar, 1997). These prosodic features can be very difficult for non-native speakers to comprehend, whereas they are easy for native speakers to understand (Vanlancker-Sidtis, 2003). With PA design principles such as social agency theory and the voice principle used to support such voice claims, human voice has been advanced as the superior choice when designing a PA. However, the reality for such claims is based on a limited amount of research comparing the two voice conditions in an experimental setting. Therefore, with the continuous advancement of technology, claims of voice benefits need to be further examined and routinely assessed, especially in regards to different populations such as non-native speakers. One design principle that supports the use of human voice is social agency theory (Mayer, Sobko, & Mautone, 2003). Social agency theory suggests that PA social cues within multimedia presentations activate the user's conversation scheme. If the PA is viewed as a social actor, then users will apply the same social rules found in human-to-human communication. Mayer (2014, p. 345) suggests that three of the most important social cues for PA design are conversational language, human voice, and human-like gestures. These components create a social partnership between the user and the PA that encourages the user to exert more effort
∗
Corresponding author. College of English, Hankuk University of Foreign Studies, 107 Imun-ro, Imun-dong, Dongdaemun-gu, Seoul, South Korea. E-mail address:
[email protected] (R.O. Davis).
https://doi.org/10.1016/j.compedu.2019.103605 Received 9 January 2019; Received in revised form 14 June 2019; Accepted 15 June 2019 Available online 18 June 2019 0360-1315/ © 2019 Elsevier Ltd. All rights reserved.
Computers & Education 140 (2019) 103605
R.O. Davis, et al.
during the learning process (Mayer, 2017). However, social agency is more complicated than conversational language, human voice, and human-like gestures. The image of the PA signals to the user that someone is present, which requires a social stance (Nam, Shu, & Chung, 2008). This creates other potential variables to social agency such as gender, age, ethnicity, visual appeal, and dynamism (Van der Meij, Van der Meij, & Harmsen, 2015). Thus, human voice within social agency theory is one component of a larger complex system that contributes to social perception and learning with PAs. Even though voice is seen as a priming factor within social agency theory, the voice principle (Atkinson, Mayer, & Merrill, 2005) suggests participants learn better from human voice than from computer synthesized voice. Mayer (2017) examined five experiments over three studies (Atkinson et al., 2005; Mayer & DaPra, 2012; Mayer et al., 2003) that directly compared human voice versus computer synthesized voice, and found people learned better when presented with human voice (d = 0.74). A more detailed examination of the experiments comparing learning outcomes between human voice and computer synthesized voice detected participants listening to the human voice condition had significantly higher retention scores (Mayer et al., 2003, Expt. 2), and significantly higher near transfer and far transfer scores (Atkinson et al., 2005, Expts. 1 & 2). However, it must be noted the technology used to create the computer synthesized voice in these experiments is vastly different than the text-to-speech technology currently available (Craig and Schroeder (2017). In later experiments with advancing technology, Mayer and DaPra (2012) compared human voice and computer synthesized voice with the extra variable of embodiment (low embodiment versus high embodiment) that is measured on the production of human-like gestures, lip synchronization, facial expression, and eye and body movements (Basori & Ali, 2013; Mayer & DaPra, 2012; Ochs, Niewiadomski, & Pelachaud, 2015). From a pure comparison of voice conditions, transfer and retention scores were not significant. However, the level of embodiment combined with the human voice was significant with the transfer of knowledge. As for embodiment and machine voice, no significance was found between the conditions. The authors suggest that embodiment helps participants learn more deeply, but negative social cues like machine voice compromise the potential benefits. Recently, Craig and Schroeder (2017) revisited the issue of voice and accounted for the advancements of technology. In their experiment, the authors compared human voice against two forms of computer generated voice: modern computer voice and classic computer voice. Modern computer voice was created with the Neospeech voice engine, which integrates today's advanced methods of text-to-speech to sound more natural, and the classic computer voice using the Microsoft speech engine, which mirrored the capabilities of text-to-speech software of the early 2000s. Results from the learning outcome measures showed that while retention was not significant across the conditions, transfer of learning was significant. Participants in the modern computer voice condition scored significantly higher than those in the human voice (d = 0.54) and classic computer voice conditions (d = 0.41). The authors propose that voice may not be as important for learning now as it was in the past, and that modern text-to-speech software performs as well as a human recording. In this way, technology has advanced enough in the field of speech production to perform as well as, if not better in some instances, as the human voice. The purpose of this study is to evaluate the effect human voice with prosody and human voice without prosody compare to modern computer voice in measured outcomes (cognitive load, agent persona, and retention) with nonnative speakers of English. 1.1. Agent Persona and Voice Multimedia environment researchers have routinely been concerned with social perception and whether the persona of the agent is beneficial to the learning process. One of the earliest foundations for agent persona comes from media equation theory (Reeves & Nass, 1996), which suggest humans will apply social rules to media and perceive media agents as real people. However, a specific meaning of “agent persona” has been difficult to establish based on the variety of definitions found in the literature (Schroeder & Adesope, 2014). Persona has been described as the ability to influence the perception of a system (Lester et al., 1997), as an agent that is real and authentic (van Mulken, Andre, & Muller, 1998), as engaging, human-like, credible, and facilitates learning (Baylor & Ryu, 2003), or as anthropomorphized due to life-like features such as facial expressions, body movements, and gestures (Woo, 2008). Although persona might be vaguely defined, research has found that PAs can simulate instructional roles such as motivator, mentor, and expert (Baylor & Kim, 2005). Participants also expect the PA to have a personality (Kim, Baylor, & Shin, 2007) and visually attend to PAs as they would to a human when holding a conversation. However, the concept of agent persona has produced mixed results in the literature. Heidig and Clarebout (2011), in a comprehensive review, contribute these mixed results to the highly complex nature of researching PAs, and in understanding both how PAs need to be designed and the conditions in which they are effective. Even though evaluating agent persona is a complex process when examining learning outcomes, motivation, and other concepts such as engagement, isolating individual variables such as voice has made it easier to understand the elemental constructs of persona. Early experiments found evidence that human participants applied social rules to computer-based voices, which indicates that voice strongly boosts the perception of social presence (Nass & Steuer, 1993). Thus, social perception of the PA can help researchers in the evaluation of the persona of the agent. However, measuring agent persona in relation to voice has been labeled differently across experiments. Three experiments (Mayer et al., 2003, Expt. 2; Atkinson et al., 2005, Expts. 1 & 2) measured agent persona as speaker rating, which rated human voice significantly higher than computer synthesized voice at effect sizes d = 1.45, d = 0.76, and d = 0.83 respectively. Later voice comparison studies measured agent persona according to the agent persona instrument (API, Baylor & Ryu, 2003), or the translated Korean agent persona instrument (KAPI, Ryu, 2012). The API and KAPI both measure agent facilitation, credibility, human-likeness, and engagement. Using the API, Craig and Schroeder (2017) compared the agent persona with PAs using human voice, modern synthesized voice, and classic computer voice. Both human voice and modern synthesized voice scored significantly higher than classic computer voice in facilitation and credibility, but human voice scored significantly higher than modern synthesized voice and classic computer voice in the categories of human-likeness and engagement. These results indicate that human voice provides better persona than classic computer voice in all measures, but modern synthesized voice may not be as human-like or 2
Computers & Education 140 (2019) 103605
R.O. Davis, et al.
engaging as human voice, though there is no measurable difference in facilitation and credibility. Also, Ryu and Fengfeng (2018) conducted experiments with the KAPI on human voice and the same modern computer voice software (Neospeech) used in the Craig and Schroeder (2017) study. Across two experiments comparing voice type with image, and voice type with screen/presentation size, the results were similar in both studies. Human voice was rated significantly higher than modern computer voice in human-likeness and engagement, but the other subscales failed to reach significance. These recent studies indicate modern computer voice is similar to human voice in some instances, but fails to match the human-likeness and engagement that human voice creates in most experiments. 1.2. Cognitive Load and Voice In addition to agent persona, researchers are concerned with mental processing demands placed upon the cognitive architecture of the learner during the learning process. Since the intrinsic features of prosody in the human voice can convey lexical, semantic, syntactic, and discourse information (Akker & Cutler, 2003; Cutler, Oahan, & Van Donselaar, 1997), the prosodic features can cause difficulty with comprehension for non-native speakers, while native speakers have no such difficulties (Vanlancker-Sidtis, 2003). Thus, the benefits and hindrances of voice processing and comprehension could be viewed through the different aspects of cognitive load theory. Cognitive load theory differentiates the type of processing demands as intrinsic, extraneous, and germane (Paas, Tuovinen, Tabbers, & Van Gerven, 2003). Intrinsic cognitive load is the inherent amount of processing required for an individual topic (Van Merriënboer & Sweller, 2005). Extraneous cognitive load could waste the individual's effort and time on more irrelevant or insignificant cognitive processing that does not contribute to schema construction or automation (Paas, Renkl, & Sweller, 2004); while germane cognitive load benefits cognitive processing by accessing knowledge that has already been acquired and automatized (Kalyuga, 2011). However, the interaction between cognitive load and mental processing is complicated since the processing in working memory is limited, but the nature of intrinsic and extraneous cognitive load are additive, which means intrinsic and extraneous cognitive load are fluid depending on certain variables (Leppink, van Gog, Paas, & Sweller, 2015). Since intrinsic cognitive load is connected with the inherent difficulty of the topic, a learner's prior knowledge of the content can dramatically influence how much intrinsic cognitive load is present in working memory. In other words, a multimedia presentation on statistics would, as a whole, cause less intrinsic cognitive load for participants who majored in math education than for participants who majored in history. Likewise, the design of multimedia presentations can increase or decrease extraneous cognitive load within the participant. One such design strategy is the temporal congruity principle, which holds that narration and graphical information should be presented simultaneously rather than separately (Mayer, 2017). Separating voice and graphical representation causes undue mental processing, which increases the processing demand on a limited information system like working memory. Therefore, components linked to intrinsic cognitive load and extraneous cognitive load need to be carefully considered so that working memory is free to process the information being presented. The element of voice is a key feature in theories examining cognitive load within multimedia presentations. The most prominent theory, the cognitive theory of multimedia learning (CTML), views learning as an interaction between people and computer-based environments using words and pictures. CTML assumes that (1) people use verbal and visual channels to process information; (2) each channel has a limited capacity to process information; and (3) the process of learning is cognitively demanding (Mayer & Moreno, 2003). If participants have to read on-screen text and view graphical representations simultaneously, then the visual channel is likely to become overloaded due to the amount of information which needs to be processed. Therefore, to avoid overloading particular channels, design strategies, such as using narration instead of text with graphical representations, have the potential to distribute the processing needed to understand the content over two channels instead of one. Overall, there are twelve principles related to CTML that help multimedia instructional designers design content that does not cognitively overload the participant (Mayer, 2009, for review). While the literature comparing human voice versus computer synthesized voice is limited, direct comparisons of human and computer voice in relation to cognitive load are scarcer. Only two experiments (Atkinson et al., 2005, Expts. 1 & 2) attempted to measure cognitive load between voice types and found no significant difference. However, those experiments examined cognitive load as difficulty, which did not detail whether the measure was explicitly measuring intrinsic or extraneous cognitive load. This separation is important since it is possible that extraneous cognitive load may burden working memory and learning ability of listeners with irrelevant information that makes them feel bored or annoyed while listening to a synthesized voice (Wouters, Paas, & van Merriënboer, 2008). Therefore, more precise measures to assess the different types of cognitive load are needed to understand how different components of the PA are cognitively impacting the participant in the multimedia environment. 1.3. Voice prosody and listening comprehension with non-native speakers Although spoken language centers on the words that are articulated, one key feature of the human voice is its ability to alter meaning through prosody. Prosody is pitch, tempo, stress, intonation, melody, loudness, accent and pause (Kent, 1997; Ross, 2000), which helps listeners comprehend the intention of the speaker. Whenever listeners recognize the utterances of others, they are processing prosodic cues (Cutler, Dahan, & Van Donselaar, 1997). For example, stress can add several different meanings to this sentence, “I didn't tell you to do that.” If the speaker says, “I didn't tell you to do that,” the information being communicated suggests someone else was responsible to do something. However, if the speaker says, “I didn't tell you to do that,” the stress at the end signals the listener was to do something, but not what the listener did. For native speakers of the language, these nuances to prosody are easily recognized and understood during discourse. 3
Computers & Education 140 (2019) 103605
R.O. Davis, et al.
However, meaning communicated through prosody is not as easy for non-native speakers to comprehend because the listening process is different than the process for a native speaker. One of the striking differences between native speakers and non-native speakers is how the language is processed. Nonnative listeners process prosodic information for semantic structure less efficiently than native listeners (Akker & Cutler, 2003). Also, native speakers tend to use a top-down approach to that focuses on meaning, whereas non-native speakers utilize a bottom-up approach focusing on words that command more cognitive resources (Osada, 2001). In addition to the less efficient bottom-up approach, issues such as limited vocabulary, unclear pronunciation, rate of speech, and the ability to segment the speech can complicate the listening process (Goh, 2000; Hasan, 2000). Therefore, non-native speakers ability to map the prosodic features of spoken language to the semantics of communicative intent is less efficient compared to the abilities of native speakers (Akker & Cutler, 2003). In order to map these prosodic features, Lindfield, Wingfield, and Goodglass (1999) suggest that individuals must have the communicative prosodic information in their lexicon and possess the ability to access these features if they want to produce or recognize prosodic characteristics of language. Because of this, more advanced users of the foreign language may have more cognitive resources available to benefit from prosodic elements than less advanced language learners who need to dedicate vast amounts of cognitive resources to process the language being heard. That is to say, these prosodic elements play a role as indicators of cognitive load in spoken language. Researchers have found that many prosodic changes are most often associated with heavy cognitive load on non-native listeners (Akker & Cutler, 2003), while gestures may help to reduce second language learner's cognitive load (McCafferty, 2004). 1.4. Research questions and hypotheses The research questions for this experiment sought to answer: RQ1. To what extent does agent voice affect intrinsic cognitive load, extraneous cognitive load, and germane cognitive load with non-native speakers?(cognitive load) H1a. Extraneous cognitive load will increase in the modern computer voice condition compared to the two human voice conditions. While Wouters and colleagues suggested that computer synthesized voice had the ability to increase extraneous cognitive load, Veletsainos (2012) provided qualitative evidence that participants found computer synthesized voice to be “obnoxious,” “distracting,” and “hard to listen to at times” (p. 280). H1b. Germane cognitive load will increase in the human weak-prosodic voice condition due to less efficient prosodic feature mapping and potential decreased access to prosodic features of the language by non-native learners (Akker & Cutler, 2003; Lindfield et al., 1999). RQ2. In what ways does voice affect non-native speakers' perceptions of the agent's persona? (perception of persona) H2a. Human weak-prosodic voice will positively influence the agent persona subscale of facilitation against the human weakprosodic voice due to the bottom-up approach non-native speakers use while listening (Osada, 2001). H2b. Only human strong-prosodic voice will positively influence the agent persona subscales of credibility, human-like, and engagement, since prosodic features are expected and previous research has found human voice outperforms machine generated voice in these subscales (Mayer & DaPra, 2012; Craig & Schroeder, 2017; Ryu & Fengfeng, 2018). RQ3. To what extent does voice type influence non-native speakers' retention of information? (retention of information) H3. Voice will not be a significant factor in the retention of information (Mayer & DaPra, 2012; Craig & Schroeder, 2017) RQ4. How do the two different human voices compare in cognitive load, agent persona, and retention measurements against each other and the modern computer voice? H4a. Human voice conditions will show no significant differences when compared against each other. 2. Methods 2.1. Design and participants A between-subjects experimental design was used to study whether PA voice alters perceptive and learning outcomes in computer-based environments with non-native speakers of English. To measure the independent variable of voice, participants were randomly assigned to one of three conditions: human strong-prosodic voice, human weak-prosodic voice, or modern computer voice. The dependent variables were participants' perceived cognitive load, evaluation of agent persona, and the retention of information through multiple-choice questions. Since foreign language learners are not as efficient at mapping the prosody of language (Akker & Cutler, 2003), the participants were 172 undergraduate students at a foreign language university in Seoul, South Korea. All students were studying in English major courses, and either majored or double majored in English-focused programs. The age range of the sample was between 18 and 32 years (M = 22.56; SD = 2.64), with 87 participants identifying as male and 85 identifying as female. The duration of the research lasted about 25 min in total. 4
Computers & Education 140 (2019) 103605
R.O. Davis, et al.
2.2. Learning materials 2.2.1. Narrative script The 486-word script was developed by Schroeder and Adesope (2013), and was subsequently used in follow-up research (Schroeder, Romine, & Craig, 2017; Schroeder & Adesope, 2015) involving pedagogical agents. The script briefly explains instructional design concepts for teaching with technology such as cognitive load theory, split attention principle, and the modality principle. The Flesch-Kincaid grade level for this script is 10.4. The whole script was presented to the participants as one continuous lesson. 2.2.2. Pedagogical agent and multimedia design The overall presentation for this experiment was designed to eliminate extraneous variables that could confound the results on the measured learning outcome. The PA was designed to use lip synch, eye gaze, facial expressions, body sway, head movements, and conversational gestures to appear human-like. Since representational gestures have the ability to convey information to the listener, the PA in all conditions used conversational gestures to give the appearance of being human, but the gestures lacked any semantic information to assist with comprehension. In addition, the multimedia context was designed to minimize strategies to assist participants with retaining information. The background was solid purple to eliminate contextualizing the information presented, and low verbal redundancy strategies such as displaying keywords concurrently on-screen coordinated with speech were excluded, since low verbal redundancy has been shown to have a large effect (g = .99) on learning outcomes (Adesope & Nesbit, 2012). The PAs and multimedia environment were created in iClone 7.1™. Appearance, conversational gestures, eye gaze, facial expressions, lip synchronization, and body sway were designed within the timeline editor. The only difference between the three conditions was the length of the human strong-prosodic voice condition, which was 3 min and 24 s compared to the 3 min for the human weak-prosodic voice condition and the modern computer voice condition. The length difference is due to adding prosodic elements to the speech. Outside of length and adjusting the timing of conversational gestures in the human strong-prosodic voice condition to be evenly distributed across the video, voice type was the only difference between the conditions. 2.2.3. Voice conditions Two American male professional voice actors were hired through Fiverr to read the 486-word script. The voice actor in the human strong-prosodic voice condition was provided a special script that instructed to add prosodic elements and pauses. See Appendix A for the human strong-prosodic voice script. The human weak-prosodic condition voice actor was instructed to maintain an even pitch, minimize pausing, and to avoid adding prosodic elements such as stress and intonation as much as possible to mimic the capabilities of modern computer voice, which has a lower range of vocal elements. For the modern computer voice condition, the American male voice of James was recorded with ReadSpeaker's text-to-speech software. To analyze the comparability of the voices, the voice tracks were analyzed in the freely downloadable Praat software. Voice analysis of human strong-prosodic voice (125 Hz) and human weak-prosodic voice (126 Hz) found the pitch was indistinguishable to the human ear. Likewise, intensity or loudness was indistinguishable at 70 dB and 68 dB respectively. The modern computer generated voice pitch was lower at 109 Hz, but loudness was comparable at 72 dB. Although male voices can differ in pitch, some studies report the average human male voice to range from 100 Hz (Hancock, Colton, & Douglas, 2014) to 118 Hz (Brown Morris, Hollien, & Howell, 1991) for males between 20 and 35. However, the male vocal pitch can range from 90 Hz to 165 Hz (Hollien & Jackson, 1973), so voice pitch ranges are comparable to the expectations of a male voice, and the intensity of the voices show negligible differences. The differences between the voice conditions can be illustrated by the way the voice tracks emphasized “cognitive load theory” in the script (Appendix A). In Fig. 1, the top portion of the graph represents the pitch of the voice with the words spoken, while the bottom of the graph provides the intensity of the voice. The human strong-prosodic voice (top left) shows several prosodic strategies while emphasizing “cognitive voice theory.” First, there are breaks in the pitch (top) line that indicate the use of pausing. Second, the intensity line (below) shows more visible variation with peaks and valleys. Third, there is greater stress at the beginning and end of the words. The human weak-prosodic voice (top right) makes no use of pausing, the intensity line has less substantial change between peaks and valleys, and the stressing pattern is not equally distributed. The modern computer voice condition provides minor pauses, but are not well placed, and the intensity pattern and stress patterns have similar issues to the human weak-prosodic voice condition. Likewise, technical data in Table 1 provides insight into the prosody between the conditions. The human strong-prosodic voice condition maintained a higher average pitch (157.8 Hz) than the human weak-prosodic voice condition (131 Hz) and the modern computer voice condition (109.2 Hz). In addition, the maximum intensity across the conditions was similar, but the minimal intensity in the human strong-prosodic voice condition was vastly lower at 20.7 dB, as compared to 42.8 dB and 42.4 dB in the human weakprosodic voice and modern computer voice respectively. See Fig. 1 and Table 1 for details. 2.2.4. Instruments Five different instruments were used to collect data during this experiment: demographic survey, prior knowledge test, cognitive load survey, Korean agent persona index (KAPI), and multiple-choice retention test. The demographic survey recorded information such as age, university year classification, gender, and native language. This information was used to assess whether the sample characteristics significantly affected outcome measures. The pretest measure consisted of three open-ended questions asking participants to list or define cognitive load, split attention 5
Computers & Education 140 (2019) 103605
R.O. Davis, et al.
Fig. 1. Graphs of pitch (top) and intensity (bottom) for the human strong-prosodic voice (top left), human weak-prosodic voice (top right), and the modern computer voice (bottom) for the words “cognitive load theory.” Table 1 Average, maximum, and minimum scores for pitch and intensity in voice.
Strong-prosodic Voice Weak-prosodic Voice Modern Computer Voice
Duration
Pitch
Intensity
Seconds
Avg.
Max
Min.
Avg.
Max
Min.
1.886 1.342 1.302
157.8 Hz 131 Hz 109.2 Hz
249.8 Hz 249.2 Hz 222.3 Hz
76.5 Hz 68.3 Hz 69.4 Hz
66 dB 65 dB 67.9 dB
79.2 dB 77.3 dB 76.4 dB
20.7 dB 42.8 dB 42.4 dB
principle, and the modality principle (Schroeder & Adesope, 2013; Schroeder & Adesope, 2015; Schroeder, 2017). Participants were awarded one point each for correctly defining the split attention principle and the modality principle. For cognitive load theory, participants were awarded one point for correctly stating and one point for properly describing working memory, long-term memory, schema, germane cognitive load, intrinsic cognitive load, and extraneous cognitive load. The highest possible score on the pretest measure was 14 points. Two raters independently assessed the pre-test answers. The Cohen's Kappa interrater reliability score was k = 0.66, which is considered substantial (Cohen, 1960). Disagreements were reconciled with the help of a third party. The cognitive load survey consisted of ten total items with three intrinsic cognitive load questions (α = .84), three extrinsic cognitive load questions (α = 0.70), and four germane cognitive load questions (α = 0.90; Leppink, Paas, Van Gog, Van der Vleuten, & Van Merriënboer, 2014). Each question used a ten-point scale with “extremely disagree” on the far left side and “extremely agree” on the far right side of the scale. The questions for intrinsic cognitive load and extrinsic cognitive load are worded “negatively,” so higher scores indicate more difficulty. However, germane cognitive load questions are worded “positively,” so higher scores indicate greater ease. The cognitive load survey was administered after watching the video. The Korean agent persona instrument (KAPI; Ryu, 2012; Davis & Antonenko, 2017) was used to collect information on the persona of the PA. The KAPI is translated from the agent persona instrument (API; Ryu & Baylor, 2005). In all, there were twenty questions with five questions in each of the four subscales measure the ability to facilitate learning (α = 0.82), to establish credibility (α = 0.76), to appear human-like (α = 0.80), and to be engaging (α = 0.85). Each question was rated on a 5 point Likert scale with “strongly disagree” on the far left (1) and “strongly agree” on the far right (5). Ratings from each question were combined to produce a total score for each subscale. Therefore, the lowest possible score in each subscale was 5 points, while the highest possible rating was 25 points. Participants rated the persona of the PA after finishing the cognitive load survey. The post-test consisted of 20 retention questions used by Schroeder and Adesope (2013, 2015). Each correct answer was awarded 6
Computers & Education 140 (2019) 103605
R.O. Davis, et al.
Table 2 Cognitive load assessment scores. Voice
Human Strong-prosodic Human Weak-prosodic Modern Computer
Intrinsic
Extraneous
Germane
M (SD)
M (SD)
M (SD)
16.40 (5.14) 14.45 (6.16) 15.63 (5.64)
10.34 (4.95) 10.24 (5.23) 11.75 (5.64)
23.81 (7.80) 26.22 (6.90) 22.95 (7.24)
Table 3 Agent persona assessment scores. Voice
Human Strong-prosodic Human Weak-prosodic Modern Computer
Facilitation
Credible
Human-like
Engaging
M (SD)
M (SD)
M (SD)
M (SD)
14.76 (4.74) 15.29 (4.62) 14.29 (4.77)
17.67 (3.91) 18.83 (3.65) 17.04 (3.64)
12.07 (4.30) 12.67 (4.93) 10.36 (3.73)
14.10 (4.37) 13.93 (4.63) 11.45 (4.38)
one point. The internal consistency reliability was measured at α = 0.65. 3. Results 3.1. Prior knowledge Before using inferential statistics, a Levene's test was performed to evaluate whether the data met the standards of homogeneity. The results indicated evidence of no heterogeneity with the data at F(2, 169) = 2.00, p = 0.138. Thus, it was determined that the data was appropriate for inferential statistics. See Table 3 for prior knowledge scores and retention scores. 4. Cognitive load To analyze how different PA voice types affect cognitive load in computer based environments, ANOVAs were performed on the cognitive load constructs of intrinsic cognitive load, extraneous cognitive load, and germane cognitive load. Intrinsic cognitive load and extraneous cognitive load were not significant with F(2, 169) = 1.843, p = 0.162 and F(2, 169) = 1.445, p = 0.239 respectively. However, germane cognitive load was significant at F(2, 169) = 3.076, p = 0.049. A Tukey's HSD post hoc test indicated a significant difference at the p < 0.05 level between the weak-prosodic voice condition (M = 26.22, SD = 6.90) and the modern computer synthesized voice condition (M = 22.95, SD = 7.24) with an effect size of d = 0.46. No significant differences were found in other comparisons. Table 2 provides the means and standard deviations for the cognitive load measure. 4.1. Korean agent persona instrument A MANOVA was performed to test if voice significantly influenced the social perception of the PA in the four subscales of facilitation, credibility, human-like, and engagement. There was a statistically significant difference between the voice conditions, F (8, 332) = 3.215, p = 0.002; Wilk's λ = 0.861, partial η2 = 0.072. There was no significant difference with voice in the facilitation subscale at F(2, 169) = 0.652, p = 0.522. However, voice was significant with credibility (F(2, 169) = 3.430, p = 0.035), humanlikeness (F(2, 169) = 4.315, p = 0.015), and engagement (F(2, 169) = 6.290, p = 0.002). A Tukey's HSD post hoc test indicated the weak-prosodic voice condition was rated significantly higher (with p < 0.05) than the modern computer voice condition in the subscales of credibility (d = 0.50), human-like (d = 0.53), and engagement (d = 0.55). The prosodic voice condition was rated significantly higher (with p < 0.05) than the modern computer voice condition in the engagement subscale (d = 0.61). No other significant differences were found in the comparisons. See Table 3 for means and standard deviations for the KAPI. 4.2. Retention scores Since there were no significant differences between any of the prior knowledge conditions, inferential statistics were used to test the retention of information. An ANOVA test conducted on the retention of information found no significant differences between the 7
Computers & Education 140 (2019) 103605
R.O. Davis, et al.
Table 4 Prior knowledge and retention scores. Voice
N
Human strong-prosodic
58
Human Weak-prosodic
58
Modern Computer
56
Prior Knowledge
Retention
M (SD)
M (SD)
.00 (.00) .03 (.184) .00 (.00)
11.03 (3.80) 10.59 (3.52) 9.98 (3.14)
groups at F(2, 169) = 1.294, p = 0.277. See Table 4 for prior knowledge and retention test scores. A follow-up ANCOVA using prior knowledge as a controlling variable found no significant differences. 5. Discussion This experiment examined the voice principle (Atkinson et al., 2005) in relation to non-native speakers assessment of cognitive load, evaluation of agent persona, and retention of information with human strong-prosodic voice, human weak-prosodic voice, and modern computer voice. The findings from this research show that understanding the voice principle and its relation to non-native speakers might be more complicated than previously thought. Some patterns were found with previous results suggesting that modern computer voice is much improved, but understanding the effects of human voice is more complicated when accounting for components of voice and the presentation to certain target populations. RQ1. Cognitive Load and Voice The cognitive load measures indicated no significant differences between intrinsic cognitive load (F(2, 169) = 1.843, p = 0.162) and extraneous cognitive load (F(2, 169) = 1.445, p = 0.239) between any of the voice conditions. These results do not support the hypothesis (H1a) that modern computer voice will increase extraneous cognitive load. However, the human weak-prosodic voice scored significantly higher on germane cognitive load than modern computer voice. This supports our hypothesis (H1b) that fewer prosodic features will increase the germane cognitive load with non-native speakers when compared against modern computer voice (d = 0.46) and human strong-prosodic voice (d = 0.33). No other significant differences were found between the conditions. The results of this data do not conform to some of the potential explanations for the benefits of increased germane cognitive load. According to Leppink et al. (2015), lower intrinsic and extraneous cognitive load allow the working memory (germane cognitive load) greater ease to relate new information to the long-term memory (Leppink et al., 2015). While human weak-prosodic voice (M = 26.22, SD = 6.90) did score significantly higher with germane cognitive load than the modern computer voice (M = 22.95, SD = 7.24) and higher than the human strong-prosodic voice (M = 23.81, SD = 7.80), the increased germane cognitive load did not necessarily manifest in higher learning outcomes of retention. Instead, the significantly higher ratings of the human weak-prosodic voice over the modern computer voice only produced a minimal learning increase (d = 0.18). Conversely, higher germane cognitive load ratings for human weak-prosodic voice actually performed worse on retention scores (d = −0.12) when compared to human strong-prosodic voice. Therefore, it is difficult to conclude with confidence that higher germane load produces better learning outcomes due to access to more working memory. Another possible view is that germane cognitive load is purely related to participant characteristics, whereas intrinsic and extraneous cognitive load are material and participant characteristics (Sweller, 2010). In Sweller's assessment, when levels of motivation are equal, increased intrinsic cognitive load will lead to increased germane cognitive load, and higher extraneous cognitive load will decrease germane cognitive load. Yet, the absence of significant differences in intrinsic and extraneous cognitive load makes this less likely. However, this might be true if motivation was equal across participants, as suggested by Sweller (2010). Although motivation was not directly assessed in this study, engagement was measured in the agent persona survey. In those measures, participants in the human strong-prosodic voice condition and the human weak-prosodic voice condition rated the PA significantly more engaging than those in the modern computer voice condition. The results from the engagement subscale are similar to the germane cognitive load assessments with human weak-prosodic voice scoring significantly higher than modern computer voice. However, when accounting for the sample population being non-native speakers, and listening comprehension being assisted by the lack of vocal features, significant germane cognitive load ratings might be a reflection of engagement in the immediate environment with the human weak-prosodic condition. Although it is outside the scope of this study, researchers should examine voice type and cognitive load with immediate and delayed assessments to evaluate whether increased working memory is more beneficial to the long-term retention of knowledge. Overall, voice appears to have minimal impact on cognitive load with non-native speakers. Outside of germane cognitive load, voice does not significantly increase or decrease the inherent difficulty of the subject matter, nor does it overwhelm the listener as a distraction. The assumption that computer voice could increase extraneous cognitive load (Wouters et al., 2008) is not supported in this research with modern computer voice. Although mixed results have been found with higher germane cognitive load ratings, this does not necessarily undermine the role that working memory has in relation to germane cognitive load. It is possible that voice is a 8
Computers & Education 140 (2019) 103605
R.O. Davis, et al.
minor influencer and other possible multimedia strategies such as verbal redundancy, visual aids, and gestures would make more significant impacts; or, it could be the type of learning outcome being measured. However, researchers should conduct studies to examine whether germane cognitive load contributes more to working memory and learning outcomes (Leppink et al., 2015), or if germane cognitive load is more related to participant characteristics where motivation is equal (Sweller, 2010). RQ2. Agent Persona and Voice Results from the KAPI survey found significance in three of the four subscales. Contrary to the stated hypothesis (H2b) that human strong-prosodic voice will increase credibility, human-likeness, and engagement, the human weak-prosodic voice was rated significantly higher than modern computer voice in the subscales of credibility, human-likeness, and engagement. However, the human strong-prosodic voice condition was significant in the engagement subscale when compared to the modern computer voice. In addition, the hypothesis (H1a) that human weak-prosodic voice would positively influence agent persona subscale of facilitation due to the bottom-up listen approach non-native speakers use was support, as the ratings were higher, but failed to reach the level of significance. The finding that human voice rates significantly higher than modern computer voice in the human-likeness and engagement subscales has been fairly consistent with recent studies using the API or KAPI when comparing human voice with modern computer speech engines (Craig & Schroeder, 2017; Ryu & Fengfeng, 2018). Including the two human voice conditions in this study, four of the five human voice conditions (Craig & Schroeder, 2017; Ryu & Fengfeng, 2018, Exps. 1 & 2) were rated significantly higher than modern computer voice in terms of human-likeness, and five out of five human voice conditions were rated significantly higher in the subscale of engagement. One of the main issues plaguing computer voice has been the lack of prosodic features (stress, pitch, intonation) and emotion, which can be immediately recognized by participants. Qualitative data collected by Veletsianos (2012) examining the perceptions of PA behavior found disdain for the computer voice used by the agent. Participants said the computer voice was “annoying,” “obnoxious,” “distracting,” and emphasized “The pitch and tone didn't really change so I started to doze off a little bit,” and “she (the agent) put the wrong emphasis on certain parts of the words … I found this sort of hard to listen to at times.” (Veletsianos, 2012, p. 280). Computer voice has routinely had the inability to use features such as emotional tone, naturalness, and pitch (Stern, 2008). However, the condition that was significant in three of the four subscales was the human weak-prosodic voice that lacked prosody much like the modern voice condition. It is possible this finding is a result of the sample population of non-native speakers that tend to process information in a more bottom-up approach (Osada, 2001) and are less efficient in processing prosody features of the language compared to native speakers (Akker & Cutler, 2003). Further research should examine whether human weak-prosodic voice increases agent persona with foreign language users because the lack of prosodic elements might assist in easier comprehension. Another component to the increased persona might be due to the rhythm that is inherent to human speaking. Qualitative data by Veletsianos (2012) cites misplaced computer voice emphasis as an issue, which could contribute to the naturalness issues put forth by Stern (2008). Although researchers are making progress with producing computer voice that sounds more natural (Jannati, Sayadiyan, & Razi, 2018), modern text to voice software is still deficient in replicating the natural flow and sound of a human voice, even when other prosodic elements are not present. Ultimately, human voice with weak-prosodic features might be more beneficial to non-native speakers due to the application of diverse listening strategies employed by this population. As far as agent persona is concerned with non-native speakers, a human voice is not inherently better, but it is the elements within the voice that influence perception of the agent. At the same time, modern computer voice has made technological advances to close the disparity between it and the human voice. RQ3. Retention and Voice Even though human strong-prosodic voice (M = 11.03, SD = 3.80) and human weak-prosodic voice (M = 10.59, SD = 3.52) scored higher than modern computer voice (M = 9.98, SD = 3.14) on the retention of information, the differences were not statistically significant. These results support our hypothesis (H3) and are similar to other studies (Mayer & DaPra, 2012; Craig & Schroeder, 2017) that compared human and computer voice with learning retention. As far as voice type and learning outcomes with PAs, Mayer and DaPra (2012) suggest that researchers should measure different learning outcomes at the same time, since only one learning outcome measure might be misleading. As they suggest, if their study only measured retention, there would not be evidence of deeper processing, which is associated with transfer. However, researchers should put retention scores in context with a larger body of evidence when speaking of learning outcome results. Davis (2018) conducted a meta-analysis of PA gesturing on the learning outcomes of retention and near transfer. Overall, the effect size for near transfer was g = 0.40, while retention was g = 0.28. Both of these effect sizes are considered small, even though transfer is on the higher end of the scale and retention on the lower end, but this could be attributed to the differences between the PAs and the experimental elements. However, every experiment that tested for retention also assessed transfer. Thus, retention and transfer could be directly compared against the same PAs and experimental designs. The effect sizes for retention remained the same (g = 0.28), but transfer increased to a medium effect size (g = 0.50) when accounting for the same conditions. For this reason, it is possible that significant differences can be discovered more easily with certain outcome measures than others. Therefore, researchers may want to assess whether PAs are more beneficial to certain types of knowledge, and design multimedia environments based on the type of learning the lesson is trying to achieve. Due to the fact the participants in this research were non-native speakers learning about concepts with technical vocabulary related to the educational technology field, the researchers eliminated the transfer test to keep from overwhelming the participants. Although the retention results matched those of previous research, future research with non-native speakers should include transfer, 9
Computers & Education 140 (2019) 103605
R.O. Davis, et al.
since there are inconsistent findings: Mayer and DaPra (2012) found that human voice significantly increased transfer, while Craig and Schroeder (2017) found the opposite with modern computer voice significantly increasing transfer. RQ4. Comparing Human Voices and modern computer voice All voice comparison studies have used one human voice to compare against computer voice(s). The voice principle suggests that participants learn better from human voice than from computer voice (Atkinson et al., 2005), but no study has examined comparability of multiple human voice conditions. A portion of this study is dedicated to examining the potential interaction between two different human voices and comparing those interactions against a computer voice to discover if the human factor is the prominent component, or if the human voice requires special internal elements to be significant. Comparing the human strong-prosodic voice against the human weak-prosodic voice revealed no significant differences in cognitive load, agent persona, or retention. These results are inline with our hypotheses (H4a) that no significant differences would be found between the two voice conditions. However, it would be faulty to suggest there was no difference between the two human voices. The human weak-prosodic voice was rated significantly higher against the modern computer voice with germane cognitive load, agent credibility, human-likeness, and engagement. This is in stark contrast to the human strong-prosodic condition that only found a significant difference in engagement when compared against the modern voice condition. Thus, these findings suggest that the assumption from the voice principle that people learn better from human voice than from computer voice is too narrow of a view to account for the complex elements of voice. The shortcomings of computer voice are well documented, but the essential elements of human voice in multimedia environments for instructional and learning purposes have not been thoroughly examined. For this reason, future research needs to examine which elements of human voice are beneficial to the perception and learning process, and which elements relegate the human voice to perform no better than the modern text-to-speech options available. 5.1. Limitations There are a couple of limitations to this study. The first is the multimedia video was between 3 min and 3 min and 24 s in length. Learning in this situation is in stark contrast to a class of fifty to a hundred fifty minutes in length. With the technical terms used within the content, and the short intervention, the non-native students may have struggled with fully comprehending the content. Second, the participants of this study were non-native speakers, but they were English majors or double majors, so their command of the English language may not be representative all other non-native English speaking populations. Therefore, caution should be taken when considering these results with regard to other contexts or individuals. Lastly, experiments were conducted depending on simple learning tasks in this research. However, if the attributes of the listening task are diversified, the effects on cognitive load as well as persona perception are likely to be more varied. 6. Conclusion This study sought to explore topics regarding the voice principle with non-native speaking populations. The first objective was to replicate earlier findings that modern computer voice is comparable to human voice in terms of agent persona and retention of knowledge. As far as persona, being comparable depended on which human voice the modern computer voice was being measured against. The human strong-prosodic voice only scored significantly higher on one of the four persona subscales, while the human weak-prosodic scored significantly higher on three of the four subscales against the modern computer voice condition. Therefore, these mixed results could be interpreted as showing that modern computer voice is becoming comparable to human voice; but in our opinion, these results indicate non-native populations require special consideration because of comprehension strategies employed during the process of listening. As far as learning outcomes, this experiment replicated previous results, which suggest that voice has no significant benefit with regard to retention; but further research needs to examine whether retention results hold true with delayed assessments and whether other types of learning outcomes are influenced by voice. In addition to comparing human voice conditions with a modern computer voice condition, this research sought to examine different human voice conditions against each other to understand whether the human element produced consistent results. The two voices were not consistent as the human weak-prosodic voice condition outperformed the human strong-prosodic voice condition when compared against modern computer voice. In all, these findings suggest that PAs speaking with a human voice have the potential to increase germane cognitive load and social perception; however, if the vocal elements are not carefully delivered in accordance to population needs and characteristics, human voice may be no more effective than modern computer voice. Further research on this is recommended. These findings were constrained into human and computer-generated auditory mode. Because non-native learners acquire languages with verbal and nonverbal (i.e. visual and emotional) cues and computer content and pedagogical agents are getting more dynamic, anthropomorphic, and multi-modal (Gilakjani, Ismail, & Ahmadi, 2011), it is impractical to focus on only auditory content with recorded speech. Instead, future research should investigate the effect of a combination of various modalities of human and computer-generated language teaching and learning materials on non-native speaker engagement, learning and cognitive load and the interrelations among multimodal items.
10
Computers & Education 140 (2019) 103605
R.O. Davis, et al.
Appendix A Strategy Key Bold = increased intonation and stress “^” = hard stress “___” = pause I heard you were interested in how to teach more effectively with technology. I don't blame you,___ technology can be fun to use__ and if used effectively,___ it can be a ^great ^teaching ^tool. __ There are quite a few things you should keep in mind__ to keep your teaching with technology simple__ and effective. __ One thing that is particularly important to understand__ is ^cognitive ^load ^theory. __Basically the theory says that there are two types of memory,__ the working memory__ and the long-term memory. __The working memory is pretty small__ and can only handle small amounts of information at a time,__ but the long-term memory_ is ^nearly ^limitless. __The long-term memory contains everything you've learned,__ and is organized by things called ^schema.__ The brain is able to function because the working memory takes in new information_ and finds an appropriate ^schema within the long term memory to store it in.__ If there isn't one,__ it makes a new schema for the information. __ Cognitive load theory also describes three types of mental load.__ Extraneous cognitive load _ is the ^extra ^strain put on the brain_ which doesn't relate to the learning material itself.__ For example, poorly designed presentations can cause extraneous cognitive load. __ Intrinsic cognitive load__ is due to the nature of the learning material._ While it can be different for each student,_ it's the mental load due to learning the specific information at hand.__ It is important to note that the intrinsic cognitive load and the extraneous cognitive load are ^additive,__ and too much cognitive load_ can interfere with the learning process.__ Finally,_ ^germane cognitive load_ is due to the construction of new knowledge structures,__ or schema.__ In other words, it is the cognitive load due to the process of learning.__ When you design your presentations,__ you need to think about the split-attention principle __and the modality principle.__ The split-attention principle states that if you make the student pay attention to more than one thing at once,__ it might interfere with their learning.__ For example, if you put a diagram on the screen__ and have the text that goes along with it off to the side,_ or on a separate sheet entirely,__ it may interfere with learning._ To get around this,_ it's suggested that you present the information in an ^integrated ^format._ In other words, _the diagram_ and the supporting information_ should all be presented together in one location.__ On the other hand,__ the modality principle_ suggests that some information should be presented visually,__ while some information is presented through narration._ This allows the student to bring in the maximum amount of information possible,_ using both their eyes_ and ears._ For example,_ if you were teaching students about ecosystems,__ you could show a picture of an ecosystem,__ and then describe the ecosystem to the students verbally. __ As you can see,__ using multimedia effectively isn't really that hard,__ it just takes some planning. Good luck on your next teaching assignment! References Adesope, O. O., & Nesbit, J. C. (2012). Verbal redundancy in multimedia learning environments: A meta-analysis. Journal of Educational Psychology, 104(1), 250–263. Akker, E., & Cutler, A. (2003). Prosodic cues to semantic structure in native and nonnative listening. Bilingualism: Language and Cognition, 6(2), 81–96. Atkinson, R. K., Mayer, R. E., & Merrill, M. M. (2005). Fostering social agency in multimedia learning: Examining the impact of an animated agent's voice. Contemporary Educational Psychology, 30(1), 117–139. Basori, A. H., & Ali, I. R. (2013). Emotion expression of avatar through eye behaviors, lip synchronization and MPEG4 in virtual reality based on Xface toolkit: Present and future. Procedia-Social and Behavioral Sciences, 97, 700–706. Baylor, A. L., & Kim, Y. (2005). Simulating instructional roles through pedagogical agents. International Journal of Artificial Intelligence in Education, 15(1), 95. Baylor, A. L., & Ryu, J. (2003). The effects of image and animation in enhancing pedagogical agent persona. Journal of Educational Computing Research, 28(4), 373–394. Brown, W. S., Jr., Morris, R. J., Hollien, H., & Howell, E. (1991). Speaking fundamental frequency characteristics as a function of age and professional singing. Journal of Voice, 5(4), 310–315. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. Craig, S. D., & Schroeder, N. L. (2017). Reconsidering the voice effect when learning from a virtual human. Computers & Education, 114, 193–205. Cutler, A., Dahan, D., & Van Donselaar, W. (1997). Prosody in the comprehension of spoken language: A literature review. Language and Speech, 40(2), 141–201. Davis, R. O. (2018). The impact of pedagogical agent gesturing in multimedia learning environments: A meta-analysis. Educational Research Review, 24, 193–209. Davis, R. O., & Antonenko, P. (2017). Effects of pedagogical agent gestures on social acceptance and learning: Virtual real relationships in an elementary foreign language classroom. Journal of Interactive Learning Research, 28(4), 459–480. Gilakjani, A. P., Ismail, H. N., & Ahmadi, S. M. (2011). The effect of multimodal learning models on language teaching and learning. Theory and Practice in Language Studies, 1(10), 1321–1327. Goh, C. C. (2000). A cognitive perspective on language learners' listening comprehension problems. System, 28(1), 55–75. Hancock, A., Colton, L., & Douglas, F. (2014). Intonation and gender perception: Applications for transgender speakers. Journal of Voice, 28(2), 203–209. Hasan, A. S. (2000). Learners' perceptions of listening comprehension problems. Language Culture and Curriculum, 13(2), 137–153. Heidig, S., & Clarebout, G. (2011). Do pedagogical agents make a difference to student motivation and learning? Educational Research Review, 6(1), 27–54. Hollien, H., & Jackson, B. (1973). Normative data on the speaking fundamental frequency characteristics of young adult males. Journal of Phonetics, 1(2), 117–120. Jannati, M. J., Sayadiyan, A., & Razi, A. (2018). Speech naturalness improvement via ε-closed extended vectors sets in voice conversion systems. Multidimensional Systems and Signal Processing, 29(1), 385–403. Kalyuga, S. (2011). Cognitive load theory: How many types of load does it really need? Educational Psychology Review, 23(1), 1–19. Kent, R. D. (1997). Gestural phonology: Basic concepts and applications in speech-language pathology. The new phonologies: Developments in clinical linguistics (pp. 247– 268). . Kim, Y., Baylor, A. L., & Shen, E. (2007). Pedagogical agents as learning companions: The impact of agent emotion and gender. Journal of Computer Assisted Learning,
11
Computers & Education 140 (2019) 103605
R.O. Davis, et al.
23(3), 220–234. Leppink, J., Paas, F., Van Gog, T., van Der Vleuten, C. P., & Van Merrienboer, J. J. (2014). Effects of pairs of problems and examples on task performance and different types of cognitive load. Learning and Instruction, 30, 32–42. Leppink, J., van Gog, T., Paas, F., & Sweller, J. (2015). 18 cognitive load theory: researching and planning teaching to maximise learning. Res. Med. Educ. 207. Lester, J. C., Converse, S. A., Kahler, S. E., Barlow, S. T., Stone, B. A., & Bhogal, R. S. (1997, March). The persona effect: Affective impact of animated pedagogical agents. Proceedings of the ACM SIGCHI Conference on Human factors in computing systems (pp. 359–366). ACM. Lindfield, K. C., Wingfield, A., & Goodglass, H. (1999). The contribution of prosody to spoken word recognition. Applied PsychoLinguistics, 20(3), 395–405. Mayer, R. E. (2009). Multimedia learning 2nd. New York: Cambridge University Press. Mayer, R. E. (2014). 14 principles based on social cues in multimedia learning: Personalization voice image and embodiment principles. The Cambridge handbook of multimedia learning. Mayer, R. E. (2017). Using multimedia for e-learning. Journal of Computer Assisted Learning, 33(5), 403–423. Mayer, R. E., & DaPra, C. S. (2012). An embodiment effect in computer-based learning with animated pedagogical agents. Journal of Experimental Psychology: Applied, 18(3), 239. Mayer, R. E., & Moreno, R. (2003). Nine ways to reduce cognitive load in multimedia learning. Educational Psychologist, 38(1), 43–52. Mayer, R. E., Sobko, K., & Mautone, P. D. (2003). Social cues in multimedia learning: Role of speaker's voice. Journal of Educational Psychology, 95(2), 419. McCafferty, S. G. (2004). Space for cognition: Gesture and second language learning. International Journal of Applied Linguistics, 14(1), 148–165. Nam, C. S., Shu, J., & Chung, D. (2008). The roles of sensory modalities in collaborative virtual environments (CVEs). Computers in Human Behavior, 24(4), 1404–1417. Nass, C., & Steuer, J. (1993). Voices, boxes, and sources of messages. Human Communication Research, 19(4), 504–527. Ochs, M., Niewiadomski, R., & Pelachaud, C. (2015). Facial expressions of emotions for virtual characters. In R. A. Calvo, S. K. D'Mello, J. Gratch, & A. Kappas (Eds.). The oxford handbook of affective computing (pp. 261–272). Oxford University Press. Osada, N. (2001). What strategy do less proficient learners employ in listening comprehension?: A reappraisal of bottom-up and top-down processing. Journal of PanPacific Association of Applied Linguistics, 5(1), 73–90. Paas, F., Renkl, A., & Sweller, J. (2004). Cognitive load theory: Instructional implications of the interaction between information structures and cognitive architecture. Instructional Science, 32(1), 1–8. Paas, F., Tuovinen, J. E., Tabbers, H., & Van Gerven, P. W. (2003). Cognitive load measurement as a means to advance cognitive load theory. Educational Psychologist, 38(1), 63–71. Reeves, B., & Nass, C. (1996). The media equation: How people treat computers, television, and new media like real people and places. New York: Cambridge University Press. Ross, E. D. (2000). Affective prosody and the aprosodias. Prinicples of behavioral and cognitive neurology. New York Oxford University Press316–331. Ryu, J. (2012). The effect of image realism and learner's expertise on persona effect of pedagogical agent. Sci. Emot. Sensib. 15(1), 47–56. Ryu, J., & Baylor, A. L. (2005). The psychometric structure of pedagogical agent persona. Technology, Instruction, Cognition and Learning, 2(4), 291. Ryu, J., & Fengfeng, K. E. (2018). Increasing persona effects: Does it matter the voice and appearance of animated pedagogical agent. Educational Technology International, 19(1), 61–92. Schroeder, N. L. (2017). The influence of a pedagogical agent on learners' cognitive load. J. Educ. Technol. Soc. 20(4), 138–147. Schroeder, N. L., & Adesope, O. O. (2013). How does a contextually-relevant peer pedagogical agent in a learner-atenuated system-paced learning environment affect cognitive and affective outcomes? Journal of Teaching and Learning with Technology, 2(2), 114–133. Schroeder, N. L., & Adesope, O. O. (2014). A systematic review of pedagogical agents' persona, motivation, and cognitive load implications for learners. Journal of Research on Technology in Education, 46(3), 229–251. Schroeder, N. L., & Adesope, O. O. (2015). Impacts of pedagogical agent gender in an accessible learning environment. J. Educ. Technol. Soc. 18(4), 401. Schroeder, N. L., Romine, W. L., & Craig, S. D. (2017). Measuring pedagogical agent persona and the influence of agent persona on learning. Computers & Education, 109, 176–186. Stern, S. E. (2008). Computer-synthesized speech and perceptions of the social influence of disabled users. Journal of Language and Social Psychology, 27(3), 254–265. Sweller, J. (2010). Element interactivity and intrinsic, extraneous, and germane cognitive load. Educational Psychology Review, 22(2), 123–138. Van Merriënboer, J. J., & Sweller, J. (2005). Cognitive load theory and complex learning: Recent developments and future directions. Educational Psychology Review, 17(2), 147–177. Van Mulken, S., André, E., & Müller, J. (1998). The persona effect: How substantial is it. People and Computers XIII Proceedings of HCI, 98, 53–66. Vanlancker-Sidtis, D. (2003). Auditory recognition of idioms by native and nonnative speakers of English: It takes one to know one. Applied PsychoLinguistics, 24(1), 45–57. Van der Meij, H., Van der Meij, J., & Harmsen, R. (2015). Animated pedagogical agents effects on enhancing student motivation and learning in a science inquiry learning environment. Educational Technology Research & Development, 63(3), 381–403. Veletsianos, G. (2012). How do learners respond to pedagogical agents that deliver social-oriented non-task messages? Impact on student learning, perceptions, and experiences. Computers in Human Behavior, 28(1), 275–283. Woo, H. L. (2008). Designing multimedia learning environments using animated pedagogical agents: Factors and issues. Journal of Computer Assisted Learning, 25(3), 203–218. Wouters, P., Paas, F., & van Merriënboer, J. J. (2008). How to optimize learning from animated models: A review of guidelines based on cognitive load. Review of Educational Research, 78(3), 645–675. Robert O. Davis is an assistant professor in the Department of English Linguistics and Language Technology at Hankuk University of Foreign Studies. His current research interests involve pedagogical agent gesturing, social acceptance of virtual characters, interaction with computer-based environments, and virtual reality in the foreign language classroom. Joseph Vincent is a professor in the Department of English for International Conferences and Communication at Hankuk University of Foreign Studies. His current research interests involve mixed media learning and teaching. Taejung Park is a research professor in Education Advancement Centre at Hankuk University of Foreign Studies in Seoul, South Korea. Her current research interests focus on instructional design, MOOCs, AR/VR/MR, software education, flipped learning, and future school.
12