A Mandarin edutainment system integrated virtual learning environments

A Mandarin edutainment system integrated virtual learning environments

Available online at www.sciencedirect.com Speech Communication 55 (2013) 71–83 www.elsevier.com/locate/specom A Mandarin edutainment system integrat...

786KB Sizes 2 Downloads 85 Views

Available online at www.sciencedirect.com

Speech Communication 55 (2013) 71–83 www.elsevier.com/locate/specom

A Mandarin edutainment system integrated virtual learning environments q Yue Ming a,⇑, Qiuqi Ruan a, Guodong Gao b a

Institute of Information Science, Beijing JiaoTong University, Beijing 100044, PR China b Beijing Traffic Control Technology CO. Ltd., Beijing 100044, PR China Received 3 July 2010; received in revised form 1 June 2012; accepted 28 June 2012 Available online 10 July 2012

Abstract In this paper, a novel Mandarin edutainment system is developed for learning Mandarin in immersing, interactive Virtual Learning Environments (VLE). Our system is mainly comprised of two parts: speech technology support and virtual 3D game design. First, 3D face recognition technology is introduced to discriminate the different learners and provide the personalized learning services based on the characteristics of the individuals. Then, a Mandarin pronunciation recognition and assessment scheme is constructed by state-of-the-art speech processing technology. According to the distinctive differences of Mandarin rhythm from the Western languages, we integrate the prosodic parameters into the recognition and evaluation model to highlight Mandarin characteristics and improve the evaluation performance. In order to promote the engagement of foreign learners, we embed our technology framework into a Virtual Reality (VR) game environment. The character design reflects the Chinese traditional culture, and the plots effectively give consider to learning pronunciation and learners’ interest, providing the scoring feedback simultaneously. In the experimental design, first, we test the correlation of recognition results and machine scores with the different errors and human scores. Then, we evaluate the usability, likeability, and knowledgeability of the whole VLE system. We divide the learners into three categories in terms of their Mandarin levels, and they provide feedback via a questionnaire. The results show that our system can effectively promote the foreign learners’ engagement and improve their Mandarin level. Ó 2012 Elsevier B.V. All rights reserved. Keywords: Mandarin learning; Pronunciation evaluation; Virtual Reality (VR); Edutainment; Virtual learning environment (VLE); 3D face recognition

1. Introduction Computer-assisted language learning (CALL) (Amory et al., 1998; Conati and Zhou, 2002; Kearney, 2004) is a continuously developing topic. Since the communication among the different countries has become increasingly frequent, more and more people urgently need to grasp one or q This work is supported by National Natural Science Foundation (60973060), the Research Fund for the Doctoral Program (20080004001) and Beijing Program (YB20081000401) and the Fundamental Research Funds for the Central Universities (2009YJS025). Informedia digital video understanding lab in Carneige Mellon University provides the portions of experimental materials and environments. The authors would like also thank the Associate Editor and the anonymous reviewers for their helpful comments. ⇑ Corresponding author. Tel.: +86 10 51682936 E-mail address: [email protected] (Y. Ming).

0167-6393/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.specom.2012.06.007

more foreign languages. Mandarin, as one of the most populous languages, is being given greater attention. With the rapid development of speech processing technology, automatic pronunciation recognition and evaluation make it ideal for learners studying the language by themselves. However, although the traditional speech-based classes have laid a solid foundation for identifying the levels of the learners and correcting incorrect pronunciation, it is easy to become bored with extended learning without any interaction. In order to better simulate real-life communication and provide the personalized services, a virtual learning environment (VLE) is merged into the speech technical support and 3D face recognition engages the learners in a live communication according to individual needs. The immersing, interactive 3D virtual games can be designed to promote extensive levels of motivation by

72

Y. Ming et al. / Speech Communication 55 (2013) 71–83

providing “free form” storylines based on the learners’ ages, native languages, and cultural backgrounds. In our system, the learners’ role is as a commander as soon as the learner has been identified by 3D face recognition technology. The learner controls the gestures and behaviors of the virtual characters built by them, based on their preferences. They can also communicate with virtual humans via interactive dialogs and actions. In this paper, the complete virtual Mandarin learning environment is constructed, composed of real-time pronunciation processing and the interactive scene design. First, pronunciation recognition and assessment is an indispensable indicator for real-time language learning. In the last several decades, considerable research has been devoted to this area, providing a solid technique foundation for our system. The SRI speech group (Franco et al., 1997; Neueyer et al., 2000; Franco et al., 2000; Neumeyer et al., 1996) first proposed the scheme which can verify the overall quality of the learner’s pronunciation. The speech groups of Cambridge University and the AI lab of MIT (Witt and Young, 2000; Witt, 1999) developed joint research for the CALL system. Their work can be used to effectively identify the pronunciation errors of the Western languages, and the evaluation was based on phone-level pronunciation, not the syllable-level. The VICK system (Cucchiarini et al., 1998; Cucchiarini et al., 1997) constructed by the University of Nijmegen, extended the evaluation into the prosodic level. They summarized the relationship between human scoring and the effect of prosodic information, including fluency, segmental quality and stress. The results of the investigating showed that prosody is crucial for smooth pronunciation and communication. The research from Tokyo University and Kyoto University (Raux and Kawahara, 2002; Tsubota et al., 2004) analyzed an important degree from the different phonemes during language learning. The different kinds of pronunciation errors were also investigated in terms of proficiency. Currently, structural representation has been introduced to help pronunciation assessment, which can effectively reflect the structure and high-level language semantics spoken by non-native learners (Minematsu, 2004; Asakawa et al., 2005). de Wet et al. (2009) exploited a scheme for large-scale oral pronunciation assessment. Their research focused on proficiency and listening comprehension for fairly advanced students of English as a second language. However, one consideration is that these systems, which are based on Western languages, cannot directly be used for Mandarin learning due to the distinctive differences from Mandarin. By widely investigating the pronunciation variances between Mandarin and western alphabetic languages, prosodic and tone features play an important role in accurate recognition and evaluation. In our system, we incorporate a prosodic model into the traditional speech recognition framework. In the experimental section, we have a detailed outline that analyzes the detection and evaluation of Mandarin pronunciation in a real-time audio-interaction system.

Our surveys demonstrate that the simple audio-interaction for Mandarin learning lacks intelligence and interest. For example, the learners using the traditional learning system cannot perceive the real interactive environments and take the corresponding actions according to the understanding of language knowledge. From the ecological perspective (Hodges, 2009), the language generation originates from the interaction between persons and their lived environments. In ecolinguistics, applying an agent’s actions can reflect the interrelational transactions between the learners and their simulated environments and best understand the culture and policies as maintaining a learner’s behavior. Speech technology only provides a reduced view of the notion of emergence in terms of precision, rigor and success. The ecological approach (Van Lier, 2000) asserts that the perception and personal behaviors of the learners, and particularly the interaction in which the learners participates, are critical to the concept understanding of Mandarin utterances. Immersing the learners in an interactive environment can effectively help them acquire language skills while interacting within the environments. There have already been a number of researches efforts for language education that aim to promote the motivation of learners (Young, 2004; Young et al., 2000). One prominent example, Lewis Johnson et al. at Alelo, Inc. (Johnson et al., xxxx) has embraced a game method for language learning for many years. In their system, they construct a real-life community by integrating scenario dialogs and cultural common sense, and their technologic and pedagogic innovations have produced the quick and economical effectiveness. However, for Mandarin, they have not developed an immersing, interactive VLE by effectively reflecting Mandarin characteristics and cultural backgrounds. In this paper, we focus on building a completed Mandarin edutainment system by incorporating Mandarin pronunciation recognition and evaluation, and immersing the student in interactive virtual learning games involving Chinese history and culture. The rest of this paper is organized as follows: in Section 2, we introduce the outline of our system and summarize the main contributions of our system. Then, a real-time Mandarin recognition and pronunciation evaluation scheme is described in Section 3. After discussing the construction of VLE in Section 4, we propose our Mandarin educational virtual game. In Section 5, we evaluate the performance of our system based on accuracy of pronunciation recognition and assessment, usability, likeability, knowledgeability of our VLE game design. Finally, we conclude in Section 6. 2. System framework and main contributions According to previous research (Amory et al., 1998; Conati and Zhou, 2002; Kearney, 2004; Hodges, 2009; Van Lier, 2000; Young, 2004; Young et al., 2000), we focus on the major challenges effecting real-time interactive

Y. Ming et al. / Speech Communication 55 (2013) 71–83

Mandarin pronunciation learning and present a new system framework in a Virtual Reality (VR) game environment (Virtual Reality is a term that applies computer-simulated environments to create a lifelike experience that can simulate physical presence in places in the real world, as well as imaginary worlds and sound through speakers (Hodges, 2009)). First, 3D face recognition is used to identify the different learners and provide the personalized learning materials according to their preferences (Ming, and Ruan, 2012). Then, the Mandarin learning system consists of four important elements: speech recognition interface, pronunciation evaluation, virtual game environment, and the plot development as shown in Fig. 1. In the past few decades, considerable efforts have been devoted to speech recognition, and this technology has become quite mature for standard Mandarin. The speech recognition interface can be introduced to identify the speaking contents of the foreign learners and display the recognition results on the screen for reference. Based on the distinctive characteristics of Mandarin, the pronunciation model is combined with the prosodic parameters which are used for our proposed system. Then, the confidence of pronunciation detection, as the evaluating standard, can be converted to the score of the foreign speaker’s Mandarin pronunciation. The results in the experimental section show that the machine scores of pronunciation evaluation in our system are quite close to the human scores. In our system, the learners are exposed to a VR game environment, which is about a young man’s Eastern adventure. At the beginning of the scenario, each learner can design an intelligent virtual character for himself according to his personal preference. The character can be placed in the virtual environment and reflect the real-time communication based on the learner’s Mandarin pronunciation. Once the character appears, a magic box is prepared for him, which can take him to an island according to their choices. There is a guardian angel ruling the island, and the angel has a conversation with the character in Mandarin. The score of the pronunciation evaluation contributes to the plot development. The system deals with the learner’s Mandarin utterance, comparing the recognition results with the standard answers. If the answer spoken by the

Fig. 1. The flow chart of our proposed Mandarin edutainment system.

73

learner is completely illogical in relation to the content of the angel’s questions, the angel will repeat the question, and the system will set the output score at 0 for this time. Otherwise, if the answer is reasonable, the system will assign different scenarios based on the evaluating score. A series of exciting adventures of the East will be started combined with Mandarin pronunciation and cultural backgrounds in a VR game environment. For beginners, the game is relatively easy, and winning the game is quite simple. As Mandarin pronunciation improves, the game difficulty level gradually increases. The pronunciation score threshold can be also correspondingly updated based on the learner’s progress, and the relative rewards can be provided. If the learning progress is not being made, the punishment system will begin to work. If consecutive errors, including the lower pronunciation scores and unreasonable answers surpass five times, the severe punishment will resend the virtual character to the beginning of the game or the last checkpoint. If learners speak and behave quite well, a set of incentive mechanisms will provide some special magic cards for reducing the degree of the conversation difficulty, skipping questions or even entering more interesting scenes. Both the punishment and incentive strategies are used to increase playability and fun in our Mandarin learning edutainment system. In our framework, a novel scheme is proposed to improve the Mandarin pronunciation level in VR game environment. The main contributions of our scheme can be summarized as follows: 1. Accuracy: Via extensive investigation of the difference between Mandarin and Western languages, prosodic features are the cornerstones to good pronunciation, especially for the four Mandarin tones. A syllable takes its meaning from the sound and tones, which is difficult for foreign learners. In our system, we combine the prosodic model with the classical speech processing framework which can provide promising results for Mandarin recognition and evaluation. Detailed technology analysis will be given in the following section. The experimental results also verify this point. Once the learners have mastered tones and rhythm, everything else will fall in place and great progress will be made. 2. Usability: In our system, learners can realize real-time communication simply by way of speech and simple manipulation. Speech recognition and evaluation is optimized to understand the utterance spoken by native and non-native speakers and provide feedback to the system. Tutorial feedback has an effect on learners’ pronunciation to facilitate the assignment of the course content. Interactive feedback that is encouraging and sensitive to the learner’s sense of self-esteem leads to a better learning state and atmosphere rather than simply telling the learner that his/her responses are right or wrong.

74

Y. Ming et al. / Speech Communication 55 (2013) 71–83

3. Likeability: Increasing the learner’s motivation and engagement is an essential component of the learning progress. VLE with appropriate game design can effectively enhance the intelligence and stimulate the learner’s understanding based on the special events or scenarios. Furthermore, learners can be completely emersed in a happy mode by playing the virtual game. Likeability can facilitate raising learning interest, exploring new concepts and building an immersive, imaginative virtual learning space. 4. Knowledgeability: Since the Chinese culture plays an important role in the learning of Mandarin, all virtual game scenarios contain Chinese history and culture. The storylines are based on Eastern adventure. A variety of characters are originally from classical Chinese mythology, and the scenario design is based on famous scenery. During the learning process, learners not only practice their Mandarin pronunciation, but also enhance their understanding of Chinese culture.

where W ¼ fw1 ; w2 ; . . . ; wn g is the word sequence and wj denotes the jth lexical word. X is the acoustic feature vectors, composed of 12 MFCC (Mel-Frequency Ceptral Coefficients), a log energy, and their corresponding delta and double delta values. P ðW Þ is the prior distribution contributed by certain language model, and P ðX jW Þ is calculated using an HMM (Hidden Markov Model) acoustic model. In the speech process, Hidden Markov Model can be treated as the statistical process of pronunciation sequences, which can be modeled by assuming Markov sequences with unobserved states (Rabiner and Juang, 1993). Mandarin has its distinctive tonal characteristics, and accurate prosodic pronunciation is closely related to the power of speaking. A set of prosodic feature vectors F ¼ ff1 ; f2 ; . . . ; fn g have been devoted to speech recognition, derived from pitch, duration and energy (Zhang et al., 2008; Huang and Lee, 2006). Next, the above equation can be extended to the following formulation: W  ¼ arg maxP ðW jX ; F Þ ¼ arg maxP ðW ÞP ðX ; F jW Þ W

3. Real-time Mandarin speech recognition and pronunciation evaluation In this section, we focus on the technology supporting about the Mandarin speech recognition and pronunciation evaluation. The flowchart of speech processing is illustrated in Fig. 2. Detailed analysis will be given in the following subsections.

Here, we assume the acoustic and prosodic features are mutually independent given the word sequence W. For each lexical word, the prosodic feature fj assumes independently the neighboring words wj1 and wjþ1 and has only effect on the current word wj . Based on the simple algebraic theory, we get the following equation (Huang and Lee, 2006): W  ffi arg maxP ðW ÞP ðX jW Þ

3.1. Mandarin speech recognition with prosodic modeling Maximum a posteriori (MAP) estimation is introduced to the classical speech recognition, and word sequences with maximum posterior probability can be treated as recognition (Huang and Lee, 2006; Rabiner and Juang, 1993): W  ¼ arg maxP ðW jX Þ ¼ arg maxP ðW ÞP ðX jW Þ W

ð1Þ

W

W

N Y P ðfj jwj Þ

ð3Þ

j¼1

Then, we introduce a two-pass decoding process in the restoring stage for effective recognition (Huang and Lee, 2006). In the first pass, an appropriate size word graph is built based on the traditional acoustic and language models. Then, during the second pass, every word arc is rescored by integrating the prosodic model features P ðfj jwj Þ. The rescoring equation can be derived from the decision tree and maximum likelihood theories: SðW Þ ¼ kX P ðX jW Þ þ kW P ðW Þ þ kP P ðF jW Þ

Fig. 2. The framework of our speech recognition and pronunciation evaluation system.

ð2Þ

F

ð4Þ

where SðW Þ is the final score, and kX ; kW ; kF are the weights for the likelihoods of acoustic model, language model, and prosodic model. The probability P ðfj jwj Þ effectively reflects intonation, rhythm, and stress, which is suited for identifying the pronunciation for those whose native language is a Western language. Mandarin uses four tones to clarify the meanings of words, including high level, rising, fall rising and falling. Most foreign learners can not accurately differentiate the tones from each other. For example, “yue (falling)” is a quite simple pronunciation for native Mandarin speakers. However, for most foreigners, they encounter difficulties to correct pronunciation. Thus, prosodic recognition and detection play a quite important role for correcting pronunciation

Y. Ming et al. / Speech Communication 55 (2013) 71–83

errors. We analyze Mandarin pronunciation based on two levels, the syllable level and the prosodic level, which can effectively identify and recognize errors when foreigners learn Chinese.

M X d KL2 ¼  pðnPP ðxmj ÞjtrueðH 0 ÞÞ m¼1

3.2. Mandarin pronunciation detection

ð5Þ

where Pb is error probability estimation and N ðH Þ denotes the total number of hypotheses tested. Then, conditioned error rate (CER) can be divided into two types: N ðrejectðH 0 Þ;trueðH 0 ÞÞ Pb ðtypeIerrorÞ ¼ Pb ðrejectðH 0 ÞjtrueðH 0 ÞÞ ¼ N ðtrueðH 0 ÞÞ

ð6Þ

N ðacceptðH 0 Þ;falseðH 0 ÞÞ Pb ðtypeIIerrorÞ ¼ Pb ðacceptðH 0 ÞjfalseðH 0 ÞÞ ¼ N ðfalseðH 0 ÞÞ

ð7Þ

Here, posterior probability is chosen as our confidence measure (Williams and Renals, 1999), which can be estimated by the following acceptor acoustic model, where qk denotes the kth states sequence: ne X

  log pðqk jX n1 Þ

ð8Þ

n¼ns

And duration-normalized is used to balance the particular level of the difference utterances: ð9Þ

where ns is the start time of the utterance and ne is the end time. The separability of the two utterances can be estimated by symmetric Kullback–Leibler distance (KLD), which can sum the divergence between the utterances evaluated in both directions (Williams and Renals, 1999):

)

pðnPP ðxmj ÞjfalseðH 0 ÞÞ (

pðnPP ðxmj ÞjtrueðH 0 ÞÞ  log pðnPP ðxmj ÞjfalseðH 0 ÞÞ

) ð10Þ

where nPP ðxmj Þ denotes the confidence value for the mth word decoding. Once the value of KLD is larger the preset threshold, the corresponding utterance can be considered an error pronunciation. 3.3. Mandarin scoring based on the acoustic and prosodic parameters The machine score must be reasonable and well correlated with the human scores. Based on the distinctive characteristics of Mandarin, the information both from the acoustic model and prosody must be considered to be scored. 3.3.1. HMM log-likelihood scores HMM model can effectively describe the acoustic likelihoods of spectral observation as scores (Neueyer et al., 2000). The total log-likelihood of the utterance is estimated by the following equation by summing the short-time windows of frames: li ¼

sX iþ1 1

log ðpðst jst1 Þpðxt jst ÞÞ

ð11Þ

t¼si

where si is the start time of the ith phonetic segment, xt is the observed spectral vector, and st denotes the HMM state at time t. The different lengths of the utterances have severe influence on the log-likelihood scores. Normalization is introduced to balance the effect of time-period and compensate the shorter utterances. The local average HMM log-likelihood score L can be calculated as follows: L¼

1 nPP ðqk Þ ¼ PP ðqk Þ D

M X m¼1

N ðrejectðH 0 Þ; trueðH 0 ÞÞ þ N ðacceptðH 0 Þ; falseðH 0 ÞÞ Pb ðerrorÞ ¼ N ðH Þ

PP ðqk Þ ¼

(

pðnPP ðxmj ÞjfalseðH 0 ÞÞ  log pðnPP ðxmj ÞjtrueðH 0 ÞÞ 

In this subsection, we introduce the related confidence measure to identify whether pronunciation is correct or not (Williams and Renals, 1999). A confidence measure can quantify how well the model matches the pronunciation units, where the values indicate the similarity across the whole utterance. Here, a 2  2 confusion matrix of the confidence values is used to estimate the unconditional error rate (UER) in the Eq. (5):

75

N 1X li N i¼1 d i

ð12Þ

3.3.2. Magnitude scores Magnitude score is a key indicator for reflecting the prosodic information, which is closely related to the assessment of evaluation. It can be computed by averaging the short time speech frames: aveMagðnÞ ¼

M 1 1X jS n ðmÞj; M m¼0

n ¼ 0; 1; . . . ; N  1

ð13Þ

76

Y. Ming et al. / Speech Communication 55 (2013) 71–83

the values of weights. Finally, we integrate the machine scoring scheme with the recognition and detection system. The system shows the recognition results of the input and provides the pronunciation scores in screen simultaneously in Fig. 3. The detailed analysis of performance will be described in experimental section. 4. Virtual 3D game design In this section, we describe the details how to effectively construct the VLE system and design the Mandarin edutainment game. Fig. 3. An example of Mandarin speech recognition and pronunciation evaluation system.

where S n ðmÞ denotes the magnitude of nth frame. In our system, interpolation is used to normalize the different lengths of feature vectors and linear scaling is introduced to compensate the environment diversity. Euclidean distance is introduced to evaluate the similarity between the standard and input utterances. 3.3.3. Segment duration scores Duration parameter can be used to evaluate the rate of speech (ROS) among the different individuals (Neueyer et al., 2000). The segment duration score can be calculated by the Viterbi phonetic alignment. The value is derived from averaging the duration for i  th segment of each segment: D¼

N 1X logðpðf ðd i Þjqi ÞÞ N i¼1

ð14Þ

where f ðd i Þ ¼ d i  ROS S is the normalized duration, ROS S is the estimate for the particular speaker S, and qi is the phone corresponding to the ith segment.

4.1. Construction of VLE system To pursue full color, wide-field view, and reality shows, Virtools (Virvou and Katsionis, 2008; Pan et al., 2006), as an available and deployment platform, is introduced to our system to construct an immersive 3D game scenario. The interactive 3D content creation is shown in Fig. 4, which can support a variety of 3D file storage formats. Thus, the tools are well suited for our applications and easier to create and edit virtual environments and 3D objects. The construction of the Mandarin edutainment system can be divided into three components as follows and as shown in Fig. 4. 1. 3D Layout (top left): The layout window illustrates the current editing project in real-time, which can be used to create, select, and manipulate the different 3D Entities. The navigation tools can drive the motion of 3D objects. 2. Building Blocks and Data Resource (top right): The Building Blocks Resource is a group of classes containing useful functions, especially for the behavior Building Blocks (BBs). The data resource is the storage area for all the media files, including 2D sprites, 3D entities,

3.3.4. Timing scores Empirical study shows that non-native speakers cannot speak as quickly as natives, and speaking rate can effectively reflect fluency and tends to be used as a scoring indicator (Neueyer et al., 2000). Euclidean distance, denoted as dist, based on DTW can be converted into the score mechanism as follows, which can evaluate the similarity between the input speech sequences and standard utterance: scorepho ¼

100 1 þ a  ðdistÞ

b

;

a; b > 0

ð15Þ

Then, we can calculate the weighting scores based on four types of parameters: scoresen ¼ w1  scorefea1 þ w2  scorefea2 þ w3  scorefea3 þ w4  scorefea4

ð16Þ

where fea1 ; fea2 ; fea3 , and fea4 denote the HMM log-likelihood, magnitude, segment duration, and timing scores respectively. By simple downhill search, we can estimate

Fig. 4. The interactive 3D content creation by Virtools.

Y. Ming et al. / Speech Communication 55 (2013) 71–83

77

3D sprites, Behavior Graphs, Characters, Materials, Audios, and Videos. The user can easily create and manipulate a Data Resource or add a media file in it. 3. Schematic (bottom): In this area, the scripts can be used to view, edit, and debug the scenarios and the behaviors of characters. The schematic is a simply visual representation of a special behavior attached to a behavior character. Virtools allows the users to create rich and interactive scripts in a VR environment and realize the real-time manipulation, which not only allows you to drag and drop BBs into Schematic, but also provides an available access to create special BBs (http://a2.media.3ds.com/products/ 3dvia/3dvia-virtools/; Li et al., 2007). We create our novel Mandarin edutainment system based on SDK of Virtools on VC++ 6.0 platform, and integrate the technology of speech recognition and pronunciation evaluation. We utilize Virtools to build our VLE system, which leads the learners into a happy mood by interacting with the virtual environment. The VLE system can effectively enhance and engage learners’ interest in certain behaviors or pronunciations, especially for which the traditional methods have encountered obstacles or difficulties. 4.2. The Mandarin edutainment VR game design Virtual Reality (VR) is a technological breakthrough for game design. The research and application of VR technology can effectively improve learning efficiency and arouse the interest of language learning compared with the existing instructive strategies. Our edutainment curriculum consists of 15 modules with 11 units focusing on Mandarin word pronunciation, and each module can last approximately two hours with audio-interaction. The main 3D scenario of our Mandarin edutainment game is a desert island, surrounded by a river, swamp, sky and so on. The major roles include the virtual human created by the learners based on their personal preferences, four great classical roles of Chinese mythology (Monkey King, Monk Sha, Kwan-yin, and Bodhisatta), and a guardian angel. The learner can navigate through the scenarios by simple voice interactive commands and achieve the aims of pronunciation exercises and explorations step by step. At the beginning of the game plot, the learners encounter the guardian angel, gods or spirits in each game checkpoint. At this point, pronunciation practice is based on the questions raised by the angel, which is the only way to access the next scene. Each answer is provided a score based on the speech recognition and pronunciation evaluation module embedded in the system. When the corresponding score of the valid answer is higher than the preset threshold, the next question or scene becomes available. We show an example in Fig. 5. In this case, the angel asks the virtual human “What is the scene behind me?” The learner answers “he liu (means river in English)” by audiointeraction with a satisfactory pronunciation. Then, our

Fig. 5. Correct answer interface with satisfactory speech content and pronunciation score.

system shows the recognition result and machine score at the bottom of screen as illustrated in Fig. 5. Otherwise, the learners need to practice the same pronunciation again. If the learner gives an unreasonable answer, the angel may repeat the question, set the output score to 0 and give hints to make the problem easier. With the plot development, the content of our designed game may incorporate a lot of adventure elements (Virvou and Katsionis, 2008). The standard of winning the game is to improve Mandarin utterance and ultimately find the four Great Classical roles in Chinese mythology. To achieve the aims, the learners have to go through all the hurdles guarded by the angels and accumulate a specific number of points. For example, the virtual human, created by the learners, encounters a swampy farmland in the virtual world as shown in Fig. 5. The guardian angel presents a problem for him related to the Chinese culture, objects or landscape surrounding him. The virtual human is required to answer the problems with fluent Mandarin acquired by the learner’s audio-interaction. Then, the system can match the learner’s pronunciation with the standard utterance and give a score based on the technology discussed in Section 3. If the score reaches or exceeds the predefined threshold, the learner will receive award points on this hurdle, and the angel will allow him/her to access the next hurdle with harder questions but with more points, which leads him/her to achieve the ultimate goal. Other than the special scene, the learners may also meet some objects which they can click on or manipulate. These objects appear randomly and may give hints or barriers for learners to go through the hurdles. For example, if the learners encounter difficulty with special Mandarin pronunciations, hints can provide words with similar or the same pronunciation to the learners, or provide the scene and meaning tips for the learners to facilitate their Mandarin learning. However, all of these hints cannot be available immediately to the learners, since they need to accumulate sufficient points in the previous steps to open the door of

78

Y. Ming et al. / Speech Communication 55 (2013) 71–83

Compared with the current instructive strategies without any VR game, our system shows obvious improvement for the learners’ communication skills. We will evaluate the performance of our system from four aspects in the following section. 5. Experiments

Fig. 6. Kwan-yin appeared.

another scene. Hence, the learners have to remember the pronunciation and Mandarin knowledge as much as they can to facilitate more and more points obtained when they meet the objects containing hints or barriers. Instructively, these objects can efficiently motivate the students to memorize the important parts of the language they are learning. As a part of the adventure game, special bonuses play an indispensable role in the progress of the game and defeating the enemies. The bonuses may involve some weapons and related services. In our educational system, if the learners have achieved 10 consecutive correct pronunciations, they will have an opportunity to get a key for a guarded door. If he/she can provide the correct Mandarin pronunciation and reasonable answer for the questions raised by the keepers, they can enter into the door with the key and pick up his/her bonuses. The degree of bonuses depends on the questions’ difficulty level. There are also some penalties that will be accrued if the learner’s pronunciation makes no progress during a certain amount of time. In the learning process, the learners can be completely immersed in the virtual world, only using the mouse and audio-interaction. Our system also provides a 2D map for facilitating the learners’ navigation. As learner progresses through certain learning units, he/she will have the opportunity to see the desert island panorama. The higher the unit is, the closer the learner is to winning the game. When the learners solve all the problems in Mandarin pronunciation, a new character will take the learners to the next scene as demonstrated in Fig. 6. Based on the above design discussion, our humanized system can foster and motivate the interests of foreign learners to a high extent. From the perspective of VLE, our edutainment system not only enriches the form of language learning, but also provides a well rounded, multi-angel view of the teaching materials to enrich the learning experience to analyze pronunciation characteristics and grasping the new knowledge about Chinese culture. All kinds of learners can share a public virtual learning space to discuss their understanding and obstacles to certain events or pronunciations.

In this section, we evaluate the performance of our proposed Mandarin edutainment system based on four aspects: the accuracy of speech recognition and pronunciation evaluation, usability and likeability with and without the VLE system, and knowledgeability from the perspective of linguistics and culture. Our edutainment course is composed of 11 units, each with a certain number of Chinese phrases to practice the Mandarin pronunciation. The system records the pronunciation scores in real time. The learners are divided into three levels according to their corresponding pronunciation scores. These levels are novice, intermediate, and experienced learners. To quantify the system performance, questionnaire is used to record the corresponding experimental results. 5.1. Accuracy of speech recognition and pronunciation evaluation In this subsection, we examine the accuracy of our speech processing technology based on the aspects of recognition results and evaluating scores. First, the recognition errors can be classified into disparateness, substitution and insertion. Here, disparateness describes what happens when whole phrases are recognized completely incorrectly. Substitution demonstrates only a fraction of phrases recognized correctly, and insertion describes some new words inserted through the recognition system which are not contained in the original pronunciation spoken by learners. In Table 1, we list the error rates based on the different range of the machine scores. The results indicate that the output of our system is consistently close to the ground truth, especially for disparateness. For the errors of substitution and insertion, they are closely related to the learners’ language background. In future research, we will exploit a personalized speech recognition systems based on the learners’ language habits. From Table 1, we can conclude that the machine scores have a certain relationship with the recognition results. With the scores gradually increasing, the disparateness errors correspondingly decrease. In the range of 0.6–0.9 scores, parts of phrases may have mistakes. If the score is up to 0.9, the errors will significantly decrease.

Table 1 The relation between the error rate and machine scores. Machine scores Disparateness Substitution Insertion

0–0.60 54.17% 5.87% 4.15%

0.60–0.90 12.34% 11.18% 9.34%

0.90–1 0 5.72% 3.15%

Y. Ming et al. / Speech Communication 55 (2013) 71–83 Table 2 The correlation of the assessment results between the machine scores and human scores. Phones

Syllables

Words

Phrases

0.75

0.78

0.83

0.81

Second, we evaluate the correlation between the machine scores and human scores, which is an important indicator to verify the performance of the pronunciation assessment. We randomly choose N phrases pronounced by different learners. The correlation between the machine scores ðA1 ;   ; AN Þ and human scores ðB1 ;   ; BN Þ can be calculated by the following equation: N  X  ðAi  AÞðBi  BÞ i¼1 ffi CorrA;B ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi N N X X 2 2 ðAi  AÞ ðBi  BÞ i¼1

ð17Þ

i¼1

PN 1

PN where A ¼ N i¼1 Ai , and B ¼ N1 i¼1 Bi . In Table 2, we list the correlation based on the different levels of Mandarin pronunciation, including phonemes, syllables, words, and phrases. From the results, our evaluation scheme can obtain a satisfactory correlation with the human scores, especially for the word level. It can be seen that combining the prosodic parameters for pronunciation evaluation can effectively reflect the characteristics of Mandarin and improve the evaluation performance. In addition, more evaluation factors should be taken into consideration in further improving the performance, for example, learners’ language background, cultural habits, and Mandarin linguistic properties. 5.2. Usability In this subsection, we test the usability effect of our proposed Mandarin edutainment system. The evaluation is based on pronunciation improvement compared with traditional 2D learning systems without VLE systems and virtual 3D learning game systems. For the learners’ Mandarin accent, we focus on the fluency, comprehensibility, grammar and vocabulary. In the case of VLE game scenarios, three interactive features need to be considered including learner interface acquaintance, VR navigation, and VLE distractions (Virvou and Katsionis, 2008). First, we test the pronunciation improvement of our system compared with a similar application without any 3D VR game. The evaluation indicator is the rate of accuracy improvement. We define accuracy as follows: Table 3 Accuracy improvement based on the different level learners. Accuracy

Novice (%)

Intermediate (%)

Experienced (%)

Traditional ones Proposed ones

59.17 71.45

42.41 59.12

34.19 43.29

accuracy improvement ¼

79

ðN  Dafter  S after  I after Þ=N ðN  Dbefore  S before  I before Þ=N ð18Þ

where Dbefore ; S before ; I before and Dafter ; S after ; I after are the total numbers of disparateness, substitution and insertion errors before and after using our learning systems, respectively, and N is the total number of labels in the whole course. We show the results in Table 3 based on the three different types of learners. As shown in Table 3, our proposed system achieves significant improvement of the learners’ accent skills across all levels, especially for novice learners. The traditional 2D education software with the simple user interface has hypertext for displaying of domain theory and pronunciation results through forms, dialogue boxes, buttons, dropdown menus etc, which easily makes the learners bored in a short time studying. Empirical study shows that the VR environment and animated-speaking agents can effectively simulate the interaction between persons and their environment. The significant pronunciation improvement of novice learners in Table 3 demonstrates the excellent performance of our proposed system compared with the traditional 2D systems. From the ecological perspective, our 3D VR system reveals a greater potential for learning. 3D face recognition technology can identify the learners for personalized learning service (Ming, and Ruan, 2012) and 3D motion analysis can adjust the learning materials in real-time for improving the learning efficiency (Ming et al., 2012). In our Mandarin courses, we design the relative simple interactive practices for the learner-specified agents with the VR environment. With the continuous progress of learning skills and interactive ability, the proposed system will continuously improve the interactive VR scenarios and provides more and more difficult and suitable learning materials to satisfy the personalized demands for the learners. The Mandarin skill improvement in our system relies highly on the ever-refining perception cycle, which is distinctively different from the traditional 2D system just providing the pronunciation result and evaluating scores. The accuracy ratings highly correlate with fluency, comprehensibility, grammar, and vocabulary. Combining the immersive, interactive and imaginative advantages in our system, every study unit and game is designed to match the personalized preference of the learners, and incorporate numerous innovations that span the aspects of interactive simulations, intelligent navigating system, and speech processing and so on. The learners tend to promote extensive levels of engagement and practice by driving the virtual humans to exhibit behaviors that are appropriate to the Chinese culture, which is quite suitable for language learning and pronunciation improvement. Next, we evaluate the usability of our VR Mandarin educational game based on the interactive features for our language instruction goals. Learner interface acquaintance refers to the skills of operating the VLE system in terms of different level learners (Virvou and Katsionis,

80

Y. Ming et al. / Speech Communication 55 (2013) 71–83

Table 4 The performance of user interface acquaintance. In this comparison, group 1, 2, and 3 test the percentage of using the inventory, map, and hints in the whole learning process respectively. In all configurations, WMT 1, 2, and 3 denote the wasted time for not use the inventory, map, and hints respectively. TT 1, 2, and 3 denote the total time for each learner without using the inventory, map, and hints respectively, and Improve 1, 2, and 3 denote the improvement of Mandarin skills with the inventory, map, and hints respectively, comparing with the first hour of playing with the second hour. People

Novices

Intermediate

Experienced

Group 1 WMT 1 TT 1 Improved 1 Group 2 WMT 2 TT 2 Improved 2 Group 3 WMT 3 TT 3 Improved 3

65.17% 75 s 7.97 min 47.57% 54.18% 72 s 12.41 min 45.36% 48.31% 254 s 7.25 min 19.72%

42.47% 52 s 5.36 min 23.62% 32.52% 59 s 8.53 min 9.78% 23.15% 148 s 5.02 min 11.75%

12.32% 32 s 2.18 min 11.15% 21.84% 32 s 3.82 min 5.34% 11.08% 107 s 3.46 min 5.23%

2008). VR navigation is concerned with the learner’s behaviors and tests the time spent looking for the correct way. The last indicator is VLE distraction, which is related to the learner’s attention, and whether the learners focus on educational goals or amusement. Next, we evaluate the usability of our VR Mandarin educational game based on the interactive features for our language instruction goals. Learner interface acquaintance refers to the skills of operating the VLE system in terms of different level learners (Virvou and Katsionis, 2008). VR navigation is concerned with the learner’s behaviors and tests the time spent looking for the correct way. The last indicator is VLE distraction, which is related to the learner’s attention, and whether the learners focus on educational goals or amusement. In our system, we introduce some assistant features to assist the learners, such as an inventory, a map and hints. We record the time spent on these features and their improvements on the second hour used with the system compared with the first hour to evaluate the learner’s skill of navigating the interface. The more time that is wasted the less usability there is. Some data collected are displayed in Table 4. It is obvious that more experience leads to less time spent seeking assistance, and therefore experienced learners tend to use fewer assisting features than the other two types of learners. The data shows that for experienced learners, they can easily find their way without frequently using the assisting features. In addition, for the second hour spent in the system, there is a significant improvement for novice learners once they are more acquainted with the interface compared with the first hour, which indicates that our system is quite easy to grasp. Second, the performance of VR navigation is measured in two ways: losing the way and inability to move (Virvou and Katsionis, 2008). The first one describes the case that

Table 5 The performance of VR navigation. In this comparison, group 1, 2 test the number of occurrences in the case of losing the way and inability to move. In all configurations, WMT 1, 2 denote the wasted time in two cases, TT 1, 2 denote the total time in two cases, and Improve 1, 2 denote the improvement of Mandarin skills in two cases. People

Novices

Intermediate

Experienced

Group 1 WMT 1 TT 1 Improved 1 Group 2 WMT 2 TT 2 Improved 2

132 times 41 s 6.35 min 25 times 207 times 35 s 8.76 min 22 times

79 times 22 s 2.74 min 14 times 152 times 24 s 6.54 min 14 times

25 times 15 s 1.51 min 5 times 97 times 13 s 2.35 min 6 times

Table 6 The performance of VLE distraction. In this comparison, DOT denotes the occurrence times of distraction and TT denotes the total delay time due to distraction. Times

Novice

Intermediate

Experienced

DOT TT

6.5 times 4.57 times

11.5 times 7.18 times

3.5 times 5.93 times

the learners may be stuck in difficulties of movement around the virtual environments. The second one is that the learners may encounter some obstructions which prevent forward movement, such as virtual objects and walls. The wasted time as the measurement for a VR navigational problem is shown in Table 5. We can summarize that the experienced learners have fewer difficulties with navigation compared with the novice learners, which makes them have more time to read the related Mandarin theories and consolidate their pronunciation. As the learners advance in the game, the time spent on the problems obviously decreased for the novice and intermediate learners. Then, they can get more benefits from the educational contents of the VR game. The improvements reflect the power of our system to guide the learners out of problematic circumstances. The ultimate goal of our system is to help learners make great progress in Mandarin pronunciation. However, the VR environment might distract the learner from acquiring Mandarin. In this part, we test how to balance education and entertainment to form edutainment. The statistical information is recorded for VLE distraction problems during two hours of learning time. These data indicate the average occurrence of distraction for each learner and the total delay time due to distraction. As shown in Table 6, the novice learners have the fewest number of distractions and the intermediate learners have the most. One cause of distraction is that the learners may be attracted by the VR scenes, such as movements of animated agents, virtual objects and storytelling etc, resulting in absent-mindedness. In the future, we will add more personalized options to customize the virtual environment components to reduce distractions.

Y. Ming et al. / Speech Communication 55 (2013) 71–83 Table 7 Time spent on the different language learning systems. Time (min)

Novice

Intermediate

Experienced

Non-game Non-education Ours

125 325 546

367 402 737

593 617 813

5.3. Likeability How to increase the learners’ motivation and engagement is an important factor for the Mandarin edutainment system. The likeability of our system is measured by two comparative studies (Virvou and Katsionis, 2008): The first one is compared with the non-game language learning software. The second one is compared with the non-education VR game software. The assessment is based on the amount of time learners spend on each application, which shows their preferences. We provide different applications to all learners for about 900 min, and record how much time they spend on each application. The results in Table 7 show a significant preference for our proposed system by all learners in comparison with other applications, especially for the experienced learners. The total results demonstrate that our system has achieved the aim of being more attractive and motivating than the non-game educational applications, and more instructive and knowledgeable than the noneducation game application. Novice learners encounter some difficulties getting started, such as Mandarin pronunciation and the system manipulation, resulting in a slightly bored and irritable mood. However, with improvement of pronunciation skills, learners will be more and more attracted to our system, and more frequently immersed into the Mandarin learning environment. After about two hours of using our system, the learners generally reflected that the VLE system was more intelligence and enhanced their understanding of certain concepts or pronunciations, especially the areas for which the traditional methods have proven inappropriate or difficult. In combination with the knowledge of Chinese culture, this can effectively enrich their ability to analyze problems from the perspective of history and culture. A sharable virtual learning space can motivate the learners to be more interactive and can make their learning more engaging and adventurous.

5.4. Knowledgeability In this subsection, we provide the results of a questionnaire concerning the influence of the learners based on the different Mandarin language elements and culture knowledge on the learning process. A series of questions about our Mandarin edutainment system are addressed concerning the views of different level learners. The major questions about knowledgeability asked in the questionnaire are as follows:

81

1. Which pronunciation is more motivating? Consonant, vowels, or tones? 2. Which has the most influence on your fluency? Rhythm, intonation, or stress? 3. Which lesson content do you prefer? Society, history, or culture? 4. Which unit in our system is more appealing? Words, phrases, or sentences? 5. Where do you think the knowledge will be helpful? In class, at home, or in the community? 6. Do you have any other suggestions for our Mandarin edutainment system? The answers to the questionnaire provided a lot of useful data for evaluating the knowledgeability of our proposed system. In the aspect of Mandarin pronunciation, we mainly evaluate the motivation characteristics for the learners based on the different pronunciation elements, which are consonant, vowels and tones. We also show the results of the questionnaire in Table 8. Based on the survey of the different aspects of the Mandarin learning system, we can further analyze the preference of the learners and alter the course assignment to facilitate more efficient learning. Our immersive simulations of the VR game are based on interactive social communication involving spoken dialogs and cultural protocols are built upon ecological psychology and technologic supports. Interactive dialog can boost the learners’ communication and social skills in the communities and successfully transfer separate Mandarin pronunciation into fluent verbal and non-verbal communication. Cultural protocols involve cultural knowledge, historical background and artistic accomplishments, which is critical for successfully communicating with the native speakers and acquiring deep understanding of idioms. There were also a huge number of constructive comments collected by analyzing the personal suggestions of learners. First, in order to enhance motivation more effec-

Table 8 Percentage of learners in each of three learner categories who chose each topic specified in the left-most column when filling out the questionnaire.

Consonant Vowels Tones Rhythm Intonation Stress Society History Culture Words Phrases Sentences In class At home In community

Novices (%)

Intermediate (%)

Experienced (%)

43 35 23 31 36 52 57 21 26 59 46 25 52 61 49

31 23 31 42 25 34 32 25 19 43 54 32 57 43 45

26 56 42 29 47 29 46 42 35 32 48 54 41 37 62

82

Y. Ming et al. / Speech Communication 55 (2013) 71–83

tively, most of the learners indicated that the VLE system would be better if it involved more VR objects, more cultural background, and more adventure schemes like the commercial games. Second, some learners point out that it was not challenging enough and they want it to contain more Chinese idioms, adages and poems in the dialogs. These suggestions came, to a large extent, from the experienced learners who want to know extensively about Mandarin and its native use. Additionally, some learners suggested that they like to take the courses in their homes during their leisure time. According to the above comments, we will continue to develop our system to be more comprehensive, entertaining, and to have more flexible characteristics. The results from evaluations discussed above show that the learners are indeed fascinated by the idea of the VR game for learning Mandarin and they are certainly more enthusiastic about our VLE system than the traditional non-game learning software, especially for learners with poor academic performance. A vital question is how knowledgeability for foreign learners can effectively be incorporated into the system’s usability and likeability interaction. Learners prefer systems which are less difficult to manipulate and have interesting and challenging game scenarios. The length of time learners spend on each unit is also important. The appropriate length will improve learning efficiently. 6. Conclusion In this paper, we provide an overview of a technique for a Mandarin edutainment system. 3D face recognition and emotion analysis technologies are introduced to identify the learner and adjust the learning materials according to the learner’s personal preference and emotion states. Then, we build a Mandarin pronunciation recognition and assessment system combined with prosodic parameters. Second, the VLE Mandarin learning system can be constructed by the incorporation of the VR games, where highly motivating scenarios effectively stimulate interests during the learning processing. Combining Chinese cultural and history knowledge which makes the storyline more fascinated. The experiment is based on four aspects: accuracy, usability, likeability, and knowledgeability. The results effectively demonstrate that our system has the distinctive advantages compared with traditional non-game learning software. The open, shared, and interactive properties can enhance the ability of learners’ communication and create a real-life social community that response to the learners’ speech, intent, gesture, and behaviors. Although our system evaluates the common Mandarin learning issues well, it may produce incorrect results in the presence of unclear articulation and confusing VR scenarios. This is the main obstacles encountered in our existing system. In the future, we will continuous to develop our system. First, we need to review the latest speech processing technologies, and perform detailed analysis of the Man-

darin characteristics to improve the performance of the speech support module. Second, we will update the current VLE game system and make it easier to manipulate. We will develop more characters and storylines, and adventures will be introduced to the proposed system, which can help the learners to compare their Mandarin pronunciation to that of the native speakers. Our ultimate goal is to help the learners master the Mandarin more quickly in a satisfying and engaging environment. References Amory, A., Naicker, K., Vincent, J., Adams, C., 1998. Computer games as a learning resource. In: Proceedings World Conference on Educational Multimedia, Hypermedia and Telecommunications, pp.50–55. Asakawa, S., Minematsu, N., Isei-Jaakkols, T., Hirose, K., 2005. Structural representation of the non-native pronunciation. In: Proceedings of EuroSpeech, pp. 165–168. Conati, C., Xiaoming Zhou, 2002. Modeling students emotions from cognitive appraisal in educational games. In: Proceedings The 6th Interational Conference on Intelligent Tutoring Systems, pp. 944–954. Cucchiarini, C., Strik, H., Boves, L., 1997. Automatic evaluation of Dutch pronunciation by using speech recognition technology. In: Proceedings of IEEE Workshop ASRU, Santa Barbara, pp. 622–625. Cucchiarini, C., Wet, F.D., Strik, H., Boves, L., 1998. Assessment of Dutch pronunciation by means of automatic speech recognition technology. In: Proceedings of ICSLP, pp. 1738–1741. de Wet, F., Van der Walt, C., Niesler, T.R., 2009. Automatic assessment of oral language proficiency and listening comprehension. Speech Communication 51, 864–874. Franco, H.L., Neumeyer, L., Kim, Y., Ronen, O., 1997. Automatic pronunciation scoring for language instruction. In: Proceedings IEEE International Conference on Acoustic, Speech, and Signal Processing, pp.1465–1469. Franco, H.L., Neumeyer, L., Digalakis, V., Ronen, O., 2000. Combination of machine scores for automatic grading of pronunciation quality. Speech Communication 30, 121–130. Hodges, B., 2009. Ecological pragmatics: values, dialogical arrays, complexity and caring. Pragmatics and Cognition 17, 628–652. http://a2.media.3ds.com/products/3dvia/3dvia-virtools/. Huang, Jui-Ting, Lee, Lin-shan, 2006. Improved large vocabulary Mandarin speech recognition using prosodic features, Speech Prosody. Johnson, Lewis et al. at Alelo, Inc, http://www.alelo.com/index.html. Kearney, P., 2004. Engaging young minds C using computer game programming to enhance learning. In: Proceeding World Conference on Educational Multimedia, Hypermedia and Telecommunications, pp.3915–3920. Li, Xunxiang, Chen, Dingfang, Wang, Le, Li, Anding, 2007. A Development Framework for Virtools-Based DVR Driving System. Springer, pp. 188–196. Minematsu, N., 2004. Pronunciation assessment based upon the compatibility between a learners pronunciation structure and the target languages lexical structure. In: Proceedings of ICSLP, pp. 1317–1320. Yue Ming, Qiuqi Ruan, 2012. Robust sparse bounding sphere for 3D face recognition. Image and Vision Computing, in press. Yue Ming, Qiuqi Ruan, Hauptmann, Alex, 2012. Activity recognition from kinect with 3D local spatio-temporal features. In: IEEE International Conference on Multimedia and Expo (ICME 2012). Neueyer, L., Franco, H., Digalakis, V., Weintraub, M., 2000. Automatic scoring of pronunciation quality. Speech Communication 30, 83–93. Neumeyer, L., Franco, H.L., Weintraub, M., Price, P., 1996. Automatic text-independent pronunciation scoring of foreign language student speech. In: Proceedings of ICSLP, pp. 217–220. Pan, Zhigeng, Cheok, Vdrian David, Yang, Hongwei, Zhu, Jiejie, Shi, Jiaoying, 2006. Virtual reality and mixed reality for virtual learning environments. Computers and Graphics 30, 20–28.

Y. Ming et al. / Speech Communication 55 (2013) 71–83 Rabiner, L., Juang, B.H., 1993. Fundamentals of Speech Recognition. Prentice Hall PTR, Upper Saddle River, New Jersey. Raux, A., Kawahara, T., 2002. Automatic intelligibility assessment and diagnosis of critical pronunciation errors for computer-assisted pronunciation learning. In: Proceedings of ICSLP, pp. 737–740. Tsubota, Y., Kawahara, T., Dantsuj, M., 2004. Practical use of English pronunciation system for Japanese students in the CALL class. In: Proceedings of ICSLP, pp. 849–852. Van Lier, V., 2000. From input to affordance: socio-interactive learning from an ecological perspective. In: J.P. Lantolf (Ed.), Sociocultural theory and second language learning, Oxford University Press, Oxford, pp. 245–259. Virvou, M., Katsionis, G., 2008. On the usability and likeability of virtual reality games for education: the case of VR-ENGAGE. Computers and Education 50, 154–178. Williams, Gethin, Renals, Steve, 1999. Confidence measures from local posterior probability estimates. Computer Speech and Language 13, 395–411.

83

Witt, S.M., 1999. Use of speech recognition in computer-assisted language learning, PhD Dissertation. Witt, S.M., Young, S.J., 2000. Phone-level pronunciation scoring and assessment for interactive language learning. Speech Communication 30, 95–108. Young, M.F., 2004. An ecological psychology of instructional design: learning and thinking by perceiving-acting systems. In: D.H. Jonassen (Ed.), Handbook of Research for Educational Communications and Technology, second ed. Mahwah, NJ: Erlbaum. Young, M.F., Barab, S., Garrett, S., 2000. Agent as detector: an ecological psychology perspective on learning by perceiving-acting systems. In:. D.H. Jonassen, S.M. Land (Eds.), Theoretical foundations of learning environments, Mahwah, NJ: Erlbaum, pp. 147–172. Zhang, Yan-Bin, Chu, Min, Huang, Chao, Liang, Man-Gui, 2008. Detection tone errors in continuous Mandarin speech. In: Proceedings IEEE International Conference on Acoustic, Speech, and Signal Processing, pp. 5065–5068.