Available online at www.sciencedirect.com
ScienceDirect Procedia Computer Science 89 (2016) 632 – 639
Twelfth International Multi-Conference on Information Processing-2016 (IMCIP-2016)
Development of Voice Activated Ground Control Station D. K. Rahula,∗ , S. Veenab , H. Lokeshab , S. Vinayc , B. Pramod Kumarc , C. M. Anandab and Vinod B. Durdia a Dayananda Sagar Institute of Technology, Bengaluru, Karnataka, India 560 078 b CSIR-National Aerospace Laboratories, Bengaluru, Karnataka, India 560 017 c Karnataka Regional Engineering College, Suratkal, Karnataka, India 575 025
Abstract This paper chronicles the development of Automatic Speech Recognition (ASR) system that can be integrated to Ground Control Station (GCS) of MAVs to achieve voice activation. The first part of the paper highlights the nature of aerospace speech signals and hence the issues to be considered while designing a voice activated aerospace application. The second part describes the development and integration of an ASR capability to the GCS. © TheAuthors. Authors.Published Published Elsevier B.V. © 2016 2016 The byby Elsevier B.V. This is an open access article under the CC BY-NC-ND license Peer-review under responsibility of organizing committee of the Twelfth International Multi-Conference on Information (http://creativecommons.org/licenses/by-nc-nd/4.0/). Processing-2016 Peer-review under(IMCIP-2016). responsibility of organizing committee of the Organizing Committee of IMCIP-2016 Keywords:
Automatic Speech Recognition; Ground Control Station; Hidden Markov Model; Micro Air Vehicle.
1. Introduction Speech is a natural and intuitive means of communication and hence an efficient alternative for information management. Recent initiatives try to exploit its potential in aerospace domain for control and command operations. Some of the cited advantages are freeing operator’s hands and efficient conveying of messages1 . Effective interpretation of speech by machine is the key to success of speech based applications. Significant advances in the hardware required for development of intelligent search algorithms is able to achieve a high level of performance, making it applicable for many critical applications2. Since Aerospace applications require a highly reliable ASR system which is capable of working in adverse and stressful conditions, being transparent to the user, and understanding its users accurately without having to tailor their individual speech and vocabulary to suit the system, the system development becomes a challenging task3 . A realistic ASR implies the capability of a machine to understand a spoken utterance on any subject by all speakers in different environments, which is quite challenging. The success is determined by the capability of algorithm set employed for capturing the inter and intra-variability of speakers, the nature of the utterance (continuous speech versus isolated words), the vocabulary size and the complexity of the language against different environmental conditions under which the recognition operation is performed4. This paper describes the development of ASR algorithms for voice controlled Ground Control Station application. ∗ Corresponding author. Tel.: +91 9538953933.
E-mail address:
[email protected]
1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the Organizing Committee of IMCIP-2016 doi:10.1016/j.procs.2016.06.026
D.K. Rahul et al. / Procedia Computer Science 89 (2016) 632 – 639
633
Nomenclature MAV – Micro Air Vehicle GCS – Ground Control Station ASR – Automatic Speech Recognition ATC – Air Traffic Controller SAD – Speech Activity Detection KWS – Keyword Spotting MFCC – Mel Frequency Cepstral Coefficients HMM – Hidden Markov Model Ch , δ (1) , δ (2) – Cepstral, Delta and Acceleration coefficients π , A, B – Initial state distribution vector, State Transition probability matrix and Continuous observation probability density function matrix of HMM C# – C sharp coding language
2. Application of Speech Technology in Aerospace The applications of ASR in aviation are rapidly evolving and some of them are, Voice Activated GCS1, 3 , Training for military (or civilian) Air Traffic Controllers (ATC)5 and Voice controlled cockpit6. Even though ASR applications are quite popular in other domains, developing it for aerospace applications is quite challenging, owing to stringent requirements. It is very important to understand the nature of speech used in aerospace to achieve effective speech based system design. The speech here is highly structured and an average of 200 vocabularies is sufficient, hence difficulty involved with speech recognition is relatively low7 . DARPA8 has classified the speech processing algorithms required for aerospace applications under Speech Activity Detection (SAD), Language Identification (LID), Key Word Spotting (KWS) and Speaker Identification (SID). However, the presence of adverse conditions like high-noise, high-stress, g-force, pressure breathing conditions, noise, wind and gust degrades the performance of the speech recognition systems7 . Further, the other main factors to be considered are its real time performance, recognition rate and robustness. Javier et al.9 has thrown light on various aspects to be considered in development of robust speech interface for aerial vehicles. Mark et al.1 presents the evaluation results of speech control of UAVs and concludes that speech input was superior to manual operations; however he emphasizes the need for robust speech recognition. 3. Components of Speech Processing System Figure 1 shows the block diagram of an ASR system. A brief description is given to understand the functionality of each sub block. The main purpose of the front end is efficient feature extraction from voiced part of the speech signal. The speech endpoint detection and silence removal are adapted for dimensionality reduction in speech that facilitates
Fig. 1. Blocks of an Automatic Speech Recognition System.
634
D.K. Rahul et al. / Procedia Computer Science 89 (2016) 632 – 639
Fig. 2. Block Schematic of an Acoustic Model.
the system to be computationally moreefficient4, 7. Mel Frequency Cepstral Coefficients (MFCCs), Perceptual Linear Predictive coefficients (PLPs), and their first and second-order temporal differences are the commonly used speech features. The feature extraction is greatly affected by noise and efficient solution to this is still an open problem. Speech endpoint detection that automatically detects a period of target human speech from a continuously observed signal is one of the most requirements. But how to efficiently distinguish the speech samples and non-speech samples in the presence of background noise is still under research. Further poor pronunciation, speech variability,limited vocabulary size, and a similar phonetic transcription of the words are also the issues influencing the accuracy of feature10. In the pattern recognition stage, speech feature vectors as represented as models. During recognition, the feature vector of the uttered speech is modelled and the best match from the stored model is selected, as shown in Fig. 2. In an ASR application, two models are generally used, an acoustic model and a language model. The acoustic model represents acoustic properties, phonetic properties, microphone and environmental variability, as well as gender and dialectal differences among speakers. Acoustic models P(O|W ) compute the probability of observing the acoustic evidence O when the speaker utters W . The model has to capture the variability in the speech signal. The most effective acoustic modelling is based on a structure referred to as Hidden Markov Models (HMM). Other models include Artificial Neural Networks (ANN) trained by back propagating error derivatives and Deep Neural Networks (DNNs) that contain many layers of non-linear hidden units and a very large output layer. The language models contain the syntax, semantics, and pragmatics knowledge for the intended recognition task. This model complements the acoustic knowledge with the information on the most probable word sequences9. There are several techniques for language modelling: grammar-based language modelling and statistical language modelling (N-gram). This statistical language modelling consists of computing the probability of one word, given the N − 1 previous words. For example, a 3-gram model consists of the probabilities of every word preceded by any combination of two words. 4. Case Study – Development of ASR System for Controlling of Micro Air Vehicles This section gives the development of ASR for controlling the MAV. The MAVis significantly smaller than UAVs and is used for surveillance and tracking missions. They are controlled remotely through a portable GCS, which is typically a laptop. GCS is used for configuring MAV hardware, gathering status information of MAV and sending the control commands11. To achieve all these, GCS software has multiple menu pages to be operated by keyboard and/or mouse. This can become quite cumbersome to operate at the mission site. Thus, the use of speech-based MAV control1, 12. Apart from this, voice commands are very useful when swiftness of control is required in emergency situations like collision, tracking etc. The paper presented by Mallesh et al.13 integrates an in built ASR from Microsoft windows to Mission planner, an open source GCS used to control MAVs. However, this ASR is not designed for outdoor environment and results in performance degradation. Therefore, there is a necessity to build a rugged ASR system to meet the outdoor operational requirements. Mission planner is a widely used GCS and its commands fall into the category of isolated words or connected words. Hence, a good acoustic model is sufficient to achieve satisfactory results and language model may not be
D.K. Rahul et al. / Procedia Computer Science 89 (2016) 632 – 639
necessary. Also, the model can be a speaker independent system as the voice commands can be issued by any user. HMM modelling is proven to be an effective to solve such an Isolated Word Recognition problem. The paper presented by Kavita et al.14 has attempted to develop a HMM based ASR system on MATLAB platform and compared the results with HTK toolkit, which is an open source HMM tool. The performance of the ASR system was very well comparable with the standard toolkit. However, standard HMM based ASR algorithm used in the paper is not very well adaptable to GCS environment. Further, the MATLAB code cannot be directly interfaced to GCS software and a code developed in GCS compatible programming language is necessary. This paper addresses these issues and attempts to realize a robust ASR integrated to Mission Planner. 4.1 ASR system design Figure 3 shows the block diagram of the proposed ASR system for GCS. In this paper it is proposed to develop HMM models in MATLAB and port these models to Mission Planner for ASR. The testing/recognition module is developed as an additional feature in mission planner using C sharp language. The approach followed exploits the powerful features of MATLAB like its graphic capability, analysis tools etc to develop HMM models. In Mission Planner an object to obtain audio samples from microphone was added to the code developed by Mallesh et al.13 . This appears as a button on the right top corner of the GCS menu. Once this button is clicked, the other menu tabs become inactive and it accepts commands only through speech as shown in Fig. 4. The details of the ASR module are given here. • Training phase At this stage of development, the focus was to cover different phonetic sounds from acoustic modelling perspective rather than targeting all menu items. Once a suitable modelling methodology is established, more
Fig. 3.
Fig. 4.
Block Schematic of the Proposed ASR System.
Mission Planner Front End with Speech Mode Activation Feature.
635
636
D.K. Rahul et al. / Procedia Computer Science 89 (2016) 632 – 639
word models can be generated. Therefore, 12 words were carefully selected to cover possible short words, connected words and unvoiced sounds. For the training, 5 utterances were recorded per word per speaker. Utterances from 10 different speakers were used. Hence the training database consists of 600 utterances. At front end stage Voice Activity detection (VAD) based on error energy is included15. Feature extraction includes the extraction of MFCCs. To enhance the robustness of the system, along with MFCC features, its delta and acceleration coefficients are used12 . The equations for computation of MFCC and the corresponding delta and acceleration coefficients are given below Ch (m, n) =
N−1 k=0
π(2n + 1)k αk log (Fmel(k) ) cos 2N
,
n = 0, 1, 2, . . . , N − 1
(1)
P δ
(1)
=
p=−P (Ch (n; m + p) − Ch (n; m)) p P 2 p=−P p
⎛ δ (2) = 2 ⎝
P p=−P
p2
⎞ P 2 C (n; m + p) − (2 p + 1) C (n; m + p) p h h p=−P p=−P ⎠ P P 2 2 − (2 p + 1) 4 p p=−P p=−P p
(2)
P
(3)
Further to make the system robust to noise, augmented Cepstral mean normalization algorithm that provides effective solution for mismatch in training and testing environment is used16 . • Modelling stage As mentioned earlier HMM modelling is used, in which the probability distributions π , A and B are written in a compact form denoted by lambda as λ = (π , A, B) where π = Initial state distribution vector. A = State transition probability matrix. B = Continuous observation probability density function matrix. For speech, a HMM model with transition from one state to next state and to itself is considered. Based on this A is initialized with the equal probability for each state. The initial state distribution vector is initialized with the probability to be in state one at the beginning, which is assumed in speech recognition theory. The elements of matrix B has MFCC, delta and acceleration coefficients weighed by multivariate single Gaussian distribution17. The system is solved as given in the paper presented by Rabiner18. At the end of this step, a best possible model is generated for each word in the database. • Testing phase The testing phase uses modified Viterbi algorithm that generates B matrix for testing utterance. This B matrix with A and π matrices of each of the model are used to generate scores using Alternative Viterbi algorithm17. The Viterbi algorithm can also be used but it comprises of multiplication of probability values which might result in zero, whereas Alternative Viterbi algorithm replaces multiplications with additions by imposing Logarithm. The HMM model with highest score represents the test utterance is corresponding to that model. To meet the real-time requirements, ASR algorithms were developed in C#15 . This adapts frame processing technique for acquiring and processing real-time speech data as shown in Fig. 5. The Flowchart explains the internal operation of the ASR in GCS. The major part of processing and recognition of speech deals with HMM. To improve the accuracy of recognition, there is a need for accurate detection of end points, this can be achieved by incorporating the system presented by Dimitrios15. The system distinguish between the speech and silence by evaluation the Energy and Zero Crossing Rate (ZCR) of the input signal as shown Fig. 5. The speech data streams recorded continuously are divided into frames of 5 ms, which is equivalent to 80 samples. Initial 100 frames are collected and among those, first 80 frames are discarded and the rest 20 frames are used for threshold computation. The 20 frames of 80 samples are reformed into 19 frames of 160 samples each, with an
637
D.K. Rahul et al. / Procedia Computer Science 89 (2016) 632 – 639
Fig. 5. Flowchart of the Real-Time ASR for Mission Planner.
overlapping of 50% with the adjacent frames. The energy and ZCR thresholds are computed separately from these 19 frames, and they are used for the classification of later frames into speech and silence frames. The Energy threshold is calculated by following relations. STEU T = 12 ∗ (m STE + stdSTE )
(4)
STE LT = 3 ∗ (m STE + stdSTE )
1 (STEU T − STE LT ) STE I T = STE LT + 8
(5) (6)
638
D.K. Rahul et al. / Procedia Computer Science 89 (2016) 632 – 639
where, STE is the short term energy, STEU T is the STE upper threshold, STE LT is the STE lower threshold, STE I T is the STE Intermediate threshold, m STE and stdSTE are the mean and standard deviation of STE respectively. The ZCR threshold is calculated by following relations. ZCRT = min{35, (meanZCR + stdZCR )}
(7)
where, ZCRT is the ZCR threshold, meanZCR and stdZCR are the mean and standard deviation of ZCR respectively. After calculating the threshold, later frames compare their energy with energy threshold. A First in First out (FIFO) buffer of length 7 is used to store the ZCR values of previous 7 frames from the current frame. The endpoints detection is as follows, • Initial End-point Detection: When the energy of a frame become more than the energy threshold, the ZCR values in the FIFO buffer are compared with the ZCR threshold. If more than three ZCR values in the FIFO buffer are greater than the ZCR threshold, then that point where the energy became greater than energy threshold is considered as the start of speech or initial end point. • Final End-point Detection: After detecting the initial end-points the process of comparing the energy with the energy threshold continues, when the energy drop below energy threshold, that point is considered as the final end-point, if the energy of next 20 frames are also less than the energy threshold. The vocabulary contains both isolated words and connected words. In the case of isolated words the energy once decreased below energy threshold will not increase again, hence that point must be considered as the final end-point. Whereas in the case of connected words, the words are of the form speech-silence-speech, so the energy will increase beyond energy threshold within next 20 frames, in that case all those frames which contained silence are also saved and the process of energy comparison and final end-point detection is continued. The data between the end-points are stored and taken for processing and recognition of spoken word. The integrated ASR capability was tested in the laboratory for its effectiveness. The accuracy of commands recognized is given in Table 1. The laboratory performance was found to be quite satisfactory. The model was able to perform in the presence of speech corrupted by laboratory simulated noise. However, the system performance in the real scenario has to be evaluated and is planned in the near future. This application can be made more powerful by adding voice based commands to control the MAV during an emergency situation. This requires an additional dialog box in the mission planner software to send voice interpreted information such as way point change, change in speed etc. Also, this application becomes much more meaningful if it can be extended to natural speech mode. This is desirable as it overcomes the latency and usability issues normally found with discrete word, speaker dependent recognition systems and requires incorporation of language model and the grammar in addition to the present acoustic model. The usage of ANN and deep learning networks are promising fields in achieving this objective. The future work needs to be focused towards achieving seamless speech control. Table 1. Recognition Accuracy. Commands Actions Config Data Flash Logs Flight Data Flight Plan Initial Setup Messages Scripts Servo Simulation Terminal Telemetry Logs
% Recognition Accuracy 100 100 100 100 98 100 96 100 100 100 98 98
D.K. Rahul et al. / Procedia Computer Science 89 (2016) 632 – 639
5. Conclusions The initial part of the paper throws light on the speech based applications in aerospace scenario and the challenges need to be met to realize them. The later part of the paper describes the development of ASR based GCS for MAV. The paper touches upon algorithm development and its integration to a typical GCS, the Mission Planner. The paper concludes with the laboratory evaluation results of the ASR and discusses the measures to be adapted to make the system adaptable to the actual application scenario. References [1] Mark Draper et al., Manual Versus Speech Input for Unmanned Aerial Vehicle Control Station Operation, Proceedings of the Human Factors & Ergonomics Society’s 47th Annual Meeting, pp. 109–113, October (2003). [2] Anderson and R. Timothy, The Technology of Speechbased Control, Alternative Control Technologies: Human Factors Issues, 2-1, (1998). [3] Weinstein, and J. Clifford, Opportunities for Advanced Speech Processing in Military Computer-based Systems, Proceedings of the IEEE 79.11 pp. 1626–1641, (1991). [4] Tolba, Hesham and Douglas O’Shaughnessy, Speech Recognition by Intelligent Machines, IEEE Canadian Review, vol. 38, pp. 20–23, (2001). [5] J. M. Pardo et al., Automatic Understanding of ATC Speech: Study of Prospectives and Field Experiments for Several Controller Positions, IEEE Transactions on Aerospace and Electronic Systems, vol. 47(4), pp. 2709–2730, (2011). [6] Englund and Christine, Speech Recognition in the JAS 39 Gripen Aircraft-adaptation to Speech at Different G-loads, Master Degree Thesis 57, (2004). [7] Cary R. Spitzer, Chapter-8, Digital Avionics Handbook, CRC press, 978-1-4200-3687-9, (2001). [8] Robust Automatic Transcription of Speech (RATS), for Information Processing Techniques Office (IPTO), Defence Advanced Research Projects Agency (DARPA), DARPA-BAA-10-34, (2012). [9] Javier Ferreiros, Rub´en San-Segundo, Roberto Barra and V´ıctor P´erez, Increasing Robustness, Reliability and Ergonomics in Speech Interfaces for Aerial Control Systems, Aerospace Science and Technology, vol. 13, pp. 423–430, (2009). [10] Sledevic and Tomyslav et al., Evaluation of Features Extraction Algorithms for a Real-time Isolated Word Recognition System, International Journal of Electronics 7.12, (2013). [11] G. Natarajan, Ground Control Stations for Unmanned Air Vehicles, DEF SCI J, vol. 51(3), (2001). [12] Kumar, Kshitiz, Chanwoo Kim and Richard M. Stern, Delta-spectral Cepstral Coefficients for Robust Speech Recognition, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, (2011). [13] S. Mallesh Babu, Mr. H. Lokesha, Mrs. S. Veena and Mr. Jayantkumar A. Rathod, Controlling the Micro Air Vehicle through Voice Instructions, International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-6367 (Print), ISSN 0976-6375(Online), vol. 6, no. 4, pp. 21–27, April (2015). [14] S. Kavitha, S. Veena and R. Kumaraswamy, Development of Automatic Speech Recognition System for Voice Activated Ground Control System, International Conference on Trends in Automation, Communication and Computing Technologies, Bangalore, Dec. 21–22, (2015). [15] Dimitrios S. Koliousis, Real-time Speech Recognition System Forrobotic Control Applications Using an Earmicrophone, Thesis Submitted to Naval Postgraduate School, Monterey, California, (2007). [16] Acero, Alex and Xuedong Huang, Augmented Cepstral Normalization for Robust Speech Recognition, Proceedings of IEEE Automatic Speech Recognition Workshop, pp. 146–147, (1995). [17] Mikael Nilsson and Marcus Ejnarsson, Speech Recognition using Hidden Markov Model, Degree of Master of Science in Electrical Engineering, Department of Telecommunications and Signal Processing, Blekinge Institute of Technology Ronneby, March (2002). [18] L. R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286, February (1989).
639