Speech Control of Measurement Devices

Speech Control of Measurement Devices

14th IFAC Conference on Programmable Devices and Embedded Systems 14th IFAC Conference on Programmable Devices and Systems 14th IFAC5-7, Conference on...

2MB Sizes 0 Downloads 65 Views

14th IFAC Conference on Programmable Devices and Embedded Systems 14th IFAC Conference on Programmable Devices and Systems 14th IFAC5-7, Conference onCzech Programmable Devices and Embedded Embedded October 2016. Brno, Republic Available online at www.sciencedirect.com Systems5-7, 2016. Brno, Czech Republic October Systems October 5-7, 2016. Brno, Czech Republic October 5-7, 2016. Brno, Czech Republic

ScienceDirect

IFAC-PapersOnLine 49-25 (2016) 013–018 Speech Control of Measurement Devices Speech Control of Measurement Devices Speech Speech Control Control of of Measurement Measurement Devices Devices

Jiří Špale*, Cedric Schweizer* Jiří Špale*, Schweizer* *Furtwangen University, Faculty ofCedric Computer Science, Furtwangen, Germany Jiří Špale*, Cedric Schweizer* Jiří Špale*, Cedric Schweizer* *Furtwangen University, Facultyschweize}@hs-furtwangen.de) of Computer Science, Furtwangen, Germany (e-mail: {spale, *Furtwangen University, University, Faculty of Computer Computer Science, Science, Furtwangen, Furtwangen, Germany Germany *Furtwangen Faculty of (e-mail: {spale, schweize}@hs-furtwangen.de) (e-mail: {spale, {spale, schweize}@hs-furtwangen.de) schweize}@hs-furtwangen.de) (e-mail:

Abstract: The technology of speech recognition has undergone a rapid development in recent years. Abstract: The technology of speech recognition has undergone a rapid development in recent years. Available are and server-based, commercialhas andundergone open source solutions. The detection rateyears. and Abstract: Thelocal technology of speech speech recognition recognition rapid development in recent recent Abstract: The technology of has aa rapid development in years. Available are local and server-based, commercial andundergone open of source solutions. The libraries detection rateSDKs and success vary. The subject of this paper is a comparative study some selected APIs, and Available are local and server-based, and open source solutions. The detection rate and Available areThe local and server-based, commercial and open of source solutions. The detectionand rateSDKs and success vary. subject of thisthe paper iscommercial a comparative study some selected APIs, libraries in combination with Qt with aim to develop an Android or iOS app that exchanges data with success vary. The subject of this paper is comparative of some selected APIs, libraries success vary. Thewith subject of thisthe paper istoaachosen comparative study some selected APIs, libraries and and SDKs in combination Qt with aim develop an study Android or iOS app that exchanges dataSDKs with measuring instruments viawith Bluetooth. The solution was of realized. in combination with Qt the aim to develop an Android or iOS app that exchanges data in combination with Qt the aim develop an Android or iOS app that exchanges data with with measuring instruments viawith Bluetooth. Thetochosen solution was realized. measuring instruments via The chosen was realized. © 2016, IFAC (International Federation of Automatic Control) Hosting by Elsevier Ltd.NDEV, All rightsSpeech reserved. measuring instruments via Bluetooth. Bluetooth. The chosen solution solution was realized. Keywords: Speech Recognition, Text-to-Speech (TTS), Pocketsphinx, Nuance API, Keywords: Speech Recognition, Text-to-Speech (TTS), Pocketsphinx, Nuance NDEV, Speech API, OpenEars, JSGF Grammar, Android, iOS, Qt Keywords: JSGF Speech Recognition, Text-to-Speech (TTS), Pocketsphinx, Pocketsphinx, Nuance Nuance NDEV, NDEV, Speech Speech API, API, Keywords: Speech Recognition, Text-to-Speech (TTS), OpenEars, Grammar, Android, iOS, Qt OpenEars, JSGF Grammar, Android, iOS, Qt  OpenEars, JSGF Grammar, Android, iOS, Qt 

1. INTRODUCTION  1. INTRODUCTION INTRODUCTION 1.manufacturer INTRODUCTION Testo AG is a global1. of measuring instruments Testo AG is aofglobal manufacturer of measuring for detection various physical quantities. These instruments devices are Testo AG is is aaofglobal global manufacturer of measuring measuring instruments Testo AG manufacturer of for detection various physicaland quantities. These instruments devices are used in gas analysis, heating refrigeration equipment, for detection detection of various heating physicaland quantities. These devices devices are for of various physical quantities. These are used in gas and analysis, refrigeration equipment, calibration pharmaceutical technology, optical and used in gas analysis, heating and refrigeration equipment, used in gas analysis, heating and refrigeration equipment, calibration and pharmaceutical technology, optical and thermal imaging inspection, to name just a few applications. calibration and pharmaceutical technology, optical and calibration andtheinspection, pharmaceutical technology, optical thermal imaging towho name just a few applications. It happens that technicians use these devices, haveand to thermal imaging inspection, to name just a few applications. thermal imaging towho name a few applications. It happens that theinspection, technicians usejust these devices, have to place a measuring probe or adjust control elements and at the It happens that the technicians who use these devices, have to It happens that the technicians who use these devices, have to place a measuring probe or adjust control elements and at the same time need probe to navigate through the menus ofat the the place a measuring or adjust control elements and place a measuring probe or adjust control elements and at the same time need to navigate through the menus of measuring instruments. Since this oftenthe is menus difficult,oftesto same time need navigate through the same timeainstruments. need to toof devices, navigate through the oftesto the measuring Since which this often is menus difficult, developed family can exchange data with measuring instruments. Since this often is difficult, testo measuring instruments. Since this often is difficult, testo developed a family of devices, which can exchange data with adeveloped smartphone or tablet via Bluetooth. The corresponding app a family family of devices, devices, which can can exchange data with with developed of which atesto smartphone or tablet via Bluetooth. Theexchange corresponding Smarta Probes is available for Android and iOSdata and app can aatesto smartphone or tablet via Bluetooth. The corresponding app smartphone or tablet via Bluetooth. The corresponding app Smart Probes is available for Android and iOS and can be downloaded from Google Play or iTunes. A voice control testo Smart Probes is available for Android and iOS and can testo Smart Probes is available for Android and iOS and can be downloaded from Google Play or iTunes. A voice control of this app would further facilitate the work A or voice even help to be downloaded from Google Play or iTunes. control be downloaded from Google Play or iTunes. A voice control of this app would further facilitate the work or even help to reduce the number of on-site employees. In order to ensure of this app would further facilitate the work or even help to of this app would further facilitate the work or even help to reduce the number of on-site employees. In order to ensure compatibility with ofother operations performed byensure testo reduce the number on-site employees. In order to reduce the number of on-site employees. In order to ensure compatibility with other operations performed by testo developments, it wasother a requirement to use Qt. Subject of the compatibility with operations performed by compatibility with other operations performed by oftesto testo developments, it was athis requirement to use Qt. Subject the project described in article, which testo entrusted developments, it was was requirement to use usetesto Qt. Subject Subject of the the developments, it aathis requirement to Qt. of the project described in Science article, which entrusted Faculty of Computer of Furtwangen University with, project of described in Science this article, article, which testo testo entrustedwith, the project described in this which entrusted the Faculty Computer of Furtwangen University was design, analysis and comparison of possible solutions for Faculty of Science of University with, Faculty of Computer Computer Science of Furtwangen Furtwangen University with, was design, analysis and comparison of possible solutions for voice control of an app that should expand the existing testo was design, analysis and comparison of possible solutions for was design, analysis and comparison of possible solutions for voice control of an app that should expand the existing testo Smart Probesofapp. Thethat selection of the optimum variant voice control an app should expand the existing testo voice control of an app that should expand the existing testo Smart Probes app. The selection of the optimum variant should be doneapp. under both technicalofand economic aspects. Smart Probes The selection the optimum variant Smart Probes The selection ofand the economic optimum aspects. variant should beshould doneapp. under both technical The app know at least the following functions: should be done under both technical and economic aspects. should done know underatboth and economic The appbeshould leasttechnical the following functions: aspects. app know least The the app byat TheControl app should should know atvoice least the the following following functions: functions:  Control theText app -by voiceusing speech recognition Speech to dictate  Control the app by voice Control theText app -by voiceusing speech recognition Speech dictate Input into multiple languages  Speech to Text dictate using Speech to Text dictate using speech speech recognition recognition Inputto inSpeech multiple- read languages  Text out the measured values Input in multiple languages  Input in multiple languages Text to Speech read out the measured values  Text Supported operating Android & iOS to Speech Speech readsystems: out the the measured measured values  Text to -- read out values Supported operating systems: Android & iOS  Supported If possible operating no licensesystems: fees for commercial use Android & & iOS iOS  Supported operating systems: Android If possible no license fees for commercial useand accuracy  If Find the optimal ratio between performance possible no license license fees for commercial commercial useand accuracy  If possible no for use Find the optimal ratiofees between performance Find the optimal ratio between performance and accuracy If Find the optimal ratio between performance possible the app should work without and anaccuracy Internet If possible As the additional app should workthewithout an Internet connection. criteria, power consumption If possible As the additional app should should workthewithout without an Internet Internet If possible the app work an connection. criteria, power consumption should be taken into account, measured and tested. The user connection. As additional criteria, the power consumption connection. As additional criteria, the power consumption should be taken into account, measured and tested. The user should be about the success or and failure of the speech should be informed taken into account, measured tested. The user should taken into account, measured and tested. The user be informed about the success or failure of the speech recognition by appropriate sounds. The theoretical solution should be informed about the success or failure of the speech should be informed about the success or failure of the speech recognition by appropriate sounds. The theoretical solution had to be verified with developing of a prototype. recognition by sounds. theoretical recognition by appropriate appropriate sounds.of The The theoretical solution solution had to be verified with developing a prototype. had to be verified with developing of a prototype. had to be verified with developing of a prototype.

2. SPEECH PROCESSING AND RECOGNITION 2. SPEECH PROCESSING AND RECOGNITION OVERVIEW 2. AND 2. SPEECH SPEECH PROCESSING PROCESSING AND RECOGNITION RECOGNITION OVERVIEW OVERVIEW OVERVIEW In automatic speech recognition (ASR), there are multiple In automatic recognition are multiple factors which speech influence the size of(ASR), the app,there the accuracy and In automatic speech recognition (ASR), there are multiple multiple In automatic speech recognition there are factors which influence the size control, of(ASR), the app, the accuracy and speed of detection. For device speaker-independent factors influence size of the and factors which influence the size of the the app, app, the accuracy accuracy and speed ofwhich detection. For the device control, speaker-independent systems with very limited vocabulary of isolates words or speed of detection. For device control, speaker-independent speed of detection. For device control, speaker-independent systems with very limited vocabulary of isolates words or unchanging short word sequences are needed. systems with very limited vocabulary of systems with veryword limited vocabulary of isolates isolates words words or or unchanging short sequences are needed. unchanging short word needed. unchanging short word sequences sequences areblock needed. Figures 1 and 2 show the generalare diagram of speech Figures 1 and 2 show the general block diagram of speech recognition. Figures 1 Figures 1 and and 2 2 show show the the general general block block diagram diagram of of speech speech recognition. recognition. recognition.

Fig. 1. Pre-processing Fig. 1. Pre-processing Fig. 1. Fig. 1. Pre-processing Pre-processing By filtering (Fig. 1) disturbing noises should be suppressed. By filtering (Fig. 1) disturbing noises should be suppressed. The term "Cepstrum" means spectrum of a function in the By filtering (Fig. 1) disturbing noises should be suppressed. By filtering (Fig. 1) disturbing noises should be independent suppressed. The term "Cepstrum" means spectrum of a function in the frequency domain. The dimension of the The term "Cepstrum" means spectrum of aathefunction in the The term "Cepstrum" means spectrum of function in frequency domain. The dimension of independent variable of the Cepstrum - "quefrency" - is the equivalent to the the frequency domain. The dimension of independent frequency domain. The - which dimension of independent variable of the Cepstrum "quefrency" - is the equivalent towas the size of the variable, from the original spectrum variable of the Cepstrum -- which "quefrency" -- is equivalent the variable of this the Cepstrum "quefrency" isthe equivalent towas the size of the variable, fromdiscrete the original spectrumto formed, in case the time. In Cepstrum, the size of the variable, from which the original spectrum was size of the variable, from which the original spectrum was formed, in this case signal the discrete time. In therefor the Cepstrum, the amplitudein of the is reduced, harmonic formed, case the time. the Cepstrum, the formed, in this this casesignal the discrete discrete time. In In the apparent. Cepstrum, the amplitude ofof the signal is become reduced, therefor harmonic components the will more This amplitude of the signal is reduced, therefor harmonic amplitude of the signal is reduced, therefor harmonic components of the signal will become more apparent. This allows conclusions whetherwill in become the generation of the signal components of the signal more apparent. This components of the signal will moreFor apparent. This allows conclusions whether in become theinvolved. generation of even the signal vocal cords or vocal tract were more allows conclusions whether in the generation of the signal allows conclusions whether in the generation of the signal vocal cords or vocal tract were involved. For even more detailedcords consideration of processing the For square ofmore the vocal or were involved. even vocal cords or vocal vocal istract tract were involved. For even detailed consideration of processing the square ofmore the spectrum's magnitude filtered by a filter from the Mel-filter detailed consideration of processing the square of the detailed consideration of processing the square of the spectrum's magnitude is filtered by a filter from the Mel-filter bank (Pfister 2008) usually before logarithm isMel-filter applied. spectrum's magnitude is filtered by aathe filter from the spectrum's magnitude is Fourier filtered by filter from the bank (Pfister 2008) usually before the logarithm isMel-filter applied. Instead of the inverse transform (IFT, FFT-1), the bank (Pfister usually before the logarithm is bank (Pfister 2008) usually before the logarithm is applied. applied. Instead ofcosine the2008) inverse Fourier transform (IFT, FFT-1), the discrete transform (DCT) is often used. This is Instead of the Fourier transform (IFT, FFT-1), the Instead ofbecause the inverse inverse Fourier transform (IFT, FFT-1), the discrete cosine transform (DCT) is often used. This is possible for real-valued input, the real part of the discrete cosine (DCT) is used. is discrete cosine transform transform (DCT)input, is often often used. This is possible because for real-valued the real partThis of the possible possible because because for for real-valued real-valued input, input, the the real real part part of of the the

Copyright © 2016 IFAC 13 2405-8963 © IFAC (International Federation of Automatic Control) Copyright © 2016, 2016 IFAC 13 Hosting by Elsevier Ltd. All rights reserved. Peer review under responsibility of International Federation of Automatic Copyright © © 2016 2016 IFAC IFAC 13 Control. Copyright 13 10.1016/j.ifacol.2016.12.003

2016 IFAC PDES 14 October 5-7, 2016. Brno, Czech Republic

Jiří Špale et al. / IFAC-PapersOnLine 49-25 (2016) 013–018

DFT is a kind of DCT. The reason why DCT is preferred is that the output is approximately decorrelated. Decorrelated features can be modelled efficiently as a Gaussian distribution with a diagonal covariance matrix. For more details, see e.g. (Pfister 2008).

described in (Spale 2009). However, to calculate with rational numbers, signed 32-bit integers with a radix point at bit 16 (Q15.16 format) are used, rather than the classical integers (Sorensen 1987). On the architecture level it is necessary to pay attention to calculations for time-critical code sections. If possible, these should run using processor registers and the number of memory accesses should be minimized (Huggíns-Daines et al. 2006).

The result of the pre-processing are Speech Vectors in which the Cepstral coefficient and 1 energy coefficients - MelFrequency Cepstral Coefficients (MFCC) are stored. •

Nuance NDEV API, a cloud-based HTTP API by Nuance Mobile. On the one hand, the required offline functionality would not be complied and the success of speech recognition would depend on the quality of the Internet connection. A further consideration would be the issue of data security. On the other hand, the development, implementation and maintenance would be minimal, no language packs would have to be installed. The app would be platform independent, and they could be kept very small. A major disadvantage is the absence of keyword spotting. Looking for a hotkey in the amount of all recognized words is certainly much more resource hungry.



Native implementation - with the standard tools of considered operating systems. At Android, the feature Okay Google Everywhere was tested. The speech recognition starts with the hotkey "OK Google". Currently it is not possible to define own hotkeys, since these are defined on the CPU level; only the "Original Equipment Manufacturer" have access there. Furthermore, the speech recognition has to be started e.g. by a button click. Anyway, the missing possibility to define own hotkeys to start "permanent listening" is the only restriction. The implementation of speech recognition follows the design pattern Inversion of Control: The button click invokes a function, in which the listening is started, a suitable Language Model is selected, eventually a graphical prompt issued and the code is transferred, by which, later in callback, the initiator of the callback can be detected. The callback function is started after completion of the speech input. If the initiator of the callback really is the request for speech recognition, all recognized words, sorted by their matching probability, will be loaded in a list. Finally, the most likely result stands at index 0.

Fig. 2. Phoneme-based decoder Almost all modern speech recognition systems use the phoneme-based process for recognizing (Fig. 2). The task of the acoustic models is determining all possible combinations of matching phonemes to the input signal by means of Hidden Markov Models (HMM). Using the dictionary, the matching combinations are compared with the words of the word pool. The selection is limited to this pool. The grammar assigns a function to each word using grammatical rules. A part of the language model are statistics that define the probability that a combination of e.g. 3 words (Trigram, N-gram) may occur. Finally, the combination with the highest probability is selected. 3. PROJECT APPROACH 3.1 Key technologies compared The following technologies were contemplated: •



CMU Sphinx, a very common open source framework written in C, developed at Carnegie Mellon University. The processing occurs offline. Extension modules for a large number of languages are available. The framework stays under a BSD license - there is no restriction against commercial use or redistribution. For quality speech models very good detection rate was found out (Harvey et al. 2010). However, a disadvantage is the implementation effort and maintenance, as well as high effort for creating new language and acoustic models. Some smartphones had problems with recognizing many different words (> 1000) or long sentences (> 3 words).

3.2 Decision-making For the decision 13 criteria were defined: 1. Implementation and linking effort [function adaptation to target platform; integration of API or library in the program; setting up the SDK] 2. Advanced functions [Voice Activity Detection, Wake-up Recognition, Noise Reduction, Text-to-Speech, …] 3. Ongoing license costs 4. One-time license costs 5. Terms of a license 6. Quality / Performance of recognition by dictation 7. Quality / Performance of recognition of isolated words on voice control

Pocketsphinx, a variant of CMU Sphinx optimized for mobile devices (Huggins-Daines 2015). It can be configured to use fixed-point or floating-point arithmetic. Floating-point arithmetic brings benefits when the processor architecture supports floating-point operations by hardware. Older or low-end mobile devices use ARM processors which have no hardware support for floatingpoint operations, and if so, then the floating-point arithmetic is much slower than the fixed-point. For this reason, the fixed-point arithmetic is offered by Pocketsphinx, compare with the Spectral Analysis 14

2016 IFAC PDES October 5-7, 2016. Brno, Czech Republic

Jiří Špale et al. / IFAC-PapersOnLine 49-25 (2016) 013–018

15

Table 1. Decision Matrix

8. Memory consumption 9. Android compatibility 10. iOS compatibility 11. Qt compatibility 12. Offline functionality 13. Documentation & Support

Android app loads at run-time the binary libraries which contain both the Qt components and the own code. As explained below, the text-to-speech component (TTS) was added to the project in the form of a native Java class. The project described combines several technologies and uses multiple programming concepts (Fig. 3).

The criteria were evaluated using a scale from 0 (worst) till 10 (best), weighted by factor 1 (low importance) till 4 (high importance). The results of the comparison are shown in Table 1. Consequently, Qt & Pocketsphinx was selected. This was the only approach giving full offline functionality in addition to multi-platform compatibility and allowing the required integration into the existing Qt projects. The quality of speech recognition and functional diversity is sufficient.

The front-end of the project is an app that makes it possible to issue voice commands and read measured values using TTS. In the frame of this project a demo app (Fig. 5) was created which should be later merged with the existing testo Smart Probes app. The simplified class diagram is shown in Fig. 4. After starting, the app connects the appropriate instrument via Bluetooth and displays the measured values on the screen. The app listens continuously until the hotkey "My Testo" is detected. After this hotkey one or more next keywords are expected. Depending on the word recognized actions like printing, saving or displaying the measured values graphically are provided. The "My Testo - Options" command opens additional submenus, where the language to be recognized, the sensitivity or others can be set. One of the submenus displays the help page that contains the list of all

3.3 Software Architecture In accordance with the customer requirements, the project is developed as a Qt project. Qt compiles the project as a native C++ library, links with the Pocketsphinx library and provides it with a small, generic Java app. The compilation is done for the target processor architecture. For this reason, JDK, Android SDK and Android NDK must be installed on the development machine in addition to Qt. The resulting Java 15

2016 IFAC PDES 16 October 5-7, 2016. Brno, Czech Republic

Jiří Špale et al. / IFAC-PapersOnLine 49-25 (2016) 013–018

In order to build the GUI, the declarative QML language was used. The controller is responsible for appropriate responses to the events (signals) from the GUI like actuations of a button and voice commands from the voice processing. These responses are produced by event handlers (slots). Furthermore, the controller sends signals to the voice processing for example to change the sensitivity value of the speech recognition. For the presentation of the plots, the QCustomPlot library is used. CustomPlotItem is a concrete implementation of the graph, which accesses the QCustomPlot library. Device simulates the measurement data and thus represents the measuring device.

voice commands. Most commands can be executed via button touch too. Nevertheless, an exception is the command "My Testo - Reading" which causes that the measured values are read out via TTS. The "My Testo" command alone brings the app to the start status.

3.3.1 Pocketsphinx Library Compilation The Pocketsphinx library is available in the form of source code in C. It must be compiled and included as a binary library in a Qt project for mobile devices. There are three possibilities how to achieve this: 1. Creation of a native library with Pocketsphinx Android. Pocketsphinx Android is a project of the family Sphinx/Pocketsphinx/Sphinx Base. With this tool, native libraries can be created for the selected processor type (ARM, ARMv7-A, x86, MIPS) using Apache Ant script. These libraries and corresponding header files can be copied and included into a Qt project.

Fig. 3. Architecture overview

2. Creation of an extra Qt project and building the libraries with them. These can be taken in a respective target Qt project. 3. Integration of the source code in each target Qt project. No external run-time library necessary. After weighing up the pros and cons the choice fell on the first method. For the 3rd method, the fact that the Pocketsphinx source code needs to be added in every target Qt project represents a disadvantage. Likewise, at the 2nd and 3rd method interfered the dependency of Qt. With the 1st method, a dynamic build script for the anytime current version is obtained. In order that the linker can find out the compiled Pocketsphinx library and for automatic delivery of the app, appropriate entries in the Qt project file must be written.

Fig. 4. Simplified Class Diagram

3.3.2 Audio recording The audio recording will be adapted by QIODevice which is a general interface class for IO operations in Qt. For the physical audio recording the class QAudioInput is responsible. This class has an internal buffer. Once the buffer is filled with audio data, the content is transferred to the linked QIODevice. In audio recording, the performance is very important (low memory consumption at high speed). Furthermore, the transfer and the writing of the audio data must be thread-safe. The recording should be able to run uninterrupted and unlimited. In our case, however, the audio data are needed only once - after being transferred the audio data are deleted. Therefore, a separate class AudioBuffer was created that is inherited from QIDDevice. With this class, the audio files are not stored but packed in a ByteArray and sent via signal to all interested objects. This

Fig. 5. testo Voice Control App and its Speech Commands 16

2016 IFAC PDES October 5-7, 2016. Brno, Czech Republic

Jiří Špale et al. / IFAC-PapersOnLine 49-25 (2016) 013–018

lightweight design provides low resource consumption and minimal redundancy whilst being robust and secure in use.

17

1. Keyword spotting (hotkey "My Testo") followed by recognition based on the language model. The projectspecific language model was created by web service cited above. With this service, a model file (*.lm) and a dictionary file (*.dic) are created. A simplified NassiShneiderman diagram (NSD) is shown in Fig. 6.

3.3.3 Speech Recognition Wrapper The SpeechRecognition class represents the wrapper class for voice recognition. It internally uses further classes, is based on the design pattern Visitor, and consistently uses QThreads. Communication between the wrapper and the controller takes place in both directions only by means of signals and slots. Through signal and slots the queued access is guaranteed. The wrapper only provides the voice recognition. Reacting to the identified strings and case decisions is a task for the controller. 3.3.4 Speech Recognition Modes Pocketsphinx supports three types of models that describe language to recognize - keyword lists, grammars and statistical language models.

a) Keyword spotting: This mode is dedicated to the

recognition of only a few different words. Because of this specialization the algorithm is highly optimized leading to a very low resource consumption. In this mode, a hotkey or keyword list to look for can be specified. The advantage is that a value called keyword spotting threshold (kws) for each keyword or keyphrase can be defined so that the hotkey can be detected in continuous speech. All other modes will try to detect the words from grammar even if the speaker uses words which are not in grammar. Threshold gets the right balance between missed detections and false alarms; threshold values between 10-5 and 10-50 are recommended. b) Grammars describe very simple type of the language for command and control, and they are usually written by hand or generated automatically within the code. Grammars usually do not have probabilities for word sequences, but some elements might be weighed. Grammars could be created with Java Speech Grammar Format (JSGF).

Fig. 6. NSD Keyword spotting & language model 2. Keyword spotting (hotkey "My Testo") followed by recognition based on a project-specific grammar created in JSGF format (Fig. 7).

JSGF is a Backus-Naur Form style (BNF), platformindependent, and vendor-independent textual representation of grammars for use in speech recognition. It is used by Java Speech API (JSAPI). JSGF is derived from the styles and conventions of the Java Programming Language with additives of conventional grammar representations.

#JSGF V1.0; grammar testoVoiceControl; public = | | ; = [ MENU ] | ; = | ; = LANGUAGE ; = ENGLISH | GERMAN; = | ; = SENSITIVITY [ ]; = ONE | TWO | THREE | FOUR | FIVE | SIX | SEVEN | EIGHT | NINE | TEN; = RECOGNITION MODE [ ]; = DIRECT SPEECH | STEP BY STEP; = HELP;

c) Language model: Statistical language models can describe also more complex language. They contain probabilities of the words and word combinations. Those probabilities are estimated from an adequate data sample. Language models can be built by CMU tool CMUCLMTK using ARPA model training, by other toolkits which generates ARPA text files (e.g. IRSLM, MITLM, SRILM) or, in case of very simple English word pool, using web service www.speech.cs.cmu.edu/ tools/lmtool-new.html.

= SAVE | PRINT | GRAPHIC | START; = STOP READING | START READING | CHECK RECORDING QUALITY;

Fig. 7. JSGF grammar file

In our project, two alternatively scenarios were implemented: 17

2016 IFAC PDES 18 October 5-7, 2016. Brno, Czech Republic

Jiří Špale et al. / IFAC-PapersOnLine 49-25 (2016) 013–018

3.3.5 Text-to-Speech (AndroidClass.java)

minutes. Thus, the instructions of the client for 6 hours endurance widely exceeded.

The Pocketsphinx library only deals with speech recognition. For the speech synthesis - generation of the artificial language, Qt offers two solutions: QtSpeech library and Qt Java Native Interface (QtJNI). Both technologies access to the native code of appropriate smartphone OS and provide a uniform interface for this.

Table 2. Workload in different scenarios

QtSpeech is a cross-platform library based on Qt. Here, the native code is invoked from other platforms (Windows SAPI; Linux Festival, and the like). Therefore QtSpeech provides a uniform interface. It stays under LGPL license and can therefore be also used for commercial projects. The library is available as source code and must be compiled for the target device.

4. CONCLUSION AND PERSPECTIVE

Qt offers the module Qt Android Extras, the classes of which use QtJNI. Implementing this, the class QAndroidJniObject offers the method callStaticMethod, by means of them a static Java method can be executed during run-time. In this way, a Java class can be written that extends QtActivity and implemented OnInitListener. This Java class uses the native Android TTS module. The words or sentences to be read are prepared with a method of the class Controller. The transfer takes place via the QAndroidJniObject.

The project demonstrated that voice control of the instruments is an interesting and future-oriented option. It was shown that an integration of the native Pocketsphinx library into Qt multi-platform applications is possible and that a modern smartphone has enough resources for reasonable local speech processing. Voice control has been verified for Android, the developments for iOS still lie ahead. The detection rate was acceptable for a plurality of speakers and in office environment. For speakers with special intonation or accent or in highly noisy environment, the detection rate was slightly less satisfactory. Here the use of a system offering training promises more stable results. The current developments of the CMU Sphinx Group go in this direction: development of new acoustic model trainers and implementing of speaker adaptation, e.g. Maximum Likelihood Linear Regression (MLLR).

3.3.6 Feedback from the app In order to inform the user whether his voice command was understood, three acoustic signals have been implemented: • • •

gentle friendly tone for recognized keyword friendly tone if a command was detected aggressive tone for error

The other potential feedback possibilities like visual effects (color change, flashing) or vibrations were discarded.

REFERENCES Harvey, A.P., McCrindle, R.J., Lundqvist, K., Parslow P. (2010). Automatic speech recognition for assistive technology devices, pp. 273-282, 8th Intl Conf. Disability, Virtual Reality & Associated Technologies, Valparaíso, Chile Huggins-Daines, D. (2015). Pocketsphinx API Documentation, http://cmusphinx.sourceforge.net/doc/pocketsphinx/, Carnegie Mellon University, Pittsburgh, PA Huggins-Daines, D., Kumar, M., Chan, A., Black, A.W., Ravishankar, M., Rudnicky, A.I. (2006). Pocketsphinx: A Free, Real-Time Continuous Speech Recognition System for Hand-Held Devices, IEEE International Conference Acoustics, Speech and Signal Proceedings, Toulouse Pfister, B., Kaufmann, T. (2008). Sprachverarbeitung, Springer-Verlag Berlin Heidelberg Sorensen, H.V., Jones, D.L., Heideman, M.T., Burrus, C.S. (1987). Real-valued fast fourier transform algorithms, pp. 849–863, IEEE Transactions on Acoustic, Speech, and Signal Processing, vol. 35, no. 6 Spale, J. (2009) Spectral monitoring with netX 500, pp. 245248, IEEE Conference, Pilsen

3.3.7 Documentation The documentation of the project was created with Doxygen. 3.3.8 Tests An Alcatel 997D (2012) was used as the testing smartphone. WLAN, Bluetooth, 3G and standby have been disabled. No other apps were running. The keyword spotting threshold was set to 10-20. We tested the keyword recognition rate, the processing time required for it and the battery consumption. The results of the workload tests are shown in Table 2. It can be seen that the actual detection has only a minimal resource consumption, as long as no keyword is detected. The results, however, can be influenced by the fact that the measuring thread has been updated with a lower priority next to the plot and the start screen. Recording of the sound and the processing of the buffer are the most complex operations. The battery consumption was measured in a long-term test with active testo Voice Control app. The percentage of the total battery consumption for testo Voice Control app is according to the Android battery indicator at approximately 3%. The battery power was consumed after 14 hours and 15 18