Available online at www.sciencedirect.com Available online at www.sciencedirect.com
ScienceDirect ScienceDirect
Procedia Computer Science 00 (2017) 000–000 Available online at www.sciencedirect.com Procedia Computer Science 00 (2017) 000–000
ScienceDirect
www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia
Procedia Computer Science 116 (2017) 87–98
2nd International Conference on Computer Science and Computational Intelligence 2017, ICCSCI 2nd International Conference on Computer Science andBali, Computational 2017, 13-14 October 2017, Indonesia Intelligence 2017, ICCSCI 2017, 13-14 October 2017, Bali, Indonesia
Implementation of Blind Speech Separation for Intelligent Implementation of Blind Speech Separation for Intelligent Humanoid Robot using DUET Method Humanoid Robot using DUET Method
Alexander A S Gunawanaa*, Albert Stevelinobb, Heri Ngariantobb, Widodo Budihartobb, Rini Alexander A S Gunawan *, Albert Stevelino , Heri Ngarianto , Widodo Budiharto , Rini b Wongso Wongsob a Mathematics Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia a Mathematics Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia Computer Science Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia b Computer Science Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia b
Abstract Abstract Nowadays, there are many efforts in building intelligent humanoid robot and adding advanced ability such as Blind Speech Nowadays, there are efforts inofbuilding intelligent robot in anda real adding advanced ability such as Blind Speech Separation (BSS). BSSmany is a problem separation of severalhumanoid speech signals world from mono or stereo audio record. In Separation (BSS). BSS is a problem of separation of several speech signals in atoreal world any fromnumber mono or record. In this research, we implement BSP system using DUET algorithm which allow separate of stereo sourcesaudio by using only this research, we implement BSP system using Unmixing DUET algorithm which allow to separate number sources FastICA by using (Fast only stereo (two) mixtures. The DUET (Degenerate Estimation Technique) algorithmany replaces ourofprevious stereo (two) mixtures. TheAnalysis) DUET (Degenerate Estimation Technique) replaces our previous FastICA (Fast Independent Component method onlyUnmixing success in simulation but failedalgorithm in the implementation. The main problem of Independent Component method only without success time in simulation but recording failed in the implementation. Theaudio mainrecord problem of FastICA is that it assumesAnalysis) instantaneous mixing delay in the process. To deals with in the FastICA it assumes withoutwith time DUET delay in the recording process. with audio record in the presence is of that inevitable time instantaneous delays, it has mixing to be replaced algorithm to separate wellToindeals real time. Finally, the DUET presence timetodelays, it hasrobot to bewhich replaced with DUET algorithm to separate well in real time. Finally, algorithmofis inevitable implemented humanoid is developed using Raspberry Pi and equipped with RaspPi Camthe to DUET detect algorithm is implemented humanoid robot Audio which is developed using equipped with RaspPi detect human face. Furthermore, tothe Cirrus Logic Card is stacked to Raspberry Raspberry PiPiand in order to record stereo Cam audio.to In our human face. there Furthermore, Cirrus Logic Audio Card is stacked to performance, Raspberry Pithat in order to record stereoofaudio. In and our experiments, are threethe controlled variables to evaluate algorithm is: distance, number sources, experiments, there are will threerecord controlled evaluate algorithm that is: distance, numberisofthen sources, and subject’s name. Robot stereovariables audio fortofour seconds after faceperformance, is detected by system. The recording separated subject’s Robotand willproduce record stereo audio estimations for four seconds facecomputation is detected by system. The recording is then separated by DUETname. algorithm two source with after average time 1.8 seconds. With Google API, the by DUET algorithm and producespeech two source estimations average computation time 1.8 seconds. With Google API, the recognition accuracy of separated is varying betweenwith 40%-70%. recognition accuracy of separated speech is varying between 40%-70%. © 2017 The Authors. Published by Elsevier B.V. © 2017 Published Elsevier B.V. © 2017 The The Authors. Authors. Published by by B.V. committee of the 2nd International Conference on Computer Science and Peer-review under responsibility of Elsevier the scientific Peer-review under responsibility of the scientific committee of the 2nd International Conference on Computer Science and Peer-review under responsibility of the scientific committee of the 2nd International Conference on Computer Science and Computational Intelligence 2017. Computational Intelligence 2017. Computational Intelligence 2017. Keywords: intelligent humanoid robot; DUET; blind speech separation; speech recognition Keywords: intelligent humanoid robot; DUET; blind speech separation; speech recognition
* Corresponding author. Tel.: +628175001010 *E-mail Corresponding author. Tel.: +628175001010 address:
[email protected] E-mail address:
[email protected] 1877-0509 © 2017 The Authors. Published by Elsevier B.V. 1877-0509 © 2017 Authors. Published Elsevier committee B.V. Peer-review underThe responsibility of theby scientific of the 2nd International Conference on Computer Science and
Computational Intelligence 2017.of the scientific committee of the 2nd International Conference on Computer Science and Peer-review under responsibility Computational Intelligence 2017. 1877-0509 © 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the scientific committee of the 2nd International Conference on Computer Science and Computational Intelligence 2017. 10.1016/j.procs.2017.10.014
88 2
Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000
1. Introduction Service robots can be defined as robots that sense, think, and act to perform services to the well-being of humans 1. The service robots have main advantage because they can work nonstop in well consistency. Currently, service robots are developed for various applications, including: edutainment, guidance and office works, inspection and surveillance etc. For a suitable service, human interaction capability is must, then face and voice detection and recognition are being the main features in service robots. Now, researches in detecting face and voice have been developed well over the world, but there is still problem for service robots to separate several voice inputs simultaneously. As result, robots are not capable to give feedback or corresponding output appropriately in natural environment. Actually, the voice sources are not coming just from one direction, but we know that human could separate and recognize the sources very well. Furthermore, the voice sources are also mixed with background noise in natural environment, thus we have to focus auditory attention on a particular source while filtering out the background noise. This phenomenon is called as the cocktail party effect 2. The objective in here is to isolate the speech of one particular speaker from all the other sounds made by the rest of the party. The techniques for solving this problem, which need no prior knowledge about the mixing ratios of the various sources, is called as Blind Speech Separation (BSS) 3 and our service robots need to implement this technique. In our initial attempt 4, we employ FastICA (Fast Independent Component Analysis) for solving BSS problem. FastICA is one of the decomposition methods which is capable of decomposing mixing signals into additive subcomponents in real time. In our previous experiment, FastICA method only success in simulation but failed in the implementation because of inevitable time delays. There is several improvements of FastICA, for example Takatani et al 5 proposed blind signal decomposition algorithm based on Single-Input Multiple-Output (SIMO) acoustic signal model using the extended ICA algorithm, called as SIMO-ICA. The separated signal of SIMO-ICA can maintain the spatial qualities of each speech source comparing to the conventional method. Nevertheless, all extensions of ICA algorithm suffer its fundamental assumption that is: the source signals must be independent of each other. The assumption causes the mixing signals must be instantaneous mixtures of sources without time delay. This requirement is impossible to be fulfilled in implementation stage because there is inevitable time delays during stereo input recording. ICA will fails in case where any delay is present between the sources 6. Therefore, to deal with the mixtures of sources containing delay, Degenerate Unmixing Estimation Technique (DUET) is developed 7 8. Rickard 8 states two or more source can be well separated from the mixing of sources contain small delay with DUET algorithm. The underlying framework of DUET is clustering method similar to van Hulle method 9. The main different is DUET uses the time– frequency representations, beside van Hulle method employed the space representation and tried to estimate the mixing matrix like ICA approach. Furthermore, DUET can separate blindly an arbitrary number of sources given just two input mixtures 8. In DUET experimental results 7, the algorithm works very well for synthetic mixtures similar to ICA method. Furthermore, DUET also works well in real mixtures of speech recorded in an anechoic room. Anechoic means it does not represent echoes, that is, reflection of sounds that arrive with a delay after the direct sounds. Because DUET is based on the assumptions of anechoic mixing, the quality of the demixing is reduced in an echoic room experiments. The experiment result also shows that DUET can estimate arbitrary sources using just two mixtures and work optimally at three source estimations. If more than three, the error will increase significantly. Based on the experiments 7, we will be used DUET to estimate arbitrary sources in humanoid robot for being used as service robot. The robot is designed like a human based on Raspberry Pi platform and equipped with RaspPi Cam to detect human face. Furthermore, the Cirrus Logic Audio Card is stacked to Raspberry Pi in order to record stereo audio from two microphones. Our experiment design is similar to Baeck et al research 10 in order to evaluate the algorithm performance in real environment. The research was conducted to answer the following problem: how is the performance of humanoid robot to estimate the speech sources based on DUET algorithm and convert those sources into text? Hence, the purpose of this research is: (i) develop a humanoid robot that capable to estimate each sources and covert it to text, (ii) evaluate the text conversion accuracy.
Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000
89 3
The remainder of this paper is composed as follows: first we discuss DUET algorithm in section 2, and then is followed by its implementation in our humanoid robot, called as RAPIRO, in section 3. In section 4, we report the experiment result in blind speech separation based on the robotic implementation. Finally, we summarize our work with notes on future research in section 5. 2. Degenerate Unmixing Estimation Technique (DUET) Blind speech separation (BSS) is the separation of speech signals from mono or stereo audio record, with little information about the speech signals or the mixing process. This problem is in general very underdetermined because there are more sources than mixtures. Nevertheless, some solutions can be derived under restricted assumptions. Our initial attempt 4 employ FastICA (Fast Independent Component Analysis) for solving the BSS problem. FastICA assumes that signals are statistically independent, thus it is only valid for instantaneous mixtures of sources. FastICA fails in the implementation because there is inevitable time delays during the recording process. To deal with the mixtures of sources involving delay, DUET (Degenerate Unmixing Estimation Technique) is developed. DUET provides very good result when small delay is involved during the mixing of sources. Although DUET uses several assumptions: anechoic mixing, W-disjoint orthogonality (WDO) of sources, closely spaced microphones for recording, and local stationary 8, we can reconstruct component sources just from two mixtures 7. DUET uses binary mask to separate the mixtures into its component sources based on WDO assumption 8. To solve BSS problem, the implementation of DUET uses stereo audio record from the Cirrus Logic Audio Card, which is stacked to Raspberry Pi. We can write the stereo input as follow: 𝑥𝑥(𝑡𝑡) = [𝑥𝑥1 (𝑡𝑡), 𝑥𝑥2 (𝑡𝑡)]𝑇𝑇 The objective of BSS problem is to estimate N source signals, that is: 𝑠𝑠(𝑡𝑡) = [𝑠𝑠1 (𝑡𝑡), … , 𝑠𝑠𝑁𝑁 (𝑡𝑡)]𝑇𝑇
In following subsection, we discuss the assumptions of DUET which lead to the main solution. 2.1. Anechoic Mixing
Suppose N source inputs, 𝑠𝑠𝑗𝑗 (𝑡𝑡), 𝑗𝑗 = 1, … , 𝑁𝑁 are received at a pair of microphones. Without loss of generality, it can be assumed that the attenuation and delay parameters of the first mixture is included in the definition of the sources. The two anechoic mixtures as representation of the stereo input can be stated as: 𝑁𝑁
𝑥𝑥1 (𝑡𝑡) = ∑ 𝑠𝑠𝑗𝑗 (𝑡𝑡) 𝑁𝑁
𝑗𝑗=1
𝑥𝑥2 (𝑡𝑡) = ∑ 𝑎𝑎𝑗𝑗 𝑠𝑠𝑗𝑗 (𝑡𝑡 − 𝛿𝛿𝑗𝑗 ) 𝑗𝑗=1
(1) where N is the number of sources, 𝑎𝑎𝑗𝑗 is a relative attenuation factor between sources and sensors, and 𝛿𝛿𝑗𝑗 is the arrival delay between the sensors. 2.2. W-Disjoint Orthogonality (WDO) Two functions 𝑠𝑠𝑗𝑗 (𝑡𝑡) and 𝑠𝑠𝑘𝑘 (𝑡𝑡) are called as W-disjoint orthogonal (WDO) if windowed Fourier transform of 𝑠𝑠𝑗𝑗 (𝑡𝑡) and 𝑠𝑠𝑘𝑘 (𝑡𝑡) are disjoint. Windowed Fourier transform of 𝑠𝑠𝑗𝑗 (𝑡𝑡) can be defined as: ∞ 1 ∫ 𝑊𝑊(𝑡𝑡 − 𝜏𝜏)𝑠𝑠𝑗𝑗 (𝑡𝑡)𝑒𝑒 −𝑖𝑖𝑖𝑖𝑖𝑖 𝑑𝑑𝑑𝑑 𝑠𝑠̂𝑗𝑗 (𝜏𝜏, 𝜔𝜔) ≔ 𝐹𝐹 𝑊𝑊 [𝑠𝑠𝑗𝑗 ](𝜏𝜏, 𝜔𝜔) ≔ √2𝜋𝜋 −∞ (2) where 𝑊𝑊(𝑡𝑡) is the windowing function. WDO assumption can be stated briefly as:
90 4
Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000
𝑠𝑠̂𝑗𝑗 (𝜏𝜏, 𝜔𝜔)𝑠𝑠̂𝑘𝑘 (𝜏𝜏, 𝜔𝜔) = 0, ∀𝜏𝜏,𝜔𝜔 , ∀𝑗𝑗 ≠ 𝑘𝑘
If the windowing function 𝑊𝑊(𝑡𝑡) ≡ 1, then 𝑠𝑠̂𝑗𝑗 (𝜏𝜏, 𝜔𝜔) becomes Fourier transform of 𝑠𝑠𝑗𝑗 (𝑡𝑡), which can be denoted as 𝑠𝑠̂𝑗𝑗 (𝜔𝜔). Furthermore, WDO assumption can be written as: 𝑠𝑠̂𝑗𝑗 (𝜔𝜔)𝑠𝑠̂𝑘𝑘 (𝜔𝜔) = 0, ∀𝜔𝜔 , ∀𝑗𝑗 ≠ 𝑘𝑘 WDO assumption is important to DUET because it allows for separating the mixture into its component sources using a binary mask which can be constructed as following: 1 𝑠𝑠̂ (𝜏𝜏, 𝜔𝜔) ≠ 0 𝑀𝑀𝑗𝑗 (𝜏𝜏, 𝜔𝜔) ≔ { 𝑗𝑗 0 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒𝑒𝑒 The binary mask 𝑀𝑀𝑗𝑗 separates 𝑠𝑠̂𝑗𝑗 from the mixture by: 𝑠𝑠̂𝑗𝑗 (𝜏𝜏, 𝜔𝜔) = 𝑀𝑀𝑗𝑗 (𝜏𝜏, 𝜔𝜔)𝑥𝑥̂1 (𝜏𝜏, 𝜔𝜔), ∀𝜏𝜏, 𝜔𝜔 2.3. Local Stationarity
Fourier transform of delayed function is:
𝑠𝑠𝑗𝑗 (𝑡𝑡 − 𝛿𝛿) ↔ 𝑒𝑒 −𝑖𝑖𝑖𝑖𝑖𝑖 𝑠𝑠̂𝑗𝑗 (𝜔𝜔) Therefore windowed Fourier transform of delayed function which can be derived from equation (2) is: 𝐹𝐹 𝑊𝑊 [𝑠𝑠𝑗𝑗 (∙ −𝛿𝛿)](𝜏𝜏, 𝜔𝜔) = 𝑒𝑒 −𝑖𝑖𝑖𝑖𝑖𝑖 𝐹𝐹 𝑊𝑊 [𝑠𝑠𝑗𝑗 (∙)](𝜏𝜏, 𝜔𝜔) The local stationarity assumption is formally written as following: 𝐹𝐹 𝑊𝑊 [𝑠𝑠𝑗𝑗 (∙ −𝛿𝛿)](𝜏𝜏, 𝜔𝜔) = 𝑒𝑒 −𝑖𝑖𝑖𝑖𝑖𝑖 𝐹𝐹 𝑊𝑊 [𝑠𝑠𝑗𝑗 (∙)](𝜏𝜏, 𝜔𝜔), ∀𝛿𝛿, |𝛿𝛿| ≤ ∆ where ∆ is the maximum time difference in the mixing model. 2.4. Closely Spaced Microphones In subsection 2.3, we exploit the local stationarity assumption to turn the time delay into a multiplicative factor in time–frequency. The multiplicative factor 𝑒𝑒 −𝑖𝑖𝑖𝑖𝑖𝑖 only uniquely specifies δ if |ωδ| < π and the other cases is ambiguous due to phase-wrap. Thus, we have to restrict that: |𝜔𝜔𝛿𝛿𝑗𝑗 | < 𝜋𝜋, ∀𝜔𝜔, ∀𝑗𝑗 This is guaranteed when the microphones are placed close to each other. 2.5. Separating the Sources By using anechoic mixing and local stationarity assumptions, we can write the mixing equation (1) in the time– frequency domain as: 𝑠𝑠̂1 (𝜏𝜏, 𝜔𝜔) 1 … 1 𝑥𝑥̂ (𝜏𝜏, 𝜔𝜔) ⋮ ] ]=[ [ 1 −𝑖𝑖𝑖𝑖𝛿𝛿1 −𝑖𝑖𝑖𝑖𝛿𝛿𝑁𝑁 ] [ … 𝑎𝑎𝑁𝑁 𝑒𝑒 𝑎𝑎1 𝑒𝑒 𝑥𝑥̂2 (𝜏𝜏, 𝜔𝜔) 𝑠𝑠̂𝑁𝑁 (𝜏𝜏, 𝜔𝜔) Furthermore, we adopt at most one source is active at every (τ, ω) based on WDO assumption, and the mixing process can be described as: 1 𝑥𝑥̂ (𝜏𝜏, 𝜔𝜔) 𝑓𝑓𝑓𝑓𝑓𝑓 𝑒𝑒𝑒𝑒𝑒𝑒ℎ (𝜏𝜏, 𝜔𝜔), [ 1 ] = [𝑎𝑎 𝑒𝑒 −𝑖𝑖𝑖𝑖𝛿𝛿𝑗𝑗 ] 𝑠𝑠̂𝑗𝑗 (𝜏𝜏, 𝜔𝜔) 𝑓𝑓𝑓𝑓𝑓𝑓 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑗𝑗 (𝜏𝜏, 𝑥𝑥̂2 𝜔𝜔) 𝑗𝑗
In the above equation, the active index j depends on (τ, ω). The fundamental equation of DUET is that the ratio of the time–frequency representations only depend on the mixing parameters, as follow: x̂2 (𝜏𝜏, 𝜔𝜔) ∀(𝜏𝜏, 𝜔𝜔) ∈ Ω𝑗𝑗 , = 𝑎𝑎𝑗𝑗 𝑒𝑒 −𝑖𝑖𝑖𝑖𝛿𝛿𝑗𝑗 x̂1 (𝜏𝜏, 𝜔𝜔) Where, Ω𝑗𝑗 ≔ {(𝜏𝜏, 𝜔𝜔): 𝑠𝑠̂𝑗𝑗 (𝜏𝜏, 𝜔𝜔) ≠ 0} In conclusion, the mixing parameters can be calculated with:
Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000
91 5
𝑥𝑥̂2 (𝜏𝜏, 𝜔𝜔) | 𝑥𝑥̂1 (𝜏𝜏, 𝜔𝜔) 𝑥𝑥̂2 (𝜏𝜏, 𝜔𝜔) 1 𝛿𝛿̂ (𝜏𝜏, 𝜔𝜔) ≔ (− ) ∠ 𝑥𝑥̂1 (𝜏𝜏, 𝜔𝜔) 𝜔𝜔 Under the assumption closely spaced microphones, the delay estimation should be accurate because the local attenuation estimator 𝑎𝑎̃(𝜏𝜏, 𝜔𝜔) and the local delay estimator 𝛿𝛿̂ (𝜏𝜏, 𝜔𝜔) can only take on the values of the actual mixing parameters. Finally, we can demix the mixture via binary masking if we can determine the indicator function of each source. For given active index j, the indicator functions can be written as: 1 (𝑎𝑎̃(𝜏𝜏, 𝜔𝜔), 𝛿𝛿̃(𝜏𝜏, 𝜔𝜔)) = (𝑎𝑎𝑗𝑗 , 𝛿𝛿𝑗𝑗 ) 𝑀𝑀𝑗𝑗 (𝜏𝜏, 𝜔𝜔) ≔ { 0 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒𝑒𝑒 After using clustering techniques on attenuation and delay estimates, the number of clusters would be number of sources and the cluster centres would be the optimal estimation of attenuation and delay associated with each source. ̃𝑗𝑗 , we can estimate the source signals with: By constructing suitable indicator function 𝑀𝑀 𝑎𝑎̃(𝜏𝜏, 𝜔𝜔) ≔ |
̃
𝑥𝑥̂1 (𝜏𝜏, 𝜔𝜔) + 𝑎𝑎̃𝑗𝑗 𝑒𝑒 𝑖𝑖𝛿𝛿𝑗𝑗𝜔𝜔 𝑥𝑥̂2 (𝜏𝜏, 𝜔𝜔) ) 1 + 𝑎𝑎̃𝑗𝑗2 In the end, we reconstruct the sources from the time–frequency representations by converting back into the time domain. The summary of DUET algorithm can be seen in the literature 8. ̃𝑗𝑗 (𝜏𝜏, 𝜔𝜔) ( 𝑠𝑠̂̃𝑗𝑗 (𝜏𝜏, 𝜔𝜔) = 𝑀𝑀
3. Implementation in Humanoid Robot Humanoid robot is a robot which is built to look like the human body. In general, humanoid robot has a torso, a head, two arms, and two legs, though its simple version may model only part of the body. RAPIRO is a small and affordable robot kit designed to work with a Raspberry Pi. It comes with Arduino-compatible servo controller. Figure 1 shows fully-assembled RAPIRO and its microcontroller board:
(a)
(b) Fig. 1. (a) RAPIRO and (b) its microcontroller board (http://www.rapiro.com)
In order for RAPIRO to be an intelligent humanoid robot needs to be added with a microprocessor board, the Raspberry Pi. In our research used Raspberry Pi 2 Model B. Furthermore, RAPIRO needs a visual input to detect faces, recognize faces, and recognize objects. For this purpose, installed camera module on Raspberry Pi. Finally, to receive stereo audio input, the external device Cirrus Logic Audio Card is stacked to Raspberry Pi because there is no
Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000
92 6
port for stereo input ready in Raspberry Pi. Raspberry Pi installation together with its camera module and the additional audio card can be seen in the following Figure 2.
(b)
(a)
(c)
Fig. 2. (a) Raspberry Pi 2 Model B, (b) camera module, and (c) Cirrus Logic Audio Card
After RAPIRO has been assembled, the next step is to develop the system to perform speech separation using the DUET method. Our objective is RAPIRO can separate the source of the sounds that come simultaneously and provide feedback in the form of confirmation. The system is developed using Python (2.7) and its several libraries, such as: nussl (0.1.3), SpeechRecognition (3.4.6), PyAudio (0.2.9), and gTTS (1.1.6). The last step is to install and implement the system to RAPIRO and test the results of speech separation using DUET algorithm. The conversion of speech to text by means of microphone and Google Speech API. Here are the steps (see figure 3) of system usage: 1. The blue LED light in RAPIRO eyes indicates that the system is in ready state. 2. Users come and deal with robots within camera range, so the face is detected by the system. 3. The LED light will be green indicating the face has been detected and the robot will greet, then RAPIRO will ask the user for his needs. In our experiment, the robot will ask the name of the person he wants to meet. 4. When the LED light is red, the robot will record the conversation for four seconds. In that duration, the user says who the person to meet.
Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000
93 7
(2)
(1)
(4) (3) Fig. 3. (1) Robot n ready state, (2) face detection, (3) greeting state, and (4) recording state
4. Experiment Results In order to evaluate the proposed method, it is done the experiments using our humanoid robot RAPIRO. The experiments are implemented through eight scenarios based on the combination of three parameters: (i) Distance between microphones (m) and speech sources (s). In our experiments, the distance is divided into two categories: far and near setting. The setting can be seen in Figure 4 below. 60 cm m1
15 cm m1
m2
10 cm
10 cm
s1
s2
m2
25 cm s1
25 cm s2
(b) Near setting (a) Far setting Fig. 4. Setting for microphones (m) and speech sources (s)
(ii) Number of speech sources. The experiment only used maximum two speech sources. For two speech sources, it is used one male voice and one female voice. (iii) Type of speech. In the experiment, the user will mention the name of the person that he wants to meet. There is two speech variation in here: the only person name and the name with greetings such as Bapak, Ibu, Saudara, and Saudari. Each scenario is done by repeating ten experiments. The evaluation of the experiments is done by calculating the accuracy of separated speeches in text form by means of Google speech to text API. Furthermore, the accuracy is measured by comparing the correctly speech in text form with the number of experiments. The restrictions in our experiments is as follow:
Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000
94 8
• • • • • •
Using formal Bahasa Indonesia. Speech and intonation are spoken clearly. No noise assumed. 16 kHz sampling rate and 4 seconds duration recording time. Hop Short-Time Fourier Transform for Windowing function is 1024. Only four names are available to be detected.
The following figure (Figure 5) is an example of speech separation in our experiment.
Fig. 5. Example of speech separation result
Results of our experiments from the eight scenarios are recorded in the following tables: Scenario 1: use Far setting, two speech sources, and speaking only the person name Table 1. Result of first scenario ID
Input 1
Input 2
Output 1
Output 2
Time (second)
1
Albert
Eko
Albert
Eko
1.81869297
2
Calvindoro
Andi
Sindoro
Andi
1.692121077
3
Alexander
Charles
Alexander
Charles
1.78068881
4
Widodo
Indra
Widodo
Indra
1.724117088
5
Budi
Widodo
Budi
Widodo
1.735452938
6
Charles
Charles
X
Charles
1.746630955
7
Daniel
Daniel
Daniel
Daniel
1.768029022
8
Eko
Albert
Iko
Albert
1.953551102
9
Andi
Calvindoro
X
X
1.870830822
10
Indra
Budi
X
Drag
1.709706116
Average DUET Computation Time
1.77998209
The accuracy of the first scenario is 65%.
Scenario 2: use Far setting, two speech sources, and speaking the name with greetings Table 2. Result of second scenario ID
Input 1
Input 2
Output 1
Output 2
Time (second)
1
Saudara Albert
Saudara Eko
saudara Albert
saudara Eko
1.832210827
2
Saudara Calvindoro
Bapak Andi
X
X
2.645637083
3
Bapak Alexander
Saudara Charles
Charles
Bapak Alexander
1.778987932
4
Bertemu Bapak Widodo
Saudara Indra
Contoh
X
1.841015863
5
Saudara Budi
Bertemu Bapak Widodo
saudara Budi
bertemu Bapak Widodo
1.727251101
6
Saudara Charles
Saudara Charles
saudara Serius
X
1.794992018
Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000
7
Saudara Daniel
Saudara Daniel
8
Saudara Eko
9
Bapak Andi
10
Saudara Indra
Saudara Budi
95 9
saudara Daniel
X
1.782523918
Saudara Albert
X
saudara Eko
1.742737103
Saudara Calvindoro
saudara Calvindoro
bapak Andi
1.782528925
saudara Indra
saudara Budi
1.824796963
Average DUET Computation Time
1.875268173
The accuracy of the second scenario is 60%.
Scenario 3: use Near setting, two speech sources, and speaking only the person name Table 3. Result of third scenario ID
Input 1
Input 2
Output 1
Output 2
Time (second)
1
Albert
Eko
Eko
X
1.832210827
2
Calvindoro
Andi
Andi
Sindoro
2.645637083
3
Alexander
Charles
Charles
Sumber
1.778987932
4
Widodo
Indra
Indra
X
1.841015863
5
Budi
Widodo
Widodo
Budi
1.727251101
6
Charles
Charles
X
X
1.794992018
7
Daniel
Daniel
Daniel
X
1.782523918
8
Eko
Albert
Albert
X
1.742737103
9
Andi
Calvindoro
Dendi
Alfin Nurul
1.782528925
10
Indra
Budi
X
X
1.824796963
Average DUET Computation Time
1.875268173
The accuracy of the third scenario is 40%.
Scenario 4: use Near setting, two speech sources, and speaking the name with greetings Table 4. Result of fourth scenario ID
Input 1
Input 2
Output 1
Output 2
Time (second)
1
Saudara Albert
Saudara Eko
saudara Albert
saudara Eko
1.87324791
2
Saudara Calvindoro
Bapak Andi
pak Andi
hotel sindoro
1.815362024
3
Bapak Alexander
Saudara Charles
saudara Charles
X
1.793608952
4
Bertemu Bapak Widodo
Saudara Indra
X
Bapak Widodo
1.775599051
5
Saudara Budi
Bertemu Bapak Widodo
bertemu Bapak Widodo
saudara Budi
1.908370066
6
Saudara Charles
Saudara Charles
saudara harus
X
1.893497992
7
Saudara Daniel
Saudara Daniel
saudara Daniel
X
1.859664965
8
Saudara Eko
Saudara Albert
saudara Albert
saudara Eko
1.927908945
9
Bapak Andi
Saudara Calvindoro
saudara Calvindoro
Andi
1.934961843
10
Saudara Indra
Saudara Budi
saudara indra
saudara Budi
1.788260031
Average DUET Computation Time The accuracy of the fourth scenario is 70%.
1.857048178
Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000
96 10
Scenario 5: use Far setting, one speech source, and speaking only the person name Table 5. Result of fifth scenario ID
Input 1
Output 1
Output 2
Time (second)
1
Albert
X
Albert
1.793345022
2
Calvindoro
Calvindoro
X
1.912111092
3
Alexander
X
Alexander
1.69664793
4
Widodo
Widodo
X
1.751409817
5
Budi
Budi
X
1.754013109
6
Charles
X
Charles
1.760180998
7
Daniel
Dunia
X
1.866503048
8
Eko
X
Eko
1.818609047
9
Andi
X
Andi
1.734113979
10
Indra
Indra
X
1.781472015
Average DUET Computation Time
1.786840606
The accuracy of the fifth scenario is 90%.
Scenario 6: use Far setting, one speech source, and speaking the name with greetings Table 6. Result of sixth scenario ID
Input 1
Output 1
Output 2
Time (second)
1
Saudara Albert
X
saudara Albert
1.720284987
2
Saudara Calvindoro
saudara Calvindoro
X
1.868231821
3
Bapak Alexander
Bapak Alexander
X
1.860559988
4
Bertemu Bapak Widodo
bertemu Bapak Widodo
X
1.742484856
5
Saudara Budi
saudara Budi
X
1.938750076
6
Saudara Charles
saudara Charles
X
1.879053879
7
Saudara Daniel
saudara Daniel
X
1.771276999
8
Saudara Eko
saudara Eko
X
1.814157057
9
Bapak Andi
saudara Andi
X
1.813770103
10
Saudara Indra
saudara Indra
X
1.794739056
Average DUET Computation Time
1.820330882
The accuracy of the sixth scenario is 100%.
Scenario 7: use Near setting, one speech source, and speaking only the person name Table 7. Result of seventh scenario ID
Input 1
Output 1
Output 2
Time (second)
1
Albert
X
Albert
1.797313976
2
Calvindoro
X
Sindoro
1.781812
3
Alexander
X
Alexander
1.764442968
4
Widodo
X
Widodo
1.840017128
Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000
5
Budi
X
Budi
1.76946311
6
Charles
X
Charles
1.959310818
7
Daniel
X
Daniel
1.664107847
8
Eko
X
Eko
1.664107847
9
Andi
X
Andi
1.801404047
10
Indra
Indra
X
1.792418051
Average DUET Computation Time
1.783439779
97 11
The accuracy of the seventh scenario is 90%.
Scenario 8: use Near setting, one speech source, and speaking the name with greetings Table 8. Result of eighth scenario ID
Input 1
Output 1
Output 2
Time (second)
1
Saudara Albert
X
saudara Albert
1.806759882
2
Saudara Calvindoro
X
saudara Calvindoro
1.697113085
3
Bapak Alexander
X
Bapak Alexander
1.795109797
4
Bertemu Bapak Widodo
bertemu Bapak Widodo
X
1.890004921
5
Saudara Budi
X
saudara Budi
1.821046162
6
Saudara Charles
X
saudara Charles
1.807483959
7
Saudara Daniel
X
X
1.813742924
8
Saudara Eko
X
saudara Eko
1.749439049
9
Bapak Andi
X
Bapak Andi
1.788173008
10
Saudara Indra
X
Saudara Indra
1.683855104
Average DUET Computation Time
1.785272789
The accuracy of the eighth scenario is 90%.
5. Conclusion In this paper, we have successfully implemented blind speech separation system for our humanoid robot by using DUET algorithm. It is also shown that the DUET algorithm can handle inevitable time delays during recording process. Based on the experiment results through eight scenarios, there are three main conclusions: (i) From scenario 5 to scenario 8, we can conclude DUET give excellent results with 90% - 100% accuracy while there is only one speech source. The results show that the source estimation is not too affected by distance and speech variation parameters. (ii) From scenario 1 to scenario 4, it can be concluded that the accuracy of DUET is influenced by distance and speech variation parameters. In scenario 1 and scenario 2, the distance parameter reduces the accuracy of the results to just 60% - 65%. In these scenarios, Far setting affect the recording process where the voice may be interfered with other noises. The phenomenon appears to decrease accuracy in scenario 2. While in scenario 3 and scenario 4, speech variation parameter greatly affects the accuracy. In scenario 4 where Google API can anticipate the subject name based on the common greeting words, the accuracy of the results can reach 70%. Otherwise in scenario 3 where Google API has to estimate the subject name without any clue, the accuracy drops up to 40%. (iii) Based on Raspberry Pi 2 performance, the average DUET computation time is about 1.8 seconds for 4 seconds speech signals in recording process.
12 98
Author name / Procedia Computer Science 00 (2017) 000–000 Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98
References 1. Teresa Z. History of Service Robots. In Application SRaRDa. Marco Ceccarelli.: IGI Global; 2012. p. 1-14. 2. Bronkhorst AW. The cocktail-party problem revisited: early processing and selection of multi-talker speech. Attention, Perception & Psychophysics. 2015; 77(5): p. 1465–1487. 3. Yu X, Hu D, Xu J. Blind Source Separation: Theory and Applications. 1st ed.: Wiley; 2014. 4. Budiharto W, Gunawan AAS. Blind speech separation system for humanoid robot with FastICA for audio filtering and separation. In First International Workshop on Pattern Recognition; 2016; Tokyo. p. 10011131001113-4. 5. Takatani T, Nishikawa T, Saruwatari H, Shikano K. SIMO-MODEL-BASED INDEPENDENT COMPONENT ANALYSIS FOR HIGH-FIDELITY BLIND SEPARATION OF ACOUSTIC SIGNALS. In 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003); 2003; Nara, Japan. p. 993-998. 6. Singh A, Anand RS. Overview of the Performance of Fast lCA and DUET for Blind Source Separation of Multiple Speakers. In National Conference on Recent Advances in Electronics & Computer Engineering, RAECE; 2015; Roorkee, India. p. 296-300. 7. Yılmaz O, Rickard S. Blind separation of speech mixtures via time fequency masking. IEEE Transactions on Signal Processing. 2004; 52(7): p. 1830 - 1847. 8. Rickard S. The DUET blind source separation algorithm. In Makino S, Sawada H, Lee TW. Blind Speech Separation.: Springer Netherlands; 2007. p. 217-241. 9. Hulle MMV. Clustering approach to square and non-square blind source separation. In IEEE Neural Networks for Signal Processing IX ; 1999; Madison, USA. 10. Baeck M, Zolzer U. Real-Time Implementation of a Source Separation Algorithm. In The 6th International Conference on Digital audio Effects; 2003; London. p. 29-34.