Implementation of Blind Speech Separation for Intelligent Humanoid Robot using DUET Method

Implementation of Blind Speech Separation for Intelligent Humanoid Robot using DUET Method

Available online at www.sciencedirect.com Available online at www.sciencedirect.com ScienceDirect ScienceDirect Procedia Computer Science 00 (2017) ...

1MB Sizes 0 Downloads 25 Views

Available online at www.sciencedirect.com Available online at www.sciencedirect.com

ScienceDirect ScienceDirect

Procedia Computer Science 00 (2017) 000–000 Available online at www.sciencedirect.com Procedia Computer Science 00 (2017) 000–000

ScienceDirect

www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia

Procedia Computer Science 116 (2017) 87–98

2nd International Conference on Computer Science and Computational Intelligence 2017, ICCSCI 2nd International Conference on Computer Science andBali, Computational 2017, 13-14 October 2017, Indonesia Intelligence 2017, ICCSCI 2017, 13-14 October 2017, Bali, Indonesia

Implementation of Blind Speech Separation for Intelligent Implementation of Blind Speech Separation for Intelligent Humanoid Robot using DUET Method Humanoid Robot using DUET Method

Alexander A S Gunawanaa*, Albert Stevelinobb, Heri Ngariantobb, Widodo Budihartobb, Rini Alexander A S Gunawan *, Albert Stevelino , Heri Ngarianto , Widodo Budiharto , Rini b Wongso Wongsob a Mathematics Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia a Mathematics Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia Computer Science Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia b Computer Science Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia b

Abstract Abstract Nowadays, there are many efforts in building intelligent humanoid robot and adding advanced ability such as Blind Speech Nowadays, there are efforts inofbuilding intelligent robot in anda real adding advanced ability such as Blind Speech Separation (BSS). BSSmany is a problem separation of severalhumanoid speech signals world from mono or stereo audio record. In Separation (BSS). BSS is a problem of separation of several speech signals in atoreal world any fromnumber mono or record. In this research, we implement BSP system using DUET algorithm which allow separate of stereo sourcesaudio by using only this research, we implement BSP system using Unmixing DUET algorithm which allow to separate number sources FastICA by using (Fast only stereo (two) mixtures. The DUET (Degenerate Estimation Technique) algorithmany replaces ourofprevious stereo (two) mixtures. TheAnalysis) DUET (Degenerate Estimation Technique) replaces our previous FastICA (Fast Independent Component method onlyUnmixing success in simulation but failedalgorithm in the implementation. The main problem of Independent Component method only without success time in simulation but recording failed in the implementation. Theaudio mainrecord problem of FastICA is that it assumesAnalysis) instantaneous mixing delay in the process. To deals with in the FastICA it assumes withoutwith time DUET delay in the recording process. with audio record in the presence is of that inevitable time instantaneous delays, it has mixing to be replaced algorithm to separate wellToindeals real time. Finally, the DUET presence timetodelays, it hasrobot to bewhich replaced with DUET algorithm to separate well in real time. Finally, algorithmofis inevitable implemented humanoid is developed using Raspberry Pi and equipped with RaspPi Camthe to DUET detect algorithm is implemented humanoid robot Audio which is developed using equipped with RaspPi detect human face. Furthermore, tothe Cirrus Logic Card is stacked to Raspberry Raspberry PiPiand in order to record stereo Cam audio.to In our human face. there Furthermore, Cirrus Logic Audio Card is stacked to performance, Raspberry Pithat in order to record stereoofaudio. In and our experiments, are threethe controlled variables to evaluate algorithm is: distance, number sources, experiments, there are will threerecord controlled evaluate algorithm that is: distance, numberisofthen sources, and subject’s name. Robot stereovariables audio fortofour seconds after faceperformance, is detected by system. The recording separated subject’s Robotand willproduce record stereo audio estimations for four seconds facecomputation is detected by system. The recording is then separated by DUETname. algorithm two source with after average time 1.8 seconds. With Google API, the by DUET algorithm and producespeech two source estimations average computation time 1.8 seconds. With Google API, the recognition accuracy of separated is varying betweenwith 40%-70%. recognition accuracy of separated speech is varying between 40%-70%. © 2017 The Authors. Published by Elsevier B.V. © 2017 Published Elsevier B.V. © 2017 The The Authors. Authors. Published by by B.V. committee of the 2nd International Conference on Computer Science and Peer-review under responsibility of Elsevier the scientific Peer-review under responsibility of the scientific committee of the 2nd International Conference on Computer Science and Peer-review under responsibility of the scientific committee of the 2nd International Conference on Computer Science and Computational Intelligence 2017. Computational Intelligence 2017. Computational Intelligence 2017. Keywords: intelligent humanoid robot; DUET; blind speech separation; speech recognition Keywords: intelligent humanoid robot; DUET; blind speech separation; speech recognition

* Corresponding author. Tel.: +628175001010 *E-mail Corresponding author. Tel.: +628175001010 address: [email protected] E-mail address: [email protected] 1877-0509 © 2017 The Authors. Published by Elsevier B.V. 1877-0509 © 2017 Authors. Published Elsevier committee B.V. Peer-review underThe responsibility of theby scientific of the 2nd International Conference on Computer Science and

Computational Intelligence 2017.of the scientific committee of the 2nd International Conference on Computer Science and Peer-review under responsibility Computational Intelligence 2017. 1877-0509 © 2017 The Authors. Published by Elsevier B.V. Peer-review under responsibility of the scientific committee of the 2nd International Conference on Computer Science and Computational Intelligence 2017. 10.1016/j.procs.2017.10.014

88 2

Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000

1. Introduction Service robots can be defined as robots that sense, think, and act to perform services to the well-being of humans 1. The service robots have main advantage because they can work nonstop in well consistency. Currently, service robots are developed for various applications, including: edutainment, guidance and office works, inspection and surveillance etc. For a suitable service, human interaction capability is must, then face and voice detection and recognition are being the main features in service robots. Now, researches in detecting face and voice have been developed well over the world, but there is still problem for service robots to separate several voice inputs simultaneously. As result, robots are not capable to give feedback or corresponding output appropriately in natural environment. Actually, the voice sources are not coming just from one direction, but we know that human could separate and recognize the sources very well. Furthermore, the voice sources are also mixed with background noise in natural environment, thus we have to focus auditory attention on a particular source while filtering out the background noise. This phenomenon is called as the cocktail party effect 2. The objective in here is to isolate the speech of one particular speaker from all the other sounds made by the rest of the party. The techniques for solving this problem, which need no prior knowledge about the mixing ratios of the various sources, is called as Blind Speech Separation (BSS) 3 and our service robots need to implement this technique. In our initial attempt 4, we employ FastICA (Fast Independent Component Analysis) for solving BSS problem. FastICA is one of the decomposition methods which is capable of decomposing mixing signals into additive subcomponents in real time. In our previous experiment, FastICA method only success in simulation but failed in the implementation because of inevitable time delays. There is several improvements of FastICA, for example Takatani et al 5 proposed blind signal decomposition algorithm based on Single-Input Multiple-Output (SIMO) acoustic signal model using the extended ICA algorithm, called as SIMO-ICA. The separated signal of SIMO-ICA can maintain the spatial qualities of each speech source comparing to the conventional method. Nevertheless, all extensions of ICA algorithm suffer its fundamental assumption that is: the source signals must be independent of each other. The assumption causes the mixing signals must be instantaneous mixtures of sources without time delay. This requirement is impossible to be fulfilled in implementation stage because there is inevitable time delays during stereo input recording. ICA will fails in case where any delay is present between the sources 6. Therefore, to deal with the mixtures of sources containing delay, Degenerate Unmixing Estimation Technique (DUET) is developed 7 8. Rickard 8 states two or more source can be well separated from the mixing of sources contain small delay with DUET algorithm. The underlying framework of DUET is clustering method similar to van Hulle method 9. The main different is DUET uses the time– frequency representations, beside van Hulle method employed the space representation and tried to estimate the mixing matrix like ICA approach. Furthermore, DUET can separate blindly an arbitrary number of sources given just two input mixtures 8. In DUET experimental results 7, the algorithm works very well for synthetic mixtures similar to ICA method. Furthermore, DUET also works well in real mixtures of speech recorded in an anechoic room. Anechoic means it does not represent echoes, that is, reflection of sounds that arrive with a delay after the direct sounds. Because DUET is based on the assumptions of anechoic mixing, the quality of the demixing is reduced in an echoic room experiments. The experiment result also shows that DUET can estimate arbitrary sources using just two mixtures and work optimally at three source estimations. If more than three, the error will increase significantly. Based on the experiments 7, we will be used DUET to estimate arbitrary sources in humanoid robot for being used as service robot. The robot is designed like a human based on Raspberry Pi platform and equipped with RaspPi Cam to detect human face. Furthermore, the Cirrus Logic Audio Card is stacked to Raspberry Pi in order to record stereo audio from two microphones. Our experiment design is similar to Baeck et al research 10 in order to evaluate the algorithm performance in real environment. The research was conducted to answer the following problem: how is the performance of humanoid robot to estimate the speech sources based on DUET algorithm and convert those sources into text? Hence, the purpose of this research is: (i) develop a humanoid robot that capable to estimate each sources and covert it to text, (ii) evaluate the text conversion accuracy.



Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000

89 3

The remainder of this paper is composed as follows: first we discuss DUET algorithm in section 2, and then is followed by its implementation in our humanoid robot, called as RAPIRO, in section 3. In section 4, we report the experiment result in blind speech separation based on the robotic implementation. Finally, we summarize our work with notes on future research in section 5. 2. Degenerate Unmixing Estimation Technique (DUET) Blind speech separation (BSS) is the separation of speech signals from mono or stereo audio record, with little information about the speech signals or the mixing process. This problem is in general very underdetermined because there are more sources than mixtures. Nevertheless, some solutions can be derived under restricted assumptions. Our initial attempt 4 employ FastICA (Fast Independent Component Analysis) for solving the BSS problem. FastICA assumes that signals are statistically independent, thus it is only valid for instantaneous mixtures of sources. FastICA fails in the implementation because there is inevitable time delays during the recording process. To deal with the mixtures of sources involving delay, DUET (Degenerate Unmixing Estimation Technique) is developed. DUET provides very good result when small delay is involved during the mixing of sources. Although DUET uses several assumptions: anechoic mixing, W-disjoint orthogonality (WDO) of sources, closely spaced microphones for recording, and local stationary 8, we can reconstruct component sources just from two mixtures 7. DUET uses binary mask to separate the mixtures into its component sources based on WDO assumption 8. To solve BSS problem, the implementation of DUET uses stereo audio record from the Cirrus Logic Audio Card, which is stacked to Raspberry Pi. We can write the stereo input as follow: 𝑥𝑥(𝑡𝑡) = [𝑥𝑥1 (𝑡𝑡), 𝑥𝑥2 (𝑡𝑡)]𝑇𝑇 The objective of BSS problem is to estimate N source signals, that is: 𝑠𝑠(𝑡𝑡) = [𝑠𝑠1 (𝑡𝑡), … , 𝑠𝑠𝑁𝑁 (𝑡𝑡)]𝑇𝑇

In following subsection, we discuss the assumptions of DUET which lead to the main solution. 2.1. Anechoic Mixing

Suppose N source inputs, 𝑠𝑠𝑗𝑗 (𝑡𝑡), 𝑗𝑗 = 1, … , 𝑁𝑁 are received at a pair of microphones. Without loss of generality, it can be assumed that the attenuation and delay parameters of the first mixture is included in the definition of the sources. The two anechoic mixtures as representation of the stereo input can be stated as: 𝑁𝑁

𝑥𝑥1 (𝑡𝑡) = ∑ 𝑠𝑠𝑗𝑗 (𝑡𝑡) 𝑁𝑁

𝑗𝑗=1

𝑥𝑥2 (𝑡𝑡) = ∑ 𝑎𝑎𝑗𝑗 𝑠𝑠𝑗𝑗 (𝑡𝑡 − 𝛿𝛿𝑗𝑗 ) 𝑗𝑗=1

(1) where N is the number of sources, 𝑎𝑎𝑗𝑗 is a relative attenuation factor between sources and sensors, and 𝛿𝛿𝑗𝑗 is the arrival delay between the sensors. 2.2. W-Disjoint Orthogonality (WDO) Two functions 𝑠𝑠𝑗𝑗 (𝑡𝑡) and 𝑠𝑠𝑘𝑘 (𝑡𝑡) are called as W-disjoint orthogonal (WDO) if windowed Fourier transform of 𝑠𝑠𝑗𝑗 (𝑡𝑡) and 𝑠𝑠𝑘𝑘 (𝑡𝑡) are disjoint. Windowed Fourier transform of 𝑠𝑠𝑗𝑗 (𝑡𝑡) can be defined as: ∞ 1 ∫ 𝑊𝑊(𝑡𝑡 − 𝜏𝜏)𝑠𝑠𝑗𝑗 (𝑡𝑡)𝑒𝑒 −𝑖𝑖𝑖𝑖𝑖𝑖 𝑑𝑑𝑑𝑑 𝑠𝑠̂𝑗𝑗 (𝜏𝜏, 𝜔𝜔) ≔ 𝐹𝐹 𝑊𝑊 [𝑠𝑠𝑗𝑗 ](𝜏𝜏, 𝜔𝜔) ≔ √2𝜋𝜋 −∞ (2) where 𝑊𝑊(𝑡𝑡) is the windowing function. WDO assumption can be stated briefly as:

90 4

Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000

𝑠𝑠̂𝑗𝑗 (𝜏𝜏, 𝜔𝜔)𝑠𝑠̂𝑘𝑘 (𝜏𝜏, 𝜔𝜔) = 0, ∀𝜏𝜏,𝜔𝜔 , ∀𝑗𝑗 ≠ 𝑘𝑘

If the windowing function 𝑊𝑊(𝑡𝑡) ≡ 1, then 𝑠𝑠̂𝑗𝑗 (𝜏𝜏, 𝜔𝜔) becomes Fourier transform of 𝑠𝑠𝑗𝑗 (𝑡𝑡), which can be denoted as 𝑠𝑠̂𝑗𝑗 (𝜔𝜔). Furthermore, WDO assumption can be written as: 𝑠𝑠̂𝑗𝑗 (𝜔𝜔)𝑠𝑠̂𝑘𝑘 (𝜔𝜔) = 0, ∀𝜔𝜔 , ∀𝑗𝑗 ≠ 𝑘𝑘 WDO assumption is important to DUET because it allows for separating the mixture into its component sources using a binary mask which can be constructed as following: 1 𝑠𝑠̂ (𝜏𝜏, 𝜔𝜔) ≠ 0 𝑀𝑀𝑗𝑗 (𝜏𝜏, 𝜔𝜔) ≔ { 𝑗𝑗 0 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒𝑒𝑒 The binary mask 𝑀𝑀𝑗𝑗 separates 𝑠𝑠̂𝑗𝑗 from the mixture by: 𝑠𝑠̂𝑗𝑗 (𝜏𝜏, 𝜔𝜔) = 𝑀𝑀𝑗𝑗 (𝜏𝜏, 𝜔𝜔)𝑥𝑥̂1 (𝜏𝜏, 𝜔𝜔), ∀𝜏𝜏, 𝜔𝜔 2.3. Local Stationarity

Fourier transform of delayed function is:

𝑠𝑠𝑗𝑗 (𝑡𝑡 − 𝛿𝛿) ↔ 𝑒𝑒 −𝑖𝑖𝑖𝑖𝑖𝑖 𝑠𝑠̂𝑗𝑗 (𝜔𝜔) Therefore windowed Fourier transform of delayed function which can be derived from equation (2) is: 𝐹𝐹 𝑊𝑊 [𝑠𝑠𝑗𝑗 (∙ −𝛿𝛿)](𝜏𝜏, 𝜔𝜔) = 𝑒𝑒 −𝑖𝑖𝑖𝑖𝑖𝑖 𝐹𝐹 𝑊𝑊 [𝑠𝑠𝑗𝑗 (∙)](𝜏𝜏, 𝜔𝜔) The local stationarity assumption is formally written as following: 𝐹𝐹 𝑊𝑊 [𝑠𝑠𝑗𝑗 (∙ −𝛿𝛿)](𝜏𝜏, 𝜔𝜔) = 𝑒𝑒 −𝑖𝑖𝑖𝑖𝑖𝑖 𝐹𝐹 𝑊𝑊 [𝑠𝑠𝑗𝑗 (∙)](𝜏𝜏, 𝜔𝜔), ∀𝛿𝛿, |𝛿𝛿| ≤ ∆ where ∆ is the maximum time difference in the mixing model. 2.4. Closely Spaced Microphones In subsection 2.3, we exploit the local stationarity assumption to turn the time delay into a multiplicative factor in time–frequency. The multiplicative factor 𝑒𝑒 −𝑖𝑖𝑖𝑖𝑖𝑖 only uniquely specifies δ if |ωδ| < π and the other cases is ambiguous due to phase-wrap. Thus, we have to restrict that: |𝜔𝜔𝛿𝛿𝑗𝑗 | < 𝜋𝜋, ∀𝜔𝜔, ∀𝑗𝑗 This is guaranteed when the microphones are placed close to each other. 2.5. Separating the Sources By using anechoic mixing and local stationarity assumptions, we can write the mixing equation (1) in the time– frequency domain as: 𝑠𝑠̂1 (𝜏𝜏, 𝜔𝜔) 1 … 1 𝑥𝑥̂ (𝜏𝜏, 𝜔𝜔) ⋮ ] ]=[ [ 1 −𝑖𝑖𝑖𝑖𝛿𝛿1 −𝑖𝑖𝑖𝑖𝛿𝛿𝑁𝑁 ] [ … 𝑎𝑎𝑁𝑁 𝑒𝑒 𝑎𝑎1 𝑒𝑒 𝑥𝑥̂2 (𝜏𝜏, 𝜔𝜔) 𝑠𝑠̂𝑁𝑁 (𝜏𝜏, 𝜔𝜔) Furthermore, we adopt at most one source is active at every (τ, ω) based on WDO assumption, and the mixing process can be described as: 1 𝑥𝑥̂ (𝜏𝜏, 𝜔𝜔) 𝑓𝑓𝑓𝑓𝑓𝑓 𝑒𝑒𝑒𝑒𝑒𝑒ℎ (𝜏𝜏, 𝜔𝜔), [ 1 ] = [𝑎𝑎 𝑒𝑒 −𝑖𝑖𝑖𝑖𝛿𝛿𝑗𝑗 ] 𝑠𝑠̂𝑗𝑗 (𝜏𝜏, 𝜔𝜔) 𝑓𝑓𝑓𝑓𝑓𝑓 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑗𝑗 (𝜏𝜏, 𝑥𝑥̂2 𝜔𝜔) 𝑗𝑗

In the above equation, the active index j depends on (τ, ω). The fundamental equation of DUET is that the ratio of the time–frequency representations only depend on the mixing parameters, as follow: x̂2 (𝜏𝜏, 𝜔𝜔) ∀(𝜏𝜏, 𝜔𝜔) ∈ Ω𝑗𝑗 , = 𝑎𝑎𝑗𝑗 𝑒𝑒 −𝑖𝑖𝑖𝑖𝛿𝛿𝑗𝑗 x̂1 (𝜏𝜏, 𝜔𝜔) Where, Ω𝑗𝑗 ≔ {(𝜏𝜏, 𝜔𝜔): 𝑠𝑠̂𝑗𝑗 (𝜏𝜏, 𝜔𝜔) ≠ 0} In conclusion, the mixing parameters can be calculated with:



Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000

91 5

𝑥𝑥̂2 (𝜏𝜏, 𝜔𝜔) | 𝑥𝑥̂1 (𝜏𝜏, 𝜔𝜔) 𝑥𝑥̂2 (𝜏𝜏, 𝜔𝜔) 1 𝛿𝛿̂ (𝜏𝜏, 𝜔𝜔) ≔ (− ) ∠ 𝑥𝑥̂1 (𝜏𝜏, 𝜔𝜔) 𝜔𝜔 Under the assumption closely spaced microphones, the delay estimation should be accurate because the local attenuation estimator 𝑎𝑎̃(𝜏𝜏, 𝜔𝜔) and the local delay estimator 𝛿𝛿̂ (𝜏𝜏, 𝜔𝜔) can only take on the values of the actual mixing parameters. Finally, we can demix the mixture via binary masking if we can determine the indicator function of each source. For given active index j, the indicator functions can be written as: 1 (𝑎𝑎̃(𝜏𝜏, 𝜔𝜔), 𝛿𝛿̃(𝜏𝜏, 𝜔𝜔)) = (𝑎𝑎𝑗𝑗 , 𝛿𝛿𝑗𝑗 ) 𝑀𝑀𝑗𝑗 (𝜏𝜏, 𝜔𝜔) ≔ { 0 𝑜𝑜𝑜𝑜ℎ𝑒𝑒𝑒𝑒𝑒𝑒 After using clustering techniques on attenuation and delay estimates, the number of clusters would be number of sources and the cluster centres would be the optimal estimation of attenuation and delay associated with each source. ̃𝑗𝑗 , we can estimate the source signals with: By constructing suitable indicator function 𝑀𝑀 𝑎𝑎̃(𝜏𝜏, 𝜔𝜔) ≔ |

̃

𝑥𝑥̂1 (𝜏𝜏, 𝜔𝜔) + 𝑎𝑎̃𝑗𝑗 𝑒𝑒 𝑖𝑖𝛿𝛿𝑗𝑗𝜔𝜔 𝑥𝑥̂2 (𝜏𝜏, 𝜔𝜔) ) 1 + 𝑎𝑎̃𝑗𝑗2 In the end, we reconstruct the sources from the time–frequency representations by converting back into the time domain. The summary of DUET algorithm can be seen in the literature 8. ̃𝑗𝑗 (𝜏𝜏, 𝜔𝜔) ( 𝑠𝑠̂̃𝑗𝑗 (𝜏𝜏, 𝜔𝜔) = 𝑀𝑀

3. Implementation in Humanoid Robot Humanoid robot is a robot which is built to look like the human body. In general, humanoid robot has a torso, a head, two arms, and two legs, though its simple version may model only part of the body. RAPIRO is a small and affordable robot kit designed to work with a Raspberry Pi. It comes with Arduino-compatible servo controller. Figure 1 shows fully-assembled RAPIRO and its microcontroller board:

(a)

(b) Fig. 1. (a) RAPIRO and (b) its microcontroller board (http://www.rapiro.com)

In order for RAPIRO to be an intelligent humanoid robot needs to be added with a microprocessor board, the Raspberry Pi. In our research used Raspberry Pi 2 Model B. Furthermore, RAPIRO needs a visual input to detect faces, recognize faces, and recognize objects. For this purpose, installed camera module on Raspberry Pi. Finally, to receive stereo audio input, the external device Cirrus Logic Audio Card is stacked to Raspberry Pi because there is no

Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000

92 6

port for stereo input ready in Raspberry Pi. Raspberry Pi installation together with its camera module and the additional audio card can be seen in the following Figure 2.

(b)

(a)

(c)

Fig. 2. (a) Raspberry Pi 2 Model B, (b) camera module, and (c) Cirrus Logic Audio Card

After RAPIRO has been assembled, the next step is to develop the system to perform speech separation using the DUET method. Our objective is RAPIRO can separate the source of the sounds that come simultaneously and provide feedback in the form of confirmation. The system is developed using Python (2.7) and its several libraries, such as: nussl (0.1.3), SpeechRecognition (3.4.6), PyAudio (0.2.9), and gTTS (1.1.6). The last step is to install and implement the system to RAPIRO and test the results of speech separation using DUET algorithm. The conversion of speech to text by means of microphone and Google Speech API. Here are the steps (see figure 3) of system usage: 1. The blue LED light in RAPIRO eyes indicates that the system is in ready state. 2. Users come and deal with robots within camera range, so the face is detected by the system. 3. The LED light will be green indicating the face has been detected and the robot will greet, then RAPIRO will ask the user for his needs. In our experiment, the robot will ask the name of the person he wants to meet. 4. When the LED light is red, the robot will record the conversation for four seconds. In that duration, the user says who the person to meet.



Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000

93 7

(2)

(1)

(4) (3) Fig. 3. (1) Robot n ready state, (2) face detection, (3) greeting state, and (4) recording state

4. Experiment Results In order to evaluate the proposed method, it is done the experiments using our humanoid robot RAPIRO. The experiments are implemented through eight scenarios based on the combination of three parameters: (i) Distance between microphones (m) and speech sources (s). In our experiments, the distance is divided into two categories: far and near setting. The setting can be seen in Figure 4 below. 60 cm m1

15 cm m1

m2

10 cm

10 cm

s1

s2

m2

25 cm s1

25 cm s2

(b) Near setting (a) Far setting Fig. 4. Setting for microphones (m) and speech sources (s)

(ii) Number of speech sources. The experiment only used maximum two speech sources. For two speech sources, it is used one male voice and one female voice. (iii) Type of speech. In the experiment, the user will mention the name of the person that he wants to meet. There is two speech variation in here: the only person name and the name with greetings such as Bapak, Ibu, Saudara, and Saudari. Each scenario is done by repeating ten experiments. The evaluation of the experiments is done by calculating the accuracy of separated speeches in text form by means of Google speech to text API. Furthermore, the accuracy is measured by comparing the correctly speech in text form with the number of experiments. The restrictions in our experiments is as follow:

Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000

94 8

• • • • • •

Using formal Bahasa Indonesia. Speech and intonation are spoken clearly. No noise assumed. 16 kHz sampling rate and 4 seconds duration recording time. Hop Short-Time Fourier Transform for Windowing function is 1024. Only four names are available to be detected.

The following figure (Figure 5) is an example of speech separation in our experiment.

Fig. 5. Example of speech separation result

Results of our experiments from the eight scenarios are recorded in the following tables: Scenario 1: use Far setting, two speech sources, and speaking only the person name Table 1. Result of first scenario ID

Input 1

Input 2

Output 1

Output 2

Time (second)

1

Albert

Eko

Albert

Eko

1.81869297

2

Calvindoro

Andi

Sindoro

Andi

1.692121077

3

Alexander

Charles

Alexander

Charles

1.78068881

4

Widodo

Indra

Widodo

Indra

1.724117088

5

Budi

Widodo

Budi

Widodo

1.735452938

6

Charles

Charles

X

Charles

1.746630955

7

Daniel

Daniel

Daniel

Daniel

1.768029022

8

Eko

Albert

Iko

Albert

1.953551102

9

Andi

Calvindoro

X

X

1.870830822

10

Indra

Budi

X

Drag

1.709706116

Average DUET Computation Time

1.77998209

The accuracy of the first scenario is 65%.

Scenario 2: use Far setting, two speech sources, and speaking the name with greetings Table 2. Result of second scenario ID

Input 1

Input 2

Output 1

Output 2

Time (second)

1

Saudara Albert

Saudara Eko

saudara Albert

saudara Eko

1.832210827

2

Saudara Calvindoro

Bapak Andi

X

X

2.645637083

3

Bapak Alexander

Saudara Charles

Charles

Bapak Alexander

1.778987932

4

Bertemu Bapak Widodo

Saudara Indra

Contoh

X

1.841015863

5

Saudara Budi

Bertemu Bapak Widodo

saudara Budi

bertemu Bapak Widodo

1.727251101

6

Saudara Charles

Saudara Charles

saudara Serius

X

1.794992018



Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000

7

Saudara Daniel

Saudara Daniel

8

Saudara Eko

9

Bapak Andi

10

Saudara Indra

Saudara Budi

95 9

saudara Daniel

X

1.782523918

Saudara Albert

X

saudara Eko

1.742737103

Saudara Calvindoro

saudara Calvindoro

bapak Andi

1.782528925

saudara Indra

saudara Budi

1.824796963

Average DUET Computation Time

1.875268173

The accuracy of the second scenario is 60%.

Scenario 3: use Near setting, two speech sources, and speaking only the person name Table 3. Result of third scenario ID

Input 1

Input 2

Output 1

Output 2

Time (second)

1

Albert

Eko

Eko

X

1.832210827

2

Calvindoro

Andi

Andi

Sindoro

2.645637083

3

Alexander

Charles

Charles

Sumber

1.778987932

4

Widodo

Indra

Indra

X

1.841015863

5

Budi

Widodo

Widodo

Budi

1.727251101

6

Charles

Charles

X

X

1.794992018

7

Daniel

Daniel

Daniel

X

1.782523918

8

Eko

Albert

Albert

X

1.742737103

9

Andi

Calvindoro

Dendi

Alfin Nurul

1.782528925

10

Indra

Budi

X

X

1.824796963

Average DUET Computation Time

1.875268173

The accuracy of the third scenario is 40%.

Scenario 4: use Near setting, two speech sources, and speaking the name with greetings Table 4. Result of fourth scenario ID

Input 1

Input 2

Output 1

Output 2

Time (second)

1

Saudara Albert

Saudara Eko

saudara Albert

saudara Eko

1.87324791

2

Saudara Calvindoro

Bapak Andi

pak Andi

hotel sindoro

1.815362024

3

Bapak Alexander

Saudara Charles

saudara Charles

X

1.793608952

4

Bertemu Bapak Widodo

Saudara Indra

X

Bapak Widodo

1.775599051

5

Saudara Budi

Bertemu Bapak Widodo

bertemu Bapak Widodo

saudara Budi

1.908370066

6

Saudara Charles

Saudara Charles

saudara harus

X

1.893497992

7

Saudara Daniel

Saudara Daniel

saudara Daniel

X

1.859664965

8

Saudara Eko

Saudara Albert

saudara Albert

saudara Eko

1.927908945

9

Bapak Andi

Saudara Calvindoro

saudara Calvindoro

Andi

1.934961843

10

Saudara Indra

Saudara Budi

saudara indra

saudara Budi

1.788260031

Average DUET Computation Time The accuracy of the fourth scenario is 70%.

1.857048178

Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000

96 10

Scenario 5: use Far setting, one speech source, and speaking only the person name Table 5. Result of fifth scenario ID

Input 1

Output 1

Output 2

Time (second)

1

Albert

X

Albert

1.793345022

2

Calvindoro

Calvindoro

X

1.912111092

3

Alexander

X

Alexander

1.69664793

4

Widodo

Widodo

X

1.751409817

5

Budi

Budi

X

1.754013109

6

Charles

X

Charles

1.760180998

7

Daniel

Dunia

X

1.866503048

8

Eko

X

Eko

1.818609047

9

Andi

X

Andi

1.734113979

10

Indra

Indra

X

1.781472015

Average DUET Computation Time

1.786840606

The accuracy of the fifth scenario is 90%.

Scenario 6: use Far setting, one speech source, and speaking the name with greetings Table 6. Result of sixth scenario ID

Input 1

Output 1

Output 2

Time (second)

1

Saudara Albert

X

saudara Albert

1.720284987

2

Saudara Calvindoro

saudara Calvindoro

X

1.868231821

3

Bapak Alexander

Bapak Alexander

X

1.860559988

4

Bertemu Bapak Widodo

bertemu Bapak Widodo

X

1.742484856

5

Saudara Budi

saudara Budi

X

1.938750076

6

Saudara Charles

saudara Charles

X

1.879053879

7

Saudara Daniel

saudara Daniel

X

1.771276999

8

Saudara Eko

saudara Eko

X

1.814157057

9

Bapak Andi

saudara Andi

X

1.813770103

10

Saudara Indra

saudara Indra

X

1.794739056

Average DUET Computation Time

1.820330882

The accuracy of the sixth scenario is 100%.

Scenario 7: use Near setting, one speech source, and speaking only the person name Table 7. Result of seventh scenario ID

Input 1

Output 1

Output 2

Time (second)

1

Albert

X

Albert

1.797313976

2

Calvindoro

X

Sindoro

1.781812

3

Alexander

X

Alexander

1.764442968

4

Widodo

X

Widodo

1.840017128



Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98 Author name / Procedia Computer Science 00 (2017) 000–000

5

Budi

X

Budi

1.76946311

6

Charles

X

Charles

1.959310818

7

Daniel

X

Daniel

1.664107847

8

Eko

X

Eko

1.664107847

9

Andi

X

Andi

1.801404047

10

Indra

Indra

X

1.792418051

Average DUET Computation Time

1.783439779

97 11

The accuracy of the seventh scenario is 90%.

Scenario 8: use Near setting, one speech source, and speaking the name with greetings Table 8. Result of eighth scenario ID

Input 1

Output 1

Output 2

Time (second)

1

Saudara Albert

X

saudara Albert

1.806759882

2

Saudara Calvindoro

X

saudara Calvindoro

1.697113085

3

Bapak Alexander

X

Bapak Alexander

1.795109797

4

Bertemu Bapak Widodo

bertemu Bapak Widodo

X

1.890004921

5

Saudara Budi

X

saudara Budi

1.821046162

6

Saudara Charles

X

saudara Charles

1.807483959

7

Saudara Daniel

X

X

1.813742924

8

Saudara Eko

X

saudara Eko

1.749439049

9

Bapak Andi

X

Bapak Andi

1.788173008

10

Saudara Indra

X

Saudara Indra

1.683855104

Average DUET Computation Time

1.785272789

The accuracy of the eighth scenario is 90%.

5. Conclusion In this paper, we have successfully implemented blind speech separation system for our humanoid robot by using DUET algorithm. It is also shown that the DUET algorithm can handle inevitable time delays during recording process. Based on the experiment results through eight scenarios, there are three main conclusions: (i) From scenario 5 to scenario 8, we can conclude DUET give excellent results with 90% - 100% accuracy while there is only one speech source. The results show that the source estimation is not too affected by distance and speech variation parameters. (ii) From scenario 1 to scenario 4, it can be concluded that the accuracy of DUET is influenced by distance and speech variation parameters. In scenario 1 and scenario 2, the distance parameter reduces the accuracy of the results to just 60% - 65%. In these scenarios, Far setting affect the recording process where the voice may be interfered with other noises. The phenomenon appears to decrease accuracy in scenario 2. While in scenario 3 and scenario 4, speech variation parameter greatly affects the accuracy. In scenario 4 where Google API can anticipate the subject name based on the common greeting words, the accuracy of the results can reach 70%. Otherwise in scenario 3 where Google API has to estimate the subject name without any clue, the accuracy drops up to 40%. (iii) Based on Raspberry Pi 2 performance, the average DUET computation time is about 1.8 seconds for 4 seconds speech signals in recording process.

12 98

Author name / Procedia Computer Science 00 (2017) 000–000 Alexander A S Gunawan et al. / Procedia Computer Science 116 (2017) 87–98

References 1. Teresa Z. History of Service Robots. In Application SRaRDa. Marco Ceccarelli.: IGI Global; 2012. p. 1-14. 2. Bronkhorst AW. The cocktail-party problem revisited: early processing and selection of multi-talker speech. Attention, Perception & Psychophysics. 2015; 77(5): p. 1465–1487. 3. Yu X, Hu D, Xu J. Blind Source Separation: Theory and Applications. 1st ed.: Wiley; 2014. 4. Budiharto W, Gunawan AAS. Blind speech separation system for humanoid robot with FastICA for audio filtering and separation. In First International Workshop on Pattern Recognition; 2016; Tokyo. p. 10011131001113-4. 5. Takatani T, Nishikawa T, Saruwatari H, Shikano K. SIMO-MODEL-BASED INDEPENDENT COMPONENT ANALYSIS FOR HIGH-FIDELITY BLIND SEPARATION OF ACOUSTIC SIGNALS. In 4th International Symposium on Independent Component Analysis and Blind Signal Separation (ICA2003); 2003; Nara, Japan. p. 993-998. 6. Singh A, Anand RS. Overview of the Performance of Fast lCA and DUET for Blind Source Separation of Multiple Speakers. In National Conference on Recent Advances in Electronics & Computer Engineering, RAECE; 2015; Roorkee, India. p. 296-300. 7. Yılmaz O, Rickard S. Blind separation of speech mixtures via time fequency masking. IEEE Transactions on Signal Processing. 2004; 52(7): p. 1830 - 1847. 8. Rickard S. The DUET blind source separation algorithm. In Makino S, Sawada H, Lee TW. Blind Speech Separation.: Springer Netherlands; 2007. p. 217-241. 9. Hulle MMV. Clustering approach to square and non-square blind source separation. In IEEE Neural Networks for Signal Processing IX ; 1999; Madison, USA. 10. Baeck M, Zolzer U. Real-Time Implementation of a Source Separation Algorithm. In The 6th International Conference on Digital audio Effects; 2003; London. p. 29-34.