Available online at www.sciencedirect.com Available online at www.sciencedirect.com
ScienceDirect ScienceDirect
Available online at www.sciencedirect.com Procedia Computer Science00 (2018) 000–000 Procedia Computer Science00 (2018) 000–000
ScienceDirect
www.elsevier.com/locate/procedia www.elsevier.com/locate/procedia
Procedia Computer Science 131 (2018) 1223–1228
8th 8th International International Congress Congress of of Information Information and and Communication Communication Technology Technology (ICICT-2018) (ICICT-2018)
A A cascaded cascaded voice voice biometric biometric system system Roaya Roaya Salhalden Salhalden A. A. Abdalrahman, Abdalrahman, Bülent Bülent Bolat一, Bolat一, Nihan Nihan Kahraman Kahraman Yildiz TechnicalUniversity, Electronics and Comms. Eng. Dpt., Istanbul 34220 Turkey Yildiz TechnicalUniversity, Electronics and Comms. Eng. Dpt., Istanbul 34220 Turkey
Abstract Abstract This paper presents a voice biometric system which uses both text dependent and independent speaker and speech recognition This paper presents a voice biometric system which uses both text dependent and independent speaker and speech recognition methods. The Mel Frequency Cepstral Coefficients (MFCC) and pitch period are used as features and the decision is made by methods. The Mel Frequency Cepstral Coefficients (MFCC) and pitch period are used as features and the decision is made by using Euclidian distance metric. This cascaded procedure reduces the False Positive Rate and increases the security of biometric using Euclidian distance metric. This cascaded procedure reduces the False Positive Rate and increases the security of biometric recognition system. It is shown that cascaded system defines voice recognition better in terms of accuracy, safety and difficulty recognition system. It is shown that cascaded system defines voice recognition better in terms of accuracy, safety and difficulty of penetration. The efficiency of the identification system is high up to approximately 91.2%. of penetration. The efficiency of the identification system is high up to approximately 91.2%. 2018 B.V. © © 2018 The The Authors. Authors. Published Published by by Elsevier Elsevier Ltd. B.V. This is an open access article under the CC BY-NC-ND license Peer-review under responsibility of organizing committee of the(https://creativecommons.org/licenses/by-nc-nd/4.0/) 8th International Congress of Information and Communication Peer-review under responsibility of organizing committee of the 8th International Congress of Information and Communication Selection and peer-review under responsibility of the scientific committee of the 8th International Congress of Information and Technology (ICICT-2018). Technology (ICICT-2018). Communication Technology. Keywords:Speaker recognition, mel frequency cepstral coefficients, pitch period. Keywords:Speaker recognition, mel frequency cepstral coefficients, pitch period.
text 1. Main 1. Main text Due to traits for the identification of individuals in keyword or token based security systems, biometric Due to traits for the identification of individuals in keyword or token based security systems, biometric technology is more popular nowadays. The earliest methods of biometric identification include physical features technology is more popular nowadays. The earliest methods of biometric identification include physical features such as fingerprint, eye and face scan etc. However there is another way to recognize a person using behavioral such as fingerprint, eye and face scan etc. However there is another way to recognize a person using behavioral features. The simplest and cheapest technology in this phrase can be realized via voice recognition systems. features. The simplest and cheapest technology in this phrase can be realized via voice recognition systems. Biometric voice recognition technology concentrates on either speech or speaker recognition for especially secure Biometric voice recognition technology concentrates on either speech or speaker recognition for especially secure access for cell phones, ATMs and personal computers. In this paper, we present an implementation of both speech access for cell phones, ATMs and personal computers. In this paper, we present an implementation of both speech and speaker recognition for securely access to a personal computer. The proposed recognition algorithm is and speaker recognition for securely access to a personal computer. The proposed recognition algorithm is developed by using MATLAB environment which is capable of first authenticating a person’s identity by his or her developed by using MATLAB environment which is capable of first authenticating a person’s identity by his or her voice pattern, then checks a predefined personal keyphrase. The proposed system can be considered as a multi voice pattern, then checks a predefined personal keyphrase. The proposed system can be considered as a multi biometric system. Generally multi biometric systems require more than one sensor which makes the whole system biometric system. Generally multi biometric systems require more than one sensor which makes the whole system expensive. However in our proposed system both biometrics use the same sensor without additional hardware. expensive. However in our proposed system both biometrics use the same sensor without additional hardware. * Corresponding author. Tel.: +90-212-383-5907; fax: +90-212-383-5702. * Corresponding author. Tel.: +90-212-383-5907; fax: +90-212-383-5702. E-mail address:
[email protected] E-mail address:
[email protected] 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license 2018 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license https://creativecommons.org/licenses/by-nc-nd/4.0/) https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review under responsibility of the scientific committee of the 8th International Congress of Information and Communication Selection and peer-review under responsibility of the scientific committee of the 8th International Congress of Information and Communication Technology Technology 1877-0509 © 2018 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/) Selection and peer-review under responsibility of the scientific committee of the 8th International Congress of Information and Communication Technology 10.1016/j.procs.2018.04.334
1224 2
Roaya Salhalden A. Abdalrahman et al. / Procedia Computer Science 131 (2018) 1223–1228 Author name / Procedia Computer Science00 (2018) 000–000
Speaker recognition is an easy and cheap solution for biometric identification problems, especially for secure logging in to a PC. A simple microphone, which is a standard device for most up to date laptops, may act as a biometric capture device and prevents additional hardware requirement. However, biometric identification by using voice is not preferred because of its relatively low accuracy regarding to state of the art methods such as fingerprint and iris recognition [1]. In this work, a two-step identification scheme has been applied for this purpose. In the first step, a text independent speaker recognition method was applied to identify the subject; then he/she was verified by a speech recognition module (Figure 1). The pitch period is a useful and simple feature for the speech recognition problem with reasonable accuracy [2] and the mel-frequency cepstral coefficients (MFCC) are the most widely used feature for speech and speakerrecognition. However, the MFCC are very sensitive to the noise [3] and the noise suppression could enhance the performance [4].
Fig. 1. The proposed speaker identification method
The organization of this paper is as follows. Section 2 introduces the feature extraction methods. Section 3 details the proposed method and the Section 4 concludes the results. 2. Feature Extraction Pitch period and MFCC are well known methods successfully used in speech recognition problems recently [2]. Cepstrum is a homomorphic transformation on complex signals consist of two or more convolved signals. The speech production mechanism can be modeled as a time variant all pole filter driven by an excitation signal. In the time domain, the speech signal is obtained by convolving the excitation by the filter; however in the cepstral domain the convolution operator becomes summation and generally the filter is occupied the lower part of the cepstral domain. Hence, the filter parameters, which determine the vocal information, can be determined easily by using a low pass cepstral filter. The mel frequency word stands for mel scale. Mel scale is a perceptual tonal system which determines an octave as a perceptually equal distance between tones. In the literature there are various linear to mel frequency transformation schemes. In this work, the mel scale is calculated as follows. Mel = 2595 ∗ log (1 + f / 700)
(1)
Computational requirements of the calculation of MFCC is rather high, hence there are various fast algorithms. Figure 2 shows the block diagram of MFCC calculation process used in this work.
Roaya Salhalden A. Abdalrahman et al. / Procedia Computer Science 131 (2018) 1223–1228 Author name / Procedia Computer Science00 (2018) 000–000
1225 3
Similarly, the pitch period is a distinctive feature for people and can be used for biometric identification. The voiced part of speech is considered as quasi periodic. The root frequency of this quasi periodic signal is called as pitch period. A well-known method for pitch period detection is based on autocorrelation function (Eq.2). �t
=
− =1
(2)
��+
where W is the length of the analysis window and indicates this is a time-domain calculation. If the speech signal x(n) is periodic with T0, the autocorrelation function must be periodic with T0 also. Then, the pitch period of x(n) is determined by the difference between two consecutive local maximas of its autocorrelation function [5]. However, since the autocorrelation function is calculated for a limited time span it has some errors which increasing by the time index. Furthermore the speech signal has noise-like parts and it is not wide sense stationary. Hence, the autocorrelation method is not a good solution for the pitch period estimation.
Fig. 2. Calculation of MFCCs
An alternative and fast solution of the pitch period detection problem is as follows. The difference function indicates the instantaneous pitch period. t
=
− =1
(� − � +� )2
t
(3)
where � is the pitch detector, W is the length of the window and indicates this is a time-domain calculation. The pitch period is detected by tracking the consecutive minimums of the detector function. A computationally efficient pitch detection method based on pitch detector is proposed in [6]. In this work, the fast detector proposed in [6] was used. 3. Feature Extraction In this paper a speaker identification system which uses both text dependent and independent speaker recognition methods is proposed. The proposed system first determines the speaker by using a text independent recognition method. Once the identity of the user determined, the system expects from the user to utter a predefined passphrase. If both the passphrase and the user ID match the system accepts the user. By combining these methods it is aimed to restrict the false acceptance ratio of the system. 3.1. Speaker Recognition Figure 3 summarizes the speaker recognition module briefly. Once the user’s voice is captured, it is divided into frames length of 100ms to calculate the MFCC. It is known that MFCC is prone to noise, hence a fast noise suppression approach, namely noise gate, is utilized at this point. Since the signal to noise ratio (SNR) is higher at the unvoiced parts of the speech, these parts of the speech signal are detected and cropped by using short time
Roaya Salhalden A. Abdalrahman et al. / Procedia Computer Science 131 (2018) 1223–1228 Author name / Procedia Computer Science00 (2018) 000–000
1226 4
energy. If a frame’s short time energy is lower than a predefined threshold, it is considered as noise and this part is removed. Figure 4 shows an original input signal and noise suppressed version. After the noise suppression, 20th order MFCCs are calculated by using Eq.1. In the speaker recognition phase there are two users are enrolled to the system by using 50 utterances for each. The MFCCs are used as biometric templates. Once an unknown utterance is applied to the system, MFCCs of this utterance are calculated. If the Euclidian distance between these and one of the users’ template is smaller than a threshold then the system decides the ID of the unknown utterance.
Fig. 3. Block diagram of speaker recognition system
a
b
Fig. 4. (a) Original input signal and (b) noise suppressed version
All the recordings are captured by a stereo PC microphone with a sampling frequency of 10 KHz and 16 bit resolution. Simulation results for two different speakers are given in Table 1. Table 1. True Positive, True Negative, False Positive and False Negative Results for Speaker Recognition Status
Positive
Negative
True
0.78
0.12
False
0.02
0.08
3.2. Speech Recognition Once the system determines the speech recognition phase, both pitch period and MFCC are used as features. In addition to the first phase, pitch period of a given signal is calculated by using Eq. 2 and 3. Speech recognition process is applied to both signal in database and the recorded signal of test, so that both results are available for comparing by the matching filter. The filter configurations are set as per the database signal statistics, the test single is then entered as second input to the matching filter. The program is implemented by MATLAB and voice
Roaya Salhalden A. Abdalrahman et al. / Procedia Computer Science 131 (2018) 1223–1228 Author name / Procedia Computer Science00 (2018) 000–000
1227 5
recognition scheme is done according to the minimum pitch frequency obtained from the autocorrelation and advanced signal processing. Figure 4 shows the applied speech recognition system. In this phase MFCC were used like speaker recognition. Furthermore, pitch period algorithm is used in order to compare features. The pitch frequencies of the signal indicates the location of maximum output power in signal. These frequencies are stored for the further process. Pitch frequencies from the reference input and the test input are matched in the last stage of voice recognition system (VRS). In matching process, distance comparison between pitch frequencies is used. After that a threshold is determined as 12 by trial and error method. First figure in Figure 5 represents the input signal and the second shows the pitch periods of indicated input signal.
Fig.5. Block diagram of speech recognition system
Fig. 6: Input signal for speech recognition and pitch periods of input
Here, two different feature extraction methodologies are used for speech recognition. The results for both Pitch periods and MFCC are compared in Table 2. It is seen that MFCC is more successful in voice recognition. Fusing of both methodologies is also applied to this problem and it is seen that this fusion is better than using one by one.
61228
Author name / Procedia Computer Science00 (2018) 000–000 Roaya Salhalden A. Abdalrahman et al. / Procedia Computer Science 131 (2018) 1223–1228 Table 2. True Positive, True Negative, False Positive and False Negative Results for Speaker Recognition Status
Positive
Negative
True (pitch period)
0.63
0.09
False(pitch period)
0.09
0.09
True(MFCC)
0.66
0.19
False (MFCC)
0.08
0.07
True (MFCC+Pitch)
0.77
0.052
False (MFCC+Pitch)
0.048
0.048
4. Conclusion This paper proposes a cascaded hybrid system in order to use in a biometric access control system. These two fused techniques are speaker recognition (by MFCC) and speech recognition (by pitch-period and MFCC). In the beginning, the user must pass speaker recognition as the first phase of the system. If the system identification result is not sufficient, the user is not accepted and the software stops. If the user is identified and the system allows to continue, the system requires the passphrase from user. Once the passphrase is matched, the system grants the user to the PC. This cascaded procedure reduces the False Positive Rate and increases the security of biometric recognition system. Table 3 shows the results for cascaded voice recognition system. This system is therefore better defined in terms of accuracy, safety and difficulty of penetration. Table 3. True Positive, True Negative, False Positive and False Negative Results for Cascaded Voice Recognition System Status
Positive
Negative
True
0.862
0.13
False
0.03
0.05
References 1. Elshamy EM, Hussein AI. et al. Secure VoIP system based on biometric voice authentication and nested digital cryptosystem using chaotic Baker’s map and Arnold's Cat Map. Int. Conf. on Computer and Applications, 2017, p. 140-146. 2. Ambikairajah E, Carey MJ, Tattersall G, Estimating the pitch period of voiced speech, IET Electronics Letters, 1980:16 (12). 3. Zunjing W, Zhigang C. Improved MFCC-based feature for robust speaker identification” TUP Tsinghua Science and Technology, 2005: 10(2). 4. Attawibulkul RS, Kaewkamnerdpong B, Miyanaga Y. Noisy speech training in MFCC-based speech recognition with noise suppression toward robot assisted autism therapy. Proc. 10th IEEE Biomedical Engineering International Conference (BMEiCON 2017), 2017. 5. Cheveigne AD, Kawahara H, Yin A. A fundamental frequency estimator for speech and music. Journal of the Acoustical Society of America, 2002:111(4). 6. Sood S, Krishnamurthy A, A robust on-the-fly pitch (OTFP) estimation algorithm. In Proc. ACM Multimedia 2004.