Digital Signal Processing 60 (2017) 63–74
Contents lists available at ScienceDirect
Digital Signal Processing www.elsevier.com/locate/dsp
Exposing speech tampering via spectral phase analysis Xiaodan Lin a,b , Xiangui Kang a,∗ a b
Guangdong Key Lab of Information Security, School of Data and Computer Science, Sun Yat-Sen University, 510006, Guangzhou, China School of Information Science and Engineering, Huaqiao University, 361021, Xiamen, China
a r t i c l e
i n f o
Article history: Available online 9 August 2016 Keywords: Spectral phase reconstruction Tampering localization Short-time Fourier Transform Higher order statistics Spectral phase correlation
a b s t r a c t Audio recordings serve as important evidence in law enforcement context. The most crucial problem in practical scenarios is to determine whether the audio recording is an authentic one or not. For this task, blind audio tampering detection is typically performed based on electric network frequency (ENF) artifacts. In case there is a high level of noise, ENF analysis would become invalid. In this paper, we present a novel approach to detect and locate tampering in uncompressed audio tracks by analyzing the spectral phase across the Short Time Fourier Transform (STFT) sub-bands. Spectral phase reconstruction is employed to counteract the impact of noise. Also, a new feature based on higher order statistics of the spectral phase residual and the spectral baseband phase correlation between two adjacent voiced segments is proposed to allow for an automated authentication. Experimental results show that a significant increase in detection accuracy can be achieved compared to the conventional ENF-based method when the audio recording is exposed to a high level of noise. We also testify that the proposed method remains robust under various noisy conditions. © 2016 Elsevier Inc. All rights reserved.
1. Introduction In the past decade, multimedia forensics had emerged as a hot topic in the field of information security. Earlier, more efforts were dedicated to image forensics since popular image processing software like Adobe Photoshop can be easily grasped by an amateur in image processing. Later, forensic issues got extended to audios and videos as well, due to the availability of editing tools, e.g. Audio Audition, Adobe Premiere for those intended to forge the audio or video, whether with or without malicious content manipulation. Unlike image forgeries, it is much easier to forge an audio by cutting, insertion, substitution or splicing without being noticed even by well-trained ears. The main reason for this may lie in the fact that silence (unvoiced segment) appears constantly in speech signals, thus facilitating local tampering. For local image forgeries, post-processing such as rotating, resizing, sharpening, blurring are always required, aiming to make the image look more natural. However, these post-processings also leave more telltale signs to forensic investigators, leading to technological evolutions of forensics and anti-forensics [1,2]. Several investigations stem from those post-processing footprints such as [3], where an effective medianfiltering detector was presented. In [4], contrast enhancement was detected by checking the pixel value histogram.
*
Corresponding author. E-mail address:
[email protected] (X. Kang).
http://dx.doi.org/10.1016/j.dsp.2016.07.015 1051-2004/© 2016 Elsevier Inc. All rights reserved.
Despite that it seems much easier to tamper with an audio, it is never an easy task to identify and localize a digital audio recording that has undergone manipulations. Nowadays, audio forensics has covered topics like double compression [5,6], fake MP3 bitrate detection [7], compression history identification [8,9], etc. These forgeries are often global and content-preserving. Hence, all the audio segments exhibit similar variations and the resulting statistical features are available to machine learning techniques. It should be noted that the methods provided in [5–9] are designed specifically to reveal the MP3 compression history, relying on the unique features introduced in the process of MP3 encoding, such as the traces left in the quantized modified discrete cosine transform coefficients. However, there is still an urgent need for the detection of audio content tampering, which seems more appealing to law enforcement agencies. The major challenge for audio content authentication comes from its local characteristics, as audio tampering is performed within targeted fragments of the audio. For example, a keyword or a syllable is cropped from the acoustic signal, leading to misconceptions or ambiguity of the content. For the purpose of local tampering detection, most of the-state-of-art methods utilize the electric network frequency (ENF) signal [10–12]. The random fluctuation of ENF signals across time and different geological locations endows audio signals with unique ENF patterns, and hence can be taken as a type of environmental signatures. All these methods require recovering the ENF signals accurately. However, if the audio is corrupted by high levels of noise, accurate extraction of ENF signals becomes difficult and the performance deteriorates
64
X. Lin, X. Kang / Digital Signal Processing 60 (2017) 63–74
rapidly, needless to mention the cases without explicit ENF signals, e.g., mobile devices don’t directly carry ENF signals. In addition to the footprint left by the power grids, acoustic reverberation which varies depending on the shape and the composition of a room can also be regarded as environmental signatures. Related works can be found in [13–15]. Other than the environmental traces, device fingerprints can also expose audio manipulation, e.g., microphone identification [16]. In the literature, the issue of audio tampering localization is mostly addressed with the aid of authentication codes or watermarks [17–19]. Different from these methods, blind tamper detection achieves the same task without the need of using extrinsic information. Existing works include authenticating waveform audio recordings by detecting ENF phase discontinuity [20,21]. The basic idea is that local audio tampering will violate the steady phase variation of ENF signals, as the normal ENF fluctuation is expected to exhibit a pseudo-periodic pattern. For MP3 format files, the integrity can be identified by checking frame offsets [22]. In [23], Pan et al. proposed a method to localize audio splicing by estimating local noise level. Some extensions to [20] can be seen in [24,25]. In [24], by comparing maximum cross correlation between the extracted ENF signal and the reference signal blockwise, better localization accuracy was yielded. As to [25], the authors proved the viability of using superior harmonic of the ENF signal to evaluate audio authenticity. Another audio forgery localization approach using the singularity with wavelet was given in [26]. However, the noisy condition was not taken into consideration in [26] and false negative error increased if the number of forge operation was small. The authors in [27] combined the technique of microphone classification with the ENF analysis to detect tampered audios. Another recent work operating on ENF abnormality was reported in [28], where the authors employed a data-driven threshold-based strategy to deal with the anomalous variations of the ENF signal. In particular, the difference between the extracted ENF and its median-filtered version highlighted the ENF abnormality, which was then captured by a Two-Pass Split-Window. The same authors of [28] explored ENF patterns to implement the task of audio edit detection [29]. Both the methods were demonstrated to outperform its counterparts in terms of detection accuracy for audio recordings with favorable noise conditions. However, the profile under noisy conditions remained unsatisfactory. In this work, we focus on the tough problem encountered by most forensic examiners, that is, the audio isn’t tampered with in a global sense. In particular, the forgers usually concentrate on the content rather than the signal itself. To tackle this type of fraud, we present a detection method based on spectral phase analysis. We further demonstrate that even under a high level of external noise, the recovered spectral phase can still be applied to audio forensics. The rationale for locating audio tampering will also be revealed. The rest of this paper is organized as follows. In Section 2, we analyze why the ENF-based methods failed under noisy conditions, followed by a revelation of the rationale behind the STFT analysis for phase reconstruction in Section 3. In Section 4, an approach for localizing tampered speech via spectral phase analysis is presented. Evaluations of the proposed method on both clean and noisy audio recordings are given in Section 5. Finally, conclusions are drawn in Section 6. 2. ENF analysis under noisy conditions First we will have a review on the conventional tampering detectors based on the ENF signals. Without loss of generality, we assume that the ENF signal is coupled with the speech in the process of recording. Thus, the speech signal can be formulated by
y (n) = s(n) + f (n),
(1)
where s(n) denotes the genuine speech, and the ENF signal is denoted by f (n). It is known that ENF signals are nominally with frequencies fluctuating around 50 Hz or 60 Hz [11]. Hence, y (n) is the recorded signal in which the ENF signal is incorporated. All of the detection algorithms based on ENF have to firstly detach f (n) from y (n). A common practice is applying a narrow band-pass filter centered at the nominal frequency to the questioned audio recording y (n). Note that the genuine speech signal s(n) usually does not have spectrum overlaps with the ENF signal f (n). Therefore, the impact of s(n) can be eliminated after band-pass filtering and the ENF signal can be recovered. However, in real applications, the audio recording is not guaranteed to be noise free due to the imperfections of recording devices and environments. For this scenario, the speech signal should be formulated by
y (n) = s(n) + f (n) + v (n),
(2)
where v (n) denotes the noise, which is often a broad-band signal. Unfortunately, in most situations, no a priori knowledge about v (n) is available. Therefore, v (n) cannot be completely removed even if a filter with sufficiently narrow pass band is employed. Though there exist some works concerning how to more accurately recover the ENF signal [30–32], yet no further progress has been reported on the issue of ENF extraction under unfavorable noise conditions. In addition, a major problem with these ENF tracking methods is that they require audio recordings of sufficiently long duration. As shown in Fig. 1(a), an obvious ENF component can be clearly observed around 50 Hz in clean speech after band-pass filtering, while this is not the case for noisy speech signals as demonstrated in Fig. 1(b), where the audio recording is subject to unfavorable background noise with a signal to noise ratio (SNR) of 15 dB. More severe background noise will further hamper the ENF tracking as shown in Fig. 1(c), where the SNR is decreased to 5 dB. By implication, the ENF variations can be concealed by a high level of noise. Hence, detection methods proposed in [20,21,24,25] failed or degraded. In this paper, we move beyond the conventional ENFbased methods to seek for a novel approach to expose tampered noisy speech via spectral phase analysis. The challenges brought by external noise can be overcome through phase reconstruction. 3. STFT analysis and spectral phase recovery In this work, we consider only the waveform speech with the assumption that the recording keeps its original sampling rate after manipulation. Otherwise, difference in sampling rates between the genuine part and the tampered part can be easily detected by the distinct falloffs since different anti-alias filters are adopted for different sampling rates [21]. We conduct phase recovery on the STFT domain due to the fact that neighboring sub-band phases are highly correlated if the speech signal is transformed to the STFT domain. This correlation between spectral phases can be further used for forgery detection. 3.1. Short-time Fourier transform for speech signals Let us first recall how a signal can be represented in the STFT domain [33]. For a unified notation, we again use the same symbols as in Section 2 to denote the noisy speech. A noise-free speech can also be formulated by (2), but with v (n) equal to zero. The STFT for noisy speech is represented as
Y (k, l) =
N −1
y (n + l · L ) · w (n) · e − j ωk n
(3)
n =0
where the noisy speech is processed on a length N basis, in accordance with the fact that audio signals can be viewed as stationary
X. Lin, X. Kang / Digital Signal Processing 60 (2017) 63–74
65
Fig. 1. Spectrogram analysis around the nominal ENF value (a) clean speech (b) noisy speech at 15 dB SNR (c) noisy speech at 5 dB SNR.
framewise. L denotes the offset between neighboring segments, i.e., there is an overlap of N–L samples. w (n) is a window of length N imposed on each segment prior to the DFT operation. Frequency index and segment index are denoted by k and l respectively. In (3), there is also an important symbol ωk to be noted, which is a normalized angular frequency equal to 2π k/ N, k = 0, 1, . . . , N − 1. Fig. 2 shows the base-band phase spectrogram for a clean speech, a noisy speech and their tampering counterparts (see [35] for the definition of base-band phase). It can be observed that the clean speech shows an obvious structure that the phase spectra appear to be highly correlated along time. If a smart forgery is made, the phase spectra still appear to be correlated as the example in Fig. 2(b), where a voiced segment of 0.5 second duration is cut from the speech. In this example, subtle difference in phase structure caused by tampering cannot be easily discerned by naked eyes. But this subtle difference can be utilized by means of adopting the residual spectral phase and spectral phase correlations as
explained later. Things get more complicated if a certain level of noise is mixed in the signal, exemplified by Fig. 2(c) and Fig. 2(d), where the SNR of the speech signal is reduced to 10 dB. In this case, the phase spectra structure almost gets lost in the presence of noise. By this observation, we aim at authenticating speech signals by exploiting phase spectrum abnormalities, in particular, under a noisy condition, and if possible, localizing the potential segment where a manipulation might take place. To this end, we have to investigate the impact of noise for both the genuine and tampered speech. 3.2. Signal model Before describing in more detail our forgery detection scheme, we set up the signal model for both the genuine speech and the forged speech. A genuine voiced speech can be modeled as a weighted superposition of sinusoids [34], yielding
66
X. Lin, X. Kang / Digital Signal Processing 60 (2017) 63–74
Fig. 2. Baseband phase spectra (a) clean genuine speech. (b) Tampered speech where a fragment deletion is performed (c) noisy genuine speech of (a) (d) noisy version of the tampered speech the same as (b).
s(n) =
H −1
A h · cos(ωh n + ϕh ),
(4)
h =0
with the amplitude A h , phase in time domain frequency ωh
ωh =
2π fs
· fh =
2π (h + 1) f 0 fs
ϕh and normalized
(5)
where f 0 , f s , f h denote fundamental frequency, sampling frequency and the harmonic frequency respectively. A delicate tampering is often chosen to be located in the segment where the SNR is small, thus making it difficult to detect the tampering by human ears. For instance, one or more words will be cut from the speech to conceal some important information or mislead the listeners. Under this assumption, the boundaries of tampering often occur at the gap between voiced segments. For those transition frames or silent segments, the signal energy is relatively low. Thus they are more susceptible to external noise but also provide a safer place to hide the tampering. Compared with the spectral amplitude, the phases are less affected by external noise, as demonstrated in the Appendix. Furthermore, the finite length of analysis window employed in the STFT transform definitely brings about the leakage effect. Although the degree of leakage can be contained within a certain level by adopting a more
appropriate window which has smaller side-lobe amplitudes, e.g., the Hamming window, which is often used in speech processing, the leakage effect is inevitable. Therefore, it stands a chance to recover the phase spectra by exploiting the relationships between neighboring audio segments and adjacent frequency bands. Nevertheless, if there is a splicing from an alien speech, for the splicing region, a sinusoidal model should be set up as (6) or (7), depending on the location and the amount of splicing. If the editing points locate in the midst of a voiced segment, the relevant voiced segments should be modeled as (6) since a single sinusoidal model does not suffice to approximate the signal. Otherwise (7) holds.
s˜ (n) =
H −1
A h · cos(ωh n + ϕh ) +
T −1
A t · cos(ωt n + ϕt )
(6)
t =0
h =0
s˜ (n) =
T −1
A t · cos(ωt n + ϕt )
(7)
t =0
with some alien harmonic components ωt mixed in. Note that ωt could deviate from ωh . s˜ (n) is the modified signal. Even if an accurate harmonic frequency is estimated, a substantial phase shift will appear since these modified segments are less correlated to its adjacent voiced segments.
X. Lin, X. Kang / Digital Signal Processing 60 (2017) 63–74
67
Fig. 3. (a) Sub-band phase variations for the segment where tampering occurs. (b) Kurtosis statistics calculated over the entire signal.
More importantly, there is also a kind of tampering without introducing new harmonic frequencies. Cases like deletion, copyand-paste (part of the signal is copied from the same speech but pasted at a different location) belong to this type of forgery. We will concentrate on this kind of fraud in this work. One of the signal models therein can be denoted by (8), as the case with deletion tampering. Models for the copy-and-paste forgery can be derived in a similar way.
s˜ (n) = s(n) · ε (n − d),
(8)
with d denoting the position where the cutting takes place and ε(·) denoting a step function. It should be noted that quite often d indicates a discontinuity point caused by tampering. Unless carefully examined point by point, the problem of synchronization cannot be well solved in temporal domain. This also poses a challenge to the forger since the existing editing tools do not provide such a delicate feature. Motivated by the discontinuity caused by speech tampering, we can have a look at an example of signal deletion. In this example, the first few samples of a sinusoidal signal are deleted, thus resulting in phase discontinuity. If the signal is sufficiently long and the phase variation is subtle, the discontinuity cannot be directly discerned in time domain. Fig. 3 shows how the spectral phase changes due to the deletion. It is demonstrated in Fig. 3(a) that the original sinusoidal signal keeps a constant zero baseband phase across all the frequency bins. The tampered signal exhibit sudden phase changes in the STFT domain, while this is not so evident in time domain. This also indicates that new phases will show up in the spectral domain due to a sudden cut-off by deletion or insertion. In the viewpoint of STFT, sudden cut-off acts like imposing an additional window on the signal. This additional window brings about the non-linearity effect, thus introducing new frequencies and sub-band phases which do not exist in the genuine signal. Note that this holds true for the sinusoidal model even if more harmonic components are incorporated. Therefore, the sub-band phase of genuine speech is highly sparse and features an extreme peak. By contrast, the phase distribution of tampered speech is flatter. In light of this, the kurtosis measure is employed to depict the sparseness of sub-band phase. Kurtosis plot calculated over the entire signal is displayed in μ Fig. 3(b). Kurtosis for stochastic signal is defined as κ = (σ 24)2 , where μ4 is the fourth-order moment and σ 2 is the variance. Pronounced fluctuations in the measure of kurtosis can be observed whereas the phase spectra of the genuine signal exhibit more smooth kurtosis across the entire sub-bands except the bins which contain the sinusoid components. In addition, as shown in
Fig. 3(b), the kurtosis of the forged signal is very close to that of a uniformly distributed phase on [−π , π ], suggesting the spectral phase of the forged signal covers a wider range of values than the unforged signal. 3.3. Phase recovery under noisy conditions Motivated by the findings in [35], which provide a decent phase reconstruction approach in noisy conditions, we therefore employ the method in [35] for phase recovery of the questioned audio recordings. Further, we will have a close inspection of the recovered phase in the spectral domain. As described above, it is very important to recover the spectral phase in case of interference from external noise. Following the method in [35], the essence of phase reconstruction in spectral domain can be bolted down to a combined process: First, at the onset of a voiced sound, the phase is reconstructed along frequency bands
φkS +i = Princ φkS − φkW −κ k + φkW −κ k +i , h
(9)
h
where the reconstructed phase across frequency bands is denoted by φkS +i , with band k containing the harmonic component. The
spectral phase φkS for the band k which accommodates a harmonic component can be directly obtained from the noisy speech. κhk is the index notation mapped from the harmonic frequency and φ W k is the spectral phase of the shifted analysis window. Ack −κh
cording to (9), it allows for recovering the clean phase for spectral bands that do not have the harmonic components. Phase reconstruction across frequency bands can be operated on all the audio segments thus the segment index can be omitted here. Princ{•} denotes the principal value operator, wrapping the phase onto a range between [−π , π ]. Second, for bands containing harmonic components, the phase of the consecutive segment is reconstructed along time as
φkS,l = Princ φkS,(l−1) + ωhk ,l L ,
(10)
where L is the segment shift, φkS,l denotes the reconstructed phase of the clean signal S at the l-th segment with the frequency index k. The harmonic frequency dominating band k is denoted by ωhk,l , i.e. the harmonic frequency that is closest to the center frequency of bank k.
68
X. Lin, X. Kang / Digital Signal Processing 60 (2017) 63–74
4. Spectral phase analysis and tampering localization Although it has already been pointed out that spectral discontinuities may appear as a good indicator of unnatural offset which is usually caused by tampering [21], the method proposed in [21] cannot survive a high level of noise since it authenticates audios based on discontinuities in spectral distance and ENF phase, both of which are sensitive to external noise. As for its application to speech authentication, minor inconsistencies caused by content manipulation will further be dampened by noise, leading to an increase in detection error. In this Section, we will explore features based on spectral phase variations to discover possible audio tampering. 4.1. Analysis of sub-band phase We conduct a combined phase reconstruction as described in Section 3. According to (10), for two successive audio segments that both contain harmonics, the spectral phase difference can be represented as
φkS,l
= Princ
φkS,l
− φkS,l−1
k h,l L
= Princ ω
.
(11)
With (11), we can deduce that the spectral phase difference across different audio frames is either approximately linear or zero. For the linearity case, φkS,l varies linearly with the dominant harmonic frequency
ωhk,l , at a rate determined by the fixed frame φkS,l
shift L. In case that approaches zero, it happens when the frame shift L is an integer multiple of the dominant harmonics period length, i.e., ωhk ,l L = 2π · m. In line with our analysis, the blue asterisks and green crosses in Fig. 4 show the spectral phase differences between two adjacent voiced frames. Clear evidence depicted by Fig. 4 is that for audio segments with high SNR, phase difference between neighboring frames is small and they follow a similar trend of variation. Meanwhile, for unvoiced segments that do not incorporate any phase reconstruction, significant phase changes can be seen, as marked by the black diamonds and red triangles. The reason is quite straight-forward, since the unvoiced segments are occupied by the external noise, resulting in abrupt phase changes. In Fig. 4, the whole spectra are divided into 64 sub-bands with equal bandwidth and the averaged phase alteration in each sub-band is displayed. It is also observed in Fig. 4 that for voiced segments, the low frequency sub-bands exhibit more prominent phase variation than the high frequency sub-bands. Further insight gained from the results in Fig. 4 is that the recovered spectral phases in voiced segments might be more helpful to detect the forgery than the unvoiced segments. Note that (10) and (11) only take effect in temporal domain. For the spectral bands that do not contain harmonic components, i.e., frequency bands that exhibit low SNRs, spectral phase reconstruction across frequency bands is opted for. Therefore, for the bands where signal energy is low, the spectral phase difference between different frequency sub-bands is given by
φkS +i − φkS = Princ φkW −κ k +i − φkW −κ k . h
Fig. 4. Spectral phase difference for four consecutive segments. (For interpretation of the colors in this figure, the reader is referred to the web version of this article.)
(12)
h
The phase difference calculated by (12) aims at recovering phase across frequency bands in a single voiced segment. For the same frequency bands at neighboring frames, (10) still holds. From (12), we can find that the spectral phase difference along different frequency bands in a single frame depends on the phase difference yielded by the analysis window W . Therefore, it is of vital importance to choose an appropriate analysis window. For a simple solution, a Rectangular window with linear phase is adopted here. Then a commonly used Hann window will be considered for comparison. A simple analysis window with linear phase can be formulated as
w (n) = R N (n)
(13)
with a window length of N, its DTFT can be calculated by
sin( N2ω )
W (ω) =
sin( ω ) 2
· e − j ω·
N −1 2
(14)
For a compact notation, we define its amplitude spectrum as
W R (ω) =
sin( N2ω )
(15)
sin( ω ) 2
From (12) and (14), we can expect a phase difference for a DFT length of M as
φkS +i − φkS = −i π
N −1
(16)
M
Generally, M is equal to or larger than N. If M is larger than the window length N, zero padding is required. Besides, if W R (ω) is negative, it will impose an additional phase jump of π on the phase difference. Moreover, the phase difference between different frequency bands is independent of the band index k. Also note that phase wrapping is taken, thus resulting in a phase difference mapping onto a range between −2π and 2π . Obviously a larger DFT length will incur phase reconstruction on more frequency bands. As for the Hann Window, it can be represented as (17) in the temporal domain.
w (n) = 0.5 − 0.5 cos
2π n
· R N (n)
N −1
(17)
Accordingly, its DTFT is obtained by
W (ω) =
1 2
W R (ω) +
+ WR ω +
1 4
WR
2π N −1
ω−
2π
N −1
e− j(
N −1 )·ω 2
.
(18)
Using (18), we are able to derive the phase difference deterN −1
mined by the Hann window. In (18), a linear phase term e − j ( 2 )·ω the same as the Rectangular window case also appears. Inside the braces are the amplitude response introduced by the Hann window. According to the analysis in Section 3, we adopt a window length N for STFT analysis, but a sudden cut or insertion acts like
X. Lin, X. Kang / Digital Signal Processing 60 (2017) 63–74
69
in voiced segments, i.e., with a relatively high SNR, the reconstructed spectral phase is very close to that of the noisy version, which also indicates that the phase residue approximates zero. Spectral phase residual is defined as
k,l = φkY,l − φkS,l ,
(19)
where φkY,l is the noisy phase and φkS,l is the reconstructed clean phase. From a statistical point of view, the residual spectral phase across all the AC sub-bands will exhibit a Gaussian-like or super Gaussian distribution for genuine speech. Fig. 6 shows the histogram and kurtosis of the spectral phase residue across the AC sub-bands for both the genuine and tampered speech, calculated on the voiced segments for audio clips each with 1 second duration. In this example, the original speech is encountered with a deletion during this one-second period. Here we use the excess μ kurtosis1 κ = (σ 24)2 − 3, which leads to a normal distribution with Fig. 5. (a) Spectral phase kurtosis for clean speech. (b) Spectral phase kurtosis for noisy speech at 10 dB SNR.
imposing an additional window within the segment under analysis such that the harmonic components cannot be well resolved. As reflected by Fig. 3, the nonlinearity introduced by the forgery leads to kurtosis values of spectral phase approaching that of a uniform distributed phase. Moreover, the spectral phase correlation between two voiced segments is weakened due to the manipulations, since two audio segments that were originally located far apart become closer. Again it should be noticed that there is often an unvoiced gap between these two segments. Albeit it’s a smart forgery, the attenuated spectral phase correlation can still be utilized as an indicator for a possible forgery. Besides, since we calculate the spectral phase piecewise, it is viable to detect where there is a tampering. 4.2. Measurements of spectral phase difference So far we have obtained the sub-band phase feature and analyzed the possible indicators of a forgery, but the metric to identify and localize these abnormalities is still left to be devised. In this subsection, we will explore a metric to measure the disparities between adjacent voiced segments so as to allow for an automated scrutiny of the questioned audio recording. To be clear, we define the tampering between consecutive voiced segments as intra-voice tampering. Opposed to intra-voice tampering, inter-voice tampering refers to tampering occurred between non-consecutive voiced segments, i.e., voiced segments separated by silent intervals. Special attention should be paid to the inter-voice tampering since this type of forgery would not introduce evident distortions from the perspective of signals, but semantically breach the content. The noise condition should also be factored in when designing the metric, requiring the features should be able to survive the noise that might conceal the tampering. We find that the kurtosis statistic of spectral phase across the AC sub-bands remains robust to noise, as indicated by Fig. 5. In spite of an adverse noise condition, when the SNR falls to 10 dB as the case shown in Fig. 5(b), the kurtosis of the recovered spectral phase in voiced segments keeps approximately constant. As illustrated by Fig. 3, the objective is to find out if new spectral phases are present in voiced frames. To achieve this, we start by checking the voice active regions in the speech, followed by pitch estimation which sets the foundation for spectral phase reconstruction. Several methods can fulfill this task, e.g., [37] delivers quite a satisfactory performance for VAD detection and pitch tracking even under adverse noise conditions. Then the kurtosis statistic of speech signals within a limited duration will be calculated. For the sub-bands that directly accommodate a harmonic component
excess kurtosis of 0. As depicted by Fig. 6(a), the histogram of tampered speech is heavy-tailed, reflected by the increase of non-zero spectral phase residues. It can also be discerned in Fig. 6(b) that all the spectral phase residues of the forged speech have a kurtosis less than zero, displaying a more uniform distribution. Hence, we derive a measurement with respect to kurtosis statistics as
R1 =
N1 N2
(20)
,
where N 1 denotes the number of frequency bins whose kurtosis is above zero, N 2 is the number of bins whose kurtosis is less than zero. Using this kurtosis ratio, if R 1 < γ , with γ as a threshold, the odds that the audio has undergone a forgery become fairly large. As discussed previously, manipulations operated in voice active regions are easier to be perceived by human ears. For many real-world applications, this concern will be taken into consideration such that the beginning and the end of the manipulations are chosen to occur at voice-inactive regions. However, the statistical metric R 1 based on the spectral phase residue can be only applied to detect tampering in voiced frames. Another important consideration is the kurtosis statistics R 1 should be used for small audio clips only otherwise it will incur miss alarm. This is because the kurtosis is obtained by taking all the voiced frames into account. More specifically, the abrupt phase changes will be smoothed out to some extent. We therefore adopt another measurement to complement the detection, i.e., the phase correlation between voiced segments, which is defined in (21). In particular, this measurement targets at inter-voice tampering.
R 2 = ρ ΦlS , ΦlS+1
(21)
where ΦlS and ΦlS+1 denote vectors that incorporate all the reconstructed baseband spectral phases for two voiced segments respectively, ρ is the correlation coefficient between the two vectors. Notice that l and l + 1 denote the l-th and its next voiced frame identified by the voice activity detection method, but they don’t have to be temporally consecutive. Following the rationale in Section 3, two adjacent voiced frames inarguably are highly correlated with respect to their spectral phase, giving rise to a large R 2 . In contrast, if a forgery occurs between these two voiced frames, the phase correlation between them is attenuated. In order to yield a fair comparison between voiced frames, R 2 is obtained over a longer period of speech signals, e.g., 3 seconds. Generally, an audio clip of three-second duration covers several voiced segments and silent intervals as well. The mean value of phase correlation for the audio clip under analysis is yielded by (22).
1
Hereafter excess kurtosis is referred to kurtosis for simplicity.
70
X. Lin, X. Kang / Digital Signal Processing 60 (2017) 63–74
Fig. 6. (a) Histogram of spectral phase residue: upper panel for genuine speech, lower panel for tampered speech. (b) Kurtosis of spectral phase residue over the frequency sub-bands.
Fig. 7. (a) Original signal. (b) Edited signal by cropping 1 second segment away. (c) R 3 for original speech. (d) R 3 for tampered speech. (For interpretation of the colors in this figure, the reader is referred to the web version of this article.)
R 3 = mean( R 2 ).
(22)
Fig. 7 illustrates an example of the mean phase correlation before and after tampering. R 3 is calculated on audio clips which last for three seconds and there is a 2-second overlap between two consecutive audio clips. In this example, a segment of about one second long is cropped from the original file. The voice-active regions are blocked out in red. Fig. 7(c) shows the R 3 metric for both the original and tampered speech. A decline of R 3 can be clearly seen from Fig. 7(d) as indexed by 12 and 13, implying a possible forgery occurred. 4.3. Tampering detection Based on the above analysis, the spectral phase of the voiced segment is quite different from that of the unvoiced segments. Therefore, we have to firstly distinguish which segments are voiceactive and which are not. As indicated by Fig. 4, there is a high
similarity between the spectral phases of voiced segments. Note that our method is on a statistical basis but the tampering is local, we therefore split the test speech into small audio clips, each with 1 second duration to identify intra-voice tampering, then R 1 is calculated for all the audio clips. For those audio clips with R 1 > γ , R 3 is obtained by (22) based on a 3-second audio clip to gain the insight from inter voiced frames, i.e. for voiced frames separated by silent intervals. If R 3 approaches zero, the forgery is likely to be located in this interval. Hence, under a binary hypothesis framework, the decision rule can be designed as
I( R 1 > γ ) × R 3 ≥ τ , I( R 1 > γ ) × R 3 < τ ,
H 0 : the audio is unforged H 1 : the audio is forged
(23)
where γ and τ are the thresholds for R 1 and R 3 respectively, I(·) is an indicator function which yields 1 if the argument holds true and 0 otherwise. Fig. 8 summarizes the proposed scheme for audio tampering detection.
X. Lin, X. Kang / Digital Signal Processing 60 (2017) 63–74
71
Fig. 8. A block diagram of the proposed scheme.
speech deleted, while the other half had a portion of audio inserted. However, it should be noted that these signals are by no means noise free since they were recorded in typical office environment. Here we use “clean” to denote the speech signals in the Carioca 1 database. Nevertheless, the ENF components are present in these audio recordings. For the noisy speech experiment, several types of noise are added to the Carioca 1 database with various SNR levels to mimic real-world noisy speech recordings. Noise types include white Gaussian, factory, babble noise and car noise. The SNR is set 25 dB, 20 dB, 15 dB, 10 dB and 5 dB. Note that when we consider different levels of SNR in this Section, the original dataset is considered noise-free, and the noise refers to the artificially generated noise. Fig. 9. DET curves for the method in [28] and our proposed method.
5. Evaluations of the proposed method To evaluate the performance of the proposed method, we consider the same database used in [28], i.e., the Carioca 1 database [36], which incorporates 100 original and 100 edited audio recordings of PSTN phone calls with sampling rates of 44.1 kHz. Male speakers account for half of the database, and the remaining half is from female speakers. Audio manipulation is achieved in such a way that half of the audio recordings had a portion of the
5.1. Performance for clean speech signals The detection error tradeoff (DET) curve plot with our proposed method can be found in Fig. 9, together with the DET curve delivered by the method in [28] for the Carioca 1 database. It is observed that our method achieves comparable performance with [28] when the ENF signal is not corrupted, i.e., a more favorable noise condition. The method in [28] offers a lower equal error rate (EER) than our proposed method, with 4% and 6% respectively. Fig. 9 also manifests that the method in [28] gains ground in regions where false positive rate (FPR) is smaller.
Fig. 10. Detection performance under different noisy conditions. (a) While Gaussian Noise. (b) Factory Noise. (c) Car Noise. (d) Babble Noise.
72
X. Lin, X. Kang / Digital Signal Processing 60 (2017) 63–74
It should be pointed out that an analysis window with smaller side-lobes is preferred in most applications so as to suppress the spectrum leakage. However, in our proposed method, since the frequency reconstruction for frequency bands with low SNR attributes to the leakage effect, we find that Rectangular window is a better option as supported by the results in Fig. 11. For simplicity, only the white Gaussian noise is considered in this set of experiments. The window size and the segment shift are set 128 ms and 16 ms respectively. Under this configuration, the Rectangular windows consistently outmatch the Hann window given that the SNR is below 25 dB. A reasonable explanation might be that an abrupt phase change due to the forgery will be smoothed by utilizing a tapering window, e.g., the Hann window in the experiments. Fig. 11. Performance of the proposed method for different types of window under the white Gaussian noise.
5.2. Performance for noisy speech signals We also evaluate the performance of our method under a variety of noisy conditions as previously described, i.e., white Gaussian noise, factory noise, car noise and babble noise. Fig. 10 shows the detection accuracy for various noise types and levels. It is evidenced by Fig. 10 that our proposed method consistently outperforms the method in [28] when the SNR is below 20 dB. A significant performance gap between the two methods can be observed in Fig. 10(c) where car noise is present. This is because the power spectrum of car noise occupies the low frequency bands where the ENF component is accommodated, thus hindering the ENF analysis. However, the improvement for the babble noise case is less significant, mainly due to the fact that the phase reconstruction is susceptible to babble noise whose power spectrum matches that of the speech signals. In particular, the pitch tracking is less robust against babble noise. Similar analysis can be extended to white Gaussian noise and factory noise, whose energy covers the whole frequency bands, resulting in a medium performance gap. The performance gap between the white Gaussian noise and factory noise case might be accounted by the uniform strength over the entire frequency bands for white Gaussian noise, while the factory noise is less intense in the low frequency bands. 5.3. The effect of windows It should be noted that the analysis window plays an important role for the purpose of effective detection. Different from conventional speech analysis, which often requires a small window, e.g. common setups are 20 ∼ 40 ms, an analysis window of larger size is more favored here since spectral phase shows to be stationary over a longer period of time as compared with the spectral amplitudes. A longer window will contribute to more fine-grained spectral phase variations, whereas a smaller window can better localize tampering. Hence, there is a tradeoff when choosing a proper window size. In our experiment, we choose 128 ms as the window size, which in turn results in a frequency resolution of 7.8 Hz. Such a resolution suffices to discriminate the harmonic components so as to reconstruct the spectral phase with more accuracy. Moreover, as illustrated in Section 2, the segment offset L will also have impact on the detection performance. The smaller L is, i.e. a substantial amount of overlap between adjacent frames, a more delicate time–frequency analysis can be achieved. Furthermore, since the audio segments are not independent of each other, a potential tampering occurs in not only one single frame, but two frames for an overlap of 50%, four frames for an overlap of 25% and so on. And if the offset is further decreased, a more reliable detection can be achieved at the cost of computational complexity.
6. Conclusions Though the use of ENF signal has shown to be promising for audio authentication, it comes with a major shortcoming that it cannot survive a high level of noise. In this paper, the challenging issue of audio content authentication under noisy conditions is investigated. Spectral phase characteristics are explored based on a spectral phase reconstruction technique. By utilizing the statistics of spectral phase residue and the correlations between reconstructed phases of adjacent voiced frames, we have developed an approach for audio tampering detection. The performance of the proposed method is evaluated on a bunch of audio recordings and compared with the existing work which is on an ENF basis. Our method achieved comparable performance with the state-ofart methods when the audio recordings get a more favorable SNR. Further, the proposed method is assessed under real world noises. Experimental results show that our proposed method is capable of localizing audio tampering even in an adverse SNR circumstance, outperforming the ENF-based method in this respect, due to the fact that the features presented in this work remain robust to noise. As this paper has established a method which does not count on the traditional ENF extraction mechanism, it might offer a solution to audio authentication in noisy conditions or can be an alternative to the ENF-based method when there is no reliable ENF tracking scheme available. Moreover, the proposed method may complement the existing ENF-based method when there is no explicit ENF signals present, like battery-powered recorders that do not directly carry an ENF signal. Our future work will focus on a more generic framework to authenticate speech signals as well as music signals, whether they embrace the ENF component or not. Acknowledgments This work was supported in part by NSFC (Grant Nos. 61379155, U1536204, 61502547, 61332012, 61272453), in part by NSF of Guangdong province (Grant No. s2013020012788) and in part by Foundation of Fujian Educational Committee (Grant No. JAT160036). Appendix In this appendix, we derive the general relationship between the noisy speech and the clean speech in terms of spectral amplitudes and spectral phases. This also demonstrates why the spectral phase is less affected by external noise. First, we denote Y (n) as the noisy speech, S (n) as the clean speech and N (n) as the additive noise. Therefore, Y (n) = S (n) + N (n). By performing STFT analysis and normalizing the resulting spectrogram with respect to the DC component, we have
X. Lin, X. Kang / Digital Signal Processing 60 (2017) 63–74
Y k,l e j φ Y S
Y k,l
= S k,l e j φ
S k,l
+ N k,l e j φ
Nk.l
(24)
,
N Y S N S e j (φ −φ ) = 1 + e j (φ −φ ) . S
(25)
Applying the logarithm function and Taylor series on (25), we have
ln
Y S
Y N j (φ N −φ S ) N N S S ≈ e j (φ −φ ) . (26) + j φ − φ = ln 1 + e S
S
Considering the real parts and imaginary parts separately, we have
φY − φ S ≈
N S
N · sin φ N − φ S ≤ S
(27)
and
ln
Y S
≈
N S
N · cos φ N − φ S ≤
(28)
S
It can be easily verified that
ln(Y − S ) + 1 < ln Y − ln S ,
(29)
therefore we have N
Y − S < e S −1 .
(30)
In (24), k and l denote the k-th frequency bin and l-th temporal frame of the short-term Fourier Transform respectively. S k,l , N k,l and Y k,l are the spectral amplitude for clean speech, noise and noisy speech, while φ S k,l , φ Nk,l and φ Y k,l are the spectral phase for clean speech, noise and noisy speech. For simplicity, the subscripts N
are omitted in (25)–(30). As e S −1 ≥ NS holds, an insight gained from (27) and (30) is that the spectral amplitude can cover a wider range of variations due to external noise; In other words, the spectral amplitude is more susceptible to external noise. References [1] M.C. Stamm, M. Wu, K.J.R. Liu, Information forensics: an overview of the first decade, IEEE Access 1 (1) (May 2013) 167–200. [2] W.-H. Chuang, R. Garg, M. Wu, Anti-forensics and countermeasures of electrical network frequency analysis, IEEE Trans. Inf. Forensics Secur. 8 (12) (Dec. 2013) 2073–2088. [3] X. Kang, M.C. Stamm, A. Peng, K.J.R. Liu, Robust median filtering forensics using an autoregressive model, IEEE Trans. Inf. Forensics Secur. 8 (9) (Sept. 2013) 1456–1468. [4] M.C. Stamm, K.J.R. Liu, Forensic detection of image manipulation using statistical intrinsic fingerprints, IEEE Trans. Inf. Forensics Secur. 5 (3) (Sep. 2010) 492–506. [5] R. Yang, Y.Q. Shi, J. Huang, Detecting double compression of audio signal, in: SPIE Conference on Media Forensics and Security II, SPIE, Bellingham, 2010. [6] Tiziano Bianchi, Alessia De Rosa, Marco Fontani, Giovanni Rocciolo, Alessandro Piva, Detection and localization of double compression in MP3 audio tracks, EURASIP J. Inf. Secur. 10 (2014). [7] R. Yang, Y.Q. Shi, J. Huang, Defeating fake-quality MP3, in: Proceedings of the 11th ACM Workshop on Multimedia and Security, MM&Sec’09, ACM, New York, 2009, pp. 117–124. [8] D. Luo, W. Luo, R. Yang, J. Huang, Identifying compression history of wave audio and its applications, ACM Trans. Multimed. Comput. Commun. Appl. 10 (3) (Mar. 2014) 30. [9] R. Korycki, Authenticity examination of compressed audio recordings using detection of multiple compression and encoders’ identification, Forensic Sci. Int. 238 (2014) 33–46. [10] C. Grigoras, Applications of ENF criterion in forensics: audio, video, computer and telecommunication analysis, Forensic Sci. Int. 167 (2–3) (Apr. 2007) 136–145. [11] R.W. Sanders, Digital authenticity using the electric network frequency, in: Proc. 33rd AES Int. Conf. Audio Forensics, Theory Practice, Jun. 2008, pp. 1–6. [12] Y. Liu, Z. Yuan, P.N. Markham, R.W. Conners, Y. Liu, Application of power system frequency for digital audio authentication, IEEE Trans. Power Deliv. 27 (4) (Oct. 2012) 1820–1828. [13] H. Malik, H. Farid, Audio forensics from acoustic reverberation, in: Proc. of ICASSP, Dallas, USA, Mar. 2010, pp. 1710–1713.
73
[14] H. Malik, H. Zhao, Recording environment identification using acoustic reverberation, in: Proc. of ICASSP, Kyoto, Japan, Mar. 2012, pp. 1833–1836. [15] S. Milani, P.F. Piazza, P. Bestagini, S. Tubaro, Audio tampering detection using multimodal features, in: Proc. of ICASSP, Florence, Italy, May 2014, pp. 4596–4600. [16] R. Buchhholz, C. Kraetzer, J. Dittmann, Microphone classification using Fourier coefficients, in: Proc. 11th International Workshop on Information Hiding, Darmstadt, Germany, June 2009, pp. 235–246. [17] X. Kang, R. Yang, J. Huang, Geometric invariant audio watermarking based on an LCM feature, IEEE Trans. Multimed. 13 (2) (2011) 181–190. [18] C. Wang, T. Chen, W. Chao, A new detection method for tampered audio signals based on discrete cosine transformation, in: Soft Computing as Transdisciplinary, in: Series Advances in Soft Computing, vol. 29, 2005, pp. 1216–1225. [19] B. Chen, G.W. Wornell, Quantization index modulation: a class of provably good methods for digital watermarking and information embedding, IEEE Trans. Inf. Theory 47 (4) (2001) 1423–1443. [20] D.P. Nicolalde, J.A. Apolinario, L.W.P. Biscainho, Audio authenticity: detecting ENF discontinuity with high precision phase analysis, IEEE Trans. Inf. Forensics Secur. 5 (May 2010) 534–543. [21] J. Apolinario, D. Nicolalde, Evaluating digital audio authenticity with spectral distances and ENF phase change, in: Proc. of ICASSP, Taipei, Apr. 2009, pp. 1417–1420. [22] R. Yang, Z. Qu, J. Huang, Exposing MP3 audio forgeries using frame offsets, ACM Trans. Multimed. Comput. Commun. Appl. 8 (S2) (Sept. 2012) 35:1–35:20. [23] X. Pan, X. Zhang, S. Lyu, Detecting splicing in digital audios using local noise level estimation, in: Proc. of ICASSP, Kyoto, Japan, Mar. 2012, pp. 1841–1844. [24] Z. Lv, Y. Hu, C.-T. Li, B.b. Liu, Audio forensic authentication based on MOCC between ENF and reference signals, in: IEEE China Summit and International Conference on Signal and Information Processing, Beijing, China, Jul. 2013, pp. 427–431. [25] D.P. Nicolalde, J.A. Apolinario Jr., L.W.P. Biscainho, Audio authenticity based on the discontinuity of ENF higher harmonics, in: Proceedings of the 21st European Signal Processing Conference, Marrakesh, Morocco, Sept. 2013, pp. 1–5. [26] J.R. Chen, S.J. Xiang, Exposing digital audio forgeries in time domain by using singularity analysis with wavelets, in: Proc. of the First ACM Workshop on Information Hiding and Multimedia Security, Montpelier, France, 2013, pp. 149–158. [27] L. Cuccovillo, S. Mann, P. Aichroth, M. Tagliasacchi, C. Dittmar, Blind microphone analysis and stable tone phase analysis for audio tampering detection, in: International AES Convention, NY, USA, Oct. 2013. [28] P.A.A. Esquef, J.A. Apolinário Jr., L.W.P. Biscainho, Edit detection in speech recordings via instantaneous electric network frequency variations, IEEE Trans. Inf. Forensics Secur. 9 (12) (Dec. 2014) 2314–2326. [29] P.A.A. Esquef, J.A. Apolinário Jr., L.W.P. Biscainho, Improved edit detection in speech via ENF patterns, in: IEEE International Workshop on Information Forensics and Security, 2015. [30] O. Ojowu, J. Karlsson, J. Li, Y. Liu, ENF extraction from digital recordings using adaptive techniques and frequency tracking, IEEE Trans. Inf. Forensics Secur. 7 (4) (Aug. 2012) 1330–1338. [31] A. Hajj-Ahmad, R. Garg, M. Wu, Spectrum combining for ENF signal estimation, IEEE Signal Process. Lett. 20 (9) (Sep. 2013) 885–888. [32] L. Fu, P.N. Markham, R.W. Conners, Y. Liu, An improved discrete Fourier transform-based algorithm for electric network frequency extraction, IEEE Trans. Inf. Forensics Secur. 8 (7) (July. 2013) 1173–1181. [33] J.B. Allen, Short term spectral analysis, synthesis, and modification by discrete Fourier transform, IEEE Trans. Acoust. Speech Signal Process. 25 (3) (Mar. 1977) 235–238. [34] X. Serra, Musical sound modeling with sinusoids plus noise, in: C. Road, S. Pope, A. Picialli, G. De Poli (Eds.), Musical Signal Processing, Swets & Zeitlinger, 1997. [35] M. Krawczyk, T. Gerkmann, STFT phase reconstruction in voiced speech for an improved single-channel speech enhancement, IEEE Trans. Audio Speech Lang. Process. 22 (12) (Dec. 2014) 1931–1940. [36] http://www.lncc.br/~pesquef/TIFS2014/Carioca1_Database.zip. [37] S. Gonzalez, M. Brookes, PEFAC—a pitch estimation algorithm robust to high levels of noise, IEEE/ACM Trans. Audio Speech Lang. Process. 22 (2) (Feb. 2014) 518–530.
Xiaodan Lin is currently working toward the Ph.D. degree with the School of Information Science and Technology, Sun Yat-Sen University, Guangzhou, China. She has joined the School of Information Science and Engineering, Huaqiao University since 2008. Her research interests include multimedia signal processing, machine learning and audio forensics. Xiangui Kang received the B.S. degree from Peking University, China, the M.S. degree from Nanjing University, China, and the Ph.D. degree from Sun Yat-sen University, Guangzhou, China. He is currently a professor with
74
X. Lin, X. Kang / Digital Signal Processing 60 (2017) 63–74
School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China. He visited Electrical and Computer Engineering Department at University of Maryland, College Park, US during Aug. 2011–Aug. 2012 and visited Electrical and Computer Engineering Department at New Jersey Institute of Technology, US during Aug. 2004–Sept. 2005. His research inter-
ests include multimedia information forensic, data hiding, and multimedia communications and security. He won the Best Ph.D. Dissertation Award of Guangdong Province in China in 2005. He and his students won the Best Student Paper Award twice of the International Workshop on Digitalforensics and Watermarking in 2008 and 2013 respectively.