Perceptual-based DWPT-DCT framework for selective blind audio watermarking

Perceptual-based DWPT-DCT framework for selective blind audio watermarking

Signal Processing ] (]]]]) ]]]–]]] 1 Contents lists available at ScienceDirect 3 Signal Processing 5 journal homepage: www.elsevier.com/locate/s...

1MB Sizes 1 Downloads 67 Views

Signal Processing ] (]]]]) ]]]–]]]

1

Contents lists available at ScienceDirect

3

Signal Processing

5

journal homepage: www.elsevier.com/locate/sigpro

7 9 11 13

Perceptual-based DWPT-DCT framework for selective blind audio watermarking

15 Q1

Hwai-Tsu Hu a,n, Ling-Yuan Hsu b, Hsien-Hsin Chou a

17

a b

Department of Electronic Engineering, National I-Lan University, Yi-Lan 26041, Taiwan, ROC Department of Information Management, St. Mary’s Medicine, Nursing and Management College, Yi-Lan 26644, Taiwan, ROC

19 21 23 25

a r t i c l e i n f o

abstract

Article history: Received 8 September 2013 Received in revised form 5 April 2014 Accepted 2 May 2014

Motivated by the human auditory perception, a framework jointly exploiting the discrete wavelet packet transform (DWPT) and the discrete cosine transform (DCT) is presented to perform variable-capacity blind audio watermarking without introducing perceptible distortion. To achieve a balanced performance between robustness and imperceptibility, a perceptual-based quantization index modulation technique is utilized for embedding the watermark bits. The effectiveness of the proposed scheme has been proven using the perceptual evaluation of audio quality (PEAQ) and bit error rates of recovered watermarks under various signal processing attacks. Experimental results show that the proposed DWPTDCT scheme is comparable to three recently developed methods in robustness while it is the only scheme surviving the amplitude scaling attack. Moreover, to ensure the watermark in a harsh environment, a back-propagation neural network has been adopted to seek suitable segments for watermark embedding. With the employment of the selective scheme, the embedded watermark receives significantly better protection against serious attacks. & 2014 Elsevier B.V. All rights reserved.

27 29 31

Keywords: Blind audio watermarking Adaptive quantization index modulation Discrete wavelet packet transform Discrete cosine transform Human auditory masking

33 35 37 39 41 43 45 47 49 51 53

63 1. Introduction The rapid evolution of digital technology has greatly facilitated the reproduction and manipulation of multimedia data. Easy access to the Internet makes the distribution of digital media much faster than ever before, leading to many serious copyright problems. Nowadays the protection of intellectual property rights remains an important issue to be resolved. One way to reinforce the protection of audio-related creation is to embed inaudible ownership information into digital audio signals. This kind of technology is called audio watermarking, and it has drawn a lot of attention in recent years. Previous investigators explored a variety of domain representations for audio watermarking. The attempted

55 57 59

n

Corresponding author. Tel.: þ886 3 9317343; fax: þ886 3 9369507. E-mail address: [email protected] (H.-T. Hu).

domains include time [1–4], frequency [5–7], discrete cosine transform (DCT) [7–10], discrete wavelet transform (DWT) [8,11–13], cepstrum [14–16], singular value decomposition (SVD) [10,17,18]. Many of the watermarking methods belong to the class of so-called quantization index modulation (QIM) that achieves provably good rate-distortion-robustness performance [19,20]. In [8], aside from the employment of QIM, Wang and Zhao explored the multiresolution analysis of DWT and the energy-compression characteristics of DCT to achieve efficient audio watermarking. Although the combination of the DWT and DCT obtained great success, the use of tentative quantization steps throughout the watermarking process was nonetheless a technical limitation. In fact, there were several other approaches capable of achieving robust audio watermarking while simultaneously making the embedded watermark imperceptible. By exploiting the masking properties of the human auditory system, Tsai et al. [7] selected a suitable coefficient index from a

http://dx.doi.org/10.1016/j.sigpro.2014.05.003 0165-1684/& 2014 Elsevier B.V. All rights reserved.

61 Please cite this article as: H.-T. Hu, et al., Perceptual-based DWPT-DCT framework for selective blind audio watermarking, Signal Processing (2014), http://dx.doi.org/10.1016/j.sigpro.2014.05.003i

65 67 69 71 73 75 77 79 81 83 85

H.-T. Hu et al. / Signal Processing ] (]]]]) ]]]–]]]

2

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47

be identified using the frame synchronization technique before watermark recovery [8,10,17]. Despite its importance, the issue of “how to select appropriate audio segments for watermarking” was seldom addressed in the literature. Here we try to deal with this issue from the viewpoint of the DWPT-DCT framework. This paper is organized as follows. Section 2 briefly discusses the derivation of auditory masking threshold, which requires the participation of DWPT and DCT. A perceptual-based adaptive and retrievable quantization scheme is proposed in Section 3. Its evaluation is given in Section 4. In Section 5, a back-propagation neural network (BPNN) is employed to predict possible correct counts encountered in a severe attack. The predicted results can be used to seek suitable segments for watermarking. A configuration of bit arrangement in each segment is also suggested. Section 6 illustrates the integration of the abovementioned techniques into a watermarking system. Finally, Section 7 sums up the conclusions.

DCT-transformed block for watermarking. They also employed a neural network to perform watermark extraction. Wang et al. [21] incorporated the well-trained support vector regression into the DWT-DCT structure where the quantization steps were designed to adapt to the audio signal. Bhat et al. [17] presented a SVD-based blind watermarking scheme operated in DWT domain. The quantization steps used in the QIM were also adaptively determined according to the statistical properties of the involved DWT coefficients. Malik et al. [22] developed a frequency-selective spread spectrum technique to select appropriate subbands of the host audio signal for watermark embedding. This technique allowed the exploitation of frequency-masking characteristics to ensure the fidelity of the host audio signal and the robustness of the embedded information. In [23], Chen et al. presented an optimization-based group-amplitude quantization scheme to achieve a tradeoff between audio quality and robustness. In [3], Wang et al. developed a fuzzy self-adaptive digital audio watermarking method based on the time-spread echo hiding technique. The fuzzy theory was brought in to control the power of the watermark. In [24], Lei et al. employed the particle swarm optimization algorithm to search for optimal watermark strength while modifying singular values of the host audio signal with the spread spectrum method. Despite the foregoing approaches have gained improvements in certain aspects, there still lacks a pellucid link between transform domains and human auditory properties. Because the audio signal varies both in time and frequency, it is definitely better for the QIM to adapt to the signal variation so that the tradeoff among robustness, imperceptibility, and capacity can be properly resolved. One possible way to achieve this is through the exploitation of signal characteristics and/or human auditory system. In this study we propose to use the discrete wavelet packet transform (DWPT) to decompose the audio signal into critical bands according to human auditory perception, and then applying QIM to the DCT coefficients derived from the DWPT coefficients subject to the perceptual masking effect. Since our goal is to develop a blind watermarking technique, the most challenge part lies in the retrieval of the original quantization step from the watermarked signal without referring to additional information. The second concern in our watermarking scheme is to locate appropriate segment for embedding. Some schemes split the watermark into parts and embedded multiple copies of these parts into different places to enhance the survival rate of the watermark. The embedded places must

71 73 75 77 79 81

85 Stemming from the noise masking measures [25–27], this study derives an auditory masking threshold for each critical band in the DWPT domain. The procedures for deriving the spectral masking threshold are briefly summarized as follows: Partition the host audio signal sðnÞ into frames, each of 4096 samples in length. The number 4096 is chosen due to the requirement of subsequent analysis in the spectral domain.

87 89 91 93 95

1. Use the DWPT to decompose the audio signal into 26 critical bands according to the specifications in Table 1, in which the approximate frequency range for each band is given for reference. The Daubechies-8 basis is used as the wavelet function. Let wi;ðnÞ denote the ith DWPT coefficient in the nth band with a length of N ðnÞ . 2. Analyze the spectral content for each band in the DWPT using DCT, i.e. ci;ðnÞ ¼ DCTfwi;ðnÞ g. 3. Derive the tonality factor τ from the DCT coefficients.      9 8 <10log10 PMg c2i;ðnÞ =PMa c2i;ðnÞ = ;1 ; τ ¼ min : ; 25 ð1Þ

Table 1

55

DWPT {Depth, Index} Approximate boundary (Hz) Band number DWPT {Depth, Index} Approximate boundary (Hz) Band number DWPT {Depth, Index} Approximate boundary (Hz)

61

69

83

97 99 101 103 105 107 109

113

Q3 The arrangement of DWPT decomposition. Band number

59

67

111

53

57

65

2. Derivation of auditory masking threshold

49 51

63

0 9 {7,6} 1034 18 {5,4} 4823

1

2

3

4

5

6

7

8

115

{8,0} 86 10 {7,4} 1206 19 {5,5} 5513

{8,1} 172 11 {7,5} 1378 20 {4,7} 6891

{8,3} 258 12 {6,7} 1723 21 {4,6} 8269

{8,2} 345 13 {6,6} 2067 22 {3,7} 11,025

{8,7} 431 14 {6,4} 2412 23 {3,6} 13,781

{8,6} 517 15 {6,5} 2756 24 {3,2} 16,538

{7,2} 689 16 {5,7} 3445 25 {3,4} 19,294

{7,7} 861 17 {5,6} 4134 26 {3,5} 22,050

117

Please cite this article as: H.-T. Hu, et al., Perceptual-based DWPT-DCT framework for selective blind audio watermarking, Signal Processing (2014), http://dx.doi.org/10.1016/j.sigpro.2014.05.003i

119 121 123

H.-T. Hu et al. / Signal Processing ] (]]]]) ]]]–]]]

1 3 5 7 9 11

    where PMg c2i;ðnÞ and PM a c2i;ðnÞ stand for the geometric and arithmetic means of c2i;ðnÞ , respectively. 4. Adjust the masking level Dz ðnÞ according to the tonality factor. ! 1 NðnÞ  1 2 Dz ðnÞ ¼ 10aðnÞ=10 ; ð2Þ ∑ c NðnÞ i ¼ 0 i;ðnÞ where aðnÞ signifies the permissible noise floor relative to the signal in the nth band, and it is formulated as aðnÞ ¼ τð 0:275n 15:025Þ þ ð1  τÞð  9:0Þ ½dB:

13

ð3Þ

15

5. Extend the masking effect to adjacent bands by convolving the adjusted masking level with a spreading function SFðnÞ, namely C z ðnÞ ¼ Dz ðnÞ  10SFðnÞ=10 . SFðnÞ is defined as qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi   u þv v u ðn þ yÞ  h þðn þ yÞ2 dB ; SFðnÞ ¼ p þ 2 2 ð4Þ

17 19 21

27

where p ¼ 15:242; y ¼ 0:15; h ¼ 0:3; u ¼  25 and v ¼ 30: This study adopts C z ðnÞ as the masking threshold, which manifests the power level not detectable by human ears in the nth critical band.

29

3. Perceptually adaptive and retrievable QIM

31

Subsequent to the DWPT and DCT specified in Section 2, a watermark bit W m can be embedded into the ˘k˘th DCT coefficient in the Kth band, termed c˘k˘ ; ðKÞ, using the QIM k 8j ΔðKÞ c ;ðKÞ > < ˘Δk˘ðKÞ þ 14 ΔðKÞ þ 4 if W m ¼ 0; ð5Þ c^ ˘ k˘ ; ðKÞ ¼ jc ;ðKÞ k Δ > : ˘Δk˘ðKÞ þ 34 ΔðKÞ  4ðKÞ if W m ¼ 1:

23 25

33 35 37 39 41 43 45 47 49 51 53 55 57 59 61

3

extra p protection. Nevertheless, γ is set as unity, i.e. ffiffiffiffiffiffiffiffiffiffiffiffiffiffi ΔðKÞ ¼ C z ðKÞL, for the sake of simplicity in this study. The rationale of such a formulation is that the distortion due to the quantization of each coefficient is restricted within C z ðKÞL and the overall distortion in the Kth critical band shall not exceed C z ðKÞNðKÞ , which is the exact energy level of the masking threshold. If we take one step further to define the distortion as the energy change in each frequency bin, the variations of the first L coefficients can be regarded as a whole and more considerations can be taken into account. To be more specific, each DCT coefficient is allowed to have two possible options fλ˘k˘ ; 1; λ˘k˘ ; 2g:   λ˘ k˘ ; 1 λ˘ k˘ ; 2 (  c^ ˘ ; ðKÞ þΔðKÞ c^ ˘ k˘ ; ðKÞ if c˘ k˘ ;^ðKÞ o c˘ k˘ ; ðKÞ; : ð7Þ ¼  ˘k  if c^ ˘ k˘ ; ðKÞ Zc˘ k˘ ; ðKÞ c^ ˘ k˘ ; ðKÞ c^ ˘ k˘ ; ðKÞ  ΔðKÞ The one that produces the minimum energy deviation of the frequency bin is chosen as the final QIM result, i.e. ˘ c˘˘ k˘ ; ðKÞ ¼ λ˘ k˘ ; I ˘ k˘ : ˘ k˘ ¼ 0; …; B 1:

˘ k˘

67 69 71 73 75 77 79 81 83 85

)

∑ ¼ 0B  1 λ˘ k˘ ; i˘ k˘ 2  ∑ ¼ 0B  1 c˘ k˘ ; ðKÞ2 ; ˘ k˘

65

ð8Þ

with fI 0 ; …; I ˘ k˘ ; …; I B  1 g ¼ arg min ; …; iB  1 g fi0 ;…;i˘ k˘ (

63

87

i˘ k˘ A f1; 2g: ð9Þ

89

Here a brutal-force approach is employed to obtain I ˘k˘ by searching all possible combinations of λ˘k˘ ; i˘k˘ 's. Consequently, the energy of this particular frequency bin becomes

91

s20 ¼ ∑ ¼ 0B  1 ˘ c˘˘ k˘ ; ðKÞ2 ;

ð10Þ

˘ k˘

and the energy values in other bins are s2i

¼

ði þ 1ÞB  1



j ¼ iB

c2j;ðKÞ ;

i ¼ 1; 2; …; L  1:

93 95 97

ð11Þ

where ΔðKÞ denotes the quantization step size. ˘k˘ is arbitrarily selected and can be used as a key in the watermarking process. We note that multiple bits can be embedded in a single band as long as the power variation is under the masking threshold C z ðKÞ. The embedded positions of the watermark bits can be anywhere across the DCT coefficients. This study considers embedding the watermark bits in adjacent coefficients so that the involved coefficients can be regarded as a single frequency bin. One benefit immediately from this arrangement is that the energy distortion of each frequency bin is more manageable. Without loss of generality, let us assume that there are B bits to be embedded into the first B coefficients in the Kth band and NðKÞ is divisible by B. The number of frequency bins is therefore L ¼ NðKÞ =B. To embed watermark bits without causing noticeable distortion, the quantization step ΔðKÞ can be chosen as qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð6Þ ΔðKÞ ¼ γ C z ðKÞNðKÞ =B ¼ γ C z ðKÞL:

In general, the watermark embedding process only requires the modification of designated coefficients, such as ˘c˘˘k˘ ; ðKÞ in our case. However, the situation becomes more complicated because the embedding process also involves the quantization step ΔðKÞ that is adaptively determined from the entire DCT coefficients. Modifying ˘c˘˘k˘ ; ðKÞ will in turn affect ΔðKÞ as well. As mentioned in Section 2, ΔðKÞ is directly related to the masking threshold and the derivation of the masking threshold requires the tonality factor which includes the geometric and arithmetic means of power spectral coefficients. It turns out that ΔðKÞ can only be retrievable if these two means remain intact throughout the watermarking process. Let Q denote the spectral flatness measure that is defined as the ratio between the geometric and arithmetic means of the s2i ’s collected from the Kth critical band:  L  1 2 1=L ∏i ¼ 0 si ¼ Q: ð12Þ 1 L1 2 L ∑i ¼ 0 si

where γ serves as a scaling factor to tune the embedding strength. Because a minor change of γ (e.g., 15%– þ15%) will lead to a dramatic alteration in quantized values, the variable γ can be used as a secret key to provide

To ensure an exact retrieval of ΔðKÞ from the watermarked signal, both the signal power (as shown in the denominator of the fraction on the left-hand side of Eq. (12)) and the ratio Q drawn from s2i ’s must keep

Please cite this article as: H.-T. Hu, et al., Perceptual-based DWPT-DCT framework for selective blind audio watermarking, Signal Processing (2014), http://dx.doi.org/10.1016/j.sigpro.2014.05.003i

99 101 103 105 107 109 111 113 115 117 119 121 123

H.-T. Hu et al. / Signal Processing ] (]]]]) ]]]–]]]

4

1 3 5 7 9 11

unchanged as the embedding process proceeds. An iterative algorithm originally presented in [28] has been modified in the sequel to adjust s2i ’s so that the abovementioned requirements can be maintained. The algorithm starts with an initial setup in signal power 8 > > > > > > > <

0

2

L1

11=2

B ∑ s2j C B C 2B j ¼ 1 C s 2 i BL  1 C ˘ s˘i;0 ¼ > > @ ∑ s2j ˘ s˘ A > > > j¼0 > 0 > : ˘ s˘

2

1 r i oL; for

for

i ¼ 0:

0

ð13Þ

13 15 17 19 21 23 25 27 29 31 33 35

˘s˘2i;0

where the notation represents the modified ith frequency bin energy at iteration 0. The purpose of Eq. (13) is to maintain the original signal power. As the arithmetic mean of ˘s˘2i;0 's remains a constant, our focus is on the alteration of the geometric mean of ˘s˘2i;0 's. The following step is the actual entry of the main iteration cycle. Inside each iteration, said t, the algorithm calculates the     mean of all log ˘s˘2i;t ’s with log ˘s˘20;t excluded. It can be deduced from Eq. (12) that the original Q can still be held if n o 2 the mean computed from logð˘s˘i;t Þj1 ri o L remains a constant η:   1 Q L1 2 1 L1 ? L log ∑ s ∑ log ˘ s˘2i;t ¼ L1 i ¼ 1 L1 L i¼0 i

41 43

into two groups, termed I tþ and I t : n   o I tþ ¼ i log ˘ s˘2i;t Zη; 1 ri o L ; I t

n   ¼ i log ˘ s˘2i;t oη;

49 51 53

o 1 r io L :

ð15Þ

ð16Þ

  For each log ˘s˘2i;t in group I t , its value is adjusted by           ∑j A I þ log ˘ s˘2j;t  ηt 2 2 log s~ i;t ¼ log ˘ s˘i;t  η ∑j A I      η  log ˘ s˘2j;t þ η

45 47

!    log ˘ s˘20;0 ¼ η:

ð14Þ   Hence the proposed iterative algorithm categorizes log ˘s˘2i;t

37 39

!

for i A I t

ð17Þ

2

s^ 2i;t ¼

57

  Although Eq. (18) allows log s^ 2i;t 's to reach a constant η in

59

Eq. (14), but it leads to a minor change of signal power.

61

ð18Þ

Consequently, s^ 2i;t must be readjusted to restore the signal power by

63

2 j¼1

˘ s˘ 20;0 if i ¼ 0; s^ i;t

j;0

if ia 0:

L1

ð19Þ

∑ s^

j¼1

65 67

j;t

The above procedures, i.e. from Eqs. (15) to (19), iterate until the following criterion is met or the iteration number exceeds 15. L  1 2 2 ∑ ˘ s˘ i ¼ 1 i;0 1 o10  5 : ð20Þ L1 ∑ s^ i ¼ 1 i;t After obtaining f logð˘s˘2i;T Þj1 r i oLg at the end of the terminating iteration T, the final watermarked DCT coefficients are of the form: vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u 2 u˘ s˘½i=B;T ˘ c˘i;ðKÞ ¼ ci;ðKÞ t 2 i ¼ B; …; L 1; ð21Þ s½i=B;T

69 71 73 75 77 79 81 83

where bdc stands for the floor function. Taking the inverse DCT of ˘c˘i;ðKÞ comes up with the DWPT sequence

85

˘ w˘i;ðKÞ ¼ IDCTf˘ c˘i;ðKÞ g:

ð22Þ

87

With the availability of all DWPT sequences, the watermarked audio signal ˘s˘ðiÞ is obtained by taking inverse DWPT transform of ˘w˘i;ðKÞ :

89

˘ s˘ðiÞ ¼ IDWPTf˘ w˘i;ðKÞ g:

ð23Þ

While implementing the foregoing algorithm, there are two steps requiring additional considerations. One is the special case in Eq. (12) when an arbitrary s2i is zero. In such a case, Q will become zero and Eq. (14) can no longer hold numerically. Though this case rarely happens in practical conditions, a precautious measure is adopted to avoid this trouble. During the computation of the geometric mean, we have substituted s2i with a small non-zero value, namely s2i o ξ.

To be more specific, the

ð24Þ

  ∑ log ˘ s˘2i ¼ η:

L1

1 L1 i ¼ 1

ð25Þ

Taking exponential on both sides gives ∏ ˘ s˘2i ¼ eηðL  1Þ :

91 93 95 97 99 101 103 105 107

The second consideration regards the existence of possible solutions for f˘s˘2i ji ¼ 1; …; L  1g while performing the iteration in Eqs. (15)–(19). According to Eq. (14), a solution exists only when

L1

55

2 2

∑ ˘ s˘

ξ whenever actual Q is computed as  L1

1=L ∏i ¼ 0 max s2i ; ξ : Q¼ 1 L1 2 L ∑ i ¼ 0 si

rather than I tþ is that it reduces the overall energy variation. After removing the logarithm by exponentiation, the resultant energy in each frequency bin is ˘ s˘ 20;0 if i ¼ 0; s~ i;t if i A I t ; ˘ s˘2i;t if i A I tþ :

> > > > > > :

L1

4 1 2 ¼ 10L ∑Li ¼ 0 si ,

  On the other hand, those log ˘s˘2i;t 's in group I tþ remain   unchanged. The rationale behind altering log s~ 2i;t 's in I t

n

˘ s˘2i;t þ 1 ¼

8 > > > > > > <

109 111 113 115

ð26Þ

i¼1

Based on the inequality of arithmetic and geometric means, it can be shown ! 2 ! 2 2 1=L  1 L1 L1 1 1 L1 2 η ˘ ˘ ˘ ∑ s ˘ s : ð27Þ ∑ ˘s ¼ e ¼ ∏ ˘s r L 1 i ¼ 1 L 1 i ¼ 0 i i¼1 i 0 i

Please cite this article as: H.-T. Hu, et al., Perceptual-based DWPT-DCT framework for selective blind audio watermarking, Signal Processing (2014), http://dx.doi.org/10.1016/j.sigpro.2014.05.003i

117 119 121 123

H.-T. Hu et al. / Signal Processing ] (]]]]) ]]]–]]]

(

1

As a result,

3

˘ s˘20 Z ∑ s2i  ðL 1Þeη :

)

∑ ¼ 0B  1 λ˘ k˘ ; i˘ k˘ 2  ∑ ¼ 0B  1 c˘ k˘ ; ðKÞ2 ; ˘ k˘

L1

i¼0

9 11 13

˘ k˘

i˘ k˘ A f1; 2g

fI 0 ; …; I ˘ k˘ ; …; I B  1 g ¼ arg min fi0 ;⋯;i˘ k˘ ;⋯;iB  1 g

15

Audio Signal 17

N ðKÞ  1

∑ c2i;ðKÞ ðL  1Þeη :

ð29Þ

67

Fig. 1 presents the procedural flow of the embedding process of the “DWPT-DCTþperceptually adaptive and retrievable QIM” scheme. The watermark extraction follows the same procedures. After the DWPT and DCT, _ the quantization step Δ K is acquired using the auditory _ masking threshold mentioned in Section 2. The bit W b residing in the Kth band of the DWPT can be acquired by 8  j _ _ k_  _ < 1if c ˘ k˘ ; ðKÞ  _ c ˘ k˘ ; ðKÞ=Δ ðKÞ Δ ðKÞ Z 0:5Δ ðKÞ ; _ Wb ¼ :0 otherwise:

69

subject to The above inequality indicates that the frequency bin shall 1 2 η possess an energy level no less than ∑Li ¼ 0 si ðL  1Þe . Eventually, there will be no solution for f˘s˘2i ji ¼ 1; …; L  1g if the above inequality does not hold. Our remedy to the developed algorithm is to impose a constraint on Eq. (9) such that

∑ ¼ 0B  1 λ˘ k˘ ; i˘ k˘ 2 Z ˘ k˘

i¼0

ð30Þ where _ c ˘k˘ ; ðKÞ denotes the designated DCT coefficient extracted from the watermarked signal.

Locate proper segments

19

DCT DCT

27

Derive auditory threshold for each critical band

29 Embedding using

Scramble

Adaptive QIM

31

Key. 1

Key. 2

33 Watermark

35 37

Inverse DCT

Inverse DWPT

39

Watermarked Audio Signal Fig. 1. Embedding procedures for the proposed “DWPT-DCTþ perceptual-based QIM” scheme

The proposed scheme is compared in capacity, imperceptibility and robustness with three recently developed methods, which are named in contracted form as DWTSVD [17], LWT-SVD [18] and DWT-norm [13]. These three methods were chosen not only because they performed the watermarking process using the QIM in transformed domains but because they attempted to optimize the performance in robustness and imperceptibility by adjusting the quantization steps. In the DWT-SVD, the minimum and maximum quantization steps were 0.35 and 0.65, respectively. For the LWT-SVD method, the decomposition levels of the lifting wavelet transform was 3 and the quantization step size was chosen as 0.25 to render an adequate signal-to-noise ratio (SNR). In the DWT-norm the variables α1 and α2 used to adaptively control the quantization steps were assigned as 0.4 and 1 respectively, and the variable “attack_SNR” was set as 30 dB. The test subjects comprised 15 popular music recordings picked from Jamendo [29], which is a free music

51 53 55 57 59 61

77 79 81

87 89 91 93 95 97 99 101 103 105

43

49

75

85 1 st ~7/11th critical bands

Embed sync_code

25

47

73

4. Performance evaluation

DWPT

23

45

71

83

21

41

63 65

ð28Þ

5 7

5

Table 2 Music recordings picked from the Jamendo website. No

107

Title

Artist

Genres and instruments

1 2 3 4

Valerio Brahms – 25 Variations and Fugue. Var. XVI–XVII L’ Eternité du Silence The Courier Arrives at Eskank

Kolokon Pierre Feraux Space Galaxy Arnaud Condé

5 6 7 8 9 10 11 12 13 14 15

Honky Tonk Truth Jack and Jill Tripett Electronic Noise Controller – Pneuma Rêve en Grêve In the JazzClub by Dave Imbernon Zodiac Virtues La Barca de Sua Quanta Capella Hope

David Certano Raulin de los Bosques KRUSTY Arena of Electronic Music Bruce et Guérin Jazz Friends Diablo Swing Orchestra La Barca De Sua Delano JT Bruce Giorgio Campagnano

Blues, World, Rock Classical, Piano Classical, Piano Classical, Instrumental, Ambient, Violin, Videogame, Background Country, Rock, Blues Country, Acoustic, Instrumental, Adventure Disco, Rock, Experimental, Bass Electronic, Techno Folk, Piano, Banjo Jazz, Blues, Fusion Metal, Jazz, Rock Pop, Latin Pop, Saxophone, Smooth Jazz Rock, Progressive, Metal Soundtrack, Violin, Piano

Please cite this article as: H.-T. Hu, et al., Perceptual-based DWPT-DCT framework for selective blind audio watermarking, Signal Processing (2014), http://dx.doi.org/10.1016/j.sigpro.2014.05.003i

109 111 113 115 117 119 121 123

H.-T. Hu et al. / Signal Processing ] (]]]]) ]]]–]]]

6

1

27

archive accessible from the Internet. Table 2 lists the titles, performing artists and genres for the 15 recordings. For each recording, a 40-s interval with loud volume was clipped for the test. Our intent here was to include various music types in the evaluation. All music recordings were originally encoded with the MPEG-3 coding standard around 192 kbps but converted to 44.1 kHz audio samples with 16-bit resolution. The watermark bits were a series of alternate 1's and 0's long enough to cover the entire host signal. This is equivalent to assuming that the watermark bits are evenly divided into 0's and 1's with an equal amount and the entire audio signal can be used to embed watermark bits. Such an arrangement is particularly useful when we want to implement a fair comparison for watermarking methods with different payload capacities. As discussed in the beginning of Section 2, multiple bits can be embedded into a band under the DWPT-DCT framework. Because most of the signal energy is concentrated in low frequency components, we examine the effects of embedding watermark bits into either the first 7 or the first 11 critical bands. These two sets of bands roughly correspond to the frequency components below 0.69 kHz and 1.38 kHz, respectively. For each band, we tentatively embed either 2 or 4 bits into the targeted DCT coefficients derived from the DWPT coefficients. Consequently, there are four combinations in our investigation. The payload capacity C can be calculated as

29



31

where Nf req_band denotes the number of critical bands used to embed watermark bits. Thus the capacity C ranges from 150.73 bps for the setting of }ð7 bands  2 bitsÞ} to 473.73 bps for }ð11 bands  4 bitsÞ}. The quality of the watermarked audio signals is evaluated using the SNR along with the perceptual evaluation of audio quality (PEAQ) [30]. The SNR defined in Eq. (32) is a measure of comparing the energy level between the original audio signal and the variation due to watermarking over a total length of N samples: !  1 ˘2 ∑nN ¼ 0 ˘ s ðnÞ SNR ¼ 10log10 : ð32Þ 2 1 ˘ ∑N n ¼ 0 ð˘ s ðnÞ  sðnÞÞ

3 5 7 9 11 13 15 17 19 21 23 25

33 35 37 39 41 43 45 47 49 51 53 55 57 59 61

fs  Nf req_band  B; 4096

ð31Þ

The PEAQ algorithm used in this study is an implementation released by the TSP Lab of McGill University [31]. It renders an objective difference grade (ODG) between  4 and 0, signifying a perceptual impression from “very annoying” and “imperceptible”. Because the final result is derived from an artificial neural network that simulates the human auditory system, the PEAQ may renders a value higher than 0. Table 3 provides the average SNR's and ODG's for the watermarked audio signals obtained by various schemes, each associated with its payload capacity. As shown in Table 3, the average SNR's for the DWPT-DCT are just above 20 dB when each band contains two watermark bits. In general, the SNR declines when more critical bands and/or more bits are taken into account. However, the average ODG's resulting from the four settings of the DWPT-DCT scheme are very close to 0, indicating that the watermarked signals are perceptually indistinguishable from the original ones. Even in the extreme

Table 3 Audio quality assessment of the DWT-SVD, LWT-SVD and DWT-norm and proposed DWPT-DCT schemes. The subscriptðm  nÞ along with the DWPT-DCT indicates that “n bits are embedded in each of the first m critical bands”. The data in each cell of the second and third columns are interpreted as “mean [ 7 standard deviation]”. Method

SNR (dB)

ODG

DWT-SVD

25.224 [7 1.892] 23.390 [7 2.320] 24.384 [7 2.110] 20.959 [7 0.723] 20.558 [7 0.549] 19.377 [7 0.772] 19.031 [7 0.592]

 1.574 [ 7 1.473]  0.904 [ 7 1.288]  1.027 [ 7 1.292] 0.070 [ 7 0.064] 0.007 [ 7 0.107] 0.026 [ 7 0.085]  0.055 [ 7 0.132]

LWT-SVD DWT-norm DWPT-DCT (7  2) DWPT-DCT (11  2) DWPT-DCT (7  4) DWPT-DCT (11  4)

Payload capacity (bps)

63 65 67 69

45.56

71

170.67

73

102.4

75

150.73

77 236.87

79

301.46

81

473.73

83 case of the }ð11 bands  4 bitsÞ}setting, the average ODG is still  0.055, despite that the corresponding SNR has already dropped below 20 dB. Also seen from Table 3, although both the LWT-SVD and DWT-norm can offer acceptable imperceptibility, their ODG scores are obviously inferior to the scores of the proposed DWPT-DCT scheme. On the other hand, the DWT-SVD approach achieves the highest average SNR, but the resulting average ODG is  1.574, implying that the watermark inserted using the DWT-SVD may still be audible and cause quality degradation in some occasions. As for the evaluation of robustness, this study assesses the bit error rates (BER) between the original watermark ~ : W and the recovered watermark W ~ Þ¼ BERðW; W

1 ~ ∑M m ¼ 0 WðmÞ  W ðmÞ M

85 87 89 91 93 95 97 99

ð33Þ

where  stands for the exclusive-or operator and M is the length of the watermark bit sequence. The attacks considered in this study are as follows:

101 103 105

(A) Resampling: conducting down-sampling to 22,050 Hz and then up-sampling back to 44,100 Hz. (B) Requantization: quantizing the watermarked signal to 8 bits/sample and then back to 16 bits/sample. (C) Amplitude scaling: scaling the amplitude of the watermarked signal by 0.85. (D) Noise corruption: adding zero-mean white Gaussian noise to the watermarked audio signal with SNR¼30 dB. (E) Noise corruption: adding zero-mean white Gaussian noise to the watermarked audio signal with SNR¼20 dB. (F) Lowpass filtering: applying a lowpass filter with a cutoff frequency of 4 kHz. (G) Highpass filtering: applying a highpass filter with a cutoff frequency of 4 kHz. (H) Extremely lowpass filtering: applying a lowpass filter with a cutoff frequency of 400 Hz. (I) Echo addition: adding an echo signal with a delay of 50 ms and a decay to 5% to the watermarked audio signal.

Please cite this article as: H.-T. Hu, et al., Perceptual-based DWPT-DCT framework for selective blind audio watermarking, Signal Processing (2014), http://dx.doi.org/10.1016/j.sigpro.2014.05.003i

107 109 111 113 115 117 119 121 123

H.-T. Hu et al. / Signal Processing ] (]]]]) ]]]–]]]

1 3

9 11 13 15 17

63

Table 4 Bit error rates of watermark bits (in percentage) for four methods under various attacks. Attack type

DWT-SVD (%)

LWT-SVD (%)

DWT-norm (%)

5 7

7

None A. B. C. D. E. F. G. H. I. J. K. L.

0.00 0.02 0.00 38.10 0.00 0.46 4.65 52.54 50.03 3.87 0.00 0.00 1.36

0.00 0.00 0.00 67.54 0.00 0.00 0.00 42.25 50.06 4.97 0.11 0.00 3.08

0.00 0.00 0.00 60.57 0.00 0.00 0.83 50.15 50.02 2.29 0.00 0.00 1.02

65

DWPT-DCT (7  2)

(11  2)

(7  4)

(11  4)

67

0.00% 0.00% 0.00% 0.00% 0.06% 0.79% 0.00% 20.79% 19.28% 0.45% 0.29% 0.00% 1.08%

0.00% 0.00% 0.09% 0.00% 0.26% 2.06% 0.00% 26.70% 23.99% 0.32% 1.10% 0.02% 2.06%

0.00% 0.00% 0.02% 0.00% 0.08% 1.58% 0.00% 22.74% 22.77% 0.99% 0.75% 0.01% 2.60%

0.00% 0.00% 0.14% 0.00% 0.40% 3.54% 0.00% 28.39% 28.02% 0.79% 2.81% 0.05% 4.53%

69

23 25 27 29

(J) Jittering: randomly deleting or adding one sample for every 100 samples within each frame. (K) 128 kbps MPEG-3 compression: compressing and decompressing the watermarked audio signal with a MPEG layer III coder at a bit rate of 128 kbps. The MPEG-3 audio files were processed under Matlab with the use of the LAME encoder to perform the actual encoding [32]. (L) 64 kbps MPEG-3 compression: compressing and decompressing the watermarked audio signal with a MPEG layer III coder at a bit rate of 64 kbps.

31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61

73 75 77 79 81

19 21

71

For the cases of F, G and H, the zero-phase Chebyshev Type II filters were employed to provide a flat passband and a sharp transition for the frequency response. For all the three filters, the order was chosen as 10 and the stopband attenuation was 20 dB. Because all the methods under comparison tend to embed watermark information into low-frequency subbands, the extremely lowpass filtering in case G and highpass filtering in case H are expected to cause more harm to the embedded watermark. The results shown in Table 4 confirm that the proposed scheme is highly resistant to all attacks considered in the test. Due to the successful retrieval of the quantization steps, a 100% recovery of the watermark bits is observed when no attack is present. In most attacks, the proposed scheme with the }ð7bands  2 bitsÞ} setting demonstrates comparable robustness with the other three methods. However, a further inspection reveals that the proposed scheme is somewhat sensitive to noise corruption and jittering. In contrast, the DWT-norm exhibits strong robustness against these two types of attacks. The reason can be ascribed to the fact that the DWT-norm computes the vector norm over a wide range of approximation DWT coefficients with a broader spectrum, thus abating the influence of noise and jittering. For the DWT-SVD, LWT-SVD and DWT-norm methods, the amplitude scaling attack can easily mangle the embedded watermarks. This is because the formulae used to derive the quantization steps in these three methods consist of fixed control variables, which fail to adapt to the amplitude variation of the audio signal. Another

disadvantage of using fixed control variables is that the determination of their values relies on the feedback of experimental results. By contrast, the DWPT-DCT scheme acquires the quantization steps entirely based on the signal characteristics under the guidance of a psychoacoustic model. The quantization step of the QIM has been adapted to the auditory masking level without the concern of perceptible distortion. The DWPT-DCT scheme automatically decreases the embedding strength while the watermarking process introduces excessive distortion. With this kind of formulation, the BER's are expected to decrease as more bits are embedded into a critical band. If the transparency of the watermark is the primary concern, the payload capacity can be gained at the cost of robustness and vice versa. The results shown in Table 4 reflect such a phenomenon, which exhibits a noticeable increase of the BER's when the }ð11bands  4 bitsÞ}setting is applied. As seen in Table 4, the proposed DWPT-DCT scheme is robust against a lowpass filter with a cutoff frequency of 4 kHz. This is not surprising, since the chosen frequency components for embedding are below 0.69 kHz if 7 bands are included and 1.38 kHz if 11 bands are included. However, it is to our surprise that the DWPT-DCT also holds certain resistance against the filtering attacks in cases G and H. In contrast, the DWT-SVD, LWT-SVD and DWTnorm can only end up BER's around 50%, suggesting that the results are almost like guessing. The specialty of the DWPT-DCT can be attributed to the mechanism of a fully adaptive QIM. Notice that the filtering process just attenuates the magnitudes of frequency components in the stopband while leaving the others intact in the passband. Hence as long as the ratio between the examined DWPT coefficient and the derived quantization step remains within an acceptable range, the watermark bit can still be retrieved by the DWPT-DCT scheme.

83 85 87 89 91 93 95 97 99 101 103 105 107 109 111 113 115 117

5. Frame synchronization and selective watermarking 119 One annoying attack in audio watermarking is the socalled desynchronization, which is the situation that the watermark cannot be detected due to the loss of synchronization. The desynchronization may result from sample

Please cite this article as: H.-T. Hu, et al., Perceptual-based DWPT-DCT framework for selective blind audio watermarking, Signal Processing (2014), http://dx.doi.org/10.1016/j.sigpro.2014.05.003i

121 123

H.-T. Hu et al. / Signal Processing ] (]]]]) ]]]–]]]

8

1

Input Layer

Hidden Layer

63

Output Layer

3

65

5

67

7

69 Correct Count

9

71

11

73

13

75

15

77 79

17 Back-propagation Neural Networks

19

81

Fig. 2. The neural network structure for estimating the correct counts.

83

21

47

cropping, time shifting and unintentional insertion during MP3 compression. To improve the robustness, this study adopts the synchronization technique that had been proposed in the literature [8,17,18]. The host signal is partitioned into two portions: one for synchronization code insertion and the other for watermark embedding. Prior to watermark detection, we need to locate the position where the watermark is embedded based on the standard synchronization technology of digital communications. In a scenario where audio segments for watermarking are selectable, we certainly prefer embedding the watermark into the segments that are most robust against attacks. For this kind of selective watermarking, the watermark is often partitioned into parts scattered over the host audio signal. Based on the formulation of the DWPT-DCT framework, the DCT coefficients and quantization steps are the two major factors affecting the watermark detection. Hence we adopt a back-propagation neural network (BPNN) to predict the resistance of the coefficients in selected segments. The proposed scheme with the }ð7bands  2 bitsÞ} setting is chosen as the investigation target. As shown in Fig. 2, the inputs to the BPNN include all the involved DCT coefficients c˘k˘ ; ðiÞ's and the corresponding quantization steps ΔðiÞ 's gathered from the first seven critical bands. The outputs are the predicted deviations of the quantized values, which are defined as

49

ρ˘ k˘ ; ðiÞ ¼ c~ ˘ k˘ ; ðiÞ

23 25 27 29 31 33 35 37 39 41 43 45

51 53 55 57 59 61

Δ~ ðiÞ  ˘ c˘˘ k˘ ;ðiÞ ΔðiÞ

ð34Þ

where the tilde above the variables signifies the consequence due to a simulated serious attack. The hidden layer, located between the input and output layers, contains 14 neurons. The hyperbolic tangent sigmoid function is used as the transfer function for both hidden and output neurons. To simulate an intense attack, we purposely corrupts the watermarked audio signal by adding Gaussian white noise with SNR¼12 dB. Basically, the recovered watermark bit will retain its original binary value if the inequality ρ˘k˘ ; ðiÞ o 0:25 is satisfied. Collecting the neural-network-

Table 5 Number of counts for the conditions whether the actual measurement jρj and its BPNN prediction ρBPNN exceed 0.25 in the training set. The total number of counts is 72,182. BPNN prediction

85 87

Actual Measurement

89 jρBPNN j Z 0:25 jρBPNN j o 0:25

jρjZ 0:25

jρj o 0:25

1933 3774

1765 64,656

91 93 95

Table 6 Number of counts for the conditions whether the actual measurement jρj and its BPNN prediction ρBPNN exceed 0.25 in the test set. The total number of counts is 36,064. BPNN prediction

97 99

Actual measurement

101 jρBPNN j Z 0:25 jρBPNN j o 0:25

jρjZ 0:25

jρj o 0:25

407 2043

313 33,301

103 105 107

judged results from all the involved critical bands offers a figure of robustness. This study utilizes 5152 frames of audio signals to train the BPNN and additional 2576 frames to verify the performance of the trained BPNN. The results of applying the trained BPNN to the samples in the training and test sets are given in Tables 5 and 6, respectively. The positive and negative predictive values, termed P PV þ and P PV  , in the training set are computed as P ðTrainingÞ ¼ PV þ

number of cases with a positive test ðclaimed as errorÞ total number with a positive test

1933 ¼ 1933 þ 1765 ¼ 52:27%;

109 111 113 115 117 119 121

ð35Þ

Please cite this article as: H.-T. Hu, et al., Perceptual-based DWPT-DCT framework for selective blind audio watermarking, Signal Processing (2014), http://dx.doi.org/10.1016/j.sigpro.2014.05.003i

123

H.-T. Hu et al. / Signal Processing ] (]]]]) ]]]–]]]

1

P ðTrainingÞ ¼ PV 

3

number of cases with a negative test ðclaimed as non  errorÞ total number with a negative test

64656 3774 þ64656 ¼ 94:48%: ¼

5

ð36Þ

7

The data drawn from the test set provide similar results.

9

P ðTestÞ PV þ

407 ¼ 407 þ 313 ¼ 56:53%;

ð37Þ

11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55

33301 P ðTestÞ PV ¼ 2043 þ33301 ¼ 94:22%:

ð38Þ

It appears that the BPNN has a problem of determining the factuality of the detected bits when a positive test occurs, whereas discerning the reliability of the bits with a negative test is rather easy. Hence this study adopts the BPNN as a gauger to estimate the reliance of the extracted parameters. The outputs of the BPNN are the predicted deviations, namelyρ˘k˘ ; ðiÞ's, which are fed into a group of inequality examiners, followed by an accumulator. By gathering the correct counts over a segment of desired length, we obtain a numeral contour that can be used to pick suitable segments for watermarking. The picking process starts with searching for the highest position of the contour, followed by masking out the selected segments and searching again. The foregoing steps repeat until reaching the desired number of segments. By means of the aforementioned selective watermarking, the watermark can be split into fragments to fit in the size of intended segments. Fig. 3 depicts the bit configuration for a segment comprising 20 frames. The initial two frames are reserved to embed an index tag of 5 bits along with four watermark bits, while the rest of 18 frames contain 252 bits. As a consequence, each segment accommodates 256 bits in total. The index tag is of paramount importance since a mistake in the index tag will ruin the entire group in a segment. To safeguard the index tag, we have employed a ðn; kÞ BCH encoder [33], where n and k Critical Bit (14) band (7) Bit (13)

X

X w w w

w

X

X w w w

w

Critical Bit (12) band (6) Bit (11)

X

X w w w

w

X w w w w

w

Critical Bit (10) band (5) Bit (9)

X w w w w

w

X w w w w

w

Critical Bit (8) band (4) Bit (7)

T w w w w

w

T

T w w w

w

Critical Bit (6) band (3) Bit (5)

T

T w w w

w

T

T w w w

w

Critical Bit (4) band (2) Bit (3)

T

T w w w

w

T

T w w w

w

Critical Bit (2) band (1) Bit (1)

T Synchronous code T

T w w w

w

T w w w

w

9

respectively denote the lengths of the code and message words. Here ðn; kÞ is chosen as (15,5) to provide the 5-bit index tag with an error-correction capability of 3 bits. The possibility of obtaining a wrong index might be deemed a condition of receiving 4 or 5 erroneous bits concurrently. The error rate of the index tag, termed P error , can thereby be calculated as 5 5 ð39Þ P error ¼ ∑ ðBERÞk  ð1 BERÞ5  k k k¼4 5 where denotes the binomial coefficient. Given that k BER ¼ 0:1 in an awful situation, P error is merely 0.00046. With the bit configuration specified in Fig. 3, the process of segment selection by the BPNN is demonstrated in Fig. 4. In the bottom subplot, four segments are located to embed the entire watermark. In this example, under the noise corruption with SNR¼12 dB, the BER of the recovered watermark is reduced from 15.85% in average to 8.93%. In other words, by picking proper frames, the BER of recovered watermark can be decreased by 43.68%.

67 69 71 73 75 77 79 81 83

The proposed watermarking system integrates all the techniques discussed in previous sections. Let the watermark be a binary image logo with size 32  32, thus requiring at least four segments (each of 20 frames) to embed the entire watermark. For the purpose of security, the system employs two keys to produce, embed and detect the watermark. The steps necessitated by the watermark embedding include

87 89 91 93 95 97

1. Scramble the watermark via the Arnold transform [34] and then convert the scrambled matrix into 1-dimensional bit stream. The times of shuffling the watermark image via the Arnold transform is regarded as Key.1. 2. Use Key.2 to identify the DCT coefficients for watermarking in each critical band. Here Key.2 is just a seed of a random number generator. The generated random numbers are used to determine the sequence of involved critical bands along with the positions of the watermarked DCT coefficients. 3. Partition the bit stream into groups, each possessing a 5-bit index tag along with 256 watermark bits. 4. Divide the host signal into frames of 4096 samples and apply the DWPT and DCT. 5. Compute the quantization step for each critical band based on the auditory masking threshold. 6. Seek appropriate embedding segments by using the BPNN. 7. For the frames in front of the picked segments: (i) Embed the synchronization codes.

99 101 103 105 107 109 111 113 115 117 119

20 selected frames

61

65

85 6. Incorporation of segment selection with the perceputal-based watermarking scheme

57 59

63

Fig. 3. Bit configuration of a watermarked audio segment, which comprises a (15,5) BCH-encoded index tag (labeled by “T”) and 256 watermark bits (labeled by “w”).

For the frames in the picked segments: 121 (i) Encode a 5-bit index tag using ð15; 5Þ BCH coding. (ii) Embed 15 BCH-encoded bits and 256 watermark bits.

Please cite this article as: H.-T. Hu, et al., Perceptual-based DWPT-DCT framework for selective blind audio watermarking, Signal Processing (2014), http://dx.doi.org/10.1016/j.sigpro.2014.05.003i

123

10

H.-T. Hu et al. / Signal Processing ] (]]]]) ]]]–]]]

1

63

3

65

5

67

7

69

9

71

11

73

13

75

15

77

17

79

19

81

21

83

23

85

25

87

27

89

29

91

31

93

33

95

35

97

37

99

39

101

41

103

43 45

Fig. 4. Illustration of the correct counts predicted by a BPNN: (a) the host audio signal, (b) the contour of actual correct counts (red dotted line) and the one predicted by the BPNN (blue solid line), (c) segments selected from the accumulative correct counts gathered from 20 consecutive frames. Each selected Q2 segment is separated from the others by at least six frames. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

The steps required by the watermark extraction involve: 49

53 55 57 59 61

107 109

47

51

105

1. Detect the synchronization codes using a matched filter. 2. Apply the DWPT and DCT to the audio signal once the frames are aligned. 3. For each frame: (i) Compute the quantization step for each involved critical band. (ii) Extract the watermark bits from the designated DCT coefficients (as indicated by Key.2) for each involved critical bands. Gather the bits from 20 frames to form a group of 256 bits along with a 5-bit index tag, which is obtained 4. from a BCH decoder.

5. Restore the whole watermark by packing all the groups together. If multiple copies exist with a specific index, use the majority vote strategy to decide each binary bit value. 6. Put Key.1 into the inverse Arnold transform to unscramble the watermark image.

111 113 115 117

Except for the role of the BPNN in segment selection, the main part of the embedding process has already been illustrated in Fig. 1. Fig. 5 depicts a complete process of the watermark extraction. With the employment of the selective watermarking scheme, the watermark receives better protection to survive through serious attacks.

Please cite this article as: H.-T. Hu, et al., Perceptual-based DWPT-DCT framework for selective blind audio watermarking, Signal Processing (2014), http://dx.doi.org/10.1016/j.sigpro.2014.05.003i

119 121 123

H.-T. Hu et al. / Signal Processing ] (]]]]) ]]]–]]]

1

11

63

Watermarked Audio Signal

65

3

Detect Sync_code & 5

67

Align frames

69

7 9

71

Identify

Retrieve quantization steps

watermarked frames

11 13

75

Extract information bits

15

77

using DPWT+ DCT+ perceptual-based QIM scheme

17 19

73

79 81

Key. 2 83

21

No

23

End ? 85

Yes

25

87

Assemble watermark

27

89 91

29

Inverse Arnold 31

93

transform

Key. 1

33

95

Watermark

35 37

97

Fig. 5. Extraction procedures for the proposed watermarking system.

101

39 7. Concluding remarks 41 43 45 47 49 51 53 55 57 59 61

99

A DWPT-DCT framework for blind audio watermarking is presented. Via the exploitation of auditory properties, the quantization steps for QIM are not only perceptually determinable during watermark embedding but also retrievable during watermark extraction. The imperceptibility of the embedded watermarks is ensured because the disturbance caused by the QIM remains below the auditory masking threshold. The experimental results confirm that the proposed watermarking scheme is inaudible and robust against various signal processing attacks. The payload capacity is expandable by allotting more critical bands and by embedding more bits in each critical band. This study demonstrates four different combinations drawn from two bit numbers and two band sets with the capacity ranging from 150.73 to 473.73 bps. Because the proposed scheme has automatically lowered the embedding strength to avoid perceptual distortion, the achievement of higher capacity is accompanied by higher bit error rates. To further improve the robustness, the proposed watermarking system adopts a selective strategy to embed packed

information bits into particular segments. The selection is achieved by a BPNN, which takes into consideration of both the quantization steps and DCT coefficients. The contribution of selective watermarking is prominent when the watermarked audio signal suffers a severe attack. With an appropriate selection of embedding frames, the BER of the recovered watermark can be significantly reduced. The merits of all the developed techniques make the proposed system very promising for blind audio watermarking.

103 105 107 109 111

Acknowledgment

113

This work was supported by the National Science Council, Taiwan, ROC, under Grant NSC 101-2221-E-197-033.

115 117

References 119 [1] P. Bassia, I. Pitas, N. Nikolaidis, Robust audio watermarking in the time domain, IEEE Trans. Multimed. 3 (2) (2001) 232–241. [2] W.-N. Lie, L.-C. Chang, Robust and high-quality time-domain audio watermarking based on low-frequency amplitude modification, IEEE Trans. Multimed. 8 (1) (2006) 46–59.

Please cite this article as: H.-T. Hu, et al., Perceptual-based DWPT-DCT framework for selective blind audio watermarking, Signal Processing (2014), http://dx.doi.org/10.1016/j.sigpro.2014.05.003i

121 123

12

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

H.-T. Hu et al. / Signal Processing ] (]]]]) ]]]–]]]

[3] H. Wang, R. Nishimura, Y. Suzuki, L. Mao, Fuzzy self-adaptive digital audio watermarking based on time-spread echo hiding, Appl. Acoust. 69 (10) (2008) 868–874. [4] X. Shijun, H. Jiwu, Histogram-based audio watermarking against time-scale modification and cropping attacks, IEEE Trans. Multimed. 9 (7) (2007) 1357–1372. [5] W. Li, X. Xue, P. Lu, Localized audio watermarking technique robust against time-scale modification, IEEE Trans. Multimed. 8 (1) (2006) 60–69. [6] R. Tachibana, S. Shimizu, S. Kobayashi, T. Nakamura, An audio watermarking method using a two-dimensional pseudo-random array, Signal Process. 82 (10) (2002) 1455–1469. [7] H.-H. Tsai, J.-S. Cheng, P.-T. Yu, Audio watermarking based on HAS and neural networks in DCT domain, EURASIP J. Adv. Signal Process. 3 (2003) (2003) 252–263. [8] X.-Y. Wang, H. Zhao, A Novel, Synchronization invariant audio watermarking scheme based on DWT and DCT, IEEE Trans. Signal Process. 54 (12) (2006) 4835–4840. [9] I.-K. Yeo, H.J. Kim, Modified patchwork algorithm: a novel audio watermarking scheme, IEEE Trans. Speech Audio Process. 11 (4) (2003) 381–386. [10] B.Y. Lei, I.Y. Soon, Z. Li, Blind and robust audio watermarking scheme based on SVD–DCT, Signal Process. 91 (8) (2011) 1973–1984. [11] X.-Y. Wang, P.-P. Niu, H.-Y. Yang, A robust digital audio watermarking based on statistics characteristics, Pattern Recognit. 42 (11) (2009) 3057–3064. [12] S. Wu, J. Huang, D. Huang, Y.Q. Shi, Efficiently self-synchronized audio watermarking for assured audio data transmission, IEEE Trans. Broadcast. 51 (1) (2005) 69–76. [13] X. Wang, P. Wang, P. Zhang, S. Xu, H. Yang, A norm-space, adaptive, and blind audio watermarking algorithm by discrete wavelet transform, Signal Process. 93 (4) (2013) 913–922. [14] X. Li, H.H. Yu, Transparent and robust audio data hiding in cepstrum domain, in: IEEE Int. Conf. Multim. Expo (2000) 397–400. [15] S.C. Liu, S.D. Lin, BCH code-based robust audio watermarking in cepstrum domain, J. Inf. Sci. Eng. 22 (3) (2006) 535–543. [16] H.-T. Hu, W.-H. Chen, A dual cepstrum-based watermarking scheme with self-synchronization, Signal Process. 92 (4) (2012) 1109–1116. [17] V. Bhat, K,I. Sengupta, A. Das, An adaptive audio watermarking based on the singular value decomposition in the wavelet domain, Digit. Signal Process. 20 (6) (2010) 1547–1558. [18] B. Lei, I.Y. Soon, F. Zhou, Z. Li, H. Lei, A robust audio watermarking scheme based on lifting wavelet transform and singular value decomposition, Signal Process. 92 (9) (2012) 1985–2001.

[19] B. Chen, G.W. Wornell, Quantization index modulation: a class of provably good methods for digital watermarking and information embedding, IEEE Trans. Inf. Theory 47 (4) (2001) 1423–1443. [20] B. Chen, G.W. Wornell, Quantization index modulation methods for digital watermarking and information embedding of multimedia, J. VLSI Signal Process. 27 (2001) 7–33. [21] X. Wang, W. Qi, P. Niu, A new adaptive digital audio watermarking based on support vector regression, IEEE Trans. Audio Speech Lang. Process. 15 (8) (2007) 2270–2277. [22] H. Malik, R. Ansari, A. Khokhar, Robust audio watermarking using frequency-selective spread spectrum, Inf. Secur. IET 2 (4) (2008) 129–150. [23] S.T. Chen, G.D. Wu, H.N. Huang, Wavelet-domain audio watermarking scheme using optimisation-based quantisation, IET Signal Process. 4 (6) (2010) 720–727. [24] B. Lei, I. Soon, Z. Li, P. Dai, A particle swarm optimization based audio watermarking scheme, in: Eighth International Conference on Information, Communications and Signal Processing (ICICS) 2011, 1–4. [25] J.D. Johnston, Transform coding of audio signals using perceptual noise criteria, IEEE J. Sel. Areas Commun. 6 (2) (1988) 314–323. [26] T. Painter, A. Spanias, Perceptual coding of digital audio, Proc. IEEE 88 (4) (2000) 451–515. [27] X. He, Watermarking in Audio: Key Techniques and Technologies, Cambria Press, Youngstown, N.Y, 2008. [28] H.-T. Hu, W.-C. Li, A perceptually adaptive and retrievable QIM scheme for efficient blind audio watermarking, in: International Conference on Information Science and Applications (ICISA), 2012, 51–55. [29] Jamendo, in: 〈http://www.jamendo.com/en/〉. [30] ITU-R Recommendation BS.1387, Method for objective measurements of perceived audio quality, in December 1998. [31] P. Kabal, An examination and interpretation of ITU-R BS.1387: perceptual evaluation of audio quality, TSP Lab Technical Report, Department of Electrical & Computer Engineering, McGill University, (2002). [32] D. Ellis, in: 〈http://labrosa.ee.columbia.edu/matlab/〉. [33] G. Forney Jr., On decoding BCH codes, IEEE Trans. Inf. Theory 11 (4) (1965) 549–557. [34] V.I. Arnold, A. Avez, Ergodic Problems of Classical Mechanics, Benjamin, New York, 1968.

35

Please cite this article as: H.-T. Hu, et al., Perceptual-based DWPT-DCT framework for selective blind audio watermarking, Signal Processing (2014), http://dx.doi.org/10.1016/j.sigpro.2014.05.003i

37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69