Automated recovery of damaged audio files using deep neural networks

Digital Investigation 30 (2019) 117e126 Contents lists available at ScienceDirect Digital Investigation journal homepage: www.elsevier.com/locate/di...

Download PDF

2MB Sizes 1 Downloads 103 Views

Report

PDF Reader
Full Text

Digital Investigation 30 (2019) 117e126

Contents lists available at ScienceDirect

Digital Investigation journal homepage: www.elsevier.com/locate/diin

Automated recovery of damaged audio ﬁles using deep neural networks Hee-Soo Heo a, Byung-Min So b, IL-Ho Yang a, Sung-Hyun Yoon a, Ha-Jin Yu a, * a b

School of Computer Science College of Engineering, University of Seoul, 163 Siripdae-ro, Dongdaemun-gu, Seoul, 02504, Republic of Korea Supreme Prosecutors' Ofﬁce 157 Banpo-daero, Seocho-gu, Seoul, 06590, Republic of Korea

a r t i c l e i n f o

a b s t r a c t

Article history: Received 25 April 2018 Received in revised form 16 July 2019 Accepted 31 July 2019 Available online 1 August 2019

In this paper, we propose two methods to recover damaged audio ﬁles using deep neural networks. The presented audio ﬁle recovery methods differ from the conventional ﬁle carving-based recovery method because the former restore lost data, which are difﬁcult to recover with the latter method. This research suggests that recovery tasks, which are essential yet very difﬁcult or very time consuming, can be automated with the proposed recovery methods using deep neural networks. We apply feed-forward and Long Short Term Memory neural networks for the tasks. The experimental results show that deep neural networks can distinguish speech signals from non-speech signals, and can also identify the encoding methods of the audio ﬁles at the level of bits. This leads to successful recovery of the damaged audio ﬁles, which are otherwise difﬁcult to recover using the conventional ﬁle-carving-based methods. © 2019 Elsevier Ltd. All rights reserved.

Keywords: Audio ﬁles Automated recovery Deep neural networks File carving Long short-term memory

Introduction The past century has shown a wide adoption of audio recording devices, including smartphones. Thus, the providing of audio ﬁles as evidence in court settings has become more common. Audio ﬁles that are claimed to be legal evidence usually proceed through the conventional validation process, whereby investigators listen to and identify the contents and examine counterfeit audio ﬁles to establish a legal case. However, audio ﬁles collected through digital devices, such as smartphones, can be deleted owing to malicious purposes or lack of storage in the devices. For the deleted ﬁles to qualify as legal evidence, the process of restoring the audio ﬁles from storage, where the deletion occurred, and validating the data are required. In a typical ﬁle recovery environment, ﬁle carvingda method to restore deleted ﬁles in the ﬁle systemdhas been widely adopted and applied (Poisel et al., 2011). However, the ﬁle-carving method often results in incomplete recovery of audio ﬁles, which are thus unable to be heard. For instance, after an audio ﬁle is deleted from a ﬁle system and overwriting takes place, the data in the region

might not be restored, thus preventing the complete recovery of the ﬁle. Moreover, if the damaged block is an essential part in playing the audio ﬁle (i.e., headers of audio ﬁles), one would be fully unable to play the ﬁle on account of the partial yet critical damage. Therefore, to restore the damaged audio ﬁles, we should devise a new approach to recovery. Such a recovery method would involve inferring the lost data based on the data that remain in the ﬁle. When a ﬁle proceeds through a complete recovery process, the process should successfully recover the lost data that the conventional methods cannot restore. The focus of the present research is the application of deep neural networks for automation of the tasks that are vital yet unsuitable to process manually owing to the difﬁculty level and time required. The remainder of this paper is organized as follows. Section 2 explains the conventional ﬁle carving method. Section 3 outlines the application of deep neural networks to the present objective. Sections 4 and 5 present the experiments and results verifying the accuracy of the proposed deep neural network method, and we conclude the paper in Section 6. File-carving methods

* Corresponding author. E-mail addresses: [email protected] (H.-S. Heo), [email protected] (B.-M. So), [email protected] (I.-H. Yang), [email protected] (S.-H. Yoon), [email protected] (H.-J. Yu). https://doi.org/10.1016/j.diin.2019.07.007 1742-2876/© 2019 Elsevier Ltd. All rights reserved.

Most existing ﬁle-recovery methods are ﬁle-carving-based methods based on the structures and contents of the ﬁles deleted from the ﬁle system. Fig. 1 illustrates the full recovery process of a

118

H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126

Fig. 1. Flow of ﬁle recovery based on ﬁle carving. (a) Physical storage and metadata are generated in the ﬁle system (“Saving ‘Audio.mp3’”). (b) Only the record of the ﬁle system is removed; the actual data remain (“Removing ‘Audio.mp3’”). (c) Investigating the storage device for ﬁle carving. (d) Metadata are re-generated based on the ﬁle-carving result.

ﬁle-carving method. The ﬁle system indicates the complete system that manages all the ﬁles employed by users. When a user saves a speciﬁc ﬁle, the ﬁle system generates metadata with information, including the physical location of the saved ﬁle and the generated time. It prevents other ﬁles from overwriting in the same location, as shown in Fig. 1a. When a user deletes a ﬁle, the ﬁle system deletes the previously generated metadata and changes the settings to enable other ﬁles to overwrite the location (Fig. 1b). In this case, ﬁle carving is the process of reading the deleted ﬁles by investigating the saved locations, even when the metadata are not available owing to the deletion of the ﬁle by the user (Fig. 1c). Thus, ﬁle carving restores access to the deleted ﬁles from the ﬁle system; it is not an actual method for restoring lost information. However, a recovery method as such can completely recover ﬁles only when the ﬁles were not overwritten. For instance, if a WAV type audio ﬁle was deleted and WAV header, necessary for playing, was overwritten, it would be difﬁcult to play it back, even when the ﬁle was recovered with ﬁle carving. Fig. 2 illustrates an example of damaged ﬁles in the ﬁle-carving process. When a ﬁle is damaged or overwritten, it is necessary to use a method that can infer and recover damaged data and thus differs from the conventional method. Proposed recovery method based on deep neural networks A WAV ﬁle with a corrupted header, which prevents the ﬁle from playing, can be ﬁxed with the proposed recovery method to infer damaged information based on data other than the header (i.e., encoded signals). This recovery process is different from that of the existing ﬁle-carving method and addresses problems that the existing method cannot solve. Nevertheless, the proposed process

requires tasks that are not suitable to perform manually because the tasks are challenging and time-intensive. We therefore automate the tasks based on deep neural networks and accordingly propose two ways to develop new recovery methods. Inference on header information based on speech or non-speech decision The following process can recover an audio ﬁle with a corrupted header. First, the header information that is expected to exist in the damaged audio ﬁle is generated. For example, when bit-rate information must exist in the header of the audio ﬁle, the header information for all possible bit rates, which can be used in the audio ﬁle, should be generated to be tested. Based on each generated header information, the audio ﬁle is decoded and it is determined whether the decoded signals are speech or nonspeech. Herein, speech signals refer to the signals that can be identiﬁed to be normal sounds when the header information is properly generated just the same as that before the damage. Nonspeech signals refer to improperly decoded signals whose generated header information is different from the original header information. Proper decoding of a speech signal is not possible unless the header information matches the one in the encoding process. Thus, we have to generate header information arbitrarily until the original speech signals are decoded. When one can identify normal audio signals in this process, lost header information is assumed to be properly inferred. However, if a person performs the tasks, he or she should repeatedly generate different headers, decode, and listen to the decoded audio for all possible permutations of header parameters. As possible cases would be too numerous for this endeavor, considering the actual encoding methods of audio ﬁles, it is not possible to recover all the

Fig. 2. Example of cases in which it is difﬁcult to recover the ﬁle completely by using the conventional ﬁle-carving method. (a) Unallocated ﬁle before ﬁle carving (Fig. 1b). (b) Another ﬁle that overrode some part of unallocated ﬁle. (c) Damaged and unallocated ﬁle because of another ﬁle writing. (d)Damaged ﬁle after ﬁle carving.

H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126

ﬁles in a feasible amount of time. In the proposed approach, we apply deep neural networks to this process to distinguish speech and nonspeech signals. In the case of speech signals, including actual human voices, features of formants are expected to be found in the frequency band. Formants are the spectral peaks of the sound spectrum that distinguish speciﬁc vowels (Fant, 1971). Whereas in the nonspeech signals, the features of white noise are predicted to appear. Using the features, these speech and nonspeech signals can be predicted and classiﬁed by the deep neural networks. The deep neural networks are organized simply as feedforward neural networks. The detailed tasks in this undertaking are outlined as follows. First, decoded signals are divided into small frame units, and major features within the frequency band for each frame are extracted. To feed the extracted features into the deep neural networks, features from multiple frames are concatenated. The deep neural networks are trained to classify whether a signal is close to speech signals or nonspeech signals by performing binary classiﬁcation on the input features. Finally, after aggregating the frame-unit-based results to utterance units, the decision whether a decoded signal is speech or nonspeech is made. The overall process is illustrated in Fig. 3. A speech recognition system using deep neural networks is referenced to design this structure (Hinton et al., 2012a). In particular, a signal is divided into small frames, and frame-level features such as ﬁlter bank outputs are extracted. Multiple frame-level features are concatenated, and inputted to DNN to determine whether the signal is a human voice or not.

Identiﬁcation of audio ﬁles based on long short-term memory Audio signals are data that store sound waves sampled and quantized in every ﬁxed interval. Therefore, each sample in an audio signal does not exist independently; the value is dependent on the amplitudes and frequencies of the sound waves. Moreover, dependencies emerge between each sample. Two examples of such dependencies are as follows. Considering the samples in the short term, the dynamic ranges between the samples are limited to some extent. Predicting a speciﬁc sample is feasible by examining the samples around the speciﬁc target; hence, short-term

119

dependencies do exist. Additionally, since the sample values do not increase nor decrease continually for a long term, but rather increase and decrease cyclically, long-term dependencies also exist. When one encodes and saves audio signals with such features, the encoded data may also show dependencies between data according to the audio signals. Speciﬁcally, a WAV ﬁle with the audio signal saved in a 16 bit format will have the aforementioned dependencies in 2 byte units, whereas a WAV ﬁle with the audio signal stored in an 8-bit format will have dependencies in 1 byte unit. Such dependencies between time intervals can be modeled by Long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997). It deﬁnes a concept of a kind of cell, which consists of an input gate, forget gate, and output gate. This cell has an internal state value and calculates the output values by passing the input values and internal state through multiple gates. Its basic structure is depicted in Fig. 4-(a). The output value of each cell is calculated as follows. First, the input gate value, it , at time t, is calculated as:

it ¼ sðxt Wi þ ht1 Ui þ bi Þ;

(1)

where Wi and Ui refer to the weight matrix of the input gate, xt is the input data at time t, ht1 is the output of the cell at time t-1, bi indicates bias vectors, and sð,Þ denotes the non-linear activation function. We used sigmoid function as the activation function. The process of calculating the value of input gate is depicted in Fig. 4(b). Likewise, the value of the “forget gate,“ ft is calculated as:

ft ¼ s xt Wf þ ht1 Uf þ bf

(2)

where Wf and Uf refer to the weight matrix of the forget gate. The process of calculating the value of forget gate is depicted in Fig. 4(c). Next, the current state value Ct is calculated using the inputgate value and forget-gate value:

Ct ¼ it Cet þ ft Ct1

(3)

Here, the current cell's state candidate, Cet , can be deﬁned as (4):

Cet ¼ tanhðxt Wc þ ht1 Uc þ bc Þ;

(4)

where tanhð,Þ means the hyperbolic tangent function. The process of calculating the value of cell state is depicted in Fig. 4-(d). Lastly, the output value, ht , at time t is calculated as:

ot ¼ sðxt Wo þ ht1 Uo þ bo Þ

(5)

ht ¼ ot tanhðCt Þ

Fig. 3. Flow of speech or nonspeech decision. (a) Damaged WAV ﬁle (header is corrupted and playback is not possible). (b) Generating random headers to the damaged ﬁle (grid search over possible parameters). (c) Decoding the signal based on the generated header information. (d) Speech or nonspeech decision.

As shown in equation (5), in the process of calculating the output value of each cell, each gate has values ranging from zero to one. It decides whether to “remember,” “forget,” “get,” or “not get” the outputs. The process of calculating the values of output gates and the output are depicted in Fig. 4-(e) and Fig. 4-(f), respectively. The weights are trained to learn the decision whether to forget or remember the previous state. Thus, one can model the structures in an audio ﬁle using LSTM that can learn both short-term dependencies and long-term dependencies. For example, we expected that the structural characteristics of each ﬁle, such as the average number of bits of 1, would be trained. The audio ﬁle modeling using LSTM proceeds without the decoding process of audio signals, unlike in the aforementioned speech and nonspeech decision process. The detailed processes in LSTM-based audio ﬁle identiﬁcation are the following. First, after reading a target ﬁle or a data block in a ﬁxed unit, the ﬁle is expressed in binary vectors.

120

H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126

Fig. 4. Simpliﬁed structure of the long short-term memory (LSTM) cell (a). Processes of calculating the values of input gate (b), forget gate (c), cell state (d), output gate (e), and output (f).

Herein, we assume that the target ﬁle or the data block includes the audio ﬁle with corrupted header. Speciﬁcally, when reading a ﬁle in 2-byte unit, vectors consisting of 16 binary values are generated. This process of generating the binary vectors from unknown ﬁle is depicted in Fig. 5. In addition, the binary vectors can be sequentially input into a network composed of multiple LSTM layers. Fig. 6

shows the process of inputting binary vectors sequentially and the process of deriving a prediction result from the input data by the LSTM layers. In this ﬁgure, each ht means one LSTM cell that deﬁned its operation from Equations (1)e(5). The prediction results indicate the probabilities that the input sequential data are extracted from each audio ﬁle format included in the training data.

Fig. 5. Process of generating binary vectors of length 16 from an unknown ﬁle.

H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126

121

LSTM-based method can infer the location of a damaged audio ﬁle in a data block in a manner similar to the existing ﬁle-carving method. Moreover, it can infer the information necessary for recovering or decoding the ﬁle. This has a cost advantage in the recovery process because all processes can be executed simultaneously. On the contrary, the conventional method must execute the process in two steps. However, it has a disadvantage because we have to have audio ﬁles of all kinds of known formats to train the LSTM model in advance. Therefore, once a model is trained and we have to restore audio ﬁles with a new format other than that used for training, we have to retrain the model with training data of the format. Meanwhile, in the case of the method based on speech and nonspeech decisions, it has a disadvantage because it is only operable when the ﬁles have already been processed with the ﬁlecarving method. Nonetheless, it is advantageous because a trained model can be reused without retraining. For example, even when an audio ﬁle with a different format must be restored, as the speech and nonspeech decision operates in the same way, there is no need to train or improve the existing model. Consequently, although the two suggested methods use deep neural networks, they both have beneﬁts and drawbacks. As the use cases differ, one can expect synergistic effects by using the two methods. Experiments Fig. 6. Process of inputting binary vectors sequentially and process of deriving a prediction result from input data by LSTM layers.

In particular, a model trained by using MP3 ﬁles, wave ﬁles, and non-audio ﬁles outputs the three probabilities that the input data have been extracted from each ﬁle. Therefore, the second proposed method has a limitation that it cannot identify audio formats that are not included in the training data. Since it is impossible to feed an indeﬁnite length of data into LSTM, we use a ﬁxed number of units in a sequence. The unit size and the sequence length are critical components to determine the LSTM's performance. In practice, if the size of the unit is determined as 1 byte, the input information in a unit will be considerably restricted because one audio ﬁle sample cannot be stored in one unit. Furthermore, if the unit size is too large, the sequence length will decrease, making it difﬁcult to model the time dependencies sufﬁciently. The binary vectors generated from each ﬁle can then be entered into the LSTM, and the LSTM is trained to identify the format of the audio ﬁles. By using the trained LSTM, it is possible to identify audio formats (WAV, MP3, etc) and encoding options (bit rate, compression ratio, etc) of each target ﬁle without any decoding process. The audio ﬁle identiﬁcation LSTM can be applied in recovering the audio ﬁles in the following way. The input is a data block generated as a result of the ﬁle-carving method, or extracted from audio recorder's storage media. Although the objective is to recover the audio ﬁle included in the data block, it will be difﬁcult to identify the location of the audio ﬁle because the header information is corrupt. However, by applying the audio ﬁle identiﬁcation LSTM, it is possible to determine the locations of the audio ﬁles and decoding methods. Speciﬁcally, this objective can be achieved after equally dividing the data block and identifying if the trained audio ﬁle's features are observed in each section. Comparison of the proposed two recovery methods We have proposed two DNN-based recovery methods, which operate in ways that differ from the conventional method. The

In this study, we designed and conducted experiments to verify the performance and application feasibility of the proposed methods using deep neural networks. We hypothesized the contexts wherein the existing ﬁle-carving method cannot restore the original waveform at all. Thus, we did not consider the existing restoration system in the experiments. Construction and training of deep neural networks were implemented in a Keras (Chollet and others, 2015) environment with Tensor-Flow (Martin Abadi et al., 2015) back-end. Experiment design of the speech and nonspeech decision The speech and nonspeech decision experiment was intended to test whether the model can identify improperly decoded nonspeech signals and properly decoded speech signals from given WAV ﬁles. The nonspeech signal mentioned herein means the case in Section 3.1, where incorrect header information is inserted into the WAV ﬁle in the process of inferring the header information. In the speech and nonspeech decision experiment, we used the utterances of adult speakers and child speakers for training the system and the performance evaluation respectively. In addition, hyperparameters were determined bases on the training set. For the adult speakers' data, we used the utterances from the speech data for speaker recognition collected and distributed by Electronics and Telecommunications Research Institute of Korea. For the children's data, we used children's utterances from ﬁve to six years of age. Both databases are composed of Korean sentences which are 2e3 s long. By employing the two distinct audio databases for training and test, we intended to make the results be independent to the training speaker's age, channel, and phoneme information. For the speech signal learning, we used normally decoded signals of one channel WAV format with 16 kHz, 16 bit sampling. For nonspeech signal learning, we used improperly decoded signals from the above WAV ﬁles with an 8 bit unsigned method. For the speech signal evaluation, we used the test utterances of the same format as the training set. For the nonspeech signal evaluation, we used improperly decoded signals from above WAV ﬁles with an 8 bit unsigned method, a 16 bit big endian method, an 8 bit m-raw

122

H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126

method, and an 8 bit A-raw method. To verify the generalization performance of the deep neural networks, which are trained using only the nonspeech signals that were improperly decoded with the 8-bit unsigned method, nonspeech signals were generated in various ways for the test. The normal speech signal can be identiﬁed only when it has the header of correct WAV format; therefore, we hypothesize only one case of the speech signal encoding method. When playing improperly decoded signals in the process of generating the data used in the experiment, noises similar to white noise were detected. Considering this characteristic, we conducted an additional experiment to identify speech and nonspeech signals with white noise added. For this experiment, we added different white noises of 10 dB and 0 dB of signal-to-noise ratios to the speech signals in the learning and test. After training the deep neural network decision system with speech and nonspeech signals, we fed 5-s signals into the system to identify whether the test signals were speech or nonspeech. A 25ms window, along with 10-ms shift was applied on the signals, and the 40-dimensionalmel-ﬁlter bank features were extracted. We generated the features to be fed to the deep neural network by concatenating the features extracted from 11 windows, comprising a 440-dimensional feature. The deep neural networks were organized with the feedforward network consisting of an input layer, ﬁve hidden layers, and an output layer. The ﬁve hidden layers each included 512 nodes, and each node was activated by the rectiﬁed linear unit (ReLU) function (Nair and Hinton, 2010). The batch normalization method (Ioffe and Szegedy, 2015) was applied. The two nodes in the output layer indicate the respective speech and nonspeech signals, and they are activated by the softmax function. We repeatedly trained the deep neural networks to minimize the negative log likelihood (NLL) to the learning data. The learning proceeded using the Adam algorithm (Kingma and Ba, 2015), and the learning rate was set at 0.01. The ﬂow of the speech and nonspeech decision is described in Fig. 7. Experiment design of the audio ﬁle identiﬁcation For the audio ﬁle identiﬁcation experiment, several ﬁles, including 8 bit WAV audio ﬁles, 16 bit WAV audio ﬁles, MP3 audio ﬁles, and non-audio ﬁles were used. The non-audio ﬁles were used to check whether it is possible to exclude ﬁles other than audio ﬁles in the identiﬁcation process. To generate audio data compatible with the objectives of the experiment, RSR 2015 (Larcher et al., 2012) was processed and used. Eight-bit WAV ﬁles and 16 bit WAV ﬁles were used to test the system's ability to identify the audio ﬁles that had the same format yet different encoding options. MP3 ﬁles were used to verify its ability to identify the compressed audio

ﬁles. The non-audio ﬁles included Microsoft Word ﬁles, PDF ﬁles, text ﬁles, and image ﬁles in JPEG format. Image ﬁles were derived from a part of ImageNet (Russakovsky et al., 2015) data, which are widely used in image recognition. To generate the Word ﬁles, PDF ﬁles, and text ﬁles, Wiki-Reading (Hewlett et al., 2016) data were partially used. For each article in Wiki-Reading, a Word ﬁle, a PDF ﬁle, and a text ﬁle were generated. Likewise, 240 MB of MP3 ﬁles, 8-bit WAV ﬁles, and 16-bit WAV ﬁles were each prepared, while 60 MB of Word ﬁles, PDF ﬁles, text ﬁles, and image ﬁles were each prepared. Therefore, for the three types of audio ﬁles and non-audio ﬁles, 240 MB of data were each used for learning. In the tests, for the four types of classes, 80 MB of data were each prepared. Table 1 shows the data organization. To identify the audio ﬁles, short-term dependencies and longterm dependencies found in the audio signals were modeled using the LSTMs. In this case, the data amount entered in the LSTMbased model at one time was ﬁxed to 1024 bytes, while the input sequence length and the unit size varied, thereby presenting different conditions for the experiments. For example, if the unit size is set to 32 (4 bytes, 32 bits), the length of the sequence would be 256 (1024/4 ¼ 256). Evaluation was performed by the segment unit of 1024 bytes long. In this case, since each segment is constructed by taking a part of one ﬁle, different kinds of ﬁles do not occur in one sequence. Accuracies of identiﬁers were calculated based on the identiﬁcation results for individual sequence. In addition, by altering the number of LSTM layers from one to ﬁve, and altering the number of cells comprising each layer, from 5 to 100, a suitable model for audio ﬁle encoding identiﬁcation was explored. Fig. 8 shows an example of the ﬂow of one of the experiments where there are two layers and 20 cells, and the unit size is 2 bytes. The LSTM model used the RMSprop (Hinton et al., 2012b) algorithm for learning with the learning rate ﬁxed at 0.001. Experimental results Table 2 shows the results from the speech/nonspeech decision experiment conducted in this study. We investigated the system in two cases: without white noise (W/O WN) and with white noise (W/WN) added to the training data. In the without-white-noise case, the decision accuracy dramatically decreases due to the cases confounding the improperly decoded signals with white noise. However, in the with-white-noise case where the system is trained with noisy data, it is apparent that the decisions are accurate, and the system distinguishes the improperly decoded signals from the white noise. Based on these ﬁndings, we identiﬁed that the speech/nonspeech decision could be successfully performed even when the waveform is encoded with an unknown decoding method, and that identiﬁcation of the nonspeech signals and white

Fig. 7. Flow of the speech or nonspeech decision process.

H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126

123

Table 1 Data organization of audio ﬁle identiﬁcation experiment. Category

Format

Source

Size

Audio Files

8 bit WAV 16 bit WAV MP3 JPEG PDF Word Text

RSR2015

Approximately Approximately Approximately Approximately Approximately Approximately Approximately

Non-Audio Files

ImageNet Wiki-Reading

Fig. 8. Example of the audio ﬁle identiﬁcation.

Table 2 Results of the speech or nonspeech decision. The columns labeled “W/O WH” and “W/WH” show the experimental results when the white noise insertion in the training data is not included or included, respectively. Each row shows the accuracies according to the condition of the training data. For example, a row labeled “þ 10 dB white noise” represents the result when white noises were inserted at 10 dB SNR in the test data. Data type

Matched

Un-matched

16 kHz, 16-bit, mono þ10 dB white noise þ0 dB white noise þ8-bit unsigned decoding þ16-bit big-endian decoding þ8-bit mlaw decoding þ8-bit A-law decoding

Accuracy (%) W/O WN

W/WN

100.0 88.3 41.7 100.0 100.0 100.0 100.0

100.0 100.0 96.2 100.0 100.0 100.0 99.6

noise was possible with a high accuracy even though they have similar characteristics. Fig. 9 presents the results of the audio ﬁle identiﬁcation experiment conducted with different values in the LSTM based models. On graph (a), different results appear as the number of layers is changed. In these experiments, the unit size was ﬁxed at 48, and the number of cells was 20. In the results shown, irrespective of the number of layers, high levels of accuracies are evident. When there are two layers, the most stable accuracy performance is observed. Graphs (b) and (d) show the results after altering the number of cells per layer. Graph (b) shows the accuracy per each epoch, and graph (d) shows the highest accuracy level out of all epochs. In these experiments, the unit size was ﬁxed at 48, and the number of layers was two. From the results, once a certain number of the LSTM layers was obtained, excellent performance

320 MB about 6 h audio 320 MB about 3 h audio 320 MB about 30 h audio 80 MB about 1000 images 80 MB about 24,000 ﬁles 80 MB about 2216 ﬁles 80 MB about 27,000 ﬁles

could be expected. When the number of cells increased, the identiﬁcation performance was considerably enhanced. Based on the previous two experiments, we veriﬁed that, even in the case of generating a large model with increased numbers of layers and cells, model overﬁtting to the training data did not occur beacuse a high level accuracy could be achieved over the unseen test data. Graphs (c) and (e) show the results based on the unit size in the model. In graph (c), the accuracy level per each epoch is given. In graph (e), the highest accuracy level among all epochs is presented. As the size of the data entered in these experiments is consistent, the sequence length becomes proportionately shorter when the unit size increases. Meanwhile, the number of layers and cells comprising LSTM were ﬁxed at 2 and 20 each. In the results, when the unit size entered in LSTM is too small, it becomes problematic, because the data amount included in one unit is too small and the sequence length increases, leading to a low recognition rate. We interpreted this result as being caused by the gradient vanishing problem. When the sequence length is too long, the gradient vanishing problem seems further aggravated in the process of training the recurrent layer with the backpropagation-through-time method, hence apparently failing in learning (Goodfellow et al., 2016). However, when the unit size is more than 24, the recognition rate is higher than 99%, proving its ability to sufﬁciently identify the encoding methods. To summarize the overall ﬁndings from the experiment, when the number of cells was more than 15 and the unit size was more than 24 in LSTM, a high rate of recognition was achieved, and overﬁtting to the learning data did not occur despite increasing the model size in the experiment. We can conclude from the results that we can apply the proposed method to fragmented ﬁles because we obtained the results from the ﬁles fragmented in 1-KB size, as shown in Fig. 10. This conﬁguration takes into account the damaged audio ﬁles that present discontinuously. Case studies This section introduces a case study to elucidate the audio ﬁle identiﬁcation method and its applications among the proposed ﬁle recovery methods. We hypothesized the case in which the actual audio-ﬁle recovery occurs using the identiﬁcation method. We therefore generated a ﬁle for restoration. First, a non-audio ﬁle with an adequate size was prepared. Next, a certain section of the ﬁle was deleted, and a segment of an audio ﬁle with corrupted header was inserted into the deleted section. In this case, a segment of 16 bit WAV format ﬁle and a segment of an MP3 format ﬁle were respectively added to different sections. Thus, an example ﬁle, including the damaged audio ﬁle, was created. The visualized structure of the ﬁle is given in Fig. 11 (“case ﬁle”). Although the audio segments are included in the example ﬁle, it is not possible to predict the location of the segment using only ﬁle carving methods. Even if the segment can be located, the audio ﬁle cannot be identiﬁed because the header information is damaged. After entering the example ﬁle into the trained LSTM-based model, the ﬁndings

124

H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126

Fig. 9. Accuracies of the audio ﬁle type identiﬁcation experiments using the LSTM. (a) Identiﬁcation accuracy based on the number of layers and learning epochs. (b) Identiﬁcation accuracy based on the number of nodes per layer and learning epochs. (c) Identiﬁcation accuracy based on the lengths of input units and learning epochs. (d) Maximum identiﬁcation accuracy based on the number of nodes per layer. (e) Maximum identiﬁcation accuracy based on the size of the input unit.

Fig. 10. Flow of the audio ﬁle encoding type identiﬁcation in the experiments.

were produced, as shown in the ﬁgure (“predicted label”). The identiﬁcation results can be obtained at every 1-KB recognition unit of the trained LSTM. When comparing the “true label” with the “predicted label,” proper identiﬁcation was achieved in most sections, even though the identiﬁcation was processed in a very small unit. Some sections were misclassiﬁed; however, the continuous misclassiﬁed sections did not exceed 1-KB. Considering the average length of typical media ﬁles, and by obtaining the contiguous sections with the same classiﬁcation results with more

than 5 KB, all sections were expected to be accurately identiﬁed. Therefore, the speech segments could be restored by aggregating the 16 bit WAV sections, generating and adding a header ﬁle, and decoding the MP3 section based on the identiﬁcation results. Thus, in terms of the application of the proposed approach, when a huge amount of data blocks is provided as legal evidence, partial automation of the recovery process based on the LSTM model will be possible without a manual investigation of all the data by an expert. We additionally conducted case studies to address the generalization performance on out-of-domain data of the proposed method. To archive this goal, we additionally designed two cases. The ﬁrst additional case was designed to conﬁrm the generalization performance on changes of the internal content of the audio ﬁles. The case ﬁle includes the audio ﬁles of noisy speech, speech in languages different to those in the training data (training: English, case: Korean), and non-speech. The noisy speech was generated by adding white noise to the clean speech in the previous case. We took a part of classical music, “Symphony No. 5, Ludwig van Beethoven, ﬁrst movement”, for the audio ﬁle without speech. Fig. 12 shows the case ﬁle, the true labels, and the predicted labels. When comparing the “true label” with the “predicted label”, all audio ﬁles were detected, even though the internal contents of audio ﬁles have been changed. Based on these results, we expected that the proposed method would be robust to changes in internal

H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126

125

Fig. 11. Example ﬁle generation process and organization for the case study, and the result of the identiﬁcation method.

Fig. 12. Example ﬁle generation process and organization for the case study with noisy speech, speech in languages different to those in the training data (training: English, case: Korean) and non-speech (classical music without voice, “Symphony No. 5, Ludwig van Beethoven, ﬁrst movement”), and the result of the identiﬁcation method.

contents of audio ﬁles. The second additional case was designed to conﬁrm the work of the proposed method on audio ﬁles of unseen formats. In particular, we conﬁrmed the operation of the model trained with 16-bit 16 kHz wave ﬁles on the 16-bit 8 kHz wave ﬁle and the 32-bit 16 kHz wave ﬁle. In addition, we conﬁrmed the operation of the model trained with 11 kbps mp3 ﬁles on the 20 kbps mp3 ﬁle and the 32 kbps mp3 ﬁle. Fig. 13 shows the case ﬁle constructed using audio ﬁles of various formats, the true labels, and the predicted labels. When comparing the “true label” with the “predicted label,” proper identiﬁcation was only achieved on the 16-bit 8 kHz wave ﬁle. These results show that it is difﬁcult to identify the format of the audio ﬁle by the proposed method if the

ﬁle structure is changed by the encoding option, even if the audio ﬁles are encoded using the same mp3 method. However, it will be possible to overcome the limitation through the process of generating and training the additional audio ﬁles of unseen structures. Conclusion It is difﬁcult to restore damaged audio ﬁles using the conventional ﬁle-carving method. In this paper, we proposed a recovery method that can infer the damaged information from the ﬁles. Herein, we propose the application of deep neural networks to develop ﬁle-recovery methods. Experiments were conducted to

Fig. 13. Example ﬁle generation process and organization for the case study with unseen audio formats, and the result of the identiﬁcation method.

126

H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126

identify whether the deep neural networks could perform the given tasks, speciﬁcally the tasks that were essential for inferring the lost data, which are too difﬁcult and time-intensive to process manually. We proposed two methods of applying deep neural networks to recover speech audio ﬁles. The ﬁrst method is to recover the damaged speech ﬁle by inferring the header information, using the deep neural networks to decide whether the decoded signals are speech or nonspeech. The second is to identify the audio formats and decoding types of the blocks from the damaged ﬁles by the deep neural networks without decoding the audio ﬁles. In addition, we provide the codes for reproducibility of the proposed method.1 Since the two suggested methods have different application cases, and both have advantages and disadvantages, the effects of using the two methods can be synergistic. Experiments were designed and conducted to verify the application feasibility and performance of the proposed approach. Moreover, we conﬁrmed that the proposed method can be applied to fragmented ﬁles through a casestudy, which applies the proposed method to data blocks of 1-KB units. The ﬁndings validated the effectiveness of the deep neural network approach, suggesting its application potential for developing more advanced audio-ﬁle recovery methods. Acknowledgement This work was supported by the research service of the Supreme Prosecutors' Ofﬁce (research title: study on recovery methods for damaged audio ﬁles). References ~ Abadi, Martin Ashish~Agarwal, Paul~Barham, Eugene~Brevdo, Zhifeng~Chen, Craig~Citro, ~Corrado, Greg~S., Andy~Davis, Jeffrey~Dean,

1 https://github.com/hsss/Automated-Recovery-of-Damaged-Audio-Files-UsingDeep-Neural-Networks.

~ Goodfellow, Matthieu~Devin, Sanjay~Ghemawat, Ian Andrew~Harp, Geoffrey~Irving, Michael~Isard, Jia, Y., Rafal~Jozefowicz, Lukasz~Kaiser, ~ Mane , Manjunath~Kudlur, Josh~Levenberg, Dan Rajat~Monga, Sherry~Moore, Derek~Murray, Chris~Olah, Mike~Schuster, ~ Shlens, ~Sutskever, Jonathon Benoit~Steiner, Ilya Kunal~Talwar, gas, Paul~Tucker, Vincent~Vanhoucke, Vijay~Vasudevan, Fernand~ aVie ~ Wattenberg, ~ Wicke, Oriol~Vinyals, Pete~Warden, Martin Martin ~ Yu, Xiaoqiang~Zheng, 2015. {TensorFlow}: Large-Scale Machine Learning Yuan on Heterogeneous Systems. Chollet, F., 2015. Keras others. Fant, G., 1971. Acoustic Theory of Speech Production: with Calculations Based on XRay Studies of Russian Articulations. Walter de Gruyter. Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT Press. Hewlett, D., Lacoste, A., Jones, L., Polosukhin, I., Fandrianto, A., Han, J., Kelcey, M., Berthelot, D., 2016. WikiReading: a novel large-scale language understanding task over wikipedia. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1535e1545. Long Papers. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., 2012a. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82e97 others. Hinton, G., Srivastava, N., Swersky, K., 2012b. Neural Networks for Machine Learning-Lecture 6a-Overview of Mini-Batch Gradient Descent. Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9, 1735e1780. Ioffe, S., Szegedy, C., 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448e456. Kingma, D.P., Ba, J.L., 2015. Adam: method for stochastic optimization. In: International Conference on Learning Representation. Larcher, A., Lee, K.A., Ma, B., Li, H., 2012. Rsr2015: database for text-dependent speaker veriﬁcation using multiple pass-phrases. In: Thirteenth Annual Conference of the International Speech Communication Association. Nair, V., Hinton, G.E., 2010. Rectiﬁed linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807e814. Poisel, R., Tjoa, S., Tavolato, P., 2011. Advanced ﬁle carving approaches for multimedia ﬁles. JoWUA 2, 42e58. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., others, 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211e252.

Automated recovery of damaged audio files using deep neural networks

Automated recovery of damaged audio files using deep neural networks

Recommend Documents