Digital Investigation 30 (2019) 117e126
Contents lists available at ScienceDirect
Digital Investigation journal homepage: www.elsevier.com/locate/diin
Automated recovery of damaged audio files using deep neural networks Hee-Soo Heo a, Byung-Min So b, IL-Ho Yang a, Sung-Hyun Yoon a, Ha-Jin Yu a, * a b
School of Computer Science College of Engineering, University of Seoul, 163 Siripdae-ro, Dongdaemun-gu, Seoul, 02504, Republic of Korea Supreme Prosecutors' Office 157 Banpo-daero, Seocho-gu, Seoul, 06590, Republic of Korea
a r t i c l e i n f o
a b s t r a c t
Article history: Received 25 April 2018 Received in revised form 16 July 2019 Accepted 31 July 2019 Available online 1 August 2019
In this paper, we propose two methods to recover damaged audio files using deep neural networks. The presented audio file recovery methods differ from the conventional file carving-based recovery method because the former restore lost data, which are difficult to recover with the latter method. This research suggests that recovery tasks, which are essential yet very difficult or very time consuming, can be automated with the proposed recovery methods using deep neural networks. We apply feed-forward and Long Short Term Memory neural networks for the tasks. The experimental results show that deep neural networks can distinguish speech signals from non-speech signals, and can also identify the encoding methods of the audio files at the level of bits. This leads to successful recovery of the damaged audio files, which are otherwise difficult to recover using the conventional file-carving-based methods. © 2019 Elsevier Ltd. All rights reserved.
Keywords: Audio files Automated recovery Deep neural networks File carving Long short-term memory
Introduction The past century has shown a wide adoption of audio recording devices, including smartphones. Thus, the providing of audio files as evidence in court settings has become more common. Audio files that are claimed to be legal evidence usually proceed through the conventional validation process, whereby investigators listen to and identify the contents and examine counterfeit audio files to establish a legal case. However, audio files collected through digital devices, such as smartphones, can be deleted owing to malicious purposes or lack of storage in the devices. For the deleted files to qualify as legal evidence, the process of restoring the audio files from storage, where the deletion occurred, and validating the data are required. In a typical file recovery environment, file carvingda method to restore deleted files in the file systemdhas been widely adopted and applied (Poisel et al., 2011). However, the file-carving method often results in incomplete recovery of audio files, which are thus unable to be heard. For instance, after an audio file is deleted from a file system and overwriting takes place, the data in the region
might not be restored, thus preventing the complete recovery of the file. Moreover, if the damaged block is an essential part in playing the audio file (i.e., headers of audio files), one would be fully unable to play the file on account of the partial yet critical damage. Therefore, to restore the damaged audio files, we should devise a new approach to recovery. Such a recovery method would involve inferring the lost data based on the data that remain in the file. When a file proceeds through a complete recovery process, the process should successfully recover the lost data that the conventional methods cannot restore. The focus of the present research is the application of deep neural networks for automation of the tasks that are vital yet unsuitable to process manually owing to the difficulty level and time required. The remainder of this paper is organized as follows. Section 2 explains the conventional file carving method. Section 3 outlines the application of deep neural networks to the present objective. Sections 4 and 5 present the experiments and results verifying the accuracy of the proposed deep neural network method, and we conclude the paper in Section 6. File-carving methods
* Corresponding author. E-mail addresses:
[email protected] (H.-S. Heo),
[email protected] (B.-M. So),
[email protected] (I.-H. Yang),
[email protected] (S.-H. Yoon),
[email protected] (H.-J. Yu). https://doi.org/10.1016/j.diin.2019.07.007 1742-2876/© 2019 Elsevier Ltd. All rights reserved.
Most existing file-recovery methods are file-carving-based methods based on the structures and contents of the files deleted from the file system. Fig. 1 illustrates the full recovery process of a
118
H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126
Fig. 1. Flow of file recovery based on file carving. (a) Physical storage and metadata are generated in the file system (“Saving ‘Audio.mp3’”). (b) Only the record of the file system is removed; the actual data remain (“Removing ‘Audio.mp3’”). (c) Investigating the storage device for file carving. (d) Metadata are re-generated based on the file-carving result.
file-carving method. The file system indicates the complete system that manages all the files employed by users. When a user saves a specific file, the file system generates metadata with information, including the physical location of the saved file and the generated time. It prevents other files from overwriting in the same location, as shown in Fig. 1a. When a user deletes a file, the file system deletes the previously generated metadata and changes the settings to enable other files to overwrite the location (Fig. 1b). In this case, file carving is the process of reading the deleted files by investigating the saved locations, even when the metadata are not available owing to the deletion of the file by the user (Fig. 1c). Thus, file carving restores access to the deleted files from the file system; it is not an actual method for restoring lost information. However, a recovery method as such can completely recover files only when the files were not overwritten. For instance, if a WAV type audio file was deleted and WAV header, necessary for playing, was overwritten, it would be difficult to play it back, even when the file was recovered with file carving. Fig. 2 illustrates an example of damaged files in the file-carving process. When a file is damaged or overwritten, it is necessary to use a method that can infer and recover damaged data and thus differs from the conventional method. Proposed recovery method based on deep neural networks A WAV file with a corrupted header, which prevents the file from playing, can be fixed with the proposed recovery method to infer damaged information based on data other than the header (i.e., encoded signals). This recovery process is different from that of the existing file-carving method and addresses problems that the existing method cannot solve. Nevertheless, the proposed process
requires tasks that are not suitable to perform manually because the tasks are challenging and time-intensive. We therefore automate the tasks based on deep neural networks and accordingly propose two ways to develop new recovery methods. Inference on header information based on speech or non-speech decision The following process can recover an audio file with a corrupted header. First, the header information that is expected to exist in the damaged audio file is generated. For example, when bit-rate information must exist in the header of the audio file, the header information for all possible bit rates, which can be used in the audio file, should be generated to be tested. Based on each generated header information, the audio file is decoded and it is determined whether the decoded signals are speech or nonspeech. Herein, speech signals refer to the signals that can be identified to be normal sounds when the header information is properly generated just the same as that before the damage. Nonspeech signals refer to improperly decoded signals whose generated header information is different from the original header information. Proper decoding of a speech signal is not possible unless the header information matches the one in the encoding process. Thus, we have to generate header information arbitrarily until the original speech signals are decoded. When one can identify normal audio signals in this process, lost header information is assumed to be properly inferred. However, if a person performs the tasks, he or she should repeatedly generate different headers, decode, and listen to the decoded audio for all possible permutations of header parameters. As possible cases would be too numerous for this endeavor, considering the actual encoding methods of audio files, it is not possible to recover all the
Fig. 2. Example of cases in which it is difficult to recover the file completely by using the conventional file-carving method. (a) Unallocated file before file carving (Fig. 1b). (b) Another file that overrode some part of unallocated file. (c) Damaged and unallocated file because of another file writing. (d)Damaged file after file carving.
H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126
files in a feasible amount of time. In the proposed approach, we apply deep neural networks to this process to distinguish speech and nonspeech signals. In the case of speech signals, including actual human voices, features of formants are expected to be found in the frequency band. Formants are the spectral peaks of the sound spectrum that distinguish specific vowels (Fant, 1971). Whereas in the nonspeech signals, the features of white noise are predicted to appear. Using the features, these speech and nonspeech signals can be predicted and classified by the deep neural networks. The deep neural networks are organized simply as feedforward neural networks. The detailed tasks in this undertaking are outlined as follows. First, decoded signals are divided into small frame units, and major features within the frequency band for each frame are extracted. To feed the extracted features into the deep neural networks, features from multiple frames are concatenated. The deep neural networks are trained to classify whether a signal is close to speech signals or nonspeech signals by performing binary classification on the input features. Finally, after aggregating the frame-unit-based results to utterance units, the decision whether a decoded signal is speech or nonspeech is made. The overall process is illustrated in Fig. 3. A speech recognition system using deep neural networks is referenced to design this structure (Hinton et al., 2012a). In particular, a signal is divided into small frames, and frame-level features such as filter bank outputs are extracted. Multiple frame-level features are concatenated, and inputted to DNN to determine whether the signal is a human voice or not.
Identification of audio files based on long short-term memory Audio signals are data that store sound waves sampled and quantized in every fixed interval. Therefore, each sample in an audio signal does not exist independently; the value is dependent on the amplitudes and frequencies of the sound waves. Moreover, dependencies emerge between each sample. Two examples of such dependencies are as follows. Considering the samples in the short term, the dynamic ranges between the samples are limited to some extent. Predicting a specific sample is feasible by examining the samples around the specific target; hence, short-term
119
dependencies do exist. Additionally, since the sample values do not increase nor decrease continually for a long term, but rather increase and decrease cyclically, long-term dependencies also exist. When one encodes and saves audio signals with such features, the encoded data may also show dependencies between data according to the audio signals. Specifically, a WAV file with the audio signal saved in a 16 bit format will have the aforementioned dependencies in 2 byte units, whereas a WAV file with the audio signal stored in an 8-bit format will have dependencies in 1 byte unit. Such dependencies between time intervals can be modeled by Long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997). It defines a concept of a kind of cell, which consists of an input gate, forget gate, and output gate. This cell has an internal state value and calculates the output values by passing the input values and internal state through multiple gates. Its basic structure is depicted in Fig. 4-(a). The output value of each cell is calculated as follows. First, the input gate value, it , at time t, is calculated as:
it ¼ sðxt Wi þ ht1 Ui þ bi Þ;
(1)
where Wi and Ui refer to the weight matrix of the input gate, xt is the input data at time t, ht1 is the output of the cell at time t-1, bi indicates bias vectors, and sð,Þ denotes the non-linear activation function. We used sigmoid function as the activation function. The process of calculating the value of input gate is depicted in Fig. 4(b). Likewise, the value of the “forget gate,“ ft is calculated as:
ft ¼ s xt Wf þ ht1 Uf þ bf
(2)
where Wf and Uf refer to the weight matrix of the forget gate. The process of calculating the value of forget gate is depicted in Fig. 4(c). Next, the current state value Ct is calculated using the inputgate value and forget-gate value:
Ct ¼ it Cet þ ft Ct1
(3)
Here, the current cell's state candidate, Cet , can be defined as (4):
Cet ¼ tanhðxt Wc þ ht1 Uc þ bc Þ;
(4)
where tanhð,Þ means the hyperbolic tangent function. The process of calculating the value of cell state is depicted in Fig. 4-(d). Lastly, the output value, ht , at time t is calculated as:
ot ¼ sðxt Wo þ ht1 Uo þ bo Þ
(5)
ht ¼ ot tanhðCt Þ
Fig. 3. Flow of speech or nonspeech decision. (a) Damaged WAV file (header is corrupted and playback is not possible). (b) Generating random headers to the damaged file (grid search over possible parameters). (c) Decoding the signal based on the generated header information. (d) Speech or nonspeech decision.
As shown in equation (5), in the process of calculating the output value of each cell, each gate has values ranging from zero to one. It decides whether to “remember,” “forget,” “get,” or “not get” the outputs. The process of calculating the values of output gates and the output are depicted in Fig. 4-(e) and Fig. 4-(f), respectively. The weights are trained to learn the decision whether to forget or remember the previous state. Thus, one can model the structures in an audio file using LSTM that can learn both short-term dependencies and long-term dependencies. For example, we expected that the structural characteristics of each file, such as the average number of bits of 1, would be trained. The audio file modeling using LSTM proceeds without the decoding process of audio signals, unlike in the aforementioned speech and nonspeech decision process. The detailed processes in LSTM-based audio file identification are the following. First, after reading a target file or a data block in a fixed unit, the file is expressed in binary vectors.
120
H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126
Fig. 4. Simplified structure of the long short-term memory (LSTM) cell (a). Processes of calculating the values of input gate (b), forget gate (c), cell state (d), output gate (e), and output (f).
Herein, we assume that the target file or the data block includes the audio file with corrupted header. Specifically, when reading a file in 2-byte unit, vectors consisting of 16 binary values are generated. This process of generating the binary vectors from unknown file is depicted in Fig. 5. In addition, the binary vectors can be sequentially input into a network composed of multiple LSTM layers. Fig. 6
shows the process of inputting binary vectors sequentially and the process of deriving a prediction result from the input data by the LSTM layers. In this figure, each ht means one LSTM cell that defined its operation from Equations (1)e(5). The prediction results indicate the probabilities that the input sequential data are extracted from each audio file format included in the training data.
Fig. 5. Process of generating binary vectors of length 16 from an unknown file.
H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126
121
LSTM-based method can infer the location of a damaged audio file in a data block in a manner similar to the existing file-carving method. Moreover, it can infer the information necessary for recovering or decoding the file. This has a cost advantage in the recovery process because all processes can be executed simultaneously. On the contrary, the conventional method must execute the process in two steps. However, it has a disadvantage because we have to have audio files of all kinds of known formats to train the LSTM model in advance. Therefore, once a model is trained and we have to restore audio files with a new format other than that used for training, we have to retrain the model with training data of the format. Meanwhile, in the case of the method based on speech and nonspeech decisions, it has a disadvantage because it is only operable when the files have already been processed with the filecarving method. Nonetheless, it is advantageous because a trained model can be reused without retraining. For example, even when an audio file with a different format must be restored, as the speech and nonspeech decision operates in the same way, there is no need to train or improve the existing model. Consequently, although the two suggested methods use deep neural networks, they both have benefits and drawbacks. As the use cases differ, one can expect synergistic effects by using the two methods. Experiments Fig. 6. Process of inputting binary vectors sequentially and process of deriving a prediction result from input data by LSTM layers.
In particular, a model trained by using MP3 files, wave files, and non-audio files outputs the three probabilities that the input data have been extracted from each file. Therefore, the second proposed method has a limitation that it cannot identify audio formats that are not included in the training data. Since it is impossible to feed an indefinite length of data into LSTM, we use a fixed number of units in a sequence. The unit size and the sequence length are critical components to determine the LSTM's performance. In practice, if the size of the unit is determined as 1 byte, the input information in a unit will be considerably restricted because one audio file sample cannot be stored in one unit. Furthermore, if the unit size is too large, the sequence length will decrease, making it difficult to model the time dependencies sufficiently. The binary vectors generated from each file can then be entered into the LSTM, and the LSTM is trained to identify the format of the audio files. By using the trained LSTM, it is possible to identify audio formats (WAV, MP3, etc) and encoding options (bit rate, compression ratio, etc) of each target file without any decoding process. The audio file identification LSTM can be applied in recovering the audio files in the following way. The input is a data block generated as a result of the file-carving method, or extracted from audio recorder's storage media. Although the objective is to recover the audio file included in the data block, it will be difficult to identify the location of the audio file because the header information is corrupt. However, by applying the audio file identification LSTM, it is possible to determine the locations of the audio files and decoding methods. Specifically, this objective can be achieved after equally dividing the data block and identifying if the trained audio file's features are observed in each section. Comparison of the proposed two recovery methods We have proposed two DNN-based recovery methods, which operate in ways that differ from the conventional method. The
In this study, we designed and conducted experiments to verify the performance and application feasibility of the proposed methods using deep neural networks. We hypothesized the contexts wherein the existing file-carving method cannot restore the original waveform at all. Thus, we did not consider the existing restoration system in the experiments. Construction and training of deep neural networks were implemented in a Keras (Chollet and others, 2015) environment with Tensor-Flow (Martin Abadi et al., 2015) back-end. Experiment design of the speech and nonspeech decision The speech and nonspeech decision experiment was intended to test whether the model can identify improperly decoded nonspeech signals and properly decoded speech signals from given WAV files. The nonspeech signal mentioned herein means the case in Section 3.1, where incorrect header information is inserted into the WAV file in the process of inferring the header information. In the speech and nonspeech decision experiment, we used the utterances of adult speakers and child speakers for training the system and the performance evaluation respectively. In addition, hyperparameters were determined bases on the training set. For the adult speakers' data, we used the utterances from the speech data for speaker recognition collected and distributed by Electronics and Telecommunications Research Institute of Korea. For the children's data, we used children's utterances from five to six years of age. Both databases are composed of Korean sentences which are 2e3 s long. By employing the two distinct audio databases for training and test, we intended to make the results be independent to the training speaker's age, channel, and phoneme information. For the speech signal learning, we used normally decoded signals of one channel WAV format with 16 kHz, 16 bit sampling. For nonspeech signal learning, we used improperly decoded signals from the above WAV files with an 8 bit unsigned method. For the speech signal evaluation, we used the test utterances of the same format as the training set. For the nonspeech signal evaluation, we used improperly decoded signals from above WAV files with an 8 bit unsigned method, a 16 bit big endian method, an 8 bit m-raw
122
H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126
method, and an 8 bit A-raw method. To verify the generalization performance of the deep neural networks, which are trained using only the nonspeech signals that were improperly decoded with the 8-bit unsigned method, nonspeech signals were generated in various ways for the test. The normal speech signal can be identified only when it has the header of correct WAV format; therefore, we hypothesize only one case of the speech signal encoding method. When playing improperly decoded signals in the process of generating the data used in the experiment, noises similar to white noise were detected. Considering this characteristic, we conducted an additional experiment to identify speech and nonspeech signals with white noise added. For this experiment, we added different white noises of 10 dB and 0 dB of signal-to-noise ratios to the speech signals in the learning and test. After training the deep neural network decision system with speech and nonspeech signals, we fed 5-s signals into the system to identify whether the test signals were speech or nonspeech. A 25ms window, along with 10-ms shift was applied on the signals, and the 40-dimensionalmel-filter bank features were extracted. We generated the features to be fed to the deep neural network by concatenating the features extracted from 11 windows, comprising a 440-dimensional feature. The deep neural networks were organized with the feedforward network consisting of an input layer, five hidden layers, and an output layer. The five hidden layers each included 512 nodes, and each node was activated by the rectified linear unit (ReLU) function (Nair and Hinton, 2010). The batch normalization method (Ioffe and Szegedy, 2015) was applied. The two nodes in the output layer indicate the respective speech and nonspeech signals, and they are activated by the softmax function. We repeatedly trained the deep neural networks to minimize the negative log likelihood (NLL) to the learning data. The learning proceeded using the Adam algorithm (Kingma and Ba, 2015), and the learning rate was set at 0.01. The flow of the speech and nonspeech decision is described in Fig. 7. Experiment design of the audio file identification For the audio file identification experiment, several files, including 8 bit WAV audio files, 16 bit WAV audio files, MP3 audio files, and non-audio files were used. The non-audio files were used to check whether it is possible to exclude files other than audio files in the identification process. To generate audio data compatible with the objectives of the experiment, RSR 2015 (Larcher et al., 2012) was processed and used. Eight-bit WAV files and 16 bit WAV files were used to test the system's ability to identify the audio files that had the same format yet different encoding options. MP3 files were used to verify its ability to identify the compressed audio
files. The non-audio files included Microsoft Word files, PDF files, text files, and image files in JPEG format. Image files were derived from a part of ImageNet (Russakovsky et al., 2015) data, which are widely used in image recognition. To generate the Word files, PDF files, and text files, Wiki-Reading (Hewlett et al., 2016) data were partially used. For each article in Wiki-Reading, a Word file, a PDF file, and a text file were generated. Likewise, 240 MB of MP3 files, 8-bit WAV files, and 16-bit WAV files were each prepared, while 60 MB of Word files, PDF files, text files, and image files were each prepared. Therefore, for the three types of audio files and non-audio files, 240 MB of data were each used for learning. In the tests, for the four types of classes, 80 MB of data were each prepared. Table 1 shows the data organization. To identify the audio files, short-term dependencies and longterm dependencies found in the audio signals were modeled using the LSTMs. In this case, the data amount entered in the LSTMbased model at one time was fixed to 1024 bytes, while the input sequence length and the unit size varied, thereby presenting different conditions for the experiments. For example, if the unit size is set to 32 (4 bytes, 32 bits), the length of the sequence would be 256 (1024/4 ¼ 256). Evaluation was performed by the segment unit of 1024 bytes long. In this case, since each segment is constructed by taking a part of one file, different kinds of files do not occur in one sequence. Accuracies of identifiers were calculated based on the identification results for individual sequence. In addition, by altering the number of LSTM layers from one to five, and altering the number of cells comprising each layer, from 5 to 100, a suitable model for audio file encoding identification was explored. Fig. 8 shows an example of the flow of one of the experiments where there are two layers and 20 cells, and the unit size is 2 bytes. The LSTM model used the RMSprop (Hinton et al., 2012b) algorithm for learning with the learning rate fixed at 0.001. Experimental results Table 2 shows the results from the speech/nonspeech decision experiment conducted in this study. We investigated the system in two cases: without white noise (W/O WN) and with white noise (W/WN) added to the training data. In the without-white-noise case, the decision accuracy dramatically decreases due to the cases confounding the improperly decoded signals with white noise. However, in the with-white-noise case where the system is trained with noisy data, it is apparent that the decisions are accurate, and the system distinguishes the improperly decoded signals from the white noise. Based on these findings, we identified that the speech/nonspeech decision could be successfully performed even when the waveform is encoded with an unknown decoding method, and that identification of the nonspeech signals and white
Fig. 7. Flow of the speech or nonspeech decision process.
H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126
123
Table 1 Data organization of audio file identification experiment. Category
Format
Source
Size
Audio Files
8 bit WAV 16 bit WAV MP3 JPEG PDF Word Text
RSR2015
Approximately Approximately Approximately Approximately Approximately Approximately Approximately
Non-Audio Files
ImageNet Wiki-Reading
Fig. 8. Example of the audio file identification.
Table 2 Results of the speech or nonspeech decision. The columns labeled “W/O WH” and “W/WH” show the experimental results when the white noise insertion in the training data is not included or included, respectively. Each row shows the accuracies according to the condition of the training data. For example, a row labeled “þ 10 dB white noise” represents the result when white noises were inserted at 10 dB SNR in the test data. Data type
Matched
Un-matched
16 kHz, 16-bit, mono þ10 dB white noise þ0 dB white noise þ8-bit unsigned decoding þ16-bit big-endian decoding þ8-bit mlaw decoding þ8-bit A-law decoding
Accuracy (%) W/O WN
W/WN
100.0 88.3 41.7 100.0 100.0 100.0 100.0
100.0 100.0 96.2 100.0 100.0 100.0 99.6
noise was possible with a high accuracy even though they have similar characteristics. Fig. 9 presents the results of the audio file identification experiment conducted with different values in the LSTM based models. On graph (a), different results appear as the number of layers is changed. In these experiments, the unit size was fixed at 48, and the number of cells was 20. In the results shown, irrespective of the number of layers, high levels of accuracies are evident. When there are two layers, the most stable accuracy performance is observed. Graphs (b) and (d) show the results after altering the number of cells per layer. Graph (b) shows the accuracy per each epoch, and graph (d) shows the highest accuracy level out of all epochs. In these experiments, the unit size was fixed at 48, and the number of layers was two. From the results, once a certain number of the LSTM layers was obtained, excellent performance
320 MB about 6 h audio 320 MB about 3 h audio 320 MB about 30 h audio 80 MB about 1000 images 80 MB about 24,000 files 80 MB about 2216 files 80 MB about 27,000 files
could be expected. When the number of cells increased, the identification performance was considerably enhanced. Based on the previous two experiments, we verified that, even in the case of generating a large model with increased numbers of layers and cells, model overfitting to the training data did not occur beacuse a high level accuracy could be achieved over the unseen test data. Graphs (c) and (e) show the results based on the unit size in the model. In graph (c), the accuracy level per each epoch is given. In graph (e), the highest accuracy level among all epochs is presented. As the size of the data entered in these experiments is consistent, the sequence length becomes proportionately shorter when the unit size increases. Meanwhile, the number of layers and cells comprising LSTM were fixed at 2 and 20 each. In the results, when the unit size entered in LSTM is too small, it becomes problematic, because the data amount included in one unit is too small and the sequence length increases, leading to a low recognition rate. We interpreted this result as being caused by the gradient vanishing problem. When the sequence length is too long, the gradient vanishing problem seems further aggravated in the process of training the recurrent layer with the backpropagation-through-time method, hence apparently failing in learning (Goodfellow et al., 2016). However, when the unit size is more than 24, the recognition rate is higher than 99%, proving its ability to sufficiently identify the encoding methods. To summarize the overall findings from the experiment, when the number of cells was more than 15 and the unit size was more than 24 in LSTM, a high rate of recognition was achieved, and overfitting to the learning data did not occur despite increasing the model size in the experiment. We can conclude from the results that we can apply the proposed method to fragmented files because we obtained the results from the files fragmented in 1-KB size, as shown in Fig. 10. This configuration takes into account the damaged audio files that present discontinuously. Case studies This section introduces a case study to elucidate the audio file identification method and its applications among the proposed file recovery methods. We hypothesized the case in which the actual audio-file recovery occurs using the identification method. We therefore generated a file for restoration. First, a non-audio file with an adequate size was prepared. Next, a certain section of the file was deleted, and a segment of an audio file with corrupted header was inserted into the deleted section. In this case, a segment of 16 bit WAV format file and a segment of an MP3 format file were respectively added to different sections. Thus, an example file, including the damaged audio file, was created. The visualized structure of the file is given in Fig. 11 (“case file”). Although the audio segments are included in the example file, it is not possible to predict the location of the segment using only file carving methods. Even if the segment can be located, the audio file cannot be identified because the header information is damaged. After entering the example file into the trained LSTM-based model, the findings
124
H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126
Fig. 9. Accuracies of the audio file type identification experiments using the LSTM. (a) Identification accuracy based on the number of layers and learning epochs. (b) Identification accuracy based on the number of nodes per layer and learning epochs. (c) Identification accuracy based on the lengths of input units and learning epochs. (d) Maximum identification accuracy based on the number of nodes per layer. (e) Maximum identification accuracy based on the size of the input unit.
Fig. 10. Flow of the audio file encoding type identification in the experiments.
were produced, as shown in the figure (“predicted label”). The identification results can be obtained at every 1-KB recognition unit of the trained LSTM. When comparing the “true label” with the “predicted label,” proper identification was achieved in most sections, even though the identification was processed in a very small unit. Some sections were misclassified; however, the continuous misclassified sections did not exceed 1-KB. Considering the average length of typical media files, and by obtaining the contiguous sections with the same classification results with more
than 5 KB, all sections were expected to be accurately identified. Therefore, the speech segments could be restored by aggregating the 16 bit WAV sections, generating and adding a header file, and decoding the MP3 section based on the identification results. Thus, in terms of the application of the proposed approach, when a huge amount of data blocks is provided as legal evidence, partial automation of the recovery process based on the LSTM model will be possible without a manual investigation of all the data by an expert. We additionally conducted case studies to address the generalization performance on out-of-domain data of the proposed method. To archive this goal, we additionally designed two cases. The first additional case was designed to confirm the generalization performance on changes of the internal content of the audio files. The case file includes the audio files of noisy speech, speech in languages different to those in the training data (training: English, case: Korean), and non-speech. The noisy speech was generated by adding white noise to the clean speech in the previous case. We took a part of classical music, “Symphony No. 5, Ludwig van Beethoven, first movement”, for the audio file without speech. Fig. 12 shows the case file, the true labels, and the predicted labels. When comparing the “true label” with the “predicted label”, all audio files were detected, even though the internal contents of audio files have been changed. Based on these results, we expected that the proposed method would be robust to changes in internal
H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126
125
Fig. 11. Example file generation process and organization for the case study, and the result of the identification method.
Fig. 12. Example file generation process and organization for the case study with noisy speech, speech in languages different to those in the training data (training: English, case: Korean) and non-speech (classical music without voice, “Symphony No. 5, Ludwig van Beethoven, first movement”), and the result of the identification method.
contents of audio files. The second additional case was designed to confirm the work of the proposed method on audio files of unseen formats. In particular, we confirmed the operation of the model trained with 16-bit 16 kHz wave files on the 16-bit 8 kHz wave file and the 32-bit 16 kHz wave file. In addition, we confirmed the operation of the model trained with 11 kbps mp3 files on the 20 kbps mp3 file and the 32 kbps mp3 file. Fig. 13 shows the case file constructed using audio files of various formats, the true labels, and the predicted labels. When comparing the “true label” with the “predicted label,” proper identification was only achieved on the 16-bit 8 kHz wave file. These results show that it is difficult to identify the format of the audio file by the proposed method if the
file structure is changed by the encoding option, even if the audio files are encoded using the same mp3 method. However, it will be possible to overcome the limitation through the process of generating and training the additional audio files of unseen structures. Conclusion It is difficult to restore damaged audio files using the conventional file-carving method. In this paper, we proposed a recovery method that can infer the damaged information from the files. Herein, we propose the application of deep neural networks to develop file-recovery methods. Experiments were conducted to
Fig. 13. Example file generation process and organization for the case study with unseen audio formats, and the result of the identification method.
126
H.-S. Heo et al. / Digital Investigation 30 (2019) 117e126
identify whether the deep neural networks could perform the given tasks, specifically the tasks that were essential for inferring the lost data, which are too difficult and time-intensive to process manually. We proposed two methods of applying deep neural networks to recover speech audio files. The first method is to recover the damaged speech file by inferring the header information, using the deep neural networks to decide whether the decoded signals are speech or nonspeech. The second is to identify the audio formats and decoding types of the blocks from the damaged files by the deep neural networks without decoding the audio files. In addition, we provide the codes for reproducibility of the proposed method.1 Since the two suggested methods have different application cases, and both have advantages and disadvantages, the effects of using the two methods can be synergistic. Experiments were designed and conducted to verify the application feasibility and performance of the proposed approach. Moreover, we confirmed that the proposed method can be applied to fragmented files through a casestudy, which applies the proposed method to data blocks of 1-KB units. The findings validated the effectiveness of the deep neural network approach, suggesting its application potential for developing more advanced audio-file recovery methods. Acknowledgement This work was supported by the research service of the Supreme Prosecutors' Office (research title: study on recovery methods for damaged audio files). References ~ Abadi, Martin Ashish~Agarwal, Paul~Barham, Eugene~Brevdo, Zhifeng~Chen, Craig~Citro, ~Corrado, Greg~S., Andy~Davis, Jeffrey~Dean,
1 https://github.com/hsss/Automated-Recovery-of-Damaged-Audio-Files-UsingDeep-Neural-Networks.
~ Goodfellow, Matthieu~Devin, Sanjay~Ghemawat, Ian Andrew~Harp, Geoffrey~Irving, Michael~Isard, Jia, Y., Rafal~Jozefowicz, Lukasz~Kaiser, ~ Mane , Manjunath~Kudlur, Josh~Levenberg, Dan Rajat~Monga, Sherry~Moore, Derek~Murray, Chris~Olah, Mike~Schuster, ~ Shlens, ~Sutskever, Jonathon Benoit~Steiner, Ilya Kunal~Talwar, gas, Paul~Tucker, Vincent~Vanhoucke, Vijay~Vasudevan, Fernand~ aVie ~ Wattenberg, ~ Wicke, Oriol~Vinyals, Pete~Warden, Martin Martin ~ Yu, Xiaoqiang~Zheng, 2015. {TensorFlow}: Large-Scale Machine Learning Yuan on Heterogeneous Systems. Chollet, F., 2015. Keras others. Fant, G., 1971. Acoustic Theory of Speech Production: with Calculations Based on XRay Studies of Russian Articulations. Walter de Gruyter. Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT Press. Hewlett, D., Lacoste, A., Jones, L., Polosukhin, I., Fandrianto, A., Han, J., Kelcey, M., Berthelot, D., 2016. WikiReading: a novel large-scale language understanding task over wikipedia. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, vol. 1, pp. 1535e1545. Long Papers. Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., 2012a. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29, 82e97 others. Hinton, G., Srivastava, N., Swersky, K., 2012b. Neural Networks for Machine Learning-Lecture 6a-Overview of Mini-Batch Gradient Descent. Hochreiter, S., Schmidhuber, J., 1997. Long short-term memory. Neural Comput. 9, 1735e1780. Ioffe, S., Szegedy, C., 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448e456. Kingma, D.P., Ba, J.L., 2015. Adam: method for stochastic optimization. In: International Conference on Learning Representation. Larcher, A., Lee, K.A., Ma, B., Li, H., 2012. Rsr2015: database for text-dependent speaker verification using multiple pass-phrases. In: Thirteenth Annual Conference of the International Speech Communication Association. Nair, V., Hinton, G.E., 2010. Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807e814. Poisel, R., Tjoa, S., Tavolato, P., 2011. Advanced file carving approaches for multimedia files. JoWUA 2, 42e58. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., others, 2015. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211e252.