An Associative Memorization Architecture of Extracted Musical Features from Audio Signals by Deep Learning Architecture

An Associative Memorization Architecture of Extracted Musical Features from Audio Signals by Deep Learning Architecture

Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 36 (2014) 515 – 522 Complex Adaptive Systems, Publication 4 Cihan ...

511KB Sizes 2 Downloads 54 Views

Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 36 (2014) 515 – 522

Complex Adaptive Systems, Publication 4 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 2014-Philadelphia, PA

An Associative Memorization Architecture of Extracted Musical Features From Audio Signals by Deep Learning Architecture Tadaaki Niwaa *, Keitaro Naruseb, Ryosuke Ooea, Masahiro Kinoshitaa, Tamotsu Mitamuraa, Takashi Kawakamia a

Hokkaido University of Science, 4-20 Maeda 7 Jo 15 chome Teine-ku, Sapporo 006-8585 Japan b University of Aizu, Ikkimachi Turuga, Aizuwakamatsu 965-8580, Japan

Abstract In this paper, we develop associative memorization architecture of the musical features from time sequential data of the music audio signals. This associative memorization architecture is constructed by using deep learning architecture. Challenging purpose of our research is the development of the new composition system that automatically creates a new music based on some existing music. How does a human composer make musical compositions or pieces? Generally speaking, music piece is generated by the cyclic analysis process and re-synthesis process of musical features in music creation procedures. This process can be simulated by learning models using Artificial Neural Network (ANN) architecture. The first and critical problem is how to describe the music data, because, in those models, description format for this data has a great influence on learning performance and function. Almost of related works adopt symbolic representation methods of music data. However, we believe human composers never treat a music piece as a symbol. Therefore raw music audio signals are input to our system. The constructed associative model memorizes musical features of music audio signals, and regenerates sequential data of that music. Based on experimental results of memorizing music audio data, we verify the performances and effectiveness of our system. © 2014 2014 The Authors. by Elsevier © Published byPublished Elsevier B.V. This isB.V. an open access article under the CC BY-NC-ND license Selection and peer-review under responsibility of scientific committee of Missouri University of Science and Technology. (http://creativecommons.org/licenses/by-nc-nd/3.0/).

Peer-review under responsibility of scientific committee of Missouri University of Science and Technology Keywords: automatic music composition; algorithmic composition; machine learning; Deep learning; Restricted Boltzmann Machine

* Corresponding author. Tel.: +81-080-1885-9440 . E-mail address: [email protected]

1877-0509 © 2014 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/3.0/). Peer-review under responsibility of scientific committee of Missouri University of Science and Technology doi:10.1016/j.procs.2014.09.032

516

Tadaaki Niwa et al. / Procedia Computer Science 36 (2014) 515 – 522

1. Introduction It is a still attractive research field to realize automatic music composition system using computer science methodologies. Especially, automatic composition with reflection of human sensitivity and impression is exciting research theme. In many of those systems, music is generated from human impressions of sound and words by using interactive Genetic Algorithm. However, the interactive Genetic Algorithm approach forces human designer to evaluate each solution and quantify a fitness value. We attempt to realize the intelligent system that creates new music based on some existing music. Such the systems that create music automatically are called automatic composition system or algorithmic composition system. In this paper, we develop the associative memorization architecture for an intelligent algorithmic composition system that to memorize musical feature sequence from sequence of musical audio signal fragments using Restricted Boltzmann Machine (RBM)[1, 2] and Conditional RBM (CRBM)[3, 4]. This algorithmic composition system extracts common musical characteristics or common musical features, and creates a new different music having such musical features. This paper illustrates the important former part of the whole system. How does a human composer make musical notes or musical pieces? Generally speaking, music piece is generated by the cyclic analysis process and the re-synthesis process of musical features in music creation procedures. In order to realize such an intelligent system that creates music automatically, it is necessary to provide the music generation algorithm and music evaluation algorithm. We firstly focus on the music generation algorithm in the intelligent system to make certain music to be evaluated. As a related work, Automatic music creation with computer science methodology is the first attempt by L. Hiller and L. Isaacson. The music called Illiac suite that is created ILLIAC of early computer. Since, many of algorithmic composition systems are realized with computer science methodologies. Experiment in Musical Intelligence (EMI)[5] is an intelligent algorithmic composition system that creates high-quality new music based on some existing music. EMI creates the music by predicting the motifs sequentially in accordance with the contexts of the musical motifs. The motifs are minimum fragments that have meaning of music. Their motifs and the contexts are extracted from some existing music, and accumulated to database in EMI. In recent years work, G. Bickerman et al[6] attempted creation of Jazz melody from code progression using Deep Belief Networks[7]. N. B. Lewandowski et al[8] attempted creation of new symbolic music using RNN-RBM. Those systems were realized using Deep learning. The deep learning is methodology for training the deep (multi layered) artificial neural networks that appeared in the field of machine learning. So far, algorithmic composition systems are realized using symbolic music of coded music score, MIDI or own formats. The first critical problem is how to describe the music data. The description format for the music data has a great influence on a learning performance and a function, in model with Artificial Neural Network (ANN) architecture. Almost of related works adopt symbolic representation methods of music data, such as music score data or MIDI data. However, we believe human composers never treat a music piece as a symbol. Therefore, we attempt modeling of height-dimension sequence for music from audio signal using deep learning architecture. The music generation architecture of learning raw audio signal and generating raw audio signal has not been studied much. The associative memorization architecture memorizes musical feature sequence of audio signal fragment, and regenerates sequence of the music. By our developed memorization architecture using RBM and CRBM, music memorization experiments are carried out. Based on experimental results of memorizing music audio data, we verify the performance and effectiveness of our associative memorization architecture. 2. Restricted Boltzmann Machine

2.1. Restricted Boltzmann Machine A Restricted Boltzmann Machine (RBM) is energy based undirected graphical model that defines a probability distribution between a vector of visible variables, or visible units v and a vector of hidden variables, or hidden units h. An RBM has a connection of v to h and a connection of h to v as shown Figure 1 (a). In this subsection, we consider the case where v and h are binary vectors of 0 or 1. This RBM model called Binary-Binary RBM. A joint probability of between a v and h is defined as follows.

517

Tadaaki Niwa et al. / Procedia Computer Science 36 (2014) 515 – 522

1  E ( v , h) e Z

p (v , h)

Z

¦e

(1)

 E ( v,h )

(2)

v,h

Z is normalization constant and E is energy function as follows.

E (v , h)

v T c  hT b  v T Wh

(3)

Where c is bias vector of visible units, b is bias vector of hidden units, and W is weights matrix between visible units v and hidden units h. The probability p(v) that an RBM assigns to visible vector v is given by summing all possible hidden vectors.

p (v )

1 Z

¦e

 E ( v,h )

(4)

h

Activation of visible units v and hidden units h are calculated conditional probability p(v | h) and p(h | v)

p (v | h)

sigm(c  hW T )

(5)

p(h | v )

sigm(b  vW )

(6)

Where sigm(x) is standard sigmoid function 1 / (1 + exp( -x ) ).

2.2. Contrastive Divergence Learning Algorithm The model parameters © = {W, b, c} of RBM are estimated using k-step Contrastive Divergence (CDk)[9] learning algorithm for log-likelihood learning. The model parameters that maximize the log-likelihood are determined with stochastic gradient method in general. The gradient of this log-likelihood are given as follows from the energy function E(v, h) of the RBM:

Where

w ln L(T | v ) wwij

p (hi

w ln L(T | v ) wc j

v j  ¦ p (v )v j

w ln L(T | v ) wbi

p (hi

1 | v )v j  ¦ p (v ) p (hi

1 | v )v j

(7)

v

(8)

v

1 | v )  ¦ p (v ) p (hi v

1 | v)

(9)

518

Tadaaki Niwa et al. / Procedia Computer Science 36 (2014) 515 – 522

p (hi

1 | v)

sigm(ci  v j wij )

(10)

However, when the number of dimensions of v is large, cannot be calculated the second term. The CDk algorithm is fast calculation algorithm to approximate the gradients of log-likelihood using Gibbs sampling. Since the RBM has independence of the conditional between the vector of visible variables v and the vector of hidden variables h, approximate calculation of the gradients by k-step Gibbs sampling will be as simple and fast as following:

w ln L(T | v ) wwij

p(hi

w ln L(T | v ) wc j

v j  v (jk )

w ln L(T | v ) wbi

p (hi

1 | v )v j  p(hi

1 | v )  p (hi

1 | v ( k ) )v (jk )

(11)

(12)

1 | v (k ) )

(13)

2.3. Gaussian Bernoulli RBM for learning real-value visible variables In this paper, we used Gaussian-Bernoulli RBM (GBRBM)[10]. GBRBM has Gaussian visible units and Binary hidden units. The states of Gaussian visible units are defined as vector v of real values, and the states of Binary hidden units are defined as vector h of binary of 0 or 1. The energy function E and conditional probability of visible unit activation are given as.

E (v, h) p (v | h)

vi ( vi  ci ) 2 wi , j h j ¦i 2V 2  ¦j h j b j  ¦ 2 i, j V i i N (v; c  hW T ;V 2 )

(14)

(15)

Where N(x; ­; ı2) is Gaussian probability density function with mean ȝ and variance ı2.

2.4. Conditional RBM A Conditional RBM (CRBM) is an energy based graphical model for modeling of sequence as shown Figure 1 (b). A CRBM has a vector of visible variables, or visible units v and a vector of hidden variables, or hidden units h, vector of past visible variables, or history units u, a connection between v and h, a connection of u to v, a connection of u to h, and energy function E as follows.

E (v , h, u) v T Wh  v T c  uT Av  uT Bh  hT b

(16)

Where c is bias vector of visible units, b is bias vector of hidden units, W is weights matrix between visible units v and hidden units h, A is weight matrix of history units u to visible units v, and B is weights matrix of history units u to hidden units h. Activation of visible units v and activation hidden units h are calculated conditional probability p(v | h) and p(h | v).

519

Tadaaki Niwa et al. / Procedia Computer Science 36 (2014) 515 – 522

p (v | h)

sigm(c ' hW T )

(17)

p(h | v )

sigm(b' vW T )

(18)

Where c’ and b’ are dynamic biases that are calculates as follows.

c ' c  uT A

(19)

b' b  u T B

(20)

The model parameters are estimated as with RBM by CDk algorithm.

Fig 1. (a) Restricted Boltzmann Machine (RBM) and (b) Conditional RBM

3. Associative memorization architecture for modeling of music

3.1. Structure of Associative memorization architecture We constructed the associative memorization architecture for modeling of music with RBM and Conditional RBM. The model is three layers network that is connected RBM block to CRBM block as shown Figure 2. The RBM block of the first layer to the second layer learns to extracted feature vector from a fragment of an audio signal, and the Conditional RBM block of the second layer to the third layer learns to memorized sequences of the feature vector. After learning, the associative model generates the music with predicting the sequence of feature vectors and translating to the audio signal. We set network parameters and initialized network parameters for the associative memorization architecture as follows. The number of first layer variables is 16,000 units, the number of second layer variables is 200 units, the number of third layer variables is 200 units, and the number of history vector is 7 histories (1,400 units). Table 1 is size of each layer and unit type of each layer. The weights parameters are initialized with random value of -0.5 to 0.5. The bias parameters, weights of history vector to visible variables, and weights of history vector to hidden variables are initialized with 0. The model parameters and the structure of associative memorization architecture are empirically determined.

520

Tadaaki Niwa et al. / Procedia Computer Science 36 (2014) 515 – 522

Table 1. Size of each layer and unit type of each layer Layer

The number of units

The type of units

First layer

16,000

Gaussian units (state of variable is real value) Binary units (state of variable is 0 or 1)

Second layer

200

Third layer

200

Binary units (state of variable is 0 or 1)

History vector layer

1,400 (7 histories)

Binary units (state of variable is 0 or 1)

Fig 2. Structure of Associative memorization architecture using Restricted Boltzmann Machine (RBM) and Conditional RBM

3.2. Training of Associative memorization architecture We set training parameters and processed pre-processing of audio signal for training the associative memorization architecture. The training parameters are set as follows: weights decay rate is 0.02; weights momentum rate is 0.8; mini-batch size is 100; and training epochs is 6,000 epochs. The resampling parameters are set as follows: sampling rate is 16,000 Hz; quantization bit is 16 bits; and one audio channel. The audio signal is resampled with the resampling parameter setting, and value of amplitude is normalized to -5.0 to 5.0. After resampling, the audio signal is split into fragments of at each time step. The splitting parameters are set as follows: window function is rectangle window; window size is 16,000 points; and frame shift size is 16,000 points. Training data for the associative memorization architecture is the sequence of the audio signal fragments. In view of the following fact the size of the fragments is empirically determined: x The music audio signal of training data is 16 kHz sampling x In preliminary experiments, not memorized the sequence when short (for example 160 points or 1,600 points) fragments size was used. 4. Experimental Results After training of the associative memorization architecture, we predicted the trained music with the memorization architecture. The training data for the memorization architecture is audio signal of “Let it be” by The Beatles. Seed data for sequence prediction is fragments of "Let it be", it is the range of between 0 seconds and 10 seconds. The predicted result of the audio signal is illustrated in Figure 3. The upper image in Figure 3 illustrates power

Tadaaki Niwa et al. / Procedia Computer Science 36 (2014) 515 – 522

spectrogram of the original audio signal ("Let it be"), and the lower image in Figure 3 illustrates power spectrogram of the reconstructed audio signal from predicted fragments of an audio signal. Other training, the training data is each movement of each music in "The Four Seasons" by Vivaldi. We pick up fragments of 0 to 15 seconds from each movement in each music, and we created a sequence from each fragment. The total time of training data is 180 seconds. Seed data for sequence prediction is fragments of training data, it is the range of between 0 seconds and 30 seconds. The predicted result of the audio signal is illustrated in Figure 4. The upper image in Figure 4 illustrates power spectrogram of the original audio signal (fragments of 180 seconds in "The Four Seasons"), and the lower image in Figure 4 illustrates power spectrogram of the reconstructed audio signal from predicted fragments of an audio signal. In the illustration of Figure 3 and Figure 4, the original signal and the reconstructed (predicted or generated) signal seem to a different. Especially, in the case of Vivaldi's music (Figure 4), there seem to be more difference, because includes more power than of white noise to all frequency bands. However, in listening of human tester, human auditory recognizes reconstructed signal as the mixture signal of original signal and white noise. Therefore, the sequences of the reconstructed signal are recognized as the original music. By removing the white noise from the reconstructed signal, a signal very close to the original signal will be obtained. The discussion of the idea for noise removal is different from the theme in this paper. However it is an equally important issue. How well could the human auditory recognize original music from reconstructed signal? It depends on the intensity of the white noise in the reconstructed signal. The length of the generated signal does not affect the intensity of the white noise. Learning error influences the white noise in proportion to the magnitude of the error. The associative memorization architecture can extract feature vectors from fragments of audio signal, and memorize sequence of feature vectors. In other words, the associative memorization architecture can memorize the music.

Fig ˏ. Power spectrogram of the original music (“Let it be” by The Beatles), and the reconstructed music from predicted audio signal by associative memorization model. Upper image is spectrogram of the original audio signal (training data), and lower image is spectrogram of the reconstructed music from predicted audio signal by associative memorization model. Where parameters for Fast Fourier Transform (FFT): window size is 512 points; overlap size is 0 points; and window function is hamming window.

521

522

Tadaaki Niwa et al. / Procedia Computer Science 36 (2014) 515 – 522

Fig 4. Power spectrogram of the original music (fragments of 180 seconds in "The Four Seasons"), and the reconstructed music from predicted audio signal by associative memorization model. Upper image is spectrogram of the original audio signal (training data), and lower image is spectrogram of the reconstructed music from predicted audio signal by associative memorization model. Where parameters for Fast Fourier Transform (FFT): window size is 512 points; overlap size is 0 points; and window function is hamming window.

5. Conclusion and Future work We attempt to realize the intelligent system that creates new music based on some existing music. In this paper, we construct the associative memorization architecture for music generation algorithm of intelligent algorithmic composition system. The memorization architecture by deep learning architecture learns to extract feature vectors from a fragment of an audio signal and learns to memorize sequences of their feature vector. After training of associative memorization architecture, we predicted followed audio signal from audio signal fragments. The reconstructed music from this predicted signal is include noise. However, this reconstructed audio signal can be recognized as the original audio signal, in the human auditory. In the result, the associative memorization architecture can extract feature vectors from fragments of audio signal, and memorize sequence of feature vectors. In other words, the associative memorization architecture can memorize the audio signal of the music. Future work in this study is three tasks as following. The first task is generation of new music piece by this architecture using data set of many music pieces. The second task is estimation of the network parameters for memorizing the many music pieces. The third task is how to remove noise in the generated music audio signals by associative memorization architecture. References 1. Smolensky P. Information processing in dynamical systems: Foundations of harmony theory. Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press; 1986. Vol. 1. p. 194-281. 2. Hinton GE. A Practical Guide to Training Restricted Boltzmann Machines. Tech. Rep. Department of Computer Science. University of Toronto; 2010. 3. Taylor GW. Hinton GE. Factored conditional restricted Boltzmann machines for modeling motion style. Proceedings of the 26th annual international conference on machine learning; 2009. p. 1025-1032. 4. Mnih V. Larochelle H, Hinton GE. Conditional Restricted Boltzmann Machines for Structured Output Prediction. Proceedings of the International Conference on Uncertainty in Artificial Intelligence; 2011. 5. Cope D. Computers and Music Style. Computer Music and Digital Audio Series; 1991. 6. Bickerman G, Bosley S, Swire P, Keller RM. Learning to Create Jazz Melodies Using Deep Belief Nets. First Inteirnational Conference on Computational Creativity; 2010. 7. Bengio Y. Learning deep architectures for AI. Foundations and trends in Machine Learning. 2.1; 2009. p. 1-127. 8. Lewandowski NB, Bengio Y, Vincent P. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. 29th International Conference on Machine Learning (ICML 2012); 2012. 9. Hinton GE. Training products of experts by minimizing contrastive divergence. Neural computation; 2002. vol. 14-8. p. 1771-1800. 10. Cho KH, Ilin A, Raiko T. Improved Learning of Gaussian-Bernoulli Restricted Boltzmann Machines. Artificial Neural Networks and Machine Learning (ICANN 2011); 2011. p. 10-17.