Pattern Recognition Letters 128 (2019) 263–269
Contents lists available at ScienceDirect
Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec
Mixture variational autoencoders Shuoran Jiang, Yarui Chen∗, Jucheng Yang, Chuanlei Zhang, Tingting Zhao Tianjin University of Science & Technology, Tianjin 300457, China
a r t i c l e
i n f o
Article history: Received 1 March 2019 Revised 4 September 2019 Accepted 6 September 2019 Available online 7 September 2019 Keywords: Mixture variational autoencoders Mixture models Reparameterization trick SGVB estimator
a b s t r a c t Variational autoencoders (VAEs) combine a generative model and a recognition model, and jointly train them to maximize a variational lower bound. VAEs play an important role in unsupervised learning and representation learning. But the isotropic generative model in VAEs cannot sufficiently utilize the latent representative space. In this paper, we present mixture variational autoencoders (MVAEs) which suppose observed data is generated from mixture models. We use a continuous variable with prior of Normal distribution as the latent representation, and introduce a discrete variable with prior of Multinomial distribution as latent indicator for mixture models. Two latent variables are both approximated by recognition models computed from neural networks. Furthermore, we combine the reparameterization trick and stick-breaking parameterization to realize stochastic gradient variational Bayes (SGVB) estimator on our variational objective. The MVAEs improve the generative performance by enlarging the latent representative space. Finally, we demonstrate the performance of MVAEs compared with the state-of-the-art models on MNIST, OMNIGLOT, and Fashion-MNIST. © 2019 Elsevier B.V. All rights reserved.
1. Introduction In recent years, deep generative models [18,26,30] have been becoming more and more popular in unsupervised learning and representation learning, and they show complementary advantages of deep learning and Bayesian statistics [1,7,20,31]. Generative adversarial nets (GANs) [8] and variational autoencoders (VAEs) [15] are two popular models under this generative framework. GANs estimate the generative model via an adversarial process, in which two models are simultaneously trained: a generative model that captures the data distribution, and a discriminative model that estimates the probability that a sample came from the training data rather than the generative model [8]. VAEs pair a top-down generative model over the observed data with a bottom-up recognition model over the continuous latent variable, and jointly train them to maximize a variational lower bound on the marginal likelihood [4,12,15]. There are also other models for unsupervised representation learning, such as the split-brain autoencoders [35], cross-channel encoder [34], context encoder [24]. These unsupervised representation learning present wide application in visual matching [6], face recognition [5], image classification [19], and many others. Compared to GANs, VAEs present an explicit description of the latent space through the posterior distribution, which can help to
∗
Corresponding author. E-mail address:
[email protected] (Y. Chen).
https://doi.org/10.1016/j.patrec.2019.09.007 0167-8655/© 2019 Elsevier B.V. All rights reserved.
analyze the representation ability of the generative model. However, regular VAEs still remain two defects: (1) The prior distribution of latent variable is isotropic, which pushes the recognition model towards learning a separate continuous factor of variation from the data [3,36]. (2) The generative model over observations is also an isotropic model in which the covariance is a diagonal matrix [3], which loses abandant correlation between features. Therefore, the recognition and generative models in traditional VAEs are both limited. In recent years, some innovative works had been proposed aiming at enlarging the latent representative space. Kingma [14] proposed a method called variational autoencoder with inverse autoregressive flow (VAE-IAF), which approximated the posterior of latent variable through the invertible autoregressive flow. The proposed flow consisted a chain of invertible transformations, where each transformation is based on an autoregressive neural network with the latent variable. VAE-IAF can learn a more stronger posterior of latent variable, but the autoregressive networks led to the disorder distribution of latent representative space. So VAE-IAF cannot generate similar but not same samples. Sohn [29] proposed a conditional variational autoencoder (CVAE) adopted from the semi-supervised learning in deep generative model [13]. Except from the unique latent representation corresponding with each observation, CVAE introduced a discrete variable to approximate data’s label. Compared to the traditional VAEs, CVAE utilized richer information in the recognition model of latent variable, and also allowed for modeling multiple modes
264
S. Jiang, Y. Chen and J. Yang et al. / Pattern Recognition Letters 128 (2019) 263–269
indicating by conditioning the discrete latent variable. However, the supervised learning used to train the discrete indicator inevitably predicted some error labels on whole dataset, and weakened generative performance. Eric Nalisnick proposed StickBreaking variational autoencoder (SB-VAE) [22], which used a discrete variable as the latent representation and generated the sample from the mixture models. SB-VAE improves the generative likelihood by mixture models, but the discrete latent representation cannot generalize richer information about data. Dilojthanakul [3] proposed Gaussian mixture variational autoencoder (GMVAE) with multimodal generative model over observed data. The multimodal prior distribution over observations complicated the latent representation. In addition to improve the generative performance, GMVAE also persisted the complete and orderly latent representative space. However, the recognition model over discrete latent indicator is a point estimate derived from MAP (maximum a posterior) [10,21,23], it is inconsistent with the frame of auto-encoding variational Bayes. Additionally, the point estimate for discrete latent indicator ignores the relationship with latent representation. In this paper, we propose the mixture variational autoencoders (MVAEs), which improve the generative performance by replacing the isotropic generative model with a mixture model. Specifically, we use a continuous variable with prior of Normal distribution as the latent representation, and approximate its posterior with recognition model computed by neural network with observed data. Furthermore, we introduce a discrete variable as latent indicator in mixture generative models, and approximate its posterior from neural network with latent representation. Finally, we optimize the lower bound of marginal likelihood using the stochastic gradient variational Bayes estimator. The rest of the paper is organized as follows. We describe the variational autoencoders in § 2. The details of mixture variational autoencoders will be described in § 3. Experiments showing qualitative and quantitative results are presented in § 4. Finally, we conclude with a brief summary in § 5. 2. Variational autoencoders Given a dataset X = {x(n ) }N consisting of N i.i.d. observan=1 tions, variational autoencoders (VAEs) suppose each sample x(n) is generated by following process: (1) a latent representation z(n) is generated from Normal distribution N (0, I ); (2) a sample x(n) is generated from a conditional distribution pθ (x|z(n) ). VAEs use a disentangled recognition model qφ (z|x) as a proxy for intractable true posterior pθ (z|x). The recognition and generative models are both computed by neural networks. The model is optimized with maximizing the log-evidence lower bound (ELBO) [9],
log pθ (x ) ≥ LELBO = Ez∼q [log pθ (x|z )] − DKL (qφ (z|x ) p(z )),
(1)
where sample z(i) is yielded from the recognition model qφ (z|x ) = N (z; μφ (x ), φ (x )) by reparameterization trick [15], that is
z (i ) =
L 1 μφ (x(i) ) + φ (x(i) ) ∗ (l ) L
with
(l ) ∼ N (0, I ).
(2)
l=1
The parameters φ, θ can be jointly optimized through maximizing the ELBO by stochastic optimization method. The key enlightenment in VAEs is that any distribution can be generated by a mapping from a variable normally distributed with sufficiently complicated function [4]. 3. Mixture variational autoencoders
Fig. 1. The graphical model of generative process in mixture variational autoencoders (MVAEs), which have K generative components in mixture models and use discrete variable c to indicate generative components for each sample x(n) .
ducing multimodal generative model on data. But the posterior approximation for discrete variable is derived from the MAP estimate, that only computes point estimates and cannot learn relationship with latent representation. GMVAE cannot achieve higher generative performance. In this paper, we propose mixture variational autoencoders (MVAEs) which use mixture models as the probability on observed data. MVAEs take a continuous variable with prior of Normal distribution as latent representation, and introduce a latent indicator with prior of Multinomial distribution. The recognition model on latent representation is computed from observed data with neural network, and the recognition model on latent indicator is computed from latent representation with neural network. The variational objective is calculated by the combination of Monte Carlo estimator [27] and reparameterization trick, and it is able to backpropagate through both latent factors. 3.1. Generative and recognition models We consider the generative model p(x, z, c ) = pπ (c ) p(z ) p (x|z, c ), in which the latent variable z is generated from a centered multivariate Gaussian N (0, I ), and the latent indicator c is generated from a Multinomial distribution Mult(π ). The latent indicator c = [c1 , c2 , · · · , cK ]T satisfies the conditions ci ∈ {0, 1}, Ki=1 ci = 1. Each sample x(i) is corresponding with unique value z(i) as latent representation, and generated from single component in mixture models p (x|z, c). The generative process is given as,
c (i ) ∼
πkck ,
k=1
z ( i ) ∼ N ( 0, I ), x ( i ) |z ( i ) , c ( i ) ∼
K
(i )
p θ k ( x| z ( i ) ) c k ,
(3)
k=1
where K is predefined number of components in mixture models, and each component pθ (x|z ) is a Gaussian distribution for k real-value data or a Bernoulli distribution for binary data, π = {π1 , π2 , . . . , πK }, = {θ 1 , θ 2 , . . . , θ K }. The graphical model of the mixture variational autoencoders is showed in Fig. 1. • Generative model Each generative component pθ (x|z ), k = 1, 2, . . . , K, is based on k a fully-connected neural network parameterized by θ k . In the case of binary data, we use
log pθ k (x|z ) =
D
xi log yi + (1 − xi ) log(1 − yi ),
i=1
where In VAEs, the recognition model and generative model are both isotropic, which result in the insufficient utilization on latent representative space. GMVAE improves the frame of VAEs by intro-
K
y = f ( z ; θ k ).
In the case of real-value data, we use
log pθ k (x|z ) = log N (x; μ, σ 2 I ),
(4)
S. Jiang, Y. Chen and J. Yang et al. / Pattern Recognition Letters 128 (2019) 263–269
265
ple x(i) , we first sample noise (l ) ∼ N (0, I ), then get a latent representation z(i,l) by the noise sample (l) , that is, 1
z (i,l ) = μφ (x(i ) ) + φ2 (x(i ) ) ∗ (l ) .
(10)
The Monte Carlo estimate of the expectation of some function f(z) can be calculated by:
E q φ ( z | x ( i ) ) [ f ( z ( i ) )] ≈ Fig. 2. The recognition and generative models for MVAEs. Solid lines denote the mixture generative models p(z ) Kk=1 pθ k (x|z )ck , and dashed lines denote the recognition model qφ (z|x) and qη (c|z).
μ = f μ ( z ; θ k ), log σ 2 = fσ (z; θ k ),
(5)
where f, fμ and fσ are the elementwise sigmoid activation function, and θ k is set of weights and biases of these neural networks. • Recognition model For the latent representative variable z ∼ N (0, I ), we choose recognition model qφ (z|x ) = N (z; μφ (x ), φ (x )) to approximate its intractable true posterior p (z|x, c), where parameters μφ (x) and φ (x)) are neural networks parameterized by φ. For the latent indicator variable c ∼ Multinoulli(π ), we choose a multinomial-logit function qη (c|z) as a proxy for posterior p,π (c|x, z), that is
exp( fk (z; η ))
j=1
exp( f j (z; η ))
,
(6)
where f(z; η) are neural networks with K-dimensions output parameterized by η. Fig. 2 shows the complete directed graphical models for recognition and generative models, in which solid lines represent the generative process, and dashed lines represent the inference process.
exp( fk (z (i,l ) ; η )) , 1 ≤ k ≤ K. (i,l ) ; η ))) j≤k (1 + exp ( f j (z
(12)
Its Monte Carlo estimates [25] can be calculated by
qη (ck(i ) = 1|z (i ) ) = Eqφ (z|x(i) ) [qη (ck(i ) = 1|z (i ) )] ≈
L 1 exp( fk (z (i,l ) ; η )) . (i,l ) ; η ))) L j≤k (1 + exp ( f j (z
(13)
l=1
The sample ci can be set as
ck(i ) =
1 0
when qη (ck(i ) = 1|z (i ) ) > qη (c(ji ) = 1|z (i ) ), k = j. otherwise.
(14)
We use minibatch technique to train our mode. Kingma [15] had verified that as long as the minibatch size M is large enough the number L of samples in reparameterization trick can be set as 1. In this paper, we set the parameters M = 500 and L = 1. For a minibatch X M = {x(i ) }M randomly drawn from dataset i=1 X, the SGVB estimator on ELBO is shown as follow:
L(φ, η, ) ≈
M L K 1 (i ) ck log pθ k (x(i ) |z (i,l ) ) M i=1 l=1 k=1
M 1 − DKL (qφ (z|x(i ) ) p(z )) M i=1
In the MVAEs, the marginal likelihood over each sample can be written as a combination of KL divergence and ELBO, that is
log p(x ) = DKL (qφ,η (z, c|x )) p,π (z, c|x )) + L(φ, η, ),
(7)
where DKL ≥ 0, and the ELBO L(φ, η, ) on the marginal likelihood can be written as:
K
ck log pθ k (x|z )
− DKL (qφ (z|x ) p(z )) − DKL (qη (c|z ) p(c )).
M 1 DKL (qη (c|z ) p(c )). M
We can derivate the gradient φ,η, L(φ, η, ) using the stochastic gradient descend method [2,32]. The generative performance can be qualified by the average NLL (negative loglikelihood) of inputs X, that is
(i ) 1 log [ p θ k ( x ( i ) | z ( i ) )] c k . M
3.3. SGVB estimator For the ELBO L(φ, η, ), we get a Monte Carlo estimator using the reparameterization trick and stick-breaking parameterization, where this estimator can be straightforwardly optimized using standard stochastic gradient methods. Firstly, we use reparameterization trick to sample value z(i) from the recognition model qφ (z|x ) = N (z|μφ (x ), φ (x )). Given a sam-
K
(16)
i=1 k=1
(8)
(9)
(15)
i=1
M
The parameters φ, η, can be jointly optimized by maximizing L(φ, η, ) with the stochastic gradient variational Bayes (SGVB), that is,
{φ, η, } ← arg max L(φ, η, ).
−
NLL = −
k=1
φ,η,
(11)
Secondly, we sample c(i) from the recognition model qη (c|z) using the stick-breaking parameterization [11] based on the sample z(i,l) , that is
3.2. The variational bound
L(φ, η, ) = Eqφ (z|x),qη (c|z )
f (z (i,l ) ).
l=1
qη (ck(i ) = 1|z (i,l ) ) =
where
q(ck = 1| f (z; η )) = K
L
1 L
4. Experiments We designed experiments to evaluate the generative performance and latent representation of MVAEs. Firstly, we determined the optimal number of generative components by comparing the generative NLLs of a series of MVAEs. Secondly, we compared and analyzed the generative performance of MVAEs with the stateof-the-art models. We finally evaluated the latent representative spaces of MVAEs with VAEs on MNIST, and the results illustrated that our model learned a more richer and orderly distribution of latent representation. In these experiments we made use of the following databases: •
MNIST [17] is a standard handwritten digits database with 60,0 0 0 samples, and each sample is a 28 × 28 grayscale image.
266
S. Jiang, Y. Chen and J. Yang et al. / Pattern Recognition Letters 128 (2019) 263–269 •
•
OMNIGLOT [16] contains 1623 different handwritten characters from 50 alphabets. The dataset consistes 24,345 samples and each sample is a 28 × 28 image. Fashion-MNIST [33] is a dataset of Zalando’s article images, which contains a training set of 60,0 0 0 samples, and each sample is a 28 × 28 grayscale images with a label in 10 classes.
The comparative models include: VAEs [15], CVAE [13], VAEIAF [14], GMVAE [3], SB-VAE [22] and MVAEs. For these models, the recognition and generative models were realized by fully connected neural networks with two hidden layers, and all models were trained end-to-end with minibatch size M = 500, epochs 200 and learning rate 0.001. 4.1. Determining the number of components
Fig. 3. Generative images of MVAEs on MNIST with the two-dimensional latent space and different numbers of generative components K = 2, 3, 4, 5.
In this experiment, we set different values for the dimensions of latent vector Dz = 20, 40, 60 and the component numbers K = 2, 3, 4, 5, 6, 7, 8 on the MVAEs database. Recognition models qφ (z|x), qη (c|z) and each generative component pθ (x|z ) were fully k connected neural networks with two hidden layer, and each hidden layer had 500 units with Tanh activation functions [28]. We showed the NLLs generated from these series of MVAEs in Table 1. From this result we can see that MVAEs improved the generative likelihoods with the number of components increasing, and the likelihoods reached a steady value when K ≥ 4. Fig. 3 presented the images generated from MVAEs with two-dimensional latent representative space Dz = 2 and different number K = 2, 3, 4, 5. The results showed that the reconstructed images became more clear with the number of generative components increasing, and
Fig. 4. The training processes of MVAEs and VAEs with Dz = 20 on the database of MNIST, OMNIGLOT and Fashion-MNIST.
S. Jiang, Y. Chen and J. Yang et al. / Pattern Recognition Letters 128 (2019) 263–269
267
Fig. 5. The two-dimensional latent representative spaces on MNIST (the upper row) and Fashion-MNIST (the lower row) datasets for VAEs (the left row) and MVAEs (the right row). Table 1 The negative log-likelihoods (NLLs) of MVAEs with different Dz and K on MNIST dataset. Latent
Dz = 20 Dz = 40 Dz = 60
NLLs
Table 2 The NLL comparisons of MVAEs and SB-VAE, VAEs, GMVAE, CVAE, VAE-IAF on the databases of MNIST, OMNIGLOT and Fashion-MNIST. Dataset
K=2
K=3
K=4
K=5
K=6
K=7
67.294 62.792 61.953
66.42 62.432 61.481
66.069 62.085 61.276
65.927 61.815 61.114
65.889 61.573 60.869
65.62 61.452 60.730
the number of blurry reconstructions reached minimum at K = 4. So we set the optimal number of components in mixture generative models K = 4 in the following experiments.
NLLs Dz = 20
Dz = 40
Dz = 60
MNIST
SB-VAE VAEs GMVAE CVAE VAE-IAF MVAEs (ours)
77.851 71.618 71.08 69.86 67.311 65.331
76.127 71.213 70.38 69.40 65.895 62.085
76.34 71.179 70.65 70.36 65.649 60.396
OMNIGLOT
SB-VAE VAEs GMVAE CVAE VAE-IAF MVAEs (ours)
111.341 83.272 80.827 77.926 97.27 79.211
105.108 79.154 81.294 76.637 85.32 71.88
105.62 79.832 80.744 75.544 84.01 67.47
Fashion-MNIST
SB-VAE VAEs GMVAE CVAE VAE-IAF MVAEs (ours)
221.392 216.014 217.626 216.906 229.615 212.229
222.025 215.926 217.844 218.181 226.234 209.432
221.883 215.655 218.601 215.277 228.066 208.048
4.2. Performance comparison of different models In this experiment, we compared and analyzed the generative performance of MVAEs with CVAEs, VAEs, SB-VAE, VAE-IAF and GMVAEs with Dz = 20, 40, 60 respectively, on the MNIST, OMNIGLOT and Fashion-MNIST datasets. For CVAEs, VAEs, SB-VAE and VAE-IAF, the recognition and generative models were realized by two hidden layers neural networks, and each hidden layer has 500 units. For MVAEs and GMVAEs, recognition models were set the same architecture as the above models, and generative models were fixed with K = 4 components and each component had the same architecture as the generative model in VAEs. The NLL comparisons by all models are shown in Table 2, where the optimal results were depicted by bold font. The comparisons of the running time are shown in the Table 3, where the optimal time was depicted by bold font. The training processes of MVAEs and VAEs with Dz = 20 on the databases of MNIST, OMNIGLOT and Fashion-MNIST were shown in the Fig. 4.
Models
Table 3 The comparisons of computational complexity of MVAEs and SBVAE, VAEs, GMVAE, CVAE, VAE-IAF on the Fashion-MNIST dataset. Models
SB-VAE VAEs GMVAE CVAE VAE-IAF MVAEs (ours)
Computing time(second) Dz = 20
Dz = 40
Dz = 60
6518.127(s) 243.413(s) 4307.242(s) 197.574(s) 880.010(s) 923.605(s)
6796.751(s) 240.444(s) 4407.228(s) 191.115(s) 925.625(s) 939.872(s)
6930.323(s) 241.112(s) 4398.142(s) 195.185(s) 968.893(s) 933.797(s)
268
S. Jiang, Y. Chen and J. Yang et al. / Pattern Recognition Letters 128 (2019) 263–269
Fig. 6. The generated images of MNIST (the upper row) and Fashion-MNIST (the lower row) from VAEs (the left column) and MVAEs (the right column).
The Table 2 showed that MVAEs presented the best generative performances in different latent spaces, except for the one result generated by CVAEs with Dz = 20 on OMNIGLOT. In addition, both the MVAEs and VAE-IAF improved the generative likelihoods more higher than other models with the dimensions of latent spaces increasing. Furthermore, the Table 3. showed that the efficiency of these models appeared as follows: CVAE > VAEs > VAEIAF > MVAEs > GMVAE > SB-VAE, and the CVAE model had the highest efficiency, the MVAEs had the medium computational complexity. Furthermore, the Fig. 4 presented the training process of the MVAEs and VAEs on the database of MNIST, OMNIGLOT and Fashion-MNIST, which showed that they all converged smoothly, and MVAEs had the superior performance. 4.3. Performance comparisons under different structures In this experiment, we demonstrated that the superior performance of MVAEs was derived from the more flexible generative model, rather than simply expanding the architectures of neural networks. For the models of SB-VAE,VAEs, CVAEs and VAE-IAF with Dz = 20, we expanded the units of hidden layers of the generative models to Dh = 10 0 0, 150 0, 20 0 0, 250 0 respectively. Meanwhile, we set the numbers of the component K = 2, 3, 4, 5 respectively for the MVAEs and GMVAEs with Dz = 20, and each component was realized by a two hidden layers (with 500 units) neural networks. The NLL comparisons of these models were shown in the Table 4, and the optimal results were depicted by bold font. The comparison results showed that the NLLs of the VAEs, SBVAE, CVAE and VAE-IAF were not significantly improved with the number of units increasing. Besides, the NLL of GMVAE didn’t have
Table 4 The NLL comparisons of MVAEs and SB-VAE, VAEs, GMVAE, CVAE, VAEIAF with Dz = 20 respectively on MNIST. In which, the MVAEs and GMVAE have different component number K = 2, 3, 4, 5 respectively, and the SB-VAE, VAEs, CVAE and VAE-IAF have different unit number Dh = 10 0 0, 150 0, 20 0 0, 250 0 for each hidden layer of generative models. Models
VAEs SB-VAE CVAE VAE-IAF
GMVAE MVAEs (ours)
NLLs Dh = 10 0 0
Dh = 1500
Dh = 20 0 0
Dh = 2500
70.044 72.601 68.09 68.985
69.624 71.682 67.05 68.913
70.200 72.053 67.22 68.696
70.220 71.867 67.31 68.907
K=2
K=3
K=4
K=5
71.179 67.294
73.593 66.42
71.08 66.069
71.34 65.889
obviously changed with the component number K increasing, and the NLL of MVAEs had been improved with the component number K increasing, and reached a stable value when K = 4. We can see that MVAEs presented the optimal performance in the equal scale of generative model compared with the other models, and improved the generative likelihood with the number K increasing. As shown in the above experiments, compared with VAEs, SBVAE, CVAE, VAE-IAF and GMVAEs, MVAEs presented the superior performance in different latent spaces. Besides, compared with the other models, MVAEs also proposed the optimal performance in the equal scale of generative models. All these results showed that the MVAEs improved the ability of generative model by a more flexible recognition model on discrete indicator.
S. Jiang, Y. Chen and J. Yang et al. / Pattern Recognition Letters 128 (2019) 263–269
4.4. Latent representations In this experiment, we analyzed and compared the latent spaces and the generated samples of the VAEs and MVAEs. We set the dimensions of latent vector Dz = 2 for the VAEs and MVAEs respectively, and trained the models based on the dataset MNIST and Fashion-MNIST. The visualizations of the latent spaces of the VAEs and MVAEs were shown in the Fig. 5, and the generated images of the VAEs and MVAEs were shown in the Fig. 6. The Fig. 5 results showed that the latent spaces of VAE and MVAEs on MNIST mainly distributed in the range of {(-10,10) × (10,10)} and the range of {(-20,20) × (-20,20)} respectively, and the latent spaces on Fashion-MNIST mainly distributed in ranges {(4,4) × (-4,4)} and {(-20,20) × (-20,20)}. Obviously, MVAEs provided a larger latent space than VAEs, and the latent representations of different categories in MVAEs were more orderly than VAEs. In addition, the Fig. 6 results showed that the generated images from MVAEs were clearer than those from VAE, and the number of blurred images from MVAEs was less than VAE. The generated images verified that the representations learned from MVAEs contained more information. These results verified that MVAEs obtained a more richer latent space, and MVAEs showed a superior representation ability with the more flexible recognition model on discrete indicator. 5. Conclusions In this paper, we have proposed the mixture variational autoencoders (MVAEs), in which observed data is generated from a mixture of variational autoencoders. We studied the recognition models on both continuous latent representation and discrete latent indicator, which can capture the complex structure among observed data, latent representation and latent indicator by neural networks. Furthermore, we optimized the variational objective through sampling the latent representation and discrete latent indicator from reparameterization trick and stick-breaking parameterization respectively. When we sampled samples from latent representative space to generate data, the best generative component in mixture models can be selected by the sampling indicator. The experiment results have shown that MVAEs improved the generative performance and enlarged the latent representative space than VAEs. We believe that the research of mixture variational autoencoders is a promising direction for increasing the flexibility of deep generative model. Declaration of Competing Interest None. Acknowledgment This work has been partly supported by the National Natural Science Foundation of China (61402332, 61976156, 11803022); Tianjin Municipal Science and Technology Commission (17JCQNJC0 040 0); the Foundation of Tianjin University of Science and Technology (2017LG10); the Key Laboratory of food safety intelligent monitoring technology, China Light Industry; research Plan Project of Tianjin Municipal Education Commission (2017KJ034, 2017KJ035). References [1] M.J. Barber, J.W. Clark, Designing neural networks that process mean values of random variables, Phys. Lett. A 378 (30–31) (2014) 2163–2167, doi:10.1016/j. physleta.2014.04.065.
269
[2] L. Bottou, Stochastic gradient descent tricks, in: Neural Networks: Tricks of the Trade, Springer, 2012, pp. 421–436, doi:10.1007/978- 3- 642- 35289- 8_25. [3] N. Dilokthanakul, P. Mediano, M. Garnelo, M. Lee, H. Salimbeni, K. Arulkumaran, M. Shanahan, Deep unsupervised clustering with gaussian mixture variational autoencoders. [4] C. DOERSCH, Tutorial on variational autoencoders, Stat 1050 (2016) 13. [5] Y. Duan, J. Lu, J. Feng, J. Zhou, Context-aware local binary feature learning for face recognition, IEEE Trans. Pattern Anal. Mach.Intell. 40 (5) (2018) 1139–1153. [6] Y. Duan, J. Lu, Z. Wang, J. Feng, J. Zhou, Learning deep binary descriptor with multi-quantization, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1183–1192. [7] A. Gelman, H.S. Stern, J.B. Carlin, D.B. Dunson, A. Vehtari, D.B. Rubin, Bayesian Data Analysis, Chapman and Hall/CRC, 2013, doi:10.1002/wcs.72. [8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, 2014, pp. 2672–2680. [9] M.D. Hoffman, D.M. Blei, C. Wang, J. Paisley, Stochastic variational inference, J. Mach. Learn. Res. 14 (1) (2013) 1303–1347. [10] Z. Huang, S.M. Siniscalchi, I.-F. Chen, J. Wu, C.-H. Lee, Maximum a posteriori adaptation of network parameters in deep models, arXiv:1503.02108v1 (2015). [11] M. Khan, S. Mohamed, B. Marlin, K. Murphy, A stick-breaking likelihood for categorical data analysis with latent gaussian models, in: Artificial Intelligence and Statistics, 2012, pp. 610–618. [12] D.P. Kingma, Variational inference & deep learning: a new synthesis (2017). [13] D.P. Kingma, S. Mohamed, D.J. Rezende, M. Welling, Semi-supervised learning with deep generative models, in: Advances in Neural Information Processing Systems, 2014, pp. 3581–3589. [14] D.P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, M. Welling, Improved variational inference with inverse autoregressive flow, in: Advances in Neural Information Processing Systems, 2016, pp. 4743–4751. [15] D.P. Kingma, M. Welling, Auto-encoding variational bayes, Stat 1050 (2014) 1. [16] B.M. Lake, R.R. Salakhutdinov, J. Tenenbaum, One-shot learning by inverting a compositional causal process, in: Advances in Neural Information Processing Systems, 2013, pp. 2526–2534. [17] Y. LeCun, The mnist database of handwritten digits, http://yann. lecun. com/ exdb/mnist/(1998). [18] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436, doi:10.1038/nature14539. [19] K. Lin, J. Lu, C.-S. Chen, J. Zhou, Learning compact binary descriptors with unsupervised deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1183–1192. [20] Y. Liu, S. Jiang, S. Liao, Efficient approximation of cross-validation for kernel methods using bouligand influence function, in: International Conference on Machine Learning, 2014, pp. 324–332. [21] J.S. Maritz, Empirical Bayes Methods with Applications: 0, Chapman and Hall/CRC, 2018. [22] E. Nalisnick, P. Smyth, Stick-breaking variational autoencoders, in: International Conference on Learning Representations (ICLR), 2017. [23] N.M. Nasrabadi, Pattern recognition and machine learning, J. Electron. Imaging 16 (4) (2007) 049901, doi:10.1117/1.2819119. [24] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A.A. Efros, Context encoders: feature learning by inpainting, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536–2544. [25] A. Pikovsky, Reconstruction of a random phase dynamics network from observations, Phys. Lett. A 382 (4) (2018) 147–152, doi:10.1016/j.physleta.2017.11.012. [26] R. Salakhutdinov, Learning deep generative models, Annu. Rev. Stat. Appl. 2 (2015) 361–385, doi:10.1146/annurev-statistics-010814-020120. [27] T. Salimans, D. Kingma, M. Welling, Markov chain monte carlo and variational inference: bridging the gap, in: International Conference on Machine Learning, 2015, pp. 1218–1226. [28] P. Sibi, S.A. Jones, P. Siddarth, Analysis of different activation functions using back propagation neural networks, J. Theor. Appl. Inf.Technol. 47 (3) (2013) 1264–1268, doi:10.4236/wsn.2015.75005. [29] K. Sohn, H. Lee, X. Yan, Learning structured output representation using deep conditional generative models, in: Advances in Neural Information Processing Systems, 2015, pp. 3483–3491. [30] I. Ullah, A. Petrosino, About pyramid structure in convolutional neural networks, in: 2016 International Joint Conference on Neural Networks (IJCNN), IEEE, 2016, pp. 1318–1324. [31] M.J. Wainwright, M.I. Jordan, et al., Graphical models, exponential families, and variational inference, Found. Trends® Mach.Learn. 1 (1–2) (2008) 1–305, doi:10.1561/220 0 0 0 0 0 01. [32] Y. Watanabe, Geometry on the parameter space of the belief propagation algorithm on bayesian networks, Phys. Lett. A 350 (1–2) (2006) 81–86, doi:10. 1016/j.physleta.2005.10.012. [33] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, arXiv:1708.07747v1(2017). [34] R. Zhang, P. Isola, A.A. Efros, Colorful image colorization, in: European Conference on Computer Vision, Springer, 2016, pp. 649–666. [35] R. Zhang, P. Isola, A.A. Efros, Split-brain autoencoders: unsupervised learning by cross-channel prediction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1058–1067. [36] J. Zhu, J. Chen, W. Hu, B. Zhang, Big learning with bayesian methods, Natl. Sci. Rev. 4 (4) (2017) 627–651, doi:10.1093/nsr/nwx044.