Adaptive parameter estimation of GMM and its application in clustering

Adaptive parameter estimation of GMM and its application in clustering

Future Generation Computer Systems 106 (2020) 250–259 Contents lists available at ScienceDirect Future Generation Computer Systems journal homepage:...

2MB Sizes 0 Downloads 62 Views

Future Generation Computer Systems 106 (2020) 250–259

Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Adaptive parameter estimation of GMM and its application in clustering ∗



Linchang Zhao a , , Zhaowei Shang a , , Jin Tan a , Xiaoliu Luo a , Taiping Zhang a , Yu Wei b , Yuan Yan Tang c a

College of Computer Science, Chongqing University, China Qiannan Normal College For Nationalities, Douyun, China c Faculty of Science and Technology, University of Macau, China b

article

info

Article history: Received 24 August 2019 Received in revised form 17 December 2019 Accepted 6 January 2020 Available online 13 January 2020 Keywords: Machine learning Parameter estimation Data clustering Variational bayesian expected maximum

a b s t r a c t Parameter estimation of Gaussian mixture model (GMM) has begun to gain attention in the field of science and engineering, and it has gradually become one of the most popular research topics. In particular, the rapid development of theoretical progress on globally optimal convergence promotes the widespread application of Gaussian mixtures in data clustering. So this paper introduces a novel parameter estimation algorithm called TDAVBEM, which combines the Tsallis entropy and a deterministic annealing (DA) algorithm on the basis of the variational bayesian expected maximum (VBEM) to simultaneously implement the parameter estimation and select the optimal components of GMM. We experimentally certified the effectiveness and robustness of our proposed algorithm passes through comparing it with several parameter evaluation methods and its application in data clustering. © 2020 Published by Elsevier B.V.

1. Introduction With the rapid development of globally optimal convergence theory and the widespread use of Gaussian mixture model (GMM) in practice, the parameter estimation of GMM has received the attention of many researchers and become one of the most active research fields [1,2]. Besides, the combination of parameter optimization algorithm with the mixed models will develop a new modeling option for scientific research and practical engineering [3]. Parameters estimation of GMM has been also extensively researched in deep learning, and this GMM can be viewed as a linear combination of n single Gaussian distributions [4]. In recent years, many scientists’ interests in this field have been created along with GMM has been widely applied in data mining [5] and machine learning [6]. In the 1970s, Dempster et al. [7] primordially proposed the expected maximum (EM) algorithm to perform a maximum likelihood estimation from incomplete data, and this algorithm has been successfully applied to the parameters estimation of GMM by following development [8]. However, the EM algorithm is sometimes easy to converge a local optimal solution rather than the globally optimal solution [9,10]. To make up for the shortcoming, Ueda et al. [11] proposed a deterministic annealing-based EM ∗ Corresponding author. E-mail addresses: [email protected] (L. Zhao), [email protected] (Z. Shang). https://doi.org/10.1016/j.future.2020.01.012 0167-739X/© 2020 Published by Elsevier B.V.

algorithm for maximum likelihood estimation problems to overcome a local maxima problem associated with the conventional EM algorithm; in their approach, the EM algorithm process is reexpressed as the problem of minimizing the thermodynamic free energy by using a statistical mechanics analogy. Vlassis et al. [12] introduced a greedy algorithm-based recently theoretical results on incremental mixture density estimation for learning the GMM which tries to avoid getting trapped in one of the many local maxima of the likelihood function. Pernkopf et al. [13] proposed a genetic algorithm-based EM approach to select the number of components of the GMM, and they thought that the populationbased stochastic search of the genetic algorithms explores the search space more thoroughly than EM algorithm. Although these algorithms can have a better globally optimal solution, their performance still has a great room for improvement. Especially how to make the algorithm becomes less sensitive to its initialization and enables escaping from a locally optimal solution to the globally optimal solution [14,15]. In recent years, the variational bayesian expectation maximization (VBEM) is introduced by Nikolaos et al. [16] to address the number of components and parameters estimation of GMM, which combines the advantages of maximum likelihood and bayesian methodology into a unified approach. It could transform a reasoning problem into the approximate optimization problem by the variational bayesian learning [17]. In bayesian learning, the distributions of parameters are modeled by hyperparameters [17]. In the case of GMM, the distributions of

L. Zhao, Z. Shang, J. Tan et al. / Future Generation Computer Systems 106 (2020) 250–259

parameters are viewed as Gaussian for the mean, Wishart for the covariance, and Dirichlet for the mixing probability. The learning task includes estimating the hyperparameters and characterizing these distributions [17]. Furthermore, various variations of VBEM have been widely used in science and engineering. For instances, Abrol et al. [18] introduced deterministic annealing (DA) algorithm for stochastic variational inference (DASVI) to overcome the variational inference methods suffer from the local optima. The DA algorithm is an ancient practice in metallurgy with thousands of years of history. When a metal was slowly cooled down, it was found to be more stable than when it was cooled abruptly. Miyahara et al. [19] introduced a temperature parameter that deterministically deforms the objective to get the optimal solutions over the course of the optimization. Ren et al. [20] proposed a robust hybrid algorithm that combines simulated annealing (SA) algorithm with the Levenberg Marquardt technique. They believed that their method has the ability to avoid traps in local minima since SA is used to get the zone nearby the globally optimal point in the initial search period. The objective function of VBEM can be reformulated as the free energy form and a temperature parameter can be naturally incorporated into the target equation by analogy with the variational free energy [20,21]. Based on the VBEM algorithm, our paper combines the advantages of DA algorithm and Tsallis entropy to propose a novel algorithm called TDAVBEM. The Tsallis entropy is a generalization of Shannon entropy in traditional extensive statistical mechanics that has become a new information measure in nonextensive statistical mechanics [22], and its information is introduced detailedly in Ref. [23]. This paper mainly discusses the hyperparameter initialization procedure for the GMM and its parameters estimation operation. Moreover, extensive experimental results show that the proposed algorithm provides faster convergence, more accurate hyperparameter estimates, and better generalization. Summarize the work of this paper, the main contributions include: (1) A novel algorithm called TDAVBEM is proposed to address the number of components and parameters estimation of GMM and the convergence of TDAVBEM is analyzed by integrating the Tsallis entropy and DA algorithm into the variational bayesian learning. (2) We introduce a temperature parameter that deterministically deforms the objective function so as to make our algorithm becomes less sensitive to its initialization and enables escaping from a locally optimal solutions to the globally optimal solution. (3) We analyze the initialization procedure of TDAVBEM and evaluate it with benchmarked algorithms on both synthetic and real-world datasets, and experiment results indicate that our TDAVBEM is significant. This paper is structured as follows: the TDAVBEM algorithm is analyzed in Section 2. Section 3 expresses the application of TDAVBEM in data clustering. Section 4 shows the experiment and result. Section 5 concludes this paper. 2. The TDAVBEM algorithm 2.1. Elaborating of VBEM The VBEM algorithm is suggested on the basis of the variational approximation theory and expected maximum algorithm [16,17]. This algorithm avoids the multi dimensional complex integral operations of likelihood function in high-dimensional data and uses the prior information of variational bayesian distribution to approximate the parameter estimation of variational posterior distribution. The parameters of GMM are calculated by maximizing the lower-bound value (LBV) of logarithmic likelihood

251

function of VBEM [24], and the LBV can be regarded as one of many criteria for the ground truth clustering [25]. To be more specific, the VBEM method can transform the multivariate joint probability distribution into the product of marginal probability distributions of each variable, and then solves the parameters of mixed model by cyclic iteration [26,27]. In Section 3.2, we will discuss in detail the definition of LBV. Let X = {x1 , . . . , xN } is the observational dataset; Z = {z1 , . . . , zN } expresses the latent parameter variables; θ indicates the parameters of Gaussian mixtures; and P(θ, Z ) represents the variational bayesian distribution. Based on the Jensen’s inequality, the likelihood function can be defined as Eq. (1).

∫∫ lnH(X ) = ln

H(θ, Z , X ) dθ dZ

H(θ, Z , X ) dθ dZ = ln P(θ, Z ) P(θ, Z ) ( ) ∫∫ H(θ, Z , X ) ≥ P(θ, Z )ln dθ dZ P(θ, Z ) = l(P)

∫∫

(1)

where H(x) is the likelihood function; H(θ, Z |X ) indicates the variational posterior distribution; l(P) signs the LBV of marginal likelihood function lnH(X ). The equal sign is established if and only if P(θ, Z ) = H(θ, Z |X ), in this case l(P) gets the maximum value. If the parameters θ and Z are independent of each other, that is to say, the P(θ, Z ) can be expressed as P(θ, Z ) = Pθ (θ )Pz (Z ) and the Eq. (1) can be re-expressed as l(Pθ , Pz ) =

∫∫

Pθ (θ )Pz (Z )ln

(

H(θ, Z , X ) P(θ, Z )

)

dθ dZ

therefore, the difference between actual distribution and approximate distribution is shown as Eq. (2). lnH(X ) − l(Pθ , Pz )

∫∫ =−

Pθ (θ )Pz (Z )ln

(

H(θ, Z |X )

)

Pθ (θ )Pz (Z )

dθ dZ

(2)

= KL[Pθ (θ )Pz (Z ) ∥ H(θ, Z |X )] ≥ 0 The maximization of l(P) is equal to the minimization of relative entropy between Pθ (θ )Pz (Z ) and H(θ, Z |X ). The LBV of l(P) can be achieved by optimizing iteration of Pθ (θ )Pz (Z ) and H(θ, Z |X ), and the formula can be represented as l(Pθ , Pz ) =

∫∫

Pθ (θ )Pz (Z ){lnH(θ, Z , X )

− lnPθ (θ ) − lnPz (Z )}dθ dZ According to the variational bayesian learning principle, we can derive the differentiation of each variable l(Pθ , Pz ) as follow:

∫ ∂ l(θ, Z ) = Pθ (θ )lnH(θ, Z , X )dθ − lnPz (Z ) + const ∂ Pz (Z ) ∫ ∂ l(θ, Z ) = Pz (Z )lnH(θ, Z , X )dZ − lnPθ (θ ) + const ∂ P θ (θ ) where const is a constant independent of differentiation. If the differential result to 0, the differentiation of l(Pθ , Pz ) can be rewritten as lnPz∗ (Z ) =

lnPθ∗ (θ )



Pθ (θ )lnH(θ, Z , X ) dθ + const

= E [lnH(θ, Z , X )] + const ∫ Pθ = Pz (Z )lnH(θ, Z , X ) dZ + const = EPz [lnH(θ, Z , X )] + const

252

L. Zhao, Z. Shang, J. Tan et al. / Future Generation Computer Systems 106 (2020) 250–259

∫∫

f (θ, Z )dθ dZ = 1, the functional of L[f ] can be expressed

Thus the VBEM algorithm can get an optimal solution by optimizing iteratively the factors Pθ (θ ) and Pz (Z ). In other words, the optimal value of l(Pθ , Pz ) is solved alternately by two iterations E-step and M-step. The E-step: Let fix Pθ (θ ) and update Pz∗ (Z ), and the computation is represented as

when as

lnPz∗ (Z ) = EPθ [lnH(θ , Zn , Xn )] + const

where λ is the Lagrange multiplier; β indicates the temperature parameter. The differential of L[f ] can be derived as Eq. (4).

lnPz∗ (Z ) =

N ∏

Pzn (Zn )

n=1

The M-step: Let fix Pz (Z ) and update Pθ∗ (θ ), and the calculation is expressed as lnPθ∗ (θ ) = EPz [lnH(θ , Z , X )] + const In process of maximizing l(Pθ , Pz ), the optimal value of Pz (Z ) is approximated as a posterior distribution of the latent variable Z , and the optimal value of Pθ (θ ) is approximated as a posterior distribution of parameter θ . The Pz (Z ) and Pθ (θ ) are evaluated by the optimal LBV of l(Pθ , Pz ).

f (θ , Z )q lnf (θ, Z )dθ dZ − 1 q−1

∫∫ =−

(3)

f (θ , Z ) lnf (θ, Z )dθ dZ q

where Sq [f ] represents the entropy of Tsallis; q can be viewed as a measure of the non-extensiveness; and the q-logarithm can be defined as lnq(X ) = (xq−1 − 1)/(1 − q). When q → 1, the entropy of Tsallis is transformed into a Shannon entropy. So Eq. (3) can be re-represented as

∫∫ Sq=1 [f ] = −

f (θ , Z )lnf (θ, Z )dθ dZ

If the random variables X and Y are independent of each other, Sq [f ] has the following property as Sq [X × Y ] = Sq [X ] + Sq [Y ] + (1 − q)Sq [X ] × Sq [Y ] Although the f (θ , Z ) of the micro information is unknown, the expectation of the logarithmic likelihood function as its corresponding macro information satisfies the following conditions: Ef [lnH(θ , Z , X )] =

∫∫

f (θ, Z )lnH(θ, Z , X )dθ dZ

f (θ, Z )dθ dZ − 1)

∂ L(f ) q =−1− lnfz (Z )q−1 ∂ fz q−1 + β Efθ [lnH(θ, Z , X )] + λ q ∂ L(f ) =−1− lnfθ (Z )q−1 ∂ fθ q−1 + β Efz [lnH(θ, Z , X )] + λ

(4)

Let the differential results of Eq. (4) be 0, and the solutions are given as follows: lnfz (Z ) ={1 + (q − 1)β Efθ [lnH(θ, Z , X )]}1/(q−1)

+ const + const

However, the VBEM algorithm may get trapped in one of the many local maxima of the likelihood function and its calculation is a little sensitive to the initial values. To overcome these limitations, our study proposes a novel algorithm called TDAVBEM that combines the advantages of Tsallis entropy and DA algorithm into a unified approach based on VBEM. The maximum likelihood objective function with Tsallis entropy penalty is built to simultaneously implement the number of components and parameter estimation of GMM [18,28]. The temperature parameter of DA algorithm is used to make the proposed algorithm becomes less sensitive to its initialization and find the globally optimal solution with far better convergence. Let f (θ , Z ) is a novel variational distribution, and it is used to estimate the posterior distribution of maximum likelihood objective function based on Tsallis entropy penalty. Furthermore, the best value of f (θ , Z ) can be acquired by maximizing the entropy of Tsallis, and the expression can be shown as Eq. (3).

Sq [f ] = −

+ λ(

∫∫

lnfθ (θ ) ={1 + (q − 1)β Efz [lnH(θ, Z , X )]}1/(q−1)

2.2. Analyzing of TDAVBEM

∫∫

L[f ] = Sq [f (θ, Z )] + β (Ef [lnH(θ, Z , X )] − const)

After adding the q and β coefficients to the variational bayesian learning algorithm, combining with the VBEM method, and the TDAVBEM algorithm is depicted in Algorithm 1. Algorithm 1 :The TDAVBEM 1: 2: 3: 4: 5: 6: 7:

Initial fθ (θ ) of prior knowledge, δ = 10−10 , t = 0, q ← q0 (0 < q < 1), β ← β0 (0 < β < 1); repeat Adjustment coefficient q, β , performing the E-step and M-step: E: lnfz (Z ) = {1 + (q − 1)β Efθ [lnH(θ, Z , X )]}1/(q−1) ; M: lnfθ (θ ) = {1 + (q − 1)β Efz [lnH(θ, Z , X )]}1/(q−1) ; t ← t + 1, q ← 1.005 × q0 , β ← 1.005 × β0 ; until δ ≥ ∆lβ (f ).

The TDAVBEM is executed by the inner layer of q under temperature parameter β , and it constantly simulates the annealing process until the maximization of lβ (fθ , fz ) is achieved. 2.3. Convergence of TDAVBEM In Section 2.2, we analyze the flow of TDAVBEM algorithm and its related formula derivation. According to the Eq. (4) for the differential of L[f ], we obtain the new variational distribution f (θ, Z ) is expressed as 1 f (θ, Z ) = exp{1 + (q − 1)β lnH(θ, Z , X )}1/(q−1) C ∫∫ C =

exp{1 + (q − 1)β lnH(θ, Z , X )}1/(q−1) dθ dZ

(5)

When q → 1 and θ → 1, the Eq. (5) is simplified as follows: lim f (θ, Z ) = lim

q→1

lim

q→1,θ →1

q→1

1 C

f (θ, Z ) = ∫∫

exp{1 + (q − 1)β lnH(θ, Z , X )}1/(q−1) H(θ, Z , X ) H(θ, Z , X )dθ dZ

This means that the variational posterior of TDAFBEM is equivalent to the posterior distribution of VBEM. That is to say, the TDAVBEM algorithm degenerates to the VBEM algorithm. Let take

L. Zhao, Z. Shang, J. Tan et al. / Future Generation Computer Systems 106 (2020) 250–259

253

the logarithm of both sides of the Eq. (5) and rearrange it to get as Eq. (6). 1

β

∫∫

1

H(θ , Z , X )β dθ dZ =lnH(θ, Z , X ) −

ln

β =Ef [lnH(θ, Z , X )]

lnf (θ, Z ) (6)

1

+ Sq [f (θ, Z )] β We can get the variational free energy controlled by temperature parameter according to the principle of non-extensive statistics and maximum entropy. When q → 1, the Eq. (6) can be re-expressed as 1

def

lβ (f ) =

β

∫∫

H(θ , Z , X )β dθ dZ

ln

= Ef [lnH(θ , Z , X )] + (t)

(t) fz

1

β

(7) Sq→1 [f ]

(t +1)

(t +1)

If fθ and are known, fθ and fz can be determined (t) (t) (t +1) by maximizing the LBV of lβ (fθ , fz ), and lβ (fθ , fz(t +1) ) ≥ (t) (t) lβ (fθ , fz ). The convergence of TDAVBEM is guaranteed, because the expectation from both sides equation of Eq. (7) can be calculated as lβ (f |f

(t)

) = Ef |f (t) [lnH(θ, Z , X )] +

∫∫ = −

1

β

Sq [f |f

(t)

=

]

f (t) (θ , Z )lnH(θ, Z , X )dθ dZ

∫∫

β

1

Fig. 1. The flowchart of TDAVBEM for GMM.

N K ∏ ∏

Znk lnρnk =

n=1 k=1

(8)

f (t) (θ, Z )lnH(θ, Z , X )dθ dZ

N K ∏ ∏ n=1 k=1

The variational distribution f (π, µ, Λ) of the M-step is shown as lnf ∗ (π, µ, Λ) = (q − 1)β{lnH(π ) +

+ Ez [lnH(Z |π )]

∆lβ (f ) = lβ (f (t +1) |f (t) ) − lβ (f (t) |f (t) ) = Ef(t +1) |f(t) [lnH(θ, Z , X )]

+

+

1

β

Sq [f (t +1) |f (t) ] − Sq [f (t) |f (t) ]

If Sq [f (t +1) ] − Sq [f (t) ] = KL[f (t +1) ∥ f (t) ] ≥ 0, it means that Ef [lnH(θ , Z , X )] is monotonically increasing during each iteration and lβ (f (t +1) ) ≥ lβ (f (t) ), this indicates that our TDAVBEM is convergent [29]. 3. Clustering with the TDAVBEM 3.1. Estimation parameter of GMM

K ∑

lnH(µk , Λk )

k=1

The solution of Eq. (8) is presented as follow:

− Ef(t) |f(t) [lnH(θ, Z , X )]

zn k γnk

N K ∑ ∑

1 1/(q−1) E [znk ]lnN(xn |µk , Λ− k )}

n=1 k=1

According to the solutions of E-step and M-step, the hyperparameters of GMM can be resolved as follow: f ∗ (π ) =

K ∏

πk (α0 +

k=1

N ∑

γnk − 1) = Dir(π|α );

n=1

f ∗ (µk |Λk ) =N(µk |mk , (γ0 Λk )−1 ); f ∗ (Λk ) = W (Λk |Wk , νk );

(9)

f ∗ (µk , Λk ) =N(µk |mk , (γk Λk )−1 )W (Λk |Wk , νk ) Based on the Eq. (9), we find out their expectations as follows: E [lnπk ] = Ψ (αk ) − Ψ (

K ∑

αk );

k=1

The GMM is a linear combination of K single Gaussian models, and its mathematical expression is as H(X |π , µ, Λ) =

K ∑

E [ln|Λk |] =

D ∑

Ψ(

νk + 1 − k

k=1 1 πk N(X |µk , Λ− k )

k=1

2

) + Dln2 + ln|Wk |;

Eµk ,Λk [(xn − µk )T Λi (xn − µk )]

= νk (xn − µk )T Wk (xn − µk ) + Dβk−1

where X indicates a set of D-dimensional observed variables; µk 1 represents the mean vectors; Λ− expresses the inverse covarik ance matrix and π shows the weight of Gaussian components. According to the joint probability distribution function, the GMM can be decomposed into as follow:

where Ψ (θ ) is the first derivative of gamma function; D expresses the dimension of the sample. According to the cyclic iterations of E-Step and M-Step [24,30,31], the calculation flowchart is represented in Fig. 1.

H(X , Z , π , µ, Λ) =H(X |Z , µ, Λ)H(Z |π )H(π )H(µ|Λ)H(Λ);

3.2. Clustering with LBV

f (Z , π , µ, Λ) =f (Z )f (π, µ, Λ) Using the TDAVBEM to solve the hyperparameter of GMM, the variational distribution f (Z ) of E-step is expressed as lnf ∗ (Z ) = {1 + (q − 1)β Eπ ,µ,Λ [lnH(X , Z , π, µ, Λ)]}1/(q−1)

The LBV can be taken as one of the many criteria for the ground truth clustering, and it is used as the criterion for the number of components of GMM in our study. The simpler the underlying GMM components, the better the ground truth clustering criteria will be defined. According to the criterion information of

254

L. Zhao, Z. Shang, J. Tan et al. / Future Generation Computer Systems 106 (2020) 250–259

Fig. 2. Convergences of DASVI and TDAVBEM on I-dataset.

the general density model [32], the LBV criterion for GMM can be expressed as lnLBVnk = (q − 1)β{E [lnπk ] + 1

1 2

E [ln|Λk |] −

D 2

4.1. Experiments on synthetic datasets

ln(2π )

1/(q−1)

Eµ ,Λ [(xn − µk ) Λk (xn − µk )]} 2 k k where E [lnπk ] is the expectation of mixing parameter πk ; E [ln|Λk |] is the expectation of the Wishart distribution; Eµk ,Λk is



T

the expectation of the Gaussian–Wishart distribution. If the value of LBV is larger, it means that the selected component of GMM is better. This also implies that the GMM can excavate the clustering structure of data and obtain the excellent clustering effects [33]. Note that the LBV also applies in particular to the locally optimal solutions, and not just the criterion of the globally optimal solution. In fact, in the process of finding the optimal value of LBV, the optimization iteration along a carefully picked direction so that the points in each cluster are not misclassified [34]. The smaller the number of misclassified points in per cluster, the better the value of LBV. However, the optimal value of LBV is often influenced through E [lnπk ], E [ln|Λk |] and Eµk ,Λk . In addition, two most frequently used indicators [35] are also taken advantage of validating the number of clusters, they are Calinski–Harabasz (CH) [36,37] and Davies–Bouldin (DB) [38,39]. The CH index also known as the variance ratio [40] criterion assesses cluster validity based on the between the mean variance inter-cluster and the intra-cluster, and it is expressed as Eq. (10). The larger its value, the better the data partitioning. CH =

trace(B) trace(W )



N −H

(10)

H −1

where B indicates the error sum of squares between different clusters (between-cluster) and W is the squared differences of all objects in a cluster from their respective cluster center (withincluster). N shows the number of elements in the dataset and H is the number of clusters. The DB index is based on the ratio between the distances intracluster and inter-cluster [41]. The smaller its value, the better the results, and it is calculated as DB =

H 1 ∑

H

i=1

maxj̸=i {

d¯i − d¯j di.j

4. Experiments and results

}

where H is the number of clusters; d¯i indicates the average distance between each point of the ith cluster and the centroid of the respective cluster (analogously for d¯j ) and di.j is the Euclidean distance between the centroids of clusters i and j.

Let us consider two synthetic datasets to commentate our motivation. These sample datum are subject to the Gaussian mixture distribution, and their dimensions, sample data-sizes and number of components can be set according to actual needs [42]. I-dataset is the one-dimensional data subject to the Gaussian mixture distribution, and it is made up of 500 data-points generated by five Gaussian components. II-dataset is the two-dimensional data subject to the Gaussian mixture distribution, and it is composed of 5000 data-points generated by four Gaussian components. The DASVI and TDAVBEM algorithms are used to calculate the parameters of GMM, and the convergence of two algorithms under the 500 data-points is compared by experiments on synthetic dataset. In the experiment, this study follows the recommendations in Ref. [18], and the initialization parameter settings of our algorithm. The iteration is stopped when the difference between two adjacent free energy changes is less than 10−10 . The convergent results of these two algorithms on the I-dataset are shown in Fig. 2. As can be seen from Fig. 2, the number of components of GMM is less than 5, and this number represents the ability of a corresponding algorithm to perform parameter estimation and select optimal components of GMM. If the number of components of GMM is smaller, the performance of corresponding algorithm is better. From the figure, the DASVI algorithm turns the original five components of GMM into four while our TDAVBEM makes the original five components of GMM into three, which indicates that the performance of our TDAVBEM is better than the DASVI algorithm. Besides, in the boundary of each Gaussian component is corresponding to the elliptical contour, the elliptic profile of our TDAVBEM is better than that of DASVI. This also means that the convergence of TDAVBEM is much better than the convergence of DASVI. The LBV of TDAVBEM is also better than the LBV of DASVI and the results of comparison are expressed in Fig. 3, The abscissa represents the number of iterations and the ordinate represents the LBV value. From Fig. 3, the blue solid line describes the results with proposed TDAVBEM while the red solid line presents the results with the DASVI. Fig. 3 shows that the blue solid line is always higher than the red solid line, which indicates that our TDAVBEM will get the much better parameters and Gaussian components. In

L. Zhao, Z. Shang, J. Tan et al. / Future Generation Computer Systems 106 (2020) 250–259

255

Table 1 Results estimated by TDAVBEM and iPMCMC on II-dataset. Latent variable πi 1 2 3 4

Mean vectors µi

Covariance matrix Λi

TDAVBEM

iPMCMC

TDAVBEM

iPMCMC

TDAVBEM

0.253 0.248 0.247 0.251

0.251 0.248 0.246 0.250

[0.001 0.018 ] [0.002 0.003] [−1.092 −0.095] [0.678 0.069]

[0.003 0.021] [0.002 0.004] [−1.092 −0.095] [0.689 0.072]

[0.001 [0.006 [0.982 [0.987

0.019; 0.016; 0.015; 0.004;

iPMCMC 0.094 0.096 0.003 0.007

0.982] 0.984] 2.975] 2.996]

[0.002 [0.006 [0.983 [0.984

0.021; 0.017; 0.014; 0.003;

0.095 0.097 0.002 0.005

0.986] 0.983] 2.974] 2.992]

4.2. Experiments on real-world datasets

Fig. 3. Comparison of LBV on I-dataset. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

other words, the globally optimal solution obtained by TDAVBEM is better than the solution of DASVI. Another comparison experiment between TDAVBEM and Particle Markov Chain Monte Carlo [43] (iPMCMC) is conducted on II-dataset. The iPMCMC combines the advantages of interacting pool of standard and conditional sequential Monte Carlo samplers into a unified method, and it is befitting for distributed and multi-core architectures. More detailed introduction is shown in Ref. [43]. In our experiment, 5000 data-points are from the IIdataset as the benchmarked dataset. The TDAVBEM and iPMCMC methods are used to evaluate the latent variable πk , mean vectors µk , and covariance matrix Λk from the four Gaussian components of II-dataset. Experimental results acquired by TDAVBEM and iPMCMC are presented in Table 1. From Table 1, we find that the πk , µk and Λk gained with TDAVBEM are better than those acquired by iPMCMC. For πk , three latent variables obtained by TDAVBEM are superior to those obtained by iPMCMC, and only one variable has the same value as iPMCMC; with respect to µk , two mean vectors obtained by TDAVBEM are better than that obtained by iPMCMC, and two vectors that are the same as iPMCMC; In respect to Λk , one covariance matrix gained by TDAVBEM has an advantage over the matrix gained by iPMCMC, two matrices that are the same as iPMCMC, and one matrix is lower than that obtained by iPMCMC.

In this section, the performance of proposed TDAVBEM is appraised on two widely used machine learning datasets: Iris and Glass. The Iris dataset comprises of three categories, and each category composes of 50 entities with four attributes. Each entity has four physical features that are length and width of sepals and petals. The Glass dataset includes 214 instances that are distributed over six clusters, and each instance has nine eigenvectors. In the experiments, we compare the results of TDAVBEM with two clustering algorithms, they are natural semi-random algorithm for k-means clustering (NSRMC) [32] and fuzzy clustering algorithm for neighborhood constrained GMM (FNCGMM) [44]. In Iris dataset, the proposed TDAVBEM is compared with NSRMC method, the Iris data are clustered into three classes based on the specie attributes, namely Setosa (Seto), Versicolor (Vers) and Virginica (Virg). The detailed clustering results are shown in Fig. 4. The abscissa represents the Petal Length, and the ordinate represents the Petal Width. Fig. 4(1) represents the performance of our TDAVBEM and Fig. 4(2) expresses the performance of NSRMC. From the figure, the proposed TDAVBEM gets a better clustering effect compared with the NSRMC algorithm, that is to say, the TDAVBEM can very clearly cluster the categories in Iris dataset and there is no confusion among the individual clustering results. While a small number of data points are confused between Vers and Virg in the clustering results of NSRMC method. In Glass dataset, the proposed TDAVBEM is compared with FNCGMM approach, the Glass data are clustered into six classes based on the type attributes, they are BuildWindowsFloat (Bwfl) with 70 instances, BuildWindowsNonFloat (Bwnf) with 76 instances, VehicleWindowsFloat (Vehi) with 17 entities, Containers (Cont) with 13 entities, Tableware (Tabl) with 9 entities, and Headlamps (Head) with 29 instances. The detailed clustering results are shown in Fig. 5. The abscissa shows the Aluminum, and the ordinate shows the Calcium. Fig. 5(1) expresses the performance of our TDAVBEM and Fig. 5(2) is the performance of FNCGMM. As can be seen Fig. 5, the TDAVBEM acquires a better clustering effect compared with the FNCGMM approach. More specifically, the TDAVBEM can accurately cluster three classes based on the categories of Glass dataset and the clustering results on the remaining data are also relatively satisfactory. However, FNCGMM can only accurately cluster two classes according to the categories of Glass dataset and the clustering results for the remaining type attributes are very frustrating. In order to further evaluate the clustering effect of the above methods on benchmarked datasets, Minkowski score (Ms) [45] is used to analyze the amount of misclassification in the clustering process and its formula is as Ms(ES , RS) =



(n01 + n10 )/(n11 + n01 )

where ES represents the data-clustering essential solution set; RS represents the synthetic solution set. n01 indicates the number of object pairs that exist only in RS; n10 indicates the number of object pairs that discover only in ES; n11 indicates the number of

256

L. Zhao, Z. Shang, J. Tan et al. / Future Generation Computer Systems 106 (2020) 250–259

Fig. 4. Clustered effects of on Iris dataset.

Fig. 5. Clustered effects on Glass dataset.

object pairs which find in both ES and RS. The smaller the value of Ms is, the better the true partitioning result will be. Therefore, a minimum Ms value is considered to be the optimal solution among all the non-dominated solutions. And F-measure (F1) [46] is also used to assess the cluster results, it is a trade-off measure of balancing the performances of recall (RE) and precision (PR). Where RE is the percentage of objects accurately classified in the specific class that is a metric of integrity and PR represents the proportion of clusters which comprise the objects of a specific class. The higher value of F1 is, the better the clustering quality will be. The RE and PR of a cluster i with respect to class j can be defined as RE(i, j) =

CNij CNi

; PR(i, j) =

Method

Iris Ms

F1

Si

No.

Ti

TDAVBEM NSRMC

0.318 0.379

0.957 0.906

0.892 0.866

65 48

43 31

Method

Glass Ms

F1

Si

No.

Ti

TDAVBEM FNCGMM

0.426 0.473

0.885 0.826

0.871 0.824

78 61

62 53

CNij

defined as

CNj

Si =

where CNij signifies the total number of class i objects in cluster j; CNi shows the total number of objects of class i. CNj represents the total number of objects of class j. Then, F1 can be calculated as F 1(i, j) =

Table 2 The results.

2 × RE(i, j) × PR(i, j) RE(i, j) × PR(i, j)

Furthermore, the Silhouette index (Si) [47,48] is a widely used coefficient to monitor the performance of the clustering. It is a normalized summation-type index that the number of clusters that best separate the dataset and the optimal cluster number is determined by maximizing the value of this index. Si can be

b i − ai max(ai , bi )

where ai is the average dissimilarity (how far away two elements are from each other) of i with all other data within the same cluster and bi is the minimum average dissimilarity of i to any other cluster of which i is not a member. Table 2 shows the analytical results of Ms, F1 and Si for each algorithm on the Iris and Glass datasets. The No. signifies the number of iterations of each algorithm, and Time (Ti, in seconds) is the consumed time of each method. The boldface indicates the best performances. From Table 2, our TDAVBEM obtains the lower Ms, the higher F1 and Si on both Iris and Glass datasets, which indicates the

L. Zhao, Z. Shang, J. Tan et al. / Future Generation Computer Systems 106 (2020) 250–259

Fig. 6. Validating index on Iris dataset.

performance of TDAVBEM is superior to both baseline methods. However, our TDAVBEM method needs more iterations and running time than comparative methods when the same convergence condition is satisfied. Fortunately, the differences in iteration and runtime between the those approaches are acceptable only from the comparison of results in Table 2. As shown in Fig. 6, comparing clustering results using CH and DB indicators on Iris dataset, the red dotted line presents the results with the NSRMC while the blue solid line describes the results with proposed TDAVBEM. Fig. 6(1) shows that the red dotted line is much lower than the blue solid line. On the contrary Fig. 6(2) shows the red dotted line is much higher than the blue solid line. Hence, it is obvious that the clustering results with proposed TDAVBEM have lower DB indices and higher CH indices. Consequently, the proposed TDAVBEM can obtain better intra-compact and inter-separated clusters. Similar results can also be obtained from Fig. 7. In Fig. 7, our TDAVBEM is compared with FNCGMM method in terms of CH and DB indicators on Glass dataset, the red dotted line expresses the results with the FNCGMM while the blue solid line indicates the results with our TDAVBEM. Similarly, Fig. 7(1) indicates that the dotted line is much lower than the solid line while Fig. 7(2) shows the dotted line is higher than the solid line. Therefore, it is verified again that our TDAVBEM can gain better intra-compact and inter-separated clusters. 5. Conclusions This paper proposes a novel parameter estimation algorithm called TDAVBEM, which combines the Tsallis entropy and a deterministic annealing (DA) algorithm on the basis of the variational bayesian expected maximum (VBEM) to simultaneously implement the parameter estimation and select the optimal component of GMM. This method can obtain a far better LBV and the globally optimal solution by maximizing the Tsallis entropy and simulating the DA algorithm. Moreover, we perform the extensive experiments on benchmarked datasets to evaluate the performance of our TDAVBEM, and the results indicate that considering the annealing algorithm and Tsallis entropy in the process of variational bayesian learning can significantly boost parameter estimation of GMM.

257

Fig. 7. Validating index on Glass dataset.

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgments The authors would like to thank the anonymous reviewers for their comments and constructive suggestions that helped improve this paper. This work is partly supported by the Project Supported by Graduate Research and Innovation Foundation of Chongqing, China (NO. CYS18006), Macao Special Project of State Ministry of Science and Technology (NO. 2015DFM10020). References [1] H. Kang, S. Choi, Probabilistic models for common spatial patterns: parameter-expanded em and variational bayes, in: Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012, pp. 970–976. [2] K.B. Petersen, O. Winther, L.K. Hansen, On the slow convergence of em and vbem in low-noise linear models, Neural Comput. 17 (9) (2005) 1921–1926. [3] D. Cournapeau, S. Watanabe, A. Nakamura, T. Kawahara, Using online model comparison in the variational bayes framework : an application to voice activity detection, IPSJ Sig Notes 20 (2009) 13–18. [4] C.M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag New York, Inc, 2006, p. 049901. [5] J. Li, L. Zheng, A. Uchiyama, L. Bin, T.M. Mauro, P.M. Elias, T. Pawelczyk, M. Sakowicz-Burkiewicz, M. Trzeciak, L. Dym, A data mining paradigm for identifying key factors in biological processes using gene expression data, Sci. Rep. 8 (05) (2018) 1–30. [6] A. Qin, Z. Shang, J. Tian, A. Li, Y. Wang, Y.Y. Tang, Maximum correntropy criterion for convex anc semi-nonnegative matrix factorization, in: IEEE International Conference on Systems, Man and Cybernetics, 2017, pp. 1856–1861. [7] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood estimation from incomplete data via the em algorithm, J. R. Stat. Soc. 39 (1) (1977) 1–38. [8] R.D. Bock, M. Aitkin, Marginal maximum likelihood estimation of item parameters: Application of an em algorithm., Psychometrika 46 (4) (1981) 443–459. [9] D.J. Warren, J.B. Thomas, Maximum likelihood estimation of mixture constants and locally optimal detection of contaminants, J. Franklin Inst. B 326 (4) (1989) 535–545. [10] E. Alpaydin, Soft vector quantization and the em algorithm, Neural Networks the Official Journal of the International Neural Network Society 11 (3) (1998) 467–477. [11] N. Ueda, R. Nakano, Deterministic annealing em algorithm, J. Soc. Instrum. Control Eng. 38 (2) (1998) 271–282.

258

L. Zhao, Z. Shang, J. Tan et al. / Future Generation Computer Systems 106 (2020) 250–259

[12] N. Vlassis, A. Likas, A greedy em algorithm for gaussian mixture learning, Neural Process. Lett. 15 (1) (2002) 77–87. [13] F. Pernkopf, D. Bouchaffra, Genetic-based em algorithm for learning gaussian mixture models., IEEE Trans. Pattern Anal. Mach. Intell. 27 (8) (2005) 1344–1348. [14] A.P. Benavent, F.E. Ruiz, J.M.S. Martinez, Ebem: An entropy-based em algorithm for gaussian mixture models., in: International Conference on Pattern Recognition, 2006, pp. 1–5. [15] M.S. Yang, C.Y. Lai, C.Y. Lin, A robust em clustering algorithm for gaussian mixture models, Pattern Recognit. 45 (11) (2012) 3950–3961. [16] N. Nikolaos, A.G. Bors, Variational learning for gaussian mixture models, IEEE Trans. Syst. Man Cybern. B 36 (4) (2006) 849–862. [17] A. Babajani-Feremi, K. Jafari-Khouzani, H. Soltanian-Zadeh, Development of a variational bayesian expectation maximization (vbem) method for model inversion of multi-area e/meg model, in: IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2011, pp. 964–968. [18] F. Abrol, S. Mandt, R. Ranganath, D. Blei, Deterministic annealing for stochastic variational inference, Eprint Arxiv 12 (90) (2014) 1–9. [19] H. Miyahara, K. Tsumura, Y. Sughiyama, Deterministic quantum annealing expectation-maximization algorithm, J. Stat. Mech. Theory Exp. 11 (04) (2017) 1–16. [20] L. Ren, L. Lin, Simulated annealing algorithm coupled with a deterministic method for parameter extraction of energetic hysteresis model, IEEE Trans. Magn. PP (99) (2018) 1–5. [21] R. Raveendran, B. Huang, Variational bayesian approach for causality and contemporaneous correlation features inference in industrial process data, IEEE Trans. Cybern. 49 (09) (2018) 2580–2590. [22] H. Bagherlou, A. Ghaffari, A routing protocol for vehicular ad hoc networks using simulated annealing algorithm and neural networks, J. Supercomput. (2018) 1–25. [23] Y. Liu, W. Dong, M.C. Zhou, Frame-based variational bayesian learning for independent or dependent source separation, IEEE Trans. Neural Netw. Learn. Syst. PP (99) (2018) 1–14. [24] M. Lundgren, L. Svensson, L. Hammarstrand, Variational bayesian expectation maximization for radar map estimation, IEEE Trans. Signal Process. 64 (6) (2016) 1391–1404. [25] M. Yasuda, Deterministic annealing approach to fuzzy c-means clustering based on entropy maximization, Adv. Fuzzy Syst. 2011 (9) (2011) 1–9. [26] M. Fatemi, L. Svensson, L. Hammarstrand, M. Lundgren, Variational bayesian em for slam, in: IEEE International Workshop on Computational Advances in Multi-Sensor Adaptive Processing, 2015, pp. 501–504. [27] F. Valente, Variational bayesian gmm for speech recognition, in: European Conference on Speech Communication and Technology, Eurospeech 2003 - INTERSPEECH 2003, Geneva, Switzerland, September, 2016, pp. 47–59. [28] V. Cottet, P. Alquier, 1-bit matrix completion: Pac-bayesian analysis of a variational approximation, Mach. Learn. 107 (3) (2018) 579–603. [29] P. Bruneau, M. Gelgon, F. Picarougne, Parsimonious variational-bayes mixture aggregation with a poisson prior, in: Signal Processing Conference, 2009 European, 2015, pp. 1499–1503. [30] T. Kimura, T. Tokuda, Y. Nakada, T. Nokajima, T. Matsumoto, A. Doucet, Expectation-maximization algorithms for inference in dirichlet processes mixture, Pattern Anal. Appl. 16 (1) (2013) 55–67. [31] V. Ro¨cková, E.I. George, Emvs: The em approach to bayesian variable selection, J. Amer. Statist. Assoc. 109 (506) (2014) 828–846. [32] P. Awasthi, A. Vijayaraghavan, Clustering semi-random mixtures of gaussians, arxiv.org/abs/1711.08841, 23 (11) (2017) 1–25. [33] I. Naim, D. Gildea, Convergence of the em algorithm for gaussian mixtures with unbalanced mixing coefficients, in: International Coference on International Conference on Machine Learning, 2012, pp. 1–8. [34] D. Wraith, F. Forbes, Location and scale mixtures of gaussians with flexible tail behaviour: properties, inference and application to multivariate clustering, Comput. Statist. Data Anal. 90 (2015) 61–73. [35] O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J.M. Pérez, I. Perona, An extensive comparative study of cluster validity indices, Pattern Recognit. 46 (1) (2013) 243–256. [36] T. CaliKński, J. Harabasz, A dendrite method for cluster analysis, Comm. Statist. Theory Methods 3 (1) (1974) 1–27. [37] L. Silva, Thermal signal analysis for breast Cancer risk verification, in: MedInfo 2015 - 15th World Congress on Health and Biomedical Informatics, 2015, pp. 746–750. [38] D.L. Davies, D.W. Bouldin, A cluster separation measure, IEEE Trans. Pattern Anal. Mach. Intell. 1 (2) (1979) 224–227. [39] R. Xu, J. Xu, D. Wunsch, A comparison study of validity indices on swarmintelligence-based clustering, IEEE Trans. Syst. Man Cybern. B 42 (4) (2012) 1243–1256. [40] D.G.D.B. Franco, M.T.A. Steiner, Clustering of solar energy facilities using a hybrid fuzzy c-means algorithm initialized by metaheuristics, J. Cleaner Prod. 191 (4) (2018) 445–457. [41] X. Li, K.C. Wong, Evolutionary multiobjective clustering and its applications to patient stratification, IEEE Trans. Cybern. 49 (05) (2018) 1680–1693. [42] C. Peng, N. Zabaras, I. Bilionis, Uncertainty propagation using infinite mixture of gaussian processes and variational bayesian inference, J. Comput. Phys. 284 (1) (2015) 291–333.

[43] T. Rainforth, C.A. Naesseth, F. Lindsten, B. Paige, Jan-Willem, Interacting particle markov chain monte carlo, J. R. Stat. Soc. 72 (03) (2016) 269–342. [44] Q. Zhao, H. Zhang, X. Zhao, L.I. Yu, Fuzzy clustering image segmentation based on neighborhood constrained gaussian mixture model, Pattern Recognit. Artif. Intell. 30 (3) (2017) 214–223. [45] S. Saha, S. Bandyopadhyay, A new symmetry based genetic clustering technique for automatic evolution of clusters, IEEE Trans. Knowl. Data Eng. 43 (3) (2010) 738–751. [46] A. Kumarjain, S. Maheshwari, Phrase based clustering scheme of suffix tree document clustering model, Int. J. Comput. Appl. 63 (10) (2013) 30–37. [47] L. Chen, W. Du, Y. Wang, H. Sun, Y. Sun, Y. Liang, Silhouette index supervised affinity propagation clustering, J. Bionanosci. 7 (4) (2013) 447–452. [48] I.A. Pagnuco, J.I. Pastore, G. Abras, M. Brun, V.L. Ballarin, Analysis of genetic association in listeria and diabetes using hierarchical clustering and silhouette index, J. Phys. Conf. Ser. 705 (1) (2016) 1–9.

Linchang Zhao received the B.S. degree in the College of Computer Science, Northeast Petroleum University, Daqing, China, in 2013. In 2017, he got the master’s degree in the School of mathematics and statistics, Qiannan Normal College for Nationalities, Duyun, China. He is currently studying toward the Ph.D. degree with the College of Computer Science, Chongqing University. His research interests include pattern recognition, computer vision, machine learning, and deep learning.

Zhaowei Shang received the Ph.D. degree in Electronics information from Xi’an Jiao University, and the postdoctoral station in computer science from Chongqing University, China, in 2005, and 2010, respectively. Currently, he is working as a professor in the Department of Computer Science at Chongqing University and a visiting research fellow at the Faculty of Science and Technology, University of Macau. His research interests include pattern recognition, image processing, and machine learning. He has published extensively in the IEEE Transactions on Image Processing, Pattern Recognition, Neurocomputing, etc. He is a member of the IEEE.

Jin Tan received the B.S., M.S. degrees in Telecommunication Engineering, from University of Electronic Science and Technology of China in 2005 and 2008. She is currently working toward the Ph.D. degree in Computer Science, Chongqing University, Chongqing, China. Her research interests include signal processing, pattern recognition, image processing, and machine learning.

Xiaoliu Luo received the B.S. degree in Mathematics and Applied Mathematics, from Ningxia University in 2016. She is currently working toward the Ph.D. degree in Computer Science, Chongqing University, Chongqing, China. Her research interests include pattern recognition, image processing, graph learning and machine learning.

Taiping Zhang received the B.S. and M.S. degrees in computational mathematics, and the Ph.D. degree in computer science from Chongqing University, Chongqing, China, in 1999, 2001, and 2010, respectively. He is currently an associate professor in the Department of Computer Science, Chongqing University. His research interests include pattern recognition, image processing, machine learning, and computational mathematics. He has published extensively in the IEEE Transactions on Image Processing, the IEEE Transactions on Systems, Man, and Cybernetics, Part B (TSMCB), the IEEE Transactions on Knowledge and Data Engineering, Pattern Recognition, etc. He is a member of the IEEE.

L. Zhao, Z. Shang, J. Tan et al. / Future Generation Computer Systems 106 (2020) 250–259 Yu Wei received the B.S. degree in Mathematics and Applied Mathematics from Minzu University of China, Beijing, China, in 1982. Currently, he is working as a professor in the Department of Mathematics and Statistics at Qiannan Normal College For Nationalities, and a Professor/Adjunct Professor/Honorary Professor with several institutes. His current research interests include mathematics education, applied mathematics, and fundamental mathematics. He has published over 100 academic papers and has authored/coauthored more than 20 monographs/books/bookchapters.

259

Yuanyan Tang received the B.Sc. degree in electrical and computer engineering from Chongqing University, Chongqing, China, the M.Eng. degree in electrical engineering from the Beijing Institute of Posts and Telecommunications, Beijing, China, and the Ph.D. degree in computer science from Concordia University, Montreal, QC, Canada in 1966, 1981, and 1990, respectively. He is a Chair Professor of the Faculty of Science and Technology, University of Macau, Macau, China, and a Professor/Adjunct Professor/Honorary Professor with several institutes, including Chongqing University, Concordia University, and Hong Kong Baptist University, Hong Kong. His current research interests include wavelets, pattern recognition, and image processing. He has published over 400 academic papers and has authored/coauthored more than 25 monographs/books/bookchapters. He is the Founder and the Chair of the Macau Branch of International Associate of Pattern Recognition (IAPR). He is a fellow of the IAPR.