Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling

Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Probabili...

1012KB Sizes 0 Downloads 48 Views

Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling Jiancong Fan a,b,c,n, Zhonghan Niu b, Yongquan Liang a,b, Zhongying Zhao b a Provincial Key Lab. for Information Technology of Wisdom Mining of Shandong Province, Shandong University of Science and Technology, Qingdao, 266590, China b College of Information Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China c State Key Lab. for Novel Software Technology, Nanjing University, Nanjing 210023, China

art ic l e i nf o

a b s t r a c t

Article history: Received 1 September 2015 Received in revised form 15 October 2015 Accepted 26 October 2015

Data imbalance problems arisen from the accumulated amount of data, especially from big data, have become a challenging issue in recent years. In imbalanced data, those minor data sets probably imply much important patterns. Although there are some approaches for discovering class patterns, an emerging issue is that few of them have been applied to cluster minor patterns. In common, the minor samples are submerged in big data, and they are often ignored and misclassified into major patterns without supervision of training set. Since clustering minorities is an uncertain process, in this paper, we employ model selection and evolutionary computation to solve the uncertainty and concealment of the minor data in imbalanced data clustering. Given data set, model selection is to select a model from a set of candidate models. We select probability models as candidate models because they can solve uncertainty effectively and thereby are well-suited to data imbalance. Considering the difficulty of estimating the models' parameters, we employ evolutionary process to adjust and estimate the optimal parameters. Experimental results show that our proposed approach for clustering imbalanced data has the ability of searching and discovering minor patterns, and can also obtain better performances than many other relevant clustering algorithms in several performance indices. & 2016 Elsevier B.V. All rights reserved.

Keywords: Imbalanced data Data mining Clustering Model selection

1. Introduction Class imbalance problem is a challenging issue in data mining community because most of the data mining algorithms always pay more attention to learn from the major samples. Some minor samples are often ignored and misclassified into major sample sets, which leads to high errors or low precisions of the discovered patterns. In class imbalance problem one or several classes only include much fewer data instances than others, while these fewer instances probably play a key role in some data mining tasks. For example, a great many of texts are published in the Internet every day, and we want to analyze these data and try to discovery the relations among hot topics or events. These topics or events are minor among all data in the Internet but they are important. Similar situations in real-world applications include medical diagnose, fraud detection, telecommunication analysis, Web mining, etc. Most of the recent efforts focus on two-class imbalanced n Corresponding author at: Provincial Key Laboratory for Information Technology of Wisdom Mining of Shandong Province, Qingdao, 266590, China. E-mail address: [email protected] (J. Fan).

datasets, that is, one is minority class (also called positive class), and the other is majority class (also called negative class). However, multi-class imbalance problems also occur frequently. For example, some local events on the Web emerge during different periods. These local data instances are minority compared with other global event data but they are identically important with the global majority for considering different goals. Therefore, in this paper, we not only consider two-class imbalanced data clustering problem, but also attempt to solve multi-class problem. There are two kinds of methods available for learning from imbalanced data. One is preprocessing approach such as sampling and feature selection, the other is algorithmic approach, such as ensembles of classifiers and modifications of current algorithms, which deals with imbalanced data by designing algorithms to construct learners. Up to now, most of the above approaches are fit for binary-class imbalance problem. In recent years, multi-class imbalanced data has attracted an increasing number of minds. However, most of the existed approaches are applicable for supervised classification area. There are only few discussions focusing on imbalanced data clustering problem. In this paper, we contribute to study imbalanced data clustering algorithm via selection of probability models and parameter evolutionary estimation for these models. The reason why we

http://dx.doi.org/10.1016/j.neucom.2015.10.140 0925-2312/& 2016 Elsevier B.V. All rights reserved.

Please cite this article as: J. Fan, et al., Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.10.140i

J. Fan et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

apply probabilistic approach to class imbalance problem will be analyzed in the next section. In addition, the next section recalls the related work and their characteristics. Section 3 contains preliminaries and background including model selection, and parameter evolutionary estimation. In Section 4, we detail the proposed approach based on probability model selection without over- and under-sampling. Section 5 introduces the experimental setup including the data sets, selected probability models with their corresponding parameters and the statistical testing, and then presents the experimental results and analysis over the most significant algorithms for imbalanced data. Finally, in Section 6, we present our concluding remarks.

2. Related work In this section, we first introduce related work about the class imbalance problem in classification and clustering. Then, we introduce the hot topics on imbalanced data clustering. 2.1. Related work about learning from class imbalanced datasets As stated in Section 1, a dataset is said to be imbalanced when the quantity of data instances of one or several classes is much smaller than others. Furthermore, the positive class (smaller class) is equal with or more important than the negative one. We define an imbalance problem as multi-class imbalance when there is more than one positive class in a dataset. On the contrary, if there is only one positive class and one negative class in a dataset, this problem is called two-class imbalance. Many studies and applications for classification of two-class imbalanced datasets have been existed in a great many of fields, such as medical diagnosis [1–5], fraud detection [6–8], credit assessment [9], malcode detection [10,11], and network traffic [12]. In addition to the application study, there are many empirical and theoretical studies for two-class imbalance problem [13–22]. In addition to the study and application for two-class imbalance problems, multi-class problems are attracts a growing number of interests. Although Zhou et al. [22] suggested that multi-class imbalance problem was more difficult than two-class task, and almost all the approaches were effective on two-class task, while most were ineffective and even might cause negative effect on multi-class task, there are an increasing number of contributions and endeavors to solve this problem [23–30]. Most of the literatures study this problem by some standard classification algorithms with over- or undersampling, neural network, or only empirical methods. So far probability-based approaches have not been considered as the means of mining minority classes in skewed data distribution. The above literatures mainly focus on the classification problem. However, in many real-world applications, unlabeled data are usually more common but the amount of labeled examples is often limited. Especially for the imbalanced data, the minority are probably not selected to label as training examples because of the small ratio in large dataset. In this situation, clustering imbalanced data is a practically applicable scenario and worthwhile to study. 2.2. The existed methods for clustering imbalanced data It is well known that there are a large number of clustering methods that have been proposed and examined in many kinds of areas, both theoretically and empirically. However, there are only few methods for clustering imbalanced data [31–34]. A differential evolution clustering hybrid resampling algorithm was proposed and used for over-sampling process to enlarge the ratio of positive samples, which utilized the similar mutation and crossover operators of Differential Evolution (DE) to cluster the oversampled

training dataset [31]. In [32], the K-means was used as the base learner of ensemble to research the clustering problem based on class imbalanced data. The cDNA microarray time-series imbalanced data was studied with PAM clustering algorithm in [33] but it did not investigate the clustering method how to influence the results. All of the above three imbalanced data clustering approaches used sampling process to balance the distribution of classes. A spectral clustering approach for imbalanced data without sampling was proposed in [34], which proposed a graph partitioning framework by parameterizing a family of graphs by adaptively modulating node degrees in a k-NN graph. Although the spectral approach did not utilize sampling but there existed two limitations: (1) main assumption was that prior knowledge of the smallest cluster size needed to be obtained in advance; (2) labeled samples for semi-supervised learning used in spectral clustering needed to be chosen with at least one sample from each class. In multi-class imbalance problem, nevertheless, to the best of our knowledge, systematic approach of adapting clustering process to possibly multi-class imbalanced data has not appeared. Although multi-class imbalance problem has been paid more and more attentions, the research emphasis is put on supervised classification.

3. Preliminaries and proposed concepts In this section, we will briefly introduce several basic concepts employed in our proposed approach such as model selection and probability-based model selection, evolutionary computation and evolutionary computation-based parameter estimation. We also specify the advantages of applying probability model selection approach to class imbalance problem. 3.1. Model selection and probability model selection Model selection is the process of identifying the best approximating model for problems to be modeled and solved. The goal of model selection is to select approximately true predictive models to best fit the observed data. There are many approaches used by model selection such as maximum likelihood (ML), hypothesis testing (HT), Akaike's information criterion (AIC), Bayesian information criterion (BIC), and cross validation (CV), et al. The main reason of employing ML is that ML is principally a method of parameter estimation which is one of the important steps in model selection. Thus ML extends straightforwardly to model selection. HT is a classical methodology of statistics which can be applied to many problems in model selection. HT is able to be applied to a situation as follows. Let x be a variable (also called parameter) ranging over a data set. The hypothesis θ=0 specifies a probabilistic density f (x; θ =0). Let θ^ be the ML estimate for θ , and

θ^ is a function of x whose probability distribution is determined by θ=0 which is written by g (θ^; θ =0). If θ=0 is chosen as the imprecise hypothesis H̅ , we may set up a 5% critical region, or rejection set, such that if θ^ is not equal to 0, H̅ is not rejected. The AIC's goal is to minimize the Kullback-Leibler (K-L) distance of the selected probabilistic density from the true density. In clustering, however, the true density is unknown due to the lack of training set and other priori information about data set. So AIC cannot be used in clustering approaches. BIC uses Bayes method to select models, which needs the prior probabilities of all models and then derives an asymptotic expression for the likelihood of each model. But the prior probabilities are difficult to be estimated because of the same reason as AIC. BIC is also not adopted in our proposed approach.

Please cite this article as: J. Fan, et al., Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.10.140i

J. Fan et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

ML, HT, AIC, and BIC are several common approaches for model selection. In this paper we apply a probabilistic model selection to cluster imbalanced data. Definition 1 gives the description of probabilistic model selection. Definition 1 . Let f p1, f p2, ⋯, f pm be m probability density functions, and these functions constitute a set Fp={f p1 , f p2 , ⋯ , f pm }, and Fp is called probabilistic model set. Probability model selection is to select models from Fp which can contain and correctly describe data instances as many as possible. Probability model selection can be written as:

{

}

f pi ← max max f pj j=1, 2, ⋯,m

(1)

and

(

Θ = arg max f pi x | θ1, θ 2,⋯,θt

)

3

instability refers to the uncertainty and randomicity of the data instances' occurrence, which results in the difficulty to predict the distribution parameters, especially for the minor data instances in positive class. Therefore, it is difficult for classical parameter estimation approaches to estimate and predict the appropriate parameter values of distribution models for positive class. Based on this, evolution approach should be an excellent choice because it is an approximate method for complex problems. On the other hand, traditional class or cluster learning methods, such as SVM-, ANN-, K-means-, Bayes-, and hierarchy-based algorithms, cannot be well fit for class imbalance problem. These wellperformed algorithms for the class-balanced data demand enough data instances of each class or cluster for training stable models or cluster centroids. However, in class imbalance problem, the positive data instances are probably not enough for constructing such models.

(2)

f pi

where in Eq. (1) denotes the ith model in Fp contains and correctly describes most of the data instances. Eq. (2) stands for the argument set of the maximum. 3.2. Evolutionary computation and parameter evolution Inspired by biological mechanisms of natural evolution [35], evolutionary computation is developed and widely applied in optimization domain. It uses iterative progress to generate much better candidate solutions in populations than before. Each population is produced in a guided random search to achieve the desired solutions. Evolutionary computation adopts three operators: selection, recombination and mutation. Selection is to obtain optimal individuals in current population, while recombination and mutation create the necessary diversity of the next generation and thereby facilitate novelty. The aim of evolutionary computation is to search the optimal solutions in the solution space. Parameter estimation, especially for parameter estimation of the probability model, refers to the process of using data instances to estimate the parameters of the selected probabilistic distribution. There are several parameter estimation methods available, such as probability plotting, least squares, maximum likelihood estimation and Bayesian estimation. In this paper we employ evolutionary algorithm to estimate parameters of probability models. The idea is to select a model f pi from Fp with one or several initial parameters. The parameters are constantly modified by evolutionary algorithm such as genetic algorithm according to different data samples until the probability model with the best parameters fit for data set is derived. 3.3. Advantages of probability model selection and model parameter evolution for class imbalance problem Class imbalance problem focuses on minority data (positive class). Usually, the minority are some particular data instances and different from the normal ones (negative class). Therefore, the probabilistic distribution of the minority is correspondingly different from the majority. So we can cluster those minorities obeying the corresponding distributions via estimating and constructing probability models. Because the positive class only contains small quantity of data instances, estimation of the models for these minority data and the model's parameters is difficult. The model can be selected from Fp and the size of Fp is finite, but each parameter's value fit for data cluster is difficult to guess and estimate because of their continuity and instability. The continuity is that the probability model's parameter takes values from a continuous domain. For example, the normal distribution f = θ Γ and ϕ = θ δ has two parameters, expectation and variance, which continuously change with different data samples. The

4. Proposed approach In this section, we will introduce our proposed approach, of which we first give the formal description. Let f ( x, Θ) be a probability model and Θ be the parameter set. According to Definition 1 we can obtain:

f (x|Θ) ← f (x|arg max f pi (x|θ1, θ 2, ⋯, θt ), i = 1, 2, ⋯, m),

(3)

where m is the number of models in Θ . In order to get the optimal values of the parameters in Θ , we employ evolutionary computation to solve the problem:

Θ ← EO (Θ) = EO (θ1, θ 2, ⋯, θt ) = g (θ1, θ 2, ⋯, θt X ) = g (θ1, θ 2, ⋯, θt x1, x2 , ⋯, xn ),

(4)

where EO(  ) represents the process of Evolutionary Optimization, g(  ) is evolution operator, and:

g (⋅) = S (⋅) or M (⋅) or R (⋅),

(5)

where S(  ) is select operator, M(  ) is mutation operator, and R(  ) is recombination operator. Then,

( )

Θ′ = arg max f x|Θ = g ( θ1, θ 2, ..., θt |x1, x2 , ..., xn )

(6)

From Eq. (6), as long as g(  ) is convergent, we can get the optimal parameter set Θ′ which makes a f (x Θ) describe a data cluster on the largest extent. Eqs. (3)–(6) clarify that our proposed approach is well fit for clustering class-imbalanced data. Eq. (3) describes the maximum likelihood estimation method for model selection, especially for selection in clustering analysis, because there are no training data samples in clustering process. Eq. (4) gives the evolutionary optimization process for parameters of the candidate models. This process applies evolution operators to search the better values of all parameters in each model and selects the one with the optimal parameters' values suitable for the distribution of a cluster. The combination of model selection and parameter evolution is expressed by Eq. (6) which conveys that the fittest models with the optimal parameters' values for data clusters can be obtained by the maximum likelihood estimation and evolutionary process. The algorithm pseudocode of this approach is given in Fig. 1. The algorithm consists of three key steps: model selection, hypothesis testing, and parameter optimization. The proposed approach actually includes several stages. The first stage is the probability model preparation and initialization. In this stage the most possible probability models suitable for data instances should be chosen. And the corresponding parameters of these models must be initialized, which is equivalent with the

Please cite this article as: J. Fan, et al., Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.10.140i

J. Fan et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

Fig. 1. Algorithm pseudocode of the proposed approach.

population initialization in evolutionary computation. But this stage has two initiative tasks: model set initialization and model parameter set initialization. In this paper, the selected models are listed as follows: (1) Multi-dimensional normal distribution N-dimensional random vector obeys multi-dimensional normal distribution X = ( X1, … , Xn ) ~N ( μ, Σ). The distribution density function of multi-dimensional normal distribution is:

f ( x, Θ) =

1 T −1 1 e− 2 ( x − μ) Σ (x − μ) (2π )n Σ

(7)

where x is the n-dimensional vector x = ( x1, x2, … , xn ), Θ denotes the parameter Set {μ, Σ}, and μ = ( μ1, … , μn ) is the mean vector of X, Σ is the n × n covariance matrix of X which is positive definite and Σ is the determinant of Σ . (2) Uniform distribution The probability density function of the uniform distribution is:

⎧ 1 ⎪ for a ≤ x ≤ b, f ( x) = ⎨ b − a ⎪ ⎩ 0 for x > a or x < b

⎧ λe − λx x ≥ 0 f ( x , Θ) = ⎨ ⎩0 x<0

(9)

where Θ denotes the parameter set {λ}, and λ (λ > 0) is the parameter of the distribution. The cumulative distribution function is as follow:

⎧ 1 − e − λx x ≥ 0 F ( x; λ ) = ⎨ ⎩0 x<0 (4) Logarithmic normal distribution Given a log-normal distributed random variable X and parameters μ and σ which respectively represent the mean and standard deviation of the variable's natural logarithm. The probability density function of logarithmic normal distribution is:

⎧ ( ln x − μ) 2 − 1 ⎪ 2σ 2 e x>0 f ( x, Θ) = ⎨ xσ 2π ⎪ ⎩0 x≤0

(10)

where Θ denotes the parameter set {μ, σ}.The cumulative distribution function of log-normal distribution is as follows:

(8)

The cumulative distribution function is:

⎧0 for x < a ⎪ x −a F ( x) = ⎨ b −a for a ≤ x < b ⎪ ⎩1 for x ≥ b (3) Exponential distribution The probability density function of exponential distribution is:

⎛ ln x − μ ⎞ F ( x) = Φ ⎜ ⎟ ⎠ ⎝ σ where Φ is the cumulative distribution function of the standard normal distribution. For the model parameter set initialization, we take normal distribution as an example to interpret it. Assume the parameter μ and σ represent the mean and standard deviation, respectively. And their initial values are estimated μe and σe . In order to

Please cite this article as: J. Fan, et al., Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.10.140i

J. Fan et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

5

Fig. 2. Procedure of parameter optimization.

decrease the search scope in evolution stage, we need to set the lower and upper bounds of the parameters, e.g., μ takes [μe ± ε1], and σ takes [σe ± ε2 ], where ε1 and ε2 are some particular values set according to the concrete data instances. The second stage is the parameter evolution. Evolution optimization (EO) is a search heuristic which simulates the process of natural selection, which solves the optimization problems by using techniques enlightened by natural selection, including selection, crossover and mutation. If the set ck fail to pass the optimization conditions (in this paper, the condition is the hypothesis testing), new parameter values will be generated according to the sample's values. Parameter which has the best fitness is selected by EO algorithm. The main steps of parameter optimization are illustrated as Fig. 2. Another stage is the hypothesis testing. In fact, this stage is merged with the parameter evolution stage, which can be viewed as the role of fitness function for this evolution. Hypothesis testing is a statistical inference method that speculates validity of proposition about the overall distribution of unknown before sampling. Commonly used methods of hypothesis testing include t-test, z-test, chi-squared test, F-test, Kolmogorov-Smirnov Test (KS-test), etc. Hypothesis testing is applied to detect whether the distribution of ck is the same as the distribution obtained in the previous step after the number of the set ck reaches threshold. KS-test is selected to achieve hypothesis testing in this paper, because of its non-parametric and distribution free. It does not like t-test which is restricted to normal distribution and can be used on a broader scope, which is suitable for the different distributions selected by the proposed algorithm. The distribution selected to make the comparison mainly includes four distribution mentioned above. Each dimension of the multidimensional data needs KS-test, and ck will pass the proposition of hypothesis test only if each dimension attribute of the ck passes the proposition of hypothesis test. The procedure of hypothesis testing is as Fig. 3. In addition, in order to increase the clustering speed and accelerate the convergence rate, we devise a technique, which is to generate random number randi in the range of parameter values, e.g., μ takes [μe ± ε1], and σ takes [σe ± ε2 ] in normal distribution. Let di be a distance function for the ith model, randi be the set of random numbers obeying a specific probabilistic distribution, and θ be a specific threshold. For all xj in X, if di (randi − xj ) ≤ θ , the corresponding model remains to continue parameter evolution and hypothesis testing, otherwise another model must be chosen to repeat this process.

5. Experimental results In this section we give experiments on 19 datasets using our proposed algorithm and other classical clustering algorithms. Section 5.1 explains the experimental settings including experimental datasets, compared algorithms, and evaluation metrics. Section 5.2 presents the sensitivity analysis of the parameters used in our algorithm. Section 5.3 presents the detailed experimental results.

5.1. Experimental setting 5.1.1. Datasets The proposed algorithm is tested with 19 data sets taken from UCI machine learning repository. The important statistics of 19 datasets are summarized in Table 1. The datasets used in this paper listed in Table 1 are all real and multivariate data instances. And all the class labels are deleted from datasets in order to obtain unsupervised data. 5.1.2. Compared algorithms In order to test our proposed algorithm and other well-known clustering algorithm, including Gaussian Mixture Model [36], CanopyþK-means [37], K-means [38], OPE-HCA [39], FCM [40] and Affinity Propagation (AP) [41]. The reason why we choose these clustering algorithms as the comparing algorithms is that the Gaussian mixture models (GMM) can be viewed as a form of a number of Gaussian components and the aim of GMM is to maximum the loglikelihood function. CanopyþK-means is the K-means algorithm having the predetermined number of clusters and initial cluster centers obtained by Canopy algorithm. The K-means clustering algorithm is based on distance measurement. OPE-HCA clustering based on EDAs can randomly select data as individuals to construct initial population and the probability distribution of population is computed to estimate the distribution of dataset. FCM is based on Euclidean distance function and associated with fuzzy mathematics. The affinity propagation (AP) method is one of the state-of-the-art methods proposed recently, which takes as input measures of similarity between pairs of data points. These well-known models apply probability-, evolution-, or ML-based ideas, which is also applied in our proposed approaches.

Fig. 3. Procedure of hypothesis testing for the models with evolved parameters.

Please cite this article as: J. Fan, et al., Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.10.140i

J. Fan et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

Table 1 Information of the tested benchmark datasets.

Table 2 The parameter θ in 19 datasets.

Dataset

Size # attribute # cluster Imbalance ration

Haberman's Survival (HS) Blood Transfusion Service Center Wine Collection Ionosphere Liver Disoders Auto MPG Breast Cancer Wisconsin (Prognostic) Hepatitis SPECT Heart SPECTF Heart Statlog Project(heart) Connectionist Bench (Sonar, Mines vs. Rocks) Vertebral Column Planning Relax Fertility Climate Model Simulation Crashes Unart(artificial dataset) Wholesale customers

306 748

3 4

2 2

225:81 570:178

178 208 351 345 398 699

13 60 34 6 7 9

3 2 2 2 3 2

59:71:48 111:97 225:126 200:145 249:70:79 458:241

81 187 187 270 208

19 22 44 13 61

2 2 2 2 2

47:34 172:15 172:15 150:120 111:97

310 182 100 540

6 12 9 18

2 2 2 2

210:100 130:52 88:12 494:46

179 440

4 7

3 2

100:54:25 298:142

Dataset

θ

Haberman's Survival (HS) Blood Transfusion Service Center Wine Collection Ionosphere Liver Disoders Auto MPG Breast Cancer Wisconsin(Prognostic) Hepatitis SPECT Heart SPECTF Heart Statlog Project(heart) Connectionist Bench (Sonar, Mines vs. Rocks) Vertebral Column Planning Relax Fertility Climate Model Simulation Crashes Unart(artificial dataset) Wholesale customers

[9.918, 11.358] [1189.166, 1189.366] [112.661, 113.781] [1.247, 1.507] [3.374, 4.894] [61.459, 62.309] [472.309, 473.749] [9.501, 10.261] [148.901, 149.661] [2.213, 2.973] [94.749, 95.509] [76.081, 76.841] [1.428, 1.528] [27.954, 29.474] [1.701, 3.271] [1.064, 1.824] [1.428, 1.628] [0.698, 2.240] [20,215.673, 20,216.433]

Optimal Theta 11.358 1189.367 113.782 1.507225 4.894 62.309 473.749 9.501 149.661 2.973 95.510 76.841 1.529 29.474 3.271 1.825 1.628 2.241 20216.433

5.3. Comparison of the proposed algorithm with other algorithms 5.1.3. Evaluation metrics (1) Clustering accuracy: the accuracy of a clustering algorithm is the degree of closeness of a quantitative result to that quantity's true value. (2) Clustering precision: the precision of a clustering algorithm is the degree to which repeated measurements show the same results if conditions unchanged. (3) Clustering F-score: a weighted average of the precision and recall, where it reaches its best value at 1 and worst at 0, or it is viewed as the harmonic mean of precision and recall. The measurement approach of the above metrics will be explained in Section 5.3. 5.2. Sensitivity analysis In our experiments, one parameter is used to analyze the performance of the proposed algorithm: the parameter θ. The accuracy, precision and F-score are applied to evaluate the performance of our algorithm and other clustering algorithms. The key step of the proposed algorithm is to cluster data instances by threshold θ. This process is accomplished by selecting the distance of data instances and random instances less than θ. The F-score of the proposed algorithm get high with the increase of θ. It is due to the dependence of the change of similarity degree between data instances and random instances. The greater value of θ can make more data divided into the same cluster. Table 2 shows the parameter θ determined by the largest F-score of clusters for 19 datasets, respectively. From Table 2, when the values of θ increase, the distribution of clusters gets closer to the actual cluster distribution of data set. However, θ is a very important basic parameter which can influence the clustering performances. If the value of θ is too small, each cluster can only cover a few points, which results in consuming extra time at consequent steps; if the value of θ is too big, some points that far away from each other can assigned to the same cluster, which may cause the lower F-score.

Our proposed algorithm and other compared algorithms described in Section 5.1 are implemented, and the results in different evaluation metrics are compared in this section. 5.3.1. Comparison of clustering accuracy results on classical algorithms Accuracy is commonly used as the evaluation index to compare the performance of clustering algorithm. The computational method of accuracy Acc is k

Acc =

∑i = 1 ai (11)

| D|

Eq. (11) means that the proportion of the number of data objects that are correctly assigned to cluster among the total number of data instances. In Eq. (11), the parameter ai denotes the number of data objects that are correctly assigned to class Ci, |D| denotes the number of elements in set D. Table 3 shows the comparison of the proposed algorithm with Gaussian Mixture Model, CanopyþK-means, K-means, OPE-HCA and FCM in clustering accuracy. From Table 3, we can see that our proposed algorithm outperforms the baselines over 11 datasets. 5.3.2. Comparison of precision results We also take another important measurement of clustering performance Precision to compare our proposed algorithm with other classical algorithms. Precision is calculated by: k

PR =

∑i = 1 k

ai ai + b i

(12)

In Eq. (12), if a data set contains k classes for a given clustering, the parameter ai denotes the number of data objects that are correctly assigned to class Ci while the parameter bi denotes the data objects that are incorrectly assigned to the class Ci. Table 4 gives the compared results of the proposed algorithm with other algorithms using Precision measures. As it can be seen, our proposed algorithm is comparable with other approaches over the datasets. 5.3.3. Comparison of F-score results In addition, we take F-score to compare the proposed algorithm

Please cite this article as: J. Fan, et al., Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.10.140i

J. Fan et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

7

Table 3 Comparison of clustering accuracy of the proposed algorithm with other clustering algorithms. Dataset

Accuracy (%) Algorithms

Haberman's Survival (HS) Blood Transfusion Service Center Wine Collection Ionosphere Liver Disoders Auto MPG Breast Cancer Wisconsin (Prognostic) Hepatitis SPECT Heart SPECTF Heart Statlog Project (heart) Connectionist Bench (Sonar, Mines vs. Rocks) Vertebral Column Planning Relax Fertility Climate Model Simulation Crashes Unart (artificial dataset) Wholesale customers

Gaussian Mixture Model

Canopy þ K-means

FCM

K-means

AP

OPE-HCA

Proposed algorithm

0.677 0.594 0.708 0.560 N/A 0.501 0.460 0.943 N/A 0.353 0.551 0.652 0.587 0.548 0.703 0.560 0.741 0.860 0.696

0.735 0.762 0.685 0.539 0.644 0.580 0.525 0.826 0.645 0.745 0.799 0.589 0.539 0.677 0.714 0.754 0.862 0.944 0.677

0.7222 0.7432 0.6578 0.5402 0.5508 0.5867 0.510 0.8577 0.7016 0.7343 0.8129 0.601 0.5564 0.7091 0.6876 0.7621 0.8371 0.9211 0.6622

0.731 0.682 0.702 0.557 0.585 0.579 0.718 0.643 0.645 0.819 0.719 0.574 0.436 0.703 0.714 0.78 0.914 0.86 0.777

0.663 0.843 0.605 0.575 0.747 0.600 0.571 0.837 0.662 0.697 0.722 0.671 0.575 0.716 0.528 0.68 0.913 0.944 0.893

0.728 0.733 0.668 0.533 0.641 0.579 0.718 0.695 0.657 0.874 0.829 0.555 0.533 0.677 0.7143 0.801 0.904 0.866 0.686

0.670 0.612 0.708 0.560 0.601 0.600 0.550 0.835 0.827 0.801 0.838 0.652 0.643 0.756 0.690 0.803 0.914 0.971 0.640

with the classical algorithms. F-score is calculated by: k

F1 =

ai ai +ci k a ∑i = 1 a +i c i i

2*PR* ∑i = 1 PR +

(13)

where PR denote the precision, which has been described in Section 5.3.2. If a dataset contains k classes for a given clustering, the parameter ai denotes the number of data objects that are correctly assigned to class Ci while the parameter ci denotes the data objects that are incorrectly rejected from the class Ci. Table 5 gives the compared results of the proposed algorithm with other clustering algorithms using F-score measures. As it can be seen, our proposed algorithm outperforms the baselines on 15 datasets, and only performs marginally worse than the best results on the rest 3 datasets.

5.3.4. Time complexity Let n be the number of data instances, m be the number of models in given model set MS, T be the iteration times in parameter evolution, and k be the maximum number of parameters from a certain model in MS, that is,

k = max



i

}

Mi ( Θi ), Mi ∈ MS

(14)

We can obtain the following time expression:

k· m + n · m + k· T · m In common, k om on, then

k· m < n · m k· T · m < n · m · T Accordingly,

Table 4 Comparison of the proposed algorithm with the best results of other clustering algorithms using Precision measure. Dataset

Haberman's Survival (HS) Blood Transfusion Service Center Wine Collection Ionosphere Liver Disoders Auto MPG Breast Cancer Wisconsin (Prognostic) Hepatitis SPECT Heart SPECTF Heart Statlog Project (heart) Connectionist Bench (Sonar, Mines vs. Rocks) Vertebral Column Planning Relax Fertility Climate Model Simulation Crashes Unart (artificial dataset Wholesale customers

Precision (%) Algorithms Gaussian Mixture Model

Canopy þ K-means

FCM

K-means

AP

OPE-HCA

Proposed algorithm

0.633 0.580 0.714 0.592 N/A 0.565 0.625 0.932 N/A 0.543 0.547 0.750 0.583 0.688 0.603 0.515 0.523 0.842 0.640

0.730 0.706 0.723 0.545 0.821 0.602 0.564 0.795 0.614 0.914 0.950 0.581 0.545 0.748 0.714 0.873 0.854 0.900 0.722

0.6914 0.6343 0.724 0.5677 0.7832 0.5987 0.5631 0.8236 0.772 0.922 0.9012 0.6243 0.5663 0.7729 0.7201 0.8564 0.8445 0.9267 0.7001

0.707 0.601 0.723 0.601 0.574 0.640 0.785 0.633 0.621 0.834 0.759 0.576 0.445 0.725 0.712 0.668 0.855 0.894 0.837

0.748 0.706 0.697 0.577 0.721 0.597 0.734 0.866 0.580 0.941 0.944 0.671 0.573 0.748 0.714 0.859 0.749 0.909 0.955

0.682 0.674 0.751 0.459 0.429 0.773 0.785 0.561 0.755 0.638 0.624 0.517 0.506 0.786 0.527 0.625 0.613 0.897 0.894

0.730 0.650 0.724 0.604 0.614 0.614 0.570 0.720 0.930 0.850 0.802 0.660 0.601 0.786 0.703 0.811 0.856 0.983 0.695

Please cite this article as: J. Fan, et al., Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.10.140i

J. Fan et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

8

Table 5 Comparison of the proposed algorithm with the best results of other clustering algorithms using F-score measure. Dataset

F-score (%) Algorithms

Haberman's Survival (HS) Blood Transfusion Service Center Wine Collection Ionosphere Liver Disoders Auto MPG Breast Cancer Wisconsin (Prognostic) Hepatitis SPECT Heart SPECTF Heart Statlog Project (heart) Connectionist Bench (Sonar, Mines vs. Rocks) Vertebral Column Planning Relax Fertility Climate Model Simulation Crashes Unart (artificial dataset) Wholesale customers

Gaussian Mixture Model

Canopyþ K-means

FCM

K-means

AP

OPE-HCA

Proposed algorithm

0.667 0.610 0.668 0.591 N/A 0.546 0.528 0.846 N/A 0.618 0.665 0.753 0.581 0.659 0.562 0.534 0.553 1.000 0.609

0.500 0.495 0.628 0.544 0.504 0.500 0.594 0.795 0.594 0.499 0.500 0.577 0.544 0.512 0.502 0.500 0.625 0.910 0.499

0.635 0.551 0.610 0.513 0.498 0.565 0.572 0.8011 0.6217 0.6 0.5334 0.6807 0.5298 0.6083 0.5728 0.6271 0.6563 0.9112 0.5761

0.728 0.630 0.712 0.621 0.579 0.600 0.750 0.631 0.631 0.827 0.739 0.575 0.431 0.759 0.712 0.744 0.905 0.877 0.806

0.712 0.655 0.622 0.576 0.623 0.718 0.633 0.810 0.613 0.871 0.853 0.627 0.624 0.713 0.609 0.728 0.807 0.914 0.846

0.809 0.710 0.707 0.494 0.514 0.601 0.751 0.621 0.703 0.753 0.743 0.536 0.519 0.767 0.606 0.731 0.734 0.881 0.776

0.85 0.710 0.732 0.623 0.623 0.601 0.582 0.890 0.780 0.875 0.860 0.753 0.643 0.795 0.720 0.806 0.909 0.966 0.684

Table 6 Comparison of all algorithms on time complexity. Algorithm

Gaussian Mixture Model

Canopy þK-means

FCM

K-means

AP

OPE-HCA

Proposed algorithm

Time complexity

O(n  m  T)

O(n  k  T)

O(n  d  k2T)

O(n  k  T)

O(n2)

O(n  log n)

O(n  m  T)

Note: In Table 6, n is the number of data instances, m is the number of models in given model set, T is the iteration times, k is the number of clusters, and d is the dimensionality of data set.

O ( k· m + n· m + k· T · m) = O ( n· m·T ) The time complexity of our proposed algorithm is O(n  m  T). The comparison of our proposed algorithm with other algorithms is illustrated in Table 6. From Table 6 we can see the time complexity of our approach is equal to or comparable with other algorithms. In common, the quantity n of data instances in data mining is far greater than the number m of models. That the time complexity is O(n)-level or O(n2)-level depends on the iteration times T. If the value of T is close to n, the time complexity is O(n2) -level. But in practice, the quantity n of data instances in data mining is greater than, always far greater than, the iteration times T. In this case the time complexity is O(n)-level.

6. Conclusions As the exponential growth of data, class imbalance problems have become outstanding issues. There are many challenges to be solved seriously. But the traditional theories, methods and techniques are not sufficient to handle those challenges hidden in the imbalanced data because those traditional solutions need enough data samples, otherwise it is difficult to obtain good learners or well performances. In imbalanced data the positive data instances are much less than the negative ones. It is possible that the quantity of the positive ones is not enough to complete the learning or mining tasks. So we must explore new theories and approaches. In this paper, we proposed an imbalanced data clustering approach which employs probability model selection and parameter

evolutionary estimation process without sampling. The motivation for this study is to explore an effective solution for clustering classimbalanced data. Since there is few approach applied to cluster imbalanced data, we compared our proposed approach with other common probability-based approaches, Gaussian Mixture model, FCM, AP, OPE-HCA, and classical clustering algorithms, k-means series algorithms. We use 19 imbalanced datasets as the baseline, and the experimental results show that our proposed algorithm outperforms other algorithms in different evaluation metrics on most of the benchmark datasets. The results turn out that the proposed approach in this paper is a new and efficient way to deal with class-imbalance problem. Although the proposed approach is fit for clustering imbalanced data, there are still some problems to be studied, for instance, the design of fitness function in parameter evolution, the determination of models in the initial model set, etc. We will study these problems and find solutions in our future work.

Acknowledgements We would like to thank the anonymous reviewers for their valuable comments and suggestions. The authors also wish to thank the students of the laboratory Zheng Feng, Wenhua Liu, Xinghui Zhao and Xuan Li for participating in the experiment for several months. This work was supported by the National Natural Science Foundation of China (No.61203305, 61433012, and 61303167), and the Special Funds of Taishan Scholars Construction Project.

Please cite this article as: J. Fan, et al., Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.10.140i

J. Fan et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

References [1] M.A. Mazurowski, P.A. Habas, J.M. Zurada, J.Y. Lo, J.A. Baker, G.D. Tourassi, Training neural network classifiers for medical decision making: the effects of imbalanced datasets on classification performance, Neural Netw. 21 (2) (2008) 427–436. [2] G. Cohen, M. Hilario, H. Sax, S. Hugonnet, A. Geissbuhler, Learning from imbalanced data in surveillance of nosocomial infection, Artif. Intell. Med. 37 (1) (2006) 7–18. [3] D.C. Li, C.W. Liu, S.C. Hu, A learning method for the class imbalance problem with medical data sets, Comput. Biol. Med. 40 (5) (2010) 509–518. [4] M. Khalilia, S. Chakraborty, M. Popescu, Predicting disease risks from highly imbalanced data using random forest, BMC Med. Inf. Decis. Mak. 11 (1) (2011) 51–63. [5] P. Cao, J. Yang, W. Li, D. Zhao, O. Zaiane, Ensemble-based hybrid probabilistic sampling for imbalanced data learning in lung nodule CAD, Comput. Med. Imaging Graph. 38 (3) (2014) 137–150. [6] C. Phua, D. Alahakoon, V. Lee, Minority report in fraud detection: classification of skewed data, ACM SIGKDD Explor. Newsl. 6 (1) (2004) 50–59. [7] M. Di Martino, F. Decia, J. Molinelli, A. Fernández. Improving electric fraud detection using class imbalance strategies. in: Proceedings of the International Conference on Pattern Recognition Applications and Methods, ICPRAM, 2012, pp. 135–141. [8] W. Wei, J. Li, L. Cao, Y. Ou, J. Chen, Effective detection of sophisticated online banking fraud on extremely imbalanced data, World Wide Web 16 (4) (2013) 449–475. [9] Y.M. Huang, C.M. Hung, H.C. Jiau, Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem, Nonlinear Anal.: Real World Appl. 7 (4) (2006) 720–747. [10] R. Moskovitch, D. Stopel, C. Feher, N. Nissim, N. Japkowicz, Y. Elovici, Unknown malcode detection and the imbalance problem, J. Comput. Virol. 5 (4) (2009) 295–308. [11] S. Wang, X. Yao, Using class imbalance learning for software defect prediction, IEEE Trans. Reliab. 62 (2) (2013) 434–443. [12] Z. Liu, R. Wang, M. Tao, X. Cai, A class-oriented feature selection approach for multi-class imbalanced network traffic datasets based on local and global metrics fusion, Neurocomputing 168 (2015) 365–381. [13] A. Michela, D. Pietro, M. Francesco, An experimental study on evolutionary fuzzy classifiers designed for managing imbalanced datasets, Neurocomputing 146 (2014) 125–136. [14] V. Piyanoot, R. Suwanna, C. Krisana, L. Chidchanok, Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms, Neurocomputing 152 (2015) 429–444. [15] R. Batuwita, V. Palade, FSVM-CIL: fuzzy support vector machines for class imbalance learning, IEEE Trans. Fuzzy Syst. 18 (3) (2010) 558–571. [16] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev. 42 (4) (2012) 463–484. [17] S.J. Lin, C. Chang, M.F. Hsu, Multiple extreme learning machines for a two-class imbalance corporate life cycle prediction, Knowl.-Based Syst. 39 (2013) 214–223. [18] X.Y. Liu, Z.H. Zhou, The influence of class imbalance on cost-sensitive learning: An empirical study, in: Proceedings of the IEEE Sixth International Conference on Data Mining, ICDM'06, 2006, pp. 970–974. [19] X.Y. Liu, J. Wu, Z.H. Zhou, Exploratory undersampling for class-imbalance learning, IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev. 39 (2) (2009) 539–550. [20] A. Menon, H. Narasimhan, S. Agarwal, S. Chawla, On the statistical consistency of algorithms for binary classification under class imbalance, in: Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 603–611. [21] H. Wan, H. Wang, G. Guo, S. Lin, Soft sensing as class-imbalance binary Classification– A Lattice Machine approach, in: R. Hervás, S. Lee, C. Nugent, J. Bravo (Eds.), Ubiquitous Computing and Ambient Intelligence. Personalisation and User Adapted Services, Springer International Publishing, Belfast, UK, 2014, pp. 540–547. [22] Z.H. Zhou, X.Y. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Trans. Knowl. Data Eng. 18 (1) (2006) 63–77. [23] R. Alejo, J.M. Sotoca, G.A. Casañ, An empirical study for the multi-class imbalance problem with neural networks. in: Progress in Pattern Recognition, Image Analysis and Applications, Lecture Notes in Computer Science, 5197, Springer, Berlin Heidelberg, 2008, pp. 479–486. [24] R. Alejo, R.M. Valdovinos, V. García, J.H. Pacheco-Sanchez, A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios, Pattern Recognit. Lett. 34 (4) (2013) 380–388. [25] L. Cerf, D. Gay, N. Selmaoui-Folcher, B. Crémilleux, J.F. Boulicaut, Parameterfree classification in multi-class imbalanced data sets, Data Knowl. Eng. 87 (2013) 109–129. [26] A. Fernández, M.J. Del Jesus, F. Herrera, Multi-class imbalanced data-sets with

[27]

[28]

[29]

[30]

[31]

[32]

[33] [34] [35] [36]

[37] [38]

[39]

[40] [41]

9

linguistic fuzzy rule based classification systems based on pairwise learning, in: Computational Intelligence for Knowledge-Based Systems Design, Springer, Berlin Heidelberg, 2010, pp. 89–98. X.Y. Liu, Q.Q. Li, Z.H. Zhou, Learning imbalanced multi-class data with optimal dichotomy weights, in: Proceedings of the IEEE 13th International Conference on Data Mining (ICDM), 2013, pp. 478–487. Y.L. Murphey, H. Wang, G. Ou, L.A. Feldkamp, OAHO: an effective algorithm for multi-class learning from imbalanced data, in: Proceedings of the International Joint Conference on Neural Networks, 2007, pp. 406–411. J.P. Sánchez-Crisostomo, R. Alejo, E. López-González, R.M. Valdovinos, J. H. Pacheco-Sánchez, Empirical analysis of assessments metrics for multi-class imbalance learning on the back-propagation context, in: Y. Tan, Y. Shi, C. A. Coello (Eds.), Advances in Swarm Intelligence, Springer International Publishing, Hefei, China, 2014, pp. 17–23. S. Ding, H. Jia, L. Zhang, F. Jin, Research of semi-supervised spectral clustering algorithm based on pairwise constraints, Neural Comput. Appl. 24 (1) (2014) 211–219. L. Chen, Z. Cai, L. Chen, Q. Gu, A novel differential evolution-clustering hybrid resampling algorithm on imbalanced datasets, in: Proceedings of the IEEE Third International Conference on Knowledge Discovery and Data Mining, 2010, pp. 81–85. X. Li, Z. Chen, F. Yang, Exploring of clustering algorithm on class-imbalanced data, in: Proceedings of the IEEE 8th International Conference on Computer Science & Education (ICCSE), 2013, pp. 89–93. R.K. Pearson, G.E. Goney, J.S. Shwaber, Imbalanced clustering for microarray time-series, in: Proceedings of the ICML, Vol. 3, ICML, Washington DC, 2003. J. Qian, V. Saligrama, Spectral Clustering with Unbalanced Data, arXiv preprint arXiv: 1302.5134, 2013. S. Ding, Z. Shi, Track on intelligent computing and applications, Neurocomputing 130 (2014) 1–2. B.G. Lindsay, Mixture models: theory, geometry and applications, in: NSFCBMS regional conference series in probability and statistics. (Institute of Mathematical Statistics and the American Statistical Association), 1995, pp.1– 163. Rongtai Qiu, Canopy for efficient K-Means algorithm, Mod. Mark. 3 (2012) 191 (school edition). K.A.A. Nazeer, M.P. Sebastian, Clustering biological data using enhanced k-means algorithm, in: Electronic Engineering and Computing Technology, Springer, Netherlands, 2010, pp. 433–442. Jiancong Fan, OPE-HCA: An optimal probabilistic estimation approach for hierarchical clustering algorithm, Neural Comput. Appl. (2015), http://dx.doi. org/10.1007/s00521-015-1998-5. N.R. Pal, K. Pal, J.M. Keller, et al., A possibilistic fuzzy c-means clustering algorithm, IEEE Trans. Fuzzy Syst. 13 (4) (2005) 517–530. B.J. Frey, D. Dueck, Clustering by passing messages between data points, Science 315 (2007) 972–976.

Jiancong Fan received his M.Sc. and Ph.D. degrees in computer science from College of Information Science and Engineering, Shandong University of Science and Technology, China, in 2003 and 2010, respectively. He joined Department of Computer Science and Technology of Shandong University of Science and Technology in 2003 as a teaching assistant. His current research interests include machine learning, data mining. He has worked on learning from labeled and unlabeled data, evolutionary learning, and industrial data analysis. Up until now, He has published over 30 papers in national and international journals or conferences. Currently, he serves as the reviewer of Information Sciences, Evolutionary Computation, etc.

Zhonghan Niu is currently a master candidate at College of Information Science and Engineering, Shandong University of Science and Technology. He received the B.S. Degree in School of Information Science and Engineering, University of Jinan, China, in 2014. His current research interests are machine learning and data mining.

Please cite this article as: J. Fan, et al., Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.10.140i

10

J. Fan et al. / Neurocomputing ∎ (∎∎∎∎) ∎∎∎–∎∎∎ Yongquan Liang received his M.Sc. and Ph.D. degrees in computer science from Beihang University and Institute of Computing Technology, Chinese Academy of Sciences, China, in 1992 and 1999, respectively. He was a visiting scholar at Institute of AIFB, Karlsruhe University. He is now a professor of Shandong University of Science and Technology. His current research interests include data mining and intelligent information processing. Up until now, He has published over 100 papers in national and international journals or conferences.

Zhongying Zhao received her Ph.D degree in computer science, from Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS), 2012. She worked in Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences (CAS), as an assistant professor from 2012 to 2014. She is currently an assistant professor in College of Information Science and Engineering, Shandong University of Science and Technology. She has published 16 papers in international journals and conference proceedings. Her research interests focus on social network analysis and data mining.

Please cite this article as: J. Fan, et al., Probability model selection and parameter evolutionary estimation for clustering imbalanced data without sampling, Neurocomputing (2016), http://dx.doi.org/10.1016/j.neucom.2015.10.140i