Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering

Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering

Applied Soft Computing Journal xxx (xxxx) xxx Contents lists available at ScienceDirect Applied Soft Computing Journal journal homepage: www.elsevie...

NAN Sizes 0 Downloads 23 Views

Applied Soft Computing Journal xxx (xxxx) xxx

Contents lists available at ScienceDirect

Applied Soft Computing Journal journal homepage: www.elsevier.com/locate/asoc

Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering Arvinder Kaur a , Saibal Kumar Pal b , Amrit Pal Singh a,c ,



a

University School of Information, Communication & Technology, New Delhi, India Directorate of Information Technology and Cyber Security, DRDO, New Delhi, India c Bharati Vidyapeeth’s College of Engineering, New Delhi, India b

highlights

graphical

abstract

• A proposed algorithm which hy-









bridizes chaos and FPA over K-means for clustering. Chaotic FPA (CFPA) is compared with algorithms FPA, CSA, BHA, BA, FFA, and PSO. For cluster integrity, CFPA and BHA have better performance as compared to the others. Cluster Integrity has been improved by 3.17% as compared to previous study for dataset D16. CFPA and CSA are significantly superior to other algorithms on the basis of execution time.

article

info

Article history: Received 2 June 2017 Received in revised form 25 April 2019 Accepted 21 May 2019 Available online xxxx Keywords: Chaotic Flower Pollination Algorithm Data clustering K-means Swarm intelligence

a b s t r a c t Classical clustering algorithms like K-means often converge to local optima and have slow convergence rates for larger datasets. To overcome such situations in clustering, swarm based algorithms have been proposed. Swarm based approaches attempt to achieve the optimal solution for such problems in reasonable time. Many swarm based algorithms such as Flower Pollination Algorithm (FPA), Cuckoo Search Algorithm (CSA), Black Hole Algorithm (BHA), Bat Algorithm (BA) Particle Swarm Optimization (PSO), Firefly Algorithm (FFA), Artificial Bee Colony (ABC) etc have been successfully applied to many non-linear optimization problems. In this paper, an algorithm is proposed which hybridizes Chaos Optimization and Flower Pollination over K-means to improve the efficiency of minimizing the cluster integrity. The proposed algorithm referred as Chaotic FPA (CFPA) is compared with FPA, CSA, BHA, BA, FFA, and PSO over K-Means for data clustering problem. Experiments are conducted on sixteen benchmark datasets. Algorithms are compared on four different performance parameters — cluster integrity, execution time, number of iterations to converge (NIC) and stability. Results obtained are analyzed statistically using Non-parametric Friedman test. If Friedman test rejects the Null hypothesis then pair wise comparison is done using Nemenyi test. Experimental Result demonstrates the following: (a) CFPA and BHA have better performance on the basis of cluster integrity as compared to other algorithms; (b) Prove the superiority of CFPA and CSA over others on the basis of execution

∗ Corresponding author at: Bharati Vidyapeeth’s College of Engineering, New Delhi, India. E-mail address: [email protected] (A.P. Singh). https://doi.org/10.1016/j.asoc.2019.105523 1568-4946/© 2019 Elsevier B.V. All rights reserved.

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

2

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx

time; (c) CFPA and FPA converges earlier than other algorithms to evaluate optimal cluster integrity; (d) CFPA and BHA produce more stable results than other algorithms. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Clustering is unsupervised classification of patterns into groups (clusters) [1], where a set of instances with similar pattern (usually vectors in multidimensional space), are grouped into clusters [1–3]. Clustering is of two types: Hard clustering and Soft clustering. Hard clustering is the one in which each element belongs to exactly one cluster and is easy to implement [1]. Soft clustering is a type of clustering in which each element can belong to more than one cluster [1,2]. Further, there are two types of clustering techniques i.e. Hierarchical and partitional clustering [1]. This work focuses on partitional clustering. The most widely used partition clustering algorithms is K-Means clustering, where n number of instances are partitioned into k clusters [4–6]. For each cluster, an optimal centroid is chosen for grouping the instances that lie in the nearby proximity. Since KMeans implementation is fast, it is common to iterate it several times with different initial parameters. In the worst case, it is very slow to converge. The studies [2,7–9] have observed that it takes an exponential time complexity, that is 2Ω (n) , to converge even in 2 dimensions. K-Means is a heuristic algorithm, there is no guarantee that it will converge to the global optimum (often converges to a local optima) [1,2,8,9], and the result may depend on the initial clusters [2]. This iterative approach does not guarantee to give the global optimum resulting in an NP-Hard problem. To overcome this problem, many swarm based optimization algorithms like Genetic Algorithm [10,11], Ant Colony Optimization [11,12], Artificial Immune System [11,13], Artificial Bee Colony [14–16], Particle Swarm Optimization [16,17], Firefly Algorithm [3,4,16], Cuckoo Search [18], Black Hole Algorithm [19], and others have been applied. Swarm intelligence is the collective behavior of agents present in nature. Swarm algorithms have attracted great interest in the last two decades [16]. Swarm-based algorithms like Black Hole Algorithm (BHA), Cuckoo Search Algorithm (CSA), Bat Algorithm (BA), Firefly Algorithm (FFA), Particle Swarm Optimization (PSO), etc are applied to problems classified as NP-hard or NPComplete and aim to find the optimal solution [7,16,20–22]. Such algorithms inspired by nature are very useful in designing computational intelligence models effectively and efficiently. However, these algorithms can trap to local optima. This work focuses on Flower Pollination Algorithm (FPA) [22] and Chaos Optimization. FPA is one nature inspired technique proposed in 2012 that has been used for solving non-linear optimization problems [20]. It is selected because this algorithm has the tendency to search both global and local search space but still algorithm can trap to local optima. Chaos optimization is used to make FPA free from converging to a local optimum. Chaos is a fundamental characteristic of nonlinear systems having series of specific features such as regularity, ergodicity, and randomness [23,24]. As an effective method to avoid getting limited to the local optimum, chaos is familiarized with swarm intelligence, creating a new field of research and application [24]. In most of the research currently carried out, the random sequence in mutation operator is substituted with the chaos sequence to provide the evidence that chaotic mutation is an effective resultant of the mutation operator [23,25]. This paper focuses on hard clustering with partitioning means. The objective function (a square error function) of hard clustering

is the sum of distances from the pattern to the center, which is known as cluster integrity [1,5,26]. In clustering, identification of problem center of a cluster is a major issue, for which cluster integrity is used. It represents integrity of each cluster. Where each variable contributes to the value to be optimized in the problem [26]. Further, to optimize cluster integrity, hybridization of Chaos and Flower Pollination Algorithm (CFPA) over K-means is proposed for data clustering and is referred as CFPAKMeans. CFPA-KMeans along with other swarm algorithms over K-Means, which include FPA-KMeans, CSA-KMeans, BHA-KMeans, BA-KMeans, FFA-KMeans and PSO-KMeans, are tested on sixteen data sets. The comparison is performed on the basis of four performance parameters viz. cluster integrity, execution time, NIC and stability. The proposed CFPA-KMeans should effectively improve on the convergence speed along with cluster integrity. The result of the experiment indicates that CFPA-KMeans and BHA-KMeans have better performance than other algorithms on the basis of cluster integrity. Also, CFPA-KMeans and CSA-KMeans give better results than other algorithms on the basis of execution time. The paper has been organized as follows: Section 2 describes related study of this work. Section 3 describes the proposed algorithm for clustering using CFPA. Section 4 presents the research methodology. Experimental results are shown in Section 5. Further, Section 6 presents the threats to validity and Section 7 concludes the results and presents the scope of future work. 2. Related study This section provides the detailed background information about the work. However, it is not the exhaustive study. This work is based on two broad areas i.e. Chaotic Swarm based algorithms and partitional clustering as explained in Sections 2.1 and 2.2 respectively. Further, motivation of the proposed work with research objectives are defined in Section 2.3. 2.1. Chaotic swarm based algorithms The chaos theory is used in swarm algorithms which can help to get rid of local optimum. Earlier various swarm algorithms are hybridized with chaos and shows better performance. Particle Swarm Optimization (PSO) (Song et al. 2007 [27]; Hongwu 2009 [28]; Hong et al. 2016 [29]), Firefly Algorithm (FA) (Yang 2012 [30]; Fister et al. 2015 [25]), Cuckoo Search Algorithm (CSA) (Xiang et al. 2012 [31]), Black Hole Algorithm (BHA) (Aslani et al. 2015 [32]) Flower Pollination Algorithm (FPA) (Kaur et al. 2017 [33]) are the examples of such algorithms. To find the dynamic characteristics of particle of PSO, Liu et al. 2007 [24] suggested an algorithm to calculate the Lyapunov exponent and correlation. Song et al. 2007 [27] studied Tent Chaotic map and Particle Swarm Optimization (TCPSO) as it could work for nonlinear optimization and increase the convergence and accuracy. Yang 2012 [34] extended the standard FA with chaos and automatic parameter tuning resulting into two flavors of FA. In another work by Talatahari et al. 2011 [35], chaotic improved imperialist competitive algorithm (CICA) has been suggested. Xiang et al. 2012 [31] examined the orthogonal learning CSA to find the parameter of chaotic system. Ouyang et al. 2014 [23] used a chaotic operator to increase the features of CSA. In 2015, Fister et al. [25] reviewed chaos based FA. Their work gathered

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx

3

the studies based on chaotic FA highlighting the frequently used chaotic maps with their advantages, disadvantages, limitations and the future scope. In a study by Kohli et al. 2017 [36], Grey Wolf Optimization (GWO) algorithm has been incorporated with chaos theory to increase the global convergence speed. Their study includes thirteen constrained benchmark problems solved with ten different chaotic maps to find the best amongst them. Lukasik et al. 2015 [37], this study explained FPA based pollination mechanism. Their work compares FPA and PSO. The study by Vedula et al. 2015 [38] proposed circular array synthesis using FPA. Their study compares the results with Genetic Algorithm (GA) and uniform circular antenna (with uniform spacing). Our previous study Kaur et al. 2017 [33] proposed the Chaotic FPA and explained the different chaotic map and their performance for the Non-linear unconstrained optimization problems.

is compared with K-Means, PSO, GA. Results show that BHA performed better results. Further In 2015 Jensi [43], proposed hybridization of K-Means and FPA for data clustering and compared with FPA and K-Means algorithm. In 2017 our previous work Kaur et al. 2018 [44] proposed hybridization of K-Means and FFA. Standard K-Means, swarm variants of K-Means (i.e. K-Means + CSA, K-Means + BA, K-Means + FFA) and other improved versions of K-Means (i.e. K-Means++, K-Means + Canopy and K-Means + Farthest First) have been compared. Results show that swarm variants of K-Means (i.e. K-Means+FFA and K-Means+BA) outperform others versions of K-means by a huge margin. Current study compares proposed algorithm (CFPA-Kmeans) with other new swarm algorithms (FPA-KMeans [43], BHA-KMeans [19], CSAKMeans [18], KMeans-BA [26]) and proposed algorithm (KMeansFFA) of a previous study [44].

2.2. Partitional clustering using swarm algorithms

2.3. Motivation and research objectives

Previous work in this area focuses on partitional clustering using K-means algorithm (which has several disadvantages as explained earlier) with various meta-heuristic algorithms. Table 1 briefly describes related study regarding clustering using swarm based techniques along with algorithms, datasets and performance measures used in the study. In 2003 Merwe [17] the two PSO approaches i.e. PSO and Hybrid K-Means and PSO were compared against K-Means clustering, which showed that the PSO approaches have better convergence to lower quantization errors, and in general, larger inter-cluster distances and smaller intra-cluster distances. In 2004 Younsi and Wang [13], developed and used a new Artificial Immune System (AIS) algorithm for data clustering. The algorithm presented has important data compression capability. In 2006 Kao and Cheng [12], proposed a new clustering algorithm based on ant colony optimization, called Ant Colony Optimization for Clustering (ACOC). In the same year, Grosan [10], introduced some of the preliminary concepts of swarm intelligence with an emphasis on particle swarm optimization and ant colony optimization algorithms for data mining terminologies. In 2010 Karaboga [14], proposed Artificial Bee Colony (ABC) Algorithm, which is used in clustering of the benchmark classification problems for classification purpose. The performance of the ABC algorithm is compared with PSO, BayesNet, MlpAnn, RBF, Kstar, Bagging, MultiBoost, NBTree, Ridor, and VFI. Experimental results show that the ABC performs better when applied to clustering for the purpose of classification. In 2011 Senthilnath [4], has proposed Firefly algorithm for clustering further The FA algorithm is compared with population-based and nature-inspired optimization techniques, which conclude that FA is an efficient method and successfully to generate optimal cluster centers. In 2012 Tang et al. [26], Integrate five natureinspired algorithms (Ant, FFA, CSA, BA, and Wolf) with K-Means to improve the efficiency of K-Means. Experimental results show that CSA and BA Integration is efficient than others. In the same year Hassanzadeh [3], has been proposed an algorithm which hybridize firefly and K-Means algorithm. Further in same year Esmin et al. [39], tested two algorithms, namely, standard PSO and a hybrid PSO approach of the swarm and compared with KMeans algorithm. In 2012 Hatamlou et al. [40], proposed hybridization of gravitational search algorithm (GSA) and K-Means algorithm for clustering. The performance of the proposed algorithm has been compared with other approaches. The results show that proposed algorithm outperforms than others. In 2012 Senthilnath [41], compared three nature-inspired algorithm (GA, PSO and CSA) on four data sets. Results show that CSA has better classification efficiency. In 2013, hybridization of CSA and Multiple KernelBased Fuzzy C-Means is done by Binu [18]. In 2013 Hatamlou [42], proposed Black Hole algorithm for the clustering problem, which

The studies [S1–S15] have been shown in Table 1, with a motive to describe the related work of swarm intelligence over K-Means for Data Clustering. In these studies, swarm algorithms such as Particle Swarm Optimization [S1, S4, S5, S7, S8, S9, S11 & S13], Firefly Algorithm [S7 & S5], Artificial Bee Colony [S4 & S5], Cuckoo Search Algorithm [S6, S11 & S12], Bat Algorithm [S6] Ant Colony Optimization [S3], Flower Pollination Algorithm [S15] have been proposed over classical K-Means. Classical K-Means and PSO-KMeans have been compared [S1, S7, S8, S9 & S13] and it was found that PSO-KMeans is better than classical KMeans. Hence, PSO-KMeans is preferred over classical K-Means for comparison with other Swarm Intelligence algorithms. As per our knowledge, chaotic variants of the algorithms have not been explored over K-Means, to make it free from local trapping. Therefore, this study presents chaotic variant of Flower Pollination Algorithm over K-Means. This work introduces a novel approach for data clustering, hybridizing Chaos and Flower Pollination Algorithm over K-Means. As shown in Fig. 1, K-means algorithm has three issues (a) KMeans is very slow to converge; (b) It has exponential computational time i.e. 2n ; (c) It often converges to local optima. Many swarm algorithms have been applied over K-Means to resolve first two issues (i.e. (a) and (b)) [3,4,13,26,40,42,43]. The current study selects FPA over other swarm algorithms due to the following reasons: (a) FPA can efficiently explore both local and global space of objective functions. (b) It can also switch between these spaces to generate optimal results, unlike any other swarm algorithms. The first and second issues observed by K-Means have been resolved using FPA but both FPA and K-Means may get trapped in local optima (i.e. third issue of KMeans). Consequently, to resolve the issue experienced in FPA and to obtain the optimal solution, the current study implements Chaotic FPA over K-Means for data clustering because it is an effective method to avoid getting limited to the local optimum. Four Research Objectives (RO) are defined as follows: RO1 To explain the convergence analysis of standard FPA and its four chaotic variants. RO2 Implementation of Chaotic FPA over K-Means for partitional clustering. RO3 CFPA algorithm has been compared with six algorithms on the basis of cluster integrity, execution time, NIC and stability. RO4 Results of cluster integrity and execution time analyzed statistically using Non-parametric Friedman test. Further, it is analyzed using post-hoc Nemenyi test for pair-wise comparison.

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

4

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx

Table 1 Summary of previous work of data clustering using swarm algorithms. Sr. No

First Author & Year

Algorithm’s Used

Datasets used

Performance measures

Best Algorithm

[S1]

D. v. d. Merwe 2003

K-Means, PSO

Artificial1, Artificial2, Iris, Wine, Glass, Breast Cancer, automotive

Quantization error, Intra cluster distance, Inter cluster distance

PSO

[S2]

R. Younsi 2004

CLONALG

Two spirals problem SPIR, CHAINLINK

N/A

AIS algorithm

[S3]

Y. Kao 2006

K-Means, Shelokar, ACOC

Iris, Wine, Prob1, Prob2

Intra-cluster distance, Time

ACOC algorithm

[S4]

D. Karaboga 2011

ABC, PSO, BayesNet, MlpAnn, RBF, Kstar, Bagging, MultiBoost, NBTree, Ridor, VFI,

Balance, Cancer, Cancer-Int, Credit, Dermatology, Diabetes, E.Coli, Glass, Heart, Horse, Iris, Thyroid, Wine

Classification Error Percentage (CEP), Classification efficiency

Artificial Bee Colony algorithm

[S5]

J. Senthilnath 2011

FFA, ABC, PSO, BayesNet, MlpAnn, RBF, Kstar, Bagging, MultiBoost, NBTree, Ridor, VFI

Balance, Cancer, Cancer-Int, Credit, Dermatology, Diabetes, E.Coli, Glass, Heart, Horse, Iris, Thyroid, Wine

Classification Error Percentage (CEP), Classification efficiency

FFA

[S6]

R. Tang 2012

K-means, Ant, FFA, CSA, BA, Wolf algorithm

Iris, Wine, Libras, Haberman, synthetic, mouse

Cluster integrity, CPU time

Cuckoo search and Bat algorithms

[S7]

T. Hassanzadeh 2012

K-Means, PSO, KPSO, KFA

Iris, WDBC, SONAR, GLASS, WINE

Intra-cluster distance

KFA, PSO(for Time)

[S8]

A. A. A. Esmin 2012

KM, PSO, HPSOM

Iris, Wine, Glass, Breast Cancer, Artif

Correctly clustered(%), Average intra cluster, Average inter cluster

HPSOM algorithm.

[S9]

A. Hatamlou 2012

K-means, GA, SA, ACO, HBMO, PSO, GSA, GSA-KM

Iris, Wine, Glass, CMC, Cancer

Intra cluster distance

GSA-KM

[S10]

B. K. Elfarra 2013

BH-centroids

Iris, Wine, Lung-cancer

Error rate of classification;

BH-centroids

[S11]

J.Senthilnath 2013

CSA, GA, PSO

Glass, Corp type, vehicle, image segmentation

Classification error percentage, Time taken, Time complexity in asymptotic notions

CSA

[S12]

D. Binu 2013

MKF-cuckoo search

Iris, wine

Rand coefficient, Jaccard coefficient, Clustering Accuracy

MKF-Cuckoo 95% accuracy in iris data and 67% accuracy in wine data.

[S13]

A. Hatamlou 2013

BHA, K-means, PSO, GSA, BB-BC

Iris, Wine, Glass, cancer, vowel, CMC

Intra-cluster distance, Error rate, Wilcoxon statistical test

BH algorithm

[S14]

S. J. Nanda 2014

A survey on Nature Inspired Algorithms for partitional clustering (Title)

Survey Paper

[S15]

R. Jensi 2015

K-means, FPA, FPAKM

Art1, IRIS, Wine, Glass, Cancer, Thyroid, cmc, cruide oil

Mean Square Error Quantization, F-measure

FPAKM

3. Chaotic Flower Pollination Algorithm for data clustering 3.1. Flower Pollination Algorithm Flower Pollination Algorithm (FPA) was developed by Xin-She Yang in 2012 [22]. The objective of flower pollination in nature is the survival of the fittest and optimal reproduction of plants which is used in FPA to find the optimal solution of NP-Class problems [22]. The global pollination is governed by Eq. (1) [22]. xti +1 = xti + L∗(xti − g∗ )

(1)

( ( )) Γ (λ) sin π2λ 1 L= ∗ 1+λ π s

(2)

Where xti is the pollen i at iteration t and g ∗ is the current best solution found among all the solutions at the current iteration. The parameter L(λ) represents the strength of pollination as defined in Eq. (2) where 0<=λ<=2 is an index of gamma function Γ (λ) [22,37,43]. The local pollination process is represented by Eq. (3) [22]. xti +1 = xti + E (xtj − xtk )

(3)

Where xtj and xtk are the pollens from the different flowers of the same plant and E ∈ 0, 1 is the random function. Flower Pollination Algorithm has been used for function optimization [22]. The standard FPA is simple to implement because there is a single key parameter p and a scaling factor λ. The pseudo code of FPA as shown in Fig. 2 has 19 lines out of which, 1–3 lines are used for initialization of pollens. Lines

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx

5

Fig. 1. Motivation for the research work.

Fig. 2. Pseudo Code of Flower Pollination Algorithm [22].

6–18 are repeated till number of iterations. Line 6 is used to calculate the fitness function of each pollen, which is further used in updating rules of FPA. Lines 7 to 16 presents the basic two updating rules i.e. local pollination and global pollination, which is defined in 9–10 and 11–12 respectively. Lines 14–15 are used to update the pollens. 3.2. Introduction of chaos in Flower Pollination Algorithm over Kmeans The method is explained in two subsections: Section 3.2.1 is used to explain the convergence analysis of FPA and its chaotic

variant (i.e. RO1 defined in Section 2.3). Further, Section 3.2.2 describes the objective function of clustering problem and Section 3.2.3 proposes chaos disturbance with FPA over K-Means for data clustering (i.e. RO2). 3.2.1. Convergence analysis of CFPA To converge an iterative algorithm, the candidate solution xti for each iteration tend to get closer and closer to the desired solution. Our previous work Kaur et al. 2017 [33] used logistic map to initialize the pollen xti and four chaotic maps (i.e. sine, tent and dyadic maps) to produce chaos disturbance in their candidate solution xti at the updation step. In this work, standard FPA has

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

6

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx

been compared with four variants of FPA. These variant uses chaotic maps at updation step. It has been observed that updation in CFPA using sine map has competitive advantage over FPA and chaotic variants in terms of computational time [33]. In this work, convergence of standard FPA and other four chaotic variants (sine, dyadic, Chebyshev and circle maps) of FPA has been presented and compared to choose best chaotic map for further study. The chaotic maps used for the experiments are defined below: (a) Logistic map zn+1 = 4zn (1 − zn )

(4)

(b) Sine map

µ

sin(π zn ); z ∈ {0, 1} ; 0 < µ ≤ 4 4 (c) Dyadic map

(5)

zn+1 = 2zn mod1

(6)

zn+1 =

(d) Chebyshev map Fig. 3. Convergence analysis for Ackley function f 1().

zn+1 = cos(n cos−1 zn )

(7)

(e) Circle map zn+1 = zn + b − (a − 2π ) sin (2π zn ) a = 0.5 and b = 0.2

(8)

For the Convergence analysis of FPA and its chaotic variant two non-linear unconstrained functions (i.e. Ackley function (f1) and Griewank function (f2)) are used as defined in Eqs. (9) and (10) [33]. f 1 (x) = −e f 2 (x) =

) ( ∑ 2 −0.5 D i=1 xi

D ∑ x2i i=0

4000

(9)

( ) xi − Π cos √ + 1 i

(10)

Convergence graphs of FPA and its four chaotic variants are shown in Figs. 3 and 4 for functions f 1(.) and f 2(.) respectively. For Griewank function, CFPA with sine and dyadic maps are better than other variants but according to Ackley function, CFPA with sine map has far better convergence rate as compared with other variants. Results demonstrated that only hybridization with chaos is not enough. In both figures circle chaotic map has low convergence rate than standard FPA. So it is necessary to choose a chaotic map which can have the optimal convergence rate of FPA. Sine chaotic map results the best for both the functions as compared with other variants. Further, more chaotic maps and algorithmic explanation can be studied in our earlier work Kaur et al. 2017 [33]. Improvement of Chaotic FPA over FPA using sine map is explained in detail in our earlier work [33]. Finally, it can be concluded that convergence rate is minimum for CFPA with sine map. 3.2.2. Problem definition Clustering techniques attempt to group patterns so that the classes thereby obtained reflect the different pattern generation processes represented in the pattern set [1]. It is an unguided process of classifying that helped the researchers to come up with various results along with different contexts in numerous disciplines [1,5,6]. While working on clustering, identification of cluster center is a big problem, as per [26] the efficient cluster center can be identified by cluster integrity. The objective function (i.e. cluster integrity as defined in Eq. (11)) evaluates the contribution of every variable to have optimized clustering. The centroids get relocated to find the optimum grouping such that the data points within a cluster are closest to their centroid [1].

Fig. 4. Convergence analysis for Griewank function f 2().

In this case, objective function (a square error function) is the sum of the distance from the pattern to the center [1,5,26]. This paper is focused on partitional clustering with an identification of data centers (centroids). By identification of the centroids, pattern closeness is calculated which is defined by intra-cluster distance. Objective function is minimized to find the center of clusters by summing up the distances of the patterns to their center. It is very well known that data sets must have great similarity along with the existence of high dissimilarity between data of different clusters [1,5]. Each cluster corresponds to the value of its objective function. The algorithms are focused on minimizing the objective function, i.e. the squared error function. This is known as cluster integrity (Fun) and is defined in Eq. (11) [26]. Fun =

M sol ∑ ∑ i=1 j=1

wi,j

M ∗A ∑   xi,t − Cj,t 2

(11)

t =1

Where sol is the solution space, M is the number of clusters, A is the number of attributes in data set or search space dimension. Matrix C contains the centroids of all the clusters with the size M ∗ A. Cj,t is the centroid of the jth cluster and t th attribute defined

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx

in Eq. (13) [26].

wi,j =

{

4.2. Experimental setup

1 xi ∈ Clusterj 0 xi ∈ / Clusterj

∑S Cj,t =

i=1

∑S

wi,j xi,t

i=1

wi,j

7

(12)

Where j = 1 . . . M and t = 1 . . . M ∗ A (13)

3.2.3. Chaotic FPA over K-means This section proposes the chaos disturbance with FPA over KMeans for data clustering. The proposed algorithm CFPA-KMeans is described in flow chart as shown in Fig. 5, which is used to minimize the cluster integrity (as defined in Section 3.2.2) In Fig. 5, B1 is used to initialize the chaos variable using logistic map. Further, B9 is used to update the pollen variable xti . The results reported in Section 3.2.1 shows that updation using sine map has competitive advantage over other variants. Hence, at B9 step updation of pollen xti is performed using sine map. Dataset input to this algorithm has A number of attributes and S data objects. N data objects have been selected from S data objects as population (pollens) for this algorithm. The stopping criteria can be number of iterations (Ni ), function tolerance, etc. In this work, Ni is used to stop the algorithm. By this we can also calculate the NIC (defined in Section 4.3.3) for this algorithm. The proposed CFPA for data clustering is based on heuristic methodology whose time complexity depends upon the number of pollens (N) selected for simulation and number of attributes in data sets (A). In Fig. 5, Implementation of CFPA-KMeans for data clustering is explained. There are ten processing rectangle boxes B1 to B10 and three conditional diamond box i.e. C1 to C3. Steps B5 to C2 repeats for number of pollens (N). Step B8 has O(M ∗ A) time; hence B5 to C2 will take O(N ∗ M ∗ A) time. Steps between B4 to C3 repeat Ni times, hence this procedure takes O(Ni ∗ N ∗ M ∗ A). Comparison of CFPA-KMeans with other six algorithms on the basis of Time complexities in asymptotic notation is described in Table 2. Data clustering is a NP-Class problem [2,8]. It is desirable to develop fast optimized approximation algorithms [1,2]. As seen in Table 2, the time complexity depends on four factors i.e. Ni , N, M and A. Out of four factors N, M and A are same for all algorithms, only Ni is different. Ni depends on the convergence rate of the algorithm. Figs. 19 to 23 (Appendix) shows the graph’s for five datasets to demonstrate performance of seven algorithms on the basis of convergence rate of fun. Moreover, CFPA produces optimal Ni as compared to others. Hence, CFPA takes minimum time complexity. 4. Research methodology 4.1. Datasets used Sixteen classification data sets (referred as D1, D2, . . . , D16 in this paper) have been taken from different fields of studies [4, 13,14,18,19,26] and UCI data repository [45] to make a reliable comparison. It has been observed that datasets SPAMBASE and SYNTHETIC have not been explored for clustering by studies, in this field. Current work in addition to above two data sets explores 14 more datasets, which have been used in previous. The number of instances, the number of input features and the number of classes for sixteen data sets are presented in Table 3. Data points with blue and green colors have been shown as two different classes, which can be grouped as clusters [26]. The pictorial representation of first data set (Iris) is shown in Fig. 6.

Implementation of this work is performed on a system with configuration: CPU @1.6 GHz Intel processor with 6GB RAM. The tool used for implementation is MATLAB R2013b simulator. This work implements CFPA over K-means algorithm and sixteen datasets (as defined in Section 4.1) are chosen for this study. This work focuses on the seven algorithms over K-Means as described in Table 4. The optimal parameter have been shown in Table 4, as selected from previous studies. Initially, parameters of FPA were tuned [22] and it was found that with p = 0.8 and λ = 0.1, optimal solutions can be obtained. Secondly, a study [28] has shown that sin map with µ = 0.8 gives optimal result. So, µ = 0.8 has been chosen to implement CFPA. Similarly, optimal parameters for FPA [33,37], CSA [31], BHA [19,42], FFA [44] and PSO [17,46] have been selected from the best parameters obtained in previous studies. Experiment are conducted by executing the seven algorithms fifteen times each with optimal parameters (as presented in Table 4) to evaluate cluster integrity, NIC, and execution time. The number of iterations used for the experiment is 100. 4.3. Performance measures The following performance measures have been used to compare the various algorithms. 4.3.1. Cluster integrity (Fun) The problem is to find the optimal centroid configuration and achieve the goal to set the centroid. The objective function as defined in Section 3.1 (cluster integrity) [26] which defines the center of the cluster and all algorithms have focused on minimizing function (Fun). 4.3.2. Execution time taken Execution time taken is the performance parameter that gives time taken to execute the algorithm. This parameter has been used in various studies [2,18,41] to analyze which algorithm will complete its execution process faster. 4.3.3. Number of iteration to converge (NIC ) The NIC is used to calculate that after how many iterations algorithm will find the optimal value of the objective function. For every algorithm, its mean and minimum number of iterations required to converge have been found; this value is used to find minimum number of iterations required to find an optimal solution. 4.3.4. Stability The general idea behind the stability of a clustering algorithm is that the algorithm produces an optimal result maximum time with minimum error. However, the assumption that stability corresponds to high accuracy may not always hold [47–52]. An algorithm is also stable when it consistently gives nearly accurate results maximum number of times. Stability of the results across different runs is considered to be an asset to the algorithm [47– 52]. In this work, it is assumed that an algorithm is stable if it produces an optimal result (i.e. minimum Fun) maximum time with minimum error. The squared error will be calculated by using following Eq. (14).

√ E=

∑15

i=1

(ti − Min)2 Min2

∗ 100

(14)

Where ti is the ith run cluster integrity value, and Min is the minimum cluster integrity value calculated from fifteen runs.

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

8

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx

Fig. 5. Flow Chart of Chaotic Flower Pollination Algorithm for Data Clustering.

4.4. Research objective and hypothesis formulation

Two hypotheses are formulated to analyze the results of cluster integrity and execution time (as defined in RO4):

Hypothesis 1: All six algorithms have the same mean intracluster distance in case of NULL hypothesis, and at least one algorithm has a different mean intracluster distance in case of alternate hypothesis. Ho:

µt1 = µt2 = µt3 = µt4 = µt5 = µt6

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx

9

Table 2 Time complexity comparison of all algorithms.

a

Algorithm

Notationa

Time Complexity (Asymptotic Notation)

CFPA-KMeans FPA-KMeans CSA-KMeans BHA-KMeans BA-KMeans FFA-KMeans PSO-KMeans

CFPA FPA CSA BHA BA FFA PSO

O(N i ∗N ∗ M ∗ A) O(Ni ∗ N ∗ M ∗ A) O(Ni ∗ N ∗ M ∗ A) O(Ni ∗ N ∗ M ∗ A) O(Ni ∗ N ∗ M ∗ A) O(Ni ∗ N 2 ∗ M ∗ A) O(Ni ∗ N ∗ M ∗ A)

For the simplicity these notations are used throughout the paper.

Table 3 Data sets used in the study. Sr. no.

Dataset

No of Inputs

No. classes

No. of instances

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16

IRIS WINE BREAST_CANCER GLASS BALANCE DERMATOLGY HABERMAN ECOLI HEART TAE SPAMBASE ILPD LEAF LIBRAS QUALITATIVE_BANKRUPTCY SYNTHETIC

4 13 10 9 4 34 3 7 13 5 31 10 15 64 6 60

3 3 2 7 3 6 2 8 2 3 2 2 16 15 2 6

150 178 699 214 625 366 306 336 370 151 4601 583 340 360 250 600

Ha:

At least one algorithm

Fig. 6. Initial Data points of IRIS (D1)

has different cluster integrity Hypothesis 2: All six algorithms have the same mean Execution time in case of NULL hypothesis, and at least one algorithm has a different mean Execution time in case of alternate hypothesis. Ho:

µt1 = µt2 = µt3 = µt4 = µt5 = µt6

Ha:

At least one algorithm

5. Experimental results

has different Execution time There are two categories of statistical test, those can be used to test the hypothesis: (a) Parametric test (e.g. ANOVA) can be applied with the following assumptions: (1) Normal distribution of accuracy differences. (2) There exists variance in accuracy or residual error for all modeling techniques. (3) For every possible pairs in modeling methods, the accuracy differences or the variance in the residual error over the two modeling method is similar [48]. (b) Non parametric test (e.g. Friedman test) also known as distribution free because they do not make any assumptions [48]. Since our data sets do not follow the parametric test assumptions, we apply Friedman test. This can be applied to multiple algorithms on multiple data sets, which ranks the algorithms [53]. j Let yi be the rank of the jth algorithm and ith data set, then average rank will be calculated by using Friedman test as defined in Eq. (15) [53]. Rj =

1 ∑ N

j

yi

(15)

i



k(k + 1)

(16) 6N If Friedman‘s test rejects the null hypothesis as defined in Hypothesis 1 and Hypothesis 2, then a post-hoc Nemenyi test

CD = qα

will be used for pair wise comparisons [48,53]. The performance of two algorithms is significantly different if the corresponding mean rank differ by critical difference in Eq. (16) [53].

CFPA along with six algorithms have been used to perform data clustering on sixteen data sets and four performance measures have been evaluated as described in Section 4.3. Mean, stdev, and minimum values of cluster integrity were calculated after executing the algorithm repeatedly for fifteen times. In Table 5, results of cluster integrity, execution time and number of iterations to converge (NIC) are shown for seven different algorithms. In order to check the stability of the algorithm the count of minimum cluster integrity was also observed for all the datasets e.g., FPA algorithm gives 78.941 minimum value of cluster integrity (Table 5), and it has been observed that out of 15 runs, the minimum value was obtained 13 times(from Table 5). Further, Figs. 7 to 16 shows the cluster integrity (Fun) for seven algorithms on nine data sets and Figures from 19 to 23 (In Appendix) shows the convergence analysis of seven algorithms for five data sets (whose results are visible properly). As seen in Tables 5 and 6 interpretation of results for some data sets are represented as below, similarly others can be seen from the tables. For IRIS (D1) data set: Six algorithms (CFPA, FPA, CSA, BHA, BA, and FFA) observed minimum cluster integrity (i.e. 78.9408). Only CFPA found the best average cluster integrity (i.e. 78.94337). Further, CFPA has best mean execution time (i.e. 0.021 µs) and best mean NIC (i.e. 8) out of above algorithms. The number of times the minimum value is observed (out of 15) is maximum in case of CFPA, BHA and FPA which are 14, 14 and 13 respectively.

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

10

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx Table 4 Adaptive parameter’s values for experiments. Sr. No.

Algorithm

Parameter Settings

No. of agents

1

CFPA

Probability(p)=0.8; Lembda(λ)=0.1, Sin Map with µ = 0.8

No. of pollens(N)=20

2

FPA

Probability(p)=0.8; Lembda(λ)=0.1

No. of pollens(N)=20

3

CSA

Alpha(α )=0.6

No. of nets(N)=20

4

BHA

N/A

No. of stars(N)=20

5

BA

Qmin=0; Qmax=3; r=0.8

No. of Bats(N)=20

6

FFA

Alpha(α )=0.6; Gamma (γ )=0.1

No. of firefly(N)=20

7

PSO

Alpha(α )=0.6; Beta(β )=0.7;

No. of particles(N)=20

Gamma (γ )=0.3

Fig. 10. Mean Fun of D5. Fig. 7. Mean Fun of D1.

Fig. 8. Mean Fun of D3.

Fig. 9. Mean Fun of D4.

Minimum error (E) (i.e. 0.000415) is achieved only by CFPA. This implies CFPA is the most stable algorithm for this data set. For BREAST_CANCER (D3) data set: Three algorithms (i.e. CFPA, BHA and BA) observed minimum cluster integrity (i.e. 743E+13) but only BHA founds minimum average value (i.e. 7426E+13). CSA has best mean execution time (i.e. 0.063) and PSO has best mean NIC (i.e. 11). The number of times minimum value is observed (out of 15) is maximum in case of BHA (i.e. 14) out of the above three algorithms and minimum E (i.e. 3.26 E-15). This implies BHA is most stable algorithm for this data set. For GLASS (D4) data set: CFPA has evaluated minimum average cluster integrity (i.e. 301.24). CSA has best mean execution time (i.e. 0.067) and FPA has best mean NIC (i.e. 3). This is complex data, but number of times minimum value is observed (out of 15) is maximum in case of BHA (i.e. 3). This implies BHA has more stability for this data set. For BALANCE (D5) data set: All algorithms except CSA observed the best minimum cluster integrity (i.e. 3472.3214) but only CFPA

Fig. 11. Mean Fun of D6.

Fig. 12. Mean Fun of D8.

Fig. 13. Mean Fun of D10.

Fig. 14. Mean Fun of D13.

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx

11

Table 5 Cluster Integrity (fun), Execution time, NIC Results.

(continued on next page)

and BHA evaluated best mean cluster integrity. CSA has best mean

CFPA (i.e 1.29E−14) and BHA (i.e. 0). Hence for this data set both

time (i.e. 0.116). The number of times the minimum value is

CFPA and BHA are stable algorithms.

observed (out of 15) is maximum in case of CFPA and BHA which

For ECOLI (D8) data set: CFPA has observed best minimum

is 15 and 15 respectively whereas the E is minimum in case of

and mean cluster integrity which are 288890.24 and 290958.95

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

12

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx Table 5 (continued).

Table 6 Percentage error of algorithms. Dataset

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16

No. of times Optimal cluster integrity found out of 15

Error % calculated using (14) equation

CFPA

FPA

CSA

BHA

BBA

FFA

PSO

CFPA

FPA

CSA

BHA

BBA

FFA

PSO

14 15 1 1 15 1 15 1 15 14 15 15 1 1 9 4

13 15 15 1 1 1 15 1 15 12 15 15 1 1 15 1

7 15 15 1 2 2 14 1 13 1 2 3 1 1 2 1

14 15 14 3 15 7 15 1 15 15 15 15 1 2 15 1

4 15 4 1 12 4 11 1 7 12 15 15 1 1 13 1

3 15 15 1 3 1 11 1 2 4 15 12 1 1 2 1

10 15 15 1 1 1 2 1 15 4 15 15 1 1 1 1

0.000415 0 41.968555 0.475784 1.29E−14 0.479241 0 0.094042 0 0.00135 0 0 2.644713 0.221433 0.009018 0.072651

0.00804 0 0 1.8601 0.01302 0.41277 0 0.32922 0 0.00088 0 0 1.60488 0.80485 0 2.58432

0.194795 0 0 0.693983 0.027909 0.766536 0.00042 0.351723 0.003095 0.266876 0.04052 0.413414 2.942373 0.631787 0.051118 2.583047

0.019426 0 3.26E−15 1.289568 0 0.012559 0 0.092932 0 0 6.57E−15 0 1.80865 0.296136 0 2.58243

0.22875 0 15.8316 2.25693 0.00912 0.02944 4.78E−05 0.42167 0.02368 0.00161 2.9E−15 0 3.8558 0.78535 0.00424 2.61206

0.198317 0 0 1.288623 0.052004 0.444378 0.005239 0.133775 0.030038 0.065358 0 0.210703 6.294389 0.572681 0.098008 2.582226

0.49479 0 0 1.34491 0.07166 1.649 0.031463 0.61044 1E-14 0.44133 8.3E−15 9.6E−15 2.76622 0.77456 0.12943 2.58652

respectively. CSA has best mean execution time (i.e. 0.069) and CFPA determined best mean NIC. One out of 15 runs, an optimal value has been observed by the all algorithms. Hence all algorithms are unstable for this data set.

For HEART (D9) data set: Minimum cluster integrity value is observed by four algorithms: CFPA, FPA, BHA and BA but CFPA produce minimum NIC (i.e. 6). CSA has minimum execution time. The number of times the minimum value is observed (out of 15)

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx

13

Table 7 Comparative study of present study results and Tang et al. 2012 results.

D1 D2 D7 D16

Fig. 15. Mean Fun of D14.

Present study(PS)

Tang et al. 2012 [26]

CFPA

BHA

C-cuckoo

C-Bat

78.9408a 2.37E+06a 30507.02a 922063.61a

78.9408a 2.37E+06a 30507.02a 922065.61

78.9408a 2.37E+06a 30507.02a 9.52E+05

78.9408a 2.37E+06a 30507.89 9.44E+05

Ib

0% 0% 0% 3.17%

a

Best value. Describes percentage improvements in cluster integrity from the previous study. b

Table 8 Statistical Friedman test result on the basis of cluster integrity and Execution time.

Fig. 16. Mean Fun of D16.

is maximum in case of FPA and BHA, which is 15 respectively. Also, error is minimum in case of FPA and BHA. For TAE (D10) data set: All algorithms have observed the best minimum cluster integrity (i.e. 16775.38) but only BHA evaluated best mean cluster integrity. CFPA has best mean execution time (i.e. 0.019). The number of times the minimum value is observed (out of 15) is maximum in case of BHA which is 15. Hence for this data set BHA is a stable algorithm. For LEAF (D13) data set: Minimum cluster integrity value has been evaluated by CFPA, FPA and CFPA produce minimum NIC (i.e. 11). Additionally, CSA has minimum execution time. The number of times the minimum cluster integrity is observed (out of 15) is same for both CFPA and FPA, which is 1. However, error is minimum in case of CFPA.

Ranks on the basis of Cluster integrity (Hypothesis 1)

Ranks on the basis of Execution time (Hypothesis 2)

CFPA=3.12; BHA=3.43; BBA=4.65; FPA=4.75; PSO=5.00; FFA=5.62; CSA=5.68; Chi-square(χF2 ):147.46 Chi-square(χ02.10,6 ) :10.6446 Hence Null Hypothesis Rejected

CFPA=1.62; CSA=1.62; PSO=3.96; BHA=4.46; FPA=4.59; BBA=4.71; FFA=7.00; Chi-square (χF2 ) :73.27 Chi-square(χ02.10,6 ) :10.6446 Hence Null Hypothesis Rejected

q0.10 :2.693 Critical Difference :1.454 (For Nemenyi Post hoc Test after rejection)

Table 7 shows that Chaotic FPA is the best algorithm, and BHA is next best algorithm for D16 data set. For other three data sets, the results were similar.

5.1. Analysis of results

5.2. Hypothesis testing

For execution time: Execution time of all algorithms were calculated and it was found that CFPA and CSA gave the best results for majority (i.e. nine out of sixteen) of datasets. In rest five algorithms execution time found is significantly high. For cluster integrity: CFPA and BHA both gave the best cluster integrity (i.e. nine out of sixteen) results, followed by FPA (five out of sixteen). All algorithms have same mean cluster integrity in D2 and D11. For stability: An algorithm is stable if it produces an optimal result maximum time with minimum error. Table 6 explained the stability of algorithm. As shown in Table 6, CFPA is stable for D1, D2, D5, D7, D9, D10, D11, D12, D15, D16 (i.e. Ten out of sixteen). FPA is stable for D1, D2, D3, D7, D9, D10, D11, D12 and D15 (i.e. Nine out of sixteen). BHA is stable D1, D2, D3, D5, D7, D9, D10, D11, D12, and D15 (i.e. Ten out of sixteen).CSA is stable for D1, D2, D3, D7, and D9 (i.e. five out of sixteen). BBA is stable for D2, D5, D7, D10, D11, D12, and D15 (i.e. seven out of sixteen). FFA is stable in case of D2, D3, D7, D11, and D12 (i.e. Five out of sixteen). PSO is stable for D1, D2, D3, D9, D11, and D12 (i.e. Six out of sixteen). This concludes that CFPA, FPA and BHA are stable algorithms. For convergence rate: Figs. 19 to 23 in Appendix demonstrates the convergence graph of seven algorithms. Appendix and results of NIC in Table 5 conclude that CFPA and FPA have fast convergence. Table 7 compares the results obtained by Tang et al. 2012 [26] and Table 5. The comparison has been made on four data sets which are common in both studies. Further best two algorithms of Tang et al. 2012 [26] and this study are used for comparison.

Now Non-parametric Friedman test is used to test two hypothesis as explained in Section 5.1. Table 8 shows the results of Friedman test whose findings are listed below:

• In hypothesis 1, χF2 > χ02.10,6 (147.46>10.6446) hence Null

hypothesis is rejected hence there are at least two algorithms those are not equivalent. Now Nemenyi test is applied for pair wise comparison [48,53]. Two algorithms performance is significantly different if the corresponding mean rank differ by critical difference [53]. Fig. 17 demonstrates for hypothesis 1, CFPA has minimum mean rank (3.12) but CFPA areas (i.e. Rank of CFPA + ( ) ( CD ) and BHA has overlapping > Rank of BHA − CD [3.84 > 2.71]). It means CFPA and 2 2 BHA has same performance. Moreover, Out of other six algorithm CFPA is significantly superior to five algorithms and whereas out of six algorithms BHA is significantly superior to four algorithms. • In hypothesis 1, χF2 > χ02.10,6 (73.27>10.6446) hence Null hypothesis is rejected hence there are at least two algorithms those are not equivalent. Now Nemenyi test is applied for pair wise comparison [48,53]. Two algorithms performance is significantly different if the corresponding mean rank differ by critical difference [53]. CFPA and CSA both has minimum mean rank (1.62) which conclude that both have same performance on the basis of execution time. Fig. 18 shows that Both CFPA and CSA are significantly superior to other five algorithms.

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

14

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx

Fig. 17. Comparison of Algorithms by Nemenyi test Hypothesis 1.

Fig. 18. Comparison of Algorithms by Nemenyi test Hypothesis 2.

5.3. Summary of results

6. Threats to validity

The results of Tables 5 to 8 and Figs. 17 to 23 demonstrated in Table 9, which gives the summary of cluster integrity, convergence rate, Execution time and stability of algorithm. Which conclude that CFPA and BHA are the best algorithms for cluster integrity and CFPA and CSA are best algorithms on the basis of execution time. In Table 8 seven stars correspond to the best algorithm and one star corresponds to worst algorithm. These stars do not represent values on ratio scale but signify values on ordinal scale. As shown in Table 8 CFPA is the best algorithm, and BHA is next best algorithm and CSA is the worst algorithm for cluster integrity. In the case of Execution time, CSA is the best algorithm, and CFPA is next best algorithm, and FFA is the worst algorithm. CFPA is the best algorithm, and FPA is next best, and CSA is the worst algorithm in case of convergence rate. According to stability factor, BHA is the best algorithm, CFPA is next best algorithm, and CSA is the worst algorithm. There exists a tradeoff between convergence rate, time taken and cluster integrity value. Depending on the problem in hand, a researcher can decide which parameter is more important for a particular problem. Clustering can be applied to medical diagnostics, intrusion detection, fault detection systems etc. For fault detection and Intrusion detection systems (any real time systems), algorithms that perform better in terms of time taken are preferred. In medical diagnostic efficiency to find cluster is important, for this cluster integrity needs to optimize. So the algorithms that perform better on the basis of cluster efficiency must be selected.

Threats to validity deal with generalization of results. There can be several potential threats to validity of an empirical study for clustering. This study used sixteen datasets for validation. However, more data set can be used to generalize the result further. Seven algorithms have been tested in data clustering experiments with different parameter settings. Since we want to compare the results with previous swarm algorithms. This work uses the values of adaptive parameters as used in previous studies in the field of swarm algorithms [5,13,15,16,22,26,46]. The presented parameter values might have been chosen in favor of swarm algorithms. However, these parameters value can be modified to observe the effect on the performance of the algorithms. 7. Conclusion and future work This work has focused on performing partitional clustering with CFPA over K-means. Earlier partitional clustering was done by other nature inspired algorithms (FPA, CSA, BHA, BA, FFA, PSO) over K-means technique. Implementation of seven algorithms (CFPA, FPA, CSA, BHA, BA, FFA, PSO) over K-means is done for Cluster Integrity. Further comparison of these seven algorithms on sixteen data sets is done on the basis of three performance parameters (cluster integrity, Execution time, and convergence rate). Experimental results of three performance parameters (cluster integrity, Execution time, Number of iterations required to

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx

15

Table 9 Summary of results. Algorithm

Asymptotic notation

Fun

Time

Convergence rate

Stable

CFPA FPA CSA BHA BA FFA PSO

O(N i ∗N ∗ M ∗ A) O(Ni ∗ N ∗ M ∗ A) O(Ni ∗ N ∗ M ∗ A) O(Ni ∗ N ∗ M ∗ A) O(Ni ∗ N ∗ M ∗ A) O(Ni ∗ N 2 ∗ M ∗ A) O(Ni ∗ N ∗ M ∗ A)

******* **** * ****** ***** *** **

****** *** ******* **** ** * *****

******* ****** * ** *** **** *****

****** ***** * ******* **** *** **

convergence) along with stability and asymptotic notation of the algorithm have been presented in this study. Statistical validation of the results is done using non parametric tests. Friedman test is used to rank seven algorithms on the basis of cluster integrity and execution time. The results indicated that CFPA has minimum mean rank for cluster integrity, where as CFPA and CSA have minimum mean rank for execution time. After obtaining ranks, Nemenyi test is applied for pair wise comparison. On the selected sixteen dataset, the study observed the following:

• For cluster integrity, CFPA and BHA are the best algorithm

• • • •

performs equally well. However, CFPA is significantly superior to five algorithms out of six and BHA is significantly superior to three out of six algorithms. Implying the superiority of CFPA over BHA. For execution time, CFPA and CSA have the same performance and are significantly superior to other algorithms. For stability, CFPA, FPA and BHA are most stable algorithms. For convergence rate, CFPA and FPA has fast convergence. This works reports an improvement in cluster integrity of 3.17% for D16 dataset compared to earlier study [26].

Fig. 19. Convergence Analysis for Breast Cancer (D3) data set.

There are many new research directions in this field which are following:

• The new nature-inspired metaheuristics like Krill herd algorithm [49], Agent Spider Monkey Algorithm [50], etc. can also be employed to solve partitional clustering problems. • In real life datasets, cluster analysis has to be carried out with certain constraints [51]. In recent research articles, swarm intelligence is used for constraint handling problems [51]. • In many cluster analysis applications, there is a need for stability or consistency of results [5,51]. As most natureinspired algorithms are heuristic in nature, stability issues of these clustering algorithms are still a barren area of research. Acknowledgments We are grateful to the Librarian and the staff of UIRC, GGSIPU who assisted us in finding the relevant research papers and books of the desired field of work. Finally, I am thankful to the Almighty God who has given me the power, good sense and confidence to complete my research paper.

Fig. 20. Convergence Analysis for Glass Dataset (D4).

(1) For Breast Cancer Dataset (D3): As seen in Fig. 19, convergence rate of PSO is poor than other’s algorithm and does

Appendix. Convergence graph for five data set

not produce an optimal value. While BA, BHA and CFPA all converge to optimum value but the number of iteration’s

The convergence graphs of selected seven algorithms are presented in this section. Five out of sixteen datasets are used for convergence analysis. To compute the time complexity of an algorithm, the Ni (i.e. X-axis) is an important factor and Ni is proportional to time complexity. Convergence graphs are shown below with their brief description.

required to converge is less than in CFPA. Hence CFPA has minimum time complexity. (2) For Glass Dataset (D4): Fig. 20 demonstrated that the optimal value can be generated by CFPA and BHA only but CFPA takes minimum number of iteration to converge. Thereby resulting in minimum time complexity.

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

16

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx

(3) For Heart Dataset (D9): In Fig. 21, almost all algorithms are same except FFA, which can find optimal value with minimum number of iteration. FFA finds the optimal value but requires more Ni . (4) For ILPD Dataset (D12): Fig. 22 shows that FPA, BHA and BA gives the optimal value, However, FPA gives the most optimal value and minimum Ni . Therefore, for the current dataset FPA gives the best results. (5) For Libra Dataset (D14): It can be observed from Fig. 23. Though, BHA Gives an optimal value (i.e. 313.90) but has large Ni . Compared with CFPA optimal value is 318.29 but with low Ni . Further, PSO, FPA, CSA, BA and FFA are in the increasing order of the optimality. References

Fig. 21. Convergence Analysis for Heart Dataset (D9).

Fig. 22. Convergence Analysis for ILPD Dataset (D12).

Fig. 23. Convergence Analysis for Libra Dataset (D14).

[1] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: A review, ACM Comput. Surv. 31 (3) (1999) 264–323. [2] R.R. Mettu, Approximation Algorithms for NP-Hard Clustering Problems, vol. 114 (Ph.D. thesis), The University of Texas, 2002. [3] T. Hassanzadeh, A new hybrid approach for data clustering using firefly algorithm and k-means, in: The 16th CSI International Symposium on Artificial Intelligence and Signal Processing, 2012. [4] J. Senthilnath, S.N. Omkar, V. Mani, Clustering using firefly algorithm: Performance study, Swarm Evol. Comput. 1 (2011) 164–171. [5] S.J. Nanda, G. Panda, A survey on nature inspired metaheuristic algorithms for partitional clustering, Swarm Evol. Comput. 16 (2014) 1–18. [6] S. Fan, S. Ding, Y. Xue, Self-adaptive kernel k-means algorithm based on the shuffled frog leaping algorithm, Soft Comput. 22 (3) (2018) 861–872. [7] A. Vattani, K-means requires exponentially many iterations even in the plane, Discrete Comput. Geom. 45 (2011) 596–616. [8] D. Arthur, B. Mamthey, H. Roeglin, K means no’s polynomial smoothed complexity, in: Proceedings of the 50th Symposium on Foundation of Computer Science, FDC’s, 2009, pp. 1–26. [9] T.H. Cormen, C.E. Leiserson, R.L. Rivest, in: Clifford Stein (Ed.), Introduction to Algorithms, second ed., 2002. [10] C. Grosan, A. Abraham, M. Chis, Swarm intelligence in data mining, Stud. Comput. Intell. 34 (2006) 1–20. [11] D. Martens, B. Baesens, T. Fawcett, Editorial Survey: Swarm Intelligence for Data Mining Machine Learning, vol. 82, Springer, 2011, pp. 1–42. [12] Y. Kao, K. Cheng, An ACO-based clustering algorithm, in: M. Dorigo, et al. (Eds.), ANTS, in: LNCS, vol. 4150, Springer, Berlin, 2006, pp. 340–347. [13] R. Younsi, W. Wang, A new artificial immune system algorithm for clustering, in: Z.R. Yang (Ed.), in: LNCS, vol. 3177, Springer, Berlin, 2004, pp. 58–64. [14] D. Karaboga, C. Ozturk, A novel cluster approach: Artificial bee colony (ABC) algorithm, Appl. Soft Comput. 11 (1) (2010) 652–657. [15] S. Karthikeyan, E.J. Thomson Fredrik, An efficient clustering approach using hybrid swarm intelligence based artificial bee colony-firefly algorithm, Indian J. Sci. Technol. 9 (39) (2016) 1–13. [16] X.S. Yang, Nature Inspired Metaheuristic Algorithms, second ed., Luniver Press, 2008. [17] D.W. van der Merwe, A.P. Engelbrecht, Data clustering using PSO, in: IEEE Congress Evolutionary Computation, vol. 1, pp. 215–220, 2003. [18] D. Binu, M. Selvi, A. George, MKF-cuckoo: hybridization of cuckoo search and multiple kernel-based fuzzy C-means algorithm, in: AASRI Conference on Intelligent Systems and Control, vol. 5, Elsevier, 2013, pp. 243–249. [19] B.K. Elfarra, T.J. El Khateeb, W.M. Ashour, BH-centroid: A new efficient clustering algorithm, Int. J. Artif. Intell. Appl. Smart Dev. 1 (2013) 15–24. [20] C. Blum, M. José Blesa Aguilera, Andrea Roli, Michael Sampels, Hybrid Metaheuristics, An Emerging Approach to Optimization, Springer, 2008. [21] S.K. Pal, C.S. Rai, A.P. Singh, Comparative study of firefly algorithm and particle swarm optimization for noisy non-linear optimization problems, I. J. Intell. Syst. Appl. MECS 10 (2012) 50–57. [22] X.S. Yang, Flower Pollination Algorithm for Global Optimization, Springer, Berlin, Heidelberg, 2012, September, pp. 240–249. [23] A. Ouyang, G. Pan, G. Yueab, J. Dub, Chaotic cuckoo search algorithm for high dimensional functions, J. Comput. 9 (5) (2014). [24] H. Liu, A. Abraham, M. Clerc, Chaotic dynamic characteristics in swarm intelligence, Appl. Soft Comput. 7 (3) (2007) 1019–1026. [25] I. Fister, M. Perc, S.M. Kamal, A review of chaos-based firefly algorithms: perspectives and research challenges, Appl. Math. Comput. 252 (2015) 155–165. [26] R. Tang, S. Fong, X.S. Yang, S. Deb, Integrating Nature-Inspired Optimization Algorithms To K-Means Clustering, IEEE, 2012, pp. 116–123. [27] Y. Song, Z. Chen, Z. Yuan, New chaotic PSO-based neural network predictive control for nonlinear process, IEEE Trans. Neural Netw. 18 (2) (2007) 595–601.

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.

A. Kaur, S.K. Pal and A.P. Singh / Applied Soft Computing Journal xxx (xxxx) xxx [28] L. Hongwu, An adaptive chaotic particle swarm optimization in: IEEE, ISECS International Colloquium on Computing, Communication, Control, and Management, 2009. [29] Y.Y. Hong, A.A. Beltran, A.C. Paglinawan, A chaos-enhanced particle swarm optimization with adaptive parameters and its application in maximum power point tracking, Math. Probl. Eng. 2016 (2016). [30] X.S. Yang, Chaos-enhanced firefly algorithm with automatic parameter tuning, Int. J. Swarm Intell. Res. 2 (4) (2011) 1–11. [31] L. Xiang-Tao, Y. Ming-Hao, Parameter estimation for chaotic systems using the cuckoo search algorithm with an orthogonal learning method, Chin. Phys. B 21 (5) (2012) 050507. [32] H. Aslani, M. Yaghoobi, M.R. Akbarzadeh-T, Chaotic inertia weight in black hole algorithm for function optimization, in: Technology, Communication and Knowledge (ICTCK), 2015 International Congress on, IEEE, 2015, November, pp. 123–129. [33] A. Kaur, S.K. Pal, A.P. Singh, New chaotic flower pollination algorithm for unconstrained non-linear optimization functions, Int. J. Syst. Assur. Eng. Manag. 9 (4) (2018) 853–865. [34] X.S. Yang, Chaos-enhanced firefly algorithm with automatic parameter tuning, Int. J. Swarm Intell. Res. 2 (4) (2012) 125–136. [35] S. Talatahari, B.F. Azar, R. Sheikholeslami, A.H. Gandomi, Imperialist competitive algorithm combined with chaos for global optimization, Commun. Nonlinear Sci. Numer. Simul. 17 (3) (2012) 1312–1319. [36] M. Kohli, S. Arora, Chaotic grey wolf optimization algorithm for constrained optimization problems, J. Comput. Des. Eng. (2017), http://dx.doi.org/10. 1016/j.jcde.2017.02.005. [37] S. Łukasik, P.A. Kowalski, Study of flower pollination algorithm for continuous optimization, in: Intelligent Systems’, Springer, Cham, 2015, pp. 451–459. [38] V.S.S.S. Vedula, S.R. Paladuga, M.R. Prithvi, Synthesis of circular array antenna for sidelobe level and aperture size control using flower pollination algorithm, Int. J. Antennas Propag. 2015 (2015). [39] A.A.A. Esmin, S. Matwin, Data clustering using hybrid PSO, in: Springer Lecture Notes, Intelligent Data Engineering and Automated Learning IDEAL, 2012.

17

[40] A. Hatamlou, S. Abdullah, H. Nezamabadi-pour, A combined approach for clustering based on K-means and gravitational search algorithms, Swarm Evol. Comput. 6 (2012) 47–52. [41] J. Senthilnath, V. Das, S.N. Omkar, V. Mani, Clustering using levy flight cuckoo search, in: Proceedings of Seventh International Conference on Bio-Inspired Computing: Theories and Applications, BIC-TA 2012, vol. 1, Springer, 2012, pp. 164–171. [42] A. Hatamlou, Black hole: A new heuristic optimization approach for data clustering, Inform. Sci. 222 (2013) 175–184. [43] R. Jensi, G.W. Jiji, Hybrid data clustering approach using k-means and flower pollination algorithm, Adv. Comput. Intell. Int. J. 2 (2) (2015) 15–25. [44] A. Kaur, S.K. Pal, A.P. Singh, Hybridization of k-means and firefly algorithm for intrusion detection system, Int. J. Syst. Assur. Eng. Manag. 9 (4) (2018) 901–910. [45] UCI Repository of Machine Learning Databases, 1998, http://www.ics.uci. edu/mlearn/MLRepository.html. [46] J. Kennedy, R.C. Eberhart, Particle swarm optimization, in: IEEE International Conference on Neural Networks, Piscataway, NJ, 1995, pp. 942–1948. [47] J.J. Thiagarajan, K.N. Ramamurthy, A. Spanias, Optimality and stability of the K-hyperline clustering algorithm, Pattern Recognit. Lett. 32 (9) (2011) 1299–1304. [48] A. Kaur, K. Kaur, Statistical comparison of modeling methods for software maintainability prediction, Int. J. Softw. Eng. Knowl. Eng. 23 (6) (2013) 743–774. [49] A.H. Gandomi, A.H. Alavi, Krill herd: A new bio-inspired optimization algorithm, Commun. Nonlinear Sci. Numer. Simul. 17 (2012) 4831–4845. [50] A. Sharma, A. Sharma, B.K. Panigrahi, D. Kiran, R. Kumar, Ageist spider monkey optimization algorithm, Swarm Evol. Comput. 28 (2016) 58–77. [51] A. Bhattacharya, R. Jaiswal, A. Kumar, Faster algorithms for the constrained k-means problem, in: 33rd Symposium on Theoretical Aspects of Computer Science, 2016, pp. 16:1–16:13. [52] L.I. Kuncheva, D.P. Vetrov, Evaluation of stability of k-means cluster ensembles with respect to random initialization, IEEE Trans. Pattern Anal. Mach. Intell. 28 (11) (2006) 1798–1808. [53] J. Demsar, Statistical comparison of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30.

Please cite this article as: A. Kaur, S.K. Pal and A.P. Singh, Hybridization of Chaos and Flower Pollination Algorithm over K-Means for data clustering, Applied Soft Computing Journal (2019) 105523, https://doi.org/10.1016/j.asoc.2019.105523.