An intelligent clustering algorithm for high-dimensional multiview data in big data applications

ARTICLE IN PRESS JID: NEUCOM [m5G;July 30, 2019;16:51] Neurocomputing xxx (xxxx) xxx Contents lists available at ScienceDirect Neurocomputing jou...

Download PDF

2MB Sizes 0 Downloads 74 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 30, 2019;16:51]

Neurocomputing xxx (xxxx) xxx

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

An intelligent clustering algorithm for high-dimensional multiview data in big data applicationsR Qian Tao a,b, Chunqin Gu a, Zhenyu Wang a,∗, Daoning Jiang a a b

School of Software Engineering, South China University of Technology, Canton 510006, China Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston 02881, USA

a r t i c l e

i n f o

Article history: Received 20 November 2017 Revised 15 November 2018 Accepted 10 December 2018 Available online xxx Keywords: Big data Clustering High dimension multiview data Optimization Spark

a b s t r a c t There are many high-dimensional multiview data in various big data applications. It is very diﬃcult to deal with those high-dimensional multiview data for the classic clustering algorithms, which consider all features of data with equal relevance. To tackle this challenging problem, this paper aims at proposing a novel intelligent weighting k-means clustering (IWKM) algorithm based on swarm intelligence. Firstly, the degree of coupling between clusters is presented in the model of clustering to enlarge the dissimilarity of clusters. Various weights of views and features are used in the weighting distance function to determine the clusters of objects. Secondly, to eliminate the sensitivity of initial cluster centers, swarm intelligence is utilized to ﬁnd initial cluster centers, weights of views, and weights of features by a global search. Lastly, a precise perturbation is proposed to improve optimization performance of swarm intelligence. To verify the performance of clustering for high-dimensional multiview data, the experiments were performed by the evaluation metrics of Rand Index, Jaccard Coeﬃcient and Folkes Russe in ﬁve big data applications on the two different computational platforms of apache spark and single node. The experimental results show that IWKM is effective and eﬃcient in clustering of high-dimensional multiview data, and can obtain better performance than the other 5 kinds of approaches in these complicated data sets with more views and higher dimensions on apache spark and single node. © 2019 Published by Elsevier B.V.

1. Introduction Currently, artiﬁcial intelligence, mobile internet, social networks and internet of things generate a huge volume of data and drive the quick development of big data applications. Big data [2–5] is so complex and voluminous that traditional data processing approaches are inadequate to deal with them, which is characterized with volume, variety, velocity, veracity, and value. In big data environments, one of the most challenging problems is how to analyze the high-dimensional multiview data for various complex and large-scale applications. The high-dimensional multiview data are

R This work is supported in part by the Guangdong province Natural Science Foundation under Grant No. 2015A030310194 and 2018A030313396, in part by the Guangdong Province Science and Technology Project under Grant No. 2015A080804019, in part by the Science and Technology Program of Guangzhou No. 201802010025 and 2014KTSCX194, in part by Education Project of the Ministry of Education and Google No. x2rjE9190130, and in part by the Fund Project of Guangzhou No. 2019PT103. This paper is an extended version of the paper [1] published in ICNC-FSKD 2017 with more than 50% new content. ∗ Corresponding author. E-mail addresses: [email protected] (Q. Tao), [email protected] (C. Gu), [email protected] (Z. Wang), [email protected] (D. Jiang).

often described by multiple feature spaces and different structures, which are obtained from various sources. For example, a picture can be represented with four views, such as color, texture, shape and text. Web pages usually consist of three views: 1. a text view whose elements contain the words in the web page; 2. a picture view whose elements appear in the web page; 3. a hyper-link view which contains all hyper-links pointing to them. In a hospital, a patient’s dataset can be divided into views of blood data, genetic data, cerebrospinal ﬂuid data, and magnetic resonance images. Generally, the clustering of multiview data is a NP-hard problem [6], which has attracted many researchers to present various clustering algorithms for all kinds of real-word applications. Unsupervised feature selection for multiview tasks is preserved in cluster structure, and then an alternating algorithm is proposed to realize the structure [7]. A novel multi-view aﬃnity propagation algorithm has been provided for multi-view clustering, which is especially suitable for clustering more than two views [8]. Chen et al. [9] proposed an automated two-level variable weighting clustering (TWKM) algorithm for multiview data, which can simultaneously compute weights for views and individual variables. However, the above algorithms mainly focus on the problem with view-wise relationship and ignore the importance of feature-wise.

https://doi.org/10.1016/j.neucom.2018.12.093 0925-2312/© 2019 Published by Elsevier B.V.

Please cite this article as: Q. Tao, C. Gu and Z. Wang et al., An intelligent clustering algorithm for high-dimensional multiview data in big data applications, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.093

JID: NEUCOM 2

ARTICLE IN PRESS

[m5G;July 30, 2019;16:51]

Q. Tao, C. Gu and Z. Wang et al. / Neurocomputing xxx (xxxx) xxx

The results of clustering aren’t consistent across different views. Moreover, the performance of the multiview clustering algorithms needs to be further enhanced to process the more complex and high-dimensional data for big data applications. The clustering of high-dimensional multiview data is entirely different from the traditional clustering approaches taking all views as a ﬂat set of variables, which takes multiview to play important roles in partitioning. Moreover, for the character of complex structure, different sources and huge sizes, the clustering of high-dimensional multiview data is more complicated and highly critical to the performance of computing platform. As a fast and general engine for large-scale data processing, apache spark can offer a lightning-fast computing platform for the clustering of highdimensional multiview data to realize various big data applications, such as image segmentation, category of web pages, medical diagnosis, etc. Particle swarm optimization (PSO) [10] developed by Kennedy and Eberhart in 1995, is widely used to optimize all kinds of scientiﬁc problems and improve the quality of solutions. The work [11] developed hybrid quantum particle swarm optimization to optimize energy-eﬃcient resource allocation problems. In [12], a rotary chaotic particle swarm optimization algorithm is proposed to solve trustworthy scheduling of a grid workﬂow. To obtain a good convergence and well-spread optimal Pareto front, a novel hybrid particle swarm optimization is provided for solving multi-objective optimization problems [13]. A constrained motion PSO is proposed to optimize support vector regression free parameters for nonlinear time series regression and prediction [14]. An alternative to physical relocation based on PSO is also presented to maximize the power extraction from PV array [15]. The above PSO algorithms have an outer performance for various optimization applications. Currently, big data presents both an opportunity and a challenge for various PSO algorithms. There is a urgent need to propose new PSO variants which can run eﬃciently on the spark to optimize the clustering of high-dimensional multiview data for big data applications. To tackle the clustering of high-dimensional multiview data for various big data applications, a novel intelligent weighting k-means (IWKM) algorithm is proposed in this paper. In the ﬁrst step, the degree of coupling between clusters is presented in the model of clustering to enlarge the dissimilarity of clusters. In the second step, the cluster centers, weights of views, and weights of features are encoded into particle representation. Then a chaotic particle swarm optimization (CPSO) approach is used to obtain better initial cluster centers, weights of views, and weights of features. In the third step, A precise perturbation is proposed to improve optimization performance of CPSO. To evaluate the effectiveness, IWKM is compared with the other ﬁve clustering algorithms in ﬁve high-dimensional multiview data sets on the two different computational platforms of apache spark and single node. The remainder of this paper is organized as follows: Section 2 reviews the related work and background. Section 3 gives clustering of big data and apache spark. IWKM is presented in detail in Section 4. The experimental results are provided in Section 5. Conclusions and future work are ﬁnally drawn in Section 6.

2. Related works and background The clustering is a collection of data objects that are similar to each other within the same cluster and are dissimilar to the objects in other clusters. Given a data object set X = [xi, j ]N∗D , N is the number of data objects, D is the dimension of a data object. That is to say, the data object has D features. A clustering problem tries to ﬁnd a k-partition of X. The centers of clusters are Z = [zk, j ]C∗D .

U = [ui,k ]N∗C , fuzzy division matrix, describes that the objects are the membership degrees of some clusters. The work reported in this paper is closely related to earlier research on k-means clustering algorithms, weighting k-means clustering algorithms, soft subspace clustering algorithms, multiview clustering algorithms, and big data clustering algorithms. 2.1. K-means clustering algorithms and weighting k-means clustering algorithms As a clustering algorithm with sensitive initial cluster centers [16], the k-means is widely used in the real applications, such as image segmentation [17] and data mining. The goal of the k-means is to ﬁnd a partition so as to minimize the with-cluster sum of squares. During the process of clustering, the task of dividing samples is solved by

F (U, Z ) =

C N D

ui,k (xi, j − zk, j )2

(1)

k=1 i=1 j=1

s.t.

C

ui,k = 1, 1 ≤ i ≤ N, ui,k ∈ {0, 1}

(2)

k=1

Where U is deﬁned as a partition matrix, and ui,k is a binary variable. Z = {Z1 , Z2 , . . . , Zk } is a set of vectors which represent the centroids of the k clusters. (xi, j − zk, j )2 is a distance measure between the ith object and the center of the kth cluster on jth variable. All features have the same weights in the classic k-means clustering algorithm, and treated equally in the clustering problem such as consumer segmentation. In fact, in many real world applications, the impact of the different features in the data set to the clustering is different, so it is necessary to assign different weights to the different features. The automated variable weighting in kmeans type clustering [18] is a weighting k-means clustering algorithm, the objective function is recorded as

F (U, Z, W F ) =

C N D

β

ui,k w f j (xi, j − zk, j )2

(3)

k=1 i=1 j=1

s.t.ui,k ∈ {0, 1},

c

ui,k = 1

(4)

k=1 D

w f j = 1, 0 ≤ w f j ≤ 1

(5)

j=1

Where U is deﬁned as a n × k partition matrix. WF is the weight of features. 2.2. Soft subspace clustering algorithms and multiview clustering algorithms Soft subspace clustering algorithms determine the subsets of dimensions according to the contributions of the dimensions in discovering the corresponding clusters. The contribution of a dimension is measured by a weight that is assigned to the dimension in the clustering process. A soft subspace clustering algorithm is proposed in [19], and the objective function is modeled as

F (U, Z, W CF ) =

C N D

β

ui,k wc fk, j (xi, j − zk, j )2

(6)

k=1 i=1 j=1

s.t.

C

ui,k = 1, 1 ≤ i ≤ N, ui,k ∈ {0, 1},

(7)

k=1

Please cite this article as: Q. Tao, C. Gu and Z. Wang et al., An intelligent clustering algorithm for high-dimensional multiview data in big data applications, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.093

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 30, 2019;16:51]

Q. Tao, C. Gu and Z. Wang et al. / Neurocomputing xxx (xxxx) xxx D

wc fk, j = 1, 0 ≤ wc fk, j ≤ 1

(8)

j=1

Where WCF is the weight of each attribute from each cluster. In [20], the locally adaptive clustering (LAC) algorithm is proposed, which assigns each feature of each cluster a weight, which minimizes its objective function by using an iterative algorithm. TWKM [9] can simultaneously compute weight of views and individual variables, but easily lead to clusters with big weights on single feature and single view, and then the distribution of weight is not balanced. A multiview clustering algorithm with the context of a coclustering framework is proposed, and the experiments show that the framework results in a better clustering accuracy than those of the single view [21]. By using the multiview [22], the sparse spectral clustering is presented and tested on the real world data sets. A novel approach to inferring group-wise consistent brain subnet works are proposed and obtain good performance, based on the multiview spectral clustering of cortical networks [23]. In [24], a multiview algorithm based on constrained clustering that can operate with an incomplete mapping is also provided. A two-step algorithm for multiview clustering, which is combined with the multiview normalized cuts approach, is proposed and compared with existing algorithms [25]. The above algorithms have signiﬁcant performance in terms of clustering quality and computational eﬃciency, and can resolve various real-word data processing problems. However, the above algorithms mainly focused on the relationship of views, and the high dimensions of data set cannot be taken into account effectively in the clustering. Hence, this paper will pay more attention to the clustering of high-dimensional multiview data and demonstrating the relations of the dimensions and views in a big data environment. 3. Clustering of big data and apache spark With the rapid development of cloud infrastructure, platform, and service [26–30], big data analytics (BDA) has become a hot research topic and plays a vital role in various big data applications. Clustering is an essential data mining approach for BDA. In [31], a review introduce the trend and progress of clustering algorithms, to cope with the challenges of big data. Some clustering algorithms are compared with each other from theoretical and empirical perspectives, to demonstrate the applications for big data [32]. Genetic algorithm is presented as an approach to optimize a mapping between graph clustering and data clustering for big data [33]. In [34], to eliminate the iteration dependence and obtain high performance, a novel processing model and k-means clustering algorithm are presented to process large-scale data. To process the clustering of big data, a new clusiVAT algorithm based on the classical k-means model is presented and compared with the other popular data clustering algorithms [35]. A novel approach for clustering of electricity consumption behavior dynamics is provided for big data applications [36]. A systematic study of fuzzy consensus clustering is also proposed for clustering of big data in two real-life applications [37]. Clustering of big data has received much attention recently, and k-means type algorithms play an important role in BDA of real-word big data applications. In distributed and parallel computing environments [38,39], apache spark is an important open-source cluster-computing framework, which provides an interface for programming entire clusters with implicit data parallelism and fault tolerance [40]. Resilient distributed datasets (RDD) of spark is a working set for distributed programs, which can offer a restricted form of distributed shared memory. Generally, as an effective framework for huge and parallel data processing, MapReduce [41] is unsuitable for iterated algorithms owing to repeated times of restarting jobs, big data

3

reading and shuﬄing. Therefore, in our experiments, apache spark is selected as the computational platform for IWKM in big data applications. 4. The IWKM algorithm 4.1. The mode for clustering of high-dimensional multiview data The clustering for partitioning X into C clusters with weights of views and features is modeled as minimization of the following objective function.

F itness(U, Z, W V, W F ) C N T =

k=1

C

i=1

k=1

t=1

t=1

T

j∈V iewt

ui,k wvt w f j (xi, j − zk, j )2

j∈V iewt

wvt w f j (zk,. j − o j )2

(9)

⎧C ⎪ ⎪ ui,k = 1, 1 ≤ i ≤ N, ui,k ∈ [0, 1] ⎪ ⎪ k =1 ⎪ ⎪ ⎪ ⎪ T ⎪ ⎪ ⎨ wvt = 1, 0 ≤ wvPt ≤ 1 s.t.

t=1

⎪ w f j = 1, 0 ≤ w f j ≤ 1, 0 ≤ t ≤ T ⎪ ⎪ ⎪ j∈ V ⎪ iewt ⎪ ⎪ ⎪ C ⎪ ⎪ ⎩ po j = zk, j /C

(10)

k=1

where U = [ui,k ]N×C is a N × C partition matrix whose elements ui,k is binary where ui,k = 1 indicates that object i is allocated to cluster k. Z = [zk, j ]C×D is a N × C matrix whose elements zk,j represents the jth feature of cluster k. W V = [wvt ]T are weights of T views. W F = [w f j ] j ∈ V iewt are weights of feature under Viewt . wvt w f j (xi, j − zk, j )2 is a weighting distance measure on jth feature between the ith object and the center of the kth cluster. wvt w f j (zk, j − o j )2 is a weighting distance measure on jth feature between the kth cluster and the average cluster center, oj is the average cluster center of C clusters. The value describes the degree of coupling between clusters. A larger value indicates a greater dissimilarity. 4.2. CPSO and the encoding of particles In IWKM, CPSO is proposed to help the algorithm to obtain better initial cluster centers, weights of views, and weights of features. Each particle i represents a candidate solution in the solution space of D dimensions, which has two vectors: a position vector Xi = [x1i , x2i , . . . , xD ] and a velocity vector Vi = [v1i , v2i , . . . , vD ]. Duri i ing the evolutionary process, the velocity vector and the position vector of particle i on dimension d at iteration t + 1 are updated by the following equations:

vdi (t + 1 ) = ωvdi (t ) + c1 r1 ( pBestid (t ) − pdi (t )) + c2 r2 (gBest d (t ) − pdi (t ))

(11)

xdi (t + 1 ) = xdi (t ) + vdi (t + 1 )

(12)

where d = 1, 2, D represents each dimension of the search space, ω is inertia weight, c1 and c2 are cognitive learning coeﬃcient and social learning coeﬃcient, respectively, r1 and r2 are two uniform random numbers in the range of [0, 1], pBestid (t ) is the position on dimension d with the best ﬁtness found up to the tth iteration for the particle i, gBestd (t) is the best position on dimension d found by the whole particle swarm. The inertia weight ω is usually updated as

ω = ωmax − (ωmax − ωmin ) × g/gmax

(13)

Please cite this article as: Q. Tao, C. Gu and Z. Wang et al., An intelligent clustering algorithm for high-dimensional multiview data in big data applications, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.093

ARTICLE IN PRESS

JID: NEUCOM 4

[m5G;July 30, 2019;16:51]

Q. Tao, C. Gu and Z. Wang et al. / Neurocomputing xxx (xxxx) xxx

where ωmax and ωmin are initial and ﬁnal weight, and set to 0.9 and 0.4, respectively. g is the current evolutionary generation number, gmax is the maximum number of generations, and set to 150. c1 and c2 are set to 1.8, respectively. The velocity of each particle on dimension d is restricted to the range of [−vmaxd , vmaxd ], vmaxd ∈ R+ . Thus, if the velocity vdi (t ) exceeds vmaxd , it is reassigned to vmaxd . Otherwise, if the velocity vdi (t ) is lower than −vmaxd , it is reassigned to −vmaxd . If vmaxd is too large, particles may miss good solutions. On the other hand, if vmaxd is too small, particles may trap in local optima. The maximum velocity vmaxd is usually set to 20% of the search range. The encoding of particles is the premise of using particle swarm to search for an optimal solution. In IWKM, initial cluster centers, weights of views, and weights of features are encoded into particles representation. Each particle is encoded by F × C + T + F dimensional real number vector. F is the number of feature of objects in the clustering problems. The ith particle in the swarm is encoded as

Xi =

1 C,2 x1i ,1 , x1i ,2 , . . . , x1i ,F , . . . , xC, , xi , . . . , xC,F i i

wv1i , . . . , wvTi , w fi1 , . . . , w fiF

4.4. Flowchart of IWKM

(14) The ﬂowchart of IWKM algorithm is shown in Fig. 1.

x(t + 1 ) = r · x(t ) · (1 − x(t )), r ∈ N, x(0 ) ∈ [0, 1]

(15)

dim N _S 1 ( pBesti j , gBest j )2 ≤ _d d ( pBest, gBest ) =

(16)

N _S

i=1

j=1

=

4.5. Pseudocode for IWKM on apache spark To verify the performance of the proposed algorithm for big data applications, IWKM is deployed on apache spark by RDD. The pseudocode for IWKM on apache spark is shown in Fig. 2. 5. Experimental evaluation

sd ( pBest j )

are the particle number of a swarm and the dimensions of a particle, _d is the threshold value of premature convergence. If d ( pBest, gBest ) ≤ _d, then the premature convergence and local optima appears, and then the suitable perturbation for the N _S/K _spark particles should be adopted. Step 3. The precise dimensions of perturbation: Since a particle has more than one dimension, in accordance with the priorities of inertia, some dimensions which have high inertia are selected preferentially for perturbation. The inertia of the pBest and gBest in the jth dimension can be given by a mean square deviation, recorded as Eqs. (17) and (18), respectively. Where N _S and m are the particle number of a swarm and the current iteration. If sd ( pBest j ) ≤ _ pBest or sd (gBest j ) ≤ _gBest, then pBest gBest in the jth dimension are inert and need to be perturbed. Where _ pBest and _gBest are the inert threshold value of the pBest and gBest, respectively.

N _S 1 1 ( pBesti j − ( pBest1 j + pBest2 j + · · · + pBestNS j ))2 ≤ _ pBest N _S N _S i=1

(17)

sd (gBest j ) =

m−1 1 (gBest j (m ) − gBest j (t ))2 ≤ _gBest m−1 t=1

(18) 4.3. A precise perturbation and CPSO To avoid local optima and premature convergence, various auxiliary techniques and methods of perturbation (jump or mutation) have great advantage on enriching the searching behavior of particles in swarm intelligence. In CPSO, the chaotic logistic sequence perturbation is used to help particles escape from local optima and achieve a better searching quality, which has the property of certainty, ergodicity and the stochastic, deﬁned as Eq. (15), where r is the control parameter, x is a variable, r = 4, and t = 0, 1, 2, . . .. We can summarize the precise perturbation of CPSO as the interplay of the following procedures. Step 1. The suitable particles of perturbation: To reduce the damage of population stability during particle searching and the computing load of CPU (apache spark and single node), we randomly select N _S/K _spark particles from the total N _S particles as the perturbation objects by a simple random sampling method. K _spark is the number of worker nodes in apache spark. Step 2. The precise time of perturbation: The time to be perturbed is the time of premature convergence for the particle swarm. The mean distance between pBest and gBest is used to judge whether the particles are in premature convergence or not, recorded as Eq. (16). Where N _S and dim

5.1. Testing environments and spark In our experiments, IWKM was tested and compared in various computing environments, including apache spark and single node. The single node is equipped with an Intel Core i5-4210M 2.6 Hz, 3.8 G RAM and ubuntu 14.04LTS operation system. The apache spark consists of one master node (Intel Core [email protected] GHz, 64G DDRIII, and 1T eﬃcient cloud disk), ten worker nodes (Intel Xeon [email protected] GHz, 16G DDRIII, and 500G eﬃcient cloud disk) and apache spark 1.6.0. Based on apache spark machine learning library and RDD, IWKM can tackle big data applications and get an outer performance, as shown in Fig. 3. 5.2. Testing data sets and evaluation metric To evaluate the performance of the proposed algorithm, we also compared IWKM with LAC [20], aﬃnity propagation (AP) [42], normalized cuts (Ncut) [43], density clustering (DensityC) [44], and TWKM [9] by evaluation metric of RI, JC and Folk. The PSO and CPSO use the same population size of 30 and the same number of 150 ﬁtness evaluations (FEs) for a fair comparison. IWKM and the other ﬁve algorithms were tested in 5 high-dimensional multiview data sets (real-life applications) [45], including the Multiple Features (Mfeat) data set, the Internet Advertisement data set, Spambase data set, the Segmentation data set and Cardiotocography data set. The basic information of these data sets and applications is shown in Table 1. 5.2.1. Mfeat data set The Mfeat data set is a handwritten numeral dataset extracted from a collection of Dutch utility maps, which contains 20 0 0 numeral objects belonging to 10 classes (0–9). Each class has 200 objects. Each object is represented by 649 features that are divided into the following six views. Mfeat-fou view contains 76 Fourier coeﬃcients of the character shapes; 2) Mfeat-fac view contains

Please cite this article as: Q. Tao, C. Gu and Z. Wang et al., An intelligent clustering algorithm for high-dimensional multiview data in big data applications, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.093

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 30, 2019;16:51]

Q. Tao, C. Gu and Z. Wang et al. / Neurocomputing xxx (xxxx) xxx

5

Fig. 1. Flowchart of IWKM algorithm. Table 1 Characteristics of the high-dimensional multiview data sets (real-life applications). ID

Name of data set

Number of data set

Number of classes

Number of features

Number of views

Size of individual classes in parentheses

Computational platform

1 2

Mfeat Internet Advertisement Spambase Image Segmentation Cardiotocography

2000 2359

10 2

649 1557

6 6

(200,200,. . . ,200) (381,1978)

Single node Single node

4601 2310

2 7

57 19

3 2

(2788,1813) (330,330,. . . ,330)

2126

3

21

3

(1655,295,176)

Single node apache spark (10 worker nodes) apache spark (10 worker nodes)

3 4 5

216 proﬁle correlations; 3) Mfeat-kar view contains 64 KarhunenLove coeﬃcients; 4) Mfeat-pix view contains 240 pixel windows; 5) Mfeat-zer view contains 47 Zernike moments; 6) Mfeat-mor view contains 6 morphological features.

5.2.2. Internet advertisement data set The Internet Advertisement data set contains a set of 3279 images from various web pages that are categorized either as advertisements or no advertisements. There are 20 images have

missing values. Our experiments were carried on 3259 instances, deleting the instances with missing values. The instances are described in six views. View 1 contains 3 geometries of the images (width, height, aspect ratio); View 2 contains 457 phrases in the url of the pages containing the images (base url); View 3 contains 495 phrases of the images url (image url); View 4 contains 472 phrases in the url of the pages the images are pointing at (target url); View 5 contains 111 the anchor text; View 6 contains 19 text of the images alt (alternative) html tags (alt text).

Please cite this article as: Q. Tao, C. Gu and Z. Wang et al., An intelligent clustering algorithm for high-dimensional multiview data in big data applications, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.093

ARTICLE IN PRESS

JID: NEUCOM 6

[m5G;July 30, 2019;16:51]

Q. Tao, C. Gu and Z. Wang et al. / Neurocomputing xxx (xxxx) xxx

are divided into three views, which are word-freq view, char-freq view and capital-run-length view. 1) word_freq view contains 48 continuous real attributes of type word_freq_WORD; 2) char_freq view contains 6 continuous real attributes of type char_freq_CHAR; 3) capital_run_length view contains 3 continuous real attributes measuring the length of sequences of consecutive capital letters. 5.2.4. Image segmentation data set In this dataset, 2310 instances were drawn randomly from a database of 7 outdoor images. The images were hand segmented to create a classiﬁcation for every pixel. Each instance is a 3 × 3 region. The dataset includes 19 features, which can be divided into 2 views: shape view includes 9 features about shape information, and RGB view includes 10 features about color information. 5.2.5. Cardiotocography data set 2126 fetal cardiotocograms (CTGs) are automatically processed and the respective diagnostic features are measured. The CTGs are also classiﬁed by three expert obstetricians and a consensus classiﬁcation label assigned to each of them. Classiﬁcation was both with respect to a morphologic pattern (A, B, C. ...) and to a fetal state (N, S, P). Therefore, the dataset can be used either for 10-class or 3-class experiments. In this experiment, it is used as a 3-class dataset. In the dataset, 21 features can be divided into 3 views: indicators per second view, variability view and histogram view.

Fig. 2. Pseudocode for IWKM on apache spark.

5.2.6. Evaluation metric As the real partitions of the ﬁve data sets which are selected for our experiments are already known, the performances of the clustering algorithms can be evaluated by comparing the resulting cluster with the real structures in terms of external criteria. Some commonly used criteria include the RI, JC and Folk. Let C = {C1 , C2 , CM } be the set of M clusters in the data set and C = {C1 , C2 , . . . , CN } the set of N clusters generated by the clustering algorithms. Given a pair of points (Xi , Xj ) in the data set, we refer to it as (1) SS is the number of pair of data points, where Xi , Xj ∈ Cm , Xi , X j ∈ Cn , i = j. (2) DD is the number of pair of data points, where Xi ∈ Cm 1, Xj ∈ Cm2 , Xi ∈ Cn 1 ,X j ∈ Cn 2 , i = j, m1 = m2, n1 = n2. (3) SD is the number of pair of data points, where Xi , Xj ∈ Cm , Xi ∈ Cn 1 , X j ∈ Cn 2 , i = j,n1 = n2. (4) DS is the number of pair of data points, where Xi ∈ Cm1 , Xj ∈ Cm2 , Xi , X j ∈ Cn , i = j, m1 = m2. The three external criteria used in our experiments can be deﬁned as follows: (1) Rand index (RI): RI = (SS + DD )/(SS + SD + DS + DD ) (2) Jaccard coeﬃcient (JC): JC =SS/(SS + SD+ DS ) (3) Folkes Russel(Folk): F olk =

SS ( (SS+ × SD

SS SS+DS

5.3. Parameter analysis In order to analyze the inﬂuence of the 3 parameters ( _d, _gBest and _ pBest ) on the clustering performance of highFig. 3. IWKM on apache spark.

5.2.3. Spambase data set The Spambase data set is a data set whose the collection of spam e-mails came from postmaster and individuals who had ﬁeld spam and collection of non-spam email came from ﬁeld work and personal e-mails, which contains 4601 object belong to 2 class (spam, nonspam). Each object is represented by 57 features that

dimensional multiview data, IWKM was tested in 3 data sets (Mfeat data set, Internet Advertisement data set, and Spambase data set) on single node. To reduce statistical errors, all the data sets were independently simulated 10 times. According to the threshold value of premature convergence, we set the _d among [5,45] with 5 step size in the Mfeat and Spambase data sets. _d on Internet Advertisement data set is set among [2, 20] with 3 step size. The statistical results on their mean evaluation metrics are shown in Fig. 4. It can be seen from Fig. 4, when

Please cite this article as: Q. Tao, C. Gu and Z. Wang et al., An intelligent clustering algorithm for high-dimensional multiview data in big data applications, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.093

JID: NEUCOM

ARTICLE IN PRESS

[m5G;July 30, 2019;16:51]

Q. Tao, C. Gu and Z. Wang et al. / Neurocomputing xxx (xxxx) xxx

7

Fig. 4. Analysis on the parameter _d in 3 high-dimensional multiview data sets on single node.

Fig. 5. Analysis on the parameter _gBest in 3 high-dimensional multiview data sets on single node.

Fig. 6. Analysis on the parameter _ pBest in 3 high-dimensional multiview data sets on single node.

_d is selected as 25, 8 and 30 respectively, IWKM has best per-

5.4. Comparison of PSO and CPSO

formance of clustering in 3 data sets on single node. The parameters _gBest and _ pBest are the threshold value of dimension inertia which is used to measure whether the perceptible changes of position in each dimension have taken place or not. The parameters _gBest and _ pBest on the three data sets are also performed similar analysis as the _d. The statistical results on their mean evaluation metrics are shown in Fig. 5 and Fig. 6, respectively. When the parameter _gBest is set as 5, 5.0E−5 and 5.0E−4, and the parameter _ pBest is set as 3.0E−6, 3.0E−5 and 0.03, respectively, the clustering performance of IWKM is best in 3 data sets on single node. As the values of JC and RI are almost equal In Spambase, the curves of JC and RI are overlapped. Hence, according to the result of parameter analysis, the best parameter value of _d, _gBest and _ pBest will be selected and tested in the next experiments.

To validate the performance of CPSO for the optimization of cluster centers, weights of views and features in IWKM, we tested CPSO and PSO in three high-dimensional multiview data sets on single node. The data sets were run 10 times by CPSO and PSO, and the average results of various algorithms were recorded and compared in Fig. 7. In CPSO, a precise perturbation, including the suitable particles of perturbation, the precise time and dimension of perturbation, are proposed to improve the performance of optimization. It can be seen from Fig. 7, that CPSO can achieve better solution accuracy and get the optimal solution earlier in all the three high-dimensional multiview data sets on single node. Obviously, CPSO have a better performance for the clustering of IWKM than PSO. Therefore, we can conclude that as an important optimization approach, CPSO can help IWKM to obtain better initial

Please cite this article as: Q. Tao, C. Gu and Z. Wang et al., An intelligent clustering algorithm for high-dimensional multiview data in big data applications, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.093

JID: NEUCOM 8

ARTICLE IN PRESS

[m5G;July 30, 2019;16:51]

Q. Tao, C. Gu and Z. Wang et al. / Neurocomputing xxx (xxxx) xxx

Fig. 7. Comparison of PSO and CPSO in 3 high-dimensional multiview data sets on single node.

Fig. 8. Comparison of PSO and CPSO in 3 high-dimensional multiview data sets on single node.

cluster centers, weights of views and features in high-dimensional multiview data. 5.5. Comparison of IWKM and TWKM for the weight of views To further evaluate the performance of obtaining weight of views, TWKM and IWKM were tested in ﬁve different highdimensional multiview data sets on apache spark and single node, respectively. The data sets were run 10 times by the two algorithms, and the average result of IWKM and TWKM were recorded and compared in Table 4. Obviously, IWKM and TWKM can get effective weights for the 5 high-dimensional multiview data sets on apache spark and single node. Especially, in the two data sets of Internet Advertisement (single node) and Image Segmentation (apache spark), TWKM and IWKM have a similar performance for obtaining the weight of views. However, in the other data sets, IWKM can obtain better and more reasonable weight of views than TWKM in the other 3 data sets (Mfeat, Spambase, and Cardiotocography) on apache spark and single node. Moreover, Fig. 8 gives pie chart of weight of views of TWKM and IWKM on three data sets with sing node. It can be seen from Fig. 8 that the weight of views calculated by TWKM often focus on one view, which is not

in conformity with the reality applications. The weights calculated by IWKM are more reasonable than these obtained by TWKM, and the weight of features are in the same situation. Hence, we can conclude that IWKM has a better performance for the weight of views than TWKM. 5.6. Comprehensive comparison of IWKM and the other ﬁve algorithms According to the previous testing and veriﬁcation (for example, 5.4 parameter analysis), we set the parameter values of six clustering algorithms for the next experiments, as shown in Table 2. To further validate comprehensive performance of the proposed algorithm for the clustering of high dimensional multi-view data in big data applications, IWKM was compared with the other ﬁve algorithms by evaluation metric of RI, JC and Folk in ﬁve highdimensional multiview data sets on the two different computational platforms of apache spark and single node. In our experiments, the product of the number of views and the number of features is used to describe the complexity of high-dimensional multiview data, recorded as Pf × v . The larger the product, the more complicated the high-dimensional multi-view

Please cite this article as: Q. Tao, C. Gu and Z. Wang et al., An intelligent clustering algorithm for high-dimensional multiview data in big data applications, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.093

ARTICLE IN PRESS

JID: NEUCOM

[m5G;July 30, 2019;16:51]

Q. Tao, C. Gu and Z. Wang et al. / Neurocomputing xxx (xxxx) xxx

9

Table 2 Parameter values of six clustering algorithms in the experiments. Algorithm

Mfeat

Internet advertisement

Spambase

Image Segmentation

Cardiotocography

LAC (h) AP (λ, p) Ncut (e) DensityC (p) TWKM (λ, η ) IWKM ( _d, _gBest , _ pBest )

2 (0.9,2.7) 1.0E−8 1.6 (30,7) (25.0,0.5,3.0E−6)

2 (0.9,60.0) 1.0E−8 1.4 (80,25) (8.0,5.0E−5,3.0E−5)

14 (0.9,12.0) 1.0E−8 1.9 (53,18) (30.0,5.0E−4,0.03)

2 (0.9,−4.7) 1.0E−8 1.7 (70,40) (20.0,5.0,3.0E−5)

5 (0.9, −24) 1.0E−8 1.5 (40,18) (20,5.0,3.0)

Table 3 Comparison of IWKM and the other ﬁve algorithms in 3 high-dimensional multiview data sets on apache spark and single node. The mean results and standard deviation. The results are averaged over 10 runs. Data sets Mfeat

Internet Advertisement Spambase

Image Segmentation @apache spark Cardiotocography @apache spark

RI JC Folk RI JC Folk RI JC Folk JC RI Folk JC RI Folk

LAC

AP

Ncut

DensityC

TWKM

IWKM

0.9344 ± 0.0000 0.5365 ± 0.0000 0.6988 ± 0.0000 0.7154 ± 0.0000 0.7055 ± 0.0000 8322 ± 0.0000 0.7112 ± 0.0000 0.5893 ± 0.0000 0.7397 ± 0.0000 0.3252 ± 0.0000 0.5319 ± 0.0000 0.5055 ± 0.0000 0.3854 ± 0.0000 0.5408 ± 0.0000 0.5705 ± 0.0000

0.8931 ± 0.0000 0.3510 ± 0.0000 0.5226 ± 0.0000 0.8124 ± 0.0000 0.7785 ± 0.0000 0.8759 ± 0.0000 0.5527 ± 0.0000 0.4797 ± 0.0000 0.6590 ± 0.0000 0.2254 ± 0.0000 0.8110 ± 0.0000 0.3682 ± 0.0000 0.3067 ± 0.0000 0.5034 ± 0.0000 0.4885 ± 0.0000

0.9317 ± 0.0065 0.4959 ± 0.0342 0.6625 ± 0.0301 0.6803 ± 0.0016 0.6151 ± 0.0026 0.7646 ± 0.0017 0.5616 ± 0.0000 0.4611 ± 0.0000 0.6358 ± 0.0000 0.3038 ± 0.0000 0.8974 ± 0.0000 0.4706 ± 0.0000 0.1886 ± 0.0000 0.4346 ± 0.0000 0.3721 ± 0.0000

0.9578 ± 0.0000 0.6720 ± 0.0000 0.8060 ± 0.0000 0.6996 ± 0.0000 0.6974 ± 0.0000 0.8293 ± 0.0000 0.5209 ± 0.0000 0.5196 ± 0.0000 0.7194 ± 0.0000 0.2573 ± 0.0000 0.5388 ± 0.0000 0.4115 ± 0.0000 0.3535 ± 0.0000 0.4617 ± 0.0000 0.5262 ± 0.0000

0.9456 ± 0.0000 0.5937 ± 0.0000 0.7467 ± 0.0000 0.8131 ± 0.0000 0.7792 ± 0.0000 0.8764 ± 0.0000 0.5208 ± 0.0000 0.5194 ± 0.0000 0.7192 ± 0.0000 0.2996 ± 0.0000 0.8252 ± 0.0000 0.4645 ± 0.0000 0.3897 ± 0.0000 0.5086 ± 0.0000 0.5656 ± 0.0000

0.9586 ± 0.0118 0.6820 ± 0.0719 0.8116 ± 0.0466 0.8179 ± 0.0132 0.7858 ± 0.0088 0.8809 ± 0.0043 0.5225 ± 0.0003 0.5222 ± 0.0002 0.7225 ± 0.0003 0.2297 ± 0.0065 0.8047 ± 0.0103 0.3750 ± 0.0036 0.3984 ± 0.0000 0.5576 ± 0.0210 0.5854 ± 0.0054

Table 4 The weights of views calculated by TWKM and IWKM.

Name of data set Mfeat

Internet advertisement

Spambase

Image segmentation @apache spark Cardiotocography @apache spark

Weights calculated by TWKM

Weights calculated by IWKM

1.66665E−6 1.66665E−6 1.66665E−6 1.66665E−6 1.66665E−6 0.99999 1.66666E−6 0.20205 0.21539 0.19255 0.16216 0.22784 0.99999 3.33331E−6 3.33331E−6 0.4684598 0.5315402 9.999933e−01 3.333311e−06 3.333311e−02

0.23424 0.23358 0.25141 0.01263 0.09903 0.16911 0.11030 0.16166 0.12580 0.29347 0.30720 0.00157 0.58757 0.06495 0.34748 0.44744640 0.55255359 0.1592640 0.4687741 0.3719617

data. In Table 1, according to the value of prof × v , the data set of Mfeat (feature: 649, views: 6, Pf ×v = 649 × 6 = 3894) and Internet Advertisement (features: 1557, views: 6, Pf ×v = 1557 × 6 = 9342) are more complicated than Spambase (features: 57, views: 3, Pf ×v = 57 × 3 = 171), Image Segmentation (features: 19, views: 2, p f ×v = 19 × 2 = 38), and Cardiotocography (features: 21, views: 3, Pf ×v = 21 × 3 = 63). Table 3 summarizes the comprehensive comparison of IWKM and the other ﬁve algorithms on apache spark and single node. Their average result (10 times) and standard deviation are compared to reduce the statistical errors. From these results, we can see that IWKM signiﬁcantly outperform the other ﬁve algorithms in the data set of Mfeat (Pf ×v = 3894) and Internet Advertisement

(Pf ×v = 9342). IWKM has a better outperform than TWKM and DensityC in Spambase (Pf ×v = 171), but AP produces the worst results in Mfeat. Both DensityC and IWKM can obtain better result than LAC, AP, Ncut and TWKM in Mfeat (Pf ×v = 3894). AP, TWKM and IWKM outperform the LAC, Ncut and DensityC in Internet Advertisement (Pf ×v = 9342). LAC signiﬁcantly outperform the other ﬁve algorithms including IWKM in Spambase, but the complexity of Spambase (Pf ×v = 171) is lower than the Mfeat (Pf ×v = 3894) and Internet Advertisement (Pf ×v = 9342). Therefore, we can conclude that IWKM outperform the other ﬁve algorithms in these complicated data sets with more views and higher dimensions (Mfeat and Internet Advertisement) on the single node. On apache spark, IWKM outperform the other 5 algorithms in Cardiotocography (Pf ×v = 63). However, the Ncut and TWKM have better performance than IWKM in Image Segmentation (Pf ×v = 38). As Cardiotocography (Pf ×v = 63) is more complicated than Image Segmentation (Pf ×v = 38), no matter what apache spark and single node, the more complicated the high-dimensional multi-view data sets, the better performance of IWKM. In a word, IWKM can effectively process the clustering of high-dimensional multiview data sets in big data applications. Meanwhile, IWKM is superior to the other ﬁve algorithms in these complicated data sets with more views and higher dimensions on apache spark and single node. 6. Conclusion In this paper, we have proposed a novel intelligent weighting clustering algorithm for high-dimensional multiview data in big data applications. In IWKM, various weights of views and features are used in the weighting distance function to determine the clusters of objects. Then initial cluster centers, weights of views, and weights of features are calculated by CPSO which is improved by a precise perturbation of the gBest and pBest. The degree of coupling between clusters is also designed in the model of clustering to enlarge the dissimilarity of clusters. We tested IWKM and the other ﬁve algorithms in ﬁve high-dimensional multiview data sets on the two different computational platforms of apache spark and

Please cite this article as: Q. Tao, C. Gu and Z. Wang et al., An intelligent clustering algorithm for high-dimensional multiview data in big data applications, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.093

JID: NEUCOM 10

ARTICLE IN PRESS

[m5G;July 30, 2019;16:51]

Q. Tao, C. Gu and Z. Wang et al. / Neurocomputing xxx (xxxx) xxx

single node. Experimental results demonstrate the effectiveness of the proposed algorithm for various big data applications. It is expected that our approach will make an impact on the clustering of high-dimensional multiview data in various big data applications. In the future, we will interest in providing transfer learning and generative adversarial network (GAN) for the clustering of high-dimensional multiview data, so as to tackle the clustering of high-dimensional multiview data in the heterogeneous data sets. Declaration of Competing Interest None. References [1] Q. Tao, Z. Wang, C. Gu, W. Chen, W. Lin, H. Lin, A novel intelligent clustering approach for high dimensional data in a big data environment, in: 2017 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), IEEE, 2017, pp. 1666–1671. [2] A.I. Naimi, D.J. Westreich, Big data: A revolution that will transform how we live, work, and think, Math. Comput. Educ. 47 (17) (2014) 181–183. [3] L. Ding, Y. Liu, B. Han, S. Zhang, B. Song, Hb-ﬁle: an eﬃcient and effective high-dimensional big data storage structure based on us-elm, Neurocomputing 261 (2017) 184–192. [4] H. Li, M. Dong, K. Ota, M. Guo, Pricing and repurchasing for big data processing in multi-clouds, IEEE Trans. Emerg. Top. Comput. 4 (2) (2016) 266–277. [5] G. Xiao, K. Li, X. Zhou, K. Li, Eﬃcient monochromatic and bichromatic probabilistic reverse top-k query processing for uncertain big data, J. Comput. Syst. Sci. 89 (2017) 92–113. [6] M. Nicholson, Genetic algorithms and grouping problems, Softw.: Pract. Exp. 28 (10) (1998) 1137–1138. [7] H. Shi, Y. Li, Y. Han, Q. Hu, Cluster structure preserving unsupervised feature selection for multi-view tasks, Neurocomputing 175 (2016) 686–697. [8] C.-D. Wang, J.-H. Lai, S.Y. Philip, Multi-view clustering based on belief propagation, IEEE Trans. Knowl. Data Eng. 28 (4) (2016) 1007–1021. [9] X. Chen, X. Xu, J.Z. Huang, Y. Ye, Tw-k-means: automated two-level variable weighting clustering algorithm for multiview data, IEEE Trans. Knowl. Data Eng. 25 (4) (2013) 932–944. [10] R. Eberhart, J. Kennedy, A new optimizer using particle swarm theory, in: Proceedings of the Sixth International Symposium on Micro Machine and Human Science, 1995. MHS’95, IEEE, 1995, pp. 39–43. [11] L. Xu, F. Qian, Y. Li, Q. Li, Y.-w. Yang, J. Xu, Resource allocation based on quantum particle swarm optimization and rbf neural network for overlay cognitive OFDM system, Neurocomputing 173 (2016) 1250–1256. [12] Q. Tao, H.-y. Chang, Y. Yi, C.-q. Gu, W.-j. Li, A rotary chaotic PSO algorithm for trustworthy scheduling of a grid workﬂow, Comput. Oper. Res. 38 (5) (2011) 824–836. [13] T. Cheng, M. Chen, P.J. Fleming, Z. Yang, S. Gan, A novel hybrid teaching learning based multi-objective particle swarm optimization, Neurocomputing 222 (2017) 11–25. [14] R. Sankar, N. Sapankevych, Nonlinear time series prediction performance using constrained motion particle swarm optimization, Trans. Mach. Learn. Artif. Intell. 5 (5) (2017) 25. [15] T.S. Babu, J.P. Ram, T. Dragicˇ evic´ , M. Miyatake, F. Blaabjerg, N. Rajasekar, Particle swarm optimization based solar PV array reconﬁguration of the maximum power extraction under partial shading conditions, IEEE Trans. Sustain. Energy 9 (1) (2018) 74–85. [16] F. Khan, An initial seed selection algorithm for k-means clustering of georeferenced data to improve replicability of cluster assignments for mapping application, Appl. Soft Comput. 12 (11) (2012) 3698–3700. [17] H. Li, H. He, Y. Wen, Dynamic particle swarm optimization and k-means clustering algorithm for image segmentation, Opt.-Int. J. Light Electron Opt. 126 (24) (2015) 4817–4822. [18] J.Z. Huang, M.K. Ng, H. Rong, Z. Li, Automated variable weighting in k-means type clustering, IEEE Trans. Pattern Anal. Mach. Intell. 27 (5) (2005) 657–668. [19] E.Y. Chan, W.K. Ching, M.K. Ng, J.Z. Huang, An optimization algorithm for clustering using weighted dissimilarity measures, Pattern Recognit. 37 (5) (2004) 943–952. [20] C. Domeniconi, D. Gunopulos, S. Ma, B. Yan, M. Al-Razgan, D. Papadopoulos, Locally adaptive metrics for clustering high dimensional data, Data Min. Knowl. Discov. 14 (1) (2007) 63–97. [21] S.F. Hussain, S. Bashir, Co-clustering of multi-view datasets, Knowl. Inf. Syst. 47 (3) (2016) 545–570. [22] C. Lu, S. Yan, Z. Lin, Convex sparse spectral clustering: single-view to multiview, IEEE Trans. Image Process. 25 (6) (2016) 2833–2843. [23] H. Chen, K. Li, D. Zhu, X. Jiang, Y. Yuan, P. Lv, T. Zhang, L. Guo, D. Shen, T. Liu, Inferring group-wise consistent multimodal brain networks via multiview spectral clustering, IEEE Trans. Med. Imaging 32 (9) (2013) 1576–1586.

[24] E. Eaton, S. Jacob, et al., Multi-view constrained clustering with an incomplete mapping between views, Knowl. Inf. Syst. 38 (1) (2014) 231–257. [25] N.F. Chikhi, Multi-view clustering via spectral partitioning and local reﬁnement, Inf. Process. Manag. 52 (4) (2016) 618–627. [26] C. Liu, K. Li, K. Li, Minimal cost server conﬁguration for meeting time-varying resource demands in cloud centers, IEEE Trans. Parallel Distrib. Syst. (2018). [27] C. Liu, K. Li, C. Xu, K. Li, Strategy conﬁgurations of multiple users competition for cloud service reservation, IEEE Trans. Parallel Distrib. Syst. 27 (2) (2016) 508–520. [28] K. Li, C. Liu, K. Li, A.Y. Zomaya, A framework of price bidding conﬁgurations for resource usage in cloud computing, IEEE Trans. Parallel Distrib. Syst. 27 (8) (2016) 2168–2181. [29] C. Liu, K. Li, K. Li, A game approach to multi-servers load balancing with loaddependent server availability consideration, IEEE Trans. Cloud Comput. (2018). 1–1. [30] W. Yang, K. Li, Z. Mo, K. Li, Performance optimization using partitioned SpMV on GPUs and multicore cpus, IEEE Trans. Comput. 64 (9) (2015) 2623–2636. [31] A.S. Shirkhorshidi, S. Aghabozorgi, T.Y. Wah, T. Herawan, Big data clustering: a review, in: International Conference on Computational Science and Its Applications, Springer, 2014, pp. 707–720. [32] A. Fahad, N. Alshatri, Z. Tari, A. Alamri, I. Khalil, A.Y. Zomaya, S. Foufou, A. Bouras, A survey of clustering algorithms for big data: taxonomy and empirical analysis, IEEE Trans. Emerg. Top. Comput. 2 (3) (2014) 267–279. [33] M.H. Hajeer, D. Dasgupta, Distributed genetic algorithm to big data clustering, in: 2016 IEEE Symposium Series on Computational Intelligence (SSCI), IEEE, 2016, pp. 1–9. [34] X. Cui, P. Zhu, X. Yang, K. Li, C. Ji, Optimized big data k-means clustering using mapreduce, J. Supercomput. 70 (3) (2014) 1249–1259. [35] D. Kumar, J.C. Bezdek, M. Palaniswami, S. Rajasegarar, C. Leckie, T.C. Havens, A hybrid approach to clustering in big data, IEEE Trans. Cybern. 46 (10) (2016) 2372–2385. [36] Y. Wang, Q. Chen, C. Kang, Q. Xia, Clustering of electricity consumption behavior dynamics toward big data applications, IEEE Trans. Smart Grid 7 (5) (2016) 2437–2447. [37] J. Wu, Z. Wu, J. Cao, H. Liu, G. Chen, Y. Zhang, Fuzzy consensus clustering with applications on big data, IEEE Trans. Fuzzy Syst. 25 (6) (2017) 1430–1445. [38] G. Xiao, K. Li, K. Li, X. Zhou, Eﬃcient top-(k, l) range query processing for uncertain data based on multicore architectures, Distrib. Parallel Databases 33 (3) (2015) 381–413. [39] Y. Chen, K. Li, W. Yang, G. Xiao, X. Xie, T. Li, Performance-aware model for sparse matrix-matrix multiplication on the sunway taihulight supercomputer, IEEE Trans. Parallel Distrib. Syst. 30 (4) (2019) 923–938. [40] J. Chen, K. Li, Z. Tang, K. Bilal, S. Yu, C. Weng, K. Li, A parallel random forest algorithm for big data in a spark cloud computing environment, IEEE Trans. Parallel Distrib. Syst. (1) (2017) 1. [41] Z. Tang, M. Liu, A. Ammar, K. Li, K. Li, An optimized mapreduce workﬂow scheduling algorithm for heterogeneous computing, J. Supercomput. 72 (6) (2016) 2059–2079. [42] B.J. Frey, D. Dueck, Clustering by passing messages between data points, Science 315 (5814) (2007) 972–976. [43] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (20 0 0) 888–905. [44] R. Mehmood, G. Zhang, R. Bie, H. Dawood, H. Ahmad, Clustering by fast search and ﬁnd of density peaks via heat diffusion, Neurocomputing 208 (2016) 210–217. [45] A. Frank, A. Asuncion, Uci machine learning repository [http://archive.ics.uci. edu/ml]. irvine, ca: university of california, Sch. Inf. Comput. Sci. 213 (2010) 2. Qian Tao received the Ph.D. degree in computer science from Sun Yat-Sen University, Canton, China, in 2011. From 2012 to 2015, he was a Postdoctoral Fellow of Computer Science in the Chinese Academy of Sciences. Currently he is a scholar visitor in the Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, RI, USA. He is the author of one book and over 30 technical papers. His current research interests include artiﬁcial intelligence, evolutionary algorithms, and machine learning. Dr. Tao’s awards and honors include Tencent Excellence Scholarship and Guangdong “Nanyue” Outstanding Doctoral Candidate of Guangdong Province. Chunqin Gu received the Ph.D. degree in computer science from Sun Yat-Sen University, Canton, China, in 2009. Since 2015, she has been an associate professor. Her research interests include data mining, artiﬁcial intelligence and big data, etc.

Please cite this article as: Q. Tao, C. Gu and Z. Wang et al., An intelligent clustering algorithm for high-dimensional multiview data in big data applications, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.093

JID: NEUCOM

ARTICLE IN PRESS

[m5G;July 30, 2019;16:51]

Q. Tao, C. Gu and Z. Wang et al. / Neurocomputing xxx (xxxx) xxx Zhenyu Wang received the B.S. degree in computer science department from Xiamen University, Xiamen, China, in 1987 and Ph.D degree in computer science department from Harbin Institute of Technology, Harbin, China, in 1993. From 2008 to Now, he is a professor and Dean of School of Software in South China University of Technology. His research interests include artiﬁcial intelligence, sentiment analysis, and big data.

11

Daoning Jiang was born in Huaibei, Anhui,china in 1994. He received the Bachelor’s degree in electromechanical engineering in 2016 from Northeast Forestry University, Harbin, China. He is currently working toward the M.S. degrees in Software Engineering at School of Software Engineering, South China University of Technology, Canton, China. From 2016 to 2017, he was a Research Assistant in IBM lab of School of Software Engineering. His research interests include artiﬁcial intelligence and sentimental analysis.

Please cite this article as: Q. Tao, C. Gu and Z. Wang et al., An intelligent clustering algorithm for high-dimensional multiview data in big data applications, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.12.093

An intelligent clustering algorithm for high-dimensional multiview data in big data applications

An intelligent clustering algorithm for high-dimensional multiview data in big data applications

Recommend Documents