Generate pairwise constraints from unlabeled data for semi-supervised clustering

Generate pairwise constraints from unlabeled data for semi-supervised clustering

Data & Knowledge Engineering 123 (2019) 101715 Contents lists available at ScienceDirect Data & Knowledge Engineering journal homepage: www.elsevier...

3MB Sizes 0 Downloads 57 Views

Data & Knowledge Engineering 123 (2019) 101715

Contents lists available at ScienceDirect

Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak

Generate pairwise constraints from unlabeled data for semi-supervised clustering Md Abdul Masud a,b ,∗, Joshua Zhexue Huang b , Ming Zhong b , Xianghua Fu b a

Department of Computer Science and Information Technology, Patuakhali Science and Technology University, Dumki, Patuakhali - 8602, Bangladesh b College of Computer Science and Software Engineering, Shenzhen University , Shenzhen 518060, China

ARTICLE

INFO

Keywords: Constrained clustering I-nice approach Pairwise constraints selection Semi-supervised clustering

ABSTRACT Pairwise constraint selection methods often rely on the label information of data to generate pairwise constraints. This paper proposes a new method of selecting pairwise constraints from unlabeled data for semi-supervised clustering to improve clustering accuracy. Given a dataset without any label information, it is first clustered by using the I-nice method into a set of initial clusters. From each initial cluster, a dense group of objects is obtained by removing the faraway objects. Then, the most informative object and the informative objects are identified with the local density estimation method in each dense group of objects. The identified objects are used to form a set of pairwise constraints, which are incorporated in the semi-supervised clustering algorithm to guide the clustering process toward a better solution. The advantage of this method is that no label information of data is required for selection pairwise constraints. Experimental results demonstrate that the new method improved the clustering accuracy and outperformed four state-of-the-art pairwise constraint selection methods, namely, random, FFQS, min–max, and NPU, on both synthetic and real-world datasets.

1. Introduction Clustering is a process to divide data of objects into a set of clusters. Objects in a cluster are similar, whereas objects in different clusters are dissimilar. Clustering results are affected by outliers and noise in data. Semi-supervised clustering partially solves this problem if the limited label information in input data is available. The quality of the clustering results of unsupervised clustering can be improved by incorporating the label information as the background knowledge to guide the clustering process. Semi-supervised clustering has received much attention in the machine learning community [1–3]. Background knowledge consists of pairs of objects that are used as pairwise constraints of the clustering process. The two objects in a pair are either in the same cluster or in two different clusters. The number of classes in the data is used as the number of clusters. In semi-supervised clustering, selecting the correct set of pairwise constraints is important because the clustering accuracy will be affected if the pairwise constraints are improperly chosen [4,5]. Two methods are used to select pairwise constraints. One is to select pairwise constraints randomly from the set of labeled objects in data [6–10]. The other is to use an active learning process to select pairwise constraints actively by asking queries to an oracle. Active learning can achieve high accuracy from limited labeled data [11]. Active learning was originally proposed for semi-supervised classification, but only recently was it used in semi-supervised clustering [12–16] to select the most uncertain or potential data objects for pairwise constraints. In general, the current methods for selection of pairwise constraints require labeled data. ∗ Corresponding author. E-mail addresses: [email protected] (M.A. Masud), [email protected] (J.Z. Huang), [email protected] (M. Zhong), [email protected] (X. Fu). https://doi.org/10.1016/j.datak.2019.101715 Received 9 June 2017; Received in revised form 11 July 2019; Accepted 12 July 2019 Available online 15 July 2019 0169-023X/© 2019 Elsevier B.V. All rights reserved.

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

However, in real applications, many datasets do not have labels such as smart grid data stream [17] and log data from communication network [18]. The existing methods need label data to form pairwise constraints for semi-supervised clustering and cannot be applied on unlabeled data. In addition, unlabeled data can be investigated with different techniques to generate label information. Manual annotation, which is a way to utilize human capabilities to estimate label information, is ineffective and expensive. Furthermore, providing the answer to a query for active learning, whether or not a pair of objects belongs to the same cluster, is difficult without detailed knowledge from dataset [19]. In this paper, we propose a new method of selecting pairwise constraints from unlabeled data for semi-supervised clustering. We assume that objects in a dataset are distributed into classes and each class constitutes a dense group of objects. We select pairwise constraints from these dense groups to guide the clustering process toward an appropriate clustering solution. Given that the input data is unlabeled, the first task is to produce groups of closely related objects. For this purpose, we apply the I-nice method [20], which we recently developed, to generate a set of initial clusters on unlabeled data. We remove faraway objects from the initial cluster and obtain a dense group of objects. The similar objects belong in the same dense group. We compute the local densities of all objects in a dense group and select the most informative object and the informative objects based on the local densities in each dense group. We formulate the must-link pairwise constraints from the set of informative objects in each dense group and cannot-link pairwise constraints from the most informative objects from different dense groups. Finally, a semi-supervised clustering algorithm is used to incorporate the pairwise constraints to perform an appropriate clustering solution on unlabeled data. We conducted a series of experiments with the proposed method on both synthetic and real-world datasets. We compared the clustering results of the proposed method with random, farthest-fast query selection (FFQS), min–max and normalized pointbased uncertainty (NPU) pairwise constraint selection methods in MPCPK-means semi-supervised clustering algorithm. Experimental results show that the proposed method outperformed the existing methods in purity, adjusted rand index (ARI), and normalized mutual information (NMI) measures on both synthetic and real-world datasets. The remainder of this paper is organized as follows. We summarize related work in Section 2. We provide a brief overview of methodology in Section 3. We introduce a pairwise constraints selection method from unlabeled data in Section 4. We present semi-supervised clustering for unlabeled data in Section 5. We present the experimental results of the I-nice method, the proposed method, and other methods in Section 6. Finally, we give the concluding remarks and draw some perspectives about this research in Section 7. 2. Related work The concept of constrained clustering was introduced in [7]. A similar concept was used in the COP K-means algorithm [8], a constrained version of the standard unsupervised 𝑘-means clustering algorithm [21]. The semi-supervised clustering framework, namely, MPCPK-means, integrates both constraints and metric leaning approaches [10]. Then, several semi-supervised clustering algorithms [6,12,22–24] were proposed to incorporate background knowledge for a improved clustering solution. All these methods select objects randomly to form pairwise constraints with respect to the label information of data. Experiments in [5] showed that constrained clustering with random constraint selection degraded the clustering performance of the basic 𝑘-means algorithm. Therefore, pairwise constraint selection plays an important role in semi-supervised clustering. An active learning-based FFQS algorithm [12] was proposed for selection of pairwise constraints. The FFQS algorithm uses the farthest-first traversal technique to select pairwise constraints in two phases, namely, explore and consolidation. The explore phase selects 𝐾 disjoint neighborhoods, where 𝐾 is the number of clusters. In each neighborhood, at least one object is selected. This phase is started by choosing the first object randomly for the first neighborhood. The next object will be the farthest one from previously chosen objects. If a new object is cannot-link to all existing objects, then a separate neighborhood is introduced with the new object. If the new object is must-link in a neighborhood, then it is added to that neighborhood. This process continues until the 𝐾 neighborhoods are obtained. The consolidation phase randomly selects the one object 𝑥𝑖 , which is not in the neighborhoods of the explore phase and takes the another object 𝑥𝑗 from each neighborhood in turn and querying for the label on the pair (𝑥𝑖 , 𝑥𝑗 ) until a must-link constraint has been obtained. One can obtain either a must-link reply or cannot-link replies from (𝐾 − 1) quires. The active query selection algorithm based on min–max criterion was proposed in [13] to modify the consolidation phase of the FFQS algorithm. This algorithm chooses the data points with the maximum uncertainty to make query with data points of explore phase. NPU method uses a neighborhood-based framework to select pairwise constraints where a neighborhood contains a set of objects that are connected by must-link constraints and different neighborhoods are connected by cannot-link constraints [14]. An active learning approach is used to extend these neighborhoods by choosing the informative objects. This method formulates an uncertainty-based principle and selects the data point that has the maximum rate of information for every query. However, the queries of this active learning are answered from the ground-truth label information from the dataset. In [25], another active constraint selection method is presented, which analyzes the eigenvectors derived from the data similarity matrix to select sparse and boundary data points. These points are used as the results of the query to the oracle to make sure that the correct pairwise constraints are used in the spectral clustering process. However, this algorithm is restricted to the two-cluster problem. The core distinction of the abovementioned algorithms and our proposed algorithm is that the proposed algorithm generates pairwise constraints from unlabeled data, whereas the abovementioned algorithms rely on the label information of data for selection of either pairwise constraint. 2

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

3. Overview of methodology Given an unlabeled dataset, we do not have label information to select pairwise constraints. Given that the clusters of unlabeled data are unknown in advance, we explore some initial groups of objects from the data where the objects in each group would be certainly located in the same cluster to be found. We apply the I-nice method [20] to estimate the number of clusters and the initial clusters of objects. The I-nice method can quickly find initial clusters from unlabeled data because it first converts high-dimensional data into one-dimensional distance data and then generates the clusters from one-dimensional data. Then, these I-nice found clusters are used for selection of pairwise constraints, similar to the current methods for selection of pairwise constraints. After obtaining the initial clusters with the I-nice method, we formulate the procedure to select the pairwise constraints. However, the clusters from I-nice are not compact. To find a dense group of objects in each cluster for selection of pairwise constraints, for each object in the cluster, we compute the sum of the distances between the object and all other objects in the cluster, and compute the sum of the mutual distances between other objects. The first sum is the separateness distance, and the second sum is the compactness distance. Then, we compute the ratio of the separateness distance and the compactness distance as the outlier indicator for the object. A large the outlier indicator corresponds to the increased distance of the object from other objects in the clusters. The outliers in each cluster are removed according to the outlier indicators and dense groups of objects are obtained. These dense groups provide the background knowledge of unlabeled data. The instance-level background knowledge can be described in the form of pairwise constraints as in [8]. For selection of pairwise constraints, we have similar objects in a dense group and dissimilar objects in different dense groups. A must-link constraint requires two objects to belong to the same cluster. Let M be the set of must-link constraints in the form of (𝑥2 , 𝑥3 ), meaning objects 𝑥2 and 𝑥3 must be in the same cluster. By contrast, a cannot-link constraint requires two objects must be in two different clusters. Let C be the set of cannot-link constraints in the form of (𝑥1 , 𝑥4 ), meaning objects 𝑥1 and 𝑥4 must be in two different clusters. To form a good set of pairwise constrains for semi-supervised clustering, we have to select each pair of objects for a must-link constraint with a certain guarantee that the two objects are in the same cluster. By contrast, each pair of objects for a cannot-link constraint should be in two different clusters. For this purpose, we first estimate the local density for every object in each dense group. We estimate an object with the highest local density as the candidate for cannot-link constraints and its nearest neighboring objects as candidate for must-link constraints in a dense group. On the one hand, the objects with the highest local density in different dense groups are used to form as a cannot-link constraint. On the other hand, the nearest neighboring objects in a dense group are used to form as a must-link constraint. This selection process is repeated until a set of pairwise constraints is obtained. These pairwise constraints represent the intrinsic clustering structure of the whole data distribution. The details of selection of pairwise constraints from unlabeled data are described in the next section. The number of clusters and the selected pairwise constraints are incorporated in the semi-supervised clustering to guide the clustering process for better clustering solution. The objective function is derived from 𝑘-means similar to optimization with adaption of the pairwise constraints. After the selection of pairwise constraints from unlabeled data, the semi-supervised clustering process is presented. 4. Generating pairwise constraints from unlabeled data To select pairwise constraints from unlabeled data for semi-supervised clustering, we first use the I-nice method to generate a set of initial clusters. Then, we select the pairwise must-link constraints from the objects in the same clusters and the pairwise cannot-link constraints from objects in different clusters. However, because the initial clusters are produced from one-dimensional distance distribution, the clusters of objects in these initial clusters may not be compact in dense neighborhoods. Some objects are always located far from the dense regions in the original data space. We have to remove these objects from the initial clusters and produce a set of dense groups in which objects in the same group are close to each other and must be placed in the same cluster and objects in two different dense groups are separated and cannot be placed in the same cluster. From these dense groups, must-link and cannot-link constraints are selected. 4.1. Estimating initial clusters We apply the I-niceSO algorithm [20] to estimate the initial clusters on input data. The I-niceSO algorithm is an I-nice method where several observation points are allocated in the data domain to observe the dense regions of clusters. This method transforms high-dimensional data into one-dimensional distance data by computing the distances of all objects to the observation point. The one-dimensional distance distribution carries the information of clusters in the original high-dimensional data. To automatically find the number of clusters, gamma mixture model (GMM) is used to model the distance data and the expectation–maximization (EM) algorithm is used to solve multiple GMMs. The second-order variant of the Akaike information criterion (AICc) is applied to select the best-fitted model in the distance distribution for each observation point. The model with the largest number of components, which is used as the number of clusters in the data, is selected as the final model. Let 𝑅𝑑 be a data domain of 𝑑 dimensions and X ⊂ 𝑅𝑑 a set of 𝑁 objects in 𝑅𝑑 . Let 𝑝 ∈ 𝑅𝑑 be a randomly generated point with a uniform distribution. 𝑝 is defined as an observation point to X . Given a distance function 𝑑(.) on 𝑅𝑑 , all distances between observation point 𝑝 and 𝑁 objects of X are computed and transformed into a set of distances 𝑋𝑝 = {𝑥1 , 𝑥2 , … , 𝑥𝑁 }. The GMM of 𝑋𝑝 is defined as 𝑝(𝑥|𝜃) =

𝑀 ∑

𝜋𝑗 𝑔(𝑥|𝜃𝑗 ), 𝑥 ≥ 0

(1)

𝑗=1

3

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

where 𝜃 is the vector of parameters of the GMM, 𝑀 is the number of the gamma components, 𝜋𝑗 is the mixing proportion of component 𝑗, and 𝜃𝑗 are the parameters, including shape parameter 𝛼𝑗 and scale parameter 𝛽𝑗 . The parameters of the GMM are solved by maximizing the log-likelihood function. The log-likelihood function is defined as (𝑀 ) 𝑁 ∑ ∑ L(𝜃|𝑋𝑝 ) = log 𝜋𝑗 𝑔(𝑥𝑖 |𝛼𝑗 , 𝛽𝑗 ) (2) 𝑖=1

𝑗=1

The EM algorithm is used to solve Eq. (2) [26]. The minimum value of AICc [27,28] is used to select the best-fitted model for each observation point. From the set of best fitted models selected from the multiple observation points, the final model is the one that has the largest number of clusters. The set of objects, one from each component, is used as the initial cluster centers to run the 𝑘-means algorithm for generating initial clusters. 4.2. Generating dense groups from initial clusters Distance-based methods are often used to detect outliers in data [29,30]. In this case, we consider the objects apart from the dense regions in the original data space as outliers. We delete these objects from the initial clusters with the following method based on the concepts of separateness and compactness distances. Let G1 , G2 , … , G𝐾 be the set of initial clusters generated by I-nice from a given unlabeled data and 𝑛1 , 𝑛2 , … , 𝑛𝐾 be the numbers of objects in these clusters, respectively. Given an object 𝑥𝑜 ∈ G𝑐 for 1 ≤ 𝑐 ≤ 𝐾, the neighboring objects of an object 𝑥𝑜 is defined as a set of objects in the initial cluster excluding itself. We assume that G𝑐′ is a set of 𝑛′𝑐 objects neighboring 𝑥𝑜 . The separateness distance of 𝑥𝑜 to G𝑐′ is defined as the average of distances from 𝑥𝑜 to all objects in G𝑐′ , given as 1 ∑ 𝑑(𝑥𝑖 , 𝑥𝑜 ) (3) 𝑆𝑑𝑥𝑜 = ′ 𝑛𝑐 ′ 𝑥𝑖 ∈ G𝑐

Similarly, the compactness distance of 𝑥𝑜 to G𝑐′ is defined as the average of the mutual distances among the objects in G𝑐′ , given as follows: ∑ 1 𝐶𝑑𝑥𝑜 = ′ ′ 𝑑(𝑥𝑖 , 𝑥𝑗 ) (4) 𝑛𝑐 (𝑛𝑐 − 1) ′ 𝑥𝑖 ,𝑥𝑗 ∈ G𝑐 , 𝑖≠𝑗

With 𝑆𝑑𝑥𝑜 and 𝐶𝑑𝑥𝑜 defined, the separateness of 𝑥𝑜 from G𝑐 can be measured by the ratio of the separateness distance and the compactness distance, which is denoted as group outlier criterion (𝐺𝑂𝐶), defined below 𝐺𝑂𝐶𝑥𝑜 =

𝑆𝑑𝑥𝑜

(5)

𝐶𝑑𝑥𝑜

For an initial cluster G𝑐 , the score 𝐺𝑂𝐶𝑥𝑜 measures how far away object 𝑥𝑜 is from the dense group of objects in G𝑐′ . We compute the 𝐺𝑂𝐶 scores for all objects in G𝑐 . Then, we arrange the objects in ascending order of their 𝐺𝑂𝐶 scores. The objects with smaller 𝐺𝑂𝐶 scores form the dense group of objects in G𝑐 . According to the distribution of 𝐺𝑂𝐶 scores in G𝑐 , a threshold is chosen to select objects for obtaining a dense group of objects. This process is performed for all initial clusters, which results in the set of dense groups of objects as candidates of pairwise constraints. 4.3. Selecting must-link and cannot-link constraints Given a set of dense groups, we use a local density estimation method to select the must-link and cannot-link pairwise constraints from the dense groups for semi-supervised clustering. Let N1 , N2 , … , N𝐾 be the set of dense groups of objects obtained after removal of the separate objects from the initial clusters. These dense groups of objects are used to select the must-link and cannot-link pairwise constraints. On the one hand, the objects of must-link pairwise constraints to be selected from the same dense group would be in the same cluster in the semi-supervised clustering result. On the other hand, the objects of any cannot-link pairwise constraint to be selected from two different dense groups should not be in the same cluster. Definition 1 (Most Informative Object). The object in a dense group is defined as the most informative object if it has the highest neighborhood density in the dense group. Let N𝑐 = {𝑥1 , 𝑥2 , … , 𝑥𝐿𝑐 } be a dense group with 𝐿𝑐 objects in a high density neighborhood N𝑐 . We use Euclidean distance to calculate the mutual distances among the 𝐿𝑐 objects to generate a nonnegative and symmetric matrix 𝐷 = (𝑑𝑖𝑗 ), called a pre-distance matrix [31], computed as ‖ ‖2 𝑑𝑖𝑗 = ‖𝑥𝑖 − 𝑥𝑗 ‖ (𝑖, 𝑗 = 1, 2, … , 𝐿𝑐 ) ‖ ‖

(6)

where the diagonal elements of 𝐷 are zero. Given 𝐷, we compute the local density for each object 𝑥𝑖 in the neighborhood N𝑐 as 1 ∑ 𝐿𝐷(𝑥𝑖 ) = − 𝑑𝑖𝑗 |N𝑐 | | | 𝑗∈N𝑐 4

(7)

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

Fig. 1. Selection of the most informative objects and informative objects from three dense groups. 𝑥1 is the most informative object, which is the candidate for the cannot-link constraint, and 𝑥2 and 𝑥3 are the first pair of informative objects, which are candidates for must-link constraints in dense group N1 . 𝑥4 is the most informative object, which is the candidate for the cannot-link constraint, and 𝑥5 and 𝑥6 are the first pair of informative objects, which are candidates for must-link constraints in dense group N2 . 𝑥7 is the most informative object, which is the candidate for the cannot-link constraint, and 𝑥8 and 𝑥9 are the first pair of informative objects, which are candidates for must-link constraints in dense group N3 .

where 𝑥𝑖 is the 𝑖th object in N𝑐 and 𝐿𝐷(𝑥𝑖 , N𝑐 ) is the local density of object 𝑥𝑖 relative to other objects in N𝑐 . In this way, we compute the local densities for all objects in N𝑐 . A large the value of 𝐿𝐷(𝑥𝑖 ) corresponds to high local density of the object 𝑥𝑖 . We can rank the objects in N𝑐 and select objects on the basis of their local densities. The most informative object has the largest 𝐿𝐷. In this way, we can identify one most informative object from each dense group of objects. Definition 2 (Informative Objects). The objects that are closer to the most informative object in the same dense group are called informative objects. Now, we formulate a set of the must-link pairwise constraints as follows. Using the I-nice method, we obtain 𝐾 dense groups. Given a dense group of objects, we use Eq. (7) to calculate the local densities of all objects and rank the objects on their local densities. The first object with the highest local density is the most informative object that is used to create cannot-link pairwise constraints. Following the ranking list of informative objects, we select the objects with the second and third highest local densities as the first pair of must-link pairwise constraints. In the same way, we select the next pair of informative objects as the second must-link pairwise constraint. This process continues until the number of required must-link pairwise constraints is selected from the dense group of objects. We use the same process to select the must-link pairwise constraints from all dense groups to build up the set of must-link pairwise constraints. The candidate objects for must-link pairwise constraints should be selected from the same dense group. Not all objects in a dense group are defined as the informative objects. The number of informative objects is determined by the number of must-link pairwise constraints to be selected. A threshold is used to control the number of pairwise constraints. The threshold is defined as the total number of must-link constraints divided by the number of clusters. In N𝑐 dense group, we extract its informative objects to form 𝐶𝐿2 pairs of must-link constraints in set M𝑐 where 𝐿𝑐 is the 𝑐 number of informative objects from the dense group. The total number of must-link pairs in set M generated from 𝐾 dense groups 2 2 2 is 𝐶𝐿 + 𝐶𝐿 +, … , +𝐶𝐿 . 1 2 𝐾 Then, we formulate the cannot-link pairwise constraints set. The most informative object is the representative object in a dense group. From each dense group, we obtain the most informative object, which is the candidate for the cannot-link pairwise constraint. Given 𝐾 most informative objects from 𝐾 dense groups, we select cannot-link pairwise constraints. The total number of cannot-link 𝐾! pairs in set C generated from 𝐾 objects is 𝐶𝐾2 = 2!(𝐾−2)! . Given that these 𝐾 objects are from different dense groups, any pair of them should not occur in the same cluster. The candidate objects of any cannot-link pairwise constraint should be selected from two different dense groups. Therefore, any pair of the most informative objects can be made a cannot-link pairwise constraint. Fig. 1 shows an example of three dense groups N1 , N2 , and N3 . Objects 𝑥1 , 𝑥4 , and 𝑥7 are the most informative objects in dense groups N1 , N2 , and N3 , respectively. The cannot-link pairwise constraints are formed by using these objects 𝑥1 , 𝑥4 , and 𝑥7 . Objects 𝑥2 and 𝑥3 are the informative objects to 𝑥1 . Likewise, objects 𝑥5 and 𝑥6 are the informative objects to 𝑥4 , and objects 𝑥8 and 𝑥9 are the informative objects to 𝑥7 . In Fig. 1, objects 𝑥2 , and 𝑥3 are the first and second informative objects in dense group N 1, respectively. 𝑥2 and 𝑥3 are selected as the first must-link pairwise constraint from N 1. Similarly, objects 𝑥5 , 𝑥6 and 𝑥8 , 𝑥9 are selected as the first must-link pairwise constraints from N 2 and N 3, respectively. Transitivity is one kind of relations among objects. The transitive closure expresses the transitive relation. Given that one object can be used in multiple pairwise constraints, the must-link and cannot-link pairwise constraints satisfy the following transitive closure properties: 5

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

Algorithm 1: Generate pairwise constraints from unlabeled data

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

19 20 21 22

Input: A high-dimensional unlabeled dataset X Output: Must-link and cannot-link pairwise constraints Estimate initial clusters of unlabeled data: Call I-niceSO Algorithm in [20] to produce {G }𝐾 initial clusters 𝑐=1 Discover dense groups: for 𝑐 := 1 to 𝐾 do for 𝑜 := 1 to 𝑛𝑐 do Compute 𝐺𝑂𝐶𝑥𝑜 in initial cluster G𝑐 by Eq. (5) if 𝐺𝑂𝐶𝑥𝑜 > th G 𝑐 ← G 𝑐 − 𝑥𝑜

N𝑐 ← G𝑐 Select must-link pairwise constraints: for 𝑐 := 1 to 𝐾 do Compute pre-distance matrix D of 𝐿𝑐 objects in dense group N𝑐 with Euclidean distance function Compute local density 𝐿𝐷(𝑥𝑖 , N𝑐 ) where (𝑖 = 1, 2, ..., 𝐿𝑐 ) Sort the indices of 𝐿𝐷(𝑥𝑖 , N𝑐 ) with descending order for 𝑖 := 1 to 𝐿𝑐 − 1 do if 𝑖 != 1 then M ← (𝑥𝑖 , 𝑥𝑖+1 ), keep pair of objects as must-link else 𝑥𝑐 ← 𝑥𝑖 , store most informative object of group N𝑐 Construct cannot-link constraints: Using most informative objects 𝑥𝑐 and 𝑐 ∈ 𝐾 for 𝑖 :=0 to 𝐾 − 1 do for 𝑗 :=i+1 to 𝐾 do C ← (𝑥𝑖 , 𝑥𝑗 ), keep cannot-link constraints

must-link

must-link

must-link

𝑥1 ←←←←←←←←←←←←←←←←←→ ← 𝑥2 and 𝑥2 ←←←←←←←←←←←←←←←←←→ ← 𝑥3 then 𝑥1 ←←←←←←←←←←←←←←←←←→ ← 𝑥3

cannot-link

cannot-link

cannot-link

𝑥1 ←←←←←←←←←←←←←←←←←←←←→ ← 𝑥4 and 𝑥4 ←←←←←←←←←←←←←←←←←←←←→ ← 𝑥7 then 𝑥1 ←←←←←←←←←←←←←←←←←←←←→ ← 𝑥7 where, the objects 𝑥1 , 𝑥2 , 𝑥3 of must-link pairwise constraints to be selected from the same dense group should be in the same cluster and the objects 𝑥1 , 𝑥4 , 𝑥7 of any cannot-link pairwise constraint to be selected from two different dense groups should not be in the same cluster. Usually, a large number of pairwise constraints are used in semi-supervised clustering to improve the clustering performance. For example, authors in [12] used 1000 pairwise constraints to evaluate the performance of semi-supervised clustering on Iris dataset that has only 150 objects. Using the proposed method, we are able to select a set of pairwise constraints from the informative objects in different dense groups of unlabeled data. Therefore, the proposed method can achieve the desired performance with a limited number of pairwise constraints. The selection steps of pairwise constraints from unlabeled data are presented in Algorithm 1. 5. Semi-supervised clustering for unlabeled data In the previous section, we discussed the method to generate pairwise constraints from unlabeled data. In this section, we present the semi-supervised clustering algorithm that incorporates the pairwise constraints in a modified 𝑘-means clustering process to perform semi-supervised clustering on unlabeled data. 5.1. Generating pairwise constraints from unlabeled data Given a dataset X = {𝑥}𝑁 of 𝑁 objects without any class labels, the target of this algorithm is to make 𝐾 clusters such that 𝑖=1 X = X . To achieve this objective, we first use the I-niceSO Algorithm in [20] to generate a set of initial clusters from X . In 𝑐 𝑐 this stage, the number of clusters 𝐾 in dataset X is also found. Then, we use Algorithm 1 to generate the set of must-link pairwise constraints M and the set of cannot-link pairwise constraints C from the set of initial clusters. The transitive closure is used to augment must-link constraints in M. Later, the 𝐾 initial cluster centers are obtained with the mean of 𝐾 must-link groups from augmented must-link pairwise constraints M. ⋃𝐾

6

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

5.2. Semi-supervised clustering process To incorporate the sets of pairwise constraints, we need to use a semi-supervised clustering algorithm. The goal of the pairwise constraints clustering framework is to minimize the combined objective function of the sum of the total distances between the objects and their cluster centers and the cost of violating each constraint [7]. These constraints can also be used to adapt the underlying distance metric, which can specify a close indication of similarity in clustering by minimizing distances between objects of samecluster, while maximizing distances between objects of different-clusters. Suppose two objects 𝑥𝑖 and 𝑥𝑗 are similar from some given information. We can learn Euclidean distance metric 𝑑(𝑥𝑖 , 𝑥𝑗 ) with a symmetric positive definite matrix. Let a positive semi-definite √ ‖ ‖ matrix 𝐴𝑐 be learned for cluster 𝑐, such that 𝐴𝑐 ≥ 0 [32]. The distance function is defined as ‖𝑥𝑖 − 𝑥𝑗 ‖ = (𝑥𝑖 − 𝑥𝑗 )𝑇 𝐴𝑐 (𝑥𝑖 − 𝑥𝑗 ). ‖ ‖𝐴𝑐 The integration of constraints and the metric learning approach results in the objective function that minimizes the cluster dispersion under the learned metrics while reducing the constraint violations. In this solution, we formulate the objective function according to the MPCK-means algorithm [10].

J =

𝐾 ∑ ∑ 𝑐=1 𝑥𝑖 ∈𝑋𝑐

𝑓𝑀𝐿 (𝑥𝑖 , 𝑥𝑗 ) =

‖2 (‖ ‖𝑥𝑖 − 𝜇𝑐 ‖𝐴𝑐 − log(det(𝐴𝑐 ))) +



𝑤𝑖𝑗 𝑓𝑀𝐿 I[𝑐 ≠ 𝑙𝑗 ] +

(𝑥𝑖 ,𝑥𝑗 )∈M



𝑤𝑖𝑗 𝑓𝐶𝐿 I[𝑐 = 𝑙𝑗 ]

(8)

(𝑥𝑖 ,𝑥𝑗 )∈C

1‖ 1‖ ‖2 ‖2 ‖𝑥 − 𝑥𝑗 ‖ + ‖𝑥𝑖 − 𝑥𝑗 ‖ ‖𝐴𝑐 2 ‖ ‖𝐴𝑙𝑗 2‖ 𝑖

(9)

‖ ‖2 ′ ′′ ‖2 𝑓𝐶𝐿 (𝑥𝑖 , 𝑥𝑗 ) = ‖ 𝑥 − 𝑥𝑗 ‖ ‖(𝑥𝑐 − 𝑥𝑐 )‖𝐴𝑐 − ‖ ‖ 𝑖 ‖𝐴𝑐

(10)

where 𝑤𝑖𝑗 and 𝑤̄𝑖𝑗 are the uniform constraint cost of violating the pairwise constraints M and C , respectively, and I is the indicator function, I[true] = 1 and I[false] = 0, and 𝑙𝑗 is cluster assignment such that 𝑙𝑗 ∈ 𝐾. If the must-linked instances are assigned two different clusters, then the incurred cost is 𝑤𝑖𝑗 . Likewise, if the cannot-linked objects are assigned to the same cluster, then the incurred cost is 𝑤̄𝑖𝑗 . The functions 𝑓𝑀𝐿 and 𝑓𝐶𝐿 compute the penalties for violating the must-link and cannot-link constraints of 𝑐 cluster where (𝑥′ , 𝑥′′ ) is the maximally separated pair of objects in the datasets according to matrix 𝐴𝑐 . Algorithm 2: Semi-supervised clustering for unlabeled data Input: A high-dimensional unlabeled dataset X of X such that objective function J is locally minimized Output: 𝐾 clusters {X𝑐 }𝐾 𝑐=1 Estimate the initial clusters: Call Algorithm I-niceSO in [20] Select pairwise constraints: Call Algorithm 1 Initial cluster clusters: Compute {𝜇𝑐 }𝐾 with the mean of 𝐾 must-link sets from M 𝑐=1 Semi-supervised clustering process: Use 𝐾, M, C , and {𝜇𝑐 }𝐾 as input for the clustering process 𝑐=1 Semi-supervised clustering solution: Call Algorithm MPCK-Means in [10]

1 2 3 4 5

𝐾 Each weighted matrix 𝐴𝑐 is computed by inverting the summation of covariance matrix 𝐴−1 𝑐 . The matrices {𝐴𝑐 }𝑐=1 are recomputed to minimize the objective function J . Each updated matrix of local weight 𝐴𝑐 is obtained by taking the partial derivative 𝜕J and setting it to zero. Then, 𝐴𝑐 is computed as 𝜕𝐴 𝑐

𝐴𝑐 = ||X𝑐 || ( +

∑ ∑

(𝑥𝑖 − 𝜇𝑐 )(𝑥𝑖 − 𝜇𝑐 )𝑇

𝑥𝑖 ∈X𝑐

(𝑥𝑖 ,𝑥𝑗 )∈M𝑐

+



1 𝑤 (𝑥 − 𝑥𝑗 )(𝑥𝑖 − 𝑥𝑗 )𝑇 𝐼[𝑙𝑖 ≠ 𝑙𝑗 ] 2 𝑖𝑗 𝑖

(11)

′ ′′ 𝑇 𝑇 −1 𝑤𝑖𝑗 (𝑥′𝑐 − 𝑥′′ 𝑐 )(𝑥𝑐 − 𝑥𝑐 ) − (𝑥𝑖 − 𝑥𝑗 )(𝑥𝑖 − 𝑥𝑗 ) 𝐼[𝑙𝑖 = 𝑙𝑗 ])

(𝑥𝑖 ,𝑥𝑗 )∈C𝑐

In the cluster assignment step of the algorithm, an object 𝑥 is assigned to a cluster such that it minimizes the sum of distance of 𝑥 to the cluster centroid according to the local metric and the penalty cost of any constraint violations. Constraints are incorporated during the cluster initialization and assignment of objects to clusters. The distance metric is adapted by re-estimating the weighted matrix 𝐴𝑐 for each iteration based on current cluster assignment and constraint violation in Eq. (11). Every cluster assignment, centroid re-estimation, and metric learning step guide the minimization of the objective function toward convergence. The pseudocode of the semi-supervised clustering algorithm is given in Algorithm 2. Algorithm 2 will converge to a local minima of J as long as matrices {𝐴𝑐 }𝐾 are computed according to Eq. (11). 𝑐=1 5.3. Complexity analysis Given a dataset X with 𝑁 objects and 𝑑 dimensions, we analyze the complexity of the proposed semi-supervised clustering Algorithm 2. Recall that 𝐾, M, and C are the number of clusters, must-link, and cannot-link sets of pairwise constraints, respectively. 7

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

Table 1 Characteristics of eight synthetic datasets. Number

Datasets

Objects 𝑁

Features 𝑑

Classes 𝐾

Overlap 𝑣

1 2 3 4 5 6 7 8

Syn-Data1 Syn-Data2 Syn-Data3 Syn-Data4 Syn-Data5 Syn-Data6 Syn-Data7 Syn-Data8

120 350 450 500 300 320 3000 5000

2 2 10 4 6 12 2 2

3 4 5 6 5 4 3 3

0.03 0.05 0.05 0.07 0.04 0.05 0.05 0.05

• Step 1: Estimate the number of clusters. For 𝑃 observation points, the computational complexity of computing distance values in this step is O(𝑑𝑁𝑃 ). Given a GMM with 𝑀 components, we need to solve 𝑀𝑚𝑎𝑥 − 1 GMMs for each distance distribution. Let 𝜌 be the average number of iterations in calculating parameters in GMM and 𝜇 be the average number of iterations in solving a GMM with EM. The computational complexity of solving (𝑀𝑚𝑎𝑥 − 1) GMMs is O((Mmax−1)(4 + 𝜌)𝜇𝑁𝑀). The computational complexity of the GMM fitting process is O((Mmax−1)(4 + 𝜌)𝜇𝑃 𝑁𝑀). • Step 2: Select pairwise constraints. The computational complexity of 𝐺𝑂𝐶 scores of 𝑁 objects is O(𝐾(𝑛′𝑐 + 𝑛′𝑐 (𝑛′𝑐 − 1)) + 𝑁), where 𝑛′𝑐 is a set of objects neighboring 𝑥0 in G𝑐′ . Then, we have 𝐾 dense groups and each group has 𝐿 number of objects. The computational complexity of selecting 𝐾 most informative objects from 𝐾 dense groups is O(𝐾), while the computational complexity of selecting 𝑠 informative objects from each dense group is O(𝐿 log 𝑠). After selecting the most informative objects and informative objects, the formulation of must-link and cannot-link pairwise constraints is straightforward. • Step 3: Semi-supervised clustering process. The computational complexity of the semi-supervised clustering process is O(𝐾.𝑑.(|M|) + |C |) + 𝑑 2 .(|M| + |C |) + 𝐾.𝑑 3 + 𝐾.𝑁.𝑑 2 + 𝑁 2 .𝑑 2 , where the cubic complexity with the dimensions of data comes from the computation of determinants and eigenvalue decomposition of each matrix 𝐴𝑐, and the quadratic complexity with the number of objects 𝑁 is due to the updates of the pair of farthest objects for each matrix. 6. Experiments In this section, we present the experimental results of pairwise constraints selection methods in semi-supervised clustering on both synthetic and real-world datasets. We compare the performance of the new algorithm with that of some state-of-the-art pairwise constraint selection methods to demonstrate that the new algorithm is not only able to select the pairwise constraints from unlabeled data but also produces more accurate clustering results than the compared pairwise constraint selection methods. 6.1. Experiment setup 6.1.1. Datasets Both synthetic and real-world datasets were used in these experiments. A synthetic dataset was generated as follows. Given the number of objects 𝑁, the number of clusters 𝐾, the number of features 𝑑, and the value of overlap threshold 𝑣, 𝐾 vectors of 𝑑 dimensions were generated values of integers ranging between 1 and 𝐾 in a uniformed distribution. No vector should be equal to any other vector. From each vector, a cluster center is calculated as the value of each element minus 0.5. In this way, 𝐾 cluster centers are obtained. Then, the mutual distances between cluster centers were computed. The half of the distance between the two closest cluster centers was used as the variances of all dimensions in the diagonal of the covariance matrix. Other elements of the covariance matrix were assigned zeros. The covariance matrix was used as a parameter to generate 𝐾 normally distributed clusters in the 𝑑 dimensional space. For each cluster, the cluster center vector and the covariance matrix were used as the parameters to the multivariate Gaussian function to generate a set of normally distributed points in the 𝑑 dimensional space. The 𝐾 Gaussian clusters were generated independently and merged in one dataset with 𝑁 objects. The overlapping threshold 𝑣 was used to control the percentage of overlapping between the clusters in a synthetic data. The characteristics of eight synthetic datasets are summarized in Table 1. In these experiments, eight real-world datasets were selected from the UCI machine learning repository [33] and the KEEL dataset repository [34]. These datasets are labeled with classes, which were taken as the true cluster labels, for comparison with the cluster labels generated by pairwise constraint selection methods in the semi-supervised clustering algorithms. The details of these real-world datasets are summarized in Table 2. 6.1.2. Experiment settings In these experiments, the number of observation points, 𝑃 , is set as 6 to estimate the initial clusters and the threshold, 𝑡ℎ, is defined as 𝑡ℎ = 𝑚𝑖𝑛(𝐺𝑂𝐶) + (𝑚𝑎𝑥(𝐺𝑂𝐶) − 𝑚𝑖𝑛(𝐺𝑂𝐶))∕2 for discovering dense groups. To evaluate the performance of pairwise constraint selection methods in the semi-supervised clustering algorithm on a given dataset, up to 20% of the objects in the dataset were selected as the pairwise constraints. For the new algorithm, the class labels were deleted from the datasets to make them unlabeled data. For other pairwise constraints selection methods, the class labels in the datasets were used to select pairwise constraints, and the right number of clusters was assigned to them. After pairwise constraints 8

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

Table 2 Characteristics of eight real-world datasets. Number

Datasets

Instances

Features

Classes

1 2 3 4 5 6 7 8

Appendicitis Mammographic Mass Iris Wine Breast Tissue Glass Ionosphere Seed

106 961 150 178 106 214 351 210

7 6 4 13 10 10 34 7

2 2 3 3 6 6 2 3

were selected, the labels were deleted and the unlabeled datasets were clustered by incorporating the pairwise constraints in the semi-clustering process. For each dataset, 10 independent runs were conducted by selecting the pairwise constraints randomly in the semi-supervised clustering process, and the average performance measures were used in the performance comparison. 6.1.3. Evaluation criteria Three evaluation criteria were used to assess the clustering performance of each method in the experiments. They are defined as follows. NMI: NMI measures the normalized mutual information between class label and cluster assignment, which are considered random variables [35]. Let 𝐶 be the random variable representing the cluster assignment of objects and 𝑌 be the random variable representing the class label of objects. NMI is computed as follows: 𝑁𝑀𝐼 =

2𝐼(𝐶; 𝑌 ) 𝐻(𝐶) + 𝐻(𝑌 )

(12)

where 𝐼(𝐶; 𝑌 ) is the mutual information of 𝐶 and 𝑌 , and 𝐻(𝐶) and 𝐻(𝑌 ) are the entropies of 𝐶 and 𝑌 , respectively. The range of NMI is between 0 and 1. Purity: Purity is the percentage of the objects that are classified correctly [36]. Purity is computed as follows: 𝑃 𝑢𝑟𝑖𝑡𝑦 =

𝐾 1 ∑ | | max𝑗 |𝐶𝑖 ∩ 𝑌𝑗 | | | 𝑁 𝑖=1

(13)

where 𝑁 is the number of objects in the dataset, 𝐾 is the number of clusters, 𝐶𝑖 is the set of objects in cluster 𝑖, and 𝑌𝑗 is the set of objects in class 𝑗 that has the maximum intersection with cluster 𝑖 among all sets of classes. Purity range is also between 0 and 1. ARI: ARI evaluates the consistency between two partitions [37]. Let 𝐶 = 𝐶1 , 𝐶2 , … , 𝐶𝐾 𝐶 be a partition of 𝑁 objects into 𝐾 𝐶 clusters and 𝑌 = 𝑌1 , 𝑌2 , … , 𝑌𝐾 𝑌 be a partition of 𝑁 objects into 𝐾 𝑌 classes. Let 𝑁𝑖𝑗 be the number of objects in both cluster 𝐶𝑖 and class 𝑌𝑗 . 𝑁𝑖 be the number of objects in cluster 𝐶𝑖 , and 𝑁𝑗 be the number of objects in class 𝑌𝑖 . ARI is calculated as follows: ∑ (𝑁𝑖𝑗 ) [∑ (𝑁𝑖 ) ∑ (𝑁𝑗 )] (𝑁 ) − ∕ 2 𝑖𝑗 2 𝑖 2 𝑗 2 𝐴𝑅𝐼 = [∑ ( ) ∑ (𝑁 )] [∑ ( ) ∑ (𝑁 )] ( ) (14) 𝑁𝑖 𝑁𝑖 1 𝑗 𝑗 − ∕ 𝑁 𝑖 2 𝑗 2 𝑖 2 𝑗 2 2 2 6.1.4. Competing methods The numbers of clusters found by the I-nice method is compared with the true number of clusters, Elbow [38], Silhouette [39], and Gap Statistics [40] results showed that the I-nice method outperformed the three popular methods. The following four pairwise constraint selection methods were selected as baseline state-of-the-art algorithms in the comparison study. Random: The MPCK-means semi-supervised clustering algorithm selects a pair of objects randomly and the pairwise constraint of these objects is formed based on their label information [10]. FFQS: This method uses an active learning-based FFQS to select pairwise constraints [12]. The answer to the query, whether the current object and the object from existing neighborhoods is in the same cluster or not, is provided from the labeled data. Min–Max: The min–max method performs active query selection by querying of pairwise constraints to select a data object with the maximum uncertainty based on min–max criterion [13]. A pair is formed with a potential object and another object from the neighborhood. Similarly, the answer to the query is obtained by using the label information. NPU: NPU method expands the neighborhoods by asking pairwise queries and an object-based selection technique is used to identify the best object to include in existing neighborhoods. The answer to the query is provided from the label information of data [14]. All the pairwise constraint selection methods, including the proposed one select a set of pairwise constraints that are incorporated in the MPCK-means semi-supervised clustering algorithm [10] to guide the clustering process for clustering the input data. 6.2. Experimental results and analysis In this section, we present the experimental results on both synthetic and real-world datasets and compare the proposed method with the baseline methods. 9

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

Table 3 Performances of Elbow, Silhouette, Gap statistic, and I-nice methods in finding the numbers of clusters for eight synthetic datasets. Number

Datasets

Classes

Elbow

Silhouette

Gap statistic

I-nice

1 2 3 4 5 6 7 8

Syn-Data1 Syn-Data2 Syn-Data3 Syn-Data4 Syn-Data5 Syn-Data6 Syn-Data7 Syn-Data8

3 4 5 6 5 4 3 3

3 4 5 Not Not 4 2 2

2 2 5 3 5 4 2 2

3 2 5 5 5 4 2 2

3 4 5 6 5 4 3 3

Table 4 Performance of the I-nice method in finding the numbers of clusters and clustering accuracy for eight real-world datasets. Number

Datasets

Classes

Impact of 𝑃

Estimated clusters

Purity

ARI

NMI

1 2 3 4 5 6 7 8

Appendicitis Mammographic Mass Iris Wine Breast Tissue Glass Ionosphere Seed

2 2 3 3 6 6 2 3

2,-,-,-,-,2,-,-,-,-,3,-,-,-,-,3,-,-,-,-,5,5,6,-,-,3,4,4,6,-,2,-,-,-,-,3,-,-,-,-,-

2 2 3 3 6 6 2 3

0.830 0.685 0.893 0.702 0.433 0.570 0.712 0.895

0.378 0.136 0.730 0.371 0.186 0.261 0.177 0.716

0.229 0.106 0.758 0.428 0.362 0.399 0.134 0.710

6.2.1. Performance of the I-nice method is estimating the initial clusters The I-nice method is applied to estimate the number of clusters and initial clusters on datasets. Table 3 presents the performances of existing methods (Elbow, Silhouette, and Gap statistic) and the I-nice method in finding the numbers of clusters for eight synthetic datasets. The results show that the I-nice method is able to discover the right number of clusters from all datasets. In addition, the I-nice method outperformed all existing methods in determining the number of clusters. Therefore, the I-nice method is the best unsupervised method of estimating the initial clusters from data. We present the performance of the I-nice method in finding the numbers of clusters and its clustering accuracy for eight realworld datasets in Table 4. The column impact of 𝑃 expresses the I-nice generated a sequence of the number of clusters with a sequence of number of 𝑃 until the right number of clusters is obtained. We can see that six datasets need one observation point to estimate the right number of clusters, whereas two datasets, namely, Breast Tissue and Glass, need three and four observation points to estimate the right number of clusters, respectively. The results show that the I-nice method can determine the right number of clusters from all the real-world datasets. The clustering accuracy in purity, ARI, and NMI measures of the I-nice method is also presented. From the experimental results in [20], the I-nice method has better clustering accuracy than other standard unsupervised clustering algorithms. 6.2.2. Selection of pairwise constraints We use a two-dimensional synthetic dataset Syn-Data1 to demonstrate the selection process of must-link and cannot-link pairwise constraints for semi-supervised clustering. The Syn-Data1 dataset has three clusters, as given in Table 1. First, we used the I-nice method to generate the initial clusters from the Syn-Data1 dataset. Figs. 2(a), 2(b), and 2(c) show the distributions of objects in the three initial clusters. Some objects stay away from the dense region in each cluster. These faraway objects are outliers that need to be removed from initial clusters. For this purpose, we compute the 𝐺𝑂𝐶 scores of all objects in each initial cluster. We present the distribution of 𝐺𝑂𝐶 scores in the initial clusters in Fig. 3. Some 𝐺𝑂𝐶 scores are high and deviated considerably from rest of the scores. According to the distribution of 𝐺𝑂𝐶 scores in an initial cluster, a threshold is chosen to remove the outliers. After removing these faraway objects, we obtained a dense group of objects from each initial cluster, as shown in Fig. 4. For each dense group of objects, we calculated the local density of each object and selected the most informative object and informative objects in each dense group of objects. Fig. 5 illustrates the most informative object in each dense group of objects with symbol 𝑚 and the informative objects with symbol 𝑖. One most informative object and several informative objects are identified in each dense group of objects. The informative objects in each dense group of objects are the candidates for forming the mustlink pairwise constraints based on their local density values. The three most informative objects in the three dense groups are the candidates for generating cannot-link pairwise constraints. From the figures, we can see that the informative objects in a must-link pairwise constraint are close to each other in the two-dimensional space, whereas the two most informative objects in a cannot-link pairwise constraint are far from each other. 6.2.3. Performance in terms of improvement of clustering on synthetic datasets Table 5 shows the clustering results of the eight synthetic datasets listed in Table 1 obtained by the five pairwise constraints selection methods. For each dataset, three sets of different numbers of pairwise constraints were used. For each set of pairwise 10

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

Fig. 2. Three initial clusters are generated on a synthetic dataset, Syn-Data1 (a) Objects are distributed in initial cluster 1 (b) Objects are distributed in initial cluster 2 (c) Objects are distributed in initial cluster 3.

Fig. 3. Distribution of GOC scores for estimating dense group of objects on the synthetic dataset, Syn-Data1 (a) Distribution of GOC scores in initial cluster 1 (b) Distribution of GOC scores in initial cluster 2 (c) Distribution of GOC scores in initial cluster 3.

Fig. 4. Three dense groups are discovered from three initial clusters after removing outliers (a) Similar objects are placed in dense group 1 (b) Similar objects are placed in dense group 2 (c) Similar objects are placed in dense group 3.

Fig. 5. The most informative object in each dense group is marked with 𝑚. The informative objects are marked with 𝑖. These marked objects are used to select must-link and cannot-link pairwise constraints, respectively, from rescaled dense groups. (a) One most informative object and four informative objects marked as 𝑖1 , 𝑖2 , 𝑖3 , 𝑖4 in dense group 1. (b) One most informative object and four informative objects marked as 𝑖1 , 𝑖2 , 𝑖3 , 𝑖4 in dense group 2. (c) One most informative object and four informative objects marked as 𝑖1 , 𝑖2 , 𝑖3 , 𝑖4 in dense group 3.

constraints, the dataset was clustered by each method, and measures of purity and ARI are reported in the table. The highest performance is highlighted in boldface. The results show that the proposed method performed the best on all eight datasets in two performance measures. The performance of the proposed method was significantly better than that of other four methods on four datasets (Syn-Data1, SynData2, Syn-Data5, and Syn-Data6). The NPU method obtained equal clustering accuracy in purity of the proposed method with 10 and 20 pairwise constraints on the Syn-Data1 dataset, and equal purity and ARI measures of the proposed method with 20 pairwise 11

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

Table 5 Clustering results of different constraints selection algorithms with different numbers of constraints on synthetic datasets. Datasets No. Constraints Syn-Data1 5 10 20 Syn-Data2 20 50 70 Syn-Data3 30 60 90 Syn-Data4 30 50 90 Syn-Data5 20 40 60 Syn-Data6 20 40 60 Syn-Data7 50 100 200 Syn-Data8 50 100 200

Purity

ARI

Random

FFQS

Min–Max

NPU

Proposed

Random

FFQS

Min–Max

NPU

Proposed

0.943 0.949 0.941

0.948 0.949 0.952

0.946 0.952 0.953

0.950 0.958 0.958

0.958 0.958 0.958

0.845 0.858 0.841

0.855 0.857 0.868

0.853 0.862 0.859

0.848 0.873 0.873

0.874 0.874 0.874

0.864 0.888 0.936

0.884 0.890 0.936

0.928 0.934 0.935

0.942 0.940 0.940

0.942 0.942 0.942

0.650 0.738 0. 855

0.717 0.740 0.854

0.827 0.850 0.852

0.875 0.869 0.869

0.875 0.874 0.874

0.923 0.921 0.924

0.924 0.933 0.934

0.946 0.946 0.940

0.946 0.937 0.946

0.946 0.944 0.948

0.829 0.833 0.833

0.833 0.845 0.832

0.866 0.865 0.846

0.862 0.842 0.863

0.866 0.860 0.872

0.881 0.899 0.907

0.904 0.924 0.924

0.911 0.918 0.926

0.912 0.932 0.925

0.926 0.930 0.928

0.760 0.789 0.802

0.788 0.820 0.823

0.797 0.817 0.827

0.789 0.840 0.825

0.829 0.834 0.830

0.935 0.942 0.914

0.0943 0.935 0.936

0.947 0.937 0.940

0.946 0.956 0.956

0.956 0.960 0.960

0.840 0.870 0.810

0.864 0.848 0.846

0.880 0.852 0.851

0.887 0.897 0.902

0.897 0.907 0.907

0.906 0.913 0.912

0.918 0.934 0.927

0.923 0.930 0.933

0.893 0.937 0.940

0.946 0.946 0.946

0.782 0.799 0.797

0.807 0.844 0.825

0.816 0.833 0.840

0.756 0.845 0.856

0.861 0.863 0.868

0.868 0.871 0.868

0.869 0.869 0.869

0.906 0.932 0.938

0.868 0.868 0.866

0.933 0.933 0.933

0.518 0.526 0.521

0.521 0.521 0.521

0.802 0.846 0.860

0.518 0.519 0.512

0.849 0.849 0.849

0.907 0.941 0.905

0.907 0.906 0.940

0.938 0.942 0.942

0.907 0.907 0.907

0.938 0.938 0.938

0.661 0.872 0.665

0.657 0.660 0.869

0.863 0.874 0.874

0.662 0.652 0.662

0.863 0.863 0.863

constraints on the Syn-Data2 dataset. The NPU method performed better than the other methods with 50 pairwise constraints on the Syn-Data4 dataset. The min–max method performed slightly better than the other methods with 60 pairwise constraints on the Syn-Data3 dataset. The proposed method achieved better clustering accuracy than the other methods with 50 and 100 pairwise constraints on datasets Syn-Data7 and Syn-Data8, whereas the min–max method outperformed the others with 200 pairwise constraints. Considering the overall clustering performance, the NPU and min–max methods performed better than the rest of the baseline methods. One advantage of the proposed method is that it can achieve good performance with a few pairwise constraints. However, other methods required more pairwise constraints to achieve improved clustering performance. Fig. 6 shows the clustering performance behavior of the five pairwise constraints selection methods as a change in numbers of pairwise constraints measured in NMI on the eight synthetic datasets. In the figure, the 𝑥-axis is the number of pairwise constraints and the 𝑦-axis indicates the clustering performance in NMI. The proposed method leads to clustering results that are more consistent and significantly better than those of the four methods on four synthetic datasets (Syn-Data1, Syn-Data2, Syn-Data5, and Syn-Data6). Similar to the performance in purity and ARI, the NPU method performed better in NMI than the other methods with 50 pairwise constraints on the Syn-Data4 dataset and the min–max method performed slightly better in NMI than the other methods with 60 pairwise constraints on the Syn-Data3 dataset. The proposed method also outperformed all four methods with 90 pairwise constraints on the Syn-Data3 dataset and with 90 pairwise constraints on the Syn-Data4 dataset. Similar to the accuracy in purity and ARI, the proposed method achieved better NMI clustering accuracy than the other methods with 50 and 100 pairwise constraints on the Syn-Data7 and Syn-Data8 datasets, whereas the min–max method outperformed the others with 200 pairwise constraints. 6.2.4. Performance in terms of improvement of clustering on real-world datasets Eight benchmark datasets with different characteristics in features and classes were used to compare the performance of the proposed method and four competing methods. Table 6 shows the clustering results of the eight real-world datasets listed in Table 2. Similarly, in each dataset, three sets of different numbers of pairwise constraints were used. For each set of pairwise constraints, the dataset was clustered by each method, and measures of purity and ARI are reported in the table. The highest performance is highlighted in boldface. The results show that the performance of the proposed method on purity and ARI measures is significantly better than that of the other four methods on the Appendicitis dataset. 12

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

Fig. 6. Comparison of the clustering accuracy in NMI of the proposed algorithm with existing algorithms using different numbers of constraints on synthetic datasets.

13

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

Fig. 7. Comparison of the clustering accuracy in NMI of the proposed algorithm with existing algorithms using different numbers of pairwise constraints on real datasets.

14

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

Table 6 Clustering results of the proposed algorithm and other algorithms using different numbers of pairwise constraints on real-world datasets. Datasets No. Constraints Appendicitis 5 10 15 20 Mam. Mass 20 100 150 Iris 10 20 30 Wine 10 20 35 Breast Tissue 10 20 30 Glass 20 30 40 Ionosphere 20 50 70 Seed 20 30 40

Purity

ARI

Random

FFQS

Min–Max

NPU

Proposed

Random

FFQS

Min–Max

NPU

Proposed

0.801 0.817 0.806 0.809

0.805 0.809 0.801 0.803

0.810 0.822 0.821 0.816

0.811 0.811 0.811 0.820

0.811 0.858 0.858 0.877

0.246 0.308 0.250 0.242

0.255 0.277 0.256 0.266

0.277 0.348 0.326 0.308

0.334 0.334 0.334 0.364

0.334 0.442 0.442 0.496

0.794 0.796 0.803

0.794 0.789 0.813

0.798 0.794 0.813

0.797 0.796 0.796

0.796 0.804 0.815

0.350 0.350 0.367

0.346 0.333 0.391

0.356 0.349 0.391

0.353 0.350 0.350

0.350 0.370 0.397

0.910 0.942 0.938

0.938 0.939 0.960

0.936 0.952 0.950

0.960 0.960 0.960

0.960 0.960 0.966

0.768 0.843 0.834

0.834 0.836 0.885

0.829 0.869 0.864

0.885 0.885 0.885

0.885 0.885 0.903

0.929 0.944 0.949

0.930 0.943 0.942

0.935 0.950 0.952

0.938 0.955 0.949

0.949 0.938 0.955

0.799 0.838 0.852

0.803 0.835 0.833

0.815 0.853 0.863

0.822 0.866 0.853

0.851 0.820 0.866

0.494 0.516 0.517

0.504 0.541 0.541

0.545 0.543 0.599

0.528 0.518 0.518

0.518 0.550 0.602

0.250 0.278 0.280

0.254 0.290 0.295

0.315 0.313 0.353

0.294 0.290 0.286

0.279 0.303 0.350

0.520 0.532 0.543

0.527 0.543 0.550

0.526 0.534 0.543

0.532 0.537 0.537

0.532 0.528 0.542

0.245 0.237 0.233

0.258 0.250 0.239

0.268 0.276 0.268

0.280 0.292 0.276

0.296 0.295 0.298

0.706 0.711 0.703

0.712 0.724 0.720

0.707 0.721 0.720

0.712 0.715 0.715

0.712 0.715 0.715

0.168 0.177 0.164

0.177 0.202 0.195

0.170 0.194 0.195

0.177 0.182 0.182

0.177 0.182 0.182

0.880 0.878 0.882

0.881 0.875 0.876

0.881 0.887 0.885

0.880 0.880 0.886

0.876 0.876 0.876

0.688 0.684 0.692

0.690 0.677 0.677

0.690 0.703 0.699

0.689 0.689 0.700

0.678 0.678 0.678

On the Mammographic Mass dataset, the min–max method outperformed other methods with 20 pairwise constraints. However, as the number of pairwise constraints increased, the proposed method performed the best. On the Iris dataset, the proposed and NPU methods obtained equal purity and ARI measures with 10 and 20 pairwise constraints. On the other hand, the proposed method outperformed others with 30 pairwise constraints. On the Wine dataset, the proposed method outperformed others with 10 and 35 pairwise constraints, while the NPU method performed the best with 20 pairwise constraints. However, the best performance was still obtained by the proposed method with 35 pairwise constraints. On the Breast Tissue dataset, the proposed method obtained the best result on purity measure with 20 and 30 pairwise constraints. The min–max method performed best in other configurations on both purity and ARI. On the Glass dataset, the FFQS method produced the best results with 30 and 40 pairwise constraints on purity, whereas the proposed method produced the best result in other configurations on both purity and ARI. The FFQS method also achieved the best results on both purity and ARI on the Ionosphere dataset. The NPU and min–max methods performed better results than others on the Seed dataset. Given that the real-world datasets are more complex than the synthetic datasets, we should not expect that one method wins all cases. However, from the results in Table 6, we can rank the performance of the pairwise constraints selection methods on the eight datasets as follows: the proposed algorithm performed the best, which is followed by NPU, min–max, and FFQS methods. Random selection ranked last because it randomly selects pairwise constraints from the labeled cases. Fig. 7 shows the clustering performance behavior of the five pairwise constraint selection methods over the change in numbers of pairwise constraints measured in NMI on the eight real-world datasets. The proposed algorithm performed the best on five datasets, as shown in Figs. 7(a), 7(b), 7(c), 7(d), and 7(f). The general trend is that results improve when more pairwise constraints are included in the semi-supervised clustering process. However, the computational cost is increased. Again, the NPU method performed better than others with 30 pairwise constraints on the Glass dataset in Fig. 7(f). The FFQS method performed better on the Ionosphere dataset, as shown in Fig. 7(g) while the NPU and min–max methods performed better than the others on the Seed dataset, as shown in Fig. 7(h). The behavior of random selection is consistent with the results in Table 6. Another interesting observation is that the proposed pairwise constraint selection method significantly contributed in the semisupervised clustering process to improve the clustering accuracy of the I-nice method on most of the datasets, but the pairwise constraint selection methods did not improve the clustering accuracy on the Seed dataset. 15

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

We analyze the performance of all pairwise constraint selection methods on different evaluation criteria. The experimental results show that the proposed method has better clustering accuracy than existing pairwise constraint selection methods (random, FFQS, min–max and NPU) in terms of clustering accuracy. In the case of effectiveness measure, all existing methods are unable to select pairwise constraints without using the label information either directly or via active learning technique, whereas the proposed method is able to select pairwise constraints from unlabeled data automatically. In terms of execution time, random selection method is evidently faster than the others. The proposed method performed two consecutive tasks, namely, the estimation of initial clusters for generating label information and the selection of pairwise constraints. The execution time for estimating initial clusters from unlabeled data with the proposed method is greater than that of the other methods. However, the I-nice method transforms high-dimensional data into one-dimensional distance data, thereby considerably decreasing the execution time [20]. 7. Conclusions and future work In this paper, we propose a method of selecting pairwise constraints from unlabeled data for semi-supervised clustering. The new method is directly applicable to data without any label information. We use the I-nice method to estimate the number of clusters and initial clusters in unlabeled data. In each initial cluster, we remove the objects that are located far from dense-objects and obtain a dense group of objects. From these dense groups, we estimate the most informative objects and informative objects to select pairwise constraints. Then, we use the semi-supervised clustering algorithm to incorporate the pairwise constraints and guide the clustering process to improve the quality of clustering results. The advantage of this method is that no label information of data is required for selecting pairwise constraints. We conducted a series of experiments on both synthetic and real-world datasets to evaluate the proposed method against several competing methods. Experimental results show that the new method performed better than existing methods, which rely on label information in selecting pairwise constraints. In addition, the proposed pairwise constraint selection method in the semi-supervised clustering algorithm improved the clustering performance substantially over unsupervised clustering. In future works, we will extend this work with ensemble clustering and investigate with several consensus functions to obtain a final clustering for more complex data. By applying this work, we will discover the clustering structure on high-dimensional gene expression data and data from real-life applications, such as load profile data stream. Acknowledgments This research was supported by the National Natural Science Foundation of China under Grant Nos. 61473194 and 61472258 and the Shenzhen-Hong Kong Technology Cooperation Foundation, China under Grant No. SGLH20161209101100926. References [1] M. Śmieja, B.C. Geiger, Semi-supervised cross-entropy clustering with information bottleneck constraint, Inform. Sci. 421 (2017) 254–271, http://dx.doi. org/10.1016/j.ins.2017.07.016, URL http://www.sciencedirect.com/science/article/pii/S002002551631115X. [2] Y. Yang, Z. Li, W. Wang, D. Tao, An adaptive semi-supervised clustering approach via multiple density-based information, Neurocomputing 257 (2017) 193–205. [3] A. Hussain, E. Cambria, Semi-supervised learning for big social data analysis, Neurocomputing 275 (2018) 1662–1673. [4] I. Davidson, K.L. Wagstaff, S. Basu, Measuring constraint-set utility for partitional clustering algorithms, in: Knowledge Discovery in Databases: PKDD 2006: 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, September 18-22, 2006 Proceedings, Springer Berlin Heidelberg, 2006, pp. 115–126, http://dx.doi.org/10.1007/11871637_15. [5] K.L. Wagstaff, Value, cost, and sharing: Open issues in constrained clustering, in: Knowledge Discovery in Inductive Databases: 5th International Workshop, KDID 2006, September 18, 2006 Revised Selected and Invited Papers, Springer Berlin Heidelberg, 2007, pp. 1–10, http://dx.doi.org/10.1007/978-3-54075549-4_1. [6] T.K. Hiep, N.M. Duc, B.Q. Trung, Local search approach for the pairwise constrained clustering problem, in: SoICT’16 Proceedings of the Seventh Symposium on Information and Communication Technology, ACM, Ho Chi Minh City, VietNam, 2016, pp. 115–122. [7] K. Wagstaff, C. Cardie, Clustering with instance-level constraints, in: Proceedings of the Seventeenth International Conference on Machine Learning, 2000, pp. 1103–1110. [8] K. Wagstaff, C. Cardie, S. Rogers, S. Schrodl, Constrained k-means clustering with background knowledge, in: Proceedings of the Eighteenth International Conference on Machine Learning, 2001, pp. 577–584. [9] S. Basu, A. Banerjee, R.J. Mooney, Semi-supervised clustering by seeding, in: Proceedings of the 19th International Conference on Machine Learning, 2002, pp. 19–26. [10] M. Bilenko, S. Basu, R.J. Mooney, Integrating constraints and metric learning in semi-supervised clustering, in: Proceedings of the 21st International Conference on Machine Learning, 2004, pp. 81–88. [11] B. Settles, Active learning literature survey, Computer Sciences Technical Report 1648, University of Wisconsin Madison (2009). [12] S. Basu, A. Banerjee, R.J. Mooney, Active semi-supervision for pairwise constrained clustering, in: Proceedings of the SIAM International Conference on Data Mining, 2004, pp. 333–344. [13] P.K. Mallapragada, R. Jin, A.K. Jain, Active query selection for semi-supervised clustering, in: ICPR, IEEE Computer Society, 2008, pp. 1–4. [14] S. Xiong, J. Azimi, X.Z. Fern, Active learning of constraints for semi-supervised clustering, IEEE Trans. Knowl. Data Eng. 26 (1) (2014) 43–54. [15] R. Huang, W. Lam, Semi-supervised document clustering via active learning with pairwise constraints, in: Seventh IEEE International Conference on Data Mining (ICDM 2007), 2007, pp. 517–522, http://dx.doi.org/10.1109/ICDM.2007.79. [16] C. Xiong, D. Johnson, J. Corso, Active clustering with model-based uncertainty reduction, IEEE Trans. Pattern Anal. Mach. Intell. 39 (1) (2017) 5–17, http://dx.doi.org/10.1109/TPAMI.2016.2539965. 16

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

[17] I. Khan, J.Z. Huang, K. Ivanov, Incremental density-based ensemble clustering over evolving data streams, Neurocomputing 191 (2016) 34–43. [18] X. Cheng, R. Wang, Communication network anomaly detection based on log file analysis, in: D. Miao, W. Pedrycz, D. Ślzak, G. Peters, Q. Hu, R. Wang (Eds.), Rough Sets and Knowledge Technology, Springer International Publishing, Cham, 2014, pp. 240–248. [19] J. Yi, R. Jin, S. Jain, T. Yang, A.K. Jain, Semi-crowdsourced clustering: Generalizing crowd labeling by robust distance metric learning, in: Advances in Neural Information Processing Systems, Lake Tahoe, 2012, pp. 1772–1780. [20] M.A. Masud, J.Z. Huang, C. Wei, J. Wang, I. Khan, M. Zhong, I-nice: A new approach for identifying the number of clusters and initial cluster centres, Inform. Sci. 466 (2018) 129–151, http://dx.doi.org/10.1016/j.ins.2018.07.034, URL http://www.sciencedirect.com/science/article/pii/S0020025517301135. [21] J. Macqueen, Some methods for classification and analysis of multivariate observations, in: 5-th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297. [22] S. Basu, M. Bilenko, R.J. Mooney, A probabilistic framework for semi-supervised clustering, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2004, pp. 59–68. [23] D. Pelleg, D. Baras, K-means with large and noisy constraint sets, in: Machine Learning: ECML 2007: 18th European Conference on Machine Learning, Warsaw, Poland, September 17-21, 2007. Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, 2007, pp. 674–682. [24] L. Chen, C. Zhang, Semi-supervised variable weighting for clustering, in: Proceedings of the 2011 SIAM International Conference on Data Mining, 2011, pp. 862–871, http://dx.doi.org/10.1137/1.9781611972818.74, arXiv:https://epubs.siam.org/doi/pdf/10.1137/1.9781611972818.74. URL https://epubs.siam. org/doi/abs/10.1137/1.9781611972818.74. [25] Q. Xu, M. desJardins, K.L. Wagstaff, Active constrained clustering by examining spectral eigenvectors, in: Discovery Science: 8th International Conference, DS 2005, Singapore, October 8 – 11, 2005. Proceedings, Springer Berlin Heidelberg, 2005, pp. 294–307, http://dx.doi.org/10.1007/11563983_25. [26] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, R. Stat. Soc. Ser. B Stat. Methodol. 39 (1) (1977) 1–38. [27] H. Akaike, Information theory and an extension of maximum likelihood principle, in: B.N. Petrov, F. Csaki (Eds.), Second International Symposium on Information Theory, Akademiai Kiado, Budapest, 1973, pp. 267–281. [28] N. Sugiura, Further analysis of data by akaike’s information criterion and the finite correction, Comm. Statist. Theory Methods 7 (1) (1978) 13–26, http://dx.doi.org/10.1080/03610927808827599. [29] S. Mohseni, A. Fakharzade, A new local distace-based outlier detection approach for fuzzy data by vertex metric, in: 2015 2nd International Conference on Knowledge-Based Engineering and Innovation (KBEI), 2015, pp. 551–554, http://dx.doi.org/10.1109/KBEI.2015.7436104. [30] K. Zhang, M. Hutter, H. Jin, A new local distance-based outlier detection approach for scattered real-world data, in: Advances in Knowledge Discovery and Data Mining: 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-30, 2009 Proceedings, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 813–822, http://dx.doi.org/10.1007/978-3-642-01307-2_84. [31] H. Kurata, P. Tarazaga, The cell matrix closest to a given euclidean distance matrix, Linear Algerbra Appl. 485 (2015) 194–207. [32] E.P. Xing, A.Y. Ng, M.I. Jordan, S. Russell, Distance metric learning, with application to clustering with side-information, in: Advances in Neural Information Processing Systems 15, MIT Press, 2002, pp. 505–512. [33] M. Lichman, UCI machine learning repository, School of Information and Computer Sciences, University of California, Irvine (2013). URL http: //archive.ics.uci.edu/ml. [34] J. Alcalafdez, A. Fernandez, J. Luengo, J. Derrac, S. Garcia, Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework, Soft Comput. 17 (2011) 255–287. [35] L.I. Kuncheva, S. Hadjitodorov, Using diversity in cluster ensembles, in: Int’l Conf. System, Man and Cybernetics, Vol. 2, 2004, pp. 1214–1219. [36] C.D. Manning, P. Raghavan, H. Sch˜ijtze, Introduction to Information Retrieval, Cambridge University Press, Cambridge, New York, 2008, URL http: //opac.inria.fr/record=b1127339. [37] L. Hubert, P. Arabie, Comparing partitions, J. Classification 2 (1) (1985) 193–218, http://dx.doi.org/10.1007/BF01908075. [38] R.L. Thorndike, Who belongs in the family, Psychometrika 18 (4) (1953) 267–276. [39] P.J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, J. Comput. Appl. Math. 20 (1) (1987) 53–65. [40] R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a data set via the gap statistic, J. R. Stat. Soc. Ser. B Stat. Methodol. 63 (2) (2001) 411–423.

Md Abdul Masud was born in November, 1982. He received M.Sc. degree from Islamic University, Kushtia, Bangladesh in 2006, and Ph.D. degree from Shenzhen University, Shenzhen, China in 2018. He is a Professor at the Department of Computer Science and Information Technology in Patuakhali Science and Technology University, Patuakhali, Bangladesh. His research interest includes machine learning, data mining, clustering, data stream clustering, and semi-supervised clustering algorithms. Prof. Masud published over 25 research papers in conferences and journals.

Joshua Zhexue Huang was born in July, 1959. He received his Ph.D. degree at The Royal Institute of Technology, Sweden in June, 1993. He is currently a Distinguished Professor of College of Computer Science & Software Engineering at Shenzhen University. He is also the director of Big Data Institute and the deputy director of National Engineering Laboratory for Big Data System Computing Technology. His main research interests include big data technology and applications. Prof. Huang published over 200 research papers in conferences and journals. In 2006, he received the first PAKDD Most Influential Paper Award. Prof. Huang is known for his contributions to the development of a series of k-means type clustering algorithms in data mining, such as k-modes, fuzzy k-modes, k-prototypes and w-k-means that are widely cited and used, and some of which have been included in commercial software. He has extensive industry expertise in business intelligence and data mining and has been involved in numerous consulting projects in Australia, Hong Kong, Taiwan and mainland China.

17

M.A. Masud, J.Z. Huang, M. Zhong et al.

Data & Knowledge Engineering 123 (2019) 101715

Ming Zhong is the Pengcheng Scholar and Distinguished professor at College of Computer Science and Software Engineering, Shenzhen University, China. Professor Ming engages in research on the internet of things, cloud computing and data mining. He published more than 100 high-quality papers in leading conference and journals.

Xianghua Fu received the M.Sc. degree from the Northwest A&F University, Yangling, China, in 2002 and Ph.D. degree in computer science and technology from Xi’an Jiaotong University, Xi’an, China, in 2005. He is a professor and postgraduate director at College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China. He led a project of the National Natural Science Foundation, and several projects of the Science and Technology Foundation of Shenzhen City. His research interests include machine learning, data mining, information retrieval, and natural language processing.

18