A review and proposal of (fuzzy) clustering for nonlinearly separable data

A review and proposal of (fuzzy) clustering for nonlinearly separable data

International Journal of Approximate Reasoning 115 (2019) 13–31 Contents lists available at ScienceDirect International Journal of Approximate Reaso...

1MB Sizes 0 Downloads 21 Views

International Journal of Approximate Reasoning 115 (2019) 13–31

Contents lists available at ScienceDirect

International Journal of Approximate Reasoning www.elsevier.com/locate/ijar

A review and proposal of (fuzzy) clustering for nonlinearly separable data Maria Brigida Ferraro, Paolo Giordani ∗ Dipartimento di Scienze Statistiche, Sapienza Università di Roma, P.le Aldo Moro, 5, 00185, Rome, Italy

a r t i c l e

i n f o

a b s t r a c t

Article history: Received 30 January 2019 Received in revised form 6 September 2019 Accepted 9 September 2019 Available online 13 September 2019

In many practical situations data may be characterized by nonlinearly separable clusters. Classical (hard or fuzzy) clustering algorithms produce a partition of objects by computing the Euclidean distance. As such, they are based on the linearity assumption and, therefore, do not identify properly clusters characterized by nonlinear structures. To overcome this limitation, several approaches can be followed: density-, kernel-, graph- or manifoldbased clustering. A review of these approaches is offered and some new fuzzy manifoldbased clustering algorithms, involving the so-called geodesic distance, are proposed. The effectiveness of such algorithms is shown by synthetic, benchmark and real data. © 2019 Elsevier Inc. All rights reserved.

Keywords: Nonlinearly separable data Density-based clustering Kernel-based clustering Graph-based clustering Manifold-based clustering Fuzzy approach to clustering

1. Introduction Clustering is the process of allocating objects to groups or clusters. Objects with similar features are assigned to the same clusters, whilst dissimilar objects belong to different clusters. In order to assess cluster memberships, suitable dissimilarity measures are considered. The most common choice, used in the well-known c-means (cM) algorithm [1], is the Euclidean distance. Let X be the data matrix of order (n × s) where n and s are the numbers of objects and variables, respectively, with generic row xi = [xi1 . . . xis ], referring to object i (i = 1, . . . , n). The cM algorithm divides n objects into c (< n) clusters by solving the following constrained optimization problem:

min J cM = U,H

n  c 





u ig d2E xi , h g ,

(1)

i =1 g =1

u ig ∈ {0, 1},

s.t.

c 

u ig = 1,

i = 1, . . . , n , i = 1, . . . , n ,

g = 1, . . . , c ,

(2) (3)

g =1

where







d2E xi , h g = xi − h g

*

T 



xi − h g ,

Corresponding author. E-mail address: [email protected] (P. Giordani).

https://doi.org/10.1016/j.ijar.2019.09.004 0888-613X/© 2019 Elsevier Inc. All rights reserved.

(4)

14

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

denotes the Euclidean distance between object i and prototype g. The prototypes, called centroids in the cM framework, are fictitious objects characterizing the clusters. They are stored in the prototype matrix H with generic row h g = [h g1 . . . h gs ] (g = 1, . . . , c). U denotes the allocation matrix with generic element u ig giving the allocation of object i to cluster g, which is equal to 1 when object i is assigned to cluster g and 0 otherwise. The row-wise sum of U is equal to 1, i.e., an object can be assigned to one and only one cluster. The constraints on U are often unfeasible in real-life applications. For this purpose, the cM algorithm has been extended in a fuzzy setting [2]. The fuzzy c-means (FcM) algorithm can be expressed as:

min J FcM = U,H

c n  





2 um ig d E xi , h g ,

(5)

i =1 g =1

u ig ≥ 0,

s.t.

c 

i = 1, . . . , n ,

u ig = 1,

g = 1, . . . , c ,

i = 1, . . . , n ,

(6) (7)

g =1

where U now denotes the membership degree matrix. Its generic element, u ig , expresses the membership degree of object i to cluster g and ranges from 0 (complete non-membership) to 1 (complete membership). This is the peculiarity of the fuzzy (or soft) approach to clustering in contrast to the hard one. In order to tune the fuzziness of the partition, the parameter m (> 1) is set. The default value is usually in the interval [1.5, 2]. Higher values of m lead to fuzzier membership degrees, whilst values approaching 1 produce an almost hard partition. The FcM loss function (as well as the cM one) is based on the squared Euclidean distances between the objects and the prototypes. Such a choice implicitly assumes that the shape of the clusters is spherical. Therefore, the FcM may be inappropriate whenever different cluster structures are expected. If the clusters have ellipsoidal shapes, the FcM variant proposed by Gustafson and Kessel [3] can be considered. In this case, for every cluster, a symmetric and positive definite matrix A g is considered in order to recover properly the shape. For this purpose, the squared Euclidean distance is replaced by a Mahalanobis-type distance:







d2M xi , h g = xi − h g

T





A g xi − h g .

(8)

The Gustafson-Kessel FcM algorithm, in brief GK-FcM, can be derived by replacing (4) with (8) in (5) and by considering the additional set of constraints

|A g | = ρ g , being

g = 1, . . . , c ,

(9)

ρ g a prespecified positive value (usually, ρ g = 1, g = 1, . . . , c). Given the g-th fuzzy cluster covariance matrix, n 

g =

i =1

um (x − h g )(xi − h g )T ig i n  i =1

,

g = 1, . . . , c ,

(10)

um ig



1

1 it can be shown that A g = ρ g | g | n − g , g = 1, . . . , c. A related proposal involving fuzzy cluster covariance matrices and an exponential distance is also proposed in [4]. If in (8), we set A g = Is , where Is denotes the identity matrix of order s,    d2M xi , h g reduces to d2E xi , h g and GK-FcM coincides with FcM. The GK-FcM algorithm and related methods widen the range of applications of standard FcM. Nevertheless, they do not fit adequately a large class of cluster topological structures. The limitations of FcM and GK-FcM are illustrated in Fig. 1 displaying the solutions obtained by means of FcM (on the left) and GK-FcM (on the right) with c = 2 clusters for three different data sets. In the top of the figure, two spherical clusters are easily visible and both FcM and GK-FcM properly detect them. Note that the performance of the methods is evaluated by considering the closest hard partition. In the middle part, we can observe two clusters with ellipsoidal shape. They are discovered by GK-FcM, whilst FcM does not produce a meaningful solution. Therefore, in this case, the cluster covariance matrices play a crucial role in determining the correct partition. In the bottom of Fig. 1 the clusters are nonlinearly separable. Neither FcM nor GK-FcM are able to recover the clusters. It occurs because the data structure is such that standard distance measures, such as the Euclidean or the Mahalanobis-type ones, are not helpful in order to evaluate the dissimilarities between objects in a proper way. In order to discover clusters with heterogeneous shapes, at least four strategies can be followed. The first one is represented by density-based clustering algorithms (e.g., [5]) aiming at finding groups by identifying locally dense areas in the variable space. Two strategies are based on the understanding that clusters are nonlinearly separable in the variable space. For this purpose, data are mapped into a higher dimensional space by using a nonlinear function. Since data are linearly separable in a sufficiently high dimensional space, the standard cM algorithm can then be applied. The wide class of kernelbased methods follows this line of research (e.g. [6]) and the most famous method is the Kernel c-means (e.g., [7]). An

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

15

Fig. 1. The FcM and GK-FcM solutions with c = 2 clusters for three different data sets; red and blue points denote assignments to Cluster 1 and Cluster 2, respectively, by considering the closest hard partition (from the top to the bottom we have spherical clusters, ellipsoidal clusters and nonlinearly separable clusters). (For interpretation of the colours in the figure(s), the reader is referred to the web version of this article.)

alternative strategy consists in looking at the data as the nodes of a graph. In this case, the goal is to divide the nodes into clusters by taking into consideration the edge structure of the graph. Spectral clustering (e.g., [8]) is a well-known graph-based method. A closely related, but different, strategy assumes that the data lie on a nonlinear manifold. In this case, clusters can be determined by building a graph of data points, computing dissimilarities among objects by means of the so-called geodesic distance and applying a clustering method based on such a distance (e.g., [9]). The previous classification is not exhaustive. For instance, an additional strategy consists in clustering around principal curves. A principal curve can be defined as a smooth, curvilinear summary of the data. In [10], a probabilistic clustering model and variants for noisy data are introduced where c principal curves are detected. Every principal curve identifies a cluster. See, for further details, [10] and references therein. In this paper we are going to review the four above-mentioned lines of research and to propose a new class of clustering techniques based on the geodesic distance. The paper is organized as follows. Section 2 contains a review of the first three strategies, namely density-, kernel- and graph-based clustering algorithms. In Section 3, the fourth strategy, based on the concept of manifold, is reviewed and some new clustering procedures for nonlinearly separable data are proposed. Their effectiveness, compared with their closest competitors, is analyzed in Section 4 by considering synthetic, benchmark and real-life data. Some final remarks in Section 5 conclude the paper. 2. Density-, kernel- and graph-based clustering algorithms: a review 2.1. Density-based clustering algorithms The idea underlying density-based clustering tools is quite intuitive. They look for areas where objects are concentrated. Such areas detect clusters, provided that they are separated by other areas that are empty or sparse. Their number is not

16

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

specified by the user, because the method automatically finds the number of clusters. The most famous procedure is the DBSCAN algorithm, originally proposed in [5]. DBSCAN is the acronym of Density Based Spatial Clustering of Applications with Noise. The clusters are found according to the concepts of maximality and connectivity and to a formal definition of noise objects. Let ε and p be the distance radius and the minimum number of objects within the neighbourhood ε of a given object. The parameter p represents the minimum number of points that constitute a valid cluster. These two parameters allow to distinguishing the objects into three classes. Object i is a core point if there are at least p objects i  (= i ) such that d2E (xi , xi  ) < ε . Note that the distance is not necessarily the Euclidean one, but this is the most common choice. All the core points belong to a certain cluster. If two core points i and i  are such that d2E (xi , xi  ) < ε , then i and i  are directly density-reachable and belong to the same cluster. The remaining objects either belong or not to the clusters. To assess it, one should check whether such remaining objects are border points or noisy points. Object i  is a border point when there exists at least one core point i such that d2E (xi  , xi ) < ε . In this case i  is directly density-reachable from i and i  is assigned to the same cluster of i. Border point i  is usually not directly density-reachable from all the core points belonging to the same cluster. However, it is connected with all the core points by a path of points that are pairwise directly density-reachable. For instance, if d2E (xi  , xi  ) ≥ ε and i  and i  belong to the same cluster, then i  is density reachable (but not directly density-reachable) from i  . Therefore, DBSCAN finds clusters such that all the objects/points in the same cluster are (directly) density-connected. This highlights that some objects may not be assigned to the clusters. They are noisy points and are such that all their distances with respect to the core points of the clusters are not lower than ε . The DBSCAN algorithm has been investigated and extended along various directions in order to improve its performance. For instance, variants of DBSCAN have been proposed in order to avoid that the DSBSCAN solution depends on the ordering of the analyzed objects. This may occur when there are border points reachable from more than one cluster and, thus, the objects are concentrated in areas quite close to each other. The problem is solved by the so-called DBSCAN* procedure [11] where border points are managed as outliers. An alternative strategy is followed in [12]. Computational improvements for speeding-up the algorithm especially for large-size problems have been developed by several authors (see, e.g., [13,14]). The performance of DBSCAN is affected by clusters with different densities. For this purpose, an improved version of DBSCAN, called OPTICS, can be used [15]. For more details on DBSCAN and, in general, on density-based clustering algorithms, one can see [18,19]. Implementations of DBSCAN in the R programming language can be found in, e.g., the packages dbscan [16] and fpc [17]. In order to apply DBSCAN, a crucial choice to be made refers to the parameters ε and p. Wrong choices may lead to poor results. An objective selection procedure is not available. The value of p depends on the expected amount of noise in the data. In particular, a large value should be chosen for noisy data. Therefore, it is clear that some preliminary knowledge on the data is necessary. However, some rules of thumb are usually considered. Some examples are p = ln(n) or p = 2s [20]. The latter one is probably the most common choice. Once p has been set, the k-distance plot is usually adopted for selecting ε . In the k-distance plot, the average distances of every object to its k = p nearest neighbours are displayed in an ascending order. The value of ε is found in connection with a knee in the plot. The idea is that the knee reveals a threshold for the distances. Larger distances refer to the noise objects, whilst the remaining ones refer to the objects belonging to the clusters. Further details on parameter selection can be found in [21]. The DBSCAN algorithm and its generalizations produce a hard partition of the objects. In the literature, there exist some fuzzy versions of DBSCAN. Their motivation is related to the difficulties in setting ε and p. The concept of fuzzy neighbourhood is introduced by fuzzifying the parameter ε in [22]. This allows to associate a fuzzy membership function to the distance between an object and its neighbours and to evaluate whether an object is a core point by taking into account such membership functions. The fuzzification of p is proposed in [23], whilst a fuzzy version of DBSCAN where both the parameters are fuzzified can be found in [24]. For more details on fuzzy extensions of DBSCAN, see also [25]. 2.2. Kernel-based clustering algorithms A rather different approach is represented by kernel-based algorithms. The basic idea is that any data set is linearly separable in a sufficiently high dimensional space. Therefore, if the Euclidean distance does not allow to properly recover the cluster structure in the original space, it is convenient to map the data in a larger space where the Euclidean distance is meaningful. In such a space, a standard clustering method is applied. This can be done by a computational shortcut allowing us to represent linear patterns efficiently in high dimensional spaces. This shortcut is the so-called kernel function, φ : Rs → Rt , with t > s, a map converting the nonlinear relations into linear ones. For instance, a popular choice is the polynomial mapping. When s = 2, it is

√ φ(xi ) = (x2i1 , 2xi1 xi2 , x2i2 ),

(11)

with xi = [xi1 xi2 ]. Thus, the high dimensional space has s = 3 dimensions. In the new space, we can compute the squared Euclidean distance between objects i and i  . We have:

d2E (φ(xi ), φ(xi  )) = φ(xi )T φ(xi ) − 2φ(xi )T φ(xi  ) + φ(xi  )T φ(xi  ). In the previous formula we obtained the kernel function

κ (xi , xi ). In fact, it is

(12)

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

κ (xi , xi ) = φ(xi )T φ(xi )

17

(13)

and it is easy to see that

κ (xi , xi ) = (xTi xi )2 .

(14)

Formulas (13) and (14) show that the kernel function is found by introducing the mapping φ . However, it is computed more efficiently in terms of the original data, without computing φ . Therefore, the high dimensional representation can be by-passed. See, for a comprehensive introduction to kernel methods, [6,26]. In the clustering framework, the mostly used kernel method is probably the kernel c-means (KcM) algorithm [7]. Taking into account (1), the KcM loss function can be written as

min J kcm =

c n  

U,H





u ig d2E φ(xi ), h g ,

(15)

i =1 g =1

where the prototype h g belongs to the high dimensional space. The loss function in (15) is minimized subject to the constraints in (2) and (3). It can be shown that

 2

2



d E φ(xi ), h g = φ(xi )T φ(xi ) −

 i  ∈π g



φ(xi )T φ(xi  ) |π g |

+



i  ∈π g i  ∈π g

φ(xi  )T φ(xi  )

| π g |2

,

(16)

where π g denotes the set of objects assigned to cluster g. Thus, the Euclidean distances between the objects and the prototypes are expressed in terms of a square kernel matrix K with generic element kii  given in (13) (i , i  = 1, . . . , n). It is interesting to note that the prototypes are obtained as the means of the transformed objects assigned to the cluster. The KcM method is implemented in the R programming language in the package kernlab [27]. The fuzzy extension of the KcM algorithm can be easily derived. Specifically, bearing in mind (5) and (15), the fuzzy kernel c-means (FKcM) algorithm minimizes

min J fkcm = U,H

c n  





2 um ig d E φ(xi ), h g ,

(17)

i =1 g =1

subject to the constraints in (6) and (7). A non-exhaustive list of papers devoted to kernel methods in the fuzzy clustering framework can be found in [28–31]. The performance of kernel clustering methods is strongly connected with the choice of the kernel function. Previously, for the sake of simplicity, we introduced the polynomial one. However, there exist several kernel functions (for instance, Gaussian, hyperbolic tangent, ANOVA, spline, etc.). Such alternatives are discussed in detail in [6]. The choice of the kernel function is data-dependent. The selected one should help to capture the existing dissimilarities among objects. A wrong choice of the kernel function may hide the data taxonomy leading to a partition even worse than the standard one, applying, e.g., cM or FcM. 2.3. Graph-based clustering algorithms A graph is a structure formed by a set of nodes connected by a set of edges. In particular, each edge connects a pair of nodes. The nodes represent the objects and the edges indicate the measures of similarity between pairs of objects. A graph is weighted when the edge connecting two nodes is weighted by the corresponding similarity. Denoting by sii  , the similarity between nodes/objects i and i  , it is sii  ≥ 0 (non-negativity) and sii  = si  i (symmetry, i.e. the graph is undirected). Moreover, if i = i  , then sii  takes the maximum value. When sii  = 0, we can conclude that nodes i and i  are not connected. There exist several ways to define the similarity between pairs of objects. A common choice is given by

sii  = exp{−

||xi − xi  ||2 2σ 2

},

(18)

where || · || denotes the Frobenius norm of a matrix and σ is a parameter to be set in advance. A graph can be seen as an intuitive way of representing a set of objects described in terms of their similarities. The problem of detecting groups of homogeneous objects can be expressed in terms of the graph. In particular, the clustering process consists in partitioning the graph in such a way that nodes highly connected, i.e., the corresponding edge is large, belong to the same cluster. This highlights that the edges play the most relevant role in the clustering phase. In fact, the goal of graph-based clustering algorithms is to find a partition where the edges between nodes assigned to different clusters have low weights and those between nodes assigned to the same cluster have large weights. In other words, the objects with low similarities belong to different clusters, whilst those with large similarities to the same cluster. For an introduction on graph-based clustering, see, e.g., [32].

18

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

In the literature, several clustering methods based on the graph theory are available. Spectral clustering (e.g., [8,33]) is probably the most famous one. It is based on the so-called Laplacian matrix, say L. Letting S be the matrix of order (n × n) with generic element sii  , the Laplacian matrix can be defined as

L = D − S,

(19)

where D is the diagonal matrix of order n with main diagonal elements equal to the degrees of the nodes defined as n di = i  =1 sii  . The Laplacian matrix is usually normalized. To do it, several approaches can be followed. Among them, we mention 1

1

1

1

LN = D− 2 LD− 2 = In − D− 2 SD− 2 ,

(20)

being In the identity matrix of order n. Spectral clustering is nothing but performing a standard clustering algorithm on the c eigenvectors corresponding to the c smallest eigenvalues of LN . Note that before running the clustering algorithm, these eigenvectors are normalized. In detail, if T is the matrix of order (n × c ) containing the c eigenvectors of LN in its columns, the rows of T are normalized in such a way that their Frobenius norm is equal to one. The reviewed spectral clustering has been discussed in [8], where the standard cM algorithm is applied. Different versions of spectral clustering, involving different definitions of the Laplacian matrix, can be found (see, e.g., [34]). The common key point of these variants is that the new representation of the data by using the eigenvectors of the Laplacian matrix allows for detecting the cluster structure. This resembles the Kernel-based approach where standard clustering algorithms are used on data mapped in a new (higher dimensional) space. The relationship between spectral clustering and kernel clustering is discussed in [7]. Spectral clustering is usually performed by running cM. As such, it produces a hard partition. However, fuzzy spectral clustering algorithms can be trivially developed. It is sufficient to replace the cM algorithm with the FcM one when analyzing the normalized eigenvectors of LN . Spectral clustering is implemented in, e.g., the R package kernlab [27]. Another graph-based class of clustering algorithms is based upon the concept of minimum spanning tree (MST) [35]. Given an undirected weighted graph, an MST is a new undirected weighted graph obtained by considering a subset of edges of the original graph such that the following condition holds: the connectivity among the nodes is preserved, there are not nodes connected in a closed chain (if so, there is a cycle) and the sum of the edges is as small as possible. There are several algorithms for building an MST. In the clustering framework, the MST theory can be used in order to find c subtrees, each of them identifying a cluster. Thus, the clusters are the subsets of connected nodes of a graph. The partition is found by iteratively combining the clusters containing the two closest nodes by adding an edge between them until c clusters are determined. In this respect, the procedure resembles the single-linkage method in hierarchical clustering. Software for MST clustering is available in, e.g., the R package vegan [36]. The so-called CLICK algorithm [37] represents a further alternative tool. Also in this case, the clusters are given by the most highly connected subsets of nodes of a graph. The peculiarity of the proposal relies on the fact that the edge weights are based on probabilistic assumptions. 3. Manifold-based clustering algorithms: a review and proposal In this section, we introduce the concept of manifold and we discuss how to use it for clustering purposes. Later, a new class of manifold-based clustering algorithms is proposed. 3.1. Manifold-based clustering algorithms: a review A different strategy to handle nonlinearly separable data is to assume that the objects lie on a manifold. A manifold is a topological space that is locally linear. In general, any entity which is nearly flat on small scales is a manifold. An example is represented by the Earth, which is spherical but looks flat on the human scale. The crucial point is therefore to embed the observed data in a higher dimensional space to a lower dimensional manifold. Standard methods are inadequate to represent the manifold topology. Specifically, given two objects lying on the manifold, standard methods based on the Euclidean distance fail because they are not able to properly compare the objects. Taking into account the local linearity, only the comparison between neighbouring objects belonging to the manifold can be evaluated through the Euclidean distance. In the case of two objects far from each other, their straight-line Euclidean distance is no longer helpful. A different distance reflecting the underlying global geometry of the object is needed. This is the so-called geodesic distance. It compares two objects by computing their distance measured along the manifold. From a practical point of view, unless the geodesic distance among objects is available, it can be approximated by the shortestpath distance. Of course, when two objects are neighbours, the shortest-path distance boils down to the Euclidean one, consistently with the local linearity. In this respect, some connections with the graph theory arise. However, within this framework, the topological structure of the manifold plays the more relevant role in the process of acquiring knowledge. There are several algorithms for computing the geodesic distances among objects. The most common ones are the Dijkstra [38] and the Floyd-Warshall [39,40] algorithms. The latter one, adopted in our proposal, will be recalled in the next subsection. For more details on data manifolds and geodesic distances, refer to [41]. A wider and comprehensive overview on nonlinear data reduction methods can be found in [42].

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

19

In the domain of clustering, the above-described framework can be adapted. Generally speaking, once the observed data are recognized to belong to a manifold and the related geodesic distances are computed, standard clustering algorithm can be applied to the manifold data. Several papers devoted to manifold-based clustering algorithms are available. In [43] an adaption of the hard cM algorithm involving the geodesic distance is developed. A famous extension of the cM algorithm is the c-medoids (cMed) algorithm [44,45]. In contrast to cM, the cMed algorithm chooses observed objects as prototypes (called medoids) in order to characterize the clusters. The cMed algorithm employing the geodesic distance is proposed in [46,47]. Other interesting works can be found in, e.g., [48,49]. All the previous papers follow the hard approach to clustering. Some works devoted to fuzzy clustering methods for manifold data are also proposed. In [50], a variant of FcM for nonlinearly separable data is developed. This is done by means of the Isomap procedure [41], which represents a nonlinear extension of classical multidimensional scaling (MDS, see, e.g., [51]) preserving the intrinsic geometry of the data as captured in the geodesic manifold distances among objects. In doing so, attention should be paid to the selection of the neighbourhood. As discussed in [52], a wrong selection may alter the geodesic distance matrix, affecting the clustering result. This problem is usually known as the risk of short-circuit errors. Isomap, as well as any MDS-based technique, projects the objects on a low dimensional space. The idea in [50] is to apply FcM to the coordinates of the objects on such a low dimensional space. The Isomap step is needed due to the difficulties in determining the centroids when minimizing the FcM loss function in (5) upon replacing the Euclidean distances with the geodesic ones. As noted by the same authors, the Isomap projection is the bottleneck of the proposal. For this reason, they also suggest to use the fuzzy c-medoids (FcMed) algorithm [53]. In their proposal, deeply investigated in [54], an extension of FcMed involving the geodesic distance is considered. It represents the fuzzy counterpart of [47]. Furthermore, a method mixing the kernel- and manifold-based approaches is given in [55]. 3.2. Manifold-based clustering algorithms: a proposal From the review of the available papers on fuzzy clustering for manifold data, it should be clear that they provide a fuzzy partition of the objects by-passing the computation of the centroids. It seems that both the centroids on the Isomap subspace and the medoids are merely computational tools for deriving the membership degrees. For this reason, it is more natural to use fuzzy clustering methods for relational data, i.e., data coming from measures of dissimilarity. See [56,57] and references therein for more details on fuzzy relational clustering. In this section we are going to introduce a class of geodesic fuzzy relational clustering algorithms. In doing so, a robust clustering method [58] is introduced by considering the concept of noise cluster [59]. The input of any relational clustering algorithm is a distance/dissimilarity matrix D of order (n × n) with generic element d(i , i  ) expressing the distance/dissimilarity between objects i and i  . Bearing in mind the nonlinear structure of the data, the matrix D should be based on the geodesic distances among objects. If not available, they can be estimated by the shortest path distances of the graph with edges connecting neighbouring objects. In order to connect the objects, two alternative approaches can be followed. These are the k-neighbouring and the ε -neighbouring approaches. They differ in the way the neighbours of every object are determined. The k-neighbouring approach consists in connecting by an edge every object to the k nearest neighbours. In the ε -neighbouring approach, the number of neighbours is not a-priori fixed. In fact, each object is connected to all the other objects lying in an ε -radius environment. In both cases, the Euclidean distance can be used to compute the length of the edges taking into account the local linearity. Once the neighbours are determined, the geodesic distances can be estimated by the shortest path distances. For this purpose, the Floyd-Warshall algorithm [39,40] is used. Let N i and d G (i , i  ) be the set of neighbours for object i and the estimated geodesic distance between i and i  , respectively. The algorithm can be summarized as follows. Floyd-Warshall algorithm 1. For each object i  ∈ N i , set the edge length between i and i  equal to d E (i , i  ). 2. Set

d G (i , i  ) =



d E (i , i )

+∞

if i and i  are linked by an edge, otherwise.

(21)

3. For each i  ∈ {1, . . . , n}, if d G (i , i  ) > d G (i , i  ) + d G (i  , i  ), then recursively set

d G (i , i  ) = d G (i , i  ) + d G (i  , i  ).

(22)

The Floyd-Warshall algorithm produces a distance matrix D such that

⎧ if there is no path between i and i  , ⎨ +∞ d(i , i  ) = d E (i , i  ) = d G (i , i  ) if i and i  are neighbours, ⎩ d G (i , i  ) otherwise.

(23)

For computational issues, in practice we suggest to replace +∞ with an arbitrarily large value, e.g., 108 . The computation of the geodesic distances by using the Floyd-Warshall algorithm has time complexity O (n3 ).

20

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

3.2.1. Fuzzy relational clustering for manifold data Letting D be the distance matrix with generic element given in (23), we formulate the clustering problem in terms of the algorithms introduced in [60]. This leads to the fuzzy relational clustering for manifold data (FRCMD). It discovers a fuzzy partition of n objects in c clusters by considering the following minimization problem: c 

min J FRCMD = U

n  n  i =1 i  =1

g =1

um u m d (i , i  ) ig i  g n 

2

i =1

u ig ≥ 0,

s.t.

c 

i = 1, . . . , n ,

u ig = 1,

(24)

, um ig g = 1, . . . , c ,

i = 1, . . . , n .

(25) (26)

g =1

The optimal solution can be found by means of the Lagrange multiplier and Kuhn-Tucker conditions. We thus get the Lagrangian function

L = J FRCMD −

n 

⎛ λi ⎝



c 

u ig − 1⎠ −

c n  

g =1

i =1

ψig u ig ,

(27)

i =1 g =1

where λi and ψig denote the Lagrange multipliers. It can be shown [60] that the optimal membership degrees are

u ig =

⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩



1 b ig



1 b  ig

g ∈I +

,

if g ∈ I i+ ,

(28)



if g ∈ I i ,

0,

with −2 b ig = aig um ig ,

m aig =

n  i  =1

um d (i , i  ) i g

i  =1



Ii =

I i+ =

⎪ ⎪ ⎪ ⎩ ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩

c   g  =1

i  =1 i  =1



2

⎫ ⎪ ⎪ ⎪ ⎬

c   g  =1

um um d(i  , i  ) i  g i  g

n  i  =1

2

,

(30)

um i g

 ≤ 0⎪ , ⎪ ⎪ ⎭ b ig 

(31)

1

⎫ ⎪ ⎪ ⎪ ⎬

1 b ig

g:

n n  

um i g

1 b ig

g:

m



n 

⎧ ⎪ ⎪ ⎪ ⎨

(29)

 > 0⎪ . ⎪ ⎪ ⎭ b ig 

(32)

1

The following iterative algorithm can be implemented for finding the optimal FRCMD solution (γ denotes the nonnegative convergence criterion, e.g., 10−4 ). FRCMD algorithm 1. Randomly initialize U(t ) with t = 0 in such a way that the constraints in (25) and (26) are fulfilled. (t +1) (t ) if i  < i or u i  g if i  ≥ i. 2. Compute b ig (i = 1, . . . , n, g = 1, . . . , c) according to (29) using (30) and considering u i  g (t +1)

3. Update u ig 4. If

n  c 

i =1 g =1

(i = 1, . . . , n, g = 1, . . . , c) according to (28) and using (31) and (32).

(t +1)

|u ig

− u (igt ) | < γ , the algorithm has converged, otherwise set t := t + 1 and go to Step 2.

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

21

3.2.2. Robust fuzzy relational clustering for manifold data The performance of FRCMD may be affected by the presence of outliers. In principle, the elements of D reported in (23) implicitly detect outliers. This occurs when the distances of an object with respect to the other ones tend to +∞. It would be helpful that the outliers belong to the clusters with membership degrees close to zero. However, it is not possible taking into account the constraints in (26). For this purpose, we introduce a noise resistant version of FRCMD by using the concept of noise cluster [59]. It is an extra cluster such that the membership degree of an object to the noise cluster increases with its distance to the c regular clusters. In this way, the influence of the outliers on the regular clusters is reduced. The noise cluster is not a cluster in a strict sense. In fact, the objects with high membership degrees to the noise cluster are not necessarily similar according to the chosen metric. In order to formally define the noise cluster, Davé [59] proposes to consider a prototype which is equidistant from all the objects. However, in the relational clustering framework, this definition does not make sense because the obtained clusters are prototype-less. In this connection, a new interpretation of the noise cluster is suggested in [60]. It is defined as the cluster such that all the dissimilarities between pairs of objects take the same value. This value is usually given by



n n  



d (i , i  ) ⎥

⎢ ⎢ i =1 i  =1 δ = λ⎢ ⎣ n2

⎥ ⎥. ⎦

(33)

The constant dissimilarity δ can be used for modifying the FRCMD loss function in such a way to assign the outliers to the noise cluster. This is done in the robust FRCMD (RFRCMD) algorithm. It is formulated as

min J RFRCMD = U

n  n  um u m d (i , i  ) c ig i  g  i =1 i  =1 n 

2

u ig ≥ 0,

i = 1, . . . , n ,

c 

u ig ≤ 1,

+

um ig

g =1

i =1

s.t.

n n  

um um δ i∗ i ∗

i =1 i  =1 n 

2

i =1

,

(34)

um i∗

g = 1, . . . , c ,

(35)

i = 1, . . . , n ,

(36)

g =1

being u i ∗ the membership degree of object i to the noise cluster. By comparing the loss functions of FRCMD in (24) and RFRCMD in (34), we can see that the latter one contains an additional term concerning the noise cluster. One more difference is visible by looking at (36). For each object, the sum of the membership degrees to the regular clusters is not equal to one. The complement of this sum,

u i∗ = 1 −

c 

u ig ,

(37)

g =1

gives the membership degree of object i to the noise cluster. Therefore, for every object, the sum of the membership degrees to the standard clusters and to the noise cluster is equal to one. As for FRCMD, the computation of the optimal RFRCMD solution is based on the Lagrange multiplier and Kuhn-Tucker conditions. In this case, the Lagrangian function takes the form

L = J R F RC M D −

n 

⎛ λi ⎝

c 

⎞ u ig + u i ∗ − 1⎠ −

g =1

i =1

n  c 

ψig u ig − ψi ∗ u i ∗ ,

(38)

i =1 g =1

where ψi ∗ is a new Lagrange multiplier. With respect to (27), (38) is extended by considering the noise cluster. From a computational point of view, the RFRCMD algorithm is very similar to the FRCMD one. More specifically, the optimal membership degrees are found by using (28) and an iterative algorithm similar to the FRCMD one can be implemented provided that (31) and (32) are replaced by

I i− =

⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩

g:

c   g  =1

1 b ig 1 b ig 

⎫ ⎪ ⎪ ⎪ ⎬



+

1 bi∗

≤0 , ⎪ ⎪ ⎪ ⎭

(39)

22

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

I i+ =

⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩

g:

c   g  =1

1 b ig 1 b ig 

⎫ ⎪ ⎪ ⎪ ⎬ 

+

1 bi∗

>0 , ⎪ ⎪ ⎪ ⎭

(40)

in order to take into account the noise cluster. Note also that b i ∗ can be determined by using (29) replacing aig with ai ∗ which, from (30), can be simplified as m 2δ . See, for further details, [60]. 3.2.3. Selection of c and k or ε The fuzzy partition obtained by FRCMD or RFRCMD depends on the choice of the number of clusters, c, and of the neighbouring parameter k or ε . A general rule for deciding to use the k-neighbouring or the ε -neighbouring approaches cannot be given. In the next section, we will study how the FRCMD and RFRCMD solutions depend on the chosen approach. Once the neighbouring approach is chosen, the optimal solution of FRCMD or RFRCMD can be found by considering the ones obtained by using a grid of values for c and k or ε . In the next section, we also investigate this point by checking the sensitivity of the obtained partition with respect to modifications of k or ε . 3.2.4. Computational issues The time complexity of FRCMD is O (n2 ) as is usually the case for (fuzzy) relational clustering algorithms. When running FRCMD, it is recommended to consider more than one random start for the membership degree matrix in order to limit the risk of hitting local minima. The solution corresponding to the lowest value of the loss function is the one to be interpreted. The computational behaviour of FRCMD will be further analyzed in the next section. 3.2.5. Software The FRCMD and RFRCMD algorithms can be easily run in the R programming language. The code for implementing the Floyd-Warshall algorithm is given in the Appendix. Its output is the distance matrix to be used as input in the functions NEFRC for FRCMD and NEFRC.noise for RFRCMD of the package fclust [61,62]. 4. Results In this section we report the results of the applications of FRCMD and RFRCMD and their closest competitors to synthetic, benchmark and real data. Among the competitors, we considered DBSCAN by using the function dbscan of the package dbscan [16] setting ε according to the k-distance plot. Moreover, KcM and spectral clustering were analyzed by using the functions kkmeans and specc of the package kernlab [27]. Different kernels specified by the option kernel of the function kkmeans were used for KcM. Spectral clustering was run by using the similarity in (18). As far as we know, R functions implementing the fuzzy versions of KcM and spectral clustering are not available. However, such methods were considered in our comparison by running the function FKM of fclust [61,62] implementing FcM. With respect to FKcM, FKM was run on the mapped data according to the polynomial kernel in (11). Concerning fuzzy spectral clustering, FKM was run on the normalized eigenvectors found following the procedure described in Section 2.3. Furthermore, the performance of MST clustering was assessed by using the function spantree of the package vegan [36]. Finally, regarding FRCMD and RFRCMD, the geodesic distance matrix was built according to the k- and ε -neighbouring approaches for different values of k and ε , respectively, in order to assess the sensitivity of the solutions to these parameters. In particular, k took values in the set {2, . . . , 40}. The values of ε were chosen by considering the first 20 quantiles of the Euclidean distances computed among pairs of objects. Three random starts were used for the FRCMD and RFRCMD algorithms in order to limit the risk of hitting local optima. Unless otherwise specified, the default options of the corresponding R functions were used and, for the fuzzy methods, m = 2 has been set. For the sake of completeness, we noticed that the performances of FRCMD and RFRCMD were slightly affected by the choice of m. In fact, setting m = 1.5 led to similar results as for m = 2. As we shall see, this held for all the datasets except for the one analyzed in Section 4.3. 4.1. Synthetic data 4.1.1. Dataset n.1 We started by clustering the data displayed in the bottom of Fig. 1. Two clusters were easily visible, therefore, except for DBSCAN, we set c = 2. We already saw that standard methods for linear data, such as FcM or GK-FcM, failed because the two clusters are nonlinearly separable. Quite heterogeneous results were found by applying the other methods. DBSCAN remarkably overestimated the number of clusters and several noisy points were found. As expected, KcM was dependent on the choice of the kernel function. For some kernels, the two clusters were perfectly identified, whilst some other kernels led to poor solutions. Spectral clustering and MST clustering worked very well recognizing the clusters. The performance of the fuzzy KcM algorithm was consistent with the corresponding hard version, whilst the fuzzy spectral clustering method performed worse than the hard one. Finally, by using FRCMD we got the true partition for values of k ranging from 4 to 34 and for values of ε ranging from q(4) to q(13), with q(4) and q(13) denoting the 4th and 13th quantiles, respectively. In

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

23

Fig. 2. Spectral clustering, MST clustering and FRCMD solutions setting k ∈ {4, . . . , 34} or ε ∈ {q(4), . . . , q(13)}; red and blue points denote assignments to Cluster 1 and Cluster 2, respectively (by considering the closest hard partition for FRCMD).

such cases, the obtained membership degrees were often either 1 or 0, highlighting a clear assignment of the objects to the clusters. We also investigated the frequency (in percentage) of finding the best estimate of the global optimum, i.e., the lowest function value upon convergence out of three runs of FRCMD. For this purpose, we checked, for each value of k or ε , the percentages of times in which the function value was less than 0.1% bigger than the lowest one. Results were very satisfactory. The global optimum was always attained for all the values of k, whilst, in the ε -neighbouring case, the average percentage of times was 96.67%. We thus concluded that the probability of hitting local optima was negligible. 4.1.2. Dataset n.2 Section 4.1.1 showed the ability of FRCMD and other methods to discover nonlinearly separable clusters. The clustering problem was quite simple because the clusters were separated and outliers were not present. In this respect, we decided to consider a new source of difficulty by adding some outliers, labelled as ‘A’, ‘B’ and ‘C’, to the data previously analyzed in order to check whether the performances of the various methods got worse. It is straightforward to observe that the three new objects were sensibly far from both the clusters. The presence of the outliers did not affect the selection of k and ε in FRCMD. In fact, the same ranges of values for getting the true partition were registered. Almost always (the only exception was for ε = q(4)), the membership degrees of the non-outlier objects to the clusters belonged to {0, 1}. Hence, the three outliers did not prevent the possibility to find the true clusters. Nevertheless, FRCMD was not able to recognize objects ‘A’, ‘B’ and ‘C’ as outliers and performed differently according to the chosen neighbouring approach. Specifically, in the k-neighbouring approach with k taking values from 5 to 34, ‘A’ and ‘B’ were assigned to Cluster 1 (the cluster coloured in red in Fig. 2) and ‘C’ to Cluster 2 (the blue one) with membership degrees equal to 1 (when k = 4 the same assignments were observed in terms of the closest hard partition, but the membership degrees were lower than one). In the ε -neighbouring approach, the registered membership degrees of the three objects to the clusters were always 0.50 (except for ε = q(4) when we got 0.56 and 0.44 to Clusters

24

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

Fig. 3. RFRCMD solutions setting ε ∈ {q(5), . . . , q(13)}; red, blue and purple points denote assignments to Cluster 1, Cluster 2 and noise cluster, respectively, by considering the closest hard partition.

1 and 2, respectively). Therefore, in both the approaches, especially in the k-neighbouring one, the assignments of the outliers were unsatisfactory. The ε -neighbouring approach alerted about the anomalous features of objects ‘A’, ‘B’ and ‘C’, but it was not able to highlight that they were reasonably far from the clusters. By running the robust version of FRCMD setting δ according to (33) and varying k and ε in the subsets of values leading to the true partition, we found that, in the k-neighbouring approach, RFRCMD always found the true clusters, but never detected the outliers. Specifically, their membership degrees were always in {0, 1}. On the contrary, in the ε -neighbouring approach, RFRCMD always reached the goal except for the case with ε = q(4). In fact, the clusters were correctly found and the non-outlier objects were assigned to the proper clusters with membership degrees equal to 1, whilst the outliers had membership degrees to the noise cluster equal to 0.66 (and 0.17 to Cluster 1 and 2). The results of RFRCMD in the ε -neighbouring case are summarized in Fig. 3. Concerning the risk of hitting local optima, the average percentages of times in which the global optimum was attained were very similar to those registered for Dataset n.1 (for FRCMD, 100% and 98.33% in the k- and ε -neighbouring cases, respectively). The percentage was 100% also for RFRCMD in the k-neighbouring approach, whilst reduced to 71.67% in the ε -neighbouring one. In the latter case, more than one random start should be recommended. The three outliers had generally disruptive effects on the competitors. In particular, this was the case for spectral clustering and MST clustering. In MST clustering, all the non-outlier objects were assigned to the same cluster together with one outlier. The performances of KcM got worse, even if, under particular choices of the kernel functions, it still recovered the two clusters. Regardless the used kernel, the common limit of KcM was the lack of robustness in the sense that the outliers were not identified and always assigned to the clusters. The same comments hold for the fuzzy versions of KcM and spectral clustering. On the contrary, DBSCAN noticeably improved its performance. The three outliers were recovered, although a solution with four clusters was discovered. Specifically, every true cluster was split in two equally-size clusters by DBSCAN.

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

25

Fig. 4. RFRCMD solutions setting ε ∈ {q(5), . . . , q(7)}; red, blue and purple points denote assignments to Cluster 1, Cluster 2 and noise cluster, respectively, by considering the closest hard partition.

4.1.3. Dataset n.3 The data studied in this section represented a further modification of those of Sections 4.1.1 and 4.1.2. In detail, one more outlier, labelled as ‘D’, was added. As for objects ‘A’, ‘B’ and ‘C’, object ‘D’ did not belong to the clusters, but, differently from them, it was approximately located in between the two clusters. By applying FRCMD, we found that, to some extent, the solutions were consistent with those obtained for Dataset n.2. Nevertheless, object ‘D’ posed new challenges. First of all, we noted that the true partition was obtained for ε taking values from q(4) to q(10) and k from 4 to 34. However, for some values of k a limited number of misclassified objects (from 1 to 9) was registered. This depended on a sort of chaining effect due to object ‘D’ sometimes bringing a few neighbouring objects in the wrong cluster with membership degrees slightly higher than 0.50. With respect to the non-outlier objects, the solution was fuzzier in the ε -neighbouring approach than in the k-neighbouring one. With respect to the four outliers, they were assigned to the clusters with membership degrees in the interval [0.50, 0.70]. The only exception was observed when ε ∈ {q(8), . . . , q(10)} and k ∈ {4, 5}. In such cases, object ‘D’ was assigned to a cluster with membership degree equal to 1. By using RFRCMD we saw that, as for Dataset n.2, it failed in the k-neighbouring approach, whilst it worked well in the ε -neighbouring one. This occurred when ε was chosen in the set {q(5), . . . , q(7)}. In such conditions, the output was similar to that for Dataset n.2, namely, the two clusters were correctly found and all the outliers were detected. Dataset n.3 and the RFRCMD solution are plotted in Fig. 4. In particular, the membership degrees of objects ‘A’, ‘B’, ‘C’ and ‘D’ to the noise cluster were equal to 0.66. The only drawback was that the different features of ‘A’, ‘B’ and ‘C’ with respect to ‘D’ did not emerge. With respect to the risk of local optima, results were consistent with those discussed for Datasets n.1 and n.2. By investigating the solutions of the competitors, we observed quite similar results to those found for Dataset n.2. The most relevant difference was that there was no kernel choice used in KcM, that helped to recover the true partition. With respect to the fuzzy versions of KcM and spectral clustering, we found that object ‘D’ was more complex to handle in

26

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

Table 1 Number of objects, variables and clusters of 3 benchmark datasets. Dataset

N. of objects

N. of variables

N. of clusters

Iris Wine Vertebral Column

150 178 310

4 13 6

3 3 3

Table 2 ARI values for different clustering algorithms. Dataset

DBSCAN

KcM (median)

KcM (max)

Spectral clustering

MST clustering

FRCMD (k) (median)

FRCMD (k) (max)

FRCMD (ε ) (median)

FRCMD (ε ) (max)

Iris Wine Vert. Col.

0.57 0 0

0.55 0.81 0.30

0.61 0.90 0.36

0.56 0.90 0.23

0.56 −0.01 −0.004

0.64 0.79 0.34

0.65 0.83 0.35

0.55 −0.01 −0.02

0.56 0.42 0.31

comparison with the other outliers. Specifically, whilst the membership degrees of objects ‘A’, ‘B’ and ‘C’ were close to 0.5 for all the clusters, the membership degrees of object ‘D’ tended to approach one for one of the two clusters. 4.2. Benchmark data We compared the performance of our proposal with those of the closest competitors, already described in the above sections, by means of some benchmark datasets from UCI repository [63]: Iris, Wine and Vertebral Column. The number of objects, variables and clusters of each dataset are reported in Table 1. To compare the partitions we used the well known Adjusted Rand Index (ARI) [64]. In Table 2 the ARI values for the following algorithms are reported: DBSCAN, KcM, Spectral clustering, MST clustering, and FRCMD. Since the KcM results were affected by the chosen kernel function, the median and maximum ARI values computed over the different kernel functions are given. The fuzzy counterparts of KcM and spectral clustering are not reported since the performances were quite similar to the corresponding hard algorithms. For the fuzzy algorithms, m = 2 has been set. FRCMD has been run for values of k (nearest neighbours) from 2 to 40 and values of ε from q(1) to q(20), and the median and maximum ARI values have been reported. Moreover, to offer a better insight into the dependence of FRCMD on the choice of k and ε , Fig. 5 contains the boxplots of the ARI values distinguished by dataset and neighbouring approach. For the Iris dataset, all the algorithms worked quite well. The highest value was obtained by using FRCMD with the k-neighbouring approach. Instead, in the case of the Wine and Vertebral Column datasets, we observed some differences. In particular, the ARI values of DBSCAN in both cases was equal to 0 since it recognized just one cluster. MST clustering did not work well for both the datasets. The results obtained with KcM, Spectral clustering and FRCMD (k-neighbouring approach) were quite similar. In both the cases, the performance of FRCMD in the ε -neighbouring approach was worse than the k-neighbouring one by considering the median values. In terms of the maximum ones, a good performance was registered for the Vertebral Column dataset. By inspecting Fig. 5, we can see that, except for a few cases, the distributions of the ARI values were characterized by low variability, hence the performances of FRCMD were slightly affected by the choice of the parameters k and ε . Furthermore, we analyzed the computational performance of our proposal by studying the percentage of times in which the global optimum was attained using three random initializations. For the Iris dataset, by using the k-neighbouring approach, the average percentage (over the values of k) was 98.29% and, in the ε -neighbouring approach it was 100%. For the Wine dataset, the average percentages were 100% (k-neighbouring) and 95% (ε -neighbouring), while for the Vertebral Column dataset, both the percentages were equal to 100%. 4.3. Real data The data referred to the municipalities of Aosta Valley, which is an Alpin region, located in Northwest Italy. The region is divided into n = 74 municipalities belonging to 13 valleys plus the area around Aosta, the main city. Our goal was to discover a partition of the municipalities in terms of their geographical positions. In detail, our interest relied on in finding groups of municipalities characterized by minimal travelling costs, expressed in terms of time needed to connect them. It was clear that such a research problem was intrinsically nonlinear. In fact, if we used the latitudes and longitudes of the municipalities to compute the Euclidean distances among municipalities, then we got a partition without minimizing the travelling costs. This was explained by the particular road transport infrastructure. Aosta Valley has one highway from the South-East to the North-West of the region, connecting Turin to France through the Mont Blanc tunnel. The highway splits the region into two parts. A freeway, next to the highway, is also present. Along the highway and the freeway, there are several municipalities, including Aosta, reachable in a limited amount of time. From these municipalities, a lot of valleys depart. In each valley, municipalities are connected by mountain roads. Municipalities belonging to the same valley can be neighbours, but the connecting time can be no longer limited. Municipalities belonging to different valleys are very

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

27

Fig. 5. Boxplots of the ARI values distinguished by dataset and neighbouring approach.

far from each other, even if they are quite close in terms of latitude and longitude, because there are not direct roads. The only possibility is to use mountain roads starting from the point of departure to go to the highway, then enter the highway and finally reach the point of arrival through mountain roads. Of course, the trip is very time consuming. For all of these reasons, our study was based on the matrix of the travelling costs among the municipalities available in [65]. It represented a proxy of the geodesic distance matrix. In fact, neighbour municipalities are reachable with a limited travelling cost, regardless the type of connecting road, and thus the local linearity approximately holds. The distances between pairs of faraway municipalities cannot be computed linearly and the travelling costs are essentially based on the sum of the costs for connecting all the municipalities along the shortest route, consistently with the shortest-path distance approach. One of the municipalities, Chamois, was certainly an outlier. In fact, it is the only municipality in Italy not navigable by car. Thus, all the geodesic distances involving Chamois took arbitrarily large values. For this reason, we decided to apply RFRCMD hoping that Chamois did not affect the obtained partition, i.e., it belonged to the noise cluster with a high membership degree. In order to select the number of clusters and the parameter of fuzziness, we compared different RFRCMD solutions. We preferred the solution with c = 3 clusters. In fact, the solution with c = 2 oversimplified the geographical structure of Aosta Valley. When c > 3, the percentage of municipalities with membership degrees lower than 0.50 to all the clusters was not negligible and some clusters had small size. Moreover, we decided to consider m = 1.5 because higher values of m led to too fuzzy partitions. To limit the risk of local optima, three random starts were considered leading to the same loss function value upon convergence. The obtained partition is displayed in Fig. 6. The solution was parsimonious and meaningful. For all the clusters, the membership degrees of the municipalities close to the highway were usually larger highlighting easiness of connections among such municipalities. 20 municipalities were assigned to Cluster 1 (blue points) with membership degrees ranging from 0.54 (Cogne) to 0.92 (Arvier and Villeneuve). They were located in the western part of the region. The municipalities with the high membership degrees were located along the highway, whilst the membership degrees decreased for municipalities located in the final parts of the valleys. The cluster included all the municipalities of seven valleys, namely, Val Ferret, Val Veny, Val di La Thuile, Valgrisenche, Val di Rhémes, Valsavaranche and Val di Cogne. We found 23 municipalities, including Aosta, in Cluster 2 (red points). The cluster

28

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

Fig. 6. The RFRCMD solution with c = 3 and m = 1.5; red, blue, green and purple points denote municipalities assignments to Cluster 1, Cluster 2, Cluster 3 and noise cluster, respectively, with membership degrees higher than 0.50 (in particular, for each cluster, light, normal and dark colours denote membership degrees in the intervals [0.50, 0.70), [0.70, 0.90) and [0.90, 1.00], respectively); grey points denote municipalities with membership degrees lower than 0.50 to all the clusters.

recognized the surroundings of Aosta along the highway and the municipalities of two valleys, called Valpelline and Valle del G.S. Bernardo, located in the Northern area of Aosta. The membership degrees of Cluster 2 were fuzzier than those of Cluster 1. In detail, they varied from 0.55 (Verrayes) to 0.81 (Quart). Finally, Cluster 3, represented by green points, was the largest group, composed by 25 municipalities, located in the eastern part of Aosta Valley, with membership degrees from 0.53 (Saint-Vincent) to 0.93 (Bard and Verrés). It detected three valleys, Val d’Ayas, Val di Gressoney and Val di Champorcher. Five closely related municipalities (grey points) were not assigned to any cluster (all membership degrees lower than 0.50). In particular, except for Pontey, they were from the valley called Valtournenche. Valtournenche is located in between Valpelline (Cluster 2) and Val d’Ayas (Cluster 3). This explained the reason why the membership degrees of such municipalities were about 0.40 in both Clusters 2 and 3. These municipalities were not assigned to a cluster since the valley is mainly characterized by one mountain road with many hairpin turns and hence reaching the various municipalities is quite time consuming. The membership degrees for the noise cluster was 0 for all the municipalities. As expected, the only exception was Chamois for which the membership degree to the noise cluster was equal to 0.98. For the sake of completeness, we compared the previously described solution with those obtained by applying the fuzzy clustering methods for relational data [60] to the Euclidean distance matrix based on the latitudes and longitudes of the municipalities. Setting c = 3 and m = 1.5 as for RFRCMD, we observed that the solutions of the robust and non-robust methods were quite similar distinguishing the clusters with respect to the geographical positions of the municipalities. In particular, the three clusters were interpreted as western, central and eastern municipalities. The most relevant distinction between these two solutions was that the robust one gave larger membership degrees to the noise cluster for the municipalities located in the extreme areas of Aosta valley. However, only three municipalities (the most western ones) had membership degrees to the noise clusters slightly higher than 0.50 (the maximum one was 0.55 for La Thuile) highlighting that problematic outliers were absent. By comparing such solutions with the RFRCMD one, we noted a few common aspects. Nevertheless, several relevant differences exist. In fact, by using the latitudes and longitudes, the geographical configuration of Aosta Valley was missed because information on road transport infrastructures and valleys was often ignored. Several anomalous results were observed. For instance, by considering the solution of the non-robust method, Cogne (the most eastern blue point in Fig. 6) was not assigned to the same cluster of Aymavilles, the closest municipality reachable following the white road. Instead, it belonged to the same cluster of Bionaz (the most northern red point). Another unfitting result referred to Gressoney-La-Trinité (the most northern green point) and Valtournenche (the most northern grey point) pertaining to the same cluster, despite the fact that they were located in different valleys without direct road transport infrastructures connecting them. 5. Final remarks In this paper a class of fuzzy clustering algorithms for nonlinearly separable data has been introduced. In contrast with standard clustering procedures, the proposed methods allow to fit a wider class of cluster topological structures. At least

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

29

four approaches are available for clustering nonlinearly separable data: density-, kernel-, graph- and manifold-based. After reviewing such approaches, some manifold-based clustering algorithms have been proposed. Their peculiarity is the use of clustering methods for relational data. This class of methods has received limited attention, but it represents the most natural choice when data are geodesic distances. In particular, two proposals are introduced, namely the fuzzy relational clustering algorithm for manifold data (FRCMD) and a robust version (RFRCMD) exploiting the concept of noise cluster. The FRCMD and RFRCMD procedures have been applied to synthetic, benchmark and real data, showing that they represent valuable clustering tools for nonlinearly separable data. Furthermore, in the applications, we saw that the clustering results were quite stable over a range of values for k or ε . However, the optimal setting of these parameters still remains to be found. In the near future, our research interest in this domain will focus on the development of a data-dependent selection method for k and ε . Declaration of competing interest The authors declare that there is no conflict of interest regarding the publication of this article. Appendix A

FloydWarshall <- function(X, k, eps, MaxDist){ # Input: # X: Data matrix # k: k-neighbouring parameter [specify either k or eps] # eps: eps-neighbouring parameter [specify either k or eps] # MaxDist: distance between non-connected objects [use Inf or an arbitrary # large value, e.g. 1e+8, to avoid computational problems) # Output: # D.geod: geodesic distance matrix n <- nrow(X) D.eucl <- as.matrix(dist(X)) D.geod <- matrix(MaxDist,n,n) diag(D.geod) <- 0 if (!is.null(k)){ require(dbscan) knear <- kNN(D.eucl,k) for (i in 1:n){ for (j in 1:k){ D.geod[i,knear$id[i,j]] <- D.eucl[i,knear$id[i,j]] D.geod[knear$id[i,j],i] <- D.geod[i,knear$id[i,j]] } } } if (!is.null(eps)){ D.geod <- D.eucl for (i in 1:(n-1)){ for (j in (i+1):n){ if (D.eucl[i,j] <= eps){ D.geod[i,j] <- D.eucl[i,j] D.geod[j,i] <- D.geod[i,j] } } } } for (i in 1:n){ for (j in 1:n){ if (D.geod[j,i] < MaxDist){ for (h in 1:n){ if (D.geod[i,h] < MaxDist){ sum.dist <- D.geod[j,i] + D.geod[i,h] if (sum.dist < D.geod[j,h]){ D.geod[j,h] <- sum.dist } }

30

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

} } } } return(D.geod) } References [1] J.B. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley, 1967, pp. 281–297. [2] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithm, Plenum Press, New York, 1981. [3] D.E. Gustafson, W.C. Kessel, Fuzzy clustering with a fuzzy covariance matrix, in: Proceedings of the 1978 IEEE Conference on Decision and Control Including the 17th Symposium on Adaptive Processes, 1979, pp. 761–766. [4] I. Gath, A.B. Geva, Unsupervised optimal fuzzy clustering, IEEE Trans. Pattern Anal. Mach. Intell. 7 (1989) 773–781. [5] M. Ester, H.P. Kriegel, S. Jörg, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining KDD’96, AAAI Press, Portland, 1996, pp. 226–231. [6] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, 2004. [7] I.S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and normalized cuts, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD’04, ACM, New York, 2004, pp. 551–556. [8] A. Ng, M. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, in: T. Dietterich, S. Becker, Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems, vol. 14, MIT Press, Massachusetts, 2002, pp. 849–856. [9] N. Asgharbeygi, A. Maleki, Geodesic k-means clustering, in: Proceedings of the 19th International Conference on Pattern Recognition ICPR 2008, Tampa, Florida, 2008, pp. 1–4. [10] D.C. Stanford, A.E. Raftery, Finding curvilinear features in spatial point patterns: principal curve clustering with noise, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 601–609. [11] R.J.G.B. Campello, D. Moulavi, A. Zimek, J. Sander, Hierarchical density estimates for data clustering, visualization, and outlier detection, ACM Trans. Knowl. Discov. Data 10 (2015) 1–51. [12] T.N. Tran, K. Drab, M. Daszykowski, Revised DBSCAN algorithm to cluster data with dense adjacent clusters, Chemom. Intell. Lab. Syst. 120 (2013) 92–96. [13] D. Han, A. Agrawal, W.K. Liao, A. Choudhary, A fast DBSCAN algorithm with spark implementation, in: S.S. Roy, P. Samui, R. Deo, S. Ntalampiras (Eds.), Big Data in Engineering Applications, Springer, Singapore, 2018, pp. 173–192. [14] A. Zhou, S. Zhou, J. Cao, Y. Fan, Y. Hu, Approaches for scaling DBSCAN algorithm to large spatial databases, J. Comput. Sci. Technol. 15 (2000) 509–526. [15] M. Ankerst, M.M. Breunig, H.P. Kriegel, J. Sander, OPTICS: Ordering points to identify the clustering structure, in: Proceedings of the ACM SIGMOD International Conference on Management of Data SIGMOD’99, ACM, New York, 1999, pp. 49–60. [16] M. Hahsler, M. Piekenbrock, S. Arya, D. Mount, dbscan: Density Based Clustering of Applications with Noise (DBSCAN) and related algorithms. R package version 1.1-3, http://cran.r-project.org/package=dbscan, 2018. [17] C. Hennig, fpc: flexible procedures for clustering. R package version 2.1-11.1, http://cran.r-project.org/package=fpc, 2018. [18] H.P. Kriegel, P. Kröger, J. Sander, A. Zimek, Density-based clustering, Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 1 (3) (2011) 231–240. [19] J. Sander, Density-based clustering, in: C. Sammut, G.I. Webb (Eds.), Encyclopedia of Machine Learning and Data Mining., Springer, Boston, 2016, pp. 1–5. [20] J. Sander, M. Ester, H.P. Kriegel, X. Xu, Density-based clustering in spatial databases: the algorithm GDBSCAN and its applications, Data Min. Knowl. Discov. 2 (1998) 169–194. [21] E. Schubert, J. Sander, M. Ester, H.P. Kriegel, X. Xu, DBSCAN revisited, revisited: why and how you should (still) use DBSCAN, ACM Trans. Database Syst. 42 (19) (2017). [22] E.N. Nasibova, G. Ulutagay, Robustness of density-based clustering methods with various neighbourhood relations, Fuzzy Sets Syst. 160 (2009) 3601–3615. [23] G. Bordogna, D. Ienco, Fuzzy core dbscan clustering algorithm, in: A. Laurent, O. Strauss, B. Bouchon-Meunier, R.R. Yager (Eds.), Information Processing and Management of Uncertainty in Knowledge-Based Systems, IPMU 2014, in: Communications in Computer and Information Science, vol. 442, Springer, Cham, 2014, pp. 100–109. [24] D. Ienco, G. Bordogna, Fuzzy extensions of the DBScan clustering algorithm, Soft Comput. 22 (2018) 1719–1730. [25] G. Ulutagaya, E. Nasibov, Fuzzy and crisp clustering methods based on the neighbourhood concept: a comprehensive review, J. Intell. Fuzzy Syst. 23 (2012) 1–11. [26] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, Cambridge, 2000. [27] A. Karatzoglou, A. Smola, K. Hornik, A. Zeileis, kernlab: an S4 package for kernel methods in R, J. Stat. Softw. 11 (9) (2004) 1–20. [28] D.Q. Zhang, S.C. Chen, Clustering incomplete data using kernel-based fuzzy c-means algorithm, Neural Process. Lett. 18 (2003) 155–162. [29] D.Q. Zhang, S.C. Chen, A novel kernelized fuzzy C -means algorithm with application in medical image segmentation, Artif. Intell. Med. 32 (2004) 37–50. [30] Y. Ding, X. Fu, Kernel-based fuzzy c-means clustering algorithm based on genetic algorithm, Neurocomputing 188 (2016) 233–238. [31] K.H. Memon, D.H. Lee, Generalised kernel weighted fuzzy C -means clustering algorithm with local information, Fuzzy Sets Syst. 340 (2018) 91–108. [32] S.E. Schaeffer, Graph clustering, Comput. Sci. Rev. 1 (2007) 27–64. [33] U. Luxburg, A tutorial on spectral clustering, Stat. Comput. 17 (2007) 395–416. [34] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 888–905. [35] C.T. Zahn, Graph-theoretical methods for detecting and describing gestalt clusters, IEEE Trans. Comput. 20 (1971) 68–86. [36] J. Oksanen, F.G. Blanchet, M. Friendly, R. Kindt, P. Legendre, D. McGlinn, P.R. Minchin, R.B. O’Hara, G.L. Simpson, P. Solymos, M.H.H. Stevens, E. Szoecs, H. Wagner, vegan: community ecology package. R package version 2.5-4, https://CRAN.R-project.org/package=vegan, 2019. [37] R. Sharan, R. Shamir, CLICK: a clustering algorithm with applications to gene expression analysis, in: Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology ISMB ’00, AAAI Press, Menlo Park, 2000, pp. 307–316. [38] E.W. Dijkstra, A note on two problems in connexion with graphs, Numer. Math. 1 (1959) 269–271. [39] R.W. Floyd, Algorithm 97: shortest path, Commun. ACM 5 (1962) 345. [40] S. Warshall, A theorem on Boolean matrices, J. ACM 9 (1962) 11–12.

M.B. Ferraro, P. Giordani / International Journal of Approximate Reasoning 115 (2019) 13–31

31

[41] J.B. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000) 2319–2323. [42] J.A. Lee, M. Verleysen, Nonlinear Dimensionality Reduction, Springer-Verlag, New York, 2007. [43] N. Asgharbeygi, A. Maleki, Geodesic k-means clustering, in: Proceedings of the 19th International Conference on Pattern Recognition ICPR 2008, 2008, pp. 1–4. [44] L. Kaufman, P.J. Rousseeuw, Clustering by means of medoids, in: Y. Dodge (Ed.), Statistical Data Analysis Based on the L1-Norm and Related Methods, North Holland, Amsterdam, 1987, pp. 405–416. [45] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: an Introduction to Cluster Analysis, John Wiley and Sons, Hoboken, 1990. [46] A.Y. Wu, M. Garland, J. Han, Mining scale-free networks using geodesic clustering, in: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD’04, ACM, New York, 2004, pp. 719–724. [47] S. Ray, M.K. Pakhira, Clustering of scale free networks using a k-medoid framework, in: Proceedings of the 2nd International Conference on Computer and Communication Technology ICCCT-2011, 2011, pp. 94–99. [48] S. Karygianni, P. Frossard, Tangent-based manifold approximation with locally linear models, Signal Process. 104 (2014) 232–247. [49] A. Babaeian, M. Babaee, A. Bayestehtashk, M. Bandarabadi, Nonlinear subspace clustering using curvature constrained distances, Pattern Recognit. Lett. 68 (2015) 118–125. [50] B. Feil, J. Abonyi, Geodesic distance based fuzzy clustering, in: A. Saad, K. Dahal, M. Sarfraz, R. Roy (Eds.), Soft Computing in Industrial Applications. Advances in Soft Computing, vol. 39, Springer, Heidelberg, 2007, pp. 50–59. [51] I. Borg, P. Groenen, Modern Multidimensional Scaling: Theory and Applications, Springer-Verlag, New York, 2005. [52] M. Balasubramanian, E.L. Schwartz, J.B. Tenenbaum, V. de Silva, J.C. Langford, The Isomap algorithm and topological stability, Science 295 (5552) (2002) 7. [53] R. Krishnapuram, A. Joshi, L. Yi, A fuzzy relative of the k-medoids algorithm with application to web document and snippet clustering, in: Proceedings of the Fuzzy Systems Conference FUZZ-IEEE ’99, 1999, pp. 1281–1286. [54] A. Király, Á. Vathy-Fogarassy, J. Abonyi, Geodesic distance based fuzzy c-medoid clustering - searching for central points in graphs and high dimensional data, Fuzzy Sets Syst. 286 (2016) 157–172. [55] J. Kim, K.H. Shim, S. Choi, Soft geodesic kernel k-means, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP ’07, 2007, pp. 429–432. [56] T.A. Runkler, Relational fuzzy clustering, in: J.V. de Oliveira, W. Pedrycz (Eds.), Advances in Fuzzy Clustering and Its Applications, Wiley, Chichester, 2007, pp. 31–51. [57] P. Giordani, A.B. Ramos-Guajardo, A fuzzy clustering procedure for random fuzzy sets, Fuzzy Sets Syst. 305 (2016) 54–69. [58] A. Banerjee, R.N. Davé, Robust clustering, Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2 (2011) 29–59. [59] R.N. Davé, Characterization and detection of noise in clustering, Pattern Recognit. Lett. 12 (1991) 657–664. [60] R.N. Davé, S. Sen, Robust fuzzy clustering of relational data, IEEE Trans. Fuzzy Syst. 10 (2002) 713–727. [61] M.B. Ferraro, P. Giordani, A toolbox for fuzzy clustering using the R programming language, Fuzzy Sets Syst. 279 (2015) 1–16. [62] M.B. Ferraro, P. Giordani, A. Serafini, fclust: an R package for fuzzy clustering, R J. 9 (2019), https://doi.org/10.32614/RJ-2019-017. [63] D. Dua, C. Graff, UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, 2019; http://archive. ics.uci.edu/ml. [64] L. Hubert, P. Arabie, Comparing partitions, J. Classif. 2 (1985) 193–218. [65] Istituto Nazionale di Statistica, Matrici di contiguità, distanza e pendolarismo, https://www.istat.it/it/archivio/157423, 2017 (in Italian).