Fuzzy clustering with nonlinearly transformed data

Fuzzy clustering with nonlinearly transformed data

Accepted Manuscript Title: Fuzzy Clustering with Nonlinearly Transformed Data Authors: Xiubin Zhu, Witold Pedrycz, Zhiwu Li PII: DOI: Reference: S156...

719KB Sizes 0 Downloads 131 Views

Accepted Manuscript Title: Fuzzy Clustering with Nonlinearly Transformed Data Authors: Xiubin Zhu, Witold Pedrycz, Zhiwu Li PII: DOI: Reference:

S1568-4946(17)30437-4 http://dx.doi.org/doi:10.1016/j.asoc.2017.07.026 ASOC 4354

To appear in:

Applied Soft Computing

Received date: Revised date: Accepted date:

5-10-2016 3-7-2017 13-7-2017

Please cite this article as: Xiubin Zhu, Witold Pedrycz, Zhiwu Li, Fuzzy Clustering with Nonlinearly Transformed Data, Applied Soft Computing Journalhttp://dx.doi.org/10.1016/j.asoc.2017.07.026 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.



Fuzzy Clustering with Nonlinearly Transformed Data Xiubin Zhua,b, Witold Pedryczc,a,e, and Zhiwu Lid,a* School of Electro-Mechanical Engineering, Xidian University, Xi’an 710071, China ([email protected]) a

b

Key Laboratory of Electronic Equipment Structure Design (Xidian University), Ministry of Education, Xi’an 710071, China c

Department of Electrical and Computer Engineering, University of Alberta, Edmonton T6R 2V4 AB, Canada ([email protected]) d

Institute of Systems Engineering, Macau University of Science and Technology, Taipa, Macau ([email protected]) e

*

Faculty of Engineering, King Abdulaziz University, Jeddah 21589, Saudi Arabia

Corresponding author. Tel.: (+86)29 88 20 19 86. E-mail address: [email protected]

Highlights  Proposed is a novel nonlinear data transformation method for Fuzzy C-Means algorithm.  Original data are projected in a nonlinear fashion into a new space with no change of space dimensionality.  Particle Swarm Optimization is used to optimize nonlinear transformation.

Abstract The Fuzzy C-Means (FCM) algorithm is a widely used objective function-based clustering method exploited in numerous applications. In order to improve the quality of clustering algorithms, this study develops a novel approach, in which a transformed data-based FCM is developed. Two data transformation methods are proposed, using which the original data are projected in a nonlinear fashion onto a new space of the same dimensionality as the original one. Next, clustering is carried out on the transformed data. Two optimization criteria, namely a classification error and a reconstruction error, are introduced and utilized to guide the optimization of the performance of the new clustering algorithm and a transformation of the original data space. Unlike other data transformation methods that require some prior knowledge, in this study, Particle Swarm Optimization (PSO) is used to determine the optimal transformation realized on a basis of a certain performance index. Experimental studies completed for a synthetic data set and a number of data sets coming from the Machine   



Learning Repository demonstrate the performance of the FCM with transformed data. The experiments show that the proposed fuzzy clustering method achieves better performance (in terms of the clustering accuracy and the reconstruction error) in comparison with the outcomes produced by the generic version of the FCM algorithm. Keywords: Fuzzy clustering, data transformation, Particle Swarm Optimization (PSO), classification error rate, reconstruction error 1. Introduction Clustering has been successfully applied to a wide spectrum of areas including pattern recognition, image processing, taxonomy and document classification, system modeling and identification. It is an important technique of unsupervised learning which aims to organize data into groups (clusters) of similar objects based upon assumed measure of closeness between individual data instance (patterns). In general, we aim at building groups (clusters) whose intra-cluster association is as high as possible whereas inter-cluster similarity is as low as possible [18, 19]. “Conventional” clustering methods, such as hierarchical methods, density-based methods and partitioning methods, require that each pattern belongs only to a single cluster. We refer to these clustering methods as Boolean clustering methods. K-Means [8] and Fuzzy C-Means (FCM) [9] are the two most commonly used objective function-based clustering algorithms guided by a certain objective function. As a generalization of the conventional K-Means method, FCM involves membership grades to quantify an extent to which each datum (pattern) belongs to different clusters. Since its inception, FCM has received a lot of attention with regard to its conceptual developments and application studies [1, 2, 3, 23, 30-37]. Various applications of clustering, including fuzzy clustering are discussed in [35]. A collection of different realizations of techniques aiming to extend FCM to very large data is discussed in [31]. The major challenges and recent developments in fuzzy clustering have been addressed in [32]. This study provided a comprehensive presentation of the state-of-the-art methods of fuzzy clustering, including its algorithmic and computational augmentations, cluster design and analysis, and application studies. An introduction to fuzzy clustering and information granules is also provided in [33], in which a group of highly diversified methodologies of knowledge-based clustering is presented and discussed. As a vehicle to construct information granules [34], fuzzy clustering along with their numerous variants has been widely used in processing of information granulation. In order to enhance the efficiency and effectiveness of the FCM when dealing with the specificity of the problem, a number of variants of this method have been proposed. In the previous studies, the objective function of the FCM algorithm has been a subject of various modifications [20,21,22]. Some kernel versions of the FCM algorithm have been proposed to eliminates the drawbacks of the generic FCM given the fact that the Euclidean distance is suitable to cope only spherical-like clusters [10]. For example, in [30], an evolutionary kernel intuitionistic Fuzzy C-Means clustering algorithm was developed. In Kernel Fuzzy C-Means clustering (KFCM), the data are first mapped onto a new feature space of much higher dimensionality than the original one and next clustering is carried out in this new space [10,11]. A comprehensive comparative analysis of kernel-based fuzzy clustering and fuzzy clustering is presented by Graves and Pedrycz [12]. It has been observed that the kernelbased FCMs are often highly sensitive to the selection of the values of the kernel parameters. Quite commonly, the KFCM clustering algorithm yields better performance than the traditional FCM, however, its evident drawback associates with high computational complexity. Furthermore the prototypes constructed in the induced space are difficult to interpret.   



The main contribution of this paper is to introduce and develop a new method, referred to as FCM with transformed data, which helps improve the classification accuracy and minimize the reconstruction error (which arises when numeric data are transformed through fuzzy sets formed by the clustering method and then reconstructed to their numeric counterparts) [13]. This method does not require any prior knowledge about the structure of the data. Unlike the kernel-based method, in which the transformation leads to a certain abstract space (being different from the original one), in this approach an optimized nonlinear transformation retains the original features, which makes the interpretation of the results of clustering, especially prototypes, transparent and intuitively appealing. To the best of our knowledge, the idea of FCM with transformed data where the transformation is realized in the space, which is the same in which the original data are located has not been considered in the previous studies. An iterative optimization algorithm is developed to help determine the best transformation, which minimizes the classification error and the reconstruction error. Furthermore the proposed data transformation is a one-to-one transformation, which helps the transformed data to be mapped back to the original space. An essence of the FCM operating on transformed data is visualized in Fig. 1.

Fig. 1. FCM clustering with transformed data: an overview.

The essence of the underlying idea is to realize a nonlinear mapping of the original space to a new feature space endowed with the enhanced discriminatory capabilities while retaining the dimensionality of the original space such that the obtained results can be still easily interpreted. Several variants of the overall mapping process are considered with respect to the form of the mapping itself (piecewise linear and logistic functions) and its levels of flexibility in handling individual variables (treating them en bloc or individually). In light of the nature of the mapping, an optimization is realized with the aid of the Particle Swarm Optimization (PSO). The study is organized as follows. In Section 2, we offer a brief introduction to the standard FCM method and kernel-based clustering methods. In Section 3, data transformation methods and ensuing transformation strategies are introduced. Two performance indexes used to evaluate the results of clustering along with the detailed optimization mechanism are discussed in Section 4. A series of experiments using a synthetic data set and data sets coming from the Machine Learning Repository (http://archive.ics.uci.edu/ml) is reported in Section 5 to demonstrate the effectiveness of this method. Finally, the study is completed in Section 6 by offering some conclusions. 2. Fuzzy C-means clustering: A brief overview FCM is an objective function-based clustering originally developed by Dunn in 1971 [4] and further improved by Bezdek [5]. Unlike the traditional K-Means algorithm, which partitions   



a set of data into predetermined k clusters each element in the data set belongs to only one exact cluster, FCM admits partial belongingness (membership) of a data to more than a single cluster, and membership grades quantify the degree to which this element belongs to different clusters. Given a finite collection of data set containing N elements, X = {x1, x2, …, xN}  Rn, the FCM method splits the data into c clusters by minimizing a certain objective function. The algorithm returns cluster centers (prototypes) v1, v2, …, vc and a fuzzy partition matrix U=[uij], uij  [0, 1], i = 1, 2, …., N; j = 1, 2, …, c. An ijth entry of U, uij, indicates to which extent the element xi belongs to the j-th cluster. The FCM algorithm minimizes the following objective function: N

c

J m   uijm || xi  v j ||2

(1)

i 1 j 1

In the above expression, the fuzziness exponent (fuzziness coefficient) m assumes values greater than 1, while the value m = 2 is the most commonly used. Different values of m control the shape of membership functions produced by the FCM algorithm. Higher value of m leads to spike-like membership functions while the values close to 1 produce more Boolean-like shapes of membership functions. The parameter exhibits some influence on the performance of the FCM algorithm and can be subject to optimization. The distance used in the FCM is the Euclidean one or one of its variants, in particular, its weighted version. The iterative updates of the partition matrix and the prototypes are realized as follows [9]:

uij 

1 2

 || xi  v j ||  m1 i  1, N ; j  1,, c    k 1  || xi  v k ||  c

(2)

N

v  ' j

u i 1 N

m ij

xi

u i 1

m ij

j  1,, c

(3)

max | u  u |   The algorithm is terminated once the following condition is satisfied: ,  where is a certain nonnegative threshold value whereas k denotes the index of the successive iteration of the algorithm. k 1 ij

k ij

An alternative to the generic fuzzy clustering comes in the form of kernel-based fuzzy clustering techniques [12, 26, 27, 28]. Here the data lying in the original feature space are explicitly transformed into a new feature space via involving user-specified kernel functions. There are two commonly encountered categories of kernel-based fuzzy clustering. In the first category, we construct prototypes in the feature space and these methods are referred to as KFCM-F (where F stands for the feature space). In another option, named KFCM-K (here K stands for the kernel space), one retains the location of the prototypes in the kernel space. The forms of several commonly used kernels are listed in Table I. KFCM-F aims to minimize the following objective function: N

c

J kfcm f   uijm ( xi )  (v j )

2

(4)

i 1 j 1

  



where  denotes the mapping from the original feature space to the kernel space. The dimensionality of the kernel space is much higher than the dimensionality of the original feature space. Suppose that we use the Euclidean distance here. Then the distance ||( xi )  (v j ) ||2 combined with the Mercer’s representation of kernels gives rise to the expression [27]: ||( xi )  (v j ) ||2 =( xi )T Γ( xi )  (v j )T (v j )  2( xi )T (v j )  K ( xi , xi )  K (v j , v j )  2 K ( xi , v j )

(5) where K stands for a certain kernel function. The processing realized by the KFCM-F is done in an iterative manner. A partition matrix is randomly initialized at the beginning of the entire process and the prototypes are updated in an iterative manner until a certain stopping criterion has been satisfied. Unlike KFCM-F, which retains the prototypes in the original feature space, KFCM-K constructs the prototypes positioned in the kernel space. It minimizes the following objective function [29]: N

c

J kfcmk   uijm || ( xi )  v j ||2

(6)

i 1 j 1

This means that the prototypes constructed in the kernel space have to be approximated by an inverse mapping to the original feature space. The detailed way of updating the partition matrix and the prototypes is presented in [29]. The obvious disadvantage of KFCM-K compared to FCM and KFCM-F is its high computational complexity. The overall computational complexity of KFCM-K is O(cN2n) while for FCM and KFCM-F the computational complexity is O(cNn) (c is the number of prototypes, and N stands for the total number of data instances in the n-dimensional space) [12]. 3. Transformation methods and transformation strategies In this section, we elaborate on the two methods to transform the original data. Our intention is to nonlinearly transform the data xi into some vectors positioned in the space of the same dimensionality, say yi  Rn , i = 1, 2, …, N. 3.1. Transformation methods Two main methods are sought with this regard where the main difference between them is in the format of the proposed mapping. Transformation of data using piecewise linear function The piecewise linear transformation reads as follows  1 ( xi1 )   ( x )  yi = Φ( xi ) =  2 i 2      (7) n ( xin )  where xik (k= 1, 2, …, n) is the value of the k-th variable of the vector xi. All coordinates of the mapping (7) are increasing piecewise linear functions. An example of k is shown in Fig. 2. This transformation function is fully described by a series of pairs of knot points (ak1, bk1), (ak2, bk2), …, (akp, bkp) where p stands for the number of knot points of the mapping   



ak1  ak 2 

(

 akp

and

bk1  bk 2 

 bkp

).

k

comes

as

the

mapping

k :[ xk min , xk max ]  [ xk min , xk max ] where xkmin and xkmax are the minimum and maximum values assumed by the k-th variable. Suppose that xkmin = ak0 = bk0 and xkmax = ak(p+1) = bk(p+1). Then the detailed calculations proceed as follows ( xik  akj ) yik  bkj  (bk ( j 1)  bkj )* if akj  xik  ak ( j 1) (8) (ak ( j 1)  akj ) Obviously, the mapping is invertible and then for any given yik one can easily produce the corresponding xik.

Fig. 2. Example of a piecewise linear transformation.

Transformation of data using a combination of logistic functions We consider a mapping expressed using the following logistic (sigmoidal) function 1 z ( x)  (9) 1  exp(α( x  s)) where α is a positive steepness parameter and s denotes the location (translation) parameter at the value of which the logistic function is equal to 0.5. The value of s lies in the interval [xmin, xmax] where xmin and xmax are the minimum and maximum values assumed by the variable to be transformed, respectively. The proposed transformation function is then constructed as follows: p

 ( x) 

 w z ( x) j

j 1

j

p

w j 1

j

(10)

where p is the number of the logistic functions and w j  [0,1] is the weight (step) of the j-th logistic function. To visualize this mapping, the plots of the transformation function  ( x) for some selected values of αj, sj, and wj when p = 3 and 5 are shown in Fig. 3. In this case, αj is randomly generated within the interval [0.05, 10]. To form the nonlinear transformation for the k-th variable of xi, we construct: yik  xk min  k ( xik )*( xk max - xk min ) xk min  xik  xk max

  

(11)



where  k stands for the nonlinear transformation discussed above. The logistic nonlinear transformations are applied to the individual variables yielding a mapping:  1 ( xi1 )   ( x )  yi   ( xi )   2 i 2      (12)  n ( xin ) 

Fig. 3. Nonlinear transformation function generated by a combination of logistic functions, the number of logistic functions varies from 3 (whose weights are 0.8, 0.4 and 0.5) to 5 (whose weights are 0.6, 0.8, 0.1, 0.3 and 0.2, respectively). thin lines – individual logistic functions; thick line – combination of the functions (10)

The visible feature of the transformation is its nonlinear character and the regions of the space where the slopes of this transformation are different. The regions of the space where the slope is very high helps “spread out” the data located there and belonging to different classes, subsequently making any ensuing interpretation or classification activities more manageable. In contrast, in the region where the slope is low, the data located there and belonging to the same class become “condensed”. Obviously, these data need not to be distinguished as they belong to the same class and in this sense are treated as nondistinguishable. Let us look at an illustrative of a one-dimensional case which includes two classes (marked by circles and stars). There is a high overlap between the data belonging to these two classes. We apply both piecewise transformation and logisitic transformation as shown in Fig. 4. It is evident that after the transformation, in the region over which the mapping has a high value of the slope, the data are “streched” (distributed). In the region where the slope is quite low, the data belonging to the same class become more “condensed”. It is also worth noting that the nonlinear mapping operating in the same space offers advantages in terms of interpretability of the transformation. The transformation itself could be investigated in various categories of problems such as, for example, linearization or multilinearization [24] or design enhancements of neural networks [25].

  



(a)

(b)

Fig. 4. Applying nonlinear transformation to one-dimensional data with highly overlap (piecewise transformation (a) and logistic transformation (b)).

3.2. Transformation strategies There are two strategies considered to transform the data and they relate with the level of flexibility the transformation mappings are endowed. Application of the same transformation functions to all variables Here we have the mappings 1 , 2 , , n :[ xmin , xmax ]  [ xmin , xmax ] . In this case, the p cutoff points are generated in range [xmin, xmax] where xmin=min(x1min, x2min, …, xnmin) and xmax=max(x1max, x2max, …, xnmax). Then parameters of the mapping  are set up as follows: Asame_p = {(a1, b1), (a2, b2) …, (ap, bp)} where ak = a1k = a2k = … = ank and bk = b1k = b2k = … = bnk (k=1, 2,…, p). The objective is to optimize these parameters such that the transformed data achieve the best clustering quality (classification error or reconstruction error; those two alternatives will be discussed in detail). For the nonlinear transformation using a combination of logistic functions, we have 1 , 2 , , n :[ xmin , xmax ]  [ xmin , xmax ] . To determine transformed data, xmin=min(x1min, x2min, …, xnmin) and xmax = max(x1max, x2max, …, xnmax). In this case, the parameter is Asame_l = {(α1, s1 , w1), (α2, s2 , w2)…( αp, sp , wp)} where αk =α1k =α2k = … =αnk, sk = s1k = s2k = … = snk and wk = w1k = w2k = … = wnk (k=1, 2,…, p). Application of different transformation functions to individual variables We apply different mappings to the individual variables. The parameters of  of the piecewise nonlinear transformation can be represented in the form: Adiff_p = {(a11, b11), (a12, b12) …, (a1p, b1p), (a21, b21), (a22, b22) …, (a2p, b2p), …, (an1, bn1), (an2, bn2) …, (anp, bnp)}. Furthermore, when it comes to the nonlinear transformation using combination of logistic functions, we have Adiff_l = {(α11, s11 , w11), (α12, s12 , w12),…,( α1p, s1p , w1p), (α21, s21 , w21), (α22, s22 , w22),…,( α2p, s2p, w2p), …, (αn1, sn1 , wn1), (αn2, sn2 , wn2),…,( αnp, snp , wnp)}. In this study, when applying different mappings to different variables, we assume that the number of cutoff points for each piecewise transformation function or the number of logistic   



functions combined in one nonlinear transformation function on each coordinate is the same. Obviously, they could be selected differently, which might result in better performance achieved at the cost of the higher computing overhead. 4. Optimization of FCM with transformed data The quality of the transformation of the original data depends on the choice of the transformation function. In what follows, two performance indexes and the ensuing optimization process are discussed. 4.1. Classification error rate Once the clusters have been formed (through the constructed partition matrix), we determine all data that belong to the i-th cluster Xi as consisting of those patterns that characterized by the highest membership value, namely: Xi = {xk| uki = max j = 1,2,..,c ukj} (13) Then we count the patterns in Xi which belong to the individual classes (consider that in the problem under discussion we have r classes). In this way, we obtain the counts ni1, ni2, .., nir where the overall sum comes as Ni = ni1+ni2+…+nir is the cardinality of Xi. By considering that one of the classes in Xi is dominant in the cluster (viz. it assumes the highest value of nij), we label Xi as class j. Then we count the number of misclassified data in Xi, pertaining to Mi 

r

n

ik

k 1 xk wi

(14)

where wi denotes the i-th class of data. The same calculations are completed for all other clusters. Overall we obtain the sum M1+ M 2+…+ M c, and the total classification error is then defined as follows Qclass 

c

M j 1

j

(15)

N

The classification error is commonly used in the performance evaluation of clustering results by assessing to which extent the clusters are homogeneous and involve data belonging to a single class. Furthermore the classification error regarded as a function of the numer of clusters helps gain an insight as to the topology of the data with respect to the geometry of data belonging to a given class. 4.2. Reconstruction criterion In [13], the mechanisms of encoding and decoding expressed in the language of fuzzy sets and fuzzy relations are formulated and discussed, which capture the essence of numericgranular-numeric transformation. In the encoding mechanism, a numeric pattern is represented (quantified) in the language of information granules, while the conversion from information granule to numeric results is referred to as decoding. The fuzzy encoding and decoding mechanisms are guided by some performance indexes which quantify a departure of reconstructed xˆ from the original numeric entry x. The key objective of the mechanisms is to make the decoded results to be as close as possible to the original numeric entry that has been used as input to the encoding process. Once FCM has been completed, we reconstruct the original x based on the elements of the codebook (a family of prototypes) and the associated membership degrees. Assuming that the reconstructed x is denoted as xˆ , the reconstruction of x is accomplished through a weighted combination of the prototypes in the following form   



c

xˆ i 

u j 1 c

m ij

u j 1

vj

m

(16)

ij

where the individual prototypes are weighted by the corresponding membership degrees. Our objective is to minimize the reconstruction performance index, as shown below: N

Qreconstruction   || xi  xˆ i ||2

(17)

i 1

where ||.|| is the Euclidean distance between the original pattern and its reconstruction. 4.3. PSO in the optimization of the transformation of data Let us now focus on the parameters of the nonlinear transformation functions. For a piecewise transformation function with p cutoff points, once the p cutoff points have been decided upon, the nonlinear transformation function is determined. Therefore the dimensionality of the search space is equal to 2p. For the logistic nonlinear transformation, the corresponding dimensionality of the search space is 3p. When we apply different nonlinear transformations to the individual variables, the dimensionality of the search space for piecewise nonlinear transformation is equal to 2pn for the linear piecewise mapping and 3pn for the logistic nonlinear transformation function. The essence of the optimization of the parameters for nonlinear transformations implies that gradient-based optimization is not feasible (a lack of a direct differentiable relationship between the parameters to be optimized and the optimized performance index). Here a sound alternative is to resort to some powerful optimization techniques, for example, the population-based optimization techniques. Particle Swarm Optimization (PSO) [6,7] that have been successfully used in multiple areas is worth considering here. PSO is an optimization method that is able to explore high dimensional problem spaces to search the optimal candidate solutions. A particle swarm is formed by a number of particles where each particle represents a possible solution in the search space. At the beginning of the optimization, the particles are initialized randomly in the search space. In successive generations, each particle tends to move towards the best position that has been reported by the whole population, named gbest, which means the optimal transformation discovered so far leading to the minimum of the classification error or reconstruction error. The particle is guided by the personal best position associated with the lowest performance index of its own denoted by pbest, which means the best transformation discovered by current particle, and the performance index of each particle is evaluated after it has reached a new position. By updating each particle’s velocity and position, we realize the search in the problem space in an iterative manner until the best global solution reaches the expected fitness value or the predetermined maximal number of iterations has been reached. Suppose that the position of the i-th particle of the population in the search space is described by vector si(t), and the velocity of the particle is described by vi(t) where t denotes the t-th generation. The next position of the particle obtained in the t+1-th generation is described by the following expression describing the current position of the particle and its velocity. si(t+1) = si(t) + vi(t+1) (18) The update of velocity of the particle is calculated in the following way: vi(t+1) = wvi(t) + φ1  (pbest - si(t+1)) + φ2  (gbest - si(t+1)) (19)   



where w is the weighting factor (usually assuming the value 0.5), while φ1 and φ2 are vectors of random numbers with entries uniformly generated over interval [0, 2], and  denotes a multiplication of the corresponding coordinates of the two vectors. When applying the same piecewise linear transformation functions to all variables, the dimensionality of the search space for a single cutoff point is



n k 1

( xmax  xmin ) where

xmin=min(x1min, x2min, …, xnmin) and xmax=max(x1max, x2max, …, xnmax). When it comes to applying different transformation functions to the individual variables, the dimensionality of the search space for the cutoff point of the i-th variable is calculated as  k 1 ( xk max  xk min ) . n

Similarly, the search space can be transformed as discussed in the previous case. When transforming data using the combination of logistic functions, the values of α are generated within the interval [0.05, 10] while w’s values range from 0.01 to 1. The value of s lies in interval [xmin, xmax] when applying the same combination of logistic transformation functions to all variables, otherwise, s is in range [xkmin, xkmax], where k is the index of the variable to be transformed. 5. Experimental studies In this section, a series of experiments using a synthetic data set and several data sets coming from the Machine Learning Repository [14] are presented to quantify the effectiveness of this approach. 5.1. Usage of classification error rate as performance index First, we report the result of an experiment completed for an illustrative 2-D synthetic data set. The 2-D synthetic data set is displayed in Fig. 5 (a). It can be seen from the figure that there are two classes of data represented by asterisks and squares, respectively. The FCM returns the following prototypes: v1=[2.27 2.88]T, v2=[6.10 6.14]T while the obtained classification error rate (11) is 6%. The prototypes given by FCM are represented by circles. We apply the same transformation to both variables and the number of cutoff points is selected as 4. After PSO optimization, we form the piecewise linear transformation as shown in Fig. 6 (a). The transformed data are visualized in Fig. 5 (b). Here the FCM algorithm returned the prototypes: v1=[0.83 1.27]T, v2=[6.47 6.83]T. As a result, the classification error rate has been reduced to zero. The solid lines in Fig. 5 represent the boundaries (points located at this line have the same membership values to both clusters) between the two classes, which means that the data points beyond this line belong to one class while data points below this line belong to another one. The PSO algorithm was run for 100 iterations with 25 particles, the piecewise transformation function formed by the PSO algorithm is shown in Fig. 6 (a), and the cutoff (knot) points for the same transformation to both variables are specified as [4.02 2.02] T, [4.92 4.39] T, [5.11 8.03]T, and [9.29 8.43] T. We also apply different transformations to different variables. The different piecewise linear transformation functions for different variables are shown in Fig. 6 (b), and the cutoff points are specified as [4.16, 0.65]T, [5.28, 8.01]T, [7.13, 8.14]T, [8.77, 8.51]T for the first variable and [1.34, 0.13]T, [4.76, 2.84]T, [5.68, 8.00]T, [6.72, 9.67]T for the second variable. It is apparent from Fig. 5 (c) that different transformations applied to different variables lead to more disjoint clusters compared to the same transformation to all variables, and thus can improve the classification performance. In light to the data shown in Fig. 5, the advantages of FCM operating on the transformed data become evident. It is also apparent that the transformation makes data of different classes to become positioned more apart from each other while the data belonging to the same class are   



made more “condensed”.

(a)

(b)

(c)

Fig. 5. Original 2D synthetic data set with two classes (a), Transformed data leading to the reduced classification error rate using the same piecewise linear transformation to both variables (b), and using different piecewise linear transformations applied to different variables (c).

(a)

(b)

Fig. 6. Piecewise linear transformation function producing the lowest classification error rate ( transformed dataset, Fig5.(b)), where the cutoff points are represented by circles (a), and different transformation functions for different variables which lead to best classification error rate (transformed dataset Fig5.(c)) (b).

The 10-fold cross-validation [15], [16], [17] is used to estimate the effectiveness of the proposed method. This validation scheme was completed on the original dataset and the transformed dataset. In each round of the 10-fold cross validation on the transformed data, the training set containing 9 splits of the original data was set to run the FCM algorithm to find the best transformation function and the c centroids (c is made equal to the number of classes of the dataset).   



For comparative analysis, we report the results of the FCM with transformed data and compare them with the results obtained for the original data and the results of kernel-based FCM. The comprehensive comparative analysis of kernel-based fuzzy clustering and fuzzy clustering has been conducted in [12]. It showed that the kernel-based FCM algorithm produced only a marginal improvement over standard FCM for most of the data sets involved. In this paper, we compare with generic FCM and one typical kernel-based fuzzy clustering, KFCM-F (the prototypes are distributed in feature space) [26, 27]. Two types of kernels, Gaussian kernel and Polynominal kernel, are utilized when running KFCM-F. As has been indicated in [12], the overall performance of kernel-based FCM is not so impressive with regard to the obtained classification error rate when compared to the results produced by the FCM algorithm. The quality of results is sensitive to the selection of kernel function and the optimization of kernel parameters. In many cases, there is no significant or only a slight improvement for kernel-based method over FCM. As the performance of KFCM-F is sensitive to the selection of the parameters of the kernel, we ran KFCM-F with different combinations of parameters and record the best performance. A number of data sets coming from UCI Machine Learning Repository [14] was experimented with. The results of classification error rate with 10-fold cross-validation for the kernel-based fuzzy clustering were reported in Table II. In our experiments concerning kernel-based fuzzy clustering, the fuzzification coefficient m was initialized starting from 1.1 and then its value has been increased with a step of 0.1 until reaching the upper limit of 5. We experimented with the values of the parameter σ (used in the Gaussian kernel) ranging from 0.5 to 15 by sweeping across this range at step 0.5. When using Polynomial kernel, the values of p swept from 1 to 20 while the values of θ varied in the range [0.5, 30]. For each combination of these parameters, we ended up with different classification error rates, and the best classification error rates and corresponding kernel parameters were recorded. Tables III-VI include the results of clustering obtained for transformed data after PSO optimization. We applied two types of nonlinear transformation functions as introduced in Section 3, the piecewise linear transformation and logistic combination nonlinear transformation. For each of the transformation, two strategies are considered: the use of the same nonlinear transformation to all variables and using different nonlinear transformation to different variables. In this study, when running the PSO algorithm, the size of the population is specified to be 5 times the dimensionality of the search space. The maximum number of generation is set as 100, which is sufficient to ensure the convergence of the PSO algorithm. In most cases, the optimization algorithm converges after 50-60 iterations.

  



It is noticeable that both the training error rate and validation error rate of the transformed data are much lower than that of the original data. The performance of FCM with transformed data is better than the performance of the kernel-based fuzzy clustering in most of the cases. After transformation of original dataset, the data points of the same cluster are much closer to each other and data points of different clusters tend to be kept away from each other. This makes them easier for FCM to determine prototypes and reduce the resulting classification error. There are some common trends present in the obtained results: when using piecewise nonlinear transformations, the higher the number of cutoff points is used, the lower the classification error rate becomes. As expected, the performance of using different transformations to different variables is better than when using the same transformation to all the variables. The performance of piecewise nonlinear transformation is a slightly better than that of logistic nonlinear transformation. And there is a cost associated with the increase of the cutoff points because the search space has been expanded, it may take more time for PSO algorithm to converge.

 Fig. 7. Values of the performance index (classification error rate) produced in successive PSO iterations for Iris data; the number of cutoff points of the nonlinear transformation is set to 4.

The PSO algorithm delivers tangible benefits when improving the performance index. For example, with the use of the same piecewise linear transformation to all variables, the classification error rate for Iris data set dropped down from 11% to 3%. The values of the classification error obtained in the successive generations for dataset Iris are shown in Fig. 7. In this case the cutoff points of the optimized piecewise nonlinear function applied to all variables are shown in Fig. 8. The piecewise linear transformation functions for banknote authentication data set when the number of cutoff points equals 2 and applying different transformations to different variables are shown in Fig. 9. These transformations have helped the FCM to reduce the classification error rate from 39% to 10%. The effect of using different transformations to the individual variables is more profound than when using a single transformation compared with the usage of the same transformation to all variables; in this case the classification error rate was reduced from 39% to 28%.

  







Fig. 8. Cutoff points of the piecewise nonlinear transformation function applied to all variables – Iris data.

Fig. 9. Piecewise transformation for each variable of data banknote authentication; the number of cutoff points = 2.

According to testing results shown in Tables III-VI, transformation functions with more cutoff points lead to the better classification error rate. For example, when using the different piecewise linear transformations of variables, two cutoff points are enough for transforming the data set nonlinearly to yield good classification rate for some data sets. The classification error rates when using two cutoff points are almost as good as the ones when using six cutoff points. However, for the datasets User Knowledge Modeling, Vertebral Column, Banknote Authentication, QSAR Biodegradation, Statlog (Heart), Ionosphere and Breast Tissue, the improvement of classification results become more visible when using more cutoff points. The improvement in classification accuracy when using six cutoff points over using two cutoff points is in the range of 5%-42%. Let us consider a piecewise transformation function with several cutoff points. It is evident that different straight-line parts of this function have different slopes. Having more cutoff points states that we may have higher flexibility capability to change the shape of original dataset, thus having more opportunity to make data points of the same class to get closer and to make data points of different classes to become more distant, which will result in a better classification outcome.   



The independent two-sample t-test is applied here to check whether the differences are significant [38]. When running t-test each time, 20 samples are collected. When applying the independent two-sample t-test to the classification error rate, there is statistical evidence that the associated average classification error rate of FCM with nonlinear transformation is much better than that of the KFCM algorithm (p-value is less than 0.05). To visualize the effectiveness of the proposed method, the best classification error rates of testing data that have been acquired by using FCM, kernel-based FCM and FCM with transformed data are shown in grouped bars as in Fig. 10. From the results presented in this figure, the effectiveness of FCM with transformed data is clearly demonstrated.

Fig. 10. Best (lowest) classification error rates of using FCM, kernel-based FCM and FCM with transformed data.

5.2. The use of reconstruction error First, we report the result of an experiment completed for the 2-D synthetic data set used in previous experiment. The obtained reconstruction error (17) for the original data when c=2 is 639. We apply the same transformation to both variables, with the number of cutoff points equals to 4. After PSO optimization, we obtain the piecewise linear transformation as shown in Fig. 12. The transformed data are visualized in Fig. 11 (b). Here the FCM algorithm returns prototypes: v1=[7.02 7.52]T, and v2=[2.29 2.43]T. As a result, the classification error rate has been reduced to 603. The parameters for the PSO algorithm are the same as we used in the first experiment, and the cutoff points of this transformation are specified as [1.96 1.90] T , [5.25 3.58] T, [5.32 6.53] T, and [9.37 8.90] T. To demonstrate effectiveness of using the same piecewise linear transformation method in reducing reconstruction error, we also report on a series of numeric experiments which concerns Machine Learning datasets coming from the Machine Learning repository [14]. To assure some confidence in the obtained results, each experiment was repeated 10 times. The average values of the reconstruction error and standard derivation produced by the FCM with transformed data are reported in Table IX while the reconstruction error and standard derivations results of the FCM for the original data sets are shown in Table VII. The reconstruction error and standard derivations results of kernel-based fuzzy clustering are shown in Table VIII. In the experiments, the numer of prototypes c was set to the number of classes. When running kernel-based fuzzy clustering, the fuzzification coefficient m was equal 2.0.

  



 (a)

(b)

Fig. 11. Original 2-D synthetic data set with two classes (a) and transformed data (b) which leads to minimal reconstruction error.

Fig. 12. Piecewise linear transformation function which leads to the least reconstruction error on the synthetic data set, where the cutoff points are represented by circles.

Tables VII, VIII and IX visualize the effect of the FCM with transformed data on the reconstruction error and the effect of different number of cutoff points of the transformation function on the improvement of reconstruction error. By inspecting Tables VII, VIII and IX, it is evident that the use of the optimized transformed data has led to the lower reconstruction error. As the number of the cutoff points of the nonlinear transformation function increases, the reconstruction error becomes even further reduced. The kernel-based clustering algorithms provide improvement on some data sets, such as User Knowledge Modeling, Banknote Authentication, Fertility and Liver Disorders, but on other data sets, FCM works better than kernel-based clustering algorithms. Interestingly, FCM with transformed data can always provide less reconstruction error compared to FCM. The improvement in terms of reconstruction error on the User Knowledge Modeling, Pima Indians Diabetes, Fertility, Liver Disorders and Blood Transfusion Service is over 20%. 6. Conclusions In this study, we have developed the novel clustering method, FCM with transformed data. Two data transformation methods and two optimization criteria, i.e., a classification error rate and reconstruction error, were introduced. After applying the nonlinear transformation, data instances belonging to different clusters (classes) become more separated, thus decreasing the classification error rate or the reconstruction error. It has been shown that the FCM algorithm   



with transformed data yields better classification accuracy when compared with the results obtained when considering the generic version of the FCM and the KFCM algorithms. The performance of using different transformations to different variables becomes better than using the same transformation to all the variables, and more cutoff points lead to better classification outcome. The FCM with transformed data becomes helpful in significantly reducing the reconstruction error. Particle Swarm Optimization algorithm has been utilized to optimize the nonlinear transformation. FCM with transformed data achieved much higher clustering accuracy and lower reconstruction error compared with the results produced by the FCM algorithm operating in the original data space. The effectiveness of this new method has been demonstrated through a series of experiments. It has been shown that the proposed method outperforms generic FCM and traditional kernel-based fuzzy clustering algorithms. Some future studies worth pursuing might be focused on the improvement to the FCM with transformed data by including mechanisms of supervision to optimize the number of cutoff points and the form of transformation functions. Another problem is that the PSO mechanism comes with some computing overhead, hence mechanisms, which could help reduce the processing time to construct the optimal transformations, are worth pursuing. Acknowledgements This work was supported by the National Natural Science Foundation of China under Grant Nos. 61374068, 61472295, the Recruitment Program of Global Experts, and the Science and Technology Development Fund, MSAR, under Grant No. 078/2015/A3. References [1] R. Cucchiara, C. Grana, A. Prati, S. Seidenari, and G. Pellacani, “Building the topological tree by recursive FCM color clustering,” in: Proceedings of international conference on pattern recognition, ICPR, vol. 1, pp. 759–762, 2002. [2] M. Li and R. C. Stauntonb, “A modified fuzzy C-means image segmentation algorithm for use with uneven illumination patterns”, Pattern recognition, vol. 40, no. 11, pp. 3005– 3011, 2007. [3] P. L. Lin, P. W. Huang, C. H. Kuo, and Y. H. Lai, “A size-insensitive integrity-based fuzzy c-means method for data clustering”, Pattern recognition, vol. 47, no. 5, pp. 2041– 2056, 2014. [4] J. C. Dunn, “A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters,” Journal of Cybernetics, vol. 3, no. 3, pp. 32–57, 1973. [5] J. C. Bezdek, Pattern Recognition With Fuzzy Objective Function Algorithms, Kluwer Academic Publishers Norwell, MA, USA, 1981. [6] J. Kennedy and R. C. Eberhart, “Particle swarm optimization,” in: Proceedings of the IEEE International Conference on Neural Networks, IEEE Press, Piscataway, NJ, vol. 4, pp. 1942–1948, Nov. 1995. [7] S. Paterlini and T. Krink, “Differential evolution and particle swarm optimisation in partitional clustering,” Comput. Stat. Data Anal, vol. 50, no. 5, pp. 1220–1247, Mar. 2006. [8] J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A k-means clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100– 108, 1979. [9] J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: The fuzzy c-means clustering algorithm,” Computers & Geosciences, vol. 10, no. 2-3, pp. 191–203, 1984. [10] M. Girolami, “Mercer kernel-based clustering in feature space,” IEEE Trans. Neual Networks, vol. 13 no. 3, pp. 780-784, May. 2002 [11] D. Q. Zhang and S. C. Chen, “Clustering incomplete data using kernel based fuzzy C  



means algorithm,” Neural Process. Lett., vol. 18, no. 3, pp. 155–162, Dec. 2003. [12] D. Graves and W. Pedrycz, “Kernel-based fuzzy clustering and fuzzy clustering: A comparative experimental study,” Fuzzy Sets Syst., vol. 161, no. 4, pp. 522–543, Feb. 2010. [13] W. Pedrycz and J. V. de Oliveira, “A development of fuzzy encoding and decoding through fuzzy clustering,” IEEE Trans. Instrum. Meas., vol. 57, no. 4, pp. 829–837, Apr. 2008. [14] M. Lichman, http://archive.ics.uci.edu/ml, UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences, 2013. [15] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” Proc of the Fourteenth International Joint Conference on Artificial Intelligence (San Mateo, CA: Morgan Kaufmann), vol. 2, no. 12, pp. 1137–1143, 1995. [16] P. A. Devijver and J. Kittler, Pattern recognition: a statistical approach, London, GB: Prentice-Hall, 1982. [17] G.J.McLachlan, D.Kim-Anh, and C. Ambroise, Analyzing microarray gene expression data, Wiley. 2004 [18] M. S. Chen, J. W. Han, and P. S. Yu, “Data mining: an overview from a database perspective,” IEEE Trans. on Knowledge and Data Engineering, vol. 8, no. 6, pp. 866–883, Dec. 1996. [19] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264–323, Sep. 1999. [20] C. H. Wu, C. Ouyang, L. Chen, and L. W. Lu, “A new fuzzy clustering validity index with a median factor for centroid-based clustering,” IEEE Trans. Fuzzy Syst., vol. 23, no.3, pp. 701–718, Jun. 2015. [21] H. Izakian, W. Pedrycz, and I. Jamal, “Clustering spatiotemporal data: an augmented fuzzy c-means,” IEEE Trans. Fuzzy Syst., vol. 21, no. 5, pp. 855–867, Oct. 2013. [22] H. B. Cao, H. W. Deng, and Y. P. Wang, “Segmentation of M-FISH images for improved classification of chromosomes with an adaptive fuzzy c-means clustering algorithm,” IEEE Trans. Fuzzy Syst., vol. 20, no. 1, pp. 1–9, Feb. 2012. [23] D. M. Tsai and C. C. Lin, “Fuzzy c-means based clustering for linearly and nonlinearly separable data”, Pattern recognition, vol. 44, no. 8, pp. 1750–1760, 2011. [24] A. Pedrycz, F. Y. Dong, and K. Hirota, “Representation of neural networks through their multi-linearization,” Neurocomputing, vol. 74, no. 17, pp. 2852–2860, Oct. 2011. [25] A. Pedrycz, F. Y. Dong, and K. Hirota, “Nonlinear mappings in problem solving and their PSO-based development,” Information Sciences, vol. 181, no.19, pp. 4112–4123, Oct. 2011. [26] H. Shen, J. Yang, S. Wang, and X. Liu, “Attribute weighted mercer kernel based fuzzy clustering algorithm for general non-spherical datasets,” Soft Computing, vol. 10, no. 11, pp. 1061–1073, Sep. 2006. [27] M. Girolami, “Mercer Kernel-Based Clustering in feature space,” IEEE Trans.on Neural Netw., vol. 13, no. 3, pp. 780–784, May. 2002. [28] A. Bouchachia and W. Pedrycz, “Enhancement of fuzzy clustering by mechanisms of partial supervision,” Fuzzy Sets and Systems, vol. 157, no. 13, pp. 1733–1759, Jul. 2006. [29] J. H. Chiang and P. Y. Hao, “A new kernel-based fuzzy clustering approach: support vector clustering with cell growing,” IEEE Trans. on Fuzzy Syst., vol. 11, no. 4, pp. 518–527, Aug. 2003. [30] K. P. Lin, “A novel evolutionary kernel intuitionistic fuzzy c-means clustering algorithm,” IEEE Trans. on Fuzzy Syst., vol. 22, no. 5, pp. 1074–1087, Oct. 2014. [31] T. C. Havens, J. C. Bezdek, C. Leckie, L. O. Hall, and M. Palaniswami, “Fuzzy c-means algorithms for very large data,” IEEE Trans. Fuzzy Syst., vol. 20, no. 6, pp. 1130–1146, Dec. 2012.   



[32] J. V. D. Oliveira and W. Pedrycz, Advances in Fuzzy Clustering and its Applications, John Wiley & Sons, Inc, 2007. [33] W. Pedrycz, Knowledge-Based Fuzzy Clustering : From Data to Information Granules, John Wiley, N. York, 2005. [34] W. Pedrycz, Granular Computing: Analysis and Design of Intelligent Systems, CRC Press/Francis Taylor, Boca Raton, 2013 [35] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley-Blackwell, 2005. [36] X. B. Zhu, W. Pedrycz, and Z. W. Li. “Granular data description: designing ellipsoidal information granules,” IEEE Trans. Cybern., doi: 10.1109/TCYB.2016.2612226, 2016. [37]X. B. Zhu, W. Pedrycz, and Z. W. Li. “Granular encoders and decoders: A study in processing information granules,” IEEE Trans. Fuzzy Syst., doi: 10.1109/TFUZZ.2016.2598366, 2016. [38] D. W. Zimmerman, “A note on interpretation of the paired-samples t test,” Journal of Educational and Behavioral Statistics, vol. 22, no. 3, pp. 349-360, 1997.

  



Table I. Selected kernel functions

kernel function Gaussian(G)

description exp( || xi  v j ||2  2 ),  2  0

Polynomial(P) Hyper-tangent(H)

( x T y   ) p ,  0, p  N tanh( x T y   ),  0

  



 Dataset

Table II. Results of classification error rate of 10-fold cross validation of kernel-based fuzzy clustering Number Number Results of data

of variables

Iris

150

4

FCM Training Testing data data 0.12±0.02 0.13±0.02

User Knowledge Modeling Seeds

258

5

0.47±0.02

0.49±0.01

210

7

0.15±0.03

0.15±0.03

Vertebral Column Banknote Authentication QSAR Biodegradation Statlog (Heart)

310

6

0.28±0.03

0.29±0.01

1372

4

0.39±0.01

0.39±0.02

1055

41

0.34±0.00

0.34±0.01

270

13

0.38±0.03

0.39±0.01

Ionosphere

351

34

0.31±0.01

0.30±0.01

Breast Tissue

106

9

0.54±0.03

0.55±0.03

Connectionist Bench Waveform Database Generator Statlog (Landsat Satellite) Spambase

208

60

0.43±0.01

0.46±0.02

5000

21

0.45±0.00

0.46±0.01

6435

36

0.26±0.01

0.28±0.01

0.24±0.01 0.25±0.02 (m=1.1, c=6, σ2=0.25)

0.30±0.01 0.31±0.02 (m=3.1, c=6, θ=9.5, p=5)

4601

57

0.19±0.02

0.20±0.02

Breast Cancer Wisconsin

569

30

0.09±0.00

0.10±0.01

0.07±0.00 0.08±0.01 (m=1.4, c=2, σ2=40) 0.08±0.00 0.09±0.01 (m=2.5, c=3, σ2=121)

0.09±0.00 0.09±0.01 (m=2.1, c=2, θ=9.5, p=7) 0.05±0.01 0.06±0.01 (m=2.1, c=3, θ=5, p=7)

  

KFCM-F (G) Training Testing data data 0.04±0.00 0.05±0.01 (m=2.5, c=3, σ2=12.1) 0.37±0.02 0.38±0.03 (m=2.3, c=4, σ2=132)

KFCM-F (P) Training Testing data data 0.05±0.01 0.05±0.01 (m=1.5, c=3, θ=28, p=16) 0.33± 0.02 0.34± 0.02 (m=2.3, c=4, θ=0.5,p=15)

0.09±0.00 0.10±0.01 (m=2.2, c=3, σ2=72.5) 0.28±0.02 0.29±0.02 (m=1.5, c=2, σ2=6.25) 0.05±0.01 0.06±0.02 (m=2.7, c=2, σ2=144) 0.21±0.02 0.22±0.02 (m=2.4, c=2, σ2=121) 0.22±0.00 0.23±0.01 (m=1.4, c=2, σ2=50) 0.32±0.00 0.33±0.01 (m=1.1, c=6, σ2=0.25) 0.43±0.00 0.45±0.01 (m=2.2, c=2, σ2=4) 0.35±0.00 0.37±0.01 (m=1.7, c=3, σ2= 72) 0.31±0.01 0.32±0.01 (m=2.7, c=6, σ2=42)

0.08±0.00 0.10±0.02 (m=2.7,c=3,θ=3.5,p=14) 0.17±0.01 0.19±0.01 (m=2.3, c=2, θ=4.5, p=7) 0.08± 0.01 0.10± 0.01 (m=31, c=2, θ=15.5, p=9) 0.27±0.00 0.26±0.01 (m=1.1, c=2, θ=1.5, p =2) 0.27±0.00 0.28±0.02 (m=1.5, c=2, θ=3.5, p=5) 0.21±0.02 0.23±0.02 (m=2.7,c=6, θ=12.5, p=1) 0.29± 0.01 0.31± 0.01 (m=2.3, c=2, θ=22, p=6) 0.28±0.01 0.29±0.01 (m=3.6, c=3, θ=6.5, p=6) 0.21±0.01 0.22±0.02 (m=4.1, c=3, θ=7.5, p=7)



Table III. Results of 10-fold cross validation when using the same piecewise linear transformation to all variables Datasets Number of cutoff points 2 4 6 8 Training Testing Training Testing Training Testing Training Testing set set set set set set set set Iris 0.04±0.01 0.04±0.01 0.03±0.01 0.04±0.01 0.04±0.01 0.04±0.01 0.04±0.01 0.04±0.01 User Knowledge Modeling Vertebral Column Seeds

0.32±0.01

0.35±0.03

0.31±0.02

0.33±0.02

0.32±0.02

0.32±0.02

0.30±0.03

0.32±0.01

0.07±0.01

0.09±0.02

0.06±0.01

0.09±0.02

0.06±0.01

0.08±0.02

0.06±0.01

0.08±0.02

0.21±0.01

0.22±0.02

0.20±0.01

0.21±0.01

0.20±0.01

0.21±0.01

0.20±0.01

0.21±0.01

Banknote Authentication QSAR Biodegradation Statlog (Heart) Ionosphere

0.29±0.01

0.30±0.01

0.27±0.02

0.28±0.02

0.17±0.01

0.18±0.02

0.16±0.02

0.17±0.02

0.29±0.02

0.30±0.02

0.28±0.02

0.29±0.02

0.23±0.01

0.25±0.02

0.23±0.01

0.24±0.02

0.22±0.01

0.24±0.01

0.18±0.02

0.20±0.03

0.15±0.01

0.17±0.02

0.15±0.01

0.17±0.02

0.21±0.01

0.23±0.02

0.20±0.01

0.21±0.02

0.20±0.01

0.21±0.02

0.20±0.01

0.21±0.02

Breast Tissue

0.34±0.01

0.36±0.02

0.34±0.01

0.35±0.02

0.34±0.01

0.35±0.02

0.34±0.01

0.35±0.02

Connectionist Bench Waveform Database Generator Statlog (Landsat Satellite) Spambase

0.27±0.01

0.29±0.02

0.27±0.01

0.29±0.02

0.27±0.01

0.29±0.02

0.27±0.01

0.29±0.02

0.27±0.01

0.28±0.02

0.26±0.01

0.27±0.02

0.26±0.01

0.27±0.02

0.26±0.01

0.27±0.02

0.22±0.01

0.23±0.02

0.21±0.01

0.22±0.02

0.21±0.01

0.22±0.02

0.21±0.01

0.22±0.02

0.11±0.01

0.12±0.01

0.11±0.01

0.12±0.01

0.11±0.01

0.12±0.01

0.11±0.01

0.12±0.01

Breast Cancer Wisconsin

0.05±0.01

0.06±0.01

0.04±0.01

0.05±0.01

0.04±0.01

0.05±0.01

0.04±0.01

0.05±0.01

  



Table IV. Results of 10-fold cross validation when using the different piecewise linear transformations of variables Datasets Number of cutoff points 2 4 6 8 Training Testing Training Testing Training Testing Training Testing set set set set set set set set Iris 0.04±0.01 0.05±0.01 0.04±0.01 0.04±0.01 0.04±0.01 0.04±0.01 0.04±0.01 0.04±0.01 User Knowledge Modeling Vertebral Column Seeds

0.24±0.02

0.25±0.03

0.23±0.01

0.24±0.02

0.21±0.01

0.23±0.02

0.21±0.01

0.23±0.02

0.07±0.01

0.08±0.02

0.06±0.01

0.07±0.02

0.05±0.01

0.06±0.02

0.04±0.01

0.05±0.02

0.21±0.01

0.22±0.02

0.20±0.01

0.21±0.01

0.20±0.01

0.21±0.01

0.20±0.01

0.21±0.01

Banknote Authentication QSAR Biodegradation Statlog (Heart) Ionosphere

0.11±0.01

0.12±0.01

0.10±0.02

0.11±0.02

0.10±0.02

0.11±0.02

0.10±0.02

0.11±0.02

0.29±0.02

0.30±0.01

0.27±0.02

0.28±0.02

0.26±0.02

0.27±0.02

0.26±0.02

0.27±0.02

0.20±0.01

0.22±0.01

0.17±0.02

0.18±0.02

0.16±0.02

0.17±0.02

0.16±0.02

0.17±0.02

0.18±0.01

0.19±0.02

0.16±0.01

0.17±0.02

0.16±0.01

0.17±0.02

0.16±0.01

0.17±0.02

Breast Tissue

0.34±0.01

0.35±0.02

0.33±0.02

0.35±0.02

0.32±0.02

0.34±0.02

0.32±0.02

0.34±0.02

Connectionist Bench Waveform Database Generator Statlog (Landsat Satellite) Spambase

0.27±0.01

0.29±0.02

0.27±0.01

0.29±0.02

0.27±0.01

0.29±0.02

0.27±0.01

0.29±0.02

0.27±0.01

0.28±0.02

0.26±0.01

0.28±0.02

0.26±0.01

0.27±0.03

0.26±0.01

0.27±0.03

0.22±0.01

0.23±0.02

0.22±0.01

0.23±0.02

0.22±0.01

0.22±0.02

0.21±0.01

0.22±0.02

0.11±0.01

0.12±0.01

0.11±0.01

0.12±0.01

0.11±0.01

0.12±0.01

0.11±0.01

0.12±0.01

Breast Cancer Wisconsin

0.04±0.01

0.05±0.01

0.04±0.01

0.05±0.01

0.04±0.01

0.05±0.01

0.04±0.01

0.05±0.01

  



Dataset

Table V. Results of 10-fold cross validation when using combination of the same logistic nonlinear transformation to all variables Number of cutoff points 2 4 6 8 Training set 0.04±0.01

Testing set 0.04±0.01

Training set 0.04±0.01

Testing set 0.04±0.01

Training set 0.04±0.01

Testing set 0.04±0.01

Training set 0.04±0.01

Testing set 0.04±0.01

User Knowledge Modeling Vertebral Column Seeds

0.38±0.02

0.39±0.02

0.37±0.02

0.39±0.02

0.34±0.02

0.35±0.02

0.33±0.01

0.35±0.03

0.08±0.01

0.09±0.01

0.08±0.01

0.09±0.01

0.08±0.01

0.09±0.01

0.08±0.01

0.09±0.01

0.24±0.01

0.26±0.01

0.23±0.01

0.24±0.01

0.19±0.01

0.20±0.01

0.20±0.01

0.21±0.02

Banknote Authentication QSAR Biodegradation Statlog (Heart) Ionosphere

0.15±0.02

0.16±0.02

0.15±0.02

0.16±0.02

0.16±0.02

0.17±0.02

0.16±0.01

0.18±0.01

0.29±0.02

0.31±0.02

0.28±0.02

0.29±0.02

0.25±0.02

0.27±0.02

0.25±0.02

0.27±0.02

0.20±0.01

0.23±0.01

0.18±0.02

0.19±0.02

0.18±0.02

0.19±0.02

0.18±0.02

0.19±0.02

0.21±0.01

0.23±0.02

0.21±0.01

0.22±0.02

0.21±0.01

0.23±0.02

0.22±0.01

0.23±0.02

Breast Tissue

0.36±0.01

0.36±0.02

0.35±0.02

0.36±0.02

0.33±0.01

0.35±0.02

0.34±0.01

0.35±0.02

Connectionist Bench Waveform Database Generator Statlog (Landsat Satellite) Spambase

0.27±0.01

0.29±0.02

0.28±0.01

0.29±0.02

0.27±0.01

0.29±0.02

0.28±0.01

0.28±0.01

0.27±0.01

0.28±0.02

0.27±0.01

0.28±0.01

0.27±0.01

0.29±0.02

0.27±0.01

0.28±0.02

0.23±0.01

0.25±0.02

0.23±0.01

0.24±0.02

0.23±0.01

0.25±0.02

0.22±0.01

0.24±0.02

0.12±0.01

0.13±0.01

0.12±0.01

0.14±0.02

0.12±0.01

0.13±0.01

0.12±0.01

0.14±0.01

Breast Cancer Wisconsin

0.07±0.01

0.08±0.01

0.08±0.01

0.08±0.01

0.08±0.01

0.08±0.01

0.08±0.01

0.08±0.01

Iris







Dataset

Iris

Table VI. Results of 10-fold cross validation when using different logistic combination nonlinear transformations to different variables Number of cutoff points 2 4 6 8 Training Testing Training Testing Training Testing Training Testing set set set set set set set set 0.03±0.01 0.04±0.01 0.04±0.01 0.04±0.01 0.04±0.01 0.04±0.01 0.04±0.01 0.04±0.01

User Knowledge Modeling Vertebral Column Seeds

0.25±0.01

0.26±0.02

0.24±0.01

0.25±0.02

0.20±0.01

0.21±0.02

0.20±0.02

0.21±0.02

0.08±0.01

0.08±0.02

0.08±0.01

0.08±0.01

0.08±0.01

0.08±0.01

0.08±0.01

0.08±0.01

0.20±0.01

0.21±0.02

0.20±0.01

0.22±0.02

0.18±0.01

0.19±0.01

0.18±0.01

0.19±0.01

Banknote Authentication QSAR Biodegradation Statlog (Heart)

0.15±0.01

0.15±0.01

0.13±0.02

0.14±0.02

0.12±0.02

0.13±0.02

0.13±0.02

0.14±0.02

0.25±0.02

0.26±0.01

0.24±0.02

0.25±0.02

0.23±0.02

0.25±0.02

0.23±0.02

0.25±0.02

0.18±0.01

0.19±0.01

0.18±0.02

0.19±0.02

0.18±0.02

0.19±0.02

0.18±0.02

0.19±0.02

Ionosphere

0.21±0.01

0.23±0.02

0.21±0.01

0.22±0.02

0.21±0.01

0.23±0.02

0.22±0.01

0.23±0.02

Breast Tissue

0.36±0.01

0.36±0.02

0.35±0.02

0.36±0.02

0.33±0.01

0.35±0.02

0.34±0.01

0.35±0.02

Connectionist Bench Waveform Database Generator Statlog (Landsat Satellite) Spambase

0.26±0.01

0.29±0.01

0.26±0.01

0.28±0.02

0.27±0.01

0.28±0.02

0.27±0.01

0.28±0.01

0.26±0.01

0.28±0.01

0.26±0.01

0.27±0.01

0.26±0.01

0.28±0.02

0.26±0.01

0.28±0.02

0.23±0.01

0.25±0.02

0.23±0.01

0.24±0.02

0.23±0.01

0.25±0.02

0.22±0.01

0.24±0.02

0.11±0.01

0.12±0.02

0.11±0.01

0.12±0.02

0.11±0.01

0.14±0.01

0.12±0.01

0.13±0.01

Breast Cancer Wisconsin

0.07±0.01

0.08±0.01

0.08±0.01

0.08±0.01

0.08±0.01

0.08±0.01

0.08±0.01

0.08±0.01







 Table VII. Reconstruction errors when using FCM for original data

Dataset

User Knowledge Modeling Vertebral Column Banknote Authentication Breast Cancer Wisconsin Pima Indians Diabetes Fertility Liver Disorders Blood Transfusion Service



Number of data 258 310 1372 699 768 100 345 748

Number of variables 5 6 4 9 8 9 6 4



Size of codebook 4 3 2 2 2 2 2 2

Reconstruction error 6.31*101±0.102*101 4.13*105±0.0064*105 4.58*104±0.00094*104 2.48*104±0.00015*104 1.15*107±0.0016*107 4.81*102±0.0044*102 6.61*105±0.00060*105 9.08*108 ±0.027*108



Table VIII. Reconstruction errors when using Kernel-Based fuzzy clustering

Dataset User Knowledge Modeling Vertebral Column Banknote Authentication Breast Cancer Wisconsin Pima Indians Diabetes Fertility Liver Disorders Blood Transfusion Service



Reconstruction error KFCM-F(G) KFCM-F(P) 4.72*101±0.034*101 7.21*101±0.037*101 6.89*105±0.0062*105 6.58*105±0.0092*105 4 4 4.10*10 ±0.019*10 8.99*104±0.06*104 4.11*104±0.0054*104 4.03*104±0.0031*104 1.15*107±0.0081*107 1.20*107±0.0018*107 2 2 4.58*10 ±0.0012*10 4.81*102±0.0012*102 5.87*105±0.0039*105 8.24*105±0.0040*105 8 8 9.26*10 ±0.067*10 9.97*108±0.0066*108





Table IX. Reconstruction errors when using FCM for transformed data using the same piecewise transformation applied to all variables

Dataset User Knowledge Modeling Vertebral Column Banknote Authentication Breast Cancer Wisconsin Pima Indians Diabetes Fertility Liver Disorders Blood Transfusion Service

Number of cutoff points 2 4.56*101±0.016*101 4.08*105±0.0025*105 4.08*104±0.0058*104 2.34*104±0.0026*104 8.97*106±0.0019*106 3.28*102±0.015*102 5.45*105±0.00086*105 6.31*108±0.070*108







4 4.38*101±0.010*101 4.05*105±0.0026*105 4.06*104±0.0062*104 2.29*104±0.0027*104 8.41*106±0.0025*106 3.20*102±0.014*102 5.35*105±0.00091*105 6.08*108±0.0044*108