CHAPTER 6
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering Chapter Outline 6.1 Introduction
128
6.1.1 The introduction of MOEA on constrained multiobjective optimization problems 128 6.1.2 An introduction to MOEA on clustering learning and classification learning 129 6.1.3 The introduction of MOEA on sparse spectral clustering 130
6.2 Modified function and feasible-guiding strategy-based constrained MOPs 6.2.1 6.2.2 6.2.3 6.2.4
Problem description 131 Modified objective function 131 The feasible-guiding strategy 134 Procedure for the proposed algorithm
131
135
6.3 Learning simultaneous adaptive clustering and classification learning via MOEA 6.3.1 Objective functions of MOASCC 137 6.3.2 The framework of MOASCC 139 6.3.3 Computational complexity 143
6.4 A sparse spectral clustering framework via MOEA 6.4.1 6.4.2 6.4.3 6.4.4 6.4.5 6.4.6 6.4.7 6.4.8
143
Mathematical description of SRMOSC 144 Extension on semisupervised clustering 145 Initialization 146 Crossover 147 Mutation 148 Laplacian matrix construction 149 Final solution selection phase 150 Complexity analysis 150
6.5 Experiments
151
6.5.1 The experiments of MOEA on constrained multiobjective optimization problems 151 6.5.1.1 Experimental setup 151 6.5.1.2 Performance metrics 152 6.5.1.3 Comparison experiment results 154 6.5.2 The experiments of MOEA on clustering learning and classification learning 164 6.5.2.1 Experiment setup 164 6.5.2.2 Experiment on a synthetic datasets 165 6.5.2.3 Experiment on real-life datasets 168 6.5.3 The experiments of MOEA on sparse spectral clustering 174 6.5.3.1 Detailed analysis of SRMOSC 175 6.5.3.2 Experimental comparison between SRMOSC and other algorithms 185 Brain and Nature-Inspired Learning, Computation and Recognition. https://doi.org/10.1016/B978-0-12-819795-0.00006-2 Copyright © 2020 Tsinghua University Press. Published by Elsevier Inc. All rights reserved.
127
137
128 Chapter 6 6.6 Summary 190 References 191
6.1 Introduction Multiobjective evolutionary algorithm (MOEA) is a nature-inspired population-based algorithm, it has attracted much attention from research and has made great progress as MOEA can generate a set of nondominated solutions in a single run. However, there are some challenges to its practical application in the aspect of constraint handing, encoding scheme, the evolutionary operators design, and Pareto solution selection. Hence, in order to overcome the above bottleneck problems, this chapter will present three algorithms to discuss the application of MOEA on constrained multiobjective optimization problems, clustering learning, classification learning, and sparse clustering.
6.1.1 The introduction of MOEA on constrained multiobjective optimization problems In the real world, we often encounter problems that at least two objectives need to be optimized simultaneously and a set of constraint conditions must be satisfied in the meantime. All these problems are called constrained multiobjective optimization problems (CMOPs). Solving CMOPs is an important part of the optimization field. In contrast to multiobjective optimization problems (MOPs), CMOPs have to deal with various limits on decision variables, the interference resulting from constraints, and the relationship between objective functions and constraints [1]. There are a large amount of constraint-handling methods in solving constrained optimization problems. According to [2,3], the commonly used constraint-handling methods can be roughly classified into four categories: (1) use of penalty functions [1,3], (2) maintaining a feasible population by special representations and genetic operators [1,4e14], (3) separation of objectives and constraints [15e21], and (4) hybrid method [1,22e24]. In conclusion, the major issue of the constraint-handling method is how to deal with infeasible individuals throughout the whole searching progress. In recent years, a few researchers have focused their research on MOPs, a number of population-based stochastic optimization algorithms such as evolutionary algorithms (EAs), particle swarm optimization (PSO), differential evolution (DE) [1,18,24], human immune system (HIS)based algorithms [25,26], and some other algorithms inspired by nature [27] have been proposed to handle MOPs. Although there have been many approaches to handle constraints, most are aimed at dealing with single objective optimization problems with constraints, few researchers focus on dealing with constraint handling and MOPs simultaneously. In this case, a method based on a modified objective function and
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
129
feasible-guiding strategy is proposed to handle CMOPs in this chapter [28]. The main idea of the algorithm is to alter the objective function values with true objective function values and constraint violation values. A feasibility ratio obtained from the current population is used to maintain the balance. Then the feasible-guiding strategy is adopted to exploit preserved infeasible individuals. The nondominated solutions obtained show superiority on diversity of distribution and convergence, which can be proved by the comparison experiment results.
6.1.2 An introduction to MOEA on clustering learning and classification learning There are two major tasks in the domain of pattern recognition: clustering learning and classification learning [29]. In terms of clustering learning, it aims at detecting the underlying structure of data and identifying the homogeneous samples into the same group according to the theory “like to like” [30,31]. As for classification learning, it aims at finding a model or discriminant function for the training samples, which can be used to predict the class label of unknown samples [32,33]. Differently from clustering learning, classification learning uses some prior knowledge obtained from the given classes, but both are expected to be in accord with human cognition. The traditional hybrid classification and clustering algorithms are carried out in order, not at the same time. The multiobjective optimization method provides a method to solve this problem. Multiobjective optimization has attracted wide attention from researchers for its common applications in nature and human uses. The goal of multiobjective optimization is to optimize at least two objectives that may contradict each other at the same time, and a set of nondominated solutions, which are called Pareto-optimal solutions, are obtained as a result. A number of MOEAs [34], such as NSGA-II [35], SPEA2 [36], MOPSO [37], and MOEA/D [38], have been proposed, and they have been successfully applied to clustering or classification learning in recent decades. An experimental evaluation of cluster representations for multiobjective evolutionary clustering has been done in Ref. [39], which has shown that multiobjective evolutionary clustering is competitive with other clustering algorithms [40,41]. MOCK [31] is a famous graph representation-based multiobjective clustering algorithm, which uses the overall deviation and connectivity as objective functions to reflect the cluster compactness and connectedness, respectively. It is extended to semisupervised clustering algorithm semi-MOCK by transforming the labeled information to a constraint. Moreover, different MOEAs are employed to deal with hybrid clustering and classification learning in [42e44]. Objective functions which represent the accuracy and complexity of RBFNN are minimized in order to design the RBFNN model, and the statistical results show that MOEAs can balance the complexity and accuracy of RBFNN. In terms of simultaneous clustering learning and classification learning, MOEAs have been employed to improve the performance by a few researchers. A multiobjective simultaneous clustering and classification learning framework (MSCC) [33] is proposed to
130 Chapter 6 overcome the shortcomings of single objective optimization. MSCC uses the simplified version of MOPSO to optimize two objective functions, the intracluster compactness and the classification error rate, to accomplish the aim of simultaneous learning. It optimizes cluster centers and bridges the connection between clustering and classification by Bayesian theory. Later, an improved multiobjective simultaneous learning framework to design a classifier [45] was proposed on the basis of MSCC, in which the cluster membership degree is calculated on the basis of randomly initialized clustering centers. In this chapter, a multiobjective evolutionary algorithm for learning simultaneous clustering and classification adaptively is presented [46]. The main operation of the algorithm is to optimize two objective functions to achieve the goal of simultaneous learning, where two objective functions represent fuzzy cluster connectedness and classification error rate, respectively. First, the algorithm adopts a graph-based encode method. Then the Bayesian theory is used to build the relationship between clustering and classification during the optimization process. The effectiveness of clustering and classification is measured by the objective functions. Feedback is used to guide the mutation. Finally, a set of nondominated solutions are generated, where the final Pareto optimal solution is selected by the adjusted Rand index.
6.1.3 The introduction of MOEA on sparse spectral clustering Spectral clustering has become one of the most popular clustering algorithms in the last decade since it is easy to implement and has shown impressive results in practical applications [47]. The first step in spectral clustering is to construct a symmetric similarity matrix from the samples of the dataset. After that, the eigenvalues of the Laplacian matrix are obtained. Then the components in the eigenvalues are clustered using a traditional clustering method such as k-means. The key point in spectral clustering is the construction of the similarity matrix, and therefore, many methods, such as those in Refs.[47e50], have been proposed in the literature to attempt to create appropriate matrices. In this chapter, we introduce a method for constructing a similarity matrix based on sparse representation theory and MOEAs. In contrast to conventional spectral clustering, the main contribution of this algorithm is to construct the similarity matrix using a sparse representation approach by modeling spectral clustering as a constrained multiobjective optimization problem. Specific operators are designed to obtain a set of high-quality solutions in the optimization process. Furthermore, the algorithm introduces a method to select a tradeoff solution from the Pareto front using a measurement called the ratio cut based on an adjacency matrix constructed by all the nondominated solutions. It also extends the framework to the semisupervised clustering field by using the semisupervised information brought by the labeled samples to set some constraints or to guide the searching process. Experiments on commonly used datasets
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
131
show that the proposed algorithm outperforms four well-known similarity matrix construction methods in spectral clustering, and one multiobjective clustering algorithm.
6.2 Modified function and feasible-guiding strategy-based constrained MOPs In this section, we present the key procedures to solve constrained multiobjective optimization problems which are based on a modified objective function method based on feasible-guiding strategy.
6.2.1 Problem description A constrained multiobjective optimization problem (CMOP) can be mathematically formulated as follows: Minmize Subject to
fi ðxÞ ¼ fi ðx1 ; x2 ; /; xn Þ; gj ðxÞ ¼ gj ðx1 ; x2 ; /; xn Þ 0;
i ¼ 1; 2; /; k j ¼ 1; 2; /; p
hj ðxÞ ¼ hj ðx1 ; x2 ; /; xn Þ ¼ 0;
j ¼ p þ 1; p þ 2; /; m
xl xmax xmin l l
l ¼ 1; 2; /; n
(6.1)
where x ¼ (x1, x2,., xn)˛U is an n-dimensional decision variable vector, which is and xmax define lower and upper boundaries of each bounded in the search space U, xmin l l dimension of search space U, respectively. fi(x) is the i-th objective function, and k is the number of objective functions. There are a total of m constraint functions to be satisfied with, of which p are inequality constraint functions, and the rest are equality constraint functions, which divide the search space into feasible space and infeasible space. gj(x) is the j-th inequality constraint, and hj(x) is the j-th equality constraint. When dealing with CMOPs, individuals that satisfy all these constraints are called feasible individuals while individuals that violate at least one of them are called infeasible individuals. Because of the adding of constraints, the global optimal solutions are more unlikely to be found in feasible space.
6.2.2 Modified objective function The algorithm in this section uses a modified objective function method to lead a dominance checking in the current population, and then a feasible-guiding strategy to repair the infeasible individuals. The modified objective function consists of two components: normalized objective function and normalized constraint violation. They are combined by an adaptive parameter gf, which is the same as the parameter in the literature [51]. The feasible-guiding strategy is some kind of repair operator, which can give guides
132 Chapter 6 to infeasible individuals in the search for feasible solutions near boundaries between feasible and infeasible regions. The details of them are discussed in this section. As with the method mentioned above, the two components which constitute modified objective function, normalized objective function value and normalized constraint violation, can be calculated as follows. fimax ¼ max fi ðxÞ;
fimin ¼ min fi ðxÞ
x
finorm ðxÞ ¼
x
fi ðxÞ fimin ; fimax fimin
i ¼ 1; 2; /; k
(6.2) (6.3)
The first step is to obtain the maximum and minimum function values of each objective function using Formula (6.2) in the current population, where fimax is the maximum value of the i-th objective function and fimin is the minimum value of this dimension. Then the normalized i-th objective value can be derived through Formula (6.3) by using them, and finorm is the normalized i-th objective value of individual x. In Formula (6.1), gj(x) is the j-th inequality constraint, and hj(x) is the j-th equality constraint, which means the solution x is feasible only when all the constraint violation values gj(x) are smaller than zero and hj(x) is equal to zero. In order to find out which solution is feasible without checking if every constraint violation value is smaller than zero or equals to zero, an easier approach Formula (6.4)] is usually adopted to see if solution x is feasible or not. In Formula (6.4), the tolerance value d is adopted since that equality constraint is too strict to be satisfied. It is clear that all the constraint violations are set to zero if a solution is feasible. Then v(x), an arithmetic mean of normalized constraint violations [Formulas (6.5) and (6.6)] is calculated to represent constraint violation of individual x in Formula (6.7). max 0; gj ðxÞ ; 1jp (6.4) cj ðxÞ ¼ maxf0; jhj ðxÞj dg; p þ 1 j m cmax ¼ max cj ðxÞ j
(6.5)
cj ðxÞ cmax j
(6.6)
x
cnorm ðxÞ ¼ j vðxÞ ¼
m 1 X cnorm ðxÞ m j¼1 j
Finally, the modified objective function is defined as Formula (6.8). In most constraint handling methods, certain types of individuals such as individuals with low constraint violation or better objective function values are preferred rather than combining two of
(6.7)
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering them in a flexible way. In Formula (6.8), a less rigid parameter gf is adopted to control how these two parts contribute to modified objective function value. 8 norm ðxÞ þ ð1 gf ÞvðxÞ; gf 6¼ 0 > < gf f i mod fi ðxÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi > : vðxÞ2 þ f norm ðxÞ2 ; gf ¼ 0 i
133
(6.8)
where number of feasible individual in current population gf ¼ population size 1) If gf s 0, there are both feasible and infeasible solutions in the current population. From Formula (6.8), we can observe that individuals with both low objective value and constraint violation value in the given dimension will have a good performance in the following process. The parameter gf decides how objective function values and constraint violation values contribute to the final modified function values. For infeasible individuals in the current population, if the feasibility ratio gf is high, the objective function value will impact more than the constraint violation value in each dimension, otherwise, the constraint violation value will impact more. Hence, which is the better one for two nondominated infeasible individuals in objective function dimension depends on the constraint violation. Usually the individual with low constraint violation value will dominate another individual in the modified objective function dimension. And it is the objective function values that decide which one is better for two infeasible individuals with almost the same constraint violation values. For a feasible individual x, the constraint violation is zero, so the modified objective function value is reduced to gf finorm ðxÞ in the given dimension. On one hand, a priority is still given to feasible individuals in the searching process and feasible individuals will dominate infeasible individuals with the same or worse objective function values. On the other hand, some infeasible individuals with low constraint violation value and lower objective function values than feasible individuals can be given a priority in the searching process. In this case, the potential information of infeasible individuals can be made use of to some extent. 2) If gf ¼ 0, there are no feasible solutions in the current population. Since judging the quality of an infeasible individual just by constraint violation sorting or nondominated sorting of objective function values is single-faceted, both of these two parts are considered in this condition. There is some difficulty in deciding which part is more important for the final selection, so a simple idea called distance value, borrowed from the literature [49], is adopted. For every infeasible individual in the modified objective function dimension, the closer to the origin in finorm ðxÞ vðxÞ space, the better it will be.
134 Chapter 6 This constraint-handling method provides an easy and flexible method for CMOPs. It allows the searching process to take place in an efficient and effective way. Here, we can summarize some properties of this constraint-handling method. 1) If there are only infeasible individuals in the current population, a condition which is rare unless the constraint is strict enough, the modified objective functions take both objective function values and constraint violation values into account, in the case that disregarding of the objective function values may cause the searching to be trapped in a local optimal situation, or disregarding of constraint violation makes it hard to find the feasible solutions. 2) If there are both feasible and infeasible individuals in the current population, a condition which is the most common in the handling of CMOPs, the individual with low objective function values and low constraint violation values will dominate individuals with high objective function values, or high constraint violation values, or both. 3) If there are only feasible individuals in the current population, which may happen in later searching process. According to our constraint-handling method, all the individuals would be selected based only on objective function values, as with MOPs.
6.2.3 The feasible-guiding strategy In theory, for CMOPs, part of Pareto optimal solutions may lie in the constraint boundaries, those solutions that are near the constraint boundaries usually contribute a lot to the searching, whether they are feasible or infeasible. Therefore the aim of this feasibleguiding strategy is to find feasible solutions close to constraint function space in a feasible direction with the help of infeasible solutions, which is some kind of specific application of DE/rand/1 (Formula (6.9)). Its detail description is as follows. DE/rand/1: vi ¼ xr0 þ F,ðxr1 xr2 Þ
(6.9)
In Fig. 6.1, how DE/rand.1 is used in CMOPs can be easily understood. The shaded region is infeasible space in decision space and solutions (xinfi) located in this area are infeasible, the unshaded region is feasible space and the solutions (xfeai) in this area are feasible, and the dashed-line denotes the constraint boundary. First, an infeasible individual and its nearest feasible individual are found. Then a direction from the infeasible region to the feasible region is confirmed, which is called the feasible direction. For example, infeasible individual xinf1, feasible individual xfea1, and the corresponding feasible direction d1. Then, the infeasible individuals in the neighborhood of the selected infeasible individual will mutate toward the feasible direction, finally a new individual is generated. In Fig. 6.1, we can see that individual xinf3 mutates toward F,d1, xinf4 and xinf5 mutate in a direction F,d2. Then some new individuals xnew1, xnew2, and xnew3 are generated correspondingly. Maybe not all these newly generated individuals are feasible, but at least they can be less violated.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
135
x fea1 d1
x new1
Infeasible region
x inf1 x inf2
Feasible region
x new2 x inf4 d2
Constraint boundary
x fea2
x inf3
Infeasible individual
x new3 x inf5
Feasible individual New generated individual
Figure 6.1 Description of the feasible-guiding strategy.
However, this method is only suited for those individuals in the neighborhood that share a similar feasible direction, not for the global search. According to Formula (6.9) and Fig. 6.1, the main idea of this method can be summarized as in Formula (6.10). As mentioned above, xfea1 and xinf1 are a couple of nearest feasible and infeasible individuals lying in the neighborhood of xinf, F is a scaling factor, and xnew is the new generated individual. xnew ¼ xinf þ F,ðxfea1 xinf 1 Þ
(6.10)
6.2.4 Procedure for the proposed algorithm The procedure for the proposed algorithm is as follows. Step 1 Initialization. Generate N individuals in decision space randomly to form the initial population P, and evaluate the objective function value of each individual. Then, select nondominated feasible individuals into archive, and the maximum number of the archive is set to N. Step 2 Modify objective functions. Calculate modified objective function values of each individual in the current population P according to the proposed method in Section 6.2.2. Step 3 Reproduction. Step 3.1 Use the tournament selection method to select N parents and generate N offspring to form offspring population Q1.
136 Chapter 6 Step 3.2 Use the feasible-guiding method introduced in Section 6.2.3 to generate some offspring Q2. The number of Q2 is gfN, which is decided by the number of infeasible individuals in the current population. Step 3.3 Offspring population Q ¼ Q1WQ2, and calculate objective function values and modified objective function values. Step 3.4 Make a nondominated sorting [35] with the modified objective function value. Step 4 Archive update. For each feasible individual in Q, remove the individuals in the archive if they are dominated by it, and add it in archive. Step 5 Current population update. Combine P and Q, recalculate the nondominated rank and crowding distance. Then select the top N nondominated individuals into Ptþ1 according to modified objective function values. Step 6 Termination. If the termination criterion is satisfied, the algorithm terminates. Otherwise, set P ¼ Ptþ1 and go back to Step 2. The operators adopted in the reproduction step are SBX crossover [Formula (6.11)] and nonuniformity mutation [Formula (6.12)]. ( 0:5½ð1 þ bk Þaik þ ð1 bk Þajk ; if rð0; 1Þ 0:5 a0ik ¼ 0:5½ð1 bk Þaik þ ð1 þ bk Þajk ; if rð0; 1Þ < 0:5 where 8 > > 1 > > > if uð0; 1Þ 0:5 < ð2uÞhc þ1 ; bk ¼ > > > 1 > > : ½2ð1 uÞ hc þ1 ; if uð0; 1Þ < 0:5
(6.11)
In Formula (6.11), aik and ajk(i s j, k ¼ 1, ., n) are the k-th dimension of individuals i and j, respectively, u and r are random numbers ranging from 0 to 1, and hc is a distribution index. ( vk þ dðuk vk Þ; if rð0; 1Þ 0:5 v0k ¼ vk dðvk lk Þ; if rð0; 1Þ > 0:5 (6.12) where l d ¼ 1 rð1it=TÞ In Formula (6.12), vk (k ¼ 1, ., n) is the k-th dimension of an individual, uk and lk are the upper and lower boundaries of this dimension, respectively. it is the current generation number, T is the maximum generation number, and l is a parameter that tunes the area of local search, which usually ranges from 2 to 5.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
137
6.3 Learning simultaneous adaptive clustering and classification learning via MOEA In this section, we introduce multiobjective evolutionary algorithms into simultaneous clustering and classification (denoted as MOASCC). The final goal of this is to enhance the performance of the classification through the cooperation of clustering and classification. In order to achieve this goal, two objective functions, fuzzy clustering connectedness function and classification error rate, are adopted. Furthermore, a specific mutation operator is designed to make use of the feedback from both clustering and classification. We give a detailed description of MOASCC in this section, including objective functions of MOASCC, the framework of MOASCC, and computational complexity and convergence analysis of MOASCC.
6.3.1 Objective functions of MOASCC In order to optimize clustering learning and classification learning simultaneously, MOASCC uses two objective functions: clustering objective function and classification objective function. Given a dataset of size N, whose number of classes is M, assume that it can be classified into K clusters during the optimization process, then the objective functions we adopted are as follows. In terms of clustering, MOASCC designs an objective function called fuzzy cluster connectedness to measure the quality of clustering. This objective function is based on the assumption that a sample and its neighbors tend to belong to the same cluster and the connectedness between different clusters should be minimized. In the objective function f1 [see Formula (6.13)], L is a parameter to control the number of neighbors which contribute to the overall fuzzy connectedness, nnij represents the j-th nearest sample apart from sample xi. ti;nnij is the connectedness between sample i and nnij, a decreasing value of 1/j, which gives emphasis to the nearer neighbor, is assigned to it if samples xi and nnij locate in different clusters [31]. p(ckjnnij) represents the probability of sample nnij belonging to P cluster ck. For each sample xi, Lj¼1 ti;nnij ,pðck jnnij Þ means the fuzzy connectedness between sample xi and the clusters which xi does not belong to. If all the L nearest neighbors of sample xi belong to the same cluster, then its fuzzy connectedness is 0, otherwise, 1/j,p(ckjnnij) will be assigned to the j-th nearest neighbor as a penalty term. 0 1 N L X X @ f1 ¼ ti;nnij ,pðck jnnij ÞA i¼1
j¼1
where ti;nnij ¼
(
0;
if dck : xi ˛ ck o nnij ˛ ck
1=j;
otherwise
(6.13)
138 Chapter 6 In MOASCC, it designs three methods [see Formulas (6.14), (6.15), and (6.16)] to calculate p(ckjxi). In Formula (6.14), p(ckjxi) is defined as the proportion of the L-nearest neighbors of sample xi belonging to the k-th cluster. This approach is not strict to the underlying structure of the dataset, it is also the reason why MOASCC chooses it to work in the later experiments. XL ( 1; a ˛ ck sðnnij ; ck Þ j¼1 (6.14) where sða; ck Þ ¼ pðck jxi Þ ¼ L 0; a;ck 1=minkxi xk pðck jxi Þ ¼ XK ; cx ˛ ck (6.15) ð1=minkx xkÞ i k¼1 1=kxi centerk k pðck jxi Þ ¼ XK ð1=kxi centerk kÞ k¼1 jck j 1 X centerk ¼ xj ; xj ˛ ck jck j j¼1
(6.16)
In Formulas (6.15) and (6.16), kxixjk denotes the Euclidean distance from sample xi to xj, jckj represents the number of samples in the cluster ck. Both methods adopt the Euclidean distance to calculate p(ckjxi). p(ckjxi) is related to the minimum Euclidean distance from sample xi to the samples in the cluster ck in Formula (6.15), it is unbiased to the structure of the given dataset. However, in Formula (6.16), p(ckjxi) is decided by the Euclidean distance between xi and the center of cluster ck (denoted as centerk), the downside of this method is that it is biased to spherically shaped samples. Note that in Formulas (6.15) and (6.16), if minkxixk ¼ 0 or kxicenterkk ¼ 0, p(ckjxi) is set to 1. In terms of classification, MOASCC employs an objective function adopted in Ref. [33] to associate it with clustering through the Bayesian theory. f2 [see Formula (6.17)] is the classification objective function, it represents the classification error rate of the training samples. ( Ntr 0; a ¼ b X dðlðxi Þ; yi Þ f2 ¼ where dða; bÞ ¼ (6.17) Ntr 1; a 6¼ b i¼1 In Formula (6.17), Ntr is the number of training samples, yi is the true class label of sample xi, and l(xi) is the predicted class label of xi after our calculation. If l(xi) is different from the true class label, d(l(xi), yi) ¼ 1 and the classification error rate increases. lðxi Þ ¼ arg max pðwm jxi Þ 1mM
(6.18)
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
139
Formula (6.18) gives us the calculation method of l(xi), it is the value of posterior probability p(wmjxi) that determines the output label of sample xi. p(wmjxi) represents the probability of sample xi belonging to the class wm, and is calculated as in Formula (6.19). Bayesian theory is used to construct the relationship between clustering and classification in Formula (6.19), where p(wmjck) means the probability that samples in the cluster ck belong to the class wm. In terms of p(ckjxi), we can obtain it according to Formulas (6.14), (6.15), and (6.16). For p(wmjck), it can be calculated as in Formula (6.20). pðwm jxi Þ ¼
K X
pðck jxi Þpðwm jck Þ
(6.19)
k¼1
pðwm jck Þ ¼
jck Xwm j jck j
(6.20)
In Formula (6.12), jckXwmj denotes the number of samples belonging to both cluster ck and class 3 2 pðw1 jc1 Þ / pðwM jc1 Þ 7 6 7 wm. All the p(wmjck) can constitute a relation matrix P ¼ 6 « « 1 5 4 pðw1 jcK Þ / pðwM jcK Þ whose size is K M, it plays an important role in discovering the structure of the given P dataset. In terms of each row vector of the relation matrix, M m¼1 pðwm jck Þ ¼ 1, it shows the distribution of the samples in the cluster ck. If and only if one nonzero value p(wmjck) ¼ 1 exists in this row vector, all the samples in the cluster ck belong to the same class. Therefore, the number and values of nonzero elements can reveal the quality of the clustering. In terms of each column vector of the relation matrix, the number of nonzero entries implies the distribution of a given class. If there exists more than one nonzero element, it means the given class scatters into different clusters. Hence, the relation matrix can clearly show the relationship between clustering and classification.
6.3.2 The framework of MOASCC A number of MOEAs have been proposed for multiobjective optimization problems in recent years. MOASCC chooses NSGA-II to optimize clustering and classification because of its popularity and effectiveness. The whole procedure of MOASCC is simply described in Algorithm 6.1. In MOASCC, it uses locus-based adjacency representation [52e55] to encode in the optimization process. In this representation scheme, each individual consists of N genes {g1, g2, ., gN}, and the value of each gene gi is numbered in the range {1, 2, ., N}. If gi is assigned a value of j, it means that sample xi is connected to sample xj. In the decoding process, all the connected samples are partitioned into one component, and the number of
140 Chapter 6 Algorithm 6.1 The pseudocode of MOASCC Require: The size of population: pop; The number of evolutionary generation: gen; The probability of crossover: pc; The probability of mutation: pm; Testing dataset: dataset. Ensure: 1: Initialization: Select the training samples randomly; Generate an MST for the given dataset; Implement the initialization scheme and form the initial population: G1; Decode each individual to find the number of clusters and evaluate the values of two objective functions. 2: for t ¼ 1:gen do 3: Execute uniform crossover and proposed mutation to the current population Gt and generate new individuals: newgeno. 4: Decode newgeno and evaluate the objective function values of them. 5: Combine Gt and newgeno, and make a nondominated sorting to assign the front level rank to them. 6: Select pop solutions to the next population Gtþ1 according to their rank and crowding distance. 7: end for 8: Select the nondominated solutions to non_genotype, decode these nondominated solutions. 9: Find the solution with the best ARI value among all the nondominated solutions, and select it as the final solution. 10: Assign every test samples a class label according to Formulas (6.18) and (6.19) and calculate the classification accuracy of the final solution. End Output: Classification accuracy: accuracy; Number of clusters: clusters.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
141
separated components determines the number of clusters. For example, we have an individual encoded as {2, 3, 1, 5, 5}, after decoding, we can find that the samples are divided into two clusters, the first cluster with samples {1, 2, 3}, and the rest of the samples belong to another cluster. In order to generate a group of high-quality individuals in the initialization step, a minimal spanning tree (MST) is created. The algorithm uses the Euclidean distance to measure the similarity of samples, and adopts the Prim’s [56] algorithm to build the MST. The cost of edges is defined as the Euclidean distance between two samples. In the initialization step, removing the edges (by modifying the value of gi to i) of MST will produce different partitions, and whether the edge between two samples should be removed depends on its cost. In a population with pop individuals (see Algorithm 6.2), if the number of individuals is less than that of samples, each individual is represented as a graph obtained by removing the j-th expensive edge from the MST. Otherwise, another edge will be removed randomly from one of the first N - 1 individuals to obtain new individuals. After decoding, the initial population can generate different partitions with at most three clusters. In the following evolutionary process, the crossover and mutation operators will generate diverse solutions with a different number of clusters. The advantages of locus-based adjacency representation can be reflected in the following aspects: (1) it can determine the number of clusters automatically instead of setting it up in advance; (2) it can produce a set of individuals with different partitions in a single run via MOEA. Uniform crossover is adopted to generate new individuals in the crossover step. Suppose A1 ¼ {a11, ., a1i, ., a1N} and A2 ¼ {a21, ., a2i, ., a2N} are two individuals to implement uniform crossover, the offspring individual B ¼ {b1, ., bi, ., bN} is decided by the mask ¼ {m1, ., mi, ., mN}, in which mi ¼ 0/1. When mi ¼ 0, bi ¼ a1i, otherwise, bi ¼ a2i. The uniform crossover can provide an unbiased chance to the chosen parents and produce a new individual containing much of the structure inherited from its parents but differing from both of them.
Algorithm 6.2 Initialization 1: for j ¼ 1: pop do 2: if j < N then 3: Remove the j-th expensive edge from the MST to generate an initial individual genetypej; 4: else 5: Select one individual genetypei (i < N) randomly, and remove one edge of genetypei randomly to obtain a new individual genetypej. 6: end if 7: end for
142 Chapter 6 In order to make use of the feedback drawn from clustering and classification, MOASCC proposed a specific mutation scheme. The procedure of the mutation scheme can be seen in Algorithm 6.3. In this scheme, the probability p(ckjxi) is considered to decide whether to mutate gi or not. After decoding, if sample xi is assigned to the cluster cK1, but the probability p(ckjxi) shows sample xi has the greatest membership in the cluster cK2, then xi will mutate to connect with a random sample in the cluster cK2. Note that the procedure marked with the symbol “*” is calculated in the function evaluation step, and we do not need to calculate here again. According to the proposed mutation scheme, if sample xi belongs to the training samples, and d(l(xi)) ¼ 1, then it will mutate to connect with a training sample with the same label. Since MOEA can obtain a set of solutions with a different number of clusters, how to select a reasonable solution is a problem to be solved. In Ref. [57], a measurement called adjusted Rand index is proposed for classification. In MOASCC, the authors use it to select the final optimal solution from the Pareto front. P P ni$ P n$j n nij i i;j j 2 2 2 2
(6.21) ARI ¼ P ni$ P n$j P ni$ P n$j n þ j i 1 2 i j 2 2 2 2 2
Algorithm 6.3 Mutation 1: 2: 3: 4: 5:
for every individual genotypej in the current population do Generate a uniform and random number rand ˛ [0, 1]; if rand < pm then for i ¼ 1:N do Find the cluster sample xi belongs to (suppose cK1), calculate pðcK1 jxi Þ, and find the cluster with the highest probability (suppose cK2). (*) 6: if K1 s K2 then 7: Mutate genotypeij to connect with a random sample in the cluster K2. 8: end if 9: if sample xi is a training sample and d(l(xi),yi) ¼ 1 then 10: Mutate genotypeij to connect with a randomly selected sample with the same label. 11: end if 12: end for 13: end if 14: end for
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
143
This index measures the similarity between two partitions. Suppose there are two different partitions U and V, nij represents the number of samples in both the i-th class of partition U and the j-th class of partition V, ni$ is the number of samples in the i-th class of partition U, and n$j is the number of samples in the j-th class of partition V. In MOASCC, these two partitions U and V correspond to the real classification of the training examples and the clustering result obtained from MOASCC, respectively. Finally, the solution that has the highest similarity between the real partition and the clustering result is selected as the output. In contrast to Formula (6.20), nij equals p(wijcj)$jcjj, which is also the reason why MOASCC chooses it as the measurement to select the final Pareto optimal solution. Since MOASCC is a simultaneous clustering and classification algorithm, all the samples (both training samples and test samples) are clustered together. It can calculate the fuzzy membership of all the samples and obtain the relation matrix from the training samples. The relation matrix not only reflects the relationship between clustering and classification, but can also be used in the prediction of the test samples. In the classification process, we will predict the class label of the test samples according to Formulas (6.18) and (6.19).
6.3.3 Computational complexity Given a dataset with size N and dimension D, the time complexity is O(N max{LK, Mk}) for each evaluation of one individual, in which K ˛ {1, ., Kmax} is the number of clusters. The complexity of the nondominated sorting is O(pop2). In the worst case, it requires O(gen max{N pop L Kmax, N pop M Kmax, pop2}) computations. Note that in this algorithm, some one-off computation is required. Before the initialization step, a similarity matrix and MST are calculated, whose time complexities are O(N2D) and O(N). Finally, the complexity of the final selection is O(N2 nnondom), in which nnondom is the number of the nondominated solution. The authors [46] also gave the convergence analysis of MOASCC, please refer to the literature [46] for details.
6.4 A sparse spectral clustering framework via MOEA In this section, we will introduce the last algorithm in detail, which is how to bring the sparse representation into spectral clustering via MOEAs (denoted as SRMOSC) and how to extend it to the semisupervised clustering, including its mathematical description, specific operators designed for it, Laplacian matrix construction method, and the tradeoff point selection phase.
144 Chapter 6
6.4.1 Mathematical description of SRMOSC For a dataset A ¼ {a1, a2, ., aN} with N samples to be reconstructed, considering the sparsity and reconstruction error, the similarity matrix construction in spectral clustering can be formulated as n o min kxk0 ; kAx Ak22 x (6.22) s:t: xii ¼ 0 xij ˛ ½0; 1 where x˛ℝNN is the sparse matrix to be optimized, which is used for constructing the similarity matrix in spectral clustering. Since all the samples in the dataset are reconstructed at the same time, A is not only the overcomplete dictionary, but also the measurement matrix. For any sample ai, the authors hope to reconstruct it with Ax:i ¼ PN j¼1 xji aj , the constraint xii ¼ 0 indicates that the sample ai is not used to reconstruct itself. In this way, all the samples in the dataset can be represented by other samples and a sparsity matrix x is formed to reflect the relationship among all the samples. If xij is a nonzero entry, samples ai and aj are more likely to be assigned to the same cluster, otherwise, they may be in different clusters. It should be noted that sparsity matrix x is not a symmetric matrix, it still needs some transformation in order to use it in the spectral clustering algorithm. The procedure of spectral clustering can be seen in Algorithm 6.4. Algorithm 6.5 presents the framework of SRMOSC. Although we do not specify the MOEA in Algorithm 6.5 and
Algorithm 6.4 Unnormalized spectral clustering Input: Dataset A, number of clusters K. Begin: 1: 2: 3: 4:
Construct the similarity matrix S. Compute the unnormalized Laplacian L. Compute the first K eigenvectors {u1, u2, ., uK} of L. Construct a matrix Y˛ℝNK whose column vector are {u1, u2, ., uK}. Let vi˛ℝK be the row vector of Y. 5: Step 5: Cluster {v1, v2, ., vN} with k-means into clusters {C1, ., CK}. Step Step Step Step
1: 2: 3: 4:
End Output: Clustering result A1, ., AK with Ai ¼ { jjvj ˛ Cj}.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
145
Algorithm 6.5 Framework of SRMOSC Input: dataset A [ {a1, a2, ., aN}; number of clusters: K; population size: pop; maximum number of iterations: gen; crossover probability: pc; mutation probability: pm; Begin: 1: Step 1 Initialization: Generate the initial population P0 according to the initialization scheme. 2: Step 2 Cycle: Execute the MOEA and generate a set of Pareto solutions Pgen. 3: Step 3 Laplacian matrix construction: Construct a symmetric matrix according to each solution in Pgen and generate the corresponding graph Laplacian matrix L. 4: Step 4 Spectral clustering: Apply Steps 3e5 in Algorithm 6.4 to L. 5: Step 5 Trade-off point selection: Select a trade-off point xTO from the nondominated solutions in Pgen using the proposed selection approach. End Output: Clustering result A1, ., AK with Ai ¼ { j j vj ˛ Cj}.
any state-of-the-art MOEA can be used, such as Pareto envelope-based selection algorithm II (PESA-II) [58], or multiobjective evolutionary algorithm based on decomposition (MOEA/D) [38], SRMOSC will use the nondominated sorting genetic algorithm II (NSGA-II) [35] in its framework. Taking into account the nature of the problem the authors want to solve, specific components tailored to it are designed. Particularly, the algorithm develops a new initialization scheme, specific crossover and mutation operators, and a rule to choose the tradeoff solution from the final Pareto set in the selection phase. Before the selection phase, a preprocessing that transforms the sparsity matrices in Pgen to symmetric matrices should be carried out since the Laplacian matrix L needs to be a symmetric matrix in spectral clustering. The details of these components are described in the following sections.
6.4.2 Extension on semisupervised clustering If there exist some labeled samples, the unsupervised clustering problem can be converted into a semisupervised clustering problem, then we can extend the above model into the
146 Chapter 6 semisupervised problem by adding some constraints involved by the pairwise can-links or cannot-links. Suppose we have some known labeled data that sample ai ˛ Cm, sample aj ˛ Ck,msk. In this way, semisupervised spectral clustering can be modeled as n o min kxk0 ; kA Axk22 x
s:t: xii ¼ 0 K P
P
k¼1 aj ˛ Ck ;ai ;Ck
where the constraint
PK P k¼1
xij ¼ 0
(6.23)
xij ˛ ½0; 1
aj ˛ Ck ;ai ;Ck xij
¼ 0 guarantees that samples with different
labels will not connect with each other. The reason why SRMOSC does not take all the can-links into constraint is that it may add too many nonzero entries since x is a sparse matrix. It is possible that it is too hard to be satisfied, and it can be relaxed as in the following model (6.24), in which the connection among different clusters is minimized: 9 ( K = X X 2 ; Axk ; x min kxk0 kA ij 2 x ; k¼1 aj ˛ Ck ;ai ;Ck (6.24) s:t: xii ¼ 0 xij ˛ ½0; 1 In SEMOSC, it adopts model (6.23) to optimize the semisupervised clustering problem. The designed initialization and mutation operator is more strictly used to solve this problem properly, as will be discussed in the corresponding section.
6.4.3 Initialization In order to get a set of high-quality solutions, the authors design an initialization scheme based on the assumption that one sample prefers to be a linear combination of its neighbors. The procedure of the initialization scheme can be seen in Algorithm 6.6. In Algorithm 6.6, pop and N are the population size and number of samples, respectively. For each sample ai, the distances between it to the remaining samples are sorted before initialization, and are called the “neighbor information” of sample ai. For an initialized individual xl˛ℝNN, the l-th nearest neighbor apart from sample aj is am, and the corresponding entry in the sparse matrix xl is xlmj , mod(l, N) is the remainder after division (l/N). There are two cases considered according to the size of the population and dataset. Take the reconstruction error jaj Ax:j j ¼ aj SNi¼1 xij ai of sample ai into account. In the first case, each sample is considered to be reconstructed by its l-th nearest neighbor for the l-th individual, and the reconstruction error is jaj - xmjamj. When the population size
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
147
Algorithm 6.6 Initialization 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
for each individual xl (l ¼ 1:pop) do for each column x:jl of xl do if l N then l ¼ rand (rand is a uniform Find the l-th nearest neighbor of sample aj:am, and set xmj random number in [0,1]). else modðl;NÞ x:jl ¼ x:j . Generate a uniform random integer r˛[1,N]/{MOD(l,N), j }, set xrjl ¼ rand. if aj is a labeled sample then Find all the labeled samples that have different labels as aj, and set the corresponding entries to 0. Randomly select a sample with the same label as aj, and set the corresponding entry in xi to a nonzero value. end if end if end for end for
exceeds the number of samples in the dataset, the l-th (l > N) individual is initialized as a sparse matrix, where each volume vector consists of 2 nonzero entries, one of which inherits from the mod(l, N)-th individual, and another is a uniform and randomly selected entry subject to the constraints mentioned in Algorithm 6.6. Lines 8e11 are designed for semisupervised clustering, for each labeled sample, it means that: (1) all the entries that reflect the relationship between the labeled samples in different clusters are set to 0 and (2) a labeled sample in the same cluster is randomly selected and the corresponding entry in the sparse matrix x is set to a nonzero value. In this way, the authors hope to obtain a set of diverse solutions.
6.4.4 Crossover Considering the different effects of nondominated and dominated individuals, SRMOSC designs a crossover strategy (Algorithm 6.7), which includes two different cases. Case 1 makes use of nondominated individuals in the current population, and implements a uniform crossover on each column vector of the current individual and a uniformly selected nondominated individual. It should be noted that different column vectors of the newly generated offspring are obtained from different nondominated solutions. With this process we hope to obtain a set of high-quality offspring by the guiding of the
148 Chapter 6 Algorithm 6.7 Crossover 1: for each individual xl (l ¼ 1, ., ncr) to implement crossover operator do 2: Generate a uniform random number a˛[0,1]. 3: if a > 0.5 then 4: % case 1: 5: for j ¼ 1:N do 6: Choose a nondominated solution y in the current population uniformly at random. 7: Implement uniform crossover to the j-th column vector x:jl and y:j. 8: end for 9: else 10: % case 2: 11: Choose a solution z in the current population uniformly at random and generate a uniform random value b˛[0,1]. 12: xl ¼ bxlþ(1b)z; 13: end if 14: end for
nondominated individuals. Contrary to case 1, case 2 simply uses two individuals to produce an offspring by intermediate crossover, the idea is to preserve the structure of the parents by using the whole information of the current individual and a randomly selected individual. Note that the sparsity of the offspring is greater than that of the parents.
6.4.5 Mutation Taking into account the sparsity property of x, the mutation operator applies a different strategy for those entries that have a value of 0 and those that are different from 0. This strategy is based on the same assumption as the initialization. Suppose ai is the k-th (k ¼ 1, ., Ne1) nearest neighbor of aj, and that g and rand are uniformly at random chosen in [0, 1]. Then if xij is a nonzero entry, the probability that it mutates to zero is set to (k/N), otherwise the zero entry mutates to a nonzero value with probability of 1 e (k/N). The proposed mutation scheme considers that the nearer sample ai is from aj, the higher the probability that the corresponding entry in x is a nonzero value. For the samples that are far away from each other, opportunities are still given to them to reconstruct each other. The procedure of the mutation operator can be seen in Algorithm 6.8. In this scheme, there are nmu individuals to implement this operator, and we can simply execute Formula (6.25) for unsupervised clustering. But for semisupervised clustering, the prior knowledge obtained from labeled data should be taken into consideration. Taking into account that the
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
149
Algorithm 6.8 Mutation 1: for l ¼ 1:nmu do 2: for j ¼ 1:N do 3: if sample aj is nonlabeled data then 4: for i ¼ 1:N do 5: Execute proposed mutation scheme according to Formula (6.25). 6: end for 7: else 8: Randomly select a sample with the same label, set corresponding entry in xl to rand. 9: Find all the samples with different labels, set corresponding entries in xl to 0 10: end if 11: end for 12: end for
matrix x to be optimized is a sparse matrix, SRMOSC randomly selects a sample with the same label for each labeled sample and sets the corresponding entry in x to a nonzero value, all the constraints taken by labeled data with different labels are strictly satisfied in this scheme. In this way, a set of high-quality feasible solutions may be generated. 8 > k > > > 0; rand o xij 6¼ 0 > > N > > > > > k > > < xij ,g; rand > o xij 6¼ 0 N xij ¼ (6.25) > k > > g; rand 1 o xij ¼ 0 > > > N > > > > k > > > rand > 1 o xij ¼ 0 : 0; N
6.4.6 Laplacian matrix construction The sparse matrix x obtained from MOEA is not a symmetric matrix, it has to be transformed into a symmetric matrix in order to take the following spectral clustering steps. A simple method that can complete this transformation is sij ¼ maxðxij ; xji Þ
(6.26)
150 Chapter 6 ( dij ¼
0; XN
s ; m¼1 im
i 6¼ j i¼j
(6.27)
sij is the corresponding entry of the similarity matrix S˛ℝNN$ D˛ℝNN is a diagonal matrix with diagonal element dii, which is the sum of the i-th column of similarity matrix S. In this way, it can make sure that the Laplacian matrix L¼D S
(6.28)
is symmetric and positive semidefinite.
6.4.7 Final solution selection phase In the final step of the algorithm, a tradeoff point should be selected from a set of Pareto optimal solutions. In Ref. [59], a knee point on the PF which is fitted by B-splines is found as the final reconstruction result. SRMOSC does not adopt this strategy here for the reason that the PF is not a smooth curve and there are no obvious knee regions or knee points. Instead, the algorithm uses a measurement called ratio cut (RC) [60], which is defined as
K X L Vi ; V i (6.29) RC ¼ jVi j i¼1 Suppose a graph G ¼ (V, E), where V is the set of vertices and E is all the edges in the graph. Given a partition that all the vertices of V are divided into K nonempty sets V1, ., K Vi, ., VK with V
i ¼ V Vi . Wi¼1 ViP¼ V, ci, j, ViXVj ¼ B, and jVij is the number of vertices in Vi. L Vi ; V i is defined as i ˛ Vi ;j ˛ V i sij After implementing step 4 in Algorithm 6.5, different partitions are obtained from the Pareto optimal solutions. In order to measure which one should as the final
be chosen solution, a standard adjacency matrix is needed to calculate L Vi ; V i of the ratio cut. In the process of constructing the standard adjacency matrix, all the nondominated solutions make the same contribution to the standard adjacency matrix. Once the entry of any nondominated solution xPF ij is a nonzero value, the corresponding entry of the standard adjacency matrix is set to 1. The procedure to construct the standard adjacency matrix Adj can be seen in Algorithm 6.9.
6.4.8 Complexity analysis 1) Space complexity: The memory in our algorithm is used to store the distance rank among all the samples and population, their space complexities are O(N2) and O(pop$N2), respectively.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
151
Algorithm 6.9 Standard adjacency matrix construction 1: for each nondominated solution do 2: for each entries xijPF do 3: if xijPF > 0 then 4: Adjij)1. 5: end if 6: end for 7: end for
2) Time complexity: In this algorithm, the main time cost lies in the working cycle of the MOEA. The time complexity of initialization, crossover, mutation, and evaluation is O(pop$N2), O(ncr$N2), O(nmu$N2), and O(pop$N2), respectively, where ncr and nmu are the number of individuals to implement crossover and mutation. The time complexity of the updating of each generation relies on the MOEA adopted. In the experiment, the time complexity of this step is of O((2pop)2). Before initialization, a distance matrix among all the samples needs to be calculated, and then the distances between a sample to the remaining samples should be sorted. The complexity of this step depends on the sorting algorithms. In step 3 and step 5 of SRMOSC (Algorithm 6.5), both of their time complexities are O(N2$nP), where nP is the number of Pareto solutions. The time complexity of step 4 also depends on the first K eigenvectors calculation method adopted. Hence, the total time complexity of SRMOSC is simplified as O(pop$gen$N2).
6.5 Experiments This section will conduct experiments and analysis on the three algorithms.
6.5.1 The experiments of MOEA on constrained multiobjective optimization problems As the algorithm in this section is proposed on the foundation of NSGA-II, and two methods are added to handle constraint or fix infeasible individuals, the contribution of the two components is shown up correspondingly and the overall effect can be seen in this section. 6.5.1.1 Experimental setup In the paper [46], the proposed algorithm is compared with NSGAII and the algorithm in the literature [51] (indicated by Woldesenbet’s algorithm). Fourteen benchmark functions
152 Chapter 6 Table 6.1: The characteristics of the test problems.
Test problems BNH SRN TNK CONSTR OSY CTP1 CTP2 CTP3 CTP4 CTP5 CTP6 CTP7 CTP8 Welded beam
Objective dimensions 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Decision dimensions 2 2 2 2 6 10 10 10 10 10 10 10 10 4
Constraints Inequality 2 2 2 2 2 1 1 1 1 1 1 1 2 4
Equality 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Linear 0 1 0 2 4 0 0 0 0 0 0 0 0 1
Nonlinear Active 2 1 2 0 2 1 1 1 1 1 1 1 1 3
0 0 1 1 3 1 1 1 1 1 1 0 1 0
are adopted to test the performance of the proposed algorithm, which are BNH [61], SRN [62], TNK [63], CONSTR [64], OSY [65], Welded Beam [66], and CTP1eCTP8 [67]. The characteristics of these problems are summarized in Table 6.1. All the algorithms run 30 times on adopted test problems with a population size of 100, crossover rate 0.8, mutation rate 0.2, and distribution index hC ¼ 15, l ¼ 2. To have a fair comparison, all the algorithms use an archive to store Pareto optimal solutions, and the number of archives is set to 100. In order to select appropriate evaluation times, we chose CTP2 as a representative to make an experiment of how IGD values change with evaluation times in Fig. 6.2, and the evaluation times range from 10,000 to 100,000 every 10,000 times. 6.5.1.2 Performance metrics 6.5.1.2.1 IGD IGD [68] is a performance metric that measures both the convergence and diversity of the nondominated fronts obtained from one algorithm. Assume P is a set of uniformly distributed solutions of the true Pareto-front (PF), A is the solution set obtained from the optimization algorithm, and IGD is defined as the average distance from P to A: P dðv; PÞ IDGðA; PÞ ¼ v ˛ P (6.30) jPj
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
153
Figure 6.2 The average value of IGD metric changing with evaluation times on CTP2.
where d(v, P) is the Euclidean distance from v to the nearest point in A. The lower IGD(A, P) is, the more approximate A is. 6.5.1.2.2 Minimal spacing
Minimal spacing [69] is an enhanced performance metric of uniformity modified from spacing. The calculation of minimal spacing (Sm) of set A is described as follows. Step 1 Normalize all the solutions A. Step 2 Separate solutions in A into two parts: calculated set Ac and uncalculated set Au. Take all the solutions in A to Au, and randomly mark one solution with “true,” the rest with “false.” Step 2.1 Put the “true” solution in Au to Ac, and calculate the minimal distance from this true solution to Au. The nearest solution in Au is marked with “true.” Step 2.2 Repeat Step 2.1 until Au ¼ B. rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PjA1j d d i i¼1 Sm ðAÞ ¼ (6.31) jA 1j where di is the minimal Euclidean distance obtained from Step 2.1, and d is the average value of di. If Sm(A) ¼ 0, it represents that solutions in A distribute uniformly.
154 Chapter 6 6.5.1.2.3 Coverage of two sets (2)
2ðA1 ; A2 Þ ¼
jfa00 ˛ A2 ; da0 ˛ A1 : a0 _a00 gj jA2 j
(6.32)
2(A1, A2) [70] ranges from 0 to 1. When 2(A1,A2) ¼ 1, it means all the solutions in A2 can be dominated by some solutions in A1, and 2(A1,A2) ¼ 0 represents that there are no solutions in A1 that can dominate any solutions in A2. It needs to be mentioned that 2(A1,A2) has nothing to do with 2(A1,A2), so it is necessary to calculate both of them respectively. In the experiments, we use these three performance metrics to measure the quality of the proposed algorithm compared with NSGA-II and Woldesenbet’s algorithm. But for test problems CTP3 and CTP4, their PFs are a set of discrete points, for test problem CTP5, its PF is a disjoint curve and some discrete points. It is not reasonable to calculate its uniformity since its true PF distributes nonuniformly, so the minimal spacing values of these only problems are offered to us to work from. 6.5.1.3 Comparison experiment results In order to compare the performance of three algorithms in a condensed way, we give the results of simulation and performance metrics on all the selected test problems obtained from three algorithms. Before showing the experiment results, we need to make a classification on CTP problems according to the characteristics of their PF [25]. For other problems, we will not classify them since they are not very complicated. As mentioned in Ref. [25], the classification is as follows. Group 1: CTP1 and CTP6, since they both have continuous PFs; group 2: CTP2, CTP7, and CTP8, since the PFs of these problems are a finite number of disconnected regions; group 3: CTP3, CTP4, and CTP5, since PFs of these problems consist of a finite number of discrete points. Before a detailed analysis is presented, it is necessary to make an illustration. Figs. 6.3, 6.5, 6.7, and 6.9 show the simulation results obtained from the three algorithms. It is noticed that each plot shown in these figures has the best IGD value in 30 runs. In the figures, plots marked with “Proposed,” “NSGA-II,” and “Woldesenbet” are the simulation results obtained from the proposed algorithm, NSGA-II, and Woldesenbet’s algorithm, respectively. True Pareto fronts are marked with red solid lines, while the Pareto optimal solutions obtained from three CMOEAs are marked with small blue circles, and feasible objective spaces of CTP problems are shaded. In Figs. 6.4, 6.6, 6.8, and 6.10, box plots of performance metrics for these problems are shown, respectively. The box plots marked with “IGD” and “Sm” are the IGD and minimal spacing metric of the three compared CMOEAs, respectively, “1, 2, 3” represents the proposed algorithm, NSGA-II, and
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering BNH(Proposed)
BNH(NSGA-II) 50
40
40
40
30
30
30
f2
f2
f2
50
20
20
20
10
10
10
0
50
100 f1
150
0
200
0
SRN(Proposed)
50
100 f1
150
0
200
0
SRN(NSGA-II) 50
0
0
0
-5 0
-5 0
-5 0
-1 0 0
-1 5 0
-1 5 0
-2 0 0
-2 0 0
-2 0 0
0
50
100
150
200
250
-2 5 0 0
50
100
200
250
-2 50 0
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.5
1 f1
100
0
150
200
250
1.2
f2
1.2
1
0
50
TNK(Woldesenbet)
1.2
0
200
f1
TNK(NSGA-II)
f2
f2
150 f1
f1
TNK(Proposed)
150
-1 0 0
-1 5 0
-2 5 0
100 f1
50
f2
-1 0 0
50
SRN(Woldesenbet)
50
f2
f2
BNH( Woldesenbet)
50
0
155
0
0.5
1 f1
0
0
0.5
1 f1
Figure 6.3 The simulation results of three algorithms on BNH, SRN, TNK, CONSTR, OSY, and welded beam.
Woldesenbet’s algorithm in that order. In the figures marked with “2 of Pro and A,” “1” is 2 (proposed algorithm, A) and “2” is 2 (A, proposed algorithm). Fig. 6.3 shows the simulation results obtained from three algorithms on test problem BNH, SRN, TNK, CONSTR, OSY, and welded beam. We can see that the proposed algorithm gets better performance compared with the other two algorithms on TNK, CONSTR, OSY, and welded beam. Nondominated solutions obtained from NSGA-II and Woldesenbet’s algorithm do not distribute well enough in the smooth part of PF on TNK, while the proposed algorithm gets a better spread in the nondominated optimal set. For test problem
156 Chapter 6 CONSTR(Proposed)
CONSTR(NSGA-II)
6
6
6 f2
8
f2
8
f2
8
4
4
4
2
2
2
0.4
0.6 f1
0.8
1
0.4
OSY(Proposed)
0.6 f1
0.8
1
0.4
OSY(NSGA-II) 80
60
60
60
20
f2
80
40
40 20
0 -300
-200
-100
-200
f1
-100
0 -300
0
WeldedBeam(Proposed)
WeldedBeam(NSGA-II)
0.01
0.01
20 f1
f2
f2
f2
0.01
10
0.005
30
0
WeldedBeam(Woldesenbet) 0.014
0
-100 f1
0.014
0
-200
f1
OSY( Proposed )
1
40
0.014
0.005
0.8
20
0 -300
0
0.6 f1
OSY(Woldesenbet)
80
f2
f2
CONSTR(Woldesenbet)
0
0
0.005
OSY( NSGA-II ) 10
20 f1
30
0
0
OSY( Woldesenbet ) 10
20
30
f1
Figure 6.3 Cont’d
CONSTR, neither NSGA-II nor Woldesenbet’s algorithm can attain overall Pareto optimal covered true PF. For test problem OSY, neither NSGA-II nor Woldesenbet’s algorithm can find the overall PF. In addition, NSGA-II cannot converge to the true PF. Fig. 6.4 is the box plots of performance metrics for test problems BNH, SRN, TNK, CONSTR, OSY, and welded beam. In these plots, we can observe that three algorithms have comparative performance on SRN and TNK, but the proposed algorithm still gets weak superiority on IGD values. For CONSTR, lower IGD values and almost equal 2 values prove the advantage of the proposed algorithm on the diversity of optimal solutions. For OSY, both box plots of IGD and 2 show the priority of the proposed algorithm on
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering BNH(IGD)
BNH(ς of Proand NSGA-II)
BNH(ς of Pro and Woldesenbet)
1
BNH(Sm)
x 10
x 10
157
1 3.2
4 .2
4 .1 5
0.98
0.98
0.96
0.96
0.94
0.94
2.4
0.92
0.92
2.2
3 2.8 2.6
4 .1
2 1
2
1
3
SRN(ς of Pro and NSGA-II)
SRN(IGD)
x 10
2
5
1
2
1
SRN(ς of Pro and Woldesenbet) 1
0.98
2.4
0.98 2.2
0.975 0.96 1
2
2
0.97
3
1
2
1
TNK(ς of Pro and NSGA-II)
TNK(IGD)
x 10
x 10
2.6
0.985
0.97 3.5
3
2.8
0.99
4
2
SRN(Sm)
3
0.995
0.99
4.5
1
2
1
TNK(ς of Pro and Woldesenbet) 2
0.95 4.5
0.9
4
0.85
3
TNK(Sm)
x 10
1
5
2
0.95
1.8 1.6
0.9
1.4
0.8
1.2
3.5 0.75 3
1
2
3
1
0.85 1
2
1
2
1
2
3
Figure 6.4 Box plots performance metrics for BNH, SRN, TNK, CONSTR, OSY, and welded beam.
diversity and convergence, which is the same as the visual appearance. It can be seen from the box plots of Sm, that optimal solutions from the three comparison algorithms have similar uniformity on these test problems. Fig. 6.5 shows the simulation results obtained from three algorithms on group 1 test problems. Both have continuous PF, and the shaded regions in the figure are feasible objective spaces. Seen from simulation results with the best IGD values, we can only derive a conclusion that three algorithms have comparative performance on these problems. In order to make a further comparison among the three algorithms, box plots of performance metrics on group 1 test problems are shown in Fig. 6.6. Fig. 6.6 shows box plots of performance metrics for group 1 test problems. According to box plots on CTP1, it is obvious that the proposed algorithm has a better convergence and diversity than the other two algorithms. Feasible objective spaces of CTP6 are presented in banded distribution so that it is easy to be trapped in a local optimal situation for this problem. IGD box plots of CTP6 shows that the proposed algorithms can converge to the true Pareto front, while NSGA-II is usually trapped in the local optimum, and Woldesenbet’s algorithm has a worse convergence than the proposed algorithm. As can be
158 Chapter 6 CONSTR(ς of Pro and NSGA-II)
CONSTR(IGD) 0.06
1
0.05
0.98
0.04
0.96
0.03
0.94
0.02
0.92
CONSTR(ς of Pro and Woldesenbet) 15
0.98 0.96
10 0.94
0.9
0.88 1
2
3
1
OSY(IGD)
5
0.92
0.9
0.01
1
2
2
1
2
3
1
2
3
1
2
3
OSY(ς of Pro and NSGA-II) OSY(ς of Pro and Woldesenbet)
0.5
1
1
0.4
0.8
0.8
5
0.3
0.6
0.6
4
0.2
0.4
0.4
0.1
0.2
0.2
0
0
0
6
3 2
1
2
3
1
2
1
0.02
0.8
0.015
0.6
0.01
1
Welded Beam (ς of Pro and NSGA-II)
WeldedBeam(IGD) 0.025
1 2
0
Welded Beam(ς of Pro and Woldesenbet) 1
0.05
0.9
0.04
0.8 0.03
0.7
0.4
0.6
0.02
0.5
0.01
0.4
0.005
0 1
2
3
1
2
1
2
Figure 6.4 Cont’d
seen from the box plots of Sm, solutions from the three comparison algorithms have similar uniformity on CTP1 and CTP6. Fig. 6.7 shows simulation results obtained from the three algorithms on group 2 test problems, from which we can obviously see that group 2 problems have a disconnected PF. Comparing the three algorithms on CTP2, no superiority can be seen to derive which algorithm is better. For CTP7, the banded distribution of feasible objective spaces determines that it is not easy to find the overall PF. From Fig. 6.7, we can see that the proposed algorithm and Woldesenbet’s algorithm give a comparative performance, while NSGA-II misses a part of the disconnected optimal solutions. For CTP8, the feasible objective spaces are distributed in blocks, which determines that it is not only easy to miss part of the PF but also to be trapped into the local optimum situation. However, we can’t see the difference among the three algorithms only using the simulation results with the best IGD values on CTP8, since all the algorithms performed well.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
159
Figure 6.5 Simulation results of the three algorithms on CTP1 and CTP6.
CTP1(ς of Pro and NSGA-II) CTP1(ς of Pro and Woldesenbet)
CTP1 (IGD)
1
1
0.08
0.8
0.8
0.06
0.6
0.1
0.02 0.015 0.6 0.01
0.04
0.4
0.4
0.005
0.02
0.2 0.2
0 1
2
3
CTP6 (IGD)
1
1
2
2
1
2
3
1
2
3
CTP6(ς of Pro and NSGA-II) CTP6(ς of Pro and Woldesenbet) 1
1
0.8
0.8
0.6
0.6
0.4
0.4
3
0.2
0.2
2
6
1.5
1
5 4
0.5
0
0 1
2
3
1
0 1
2
1
2
Figure 6.6 Box plots of performance metrics for CTP1 and CTP6.
160 Chapter 6
Figure 6.7 Simulation results of the three algorithms on CTP2, CTP7, and CTP8.
Fig. 6.8 shows the box plots of performance metrics on group 2 test problems, from which we can see the superiority of the proposed algorithm. Lower IGD values and better 2 values on CTP2 show the weak advantage of the proposed algorithm. For CTP7 and CTP8, the superiority of the proposed algorithm can be seen obviously. High IGD values and low 2 values show the disadvantage of other two algorithms on convergence and diversity of nondominated solutions. We can see that the proposed algorithm strictly dominates the other two algorithms on CTP7 and CTP8 since 2 (pro, NSGA-II) z 1 and 2 (NSGA-II, pro) z 0, which proves that the proposed algorithm can find more approximate nondominated solutions on or near the true PF.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering CTP2 (IGD)
CTP2(ς of Pro and NSGA-II) CTP2(ς of Pro and Woldesenbet) 1
CTP2 (Sm)
1 0.03
0.8
0. 1
161
0.8 0.6 0.6
0.0 5
0.02
0.4 0.01 0.2
0.4 0 1
2
3
CTP7 (IGD) 0.1 5
0. 1
1
2
0
0 1
2
1
CTP7(ς of Pro and NSGA-II) CTP7(ς of Pro and Woldesenbet)
x 10
1
1
20
0.8
0.8
15
0.6
0.6
10
0.4
0.4
5
0.2
0.2
0
0
0
-5
2
3
CTP7 (Sm)
0.0 5
0 1
2
3
CTP8 (IGD)
1
2
1
2
1
CTP8(ς of Pro and NSGA-II) CTP8(ς of Pro and Woldesenbet) 1
0.8
0.8
0.05
0.6
0.04
0.4
0.03
0.4
0.2
0.02
0.5
0.2
0
0.01
0
0
2
3
0.06
1 2.5
2
CTP8 (Sm)
0.6 1.5 1
1
2
3
-0.2 1
2
0 1
2
1
2
3
Figure 6.8 Box plots of performance metrics for CTP2, CTP7, and CTP8.
As mentioned in Ref. [25], it is not suitable for group 2 test problems to measure the diversity performance of an algorithm because of the property of the PFs, so we adopt the number of disconnected regions found to evaluate it. From Table 6.2, it can be seen that the proposed algorithm has a weak advantage over the other two algorithms on CTP2 and CTP7. It shows that all the disconnected regions in 30 runs, which confirms that our algorithm can obtain well distributed and convergent solutions. All the disconnected regions can be found by the proposed algorithm on CTP8, while NSGA-II and Woldesenbet’s algorithm are easy to trap in the local optimum so that they cannot find the correct PFs. As shown in Fig. 6.9, an infeasible tunnel needs to be traveled in searching for discrete Pareto optimal points at the end of the feasible tunnel for group 3 test problems. The narrower and longer the tunnel is, the more difficult is the search. In order to find all discrete feasible points, some infeasible tunnel must be gone through. Optimal solutions obtained from the proposed algorithm have better convergence and diversity compared with the other two algorithms on group 2 test problems, especially CTP4. On CTP5, Pareto optimal solutions found by the proposed algorithm are more approximate and overall than the other two algorithms, but a discrete point near f1 ¼ 0 is still missed.
162 Chapter 6
Figure 6.9 Simulation results of the three algorithms on CTP3, CTP4, and CTP5.
Fig. 6.10 shows the box plots of the performance metrics on group 3 test problems, from which we can see the superiority of the proposed algorithm. Lower IGD values and better 2 values indicate that the proposed algorithm has better convergence and diversity on these problems than the other two algorithms. Especially for CTP4, box plots of 2 values prove the capacity for searching discrete points. As mentioned above, it is not reasonable to calculate the uniformity of problems CTP3, CTP4, and CTP5, so the experiment takes the number of discrete points found by the algorithms instead of box plots of Sm. In Table 6.3, we can observe that the number of discrete points found by the proposed algorithm is greater than with the other algorithms, which proves the effectiveness of the proposed algorithm. It is worth noticing that the PF of CTP5 consists of a disconnected region and a set of discrete points, but only the set of discrete points is taken into account in Table 6.3. Bold values represent the best results.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
163
CTP3(ς of Pro and NSGA-II) CTP3(ς of Pro and Woldesenbet)
CTP3(IGD) 0.07
1
1
0.06 0.8
0.8
0.6
0.6
0.03
0.4
0.4
0.02
0.2
0.2
0.05 0.04
0.01 0 1
2
3
0 1
2
1
2
CTP4(ς of Pro and NSGA-II) CTP4(ς of Pro and Woldesenbet)
CTP4(IGD) 0.7
1
1
0.8
0.8
0.4
0.6
0.6
0.3
0.4
0.4
0.2
0.2
0.6 0.5
0.2 0.1
0
0 1
2
3
1
2
1
2
CTP5(ς of Pro and NSGA-II) CTP5(ς of Pro and Woldesenbet)
CTP5(IGD)
1
1 0.8
0.8
0.1
0.6 0.6
0.4
0.05 0.4
0.2
0
0 1
2
3
1
2
1
2
Figure 6.10 Box plots of performance metrics for CTP3, CTP4, and CTP5.
Table 6.2: Statistics of the number of disconnected regions found by the three algorithms on group 2 test problems. Disconnected regions Test problems CTP2
CTP7
CTP8
Algorithms Proposed algorithm NSGA-II Woldesenbet’s algorithm Proposed algorithm NSGA-II Woldesenbet’s algorithm Proposed algorithm NSGA-II Woldesenbet’s algorithm
Mean 13 12.5 12.46667 7 6.066667 6.433,333 3 0.433,333 0.966,667
S.D. 0 0.776,819 0.776,079 0 0.253,708 0.568,321 0 1.04004 1.351,457
164 Chapter 6 Table 6.3: Statistics of the number of discrete points found by the three algorithms on group 3 test problems. Discrete points Test problems CTP3
CTP4
CTP5
Algorithms Proposed algorithm NSGA-II Woldesenbet’s algorithm Proposed algorithm NSGA-II Woldesenbet’s algorithm Proposed algorithm NSGA-II Woldesenbet’s algorithm
Mean 12.76667 10.8 11 11.6 7.666,667 8.8 13.06667 12.26667 12.56667
S.D. 0.504,007 2.265,179 1.618,854 1.220,514 2.170,862 1.689,726 1.142,693 2.887,946 2.095699
6.5.2 The experiments of MOEA on clustering learning and classification learning In this section, detailed comparison experiments against other algorithms are shown, including MOASCC, MSCC [71], SVM [72], RBFNN [73], MOCK [31], and semi-MOCK [74]. MSCC is a simultaneous clustering and classification learning algorithm using MOPSO. SVM is a state-of-the-art classifier. RBFNN is a radial basis function neural network model which handles clustering and classification sequentially. MOCK and semiMOCK are unsupervised and semisupervised multiobjective evolutionary clustering algorithms, respectively. The algorithms mentioned above are first tested on the synthetic datasets to show the efficiency of MOASCC. In order to give a further analysis of MOASCC, the experiments implemented it on the real-life datasets, including the parameter analysis, the benefit of MOEA, the convergence of MOASCC, and the comparison results on these real-life datasets. 6.5.2.1 Experiment setup Suppose for a dataset whose size is N, we select N/2 samples in a dataset randomly as training samples and the rest as test samples for all the supervised and semisupervised learning algorithms. In terms of the nature-inspired algorithms MOASCC, MSCC, MOCK, and semi-MOCK, they share the same values on pop and gen, which are set to 100 and 50 for synthetic datasets, and 100 and 100 for UCI datasets, respectively. The probabilities of crossover (pc) and mutation (pm) in MOASCC, MOCK, and semi-MOCK are set to 0.7 and 0.3, respectively. In terms of MSCC and RBFNN, the number of p clusters K ranges from C to Cmax, where C ffiffiffiffi is the true number of classes and Cmax is set to N according to Ref. [71]. l is the scale
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
165
factor in Gaussian kernel function and l˛{0.001,0.01,0.05,0.1,0.5,1,5,10,15}. All the combinations of K and l are tested for 30 independent runs and the one with the best classification accuracy is selected to be shown in the experiments. In SVM, K is set to the real number of clusters, the regularization parameter is selected from {21, 20, 23, 25, 27, 29}, and the scale factor l˛{0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 15}. In SVM, the combination of parameters with the best classification accuracy is selected to undertake the prediction task. 6.5.2.2 Experiment on a synthetic datasets This experiment adopts four synthetic datasets with different structures ASD_11_2, 2moons, eyes, and spiral (see Fig. 6.11) to test the effectivity of MOASCC by comparing it with MSCC, SVM, RBFNN, MOCK, and semi-MOCK. ASD_11_2 consists of 515 twodimensional samples distributed in 11 spherical shaped clusters. 2moons consists of 200 two-dimensional samples distributed in 2 moon-shaped clusters. Eyes is a synthetic dataset that consists of 238 two-dimensional samples, it has 1 ring-shaped cluster and 2 squared clusters. Spiral consists of 1000 two-dimensional samples distributed in 2 spiral line shaped clusters. The classification results of these four synthetic datasets obtained from MOASCC, MSCC, SVM, RBFNN, MOCK, and semi-MOCK are shown in Table 6.4. In Table 6.4, the classification accuracy over 30 runs is written in the form: mean accuracy (standard deviation), K is determined adaptively in MOASCC, MOCK, and semi-MOCK, while it is specified before execution in MSCC, SVM, and RBFNN. As can be seen from this table, MOASCC gets better classification results than MSCC and RBFNN, especially on dataset spiral, in which all the samples can be classified into the correct category by MOASCC. Notice that in SVM, the scale factor l has multiple values when it achieves the best classification performance. Comparing MOASCC with SVM and semi-MOCK, they get neck-to-neck performance, and it therefore needs further comparison on real-life datasets.
Figure 6.11 Synthetic dataset ASD_11_2 (A), 2moons (B), eyes (C), and spiral (D).
166 Chapter 6 Table 6.4: Parameter setting and classification accuracy on synthetic datasets.
Dataset ASD_11_2
2moons
Eyes
spiral
Algorithm
K
MOASCC MSCC SVM
11 11 11
RBFNN MOCK Semi-MOCK MOASCC MSCC SVM RBFNN MOCK Semi-MOCK MOASCC MSCC SVM RBFNN MOCK Semi-MOCK MOASCC MSCC SVM RBFNN MOCK Semi-MOCK
16 9, 10, 11 11 2 10 2 10 2, 4, 5 2, 3 3 10 2 11 2, 3 3, 4 2 22 2 22 4, 5 2
l d 0.01 0.05, 0.1, 0.5, 1, 5, 10, 15 1 d d d 0.001 1, 5, 10, 15 1 d d d 0.001 0.1, 0.5, 1, 5, 15 1 d d d 0.001 5, 10, 15 1 d d
Maximum accuracy (%)
Mean accuracy (standard deviation) (%)
100 99.42 100
100 (0) 95.71 (0.69) 100 (0)
100 96.12 100 100 100 100 100 68.50 100 100 99.16 100 100 84.03 100 100 87.10 100 97.10 100 100
98.91 (1.12) 88.71 (1.99) 100 (0) 100 (0) 99.15 (0.41) 100 (0) 99.07 (0.73) 65.20 (1.44) 100 (0) 100 (0) 98.49 (1.24) 100 (0) 98.71 (1.08) 77.77 (2.44) 98.71 (1.08) 100 (0) 85.36 (2.5) 100 (0) 94.75 (3.44) 98.92 (3.44) 100 (0)
Take dataset ASD_11_2 as an example, we give an analysis on the relation matrix and the classification accuracy obtained from MOASCC and MSCC, and show them in Table 6.5. Here, we do not compare MOASCC with SVM, RBFNN, MOCK, and semi-MOCK, since SVM, MOCK, and semi-MOCK do not have a relation matrix and the relation matrix in RBFNN cannot represent intuitive meanings. For each row vector in the relation matrix, PM m¼1 pðwm jcj Þ ¼ 1, it shows the distribution of the samples in the cluster cj. Take the relation matrix obtained from MSCC as an example, p(w3jc2) and p(w9jc2) are nonzero entries, which indicates that the samples in the cluster c2 distribute in the class w3 and w9. If there exists a value p(wmjcj) ¼ 1, then all the training samples in this cluster have the same class label. When all the nonzero values equal 1, such as the relation matrix of MOASCC, the underlying structure of the given dataset is correctly detected by clustering
Table 6.5: Relation matrix obtained from MOASCC and MSCC on ASD_11_2.
Relation matrix P
Classification accuracy
2
1 60 6 6 60 6 60 6 6 60 6 6 60 6 60 6 6 60 6 60 6 6 40 0
MOASCC 0 1 0 0
0 0 1 0
0 0 0 1
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
0 0 0 0
0 0 0 0
0 0 0
0 0 0
0 0 0 0 0 0
0 0 0
0 0 0
0 0 0
1 0 0 1 0 0
100%
3
0 07 7 7 07 7 07 7 7 07 7 7 07 7 07 7 7 07 7 07 7 7 05 1
2
MSCC 0 0 0 0
6 6 6 6 6 6 6 6 6 0 6 6 6 0 6 6 0 6 6 6 0 6 6 0 6 6 4 0:82 0
0 0 0 0 0:07 0 0 0 0 0 1 0
0 0 0 0
0 0 1 0
0 0 0 0
0 0 0 0 0:93 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 1
0 0 1 0
0 0 0 0
0 0 0 0
1 0 0 0
0 0 1
0 0 0
0 0 0
0 0 0
0 0 0
1 0:09 0
99.42%
0 0 0 0
0 1 0 0
0 0 0 0 0:09 0 0 0 0
3 1 07 7 7 07 7 07 7 7 07 7 7 07 7 07 7 7 07 7 07 7 7 05 0
168 Chapter 6 learning. In this table, we can see that MOASCC works better than MSCC and the relation matrix shows that MOASCC has a clearer relationship between clusters and the given classes. 6.5.2.3 Experiment on real-life datasets In this section, the experiment selects 19 real-life datasets from the University of California at Irvine (UCI) Machine Learning Repository [75] to test the efficiency and accuracy of MOASCC. Four datasets glass, vowel, ecoli, and lung_cancer are selected as examples to undertake further analysis of MOASCC, including parameter analysis, the benefit of MOEA, and the convergence of MOASCC. Fig. 6.12 gives us an analysis of parameters pc (pm ¼ 1 pc) and L on MOASCC. We take four UCI datasets for examples in each experiment. The number of samples, attributes, and categories of each dataset is written in the form: dataset (samples attributes categories). According to Fig. 6.12A, we can see that the classification accuracy is not sensitive to the value of pc. When pc ¼ 0.7, MOASCC performs a bit better than on the other values, so we choose pc ¼ 0.7 in the experiments. It is noticed that the objective function values of MOASCC are related to parameter L, usually the higher it is, the higher is the clustering objective function value. As recommended in Ref. [31], L˛ {5,/,10}, Fig. 6.12B gives us an experimental analysis of the effect of L. We found that the classification accuracy is not sensitive to the value of parameter L, so the experiment chooses a consistent value 10 for all the datasets. Next, we give a simple discussion on the effect of multiobjective optimization. In order to see how multiobjective optimization affects the result of classification, an experiment (A)
(B)
Figure 6.12 Parameter analysis [glass (214 9 6), vowel (528 10 11), ecoli (336 7 8), lung_cancer (32 56 3)].
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
169
Figure 6.13 Number of clusters obtained during the optimization process on dataset glass, vowel, ecoli, and lung_cancer.
about the effect of single objective and multiobjective optimization was carried out and is shown in Fig. 6.13. “MOASCC þ strategy1” represents the algorithm that replaces the objective functions of MOASCC with single objective function (classification error rate), “MOASCC þ strategy2” is MOASCC with the initialization strategy adopted in MOCK, and “MOASCC þ strategy3” represents “MOASCC þ strategy1” with the initialization strategy adopted in MOCK. MOCK adopts two schemes to generate initial individuals, half of which are derived from the MST, and the rest are generated from k-means (these solutions are converted to MST-based individuals). Take UCI dataset “glass,” “vowel,” “ecoli,” and “lung_cancer” as examples in this figure, we can clearly see that using multiobjective optimization can make the number of clusters decrease/increase to a value close to the real number of clusters no matter which initialization scheme is used. It is easy for the number of clusters to be affected by the initialization scheme with single objective optimization, because the quality of the clustering can’t be guaranteed without a clustering objective function. What’s more, we can also see that the number of clusters has little to do with the initialization strategies, which is also the reason why the algorithm uses a more simple initialization scheme in MOASCC. In order to show the experiment result intuitively, Fig. 6.14 gives the Pareto front obtained from MOASCC on datasets glass, vowel, ecoli, and lung_cancer. In terms of coordinate Si(x, y) in this figure, x is the number of clusters and y is the classification accuracy. The
170 Chapter 6
Figure 6.14 Pareto front obtained from MOASCC on datasets Glass, Vowel, Ecoli, and Lung_cancer.
symbol “o” marked in red represents the Pareto optimal solution with the best ARI value. From Fig. 6.14, we can see that MOASCC is able to obtain a set of solutions with a different number of clusters and the solution with relatively low classification error rate on training samples gives a high accuracy on test samples. Note that on dataset “vowel,” MOASCC obtained the optimal solution for it. However, it is a difficult task to find the optimal solution for all the tested datasets. Another observation is that there are no solutions whose number of clusters is far more than the real number of classes, which is because they are dominated in the evolutionary process by the Pareto solutions. In order to verify the convergence of MOASCC, an intuitive experiment to verify it is shown in Fig. 6.15. In this figure, we can see how classification accuracies obtained from MOASCC, MSCC, and semi-MOCK change during the evolutionary process on datasets glass, vowel, ecoli, and lung_cancer. In this experiment, gen is set to 100, the classification accuracy is calculated every five generations from the first generation to the 100th generation except the first interval is set to 4. The results show that the classification accuracies obtained from all the algorithms increase with generation in the early stage, and then converge to a stable status in the later stage. This indicates that the Pareto optimal solutions are superior to dominated solutions and rules out the possibility of overtraining. In the later stage, MOASCC gets a relatively higher classification accuracy compared with MSCC and semi-MOCK except for lung_cancer. This indicates that the two objective functions in MOASCC are reasonable and efficient in solving classification problems.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
171
Figure 6.15 The classification accuracies of MOASCC, MSCC, and semi-MOCK obtained from different generations on datasets glass, vowel, ecoli, and lung_cancer.
What’s more, MOASCC has started to converge since the 40th generation, it seems suitable to set the parameter gen ¼ 100 in our algorithm. In conclusion, this experiment not only proves the convergence of MOASCC, but also its efficiency. As a combination of clustering and classification, the performance of MOASCC usually relies on the following aspects: (1) the difficulty in dealing with the given dataset; (2) the clustering ability of the clustering scheme; and (3) the effect of the cooperation between clustering and classification. In order to discuss these issues, the overall comparison results of MOASCC against other algorithms in the UCI datasets are presented in this subsection. Table 6.6 shows the detailed experiment results obtained from MOASCC, MSCC, SVM, RBFNN, MOCK, and semi-MOCK. In Table 6.6, the best or comparative classification results are marked in bold. First, we make a comparison between two simultaneous clustering and classification algorithms, MOASCC and MSCC. MOASCC adopts a locusbased adjacency representation encoding scheme so that the number of clusters can be determined adaptively, which reduces the time in searching and deciding the best combination of parameters K and l. We can see that MOASCC achieves better performance than MSCC on all the datasets, especially datasets glass, sonar, vowel,
Table 6.6: The experiment result obtained from MOASCC, MSCC, SVM, RBFNN, MOCKs, and semi-MOCK on real-life datasets [the classification accuracy over 30 runs is written in the following form: mean (standard deviation)]. Datasets (#Samples 3 #dim 3 # class) Wine (178 13 3) Glass (214 9 6) Lenses (24 4 3) Iris (150 4 3) Wdbc (569 30 2) Heart disease (270 13 2) Soybean (small) (47 35 4) Balance scale (625 4 3) Sonar (208 60 2) Vowel (528 10 11) Thyroid (215 5 3) Lung_cancer (32 56 3) Pima Indians diabetes (768 8 2) Bupa (345 6 2) Vote (435 26 2) Vehicle (846 18 4) Ecoli (336 7 8) Image segmentation (2310 19 7) Waveform(5000 21 3)
MOASCC Accuracy (%)
MSCC K
Accuracy (%)
K
SVM l
Accuracy (%)
RBFNN l
Accuracy (%)
MOCK l
K
Accuracy (%)
Semi-MOCK K
Accuracy (%)
K
97.81 79.37 100 97.13 97.71 87.83 100
(0.58) (2.99) (0) (0.32) (0.44) (0.39) (0)
3,4,5 5,6,7 3 3,4 2 2e4 4
95.79 (1.50) 65.98 (2.67) 87.29 (10.60) 96.63 (1.43) 94.38 (1.52) 83.44(1.52) 86.27 (9.45)
3 20 3 3 2 2 4
0.001 0.01 0.1 0.01 0.05 0.01 0.1
98.05 (0.87) 77.54 (2.40) 94.86 (7.56) 96.93 (0.73) 94.80 (0.28) 86.22 (1.42) 76.88 (8.56)
1 0.1 15 0.1 10 5 15
97.55 52.18 97.64 97.07 94.69 84.89 72.27
(0.88) (5.38) (4.05) (0.90) (1.11) (1.38) (19.10)
6 6 6 9 4 12 4
1 0.05 0.05 5 1 0.1 5
68.65 44.16 68.96 90.10 94.52 80.91 42.91
(3.62) (4.46) (3.94) (0.73) (0.12) (0.52) (6.78)
3, 4 5e8 3,4 3,4 2,3 3e6 4
97.36 66.05 100 97.73 96.70 82.93 100
(0.55) (2.40) (0) (0.34) (0.58) (0.41) (0)
3,4 5e7 3 3,4 2,3 3e6 4
89.41 88.13 99.05 97.10 67.85 79.69
(0.67) (2.45) (0) (1.12) (2.74) (0.88)
3 2e4 11e14 3e5 2e6 2
89.48 (1.32) 67.90 (4.49) 40.83 (2.27) 95.70(1.92) 48.13 (8.32) 75.50(1.86)
15 9 16 11 3 2
1 0.001 0.001 0.1 0.001 0.05
92.97 (0.97) 86.70 (2.39) 93.21 (0.42) 96.09 (1.45) 66.98 (5.27) 83.51 (1.33)
0.1 15 0.1 0.1 0.01 0.1
91.26 71.74 46.35 91.06 53.64 77.32
(0.63) (5.97) (8.32) (1.78) (6.09) (0.74)
15 6 11 5 4 9
0.01 0.01 0.5 10 0.01 0.01
54.88 56.88 48.61 73.53 48.44 68.07
(5.23) (1.52) (4.95) (2.77) (1.90) (2.50)
3,4 3e5 9e11 3,4 3e5 2,3
72.74 75.75 56.41 96.00 75.16 74.40
(4.02) (1.74) (1.73) (0.61) (3.58) (0.83)
3, 4 2,3,4 9e12 3,4 3e5 2,3
69.20 94.52 44.67 76.39 62.18
(2.17) (0.61) (8.31) (11.82) (9.16)
8 14 6 8 7
0.01 0.05 0.5 15 15
58.46 65.08 44.49 64.06 57.84
(0.49) (0.92) (2.98) (0.65) (4.95)
2e4 2e4 4e6 6e11 7,8
64.52 91.53 56.35 85.74 83.30
(1.74) (0.55) (1.71) (1.58) (2.57)
2 3 4,5 5e8 7,8
17
15
69.21 (2.35)
3e5
85.33 (0.56)
3,4
72.90(1.31) 95.27 (0.67) 83.74 (2.97) 89.97 (1.12) 97.82 (0.36)
2 2 4e6 5e8 7e9
64.19 92.22 45.42 79.20 85.66
(2.60) (1.56) (4.60) (3.23) (2.26)
4 3 7 12 7
0.1 0.001 1 0.05 0.1
81.19 (2.06) 92.76 (0.79) 82.27 (1.74) 87.68 (1.45) 95.87 (0.40)
0.1 10 0.5 0.5 0.5
88.51 (0.32)
3,4
81.39 (2.91)
50
0.01
87.63 (0.92)
0.5
86.78 (0.26)
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
173
lung_cancer, pima_indians_diabetes, bupa, vehicle, and ecoli. On large-scale datasets image segmentation and waveform, MOASCC also shows efficiency. Second, we compare MOASCC with another hybrid clustering and classification learning model, RBFNN, which shows that MOASCC is superior to RBFNN in most of the real-life datasets, except that they get comparative results on dataset wine, lenses, iris, and vote. The better performance of MOASCC over RBFNN comes from its effective cooperation between clustering and classification. Third, the state-of-the-art classifier SVM is further compared in this part to see whether simultaneous clustering and classification can enhance the classification performance. The experiment shows that MOASCC is better than SVM on most datasets except for wine, balance_scale, pima indians diabetes, and bupa. Finally, a comparison between MOASCC and different multiobjective clustering algorithms MOCK and semi-MOCK is discussed. Although they use the same representation scheme, MOASCC still shows its superiority on most of the UCI datasets. From Table 6.6, we can derive the conclusions that: (1) the value of K determined adaptively by MOASCC is close to the true number of clusters; and (2) MOASCC can improve the performance of both clustering and classification. In real life, there are many datasets which are difficult to deal with. As a simultaneous clustering and classification algorithm, the result of clustering usually has a great effect on the classification performance. However, the comparison between MOASCC and other clustering/ classification algorithms in Table 6.6 indicates that clustering and classification can benefit from their cooperation. On the one hand, MOASCC is not strict to the underlying structure of the given dataset considering the clustering objective function, which can be demonstrated by the comparison between MOASCC and MSCC. On the other hand, MOASCC is based on multiobjective optimization, which demands that only the individuals with both better clustering quality and classification quality can be selected to replace the original individuals. What’s more, a mutation operator which is related to the feedback from classification is designed to guide the search. This scheme also improves the performance of clustering. Unfortunately, many of the features and attributes of the real-life datasets are redundant, noisy, or irrelevant to the clustering and classification task. It is difficult for most clustering and classification algorithms and even the ensemble algorithms to deal with such a task. According to [76,77], we can apply feature selection to clustering and classification to improve the performance of data mining. It is also one of our current efforts to use multiobjective optimization for subspace learning. To further analyze the effect of different MOEAs on MOASCC, we selected three state-ofthe-art MOEAs, MOEA/D [38], SPEA2 [36], and NSGA-II [35], to carry out this experiment (see Fig. 6.16). In this experiment, MOEA/D, SPEA2, and NSGA-II share the same values on parameters pop, gen, pc, and L. For the remaining parameters T (the
174 Chapter 6
classification accuracy
1.2 1 0.8 0.6 0.4 0.2
MOEA/D SPEA2 NSGA-II
0
Figure 6.16 The classification result obtained from three state-of-the-art multiobjective evolutionary algorithms: MOEA/D, SPEA2, and NSGA-II.
number of weight vectors in the neighborhood of each weight vector) in MOEA/D and the archive size in SPEA2, they are set to 20 and 100, respectively. Fig. 6.16 shows that MOEA/D, SPEA2, and NSGAII have similar performance on most tested datasets. Since these algorithms adopt different nondominated solution reservation strategies, each algorithm gains its own advantage on different datasets. This experiment also demonstrates the efficiency of MOEAs in solving clustering/classification problems.
6.5.3 The experiments of MOEA on sparse spectral clustering The experiments are mainly carried out on the basis of NSGA-II, and this section has been divided into two parts. The first part presents a detailed analysis of SRMOSC on the basis of NSGA-II, five experiments are carried out including detailed analysis of the parameter setting, the sparsity of the Pareto optimal solutions, the effectivity of the final solution selection strategy, the experiments about the proposed initialization, crossover and mutation schemes, and the benefit of MOEAs in solving spectral clustering. The second part gives experimental results on real-life datasets. The proposed algorithms based on NSGA-II and MOEA/D are compared with four other similarity matrix construction methods and two multiobjective clustering algorithms, including unsupervised clustering and semisupervised clustering. The four commonly used similarity matrix construction methods discussed in [47,78e81] are used for comparison. In addition, multiobjective clustering with automatic
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
175
k-determination (MOCK) and multiobjective genetic algorithm optimizing p and sep [MOGA(p, sep)] are also compared. MOCK [31] is a graph representation-based multiobjective clustering algorithm, which uses overall deviation and connectivity as objective functions to reflect cluster compactness and connectedness, respectively. MOGA(p, sep) [82] is a prototype representation-based multiobjective fuzzy clustering algorithm. In the experiments, we use supervised classification datasets, therefore the number of clusters in all the algorithms is fixed to the number of classes, the clustering accuracy is measured in terms of percent of instances that are correctly classified, and the clustering result with the highest accuracy is considered as the best result. The parameters pop, gen, pc, and pm are set to 50, 50, 0.7, and 0.3, respectively, for SRMOSC, MOCK, and MOGA(p, sep). When constructing the similarity matrix using fully connected kNN and mutual kNN construction methods [79e81], the Gaussian kernel K(x, y) ¼ e^-(jjx-yjj2/ (2s2)) is adopted to calculate the similarity. We carry out experiments with the following values {0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 15} for s, choosing the one that outputs the best clustering result as the final value. For kNN and mutual kNN methods, k is set to log(N). ε is set to the value with the best clustering result from {0.2, 0.3, 0.4, 0.5, 0.6} in the ε-neighborhood method [47,79]. 6.5.3.1 Detailed analysis of SRMOSC In order to evaluate the effectiveness of SRMOSC, a detailed analysis is shown in this section. 1. Parameter analysis: Fig. 6.17 gives a detailed analysis of the setting of parameters. It is carried out on dataset “wine” for illustration. Analyzing this figure, two observations can be obtained: (1) SRMOSC is not sensitive to parameter pc especially pop 40 and (2) when gen 20, SRMOSC converges to a stable state. Taking the stability and time complexity of SRMOSC into consideration, the values of parameters pop, gen, pc, and pm are set to 50, 50, 0.7, and 0.3, respectively. 2. Sparsity of the Pareto optimal solutions: An experiment about the sparsity of the Pareto optimal solutions and whether they can exactly describe the relationship among samples is shown. We have used the UCI dataset [83] “wine” for illustration purposes because it is very clear to see the relationship between the different clusters. Dataset wine has 178 samples, 13 attributes, and three categories, with samples 1e59 belonging to category “1,” samples 60e130 belonging to category “2,” and the rest in category “3.” In Fig. 6.18, the sparse matrices that correspond to some Pareto optimal solutions found by one run of SRMOSC, including the solution with the best ratio cut value, are visually shown to see to what extent the sparsity matrices can reveal the relationship among different clusters. All the nonzero entries in x are represented with black pixels. The weights are not considered in order to get a clearer picture. We can see from the sparse matrices in Fig. 6.18 that they do have an obvious property that most of the nonzero
(A)
(B)
Figure 6.17 Parameter analysis. (A) Effect of parameter pc and pop on the clustering accuracy (gen is a constant value and set to 200). (B) Convergence of SRMOSC when pop is set to different values (pc is set to 0.7, gen starts from 1 to 200, the results are recorded every five generations except in the first interval that is 4). All the results are the average clustering accuracies obtained from 20 independent runs, and pm ¼ 1 e pc.
(A)
wine
17
(B)
Pareto solutions Best Ratio cut solution
(A) 16
||y - Ax || 2 2
15 14 13
(B)
12 11
0
(C)
500
(C)
(D)
1000 1500 || x || 0
(E)
2000
Accuracy:96.07%
2500
Accuracy:59.55%
(D)
Accuracy:95.51%
(E)
Accuracy:94.94%
Accuracy:95.51%
Figure 6.18 Visualization of sparse matrices in the PF. The solutions marked with the symbol “8” are Pareto optimal solutions obtained from one run by SRMOSC, and the one marked with a red symbol “-” is the best ratio cut solution. Five Pareto optimal solutions are selected and the corresponding sparse matrices and clustering accuracy are shown visually. (A) Accuracy: 59.55%. (B) Accuracy: 95.51%. (C) Accuracy: 96.07%. (D) Accuracy: 94.94%. (E) Accuracy: 95.51%.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
177
entries are distributed in the same cluster and rarely among different clusters no matter how sparse the solution is. In this sense, they can reveal the relationship among samples. In order to further evaluate the effect of the sparse matrix in spectral clustering, a visualization of similarity matrices, corresponding eigenvalues, and eigenvectors constructed with SRMOSC and several conventional methods is shown in Fig. 6.19. In the first row of Fig. 6.19, we show the similarity matrix with the best ratio cut value obtained from SRMOSC, and the results from the other methods are shown in the rest. Note that in the case of the fully connected similarity matrix construction method, the similarity matrix is not sparse. Unlike Fig. 6.18, the similarity matrices here are symmetric, and the weights are taken into account. The maximum and minimum weights are represented with white and black pixels, respectively, for visualization purposes. In this sense, we can see that the similarity matrix obtained from SRMOSC has the following properties: (1) the number of nonzero entries in the similarity matrix is quite low in contrast with that of zero entries; (2) the nonzero entries of the similarity matrix are distributed as intraclass connections mostly, which means they provide more discriminative information to the clustering; (3) the nonzero entries distributed as interclass connections are much smaller in contrast with intraclass connections when they exist; and (4) the values of nonzero entries are quite different in contrast with those obtained from other methods. The visualization graph of SRMOSC shows a high variance of gray levels, while most pixels of nonzero entries in other graphs share similar gray levels. These four properties demonstrate that the similarity matrix obtained from SRMOSC can reveal the relationship between samples more clearly than other methods. As shown in Algorithm 6.4, the eigenvectors obtained from the Laplacian of the similarity matrix are finally responsible for the clustering result. In the case of SRMOSC (Fig. 6.19A) and kNN (Fig. 6.19C), we can see that eigenvector 1 cannot provide exact discriminating information to carry out the clustering task, however, eigenvectors 2 and 3 can mostly classify the samples into three different clusters with k-means. In the case of (Figs. 6.19B, D, and E), it is obvious that using the eigenvectors of the fully connected similarity matrix, mutual kNN similarity matrix, and ε-neighborhood similarity matrix is much harder to divide the samples into different clusters exactly. 3. Efficiency of the final solution selection method: In this part, the reason SRMOSC adopts the ratio cut as the measurement to select the final solution and its efficiency are described. In Fig. 6.20, the plots of the relationship between sparsity and the measurement error jjA Axk2, ratio cut, and clustering accuracy are shown for several UCI datasets. All the results are from one execution. In order to put ratio cut, clustering ac2
curacy, and the objective function measurement error jjA Axk2 in one plot, we normalized the objective function values into [0, 1]. Additionally, given that our 2
178 Chapter 6 Similarity matrix
Eigenvector 1
Eigenvalues
Eigenvector 2
0.075
3
Eigenvector
0.1 0.05
0.075 2
3
0.15 0.1
0
0.05
0.075 1
0.05
0
0.1
0.05
0.075 0
0
0.075 200 0
100
(A) 150
0.075
100
0.075
50
0.075
0
0.075
50
0
0.075 200 0
100
(B)
0.15
100
200
0
100
200
0 .1
0
100
200
SRMOSC (accuracy: 96.07%)
100
200
1
0.5
0.5
0
0
0.5
0.5
0
100
1
200
0
100
200
Fully connected (accuracy: 63.48%) 0.075
15 10
0.075
5
0.075
0.15
0.2
0.1
0.1
0.05 0 0 0 5
0.075
0
0.075 200 0
100
(C) 8 6
0.1
0.05
100
200
0.1
0
100
200
0.2
0
100
200
kNN (accuracy: 96.07%) 0.4
0.2
0.2
0.2
0.15
0.1
4 0
0.1
0
0.2
0.05
0.1
2 0 2
0
100
200
(D)
0.4
0
100
0
200
0
100
200
0.2
0
100
200
100
200
Mutual kNN (accuracy: 57.87%)
40 30
0.2
0.4
0
0
0.2
0.05
0.2
0
0.1
0
0.4
0.2
0.15
10
0.6
0.4
20 10
0
100
(E)
200
0
100
200
0
100
200
0.2
0
ε - neighborhood (accuracy : 62.92%)
Figure 6.19 Visualization of similarity matrices (column 1), eigenvalues (column 2), and eigenvectors (columns 3e5) obtained from five different methods.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering accuracy
1
0.8
|| or
0.6
0.4
0.2
||
||
0.4 0.2
||
||
||
|| or
0.6
0.8
0
500
1000 || ||
1500
||
accuracy
accuracy
0
accuracy
1
0
2000
0
500
1000 || ||
1500
2000
accuracy
1 0.8
2
0.4
||
||
0.2
|| 2 or
0.6
||
accuracy
0
0
(A) Wine
500
1000 || ||
(C) Heart disease
accuracy
1
0.8
0.6 || or
0.6
0.8
0.4
||
0.2
||
||
||
||
0.2
0.4
accuracy
0
0
500
1000
accuracy
2000
(B) Glass
1
|| or
1500
||
accuracy
0 0
1500
200
400
|| ||
600 || ||
800 1000 1200
0
accuracy
1 0.8
0.4
||
||
0.2
||
accurac y
0 0
(D) Thyroid
|| or
0.6
200
400 || ||
(E) Zoo
600
800
(F) Iris
Figure 6.20 Relationship between objective functions, ratio cut, and the clustering accuracy.
179
180 Chapter 6 algorithm tries to select a solution that has the minimal ratio cut value while keeping a high clustering accuracy among all the Pareto solutions as the final solution, we normalize the eRC into [0, 1] in order to see the relationship between ratio cut and clustering accuracy more clearly in Fig. 6.20. In order to see the detail clearly, an enlarged view of some figures is shown in Fig. 6.20. We can clearly see from the nondominated solutions obtained from SRMOSC that there are no obvious knee regions or knee points. Even though we can use the B-splines to fit the PF, the clustering accuracy of the solutions in the knee region is not stable. Hence, we cannot use such a criterion to select a solution. Given a solution with the sparsity jjxjj0 in the PF, the cases we tested mostly have the following property: when jjxjjjjxjj0 , the clustering accuracy cannot keep increasing. This property increases the difficulty of selecting the final solution. We can see from Fig. 6.20 that the changing of clustering accuracy follows that of eRC. The solutions with better clustering accuracies usually have better RC values, although it is not always the best one. Therefore, it is reasonable and effective to use ratio cut as the selection measurement. In Fig. 6.21, three experiments, including clustering and semisupervised clustering with 10% and 20% labeled samples, are carried out. In each experiment, we show the accuracies of two solutions, the one with the best ratio cut value and the one with the highest accuracy, selected from the Pareto optimal solutions in 20 runs. It can be seen from these boxplots that we can get two conclusions: (1) it is appropriate for clustering or semisupervised clustering to use ratio cut as the measurement to select the final solution, although the result is not always the best and (2) the method used to extend clustering to semisupervised clustering is effective since the results are improved with the guidance of labeled data. 4. Effect of the specific evolutionary operators: First, we discuss the effect of designed initialization and mutation schemes against random initialization and mutation schemes in this section. The designed schemes are based on the assumption that a sample prefers to reconstruct itself with its neighbors, and when taking the distance between different samples into account, this neighbor information reduces the “blind search” in such a huge searching space. We compare the clustering accuracy between the proposed schemes and random schemes to illustrate it more clearly, as presented in Fig. 6.22. In this figure, we can see that the proposed schemes that use of neighbor information significantly outperforms the random schemes. In addition, the PF obtained from the random scheme and the proposed scheme is shown in the supplementary material. To further discuss the benefit of the proposed mutation scheme, a comparison between the expansion of classic polynomial mutation and the proposed mutation is shown in Fig. 6.23. In polynomial mutation, we take each column vector of an individual as a
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
(A) Wine
(D) Thyroid
(B) Glass
(E) Zoo
181
(C) Heart disease
(F) Iris
Figure 6.21 Boxplot of clustering accuracy comparison between the best ratio cut and the best accuracy solutions on PF obtained from 20 runs. “RC” and “best” represent the result of clustering with the best ratio cut value and the best clustering accuracy, respectively, and “RC a%” and “best a%” represent the corresponding result of the semisupervised clustering with a% labeled data.
182 Chapter 6
(A) Wine
(B) Glass
(C) Heart Disease
(D) Thyroid
(E) Zoo
(F) Iris
Figure 6.22 Boxplot of the clustering accuracy obtained from the proposed schemes and the random schemes (“proposed” and “random” represent the proposed schemes and random schemes, respectively).
basic unit to execute the classic mutation scheme. This comparison shows that the performance of the proposed mutation scheme is slightly better than a classic polynomial mutation. Furthermore, the effect of the proposed crossover scheme is discussed. In the proposed crossover scheme (Algorithm 6.7), there are two cases considered, whose effect are
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
(A) Wine
(B) Glass
(C) Heart Disease
(D) Thyroid
(E) Zoo
183
(F) Iris
Figure 6.23 Boxplot of the clustering accuracy obtained from the proposed mutation and the expansion of the polynomial mutation.
shown in Fig. 6.24. In Fig. 6.24, four crossover schemes are compared in this experiment, including the proposed crossover, the expansion of simulated binary crossover (SBX), case 2 of the proposed crossover (represented as “case2”), and “DE/rand/1” (denoted as DE). All the results are obtained from 20 independent executions. Taking the properties of individuals into account, we carry out SBX on each individual by taking each column vector as a basic element. We can derive that the proposed crossover has a better performance than SBX and DE/rand/1. What is more, using the nondominated solutions to guide the search shows a slight advantage over case 2. 5. Benefit of multiobjective optimization: In SRMOSC, the spectral clustering is formulated as a multiobjective optimization problem (6.22). In order to discuss the benefit of MOEA in solving this problem, a single objective optimization model [formulated as (6.33)] is compared in this part.
184 Chapter 6
(A)Wine
(B) Glass
(C) Heart Disease
(D)Thyroid
(E) Zoo
(F) Iris
Figure 6.24 Boxplot of the clustering accuracy obtained from the proposed crossover and other crossover schemes.
min kAx Ak22 þ gkxk x
s:t:
xii ¼ 0 xij ˛ ½0; 1
0
(6.33)
In Formula (6.33), the most difficult problem is how to select the value of parameter g. Refer to Fig. 6.25, jjAx Ajj2 jjxjj0 , which means that g tends to be a small value (g > 0, the upper bound of it depends on the problems), which was proven in the experiment (seen from Fig. 6.25). Taking dataset “wine” and “thyroid” as examples, we can see that: (1) the best g values are different for different datasets, and it is time consuming to find a suitable value for each problem. We cannot obtain a satisfying clustering result by simply sampling a few values for g and run a few times and (2) it performs worse than SRMOSC (refer to Table 6.7). In the author’s view, some adaptive schemes for choosing g values during the optimization process may improve its performance. 2
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering (A)
185
(B)
Wine
Thyroi d
Figure 6.25 Clustering accuracies under different g values. For each g value, the result is the average clustering accuracy of 10 runs.
6.5.3.2 Experimental comparison between SRMOSC and other algorithms In this section, the experimental comparison among SRMOSC, unnormalized spectral clustering algorithms based on conventional similarity matrix construction methods, and the famous multiobjective clustering evolutionary algorithms MOCK [31], MOGA(p, sep) [82] is presented. These experiments are extended to semisupervised clustering based on SRMOSC against other traditional methods and semi-MOCK [74] with 10% and 20% labeled data. Furthermore, the authors also carried out SRMOSC on the basis of MOEA/D in order to show the flexibility of the proposed framework. Twelve UCI datasets are adopted to test their performance, and all the results are the average of 20 independent runs for each algorithm on all the datasets. The parameter setting in the other algorithms is the same as in the previous experiments. Note that in MOEA/D-based SRMOSC, T is set to pop e 1, and only case 2 of the proposed crossover scheme can be used. Tables 6.7e6.9 give the experimental results with no labeled data, 10%, and 20% labeled data, respectively. In these three tables, two experimental comparisons are shown. The first case is the comparison among the algorithms that construct a similarity matrix, and we have written the best result in bold. The other case is the comparison between SRMOSC, MOCK, and MOGA(p, sep), we have marked the result of MOCK or MOGA(p, sep) in bold italics if it reaches a better result than SRMOSC. In order to know the statistical significance of the results, we have carried out a KruskaleWallis statistical test (a ¼ 0.05) between each algorithm and the one that reaches the best results for each dataset. When an algorithm does not present statistical difference with the best, its results have been marked with the symbol “*” in the tables. Table 6.7 shows the experimental results obtained from all the algorithms. SRMOSC significantly outperforms fully connected, mutual kNN, and ε-neighborhood similarity
186 Chapter 6
Table 6.7: Clustering accuracy comparison obtained from SRMOSC against other algorithms on real-life datasets. SRMOSC Datasets Wine Glass Iris Wdbc Heart disease Balance scale Vote Ecoli Thyroid Zoo Image Waveform
NSGA-II
MOEA/D
Fully connected
kNN
Mutual kNN
ε-neighborhood
MOCK
MOGA (p, sep)
95.90 ± 0.93* 62.45 ± 2.23 92.50 ± 2.46 94.37 ± 0.63* 68.54 ± 5.60*
95.54 ± 1.08* 57.78 ± 5.29 91.44 ± 2.82 92.07 ± 2.73 65.43 ± 6.15
63.48 ± 0.00 47.03 ± 0.94 66.80 ± 0.00 65.40 ± 2.23 56.41 ± 1.21
96.07 ± 0.00* 56.50 ± 5.12 88.36 ± 2.58 94.90 ± 0.00* 66.30 ± 0.00*
54.61 ± 6.38 44.01 ± 3.33 53.60 ± 7.46 62.74 ± 0.00 56.04 ± 1.06
59.97 ± 6.93 49.95 ± 1.33 68.00 ± 0.00 69.40 ± 3.60 56.24 ± 2.08
68.65 ± 3.62 44.16 ± 4.46 90.10 ± 0.73 94.52 ± 0.12* 80.91 ± 0.52
95.03 ± 1.34* 60.65 ± 2.92 91.33 ± l.65 93.87 ± 0.46 81.61 ± 4.38
66.72 ± 1.88
69.24 ± 4.61
64.74 ± 2.39
65.44 ± 1.89
61.30 ± 3.98
65.41 ± 6.72
54.88 ± 5.23
75.11 ± 4.25
88.28 ± 0.60* 80.64 ± 2.42 92.84 ± 1.09* 90.79 ± 2.55 70.22 ± 2.89* 63.90 ± 5.40
88.85 ± 0.88* 80.24 ± 1.53* 84.30 ± 4.03 87.97 ± 3.98 69.56 ± 3.12* 65.47 ± 5.09
63.37 ± 1.54 64.43 ± 1.20 73.77 ± 1.98 42.67 ± 1.54 37.62 ± 0.00 63.86 ± 0.00
88.05 ± 0.00* 78.85 ± 2.54 94.18 ± 1.04 83.32 ± 8.17 65.26 ± 3.09 52.04 ± 0.00
62.49 ± 1.65 68.36 ± 5.22 75.19 ± 3.65 59.11 ± 5.56 54.19 ± 4.73 34.03 ± 0.10
77.46 ± 10.62 69.80 ± 4.81 71.88 ± 0.56 50.40 ± 4.04 53.53 ± 3.53 40.38 ± 6.09
65.08 ± 0.92 64.06 ± 0.65 73.53 ± 2.77 50.50 ± 0.00 57.84 ± 4.95 69.21 ± 2.35
87.79 ± 0.75 80.82 ± 1.08* 87.44 ± 2.04* 88.05 ± 1.83 67.37 ± 3.67 63.96 ± 5.07
Table 6.8: Semisupervised clustering with 10% labeled data obtained from SRMOSC against other algorithms on real-life datasets. SRMOSC
ε-neighborhood
MOCK
58.51 ± 11.75
56.57 ± 6.77
95.1 ± 1.19
60.44 ± 3.67
44.02 ± 3.33
48.25 ± 2.98
60.93 ± 4.43
69.33 ± 3.67
91.93 ± 2.92
64.20 ± 3.69
68.00 ± 0.00
96.80 ± 0.34
95.27 ± 0.71
65.14 ± 1.81
96.40 ± 0.70
62.90 ± 0.53
67.99 ± 3.35
95.73 ± 0.67*
78.04 ± 3.91
75.33 ± 7.94
58.44 ± 2.17
70.89 ± 9.46
57.70 ± 4.00
59.80 ± 4.87
81.56 ± 0.53
78.22 ± 3.95
77.98 ± 3.30
82.58 ± 2.09
82.04 ± 1.62
67.11 ± 6.84
81.42 ± 1.76
62.86 ± 7.03
89.79 ± 0.56*
89.55 ± 0.99*
63.91 ± 1.91
90.55 ± 0.80*
66.59 ± 5.72
69.77 ± 11.41
89.61 ± 1.56*
81.84 ± 2.10 *
81.25 ± 3.35*
68.05 ± 7.84
82.35 ± 3.86*
76.90 ± 3.17
68.10 ± 10.16
85.71 ± 1.07
93.65 ± 1.23*
92.26 ± 2.96
73.51 ± 2.47
93.30 ± 2.27*
78.56 ± 4.95
75.35 ± 2.46
92.79 ± 2.27*
90.50 ± 2.19
89.55 ± 2.60
42.67 ± 1.54
85.05 ± 6.91
63.81 ± 4.93
52.60 ± 4.02
92.08 ± 0.00
81.79 ± 2.53*
82.11 ± 2.44*
44.45 ± 5.26
79.32 ± 4.25
73.76 ± 6.57
62.34 ± 7.70
78.46 ± 3.55
75.30 ± 3.24
74.89 ± 2.88
63.82 ± 0.27
69.74 ± 6.13
34.12 ± 0.09
37.38 ± 1.71
79.22 ± 1.24
NSGA-II
MOEA/D
kNN
Wine (178 13 3) Glass (214 9 6) Iris (150 4 3) Wdbc (569 30 2) Heart disease (270 13 2) Balance scale (625 4 3) Vote (435 26 2) Ecoli (336 7 8) Thyroid (215 5 3) Zoo (101 16 7) Image segmentation (2310 19 7) Waveform (5000 21 3)
96.40 ± 1.02*
96.34 ± 1.06*
62.39 ± 6.16
96.07 ± 1.01*
63.90 ± 1.76
62.38 ± 1.35
47.12 ± 2.97
93.53 ± 1.72
92.53 ± 2.61
95.46 ± 0.64
Mutual kNN
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
Fully connected
Datasets
187
SRMOSC Datasets
NSGA-II
MOEA/D
Fully connected
kNN
Mutual kNN
ε-neighborhood
MOCK
Wine (178 13 3) Glass (214 9 6) Iris (150 4 3) Wdbc (569 30 2) Heart disease (270 13 2) Balance scale (625 4 3) Vote (435 26 2) Ecoli (336 7 8) Thyroid (215 5 3) Zoo (101 16 7) Image segmentation (2310 19 7) Waveform (5000 21 3)
96.94 ± 0.64*
96.91 ± 0.90*
69.35 ± 16.44
96.80 ± 1.00*
58.06 ± 8.76
56.80 ± 5.66
96.38 ± 0.76*
64.60 ± 1.66*
63.55 ± 3.68*
50.23 ± 3.77
63.76 ± 4.37*
52.06 ± 7.25
49.01 ± 3.57
61.66 ± 2.44
95.30 ± 1.63*
96.50 ± 1.14
69.63 ± 4.63
94.73 ± 2.66
66.73 ± 2.12
68.00 ± 0.53
97.47 ± 0.35
96.63 ± 0.63
96.78 ± 0.85
67.54 ± 3.61
97.31 ± 0.54
62.89 ± 0.52
66.79 ± 4.23
95.75 ± 0.75
82.42 ± 2.59*
82.96 ± 1.24*
60.44 ± 4.54
75.74 ± 10.75
57.28 ± 0.79
59.43 ± 4.62
82.24 ± 0.61*
82.57 ± 1.56
82.24 ± 1.23
86.73 ± 1.32
86.50 ± 1.43
77.27 ± 8.96
85.78 ± 1.58
65.02 ± 5.34
91.54 ± 0.92
90.85 ± 0.70
64.43 ± 4.04
90.62 ± 6.91
66.90 ± 8.20
68.31 ± 8.27
89.87 ± 0.91
85.95 ± 1.44*
85.34 ± 1.53*
63.88 ± 7.51
85.55 ± 2.55*
80.07 ± 1.75
70.29 ± 5.16
85.03 ± 2.10*
95.14 ± 1.07*
94.35 ± 1.37*
74.23 ± 3.35
94.39 ± 3.13*
78.12 ± 8.43
73.70 ± 1.61
93.56 ± 1.39
91.09 ± 1.70
90.40 ± 2.53
88.98 ± 2.62
87.22 ± 4.43
69.06 ± 8.06
62.25 ± 3.63
93.66 ± 1.13
90.02 ± 2.03*
89.17 ± 2.32*
44.12 ± 5.01
82.69 ± 6.80
81.26 ± 4.22
64.17 ± 4.58
83.30 ± 2.57
87.22 ± 0.53
89.04 ± 0.49
60.54 ± 0.90
86.23 ± 0.47
34.20 ± 0.15
42.40 ± 3.10
85.33 ± 0.56
188 Chapter 6
Table 6.9: Semisupervised clustering with 20% labeled data obtained from SRMOSC against other algorithms on real-life datasets.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
189
matrix-based spectral clustering on all the tested data. In addition, SRMOSC achieves a better performance than kNN on most of the tested datasets. Comparing SRMOSC with MOCK, SRMOSC works much better on the tested datasets except on dataset “heart.” Note that kNN shows a much better performance than mutual kNN on all the tested data. In both cases, parameter k is set to the same value, but the number of nonzero entries in the similarity matrix obtained from kNN is more than that of mutual kNN, which also demonstrates the importance of parameter k. Meanwhile, the problem of deciding the value of k is overcome in SRMOSC. The experimental results of semisupervised clustering with 10% and 20% labeled data are shown in Tables 6.8 and 6.9. Semi-MOCK handles semisupervised information with a third objective function “adjusted rand index” [57], which is an external measure of clustering quality. In traditional similarity matrix construction methods, all the entries of the labeled samples with the same label are set to the maximum value of its similarity matrix, and the corresponding entries with different labels are set to 0. In Table 6.8, SRMOSC works better on most of the tested datasets than other traditional methods and semi-MOCK. When the percentage of labeled data increases to 20%, SRMOSC also shows its efficiency against other algorithms. Note that kNN performs well on some of the datasets, especially with 10% labeled data. It has been mentioned that it is a difficult problem to select a value of k for finite data, and we will give an additional experiment in the supplementary material in order to see how k affects the clustering result. For other traditional spectral clustering methods, even when we choose the best value of the parameter, the results are still quite poor. Fig. 6.26 gives a time evaluation of SRMOSC based on different MOEAs and other algorithms under the same experimental conditions. It shows that: (1) in contrast to 140 120 MOEA/D NSGA -II
80
MOCK
time (s)
100
MOGA
60
kNN mutual kNN
40
fully
20 0
neighborhood wine
glass
heart
thyroid
zoo
iris
Figure 6.26 Time evaluation of SRMOSC and other algorithms (SRMOSCs based on NSGA-II and MOEA/D is represented as NSGA-II and MOEA/D, respectively).
190 Chapter 6 conventional spectral clustering algorithms, the time cost of SRMOSC is higher for the reason that SRMOSC is a multiobjective clustering algorithm. Although SRMOSC costs more time than the conventional spectral clustering algorithms, it overcomes the difficulty of selecting a suitable parameter value in constructing the similarity matrix; (2) in contrast to multiobjective clustering algorithms, its time cost is higher than MOGA(p, sep) (prototype-based representation) but lower than MOCK (graph-based representation), which indicates the time complexity of clustering algorithms is closely related to the cluster representation methods in multiobjective clustering algorithms; and (3) the time cost of SRMOSC based on MOEA/D is lower than NSGA-II, and it shows that the time complexity of MOEAs has a great effect on SRMOSC.
6.6 Summary This chapter has presented three methods based on MOEAs to solve constrained multiobjective optimization problems (CMOPs), the issues of adaptive clustering and classification, and the issues of sparse spectral clustering, respectively [28,46,84]. The first method, a modified objective function method based on a feasible-guiding strategy is introduced to solve CMOPs. The modified objective function method allows the search of Pareto optimal individuals to exploit from both feasible spaces and infeasible spaces. In the modified objective function method, constraint violation and objective function values are both considered to select infeasible individuals, only the one with low constraint violations and better objective function values can survive in the selection mood. The feasibility ratio in current population decides the contribution of these two parts, which guides the evolution to search less violated infeasible individuals with better objective function values or to find better nondominated feasible individuals. Even though there are no feasible individuals in the current population, both of the two parts are still considered together in case that the searching traps in the situation of finding individuals are feasible but not sufficiently optimal. Feasible-guiding strategy makes some feasible individual govern the evolution of infeasible individuals. The cooperation of feasible regions and infeasible regions makes the search process more motivated. Furthermore, both of the two methods are implemented on the basis of NSGA-II because of the popularity of this algorithm. Of course, these methods are easy to be extended on other CMOEAs. The experimental results on test problems indicate that the mentioned algorithm is able to find well distributed Pareto optimal solutions that spread evenly on or near overall true PF, which provides evidence of the capacity of the proposed algorithm. The second method, MOASCC, is an algorithm that learns simultaneous clustering and classification adaptively via MOEA. The main work of this method can be concluded as follows. First, MOASCC adopts the graph-based representation scheme to generate a set of individuals with various partitions and a different number of clusters. Second, the new
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
191
clustering objective function is designed to make MOASCC more robust to the underlying structure of the given dataset. Multiobjective optimization not only guarantees the quality of both clustering and classification, but also restricts the number of clusters to certain ranges. Third, a specific mutation scheme is designed to make use of the feedback drawn from the classification process, which enhances the classification performance. What’s more, the experimental analysis on the convergence of MOASCC is also given to prove the efficiency of MOASCC. SRMOSC is introduced in the final part and gives several contributions to the chapter. First, the principal one is that a framework based on sparse representation via an MOEA is proposed to construct the similarity matrix for spectral clustering. SRMOSC models the similarity construction process in spectral clustering into a constrained multiobjective problem, and solves it with EAs. It overcomes the difficulty of parameter setting that commonly exists in traditional methods, and the experiments also demonstrate that the multiobjective evolutionary sparse representation model is efficient in solving the spectral clustering problem. Second, SRMOSC is extended to semisupervised spectral clustering by modeling the semisupervised information as a constraint to satisfy, and guiding the search in the initialization and mutation processes. Third, a selection principle is designed which adopts ratio cut as the measurement to select the final solution from all the Pareto optimal solutions based on a standard adjacency matrix constructed by all the nondominated solutions. Detailed experiments show that a satisfying solution can be obtained in this way. Fourth, some special initialization, crossover, and mutation schemes are also designed in solving sparse representation-based spectral clustering with constrained MOEAs. Additionally, the model that constructs the similarity matrix in SRMOSC can be easily extended to other graph-related problems, such as subspace learning. All the contributions mentioned above help SRMOSC gain a more satisfying performance than other conventional methods or multiobjective clustering algorithms. However, considering the coding scheme, its space complexity is high, especially when solving large-scale problems.
References [1] Hsieh MN, Chiang TC, Fu LC. A hybrid constraint handling mechanism with differential evolution for constrained multiobjective optimization. In: Evolutionary computation (CEC), 2011 IEEE congress on. IEEE; 2011. p. 1785e92. [2] Michalewicz Z, Schoenauer M. Evolutionary algorithms for constrained parameter optimization problems. Evolutionary Computation 1996;4(1):1e32. [3] Coello CAC, Carlos A. A survey of constraint handling techniques used with evolutionary algorithms. Lania-RI-99-04 Laboratorio Nacional de Informa´tica Avanzada 1999. [4] Davis L. Handbook of genetic algorithms. 1991. [5] Michalewicz Z. Genetic algorithms þData structures ¼ evolution programs. New York: Springer-Verlag; 1996. [6] Ray T, Singh HK, Isaacs A, et al. Infeasibility driven evolutionary algorithm for constrained optimization. Constraint-handling in evolutionary optimization. Berlin, Heidelberg: Springer; 2009. p. 145e65.
192 Chapter 6 [7] Dasgupta D, Michalewicz Z. Evolutionary algorithms in engineering applications. International Journal of Evolution Optimization 1999;1:93e4. [8] Koziel S, Michalewicz Z. A decoder-based evolutionary algorithm for constrained parameter optimization problems. In: International conference on parallel problem solving from nature. Berlin, Heidelberg: Springer; 1998. p. 231e40. [9] Koziel S, Michalewicz Z. Evolutionary algorithms, homomorphous mappings, and constrained parameter optimization. Evolutionary Computation 1999;7(1):19e44. [10] Michalewicz Z, Nazhiyath G. Genocop III: a co-evolutionary algorithm for numerical optimization problems with nonlinear constraints. In: Evolutionary computation, 1995., IEEE international conference on. vol. 2. IEEE; 1995. p. 647e51. [11] Michalewicz Z. Evaluation of paths in evolutionary planner/navigator. In: Proceedings of the international workshop on biologically inspired evolutionary systems; 1995. [12] Xiao J, Michalewicz Z, Zhang L, et al. Adaptive evolutionary planner/navigator for mobile robots. IEEE Transactions on Evolutionary Computation 1997;1(1):18e28. [13] Xiao J, Michalewicz Z, Zhang L. Evolutionary planner/navigator: operator performance and self-tuning. In: Evolutionary Computation, 1996., proceedings of IEEE international conference on. IEEE; 1996. p. 366e71. [14] Sathya SS, Kuppuswami S. Gene silencingda genetic operator for constrained optimization. Applied Soft Computing 2011;11(8):5801e8. [15] Runarsson TP, Yao X. Stochastic ranking for constrained evolutionary optimization. IEEE Transactions on Evolutionary Computation 2000;4(3):284e94. [16] Deb K. An efficient constraint handling method for genetic algorithms. Computer Methods in Applied Mechanics and Engineering 2000;186(2e4):311e38. [17] Takahama T, Sakai S. Constrained optimization by applying the/spl alpha/constrained method to the nonlinear simplex method with mutations. IEEE Transactions on Evolutionary Computation 2005;9(5):437e51. [18] Takahama T, Sakai S. Constrained optimization by the ε constrained differential evolution with an archive and gradient-based mutation. In: Evolutionary computation (CEC), 2010 IEEE congress on. IEEE; 2010. p. 1e9. [19] Paredis J. Co-evolutionary constraint satisfaction. In: International conference on parallel problem solving from nature. Berlin, Heidelberg: Springer; 1994. p. 46e55. [20] Singh HK, Ray T, Smith W. Performance of infeasibility empowered memetic algorithm for CEC 2010 constrained optimization problems. In: Evolutionary Computation (CEC), 2010 IEEE congress on. IEEE; 2010. p. 1e8. [21] Venkatraman S, Yen GG. A generic framework for constrained optimization using genetic algorithms. IEEE Transactions on Evolutionary Computation 2005;9(4):424e35. [22] Mallipeddi R, Suganthan PN. Evaluation of novel adaptive evolutionary programming on four constraint handling techniques. In: Evolutionary computation, 2008. CEC 2008. (IEEE world congress on computational intelligence). IEEE congress on. IEEE; 2008. p. 4045e52. [23] Mallipeddi R, Suganthan PN. Ensemble of constraint handling techniques. IEEE Transactions on Evolutionary Computation 2010;14(4):561e79. [24] Mallipeddi R, Suganthan PN. Differential evolution with ensemble of constraint handling techniques for solving CEC 2010 benchmark problems. In: Evolutionary computation (CEC), 2010 IEEE congress on. IEEE; 2010. p. 1e8. [25] Xiao H, Zu JW. A new constrained multiobjective optimization algorithm based on artificial immune systems. In: Mechatronics and automation, 2007. ICMA 2007. International conference on. IEEE; 2007. p. 3122e7. [26] Zhang Z. Immune optimization algorithm for constrained nonlinear multiobjective optimization problems. Applied Soft Computing 2007;7(3):840e57.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
193
[27] Karaboga D, Akay B. A modified artificial bee colony (ABC) algorithm for constrained optimization problems. Applied Soft Computing 2011;11(3):3021e31. [28] Jiao L, Luo J, Shang R, et al. A modified objective function method with feasible-guiding strategy to solve constrained multiobjective optimization problems. Applied Soft Computing 2014;14:363e80. [29] Jain AK, Duin RPW, Mao J. Statistical pattern recognition: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence 2000;22(1):4e37. [30] Maulik U, Bandyopadhyay S. Performance evaluation of some clustering algorithms and validity indices. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002;24(12):1650e4. [31] Handl J, Knowles J. An evolutionary approach to multiobjective clustering. IEEE Transactions on Evolutionary Computation 2007;11(1):56e76. [32] Duda RO, Hart PE, Stork DG. Pattern classification. New York: Wiley; 1973. [33] Cai W, Chen S, Zhang D. A multiobjective simultaneous learning framework for clustering and classification. IEEE Transactions on Neural Networks 2010;21(2):185e200. [34] Coello CAC. Evolutionary multiobjective optimization: a historical view of the field. IEEE Computational Intelligence Magazine 2006;1(1):28e36. [35] Deb K, Pratap A, Agarwal S, et al. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 2002;6(2):182e97. [36] Zitzler E, Laumanns M, Thiele L. SPEA2: improving the strength pareto evolutionary algorithm. TIKReport. 2001. p. 103. [37] Coello CAC, Pulido GT, Lechuga MS. Handling multiple objectives with particle swarm optimization. IEEE Transactions on Evolutionary Computation 2004;8(3):256e79. [38] Zhang Q, Li H. MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Transactions on Evolutionary Computation 2007;11(6):712e31. [39] Garcia-Piquer A, Fornells A, Bacardit J, et al. Large-scale experimental evaluation of cluster representations for multiobjective evolutionary clustering. IEEE Transactions on Evolutionary Computation 2014;18(1):36e53. [40] Mukhopadhyay A, Maulik U, Bandyopadhyay S, et al. A survey of multiobjective evolutionary algorithms for data mining: Part I. IEEE Transactions on Evolutionary Computation 2014;18(1):4e19. [41] Mukhopadhyay A, Maulik U, Bandyopadhyay S, et al. Survey of multiobjective evolutionary algorithms for data mining: Part II. IEEE Transactions on Evolutionary Computation 2014;18(1):20e35. [42] Qasem SN, Shamsuddin SM. Memetic elitist pareto differential evolution algorithm based radial basis function networks for classification problems. Applied Soft Computing 2011;11(8):5565e81. [43] Qasem SN, Shamsuddin SM, Zain AM. Multiobjective hybrid evolutionary algorithms for radial basis function neural network design. Knowledge-Based Systems 2012;27:475e97. [44] Qasem SN, Shamsuddin SM, Hashim SZM, et al. Memetic multiobjective particle swarm optimizationbased radial basis function network for classification problems. Information Sciences 2013;239:165e90. [45] Bharill N, Tiwari A. An improved multiobjective simultaneous learning framework for designing a classifier. In: Recent trends in information technology (ICRTIT), 2011 international Conference on. IEEE; 2011. p. 737e42. [46] Luo J, Jiao L, Shang R, et al. Learning simultaneous adaptive clustering and classification via MOEA. Pattern Recognition 2016;60:37e50. [47] Von Luxburg U. A tutorial on spectral clustering. Statistics and Computing 2007;17(4):395e416. [48] Wright J, Ma Y, Mairal J, et al. Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE 2010;98(6):1031e44. [49] Vidal R. Subspace clustering. IEEE Signal Processing Magazine 2011;28(2):52e68. [50] Zelnik-Manor L, Perona P. Self-tuning spectral clustering. In: Advances in neural information processing systems; 2005. p. 1601e8. [51] Woldesenbet YG, Yen GG, Tessema BG. Constraint handling in multiobjective evolutionary optimization. IEEE Transactions on Evolutionary Computation 2009;13(3):514e25.
194 Chapter 6 [52] Park YJ, Song MS. A genetic algorithm for clustering problems. In: Proceedings of the third annual conference on genetic programming; 1998. p. 568e75. [53] Good BH, de Montjoye YA, Clauset A. Performance of modularity maximization in practical contexts. Physical Review E 2010;81(4):046106. [54] Matake N, Hiroyasu T, Miki M, et al. Multiobjective clustering with automatic k-determination for largescale data. In: Proceedings of the 9th annual conference on genetic and evolutionary computation. ACM; 2007. p. 861e8. [55] Pizzuti C. A multiobjective genetic algorithm to find communities in complex networks. IEEE Transactions on Evolutionary Computation 2012;16(3):418e30. [56] Wilson RJ, Watkins JJ. Graphs: an introductory approach: a first course in discrete mathematics. John Wiley & Sons Inc.; 1990. [57] Hubert L, Arabie P. Comparing partitions. Journal of Classification 1985;2(1):193e218. [58] Corne DW, Jerram NR, Knowles JD, et al. PESA-II: region-based selection in evolutionary multiobjective optimization. In: Proceedings of the 3rd annual conference on genetic and evolutionary computation. Morgan Kaufmann Publishers Inc.; 2001. p. 283e90. [59] Li L, Yao X, Stolkin R, et al. An evolutionary multiobjective approach to sparse reconstruction. IEEE Transactions on Evolutionary Computation 2014;18(6):827e45. [60] Wei YC, Cheng CK. Ratio cut partitioning for hierarchical designs. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems 1991;10(7):911e21. [61] Binh TT, Korn U. MOBES: a multiobjective evolution strategy for constrained optimization problems. In: The third international conference on genetic algorithms (Mendel 97). 25; 1997. p. 27. [62] Srinivas N, Deb K. Multiobjective optimization using nondominated sorting in genetic algorithms. Evolutionary Computation 1994;2(3):221e48. [63] Tanaka M, Watanabe H, Furukawa Y, et al. GA-based decision support system for multicriteria optimization. In: Systems, man and cybernetics, 1995. Intelligent Systems for the 21st century., IEEE international conference on, vol. 2. IEEE; 1995. p. 1556e61. [64] Deb K. Multiobjective optimization using evolutionary algorithms. John Wiley & Sons; 2001. [65] Osyczka A, Kundu S. A new method to solve generalized multicriteria optimization problems using the simple genetic algorithm. Structural Optimization 1995;10(2):94e9. [66] Ray T, Tai K. An evolutionary algorithm with a multilevel pairing strategy for single and multiobjective optimization. Foundations of Computing and Decision Sciences 2001;26(1):75e98. [67] Deb K, Pratap A, Meyarivan T. Constrained test problems for multiobjective evolutionary optimization. In: International conference on evolutionary multi-criterion optimization. Berlin, Heidelberg: Springer; 2001. p. 284e98. [68] Zhang Q, Zhou A, Zhao S, et al. Multiobjective optimization test instances for the CEC 2009 special session and competition. Singapore: University of Essex, Colchester, UK and Nanyang technological University; 2008. p. 264. special Session On Performance Assessment Of Multiobjective Optimization Algorithms, Technical Report. [69] Bandyopadhyay S, Pal SK, Aruna B. Multiobjective GAs, quantitative indices, and pattern classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 2004;34(5):2088e99. [70] Zitzler E, Deb K, Thiele L. Comparison of multiobjective evolutionary algorithms: empirical results. Evolutionary Computation 2000;8(2):173e95. [71] Cai W, Chen S, Zhang D. A simultaneous learning framework for clustering and classification. Pattern Recognition 2009;42(7):1248e59. [72] Chang CC, Lin CJ. LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2011;2(3):27. [73] Oyang YJ, Hwang SC, Ou YY, et al. Data classification with radial basis function networks based on a novel kernel density estimation algorithm. IEEE Transactions on Neural Networks 2005;16(1):225e36. [74] Handl J, Knowles J. On semi-supervised clustering via multiobjective optimization. In: Proceedings of the 8th annual conference on genetic and evolutionary computation. ACM; 2006. p. 1465e72.
Multiobjective evolutionary algorithm (MOEA)-based sparse clustering
195
[75] Blake C. UCI repository of machine learning databases. 1998. http://www.ics.uci.edu/w mlearn/ MLRepository.html. [76] Dash M, Liu H. Feature selection for clustering. In: Pacific-Asia conference on knowledge discovery and data mining. Berlin, Heidelberg: Springer; 2000. p. 110e21. [77] Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering 2005;17(4):491e502. [78] Zhang X, Li J, Yu H. Local density adaptive similarity measurement for spectral clustering. Pattern Recognition Letters 2011;32(2):352e8. [79] Hamad D, Biela P. Introduction to spectral clustering. In: Information and communication technologies: from theory to applications, 2008. ICTTA 2008. 3rd international Conference on. IEEE; 2008. p. 1e6. [80] Maier M, Hein M, von Luxburg U. Optimal construction of k-nearest-neighbor graphs for identifying noisy clusters. Theoretical Computer Science 2009;410(19):1749e64. [81] Chen WY, Song Y, Bai H, et al. Parallel spectral clustering in distributed systems. IEEE Transactions on Pattern Analysis and Machine Intelligence 2011;33(3):568e86. [82] Mukhopadhyay A, Maulik U, Bandyopadhyay S. Multiobjective genetic algorithm-based fuzzy clustering of categorical attributes. IEEE Transactions on Evolutionary Computation 2009;13(5):991e1005. [83] Lichman M. UCI machine learning repository. Irvine, CA, USA: School Inf. Comput. Sci., Univ. California Irvine; 2013. Available: http://archive.ics.uci.edu/ml. [84] Luo J, Jiao L, Lozano JA. A sparse spectral clustering framework via multiobjective evolutionary algorithm. IEEE Transactions on Evolutionary Computation 2016;20(3):418e33.