Neural Networks 21 (2008) 1287–1301
Contents lists available at ScienceDirect
Neural Networks journal homepage: www.elsevier.com/locate/neunet
SAKM: Self-adaptive kernel machine A kernel-based algorithm for online clustering Habiboulaye Amadou Boubacar a,b,∗ , Stéphane Lecoeuche a,b , Salah Maouche b a
Ecole des Mines de Douai, Département Informatique et Automatique, 941, Rue Charles Bourseul, BP838, 59 508 Douai, France
b
Laboratoire Automatique, Génie Informatique et Signal, UMR CNRS 8146, Université des Sciences et Technologies de Lille, Bâtiment P2, 59655 Villeneuve d’Ascq, France
article
info
Article history: Received 6 April 2005 Accepted 20 March 2008 Keywords: Online clustering RKHS Non-stationary data Multi-class problems Evolving models
a b s t r a c t This paper presents a new online clustering algorithm called SAKM (Self-Adaptive Kernel Machine) which is developed to learn continuously evolving clusters from non-stationary data. Based on SVM and kernel methods, the SAKM algorithm uses a fast adaptive learning procedure to take into account variations over time. Dedicated to online clustering in a multi-class environment, the algorithm designs an unsupervised neural architecture with self-adaptive abilities. Based on a specific kernel-induced similarity measure, the SAKM learning procedures consist of four main stages: Creation, Adaptation, Fusion and Elimination. In addition to these properties, the SAKM algorithm is attractive to be computationally efficient in online learning of real-drifting targets. After a theoretical study of the error convergence bound of the SAKM local learning, a comparison with NORMA and ALMA algorithms is made. In the end, some experiments conducted on simulation data, UCI benchmarks and real data are given to illustrate the capacities of the SAKM algorithm for online clustering in non-stationary and multi-class environment. © 2008 Elsevier Ltd. All rights reserved.
1. Introduction Clustering methods group data by using some distribution models and according to various criteria (Euclidian distance, membership function etc.). Numerous techniques have been developed for static data clustering (Ben-Hur, Horn, Siegelmann, & Vapnik, 2001; Bishop, 1995; Kriegel, Sander, Ester, & Xu, 1997; Zhang & Xing, 2003). But, in many real-life applications, non-stationary data are generated according to some distribution models which change over time. For example, in the case of voice and face recognition, models undergo variations with ageing. Therefore, online learning techniques are useful to take into account continuously model variations over time. Online clustering methods have to deal with both static and changing target models, depending on the problems addressed. In addition, in a multi-class environment, the number of clusters may change over time because of some phenomena like new modes’ appearance, mode fusion or mode elimination (noise or obsolete information). Specific difficulties related to online clustering require unsupervised, recursive and adaptive learning rules that can incorporate new information and take into
∗ Corresponding author at: Ecole des Mines de Douai, Département Informatique et Automatique, 941, Rue Charles Bourseul, BP838, 59 508 Douai, France. E-mail addresses:
[email protected] (H. Amadou Boubacar),
[email protected] (S. Lecoeuche),
[email protected] (S. Maouche). 0893-6080/$ – see front matter © 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.neunet.2008.03.016
account model evolution, from non-stationary data. Online clustering algorithms must be computationally efficient for real-time applications. Besides, in order to evaluate the error convergence of online clustering algorithms dealing with changing targets in a non-stationary environment, a suitable performance measure is necessary. The ambitious framework of online clustering of nonstationary data in a multi-class area is an interesting research topic to investigate. The list of symbols used in this paper is given in Table 1. 1.1. Previous works Many algorithms introduced in previous works for online clustering of non-stationary data, have been developed with neural network techniques (Deng & Kasabov, 2003; Kasabov, 2001). Most of them are based on some parametric models for data grouping. For example, the Fuzzy Min–Max Clustering algorithm (Simpson, 1993) uses hyper-box prototypes and the Cluster Detection and Labelling (Eltoft & Figueiredo, 1998) neural network is based on the Euclidian distance to define hyperspherical clusters. Both methods have adaptive architectures and incremental learning procedures, but are not well adapted to online applications that require time-varying clusters, with an unknown distribution model. More recently, Lecoeuche and Lurette (2003) have developed a new architecture called AUDyC (Auto-Adaptive and Dynamical Clustering). It uses Gaussian
1288
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
Table 1 Notations of time-varying variables Variables
Descriptions
t /i/m Xt t Cm Ft fmt SVit,m
Time/Index of support/Index of cluster Data point of input space χ acquired at time t mth cluster of a set Ω at time t Learning function (as kernel expansion) at time t mth cluster boundary function of a set = at time t ith support vector of the function fmt at time t Parameter vector of the function fmt at time t t ith coefficient (weight) of the vector αm Offset of the decision function fmt at time t
αmt αit,m ρmt
For simplicity: the time index t will be used only in recursive equations.
prototype models with a fuzzy membership function to cluster non-stationary data. The AUDyC Neural Network has some specific online learning rules in order to incorporate new information in a multi-class environment. This algorithm has been successfully applied to online monitoring of systems (Lecoeuche & Lurette, 2004). However, in spite of its good performance, the AUDyC algorithm leads to overfitting and is limited to low-dimensional spaces. Indeed, the number of operations per iterations can be very large. Although the algorithms mentioned above have efficient auto-adaptive learning abilities, their convergence bounds are not theoretically proved for online clustering. During recent years, support vector machines (SVM) and related kernel methods introduced by Vapnik (1998) have proven to be successful in many applications of pattern recognition (Schölkopf & Smola, 2002), and provide both theoretical and experimental attractions. As an efficient (non-parametric) approach using the kernel trick, SVM methods are mostly used for data classification with maximum margin hyperplanes. Initially introduced for binary classification (Burges, 1998), SVM techniques based on optimal discriminant hyperplanes have been extended to many multi-class problems (Borer, 2003; Cheong, Oh, & Lee, 2004). In 2001, a new SVM technique called One-Class-SVM was introduced (Schölkopf, Platt, Shawe-Taylor, & Smola, 2001) to estimate the support of distribution (level set) of high-dimensional data that is successfully used in some single-class problems (Desobry, Davy, & Doncarli, 2005; Tax, 2001). The SVM training models are generally obtained by using supervised learning. However, in online clustering problems, the goal is to construct evolving models without a priori knowledge of data labels. Right now, few SVM techniques exist with incremental learning rules. Kuh (2001) presented an adaptive kernel approach based on Least Squares criterion for adaptive learning for signal processing applications. This algorithm is not usable in online clustering problems, it is particularly developed for recovering CDMA (Code Division Multiple Access) downlink signals. A sequential kernel-based algorithm is proposed in Csato and Opper (2002) by using a combination of Gaussian process models and some projection techniques in RKHS (Reproducing Kernel Hilbert Space). This algorithm is suitable for large dataset problems. Some fast kernel classifiers are proposed in Bordes and Bottou (2005) and Bordes, Ertekin, Weston, and Bottou (2005) for online and active learning in high-dimensional datasets. The previous algorithms are not appropriate for clustering problems in non-stationary environments especially because they cannot deal with challenges of merge and split phenomena. Inspired by the incremental and decremental learning rules developed by Cauwenberghs and Poggio (2000), Gretton and Desobry (2003) and Tax and Laskov (2003) proposed some online versions of One-Class-SVM. These algorithms provide good solutions for the pursuit of an evolving cluster but are computationally too expensive. This inconvenience is due to an inversion of a matrix of support vectors required at any iteration. In 2004, Kivinen, Smola, and Williamson (2004) introduced the NORMA (Naïve Online Regularized Risk Minimization Algorithm) the principle of
which is relatively similar to that of Gentile’ALMA (2001) for online classification problems. The NORMA algorithm is successfully applied in many real-time applications (classification, density support estimation and regression). The NORMA algorithm is built with an iterative update rule based on the technique of stochastic gradient descent in Hilbert space. It provides good approximate solution and computationally efficient performance. In spite of its various applications, the NORMA algorithm is not convenient for real-drifting targets and could not deal with the challenges of unsupervised clustering in multi-class environment. 1.2. Paper contributions In this paper, we propose a new kernel-based algorithm for online clustering. This algorithm called SAKM (Self-Adaptive Kernel Machine) combines SVM and kernel methods with a specific neural architecture. This neural architecture is used to describe the SAKM unsupervised clustering in a non-stationary and multiclass context. From online data acquisition, a new novel kernelinduced measure is developed to evaluate data similarity in order to decide if it is necessary to create a new cluster or to modify existing clusters by adapting their parameters or merge them or eliminate them (Amadou, Lecoeuche, & Maouche, 2005). These abilities allow using the algorithm in a really non-stationary environment by comparison with NORMA that only allows the cluster adaptations. In SAKM, each cluster is modelled by using a density support (level set) in Reproducing Kernel Hilbert Space (RKHS). In input space, each density support yields a boundary function (defined as a kernel quantile function) enclosing all the cluster data in a domain with a minimum volume. In order to track time-varying cluster models, a new online update rule is introduced. It is based on stochastic gradient descent and combines the advantages of NORMA and incremental One-class-SVM. Fast and efficient, the SAKM update rule incorporates recursively new information in order to estimate online cluster models and to take into account their evolutions over time. In the context of nonstationary data clustering, some other difficulties as information fusion could occur. Using an ambiguity criterion, the SAKM fusion procedure detects clusters sharing data in order to merge them. Finally, an elimination procedure based on a cardinality criterion is used to eliminate periodically noisy and obsolete clusters according to the problem requirements. By way of evaluating SAKM performance, the average of cumulated error is evaluated. With good convergence bounds, this measure represents the SAKM objective function to minimize in experiments. 1.3. Paper organization Section 2 of this paper recalls the theoretical foundations of statistical learning and SVM methods. Then, as groundwork of our update rule, the NORMA algorithm and its application to a singleclass density support (novelty detection) are described. Section 3 details the novel ideas introduced in SAKM. In particular, we present the algorithm architecture, the kernel-induced similarity measure, and the proposed update rule (a variant of the NORMA update rule). After a theoretical analysis of SAKM error convergence, a comparison with NORMA and ALMA is pointed out in Section 3.4. At the end, Section 4 presents some experimental results. Firstly, simulation data are used to show the capability of SAKM in a non-stationary environment. Then, the algorithm performances are compared with the Cauwenberghs and Poggio (2000) SVM incremental algorithm by using UCI benchmark data. Finally, SAKM is used in a real-life actual problem that concerns the monitoring of a WCTC (Water Circulating Temperature Controller).
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
1289
2. SVM foundations and online learning in RKHS 2.1. Statistical learning The Support Vector Machines are based on the principle of the statistical learning theory through generalization performance and error bound formulation (Vapnik, 1999). Consider a learning problem where all data are immediately available, and suppose that the data are generated independently and identically from some distribution P over, χ × Υ where for any data X ∈ χ , a desirable value (label) y ∈ Υ is assigned. Let ξ be a loss function that measures the error between the estimated label F (X ) and the true label y. In these conditions, Vapnik introduces the Expected Risk approximated by Empirical Risk the minimization of which is tractable (Vapnik, 1998). To avoid overfitting, F must be chosen in a suitable restricted set of learning functions. Therefore, the Empirical risk is performed by adding a Regularization term G(F ). Thus, the Regularized empirical risk is then defined so that: Rreg [F , χ] =
1 X N X ∈χ
ξ (F (X ), y) + G(F ).
(1)
Many techniques of Support Vector Classification have been developed in the idea of learning risk minimization (Bennett & Campbell, 2000; Burges, 1998, etc.). But, very few SVM algorithms can work in online applications. Before treating the multi-class problem, let us briefly present the NORMA algorithm and its application for the recursive update of a single-class density support. 2.2. NORMA algorithm and single-class density support Through the previous SVM foundations, the principle of the classifier estimation is formulated as a problem of risk minimization. This section presents the NORMA algorithm based on the technique of stochastic gradient descent in RKHS. Among its various applications dedicated to pattern recognition, our survey will be related to the online estimation of single-class density support. 2.2.1. Stochastic gradient in RKHS NORMA is implemented with an efficient update rule built on stochastic gradient descent in Hilbert Space. This algorithm is implemented to online update a learning function by tracking iteratively the risk minimization. Consider a data Xt collected at time t, and F t a learning function defined as a kernel expansion. This function F t learnt from the database available at t − 1, is defined so that: F (Xt ) = t
t −1 X
α κ(Xt , Xi ) t i
(2)
i=1
where αit ∈ R are the coefficients (weights) at time t and κ(·, ·) is the reproducing kernel induced by the mapping function φ : χ → Γ in the inner-product Hilbert space Γ . This kernel is defined by:
κ(X1 , X2 ) = hφ(X1 ), φ(X2 )iΓ ,
∀X1 , X2 ∈ χ .
(3)
h·, ·iΓ is the inner-product operator in the Hilbert Space Γ . In the context of real-time applications, data are collected sequentially. In fact, since the whole database is not known a priori, the Regularized risk minimization (1) is not tractable. So, at each time t, the Instantaneous risk is defined by using a penalization term of the learning function F t instead of the Generalization term (Kivinen et al., 2004): Rinst F , Xt = ξ (F (Xt ), yt ) +
t
t
a 2
t 2
kF kΓ
(4)
Fig. 1. Density support estimated in Hilbert space Γ and data interpretations. With a RBF kernel, all data images are mapped on a quadrant of circle. This geometrical representation comes from the properties: hφ(X1 ), φ(X2 )iΓ = 1 and hφ(X1 ), φ(X2 )iΓ ≤ 1, ∀X1 6= X2 ∈ χ . Non-SVs are vectors (well) classified inside the class domain without any error. Margin SVs are situated on the hyperplane and Outlier SVs are data with some slack errors regarding the density support.
where a > 0 is a penalization constant and kF t kΓ is the norm of the learning function F t defined in the Hilbert space Γ . It implements the regularization that prevents overfitting if abrupt changes occur over time. The minimization of the Instantaneous Risk controls the learning function in the time. The technique of stochastic gradient descent is then used in Hilbert space to track efficiently changing model in order to adjust the learning function F t in the time: F t +1 = F t − η
∂ Rinst [F t , Xt ] ∂F t
(5)
where η > 0 is the learning rate. A necessary condition for the algorithm to work is: 0 < η < 1a . For a given application with an adequate selection of the loss function ξ , the derivatives of the Instantaneous Risk (4) and the gradient descent equation (5) yield the NORMA update rule. Let us now present the application of the NORMA algorithm for novelty detection. 2.2.2. Online learning of a single-class density support The estimation of a distribution support is introduced in supervised learning by Schölkopf et al. (2001). It can be seen as classification without label (single-class density) and the estimator developed is called One-class-SVM. From a given training dataset, the One-class-SVM estimator provides an estimation of the support of distribution by mapping all data in the Hilbert space Γ . It uses the Lagrangian technique to determine the optimal separating hyperplane ∆. This hyperplane corresponds to the boundary function f enclosing data in a domain with a minimal volume of input space χ . This boundary function is defined by: f (x) = F (X ) − ρ =
X
αi κ(X , Xi ) − ρ
(6)
i
where ρ is the offset of the decision function. On the hyperplane, the property ∆ : F (x) − ρ = 0 is verified. Fig. 1 shows, for softmargin SVM, the data images mapped in the Hilbert space using a mapping function φ such as the dot product can be estimated by the RBF kernel (Schölkopf & Smola, 2002):
κ(X1 , X2 ) = exp(−λkX1 − X2 k),
∀X1 , X2 ∈ χ
(7)
where λ is the kernel parameter that depends on the data density. In online applications, the aim is to incorporate incrementally new information to adapt the learning function over time. In order to penalize the learning function of a single-class density support (that is a case of classification without label), the Hinge loss function is used (Kivinen et al., 2004):
ξ (f t , Xt ) = max 0, −f t (Xt ) − νρ.
(8)
In soft-margin SVM, the parameter ν sets the fraction of margin support vectors and outliers (error vectors) that are outside the class (Schölkopf, Smola, Williamson, & Bartlett, 2000). Thus, only
1290
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
neural architecture. According to a new kernel-induced similarity measure, data are grouped in cluster models described by their density supports in RKHS. Evolving clusters are iteratively updated by incorporating new information via the SAKM update rules (variants of those of NORMA). Before presenting the SAKM, let us introduce the overall formulation of our problem. We use the t temporal cluster Cm associated with the temporal function fmt to represent respectively the mth cluster and its boundary at time t. Let us define a set Ω t of clusters in multi-class data space χ associated with a set =t of temporal boundary functions at each time t: Fig. 2. (a) Density support in input space. Circle dots on class boundary are margin SVs and outside vectors are outliers. (b) The behaviour of Hinge loss from data evolution with regard to a single-class density support.
Ω t = {C1t , . . . , Cmt , . . . , CMt }
(10)
=t = {f1t , . . . , fmt , . . . , fMt } so that:
data with non-zero weights (αi 6= 0) contribute to the definition of the cluster boundary (see Eq. (6)) and thus must be updated recursively. Indeed, all the weights of non-SV (inside vector) are null (Fig. 1). Fig. 2 shows the behaviour of the Hinge loss function (8) according to data position relatively to a single-class boundary. The function offset ρ changes also over time and is computed online in the same way, thanks to the gradient descent technique. Using the Hinge loss function (8) to derive the instantaneous Risk (4), the NORMA update rule gives iteratively the parameters of the cluster boundary function according to the recursive equations:
t +1 αi α t +1 i ρ t +1 t +1 ρ
= (1 − η)αit , = η[resp. 0],
for i < t if f t (Xt ) < 0 [resp. f t (Xt ) ≥ 0]
= ρ t + η(1 − ν) = ρ t + ην
if f t (Xt ) < 0 else.
(9)
The NORMA algorithm is initialised by choosing the zero kernel hypothesis: f 1 = 0. Thereafter, to avoid updating every outdated coefficient (obsolete information), the kernel expansion is truncated to τ terms. Sequential single-class model is then obtained by updating density support parameters on a sliding window with an exponential decay. The convergence of NORMA is proved in Kivinen et al. (2004). In spite of its various applications, NORMA algorithm cannot deal with challenges of learning in a nonstationary environment because of the following limitations: – The algorithm is particularly appropriate for online single-class classification (classification without labels: all data belong to the same class), for supervised binary classification and also for some problems of regression. However, the algorithm is not achieved for unsupervised learning in multi-class area. NORMA is not usable with unlabelled data because it is not built with a similarity measure useful to cluster data. – This technique has also limitations in non-stationary environment. Online learning of real-drifting targets presents some difficulties to NORMA. In Section 3.4, a study led by using a real-drifting cluster, shows that NORMA produces a too important cumulated error. Furthermore, Norma cannot deal with merging and splitting phenomena. Let us now introduce our algorithm and describe its main new features for online clustering in a multi-class context. 3. Self-adaptive Kernel machine SAKM is developed as a new kernel-based algorithm to cluster online non-stationary data in a multi-class context (Amadou-Boubacar, 2006). In order to deal with the difficulties of unsupervised clustering, the SAKM network uses a feed-forward
∀(Cmt , fmt ) ∈ Ω t × =t ,
t Cm = {X ∈ χ /fmt (X ) ≥ 0}.
(11)
M is the number of clusters, and it changes according to new clusters’ appearance, fusion and elimination. The mth cluster boundary function fmt is defined by the support vectors SVit,m weighted by the non-zero parameters αit,m and the t offset ρm . fmt (·) =
X
αit,m κ(·, SVit,m ) − ρmt .
(12)
i
To measure the performance of SAKM for online clustering in a non-stationary and multi-class area, a suitable objective function is introduced. It determines the average of cumulated errors over time and is defined as: Elearn [=] =
M 1 X
M
m=1
1
X
card(Cm ) X ∈C m
ξ (fm , X )
(13)
where ξ is the Hinge loss function presented in Eq. (8) and card(Cm ) defined as the cardinality of Cm . This objective function is a suitable measure of performance since it quantifies the frequency alerts (amount of errors) during the clustering. In fact, the objective function increases with the number of outliers. Hence, its minimization leads to an online clustering with good performance. Let us present the SAKM network and its adaptive architecture. 3.1. Neural architecture: SAKM network The SAKM algorithm is designed with a feed-forward neural architecture (Fig. 3) combined with kernel-based functions. The SAKM network consists of three layers:
• The Input layer composed of D neurons that correspond to the input vector components: X = [x1 , . . . , xd , . . . , xD ]T . The size of this layer is then fixed by the input space dimension.
• The Hidden layer consists of the support vectors. Each neuron of this layer corresponds to a support vector SVi,m . The size J of this layer corresponds to the number of support vectors and could be modified during the training. The Input layer is connected to the Hidden layer with the matrix A which memorizes the support vector locations. The output function of hidden nodes is specific. Each node of the Hidden layer is connected to the other hidden nodes that correspond to the support vectors of the same cluster. A competition between these nodes allows activating only the winner support vectors SVwin,m of each cluster. The activation function is based on the kernel-induced similarity measure µφt ,m that is presented in Section 3.2.
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
Fig. 3. Layout of the SAKM architecture.
• The Output layer consists of M neurons characterizing the clusters. The size of this layer could also change. Each neuron of this layer represents a cluster label Cm . The connections between the Hidden layer and the Output layer memorize the weights αi,m of the support vectors in a matrix B = [bjm ]J ×M : bjm = αi,m , if the jth hidden node corresponds to a SVi,m ∈ Cm , bjm = 0, else.
(
j = 1, . . . , J
(14)
In addition, the Output layer is connected to a Bias node which memorizes the offset ρm of each cluster. The weights αi,m and ρm are sequentially updated by using the recursive rules developed in Section 3.3.2. The following sections present the SAKM learning process and how the neural architecture (nodes and connections), presented in Fig. 3, can evolve according to the dynamic phenomena in a non-stationary environment. Before that, let introduce the SAKM similarity measure which allows the online classification of data in an unsupervised way.
In data classification, the implementation of a suitable similarity measure is an essential step to obtain reliable algorithms. For SVM methods, the widely used measure is the kernel similarity function directly induced in a Hilbert space. The decision rises automatically from the sign of this kernel-based function. In the context of online learning in a multi-class environment, the decision rule cannot be based on this kernel-induced measure. Indeed, the value of this measure for the same data can vary more or less according to the cluster training. In addition, the bounds of this measure are not well known. Therefore, we propose a new kernel similarity measure to compute the membership levels of a new sample to all created clusters in data space. Through the geometrical map of the images of data in the inner-product Hilbert space Γ (Schölkopf & Smola, 2002), our main idea consists in computing the proximity level of the new sample with the nearest support vector for each cluster by using the kernel-induced metric (Borer, 2003). The new data belongs to a cluster if it is located inside the cluster boundary or if it is in the support vector set (Fig. 1). Consider a new sample Xt acquired at time t and suppose t that SVwin,m is the winner support vector of the cluster Cm by the evaluation of the kernel-induced metric so that: win = arg min kφ(Xt ) − φ(SVi,m )kΓ . i
Fig. 4. Similarity measure between a new data Xnew and two existing clusters Cl (support ∆l ) and Cq (support ∆q ) using kernel-induced metric in Hilbert space. Data are mapped with RBF kernel in Hilbert Space Γ .
In the Hilbert space Γ with the RBF kernel, all data images are mapped on a quadrant of circle. The cluster boundaries are represented by some linear hyperplanes (like ∆l and ∆q in the example shown on Fig. 4). Thus, it becomes simple to define a similarity measure. We introduce the kernel-induced similarity function µφt ,m = µφ(Xt , Cm ) to evaluate the distance of a new t data Xt with each cluster Cm so that:
δ µφt ,m = √ kφ(Xt ) − φ(SVwin,m )kΓ 2
= δ×
(15)
p
1 − κ(Xt , SVwin,m )
(16)
where δ is a sign function defined as:
δ=
1 0
if fmt (Xt ) < 0 else.
(17)
By introducing the RBF kernel into Eq. (16), the kernel similarity function becomes:
µφt ,m = δ ×
3.2. New kernel-induced similarity measure
1291
q
1 − exp −λ Xt − SVwin,m .
(18)
This similarity function is strictly monotonous and bounded in the interval [0, 1]:
kXt − SVwin,m k = 0 ⇔ µφt ,m = 0 kXt − SVwin,m k → ∞ ⇔ µφt ,m → δ.
(19)
The kernel-induced similarity measure is the basis of the learning procedure of the SAKM network. Based on the evaluation of this measure, a criterion is defined to select the suitable learning procedure (creation, adaptation and fusion). This is described in the next section. 3.3. Online learning procedures The SAKM learning procedures are described in four main stages:
• Creation procedure allowing insertion of new clusters with an adequate initialisation mechanism;
• Adaptation procedure based on the technique of stochastic gradient descent in Hilbert space. In this procedure a fast and efficient update rule is set to incorporate new information to evolving clusters. • Fusion procedure developed to handle mutual information between clusters that are similar to each other.
1292
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
Table 2 SAKM decision rules and procedures 1st case 2nd case 3rd case
card(Ω win ) = 0 card(Ω win ) = 1 card(Ω win ) ≥ 2
Initialisation–Creation Adaptation Fusion
Now, consider that the SAKM decision rule leads to case 1 (Table 2) at an unspecified time t. This implies a rejection in distance that is used to detect a novelty appearance in the multi-class environment. So, a new cluster CM +1 is inserted with a new support vector SV1,M +1 = Xt . Consequently, the sets Ω and = are incremented:
Ω t +1 = Ω t ∪ {CMt +1 }
(23)
=t +1 = =t ∪ {fMt +1 }.
The boundary function fMt +1 of the new cluster is initialised by using the same initialisation as (22). The Hidden and Output layers of the SAKM network are also incremented by the insertion of new neurons. Fig. 5. In input space, the maximum distance between the data of each cluster must be less than the double of RBF rayon to conserve continuity.
• Finally, an Elimination procedure is useful to eliminate nonrepresentative clusters and then ensure robustness in online clustering. When a new data Xt is presented, the choice of the learning procedure (Creation or Adaptation or Fusion) is done by using the following criterion:
Ω win = Cmt ∈ Ω t /µφ Xt , Cmt ≤ εth
(20)
where εth is an acceptance threshold. According to the different possible cases met with this criterion, the decision rule of the SAKM learning procedures is summarized in Table 2. This table gives for each case the suitable learning procedure. The fourth procedure (elimination) is achieved periodically when a sufficient number of new data have been presented. The mechanisms of each procedure are described through the following sections (Sections 3.3.1–3.3.4). Remark on the acceptance threshold εth Using a RBF kernel geometry in input space χ with the assumption that a cluster describes a continuous region in the feature space, this threshold is fixed (εth = 0.8). Motivation: Consider a new data not located inside the winner cluster, the RBF kernel geometry and the continuity constraint given above impose the condition (Fig. 5): t X t ∈ Cm
if kXt − SVwin,m k ≤ 2 · σ =
1
λ
.
(21)
Under this condition, the threshold becomes, in RKHS:
εth =
p
1 − exp(−1) = 0.79 ≈ 0.8.
(21)
3.3.2. Adaptation stage: New online update rule In case 2 (Table 2), the new data is close enough to only one cluster Cm to contribute to its definition. Thus, the new information is used to refine the support model of the cluster Cm . That means the parameters (αm , ρm ) of the mth boundary function must be updated. To this end, we propose a novel online update rule by combining the advantages of the online update rule of the NORMA algorithm and the One-Class-SVM supervised estimator. Based on the technique of stochastic gradient descent, NORMA is set with a fast iterative update rule while the One-Class-SVM gives the exact but slow solution by using the technique of Lagrangian optimization (Schölkopf & Smola, 2002). Our main idea consists in using the gradient descent technique in RKHS to compute the coefficients αi,m of the boundary function fm . Then, the obtained coefficients are P normalized any time to respect the Lagrangian dual constraint ( i αi,m = 1). Thereafter, the offset is recovered by using the hyperplane equation at a chosen support vector SVc ,m in Hilbert space (Fig. 1):
∆m : fm (SVc ,m ) = 0 ⇔
X
αi,m κ(SVc ,m , SVi,m ) − ρm = 0.
(24)
i
According to the previous properties and using the penalty constant a = 1 in Eq. (4), the SAKM recursive equations are introduced in this adaptation stage. These equations allow updating iteratively the parameters of the boundary function of the winner cluster so that:
t +1 αi,m = (1 − η)αit,m , t − τ < i < t , αit,+m1 = 0 [resp. η], fi,tm (Xt ) ≥ 0 [resp. < 0] X 1 t +1 then αit,+ ← α αit,+m1 m i,m i X t +1 αit,+m1 κ(SVc ,m , SVit,m ) ρm =
(25)
i=max(t −τ )
3.3.1. Creation stage: Initialisation procedure The SAKM algorithm is initialised from zero (pre-acquisition) hypotheses. Before the first acquisition, there is only the null function f0 = 0 associated to the empty cluster C0 (by convention) in the data space χ . The first data X1 is then used to create the first non-empty cluster C1 . At time t = 1, the parameters of the initial cluster boundary function f1 are initialised with: f1 →
α1,1 = 1 ρ1 = η(1 − ν)
where αi,m are normalized. In practice, c is chosen to be the median index of support vectors SVi,m for a better estimation. The kernel expansion is truncated to τ terms to allow learning clusters on a sliding exponential window and to forget obsolete information. The benefit of the truncation also includes limiting the amount of computation. When the acquired data falls inside the cluster no computation is done; in the opposite case, the cluster boundary is updated and a new support vector is added so that: SVi+1,m = Xt .
(22)
where η is the learning rate (see Section 2.2.1) and υ is the margin fraction. At time t = 1, X1 is then the unique point belonging to the cluster C1 characterized by a support vector SV1,1 = X1 . In the same way, the SAKM architecture is initialised by the creation of the first hidden node and of the first output node representing the initial cluster.
3.3.3. Fusion stage Case 3 (Table 2) presents an ambiguity situation while a data is shared by two or several clusters. In this situation, the SAKM algorithm detects mutual information shared by those clusters. When the number of ambiguous data exceeds a threshold A (set for noise insensitive), those clusters are merged in a unique one. In the assumption that different clusters are disjoint in a multi-class
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
area, the fusion procedure is justified to conserve the continuity of the domain described by each cluster Cm . Consider the set Ω win of winners from the kernel similarity criterion (20), all winner clusters Cwin are replaced by the merger cluster Cmerg so that: Cmerg = {X ∈ (∪ Cwin )/fmerg (X ) ≥ 0}.
(26)
fmerg is computed with all data available in the winner clusters by using iteratively the SAKM update rule (25). Then, the cluster set Ω and the boundary function set = are then modified:
3.4.1. Convergence bounds of SAKM To show the convergence of SAKM, we can proceed in two ways. At first, we propose deducing the convergence of the SAKM update rule by using the NORMA convergence theorem (Kivinen et al., 2004) and the principle of risk minimization (Section 3.4.1.1). In Section 3.4.1.2, we conduct a theoretical study of SAKM objective function which is in fact the average of cumulated errors. The aim of this study is to prove the convergence of this instantaneous error and to give its limit bound at infinity.
Ω = Ω − ∪ Cwin ∪ {Cmerg } win = = = − ∪ fwin ∪ {fmerg }.
(27)
win
Thereafter, ambiguous data are affected toward the new merged cluster Cmerg . The SAKM network is then modified. All corresponding output neurons are replaced by a unique neuron and the hidden layer is also adapted according to the new support vectors of the merged cluster. 3.3.4. Elimination stage This stage is useful to eliminate clusters that are eventually created because of the noise effect in non-stationary data. To this end, the SAKM algorithm analyzes all the clusters’ cardinality from their creation. The clusters Cweak with very few data compared to a threshold Nc are deleted after each T time period. So,
1293
3.4.1.1. Bound of the average of the cumulated error. Inspired by NORMA algorithm, the recursive learning procedure of SAKM algorithm is based on the Instantaneous risk minimization in kernel Hilbert space. Consider the training set Sl = {X1 , . . . , Xt , . . . , Xl } (where l is the length of the training set Sl ) and the sequence of learning functions f t t =1,...,l on set Sl . Suppose f t is the function learnt sequentially using the training subset available at the previous time t −1 and g is the best learning function on set Sl . From the Instantaneous risk equation (4), the average of the cumulated error can be written so that: Elearn f t =
=
l 1X
l t =1
ξ f t , Xt
l 1 Xh
l t =1
Rinst f t , Xt −
a t 2 i
f . Γ 2 t =l
(29)
Ω = Ω − ∪ Cweak weak = = = − ∪ fweak ,
(28)
weak
and their representing neurons are removed on the Output layer. On the Hidden layer, all nodes corresponding to the support vectors of eliminated clusters are also removed. Finally, the SAKM algorithm is summarized in the following pseudo-program: Algorithm: Online clustering with SAKM Require: Online data source X := Xt Require: Parameters λ, η, ν, εth Require: Thresholds τ , A, Nc , T Initialise: t = 0; f0 := f0t = 0; C0 := C0t = ∅ Loop Acquisition Xt Evaluate Kernel Similarity Function: µφt ,m Kernel Similarity Criterion: Determine Ω win Case 1: card(Ω win ) = 0 Creation procedure Case 2: card(Ω win ) = 1 Adaptation procedure Case 3: card(Ω win ) ≥ 2 Fusion procedure If t = k · T (k ∈ N) then Elimination procedure End If End Loop
Theorem 1 (Kivinen et al., 2004). Under some mild assumptions: (1) Kernel function κ(·, ·) is bound so that: ∃K ∈ R+ ∗ , ∀X , Y ∈ χ , κ(X , Y ) ≤ K 2 , (2) Loss function ξ is convex and satisfies a Lipschitz condition: ∃c ∈ R+ ∗ , |ξ (Z1 , X ) − ξ (Z2 , X )| ≤ c |Z2 − Z1 |, (3) The offset and the stochastic risk are bounded on the training dataset Sl such that: |ρ| < b, Rinst [f t , Xt ] ≤ L, Kivinen et al. (2004) show that the average of the instantaneous risk of learning functions produced by NORMA converges toward the minimal regularized risk at rate O(l−1/2 ) so that: l 1X
Rinst [f t , X t ] ≤ Rreg [g , Sl ]
l t =1
+
a + L 2 ln
l1/2
1
+
δ
b
(30)
l
with probability at least 1 − δ over random draws of Sl . In this condition, the instantaneous risk produced by the update rule of gradient descent in Hilbert space will be close to the Vapnik expected risk with high probability. This theorem is proved for NORMA. Corollary 1 (Bound of Objective Function). According to previous conditions and Eqs. (29) and (31), the average of cumulated error of sequences produced by gradient descent in RKHS is bounded so that (Appendix A):
t
Elearn f 3.4. Performance of the SAKM update rule This section presents a study of the performances of our online classifier update rule by considering a local iterative learning of a single-class model. Before comparing the SAKM update rule, NORMA and Gentile’ ALMA (Gentile, 2001), let us point out a theoretical analysis of the error convergence of the SAKM local learning method.
1/2 !
1
+
b l
+
t =l
≤
1 l1/2
a + L 2 ln
1/2 ! 1
δ
l a X t 2
f Rreg [g , Sl ] − Γ 2l t =1
! (31) t =l
with probability at least 1 − δ over random draws of Sl . More precisely, in the following sections, we will prove that the average of the SAKM cumulated error does not diverge but converges toward a finite limit and is bounded at any time.
1294
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
3.4.1.2. Convergence of the SAKM update rule. To show the convergence of the SAKM update rule, our aim is to prove that the average of the cumulated error Elearn converges toward a unique and finite limit1 ` = Elearn [f t ]t →∞ at infinity. To this end, we proceed in two steps as follows: Unique limit: through Theorem 2 (Section 3.4.1.2.1), we show the unicity of this limit ` Finite limit: Theorem 3 (Section 3.4.1.2.2) gives the bounds of the limit `.
At first, let determine the bounds of the kernel-based boundary function f (·) and the bounds of the loss function ξ (f , ·). These intermediary results are useful to prove the convergence of the SAKM update rule. 3.4.1.2.1. Determination of the bounds of the kernel-based boundary function and the bounds of the loss function. Consider the learning of Psingle-class model, the boundary function is defined by: f (·) = i αi .κ(·, SVi ) − ρ . Lemma 1. Consider Upker , Loker , respectively the upper bound and the lower bound of the kernel, Upker ≥ 0 and Loker ≥ 0 because κ(·, ·) = hφ(·), φ(·)iΓ ≥ 0 is a necessary condition for the dot product. By using the relation Loker ≤ κ(·, ·) ≤ P Upker and the property that the coefficients αi are normalized ( i αi = 1), the kernel-based boundary function f (·) is bounded so that (see Appendix B for proof) Loker − Upker ≤ f (·) ≤ Upker − Loker .
(32)
For example, in the case of RBF kernel: 0 ≤ κ(·, ·) ≤ 1 ⇔ −1 ≤ f (·) ≤ 1. Hence, the cluster boundary function produced by the SAKM local update rule does not diverge. Let now determine the local bounds of the loss function ξ (f , ·) used to set the SAKM objective function (i.e. the average of the cumulated error). Lemma 2. By using the kernel bounds and Lemma 1, the loss function ξ (f , X ) = max(0, −f ) − νρ locally estimated on a data point X is bounded:
−ν · Upker ≤ ξ (f , X ) ≤ Upker − (1 + ν) · Loker ,
0 < ν < 1. (33)
Both previous lemmas, easily proved in Appendix B, are useful to study the SAKM convergence properties. 3.4.1.2.2. Local study of the error convergence of the SAKM update rule. In this subsection, a theoretical study of the error convergence of the SAKM update rule is given. To this end, we will firstly show that the average of the SAKM cumulated error tends toward a unique limit ` at infinity. Then, by bounding the average of the cumulated error, we can demonstrate that the limit ` is finite. Proposition 1 (Bounds of ∂∂t Elearn [f t ]). Consider the derivative of a SAKM objective function written so that:
∂ Elearn [f t ] = ∆Elearn [f t ] = Elearn [f t ] − Elearn [f t −1 ]. ∂t
(34)
Usingthe result of Lemma 2 (Appendix B), the derivative function of Pl t ξ f , X can be bounded at any time t by: Elearn f t = 1l t t =1 1 t
· ((1 + ν) · (Loker − Upker )) ≤ ≤
1 t
∂ Elearn [f t ] ∂t
· ((1 + ν) · (Upker − Loker )) .
(35)
1 If this limit l = 0, the update rule is called an exact update rule (AmadouBoubacar, 2006)
Theorem 2 (Unique Limit ` = Elearn [f t ]t →∞ ). The limits of bounds (35) imply the following result:
lim
t →∞
∂ Elearn [f t ] = lim Elearn [f t ] − Elearn [f t −1 ] = 0. t →∞ ∂t
(36)
According to the previous result, the average of the cumulated error Elearn [f t ] does not vary but tends to a limit ` at infinity. Hence, the limit ` = Elearn [f t ]t →∞ is unique. Theorem 3 (Finite Limit ` = Elearn [f t ]t →∞ ). The average of the cumulated error of the SAKM local update rule is bounded at any time so that:
−ν · Upker ≤ Elearn [f t ] ≤ Upker − (1 + ν) · Loker , 0 < ν < 1.
(37)
This result, proved in Appendix B, is the corollary of Lemma 2. The choice of the RBF kernel gives: −ν ≤ Elearn [f t ] ≤ 1, ∀t ≥ 1. Condition (37) is not verified for NORMA and ALMA algorithms. Indeed, their provided offsets are not bounded and could take very high values after a long time training. This is a drawback in the context of learning real-drifting targets (see the illustration in Section 3.3.2). Theorem 4 (Convergence of the SAKM Update Rule). The average of the cumulated error produced by SAKM update rule is constant at infinity (Theorem 2) and is bounded at any time (Theorem 3). According to properties (36) and (37), the SAKM update rule converges toward a finite limit so that (Appendix B): Elearn [f t ] → ` ∈ −ν · Upker , Upker − (1 + ν) · Loker
t →∞
(e.g.: for RBF kernel ` ∈ [−ν, 1]).
(38)
Section 3.3.3 gives an illustration that shows the good convergence properties of the SAKM update rule. 3.4.1.2.3. Convergence by error bound generalization in a multi class environment. Consider the set =t = f1t , . . . , fmt , . . . , fMt of learning functions representing the boundaries of M (finite) clusters, the previous results of convergence can be extended to multi-class, if only the SAKM update rule is considered. Corollary 2 (Generalization in a Multi-Class Environment). Under the assumption of learning without merging and splitting phenomena, the average of global error, made by SAKM in a multi-class area, converges toward a finite limit `= at infinity and is bounded so that: Elearn [=t ] → `= t →∞
∈
ν · M · Upker , M · (Upker − (1 + ν) · Loker ) . (39)
Using RBF kernel, the convergence bounds are: Elearn [=t ] → M ` ∈ [ν · M , M].
t →∞
3.4.2. Comparison of NORMA, ALMA and SAKM update rule: Tests on real-drifting of a single-class Many tests on non-stationary data show that the NORMA and ALMA algorithms are less adapted for online learning with changing target (fast drifting model). Using these algorithms, the offset parameter ρ diverges over the time and then all new data become outliers after a short time of training. This phenomenon is undesirable. In fact, it causes an abnormal increase of the cumulated error (see Eq. (8)). The SAKM update rule overcomes this drawback by evaluating the offset ρ in the same principle of the One-Class-SVM by using normalized parameters. Table 3 gives a comparison of NORMA and ALMA algorithms with the SAKM update rule according to the simulation of Fig. 6. In this table, four indicators are highlighted:
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
1295
Fig. 6. Incremental learning of an evolving Gaussian distribution using NORMA, ALMA and SAKM algorithms. Axes represent the feature space coordinates. Dots represent the support √ vectors. Simulation on a drifting Gaussian density of 1000 data generated according to: t, a uniform distribution in [0, 1]; rnd, iid N (0, 1) so that: [Drift_Gauss] = [3t , −5 + 3 t ]T = 0.7[rnd, rnd]T . Parameters: λ = 1, η = 0.1, ν = 0.3, τ = 15, εth = 0.8, Nc = 5, T = 200.
– The offset ρ(t ) given by each algorithm and the kernel-based similarity function f t (Xt ) show the behaviour of the cluster boundary function. – The cumulated loss Lcum [f t ] and the objective function Elearn [f t ] as suitable performance measures allow appreciating the amount of error obtained during the online clustering. 3.4.3. Discussions Let us now discuss the results provided by the three algorithms in Table 3, in particular we will give an explanation of the behaviour of NORMA and ALMA cumulated loss compared to the one of SAKM. In this context of real-drifting model, the NORMA and ALMA algorithms detect many outlier vectors (frequency alerts) that
increase rapidly their cumulated loss. In addition, the offset ρ grows quickly and too much over time. These two phenomena cause a very significant increase of the amount of loss in situations of evolving cluster learning (see Fig. 6 and Table 3). On the other hand, in the SAKM update rule, the offset ρ computed online remains weak at the time. This property implies that the algorithm provides more margin supports at the class boundary: its behaviour is similar to those of a one-class-SVM with a good approximate solution. Therefore, a great deal of data is classified inside the class as non-margin support vectors (with zero error). Consequently, the cumulated error becomes weak and so also the objective function. Fig. 7 shows that the objective function of the SAKM algorithm decreases over time, contrarily to those given by NORMA and ALMA algorithms. Under normal conditions (without fusion and
1296
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
Fig. 7. Offset and Objective function as functions of time using ALMA, NORMA and SAKM algorithms on a real-drifting single-class (Simulation of Fig. 6). Table 3 NORMA, ALMA and SAKM online clustering results Measures
ρ(t ) f t (X t ) Lcum (f t ) Elearn (f t )
Time
ALMA
NORMA
SAKM
100 500 1000 100 500 1000 100 500 1000 100 500 1000
5.61 33.60 68.61 −4.81 −32.66 −67.53 −180.89 −7 702.10 −33 861.01 1.80 15.40 32.86
5.57 33.52 68.75 −5.14 −33.13 −68.09 −205.29 −7 870.40 −33 205.41 2.05 15.74 33.20
0.17 0.16 0.14 0.09 −0.05 0.23 −9.39 −11.70 −14.23 0.09 0.02 0.014
Simulation on data presented in Fig. 4.
elimination procedure), the average of cumulated error of the SAKM update rule tested locally on a real-drifting single-class behaves like a decreasing function and converges toward a finite limit after a long time acquisition. In conclusion, the practical results of SAKM update rule illustrate its good performance in terms of error convergence. However, the SAKM update rule is slightly more expensive in complexity than NORMA or ALMA algorithms. Indeed, a supplementary computation is done at each iteration to normalize the weights αi . And, the estimation of the offset ρ (linear combination of kernels) is much expensive using the SAKM learning procedure compared to NORMA and ALMA. On the other hand, the SAKM update rule is very simple and far more computationally efficient than the incremental SVM based on adiabatic increment (Cauwenberghs & Poggio, 2000; Gretton & Desobry, 2003) . 4. Experimental results In this section, some experimental results are presented to illustrate the performances of the SAKM algorithm in the context of non-stationary data clustering in a multi-class environment. Three subsections are presented. The first one is dedicated to the classification of synthetic data, the latter being designed in
order to illustrate the abilities to model online the decision space. The second gives on referenced UCI datasets a comparison of the performances of the SAKM network with a closer technique, the incremental SVM. The last experiments give a short practical point of view in a real application. 4.1. Artificial data 4.1.1. Simulation 1: Cluster creation and evolution In the first simulation (Fig. 8a), data are created from two Gaussian distributions evolving over time with changing means and increasing dispersal (Lecoeuche & Lurette, 2003). Consider a uniform distribution k in interval [1, N ], non-stationary data are generated using equations:
[E ν _Gauss 1] =
4 · cos(β1 (k)) + 6 · 1n + Mrnd(k) 6 · sin(β1 (k)) + 2.6
5 · cos(β2 (k)) + 4.2 [E ν _Gauss 2] = · 1n + Mrnd(k). 5 · sin(β2 (k)) + 6.3
(40)
(41)
Mrnd(k) is a random matrix of n vectors in 2D space generated at step k (using Matlab function mvnrnd). The dispersal S (k) of the random matrix Mrnd(k) and the arguments β1 and β2 change at each step k according to:
S − Sini S (k) = S (k − 1) + end N −1 β − βini , β(k) = β(k − 1) + end N −1
t > 1.
(42)
The equation parameters are set for this test as: N = 100; n = 10, Sini = 0.01 · ID , Send = 0.03 · ID , β1,ini = π ,
β2,ini = 0, β1,end = β2,end = π /2. In these conditions, the chosen equations give 2000 time-varying data Xt sequentially generated to draw two evolving Gaussian distributions (Fig. 8a). Using the SAKM algorithm, the initialisation of each cluster model is done with the first data (Fig. 8b). Then, the created cluster models grow, and evolve progressively according to the data distribution drift thanks to the update learning rule (Fig. 8c–f).
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
1297
Table 4 Evaluated UCI datasets Benchmarks
Prima_diabetes
Breast cancers
Ionosphere
Dimension Nb data Nb training data Nb test data
8 768 512 256
9 699 466 233
34 351 234 117
– Two evolving clusters by using drifting Gaussian distributions of 1200 data that come to merge.
−12 + 10t rnd1 (t ) + 2.8 +4 + 5 sin(1.3π t ) rnd2 (t ) +8 − 10t rnd1 (t ) [E ν _Gauss 2] = + 2.8 . −4 − 5 sin(1.3π t ) rnd2 (t ) [E ν _Gauss 1] =
(44)
(45)
– And some isolated data are created using a random distribution:
Σ = 0.4 · I2 around center −7
Fig. 8. Online clustering with SAKM on two evolving clusters. Axes represent the feature space coordinates. Non-stationary data created with two Gaussian distributions. The algorithm takes into account model changes over time thanks to its self-adaptive abilities. Algorithm parameters: λ = 1, η = 0.1, ν = 0.3, τ = 30, εth , Nc = 5, T = 200.
4.1.2. Simulation 2: Online learning non-stationary data: Creation– Adaptation–Elimination–Fusion In this second simulation experiment, the SAKM algorithm is tested on non-stationary and noisy data. To this end, the nonstationary data are generated in a multi-class area by using: – One static cluster consisting of two 200-data distributions with dispersal Σ and center C .
0.4 0 T St_Gauss 11 = Mrnd C + [ 5 , 8 ] , Σ = 11 11 0 2 (43) 2 0 T St_Gauss 12 = Mrnd C12 + [8, 5] , Σ11 = . 0 0.4
T −5 .
Data are generated sequentially in the same environment to draw both static and drifting targets (Fig. 9a). At the first data acquisitions, the SAKM algorithm creates 4 kernel models to represent clusters (Fig. 9b). Then, the static cluster C1 is adapted progressively according to data acquisition while evolving clusters C2 and C3 are updating over time to follow the distributions drifts (Fig. 9c–e). During the same time, any other data did not come in the small cluster created by initial noisy data. So, this cluster is not updated during a period T . As its cardinality stays less than the threshold Nc , the SAKM elimination procedure deletes this cluster (Fig. 9f). Thereafter, when the evolving clusters C2 and C3 are close enough, they are merged into one (Fig. 9g). The SAKM cluster fusion procedure provides a good result by circumventing local optimum drawbacks of overlapping. Finally, the SAKM algorithm adapts continuously remaining clusters (Fig. 9h). Through these experiments, Fig. 101.a and Fig. 102.a illustrate the behaviour of the cumulated error of the algorithms. Fig. 101.b and Fig. 102.b show that the objective function does not diverge but rather decreases over time and tends to a finite limit. This behaviour of the objective function confirms (again) the good performance of SAKM online classifier with regard to error convergence. Nevertheless, when fusion of clusters or elimination occurs, one or more clusters disappear but their cumulated training errors remain. So, due to this phenomenon, the objective function grows up before decreasing again. The SAKM algorithm has good capacities with online cluster non-stationary data. Through the experiments, the algorithm trains kernel cluster models in multi-class area by using its unsupervised learning procedures. It has self-adaptive abilities to learn iteratively evolving clusters and to take into account changes of real drifting targets. 4.2. Benchmark data In this section, the proposed SAKM is evaluated through a comparison with the incremental SVM classifier (Cauwenberghs & Poggio, 2000). To give such a comparison is quite ambitious due to the specificity of the SAKM. To our knowledge, it is the unique SVM technique dedicated to online clustering of non-stationary data. The section highlights that, despite the fact that the SAKM technique has a continuous learning, its performances are not too far from those given by an incremental technique (Cauwenberghs & Poggio, 2000). We selected three datasets from the UCI Machine Learning Repository (Blake & Merz, 1998) (see Table 4), that have various numbers of data samples and different dimensions, in order to give a good overview of the performances. These three sets are limited to a two-class classification task due to the limitation of the incremental SVM technique.
1298
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
(a) Non-stationary data.
(e) Adaptation & Evolution (t4 ).
(b) SAKM cluster creation (t1 ).
(c) Adaptation & Evolution (t2 ).
(f) Noisy cluster Elimination (t5 ).
(g) Cluster fusion process (t6 ).
(d) Adaptation & Evolution (t3 ).
(h) Cluster Adaptation (t7 ).
Fig. 9. Online clustering of non-stationary and noisy data in multi-class environment. Axes represent the feature space coordinates. The SAKM algorithm takes into account model changes over time with its self-adaptive abilities including fusion and elimination. Algorithm parameter: λ = 1, η = 0.1, ν = 0.3, τ = 30, εth = 0.8, Nc = 5, T = 2000.
Fig. 10. Evolution of the cumulated loss (a) and the objective function (b), functions of time, for the online learning of non-stationary data: simulation 1 (Fig. 8) and simulation 2 (Fig. 9) using SAKM.
Table 5 shows the experimental results of the two tested algorithms. As seen from Table 5, the classification accuracy of the SAKM is slightly inferior to the incremental SVM, the latter being clearly more adapted to this classification task (supervised learning). Obviously, regarding the SAKM design, its computational cost is
lower and the number of SV is reduced. For example, in the second case which is the less favourable for the SAKM, the number of SV is divided by 2.5 whilst the classification quality loses 6%. From these results, we could only notice that the SAKM network has better abilities for online learning: it is faster and needs less structural parameters.
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
1299
Fig. 11. Drift detection based on SAKM network. Axes of (x · 1) represent the feature space coordinates. Axes of (y · 2) represent the evolution of the drift indicator and the time of the samples. The subset (a) is dedicated to a progressive fouling of the heater, the subset (b) to the one of the exchanger and (c) to the one of the filter. In each case, the current functioning is online adapted according to the evolutions of the measurement by the mean of the SAKM network. Its parameters are set: λ = 1; η = 0.15; τ = 50/Namb = 10; Nmin = 50; T = 100.
Table 5 Experimental results: Incremental SVM vs. SAKM on UCI datasets Benchmarks
Prima_diabetes
Algorithms
Incr. SVM
SAKM
Breast cancers Incr. SVM
SAKM
Incr. SVM
SAKM
Accuracy CPU time (s) Nb support vectors
81.2% 11.55 310
78.6% 6.42 61
98.70% 17.34 122
92.75% 6.36 53
95.70 9.94 127
86.10% 2.41 40
A practical limitation is also shown. The SAKM gives less classification accuracy when the dimension of the data increases. In fact, due to its adaptation rules based on the gradient, SAKM is more sensitive to the dimensionality curse. 4.3. Validations on real data: System monitoring The last section dedicated to the validation concerns the classification task of actual data. The results presented here are part of an important work dealing with industrial system monitoring. We propose using a supervision architecture based on pattern recognition approach. The important point is that the functioning modes of any real system evolve in time according to its natural evolutions. The key idea is to use our dynamical classifier in order to have an up-to-date decision space. The aim of the classifier is to continuously model the functioning modes corresponding to the current functioning states of the system. Our objective is to improve the failure detection by using non-static references (not as commonly used in the supervision tools) based on up-to-date models. The data presented here correspond to the functioning evolutions of a WCTC (Water Circulating Temperature Controller) when fouling occurs.2 The aim of the result presented hereafter is to alert the user before any significant degradation of the capabilities of the WCTC occurs.
2 This work has been partially sponsored by the French agency for energy saving and environment (ADEME), under the contract 02 74 062.
Ionosphere
The used data have been estimated from raw measurements (flow, pressure, temperature) using identification techniques. The next set in Fig. 11 illustrates the monitoring principle. From lab experiments, it is easy to build a decision space that characterizes the fouling of the main components of the WCTC. In Fig. 11(a.1, b.1 and c.1 where axes represent the ratio of different component pressure values.), three default modes are known (MH , ME , MF ). Each mode, one for the three main components: heater, exchanger and filter, corresponds to a fouling that corresponds to a loss of flow rate of 20%. From this initial knowledge, the experiments consist in inducing a fouling (simulated by a manual valve) and to track the evolutions of the current mode MCF towards the failure modes. The upper subfigures show, for a progressive fouling of three main components, the evolutions of the measurements (back dots) and the final location (at time 2600) of the updated current mode (see MCF arrows). According to slight deviations, the mode moves to follow the data localisation. To fix the SAKM parameters, mainly λ and τ which are dependent on the application, it is obviously necessary to introduce some a priori knowledge as the noise acting on the system and the possible drift velocity. In this case, WCTC experts could inform us about the standard duration (coarsely 6 months in real conditions for a temperature of 60 ◦ C) and then help us to fix the choice, in link with the sampling period, of the value of τ . Finally, the detection is done by measuring the proximity between the known failures and the location of the updated current mode. This corresponds to estimating how ‘‘far’’ from already known failure classes is the current mode and how fast it moves. Then it is necessary to estimate how far two distributions
1300
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
are from each other and to estimate the velocity of the evolving class. To achieve that, some specific tools such as drift indicators (see Fig. 11 a.2, b.2 and c.2) (that are not in the area of this paper) have been developed (Amadou-Boubacar, 2006; Lecoeuche & Lurette, 2004).
Appendix A. Convergence bound of the SAKM objective function Proof of Theorem 1 (NORMA Convergence Bound). Demonstration in Kivinen et al. (2004). Proof of Corollary 1. l 1X
l t =1
l h 1X a 2 i ξ f t , Xt = Rinst f t , Xt − f t Γ
l t =1
2
Elearn f
=
1X l t =1
ξ f , Xt t
l
=
1X l t =1
Rinst f t , Xt −
l 1X
l t =1
" ≤ Rreg [g , S ] + b
αi · κ(·, SVi ) − ρ
l
1 X a t 2
f Γ l t =1 2
ξ f t , Xt 1 l1/2
X
αi · κ(·, SVi ) −
X
i
αi · κ(SVc , SVi ).
i
P By using the normalized constraint of the SAKM update rule i αi = 1, we have ! ! X X f (·) ≤ αi · max κ(·, SVi ) − αi · min κ(SVc , SVi ) i
i
i
i
f (·) ≤ 1 ·Upker −1 ·Loker and f (·) ≥
X i
! ! X αi · min κ(·, SVi ) − αi · max κ(SVc , SVi ) i
i
i
f (·) ≥ 1 · Loker − 1 ·Upker . Hence, the kernel-based boundary function is bounded so that: Loker ≤ κ(·, ·) ≤ Upker ⇔ Loker − Upker
≤ f (·) ≤ Upker − Loker . Proof of Lemma 2 (Local Bounds of the Loss Function).
ξ (f , X ) = max(0, −f ) − νρ ⇒ 0 − max(ν · ρ) ≤ ξ (f , X ) ≤ |min(f )| − min (ν · ρ) X αi · max (κ(SVc , SVi )) ⇒ −ν · i
≤ ξ (f , X ) ≤ |min(f )| −
X
αi · min(κ(SVc , SVi ))
i
⇒ −ν · Upker ≤ ξ (f , X ) ≤ Upker − (1 + ν) · Loker . Proof of Proposition 1 (Bounds of the Derivative of the Objective Function).
∂ Elearn [f t ] − Elearn [f t −1 ] Elearn [f t ] ≈ ∂t t − (t − 1) t t −1 X 1 1 X = ξ (f n , X n ) − ξ (f n , X n ) t n =1 t − 1 n=1 !! t −1 X 1 t t n n = · ( t − 1 ) · ξ (f , X ) − ξ (f , X ) . t (t − 1) n=1
1 l
Elearn f t =
X
By using the results of Lemma 2, the loss function can be bounded as follows:
l
t
f (·) = f (·) =
This paper presents the Self-Adaptive Kernel Machine (SAKM) as a realisation of a new kernel-based algorithm for online clustering. Built with neural architecture, SAKM deals with online learning challenges in non-stationary and multi-class contexts. SAKM is based on a kernel-induced similarity measure useful to set a reliable decision rule. A significant advantage of SAKM is its fast and efficient update rule which allows online learning of realdrifting targets. This algorithm shows that the classical stochastic gradient descent used with the regularized risk in RKHS gives accurate solutions in online applications. To deal with unsupervised learning, the SAKM learning procedures are built in four main stages: initialisation and creation, adaptation, fusion and elimination. Using the kernel-based density model, the algorithm takes into account new clusters’ appearances and variations over time by providing sequentially their optimal models. A theoretical study is carried out to analyze the instantaneous error convergence of SAKM. Thereafter, a comparison with NORMA and ALMA algorithms illustrates the performance of SAKM local learning according to a suitable error measure. Through the simulations, the SAKM algorithm shows its abilities to learn efficiently clusters and to take into account their evolutions over time in a non-stationary environment. Experiments are carried out to validate the algorithm on benchmarks and real data. The results given by these experiments show the good capabilities of SAKM and its performances in reallife problems. To make the SAKM algorithm fully adaptive in a non-stationary environment, our future work will aim at the implementation of a cluster splitting procedure.
Proof of Lemma 1 (Bounds of the Kernel-Based Boundary Function). The boundary function is defined by:
i
5. Conclusion
Elearn f t =
Appendix B. Convergence of SAKM update rule
a + L 2 ln
1 X a t 2
f . + − Γ l l t =1 2
t
∂ 1 Elearn [f t ] ≤ · max ξ (f t , X t ) − min ξ (f n , X n ) ∂t t 1 ∂ · (−ν · Upker − (Upker − (1 + ν) · Loker )) ≤ Elearn [f t ] t ∂t ≤
1/2 !#
≤
1
δ
· min ξ (f t , X t ) − max ξ (f n , X n )
1 t
1 t
· ((Upker − (1 + ν) · Loker ) + ν · Upker )
· ((1 + ν) · (Loker − Upker )) ≤
∂ 1 Elearn [f t ] ≤ · ((1 + ν) · (Upker − Loker )) . ∂t t
H. Amadou Boubacar et al. / Neural Networks 21 (2008) 1287–1301
Proof of Theorem 2 (Unique Limit ` = Elearn [f t ]t →∞ ). 1
lim
t →∞
t
K → 0 (∀K ∈ R) ⇒ 0 ≤
⇒
∂ Elearn [f t ]t →∞ ≤ 0 ∂t
∂ Elearn [f t ] → 0. t →∞ ∂t
Proof of Theorem 3 (Finite Limit ` = Elearn [f t ]t →∞ ). Determination of the bounds of Elearn [f t ] at any time: Elearn [f t ] =
⇒
1 t
t 1X
t t =1
ξ f t , Xt
1 · t · min ξ (f t , Xt ) ≤ Elearn [f t ] ≤ · t · max ξ (f t , Xt ) t
⇒ −ν · Upker ≤ Elearn [f t ] ≤ Upker − (1 + ν) · Loker . Proof of Theorem 4 (Convergence of SAKM Update Rule). (1) ` = Elearn [f t ]t →∞ is unique (constant) (2) ` = Elearn [f t ]t →∞ is finite (bounded in R − {−∞; ∞}). (1) and (2) are necessary and sufficient conditions for the convergence of the SAKM update rule. The average of cumulated error converges towards the limit ` so that: Elearn [f t ] → ` ∈ −ν · Upker , Upker − (1 + ν) · Loker .
t →∞
Proof of Corollary 2 (Error Bound Generalization in Multi-Class En t vironment). Considering =t = f1 , . . . , fmt , . . . , fMt and the assumption of online learning without merge and split phenomena: Elearn [f1t ] → `1 ∈ −ν · Upker , Upker − (1 + ν) · Loker t →∞
.. . Elearn [fmt ] → `m ∈ −ν · Upker , Upker − (1 + ν) · Loker t →∞ .. . t Elearn [fM ] → `M ∈ −ν · Upker , Upker − (1 + ν) · Loker t →∞ ! M X t ⇒ Elearn [= ] → `= = `m t →∞
m=1
] → `= ∈ M · −ν · Upker , Upker t →∞ − (1 + ν) · Loker . t →∞
⇒ Elearn [=
References Amadou, B. H., Lecoeuche, S., & Maouche, S. (2005). Self-adaptive kernel machine: Online clustering in RKHS. In IEEE proceedings.
1301
Amadou-Boubacar, H. (2006). Classification Dynamique de données nonstationnaires: Apprentissage et Suivi de classes évolutives. Ph.D. thesis. Laboratoire LAGIS. Université de Sciences et Technologies de Lille 1 (USTL). Ben-Hur, A., Horn, D., Siegelmann, H., & Vapnik, V. (2001). Support vector clustering. Journal of Machine Learning Research, 2, 125–137. Bennett, K., & Campbell, C. (2000). Support vector machines: Hype or hallelujah? SIGKDD Exploration, 2(2), 1–13. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford, UK: Clarendon Press. Blake, C., & Merz, C. (1998). UCI Repository of machine learning databases. www.ics.uci.edu/~mlearn/. Bordes, A., & Bottou, L. (2005). The Huller: A simple and efficient online SVM. In Lecture notes in artificial intelligence, LNAI: Vol. 3720. Machine learning: ECML 2005 (pp. 505–512). Springer Verlag. Bordes, A., Ertekin, S., Weston, J., & Bottou, L. (2005). Fast Kernel classifiers with online and active learning. Journal of Machine Learning Research, 6, 1579–1619. Borer, S. (2003). New support vector algorithms for multi-categorical data: Applied to real-time object recognition. Ph.D. thesis. Lausanne Swiss EPFL. Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–167. Cauwenberghs, G., & Poggio, T. (2000). Advances in neural information systems: Vol. 13. Incremental and decremental support vector machine learning (pp. 409–415). Cambridge, MA: MIT Press. Cheong, S., Oh, S., & Lee, S. (2004). Support vector machines with binary tree architecture for multi-class classification. Neural Information Processing, 2(3), 47–51. Csato, L., & Opper, M. (2002). Sparse online gaussian processes. Neural Computation, 14, 641–668. Deng, D., & Kasabov, N. (2003). On-line pattern analysis by evolving self organizing maps. Neurocomputing, 51, 87–103. Desobry, F., Davy, M., & Doncarli, C. (2005). An online Kernel change detection algorithm. IEEE Transactions on Signal Processing, 53(8), 2961–2974. part 2. Eltoft, T., & Figueiredo, R. J. P. (1998). A new neural network for cluster-detectionand-labeling. IEEE Transactions on Neural Networks, 9(5), 1021–1035. Gentile, C. (2001). A new approximation maximal margin classification algorithm. Journal of Machine Learning Research, 2, 313–242. Gretton, A., & Desobry, F. (2003). Online one-class nu-svm, an application to signal segmentation. In IEEE ICASSP03 proceedings: vol. 2 (pp. 709–712). Kasabov, N. K. (2001). Evolving fuzzy neural networks for supervised/unsupervised online knowledge-based learning. IEEE Transactions of Systems, Man and Cybernetics, Part B – Cybernetics, 31(6), 902–918. Kivinen, J., Smola, A., & Williamson, R. (2004). Online Learning with Kernels. IEEE Transactions on Signal Processing, 52(8), 2165–2176. Kriegel, H. P., Sander, J., Ester, M., & Xu, X. (1997). Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Mining and Knowledge Discovery, 2(2), 169–194. Kuh, A. (2001). Adaptive kernel methods for CDMA systems. In IEEE proceedings of IJCNN (pp. 2404–2409). Vol. 4. Lecoeuche, S., & Lurette, C. (2004). New supervision architecture based on on line modelization of non stationary data. Neural Computing and Applications, 13(4), 323–338. Lecoeuche, S., & Lurette, C. (2003). Auto-adaptive and dynamical clustering neural network. In ICANN03 (pp. 350–358). Springer. Schölkopf, B., Platt, J., Shawe-Taylor, J., & Smola, A. (2001). Estimating the support of a high-dimensional distribution. Neural Computation, 13, 1443–1471. Schölkopf, B., Smola, A., Williamson, R., & Bartlett, P. (2000). New support vector algorithms. Neural Computation, 12, 1207–1245. Schölkopf, B., & Smola, A. (2002). Learning with Kernels. Cambridge, MA: MIT Press. Simpson, P. (1993). Fuzzy min–max neural networks — Part2: Classification. IEEE Transactions on Fuzzy systems, 11, 32–45. Tax, D., & Laskov, P. (2003). Online SVM learning: From classification to data description and back. Proceedings of the Neural Network and Signal Processing, 499–508. Tax, D. (2001). One-class classification. Ph.D. thesis. TU Deft, Netherlands. Vapnik, V. (1998). Statistical learning theory. New York: John Wiley & Sons Inc. Vapnik, V. (1999). An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5), 988–999. Zhang, B., & Xing, Y. (2003). Competitive EM algorithm for finite mixture models. Pattern Recognition, 37, 131–144.