ARTICLE IN PRESS
Neurocomputing 70 (2007) 1215–1224 www.elsevier.com/locate/neucom
Margin-based active learning for LVQ networks F.-M. Schleifa,, B. Hammerb, T. Villmannc a
Department of Mathematics and Computer Science, University of Leipzig, Germany Department of Computer Science, Clausthal University of Technology, Clausthal, Germany c Department of Medicine, Clinic for Psychotherapy, University of Leipzig, Leipzig, Germany b
Available online 22 December 2006
Abstract In this article, we extend a local prototype-based learning model by active learning, which gives the learner the capability to select training samples during the model adaptation procedure. The proposed active learning strategy aims on an improved generalization ability of the final model. This is achieved by usage of an adaptive query strategy which is more adequate for supervised learning than a simple random approach. Beside an improved generalization ability the method also improves the speed of the learning procedure which is especially beneficial for large data sets with multiple similar items. The algorithm is based on the idea of selecting a query on the borderline of the actual classification. This can be done by considering margins in an extension of learning vector quantization based on an appropriate cost function. The proposed active learning approach is analyzed for two kinds of learning vector quantizers the supervised relevance neural gas and the supervised nearest prototype classifier, but is applicable for a broader set of prototype-based learning approaches too. The performance of the query algorithm is demonstrated on synthetic and real life data taken from clinical proteomic studies. From the proteomic studies high-dimensional mass spectrometry measurements were calculated which are believed to contain features discriminating the different classes. Using the proposed active learning strategies, the generalization ability of the models could be kept or improved accompanied by a significantly improved learning speed. Both of these characteristics are important for the generation of predictive clinical models and were used in an initial biomarker discovery study. r 2007 Elsevier B.V. All rights reserved. Keywords: Active learning; Learning vector quantization; Generalization; Classification; Proteomic profiling
1. Introduction In supervised learning, we frequently are interested in training a classifier such that the underlying (unknown) target distribution is well estimated. Whereas traditional approaches usually adapt the model according to all available and randomly sampled training data, the field of active learning restricts to only few actively selected samples. This method avoids the shortcoming of traditional approaches that the average amount of new information per sample decreases during learning and that additional data from some regions are basically redundant. Further, it accounts for the phenomenon which is Corresponding author. Bruker Daltonik GmbH, Permoserstrasse 15, D-04318 Leipzig, Germany. Tel.: +49 341 24 31 408; fax: +49 341 24 31 404. E-mail addresses:
[email protected],
[email protected] (F.-M. Schleif).
0925-2312/$ - see front matter r 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2006.10.149
increasingly common e.g. in bioinformatics or web search that unlabeled data are abundant whereas reliable labeling is costly. Variants of active and query-based learning have been proposed quite early for neural models [2,8]. Basically two different kinds of active learning exist, active learning based on an oracle where one can demand arbitrary points of the input space [2] and method where one can choose different points from a given training set [14]. We focus on the second strategy. In query algorithms proposed so far, samples are chosen according to some heuristic e.g. [2] or in a principled way by optimizing an objective function such as the expected information gain of a query e.g. [8], or the model uncertainty e.g. [3]. A common feature of these query algorithms, however, is that they have been applied to global learning algorithms. Only a few approaches incorporate active strategies into local learning such as [12] where a heuristic query strategy for simple vector quantization is proposed. In this paper we include active
ARTICLE IN PRESS F.-M. Schleif et al. / Neurocomputing 70 (2007) 1215–1224
1216
learning in two recently proposed margin-based, potentially kernelized learning vector quantization approaches, which combine the good generalization ability of margin optimization with the intuitivity of prototype-based local learners where subunits compete for predominance in a region of influence [9,18]. Now, we briefly review the basic of this kernel-extension of LVQ and its accompanying learning theoretical generalization bounds, therefrom we derive a margin-based active learning strategy. We demonstrate the benefit of this mode by comparing the classification performance of the algorithm on randomly selected training data and active strategies for several data sets stemming from clinical proteomics as well as synthetic data. 1.1. Generalized relevance learning vector quantization Standard LVQ and variants as proposed by KOHONEN constitute popular simple and intuitive prototype-based methods, but they are purely heuristically motivated local learners [13]. They suffer from the problem of instabilities for overlapping classes. Further they strongly depend on the initialization of prototypes, and a restriction to classification scenarios in Euclidean space. Generalized relevance learning vector quantization (GRLVQ) has been introduced by the authors to cope with these problems [9]. It is based on a cost function such that neighborhood incorporation, integration of relevance learning, and kernelization of the approach become possible which gives supervised neural gas (SNG) and supervised relevance neural gas denoted as SRNG [9]. The method can be accompanied by a large margin generalization bound [10], which is directly connected to the cost function of the algorithm and which opens the way towards active learning strategies, as we will discuss in this article. We first introduce the basic algorithm. Input vectors are denoted by v and their corresponding class labels by cv , L is the set of labels (classes). Let V RDV be a set of inputs v: The model uses a fixed number of representative prototypes (weight vectors, codebook vectors) for each class. Let W ¼ fwr g be the set of all codebook vectors and cr be the class label of wr . Furthermore, let Wc ¼ fwr jcr ¼ cg be the subset of prototypes assigned to class c 2 L. The task of vector quantization is realized by the map C as a winner-takes-all rule, i.e. a stimulus vector v 2 V is mapped onto that neuron s 2 A the pointer ws of which is closest to the presented stimulus vector v, CkV !A : v7!sðvÞ ¼ argmin d k ðv; wr Þ
(1)
information of the weight vector is used, the boundaries qOkr generate the decision boundaries for classes. A training algorithm should adapt the prototypes such that for each class c 2 L, the corresponding codebook vectors Wc represent the class as accurately as possible. To achieve this goal, GRLVQ optimizes the following cost function, which is related to the number of misclassifications of the prototypes, via a stochastic gradient descent: CostGRLVQ ¼
Okr ¼ fv 2 V : r ¼ CkV !A ðvÞg,
(2)
which is mapped to a particular neuron r according to (1), forms the (masked) receptive field of that neuron. If the class
f ðmk ðvÞÞ
with mk ðvÞ ¼
v
d krþ d kr d krþ þ d kr
,
(3)
where f ðxÞ ¼ ð1 þ expðxÞÞ1 is the standard logistic function, d krþ is the similarity of the input vector v to the nearest codebook vector labeled with crþ ¼ cv , say wrþ , and d kr is the similarity measure to the best matching prototype labeled with cr acv , say wr . Note that the term f ðmk ðvÞÞ scales the differences of the closest two competing prototypes to ð1; 1Þ, negative values correspond to correct classifications. The learning rule can be derived from the cost function by using the derivative. As shown in [16], this cost function shows robust behavior whereas original LVQ2.1 yields divergence. SRNG combined this methods with neighborhood cooperation as derived in [9] and thus avoids that the algorithm gets trapped in local optima of the cost function. In subsequent experiments, Pwe choose 2the weighted Euclidean metric with d l ðv; wÞ ¼ li ðvi wi Þ as metric. An alternative approach is the soft nearest prototype classification (SNPC) and its kernel variants incorporating metric adaptation [18] which are referred as SNPC. The SNPC is based on an alternative cost function which can be interpreted as a Gaussian mixture model (GMM) approach aiming on empirical risk minimization. In the following the SNPC is reviewed very briefly. Thereby, we restrict ourselves to the SNPC cost function and some further important parts which are effected by introduction of an active learning strategy. For details on the derivation of SNPC and the ordinary learning dynamic we refer to [18]. 2. Soft nearest prototype classification We keep the generic notations from the former section and review subsequently the SNPC. SNPC is proposed as an alternative stable NPC learning scheme. It introduces soft assignments for data vectors to the prototypes, which have a statistical interpretation as normalized Gaussians. In the original SNPC as provided in [19] one considers the cost function
r2A
with d k ðv; wÞ being an arbitrary differentiable similarity measure, which may depend on a parameter vector k. The subset of the input space
X
EðSÞ ¼
NS X 1 X ut ðrjvk Þð1 ar;cvk Þ N S k¼1 r
(4)
with S ¼ fðv; cv Þg the set of all input pairs, N S ¼ #S. The value ar;cvk equals one if cvk ¼ cr and 0 otherwise. It is the probability that the input vector vk is assigned to the prototype r. A crisp winner-takes-all mapping (1) would yield ut ðrjvk Þ ¼ dðr ¼ sðvk ÞÞ.
ARTICLE IN PRESS F.-M. Schleif et al. / Neurocomputing 70 (2007) 1215–1224
In order to minimize (4) in SNPC the variables ut ðrjvk Þ are taken as soft assignment probabilities. This allows a gradient descent on the cost function (4). As proposed in [19], the probabilities are chosen as normalized Gaussians rÞ expð dðv2tk ;w Þ 2 ut ðrjvk Þ ¼ P , dðvk ;wr0 Þ r0 expð 2t2 Þ
(5)
whereby d is the distance measure used in (1) and t is the bandwidth which has to be chosen adequately. Then the cost function (4) can be rewritten as E soft ðSÞ ¼
NS 1 X lcððvk ; cvk ÞÞ N S k¼1
with local costs X lcððvk ; cvk ÞÞ ¼ ut ðrjvk Þð1 ar;cvk Þ
(6)
(7)
r
i.e. the local error is the sum of the assignment probabilities ar;cvk to all prototypes of an incorrect class, and, hence, lcððvk ; cvk ÞÞp1 with local costs depending on the whole set W. Because the local costs lcððvk ; cvk ÞÞ are continuous and bounded, the cost function (6) can be minimized by stochastic gradient descent using the derivative of the local costs as shown in [19]. All prototypes are adapted in this scheme according to the soft assignments. Note that for small bandwidth t, the learning rule is similar to LVQ2.1. 2.1. Relevance learning for SNPC Like all NPC algorithms, SNPC heavily relies on the metric d, usually the standard Euclidean metric. For highdimensional data as occur in proteomic patterns, this choice is not adequate since noise present in the data set accumulates and likely disrupts the classification. Thus, a focus on the (prior not known) relevant parts of the inputs, would be much more suited. Relevance learning as introduced in [11] offers the opportunity to learn metric parameters, which account for the different relevance of input dimensions during training. In analogy to the above learning approaches this relevance learning idea is included into SNPC leading to SNPC-R. Instead of the metric dðvk ; wr Þ now the metric can be parameterized incorporating adaptive relevance factors giving d k ðvk ; wr Þ which is included into the soft assignments (5), whereby the component lk of l is usually chosen as weighting parameter for input dimension k. The relevance parameters lk can be adjusted according to the given training data, taking the derivative of the cost function, i.e. qlcððvk ; cvk ÞÞ=qk using the local cost (7): qlcððvk ; cvk ÞÞ 1 X ¼ 2 ut ðrjvk Þ qlj 2t r
qd kr ð1 ar;cvk lcððvk ; cvk ÞÞÞ qlj
with subsequent normalization of the lk .
ð8Þ
1217
It is worth to emphasize that SNPC-R can also be used with individual metric parameters lr for each prototype wr or with a class-wise metric shared within prototypes with the same class label cr as it is done here, referred as localized SNPC-R (LSNPC-R). If the metric is shared by all prototypes, LSNPC-R is reduced to SNPC-R. The respective adjusting of the relevance parameters l can easily be determined in complete analogy to (8). It has been pointed out in [7] that NPC classification schemes, which are based on the Euclidean metric, can be interpreted as large margin algorithms for which dimensionality independent generalization bounds can be derived. Instead of the dimensionality of data, the so-called hypothesis margin, i.e. the distance, the hypothesis can be altered without changing the classification on the training set, serves as a parameter of the generalization bound. This result has been extended to NPC schemes with adaptive diagonal metric in [9]. This fact is quite remarkable, since DV new parameters, DV being the input dimension, are added this way, still, the bound is independent of DV . This result can even be transferred to the setting of individual metric parameters lr for each prototype or class as we will see below, such that a generally good generalization ability of this method can be expected. Despite from the fact that (possibly local) relevance factors allow a larger flexibility of the approach without decreasing the generalization ability, they are of particular interest for proteomic pattern analysis because they indicate potentially semantically meaningful positions. Our active learning approach holds for each kind of such LVQ-type learning. 3. Margin-based active learning The first dimensionality independent large margin generalization bound of LVQ classifiers has been provided in [7]. For GRLVQ-type learning, a further analysis is possible, which accounts for the fact that the similarity measure is adaptive during training [10]. Here, we sketch the argumentation as provided in [10] to derive a bound for a slightly more general situation where different local adaptive relevance terms are attached to the prototypes. 3.1. Theoretical generalization bound for fixed margin Assume, for the moment, that a two-class problem with labels f1; 1g is given.1 We assume that an NPC classification scheme is used whereby the locally weighted squared Euclidean metric determines the receptive fields: X v7! argmin lrl ðvl ðwr Þl Þ2 , r2A
l
where P r l denotes the components of the vectors and l ll ¼ 1. We further assume that data are chosen i.i.d. 1
These constraints are technical to derive the generalization bounds which have already been derived by two of the authors in [10], the active learning strategies work also for 42 classes and alternative metrics.
ARTICLE IN PRESS F.-M. Schleif et al. / Neurocomputing 70 (2007) 1215–1224
1218
according to the data distribution PðV Þ which support is limited by a ball of radius B and the class labels are determined by an unknown function. Generalization bounds limit the error, i.e. the probability that the learned classifier does not classify given data correctly: E P ðCÞ ¼ Pðcv aCkV !A ðvÞÞ.
(9)
Note that this error captures the performance for GRLVQ/ SRNG networks as well as SNPC-R learning with local adaptive diagonal metric. Given a classifier C and a sample ðv; cv Þ, we define the margin as rþ
r
M C ðv; cv Þ ¼ d lrþ þ d lr ,
(10)
i.e. the difference of the distance of the data point from the closest correct and the closest wrong prototype. (To be precise, we refer to the absolute value as the margin.) For a fixed parameter r 2 ð0; 1Þ, the loss function is defined as 8 if tp0; > <1 1 t=r if 0otpr; (11) L: R ! R; t7! > : 0 otherwise: The term E^ Lm ðCÞ ¼
X
LðM C ðv; cv ÞÞ=jV j
(12)
v2V
denotes the empirical error on the training data. It counts the data points which are classified wrong and, in addition, punishes all data points with too small margin. We can now use techniques from [1] in analogy to the argumentation in [10] to relate the error (9) and the empirical error (12). As shown in [1] (Theorem 7), one can express a bound for the deviation of the empirical error E^ m ðCÞ and the error E P ðCÞ by the Gaussian complexity of the function class defined by the model: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2K lnð4=dÞ L Gm þ E P ðCÞpE^ m ðCÞ þ r 2m with probability at least 1 d=2. K is a universal constant. G m denotes the Gaussian complexity, i.e. it is the expectation ! 2 X m i gi Cðv Þ , E vi E g1 ;...;gm sup C m i¼1 where the expectation is taken w.r.t. independent Gaussian variables g1 ; . . . ; gm with zero mean and unit variance and i.i.d. points vi sampled according to the marginal distribution induced by P. The supremum is taken over all NPC classifiers C with prototypes W with length at most B. The Gaussian complexity of NPC networks with local adaptive diagonal metric can be easily estimated using techniques from [1]: the classifier can be expressed as a Boolean formula of the result of classifiers with only two prototypes. Thereby, at most jWðW 1Þj such terms exist. For two prototypes i and j, the corresponding output can be described by the sum of a simple quadratic form and a
linear term: i
j
d li d lj p0 () ðv wi Þt Li ðv wi Þ ðv wj Þt Lj ðv wj Þp0 () vt Li v vt Lj v 2 ðLi wi Lj wj Þt v þ ðwi Þt Li wi ðwj Þt Lj wj p0. Since the size of the prototypes and the size of the inputs are restricted by B and since l is normalized to 1, we can estimate the empirical Gaussian complexity by the sum of pffiffiffiffi 4 B ðB þ 1Þ ðB þ 2Þ m m for the linear term (including the bias) and pffiffiffiffi 2 B2 m m for the quadratic term, using [1] (Lemma 22). The Gaussian complexity differs from the empirical Gaussian complexity by at most with probability at least 2 expð2 m=8Þ. Putting these bounds together, the overall bound ln jV j pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffi lnð1=dÞ jWj2 B3 pd P E P ðCÞ4E^ Lm ðCÞ þ K 0 r jV j (13) results with probability d 2 ð0; 1Þ, K 0 being a universal constant. This bound holds for every prototype-based learning algorithm with diagonal Euclidean metric and adaptive relevance parameters as long as the absolute sum of the relevance parameters l is restricted by 1, whereby the parameters may even be adapted locally for each prototype vector. This formula provides an estimation of the difference of the real error of a trained classifier in comparison to the empirical error measured on the training set. Its form is very similar to generalization bounds of learning algorithms as derived in the framework of VC-theory [21]. Since these bounds are worst case bounds, they usually differ strongly from the exact test error and can give only an indication of the relevance of certain parameters. However, this information is sufficient to derive useful strategies to guide active learning. Note that the bound (13) does not include data dimensionality, but the margin r. This bound holds for the scaled squared Euclidean metric with local adaptive relevance terms. It can directly be generalized to kernelized versions of this similarity measure. A sufficient condition for this fact is, e.g. that the measure is symmetric, it vanishes for two identical arguments, and its negative is conditionally positive definite. It should be mentioned, that the margin (10) occurs as nominator in the cost function of GRLVQ. Hence GRLVQ and SRNG maximize its margin during training according to this cost function. However, the final generalization bound holds for all training algorithms of NPC classifiers including SNPC and SNPC-R.
ARTICLE IN PRESS F.-M. Schleif et al. / Neurocomputing 70 (2007) 1215–1224
3.2. Active learning strategy The generalization bound in terms of the margin proposes an elegant scheme to transfer margin-based active learning to local learners. Margin-based sample selection has been presented e.g. in the context of SVM in [6,14]. Obviously, the generalization ability of the GRLVQ algorithm does only depend on the points with too small margin (10). Thus, only the extremal margin values need to be limited and a restriction of the respective update to extremal pairs of prototypes would suffice. This argument proposes schemes for active data selection if a fixed and static pattern set is available: We fix a monotonically decreasing non-negative function Lc : R ! R and actively select training points from a given sample, in analogy to e.g. [14], based on the probability Lc ðM F ðv; cv ÞÞ for sample v. Thereby, different realizations are relevant: (1) Lc ðtÞ ¼ 1 for to0 and Lc ðtÞjtja , otherwise, i.e. the size of the margin determines the probability of v being chosen annealed by a parameter a (probabilistic strategy). (2) Lc ðtÞ ¼ 1 if tpr, otherwise, it is 0. That means, all samples with margin smaller than r are selected (threshold strategy). Both strategies focus on the samples which are not yet sufficiently represented in the model. Therefore, they directly aim at an improvement of the generalization bound (13). Strategy (2) allows an adaptation of the margin parameter r during training in accordance to the confidence of the model in analogy to the recent proposal [14] for SVM. For each codebook vector wr 2 W we introduce a new parameter ar measuring the mean distance of data points in its receptive field (2) to the current prototype wr . This parameter can be easily computed during training as a moving average with no extra costs.2 We choose rr locally as rr ¼ 2 ar . Thus, points which margin compares favorable to the size of the receptive fields are already represented with sufficient security and, hence, they are abandoned. For strategy (1), a confidence depending on the distance to the closest correct prototype and the overall classification accuracy can be introduced in a similar way. Doing this the normalized margin is taken as a probability measure for data selection. 3.3. Validity of generalization bounds for active strategies There are a few aspects of the generalization bound (13) which have to be reconsidered in the framework of active learning. The bound (13) is valid for a priorly fixed margin r; it is not applicable to the empirical margin observed during training which is optimized by active learning. However, it is possible to generalize the bound (13) to this general setting in the spirit of the luckiness framework of 2 The extra computational time to determine the active learning control variables is negligible.
1219
machine learning [20] by fixing prior probabilities to achieve a certain margin. Here, we derive a bound depending on an universal upper bound on the empirical margin and fixed prior probabilities. Assume the empirical margin can be upper bounded by C40. A reasonable upper bound C can e.g. be derived from the data as half the average of the distances of points to their respective closest neighbor with different labeling. Define Pri ¼ C=i for iX1. Choose prior probabilities pi X0 with i pi ¼ 1 which indicate the confidence in achieving an empirical margin which size is at least ri . Set 8 if tp0; > <1 (14) Li : R ! R; t ! 1 t=ri if 0otpri ; > : 0 otherwise in analogy to (11). E^ Lmt ðCÞ denotes the empirical error (12) using loss function Lt . We are interested in the probability Pð9i E P ðCÞ4E^ Lmi ðCÞ þ ðiÞÞ
(15)
i.e. the probability that the empirical error measured with respect to a loss Li for any i and the real error deviate by more than ðiÞ, where the bound ln jV j pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffi lnð1=ðpi dÞÞ jWj2 B3 ðiÞ ¼ K 0 ri jV j depends on the empirical margin ri . Note that the value ðiÞ is chosen as the bound derived in (13) for the margin ri and confidence pi d. The probability (15) upper bounds the probability of a large deviation of the real error and the empirical error with respect to a loss function which is associated to the empirical margin observed posteriorly on the given training set. For every observed margin roC, some ri can be found with rpri , such that the generalization bound ðiÞ results in this setting. The size of ðiÞ depends on whether the observed empirical margin corresponds to the prior confidence in reaching this margin, i.e. a large prior pi . We can limit (15) as follows: X Pð9i E P ðCÞ4E^ Lmi ðCÞ þ ðiÞÞp PðE P ðCÞ4E^ Lmi ðCÞ þ ðiÞÞ i
p
X
pi d ¼ d
i
because of the choice of ðiÞ such that it corresponds to the bound (13). This argument allows us to derive bounds of a similar form to (13) for the empirical margin. Another point worth discussing concerns the assumption that training data are i.i.d. with respect to an unknown underlying probability. For active learning, the training points are chosen depending on the observed margin, that means, they are dependent. However, this argument does not apply to the scenario as considered in our case: we assume a priorly fixed training set with i.i.d. data and chose the next training pattern depending on the margin such that the convergence speed of training and the generalization ability of the trained classifier is improved. This
ARTICLE IN PRESS 1220
F.-M. Schleif et al. / Neurocomputing 70 (2007) 1215–1224
affects the learning algorithm on the given data set, however, the empirical error can nevertheless be evaluated on the whole training set independent of the algorithm, such that the bound as derived above is valid for the priorly fixed training set. For dependent data (e.g. points created online for learning), an alternative technique such as e.g. the statistical theory of online learning must be used to derive bounds [4,15].
3.4. Adaptation to noisy data or unknown classes We will apply the strategy of active learning to data sets for which a small classification error can be achieved with reasonable margin. We will not test the method for very corrupted or noisy data sets for which a large margin resp. small training error cannot be achieved. It can be expected that the active strategies as proposed above have to be adapted for very noisy learning scenarios: the active strategy as proposed above will focus on points which are not yet classified correctly resp. which lie close to the border, trying to enforce a separation which is not possible for the data set as hand. In such cases, it would be valuable to restrict the active strategy to points which are still promising, e.g. choosing points only from a limited band parallel to the training hyperplane. We would like to mention that, so far, we have restricted active selection strategies to samples where all labels are known beforehand, because the closest correct and wrong prototype have to be determined in (10). This setting allows to improve the training speed and performance of batch training. If data are initially unlabeled and queries can be asked for a subset of the data, we can extend these strategies in an obvious way towards this setting: in this case, the margin (10) is given by the closest two prototypes which possess a different class label, whereby the (unknown) class label of the sample point has no influence. Lc ðtÞ is substituted by Lc ðjtjÞ. 4. Synthetic and clinical data sets The first data set is the Wisconsin breast cancer-data set (WDBC) as given by UCI [5]. It consist of 569 measurements with 30 features in two classes. The other two data sets are taken from proteomic studies named as Proteom1 and Proteom2 and originate from own mass spectrometry measurements on clinical proteomic data.3 The Proteom1 data set consists of 97 samples in two classes with 59 dimensions. The data set Proteom2 consists of 97 measurements with two classes and 94 dimensions. MALDI-TOF MS combined with magnetic-bead-based sample preparation was used to generate proteomic pattern from human EDTA (ethylenediaminetetraacetic acid) plasma 3
These data are measured on Bruker systems and are subject of confidential agreements with clinical partners, so only some details are listed here.
samples. The MALDI-TOF mass spectra were obtained using magnetic-bead-based weak cation exchange chromatography (WCX) [17] The sample eluates from the bead preparation have been applicated to an AnchorChipTargetTM by use of HCCA matrix. The material was randomly applicated on the target by use of ClinProtRobotTM and subsequently measured using an AutoFlex II in linear mode within 1–10 kDa (Bruker Daltonik GmbH, Bremen, Germany). Thereby, each spectrum has been accumulated by use of 450 laser shots with 15 shot positions per spot on the target. Each eluated sample was fourfold spotted and averaged subsequently. Individual attributes such as gender, age, cancer related preconditions and some others have been controlled during sample collection to avoid bias effects. All data have been collected and prepared in accordance to best clinical practices. Spectra preparation has been done by use of the Bruker ClinProt-System (Bruker Daltonik GmbH, Bremen, Germany). The well separable checkerboard data (checker) as given in [11] are used as a synthetic evaluation set. It consists of 3700 data points. Further, to illustrate the differences between the different strategies during learning of the codebook vector positions a simple spiral data set has been created and applied using the SRNG algorithm. The spiral data are generated in a similar way as the data shown in [12] and the set consists of 5238 data points. Checker as well the spiral data are given in a two-dimensional space.
5. Experiments and results For classification, we use six prototypes for the WDBC data, 100 prototypes for the well separable checkerboard data set as given in [11], nine prototypes for the Proteom1 data set and 10 for Proteom2 . Parameter settings for SRNG can be resumed as follows: learn rate correct prototype: 0:01, learn rate incorrect prototype: 0:001 and learning rate for l: 0:01. The neighborhood range is given by #W=2. For SNPC the same settings as for SRNG are used with the additional parameter window threshold: 0:05 and width s ¼ 2:5 for the Gaussian kernel. Learning rates are annealed by an exponential decay. All data have been processed using a 10-fold cross-validation procedure. Results are calculated using the SNPC, SNPC-R and SNG, SRNG. SNG and SRNG are used instead of GLVQ or GRLVQ as improved version of the former using neighborhood cooperation. We now compare both prototype classifiers using randomly selected samples with its counterparts using the proposed query strategies. The classification results are given in Tables 1 and 3 without metric adaptation and in Tables 2 and 3 with relevance learning, respectively. Features of all data sets have been normalized. First we upper bounded the data set by 1:0 and subsequently data are transformed such that we end with zero mean and variance 1:0.
ARTICLE IN PRESS F.-M. Schleif et al. / Neurocomputing 70 (2007) 1215–1224
1221
Table 1 Classification accuracies for cancer and checkerboard data sets using SNG SNG
WDBC Proteom1 Proteom2 Checker
SNGactive
SNGactive
strategy 1
strategy 2
Rec. (%)
Pred. (%)
Rec. (%)
Pred. (%)
Rel. #Q (%)
Rec. (%)
Pred. (%)
Rel. #Q (%)
95 76 73 72
95 83 67 67
94 76 73 98
94 77 65 97
38 48 49 31
93 76 73 99
92 85 62 96
9 15 27 5
All data consist of two classes whereby the Proteom2 data set is quite complex. The prediction accuracies are taken from a 10-fold cross validation and show a reliable good prediction of data which belong to the WDBC data as well as for the Proteom1 data set. The checker data are not as well modeled, this is due to the fixed upper limit of cycles (1000) longer runtimes lead for this data to a nearly perfect separation.
Table 2 Classification accuracies for cancer data sets using SRNG SRNG
WDBC Proteom1 Proteom2
SRNGactive
SRNGactive
strategy 1
strategy 2
Rec. (%)
Pred. (%)
Rec. (%)
Pred. (%)
Rel. #Q (%)
Rec. (%)
Pred. (%)
Rel. #Q (%)
95 82 87
94 88 81
95 92 96
94 87 87
29 31 33
97 97 96
97 93 76
7 5 10
Data characteristics are as given before. A reliable good prediction of data which belong to the WDBC data as well as for the Proteom1 data set can be seen. One clearly observes an improved modeling capability by use of relevance learning and an related addition decrease in the number of queries.
Table 3 Classification accuracies for cancer data sets using standard SNPC and SNPC-R SNPC
WDBC Proteom1
SNPCactive
strategy 2
Rec. (%)
Pred. (%)
Rec. (%)
Pred. (%)
Rel. #Q (%)
Rec. (%)
Pred. (%)
Rel. #Q (%)
90 71
89 81
66 72
93 82
60 67
82 65
92 82
15 21
SNPC-R
WDBC Proteom1
SNPCactive
strategy 1
SNPCRactive
SNPCRactive
strategy 1
strategy 2
Rec. (%)
Pred. (%)
Rec. (%)
Pred. (%)
Rel. #Q (%)
Rec. (%)
Pred. (%)
Rel. #Q (%)
86 72
91 86
85 89
94 82
62 66
94 92
95 87
12 16
Data characteristics as above. The prediction accuracies show a reliable good prediction of data belonging to the WDBC data as well as for the Proteom1 data set. The standard SNPC showed an in-stable behavior for the Proteom2 data, hence the results are not given in the table. By use of relevance learning the number of queries as well as the prediction accuracy improved slightly with respect to the standard approach.
We applied the training algorithms using the different query strategies as introduced above. The results for recognition and prediction rates using SRNG are shown in Table 24 and for SNPC in Table 3, respectively. Thereby, the recognition rate is a performance measure of the model indicating the relative number of training data points whose class label could correctly be recognized using the model. The prediction rate is a measure of the generalization ability of the model accounting for the relative 4 The relative number of queries is calculated with respect to the maximal number of queries possible up to convergence of SRNG using the corresponding query strategy. The upper limit of cycles has been fixed to 1000.
number of correctly predicted class labels from test data points which were not used in the former training of the classifier. As explained before each prediction rate is obtained as an average over a 10-fold cross-validation procedure. For the WDBC data set and the Proteom2 data set we found small improvements in the prediction accuracy using the active strategy 2. In parts small over-fitting behavior using the new query strategies can be observed. Both new query strategies were capable to significantly decrease the necessary number of queries by keeping at least reliable prediction accuracies with respect to a random query approach. This is depicted for different data sets using the SRNG algorithm in Figs. 1–6 using the SNPC algorithm.
ARTICLE IN PRESS F.-M. Schleif et al. / Neurocomputing 70 (2007) 1215–1224
1222
Probabilistic strategy Threshold strategy
0.8 0.6 0.4 0.2
relative number of queries
relative number of queries
1
0
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Probabilistic strategy Threshold strategy
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Probabilistic strategy Threshold strategy
100 200 300 400 500 600 700 800 900 1000 cycle
Fig. 2. Number of queries in % using active strategy 1 (threshold) and 2 (probabilistic) executed by the SRNG algorithm on the Proteom2 data set.
relative number of queries
Fig. 4. Number of queries in % using active strategy 1 (threshold) and 2 (probabilistic) executed by the SNG algorithm on the checker data set.
0
0
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Probabilistic strategy Threshold strategy
100 200 300 400 500 600 700 800 900 1000
100 200 300 400 500 600 700 800 900 1000 cycle
Fig. 5. Number of queries in % using active strategy 1 (threshold) and 2 (probabilistic) executed by the SNPC-R algorithm on the Proteom1 data set.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Probabilistic strategy Threshold strategy
0
0
100 200 300 400 500 600 700 800 900 1000 cycle
relative number of queries
relative number of queries
Fig. 1. Number of queries in % using active strategy 1 (threshold) and 2 (probabilistic) executed by the SRNG algorithm on the Proteom1 data set.
Probabilistic strategy Threshold strategy
0
100 200 300 400 500 600 700 800 900 1000 cycle
relative number of queries
0
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
100 200 300 400 500 600 700 800 900 1000 cycle
cycle Fig. 3. Number of queries in % using active strategy 1 (threshold) and 2 (probabilistic) executed by the SRNG algorithm on the WDBC data set.
Although the number of queries differs for SNPC and SRNG on the same data set the generic trend is similar but clearly reflects the classification performance of the individual classifier. Especially, the threshold approach led to a significant decrease in the number of queries in each experiment.
Fig. 6. Number of queries in % using active strategy 1 (threshold) and 2 (probabilistic) executed by the SNPC-R algorithm on the WDBC data set.
In Fig. 7 the behavior of the proposed strategies are shown for a synthetic spiral data set of around 5000 data points. Both active learning strategies showed a significant decrease in the number of necessary queries, typically only 10–30% of the queries were executed with respect to a random approach. Using the active learning strategies the
ARTICLE IN PRESS F.-M. Schleif et al. / Neurocomputing 70 (2007) 1215–1224
4
Normalized spiral data Codebook - no active learning Codebook - probabilistic strategy Codebook - threshold strategy
3
2nd dim
2 1 0 -1 -2 -3 -4 -3
-2
-1
0 1st dim
1
2
3
Fig. 7. Plot of the synthetic spiral data with prototype positions as obtained using the different query strategies.
data could be learned nearly perfect. Random querying required a much larger number of cycles to obtain acceptable results. Considering the prototype distribution one clearly observes the well positioning of the prototypes using the active learning strategies whereas the random approach suffers by over-representing the core of the spiral which has a larger data density.
6. Conclusion Margin-based active learning strategies for GLVQ-based networks have been studied. We compared two alternative query strategies incorporating the margin criterion of the GLVQ networks with a random query selection. Both active learning strategies show reliable or partially better results in their generalization ability with respect to the random approach. Thereby, we found a significantly faster convergence with a much lower number of necessary queries. For the threshold strategy we found that it shows an overall stable behavior with good prediction rate and a significant decrease in processing time. Due to the automatically adapted parameter the strategy is quite simple but depends on a sufficiently well estimation of the local data distribution. By scaling the threshold parameter an application specific choice between prediction accuracy and speed can be obtained. The probabilistic strategy has been found to get similar results with respect to the prediction accuracy but the number of queries is quite dependent of the annealing strategy, simulated less restrictive constraints showed a faster convergence but over-fitting on smaller training data sets. Especially, for larger data sets the proposed active learning strategies show great benefits in speed and prediction. Especially for the considered mass spectrometric cancer data sets an overall well performance improvement has been observed. This is interesting from a practical point of view, since the technical equipment for measuring e.g. a large number of mass spectrometric data become more and more available. In mass spectrometry it is easy to measure a sample multiple times. The replicates which are taken from such
1223
multiple measurements are in general very similar and do differ only by random but not by systematic new information. In clinical proteomics based on mass spectrometry replicates are measured very often to decrease the measurement variance (e.g. by averaging) or to reduce the loss of missed samples in case of an error during a measurement of a sample. Typically 4,8 or even 16 multiple measurements of the same sample are generated and hence also for moderate sample sizes (e.g. 50 samples per class) the amount of training data becomes huge.5 The presented approach is optimal suited to deal with replicate measurements which may drastically increase the number of samples and hence typically lead to very long runtimes for ordinary trainings using the considered classification algorithms.
References [1] P. Bartlett, S. Mendelsohn, Rademacher and Gaussian complexities: risk bounds and structural results, J. Mach. Learn. Res. 3 (2002) 463–482. [2] E. Baum, Neural net algorithms that learn in polynomial time from examples and queries, IEEE Trans. Neural Networks 2 (1991) 5–19. [3] L.M. Belue, K.W. Bauer Jr., D.W. Ruck, Selecting optimal experiments for multiple output multilayer perceptrons, Neural Comput. 9 (1997) 161–183. [4] M. Biehl, A. Ghosh, B. Hammer, Learning vector quantization: the dynamics of winner-takes-all algorithms, Neurocomputing 69 (7–9) (2006) 660–670. [5] C. Blake, C. Merz, UCI repository of machine learning databases, available at: hhttp://www.ics.uci.edu/mlearn/MLRepository.htmli, 1998. [6] C. Cambell, N. Cristianini, A. Smola, Query learning with large margin classifiers, in: International Conference in Machine Learning, 2000, pp. 111–118. [7] K. Crammer, R. Gilad-Bachrach, A. Navot, A. Tishby, Margin analysis of the LVQ algorithm, in: Proceedings of NIPS 2002, hhttp:// www-.cs.cmu.edu/Groups/NIPS/NIPS2002/NIPS2002preproceedings/ index.htmli, 2002. [8] Y. Freund, H.S. Seung, E. Shamir, N. Tishby, Information, prediction and query by committee, in: Advances in Neural Information Processing Systems 1993, 1993, pp. 483–490. [9] B. Hammer, M. Strickert, T. Villmann, Supervised neural gas with general similarity measure, Neural Process. Lett. 21 (1) (2005) 21–44. [10] B. Hammer, M. Strickert, T. Villmann, On the generalization ability of GRLVQ networks, Neural Process. Lett. 21 (2) (2005) 109–120. [11] B. Hammer, T. Villmann, Generalized relevance learning vector quantization, Neural Networks 15 (8–9) (2002) 1059–1068. [12] M. Hasenja¨ger, H. Ritter, Active learning with local models, Neural Process. Lett. 7 (1998) 107–117. [13] T. Kohonen, Self-Organizing Maps, Springer Series in Information Sciences, vol. 30, Springer, Berlin, Heidelberg, 1995 (2nd Ext. Ed. 1997). [14] P. Mitra, C. Murthy, S. Pal, A probabilistic active support vector learning algorithm, IEEE Trans. Pattern Anal. Mach. Intell. 28 (3) (2004) 412–418. [15] M. Opper, Statistical mechanics of generalization, in: M. Arbib (Ed.), The Handbook of Brain Theory and Neural Networks, MIT, Cambridge, MA, 2003, pp. 1087–1090. 5 Considering 50 samples per class and 16 replicates per sample one would be confronted with 1600 highly redundant training items.
ARTICLE IN PRESS 1224
F.-M. Schleif et al. / Neurocomputing 70 (2007) 1215–1224
[16] A. Sato, K. Yamada, A formulation of learning vector quantization using a new misclassification measure, in: A.K. Jain, S. Venkatesh, B.C. Lovell (Eds.), Proceedings of the 14th International Conference on Pattern Recognition, vol. 1, IEEE Computer Society, Los Alamitos, CA, USA, 1998, pp. 322–325. [17] E. Scha¨ffeler, U. Zanger, M. Schwab, M. Eichelbaum, Magnetic bead based human plasma profiling discriminate acute lymphatic leukaemia from non-diseased samples, in: 52nd ASMS Conference (ASMS) 2004, 2004, p. TPV 420. [18] F.-M. Schleif, T. Villmann, B. Hammer, Local metric adaptation for soft nearest prototype classification to classify proteomic data, in: Fuzzy Logic and Applications: Sixth International Workshop, WILF 2005, Lecture Notes in Computer Science (LNCS), Springer, Berlin, 2006, pp. 290–296. [19] S. Seo, M. Bode, K. Obermayer, Soft nearest prototype classification, IEEE Trans. Neural Networks 14 (2003) 390–398. [20] J. Shawe-Taylor, P. Bartlett, R. Willimason, M. Anthony, Structural risk minimization over data-dependent hierarchies, IEEE Trans. Inform. Theory 44 (5) (1998) 1926–1940. [21] V. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. Frank-Michael Schleif studied Applied Computer Science and Psychology at the University of Leipzig and examinated in 2002. In 2003, he joined the Chair of Applied Telematics at the University of Leipzig as a research associate. In 2004, he became a member of Bruker Daltonics where he is currently finishing his Ph.D. studies. His research interests include Computational Biology with a special focus on classification problems, pattern recognition, bioinformatics and statistical data analysis. http://gaos.org/schleif
Barbara Hammer received her Ph.D. in Computer Science in 1995 and her venia legendi in Computer Science in 2003, both from the University of Osnabrueck, Germany. From 2000–2004, she was leader of the junior research group ‘Learning with Neural Methods on Structured Data’ at University of Osnabrueck before accepting an offer as professor for Theoretical Computer Science at Clausthal University of Technology, Germany, in 2004. Several research stays have taken her to Italy, UK, India, France, and the USA. Her areas of expertise include various techniques such as hybrid systems, self-organizing maps, clustering, and recurrent networks as well as applications in bioinformatics, industrial process monitoring, or cognitive science. Most of her publications can be retrieved from http://www.in.tu-clausthal.de/hammer/. Thomas Villmann received his Ph.D. in Computer Science in 1996 and his venia legendi in Computer Science in 2005, both from the University Leipzig, Germany. Since 1997 he is with the medical department of University Leipzig. At the hospital for psychotherapy he leads the computer science group and the research group of computational intelligence. Several research stays have taken him to France, and the USA. He is founding member of the German chapter of ENNNS (GNNS). His research areas include a broad range of machine learning approaches like neural maps, clustering, classification, pattern recognition and evolutionary algorithms as well as applications in medicine, bioinformatics, satellite remote sensing and other. http://www.uni-leipzig.de/psychsom/ diagramm/MA_villmann.html.