ARTICLE IN PRESS
Neurocomputing 70 (2007) 2528–2533 www.elsevier.com/locate/neucom
Incremental support vector machines and their geometrical analyses Kazushi Ikeda, Takemasa Yamasaki Graduate School of Informatics, Kyoto University, Sakyo, Kyoto 606-8501, Japan Received 17 December 2005; received in revised form 26 March 2006; accepted 26 July 2006 Communicated by N. Ueda Available online 19 October 2006
Abstract Support vector machines (SVMs) are known to result in a quadratic programming problem, that requires a large computational complexity. To reduce it, this paper considers, from the geometrical point of view, two incremental or iterative SVMs with homogeneous hyperplanes. One method is shown to produce the same solution as an SVM in batch mode with the linear complexity on average, utilizing the fact that only effective examples are necessary and sufficient for the solution. The other, which stores the set of support vectors instead of effective examples, is quantitatively shown to have a lower performance although implementation is rather easy. r 2006 Elsevier B.V. All rights reserved. Keywords: Support vector machines; Incremental learning; Learning curves; Admissible region
1. Introduction Support vector machines (SVMs) have attracted much attention over the past decade as a new classification technique with good generalization ability in applications such as pattern classification [13–16,1]. An SVM nonlinearly maps given input vectors to feature vectors in a high-dimensional space, and linearly separates the feature vectors with an optimal hyperplane in terms of margin. One property of SVMs is that they have no local minima, from which traditional gradient-based methods, such as multi-layer perceptrons [11], suffer in convergence. This is due to the fact that finding an optimal hyperplane results in a convex quadratic programming problem (QP) with linear constraints. Although QPs have good properties, they require a high computational complexity. Since an SVM with N given examples is equivalent to a QP with N or 2N variables, even good QP solvers, such as interior-point methods, can only solve problems of a limited size. Another property of SVMs is that they have a sparse solution; that is, only a limited number of the examples contribute to the SVM solution while the others do not. Corresponding author. Tel.: +81 75 7535501; fax: +81 75 7534755.
E-mail address:
[email protected] (K. Ikeda). 0925-2312/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2006.07.013
This means that we could reduce the computational complexity if such useless examples could be removed in advance. Incremental learning for SVMs is an approach based on this idea. If a new example is useless compared to the stored ones, throw it away, or store it and assemble the stored examples. This has a complexity proportional to the number N of given examples on average unless the expected number of stored examples depends on N. There are two important points; how to distinguish important examples from others, and how many examples should be stored. We consider two incremental methods in this paper. One produces the same solution as an SVM in batch mode, i.e., solving the full size QP. The other is easier to implement and has less complexity, but has a higher generalization error. We also analyze the degradation of performance from a geometrical viewpoint. The rest of the paper is organized as follows: Section 2 introduces effective examples and support vectors of an SVM, which are, in effect ‘‘important examples’’. Two incremental methods are proposed in Section 3 and statistical analyses of them are presented in Section 4. In Section 5, we show the results of computer simulations that confirm the validity of the analyses. Conclusions and discussions are given in Section 6.
ARTICLE IN PRESS K. Ikeda, T. Yamasaki / Neurocomputing 70 (2007) 2528–2533
2529 w*
2. Effective examples and support vectors fi
An SVM maps an input vector x to a vector f ¼ f ðxÞ called a feature vector in the high-dimensional feature space and separates the feature vector by an optimal hyperplane in terms of margins. In this paper, we assume that the feature vector is normalized, that is, kf k ¼ kf ðxÞk ¼ 1
(1)
for any x, as is done in [3]. This assumption is rather restrictive but useful for a geometrical interpretation of SVMs, as will be shown later. Moreover, we regard f ¼ f ðxÞ as a newly defined input vector and do not consider x any more in the following. This means that we employ the so-called linear kernel and consider only margin maximization. We discuss the possibility of extension to general cases in Section 6. Although the original SVMs employed inhomogeneous separating hyperplanes, wT f þ b ¼ 0, where T denotes the transposition, we only consider SVMs with homogeneous separating hyperplanes, wT f ¼ 0, as is done in some theoretical studies [3,9,10,6]. Note that a problem with inhomogeneous hyperplanes is easily transformed to one with homogeneous hyperplanes using the so-called lifting up, T that is, transformation w~ T :¼ ðwT ; bÞ and f~ :¼ ðf T ; 1Þ where :¼ means definition, though they differ a little since the latter also penalizes the bias b in margin maximization [7]. An SVM is given N examples and the ith example is a pair of an input vector f i 2 S M , where SM is the M-dimensional unit hypersphere, and the corresponding label yi 2 f1g satisfying yi ¼ sgnðwT f i Þ where w is the true weight vector to be estimated and sgnðÞ outputs the sign of . Since the separating hyperplane is homogeneous, an example ðf i ; yi Þ is completely equivalent to ðyi f i ; 1Þ and hence we can consider that any example has a positive label. In this case, input vectors are chosen from a hypersemisphere T SM þ :¼ ff jf w 40g,
(2)
which we call the input space. Since the magnitude of w does not affect its separation ability, we assume w 2 S M without loss of generality where S M is called the weight space. When an example ðf i ; yi Þ or ðf i yi ; 1Þ is given, the true weight vector w must be in the hypersemisphere fwjyi wT f i 40g in SM . This means that an example is represented as a point in the input space and a hyperplane in the weight space (Fig. 1). Note that a weight vector is represented as a hyperplane in the input space and a point in the weight space. When N examples are given, w has to be in the area AN defined as AN ¼ fwjyi wT f i 40; i ¼ 1; . . . ; Ng,
(3)
which we call the admissible region [8]. The admissible region AN , called the version space in physics, is a polyhedron in S M . If the admissible region changes when an example is removed, the example is called effective.
Fig. 1. An example in the input and weight spaces.
^ w
Fig. 2. The optimal weight w^ is the center of the maximum inscribed sphere in the admissible region AN .
Note that the set of effective examples, referred to as the effective set, makes the same admissible region as all the examples. So, some algorithms for estimating w, including SVMs, utilize only effective examples. Under the assumption that feature vectors are normalized, an SVM solution has a clear geometrical picture. Finding a hyperplane that maximizes the margin results in a QP, min 12kwk2 w
s:t: wT f i X1.
(4)
It is known that the SVM solution w^ necessarily has the form ^ ¼ w
N X
ai f i ,
(5)
i¼1
where ai are the Lagrangian multipliers. When ai a0, f i is called a support vector. In other words, w^ consists of only support vectors. From the Karush–Kuhn–Tucker optim^ T f i ¼ 1 and the ality conditions, support vectors f i satisfy w ^ is others do not. This means that the SVM solution w ^ is not equidistant from support vectors [3]. Since kwk necessarily unity, we consider the meaning of the above in the weight space SM . It is easily shown that the normalized ^ in S M (that is, w=k ^ wk) ^ w is still equidistant from the support vectors in the angular distance of S M , and the ^ is the center of the maximum inscribed SVM solution w sphere in the admissible region AN (Fig. 2). Note that the other examples are more distant from the center, even though they are effective.
ARTICLE IN PRESS K. Ikeda, T. Yamasaki / Neurocomputing 70 (2007) 2528–2533
2530
3. Online methods for SVM learning 3.1. Method 1: incremental SVM storing effective examples The discussion in the previous section claims that an SVM can get the same information from only a set of effective examples. Thus, the incremental algorithm below referred to as Method 1 gives the same answer as an SVM in batch mode: 1. The machine has the effective set of n given examples. 2. Unless the ðn þ 1Þst example is effective, neglect it. 3. Otherwise, the effective set is updated, adding the ðn þ 1Þst example. On average, this algorithm has a low computational complexity, as shown in the next section; however, it is not easy to ascertain whether an example is effective or not. An example of implementation is shown below. As discussed before, an example is represented as a point in the input space and a hyperplane in the weight space. We clarify the representation of an admissible region in the input space according to [8]. Consider the intersection (a vertex) of M examples of the admissible region in the weight space. From the definition of the admissible region, there is no hyperplane made from an example that cuts the segment between the vertex and the true weight vector w . This property appears in the input space as the fact that the vertex represents a hyperplane that includes the M examples, and all the other examples are located in one side of the hyperplane. Hence, the admissible region in the weight space and the convex hull of examples (the smallest convex set including examples) in the input space are mutually dual, and the effective examples correspond to the vertices of the convex hull in the input space (Fig. 3). Hence, we can update the effective set by checking whether the added example is located inside the convex hull or not. To implement this, there are several packages, e.g., the function ‘‘convhulln’’ in MATLAB, which is based on the Delaunay triangulation. 3.2. Method 2: incremental SVM storing support vectors Since the effective set has all information in the given examples, any support vector is effective by definition. To overcome the difficulty in knowing whether a new example f3 f1
f2 w*
f4
f4 f2
f1
is effective or not, we propose storing support vectors instead of effective examples. Since the support vectors are a subset of the effective examples, there may be some loss in information, as described later, but it is easily determined whether an example is a new support vector or not: the example is a support vector if and only if its distance from the current separating hyperplane is less than the current margin. Hence, an algorithm referred to as Method 2 can be written as below: 1. The machine has the set of support vectors of n given examples. 2. If the ðn þ 1Þst example is more distant from the separating hyperplane than the current margin, neglect it. 3. Otherwise, the set of support vectors is updated by an SVM solver with the support vectors and the ðn þ 1Þst example. Method 2 neglects a new example that is effective but not a support vector. Since such a vector may become a support vector in the future, it is expected that Method 2 has a lower performance than a conventional SVM or Method 1. For instance, when a new example is located in (a) in Fig. 4, the example is thrown away in both incremental algorithms; in (b), each update procedure of the stored examples starts in both algorithms; in (c), an effective example is neglected in Method 2. 4. Statistical analyses of incremental SVMs 4.1. Computational complexity In the computer science community, computational complexity is evaluated in the worst case; for example, when all given examples are effective or support vectors for incremental SVMs. Although we cannot reduce the complexity in such a case, it is rare in practice. Here, we instead evaluate the complexity on average, which is more important for practitioners. The number of stored examples affects the complexity of knowing whether the additional example is effective or not in Method 1 and whether it is a support vector in Method 2, as well as the (a)
(c)
(b)
w^ f3
Fig. 3. The admissible region (the hatched area, right) in the weight space SM is expressed as the convex hull of examples in the input space S M þ.
Fig. 4. A difference of the incremental algorithms appears in case (c).
ARTICLE IN PRESS K. Ikeda, T. Yamasaki / Neurocomputing 70 (2007) 2528–2533
2531
update procedure in each method. Therefore, we analyze the expected number of effective examples in the asymptotic of N ! 1. In the analysis, examples are assumed to distribute uniformly and independently in SM þ as is done in [8] since general cases are too difficult to analyze. Although the assumption is not satisfied in general, the results hold when the examples obey a Gaussian distribution exactly or in the asymptotics. Moreover, if feature vectors are localized and form a submanifold in the feature space, the space can be regarded as a lower-dimensional space or be decomposed to lower-dimensional spaces [4]. As discussed in the previous section, the number of effective examples is equal to that of vertices of the convex hull. The expected number of vertices or other geometrical properties of the convex hull of points randomly chosen from a finite set is called the integral or stochastic geometry [12]. Instead of the expected number of vertices of the convex hull V N , Ikeda and Amari [8] derived the expected number of faces F N , by applying the method in [2] to SM .
Consider a bad algorithm, here named the worst algorithm, that gives an upper bound to SVMs. Unless a new input intersects the admissible region, any weight in the region can correctly predict the corresponding output. Otherwise, the new input divides the admissible region into two sets; the weight vectors in one set correctly predict its output while those in the other do not. The worst algorithm, according to the given new input, always chooses the weight vector from the latter set. Hence, the worst algorithm mispredicts the output if and only if the new input intersects the admissible region. In this case, the average probability of misprediction is asymptotically written as V N =N, since misprediction occurs when the new input is effective. From the above it is obvious that the learning curve of Method 1 is bounded by that of the worst algorithm and it converges in the order of 1=N.
Theorem 1. In asymptotic of N ! 1,
As mentioned before, Method 2 would have a lower performance than Method 1 since some effective examples are thrown away. This subsection, in a rather rough analysis, shows how much the performance is degraded. The SVM solution is the center of the maximum sphere inscribed in the admissible region where the radius is the margin of the separating hyperplane. Therefore, we consider the update of the margin. When an additional example has a smaller margin, it becomes a new support vector and the margin is smaller than before. This means that the margin is monotonically non-increasing. Let M n denote the average margin when n examples are given. If n is large and M n is small, we can make the following two assumptions:
FN ¼
pM , MI M
(6)
where Z
p=2
sinM1 f df.
IM ¼
(7)
0
Note that F N increases as M ! 1 since I M is a monotonically decreasing function. Since each face consists of M vertices with probability one and each vertex is on more than M faces, the following corollary holds.
4.3. Learning curve of Method 2
Corollary 2. In the asymptotic of N ! 1, VNp
pM . MI M
(8)
This means that the number of stored examples is on average bounded, irrespective of N, even if N goes to infinity. Therefore, Method 1 has computational complexity of the order N on average. Since any support vector is effective, the expected number of support vectors is less than V N . And a tighter upper bound of the expected number of support vectors is the dimension M of the feature space. Hence, Method 2 also has computational complexity of the order N on average. 4.2. Learning curve of Method 1 The expected number of effective examples V N is also useful for deriving an upper bound of the learning curve of Method 1. Here, we refer to the prediction error, i.e., the probability that a learning machine given N examples mispredicts the output for a new input, averaged over the examples as the learning curve.
The probability that the set of support vectors is updated is proportional to M n , i.e., aM n where a is constant. The decrease of the margin is also proportional to M n , i.e., lM n where l is constant, since its shape does not depend on its size from the stochastic geometrical viewpoint. The above assumptions lead to the following update equations: M nþ1 ¼ ½1 aM n M n þ aM n ½lM n ¼ M n a½1
lM 2n ,
ð9Þ ð10Þ
that have the solution M n ¼ c=n by a simple calculation where c is constant. As seen in the previous section, the probability that the effective set changes is equal to the learning curve of the worst algorithm, and is more than the probability that the set of support vectors is updated. This implies that the learning curve of Method 2 is the reciprocal to N, which is of the same order as Method 1 and hence of a conventional SVM.
ARTICLE IN PRESS K. Ikeda, T. Yamasaki / Neurocomputing 70 (2007) 2528–2533
2532
5. Computer simulations
100
10-1 Generalization Error
To confirm the theoretical results, some computer simulations were carried out. Note that when M is large the number of effective examples is so large in practice that Method 1 does not work even for M ¼ 10. On the contrary, the number of support vectors is much fewer and hence Method 2 has a less complexity. In the simulations, N ¼ 5000 examples are chosen from SM þ uniformly and independently. Methods 1 and 2 learn the examples gradually and Fig. 5 shows the generalization errors of the SVMs with both methods when the number n ðpNÞ of given examples increases, where the dimension of the input space is M ¼ 3. It is clearly shown that both learning curves have the order of 1=N when N is large and the validity of the analysis is confirmed. In Fig. 6 we show that Method 2 has a learning curve of the same order even when M ¼ 10.
10-3
10-4
In this paper, we proposed two incremental SVM methods with linear complexity on average. Method 1, which utilizes the fact that only effective examples are necessary and sufficient to obtain the SVM solution, achieves the same performance as an SVM in batch mode by storing effective examples. Method 2 neglects effective examples that are not support vectors when they are added, and this leads to easy implementation but a lower performance. Nevertheless, Method 2’s learning curve has the same order, 1=N, as Method 1 and an SVM in batch
100
Generalization Error
10-1
10-2
10-3
101
100
101
102
103
104
# of Examples
6. Conclusions and discussions
10-4 100
10-2
102
103
104
# of Examples
Fig. 5. Learning curves of Methods 1 (solid line) and 2 (dashed line) versus the number of examples ðM ¼ 3Þ, averaged over 300 trials.
Fig. 6. Learning curve of Method 2 versus the number of examples ðM ¼ 10Þ, averaged over 300 trials.
mode. Hence, the degradation is only its coefficient. This was confirmed by computer simulations. By computational complexity, both methods update the stored sets only when the new example changes the set. Since the probability is only of the order of 1=N, the number of updates is proportional to log N. Hence, the complexity for remaking the set is less important compared to the complexity of checking whether the new example updates the stored set or not. The analyses so far are based on the linear kernel; so if we employ another kernel function, the proposed methods do not work well in some cases. For example, the Gaussian kernel has an infinite-dimensional feature space and any feature vectors are linearly independent there. This means that all the given examples are effective and hence Method 1 cannot reduce the computational complexity.1 Nevertheless, the actual dimension of a feature space is not so high in some cases. One example is the polynomial kernel, which has a feature space of a smaller dimension than the theoretical one [4]. Another is the Gaussian kernel, which has an actual feature space with a finite dimension, depending on the boundedness of the input space [5]. In that case, Method 2 may perform well as far as the sparseness of SVM solutions is kept, though the theoretical guarantee in Section 4 does not hold anymore. We plan to more rigorously analyze the performance of Method 2 in general cases in the future.
1
Some analysis in such a case is given in [6].
ARTICLE IN PRESS K. Ikeda, T. Yamasaki / Neurocomputing 70 (2007) 2528–2533
Acknowledgment This study is supported in part by a Grant-in-Aid for Scientific Research (14084210, 15700130, 18300078) from the Ministry of Education, Culture, Sports, Science and Technology of Japan. References [1] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines, Cambridge University Press, Cambridge, UK, 2000. [2] B. Efron, The convex hull of a random set of points, Biometrika 52 (1965) 331–343. [3] R. Herbrich, Learning Kernel Classifiers: Theory and Algorithms, MIT Press, Cambridge, MA, 2002. [4] K. Ikeda, An asymptotic statistical theory of polynomial kernel methods, Neural Comput. 16 (8) (2004) 1705–1719. [5] K. Ikeda, Boundedness of input space and effective dimension of feature space in kernel methods, IEICE Trans. Inf. Syst. E87-D (2004) 297–299. [6] K. Ikeda, Effects of kernel function on Nu support vector machines in asymptotics, IEEE Trans. Neural Networks 17 (1) (2006) 1–9. [7] K. Ikeda, Geometrical properties of lifting-up in the Nu support vector machines, IEICE Trans. Inf. Syst. E89-D (2006) 847–852. [8] K. Ikeda, S.-I. Amari, Geometry of admissible parameter region in neural learning, IEICE Trans. Fundam. E79-A (1996) 938–943. [9] K. Ikeda, T. Aoishi, An asymptotic statistical analysis of support vector machines with soft margins, Neural Networks 18 (3) (2005) 251–259. [10] K. Ikeda, N. Murata, Geometrical properties of Nu support vector machines with different norms, Neural Comput. 17 (11) (2005) 2508–2529. [11] D. Rumelhart, J.L. McClelland, The PDP Research Group, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, MIT Press, Cambridge, MA, 1986. [12] L.A. Santao´, Integral Geometry and Geometric Probability, Encyclopedia of Mathematics, vol. 1, Addison-Wesley, Reading, MA, 1976.
2533
[13] B. Scho¨lkopf, C. Burges, A.J. Smola, Advances in Kernel Methods: Support Vector Learning, Cambridge University Press, Cambridge, UK, 1998. [14] A.J. Smola, P.L. Bartlett, B. Scho¨lkopf, D. Schuurmans (Eds.), Advances in Large Margin Classifiers, MIT Press, Cambridge, MA, 2000. [15] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, 1995. [16] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, NY, 1998. Kazushi Ikeda was born in Shizuoka, Japan, in 1966. He received the B.E., M.E., and Ph.D. degrees in mathematical engineering and information physics from the University of Tokyo, Tokyo, Japan, in 1989, 1991, and 1994, respectively. From 1994 to 1998, he was with the Department of Electrical and Computer Engineering, Kanazawa University, Kanazawa, Japan. Since 1998, he has been with the Department of Systems Science, Graduate School of Informatics, Kyoto University, Kyoto, Japan. His research interests are focused on adaptive and learning systems, including neural networks, adaptive filters, and machine learning. Takemasa Yamasaki was born in Hyogo, Japan, in 1982. He received the B.E. from Kobe University, Kobe, Japan, in 2004. He is currently working toward the M.E. degree from Kyoto University, Kyoto, Japan. His research interests include machine learning and statistical science.