Expert Systems with Applications 38 (2011) 14238–14248
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Reduced one-against-all method for multiclass SVM classification M. Arun Kumar ⇑, M. Gopal Control Group, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India
a r t i c l e
i n f o
Keywords: Support vector machines Multi-class classification One-against-all Text categorization
a b s t r a c t We present an improved version of one-against-all method for multiclass SVM classification based on subset sample selection, named reduced one-against-all, to achieve high performance in large multiclass problems. Reduced one-against-all drastically decreases the computing effort involved in training oneagainst-all classifiers, without any compromise in classification accuracy. Computational comparisons on publicly available datasets indicate that the proposed method has comparable accuracy to that of conventional one-against-all method, but with an order of magnitude faster. On the largest dataset considered, reduced one-against-all method achieved 50% reduction in computing time over one-against-all method for almost the same classification accuracy. We further investigated reduced one-against-all with linear kernel for multi-label text categorization applications. Computational results demonstrate the effectiveness of the proposed method on both the text corpuses considered. Ó 2011 Elsevier Ltd. All rights reserved.
1. Introduction Support vector machines (SVMs), being computationally powerful tools for supervised learning, are widely used in classification and regression problems. SVM classifiers have been successfully applied to a variety of real world problems like particle identification, face recognition, text categorization and bioinformatics (Burges, 1998). The approach is systematic and motivated by statistical learning theory (SLT), and Bayesian arguments. The central idea of SVM classifier is to find the optimal separating hyperplane between positive and negative examples. The optimal hyperplane is defined as the one giving maximum margin between the training examples that are closest to the hyperplane. Although SVMs enjoy good generalization in wide variety of applications; they face hurdles in being extended to large scale problems. This is because, the training phase of SVMs involves solving a quadratic programming problem (QPP), whose size depends on total number of training datapoints l. Conventional QPP solvers have time complexity O(l3) and they need explicit storage of kernel matrix of size (l l). For even moderately sized problems with 10,000 training datapoints, conventional QPP solvers occupy more memory and take very long training times. Hence, speeding up SVM training has been an active research area among machine learning researchers for almost a decade. One way to circumvent this problem is to develop efficient algorithms that solve the same QPP, with less memory requirements and reduced training time. Many decomposition methods have ⇑ Corresponding author. Address: Philips Research Asia, Philips Innovation Campus, Bangalore, India. Tel.: +91 97 40219695; fax: +91 80 41892265. E-mail address:
[email protected] (M. Arun Kumar). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.04.237
been reported in the recent past which break down the single large QPP into series of smaller QPPs that can be solved efficiently. Popular methods include SMO (Platt, 1999), SOR (Mangasarian & Musicant, 1999), SVMlight (Joachims, 1999), and LIBSVM (Chang & Lin, 2001). The approximate time complexity of these algorithms scales to T O(lq + q3), where T is the number of iterations and q is the size of working set. These methods have been pivotal in realizing SVM’s potential in large scale classification tasks. Another approach is to pre-select a set of important datapoints which are significantly lesser than l and solve a smaller QPP only with this subset. This is due to the fact: SVM’s decision function depends only on a small subset of training datapoints called support vectors. These support vectors lie in the proximity of decision boundary, and SVMs decision function completely depends on this small subset of datapoints. Also, it has been reported that SVMs trained with different kernels on same dataset have highly overlapping set of support vectors (Scholkopf, Smola, & Vapnik, 1995). Hence, identifying insignificant datapoints that are not likely to be support vectors and removing them before solving QPP, will save considerable amount of memory and training time. In this paper our focus is on this second approach. Many such ‘‘sample selection’’ methods has been reported in literature. Popular techniques and criteria used for pre-selecting samples include: clustering (Almeida, Braga, & Braga, 2000; Koggalage & Halgamuge, 2004), fuzzy memberships (Sohn et al., 2001), Mahalanobis distance (Abe & Inoue, 2001), k-nearest neighbors (Shin & Choo, 2005), b-skeleton (Zhang & King, 2002), and Hausdorff distance (Wang, Neskovic, & Cooper, 2007). Yang and Ahuja (2000) proposed a geometric approach to select a superset of support vectors called guard vectors. Lee and Mangasarian (2001) proposed reduced SVM (RSVM) using rectangular kernel instead
14239
M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248
of a conventional square kernel, by randomly selecting a subset of training datapoints. Zheng, Xiaofeng, Nanning, and Weipu (2003) improved RSVM by substituting random samples with cluster centroids. Although these methods do not theoretically ensure that, SVM solved with complete training dataset and SVM solved with pre-selected samples have same generalization; they often achieve tractable, low-cost solutions without any significant loss in accuracy, which is the key idea of soft computing. In this paper we propose to use posterior probability estimates of SVM outputs for subset sample selection in multiclass scenario. SVMs were originally developed for binary classification and their extension to multi-class classification is an ongoing research issue. The current approaches to multi-class SVM methods either construct several binary SVMs or solve a single large optimization problem. It has been shown that solving single large optimization problem takes more computing time and hence is not suitable for practical applications (Fei & Liu, 2006; Hsu & Lin, 2002; Rifkin & Klautau, 2004). One of the earliest and simplest implementation of multi-class SVM is one-against-all (OAA) method (Vapnik, 1998). For a K class problem it constructs K binary SVMs; one for each class. Each class SVM is trained to separate its own datapoints from datapoints of other classes. A new datapoint is classified to a class whose binary SVM gives the largest value. Another popular approach is to train K(K 1)/2 binary SVMs, where each binary SVM is trained on datapoints from two classes. This approach is called one-against-one (OAO) method (Knerr, Personnaz, & Dreyfus, 1990; Hastie & Tibshirani, 1998). After training K(K 1)/2 binary SVMs, future testing is done using ‘‘Max Wins’’ voting strategy. An interesting modification to OAO is directed acyclic graph SVM (DAGSVM) proposed by Platt and Christiani (1999). Its training phase is same as OAO method, however in testing phase it uses a rooted binary directed acyclic graph with K(K 1)/2 internal nodes and K leaves. The advantage of DAGSVM is that it requires only (K 1) binary classifiers to be evaluated instead of K(K 1)/2 evaluations required in OAO during testing phase. Recently, Fei and Liu (2006) have proposed Binary Tree of SVM (BTS) for multiclass classification. BTS can be seen as an improvement to OAO in which, a binary SVM, trained between two classes, also separates some other classes, and hence obviates the need for several binary SVMs. They have showed that BTS requires (K 1) binary SVMs to be trained under best situation and log 4/3((K + 3)/4) binary SVMs to be evaluated on an average while testing. Pedrajas and Boyer (2006) have recently proposed an approach, named all and one (AO), based on the combination of two methods: OAA and OAO. AO combines the strengths of both methods and partially avoids their sources of failure. A critical review of all multiclass methods can be found in the literature (Hsu & Lin, 2002; Rifkin & Klautau, 2004). Rifkin and Klautau (2004) conducted many carefully controlled experiments and came out with a suggestion: if all binary SVMs are properly tuned then the simple OAA is as accurate as any other multiclass method. The approach we propose in this paper is an improvement to OAA by sample selection. This is based on the observation that, for a K class problem in OAA, we train K binary SVMs as K independent problems despite the fact that all K binary problems are related and are solved with same set of training datapoints (with different targets). A binary SVM solved does not share any useful experience with other binary SVMs to be solved. We propose a simple approach in which, the knowledge gained from solving a binary SVM can be used to reduce training set of future binary SVMs so that they can be solved more efficiently. We call this method as reduced OAA (R-OAA). We have compared R-OAA and OAA in terms of classification accuracy and training time on several datasets. We have also extended R-OAA for multilabel text catego-
rization and compared it with OAA’s extension. We do not make comparisons with other popular approaches like OAO in this paper, because of the following reasons: a. Benchmark comparisons on multiclass SVM approaches already exist in literature (Hsu & Lin, 2002; Rifkin & Klautau, 2004; Fei & Liu, 2006). b. It has been concluded by Rifkin and Klautau (2004) that OAA is as accurate as any other approach, assuming that all underlying binary SVMs are well tuned. c. We concluded in our recent study that OAA performs better than OAO, DAGSVM, BTS, and AO in terms of generalization, on multiclass text categorization. d. Moreover, OAA is an independent method which can easily be extended to handle multi-label datasets. To our knowledge we have not found any extensions from OAO family of methods to handle multi-label classification. The paper is organized as follows. In Section 2 we briefly discuss the SVM problem formulation, its extension to multiclass using OAA and scope for improvements in OAA. Section 3 describes our R-OAA approach. Computational comparisons on standard datasets are done in Section 4. In Section 5 we investigate the performance of R-OAA with linear kernel for multilabel text categorization; and Section 6 gives concluding remarks. In this paper, all vectors will be column vectors unless transformed to row vector by a prime 0 . A column vector of ones in real space of arbitrary dimension will be denoted by e. A vector of zeros in a real space of arbitrary dimension will be denoted by 0e. For a matrix A 2 Rln , Ai is the ith row of A, which is a row vector in Rn : For a vector x 2 Rn , x⁄ denotes the vector in Rn with components (x⁄)i = 1 if xi > 0 and 0 otherwise, i = 1, . . . , n. In other words, x⁄ is the result of applying step function component-wise to x. 2. Problem formulation 2.1. Binary classification using SVM SVMs represent novel learning techniques that have been introduced in the framework of structural risk minimization (SRM) and in the theory of VC bounds. Compared to state-of-the-art methods, SVMs have showed excellent performance in pattern recognition tasks. In the simplest binary pattern recognition tasks, SVMs use a linear separating hyperplane to create a classifier with maximal margin. Consider the problem of binary classification wherein, a linearly inseparable dataset X of l points in real n-dimensional space of features is represented by the matrix X 2 Rln . The corresponding target or class of each datapoint Xi;i = 1, 2, . . . , l, is represented by a diagonal matrix D 2 Rll with entries Dii as + 1 or 1. Given the above problem, SVM’s linear softmargin problem is to solve the following primal QPP (Burges, 1998): 1 0 Min w w þ Ce0 y 2 subject to DðXw þ ebÞ þ y P e; y P 0e:
ð1Þ
where C is a penalty parameter and y 2 Rl are the nonnegative slack variables. The optimal decision function (linear separating hyperplane) at the solution of (1) can be expressed as:
f ðxÞ ¼ w0 x þ b; n
ð2Þ n
where x 2 R is any (new) datapoint, w 2 R and b 2 R are parameters. Since the number of constraints in (1) is large, the dual of (1) is usually solved. The Wolfe dual of (1) is (Mangasarian, 1998):
Max
e0 a 12 a0 DXX 0 Da
subject to e0 Da ¼ 0; 0e 6 a 6 Ce;
ð3Þ
14240
M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248
where a 2 Rl are Lagrangian multipliers. The optimal decision function is same as in (2) whose parameters are given by:
w ¼ X 0 Da N sv P b ¼ N1sv i¼1
1 Dii
Xiw
ð4Þ
The set of datapoints Xi for which 0 < ai < C, are called support vectors SV of the decision function f(x) and Nsv represents the total number of support vectors (SV = {Xi j0 < ai < C}&Nsv = jSVj). The decision boundary f(x) = 0 lies midway between the bounding hyperplanes given by:
w0 x þ b ¼ 1 and w0 x þ b ¼ 1
ð5Þ
2 and separates the two classes from each other with a margin of kwk . 2
A new datapoint x 2 Rn is classified as + 1 or 1 according to whether the step function (f(x))⁄ yields 1 or 0 respectively. An important characteristic of SVM is that, it can be extended in a relatively straightforward manner to create nonlinear decision boundaries (Vapnik, 1998). For the nonlinear case, the input data is transformed into a high-dimensional feature space using the kernel map U (), and a linear separating hyperplane is obtained in that high-dimensional feature space. Hence, the nonlinear version of the dual problem (3) is solved:
e0 a 12 a0 DKðX; X 0 ÞDa
Max
subject to e0 Da ¼ 0; 0 6 a 6 Ce;
ð6Þ
where Kij = K(Xi, Xj) = U(Xi)U(Xj). The optimal decision function can be expressed as:
f ðxÞ ¼
N sv X
ai Dii Kðx; X i Þ þ b:
ð7Þ
i¼1
2.2. Multiclass SVM classification using OAA method A popular and simple way of combining several binary SVM classifiers for multiclass classification is OAA method (Vapnik, 1998). Given a K class problem, OAA method constructs K binary SVM classifiers with decision functions f1(x), f2(x), . . . , fK(x); one for each class. The binary SVM corresponding to a particular class i is obtained by training all datapoints of class i against remaining datapoints of other K 1classes. A new datapoint x 2 Rn is classified based on the decision function values of all K binary SVMs. All K decision functions are evaluated at x and the datapoint is assigned to the class whose decision function gives highest output. The OAA SVM classifier’s decision function f(x) is defined as:
f ðxÞ ¼ arg max fi ðxÞ: i2f1;2;...K g
ð8Þ
It can be noted that in training phase, OAA needs K binary SVMs to be trained, and in testing phase it needs K binary SVMs to be evaluated for each test datapoint. Before proceeding to R-OAA, we will make clear the notion of ‘‘unique’’ support vectors in OAA. Each decision function fi(x) will have its own support vector set SVi, and union of all these support vector sets will give the unique support vector set SVc = [i=1, 2, . . . , K SViof the multiclass problem. This set SVc completely defines the decision function f(x) in (8), as all fi(x) depends on a small subset of SVc. Now, we can define unique support vector set for class i as SVci = { 2 SVcjclass = i}. It is worth mentioning that any unique support vector of class i does not necessarily mean it will also be a support vector in decision function fi(x); though the converse is true.
2.3. Scope for improvements in OAA OAA method is highly successful because of its simplicity and good generalization ability; however there is good scope for improvements in OAA. This is because for a K class problem, OAA solves K binary SVM problems as K independent problems, despite the fact that all these problems are related. That is OAA does not pass any useful experience learnt, from one problem to the other. If we are able to learn some useful information about the structure and distribution of datapoints from the first binary SVM we solve, it can be used for the remaining K 1 binary SVMs. The advantage is apparent particularly under the scenario when K is sufficiently large. Apart from this, the information multiplies as we proceed to further problems; which means, when solving tail binary SVMs we will have good amount of information that helps in solving them more efficiently. Thus, accumulation of information from each binary SVM solved gives abundant information for solving tail binary SVMs. Also, in OAA there is no guide to select which binary SVM should be learned first or to be preferred over others, as all binary SVMs are treated as independent problems. On these lines we will describe our approach R-OAA, which incorporates ‘‘learning from past experience’’ in the form of sample selection to conventional OAA method. 3. Reduced OAA for multiclass SVM classification Based on the ideas presented in the previous section, our objective is to learn from each binary SVM problem we solve, and incorporate this knowledge in future problems (in the form of subset selection) to be solved. In what follows, we will illustrate how this can be done through a simple 2-D toy example as shown in Fig. 1(a). This example has 3 classes and hence OAA method requires 3 binary SVMs. Assuming that first we solve binary SVM for class 1, Fig. 1(a) also shows the decision boundary f1(x) = 0 obtained between datapoints of class 1 and datapoints of other classes. This decision boundary discerns the two-dimensional space into two mutually exclusive regions, positive region (which has most class 1 datapoints) and negative region (which has most class 2 and class 3 datapoints). Based on this decision boundary f1(x) = 0, we will define a closeness measure S1(x) such that, it will give small values for datapoints closer to f1(x) = 0 and large values for datapoints away from f1(x) = 0. By selecting a threshold parameter d, we will define a d-region around the decision boundary f1(x) = 0, such that this d-region will include all datapoints with closeness measure S1(x) 6 d. This d-region is shown in Fig. 1(b). From this figure we make the following observations: a. datapoints of class i which lie inside the d-region of decision boundary fi(x) = 0 are good representatives of class i, against all other classes in multiclass SVM scenario. This is because, the d-region includes almost all unique support vectors of class i, given d is sufficiently large. b. datapoints of class i which are on the positive side of decision boundary, lying outside the d-region are less likely to become unique support vectors of class i and will be classified correctly. c. datapoints of class i which are on the negative side of decision boundary, lying outside the d-region turn out to be outliers and will be misclassified. d. datapoints that do not belong to class i and are on the positive side of decision boundary, lying outside the d-region turn out to be outliers and will be misclassified. Based on these observations, in our toy example (Fig. 1(b)), we can omit datapoints of class 1 that are outside the d-region while
14241
M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248
Fig. 1. R-OAA Multiclass classification for a 2-D toy dataset.
solving subsequent binary SVMs. Also, datapoints that do not belong to class 1 and are on the positive side of decision boundary, lying outside the d-region can be omitted. Now, this reduced dataset can be used to train next binary SVM for class 2. Fig. 1(c) shows this reduced dataset and the decision boundary f2(x) = 0 obtained from it. Again, by defining d-region around this decision boundary f2 (x) = 0, the dataset can further be reduced and be used to train the last binary SVM for class 3. Fig. 1(d) shows the decision boundary f3(x) = 0 and reduced dataset used to obtain this decision boundary. Thus, if we incorporate such a learning in all binary SVMs we train, the tail binary SVMs will have less number of training datapoints and hence they can be solved more efficiently. To validate these observations, we conducted trial experiments on six multiclass datasets. Fig. 2 presents these results. Each graph shows count of unique support vectors of class i i.e., jSVcij or dSV ci , and count of such unique support vectors of class i that lie inside the d-region of its corresponding decision boundary fi(x) = 0 i.e. jdSVcij or dSV ci where dSV ci ¼ f2 SV ci jSi ðxÞ 6 dg. For these experiments we converted each binary SVM’s output to closeness measure using probability estimates and defined d-region with d = 0.35. We will discuss more on probability estimates in the following material. It could be observed from these graphs that most of the unique support vectors of a class lie closer to its corresponding decision boundary and hence are captured by the d-region. For identifying the d-region, a closeness measure between training datapoints and decision boundary fi(x) = 0 is to be defined. We propose to use probabilistic output of SVM as closeness measure in this paper. In general, the SVM decision function fi(x) outputs uncalibrated values and can be converted to posterior probability estimates by fitting a sigmoidal function at its output. A discussion
on different types of probabilistic outputs for SVM can be found in Fei and Liu (2006). Following (Fei & Liu, 2006), we adopted the posterior probability estimate given by:
Pðclass ¼ ijfi ðxÞÞ ¼ 1þexp1ðf ðxÞÞ i
Pðclass ¼ ijfi ðxÞÞ ¼ 0:5 when f i ðxÞ ¼ 0
ð9Þ
The above expression can be modified as
DPi ðxÞ ¼ Pðclass ¼ ijfi ðxÞÞ 0:5 ¼
1 0:5 1 þ exp ðfi ðxÞÞ
ð10Þ
where jDPi(x)j indicates the closeness measure Si(x) between training datapoints and decision boundary fi(x) = 0. It could be noted that, for fi(x) = 0;Si(x) = 0, when fi(x) ? 1;Si(x) ? 0.5, and when fi(x) ? 1; Si(x) ? 0.5. Thus, fi(x) in the range (1, 1) gets mapped to Si(x) in the range [0, 0.5), which also becomes a valid range for selection of threshold parameter d. Hence, after training a binary SVM for class i, using Si(x), we can remove datapoints of class i if their closeness estimate Si(x) > d(based on observations b & c). Also datapoints of other classes lying on positive side of the decision boundary fi(x) = 0 with Si(x) > d can be removed (based on observation d). Although [0, 0.5) is a valid range for threshold parameter d, usable range will be (0.231, 0.5) based on our requirements. This is because, for any decision function fi(x), support vectors of class i will have fi(x) = 1and hence Si (x) = 0.231. Selecting threshold parameter d < 0.231 will fail to include these support vectors of class i in the d-region. This is against our requirement that d-region should possibly include all unique support vectors of class i. Selecting d = 0.231 will make the d-region to coincide with margin area
14242
M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248
Fig. 2. Validation of observations about d-region.
(1 6 fi(x) 6 1) of the decision function fi(x). While this ensures inclusion of all support vectors of class i in d-region, it is less likely to include all unique support vectors of class i. During our trial experiments we observed significant decrease in accuracy with d = 0.231. Practically d should be slightly greater than 0.231. Also it is easy to see: as d ? 0.5 R-OAA will become OAA as d-region will include all datapoints and hence no reduction in subsequent datasets can be achieved. In general, small d will witness more reductions and hence will achieve faster training. However if d is too small, it may miss many support vectors and will witness poor accuracy. We conducted all our experiments for three values of d(0.25, 0.3 & 0.35) and we observed significant reduction in training time without significant loss in accuracy across many datasets, for all these values. Obviously, the above process of removing less significant datapoints from each binary SVM learnt, will improve efficiency of future binary SVMs, irrespective of the order in which these K binary SVMs are solved. But, is it possible to get benefited by heuristically defining an order in which we should solve K binary SVMs? We explored three heuristic possibilities of defining such an order. A simple approach is to give preference to a class which is having more number of datapoints. That is, we solve K binary SVMs in descending order with respect to the total number of datapoints in each class. This approach is logical because, when we solve a class with most number of datapoints first, we can expect maximum reduction in datasets to be used for the future SVMs. Another approach is to select a class whose datapoints lie far away from other class datapoints. The idea behind is, if we remove insignificant datapoints of a class which is far away from other
classes, the reduction is likely to have less impact on future decision functions. Hence we can expect the decision boundaries of future binary SVMs, not to deviate much from the true ones obtained by solving binary SVMs with complete dataset. The following methods will help in achieving such an order. First, centers of all classes in a multiclass problem are calculated. Then, average of all these centers is calculated as the problem’s center. Now, the Euclidean distance di between center of class i and problem’s center can be interpreted as a measure of how far the datapoints of a class lie from datapoints of all other classes. All K classes are sorted in descending order of di, to get the order in which K binary SVMs should be solved. The above method is a modified interpretation of the one described in Fei and Liu (2006), where the authors were interested in finding two classes that are closest to problem’s center. Another method we consider here is an improved version of the one described above. This method is due to (Wang, Shi, Wu, & Wang, 2006), where distribution of classes is also considered in measuring separability of a class. Following (Wang et al., 2006), for a K class problem, separability measure of a class i can be defined as
smi ¼
di
ri
ð11Þ
where di is the Euclidean distance between center of class i and problem’s center, and ri is the variance of class i. Hence the order can be obtained by sorting separability measure of all K classes in descending order. The above separability measure is defined in input space; however it is also possible to extend it to
14243
M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248
high-dimensional feature space (Wang et al., 2006). Separability measure is an interesting approach, but it needs more computing time; particularly in its kernalized version. Hence in our experiments, we considered other two approaches for defining order (number of datapoints of a class, center of classes). We now give an explicit statement of the algorithm for our R-OAA method. Description of data structures used in the algorithm is as follows: Train_data: set of training datapoints. Rem_index: a vector containing indices of training datapoints that are identified as insignificant. O: a K 1 vector containing the order in which K binary SVMs are to be trained. 3.1. ALGORITHM reduced OAA for multiclass classification Given aKclass dataset, reduced OAA classification can be performed with following steps: (i) Choose a kernel function (linear/Gaussian/etc.. . .) (ii) Select penalty parameter C. Usually this parameter is selected based on validation. (iii) Obtain order O in which K binary classifiers should be trained. (using any of the three methods described above). (iv) Select a threshold value d for removing datapoints (between 0.231 and 0.5). (v) Initialize Rem_index to null. (vi) Repeat the following steps for i = 1, 2, . . . , K times (a) Remove datapoints from Train_data whose indices are stored in Rem_index. (b) Set Rem_index to null. (c) Construct a binary classifier fO(i)(x) between datapoints of class O(i)and remaining datapoints, from Train_data. (d) Evaluate fO(i)(x)and SO(i)(x) = jDPO(i)(x)j for datapoints from Train_data. (e) Find datapoints of class O(i)for which SO(i)(x) > d and store their index values in Rem_index. (f) Find datapoints that does not belong to class O(i )such that fO(i)(x) > 0 and SO(i)(x) > d ,and store their index values in Rem_index. (g) Store decision function fO(i)(x). (vii) Test new datapoints as it is done in conventional OAA by evaluating decision functions f1(x),f2(x), . . . , fk(x). 4. Experimental work on standard datasets 4.1. Experimental protocol To demonstrate the performance of R-OAA multiclass classification, we conducted experiments on several multiclass problems from the Statlog collection (ftp://ncc.up.pt/pub/statlog/) and UCI Repository of machine learning databases (Blake et al., 1998). From UCI Repository, we selected two ten class datasets: optdigit and pendigit. From Statlog collection we chose: dna, satimage, and shuttle datasets. In addition to above mentioned datasets, we also performed experiments on USPS (Hull, 1994) dataset of handwritten digits classification. Table 1 gives description of all these datasets. We also conducted experiments on the same datasets using OAA, for comparison against R-OAA. Binary SVM classification problems arising from both OAA and R-OAA were solved with SVMlight (Joachims, 1999) sofware. Datasets were pre-normalized to the range [1, 1]. All experiments were implemented in MATLAB 7.3.0 (R2006b) (http://www.mathworks.com) environment on a PC with Intel Core2Duo processor (2.13 GHz), 1 GB RAM.
Table 1 Description of datasets. Dataset
#Class
#Training data
#Testing data
#Feature
dna optdigit satimage USPS pendigit shuttle
3 10 6 10 10 7
2000 3823 4435 7291 7494 43500
1186 1797 2000 2007 3498 14500
180 64 36 256 12 9
Table 2 Dna. Details
OAA
R-OAA d = 0.35
R-OAA d = 0.3
R-OAA d = 0.25
C Accuracy (%) #l1 #SVs #l2 #SVs #l3 #SVs
8 0.002 96.03 2000 603 2000 439 2000 450
8 0.002 96.03 2000 603 1502 434 1462 446
8 0.002 95.95 2000 603 1366 423 1240 437
8 0.002 95.95 2000 603 1251 408 1021 418
Avg. #l Total Tr. time
2000 2.37
1654.7 1.82
1535.3 1.67
1424 1.51
l
Table 3 Optdigit. Details
OAA
R-OAA, d = 0.35
R-OAA, d = 0.3
R-OAA, d = 0.25
C Accuracy (%) #l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs #l7 #SVs #l8 #SVs #l9 #SVs #l10 #SVs
2048 0.125 98.72 3823 552 3823 571 3823 607 3823 593 3823 651 3823 626 3823 695 3823 491 3823 465 3823 635
2048 0.125 98.77 3823 552 3810 571 3716 607 3705 593 3673 650 3655 626 3608 694 3564 489 3531 461 3423 635
2048 0.125 98.72 3823 552 3725 571 3539 607 3432 592 3280 642 3200 621 3061 694 2933 488 2768 461 2540 642
2048 0.125 98.72 3823 552 3609 564 3359 604 3143 584 2887 623 2721 609 2470 669 2273 477 2012 441 1710 604
Avg. #l Total Tr. time
3823 13.31
3650.8 11.8
3230.1 8.8
2800.7 7.1
l
4.2. Comparison strategy We compared OAA and R-OAA methods in the following five aspects: (i) Classification Accuracy: ratio between total number of correctly classified test datapoints to total number of test datapoints. (ii) The total number of datapoints removed by R-OAA after learning each binary SVM (reduction). (iii) The total number of support vectors for each binary SVM learned.
14244
M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248
Table 4 Satimage.
Table 6 Pendigit.
Details
OAA
R-OAA, d = 0.35
R-OAA, d = 0.3
R-OAA, d = 0.25
Details
OAA
R-OAA, d = 0.35
R-OAA, d = 0.3
R-OAA, d = 0.25
C
32 2 91.9 4435 1118 4336 1398 4042 1308 3759 1054 3758 1148 3725 885
32 2 91.9 4435 1118 4027 1398 3545 1307 3123 1042 3065 1134 2969 883
32 2 91.9 4435 1118 3715 1391 3102 1297 2534 1023 2369 1114 2190 867
C
Accuracy (%) #l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs
32 2 91.85 4435 1118 4435 1398 4435 1308 4435 1060 4435 1153 4435 886
Avg. #l Total Tr. time
4435 14.05
4009.2 12.28
3527.3 10.59
3057.5 8.1
Accuracy (%) #l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs #l7 #SVs #l8 #SVs #l9 #SVs #l10 #SVs
8 0.25 98.82 7494 120 7494 100 7494 85 7494 189 7494 110 7494 119 7494 57 7494 99 7494 168 7494 137
8 0.25 98.82 7494 120 7424 100 6947 85 6335 188 5871 108 5300 118 5073 56 4500 93 4028 167 3766 137
8 0.25 98.82 7494 120 7009 99 6377 82 5679 187 5083 107 4418 118 3950 55 3293 92 2659 164 2167 133
8 0.25 98.68 7494 120 6771 99 6058 82 5319 178 4636 101 3912 118 3268 55 2579 87 1903 151 1277 127
Details
OAA
R-OAA, d = 0.35
R-OAA, d = 0.3
R-OAA, d = 0.25
Avg. #l Total Tr. time
7494 5.37
5673.8 3.89
4812.9 3.29
4321.7 2.94
C Accuracy (%) #l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs #l7 SVs #l8 #SVs #l9 #SVs #l10 #SVs
16 0.0078 95.86 7291 243 7291 87 7291 339 7291 209 7291 320 7291 300 7291 210 7291 270 7291 353 7291 325
16 0.0078 95.83 7291 243 6469 87 6194 339 5796 202 5425 320 5077 299 4818 205 4434 260 4146 352 3950 317
16 0.0078 95.75 7291 243 6292 86 5478 339 4969 201 4468 313 4028 293 3624 207 3129 251 2693 345 2377 318
16 0.0078 95.31 7291 243 6200 84 5246 334 4663 193 4090 310 3579 275 3078 201 2517 228 1999 315 1601 291
Details
OAA
R-OAA, d = 0.35
R-OAA, d = 0.3
R-OAA, d = 0.25
C
Avg. #l Total Tr. time
7291 26.95
5360 17.31
4434.9 14.43
4026.4 12.36
Accuracy (%) #l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs #l7 #SVs
1024 16 99.91 43500 136 43500 62 43500 102 43500 85 43500 21 43500 94 43500 88
1024 16 99.90 43500 136 15452 63 9749 93 8263 84 8205 19 8182 88 8181 87
1024 16 99.89 43500 136 12173 58 6105 81 4018 82 3947 15 3921 78 3919 69
1024 16 99.89 43500 136 9890 59 3361 68 1000 80 917 16 889 70 887 64
Avg. #l Total Tr. time
43,500 203.96
14,505 137.09
11,083 128.42
8634.9 101.27
l
Table 5 USPS.
l
(iv) Average number of datapoints used for training K binary SVMs. (v) Total training time spent on learning K binary SVMs. 4.3. Results and discussion Tables 2–7 show comparison results of OAA and R-OAA on all the six datasets considered. We used Gaussian kernel for all exper2 iments, given by KðX i ; X j Þ ¼ elkX i X j k : The kernel parameter l and penalty parameter C were selected based by cross validation on OAA. We used the same values of l and C for experiments with R-OAA. In R-OAA, we solved K binary SVMs in descending order with respects to the number of datapoints in each class. In all these tables, we have shown results for K binary SVMs of OAA in the same order for ease of comparison. For each dataset we solved ROAA with three values of d (0.25, 0.3, 0.35). These tables make evident that, R-OAA is capable of achieving accuracies closest to OAA without any significant loss on all six
l
Table 7 Shuttle.
l
datasets considered. In fact on optdigit, and satimage datasets, accuracy of R-OAA is slightly more than OAA. At this juncture, it is to be remembered that, solving SVM with reduced dataset does not prevent it from having accuracies better than SVM solved with complete dataset (though it is uncommon). This entirely depends on nature of the dataset and influence of outliers over decision boundary. These results also show that, the number of datapoints used for training each binary SVM classifier gradually decreases in R-OAA, against fixed number of datapoints for OAA (a graphical representation is shown in Fig. 3). In Particular, with shuttle dataset (Table 7) we are solving a tail binary SVM problem with 887 datapoints in R-OAA against 43500 datapoints in OAA, without major sacrifice in accuracy. Reduction in dataset for a binary SVM problem directly implies reduced memory usage and reduced training time. The comparison of total training time shows, ROAA is capable of achieving drastic reduction. R-OAA achieved
14245
M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248
2000
4000
4500
1800
3500
4000
1600
3000
3500
1400
3000
2500
1200 1000
2000
800
1500
600
2500 2000 1500
1000
400
1000
500
200 0
1
2
dna δ = 0.35
3
8000
0
500
1
2
3
4
5
6
7
8
optdigit δ = 0.25
9 10
0
8000
4.5
7000
7000
4
6000
6000
5000
5000
4000
4000
3000
3000
2000
2000
1
1000
1000
0.5
0
1
2
3
4
5
6
7
8
USPS δ = 0.35
9 10
0
1
2
3
4
5
satimage δ = 0.25
6
x 104
3.5 3 2.5 2 1.5
1
2
3
4
5
6
7
8
9 10
pendigit δ = 0.30
0
1
2
3
4
5
shuttleδ = 0.35
6
7
Fig. 3. Reduction achieved by R-OAA.
4.4. How important is order selection?
Table 8 Reduced OAA with order selection based on center of classes. Dataset (C, l)
Details
dna (8, 0.002) optdigit (2048, 0.125) satimage (32, 2) USPS (16, 0.0078) pendigit (8, 0.25) shuttle (1024, 16)
Accuracy Avg. #l Accuracy Avg. #l Accuracy Avg. #l Accuracy Avg. #l Accuracy Avg. #l Accuracy Avg. #l
(%) (%) (%) (%) (%) (%)
R-OAA d = 0.35
d = 0.3
d = 0.25
96.12 1964 98.72 3614.3 91.85 4114.7 95.84 5415.2 98.82 5676.4 99.91 30421
96.12 1886.7 98.72 3130.8 91.85 3828.8 95.76 4458.9 98.77 4823.2 99.91 28565
96.20 1787 98.66 2699.6 91.9 3505 95.46 4049.1 98.71 4363.6 99.91 27386
around 50% reduction in total training time on the largest dataset (Table 7) we considered, without significant loss in classification accuracy. Thus R-OAA is more suitable for large/dense datasets with less number of support vectors, as it will achieve more reductions. Another important observation in the results of R-OAA with d = 0.35 is: for each binary SVM, the total number of support vectors remains very close to the total number of support vectors of OAA. This actually reflects ability of the d-region to capture most of the unique support vectors. As threshold d decreases, the total number of support vectors for each binary SVM in R-OAA also slightly decreases. This is because as d decreases, d-region starts missing some unique support vectors and hence the solution becomes approximate. Also, decrease in d means more reductions in subsequent datasets and hence will witness reduced memory usage and faster training. Thus d can also be viewed as an approximation parameter which will tradeoff fast training vs accurate solution. An immediate use of this threshold d is in tuning SVM parameters by cross validation for large multiclass datasets; setting small values of d while tuning will lead to quicker results. Further it should be noted that penalty parameter C, and kernel parameter l in all these experiments were tuned for best classification accuracy of OAA and not for R-OAA. Hence one can expect improved classification accuracies by directly tuning these parameters over R-OAA.
In this section, we will explore importance of selecting order in which K binary SVMs should be solved in R-OAA method. We have discussed two simple approaches for selecting order (Section 3). In the previous section, we have shown results of R-OAA with order selection based on number of datapoints in each class. Another method which we had discussed is based on center of classes. Table 8 shows results of R-OAA with order selection based on center of classes for all six datasets considered. We had kept same experimental setting as explained in the previous section for comparison purposes. Comparing Table 8 against Tables 2–7, it can be observed that both the methods achieved almost the same classification accuracy except on dna dataset. However, as far as reduction in number of datapoints for subsequent binary SVMs is concerned, simple ordering approach based on number of datapoints in each class performed better than center of classes approach. Particularly on shuttle dataset, center of classes approach could achieve only an average of 27,386 whereas ordering based on number of datapoints in each class achieved 8,634.9. This clearly shows the importance of selecting order in R-OAA and its impact on reductions
Table 9 Reduced OAA with random ordering. Dataset (C, l)
Details
OAA
R-OAA d = 0.3 (Descending orderwith respect to No. of datapoints in each class)
R-OAA d = 0.3 (Random ordering)
dna (8, 0.002) optdigit (2048, 0.125) satimage (32, 2) USPS (16, 0.0078) pendigit (8, 0.25) shuttle (1024, 16)
Accuracy (%) Avg. #l Accuracy (%) Avg. #l Accuracy (%) Avg. #l Accuracy (%) Avg. #l Accuracy (%) Avg. #l Accuracy (%) Avg. #l
96.03 2000 98.72 3823 91.85 4435 95.86 7291 98.82 7494 99.91 43500
95.95 1535.3 98.72 3230.1 91.9 3527.3 95.75 4434.9 98.82 4812.9 99.89 11083
96.07 1728.53 98.73 3199.82 91.82 3800.09 95.7 4921.5 98.80 4848.08 99.90 27684.69
14246
M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248
achieved. We also conducted experiments on R-OAA with random ordering to observe its effect. Table 9 shows these results. For each dataset we conducted 50 R-OAA trials (d = 0.3) with random orders, and we report average results over these 50 trials. The values of penalty parameter C and kernel parameter l for all datasets were kept the same as shown in the previous results, for comparison. We have reproduced OAA and R-OAA results from Tables 2–7 for convenience. Table 9 shows, even with random ordering, R-OAA achieves good amount of reductions as compared to conventional OAA method; however a proper selection of order can achieve even better reductions. Following these observations, for all experiments with R-OAA in subsequent sections of this paper, we have used ordering based on number of class datapoints. 5. Application to text categorization Given the impressive performance of R-OAA multiclass method on Statlog and UCI datasets, we further went onto investigate its performance on text categorization (TC) datasets. Automatic TC is a supervised learning problem which involves training a classifier with some labeled documents and then using the classifier to predict labels of unlabeled documents. Each document may belong to multiple labels or single label or no label at all. For such multilabel learning problems, usually a simple extension of OAA is used. The training phase is same as that of OAA, where a binary SVM is learned for each label to form a complete TC system. However in testing phase, instead of assigning single label to a test document based on decision function giving maximum output, each binary SVM predicts whether the test document belongs to its label or not. We followed the same approach for our R-OAA algorithm. We performed experiments on two well known multilabel datasets in TC research, reuters-21578 (http://www.daviddlewis.com/resources/testcollections/reuters21578/), and ohsumed (ftp://medir.ohsu.edu/pub/ohsumed). 5.1. Document representation Documents which typically are string of characters, have to be transformed into a representation suitable for learning algorithm of the classifier. This transformation involves several steps like preprocessing, dimensionality reduction, feature subset selection and term weighting. We used simple ‘bag of words’ representation in all our experiments. We removed stop words using a stop word dictionary (http://www.dcs.gla.ac.uk/idom/ir_resources/linuistic_ utils/) consisting of 319 words to reduce dimension. In addition to stop words, we removed words that occurred in only one training document, uniformly for both text corpuses considered. We also performed word stemming using Porter stemmer algorithm (Porter, 1980). After dimensionality reduction, we did local feature selection using mutual information (MI) measure described by Dumais, Platt, Heckerman, and Sahami (1998), given by:
MIðf ; cÞ ¼
X
X
pðf ; cÞ log
f 2f0;1g c2f0;1g
pðf ; cÞ pð f ÞpðcÞ
ð12Þ
The selected features were associated with a weight using log (TFIDF) (Liao, Alpha, & Dixon, 2003) term weighting scheme described as:
wfd ¼ logðtffd þ 0:5Þ log
D dff
ð13Þ
where wfd is the weight of feature f in document d, tffd is occurrence frequency of feature f in document d, D is the total number of documents in the training set and dff is the total number of documents containing feature f. In all our experiments, we scaled
weights obtained from log (TFIDF) weighting using cosine normalization, given as:
wfd wnfd ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffi : k P w2fd
ð14Þ
f ¼1
where k is number of features selected to represent a document. 5.2. Data collections 5.2.1. Reuters-21578 The reuters-21578 (http://www.daviddlewis.com/resources/ testcollections/reuters21578/) dataset was compiled by David Lewis and originally collected by the Carnegie group from reuters newswire in 1987. It contains 21578 news articles each belonging to one or more categories (labels). The frequency of occurrence of documents varies greatly from category to category. We used the modeApt split which led us to a corpus of 9603 training and 3299 testing documents. Out of 135 potential categories, only 90 categories have at least one training and one testing document. In our experiments, following other TC projects, we ran a test on the top 10 categories having highest number of documents and we omitted unlabeled documents from classification. After stemming and stop word removal, the training corpus contained 10,789 distinct terms in global dictionary. We evaluated MI measure for all these 10,789 distinct terms with respect to each category and selected top 300 words as features for document representation of the corresponding category. 5.2.2. Ohsumed The ohsumed corpus compiled by William hersh (ftp://medir.ohsu.edu/pub/ohsumed) consists of medline documents from the year 1981 to 1991. Following (Joachims, 1998), from 50,216 documents in 1991, we used first 10,000 for training and second 10,000 for testing. The resulting training set and testing set have more homogeneous distribution across 23 different MeSH ‘‘diseases’’ categories. Unlike reuters-21578, it is more difficult to learn a classifier in ohsumed corpus because of the presence of noisy data. We used top 10 out of 23 categories in our experiments. The training corpus of ohsumed contained 12,180 distinct terms after stemming and stop word removal. We evaluated MI measure for all 12,180 terms with respect to each category and selected top 500 words as features for document representation of the corresponding category. 5.3. Evaluation methodology A number of metrics are being used in TC to measure its effectiveness. In this paper, we used the standard information retrieval metrics, precision and recall, to evaluate each binary classifier. They can be calculated from the confusion matrix as shown in Table 10. A confusion matrix provides counts of different outcomes from an evaluation system. True positive (TP) represents total number of documents the system correctly labeled as positive and true negative (TN) represents total number of documents the system correctly labeled as negative. False positive (FP) and false negative (FN) are total number of documents the system incorTable 10 Confusion matrix. Category
Cj
Classifier Output
YES NO
Expert judgments YES
NO
TP FN
FP TN
14247
M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248
rectly labeled positive or negative respectively. Precision is defined simply as the ratio of correctly assigned category Cj documents to, total number of documents classified as category Cj. Recall is the ratio of correctly assigned category Cj documents to, total number of documents actually in category Cj. They can be obtained from the confusion matrix as
Precision ¼
TP TP þ FP
and Recall ¼
TP TP þ FN
ð15Þ
Neither precision nor recall is meaningful in isolation of the other. In practice, combined effectiveness measure namely precision-recall breakeven point (BEP), and F1measure are used. F1 measure is calculated as the harmonic mean of precision (P) and recall (R), given as:
F 1 ¼ 2PR=ðP þ RÞ
ð16Þ
The precision-recall BEP is the point where precision is equal to recall and is often determined by calculating arithmetic mean of precision and recall. BEP performance metric is to be computed for each category separately and overall performance of an approach can be found with the help of microaverage or macroaverage of BEP over all categories. Macroaverage gives equal weight to each category, whereas microaverage gives equal weight to each document. 5.4. Experimental results All text categorization steps discussed in Section 5.1 were performed using MATLAB 7.3.0 (R2006b) software (http://www.mathworks.com). We conducted experiments on both the text corpuses described in Section 5.2 using OAA and R-OAA methods. Binary classification problems arising from both OAA and R-OAA were solved with SVMlight (Joachims, 1999) software with linear kernel. The penalty parameter C was selected by cross validation on OAA. We used the same values of C for experiments with R-OAA. Table 11 BEP performance of 10 largest categories from reuters-21578. Category
OAA
R-OAA, d = 0.35
R-OAA, d = 0.3
R-OAA, d = 0.25
earn acq money-fx grain crude trade interest ship wheat corn
0.9844 0.9536 0.7729 0.949 0.8913 0.7953 0.7671 0.7783 0.8319 0.8805
0.9844 0.9574 0.7709 0.9421 0.8812 0.786 0.7715 0.804 0.8251 0.8805
0.9844 0.9588 0.7988 0.949 0.8774 0.8025 0.7721 0.8172 0.8251 0.8909
0.9844 0.9568 0.7856 0.9456 0.8756 0.7905 0.7602 0.7789 0.8251 0.9092
Micro BEP Macro BEP
0.9241 0.8604
0.9241 0.8603
0.9276 0.8676
0.9237 0.8612
We slightly modified R-OAA to suit multilabel learning. Instead of removing datapoints that do not satisfy threshold criteria then and there, we stored their index values and removed them only if they do not belong to the class for which binary SVM classifier is to be learned. This is important because, datapoints of class i which are identified as insignificant after learning a binary SVM for class i cannot be removed immediately, as they may also belong to some other classes in a multilabel scenario. We observed significant decrease in classification accuracy by removing such datapoints. Tables 11 and 12 summarize BEP performance of top 10 categories obtained by R-OAA and OAA on reuters-21578 and ohsumed datasets, respectively. It can be observed from these tables that R-OAA achieves better BEP performance than OAA on many of the categories over both text corpuses. On ohsumed dataset (for d = 0.25), R-OAA performed better than OAA on 9 out of the 10 Table 13 Reduction in reuters-21578. Details C Macro F1
OAA 1 0.8561
R-OAA, d = 0.35 1 0.8570
R-OAA, d = 0.3 1 0.8649
R-OAA, d = 0.25 1 0.8584
#l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs #l7 #SVs #l8 #SVs #l9 #SVs #l10 #SVs
7775 530 7775 735 7775 546 7775 366 7775 433 7775 483 7775 516 7775 325 7775 225 7775 262
7775 530 5689 718 5248 537 5172 352 5128 410 5076 477 5071 512 5027 315 5021 225 4996 262
7775 530 5354 704 4765 530 4696 351 4618 408 4538 470 4504 502 4445 311 4442 226 4419 255
7775 530 5187 692 4448 522 4332 337 4225 399 4122 456 4069 487 4015 304 4000 220 3939 252
Avg. #l Total Tr. time
7775 1.32
5420.3 0.81
4955.6 0.77
4611.2 0.71
Table 14 Reduction in ohsumed. Details C Macro. F1
OAA 1.4142 0.6775
R-OAA, d = 0.35 1.4142 0.6864
R-OAA, d = 0.3 1.4142 0.6910
R-OAA, d = 0.25 1.4142 0.6948
6286 3208 6286 1295 6286 1311 6286 1169 6286 920 6286 954 6286 806 6186 866 6286 719 6286 770
6286 3208 6250 1297 6053 1297 5775 1149 5757 905 5727 926 5647 803 5604 841 5577 687 5515 745
6286 3208 6183 1290 5870 1276 5438 1132 5402 900 5342 913 5238 782 5151 801 5109 670 5002 742
6286 3208 6097 1279 5687 1261 5124 1110 5042 884 5003 886 4826 765 4705 816 4654 654 4501 723
6286 2.82
5819.1 2.61
5502.1 2.45
5192.5 2.14
Category
OAA
R-OAA, d = 0.35
R-OAA, d = 0.3
R-OAA, d = 0.25
Pathology Neoplasma Cardiovascular Nervous Environment Digestion Immunology Respiratory Urology Bacteria
0.4761 0.8077 0.7921 0.6002 0.6906 0.7131 0.7167 0.6745 0.8 0.6755
0.4761 0.8081 0.7952 0.6062 0.6897 0.729 0.7244 0.6774 0.8059 0.6859
0.4761 0.8109 0.7995 0.6068 0.6827 0.7263 0.7387 0.6783 0.8095 0.6913
0.4761 0.8097 0.7969 0.6154 0.6781 0.7285 0.7372 0.6946 0.807 0.6955
#l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs #l7 #SVs #l8 #SVs #l9 #SVs #l10 #SVs
Micro BEP Macro BEP
0.6805 0.6946
0.6844 0.6997
0.6862 0.7020
0.6871 0.7039
Avg. #l Total Tr. time
Table 12 BEP performance of 10 largest categories from ohsumed.
14248
M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248 8000
7000
7000
6000
6000
5000
5000 4000 4000 3000 3000 2000
2000
1000
1000 0
1
2
3
4
5
6
7
8
9
10
reuters-21578 δ = 0.25
0
1
2
3
4
5
6
7
8
ohsumed δ = 0.25
9
10
Fig. 4. Reduction achieved by R-OAA on text corpuses.
categories considered. Thus R-OAA’s performance is significantly better than OAA, especially on noisy domains. Tables 13 and 14 and Fig. 4 show the reduction achieved by R-OAA over OAA on both text corpuses. While significant reduction in training time is achieved on both corpuses, particularly on reuters-21578 corpus, R-OAA method (with d = 0.25) achieved approximately 45% reduction in training time. 6. Conclusion In this paper, we have proposed reduced OAA (R-OAA) method for multiclass SVM classification. R-OAA is an improvement to the well known OAA multiclass SVM method in the form of sample selection. R-OAA learns from each binary SVM classification problem we solve and uses this knowledge to solve future binary SVM problems efficiently. We have also discussed heuristic selection of order in which K binary SVM problems of R-OAA can be solved. Experimental results on standard datasets demonstrate R-OAA performing statistically comparable to that of standard OAA multiclass method; however at reduced computational effort. R-OAA achieved approximately 50% reduction in total computing time without significant loss in accuracy for the largest dataset considered in this paper. This suggests using R-OAA for large datasets to achieve tractable and low-cost solutions. We further investigated the application of R-OAA with linear kernel on text categorization, using two benchmark text categorization datasets reuters-21578 and ohsumed. It can also be seen as a simple extension of R-OAA algorithm to handle multilabel classification as both text corpuses considered are multilabel in nature. Experimental results show significant improvement in performance at reduced computing effort, when compared with OAA on both text corpuses considered. It suggests using R-OAA for text categorization applications particularly under noisy domains. References Abe, S., Inoue, T. (2001). Fast training of support vector machines by extracting boundary data. In Proceedings of international conference on artificial neural networks (ICANN) (pp. 308–313). Almeida, M. B., Braga, A. P., Braga, J. P. (2000). SVM-KM: Speeding SVMs learning with a priori cluster selection and k-means. In Proceedings of 6th Brazilian symposium on neural networks (pp. 162–167). Blake, C. L., Merz, C. J. (1998). UCI repository for machine learning databases. [http: //www.ics.uci.edu/mlearn/MLRepository.html], Department of Information and Computer Sciences, University of California, Irvine. Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 1–43. Chang, C. C., Lin, C. J. (2001). LIBSVM: A library for support vector machines. Software available at (http://www.csie.ntu.edu.tw/cjlin/libsvm). Dumais, S., Platt, J. C., Heckerman, D., Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of seventh international conference on information and knowledge management (pp. 148– 155).
Fei, B., & Liu, J. (2006). Binary tree of SVM: A new fast multiclass training and classification algorithm. IEEE Transactions on Neural Networks, 17, 696–704. Hastie, T., & Tibshirani, R. (1998). Classification by pairwise coupling. The Annals of Statistics, 26(2), 451–471. Hsu, C., & Lin, C. J. (2002). A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13, 415–425. Hull, J. J. (1994). A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5), 550–554. Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning (pp. 137–142). Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. J. Burges, & A. J. Smola (Eds.), Advances in Kernel methods – support vector learning (pp. 169–184). MIT-Press. Knerr, S., Personnaz, L., & Dreyfus, G. (1990). Single-layer learning revisited: a stepwise procedure for building and training a neural network. In J. Fogelman (Ed.), Neurocomputing: Algorithms architechtures and applications. New York: Springer-Verlag. Koggalage, R., & Halgamuge, S. (2004). Reducing the number of training samples for fast support vector machine classification. Neural Information Processing Letters & Reviews, 2(3), 57–65. Lee, Y. J., Mangasarian, O. L. (2001). RSVM: Reduced support vector machines. In: Proceedings of International Conference on Data Mining, Chicago. Liao, C., Alpha, S., Dixon, P. (2003). Feature preparation in text categorization. Artificial Intelligence White papers, Oracle Corporation. Mangasarian, O. L. (1998). Nonlinear programming. SIAM. Mangasarian, O. L., & Musicant, D. R. (1999). Successive overrelaxation for support vector machines. IEEE transactions on Neural Networks, 10, 1032–1037. Pedrajas, N. G., & Boyer, D. O. (2006). Improving multiclass pattern recognition by the combination of two strategies. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6), 1001–1006. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. J. Burges, & A. J. Smola (Eds.), Advances in Kernel methods – Support vector learning (pp. 185–208). MIT Press. Platt, J. C., Christiani, N., Shawe-Taylor. (1999). Large margin DAGs for multiclass classification. In Solla, S. A., Leen, T. K., Muller, K.-R. (Eds.). Proc. Neural Information Proces sing Systems(NIPS’99), 547-553. Porter, M. (1980). An algorithm for suffix stripping. Program (Automated Library and Information Systems), 14(3), 130–137. Rifkin, R., & Klautau, A. (2004). In defense of one-vs-all classification. J. Machine Learning Research, 5, 101–141. Scholkopf, B., Smola, A. J., Vapnik, V. (1995). Extracting support data for a given task. In Proceedings of 1stInternational Conference on Knowledge Discovery and Data Mining, AAAI press (pp. 252–257). Shin, H., & Choo, S. (2005). Invariance of neighborhood relation under input space to feature space mapping. Pattern Recognition Letters, 26, 707–718. Sohn, S., Dagli, C. H. (2001). Advantages of using fuzzy class memberships in selforganizing map and support vector machines. In Proceedings of International Conference on Neural Networks (IJCNN), Washington D.C. (pp. 1886–1890). Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Wang, X., Shi, Z., Wu, C., Wang, W. (2006). An improved algorithm for decision-treebased SVM. In Proceedings of the 6thWorld Congress on Intelligent Control and Automation (pp. 4234-4283). Wang, J., Neskovic, P., & Cooper, L. N. (2007). Selecting data for fast support vector machine training. Studies in Computational Intelligence, 35, 61–84. Yang, M. H., Ahuja, N. (2000). A geometric approach to train support vector machines. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 430–437). Zhang, W., King, I. (2002). Locating support vectors via b-skeleton technique. In: Proceedings of international conference on neural information processing (ICONIP) (pp. 1423–1427). Zheng, S., Xiaofeng, L., Nanning, Z., Weipu, X. (2003). Unsupervised clustering based reduced support vector machines. In Proceedings of IEEE international confeerence on acoustics, speech, and signal processing (ICASSP) (pp. 821–824).