Reduced one-against-all method for multiclass SVM classification

Expert Systems with Applications 38 (2011) 14238–14248 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ...

Download PDF

1MB Sizes 2 Downloads 176 Views

Report

PDF Reader
Full Text

Expert Systems with Applications 38 (2011) 14238–14248

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Reduced one-against-all method for multiclass SVM classiﬁcation M. Arun Kumar ⇑, M. Gopal Control Group, Indian Institute of Technology Delhi, Hauz Khas, New Delhi 110016, India

a r t i c l e

i n f o

Keywords: Support vector machines Multi-class classiﬁcation One-against-all Text categorization

a b s t r a c t We present an improved version of one-against-all method for multiclass SVM classiﬁcation based on subset sample selection, named reduced one-against-all, to achieve high performance in large multiclass problems. Reduced one-against-all drastically decreases the computing effort involved in training oneagainst-all classiﬁers, without any compromise in classiﬁcation accuracy. Computational comparisons on publicly available datasets indicate that the proposed method has comparable accuracy to that of conventional one-against-all method, but with an order of magnitude faster. On the largest dataset considered, reduced one-against-all method achieved 50% reduction in computing time over one-against-all method for almost the same classiﬁcation accuracy. We further investigated reduced one-against-all with linear kernel for multi-label text categorization applications. Computational results demonstrate the effectiveness of the proposed method on both the text corpuses considered. Ó 2011 Elsevier Ltd. All rights reserved.

1. Introduction Support vector machines (SVMs), being computationally powerful tools for supervised learning, are widely used in classiﬁcation and regression problems. SVM classiﬁers have been successfully applied to a variety of real world problems like particle identiﬁcation, face recognition, text categorization and bioinformatics (Burges, 1998). The approach is systematic and motivated by statistical learning theory (SLT), and Bayesian arguments. The central idea of SVM classiﬁer is to ﬁnd the optimal separating hyperplane between positive and negative examples. The optimal hyperplane is deﬁned as the one giving maximum margin between the training examples that are closest to the hyperplane. Although SVMs enjoy good generalization in wide variety of applications; they face hurdles in being extended to large scale problems. This is because, the training phase of SVMs involves solving a quadratic programming problem (QPP), whose size depends on total number of training datapoints l. Conventional QPP solvers have time complexity O(l3) and they need explicit storage of kernel matrix of size (l l). For even moderately sized problems with 10,000 training datapoints, conventional QPP solvers occupy more memory and take very long training times. Hence, speeding up SVM training has been an active research area among machine learning researchers for almost a decade. One way to circumvent this problem is to develop efﬁcient algorithms that solve the same QPP, with less memory requirements and reduced training time. Many decomposition methods have ⇑ Corresponding author. Address: Philips Research Asia, Philips Innovation Campus, Bangalore, India. Tel.: +91 97 40219695; fax: +91 80 41892265. E-mail address: [email protected] (M. Arun Kumar). 0957-4174/$ - see front matter Ó 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2011.04.237

been reported in the recent past which break down the single large QPP into series of smaller QPPs that can be solved efﬁciently. Popular methods include SMO (Platt, 1999), SOR (Mangasarian & Musicant, 1999), SVMlight (Joachims, 1999), and LIBSVM (Chang & Lin, 2001). The approximate time complexity of these algorithms scales to T O(lq + q3), where T is the number of iterations and q is the size of working set. These methods have been pivotal in realizing SVM’s potential in large scale classiﬁcation tasks. Another approach is to pre-select a set of important datapoints which are signiﬁcantly lesser than l and solve a smaller QPP only with this subset. This is due to the fact: SVM’s decision function depends only on a small subset of training datapoints called support vectors. These support vectors lie in the proximity of decision boundary, and SVMs decision function completely depends on this small subset of datapoints. Also, it has been reported that SVMs trained with different kernels on same dataset have highly overlapping set of support vectors (Scholkopf, Smola, & Vapnik, 1995). Hence, identifying insigniﬁcant datapoints that are not likely to be support vectors and removing them before solving QPP, will save considerable amount of memory and training time. In this paper our focus is on this second approach. Many such ‘‘sample selection’’ methods has been reported in literature. Popular techniques and criteria used for pre-selecting samples include: clustering (Almeida, Braga, & Braga, 2000; Koggalage & Halgamuge, 2004), fuzzy memberships (Sohn et al., 2001), Mahalanobis distance (Abe & Inoue, 2001), k-nearest neighbors (Shin & Choo, 2005), b-skeleton (Zhang & King, 2002), and Hausdorff distance (Wang, Neskovic, & Cooper, 2007). Yang and Ahuja (2000) proposed a geometric approach to select a superset of support vectors called guard vectors. Lee and Mangasarian (2001) proposed reduced SVM (RSVM) using rectangular kernel instead

14239

M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248

of a conventional square kernel, by randomly selecting a subset of training datapoints. Zheng, Xiaofeng, Nanning, and Weipu (2003) improved RSVM by substituting random samples with cluster centroids. Although these methods do not theoretically ensure that, SVM solved with complete training dataset and SVM solved with pre-selected samples have same generalization; they often achieve tractable, low-cost solutions without any signiﬁcant loss in accuracy, which is the key idea of soft computing. In this paper we propose to use posterior probability estimates of SVM outputs for subset sample selection in multiclass scenario. SVMs were originally developed for binary classiﬁcation and their extension to multi-class classiﬁcation is an ongoing research issue. The current approaches to multi-class SVM methods either construct several binary SVMs or solve a single large optimization problem. It has been shown that solving single large optimization problem takes more computing time and hence is not suitable for practical applications (Fei & Liu, 2006; Hsu & Lin, 2002; Rifkin & Klautau, 2004). One of the earliest and simplest implementation of multi-class SVM is one-against-all (OAA) method (Vapnik, 1998). For a K class problem it constructs K binary SVMs; one for each class. Each class SVM is trained to separate its own datapoints from datapoints of other classes. A new datapoint is classiﬁed to a class whose binary SVM gives the largest value. Another popular approach is to train K(K 1)/2 binary SVMs, where each binary SVM is trained on datapoints from two classes. This approach is called one-against-one (OAO) method (Knerr, Personnaz, & Dreyfus, 1990; Hastie & Tibshirani, 1998). After training K(K 1)/2 binary SVMs, future testing is done using ‘‘Max Wins’’ voting strategy. An interesting modiﬁcation to OAO is directed acyclic graph SVM (DAGSVM) proposed by Platt and Christiani (1999). Its training phase is same as OAO method, however in testing phase it uses a rooted binary directed acyclic graph with K(K 1)/2 internal nodes and K leaves. The advantage of DAGSVM is that it requires only (K 1) binary classiﬁers to be evaluated instead of K(K 1)/2 evaluations required in OAO during testing phase. Recently, Fei and Liu (2006) have proposed Binary Tree of SVM (BTS) for multiclass classiﬁcation. BTS can be seen as an improvement to OAO in which, a binary SVM, trained between two classes, also separates some other classes, and hence obviates the need for several binary SVMs. They have showed that BTS requires (K 1) binary SVMs to be trained under best situation and log 4/3((K + 3)/4) binary SVMs to be evaluated on an average while testing. Pedrajas and Boyer (2006) have recently proposed an approach, named all and one (AO), based on the combination of two methods: OAA and OAO. AO combines the strengths of both methods and partially avoids their sources of failure. A critical review of all multiclass methods can be found in the literature (Hsu & Lin, 2002; Rifkin & Klautau, 2004). Rifkin and Klautau (2004) conducted many carefully controlled experiments and came out with a suggestion: if all binary SVMs are properly tuned then the simple OAA is as accurate as any other multiclass method. The approach we propose in this paper is an improvement to OAA by sample selection. This is based on the observation that, for a K class problem in OAA, we train K binary SVMs as K independent problems despite the fact that all K binary problems are related and are solved with same set of training datapoints (with different targets). A binary SVM solved does not share any useful experience with other binary SVMs to be solved. We propose a simple approach in which, the knowledge gained from solving a binary SVM can be used to reduce training set of future binary SVMs so that they can be solved more efﬁciently. We call this method as reduced OAA (R-OAA). We have compared R-OAA and OAA in terms of classiﬁcation accuracy and training time on several datasets. We have also extended R-OAA for multilabel text catego-

rization and compared it with OAA’s extension. We do not make comparisons with other popular approaches like OAO in this paper, because of the following reasons: a. Benchmark comparisons on multiclass SVM approaches already exist in literature (Hsu & Lin, 2002; Rifkin & Klautau, 2004; Fei & Liu, 2006). b. It has been concluded by Rifkin and Klautau (2004) that OAA is as accurate as any other approach, assuming that all underlying binary SVMs are well tuned. c. We concluded in our recent study that OAA performs better than OAO, DAGSVM, BTS, and AO in terms of generalization, on multiclass text categorization. d. Moreover, OAA is an independent method which can easily be extended to handle multi-label datasets. To our knowledge we have not found any extensions from OAO family of methods to handle multi-label classiﬁcation. The paper is organized as follows. In Section 2 we brieﬂy discuss the SVM problem formulation, its extension to multiclass using OAA and scope for improvements in OAA. Section 3 describes our R-OAA approach. Computational comparisons on standard datasets are done in Section 4. In Section 5 we investigate the performance of R-OAA with linear kernel for multilabel text categorization; and Section 6 gives concluding remarks. In this paper, all vectors will be column vectors unless transformed to row vector by a prime 0 . A column vector of ones in real space of arbitrary dimension will be denoted by e. A vector of zeros in a real space of arbitrary dimension will be denoted by 0e. For a matrix A 2 Rln , Ai is the ith row of A, which is a row vector in Rn : For a vector x 2 Rn , x⁄ denotes the vector in Rn with components (x⁄)i = 1 if xi > 0 and 0 otherwise, i = 1, . . . , n. In other words, x⁄ is the result of applying step function component-wise to x. 2. Problem formulation 2.1. Binary classiﬁcation using SVM SVMs represent novel learning techniques that have been introduced in the framework of structural risk minimization (SRM) and in the theory of VC bounds. Compared to state-of-the-art methods, SVMs have showed excellent performance in pattern recognition tasks. In the simplest binary pattern recognition tasks, SVMs use a linear separating hyperplane to create a classiﬁer with maximal margin. Consider the problem of binary classiﬁcation wherein, a linearly inseparable dataset X of l points in real n-dimensional space of features is represented by the matrix X 2 Rln . The corresponding target or class of each datapoint Xi;i = 1, 2, . . . , l, is represented by a diagonal matrix D 2 Rll with entries Dii as + 1 or 1. Given the above problem, SVM’s linear softmargin problem is to solve the following primal QPP (Burges, 1998): 1 0 Min w w þ Ce0 y 2 subject to DðXw þ ebÞ þ y P e; y P 0e:

ð1Þ

where C is a penalty parameter and y 2 Rl are the nonnegative slack variables. The optimal decision function (linear separating hyperplane) at the solution of (1) can be expressed as:

f ðxÞ ¼ w0 x þ b; n

ð2Þ n

where x 2 R is any (new) datapoint, w 2 R and b 2 R are parameters. Since the number of constraints in (1) is large, the dual of (1) is usually solved. The Wolfe dual of (1) is (Mangasarian, 1998):

Max

e0 a 12 a0 DXX 0 Da

subject to e0 Da ¼ 0; 0e 6 a 6 Ce;

ð3Þ

14240

M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248

where a 2 Rl are Lagrangian multipliers. The optimal decision function is same as in (2) whose parameters are given by:

w ¼ X 0 Da N sv P b ¼ N1sv i¼1

1 Dii

Xiw

ð4Þ

The set of datapoints Xi for which 0 < ai < C, are called support vectors SV of the decision function f(x) and Nsv represents the total number of support vectors (SV = {Xi j0 < ai < C}&Nsv = jSVj). The decision boundary f(x) = 0 lies midway between the bounding hyperplanes given by:

w0 x þ b ¼ 1 and w0 x þ b ¼ 1

ð5Þ

2 and separates the two classes from each other with a margin of kwk . 2

A new datapoint x 2 Rn is classiﬁed as + 1 or 1 according to whether the step function (f(x))⁄ yields 1 or 0 respectively. An important characteristic of SVM is that, it can be extended in a relatively straightforward manner to create nonlinear decision boundaries (Vapnik, 1998). For the nonlinear case, the input data is transformed into a high-dimensional feature space using the kernel map U (), and a linear separating hyperplane is obtained in that high-dimensional feature space. Hence, the nonlinear version of the dual problem (3) is solved:

e0 a 12 a0 DKðX; X 0 ÞDa

Max

subject to e0 Da ¼ 0; 0 6 a 6 Ce;

ð6Þ

where Kij = K(Xi, Xj) = U(Xi)U(Xj). The optimal decision function can be expressed as:

f ðxÞ ¼

N sv X

ai Dii Kðx; X i Þ þ b:

ð7Þ

i¼1

2.2. Multiclass SVM classiﬁcation using OAA method A popular and simple way of combining several binary SVM classiﬁers for multiclass classiﬁcation is OAA method (Vapnik, 1998). Given a K class problem, OAA method constructs K binary SVM classiﬁers with decision functions f1(x), f2(x), . . . , fK(x); one for each class. The binary SVM corresponding to a particular class i is obtained by training all datapoints of class i against remaining datapoints of other K 1classes. A new datapoint x 2 Rn is classiﬁed based on the decision function values of all K binary SVMs. All K decision functions are evaluated at x and the datapoint is assigned to the class whose decision function gives highest output. The OAA SVM classiﬁer’s decision function f(x) is deﬁned as:

f ðxÞ ¼ arg max fi ðxÞ: i2f1;2;...K g

ð8Þ

It can be noted that in training phase, OAA needs K binary SVMs to be trained, and in testing phase it needs K binary SVMs to be evaluated for each test datapoint. Before proceeding to R-OAA, we will make clear the notion of ‘‘unique’’ support vectors in OAA. Each decision function fi(x) will have its own support vector set SVi, and union of all these support vector sets will give the unique support vector set SVc = [i=1, 2, . . . , K SViof the multiclass problem. This set SVc completely deﬁnes the decision function f(x) in (8), as all fi(x) depends on a small subset of SVc. Now, we can deﬁne unique support vector set for class i as SVci = { 2 SVcjclass = i}. It is worth mentioning that any unique support vector of class i does not necessarily mean it will also be a support vector in decision function fi(x); though the converse is true.

2.3. Scope for improvements in OAA OAA method is highly successful because of its simplicity and good generalization ability; however there is good scope for improvements in OAA. This is because for a K class problem, OAA solves K binary SVM problems as K independent problems, despite the fact that all these problems are related. That is OAA does not pass any useful experience learnt, from one problem to the other. If we are able to learn some useful information about the structure and distribution of datapoints from the ﬁrst binary SVM we solve, it can be used for the remaining K 1 binary SVMs. The advantage is apparent particularly under the scenario when K is sufﬁciently large. Apart from this, the information multiplies as we proceed to further problems; which means, when solving tail binary SVMs we will have good amount of information that helps in solving them more efﬁciently. Thus, accumulation of information from each binary SVM solved gives abundant information for solving tail binary SVMs. Also, in OAA there is no guide to select which binary SVM should be learned ﬁrst or to be preferred over others, as all binary SVMs are treated as independent problems. On these lines we will describe our approach R-OAA, which incorporates ‘‘learning from past experience’’ in the form of sample selection to conventional OAA method. 3. Reduced OAA for multiclass SVM classiﬁcation Based on the ideas presented in the previous section, our objective is to learn from each binary SVM problem we solve, and incorporate this knowledge in future problems (in the form of subset selection) to be solved. In what follows, we will illustrate how this can be done through a simple 2-D toy example as shown in Fig. 1(a). This example has 3 classes and hence OAA method requires 3 binary SVMs. Assuming that ﬁrst we solve binary SVM for class 1, Fig. 1(a) also shows the decision boundary f1(x) = 0 obtained between datapoints of class 1 and datapoints of other classes. This decision boundary discerns the two-dimensional space into two mutually exclusive regions, positive region (which has most class 1 datapoints) and negative region (which has most class 2 and class 3 datapoints). Based on this decision boundary f1(x) = 0, we will deﬁne a closeness measure S1(x) such that, it will give small values for datapoints closer to f1(x) = 0 and large values for datapoints away from f1(x) = 0. By selecting a threshold parameter d, we will deﬁne a d-region around the decision boundary f1(x) = 0, such that this d-region will include all datapoints with closeness measure S1(x) 6 d. This d-region is shown in Fig. 1(b). From this ﬁgure we make the following observations: a. datapoints of class i which lie inside the d-region of decision boundary fi(x) = 0 are good representatives of class i, against all other classes in multiclass SVM scenario. This is because, the d-region includes almost all unique support vectors of class i, given d is sufﬁciently large. b. datapoints of class i which are on the positive side of decision boundary, lying outside the d-region are less likely to become unique support vectors of class i and will be classiﬁed correctly. c. datapoints of class i which are on the negative side of decision boundary, lying outside the d-region turn out to be outliers and will be misclassiﬁed. d. datapoints that do not belong to class i and are on the positive side of decision boundary, lying outside the d-region turn out to be outliers and will be misclassiﬁed. Based on these observations, in our toy example (Fig. 1(b)), we can omit datapoints of class 1 that are outside the d-region while

14241

M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248

Fig. 1. R-OAA Multiclass classiﬁcation for a 2-D toy dataset.

solving subsequent binary SVMs. Also, datapoints that do not belong to class 1 and are on the positive side of decision boundary, lying outside the d-region can be omitted. Now, this reduced dataset can be used to train next binary SVM for class 2. Fig. 1(c) shows this reduced dataset and the decision boundary f2(x) = 0 obtained from it. Again, by deﬁning d-region around this decision boundary f2 (x) = 0, the dataset can further be reduced and be used to train the last binary SVM for class 3. Fig. 1(d) shows the decision boundary f3(x) = 0 and reduced dataset used to obtain this decision boundary. Thus, if we incorporate such a learning in all binary SVMs we train, the tail binary SVMs will have less number of training datapoints and hence they can be solved more efﬁciently. To validate these observations, we conducted trial experiments on six multiclass datasets. Fig. 2 presents these results. Each graph shows count of unique support vectors of class i i.e., jSVcij or dSV ci , and count of such unique support vectors of class i that lie inside the d-region of its corresponding decision boundary fi(x) = 0 i.e. jdSVcij or dSV ci where dSV ci ¼ f2 SV ci jSi ðxÞ 6 dg. For these experiments we converted each binary SVM’s output to closeness measure using probability estimates and deﬁned d-region with d = 0.35. We will discuss more on probability estimates in the following material. It could be observed from these graphs that most of the unique support vectors of a class lie closer to its corresponding decision boundary and hence are captured by the d-region. For identifying the d-region, a closeness measure between training datapoints and decision boundary fi(x) = 0 is to be deﬁned. We propose to use probabilistic output of SVM as closeness measure in this paper. In general, the SVM decision function fi(x) outputs uncalibrated values and can be converted to posterior probability estimates by ﬁtting a sigmoidal function at its output. A discussion

on different types of probabilistic outputs for SVM can be found in Fei and Liu (2006). Following (Fei & Liu, 2006), we adopted the posterior probability estimate given by:

Pðclass ¼ ijfi ðxÞÞ ¼ 1þexp1ðf ðxÞÞ i

Pðclass ¼ ijfi ðxÞÞ ¼ 0:5 when f i ðxÞ ¼ 0

ð9Þ

The above expression can be modiﬁed as

DPi ðxÞ ¼ Pðclass ¼ ijfi ðxÞÞ 0:5 ¼

1 0:5 1 þ exp ðfi ðxÞÞ

ð10Þ

where jDPi(x)j indicates the closeness measure Si(x) between training datapoints and decision boundary fi(x) = 0. It could be noted that, for fi(x) = 0;Si(x) = 0, when fi(x) ? 1;Si(x) ? 0.5, and when fi(x) ? 1; Si(x) ? 0.5. Thus, fi(x) in the range (1, 1) gets mapped to Si(x) in the range [0, 0.5), which also becomes a valid range for selection of threshold parameter d. Hence, after training a binary SVM for class i, using Si(x), we can remove datapoints of class i if their closeness estimate Si(x) > d(based on observations b & c). Also datapoints of other classes lying on positive side of the decision boundary fi(x) = 0 with Si(x) > d can be removed (based on observation d). Although [0, 0.5) is a valid range for threshold parameter d, usable range will be (0.231, 0.5) based on our requirements. This is because, for any decision function fi(x), support vectors of class i will have fi(x) = 1and hence Si (x) = 0.231. Selecting threshold parameter d < 0.231 will fail to include these support vectors of class i in the d-region. This is against our requirement that d-region should possibly include all unique support vectors of class i. Selecting d = 0.231 will make the d-region to coincide with margin area

14242

M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248

Fig. 2. Validation of observations about d-region.

(1 6 fi(x) 6 1) of the decision function fi(x). While this ensures inclusion of all support vectors of class i in d-region, it is less likely to include all unique support vectors of class i. During our trial experiments we observed signiﬁcant decrease in accuracy with d = 0.231. Practically d should be slightly greater than 0.231. Also it is easy to see: as d ? 0.5 R-OAA will become OAA as d-region will include all datapoints and hence no reduction in subsequent datasets can be achieved. In general, small d will witness more reductions and hence will achieve faster training. However if d is too small, it may miss many support vectors and will witness poor accuracy. We conducted all our experiments for three values of d(0.25, 0.3 & 0.35) and we observed signiﬁcant reduction in training time without signiﬁcant loss in accuracy across many datasets, for all these values. Obviously, the above process of removing less signiﬁcant datapoints from each binary SVM learnt, will improve efﬁciency of future binary SVMs, irrespective of the order in which these K binary SVMs are solved. But, is it possible to get beneﬁted by heuristically deﬁning an order in which we should solve K binary SVMs? We explored three heuristic possibilities of deﬁning such an order. A simple approach is to give preference to a class which is having more number of datapoints. That is, we solve K binary SVMs in descending order with respect to the total number of datapoints in each class. This approach is logical because, when we solve a class with most number of datapoints ﬁrst, we can expect maximum reduction in datasets to be used for the future SVMs. Another approach is to select a class whose datapoints lie far away from other class datapoints. The idea behind is, if we remove insigniﬁcant datapoints of a class which is far away from other

classes, the reduction is likely to have less impact on future decision functions. Hence we can expect the decision boundaries of future binary SVMs, not to deviate much from the true ones obtained by solving binary SVMs with complete dataset. The following methods will help in achieving such an order. First, centers of all classes in a multiclass problem are calculated. Then, average of all these centers is calculated as the problem’s center. Now, the Euclidean distance di between center of class i and problem’s center can be interpreted as a measure of how far the datapoints of a class lie from datapoints of all other classes. All K classes are sorted in descending order of di, to get the order in which K binary SVMs should be solved. The above method is a modiﬁed interpretation of the one described in Fei and Liu (2006), where the authors were interested in ﬁnding two classes that are closest to problem’s center. Another method we consider here is an improved version of the one described above. This method is due to (Wang, Shi, Wu, & Wang, 2006), where distribution of classes is also considered in measuring separability of a class. Following (Wang et al., 2006), for a K class problem, separability measure of a class i can be deﬁned as

smi ¼

di

ri

ð11Þ

where di is the Euclidean distance between center of class i and problem’s center, and ri is the variance of class i. Hence the order can be obtained by sorting separability measure of all K classes in descending order. The above separability measure is deﬁned in input space; however it is also possible to extend it to

14243

M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248

high-dimensional feature space (Wang et al., 2006). Separability measure is an interesting approach, but it needs more computing time; particularly in its kernalized version. Hence in our experiments, we considered other two approaches for deﬁning order (number of datapoints of a class, center of classes). We now give an explicit statement of the algorithm for our R-OAA method. Description of data structures used in the algorithm is as follows: Train_data: set of training datapoints. Rem_index: a vector containing indices of training datapoints that are identiﬁed as insigniﬁcant. O: a K 1 vector containing the order in which K binary SVMs are to be trained. 3.1. ALGORITHM reduced OAA for multiclass classiﬁcation Given aKclass dataset, reduced OAA classiﬁcation can be performed with following steps: (i) Choose a kernel function (linear/Gaussian/etc.. . .) (ii) Select penalty parameter C. Usually this parameter is selected based on validation. (iii) Obtain order O in which K binary classiﬁers should be trained. (using any of the three methods described above). (iv) Select a threshold value d for removing datapoints (between 0.231 and 0.5). (v) Initialize Rem_index to null. (vi) Repeat the following steps for i = 1, 2, . . . , K times (a) Remove datapoints from Train_data whose indices are stored in Rem_index. (b) Set Rem_index to null. (c) Construct a binary classiﬁer fO(i)(x) between datapoints of class O(i)and remaining datapoints, from Train_data. (d) Evaluate fO(i)(x)and SO(i)(x) = jDPO(i)(x)j for datapoints from Train_data. (e) Find datapoints of class O(i)for which SO(i)(x) > d and store their index values in Rem_index. (f) Find datapoints that does not belong to class O(i )such that fO(i)(x) > 0 and SO(i)(x) > d ,and store their index values in Rem_index. (g) Store decision function fO(i)(x). (vii) Test new datapoints as it is done in conventional OAA by evaluating decision functions f1(x),f2(x), . . . , fk(x). 4. Experimental work on standard datasets 4.1. Experimental protocol To demonstrate the performance of R-OAA multiclass classiﬁcation, we conducted experiments on several multiclass problems from the Statlog collection (ftp://ncc.up.pt/pub/statlog/) and UCI Repository of machine learning databases (Blake et al., 1998). From UCI Repository, we selected two ten class datasets: optdigit and pendigit. From Statlog collection we chose: dna, satimage, and shuttle datasets. In addition to above mentioned datasets, we also performed experiments on USPS (Hull, 1994) dataset of handwritten digits classiﬁcation. Table 1 gives description of all these datasets. We also conducted experiments on the same datasets using OAA, for comparison against R-OAA. Binary SVM classiﬁcation problems arising from both OAA and R-OAA were solved with SVMlight (Joachims, 1999) sofware. Datasets were pre-normalized to the range [1, 1]. All experiments were implemented in MATLAB 7.3.0 (R2006b) (http://www.mathworks.com) environment on a PC with Intel Core2Duo processor (2.13 GHz), 1 GB RAM.

Table 1 Description of datasets. Dataset

#Class

#Training data

#Testing data

#Feature

dna optdigit satimage USPS pendigit shuttle

3 10 6 10 10 7

2000 3823 4435 7291 7494 43500

1186 1797 2000 2007 3498 14500

180 64 36 256 12 9

Table 2 Dna. Details

OAA

R-OAA d = 0.35

R-OAA d = 0.3

R-OAA d = 0.25

C Accuracy (%) #l1 #SVs #l2 #SVs #l3 #SVs

8 0.002 96.03 2000 603 2000 439 2000 450

8 0.002 96.03 2000 603 1502 434 1462 446

8 0.002 95.95 2000 603 1366 423 1240 437

8 0.002 95.95 2000 603 1251 408 1021 418

Avg. #l Total Tr. time

2000 2.37

1654.7 1.82

1535.3 1.67

1424 1.51

l

Table 3 Optdigit. Details

OAA

R-OAA, d = 0.35

R-OAA, d = 0.3

R-OAA, d = 0.25

C Accuracy (%) #l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs #l7 #SVs #l8 #SVs #l9 #SVs #l10 #SVs

2048 0.125 98.72 3823 552 3823 571 3823 607 3823 593 3823 651 3823 626 3823 695 3823 491 3823 465 3823 635

2048 0.125 98.77 3823 552 3810 571 3716 607 3705 593 3673 650 3655 626 3608 694 3564 489 3531 461 3423 635

2048 0.125 98.72 3823 552 3725 571 3539 607 3432 592 3280 642 3200 621 3061 694 2933 488 2768 461 2540 642

2048 0.125 98.72 3823 552 3609 564 3359 604 3143 584 2887 623 2721 609 2470 669 2273 477 2012 441 1710 604

Avg. #l Total Tr. time

3823 13.31

3650.8 11.8

3230.1 8.8

2800.7 7.1

l

4.2. Comparison strategy We compared OAA and R-OAA methods in the following ﬁve aspects: (i) Classiﬁcation Accuracy: ratio between total number of correctly classiﬁed test datapoints to total number of test datapoints. (ii) The total number of datapoints removed by R-OAA after learning each binary SVM (reduction). (iii) The total number of support vectors for each binary SVM learned.

14244

M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248

Table 4 Satimage.

Table 6 Pendigit.

Details

OAA

R-OAA, d = 0.35

R-OAA, d = 0.3

R-OAA, d = 0.25

Details

OAA

R-OAA, d = 0.35

R-OAA, d = 0.3

R-OAA, d = 0.25

C

32 2 91.9 4435 1118 4336 1398 4042 1308 3759 1054 3758 1148 3725 885

32 2 91.9 4435 1118 4027 1398 3545 1307 3123 1042 3065 1134 2969 883

32 2 91.9 4435 1118 3715 1391 3102 1297 2534 1023 2369 1114 2190 867

C

Accuracy (%) #l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs

32 2 91.85 4435 1118 4435 1398 4435 1308 4435 1060 4435 1153 4435 886

Avg. #l Total Tr. time

4435 14.05

4009.2 12.28

3527.3 10.59

3057.5 8.1

Accuracy (%) #l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs #l7 #SVs #l8 #SVs #l9 #SVs #l10 #SVs

8 0.25 98.82 7494 120 7494 100 7494 85 7494 189 7494 110 7494 119 7494 57 7494 99 7494 168 7494 137

8 0.25 98.82 7494 120 7424 100 6947 85 6335 188 5871 108 5300 118 5073 56 4500 93 4028 167 3766 137

8 0.25 98.82 7494 120 7009 99 6377 82 5679 187 5083 107 4418 118 3950 55 3293 92 2659 164 2167 133

8 0.25 98.68 7494 120 6771 99 6058 82 5319 178 4636 101 3912 118 3268 55 2579 87 1903 151 1277 127

Details

OAA

R-OAA, d = 0.35

R-OAA, d = 0.3

R-OAA, d = 0.25

Avg. #l Total Tr. time

7494 5.37

5673.8 3.89

4812.9 3.29

4321.7 2.94

C Accuracy (%) #l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs #l7 SVs #l8 #SVs #l9 #SVs #l10 #SVs

16 0.0078 95.86 7291 243 7291 87 7291 339 7291 209 7291 320 7291 300 7291 210 7291 270 7291 353 7291 325

16 0.0078 95.83 7291 243 6469 87 6194 339 5796 202 5425 320 5077 299 4818 205 4434 260 4146 352 3950 317

16 0.0078 95.75 7291 243 6292 86 5478 339 4969 201 4468 313 4028 293 3624 207 3129 251 2693 345 2377 318

16 0.0078 95.31 7291 243 6200 84 5246 334 4663 193 4090 310 3579 275 3078 201 2517 228 1999 315 1601 291

Details

OAA

R-OAA, d = 0.35

R-OAA, d = 0.3

R-OAA, d = 0.25

C

Avg. #l Total Tr. time

7291 26.95

5360 17.31

4434.9 14.43

4026.4 12.36

Accuracy (%) #l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs #l7 #SVs

1024 16 99.91 43500 136 43500 62 43500 102 43500 85 43500 21 43500 94 43500 88

1024 16 99.90 43500 136 15452 63 9749 93 8263 84 8205 19 8182 88 8181 87

1024 16 99.89 43500 136 12173 58 6105 81 4018 82 3947 15 3921 78 3919 69

1024 16 99.89 43500 136 9890 59 3361 68 1000 80 917 16 889 70 887 64

Avg. #l Total Tr. time

43,500 203.96

14,505 137.09

11,083 128.42

8634.9 101.27

l

Table 5 USPS.

l

(iv) Average number of datapoints used for training K binary SVMs. (v) Total training time spent on learning K binary SVMs. 4.3. Results and discussion Tables 2–7 show comparison results of OAA and R-OAA on all the six datasets considered. We used Gaussian kernel for all exper2 iments, given by KðX i ; X j Þ ¼ elkX i X j k : The kernel parameter l and penalty parameter C were selected based by cross validation on OAA. We used the same values of l and C for experiments with R-OAA. In R-OAA, we solved K binary SVMs in descending order with respects to the number of datapoints in each class. In all these tables, we have shown results for K binary SVMs of OAA in the same order for ease of comparison. For each dataset we solved ROAA with three values of d (0.25, 0.3, 0.35). These tables make evident that, R-OAA is capable of achieving accuracies closest to OAA without any signiﬁcant loss on all six

l

Table 7 Shuttle.

l

datasets considered. In fact on optdigit, and satimage datasets, accuracy of R-OAA is slightly more than OAA. At this juncture, it is to be remembered that, solving SVM with reduced dataset does not prevent it from having accuracies better than SVM solved with complete dataset (though it is uncommon). This entirely depends on nature of the dataset and inﬂuence of outliers over decision boundary. These results also show that, the number of datapoints used for training each binary SVM classiﬁer gradually decreases in R-OAA, against ﬁxed number of datapoints for OAA (a graphical representation is shown in Fig. 3). In Particular, with shuttle dataset (Table 7) we are solving a tail binary SVM problem with 887 datapoints in R-OAA against 43500 datapoints in OAA, without major sacriﬁce in accuracy. Reduction in dataset for a binary SVM problem directly implies reduced memory usage and reduced training time. The comparison of total training time shows, ROAA is capable of achieving drastic reduction. R-OAA achieved

14245

M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248

2000

4000

4500

1800

3500

4000

1600

3000

3500

1400

3000

2500

1200 1000

2000

800

1500

600

2500 2000 1500

1000

400

1000

500

200 0

1

2

dna δ = 0.35

3

8000

0

500

1

2

3

4

5

6

7

8

optdigit δ = 0.25

9 10

0

8000

4.5

7000

7000

4

6000

6000

5000

5000

4000

4000

3000

3000

2000

2000

1

1000

1000

0.5

0

1

2

3

4

5

6

7

8

USPS δ = 0.35

9 10

0

1

2

3

4

5

satimage δ = 0.25

6

x 104

3.5 3 2.5 2 1.5

1

2

3

4

5

6

7

8

9 10

pendigit δ = 0.30

0

1

2

3

4

5

shuttleδ = 0.35

6

7

Fig. 3. Reduction achieved by R-OAA.

4.4. How important is order selection?

Table 8 Reduced OAA with order selection based on center of classes. Dataset (C, l)

Details

dna (8, 0.002) optdigit (2048, 0.125) satimage (32, 2) USPS (16, 0.0078) pendigit (8, 0.25) shuttle (1024, 16)

Accuracy Avg. #l Accuracy Avg. #l Accuracy Avg. #l Accuracy Avg. #l Accuracy Avg. #l Accuracy Avg. #l

(%) (%) (%) (%) (%) (%)

R-OAA d = 0.35

d = 0.3

d = 0.25

96.12 1964 98.72 3614.3 91.85 4114.7 95.84 5415.2 98.82 5676.4 99.91 30421

96.12 1886.7 98.72 3130.8 91.85 3828.8 95.76 4458.9 98.77 4823.2 99.91 28565

96.20 1787 98.66 2699.6 91.9 3505 95.46 4049.1 98.71 4363.6 99.91 27386

around 50% reduction in total training time on the largest dataset (Table 7) we considered, without signiﬁcant loss in classiﬁcation accuracy. Thus R-OAA is more suitable for large/dense datasets with less number of support vectors, as it will achieve more reductions. Another important observation in the results of R-OAA with d = 0.35 is: for each binary SVM, the total number of support vectors remains very close to the total number of support vectors of OAA. This actually reﬂects ability of the d-region to capture most of the unique support vectors. As threshold d decreases, the total number of support vectors for each binary SVM in R-OAA also slightly decreases. This is because as d decreases, d-region starts missing some unique support vectors and hence the solution becomes approximate. Also, decrease in d means more reductions in subsequent datasets and hence will witness reduced memory usage and faster training. Thus d can also be viewed as an approximation parameter which will tradeoff fast training vs accurate solution. An immediate use of this threshold d is in tuning SVM parameters by cross validation for large multiclass datasets; setting small values of d while tuning will lead to quicker results. Further it should be noted that penalty parameter C, and kernel parameter l in all these experiments were tuned for best classiﬁcation accuracy of OAA and not for R-OAA. Hence one can expect improved classiﬁcation accuracies by directly tuning these parameters over R-OAA.

In this section, we will explore importance of selecting order in which K binary SVMs should be solved in R-OAA method. We have discussed two simple approaches for selecting order (Section 3). In the previous section, we have shown results of R-OAA with order selection based on number of datapoints in each class. Another method which we had discussed is based on center of classes. Table 8 shows results of R-OAA with order selection based on center of classes for all six datasets considered. We had kept same experimental setting as explained in the previous section for comparison purposes. Comparing Table 8 against Tables 2–7, it can be observed that both the methods achieved almost the same classiﬁcation accuracy except on dna dataset. However, as far as reduction in number of datapoints for subsequent binary SVMs is concerned, simple ordering approach based on number of datapoints in each class performed better than center of classes approach. Particularly on shuttle dataset, center of classes approach could achieve only an average of 27,386 whereas ordering based on number of datapoints in each class achieved 8,634.9. This clearly shows the importance of selecting order in R-OAA and its impact on reductions

Table 9 Reduced OAA with random ordering. Dataset (C, l)

Details

OAA

R-OAA d = 0.3 (Descending orderwith respect to No. of datapoints in each class)

R-OAA d = 0.3 (Random ordering)

dna (8, 0.002) optdigit (2048, 0.125) satimage (32, 2) USPS (16, 0.0078) pendigit (8, 0.25) shuttle (1024, 16)

Accuracy (%) Avg. #l Accuracy (%) Avg. #l Accuracy (%) Avg. #l Accuracy (%) Avg. #l Accuracy (%) Avg. #l Accuracy (%) Avg. #l

96.03 2000 98.72 3823 91.85 4435 95.86 7291 98.82 7494 99.91 43500

95.95 1535.3 98.72 3230.1 91.9 3527.3 95.75 4434.9 98.82 4812.9 99.89 11083

96.07 1728.53 98.73 3199.82 91.82 3800.09 95.7 4921.5 98.80 4848.08 99.90 27684.69

14246

M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248

achieved. We also conducted experiments on R-OAA with random ordering to observe its effect. Table 9 shows these results. For each dataset we conducted 50 R-OAA trials (d = 0.3) with random orders, and we report average results over these 50 trials. The values of penalty parameter C and kernel parameter l for all datasets were kept the same as shown in the previous results, for comparison. We have reproduced OAA and R-OAA results from Tables 2–7 for convenience. Table 9 shows, even with random ordering, R-OAA achieves good amount of reductions as compared to conventional OAA method; however a proper selection of order can achieve even better reductions. Following these observations, for all experiments with R-OAA in subsequent sections of this paper, we have used ordering based on number of class datapoints. 5. Application to text categorization Given the impressive performance of R-OAA multiclass method on Statlog and UCI datasets, we further went onto investigate its performance on text categorization (TC) datasets. Automatic TC is a supervised learning problem which involves training a classiﬁer with some labeled documents and then using the classiﬁer to predict labels of unlabeled documents. Each document may belong to multiple labels or single label or no label at all. For such multilabel learning problems, usually a simple extension of OAA is used. The training phase is same as that of OAA, where a binary SVM is learned for each label to form a complete TC system. However in testing phase, instead of assigning single label to a test document based on decision function giving maximum output, each binary SVM predicts whether the test document belongs to its label or not. We followed the same approach for our R-OAA algorithm. We performed experiments on two well known multilabel datasets in TC research, reuters-21578 (http://www.daviddlewis.com/resources/testcollections/reuters21578/), and ohsumed (ftp://medir.ohsu.edu/pub/ohsumed). 5.1. Document representation Documents which typically are string of characters, have to be transformed into a representation suitable for learning algorithm of the classiﬁer. This transformation involves several steps like preprocessing, dimensionality reduction, feature subset selection and term weighting. We used simple ‘bag of words’ representation in all our experiments. We removed stop words using a stop word dictionary (http://www.dcs.gla.ac.uk/idom/ir_resources/linuistic_ utils/) consisting of 319 words to reduce dimension. In addition to stop words, we removed words that occurred in only one training document, uniformly for both text corpuses considered. We also performed word stemming using Porter stemmer algorithm (Porter, 1980). After dimensionality reduction, we did local feature selection using mutual information (MI) measure described by Dumais, Platt, Heckerman, and Sahami (1998), given by:

MIðf ; cÞ ¼

X

X

pðf ; cÞ log

f 2f0;1g c2f0;1g

pðf ; cÞ pð f ÞpðcÞ

ð12Þ

The selected features were associated with a weight using log (TFIDF) (Liao, Alpha, & Dixon, 2003) term weighting scheme described as:

wfd ¼ logðtffd þ 0:5Þ log

D dff

ð13Þ

where wfd is the weight of feature f in document d, tffd is occurrence frequency of feature f in document d, D is the total number of documents in the training set and dff is the total number of documents containing feature f. In all our experiments, we scaled

weights obtained from log (TFIDF) weighting using cosine normalization, given as:

wfd wnfd ¼ sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ : k P w2fd

ð14Þ

f ¼1

where k is number of features selected to represent a document. 5.2. Data collections 5.2.1. Reuters-21578 The reuters-21578 (http://www.daviddlewis.com/resources/ testcollections/reuters21578/) dataset was compiled by David Lewis and originally collected by the Carnegie group from reuters newswire in 1987. It contains 21578 news articles each belonging to one or more categories (labels). The frequency of occurrence of documents varies greatly from category to category. We used the modeApt split which led us to a corpus of 9603 training and 3299 testing documents. Out of 135 potential categories, only 90 categories have at least one training and one testing document. In our experiments, following other TC projects, we ran a test on the top 10 categories having highest number of documents and we omitted unlabeled documents from classiﬁcation. After stemming and stop word removal, the training corpus contained 10,789 distinct terms in global dictionary. We evaluated MI measure for all these 10,789 distinct terms with respect to each category and selected top 300 words as features for document representation of the corresponding category. 5.2.2. Ohsumed The ohsumed corpus compiled by William hersh (ftp://medir.ohsu.edu/pub/ohsumed) consists of medline documents from the year 1981 to 1991. Following (Joachims, 1998), from 50,216 documents in 1991, we used ﬁrst 10,000 for training and second 10,000 for testing. The resulting training set and testing set have more homogeneous distribution across 23 different MeSH ‘‘diseases’’ categories. Unlike reuters-21578, it is more difﬁcult to learn a classiﬁer in ohsumed corpus because of the presence of noisy data. We used top 10 out of 23 categories in our experiments. The training corpus of ohsumed contained 12,180 distinct terms after stemming and stop word removal. We evaluated MI measure for all 12,180 terms with respect to each category and selected top 500 words as features for document representation of the corresponding category. 5.3. Evaluation methodology A number of metrics are being used in TC to measure its effectiveness. In this paper, we used the standard information retrieval metrics, precision and recall, to evaluate each binary classiﬁer. They can be calculated from the confusion matrix as shown in Table 10. A confusion matrix provides counts of different outcomes from an evaluation system. True positive (TP) represents total number of documents the system correctly labeled as positive and true negative (TN) represents total number of documents the system correctly labeled as negative. False positive (FP) and false negative (FN) are total number of documents the system incorTable 10 Confusion matrix. Category

Cj

Classiﬁer Output

YES NO

Expert judgments YES

NO

TP FN

FP TN

14247

M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248

rectly labeled positive or negative respectively. Precision is deﬁned simply as the ratio of correctly assigned category Cj documents to, total number of documents classiﬁed as category Cj. Recall is the ratio of correctly assigned category Cj documents to, total number of documents actually in category Cj. They can be obtained from the confusion matrix as

Precision ¼

TP TP þ FP

and Recall ¼

TP TP þ FN

ð15Þ

Neither precision nor recall is meaningful in isolation of the other. In practice, combined effectiveness measure namely precision-recall breakeven point (BEP), and F1measure are used. F1 measure is calculated as the harmonic mean of precision (P) and recall (R), given as:

F 1 ¼ 2PR=ðP þ RÞ

ð16Þ

The precision-recall BEP is the point where precision is equal to recall and is often determined by calculating arithmetic mean of precision and recall. BEP performance metric is to be computed for each category separately and overall performance of an approach can be found with the help of microaverage or macroaverage of BEP over all categories. Macroaverage gives equal weight to each category, whereas microaverage gives equal weight to each document. 5.4. Experimental results All text categorization steps discussed in Section 5.1 were performed using MATLAB 7.3.0 (R2006b) software (http://www.mathworks.com). We conducted experiments on both the text corpuses described in Section 5.2 using OAA and R-OAA methods. Binary classiﬁcation problems arising from both OAA and R-OAA were solved with SVMlight (Joachims, 1999) software with linear kernel. The penalty parameter C was selected by cross validation on OAA. We used the same values of C for experiments with R-OAA. Table 11 BEP performance of 10 largest categories from reuters-21578. Category

OAA

R-OAA, d = 0.35

R-OAA, d = 0.3

R-OAA, d = 0.25

earn acq money-fx grain crude trade interest ship wheat corn

0.9844 0.9536 0.7729 0.949 0.8913 0.7953 0.7671 0.7783 0.8319 0.8805

0.9844 0.9574 0.7709 0.9421 0.8812 0.786 0.7715 0.804 0.8251 0.8805

0.9844 0.9588 0.7988 0.949 0.8774 0.8025 0.7721 0.8172 0.8251 0.8909

0.9844 0.9568 0.7856 0.9456 0.8756 0.7905 0.7602 0.7789 0.8251 0.9092

Micro BEP Macro BEP

0.9241 0.8604

0.9241 0.8603

0.9276 0.8676

0.9237 0.8612

We slightly modiﬁed R-OAA to suit multilabel learning. Instead of removing datapoints that do not satisfy threshold criteria then and there, we stored their index values and removed them only if they do not belong to the class for which binary SVM classiﬁer is to be learned. This is important because, datapoints of class i which are identiﬁed as insigniﬁcant after learning a binary SVM for class i cannot be removed immediately, as they may also belong to some other classes in a multilabel scenario. We observed significant decrease in classiﬁcation accuracy by removing such datapoints. Tables 11 and 12 summarize BEP performance of top 10 categories obtained by R-OAA and OAA on reuters-21578 and ohsumed datasets, respectively. It can be observed from these tables that R-OAA achieves better BEP performance than OAA on many of the categories over both text corpuses. On ohsumed dataset (for d = 0.25), R-OAA performed better than OAA on 9 out of the 10 Table 13 Reduction in reuters-21578. Details C Macro F1

OAA 1 0.8561

R-OAA, d = 0.35 1 0.8570

R-OAA, d = 0.3 1 0.8649

R-OAA, d = 0.25 1 0.8584

#l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs #l7 #SVs #l8 #SVs #l9 #SVs #l10 #SVs

7775 530 7775 735 7775 546 7775 366 7775 433 7775 483 7775 516 7775 325 7775 225 7775 262

7775 530 5689 718 5248 537 5172 352 5128 410 5076 477 5071 512 5027 315 5021 225 4996 262

7775 530 5354 704 4765 530 4696 351 4618 408 4538 470 4504 502 4445 311 4442 226 4419 255

7775 530 5187 692 4448 522 4332 337 4225 399 4122 456 4069 487 4015 304 4000 220 3939 252

Avg. #l Total Tr. time

7775 1.32

5420.3 0.81

4955.6 0.77

4611.2 0.71

Table 14 Reduction in ohsumed. Details C Macro. F1

OAA 1.4142 0.6775

R-OAA, d = 0.35 1.4142 0.6864

R-OAA, d = 0.3 1.4142 0.6910

R-OAA, d = 0.25 1.4142 0.6948

6286 3208 6286 1295 6286 1311 6286 1169 6286 920 6286 954 6286 806 6186 866 6286 719 6286 770

6286 3208 6250 1297 6053 1297 5775 1149 5757 905 5727 926 5647 803 5604 841 5577 687 5515 745

6286 3208 6183 1290 5870 1276 5438 1132 5402 900 5342 913 5238 782 5151 801 5109 670 5002 742

6286 3208 6097 1279 5687 1261 5124 1110 5042 884 5003 886 4826 765 4705 816 4654 654 4501 723

6286 2.82

5819.1 2.61

5502.1 2.45

5192.5 2.14

Category

OAA

R-OAA, d = 0.35

R-OAA, d = 0.3

R-OAA, d = 0.25

Pathology Neoplasma Cardiovascular Nervous Environment Digestion Immunology Respiratory Urology Bacteria

0.4761 0.8077 0.7921 0.6002 0.6906 0.7131 0.7167 0.6745 0.8 0.6755

0.4761 0.8081 0.7952 0.6062 0.6897 0.729 0.7244 0.6774 0.8059 0.6859

0.4761 0.8109 0.7995 0.6068 0.6827 0.7263 0.7387 0.6783 0.8095 0.6913

0.4761 0.8097 0.7969 0.6154 0.6781 0.7285 0.7372 0.6946 0.807 0.6955

#l1 #SVs #l2 #SVs #l3 #SVs #l4 #SVs #l5 #SVs #l6 #SVs #l7 #SVs #l8 #SVs #l9 #SVs #l10 #SVs

Micro BEP Macro BEP

0.6805 0.6946

0.6844 0.6997

0.6862 0.7020

0.6871 0.7039

Avg. #l Total Tr. time

Table 12 BEP performance of 10 largest categories from ohsumed.

14248

M. Arun Kumar, M. Gopal / Expert Systems with Applications 38 (2011) 14238–14248 8000

7000

7000

6000

6000

5000

5000 4000 4000 3000 3000 2000

2000

1000

1000 0

1

2

3

4

5

6

7

8

9

10

reuters-21578 δ = 0.25

0

1

2

3

4

5

6

7

8

ohsumed δ = 0.25

9

10

Fig. 4. Reduction achieved by R-OAA on text corpuses.

categories considered. Thus R-OAA’s performance is signiﬁcantly better than OAA, especially on noisy domains. Tables 13 and 14 and Fig. 4 show the reduction achieved by R-OAA over OAA on both text corpuses. While signiﬁcant reduction in training time is achieved on both corpuses, particularly on reuters-21578 corpus, R-OAA method (with d = 0.25) achieved approximately 45% reduction in training time. 6. Conclusion In this paper, we have proposed reduced OAA (R-OAA) method for multiclass SVM classiﬁcation. R-OAA is an improvement to the well known OAA multiclass SVM method in the form of sample selection. R-OAA learns from each binary SVM classiﬁcation problem we solve and uses this knowledge to solve future binary SVM problems efﬁciently. We have also discussed heuristic selection of order in which K binary SVM problems of R-OAA can be solved. Experimental results on standard datasets demonstrate R-OAA performing statistically comparable to that of standard OAA multiclass method; however at reduced computational effort. R-OAA achieved approximately 50% reduction in total computing time without signiﬁcant loss in accuracy for the largest dataset considered in this paper. This suggests using R-OAA for large datasets to achieve tractable and low-cost solutions. We further investigated the application of R-OAA with linear kernel on text categorization, using two benchmark text categorization datasets reuters-21578 and ohsumed. It can also be seen as a simple extension of R-OAA algorithm to handle multilabel classiﬁcation as both text corpuses considered are multilabel in nature. Experimental results show signiﬁcant improvement in performance at reduced computing effort, when compared with OAA on both text corpuses considered. It suggests using R-OAA for text categorization applications particularly under noisy domains. References Abe, S., Inoue, T. (2001). Fast training of support vector machines by extracting boundary data. In Proceedings of international conference on artiﬁcial neural networks (ICANN) (pp. 308–313). Almeida, M. B., Braga, A. P., Braga, J. P. (2000). SVM-KM: Speeding SVMs learning with a priori cluster selection and k-means. In Proceedings of 6th Brazilian symposium on neural networks (pp. 162–167). Blake, C. L., Merz, C. J. (1998). UCI repository for machine learning databases. [http: //www.ics.uci.edu/mlearn/MLRepository.html], Department of Information and Computer Sciences, University of California, Irvine. Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 1–43. Chang, C. C., Lin, C. J. (2001). LIBSVM: A library for support vector machines. Software available at (http://www.csie.ntu.edu.tw/cjlin/libsvm). Dumais, S., Platt, J. C., Heckerman, D., Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of seventh international conference on information and knowledge management (pp. 148– 155).

Fei, B., & Liu, J. (2006). Binary tree of SVM: A new fast multiclass training and classiﬁcation algorithm. IEEE Transactions on Neural Networks, 17, 696–704. Hastie, T., & Tibshirani, R. (1998). Classiﬁcation by pairwise coupling. The Annals of Statistics, 26(2), 451–471. Hsu, C., & Lin, C. J. (2002). A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks, 13, 415–425. Hull, J. J. (1994). A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5), 550–554. Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning (pp. 137–142). Joachims, T. (1999). Making large-scale SVM learning practical. In B. Scholkopf, C. J. Burges, & A. J. Smola (Eds.), Advances in Kernel methods – support vector learning (pp. 169–184). MIT-Press. Knerr, S., Personnaz, L., & Dreyfus, G. (1990). Single-layer learning revisited: a stepwise procedure for building and training a neural network. In J. Fogelman (Ed.), Neurocomputing: Algorithms architechtures and applications. New York: Springer-Verlag. Koggalage, R., & Halgamuge, S. (2004). Reducing the number of training samples for fast support vector machine classiﬁcation. Neural Information Processing Letters & Reviews, 2(3), 57–65. Lee, Y. J., Mangasarian, O. L. (2001). RSVM: Reduced support vector machines. In: Proceedings of International Conference on Data Mining, Chicago. Liao, C., Alpha, S., Dixon, P. (2003). Feature preparation in text categorization. Artiﬁcial Intelligence White papers, Oracle Corporation. Mangasarian, O. L. (1998). Nonlinear programming. SIAM. Mangasarian, O. L., & Musicant, D. R. (1999). Successive overrelaxation for support vector machines. IEEE transactions on Neural Networks, 10, 1032–1037. Pedrajas, N. G., & Boyer, D. O. (2006). Improving multiclass pattern recognition by the combination of two strategies. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6), 1001–1006. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. J. Burges, & A. J. Smola (Eds.), Advances in Kernel methods – Support vector learning (pp. 185–208). MIT Press. Platt, J. C., Christiani, N., Shawe-Taylor. (1999). Large margin DAGs for multiclass classiﬁcation. In Solla, S. A., Leen, T. K., Muller, K.-R. (Eds.). Proc. Neural Information Proces sing Systems(NIPS’99), 547-553. Porter, M. (1980). An algorithm for sufﬁx stripping. Program (Automated Library and Information Systems), 14(3), 130–137. Rifkin, R., & Klautau, A. (2004). In defense of one-vs-all classiﬁcation. J. Machine Learning Research, 5, 101–141. Scholkopf, B., Smola, A. J., Vapnik, V. (1995). Extracting support data for a given task. In Proceedings of 1stInternational Conference on Knowledge Discovery and Data Mining, AAAI press (pp. 252–257). Shin, H., & Choo, S. (2005). Invariance of neighborhood relation under input space to feature space mapping. Pattern Recognition Letters, 26, 707–718. Sohn, S., Dagli, C. H. (2001). Advantages of using fuzzy class memberships in selforganizing map and support vector machines. In Proceedings of International Conference on Neural Networks (IJCNN), Washington D.C. (pp. 1886–1890). Vapnik, V. (1998). Statistical learning theory. New York: Wiley. Wang, X., Shi, Z., Wu, C., Wang, W. (2006). An improved algorithm for decision-treebased SVM. In Proceedings of the 6thWorld Congress on Intelligent Control and Automation (pp. 4234-4283). Wang, J., Neskovic, P., & Cooper, L. N. (2007). Selecting data for fast support vector machine training. Studies in Computational Intelligence, 35, 61–84. Yang, M. H., Ahuja, N. (2000). A geometric approach to train support vector machines. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 430–437). Zhang, W., King, I. (2002). Locating support vectors via b-skeleton technique. In: Proceedings of international conference on neural information processing (ICONIP) (pp. 1423–1427). Zheng, S., Xiaofeng, L., Nanning, Z., Weipu, X. (2003). Unsupervised clustering based reduced support vector machines. In Proceedings of IEEE international confeerence on acoustics, speech, and signal processing (ICASSP) (pp. 821–824).

Reduced one-against-all method for multiclass SVM classification

Reduced one-against-all method for multiclass SVM classification

Recommend Documents