Designing bag-level multiple-instance feature-weighting algorithms based on the large margin principle

Designing bag-level multiple-instance feature-weighting algorithms based on the large margin principle

Information Sciences 367–368 (2016) 783–808 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate...

1MB Sizes 1 Downloads 53 Views

Information Sciences 367–368 (2016) 783–808

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Designing bag-level multiple-instance feature-weighting algorithms based on the large margin principle Jing Chai a,∗, Zehua Chen a, Hongtao Chen a, Xinghao Ding b a b

College of Information Engineering, Taiyuan University of Technology, Taiyuan 030024, PR China School of Information Science and Engineering, Xiamen University, Xiamen 361005, PR China

a r t i c l e

i n f o

Article history: Received 29 December 2014 Revised 7 July 2016 Accepted 10 July 2016 Available online 12 July 2016 Keywords: Large margin Feature weighting Multiple-instance learning Bag level

a b s t r a c t In multiple-instance learning (MIL), class labels are attached to bags instead of instances, and the goal is to predict the class labels of unseen bags. Existing MIL algorithms generally fall into two types: those designed at the bag level and those designed at the instance level. In this paper, we aim to employ bags directly as learning objects and design multiple-instance feature-weighting algorithms at the bag level. In particular, we initially provide a brief introduction of the bag-level large margin feature-weighting framework and then adopt the three bag-level distances minimal Hausdorff (minH), class-to-bag (C2B) and bag-to-bag (B2B) as examples to design the corresponding bag-level feature-weighting algorithms. Experiments conducted on synthetic and real-world datasets empirically demonstrate the effectiveness of our work in improving MIL performances. © 2016 Elsevier Inc. All rights reserved.

1. Introduction The main difference between traditional supervised learning (SL) and multiple-instance learning (MIL) [56] is that SL takes instances as learning objects and aims to predict the class labels of unseen instances, whereas MIL takes bags (a set of instances is termed a bag) as learning objects with the goal of predicting the class labels of unseen bags. In MIL, only the class labels of bags are known in advance. Usually, a bag is classified as positive if it contains at least one positive instance. Otherwise, this bag is classified as negative, which means that all instances in it are negative. Hence, one obvious property of MIL is that there are label ambiguities for instances in positive bags because a positive bag can contain positive and negative instances simultaneously. The terminology “multiple-instance learning” was originally proposed by Dietterich et al. [18] when they were investigating the drug-activity prediction problem. In their seminal paper, Dietterich et al. considered the problem of predicting whether a candidate drug molecule binds to the target protein. In particular, a molecule can take on many different shapes, and if any of these shapes conforms closely to the structure of the binding site, the candidate molecule binds to the target protein. By treating each molecule as a bag and each shape of a molecule as an instance, drug activity prediction can be considered a typical MIL problem. In addition to drug activity prediction, MIL can be applied in many other domains such as image categorization [13,29,30,31,35,44,57], image retrieval [12,33,54], text classification [4,40], stock selection [36], protein sequence classification [33,46], computer-aided diagnosis [8,22], and security applications [42]. Moreover, with the rapid development of MIL, many representative algorithms such as ID-APR [18], Diverse Density (DD) [35] and its improvement



Corresponding author. Fax: +86 351 6010029. E-mail address: [email protected], [email protected] (J. Chai).

http://dx.doi.org/10.1016/j.ins.2016.07.029 0020-0255/© 2016 Elsevier Inc. All rights reserved.

784

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

EM-DD [55], Bayesian-KNN and Citation-KNN [49], MI-SVM and mi-SVM [4], MI-Kernel [25], MI-Graph and mi-Graph [58], Simple MI [19], and MILES [13] have been proposed to cope with various MIL tasks. Based on the difference of levels at which an algorithm exploits the required discriminative information from MIL data, existing MIL algorithms can be roughly divided into two types [3]: instance-level algorithms and bag-level algorithms. Instance-level algorithms such as ID-APR, DD, and EM-DD, exploit the discriminative information from the instance level; they initially attempt to obtain instance labels (or information related to instance labels) and then derive the bag labels from instance labels (or related information). In contrast, bag-level algorithms, such as Bayesian-KNN, Citation-KNN, MIKernel, MI-Graph, mi-Graph, Simple MI, and MILES, exploit the discriminative information from the bag level; they treat each bag as an inseparable entity and aim to obtain the bag labels directly. Based on the difference in how the discriminative information is exploited, i.e., explicitly or implicitly, bag-level algorithms can be further divided into two subtypes [3]: bag-space algorithms and embedded-space algorithms. Because each bag is a set of vectors with each vector describing an instance, and because usually the numbers of instances in different bags are different, the bag space is a non-vector space. Therefore, bag-space algorithms such as Bayesian-KNN, Citation-KNN, MI-Kernel, MI-Graph, and mi-Graph use a similarity function (e.g., kernel function) or a distance function to evaluate explicitly the similarity or dissimilarity of any two bags. Different from bag-space algorithms, embedded-space algorithms such as Simple MI and MILES initially map each bag to a feature vector that implicitly summarizes the relevant information of the bag and then use supervised classifiers to conduct classifications on the embedded vectors. Note that although some MIL algorithms obtained very promising classification results, most did not consider the featureweighting problem; i.e., they treated different features equivalently and adopted all original features in classifications. Usually, the contributions of different features to multiple-instance classifications are different. Some features can contain valuable discriminative information and are helpful to improve MIL performances. These features should be highlighted. In contrast, some features can contain redundant and noisy information only, which makes them useless and even harmful to discriminations. These features should be suppressed. Feature weighting, which highlights some features and suppresses others, is an important research direction in machine learning. By endowing each feature with a weighting coefficient, we can quantitatively describe the relevance of different features to the learning task (e.g., classification) by comparing the relative magnitudes of different weighting coefficients. Some representative feature-weighting algorithms, e.g., RELIEF [27], I-RELIEF [45], LESS [48], and LMFW [14], had obtained very promising learning performances in different SL domains. Considering that the task of MIL is to separate heterogeneous bags, in this paper, we term features that can improve the discrimination of heterogeneous bags relevant features, and we term features that are useless or harmful to discriminations irrelevant ones, respectively. We hope to improve MIL performances by feature weighting, i.e., by highlighting relevant features and suppressing irrelevant ones. We focus our study on the bag level (more accurately, on the bag space); i.e., we take bags directly as learning objects and attempt to obtain large margins among heterogeneous bags in the weighted feature space via different bag-level distances. Because we would like to propose three different bag-level multiple-instance feature-weighting algorithms based on different bag-level distances, for convenience of description, we term our work the Bag-level Large Margin Multiple-instance Feature-Weighting (B-LM2FW) framework, and we term different algorithms different realizations of this framework. The three adopted bag-level distances are the minimal Hausdorff (minH) distance [49], the class-to-bag (C2B) distance [50], and the bag-to-bag (B2B) distance, respectively. Thus, the resulting feature-weighting algorithms based on the above three distances can, respectively, be termed LM2FW-minH, LM2FW-C2B, and LM2FW-B2B. Note that of the above three distances, minH and C2B are off-the-shelf, whereas B2B is newly proposed in this paper. Moreover, because the original C2B distance is class specific [50], i.e., different classes can have different metrics, we also treat the minH and B2B distances as class specific; thus, all of the three proposed bag-level feature-weighing algorithms are locally adaptive. After the introduction of the above three feature-weighting algorithms, we also provide a discussion about how to optimize them. In particular, LM2FW-minH can be optimized by solving a linear programming (LP) problem, whereas both LM2FW-C2B and LM2FW-B2B are non-convex and optimized by the block coordinate descent algorithm; i.e., different types of unknown variables in LM2FW-C2B and LM2FW-B2B are updated alternatively and iteratively. Moreover, according to the properties of the three proposed feature-weighting algorithms, we also provide the corresponding multiple-instance classifiers after their feature-weighting preprocesses. In summary, the main contributions of this paper are listed below. (1) We propose a bag-level multiple-instance feature-weighting framework based on the large margin principle, namely, B-LM2FW, which is applicable to different bag-level distances and can provide guidance for future algorithm design. (2) Enlightened by the off-the-shelf C2B distance, which is able to measure the similarity between any bag and the positive/negative super-bag, we propose a new bag-level distance, B2B, which enables us to measure the similarity between any two bags and is more flexible than C2B. (3) We adopt three locally adaptive bag-level distances (minH, C2B, and B2B) as examples to demonstrate how to construct the corresponding bag-level feature-weighting algorithms under the B-LM2FW framework. The remainder of this paper is organized as follows. We introduce several related works and discuss their relationship to our work in Section 2. In Section 3, we introduce useful notation and provide the formulation of the B-LM2FW framework. From Section 4 to Section 6, we substitute three different bag-level distances into the B-LM2FW framework to derive the corresponding bag-level feature-weighting algorithms and discuss how to optimize these algorithms. In Section 7, we analyze

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

785

the optimality of the three proposed feature-weighting algorithms and discuss classifications after their feature-weighting preprocesses. In Section 8, we experiment on synthetic and real-world datasets to conduct performance evaluations of the proposed feature-weighting algorithms by comparing them with several competing algorithms. We provide concluding remarks and discuss future work in Section 9. 2. Related work In this section, we provide a brief introduction of some related works and discuss their relationship to our work. The related works include LMFW [14], M3 I [50], MI-SVM [4], mi-SVM [4], MIRVM [41], EM-DD [55], ReliefF-MI [53], and M3IFW [11], in which LMFW is for supervised feature weighting, M3 I is for multiple-instance Mahalanobis metric learning, MI-SVM, mi-SVM, and MIRVM are for multiple-instance feature projection, and EM-DD, ReliefF-MI, and M3IFW are for multipleinstance feature weighting. 2.1. Large margin feature weighting using single instance LMFW is a supervised feature-weighting algorithm designed according to the large margin principle. The algorithm seeks large classification margins by forcing the distances among heterogeneous examples to be greater than those among homogeneous ones via feature weighting. Similar to LMFW, our work also tries to seek large classification margins among heterogeneous examples. The difference between our work and others is that in our work, each example refers to a bag, whereas in LMFW, each example refers to an instance. The other difference is that the distance metric learned by LMFW is a global one, whereas distances learned by our work are local ones, i.e., there is a unique weighting vector for all classes in LMFW, and there are different weighting vectors for different classes in our work. 2.2. Maximum margin metric learning using multiple instances M3 I is a C2B distance-based multiple-instance feature-extraction algorithm. The C2B distance was proposed by Wang et al. [50] to measure the similarity between a given bag and the positive or negative super-bag. C2B, thus can be used to measure the closeness of this bag to the positive or negative class. The C2B distances in M3 I are class specific, i.e., the C2B distance between any bag and the positive super-bag and that between any bag and the negative super-bag use different metrics. Our work is closely related to M3 I because both are designed according to the large margin principle and aim to seek large margins among heterogeneous bags. There are two major differences between M3 I and our work. The first is that M3 I operates by Mahalanobis metric learning whereas our work operates by feature weighting. The second difference is that the LM2FW-C2B/LM2FW-B2B algorithms in our work and M3 I use different strategies to construct and optimize the objective functions. The objective functions of M3 I in different optimization steps are inconsistent, i.e., they cannot be integrated into a unified objective function and thus cannot be optimized iteratively. In contrast, the objective functions of LM2FW-C2B/LM2FW-B2B in different optimization steps are consistent and can be optimized iteratively. 2.3. Feature projection using multiple instances MI-SVM and mi-SVM are two multiple-instance extensions of the supervised learning algorithm, support vector machines (SVM) [15]. The similarity between MI-SVM/mi-SVM and our work is that they are all designed based on the large margin principle. Different from mi-SVM, which works at the instance level, both our work and MI-SVM work at the bag level. Moreover, MI-SVM/mi-SVM conducts nonlinear transformation by operating feature projection in the kernel space, whereas our work conducts linear transformation by operating feature weighting in the original space. Feature projection and feature weighting appear to be the same because both require a vector w ∈ RD to conduct the transformation. Actually, they differ slightly. Feature projection projects a D-dimensional example x into the one-dimensional space and then adopts the onedimensional projection wT x to cope with subsequent learning tasks (e.g., classifications). Feature weighting endows each feature of x with a weighting coefficient and then utilizes the D-dimensional weighted example w ◦ x (here, ◦ denotes the element-wise production of two vectors) to cope with subsequent learning tasks. MIRVM is an algorithm that integrates multiple-instance classification and feature weighting. MIRVM initially extends the supervised logistic regression (LR) model into MIL and uses MIL as the basic multiple-instance classifier. MIRVM then selects the feature subset according to the feature-weighting results obtained under the maximum a posterior (MAP) rule. Hence, the design principles of MIRVM and our work are different; i.e., MIRVM is designed under the Bayesian framework, which requires operating parameter estimations via MAP, whereas our work is designed under the large-margin framework, which directly maximizes the classification margins among heterogeneous bags. 2.4. Feature weighting using multiple instances EM-DD is a combination of the expectation maximization (EM) algorithm and the multiple-instance classifier, diverse density (DD). That is, EM-DD re-explains DD from a probabilistic perspective and thus optimizes DD via the EM algorithm.

786

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

Both EM-DD and our work are multiple-instance feature-weighting problems but are designed according to different principles. EM-DD is designed based on the Bayesian principle and aims at identifying the most appropriate hypothesis that maximizes the DD function in a globally weighted feature space, whereas our work is designed based on the large margin principle and aims at seeking large classification margins among heterogeneous bags in locally (i.e., class specific) weighted feature spaces. ReliefF-MI works for multiple-instance feature weighting and is an extension of the supervised feature-weighting algorithm, RELIEF [27]. Hence, it is a filter-type learning algorithm [34]. Similar to RELIEF, ReliefF-MI initially searches the nearest hits and misses of randomly selected examples (bags), calculates the average difference between the nearest-miss and nearest-hit distances, and finally endows different features with different weighting coefficients according to the above difference. Different from our work, which learns class-specific weighting vectors, ReliefF-MI learns a unique weighting vector for all classes. Both M3IFW and our work are multiple-instance feature-weighting algorithms designed according to the large margin principle. They have two main differences. The first is that M3IFW is an instance-level feature-weighting algorithm, whereas all three proposed algorithms in our work are bag-level ones. More specifically, M3IFW initially selects representative instances for both positive and negative bags. It, then uses the instance-level distances, i.e., distances between those selected representative instances, to learn the required weighting vector; in contrast, our work directly uses the bag-level distances, i.e., distances between different bags or distances between bags and super-bags, to design the corresponding feature-weighting algorithms. The second difference is that M3IFW is a global feature-weighting algorithm because it learns a unique weighting vector for both positive and negative classes, whereas we learn different weighting vectors for different classes in our work, and all three proposed feature-weighting algorithms are local ones. In brief, compared with M3IFW, designing class-specific feature-weighting algorithms from the bag level is the major novelty of our work. 3. B-LM2FW framework Many machine learning researchers have shown interest in designing large margin-related algorithms, and these algorithms can be applied to different domains such as supervised learning [10,14,15,48], semi-supervised learning [1,6,37], and unsupervised learning [32,47,52]. Among these algorithms, perhaps SVM [15] is the most popular and is usually treated as state-of-the-art. Considering the general success of the large margin principle in the machine learning community, in this paper we focus on constructing large margin-related multiple-instance feature-weighting algorithms based on different bag-level distances. For descriptive convenience, we consider all three algorithms part of the Bag-level Large Margin Multiple-instance Feature-Weighting (B-LM2FW) framework and treat each algorithm as a specific realization of this framework. We begin our discussion by introducing useful notation. Suppose there are l = l + + l − bags in total, of which l + bags are positive and the other l − bags are negative. B+ (i ∈ {1, . . . , l + }) and i denote the ith positive bag and the positive bag index, i + D respectively. Let xi j ∈ R be the jth instance of B+ ; then B+ can be expressed as [x+ , . . . , x+ + ], where n+ denotes the number i i i1 i ini

of instances in B+ . Similarly, B− (i ∈ {1, . . . , l − }), i, x− ∈ RD , and n− denote the ith negative bag, the negative bag index, the i i ij i jth instance in B− , and the number of instances in B− , respectively. i i Without loss of generality, we let dho+

Bi ,p

and dhe+

Bi ,q

denote the distance between bag B+ and its pth homogeneous example i

and that between B+ and its qth heterogeneous example, respectively. Similarly, we let dho− i B− i

B− i

Bi ,p

and dhe−

Bi ,q

denote the distance

between and its pth homogeneous example and that between and its qth heterogeneous example, respectively. Because we focus on designing bag-level algorithms, “example” can here be either a bag or a super-bag, and “homogeneous” or “heterogeneous” means that the example has class labels the same as or different from the objective bag. A super-bag is the collection of all instances in all bags belonging to a given class. Because standard MIL considers a binary classification problem, there are two super-bags in total, i.e., the positive one SB+ and the negative one SB− , which can, respectively, be expressed as + + + + + SB+ = {x+ 11 , . . . , x1n+ , x21 , . . . , x2n+ , . . . , xl + 1 , . . . , xl + n+

} = {s+1 , . . . , s+c+ }

(1)

− − − − − SB− = {x− 11 , . . . , x1n− , x21 , . . . , x2n− , . . . , xl − 1 , . . . , xl − n−

} = {s−1 , . . . , s−c− }

(2)

1

1

2

2

l+

l−

where s+ (k ∈ {1, . . . , c+ }) and s− (k ∈ {1, . . . , c− }) denote the kth instance in SB+ and the kth instance in SB− , respectively, k k + + − − {s1 , . . . , sc+ } and {s1 , . . . , sc− } are the index-reordered expressions of all instances in all positive bags and those in all negative bags, respectively. c+ and c− denote the number of instances in SB+ and in SB− , respectively. In particular, B+ ’s homogeneous example can be either a positive bag or the positive super-bag SB+ , and B+ ’s heterogei i neous example can be either a negative bag or the negative super-bag SB− ; B− ’s homogeneous or heterogeneous example can i be deduced by analogy. Note that, given a bag, the number of its homogeneous or heterogeneous examples can be either one (w.r.t. the case, each “example” is a super-bag) or no less than one (w.r.t. the case, each “example” is a bag). Based on the large margin principle, the distance between each bag and its heterogeneous example is expected to be greater than that between the bag and its homogeneous example. Without loss of generality, the difference in the above two

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

787

distances should be no less than one. Otherwise, a hinge loss will be triggered. Mathematically, the B-LM2FW framework can be expressed as +

min ξ ,u

s.t.

l   i=1

p



+ ξi,p,q +

l  

q

i=1

p

+ dBhe+ ,q − dBho+ ,p ≥ 1 − ξi,p,q i

i



ξ

≥0

∀i, p, q

≥0

∀i, p, q

ξ

+ i,p,q − i,p,q

dBho− ,p i

≥1−ξ

dBhe− ,q i

− i,p,q

− ξi,p,q + α Reg(u )

q

∀i, p, q ∀i, p, q

Rest rict ion on u

(3)

where the first two objective terms denote the sum of nonnegative slack variables introduced to allow soft marginal constraints; the last objective term Reg(u ) is a regularization operated on u, and u denotes all other unknown variables except ξ + and ξ − (i.e., the weighting vectors). α is a trade-off parameter introduced to denote the relative importance of different objective terms; the first two constraints aim to seek large margins among heterogeneous bags; the third and fourth constraints make ξ + and ξ − nonnegative; and the last constraint is a restriction endowed on unknown variables u. Note that the constraint on u (the last constraint of Eq. (3)) and the regularization on u (the last objective of Eq. (3)) works differently. The constraint makes u nonnegative, and the regularization helps to avoid overfitting or unnecessary numerical problems. We can adopt different bag-level distances to evaluate the similarity among bags and construct different multipleinstance feature-weighting algorithms. In the following three sections, we, respectively, adopt the minH, C2B, and B2B distances as examples to discuss how to design and optimize the corresponding feature-weighting algorithms. 4. LM2FW-minH 4.1. Introduction of LM2FW-minH The minimal Hausdorff (minH) distance was first introduced by Wang and Zucker [49] when they were investigating nearest-neighbor related multiple-instance classifiers. The minH distance between two bags is defined as the minimal Euclidean distance between any two instances derived from different bags (one from the first bag and the other from the second bag). For example, the minH distance between bag F and bag E can be expressed as

dmin H (F , E ) = min min r

s

nF  nE 

 fr − es 22

(4)

r=1 s=1

where 22 denotes the squared l2 -norm; nF and nE denote the number of instances in bag F and in bag E, respectively; and fr ∈ RD and es ∈ RD denote the rth instance in bag F and the sth instance in bag E, respectively. By introducing the weighting vector w ∈ RD , the weighted minH distance can be expressed as w dmin H (F , E ) = min min r

s

nF  nE  r=1 s=1

min w ◦ fr − w ◦ es 22 = min r s

nF  nE  D 

w2d ( fr,d − es,d )

2

(5)

r=1 s=1 d=1

where ◦ denotes the element-wise production of two vectors, and wd , fr,d and es,d denote the dth element of w, fr and es , respectively. The weighted minH distance in Eq. (5) is a global one, i.e., we share a unique weighting vector w for different classes. However, recent work on metric learning [50,51] demonstrated that local metrics were usually more flexible and powerful than were global ones. The core idea of local metric learning is to design class-specific metrics, i.e., we have multiple metrics, with each being applied to one specific class. In particular, we would like to adopt the following local minH distances in our design work:

w dmin H (F , E ) =

⎧ nF  nE  w+ ⎪ (F , E ) = min min w+ ◦ fr − w+ ◦ es 22 , if F is a positive bag ⎨dmin H r s r=1 s=1

nF  nE  ⎪ w− ⎩dmin (F , E ) = min min w− ◦ fr − w− ◦ es 22 , if F is a negative bag H r s

(6)

r=1 s=1

w+

w−

where and denote the positive and negative weighting vectors, respectively. Note that choosing w+ or w− depends w upon the class label of the first bag of dmin (·, · ), i.e., bag F . Actually, there is an alternative choice, i.e., terming F the second H bag (changing the places of F and E) and making the choice according to the class label of the second bag. In short, what really matters to the choice between w+ and w− is not the place F lies in but the assumption that the class label of F is known. (In the training phase, we must calculate the minH distance between two training bags, and the class labels of both bags are known. In the testing phase, we must calculate the minH distance between a training bag and a testing bag, and

788

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

the class label of the training bag is known. Hence, in both phases, the above assumption is reasonable.) In general, the rule is that the positive weighting metric w+ is used when calculating the minH distance between a positive bag (i.e., when F is positive) and any other bag, and the negative metric w− is used when calculating the minH distance between a negative bag (i.e., when F is negative) and any other bag. Let the homogeneous and heterogeneous examples of a bag, respectively, be its homogeneous and heterogeneous neighbors. Based on the large margin principle, in the weighted feature space, we expect the local minH distance between any bag and each of its k1 heterogeneous neighboring bags to be greater than that between this bag and each of its k2 homogeneous neighboring bags. Hence, by substituting local minH distances into the B-LM2FW framework, we can express the LM2FW-minH algorithm as +

min ξ ,m

s.t.

k2  k1 l  



+ ξi,p,q +

i=1 p=1 q=1 w− dmin H

(

he q , B+ i

k2  k1 l   i=1 p=1 q=1



− ξi,p,q + α m+ + m− 1 1

+

w + + + ) − dmin H (ho p , Bi ) ≥ 1 − ξi,p,q ,ξi,p,q ≥ 0

∀i ∈ {1 , . . . , l + }, p ∈ {1 , . . . , k2 }, q ∈ {1 , . . . , k1 } w+ dmin H



w − − − (heq , B−i ) − dmin H (ho p , Bi ) ≥ 1 − ξi,p,q ,ξi,p,q ≥ 0

∀i ∈ {1 , . . . , l − }, p ∈ {1 , . . . , k2 }, q ∈ {1 , . . . , k1 } m+ ≥ 0, m− ≥ 0 ∀d ∈ {1 , . . . , D} d d

(7)

where 1 denotes the l1 -norm; heq and ho p denote a given bag’s qth heterogeneous and pth homogeneous neighboring w+ (he , B+ ) and d w+ (ho , B+ ), B+ is the given bag, he and ho are B+ ’s qth heterobags, respectively. For example, in dmin q p i q p H min H i i i geneous and pth homogeneous neighboring bags, respectively; m+ ∈ RD and m− ∈ RD respectively denote the element-wise squares of w+ and w− , i.e., m+ = w+ 2 and m− = w− 2 for ∀d ∈ {1, . . . , D}. The reason we use m+ /m− to replace w+ /w− d d d d is that through this replacement, we can express LM2FW-minH as a linear programming (LP) problem, which is convex; otherwise (i.e., still using w+ /w− ), the constraints of LM2FW-minH are the quadratic of w+ /w− and can be non-convex. w+ w− Note that there is no need to worry about the comparability of dmin with dmin because they are simultaneously H H learned by solving the same optimization problem, and references [50,51] have justified the comparability of different metw+ w− rics learned by solving a unified optimization problem. If dmin and dmin are learned independently (e.g., learned by optiH H mizing two different problems, respectively), then in the testing phase, the two metrics might be meaninglessly comparaw+ w− ble. However, because LM2FW-minH integrates dmin and dmin into a unified problem and optimizes them simultaneously H H (which implies that the interaction of the above two metrics is considered), we can ensure the comparability of these two metrics. 4.2. Optimization of LM2FW-minH The LM2FW-minH algorithm of Eq. (7) is an LP problem and thus can be optimized with global solutions. The pseudocode of its optimization is provided in Algorithm 1. Algorithm 1 Pseudocode of optimizing LM2FW-minH. (i ∈ {1, . . . , l + }), negative bags B− (i ∈ {1, . . . , l − }), trade-off parameter α , number of heterogeneous neighboring bags k1 , Input: positive bags B+ i i number of homogeneous neighboring bags k2 . Output: positive weighting vector w+ , negative weighting vector w− . Initialization: for each bag, search its k1 heterogeneous and k2 homogeneous neighboring bags based on the minH distance in the original space. Optimization: solve the LP problem of Eq. (7) and obtain the solutions for m+ and m− , then let w+ and w− be the element-wise square roots of m+ and m− , respectively.

5. LM2FW-C2B 5.1. Introduction of LM2FW-C2B The class-to-bag (C2B) distance was proposed by Wang et al. [50] to measure the similarity between a super-bag and a bag, in which a super-bag is the collection of all instances in all bags belonging to a given class. For the definition of positive and negative super-bags, please refer to Eqs. (1) and (2), respectively. The C2B distance between a super-bag and a given bag is defined as the weighted average of the Euclidean distances between each instance in the super-bag and its nearest-neighbor instance in the given bag. For example, given bag E, the C2B distances between SB+ and E and between SB− and E can be expressed, respectively, as

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

+

dC2B (SB+ , E ) =

c  k=1 −

(

dC2B SB− , E

)=

c  k=1

789

2 v+k s+k − s˜+k,E 2

(8)

2 v−k s−k − s˜−k,E 2

(9)

where s˜+ denotes the nearest-neighbor instance of s+ in bag E and s˜− denotes that of s− in bag E; v+ and v− are k,E k k k,E k k the significant coefficients of s+ − s˜+ 2 and of s−k − s˜−k,E 22 , respectively. Because both s+k − s˜+k,E 22 and s−k − s˜−k,E 22 are k k,E 2 instance-to-instance (I2I) distances, the C2B distance is actually a combination of multiple I2I distances, and the significant coefficients denote the relative importance of these I2I distances. The C2B distance can reflect the relative importance of different instances in a super-bag when measuring the similarity between this super-bag and a given bag. Compared with minH, C2B considers more instances’ information, because the minH distance between two bags is only the minimal distance between any two instances derived from different bags, which implies that only two instances’ information is considered. In the locally (i.e., class specific) weighted feature space, the above two C2B distances can, respectively, be expressed as +

+

w dC2B (SB+ , E ) =

c  k=1 −



w dC2B (SB− , E ) =

c  k=1

2 v+k s+k ◦ w+ − s˜+k,E ◦ w+ 2

(10)

2 v−k s−k ◦ w− − s˜−k,E ◦ w− 2

(11)

where w+ ∈ RD and w− ∈ RD are the positive and negative weighting vectors, respectively. Eqs. (10) and (11) are the feature weighted versions of Eqs. (8) and (9), respectively, i.e., Eq. (8) denotes the C2B distance between SB+ and E in the original space, whereas Eq. (10) denotes that in the weighted feature space. Note that in Eqs. (10) and (11), choosing w+ or w− depends upon the class label of the super-bag, i.e., w+ is chosen for SB+ , and w− is chosen for SB− . Based on the large margin principle, in the locally weighted feature space, we expect the C2B distance between each bag and its heterogeneous super-bag to be greater than that between it and the homogeneous one. Hence, we can express the LM2FW-C2B algorithm as +

min

ξ ,m,v

s.t.

l 





ξi− + α m+ 1 + m− 1 + β v+ 1 + v− 1



ξ + + i

i=1

l  i=1

w− dC2B

w+ SB− , B+ − dC2B SB+ , B+ ≥ 1 − i+ , i+ ≥ 0 i i w+ + − w− − − dC2B SB , Bi − dC2B SB , Bi ≥ 1 − i− , i− ≥ 0 m+ ≥ 0, m− ≥0 d ∈ 1, . . . , D d d + ≥ 0 k ∈ 1 , . . . , c+ k − ≥ 0 k ∈ 1, . . . , c − k

(

)

(

(

)

)

(



v v



∀ {

{

) { }

ξ ξ

ξ ξ }

∀i ∈ {1 , . . . , l + } ∀i ∈ {1 , . . . , l − }

}

(12)

where both α and β are trade-off parameters; m+ and m− respectively denote the element-wise square of w+ and w− ; m+ d and m− respectively denote the dth element of m+ and m− ; and v+ and v− respectively denote the kth elements of v+ and k d k v− . 5.2. Optimization of LM2FW-C2B Before introducing the optimization of LM2FW-C2B, for convenience, we let m = [m+ ; m− ], v = [v+ ; v− ], and ξ = [ξ1+ ; · · · ; ξl++ ; ξ1− ; · · · ; ξl−− ], respectively. Eq. (12) has three types of unknown variables, m, v, and ξ , and it is non-convex w.r.t. the overall values of these unknown variables. However, by fixing m, Eq. (12) is convex w.r.t. v and ξ . By fixing v, Eq. (12) is also convex w.r.t. m and ξ . Hence, despite it being difficult to optimize all unknown variables simultaneously, we can adopt the block coordinate descent algorithm [39] to update {v, ξ } and {m, ξ } alternatively in an iterative way. Each iterative round in the optimization of LM2FW-C2B consists of two alternative steps. In the first step, we fix m and update v and ξ . If m is fixed, both the objective and constraints related to m can be omitted; thus, Eq. (12) can be simplified as +

min ξ ,v

s.t. +

l 



ξi+ +

i=1

l  i=1



ξi− + β v+ 1 + v− 1



+

w w dC2B (SB− , B+i ) − dC2B (SB+ , B+i ) ≥ 1 − ξi+ ,ξi+ ≥ 0 −

w w dC2B (SB+ , B−i ) − dC2B (SB− , B−i ) ≥ 1 − ξi− ,ξi− ≥ 0

∀i ∈ {1 , . . . , l + }

∀i ∈ {1 , . . . , l − }

790

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

v+k ≥ 0 v−k ≥ 0

∀k ∈ {1 , . . . , c + } ∀k ∈ {1 , . . . , c − }

(13)

Both the objective and constraints of Eq. (13) are linear w.r.t. v and ξ , and we can optimize the equation as an LP problem with global solutions. In the second step, we fix v and update m and ξ . Similarly, we omit the objective and constraints related to v and simplify Eq. (12) as +

min ξ ,m

l 



ξi+ +

i=1

l  i=1



ξi− + α m+ 1 + m− 1



+

w w dC2B (SB− , B+i ) − dC2B (SB+ , B+i ) ≥ 1 − ξi+ ,ξi+ ≥ 0

s.t.

+



w w dC2B (SB+ , B−i ) − dC2B (SB− , B−i ) ≥ 1 − ξi− ,ξi− ≥ 0

m+ ≥ 0, m− ≥0 d d

∀i ∈ {1 , . . . , l + } ∀i ∈ {1 , . . . , l − }

∀d ∈ {1 , . . . , D}

(14)

Eq. (14) is also an LP problem and can be optimized with global solutions. The above two alternative steps should be repeated iteratively until the change of the objective of Eq. (12) in two neighboring iterative rounds is little or the iterative number reaches the maximum allowed value. The pseudocode of optimizing LM2FW-C2B is provided in Algorithm 2. Algorithm 2 Pseudocode of optimizing LM2FW-C2B. (i ∈ {1, . . . , l + }), negative bags B− (i ∈ {1, . . . , l − }), positive super-bag SB+ , negative super-bag SB− , trade-off parameter α , trade-off Input: positive bags B+ i i parameter β , maximum allowed iterative number T , and convergence tolerance δ . Output: positive weighting vector w+ , negative weighting vector w− , positive significant vector v+ , and negative significant vector v− . T Initialization: let m+ = m− = [1, . . . , 1] ∈ RD . Repeat iterations: Step 1: fix m+ and m− , solve the LP problem of Eq. (13) w.r.t. v+ , v− , ξ + , and ξ − ; Step 2: fix v+ and v− , solve the LP problem of Eq. (14) w.r.t. m+ , m− , ξ + , and ξ − ; Step 3: calculate the objective of Eq. (12); Until the relative change of the objective of Eq. (12) is less than δ or the iterative number reaches T . Obtain solutions of m+ , m− , v+ , v− , and let w+ and w− be the element-wise square roots of m+ and m− , respectively.

6. LM2FW-B2B 6.1. Introduction of LM2FW-B2B Enlightened by C2B, in this section, we propose a new bag-level distance, the bag-to-bag (B2B) distance, and design a corresponding feature-weighting algorithm based on it. The B2B distance between positive bag B+ and any given bag E, and i that between negative bag B− and E, can, respectively, be expressed as i +

(

dB2B B+ ,E i

)=

ni  k=1 −

dB2B (B− , E) = i

ni  k=1

2 v+ik x+ik − x˜+ik,E 2

(15)

2 v−ik x−ik − x˜−ik,E 2

(16)

where x˜+ and x˜− respectively denote the nearest-neighbor instance of x+ in bag E and that of x− in bag E; v+ and v− ik,E ik ik ik,E ik ik

are the significant coefficient of x+ − x˜+ 2 and that of x−ik − x˜−ik,E 22 , respectively. Similar to C2B, the B2B distance is also ik ik,E 2 a combination of multiple I2I distances, and different I2I distances are assigned with different significant coefficients. B2B is similar to C2B because both of them initially calculate the instance-level distance between each instance in a bag/super-bag and its nearest-neighbor instance, endow these instance-level distances with different significant coefficients, and finally use the weighted average of these instance-level distances as the corresponding bag-level distance. One obvious difference between B2B and C2B is that the former measures the similarity between two bags, whereas the latter measures the similarity between a super-bag and a bag. In the locally weighted feature space, the above two B2B distances can, respectively, be expressed as +

w+ dB2B

(

B+ ,E i

)=

ni  k=1

2 v+ik x+ik ◦ w+ − x˜+ik ◦ w+ 2

(17)

J. Chai et al. / Information Sciences 367–368 (2016) 783–808



w− dB2B

(

B− ,E i

)=

ni  k=1

791

2 v−ik x−ik ◦ w− − x˜−ik ◦ w− 2

(18)

where w+ ∈ RD and w− ∈ RD denote the positive and negative weighting vectors, respectively. Hence, Eqs. (17) and (18) are the feature weighted versions of Eqs. (15) and (16), respectively. Note that in Eqs. (17) and (18) choosing w+ or w− depends w (·, · ). upon the class label of the first bag in dB2B In the locally weighted feature space, we expect the B2B distance between any bag and each of its k1 heterogeneous neighboring bags to be greater than that between it and each of its k2 homogeneous neighboring bags. We thus express the LM2FW-B2B algorithm as +

min

ξ ,m,v

k2  k1 l  

ξ

+

i=1 p=1 q=1 w− dB2B

s.t.



+ i,p,q

(

k2  k1 l  

ξ

i=1 p=1 q=1

he q , B+ i

)−

w+ dB2B

(

− i,p,q

+ l l−    + −

+ − v + v + α m + m + β i 1 i 1 1 1 i=1

ho p , B+ i

)≥1−ξ

ξ

+ , + i,p,q i,p,q

i=1

≥0

∀i ∈ {1 , . . . , l + }, p ∈ {1 , . . . , k2 }, q ∈ {1 , . . . , k1 } w+ w− − − dB2B (heq , B− ) − dB2B (ho p , B−i ) ≥ 1 − ξi,p,q ,ξi,p,q ≥0 i ∀i ∈ {1 , . . . , l − }, p ∈ {1 , . . . , k2 }, q ∈ {1 , . . . , k1 } ≥ 0, m− ≥ 0 ∀d ∈ {1 , . . . , D} d v ≥0 ∀i ∈ {1, . . . , l + }, k ∈ {1, . . . , n+i } v ≥0 ∀i ∈ {1, . . . , l − }, k ∈ {1, . . . , n−i } m+ d + i,k − i,k

(19)

where heq and ho p respectively denote a given bag’s qth heterogeneous and pth homogeneous neighboring bags, i.e., in w− (he , B+ ) and d w+ (ho , B+ ), B+ is the given bag, he and ho are B+ ’s qth heterogeneous and pth homogeneous neighdB2B q p i q p B2B i i i boring bags, respectively; α and β are trade-off parameters; and m+ and m− are the element-wise squares of w+ and w− , respectively. 6.2. Optimization of LM2FW-B2B For

[ξ1+,1,1 ; · · ·

descriptive convenience, we let m = [m + ; m− ], v = [v+1 ; · · · ; v+l + ; v−1 ; · · · ; v−l − ], and ξ= + − − ; ξl + ,k ,k ; ξ1,1,1 ; · · · ; ξl − ,k ,k ], respectively. Similar to LM2FW-C2B, the LM2FW-B2B algorithm in Eq. (19) is 2

1

2

1

non-convex w.r.t. all unknown variables but is convex w.r.t. partial ones by fixing the others. That is, it is convex w.r.t. ξ and v by fixing m and also convex w.r.t. ξ and m by fixing v. Hence, we also use the block coordinate descent algorithm to optimize LM2FW-B2B, i.e., in each step, we update partial unknown variables by fixing the others and repeat these steps alternatively in an iterative way. Each iterative round consists of two steps. In the first step, we update v and ξ by fixing m. Because m is fixed, terms related to m can be omitted and Eq. (19) can be simplified as +

min ξ ,v

s.t.

k2  k1 l  



+ ξi,p,q +

i=1 p=1 q=1

k2  k1 l   i=1 p=1 q=1



l  +

− ξi,p,q + β(

i=1

l−  v + + v− ) i 1 i 1 i=1

+

w w + + dB2B (heq , B+i ) − dB2B (ho p , B+i ) ≥ 1 − ξi,p,q ,ξi,p,q ≥0

∀i ∈ {1 , . . . , l + }, p ∈ {1 , . . . , k2 }, q ∈ {1 , . . . , k1 } w+ w− − − dB2B (heq , B−i ) − dB2B (ho p , B−i ) ≥ 1 − ξi,p,q ,ξi,p,q ≥0 ∀i ∈ {1 , . . . , l − }, p ∈ {1 , . . . , k2 }, q ∈ {1 , . . . , k1 } v ≥0 ∀i ∈ {1, . . . , l + }, k ∈ {1, . . . , n+i } v ≥0 ∀i ∈ {1, . . . , l − }, k ∈ {1, . . . , n−i } + i,k − i,k

(20)

In the second step of each iterative round, we update m and ξ by fixing v. In this step, terms related to v can be omitted, and Eq. (19) can be simplified as +

min ξ ,m

s.t.

k2  k1 l  



ξ

+ i,p,q

+

i=1 p=1 q=1 w− dB2B

(

he q , B+ i

k2  k1 l   i=1 p=1 q=1

)−

w+ dB2B



− ξi,p,q + α m+ + m− 1 1

+ + (ho p , B+i ) ≥ 1 − ξi,p,q ,ξi,p,q ≥0

∀i ∈ {1 , . . . , l + }, p ∈ {1 , . . . , k2 }, q ∈ {1 , . . . , k1 } w− − − (heq , B−i ) − dB2B (ho p , B−i ) ≥ 1 − ξi,p,q ,ξi,p,q ≥0

w+ dB2B

792

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

∀i ∈ {1 , . . . , l − }, p ∈ {1 , . . . , k2 }, q ∈ {1 , . . . , k1 } m+ ≥ 0, m− ≥0 ∀d ∈ {1 , . . . , D} d d

(21)

Both Eqs. (20) and (21) are LP problems and can be solved with global solutions. The pseudocode of optimizing LM2FW-B2B is provided in Algorithm 3. Algorithm 3 Pseudocode of optimizing LM2FW-B2B. (i ∈ {1, . . . , l + }), negative bags B− (i ∈ {1, . . . , l − }), trade-off parameter α , trade-off parameter β , maximum allowed iterative Input: positive bags B+ i i number T , and convergence tolerance δ . (i ∈ {1, . . . , l + }), and negative significant vectors v− Output: positive weighting vector w+ , negative weighting vector w− , positive significant vectors v+ i i (i ∈ {1, . . . , l − }). Initialization: for each bag, search its k1 heterogeneous and k2 homogeneous neighboring bags based on some distance metric; let m+ = m− = [1, . . . , 1]T ∈ RD . Repeat iterations: Step 1: fix m+ and m− , solve the LP problem of Eq. (20) w.r.t. v+ , v− , ξ + , and ξ − ; Step 2: fix v+ and v− , solve the LP problem of Eq. (21) w.r.t. m+ , m− , ξ + , and ξ − ; Step 3: calculate the objective of Eq. (19); Until the relative change of the objective of Eq. (19) is less than δ or the iterative number reaches T . Obtain solutions of m+ , m− , v+ , v− , and let w+ and w− be the element-wise square roots of m+ and m− , respectively.

In the initialization, we should adopt some distance metric to search each bag’s heterogeneous and homogeneous neighboring bags. Unfortunately, there is no theoretical guidance for the selection of the distance metric. Therefore for convenience, we empirically adopt the distance metric in Eqs. (15) and (16) to operate the initialization, with the assumption that + − v+i = [1/n+i , . . . , 1/n+i ]T ∈ Rni and v−i = [1/n−i , . . . , 1/n−i ]T ∈ Rni . 7. Optimality and classification of proposed algorithms In this section, we discuss the optimality and classification of the above three proposed feature-weighting algorithms. First, we discuss the optimality problem. The LM2FW-minH algorithm shown in Eq. (7) is an LP problem and convex; thus, it can be optimized with global solutions. Because the formulations of LM2FW-C2B and LM2FW-B2B are non-convex w.r.t. all unknown variables, it is difficult to optimize all unknown variables simultaneously. As a result, we must adopt the block coordinate descent algorithm to update them alternatively in an iterative way and can only obtain local solutions for LM2FW-C2B and LM2FW-B2B. Next, we discuss the classification problem. Note that all three feature-weighting algorithms only work for data preprocessing, i.e., they cannot be directly used to conduct classifications but must cooperate with external classifiers. Considering that both LM2FW-minH and LM2FW-B2B use neighborhood information to construct large margins among heterogeneous bags, a natural choice is to adopt the K-Nearest-Neighbors (KNN) classifier to conduct classifications after the preprocessing of these two algorithms. The pseudocodes of the KNN classifiers in the LM2FW-minH and LM2FW-B2B weighted feature spaces are provided in Algorithms 4 and 5, respectively. Algorithm 4 Pseudocode of the KNN classifier in the LM2FW-minH weighted feature space. (i ∈ {1, . . . , l + }), negative training bags B− (i ∈ {1, . . . , l − }), positive weighting vector w+ , negative weighting vector w− , Input: positive training bags B+ i i testing bag E, and number of neighbors k in KNN. Output: class label of testing bag E. Classification: Step 1: respectively calculate the minH distance between E and each training bag in the locally weighted feature space according to the following , B− , · · · , B+ , · · · , B− }), calculate the minH distance between E and F according rule: given any training bag denoted by F (i.e., F can be any bag of {B+ 1 1 l− l+ to Eq. (6). Step 2: find the k nearest neighboring bags of E by comparing the locally weighted minH distances calculated in Step 1, and then make the decision that E belongs to the class to which most of its neighboring bags belong.

Because LM2FW-C2B adopts the local C2B distance to measure the similarity between each bag and the positive/negative super-bag, we can use the Nearest Super-Bag (NSB) classifier to conduct classifications because if the testing bag is nearer to the positive super-bag than to the negative one, the testing bag is classified as positive. Otherwise, it is classified as negative. The pseudocode of the NSB classifier in the LM2FW-C2B weighted feature space is provided in Algorithm 6. 8. Experiments In this section, we experiment on both synthetic and real-world datasets to conduct performance evaluations of our work. In particular, we initially use six synthetic datasets to evaluate the performance of the three proposed feature-weighting algorithms. This experiment gives us insight into the situation in which LM2FW-minH might perform less satisfactorily than

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

793

Algorithm 5 Pseudocode of the KNN classifier in the LM2FW-B2B weighted feature space. Input: positive training bags B+ (i ∈ {1, . . . , l + }), negative training bags B− (i ∈ {1, . . . , l − }), positive weighting vector w+ , negative weighting vector w− , i i (i ∈ {1, . . . , l + }), negative significant vectors v− (i ∈ {1, . . . , l − }), testing bag E, and number of neighbors k in KNN. positive significant vectors v+ i i Output: class label of testing bag E. Classification: Step 1: respectively calculate the B2B distance between E and each training bag in the locally weighted feature space according to the following rule: with i ∈ {1, . . . , l + }, the required B2B distance between E and B+ is calculated according to Eq. (17); given any given any positive training bag B+ i i with i ∈ is calculated according to Eq. (18). { 1 , . . . , l − }, the required B2B distance between E and B− negative training bag B− i i Step 2: find the k nearest neighboring bags of E by comparing the locally weighted B2B distances calculated in Step 1, and then make the decision that E belongs to the class to which most of its neighboring bags belong. Algorithm 6 Pseudocode of the NSB classifier in the LM2FW-C2B weighted feature space. Input: positive super-bag SB+ , negative super-bag SB− , positive weighting vector w+ , negative weighting vector w− , positive significant vector v+ , negative significant vector v− , and testing bag E. Output: class label of testing bag E. Classification: Step 1: respectively calculate the locally weighted C2B distance between SB+ and E according to Eq. (10), and that between SB− and E according to Eq. (11). Step 2: make the decision that if the locally weighted C2B distance between SB+ and E is less than that between SB− and E, E is positive; otherwise, E is negative.

LM2FW-C2B and LM2FW-B2B. Then, we adopt four LJ-r.f.s datasets that were constructed by Amar et al. [2] to mimic the generation of chemically realistic datasets to compare our work with several multiple-instance data preprocessing algorithms. Next, we adopt five benchmark datasets to conduct the experiment. This experiment allows us to compare our work with a large amount of other work that consists of both multiple-instance classification and data preprocessing algorithms. Finally, we adopt seven text categorization datasets to conduct performance evaluations of our work in large-scale highdimensional applications. (Both the data dimensionality and the number of instances in each of the seven text-categorization datasets are much greater than those in other datasets used in our experiments, e.g., the four LJ-r.f.s and five benchmark datasets.) Recall that k1 and k2 denote the numbers of heterogeneous and homogeneous neighboring bags, respectively, and that α and β are trade-off parameters. Thus, there are three (k1 ,k2 ,α ), two (α ,β ), and four (k1 ,k2 ,α ,β ) free parameters in LM2FWminH, LM2FW-C2B, and LM2FW-B2B, respectively. We adopt the following parametrical study similar to that used in LMFW [14] in our experiments: fixing k1 = k2 = 3, setting the candidate sets of both α and β as {0.001, 0.01, 0.1, 1, 10, 100, 10 0 0, 10,0 0 0}, and tuning α and β (for LM2FW-minH, only α should be tuned) jointly by operating five-fold cross validations [28] on the training set. Note that both k1 and k2 are set empirically and not tuned carefully via techniques such as cross validations. Although tuning them carefully might result in better classification performances, this process can be very timeconsuming in large-scale and/or high-dimensional applications and occasionally is even computationally impossible if the time complexity of tuning them is too high. 8.1. On synthetic datasets In this subsection, we generate six synthetic datasets to conduct performance evaluations of our work. Each synthetic dataset consists of 10 0 0 positive bags and 10 0 0 negative bags, respectively. The number of instances in each bag is randomly selected from the integer set {1, …, 5}. For each positive bag, we initially generate one positive instance and then generate the other instances (if it contains more than one instance) with their class labels decided by flipping a fair coin. For each negative bag, we directly generate the required number of negative instances. Each instance (no matter whether positive or negative) consists of 20 dimensions, of which the first two dimensions are relevant and the other 18 dimensions are noise. All noisy dimensions are independent of one another, with each dimension drawn from a Gaussian with mean 0 and variance 16. The two positive relevant dimensions are Gaussian distributed with mean [−10, 10] and covariance [2, 0; 0, 2]. The distributing area of negative instances can be much greater than that of the positive ones (e.g., in image annotation, positive instances correspond to the image concept and usually occupy a very small part of the image area, whereas negative instances correspond to the background and can occupy a much larger part). Therefore, we do not fix the distribution of negative relevant dimensions but instead attempt six different Gaussians for the distribution. The six Gaussians have the same mean [10, −10] and different covariances, which are [2, 0; 0, 2], [10, 0; 0, 10], [20, 0; 0, 20], [30, 0; 0, 30], [40, 0; 0, 40], and [50, 0; 0, 50]. Note that the scales of the above six covariances increase gradually, i.e., from a scale comparable to that of the positive relevant covariance to much larger ones. With respect to different negative relevant covariances, we term the six datasets Dataset 1 (w.r.t. covariance [2, 0; 0, 2]), Dataset 2 (w.r.t. covariance [10, 0; 0, 10]), …, and Dataset 6 (w.r.t. covariance [50, 0; 0, 50]), respectively. Each of the above six datasets is independently generated ten times. For each generated dataset, half of the copies are used for training and the other half are used for testing. Given a multiple-instance learning algorithm, for each dataset (which consists of a training set and a testing set), we initially conduct five-fold cross validations on the training set to select

794

J. Chai et al. / Information Sciences 367–368 (2016) 783–808 Table 1 Average classification accuracies (%) and training time (seconds) of LM2FW-minH for different {k1 , k2 } pairs on six synthetic datasets. LM2FW-minH

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Dataset 6

{1, 1}

70.2 (0.8) 74.7 (1.4) 75.3 (3.0) 77.0 (4.7) 76.4 (16.2)

73.5 (0.7) 77.5 (1.2) 78.2 (2.8) 78.5 (6.4) 79.1 (13.7)

81.9 (0.8) 89.2 (1.3) 88.7 (2.5) 89.6 (5.3) 90.2 (11.6)

85.4 (1.0) 93.9 (1.2) 95.1 (2.3) 95.6 (5.4) 94.6 (11.8)

87.7 (1.0) 95.2 (1.1) 96.4 (2.1) 95.9 (5.3) 96.7 (11.5)

95.5 (0.9) 98.2 (1.1) 98.6 (2.1) 99.0 (4.2) 99.0 (9.7)

{3, 3} {5, 5} {7, 7} {9, 9}

Table 2 Average classification accuracies (%) and training time (seconds) of LM2FW-B2B for different {k1 , k2 } pairs on six synthetic datasets. LM2FW-B2B

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Dataset 6

{1, 1}

98.8 (24.6) 99.3 (193.5) 99.6 (1356.1) 100 (4656.8) 99.7 (12,453.4)

99.0 (19.8) 99.2 (257.4) 100 (1442.3) 100 (4267.1) 100 (14,227.3)

98.7 (21.1) 99.1 (205.8) 99.1 (1094.8) 99.8 (3875.5) 100 (12,525.1)

98.4 (18.4) 99.8 (187.6) 100 (1415.4) 99.7 (4246.1) 100 (11,985.6)

99.2 (23.7) 99.7 (247.9) 99.8 (1506.7) 100 (4741.5) 100 (10,443.6)

99.2 (19.4) 99.5 (225.6) 99.7 (1371.8) 99.8 (3954.9) 100 (12,754.4)

{3, 3} {5, 5} {7, 7} {9, 9}

Table 3 Average classification accuracies (%) and corresponding standard deviations of three proposed feature-weighting algorithms on six synthetic datasets. This table also provides the Friedman test results, including the average rank of each algorithm and the p-value. Algorithm

Dataset 1

Dataset 2

Dataset 3

Dataset 4

Dataset 5

Dataset 6

Average Rank

LM2FW-minH LM2FW-C2B LM2FW-B2B p-value: 0.0057

74.7±2.0 99.1±0.3 99.3±0.4

77.5±1.7 98.9±0.6 99.2±0.5

89.2±1.2 99.4±0.5 99.1±0.4

93.9±1.6 99.5±0.5 99.8±0.2

95.2±1.1 99.6±0.5 99.7±0.3

98.2±1.0 99.4±0.3 99.5±0.3

1.0 0 0 0 2.1667 2.8333

the optimal parameter (or set of parameters) for this algorithm. We, then fix the selected parameter (or set of parameters) to evaluate the performance of this algorithm on the testing set. Because each of the above six datasets is generated ten independent times, we use the classification accuracy averaged over ten independent generations to evaluate the given algorithm’s learning performance on this dataset. At the beginning of this section (the 2nd paragraph of Section 8) we clarify that for simplicity, we fix k1 and k2 in both LM2FW-minH and LM2FW-B2B as k1 = k2 = 3 and only tune the other parameters. This choice is made by striking a balance between classification accuracy and time complexity. In Table 1, we provide the average classification accuracy of LM2FW-minH and the required CPU time to train LM2FW-minH (CPU: Intel-Core-i5-240 0, 3.10 GHz; RAM: 4.0 0 GB) for the following fixed {k1 , k2 } pairs: {1,1}, {3,3}, {5,5}, {7,7}, and {9,9}. The corresponding results of LM2FW-B2B are provided in Table 2. Through these two tables, we find that the training time of both algorithms increases with the increase of {k1 , k2 } values and that the increasing rate of LM2FW-B2B is much greater than that of LM2FW-minH. Moreover, with the increase of {k1 , k2 } values, in general the accuracies of both algorithms also increase. However, when k1 = k2 > 3, the increase is not obvious and occasionally might be trivial. As a result, by setting k1 = k2 = 3, we can obtain a good tradeoff between accuracy and training time. In fact and not only in this experiment, the experiments on benchmark and text categorization datasets also show that k1 = k2 = 3 is a good empirical value (we also take a rough estimate of the effect of {k1 , k2 } values on accuracies in these experiments, and the results provide us empirical insights for favorable {k1 , k2 } values). In the following experiments, we fix k1 = k2 = 3 and only tune the other parameters for LM2FW-minH and LM2FW-B2B. By fixing k1 = k2 = 3, we provide average classification accuracies and corresponding standard deviations of three proposed feature-weighting algorithms for the six synthetic datasets in Table 3. We conduct nonparametric statistical tests on the results shown in Table 3. The reason we choose nonparametric tests instead of parametric ones is that parametric tests usually make distributing assumptions that are unsatisfied in real applications [16]; e.g., t-test [43] assumes that data are

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

795

Table 4 The post-hoc Nemenyi test results of three proposed feature-weighting algorithms on six synthetic datasets. Average-Rank Difference

LM2FW-minH

LM2FW-C2B

LM2FW-B2B

LM2FW-minH LM2FW-C2B LM2FW-B2B

N/A 1.1667 1.8333

1.1667 N/A 0.6667

1.8333 0.6667 N/A

normally distributed, and ANOVA (analysis of variance) [20] assumes both normal distributions and the homogeneity of variances. In contrast, nonparametric tests usually make no restrictions on data distribution and thus can be used more widely in real applications. Compared with the issue of comparing two algorithms on a single dataset, the issue of comparing multiple (more than two) algorithms on multiple datasets is believed to be more essential to machine learning [16] and has attracted many researchers’ attention in recent years [17,23,24]. Hence, for each experiment conducted in this subsection and any that follow, we initially apply the nonparametric Friedman test [21] to compare multiple algorithms evaluated in the corresponding subsection on multiple datasets adopted in the corresponding subsection to show whether they perform differently on these datasets. If a difference exists, we proceed by applying the Nemenyi test [38] as the post-hoc procedure to find which algorithms actually differ. In this subsection, we apply the Friedman test (together with the post-hoc Nemenyi test if necessary) to make performance comparisons among the three proposed feature-weighting algorithms on the six synthetic datasets. During the Friedman test, candidate algorithms must be ranked on each dataset separately, and then the average rank of each algorithm should be calculated to conduct the test. Therefore, average ranks are very important to the Friedman test, and we provide them and the p-value in Table 3. The test results shown in Table 3 demonstrate that the three proposed feature-weighting algorithms perform differently at the 0.05 significance level because the calculated p-value is 0.0057 and less than 0.05. Next, we adopt the Nemenyi test as the post-hoc procedure to find which algorithms actually differ. The performances of two algorithms are significantly different if their average ranks differ by at least the critical difference (CD)

CD = qα

k (k + 1 ) 6N

(22)

where qα denotes the critical value at significance level α , k denotes the number of algorithms, and N denotes the number of datasets. In our test, we choose α = 0.05 and k = 3, for which q0.05 = 2.343. The calculated CD is 1.3527. In Table 4, we provide the average-rank difference between any pair of algorithms and represent differences greater than CD in italic boldface. Therefore, an average-rank difference represented in italic boldface implies that the corresponding two algorithms perform differently at the 0.05 significance level. The test results show that statistically speaking, LM2FW-B2B performs similarly to LM2FW-C2B and better than LM2FW-minH on the six synthetic datasets. In general, both LM2FW-C2B and LM2FW-B2B obtain very promising classification results on all six datasets. However, LM2FW-minH works unsatisfactorily; the classification accuracies of LM2FW-minH are lower than those of LM2FW-C2B and LM2FW-B2B on most datasets. In Table 3, we find an interesting phenomenon; when the scale of negative relevant covariance is comparable to that of the positive one (e.g., Dataset 1), LM2FW-minH performs unsatisfactorily. However, when the scale of the former gradually becomes greater than that of the latter (from Dataset 2 to Dataset 6), the classification performance of LM2FW-minH also gradually improves. A possible reason for this phenomenon is provided below. First, we assume that the distance between homogeneous instances is less than that between heterogeneous ones, which is reasonable for many MIL tasks. Because the minH distance is the minimal distance between any pair of instances derived from different bags, the minH distance between two positive bags can be either the distance between two positive instances (w.r.t. Case 1) or that between two negative ones (w.r.t. Case 2). If Case 2 occurs, both the minH distance between two positive bags and that between one positive bag and one negative bag will be some distance between two negative instances. As a result, it will be difficult to distinguish these two minH distances and difficult to distinguish positive and negative bags. Therefore, we hope to avoid the occurrence of Case 2. If the scale of the negative relevant covariance is comparable to that of the positive one, the probabilities that Case 2 occurs and that Case 1 occurs will also be comparable because the comparability of these two scales implies the comparability of the distributional areas of positive and negative relevant dimensions. In contrast, if the scale of negative relevant covariance is much greater than that of the positive one, Case 1 is more likely to occur than Case 2, and hence, it will be much easier to distinguish the minH distance between two positive bags from that between one positive bag and one negative bag. Another possible reason for LM2FW-minH’s unsatisfactory performance is that when calculating the minH distance between two bags, we only consider one instance in each bag and omit the potential discriminative information derived from other instances. To show intuitively the effectiveness of the three proposed feature-weighting algorithms, we plot the weighting vectors obtained by the three algorithms on six synthetic datasets, respectively, in Figs. 1–3. Different from the results shown in Table 3, which are averaged over ten independent generations, the result shown in each figure is only for one randomly selected generation. Through the three figures, we find that LM2FW-C2B and LM2FW-B2B successfully find the relevant

796

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

Weighting Vectors on Dataset 1

Weighting Vectors on Dataset 2

0.25

0.12 w+ w-

Amplitude

Amplitude

0.2

0.15

0.1

0.05

0

w+ w-

0.1 0.08 0.06 0.04 0.02

2

4

6

8

10

12

14

16

18

0

20

Feature Index

2

4

6

8

(a)

14

16

18

20

Weighting Vectors on Dataset 4

Weighting Vectors on Dataset 3 0.18

0.16

0.16

w+ w-

0.14

0.1 0.08 0.06

0.12 0.1 0.08 0.06

0.04

0.04

0.02

0.02 2

4

6

8

10

12

14

16

18

w+ w-

0.14

0.12

Amplitude

Amplitude

12

(b)

0.18

0

10

Feature Index

0

20

2

4

6

8

10

12

14

16

18

20

Feature Index

Feature Index

(c)

(d) Weighting Vectors on Dataset 6

Weighting Vectors on Dataset 5 0.24

0.18 0.21

w+ w-

0.14

Amplitude

Amplitude

0.18

w+ w-

0.16

0.15 0.12 0.09

0.12 0.1 0.08 0.06

0.06

0.04

0.03 0

0.02 2

4

6

8

10

12

Feature Index

(e)

14

16

18

20

0

2

4

6

8

10

12

14

16

18

20

Feature Index

(f)

Fig. 1. Weighting vectors of LM2FW-minH on six synthetic datasets. The results in subfigures (a)(b)(c)(d)(e)(f) are for Datasets 1, 2, 3, 4, 5, and 6, respectively.

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

797

Weighting Vectors on Dataset 2

Weighting Vectors on Dataset 1 0.012 w+ w-

0.01

0.008

w+ w-

Amplitude

Amplitude

0.007 0.008 0.006 0.004

0.006 0.005 0.004 0.003 0.002

0.002

0.001 0

2

4

6

8

10

12

14

16

18

0

20

2

4

6

8

10

12

14

Feature Index

Feature Index

(a)

(b)

16

18

20

Weighting Vectors on Dataset 4

Weighting Vectors on Dataset 3 0.035

0.01 w+ w-

0.008

w+ w-

0.03

Amplitude

Amplitude

0.025 0.006

0.004

0.02 0.015 0.01

0.002

0.005 0

2

4

6

8

10

12

14

16

18

0

20

2

4

6

8

10

12

14

16

18

20

Feature Index

Feature Index

(c)

(d) Weighting Vectors on Dataset 6

Weighting Vectors on Dataset 5 0.014

0.025 w+ w-

0.012

Amplitude

0.01

Amplitude

w+ w-

0.02

0.008 0.006

0.015

0.01

0.004

0.005

0.002 0

2

4

6

8

10

12

14

Feature Index

(e)

16

18

20

0

2

4

6

8

10

12

14

16

18

20

Feature Index

(f)

Fig. 2. Weighting vectors of LM2FW-C2B on six synthetic datasets. The results in subfigures (a)(b)(c)(d)(e)(f) are for Datasets 1, 2, 3, 4, 5, and 6, respectively.

798

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

Weighting Vectors on Dataset 2

Weighting Vectors on Dataset 1 0.07 0.14 w+ w-

0.12

w+ w-

0.06

Amplitude

Amplitude

0.05 0.1 0.08 0.06

0.04 0.03

0.04

0.02

0.02

0.01

0

2

4

6

8

10

12

14

16

18

0

20

2

4

6

8

10

12

14

16

18

20

Feature Index

Feature Index

(a)

(b) Weighting Vectors on Dataset 4

Weighting Vectors on Dataset 3 0.07 0.07

0.06

w+ w-

0.06

w+ w-

Amplitude

Amplitude

0.05 0.05 0.04 0.03

0.04 0.03

0.02

0.02

0.01

0.01

0

2

4

6

8

10

12

14

16

18

0

20

2

4

6

8

(c)

12

14

16

18

20

(d) Weighting Vectors on Dataset 6

Weighting Vectors on Dataset 5 0.08

0.25

0.07

w+ w-

w+ w-

0.06

Amplitude

0.2

Amplitude

10

Feature Index

Feature Index

0.15

0.1

0.05 0.04 0.03 0.02

0.05

0.01 0

2

4

6

8

10

12

Feature Index

(e)

14

16

18

20

0

2

4

6

8

10

12

14

16

18

20

Feature Index

(f)

Fig. 3. Weighting vectors of LM2FW-B2B on six synthetic datasets. The results in subfigures (a)(b)(c)(d)(e)(f) are for Datasets 1, 2, 3, 4, 5, and 6, respectively.

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

799

Table 5 Average classification accuracies (%) and corresponding standard deviations of different algorithms on four LJ-r.f.s datasets. This table also provides the Friedman test results, including the average rank of each algorithm and the p-value. Algorithm

LJ-150.283.2

LJ-150.283.4

LJ-150.283.10

LJ-150.283.15

Average Rank

ReliefF-MI +Citation-KNN M3IFW +Citation-KNN M3 I LM2FW-minH LM2FW-C2B LM2FW-B2B p-value: 0.0013

92.5±4.2 98.0±3.5 97.5±3.5 85.5±7.2 99.0±2.1 99.5±1.6

90.5±6.4 98.5±2.4 98.0±3.5 83.5±5.3 99.0±2.1 100±0.0

90.5±4.4 98.5±2.4 96.5±3.4 84.5±5.0 99.0±2.1 99.5±1.6

90.0±6.7 98.0±3.5 97.5±2.6 85.5±4.4 99.5±1.6 99.5±1.6

2.0 0 0 0 4.0 0 0 0 3.0 0 0 0 1.0 0 0 0 5.1250 5.8750

features on all six datasets, whereas LM2FW-minH succeeds only on Dataset 6. With the increase of the scale of negative relevant covariance (from Dataset 1 to Dataset 6), LM2FW-minH gradually selects smaller and smaller numbers of features until it correctly selects the two relevant features. This trend is consistent with that shown in Table 3. 8.2. On LJ-r.f.s datasets The LJ-r.f.s datasets were provided by Amar et al. [2] when they were investigating how to construct artificial datasets to mimic the generation of chemically realistic datasets, in which some actors that might affect the datasets could be controlled by the generators. LJ-r.f.s is the name of a set of datasets, and a specific dataset can be obtained by fixing the values of r, f, and s. In the name “LJ-r.f.s”, LJ is short for the Lennard-Jones potential, which is usually adopted as a basis to mimic intermolecular interactions in the generation of chemically realistic datasets. R, f, and s are three variables, respectively denoting the number of relevant features, the number of all features, and the scale factors used for constructing relevant features. We utilize four LJ-r.f.s datasets, LJ-150.283.2, LJ-150.283.4, LJ-150.283.10 and LJ-150.283.15, to compare the performances of several multiple-instance data preprocessing algorithms. The reason for selecting these four datasets is that the dimensionality of these four datasets is 283 (relatively high compared with the dimensionality of other LJ-r.f.s datasets), which makes them suitable for evaluating and comparing the performances of different data preprocessing algorithms. There are 60 positive bags and 140 negative bags for each of our datasets, and the number of instances in positive bags and in negative bags are 250 and 546, respectively. On average, there are approximately 3.98 instances in each bag. For other details of these datasets, please refer to [2]. We compare our work with M3 I [50], ReliefF-MI [53] and M3IFW [11], in which M3 I represents multiple-instance metric learning and ReliefF-MI and M3IFW represent multiple-instance feature weighting. In M3 I, parameter C, which controls the trade-off between the loss and regularization terms, is selected from the candidate set {0.001, 0.01, 0.1, 1, 10, 100, 10 0 0, 10,0 0 0}. Relief-MI assigns each feature a weighting coefficient and eliminates features by comparing these coefficients with a preset threshold. Actually, we can also treat Relief-MI as a feature-weighting algorithm if we use all weighted features, rather than partial features corresponding to relatively large weighting coefficients (greater than the threshold), to conduct classifications. Considering that the three proposed algorithms (LM2FW-minH, LM2FW-C2B and LM2FW-B2B) all work for feature weighting, to make a fair comparison in the following experiments, we also treat Relief-MI as a feature-weighting algorithm. In ReliefF-MI, parameter m is set to be the number of all training bags, and parameter k is selected from the candidate set {1, 3, 5, 8, 10, 12, 15}. In M3IFW, the free parameters are selected from exactly the same candidate sets in [11]. Both our algorithms and competing algorithms are evaluated via one time ten-fold cross validations, and their free parameters are selected by conducting five-fold cross validations on the training set. Note that in our experiments, two different cross validations [28], ten-fold ones and five-fold ones, are used for different purposes. We use ten-fold cross validations as the rule to evaluate the classification performance of a learning algorithm. Given a dataset, the ten-fold cross validations are conducted on the whole dataset. The five-fold cross validations aim to select the free parameter (or set of free parameters) that should be fixed in the testing phase. Therefore, given a dataset consisting of a training set and a testing set, the five-fold cross validations are conducted only on the training set. The average classification accuracies and corresponding standard deviations of both our and competing algorithms, along with the Friedman test results including the average rank of each algorithm and the calculated p-value, are provided in Table 5. Because the p-value is less than 0.05, we can conclude that the algorithms listed in Table 5 perform differently at the 0.05 significance level. The post-hoc Nemenyi test results are provided in Table 6, in which the average-rank differences greater than CD are represented in italic boldface. The critical value at the 0.05 significance level is 2.850, and the calculated CD is 3.7702. The results shown in Table 6 demonstrate that the proposed feature-weighting algorithm using the C2B or B2B distance (LM2FW-C2B or LM2FW-B2B) performs better than when using the minH distance (LM2FW-minH), which empirically shows that the minH distance is unsuitable for designing feature-weighting algorithms on the four LJ-r.f.s datasets. In addition to giving the classification accuracies, we also show the variation of the classification accuracies of LM2FWminH, LM2FW-C2B, and LM2FW-B2B w.r.t. different trade-off parameters on the four LJ-r.f.s datasets, respectively, in Figs. 4–6. The classification accuracies of LM2FW-C2B and LM2FW-B2B seem insensitive to variation in α . However, when α in LM2FW-minH or β in LM2FW-C2B becomes relatively large, there are obvious performance degradations.

800

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

Fig. 4. Classification accuracy of LM2FW-minH w.r.t. the trade-off parameter α on four LJ-r.f.s datasets.

Fig. 5. Classification accuracy of LM2FW-C2B w.r.t. trade-off parameters α and β on four LJ-r.f.s datasets. Subfigure (a) is for α ; (b) is for β .

Fig. 6. Classification accuracy of LM2FW-B2B w.r.t. trade-off parameters α and β on four LJ-r.f.s datasets. Subfigure (a) is for α ; (b) is for β .

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

801

Table 6 The post-hoc Nemenyi test results of different algorithms on four LJ-r.f.s datasets. Average-Rank Difference

ReliefF-MI +Citation −KNN

M3IFW +Citation −KNN

M3 I

LM2FW −minH

LM2FW −C2B

LM2FW −B2B

ReliefF-MI +Citation-KNN M3IFW +Citation-KNN M3 I LM2FW-minH LM2FW-C2B LM2FW-B2B

N/A 2.0 0 0 0 1.0 0 0 0 1.0 0 0 0 3.1250 3.8750

2.0 0 0 0 N/A 1.0 0 0 0 3.0 0 0 0 1.1250 1.8750

1.0 0 0 0 1.0 0 0 0 N/A 2.0 0 0 0 2.1250 2.8750

1.0 0 0 0 3.0 0 0 0 2.0 0 0 0 N/A 4.1250 4.8750

3.1250 1.1250 2.1250 4.1250 N/A 0.7500

3.8750 1.8750 2.8750 4.8750 0.7500 N/A

Table 7 Brief description of five benchmark datasets. Dataset

Musk1

Musk2

Elephant

Fox

Tiger

# # # # # #

47 207 45 269 5.17 166

39 1017 63 5581 64.69 166

100 762 100 629 6.96 230

100 647 100 673 6.60 230

100 544 100 676 6.10 230

Positive Bags Instances in Positive Bags Negative Bags Instances in Negative Bags Average Instances in Each Bag Features of Each Instance

8.3. On benchmark datasets In this subsection, we adopt five benchmark datasets to empirically conduct performance evaluations of our work and thus compare our work with several competing algorithms. The five benchmark datasets are Musk1, Musk2, Elephant, Fox, and Tiger, most likely the five most popularly used MIL datasets. Musk1 and Musk2 are utilized to describe molecules. A molecule is considered a bag, and each steric configuration of it (i.e., each molecular shape) is considered an instance. Elephant, Fox, and Tiger are three image annotation datasets, in which each image is considered a bag and segmented into several regions with each region being considered an instance. A brief description of these datasets is provided in Table 7. For other details of Musk1 and Musk2, please refer to [18]. For other details of Elephant, Fox, and Tiger, please refer to [4]. The competing algorithms include ID-APR [18], DD [35], EM-DD [55], MI-SVM [4], mi-SVM [4], MILES [13], Citation-KNN [49], MI-Kernel [25], MI-Graph [58], mi-Graph [58], M3 I [50], ReliefF-MI [53], and M3IFW [11]. Because most competing algorithms are evaluated via ten-fold cross validations in their original papers, to make a fair comparison, the three proposed feature-weighting algorithms are also evaluated via ten times ten-fold cross validations. For the competing algorithms evaluated via ten-fold cross validations (all of the above-listed competing algorithms but Citation-KNN, M3 I, and ReliefF-MI), we simply replicate their results for reference. We reevaluate Citation-KNN, M3 I, and ReliefF-MI, via ten times ten-fold cross validations and show the reevaluated results (considering that ReliefF-MI is only for feature weighting, we provide the results of ReliefF-MI plus the Citation-KNN classifier). In Citation-KNN, the number of references is selected from the candidate set {1, …, 4}, and the number of citers is set similarly to that of [49], i.e., it is equal to the number of references plus 2. Parameter C of M3 I is selected from the candidate set {0.001, 0.01, 0.1, 1, 10, 100, 1000, 10,000}. Parameter m of ReliefF-MI is set to be the number of all training bags, parameter k is selected from the candidate set {1, 3, 5, 8, 10, 15, 20, 25, 30} on Musk1 and Musk 2 and from the candidate set {1, 5, 10, 20, 30, 40, 50, 60, 70, 80} on Elephant, Fox, and Tiger. The free parameters of M3IFW are selected from exactly the same candidate sets in [11]. Similar to our work, the free parameters (except the fixed ones) in Citation-KNN, M3 I, and ReliefF-MI are also selected by operating five-fold cross validations on the training set. In Table 8, we provide the average classification accuracies and standard deviations of both our and competing algorithms, and the Friedman test results including the average rank of each algorithm (except for ID-APR and DD because their classification accuracies are only provided on the two musk datasets, not on the three image annotation datasets) and the p-value. The p-value is much less than 0.05, which means that the 14 algorithms (all algorithms listed in Table 8 but ID-APR and DD) perform differently at the 0.05 significance level on the five benchmark datasets. The post-hoc Nemenyi test results are provided in Table 9. Note that during the Nemenyi test, there are 14 algorithms among which the pairwise comparisons should be made, and it is difficult to include all 14∗14 pairwise comparison results in a single table (because we cannot include all 14 columns in one page). Therefore, we provide the Nemenyi test results in two sub-tables, Tables 9(a) and 9(b), in which Table 9(a) provides the pairwise comparison results among all 14 algorithms and the first 7 algorithms, and Table 9(b) provides the results among all 14 algorithms and the last 7 algorithms. For this Nemenyi test, the critical value at the 0.05 significance level is 3.354, and the calculated CD is 8.8738. In Tables 9(a) and 9(b), all averagerank differences greater than CD are represented in italic boldface, which means that the corresponding pairs of algorithms perform differently at the 0.05 significance level. Through Tables 9(a) and 9(b), we find that in general, LM2FW-B2B and M3IFW+Citation-KNN perform the best because either of them significantly outperforms the four competing algorithms and that LM2FW-C2B performs the second best because it outperforms two competing algorithms. Compared with LM2FW-B2B

802

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

Table 8 Average classification accuracies (%) and corresponding standard deviations of different algorithms on five benchmark datasets. This table also provides the Friedman test results, including the average rank of each algorithm and the p-value. Algorithm

Musk1

Musk2

Elephant

Fox

Tiger

Average Rank

ID-APR DD EM-DD MI-SVM mi-SVM MILES MI-Kernel MI-Graph mi-Graph Citation-KNN ReliefF-MI +Citation-KNN M3IFW +Citation-KNN M3 I LM2FW-minH LM2FW-C2B LM2FW-B2B p-value: 1.1312∗10−7

92.4 88.0 84.8 77.9 87.4 86.3 88.0±3.1 90.0±3.8 88.9±3.3 90.0±2.7 94.4±1.9 96.7±3.1 88.6±3.4 93.1±3.7 93.3±2.6 95.4±2.4

89.2 84.0 84.9 84.3 83.6 87.7 89.3±1.5 90.0±2.7 90.3±2.6 89.1±3.0 90.2±3.1 93.2±2.8 90.5±3.7 91.0±2.1 92.1±2.7 92.3±3.8

N/A N/A 78.3 81.4 82.0 81.0 84.3±1.6 85.1±2.8 86.8±0.7 87.8±3.2 91.8±2.7 92.5±2.1 89.2±1.8 91.9±2.5 90.5±3.4 92.4±1.9

N/A N/A 56.1 59.4 58.2 62.0 60.3±1.9 61.2±1.7 61.6±2.8 62.0±2.1 76.2±2.6 77.1±3.2 70.4±2.6 67.3±2.5 78.1±2.9 78.4±3.2

N/A N/A 72.1 84.0 78.9 80.0 84.2±1.0 81.9±1.5 86.0±1.6 82.5±1.9 87.1±1.8 89.9±2.4 87.2±2.9 86.2±1.8 92.1±2.3 92.7±1.7

N/A N/A 1.60 0 0 3.0 0 0 0 2.60 0 0 3.90 0 0 5.40 0 0 6.10 0 0 7.40 0 0 6.80 0 0 10.40 0 0 13.20 0 0 9.20 0 0 10.20 0 0 11.80 0 0 13.40 0 0

Table 9a The post-hoc Nemenyi test results of different algorithms on five benchmark datasets. This sub-table provides the pairwise comparison results among fourteen algorithms listed in Table 8 (except for ID-APR and DD) and the first seven of these fourteen algorithms. Average-Rank Difference

EM-DD

MI-SVM

mi-SVM

MILES

MI-Kernel

MI-Graph

mi-Graph

EM-DD MI-SVM mi-SVM MILES MI-Kernel MI-Graph mi-Graph Citation-KNN ReliefF-MI +Citation-KNN M3IFW +Citation-KNN M3 I LM2FW-minH LM2FW-C2B LM2FW-B2B

N/A 1.40 0 0 1.0 0 0 0 2.30 0 0 3.80 0 0 4.50 0 0 5.80 0 0 5.20 0 0 8.80 0 0 11.6000 7.60 0 0 8.60 0 0 10.2000 11.8000

1.40 0 0 N/A 0.40 0 0 0.90 0 0 2.40 0 0 3.10 0 0 4.40 0 0 3.80 0 0 7.40 0 0 10.2000 6.20 0 0 7.20 0 0 8.80 0 0 10.4000

1.0 0 0 0 0.40 0 0 N/A 1.30 0 0 2.80 0 0 3.50 0 0 4.80 0 0 4.20 0 0 7.80 0 0 10.6000 6.60 0 0 7.60 0 0 9.2000 10.8000

2.30 0 0 0.90 0 0 1.30 0 0 N/A 1.50 0 0 2.20 0 0 3.50 0 0 2.90 0 0 6.50 0 0 9.3000 5.30 0 0 6.30 0 0 7.90 0 0 9.5000

3.80 0 0 2.40 0 0 2.80 0 0 1.50 0 0 N/A 0.70 0 0 2.0 0 0 0 1.40 0 0 5.0 0 0 0 7.80 0 0 3.80 0 0 4.80 0 0 6.40 0 0 8.0 0 0 0

4.50 0 0 3.10 0 0 3.50 0 0 2.20 0 0 0.70 0 0 N/A 1.30 0 0 0.70 0 0 4.30 0 0 7.10 0 0 3.10 0 0 4.10 0 0 5.70 0 0 7.30 0 0

5.80 0 0 4.40 0 0 4.80 0 0 3.50 0 0 2.0 0 0 0 1.30 0 0 N/A 0.60 0 0 3.0 0 0 0 5.80 0 0 1.80 0 0 2.80 0 0 4.40 0 0 6.0 0 0 0

Table 9b The post-hoc Nemenyi test results of different algorithms on five benchmark datasets. This sub-table provides the pairwise comparison results among fourteen algorithms listed in Table 8 (except for ID-APR and DD) and the last seven of these fourteen algorithms. Average-Rank Difference

Citation −KNN

ReliefF-MI +Citation −KNN

M3IFW +Citation −KNN

M3 I

LM2FW −minH

LM2FW −C2B

LM2FW −B2B

EM-DD MI-SVM mi-SVM MILES MI-Kernel MI-Graph mi-Graph Citation-KNN ReliefF-MI +Citation-KNN M3IFW +Citation-KNN M3 I LM2FW-minH LM2FW-C2B LM2FW-B2B

5.20 0 0 3.80 0 0 4.20 0 0 2.90 0 0 1.40 0 0 0.70 0 0 0.60 0 0 N/A 3.60 0 0

8.80 0 0 7.40 0 0 7.80 0 0 6.50 0 0 5.0 0 0 0 4.30 0 0 3.0 0 0 0 3.60 0 0 N/A

11.6000 10.2000 10.6000 9.3000 7.80 0 0 7.10 0 0 5.80 0 0 6.40 0 0 2.80 0 0

7.60 0 0 6.20 0 0 6.60 0 0 5.30 0 0 3.80 0 0 3.10 0 0 1.80 0 0 2.40 0 0 1.20 0 0

8.60 0 0 7.20 0 0 7.60 0 0 6.30 0 0 4.80 0 0 4.10 0 0 2.80 0 0 3.40 0 0 0.20 0 0

10.2000 8.80 0 0 9.2000 7.90 0 0 6.40 0 0 5.70 0 0 4.40 0 0 5.0 0 0 0 1.40 0 0

11.8000 10.4000 10.8000 9.5000 8.0 0 0 0 7.30 0 0 6.0 0 0 0 6.60 0 0 3.0 0 0 0

6.40 0 0 2.40 0 0 3.40 0 0 5.0 0 0 0 6.60 0 0

2.80 0 0 1.20 0 0 0.20 0 0 1.40 0 0 3.0 0 0 0

N/A 4.0 0 0 0 3.0 0 0 0 1.40 0 0 0.20 0 0

4.0 0 0 0 N/A 1.0 0 0 0 2.60 0 0 4.20 0 0

3.0 0 0 0 1.0 0 0 0 N/A 1.60 0 0 3.20 0 0

1.40 0 0 2.60 0 0 1.60 0 0 N/A 1.60 0 0

0.20 0 0 4.20 0 0 3.20 0 0 1.60 0 0 N/A

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

803

Fig. 7. Classification accuracy of LM2FW-minH w.r.t. the trade-off parameter α on five benchmark datasets.

and LM2FW-C2B, another proposal in this paper, LM2FW-minH, does not work well because it has no significant advantage over other competing algorithms. Through Table 8, we find an interesting phenomenon M3 I performs worse than LM2FW-C2B on most datasets. From the viewpoint of design principles, M3 I is very close to LM2FW-C2B; both seek large classification margins based on the C2B distance. The difference between them is that M3 I works for metric learning and learns a full Mahalanobis matrix, whereas LM2FW-C2B works for feature weighting and learns a diagonal Mahalanobis matrix with diagonal elements denoting the squared weighting coefficients. Hence, M3 I has more degrees of freedom than does LM2FW-C2B. However, M3 I performs worse than LM2FW-C2B does on most benchmark datasets, perhaps due to the following. M3 I optimizes different types of unknown variables alternatively, but the objective functions in different alternative optimization steps cannot be integrated into a unified one. Due to the lack of a unified objective function, M3 I does not optimize the unknown variables iteratively and neglects the interaction of different types of unknown variables, which are correlated and can have significant influences on each other. In contrast, the objective functions of LM2FW-C2B in different alternative optimization steps can be integrated into a unified one. Moreover, we can optimize the unknown variables of LM2FW-C2B iteratively, which considers the interaction of different types of unknown variables and might help us to obtain gradually improved solutions. To show the effect of trade-off parameters on the classification accuracies of three proposed feature-weighting algorithms, in Figs. 7–9, we, respectively, plot the variation of the classification accuracies of LM2FW-minH, LM2FW-C2B, and LM2FW-B2B w.r.t. different trade-off parameters on five benchmark datasets. Note that the best results shown in the above three figures are not exactly the same as those shown in Table 8 because we conduct experiments via ten-fold cross validations and provide the averaged results, and for each given value of a trade-off parameter, we might obtain the optimal results in some folds but obtain sub-optimal results in other folds. The classification performance seems insensitive to the variation of trade-off parameters when their values are relatively small. However, when the values of trade-off parameters become relatively large, the classification performance can become unstable (e.g., in (b) of Fig. 8, when β becomes large there is an obvious performance degradation).

8.4. On text categorization datasets Text categorization is an important application of MIL. In text categorization, a document can be divided into several segments that can contain overlapped passages, and if one segment is assigned a specific concept, the document is assigned this concept. Hence, we can consider text categorization a typical MIL problem by treating each concept, each document, and each segment of a document as a class, a bag, and an instance in a bag, respectively. In this subsection, we test the learning performance of three proposed feature-weighting algorithms through the text categorization task. The employed dataset is OHSUMED, which is also called TREC9. The original data are composed of 54,0 0 0 MEDLINE documents that are annotated with 4903 subject terms; each term denotes a binary concept. Each document (bag) is split into segments (instances) by overlapped windows with each having a maximum of 50 words. Similar to the processing in [4], a smaller subset consisting of seven concepts is used; thus, seven binary-class multiple-instance classification problems are obtained. A brief description of the seven datasets is provided in Table 10. Compared with the five benchmark datasets, each of the seven text categorization datasets consists of many more instances, and each instance is a vector with many more dimensions. As a result, the time complexity on the seven text categorization datasets will be much higher than that on the five benchmark datasets (e.g., the optimization of M3 I requires solving a semidefinite programming problem with the Mahalanobis matrix consisting of thousands of rows and columns,

804

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

Fig. 8. Classification accuracy of LM2FW-C2B w.r.t. trade-off parameters α and β on five benchmark datasets. Subfigure (a) is for α ; (b) is for β .

Fig. 9. Classification accuracy of LM2FW-B2B w.r.t. trade-off parameters α and β on five benchmark datasets. Subfigure (a) is for α ; (b) is for β . Table 10 Brief description of seven text categorization datasets. Dataset

TST1

TST2

TST3

TST4

TST7

TST9

TST10

# # # # # #

200 1580 200 1644 8.06 6668

200 1715 200 1629 8.36 6842

200 1626 200 1620 8.12 6568

200 1754 200 1637 8.48 6626

200 1746 200 1621 8.42 7037

200 1684 200 1616 8.25 6982

200 1818 200 1635 8.63 7073

Positive Bags Instances in Positive Bags Negative Bags Instances in Negative Bags Average Instances in Each Bag Features of Each Instance

which is too time consuming to return solutions in a reasonable time). We compare our work only with partial competing algorithms used in Section 8.3. The results of these competing algorithms (EM-DD, MI-SVM, and mi-SVM) were replicated from the literature [5] and were evaluated via ten-fold cross validations. Similarly, our algorithms are also evaluated via one time ten-fold cross validations, with their free parameters tuned by conducting five-fold cross validations on the training set. In ReliefF-MI [53], parameter m is set as the number of all training bags, and parameter k is selected from the candidate set {1, 10, 20, 50, 70}. In M3IFW [11], parameter r is set as 10% of the number of negative training bags, and parameter α is selected from the candidate set {0.0 0 01, 0.0 01, 0.01, 0.1, 1, 10, 100, 1000, 10,000}. The performances of ReliefF-MI and M3IFW are evaluated via one-time ten-fold cross validations by us and their free parameters are tuned via five-fold cross validations on the training set. The average classification accuracies, corresponding standard deviations, and the Friedman test results

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

805

Table 11 Classification accuracies (%) of different algorithms on seven text categorization datasets. This table also provides the Friedman test results including the average rank of each algorithm and the p-value. Algorithm

TST1

TST2

TST3

TST4

TST7

TST9

TST10

Average Rank

EM-DD MI-SVM mi-SVM ReliefF-MI +Citation-KNN M3IFW +Citation-KNN LM2FW-minH LM2FW-C2B LM2FW-B2B p-value: 2.5115∗10−6

85.8 93.9 93.6 65.5±5.0 88.0±4.2 71.5±4.6 96.5±2.1 92.3±3.4

84.0 84.5 78.2 62.8±5.2 75.0±4.3 63.3±4.1 87.3±3.6 77.0±5.4

69.0 82.2 87.0 62.5±3.6 83.8±4.5 73.5±4.0 89.0±3.6 86.0±4.7

80.5 82.4 82.8 67.5±4.2 78.3±3.9 70.8±3.2 87.3±3.4 81.8±4.8

75.4 78.0 81.3 64.8±4.5 76.5±4.1 70.5±3.8 81.8±4.3 77.0±3.4

65.5 60.2 67.5 60.8±3.9 69.2±5.0 64.8±3.4 72.8±4.5 71.5±3.8

78.5 79.5 79.6 66.5±4.1 81.5±4.3 71.8±3.3 84.5±3.5 84.3±4.6

3.5714 5.0 0 0 0 6.0 0 0 0 1.1429 4.4286 2.2857 8.0 0 0 0 5.5714

Table 12 The post-hoc Nemenyi test results of different algorithms on seven text categorization datasets. Average-Rank Difference EM-DD MI-SVM mi-SVM ReliefF-MI +Citation-KNN M3IFW +Citation-KNN LM2FW-minH LM2FW-C2B LM2FW-B2B

EM-DD

MI-SVM

mi-SVM

ReliefF-MI +Citation −KNN

M3IFW +Citation −KNN

LM2FW −minH

LM2FW −C2B

LM2FW −B2B

N/A 1.4286 2.4286 2.4286

1.4286 N/A 1.0 0 0 0 3.8571

2.4286 1.0 0 0 0 N/A 4.8571

2.4286 3.8571 4.8571 N/A

0.8571 0.5714 1.5714 3.2857

1.2857 2.7143 3.7143 1.1429

4.4286 3.0 0 0 0 2.0 0 0 0 6.8571

2.0 0 0 0 0.5714 0.4286 4.4286

0.8571 1.2857 4.4286 2.0 0 0 0

0.5714 2.7143 3.0 0 0 0 0.5714

1.5714 3.7143 2.0 0 0 0 0.4286

3.2857 1.1429 6.8571 4.4286

N/A 2.1429 3.5714 1.1429

2.1429 N/A 5.7143 3.2857

3.5714 5.7143 N/A 2.4286

1.1429 3.2857 2.4286 N/A

including the average rank of each algorithm and the p-value, are shown in Table 11. The standard deviations shown in Tables 5 and 11 and those shown in Table 8 are calculated in different ways. In Tables 5 and 11, because we perform tenfold cross validations only once, the standard deviations are averaged over ten values, with each value corresponding to one fold. In Table 8, because we perform ten-fold cross validations ten independent times, the standard deviations are averaged over ten values, with each value corresponding to one time (ten-fold cross validations), i.e., each value itself is averaged over ten folds. Table 11 shows that the calculated p-value through the Friedman test is much less than 0.05; thus, we can conclude that the algorithms listed in Table 11 perform differently at the 0.05 significance level. The post-hoc Nemenyi test results are provided in Table 12. The critical value equals 3.031, and the calculated CD is 3.9685. Average-rank differences greater than CD are represented in italic boldface. The results shown in Table 11 demonstrate that LM2FW-C2B and LM2FW-B2B perform satisfactorily because they obtain very competitive classification results on most datasets. However, the classification performance of LM2FW-minH is not promising because LM2FW-minH obtains much lower classification accuracies than do most competing algorithms on most datasets. Table 12 shows that LM2FW-C2B performs better than LM2FW-B2B (LM2FW-C2B significantly outperforms three competing algorithms, whereas LM2FW-B2B just outperforms one), but Tables 9(a) and 9(b) show that LM2FW-C2B performs worse than LM2FW-B2B does (LM2FW-C2B outperforms two competing algorithms, whereas LM2FW-B2B outperforms four). The results in these tables demonstrate that LM2FW-C2B can perform more satisfactorily than LM2FW-B2B does in some cases, but not as best as LM2FW-B2B in other cases. One possible reason for this phenomenon is provided below. In the testing phase, LM2FW-B2B conducts its classification via the KNN classifier, whereas LM2FW-C2B conducts its classification via the NSB classifier, which is similar to Nearest Mean (NM). (Note that NSB is similar to NM because when calculating the C2B distance in NSB, the information of all instances in the positive/negative super-bag is used. When calculating the mean vector in NM, the information on all instances belonging to a given class is used; however, NSB is not exactly the same as NM because there is no need to calculate the mean vector for each class in NSB.) In short, KNN is a local classification model, whereas NSB is more likely a global one. If the training set is densely sampled (the ratio of the number of training samples to the data dimensionality is relatively high), there are adequate training samples to support the learning of the local details of the data structure. In this situation, local models usually outperform global ones [9] because they are more flexible and powerful, and the results shown in Table 8 might justify this analysis. However, compared with the five benchmark datasets, the seven text categorization datasets are sparsely sampled, and there are insufficient training samples to learn the local details. As a result, the local details learned by LM2FW-B2B might be inaccurate, which makes LM2FW-B2B perform worse than LM2FW-C2B does on the text categorization datasets. Another interesting phenomenon is that the instance-level feature-weighting algorithm, M3IFW, achieves very competitive learning performance on the five benchmark datasets (see Table 8) but delivers less-competitive learning performance on the seven text categorization datasets (see Table 11). One possible reason leading to this difference is provided below. As

806

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

Fig. 10. Classification accuracy of LM2FW-minH w.r.t. the trade-off parameter α on seven text categorization datasets.

Fig. 11. Classification accuracy of LM2FW-C2B w.r.t. trade-off parameters α and β on seven text categorization datasets. Subfigure (a) is for α ; (b) is for β .

an instance-level learning algorithm, M3IFW only uses the information of partial instances (prototype instances) to conduct feature weighting, without considering the contribution of other instances to discriminations. For (relatively) small-scale datasets such as the benchmark ones, only considering the contribution of prototype instances might be acceptable. However, for (relatively) large-scale datasets such as the text categorization ones, omitting the contribution of non-prototype instances might become problematic because there are too many non-prototype instances and these instances can have significant influence on discriminations. In contrast, the C2B and B2B metrics consider the contribution of all instances in each bag/super-bag to discriminations. Perhaps this advantage results in the superiority of LM2FW-C2B and LM2FW-B2B over M3IFW on the text categorization datasets. Similar to the experiments conducted on LJ-r.f.s and benchmark datasets, we show the variation in the classification accuracies of LM2FW-minH, LM2FW-C2B, and LM2FW-B2B w.r.t. different trade-off parameters on the seven text categorization datasets, respectively, in Figs. 10–12. Similar to the results shown on LJ-r.f.s and benchmark datasets, through Figs. 10–12, we also find that the classification performances occasionally have obvious degradation when the trade-off parameters become relatively large, but are insensitive to the variation of trade-off parameters when their values are relatively small. 9. Conclusions and future work In this paper, we propose a large margin framework to guide the design of multiple-instance feature-weighting algorithms from the bag level and take two existing and one newly proposed bag-level distances as examples to illustrate the specific design work. Experiments conducted on synthetic and real-world datasets demonstrate the effectiveness of the proposed framework in improving multiple-instance classification performances.

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

807

Fig. 12. Classification accuracy of LM2FW-B2B w.r.t. trade-off parameters α and β on seven text categorization datasets. Subfigure (a) is for α ; (b) is for β .

Future work includes the following three aspects. First, the current work focuses on standard MIL, which only consists of two classes and should be extended into multiple-class versions. Second, we should extend the current work into other related domains such as multiple-instance ranking [7,26], which combines MIL and ranking and has been applied in domains such as image retrieval and computational chemistry. Third, the current work is for feature learning in the original space; by using the “kernel trick”, we can study feature learning in the Reproducing Kernel Hilbert (RKH) space with the expectation of obtaining further-improved performances. Acknowledgements The project is supported by the National Natural Science Foundation of China (No. 61403273, No. 61402319, No. 61172179), the Natural Science Foundation of Shanxi Province (No. 2014021022-4, No. 2014021022-3), and the Special/Youth Foundation of Taiyuan University of Technology (No. 2012L088). References [1] Y. Altun, D. McAllester, M. Belkin, Maximum margin semi-supervised learning for structured variables, Adv. Neural Inf. Process. Syst. (2006) 18–33. [2] R.A. Amar, D.R. Dooly, S.A. Goldman, Q. Zhang, Multiple-instance learning of real-valued data, in: Proceedings of the 18th International Conference on Machine Learning, 2001, pp. 3–10. [3] J. Amores, Multiple instance classification: review, taxonomy and comparative study, Artif. Intell. 201 (2013) 81–105. [4] S. Andrews, I. Tsochantaridis, T. Hofmann, Support vector machines for multiple-instance learning, Adv. Neural Inf. Process. Syst. (2003) 561–568. [5] S. Andrews, Learning from Ambiguous Examples PhD thesis, Brown University, Providence, Rhode Island, USA, 20 0 0. [6] K. Bennett, A. Demiriz, Semi-supervised support vector machines, Adv. Neural Inf. Process. Syst. (1999) 368–374. [7] C. Bergeron, J. Zaretzki, C. Breneman, K.P. Bennett, Multiple instance ranking, in: Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 48–55. [8] J. Bi, J. Liang, Multiple instance learning of pulmonary embolism detection with geodesic distance along vascular structure, in: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8. [9] J. Chai, H.W. Liu, Z. Bao, Combinatorial discriminant analysis: supervised feature extraction that integrates global and local criteria, Electron. Lett. 45 (18) (2009) 934–935. [10] J. Chai, H.W. Liu, B. Chen, Z. Bao, Large margin nearest local mean classifier, Signal Process. 90 (1) (2010) 236–248. [11] J. Chai, H.T. Chen, L.X. Huang, F.H. Shang, Maximum margin multiple-instance feature weighting, Pattern Recognit. 47 (6) (2014) 2091–2103. [12] Y. Chen, J.Z. Wang, Image categorization by learning and reasoning with regions, J. Mach. Learn. Res. 5 (2004) 913–939. [13] Y. Chen, J. Bi, J.Z. Wang, MILES: multiple-instance learning via embedded instance selection, IEEE Trans. Patt. Anal. Mach. Int. 28 (12) (2006) 1931–1947. [14] B. Chen, H.W. Liu, J. Chai, Z. Bao, Large margin feature weighting method via linear programming, IEEE Trans. Knowl. Data Eng. 21 (10) (2009) 1475–1488. [15] N. Cristianini, J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, 20 0 0. [16] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [17] J. Derrac, S. García, D. Molina, F. Herrera, A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms, Swarm Evol. Comput. 1 (1) (2011) 3–18. [18] T.G. Dietterich, R.H. Lathrop, T. Lozano-Perez, Solving the multiple instance problem with axis-parallel rectangles, Artif. Intell. 89 (1) (1997) 31–71. [19] L. Dong, A Comparison of Multi-Instance Learning Algorithms Master’s thesis, University of Waikato, New Zealand, 2006. [20] R.A. Fisher, Statistical Methods and Scientific Inference, 2nd edition, Hafner Publishing Co., New York, 1959. [21] M. Friedman, The use of ranks to avoid the assumption of normality implicit in the analysis of variance, J. Am. Statist. Assoc. 32 (1937) 675–701. [22] G. Fung, M. Dundar, B. Krishnapuram, R.B. Rao, Multiple instance learning for computer aided diagnosis, Adv. Neural Inf. Process. Syst. (2007) 425–432. [23] S. García, F. Herrera, An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons, J. Mach. Learn. Res. 9 (2008) 2677–2694. [24] S. García, A. Fernández, J. Luengo, F. Herrera, Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power, Inf. Sci. 180 (10) (2010) 2044–2064.

808

J. Chai et al. / Information Sciences 367–368 (2016) 783–808

[25] T. Gartner, P.A. Flach, A. Kowalczyk, A.J. Smola, Multi-instance kernels, in: Proceedings of the 19th International Conference on Machine Learning, 2002, pp. 179–186. [26] Y. Hu, M. Li, N. Yu, Multiple-instance ranking: learning to rank images for image retrieval, in: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2008, pp. 1–8. [27] K. Kira, L.A. Rendell, A practical approach to feature selection, in: Proceedings of the 9th International Conference on Machine Learning, 1992, pp. 249–256. [28] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Proceedings of the 14th international joint conference on Artificial intelligence, 1995, pp. 1137–1143. [29] O.E. Kundakcioglu, P.M. Pardalos, A branch and bound algorithm for multiple instance classification, in: Proceedings of the 2008 International Conference on Machine Learning; Models, Technologies and Applications (MLMTA’08), 2, 2008, pp. 865–869. [30] O.E. Kundakcioglu, O. Seref, P.M. Pardalos, Multiple instance learning via margin maximization, Appl. Numer. Math. 60 (4) (2010) 358–369. [31] C. Leistner, A. Saffari, H. Bischof, MIForest: multiple-instance learning with randomized trees, in: Proceedings of the 11th European Conference on Computer Vision, 2010, pp. 29–42. [32] Y.F. Li, I.W. Tsang, J.T. Kwok, Z.H. Zhou, Tighter and convex maximum margin clustering, in: Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, 2009, pp. 344–351. [33] W.J. Li, D.T. Yeung, MILD: multiple-instance learning via disambiguation, IEEE Trans. Knowl. Data Eng. 22 (1) (2010) 76–89. [34] H. Liu, M. Hiroshi, Feature Selection for Knowledge Discovery and Data Mining, Springer, 1998. [35] O. Maron, A.L. Ratan, Multiple-instance learning for natural scene classification, in: Proceedings of the 15th International Conference on Machine Learning, 1998, pp. 341–349. [36] O. Maron, T. Lozano-Perez, A framework for multiple-instance learning, Adv. Neural Inf. Process. Syst. (1998) 570–576. [37] P.K. Mallapragada, R. Jin, A.K. Jain, Y. Liu, Semiboost: Boosting for semi-supervised learning, IEEE Trans. Patt. Anal. Mach. Int. 31 (11) (2009) 2000–2014. [38] P.B. Nemenyi, Distribution-Free Multiple Comparisons PhD thesis, Princeton University, USA, 1963. [39] J. Nocedal, S.J. Wright, Numerical Optimization, Springer Science+Business Media, 2006. [40] S. Ray, M. Craven, Supervised versus multiple instance learning: an empirical comparison, in: Proceedings of the 22nd International Conference on Machine Learning, 2005, pp. 697–704. [41] V.C. Raykar, B. Krishnapuram, J. Bi, M. Dundar, R.B. Rao, Bayesian multiple instance learning: automatic feature selection and inductive transfer, in: Proceedings of the 25th International Conference on Machine Learning, 2008, pp. 808–815. [42] G. Ruffo, Learning Single and Multiple Decision Trees for Security Applications PhD thesis, University of Turin, Italy, 20 0 0. [43] G.D. Ruxton, The unequal variance t-test is an underused alternative to Student’s t-test and the Mann-Whitney U test, Behav. Ecol. 17 (4) (2006) 688–690. [44] O. Seref, O.E. Kundakcioglu, P.M. Pardalos, Selective linear and nonlinear classification, CRM Proc. Lect. Notes 45 (2008) 211–234. [45] Y.J. Sun, Iterative RELIEF for feature weighting: algorithms, theories, and applications, IEEE Trans. Patt. Anal. Mach. Int. 29 (6) (2007) 1035–1051. [46] Q. Tao, S. Scott, N.V. Vinodchandran, T.T. Osugi, SVM-based generalized multiple-instance learning via approximate box counting, in: Proceedings of the 21st International Conference on Machine Learning, 2004, pp. 799–806. [47] H. Valizadegan, R. Jin, Generalized maximum margin clustering and unsupervised kernel learning, Adv. Neural Inf. Process. Syst. (2007) 1417–1424. [48] C.J. Veenman, D.M.J. Tax, LESS: a model-based classifier for sparse subspaces, IEEE Trans. Patt. Anal. Mach. Int. 27 (9) (2005) 1496–1500. [49] J. Wang, J.D. Zucker, Solving the multiple-instance problem: a lazy learning approach, in: Proceedings of the 17th International Conference on Machine Learning, 20 0 0, pp. 1119–1125. [50] H. Wang, H. Huang, F. Kamangar, F.P. Nie, C. Ding, Maximum margin multi-instance learning, Adv. Neural Inf. Process. Syst. (2011) 29–37. [51] K.Q. Weinberger, L.K. Saul, Distance metric learning for large margin nearest neighbor classification, J. Mach. Learn. Res. 10 (2009) 207–244. [52] L. Xu, J. Neufeld, B. Larson, D. Schuurmans, Maximum margin clustering, Adv. Neural Inf. Process. Syst. (2004) 1537–1544. [53] A. Zafra, M. Pechenizkiy, S. Ventura, ReliefF-MI: an extension of ReliefF to multiple instance learning, Neurocomputing 75 (1) (2012) 210–218. [54] Q. Zhang, S.A. Goldman, W. Yu, J. Fritts, Content-based image retrieval using multiple-instance learning, in: Proceedings of the 19th International Conference on Machine Learning, 2002, pp. 682–689. [55] Q. Zhang, S.A. Goldman, EM-DD: an improved multiple-instance learning technique, Adv. Neural Inf. Process. Syst. (2002) 1073–1080. [56] Z.H. Zhou, Multi-Instance Learning: A Survey, AI Lab, Department of Computer Science and Technology, Nanjing University, 2004 Technical Report. [57] Z.H. Zhou, M.L. Zhang, Multi-instance multi-label learning with application to scene classification, Adv. Neural Inf. Process. Syst. (2007) 1609–1616. [58] Z.H. Zhou, Y.Y. Sun, Y.F. Li, Multi-instance learning by treating instances as non-i.i.d. samples, in: Proceedings of the 26th International Conference on Machine Learning, 2009, pp. 1249–1256.