Multiple kernel-based multi-instance learning algorithm for image classification

Multiple kernel-based multi-instance learning algorithm for image classification

J. Vis. Commun. Image R. 25 (2014) 1112–1117 Contents lists available at ScienceDirect J. Vis. Commun. Image R. journal homepage: www.elsevier.com/l...

549KB Sizes 0 Downloads 127 Views

J. Vis. Commun. Image R. 25 (2014) 1112–1117

Contents lists available at ScienceDirect

J. Vis. Commun. Image R. journal homepage: www.elsevier.com/locate/jvci

Multiple kernel-based multi-instance learning algorithm for image classification Daxiang Li a,c,⇑, Jing Wang b, Xiaoqiang Zhao a, Ying Liu a,c, Dianwei Wang a,c a

School of Telecommunication and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China Computer Graphics, Imaging and Vision (CGIV) Research Group, School of Computing and Engineering, University of Huddersfield, Huddersfield HD1 3DH, UK c Lab of Image Processing, Crime Scene Investigation Unit of Shaanxi Province, Xi’an 710121, China b

a r t i c l e

i n f o

Article history: Received 19 July 2013 Accepted 24 March 2014 Available online 3 April 2014 Keywords: Multi-instance learning (MIL) Image classification Affinity propagation (AP) Multiple kernel learning (MKL) Image retrieval Visual words Cluster analysis Support vector machines

a b s t r a c t In this paper, a novel multi-instance learning (MIL) algorithm based on multiple-kernels (MK) framework has been proposed for image classification. This newly developed algorithm defines each image as a bag, and the low-level visual features extracted from its segmented regions as instances. This algorithm is started from constructing a ‘‘word-space’’ from instances based on a collection of ‘‘visual-words’’ generated by affinity propagation (AP) clustering method. After calculating the distance between a ‘‘visualword’’ and the bag (image), a nonlinear mapping mechanism is introduced for registering each bag as a coordinate point in the ‘‘word-space’’. In this case, the MIL problem is transformed into a standard supervised learning problem, which allows multiple-kernels support vector machine (MKSVM) classifiers to be trained for the image categorization. Compared with many popular MIL algorithms, the proposed method, named as MKSVM-MIL, shows its satisfactorily experimental results on the COREL dataset, which highlights the robustness and effectiveness for image classification applications.  2014 Elsevier Inc. All rights reserved.

1. Introduction Image semantic categorization is the use of image analysis and computer technology to assign an image into a pre-defined semantic category based on its dominated objects or scene type. With the increasing demand from digital image used for online and offline purpose, automatic image categorization method becomes increasingly important [1]. In order to bridge the gap between images and semantics, it is required to extract the global visual features (i.e. color, texture & shape, etc.), intermediate semantics features [2] or key-points feature [3] from images, and then combining supervised learning methods (i.e. SVM) to carry on image classifications. However, there are two major tasks should be considered during these operations: (1) Because of the existence of ‘‘semantic gap’’, seeking effective semantics representing model to describe image is very important. (2) Before a learning machine can perform classification, the training samples need to be accurately labeled. Unfortunately, manually collecting those training examples (and possibly further annotating, aligning, cropping, etc.) is a tedious job, which is both time consuming and error-prone [4]. ⇑ Corresponding author at: School of Telecommunication and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China. Fax: +86 29 88166381. E-mail addresses: [email protected] (D. Li), [email protected] (J. Wang). http://dx.doi.org/10.1016/j.jvcir.2014.03.011 1047-3203/ 2014 Elsevier Inc. All rights reserved.

One possible solution is regards every image as a bag and the low-level visual feature vectors of its segmented regions as instances, images can be labeled as positive or negative bags based on whether it contains an interesting semantic for user. Therefore, the semantic-based image categorization problem can be then transformed into a multi-instance learning (MIL) problem [5]. This is because the training samples of MIL allows for coarse labeling at image level rather than fine labeling at their region level, so the efficiency of labeling processes can be improved significantly. Many multi-instance learning algorithms have been intensively studied during this decade, such as the Diverse Density (DD) algorithm [6], multi-label multi-instance learning (MLMIL) [7], and neural network algorithm [8]. It is difficult to list all existing MIL algorithms. Here, we mainly focus on the methods based on support vector machines (SVM), which have been successfully used in many machine-learning problems. Andrews et al. [9] first modified the SVM formulation, and presented mi-SVM and MI-SVM algorithms. However, unlike the standard SVM, they lead to nonconvex optimization problems, which suffer from local minima. Therefore, Gehler et al. [10] applied deterministic annealing for solving so-called non-convex optimization problem [9]. This method is able to find better local minimum of the objective function. Gartner [11] designed kernels directly on the bags by using a standard SVM to solve MIL problem. Since instance labels are unavailable, the MI kernel defines a crude assumption that all

1113

D. Li et al. / J. Vis. Commun. Image R. 25 (2014) 1112–1117

instances in a bag are equally important. Based on Gartner’s work, Kwok et al. [12] designed marginalized multi-instance kernels by highlighting that the contribution from different instances should be different. Chen et al. [13] proposed a DD-SVM method employed Diverse Density to learn a set of instance prototypes and then mapped the bags to a new feature space based on the instance prototypes. Recently, Chen et al. [14] also devised a new algorithm called Multi-Instance Learning via Embedded Instance Selection (MILES) to solve multiple instance problems. In addition, many multiple-instance semi-supervised learning algorithms have been presented during this decade, such as MissSVM [15], MISSL [16] and LSA-MIL [17] algorithms. Recently, converting every bag in the MIL problem into a single representation vector, and then using a standard supervised learning method to solve the MIL problem, is a kind of very effective MIL algorithms. However, most of the existing feature representation methods are not effective to describe the bags. This makes it difficult to adapt some well-known supervised learning methods for MIL problems. For example, DD-SVM [13] method must learn a collection of instance prototypes to construct a new feature space using Diverse Density (DD) function. So its representation features are very sensitive to noise and incur very high computation cost. Therefore, MILES [14] method uses all the instances from the training bags instead of the prototypes used with DD-SVM to construct a new feature space. Although MILES is less sensitive to noise and more efficient than DD-SVM, but the feature space for representing bags is of very high dimensionality because it contains too many irrelevant features, so 1-norm SVM is applied to perform standard supervised learning as it can select important features and construct classifiers simultaneously. Inspired by DD-SVM [13] and MILES [14] methods, in this paper, we present an affinity propagation (AP) clustering [18,19] based method to convert the MIL problem into a standard supervised learning problem, then use the multiple kernels learning (MKL) method [20] to solve the MIL problem. The main contributions are summarized as follows:  AP clustering-based feature representation method is proposed to convert the MIL problem into a standard supervised learning problem. To the best of our knowledge, this is the first inductive AP method for MIL problem. While comparing with other feature representation methods, AP does not need user to specify the number of clusters, so the proposed method has stronger robustness and adaptability.  After the MIL problem be converted to a standard supervised learning problem, based on multiple kernel support vector machine (MKSVM), we present a new MKSVM-based MIL algorithm, named as MKSVM-MIL, for image classification problems. Based on the experiment analysis, the proposed algorithm shows its promising robustness when facing clutter classification problems. In the reference [21], Sun et al. propose a visual object detection method based on multiple-kernel, multiple-instance similarity features, where the multi-kernel is used to calculate the similarity feature vector between two instances, and obtaining the multiple-kernel similarity features (MKSF) for all images, then combined with linear or Gaussian SVM to achieve object detection. However, in our proposed MKSVM-MIL algorithm, multi-kernels is used to measure the similarity between two bags’ representation vectors while train the bag-level MIL classifier. Therefore, the purpose of using multi-kernels is different between MKSVM-MIL algorithm and MKSF [21]. The rest of the paper is organized as follows: Section 2 introduces the affinity propagation (AP) clustering and the multiple kernels SVM learning; Section 3 introduces detailed description of the

proposed MKSVM-MIL algorithm based on concept from ‘‘wordspace’’, nonlinear mapping and the multiple kernels support vector machine (MKSVM). The system evaluation and experimental results on COREL data set are presented in Section 4. Section 5 concludes the paper. 2. Preliminaries 2.1. Affinity propagation (AP) clustering AP clustering is an innovative clustering algorithm introduced by Frey and Dueck [18]. Compared with traditional clustering algorithms, AP algorithm shows many advantages [19]: (a) According to the similarity between the data, this algorithm can determine the number of clusters automatically. (b) Do not need to specify the initial cluster centers. (c) Low error rate and less time-consuming calculations. In this paper, the AP clustering has been used for grouping InstSet into many clusters and locating each cluster centroid as a ‘‘visual-word’’. The mathematical model of the AP approach can be briefly described as following [18,19]. Given an N data point’s dataset, xi and xj are two objects in it. The similarity s(i, j) indicates how well si is suited to be the exemplar for sj, which can be initialized as sði; jÞ ¼ 1=ksi  sj k2 ; i – j. If there is no heuristic knowledge, self-similarities are constant values and can be named as preference in [18] as:

PN sðn; nÞ ¼

i¼1;i–j sði; nÞ

N1

;

n ¼ 1; 2; . . . ; N

ð1Þ

The AP approach computes two kinds of messages exchanged between data points. The first one is called ‘‘responsibility’’ r(i, j): It is sent from data point i to candidate exemplar point j which reflects the accumulated evidence for how well-suited point j is to serve as the exemplar for point i. The second message is called ‘‘availability’’ a(i, j): It is sent from candidate exemplar point j to point i reflecting the accumulated evidence for how appropriate it would be for point i to choose point j as its exemplar. At the beginning, the availabilities are initialized to zero: a(i, j) = 0. The update equations for r(i, j) and a(i, j) can be written as: 0

0

rði; jÞ ¼ sði; jÞ  max faði; j Þ þ sði; j Þg 0 j –j

8 P 0 minf0; rðj; jÞ þ maxf0; rði ; jÞgg i – j
ð2Þ

ð3Þ

i0 –j

In addition, during each message’s exchange between data points, a damping factor k 2 ½0; 1 is added to avoid numerical oscillations that may arise in some circumstances:

Rtþ1 ¼ ð1  kÞRt þ kRt1 Atþ1 ¼ ð1  kÞAt þ kAt1

ð4Þ

where R = [r(i, j)] and A = [a(i, j)] represents the responsibility matrix and availability matrix, respectively. t indicates the iteration times. The above two messages are updated iteratively, until they reach some specified values or the local decisions stay constant after iterations. At this stage, availabilities and responsibilities can be combined to identify exemplars:

"

#

ci ( arg max rðr; jÞ þ aði; jÞ 1jN

ð5Þ

1114

D. Li et al. / J. Vis. Commun. Image R. 25 (2014) 1112–1117

2.2. Multiple kernels learning During the last few years, kernel methods, such as support vector machines (SVM) have proved to be efficient tools for solving learning problems like classification or regression[20]. Let fxi ; yi gNi¼1 be the training set, where xi belongs to some input space X and yi is the label for pattern xi. For kernel algorithms, the solution of the learning problem is of the form:

f ðxÞ ¼

N X

ai Kðx; xi Þ þ b

ð6Þ

i¼1

where ai and b⁄ are some coefficients to be learned from training samples fxi ; yi gNi¼1 , K(x, xi) is a given positive definite kernel associated with a reproducing kernel in Hilbert space. In some situations, a machine learning practitioner may be interested in more flexible models. Recent applications have shown that using multiple kernels instead of a single one can enhance the interpretability of the decision function and improve performances [20]. In such cases, a convenient approach is to consider that the kernel K(x, xi) is actually a convex combination of basis kernels [20]:

Kðx; x0 Þ ¼

M X dm K m ðx; x0 Þ;

with dm  0;

m¼1

M X

dm ¼ 1

ð7Þ

m¼1

where M is the total number of kernels. Each basis kernel Km(x, xi) may either use the full set of variables describing x or subsets of variables stemming from different data sources. Alternatively, the kernels Km(x, xi) can simply be classical kernels (such as Gaussian kernels) with different parameters. Within this framework, the problem of data representation through the kernel is then transferred to the choice of weights dm. Learning both the coefficients ai and the weights dm in a single optimization problem is known as the multiple kernels learning (MKL) problem. More detailed can be found in the reference [20]. 3. MKSVM-based MIL method In the proposed approach, all images are firstly segmented into several regions, and the low-level visual features are extracted from those regions by incorporating colors, textures, and shape properties. In the MIL framework for Image classification problems, a bag corresponds to an image and an instance corresponds to the low-level visual features of a segmented region.

and contains same high-level concept, denoted as wt. Then the set composed of all the ‘‘visual-words’’ is called ‘‘word-space’’, denoted as X = {w1, w2, . . ., wM}, where M is the total number of the ‘‘visual-words’’. 3.2. Computing mapping feature Let X = {w1, w2, . . ., wM} be the ‘‘visual-space’’ given by AP clustering method, where wt is the t-th ‘‘visual-word’’ and M is the total number of the ‘‘visual-words’’. Firstly, we define the minimum and maximum Euclidean distances between ‘‘visual-words’’ wi and a bag Bi = {xij|j = 1, 2, . . ., ni} as:

 8 2  > min xij  wt  < Dmin ðwt ; Bi Þ ¼ j¼1;2;...;n i    > : Dmax ðwt ; Bi Þ ¼ max xij  wt 2

ð9Þ

j¼1;2;...;ni

Then the mapping feature, /(Bi), for the bag Bi be defined as:

8 > 1 ; Bi Þ; sðw2 ; Bi Þ; . . . ; sðwM ; Bi Þ < /ðBi Þ ¼ ½sðw   sðwt ; Bi Þ ¼ expðDmin ðwt ; Bi Þ=d2 Þ; expðDmax ðwt ; Bi Þ=d2 Þ ; ð10Þ > : t ¼ 1; 2; . . . ; M Here, d2 is a predefined scaling factor, which can be obtained by two fold cross-validation method in the image classification experiments, s(wt, Bi) consists of two values, one is exp(Dmin(wt, Bi)/d2), the other is exp(Dmax(wt, Bi)/d2), that can be interpreted as the maximum and minimum likelihoods for the bag Bi contains the ‘‘visual-word’’ wt. By Eq. (10), each bag will be transformed to a 2M-dimensional mapping feature in the extensional ‘‘wordspace’’. This is equivalent to embed a bag into the extended ‘‘word-space’’ as a point (single sample). If a bag is positive, the corresponding sample is labeled as +1, otherwise, labeled as 1, which transforms the MIL problem into a standard supervised learning problem and can be solved directly by using a MKSVM. It should be noted that, in the Eq. (10), we use the maximum and minimum likelihoods as the mapping features for bag Bi . The reason is: In image classification applications, both features play the same important role, so using them simultaneously can improve the accuracy of image classification. This phenomenon has also been empirical studied by DD-SVM method [13], experimental results show that when including instance prototypes from negative bags, it can improve classification accuracy by an average amount of 2.2% for the 10-class image categorization [13]. 3.3. MKSVM-MIL algorithm

3.1. Constructing word-space Let D = {(B1, y1), (B2, y2), . . ., (BN, yN)} denotes the training set consisting N bags, where yi e {1, +1} is the label of bag Bi, and Bi ¼ fxi1 ; xi2 ; . . . ; xini g is a collection of ni instances, each instance xij e Rd is a d-dimensional feature vector. The aim of the MIL is to learn a classification function that can accurately predict the label of any unseen bag. For the sake of convenience [4], we line up all the instances in each training bag, and re-index as:

InstSet ¼ fxi ji ¼ 1; . . . ; T:g

ð8Þ

P Here, T ¼ Ni¼1 ni is the total number of instances within all the training bags. When the image regions have similar visual characteristics, their low-level visual feature vectors should group together into a same cluster in the instance feature space. In this research, we deployed affinity propagation (AP) clustering [18] method to group all instances into a large number of clusters containing similar visual characteristics. We regard the centroid of each cluster as a ‘‘visual-word’’, which represents a specific class of image regions

By Eq. (10), every bag (image) is mapped to a point in the extensional ‘‘visual-space’’, and bag-level multi-kernels SVM (MKSVM) classifier for image classification is trained in this space. The maximum margin formulation of the MIL problem in the extensional ‘‘visual-space’’ is given as the following quadratic optimization problem [13]:

8 > > > max > > > > > > > < > > > > > > > > > > :

Lða; dm Þ ¼

N X

N X N M X X yi yj ai aj dm K m ðð/ðBÞ; ð/ðB0 ÞÞ

i¼1

i¼1 j¼1

ai  12

m¼1

N X s:t: yi ai ¼ 0; 0 6 ai 6 Ci ¼ 1; 2; . . . ; N: i¼1

dm P 0;

M X dm ¼ 1 m¼1

ð11Þ where M is the total number of kernels. Km(x, xi), and dm denotes the mth the base kernel function and its weights, respectively, C > 0 is the penalty parameter of the error sample, to control the trade-off

1115

D. Li et al. / J. Vis. Commun. Image R. 25 (2014) 1112–1117

between accuracy and regularization. The bag-level MKSVM classifier is then defined by ða ; dm Þ as

labelðBÞ ¼ sign

N M X X  yi ai dm K m ðð/ðBÞ; ð/ðBi ÞÞ þ b i¼1

! ð12Þ

m¼1

where b⁄ is chosen so that

yj

N M X X  yi ai dm K m ðð/ðBi Þ; ð/ðBj ÞÞ þ b i¼1

! 1¼0

Table 1 The confusion matrix of MKSVM-MIL on Corel 1 k (%).

ð13Þ

m¼1

for any aj with C > aj > 0. Finally, the MKSVM-MIL algorithm’s detailed steps can summarize as follows Algorithm: MKSVM-MIL Algorithm 1. MKSVM-MIL Training Input: A set of labeled training bags D;  Output: word-space X and MKSVM classifier ðdm ; am ; bm Þ; Initialize: Set S = U; Step 1: Line up all instances in the training bags D together, denoted as InstSet. Then clustered it into several group using AP clustering method, and regard every center of the clusters as ‘‘visual-word’’ to construct a ‘‘word-space’’ X = {w1, w2, . . ., wM}, where M is the number of clusters automatically determine by the AP method. Step 2: For "Bi e D calculate mapping feature /(Bi) by Eq. (10), and add (/(Bi), yi) to S, here yi is the label of Bi. Step 3: Use simpleMPL software toolkit, train a bag-level  MKSVM classifier ðdm ; am ; bm Þ based on S for image classification. 2. MKSVM-MIL predicting Let B be an unlabeled bag (image), Use Eq. (8) to calculate its mapping-feature /(B) in the extensional ‘‘word-space’’ X,  and then use MKSVM classifier ðdm ; am ; bm Þ to predict its label.

4. Experiments and analysis 4.1. Image data set

Cat.0 Cat.1 Cat.2 Cat.3 Cat.4 Cat.5 Cat.6 Cat.7 Cat.8 Cat.9

Cat.0

Cat.1

Cat.2

Cat.3

Cat.4

Cat.5

Cat.6

Cat.7

Cat.8

Cat.9

76.1 2.2 4.3 2.1 0.2 3.2 0.2 1.6 0.0 5.0

3.0 60.8 9.0 1.1 0.0 2.1 0.0 0.3 13.8 2.6

3.0 1.8 67.5 3.3 0.0 3.2 1.2 0.5 1.6 0.0

1.3 3.2 5.0 89.0 0.2 0.0 0.4 0.5 4.4 0.4

0.0 0.0 0.0 0.0 99.0 0.1 0.6 0.3 0.0 0.0

6.2 2.0 3.6 0.2 0.2 87.1 0.0 2.0 0.0 1.6

1.2 0.8 0.6 0.0 0.0 0.0 95.2 0.8 0.8 0.7

0.9 0.0 0.0 0.1 0.0 0.2 0.4 92.6 0.0 0.3

2.1 29.2 8.4 1.2 0.0 4.1 0.6 1.1 78.2 0.0

6.2 0.0 1.6 3.0 0.4 0.0 1.4 0.3 1.2 89.4

The numbers listed are the average percentage of 10 times repeated experiments. The significance of ‘‘bold’’ indicate the classification accuracy of each type of images.

segment an image, the method first partitions the image into non-overlapping blocks of size 4  4 pixels, and a 6-dimension feature vector is extracted for each block. Three of them are the average LUV color components in a block. The other three represent square root of energy in the high frequency bands of the wavelet transforms, i.e., the square root of the second order moment of wavelet coefficients in high frequency bands. Because the coefficients of the wavelet transforms in different frequency bands show variations in different directions, hence they can capture the texture properties of a block. Then a modified k-means algorithm is applied to group the feature vectors into clusters, each of which corresponds to a region in the segmented image (in this data set, each image is pre-segmented into about 5 patches). After segmentation, three extra features are computed for each region to describe its shape properties, they are normalized inertia of order 1, 2, and 3. As a result, each region in any image is characterized by a 9-dimensional feature vector, which characterizing the color, texture, and shape properties of the region. More detailed descriptions can be found in reference [13,14]. The original data set is separated into two subsets. The first one (i.e. Corel 1 k) contains the first ten categorizes while the second subset (i.e. Corel 2 k) used all the categories’ images. Each experiment is repeated for ten times for ten random sample selection, and the average accuracy as well as 95% confidence interval was reported. 4.2. Experimental setup

To evaluate our method in image categorization problem, we applied it on the COREL data set, a widely used standard benchmark data set for image retrieval. The data set consists of 2000 images in JPEG format with size 256  384 or 384  256. There are all-together twenty different categories, each containing 100 images. The twenty categories (labeled form 0 to 19) are: Africa people and villages, beaches, buildings, buses, dinosaurs, elephants, flowers, horses, mountains and glaciers, food, dogs, lizards, fashion models, sunset scenes, cars, waterfalls, antique furniture, battle ships, skiing, and desserts. Since we will compare MKSVM-MIL with MILES [14] and DDSVM [13] approaches, we adopt the same image segmentation and feature extraction algorithms as described in [13] and [14]. A brief summary about the imagery features is given as follows. To

Beach.122

Beach.136

Beach.184

In the MKSVM-MIL algorithm, we used simpleMPL [20] software toolkit, which download from http://asi.insa-rouen.fr/enseignants/~arakoto/code/mklindex.html, for training the MKSVM Table 2 Comparison of experimental results (%).

MKSVM-MIL SKSVM-MIL-G SKSVM-MIL-P DD-SVM [13] MI-SVM [9] MILES [14] MissSVM [15]

Mountains.833

Mountains.858

Fig. 1. Selected images of misclassification between Beaches and Mountains.

Corel 1 k

Corel 2 k

85.2 ± 1.1 83.7 ± 1.8 81.8 ± 2.1 81.5 ± 3.0 74.7 ± 1.6 82.6 ± 1.2 78.0 ± 2.2

71.3 ± 1.2 70.5 ± 1.3 68.1 ± 1.3 67.5 ± 0.8 54.6 ± 1.5 68.7 ± 1.4 65.2 ± 3.1

Mountains.878

1116

D. Li et al. / J. Vis. Commun. Image R. 25 (2014) 1112–1117

(

Elephants category image

Mountains category image

Fig. 2. Comparison of the ROC curves of three different MIL algorithms.

Classifier. The multiple kernels are composed by Gaussian kernels with 10 different widths ({23, 22, . . ., 26})and polynomial kernels of degree 1–3. One-against-rest strategy was employed for these multi-class tasks. During each trial, 50 positive images are randomly selected from one category and 50 negative images are randomly selected from the other categories to form the training set, and all the remaining images to form the test set. The final predicted class label is decided by the winner of all MKSVM Classifiers. In Eq. (10), one important parameter d2 (scaling factor) must be predefined, similar to reference [14], we chose d2 from 5 to 15 with step size 1, we found that d2 = 11gave the minimum twofold crossvalidation error. Therefore, we fixed d2 = 11 in all subsequent experiments. 4.3. Categorization accuracy The confusion matrix of the MKSVM-MIL method is reported in Table 1 based on Corel 1 K (Category 0 to Category 9), where each row lists the average percentages of images in a specific category classified to each of the 10 categories, therefore, the numbers on the diagonal show the classification accuracy for each category and off-diagonal entries indicate classification errors. Table 1 reveals that MKSVM-MIL works well on most categorizes. The largest errors come from Category 1 (i.e. Beach) and Category 8 (i.e. Mountains and glaciers): 29.2% Beach images were misclassified as Mountains and glaciers while 13.8% Mountains and glaciers images were misclassified as Beach. This phenomenon also appeared in DD-SVM [13] and MILES [14]. These errors are due to the fact that many images of these two categories contain semantically related and visually similar regions, such as the sky, mountain, river, lake and ocean. Fig. 1 show some mislabeled images in these two categories. 4.4. Overall categorization results In order to verify the validation of the MKL in MIL problem, we also combined the mapping feature /(Bi) with single Gaussian and polynomial kernel SVM. Here, SKSVM-MIL-G and SKSVM-MIL-P indicate the MIL algorithms when mapping feature combination with Gaussian and polynomial kernel SVM, respectively. To predefine the SVM training parameters, in all subsequent experiments, a 2-fold cross-validation is conducted on the training set to find the best parameters. Based on Corel 1 K and Corel 2 K, the overall classification accuracy of MKSVM-MIL compared with other existing MIL algorithms was reported in Table 2, including MILES [14], DD-SVM [13], MISVM [9] and MissSVM [15]. The numbers listed are the average classification accuracies over 10 times repeated experiments and

corresponding 0.95 confidence intervals. As seen in the Table 2, MKSVM-MIL is competitive with the compared algorithms.

4.5. MKSVM-MIL analysis Based on the test results as shown in Table 2, MKSVM-MIL achieves the best performance compared with other popular MIL classification algorithms. This reason is each ‘‘visual-word’’ extracted by AP method represents a group of image regions with similar visual characteristics, which share explicit high-level semantic concepts. Generally, the mapping feature represents the likelihood of each bag containing high-level semantic concepts. For example, all beaches images certainly contain a lot of sky, water, and sands regions. In the instance feature space, the feature vectors of those regions are gathered into several clusters, and some ‘‘visual-word’’ on behalf of sky, water, and sands will also be extracted due their strong capability of discriminating between positive and negative. Thus, a beach image can be located on the projection axis of sky, water, and sand etc. concept. Therefore, the different types of scene images can drop into different categories in the word-space, which guarantee the high performance of MKSVM classification. Based on the experimental results in the Table 2, it is clear that the performance of multiple kernels SVM is superior to single kernel SVM. When measuring the similarity between the samples, multiple kernels is more flexible and effectively than the single kernel. To illustrate the better performance of the MKSVM-MIL method, we used the images from the categories of elephants and mountains, and plotted the ROC curves of three different MIL algorithms in Fig. 2 (A) and (B), respectively. From Fig. 2, the performance of MKSVM-MIL always outperforms the other two. This experiment indicates that using the multiple kernels SVM to train classifiers can significantly improve the image categorization accuracy.

5. Conclusions In this paper, we have proposed a new multi-instance learning algorithm for image classification, named as MKSVM-MIL, which is based on AP method and MKSVM. In order to transform bags to a single sample and convert MIL problem to a supervised learning problem, this research presented a novel method to extract ‘‘visual-words’’ for constructing ‘‘word-space’’ followed by defining a nonlinear mapping function, which mapped the bag into a point in the ‘‘word-space’’. Based on the experiment designed on COREL data set, the experimental results shows that the proposed MIL algorithm is comparable to other state-of-the-art MIL algorithms.

D. Li et al. / J. Vis. Commun. Image R. 25 (2014) 1112–1117

Acknowledgments The author acknowledges the support of the National Natural Scientific Youth Foundation (Grant Nos. 61202183, 61102095), the Natural Science Foundation Research Project of the Shaanxi Province (Grant Nos. 2013JM8031, 2012JM8022), Postdoctoral Science Foundation funded project (Grant No. 2013M542386) and the Natural Science Foundation Research Project of Shaanxi Province Department of Education (Grant Nos. 12JK0734, 12JK0504, 12JK0731, 12JK0543), China. References [1] Chunjie Zhang, Shuhui Wang, Qingming Huang, et al., Laplacian affine sparse coding with tilt and orientation consistency for image classification, J. Vis. Commun. Image Represent. 24 (7) (2013) 786–793. [2] Li-Jia Li, R. Socher, Fei-Fei Li, Towards total scene understanding: classification, annotation and segmentation in an automatic framework, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), IEEE Press, Miami, FL, USA, 2009, pp. 2036–2043. [3] Wu-Jun Li, Dit-Yan Yeung, Localized content-based image retrieval through evidence region identification, in: 2009 IEEE Conference on Computer Vision and Pattern recognition, IEEE Press, Miami, FL, USA, 2009, pp. 1666–1673. [4] Da-xiang Li, Jin-ye Peng, Zhan Li, Qirong Bu, LSA based multi-instance learning algorithm for image retrieval, Signal Process. 91 (8) (2011) 1993–2000. [5] T.G. Dietterich, R.H. Lathrop, T. Lozano-Pérez, Solving the multiple instance problem with axis-parallel rectangles, Artif. Intell. 89 (1997) 31–71. [6] O. Maron, A.L. Ratan, Multiple-instance learning for natural scene classification, in: Proceedings of the 15th International Conference on Machine Learning, Madison, WI, (1998) pp. 341–349. [7] Z.-H. Zhou, M.-L. Zhang, Multi-instance multi-label learning with application to scene classification, Adv. Neural Inf. Process. Syst. 19 (2007) 1609–1616. [8] Zhang Min-Ling, Zhou Zhi-Hua, A multi-instance regression algorithm based on neural network, J. Software v14 (7) (2003) 1238–1242.

1117

[9] S. Andrews, T. Hofmann, I. Tsochantaridis, Multiple instance learning with generalized support vector machines, in: Proceedings of the 18th National Conference on Artificial Intelligence, Edmonton, Canada, (2002) pp. 943–944. [10] P.V. Gehler, O. Chapelle, Deterministic annealing for multiple-instance learning, in: Proceedings of the 11th International Conference on Artificial Intelligence and Statistics (AISTATS 2007), San Juan, Puerto Rico, March, (2007) pp. 123–130. [11] T. Gartner, P.A. Flach, A. Kowalczyk, A.J. Smola, Multi-instance kernels, in: Proceedings of the 19th International Conference on Machine Learning, Sydney, Australia, (2002) pp. 179–186. [12] J.T. Kwok P.-M. Cheung, Marginalized multi-instance kernels, in: Proceedings of the 20th International Joint Conference on Artificial Intelligence, Hydrabad, India, (2007) pp. 901–906. [13] Y.X. Chen, J.Z. Wang, Image categorization by learning and reasoning with regions, J. Mach. Learn. Res. 5 (2004) 913–939. [14] Yixin Chen, Jinbo Bi, James Z. Wang, MILES: multiple-instance learning via embedded instance selection, IEEE Trans. Pattern Anal. Mach. Intell. 28 (2006) 1931–1947. [15] Z.-H. Zhou, J.-M. Xu, On the relation between multi-instance learning and semi-supervised learning, in: Proceedings of the 24th ICML, Corvalis, Oregon, June, (2007) pp. 1167–1174. [16] Rouhollah Rahmani, Sally A. Goldman, MISSL: multiple-instance semisupervised learning, in: Proceedings of the 23rd international conference on Machine learning, Pittsburgh, Pennsylvania, June, (2006) pp. 705–712. [17] Daxiang Li, Xiaoqiang Zhao, Ying Liu, et al., Infrared face recognition method by integration of SIFT and MIL, J. Xi’an Univ. Posts Telecommun. 17 (4) (2012) 15–20. [18] B.J. Frey, D. Dueck, Clustering by passing messages between data points, Science 315 (2007) 972–976. [19] Renchu Guan, Xiaohu Shi, Maurizio Marchese, et al., Text clustering with seeds affinity propagation, IEEE Trans. Knowl. Data Eng. 23 (2011) 627–637. [20] Alain Rakotomamonjy, Francis R. Bach, Stéphane Canu, Yves Grandvalet, SimpleMKL, J. Mach. Learn. Res. 9 (11) (2008) 2491–2521. [21] Chensheng Sun, Kin-Man Lam, Multiple-kernel, multiple-instance similarity features for efficient visual object detection, IEEE Trans. Image Process. 22 (8) (2013) 3050–3061.