Computers and Electrical Engineering 80 (2019) 106482
Contents lists available at ScienceDirect
Computers and Electrical Engineering journal homepage: www.elsevier.com/locate/compeleceng
Dissimilarity-based multi-instance learning using dictionary learning and sparse coding ensemblesR Nazanin Moarref∗, Yusuf Yaslan Istanbul Technical University, Faculty of Computer and Informatics Engineering, Sariyer 34469, Istanbul, Turkey
a r t i c l e
i n f o
Article history: Received 18 January 2019 Revised 26 September 2019 Accepted 26 September 2019
Keywords: Dictionary learning Sparse coding Multi instance learning Random subspace Bagging
a b s t r a c t In multi-instance learning problems, samples are represented by multisets, which are named as bags. Each bag includes a set of feature vectors called instances. This differs multi-instance learning problems from classical supervised learning problems. In this paper, to convert a multi-instance learning problem into a supervised learning problem, fixed-size feature vectors of bags are computed using a dissimilarity based method. Then, dictionary learning based bagging and random subspace ensemble classification models are proposed to exploit the underlying discriminative structure of the dissimilarity based features. Experimental results are obtained on 11 different datasets from different multiinstance learning problem domains. It is shown that the proposed random subspace based dictionary ensemble algorithm gives the best results on 8 datasets in terms of classification accuracy and area under curve. © 2019 Elsevier Ltd. All rights reserved.
1. Introduction Multi-instance learning (MIL) problem was proposed in [1] and recently, has been applied in many real-world problems such as drug activity prediction, content-based image retrieval classification, and text/document classification [2]. In a MIL problem, the classifiers are dealing with sets of instances, which are gathered in different kinds of bags. In a traditional supervised classification model, each example has a fixed-size vector of features, whereas in a MIL problem each example in a bag consists of multi-feature vectors [3]. The assumption is that the label of a bag is positive if and only if it has at least one positive instance. Otherwise, the bag will be labeled as a negative bag. Classifiers use bags (for either bag-based or embedded-based methods) or instances (for instance-based methods) and their labels to train models [4]. In instancebased methods, classifiers use instances to predict the labels of the bags. Different classifiers such as Bayesian approach, neural networks, decision trees, random forest, and SVM have been used in the instance-based domain [5]. In addition to the classification models, feature representations are also important for the performance of any classification problem. Sparse coding and dictionary learning methods have recently attracted researchers’ interests by representing each sample as sparse as possible. In this way, samples are represented by linear combinations of basic elements, which are called atoms. Many research fields have benefited from sparse coding and dictionary learning techniques such as signal denoising, image processing [6], feature extraction [7], supervised learning [8], unsupervised clustering [9] and semi-supervised dictionary learning [10]. Previously dictionary learning is used for MIL problems in [11], by generating class-specific bag level diverse R This paper is for regular issues of CAEE. Reviews processed and recommended for publication to the Editor-in-Chief by Associate Editor Dr. Ayaz Isazadeh. ∗ Corresponding author. E-mail addresses:
[email protected] (N. Moarref),
[email protected] (Y. Yaslan).
https://doi.org/10.1016/j.compeleceng.2019.106482 0045-7906/© 2019 Elsevier Ltd. All rights reserved.
2
N. Moarref and Y. Yaslan / Computers and Electrical Engineering 80 (2019) 106482
classifiers. The impressive performance of dictionary learning showed that sparse representation is naturally discriminative [11]. Among many methods that tried to convert a MIL problem into a supervised learning problem[4], dissimilarity based method applied in [12] is shown to be a successful approach for MIL problems. However, it seems that SVM is not a sufficient method to discriminate bags in dissimilarity feature spaces. As a contribution, in this paper, we have considered these two related works and applied a discriminative dictionary learning framework, which uses the dissimilarity values to learn a structured dictionary and ensembles of these dictionaries are proposed as MIL classifiers. During the last decade, the ensemble methods have been applied in many classification problems. Making decisions according to more than one classifier, result in more reliable decisions and increase the classification performance. Random subspace and bagging methods are most successful ensemble models and a lot of research has been done in these fields [13]. Therefore, we have proposed two methods to train ensemble classifiers for MIL problems. The first method utilizes the random subspace algorithm to select instances as prototypes and the dissimilarity values of each instance to the selected prototypes are used as feature values. In the second method, all instances are evaluated as prototypes and dissimilarity values of each instance to prototypes are taken into consideration. Then bagging is applied to the bags, which are represented by dissimilarity based feature vectors. In the following section, we cover the methods and techniques that are used in this article. In Section 2, dissimilaritybased MIL problem is explained, then the combination of this method with ensemble-based random subspace technique [12], which is one of the recently successful approaches in MIL, is described. Dictionary learning and sparse coding is clarified afterward. In Section 3, the proposed method is explained in details. In Section 4, the experimental results are illustrated, and in Section 5, we conclude the article by emphasizing the important results achieved. 2. Methodology 2.1. Dissimilarity based multi instance learning In a MIL problem, a bag is a set B j = {xi j , y j |i = 1, 2, . . . , N j }, where the xij ∈ Rd are instances called feature vectors. yj is the label, Nj is the number of instances in jth bag that can vary from bag to bag. All the instances belong to the d dimensional feature space called the instance space. This learning method aims to learn model which can predict targets of unseen bags. Unlike most of the existing works on MIL, in this paper instead of representing a bag by its instances, bags are represented by relative dissimilarities to the reference instances called prototypes. In [14] the prototypes are selected using the clustering method. However, clustering the instances has the risk of excluding informative instances, and the obtained feature space may not carry sufficient information about the problem. The dissimilarity-based MIL method is proposed in [12]. The aim is to convert the MIL problem into a supervised learning problem by computing the dissimilarities between bags and preselected prototypes. Let d() be the dissimilarity function, T = {Bi , yi |i = 1, 2, . . . , M} be training bags, rT = {xi j |(i = 1, 2, . . . , N j ), ( j = 1, 2, . . . , M )} which, includes all the instances in the training set and r = {r1 , r2 , . . . , rz } be the set of randomly selected instances as prototypes from rT and has z number of prototypes. Then, by using dissimilarity values each bag Bi in the training set is represented by a simple vector vinstance = [d (Bi , r1 ), d (Bi , r2 ), . . . , d (Bi , rz )], where vinstance ∈ Rz and MIL problem can be converted into supervised learning. Many dissimilarity metrics have been tried in [5] and experimental results show that Euclidean distance is a reasonable choice for many datasets. Therefore, Euclidean distance is used as a dissimilarity function d(). The instance-based dissimilarities between bags and prototypes can be written as follows:
vinstance = [d (Bi , r1 ), d (Bi , r2 ), . . . , d (Bi , rz )]
(1)
where, for a prototype instance rp , the dissimilarity of the Bi to rp is:
d ( Bi , r p ) =
min
l∈{1,2,..Ni }
d (xli , r p )
(2)
After representing the bags by dissimilarity feature vectors, SVM is used for classification and it is shown that this method, Dissimilarity-based Random Subspace with SVM (DRSSVM), outperforms its counterparts such as MILES [4], MinimaxSVM [15], MILBoost [16], MI-SVM [17], EM-DD [18]. In pursuit of this approach, in this paper Dissimilarity-based Random Subspace (DRS) algorithm is implemented with the dictionary learning method (DL). We illustrate these algorithms in more details in the next section. 2.2. Dictionary learning and sparse coding In this section, we will give details about general dictionary learning and sparse coding methods. Consider y ∈ Rn as an input sample. The dictionary is a matrix, consists of normalized, basis vectors di , where di .diT = 1. We symbolize a dictionary with D = [d1 , d2 , . . . , dk ], where D ∈ Rn × k . The number of instances is usually greater than the number of atoms (n > k). In this case, the dictionary is called an over-complete dictionary. α ∈ Rk is called sparse code, which is the coefficient vector. Using the dictionary, the input signal can be represented by a linear combination of atoms and it can be formulated as:
min α0
s.t
y = Dα
(3)
α0 is the L0 norm of the coefficient vector α , which indicates the number of non-zero elements of α . To represent the signal as sparse as possible, we try to find the minimum number of non-zero elements. In case the dictionary is an
N. Moarref and Y. Yaslan / Computers and Electrical Engineering 80 (2019) 106482
3
over-complete one, finding the sparsest representation will be difficult. Since the computational problem will arise, and it needs a combinatorial search, which results in an NP-hard problem. To find the best solution, instead of using L0 , the L1 norm can be used. As a result, the sparse representation will be formulated as:
min α1
s.t
y = Dα
(4)
Then the general form can be formulated as follows:
α ∗ = arg min α1 s.t Dα − y2 ≤ α
(5)
Dictionary learning is a learning method whereby a dictionary is directly constructed by input signals. To address this purpose, dictionaries could be attained by solving this minimization problem:
min y − Dα22 + λα1 D,α
(6)
Where y is representative of the input signal, y − Dα22 is data fitting term which defines reconstruction error, α1 is the regularization term which specifies the decomposition sparsity, and λ is penalty parameter that holds the trade-off between the data fitting and the regularization term in balance. Considering D and α as the variables of the mentioned problem makes this optimization problem a non-convex problem. To address this issue, the solution considers one of the variables fixed so that the minimization function turns into a convex optimization problem, which assures the existence of an optimum solution regarding the other variable. As a result, to solve this objective function, two optimization steps, related to these two variables, are needed. To achieve a predetermined convergence, it is required to apply these two steps iteratively until the desired convergence is achieved. • Sparse approximation step: using the dictionary D, coefficients α of signal y are calculated by finding an optimum solution for the objective function. (Note that, in the beginning, the dictionary is initialized arbitrarily.) • Dictionary update step: Using the calculated sparse coding matrix α , new dictionaries are computed and updated. Updating the dictionaries leads to the reduction in the approximation error at each iteration. 3. Proposed dictionary learning based ensemble multiple-instance learning In this study, dissimilarity-based MIL has been addressed using dictionary learning as a base classifier. In this case, we have MIL problem datasets which consist of bags and their associated feature vectors (instances). To convert the MIL problem into a standard supervised learning problem, the dissimilarities of the bags to the selected instance prototypes are calculated using Eq. (1). Thereby, each bag is represented by the dissimilarity values as features. Sparse coding can help decrease the dimension size, therefore the classifiers deal with lower complex problems so that their performances can increase [19]. Dictionary learning not only is applied for the reconstructive purpose but also can be used for discriminative purposes. Discriminative dictionaries are an appropriate strategy to classify input data. In pursuit of this aim, class labels of the input data get involved in learning dictionaries, which result in different representations for each separate class. Therefore, for each separate class of input data, a unique dictionary is obtained. If we consider input data with m class labels, the base dictionary D is the result of m sub-dictionaries D = [D1 , D2 , . . . , Dm ], where Di ∈ Rn × k . To classify the test set we use the D dictionary to encode the signal. This means that the signal uses each of sub-dictionaries to be encoded. The signal will be assigned to the class of the sub-dictionary which leads the signal to have the least reconstruction error and the sparsest representation. To explain in more details, the steps are: • Obtain the sparse code α i of the signal y for each sub-dictionary Di , which are trained for each class of input data. • Compare the cost of representation δ i (y) for each sub-dictionary and its related sparse code. Assign the class labels of the sub-dictionary which leads to minimum representation cost:
class where
i∗ = arg min δi (y ), i∈1,m
δi (y ) = min yi − Di αi 22 + λαi 1
(7) (8)
Using these equations supervised dictionary learning and sparse coding can be used as a classifier, which generates the sparse representation of signals and classifies them simultaneously. A classifier can suffer from the curse of dimensionality when the number of features is too large. Similarly, when the number of prototypes is too large the dissimilarity based MIL algorithm can also suffer from high dimensionality and a lot of redundant and uninformative dissimilarities. Therefore, ensemble learning methods are also applied to deal with this problem. In this paper, random subspace and bagging, are taken into consideration as ensemble learning methods. Random subspace is one of the commonly used ensemble learning methods, which generates subspaces from randomly selected features. In the random subspace method, in each subspace, the model is trained using a subset of randomly selected features from the entire features. The advantage of this method is that in high dimensional data, it may reduce the problems that may arise from the curse of dimensionality problem. In our random subspace-based method, Dissimilarity-based Random Subspace with Dictionary Learning (DRSDL), some instances from all training instances are randomly selected as prototypes. The instance-based dissimilarity is used to calculate the dissimilarity of each bag to the prototype instances. Using the bags
4
N. Moarref and Y. Yaslan / Computers and Electrical Engineering 80 (2019) 106482
Fig. 1. Flow chart of the proposed DRSDL model. Table 1 pseudo-code for dissimilarity based MIL using random subspace and dictionary learning. Algorithm: Dissimilarity Based Multi Instance Learning Using Random Subspace and Dictionary Learning Input: training set T, the number of ensemble classifiers s, subspace size z, base classifier D, number of samples in the test set t, prototypes r = {r1 , r2 , . . . , rz } Output: labels of test set y = [y1 , y2 , . . . , yt ] Training: For i = 1: s Create a subspace using z randomly selected instances: r = {r1 , r2 , .., rz }. Represent each bag in T in vinstance form (Eq. (1)). Train two separate dictionaries: Di1 , Di2 (Eqs. (5) and (6)). End for Test: Represent each bag in the test set in vinstance form. For i=1: s Predict the test bag labels using Di1 , Di2 ((Eq. (7)). End for Output: Classify each bag using the posterior probability: y
with obtained feature values, dictionary learning and sparse coding method is used to train the model and classify the unseen bags. The second approach used in this paper is the bagging method and it is the most straightforward approach for manipulating the inputs of the training set. In this technique, samples are selected randomly (with replacement) from the entire training set samples. This approach helps to create diverse classifiers [20]. In our bagging-based method, Dissimilaritybased Bagging Subspace with Dictionary Learning (DBSDL), all training instances are selected as prototypes. Thereby, our feature space consists of instance-based dissimilarity values of each bag to all existent instances. Afterward, each subspace is generated by randomly selected bags. Using selected bags, at each subspace, the model is trained by dictionary learning and sparse coding-based classifiers. The pseudo-code of the proposed methods DRSDL and DBSDL techniques are given in Tables 1 and 2 respectively. To illustrate our approach more concisely, the flowchart of DRSD and DBSDL is given in Figs. 1 and 2 respectively. The computational complexity of the proposed methods are dependent on dictionary learning and sparse coding part that has two steps: • Approximating the sparse representation of the training data. • Updating the dictionary using sparse representation. These steps can be interpreted as PCA approach with multiple subspaces or K-Means method as clustering a signal using multiple locations. Complexity analysis of dictionary learning is given with details in [21]. 4. Experimental results DRSDL, DBSDL are compared with DRSSVM, MILES, MILBoost, minimaxSVM on 11 different MIL datasets [22]. To have an accurate perception of the DL role in the ensembles, Dissimilarity-based Bagging subspace with SVM (DBSSVM) is evaluated.
N. Moarref and Y. Yaslan / Computers and Electrical Engineering 80 (2019) 106482
5
Table 2 pseudo-code for dissimilarity based MIL using bagging and dictionary learning. Algorithm: Dissimilarity Based Multi Instance Learning Using Bagging and Dictionary Learning Input: training set T, the number of ensemble classifiers s, base classifier D, the number of samples in the test set t, prototypes r = {r1 , r2 , . . . , rz }. Output: labels of the test set y = [y1 , y2 , . . . , yt ] Training: For i = 1: s Use all the instances in the training set as prototypes. Represent each bag in the training set in vinstance form (Eq. (1)). Create a subspace using randomly selecting bags from T with replacement. Train two separate dictionaries: Di1 , Di2 (Eqs. (5) and (6)). End for Test: Represent each bag in the test set in vinstance form (Eq. (1)). For i=1: s Predict the test bag labels using Di1 , Di2 . (Eq. (7)) End for Output: Classify each bag using the posterior probability: y Table 3 MIL Datasets used for experimental results. Dataset
Total bags
Positive bags
Negative bags
Total instances
Tiger Fox Elephant Musk2 Musk1 alt.atheism comp.graphics rec.autos sci.crypt sci-med talk.politics.guns
200 200 200 102 92 100 100 100 100 100 100
100 100 100 63 47 51 52 51 51 51 51
100 100 100 39 45 49 48 49 49 49 49
1220 1320 1391 6598 476 5443 3094 3458 4284 3045 3558
Similarly, to see the role of ensembles, Dissimilarity-based with Dictionary Learning (DDL) and Dissimilarity-based with SVM (DSVM), where no ensemble method is combined, are also examined. Table 3 indicates the number of positive and negative bags and the number of instances that each data set contains. The Tiger, Fox, and Elephant datasets are the most frequently used benchmarks in MIL problems and are related to image categorization field. The bags are images and the instances are the segments of the images. The images which contain the relevant animal in, are considered as positive bags, and the negative ones are the images which do not contain the relevant animal in. Musk1 and Musk2 are about molecule activity prediction problems. In this problem, the classifiers try to make
Fig. 2. Flowchart of the proposed DBSDL model.
6
N. Moarref and Y. Yaslan / Computers and Electrical Engineering 80 (2019) 106482 Table 4 Number of atoms and percentage of selected prototypes. Dataset
DDL
DBSDL
DRSDL
percentage
Tiger Fox Elephant Musk2 Musk1 alt.atheism comp.graphics rec.autos sci.crypt sci-med talk.politics.guns
100 150 50 25 25 25 50 50 25 50 50
50 50 25 25 25 25 50 50 25 25 25
200 200 200 25 25 50 200 100 100 200 100
20% 50% 5% 10% 30% 20% 50% 50% 30% 50% 50%
Table 5 Accuracy (%) and SE results. DATA
Tiger Fox Elephant Musk2 Musk1 alt.atheism comp.graphics rec.autos sci.crypt sci-med talk.politics.guns
MILES
MILBoost
minimaxSVM
DRSSVM
ACC
SE
ACC
SE
ACC
SE
ACC
SE
76.00 65.00 78.50 93.00 82.22 51.00 46.00 49.00 53.00 45.00 41.00
2.59 3.07 2.11 2.60 5.02 2.33 6.00 4.58 5.97 3.73 5.04
76.00 55.50 87.00 63 66.67 42.00 43.00 53.00 52.00 59 54.00
5.57 2.52 2.26 4.48 6.83 3.27 4.48 7.16 3.27 5.04 6.36
61.50 58.50 69.00 37 60.00 66.00 60.00 61.00 62.00 63 63.00
2.69 2.11 3.48 4.73 4.12 3.71 2.58 2.33 2.91 1.53 2.13
81.00 64.50 83.50 89.00 88.94 69.05 54.02 61.23 70.14 71.94 76.05
2.56 2.17 3.08 3.48 3.32 1.69 1.95 2.43 2.79 3.62 2.98
decisions as to whether a molecule has a musky smell or not. A molecule can have different shapes which are fold into conformers. Hence, each bag is considered as a molecule and each instance is one of its conformers. In this case, if at least one of the conformers can make the molecule smell musky, then the bag will have a positive label. The remaining datasets are from 20Newsgroups’ dataset. Newsgroups are about a text categorization problem and generated from 20 different categories. A bag consists of different posts (as instances) from different categories. The positive bags contain 3% posts from the relevant category and 97% from other categories. SPAMS tool [23], LIBSVM [24] and PRtools [25] are used for dictionary learning-based methods, SVM-based methods, and MIL-based approaches (MILES, MILboost, and MinimaxSVM) respectively. Experimental results are obtained using ten-fold cross-validation. All the experiments are obtained on the same training sets and test sets. All the dissimilarity values for training sets and test sets are normalized with zero mean and unit variance. The number of atoms for each dictionary-based method (DRSDL, DBSDL, and DDL) is different. For each method and different datasets, the accuracies are calculated for different numbers of atoms (using one of 25, 50, 100, 150, 200, 250, and 300). The number of atoms is selected using the validation set. In this way, the dictionaries are constructed using the numbers of atoms which give the best prediction accuracies. The accuracies of the models decrease to around 50% after 300 atoms and they are not reported in the paper. As the number of features in each dataset is data-dependent, in random subspace ensemble methods (DRSSVM and DRSDL), the number of selected features for each dataset is different. Since there is no information about the number of redundant features, the number of selected features is examined for different sizes. In the experiments, 5%, 10%, 20%, 30%, 40%, and 50% of dissimilarity features are evaluated for each dataset, and the one which gives the best accuracy is selected to be the random subspace size. For example, the Elephant dataset has the best accuracy using 5% of training instances, whereas Fox has the best accuracy with 50%. In this way, we tried to obtain better results using fewer than 50% of all features in the training sets. In the bagging subspace ensemble methods (DBSDL, DBSSVM) all of the training bags are selected with replacement. In both of the ensemble methods, the number of subspaces is chosen to be 20. The reason why we use the same number of subspaces for bagging and random subspace methods is that we try to fix all possible parameters so that we can approximately have an accurate perception of the performance of each ensemble method. Table 4 indicates the number of atoms used for each relevant dictionary-based method and the percentage of selected prototypes (z) from the entire training bag instances (random subspace dimension size). In our initial experimental results we have compared the previously proposed DRSSVM method with 3 powerful MIL approaches; MILES, MILBoost, and MinimaxSVM. The classification accuracy and AUC results are given in Tables 5 and 6 respectively. As it was shown in [12], in our experiments DRSSVM generally outperforms its counterparts. Therefore, the proposed methods are compared with DRSSVM. These classification accuracies and AUC results are given in Tables 7 and 8 respectively.
N. Moarref and Y. Yaslan / Computers and Electrical Engineering 80 (2019) 106482
7
Fig. 3. Classification accuracy of DRSSVM and DBSDL on test datasets. Table 6 AUC (%) and SE results. DATA
MILES
Tiger Fox Elephant Musk2 Musk1 alt.atheism comp.graphics rec.autos sci.crypt sci-med talk.politics.guns
minimaxSVM
DRSSVM
AUC
SE
MILBoost AUC
SE
AUC
SE
AUC
SE
87.24 72.78 85.84 97.14 87.89 46.49 48.72 55.49 50.65 52.93 35.11
2.40 2.79 1.84 1.90 5.04 6.38 6.36 4.89 5.61 7.68 6.15
79.53 64.12 92.43 68.56 70.76 52.52 41.63 54.94 51.69 66.41 55.53
8.13 4.23 2.05 5.78 8.07 5.49 4.26 7.28 4.33 4.31 6.63
73.85 64.10 89.86 76.55 78.04 51.42 54.33 46.67 56.11 54.53 61.12
2.93 3.48 2.11 5.05 6.09 2.55 4.12 3.71 6.19 2.86 4.93
84.5 68.85 89.95 90.33 95.75 76.70 52.00 61.90 78.28 78.05 89.30
2.54 2.36 2.19 3.21 1.75 3.76 2.00 2.81 3.36 4.46 2.83
Table 7 Accuracy (%) and SE results. DATA
Tiger Fox Elephant Musk2 Musk1 alt.atheism comp.graphics rec.autos sci.crypt sci-med talk.politics.guns
DSVM
DDL
DBSSVM
DBSDL
DRSSVM
DRSDL
ACC
SE
ACC
SE
ACC
SE
ACC
SE
ACC
SE
ACC
SE
83.00 63.00 71.50 85.09 87.80 69.80 54.02 59.23 68.94 64.92 66.14
2.71 2.81 2.59 3.1 2.61 3.07 1.95 2.87 3.81 3.10 2.89
80.00 57.50 84.00 76.45 85.02 73.07 62.88 65.25 73.16 67.83 63.92
3.16 1.86 2.97 4.16 3.85 3.34 2.64 3.90 4.17 2.28 3.12
82.50 74.00 75.00 89 90.02 68.92 55.13 60.12 69.93 68.03 72.94
2.50 2.96 2.98 3.48 2.51 2.86 2.32 3.11 2.16 2.39 3.04
83.89 73.50 83.00 80.27 90.28 81.07 76.32 73.12 78.07 79.85 83.05
1.93 3.17 3.67 2.63 3.11 2.29 3.54 4.02 2.88 2.75 2.50
81.00 64.50 83.50 89.00 88.94 69.05 54.02 61.23 70.14 71.94 76.05
2.56 2.17 3.08 3.48 3.32 1.69 1.95 2.43 2.79 3.62 2.98
87.00 65.50 87.50 82.27 86.16 90.18 90.25 91.00 90.07 83.98 87.87
2.00 1.57 3.27 3.28 3.98 2.02 2.79 2.77 2.02 3.71 2.59
Table 8 AUC (%) and SE results. DATA
Tiger Fox Elephant Musk2 Musk1 alt.atheism comp.graphics rec.autos sci.crypt sci-med talk.politics.guns
DSVM
DDL
DBSSVM
DBSDL
DRSSVM
DRSDL
AUC
SE
AUC
SE
AUC
SE
AUC
SE
AUC
SE
AUC
SE
83.00 63.00 71.50 85.65 88.00 70.00 52.00 58.50 68.67 64.92 65.67
2.71 2.81 2.59 3.29 2.32 2.98 2.00 2.79 3.81 3.10 2.80
80.00 58.00 74.11 74.11 85.25 73.17 62.25 64.92 72.92 67.83 63.83
3.16 2 4.83 4.83 3.81 3.39 2.42 3.82 4.11 2.33 3.14
88.75 85.70 89.35 96.46 98.40 80.42 53.25 61.05 82.60 75.27 88.30
2.66 2.30 2.62 2.06 1.06 2.51 2.24 3.43 3.3 4.16 3.23
89.2 84.70 91.55 92.77 96.60 91.73 90.23 88.83 86.87 92.38 93.65
2.14 2.67 2.75 2.17 2.00 2.67 3.29 3.74 2.84 1.27 2.30
84.50 68.85 89.95 90.33 95.75 76.70 52.00 61.90 78.28 78.05 89.30
2.54 2.36 2.19 3.21 1.75 3.76 2.00 2.81 3.36 4.46 2.83
92.05 71.50 93.80 79.73 93.24 97.60 96.57 98.8 97.87 96.27 95.83
2.05 2.16 2.21 4.12 2.43 1.36 1.25 1.20 0.95 1.75 1.59
8
N. Moarref and Y. Yaslan / Computers and Electrical Engineering 80 (2019) 106482
Fig. 4. Classification accuracy of DRSSVM and DBSSVM on test datasets.
Fig. 5. Classification accuracy of DRSDL and DBSDL on test datasets.
Fig. 6. Classification accuracy of DDL and DSVM on test datasets.
As it is shown in these two tables, DRSDL approach outperforms not only the DRSSVM technique but also it outperforms the bagging-based methods for each of DBSDL and DBSSVM algorithms and simple dissimilarity-based methods without any ensembles for each DDL and DSVM. The best performance of the proposed method can be seen on newsgroup dataset where the proposed DRSDL algorithm performs 91% classification accuracy and 98.8% AUC performance for rec.autos dataset, Whereas the DRSSVM performs 61.23%, 61.90%; DBSDL 73.12%, 88.83%; DBSSVM 60.12%, 61.05%; DDL 65.25%, 64.92% and DSVM 59.23%, 58.50% in terms of classification accuracy and AUC performance respectively. In Tables 7 and 8 the prominence of DRSDL is shown. The DBSDL method as a second successful performance is also compared with DRSSVM in Fig. 3. It can be seen that in most cases DBSDL has higher performance. It can be inferred
N. Moarref and Y. Yaslan / Computers and Electrical Engineering 80 (2019) 106482
Fig. 7. Area under curve of MIL datasets for DRSDL, DRSSVM, DBSDL,DBSSVM, DDL,DSVM methods.
9
10
N. Moarref and Y. Yaslan / Computers and Electrical Engineering 80 (2019) 106482
Fig. 7. Continued
N. Moarref and Y. Yaslan / Computers and Electrical Engineering 80 (2019) 106482
11
that these two proposed methods outperform other MIL methods as well. The next comparisons are obtained on ensemble strategies. To compare the performance of random subspace and bagging methods, we have also compared dissimilaritybased random subspace and bagging methods using different classifiers (DRSSVM with DBSSVM in Fig. 4 and DRSDL with DBSDL in Fig. 5). Comparing DRSSVM to DBSSVM, these two methods in most cases show approximately similar results, but it seems that the DRSSVM slightly performs better. Comparing DRSDL with DBSDL in Fig. 5, in most cases, DRSDL has higher performance. It is clear that the random subspace method has better performance than bagging using dictionary learning. To illustrate the constructive performance of the ensembles, one can compare the results of DSVM and DDL methods in Fig. 6, where only single classifiers are trained on all the dissimilarity values. It is clear that performance is decreased significantly. Comparing DDL and DSVM, it seems that DL slightly performs better than SVM. The area under curve performance of all methods for each dataset can be seen in Fig. 7. DRSDL has the best AUC performance in most cases. Following this method, DBSDL has the second-best AUC performance compared to the remaining approaches (DRSSVM, DBSSVM, DSVM, and DDL). Combining DL with ensembles outperform the SVM ensembles. It seems that DL ensemble methods generate more diverse classifiers comparison to SVM ensembles. It is notable that in ensemble methods what matters is not about using high-performance classifiers as a base classifier, it is about applying methods which generate more diverse classifiers at each subspace. The main limitation of the dissimilarity based models is the number of prototypes and this limitation is also valid for the proposed algorithms. Besides, the number of atoms in the dictionaries is also another parameter to be determined. However, parameter optimization is a general issue for any classifier and these parameters can be found from the validation sets. 5. Conclusion In this paper, dissimilarity-based dictionary learning and sparse coding ensemble methods are proposed for multiinstance learning problems. The dissimilarity-based part is applied to convert the problem into a supervised problem. Dictionary learning and sparse coding method is used in a combination of random subspace and bagging for classification purpose. The detailed experimental results on benchmark datasets show significantly high performance compared to the current state-of-art models. As future work, we will investigate the diversity of classifiers in ensembles and their effects on overall performance. Declaration of Competing Interest Authors declare that there is no conflict of interest. Supplementary material Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.compeleceng. 2019.106482. References [1] Dietterich TG, Lathrop RH, Lozano-Pérez T. Solving the multiple instance problem with axis-parallel rectangles. Artif Intell 1997;89(1):31–71. [2] Kotzias D, Denil M, De Freitas N, Smyth P. From group to individual labels using deep features. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM; 2015. p. 597–606. [3] Foulds J, Frank E. A review of multi-instance learning assumptions. Knowl Eng Rev 2010;25(01):1–25. [4] Chen Y, Bi J, Wang JZ. Miles: multiple-instance learning via embedded instance selection. IEEE Trans Pattern Anal Mach Intell 2006;28(12):1931–47. [5] Cheplygina V. Dissimilarity-based multiple instance learning Ph.D. thesis. TU Delft; 2015. [6] Hand EM, Castillo C, Chellappa R. Doing the best we can with what we have: Multi-label balancing with selective learning for attribute prediction. In: Thirty-second AAAI conference on artificial intelligence; 2018. [7] Kocyigit G, Yaslan Y. Demial: an active learning framework for multiple instance image classi cation using dictionary ensembles. Turk J Electr Eng Comput Sci 2018;26(1):593–604. [8] Cheng E-J, Prasad M, Puthal D, Sharma N, Prasad OK, Chin P-H, et al. Deep learning based face recognition with sparse representation classification. In: International Conference on Neural Information Processing. Springer; 2017. p. 665–74. [9] Ramirez I, Sprechmann P, Sapiro G. Classification and clustering via dictionary learning with structured incoherence and shared features. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE; 2010. p. 3501–8. [10] Wang Z, Dong Y, Mao S, Wang X. Internet multimedia traffic classification from qos perspective using semi-supervised dictionary learning models. China Commun 2017;14(10):202–18. [11] Qiao M, Liu L, Yu J, Xu C, Tao D. Diversified dictionaries for multi-instance learning. Pattern Recognit 2017;64:407–16. [12] Cheplygina V, Tax DM, Loog M. Dissimilarity-based ensembles for multiple instance learning. IEEE Trans Neural Netw Learn Syst 2016;27(6):1379–91. [13] Polikar R. Ensemble based systems in decision making. IEEE Circuits Syst Mag 2006;6(3):21–45. [14] Akbas E, Ghanem B, Ahuja N. Mis-boost: multiple instance selection boosting. arXiv:11092388, 2011. [15] Gärtner T, Flach PA, Kowalczyk A, Smola AJ. Multi-instance kernels. In: ICML, 2; 2002. p. 179–86. [16] Zhang C, Platt JC, Viola PA. Multiple instance boosting for object detection. In: Advances in Neural Information Processing Systems; 2005. p. 1417–24. [17] Andrews S, Tsochantaridis I, Hofmann T. Support vector machines for multiple-instance learning. In: Advances in Neural Information Processing Systems; 2002. p. 561–8. [18] Zhang Q, Goldman SA. Em-dd: An improved multiple-instance learning technique. In: Advances in Neural Information Processing Systems; 2001. p. 1073–80. [19] Tosic I, Frossard P. Dictionary learning. IEEE Signal Process Mag 2011;28(2):27–38. [20] Efron B, Tibshirani RJ. An introduction to the bootstrap. CRC press; 1994.
12 [21] [22] [23] [24] [25]
N. Moarref and Y. Yaslan / Computers and Electrical Engineering 80 (2019) 106482 Vainsencher D, Mannor S, Bruckstein AM. The sample complexity of dictionary learning. J Mach Learn Res 2011;12(Nov):3259–81. Cheplygina V. Multi instance learning data sets. http://www.miproblems.org/datasets/, Accessed: 2016-08-12. Mairal J. Spams tool. http://spams-devel.gforge.inria.fr/index.htmll, Accessed: 2015-02-01. Libsvm. https://www.csie.ntu.edu.tw/∼cjlin/libsvm/, Accessed: 2015-02-01. Prtools. http://prtools.org/software, Accessed: 2017-03-01.
Nazanin Moarref received her B.Sc. degree in Electric and Electronic Engineering from University of Tabriz, Iran, in 2013. She got her M.Sc. degree in Computer Engineering from ITU in 2017. She also continues studying her Ph.D. in Computer Engineering in ITU. Her research activities are concentrated in machine learning theory and applications and deep learning fields. Yusuf Yaslan received his B.Sc. degree in Computer Science Engineering from Istanbul University, Turkey, in 2001. He got his M.Sc. degree in Telecommunication Engineering and his Ph.D. in Computer Engineering from Istanbul Technical University, in 2004 and 2011 respectively. His research interests are machine learning, data mining and recommendation systems.