Accepted Manuscript Title: Efficient multi-kernel multi-instance learning using weakly supervised and imbalanced data for diabetic retinopathy diagnosis Author: Peng Cao Fulong Ren Chao Wan Jinzhu Yang Osmar Zaiane PII: DOI: Reference:
S0895-6111(18)30479-8 https://doi.org/doi:10.1016/j.compmedimag.2018.08.008 CMIG 1583
To appear in:
Computerized Medical Imaging and Graphics
Received date: Revised date: Accepted date:
1-1-2018 9-7-2018 22-8-2018
Please cite this article as: Peng Cao, Fulong Ren, Chao Wan, Jinzhu Yang, Osmar Zaiane, Efficient multi-kernel multi-instance learning using weakly supervised and imbalanced data for diabetic retinopathy diagnosis, (2018), https://doi.org/10.1016/j.compmedimag.2018.08.008 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ip t
Efficient multi-kernel multi-instance learning using weakly supervised and imbalanced data for diabetic retinopathy diagnosis
cr
Peng Caoa , Fulong Rena,b , Chao Wanc,∗, Jinzhu Yanga,b , Osmar Zaianed
a Computer Science and Engineering, Northeastern University, Shenyang, China Key Laboratory of Medical Image Computing of Ministry of Education, Northeastern University, Shenyang, China c Department of Ophthalmology, the First Hospital of China Medical University, Shenyang, China d Computing Science, University of Alberta, Edmonton, Alberta, Canada
an
us
b
M
Abstract
Ac ce p
te
d
Objective: Diabetic retinopathy (DR) is one of the most serious complications of diabetes. Early detection and treatment of DR are key public health interventions that can significantly reduce the risk of vision loss. How to effectively screen and diagnose the retinal fundus image in order to identify retinopathy in time is a major challenge. In the traditional DR screening system, the accuracy of micro-aneurysm (MA) and hemorrhagic (H) lesion detection determines the final screening performance. The detection method produced a large number of false positive samples for guaranteeing high sensitivity, and the classification model was not effective in removing false positives since the suspicious lesions lack label information. Methods: In order to solve the problem of supervised learning in the diagnosis of DR, we formulate weakly supervised multi-class DR grading as a multi-class multi-instance problem where each image (bag) is labeled as healthy or abnormal and consists of unlabeled candidate lesion regions (instances). Specifically, we proposed a multi-kernel multi-instance learning method based on graph kernel. Moreover, we develop an under-sampling from instance level and over-sampling from bag level to improve the performance of the multi-instance learning in the diagnosis of DR. Results: Through empirical evaluation and comparison with different baseline ∗
Corresponding author. Email address:
[email protected] (Chao Wan)
Preprint submitted to Elsevier
August 24, 2018
Page 1 of 32
cr
ip t
methods and the state-of-the-art methods on data from Messidor, we illustrate that the proposed method reports favorable results, with an overall classification accuracy of 0.916 and an AUC of 0.957. Conclusions: The experiments results demonstrate that the proposed multi-kernel multi-instance learning framework with bi-level re-sampling can solve the problem in the imbalanced and weakly supervised data for grading diabetic retinopathy, and it improves the diagnosis performance over several state-of-the-art competing methods.
an
us
Keywords: Diabetic retinopathy, computer aided diagnosis, multi-kernel learning, imbalanced data learning, multi-instance learning 1. Introduction
Ac ce p
te
d
M
Diabetic retinopathy (DR) is a chronic progressive disease of the retinal microvasculature which is the most common cause of blindness in the past 50 years. It remains a major complication of diabetes and a leading cause of blindness among adults worldwide. The prevalence of DR is expected to grow exponentially and affect over 366 millions people worldwide by 2030 [1]. Despite the high risk factor, 75% of the blindness due to diabetes is known to be preventable. It has been established that early detection and timely treatment can reduce the development of severe vision loss in 60% of cases. In order to prevent the damage of this severe complication to patients’ vision, it is very important to diagnose diabetic retinopathy and provide appropriate treatment to minimize further deterioration as early as possible. Automatic computer-aided screening of DR is a highly investigated field and a number of different methods have been developed in the past to help reduce the burden on specialists[2, 3]. Microaneurysms (MAs) are regarded as early signs of DR, and as the degree of DR advances hemorrhages (H) become evident. Since HMAs (MAs and H) counts is an important measure of progression of retinopathyin the early stage and may serve as a surrogate end point for severe change in some clinical trials. The diagnosis and grading performance of DR highly depends on HMA detection[4]. Detection is a challenging task due to the similarity between the true HMAs and other normal patterns (e.g., blood vessels) in the images. Consequently most algorithms produce a large amounts of false positives for guaranteeing high sensitivity[5, 6]. Classification is usually used to reduce the false positive, which is trained to predict labels of the HMA candiates, to eliminate these false positives (FPs) as much as possible while retaining the true positives (TPs). Supervised machine learning 2
Page 2 of 32
Ac ce p
te
d
M
an
us
cr
ip t
is a powerful tool frequently used in computer-aided diagnosis (CAD) applications. The aim is to learn a system capable of the prediction of the unknown output class of a previously unseen suspicious HMA candiates with a good generalization ability, is a critical part in the DR diagnosis system. A number of techniques have been presented using machine learning approaches to identify DR. 1) Some methods consider the retinal images as a whole and identify DR directly from the global image level. Acharya et al develop an integrated index for the identification of diabetic retinopathy stages using texture parameters [7]. Ganesan et al. utilize trace transforms to model a human visual system, and extract features with the trace transform functionals. Besides the feature extraction, deep learning methods can learn a hierarchy of features, which can be used for image classification purposes instead of the handcrafted features[8]. Xu et al developed a Deep Convolutional Neural Network for the classification task of retinal images [9][10]. Pratt propose a Convolutional Neural Network approach to diagnosing DR from digital fundus images and accurately classifying its severity [11]. 2) These methods above consider the whole image as the instances from the global view. However, it does not consider the lesions in the image from the local view and not provide interpretability. The out of these methods is not sufficiently useful for human graders because the grade is not based on the specific lesion. An algorithm that marks out the location of these retinopathy lesions will provide more information to graders and save time and resources. There are several signs to identify DR. The lesions lesions include microaneurysms and hemorrhages. Tan et al. utilize a single convolutional neural network to segment the lesions, such as exudates, haemorrhages and microaneurysms [12] [13]. However, the segmentation and identification of the lesions require the annotation. However, medical experts do not always have enough time for labeling and segmenting such a large amount of HMAs data for training. Existing datasets with DR label do not contain the label of lesion, e.g. MESSIDOR database (http://messidor.crihan.fr/index-en.php), which contains 1200 retinal images only with the DR grading information acquired by three ophthalmologic departments. The lack of the position of HMA hinders the traditional supervised algorithms to reduce the false positives generated by initial detection. That is weakly supervised learning problem [14], where only the label for the whole image is available but not for the individual instances[15]. The problem complicates the assessment of the severity level of DR. Some methods utilize the auxiliary datasets[16, 6], but the major issue is that there exists difference in the marginal distribution and conditional distribution between the borrowed auxiliary dataset and target dataset. Some methods create the manually annotated ground truths with the help of medical experts[17], which is a laborious process and is 3
Page 3 of 32
Ac ce p
te
d
M
an
us
cr
ip t
prone to human error. Another key challenges in designing good prediction models on the medical data lies in the class imbalance[18, 19]. The imbalanced data issue usually occurs in computer-aided DR diagnosis systems since the ”healthy” class is far better represented than the three ”diseased” class. Besides the class imbalanced distribution existing between the ”healthy” class and the ”diseased” class from image level, the distribution is also imbalanced between TPs and FPs in each image from instance level, although the label of each candidate is unknown [5, 20, 21]. It is because detection algorithms have high sensitivity that some non-microaneury structures are labeled as microaneurysms in the initial microaneurysms identification step. The amount of the true positiive (positive instances) is more than the false positive (negative instances). Class imbalanced data has detrimental effects on the performance of conventional classifiers. Typically classifiers attempt to reduce global error rate without taking the data distribution into consideration. In order to solve the problem of supervised learning in the diagnosis of DR, we propose in this paper a multi-instance learning (MIL) framework to accurately diagnose for DR. The framework involves three key techniques: under-sampling from instance level, over-sampling from bag level and multi-instance learning to overcome the issues introduced previously. More specifically, our main contributions to the DR research are summarized as follows: 1. Multi-instance learning for weakly supervised data. Based solely on class labels assigned globally to fundus images, we regard the diagnosis of DR as a multi-instance learning problem and proposed a partially supervised multi-class classification algorithm to assess the severity of the disease. MIL is an extension of supervised learning which can train classifiers using such weakly labeled data which is formed in terms of such bags and where individual instances in a bag do not have a label. Compared with traditional SIL (Single instance learning) approaches, weaker annotations are needed for training MIL algorithms, which simplifies data collection tremendously. Through MIL, we can shift the problem of classifying the suspicious HMA leisions in supervised learning to classification on the fundus images relying on global labels at a global level directly, which avoids building a supervised classifier for false positive reduction [22] and is more in accordance with clinicians reasoning. The difference of traditional learning and MIL based methods is illustrated in Figure 1. Based on the MIL strategy, HMA ROIs (Region of Interest) is regarded as instances and the fundus image containing a bag of HMA ROIs is regarded as a bag. With the idea of mi-Graph algorithm proposed in [23], we develop a graph based MIL method to predict the diagnosis of DR. The mi-Graph algorithm treated 4
Page 4 of 32
ip t cr us an
Figure 1: The flowchart of the conventional supervised based method and proposed MIL based method.
Ac ce p
te
d
M
the instances in an non-i.i.d. way and relies on a graph representation of bag. By relaxing the assumption of independence between instances and exploiting the relations among instances, it can improve the MIL performance since the instances are often inter-correlated [23]. 2. Imbalanced data learning based on hybrid sampling scheme at bag level and instance level. We employ data sampling techniques to solve the imbalanced data issues from the bag level and instance level. On the one hand, based on the ideas of the transfer learning[24] and under-sampling[19], we filter the irrelevant instances and avoid the bias induced by different datasets, to reduce the imbalanced data distribution within bags instances while maintaining the existing of true instances without the label of instance. On the other hand, we extend SMOTE [25] to generate appropriate synthetic bags for imbalanced data learning. 3. A unified multi-kernel framework to build a nonlinear model . Modeling the diagnosis of DR as nonlinear functions of fundus image may provide enhanced flexibility and the potential to better capture the complex relationship between the MRI features and cognitive outcomes. Many kernel-based classification or regression methods with faster optimization speed or stronger generalization performance have been proposed and investigated by theoretical analysis and experimental evaluation [26, 27]. The kernel methods also have been widely used for the predictive classification [28, 29] in the current research on DR diagnosis. The choice of the types and parameters of the kernels is critical for a particular task[30], which determines the mapping between the instances (input
5
Page 5 of 32
Ac ce p
te
d
M
an
us
cr
ip t
space) and the graph kernel(feature space). The selection of the optimal kernel by cross validation search on a predefined pool of kernels is usually time-consuming, and sometimes causes overfitting. The kernel plays an essential role in the construction of graph and classification, multiple kernel learning (MKL) methods [30] not only learn an optimal linear combination of given base kernels, but also can be used to combine heterogeneous feature subset of instances. The primary objective in MKL is indeed to learn a kernel best suited for a given task by optimally combining the base kernels. Therefore, we propose a multi-kernel multiinstance learning framework, involving bag under-sampling, bag over-sampling, graph construction and classification, all of which is conducted in the multiple feature space induced by multiple kernel functions. The novelty of this paper is three-fold: firstly, we proposed a sparse multi-kernel multi-instance learning for diabetic retinopathy diagnosis on the weakly supervised data, which demonstrates superior performance over deep learning and other multi-instance learning methods. Moreover, an under-sampling at instance level and an over-sampling at bag level are proposed to improve the performance of the multi-instance learning on the imbalanced data. Furthermore, we have conducted extensive experiments to investigate our proposed methods on the public Messidor dataset [31]. We also have conducted extensive ablation experiments which can demonstrate the contribution of each key component in our proposed framework. The rest of this paper is organized as follows. Section 2 present the problem formulation and introduce the pipeline of our automated screening system. Section 3 presents experimental results on comparison of different prediction methods using the Messidor dataset. Section 4 discusses the experimental results. Section 5 presents the limitations and future directions of our work. At last, this paper is concluded in Section 4. 2. MKMIL base framework for DR diagnosis 2.1. Motivation and Problem Formulation The labeled suspicious lesion ROI data is scarce as manual annotation is timeconsuming and tedious. Therefore, the aim of our work is to predict whether subjects healthy or the specific disease stage (e.g. mild, moderate or severe) based on the retina image without being given the instance labels, avoiding building supervised classifier for false positive reduction. It is a binary class (diagnosis) classification problem. We formulate the problem as a Multiple Instance Learning (MIL) problem, which is a weakly supervised setting where images are regarded as bags X = {Bi , yi }N i=1 with a label yi ∈ {−1, +1} and is represented by a set 6
Page 6 of 32
us
MIL.jpg
cr
ip t
of instances {x1 , x2 , . . . , xni }. Different bags may contain different number of instances. Each instance corresponds to a suspicous HMA ROI from which a feature vector is extracted. The class labels y are diagnoses assigned globally to the images. MIL is then used to train on the training set X, and predict the bag labels of unseen images and alleviate the labelling and segmentation burden on the ophthalmologist. The figure 2 illustrate the multi-instance learning in the diagnosis of DR and the difference between MIL and traditional supervised learning.
an
Figure 2: The multi-instance learning in the diagnosis of DR.
d
M
For the problem of DR diagnosis, images can be labeled as positive or negative bags based on whether it contains any MA lesion. As shown in Fig.2, if the bag contains at least one positive HMA ROI related to disease, the bag is labeled as positive (DR); otherwise, the bag is labeled as negative (healthy). According to clinical DRs severity scales, the severity level is based on the number of lesions, with distinction between MA and H, and their spatial distribution in the retina (Table 1). Figure 3 shows the fundus images for the DR different stages.
te
Table 1: Criteria used for grading DR
Ac ce p
L0 (healthy, no DR) MA=0 and H=0 L1 (mild DR) 1 ≤ MA ≤ 5 and H=0 L2 (moderate DR) 5 ≤ MA ≤ 15 and 0 ≤ H ≤ 5 L3 (severe DR) MA > 15 and H > 5
2.2. Overview of methods An overview of the proposed method multi-kernel multi-instance learning (MKMIL) framework is shown in Fig.4. For the diagnostic of diabetic retinopathy (DR or health), the developed system to automatically detect and diagnosis is shown in Fig.4. This diagnosis system is addressed by six major steps: instance generation (candidate HMA ROI detection and feature extraction), undersampling, graph construction, computation of graph kernels, over-sampling and multi-kernel classifier training. The difference between Fig.4(a) and Fig.4(b) is that an over-sampling procedure is incorporated when there exists a class imbalanced distribution. 7
Page 7 of 32
ip t cr
(b) L1:mild DR
(d) L3:severe DR
d
(c) L2:moderate DR
M
an
us
(a) L0:no DR
te
Figure 3: Initial detection of HMA candidates by our detection method [5] and illustration of instances for different stages. Red: true Hem, Green: true MA, White: false positive.
Ac ce p
2.3. ROI (instances) generation In our previous work, we develop a HMA candidates initial detection algorithm, which consists of vessel detection and elimination, seed of HMAs candidate extraction, and constrained region-growing [5]. The initial detection results are shown in Figure 5(a-b). Due to the fact that the MAs vary enormously in volume, shape, and appearance[5], we extract more features from many aspects, such as intensity, gradient, shape and texture distribution. Our feature extraction process generated 37 image features (Table 2) for each potential HMA. 2.4. Bag under-sampling, BUS Current CAD schemes for microaneurysms have achieved high sensitivity levels, whereas current schemes for microaneurysms detection report many false positives. It is because that the detection algorithms have high sensitivity that some non-microaneury structures are labeled as microaneurysms in the initial microaneurysms identification step. The large amount of negative instances (HMA candidates) in each bag (image) could degrade the performance of MIL. Hence it 8
Page 8 of 32
Feature extraction
Suspicious HMA detection
Irrelevant instance filtering
Graph Construction
Color
Graph Kernel
ip t
Training data
Shape Texture
cr
Gradient
Multi kernel classifier
us
(a) The proposed computer-assisted system for diagnostic of diabetic retinopathy (balanced data) Confidential
Feature extraction
Suspicious HMA detection
Color
Graph Construction
M
Shape
Irrelevant instance filtering
an
Training data
Over-sampling
Texture
Graph Kernel
Multi-task multi kernel classifier
te
d
Gradient
(b) The proposed computer-assisted system for diagnostic of diabetic retinopathy (imbalanced data)
Ac ce p
Figure 4: The overview of the proposed MKMIL method.
is important to design efficient instance pruning and selection techniques to speed up theConfidential learning process without compromising on the performance. In our earlier work, we have proposed a ensemble classification for false positive reduction of HMA candidates by using ELM as base classier [5]. The ensemble classification is applied on the e-ophtha dataset to discriminate the true HMAs and false positive candidates which contains the ground truth of the true lesions. However, directly applying a classifier trained on an independent dataset to predict the HMA candidates in target dataset would violate the assumption of the same distribution between training and test data. To reduce the imbalanced data distribution between TPs and FPs, and avoid the bias induced by different datasets, we aim to find the suitable weight γ of source labeled dataset to minimize the discrepancy between the source labeled dataset Ds and the target unlabeled dataset Dt . Based on the estimation of resam9
Page 9 of 32
ip t cr us
(b) Vessel segmetation
te
d
M
an
(a) Original fundus image
(c) Suspicious HMA detection
(d) Filtering irrelevant instances
Ac ce p
Figure 5: Initial detection and irrelevant instance filtering of HMA candidates
pling weights[? ], the primal objective function of the sampling weight estimating model is written as:
min γ
2
n ns
1 X X 1
γi φq (xsi ) − φq (xi )
ns n i=1 i=1
1 T γ Kq γ − κT γ + const γ 2 1 T = min γ Kq γ − κT γ γ 2 ns X s.t. γi ∈ [0, T ] and γi − ns ≤ ns = min
(1)
i=1
10
Page 10 of 32
Table 2: Description of heterogeneous features
Description
f21 f22
RGBmean ,RGBstd , RGBmin ,RGBmax Greenmean ,Greenstd , Greenmin ,Greenmax Redmean ,Redstd , Redmin ,Redmax CIElabmean ,CIElabstd , CIElabmin ,CIElabmax Gradmean ,Gradstd , Gradmin ,Gradmax Area Perimeter
f23
AspectRatio
f24
Solidity
f25−37
haralick
The mean,standard deviation,minimum, maximum of intensity value from the RGB image The mean,standard deviation,minimum,maximum of intensity value from the green channel The mean,standard deviation,minimum,maximum of intensity value from the red channel The mean,standard deviation,minimum,maximum of intensity value in CIElab color space The mean,standard deviation,minimum,maximum of the gradient image The sum of pixels in possible candidate region The count of boundary pixels The ratio of major axis length to minor axis length of the candidate region The ratio of the region area and the minimum bounding rectangle area of the region Mean, Variance, Homogeneity, Contrast Dissimilarity, Entropy, ASM, Correlation and so on.
f13−16
d
f17−20
cr
f9−12
us
f5−8
an
f1−4
ip t
Name
M
Notation
Ac ce p
te
where ns is the size of the source labeled dataset (e.g. e-ophtha dataset), T is the maximum of the weight, K is the kernel matrix of the source labeled dataset, xs P is the instances from the source labeled dataset, Kq (i, j) = k(xi , xj ), κi = n ns j=1 k(xsi , xj ), n is the size of the target unlabeled dataset (Messidor) and q n is the specific kernel function. Large values of κi correspond to important observations xsi of the source dataset and lead to large weight γi . The obtained weight γ can match the distributions between source labeled source dataset and target unlabeled dataset in feature space. After estimating the instance weights of the labeled source dataset, we retraining a kernel classifier with a specific kernel function on the reweighed instances. With the trained classifier, the unlabeled target instances are predicted with probability score, and the instances whose probability scores are lower than a threshold value (S) are removed. An example result of instance detection and filtering is shown in Figure 5(d). 2.5. The construction of graph kernels with multiple kernels The mi-Graph method explicitly maps every bag to an undirected graph and uses a new graph kernel to distinguish the positive and negative bags. However, 11
Page 11 of 32
an
us
cr
ip t
learning from bags raises important challenges in the model of mi-graph. In each bag, a graph is constructed based on the instances. In the graph, the nodes p represent the features of ROIs and the edges w which reflect the relationships between the instances. The graphs can represent the appearances of suspicious HMA ROIs and reflect the relationships among the suspicious instances. Distances between instances in Bi with the q-th kernel function are calculated is defined as rqi (xia , xib ). If the RBF kernel function is chosen and the distance between in−xb k ), where σ stances is calculated using the RBF distance: r(a, b) = exp(− kxa2σ is a kernel parameter. To only consider the strong similarity, if the distance between the instances xa and xb is smaller than θ, r(xa , xb ) is set to 1, and 0 otherwise. Note that θ is i = set to the average distance between each two instances in B. Therefore, wq,a 1 i i i Pni rq (xa , xu ) (where ni is the number of instance in i-th bag) is associated with u=1
d
M
each instance xia in Bi with the q-th kernel function, and is inversely proportional to the number of neighbors of xia . To capture the similarity among graphs, a graph kernel is proposed and we expect that the graph kernel can distinguishing the positive and negative bags. Consequently, the similarity between bags Bi and Bj is calculated as: Pni Pnj
te
Kq (Bi , Bj ) =
j i wq,b wq,a rqij (xia , xjb ) Pnj j Pni i w q,a b=1 wq,b a=1
a=1
b=1
(2)
Ac ce p
2.6. Bag Over-sampling, BOS In section 2.4, we proposed a negative instance filtering as an under-sampling algorithm to solve th imbalanced at the instance level. Besides the imbalance data issue within each positive bag, another key challenges in designing good prediction models lies in the class imbalance problem on the bag level. In the Messidor dataset, the amount of the health cases are nearly two times the other classes. This is a typical class imbalance problem, which has detrimental effects on the performance of conventional classifiers. Class imbalanced data is ubiquitous in medical applications and has been recognized as a crucial problem in machine learning [19]. A popular and effective over-sampling method is the synthetic minority oversampling technique (SMOTE) [25]. The SMOTE algorithm creates artificial data based on the feature space similarities between existing minority instances. Specifically, for minority class dataset Cmin , consider the K nearest neighbors for each instance xi ∈ Cmin . To create a synthetic sample, randomly select one of the k 12
Page 12 of 32
xij = xi + δij × (xi − xj )
ip t
nearest neighbors, then multiply the corresponding feature vector difference with a random number between [0,1], and finally, add this vector to the xi : (3)
M
an
us
cr
where xj is one of the K-nearest neighbors for xi , and δ ∈ [0, 1] is a random number. Therefore, the resulting synthetic instance is a point along the line segment joining xi under consideration and the randomly selected K nearest neighbor xj . Figure 6(a) shows an example of the SMOTE procedure. When there exist a imbalanced class distribution at the bag level, we need to extend the traditional SMOTE to MIL setting. For Bi , consider the k-nearest neighbors with the same class according to the Gram matrix obtained before and randomly select one of them denoted as Bj . Utilizing the strategy of SMOTE procedure, BOS generates artificial example xij pq located on the path connecting i selected minority example xp in the i-th bag and one of the instances xjq in the j-th bag.
(4)
d
i i j xij pq = xp + δpq × (xp − xq )
Ac ce p
te
The idea of the proposed bag level over-sampling for MIL problem is shown in Figure 6(b). Bij is the synthetic bag based on Bi and Bj by BOS. The instance size in Bij is the same as the seed bag Bi . Due that there exists negative instances in Bi , it lead to generate wrong instances if based on a negative instance. To solve it, we utilize an adaptive scheme to assign weights to the minority class instead of uniform sampling weight. The adaptive scheme can automatically decide the number of synthetic samples that need to be generated for each minority instance. Specifically, in each bag of minority bag, we choose the instances according to the weight wqi . The instances with higher weight has a higher probability to be considered as seeds. The aim of is to hinder that the negative instances remained in positive bag is chosen as seed. The sampling weight wqi of each instance in minority bag is obtained according to the probability output produced by ELM during the stage of bag under-sampling. The probability of each instance is normalized: .Xni wui (5) wˆqi = wqi u=1
Then, the amount of new instances for each instance xq is obtained according to: 13
Page 13 of 32
Bj
Bj
ip t
Bij
Bij
cr
xi xi
xij xj
Bi
us
xj
xij
an
Bi
(a) SMOTE
(b) BOS
Confidential
M
d
(6)
te
nq = ni × wˆqi
2.7. Multi-kernel learning After obtain the graph kernel with multiple kernels, the MIL problem be converted to a standard supervised learning problem, based on multiple kernel learnˆ ing. seek the optimal kernel combination K(x, x0 ) = Pto PQ The objective0 of MKL is Q q=1 dq Kq (x, x ), dq ≥ 0, q=1 dq = 1. The commonly used MKL is based on the `1 -norm imposed on d. Through MKL, we can obtain the optimal weights of the kernel combination d and the classifier model by solving a single joint optimization problem.
Ac ce p
Confidential
Figure 6: xij is the synthetic instances based on xi and xj by SMOTE.An illustrative toy example of the proposed BOS algorithm. Bij is the synthetic bag based on Bi and Bj by BOS. Red rectangle is the bag of minority class and the blue one is the bag of the majority class. The circle in the bag indicates the instances and the color indicate the weight of instances. The darker the color is, the bigger the weight of the instance is. The green rectangle is the new bag consisting of the new instances generated across the nearest neighbor bags
14
Page 14 of 32
The primal objective function of multiple kernel learning: Q
min
f,d,b,ξ
s.t.
q=1
dq = 1, dq ≥ 0, q = 1, . . . , Q
q=1
d
α
N X
αi −
i=1
s.t.
k X
Q N X 1X αi αj y i y j dq Kq (Bi , Bj ), 2 i,j=1 q=1
dj = 1,
(8)
dj ≥ 0 ,
d
j=1
M
min max
an
The objective value of the dual problem of (7):
(7)
us
Q X
cr
Q X fq (xi ) + b) ≥ 1 − ξi , i = 1, . . . , n yi (
ip t
N
X 1 X kfq k22 +C ξi 2 q=1 dq i=1
Ac ce p
te
By multiple kernel functions, the bags are mapped from the original space to multiple feature spaces (RKHS) induced by different kernel functions. The kernel function is capable of attaining the inner product of two mapped bags in H: K(Bi , Bj ) = φ(Bi )·φ(Bj ) in the original space without explicitly computing the mapped data. The optimization of Eq. (8) consists of two main steps: (a) solving a canonical SVM optimization problem with given d and (b) updating d using the reduced gradient with considering the nonnegativity and normalization constraints of the kernel weights with α optimized in the first step. MKL is originally for single instance learning. When multiple graph kernels are constructed by Eq. (2), MKL can extended to solve the multi-instance learning by incorporating the resultant multiple graph kernels constructed. The framework of the multi-kernel multiinstance learning framework is shown in Fig.7.
15
Page 15 of 32
Kernel Calculating
G1n
G21
. . .
K2
Kernel Calculating
Graph kernel
oversampling
Graph kernel
d1
Graph kernel
oversampling Graph kernel
d2
G2n
. . . . ..
KQ
Kernel Calculating
oversampling Graph kernel
Graph kernel
Optimal Kernel
dQ
an
G21
Multi kernel classifier
us
.. .
Multi-task multi kernel classifier
ip t
. . .
K1
cr
G11
G2n Confidential
M
Figure 7: Illustration of the proposed unified multi-kernel framework. In each feature space induced by kernel function q, the original training dataset Dq is under-sampled to the reduced dataset Dq0 by the proposed BUS algorithm. Then, a graph is constructed for each bag. After the graphs constructed, a graph kernel Kq is calculated based on the graphs. BOS is conducted to generate new synthetic bags, and the graph kernel is transformed to augmented Kq0 . Finally, an optimal kernel is obtained by the multi-kernel learning.
d
3. Experimental results and discussions
Ac ce p
te
The experiments consist of vertical comparison and horizontal comparison separately. The vertical evaluation in Experiment I, uppercaseii and uppercaseiii involves the comparison with baseline methods, the different kernel setting strategies of multi-kernel learning framework and the influence of the re-sampling ratio parameter of BUS and BOS. The horizontal one in Experiment IV and V includes the comparison with the state-of-the-art methods for multi-instance learning and for DR diagnosis. 3.1. Experimental setup The Messidor dataset is publicly available for studies on computer-aided diagnosis (CAD) of DR [32]. The dataset consists of 1,200 digital fundus images that are prepared by three French ophthalmology departments via a non-mydriatic digital color video 3CCD camera with a 45 degree field of view. Details of the grade information of the sample used in this paper are presented in Table 3. All the 1200 subjects available in the Messidor database were used for evaluation. We evaluate the generalization performance of all methods using 10 times twofold cross-validation[33, 15] for all the comparable MIL methods, so as to 16
Page 16 of 32
Category
L0
L1
L2
L3
Number
540
153
247
260
ip t
Table 3: The characteristics of the studied data from Messidor.
us
cr
ensure a fair comparison. Following the settings of previous MKL studies, the candidate kernels are: six different kernel bandwidths (2−2 , 2−1 , . . . , 23 ), polynomial kernels of degree 1 to 3, and a linear kernel, which totally yields 10 kernels. All these kernels are applied on all the features. Each base kernel matrix is normalized to unit trace.
an
Table 4: The tuning parameters in our proposed framework.
M
Notation Explanation C the regularization parameter in MKL M over-sampling ratio parameter in BOS S under-sampling ratio parameter in BUS
Ac ce p
te
d
The tuning parameters in our framework are shown in Table 4. For all the kernel methods, the regularization parameter of C are chosen by nested crossvalidation on the training data(trying values 0.01, 0.1, 1, 10, 100). In BUS, the threshold value S = 40%. In BOS, the over-sampling ratio parameter M is set to 100%, and the parameter of nearest neighbors is set to 10. All these values are chosen after some preliminary runs and they show the best results but not to be optimal for each classification task. All the experiments are carried out by means of a 10-fold cross-validation. That is, the dataset was split into ten folds, each one containing 10% of the patterns of the dataset. For each fold, the algorithm is trained with the examples contained in the remaining folds and then tested with the current fold. We use metrics such as accuracy, sensitivity and specificity to evaluate the performance of the learning algorithms. It is known that the use of overall accuracy is not an appropriate evaluation measure for imbalanced data, therefore the metric for representing the performance of each classifier is chosen to be G-mean and AUC. The G-mean is the geometric mean of accuracies measured separately on each class, which is commonly utilized for imbalanced data and expected to be high simultaneously. 3.2. Experiment I: The comparison with the baseline methods For DR screening purposes, there are different ways to convert four classes classification to a binary classification task in Messidor dataset. One way is to 17
Page 17 of 32
M
an
us
cr
ip t
label no-DR images as normal (no DR) and group images of other stages (L1, L2 and L3) as abnormal (DR) as in the works of [34, 35]. However, given the fact that the difference between normal images and images of stage 1 (L1) is the most difficult task for both CAD systems and experts, we also focus on the classification task to evaluation of our method. Therefore, we follow a protocol similar to [16] conduct two binary classification tasks (DR vs no DR, L0 vs L1) to realize the evaluation cross dataset and prior studies. KBS Due to the balanced distribution between DR and no DR, we only apply the proposed BOS algorithm on the second classification task (L0 vs L1). To evaluate the performance of our proposed method in disease diagnosis, we compare our proposed methods with other baseline methods: Supervised learning [3], mi-Graph [23], mkmi-Graph and mkmi-Graph combined with re-sampling techniques (BUS, BOS or BOUS) for DR vs. non DR or L0 vs. L1, respectively. Note that BOUS means the combination of BUS and BOS. For supervised learning, an ELM is trained using extracted features from the training set of e-ophtha and tested on the all HMA candidates obtained from the entire Messidor.
d
Table 5: The comparison between our method with baseline methods for DR vs. non DR (Note that the value in parenthesis indicate p-value and ? stands for the case with p ≤ 0.05 ) Accuracy
Sensitivity
Specificity
AUC
0.623 ± 0.009∗ 0.871 ± 0.015∗ 0.901 ± 0.023∗ 0.921 ± 0.010
0.993 ± 0.024∗ 0.868 ± 0.008∗ 0.887 ± 0.028∗ 0.924 ± 0.018
0.253 ± 0.019∗ 0.908 ± 0.024∗ 0.906 ± 0.014∗ 0.915 ± 0.014
0.591 ± 0.009∗ 0.898 ± 0.016∗ 0.912 ± 0.004∗ 0.939 ± 0.009
Ac ce p
te
Method Supervised learning (ELM) mi-Graph mkmi-Graph mkmi-Graph-BUS
Table 6: The comparison between our method with baseline methods for L0 vs. L1 (Note that the value in parenthesis indicate p-value and ? stands for the case with p ≤ 0.05 ) Method
Supervised learning (ELM) mi-Graph mkmi-Graph mkmi-Graph-BUS mkmi-Graph-BOS mkmi-Graph-BOUS
Accuracy 0.026∗
0.659 ± 0.864 ± 0.019∗ 0.877 ± 0.012∗ 0.905 ± 0.014∗ 0.896 ± 0.012∗ 0.916 ± 0.015
Sensitivity 0.024∗
0.769 ± 0.850 ± 0.008∗ 0.863 ± 0.028∗ 0.859 ± 0.015∗ 0.904 ± 0.017∗ 0.909 ± 0.018
Specificity 0.019∗
0.719 ± 0.894 ± 0.024∗ 0.901 ± 0.014∗ 0.945 ± 0.007 0.897 ± 0.011∗ 0.933 ± 0.014∗
G-mean 0.009∗
0.743 ± 0.872 ± 0.016∗ 0.882 ± 0.016∗ 0.900 ± 0.012∗ 0.900 ± 0.020∗ 0.921 ± 0.013
AUC 0.624 ± 0.025∗ 0.909 ± 0.008∗ 0.920 ± 0.017∗ 0.945 ± 0.009∗ 0.924 ± 0.010∗ 0.957 ± 0.013
Experimental results are reported in Table 5 and Table 6 where the best results are boldfaced. A first glance at the results shows that mkmi-Graph combined with re-sampling algorithms generally outperforms all other compared methods on both metrics and across all all the cognitive tasks. Additionally, a statistical 18
Page 18 of 32
an
us
cr
ip t
analysis is performed on the results. Moreover, ROC curves are shown in Fig. 8 from cross-validated classifications, separately for each method. It is apparent that our proposed methods achieved higher AUC value than the other contender methods. ELM trained on the e-ophtha dataset is lower than the MIL methods, especially the specificity is low, which indicates that the distribution from multiple data sources are different, and employing the classifier simply borrowing labeled data from existing datasets result in false positive bags. In addition, we can observe that multi-kernel framework have more powerful performance than single kernel, which demonstrates that the advantage of kernel learning for mi-Graph method. As expected, BUS or BOS has a positive impact on the performance of MIL built on the imbalanced data from the results.
d
M
3.3. Experiment II: The investigation of the weight of feature subsets In our MIL framework, the suspicious HMA ROIs are characterized by G heterogeneous feature subsets x(1) , . . . , x(G) from the view of color, gradient, shape and texture. The heterogeneous subsets have different contribution to the final classification performance. To investigate the importance weights of multiple heterogeneous subsets, a base kernel is computed for each feature subset (representation). The graph kernel between bags are calculated as:
te
Pni Pnj
j j i i ij b=1 wq,a wq,b rq (xa(g) , xb(g) ) , Pnj j Pni i w w q,a b=1 q,b a=1
(9)
Ac ce p
Kq(g) (Bi , Bj ) =
a=1
where xia(g) indexes the g-th feature subset of xa in the Bi . The mixed `1 norm in the formulation of our MKL enforces group sparsity among different heterogeneous feature subset, which actually performs as a role of feature subset selection. In order to visualize the contribution of each feature type in our MKMIL, we plot the kernel weights of the base kernels of our multiple kernel learning for DR vs. no DR in Fig.9. Fig.9 shows the weight for each base kernel and the associated variable. The weight values correspond to the contribution for representations. Kernels corresponding to the discriminative feature subset are assigned the highest weights. We found that the weights of the intensity feature from red channel and haralick texture feature are informative and discriminative.
19
Page 19 of 32
ip t cr us an M
Ac ce p
te
d
(a) DR vs. no DR
(b) L0 vs. L1
Figure 8: The ROC of multiple computing methods
20
Page 20 of 32
ip t cr us
Figure 9: The kernel weights of heterogeneous feature subset. Different colors indicate the different base kernel functions
Ac ce p
te
d
M
an
3.4. Experiment III: The influence of the parameters of BOUS The optimal re-sampling ratio of BUS and BOS is unknown, and the parameter of re-sampling ratio plays a vital role for the performance of imbalanced data learning. In our research, we evaluated classifier performance using a variety of re-sampling ratios. Fig.10 illustrates the performance with respect to AUC by tuning the two re-sampling ratio parameters. The choice of re-sampling ratio parameters S and M is important for imbalanced weakly supervised data system because it determines the size of instances for each bag as well as the size of minority class bags, and affects directly the final performance. We varied the value of threshold S ∈ {0, 0.2, 0.4, 0.6, 0.8} , M ∈ {0, 50%, 100%, 150%, 200%} and investigated the variation of performance with multiple threshold values in Figure 10. From Figure 10, it can be found that the performance of classification performance achieves to the best when S = 0.4 and M = 150%. Another important conclusion is that the effect of under-sampling is more significant than the component of over-sampling for imbalanced weakly supervised data. 3.5. Experiment IV: The comparison with the state-of-the-art MIL approaches with other assumption This experiment presents a comprehensive empirical investigation comparing the performances of several MIL techniques. Specially, we compare the performance of seven methods with our proposed method, including as follows: Iterated APR(IAPR)[36], EM-DD[37], Citation-KNN[38] and MI-SVM[39]. Experiment results in Table 6 demonstrate the benefit of mkmi-Graph with `2,1 -norm compared to other commonly used methods. At confidence level α = 0.05, the results in Table 6 show that the proposed method significantly outperforms the competing 21
Page 21 of 32
ip t cr us
an
Figure 10: The influence of re-sampling ratio parameters of BUS and BOS for L0 vs. L1 to MIL classification performance
M
MIL methods. The graph structure based MIL can reflect the relationship among the instances, and improve the MIL classification with this structure information compared with the other MIL treating instances in the bags as I.I.D. Table 7: The comparison between our methods with other MIL approaches for DR vs. non DR. (Note that the value in parenthesis indicate p-value and ? stands for the case with p ≤ 0.05 ) Accuracy
iAPR EM-DD Citation-KNN MI-SVM mkmi-Graph
0.642 ± 0.009∗ 0.798 ± 0.009∗ 0.781 ± 0.011∗ 0.735 ± 0.015∗ 0.901 ± 0.023
Sensitivity
d
Method
AUC
0.501 ± 0.019∗ 0.816 ± 0.017∗ 0.842 ± 0.022∗ 0.980 ± 0.017 0.906 ± 0.014
0.627 ± 0.009∗ 0.874 ± 0.012∗ 0.826 ± 0.010∗ 0.866 ± 0.008∗ 0.912 ± 0.004
Ac ce p
te
0.819 ± 0.024∗ 0.813 ± 0.021∗ 0.750 ± 0.031∗ 0.507 ± 0.014∗ 0.887 ± 0.028
Specificity
Moreover, the goal of this experiment carried out in this section is to show the validity of the re-sampling strategy for MIL methods. The results further demonstrate that that jointly considering two aspects of proposed data re-sampling is able to improve the classification performance of MIL methods. Table 8 and Table 9 shows the performance for each MIL algorithms incorporated with BUS for two binary classification tasks. Table 10 shows the performance for each MIL algorithms incorporated with BOUS for the classification of L0 vs. L1. Table 8: The comparison between our methods with other MIL approaches combined with BUS technique for DR vs. non DR. (Note that the value in parenthesis indicate p-value and ? stands for the case with p ≤ 0.05 ) Method
iAPR-BUS
EM-DD-BUS
Citation-KNN-BUS
MI-SVM-BUS
mkmi-Graph-BUS
AUC
0.675 ± 0.043?
0.899 ± 0.034?
0.882 ± 0.019?
0.909 ± 0.015?
0.939 ± 0.009
22
Page 22 of 32
Table 9: The comparison between our methods with other MIL approaches combined with BUS technique for L0 vs. L1. (Note that the value in parenthesis indicate p-value and ? stands for the case with p ≤ 0.05 )
AUC
iAPR-BUS 0.712 ±
0.057?
EM-DD-BUS 0.892 ±
Citation-KNN-BUS
0.033?
0.844 ±
MI-SVM-BUS
0.015?
0.885 ±
0.017?
mkmi-Graph-BUS
ip t
Method
0.945 ± 0.009
Method
iAPR 0.0539?
EM-DD
cr
Table 10: The comparison between our methods with other MIL approaches combined with BUS technique for DR vs. non DR. (Note that the value in parenthesis indicate p-value and ? stands for the case with p ≤ 0.05 ) Citation-KNN
0.024?
0.012?
MI-SVM
0.757 ± 0.775 ± 0.038?
0.897 ± 0.905 ± 0.013?
0.852 ± 0.874 ± 0.019?
Method
iAPR-BOUS
EM-DD-BOUS
Citation-KNN-BOUS
0.804 ± 0.793 ± 0.043?
0.08?
0.018?
0.925 ± 0.914 ± 0.031?
0.877 ± 0.897 ± 0.022?
0.900 ± 0.017 0.920 ± 0.017
MI-SVM-BOUS
mkmi-Graph-BOUS
0.024?
0.911 ± 0.909 ± 0.024?
an
G-mean AUC
0.042?
mkmi-Graph
0.872 ± 0.903 ± 0.023?
us
G-mean AUC
0.021?
0.939 ± 0.009 0.957 ± 0.013
Ac ce p
te
d
M
3.6. Experiment V: The comparison with the state-of-the-art approaches for DR diagnosis In recent years, new approaches were developed for DR diagnosis directly at the image level. These algorithms are based on dense MIL [33, 15] or deep learning[40]. In the field of medical imaging analysis, the MIL methods involves two different strategies: dense image representation and sparse image representation [14]. In the dense based MIL, the image is divided into a regular grid of small patches, and each patch is regarded as a instance. Then, the instance from each patch is extracted by a feature vector. Different from the dense MIL, our MIL method belongs to the sparse MIL treating candidate lesion ROIs as instances. Compared with the dense one, the sparse MIL can identify useful pattern candidates and provide important image biomarkers. Table 11 shows the results of our methods compared with previous studies involving traditional supervised method [41], MIL [33, 15] and deep learning [34, 35]. From the result, we can find that we achieve the highest performance on Messidor dataset. The proposed ROI-based sparse MIL method outperform the patch-based dense MIL methods, which demonstrates that the suspicious HMAs regarded as instances can represent the bag more effectively, and make the classification problem of bag level more discriminative. Furthermore, our method performs better than two experts reported in [42] 4. Discussion
23
Page 23 of 32
Table 11: The comparison between our methods with other approaches for DR diagnosis Accuracy
Sensitivity
Specificity
AUC
0.932 Not reported 0.725 Not reported Not reported Not reported 0.905 0.858 0.871
0.935 Not reported Not reported Not reported Not reported Not reported Not reported 0.916 0.882
0.924 Not reported Not reported Not reported Not reported Not reported Not reported 0.803 0.857
0.945 0.81 0.932 0.922 0.865 0.870 0.921 0.862 0.870
us
cr
ip t
Method mkmi-Graph-BUS Quellec et al.(2012)[33] Kandemir et al.(2014)[15] Expert A [42] Expert B [42] Tang et al. (2013) [41] Wang et al. (2017) [34] Vo et al. (2016) [35] Vo et al. (2016) [35]
Ac ce p
te
d
M
an
The automated detection system is the ability to screen large number of patients in a short time and more objectively than an observer-driven technique. Through the experiment results, we can find there exist some challenges on the data within automated grading of DR: weakly supervised data, imbalanced data distribution, high dimensionality and the discrepancy between the multiple datasets. These challenges also exist in other disease computer aided diagnosis. Our empirical results have demonstrated that our framework can solve these issues. It could be applied for the detection of many other disease diagnosis where data is weakly supervised and imbalanced distribution. To our best knowledge, our model is the first method to jointly learn the weak supervised data and the class imbalanced data. On the one hand, the proposed weakly supervised classifier can drastically reduce the annotation effort, while keeping prediction performance at an acceptable level. On the other hand, our bi-level hybrid re-sampling algorithm can improve the sensitivity of DR and eliminate the false positive ROI candidates. Next, we focus our discussion on the important components of the unified framework. • 1. Multi-kernel learning framework
Although the linearized method can be optimized efficiently, it cannot be directly applied to capture the high order statistics and the underlying structures of complex data spaces. The kernel based models enable us to capture nonlinear associations between MRI and cognitive outcomes. As the kernel plays an essential role in the formulation, inappropriate kernels may not accurately capture the correlation structure of data. The multi-kernel learning framework aims to solve the kernel selection problem in a principled way kernel selection, therefore it can provide a solution to obtain an optimally combined kernel representation of
24
Page 24 of 32
ip t
candidate kernel functions. In the Experiment I, we compared with the single kernel based MIL method (mi-Graph), mkmi-Graph achieved a better performance, which indicates the importance of the kernel learning procedure. • 2. Sparse image representation scheme
an
us
cr
In the sparse image representation of our mkmi-graph, each instance represents one candidate lesion provided by a initial detection algorithm. In general, the detection step is a difficult task. The dense image representation eliminates the need for a candidate detection step. However, we can see that our sparse image representation is better than the sparse scheme from the Table 11. The limitations of the dense MIL are that: 1) the appropriate size of the patch is difficult to be obtained, 2) many of instances are background or other structure (e.g. vessel) since true MA is small, resulting in lots of irrelevant instances, 3) the shape feature cannot be extracted.
M
• 3. The bi-level hybrid re-sampling
Ac ce p
te
d
We proposed a bi-level hybrid re-sampling to re-balance the class distribution by over-sampling bags of the minority class (disease class) and under-sampling the instances of the majority class (false positive class). The common re-sampling is based on the instance level, which is not appropriate for the multi-instance learning. We extend the SMOTE algorithm to the bag level version (BOS). Moreover, the class imbalanced distribution within the bag and between bag occur at the same time, which presents an additional challenge since the amount of the false positive class within the bag influence the performance of the BOS algorithm for balancing the skewed class distribution in the bag level. The under-sampling for instance level takes a critical role in the MIL. The difficulty of discrimination between the true and false positive hinders the MIL problem, and selecting and removing some irrelevant with higher confidence helps improve the classification performance prior to MIL classification. However, the under-sampling within bags is generally often ignored when multi-instance learning. Furthermore, the BUS algorithm can improve the performance of the BOS. The results also suggest that the data re-sampling from bag level and instance level are complementary and need to be considered simultaneously during the multi-instances learning on the imbalanced data. From the experimental results in Table 8, 9 and 10, we observe that both of the BUS and BOS can improve the performance of the traditional MIL methods, which demonstrates that taking into the imbalanced data learning can improve the prediction performance of MIL. 25
Page 25 of 32
• 4. The interpretability and convexity of our model
Ac ce p
te
d
M
an
us
cr
ip t
Deep learning has been one of the most prominent machine learning techniques nowadays, being the state-of-the-art on a broad range of applications where automatic feature extraction is needed. Many works have shown the strong representation and classification performance of deep learning. Compared with the deep learning model, the advantages of our model are: 1). The deep learning models are mostly a black box, lacking the desired interpretation for medical image diagnosis. Our model can estimate the probability of the suspicious ROI, so as to provide the top response locations to doctors. Moreover, our model can obtain the importance weights of features. 2). The deep learning models slow convergence on ill-conditioned problems, convergence into possibly poor local optima. Our mkmi-Graph can obtain an optimal solution due to the convex property. 3). The deep learning models require a large amount of labeled training data. the size of Messidor dataset is not sufficient to train a sophisticated model. The limited available training data degrades the performance of deep learning models, and leads to overfitting since the provided training data are not sufficient for optimizing the parameters. From the result in Table 11, we can find that we achieve a better performance than the deep learning algorithms proposed in [34][35], which demonstrates that the proposed multi-instance learning is more appropriate for the weakly supervised data. 5. Limitations and future directions There are some shortcomings of the proposed methods. 1). The drawback of this representation is that the instances missed by the candidate detection will be missed by the MIL algorithm as well. 2). The optimal re-sampling parameters is unknown in both the BOS and BUS algorithm, and the parameter of over-sampling ratio plays a vital role for the performance of imbalanced data learning. Many over-sampling methods in the literature over-sample the minority class into a completely balanced training set. However it is not an optimal way for generating synthetic instances. In future work, we will extend our method to the multi-level grading [2] (multi-class classification problem) and evaluate our method on a larger data set (EyePACS). Moreover, A better graph kernel or other kernels is designed to capture more useful structure information of multi-instance bags. 26
Page 26 of 32
6. Conclusion
an
us
cr
ip t
Focusing on improving the performance of DR diagnosis when the lesion information is unavailable, we formulated the problem of DR diagnosis as an MIL problem, and presented a multi-kernel based multiple instance learning method for DR diagnosis. For the weakly supervised and imbalanced data in the Messidor dataset, we propose to conduct multi-instance learning, under-sampling at instance level and over-sampling at bag level simultaneously in a joint MKL framework. Our proposed method achieved an accuracy of 0.916, sensitivity of 0.909, specificity of 0.933, and AUC of 0.957. Through theoretical justifications and empirical studies, we demonstrated the effectiveness of the proposed MIL method on the performance of DR diagnosis vertically and horizontally. Acknowledgment
References
d
M
This research was supported by the National Natural Science Foundation of China (No.61502091), the Fundamental Research Funds for the Central Universities (No.N161604001).
te
[1] Y. Zheng, M. He, N. Congdon, The worldwide epidemic of diabetic retinopathy, Indian journal of ophthalmology 60 (5) (2012) 428.
Ac ce p
[2] M. U. Akram, S. Khalid, S. A. Khan, Identification and classification of microaneurysms for early detection of diabetic retinopathy, Pattern Recognition 46 (1) (2013) 107–116. [3] B. Antal, A. Hajdu, An ensemble-based system for microaneurysm detection and diabetic retinopathy grading, IEEE transactions on biomedical engineering 59 (6) (2012) 1720–1726. [4] A. K. Sjølie, J. Stephenson, S. Aldington, E. Kohner, H. Janka, L. Stevens, J. Fuller, B. Karamanos, C. Tountas, A. Kofinis, et al., Retinopathy and vision loss in insulin-dependent diabetes in europe: the eurodiab iddm complications study, Ophthalmology 104 (2) (1997) 252–260. [5] F. Ren, P. Cao, W. Li, D. Zhao, O. Zaiane, Ensemble based adaptive oversampling method for imbalanced data learning in computer aided detection of microaneurysm, Computerized Medical Imaging and Graphics 55 (2017) 54. 27
Page 27 of 32
ip t
[6] S. Roychowdhury, D. D. Koozekanani, K. K. Parhi, Dream: diabetic retinopathy analysis using machine learning, IEEE journal of biomedical and health informatics 18 (5) (2014) 1717–1728.
cr
[7] U. R. Acharya, E. Y. Ng, J. H. Tan, S. V. Sree, K. H. Ng, An integrated index for the identification of diabetic retinopathy stages using texture parameters., Journal of Medical Systems 36 (3) (2012) 2011–2020.
an
us
[8] K. Ganesan, R. J. Martis, U. R. Acharya, C. K. Chua, L. C. Min, E. Y. Ng, A. Laude, Computer-aided diabetic retinopathy detection using trace transforms on digital fundus images., Medical Biological Engineering Computing 52 (8) (2014) 663–672.
M
[9] K. Xu, D. Feng, H. Mi, Deep convolutional neural network-based early automated detection of diabetic retinopathy using fundus image, Molecules 22 (12) (2017) 2054.
d
[10] X. U. Kele, Z. Li, R. Wang, L. Chang, Z. Yi, Su-f-j-04: Automated detection of diabetic retinopathy using deep convolutional neural networks, Medical Physics 43 (6Part8) (2016) 3406.
Ac ce p
te
[11] Convolutional neural networks for diabetic retinopathy, author=Pratt, Harry and Coenen, Frans and Broadbent, Deborah M. and Harding, Simon P. and Zheng, Yalin, journal=Procedia Computer Science, volume=90, pages=200205, year=2016,. [12] J. H. Tan, H. Fujita, S. Sivaprasad, S. V. Bhandary, A. K. Rao, C. C. Kuang, U. R. Acharya, Automated segmentation of exudates, haemorrhages, microaneurysms using single convolutional neural network, Information Sciences 420. [13] J. H. Tan, U. R. Acharya, S. V. Bhandary, C. C. Kuang, S. Sivaprasad, Segmentation of optic disc, fovea and retinal vasculature using a single convolutional neural network, Journal of Computational Science 20 (2017) 70–79. [14] G. Quellec, G. Cazuguel, B. Cochener, M. Lamard, Multiple-instance learning for medical image and video analysis, IEEE Reviews in Biomedical Engineering PP (99) (2017) 1–1.
28
Page 28 of 32
ip t
[15] M. Kandemir, F. A. Hamprecht, Computer-aided diagnosis from weak supervision: a benchmarking study, Computerized Medical Imaging and Graphics 42 (2015) 44–50.
cr
[16] B. Antal, A. Hajdu, An ensemble-based system for automatic screening of diabetic retinopathy, Knowledge-Based Systems 60 (2014) 20–27.
us
[17] M. U. Akram, S. Khalid, A. Tariq, S. A. Khan, F. Azam, Detection and classification of retinal lesions for grading of diabetic retinopathy, Computers in biology and medicine 45 (2014) 161–171.
an
[18] P. Cao, X. Liu, J. Zhang, D. Zhao, M. Huang, O. Zaiane, l21 norm regularized multi-kernel based joint nonlinear feature selection and over-sampling for imbalanced data classification, Neurocomputing 234 (C) (2017) 38–57.
M
[19] H. He, E. A. Garcia, Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 21 (9) (2009) 1263–1284.
d
[20] N. Eftekheri, H. Pourreza, E. Saeedi, Microaneurysm detection in fundus images using a two-step convolutional neural networks, arXiv preprint arXiv:1710.05191.
Ac ce p
te
[21] B. Dai, X. Wu, W. Bu, Retinal microaneurysms detection using gradient vector analysis and class imbalance classification, PloS one 11 (8) (2016) e0161556. [22] P. Cao, X. Liu, J. Yang, D. Zhao, W. Li, M. Huang, O. Zaiane, A multi-kernel based framework for heterogeneous feature selection and over-sampling for computer-aided detection of pulmonary nodules, Pattern Recognition 64 (C) (2016) 327–346. [23] Z.-H. Zhou, Y.-Y. Sun, Y.-F. Li, Multi-instance learning by treating instances as non-iid samples, in: Proceedings of the 26th annual international conference on machine learning, ACM, 2009, pp. 1249–1256. [24] S. J. Pan, Q. Yang, A survey on transfer learning, IEEE Transactions on Knowledge and Data Engineering 22 (10) (2010) 1345–1359. [25] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 16 (1) (2002) 321–357. 29
Page 29 of 32
ip t
[26] B. Gu, V. S. Sheng, A robust regularization path algorithm for ν -support vector classification, IEEE Transactions on Neural Networks and Learning Systems 28 (5) (2017) 1241.
cr
[27] B. Gu, V. S. Sheng, K. Y. Tay, W. Romano, S. Li, Incremental support vector learning for ordinal regression, IEEE Transactions on Neural Networks and Learning Systems 26 (7) (2015) 1403.
us
[28] R. Acharya, C. K. Chua, E. Ng, W. Yu, C. Chee, Application of higher order spectra for the identification of diabetes retinopathy stages, Journal of Medical Systems 32 (6) (2008) 481–488.
M
an
[29] M. R. K. Mookiah, U. R. Acharya, R. J. Martis, C. K. Chua, C. M. Lim, E. Ng, A. Laude, Evolutionary algorithm based classifier parameter tuning for automatic diabetic retinopathy grading: A hybrid feature extraction approach, Knowledge-based systems 39 (2013) 9–22. [30] A. Rakotomamonjy, F. R. Bach, S. Canu, Y. Grandvalet, Simplemkl, Journal of Machine Learning Research 9 (Nov) (2008) 2491–2521.
te
d
[31] E. Decencire, X. Zhang, G. Cazuguel, B. Lay, B. Cochener, C. Trone, P. Gain, R. Ordonez, P. Massin, A. Erginay, Feedback on a publicly distributed image database: The messidor database, Image Analysis Stereology 33 (3) (2014) 231–234.
Ac ce p
[32] T.-V. MESSIDOR, Messidor: methods to evaluate segmentation and indexing techniques in the field of retinal ophthalmology. 2014, Available on: http://messidor. crihan. fr/index-en. php Accessed: October 9. [33] G. Quellec, M. Lamard, M. D. Abr`amoff, E. Decenci`ere, B. Lay, A. Erginay, B. Cochener, G. Cazuguel, A multiple-instance learning framework for diabetic retinopathy screening, Medical image analysis 16 (6) (2012) 1228– 1240. [34] Z. Wang, Y. Yin, J. Shi, W. Fang, H. Li, X. Wang, Zoom-in-net: Deep mining lesions for diabetic retinopathy detection, arXiv preprint arXiv:1706.04372. [35] H. H. Vo, A. Verma, New deep neural nets for fine-grained diabetic retinopathy recognition on hybrid color space, in: Multimedia (ISM), 2016 IEEE International Symposium on, IEEE, 2016, pp. 209–215.
30
Page 30 of 32
ip t
[36] T. G. Dietterich, R. H. Lathrop, T. Lozano-P´erez, Solving the multiple instance problem with axis-parallel rectangles, Artificial intelligence 89 (1) (1997) 31–71.
cr
[37] Q. Zhang, S. A. Goldman, et al., Em-dd: An improved multiple-instance learning technique, in: NIPS, Vol. 1, 2001, pp. 1073–1080.
us
[38] J. Wang, J.-D. Zucker, Solving multiple-instance problem: A lazy learning approach.
an
[39] S. Andrews, I. Tsochantaridis, T. Hofmann, Support vector machines for multiple-instance learning, Advances in neural information processing systems (2003) 577–584.
M
[40] G. Quellec, K. Charri`ere, Y. Boudi, B. Cochener, M. Lamard, Deep image mining for diabetic retinopathy screening, Medical Image Analysis 39 (2017) 178–193.
te
d
[41] L. Tang, M. Niemeijer, J. M. Reinhardt, M. K. Garvin, M. D. Abr`amoff, Splat feature classification with application to retinal hemorrhage detection in fundus images, IEEE Transactions on Medical Imaging 32 (2) (2013) 364–375.
Ac ce p
[42] C. I. Snchez, M. Niemeijer, A. V. Dumitrescu, M. S. Suttorp-Schulten, M. D. Abrmoff, G. B. Van, Evaluation of a computer-aided diagnosis system for diabetic retinopathy screening on public data, Investigative Ophthalmology Visual Science 52 (7) (2011) 4866.
31
Page 31 of 32
Research Highlights 1. A sparse multi-kernel multi-instance learning is proposed for diabetic retinopathy diagnosis. 2. An under-sampling at instance level and an over-sampling at bag level are proposed to
ip t
improve the performance of the multi-instance learning in the diagnosis of DR
cr
3. We have conducted extensive experiments to investigate our proposed methods. The
proposed method achieve an overall classification accuracy of 0.916 and an AUC of 0.957. It
us
could be applied for the detection of many other disease diagnosis where data is weakly
Ac ce p
te
d
M
an
supervised
Page 32 of 32