A batch-mode active learning framework by querying discriminative and representative samples for hyperspectral image classification

A batch-mode active learning framework by querying discriminative and representative samples for hyperspectral image classification

Neurocomputing 179 (2016) 88–100 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom A batch-...

2MB Sizes 0 Downloads 76 Views

Neurocomputing 179 (2016) 88–100

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

A batch-mode active learning framework by querying discriminative and representative samples for hyperspectral image classification$ Zengmao Wang a, Bo Du a,n, Lefei Zhang a,n, Liangpei Zhang b a

State Key Laboratory of Software Engineering, School of Computer, Wuhan University, Wuhan 430079, China Collaborative Innovation Center of Geospatial Technology, State Key Lab of Information Engineering in Surveying, Mapping and Remote sensing, Wuhan University, Wuhan 430079, China

b

art ic l e i nf o

a b s t r a c t

Article history: Received 5 September 2015 Received in revised form 3 November 2015 Accepted 20 November 2015 Communicated by: Lu Xiaoqiang Available online 21 December 2015

Batch-mode active learning approaches are dedicated on the training sample set selection for classification, where a batch of unlabeled samples is queried at each iteration. The current state-of-the-art AL techniques exploit different query functions, which are mainly based on the evaluation of two criteria—uncertainty and diversity. Generally, the two criterions are independent of each other, and they also cannot guarantee that the new queried samples are identical and independently distributed (i.i.d.) from the unknown source distribution. To solve this problem, a novel form of upper bound for the true risk in the setting is derived by minimizing this upper bound to measure the discriminative information, which is connected with the uncertainty. And for the distribution match, the proposed method adopts the maximum mean discrepancy to constrain the distribution of the labeled samples and make them as similar to the overall sample distribution as possible, which helps capture the representative information of the data structure. In the proposed framework, the defining of the binary classes is generalized to a multiclass problem, in addition, the discriminative and representative information (DR) are combined together. In this way, our method is shown to query the most informative samples while preserving the source distribution as much as possible, thus identifying the most uncertain and representative queries. Meanwhile, the number of new queried samples is adaptive, and depends on the distribution of the labeled samples. In the experiments, we employed two benchmark remote sensing datasets—the Indian Pines and Washington DC datasets—and the results confirmed the superior performance of the proposed framework compared with the other state-of-the-art AL methods. & 2015 Elsevier B.V. All rights reserved.

Keywords: Active learning Hyperspectral image Image classification Remote sensing Discriminative and representative Maximum mean discrepancy

1. Introduction Machine learning algorithms have become powerful toolS for the extraction of information from data in the different fields of data mining, pattern recognition, computer version, as well as in remote sensing [1–7,54], and advances in remote sensing technology have made hyperspectral data with hundreds of narrow contiguous bands available. Hyperspectral image (HSI) processing with machine learning methods has been widely studied in the past decade [8–14]. HSI classification is one of the important tasks used to extract environmental information from remote sensing images and has been an active field in current HIS ☆ This work was supported in part by the National Basic Research Program of China (973Program) under Grant 2012CB719905, the National Natural Science Foundation of China under Grants 61471274, 61401317 and 41431175, and the Natural Science Foundation of Hubei Province under Grants 2014CFB193. n Corresponding authors. E-mail addresses: [email protected] (Z. Wang), [email protected] (B. Du), [email protected] (L. Zhang), [email protected] (L. Zhang).

http://dx.doi.org/10.1016/j.neucom.2015.11.062 0925-2312/& 2015 Elsevier B.V. All rights reserved.

processing [15–20]. To fully utilize the information in remote sensing images, many different machine learning algorithms have been developed to classify the data [21–23]. Supervised classification is the main technique, which requires the availability of labeled samples for training the classifiers. Given a specific supervised classifier, the remote sensing images can be automatically classified. However, the supervised classifiers are highly dependent on the amount and quality of the training samples [24]. Therefore, to collect samples of a good quality (e.g., informative and non-redundant) is vital. Manually selecting the region of interest in the HSI as the training samples is a common approach, but this procedure is very expensive in most real-world applications. As HSIs have very high dimensionality, it is more difficult to design classifiers using only a few labeled data points than with a multispectral image [11]. This paper is focused on HSI classification with a few labeled data points. Two popular machine learning approaches have been developed to solve this problem: semi-supervised learning and active learning (AL). Semi-supervised algorithms incorporate the unlabeled samples and the labeled samples to

Z. Wang et al. / Neurocomputing 179 (2016) 88–100

find a classifier with better boundaries [25–27]. An overview of the semi-supervised classification techniques can be found in [12]. In contrast, AL assumes that a primary classifier with a small amount of labeled samples exists. AL is based on iteration and can provide better classification results with a small number of unlabeled samples. The AL methods are conducted according to an iterative process. In each iteration, the most informative unlabeled samples are chosen for manual labeling. In this way, the unnecessary and redundant labeling of noninformative samples is avoided, greatly reducing the labeling cost and time. Moreover, AL allows one to reduce the computational complexity of the training phase. The batch-mode active learning method is expected to be more suitable for hyperspectral image classification, where a batch of unlabeled samples is queried at each iteration, which increases the speed of the sample selection and reduces the iterations [28]. The best result for batch-mode AL is to select the most informative batch of samples with as little redundancy as possible, so that they can provide the uncertain information to the classifier. At the same time, batch-mode AL can also increase the speed of the sample selection and reduce the iterations [29]. There are two main phases for querying the unlabeled samples: the uncertainty and the diversity [30–32]. The first phase is to query the most informative samples with the uncertainty criterion, but in the queried samples, some very similar samples may exist, so in these samples, just one sample query is enough; in this way, it is necessary to remove the redundancy in these samples. Meanwhile, in the second phase, the diversity criterion is used to reduce the redundancy in the samples which are queried in the first phase with the diversity criterion. There has been a large amount of research into the study of the uncertainty criterion, the conventional uncertainty criteria of batchmode AL can be grouped into three fields: 1) query by committee, in which the uncertainty of an unlabeled sample is measured by the disagreement of several classifiers [33–35]; 2) the posterior probability based methods, where the posterior probability is used to measure the uncertainty of the candidates [36,37]; and 3) the large margin heuristic based methods, where the uncertainty of the candidates is measured by the distance to the margin of the classifier, such as support vector machine (SVM) [38,39]. However, in the current research, less attention is being paid to the diversity criterion, the diversity criteria are mainly the clustering algorithms, such as k-means [40] and its kernel version [41], which depend on the correctness of the convergence and are usually influenced by the initialization adequacy of the initialization [42]. Moreover, these algorithms have to be given the number of the clustering centers beforehand. Thus the queried data by such methods are not guaranteed to be i.i.d. sampled from the original data distribution, as they are selectively sampled based on the AL criterion [43]. At the same time, they do not fully use the label information, and divide the uncertainty and the diversity criteria into two steps. In fact, using either kind of criterion alone may not be sufficient to get the optimal results. This paper proposes a new diversity criterion, extends the empirical risk minimization principle to the AL case and presents a novel AL framework. This framework adopts the maximum mean discrepancy (MMD) to measure the distribution difference and derives an empirical upper bound for the AL risk. By minimizing this upper bound, it approximately minimizes the true risk under the original data distribution. In the proposed framework, it attempts to query the unlabeled samples by both discriminative and representative information with one optimized formulation. Our goal is to query a subset of unlabeled samples which help minimize the discriminative and

89

representative information. The contributions of this manuscript can be summarized as: (1) In the proposed framework, the MMD is adopted, so that the queried samples are not only diverse, but also preserve the distribution of the original data. This strategy can rapidly reduce the empirical risk in the training data. (2) With the discriminative and representative information in one optimal formulation, a trade-off is undertaken by a weight parameter, and the queried samples can contain both discriminative and representative information. (3) The proposed method is suitable for multiple classes problem, and the number of queried samples is adaptive. Furthermore, only the most uncertain samples are selected in the preparation procedure, so the proposed method can be used to solve large-scale data. The reminder of this paper is organized as follows. Section 2 presents the recent research into batch-mode AL in remote sensing image classification. Section 3 formulates the proposed batchmode AL framework. Section 4 describes the experiments with two benchmark hyperspectral datasets—the Indian Pines and Washington DC datasets,—and presents the experimental results in comparison with the other state-of-the-art batch-mode AL methods. Finally, Section 5 summarizes the paper.

2. Related work 2.1. The framework of conventional batch-mode active learning The conventional AL methods can be modeled as a quintuple ðF; Q ; D; T; U Þ [44], where F is a supervised classifier which is used to train the training dataset T. Q is the query function used to select the most informative unlabeled samples from a pool U of unlabeled samples. D is a supervisor that can correctly label a batch of the most informative samples queried by Q. AL is an iterative process, in which the supervisor D labels the most informative samples queried by the query function Q. For the first iteration, an initial training set T with a few labeled samples is used for training the classifier F. After the first iteration, a batch of samples X is selected from the unlabeled sample pool U by the query function Q, and is assigned class labels by the supervisor D. The new labeled batch of samples X are then added into the training set T, and retrain the classifier F using the updated training set T. The pool U of unlabeled samples is updated, i.e., the new labeled batch of samples X is removed from the unlabeled samples. The closed loop of querying and retraining continues for a predefined number of iterations or until a stopping criterion is satisfied. The general batch-mode AL process is described in Algorithm 1: Algorithm 1. The process of AL 1. Initialize the training set T and the pool U of unlabeled samples; 2. Train the classifier F using the training set T; 3. Train the pool U of unlabeled samples with classifier F; Repeat 4. Query the most informative batch of unlabeled samples X using the query function Q from the pool U of unlabeled samples; 5. Assign class labels for the new queried samples X by the supervisor D; 6. Update training set X and the pool U of unlabeled samples, i.e., add the new labeled most informative batch of samples X into the training set T and remove them from the unlabeled sample pool U;

90

Z. Wang et al. / Neurocomputing 179 (2016) 88–100

samples in the kernel space [28]:

7. Retrain the classifier F with the updated training set T; 8. Train the updated pool U of unlabeled samples with the new classifier F Until satisfying a stopping criterion.

     D2 ϕðxi Þ; ϕ μv ¼ ‖ϕðxi Þ  ϕ μv ‖2 ¼ ‖ϕðxi Þ 

2.2. The query function in batch-mode active learning

¼ K ðxi ; xi Þ 

The query function is very important in the batch-mode AL methods, and a number of batch-mode AL methods have been proposed in the machine learning literature [28,45]. In the conventional AL methods, the query function can be grouped into two steps: 1) the uncertainty criterion; and 2) the diversity criterion. 2.2.1. Uncertainty criterion The uncertainty criterion aims to select the unlabeled samples which have the maximum uncertainty for the current classifier from the unlabeled sample pool U. Since the most uncertain samples have the lowest probability of being correctly classified by a classification model, they are the most useful to include in the training set. In this paper, we investigate two techniques based on the popular SVM classifier: 1) margin sampling (MS) [45]; and 2) multiclass-level uncertainty (MCLU) [28]. The two uncertainty criteria are briefly described: MS: The MS heuristic takes advantage of the distance to the hyperplane in SVM, and builds a linear decision function in the high-dimensional feature space H, where the samples are more likely to be linearly separable. Consider the decision function of the two-class SVM: 0 1 n X ð1Þ yi αi Kðxj ; qi ÞA f ðqi Þ ¼ sign@ j¼1

The candidate included in the training set is therefore the one that respects the condition:   x^ ¼ argmin f ðqi Þ ð2Þ qi A Q

MCLU: The adopted MCLU technique selects the most uncertain samples according to a confidence value cðxÞ, x A U, which is defined on the basis of their functional distancef i ðxÞ; i ¼ 1; 2; …; n to the n decision boundaries in the binary SVM classifiers included in the one against all (OAA) architecture. In this technique, the distance of each sample x A U to each hyperplane is calculated, and   a set of n distance values f 1 ðxÞ; f 2 ðxÞ; …; f n ðxÞ is obtained. The confidence value cðxÞ can then be calculated using different strategies [38]. In this paper, the difference function cdif f ðxÞ strategy is used, which considers the difference between the first and second largest distance values to the hyperplanes. r 1 max ¼ arg max ff i ðxÞg i ¼ 1;2;…;n

r 2 max ¼ arg

ff j ðxÞg

max

j ¼ 1;2;…;n a r1

cdif f ðxÞ ¼ r 1 max  r 2 max

m       1 X δ ϕ xj ; C v ϕ xj ‖2 jC v j j ¼ 1

max

ð3Þ

2.2.2. Diversity criterion This is used to reduce the redundancy in the unlabeled samples which are queried with the uncertainty criterion. In the current research, clustering algorithms such as kernel k-means are used as advanced diversity criteria [46]. The kernel k-means method is applied in the kernel space, corresponding to the SVM, so that it can select the more informative samples for the classifier in the kernel space, rather than in the original feature space. The main step for kernel k-means is to calculate the distance between two

þ

m       2 X δ ϕ xj ; C v K xi ; xj jC v j j ¼ 1

m X m X 

1 2

jC v j

 

δ ϕðxl Þ; C v K xj ; xl



ð4Þ

j¼1l¼1

    where δ ϕ xj ; C v is the indicator function, only if xj is assigned to         C v , δ ϕ xj ; C v ¼ 1, otherwise δ ϕ xj ; C v ¼ 0, and jC v j denotes the total number of samples.

3. The proposed batch-mode active learning framework In this paper, we combine the discriminative and representative information into one optimal formulation as the diversity criterion, and select MS and MCLU as the uncertainty criteria. In the proposed method, the number of queried samples is adaptive. Meanwhile, the queried samples have the same distribution as the original data, and the relationship between the queried samples is identical and independently distributed (i.i.d.). In the conventional batch-mode AL methods, the query function is divided into two steps, which are the uncertainty criterion and the diversity criterion, and the two steps are independently. The foremost difference between the proposed method and the other state-of-the-art AL methods is the second step, it also considered uncertainty information, which is measured by discriminative information. The proposed diversity criterion combines the discriminative and representative information together, it guarantees the queried samples uncertainty and diversity by making sure that the unlabeled samples are i.i.d. in the original data. In this way, all the labeled samples are used. Suppose we have an HSI with n samples D ¼ fx1 ; x2 ; …; xn g of d dimensions. Firstly, l initial training samples are randomly selected       from the ground truth. L ¼ x1 ; y1 ; x2 ; y2 ; …; xn ; yn , yi A f1; 2; …; N c g and Nc is the number of classes. The remaining u ¼n l   samples form the unlabeled set U ¼ xl þ 1 ; xl þ 2 ; …; xl þ u , which is the candidate dataset for the AL. In the following, we discuss the discriminative and representative criteria in detail. 3.1. Discriminative information determined by the uncertainty of minimum margin In this paper, the least squares loss function for choosing the discriminative information is used. It represents the discriminative part in the proposed method, and it also measures the uncertainty information. For the binary class problem: max min

Y i : 8 xj A Q Q ;f

X xi A L

ðY i f ðxi ÞÞ2 þ

X

4

ðY j f ðxj ÞÞ2 þ λ‖f ‖2F

ð5Þ

xj A Q

In (5), Y i A f1;  1g, f is the classifier, and ‖f ‖2F is used to constrain the complexity of the classifier class. It is known that the classification of remote sensing is a multiclass problem, but this can be split into several binary classes, and for each binary class, the informative samples can be queried by (5). The problem is generalized to a multiclass problem, which is organized into several binary classes. The optimal multiclass formula is obtained as

Z. Wang et al. / Neurocomputing 179 (2016) 88–100

and obtain the binary class optimal formulation:

follows: min

8 Nc < X c¼1

:

X

max min Q c ;f c

ðY i  f c ðxi ÞÞ2 þ

xi A L

X

4

ðY j  f c ðxj ÞÞ2 þ λ‖f c ‖2F

xj A Q c

9 =

min w

;

α T 1u ¼ b

ð6Þ

8 > > > > N < c X>

X

min

>

wc c¼1> > > > : T 1u c

α

ðY i  wTc ϕðxi ÞÞ2 þ

xi A L

u X

h

c¼1

¼b

3.2. Representative information determined by maximum mean discrepancy (MMD) The representative part in proposed method is the MMD term. MMD is used to constrain the distribution between the labeled and queried samples, and makes it as similar to the overall sample distribution as possible. It captures the representative information of the data structure in the proposed method. According to [48], the MMD can be expressed as min

1

αT K UU α þ

ub l þb 1 K LU α  1u K UU α n l n

ð9Þ

In (9), 1l and 1u are vectors of length l and u, respectively, with all entries equal to 1; α is the indicator vector with u elements, and each element αi A f0; 1g; αT 1u ¼ b, and b is the query number for the binary class problem. K is the kernel matrix with its elements as K ij ¼ kðxi ; xj Þ ¼ ϕðxi ÞT ϕðxj Þ, and KAB denotes its sub-matrix between the samples from set A and B. The objective function (9) can be further simplified as: min αT K 1 α þkα

ð10Þ

α T 1u ¼ b

where

K 1 ¼ 12K UU ; k ¼ k3  k2 ; 8 xi A U;

b k3 ðiÞ ¼ u  n

P



 

j¼1

ð11Þ

9 > > > > > =

ð12Þ

> > > > > ;

Formula (12) is the main part of the proposed method, and from it, it is clear that this problem is not convex. An alternating optimization strategy is employed to solve it [49]. Since the multiclass problem is formed by many binary problems, (12) can be solved by a binary model. For a binary problem, if the query index α is fixed, the objective is to find the best classifier, based on the current labeled and queried samples: min

Xl

w

i¼1

ðyi  wT ϕðxi ÞÞ2 þ λ‖w‖2 þ

b X  

‖wT ϕðxj Þ‖22 2wT ϕðxj Þ j¼1

ð13Þ

xi A L

ð8Þ

αT 1u ¼ b2



αj ðwT ϕðxj ÞÞ2  2wT ϕðxj Þ

As with (6), we develop it to a multiclass problem:

j¼1

We now use a linear regression model in the kernel space as the classifier, which is in the form of f ðxÞ ¼ wT ϕðxÞ, with the feature mapping ϕðxÞ. The objective function (6) becomes: ( ) Nc h X X  i min min ðY i  wTc ϕðxi ÞÞ2 þ ðwTc ϕðxj ÞÞ2  2wTc ϕðxj Þ þ λ‖wc ‖2

u X

þ λ‖w‖2 þ βðαT K 1 α þ kαÞ

i



xi A Q

Q c ;wc

ðY i  wT ϕðxi ÞÞ2 þ

αcj ðwTc ϕðxj ÞÞ2  2wTc ϕðxj Þ þ λ‖wc ‖2 þ βc ðαTc K 1 αc þ kc αc Þ

can be solved independently of several binary classes. For any classifier f c , (6) identifies the samples with minimum margin summation [47], given by X  f ðxi Þ min ð7Þ Q

X

xi A L

4

where xi A L, if yi ¼c, Yi ¼ 1, else Yi ¼ -1. Y i ¼ sign(f(xi)), which is a pseudo label, and Qc is the query sample set obtained under the binary class between the class c and the other classes. Formula (6)

min

91

k2 ðiÞ ¼ l þn b

P

Kði; jÞ;

and

xj A U

Kði; jÞ.

xj A L

Formula (13) can be solved by the alternating direction method of multipliers (ADMM) [50]. If wis fixed, the objective becomes: h Xu  i min α ðwT ϕðxi ÞÞ2  2wT ϕðxi Þ þ βðαT K 1 α þkαÞ ð14Þ i¼1 i which can be rewritten as:   min βαT K 1 α þ β k þ a α αT 1u ¼ b

ð15Þ

  where ai ¼ ðwT ϕðxi ÞÞ2  2wT ϕðxi Þ, and (15) is standard quadratic programming. If we relax α to a continuous value in the range [0, 1]u, the b samples in the pool U corresponding to the largest b elements in α are needed. Algorithm 2 shows the process. Algorithm 2. The process of the discriminative and representative criterion Input: 1. Initialize the training set L with l labels, and the unlabeled set U with u samples; 2. Initialize the parameters λ, β and batch size b; Repeat 3. Step 1: optimize the objective function (13) with regard to w using ADMM; 4. Step 2: optimize the objective function (14) with regard to α using quadratic programming (QP) Until satisfying a stopping criterion; 5. The b samples in the pool U corresponding to the largest b elements in α are needed.

3.3. The Proposed Discriminative and Representative Criterion Combining the discriminative and representative part into one optimal formulation, which uses a weight β to balance the discriminative term and the representative term. Firstly, we build the optimal formula for the binary problem and develop it into a multiclass problem. Therefore, we can put (5) and (10) together,

3.4. The proposed batch-mode active learning framework This paper proposes a novel discriminative and representative criterion and applies it to AL for hyperspectral image classification. A linear regression model in the kernel space is used as the classifier, so

92

Z. Wang et al. / Neurocomputing 179 (2016) 88–100

if the unlabeled dataset is large, the kernel matrix will also be large, slowing down the proposed algorithm. In order to overcome this problem, for each iteration, the proposed method first uses the uncertainty criterion to select the N most uncertain samples from the unlabeled sample pool U as the new unlabeled sample pool Un. The optional methods include MS and MCLU. For each class, the training set L and the unlabeled samples Un are used to establish a binary class problem, so we can build Nc binary class problems. For each binary class, the proposed diversity criterion is employed to select b batch size samples, so we can get an Nc*b query sample set G. Since the same unlabeled samples Un are applied to the discriminative and representative criterion for every binary class, one sample may be queried in different binary classes, so the Nc*b queried sample set G may contain repetitive samples. Firstly, the repetitive samples in G should be merged to obtain the new dataset S, in which the samples are different. In this step, the size of S is adaptive. In order to control the size of the queried samples added into the training set at each iteration and reduce the spatial redundancy, because the number of same samples which are queried is unknown, the maximum number of new queried samples is set as h in the experiments. If the size of S is greater than the initial query batch size h, the h samples are chosen corresponding to the h smallest values of (10) from the set S. In this way, the number of the queried samples is always less than or equal to the initial h. A flowchart of the proposed method is shown in Fig. 1, and the process of the proposed method is shown in Algorithm 3. Algorithm 3. Workflow of the proposed method Input:  l 1. Initial training data set X iter ¼ xi ; yi i ¼ 1 , iter ¼ 1, yi A f1; 2; …; N c g. 2. N: the number of the most uncertain unlabeled samples to choose. 3. b: the number of the selected discriminative and representative unlabeled samples from a binary class. 4. h: the number of samples that are added to the training dataset in each iteration. 5. it: the number of iterations. for iter¼ 1:it 6. Train the current labeled dataset X iter with the supervised classifier SVM; Train the unlabeled dataset U iter ¼ fxi gli þ¼ul þ 1 and select the N most uncertain unlabeled samples; 8. Initialize the discriminative and representative set G ¼ Φ; for c ¼1: N c 9. Set the binary class, if yi ¼ c, Yi ¼1, else Yi ¼-1; 10. Select the b discriminative and representative samples from the N most uncertain unlabeled samples using the discriminative and representative criterion (Algorithm 2) with the best weightβ , and add them to the dataset G; end 7.

11. Merge the same samples in the discriminative and representative set G; 12. If the size G 4 h, construct the dataset S with h samples from G according to the value of (10); else S ¼G end 13. Update: X iter þ 1 ¼ X iter [ S and U iter þ 1 ¼ U iter =S. end Based on the proposed discriminative and representative (DR) framework, and by combining the two different uncertainty criteria, two practical methods are proposed: 1) MS-DR, which is based on margin sampling; and 2) MCLU-DR, which is based on multiclass-level uncertainty.

4. Experiments and analysis We used two benchmark HSI datasets in the experiments [51] and compared the results of the proposed method and the other state-of-the-art methods. According to the experimental results, we then analyzed the proposed method. 4.1. Datasets Indian Pines: The Indian Pines hyperspectral dataset was acquired by NASA's Airborne Visible Infrared Imaging Spectrometer (AVIRIS). It contains 145  145 pixels, with each pixel having 220 spectral bands covering the range of 375  2200 μm. The corresponding spatial resolution is approximately 20 m. The initial training dataset T was composed of about 15 samples per class (240 samples in all), and the rest of the samples were considered as unlabeled samples stored in the unlabeled pool U. The falsecolor composite and the ground-truth map are shown in Fig. 2. Washington DC: The Washington DC Mall dataset is a Hyperspectral Digital Imagery Collection Experiment (HYDICE) airborne hyperspectral image. This dataset contains 1280 scan lines, and each line has 307 pixels. It includes 210 bands covering the 0.4– 2.4 μm wavelength of the visible and infrared spectrum. A number of water absorption channels were discarded in the dataprocessing procedure, and the remaining 191 channels were used in this study. A total of 76,750 pixels from seven classes were used in our experiments. The initial training set T was composed of about 15 samples per class (105 samples in all), and the rest of the samples were taken as unlabeled samples stored in the unlabeled pool U. The false-color composite and the ground-truth map are shown in Fig. 3. Update Ncbinary class b samples

Labeled Model set

Unlabeled set

Uncertainty criterion

Ni (Nc*b) discriminative b samples & representative unlabeled samples b samples

N most uncertainty unlabeled samples

Discriminative & Representative criterion

Fig. 1. Flowchart of the proposed method.

Queried samples

Z. Wang et al. / Neurocomputing 179 (2016) 88–100

93

Corn-notill Bldg-grass-tree-drives Soybeans-min Alfalfa Corn-min Corn Grass/pasture Grass-trees Stone-steel towers Oats Grass/pasture-moved Soybeans-clean Soybeans-notill Woods Hay-windrowed Wheat Fig. 2. Indian Pines dataset. (a) False-color composite of the AVIRIS Indian Pines scene. (b) Ground-truth map containing 16 mutually exclusive land-cover classes.

Water Road Roof Trail Shadow Grass Tree Fig. 3. The details of the Washington DC mall dataset. (a) False-color composite of the Washington DC mall scene. (b) The classes of Washington DC.

Table 1 Different combinations of uncertainty criterion and diversity criterion. Classifier SVM Uncertainty criterion Marginal sampling

Multiclass-level uncertainty

Query by committee

Diversity criterion

Method

Closest support vector Kernel k-means The proposed DR criterion – Kernel k-means The proposed DR criterion /

MS-cSV MS-Kkmeans MS-DR MCLU MCLU-ECBD MCLU-DR EQB

4.2. Experiment settings To test the effectiveness of the proposed AL methods, a number of state-of-the-art methods were used for comparison. The comparison methods are shown in Table 1. According to the description in Section 2, the compared methods contain two criteria: the uncertainty criterion and the diversity criterion. The uncertainty criteria we chose were marginal sampling (MS) and multiclass-level uncertainty (MCLU). When MS was used as the uncertainty criterion, MS was combined with closest support vector (MS-cSV), kernel k-means (MS-Kkmeans), and the proposed criterion (MS-DR), and the different methods were compared. When MCLU was used as the uncertainty criterion, MCLU was combined with no diversity (MCLU), kernel k-means (MCLU-ECBD), and the proposed DR criterion (MCLU-DR), and the different methods were again compared.

In the conventional AL methods, the batch size of new queries was fixed. In this study, for the Indian Pines dataset, which contains 16 classes, the batch size was fixed as h¼20 in each iteration, and for the Washington DC mall dataset, in which there are seven classes, we set h¼15 for each iteration. For the classifier, oneagainst-all SVM, implemented with LIBSVM [52], was applied with a Gaussian radial basis function (RBF) kernel for all the experiments. The SVM classifier with Gaussian RBF kernel involves two parameters: the penalty parameter C and the kernel parameter γ . In our experiments, without loss of generality, the values for the regularization parameter C and the spread γ of the RBF kernel parameters were chosen by performing a grid-search model selection at the first iteration of the AL process. In the proposed method, there is a regularization weight λ. Here, we set λ ¼ 0:1. To solve the optimal formulation of the discriminative and representative criterion, the Gaussian RBF kernel with penalty parameter C and kernel parameter γ , the same as with the SVM classifier, was used. In addition, there are four other parameters: 1) The number N of the unlabeled dataset Un. For the Indians Pines dataset when the uncertainty criterion was MS, we set N ¼ 400, and when the uncertainty criterion was MCLU, we set N ¼300. For the Washington DC dataset, when the uncertainty was MS, we set N ¼400, and when the uncertainty was MCLU, we set N ¼700. 2) The other parameter is the number b, which is the number of queried samples which are selected from a binary class. In the binary classes, the labels of one class of training samples are 1, and the labels of the other samples are  1. 3) The batch size h can usually be fixed (i.e. h¼15, 20). From our own experience, for the Indians Pines dataset, we set b¼ 2, and for the Washington DC dataset, b ¼3. The last parameter is the balance weight β. In each independent run, a cross-validation was carried out in order to get

94

Z. Wang et al. / Neurocomputing 179 (2016) 88–100

the best weight value in the range of β ¼ [10  5, 10  4,…, 105]. Therefore, there are three parameters, N, b, and β, that need to be evaluated. All the results refer to the average accuracy of 10 individual experiments, according to the 10 randomly selected training datasets. In each initial training set, we randomly selected 15 samples for each class. It is worth noting that it is not possible to directly evaluate the stability of the proposed method, because the number of new queried samples in different iterations is not the same. Thus, we averaged the queried numbers at the corresponding iterations in the 10 runs. After this procedure, the overall accuracy (OA) and standard deviation were calculated to quantify the stability of the proposed method. 4.3. Results and analysis In this part, the results of the two image datasets are discussed. Table 1 lists the seven AL models that were implemented. 4.3.1. The AVIRIS Indian Pines dataset 4.3.1.1. The performances of the different AL algorithms. Fig. 4 shows the average OA with the proposed AL method against the other 85

Overall Accuracy

80

75

70

MS-DR MCLU-DR MCLU MCLU-ECBD MS-cSV MS-Kkmeans EQB

65

60

55 0

200

400

600

800

1000

1200

Number of Queries Fig. 4. The learning curves for the Indian Pines dataset with SVM. Each curve shows the average of the OA and standard deviation with the growing number of queried samples over several runs, which start with different initial sets.

methods in several runs. Each point on the x-axis represents the number of training samples added to the training set and used to train the classifier, while the y-axis represents the OA classification accuracies. In Fig. 4, intuitively, MCLU-DR and MS-DR present higher classification accuracies than the other methods. In addition, the numbers of queried samples added into the training sets were smaller than for the other methods. This indicates that the number of queried samples is adaptive in the proposed method, but for the compared methods, the number of queries is fixed. The proposed method also obtained a smaller standard deviation than the other algorithms. This indicates that the stability of the proposed method is better than the other batch-mode AL methods. This can be explained by the fact that the proposed AL method actively searches for the most informative labeled samples. In Fig. 4, it is worth noting that the results with MCLU uncertainty are always much better than the results with MS uncertainty. However, MS uncertainty combined with the proposed discriminative and representative criterion is always better than all the other compared methods. It can be seen that the highest OA level of the compared method is 85% with 60 iterations, and when the OA accuracy reaches 85%, the comparison methods need to query 1200 samples or more. However, for the proposed method, MS-DR, it only needs to query 1046 samples, and for MCLU-DR, it just needs to query 905 samples. The difference in the number of queried samples between the proposed method and the state-ofthe-art method is nearly 300 pixels. For the comparison methods, only the EQB query samples do not contain the diversity criterion, but the diversity criterion in these comparison methods has no relationship with the uncertainty criterion. Although the diversity criterion can remove the redundant samples, it can also remove some informative samples. For example, in a HSI, different objects may have the same spectrum. When the kernel k-means diversity criterion is used, there will be only one sample queried from one cluster. In the same cluster, the samples just have a similar spectrum, and the samples do not actually belong to one class. When one sample has a similar spectrum to the queried sample, but they do not in fact belong to the same class, this sample will be wrongly classified in the classification. Therefore, this sample is also an informative sample. In order to alleviate this problem, the discriminative information is considered in the proposed method, and the uncertainty and diversity are balanced by a weight. In this way, the proposed method can preserve the informative samples as much as possible.

85

85

80

75 N=100 N=200

70

N=300 N=400 N=500

65

N=600

Overall accuracy

Overall accuracy

80

N=100 N=200 N=300 N=400 N=500 N=600 N=700 N=800 N=900 N=1000

75

70

65

N=700 N=800

60

N=900

60

N=1000

55

0

100

200

300

400

500

600

700

Number of queries

800

900 1000 1100

55

0

100

200

300

400

500

600

700

800 900 1000 1100

Number of queries

Fig. 5. The AL curves of the proposed method. (a) The MS-DR method under different numbers of uncertain samples N, with fixed b¼ 2, h ¼ 20. (b) The MCLU-DR method under different numbers of uncertain samples N, with fixed b ¼ 2, h ¼20.

Z. Wang et al. / Neurocomputing 179 (2016) 88–100

4.3.1.2. Sensitivity analysis. The aim of this part is to analyze the MCLU-DR and MS-DR methods under different parameter settings and strategies. 4.3.1.2.1. The effect of the number of uncertain samples. Fig. 5 shows the results of the Indian Pines dataset with different initial training sets (fixed b¼ 2, h ¼20). Fig. 5(a) shows the result of the MS-DR method with N varied within the range [100, 1000]. Fig. 5 (b) shows the result of the MCLU-DR method with N varied within the range [100, 1000]. From the figures, it can be observed that no one curve is always higher than others, so the number N does not have a big influence on the result. Briefly, the proposed method is not sensitive to N, which is the number of uncertain samples selected from the uncertainty criterion. For the MS-DR method, when the OA reaches a relatively high level, the N ¼400 line is often higher than the others, and the query number is relatively small. Fig. 5(b) shows the result of the MCLU-DR method with N varied within the range [100, 1000]. When N is 300, the query number is lower than the others, and the OA is also relatively high. It is therefore worth noting that the number N is very important. When N is large, the speed will be slow. If N is too small, the OA cannot reach a high level. This is because the informative samples in the unlabeled set are limited, so if the uncertain set is too small, some informative samples will be abandoned, and if the uncertain set is too large, the redundant samples will also increase. For example, when we pay more attention to the representative information, the big uncertain unlabeled dataset may make the new queried samples i.i.d. from each other, but there will be little discriminative information. According to the experimental results and the analysis mentioned above, if there is no requirement to elaborately select the best value for N, the user should select a value of N around 400, which can also obtain a good result. 4.3.1.2.2. The effect of the number of DR samples by a binary class DR criterion. Fig. 6 shows the results of the proposed methods with the Indian Pines dataset with a fixed h¼ 20 in several runs. Fig. 6(a) shows the result of MS-DR with N ¼ 400 and b with a candidate dataset [2–5]. Fig. 6(b) shows the result of MCLU-DR with N ¼300 and b with the same candidate dataset. b is the number of DR samples by a binary class DR criterion. Therefore, for the Indian Pines dataset, there are 16 classes, and 16 binary classes can be constructed. At each iteration, 16*b DR samples are queried. It is clear from Fig. 6 that when b is 2, the curve is higher than all the others, and the number of queried samples is also smaller than

the others. In the same iteration, the number of new queried samples is different for the different b values. This confirms that when b is small, the queried samples are more informative and less redundant than the queried samples when b is large. When b is large, the number of queried samples also increases. In the experiment, the iteration number was 60, so if the batch size h is fixed at 20, the number of queried samples is 1200, but in Fig. 6, the numbers of queried samples for all the curves are less than 1200. As mentioned above, the number of queried samples is adaptive, which is because all the binary classes use the same unlabeled dataset, and one sample may be queried in several binary classes. Meanwhile, the number of the same samples queried from different binary classes at the same iteration is not known, so the number of new queried samples is adaptive in the proposed method. According to the experimental results, when the batch size h is fixed, as b increases, the OA decreases when the same number of samples is queried. Therefore, if b is smaller, the queried samples are more informative. 4.3.1.2.3. Sensitivity analysis of the weight between the discriminative and representative information. The weight β is used to balance the trade-off between the discriminative information and the representative information in the proposed DR criterion in the diversity step. For the Indian Pines dataset, we set the iterations as 60. Fig. 7 shows the result for β in the 60 iterations. In the experiment, for a binary class, β was set within the range [10  5, 10  4,…, 105] in one iteration. A best value of β was chosen corresponding to the highest OA. In Fig. 7, each bar means the frequency of the trade-off parameter β gets the same value in the whole active learning process. Fig. 7(a) shows the result for the MS-DR method, while Fig. 7(b) shows the result for the MCLU-DR method. For both figures, we can see that the distributions of the optimal values of β in the iterations in the different methods are similar, and when we pay more attention to the representative information, β can reach a large value of 104. However, the value of β is often less than or equal to 10  1. This demonstrates that the discriminative information is important in the querying process. When there is not enough information to query the discriminative samples, the value of β becomes large, and the representative samples will be queried, so β can adjust the trade-off between the discriminative information and the representative information to make the classification accuracy convergence quickly.

85

85

80

80

75

75

Overall accuracy

Overall accuracy

95

70

65

b=2 b=3 b=4 b=5

60

70

65 b=2 b=3

60

b=4 b=5

55

0

100 200

300 400 500

600 700 800 900 1000 1100 1200

Number of queries

55

0

100 200 300 400 500 600 700 800 900 1000 1100 1200

Number of queries

Fig. 6. The AL curves of the proposed method. (a) The MS-DR method under different numbers of DR samples b, with fixed N ¼400, h ¼ 20. (b) The MCLU-DR method under the different numbers of DR samples b, with fixed N ¼300, h ¼ 20.

96

Z. Wang et al. / Neurocomputing 179 (2016) 88–100

25

25

10

5

-5

-4

-3

-2

-1

0

1

2

2 3

20

4

Frequency

Frequency

15

0

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

20

5

15

6 7 8 9

10

10 11 12 13

5

14 15 16

3

4

5

log(β)

0

-5

-4

-3

-2

-1

0

1

2

3

4

5

log(β)

Fig. 7. The weight parameter β balances the discriminative information and the representative information of the proposed method for the Indian Pines dataset. (a) The result of the MS-DR method. (b) The result of the MCLU-DR method.

It is also worth noting that the result by MCLU uncertainty is better than that by MS uncertainty. However, when the MS uncertainty is combined with the proposed discriminative and representative criterion, under the same number of new queried samples, the OA is much higher than for all the other compared methods. Therefore, when the manually labeled samples are limited for the remote sensing image classification, the proposed method can obtain a better result than the other methods. This is because the proposed method can effectively remove the redundant information and obtain more informative samples than the compared methods.

98

Overall accuracy

96

94

MS-DR MCLU-DR MCLU MCLU-ECBD MS-cSV MS-Kkmeans EQB

92

90

88 0

100

200

300

400

500

600

700

800

900

Number of queries Fig. 8. The learning curves for the Washington DC dataset with SVM. Each curve shows the average of OA and standard deviation with the growing number of queried samples over several runs, which start with different initial sets.

4.3.2. The Washington DC dataset In order to verify the effectiveness of the proposed method, the Washington DC dataset was also used. 4.3.2.1. The performances of the different AL algorithms. Fig. 8 shows the average OA obtained by the proposed AL models against the other models in several runs with the Washington DC dataset. In Fig. 8, the OA curves of MCLU-DR and MS-DR are higher than those of the other methods. The analysis of the standard deviation also indicates that the proposed method has a smaller standard deviation than the other methods. This confirms the good stability of the proposed method versus the other state-of-the-art batchmode AL methods. In the experiment with the Washington DC dataset, according to the next analysis, we set N ¼400 and b¼ 3 for MS-DR, and N ¼ 700 and b¼ 3 for MCLU-DR. Because the number of new queried samples is adaptive in the proposed method, the number of new queried samples is not greater than the batch size h in each iteration. In the Washington DC dataset experiment, we set h¼ 15, and when the OA accuracy reached 98%, the best case for the state-of-the-art AL methods was at least 860 samples, but the proposed method, MS-DR, only needed to query 730 samples, and MCLU-DR needed to query 710 samples.

4.3.2.2. Sensitivity analysis. Here, we analyze the different parameters and strategies for the Washington DC dataset. 4.3.2.2.1. The effect of the number of uncertain samples. Fig. 9 shows the results for the Washington DC dataset. At a glance, the lines are very not smooth, so we chose the relatively smooth and high line as the standard. Fig. 9(a) shows the result of the MS-DR method with a set N varied within the range [100, 1000], and the other two parameters were fixed as b ¼3 and h ¼15. From all the lines in Fig. 9(a), when N ¼400, the curve has a relatively good result. Fig. 9(b) shows the result of the MCLU-DR method, with the parameter settings the same as with the MSDR method. Form Fig. 9(b), we can observe that all curves can reach a high OA, but when N ¼700, this curve is relatively smooth and high. Therefore, in the experiment with the proposed method and the Washington DC dataset, we set N ¼400 for MS-DR and N ¼ 700 for MCLU-DR. From Figs. 5 and 9, we can see that the results are not so sensitive to the size of N between the range [300, 500]. Although N is an important parameter in the proposed method, it has no absolute best value in this experiment, and N can be chosen from the range [300, 500]. 4.3.2.2.2. The effect of the number of DR samples by a binary class DR criterion. Fig. 10 shows the result of the proposed method for the Washington DC dataset. Fig. 10(a) shows the result of MS-DR with fixed N ¼400 and h¼ 15 in several runs, and b with a candidate dataset [2–5]. Fig. 10(b) shows the result of MCLU-DR with fixed N ¼ 700 and h¼15, and b with the same candidate set. From Fig. 10, we can observe that to achieve a given accuracy, when b is small, the number of queried samples is fewer than the number of queried samples when b is large. Meanwhile, when b is small, the number of queried samples is also limited. For example, in Fig. 10,

Z. Wang et al. / Neurocomputing 179 (2016) 88–100

97

chose b ¼3 instead of b¼ 2. This was because, in Fig. 10, although the classification accuracy when b was set at 2 is higher than the classification accuracy when b was set at 3, 4, and 5 under the same number of queried samples, the final classification result is not as good as the result when b was 3, 4, or 5. This is related to the batch size h. For the Indian Pines dataset, it contains 16 classes, and the batch size h was set at 20 when b was 2, so there were 16*2 samples at most queried at one iteration. The number 16*2 is 32, which is greater than the fixed batch size h¼20. Meanwhile, for the Washington DC dataset, it contains seven classes, and the batch size h was set at 15 when b was 2, so 7*2 samples at most were queried in an iteration, which is less than the fixed batch size h¼15, but when b was 3, 7*3 samples at most were queried in one iteration, and this time the queried samples were greater than h¼15, and the classification accuracy was higher than the accuracy when b was 2. Again, as mentioned above, when b is smaller, the queried samples are more informative. Therefore, if the user wants to achieve a high OA with few samples, a small value of b could be chosen, but the time cost will be high. If the user wants to get a high OA with a low time cost, a large value of b can be

when b is 2, the number of queried samples is less than 600 samples in 60 iterations. However, when b is greater than 2, the number of queried samples is more than 700 samples. This demonstrates that when b is small, the queried samples are more informative, but in the same iterations, the number of queried samples is limited, so it is not possible to obtain a result as good as the result when b is large. Therefore, when b is small, if a good result is obtained, the number of iterations should increase, and it also needs much more time. However, although the iterations have increased, when a good result is obtained, the number of queried samples is less than the queried number when b is large. In order to balance the OA and time cost, according to the experimental result of the Washington DC dataset, b¼ 3 was chosen for the Washington DC dataset with the proposed method. In the two dataset experiments, under consideration of both the accuracy and cost, we chose b¼2 for the Indian Pines dataset and b ¼3 for the Washington DC dataset, according to Figs. 6 and 10. From Figs. 6 and 10, it can be clearly seen that when b is smaller, the classification accuracy is higher at the same number of queried samples. However, for the Washington DC dataset, we

98

98

96

96

N=100

Overall accuracy

Overall accuracy

N=100

94

N=200 N=300 N=400 N=500

92

N=600

N=200

94

N=300 N=400 N=500

92

N=600 N=700

N=700

N=800

N=800

90

N=900

N=900

90

N=1000

N=1000

88

88

0

100

200

300

400

500

0

600

100

200

300

400

500

600

Number of queries

Number of queries

Fig. 9. The AL curves of the proposed method. (a) The MS-DR method under different numbers of uncertain samples N, with fixed b¼ 3, h ¼15. (b) The MCLU-DR method under different numbers of uncertain samples N, with fixed b¼ 3, h ¼ 15.

98

98

96

Overall accuracy

Overall accuracy

96

94

92

94

92 b=2

b=2

b=3

b=3

90

b=4

90

b=4

b=5

b=5

88

88 0

100

200

300

400

500

600

Number of queries

700

800

900

0

100

200

300

400

500 600

700

800

900

Number of queries

Fig. 10. The AL curves of the proposed method. (a) The MS-DR method under different numbers of DR samples b, with fixed N ¼400, h ¼15. (b) The MCLU-DR method under the different DR size b, with fixed N ¼ 700, h ¼ 15.

Z. Wang et al. / Neurocomputing 179 (2016) 88–100

25

25

20

20

15

1 2 3 4 5 6 7

10

5

0

Frequency

Frequency

98

-5

-4

-3

-2

-1

0

1

2

15

1 2 3 4 5 6 7

10

5

3

4

5

0

-5

-4

-3

-2

-1

0

1

2

3

4

5

log(β)

log(β)

Fig. 11. The weight parameter β balances the discriminative information and representative information of the proposed methods for the Washington DC dataset. (a) The result of the MS-DR method. (b) The result of the MCLU-DR method.

chosen, but more samples will be required. However, according to the experimental analysis, when b multiplied by the number of classes is just greater than batch size h, the classification accuracy, the number of queried samples, and the time cost have a good balance. 4.3.2.2.3. Sensitivity analysis of the weight balancing the discriminative and representative information. Fig. 11 shows a graph of β in 60 iterations for the Washington DC dataset. In the experiment, we set β within the range [10  5, 10  4,…, 105], and the best β was chosen corresponding to the highest OA. Fig. 11 (a) shows the result of the MS-DR method, while Fig. 11 (b) shows the result of the MCLU-DR method. We can observe that the distribution of two graphs is similar. In the experiment, when we pay more attention to the representative information, β gets a big value, which reaches 104. Although the frequency of 104 accounts for less than 30% of all the values, it is effective to adjust the training set when the uncertain information is not enough to select the informative samples, However, if β is less than or equal to 10  1, it accounts for more than 70%. When β is greater than 1, the representative information is predominant, otherwise the discriminative information is more significant than representative information. Accordingly, from Fig. 11, the discriminative information is more important in the whole experimental process. From the two experimental results, we can observe that although the datasets are different, the distributions of β are similar in the proposed method. In all the experiments the discriminative information is predominant in adjusting the training set to be more effective to select the most informative samples. However, to select the optimal β is very costly in every iteration. According to the experimental result, if the user does not require the best value, a compromised value, which not only pays attention to the discriminative information but also considers the representative information, can be chosen, and it can be set as β ¼ 10  3 in all the iterations.

5. Conclusion In this paper, we generalize the empirical risk minimization principle to the active learning setting and propose a novel active learning framework. By effectively combing the representative term and discriminative term, we query the samples which are

expected to rapidly reduce the empirical risk, and preserve the original source distribution at the same time. This enables our method to achieve a consistent good performance during the whole active learning process. The superior performance of the proposed method is verified by our evaluations using two benchmark HSI datasets, compared with the state-of-the-art batch mode active learning methods. Firstly, the proposed diversity criterion can effectively reduce the number of queried samples with an adaptive number of new queried samples. Secondly, by combing the discriminative information and the representative information together, the proposed diversity criterion can achieve a given accuracy with the much fewer queried samples. Thirdly, the maximum mean discrepancy is used to measure the representative information, and makes sure the new queried samples are identical and independently distributed. This also ensures that the new queried samples have the consistency of distribution in the original space and makes the queried samples as less redundancy as possible. We observe from our experiments that it is beneficial to update the trade-off parameter which balances the discriminative and representative information during the query process. But it is time costly to find the best value of the trade-off parameter. Our future work is to explore an adaptive methodology to tune this parameter automatically, similar to [53]. This could make our active learning framework more practical. In addition, we plan to develop further methodologies and aim to extend to the semi-supervised learning and multi-label learning settings.

References [1] X. Lu, X. Li, Multiresolution imaging, IEEE Trans. Cybern. 44 (1) (2014) 149–160. [2] X. Lu, Y. Yuan, P. Yan, Alternatively constrained dictionary learning for image superresolution, IEEE Trans. Cybern. 44 (3) (2014) 366–377. [3] X. Li, S. Lin, S. Yan, D. Xu, Discriminant locally linear embedding with highorder tensor data, IEEE Trans. Syst. Man Cybern. B Cybern. 38 (2) (2008) 342–352. [4] D. Tao, S. Maybank, W. Hu, X. Li, Stable third-order tensor representation for colour image classification, in: Proceedings of IEEE/WIC/ACM International Conference On Web Intelligence, 2005, pp. 641–644. [5] Y. Luo, D. Tao, B. Geng, C. Xu, S.J. Maybank, Manifold regularized multitask learning for semi-supervised multilabel image classification, IEEE Trans. Image Process. 22 (2) (2013) 523–536. [6] G. Shaw, D. Manolakis, Signal processing for hyperspectral image exploitation, IEEE Signal Process. Mag. 19 (1) (2002) 12–16.

Z. Wang et al. / Neurocomputing 179 (2016) 88–100

[7] Y. Zhong, L. Zhang, Remote sensing image subpixel mapping based on adaptive differential evolution, IEEE Trans. Syst. Man Cybern. B: Cybern. 42 (5) (2012) 1306–1329. [8] B. Du, L. Zhang, L. Zhang, T. Chen, K. Wu, A discriminative manifold learning based dimension reduction method for hyperspectral classification, Int. J. Fuzzy Syst. 14 (2) (2012) 272–277. [9] D.A. Landgrebe, Signal Theory Methods in Multispectral Remote Sensing, John Wiley & Sons, 2005. [10] B. Du, L. Zhang, Target detection based on a dynamic subspace, Pattern Recognit. 47 (1) (2014) 344–358. [11] S. Rajan, J. Ghosh, M.M. Crawford, An active learning approach to knowledge transfer for hyperspectral data analysis, in: Proceedings of IEEE International Conference on Geoscience and Remote Sensing Symposium, IGARSS, 2006, pp. 541–544. [12] M. Seeger, Learning with Labeled and Unlabeled Data, Inst. Adaptive Neural Comput., Univ. Edinburgh, Edinburgh, 2000. [13] J. Li, J.M. Bioucas-Dias, A. Plaza, Spectral-spatial classification of hyperspectral data using loopy belief propagation and active learning, IEEE Trans. Geosci. Remote. Sens. 51 (2) (2013) 844–856. [14] A. Plaza, J.A. Benediktsson, J.W. Boardman, J. Brazile, L. Bruzzone, G. CampsValls, J. Chanussot, M. Fauvel, P. Gamba, A. Gualtieri, M. Marconcini, J.C. Tilton, G. Trianni, Recent advances in techniques for hyperspectral image processing Suppl. 1, Remote Sens. Environ. 113 (2009) S110–S122. [15] Y. Gao, R. Ji, P. Cui, Q. Dai, G. Hua, Hyperspectral image classification through bilayer graph-based learning, IEEE Trans. Image Process. 23 (7) (2014) 2769–2778. [16] E. Pasolli, F. Melgani, D. Tuia, F. Pacifici, W.J. Emery, SVM active learning approach for image classification using spatial information, IEEE Geosci. Remote Sens. 52 (4) (2014) 2217–2233. [17] J. Li, H. Zhang, L. Zhang, Column-generation kernel nonlocal joint collaborative representation for hyperspectral image classification, ISPRS J. Photogramm. Remote Sens. 94 (2014) 25–36. [18] K. Tan, E. Li, Q. Du, P. Du, An efficient semi-supervised classification approach for hyperspectral imagery, ISPRS J. Photogramm. Remote Sens. 97 (2014) 36–45. [19] M. Volpi, G. Matasci, M. Kanevski, D. Tuia, Semi-supervised multiview embedding for hyperspectral data classification, Neurocomputing 145 (0) (2014) 427–437. [20] Q. Shi., L. Zhang, B. Du, Semi-supervised discriminative locally enhanced alignment for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 51 (9) (2013) 4800–4815. [21] K. Bernard, Y. Tarabalka, J. Angulo, J. Chanussot, J.A. Benediktsson, SpectralSpatial Classification of Hyperspectral Data Based on a Stochastic Minimum Spanning Forest Approach, IEEE Trans. Image Process. 21 (4) (2012) 2008–2021. [22] J.E. Fowler, Q. Du, Anomaly Detection and reconstruction from random projections, IEEE Trans. Image Process. 21 (1) (2012) 184–195. [23] J. Li, P. Reddy Marpu, A. Plaza, J.M. Bioucas-Dias, J. Atli Benediktsson, Generalized composite kernel framework for hyperspectral image classification, IEEE Trans. Geosci. Remote Sens. 51 (9) (2013) 4816–4829. [24] Huo, T. Ping, A batch-mode active learning algorithm using regionpartitioning diversity for SVM classifier, IEEE J. Sel. Top. Appl. Earth Obs. 7 (4) (2014) 1036–1046. [25] L. Bruzzone, C. Mingmin, M. Marconcini, A novel transductive SVM for semisupervised classification of remote-sensing images, IEEE Trans. Geosci. Remote Sens. 44 (11) (2006) 3363–3373. [26] M. Marconcini, G. Camps-Valls, L. Bruzzone, A composite semisupervised SVM for classification of hyperspectral images, IEEE Geosci. Remote Sens. Lett. 6 (2) (2009) 234–238. [27] C. Mingmin, L. Bruzzone, Semisupervised classification of hyperspectral images by SVMs optimized in the primal, IEEE Trans. Geosci. Remote Sens. 45 (6) (2007) 1870–1880. [28] D. Tuia, F. Ratle, F. Pacifici, M.F. Kanevski, W.J. Emery, Active learning methods for remote sensing image classification, IEEE Trans. Geosci. Remote. Sens. 47 (7) (2009) 2218–2232. [29] S.C.H. Hoi, R. Jin, J. Zhu, M.R. Lyu, Batch mode active learning and its application to medical image classification, In: Proceedings of the 23rd International Conference on Machine Learning, 2006. [30] K. Brinker, Incorporating diversity in active learning with support vector machines, in: Proceedings of International Conference on Machine Learning Workshop, 2003, pp. 59. [31] Z. Xu, K. Yu, V. Tresp, X. Xu, J. Wang, Representative sampling for text classification using support vector machines, Advances in Information Retrieval, 2003, pp. 11.11. [32] H.T. Nguyen, A. Smeulders, Active learning using pre-clustering, in: Proceedings of the 21st International Conference on Machine Learning, 2004, pp. 79. [33] H.S. Seung, M. Opper, H. Sompolinsky, Query by committee, In: Proceedings of the 5th Annual Workshop on Computational Learning Theory, 1992, pp. 287–294. [34] Y. Freund, H. Sebastian, S. Shamir, N. Tishby, Selective sampling using the query by committee algorithm, Mach. Learn. 28 (1997) 133–168. [35] S. Argamon-Engelson, I. Dagan, Committee-based sample selection for probabilistic classifiers, arXiv preprint arXiv: 1106.0220, 2011. [36] T. Luo, K. Kramer, D.B. Goldgof, L.O. Hall, Active learning to recognize multiple types of plankton, in: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, 478–481. [37] N. Roy, A. McCallum, Toward optimal active learning through Monte Carlo estimation of error reduction, in: Proceedings of International Conference on Machine Learning, Williamstown, 2001.

99

[38] C. Campbell, N. Cristianini, A. Smola, Query learning with large margin classifiers, in: Proceedings of International Conference on Machine Learning Workshop, 2000, pp. 111–118. [39] G. Schohn, D. Cohn, Less is more: active learning with support vector machines, in: Proceedings of International Conference on Machine Workshop, 2000, pp. 839–846. [40] T. Kanungo, D.M. Mount, N.S. Netanyahu, C.D. Piatko, R. Silverman, A.Y. Wu, An efficient k-means clustering algorithm: analysis and implementation, IEEE Trans. Pattern Anal. Mach. Intell. 24 (7) (2002) 881–892. [41] J. Shawe-Taylor, N. Cristianini, Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. [42] L. Bottou, Y. Bengio, Convergence properties of the K-means algorithms, Adv. Neural Inf. Process. Syst. 7 (1995) 585–592. [43] A. Beygelzimer, S. Dasgupta, J. Langford, Importance weighted active learning, in: Proceedings of the 26th International Conference on Machine Learning Workshop, 2009, pp. 49–56. [44] L. Mingkun, I.K. Sethi, Confidence-based active learning, IEEE Trans. Pattern Anal. Mach. Intell. 28 (8) (2006) 1251–1261. [45] B. Demir, C. Persello, L. Bruzzone, Batch-mode active-learning methods for the interactive classification of remote sensing images, IEEE Trans. Geosci. Remote Sens. 49 (3) (2011) 1014–1031. [46] R. Zhang, A.I. Rudnicky, A large scale clustering scheme for kernel k-means, in: Proceedings of IEEE International Conference on Pattern Recognition, Aug. 11– 15, 2002, pp. 289–292. [47] Z. Wang, J. Ye, Querying discriminative and representative samples for batch mode active learning, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, pp. 158–166. [48] R. Chattopadhyay, Z. Wang, W. Fan, I. Davidson, S. Panchanathan, J. Ye, Batch mode active sampling based on marginal probability distribution matching,in: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 741–749. [49] J.C. Bezdek, R.J. Hathaway, Convergence of alternating optimization, Neural Parallel Sci. Comput. 11 (4) (2003) 351–368. [50] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Found. Trends Mach. Learn. 3 (1) (2011) 1–122. [51] D. Comaniciu, P. Meer, Mean shift analysis and applications, In: Proceedings of the 7th IEEE International Conference Computer Vision, 1999, pp. 1197–1203. [52] C.-C. Chang, L. Chih-Jen, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. 2 (3) (2011) 27. [53] Z. Wang, S. Yan, C. Zhang, Active learning with adaptive regularization, Pattern Recognit. 44 (10) (2011) 2375–2383. [54] L. Zhang, Q. Zhang, L. Zhang, D. Tao, X. Huang, B. Du, Ensemble manifold regularized sparse low-rank approximation for multiview feature embedding, Pattern Recognit 48 (10) (2015) 3102–3112.

Zengmao Wang received the B.S. degree in project of surveying and mapping from Central South University, Changsha, China, in 2013, and is currently pursuing M. S. degree at the State Key Laboratory of Information Engineering in Surveying, Mapping, and Remote Sensing (LIESMARS). His research interests include hyperspectral image processing and machine learning.

Bo Du (M’10–SM’15) received the B.S. degree and the Ph.D. degree in Photogrammetry and Remote Sensing from State Key Lab of Information Engineering in Surveying, Mapping and Remote sensing, Wuhan University, Wuhan, China, in 2005, and in 2010, respectively. He is currently an associate professor with the School of Computer, Wuhan University, Wuhan, China. He has more than 40 research papers published in the IEEE Transactions on Geoscience and Remote Sensing (TGRS), IEEE Transactions on image processing (TIP), IEEE Journal of Selected Topics in Earth Observations and Applied Remote Sensing (JSTARS), and IEEE Geoscience and Remote Sensing Letters (GRSL), etc. His major research interests include pattern recognition, hyperspectral image processing, and signal processing. He is currently a senior member of IEEE. He received the best reviewer awards from IEEE GRSS for his service to IEEE Journal of Selected Topics in Earth Observations and Applied Remote Sensing (JSTARS) in 2011 and ACM rising star awards for his academic progress in 2015. He was the Session Chair for the 4th IEEE GRSS Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS). He also serves as a reviewer of 20 Science Citation Index (SCI) magazines including IEEE TGRS, TIP, JSTARS, and GRSL.

100

Z. Wang et al. / Neurocomputing 179 (2016) 88–100

Lefei Zhang (S'11-M'14) received the B.S. and Ph.D. degrees from Wuhan University, Wuhan, China, in 2008 and 2013, respectively. From August 2013 to July 2015, he was with the School of Computer, Wuhan University, as a Postdoctoral Researcher, and he was a Visiting Scholar with the CAD & CG Lab, Zhejiang University in 2015. He is currently a lecturer with the School of Computer, Wuhan University, and also a Hong Kong Scholar with the Department of Computing, Hong Kong Polytechnic University, Hong Kong. His research interests include pattern recognition, image processing, and remote sensing. Dr. Zhang is a reviewer of more than twenty international journals, including the IEEE TIP, TNNLS, and TGRS.

Liangpei Zhang (M'06–SM'08) received the B.S. degree in physics from Hunan Normal University, Changsha, China, in 1982, the M.S. degree in optics from the Xi'an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi'an, China, in 1988, and the Ph. D. degree in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 1998. He is currently the head of the remote sensing division, state key laboratory of information engineering in surveying, mapping, and remote sensing (LIESMARS), Wuhan University. He is also a "Chang-Jiang Scholar" chair professor appointed by the ministry of education of China. He is currently a principal scientist for the China

state key basic research project (2011–2016) appointed by the ministry of national science and technology of China to lead the remote sensing program in China. He has more than 450 research papers and five books. He is the holder of 15 patents. His research interests include hyperspectral remote sensing, high-resolution remote sensing, image processing, and artificial intelligence. Dr. Zhang is the founding chair of IEEE Geoscience and Remote Sensing Society (GRSS) Wuhan Chapter. He received the best reviewer awards from IEEE GRSS for his service to IEEE Journal of Selected Topics in Earth Observations and Applied Remote Sensing (JSTARS) in 2012 and IEEE Geoscience and Remote Sensing Letters (GRSL) in 2014. He was the General Chair for the 4th IEEE GRSS Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS) and the guest editor of JSTARS. His research teams won the top three prizes of the IEEE GRSS 2014 Data Fusion Contest, and his students have been selected as the winners or finalists of the IEEE International Geoscience and Remote Sensing Symposium (IGARSS) student paper contest in recent years. Dr. Zhang is a Fellow of the Institution of Engineering and Technology (IET), executive member (board of governor) of the China national committee of international geosphere–biosphere programme, executive member of the China society of image and graphics, etc. He was a recipient of the 2010 best paper Boeing award and the 2013 best paper ERDAS award from the American society of photogrammetry and remote sensing (ASPRS). He regularly serves as a Co-chair of the series SPIE conferences on multispectral image processing and pattern recognition, conference on Asia remote sensing, and many other conferences. He edits several conference proceedings, issues, and geoinformatics symposiums. He also serves as an associate editor of the International Journal of Ambient Computing and Intelligence, International Journal of Image and Graphics, International Journal of Digital Multimedia Broadcasting, Journal of Geospatial Information Science, and Journal of Remote Sensing, and the guest editor of Journal of applied remote sensing and Journal of sensors. Dr. Zhang is currently serving as an associate editor of the IEEE Transactions on Geoscience and Remote Sensing.