Active learning via local structure reconstruction

Active learning via local structure reconstruction

Accepted Manuscript Active learning via local structure reconstruction Qin Li , Xiaoshuang Shi , Linfei Zhou , Zhifeng Bao , Zhenhua Guo PII: DOI: Re...

613KB Sizes 1 Downloads 21 Views

Accepted Manuscript

Active learning via local structure reconstruction Qin Li , Xiaoshuang Shi , Linfei Zhou , Zhifeng Bao , Zhenhua Guo PII: DOI: Reference:

S0167-8655(17)30140-X 10.1016/j.patrec.2017.04.022 PATREC 6806

To appear in:

Pattern Recognition Letters

Received date: Revised date: Accepted date:

22 November 2016 16 February 2017 25 April 2017

Please cite this article as: Qin Li , Xiaoshuang Shi , Linfei Zhou , Zhifeng Bao , Zhenhua Guo , Active learning via local structure reconstruction, Pattern Recognition Letters (2017), doi: 10.1016/j.patrec.2017.04.022

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Pattern Recognition Letters journal homepage: www.elsevier.com

Active learning via local structure reconstruction a

School of Software Engineering, Shenzhen Institute of Information Technology, China Graduate School at Shenzhen, Tsinghua University, China c Department of Computer Science, Ludwig-Maximilians University in Munich, Germany d School of computer science and information technology, RMIT University, Australia b

ABSTRACT

CR IP T

Qin Lia, Xiaoshuang Shib, Linfei Zhouc, Zhifeng Baod, Zhenhua Guob, 

AN US

To select the most representative data points for labeling, two typical active learning methods, Transductive Experimental Design (TED) and Robust Representation and Structured Sparsity (RRSS), have been recently proposed. They yield impressive results. However, both of them neglected the local structure of data points which is helpful for selecting representative data points. Therefore, in this paper, we propose a novel active learning method via local structure reconstruction to select representative data points. Specifically, we construct a simple but effective graph to search the local relationship of data points. Then an optimization model is formulated to fulfill the data point reconstruction and select the most representative data points. Furthermore, we define a simple but useful classifier based on a linear regression model for better exploring the potential classification performance of selected data points. Experimental results on two synthetic datasets and two face databases demonstrate the effectiveness of our method and the efficiency of the defined classifier.

AC

CE

PT

ED

M

2012 Elsevier Ltd. All rights reserved.

 Corresponding author. e-mail: [email protected]

ACCEPTED MANUSCRIPT 1. Introduction

AC

CE

PT

ED

M

AN US

CR IP T

Since labels in many real-world applications are usually difficult to obtain and expensive, active learning, which reduces the labeling cost by choosing the most informative data to be labeled, has attracted many researchers during the past few decades. Traditional active learning methods can be categorized into two groups. The first group targets to query the most informative data points, and selects the uncertain or hard-to-predict data. However these approaches require a large number of labeled data points to avoid the sample bias (Nie et al., 2013b). Within the first group, one of the most widely used methods is uncertainty sampling (Balcan et al., 2007; Lewis and Catlett, 1994), which queries the points where the predictive uncertainty of the trained model is the highest. Other popular methods, including query by committee (Freund et al., 1997; Seung et al., 1992) and variance reduction (Cohn et al., 1996), are derived from optimal experimental design (Atkinson et al., 2007; Flaherty et al., 2005). The second group aims to select the data points with representative information, which can achieve high accuracy using as few labeled data points as possible (Zhang et al., 2011), such as transductive experimental design (Yu et al., 2006), clustering based methods (Nie et al., 2012) and QUIRE method (Huang et al., 2014). In addition, these two categories solve the experimental design problem at different stages. Since the first group of methods needs enough labeled data points, it should be applied during the mid-stage of experiment. While a small number (or even none) of labeled data are required for the second group of methods, they can be used during the early stage. In this paper, we focus on improving the second group of methods. Recently, Transductive Experimental Design (TED) (Yu et al., 2006), which selects the representative data points via least square loss function and ridge regularization, has yielded impressive results. However the TED objective leads to an NP-hard problem and is approximated by a sequential optimization problem with slow greedy algorithm. Then, Robust Representation and Structured Sparsity (RRSS) (Nie et al., 2013b), has been proposed to solve these two deficiencies. From a perspective of data reconstruction, the original data space in both TED and RRSS are reconstructed in a global way where each data points linearly reconstructed by using all of the selected data points (Hu et al., 2013), even if the selected data points are far away from the point to be reconstructed (see Fig. 1). Nevertheless, according to recent studies, a high-dimensional data might lie on a nonlinear sub-manifold (For example, a k-dimensional sub-manifold of a Euclidean space R h is a subset of M k  R h , which locally looks like a flat of k-dimensional Euclidean space (Cai et al., 2011)) hidden in a high-dimensional ambient space rather than uniformly distributed in the whole ambient space. Both TED and RRSS fail to consider the underlying manifold structure. In order to discover the underlying manifold structure, many manifold learning algorithms such as Locally Linear Embedding (LLE) (Roweis and Saul, 2000) and ISOMAP (Tenenbaum et al., 2000) have been proposed. These methods use the so-called locally invariant idea (Hadsell et al., 2006) that the nearby points are likely to have similar embeddings. They also have illustrated that if the geometrical structure is exploited and the local invariance is considered, the learning performance can be significantly improved (Cai et al., 2011). Based on the above motivations, we propose a novel active learning method that selects the most representative points by considering the local invariance. Specifically, we assume that each data point and its neighbors lie on a local manifold, and the manifold structure is first modeled by a nearest-neighbor graph that can preserve the local geometrical structure of the data space (He et al., 2005). Then, based on the graph model, we select some representative data points which can reconstruct the original whole data points well. Meanwhile, we prove that the optimization stage of our algorithm has closed form solution in each iteration and global convergence. In order to further test the quality of the selected data points, we also define a classifier based on a linear regression model with least square loss function and ridge regularization. Experimental results demonstrate the effectiveness of our method, the fast convergence of its optimization stage and the efficiency of the defined classifier. The main contributions of this paper are listed as follows: (1) We propose a novel model to select the most representative points via considering the local variance. (2) We present the optimization procedure of the proposed model, and analyze its convergence and time complexity. (3) Extensive experiments on two synthetic datasets and two face databases illustrate the superior performance of the proposed method to the state-of-the-art methods. The rest of paper is organized as follows. Section 2 introduces the notations and definitions used in this paper, and briefly reviews the related work: TED and RRSS. Section 3 presents the proposed algorithm Active Learning via Local Structure Reconstruction (ALLSR) and a novel classifier Linear Regression Classifier (LRC). Section 4 reports and analyzes the experimental results on two synthetic data sets and two face databases. Finally, Section 5 draws the conclusion of our work and discusses potential avenues for future research.

(a) (b) (c) Fig. 1. Comparisons of data reconstruction by global way and nearest neighbors. (a) Original data points constituted by two classes, the points in triangle is the selected points and the points in circle are target points; (b) TED and RRSS reconstruct target points by a global way, each target one is reconstructed by all selected points; (c) reconstructing target points by the nearest selected points. Compared with reconstructing target points with all selected points, using the nearest selected points to reconstruct the target one might be more accurate.

2. Preliminaries 2.1. Notations and definitions

ACCEPTED MANUSCRIPT For a matrix

M  (mij )  R nd , its i-th row and j-th column are denoted by mi and mj, respectively. The Lp-norm of a vector n

v  R n is defined as v

p

(

v

i

p 1/ p

)

. The Frobenius norm of the matrix M is defined as:

i 1

|| M ||F  Thus the L2-norm of the matrix M is M

2 F

n

d

 m

2 ij

i 1 j 1



n

|| m || i

2 2

i 1

(1)

.

The L2,1-norm of the matrix M is defined as: n

|| M ||2,1   i 1

d

n

 m  || m || j 1

2 ij

i

i 1

(2)

2

CR IP T

For the sake of consistency, the quasi-norm L2,0-norm of the matrix M is defined as the number of the nonzero rows of M . 2.2. Transductive Experimental Design (TED) Given unlabeled data X  [ x1, x2 ,...,xn ]  R nd , active learning aims to select m(m  n) points for labeling by user in order to maximize the potential recognition performance of the classifier trained by these m labeled points. The active learning methods usually select the representative data points to maximize the efficiency of the training process, such as TED whose basic idea is to minimize the average predictive variance of the estimated regularized linear regression function. According to the geometrical view, TED is to find m representative points X s  [ xs1 , x2 ,...,xsm ]  X that span a linear space to

AN US

retain most of the information of X (Yu et al., 2006), in other words, the original data points X can be reconstructed by m selected data points X s with least loss, which can be formulated as follows: n

min  (|| xi  X s ai ||22  || ai ||22 ) Xs , A

i 1

M

(3) s.t. X s  X ,| X s | m A  [a1 , , an ]  R m n where  is the regularization parameter for controlling the amount of shrinkage. The first term in the objective function shows that each data point can be linearly reconstructed by all of the selected data points. But some selected points far away from one target point might have little or even negative effect for reconstruction, thus it is more reasonable to reconstruct the target point via its nearest neighbors. The second term is to penalize the norm of the reconstruction coefficients, which indicates that TED tends to select data points with large norm (Zhang et al., 2011). 2.3. Robust Representation and Structured Sparsity

ED

Since Eq. (3) leads to a difficult combinational optimization problem (NP-hard), the TED problem was approximated by a sequential optimization problem and solved by an inefficient greedy optimization method (Nie et al., 2013b). In order to overcome these two deficiencies, RRSS has been proposed. Its formulation is as follows: n

PT

min || xi  Xai ||22   || A ||2,0 A

(4)

i 1

where A  [a1, a2 ,...,an ]  R nn . A 2,0  m . Similar to Eq. (3), Eq. (4) aims to select m data points to reconstruct the whole original

replacing A

2, 0

CE

data points with least loss. However, Eq. (4) is a highly non-convex optimization problem and no efficient solution is known. Thus Eq. (4) is relaxed by with A

2,1

to obtain a tractable optimization problem. Besides, A

AC

is row-sparse enough, minimizing A

2,1

is equivalent to minimize A

2, 0

2,1

is the minimum convex hull of A

2, 0

, when A

(Nie et al., 2012). Therefore, the convex surrogate of Eq. (4)

can be written as:

n

min || xi  Xai ||22   || A ||2,1 A

(5)

i 1

As the L2-norm loss function is sensitive to outliers, L2,1-norm is applied to the loss function by replacing the L2-norm. Hence, Eq. (5) becomes n

min || xi  Xai ||2   || A ||2,1 A

(6)

i 1

In Eq. (6), the L2,1-norm on loss function imposes the L2-norm on the feature dimension and the L1-norm on the data points dimension. The effect of outliers is reduced by the L1-norm. After obtaining A , the row-sum of the absolute of A is computed, then these values are sorted by decreasing order, and finally m data points corresponding to the top m rows of A are selected.

ACCEPTED MANUSCRIPT 3. Active Learning via Local Structure Reconstruction In this section, we introduce a novel active learning method, ALLSR, via considering the local invariance. 3.1. The objective Function From Eq. (3) and Eq. (6), we can see that both TED and RRSS reconstruct each point in a global way via a linear combination of all the selected points, which fail to take into account the underlying manifold structure. In this paper, we suggest to select representative points via considering local geometrical structure that could describe data structure more accurately. To compute local geometrical structure, we rely on recent studies in spectral graph theory (Chung, 1997) and manifold structure theory (He et al., 2005; Roweis and Saul, 2000; Tenenbaum et al., 2000) that have demonstrated that local geometric structure can be effectively modeled through a nearest neighbor graph on a scatter of data points, and have shown better recognition accuracy than ignoring local geometric structure methods. Given data matrix X  [ x1, x2 , , xn ]  Rd n , we can find k nearest neighbors for each data point and put edges between xi and its neighbors. There are many choices for constructing the weight matrix S on the graph. In this paper, we adopt a simple weighting method as follow:

CR IP T

 1 if xi  N k ( x j ) or x j  N k ( xi ) Sij    0 otherwise

(7)

where N k ( xi ) denotes the set of k nearest neighbors of xi . After obtaining the weight matrix S , we can get the Laplacian matrix

L  D  S , where D is a diagonal matrix and Dii   Sij . We empirically found complex weighting schemes do not improve j

performance significantly, so Eq. (7) is chosen.

AN US

Suppose Q  [q1, q2 , , qn ]  Rd n are the reconstructed data by one manifold learning method, we can get the following model based on the Laplacian matrix:

min Tr (QLQT ) Q

(8)

which aims to make the reconstructed data points share the same local geometrical structure with the original data points X . By replacing in Eq. (3) there presentative data points X s  [ xs1 , xs2 , , xsm ]  Rd m with m representative reconstructed data points

Qs  [qs1 , qs2 , , qsm ]  Rd m  Q to reconstruct the whole original data points X , the following optimization model is obtained:

M

n

min  (|| xi  Qs ai ||22  || ai ||22 )   Tr (QLQT )

Q , Qs , A

i 1

ED

(9) s.t.Qs  Q,| Qs | m A  [a1 , , an ]  R m n Where the regularization parameter  controls the balance between the local geometric structure and the selected the data points. And the regularization term can ensure that Q contains the information of local structure.

PT

Although the L2,1-norm loss function is insensitive to outliers, it is sensitive to small loss (penalizes more for small loss than L2norm) (Nie et al., 2013a) and the computation cost of L2,1-norm is higher than L2-norm, thus we still use the L2-norm loss function in our objective function. And similar to RRSS, the L2,1-norm regularization can be used to select the data points. Thus we formulate our objective function as follows:

J (Q, A)  min || X  QA ||2F  Tr (QLQT )   || A ||2,1 Q, A

(10)

CE

where  and  are two regularization parameters. 3.2. Optimization Algorithm

AC

In this section, we will give the optimization process of the proposed algorithm. Fixed A, taking the derivative of Eq. (10) with respect to Q and setting the derivative to zero, the solution of Eq. (10) is:

Q  XAT ( AAT   L)1

(11)

min || X  QA ||2F  || A ||2,1

(12)

Fixed Q , Eq. (10) becomes: A

T

Since the derivative of || A ||2,1 equals the derivative of Tr ( A GA) (Nie et al., 2010), where G is a diagonal matrix with the i-th diagonal element as: gii 

1 2 || a i ||2

(13)

The solution of Eq. (12) is equivalent to the solution of the following model:

min || X  QA ||2F  Tr ( AT GA) A

To solve it, we can get:

(14)

ACCEPTED MANUSCRIPT A  (QT Q   G)1QT X According to the Woodbury matrix identity (Golub and Van Loan, 2012), Eq. (15) is equivalent to:

(15)

(16) A  G 1QT (QG 1QT   I d )1 X In summary, we present the procedure of ALLSR in Table 1. The algorithm to solve the optimization problem in Eq. (10) is provided in the Stage II of ALLSR. Table 1: Procedure of ALLSR Input: Data set X  [ x1 , x2 ,

, xn ] ; regularization parameter  ,  ; neighborhood size k; number of selected points m

Output: Selected data points [ xs1 , xs2 ,

, xsm ]

Stage I: Graph construction 1. Construct the nearest neighborhood graph G( X , S ) ;

2. Repeat Update G(t), where the i-th diagonal element is





CR IP T

2. Calculate the Laplacian matrix L . Stage II: Alternative optimization 1. Initialize A(0)  R nn , t=0 ;

Calculate Q(t 1)  XA ( A A   L) ; T (t)

T (t) (t)

1

Calculate A(t 1)  G(t1)Q(Tt 1) (Q(t 1)G(t1)Q(Tt 1)   I d )1 X ; t=t+1; until converge Stage III: Point selection 1. Calculate the scores of all the points {|| ai ||2 }in1 ; 3. Obtain the selected points [ xs1 , xs2 ,

AN US

2. Sort the scores by descending order and select the largest m values, their corresponding index form the selected point index [s1 , s2 , , xsm ] .

3.3. Convergence Analysis

, sm ] ;

As seen from Table 1, we have solved the optimization problem Eq. (10) in an alternative way, so we present its convergence behavior. First, a lemma is introduced. Lemma 1 (Nie et al., 2010). For any non-zero vectors a, b  R n , the following result follows:

M

|| a ||2 

|| a ||22 || b ||22 || b ||2  2 || b ||2 2 || b ||2

(17)

Next, we prove the convergence of the optimization procedure of Eq. (12) in the following theorem. Theorem 1: the optimization procedure in the second stage of Table 1 will monotonically decrease the objective value in each iteration.

ED

Proof: Since the derivative of || A ||2,1 equals the derivative of Tr ( AT GA) , the solutions of Eq. (10) equals to the solution of the following optimization model:

min Tr (( X  QA)T ( X  QA))  Tr (QLQT )   Tr ( AT GA)

(18)

Q, A

PT

As seen from the Stage II in Table 1, suppose Q(t 1) and A(t 1) are updated value of Q( t ) and A( t ) in the t-th respectively, according to Tr (( X  Q(t 1) A(t 1) )T ( X  Q(t 1) A(t 1) ))   Tr (Q(t 1) LQ(Tt 1) )   Tr ( A(Tt 1)G( t ) A( t 1) )

(19)

 Tr (( X  Q(t ) A(t ) )T ( X  Q(t ) A(t ) ))   Tr (Q(t ) LQ(Tt ) )   Tr ( A(Tt )G(t ) A(t ) )

1 , the above inequality suggests: 2 || a i ||2

AC

Due to gii 

CE

convex optimization theory (Boyd and Vandenberghe, 2004), the following inequality holds:

n

|| a i(t 1) ||22

i 1

2 || a(it ) ||2

Tr (( X  Q(t 1) A(t 1) )T ( X  Q(t 1) A(t 1) ))  Tr (Q(t 1) LQ(Tt 1) )   Tr ( A(Tt 1)G(t 1) A(t 1) )    ( i 2 (t ) 2 i (t ) 2

n

|| a ||

i 1

2 || a ||

 Tr (( X  Q(t ) A(t ) )T ( X  Q(t ) A(t ) ))   Tr (Q(t ) LQ(Tt ) )   Tr ( A(Tt )G(t ) A(t ) )    (

 || a(it +1) ||2 )

(20)

 || a(it ) ||2 )

Recalling the results in Lemma 1, we can get:

|| a(it +1) ||22 2 || a(it ) ||2

 || a(it +1) ||2 

|| a(it ) ||22 2 || a(it ) ||2

 || a(it ) ||2

(21)

Combining Eq. (22) and Eq. (23), it becomes: Tr (( X  Q(t 1) A(t 1) )T ( X  Q(t 1) A( t 1) ))   Tr (Q( t 1) LQ(Tt 1) )   Tr ( A(Tt 1)G( t 1) A( t 1) )  Tr (( X  Q(t ) A(t ) )T ( X  Q(t ) A(t ) ))   Tr (Q(t ) LQ(Tt ) )   Tr ( A(Tt )G(t ) A(t ) )

(22)

ACCEPTED MANUSCRIPT which suggests that the objective function in Eq. (10) will monotonically decrease in each iteration. Beyond that, the equations Eq. (18)Eq. (22) also suggest that as the parameters  ,   0 , the objective function in Eq. (10) also will monotonically decrease in each iteration. Therefore, the iterative algorithm converges because the objective function has a lower bound, such as zero. Besides, our experimental results show that it could converge very fast, within 20 iterations in most cases. 3.4. Time Complexity Analysis The proposed method ALLSR contains three main stages shown in Table 1. The time complexity of Stage I is . For the -th iteration in Stage II, updating , and require ), and operations, respectively. As a result, the time complexity of Stage II is , where is the number of iterations. Stage III needs max( operations. Therefore, the time complexity of ALLSR is Additionally, the time complexity of four popular methods: K-means, QUIRE (Huang et al., 2014), TED and RRSS are and respectively. Usually ALLSR spends less time costs than RRSS, while it needs more operations than the other three methods. 3.5. Linear Regression Classifier (LRC)

CR IP T

Since TED is derived from a linear regression model, we define a classifier based on the linear regression model for better exploring the potential classification performance of the data points selected by TED and its variants. The defined classifier is: m

min || W T xsi  ysi ||22   || W ||2F w

(23)

i 1

wc ]  Rd c is the projection matrix, X s  [ xs1 , xs2 , , xsm ]  Rd m , xsi is the i-th selected data point,

where W  [w1, w2 ,

Ys  [ys1, ys 2 , , ysm ]T  Rmc is the corresponding low-dimensional matrix, where c is the number of classes, Ys is defined as:

AN US

 1 if xsi  j  th class , j  1,2,..., c ysi ( j )    0 otherwise

The solution of Eq. (23) is: where I d  R

d d

is a unit matrix.

W  ( X s X sT   I d )1 X sYs

(25)

arg max wTj x

(26)

1 j  c

M

Given a test sample x  R d , the decision function is:

(24)

Based on Eq. (26), we can determine the prediction label of x . For example, if wkT x obtains the maximum value among wTj x (1  j  c) , then x is regarded as belonging to k-th class.

ED

Our classifier is similar to the classifier defined based on linear regression of an indicator matrix (Hastie et al., 2009), but the proposed method has a regularization term, whose goal is to enlarge the bias, reduce the variance, and ensure the unique solution of W . For simplicity, we name this classifier as Linear Regression Classifier (LRC) in this paper.

PT

4. Experiments and Analysis

To demonstrate the effectiveness of our proposed algorithm ALLSR, we test it through toy examples and classification experiments. 20

20

15

15

10

CE

10

5

5

0

0

-5

-5

-10

-20 -20

AC

-10

-15

-15

-10

-5

0

5

10

-15 -20 15

20

-20

-15

-10

-5

(a) TED

20 15

0

5

10

15

20

(b)RRSS 50 0

10

-50

5 -100

0 -150

-5 -200

-10 -250

-15 -300

-20 -20

-15

-10

-5

0

5

10

15

20

-350 -100

-50

0

50

100

150

(c) ALLSR (d) Reconstructed points Fig. 2. Data selection by active learning algorithms on two-circle data set. (a) TED; (b) RRSS; (c) ALLSR; (d) The reconstructed data points Q . Small circle dots represent one class and star dots are the other class, and the dots in the small triangles are the selected data points.

4.1. Toy examples

ACCEPTED MANUSCRIPT In this section, we apply three active algorithms: TED, RRSS and ALLSR on two synthetic data sets to directly illustrate how each algorithm works. The synthetic data sets are: 1) Two-circle data set (Fig. 2): the big circle contains 40 points and the small circle contains 20 points. 2) Two-moon data set (Fig. 3): there are 100 points for each moon. We apply TED, RRSS and ALLSR to select the most representative points on the two data sets. We set the selected number m  10 in Fig. 2 and m  20 in Fig. 3. Small circle dots represent one class and star dots are the other class, and the dots in the small triangles are the selected data points. For clearly showing the influence of local structure on the data points, we also present the reconstructed data points. Besides, there are just two reconstructed data points shown in Fig. 2(d) and Fig. 3(d) because the reconstructed data points in each class cluster to one point. The main reason is that the reconstructed data Q can be seen as shrunk data with clearer cluster structure (Hou et al., 2013).

CR IP T

Fig. 2(a) and Fig. 3(a) show the data points selected by TED with linear kernel. Although RRSS adopts the L2,1-norm loss function, it still tends to select the points with large absolute values (Fig. 2(b) and Fig. 3(b)). This is mainly because imposing the L1-norm on data points could not fully eliminate the influence of large absolute values. As it can be seen from these two examples, Fig. 2(c) and Fig. 3(c) show that the selected data points can reflect the manifold structure more accurately; Fig. 2(d) and Fig. 3(d) suggest that the reconstructed points can be more easily clustered by adding the local structure information, which is the reason why using the reconstructed data Q are more suitable for selecting data points than the original data X . 1

0.8

0.8

0.6

0.6 0.4

0.4 0.2

0.2 0

0

-0.2

-0.2

-0.6 -0.6

-0.8

-1.5

-1

-0.5

0

0.5

1

1.5

2

-1 -2

-1.5

-1

-0.5

(a) TED 139.9

0.8

139.8

0.6

139.7

0.4

139.6

0.2

139.5

0

139.4

-0.2

139.3

-0.4

139.2

-0.6

139.1

0.5

139

-0.8 -1 -2

0

-1.5

-1

-0.5

0

0.5

1

1.5

(b)RRSS

1

1

1.5

2

138.9 40

41

42

43

44

45

2

M

-0.8 -2

AN US

-0.4 -0.4

46

47

48

ED

(c) ALLSR (d) Reconstructed points Fig. 3. Data selection by active learning algorithms on two-moon data set. (a) TED; (b) RRSS; (c) ALLSR; (d) The reconstructed data points Q and the selected points in ALLSR. Small circle dots represent one class and star dots are the other class, and the dots in the small triangles are the selected data points.

4.2. Classification experiments

AC

CE

PT

In this section, we present the classification performance of the data points selected by ALLSR. In order to better compare its performance, we also show the performance of the data points selected by Random Sample (RS), one classical cluster method K-means (KM), and three active learning methods: QUIRE (Huang et al., 2014), TED (Yu et al., 2006) and RRSS (Nie et al., 2013b). In our experiments, we adopt two typical databases as follows: Yale face database (Georghiades et al., 2001) contains 165 gray-scale images of 15 individuals. Each subject has 11 images, including variations in different lighting conditions (left-light, center-light and right-light), facial expression (normal, happy, sad, sleepy, surprised and wink) and with/without glass. In this experiment, all the images were cropped and resized to 32  32 pixels, with 256 gray levels per pixel. Therefore, each image was represented by a 1024-dimensional vector. Ten face images of one subject are shown in Fig. 4a. AR face database (Martinez and Benavente, 2003) contains over 4000 color images of 126 human subjects (70 men and 56 women). The images are captured by frontal view faces with different facial expressions, illumination conditions and occlusions. Here, we chose 2600 face images of 100 individuals (50 men and 50 women) under different facial expressions, illumination conditions and occlusions, and they were manually cropped and resized to 33  30 pixels. Ten face images of one human subject are shown in Fig. 4b.

(a)

(b) Fig. 4.Ten image samples in each database. (a) Yale, (b) AR.

4.2.1. Classifier Selection

ACCEPTED MANUSCRIPT In order to measure the classification performance of the selected points, apart from the proposed LRC, we also implement the Support Vector Machine (SVM) with linear kernel to train a model for classifying the points in the test dataset. Here, the adopted type of SVM is one-versus-all (OVA) that will train c binary classifiers and each binary classifier separates one class (positive) from all the other classes (negative). 4.2.2. Experimental results

ED

M

AN US

Table 2: Classification results (%) on Yale face database by different classifiers. (a)SVM, (b)LRC (a) m RS KM QUIRE TED 5 15.00 15.53 16.87 17.07 10 22.27 25.87 28.27 23.47 15 27.07 30.40 32.53 30.47 20 32.67 36.13 35.07 36.27 25 38.07 40.20 35.33 40.13 30 40.73 44.87 38.00 44.53 35 47.27 47.47 41.87 48.53 40 50.33 48.47 43.33 53.40 45 53.47 51.93 45.80 56.47 50 58.80 54.33 50.27 59.93 (b) m RS KM QUIRE TED 5 16.24 16.87 17.50 17.53 10 24.20 25.20 23.40 27.40 15 30.27 35.40 31.07 34.00 20 36.80 41.07 36.67 40.53 25 43.00 46.80 41.27 47.33 30 48.93 52.67 46.40 51.87 35 53.13 57.20 51.47 56.93 40 58.13 60.13 54.20 59.53 45 62.20 64.40 58.87 63.80 50 65.07 67.67 62.40 66.07

CR IP T

In our experiments, each dataset was randomly split into two subsets, one subset was used for training and the other was used for testing. For Yale face database, one subset containing 6 images of each individual was used as the training set and the remaining images were used as the test set, thus there were 90 images in the training set and 75 images in the test set. Each active learning algorithm was applied to select a given number m=[5,10,15,20,25,30,35,40,45,50] from the training set; for AR database, one subset containing 10 images of each individual was used as the training set and the remaining images were used as the test set, thus there were 1000 images in the training set and 1600 images in the test set. Each active learning algorithm was applied to select a given number m=[40,80,120,160,200,240,280,320,360,400] from the training set. Then we used these selected points to train a model and classified the test data. We repeated this process 20 times and then calculated their average recognition accuracy. As for the graph construction, we empirically selected 10 nearest neighbors to construct the graph. For all the regularization parameters in RRSS and in ALLSR, we adopted 5-fold cross validation by searching the grid [10-5,10-4,10-3,10-2,10-1,1,10,102,103,104, 105]. Table 2, Table 3, Fig. 5 and Fig. 6 show the classification accuracy of six algorithms with different classifiers, which suggests that ALLSR outperforms other algorithms in most cases, so local structure can improve the performance of active learning process.

AC

CE

PT

Table 3: Classification results (%) on AR face database by different classifiers. (a)SVM, (b)LRC (a) m RS KM QUIRE TED 40 9.33 9.60 11.38 11.47 80 17.29 18.24 18.56 19.16 120 25.19 27.01 25.69 28.38 160 31.79 34.46 32.38 34.33 200 38.42 40.44 37.25 41.35 240 45.19 46.81 43.06 47.07 280 50.18 52.27 47.94 51.81 320 55.21 57.03 53.13 56.27 360 60.04 61.59 55.94 60.43 400 64.84 64.95 62.38 64.16 (b) m RS KM QUIRE TED RRSS 40 14.05 13.94 16.31 15.07 15.38 80 25.96 26.40 25.69 27.29 27.14 120 35.76 38.21 36.13 38.18 37.46 160 45.74 47.42 45.63 47.07 46.18 200 52.28 54.04 46.56 54.82 53.56 240 60.11 62.38 51.81 60.51 61.76 280 66.04 67.45 58.56 67.30 67.50 320 71.03 71.48 65.25 71.68 72.58 360 74.97 76.92 70.06 75.64 76.07 400 78.06 79.58 78.13 78.51 79.27

RRSS 16.60 24.53 31.13 35.47 41.07 45.60 50.13 53.33 57.13 61.13

RRSS 18.13 26.67 32.40 39.00 44.93 49.40 56.93 61.53 65.93 69.00

RRSS 11.06 18.99 26.63 33.85 39.88 46.01 51.71 56.74 61.10 65.23 ALLSR 16.36 29.34 39.07 48.19 56.69 63.56 69.23 73.81 78.30 80.63

ALLSR 18.73 26.40 32.20 37.47 44.33 47.67 51.60 55.07 59.73 62.73 ALLSR 18.80 28.33 36.20 42.47 47.80 53.60 57.93 62.80 66.67 69.67

ALLSR 12.80 20.88 28.11 36.79 41.51 47.37 53.66 58.53 62.59 65.56

ACCEPTED MANUSCRIPT 70

60

65

55 60 55

45

Accuracy

Accuracy

50

40 35

RS K-means TED QUIRE RRSS ALLSR

30 25 20 15

5

10

15

20

25

30

35

40

45

50 45 40

RS K-means TED QUIRE RRSS ALLSR

35 30 25 20 15

50

5

10

15

20

25

m

30

35

40

45

50

m

(a) (b) Fig. 5. Classification results on Yale database. (a) The average classification accuracy by SVM, (b) the average classification accuracy by LRC. 80 60

70

60

40

RS K-means TED QUIRE RRSS ALLSR

30

20

10 50

100

150

200

250

300

350

50

RS K-means TED QUIRE RRSS ALLSR

40

30

20 400

50

100

150

200

m

250

300

350

CR IP T

Accuracy

Accuracy

50

400

m

(a) (b) Fig. 6. Classification results on AR database. (a) The average classification accuracy by SVM, (b) the average classification accuracy by LRC.

AR

m para

Table 4: Parameters (para) of the proposed algorithm ALLSR on two face databases. 10 15 20 25 30 35 (1e-1, (1e-4, (1e-3, (1e-5, (1e-3, (1e-5, 1e-3) 1e-2) 1e-2) 1e-2) 1e-2) 1e-3) 80 120 160 200 240 280 (1e1, (1e-1, (1e-2, (1e-2, (1e-2, (1e-5, 1e-3) 1e-3) 1e-3) 1e-3) 1e-3) 1e-3)

5 (1e-5, 1e-1) 40 (1e1, 1e-3)

ED

M

Besides, LRC can get better classification accuracy than SVM for six algorithms. Probably because SVM is a binary-class classifier, it cannot solve multi-class problems as effectively as binary-class problems. Probably because SVM is a binary-class classifier, it cannot solve multi-class problems as effectively as binary-class problems. Additionally, we also show the parameters of the proposed method in Table 4 for better reproduction. Applying ALLSR to other biometric applications (Lai et al., 2014c; Liu et al., 2014; Liu et al., 2015; Shen et al., 2011; Yuan et al., 2016) will be studied in the future. 4.3. Convergence Analysis 140

PT

120

80

60

40

20

0

5

10

AC

0

CE

Objective

100

15

20

25

30

40 (1e-5, 1e-2) 320 (1e-5, 1e-3)

45 (1e-4, 1e-2) 360 (1e-5, 1e-3)

50 (1e-5, 1e-3) 400 (1e-5, 1e-3)

In this section, we present the impact of different number of nearest points (k) for graph construction on classification performance. The impact of the number selection on the two databases is very similar. For brevity, we just show the results on the Yale face database. As before, the training set contained 6 images of each individual and the remaining images were used as the test set. Then we selected m=30 from the training set to train a model. The process was repeated 20 times and calculated the mean accuracy. Fig.8 shows the impact of different number on classification accuracy with LRC classifier. As shown in Fig. 8, the number among the range [5, 15] has almost the same classification results. As a result, we selected 10 nearest neighbors to construct the graph in our experiments. For the classifier SVM, similar findings can be found. Due to limited space, we do not show it. 55

50

35

40

45

50

Iteration number

Fig. 7. The convergence results to solve Eq. (12)

When we ran the experiment, we also recorded the objective values for each iteration. Fig. 7 reports the convergence results to solve Eq. (10) on Yale Database as m=25. From which we can see that our algorithm can converge fast within 10 iterations, because the algorithm has the closed form solution in each iteration. For different parameter settings and AR database, a behavior similar to the one in Fig. 7 has been observed, for briefly, we do not present it. We empirically set the number of iterations to 20 in our algorithm, although a larger number can get better results. 4.4. Number selection for constructing graph

Accuracy

m para

AN US

Yale-B

45

RS K-means TED QUIRE RRSS ALLSR

40

35

5

10

15

20

25

30

35

40

45

50

k Fig. 8. The impact of different number of nearest points (k) for graph construction on classification performance with LRC classifier.

5. Conclusion In this paper, we proposed a novel active learning method to solve the early stage experimental design problem. Instead of

ACCEPTED MANUSCRIPT

Acknowledgments

AN US

This work is supported by the National Natural Science Foundation of China under Grant 61527808, the Open Projects Program of National Laboratory of Pattern Recognition and the Shenzhen Fundamental Research fund under Grant JCYJ20160531194840025.

(Hou et al., 2013) Hou, C., Nie, F., Jiao, Y., Zhang, C., Wu, Y., 2013. Learning a subspace for clustering via pattern shrinking, Inf. Process. Manage, vol. 49, no. 4, pp. 871-883. (Hu et al., 2013) Hu, Y., Zhang, D., Jin, Z., Cai, D., He, X., 2013. Active learning via Neighborhood Reconstruction, Int. Joint Conf. Artificial Intell., pp. 1572-1578. (Huang et al., 2014) Huang, S., Jin, R., Zhou, Z., 2014. Active Learning by Querying Informative and Representative Examples, IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 10, pp. 1936 - 1949. (Lai et al., 2014) Lai, Z., Wong, W. K., Xu, Y., Zhao, C., Sun, M., 2014. Sparse alignment for robust tensor learning. IEEE Trans. Neural Networks Learning Systems, vol., 25, no. 10, pp. 1779-1792. (Lai et al., 2014b) Lai, Z., Xu, Y., Chen, Q., Yang, J., Zhang, D., 2014. Multilinear sparse principal component analysis. IEEE Trans. Neural Networks Learning Systems, vol. 25, no. 10, pp. 1942-1950. (Lai et al., 2014c) Lai, Z., Xu, Y., Jin, Z., Zhang, D., 2014. Human gait recognition via sparse discriminant projection learning. IEEE Trans. Circuits Systems for Video Technology, vol. 24, no. 10, pp. 1651-1662. (Lewis and Catlett, 1994) Lewis, D. D., Catlett, J., 1994. Heterogenrous uncertainty sampling for supervised learning. Int. Conf. Mach. Learn., pp. 148-156. (Liu et al., 2014) Liu, J., Zhang, B., Zeng, H., Shen, L., Liu, J., Zhao, J., 2014. The beihang keystroke dynamics systems, databases and baselines. Neurocomputing, vol.144, no.1, pp. 271-281. (Liu et al., 2015) Liu, F., Zhang, D., Shen, L., 2015. Study on novel curvature features for 3D fingerprint recognition. Neurocomputing, vol. 168, no.1, pp. 599-608. (Lu et al., 2016) Lu, Y., Lai, Z., Xu, Y., Li, X., Zhang, D., Yuan, C., 2016. Low-rank preserving projections. IEEE Trans. cybernetics, vol. 46, no. 8, pp. 1900-1913. (Martinez and Benavente, 2003) Martinez, A. M., Benavente, R., 2003. The AR face database, http://rvl1.ecn.purdue.edu/~aleix/aleix_face_DB.html. (Nie et al., 2010) Nie, F., Huang, H., Cai, X., Ding, C., 2010. Efficient and robust feature selection via joint L2, 1-norms minimization, Adv. Neural Inf. Process Syst., pp. 1813-1821. (Nie et al., 2012) Nie, F., Xu, D., Li, X., 2012. Initialization independent clustering with actively selftraining method, IEEE Trans. Syst., Man, Cyber., Part B, vol. 42, no. 1, pp. 17-27. (Nie et al., 2013a) Nie, F., Wang, H., Huang, H., Ding, C., 2013. Adaptive loss minimization for semi-supervised elastic embedding, Int.Joint Conf. Artificial Intell., pp. 1565-1571. (Nie et al., 2013b) Nie, F., Wang, H., Huang, H., Ding, C., 2013. Early active learning via robust representation and structured sparsity. Int. Joint Conf. Artificial Intell., pp. 1572-1578. (Roweis and Saul, 2000) Roweis, S., Saul, L., 2000. Nonlinear dimensionality reduction by locally linear embedding, Science, vol. 290, no. 22, pp. 23232326. (Seung et al., 1992) Seung, H. S., Opper, M., Sompolinsky, H., 1992. Query by committee, ACM Workshop Comput. Learn. Theory, pp. 287-294. (Shen et al., 2011) Shen, L., Bai, L., Ji, Z., 2011. FPCODE: An efficient approach for multi-modal biometrics. International Journal of Pattern Recognition and Artificial Intelligence, vol. 25, no. 2, pp. 273-286. (Yuan et al., 2016) Yuan, C., Sun, X., Lv, R., 2016. Fingerprint liveness detection based on multi-scale LPQ and PCA. China Communications, vol. 13, no. 7, pp. 60-65. (Tenenbaum et al., 2000) Tenenbaum, J., Silva, V., Langford, J., 2000. A global geometric framework for nonlinear dimensionality reduction, Science, vol. 290, no. 22, pp. 2319-2323. (Wong et al., 2015) Wong, W. K., Lai, Z., Xu, Y., Wen, J., Ho, C. P., 2015. Joint tensor feature analysis for visual object recognition. IEEE Trans. Cybernetics, vol. 45, no. 11, pp. 2425-2436. (Yu et al., 2006) Yu, K., Bi, J., Tresp, V., 2006. Active learning via transductive experimental design, Int.Conf. Mach.Learn., pp. 1081-1088. (Zhang et al., 2011) Zhang, L., Chen, C., Bu, J., Cai, D., He, X., Thomas, S., 2011. Active learning based on local linear reconstruction, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 10, pp. 2026-2038. (Zheng et al., 2015) Zheng, Y., Jeon, B., Xu, D., Wu, Q. M., Zhang, H., 2015. Image segmentation by generalized hierarchical fuzzy C-means algorithm. Journal of Intelligent Fuzzy Systems, vol. 28, no. 2, pp. 961-973.

CR IP T

reconstructing the data points in a global way, we utilized the local structure to reconstruct the data points. As a result, our method could capture the nonlinear structures of hidden manifold among the data points. Furthermore, we defined a novel classifier to explore the potential performance of selected data points. Experimental results on two synthetic datasets and two face databases demonstrate the effectiveness of our method and the novel classifier. However, the proposed method requires relatively high time costs, which may limit its applications. In the future, we will reduce its time complexity with maintaining the performance. Additionally, two toy examples also show the differences between ALLSR and the other related methods, in particular suggesting that the reconstructed data points are helpful for clustering. Thus we will explore how to apply ALLSR to clustering algorithms (Zheng et al., 2015) or other learning methods (Gu et al., 2015a; Gu et al., 2015b; Gu and Sheng, 2016; Lai et al., 2014a; Lai et al., 2014b; Lu et al., 2016) in our future work.

References

AC

CE

PT

ED

M

(Atkinson et al., 2007) Atkinson, A., Donev, A., Tobias, R., 2007. Optimum experimental designs, with SAS, Oxford Univ. Press. (Balcan et al., 2007) Balcan, M. F., Broder, A. Z., Zhang, T., 2007. Margin based active learning, Ann. Conf. Learn. Theory, pp. 35-50. (Boyd and Vandenberghe, 2004) Boyd, S., Vandenberghe, L., 2004. Convex optimization, Cambridge University Press. (Cai et al., 2011) Cai, D., He, X., Han, J., 2011. Graph Regularized Nonnegative Matrix Factorization for Data Representation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 8, no. 33, pp. 1548-1560. (Chung, 1997) Chung, F. R. K., 1997. Spectral Graph Theory, AM. Math.Soc.. (Cohn et al., 1996) Cohn, D. A., Ghahramani, Z., Jordan, M. I., 1996. Active learning with statistical Models, J. Artificial Intell. Research., vol. 4, pp. 129145. (Flaherty et al., 2005) Flaherty, P., Jordan, M. I., Arkin, A. P., 2005. Robust design of biological experiments, Adv. Neural Inf. Process Syst., pp. 363370. (Freund et al., 1997) Freund, Y., Seung, H. S., Shamir, E., Tishby, N., 1997. Selective sampling using the query by committee algorithm, Machine Learn., vol. 28, no. 2-3, pp. 133-168. (Georghiades et al., 2001) Georghiades, A., Belhumeur, P., Kriegman, D., 2001. From few to many: Illumination cone models for face recognition under variable lighting and pose, IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 643-660. (Golub and Van Loan, 2012) Golub, G.H., Van Loan, C.F., 2012. Matrix computations, JHU Press. (Gu et al., 2015a) Gu, B., Sheng, V. S., Tay, K. Y., Romano, W., Li, S., 2015. Incremental support vector learning for ordinal regression. IEEE Trans. Neural Networks Learning Systems, vol. 26, no. 7, pp. 1403-1416. (Gu et al., 2015b) Gu, B., Sheng, V. S., Wang, Z., Ho, D., Osman, S., Li, S. (2015). Incremental learning for ν-support vector regression. Neural Networks, vol. 67, pp. 140-150. (Gu and Sheng, 2016) Gu, B., Sheng, V. S., 2016. A robust regularization path algorithm for ν-support vector classification. IEEE Trans. Neural Networks Learning Systems, DOI : 10.1109/TNNLS.2016.2527796, 2016. (Hadsell et al., 2006) Hadsell, R., Chopra, S., LeCun, Y., 2006. Dimensionality reduction by learning an invariant mapping, Comput. Vision. Pattern Recog., pp. 1735–1742. (Hastie et al., 2009) Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statistical learning: data mining, interference, and prediction, Springer. (He et al., 2005) He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H. J., 2005. Face recognition using Laplacianfaces, IEEE Trans. Pattern Anal. Mach. Intel., vol.27, no. 3, 328-340.