A multilevel approach for learning from labeled and unlabeled data on graphs

A multilevel approach for learning from labeled and unlabeled data on graphs

ARTICLE IN PRESS Pattern Recognition 43 (2010) 2301–2314 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevie...

1MB Sizes 0 Downloads 70 Views

ARTICLE IN PRESS Pattern Recognition 43 (2010) 2301–2314

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

A multilevel approach for learning from labeled and unlabeled data on graphs Changshui Zhang, Fei Wang  State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation, Tsinghua University, Beijing 100084, China

a r t i c l e in fo

abstract

Article history: Received 30 June 2009 Received in revised form 15 December 2009 Accepted 27 December 2009

The recent years have witnessed a surge of interest in graph-based semi-supervised learning methods. The common denominator of these methods is that the data are represented by the nodes of a graph, the edges of which encode the pairwise similarities of the data. Despite the theoretical and empirical success, these methods have one major bottleneck which is the high computational complexity (since they usually need to solve a large-scale linear system of equations). In this paper, we propose a multilevel scheme for speeding up the traditional graph based semi-supervised learning methods. Unlike other accelerating approaches based on pure mathematical derivations (like conjugate gradient descent and Lanczos iteration) or intuitions, our method (1) has explicit physical meanings with some graph intuitions; (2) has guaranteed performance since it is closely related to the algebraic multigrid methods. Finally experimental results are presented to show the effectiveness of our method. & 2010 Elsevier Ltd. All rights reserved.

Keywords: Multilevel Graph Transduction Multigrid

1. Introduction Semi-supervised learning (SSL), which aims at learning from labeled and unlabeled data, has aroused considerable interests in data mining and machine learning fields since it is usually hard to collect enough labeled data points in practical applications. Various semi-supervised learning methods have been proposed in recent years and they have been applied to a wide range of areas including text categorization, computer vision, and bioinformatics (see [8,36] for recent reviews). Moreover, it has been shown recently that the significance of semi-supervised learning is not limited to utilitarian considerations: humans perform semi-supervised learning too [15,23,35]. Therefore, to understand and improve semi-supervised learning will not only help us to get a better solver for real world problems, but also help us to better understand how natural learning come about. One key point for understanding the semi-supervised learning approaches is the cluster assumption [8], which states that [32] (1) nearby points are likely to have the same label (local consistency); (2) points on the same structure (such as a cluster or a submanifold) are likely to have the same label (global consistency). It is straightforward to associate cluster assumption with the manifold analysis methods developed in recent years [3,18,24] (note that these methods are also in accordance with the ways that the humans perceive the world [19]). The manifold based methods first assume that the data points (nearly) reside on a

low-dimensional manifold (which is called manifold assumption in [8]), and then try to discover such manifold by preserving some local structure of the data set. It is well known that graphs can be viewed as discretizations of manifolds [2], consequently, numerous graph based SSL methods have been proposed in recent years, and graph based SSL has been becoming one of the most active research area in semi-supervised learning community [8]. However, in spite of the intensive study of graph based SSL methods, there are still some open issues which have not been addressed properly, such as: 1. How to select an appropriate similarity measure between pairwise data automatically. 2. How to speed up these algorithms for handling large-scale data set (since they usually require the computation of matrix inverse). In our previous paper, we have proposed a good way to address the first issue [27,28]. In this paper we will mainly concentrate on the accelerating issue.1 As another research field, the psychologists have long studied the way that how people perceive the world, among which the Gestalt psychology [30] is a movement that began just prior to World War I. It made important contributions to the study of visual perception and problem solving. The Gestalt approach emphasizes that we perceive objects as well-organized patterns

 Corresponding author. Tel.: + 86 10 62796872.

E-mail addresses: feiwang@cs.fiu.edu, [email protected] (F. Wang). URL: http://feiwang03.googlepages.com (F. Wang). 0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2009.12.025

1 Note that a preliminary conference version of this paper has been published as [29].

ARTICLE IN PRESS 2302

C. Zhang, F. Wang / Pattern Recognition 43 (2010) 2301–2314

rather than separate component parts. According to this approach, when we open our eyes we do not see fractional particles in disorder. Instead, we notice larger areas with defined shapes and patterns. The ‘‘whole’’ that we see is something that is more structured and cohesive than a group of separate particles. For example, when we see an image, we cannot perceive the intensity or color of each pixel in the image, but we can recognize some parts composed of the image, e.g. humans, sky, trees, etc. In other words, the Gestalt theory tells us that the way human perceives the world is on some ‘‘coarser’’ level. Inspired by the Gestalt laws, we propose a fast multilevel graph learning algorithm. In our method, the data graph is first coarsened level by level based on the similarity between pairwise data points (which has a similar spirit with grouping, such that for each group, we only select one representative node), then the learning procedure can be performed on a graph with a much smaller size. Finally the solution on the coarsened graph will be refined back level by level to get the solution of the initial problem. Moreover, as unsupervised learning can be viewed as a special case of semi-supervised learning, we will show that our multilevel method can easily be incorporated into the graph based clustering methods. Our experimental results show that this strategy can improve the speed of graph based semisupervised learning algorithms significantly. And we also give a theoretical guarantee on the performance of our algorithm. The rest of this paper is organized as follows. Some notations and related works will be introduced in Section 2. We will formally present the new algorithm in Section 3 and analyze its connection with multigrid methods in Section 4. The experimental results will be shown in Section 6, followed by the conclusions in Section 7.

the original label vector of class c with its i-th entry ( 1 if xi is labeled as c; yci ¼ 0 if xi is unlabeled or its label is not c:

ð3Þ

For notational convenience, we will omit the superscripts, and use f and y to denote an arbitrary classification vector and its corresponding original label vector. A common principle that for graph based SSL is to minimize the following cost function with some constraints [21] T

J ðfÞ ¼ f Sf þ g1

l X

ðfi yi Þ2 þ g2

i¼1

n X

fu2 ;

ð4Þ

u ¼ lþ1

where the first term is the smoothness term, which measures the total variation of the data labels with respect to the intrinsic structure, and S is a smoothness matrix [2]; the second term is the fit term, which measures how well the predicted labels fit the original labels, and g2 4 0 is the regularization parameter. Table 1 lists some recently proposed methods and their corresponding parameters.3 In the remaining presentations, we will follow the formulations in [27,31,32] and set g1 ¼ g2 ¼ g.4 The optimal f can be solved by setting @J ðfÞ=@f ¼ 0, which results in ðS þ gIÞf ¼ gy. Obviously, this is a linear equation group and it is time consuming to solve it when the size of the data set is large (e.g. the computation of ðS þ gIÞ1 is Oðn3 Þ).

3. Multilevel semi-supervised learning on graphs In this section we will introduce our new multilevel graph based SSL algorithm in detail. Firstlet us see how to construct the original graph.

2. Notations and related works 3.1. Graph construction We suppose that there are a set of data points X ¼ fx1 ; . . . ; xl ; . . . ; xl þ u g, of which X L ¼ fx1 ; x2 ; . . . ; xl g are labeled as yi A L (1o i rl; L ¼ f1; 2; . . . ; Cg is the label set) and the remaining points X u ¼ fxl þ 1 ; . . . ; xl þ u g are unlabeled. The goal is to predict the labels of X u .2 Generally, the graph based SSL methods [2,32,34] first model the whole data set as a weighted undirected graph G ¼ ðV; EÞ where node set V corresponds to the data set X ¼ X L [ X U , and E is the edge set. Associate with each edge eij A E is a weight wij , which is usually computed by wij ¼ expðbJxi xj J2 Þ;

ð1Þ

where b is a free parameter which is usually set empirically. Many methods have been proposed to determine b automatically [8], however, there is no reliable approach [32]. In Zhou’s terminology [32], the classification result on X can be represented by an n  C matrix F with nonnegative entries (here n ¼ l þ u). And xi ’s label ti is determined by ti ¼ argmaxFij :

ð2Þ

jrC

Fij is the ði; jÞth entry of F. Therefore Fij can be treated as a measure of possibility that xi belongs to class j (If we further P normalize the rows of F so that j Fij ¼ 1; 81 r ir n, then Fij can be regarded as the probability of xi belonging to class j.). Based on c this point of view, we will call f , which is the c-th column of F, the classification vector of the c-th class. Similarly, we will call yc 2 In this paper, we will focus on the transduction problem, for induction please refer to the method in [11,27].

Unlike some traditional methods which construct a fully connected graph [32,34], our strategy here is to construct a connected neighborhood graph as in [4]. This can save some storage requirement and make the final problem sparse. More concretely, the neighborhood system for X is defined as Definition 1 (Neighborhood system). Let N ¼ fN i j8xi A X g be a neighborhood system for X , where N i is the neighborhood of xi . = N i (self-exclusion); (2) xi A N j 3xj A N i Then N i satisfies: (1) xi2 (symmetry). In this paper, N i is defined in the following way: xj A N i iff xj A Ki or xi A Kj , where Ki is the set that contains the k nearest neighbors of xi . Based on the above definitions, we can construct the graph G where there is an edge linking nodes xi and xj iff xj A N i . Thus we can define an n  n (n ¼ l þ u) weight matrix W for graph G, with its ði; jÞth entry ( 4 0 if xj A N i ; ð5Þ Wij ¼ wij ¼ 0 otherwise: Instead of considering pairwise relationships as Eq. (1) in traditional graph-based methods, we propose to incorporate the 3 In the table, W represents the pairwise similarity matrix such that Wði; jÞ is the similarity between xi and xj and D is the corresponding degree matrix, which P is diagonal with Dði; iÞ ¼ j Wði; jÞ. Note that for linear neighborhood propagation, the matrix Wa is the aggregated similarity matrix with its ði; jÞth entry equivalent to aij , which is computed from Eq. (7). 4 In fact, we can also set g1 a g2 , and the whole framework in this paper will not change.

ARTICLE IN PRESS C. Zhang, F. Wang / Pattern Recognition 43 (2010) 2301–2314

2303

Table 1 Popular graph based semi-supervised learning objectives and their corresponding parameters. Method

S

g1 ; g2

Constraints

Graph mincuts [6] Label propagation [34] Gaussian random fields [34] Local and global consistency [32]

L ¼ D-W L ¼ D-W L ¼ D-W L^ ¼ ID1=2 WD1=2

fi A f0; 1g None None None

Tikhonov regularization [2] Interpolated regularization [2] Linear neighborhood propagation [27] Local learning regularization [31]

Lp ; p A N or expðtLÞ; t A R Lp ; p A N or expðtLÞ; t A R L ¼ IWa

g1 ¼ 1; g2 ¼ 0 g1 ¼ 1; g2 ¼ 0 g1 ¼ 1; g2 ¼ 0 g1 ¼ g2 4 0 g1 ¼ 1=l; g2 ¼ 0 g1 ¼ 1; g2 ¼ 0 g1 ¼ g2 4 0 g1 ¼ g2 4 0

L ¼ ðIAÞT ðIAÞ

None P i fi ¼ 0 None None

neighborhood information of each point to the estimation of wij . Similar to the method we proposed in [27], we assume each data can be linearly reconstructed from its k-nearest neighbors, thus we should minimize the reconstruction error 2    X X   ei ¼ xi  aij xj  ¼ aij Gijk aik ð6Þ   j:xj A Ki j;k:xj ;xk A Ki for all 1 ri rn. Here aij is the contribution of xj to xi , subject to P i T j A Ki aij ¼ 1, aij Z 0. And Gjk ¼ ðxi xj Þ ðxi xk Þ is the ðj; kÞth entry of the local Gram matrix at point xi . Obviously, the more similar xj to xi , the larger aij will be. Thus aij can be used to measure how similar xj to xi . One issue should be addressed here is that usually aij a aji . Thus the reconstruction weights of each data object can be solved by the following n standard quadratic programming problems: X

min aij

s:t:

aij Gijk aik

j;k:xj ;xk A Kðxi Þ

X

aij ¼ 1;

aij Z0:

ð7Þ

j

However, we do not use aij as the similarity between xi and xj directly as in [27]. Recalling the definition of the neighborhood system in Definition 1, we can compute wij by wij ¼ 12 ðaij þ aji Þ, and aij ¼ 0 if xj2= Ki . This similarity is more reasonable since it is a symmetric one, as on an undirected weighted graph, the weight on each edge should be symmetric with respect to its two nodes. Moreover, this similarity also hold some superiorities over traditional Gaussian weights: 1. The hyperparameter s controlling the Gaussian weights lies in a continuous space, while the hyperparameter k (i.e. the neighborhood size) controlling our proposed weights lies in a discrete space, which is much easier to tune. 2. As we showed in our previous paper [27], the final classification results are not sensitive to k using our proposed way to compute the weights, however, it is usually found that when using the Gaussian weights, the parameter s can bias the final classification results significantly. 3. The weights computed by our proposed way have a ‘‘scale free’’ property, i.e. it is a relative similarity measure insensitive to the distribution of the data set. On the contrary, the methods using Gaussian weights may perform poorly when the data from different classes have different densities [33]. 3.2. Multilevel semi-supervised learning on graphs Below we will introduce a novel multilevel scheme for semisupervised learning on graphs. The scheme is composed of three phases: (1) graph coarsening; (2) initial classification; (3) solution refining. Fig. 1 provides a graphical overview of our algorithm, and

Fig. 1. Overview of the multilevel scheme. We use hexagons to represent the data graphs, and the sizes of the hexagons correspond to the graph sizes. The circle and square represent the labeled points. Our scheme will first coarsen the data graph level by level (note that during this procedure the labeled points will always appear on the data graph), and then solve the classification problem on a coarsened graph of some appropriate level. Finally this classification result will be refined back level by level until we can get an approximate solution of the initial problem.

the details of the approach will be described in the following subsections. 3.2.1. Graph coarsening We now present the recursive level by level graph coarsening phase in our method. In each coarsening step a new, approximately equivalent semi-supervised learning problem will be defined, with the number of vertices of the graph reducing to a fraction (typically around 12) of the former graph. In the following we will describe the first coarsening step. Starting from graph G0 ¼ G (the superscript represents the level of graph scale), we first split V 0 ¼ V into two sets, C0 and F 0 , subject to C0 [ F 0 ¼ V 0 , C0 \ F 0 ¼ |. The set C0 will be used as the node set of the coarser graph of the next level, i.e. V 1 ¼ C0 . And the nodes in C0 are called Cnodes, which is defined as: Definition 2 (C nodes and F  nodes). Given a graph Gl ¼ ðV l ; E l Þ, we split V l into two sets, Cl and F l satisfying Cl [ F l ¼ V l , Cl \ F l ¼ |, Cl ¼ V l þ 1 . And one node is selected into Cl if and only if it satisfies one of the following conditions: (1) it is labeled; (2) it strongly influences at least one node in F l on level l. We will call the nodes in Cl Cnodes, and the nodes in F l F nodes. Here strongly influence means

ARTICLE IN PRESS 2304

C. Zhang, F. Wang / Pattern Recognition 43 (2010) 2301–2314

where W is the weight matrix constructed in the manner introduced in Section 3.1, and D is a diagonal matrix with its P P ði; iÞth entry Dii ¼ j Wij ¼ j w0ij . Written in its element form X X 0 J ðf Þ ¼ w0ij ðfi0 fj0 Þ2 þ g ðfi0 yi Þ2 ; i;j A V 0

1

J ðf Þ ¼

i A V0

0

X

w0ij @

i;j A V 0

þg

X i A V0

þg

X i A V0

X

12

X

½0;1 1 Pik fk  Pjl½0;1 fl1 A k A V1 l A V1

0 @ 0 @

X

2 ½0;1 1 Pik fk yi Þ

k A V1

X

¼

w1kl ðfk1 fl1 A

k;l A V 1

12 ½0;1 1 Pik fk yi A

12

X

;

k A V1

where Fig. 2. A graphical illustration of the point selection procedure when coarsening the data graph. A node will be selected to formulate the graph of a coarser level if and only if it is labeled or it is strongly coupled with at least one node in the finer graph.

Definition 3 (Strongly influence). A node xi strongly influences xj on level l means that X ð8Þ wlij Z d wlkj ; k

where 0 o d o1 is a control parameter, and wlij is the weight of the edge linking xi and xj on Gl . P In fact, zlij ¼ wlij = k wlkj measures how much xj depends on xi . Since xj only connects to its neighborhood, a larger zij implies a larger dependency of xj on xi . Intuitively, if xj depends too much on xi , then we only need to retain xi . The normalization is to make zij a relative measure which is independent of the data distributions. Fig. 2 provides us an intuitive illustration of the node selection procedure when coarsening the graph.5 0 1 Let f ¼ f be a classification vector we want to solve, and f be 1 its corresponding classification vector on G (hence the dimension1 ality of f should be equivalent to n1 , the cardinality of V 1 ). The 0 same as in other multilevel methods [25], we assume that f can 1 6 be approximately interpolated from f , that is 0

1

f  P½0;1 f ;

ð9Þ

½0;1

0

1

0

is the interpolation matrix of size n  n (n ¼ n), where P P ¼ 1. Therefore, bringing Eq. (9) into Eq. (4), we subject to j P½0;1 ij can get the coarsened problem which aims to minimize 1

J ðf Þ ¼ f

1T

1

1

P½1;0 SP½0;1 f þ gJP½0;1 f yJ2 ;

½1;0

ð10Þ

½0;1

is the transpose of P . In this paper we will use where P combinatorial Laplacian as the smoothness matrix because of its wide applicability [2,8,34], i.e. S ¼ DW;

ð11Þ

5 Note that when we select Cnodes from the unlabeled points, the strongly influence points may be different if the order of checked points changes. In our experiments, we first order the unlabeled data points according to their corresponding node degrees. We select the first data point as a C node, and scan the data points according to this order. 6 Actually, as we have analyzed after Definition 3, the nodes in V 0 =V 1 are largely dependent on the nodes in V 1 . Therefore what we define in Eq. (9) is just to model such a dependence rule. The interpolation rule is simple and efficient, and it has also been widely used in the multilevel or multigrid methods for solving partial differential equations [7,25], that is the reason why we apply it here.

w1kl ¼

1 X 0 ½0;1 ½0;1 ½0;1 ½0;1 w ðP Pil ÞðPik Pjk Þ: 2 i;j ij jl

ð12Þ 1

The first term of Eq. (12) can be considered as the smoothness of f on graph G1 , and the weight on the edge linking xk and xl on G1 can be computed by Eq. (12). Moreover, we can generalize the above conclusion and get the following theorem: Theorem 1. To preserve the objective function, the edge weights on graph Gl þ 1 can be computed from the edge weights on Gl by lþ1 ¼ wuv

1 X l ½l;l þ 1 ½l;l þ 1 ½l;l þ 1 ½l;l þ 1 w ðP Piv ÞðPiu Pju Þ: 2 i;j ij jv

ð13Þ

Proof. See Appendix A. An issue should be mentioned here is that the above coarsening weight equation can be somewhat simplified to the following lþ1 iterated weighted aggregation strategy [25], which computes wuv by lþ1 ¼ wuv

1 X ½l;l þ 1 l ½l;l þ 1 P wij Pjv : 2 i;j iu

ð14Þ

It can be shown that Eq. (14) in many situations provides a good approximation to Eq. (13) [20]. In certain cases the two processes are identical, e.g. in the case that each data point is associated with only two data subsets in the fine graph. Moreover, Eq. (14) can be motivated by itself: it states that the coupling between two subsets is the sum of the couplings between the graph nodes associated with these blocks weighted appropriately. Now the only remaining problem is how to determine the interpolation matrix, and we will leave it to Section 3.2.3. To summarize, given a graph Gl ¼ ðV l ; E l Þ, we have to do three things for coarsening it to Gl þ 1 ¼ ðV l þ 1 ; E l þ 1 Þ: (1) select the Cnodes from V l based on Definition 2, and let V l þ 1 ¼ Cl ; (2) compute the interpolation matrix P½l;l þ 1 (see Section 3.2.3 for details); (3) compute the edge weights by Eq. (13). After these three steps, the graph will be coarsened to the next level.

3.2.2. Initial classification Assuming the data graph G has been coarsened recursively to some level s, then the semi-supervised classification problem defined on Gs is to minimize s

sT

s

s

J ðf Þ ¼ f P½s;s1    P½1;0 SP½0;1    P½s1;s f þ gJP½0;1    P½s1;s f yJ2 ; ð15Þ

ARTICLE IN PRESS C. Zhang, F. Wang / Pattern Recognition 43 (2010) 2301–2314

where P½i;i1 ¼ ðP½i1;i ÞT , and S is defined in Eq. (11). Therefore, let s s @J ðf Þ=@f ¼ 0, then s

@J ðf Þ s ¼ 2ðP½s;s1    P½1;0 SP½0;1    P½s1;s f s @f s þ gP½s;s1    P½1;0 ðP½0;1    P½s1;s f yÞÞ s

¼ 2ððLs Þf gP½s;s1    P½1;0 yÞ s

¼ 0¼)f ¼ gðLs Þ1 P½s;s1    P½1;0 y: In the following, I is the n  n identity matrix. Moreover, we have the following theorem: Theorem 2. The matrix Ls ¼ P½s;s1    P½1;0 ðS þ gIÞP½0;1    P½s1;s is invertible. P Proof. For 8a a0, since wij Z 0; g 4 0, then aT ðS þ gIÞa ¼ ij wij P 2 2 ðai aj Þ þ g i ai 4 0. Thus the matrix S þ gI is positive definite. If rankðP½0;1 Þ ¼ n1 (n1 is the cardinality of V 1 ), and let a be an arbitrary vector, then P½0;1 a a 0. Let r ¼ P½0;1 a, then we have aT P½1;0 ðS þ gIÞP½0;1 a ¼ rT ðS þ gIÞr 40

ð16Þ

Therefore P½1;0 ðS þ gIÞP½0;1 is also positive definite. Similarly, if we can guarantee that rankðP½l;l1 Þ ¼ nl , 81r l r s (nl is the cardinality of V l ), then Ls will be invertible. Fortunately, the way we define the interpolation matrix (which will be introduced in the next subsection) can meet this condition naturally (see Lemma 1). So Ls is invertible. & Based on the above theorem, we can compute the initial classification vector using Eq. (16), in which we only need to compute the inverse of an ns  ns matrix, and usually ns is much smaller than n.

3.2.3. Solution refining Having achieved the initial classification vector from Eq. (16), we have to refine it level by level to get a classification vector on the initial graph G0 ¼ G. As stated in Section 3.2.1, we assume that the classification vector on graph Gl can be linearly interpolated l lþ1 . Here P½l;l þ 1 is an nl  nl þ 1 from Gl þ 1 , i.e. f ¼ P½l;l þ 1 f P interpolation matrix subject to j Pij½l;l þ 1 ¼ 1. Based on the simple geometric intuition that the label of a point should be similar to the label of its neighbors (which is also consistent with the cluster assumption we introduced in ½l;l þ 1 by Section 1), we propose to compute PiIðjÞ 8 , X > > > > wlij wlik i= 2Cl ; > < ½l;l þ 1 k A Cl PiIðjÞ ¼ ð17Þ > 1 i ¼ j; > > > > :0 iA Cl ; i aj: In the above equation, subscripts i; j; k are used to denote the index of the nodes in V l . We assume that node j has been selected as a Cnode, and IðjÞ is the index of j in V l þ 1 . It can be easily inferred that P½l;l þ 1 has the following property. ½l;l þ 1

Lemma 1. The interpolation matrix P

has full rank.

Proof. Through elementary transforms, P½l þ 1;l ¼ ðP½l;l þ 1 ÞT can always be written in the form of ½R^Inl þ 1 , where Inl þ 1 is an nl þ 1  nl þ 1 identity matrix, and R is a matrix composed of the elements in P after elementary transform. Thus P½l;l þ 1 has full rank. & After all the interpolation matrices having been calculated using Eq. (9), we can use them to interpolate the initial classification vector level by level until we get a solution of the initial problem.

2305

3.3. Implementation costs The implementation costs of our multilevel algorithm will be analyzed in this section, including the storage costs and computational costs. Storage costs: Assuming that initially we are given n data points, then on the finest graph there are totally n graph nodes. At every coarsening step we select a subset of the unlabeled nodes such that the remaining nodes are coupled strongly to at least one of the nodes. Following this selection procedure almost no two neighboring nodes can survive to the next level. Thus, at each level we obtain about half the graph nodes from the previous level, i.e. the coarser graph requires 21 times the amount of storage of the next finest graph. Therefore the total storage costs can be expressed as the following geometrical series: SC ¼ nð1 þ 21 þ22 þ    þ 2L Þ o 2n; where L is the final graph level of the coarsening procedure. Therefore, the storage costs is less than twice the storage requirements of the finest graph. Computational costs: If initially we are given a data set X , then the construction of the finest graph typically has the computational cost of OðnK 4 Þ, where n is the size of the data set, K is the size of the neighborhoods (since for each data point we need to solve a quadratic programming problem of scale K). After the finest graph has been constructed, we need to do three phases, coarsening, initial solution and interpolation in the remaining multilevel computation. In the following we will analyze the computational cost of them, respectively. (1) Coarsening: There are mainly two types of operations in the coarsening phase. The first one is the selection of representative nodes, which can be done efficiently using bucket sort, and the computational cost is Oðnl Þ for graph Gl (nl is the number of nodes in Gl ). So the total computational cost of selecting representative nodes is   1 1 1 O n þ n þ 2 n þ    þ L n o2OðnÞ ¼ OðnÞ 2 2 2 The other operation is that when coarsening the graph from level l to level l þ 1, we need to compute the weight matrix Wl þ 1 based on Wl using Eq. (14), which can be written in its matrix form as Wl þ 1 ¼ ðP½l;l þ 1 ÞT Wl P½l;l þ 1 : Note that Wl and P½l;l þ 1 are all sparse matrices. The number of non-zero elements in Wl is approximately ð1=2l ÞnK, and the number of non-zero elements on each row of P½l;l þ 1 is approximately K. Therefore the computational costs for coarsening the graph from level l to level l þ1 is ð1=2l ÞOðnK 2 Þ. Since generally K 5 n, therefore the total computational cost of coarsening the graph from level 0 to level L is 1 1 CC 1 ¼ OðnK 2 Þ þ OðnK 2 Þ þ    þ L OðnK 2 Þ 2 2   1 1 ¼ 1 þ þ    þ L OðnK 2 Þ o2OðnK 2 Þ 2 2 (2) Initial classification: The initial classification is performed on the graph of the coarsest scale L. Typically, the number of nodes on GL is OðlognÞ. Therefore solving the linear equation system of scale OðlognÞ usually requires the computational 3 cost of scale Oðlog nÞ. (3) Solution refining. The procedure of solution refining is just to interpolate the label vector obtained from initial classification back to the label vector on the finest graph level by level.

ARTICLE IN PRESS 2306

C. Zhang, F. Wang / Pattern Recognition 43 (2010) 2301–2314

Typically, assuming at level l there are nl nodes, then the interpolation of the label vector from level l to level l1 involves the computational cost of scale OðKnl Þ (since there are less than K non-zero elements on each column of P½l1;l ). Therefore, the total computational cost for solution refining is less than OðK L lognÞ (note that K 5n and usually Lr 10).

(3) we directly interpolate the solution from the coarser grid back to the finer grid instead of a correction. Due to these three simplification, our algorithm can perform more efficiently than traditional AMG method. Moreover, if we define S as in Eq. (11), then we can derive the following theorem. Theorem 3. Let P½0;s ¼ P½0;1    P½s1;s , P½s;0 ¼ ðP½0;s ÞT , Q 0 ¼ S þ gI, s and RðP½s;0 Þ is the column space of P½s;0 . Clearly, f A RðKÞ. Let f be ½0;s the exact solution to Eq. (18), e ¼ f gP ðQ s Þ1 P½s;0 y be the prediction error vector, then %

%

3.4. A toy example To provide an intuitive illustration of the proposed method, we first give a toy example in Fig. 3. The leftmost figure of Fig. 3 shows the original problem, which is a two-circle pattern containing 2252 data points, with only two labeled. From left to right, Fig. 3 shows the coarsened data graphs from level 1 to level 3. On all these four graph levels, our method can correctly classify all the data points. On our Pentium IV 2.2 GHz computer, directly predicting the data labels on the original data set (level 0) will take 25.7670 s, but it only takes 2.1330 s using our multilevel approach with the data graph coarsened to level 4, and both the methods can produce correct result.

4. Relationship with multigrid methods Multigrid methods [25] are a state-of-the art technique to solve large systems of linear equations Ax ¼ b, where A A Rnn and b A Rn1 . It can accelerate the convergence of a basic iterative method by global correction from time to time, accomplished by solving a set of coarse problems. Basically, we can view the linear equation system as a graph of n nodes where an edge ði; jÞ represents a non-zero coefficient Aij . To simplify the following illustration, we assume that graph to be a regular two dimensional grid. The basic idea of multigrid is to define a hierarchy of grids. Each node at the coarser grid level represents a subset of nodes at the finer level. Then the multigrid method is mainly composed of three steps: (1) Smoothing: reducing high frequency errors, for example using a few iterations of the Gauss–Seidel method; (2) Restriction: downsampling the residual error to a coarser grid; (3) Interpolation or prolongation: interpolating a correction computed on a coarser grid into a finer grid. If we treat the data graph G as an algebraic grid [25], then minimizing Eq. (4) will also result in a system of linear equations on these grid points (i.e. data points) since @J ðfÞ=@f ¼ 0 will cause ðS þ gIÞf ¼ gy:

ð18Þ

One interesting issue is that if we use S ¼ Sc ¼ D1=2 ðDWÞD1=2 as in [32], then we have the following conclusion [5]. Lemma 2. The iterative framework proposed in [32] is just a Jacobi iteration [25] for minimizing T

J c ¼ f Sc f þ gJfyJ2 ;

ð19Þ

which implies that we can apply the multigrid methods to accelerate Zhou’s consistency method [32]. Proof. See [5].

&

Furthermore, on a coarse level, the minimization of Eq. (15) will also cause a system of equations: s

ðP½s;s1    P½1;0 ðS þ gIÞP½0;1    P½s1;s Þf ¼ gP½s;s1    P½1;0 y;

ð20Þ

which is just the coarse level algebraic multigrid (AMG) systems for Eq. (18) based on the Galerkin principal [25]. The main difference between the two methods is that: (1) we omit the initial smoothing step in AMG; (2) the AMG method approximates the residual error on a coarse grid (see [25] for details), while our method directly approximates the solution on a coarse grid;

JeJ2 ¼

min Jf vJ2 : %

ð21Þ

v A RðP½0;s Þ

Proof. See Appendix B. Theorem 3 implies our method can give a good approximation to the solution of the original problem. Another issue is that analyzing our method from the multigrid point of view, we may find that our interpolation strategy when refining the predicted labels back is just the direct interpolation [22] used in multigrid methods, which is the simplest one. There are also some other methods like standard interpolation, Jacobi interpolation and extended interpolation, which are more complicated. However, in our experiments we find that such simple interpolation is good enough for real world applications, therefore we did not try other approaches.

5. Relationship with other graph based methods There are also some other research works that aim to accelerate graph based algorithms. Karypis et al. [16] present a multilevel paradigm for hypergraph partitioning, where a sequence of successively coarser hypergraphs is constructed by recursive bisection. At each step, a bisection of the smallest hypergraph is computed and it is used to obtain a bisection of the original hypergraph by successively projecting and refining the bisection to the next level finer hypergraph. In [17] the authors further improved their algorithm by changing the strategy of coarsening the graph: at each step a set of nodes in the finer graph are aggregated to form one node in the coarser graph, and their corresponding edges are also accumulated. They make use of the graph matching algorithm to achieve the coarsening and they showed that this would leads to an OðlogCÞ (C is the number of clusters) scale improvement over the previous recursive bisection algorithm. The authors in [1] further extend the graph matching methods to allow arbitrary size sets of vertices to be collapsed together, and they apply their algorithm to partition power-law graph in a multilevel way. Dhillon et al. [13] also make use of such coarsening strategy to partition the graph. The difference is that their algorithm allows the final cluster size to be arbitrary, while the works in [16,1,17] only allow equi-size partitioning. The main theme behind these methods is ‘‘aggregation’’, i.e. they treat a union of the nodes in the finer graph as a supernode in a coarser graph (that is, there is a many-to-one correspondence between the nodes on the finer and coarser graphs), and these supernode are found by graph matching algorithms. In our algorithm, the nodes in the coarser graph are the ‘‘representative’’ nodes in the finer graph, i.e. we can find an one-to-one correspondence between the nodes in the finer graph and coarser graph.7 Moreover, we select a representative node according to its influence to other nodes, and the edge weights on the coarser 7 Here ‘‘one-to-one’’ correspondence means that for each node in the coarser graph, we can find a unique image in the finer graph where they are selected from. On the contrary, for traditional aggregation based algorithms, each node in the coarser is a virtue node aggregated from several nodes in the finer graph.

ARTICLE IN PRESS C. Zhang, F. Wang / Pattern Recognition 43 (2010) 2301–2314

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−2

−2

−2.5 −3

−2

−1

0

1

2

3

−2.5 −3

2.5

2.5

2

2

1.5

1.5

1

1

0.5

0.5

0

0

−0.5

−0.5

−1

−1

−1.5

−1.5

−2

−2

−2.5

−2.5 −3

−2

−1

0

1

2

3

−3

2307

−2

−1

0

1

2

3

−2

−1

0

1

2

3

Fig. 3. Graph coarsening procedure on a two-circle pattern. The number of nearest neighbors k is set to 5, and the control parameter d in Eq. (8) is set to 0.2. (a) Original, 3002 points. (b) Level 1, 1139 points. (c) Level 2, 478 points. (d) Level 3, 213 points.

graph are computed in a way of preserving the objective function, not just aggregation. Finally, all the above-mentioned methods are designed for unsupervised graph partitioning, while the method proposed in this paper is used for semi-supervised learning. There are also some efficient methods designed for graph based semi-supervised learning. Delalleu et al. [12] proposed an interpolation method, which first split the graph nodes V into two disjoint subsets R and S, such that the labels of the nodes in R can be interpolated from the labels of the nodes in S as P 8xi A R;

fi ¼ P

j A S Wij fj

j A S Wij þ

e

;

ð22Þ

which is quite similar to the interpolation rule in Eq. (9) except the smoothing constant e in the denominator. Bringing Eq. (22) back to Eq. (4), the method only needs to solve a linear equation group with jSj variables. To further improve the efficiency, they dropped a term coupled with purely the nodes in R, which may make the result inaccurate unpredictably. They propose the random and smart sampling strategies for selecting the nodes in S. The method proposed in [12] is more like a one-level coarsening of our multilevel approach. However, they are still different in that (1) the node selection process; (2) they dropped a term relevant to the points in R when constructing the

approximation problems, while we do not. Moreover, with only one level of coarsening, we cannot expect the resultant graph to be very small because if we get a small graph after only one step of coarsening, then the information loss would be huge. In the next section we will compare the performance of our algorithm with their one. Another interesting work is multiscale graph analysis using diffusion wavelets [9]. Unlike other methods which do analysis directly on original graphs, the method in [9] performs multiscale analysis on the eigenfunctions of the Laplacian defined on the data manifold. Their algorithm proceeds in a bottom-up way, from the finest scale to the coarsest scale, using powers of a diffusion operator as dilations and a numerical rank constraint to critically sample the multiresolution subspaces, which is totally different from our method.

6. Experiments In this section, we give a set of experiments to show the effectiveness of our multilevel approach. In all of the experiments, the regularization parameter g in Eq. (4) is set to 0.01, and the control parameter d is set to 0.2. All the experiments are performed on a Windows machine with 2.2 GHz Pentium IV processor and 512 MB main memory.

ARTICLE IN PRESS 2308

C. Zhang, F. Wang / Pattern Recognition 43 (2010) 2301–2314

Running time (seconds) vs. graph level 60 0.8

40

0.6

20

0.4

Running time on different scales (seconds) 300

0 0

0.2 0

5 10 15 20 25 Graph size vs. graph level

30 200 150

4000

−0.2

Level 0 Level 1 Level 2 Level 3

250

100

−0.4

2000

50

−0.6 100

−0.8 −1

0

1

2

3

4

5

6

x630

0 0

5

10

15

20

25

30

1

2

3

4

5

6

7

8

9

Fig. 4. Experimental results of our multilevel graph based SSL on 3 band data set. (a) shows the original data set; the upper figure of (b) is the running time vs. graph level plot of our algorithm on the data set in (a); the lower figure of (b) shows sizes of the graphs (i.e. number of nodes) on different levels; (c) shows plot of the running times of our algorithm (ordinate) vs. data set scales (abscissa).

6.1. A synthetic example Fig. 4(a) shows a synthetic data set with three classes, each being a horizontal band containing 1050 data points with only 1 point labeled. In our method, we construct the neighborhood graph with the number of nearest neighbors k ¼ 5 under the Euclidean distance. We apply our method on this data set with the graph level l varying from 0 to 30. On each graph level our method can correctly classify all the data points. Fig. 4(b) shows how the running time and graph size vary when we coarsen the graph level by level. Clearly, the computational time can be significantly improved in each step when l o 5, since the size of the graph can be reduced greatly. After l 414, there are only three points remaining in the graph (which are the three labeled points), so further coarsening will not help to speed up the algorithm. We also test the running time of our algorithm with the scale of the data set varying. The experimental result is shown in Fig. 4(c), where we also adopted the 3 band data set with its size being 630m ðm ¼ 1; 2; . . . ; 9Þ. This figure tells us that with the growing of the data set scale, the computational time of our algorithm will increase much slower on a coarser graph.

6.2. Digits recognition In this case study, we will focus on the problem of classifying hand-written digits. The data set we adopt is a subset of the USPS [37] handwritten 16  16 digits data set. The images of digits 1, 2, 3 and 4 are used in this experiment as four classes, and there are 1269, 929, 824 and 852 examples in each class, with a total of 3874. The same data set has been used in [32] as a benchmark data set. Tables 2 and 3 summarizes the classification results and computational time of various algorithms (including 1 nearest neighbor classifier (1-NN), support vector machine (SVM), Gaussian random field (GRF) [34], learning with local and global consistency (LLGC) [32], linear neighborhood propagation (LNP) [27] and the large scale method with random/smart subset selection (LS Rand/Smart) [12]). The implementation of these algorithm are standard with respect to their original papers, and the result of our algorithm is obtained by first coarsening the graph to level 3. The reported recognition accuracies are averaged over 50 independent trials with 10 and 100 randomly labeled points. The performance of our algorithm with respect to the number of randomly labeled points on different graph levels are summarized in Fig. 5, which shows that there are mainly two

Table 2 Averaged classification accuracy comparison on USPS.

10 100

1-NN

SVM

GRF

LLGC

LNP

LS Rand

LS Smart

Multilevel

80.77 95.21

75.99 92.70

85.48 95.83

89.51 96.90

95.23 98.10

81.94 86.70

82.47 89.50

83.26 93.77

Table 3 Averaged computational time comparison on USPS.

10 100

1-NN

SVM

GRF

LLGC

LNP

LS Rand

LS Smart

Multilevel

5.33 5.74

3.24 10.28

22.35 24.31

24.52 25.28

27.07 27.86

5.15 5.47

6.34 6.98

3.86 4.50

Classification results on different graph levels 1 0.95 0.9 0.85 0.8 Level 0 Level 1 Level 2 Level 3

0.75 0.7 5

10

15

20

25

30

35

40

45

50

Fig. 5. Multilevel graph based SSL on USPS data set. In the figure, the ordinate is the recognition accuracy averaged on 50 random trials, and the abscissa represents the number of points we randomly labeled, where we guarantee that there is at least one labeled point in each class.

factors that will affect the final classification results using our multilevel methods:

 The level of the graph. Clearly if we coarsen the graph more, the classification results will become poorer. This is because at each coarsening step, our method will make some approximations (computing the edge weights on the coarser graph, label

ARTICLE IN PRESS C. Zhang, F. Wang / Pattern Recognition 43 (2010) 2301–2314

shown in Fig. 8, in which the rows correspond to the results of different methods, and the columns represent different numbers of labeled points. In the table we use ‘‘Multilevel i’’ to denote the results of our multilevel method on the graph of the i-th level, and ‘‘Rand i’’ to denote the results of the random selection method on the graph of i-th level. We also present the results of the large scale method in [12], and we use ‘‘LS Rand/Smart’’ to denote that method using random/smart sampling strategies. The results in the figure are averaged over 50 independent runs. From the figure we can clearly see the superiority of our method.

Computational time (seconds)

25 Multilevel LS Smart LS Rand

20

15

10

6.3. Text classification

5

0 0

1

2

3

Graph level Fig. 6. Computational time (s) vs. graph level plot on the USPS data set.



2309

interpolation), which will cause some information loss. Then the more we coarsen the data graph, the more information will be lost. As can be seen from the toy example in Section 6.1, at the extreme case, the coarsest graph will only contain the initially labeled points, i.e. the geometry information of the data graph is gone. Thus the prediction results will be poor. The number of labeled points. As the graph coarsening may cause information loss, there might be some wrongly predicted data labels on the coarse graph if the size of the labeled points is too small. This can be clearly seen from Fig. 5. When the labeled points become more, our method can produce classification results that are very close to the results produced on the finer graph.

The computational time vs. graph level plot is shown in Fig. 6, from which we can see that learning on a coarser graph can save much computational time. Moreover, we also provide the computational time of the large scale method in [12] with the random and smart node selection strategies (where the number of the selected nodes is set to the same as the size of the level 1 graph resulted from our method). Therefore, as long as the size of the labeled data is not too small, our multilevel method can produce almost the same classification results in a much shorter time, and the computational time of the large scale method typically lies in the range of our multilevel method with the graph coarsened to level 1 and level 2, but Fig. 8 shows that the classification accuracy is worse. Another set of experiments we have conducted is to test how the performances of our method change with respect to the control parameter d. The results are provided in Fig. 7 (in which we randomly label 5 points and the values on the curves in the figures are averaged values of 50 independent runs), from which we can see that there is a clear tradeoff between the value of d, the computational time and the classification accuracy. The larger the d is, the number of selected nodes for formulating the graph of a coarser level will be larger, then the classification will be more accurate and cost more computational time. However, as can be seen from Fig. 7(a), the classification accuracy increases very slowly after d 4 0:3, therefore d ¼ 0:3 is a very reasonable choice for coarsening the graph. Besides, we also conduct a set of experiments to compare the performances using our algorithm and the performances using a random seeds selection strategy, i.e. at each graph coarsening step, the data points selected to form the graph of a coarser level are selected randomly except for the labeled points. The results are

In this experiment, we addressed the task of text classification using 20-newsgroups data set [38]. The topic rec containing autos, motorcycles, baseball and hockey was selected from the version 20news-18828. The articles were processed by the Rainbow software package with the following options: (1) passing all words through the Porter stemmer before counting them; (2) tossing out any token which is on the stoplist of the SMART system; (3) skipping any headers; (4) ignoring words that occur in five or fewer documents. No further preprocessing was done. Removing the empty documents, we obtained 3970 document vectors in a 8014-dimensional space. Finally the documents were normalized into TFIDF representation. We use the inner-product distance to find the k nearest neighbors when constructing the neighborhood graph, i.e. dðxi ; xj Þ ¼ 1

xTi xj ; Jxi JJxj J

ð23Þ

where xi and xj are document vectors. And the value of k is set to 10. The classification results and computational time of various algorithms averaged over 50 independent runs are reported in Tables 4 and 5. The classification results of our method on different graph levels are shown in Fig. 9, and the computational time is shown in Fig. 10. Besides, we also conduct a set of experiments to show the sensitivity of our method to the choice of d, the result of which can be seen in Fig. 11 (where 20 points are randomly labeled and the experiments run independently 50 times), from the figures we can also see that d ¼ 0:2 is a good tradeoff value. The comparison of selecting nodes using our method and randomly selection is shown in Fig. 12 together with the large scale method in [12] (its computational time is also shown in Fig. 10). From the figure we can see that our method can perform far better than the random selection approach and the large scale approach. 6.4. Application in bioinformatics To further investigate the issue of how our algorithm performs with respect to large scale data, we also apply our algorithm to classify the SecStr data in [8]. The task is to predict the secondary structure of a given amino acid in a protein based on a sequence window centered around that amino acid. The data set is based on the CB513 set,8 which was created by Cuff and Barton and consists of 513 proteins [10]. The 513 proteins consist of a total of 84,119 amino acids, of which 440 were X, Z, or B, and were therefore not considered. For the remaining 83,679 amino acids, a symmetric sequence window of amino acids ½7; þ 7 was used to generate the input x. Positions before the beginning or after the end of the protein are represented by a special (21st) letter. Each letter is represented by a sparse binary vector of length 21 such that the position of the single 1 indicates the letter. The 28,968 8

E.g. at http://www.compbio.dundee.ac.uk/  www-jpred/data/pred_res/.

ARTICLE IN PRESS C. Zhang, F. Wang / Pattern Recognition 43 (2010) 2301–2314

1 0.95 0.9 0.85 0.8 0.75 0.7 0.65 0

0.2

0.4

0.6

0.8

3500

average computational time (seconds)

average size of the coarse graph

average classification accuracy

2310

3000 2500 2000 1500

25 20 15 10 5 0

1000

1

30

0

0.2

the value of delta

0.4

0.6

0.8

1

0

0.2

the value of delta

0.4

0.6

0.8

1

the value of delta

Fig. 7. The sensitivity of the performances of our algorithm to the choice of d when coarsening the graph from level 0 to level 1 on the USPS data set. For each d value, we perform 50 independent runs and the values on the curves in (a)–(c) are the average values of those 50 independent runs.

averaged classification accuracy

1 0.8 0.6 0.4 0.2 0

Multilevel 1 Multilevel 2 Multilevel 3 Rand 1 Rand 2 Rand 3 LS Rand LS Smart

10

20 30 40 number of randomly labeled points

50

Fig. 8. Comparisons of different node selection strategies and the large scale method in [12]. The x-axis represents the number of randomly labeled points, y-axis denotes the classification accuracy averaged over 50 independent runs. Multilevel i represents the result of our method performed on graph coarsened to level i, and Rand i denotes our multilevel method using random node selection for graph coarsening to level i. LS Rand and Smart denotes the large scale method in [12] using random and smart sampling strategies.

Classification results on different graph levels

Table 4 Averaged classification accuracy comparison on 20-newsgroup data set.

10 100

1-NN

SVM

GRF

LLGC

LNP

LS Rand

LS Smart

Multilevel

48.67 68.98

35.81 74.46

47.31 78.88

55.40 81.60

67.23 92.30

44.67 71.23

45.23 72.11

48.29 84.50

0.8 0.7 0.6

Table 5 Averaged computational time comparison on 20-newsgroup data set (s).

10 100

0.5

1-NN

SVM

GRF

LLGC

LNP

LS Rand

LS Smart

Multilevel

0.4

10.27 10.36

6.34 20.28

44.37 45.02

45.02 45.79

47.76 48.20

6.48 6.55

5.84 6.24

4.73 5.08

0.3

Level 0 Level 1 Level 2 Level 3

0.2

ahelical and 18,888 bsheet protein positions were collectively called class 1, while the 35,823 remaining points (‘‘coil’’) formed class þ1. Thus the whole data set contains a total of 83,679 data samples of dimension 315, and it is a binary classification problem. In our experiments, we select a subset of the SecStr data set containing 20,000 data points, and we randomly select 100 and 1000 points as labeled. The averaged classification results and computational time averaged over 50 independent runs are reported in Tables 6 and 7. For SVM, we apply the Gaussian kernel with the width set by 5-fold cross validation. For LS Rand and LS Smart, we set the number of selected points to be the same as the size of the level 1 graph resulted by our method. For our Multilevel method, the graph is first coarsened to level 5. Fig. 13

5

10

15

20

25

30

35

40

45

50

Fig. 9. Multilevel graph based SSL on 20Newsgroup data set. The ordinate is the recognition accuracy averaged on 50 random trials, and the abscissa represents the number of points we randomly labeled, where we guarantee that there is at least one labeled point in each class.

also depicts the consumed time of our multilevel method with respect to the graph levels, together with the time cost by the large scale method in [12]. From those results we can see that the large scale method is faster than our method if only the graph is coarsened to the first level, and the results is even worse than our method when the graph is coarsened to level 5, which suggests that our method can provide better approximation to the original solution.

ARTICLE IN PRESS C. Zhang, F. Wang / Pattern Recognition 43 (2010) 2301–2314

6.5. Interactive image segmentation

2311

of some prior information. Such information is usually specified by some strokes predefined by the user. Therefore what interactive image segmentation do is just to assign labels (‘‘foreground’’ or ‘‘background’’) to the unlabeled image pixels given some labeled pixels (the user-predefined strokes), which is a typical transductive learning problem of a large scale (e.g. for a very small image of size 100  100, we even have a magnitude of 10,000 pixels for labeling). Then if we view each pixel as a data point, we should construct a data graph of size Oð104 Þ if we want

To show the power of our multilevel algorithm, we also apply our approach to the task of interactive image segmentation. The goal of interactive image segmentation is to segment the objects (foreground) that are interested to the users under the guidance

45 Multilevel LS Smart LS Rand

Computational time (seconds)

40 35

Table 6 Averaged classification accuracy comparison on SecStr data set.

30 25

100 1000

SVM

LS Rand

LS Smart

Multilevel

55.41 65.29

57.68 59.16

57.74 60.20

58.96 66.37

20 15 10 Table 7 Averaged classification computational time comparison on SecStr data set.

5 0 0

0.5

1

1.5 Graph level

2

2.5

3 100 1000

SVM

LS Rand

LS Smart

Multilevel

15.39 100.37

59.33 60.28

60.24 62.15

46.50 50.46

0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0

0.2

0.4

0.6

0.8

1

4000

average computational time (seconds)

average size of the coarse graph

average classification accuracy

Fig. 10. Computational time (s) vs. graph level plot on the 20Newsgroup data set.

3500 3000 2500 2000 1500 1000

40 35 30 25 20 15 10 5

500 0

0.2

the value of delta

0.4

0.6

0.8

1

0 0

0.2

the value of delta

0.4

0.6

0.8

1

the value of delta

averaged classification accuracy

Fig. 11. The sensitivity of the performances of our algorithm to the choice of d when coarsening the graph from level 0 to level 1 on the 20 Newsgroup data set. For each d value, we perform 50 independent runs and the values on the curves in (a)–(c) are the average values of those 50 independent runs.

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Multilevel 1 Multilevel 2 Multilevel 3 Rand 1 Rand 2 Rand 3 LS Rand LS Smart

0 10

20

30

40

50

number of randomly labeled points Fig. 12. Comparisons of different node selection strategies and the large scale method in [12]. The x-axis represents the number of randomly labeled points, y-axis denotes the classification accuracy averaged over 50 independent runs. Multilevel i represents the result of our method performed on graph coarsened to level i, and Rand i denotes our multilevel method using random node selection for graph coarsening to level i. LS Rand and Smart denotes the large scale method in [12] using random and smart sampling strategies.

ARTICLE IN PRESS 2312

C. Zhang, F. Wang / Pattern Recognition 43 (2010) 2301–2314

to apply the graph based approaches to solve such a problem, which is very hard for traditional graph based methods. In our experiments, we just use the 3  3 grid around each pixel as the neighborhood of it and construct the pixel graph. Then the graph will be coarsened to level 5 for initial classification, and finally the solution will be refined back via interpolation. Some results are provided in Fig. 14, in which all the images are of

Computational time (seconds)

7. Conclusions In this paper we propose a novel multilevel approach for graph based semi-supervised learning which can solve the SSL problem in a multilevel way. The feasibility of our method is analyzed theoretically in this paper, and the experiments show that our method can produce comparable results with traditional methods with much shorter time.

250 Multilevel LS Smart LS Rand

200

size 320  480, and after coarsening, there are only several thousands of pixels left and segmenting one such image only takes tens of seconds.

Appendix A. Proof of Theorem 1

150

The smoothness of the classification function on the graph of level s þ1 is

100

Sðf

sþ1

Þ¼

50

X wsklþ 1 ðfks þ 1 fls þ 1 Þ2 k;l

1 XX s ½s;s þ 1 ½s;s þ 1 ½s;s þ 1 ½s;s þ 1 s þ 1 s þ 1 2 ¼ w ðp pil Þðpik pjk Þðfk fl Þ 2 k;l i;j ij jl

0 0

1

2

3 Graph level

4

5

Fig. 13. Computational time (s) vs. graph level plot on the SecStr data set.

¼

1 X s X ½s;s þ 1 ½s;s þ 1 ½s;s þ 1 ½s;s þ 1 ½s;s þ 1 ½s;s þ 1 w ðp pik pil pik pjl pjk 2 i;j ij k;l jl þ 1 ½s;s þ 1 þ p½s;s pjk Þ  ½ðfks þ 1 Þ2 2fks þ 1 fls þ 1 þðfls þ 1 Þ2 : il

Fig. 14. Interactive segmentation results on some natural images. The images in the left column correspond to the original images; the images in the middle column are the labeled images, in which the black strokes represent the seeds for foreground pixels, while the white strokes denote the seeds for background pixels; the images in the right column are the results of the segmented foreground objects.

ARTICLE IN PRESS C. Zhang, F. Wang / Pattern Recognition 43 (2010) 2301–2314

P ½s;s þ 1 P ½s;s þ 1 P ½s;s þ 1 P ½s;s þ 1 Since ¼ 1, ¼ 1, ¼ 1, ¼ 1, l pjl l pil k pjk k pik thus X ½s;s þ 1 ½s;s þ 1 ½s;s þ 1 ½s;s þ 1 ½s;s þ 1 ½s;s þ 1 ðpjl pik pil pik pjl pjk k;l ½s;s þ 1 þ pil½s;s þ 1 pjk Þ½ðfks þ 1 Þ2 þ ðfls þ 1 Þ2  ¼ 0:

X

ð24Þ

And it can be easily verified that X þ 1 ½s;s þ 1 ½s;s þ 1 þ 1 ½s;s þ 1 wsij ðp½s;s pik þ pil½s;s þ 1 pik þp½s;s pjk jl jl

i;j

k;l þ 1 ½s;s þ 1 s þ 1 s þ 1 p½s;s pjk Þfk fl il

¼

X ½s;s þ 1 X X þ 1 s þ 1 wsij pik fks þ 1  p½s;s fl jl i;j

k

!2

l

2 X  ¼ wsij fis fjs : i;j

Therefore Sðf

sþ1

s

Þ ¼ Sðf Þ.

Appendix B. Proof of Theorem 3 According to Eq. (10), we can derive the cost function that we want to minimize in step s as s

s

s

s

J ðf Þ ¼ ðf ÞT P½s;0 SP½0;s f þ gJP½0;s f yJ2 : s

ð25Þ

s

Let @J ðf Þ=@f ¼ 0 we can get the following linear equation system: s

P½s;0 ðS þ gIÞP½0;s f ¼ gP½s;0 y: s

½s;0

½0;s

ðS þ gIÞP

Let Q ¼ P s

s 1

f ¼ gðQ Þ

½s;0

P

ð26Þ

, then

y:

ð27Þ

Thus the predicted label vector at the initial graph is 0

s

f ¼ P½0;s f ¼ gP½0;s ðQ s Þ1 P½s;0 y: 0

ð28Þ

Let Q ¼ S þ gI, and f be the exact solution to Eq. (18), then the prediction error vector is %

e ¼ f P½0;s ðQ s Þ1 P½s;0 Q 0 f ¼ ðIP½0;s ðQ s Þ1 P½s;0 Q 0 Þf : %

%

%

ð29Þ

Let M ¼ IP½0;s ðQ s Þ1 P½s;0 Q 0 . Since M2 ¼ I2P½0;s ðQ s Þ1 P½s;0 Q 0 þ P½0;s ðQ s Þ1 P½s;0 Q 0 P½0;s ðQ s Þ1 P½s;0 Q 0 ¼ IP½0;s ðQ s Þ1 P½s;0 Q 0 ¼ M; the matrix M is idempotent. Moreover, since IM ¼ P½0;s ðQ s Þ1 P½s;0 Q 0 : Then RðIMÞ ¼ RðP½0;s Þ, and JMa þ ðIMÞbJ2 ¼ JMaJ2 þ JðIMÞbJ2 þ2aT MðIMÞb ¼ JMaJ2 þ JðIMÞbJ2 :

Therefore min Jf vJ2 ¼ %

v A RðP½0;s Þ

¼

min JMf þ ðIMÞf vJ2 %

%

v A RðP½0;s Þ

min JMf vJ2 ¼ %

v A RðP½0;s Þ

min JMf J2Q 0 þ JvJ2 %

v A RðP½0;s Þ

¼ JMf J2 ¼ JeJ2 ; %

which proves the theorem. References [1] A. Abou-Rjeili, G. Karypis, Multilevel algorithms for partitioning power-law graphs, in: IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2006.

2313

[2] M. Belkin, I. Matveeva, P. Niyogi, Regularization and semi-supervised learning on large graphs, in: Proceedings of the 17th Conference on Learning Theory, 2004. [3] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15 (6) (2003) 1373–1396. [4] M. Belkin, P. Niyogi, Semi-supervised learning on Riemannian manifolds, Machine Learning 56 (2004) 209–239. [5] Y. Bengio, O. Delalleau, N. Le Roux, Label Propagation and Quadratic Criterion, ¨ in: O. Chapelle, B. Scholkopf, A. Zien (Eds.), Semi-Supervised Learning, MIT Press, Cambridge, pp. 193–216. [6] A. Blum, S. Chawla, Learning from labeled and unlabeled data using graph mincuts, in: Proceedings of the 18th International Conference on Machine Learning, 2001. [7] A. Brandt, D. Ron, Multigrid solvers and multilevel optimization strategies, in: J. Cong, J.R. Shinnerl (Eds.), Multilevel optimization and VLSICAD, Kluwer Academic Publishers, Dordrecht, 2002. [8] O. Chapelle, et al. (Eds.), Semi-supervised Learning, MIT Press, Cambridge, MA, 2006. [9] R.R. Coifman, M. Maggioni, Diffusion wavelets, Applied Computational Harmonic Analysis 21 (1) (2006) 53–94. [10] J.A. Cuff, G.J. Barton, Evaluation and improvement of multiple sequence methods for protein secondary structure prediction, Proteins 34 (4) (1999) 508–519. [11] O. Delalleu, Y. Bengio, N. Le Roux, Non-parametric function induction in semi-supervised learning, in: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, 2005. [12] O. Delalleu, Y. Bengio, N. Le Roux, Large-scale algorithms, in: O. Chapelle, B. ¨ Scholkopf, A. Zien (Eds.), Semi-supervised Learning, MIT Press, Cambridge, 2000, pp. 333–341. [13] I.S. Dhillon, Y. Guan, B. Kulis, Weighted graph cuts without eigenvectors: a multilevel approach, IEEE Transactions on Pattern Analysis and Machine Intelligence 29 (11) (2007) 1944–1957. [15] K. Graf Estes, J.L. Evans, M.W. Alibali, J.R. Saffran, Can infants map meaning to newly segmented words? Statistical Segmentation and Word Learning. Psychological Science 18 (3) 254–260. [16] G. Karypis, R. Aggarwal, V. Kumar, S. Shekhar, Multilevel hypergraph partitioning: applications in VLSI domain, in: Proceedings of the 34th Design and Automation Conference, 1997, pp. 526–529. [17] G. Karypis, V. Kumar, Multilevel k-way partitioning scheme for irregular graphs, Journal of Parallel Distribution Computing 48 (1) (1998) 96–129. [18] S.T. Roweis, L.K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000) 2323–2326. [19] H.S. Seung, D.D. Lee, The manifold ways of perception, Science 290 (2000) 2268–2269. [20] E. Sharon, A. Brandt, R. Basri, Fast multiscale image segmentation, in: Proceedings IEEE Conference on Computer Vision and Pattern Recognition, South Carolina, 2000, pp. I:70–I:77. ¨ [21] H. Shin, N.J. Hill, G. Ratsch, Graph based semi-supervised learning with sharper edges, in: Proceedings of the European Conference on Machine Learning, 2006, pp. 401–412. [22] H.D. Sterck, R.D. Falgout, J. Nolting, U.M. Yang, Distance-two interpolation for parallel algebraic multigrid, Numerical Linear Algebra with Applications 15 (2008) 115–139. [23] S.B. Stromsten, Classification learning from both classified and unclassified examples, Ph.D. Dissertation, Stanford, 2002. [24] J.B. Tenenbaum, V. de Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000) 2319–2323. [25] U. Trottenberg, C.W. Oosterlee, A. Schler, Multigrid. With Guest Contributions by Brandt, A., Oswald, P. Sten, K., Academic, San Diego, CA, London, 2001. [27] F. Wang, C. Zhang, Label propagation through linear neighborhoods, in: Proceedings of the 23rd International Conference on Machine Learning, 2006. [28] F. Wang, J. Wang, C. Zhang, H.C. Shen, Semi-supervised classification using linear neighborhood propagation, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, New York, New York, vol. 1, USA. 2006, pp. 160–167. [29] F. Wang, C. Zhang, Fast multilevel transduction on graphs, in: Proceedings of The 7th SIAM Conference on Data Mining, Minneapolis, MN, USA, 2007. [30] M. Wertheimer, Gestalt theory, in: W.D. Ellis (Ed.), A Source of Gestalt Psychology, The Humanities Press, New York, 1924/1950, pp. 1–11. ¨ [31] M. Wu, B. Scholkopf, Transductive classification via local learning regularization, in: Proceedings of the 11th International Conference on Artificial Intelligence and Statistics, 2007, pp. 624–631. ¨ [32] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Scholkopf, Learning with local and ¨ global consistency, in: S. Thrun, L. Saul, B. Scholkopf, (Eds.), Advances in Neural Information Processing Systems, vol. 16, 2004, pp. 321–328. ¨ [33] D. Zhou, B. Scholkopf, Learning from labeled and unlabeled data using random walks, in: Pattern Recognition, Proceedings of the 26th DAGM Symposium, 2004. [34] X. Zhu, Semi-supervised learning with graphs, Ph.D. Thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University, May, 2005. [35] X. Zhu, T. Rogers, R. Qian, C. Kalish, Humans perform semi-supervised classification too, in: Proceedings of the Twenty-Second AAAI Conference on Artificial Intelligence (AAAI), 2007. [36] X. Zhu, Semi-supervised learning literature survey, Technical Report 1530, University of Wisconsin, Madison, 2005. [37] /http://www.kernel-machines.org/data.htmlS. [38] /http://people.csail.mit.edu/jrennie/20Newsgroups/S.

ARTICLE IN PRESS 2314

C. Zhang, F. Wang / Pattern Recognition 43 (2010) 2301–2314

About the Author—CHANGSHUI ZHANG is currently a professor in Department of Automation, Tsinghua University. His main research interests include machine learning, data mining and computer vision. He is currently an associate editor of Pattern Recognition.

About the Author—FEI WANG received his Ph.D. degree from Tsinghua University, Beijing, China, in 2008. After that, he came to Florida International University as a postdoctorate research associate till August 2009. He will join Department of Statistics, Cornell University as a postdoc in September 2009.