Neurocomputing 177 (2016) 385–393
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
Semi-supervised learning via mean field methods Jianqiang Li a,b,d,n, Fei Wang c a
School of Software Engineering, Beijing University of Technology, China. Beijing Engineering Research Center for IoT Software and Systems, China Department of Computer Science and Engineering, University of Connecticut, United States d Guangdong Key Laboratory of Popular High Performance Computers, Shenzhen Key Laboratory of Service Computing and Applications, China b c
art ic l e i nf o
a b s t r a c t
Article history: Received 8 February 2015 Received in revised form 16 May 2015 Accepted 15 November 2015 Communicated by Feiping Nie Available online 26 November 2015
The recent years have witnessed a surge of interest in semi-supervised learning methods. Numerous methods have been proposed for learning from partially labeled data. In this paper, a novel semisupervised learning approach based on statistical physics is proposed. We treat each data point as an Ising spin and the interaction between pairwise spins is captured by the similarity between the pairwise points. The labels of the data points are treated as the directions of the corresponding spins. In semisupervised setting, some of the spins have fixed directions (which corresponds to the labeled data), and our task is to determine the directions of other spins. An approach based on the Mean Field theory is proposed to achieve this goal. Finally the experimental results on both toy and real world data sets are provided to show the effectiveness of our method. & 2015 Elsevier B.V. All rights reserved.
Keywords: Mean field Semi-supervised learning
1. Introduction Traditional data mining approaches can be categorized into two categories: (1) supervised learning, which aims to predict the labels of any new data points from the observed data-label pairs. A supervised learning task is called regression when the predicted labels take real values, and classification when the predicted labels take a set of discrete values. Typical supervised learning methods include Support Vector Machine [32] and Decision Tree [25]; (2) unsupervised learning, the goal of which is just to organize the observed data points without labels. Typical unsupervised learning tasks include clustering [15] and dimensionality reduction [26]. In this paper, we will focus on classification, which is traditionally a supervised learning task. To train a classifier one needs a collection of labeled data points. However, in many practical applications of pattern classification and data mining, one often faces a lack of sufficient labeled data, since labeling often requires expensive human labor and much time. Meanwhile, in many cases, large numbers of unlabeled data can be far easier to obtain. For example, in text classification, one may have an easy access to a large database of documents (e.g. by crawling the web), but only a small part of them are classified by hand. n Corresponding author at: School of Software Engineering, Beijing University of Technology, China. E-mail address:
[email protected] (J. Li).
http://dx.doi.org/10.1016/j.neucom.2015.11.042 0925-2312/& 2015 Elsevier B.V. All rights reserved.
Consequently, semi-supervised learning methods, which aim to learn from partially labeled data, are proposed. Many conventional semi-supervised learning algorithms adopt a generative model for the classifier and employ Expectation Maximization (EM) [10] to model the label prediction or parameter estimation process. For example, mixture of Gaussians [27], mixture of experts [20], and naive Bayes [21] have been respectively used as the generative model, while EM is used to combine labeled and unlabeled data for classification. There are also many other algorithms such as cotraining [6], transductive SVM (TSVM) [14], and the Gaussian process approach [18]. For a detailed literature survey one can refer to [36]. The basic assumption behind semi-supervised learning is the cluster assumption [8], which states that two points are likely to have the same class label if there is a path connecting them passing through the regions of high density only. Zhou et al. [34] further explored the geometric intuition behind this assumption: (1) nearby points are likely to have the same label; (2) points on the same structure (such as a cluster or a submanifold) are likely to have the same label. Note that the first assumption is local, while the second one is global. The cluster assumption implies us to consider both local and global information contained in the dataset during learning. In recent years there has been significant interest in adapting numerical [22] and analytic [3] techniques from statistical physics to provide beautiful algorithms and estimates for machine learning and neural computation problems. In this paper we formulate the problem of semi-supervised learning as that of measuring
386
J. Li, F. Wang / Neurocomputing 177 (2016) 385–393
equilibrium properties of an homogeneous Ising model. In our model, each data point is viewed as a spin, the direction of the spin stands for the label of the data point. We also introduce some interactions between pairwise points based on the intrinsic geometry of the dataset. The directions of the spins corresponding to the labeled data points are fixed. And our goal is to predict the labels of the unlabeled points, which will be estimated by the directions of these spins in thermal equilibrium. The experiments show that our method can give good classification results. The rest of this paper is organized as follows. The detailed description of the Ising model will be presented in Section 2. In Section 3 we will introduce a Mean Field approach for solving the Ising problems. Our approach for semi-supervised learning will be described in Section 4, and we also compare it with traditional Bayesian methods in Section 5. The experimental results on both toy and real world datasets will be introduced in Section 6, followed by the conclusions and discussions in Section 7.
In addition, we also assume that the total energy of a configuration also includes a term θi for each spin, which can be viewed as the effects of an external field. If it is a magnetic field that can result in a þ θ energy on the spins with þ1 states, and θ energy on 1 spins, then the total energy of the configuration shown in Fig. 1 is 24U þ 3θ. Based on the discussions above, we can define the energy of a general Ising model in a given configuration S ¼ ðS1 ; S2 ; …; SN Þ to be X X ~ θ~ i Si ; ð1Þ J~ ij Si Sj EðSÞ ¼ 〈i;j〉
i
where Si A f þ 1; 1g is the current value of the ith spin, 〈i; j〉 represents a neighboring spin pair, Jij is the symmetric interaction energy of the pairwise spins i and j, θ~ i is the energy on spin i brought by the external fields. The canonical partition function of the system is defined as1 Z Z Z ~ Z¼ dS1 dS2 ⋯ dSN e βE ðSÞ ; ð2Þ where
2. Ising model
β ¼ ðkTÞ 1 ;
The Ising model [12] first proposed by E. Ising is a lattice model, which is used for describing intermolecular forces. For example, in magnets, each molecule has a spin that can be oriented either up or down relative to the direction of an externally applied field. Fig. 1 shows us such an example which is a 2D periodic lattice having an array of 25 fixed points. Note that in real world applications the lattice can be of any type. With each lattice site is associated a spin variable Si being either þ1 or 1. In magnets, we usually call þ1 spin up and 1 spin down. A configuration of the lattice is a particular set of values of all spins, e.g. a configuration of the 5 5 regular lattice is illustrated in Fig. 1. Clearly, for such a spin system there are totally 225 possible configurations. We usually assign an energy to a specific configuration of an Ising model, and assume that the molecules exert only short-range forces on each other, i.e. the interaction energy depends only on the configurations of neighboring spins of the lattice. Taking spontaneous magnetization as an example, since the neighboring spins tend to have the same direction, we can define that if the two neighboring spins are in the same direction, the energy between them is U, if they are in different directions, the energy is þ U (U 40 is a constant). Then the energy of the system shown in Fig. 1 is 24U.
and k is the Boltzmann constant and T is the temperature. We further define the energy function as X X EðSÞ ¼ J ij Si Sj θ i Si ; ð4Þ
ð3Þ
〈i;j〉
i
where J ij ¼ β J~ ij ;
θi ¼ βθ~i :
Then the probability distribution of the spin system is 20 13 X X 1 θ i S i A5: PðSÞ ¼ exp4@ J ij Si Sj þ Z 〈i;j〉 i
ð5Þ
ð6Þ
Furthermore, by absorbing the constraint that Si can only take binary values, we can rewrite PðSÞ as 20 13 X X ρðSÞ exp4@ J ij Si Sj þ ð7Þ PðSÞ ¼ θ i S i A5; Z 〈i;j〉 i where ρðSÞ ¼ ∏i ρðSi Þ with 1 2
1 2
ρðSi Þ ¼ δðSi 1Þ þ δðSi þ 1Þ;
ð8Þ
which states the prior knowledge that each spin has the equal probabilities to be þ 1 or 1, and δðÞ is the Dirac Delta function. Therefore the marginal probability of Si is Z ∏ dSj PðSÞ: ð9Þ PðSi Þ ¼ jai
Our goal is to approximate the behavior of such an Ising type interacting spin system in equilibrium. We will introduce an adaptive TAP approach [22] in the next section to solve the problem.
3. The mean field approach for Ising model The main idea of the mean field theory is to focus on one spin and assume that the most important contribution to the interactions of such spin with its neighboring spins is determined by the mean field due to its neighboring spins [24]. It originally aims to
Fig. 1. An example of the 2D Ising model.
1 Here we give a more general form of the partition function. In our Ising model case, since each random variable can only have two integer values, we can use the sum operator to replace the integral operator.
J. Li, F. Wang / Neurocomputing 177 (2016) 385–393
approximate the behavior of interacting spin systems in thermal equilibrium. We use it here to approximate the behavior of Ising models in equilibrium. More concretely, the mean state of spin Si is Z hSi i ¼ dSi PðSi Þ; ð10Þ where PðSi Þ is the marginal distribution in Eq. (9). So we can define the mean field hSi ¼ ðhS1 i; hS2 i; …; hSN iÞ by these mean values. In the following we will briefly review the adaptive TAP approach introduced in [22,23]. Note that 0 1 X X ρðSi Þ @ exp PðSÞ ¼ ∏ J ij Si Sj þ θ i Si A Z i 〈i;j〉 i 2 0 13 X 1 4 @ J ij Si Sj þ θi Si A5 ¼ ∏ ρðSi Þ exp Z i j
Z ¼
dSi Si PðSi Þ ¼ hSi i:
Let
2 3 ðhi hi ⧹i Þ2 1 ρðSi Þ 5; PðSi ; hi Þ ¼ pffiffiffiffiffiffiffiffiffiffiffi exp4Sðhi þ θi Þ Z i 2π V i 2V i then hi ¼
Z dhi dSi hi PðSi ; hi Þ
1 ρðSi Þ V pffiffiffiffiffiffiffiffiffiffiffi exp Si ðθi þ hi ⧹i Þ þ i S2i Z i 2π V i 2 0 2 1 Z hi ðV i Si þ hi ⧹i Þ B C dhi hi exp@ A 2V i Z
¼
dSi
Z
dSi PðSi ÞðV i Si þ hi ⧹i Þ ¼ V i hSi i þ hi ⧹i :
¼ Thus the marginal distribution of Si can be further transformed to P R Sð J S þθ Þ ∏j;j a i dSj ρðSi Þe i j ij j i PðS⧹Si Þ P ; ð11Þ PðSi Þ ¼ R Sð J S þθ Þ ∏j dSj ρðSi Þe i j ij j i PðS⧹Si Þ where PðS⧹Si Þ is the distribution of all variables excluding Si in the P spin system. Let us introduce an auxiliary variable hi ¼ j J ij Sj , which means the aggregated effects of all other spins on Si. Since 2 0 13 Z X ∏ dSj ρðSi Þ exp4Si @ J ij Sj þ θi A5PðS⧹Si Þ j;j a i
¼ ρðSi Þ
j
Z
dhi exp Si ðhi þ θi Þ
Z
0 ∏ dSj δ@hi
j;j a i
1 X J ij Sj APðS⧹Si Þ; j
Define the cavity distribution as 0 1 Z X @ ∏ dSj δ hi J ij Sj APðS⧹Si Þ: Pðhi ⧹Si Þ ¼ j;j a i
ð12Þ
j
Then PðSi Þ can be rewritten as R ρðSi Þ eSi ðhi þ θi Þ ⧹i ρðSi Þ dhi eSi ðhi þ θi Þ Pðhi ⧹Si Þ ; R ¼ R PðSi Þ ¼ R dhi dSi ρðSi ÞeSi ðhi þ θi Þ Pðhi ⧹Si Þ dSi ρðSi ÞeSi ðhi þ θi Þ ⧹i where hi⧹i denotes the average over the cavity distribution. Since in real world applications the cavity distribution is unavailable, we can use a Gaussian distribution to approximate it [19,22]. Thus 0 1 ðhi hi ⧹i Þ2 1 A; ð13Þ Pðhi ⧹Si Þ pffiffiffiffiffiffiffiffiffiffiffi exp@ 2V i 2π V i D E 2 2 where V i ¼ hi hi ⧹i . Therefore ⧹i ρðSi Þ eSi ðhi þ θi Þ ⧹i PðSi Þ ¼ R dSi ρðSi ÞeSi ðhi þ θi Þ ⧹i 2 3 Z ðhi hi ⧹i Þ2 1 ρðSi Þ 5 dhi exp4Si ðhi þ θi Þ ¼ pffiffiffiffiffiffiffiffiffiffiffi Z i 2π V i 2V i V
1 ¼ ρðSi Þ Si hi ⧹i þ θi þ i S2i ; Zi 2 where the single variable partition function Z V
dSi ρðSi Þ exp Si hi ⧹i þ θi þ i S2i : Zi ¼ 2 Thus ∂ ln Z i 1 ¼ Zi ∂θ i
Z
V
dSi Si ρðSi Þ exp Si hi ⧹i þ θi þ i S2i 2
387
Therefore X J ij Sj V i hSi i: hi ⧹i ¼ hi V i hSi i ¼ j
The equation group hSi i ¼
∂ ln Z i ∂θ i
ð14Þ
X J ij Sj V i hSi i hi ⧹i ¼
ð15Þ
j
is the so-called TAP equations [30]. And the last term of Eq. (15) is usually called the Onsager reaction term. For computing Vi, we first define the covariance matrix χ with its (i,j)th entry χ ij ¼ Si Sj hSi i Sj : ð16Þ Since
Z Z ∂ dSi ∏ dSj Si PðSÞ ∂θ j j:j a i ! Z Z X X ∂ ∂ S ρðSÞ exp J uv Su Sv þ θv Sv dS Si PðSÞ ¼ dS i ¼ Z ∂θ j ∂θ j u;v v
∂hSi i ∂ ¼ ∂θ j ∂θ j
Z
dSi Si PðSi Þ ¼
Z ¼ Z hSi i
dS Si Sj
ρðSÞ Z
exp
! X X J uv Su Sv þ θ v Sv u;v
v
X X Sj ρðSÞ exp dS J uv Su Sv þ θ v Sv Z u;v v
!
¼ Si Sj hSi i Sj ¼ χ ij ;
then
! ∂ hi ⧹i ∂hSi i ∂hSi i ∂θi ∂hSi i ∂ hi ⧹i ∂hSi i χ ij ¼ ¼ þ ¼ δij þ ; ∂θ j ∂θ j ∂θi ∂θj ∂ hi ⧹i ∂θj ∂θ i
where δij ¼ 1 if i¼ j, otherwise δij ¼ 0, and the last equality follows from ∂hSi i ∂hSi i ¼ ; ∂θ i ∂ hi ⧹i which can be verified immediately if we expand hSi i by Eq. (14). Furthermore, using Eq. (15), we can derive that ! X ∂hSi i χ ij ¼ δij þ ðJ ik V k δik Þχ kj ; ð17Þ ∂θ i k which can be further transformed to " # X ∂ hS i i 1 þ V i χ ij ¼ J ij χ kj þ δij : ∂θ i k
ð18Þ
388
J. Li, F. Wang / Neurocomputing 177 (2016) 385–393
Now we define the matrix J with its (i,j)th entry Jij ¼ J ij , and the diagonal matrix ! ∂〈S1 〉 1 ∂〈SN 〉 1 Λ ¼ diag V 1 þ ; …; V 1 þ : ∂θ 1 ∂θ N Then we can rewrite Eq. (18) in its matrix form
χ ¼ ðΛ JÞ 1
ð19Þ
The diagonal elements of the covariance matrix χ can be computed by
χ ii ¼
∂2 ∂θ
2 i
ln Z i ¼
i ∂hSi i h ¼ ðΛ JÞ 1 : ii ∂θi
ð20Þ
∂ f ð hi ⧹i ; V i Þ ¼ hSi i ¼ ln Z i ∂θ i
ð21Þ
∂hSi i ∂2 0 f ð hi ⧹i ; V i Þ ¼ χ ii ¼ ¼ 2 ln Z i ∂ θi ∂θ i
ð22Þ
For Ising models, each random variable Si can only take two values f þ 1; 1g, and the prior ρðSi Þ for each Si is computed by Eq. (8). Therefore
The numerator V
Si ρðSi Þ exp Si hi ⧹i þ θi þ i S2i 2 ¼ 71
X Si
i 1h exp hi ⧹i þ θi exp hi ⧹i þ θi eV i =2 2 V ¼ sinh hi ⧹i þ θi exp i ; 2
¼
Table 1 The adaptive TAP mean field algorithm for Ising model. Initialization:
Start from a tabular rasa hSi ¼ ðhS1 i; hS2 i; …; hSN iÞ ¼ 0 (or small values if hSi ¼ 0 is a fixed point). V ¼ ðV 11 ; V 22 ; …; V NN Þ ¼ 0. Learning rate η ¼ 0:05.
Fault tolerance ft ¼ 10 3 . Variance update iteration t ¼20.
Iterate: do: for all i: (Eqs. (14) and (15)) P hi ⧹i ¼ j J ij Sj V i hSi i δhSi i ¼ f hi ⧹i ; V i hSi i f hi ⧹i ; V i is computed from Eq. (23) end for for all i: hSi i ¼ hSi i þ ηδhSi i end for for each variance-update iteration: for all i: Λi ¼ V i þ f 0 ð o h 14 ;V Þ i ⧹i i 0 f hi ⧹i ; V i is computed from Eq. (24) for all i: 1 ½ðΛ JÞii Λi while maxi j δhSi ij 2 4ft Vi ¼
Note that in the above derivations we have used the equality Kσ
e
¼ cosh K þ σ sinh Kðσ ¼ 7 1Þ:
So for Ising models f ð hi ⧹i ; V i Þ ¼ hSi i ¼ tanh hi ⧹i þ θi ;
ð23Þ
and
So far we have derived the equations for computing the first and second moments of Si, which are summarized as follows:
∂ ln Z i f ð hi ⧹i ; V i Þ ¼ hSi i ¼ ∂θ i
V P hi ⧹i þ θi þ i S2i Si ¼ 7 1 Si ρðSi Þ exp Si 2 ¼ : Zi
and similarly the denominator V
X Zi ¼ ρðSi Þ exp Si hi ⧹i þ θi þ i S2i 2 Si ¼ 7 1 V ¼ cosh hi ⧹i þ θi exp i : 2
0 f ð hi ⧹i ; V i Þ ¼ χ ii ¼
1 : hi ⧹i þ θi cosh
ð24Þ
2
To summarize, the adaptive TAP method for Ising models is shown in Table 1.
4. Semi-supervised learning based on the mean field approach So far we have defined the Ising model and introduced a mean field approach for solving the states of the spins in such a model in thermal equilibrium. We can now turn to the problem for which these concepts will be utilized: semi-supervised learning. In semi-supervised learning, the dataset X ¼ fx1 ; x2 ; …; xl ; xl þ 1 ; …; xn g is composed of two parts. The first part X L ¼ fxi gli ¼ 1 is the labeled set, in which the data points are labeled with þ 1 or 1 (we consider the two-class case for the moment). The remaining unlabeled data points X U ¼ fxu gnu ¼ l þ 1 constitute the second part of X . The goal of semi-supervised learning is to predict the labels of the unlabeled data.2 Considering the two-class classification problem, we can just treat each data point as a spin, and its label as the direction of the spin. Then the labels of the unlabeled data can be regarded as the state of those spins in equilibrium, i.e. the spins fSi gN i ¼ 1 can be viewed as the hidden label variables of the data. So there are two things remained: (1) how to determine the interactions between pairwise data (i.e. J~ ij ); (2) how to determine the external field (i.e. θ~ ). i
4.1. Compute the spin–spin interactions No doubtedly, the computation of the spin–spin interactions is at the heart of our semi-supervised algorithm. An intuition here is that the more similar xi to xj , the stronger the interaction between xi and xj will be. Therefore, we should define a proper similarity between pairwise data. Some possible ways for defining such a similarity including: 1. Unweighted k-nearest neighborhood similarity [4]: The similarity between xi and xj is 1 if xi is in the k-nearest neighborhood of xj or xj is in the k-nearest neighborhood of xi , and 0 otherwise. k is the only hyperparameter that controls this similarity. As noted by [38], this similarity has the nice property of “adaptive scales”, since the similarities between pairwise points are the same in low and high density regions. 2. Unweighted ϵ-ball neighborhood similarity: The similarity between xi and xj is 1 if for some distance function dðÞ, 2 The task we want to handle here is also called transduction, in which we do not consider to predict the labels of the data not in the training dataset (induction). Since according to [32], this may import unnecessary complexity to the learning algorithm.
J. Li, F. Wang / Neurocomputing 177 (2016) 385–393
dðxi ; xj Þ r ϵ. ϵ is the only hyperparameter controlling this similarity, which is continuous. 3. Weighted linear neighborhood similarity [33]: Assuming that xi is in the neighborhood (k-nearest neighborhood or ϵ-ball neighborhood) of xj , the similarity between xi and xj , wij, can be computed by solving the following optimization problem min wij
s:t:
J xi wij xj J 2 X
wij Z0;
j
wij ¼ 1;
and wij ¼ 0 if xj is not in the neighborhood of xi . Note that this ~ ij ¼ 12 ðwij þ similarity is generally asymmetric, and we can use w wji Þ for symmetrize it. The neighborhood size (k or ϵ) is the only hyperparameter for controlling this similarity. 4. Weighted tanh similarity [38]: Let dij be the distance between xi and xj , then the tanh similarity between xi and xj can be computed by 1 wij ¼ ðtanhðα1 ðdij α2 ÞÞ þ 1Þ 2 The intuition is to create a soft cutoff around length α2, so that similar examples (presumably from the same class) have higher similarities and dissimilar examples (presumably from different classes) have lower similarities. The hyperparameters α1 and α2 control the slope and cutoff values of the tanh similarity respectively. 5. Weighted exponential similarity [4,9,34,38]: Let dij be the distance between xi and xj , then the tanh similarity between xi and xj can be computed by ! 2 dij wij ¼ exp ; ð25Þ
σ
which is also a continuous weighting scheme with σ controlling the decay rate. Because of its popularity and solid theoretical foundations [5], we adopt the Weighted Exponential Similarity as our similarity measure for computing the pairwise similarities (interactions). Then another problem arises, i.e. defining a proper distance function. We also list some possible distance functions here. 1. Euclidean distance [4,34,38]: The distance between xi and xj can be calculated by qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi E ð26Þ dij ¼ J xi xj J ¼ ðxi xj ÞT ðxi xj Þ:
2. Geodesic distance [29]: First construct a connected neighborhood graph on dataset X . Then the distance between xi and xj is the length of the shortest path linking them. Note that for each line segment on this path, its length is computed by the Euclidean distance. 3. Connectivity distance [11]: We also need to first construct a connected neighborhood graph for the dataset X . Let p be a path from one point to another with length j pj , and the indices of the data points on this path are denoted by fpk gkj pj¼ 1 . Let P ij be the set of paths connecting xi and xj , then the connectivity distance between xi and xj is defined by dij ¼ min
max
p A P ij 1 r k r j pj 1
E
dpk pk þ 1 ;
where dE represents the Euclidean distance. Chapelle et al. [9] further propose to “soften” this distance to make it more robust to the bridge points. Such “softened” connectivity distance
389
between xi and xj is computed by ! j pj 1 X 1 ρdE e pk pk þ 1 1 ; dij ¼ 2 ln 1 þ min
ρ
p A P ij
k¼1
where ρ 4 0 is a free parameter for controlling the distance. 4. Inner product distance [34,36]: The inner product distance between xi and xj is computed by dij ¼ 1
xTi xj : J xi J J xj J
ð27Þ
Note that this distance is one of the most commonly used distances in text classification. In real world applications, which distance to use is dependent on our prior knowledge of the datasets. For example, for the data (nearly) residing on intrinsic manifolds, the geodesic distance can be a better choice; for the dataset that is corrupted by some noise, using connectivity distance might be better; for text analysis, the inner product distance should be considered for the first choice. However, if we do not have any prior information about the dataset, then the Euclidean distance may be a safe choice. 4.2. Determining the external fields Another important issue that we should address in our method is how to determine the influence of the external fields, i.e. θ~ i ði ¼ 1; 2; …; NÞ. Recall that what we want to solve is a semisupervised learning problem, that means some data have already been labeled. Considering an analogy to the spin system, we can regard the labeled points as the spins having “fixed” directions, which is forced by some external fields. For a particular configuration of the spin system, if the states for the spins corresponding to the labeled data points are in accordance with their imposed states (labels), then its probability should be higher. Recalling the definition of the probability of a specific configuration in Eq. (7), as for each spin there are only two possible directions, spin up (þ 1), or spin down ( 1), we can define θ~ i in the following way ( li ; if xi is labeled as li θ~ i ¼ : ð28Þ 0; if xi is unlabeled In such a definition, if the state of a “labeled” spin is the same as its label, then θ~ i 4 0, which will result in a higher probability, otherwise θ~ i o 0, which will make the probability lower. And for simplicity, we do not consider the influence of the external fields on unlabeled spins. Table 2 Semi-supervised learning using a mean field approach. Inputs: Dataset X Scale σ Temperature T A proper distance function dð; Þ The initialization parameters of the ATAP algorithm Outputs: The labels of the unlabeled data. 1. Calculate the distance matrix D with its (i,j)th entry Dij ¼ dðxi ; xj Þ; 2. Calculate the interaction matrix J~ with its (i,j)th entry J~ ¼ exp Dij =σ ; ij
~ 1=2 , where R is a diagonal matrix with the i-th 3. Normalize J~ by J~ ¼ R 1=2 JR P~ element on its diagonal line Rii ¼ J ; j ij
4. Compute the matrix J with its (i,j)th entry J ij ¼ βJ~ ij , where β is defined as in Eq. (3), 5. Calculate hSu iðxu A X U Þ by the method shown in Table 1, and fix Sl ðxl A X l Þ to be its label tl; 6. If hSi 40, then classify xi as þ 1, else if hSi i o 0, classify xi as 1
390
J. Li, F. Wang / Neurocomputing 177 (2016) 385–393
4.3. Algorithm framework Having computed the spin–spin interactions and the influence of the external fields, we can derive the framework of our algorithm, which is shown in Table 2. The computation of pairwise similarities takes Oðn2 Þ time. Symmetric normalization takes Oðn2 Þ time. For the TAP algorithm in Table 1, every outer iteration takes 2OðnÞ for updates of 〈Si 〉 ð1 r i r nÞ, and 2OðnÞ to update the variance. Therefore the total complexity is Oðn2 þ4TnÞ OðN3 Þ, where T is the number of outer iterations. There are some issues should be addressed in our algorithm: 1. The normalization procedure in step 3 is to make the computed similarities insensitive to the data distribution, since a single Gaussian function will always assign high similarities to high density regions, which will bias the final classification results if the data from different classes have different densities. A similar approach can be seen in [35]. 2. Usually the computed hSi 4 is not just þ 1 or 1, so we classify the data points as the sign of hSi i. In this case, hSi i can be viewed as the soft label of xi . 3. For multi-class problems, we can just use (1) one-vs-rest scheme [32] and classify xi to the class with the largest Si value; (2) a Potts multi-valued spin model.
can derive that Pðyj DÞ ¼ Pðyj X ; tL Þ ¼ 1Z Pðyj X ÞPðtL j yÞ The term Pðyj X Þ is the probability of the hidden labels given the training set, which can be regarded as the prior information of y on the training set. PðtL j yÞ is the likelihood term, which is usually written as PðtL j yÞ ¼ ∏li ¼ 1 Pðt i j yi Þ [16]. This term models the probabilistic relation between the observable hard labels and the hidden soft labels. The prior term plays an important role in semi-supervised learning, especially when the size of the labeled data is small. The recent research shows that the prior should impose a smoothness constraint with respect to the intrinsic data structure [16,28], which means that it should give higher probability to the labelings that respect the similarity of the data graph. More concretely, if we model the whole dataset as a weighted undirected graph G ¼ ðV; EÞ, where V ¼ X is the node set, and E is the edge set. Associated with each edge eij A E is a symmetrical nonnegative weight representing the similarity between xi and xj . In this way, the intrinsic structure of the dataset can be described by such a data graph, and the smooth labeling with respect to the data structure just corresponds to the smooth labeling over the data graph. As noted in [5], if we regard y as a function defined on the data graph G, such that yi is the return value of this function at xi , then the smoothness of y can be computed by S y ¼ yT My;
5. Relationship with Bayesian discriminative semi-supervised classification methods The Bayesian approach is an importance class of methods for data mining and machine learning. Traditionally, a Bayesian classifier can be categorized into either generative or discriminative classifiers. The goal of generative classifiers is to learn a joint probability model, Pðx; tÞ, of input x and its class label t, and then making their predictions by using the Bayesian rule to compute Pðtj xÞ. Discriminative classifiers, on the other hand, model posterior class probabilities Pðtj xÞ for all classes directly and learn a mapping from x to t. It has often been argued that for many application domains, discriminative classifiers can often achieve higher test accuracies than generative classifiers [13,32]. In the following we will use the same symbol system introduced at the beginning of Section 4. We use t ¼ ðt 1 ; t 2 ; …; t n Þ to denote the hard labels of the data points, i.e. t i A f 1; þ 1g; 8 1 r i r n, and y ¼ ðy1 ; y2 ; …; yn Þ is the hidden soft labels of the data points, e.g. in the two-class classification problem, t ¼ signðyÞ. In the discriminative framework, we can draw a graphical model for such a problem which is shown in Fig. 2. For semi-supervised classification, what we observed is the set D ¼ fX ¼ X L ⋃X U ; tL g, where tL corresponds to the labels of the labels of the labeled points. And we aims at learning the posterior PðtU j DÞ, where tU ¼ ðt l þ 1 ; t l þ 2 ; …; t n Þ. The posterior can be written as Z PðtU j DÞ ¼ PðtU j yÞPðyj DÞ ð29Þ y
The same as in [16], we can first approximate Pðyj DÞ and then use Eq. (29) to classify the unlabeled data. Using the Bayesian rule, we
Fig. 2. The graphical model of a discriminative classifier.
ð30Þ
where M is the so-called smoothness matrix, some typical choice of M are
Combinatorial graph Laplacian [5,38]: M ¼ R J~
ð31Þ
Normalized graph Laplacian [34,35]: ~ 1=2 M ¼ I R 1=2 JR
ð32Þ
where J~ is the similarity matrix and R is the diagonal matrix ~ with its (i,i)th element equal to the sum of the ith row of J. It is found that using the normalized graph Laplacian can usually produce better results [34]. Therefore, a natural choice for the prior of y could be 1 ~ 1=2 Þy : Pðyj X Þ ¼ exp yT ðI R 1=2 JR ð33Þ Zp The likelihood term PðtL j yÞ should be defined in the way that it should be higher when ti and yi have the same sign, and lower otherwise. Therefore a natural choice for the likelihood term could be ! l l X 1 l 1 PðtL j yÞ ¼ ∏ Pðt i j yi Þ ¼ ∏ expðt i yi Þ ¼ exp t i yi Zl i ¼ 1 Zl i¼1 i¼1 Note that for the labeled points, their hard labels are just the l influence of the external fields, i.e. fθ~ i gi ¼ 1 , as we defined in Eq. (28). As a result, the posterior probability ! l X 1 ~ 1=2 Þy þ Pðyj DÞ ¼ exp yT ðI R 1=2 JR t i yi Z i¼1 ! l X 1 1=2 ~ 1=2 T Þy þ θ~ i yi ¼ exp y ðI R JR Z i¼1 ! l X 1 T T 1=2 ~ 1=2 ¼ exp y y þ y R yþ θ~ i yi : JR Z i¼1
J. Li, F. Wang / Neurocomputing 177 (2016) 385–393
The BCI dataset was provided by [17]. The Text dataset was provided by [31].
Table 3 Basic information of the benchmark datasets. Dataset
Classes
Dimension
Size
Type
g241d Digit1 USPS COIL BCI Text
2 2 2 6 2 2
241 241 241 241 117 11,960
1500 1500 1500 1500 400 1500
arti. arti. imba.
spar.
Since we can always normalize the vector y to have a unit length, the Maximum A Posteri estimation of y is equivalent to maximize the following criterion ! l X 1 ~ 1=2 y þ θ~ i yi ; J ¼ exp yT R 1=2 JR Z i¼1 which has exactly the same form as Eq. (7) except without ρðSÞ. And as noted in [22], the existence of ρðSÞ in Eq. (7) is just for imposing some constraints on S. So we can see that our method has a very similar expression as Bayesian discriminative semi-supervised learning methods. Although our method is based on statistical physics, while the Bayesian methods are based on Bayesian theory, they are very closely in spirit. And our method makes an approximation that works which omits some of the complications of the Bayesian approach.
For each dataset, there are 100 labeled points, and 12 random splits are performed to partition the dataset into labeled points and remaining unlabeled points. It is ensured that each split contains at least one point of each class. For comparison, we also provide the classification results of 7 semi-supervised learning methods. The Nearest Neighbor (1-NN) and Linear SVM are used as the base line algorithms. The detailed configuration of other semisupervised learning algorithms are described below:
Locally Linear Embedding (LLE) þ1-NN: The dimensionality of the
6. Experiments We apply our algorithm to several standard semi-supervised learning datasets.3 Table 3 provides us some basic information about these datasets. In Table 3, the first column corresponds to the name of the datasets, the second column is the number of classes contained in the dataset, the third and fourth column represent the dimensionality and size of the datasets, respectively, and the last column shows some properties of the datasets. For the last column, “arti” means artifical dataset, “imba” means imbalanced dataset, “spar.” represents sparse dataset. Here are some brief description of these datasets:
The g241d dataset is constructed to have potentially misleading
structure, where the first 750 points (labeled as þ 1) are sampled from a Gaussian mixture model of two Gaussians with equal prior and covariance matrix, and their mean having a distance of 6 in a random direction. The rest points are sampled from the same distribution with their means moving a distance of 2.5 in a random direction. The digit1 dataset consists of points close to a low-dimensional manifold embedded into a high-dimensional space, but not to show a pronounced cluster structure. The USPS dataset is derived from the USPS handwritten digits dataset,4 where 150 images for each digits are randomly drawn. The digits “2” and “5” are assigned as þ1, and all the others formed class 1. The image structure is hidden using the Obscure method in Algorithm 21.1 in [7]. The COIL dataset is derived from the COIL-100 database,5 where the images for 24 objects are randomly selected. These images are partitioned into six classes, and they were downsampled and obscured first. 3 4 5
html
391
Available at http://www.kyb.tuebingen.mpg.de/ssl-book/benchmarks.html Available at http://www.kernel-machines.org/data.html Available at http://www1.cs.columnbia.edu/CAVE/research/softlib/coil-100.
datasets were first reduced by LLE [26] using a 12-nearest neighborhood, then the 1-NN classifier was performed for classification. Gaussian Random Fields (GRF)þ Class Mass Normalization (CMN) [37]: The hyperparameter σ of the Gaussian kernel was selected by cross-validation on the first split, and this σ will be used on other splits. Learning with Local and Global Consistency (LLGC) [34]: The regularization parameter μ is set to 0.05, the neighborhood size was set by a 10-fold cross validation. Transductive SVM (TSVM) [14]: The Gaussian kernel was used in the same way as in [9]. Cluster Kernels [8]: korig is a Gaussian kernel with the width σ and ridge C 1 optimized using the code at [1], and kbag is set by 10-fold cross validation. Low Density Separation (LDS) [9]: The code available at [2] is adopted.
For our mean fields method, we adopt the exponential similarity (Eq. (25)) in all the experiments, and for the Text dataset, we use the inner product distance (Eq. (27)), and for all other 5 datasets, we use the Euclidean distance (Eq. (26)). There are two hyperparameters in our method, namely the temperature T and scale σ. In our experiments, they are tuned by a exhaustive search over the grid 2½ 5:0:5:5 2½ 5:0:5:5 , where 0.5 is the granularity of the grid. Fig. 3 shows the plots of the average test error vs. T and σ, from which we can see that for some of the datasets (Digit1), the final classification results are not sensitive to the choice of T and σ, for some of the datasets (Text), the optimal ðT; σ Þ only lies in a small range, and for most of the datasets, the optimal ðT; σ Þ lies in a relatively large area, which makes the parameter tuning procedure easier. We think that the different properties of the ðT; σ Þ are because of the different structures of the datasets, and we are currently working on an automatic way to self-tune those parameters. Table 4 reports the average test errors for various methods. Table 5 provides the average ROC scores (i.e. the fraction of the area under the Receiver Operating Characteristic curve). Both tables show that our mean field approach performs the best on digit1, USPS and COIL datasets, and can produce comparable results on the BCI and Text dataset. However, as there is no free lunch, it performs fairly poor on the g241d dataset. This is probably because the complex structure of the dataset, in which the data from the two classes are heavily overlapped and confused, but our method tends to make the neighboring points have same labels, thus it may confused in this case.
7. Conclusions and future works In this paper, we propose a novel scheme for semi-supervised learning which is based on statistical physics. Unlike the traditional Bayesian methods which aim at doing a maximum a posteriori estimation for the labels, our method treat the data labels as a disordered system and the true labels can be regarded as the
392
J. Li, F. Wang / Neurocomputing 177 (2016) 385–393
Fig. 3. Average error rate vs. temperature T and scale σ. In all the figures, the z-axis represents the average test error, the x-axis is the log 2 value of the temperature T, and the y-axis is the log2 value of the scale σ. (a) g241d. (b) Digit1. (c) Usps. (d) BCI. (e) COIL. (f) Text.
states of such a system in equilibrium. Many experimental results are presented to show the effectiveness of our method. In the future, we will focus on two issues of our algorithm: (1) Induction, as induction can be viewed as a natural extension of the cavity method introduced in Section 3, thus deriving an induction approach seems to be a straight forward thing. However, the inclusion of a new spin will affect the state of the original spin system, thus how to tackle such effects is still a hard problem; (2) The automatic learning of the hyperparameters in our algorithm; (3) Acceleration, the ATAP approach introduced in Section 3
has the computational complexity of Oðn3 Þ, which prohibits the usage of our method to large scale datasets, thus deriving effective accelerating methods is also an important direction in our future works.
Acknowledgements This work is supported by Guangdong Key Laboratory of Popular High Performance Computers, Shenzhen Key Laboratory of
J. Li, F. Wang / Neurocomputing 177 (2016) 385–393
Table 4 Average test errors (%) with 100 labeled training points. Methods
g241d
digit1
USPS
COIL
BCI
Text
1-NN SVM [32] LLE þ 1-NN [26] GRF þCMN [37] LLGC [34] TSVM [14] Cluster Kernel [8] LDS [9] Our method
42.45 24.64 38.20 37.49 28.20 22.42 4.95 23.74 20.41
3.89 5.53 2.83 3.15 2.77 6.15 3.79 3.46 1.83
5.81 9.75 6.50 6.36 4.68 9.77 9.68 4.96 4.47
17.35 22.93 28.71 10.03 9.61 25.80 21.99 13.72 8.32
48.67 34.31 47.89 46.22 47.67 33.25 35.17 43.97 33.40
30.11 26.45 32.83 25.71 24.00 24.52 24.38 23.15 24.25
Table 5 Average ROC scores (area under curve; %) with 100 labeled training points. Methods
g241d
digit1
USPS
BCI
Text
1-NN SVM [32] LLE þ 1-NN [26] GRF þCMN [37] LLGC [34] TSVM [14] Cluster Kernel [8] LDS [9] Our method
– 83.54 – 82.23 52.97 84.18 98.95 83.13 88.57
– 99.09 – 99.59 98.84 98.02 99.36 99.23 99.93
– 95.76 – 91.11 92.24 92.74 94.50 95.62 97.56
– 71.17 – 56.48 52.36 73.09 70.50 57.22 72.35
– 84.26 – 84.62 71.53 80.96 85.90 84.77 84.89
Service Computing and Applications, and China National Key Technology Research and Development Program project with no. 2015BAH13F01.
References [1] 〈http://www.kyb.tuebingen.mpg.de/bs/people/chapelle/ams/〉. [2] 〈http://www.kyb.tuebingen.mpg.de/bs/people/chapelle/lds〉. [3] M. Blatt, S. Wiseman, E. Domany, Data clustering using a model granular magnet, Neural Comput. 9 (1997) 1805–1842. [4] M. Belkin, et al., Regularization and semi-supervised learning on large graphs, in: COLT, 2004. [5] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Comput. 15 (2003) 1373–1396. [6] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: Proceedings of the 11th Annual Conference on Computational Learning Theory, Madison, WI, 1998, pp. 92–100. [7] O. Chapelle, et al. (Eds.), Semi-Supervised Learning, MIT Press, Cambridge, MA, 2006. [8] O. Chapelle, J. Weston, B. Scholkopf, Cluster kernels for semi-supervised learning, in: Advances in Neural Information Processing Systems, vol. 15, 2003. [9] O. Chapelle, A. Zien, Semi-supervised classification by low density separation, in: Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, 2005. [10] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc. Ser. B 39 (1) (1977) 1–38. [11] B. Fischer, V. Roth, J.M. Buhmann, Clustering with the connectivity kernel, in: Advances in Neural Information Processing Systems, vol. 16, 2004. [12] E. Ising, Beitrag zur theorie der ferromagnetismus, Z. Phys. 31 (1925) 253–258. [13] T. Jaakkola, D. Haussler, Exploiting generative models in discriminative classifiers, in: Advances in Neural Information Processing Systems, vol. 11, 1998. [14] T. Joachims, Transductive inference for text classification using Support Vector Machines, in: Proceedings of the 16th International Conference on Machine Learning, 1999. [15] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs NJ, USA, 1988. [16] A. Kapoor, Y. Qi, H. Ahn, R.W. Picard, Hyperparameter and kernel learning for graph based semi-supervised classification, in: Advances in Neural Information Processing Systems, vol. 18, 2006. [17] T.N. Lal, M. Schröder, T. Hinterberger, J. Weston, M. Bogdan, N. Birbaumer, B. Schölkopf, Support Vector Channel selection in BCI, IEEE Trans. Biomed. Eng. 51 (6) (2004) 1003–1010. [18] N.D. Lawrence, M.I. Jordan, Semi-supervised learning via Gaussian processes, in: L.K. Saul, Y. Weiss, L. Bottou (Eds), Advances in Neural Information Processing Systems, vol. 17, MIT Press, Cambridge, MA, 2005. [19] M. Mézard, G. Parsi, M.A. Virasoro, Spin class theory and beyond, in: Lecture Notes in Physics, vol. 9, World Scientific, London, 1987.
393
[20] D.J. Miller, U.S. Uyar, A mixture of experts classifier with learning based on both labelled and unlabelled data, in: M. Mozer, M.I. Jordan, T. Petsche (Eds.), Advances in Neural Information Processing Systems, vol. 9, MIT Press, Cambridge, MA, 1997, pp. 571–577. [21] K. Nigam, A.K. McCallum, S. Thrun, T. Mitchell, Text classification from labeled and unlabeled documents using EM, Mach. Learn. 39 (2–3) (2000) 103–134. [22] M. Opper, O. Winther, Tractable approximations for probabilistic models: the adaptive tap mean field approach, Phys. Rev. Lett. 86 (2001) 3695–3699. [23] M. Opper, O. Winther, Adaptive TAP Equations, in: M. Opper, D. Saad (Eds.), In Advanced Mean Field Methods—Theory and Practice, MIT Press, Cambridge, Massachusetts, 2001, pp. 85–98. [24] G. Parisi, Statistical Field Theory, Addison-Wesley Pub. Co., Redwood City, CA, 1988. [25] J.R. Quinlan, Introduction to decision trees, Mach. Learn. 1 (1) (1986) 81–106. [26] L.K. Saul, K.Q. Weinberger, J.H. Ham, F. Sha, D.D. Lee, Spectral methods for dimensionality reduction, in O. Chapelle, B. Schölkopf, A. Zien (Eds.), Semisupervised Learning, MIT Press, Cambridge, MA, 2006. [27] B. Shahshahani, D. Landgrebe, The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon, IEEE Trans. Geosci. Remote Sens. 32 (5) (1994) 1087–1095. [28] Y. Song, C. Zhang, J. Lee, F. Wang, A discriminative method for semi-automated tumorous tissues segmentation of MR brain images, in: Proceedings of the 2006 IEEE Computer Society Workshop on Mathematical Methods in Biomedical Image Analysis (MMBIA) at the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, p. 79. [29] J.B. Tenenbaum, V. Silva, J.C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (2000) 2319–2323. [30] D.J. Thouless, P.W. Anderson, R.G. Palmer, Solution of a solvable model of a spin glass, Philos. Mag. 35 (1977) 593. [31] S. Tong, D. Koller, Restricted bayes optimal classifiers, in: Proceedings of the 17th National Conference on Artificial Intelligence, 2000, pp. 658–664. [32] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, Berlin, 1995. [33] F. Wang, C. Zhang, Label propagation through linear neighborhoods, in: Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 985–992. [34] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Schölkopf, Learning with local and global consistency, in: S. Thrun, L. Saul, B. Schölkopf (Eds.), Advances in Neural Information Processing Systems, vol. 16, 2004, pp. 321–328. [35] D. Zhou, B. Schölkopf, Learning from labeled and unlabeled data using random walks, in: Pattern Recognition, Proceedings of the 26th DAGM Symposium, 2004. [36] X. Zhu, Semi-supervised Learning Literature Survey, Computer Sciences, Technical Report 1530, University of Wisconsin-Madison, 2006. [37] X. Zhu, Z. Ghahramani, Z. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, in: Proceedings of the 20th International Conference on Machine Learning. [38] X. Zhu, J. Lafferty, Z. Ghahramani, Semi-supervised Learning: From Gaussian Fields to Gaussian Process, Computer Science Technical Report, Carnegie Mellon University, CMU-CS-03-175, 2003.
Jianqiang Li received his B.S. degree in Mechatronics from Beijing Institute of Technology, Beijing, China in 1996, M.S. and Ph.D degrees in Control Science and Engineering from Tsinghua University, Beijing, China in 2001 and 2004, respectively. He worked as a researcher in Digital Enterprise Research Institute, National University of Ireland, Galway in 2004–2005. From 2005 to 2013, he worked in NEC Labs China as a researcher, and Department of Computer Science, Stanford University, as a visiting scholar in 2009–2010. He joined Beijing University of Technology, Beijing, China, in 2013 as Beijing Distinguished Professor. His research interests are in Petri nets, enterprise information system, business process, data mining, information retrieval, semantic web, privacy protection, and big data. He has over 50 publications and 37 international patent applications (19 of them have been granted in China, US, or Japan). He served as PC members in multiple international conferences and organized the IEEE Workshop on Medical Computing 2014. He served as Guest Editors to organize special issues on Emerging Information Technology for Enhanced Healthcare in Computer in Industry and Telecommunication for Remote Medicine in China Communication.
Fei Wang is currently an Associate Professor in Department of Computer Science and Engineering, University of Connecticut. He also holds an affiliate position in University of Connecticut Health Center. Before his current position, he worked in IBM T.J. Watson Research Center for 4.5 years. His major research interest is data analytics and its applications in biomedical informatics. He regularly publishes papers on top data mining conferences like KDD, ICDM and SDM, as well as medical informatics conferences like AMIA. His papers have received nearly 2500 citations so far. He won best research paper nomination for ICDM 2010, and Marco Romani Best paper nomination in AMIA TBI 2014.