Discriminative deep belief networks for visual data classification

Discriminative deep belief networks for visual data classification

Pattern Recognition 44 (2011) 2287–2296 Contents lists available at ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/locate/pr ...

870KB Sizes 23 Downloads 138 Views

Pattern Recognition 44 (2011) 2287–2296

Contents lists available at ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

Discriminative deep belief networks for visual data classification Yan Liu a, Shusen Zhou b,a, Qingcai Chen b, a b

Department of Computing, The Hong Kong Polytechnic University, Hong Kong Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen, PR China

a r t i c l e in f o

abstract

Available online 25 December 2010

Visual data classification using insufficient labeled data is a well-known hard problem. Semi-supervise learning, which attempts to exploit the unlabeled data in additional to the labeled ones, has attracted much attention in recent years. This paper proposes a novel semi-supervised classifier called discriminative deep belief networks (DDBN). DDBN utilizes a new deep architecture to integrate the abstraction ability of deep belief nets (DBN) and discriminative ability of backpropagation strategy. For unsupervised learning, DDBN inherits the advantage of DBN, which preserves the information well from high-dimensional features space to low-dimensional embedding. For supervised learning, through a well designed objective function, the backpropagation strategy directly optimizes the classification results in training dataset by refining the parameter space. Moreover, we apply DDBN to visual data classification task and observe an important fact that the learning ability of deep architecture is seriously underrated in real-world applications, especially in visual data analysis. The comparative experiments on standard datasets of different types and different scales demonstrate that the proposed algorithm outperforms both representative semi-supervised classifiers and existing deep learning techniques. For visual dataset, we can further improve the DDBN performance with much larger and deeper architecture. & 2010 Elsevier Ltd. All rights reserved.

Keywords: Semi-supervised learning Discriminative deep belief networks Deep learning Visual data classification

1. Introduction The rapid growth of multimedia data and the noteworthy development of multimedia technologies have led to a significant need for visual content analysis and understanding. To reach the goals of automatic and semantic analysis, some pattern recognition techniques such as classification have been attempted. However, in real-world applications, it is often the case that the labeled data are difficult, expensive, or time consuming to obtain [1], while abundant of unlabeled data are available. For example, in content-based image retrieval, a user usually poses an example image as a query and asks the system to return similar images. So there are many unlabeled images existing in a database, but there is only one labeled example, i.e. the query image [2]. To address this problem, semi-supervised learning uses large amount of unlabeled data together with labeled data to build better classifiers. Since semisupervised learning requires less human effort and gives higher accuracy, it is of great interest both in theory and in practice [3]. Typical semi-supervised learning methods include: self-training [1,4], co-training [5], and tri-training [6], generative models [7,8], graph-based algorithms [9], transductive support vector machines (TSVMs) [10–12]. Probably the earliest idea about using  Corresponding author. Tel.: +86 755 26033475.

E-mail addresses: [email protected] (Y. Liu), [email protected] (S. Zhou), [email protected], [email protected] (Q. Chen). 0031-3203/$ - see front matter & 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2010.12.012

unlabeled data in classification is self-training [1], which propagates the label information from the labeled data to the unlabeled ones [4]. Co-training algorithm [5] trains two learners separately on two different views, i.e. two independent sets of attributes, and uses the prediction of each learner on unlabeled examples to augment the training set of the other. To relax the requirement of sufficient and redundant views in co-training, tri-training generates three classifiers from the original labeled dataset and refines the classifiers using unlabeled data via majority voting [6]. Both cotraining and tri-training mined the useful information from unlabeled data and improved the performance of the web classification and visual data analysis. Nonparametric Bayesian method [7] uses Gaussian processes for the generative model, and takes advantage of recent advances in Markov chain Monte Carlo algorithms. Graph Mincuts [9] uses a similarity measure between data to construct a graph, then partition the graph by minimizing the number of similar pairs of examples that are given different labels. Chapelle and Zien [10] derived graph-based distances that emphasize low density regions between clusters, followed by training a transductive SVMs (TSVM). Sindhwani et al. [11] showed how to turn transductive and standard supervised learning algorithms into semi-supervised learning. Collobert [12] optimized the transductive SVMs, which can run on large database. Moreover, there are many semi-supervised applications in evolutionary data, texts categorization, part-of-speech tagging, driving safety monitoring, space carving, image segmentation, annotation, classification, and retrieval.

2288

Y. Liu et al. / Pattern Recognition 44 (2011) 2287–2296

Currently, most semi-supervised methods use shallow architecture to model the problem [13], such as kernelized method [10]. However, as argued by several researchers [14], deep architecture, composed of multiple levels of non-linear operations [15], is expected to perform well in semi-supervised learning. Weston et al. simply leveraged shallow algorithms to deep architectures and already yielded competitive performance in semi-supervised learning task [13]. Many empirical validations can be found to support the argument that deep architectures show promise in solving hard learning problems [13,16]. Theoretical analysis also indicates that compared with shallow circuits, deep architectures are more efficient because they can represent most common functions, especially highly variable learning functions compactly and effectively [17]. Moreover, neuroscience provides the evidence that brain extracts multiple levels of representation from the featuredetecting neurons and current design of deep architectures is similar to human’s real visual and perceptual system [18]. According to above findings, this paper focuses on designing semisupervised classifier with deep architectures for visual data classification. In this paper, we propose a novel semi-supervised classifier called discriminative deep belief networks (DDBN) based on a representative deep architecture deep belief networks (DBN). Our paper makes two important contributions as shown below: 1. This paper proposes an effective semi-supervised classifier DDBN, which has several attractive characters. First, DDBN utilizes a new deep architecture to integrate the abstraction ability of deep belief nets and discriminative ability of backpropagation strategy. The deep architecture is constructed by greedy layer-wise unsupervised learning and the parameter space is trained using gradient descent supervised learning. Second, for unsupervised learning, DDBN inherits the advantage of DBN, which preserves the information well from highdimensional features space to low-dimensional embedding, especially for modeling complicated structure. Therefore, it improves the generalization capability by using abundant unlabeled data. Third, for supervised learning, through a new objective function, DDBN can directly optimize the classification results in training dataset using back propagation strategy. Therefore, it can achieve attractive classification performance with few labeled data. 2. This paper applies semi-supervised learning to visual data classification successfully and observes several important facts. First, DDBN shows impressive classification results on synthetic dataset as well as visual dataset with the aid of unlabeled data. Second, we provide the empirical validations to demonstrate that deep architecture is more efficient to represent most common function and at the same time more effective to solve hard learning problem. The last but not the least, the scales and depths of most existing deep architecture make the deep learning performance far away from its real learning ability. This finding, together with the evidences from neuroscience, indicates that the learning ability of deep architecture is seriously underrated, at least for visual data analysis. This paper intends to illustrate new thoughts from the perspective of real-world applications, which potentially lead to breakthroughs of deep techniques in both theoretical and practical aspects.

2. Related work on deep learning Recent advances in machine learning have caused some to consider neural networks obsolete, even dead. Latest researches on

deep architectures, which are perhaps best exemplified by neural networks with multiple hidden layers, suggest that such announcements are premature [19]. One convictive support to the rationality of deep learning comes from neuroscience that humans observe and perceive the world in the similar way. For example, the visual information from the retina is sent to the lateral geniculate nucleus of the thalamus, which in turn projects to the primary visual cortex in the occipital lobe. Primary visual cortex, also referred as V1 is divided into six functionally distinct layers, labeled 1 through 6. Layer 4, the ‘‘granular’’ layer, which receives most visual input from the lateral geniculate nucleus, is further divided into four layers, labeled 4A, 4B, 4C a, and 4C b. V1 carries out a set of spatiotemporal features such as orientation and sends to prestriate cortex V2. Then V2 sends the strong connections to V3, V4, V5, as well as feedback connection to V1 [20]. So even for the simplest vision generation, dozens of cortical layers and around 140,000,000 neurons involve in [21]. Here, attention and consciousness of the object, which is processed in thalamus, has not been counted. Theoretical analysis also indicates that compared with shallow circuits, deep architectures are more efficient because they can represent most common functions, especially highly variable learning functions compactly and effectively [17]. For example, to model the d-dimensional parity function, Gaussian SVM uses O(d2d) parameters while deep learning only needs O(d2) parad meters with Oðlog2 Þ hidden layers [16]. Bengio and Le Cun argue that deep architectures have the potential to generalize in nonlocal ways, i.e., beyond immediate neighbors, and that this is crucial in order to make progress on the kind of complex tasks required for artificial intelligence [22]. Moreover, deep architectures have shown good performance in many real-world applications. Larochelle et al. present a series of experiments which indicate that deep architectures show promise in solving harder learning problems that exhibit many factors of variation [17]. Salakhutdinov and Hinton show how to learn a deep graphical model of the word-count vectors obtained from a large set of documents [23]. Horster and Lienhart explore deep networks for deriving a low-dimensional image representation appropriate for image retrieval [24]. Yu et al. derive an efficient algorithm using stochastic gradient descent, and demonstrate encouraging results for visual recognition task, in terms of both accuracy and speed [25]. Mobahi et al. propose a learning method for deep architectures that takes advantage of sequential data, in particular from the temporal coherence that naturally exists in unlabeled video recordings [26]. In this paper, we focus on the study of a representative deep architecture called deep belief networks (DBN) because DBN is a suitable model for semi-supervised learning on visual data classification. DBN partitions the learning procedure to two stages: abstract input information layer by layer and fine-turn the whole deep network to the ultimate learning target [15]. This two-stage construction reduces the difficulty to learn the parameters of the deep architectures with multiple hidden layers, and more importantly, makes DBN natural to semi-supervised learning. Fig. 1 shows a DBN with three hidden layers h1, h2, and h3. x is the input datum and y is the learning target. In the first stage, DBN pairs each feed-forward layer with a feed-back layer that attempts to reconstruct the input of the layer from the output. This layer-wise generative models are implemented by a family of restricted Boltzmann machines (RBMs), a two-layer recurrent neural network in which stochastic binary inputs are connected to stochastic binary outputs using symmetrically weighted connections as shown in Fig. 2 [27]. After a greedy unsupervised learning to each pair of layers, the lower-level features are progressively combined into more compact high-level representations. In the second stage, the whole deep network is refined using a contrastive version of the

Y. Liu et al. / Pattern Recognition 44 (2011) 2287–2296

2289

Fig. 1. Deep belief network with three hidden layers h1, h2, h3, one input layer x, and one labeled layer y.

3.1. Problem formulation h

v

Hidden features

Visible data

Fig. 2. A restricted Boltzmann machine. The top layer represents a vector of stochastic binary hidden feature h and the bottom layer represents a vector of binary visible data v. w is the symmetric interaction term between data v and feature h.

‘‘wake-sleep’’ algorithm via global gradient-based optimization strategy. Currently, DBN has been applied to different real-world applications successfully, such as text representation [23], audio event classification, object recognition [18], and various visual data analysis tasks [24,28]. Moreover, owing to this two-stage fast greedy learning, DBN exhibits notable performance under insufficient training data [14]. DBN-rNCA is a semi-supervised learning algorithm that combines DBN architecture and neighborhood component analysis (NCA) techniques for dimensionality reduction [14]. The experimental validations have demonstrated that DBN-rNCA improves the performance of handwritten digit recognition obviously by using abundant unlabeled data. A set of efforts have been presented to improve the performance of DBN further. Lee et al. developed a sparse variant of the DBN that faithfully mimics certain properties of visual area V2 [29]. Salakhutdinov and Murray utilized annealed importance sampling (AIS) to estimate the partition function of RBM efficiently [30]. Vincent et al. motivated a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern [31]. Raina et al. speeded up DBN learning using graphics processors [32]. Convolutional deep belief network (CDBN) trains all the layers in deep architectures simultaneously to minimize the overall loss function and shows obvious advantages of representing large image using only a small number of feature detectors [33].

3. Discriminative deep belief networks In this part, we propose a novel semi-supervised learning algorithm, discriminative deep belief networks (DDBN) based on the representative deep architecture DBN. We formulate the problem in Section 3.1 and provide the solution via deep architecture in Section 3.2. Section 3.3 discusses greedy layer-wise unsupervised learning and Section 3.4 provides gradient descent supervised learning with a new loss function. Section 3.5 introduces training and test procedure using DDBN.

Let X be a set of samples, which can be written as 2 1 3 x1 , x21 , . . . , xL1 þ U 6 1 7 6 x2 , x22 , . . . , xL2 þ U 7 7 X ¼ ½x1 ,x2 , . . . ,xL þ U  ¼ 6 6 ^, ^, . . . , ^ 7 4 5 x1D , x2D , . . . , xLDþ U

ð1Þ

where L is the number of labeled samples, U is the number of unlabeled samples, D is the number of features. Every rank of X is a sample x. A sample that has all features is viewed as a vector in RD , where the ith coordinate corresponds to the ith feature. Let Y be a set of labels corresponds to L labeled samples, which can be seen as 2 1 3 y1 , y21 , . . . , yL1 6 1 2 L 7 6 y2 , y2 , . . . , y2 7 7 ð2Þ Y ¼ ½y1 ,y2 , . . . ,yL  ¼ 6 6 ^, ^, . . . , ^ 7 4 5 1 2 L yC , yC , . . . , yC where C is the number of classes. Every rank of Y is a vector in RC , where the ith coordinate corresponds to the ith class. ( 1 if xi A jth class i yj ¼ ð3Þ = jth class 1 if xi 2 We intend to seek the mapping function X-Y using the L labeled data and U unlabeled data. After training, we can determine y using the mapping function when a new sample x comes. 3.2. Architecture To address the problem formulated in Section 3.1, we propose a novel semi-supervised learning method DDBN. Fig. 3 shows the deep architecture of DDBN, a fully interconnected directed belief nets with one input layer h0, N hidden layers h1, h2,y, hN, and one label layer at the top. The input layer h0 has D units, equal to the number of features of sample data x. The label layer has C units, equal to the number of classes of label vector y. The numbers of units for hidden layers, currently, are pre-defined according to the experience or intuition. The seeking of the mapping function X-Y, here, is transformed to the problem of finding the parameter space W for the deep architecture. The training of the DDBN can be divided into two stages: 1. DDBN is constructed by greedy layer-wise unsupervised learning using RBMs as building blocks. U unlabeled data together with L labeled data are utilized to find the parameter space W with N layers.

2290

Y. Liu et al. / Pattern Recognition 44 (2011) 2287–2296

ZðyÞ ¼

XX k1 k expðEðh ,h ; yÞÞ k1

h

ð6Þ

k

h

where ZðyÞ denotes the normalizing constant. The conditional distributions over hk and hk  1 are given: Y Y k k1 k1 k1 k k pðhkt jh Þ, pðh jh Þ ¼ pðhk1 jh Þ ð7Þ pðh jh Þ ¼ s s

t

the probability of turning on unit t is a logistic function of the states k1 of h and wst: ! X k1 pðhkt ¼ 1jh Þ ¼ sigm ct þ wkst hk1 ð8Þ s s

the probability of turning on unit s is a logistic function of the states of hk and wst: ! X k k k pðhk1 ¼ 1jh Þ ¼ sigm b þ w h ð9Þ s s st t t

where the logistic function is sigmðZÞ ¼ 1=ð1 þ eZ Þ

ð10Þ

The derivative of the log-likelihood with respect to the model parameter wk can be obtained by the CD method [28]: k1

Fig. 3. Discriminative deep belief network with hidden layers from h1 to hN, one input layer x, and one labeled layer y. The hidden layers hi are constructed using both label and unlabeled data and the function f ðhN ðxÞ,yÞ is learned using labeled data.

2. DDBN is trained according to the exponential loss function using gradient descent method. The parameter space W is refined by a new loss function using L labeled data. 3.3. Unsupervised learning As shown in Fig. 3, we construct DDBN layer by layer using restricted Boltzmann machines (RBMs), a two layer recurrent neural network in which stochastic binary inputs are connected to stochastic binary outputs using symmetrically weighted connections [27]. We utilize RBMs as the building blocks because they are suitable to model the human visual system. As indicated by Hinton, there are several reasons for believing that our visual systems contain multilayer generative models in which top-down connections can be used to generate low-level features of images from high-level representations, and bottom-up connections can be used to infer the high-level representations that would have generated an observed set of low-level features [34]. Single cell recordings [35] and the reciprocal connectivity between cortical areas both suggest a hierarchy of progressively more complex features in which each layer can influence the layer below. The details of RBMs can be found in [15,30]. In discriminative deep belief networks, we define the energy of the state (hk  1, hk) as k1

Eðh

k

,h ; yÞ ¼ 

D k1 X

Dk X

s¼1t ¼1

wkst hk1 hkt  s

D k1 X s¼1

bs hk1  s

Dk X

k1

; yÞ ¼

1 X k1 k expðEðh ,h ; yÞÞ ZðyÞ k h

Þ

¼ /hk1 hkt SP0 /hk1 hkt SPM s s

ð11Þ

where /  SP0 denotes an expectation with respect to the data distribution and /  SPM denotes a distribution of samples from running the Gibbs sampler initialized at the data, for M full steps. Then the parameter wk can be adjusted through: k1

wkst ¼ Wwkst þ Z

@logpðh @wkst

Þ

ð12Þ

where W is the momentum and Z is the learning rate. The above discussion is based on one sample data x. In DDBN, we construct the deep architecture using all labeled data with unlabeled data by inputting them one by one from layer h0. The deep architecture is constructed layer by layer from bottom to top, and each time, the parameter space wk is trained by the calculated data in the k 1th layer. According to the wk calculated above, the layer hk can be got as following when a sample x inputs from layer h0: ! D k1 X wkst hk1 ðxÞ , t ¼ 1, . . . ,Dk , k ¼ 1, . . . ,N1 hkt ðxÞ ¼ sigm ctk þ s s¼1

ð13Þ N

The parameter space w is initialized randomly, just as backpropagation algorithm: N hN t ðxÞ ¼ ct þ

D N1 X

N1 wN ðxÞ, st hs

t ¼ 1, . . . ,DN

ð14Þ

s¼1

3.4. Supervised learning N

ct ht

ð4Þ

t¼1

where y ¼ ðw,b,cÞ are the model parameters: wkst is the symmetric interaction term between unit s in the layer hk  1 and unit t in the layer hk, k ¼1,y, N 1. bs is the sth bias of layer hk  1 and ct is the tth bias of layer hk. Dk is the number of unit in the kth layer. The probability that the model assigns to a hk  1 is Pðh

@logpðh @wkst

ð5Þ

After greedy layer-wise unsupervised learning, h ðxÞ is the representation of x. In this section, we will use L labeled data to refine the parameter space W for better discriminative ability. This task can be formulated to an optimization problem: arg minfðhN ðXÞ,YÞ

ð15Þ

W

where fðhN ðXÞ,YÞ ¼

L X C X i¼1j¼1

TðhN ðxij Þyij Þ

ð16Þ

Y. Liu et al. / Pattern Recognition 44 (2011) 2287–2296

2291

9 TExponent(r)=exp(−r)

8

THinge(r)=max(1−r,0) TSquare(r)=(r−1)2

7

3 2.5

6

Class 2

Class 1

f2

T (r)

2 5

1.5

4

1 3

0.5 2

0 2

1 0

2

1 −2

−1

0 r

1

h2N

2

1

0

0

−1

−1 −2

hN1

−2

Fig. 4. Comparison of different loss functions. Fig. 6. Illustration of hinge error loss function in deep architecture with two categories.

6

Class 2

Class 1

5

f3

4

6

3

5

2

Class 2

Class 1 4

1

f1

0 2 1 0

h2N

−1 −2

−2

−1

1

0

2

N

h1

1

Fig. 5. Illustration of squared error loss function in deep architecture with two categories.

T represents loss function. Now, the key is how to define a suitable loss function to improve the discriminative ability of the classifier. Squared error loss function as shown below is the natural choice because it has been widely used in backpropagation: TSquare ðrÞ ¼ ðr1Þ2

ð17Þ

where r ¼ hN ðxij Þyij . But as shown in Fig. 4, when r 4 1, Tsquare is increasing with r. This makes squared-error loss an especially poor approximation to misclassification error rate. Classifications that are ‘‘too correct’’ are penalized as much as misclassification errors [36]. Hinge loss, always used in SVMs [12], is another possible choice: THinge ðrÞ ¼ maxð1r,0Þ

3

2

ð18Þ

However, SVMs only concern about support vectors, data near the decision boundary. So in Fig. 4, THinge converges to zero when r 4 1, i.e., hinge loss stops the optimization when r ¼ hN ðxij Þyij is bigger than one. Actually, it is no problem if enough support vectors are available. Unfortunately, in this paper, the labeled data are insufficient. DDBN utilizes the exponential loss function, which has been used in boosting techniques and performed well on real-world

0 2 1

2 1

0

h2N

0

−1

−1 −2

N

h1

−2

Fig. 7. Illustration of exponential error loss function in deep architecture with two categories.

dataset [36]: TExponent ðrÞ ¼ expðrÞ

ð19Þ

We illustrate the performance of different loss function in deep architecture by Figs. 5–7. To simplify the discussion, we set the number of categories C ¼2. Fig. 5 shows that squared error has the property of placing increasing emphasis on data points that are correctly classified but that are a long way from the decision boundary on the correct side. Such points will be strongly weighted at the expense of misclassified points, and so if the objective is to minimize the misclassification rate, then a monotonically decreasing loss function would be a better choice [37]. Fig. 6 illustrates that hinge loss function stops to optimize the deep network when the N data are not near the decision boundary hN 1 ¼ h2 . As illustrated in Fig. 7, exponential loss function is smooth and monotonously.

2292

Y. Liu et al. / Pattern Recognition 44 (2011) 2287–2296

After determining the loss function, we use gradient-descent through the whole DDBN to refine the weight space. In the supervised learning stage, the stochastic activities are replaced by deterministic, real valued probabilities. 3.5. Classification using DDBN

Table 1 Datasets used in the experiments. Dataset

Categories

Dimensions

Data points

Labeled data

g50c Caltech 101 (subset) MNIST

2 5 10

50 400 84

550 2935 70,000

2–50 5–75 10–100

The training procedure of DDBN is given in Algorithm 1. After training, we can use the Eq. (20) to determine the label of the new data:

KNN SVM TSVM NN EmbedNN DBN−rNCA DDBN

40

argmaxhN ðxÞ

ð20Þ

j

35

Algorithm 1. Algorithm of discriminative deep belief networks

s

 X k wkst hkt,u pðhk1 s,u ¼ 1jh Þ ¼ sigm bs þ t

Update the weights and biases: k k1 k wkst ¼ Wwkst þ Zð/hk1 s,u ht,u SP0 /hs,u ht,u SP1 Þ end end end 2. Supervised learning based on gradient descent L X C X arg min expðhN ðxij Þyij Þ W

i¼1j¼1

Test error (%)

Input: data X, Y number of units in every hidden layer D1yDN number of layers N number of epochs Q number of labeled data L number of unlabeled data U hidden layer h parameter space W ¼ w1 , . . . ,wN biases b, c momentum W and learning rate Z Output: deep architecture with parameter space W 1. Greedy layer-wise unsupervised learning for k ¼ 1; k r N1 do for q ¼ 1; q r Q do for u ¼ 1; u r Lþ U do Calculate the non-linear positive and negative phase:   P k1 pðhkt,u ¼ 1jh Þ ¼ sigm ct þ wkst hk1 s,u

30 25 20 15 10 5 0 0

5

10

15

20 25 30 35 Number of labeled data

40

45

50

Fig. 8. Classification error rate on test data with different number of labeled data on small-scale artificial dataset.

epochs equal to 30 and the learning rate Z equal to 0.1. The initial momentum W is 0.5 and after 5 epochs, the momentum is set to 0.9. For supervised learning, we use the method of conjugate gradients. Three line searches is performed in each epoch, we do 20 epochs in supervised learning. We compare the classification performance of DDBN with the six representative classifiers, k-nearest neighbor (KNN), support vector machines (SVM) [12], transductive SVM (TSVM) [12], neural network (NN) [39], EmbedNN [13], and DBN-rNCA [14]. KNN, a typical nonlinear classifier, is always used as the baseline for performance comparison. In this paper, we set k equal to 3. SVM and NN are two powerful classification methods while TSVM is the semi-supervised version of SVM and EmbedNN is the semisupervised version of NN with deep architecture. As known, DBN-rNCA is the semi-supervised version of DBN.

4. Experiments 4.1. Experimental setting

4.2. Small-scale artificial dataset

We evaluate the performance of the proposed DDBN algorithm using three different kinds of datasets with different scales as shown in Table 1. The first dataset called g50c is a small artificial datasets often used to illustrate semi-supervised learning algorithms [10,12]. The second dataset Caltech 101, a standard dataset for image classification, including the images of 101 different objects, plus a background category [38]. In this paper, we use the images from the first five categories, which include more samples for semi-supervised learning. The third dataset is MNIST, a standard dataset for empirical validation of deep learning algorithms [13,14]. MNIST is a handwritten digit database containing 60,000 training images and 10,000 test images. According to different datasets, the DDBN has different number of hidden layers and different number of hidden units for every hidden layer. For greedy layer wise unsupervised learning, we train the weights of each layer separately with the fixed number of

The dataset g50c is a small-scale datasets with 550 data points of two categories. Each data is represented using 50 features. We perform experiments using the same setting of the previous work [10,12]. In 10-fold cross validation, we partition the 500 data points to 10 splits of 50 for training and the rest for test. The number of labeled data varies from 2 to 50. Each class at least should have one labeled data. The DDBN structure used in this experiment is 50–50– 50–200–2, which represents the number of units in input layer is 50, in output layer is 2, in three hidden layers are 50, 50, and 200 separately. Fig. 8 shows the average classification error rate on test data in cross validation. Under different number of labeled data, DDBN always outperforms other classification methods. The greedy layerwise training of deep models can abstract the feature space of g50c datasets very effectively, so DDBN and DBN-rNCA are powerful enough to partition the data into two categories. Only two labeled

Y. Liu et al. / Pattern Recognition 44 (2011) 2287–2296

2293

Table 3 Classification error rate on test data with different number of labeled data on largescale handwritten digit dataset MNIST. Labeled data

10

20

50

100

KNN SVM NN EmbedNN DBN-rNCA DDBN

48.39 47.45 48.73 46.60 39.55 35.79

45.90 44.35 45.99 44.60 37.28 34.25

32.35 29.93 31.32 36.00 25.84 17.82

22.68 23.00 25.89 16.86 18.71 12.23

Fig. 9. Selected images from Caltech101 of five categories: ‘‘Faces’’, ‘‘Faces_easy’’, ‘‘Motorbikes’’, ‘‘Back_googel’’, ‘‘Airplanes’’.

17 Test error

Table 2 Classification error rate on test data with different number of labeled data on middle-scale image dataset Caltech 101. 5

25

50

75

KNN SVM TSVM NN EmbedNN DBN-rNCA DDBN

55.56 50.24 50.13 46.83 48.90 56.89 41.70

41.92 33.86 29.83 34.03 44.50 45.91 28.59

36.83 32.76 29.50 33.28 41.30 44.50 27.97

35.38 31.19 27.14 29.38 36.10 34.23 25.97

data are needed to indicate which categories the data belongs to. DBN-rNCA just compresses the data with neighborhood component analysis objective function, and then classifies the data using KNN. DDBN optimize its parameter space for classification purpose directly, hence DDBN performs better than DBN-rNCA.

Error rate (%)

Labeled data

16

15

14

13

12

0

1

2 3 4 Number of unlabeled data

5

6 x 104

Fig. 10. Classification error rate on test data with different number of unlabeled data and 100 labeled data on MNIST.

4.3. Middle-scale image dataset 50

4.4. Large-scale handwritten dataset MNIST is a large handwritten digit database containing 70,000 images with 10 classes, corresponding to 0–9. Generally, 60,000

TExponent TSquare

40

Test Error (%)

The dataset Caltech 101 includes images from 101 categories. In this experiment, we work on the subset of Caltech 101, which includes 2935 images from the first five categories: 435 images of ‘‘Faces’’, 435 images of ‘‘Faces_easy’’, 798 images of ‘‘Motorbikes’’, 467 images of ‘‘Back_google’’, and 800 images of ‘‘Airplanes’’. Classification on Caltech 101 dataset is a well-known hard task. As shown in Fig. 9, the sizes of the images are different, and even in the same category, the images have large variance. We preprocess the images to the vectors of the same size as the input of DDBN. Only luminance information is used in this experiment. Assume the resolution of the image is m  n, we fetch the middle square of d  d pixels and resample it to 20  20. Here, d ¼ minðm,nÞ. So the DDBN in this experiment has the structure of 400–200–200–800–5, which represents the number of units in input layer is 400, in output layer is 5, in three hidden layers are 200, 200, and 800 separately. Similar with previous setting, we compare the classification error rate of different methods with various numbers of labeled data. We set the number of labeled data equal to 5, 25, 50, 75, respectively, and each class should have at least one labeled data. As shown in Table 2, DDBN illustrates stable and impressive performance on this dataset. This experiment not only demonstrates the effectiveness of deep learning on visual data analysis, but also provides the empirical validation to the argue that deep architectures show promise in solving hard learning problems [13,16].

THinge

30

20

10

0 10

20 50 Number of labeled data

100

Fig. 11. Classification error rate on test data with different number of labeled data using different loss function on MNIST.

images are used for training and 10,000 images are used for test. The images only include luminance information and the resolution is 28  28. The DDBN structure used in this experiment is 784–500– 500–2000–10, which is the classical experiment setting for deep learning [14,15]. In this part, we conduct three experiments. The first experiment is used to demonstrate the effectiveness of DDBN on large scale handwritten digit dataset. For performance comparison, we set the number of labeled data equal to 10, 20, 50, 100, respectively, and each class should include at least one labeled data. The rest images

2294

Y. Liu et al. / Pattern Recognition 44 (2011) 2287–2296

are used as unlabeled data. Table 3 shows the classification error on the test dataset from different classifiers. Obviously, DDBN always has much better performance than other classifiers. For TSVM algorithm, the error rate is 16.81% when 100 labeled data and 2000 unlabeled data are used [12]. However, due to the high computation cost, the experiment on TSVM has not finished for several weeks running when nearly 60,000 images are used as unlabeled data. The second experiment demonstrates the influence of unlabeled data in semi-supervised learning. We fix the number of labeled data equal to 100 and vary the number of unlabeled data. The classification error rate on 10,000 test images with different number of unlabeled data is shown in Fig. 10. Through the figure, we can see 14

4.5. Scales and depth of deep architecture

Test error

13

Error rate (%)

12 11 10 9 8 7 0

10

20

30 40 50 Number of hidden layers

60

clearly that when the number of unlabeled data is smaller than 20,000, the test error rate decreases dramatically with the increasing of the unlabeled data. Under current deep structure, when the number of unlabeled data is bigger than 20,000, the classification result is fluctuate, and the effect of added unlabeled samples is not obvious. The third experiment demonstrates the influence of loss functions in deep learning. Fig. 11 compares the classification error rate on test dataset when different loss functions are used in deep belief networks. The number of unlabeled data is fixed to 60,000. Under different number of labeled data, exponential loss function always outperforms hinge loss function and squared error loss function.

70

Fig. 12. Classification error rate on test data with different number of hidden layers on MNIST.

Table 4 Deep architecture with different number of hidden layers and fix number of hidden units. Input

Hidden

Output

784 784 784 784

8192 4096 4096 4096 2048 2048 4096 2048 1024 y

10 10 10 10

In this section, we conduct two experiments to study the relation between deep learning performance and deep architecture scale on MNIST dataset. In the first experiment, we keep deepening the architecture by increasing the number of hidden layers. As mentioned previously, the classical setting of deep architecture in the existing works is 784–500–500–2000–10, which represents the number of units in input layer is 784, in output layer is 10, in three hidden layers are 500, 500, and 2000 separately. So we begin with the similar structure with two hidden layers of 784–500–2000–10. Then we keep inserting hidden layers with 500 units to the deep networks until the number of hidden layers is equal to 70. As shown in Fig. 12, the depth varies from 2 to 70, and the maximum number of hidden units varies from 2500 to 36,500. We fix the number of labeled data equal to 100 and the number of unlabeled data equal to 60,000 for DDBN training. We provide the classification error rate on test dataset with various scales in Fig. 12. We find the fact that the classification error decreases obviously when the architecture is deeper than 10 hidden layers. The best performance appears when the depth is equal to 50 and the number of hidden units is equal to 26,500. It is a surprising observation because so far, most deep architectures only use 2–5 hidden layers, which makes the deep learning performance far away from its real learning ability. Moreover, this observation is strongly supported by the research from neuroscience. As mentioned in the related work of deep architecture, even for the simplest vision generation, dozens of cortical layers and around millions neurons involve in [21]. Actually, Roux and Bengio has proven that adding hidden units yields strictly improved modeling power. This theoretical analysis

1800

17

Runtime

16

1600

15

1400

14

1200 Runtime (m)

Error rate (%)

Test error

13 12

1000 800

11

600

10

400

9

200

8

0

10

20 30 40 50 Number of hidden layers

60

70

0

0

10

20 30 40 50 Number of hidden layers

60

70

Fig. 13. Classification error rate and real running time with different number of hidden layers and same number of hidden units on MNIST.

Y. Liu et al. / Pattern Recognition 44 (2011) 2287–2296

indicates that RBMs are very powerful and can approximate any distribution when the number of hidden units is allowed to be very large [40]. This experiment provides empirical evidence on real visual dataset and leads to new through from real-world applications. Weston et al. have done the similar work using EmbedNN and the conclusion is different with ours [13]. The EmbedNN overfits when the number of hidden layers is larger than 10. Three reasons may lead to different observations of deep architectures for semi-supervised learning. The first reason is that in DDBN, RBMs are used as building blocks. These generative models show good and stable performance. The second reason is that the usage of abundant of unlabeled data makes the learning model more robust, therefore, deep learning shows distinguished advantages in semi-supervised learning. The third reason we will discuss in the second experiment, which is about the relations between depth and width under the same scale of the deep architecture. In the second experiment, we fix the number of hidden units equal to 8192 and vary the depth of the deep network as shown in Table 4. Fig. 13 illustrates the classification error rate on test data using 100 labeled data and 60,000 unlabeled data. The performance keeps increasing until the number of hidden layers equal to 20. Compared with the first experiment, we find that when the number of hidden layers equal to 36,500, the learning performance peaks at depth 50. So the conclusion of the deeper, the better, exists when the scale of the deep architecture is large enough. Otherwise, it can cause serious overfitting. This is consistent with the observation of EmbedNN from Weston et al. in [13]. In Fig. 13, we also provides the relationship between the real running time and the depth of the deep architecture with same scale. The operate system we used in experiment is red hat enterprise Linux server release 5.1. The CPU speed is 2.83 GHz. Obviously, the deeper, the faster. An important observation is that deep learning is more effective and at the same time more efficient for classification task. For example, under the same scale and same labeled and unlabeled data, DDBN with 20 hidden layers can achieve better accuracy as well as lower running time than DDBN with 10 hidden layers. Another important observation is that for all experiments reported in this paper, DDBN converges fast. On three public datasets, DDBN achieves the best performance at around 20 iterations and more importantly, it is not sensitive to the size of the feature spaces and the types of the data.

5. Conclusion Inspired by the distinguished learning ability of deep networks, this paper proposes a novel semi-supervised learning algorithm DDBN and applies it successfully to visual data classification. By combining generalization capability of unsupervised learning and discriminative ability of supervised learning, DDBN demonstrates the impressive learning performance on synthetic dataset as well as real-world datasets. The further work will be explored from two aspects. First, we will study how to determine the scale of deep architecture for various applications, and how to speedup training and testing when large deep networks are used. Second, we will consider how to make use of limited labeled data better, for example, selecting representative training samples for more effective semi-supervised learning.

Acknowledgements The work is partly supported by Hong Kong RGC General Research Fund PolyU 5204/09E, National Natural Science Foundation of China (No. 60703015 and No. 60973076).

2295

References [1] O. Chapelle, B. Scholkopf, A. Zien, Semi-supervised Learning, MIT Press, USA, 2006. [2] Z.H. Zhou, D.C. Zhan, Q. Yang, Semi-supervised learning with very few labeled training examples, in: the 22nd AAAI Conference on Artificial Intelligence, 2007, pp. 675–680. [3] X. Zhu, Semi-supervised learning literature survey, in: Computer Sciences TR 1530, University of Wisconsin Madison, , 2007. [4] C. Rosenberg, M. Hebert, H. Schneiderman, Semi-supervised self-training of object detection models, in: Seventh IEEE Workshops on Application of Computer Vision, 2005, pp. 29–36. [5] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: The Eleventh Annual Conference on Computational Learning Theory, ACM, Madison, Wisconsin, United States, 1998, pp. 92–100. [6] Z.H. Zhou, M. Li, Tri-training: exploiting unlabeled data using three classifiers, IEEE Transactions on Knowledge and Data Engineering 17 (2005) 1529–1541. [7] R.P. Adams, Z. Ghahramani, Archipelago: nonparametric Bayesian semisupervised learning, in: International Conference on Machine Learning, Canada, 2009, pp. 1–8. [8] A. Patel, S. Sundararajan, S. Shevade, Semi-supervised classification using sparse Gaussian process regression, in: International Joint Conferences on Artificial Intelligence, 2009, pp. 1193–1198. [9] A. Blum, S. Chawla, Learning from labeled and unlabeled data using graph mincuts, in: Proceedings of the Eighteenth International Conference on Machine Learning, Morgan Kaufmann Publishers Inc., 2001, pp. 19–26. [10] O. Chapelle, A. Zien, Semi-supervised classification by low density separation, in: International Workshop on Artificial Intelligence and Statistics, 2005, pp. 57–64. [11] V. Sindhwani, P. Niyogi, M. Belkin, Beyond the point cloud: from transductive to semi-supervised learning, in: International Conference on Machine Learning, ACM, Bonn, Germany, 2005, pp. 824–831. [12] R. Collobert, F. Sinz, J. Weston, L. Bottou, Large scale transductive SVMs, Journal of Machine Learning Research 7 (2006) 1687–1712. [13] J. Weston, F. Ratle, R. Collobert, Deep learning via semi-supervised embedding, in: International Conference on Machine Learning, ACM, Helsinki, Finland, 2008, pp. 1168–1175. [14] R.R. Salakhutdinov, G.E. Hinton, Learning a nonlinear embedding by preserving class neighbourhood structure, in: Proceedings of Eleventh International Conference on Artificial Intelligence and Statistics, 2007. [15] G.E. Hinton, S. Osindero, Y.W. Teh, A fast learning algorithm for deep belief nets, Neural Computation 18 (2006) 1527–1554. [16] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep networks, in: Advances in Neural Information Processing Systems, 2006, pp. 153–160. [17] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, Y. Bengio, An empirical evaluation of deep architectures on problems with many factors of variation, in: Proceedings of the International Conference on Machine Learning, 2007, pp. 473–480. [18] G.E. Hinton, To recognize shapes, first learn to generate images, in: Computational Neuroscience: Theoretical Insights into Brain Function, 2007, pp. 535–547. [19] G.W. Cottrell, New life for neural networks, Science 313 (2006) 454–455. [20] P.H. Schiller, On the specificity of neurons and visual areas, Behavioural Brain Research 76 (1996) 21–35. [21] G. Leuba, R. Kraftsik, Changes in volume, surface estimate, 3-dimensional shape and total number of neurons of the human primary visual-cortex from midgestation until old-age, Anatomy and Embryology 190 (1994) 351–366. [22] Y. Bengio, Y. LeCun, Scaling learning algorithms towards AI, in: Large-Scale Kernel Machines, 2007. [23] R. Salakhutdinov, G. Hinton, Semantic hashing, International Journal of Approximate Reasoning 50 (2009) 969–978. [24] E. Horster, R. Lienhart, Deep networks for image retrieval on large-scale databases, in: Proceeding of the 16th ACM International Conference on Multimedia, ACM, Vancouver British Columbia, Canada, 2008, pp. 643–646. [25] K. Yu, W. Xu, Y. Gong, Deep learning with kernel regularization for visual recognition, in: Advances in Neural Information Processing Systems, 2008, pp. 1889–1896. [26] H. Mobahi, R. Collobert, J. Weston, Deep learning from temporal coherence in video, in: International Conference on Machine Learning, ACM, Canada, 2009, pp. 737–744. [27] P. Smolensky, Information processing in dynamical systems: foundations of harmony theory, in: Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1: Foundations, MIT Press, 1986, pp. 194–281. [28] E.K. Chen, X.K. Yang, H.Y. Zha, R. Zhang, W.J. Zhang, Ieee, Learning object classes from image thumbnails through deep neural networks, in: International Conference on Acoustics, Speech and Signal Processing, 2008, pp. 829–832. [29] H. Lee, C. Ekanadham, A.Y. Ng, Sparse deep belief net model for visual area V2, in: Advances in Neural Information Processing Systems, 2007, pp. 1416–1423. [30] R.R. Salakhutdinov, I. Murray, On the quantitative analysis of deep belief networks, in: Proceedings of the International Conference on Machine Learning, ACM, Helsinki, Finland, 2008, pp. 872–879. [31] P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol, Extracting and composing robust features with denoising autoencoders, in: International Conference on Machine Learning, ACM, Helsinki, Finland, 2008, pp. 1096–1103.

2296

Y. Liu et al. / Pattern Recognition 44 (2011) 2287–2296

[32] R. Raina, A. Madhavan, A.Y. Ng, Large-scale deep unsupervised learning using graphics processors, in: International Conference on Machine Learning, ACM, Canada, 2009, pp. 873–880. [33] H. Lee, R. Grosse, R. Ranganath, A.Y. Ng, Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations, in: International Conference on Machine Learning, ACM, Canada, 2009, pp. 609–616. [34] G.E. Hinton, Learning multiple a layers of representation, Trends in Cognitive Sciences 11 (2007) 428–434. [35] T.S. Lee, D. Mumford, R. Romero, V.A.F. Lamme, The role of the primary visual cortex in higher level vision, Vision Research 38 (1998) 2429–2454.

[36] J.H. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting, Annals of Statistics 28 (2000) 337–407. [37] C.M. Bishop, Pattern Recognition and Machine Learning, Springer Verlag, 2006. [38] F. Li, R. Fergus, P. Perona, Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories, in: CVPR Workshop, 2004. [39] T.M. Mitchell, Machine Learning, 1997. [40] N.L. Roux, Y. Bengio, Representational power of restricted Boltzmann machines and deep belief networks, Neural Computation 20 (2008) 1631–1649.

Yan Liu joined Department of Computing at The Hong Kong Polytechnic University as assistant professor in 2005. She got her Ph.D. from Department of Computer Science at Columbia University. She got her M.Sc. from School of Business at Nanjing University and her B.Eng. from Department of Electrical Engineering at Southeast University. Her research interests include multimedia content analysis and machine learning.

Shusen Zhou is a Ph.D. candidate in Harbin Institute of Technology Shen-zhen Graduate School. His main research interests include artificial intelligence, machine learning and multimedia content analysis.

Qingcai Chen received the Ph.D. degree in computer science from the Computer Science and Engineering Department, Harbin Institute of Technology. From September 2003 to August 2004, he worked for Intel (China) Ltd. as a senior software engineer. Since September 2004, he has been with the Computer Science and Technology Department of Harbin Institute of Technology Shen-zhen Graduate School as an associate professor. His research interests include machine learning, pattern recognition, speech signal processing, and natural language processing.