SRNN: Self-regularized neural network

SRNN: Self-regularized neural network

Communicated by Jun Yu Accepted Manuscript SRNN: Self-Regularized Neural Network Chunyan Xu, Jian Yang, Junbin Gao, Hanjiang Lai, Shuicheng Yan PII:...

2MB Sizes 11 Downloads 102 Views

Communicated by Jun Yu

Accepted Manuscript

SRNN: Self-Regularized Neural Network Chunyan Xu, Jian Yang, Junbin Gao, Hanjiang Lai, Shuicheng Yan PII: DOI: Reference:

S0925-2312(17)31357-7 10.1016/j.neucom.2017.07.051 NEUCOM 18747

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

10 April 2017 8 July 2017 17 July 2017

Please cite this article as: Chunyan Xu, Jian Yang, Junbin Gao, Hanjiang Lai, Shuicheng Yan, SRNN: Self-Regularized Neural Network, Neurocomputing (2017), doi: 10.1016/j.neucom.2017.07.051

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

SRNN: Self-Regularized Neural Network Chunyan Xua , Jian Yanga,∗, Junbin Gaob , Hanjiang Laic , Shuicheng Yand a School

CR IP T

of Computer Science and Engineering, Nanjing University of Science and Technology, 210094, China. b Discipline of Business Analytics, University of Sydney Business School, Universtiy of Sydney NSW 2006, Australia. c School of Data and Computer Science, Sun Yat-Sen University, 510275, China. d Department of Electrical and Computer Engineering, National University of Singapore, 117583, Singapore.

AN US

Abstract

In this work, we address to boost the discriminative capability of deep neural network by alleviating the over-fitting problem. Previous works often deal with the problem of learning a neural network by optimizing one or more objective functions with some existing regularization methods (such as dropout, weight

M

decay, stochastic pooling, data augmentation, etc.). We argue that these approaches may be difficult to further improve the classification performance of a neural network, due to not well employing its own learned knowledge. In this

ED

paper, we introduce a self-regularized strategy for learning a neural network, named as a Self-Regularized Neural Network (SRNN). The intuition behind the SRNN is that the sample-wise soft targets of a neural network may have poten-

PT

tials to drag its own neural network out of its local optimum. More specifically, an initial neural network is firstly pre-trained by optimizing one or more objec-

CE

tive functions with ground truth labels. We then gradually mine sample-wise soft targets, which enables to reveal the correlation/similarity among classes predicted from its own neural network. The parameters of neural network are

AC

further updated for fitting its sample-wise soft targets. This self-regularization learning procedure minimizes the objective function by integrating the samplewise soft targets of neural network and the ground truth label of training sam∗ Corresponding

author

Preprint submitted to Journal of Neurocomputing

August 18, 2017

ACCEPTED MANUSCRIPT

ples. Three characteristics in this SRNN are summarized as: (1) gradually mining the learned knowledge from a single neural network, and then correcting and enhancing this part of learned knowledge, resulting in the sample-wise soft

CR IP T

targets; (2) regularly optimizing the parameters of this neural network with their sample-wise soft targets; (3) boosting the discriminative capability of a neural network with the self-regularization strategy. Extensive experiments on four public datasets, i.e., CIFAR-10, CIFAR-100, Caltech101 and MIT, well demonstrate the effectiveness of the proposed SRNN for image classification.

Keywords: Self-regularized learning, sample-wise soft targets, neural network,

AN US

image classification.

1. Introduction

Deep neural networks have drawn a resurgence of attention recently, convolutional neural networks [1] in particular, due to their success in the large-scale

5

M

visual recognition [2, 3, 4], such as image classification [5], object detection [6, 7], scene understanding [8], event detection [9], etc. Relying on the massive computational resources and availability of large scale databases, deep neural networks

ED

are proven to be a powerful tool for computer vision tasks. The successes of deep neural networks have been largely attributed to the considerable number of

10

PT

model parameters with lots of network layers. For example, the network structure of Large Scale Visual Recognition Challenge 2014 classification challenge winner, i.e., GoogLeNet [10], employed a 22-layer neural network in classifying

CE

images into 1000 categories on the ImageNet dataset [11]. However, these deep neural networks with a large number of parameters are prone to over-fitting in

AC

the learning process.

15

While there has been previous work devoted to regularization strategies [12,

13, 14, 15, 16] for alleviating this over-fitting problem and boosting the discriminative capability of a neural network, none of previous work consider how to deal with the over-fitting problem with the self-learned knowledge of a neural network. The self-learned knowledge of a neural network, which is “mysteri-

2

ACCEPTED MANUSCRIPT

Type 1:

Class

ship

cat

truck

dog

Probability

0.006

0.6

0.004

0.39

CR IP T

Type 2:

Figure 1: Illustration of different types of knowledge. For the knowledge of type 1, the feature

maps from convolutional layers can describe the shape of the object in the original image (the first one) to some extent. The knowledge of type 2, which can be extracted from softmax layer, refers to the distribution of class probabilities and defines a similarity metric over the

classes that can be employed to learn a better model. For example, the CNN model may give

AN US

an image of a “cat” with a probability of being a “dog” (i.e., 0.39), but it is still far greater than its probability of being a “ship” (i.e., 0.006).

20

ously” encapsulated in its connection weights, refers to any knowledge acquired by its own neural network. Different types of knowledge can be illuatrated in Fig. 1.

The learned knowledge of a neural network (e.g., convolutional

M

maps and the parameters of fully-connected layers) facilitates great advances in computer vision tasks, such as image classification [17, 18, 19], image segmentation [20], etc. Most of these approaches analyze these learned knowledge of

ED

25

neural network with some exiting traditional approaches (e.g., support vector machine classifier, statical analyzing methods, etc.). Recently, Hinton et al. [19]

PT

proposed a novel knowledge distillation technology which refers to mining the class probabilities from an ensemble (e.g., a set of good neural network models) or a neural network model. When the learned knowledge of a neural network

CE

30

may be not a true reflection of training samples, due to some possible prediction errors of a neural network, we can correct the learned knowledge to be

AC

correspond with the ground-truth information of a training sample. Then this learned knowledge will be incline to have a high degree of confidence over all

35

training samples. Although this learned knowledge has been utilized for model compression and specialist networks training [19], incremental training [21], thin deep nets training [22] and recurrent neural network training [23], none of previous work explicitly considers the fact that a single network is synchronously 3

ACCEPTED MANUSCRIPT

used both as the teacher and student networks for alleviating its own over-fitting 40

problem. From the aspect of neuroscience, CNN shares many properties with the

CR IP T

visual system of the brain. The self-feedback process, which is abundant in the human brain system [24] plays an important role in the neocortex. The learning

process of a person is dynamic under the action of self-feedback synapses. For 45

example, after observing some objects or reading a book, a person inherently

possesses the capability of understanding and thinking about what he/see have observed. This kind of capability can be further improved by employing the

AN US

learned knowledge by themselves, which can be seen as a self-feedback learning

process. However, most of existing CNN models [2, 4, 25] are optimized by 50

minimizing one or more objective functions (e.g., softmax loss, hinge loss, L2 loss, etc.) under the guide of label information. Motivated by the fact that the self-feedback process is very important in the human visual system, we propose a novel Self-Regularized Neural Network (SRNN) framework for boosting the

55

M

discriminative capability of a neural network. Our SRNN can be illustrated in Fig. 2. Our proposed SRNN can combine the feed-forward/top-up and feed-

ED

back/top-down learning processes. This feed-back learning process of our SRNN can be seen as a self-regularization under the guide of its learned knowledge and the ground-truth information, where the learned knowledge from a pre-trained

60

PT

neural network can be with a higher degree of confidence. For better fitting both the learned knowledge (e.g., the type 2 of Fig. 1) and its ground-truth label

CE

information, we try to not only completely drop the ground truth information,

AC

but also effectively employ the learned knowledge in the SRNN learning process.

65

Based on a general neural network architecture (e.g., Network In Network [4],

Spatially-sparse convolutional neural networks [26], convolutional neural network [2], etc.), our SRNN could gradually obtain some valuable knowledge from a single neural network. It iteratively optimizes the objective function by updating the objective loss function with those learned knowledge and the ground truth labels of training samples. The discriminative capability of deep 4

ACCEPTED MANUSCRIPT

Deep Neural Network Self-Regularization

Learned Knowledge

AN US

Classification Results

● ● ●

CR IP T

Labeled Data

● ● ●

Automobile

Cat

Frog

Figure 2: Overview of the proposed Self-Regularized Neural Network (SRNN). Under a single neural network architecture (e.g., Convolutional Neural Network, CNN [2]), we gradually mine

M

the learned knowledge from its own neural network. Then we iteratively optimize the parameters of neural network with this learned knowledge, which can be named as a self-regularized learning process. With boosting the discriminative capability of neural network under this

70

ED

self-regularization strategy, the final classification results are obtained by our proposed SRNN.

neural network can be finally boosted under this self-regularization technique.

PT

Specifically, an initial neural networks is firstly pre-trained by optimizing one or more loss functions (e.g., cross-entropy loss, hinge loss, L2 loss, etc.) with

CE

ground-truth labels. We can mine the learned knowledge from the output layer (i.e., softmax layer) of this neural network, which refers to the class correla-

75

tion/similarity over the training data. In correspondence with the ground-truth

AC

label information, we can further correct a part of the learned knowledge from a neural network, resulting in the sample-wise corrected knowledge. The samplewise soft targets is finally obtained for iteratively learning the parameters of neural networks. This self-regularization can be seen as a learning procedure

80

that minimizes the objective function by integrating the learned knowledge of

5

ACCEPTED MANUSCRIPT

neural network and the ground truth label information in the training process. The intuition behind the proposed SRNN framework is that the sample-wise soft targets of a neural network may have potentials to drag its own neural

85

CR IP T

network out of its local optimum in the process of iteratively optimizing itself, and meanwhile achieve the self-regularization of a neural network.

The major contributions of our proposed SRNN can be summarized as follows: (1) Along with optimizing the parameters of neural networks, we can grad-

ually mine the learned knowledge only depending on a single neural network.

In our SRNN, the learned knowledge refers to the class correlation/similarity

over the training data predicted by a neural network. (2) We propose a novel

AN US

90

self-regularized strategy to optimize the parameters of a neural network with the sample-wise soft targets produced by itself. The sample-wise soft targets can be gradually updated in the self-regularized learning procedure and then will improve the discriminative capability of a neural network in turn. Bene95

fiting from this self-regularized strategy, we can optimize the parameters of a

M

single neural network by simultaneously employing the sample-wise soft targets and the ground-truth label information. (3) Under any one neural network ar-

ED

chitecture (such as convolutional neural network [2], network in network [4], deeply-supervised nets [25] and so on), our SRNN can not only alleviate the 100

over-fitting problem of neural network to some extent, but also boost the clas-

PT

sification performance of a neural network with the self-regularization strategy.

CE

2. Related work

Network architectures: Deep neural networks (DNN), especially CNN,

has been long studied and applied in the field of computer vision [2, 19, 27]. A classic CNN is comprised of one or more convolutional layers, pooling/sub-

AC 105

sampling layers, and then followed by one or more fully connected layers as in a standard multilayer neural network. One of the most popular CNN (named AlexNet), which was proposed by Krizhevsky et al. [2] for image classification,

has employed a effective regularization method called “dropout” for reducing

6

ACCEPTED MANUSCRIPT

110

the overfitting problem in the fully-connected layers. By building micro neural networks with more complex structures to abstract the data within the receptive field, Lin et al. [4] proposed a novel deep network structure called Network

CR IP T

In Network (NIN) to enhance the discriminative capability for local patches within the receptive field. To process spatially-sparse inputs, Graham et al. [26] 115

proposed a spatially-sparse convolutional neural network (SparseCNN) to train

a deeper network. When the input array is spatially sparse, it makes sense

to take advantage of the sparsity to speed up the computation. Some exiting

neural network frameworks (e.g. CNN [2], NIN [4], SparseCNN [26], etc.) can

120

AN US

learn the discriminative vision features, leading to great success in image clas-

sification [10] and other vision recognition tasks [28, 20]. Liu et al. [29] mainly survey several deep learning architectures (such as auto-encoder, convolutional neural network, deep belief network, and restricted Boltzmann machine) and their practical applications. In the work [30], a multi-modal deep auto-encoder, which employs non-linear mapping with multi-layered deep neural network, has been proposed to deal with the problem of human pose recovery. Ghazi et al. [31]

M

125

use deep convolutional neural networks to deal with the plant identification task

ED

and optimize the parameters of these networks (i.e., GoogLeNet, AlexNet, and VGGNet) by employing the transfer learning method. A novel neural network model, which is referred to as a symmetric quaternionic hopfield neural networks, can improve the noise tolerance by computer simulations [32].

PT

130

Regularization strategies: A successful neural network usually involves

CE

very deep and wide architectures and is consist of a very large number of parameter. It is necessary for training a deep neural network with some regularization strategies to alleviate the over-fitting problem. Specifically, the most simple method, e.g. data augmentation [33, 12, 26], is to manually enlarge the num-

AC

135

ber of training images with label-preserving transformations, such as horizontal/vertical reflections, image translation, image cropping, image padding and so on. The weight decay technique [14] targets on preventing the over-fitting problem by adding a penalty to the maximum likelihood. Dropout, recently

140

proposed by Hinton et al. [13, 34], stochastically sets a part of the activations 7

ACCEPTED MANUSCRIPT

(e.g, 30%, 50%, 70%, etc.) within a layer to zero for each training sample in the training stage, while the drop-connect method [15] is proposed to regularize large fully-connected layers in neural networks. The stochastic pooling

145

CR IP T

method [16] replaces the conventional deterministic pooling operation with a stochastic procedure, and randomly picks up the activations within each pool-

ing region according to its multinomial distribution of the activations. The

deeply-supervised nets (DSN) framework, which is proposed by Lee et al. [25], can optimize the parameters of neural networks by adding supervision at in-

termediate layers of the deep network. In [33], a kind of regularization (named ensembles) is proposed to improve the discriminative capability of DNN by com-

AN US

150

bining several DNN columns into a Multi-column DNN. The greedy layer-wise pre-training method [35] can help the optimization of deep belief networks by initializing weights in a region near a good local minimum. However, these regularization techniques do not consider how to do a self-regularization for learning 155

a neural network by utilizing its own generated knowledge.

M

For obtaining fast and accurate detection of large numbers of privacy-sensitive object classes, a deep multi-task learning algorithm [36] has been developed to

ED

jointly learn more representative deep convolutional neural networks and more discriminative tree classifier. In order to effectively retrieve images, Yu et al. [37] 160

propose a novel deep multi-modal distance metric learning method to combine

PT

these multi-modal visual features. Moreover, a recurrent self-organizing neural networks [38] with an adaptive growing and pruning algorithm, is proposed for

CE

efficiently enhancing generalization capabilities of RNNs. The graph-based regularized method bridges topic modeling and social network analysis, and can

165

be implemented by designing a matrix factorization objective function with so-

AC

cial regularization [39, 40]. A latent feature learning algorithm based on the popular deep architecture is then proposed and applied in social media tasks of user recommendation and social image annotation [41]. To effectively exploit high order information among the data samples, several factorization-based reg-

170

ularization [42, 43, 44] (including matrix factorization, Low-rank matrix factorization, tensor factorization)are employed by constructing the unified objective 8

ACCEPTED MANUSCRIPT

function. The manifold learning based regularization framework [45, 46] is to preserve the underlying low dimensional manifold of input samples as part of the optimization procedure for estimating network/model parameters.

recently, several regularization methods with knowledge distillation have been

CR IP T

175

Most

proposed in various network training [19, 21, 22, 23]. By exploiting important knowledge, Sang et al. [47] have proposed to organize photos both geograph-

ically and semantically, and investigate the problem of location visualization from multiple semantic themes. For solving the problem of training a neural 180

network on a huge dataset, Heryanto et al. [21] transfer some knowledge from

AN US

the previous networks to a new network using different partition of the dataset.

In [22], a student-teacher framework was proposed to train thin deep nets by employing the knowledge distillation. Similarly, a knowledge transfer learning approach was introduced to train recurrent neural networks using a DNN 185

model as the teacher [23]. The main difference between our proposed SRNN and the above regularization strategies is that our SRNN aims to alleviate the

M

over-fitting problem from the learned knowledge of a single neural network, and meanwhile boost its own discriminative capability under this self-regularization

190

ED

technique.

3. Self-Regularized Neural Network (SRNN)

PT

In this section, we give the formulation of our proposed self-regularized neural network (SRNN). We focus on pre-training a neural network, and mining its

CE

own learned knowledge, with which a self-regularized neural network is gradually optimized for boosting its classification performance for vision tasks. The

195

SRNN learning process is consist of three main blocks (i.e., pre-training net-

AC

work, mining knowledge, and learning SRNN) and finally outputs an optimized neural network. Its pipeline can be illustrated in Fig. 3. 3.1. Network pre-training In order to mine some valuable knowledge from a neural network for the self-

200

regularized learning, we firstly pre-train/fine-tune an initial neural network by 9

ACCEPTED MANUSCRIPT

...

Pre-training network

Learned knowledge from NN

Sample-wise corrected knowledge

...

CR IP T

...

Sample-wise soft targets

Learning SRNN

Mining Knowledge

A final neural network model

Figure 3: The pipeline of learning a SRNN process. It is consist of three main blocks (i.e., pre-

AN US

training network, mining knowledge, and learning SRNN), and finally outputs an optimized neural network. And a single network with different status is shown in three different colors. For better viewing of all figures in this paper, please see original zoomed-in color pdf file.

optimizing one or more objective functions (e.g., cross-entropy loss, hinge loss, L2 loss, ranking loss, etc.) under the guide of the ground-truth information.

M

In our implementation, our basic SRNN can be based on some general neural network architectures, such as AlexNet [2], NIN [4], DSN [25], SparseCNN [26] 205

and so on. For different network structures, we employ different pre-training

ED

strategies. For example, for the NIN structure, we learn a neural network by N P minimizing a cross-entropy loss function LG = L(xi , li ; Θ), with the groundi=1

PT

truth label information L of the whole training dataset X. And Θ denotes the

parameters of the neural network, X = {x1 , x2 , . . . , xN } and L = {l1 , l2 , ..., lN } 210

are the sets of training samples and its corresponding ground-truth label in-

CE

formation, N and C is the total number of the training samples and classes. Similarly, an initial DSN [25] is optimized under the guide of multiple hinge loss

AC

functions. We fine-tune the AlexNet [2] and VGG [48] structures on our dataset as the pre-training process.

215

3.2. Knowledge mining To train a self-regularized neural network, we gradually generate the sample-

wise soft targets from a neural network, as illustrated in Fig. 4. Each row of

10

ACCEPTED MANUSCRIPT

Fig. 4 denotes different kinds of learned knowledges for each sample. Each column of Fig. 4 refers to different kinds of information over training samples, e.g., 220

labeled training samples, learned knowledge from NN, sample-wise corrected

CR IP T

knowledge and sample-wise soft targets. We can firstly mine the learned knowledge from the softmax layer (i.e., the last fully-connected layer) of this neural network which refers to the class correlation over the training data. In order to

be consistent with the ground-truth label information, we can then correct the 225

learned knowledge of a neural network, resulting in the sample-wise corrected knowledge. The sample-wise soft targets is finally obtained for iteratively learn-

AN US

ing the parameters of neural network. The sample-wise soft targets can be

gradually updated by the self-regularized learning procedure and then improve the discriminative capability of a neural network in turn.

Learned knowledge from NN: For all labeled training samples (the first

230

column of Fig. 4), we can mine the learned knowledge from a neural network (the second column of Fig. 4), which is pre-trained and has a certain discrimina-

M

tive capability. The learned knowledge can be the feature maps of convolutional layers, feature vectors of fully-connected layer, predicted results and so on. And the computing time of learning knowledge from NN is related to pre-train a

ED

235

neural network. In this work, this part of learned knowledge refers to a similarity metric over the classes and can be extracted from the softmax layer of

PT

a neural network. The learned knowledge predicted by a neural network can be represented as: {P1 , P2 , ..., PN }, where N is the number of training samples. Pi = {pi,1 , pi,2 , ..., pi,C } denotes the probabilities of the training sample i over

CE

240

all classes, i ∈ 1, · · · , N , where C is the number of classes. Each class probaPC bility of an sample i is pi,c = exp(yi,c )/ j=1 exp(yi,j ), where yi,c and yi,j are

AC

the softmax pre-activations of a neural network, and pi,c denotes the softmax activation (i.e. the output of softmax layer).

245

Sample-wise corrected knowledge: The learned knowledge of a neural

network may be not a true reflection of training samples’ label information, due to some possible prediction errors of a neural network. For example, a cat

11

ACCEPTED MANUSCRIPT

sample should have the highest prediction probability over “cat” class, but a neural network may be predict the highest probability over “dog” class, as can 250

be seen in the second column of Fig. 4. In this situation, we should correct

CR IP T

the learned knowledge of a neural network to be correspond with the groundtruth label information of a training sample. We correct the class probability

from the learned knowledge equal to the ground-truth. More specifically, if the maximum of the learned knowledge accords with its ground-truth label 255

information, its sample-wise knowledge is not be corrected. That is to say, its sample-wise corrected knowledge is the same with its learned knowledge of NN.

AN US

For each training sample, we then obtain the sample-wise corrected knowledge (the third column of Fig. 4). For a batch size in the process of learning SRNN, the computational complexity is O(N) for correcting sample-wise knowledge, 260

where N is the number of training samples.

Sample-wise soft targets: The above sample-wise corrected knowledge has been rectified for bringing the learned knowledge into correspondence with

M

the ground-truth. However, the highest probability of a training sample is very near to 1.0, while most of the other probabilities are so close to zero, which is thus difficult to present the information of a correlations/similarities among

ED

265

classes. For example, an image of a cat may have a big probability of being incorrectly predicted as a dog (p6 = 10−4 ), and this mistake (i.e., predict a cat as

PT

a dog) is many times more probable than discriminating a cat for a truck (p10 = 10−9 ). This knowledge implies a kind of valuable information which contains an abundant similarity structure over the training data, but it have a very little

CE

270

constraint in the process of learning a neural network because most of class probabilities are so near to zero. In order to better represent this similarities

AC

among classes, we further enhanced the sample-wise corrected knowledge with

275

the distillation technique proposed by Hinton et al. [19]. Specifically, each class PC probability of an sample i is enhanced with qi,c = exp(yi,c /T )/ j=1 exp(yi,j /T ), where yi,c and yi,j are also the logits with the outputs of the softmax layer in a

neural network, and T is a constant parameter. If T equals to 1, the sample-wise

12

ACCEPTED MANUSCRIPT

corrected knowledge is not enhanced/transformed, while T is set to a higher value, the sample-wise corrected knowledge is enhanced for better indicating 280

the probability distribution over classes. As described in [19], the temperature

CR IP T

parameter T in the range of 2.5-4.0 worked significantly better than higher or lower temperatures. For simplicity, the parameter T for acquiring the samplewise soft targets is set to 3.0 in all our experiments. The sample-wise soft targets K can be denoted as: K = {K1 , K2 , ..., KN } 285

(1)

where Ki = {qi,1 , qi,2 , ..., qi,C } denotes the sample-wise soft targets of the train-

AN US

ing sample i, and N is the number of training samples. From the sample-wise corrected knowledge to the corresponding soft targets with a batch size, the

learning process demands for O(N*C), where N is the number of training samples and classes, respectively. 290

3.3. SRNN learning

M

In this subsection, based on the sample-wise soft targets, we introduce how to learn a self-regularized neural network (SRNN) for boosting its classifica-

ED

tion performance. Along with iteratively optimizing the parameters of a neural network in the network pre-training process, we can constantly mine the sample295

wise soft targets only depending on a pre-trained neural network.

PT

Given the sample-wise soft targets of training samples produced by itself, we learn the parameters of neural network with the objective function by incorporating the sample-wise soft targets and ground-truth label of training data.

CE

Based on the sample-wise learned knowledge Kt and the parameters of the neu-

300

ral network in the previous iteration, the loss function LK,t+1 with the iteration

AC

t + 1, can be formulated as follows, LK,t+1 = L(X, L; Θ, Kt ) =

N X i=1

=−

L(xi , li ; Θ, Kt,i )

N X C X i=1 c=1

13

qi,c log p(lc |xi ; Θ)

(2)

ACCEPTED MANUSCRIPT

CR IP T

Cat

Cat

Truck Labeled Training samples

Learned knowledge from NN

AN US

Dog

Sample-wise corrected knowledge

Sample-wise enhanced knowledge

M

Figure 4: The knowledge mining process of a neural network (NN). Each column refers to different kinds of information for training samples, e.g., labeled training samples, learned knowledge from NN, sample-wise corrected knowledge and sample-wise soft targets. For each

ED

histogram, the horizontal axis indicates the possible labels, and the red dot denotes the ground truth label information. And the vertical axis represents the probabilities distribution of ten possible classes.

PT

where X = {x1 , x2 , · · · , xN } and L = {l1 , l2 , · · · , lN } refer to the sets of training samples and their ground-truth labels, Kt = {Kt,1 , Kt,2 , ..., Kt,N } and

CE

Kt,i = {qi,1 , qi,2 , ..., qi,C } denote the sample-wise soft targets of the sample i

305

from the neural network Θ, N and C are the number of training samples and classes, respectively. Besides the loss function with the sample-wise soft tar-

AC

gets, we also employ a cross-entropy loss function LG with the ground-truth label information.

310

The sample-wise soft targets, which have been optimized in the previous

iteration of a neural network, are with a higher degree of confidence, while the ground truth label information has less capability for further boosting the

14

ACCEPTED MANUSCRIPT

discriminative capability of a neural network. For better fitting both the samplewise soft targets and its ground-truth label information, we try to not only completely drop the ground truth information, but also effectively employ the learned sample-wise soft targets in the SRNN learning process. Specially, we can

CR IP T

315

combine the above two functions LK,t+1 and LG,t+1 and minimize the weighted loss function for optimizing the parameters of a neural network, Θ∗ = arg min(αLK,t+1 + (1 − α)LG,t+1 ). Θ

(3)

where the parameter α denotes the weight of two objective functions.

320

AN US

Let X = {X1 , X2 , ..., XN } be a set of training samples, where N is the num-

ber of training sample. An initial neural network is firstly pre-trained by optimizing a cross-entropy loss function LG . We gradually mine the learned knowledge {P1 , P2 , ..., PN } from a neural network, correct it with the ground-truth label information and then get the sample-wise soft targets K = {K1 , K2 , ..., KN }. Finally, a neural network model is learned by minimizing the objective function (Eq. 3) by integrating the sample-wise soft targets of neural network and the

M

325

ground truth label of training samples. In the framework of learning a SRNN,

ED

the sample-wise soft targets can be seen as a good regularizer for boosting the classification performance of a neural network. In some existing neural network architectures, its final objective is the ground-truth label for a training sample, which may be too sure for learning a neural network. The sample-wise soft tar-

PT

330

gets would to some extent prevent the objective of a neural network from being

CE

too sure, and impose more constraints on the parameters of a neural network. When the discriminative capability of a neural network is gradually improved, we can mine the learned knowledge from NN, leading to better sample-wise soft targets and then update the parameters of a neural network ((i.e., Eq. 3)).

AC

335

4. Experiments In this section, we evaluate the effectiveness of our proposed SRNN by comparing it with several exiting state-of-the-art algorithms. We first introduce the

15

ACCEPTED MANUSCRIPT

experimental settings, and then report and analyze the experimental results, 340

after that a further discussion about the proposed SRNN is given.

CR IP T

4.1. Experimental Settings To compare our results with some previous methods, all experiments are conducted based on the following experimental setups. For the CIFAR-10 and

CIFAR-100 datasets, we process the data with the same global contrast normal345

ization and Zero Components Analysis whitening, as done in [5]. To compare

our results with the previous state-of-the-arts [25, 4, 5], we have done two differ-

AN US

ent experiments under two situations (e.g., with data augmentation and without data augmentation) and three different architectures (e.g., NIN [4], DSN [25] and SparseCNN [26]). Under the NIN and DSN architectures, we have done experi350

ments without data augmentation, and also augmented the dataset by padding zero 4 pixels on each side, then done corner cropping and random flipping on the fly during training. To fairly compare with the SparseCNN architecture [26], we

M

also adapt the big data augmentation strategy as in [26], which employs affine spatial and color-space training images and then pads the image with zero from 32 × 32 to 96 × 96 pixels (named SparseCNN96 ) and from 32 × 32 to 126 × 126

ED

355

pixels (named SparseCNN126 ) 1 . For the Caltech101 and MIT-Indoor datasets, we resize all images into 256×256 pixels, randomly crop image from 256×256

PT

to 227×227 and 224×224 for fine-tuning the AlexNet [2] and Visual Geometry Group (VGG) model [48] to initialize the neural network, respectively. For all 360

our experiments, we use the stochastic gradient descent solver for learning the

CE

parameters of a neural network, and conduct experiments on a single NVIDIA Tesla K40c. The neural network is trained and tested on the Caffe [3], except

AC

the method with the SparseCNN architecture [26]. We train the neural network in the SRNN with the weight parameter α of 0.9 in the Eq. 3, the momentum

365

of 0.9, the weight decay of 0.001. In the SRNN learning process, the learning rate is initialized at 0.001 and divided by 10 after 30 epochs and we train the 1 https://github.com/btgraham/SparseConvNet

16

ACCEPTED MANUSCRIPT

networks for roughly 120 epochs. After a neural network has been pre-trained, we can update the sample-wise soft targets with optimizing the objective of the

370

CR IP T

SRNN. 4.2. Performance Comparison

Table 1 shows the classification errors of our SRNN on CIFAR-10 dataset and comparison with several exiting deep neural network methods, including stochastic pooling [16], Maxout Networks [5], Network In Network (NIN) [4],

probabilistic maxout units [49], DSN [25], and Spatially-sparse convolutional neural networks (SparseCNN) [26].

To allow direct comparison with other

AN US

375

popular works, we implement our proposed SRNN under three DNN architectures (i.e., NIN, SparseCNN96 and SparseCNN126 ), namely “NIN+SRNN”, “SparseCNN96 +SRNN” and “SparseCNN126 +SRNN”. Without data augmentation, the “NIN+SRNN” achieves a better classification error rate of 9.35%, 380

which is 0.43% and 1.06% less than DSN [25] and NIN [4] on CIFAR-10 dataset,

M

while our “DSN+SRNN” can substantially outperform the baselines by 1.06%, over “DSN” and 1.69% over “NIN respectively. With data augmentation, the

ED

“NIN+SRNN’ can be increased by 0.73%, compared to the baseline NIN [4], and our “DSN+SRNN” significantly exceed four baselines: 1.47% over Maxout 385

Networks [5], 1.41% over Drop-Connect [15], 0.90% over NIN [4] and 0.31%

PT

over DSN [25] in the classification performance. To compare with the SparseCNN [26], we achieve a classification error rate of 4.93% with padding the image from 32×32 to 96×96, 0.79% and 1.35% (reported in [26]) lower than the

CE

baseline “SparseCNN96 ”. Moreover, we pad the image 32×32 to 126×126 and

390

get a classification error rate of 4.12%, which is 0.55% lower than the baseline

AC

“SparseCNN126 ”. To further evaluate the effectiveness of our SRNN on CIFAR-100 dateset, we

compare it with some exiting deep neural network methods, such as stochastic pooling [16], maxout networks [5], maxout units [49], NIN [4] and DSN [25]. Ta-

395

ble 2 shows the performances of our proposed SRNN and other state-of-the-art methods. Without data augmentation, the baseline methods achieve the classifi17

ACCEPTED MANUSCRIPT

cation error of 42.51% for stochastic pooling [16], 38.57% for maxout networks [5] and 38.14% for maxout units [49], all of which are much higher than the classification error of our “NIN+SRNN” and “DSN+SRNN”. Our “DSN+SRNN” achieves the best classification test error 32.83%, which significantly improve the

CR IP T

400

classification performance over the NIN [4] by 2.85% errors, and over DSN [25]

(the best results reported) by 1.74% errors. With data augmentation, our “NIN+SRNN” can outperform two baselines: 2.43% over “NIN” and 1.26% over

“DSN”, while our “DSN+SRNN” with 30.43% test error can further improve 405

the classification performance. Our SparseCNN96 +SRNN method gets the clas-

AN US

sification error of 19.55% with padding images from 32 × 32 to 96 × 96, which

is 0.75% and 4.75% (reported in [26]) lower than the SparseCNN96 +SRNN. With the SparseCNN126 architecture, our SRNN approach proves effective and outperforms [26] by 0.65% in classification error on CIFAR-100 dataset. This 410

well verifies the superiority of our proposed SRNN, which is effective for achieving better discriminative capability of the DNN by alleviating the over-fitting

M

problem.

Table 3 and Table 4 show the classification accuracies on the Caltech101

415

ED

and MIT-Indoor datasets, respectively. It can be seen that the DNN methods significantly outperform the other traditional approaches, such as Low-rank sparse coding [50], Sparse embedding [51], Kernel sparse representation [52],

PT

Lie group manifold analysis methods [53], Learning Discriminative Part Detectors [54], etc. Based two popular DNN architecture, i.e., AlexNet [2] and

CE

VGGdeep16 [48], we test whether our proposed SRNN can boost the classifica420

tion performance of a neural network by alleviating the over-fitting problem to some extent. The classification accuracies of our “AlexNet+SRNN” and

AC

“VGGdeep16 +SRNN” are both better than some existing DNN methods [55, 56,

57, 58, 59] and two CNN baselines, i.e., “AlexNet” and “VGGdeep16 ”. Specifically, the “CNNAlex +SRNN” method on Caltech101 dataset improves the per-

425

formance by 1.29% higher than the results of the exiting “AlexNet” with 15 training images per category, and 0.84% higher with 30 training images per category. With the VGG architecture, our methods (“VGGdeep16 +SRNN”) on 18

ACCEPTED MANUSCRIPT

Table 1: CIFAR-10 classification errors of various methods. Method

Test Error (%)

Stochastic Pooling [16]

15.13

Maxout Networks [5]

11.68

Maxout Units [49]

11.35

NIN [4]

10.41

DSN [25]

9.78

NIN+SRNN

9.35

DSN+SRNN

8.72

With data augmentation 9.38

Drop-Connect [15]

9.32

NIN[4]

8.81

DSN [25]

AN US

Maxout Networks [5]

CR IP T

Without data augmentation

8.22

NIN+SRNN

8.08

7.91

DSN+SRNN

With big data augmentation SparseCNN96 (reported in [26])

6.28

SparseCNN96

5.72

SparseCNN126

4.67 4.93

M

SparseCNN96 +SRNN

SparseCNN126 +SRNN

4.12

ED

Caltech101 dataset can also get the 1.57% and 1.36% classification improvement by comparing with the “VGGdeep16 ” method for 15 and 30 training images, 430

respectively. On MIT-Indoor dataset, the SRNN method also achieves an im-

PT

provement, e.g. 63.58% vs 65.07% with AlexNet architecture, and 70.14% vs 71.89% with VGGdeep16 architecture. Moreover, Our “VGGdeep16 +SRNN” on

CE

Caltech101 dataset can substantially outperform other CNN based methods by 3.86% and 3.45% over Visualize CNN [55] and DeCAF-fc6 [56] with 30 training

435

samples per class. On MIT-Indoor dataset, we also achieve a better perfor-

AC

mance than some existing methods, e.g. 71.89% vs 68.24% [57] and 71.89%

vs 69.00% [59] with the “VGGdeep16 architecture. With the same architecture and experimental settings, our proposed SRNN can improve the classification performance on the Caltech101 and MIT-Indoor datasets. The proposed SRNN

440

method is comparable to some exiting methods, and thus we can say that the

19

ACCEPTED MANUSCRIPT

Table 2: CIFAR-100 classification errors of various methods. Method

Test Error (%)

Stochastic Pooling [16]

42.51

Maxout Networks [5]

38.57

Maxout Units [49]

38.14

NIN [4]

35.68

DSN [25]

34.57

NIN+SRNN

33.94

DSN+SRNN

32.83

With data augmentation 33.53

DSN [25]

32.36

NIN+SRNN

31.10

AN US

NIN [4]

CR IP T

Without data augmentation

30.43

DSN+SRNN

With big data augmentation SparseCNN96 (reported in [26])

24.30

Sparse-CNN96

20.30

SparseCNN126 SparseCNN96 +SRNN

17.32 19.55

16.67

M

SparseCNN126 +SRNN

SRNN effectively work for alleviating the over-fitting problem.

ED

To evaluate the performance of our SRNN on the compelling and significantly harder vision task, we show the classification errors of various methods [60, 61, 2, 48] on ImageNet dataset [11], where our SRNN method is tested under two popular CNN structures (i.e, AlexNet [2] and VGG [48]), called

PT

445

“AlexNet+SRNN” and “VGGdeep16 +SRNN”. For the experimental comparison, we also evaluate the performance of a single network from two aspects, the

CE

top-1 and top-5 test errors. As can be presented in Table 5, the results indicate that our “AlexNet+SRNN” can outperform three baselines (such as sparse coding [60], SIFT+Fisher Vectors [61], AlexNet [2]), while the “VGGdeep16 +SRNN”

AC

450

is also better than its baseline VGGdeep16 network: 28.23% vs 28.97% for the top-1 test error, 9.6% vs 10.15% for the top-5 test error. The main reason for these improvements on this challenging task may be that the SRNN can well boost its discriminative capability by considering the self-regularization, even

455

with the deeper network architecture. 20

ACCEPTED MANUSCRIPT

Table 3: Caltech101 classification performance of various methods. Accuracy (%)

Method

15 tr.

30 tr.

McCann et al. [62]

66.1

71.9

Zhang et al. [50]

-

75.02

Sun et al. [54]

-

78.8

Nguyen et al. [51]

69.5

77.3

Goh et al. [52]

71.1

78.9

Duchenne et al. [63]

75.3

80.3

Feng et al. [64]

70.3

82.6

LG [53]

75.83

81.41

LG+DALG [53]

77.42

83.69

AN US

CR IP T

Traditional methods

CNN based methods Visualize CNN [55] DeCAF-fc6 [56] AlexNet VGGdeep16 AlexNet+SRNN

86.50

-

86.91

81.42

87.68

82.47

89.00

82.71

88.52

84.04

90.36

M

VGGdeep16 +SRNN

-

Table 4: MIT-Indoor classification performance of various methods.

ED

Method

Accuracy (%)

Traditional methods 26.00

LiJia et al. [66]

37.6

Kwitt et al. [67]

44.0

Zheng et al. [68]

47.2

Sun et al. [54]

51.4

Li et al. [69]

52.3

LG [53]

53.46

DALG [53]

55.58

AC

CE

PT

Quattoni et al. [65]

CNN based methods FCR2 (placeNet) [57]

68.24

Order-less pooling [58]

68.90

CNNaug-SVM [59]

69.00

FCR2 HybridNet [57]

70.80

AlexNet

63.58

VGGdeep16

70.14

AlexNet+SRNN

65.07

VGGdeep16 +SRNN

71.89

21

ACCEPTED MANUSCRIPT

Table 5: Classification errors of various methods on ImageNet dataset.

Top-1

Top-5

Sparse coding [60]

47.1

28.2

SIFT+FVs [61]

45.7

25.7

AlexNet [2]

37.5

17.0

VGGdeep16 [48]

28.97

10.15

AlexNet+SRNN

36.28

16.07

VGGdeep16 +SRNN

28.23

9.60

4.3. Algorithm Analysis

CR IP T

Test Error (%)

Method

AN US

Fig. 5 presents some test images from the CIFAR-10 dataset [70], which are correctly classified with “SparseCNN + SRNN”, but misclassified by “SparseCNN”. There are two label names marked below each image, one of which 460

is its wrong class by “SparseCNN”, the other is its right label (i.e. same with the ground-truth label) by our “SparseCNN + SRNN”. As can be seen in Fig. 6, we further show the confusion matrix of our “NIN+SRNN” method on the

M

CIFAR-10 dataset with data augmentation. The cat and dog classes, which are with the similar color, shape and background, often occur the confusion and have a lower classification performance than other classes. When the above

ED

465

experimental results only show the classification performance on the test set, we now present how our SRNN method can alleviate the overfitting problem

PT

by analyzing some results on the training set. For example, the classification errors of the “NIN” and our “NIN+SRNN” methods is 35.68% and 33.94% on CIFAR-100 without data augmentation, while the classification errors on the

CE

470

training set is 0.42% and 2.61%, respectively. Our proposed SRNN employs the self-regularized strategy for learning a neural network, while a deep neu-

AC

ral network learns the parameters with only one or more loss functions, without considering its own learned knowledge. The effectiveness of our proposed SRNN

475

again speaks well that our method can successfully recognize the confusing and difficult objects.

22

ACCEPTED MANUSCRIPT

For further analyzing the SRNN learning process, we evaluate the effectiveness of activated function in the step of obtaining sample-wise soft targets, and test whether the step of sample-wise corrected knowledge is necessary or not. As can be seen in Fig. 8, we present the performance comparison on the CIFAR-100

CR IP T

480

dataset under four different situations, named “NIN”, “NIN+SRNN without correcting (learned) knowledge”, “NIN+SRNN without activated knowledge”

and “NIN+SRNN”. Our SRNN method without correcting the learned knowledge obtain a slightly better results than the baseline “NIN”, while our SRNN 485

method can further improve the classification performance on the CIFAR-100

AN US

dataset. Moreover, the performance of SRNN method is higher than the SRNN without activated knowledge under two different situations (i.e., with data augmentation and without data augmentation on CIFAR-100), which well verifies

the activation function can play a role in the process of optimizing SRNN pa490

rameters.

As can be illustrated in Fig. 8, we show the relationship between the classifi-

M

cation error and the number of iterative epochs in the process of SRNN learning on the CIFAR-100 dataset without data augmentation, namely how the classifi-

495

ED

cation performance is improved with our proposed SRNN method. The SRNN can gradually mine the sample-wise soft targets from a neural network, and iteratively optimize the parameters of this neural network with constantly updating

PT

the objective function. Therefore, the discriminative capability of a neural network can be further improved under our proposed SRNN. Finally, we conduct an

CE

experiment over the CIFAR-100 dataset to show how the discriminative power 500

related can be affected under various parameter α of the objective function, as shown in Fig. 9. The classification errors can be slightly reduced as long

AC

as the value of the parameter α falls in the range of approximatedly from 0.3

to 0.7 while the objective function with a higher value of the parameter α 0.9 can significantly improve the classification performance on CIFAR-100 dataset

505

without data augmentation. For simplicity, the parameter α of the objective

function (Eq. 3) is set to 0.9 in all our experiments.

23

ACCEPTED MANUSCRIPT

Bi r d

Aut omobi l e

Fr og

Shi p

Cat

Deer

Bi r d

Cat

Deer

Dog

Deer

Dog

Spar seCNN:

Cat

Cat

Deer

Cat

Dog

Cat

Spar seCNN+SRNN:

Dog

Fr og

Ai r pl ane

Dog

Fr og

Dog

Aut omobi l e Tr uck

Ai r pl ane Tr uck

CR IP T

Spar seCNN: Spar seCNN+SRNN:

Ai r pl ane House

Aut omobi l e Shi p

Figure 5: Exemplar test images from CIFAR-10 [70]. These images are wrongly classified by

SparseCNN, but correctly classified with the same architecture SparseCNN in our proposed SRNN framework (“SparseCNN + SRNN”). There are two label names marked below each image, one of which is its wrong class by SparseCNN, the other is its right class by “SparseCNN

AN US

+ SRNN”’

5. Conclusion

In this work, we proposed a novel Self-Regularized Neural Network (SRNN) for image classification task, which can improve the discriminative capability of deep neural network by alleviating the over-fitting problem. We gradually mine

M

510

the sample-wise soft targets from a single neural network, which can be any one neural network architecture. The sample-wise soft targets mainly reveal

ED

the correlation/similarity among classes, predicted by its own neural network. With this sample-wise soft targets produced by its own neural network and the ground-truth label information, we can update the objective function and itera-

PT

515

tively optimize the parameters of this neural network. Therefore, our proposed SRNN, which imposes much more constraints on the weights of a neural network

CE

and prevents the objective being too sure, can be seen as a good regularization strategy for learning a neural network. Extensive experimental results clearly demonstrated the effectiveness of our proposed SRNN.

AC

520

6. Acknowledgments This work is supported by the National Natural Science Foundation of China

(Grant No. 61602244 and 61502235) and partially sponsored by CCF-Tencent Open Research Fund. 24

ACCEPTED MANUSCRIPT

70 2

0.8

3

69

0.7

4

0.6

5

68

0.5

6 0.4 7 0.3 8

67

CR IP T

0.9

Classification accuracy (%)

1

66 65 64 63

0.2

62

0.1

61

0

60

NIN NIN+SRNN w/o activated knowledge

9 10

Figure

2

3

4

6:

5

6

7

Confusion

“NIN+SRNN”

method

8

9

10

matrix on

the

of

without data augmentation

AN US

1

NIN+SRNN w/o correcting knowledge NIN+SRNN

our

Figure 7:

CIFAR-

the

with data augmentation

Performance comparison on

CIFAR-100

dataset

different

horizontal and vertical axises indicate the

“NIN+SRNN

image labels, e.g. “airplane”, “automobile”,

edge”,

“bird”, “cat”, “deer”, “dog”, “frog”, “horse”,

knowledge” and “NIN+SRNN”.

without

“NIN+SRNN

named

activated

without

three “NIN”, knowl-

correcting

M

“ship”, “truck” from 1 to 10 orderly.

situations,

under

10 dataset with data augmentation. And the

35.2 35

34.6 34.4

CE

34.2

35.6 35.4

PT

Classification error (%)

35.4

classification error (%)

35.6

34.8

35.8

ED

35.8

20

30

34.6 34.4

34 40

50

60 70 epoch

80

90

100

110

33.8

120

AC

10

35 34.8

34.2

34

33.8

35.2

0

0.1

0.2

0.3

0.4 0.5 0.6 the parameter α

0.7

0.8

0.9

Figure 8: The relationship between the clas-

Figure 9: The relationship between the pa-

sification error and the number of iterative

rameter α and classification rates on the

epochs in the process of SRNN learning on

CIFAR-100 dataset without data augmenta-

the CIFAR-100 dataset without data aug-

tion.

mentation.

25

1

ACCEPTED MANUSCRIPT

525

References [1] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning ap-

CR IP T

plied to document recognition, Vol. 86, 1998, pp. 2278–2324. 2 [2] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with

deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105. 2, 4, 5, 6, 7, 10, 16, 18, 20, 22

530

[3] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick,

S. Guadarrama, T. Darrell, Caffe: Convolutional architecture for fast fea-

AN US

ture embedding, arXiv preprint arXiv:1408.5093. 2, 16

[4] M. Lin, Q. Cheng, S. Yan, Network in network, International Conference on Learning Representations. 2, 4, 6, 7, 10, 16, 17, 18, 19, 20

535

[5] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Bengio,

M

Maxout networks, arXiv preprint arXiv:1302.4389. 2, 16, 17, 18, 19, 20 [6] X. Zeng, W. Ouyang, M. Wang, X. Wang, Deep learning of scene-specific classifier for pedestrian detection, in: European Conference on Computer Vision, 2014, pp. 472–487. 2

ED

540

[7] R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for

PT

accurate object detection and semantic segmentation, in: IEEE Conference on Computer Vision and Pattern Recognition, 2014. 2

CE

[8] C. Farabet, C. Couprie, L. Najman, Y. LeCun, Learning hierarchical features for scene labeling, IEEE Transactions on Pattern Analysis and Ma-

545

AC

chine Intelligence 35 (8) (2013) 1915–1929. 2

[9] Z. Xu, Y. Yang, A. G. Hauptmann, A discriminative CNN video representation for event detection. 2

[10] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Er-

550

han, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, arXiv preprint arXiv:1409.4842. 2, 7 26

ACCEPTED MANUSCRIPT

[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255. 2, 20 [12] P. Y. Simard, D. Steinkraus, J. C. Platt, Best practices for convolutional

CR IP T

555

neural networks applied to visual document analysis, in: International Con-

ference on Document Analysis and Recognition, Vol. 2, 2003, pp. 958–958. 2, 7

[13] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. R. Salakhutdinov, Improving neural networks by preventing co-adaptation of feature

AN US

560

detectors, arXiv preprint arXiv:1207.0580. 2, 7

[14] C. M. Bishop, Neural networks for pattern recognition, Clarendon press Oxford, New York, NY, USA, 1995. 2, 7

[15] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, R. Fergus, Regularization of neural networks using dropconnect, in: International Conference on Machine

M

565

Learning, 2013, pp. 1058–1066. 2, 8, 17, 19

ED

[16] M. D. Zeiler, R. Fergus, Stochastic pooling for regularization of deep convolutional neural networks, arXiv preprint arXiv:1301.3557. 2, 8, 17, 18,

570

PT

19, 20

[17] M. Cimpoi, S. Maji, A. Vedaldi, Deep convolutional filter banks for texture

CE

recognition and segmentation, CoRR abs/1411.6836. 3 [18] B.-B. Gao, X.-S. Wei, J. Wu, W. Lin, Deep spatial pyramid: The devil is

AC

once again in the details, CoRR. 3

[19] G. E. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural

575

network, CoRR abs/1503.02531. 3, 6, 9, 12, 13

[20] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, IEEE Conference on Computer Vision and Pattern Recognition. 3, 7 27

ACCEPTED MANUSCRIPT

[21] D. Heryanto, T.-S. Chua, Incremental training of neural network with knowledge distillation, B. Comp Dissertation, Department of Computer

580

Science, National University of Singapore. 3, 9

CR IP T

[22] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, Y. Bengio,

Fitnets: Hints for thin deep nets, in: Proc. of the International Conference on Learning Representations, 2015. 3, 9 585

[23] D. Wang, C. Liu, Z. Tang, Z. Zhang, M. Zhao, Recurrent neural network training with dark knowledge transfer, CoRR, abs/1505.04630. 3, 9

AN US

[24] A. K. Jain, J. Mao, K. Mohiuddin, Artificial neural networks: A tutorial, IEEE Computer 29 (1996) 31–44. 4

[25] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, Z. Tu, Deeply-supervised nets, in: Advances in Neural Information Processing Systems workshop on deep

590

learning and representation learning, 2014. 4, 6, 8, 10, 16, 17, 18, 19, 20

M

[26] B. Graham, Spatially-sparse convolutional neural networks, CoRR

ED

abs/1409.6070. 4, 7, 10, 16, 17, 18, 19, 20 [27] S. Yu, S. Jia, C. Xu, Convolutional neural networks for hyperspectral image classification, Neurocomputing 219 (2017) 88–98. 6

595

PT

[28] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, Y. Pan, A multimedia retrieval framework based on semi-supervised ranking and relevance feedback, IEEE

CE

Transactions on Pattern Analysis and Machine Intelligence 34 (4) (2012) 723–742. 7

[29] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, F. E. Alsaadi, A survey of deep

AC

600

neural network architectures and their applications, Neurocomputing 234 (2017) 11–26. 7

[30] C. Hong, J. Yu, J. Wan, D. Tao, M. Wang, Multimodal deep autoencoder for human pose recovery, IEEE Transactions on Image Processing 24 (12)

605

(2015) 5659–5670. 7 28

ACCEPTED MANUSCRIPT

[31] M. M. Ghazi, B. Yanikoglu, E. Aptoula, Plant identification using deep neural networks via optimization of transfer learning parameters, Neurocomputing 235 (2017) 228–235. 7

CR IP T

[32] Symmetric quaternionic hopfield neural networks, Neurocomputing 240 (2017) 110–114. 7

610

[33] J. Schmidhuber, Multi-column deep neural networks for image classification, in: IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3642–3649. 7, 8

AN US

[34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple way to prevent neural networks from overfitting, The

615

Journal of Machine Learning Research 15 (1) (2014) 1929–1958. 7 [35] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, Greedy layer-wise training of deep networks, in: B. Sch¨ olkopf, J. C. Platt, T. Hoffman (Eds.),

pp. 153–160. 8

620

M

Advances in Neural Information Processing Systems 19, MIT Press, 2007,

ED

[36] J. Yu, B. Zhang, Z. Kuang, D. Lin, J. Fan, iprivacy: Image privacy protection by identifying sensitive objects via deep multi-task learning, IEEE Transactions on Information Forensics and Security 12 (5) (2017) 1005–

625

PT

1016. 8

[37] J. Yu, X. Yang, F. Gao, D. Tao, Deep multimodal distance metric learning

CE

using click constraints for image ranking, IEEE Transactions on Cybernetics PP (99) (2017) 1–11. 8

AC

[38] An adaptive growing and pruning algorithm for designing recurrent neural

630

network, Neurocomputing 242 (2017) 51–62. 8

[39] Q. Mei, D. Cai, D. Zhang, C. Zhai, Topic modeling with network regularization, in: Proceedings of the 17th International Conference on World Wide Web, 2008, pp. 101–110. 8

29

ACCEPTED MANUSCRIPT

[40] H. Ma, D. Zhou, C. Liu, M. R. Lyu, I. King, Recommender systems with social regularization, in: Proceedings of the fourth ACM international conference on Web search and data mining, ACM, 2011, pp. 287–296. 8

635

CR IP T

[41] Z. Yuan, J. Sang, Y. Liu, C. Xu, Latent feature learning in social media

network, in: Proceedings of the 21st ACM International Conference on Multimedia, MM ’13, ACM, New York, NY, USA, 2013, pp. 253–262. 8

[42] Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems, Computer 42 (8). 8

640

AN US

[43] T. Jin, J. Yu, J. You, K. Zeng, C. Li, Z. Yu, Low-rank matrix factorization with multiple hypergraph regularizer, Pattern Recognition 48 (3) (2015) 1011–1022. 8

[44] J. Sang, C. Xu, J. Liu, User-aware image tag refinement via ternary semantic analysis, IEEE Transactions on Multimedia 14 (3) (2012) 883–895.

645

M

8

[45] C. Hong, J. Yu, J. You, X. Chen, D. Tao, Multi-view ensemble manifold

ED

regularization for 3d object recognition, Information Sciences 320 (2015) 395–405. 9

[46] V. S. Tomar, R. C. Rose, Manifold regularized deep neural networks, in:

PT

650

Fifteenth Annual Conference of the International Speech Communication

CE

Association, 2014. 9 [47] J. Sang, Q. Fang, C. Xu, Exploiting social-mobile information for location visualization, ACM Transactions on Intelligent Systems and Technology

AC

655

(TIST) 8 (3) (2017) 39. 9

[48] K. Simonyan, A. Zisserman, Very deep convolutional networks for largescale image recognition, International Conference on Learning Representations. 10, 16, 18, 20, 22

30

ACCEPTED MANUSCRIPT

[49] J. T. Springenberg, M. Riedmiller, Improving deep neural networks with probabilistic maxout units, arXiv preprint arXiv:1312.6116. 17, 18, 19, 20

660

[50] T. Zhang, B. Ghanem, S. Liu, C. Xu, N. Ahuja, Low-rank sparse coding

CR IP T

for image classification, in: International Conference on Computer Vision, 2013, pp. 281–288. 18, 21

[51] H. Nguyen, V. Patel, N. Nasrabadi, R. Chellappa, Sparse embedding: A

framework for sparsity promoting dimensionality reduction, in: European

665

Conference on Computer Vision, 2012, pp. 414–427. 18, 21

AN US

[52] S. Gao, I. W.-H. Tsang, L.-T. Chia, Kernel sparse representation for image

classification and face recognition, in: European Conference on Computer Vision, 2010, pp. 1–14. 18, 21 670

[53] C. Xu, C. Lu, J. Gao, W. Zheng, T. Wang, S. Yan, Discriminative analysis for symmetric positive definite matrices on lie groups, Circuits and Systems

M

for Video Technology, IEEE Transactions on. 18, 21

[54] J. Sun, J. Ponce, Learning discriminative part detectors for image clas-

ED

sification and cosegmentation, in: International Conference on Computer Vision, 2013, pp. 3400–3407. 18, 21

675

PT

[55] M. D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: European Conference on Computer Vision, 2014, pp. 818–833.

CE

18, 19, 21

[56] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, Decaf: A deep convolutional activation feature for generic visual recognition

AC

680

(2014). 18, 19, 21

[57] A. Sharif Razavian, H. Azizpour, J. Sullivan, S. Carlsson, Cnn features off-the-shelf: An astounding baseline for recognition, in: IEEE Conference on Computer Vision and Pattern Recognition Workshops on DeepVision:

685

Deep learning in Computer Vision, 2014. 18, 19, 21

31

ACCEPTED MANUSCRIPT

[58] F. Perronnin, C. R. Dance, Fisher kernels on visual vocabularies for image categorization, in: CVPR, IEEE Computer Society, 2007. 18, 21 [59] D. G. Lowe, Distinctive image features from scale-invariant keypoints, In-

690

CR IP T

ternational Journal of Computer Vision 60 (2004) 91–110. 18, 19, 21

[60] A. Berg, J. Deng, L. Fei-Fei, large-scale visual recognition challenge, in: www.image-net.org/challenges, 2010. 20, 22

[61] J. Sanchez, F. Perronnin, High-dimensional signature compression for large-

scale image classification, in: Computer Vision and Pattern Recognition

695

AN US

(CVPR), 2011 IEEE Conference on, 2011, pp. 1665–1672. 20, 22

[62] S. McCann, D. G. Lowe, Local naive Bayes nearest neighbor for image classification, in: IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3650–3656. 21

[63] O. Duchenne, A. Joulin, J. Ponce, A graph-matching kernel for object

pp. 1792–1799. 21

ED

700

M

categorization, in: International Conference on Computer Vision, 2011,

[64] J. Feng, B. Ni, Q. Tian, S. Yan, Geometric `p -norm feature pooling for image classification, in: IEEE Conference on Computer Vision and Pattern

PT

Recognition, 2011, pp. 2697–2704. 21 [65] A. Quattoni, A. Torralba, Recognizing indoor scenes, in: IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 413–420. 21

CE

705

[66] E. P. X. Li-Jia Li, Hao Su, L. Fei-Fei, Object bank: A high-level image

AC

representation for scene classification & semantic feature sparsification, in:

710

Advances in Neural Information Processing Systems, 2010, pp. 1378–1386. 21

[67] R. Kwitt, N. Vasconcelos, N. Rasiwasia, Scene recognition on the semantic manifold, in: European Conference on Computer Vision, 2012, pp. 359–372. 21 32

ACCEPTED MANUSCRIPT

[68] Y. Zheng, Y.-G. Jiang, X. Xue, Learning hybrid part filters for scene recognition, in: European Conference on Computer Vision, 2012, pp. 172–185. 21

715

CR IP T

[69] Q. Li, J. Wu, Z. Tu, Harvesting mid-level visual concepts from large-scale

internet images, in: IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 851–858. 21

[70] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny 720

images, Computer Science Department, University of Toronto, Tech. Rep.

AC

CE

PT

ED

M

AN US

22, 24

33

ACCEPTED MANUSCRIPT

Chunyan Xu received the Ph.D. degree from the School of Computer Science and Technology, Huazhong University of Science and Technology, 2015. From 2013 to 2015, she was a visiting scholar in the Department of Electrical and Computer Engineering at National University of Singapore. Now she is a lecture in the School of Computer Science and Engineering from Nanjing University of Science and Technology, Nanjing, 210094, China. Her research interests include computer vision, manifold learning and deep learning.

CR IP T

Jian Yang received the BS degree in mathematics from the Xuzhou Normal University in 1995. He received the MS degree in applied mathematics from the Changsha Railway University in 1998 and the PhD degree from the Nanjing University of Science and Technology (NUST), on the subject of pattern recognition and intelligence systems in 2002. In 2003, he was a postdoctoral researcher at the University of Zaragoza, and in the same year, he was awarded the RyC Program Research Fellowship sponsored by the Spanish Ministry of Science and Technology. From 2004 to 2006, he was a postdoctoral fellow at Biometrics Centre of Hong Kong Polytechnic University. From 2006 to 2007, he was a postdoctoral fellow at Department of Computer Science of New Jersey Institute of Technology. Now, he is a professor in the School of Computer Science and Technology of NUST. He is the author of more than 80 scientific papers in pattern recognition and computer vision. His research interests include pattern recognition, computer vision and machine learning. Currently, he is an associate editor of Pattern Recognition Letters and IEEE Transactions on Neural Networks and Learning Systems.

AN US

Junbin Gao graduated from Huazhong University of Science and Technology (HUST),China in 1982 with BSc. degree in Computational Mathematics and obtained PhD from Dalian University of Technology, China in 1991. He is a Professor in Discipline of Business Analytics, University of Sydney Business School, Universtiy of Sydney, Australia. He was a senior lecturer, a lecturer in Computer Science from 2001 to 2005 at University of New England, Australia. From 1982 to 2001 he was an associate lecturer, lecturer, associate professor and professor in Department of Mathematics at HUST. From 2002 to 2015, he was a Professor in Computing Science in the School of Computing and Mathematics at Charles Sturt University, Australia. His main research interests include machine learning, data mining, Bayesian learning and inference, and image analysis. Hanjiang Lai received his B.S. and Ph.D. degrees from Sun Yat-sen University in 2009 and 2014, respectively. He was working as a research fellow at National University of Singapore during 2014-2015. He is now working at Sun Yat-sen University. His research interests include machine learning algorithms, deep learning, and computer vision.

AC

CE

PT

ED

M

Shuicheng Yan is the Dean’s Chair Associate Professor at National University of Singapore, and also the chief scientist of Qihoo/360 company. Dr. Yan’s research areas include machine learning, computer vision and multimedia, and he has authored/coauthored hundreds of technical paper, with Google Scholar citation >15,000 times and H-index 52. He has been serving as an associate editor of IEEE TCSVT and ACM TIST. He received the Best Paper Awards from ACM MM’13 (Best Paper and Best Student Paper), MMM’16 (Best Student Paper), ACM MM’12 (Best Demo), PCM’11, ACM MM’10, ICME’10 and ICIMCS’09, the winner prizes of the classification task in PASCAL VOC 2010-2012, 2011 Singapore Young Scientist Award, and 2012 NUS Young Researcher Award. He is also IEEE and IAPR Fellow.

ACCEPTED MANUSCRIPT

CR IP T

Chunyan Xu

AN US

Jian Yang

ED

M

Junbin Gao

CE

PT

Hanjiang Lai

AC

Shuicheng Yan