Active one-shot learning by a deep Q-network strategy

Active one-shot learning by a deep Q-network strategy

Active one-shot learning by a deep Q-Network strategy Communicated by Domenico Ciuonzo Journal Pre-proof Active one-shot learning by a deep Q-Netwo...

1MB Sizes 0 Downloads 57 Views

Active one-shot learning by a deep Q-Network strategy

Communicated by Domenico Ciuonzo

Journal Pre-proof

Active one-shot learning by a deep Q-Network strategy Li Chen, Honglan Huang, Yanghe Feng, Guangquan Cheng, Jincai Huang, Zhong Liu PII: DOI: Reference:

S0925-2312(19)31576-0 https://doi.org/10.1016/j.neucom.2019.11.017 NEUCOM 21522

To appear in:

Neurocomputing

Received date: Revised date: Accepted date:

28 March 2019 6 November 2019 8 November 2019

Please cite this article as: Li Chen, Honglan Huang, Yanghe Feng, Guangquan Cheng, Jincai Huang, Zhong Liu, Active one-shot learning by a deep Q-Network strategy, Neurocomputing (2019), doi: https://doi.org/10.1016/j.neucom.2019.11.017

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Elsevier B.V. All rights reserved.

Active one-shot learning by a deep Q-Network strategy Li Chena , Honglan Huanga , Yanghe Fenga,∗, Guangquan Chenga , Jincai Huanga , Zhong Liua a National

University of Defense Technology, College of System Engineering, Changsha, China

Abstract One-shot learning has recently attracted growing attention to produce models which can classify significant events from a few or even no labeled examples. In this paper, we introduce a deep Q-network strategy into one-shot learning (OL-DQN) to design a more intelligent learner to infer whether to label a sample automatically or request the true label for the active-learning set-up. Then we conducted experiments in the ALOI dataset for the classification of objects recorded under various imaging circumstances and a dataset for handwriting recognition composed of both characters and digits to have a performance evaluation and application analysis of the proposed model respectively, and the obtained results demonstrate that our model can achieve a better trade-off between prediction accuracy and the need of label requests compared with a purely supervised task, a prior work AOL, and a conventional active learning algorithm QBC. Keywords: One-shot learning; a deep Q-network strategy; active-learning set-up.

1. Introduction Machine learning focuses on how computers use empirical data to improve their performance. Sufficient and high quality data is the foundation and key to effective learning. In traditional supervised learning, all the data used for training and learn5

ing models is required to be labeled. However, there are a lot of data that are not ∗ Corresponding

author Email addresses: [email protected] (Li Chen), [email protected] (Honglan Huang), [email protected] (Yanghe Feng ), [email protected] (Guangquan Cheng), [email protected] (Jincai Huang), [email protected] (Zhong Liu)

Preprint submitted to NEUROCOMPUTING

December 10, 2019

labeled in life, such as various medical-related image data, web data, audio and video data, etc, and it is well accepted labeling training examples is a costly procedure [1], which requires comprehensive and intensive investigations on the instances, and incorrectly labeled examples will significantly deteriorate the model built from the data 10

[2]. Therefore, active learning, as one of the effective ways to solve this problem, has aroused widespread interest of scholars in various fields [3, 4, 5, 6, 7]. The key idea of active learning is to reduce the cost of annotation without affecting the performance of learning algorithm by training an effective model with a few labeled samples. In this way, the instance selection scheme is becoming extremely cru-

15

cial. And many algorithms have been proposed to design different criteria for selecting the most valuable instances to query their labels, which can be roughly divided into the following three categories [8]: the methods querying most informative instances (e.g., uncertainty sampling [7] and expected error reduction based sampling [9]; the methods querying most representative instances (e.g., clustering based sampling [10] and den-

20

sity based sampling [11]; and the methods simultaneously consider informativeness and representativeness [12]. However, these methods often use heuristics for handling the computational complexity of approximating the risk associated with selection. Besides, more recently deep meta-learning has received increasing attention for few-shot learning [13, 14, 15]. [16] proposed to learn the optimization algorithm for the

25

learner neural network in the few-shot regime by an LSTM-based meta-learner model. [17] proposed a meta-learning method built on the undersampling by incorporating evaluation metric optimization into the data sampling process to learn which instance should be discarded for the given classifier and evaluation metric. A model-agnostic meta-learning framework is proposed by Finn et al. [18] which trains a deep model

30

from the auxiliary dataset with the objective that the learned model can be effectively updated/fine-tuned on the new classes with one or few gradient steps. Note that similar to the generalised zero-shot learning setting, recently the problem of adding new classes to a deep neural network whilst keeping the ability to recognise the old classes have been attempted [19]. However, the problem of lifelong learning and progressively

35

adding new classes with few-shot remains an unsolved problem [20]. Therefore, in this paper, aimed to design an artificially intelligent agent to replace 2

these heuristics mentioned above with learning and inference and further approach continuous learning of one-shot learning, we combine the memory inference capability of Long Short-Term Memory Network (LSTM)[21], which has been successfully used 40

in many fields, such as network traffic analysis [22, 23], image classification [24] and so on, with the decision-making of reinforcement learning to learn an active learner deciding whether to label a example automatically or request the true label for one shot learning task, which is labeled here as OL-DQN. Besides, we consider the online setting of active learning where the agent is presented with examples in a sequence.

45

To provide a comprehensive evaluation of the performance of the proposed algorithm for active one-shot learning, we conducted plenty of experiments at various angles on a commonly used dataset for objects classification: ALOI from Amsterdam Library of Object Images. In addition, we extended its application to an active research topic: handwriting recognition of both digits and characters. And experimental results

50

show the proposed model OL-DQN can achieve a relatively higher prediction accuracy with fewer label requests compared to a similar model on a purely supervised task [25] and the prior work [26]. The rest of the paper is organized as follows. Section 2 gives a brief overview of active learning, one-shot learning and deep Q-network. Section 3 details the pro-

55

posed method. Section 4 presents the performance estimation of our model through four groups of experiments in the context of object classification, whereas Section 5 displays the set of results obtained with the the proposed network and the previous works, confirming the adequacy of the application in handwriting recognition of both digits and characters, as well as highlighting the benefits of the our model. Finally, this

60

work closes with a summary and an outlook over future work in Section 6. 2. Preliminaries In this paper, combined with deep reinforcement learning, we present a novel strategy learning an active learner based on deep Q-network to perform exact inference on how to label examples and when to instead request a label for the active one-shot learn-

65

ing task. In the following we give an outline of active learning, one-shot learning and

3

deep Q-network involved as a means of introducing notation. 2.1. Active learning In many practical tasks, we can easily get a large amount of data, but most of them are unmarked. Compared with manual annotation of all data, choosing a portion of the 70

data for labeling can obviously result in significant annotation cost savings. In fact, the contribution of different data samples to the learning model is different. Therefore, it is possible to obtain the same efficient model based on only a small amount of data if some of the most valuable data can be selected to label. For this purpose, the key is how to choose the most effective data samples for training model and get their label

75

information. Active learning is the machine learning framework for studying this problem, which can lower the cost of annotation without sacrificing model performance by developing criteria for selecting samples. 2.2. One-shot learning One-shot learning[27, 28] is an object categorization problem in computer vision.

80

Whereas most machine learning based object categorization algorithms require training on hundreds or thousands of images and very large datasets, one-shot learning aims to learn information about object categories from one, or only a few, training examples per category. This ‘one-shot learning’ ability has emerged as one of the most promising yet challenging areas of research [29]. The method based on deep learning to one-

85

shot learning has developed rapidly in recent years, which can be divided into two broad categories [30]: metric-learning approaches and meta-learning approaches. The former tackles the problem by designing a specific metric loss or develop a particular training strategy, while the latter, addressed in our paper, solves the problem through a two-level-learning regime: first, quickly acquire knowledge of individual base tasks,

90

then extract meta-information from them. 2.3. Deep Q-Network In 2015 Deep Q-Network (DQN) was proposed by Mnih et. al [31] based on Qlearning, in the following we give a brief introduction of Q-learning and DQN algorithm involved respectively. 4

95

2.3.1. Q-learning Q-learning is a model-free reinforcement learning method [32], used to find an optimal action-selection policy for any given finite Markov Decision Process (MDP). Specifically, an agent interacts with an environment through a sequence of observations, actions and rewards. The goal of the agent is to select actions in a fashion that maximizes cumulative future reward. More formally, the so-called Q-function is defined as: Qπ (s, a) = Eπ

∞ hX t=1

i γt−1 rt |st = s, at = a, π ,

(1)

which is the maximum sum of rewards rt discounted by γ at each time step t, achievable by a behaviour policy π , after making an observation s and taking an action a. It can be seen that Q-Learning [33] estimates the value of executing an action from a given state. Such value estimations are referred to as state action values, or sometimes simply Q-values. Q-values are learned iteratively by updating the current Q-value estimate towards the observed reward r and estimated utility of the resulting state s0 :   Q(s, a) = Q(s, a) + α r+γmaxa0 Q(s0 , a0 )−Q(s, a) ,

(2)

where α is the learning rate. 2.3.2. Deep Q-Network 100

Although Q-learning methods perform well for small-size models, it will have an increasing performance deterioration for large-scale models, to overcome this issue and bridge the divide between high-dimensional sensory inputs and actions resulting in the first artificial agent that is capable of learning to excel at a diverse array of challenging tasks, a novel agent referred to as deep Q-Network (DQN) was proposed by Mnih

105

et. al [31], in which deep reinforcement learning combines Q-learning with a class of artificial neural network known as deep neural network. In contrast to Q-learning methods, deep Q-networks have three improvements, (1) deep neural network is employed as a nonlinear function approximator to the actionvalue function, (2) experiences et = (st , at , rt , st+1 ) are recorded in a replay memory

110

D and then sampled uniformly at training time. (3) a separate, target network, only 5

periodically updated, provides stale update targets to the main network and serves to partially decouple the feedback resulting from the network generating its own targets. In our paper, DQN is employed to capture the optimal policy deciding whether to label a sample automatically or request the true label for the active-learning set-up, 115

therefore, we call it a deep Q-network strategy. 3. Proposed method In this section, first, the description of the meta-learning task addressed in this paper is given. Then, the proposed method OL-DQN is presented in detail in terms of mathematical principles and algorithmic framework.

120

3.1. Meta-learning task methodology Aiming to learn an active agent that can make labeling decisions online for oneshot classification events, proper task setup is crucial to accomplish this goal. Inspired by the works [15, 25], we choose to train with short episodes and a few examples per class, varying classes and randomizing labels between episodes, and the task setup is

125

shown in Figure 1. For each episode, the dataset Ω = {(xt , yt )}Tt=1 needs to be given, where xt ∈ Rm is

the sample feature vector or the picture at t-th step of the episode, m is the dimension

of every sample; One-hot encoding yt ∈ Rn is the corresponding class label, and n is the number of classes per episode; T is the number of steps of an episode, and which 130

we assume to be deterministic in this paper. The final output (lt , dt ) is a one-hot vector of length n + 1, where lt ∈ Rn and dt ∈ R are the learning result and decision value

at time t respectively. Specifically, if the agent makes the decision to request the label, lt will be a zero-vector ~0 and dt = 1, otherwise, dt = 0 and lt is the predict result

yˆ t , which means it decides to label the sample. According to the above task setup, 135

the learning goal gt−1 of the active agent, whether assigned to the true label yt−1 for requesting the label (dt−1 = 1) or a zero-vector ~0 for predicting (dt−1 = 0) at time t-1, is presented as input along with xt at time t, in a temporally offset manner. In other words, the input sequence of the network, or the observation data o, is seen as [~0, x1 ], [g1 , x2 ], · · · , [gt−1 , xt ], · · · , [gT −1 , xT ]. 6

Figure 1: Task setup. For classification task, the classes and their labels and the specific samples are shuffled and randomly presented at each episode. At each time step, the sample feature vector xt along with the learning goal gt−1 depending on the decision value of the previous instance dt−1 is presented as the input of the model. The output of the model is a one-hot vector of length n + 1 composed of dt and learning result lt , where n is the number of classes per episode. If the model requests the label of xt , dt = 1, gt is equal to the true label yt and the reward of this label request action is Rreq . Alternatively, if the model makes a prediction of xt , dt = 0, gt = ~0, and reward rt is assigned to Rcor if the prediction is correct or Rinc if not.

140

On each time step, one of three rewards is given: Rreq , Rcor , or Rinc , for requesting the label, correctly predicting the label, or incorrectly predicting the label separately. Besides, the network, a long short-term memory (LSTM) connected with a fullyconnected layer (FCL), is trained to learn when and how output the proper label or request the true label for xt at the given timestep to maximize the cumulative reward

145

received during the episode. The ideal strategy involves maintaining a set of class representations and their cor-

7

responding labels, in memory. Thus, for a new sample xt , the model will compare the representation of xt with the existing class representations, weighing the uncertainty of a match along with the cost of being incorrect, correct, or requesting a label, and then 150

make a decision to either retrieve and output the stored label or request a new label. If the model infers that xt is the first presentation of a new class, it must store this class representation and request the label, at the same time, the response associated with the class representation must be stored to achieve the next prediction of the same class of sample.

155

3.2. OL-DQN model The key to complete the task mentioned above through the proposed OL-DQN model is the connection between one-shot learning and deep reinforcement learning. So, we model a criterion of the meta-learning task as the neural network and the discovery of the ideal criterion as a RL problem, and it is DQN that is used to solve this RL

160

problem to achieve intelligent decision. In this paper, the RL problem is concerned with learning optimal policies for agent interacting with the meta-learning task environment, N which can be formalized as the MDP defined by (i) a state space: S = {si }i=1 ∈ Rm ,

si is equivalent to the i-th sample xi and N is the number of samples; (ii) a action

n+1 space: A = {ai }n+1 , ai represents that the sample is predicted to be class i if i=1 ∈ R

165

i = 1, 2, · · · , n, while an+1 means asking for the label. And each action is a one-hot

vector of length n + 1, that is, for ai , the i-th element is 1 and the others are zero; (iii) transition function P(st , at )Tt=1 : S × A → S, it decides on the probability that the environment will change from state st to st+1 (xt to xt+1 ) after taking action at . In this paper, the transition probability is 170

1 N

because the random sampling method is used in

our model; (iv) the reward set: R = {ri }3i=1 = {Rreq , Rcor , Rinc } ∈ R. In particular, at

each timestep t, the agent interacting with the MDP gets a observation ot = (gt , st+1 ),

and chooses an action at ∈ A which determines the received reward rt ∈ R and next observation o0 = ot+1 . Therefore, the criterion is the policy π(at |st ), the terminal goal

can be described to find a optimal policy π∗ (at |st ) to maximise the cumulative reward.

The OL-DQN model consists of two networks with the same structure: eval net and

target net, as depicted in Figure 2, the former needs to be trained and its parameters θ 8

are updated every episode, while the latter has no training required and its parameters θ− only copy the parameters θ every C steps. In an episode, given a set of observations o = {ot }Tt=1 = {[gt−1 , st ]}Tt=1 = {[~0, x1 ], [g1 , x2 ],· · ·, [gt−1 , xt ],· · ·, [gT −1 , xT ]} as the model

input, the eval net will output Q values of each action at every timestep Q(o, a; θ), then actions a = {at }Tt=1 can be chosen according to ε-greedy strategy,       arbitrary ai ∈ A at =      argmaxai q(ot , ai ; θ)

with probability 

(3)

otherwise

Figure 2: Schematic diagram for network forward propagation of the proposed OL-DQN. The network consists of eval net and target net. In an episode, (1) the state sequence o and its subsequent state sequence o0 are the input vector of eval net and target net respectively. (2) Based on the output of the former net, the action a is chosen by ε-greedy strategy and corresponding reward r is recieved. (3)Get the td error = Qtarget (o0 , a0 ) − Qeval (o, a). (4) Iteratively update the parameters of eval net that adjusts the action-values Q

towards target values.

Let parameter set be θ = {W i , W h , b; W ho , bo }, where W i , W h , the parameters of a ba-

sic LSTM, are the weights mapping from the initial state and hidden state respectively

to the gates and candidate cell state, and b is the bias vector; W ho are the connection

9

weights between linear fully-connected layer and the LSTM output, bo is the FCL bias. For the observation value ot , after the LSTM mapping:

175

gˆ f , gˆ i , gˆ o , cˆ t = W i ot + W h ht−1 + b

(4)

g f = σ(ˆg f )

(5)

gi = σ(ˆgi )

(6)

go = σ(ˆgo )

(7)

ct = g f ct−1 + gi tanh(ˆct )

(8)

ht = go tanh(ct )

(9)

where gˆ f , gˆ i , gˆ o are the forget gates, input gates,and output gates respectively, cˆ t and ct are the candidate cell sates and the new LSTM cell state. is element-wise multiplication. σ(·) and tanh(·) are the sigmoid and hyperbolic tangent functions respectively. Then the LSTM output h(t) is mapped by the FCL: q(ot ) = W ho ht + bo

(10)

q(ot , at ) = q(ot ) · at

(11)

And the Q values of all timesteps over the chosen action in an episode for eval net and the reward received are given: Qeval (o, a) = Q(o, a; θ) = {q(ot , at )}Tt=1 10

(12)

    Rreq        rt =  Rcor         Rinc

if dt = 1 (13)

if dt = 0 and yˆ t = yt if dt = 0 and yˆ t , yt

Similarly, given the input sequence o0 = {o01 , o02 , · · · , o0T }, the output of the tar-

get net is Q( o0 , a0 ; θ− ). Then the target value, the optimal action-value function according to Bellman equation, can be derived as: Qtarget (o0 , a0 ) = r + γmaxa0 Q(o0 , a0 ; θ− )

(14)

Finally, we can get the td error: td error = Qtarget (o0 , a0 ) − Qeval (o, a)

(15)

And the eval net learning update at i-th episode uses the following loss function: h i Li (θi ) = E(o,a,r,o0 ) (td error)2 h 2 i = E(o,a,r,o0 ) r+γmaxa0 Q(o0 , a0 ; θi− )−Q(o, a; θi ) (16)

in which the discount factor γ determines the agent horizon, θi are the parameters of the eval net at iteration i and θi− are the target net parameters used to compute the target at 180

iteration i. And θi− are only updated with the Q-network parameters (θi ) every C steps and keep fixed between individual updates. Its algorithm pseudocode is presented in Algorithm 1. 3.3. Computational complexity Define L, G, M as the maximum iteration number, the capacity of replay memory,

185

the number of episodes, and T, B as the number of steps of each episode, the capacity of minibatch respectively. According to Algorithm 1, the calculation of our model for each minibatch needs O(MT ) operations. And each minibatch consists of B batches

randomly selected from the replay memory with the capacity of G, so the calculation of each minibatch requires O(GMT ). Therefore, the total complexity of OL-DQN

190

is O(LGMT ), while that of Active One-shot Learning(AOL)[26] is O(LBMT ) as it directly generates a minibatch without the step of replay memory. As can be seen that complexity of OL-DQN is slightly higher than AOL because of G ≥ B. 11

Algorithm 1 OL-DQN 1: Initialize replay memory D to capacity G; 2:

Initialize eval net with random weights θ;

3:

Initialize target net with weights θ− = θ;

4:

for iteration =1, L do

5:

Initialize input sequence o1 = [~0, x1 ].

6:

for episode =1, M do for t = 1, T do

7:

Preprocessed sequence φt = q(ot , a; θ) and get one-hot net output (lt , dt )

8:

; With probability  select a random action at ∈ a;

9:

otherwise select at = argmaxa q(ot , a; θ)

10:

After executing action at in emulator, get the  reward rt according to (13)    ~  if dt−1 = 0  0, . and next input sequence ot+1 = [gt , xt+1 ], where gt =      yt−1 , if dt−1 = 1. 12: Preprocess φt+1 = φ(t + 1) = q(ot+1 , a; θ− )

11:

13:

Store transition (φt , at , rt , φt+1 ) in D

14:

Set ot = ot+1

15:

Sample random mininbatch of transitions (φ j , a j , r j , φ j+1 ) from D      if episode terminates at state j+1  r j Set yi =      r + γmax q(o , a ; θ− ) otherwise.

16:

j

a j+1

j+1

j+1

17:

end for

18:

−1 Get Q(o0 , a0 ; θi− ) = q(o j+1 , a j+1 ; θi− )Tj=1 and Q(o, a; θi ) = q(o j , a j ; θi )Tj=1 ,

where i = iteration = 1, 2, · · · , L. 19:

end for

20:

Perform a gradient descent step on loss function (16) with respect to the network parameters θ.

21: 22:

Every C steps reset θ− = θ. end for

12

4. Performance evaluation: objects classification In this section, we carry out performance evaluation of our proposed one-shot learn195

ing model in the active learning set-up mentioned above using the public dataset ALOI [34] by assessing it against a purely supervised task [25], the prior work AOL [26], and a conventional active learning method Query By Committee (QBC)[35]. AOL also combines reinforcement learning with one-shot learning to decide which examples are worth labeling, and the main difference from OL-DQN is that a deep

200

Q-network strategy is employed to discover the criterion of the meta-learning task instead of Q learning used by AOL. Specifically, there are two distinct points: (1)The batch methods and incremental methods are taken in the training process of our model and AOL respectively, that is, OL-DQN is to collect a pile of samples, and then take part of them to update Q network, which is called experience reply, while AOL is to

205

update a data once. (2)Compared with AOL, the target network is set independently for our model to handle the TD bias in the time difference algorithm separately. Both of them are used to break the association between episode steps collected through reinforcement learning to comply with the assumption that samples are independent and identically distributed during the training of neural network, which can be proved by

210

the experimental results to be beneficial to the gain in performance of the algorithm in terms of not only the improvement of the objects classification accuracy but also the reduction of the demand rate of sample labels. 4.1. Dataset and training setup The ALOI dataset consists of 1,000 separate classes of objects recorded under var-

215

ious imaging circumstances which include varied viewing angle, illumination angle, illumination color for each object and additionally captured wide-baseline stereo images, with 108 images of each object, giving 108,000 total examples. We randomly split the classes into 700 training classes, and 300 testing classes. Besides, for the ALOI dataset we have used an extended color histogram with 128 dimensions [36].

220

Images were normalized to a pixel value between 0.0 and 1.0. Each episode consisted of 30 images (time step=30) sampled randomly from 3 randomly sampled classes, with-

13

out replacement. Note that the number of samples from each class may not have been equal. That is, the images were finally flattened to 128 dimensional vectors, giving xt . 225

Each of the three sampled classes in the episode was randomly assigned a slot in a one-hot vector of length three, giving yt . Each training step consisted of a batch of 50 episodes. ε-greedy exploration was used for actions selection during training, and for the proposed algorithm OL-DQN, the parameter ε decreased from 0.23 to 0.05 over time step to have a full exploration, that is, at the beginning the system moved relatively

230

randomly, exploring the state space to a greater extent, and then stably exploited the development rate. The discount factor, γ, was set to 0.3. while for AOL, ε = 0.05, γ = 0.5. If exploring, either the correct label, a random incorrect label, or the “request label” action was chosen, each with probability 1/3. We used an LSTM with 200 hidden units to represent Q. The network was randomly initialized without any pre-

235

training and was trained using Adam with the default parameters [37]. Besides, on each time step, the reward Rreq , Rcor , Rinc were set to -0.05, 1.0, -1.0 for requesting the label, correctly predicting the label and incorrectly predicting the label respectively. And the capacity G for OL-DQN and B for AOL are 60 and 50 respectively. 4.2. Performance evaluation

240

To provide a comprehensive evaluation of the performance of this novel model, four groups of experiments were considered in our work on objects classification, and our goal with the following experiments is to determine 1) whether or not the proposed model can learn, through reinforcement learning, how to label examples and when to instead request a label, and 2) whether or not the model is effectively reasoning

245

about uncertainty when making its predictions. Finally, it should be noted that all the experiments in this paper were implemented in the deep learning platform Pytorch based on Python. 4.2.1. The first group of experiment In this part, we performed a number of iterations of the meta-learning task de-

250

scribed in Section 3.1. During training, the time steps including 1st, 2nd, 5th, and

14

10th instances of all classes, for each episode within a training batch, are identified. It should be noted that in this analysis the method we use to calculate the accuracy treat label requests as incorrect label prediction. After training on 100,000 episode batches, the network was given a series of test episodes. In these episodes, training 255

was ceased and no further learning occurred, and the model was to predict or request the class labels for never-before-seen classes pulled from a disjoint test set from ALOI dataset. Figure 3(a) and Figure 3(b) show the percentage of actions corresponding to correct label predictions and the percentage of label requests for each of these steps, as training progressed, respectively, whereas Figure 4(a) and Figure 4(b) report the re-

260

sults of comparison of overall accuracy and request rate between OL-DQN and AOL respectively.

(a) OL-DQN accuracy

(b) OL-DQN label requests

Figure 3: Accuracy (a) and label requests (b) per episode batch for the 1st, 2nd, 5th, and 10th instances of all classes. The model requests fewer labels and has a higher accuracy on later instances of a class. At 100,000 episode batches, the training stops and the data switches to the test set.

Some obvious remarks can be drawn from Figure 3 and 4. Firstly, it is possible to observe that, the OL-DQN we proposed learns to have higher label demand rate for early instances of a class, and lower for later instances. At the same time, classification 265

accuracy of the model is significantly increased on later instances of the same class. Secondly, whether during training or testing, by comparing overall accuracy and request rate of OL-DQN to that of AOL, it is clear that our model can achieve higher accuracy with less label requests, which indicates the more efficiency of the proposed 15

(a) Accuracy comparision

(b) Label requests comparision

Figure 4: Comparison of overall accuracy (a) and request rate (b) results between OL-DQN and AOL. Compared to AOL, OL-DQN is able to achieve higher accuracy and lower request rate. After 100,000 episodes, the data switches to test set without further learning.

model for one-shot learning task compared to the prior work AOL. 270

4.2.2. The second group of experiment In this part, considering the possible imbalance of test set sample categories, in order to further verify the performance of the proposed OL-DQN method as a multiclassifier in the one-shot learning task, the receiver operating characteristic (ROC) curve analyses of OL-DQN against AOL on ALOI test set were then given in Fig-

275

ure 5. The ROC curve, a plot of true positive rate (TPR) against the false positive rate (FPR) at various threshold settings, can well illustrate the diagnostic ability of a classifier system. To more clearly observe the contrast effect of the two classifiers, the areas under the curve (AUC) of ROC was also computed using the trapezoidal areas created between each ROC point to evaluate quantitatively classification performance. As we

280

can see from Figure 5, the AUC value of OL-DQN is 0.921, which is greatly superior to that of the AOL method at 0.886. As a result, the ROC-AUC analyses shows the OL-DQN algorithm effectively improves the classification performance, which, again, demonstrates the advantages of the proposed model.

16

Figure 5: The comparison of ROC and AUC results between OL-DQN and AOL.

4.2.3. The Third Group of Experiment 285

For reinforcement learning, the reward function has a great effect on the performance of the algorithm. To make a better compromise between prediction accuracy and label requests by properly selecting rewards, in this part, we conducted an experiment in which the Rinc value of the proposed model assumed different values, whereas other parameters were set to fixed values to explore the impact of the reward func-

290

tion on the performance of this classification task. At the same time, besides the AOL model, we also give the best results of QBC presented on the same problem in which the dataset used to update the network each time consists of initial annotated sample set and samples with true label (simulating expert annotation) selected from unlabeled sample set using the query by committee algorithm. Besides, a supervised learning

295

model introduced in Santoro et al. [25] was also evaluated, where the cross entropy between the predicted and true label was used as the loss function, and the true label was always presented on the following time step. For the sake of fair comparison, both the supervised task and QBC used the same LSTM model, but the difference from our model is that the former has the modifications of a softmax on the output and not

300

outputting the extra bit for the “request label” action, whereas the latter gives the true label for only one sample selected by QBC strategy after each query of a training. To ensure high classification accuracy of QBC, the number of queries is set to 20. In ad-

17

dition, it should be noted that more requested instances will bring better precision and worse accuracy, which means the recall may be bad. Therefore, the comprehensive 305

metric F1-score of AOL and OL-DQN were presented to better measure classification performance. The obtained results are presented in Table 1. Table 1: Classification prediction, accuracies, percentage of label requests per episode, and F1-score on the objects classification. Higher accuracy requires more labels. Introducing DQN strategy can provide a principled approach to making that trade-off, here by specifying the reward for an incorrect answer, Rinc . The model for “AOL” and “OL-DQN Rinc = −1” is the same model used in Figure 3 and 4.

Algorithm

Training

Testing

Pre(%) Acc(%) Req(%) Pre(%) Acc(%) Req(%) F1

Supervised accuracy

92.3

-

100.0

90.5

-

100.0

-

QBC accuracy

89.7

-

30.7

89.4

-

30.7

-

AOL

90.0

81.7

9.1

86.2

78.2

9.4

0.77

94.4

87.9

6.8

90.2

83.9

8.0

0.88

97.7

71.7

26.0

96.6

70.5

26.6

0.56

Rinc = −10 99.1

42.9

53.3

98.1

40.0

59.1

0.35

OL-DQN

Rinc = −1 Rinc = −5

1

Pre: Prediction, the accuracy when predictions were made;

2

Acc: Accuracy, the accuracy with label requests treated as incorrect label prediction;

3

Req: Requests, label request rate.

4

F1: F1-score.

5

Supervised accuracy: the accuracy with all samples labeled.

6

QBC accuracy: the accuracy of labeled training samples. As can be seen that the proposed model can achieve higher accuracy while using

fewer labels on both train and test sets compared to the AOL method, QBC, and the supervised learning which requests labels 100% of the time with the same reward values 310

setting. In other words, our model is able to make better trade-offs between high prediction accuracy with many label requests and few label requests but lower prediction accuracy, which can also be strongly confirmed by the F1-score comparison results.

18

4.2.4. The fourth group of experiment To examine whether the model is effectively reasoning about uncertainty when 315

making its predictions or simply learning a fixed time step to convert between requesting labels and predicting labels, in this part, the experiment was performed in which the order of sample arrangement was artificially given to explore the action strategy of model instead of the random sampling as the previous experiment in each episode. In this task, we only randomly chose two classes of test samples for each episode, and

320

then ran 1000 episodes using the trained model without learning. Figure 6(a) and 6(b) are the results of two cases respectively: the model was presented with either 4 examples of the first class followed by 6 example of the second class, or 11 examples of the first class followed by 2 example of the second class.

(a) Switch classes after 4

(b) Switch classes after 11

Figure 6: The trained model was further evaluated on a second task. For each episode, two random test classes were selected. (a)4 examples from the first class are presented, followed by 6 examples from the second class. (b)11 examples from the first class are presented, followed by 2 examples from the second class. The percentage of episodes in which a label request is made for each time step is shown. The difference in the percentage of label requests at time step 5, along with the similarity between the percentage of label requests when an instance of the second class is presented, suggests that the model is computing uncertainty and acting on it.

From the obtained results, as shown in Figure 6, it is possible to notice that, whether 325

switch to the second class after 4 or 11, OL-DQN model has high percentage of label requests on time-step when the first instance of the both two classes appeared. More concretely, the label requests frequency at step 1 is high and, observably, equal on the 19

5th or 12th time across the scenarios, while for later instances of the same class it has greatly reduced since the model has seen an instance of that class. This confirms 330

that our model achieves the ability to effectively reason about its own uncertainty and intelligently decide which examples are worth labeling, which can make important contributions to the reduction of the cost of manual labeling. 5. Application analysis: handwriting recognition For the one-shot learning task, handwriting recognition, such as MNIST dataset

335

consisting of digits from 0 to 9 and Omniglot dataset containing various classes of characters, is usually used for application analysis of proposed algorithms, in our paper, we choose the dataset[38] composed of both characters and digits to validate our model. The experimental results show that our model successfully demonstrated its ability to learn intelligent and efficient decision-making and inference policies without

340

using heuristics for handing the computational complexity of approximating the risk associated with selection as is required in active learning algorithms. 5.1. Dataset and training setup The dataset mentioned above corresponds to handwritten alphanumeric characters, giving 20 by 16 pixel images in binary format as the model input. There are 36 classes,

345

corresponding to the letters from A to Z and digits from 0 to 9, with 39 examples per class. And the training set consists of 28 classes randomly chose, the rest of 8 classes are collected for the testing set. All images were normalized to a pixel value between 0.0 and 1.0. Similarly, each episode was composed of 30 images sampled randomly from 3 randomly sampled classes (n classes = 3), without replacement. Besides, for

350

each class in the episode, to reduce the risk of overfitting, we performed data augmentation by randomly translating and rotating character images through 0o , 90o , 180o , 270o rotations of existing data after zero-filling of data to 20 by 20 pixels. In this experiment, ε-greedy exploration was employed to select actions, and ε reduced from 0.3 to 0.05 over time step for OL-DQN while it was set to 0.05 for

355

AOL. The discount factor γ was fixed to 0.6 and 0.5 for proposed model and the prior

20

work respectively. Most of all, both of them chose the same reward function: Rreq = −0.3, Rcor = 1.0, Rinc = −1.0 to make a fair comparison. And the capacity G for OL-

DQN and B for AOL are 100 and 50 respectively. 5.2. Application analysis 360

As performed in the experiments for objects classification, in this part, we also designed four sets of experiments to verify the utility of our model in the application of handwriting recognition. 5.2.1. The first group of experiment Similarly, first, we observed the percentage of actions corresponding to correct la-

365

bel predictions and label requests of 1st, 2nd, 5th, and 10th instances of all classes for each episode within a training batch, and the obtained results are presented in Figure 7(a) and Figure 7(b) respectively. One relevant baseline about this experiment is the human performance: facing with an image of never-before-seen class for predicting the label, that is the 1st instance in Figure 7, participants would largely require the true

370

label for storing correlative characteristics of this class samples to further reinforce correct decisions in the completion of subsequent classification tasks, which means high demand rate for labels will occur in the 1st instances of all classes in theory. Next, when a new image of the same class was presented, they were to make an untimed prediction as to its class label. Naturally, classification accuracy would improve and sample la-

375

bel demand rate would reduce as the number of images observed of the same category increased. As we can see from Figure 7, the model learned human-like abilities and exhibited high classification accuracy and low label requests on just the second presentation of a sample from the same class within an episode (86.4% and 2.7%), reaching up to 90.4% accuracy and 0.1% request by the fifth instance as well as 91.7% accuracy

380

and 0.0% request by the tenth. Then, Figure 8(a) and Figure 8(b) were given to report the results on the comparison experiments between the proposed OL-DQN and the prior AOL in terms of overall accuracy and request rate respectively. It is obvious that, with respect to AOL, OL-DQN is able to achieve higher accuracy and lower request rate, which confirms our model

21

(a) OL-DQN accuracy

(b) OL-DQN label requests

Figure 7: Accuracy (a) and label requests (b) per episode batch for the 1st, 2nd, 5th, and 10th instances of all classes. The model requests fewer labels and has a higher accuracy on later instances of a class. At 100,000 episode batches, the training stops and the data switches to the test set.

385

is a more efficient active learner on handwriting recognition for the meta-learning task compared to the prior work AOL.

(a) Accuracy comparision

(b) Label requests comparision

Figure 8: Comparison of overall accuracy (a) and request rate (b) results between OL-DQN and AOL. Compared to AOL, OL-DQN is able to achieve higher accuracy and lower request rate. After 100,000 episodes, the data switches to test set without further learning.

5.2.2. The second group of experiment These encouraging observations above are clearly corroborated by the comparison of ROC and AUC results between OL-DQN and AOL for multi-classification tasks 22

390

on handwriting recognition in this part, as shown in Figure 9, the proposed model reached superior classification performance (AUC=0.905) with respect to the AOL (AUC=0.857), which is another evidence of the advantages obtained by employing the deep Q-Network strategy.

Figure 9: Comparison of ROC and AUC results between OL-DQN and AOL.

5.2.3. The third group of experiment 395

In this part, we further analyzed the performance of our model against the supervised model, the conventional active learning method(QBC), and the prior work AOL in terms of prediction, accuracy, request rate and F1-score on the handwriting recognition dataset. Besides, to confirm the applicability and extensibility of the algorithm for ‘few-shot’ learning task, another experiment was performed where the number of

400

classes presented in each episode increased from 3 to 5, and the other parameter settings remained the same. The obtained results, as shown in Table 2, clearly reveal that our model outperforms the existing method AOL and QBC in terms of the compromise between classification accuracy and label requirement rate and achieves a comparable accuracy with

405

extremely fewer requests compared to the supervised model. Both of them verify the effectiveness in reducing annotation costs of the proposed OL-DQN. In addition, it can be observed that the transfer ability and generalization performance of the algorithm become worse as the number of categories increases, which is reflected in the fact that 23

Table 2: Classification prediction, accuracies, percentage of label requests per episode and F1-score on the handwriting recognition. Compared to QBC, the supervised model and the prior work AOL, the proposed OL-DQN can achieve higher accuracy with fewer label requests. By changing the value settings of the number of classes presented in each episode, it can be verified that learning the weights of a classifier using large one-hot vectors becomes increasingly difficult with scale. The model for “AOL and “OL-DQN is the same model used in Figure 7(a) and 7(b).

Algorithm

Training

Testing

Pre(%) Acc(%) Req(%) Pre(%) Acc(%) Req(%) F1

Supervised accuracy

90.1

-

100.0

89.1

-

100.0

-

QBC accuracy

92.9

-

36.5

80.4

-

36.5

-

AOL

88.6

81.6

7.2

87.1

80.2

7.9

0.80

nc=3

93.4

87.2

6.7

89.5

83.4

6.8

0.87

nc=5

92.9

85.4

8.1

79.8

72.8

8.8

0.58

OL-DQN 1

Pre: Prediction, the accuracy when predictions were made;

2

Acc: Accuracy, the accuracy with label requests treated as incorrect label prediction;

3

Req: Requests, label request rate.

4

F1: F1-score.

5

Supervised accuracy: the accuracy with all samples labeled.

6

QBC accuracy: the accuracy of labeled training samples.

7

n c: n class, the number of classes per episode.

the classification accuracy in the training set is not much different, while that of the 410

testing set reduces obviously. 5.2.4. The fourth group of experiment To verify whether the model uses an naive strategy or an optimal strategy for the meta-learning task, in which the former means to learn a fixed time step for switching from requesting labels to predicting labels and the latter is to consider the model’s un-

415

certainty of the label when requesting a label, the experiment of 4.2.4 was repeated on the handwriting recognition dataset. With respect to the previous experimental design (See Figure 6), we only switched to the second class after step 3 ( Figure 10(a)) and 24

after step 7 ( Figure 10(b)). And the differences in the percentage of label requests at step 4 between those two pictures reveal, obviously, it is an optimal strategy instead 420

of the episode independent label request schedule that our model employed in dealing with mentioned tasks.

(a) Switch classes after 3

(b) Switch classes after 7

Figure 10: The trained model was further evaluated on a second task. For each episode, two random test classes were selected. (a) 3 examples from the first class are presented, followed by 7 examples from the second class. (b) 7 examples from the first class are presented, followed by 3 examples from the second class. The label request rate is greatly reduced when the second sample of each class appeared (time step 2 and 5 (a), time step 2 and 9 (b)), while that of the first sample of each class is significantly increased (time step 1 and 4 (a), time step 1 and 8 (b)).

6. Discussion and future work In this part, the core contributions of our work are outlined: (1) we proposed a novel model for the active-learning set-up, referred to it being called OL-DQN, which 425

learned an active learner to infer whether to label a sample automatically or request the true label via introducing the DQN strategy to one-shot learning. (2) we carried out the performance evaluation and application analysis for the proposed model OL-DQN on one-shot learning tasks using objects classification dataset ALOI and handwriting recognition dataset which combined both characters and digits. The experimental re-

430

sults show that our model can transform from engineering heuristic selection of samples to learning strategies from data. Compared to a purely supervised task and the

25

previous work, our model can achieve higher classification accuracy with fewer label requests, which contributes to the cost savings of manual annotation. There are three important directions for future work of our model: (1) Aimed to the 435

performance degradation problem of OL-DQN when the number of classes presented in episodes increases, further model optimization work can be expanded, for example, memory-augmented neural networks (MANNs) can be employed to replace LSTM to do the one-shot learning task. (2) Further improve the classification performance of our model, for example, using more powerful RL strategies such as better exploration [39],

440

decomposing the action-value function [40]. (3) Expand its application to practical problems such as cold-start problem, or increase the complexity of the task. Acknowledgements This study was supported by the National Natural Science Foundation of China (No.71701205).

445

References References [1] D. Cohn, L. Atlas, R. Ladner, Improving generalization with active learning, Machine learning 15 (2) (1994) 201–221. [2] X. Zhu, X. Wu, Class noise vs. attribute noise: A quantitative study, Artificial

450

intelligence review 22 (3) (2004) 177–210. [3] S. Vijayanarasimhan, K. Grauman, Large-scale live active learning: Training object detectors with crawled data and crowds, International Journal of Computer Vision 108 (1-2) (2014) 97–114. [4] D. Reker, G. Schneider, Active-learning strategies in computer-assisted drug dis-

455

covery, Drug Discovery Today 20 (4) (2015) 458–465.

26

[5] C. Liu, H. Lin, Z. Li, J. Li, Feature-driven active learning for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing PP (99) (2017) 1–14. [6] Y. Yang, M. Loog, A benchmark and comparison of active learning for logistic 460

regression, Pattern Recognition 83 (C) (2018) 401–415. [7] E. Lughofer, M. Pratama, Online active learning in data stream regression using uncertainty sampling based on evolving generalized fuzzy models, IEEE Transactions on Fuzzy Systems 26 (1) (2018) 292–309. [8] S.-J. Huang, J.-L. Chen, X. Mu, Z.-H. Zhou, Cost-effective active learning from

465

diverse labelers., in: International Joint Conference on Artificial Intelligence (IJCAI), 2017, pp. 1879–1885. [9] N. Roy, A. Mccallum, Toward optimal active learning through sampling estimation of error reduction, in: The Eighteenth International Conference on Machine Learning, 2001, pp. 441–448.

470

[10] P. T. M. Saito, C. T. N. Suzuki, J. F. Gomes, P. J. D. Rezende, A. X. Falc˜ao, Robust active learning for the diagnosis of parasites, Pattern Recognition 48 (11) (2015) 3572–3583. [11] J. Zhu, H. Wang, B. K. Tsou, M. Ma, Active learning with sampling by uncertainty and density for data annotations, IEEE Transactions on Audio Speech &

475

Language Processing 18 (6) (2010) 1323–1331. [12] S.-J. Huang, R. Jin, Z.-H. Zhou, Active learning by querying informative and representative examples, in: Advances in neural information processing systems, 2010, pp. 892–900. [13] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, A. Vedaldi, Learning feed-

480

forward one-shot learners, in: Advances in Neural Information Processing Systems, 2016, pp. 523–531.

27

[14] H. Zhang, K. Dana, K. Nishino, Friction from reflectance: Deep reflectance codes for predicting physical surface properties from one-shot in-field reflectance, in: European Conference on Computer Vision, 2016, pp. 808–824. 485

[15] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al., Matching networks for one shot learning, in: Advances in neural information processing systems, 2016, pp. 3630–3638. [16] S. Ravi, H. Larochelle, Optimization as a model for few-shot learning, in: International Conference on Learning Representations (ICLR), 2017.

490

[17] M. Peng, Q. Zhang, X. Xing, T. Gui, X. Huang, Y.-G. Jiang, K. Ding, Z. Chen, Trainable undersampling for class-imbalance learning, AAAI, 2019. [18] C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, in: Proceedings of the 34th International Conference on Machine Learning, Vol. 70, 2017, pp. 1126–1135.

495

[19] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, R. Hadsell, Progressive neural networks, arXiv preprint arXiv:1606.04671, 2016. [20] Y. Fu, T. Xiang, Y.-G. Jiang, X. Xue, L. Sigal, S. Gong, Recent advances in zeroshot recognition: Toward data-efficient understanding of visual content, IEEE

500

Signal Processing Magazine 35 (1) (2018) 112–125. [21] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Computation 9 (8) (1997) 1735–1780. [22] G. Aceto, D. Ciuonzo, A. Montieri, A. Pescap´e, Mobile encrypted traffic classification using deep learning, in: 2018 Network Traffic Measurement and Analysis

505

Conference (TMA), 2018, pp. 1–8. [23] G. Aceto, D. Ciuonzo, A. Montieri, A. Pescap´e, Mobile encrypted traffic classification using deep learning: Experimental evaluation, lessons learned, and challenges, IEEE Transactions on Network and Service Management, 2019. 28

[24] Q. Wen, J. Yan, B. Liu, D. Meng, S. Li, A meta-learning method for histopathol510

ogy image classification based on lstm-model, in: Tenth International Conference on Graphics and Image Processing (ICGIP), Vol. 11069, 2019, pp. 110691H. [25] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, T. Lillicrap, One-shot learning with memory-augmented neural networks, arXiv preprint arXiv:1605.06065, 2016.

515

[26] M.

Woodward,

C.

Finn,

Active

one-shot

learning,

arXiv

preprint

arXiv:1702.06559, 2017. [27] L. Fei-Fei, R. Fergus, P. Perona, One-shot learning of object categories, IEEE transactions on pattern analysis and machine intelligence 28 (4) (2006) 594–611. [28] B. Lake, R. Salakhutdinov, J. Gross, J. Tenenbaum, One shot learning of simple 520

visual concepts, in: Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 33, 2011. [29] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, S. J. Gershman, Building machines that learn and think like people, Behavioral and Brain Sciences 40 (2017) e253. [30] Y.-H. H. Tsai, R. Salakhutdinov, Improving one-shot learning through fusing side

525

information, arXiv preprint arXiv:1710.08347, 2017. [31] V. Mnih, K. Kavukcuoglu, D. Silver, et al., Human-level control through deep reinforcement learning, Nature 518 (7540) (2015) 529–533. [32] R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018.

530

[33] C. J. C. H. Watkins, P. Dayan, Q-learning, Machine Learning 8 (3-4) (1992) 279– 292. [34] A. Rocha, S. K. Goldenstein, Multiclass from binary: Expanding one-versus-all, one-versus-one and ecoc-based approaches, IEEE Transactions on Neural Networks and Learning Systems 25 (2) (2014) 289–302.

29

535

[35] Y. Freund, H-S. Seung, E. Shamir, N. Tishby, Selective sampling using the query by committee algorithm, Machine learning, 28 (2-3) (1997), 133–168. [36] R. O. Stehling, M. A. Nascimento, A. X. Falc˜ao, A compact and efficient image retrieval approach based on border/interior pixel classification, in: Proceedings of the eleventh international conference on Information and knowledge manage-

540

ment, ACM, 2002, pp. 102–109. [37] D. Kingma, J. Ba, Adam: A method for stochastic optimization, Computer Science, 2014. [38] H. Larochelle, D. Erhan, Y. Bengio, Zero-data learning of new tasks, in: Proc. 23rd Nat’l Conf. Artificial Intelligence, Vol. 1, 2008, pp. 646–651.

545

[39] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in: International conference on machine learning, 2016, pp. 1928–1937. [40] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, N. De Freitas, Dueling network architectures for deep reinforcement learning, arXiv preprint

550

arXiv:1511.06581, 2016.

30

LI CHEN received her B.S. degree from the Department of Measurement and Control Technology at Lanzhou University of Technology, and Master’s degree in the School of Information Science and Engineering at Lanzhou University, Lanzhou, China. Now she is working towards her Doctorate in the School of systems 555

engineering at National University of Defense Technology, Changsha, China.

HONGLAN HUANG was born in Hefei, China in 1995. She received the B.S. degree in information engineering from Xi’an Jiaotong University, Xi’an, China, in 2017. She is currently pursuing the M.S. degree in management science and engineering at National University of Defense Technology, Changsha, China. 560

Her research interests include active learning, reinforcement learning and one-shot learning.

YANGHE FENG is now an associate research fellow in National University of Defense Technology. His research interests include human factors engineering, cognitive computing, deep learning, deep reinforcement learning, and ac565

tive learning.

GUANGQUAN CHENG is an associate professor at National

31

University of Defense Technology. He received the M.Sc and Ph.D. degrees from National University of Defense Technology, Changsha, China, in 2005 and 2010, respectively. His current research interests include complex network analysis and decision570

making support technology.

JINCAI HUANG is an associate research fellow of the National University of Defense Technology, Changsha, Hunan, China, and a researcher of Science and Technology on Information Systems Engineering Laboratory. His main research interests include artificial general intelligence, deep reinforcement learning, 575

and multi-agent systems.

ZHONG LIU is an associate research fellow of the National University of Defense Technology, Changsha, Hunan, China, and a researcher of Science and Technology on Information SystemsEngineering Laboratory. His main research interests include artificial general intelligence, deep reinforcement learning, and multi580

agent systems.

32

Conflict of Interest Form We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

33

Declaration of interests The authors declare that they have no known competing financial interests or per-

585

sonal relationships that could have appeared to influence the work reported in this paper.

34