A BasisEvolution framework for network traffic anomaly detection

A BasisEvolution framework for network traffic anomaly detection

Accepted Manuscript A BasisEvolution Framework for Network Traffic Anomaly Detection Hui Xia, Bin Fang, Matthew Roughan, Kenjiro Cho, Paul Tune PII: ...

1MB Sizes 0 Downloads 87 Views

Accepted Manuscript

A BasisEvolution Framework for Network Traffic Anomaly Detection Hui Xia, Bin Fang, Matthew Roughan, Kenjiro Cho, Paul Tune PII: DOI: Reference:

S1389-1286(18)30033-1 10.1016/j.comnet.2018.01.025 COMPNW 6373

To appear in:

Computer Networks

Received date: Revised date: Accepted date:

13 January 2017 16 October 2017 17 January 2018

Please cite this article as: Hui Xia, Bin Fang, Matthew Roughan, Kenjiro Cho, Paul Tune, A BasisEvolution Framework for Network Traffic Anomaly Detection, Computer Networks (2018), doi: 10.1016/j.comnet.2018.01.025

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

A BasisEvolution Framework for Network Traffic Anomaly Detection

CR IP T

Hui Xia1 , Bin Fang2 , Matthew Roughan3 , Kenjiro Cho4 , Paul Tune5

Abstract

Traffic anomalies arise from network problems, and so detection and diagnosis

AN US

are useful tools for network managers. A great deal of progress has been made

on this problem so far, but most approaches can be thought of as forcing the data to fit a single mould. Existing anomaly detection methods largely work by separating traffic signals into “normal” and “anomalous” types using historical data, but do so inflexibly, either requiring a long description for “normal” traffic, or a short, but inaccurate description. In essence, preconceived “basis”

M

functions limit the ability to fit data, and the static nature of many algorithms prevents true adaptivity despite the fact that real Internet traffic evolves over

ED

time. In our approach we allow a very general class of functions to represent traffic data, limiting them only by invariant properties of network traffic such as diurnal and weekly cycles. This representation is designed to evolve so as to

PT

adapt to changing traffic over time. Our anomaly detection uses thresholding approximation residual error, combined with a generic clustering technique to report a group of anomalous points as a single anomaly event. We evaluate our

CE

method with orthogonal matching pursuit, principal component analysis, robust

AC

1 H. Xia is with the School of Accounting, Chongqing University of Technology, Chongqing, 400044, China. E-mail: [email protected]. The majority of this work was conducted while H. Xia was visiting the University of Adelaide. 2 Corresponding Author. B. Fang is with the School of Computer Science, Chongqing University, Chongqing, 400044, China. E-mail: [email protected] 3 M. Roughan is with ARC Centre of Excellence for Mathematical & Statistical Frontiers (ACEMS) at the School of Mathematical Science, and the University of Adelaide, Adelaide 5005, SA, Australia. E-mail: [email protected] 4 Kenjiro Cho is at IIJ. E-mail: [email protected] 5 P. Tune is at Image Intelligence, E-mail: [email protected]

Preprint submitted to Computer Networks

January 29, 2018

ACCEPTED MANUSCRIPT

principal component analysis and back propagation neural network, using both synthetic and real world data, and obtaining very low false-alarm probabilities in comparison.

CR IP T

Keywords: anomaly detection, basis, evolution, low false-alarm probability, SVD

1

1. Introduction

Conceptually, a network traffic anomaly can be defined as an intentional or

3

unintentional attack, fault, or defect, which perturbs “normal” network traf-

4

fic behaviour [1]. The causes could be traffic outages, configuration changes,

5

network attacks (e.g., DDoS attacks), flash crowds, network worms, and so

6

on [2, 3, 4]. As the Internet continues to grow in size and complexity, traffic

7

anomalies are becoming more common and diverse. They need to be detected

8

and analysed effectively.

AN US

2

Most Intrusion Detection Systems (IDS) are knowledge-based detection,

10

which search for instances of known network problems by attempting to match

11

with pre-determined representations [5]. Such systems can be trained against

12

known records, but cannot detect unknown problem types, i.e., anomalies [6].

13

Thus, there is a need for anomaly detection, which estimates typical behaviour

14

of network traffic and detects deviations from this [1, 6].

PT

ED

M

9

Anomaly detection schemes need to have low false-alarm probabilities, at

16

least of the order of 10−4 , because of the large number of tests running in

17

even a moderately sized network, and the amount of human work required to

18

process each anomaly. Most existing techniques do not achieve such low false-

19

alarm probabilities [7, 8], at least in part because they are restricted in their

CE

15

representation of typical traffic. Existing techniques represent typical traffic in

21

terms of a set of functions of historical data – we refer to this as a “basis”, though

22

it may not be a strict mathematical basis for the space. The requirement that we

23

accurately model typical traffic either results in a large error in representation, or

24

a large set of functions which is undesirable for reasons described best under the

AC 20

2

ACCEPTED MANUSCRIPT

25

heading of the “Principle of Parsimony” [9]; principally, problems of estimation

26

and generalisation. In our work, we also search for basis functions, but we do so from a poten-

28

tially infinite set, allowing a very accurate representation with a small number of

29

functions. However, unlike Principal Components Analysis (PCA), which also

30

performs such a search, we do so under constraints that match known behaviour

31

of real traffic: for instance, cyclicity. Current anomaly detection approaches usu-

32

ally ignore the nature of network traffic, resulting in a hard explanation on the

33

derived basis.

CR IP T

27

Also, real network traffic patterns do change over time, but slowly. For

35

example, the peak time of traffic might shift from 8pm to 7pm due to the change

36

of summer time; and traffic will change as an ISP recruits or loses customers [10].

37

We also present an update algorithm that improves the basis with new data,

38

allowing the algorithm to adapt to network changes, without a computational

39

explosion. The evolved basis usually remains similar to the old basis, but if

40

the reconstruction error of the evolved basis is outside a preset threshold, we

41

reinitialise.

Once we have a basis for typical traffic, anomalous points are easy to detect

ED

42

43

M

AN US

34

by examining the reconstruction error. An additional contribution of this paper is to go beyond simple metrics

45

for anomaly detection performance (such as false alarms and missed detection

46

probabilities). Real anomalies persist, and better metrics for anomaly detec-

47

tion also determine the extent or duration of the anomaly. In this paper, we

CE

PT

44

use a clustering algorithm and associated quality metrics to compare detection

49

algorithms. Hence, anomaly detection in our framework includes two compo-

50

nents: anomalous-point detection and anomaly clustering. The first is used for

AC

48

51

detecting anomalous points, and the second is for analysing anomalies.

52

We use both synthetic and real world data to assess the performance of our

53

framework. The synthetic data is less real, but is necessary to provide accu-

54

rate metrics [11]: there are intrinsic difficulties of obtaining large enough sets

55

of accurately marked anomaly data, given that the nature of an “anomaly” is 3

ACCEPTED MANUSCRIPT

intrinsically defined only by its absence in typical traffic. We use real data to il-

57

lustrate the relationship between anomalies detected by our approach and those

58

of PCA, Robust Principle Component Analysis (RPCA), Orthogonal Matching

59

Pursuit (OMP), and Neural Network using Back Propagation (NNBP) detection

60

models. We show that our approach has substantially better detection charac-

61

teristics than these alternatives, including false-alarm probabilities as low as

62

7.8 × 10−5 .

63

2. Background and Related work

CR IP T

56

In many cases, network traffic records contain a number of properties, such

65

as source IP addresses, destination IP addresses, which lead the context of

66

anomaly detection expanding to the high-dimensional space. In view of this,

67

anomaly detection approaches first process attributes of data because they are

68

of different types. Examples include calculating the entropy of some features of

69

data, e.g., the number of observed host addresses [4]; removing mean or median

70

values of the data [12, 13]; measuring the mutual information of each feature

71

for selection [14].

M

AN US

64

In real detection scenarios, higher-dimensional data require more sources in

73

calculating parameters, and may cause overfitting in training stage, even though

74

the results are more explainable. In view of this, researchers adopt attributes

75

selection (or extraction) to reduce dimensionality. For example, Huang et al.,

76

[15] chose Self-Organizing Map (SOM) to extract principal components as a

77

compressed representation of a “normal picture”; Chen et al., [16] filtered ir-

78

relevant attributes by utilizing the maximal information coefficient (MIC) to

79

increase the classification accuracy.

CE

PT

ED

72

One explicit characteristic of high-dimensional network traffic data is that

81

almost all anomalies are embedded in some lower-dimensional subspaces [17].

82

Existing detection methods for those “subspace anomalies” include subspace

83

projected-based model [17] and PCA [13, 18]. Other techniques such as Artificial

84

Neural Networks (ANNs) classifiers [15, 5], Fourier analysis [19], wavelet anal-

AC 80

4

ACCEPTED MANUSCRIPT

85

ysis [20] and “BasisDetect” [21] perform better in lower- or single-dimensional

86

environments. Moreover, Teodoroa et al., [6] provided a comprehensive review on commonly

88

used techniques, platforms, systems and projects; while Ahmed et al., [22] pre-

89

sented an in-depth analysis of four major anomaly detection techniques: classi-

90

fication, statistical, information theory and clustering; In addition, Chandola et

91

al., [23] discussed anomaly types and analyzed the anomaly detection algorithms

92

in six classes: classification, nearest neighbor, clustering, statistics, information

93

theoretic and spectral-based techniques. As the conventional distance and sim-

94

ilarity measurements lose efficacy as the dimensionality increasing, Wellerfahy

95

et al., [24] discussed various types of distance and similarity measurements so

96

that the most appropriate evaluation metrics can be used in detection.

AN US

CR IP T

87

97

Instead of reviewing the above detection methods, we seek to describe a

98

framework within which we can place various approaches, and we use a few

99

exemplar techniques to illustrate this framework, and against which we compare our approach.

M

100

In general, anomaly detection algorithms use historical records to establish a

102

baseline for typical behaviour, and then separate unusual, or anomalous traffic

103

from this baseline. Conceptually this can be illustrated by Figure 1. The figure

104

shows how we form an approximation for classification. Some techniques treat

105

this as a simple classification problem, while others project into the typical

106

space, and look for outliers in the residual, but the underlying problem is the

107

same.

CE

PT

ED

101

There is a tension in all techniques in the number of functions or features

109

used to approximate the region τ . If more features are used, we might obtain

110

a more accurate approximation, however, this works against the “Principle of

AC

108

111

Parsimony” [9], which argues for smaller numbers of features to be used. The

112

principle expresses the fact that larger numbers of features are harder to accu-

113

rately estimate (from the historical data), and more importantly, larger numbers

114

of features may not be generalizable. That is they may have great explanatory

115

power for the historical data, but do not work well for data on which they have 5

ACCEPTED MANUSCRIPT

All possible traffic signals, Ω

missed detections

τ

τ

Ap

pr ox i

m

at io

n,

Typical traffic,

CR IP T

false alarms

116

AN US

Figure 1: A simplified picture of the problem of anomaly detection. Conceptually, there is a space of traffic signals which are “typical”, τ , within the larger space of all traffic signals, Ω. We seek to estimate this region from historical data, and approximate it by τˆ, so that we can classify new signals. Inaccuracies in the approximation result in errors of the two types indicated.

not been trained. A common expression is that we are “fitting the noise”. The result is that although some approaches can approximate a region very

118

accurately, the region they match is polluted by noise (for instance, undiagnosed

119

anomalies in the training data, which is quite common). Illustrative examples

120

are the poor performance of PCA when the traffic measurements are polluted

121

with anomalies [25], thereby distorting the approximation region τˆ. So there

122

is a tradeoff, for any one technique, but better approaches can make unilateral

123

improvements by choosing a better set of features of functions with which to

124

represent τ .

PT

ED

M

117

The tradeoff can be seen in most approaches described above. For instance,

CE

125

Fourier analysis often either directly filters low-frequency components of the sig-

127

nal in the Fourier domain, or approximates this through a low-pass filter, which

128

requires to trade off the number of frequencies used in approximation. A larger

129

number leads to a better approximation, but the approach then becomes more

130

subject to noise. Similarly, PCA constructs “normal subspace” to approximate

131

the region τ , which has to tradeoff the number of principal components [26].

AC

126

132

Many approaches choose a hybrid way to address the trade-off problem.

6

ACCEPTED MANUSCRIPT

They combine PCA with multi-scale methods so that a moderate number of

134

features can approximate typical traffic space τ more accurately without losing

135

the generalization and efficiency. e.g., Some researchers utilized Empirical Mode

136

Decomposition (EMD) on the extracted principal components and collected

137

Intrinsic Mode Functions (IMFs) as multiscale features for network traffic [27].

138

Similarly, Jiang et al., [8] applied wavelet transformation as multiscale analysis,

139

through which the anomalous traffic can be identified using mapping functions

140

on the extracted principal components for feature under each scale. Novakov

141

et al., [26] introduced a cooperational detection approach with two methods:

142

PCA- and wavelet-based detection algorithms, in which the former examines

143

the entire set of data to rough detect the anomalous time bins, while the latter

144

applies a detailed localized frequency analysis to identify potential anomalous

145

time bins. Additionally, Chen et al., [16] proposed a multiscale PCA (MSPCA)

146

model to decrease the typical traffic reconstruction error by combining PCA

147

and Discrete Wavelet Transformation (DWT). The improved version involves

148

Bayesian PCA [28].

M

AN US

CR IP T

133

Within these approaches BasisDetect [21] chooses an interesting path. It

150

seeks to use a broader set of potential features, selecting a small number of

151

those that match the data to create representations. By allowing a broader

152

class of feature vectors, it allows a better representation with fewer features,

153

and this is a general property. However, if we allow that class of features to

154

become too wide, we hit computational barriers. Techniques such as PCA select

155

from a much larger (infinite) set of possible feature vectors for representation,

CE

PT

ED

149

but do so in a manner that assumes an underlying model. We argue here that

157

the patterns extracted by PCA do not represent what we see in real traffic data:

158

it assumes independent and identically distributed Gaussian samples, and it has

AC

156

159

long been known that the Internet traffic measurements (at least) show strong

160

cyclical dependencies.

161

We describe these processes more formally by suggesting that the region τˆ

162

is described by a set of basis functions. They are not truly a basis, as the region

163

may not be a vector space (it may not even be convex), but this terminology 7

ACCEPTED MANUSCRIPT

seems to be helpful for a uniform description of approaches. Given this ter-

165

minology, Figure 2 shows how approaches can be roughly separated into four

166

steps: data transformation, basis derivation, data reconstruction and anomaly

167

detection. We use this structure to explain what is novel about our approach.

Past data

Data transformation

Basis derivation

Data reconstruction New data

CR IP T

164

Anomaly detection

Anomalous points

AN US

Figure 2: The general anomaly detection process. (Data transformation) is a preprocessing procedure. (Basis derivation) generates a basis from the historical data. (Data reconstruction) approximates the new data with the basis. (Anomaly detection) detects anomalies from the residual series, which records the difference between the new data and its approximation.

We will not dwell on the initial transformation process, as it is about the type

169

of anomalous behaviour observable, not the technique for observing it. The basis

170

derivation process forces techniques seek ways to pick better functions, so that

171

the reconstructed traffic τˆ is closer to τ . Our approach tackles these problems

172

by

M

168

1. allowing a wider class of basis functions, constrained by known traffic

173

features in order to avoid an explosion of possibilities; and

ED

174

2. an adaptive mechanism to allow the representation to easily evolve.

175

The latter has two advantages. Firstly, it reduces the computational over-

177

head. Secondly, and perhaps more importantly, it also constrains the basis

178

functions. The constraint forces them to be more meaningful than if we trained

179

the features arbitrarily. We have deliberately chosen four exemplar approaches

180

that cover a very wide range of ideas for anomaly detection as shown in Table 1 ,

181

and in doing so we aim to illustrate the general improvement that our approach

AC

CE

PT

176

182

obtains in the following section.

183

3. BasisEvolution Framework

184

Techniques in Table 1 derive a basis that preserves characteristics in either

185

time or frequency or the spatial domain. The underlying assumption is that 8

ACCEPTED MANUSCRIPT

No.

Basis derivation

n

Wavelet [20] BasisDetect [21]

kn

Predetermined low frequency functions Soft thresholding (denoising) Improved OMP

PCA [14]

Over-complete wavelet functions Over-complete dictionary of sinusoids, deltas and/or wavelet functions. Arbitrary

kn



Basis size Large Large

CR IP T

Technique Candidate solution space Fourier Sinusoidal basis

Eigenanalysis

Moderate

Small

AN US

Table 1: Description of anomaly detection in terms of candidate basis functions and techniques of finding a representation for a given data set. Here, the candidate solution space describes the set of functions from which we can construct a typical traffic signal; and the No. is the number of functions in this space for a signal of length n. The basis derivation refers to the technique used to refine this to a basis for the typical traffic. The basis size is the number of functions then required to accurately represent the historical traffic.

we know little about the input data, and must discover its characteristics in

187

these domains. However, we know that in the frequency domain, typical traffic

188

has dominant cycles with 24 hours and 1 week periods [29]. Moreover, it has

189

less variation over short periods than these dominant cycles. We do not need

190

Fourier analysis, or the like, to tell us this. While techniques such as PCA may

191

appear to be sensitive to these cycles, PCA is actually quite agnostic to the

192

period. Since it only responds to the strength of variation across the period,

193

a fact which can be seen if we permute the traffic in time (PCA will produce

194

identical, albeit permuted, results).

PT

ED

M

186

Roughly speaking, the idea of BasisEvolution is to allow for great flexibility

196

in discovering basis functions so as to be able to find a very small set of such,

CE

195

but to use cyclicity and other traffic properties to restrict our search to make it

198

practical. We do this through two approaches:

AC

197

199

1. Initializing the basis from the typical traffic

200

Given a small set of typical (cleaned) data, we apply Singular Value

201

Decomposition (SVD) [30] in each cycle to find a set of functions (referred

202

as a function family) preserving the largest energy. After finishing all

203

cycles, the function families are gathered together forming a final basis.

9

ACCEPTED MANUSCRIPT

204

Through initialization, the functions in the basis are all “normal patterns”,

205

and easier to be understood, since each element in the basis matches a

206

specific cycle. Besides, the size of the basis is usually very small. As we all know that, small anomalies are hidden by a larger anomaly,

208

or anomalies are of different scales. The above initialization finds basis

209

functions of different scales (periods). This multi-scale representation of

210

the traffic would definitely lead a better detection performance, with more

211

anomalies of different scales being found. 2. Evolving the old basis for the new traffic.

212

CR IP T

207

Approaches from Table 1 adapt to the new data by either using a

214

fixed basis, which may enlarge the approximation error, or in the case

215

of PCA-like approaches, the basis is derived again for the new data with

216

extra computational cost.

AN US

213

BasisEvolution addresses this issue by adapting the old basis into the

218

new one. More specifically, once having initialized the basis, for any new

219

(dirty) traffic, we can get the corresponding new basis through evolution.

220

The new basis minimizes the approximation error to the new traffic and

221

keeps the similarity with the old basis. As time goes on, the continuous

222

evolution of the basis can better represent the statistical changes of “nor-

223

mal patterns” in the new traffic. Comparing with the basis derivation

224

process, the evolution approach can effectively reduce the computational

PT

ED

M

217

cost especially for large-scale data.

225

CE

To define the algorithm formally, we start by noting that we sample traffic at

times, tj , separated by intervals ∆t. The variables xtj refer to the total traffic during the interval [tj , tj+1 ). We group these into blocks of b measurements,

AC

such that we obtain a new process X = {x0 , . . . , xn } where xm = [xtj , ..., xtj+b−1 ]T ,

226

where j = bm, and the time intervals in this series are ∆T = b ∆t. In BasisEvo-

227

lution, we use this data segment series as the input for the whole framework, 10

ACCEPTED MANUSCRIPT

228

where ∆T is usually several cycles long, e.g., it might be several weeks. In the framework, a small set of clean data is required for the initialization

230

of basis. However, the collected traffic measurements are contaminated with

231

anomalies, and would cost large amounts of resources to manually separate the

232

typical part. We then design a data cleaning process on a data block as an

233

alternative. Usually, we choose the first collected data block x0 for cleaning.

CR IP T

229

234

With the cleaned data segment x00 , the framework derives a basis for ini-

235

tialization. We name this procedure as the basis generation process and the

236

initialized basis as H0 .

For any new traffic (e.g., x1 ), we updating the old basis (e.g., H0 ) into a

238

new one. This basis evolution process keeps repeating as long as we have new

239

traffic. However, if the approximation error of the evolved basis for new traffic

240

is outside a given bound, the BasisEvolution will clean the data and initialize

241

the basis instead of evolving the old one.

AN US

237

The framework proceeds anomaly detection on the approximation error (e.g.,

243

r1 ). We identify the anomalous points by thresholding, and cluster them to-

244

gether, with each cluster being referred as an anomaly.

M

242

The complete algorithm is illustrated inFigure 3, in which the BasisEvolution

246

is composed of four components: data cleaning, basis generation, basis update

247

and anomaly detection (anomalous-point detection and anomaly clustering).

248

We describe each in detail below.

249

3.1. Data cleaning

PT

ED

245

As for the whole BasisEvolution framework, we first need a small set of

251

“normal” historical data, so that the basis generation process can access the

252

“norm” patterns, and generate the primal basis functions. However, the data

AC

CE 250

253

we collected are contaminated by anomalies in practice.

254

One common method for this issue is to label the data based on the experi-

255

ences and professional advices. This job is time and effort-consuming, especially

256

when the size of data grows large, which results in a difficult implementation in

257

reality. 11

ACCEPTED MANUSCRIPT

Data cleaning x0'

T1 Time index

H1

Anomalous point detection

. . .

r1=x1-W1H1

Basis update Hm . rm=xm-WmHm

xm

. . .

Tn

r0=x0-W0H0

H0 Basis update

x1

. . .

Tm

Anomalous point detection

Basis generation

. .

Basis update

xn

. . .

Anomalous point detection . . .

Anomaly clustering A0

P0

Anomaly clustering

CR IP T

x0

Anomalous point detection

P1

rn=xn-WnHn

. . .

A1

Anomaly clustering Am Pm . . .

AN US

T0

Pn

Anomaly clustering

An

M

Figure 3: A bird’s eye view of BasisEvolution framework for one single link. At time index T0 , we clean the data segment x0 , and get the “anomaly free” data segment x00 . Then we derive a basis set H0 from x00 . Define r0 to be the approximation error of x00 reconstructed by H0 . Since x00 is totally normal, no anomaly is expected to be found in r0 , we then utilize x0 instead of x00 for anomaly detection. The projection of x0 on H0 is W0 , and r0 is the difference between x0 and approximation series W0 H0 . Through anomaly detection process, we get anomalous points P0 , and the clustered anomalies A0 . At subsequent time indices (e.g., Tm ), we evolve the old basis H(m−1) into Hm , calculate the reconstruction error rm , and repeat anomaly detection process to find anomalous points and anomalies.

Since the dataset we used has not been labeled yet, we then apply standard,

259

existing anomaly detectors as alternative approaches to auto label the target

260

dataset. However, there are many issues left. We address them as list below: 1. Cleaning the data with multiple iterations.

264

PT

261

ED

258

265

collective anomaly (at time interval ∆t3 ) [23]. If we measure the normal

266

region with thresholding (such as [-2,2] in Figure 4), only a spike anomaly

262

can be found in Figure 4, in which the simulated time series includes three types anomalies: spike (at time t1 ), contextual anomaly (at time t2 ) and

AC

CE

263

Some small anomalies may be hidden by a larger anomaly. Examples

267

can be found, while the rest two anomalies are hidden.

268

Thus, successive layers of cleaning may be required as shown in Fig-

269

ure 5, in which the anomaly detection techniques repeat many times. In

270

each iteration, the detected points are removed and filled by “normal val-

271

ues” after detection. Then in the next iteration, the modified traffic data

12

t1

10 5 t4

t2

t3

0 0

500

1000

1500

2000

Time

2500

CR IP T

Simulation values

ACCEPTED MANUSCRIPT

3000

3500

AN US

Figure 4: Different anomaly types in the simulated time series. Traffic point at time t1 is a spike. The context of point at t2 is different with that at t4 , even though the value at t2 is similar to that at t4 , so it is a contextual anomaly. Any point at time interval ∆t3 is normal, however, all points together is a collective anomaly [23]. For most anomaly detection methods, e.g., threshold-based techniques, spike is easier to be found than the rest, which indicates the rest anomalies can hide behind a larger anomaly.

are more “normal” which result in more hidden anomalies being found.

273

Repeat the whole process until no more anomalies are found or the iter-

274

ation count exceeds the preset threshold (we set maxC = 5 according to

275

Sec 4.3.3).

M

272

In each iteration, we remove all detected anomalous points to elim-

277

inate the impacts of anomalous points in the following iterations, but

278

this leaves gaps that must be filled for most algorithms. We fill the gaps

279

using a weighted average of periodically-adjusted linear interpolates and

280

the predictions provided by NNBP. Appropriate weights were chosen by comparison of the interpolated data with non-missing points at equivalent times of day and week in other segments of the data.

CE

282

PT

281

ED

276

283

2. Applying a combination detector to find anomalies. For each iteration in Figure 5, we test the performance of three detec-

285

tion approaches: PCA, ANN and the combination of them. The results

AC

284

286

indicate that no single anomaly detector is perfect, and the combination

287

detector works better. More details can be seen in Sec 4.3.2. So we com-

288

bine the detection results of several anomaly detection algorithms in each

289

iteration.

290

Imagine the detected anomalies of the dth detection method is Ad , 13

ACCEPTED MANUSCRIPT

x0

Anomalies? No

Yes

Count > maxC?

No

Yes

x 0'

CR IP T

Anomaly detection algorithms

Remove anomaly and Iterpolation

AN US

Figure 5: The data cleaning algorithm. In practice, we apply the combination detector (PCA+NNBP, seen in Sec 4.3.2) on x0 to find anomalies. If anomalies are detected, we remove them, and replace gaps using interpolation and prediction, and then repeat detection until no more anomalies being found or the iteration count larger than the preset maxC (maxC = 5, according to Sec 4.3.3).

SD

291

for d ∈ {1, 2, ..., D}, then we choose our anomalies as the set

292

goal being to ensure that we have the largest possible intersection with the

293

true set of anomalies. In this sense, we collect as many types of anomalies

294

as possible so that the cleaning process is less sensitive to the polluted

295

data, and can revise the whole traffic more effectively.

Ad , the

M

d=1

Besides, the mechanism of the combination detector is similar to the

297

pooling strategy for information retrieval [31]. With enough systems and

298

a sufficiently deep pool, each system contributes a number of documents

299

into the pool for assessment, and the whole pooling can provide more

PT

ED

296

reliable and relevant documents for each query. As for our combination

300

detector, points in the pool are more likely to be real anomalies in manual

CE

301

302

judgement, since the pool is the collection of detection results of each single

303

method. Meanwhile, the rest part of the traffic is regarded as “typical”.

We do not have ground truth data in real network conditions, so it is hard to

305

measure the performance of the data cleaning process directly. However, we can

306

assess it through the performance of the overall algorithm (seen in Sec 4.3.1).

AC 304

14

ACCEPTED MANUSCRIPT

307

3.2. Basis generation Our goal here is to find a small set of representative functions to use as a

309

basis for our approximation of the space of typical traffic. The basis generation

310

problem can be expressed as the computation of a basis set that minimises the

311

reconstruction error:

L

X

(l) (l) w h , argmin x −

L,w,h(l) l=1

CR IP T

308

(1)

where x is the input signal. L is the number of basis vectors, w is a vector of

313

weights, and h(l) are the basis vectors.

AN US

312

314

The problem is too open however, in that we have complete choice of basis,

315

and therefore could achieve a trivial perfect reconstruction with h(1) = x, w(1) =

316

1, and L = 1.

We know that traffic exhibits approximate periodicity (with known periods), and so we insist that the basis vectors are periodic. We list the periods ci ,

M

i = 1, 2, . . . , I, in increasing order in terms of number of measurements. Theoretically, ci should represent integral multiple of cycles of network traffic. Here

ED

we only use daily and weekly patterns for simplification. For instance, if there were 24 measurements per day, the typical 24 hours and 1 week cycles would

PT

be represented by c1 = 24 and c2 = 168. We select the third and last element b to be the largest number of weeks that the data segment include, cI = 24 24 .

We denote h(ci ,j) , j = 1, 2, . . . , Ji , to be the j th basis vector with period ci , i.e.,

CE

for each element

(c ,j)

hk i

(c ,j)

i = hk+ac , i

AC

for all valid k, a ∈ Z+ . We write

317

  H (ci ) = h(ci ,1) |h(ci ,2) | · · · |h(ci ,Ji ) ,

and the basis H = [H (c1 ) | · · · |H (cI ) ].

15

ACCEPTED MANUSCRIPT

Then the reconstruction problem is defined by a new objective function:

where the size of basis is L =

PI

i=1

(2)

CR IP T

318



Ji I X X

(ci ,j) (ci ,j) x − w h argmin

,

L,w,h(l)

i=1 j=1 Ji .

As in OMP, we progressively refine the traffic measurement approximation

with an iterative procedure across all cycles that pursues a good solution. Our algorithm does so starting with the shortest period, and progressing through them in order. At each phase, we fix all of the variables except that of the

AN US

period we are working on. With the higher period functions implicitly set to zero, so the residual in phase n is defined by r(1) = x and for n > 1 r(n) = x −

Ji n−1 XX

w(ci ,j) h(ci ,j) ,

i=1 j=1

M

and we optimise h(cn ,j) , and their respective weights by solving

(3)

ED



Jn

(n) X (cn ,j) (cn ,j)

w h argmin r −

, Jn ,w(cn ,j) ,h(cn ,j)

j=1

One tool for solving Equation (3) is SVD. The first step of applying SVD

PT

is transforming the residual signal r(n) to a matrix, by taking the new matrix’s

AC

CE

rows to be sliding windows (with width of one period cn ) of the signal, i.e.,

R(n)



(n)

rt0

  (n)  rt1 = ..   .  (n) rtb−cn +1

(n)

rt1

(n)

rt2 .. . (n)

rtb−cn +2

(n)

···

rtcn −1

··· .. .

rtcn .. .

···

rtb−1

(n)

(n)



   .   

(4)

319

The new matrix will be (b − cn + 1) × cn . In all scenarios considered here

320

b − cn + 1 > cn .

16

ACCEPTED MANUSCRIPT

We then apply SVD to R(n) to obtain the decomposition R(n) = U ΣV T ,

CR IP T

(5)

321

where V and U are unitary matrices, and Σ is a diagonal matrix containing the

322

singular values σj of R(n) in the cycle cn . The singular values are arranged in

323

decreasing order.

To understand the SVD’s use in matrix approximations, consider the following interpretation. The matrix Σ is diagonal, so the SVD of a matrix R(n) can ` X

AN US

be rewritten R(n) =

uj σj vjT ,

(6)

j=1

324

where uj and vj are the j th columns of U and V respectively, and ` = min{cn , b−

325

cn + 1}.

e by retaining only the terms correspondWe then create an approximation R

M

ing to the largest Jn singular values:

ED

e= R

Jn X

uj σj vjT .

(7)

j=1

PT

The energy proportion of uj is

σj2 Ej = P`

j=1

σj2

,

(8)

and we choose Jn such that Ej is closest to but still larger than the preset

327

threshold value γ = 0.3.

CE

326

Since vectors of the first Jn columns in U take the largest energy of the

AC

signal, we directly associate h(cn ,j) = uj , and obtain the corresponding weight

328

w(cn ,j) =

h(cn ,j) r(n) r(n) .

(9)

Then we construct a new residual for the next period, and solve the new problem

17

ACCEPTED MANUSCRIPT

329

as above. We call this algorithm SVD on Specific Cycles (SVD SC), which is described

331

in detail in Algorithm 1. One thing we should note is that in Step 6 and also

332

in subsequent algorithms we use Matlab’s notation for vector concatenation.

CR IP T

330

333

The entire algorithm is supposed to be applied to the first cleaned signal, i.e.,

334

x = x00 . Moreover, when traffic patterns of one data segment vary too much,

335

we will also generate a new basis through a basis generation process instead of

336

updating the old one.

337

The entire Algorithm 1 is based on SVD, and the computation time is O(I ×

338

Cn2

339

two-week data segment). According to subsubsection 4.3.1, the data cleaning

340

process determines the performance of the derived basis, and we should clean

341

the data segment before generating a new one. For a two-week data segment,

342

a new basis needs about 6.008 s on average in which the data cleaning process

343

takes about 6 s (the NNBP algorithm in the data cleaning process takes a large

344

amount of time), and the basis generation process needs about 0.008 s. The

345

experiment was repeated 100 times in Matlab R2012b on a Mac OS X 10.9.5

346

with a 2.7-GHz Intel Core i5.

347

3.3. Basis update

ED

M

AN US

× (b − Cn + 1)). In practice, I is the number of levels of cyclicity (I = 3 for

For a new data segment xm we evolve the old basis H(m−1) to the new one

349

Hm . By evolving it we avoid additional computational cost, and also, we want

350

the basis to change only slowly, reflecting slow changes in traffic patterns, thus

PT

348

avoiding the impact of new anomalies in the data, and also avoiding the need

352

to clean this new data.

CE 351

AC

We formulate the basis update problem by retaining the same number of

basis vectors L, and then considering each in turn, keeping the others fixed. That is, for k = 1, . . . , L we solve the problems





  X

(l) (l)  (k) (k) (k) (k) xm − argmin w h − w h + λ d h H d m m m m m (m−1) ,

(k) (k)

λd ,wm ,hm l6=k 18

(10)

ACCEPTED MANUSCRIPT

AN US

CR IP T

Algorithm 1 SVD on specific cycles algorithm (SVD SC). Input: The traffic segment, x; The traffic aggregation time, ∆t days; The size of x, b; The threshold of basis energy proportion, γ. Output: The basis H, and weight W ; The final residual r; 1: Initialize the cycle set   1 7 , ,b ; c← ∆t ∆t

Set r(1) ← x; for n = 1, . . . , I do 4: // Create R(n) from r(n) according to Equation 4; 5: for row = {1, 2, ..., b − cn + 1} do 6: Slide windows on r(n) of length cn to form matrix R row by row, h i (n) (n) R(n) ← R(n) ; [rtrow−1 , ..., rtrow+cn −2 ] ;

9: 10: 11: 12:

Set j ← 1; while Energy rate Ej ≥ γ do j ← j + 1; end while Set Jn ← j; The first Jn columns in U constitutes the function family

CE

13: 14:

end for Apply SVD to matrix R(n) , i.e., find U , Σ and V such that R(n) = U ΣV T ;

ED

8:

PT

7:

M

2: 3:

AC

15:

16: 17: 18:

H (cn ) ← {u1 , . . . , uJn } ;

Calculate the corresponding weights according to Equation (9).  T w(cn ) ← w(cn ,1) , . . . , w(cn ,Jn ) ;

r(n+1) ← r(n) − H (cn ) w(cn ) ; end for r ← rI+1 ;

19

ACCEPTED MANUSCRIPT

(k)

(k)

353

where hm is a new basis vector we are seeking, H(m−1) represents the history

354

of basis vectors {h1 , . . . , h(m−1) }, d represents a distance metric from the

355

historical value to the present value, and the parameter λd allows a trade off

356

between a precise fit to the measured data and the goal of remaining similar to

357

the old basis.

358

359

360

P

(k)

CR IP T

(k)

The first term in Equation (10) is the representation error ek = xm − (l) (l)

wm hm , which indicates the error when the k th basis is removed (and

l6=k

the other terms are kept fixed).

The following second term penalizes changes of the new basis. One best

362

approach for measuring this change is the deviation to the historical mean.

363

In addition, the new basis should also be similar to the previous basis, since

364

it directly evolves to the current one. In practice, we define the second term

365

distance to includ two parts: the similarity with the immediate previous basis

366

and the deviation to the historical mean. The mathematical form of d can be

367

seen in Equation (11), where the first term calculates the similarity with the

368

previous basis, the second term evaluates the deviation to the historical mean,

369

and parameter λ1 and λ2 are weighted coefficients for trading off the similarity

370

and deviation.

ED

M

AN US

361

PT





 

(k)

(k)

(k) (k) (k) d h(k) m H(m−1) = λ1 hm − h(m−1) + λ2 hm − h(m−1) ,

CE

where

(k)

h(m−1) =

(11)

(m−1) 1 X (k) h . m i=0 i

(k)

The j th element in hm is also constrained to maintain smoothness by en-

AC

forcing

371

(k)

(k)

min(hi (j)) ≤ h(k) m (j) ≤ max(hi (j)), i

i

where i ∈ [0, m − 1].

(12)

min max Define H(m−1) and H(m−1) as collections of historical minimum and maxi-

20

ACCEPTED MANUSCRIPT

mum of basis vectors respectively. Hence, (k)

(k)

max H(m−1) = {maxh(m−1) }k∈[1,L] ,

where (k)

(k)

CR IP T

min H(m−1) = {minh(m−1) }k∈[1,L] ;

minh(m−1) = { min (hi (j))}j∈[1,b] ; i∈[0,m−1]

and (k)

(k)

maxh(m−1) = { max (hi (j))}j∈[1,b] .

AN US

i∈[0,m−1]

372

The result is a new basis that evolves slowly to match changes in traffic

373

patterns, but is unresponsive to anomalies, which are inherently rare and erratic

374

(or else they are not anomalous). (k)

(k)

We derive hm and wm through alternating the fixed variables iteratively [32, 33]. We transform the final residual ek to a matrix form E whose rows (k)

M

have the same length as hm , i.e., c. The data segments cover traffic for several cycles, so the new matrix E has K = b/c rows, and we can write it

ED



ek (1)

PT

  ek (c + 1)  E = ..   .  ek ((K − 1)c + 1)

··· ··· .. . ···

ek (c)



  ek (2c)   ..  .   ek (Kc)

(13)

(k)

(k) w = wm y,

AC

CE

For each basis vector hm , we initialize a vector

375

where y is an IID standard normal variate of length K. We then iteratively

21

ACCEPTED MANUSCRIPT

376

apply the following rules (k)

(k)

← h(k) m ◦

wT E + λd (λ1 h(m−1) + λ2 h(m−1) ) (k)

(k)

(k)

w

← w◦

(k)

wT whm + λd (λ1 hm + λ2 hm ) Ehm

(k)

(k)

whm (hm )T

,

,

(14)

CR IP T

h(k) m

(15)

where ◦ and division are element-wise (Hadamard) multiplication and division,

respectively. After the algorithm has converged, we take the average of w as the weight

κ

377

1X wi , κ i=1

AN US

(k) wm ←

(16)

where wi ∈ w. Algorithm 2 shows more detail.

Finally, if the residual error r cannot be reduced to a preset bound (e.g.,

379

the average approximation error using the previous basis on new data) after a

380

certain number of iterations, it means the update algorithm can not converge.

381

This non-convergence indicates the behaviour patterns of the new data segment

382

varies greatly from the past behaviour, and we should apply the data cleaning

383

and basis generation processes to produce a totally new basis set instead of

384

updating the old basis set.

ED

M

378

The time complexity for basis update algorithm is O(L × c × β × ζ). For any

386

two-week data segment, L is small (generally 3, corresponding to three different

387

levels of cyclicity), c records the length of the basis (no larger than two weeks).

388

Meanwhile, we set β and ζ at small or moderate levels (in later experiments, β

PT

385

and ζ are equal to 10 for simplicity). Hence, the capacity of time cost is not

390

large. In practice, the basis update process takes about 0.6 s to evolve the old

391

basis. The experimental environment is the same: the simulation code ran 100

AC

CE 389

392

times on Matlab R2012b on a Mac OS X 10.9.5 with a 2.7-GHz Intel Core i5.

393

This indicates that updating the old basis takes less time than generating a new

394

one, which is more suitable for online anomaly detection.

22

ACCEPTED MANUSCRIPT

ED

M

AN US

CR IP T

Algorithm 2 Basis update algorithm. Input: A new traffic segment, xm ; The size of xm , b; The historical basis set, H(m−1) ; min The historical minimum, H(m−1) ; max The historical maximum, H(m−1) ; The historical mean, H(m−1) ; The old weight set, W(m−1) ; The threshold of residuals, α; The maximum number of iterations, β and ζ. Output: The new basis Hm , and weight set Wm ; The final residual r. 1: Hm ← H(m−1) , Wm ← W(m−1) ; 2: Set the number of iterations n1 ← 0 and n2 ← 0; 3: Set the residual e1 ← xm ; 4: while n1 ≤ β and r > α do 5: for k = 1,2,. . . ,L do P (l) (l) 6: Compute the residuals ek ← xm − l6=k wm hm ; 7: while n2 ≤ ζ do 8: Calculate the residual matrix E according to Equation (13) (k) 9: Update hm according to rules (14); (k) 10: c ← the length of hm ; 11: for j = 1,2,...,c do (k) (k) 12: if hm (j) < minh(m−1) (j) then (k)

(k)

hm (j) ← minh(m−1) (j);

13:

(k)

(k)

14:

else if hm (j) > maxh(m−1) (j) then

15:

hm (j) ← maxh(m−1) (j); end if end for (k) Update wm according to rules (15) and (16); n2 ← n2 + 1; end while (k) (k) Hm ← Hm ∪ hm , Wm ← Wm ∪ wm ; end for Compute the approximation error

PT

(k)

CE

16: 17: 18: 19: 20:

(k)

AC

21: 22: 23:

r ← xm − and

n1 ← n1 + 1; end while

k (k) wm hm ;

k=1

b

r← 24: 25:

L X

1X r(i); b i=1

23

ACCEPTED MANUSCRIPT

395

3.4. Anomaly detection Many researchers stop their work after detecting the anomalous points by

397

thresholding the residual. However, many detected points may be associated

398

with the same underlying anomaly. It is confusing to network operators to

399

present a single event as many detections, and there are distinct advantages

400

as well, in combining individual detections before deciding which anomalies are

401

most important: for instance, an anomaly that develops more slowly, but over

402

a long time may be more important that a sharp, but singular event.

CR IP T

396

In this paper, we try to associate detections so that we can present a single

404

anomalous event, along with information about the duration of the event. We

405

do so through a clustering process that we describe below.

406

3.4.1. Anomalous-point detection

AN US

403

We detect anomalous points by thresholding. For any approximation resid-

408

ual error r, we set the upper and lower limits [p − q ∗ σ, p + q ∗ σ], where p and

409

σ are the mean and standard deviation of r, and q is a parameter chosen to

410

fix the false-alarm probability (assuming a normal distribution of the residual).

411

Any points of r that fall outside these thresholds are declared to be anomalous.

412

3.4.2. Anomaly clustering

ED

M

407

Once we have anomalous data points, we use a bottom-up hierarchical clus-

414

tering algorithm to group points, as shown in Algorithm 3. Similar to Algorithm

415

1, we also use the Matlab notation to represent the concatenate process of time

416

series t(m) as shown in step 12,18 and 25.

CE

PT

413

For most anomaly types, the times of anomalous points caused by the same

418

events are close to each other, which means they occur at short-time anomalies.

AC

417

419

In clustering, we measure the distance between clusters by the absolute time

420

gaps between their centers. Anomalous points with smaller time gaps will tend

421

to be grouped into one large cluster. This clustering method has the ability to

422

find short-time anomalies.

24

ACCEPTED MANUSCRIPT

Since we have not included any specific properties of network traffic for

424

clustering, this clustering algorithm is generally and equally applicable to other

425

detectors given the time locations of anomalous points, and we shall test it as

426

part of these alternative algorithms.

CR IP T

423

However, for anomalies such as FISSD type of anomalies, they often last

428

long and do not have the short-time attribute we have mentioned above. In this

429

condition, Algorithm 3 is no longer suitable. This long-time anomaly clustering

430

problem has been left for further research in the future. One primary idea

431

to address this problem is to combine multi scales into clustering, inspired by

432

extracting multiscale features for anomaly detection [15, 16, 26, 28].

433

4. Experiments with Synthetic Data

AN US

427

In this section, we apply several anomaly detection approaches to synthetic

435

data. The simulation of synthetic data is extremely important for the validation

436

of anomaly detection algorithms where ground truth is difficult to find [25]. In

437

our case we aim to assess very small false-alarm probabilities, necessitating large

438

amounts of tests. In addition, we want to test the ability of the algorithm to

439

determine the duration of the anomalies. As far as we know, this information

440

has been labelled in no dataset.

ED

M

434

Our approach to simulation intends to highlight the features of different

442

techniques. We make no claim that the simulation is completely realistic, only

443

that it illustrates the properties of different anomaly detection techniques.

PT

441

Our analysis uses three pairs of metrics, as described in subsection 4.1, in-

445

tended to explore the different characteristics of the anomaly detection algo-

446

rithms, and compares our algorithm to four other methods:

AC

CE

444

1. OMP-on-dictionary: This method, motivated by BasisDetect [21], uses Orthogonal Matching Pursuit (OMP) on a dictionary of discrete cosine time and frequency atoms. Similar to BasisEvolution, this model determines L atoms, ΦL , with the largest coefficients αL at the first data segment x0 , Then, for the new data segment xm , we represent it with ΦL 25

ACCEPTED MANUSCRIPT

12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:

24:

25:

AC

26: 27: 28: 29: 30: 31:

(m−1)

+t

(m)

;

(m−1)

};

t(m) ← t(m) , tk k ← k + 1; end if end for (m−1) if d(i,j) > ι then V (m) ← V (m) ∪ {tj

CE

23:

(m−1)

t

j ; tk ← i 2 (m−1) (m−1) (m) (m) V ←V ∪ {t , tj }; h h i ii (m) (m) (m) ; t ← t , tk k ← k + 1; i ← i + 1; else (m−1) V (m) ← V (m) ∪ {ti }; (m) (m−1) tk ← thi ;h ii

M

11:

(m)

≤ ι then

ED

10:

(m−1)

if d(i,j)

PT

9:

AN US

CR IP T

Algorithm 3 Bottom-up hierarchical clustering based anomaly detection. Input: Time series of the anomalous traffic measurements, t = [t1 , . . . , tn ]; A distance threshold, ι; The maximum number of iterations, η Output: The set of anomalies (clusters) and the corresponding center set, V and U ; 1: t(0) ← t; 2: k (0) ← n; 3: m ← 1; 4: while m ≤ η do 5: k ← 1; 6: for i= 1, . . . , k (m−1) − 1 do 7: j ← i + 1; (m−1) (m−1) (m−1) 8: The distance d(i,j) ← ti − tj ;

(m) tk (m)

(m−1) ← tj ; h h ii (m) (m) ← t , tk ;

t k ← k + 1; end if k (m) ← k; m ← m + 1; end while V ← V (m) ; U ← t(m) ;

26

ACCEPTED MANUSCRIPT

by solving argmin kxm − αL0 ΦL k ,

(17)

αL0

where αL0 represents the projection magnitudes of xm on ΦL . We then

448

get the residual series rOM P = xm − αL0 ΦL , and detect anomalous points

449

as subsection 3.4 describes.

CR IP T

447

450

2. PCA (Principle Components Analysis): This approach finds the k eigen-

451

vectors with the largest eigenvalues for the primary data, and reconstructs

452

the data segment xm with those eigenvectors. P CA

The residual series is

P CA c = xm − xc is m , where x m is the reconstructed signal and r

r

454

used for detection as described above.

AN US

453

455

3. RPCA (Robust PCA): This method is the robust version of PCA [12]

456

and uses the median to centre the data in the transformation process.

457

The method we used for RPCA is the Crux-Ruiz (CR) algorithm [12].

458

Otherwise it is similar to PCA.

4. NNBP (Neural Network using Back Propagation): The NNBP approach

460

uses an artificial neural network algorithm to predict traffic [5], and we

461

calculate residuals as the difference between the predicted and actual traf-

462

fic. The principle of choosing comparison algorithms is to include as wide

463

range of ideas as possible. Hence, we select this approach even though it

464

is the least successful (as demonstrated below).

PT

ED

M

459

In each case the initial set of data is cleaned as described in our approach, and

466

we use the same clustering process for each to provide a fair comparison.

CE

465

We chose these four methods because they cover a wide range of concepts

468

used in anomaly detection field, and have been discussed in many references.

469

The particular examples are relatively simple exemplars of the concept they

AC

467

470

espouse, so that the results are not complicated by myriad implementation de-

471

tails. For example, we do not attempt to compare to wavelet detection methods

472

as they require much fine tuning, for instance, of the particular wavelet basis

473

being used.

27

ACCEPTED MANUSCRIPT

474

4.1. Detection error metrics In many works on anomaly detection, the algorithms are assessed by their

476

false-positive and negative probabilities, considered point-wise on the original

477

sequence, i.e., each point is either an anomaly or not, and we compare our

478

classification to the ground truth. However, in this work, we seek to understand

479

performance in more practical terms: we want to detect the underlying cause

480

of a sequence of anomalous points in the data, and we prefer to return a single

481

detection to an operator to reduce the amount of low-level work that an operator

482

needs to accomplish before he/she can start a diagnosis, thus facilitating the

483

diagnosis. We want to tell the operator about one anomaly rather than a set

484

of anomalous points. To assess the quality of such results we need additional

485

metrics. As with false-negative/positive probabilities, we wish to present these

486

in pairs that show the trade-off between sensitivity and specificity. We present

487

two additional pairs of such metrics.

AN US

CR IP T

475

We label the set of true anomalous time points G. They are clustered into

488

The set of detected anomalous points for method r is denoted D(r) , and

490

these are clustered into groups cD(r) using Algorithm 3.

492

493

494

The metrics we consider are listed below: 1. We still include the standard point-wise metrics: Detection Probability on Anomalous Points (DP AP) and False-Alarm Probability on Non-

anomalous Points (FAP NP):

AC

CE

495

ED

491

M

groups cG, where each group corresponds to one injected anomaly.

PT

489

DP AP

=

F AP N P

=

G ∩ D(r) , |G| (r) D − G ∩ D(r) N − |G|

(18) ,

(19)

496

where N is the total number of data points, and |·| means the cardinality

497

of a set.

498

2. Our second pair of metrics aims to assess whether a detected group of

499

anomalies corresponds to a real anomaly group. As long as the detected 28

ACCEPTED MANUSCRIPT

anomaly groups have real anomalous points, we define them as truly de-

501

tected anomalies. We call these the Consistency Degree of Anomaly groups

502

in Amount (CD AA) and the Aggregation Degree of Anomaly groups in

503

Amount (AD AA). Given that the number of truly detected anomaly

504

groups is NcD(r) , and the number of hits against real anomaly groups

505

is NcG(r) , then we define these by =

AD AA

=

NcG(r) , |cG| N (r) cD . cD(r)

(20) (21)

AN US

CD AA

CR IP T

500

506

CD AA measures the probability of truly detected anomaly groups, while

507

AD AA records the fraction of truly detected anomaly groups across all

508

detected anomalies, under the above definition.

509

3. The above metric uses a rather simplistic notion of a truly detected anomaly group in which it is correct if it covers at least one true anomaly point.

511

Here we aim to quantify the overlap between the true anomalies and their

512

detections, but one true anomaly group may generate multiple detected

513

groups, and vice versa. To do so, we extend the idea of a purity in-

514

dex [34, 35], to define Consistency Degree of Anomaly groups in Dura-

515

tion (CD AD) and Aggregation Degree of Anomaly groups in Duration

516

(AD AD) which are defined for each data point x ∈ G ∩ D(r) . We define

AC 519

ED

(r)

cGx to be the cluster that x truly belongs to, and cDx to be the cluster that x belongs to, detected by method r. Then, we define for each x

CE

518

PT

517

M

510

CD AD(x)

AD AD(x)

=

=

(r) cGx ∩ cDx |cGx |

,

(r) cGx ∩ cDx . (r) cDx

(22)

(23)

We then average this metric over G∩D(r) (the intersection of detected and

29

ACCEPTED MANUSCRIPT

true anomalous points) to obtain global versions of the metrics defined by 1 CD AD = G ∩ D(r)

X

CD AD(x);

(24)

AD AD(x).

(25)

x∈G∩D (r)

1 AD AD = G ∩ D(r)

X

x∈G∩D (r)

CR IP T

520

Now, for all truly detected anomalies, CD AD is a weighted-average de-

522

tection probability, while AD AD is a weighted-average proportion of real

523

anomalous points on the detected anomalies, taking into account the group

524

structure of the anomalous points.

AN US

521

In order to analyze the performance of all five anomaly detection methods

526

and see the trade-offs over these pairs of metrics, we vary the threshold param-

527

eter q from 1.0 to 4.5, and generate Receiver Operator Characteristic (ROC)

528

curves [36], or the equivalent for the new metrics.

529

4.2. Synthetic traffic data

M

525

Our goal here is to generate traffic with some of the important observed

531

features of real traffic. This will, of course, be a simplification of real traffic,

532

but we have a great deal more control over its details and can see the effects,

533

for instance, of changing the duration of anomalies.

ED

530

The simulation has two steps: first, we generate our typical traffic, and sec-

535

ond, we inject anomalies. The typical traffic is based around a sine function with

536

a 24-h period to represent the strong diurnal cycles of Internet traffic [37]. Noise

PT

534

was added to represent statistical fluctuations around the mean. In this study,

538

we use white Gaussian noise. Gaussianity is a reasonable first approximation

539

for Internet traffic on uncongested links [38], and we do not include correlations

AC

CE 537

540

(though they are often present in traffic), because uncorrelated traffic represents

541

a worst case for detecting anomalies [39]. We also control the relative size of

542

the noise. In what follows the standard deviation of the noise is 10% of the

543

amplitude of the cycles.

30

ACCEPTED MANUSCRIPT

We generate a series containing eight weeks of data which is aggregated every

545

1 hour, i.e., 1,344 data points for each simulation. Each is broken into 4 data

546

segments, to be analyzed in blocks as described above. We experimented with

547

other data segment lengths, and athough the results and general trends were

548

similar, a two-week segment produced the best results of the lengths tested.

CR IP T

544

The anomaly model in this work can be represented as A(τ, θ, ϑ, ψ, φ), where

549

550

τ , θ, ϑ, ψ and φ are parameters for generating and injecting anomalies.

The parameter τ indexes a set of anomaly types. Each anomaly type is

552

defined by a shape function with width W, which represents the minimum width

553

for that anomaly type (otherwise, all anomalies could collapse into spikes), and

554

amplitude M, which is defined by the relative size to the typical traffic.

AN US

551

1. FIFD: (Fast Increase, Fast Decrease): these are the classical anomalies

556

most often examined in prior work. They might be caused by system

557

or device outages, or DoS attacks. Often, past literature has focused on

558

“spikes” which are very short FIFD anomalies.

M

555

2. SIFD (Slow Increase, Fast Decrease): these might arise through epidemic

560

processes (e.g., worms or viruses [40]), which grow over time, and are

561

eventually fixed, dropping the system back into its standard state.

ED

559

3. FISSD (Fast Increase, Stable State, Slow Decrease): these arise typically

563

as the result of phenomena such as flash crowds, which appear very sud-

PT

562

denly, are maintained for some time, and then gradually disappear.

564

4. FISD (Fast Increase, Slow Decrease): these represent, for completeness, a

565

special case of the previous anomaly with no stable state.

CE

566

We also include negative anomalies as well as positive, i.e., traffic decreases as

568

well as traffic increases. Each simulation includes only one type of anomaly.

AC

567

569

Most of our reported results here focus on FIFD anomalies, because most of the

570

results for other anomaly types mirror the ones reported.

571

Parameter θ determines the number of anomalies that are injected. We set

572

θ following the Poisson distribution, θ ∼ P oisson(λ), where λ = 3 per data

573

segment, i.e., on average there will be 12 anomalies in each simulation. It is 31

ACCEPTED MANUSCRIPT

574

important that there are multiple anomalies per segment, because one of the

575

affects we seek to include in our simulations is interactions in the detector,

576

between anomalies. The parameter ϑ controls the starting locations of anomalies. Anomalies, by

578

definition, could happen at any time with no preference, so ϑ ∼ U nif orm(0, b),

579

where b is the length of the data segment.

CR IP T

577

580

The parameter ψ changes the width of anomalies. We dilate the anomaly

581

type function by a factor W + ψ. In our experiments we consider a wide range

582

of values ψ = {0, 4, 10, 20, 40}.

The parameter φ updates the magnitude of anomalies: we scale the shape

584

function of the anomaly type by this factor. In our experiments we have con-

585

sidered φ = {0.1, 0.3, 0.5, 0.7, 1} resulting in magnitudes φM.

586

4.3. Investigation on BasisEvolution

AN US

583

Before experiments, we need to have a thorough investigation on the whole

588

BasisEvolution framework. In data cleaning phase, we need to clarify the im-

589

portance of this step, and should make a proper selection on the parameter and

590

detection method (seen in Figure 5). In addition, we have to estimate a set of

591

parameters in basis update process.

ED

M

587

As discussed before, in real network traffic it is difficult to find the ground

593

truth, while the synthetic traffic volumes highlighting features of network traffic

594

are suitable for analysis. Hence, all experiments at this section are on synthetic

595

data.

PT

592

We first demonstrate the importance of data cleaning process on the entire

597

framework through the comparison of the performance of BasisEvolution on

598

cleaned and uncleaned data. Then we confirm the number of maximum itera-

AC

CE 596

599

tions and the detection algorithm by examining the degree of cleanliness of the

600

revised traffic after data cleaning process. As for parameters (shown in Equa-

601

tion (10) and (11)) in basis update process, we have found rough parameter

602

settings that are not too far from optimal.

32

ACCEPTED MANUSCRIPT

603

4.3.1. The impact of data cleaning process Figure 6 compares the detection performance of BasisEvolution under two

605

conditions: with data cleaning process and without data cleaning process. We

606

see that the framework performs better when the data is cleaned, which means

607

basis vectors derived from cleaned data can better describe the “normal region”

608

than those from uncleaned data. If data is uncleaned, and contains traces of

609

intrusions, BasisEvolution may not detect future instances of those intrusions,

610

since it will presume that they are normal, which leads a higher false-alarm

611

probability and lower detection probability. Hence, the data cleaning process is

612

necessary for the entire work.

DP_AP

0.6 0.5 0.4 0.3

BE on cleaned data

BE on uncleaned data

0

M

0.2 0.1

AN US

0.7

CR IP T

604

0.02

0.04

0.06

0.08

0.1

0.12

FAP_NP

613

PT

ED

Figure 6: The performance of BasisEvolution on clean & unclean data. The effectiveness is measured with FAP NP (False-Alarm Probability on Non-anomalous Points) and DP AP (Detection Probability on Anomalous Points). The whole simulation experiment repeats 100 times to get statistical results. The performance of BasisEvolution on clean data is better than which on uncleaned data.

4.3.2. Detector selection for data cleaning process In our test, we compare the performance of single approaches and a combi-

CE

614

nation method by choosing PCA- and ANN-based detectors. In this way, the

616

single approaches are PCA- and ANN-based detectors, while the combination

617

detector merges the detection results of each detector at each iteration. More

AC

615

618

specifically, we select NNBP method as an ANN algorithm.

619

In PCA, we define the “normal subspace” with thresholding (e.g., [mean−3∗

620

std, mean + 3 ∗ std], where mean and std are average and standard deviations

621

of principal components), and find anomalous points with SPE measurement

622

[18]. Since PCA is sensitive to the polluted data [25], the “normal subspace” 33

ACCEPTED MANUSCRIPT

623

is distorting at the first iteration, and the false alarms would be very high. As

624

Figure 5 shows, the continuous detection and gaps padding in the following

625

iterations bring the traffic closer to the normal range. NNBP finds anomalous points by comparing the differences between real

627

traffic and predicted traffic volumes. If the differences are out of the threshold

628

(e.g., c ∗ std, where c is a constant value, std is the standard deviation of test

629

samples), the corresponding points are identified as anomalies. In the first iter-

630

ation, the data is polluted and the corresponding predictions are not accurate.

631

Similarly, the successive layers of detection and interpolation (seen in Figure 5)

632

can modify the traffic to the regular level.

AN US

CR IP T

626

633

As for the combination detector, the set of detected anomalous points is

634

larger through combining the detection results of both PCA and NNBP. With

635

those anomalous points being revised by “normal values” in each iteration (seen

636

in Figure 5), the typical traffic can be reformed more effectively. We measure the performance of three detectors by detecting the correspond-

638

ing modified traffic. More specifically, we detect the residuals between original

639

traffic and modified traffic through thresholding (seen in Sec 3.4.1). If the de-

640

tection accuracy (DP AP) is high and the false-alarm probability (FAP NP) is

641

low, larger part of the modified traffic will be in the normal region. Figure 7

642

illustrates the performance of three detectors in data cleaning process. The com-

643

bination detector performs best with the lowest false-alarm probability. This

644

is in accordance with the conclusion discussed above. The lowest false-alarm

645

probability means the traffic cleaned by the combination detector is closest to

CE

PT

ED

M

637

the “normal” traffic, from which we can extract the most accurate “normal

647

patterns”.

AC

646

648

4.3.3. The estimation of parameter maxC in data cleaning process

649

In data cleaning process, parameter maxC is defined as the number of max-

650

imum iterations. Furthermore, both anomaly detection and gap filling methods

651

should repeat maxC times to clean the data. We try to estimate a small value of

652

maxC, so that the cleaned traffic is close to “normal”, and the BasisEvolution 34

ACCEPTED MANUSCRIPT

0.5 0.4

PCA NNBP PCA+NNBP

0.3 0.1

0.2

0.3

FAP_NP

0.4

CR IP T

DP_AP

0.6

0.5

653

AN US

Figure 7: The performance of three detectors in data cleaning process. Three detectors are PCA, NNBP and the combination of them. The effectiveness is measured with FAP NP (False-Alarm Probability on Non-anomalous Points) and DP AP (Detection Probability on Anomalous Points). The whole simulation experiment repeats 100 times to get statistical results. The performance of combined approach is better than each single one.

could achieve the best detection performance.

In the simulation, the candidate set for maxC is {1, 3, 5, 7, 9}. we choose the

655

optimal value of maxC through two comparisons: comparing the performance of

656

the data cleaning process (similar to Sec 4.3.2); and comparing the performance

657

of BasisEvolution (similar to Sec 4.3.1). The results can be seen in Figure 8 (a)

658

and (b).

M

654

In Figure 8 (a), we measure the performance of the data cleaning process

660

through detecting the residuals between original traffic and the cleaned traffic.

661

More cleaner the traffic, higher detection performance (lower false-alarm prob-

662

abilities and higher detection probabilities) on the residuals. From Figure 8

663

(a), the false alarms are high when maxC is smaller than 5, however, when

664

maxC is larger than 5, the corresponding curves not only perform better, but

665

also are close to each other. As for Figure 8 (b), when maxC is larger than 3,

666

the corresponding curves are also similar to each other, with higher detection

667

performance. Based on these, we finally choose maxC = 5 in later experiments.

AC

CE

PT

ED

659

668

4.3.4. Parameter estimation in basis update

669

The parameters we need to assess are λd , λ1 and λ2 . Given our motivation

670

for distance measurement, i.e., the new basis should both maintain similarity

671

to the immediately previous one and be close to the historical mean. Hence, the

35

ACCEPTED MANUSCRIPT

0.6 0.5

0.45 0.4 0.35 0.14

0.4 maxC=9 maxC=7 maxC=5 maxC=3 maxC=1

0.3 0.2

0.15

0.16

0.17

0.18

0.1

0.19

FAP_NP

0

0.01

CR IP T

DP_AP

0.5

maxC=1 maxC=7 maxC=9 maxC=3 maxC=5

TP_AP

0.6 0.55

0.02

0.03

0.04

0.05

FAP_NP

(a) The performance of the data cleaning process.

(b) The performance of BasisEvolution.

AN US

Figure 8: The performance of data cleaning process (a) and the whole framework (b) under different values of maxC. maxC ∈ {1, 3, 5, 7, 9}. The effectiveness is measured with FAP NP (False-Alarm Probability on Non-anomalous Points) and DP AP (Detection Probability on Anomalous Points). The whole simulation experiment repeats 100 times to get statistical results. A better value for maxC is 5.

672

importance of two parts in Equation (11) should be the same, and we set the

673

weighted parameter λ1 = λ2 = 0.5.

Parameter λd determines the trade-off between the measurement constraint

675

and the importance of changes. A larger λd leads to lower approximations,

676

whereas a smaller λd means approximations are a better fit to the data. We set

677

the candidate solution of λd to be {0.01, 0.1, 1, 2, 10, 100}. Figure 9 shows the

678

detection performance results with respect to different λd values. We find that

679

an optimal detection result (a higher detection probability and a lower false-

680

alarm probability) is around λd = 2 with the corresponding curve tending to

681

the top left of the plot. Hence, we finally choose λd = 2, λ1 = 0.5, and λ2 = 0.5

682

in the following experiments.

PT

ED

M

674

AC

0.6

DP_AP

CE

0.8

d d

0.4

d d

0.2

d

0

0

0.01

0.02

0.03

0.04

0.05

d

=0.01 =0.1 =1 =2 =10 =100 0.06

FAP_NP

Figure 9: Comparison of detection performance in terms of FAP NP (False-Alarm Probability on Non-anomalous Points) and DP AP (Detection Probability on Anomalous Points) when λd ∈ {0.01, 0.1, 1, 2, 10, 100}. The entire simulation was repeated 100 times to get statistical results. A better performance can be found at λd = 2.

36

ACCEPTED MANUSCRIPT

683

4.4. Experiment results on synthetic data We conducted several classes of synthetic experiments. In the first class of

685

experiments, we compare anomaly detectors on fixed width anomalies, but with

686

anomalies that are not simple spikes. In the second class, we consider the affects

687

of anomaly width on detection performance. In the third and fourth classes, we

688

consider the same issues but for magnitude.

CR IP T

684

In each case below we generate traffic and inject anomalies as described

690

above. We replicate the experiments 100 times, and report aggregated statistics.

691

4.4.1. Constant-width results

AN US

689

We start by considering anomalies with random magnitude, but fixed width

693

ψ = 10. The resulting ROC curves are shown in Figure 10 (a), which shows that

694

as the threshold q increases, both Detection Probability on Anomalous Points

695

(DP AP), and False-Alarm Probability on Non-anomalous Points (FAP NP)

696

decreases. i.e., there is a trade-off between the two types of errors. Notewor-

697

thy is the fact that BasisEvolution (BE) outperforms the other methods (we

698

prefer curves towards the top left of the diagram). Table 2 shows, more pre-

699

cisely, a number of FAP NP values for given DP AP values. The important

700

feature shown in this table, apart from BasisEvolution’s superiority, is that it

701

can achieve very low false-alarm probabilities (as low as 0.00063), which are

702

necessary in practical anomaly detection algorithms.

PT

ED

M

692

The trade-off is a relative low DP AP, but this is less important. In real

704

anomaly detection settings, it is far more important to avoid reporting a large

CE

703

number of false alarms, while still providing useful detections, than to detect all

706

events at the cost of false alarms that degrade the operators confidence in any

707

of the detections. This also emphasizes the need for simulation studies, as it

AC

705

708

would have been impractical for us to measure such a low probability without

709

a large set of data on which to perform tests.

710

In Figure 10 (b), we show the equivalent of an ROC curve for the pair of

711

metrics AD AA (Aggregation Degree on Anomaly Amounts) and CD AA (Con-

712

sistency Degree on Anomaly Amounts), although in this plot we prefer curves 37

ACCEPTED MANUSCRIPT

to the top right in the diagram. As the threshold q increases, AD AA increases,

714

however, CD AA decreases, showing the trade-off between the two types of er-

715

rors. Once again BasisEvolution is universally superior to other methods across

716

all threshold values.

CR IP T

713

718

(Aggregation Degree on Anomaly Duration) increases, while CD AD (Consis-

719

tency Degree on Anomaly Duration) decreases, showing another trade-off. As

720

in (b), we prefer curves towards the top right. It is beyond any doubt that

721

BasisEvolution outperforms the rest four methods across all threshold values.

0

0.05

0.1

0.15 FAP_NP

0.2

0.25

0.3

(a) FAP NP versus DP AP.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

CD_AD

OMP PCA RPCA NNBP BE

CD_AA

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

AN US

In Figure 10 (c), apart from the NNBP, as the threshold q increases, AD AD

DP_AP

717

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 AD_AA

1

(b) AD AA versus CD AA.

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.7

0.75

0.8

0.85 AD_AD

0.9

0.95

1

(c) AD AD versus CD AD.

ED

M

Figure 10: Performance of 5 methods for detecting FIFD anomalies. Here the width parameter is ψ = 10, and magnitude is random. BE is short for BasisEvolution. FAP NP is the False-Alarm Probability on Non-anomalous Points. DP AP is the Detection Probability on Anomalous Points. AD AA is the Aggregation Degree on Anomaly Amounts. CD AA is the Consistency Degree on Anomaly Amounts. AD AD is the Aggregation Degree on Anomaly Duration. CD AD is the Consistence Degree on Anomaly Duration.

CE

PT

Method BE OMP PCA RPCA NNBP

Fixed DP AP 0.3 0.5 0.6 0.00063 0.0073 0.020 0.0027 0.18 DNE 0.019 0.047 0.099 0.025 0.047 0.064 0.27 DNE DNE

AC

Table 2: We fix the value of DP AP, and report the corresponding FAP NP of five methods. FAP NP is the False-Alarm Probability on Non-anomalous Points. DP AP is the Detection Probability on Anomalous Points. DNE means “Does Not Exist”. BE is short for BasisEvolution. Note that very small FAP NP values are reported (shown in bold), as required.

722

723

724

4.4.2. The effect of width We now test the five anomaly detection methods on anomalies whose magnitudes are randomly distributed, but with fixed width parameter ψ = {0, 4, 10, 20, 40}, 38

ACCEPTED MANUSCRIPT

726

FIFD type of anomalies, Figure 11 (a) shows the performance of BasisEvolution

727

decreases as the width parameter ψ grows. That is, it is more difficult to cor-

728

rectly detect wider anomalies. This is to be expected: as the anomaly width

729

increases, there are more points in the anomaly, and thus these points are more

730

likely to be considered “normal” by any detector, and by any metric. A single

731

spike will always be the easiest type of event to detect, while although it might

732

be possible to detect the change into a broad flat anomaly, it will always be

733

difficult to correctly categorize the flat region.

AN US

=0 =4 =10 =20 =40

0

0.02

0.04 0.06 FAP_NP

0.08

DP_AP

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

CR IP T

to understand the affect of anomaly width on detection. As for detecting the

DP_AP

725

0.1

(a) Variable width ψ.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.02

0.04

0.06 FAP_NP

=0.1 =0.3 =0.5 =0.7 =1 0.08

0.1

0.12

(b) Variable magnitude φ.

734

ED

M

Figure 11: Performance (FAP NP versus DP AP) of BasisEvolution in detecting FIFD anomalies with varying anomaly widths or magnitudes. FAP NP is the False-Alarm Probability on Non-anomalous Points. DP AP is the Detection Probability on Anomalous Points. For BEdetected results, we note that performance decreases with increasing width, or decreasing magnitude. The additional plots are omitted as they reinforce the same results.

4.4.3. Constant-magnitude results We now consider the effect of magnitude of the anomalies on detection, start-

736

ing with a set of tests with fixed magnitude, φ = 0.5, but random width. The

737

ROC curves for metrics DP AP (Detection Probability on Anomalous Points)

CE

PT

735

738

and FAP NP (False-Alarm Probability on Non-anomalous Points) are shown in

739

Figure 12.

AC

740

As in Figure 10, both DP AP and FAP NP decrease with the increment

741

of the threshold q. We prefer the curves toward top left of the diagram. The

742

results are similar to those for constant width, though perhaps more extreme,

743

e.g., Table 3 provides FAP NP values for the five techniques corresponding to

744

the fixed DP AP values, and we see that BasisEvolution achieves an even lower

745

false-alarm probability of 7.8 × 10−5 . 39

0

0.05

0.1

0.15

0.2 0.25 FAP_NP

0.3

0.35

0.4

0.45

(a) FAP NP versus DP AP.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 AD_AA

1

(b) AD AA versus CD AA.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.85 0.87 0.89 0.91 0.93 0.95 0.97 0.99 1 AD_AD

CR IP T

OMP PCA RPCA NNBP BE

CD_AD

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

CD_AA

DP_AP

ACCEPTED MANUSCRIPT

(c) AD AD versus CD AD.

Figure 12: Performance for detecting FIFD anomalies with fixed magnitude φ = 0.5 and random width. FAP NP is the False-Alarm Probability on Non-anomalous Points. DP AP is the Detection Probability on Anomalous Points. AD AA is the Aggregation Degree on Anomaly Amounts. CD AA is the Consistency Degree on Anomaly Amounts. AD AD is the Aggregation Degree on Anomaly Duration. CD AD is the Consistence Degree on Anomaly Duration.

Fixed 0.3 7.8 × 10−5 0.0026 0.013 0.033 0.19

DP AP 0.5 0.7 0.0002 0.0025 0.009 0.066 0.039 0.16 0.079 DNE DNE DNE

AN US

Method BE OMP PCA RPCA NNBP

746

ED

M

Table 3: We fix the value of DP AP, and report the corresponding FAP NP of five methods. FAP NP is the False-Alarm Probability on Non-anomalous Points. DP AP is the Detection Probability on Anomalous Points. DNE means “Does Not Exist”. BE is short for BasisEvolution. Note that very small FAP NP values are possible (shown in bold), as required.

4.4.4. Effect of varying magnitudes In our final simulation results, we test anomalies whose widths are randomly

748

distributed, with magnitude parameter φ = {0.1, 0.3, 0.5, 0.7, 1}, to analyze the

749

impact of anomaly magnitude on detection.

PT

747

Figure 11 (b) illustrates the results of BasisEvolution on anomalies with vary-

CE

750

ing magnitudes, showing it is easier to distinguish anomalies with larger magni-

752

tudes. This result is reasonable because anomalies with larger magnitudes lead

753

to corresponding anomalous points having larger deviations, which are easier for

AC

751

754

any detector to see. However, there is a diminishing return for anomalies above

755

some threshold around φ = 0.5, where little improvement is seen in comparison

756

to the change in anomaly magnitude.

757

The figures presented in the above four classes show the results for FIFD

758

anomalies, but we found the same types of results no matter which type of 40

ACCEPTED MANUSCRIPT

759

anomaly we considered. We also saw similar results for the alternative perfor-

760

mance metrics as well (in the places where these plots are omitted). Why the proposed approach BasisEvolution is better than others? Two rea-

762

sons can be concluded through analyzing the mechanism. BasisEvolution starts

763

extracting the basis vectors under the smallest period (24 h), and ends with

764

basis under the largest cycle (e.g., weeks or months). In each cycle, we obtain

765

directions of principal components as basis vectors. Since these directions pre-

766

serve the largest energy of traffic at each cycle, they are easy to understand,

767

and actually describe the network traffic at multiscales. Those multiscale basis

768

vectors can formulate “typical region” more accurately, which directly means

769

that BasisEvolution can find more types of anomalies (e.g., subtle anomalies),

770

and are robust in anomaly detection. Moreover, BasisEvolution utilizes the ex-

771

tracted basis vectors to construct traffic volumes through linear combination.

772

This mechanism indicates that the extracted basis vectors explain the compo-

773

nents of network traffic.

AN US

CR IP T

761

Compared with commonly used approaches (OMP and PCA), BasisEvolu-

775

tion performs better according to the above experiments. Specifically, PCA

776

is sensitive to anomalous points, that is, the “typical region” (or the “normal

777

subspace”) that PCA constructed is easy to be contaminated when the number

778

of anomalous points is large. BasisEvolution elevates this issue by extracting

779

multiscale features so that the “typical region” can be better approximated,

780

with a comprehensive description of network traffic in different views. Hence,

781

BasisEvolution is less sensitive and can find more anomalies against PCA (and

CE

PT

ED

M

774

PCA-type methods). As for OMP, it reconstructs network traffic through the su-

783

perposition of finite numbers of discrete cosine functions. For each basis vector,

784

it has no real value. In addition, finite numbers of functions may lose the de-

AC

782

785

tails of a “typical region”, which leads to the decrement of detection probability.

786

Hence, BasisEvolution is better than OMP with respect to the interpretability

787

and accuracy.

41

ACCEPTED MANUSCRIPT

788

5. Experiments with Real Data We test BasisEvolution on traffic data from the Internet Initiative Japan

790

(IIJ), a Japanese ISP. The dataset [41] is a collection of time series data obtained

791

from Japanese network backbone routers. The collected data extends from April

792

4 to December 4, 2014. It contains traffic measurements (in and out) on 10 links,

793

at time intervals of 1 h for a total of 5,857 observed data samples for each link

794

in each direction.

CR IP T

789

An intrinsic problem we have already discussed is that it is difficult to obtain

796

the ground truth for anomaly detection. Even if we adopted expert advices to

797

label the data, there should remain some anomalies that may not be seen by

798

experts. This is because they are not traditional ones. Hence, our false-alarm

799

probabilities would be elevated. Thus, we used simulations to provide quanti-

800

tative comparisons to understand the merits of the various anomaly detection

801

methods.

AN US

795

However, we acknowledge that there is an artificiality in any simulation. Here

803

we aim to at least comment on the ability of BasisEvolution to find anomalies in

804

real data by using common methods (such as PCA) to find anomalies, which we

805

then treat as if they were real. Using those “real” anomalies, we compare Basi-

806

sEvolution with other widely used detection methods. In this section, however,

807

we omit NNBP results, as this method performed very poorly in simulations.

PT

ED

M

802

We divide each series into segments of two weeks in length, as in the simu-

809

lations. We repeat the process for each algorithm as described above, for each

810

link. For example, Figure 13 shows our discovered basis functions for the first

811

two time segments, the initial segment, and its evolved form in the second time

812

segment. In each, we can see that the first two basis vectors show the daily and

CE

808

weekly cycles in the data, with the third containing more noise.

AC 813

814

Next, we compare the anomaly detectors against each other by treating the

815

anomalies found by one method as if they were real, and then seeking to detect

816

these using an alternative method.

817

Take PCA-detected anomalies as an example. Figure 14 (a) illustrates the

42

c1 = 1 day

0 −2 0

1

2

3

4

5

6

7 8 Days

9

10

11

13

1

2

3

4

5

6

7 8 Days

9

10

11

13

14

0

1

2

3

4

5

6

7 8 Days

9

10

11

12

13

4

5

6

14

9

10

11

12

13

14

c = 1 week 2

1

2

3

4

5

6

7 8 Days

9

10

11

12

13

14

c = 2 weeks 3

0 −5 0

7 8 Days

1

2

3

4

5

6

7 8 Days

9

AN US

−5 0

3

5 Values

Values

12

2

0 −5 0

c3 = 2 weeks

5

1

5

2

0 −5 0

0 −2 0

14

Values

Values

12

c = 1 week

5

c1 = 1 day

2 Values

Values

2

CR IP T

ACCEPTED MANUSCRIPT

10

11

12

13

14

(a) The initial basis vectors on the first time (b) Evolved basis on the following time segsegment: April 4th-18th, 2014 ment.

PT

ED

M

Figure 13: Basis vectors for the first two time segments of the second IIJ link. All functions are normalized to have zero mean.

0.4 0.3

CE

0.2

0

0.02

0.5

PCA RPCA OMP

0.3

0.4 0.3 0.2

0.2 0.1

0.1

0.1

0

0.4

OMP PCA BE

0.6

DP_AP

0.5 DP_AP

0.7

OMP RPCA BE

0.6

DP_AP

0.7

0.04 FAP_NP

0.08

0

0

0.02

0.04

FAP_NP

(b) RPCA

AC

(a) PCA

0.06

0.06

0.08

0

0

0.02

0.04 FAP_NP

0.06

0.08

(c) OMP

Figure 14: Comparisons of methods on false alarm probability and detection probability with IIJ data . Anomalies detected by (a) PCA, (b) RPCA, and (c) OMP are treated as “real” anomalies respectively.

43

ACCEPTED MANUSCRIPT

performance of the left three alternative methods (OMP, RPCA, and BasisEvo-

819

lution) in comparison. Note that here, False-Alarm Probability of PCA-detected

820

Non-anomalous Points (FAP NP) and Detection Probability of PCA-detected

821

Anomalous Points (DP AP) are not with respect to ground truth, though, and

822

thus the meaning of the graph is different. This shows how consistent the ap-

823

proaches are. Unsurprisingly, RPCA is the most consistent with PCA, in that

824

it finds the largest share of the same anomalies. OMP is the most different from

825

PCA, and our approach is in between the two. Similar conclusions can be found

826

in Figure 14 (b), where RPCA-detected anomalies are regarded as real ones.

827

However, for the alternated OMP-detected anomalies (Figure 14 (c)), the entire

828

PCA, RPCA and BasisEvolution share almost the same number of anomalies,

829

which is in accordance with their underlying mechanisms. The alternative mea-

830

sures we considered above reinforce the same story.

AN US

CR IP T

818

One interesting thing in Figure 14 is that the whole detection probabilities

832

are not high, which means the overlap region among those methods is low. One

833

explanation is that different anomalies are usually found by different methods

834

because of inherent mechanisms. Hence, the performance of RPCA and PCA at

835

Figure 14 (a) and (b) are the best respectively. OMP, PCA and BasisEvolution

836

discover different types of anomalies. This is what we might expect, given the

837

different assumptions and algorithms upon which they are based.

ED

M

831

To further analyze the performance of four methods, we randomly present

839

the detection results of ten weeks data as shown in Figure 15. Since we do

840

not have the ground truth for the real data, the possible method of identifying

CE

PT

838

anomalous points is through either expert advice or voting. In view of real net-

842

work scenarios and holiday influences in Japan, we can label several anomalous

843

points, such as A(624,2.12e+08). In addition, we adopt a voting strategy to

AC

841

844

estimate the reliability of each detected anomalous points. For any detected

845

anomalous point, it will win one vote if another technique also identifies it as

846

abnormal. In our situation, the number of votes that one point could obtain

847

ranging from 0 to 3. Larger numbers of votes indicates that the corresponding

848

anomalous points are shared by more approaches. Hence, the corresponding 44

ACCEPTED MANUSCRIPT

methods are more reliable. 10 8

3.5

Network traffic on Link 4 OMP PCA BE RPCA

Network traffic volume

3 2.5

A (624,2.12e+08)

C(1179,1.411e+08)

B(983,2.32e+08)

2 1.5 1 0.5 4.5

4.12

4.19

4.26

5.3

5.10

5.17

Date (Starts with Sunday, Apr. 5th 2014)

5.24

CR IP T

849

5.31

6.7

6.14

Figure 15: Detection results of four algorithm on ten weeks data on link 4, IIJ: Apr 5th, 2014 to Jun 14th, 2014. Four algorithms are OMP, PCA, RPCA, and BasisEvolution respectively. BE is short for BasisEvolution.

Take BasisEvolution as an example, A total of 15 anomalous points are

851

detected in this context. Among them, five points have three votes, and can be

852

detected by the other three approaches; four points have two votes; three points

853

have one vote; two points (A and B in Figure 15) have been labeled as anomalous

854

ones even though they have not been detected by any other technique; and

855

only the left point (C in Figure 15) is neither detected nor labeled. Similarly,

856

descriptions of the other methods can be seen in Table 4. Compared with the

857

other three methods, 12 out of 15 BE-detected points can also be detected by

858

others, and only one point is zero vote, which indicates the strong reliability

859

of BasisEvolution. Regarding PCA, RPCA and OMP, larger proportions of

860

detected points are not admitted with a zero vote, which demonstrates that the

861

reliability of BasisEvolution is better than the rest three. In addition, there are

862

two zero-vote BE-detected points, which are authorized as real anomalies, while

863

OMP has one and both PCA and RPCA have zero. These results, as well as

CE

PT

ED

M

AN US

850

the simulation, support the conclusion that BasisEvolution can find anomalous

865

points that other approaches can not.

AC

864

866

6. Conclusion and future work

867

Accurately detecting anomalies without pre-knowledge plays an important

868

role for network operators. However, to be practical, methods need to have a

869

very low false-alarm probability, and there are other challenges. Strong anoma45

ACCEPTED MANUSCRIPT

Total anomalies 15 13 24 24

Amount (of real anomalies) 2 1 0 0

No. (with votes) 3 2 1 0 5 4 3 1 5 0 2 5 5 4 9 6 5 4 7 8

CR IP T

Method BE OMP PCA RPCA

Table 4: Description of anomaly detection results for four algorithms in terms of reliability of those detected anomalies based on Figure 15. Here, total anomalies means a total number of anomalies detected by any approach, amount of real anomalies counts the detected points that are real anomalies based on the expert advice but fail to get any vote and No. with votes records the number of detected anomalous points with a certain amount of votes. BasisEvolution is more reliable and can find anomalies that the other three cannot.

lies can pollute the normal space used as a baseline to detect anomalies, and

871

anomalies can themselves have different shapes that make them harder or easier

872

to detect.

AN US

870

In this paper, we address these problems with BasisEvolution, which looks

874

for a concise but accurate, and easy-to-understand basis that represent the

875

typical traffic patterns more precisely. We evolve the basis to maintain similar-

876

ity with the old one in describing the new data. We show it can then detect

877

anomalous points on the residual series with higher accuracy than common al-

878

ternatives. It still works even the anomalies are complex.

ED

M

873

However, there are many issues left for future work. For example, the basis

880

generation algorithm we used is based on SVD, which is not feasible for very

881

large data scenarios. The first issue is to find robust, fast algorithm to replace

882

SVD, such as the pursuit principal used in OMP. Another point is that the size

883

of data segment we used in our framework is fixed and our algorithm proceeds

PT

879

in batches. A more realistic algorithm should be able to tackle arbitrary new

885

segments of data as they arrive. We plan to extend BasisEvolution in the future

886

to tackle these issues.

AC

CE 884

887

888

889

Acknowledgment This authors thank IIJ for providing data. This work was supported by the

National Key Basic Research Program of China (2013CB329103 of 2013CB329100),

46

ACCEPTED MANUSCRIPT

the China Scholarship Council (201506050072), and the short term Interna-

891

tional Academic Fund of ChongQing University 2015 Overseas Visiting Student

892

Projects (GJXSFX150203), as well as Australian Research Council (ARC) grant

893

DP110103505 and the ARC Centre of Excellence for Mathematical and Statis-

894

tical Frontiers (ACEMS).

895

References

896

References

CR IP T

890

[1] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, J. Srivastava, A comparative

898

study of anomaly detection schemes in network intrusion detection, in: The

899

SIAM International Conference on Data Mining, 2003, pp. 25–36.

AN US

897

[2] P. Barford, D. Plonka, Characteristics of network traffic flow anomalies,

901

in: The 1st ACM SIGCOMM Workshop on Internet Measurement (IMW),

902

San Francisco, CA, USA, 2001, pp. 69–73.

M

900

[3] A. Delimargas, Iterative Principal Component Analysis (IPCA) for network

904

anomaly detection, Master’s thesis, Carleton University (September 2015).

905

[4] K. Xu, F. Wang, L. Gu, Behavior analysis of internet traffic via bipartite

906

graphs and one-mode projections, IEEE/ACM Transactions on Networking

907

22 (3) (2014) 931–942.

PT

908

ED

903

[5] M. H. Bhuyan, D. K. Bhattacharyya, J. K. Kalita, Network anomaly detection: methods, systems and tools, IEEE Communications Surveys &

910

Tutorials 16 (1) (2014) 303–336.

CE 909

[6] P. Garc´ıa-Teodoroa, J. D´ıaz-Verdejoa, G. Maci´ a-Fern´ andeza, E. V´ azquezb,

912

Anomaly-based network intrusion detection: Techniques, systems and chal-

913

lenges, Computers & Security 28 (1-2) (2009) 18–28.

AC

911

914

915

[7] L. Huang, L. Nguyen, M. N. Garofalakis, M. I. Jordan, A. D. Joseph, N. Taft, In-network pca and anomaly detection (2007) 617–624.

47

ACCEPTED MANUSCRIPT

916

[8] D. Jiang, C. Yao, Z. Xu, W. Qin, Multi-scale anomaly detection for high-

917

speed network traffic, Transactions on Emerging Telecommunications Tech-

918

nologies 26 (3) (2015) 308–317. [9] E. Sober, The principle of parsimony, The British Journal for the Philosophy of Science 32 (2) (1981) 145–156.

920

921

CR IP T

919

[10] J. S. Chiou, The antecedents of consumers loyalty toward internet service providers, Information & Management 41 (6) (2004) 685–695.

922

[11] T. Tavallaee, N. Stakhanova, A. A. Ghorbani, Toward credible evaluation

924

of anomaly-based intrusion-detection methods, IEEE Transactions on sys-

925

tems, man, and cybernetics - part C: application and reviews 40 (5) (2010)

926

516–524.

AN US

923

[12] C. Croux, P. Filzmoser, M. R. Oliveira, Algorithms for projection–pursuit

928

robust principal component analysis, Chemometrics and Intelligent Labo-

929

ratory Systems 87 (2) (2007) 218–225.

M

927

[13] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. D. Kolaczyk,

931

N. Taft, Structural analysis of network traffic flows, in: ACM SIGMET-

932

RICS/Performance, 2004, pp. 61–72.

ED

930

[14] C. Pascoal, M. R. de Oliveira, R. Valadas, P. Filzmoser, P. Salvador,

934

A. Pacheco, Robust feature selection and robust PCA for internet traffic

935

anomaly detection, in: IEEE INFOCOM, 2012, pp. 1755–1763.

CE

PT

933

[15] S. Huang, F. Yu, R. Tsaih, Y. Huang, Network traffic anomaly detection

937

with incremental majority learning, in: International Joint Conference on

938

Neural Networks (IJCNN), 2015, pp. 1–8.

AC

936

939

[16] Z. Chen, C. K. Yeo, B. S. L. Francis, C. T. Lau, Combining mic feature

940

selection and feature-based mspca for network traffic anomaly detection, in:

941

The 3rd International Conference on Digital Information Processing, Data

942

Mining, and Wireless Communications (DIPDMWC), 2016, pp. 176–181.

48

ACCEPTED MANUSCRIPT

943

[17] J. Zhang, H. Li, Q. Gao, H. Wang, Y. Luo, Detecting anomalies from

944

big network traffic data using an adaptive detection approach, Information

945

Sciences 318 (2015) 91–110. [18] A. Lakhina, M. Crovella, C. Diot, Diagnosing network-wide traffic anomalies, in: ACM SIGCOMM, 2004, pp. 219–230.

947

CR IP T

946

948

[19] YusukeTsuge, H. Tanaka, Quantification for intrusion detection system us-

949

ing discrete fourier transform, in: International Conference on Information

950

Science and Security (ICISS), 2016, pp. 1–6.

[20] P. Barford, J. Kline, D. Plonka, A. Ron, A signal analysis of network

952

traffic anomalies, in: The 2nd ACM SIGCOMM Workshop on Internet

953

Measurement (IMW), Marseille, France, 2002, pp. 71–82.

AN US

951

[21] B. Eriksson, P. Barford, R. Bowden, N. Duffield, J. Sommers, M. Roughan,

955

Basisdetect: a model-based network event detection framework, in: The

956

10th ACM SIGCOMM Conference on Internet Measurement (IMC), Mel-

957

bourne, Australia, 2010, pp. 451–464.

M

954

[22] M. Ahmed, A. N. Mahmood, J. Hu, A survey of network anomaly detection

959

techniques, Journal of Network and Computer Applications 60 (2016) 19–

960

31.

PT

961

ED

958

[23] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: A survey, ACM Computing Surveys 41 (3) (2009) 15.

CE

962

[24] D. J. Wellerfahy, B. J. Borghetti, A. A. Sodemann, A survey of distance

964

and similarity measures used within network intrusion anomaly detection,

965

IEEE Communications Surveys Tutorials 17 (1) (2015) 70–91.

AC

963

966

967

[25] H. Ringberg, A. Soule, J. Rexford, C. Diot, Sensitivity of PCA for traffic anomaly detection, ACM SIGMETRICS 35 (1) (2007) 109–120.

968

[26] S. Novakov, C. Lung, I. Lambadaris, N. Seddigh, Studies in applying pca

969

and wavelet algorithms for network traffic anomaly detection, in: IEEE 49

ACCEPTED MANUSCRIPT

970

14th International Conference on High Performance Switching and Routing

971

(HPSR), 2013, pp. 185–190. [27] F. Meng, N. Jiang, B. Liu, R. Li, F. Xia, A real-time detection approach

973

to network traffic anomalies in communication networks, in: The Joint In-

974

ternational Conference on Service Science, Management and Engineering

975

(SSME) and International Conference on Information Science and Technol-

976

ogy (IST), 2016.

CR IP T

972

[28] Z. Chen, C. K. Yeo, B. S. Lee, C. T. Lau, Detection of network anomalies

978

using improved-mspca with sketches, Computers & Security 65 (2017) 314–

979

328.

AN US

977

980

[29] Z. Wang, K. Hu, K. Xu, B. Yin, X. Dong, Structural analysis of network

981

traffic matrix via relaxed principal component pursuit, Computer Networks

982

56 (7) (2012) 2049–2067.

[30] G. H. Golub, C. F. V. Loan, Matrix computations, Vol. 3, JHU Press, 2012.

984

[31] J. Zobel, How reliable are the results of large-scale information retrieval

M

983

experiments, in: In Proceedings of SIGIR, 1998, pp. 307–314.

ED

985

[32] D. D. Lee, H. S. Seung, Algorithms for non-negative matrix factorization,

987

in: Advances in Neural Information Processing Systems, 2001, pp. 556–562.

988

[33] R. Peharz, F. Pernkopf, Sparse nonnegative matrix factorization with l0-

PT

986

constraints, Neurocomputing 80 (2012) 38–46.

CE

989

[34] Y. Zhao, G. Karypis, D.-Z. Du, Criterion functions for document cluster-

991

ing, Tech. rep., Department of Computer Science, University of Minnesota

992

(2005).

AC

990

993

994

995

996

[35] J. Han, J. Pei, M. Kamber, Data mining: concepts and techniques, Elsevier, 2011.

[36] T. Fawcett, An introduction to ROC analysis, Pattern recognition letters 27 (8) (2006) 861–874. 50

ACCEPTED MANUSCRIPT

997

[37] M. Roughan, A. Greenberg, C. Kalmanek, M. Rumsewicz, J. Yates,

998

Y. Zhang, Experience in measuring backbone traffic variability: Models,

999

metrics, measurements and meaning, in: The 2nd ACM SIGCOMM Work-

1001

[38] M. Roughan, J. Gottlieb, Large-scale measurement and modeling of backbone internet traffic, in: SPIE ITCom, Boston, MA, USA, 2002.

1002

1003

CR IP T

shop on Internet Measurement (IMW), 2002, pp. 91–92.

1000

[39] M. Roughan, On the beneficial impact of strong correlations for anomaly detection, Stochastic Models 25 (1) (2009) 1–27.

1004

[40] C. C. Zou, W. Gong, D. Towsley, Code red worm propagation modeling and

1006

analysis, in: The 9th ACM Conference on Computer and Communications

1007

Security (CCS), New York, NY, USA, 2002, pp. 138–147. doi:10.1145/

1008

586110.586130.

AN US

1005

[41] P. Tune, K. Cho, M. Roughan, A comparison of information criteria for traf-

1010

fic model selection, in: The 10th International Conference on Signal Pro-

1011

cessing and Communications Systems (ICSPCS), Gold Coast, Australia,

1012

2016.

AC

CE

PT

ED

M

1009

51

CR IP T

ACCEPTED MANUSCRIPT

Hui. Xia is a PhD student in the Department of Computer

1014

Science at Chongqing University. She received a Bachelors degree in Computer

1015

Networks from Nanchang University. Her research interests include network

1016

traffic analysis, data mining, and personal recommendation.

AN US

1013

Bin Fang received the B.S. degree in electrical engineer-

1018

ing from Xian Jiaotong University, Xian, China, the M.S. degree in electrical

1019

engineering from Sichuan University, Chengdu, China, and the Ph.D. degree

1020

in electrical engineering from the University of Hong Kong, Hong Kong. He is

1021

currently a Professor with the College of Computer Science, Chongqing Uni-

1022

versity, Chongqing, China. His research interests include computer vision, pat-

1023

tern recognition, information processing, biometrics applications, and document

1024

analysis. He has published more than 120 technical papers and is an Associate

1025

Editor of the International Journal of Pattern Recognition and Artificial Intel-

1026

ligence. Prof. Fang has been the Program Chair, and a Committee Member for

1027

many international conferences.

AC

CE

PT

ED

M

1017

1028

Matthew. Roughan received a Bachelors degree in math

52

ACCEPTED MANUSCRIPT

science at Adelaide University and the Ph. D degree in applied mathematics at

1030

Adelaide University. Currently, he is a professor with the School of Mathemat-

1031

ical Sciences, the University of Adelaide. His research interests include Internet

1032

measurement and estimation, network management, and stochastic modeling,

1033

in particular with respect to network traffic and performance modeling. He

1034

has published more than 150 technical papers, and is a member of the MASA,

1035

AustMS, IEEE and ACM. Prof. Roughan has been a chair of ACM SIGCOMM

1036

Doctoral Dissertation Award Committee, and won 2013 ACM SIGMETRICS

1037

Test of Time Award.

AN US

CR IP T

1029

Paul. Tune received a Ph. D degree at the Centre for

1039

Ultra-Broadband Information Networks (CUBIN), Dept. of E&E Engineering,

1040

the University of Melbourne. He worked as a postdoctoral research fellow at the

1041

School of Mathematical Sciences, the University of Adelaide. Now he worked at

1042

the Image Intelligence, Sydney. His research interests are Network measurement,

1043

Information theory, Compressed sensing.

CE

PT

ED

M

1038

Kenjiro Cho received the B.S. degree in electronic engi-

1045

neering from Kobe University, the M.Eng. degree in computer science from

AC

1044

1046

Cornell University, and the Ph.D. degree in media and governance from Keio

1047

University. He is Deputy Research Director with the Internet Initiative Japan,

1048

Inc., Tokyo, Japan. He is also an Ad- junct Professor with Keio University

1049

and Japan Advanced Institute of Science and Technology, Tokyo, Japan, and

1050

a board member of the WIDE project. His current research interests include 53

ACCEPTED MANUSCRIPT

traffic measurement and management and operating system support for net-

1052

working.

AC

CE

PT

ED

M

AN US

CR IP T

1051

54