Accepted Manuscript
A BasisEvolution Framework for Network Traffic Anomaly Detection Hui Xia, Bin Fang, Matthew Roughan, Kenjiro Cho, Paul Tune PII: DOI: Reference:
S1389-1286(18)30033-1 10.1016/j.comnet.2018.01.025 COMPNW 6373
To appear in:
Computer Networks
Received date: Revised date: Accepted date:
13 January 2017 16 October 2017 17 January 2018
Please cite this article as: Hui Xia, Bin Fang, Matthew Roughan, Kenjiro Cho, Paul Tune, A BasisEvolution Framework for Network Traffic Anomaly Detection, Computer Networks (2018), doi: 10.1016/j.comnet.2018.01.025
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
A BasisEvolution Framework for Network Traffic Anomaly Detection
CR IP T
Hui Xia1 , Bin Fang2 , Matthew Roughan3 , Kenjiro Cho4 , Paul Tune5
Abstract
Traffic anomalies arise from network problems, and so detection and diagnosis
AN US
are useful tools for network managers. A great deal of progress has been made
on this problem so far, but most approaches can be thought of as forcing the data to fit a single mould. Existing anomaly detection methods largely work by separating traffic signals into “normal” and “anomalous” types using historical data, but do so inflexibly, either requiring a long description for “normal” traffic, or a short, but inaccurate description. In essence, preconceived “basis”
M
functions limit the ability to fit data, and the static nature of many algorithms prevents true adaptivity despite the fact that real Internet traffic evolves over
ED
time. In our approach we allow a very general class of functions to represent traffic data, limiting them only by invariant properties of network traffic such as diurnal and weekly cycles. This representation is designed to evolve so as to
PT
adapt to changing traffic over time. Our anomaly detection uses thresholding approximation residual error, combined with a generic clustering technique to report a group of anomalous points as a single anomaly event. We evaluate our
CE
method with orthogonal matching pursuit, principal component analysis, robust
AC
1 H. Xia is with the School of Accounting, Chongqing University of Technology, Chongqing, 400044, China. E-mail:
[email protected]. The majority of this work was conducted while H. Xia was visiting the University of Adelaide. 2 Corresponding Author. B. Fang is with the School of Computer Science, Chongqing University, Chongqing, 400044, China. E-mail:
[email protected] 3 M. Roughan is with ARC Centre of Excellence for Mathematical & Statistical Frontiers (ACEMS) at the School of Mathematical Science, and the University of Adelaide, Adelaide 5005, SA, Australia. E-mail:
[email protected] 4 Kenjiro Cho is at IIJ. E-mail:
[email protected] 5 P. Tune is at Image Intelligence, E-mail:
[email protected]
Preprint submitted to Computer Networks
January 29, 2018
ACCEPTED MANUSCRIPT
principal component analysis and back propagation neural network, using both synthetic and real world data, and obtaining very low false-alarm probabilities in comparison.
CR IP T
Keywords: anomaly detection, basis, evolution, low false-alarm probability, SVD
1
1. Introduction
Conceptually, a network traffic anomaly can be defined as an intentional or
3
unintentional attack, fault, or defect, which perturbs “normal” network traf-
4
fic behaviour [1]. The causes could be traffic outages, configuration changes,
5
network attacks (e.g., DDoS attacks), flash crowds, network worms, and so
6
on [2, 3, 4]. As the Internet continues to grow in size and complexity, traffic
7
anomalies are becoming more common and diverse. They need to be detected
8
and analysed effectively.
AN US
2
Most Intrusion Detection Systems (IDS) are knowledge-based detection,
10
which search for instances of known network problems by attempting to match
11
with pre-determined representations [5]. Such systems can be trained against
12
known records, but cannot detect unknown problem types, i.e., anomalies [6].
13
Thus, there is a need for anomaly detection, which estimates typical behaviour
14
of network traffic and detects deviations from this [1, 6].
PT
ED
M
9
Anomaly detection schemes need to have low false-alarm probabilities, at
16
least of the order of 10−4 , because of the large number of tests running in
17
even a moderately sized network, and the amount of human work required to
18
process each anomaly. Most existing techniques do not achieve such low false-
19
alarm probabilities [7, 8], at least in part because they are restricted in their
CE
15
representation of typical traffic. Existing techniques represent typical traffic in
21
terms of a set of functions of historical data – we refer to this as a “basis”, though
22
it may not be a strict mathematical basis for the space. The requirement that we
23
accurately model typical traffic either results in a large error in representation, or
24
a large set of functions which is undesirable for reasons described best under the
AC 20
2
ACCEPTED MANUSCRIPT
25
heading of the “Principle of Parsimony” [9]; principally, problems of estimation
26
and generalisation. In our work, we also search for basis functions, but we do so from a poten-
28
tially infinite set, allowing a very accurate representation with a small number of
29
functions. However, unlike Principal Components Analysis (PCA), which also
30
performs such a search, we do so under constraints that match known behaviour
31
of real traffic: for instance, cyclicity. Current anomaly detection approaches usu-
32
ally ignore the nature of network traffic, resulting in a hard explanation on the
33
derived basis.
CR IP T
27
Also, real network traffic patterns do change over time, but slowly. For
35
example, the peak time of traffic might shift from 8pm to 7pm due to the change
36
of summer time; and traffic will change as an ISP recruits or loses customers [10].
37
We also present an update algorithm that improves the basis with new data,
38
allowing the algorithm to adapt to network changes, without a computational
39
explosion. The evolved basis usually remains similar to the old basis, but if
40
the reconstruction error of the evolved basis is outside a preset threshold, we
41
reinitialise.
Once we have a basis for typical traffic, anomalous points are easy to detect
ED
42
43
M
AN US
34
by examining the reconstruction error. An additional contribution of this paper is to go beyond simple metrics
45
for anomaly detection performance (such as false alarms and missed detection
46
probabilities). Real anomalies persist, and better metrics for anomaly detec-
47
tion also determine the extent or duration of the anomaly. In this paper, we
CE
PT
44
use a clustering algorithm and associated quality metrics to compare detection
49
algorithms. Hence, anomaly detection in our framework includes two compo-
50
nents: anomalous-point detection and anomaly clustering. The first is used for
AC
48
51
detecting anomalous points, and the second is for analysing anomalies.
52
We use both synthetic and real world data to assess the performance of our
53
framework. The synthetic data is less real, but is necessary to provide accu-
54
rate metrics [11]: there are intrinsic difficulties of obtaining large enough sets
55
of accurately marked anomaly data, given that the nature of an “anomaly” is 3
ACCEPTED MANUSCRIPT
intrinsically defined only by its absence in typical traffic. We use real data to il-
57
lustrate the relationship between anomalies detected by our approach and those
58
of PCA, Robust Principle Component Analysis (RPCA), Orthogonal Matching
59
Pursuit (OMP), and Neural Network using Back Propagation (NNBP) detection
60
models. We show that our approach has substantially better detection charac-
61
teristics than these alternatives, including false-alarm probabilities as low as
62
7.8 × 10−5 .
63
2. Background and Related work
CR IP T
56
In many cases, network traffic records contain a number of properties, such
65
as source IP addresses, destination IP addresses, which lead the context of
66
anomaly detection expanding to the high-dimensional space. In view of this,
67
anomaly detection approaches first process attributes of data because they are
68
of different types. Examples include calculating the entropy of some features of
69
data, e.g., the number of observed host addresses [4]; removing mean or median
70
values of the data [12, 13]; measuring the mutual information of each feature
71
for selection [14].
M
AN US
64
In real detection scenarios, higher-dimensional data require more sources in
73
calculating parameters, and may cause overfitting in training stage, even though
74
the results are more explainable. In view of this, researchers adopt attributes
75
selection (or extraction) to reduce dimensionality. For example, Huang et al.,
76
[15] chose Self-Organizing Map (SOM) to extract principal components as a
77
compressed representation of a “normal picture”; Chen et al., [16] filtered ir-
78
relevant attributes by utilizing the maximal information coefficient (MIC) to
79
increase the classification accuracy.
CE
PT
ED
72
One explicit characteristic of high-dimensional network traffic data is that
81
almost all anomalies are embedded in some lower-dimensional subspaces [17].
82
Existing detection methods for those “subspace anomalies” include subspace
83
projected-based model [17] and PCA [13, 18]. Other techniques such as Artificial
84
Neural Networks (ANNs) classifiers [15, 5], Fourier analysis [19], wavelet anal-
AC 80
4
ACCEPTED MANUSCRIPT
85
ysis [20] and “BasisDetect” [21] perform better in lower- or single-dimensional
86
environments. Moreover, Teodoroa et al., [6] provided a comprehensive review on commonly
88
used techniques, platforms, systems and projects; while Ahmed et al., [22] pre-
89
sented an in-depth analysis of four major anomaly detection techniques: classi-
90
fication, statistical, information theory and clustering; In addition, Chandola et
91
al., [23] discussed anomaly types and analyzed the anomaly detection algorithms
92
in six classes: classification, nearest neighbor, clustering, statistics, information
93
theoretic and spectral-based techniques. As the conventional distance and sim-
94
ilarity measurements lose efficacy as the dimensionality increasing, Wellerfahy
95
et al., [24] discussed various types of distance and similarity measurements so
96
that the most appropriate evaluation metrics can be used in detection.
AN US
CR IP T
87
97
Instead of reviewing the above detection methods, we seek to describe a
98
framework within which we can place various approaches, and we use a few
99
exemplar techniques to illustrate this framework, and against which we compare our approach.
M
100
In general, anomaly detection algorithms use historical records to establish a
102
baseline for typical behaviour, and then separate unusual, or anomalous traffic
103
from this baseline. Conceptually this can be illustrated by Figure 1. The figure
104
shows how we form an approximation for classification. Some techniques treat
105
this as a simple classification problem, while others project into the typical
106
space, and look for outliers in the residual, but the underlying problem is the
107
same.
CE
PT
ED
101
There is a tension in all techniques in the number of functions or features
109
used to approximate the region τ . If more features are used, we might obtain
110
a more accurate approximation, however, this works against the “Principle of
AC
108
111
Parsimony” [9], which argues for smaller numbers of features to be used. The
112
principle expresses the fact that larger numbers of features are harder to accu-
113
rately estimate (from the historical data), and more importantly, larger numbers
114
of features may not be generalizable. That is they may have great explanatory
115
power for the historical data, but do not work well for data on which they have 5
ACCEPTED MANUSCRIPT
All possible traffic signals, Ω
missed detections
τ
τ
Ap
pr ox i
m
at io
n,
Typical traffic,
CR IP T
false alarms
116
AN US
Figure 1: A simplified picture of the problem of anomaly detection. Conceptually, there is a space of traffic signals which are “typical”, τ , within the larger space of all traffic signals, Ω. We seek to estimate this region from historical data, and approximate it by τˆ, so that we can classify new signals. Inaccuracies in the approximation result in errors of the two types indicated.
not been trained. A common expression is that we are “fitting the noise”. The result is that although some approaches can approximate a region very
118
accurately, the region they match is polluted by noise (for instance, undiagnosed
119
anomalies in the training data, which is quite common). Illustrative examples
120
are the poor performance of PCA when the traffic measurements are polluted
121
with anomalies [25], thereby distorting the approximation region τˆ. So there
122
is a tradeoff, for any one technique, but better approaches can make unilateral
123
improvements by choosing a better set of features of functions with which to
124
represent τ .
PT
ED
M
117
The tradeoff can be seen in most approaches described above. For instance,
CE
125
Fourier analysis often either directly filters low-frequency components of the sig-
127
nal in the Fourier domain, or approximates this through a low-pass filter, which
128
requires to trade off the number of frequencies used in approximation. A larger
129
number leads to a better approximation, but the approach then becomes more
130
subject to noise. Similarly, PCA constructs “normal subspace” to approximate
131
the region τ , which has to tradeoff the number of principal components [26].
AC
126
132
Many approaches choose a hybrid way to address the trade-off problem.
6
ACCEPTED MANUSCRIPT
They combine PCA with multi-scale methods so that a moderate number of
134
features can approximate typical traffic space τ more accurately without losing
135
the generalization and efficiency. e.g., Some researchers utilized Empirical Mode
136
Decomposition (EMD) on the extracted principal components and collected
137
Intrinsic Mode Functions (IMFs) as multiscale features for network traffic [27].
138
Similarly, Jiang et al., [8] applied wavelet transformation as multiscale analysis,
139
through which the anomalous traffic can be identified using mapping functions
140
on the extracted principal components for feature under each scale. Novakov
141
et al., [26] introduced a cooperational detection approach with two methods:
142
PCA- and wavelet-based detection algorithms, in which the former examines
143
the entire set of data to rough detect the anomalous time bins, while the latter
144
applies a detailed localized frequency analysis to identify potential anomalous
145
time bins. Additionally, Chen et al., [16] proposed a multiscale PCA (MSPCA)
146
model to decrease the typical traffic reconstruction error by combining PCA
147
and Discrete Wavelet Transformation (DWT). The improved version involves
148
Bayesian PCA [28].
M
AN US
CR IP T
133
Within these approaches BasisDetect [21] chooses an interesting path. It
150
seeks to use a broader set of potential features, selecting a small number of
151
those that match the data to create representations. By allowing a broader
152
class of feature vectors, it allows a better representation with fewer features,
153
and this is a general property. However, if we allow that class of features to
154
become too wide, we hit computational barriers. Techniques such as PCA select
155
from a much larger (infinite) set of possible feature vectors for representation,
CE
PT
ED
149
but do so in a manner that assumes an underlying model. We argue here that
157
the patterns extracted by PCA do not represent what we see in real traffic data:
158
it assumes independent and identically distributed Gaussian samples, and it has
AC
156
159
long been known that the Internet traffic measurements (at least) show strong
160
cyclical dependencies.
161
We describe these processes more formally by suggesting that the region τˆ
162
is described by a set of basis functions. They are not truly a basis, as the region
163
may not be a vector space (it may not even be convex), but this terminology 7
ACCEPTED MANUSCRIPT
seems to be helpful for a uniform description of approaches. Given this ter-
165
minology, Figure 2 shows how approaches can be roughly separated into four
166
steps: data transformation, basis derivation, data reconstruction and anomaly
167
detection. We use this structure to explain what is novel about our approach.
Past data
Data transformation
Basis derivation
Data reconstruction New data
CR IP T
164
Anomaly detection
Anomalous points
AN US
Figure 2: The general anomaly detection process. (Data transformation) is a preprocessing procedure. (Basis derivation) generates a basis from the historical data. (Data reconstruction) approximates the new data with the basis. (Anomaly detection) detects anomalies from the residual series, which records the difference between the new data and its approximation.
We will not dwell on the initial transformation process, as it is about the type
169
of anomalous behaviour observable, not the technique for observing it. The basis
170
derivation process forces techniques seek ways to pick better functions, so that
171
the reconstructed traffic τˆ is closer to τ . Our approach tackles these problems
172
by
M
168
1. allowing a wider class of basis functions, constrained by known traffic
173
features in order to avoid an explosion of possibilities; and
ED
174
2. an adaptive mechanism to allow the representation to easily evolve.
175
The latter has two advantages. Firstly, it reduces the computational over-
177
head. Secondly, and perhaps more importantly, it also constrains the basis
178
functions. The constraint forces them to be more meaningful than if we trained
179
the features arbitrarily. We have deliberately chosen four exemplar approaches
180
that cover a very wide range of ideas for anomaly detection as shown in Table 1 ,
181
and in doing so we aim to illustrate the general improvement that our approach
AC
CE
PT
176
182
obtains in the following section.
183
3. BasisEvolution Framework
184
Techniques in Table 1 derive a basis that preserves characteristics in either
185
time or frequency or the spatial domain. The underlying assumption is that 8
ACCEPTED MANUSCRIPT
No.
Basis derivation
n
Wavelet [20] BasisDetect [21]
kn
Predetermined low frequency functions Soft thresholding (denoising) Improved OMP
PCA [14]
Over-complete wavelet functions Over-complete dictionary of sinusoids, deltas and/or wavelet functions. Arbitrary
kn
∞
Basis size Large Large
CR IP T
Technique Candidate solution space Fourier Sinusoidal basis
Eigenanalysis
Moderate
Small
AN US
Table 1: Description of anomaly detection in terms of candidate basis functions and techniques of finding a representation for a given data set. Here, the candidate solution space describes the set of functions from which we can construct a typical traffic signal; and the No. is the number of functions in this space for a signal of length n. The basis derivation refers to the technique used to refine this to a basis for the typical traffic. The basis size is the number of functions then required to accurately represent the historical traffic.
we know little about the input data, and must discover its characteristics in
187
these domains. However, we know that in the frequency domain, typical traffic
188
has dominant cycles with 24 hours and 1 week periods [29]. Moreover, it has
189
less variation over short periods than these dominant cycles. We do not need
190
Fourier analysis, or the like, to tell us this. While techniques such as PCA may
191
appear to be sensitive to these cycles, PCA is actually quite agnostic to the
192
period. Since it only responds to the strength of variation across the period,
193
a fact which can be seen if we permute the traffic in time (PCA will produce
194
identical, albeit permuted, results).
PT
ED
M
186
Roughly speaking, the idea of BasisEvolution is to allow for great flexibility
196
in discovering basis functions so as to be able to find a very small set of such,
CE
195
but to use cyclicity and other traffic properties to restrict our search to make it
198
practical. We do this through two approaches:
AC
197
199
1. Initializing the basis from the typical traffic
200
Given a small set of typical (cleaned) data, we apply Singular Value
201
Decomposition (SVD) [30] in each cycle to find a set of functions (referred
202
as a function family) preserving the largest energy. After finishing all
203
cycles, the function families are gathered together forming a final basis.
9
ACCEPTED MANUSCRIPT
204
Through initialization, the functions in the basis are all “normal patterns”,
205
and easier to be understood, since each element in the basis matches a
206
specific cycle. Besides, the size of the basis is usually very small. As we all know that, small anomalies are hidden by a larger anomaly,
208
or anomalies are of different scales. The above initialization finds basis
209
functions of different scales (periods). This multi-scale representation of
210
the traffic would definitely lead a better detection performance, with more
211
anomalies of different scales being found. 2. Evolving the old basis for the new traffic.
212
CR IP T
207
Approaches from Table 1 adapt to the new data by either using a
214
fixed basis, which may enlarge the approximation error, or in the case
215
of PCA-like approaches, the basis is derived again for the new data with
216
extra computational cost.
AN US
213
BasisEvolution addresses this issue by adapting the old basis into the
218
new one. More specifically, once having initialized the basis, for any new
219
(dirty) traffic, we can get the corresponding new basis through evolution.
220
The new basis minimizes the approximation error to the new traffic and
221
keeps the similarity with the old basis. As time goes on, the continuous
222
evolution of the basis can better represent the statistical changes of “nor-
223
mal patterns” in the new traffic. Comparing with the basis derivation
224
process, the evolution approach can effectively reduce the computational
PT
ED
M
217
cost especially for large-scale data.
225
CE
To define the algorithm formally, we start by noting that we sample traffic at
times, tj , separated by intervals ∆t. The variables xtj refer to the total traffic during the interval [tj , tj+1 ). We group these into blocks of b measurements,
AC
such that we obtain a new process X = {x0 , . . . , xn } where xm = [xtj , ..., xtj+b−1 ]T ,
226
where j = bm, and the time intervals in this series are ∆T = b ∆t. In BasisEvo-
227
lution, we use this data segment series as the input for the whole framework, 10
ACCEPTED MANUSCRIPT
228
where ∆T is usually several cycles long, e.g., it might be several weeks. In the framework, a small set of clean data is required for the initialization
230
of basis. However, the collected traffic measurements are contaminated with
231
anomalies, and would cost large amounts of resources to manually separate the
232
typical part. We then design a data cleaning process on a data block as an
233
alternative. Usually, we choose the first collected data block x0 for cleaning.
CR IP T
229
234
With the cleaned data segment x00 , the framework derives a basis for ini-
235
tialization. We name this procedure as the basis generation process and the
236
initialized basis as H0 .
For any new traffic (e.g., x1 ), we updating the old basis (e.g., H0 ) into a
238
new one. This basis evolution process keeps repeating as long as we have new
239
traffic. However, if the approximation error of the evolved basis for new traffic
240
is outside a given bound, the BasisEvolution will clean the data and initialize
241
the basis instead of evolving the old one.
AN US
237
The framework proceeds anomaly detection on the approximation error (e.g.,
243
r1 ). We identify the anomalous points by thresholding, and cluster them to-
244
gether, with each cluster being referred as an anomaly.
M
242
The complete algorithm is illustrated inFigure 3, in which the BasisEvolution
246
is composed of four components: data cleaning, basis generation, basis update
247
and anomaly detection (anomalous-point detection and anomaly clustering).
248
We describe each in detail below.
249
3.1. Data cleaning
PT
ED
245
As for the whole BasisEvolution framework, we first need a small set of
251
“normal” historical data, so that the basis generation process can access the
252
“norm” patterns, and generate the primal basis functions. However, the data
AC
CE 250
253
we collected are contaminated by anomalies in practice.
254
One common method for this issue is to label the data based on the experi-
255
ences and professional advices. This job is time and effort-consuming, especially
256
when the size of data grows large, which results in a difficult implementation in
257
reality. 11
ACCEPTED MANUSCRIPT
Data cleaning x0'
T1 Time index
H1
Anomalous point detection
. . .
r1=x1-W1H1
Basis update Hm . rm=xm-WmHm
xm
. . .
Tn
r0=x0-W0H0
H0 Basis update
x1
. . .
Tm
Anomalous point detection
Basis generation
. .
Basis update
xn
. . .
Anomalous point detection . . .
Anomaly clustering A0
P0
Anomaly clustering
CR IP T
x0
Anomalous point detection
P1
rn=xn-WnHn
. . .
A1
Anomaly clustering Am Pm . . .
AN US
T0
Pn
Anomaly clustering
An
M
Figure 3: A bird’s eye view of BasisEvolution framework for one single link. At time index T0 , we clean the data segment x0 , and get the “anomaly free” data segment x00 . Then we derive a basis set H0 from x00 . Define r0 to be the approximation error of x00 reconstructed by H0 . Since x00 is totally normal, no anomaly is expected to be found in r0 , we then utilize x0 instead of x00 for anomaly detection. The projection of x0 on H0 is W0 , and r0 is the difference between x0 and approximation series W0 H0 . Through anomaly detection process, we get anomalous points P0 , and the clustered anomalies A0 . At subsequent time indices (e.g., Tm ), we evolve the old basis H(m−1) into Hm , calculate the reconstruction error rm , and repeat anomaly detection process to find anomalous points and anomalies.
Since the dataset we used has not been labeled yet, we then apply standard,
259
existing anomaly detectors as alternative approaches to auto label the target
260
dataset. However, there are many issues left. We address them as list below: 1. Cleaning the data with multiple iterations.
264
PT
261
ED
258
265
collective anomaly (at time interval ∆t3 ) [23]. If we measure the normal
266
region with thresholding (such as [-2,2] in Figure 4), only a spike anomaly
262
can be found in Figure 4, in which the simulated time series includes three types anomalies: spike (at time t1 ), contextual anomaly (at time t2 ) and
AC
CE
263
Some small anomalies may be hidden by a larger anomaly. Examples
267
can be found, while the rest two anomalies are hidden.
268
Thus, successive layers of cleaning may be required as shown in Fig-
269
ure 5, in which the anomaly detection techniques repeat many times. In
270
each iteration, the detected points are removed and filled by “normal val-
271
ues” after detection. Then in the next iteration, the modified traffic data
12
t1
10 5 t4
t2
t3
0 0
500
1000
1500
2000
Time
2500
CR IP T
Simulation values
ACCEPTED MANUSCRIPT
3000
3500
AN US
Figure 4: Different anomaly types in the simulated time series. Traffic point at time t1 is a spike. The context of point at t2 is different with that at t4 , even though the value at t2 is similar to that at t4 , so it is a contextual anomaly. Any point at time interval ∆t3 is normal, however, all points together is a collective anomaly [23]. For most anomaly detection methods, e.g., threshold-based techniques, spike is easier to be found than the rest, which indicates the rest anomalies can hide behind a larger anomaly.
are more “normal” which result in more hidden anomalies being found.
273
Repeat the whole process until no more anomalies are found or the iter-
274
ation count exceeds the preset threshold (we set maxC = 5 according to
275
Sec 4.3.3).
M
272
In each iteration, we remove all detected anomalous points to elim-
277
inate the impacts of anomalous points in the following iterations, but
278
this leaves gaps that must be filled for most algorithms. We fill the gaps
279
using a weighted average of periodically-adjusted linear interpolates and
280
the predictions provided by NNBP. Appropriate weights were chosen by comparison of the interpolated data with non-missing points at equivalent times of day and week in other segments of the data.
CE
282
PT
281
ED
276
283
2. Applying a combination detector to find anomalies. For each iteration in Figure 5, we test the performance of three detec-
285
tion approaches: PCA, ANN and the combination of them. The results
AC
284
286
indicate that no single anomaly detector is perfect, and the combination
287
detector works better. More details can be seen in Sec 4.3.2. So we com-
288
bine the detection results of several anomaly detection algorithms in each
289
iteration.
290
Imagine the detected anomalies of the dth detection method is Ad , 13
ACCEPTED MANUSCRIPT
x0
Anomalies? No
Yes
Count > maxC?
No
Yes
x 0'
CR IP T
Anomaly detection algorithms
Remove anomaly and Iterpolation
AN US
Figure 5: The data cleaning algorithm. In practice, we apply the combination detector (PCA+NNBP, seen in Sec 4.3.2) on x0 to find anomalies. If anomalies are detected, we remove them, and replace gaps using interpolation and prediction, and then repeat detection until no more anomalies being found or the iteration count larger than the preset maxC (maxC = 5, according to Sec 4.3.3).
SD
291
for d ∈ {1, 2, ..., D}, then we choose our anomalies as the set
292
goal being to ensure that we have the largest possible intersection with the
293
true set of anomalies. In this sense, we collect as many types of anomalies
294
as possible so that the cleaning process is less sensitive to the polluted
295
data, and can revise the whole traffic more effectively.
Ad , the
M
d=1
Besides, the mechanism of the combination detector is similar to the
297
pooling strategy for information retrieval [31]. With enough systems and
298
a sufficiently deep pool, each system contributes a number of documents
299
into the pool for assessment, and the whole pooling can provide more
PT
ED
296
reliable and relevant documents for each query. As for our combination
300
detector, points in the pool are more likely to be real anomalies in manual
CE
301
302
judgement, since the pool is the collection of detection results of each single
303
method. Meanwhile, the rest part of the traffic is regarded as “typical”.
We do not have ground truth data in real network conditions, so it is hard to
305
measure the performance of the data cleaning process directly. However, we can
306
assess it through the performance of the overall algorithm (seen in Sec 4.3.1).
AC 304
14
ACCEPTED MANUSCRIPT
307
3.2. Basis generation Our goal here is to find a small set of representative functions to use as a
309
basis for our approximation of the space of typical traffic. The basis generation
310
problem can be expressed as the computation of a basis set that minimises the
311
reconstruction error:
L
X
(l) (l) w h , argmin x −
L,w,h(l) l=1
CR IP T
308
(1)
where x is the input signal. L is the number of basis vectors, w is a vector of
313
weights, and h(l) are the basis vectors.
AN US
312
314
The problem is too open however, in that we have complete choice of basis,
315
and therefore could achieve a trivial perfect reconstruction with h(1) = x, w(1) =
316
1, and L = 1.
We know that traffic exhibits approximate periodicity (with known periods), and so we insist that the basis vectors are periodic. We list the periods ci ,
M
i = 1, 2, . . . , I, in increasing order in terms of number of measurements. Theoretically, ci should represent integral multiple of cycles of network traffic. Here
ED
we only use daily and weekly patterns for simplification. For instance, if there were 24 measurements per day, the typical 24 hours and 1 week cycles would
PT
be represented by c1 = 24 and c2 = 168. We select the third and last element b to be the largest number of weeks that the data segment include, cI = 24 24 .
We denote h(ci ,j) , j = 1, 2, . . . , Ji , to be the j th basis vector with period ci , i.e.,
CE
for each element
(c ,j)
hk i
(c ,j)
i = hk+ac , i
AC
for all valid k, a ∈ Z+ . We write
317
H (ci ) = h(ci ,1) |h(ci ,2) | · · · |h(ci ,Ji ) ,
and the basis H = [H (c1 ) | · · · |H (cI ) ].
15
ACCEPTED MANUSCRIPT
Then the reconstruction problem is defined by a new objective function:
where the size of basis is L =
PI
i=1
(2)
CR IP T
318
Ji I X X
(ci ,j) (ci ,j) x − w h argmin
,
L,w,h(l)
i=1 j=1 Ji .
As in OMP, we progressively refine the traffic measurement approximation
with an iterative procedure across all cycles that pursues a good solution. Our algorithm does so starting with the shortest period, and progressing through them in order. At each phase, we fix all of the variables except that of the
AN US
period we are working on. With the higher period functions implicitly set to zero, so the residual in phase n is defined by r(1) = x and for n > 1 r(n) = x −
Ji n−1 XX
w(ci ,j) h(ci ,j) ,
i=1 j=1
M
and we optimise h(cn ,j) , and their respective weights by solving
(3)
ED
Jn
(n) X (cn ,j) (cn ,j)
w h argmin r −
, Jn ,w(cn ,j) ,h(cn ,j)
j=1
One tool for solving Equation (3) is SVD. The first step of applying SVD
PT
is transforming the residual signal r(n) to a matrix, by taking the new matrix’s
AC
CE
rows to be sliding windows (with width of one period cn ) of the signal, i.e.,
R(n)
(n)
rt0
(n) rt1 = .. . (n) rtb−cn +1
(n)
rt1
(n)
rt2 .. . (n)
rtb−cn +2
(n)
···
rtcn −1
··· .. .
rtcn .. .
···
rtb−1
(n)
(n)
.
(4)
319
The new matrix will be (b − cn + 1) × cn . In all scenarios considered here
320
b − cn + 1 > cn .
16
ACCEPTED MANUSCRIPT
We then apply SVD to R(n) to obtain the decomposition R(n) = U ΣV T ,
CR IP T
(5)
321
where V and U are unitary matrices, and Σ is a diagonal matrix containing the
322
singular values σj of R(n) in the cycle cn . The singular values are arranged in
323
decreasing order.
To understand the SVD’s use in matrix approximations, consider the following interpretation. The matrix Σ is diagonal, so the SVD of a matrix R(n) can ` X
AN US
be rewritten R(n) =
uj σj vjT ,
(6)
j=1
324
where uj and vj are the j th columns of U and V respectively, and ` = min{cn , b−
325
cn + 1}.
e by retaining only the terms correspondWe then create an approximation R
M
ing to the largest Jn singular values:
ED
e= R
Jn X
uj σj vjT .
(7)
j=1
PT
The energy proportion of uj is
σj2 Ej = P`
j=1
σj2
,
(8)
and we choose Jn such that Ej is closest to but still larger than the preset
327
threshold value γ = 0.3.
CE
326
Since vectors of the first Jn columns in U take the largest energy of the
AC
signal, we directly associate h(cn ,j) = uj , and obtain the corresponding weight
328
w(cn ,j) =
h(cn ,j) r(n) r(n) .
(9)
Then we construct a new residual for the next period, and solve the new problem
17
ACCEPTED MANUSCRIPT
329
as above. We call this algorithm SVD on Specific Cycles (SVD SC), which is described
331
in detail in Algorithm 1. One thing we should note is that in Step 6 and also
332
in subsequent algorithms we use Matlab’s notation for vector concatenation.
CR IP T
330
333
The entire algorithm is supposed to be applied to the first cleaned signal, i.e.,
334
x = x00 . Moreover, when traffic patterns of one data segment vary too much,
335
we will also generate a new basis through a basis generation process instead of
336
updating the old one.
337
The entire Algorithm 1 is based on SVD, and the computation time is O(I ×
338
Cn2
339
two-week data segment). According to subsubsection 4.3.1, the data cleaning
340
process determines the performance of the derived basis, and we should clean
341
the data segment before generating a new one. For a two-week data segment,
342
a new basis needs about 6.008 s on average in which the data cleaning process
343
takes about 6 s (the NNBP algorithm in the data cleaning process takes a large
344
amount of time), and the basis generation process needs about 0.008 s. The
345
experiment was repeated 100 times in Matlab R2012b on a Mac OS X 10.9.5
346
with a 2.7-GHz Intel Core i5.
347
3.3. Basis update
ED
M
AN US
× (b − Cn + 1)). In practice, I is the number of levels of cyclicity (I = 3 for
For a new data segment xm we evolve the old basis H(m−1) to the new one
349
Hm . By evolving it we avoid additional computational cost, and also, we want
350
the basis to change only slowly, reflecting slow changes in traffic patterns, thus
PT
348
avoiding the impact of new anomalies in the data, and also avoiding the need
352
to clean this new data.
CE 351
AC
We formulate the basis update problem by retaining the same number of
basis vectors L, and then considering each in turn, keeping the others fixed. That is, for k = 1, . . . , L we solve the problems
X
(l) (l) (k) (k) (k) (k) xm − argmin w h − w h + λ d h H d m m m m m (m−1) ,
(k) (k)
λd ,wm ,hm l6=k 18
(10)
ACCEPTED MANUSCRIPT
AN US
CR IP T
Algorithm 1 SVD on specific cycles algorithm (SVD SC). Input: The traffic segment, x; The traffic aggregation time, ∆t days; The size of x, b; The threshold of basis energy proportion, γ. Output: The basis H, and weight W ; The final residual r; 1: Initialize the cycle set 1 7 , ,b ; c← ∆t ∆t
Set r(1) ← x; for n = 1, . . . , I do 4: // Create R(n) from r(n) according to Equation 4; 5: for row = {1, 2, ..., b − cn + 1} do 6: Slide windows on r(n) of length cn to form matrix R row by row, h i (n) (n) R(n) ← R(n) ; [rtrow−1 , ..., rtrow+cn −2 ] ;
9: 10: 11: 12:
Set j ← 1; while Energy rate Ej ≥ γ do j ← j + 1; end while Set Jn ← j; The first Jn columns in U constitutes the function family
CE
13: 14:
end for Apply SVD to matrix R(n) , i.e., find U , Σ and V such that R(n) = U ΣV T ;
ED
8:
PT
7:
M
2: 3:
AC
15:
16: 17: 18:
H (cn ) ← {u1 , . . . , uJn } ;
Calculate the corresponding weights according to Equation (9). T w(cn ) ← w(cn ,1) , . . . , w(cn ,Jn ) ;
r(n+1) ← r(n) − H (cn ) w(cn ) ; end for r ← rI+1 ;
19
ACCEPTED MANUSCRIPT
(k)
(k)
353
where hm is a new basis vector we are seeking, H(m−1) represents the history
354
of basis vectors {h1 , . . . , h(m−1) }, d represents a distance metric from the
355
historical value to the present value, and the parameter λd allows a trade off
356
between a precise fit to the measured data and the goal of remaining similar to
357
the old basis.
358
359
360
P
(k)
CR IP T
(k)
The first term in Equation (10) is the representation error ek = xm − (l) (l)
wm hm , which indicates the error when the k th basis is removed (and
l6=k
the other terms are kept fixed).
The following second term penalizes changes of the new basis. One best
362
approach for measuring this change is the deviation to the historical mean.
363
In addition, the new basis should also be similar to the previous basis, since
364
it directly evolves to the current one. In practice, we define the second term
365
distance to includ two parts: the similarity with the immediate previous basis
366
and the deviation to the historical mean. The mathematical form of d can be
367
seen in Equation (11), where the first term calculates the similarity with the
368
previous basis, the second term evaluates the deviation to the historical mean,
369
and parameter λ1 and λ2 are weighted coefficients for trading off the similarity
370
and deviation.
ED
M
AN US
361
PT
(k)
(k)
(k) (k) (k) d h(k) m H(m−1) = λ1 hm − h(m−1) + λ2 hm − h(m−1) ,
CE
where
(k)
h(m−1) =
(11)
(m−1) 1 X (k) h . m i=0 i
(k)
The j th element in hm is also constrained to maintain smoothness by en-
AC
forcing
371
(k)
(k)
min(hi (j)) ≤ h(k) m (j) ≤ max(hi (j)), i
i
where i ∈ [0, m − 1].
(12)
min max Define H(m−1) and H(m−1) as collections of historical minimum and maxi-
20
ACCEPTED MANUSCRIPT
mum of basis vectors respectively. Hence, (k)
(k)
max H(m−1) = {maxh(m−1) }k∈[1,L] ,
where (k)
(k)
CR IP T
min H(m−1) = {minh(m−1) }k∈[1,L] ;
minh(m−1) = { min (hi (j))}j∈[1,b] ; i∈[0,m−1]
and (k)
(k)
maxh(m−1) = { max (hi (j))}j∈[1,b] .
AN US
i∈[0,m−1]
372
The result is a new basis that evolves slowly to match changes in traffic
373
patterns, but is unresponsive to anomalies, which are inherently rare and erratic
374
(or else they are not anomalous). (k)
(k)
We derive hm and wm through alternating the fixed variables iteratively [32, 33]. We transform the final residual ek to a matrix form E whose rows (k)
M
have the same length as hm , i.e., c. The data segments cover traffic for several cycles, so the new matrix E has K = b/c rows, and we can write it
ED
ek (1)
PT
ek (c + 1) E = .. . ek ((K − 1)c + 1)
··· ··· .. . ···
ek (c)
ek (2c) .. . ek (Kc)
(13)
(k)
(k) w = wm y,
AC
CE
For each basis vector hm , we initialize a vector
375
where y is an IID standard normal variate of length K. We then iteratively
21
ACCEPTED MANUSCRIPT
376
apply the following rules (k)
(k)
← h(k) m ◦
wT E + λd (λ1 h(m−1) + λ2 h(m−1) ) (k)
(k)
(k)
w
← w◦
(k)
wT whm + λd (λ1 hm + λ2 hm ) Ehm
(k)
(k)
whm (hm )T
,
,
(14)
CR IP T
h(k) m
(15)
where ◦ and division are element-wise (Hadamard) multiplication and division,
respectively. After the algorithm has converged, we take the average of w as the weight
κ
377
1X wi , κ i=1
AN US
(k) wm ←
(16)
where wi ∈ w. Algorithm 2 shows more detail.
Finally, if the residual error r cannot be reduced to a preset bound (e.g.,
379
the average approximation error using the previous basis on new data) after a
380
certain number of iterations, it means the update algorithm can not converge.
381
This non-convergence indicates the behaviour patterns of the new data segment
382
varies greatly from the past behaviour, and we should apply the data cleaning
383
and basis generation processes to produce a totally new basis set instead of
384
updating the old basis set.
ED
M
378
The time complexity for basis update algorithm is O(L × c × β × ζ). For any
386
two-week data segment, L is small (generally 3, corresponding to three different
387
levels of cyclicity), c records the length of the basis (no larger than two weeks).
388
Meanwhile, we set β and ζ at small or moderate levels (in later experiments, β
PT
385
and ζ are equal to 10 for simplicity). Hence, the capacity of time cost is not
390
large. In practice, the basis update process takes about 0.6 s to evolve the old
391
basis. The experimental environment is the same: the simulation code ran 100
AC
CE 389
392
times on Matlab R2012b on a Mac OS X 10.9.5 with a 2.7-GHz Intel Core i5.
393
This indicates that updating the old basis takes less time than generating a new
394
one, which is more suitable for online anomaly detection.
22
ACCEPTED MANUSCRIPT
ED
M
AN US
CR IP T
Algorithm 2 Basis update algorithm. Input: A new traffic segment, xm ; The size of xm , b; The historical basis set, H(m−1) ; min The historical minimum, H(m−1) ; max The historical maximum, H(m−1) ; The historical mean, H(m−1) ; The old weight set, W(m−1) ; The threshold of residuals, α; The maximum number of iterations, β and ζ. Output: The new basis Hm , and weight set Wm ; The final residual r. 1: Hm ← H(m−1) , Wm ← W(m−1) ; 2: Set the number of iterations n1 ← 0 and n2 ← 0; 3: Set the residual e1 ← xm ; 4: while n1 ≤ β and r > α do 5: for k = 1,2,. . . ,L do P (l) (l) 6: Compute the residuals ek ← xm − l6=k wm hm ; 7: while n2 ≤ ζ do 8: Calculate the residual matrix E according to Equation (13) (k) 9: Update hm according to rules (14); (k) 10: c ← the length of hm ; 11: for j = 1,2,...,c do (k) (k) 12: if hm (j) < minh(m−1) (j) then (k)
(k)
hm (j) ← minh(m−1) (j);
13:
(k)
(k)
14:
else if hm (j) > maxh(m−1) (j) then
15:
hm (j) ← maxh(m−1) (j); end if end for (k) Update wm according to rules (15) and (16); n2 ← n2 + 1; end while (k) (k) Hm ← Hm ∪ hm , Wm ← Wm ∪ wm ; end for Compute the approximation error
PT
(k)
CE
16: 17: 18: 19: 20:
(k)
AC
21: 22: 23:
r ← xm − and
n1 ← n1 + 1; end while
k (k) wm hm ;
k=1
b
r← 24: 25:
L X
1X r(i); b i=1
23
ACCEPTED MANUSCRIPT
395
3.4. Anomaly detection Many researchers stop their work after detecting the anomalous points by
397
thresholding the residual. However, many detected points may be associated
398
with the same underlying anomaly. It is confusing to network operators to
399
present a single event as many detections, and there are distinct advantages
400
as well, in combining individual detections before deciding which anomalies are
401
most important: for instance, an anomaly that develops more slowly, but over
402
a long time may be more important that a sharp, but singular event.
CR IP T
396
In this paper, we try to associate detections so that we can present a single
404
anomalous event, along with information about the duration of the event. We
405
do so through a clustering process that we describe below.
406
3.4.1. Anomalous-point detection
AN US
403
We detect anomalous points by thresholding. For any approximation resid-
408
ual error r, we set the upper and lower limits [p − q ∗ σ, p + q ∗ σ], where p and
409
σ are the mean and standard deviation of r, and q is a parameter chosen to
410
fix the false-alarm probability (assuming a normal distribution of the residual).
411
Any points of r that fall outside these thresholds are declared to be anomalous.
412
3.4.2. Anomaly clustering
ED
M
407
Once we have anomalous data points, we use a bottom-up hierarchical clus-
414
tering algorithm to group points, as shown in Algorithm 3. Similar to Algorithm
415
1, we also use the Matlab notation to represent the concatenate process of time
416
series t(m) as shown in step 12,18 and 25.
CE
PT
413
For most anomaly types, the times of anomalous points caused by the same
418
events are close to each other, which means they occur at short-time anomalies.
AC
417
419
In clustering, we measure the distance between clusters by the absolute time
420
gaps between their centers. Anomalous points with smaller time gaps will tend
421
to be grouped into one large cluster. This clustering method has the ability to
422
find short-time anomalies.
24
ACCEPTED MANUSCRIPT
Since we have not included any specific properties of network traffic for
424
clustering, this clustering algorithm is generally and equally applicable to other
425
detectors given the time locations of anomalous points, and we shall test it as
426
part of these alternative algorithms.
CR IP T
423
However, for anomalies such as FISSD type of anomalies, they often last
428
long and do not have the short-time attribute we have mentioned above. In this
429
condition, Algorithm 3 is no longer suitable. This long-time anomaly clustering
430
problem has been left for further research in the future. One primary idea
431
to address this problem is to combine multi scales into clustering, inspired by
432
extracting multiscale features for anomaly detection [15, 16, 26, 28].
433
4. Experiments with Synthetic Data
AN US
427
In this section, we apply several anomaly detection approaches to synthetic
435
data. The simulation of synthetic data is extremely important for the validation
436
of anomaly detection algorithms where ground truth is difficult to find [25]. In
437
our case we aim to assess very small false-alarm probabilities, necessitating large
438
amounts of tests. In addition, we want to test the ability of the algorithm to
439
determine the duration of the anomalies. As far as we know, this information
440
has been labelled in no dataset.
ED
M
434
Our approach to simulation intends to highlight the features of different
442
techniques. We make no claim that the simulation is completely realistic, only
443
that it illustrates the properties of different anomaly detection techniques.
PT
441
Our analysis uses three pairs of metrics, as described in subsection 4.1, in-
445
tended to explore the different characteristics of the anomaly detection algo-
446
rithms, and compares our algorithm to four other methods:
AC
CE
444
1. OMP-on-dictionary: This method, motivated by BasisDetect [21], uses Orthogonal Matching Pursuit (OMP) on a dictionary of discrete cosine time and frequency atoms. Similar to BasisEvolution, this model determines L atoms, ΦL , with the largest coefficients αL at the first data segment x0 , Then, for the new data segment xm , we represent it with ΦL 25
ACCEPTED MANUSCRIPT
12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:
24:
25:
AC
26: 27: 28: 29: 30: 31:
(m−1)
+t
(m)
;
(m−1)
};
t(m) ← t(m) , tk k ← k + 1; end if end for (m−1) if d(i,j) > ι then V (m) ← V (m) ∪ {tj
CE
23:
(m−1)
t
j ; tk ← i 2 (m−1) (m−1) (m) (m) V ←V ∪ {t , tj }; h h i ii (m) (m) (m) ; t ← t , tk k ← k + 1; i ← i + 1; else (m−1) V (m) ← V (m) ∪ {ti }; (m) (m−1) tk ← thi ;h ii
M
11:
(m)
≤ ι then
ED
10:
(m−1)
if d(i,j)
PT
9:
AN US
CR IP T
Algorithm 3 Bottom-up hierarchical clustering based anomaly detection. Input: Time series of the anomalous traffic measurements, t = [t1 , . . . , tn ]; A distance threshold, ι; The maximum number of iterations, η Output: The set of anomalies (clusters) and the corresponding center set, V and U ; 1: t(0) ← t; 2: k (0) ← n; 3: m ← 1; 4: while m ≤ η do 5: k ← 1; 6: for i= 1, . . . , k (m−1) − 1 do 7: j ← i + 1; (m−1) (m−1) (m−1) 8: The distance d(i,j) ← ti − tj ;
(m) tk (m)
(m−1) ← tj ; h h ii (m) (m) ← t , tk ;
t k ← k + 1; end if k (m) ← k; m ← m + 1; end while V ← V (m) ; U ← t(m) ;
26
ACCEPTED MANUSCRIPT
by solving argmin kxm − αL0 ΦL k ,
(17)
αL0
where αL0 represents the projection magnitudes of xm on ΦL . We then
448
get the residual series rOM P = xm − αL0 ΦL , and detect anomalous points
449
as subsection 3.4 describes.
CR IP T
447
450
2. PCA (Principle Components Analysis): This approach finds the k eigen-
451
vectors with the largest eigenvalues for the primary data, and reconstructs
452
the data segment xm with those eigenvectors. P CA
The residual series is
P CA c = xm − xc is m , where x m is the reconstructed signal and r
r
454
used for detection as described above.
AN US
453
455
3. RPCA (Robust PCA): This method is the robust version of PCA [12]
456
and uses the median to centre the data in the transformation process.
457
The method we used for RPCA is the Crux-Ruiz (CR) algorithm [12].
458
Otherwise it is similar to PCA.
4. NNBP (Neural Network using Back Propagation): The NNBP approach
460
uses an artificial neural network algorithm to predict traffic [5], and we
461
calculate residuals as the difference between the predicted and actual traf-
462
fic. The principle of choosing comparison algorithms is to include as wide
463
range of ideas as possible. Hence, we select this approach even though it
464
is the least successful (as demonstrated below).
PT
ED
M
459
In each case the initial set of data is cleaned as described in our approach, and
466
we use the same clustering process for each to provide a fair comparison.
CE
465
We chose these four methods because they cover a wide range of concepts
468
used in anomaly detection field, and have been discussed in many references.
469
The particular examples are relatively simple exemplars of the concept they
AC
467
470
espouse, so that the results are not complicated by myriad implementation de-
471
tails. For example, we do not attempt to compare to wavelet detection methods
472
as they require much fine tuning, for instance, of the particular wavelet basis
473
being used.
27
ACCEPTED MANUSCRIPT
474
4.1. Detection error metrics In many works on anomaly detection, the algorithms are assessed by their
476
false-positive and negative probabilities, considered point-wise on the original
477
sequence, i.e., each point is either an anomaly or not, and we compare our
478
classification to the ground truth. However, in this work, we seek to understand
479
performance in more practical terms: we want to detect the underlying cause
480
of a sequence of anomalous points in the data, and we prefer to return a single
481
detection to an operator to reduce the amount of low-level work that an operator
482
needs to accomplish before he/she can start a diagnosis, thus facilitating the
483
diagnosis. We want to tell the operator about one anomaly rather than a set
484
of anomalous points. To assess the quality of such results we need additional
485
metrics. As with false-negative/positive probabilities, we wish to present these
486
in pairs that show the trade-off between sensitivity and specificity. We present
487
two additional pairs of such metrics.
AN US
CR IP T
475
We label the set of true anomalous time points G. They are clustered into
488
The set of detected anomalous points for method r is denoted D(r) , and
490
these are clustered into groups cD(r) using Algorithm 3.
492
493
494
The metrics we consider are listed below: 1. We still include the standard point-wise metrics: Detection Probability on Anomalous Points (DP AP) and False-Alarm Probability on Non-
anomalous Points (FAP NP):
AC
CE
495
ED
491
M
groups cG, where each group corresponds to one injected anomaly.
PT
489
DP AP
=
F AP N P
=
G ∩ D(r) , |G| (r) D − G ∩ D(r) N − |G|
(18) ,
(19)
496
where N is the total number of data points, and |·| means the cardinality
497
of a set.
498
2. Our second pair of metrics aims to assess whether a detected group of
499
anomalies corresponds to a real anomaly group. As long as the detected 28
ACCEPTED MANUSCRIPT
anomaly groups have real anomalous points, we define them as truly de-
501
tected anomalies. We call these the Consistency Degree of Anomaly groups
502
in Amount (CD AA) and the Aggregation Degree of Anomaly groups in
503
Amount (AD AA). Given that the number of truly detected anomaly
504
groups is NcD(r) , and the number of hits against real anomaly groups
505
is NcG(r) , then we define these by =
AD AA
=
NcG(r) , |cG| N (r) cD . cD(r)
(20) (21)
AN US
CD AA
CR IP T
500
506
CD AA measures the probability of truly detected anomaly groups, while
507
AD AA records the fraction of truly detected anomaly groups across all
508
detected anomalies, under the above definition.
509
3. The above metric uses a rather simplistic notion of a truly detected anomaly group in which it is correct if it covers at least one true anomaly point.
511
Here we aim to quantify the overlap between the true anomalies and their
512
detections, but one true anomaly group may generate multiple detected
513
groups, and vice versa. To do so, we extend the idea of a purity in-
514
dex [34, 35], to define Consistency Degree of Anomaly groups in Dura-
515
tion (CD AD) and Aggregation Degree of Anomaly groups in Duration
516
(AD AD) which are defined for each data point x ∈ G ∩ D(r) . We define
AC 519
ED
(r)
cGx to be the cluster that x truly belongs to, and cDx to be the cluster that x belongs to, detected by method r. Then, we define for each x
CE
518
PT
517
M
510
CD AD(x)
AD AD(x)
=
=
(r) cGx ∩ cDx |cGx |
,
(r) cGx ∩ cDx . (r) cDx
(22)
(23)
We then average this metric over G∩D(r) (the intersection of detected and
29
ACCEPTED MANUSCRIPT
true anomalous points) to obtain global versions of the metrics defined by 1 CD AD = G ∩ D(r)
X
CD AD(x);
(24)
AD AD(x).
(25)
x∈G∩D (r)
1 AD AD = G ∩ D(r)
X
x∈G∩D (r)
CR IP T
520
Now, for all truly detected anomalies, CD AD is a weighted-average de-
522
tection probability, while AD AD is a weighted-average proportion of real
523
anomalous points on the detected anomalies, taking into account the group
524
structure of the anomalous points.
AN US
521
In order to analyze the performance of all five anomaly detection methods
526
and see the trade-offs over these pairs of metrics, we vary the threshold param-
527
eter q from 1.0 to 4.5, and generate Receiver Operator Characteristic (ROC)
528
curves [36], or the equivalent for the new metrics.
529
4.2. Synthetic traffic data
M
525
Our goal here is to generate traffic with some of the important observed
531
features of real traffic. This will, of course, be a simplification of real traffic,
532
but we have a great deal more control over its details and can see the effects,
533
for instance, of changing the duration of anomalies.
ED
530
The simulation has two steps: first, we generate our typical traffic, and sec-
535
ond, we inject anomalies. The typical traffic is based around a sine function with
536
a 24-h period to represent the strong diurnal cycles of Internet traffic [37]. Noise
PT
534
was added to represent statistical fluctuations around the mean. In this study,
538
we use white Gaussian noise. Gaussianity is a reasonable first approximation
539
for Internet traffic on uncongested links [38], and we do not include correlations
AC
CE 537
540
(though they are often present in traffic), because uncorrelated traffic represents
541
a worst case for detecting anomalies [39]. We also control the relative size of
542
the noise. In what follows the standard deviation of the noise is 10% of the
543
amplitude of the cycles.
30
ACCEPTED MANUSCRIPT
We generate a series containing eight weeks of data which is aggregated every
545
1 hour, i.e., 1,344 data points for each simulation. Each is broken into 4 data
546
segments, to be analyzed in blocks as described above. We experimented with
547
other data segment lengths, and athough the results and general trends were
548
similar, a two-week segment produced the best results of the lengths tested.
CR IP T
544
The anomaly model in this work can be represented as A(τ, θ, ϑ, ψ, φ), where
549
550
τ , θ, ϑ, ψ and φ are parameters for generating and injecting anomalies.
The parameter τ indexes a set of anomaly types. Each anomaly type is
552
defined by a shape function with width W, which represents the minimum width
553
for that anomaly type (otherwise, all anomalies could collapse into spikes), and
554
amplitude M, which is defined by the relative size to the typical traffic.
AN US
551
1. FIFD: (Fast Increase, Fast Decrease): these are the classical anomalies
556
most often examined in prior work. They might be caused by system
557
or device outages, or DoS attacks. Often, past literature has focused on
558
“spikes” which are very short FIFD anomalies.
M
555
2. SIFD (Slow Increase, Fast Decrease): these might arise through epidemic
560
processes (e.g., worms or viruses [40]), which grow over time, and are
561
eventually fixed, dropping the system back into its standard state.
ED
559
3. FISSD (Fast Increase, Stable State, Slow Decrease): these arise typically
563
as the result of phenomena such as flash crowds, which appear very sud-
PT
562
denly, are maintained for some time, and then gradually disappear.
564
4. FISD (Fast Increase, Slow Decrease): these represent, for completeness, a
565
special case of the previous anomaly with no stable state.
CE
566
We also include negative anomalies as well as positive, i.e., traffic decreases as
568
well as traffic increases. Each simulation includes only one type of anomaly.
AC
567
569
Most of our reported results here focus on FIFD anomalies, because most of the
570
results for other anomaly types mirror the ones reported.
571
Parameter θ determines the number of anomalies that are injected. We set
572
θ following the Poisson distribution, θ ∼ P oisson(λ), where λ = 3 per data
573
segment, i.e., on average there will be 12 anomalies in each simulation. It is 31
ACCEPTED MANUSCRIPT
574
important that there are multiple anomalies per segment, because one of the
575
affects we seek to include in our simulations is interactions in the detector,
576
between anomalies. The parameter ϑ controls the starting locations of anomalies. Anomalies, by
578
definition, could happen at any time with no preference, so ϑ ∼ U nif orm(0, b),
579
where b is the length of the data segment.
CR IP T
577
580
The parameter ψ changes the width of anomalies. We dilate the anomaly
581
type function by a factor W + ψ. In our experiments we consider a wide range
582
of values ψ = {0, 4, 10, 20, 40}.
The parameter φ updates the magnitude of anomalies: we scale the shape
584
function of the anomaly type by this factor. In our experiments we have con-
585
sidered φ = {0.1, 0.3, 0.5, 0.7, 1} resulting in magnitudes φM.
586
4.3. Investigation on BasisEvolution
AN US
583
Before experiments, we need to have a thorough investigation on the whole
588
BasisEvolution framework. In data cleaning phase, we need to clarify the im-
589
portance of this step, and should make a proper selection on the parameter and
590
detection method (seen in Figure 5). In addition, we have to estimate a set of
591
parameters in basis update process.
ED
M
587
As discussed before, in real network traffic it is difficult to find the ground
593
truth, while the synthetic traffic volumes highlighting features of network traffic
594
are suitable for analysis. Hence, all experiments at this section are on synthetic
595
data.
PT
592
We first demonstrate the importance of data cleaning process on the entire
597
framework through the comparison of the performance of BasisEvolution on
598
cleaned and uncleaned data. Then we confirm the number of maximum itera-
AC
CE 596
599
tions and the detection algorithm by examining the degree of cleanliness of the
600
revised traffic after data cleaning process. As for parameters (shown in Equa-
601
tion (10) and (11)) in basis update process, we have found rough parameter
602
settings that are not too far from optimal.
32
ACCEPTED MANUSCRIPT
603
4.3.1. The impact of data cleaning process Figure 6 compares the detection performance of BasisEvolution under two
605
conditions: with data cleaning process and without data cleaning process. We
606
see that the framework performs better when the data is cleaned, which means
607
basis vectors derived from cleaned data can better describe the “normal region”
608
than those from uncleaned data. If data is uncleaned, and contains traces of
609
intrusions, BasisEvolution may not detect future instances of those intrusions,
610
since it will presume that they are normal, which leads a higher false-alarm
611
probability and lower detection probability. Hence, the data cleaning process is
612
necessary for the entire work.
DP_AP
0.6 0.5 0.4 0.3
BE on cleaned data
BE on uncleaned data
0
M
0.2 0.1
AN US
0.7
CR IP T
604
0.02
0.04
0.06
0.08
0.1
0.12
FAP_NP
613
PT
ED
Figure 6: The performance of BasisEvolution on clean & unclean data. The effectiveness is measured with FAP NP (False-Alarm Probability on Non-anomalous Points) and DP AP (Detection Probability on Anomalous Points). The whole simulation experiment repeats 100 times to get statistical results. The performance of BasisEvolution on clean data is better than which on uncleaned data.
4.3.2. Detector selection for data cleaning process In our test, we compare the performance of single approaches and a combi-
CE
614
nation method by choosing PCA- and ANN-based detectors. In this way, the
616
single approaches are PCA- and ANN-based detectors, while the combination
617
detector merges the detection results of each detector at each iteration. More
AC
615
618
specifically, we select NNBP method as an ANN algorithm.
619
In PCA, we define the “normal subspace” with thresholding (e.g., [mean−3∗
620
std, mean + 3 ∗ std], where mean and std are average and standard deviations
621
of principal components), and find anomalous points with SPE measurement
622
[18]. Since PCA is sensitive to the polluted data [25], the “normal subspace” 33
ACCEPTED MANUSCRIPT
623
is distorting at the first iteration, and the false alarms would be very high. As
624
Figure 5 shows, the continuous detection and gaps padding in the following
625
iterations bring the traffic closer to the normal range. NNBP finds anomalous points by comparing the differences between real
627
traffic and predicted traffic volumes. If the differences are out of the threshold
628
(e.g., c ∗ std, where c is a constant value, std is the standard deviation of test
629
samples), the corresponding points are identified as anomalies. In the first iter-
630
ation, the data is polluted and the corresponding predictions are not accurate.
631
Similarly, the successive layers of detection and interpolation (seen in Figure 5)
632
can modify the traffic to the regular level.
AN US
CR IP T
626
633
As for the combination detector, the set of detected anomalous points is
634
larger through combining the detection results of both PCA and NNBP. With
635
those anomalous points being revised by “normal values” in each iteration (seen
636
in Figure 5), the typical traffic can be reformed more effectively. We measure the performance of three detectors by detecting the correspond-
638
ing modified traffic. More specifically, we detect the residuals between original
639
traffic and modified traffic through thresholding (seen in Sec 3.4.1). If the de-
640
tection accuracy (DP AP) is high and the false-alarm probability (FAP NP) is
641
low, larger part of the modified traffic will be in the normal region. Figure 7
642
illustrates the performance of three detectors in data cleaning process. The com-
643
bination detector performs best with the lowest false-alarm probability. This
644
is in accordance with the conclusion discussed above. The lowest false-alarm
645
probability means the traffic cleaned by the combination detector is closest to
CE
PT
ED
M
637
the “normal” traffic, from which we can extract the most accurate “normal
647
patterns”.
AC
646
648
4.3.3. The estimation of parameter maxC in data cleaning process
649
In data cleaning process, parameter maxC is defined as the number of max-
650
imum iterations. Furthermore, both anomaly detection and gap filling methods
651
should repeat maxC times to clean the data. We try to estimate a small value of
652
maxC, so that the cleaned traffic is close to “normal”, and the BasisEvolution 34
ACCEPTED MANUSCRIPT
0.5 0.4
PCA NNBP PCA+NNBP
0.3 0.1
0.2
0.3
FAP_NP
0.4
CR IP T
DP_AP
0.6
0.5
653
AN US
Figure 7: The performance of three detectors in data cleaning process. Three detectors are PCA, NNBP and the combination of them. The effectiveness is measured with FAP NP (False-Alarm Probability on Non-anomalous Points) and DP AP (Detection Probability on Anomalous Points). The whole simulation experiment repeats 100 times to get statistical results. The performance of combined approach is better than each single one.
could achieve the best detection performance.
In the simulation, the candidate set for maxC is {1, 3, 5, 7, 9}. we choose the
655
optimal value of maxC through two comparisons: comparing the performance of
656
the data cleaning process (similar to Sec 4.3.2); and comparing the performance
657
of BasisEvolution (similar to Sec 4.3.1). The results can be seen in Figure 8 (a)
658
and (b).
M
654
In Figure 8 (a), we measure the performance of the data cleaning process
660
through detecting the residuals between original traffic and the cleaned traffic.
661
More cleaner the traffic, higher detection performance (lower false-alarm prob-
662
abilities and higher detection probabilities) on the residuals. From Figure 8
663
(a), the false alarms are high when maxC is smaller than 5, however, when
664
maxC is larger than 5, the corresponding curves not only perform better, but
665
also are close to each other. As for Figure 8 (b), when maxC is larger than 3,
666
the corresponding curves are also similar to each other, with higher detection
667
performance. Based on these, we finally choose maxC = 5 in later experiments.
AC
CE
PT
ED
659
668
4.3.4. Parameter estimation in basis update
669
The parameters we need to assess are λd , λ1 and λ2 . Given our motivation
670
for distance measurement, i.e., the new basis should both maintain similarity
671
to the immediately previous one and be close to the historical mean. Hence, the
35
ACCEPTED MANUSCRIPT
0.6 0.5
0.45 0.4 0.35 0.14
0.4 maxC=9 maxC=7 maxC=5 maxC=3 maxC=1
0.3 0.2
0.15
0.16
0.17
0.18
0.1
0.19
FAP_NP
0
0.01
CR IP T
DP_AP
0.5
maxC=1 maxC=7 maxC=9 maxC=3 maxC=5
TP_AP
0.6 0.55
0.02
0.03
0.04
0.05
FAP_NP
(a) The performance of the data cleaning process.
(b) The performance of BasisEvolution.
AN US
Figure 8: The performance of data cleaning process (a) and the whole framework (b) under different values of maxC. maxC ∈ {1, 3, 5, 7, 9}. The effectiveness is measured with FAP NP (False-Alarm Probability on Non-anomalous Points) and DP AP (Detection Probability on Anomalous Points). The whole simulation experiment repeats 100 times to get statistical results. A better value for maxC is 5.
672
importance of two parts in Equation (11) should be the same, and we set the
673
weighted parameter λ1 = λ2 = 0.5.
Parameter λd determines the trade-off between the measurement constraint
675
and the importance of changes. A larger λd leads to lower approximations,
676
whereas a smaller λd means approximations are a better fit to the data. We set
677
the candidate solution of λd to be {0.01, 0.1, 1, 2, 10, 100}. Figure 9 shows the
678
detection performance results with respect to different λd values. We find that
679
an optimal detection result (a higher detection probability and a lower false-
680
alarm probability) is around λd = 2 with the corresponding curve tending to
681
the top left of the plot. Hence, we finally choose λd = 2, λ1 = 0.5, and λ2 = 0.5
682
in the following experiments.
PT
ED
M
674
AC
0.6
DP_AP
CE
0.8
d d
0.4
d d
0.2
d
0
0
0.01
0.02
0.03
0.04
0.05
d
=0.01 =0.1 =1 =2 =10 =100 0.06
FAP_NP
Figure 9: Comparison of detection performance in terms of FAP NP (False-Alarm Probability on Non-anomalous Points) and DP AP (Detection Probability on Anomalous Points) when λd ∈ {0.01, 0.1, 1, 2, 10, 100}. The entire simulation was repeated 100 times to get statistical results. A better performance can be found at λd = 2.
36
ACCEPTED MANUSCRIPT
683
4.4. Experiment results on synthetic data We conducted several classes of synthetic experiments. In the first class of
685
experiments, we compare anomaly detectors on fixed width anomalies, but with
686
anomalies that are not simple spikes. In the second class, we consider the affects
687
of anomaly width on detection performance. In the third and fourth classes, we
688
consider the same issues but for magnitude.
CR IP T
684
In each case below we generate traffic and inject anomalies as described
690
above. We replicate the experiments 100 times, and report aggregated statistics.
691
4.4.1. Constant-width results
AN US
689
We start by considering anomalies with random magnitude, but fixed width
693
ψ = 10. The resulting ROC curves are shown in Figure 10 (a), which shows that
694
as the threshold q increases, both Detection Probability on Anomalous Points
695
(DP AP), and False-Alarm Probability on Non-anomalous Points (FAP NP)
696
decreases. i.e., there is a trade-off between the two types of errors. Notewor-
697
thy is the fact that BasisEvolution (BE) outperforms the other methods (we
698
prefer curves towards the top left of the diagram). Table 2 shows, more pre-
699
cisely, a number of FAP NP values for given DP AP values. The important
700
feature shown in this table, apart from BasisEvolution’s superiority, is that it
701
can achieve very low false-alarm probabilities (as low as 0.00063), which are
702
necessary in practical anomaly detection algorithms.
PT
ED
M
692
The trade-off is a relative low DP AP, but this is less important. In real
704
anomaly detection settings, it is far more important to avoid reporting a large
CE
703
number of false alarms, while still providing useful detections, than to detect all
706
events at the cost of false alarms that degrade the operators confidence in any
707
of the detections. This also emphasizes the need for simulation studies, as it
AC
705
708
would have been impractical for us to measure such a low probability without
709
a large set of data on which to perform tests.
710
In Figure 10 (b), we show the equivalent of an ROC curve for the pair of
711
metrics AD AA (Aggregation Degree on Anomaly Amounts) and CD AA (Con-
712
sistency Degree on Anomaly Amounts), although in this plot we prefer curves 37
ACCEPTED MANUSCRIPT
to the top right in the diagram. As the threshold q increases, AD AA increases,
714
however, CD AA decreases, showing the trade-off between the two types of er-
715
rors. Once again BasisEvolution is universally superior to other methods across
716
all threshold values.
CR IP T
713
718
(Aggregation Degree on Anomaly Duration) increases, while CD AD (Consis-
719
tency Degree on Anomaly Duration) decreases, showing another trade-off. As
720
in (b), we prefer curves towards the top right. It is beyond any doubt that
721
BasisEvolution outperforms the rest four methods across all threshold values.
0
0.05
0.1
0.15 FAP_NP
0.2
0.25
0.3
(a) FAP NP versus DP AP.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
CD_AD
OMP PCA RPCA NNBP BE
CD_AA
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
AN US
In Figure 10 (c), apart from the NNBP, as the threshold q increases, AD AD
DP_AP
717
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 AD_AA
1
(b) AD AA versus CD AA.
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.7
0.75
0.8
0.85 AD_AD
0.9
0.95
1
(c) AD AD versus CD AD.
ED
M
Figure 10: Performance of 5 methods for detecting FIFD anomalies. Here the width parameter is ψ = 10, and magnitude is random. BE is short for BasisEvolution. FAP NP is the False-Alarm Probability on Non-anomalous Points. DP AP is the Detection Probability on Anomalous Points. AD AA is the Aggregation Degree on Anomaly Amounts. CD AA is the Consistency Degree on Anomaly Amounts. AD AD is the Aggregation Degree on Anomaly Duration. CD AD is the Consistence Degree on Anomaly Duration.
CE
PT
Method BE OMP PCA RPCA NNBP
Fixed DP AP 0.3 0.5 0.6 0.00063 0.0073 0.020 0.0027 0.18 DNE 0.019 0.047 0.099 0.025 0.047 0.064 0.27 DNE DNE
AC
Table 2: We fix the value of DP AP, and report the corresponding FAP NP of five methods. FAP NP is the False-Alarm Probability on Non-anomalous Points. DP AP is the Detection Probability on Anomalous Points. DNE means “Does Not Exist”. BE is short for BasisEvolution. Note that very small FAP NP values are reported (shown in bold), as required.
722
723
724
4.4.2. The effect of width We now test the five anomaly detection methods on anomalies whose magnitudes are randomly distributed, but with fixed width parameter ψ = {0, 4, 10, 20, 40}, 38
ACCEPTED MANUSCRIPT
726
FIFD type of anomalies, Figure 11 (a) shows the performance of BasisEvolution
727
decreases as the width parameter ψ grows. That is, it is more difficult to cor-
728
rectly detect wider anomalies. This is to be expected: as the anomaly width
729
increases, there are more points in the anomaly, and thus these points are more
730
likely to be considered “normal” by any detector, and by any metric. A single
731
spike will always be the easiest type of event to detect, while although it might
732
be possible to detect the change into a broad flat anomaly, it will always be
733
difficult to correctly categorize the flat region.
AN US
=0 =4 =10 =20 =40
0
0.02
0.04 0.06 FAP_NP
0.08
DP_AP
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
CR IP T
to understand the affect of anomaly width on detection. As for detecting the
DP_AP
725
0.1
(a) Variable width ψ.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.02
0.04
0.06 FAP_NP
=0.1 =0.3 =0.5 =0.7 =1 0.08
0.1
0.12
(b) Variable magnitude φ.
734
ED
M
Figure 11: Performance (FAP NP versus DP AP) of BasisEvolution in detecting FIFD anomalies with varying anomaly widths or magnitudes. FAP NP is the False-Alarm Probability on Non-anomalous Points. DP AP is the Detection Probability on Anomalous Points. For BEdetected results, we note that performance decreases with increasing width, or decreasing magnitude. The additional plots are omitted as they reinforce the same results.
4.4.3. Constant-magnitude results We now consider the effect of magnitude of the anomalies on detection, start-
736
ing with a set of tests with fixed magnitude, φ = 0.5, but random width. The
737
ROC curves for metrics DP AP (Detection Probability on Anomalous Points)
CE
PT
735
738
and FAP NP (False-Alarm Probability on Non-anomalous Points) are shown in
739
Figure 12.
AC
740
As in Figure 10, both DP AP and FAP NP decrease with the increment
741
of the threshold q. We prefer the curves toward top left of the diagram. The
742
results are similar to those for constant width, though perhaps more extreme,
743
e.g., Table 3 provides FAP NP values for the five techniques corresponding to
744
the fixed DP AP values, and we see that BasisEvolution achieves an even lower
745
false-alarm probability of 7.8 × 10−5 . 39
0
0.05
0.1
0.15
0.2 0.25 FAP_NP
0.3
0.35
0.4
0.45
(a) FAP NP versus DP AP.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 AD_AA
1
(b) AD AA versus CD AA.
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.85 0.87 0.89 0.91 0.93 0.95 0.97 0.99 1 AD_AD
CR IP T
OMP PCA RPCA NNBP BE
CD_AD
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
CD_AA
DP_AP
ACCEPTED MANUSCRIPT
(c) AD AD versus CD AD.
Figure 12: Performance for detecting FIFD anomalies with fixed magnitude φ = 0.5 and random width. FAP NP is the False-Alarm Probability on Non-anomalous Points. DP AP is the Detection Probability on Anomalous Points. AD AA is the Aggregation Degree on Anomaly Amounts. CD AA is the Consistency Degree on Anomaly Amounts. AD AD is the Aggregation Degree on Anomaly Duration. CD AD is the Consistence Degree on Anomaly Duration.
Fixed 0.3 7.8 × 10−5 0.0026 0.013 0.033 0.19
DP AP 0.5 0.7 0.0002 0.0025 0.009 0.066 0.039 0.16 0.079 DNE DNE DNE
AN US
Method BE OMP PCA RPCA NNBP
746
ED
M
Table 3: We fix the value of DP AP, and report the corresponding FAP NP of five methods. FAP NP is the False-Alarm Probability on Non-anomalous Points. DP AP is the Detection Probability on Anomalous Points. DNE means “Does Not Exist”. BE is short for BasisEvolution. Note that very small FAP NP values are possible (shown in bold), as required.
4.4.4. Effect of varying magnitudes In our final simulation results, we test anomalies whose widths are randomly
748
distributed, with magnitude parameter φ = {0.1, 0.3, 0.5, 0.7, 1}, to analyze the
749
impact of anomaly magnitude on detection.
PT
747
Figure 11 (b) illustrates the results of BasisEvolution on anomalies with vary-
CE
750
ing magnitudes, showing it is easier to distinguish anomalies with larger magni-
752
tudes. This result is reasonable because anomalies with larger magnitudes lead
753
to corresponding anomalous points having larger deviations, which are easier for
AC
751
754
any detector to see. However, there is a diminishing return for anomalies above
755
some threshold around φ = 0.5, where little improvement is seen in comparison
756
to the change in anomaly magnitude.
757
The figures presented in the above four classes show the results for FIFD
758
anomalies, but we found the same types of results no matter which type of 40
ACCEPTED MANUSCRIPT
759
anomaly we considered. We also saw similar results for the alternative perfor-
760
mance metrics as well (in the places where these plots are omitted). Why the proposed approach BasisEvolution is better than others? Two rea-
762
sons can be concluded through analyzing the mechanism. BasisEvolution starts
763
extracting the basis vectors under the smallest period (24 h), and ends with
764
basis under the largest cycle (e.g., weeks or months). In each cycle, we obtain
765
directions of principal components as basis vectors. Since these directions pre-
766
serve the largest energy of traffic at each cycle, they are easy to understand,
767
and actually describe the network traffic at multiscales. Those multiscale basis
768
vectors can formulate “typical region” more accurately, which directly means
769
that BasisEvolution can find more types of anomalies (e.g., subtle anomalies),
770
and are robust in anomaly detection. Moreover, BasisEvolution utilizes the ex-
771
tracted basis vectors to construct traffic volumes through linear combination.
772
This mechanism indicates that the extracted basis vectors explain the compo-
773
nents of network traffic.
AN US
CR IP T
761
Compared with commonly used approaches (OMP and PCA), BasisEvolu-
775
tion performs better according to the above experiments. Specifically, PCA
776
is sensitive to anomalous points, that is, the “typical region” (or the “normal
777
subspace”) that PCA constructed is easy to be contaminated when the number
778
of anomalous points is large. BasisEvolution elevates this issue by extracting
779
multiscale features so that the “typical region” can be better approximated,
780
with a comprehensive description of network traffic in different views. Hence,
781
BasisEvolution is less sensitive and can find more anomalies against PCA (and
CE
PT
ED
M
774
PCA-type methods). As for OMP, it reconstructs network traffic through the su-
783
perposition of finite numbers of discrete cosine functions. For each basis vector,
784
it has no real value. In addition, finite numbers of functions may lose the de-
AC
782
785
tails of a “typical region”, which leads to the decrement of detection probability.
786
Hence, BasisEvolution is better than OMP with respect to the interpretability
787
and accuracy.
41
ACCEPTED MANUSCRIPT
788
5. Experiments with Real Data We test BasisEvolution on traffic data from the Internet Initiative Japan
790
(IIJ), a Japanese ISP. The dataset [41] is a collection of time series data obtained
791
from Japanese network backbone routers. The collected data extends from April
792
4 to December 4, 2014. It contains traffic measurements (in and out) on 10 links,
793
at time intervals of 1 h for a total of 5,857 observed data samples for each link
794
in each direction.
CR IP T
789
An intrinsic problem we have already discussed is that it is difficult to obtain
796
the ground truth for anomaly detection. Even if we adopted expert advices to
797
label the data, there should remain some anomalies that may not be seen by
798
experts. This is because they are not traditional ones. Hence, our false-alarm
799
probabilities would be elevated. Thus, we used simulations to provide quanti-
800
tative comparisons to understand the merits of the various anomaly detection
801
methods.
AN US
795
However, we acknowledge that there is an artificiality in any simulation. Here
803
we aim to at least comment on the ability of BasisEvolution to find anomalies in
804
real data by using common methods (such as PCA) to find anomalies, which we
805
then treat as if they were real. Using those “real” anomalies, we compare Basi-
806
sEvolution with other widely used detection methods. In this section, however,
807
we omit NNBP results, as this method performed very poorly in simulations.
PT
ED
M
802
We divide each series into segments of two weeks in length, as in the simu-
809
lations. We repeat the process for each algorithm as described above, for each
810
link. For example, Figure 13 shows our discovered basis functions for the first
811
two time segments, the initial segment, and its evolved form in the second time
812
segment. In each, we can see that the first two basis vectors show the daily and
CE
808
weekly cycles in the data, with the third containing more noise.
AC 813
814
Next, we compare the anomaly detectors against each other by treating the
815
anomalies found by one method as if they were real, and then seeking to detect
816
these using an alternative method.
817
Take PCA-detected anomalies as an example. Figure 14 (a) illustrates the
42
c1 = 1 day
0 −2 0
1
2
3
4
5
6
7 8 Days
9
10
11
13
1
2
3
4
5
6
7 8 Days
9
10
11
13
14
0
1
2
3
4
5
6
7 8 Days
9
10
11
12
13
4
5
6
14
9
10
11
12
13
14
c = 1 week 2
1
2
3
4
5
6
7 8 Days
9
10
11
12
13
14
c = 2 weeks 3
0 −5 0
7 8 Days
1
2
3
4
5
6
7 8 Days
9
AN US
−5 0
3
5 Values
Values
12
2
0 −5 0
c3 = 2 weeks
5
1
5
2
0 −5 0
0 −2 0
14
Values
Values
12
c = 1 week
5
c1 = 1 day
2 Values
Values
2
CR IP T
ACCEPTED MANUSCRIPT
10
11
12
13
14
(a) The initial basis vectors on the first time (b) Evolved basis on the following time segsegment: April 4th-18th, 2014 ment.
PT
ED
M
Figure 13: Basis vectors for the first two time segments of the second IIJ link. All functions are normalized to have zero mean.
0.4 0.3
CE
0.2
0
0.02
0.5
PCA RPCA OMP
0.3
0.4 0.3 0.2
0.2 0.1
0.1
0.1
0
0.4
OMP PCA BE
0.6
DP_AP
0.5 DP_AP
0.7
OMP RPCA BE
0.6
DP_AP
0.7
0.04 FAP_NP
0.08
0
0
0.02
0.04
FAP_NP
(b) RPCA
AC
(a) PCA
0.06
0.06
0.08
0
0
0.02
0.04 FAP_NP
0.06
0.08
(c) OMP
Figure 14: Comparisons of methods on false alarm probability and detection probability with IIJ data . Anomalies detected by (a) PCA, (b) RPCA, and (c) OMP are treated as “real” anomalies respectively.
43
ACCEPTED MANUSCRIPT
performance of the left three alternative methods (OMP, RPCA, and BasisEvo-
819
lution) in comparison. Note that here, False-Alarm Probability of PCA-detected
820
Non-anomalous Points (FAP NP) and Detection Probability of PCA-detected
821
Anomalous Points (DP AP) are not with respect to ground truth, though, and
822
thus the meaning of the graph is different. This shows how consistent the ap-
823
proaches are. Unsurprisingly, RPCA is the most consistent with PCA, in that
824
it finds the largest share of the same anomalies. OMP is the most different from
825
PCA, and our approach is in between the two. Similar conclusions can be found
826
in Figure 14 (b), where RPCA-detected anomalies are regarded as real ones.
827
However, for the alternated OMP-detected anomalies (Figure 14 (c)), the entire
828
PCA, RPCA and BasisEvolution share almost the same number of anomalies,
829
which is in accordance with their underlying mechanisms. The alternative mea-
830
sures we considered above reinforce the same story.
AN US
CR IP T
818
One interesting thing in Figure 14 is that the whole detection probabilities
832
are not high, which means the overlap region among those methods is low. One
833
explanation is that different anomalies are usually found by different methods
834
because of inherent mechanisms. Hence, the performance of RPCA and PCA at
835
Figure 14 (a) and (b) are the best respectively. OMP, PCA and BasisEvolution
836
discover different types of anomalies. This is what we might expect, given the
837
different assumptions and algorithms upon which they are based.
ED
M
831
To further analyze the performance of four methods, we randomly present
839
the detection results of ten weeks data as shown in Figure 15. Since we do
840
not have the ground truth for the real data, the possible method of identifying
CE
PT
838
anomalous points is through either expert advice or voting. In view of real net-
842
work scenarios and holiday influences in Japan, we can label several anomalous
843
points, such as A(624,2.12e+08). In addition, we adopt a voting strategy to
AC
841
844
estimate the reliability of each detected anomalous points. For any detected
845
anomalous point, it will win one vote if another technique also identifies it as
846
abnormal. In our situation, the number of votes that one point could obtain
847
ranging from 0 to 3. Larger numbers of votes indicates that the corresponding
848
anomalous points are shared by more approaches. Hence, the corresponding 44
ACCEPTED MANUSCRIPT
methods are more reliable. 10 8
3.5
Network traffic on Link 4 OMP PCA BE RPCA
Network traffic volume
3 2.5
A (624,2.12e+08)
C(1179,1.411e+08)
B(983,2.32e+08)
2 1.5 1 0.5 4.5
4.12
4.19
4.26
5.3
5.10
5.17
Date (Starts with Sunday, Apr. 5th 2014)
5.24
CR IP T
849
5.31
6.7
6.14
Figure 15: Detection results of four algorithm on ten weeks data on link 4, IIJ: Apr 5th, 2014 to Jun 14th, 2014. Four algorithms are OMP, PCA, RPCA, and BasisEvolution respectively. BE is short for BasisEvolution.
Take BasisEvolution as an example, A total of 15 anomalous points are
851
detected in this context. Among them, five points have three votes, and can be
852
detected by the other three approaches; four points have two votes; three points
853
have one vote; two points (A and B in Figure 15) have been labeled as anomalous
854
ones even though they have not been detected by any other technique; and
855
only the left point (C in Figure 15) is neither detected nor labeled. Similarly,
856
descriptions of the other methods can be seen in Table 4. Compared with the
857
other three methods, 12 out of 15 BE-detected points can also be detected by
858
others, and only one point is zero vote, which indicates the strong reliability
859
of BasisEvolution. Regarding PCA, RPCA and OMP, larger proportions of
860
detected points are not admitted with a zero vote, which demonstrates that the
861
reliability of BasisEvolution is better than the rest three. In addition, there are
862
two zero-vote BE-detected points, which are authorized as real anomalies, while
863
OMP has one and both PCA and RPCA have zero. These results, as well as
CE
PT
ED
M
AN US
850
the simulation, support the conclusion that BasisEvolution can find anomalous
865
points that other approaches can not.
AC
864
866
6. Conclusion and future work
867
Accurately detecting anomalies without pre-knowledge plays an important
868
role for network operators. However, to be practical, methods need to have a
869
very low false-alarm probability, and there are other challenges. Strong anoma45
ACCEPTED MANUSCRIPT
Total anomalies 15 13 24 24
Amount (of real anomalies) 2 1 0 0
No. (with votes) 3 2 1 0 5 4 3 1 5 0 2 5 5 4 9 6 5 4 7 8
CR IP T
Method BE OMP PCA RPCA
Table 4: Description of anomaly detection results for four algorithms in terms of reliability of those detected anomalies based on Figure 15. Here, total anomalies means a total number of anomalies detected by any approach, amount of real anomalies counts the detected points that are real anomalies based on the expert advice but fail to get any vote and No. with votes records the number of detected anomalous points with a certain amount of votes. BasisEvolution is more reliable and can find anomalies that the other three cannot.
lies can pollute the normal space used as a baseline to detect anomalies, and
871
anomalies can themselves have different shapes that make them harder or easier
872
to detect.
AN US
870
In this paper, we address these problems with BasisEvolution, which looks
874
for a concise but accurate, and easy-to-understand basis that represent the
875
typical traffic patterns more precisely. We evolve the basis to maintain similar-
876
ity with the old one in describing the new data. We show it can then detect
877
anomalous points on the residual series with higher accuracy than common al-
878
ternatives. It still works even the anomalies are complex.
ED
M
873
However, there are many issues left for future work. For example, the basis
880
generation algorithm we used is based on SVD, which is not feasible for very
881
large data scenarios. The first issue is to find robust, fast algorithm to replace
882
SVD, such as the pursuit principal used in OMP. Another point is that the size
883
of data segment we used in our framework is fixed and our algorithm proceeds
PT
879
in batches. A more realistic algorithm should be able to tackle arbitrary new
885
segments of data as they arrive. We plan to extend BasisEvolution in the future
886
to tackle these issues.
AC
CE 884
887
888
889
Acknowledgment This authors thank IIJ for providing data. This work was supported by the
National Key Basic Research Program of China (2013CB329103 of 2013CB329100),
46
ACCEPTED MANUSCRIPT
the China Scholarship Council (201506050072), and the short term Interna-
891
tional Academic Fund of ChongQing University 2015 Overseas Visiting Student
892
Projects (GJXSFX150203), as well as Australian Research Council (ARC) grant
893
DP110103505 and the ARC Centre of Excellence for Mathematical and Statis-
894
tical Frontiers (ACEMS).
895
References
896
References
CR IP T
890
[1] A. Lazarevic, L. Ertoz, V. Kumar, A. Ozgur, J. Srivastava, A comparative
898
study of anomaly detection schemes in network intrusion detection, in: The
899
SIAM International Conference on Data Mining, 2003, pp. 25–36.
AN US
897
[2] P. Barford, D. Plonka, Characteristics of network traffic flow anomalies,
901
in: The 1st ACM SIGCOMM Workshop on Internet Measurement (IMW),
902
San Francisco, CA, USA, 2001, pp. 69–73.
M
900
[3] A. Delimargas, Iterative Principal Component Analysis (IPCA) for network
904
anomaly detection, Master’s thesis, Carleton University (September 2015).
905
[4] K. Xu, F. Wang, L. Gu, Behavior analysis of internet traffic via bipartite
906
graphs and one-mode projections, IEEE/ACM Transactions on Networking
907
22 (3) (2014) 931–942.
PT
908
ED
903
[5] M. H. Bhuyan, D. K. Bhattacharyya, J. K. Kalita, Network anomaly detection: methods, systems and tools, IEEE Communications Surveys &
910
Tutorials 16 (1) (2014) 303–336.
CE 909
[6] P. Garc´ıa-Teodoroa, J. D´ıaz-Verdejoa, G. Maci´ a-Fern´ andeza, E. V´ azquezb,
912
Anomaly-based network intrusion detection: Techniques, systems and chal-
913
lenges, Computers & Security 28 (1-2) (2009) 18–28.
AC
911
914
915
[7] L. Huang, L. Nguyen, M. N. Garofalakis, M. I. Jordan, A. D. Joseph, N. Taft, In-network pca and anomaly detection (2007) 617–624.
47
ACCEPTED MANUSCRIPT
916
[8] D. Jiang, C. Yao, Z. Xu, W. Qin, Multi-scale anomaly detection for high-
917
speed network traffic, Transactions on Emerging Telecommunications Tech-
918
nologies 26 (3) (2015) 308–317. [9] E. Sober, The principle of parsimony, The British Journal for the Philosophy of Science 32 (2) (1981) 145–156.
920
921
CR IP T
919
[10] J. S. Chiou, The antecedents of consumers loyalty toward internet service providers, Information & Management 41 (6) (2004) 685–695.
922
[11] T. Tavallaee, N. Stakhanova, A. A. Ghorbani, Toward credible evaluation
924
of anomaly-based intrusion-detection methods, IEEE Transactions on sys-
925
tems, man, and cybernetics - part C: application and reviews 40 (5) (2010)
926
516–524.
AN US
923
[12] C. Croux, P. Filzmoser, M. R. Oliveira, Algorithms for projection–pursuit
928
robust principal component analysis, Chemometrics and Intelligent Labo-
929
ratory Systems 87 (2) (2007) 218–225.
M
927
[13] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. D. Kolaczyk,
931
N. Taft, Structural analysis of network traffic flows, in: ACM SIGMET-
932
RICS/Performance, 2004, pp. 61–72.
ED
930
[14] C. Pascoal, M. R. de Oliveira, R. Valadas, P. Filzmoser, P. Salvador,
934
A. Pacheco, Robust feature selection and robust PCA for internet traffic
935
anomaly detection, in: IEEE INFOCOM, 2012, pp. 1755–1763.
CE
PT
933
[15] S. Huang, F. Yu, R. Tsaih, Y. Huang, Network traffic anomaly detection
937
with incremental majority learning, in: International Joint Conference on
938
Neural Networks (IJCNN), 2015, pp. 1–8.
AC
936
939
[16] Z. Chen, C. K. Yeo, B. S. L. Francis, C. T. Lau, Combining mic feature
940
selection and feature-based mspca for network traffic anomaly detection, in:
941
The 3rd International Conference on Digital Information Processing, Data
942
Mining, and Wireless Communications (DIPDMWC), 2016, pp. 176–181.
48
ACCEPTED MANUSCRIPT
943
[17] J. Zhang, H. Li, Q. Gao, H. Wang, Y. Luo, Detecting anomalies from
944
big network traffic data using an adaptive detection approach, Information
945
Sciences 318 (2015) 91–110. [18] A. Lakhina, M. Crovella, C. Diot, Diagnosing network-wide traffic anomalies, in: ACM SIGCOMM, 2004, pp. 219–230.
947
CR IP T
946
948
[19] YusukeTsuge, H. Tanaka, Quantification for intrusion detection system us-
949
ing discrete fourier transform, in: International Conference on Information
950
Science and Security (ICISS), 2016, pp. 1–6.
[20] P. Barford, J. Kline, D. Plonka, A. Ron, A signal analysis of network
952
traffic anomalies, in: The 2nd ACM SIGCOMM Workshop on Internet
953
Measurement (IMW), Marseille, France, 2002, pp. 71–82.
AN US
951
[21] B. Eriksson, P. Barford, R. Bowden, N. Duffield, J. Sommers, M. Roughan,
955
Basisdetect: a model-based network event detection framework, in: The
956
10th ACM SIGCOMM Conference on Internet Measurement (IMC), Mel-
957
bourne, Australia, 2010, pp. 451–464.
M
954
[22] M. Ahmed, A. N. Mahmood, J. Hu, A survey of network anomaly detection
959
techniques, Journal of Network and Computer Applications 60 (2016) 19–
960
31.
PT
961
ED
958
[23] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: A survey, ACM Computing Surveys 41 (3) (2009) 15.
CE
962
[24] D. J. Wellerfahy, B. J. Borghetti, A. A. Sodemann, A survey of distance
964
and similarity measures used within network intrusion anomaly detection,
965
IEEE Communications Surveys Tutorials 17 (1) (2015) 70–91.
AC
963
966
967
[25] H. Ringberg, A. Soule, J. Rexford, C. Diot, Sensitivity of PCA for traffic anomaly detection, ACM SIGMETRICS 35 (1) (2007) 109–120.
968
[26] S. Novakov, C. Lung, I. Lambadaris, N. Seddigh, Studies in applying pca
969
and wavelet algorithms for network traffic anomaly detection, in: IEEE 49
ACCEPTED MANUSCRIPT
970
14th International Conference on High Performance Switching and Routing
971
(HPSR), 2013, pp. 185–190. [27] F. Meng, N. Jiang, B. Liu, R. Li, F. Xia, A real-time detection approach
973
to network traffic anomalies in communication networks, in: The Joint In-
974
ternational Conference on Service Science, Management and Engineering
975
(SSME) and International Conference on Information Science and Technol-
976
ogy (IST), 2016.
CR IP T
972
[28] Z. Chen, C. K. Yeo, B. S. Lee, C. T. Lau, Detection of network anomalies
978
using improved-mspca with sketches, Computers & Security 65 (2017) 314–
979
328.
AN US
977
980
[29] Z. Wang, K. Hu, K. Xu, B. Yin, X. Dong, Structural analysis of network
981
traffic matrix via relaxed principal component pursuit, Computer Networks
982
56 (7) (2012) 2049–2067.
[30] G. H. Golub, C. F. V. Loan, Matrix computations, Vol. 3, JHU Press, 2012.
984
[31] J. Zobel, How reliable are the results of large-scale information retrieval
M
983
experiments, in: In Proceedings of SIGIR, 1998, pp. 307–314.
ED
985
[32] D. D. Lee, H. S. Seung, Algorithms for non-negative matrix factorization,
987
in: Advances in Neural Information Processing Systems, 2001, pp. 556–562.
988
[33] R. Peharz, F. Pernkopf, Sparse nonnegative matrix factorization with l0-
PT
986
constraints, Neurocomputing 80 (2012) 38–46.
CE
989
[34] Y. Zhao, G. Karypis, D.-Z. Du, Criterion functions for document cluster-
991
ing, Tech. rep., Department of Computer Science, University of Minnesota
992
(2005).
AC
990
993
994
995
996
[35] J. Han, J. Pei, M. Kamber, Data mining: concepts and techniques, Elsevier, 2011.
[36] T. Fawcett, An introduction to ROC analysis, Pattern recognition letters 27 (8) (2006) 861–874. 50
ACCEPTED MANUSCRIPT
997
[37] M. Roughan, A. Greenberg, C. Kalmanek, M. Rumsewicz, J. Yates,
998
Y. Zhang, Experience in measuring backbone traffic variability: Models,
999
metrics, measurements and meaning, in: The 2nd ACM SIGCOMM Work-
1001
[38] M. Roughan, J. Gottlieb, Large-scale measurement and modeling of backbone internet traffic, in: SPIE ITCom, Boston, MA, USA, 2002.
1002
1003
CR IP T
shop on Internet Measurement (IMW), 2002, pp. 91–92.
1000
[39] M. Roughan, On the beneficial impact of strong correlations for anomaly detection, Stochastic Models 25 (1) (2009) 1–27.
1004
[40] C. C. Zou, W. Gong, D. Towsley, Code red worm propagation modeling and
1006
analysis, in: The 9th ACM Conference on Computer and Communications
1007
Security (CCS), New York, NY, USA, 2002, pp. 138–147. doi:10.1145/
1008
586110.586130.
AN US
1005
[41] P. Tune, K. Cho, M. Roughan, A comparison of information criteria for traf-
1010
fic model selection, in: The 10th International Conference on Signal Pro-
1011
cessing and Communications Systems (ICSPCS), Gold Coast, Australia,
1012
2016.
AC
CE
PT
ED
M
1009
51
CR IP T
ACCEPTED MANUSCRIPT
Hui. Xia is a PhD student in the Department of Computer
1014
Science at Chongqing University. She received a Bachelors degree in Computer
1015
Networks from Nanchang University. Her research interests include network
1016
traffic analysis, data mining, and personal recommendation.
AN US
1013
Bin Fang received the B.S. degree in electrical engineer-
1018
ing from Xian Jiaotong University, Xian, China, the M.S. degree in electrical
1019
engineering from Sichuan University, Chengdu, China, and the Ph.D. degree
1020
in electrical engineering from the University of Hong Kong, Hong Kong. He is
1021
currently a Professor with the College of Computer Science, Chongqing Uni-
1022
versity, Chongqing, China. His research interests include computer vision, pat-
1023
tern recognition, information processing, biometrics applications, and document
1024
analysis. He has published more than 120 technical papers and is an Associate
1025
Editor of the International Journal of Pattern Recognition and Artificial Intel-
1026
ligence. Prof. Fang has been the Program Chair, and a Committee Member for
1027
many international conferences.
AC
CE
PT
ED
M
1017
1028
Matthew. Roughan received a Bachelors degree in math
52
ACCEPTED MANUSCRIPT
science at Adelaide University and the Ph. D degree in applied mathematics at
1030
Adelaide University. Currently, he is a professor with the School of Mathemat-
1031
ical Sciences, the University of Adelaide. His research interests include Internet
1032
measurement and estimation, network management, and stochastic modeling,
1033
in particular with respect to network traffic and performance modeling. He
1034
has published more than 150 technical papers, and is a member of the MASA,
1035
AustMS, IEEE and ACM. Prof. Roughan has been a chair of ACM SIGCOMM
1036
Doctoral Dissertation Award Committee, and won 2013 ACM SIGMETRICS
1037
Test of Time Award.
AN US
CR IP T
1029
Paul. Tune received a Ph. D degree at the Centre for
1039
Ultra-Broadband Information Networks (CUBIN), Dept. of E&E Engineering,
1040
the University of Melbourne. He worked as a postdoctoral research fellow at the
1041
School of Mathematical Sciences, the University of Adelaide. Now he worked at
1042
the Image Intelligence, Sydney. His research interests are Network measurement,
1043
Information theory, Compressed sensing.
CE
PT
ED
M
1038
Kenjiro Cho received the B.S. degree in electronic engi-
1045
neering from Kobe University, the M.Eng. degree in computer science from
AC
1044
1046
Cornell University, and the Ph.D. degree in media and governance from Keio
1047
University. He is Deputy Research Director with the Internet Initiative Japan,
1048
Inc., Tokyo, Japan. He is also an Ad- junct Professor with Keio University
1049
and Japan Advanced Institute of Science and Technology, Tokyo, Japan, and
1050
a board member of the WIDE project. His current research interests include 53
ACCEPTED MANUSCRIPT
traffic measurement and management and operating system support for net-
1052
working.
AC
CE
PT
ED
M
AN US
CR IP T
1051
54