Query-efficient label-only attacks against black-box machine learning models

Query-efficient label-only attacks against black-box machine learning models

Query-efficient Label-only Attacks Against Black-box Machine Learning Models Journal Pre-proof Query-efficient Label-only Attacks Against Black-box ...

1MB Sizes 0 Downloads 9 Views

Query-efficient Label-only Attacks Against Black-box Machine Learning Models

Journal Pre-proof

Query-efficient Label-only Attacks Against Black-box Machine Learning Models Yizhi Ren, Qi Zhou, Zhen Wang, Ting Wu, Guohua Wu, Kim-Kwang Raymond Choo PII: DOI: Reference:

S0167-4048(19)30235-4 https://doi.org/10.1016/j.cose.2019.101698 COSE 101698

To appear in:

Computers & Security

Received date: Revised date: Accepted date:

13 April 2019 17 December 2019 19 December 2019

Please cite this article as: Yizhi Ren, Qi Zhou, Zhen Wang, Ting Wu, Guohua Wu, Kim-Kwang Raymond Choo, Query-efficient Label-only Attacks Against Black-box Machine Learning Models, Computers & Security (2019), doi: https://doi.org/10.1016/j.cose.2019.101698

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Ltd.

Query-efficient Label-only Attacks Against Black-box Machine Learning Models Yizhi Rena , Qi Zhoua , Zhen Wanga,∗, Ting Wua , Guohua Wua , Kim-Kwang Raymond Choob a School b Department

of Cyberspace, Hangzhou Dianzi University, Hangzhou 310018, China of Information Systems and Cyber Security, The University of Texas at San Antonio, San Antonio, TX 78249, USA

Abstract Recent studies have shown that machine learning algorithms are susceptible to imperceptible perturbations. These studies focus on the laboratory environment, where the attacker has knowledge about internal information of the victim model or feedbacks such as class probabilities. There is still a gap between theory and physical world, the risk of adversarial attacks under the more extreme and realistic condition needs to be figured out. Here we propose a knowledge restricted black-box attack model where the attacker can only get the final predict label. In the meantime, we model the attacker as a resourcerestricted one, such as query-limited. The limitations of knowledge level and resources make previous work unable to be directly applied. For this problem, the current state-of-the-art method is boundary attack, however it requires large a number of queries. In this paper, we make several contributions to investigate the vulnerability of machine learning models in more realistic scenarios. First, we reconstruct the optimization problem, measure the quality of the sample points by L2 distance. Second, we provide a more effective algorithm, using cutting plane method and local optimization. Third, we propose two effective dynamic defense strategy, which is easy to implement. At last, we conduct an experimental evaluation on MNIST, Fashion-MNIST and malware detection datasets. The results show that (1) compared with state-of-the-art method, our ∗ Corresponding

author

Preprint submitted to Journal of LATEX Templates

December 24, 2019

cutting plane method reduces the number of queries while ensuring the attack efficiency; (2) Dynamic defense strategy is effective against label-only adversarial attacks, the rate of attack success dropped from nearly 100% to 23%, with a considerable classification accuracy; (3) Improved defense strategy guarantees the effectiveness of defense and improves the stability of the whole model. Keywords: Adversarial sample, Machine learning, Image recognition 2010 MSC: 00-01, 99-00

1. Introduction Machine learning algorithms have been widely deployed to solve different tasks, e.g., spam filtering[1], malware dectection[2], image recognition[3] and auto-driving[4]. They have achieved extremely high accuracy and become in5

creasingly effective. However, their successes are based on the assumption that there are no potential attackers. In fact, recent studies have identified that many high-performance machine learning algorithms are vulnerable to adversarial samples[5], adding a slight perturbation to original input can force a welllearned classifier to misclassify the crafted input. The existence of adversarial

10

samples brings a huge impact on the security of machine learning models. For example, crafted malware samples can easily fool machine learning-based malware detection tools[6] and latest work about using generative adversarial deep neural networks for cyber attacks[7]. It is critical to evaluate the robustness of machine learning systems.

15

In the past few years, researchers have proposed many different kinds of threat models to simulate adversarial attacks against machine learning models. According to the knowledge of the attacker, threat models can be roughly divided into white-box setting and black-box setting. In white-box setting, the attacker is assumed to have access to the victim model. However, in the phys-

20

ical world, the machine learning model may not be exposed to users. Means that the attacker have no knowledge about the details of the model, but they have query access to the model’s predictions on input samples, which motivates

2

the study of black-box threat model. Current existing black-box attacks focus on two major methods. One is called transfer-based attacks[8], these attacks 25

are effective and efficient, the key idea is crafting adversarial samples on a local surrogate model. However, the quality of adversarial samples depends on the attacker’s knowledge of training data. Another one is score-based attacks, these attacks do not need training data, the attacker constructs loss function only rely on the predicted scores (e.g. class probabilities or logits) of the model.

30

In this paper, we aim to investigate the vulnerability of machine learning models when facing an information-lacked but tenacious attacker. Under this circumstance, the attacker has nothing about training data and only has access to top-1 class feedback of the victim model. It is hard to construct loss function or evaluate the status of sample points. Thus, existing transfer-based methods

35

and score-based methods cannot be directly applied to craft adversarial samples. We notice one exceptional work considering strict black-box threat model[9] which uses a random walk along the boundary to solve this problem, their method could find considerable adversarial samples compared with white-box attacks, but at lots of quires.

40

To solve these difficulties, this paper makes the following key contributions. • We choose L2 norm distance to reconstruct the loss function, by doing this, we transform the status of sample points into the measurement of distance. • We propose a novel attack method called cutting plane method. Cutting

45

plane method exploits the local geometry of the decision boundary, and combined with local random optimization for optimum solution quality. • We evaluate our method on different machine learning models using MNIST, Fashion-MNIST and malware detection datasets, and the results show that (i) machine learning models are susceptible even if not leaking informa-

50

tion. (ii) our proposed cutting plane method outperforms the state-of-art method.

3

• We propose an efficient defense strategy, by dynamically changing the victim model. The experiment result shows that this defense strategy can effectively counter the boundary attacks while maintaining considerable 55

classification accuracy.

2. Related Work In this section, we review the related works in crafting adversarial samples to attack machine learning models both in white-box setting and black-box setting. Attacks in white-box setting. In white-box setting, the attacker can 60

get specific information about the victim model, thus the attacker could rely on the gradient of loss function to implement attacks. FGSM[10] attack crafts adversarial samples by adding perturbation to the input towards the gradient direction of loss function under the L∞ constraint. Unlike FGSM, DeepFool[11] attack is an iterative attack, which projects an image to approximated linear

65

hyperplane to find the least L2 distance. Besides gradient-based attacks, other methods are also efficient to craft adversarial samples. For example, JSMA[12] attack uses the Jacobian-based saliency map and modifies the most significant features to fool the classifier. Carlini and Wagner proposed C&W[13] attack, which formulates attack as an optimization problem, adds a penalty of pertur-

70

bation under the Lp constraint. Attacks in black-box setting. In the physical world, most machine learning models will not be exposed. This inspired researchers to investigate adversarial attacks in black-box setting. Black-box attacks can be roughly divided into two categories: transfer-based and score-based. The first transfer-based

75

black-box attack is referred to literature[8], these attacks do not need the detail information of the victim model but rely on partial information about the training set. The attacker trains a surrogate model using synthesized data and then apply white-box attacks to generate adversarial samples. Research shows that different machine learning models are vulnerable to these transfer-based

80

adversarial samples[14] .

4

Though transfer-based attacks perform well in several open-source datasets, they still need information about training data, and may lead to low solution quality and success rate, which motivates the study of a stricter and more robust attack method: score-based attacks. This work is very related 85

to literature about score-based attacks, e.g., variants of JSMA[15] and Zoo[16]. Narodytska[15] proposed an adversarial attack by adding perturbation to a single pixel or a small set of them to craft adversarial examples based on greedy local-search. Zoo[16] is a strong iterative gradient attack method, using zeroth order stochastic coordinate descent to seek an optimal solution. However, the

90

above white-box methods need the full knowledge of the classifier and blackbox methods either need the class probability or logit outputs. In physical, the attacker might not have access to those detail information. Instead, he is more likely to get final predictions, specifically the top-1 class. Under this circumstance, attacking will be much more challenging since it is hard to evaluate the

95

current state. Therefore the typical gradient-based and score-based methods can not be applied any more. To our knowledge, the only approach to solve this problem is Boundary Attack[9]. This attack starts from a large adversarial perturbation and searches better solutions along the boundary. This method can find adversarial samples with comparable distortion with gradient-based

100

white-box attacks. However, the cost of this method is lots of queries.

3. Threat Model This section explores our proposed method. In order to get closer to realword scenarios, we propose our model as follows. Attack Specificity. The attacker aims to cause the target sample misclas105

sified to any other class, which is a so-called untargeted attack. For example, to perform a successful attack to an auto-driving car, the attacker might force the car to misidentify a stop sign as any other one. Attacker’s Knowledge.

We follow the black-box setting proposed in

Boundary attack[9]. Under these circumstances the attacker has no knowledge

5

110

of the victim model, including the architectural, learning algorithm, training data. The attacker can only query the model and get corresponding feedback, which is a top-1 sorted label only. This is a common real-world scenario. For example, Google photos1 will label user-uploaded images, so the attacker only knows final labels of images he uploaded. Mention that we implicitly assume

115

the attacker knows what task the model is designed to do (e.g., image recognition, spam detection), and have an idea of which potential transformations to apply to cause some feature changes, otherwise neither change can be inflicted to the output of the classification function, nor any useful information can be extracted from it[17].

120

Query-limited Setting. The number of queries the attacker can make at each stage is limited, this can be construed as resources-constrained, such as monetary limit or authority limit. For example, Both Clarifai NSFW detection API2 (after first 2,500 predictions) and Tencent image tag API3 (after first 10,000 predictions) are chargeable. This makes attacker rely on more efficient

125

attack algorithm. Attacker’s Goal. For a given instance x0 and black-box classifier f, thus the category of x0 can be represented as f (x0 ). The attacker’s goal is to add a minimal perturbation δ on x0 , which makes new instance x∗ = x0 + δ classified into a different class from f (x0 ). This can be regarded as an optimization problem as follows. argminx

L (x)

where L (x) is a loss function.   Dis(x, x ) 0 L (x) =  +∞

f (x) 6= f (x0 ) f (x) = f (x0 )

where Dis(x, x0 ) = kx − x0 k2

1 https://photos.google.com/ 2 https://clarifai.com/models/nsfw-image-recognition-model-

e9576d86d2004ed1a38ba0cf39ecb4b1 3 https://cloud.tencent.com/product/image-tag

6

(1)

(2)

We use L2 norm distance as the criterion of distortion. Since the attacker has no access to class probabilities or scores, the current state can be measured by calculating distance to x0 . Notice that if x lies in the same class will lead 130

to an infinite distortion. Thus the attacker needs to seek a considerable x∗ , which is classified into different class and meanwhile has a tiny distortion as possible. Boundary attack use tiny orthometric walk on the boundary to find such adversarial samples. However, this method needs a lot of queries. We show our method performs better for every given number of query steps during the

135

whole query process until the budget is used up.

4. Approach 4.1. Simple cutting plane In this section, we will introduce our adversarial attack method. As stated in Section 3, the attacker tries to create a considerable instance to cross over the 140

decision boundary by adding tiny perturbation to a given instance. To achieve that, we use a novel cutting plane method combined with local optimization based on greedy algorithm. We begin with a simple 2-dimensional example to illustrate our method. From a given point x0 and corresponding class f (x0 ), we randomly choose

145

a direction and perform a binary search algorithm to seek a point x1 , which is close to the decision boundary and be classified into different class from f (x0 ) (same with Algorithm 1). Since we already have a point close to the decision boundary, we start at this point and make a tangent to the decision boundary. In 2-dimensional feature space, we can complete this operation by sampling

150

another point close to x1 and then perform binary search to press close to the dicision boudary. With this two points, we can approximately fit the local dicision boudary around x1 and make a straight line after this two points. The straight line can be regarded as a tangent of the curve at point x1 . Then along the projection direction from point x0 to above tangent to seek a new point x2 .

155

If the local geometry of the decision boundary is convex, we can find a better

7

point. Repeat this process, a considerable point x∗ will be found after several iterations, as seen in Figure 1.

x

f ’澻x 澼



x* x



x





&ODVVf 澻x澼

f ’澻x 澼 

x



GHFLVLRQERXQGDU\

f ’澻x 澼 

Figure 1: 2-dimensional cutting plane method

4.2. Theoretical analysis We still use Figure 1 to give an analysis of the cutting plane method and 160

show under which hypothesis our algorithm can converge. Since the property of the optimal solution x∗ is that it has the shortest distance to x0 . If we take x0 as the center and circle the distance from x0 to x∗ as the radius, the circle should be tangent to the decision boundary and the cut point is x∗ . This is also the theoretical basis for the termination condition of our cutting plane algorithm.

165

In each iteration, we make a tangent to the decision boundary at point xi to check for convergence. We assume the decision boundary around x0 is convex, that is, it lies between the tangent line and straight line l(x0 ,x∗ ) . In this case, when project x0 onto the tangent line, we will find a new point around the decision boundary which is better than the current solution. Because x0 , xi ,

8

Algorithm 1 Cross the decision boundary Input: Black-box classifier f , initial instance x0 (x0 ∈ Rp ), default value d, direction µ, query step t, threshold value . Output: New instance x0 . 1:

dl ← d, dr ← d

2:

while f (x0 + dr µ) = f (x0 ) do

3: 4:

while f (x0 + dl µ) 6= f (x0 ) do

5:

dl ← (1 − t)dl

6:

while dr − dl >  do

7:

dm ← (dr + dl )/2

8:

if f (x0 + dm µ) = f (x0 ) then

9: 10: 11: 12: 13:

170

dr ← (1 + t)dr

dl ← dm else dr ← dm x0 ← x0 + dr µ

return x0

projection(x0 ) form a right triangle, the distance between the sides of a right triangle (x0 , projection(x0 )) is less than the hypotenuse (x0 , xi ). 4.3. Improved cutting plane Our cutting plane method performs well under the circumstance that the local geometrical shape of the decision boundary is convex. However, this as-

175

sumption is usually not true in the high-dimensional space. Secondly, it will take a lot of queries to compute the tangent plane while facing a high dimensional situation (e.g., need to seek at least n points to build a hyperplane if the feature dimension is n). In order to overcome these defects, we combine cutting plane method with local random optimization. The main idea of the

180

local random optimization algorithm is based on greedy search, which utilizes queries in the stage of calculating tangent planes. Instead of sampling points 9

around a fixed point xi , we dynamically adjust the sampling center (xi at each iteration by default). That is, sampling center will be updated if the state of a sampling point is better than the current sampling center. The benefit of doing 185

this is that optimize solution while calculating the tangent plane. As shown in Figure 2, the simple cutting plane method is sampling a point x01 close to x1 to fit a tangent line. A new adversarial sample might jump out of global optimal solution since the geometric complexity of decision boundary, which leads to a poor local optimal solution. The greedy optimization simple

190

cutting plane method will update current optimal solution, to x01 in the case shown in this figure (line 8-9 in Algorithm 2). Then algorithm will continue from adversarial point x01 . After several iterations, it will converge to a better solution

x

2

x’ x 1

x*

1

GHFLVLRQERXQGDU\

x

0

&ODVVf x0

Figure 2: 2-dimensional illustration

5. Experimental Results 195

We evaluate the performance of our attack on three different datasets: MNIST, Fashion-MNIST[18] and malware traffic dataset[19].

10

Algorithm 2 Cutting plane Input: Initial instance x0 , current adversarial instance xi , feature dimension p, parameter α, query budget B, number of iterations M . Output: adversarial instance x∗ . 1:

Initial matrix K to be empty

2:

for i=1 to M do

3:

for j=1 to p do

4:

Randomly sample a guassian vector gj

5:

xji ← xi + αgj

6:

Update xji using Algorithm 1

7:

K ← K ∪ xji

if Dis(xji , x0 ) < Dis(x∗ , x0 ) then

8:

x∗ ← xji

9: 10:

Compute the tangent plane by K and the projection direction from x0

11:

Update xi using Algorithm 1

12:

return x∗

5.1. Attack on MNIST In order to make a comparison with previous study, we use the same CNN model as Boundary attack[9] and C&W attack[13]. The CNN model features 200

nine layers with four convolutional layers, two max-pooling layers and two fullyconnected layers, which could achieve 99.5% accuracy on MNIST. We also evaluate our method on different machine learning classifiers. We compare our method against Boundary Attack: The state-of-art attack in the decision-based setting. For each attack method, we randomly sampled N = 500 images from validation set to evaluate the attack performance. We use the following metric to quantitative performance: P =

N 1 X kxi − x0 k2 N i=1

11

(3)

Table 1: Results of two attacks on MNIST MNIST Attack Methods

Boundary Attack

Cutting Plane Attack(ours)

205

Average Distortion

Average Number of Queries

1.1660

40,010

1.1293

48,009

1.0759

76,476

1.0571

45,244

1.0061

73,963

0.9950

81,157

In the untargeted setting, the attacker only seeks an image whose predicted label is different from a given original image. We compare the above methods, the result is shown in Table 1. Here we choose the same original images and initial adversarial images, thus can better show the difference between those algorithms.

Figure 3: The MNIST average distortion in each iteration.

210

Next, we compare different attack methods at each iteration step. Figure 3 shows the average distortion (L2 distance) in each iteration. The distortion of

12

)[ZZOTMVRGTKGZZGIQ original

initial adversarial point

9073 calls

5.8025

2.9250

40997 calls

48849 calls

56701 calls

1.8628

1.7729

1.7271

17439 calls

25293 calls

33145 calls

2.2313

2.0186

2.0186

64553 calls

72405 calls

80257 calls

1.6829

1.5956

1.5887

25375 calls

33184 calls

(U[TJGX_GZZGIQ original

initial adversarial point

9076 calls

5.8025

5.7808

41145 calls

48996 calls

56713 calls

3.1499

2.9130

2.8088

17816 calls

4.9029

4.4386

3.8445

64806 calls

72444 calls

80523 calls

2.7303

2.6846

2.6657

Figure 4: Adversarial samples crafted by the Boundary Attack and Cutting plane attack for MNIST. For each image we show the total query number until that point(above the image) and the distortion (L2 distance) between the adversarial and the original(below the image).

13

our algorithm rapidly decreases in the early stage and it is also slightly better than Boundary attack overall. Intuitively, compared with Boundary attack’s heuristic algorithm, our algorithm using cutting-plane method to optimize seems 215

to be more efficient. The single case study is shown in Figure 4, mention that we use the same original point and start point in two algorithms, thus the uncertainty is reduced to a minimum. In order to better show the difference, we extract the adversarial perturbations from adversarial samples generated by the two algorithms in Figure 5. )[ZZOTMVRGTKGZZGIQ

5.8025

2.9250

2.2313

2.0186

1.7729

1.7271

1.6829

1.5956

1.5887

1.5887

(U[TJGX_GZZGIQ

5.8025

2.9250

4.9029

4.4386

3.8445

3.1499

2.8088

2.7303

2.6846

2.6657

Figure 5: Single case study: Adversarial perturbations generated by Boundary attack and Cutting plane attack in each iteration for MNIST.

14

Table 2: Results of two attacks on Fashion-MNIST Fashion-MNIST Attack Methods

Average Distortion

Average Number of Queries

2.0603e-02

15,509

1.6206e-02

20,513

1.4349e-02

25,511

1.664e-02

1,5225

1.3809e-02

20,344

1.3429e-02

25,443

Boundary Attack

Cutting Plane Attack(ours)

220

5.2. Attack on Fashion-MNIST We also evaluate our method on Fashion-MNIST dataset. Fashion-MNIST is a novel dataset comprising of 28 × 28 grayscale images of 70,000 fashion products from 10 categories, with 7,000 images per category. We choose the model in official documents of Fashion-MNIST4 . The main components of the

225

model are two convolutional layers, batch normalization layer and pooling layer5 , which could achieve 93.4% accuracy on the validation set. Same as in the previous section, we also randomly choose 500 samples to implement different algorithms. The overall result is shown in Table 2 and comparison results during each iteration in Figure 6. 4 https://github.com/zalandoresearch/fashion-mnist 5 https://github.com/khanguyen1207/My-Machine-Learning-

Corner/blob/master/Zalando MNIST/fashion.ipynb

15

Figure 6: The Fashion-MNIST average distortion in each iteration.

230

Although the result of Fashion-MNIST is on a much smaller scale since the difference between datasets and models, it can still be seen that our result is better on both measures. Notice that Boundary attack method need nearly 5,000 queries to reduce the distortion from 1.6206e-02 to 1.4349e-02. We regard the 0.92e-03 distortion gap between Boundary attack and our cutting plane

235

attack to be a surpass.

16

)[ZZOTMVRGTKGZZGIQ original

initial adversarial point

4158 calls

7500 calls

2.205e-01

2.819e-02

1.508e-02

16458 calls

18813 calls

23523 calls

initial direction

1.372e-02

1.235e-02

1.235e-02

(U[TJGX_GZZGIQ original

initial adversarial point

2.205e-01

1.497e-01

8.614e-02

16694 calls

19067 calls

26095 calls

initial direction

2.867e-02

2.553e-02

2.148e-02

5291 calls

7437 calls

Figure 7: Adversarial samples generated by the Boundary Attack and Cutting plane attack for Fashion-MNIST dataset. We show the current number of queries and the perturbation distortion below the image. Especially, the initial direction used to generate initial adversarial point is given in the last image.

We report a single case study in Figure 7 and Figure 8 to show the detail

17

difference. Here we choose an image labeled as ’Pullover’ to be an original point. In the initial stage, we randomly sample an image to guide the direction of attack, here is a ’Sandal’ image, shown as the last image in Figure 7. It can be 240

seen that, there are still traces of ’Sandal’ image in adversarial images generate by Boundary attack method after two iterations. By contrast, adversarial images generated by our cutting plane method is much cleaner, detail distortion comparison is shown in Figure 8.

)[ZZOTMVRGTKGZZGIQ

2.819e-02

1.508e-02

1.372e-02

1.235e-02

1.235e-02

2.553e-02

2.148e-02

(U[TJGX_GZZGIQ

1.497e-01

8.614e-02

2.867e-02

Figure 8: Single case study: Adversarial perturbations generated by Boundary attack and Cutting plane attack in each iteration for Fashion-MNIST.

5.3. Attack on Malware Traffic Dataset 245

To evaluate the general applicability of our attack method, we also conduct experiments on malware traffic classification dataset[19]. In their work, the traffic data is converted into the form of MNIST images. Here we directly use the processed data6 , including normal traffic and malicious traffic. The data set 6 https://github.com/echowei/DeepTraffic

18

Table 3: Results of two attacks on Malware Traffic Dataset Malware Traffic Dataset Attack Methods

Average Distortion

Boundary Attack

Cutting Plane Attack(ours)

Average Number of Queries

2.7250

8,178

0.8647

49,836

0.8576

69,783

1.0617

8,164

0.7745

49,816

0.7689

63,496

is divided into training dataset with 245,437 datapoints (121,107 benign dat250

apoints, 124,330 malware datapoints) and test dataset with 27,271 datapoints (13,456 benign datapoints, 13,815 malware datapoints). Benign traffic data includes BitTorrent, Gmail, MySQL, etc. Malware traffic data includes Cridex, Geodo, Miuref, etc. We train a 2-layer CNN classifier according to the original paper, which achieves 100% accuracy on test dataset. We randomly choose

255

1,000 malicious samples as original points, then apply our attack method and baseline method to fool malicious traffic detection classifier. The overall result is shown in Table 3 and Figure 9. As is shown in Figure 9, our method converges faster and the final solution quality is better than baseline. In the early stage of the attack, it takes fewer

260

queries to find a higher quality solution. Whereafter both baseline and our method further optimize the attack points, eventually converge to a sub-optimal solution.

19

1 2

B o u n d a ry A tta c k C u ttin g P la n e A tta c k

1 0

A v e r a g e D is to r tio n

8

6

4

2

0

1 0 0 0 0

2 0 0 0 0

3 0 0 0 0

4 0 0 0 0

5 0 0 0 0

6 0 0 0 0

7 0 0 0 0

N u m b e r o f q u e r ie s

Figure 9: The average distortion in each iteration on Malware Traffic Dataset.

5.4. Dynamic defense We notice that boundary attack methods are heavily dependent on the ge265

ometric properties of the decision boundary. Therefore we propose a simple defense strategy, since the attacker has no knowledge of the victim classifier, we can use dynamically switching classifiers. The difference of decision boundary of different classifiers may improve the robustness of classification system. We construct both attacker and defender as players in the game model. For at-

270

tacker, he either attack or not. Here we assume the attacker has sufficient query resources so that he will always choose to attack. His utility can be defined as the smallest distortion during the whole attack process. For defender, he can select a classifier anytime to destroy the attack. We define defender’s utility as the combination of the classifier’s accuracy and the distortion of the attack

275

point. First, we use a random strategy every query. For example, we dynamically choose three different classifiers A, B and C. For a given original point, after running attack method on each classifier, the average distortion of these three classifiers might be different. Here we hypothesis the average distortion of classifier A is the smallest. That means if we dynamically switch classifiers dur20

280

ing the attack, current attack point tends to move to the boundary of classifier A, but which may not be regarded as an adversarial point of the rest classifiers. However switching classifier too frequently may lead to low classification accuracy of normal users, so we improve the random strategy to a more efficient denfense strategy. First, not randomly choose a classifier every query but select

285

classifier with purpose after classifying bunch of instances. Specifically, for a current attack point, we list classifiers that can classify it correctly and then select the classifier with the highest classification accuracy under normal circumstances. The attacker will randomly find a new attack point if his current optimized attack point is no longer adversarial.

290

In experiment, we use five well-trained classifiers: CNN, SVM, LR, RF, GBDT. The accuracy of each classifier is shown as table 4. For random strategy, we randomly choose a classifier every query during the attack. The attack success rate of each classifier is shown as table 5. We randomly choose 500 samples to run the attack methods and send the final adversarial point to these

295

five different classifiers and inspect whether it is adversarial, we regard it as a successful attack if true. Thus we get the success rate of attacks on these five classifiers. Still we use a dynamic classifier (randomly choose one of five base classifiers) to show the result. Notice that the success rate of attacks on sigle special base classifier is close to 100%. We regard this simple defense is helpful

300

for improving the robustness of classification system. However, this is based on the attacker has no further strategy. Table 4: Accuracy of five base classifiers

Base Models

Classification Accuracy

CNN

99.5%

SVM

97.3%

LR

92.0%

RF

95.0%

GBDT

95.0%

21

Table 5: Success rate of attacks on each classifier Victim Models

Success rate(using defense)

Success rate(not using defense)

CNN

0%

≈ 100%

SVM

4%

≈ 100%

LR

1%

≈ 100%

RF

76%

≈ 100%

GBDT

22%

≈ 100%

Dynamic Classifier

23%

-

For improved strategy, we use the same classifiers as random strategy. But a policy selection is performed every 8,000 queries. We count the number of switches of the classifier, it needs to be noted that each switch of classifiers 305

means that the attacker previously carefully maintained attack point loses the antagonism. The attacker will choose a new attack point after that. Therefore, the higher the frequency of classifier switching means the more frequently the attack is destroyed. We show the final result in table 6. The success rate of defense is 37.5%. In addition, the model is more stable because the classifier is

310

switched after a batch of queries. Table 6: Success rate of defense when using improved strategy Average frequency of switch

Average total number of iterations

Success rate of defense

2.96

7.89

37.5%

6. Discussion This paper evaluates the vulnerability of machine learning models in the face of a resource restricted attacker. We reveal that even the attacker is lack of internal information and detail feedback of the model, he can still implement 315

successful attacks by using considerable queries. First, we model a strict attack scenario where the attacker only have access to the final labels of given input images. Then we propose a query-efficient attack algorithm, using local optimization combined with cutting plane method to solve the attack problem. 22

Moreover, we show that our method is better than state-of-the-art boundary 320

attack method both on MNIST, Fashion-MNIST and malware traffic detection dataset. However, there are still some problems to be solved in our method, we currently evaluate our method on three different datasets, it still needs to extend to more datasets. Otherwise, the current method rely on sampling points along

325

the boundary, the number of sampling is equal to data dimensions. It will be more efficient if dimensional reduction can be combined with. We summarize this as part of our future work. Since most existing attack methods are based on continuous feature data like images, how to extend to discrete feature data is also worth exploring.

330

Declaration of Competing Interest he authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement 335

This work was supported by the National Natural Science Foundation of China (Grant No. 61872120), Natural Science Foundation of Zhejiang Province (Grant Nos. LY18F020017, and LY18F030007) and Key R&D Program of Zhejiang Province (Grant No. 2017C01062)

References 340

[1] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz, A bayesian approach to filtering junk e-mail, in: Learning for Text Categorization: Papers from the 1998 workshop, Vol. 62, Madison, Wisconsin, 1998, pp. 98–105.

23

[2] W. Huang, J. W. Stokes, Mtnet: a multi-task neural network for dynamic malware classification, in: International Conference on Detection of In345

trusions and Malware, and Vulnerability Assessment, Springer, 2016, pp. 399–418. [3] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, 2012, pp. 1097–1105.

350

[4] J. Stallkamp, M. Schlipsing, J. Salmen, C. Igel, Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition, Vol. 32, Elsevier, 2012, pp. 323–332. [5] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing properties of neural networks, in: arXiv:1312.6199,

355

2013. [6] S. Chen, M. Xue, L. Fan, S. Hao, L. Xu, H. Zhu, B. Li, Automated poisoning attacks and defenses in malware detection systems: An adversarial machine learning approach, Vol. 73, Elsevier, 2018, pp. 326–344. [7] A. AlEroud, G. Karabatis, Sdn-gan: Generative adversarial deep nns for

360

synthesizing cyber attacks on software defined networks, in: SIAnA, 2019. [8] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, A. Swami, Practical black-box attacks against machine learning, in: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ACM, 2017, pp. 506–519.

365

[9] W. Brendel, J. Rauber, M. Bethge, Decision-based adversarial attacks: Reliable attacks against black-box machine learning models, in: arXiv:1712.04248, 2017. [10] I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, in: arXiv:1412.6572, 2014.

24

370

[11] S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard, Deepfool: a simple and accurate method to fool deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2574– 2582. [12] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, A. Swami,

375

The limitations of deep learning in adversarial settings, in: 2016 IEEE European Symposium on Security and Privacy (EuroS&P), IEEE, 2016, pp. 372–387. [13] N. Carlini, D. Wagner, Towards evaluating the robustness of neural networks, in: 2017 IEEE Symposium on Security and Privacy (SP), IEEE,

380

2017, pp. 39–57. [14] N. Papernot, P. McDaniel, I. Goodfellow, Transferability in machine learning: from phenomena to black-box attacks using adversarial samples, in: arXiv:1605.07277, 2016. [15] N. Narodytska, S. P. Kasiviswanathan, Simple black-box adversarial per-

385

turbations for deep networks, in: arXiv:1612.06299, 2016. [16] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, C.-J. Hsieh, Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models, in: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, ACM, 2017, pp. 15–26.

390

[17] B. Biggio, F. Roli, Wild patterns: Ten years after the rise of adversarial machine learning, Vol. 84, Elsevier, 2018, pp. 317–331. [18] H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, in: arXiv:1708.07747, 2017. [19] Wei Wang, Ming Zhu, Xuewen Zeng, Xiaozhou Ye, Yiqiang Sheng, Mal-

395

ware traffic classification using convolutional neural network for representation learning, in: 2017 International Conference on Information Networking (ICOIN), 2017, pp. 712–717. doi:10.1109/ICOIN.2017.7899588. 25

Yizhi Ren received his PhD in Computer software and theory from Dalian University of Technology, China in 2011. He is currently an associate professor 400

with School of Cyberspace, Hangzhou Dianzi University, China. From 2008 to 2010, he was a research fellow at Kyushu University, Japan. His current research interests include: network security, complex network, and trust management. Dr. REN has published over 60 research papers in refereed journals and conferences. He won IEEE Trustcom2018 Best Paper Award, CSS2009 Student Paper

405

Award and AINA2011 Best Student paper Award. Qi Zhou is currently a master candidate at Hangzhou Dianzi University, China. His research interests include adversarial machine learning and information security. Zhen Wang received the Ph.D. degree in Dalian University of Technology,

410

China in 2016 He is a associate professor at Hangzhou Dianzi University. His research interests include game theory and information security. Ting Wu received the Ph.D. degree in Shandong University in 2002. He is a Professor at Hangzhou Dianzi University. His research interests include cryptography and information security.

415

Guohua Wu received the Ph.D. degree in Zhejiang University in 1998. He is a Professor at Hangzhou Dianzi University. His research interests include data-driven security and information security. Kim-Kwang Raymond Choo holds the Cloud Technology Endowed Professorship at the University of Texas at San Antonio (UTSA), and has a courtesy

420

appointment at the University of South Australia. In 2016, he was named the Cybersecurity Educator of the Year - APAC, and in 2015 Quang, Ben and Raymond won the Digital Forensics Research Challenge organized by Germany’s University of Erlangen-Nuremberg. He is the recipient of the 2018 UTSA College of Business Col. Jean Piccione and Lt. Col. Philip Piccione Endowed Re-

425

search Award, IEEE TrustCom 2018 and ESORICS 2015 Best Paper Awards, 2014 Highly Commended Award by the Australia New Zealand Policing Advisory Agency, Fulbright Scholarship in 2009, 2008 Australia Day Achievement Medallion, and British Computer Society’s Wilkes Award in 2008. He is also 26

a Fellow of the Australian Computer Society, an IEEE Senior Member, and 430

Co-Chair of IEEE Multimedia Communications Technical Committee’s Digital Rights Management for Multimedia Interest Group.

27