CNN–MHSA: A Convolutional Neural Network and multi-head self-attention combined approach for detecting phishing websites

Journal Pre-proof CNN-MHSA: A convolutional neural network and multi-head self-attention combined approach for detecting phishing websites Xi Xiao, Di...

Download PDF

775KB Sizes 0 Downloads 35 Views

Report

PDF Reader
Full Text

Journal Pre-proof CNN-MHSA: A convolutional neural network and multi-head self-attention combined approach for detecting phishing websites Xi Xiao, Dianyan Zhang, Guangwu Hu, Yong Jiang, Shutao Xia

PII: DOI: Reference:

S0893-6080(20)30058-7 https://doi.org/10.1016/j.neunet.2020.02.013 NN 4411

To appear in:

Neural Networks

Received date : 13 July 2019 Revised date : 19 January 2020 Accepted date : 20 February 2020 Please cite this article as: X. Xiao, D. Zhang, G. Hu et al., CNN-MHSA: A convolutional neural network and multi-head self-attention combined approach for detecting phishing websites. Neural Networks (2020), doi: https://doi.org/10.1016/j.neunet.2020.02.013. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2020 Elsevier Ltd. All rights reserved.

Journal Pre-proof

lP repro of

CNN-MHSA: A Convolutional Neural Network and Multi-Head Self-Attention Combined Approach for Detecting Phishing Websites

Xi Xiao1,2 , Dianyan Zhang1 , Guangwu Hu3 , Yong Jiang1,2 , and Shutao Xia1,2 1

3

Graduate School At Shenzhen, Tsinghua University, Shenzhen, China {xiaox, jiangy, xiast}@sz.tsinghua.edu.cn [email protected] 2 Peng Cheng Laboratory, Shenzhen, China School of Computer Science, Shenzhen Institute of Information Technology, Shenzhen 518172, China [email protected]?

Jou

rna

Abstract. Increasing phishing sites today have posed great threats due to their terribly imperceptible hazard. They expect users to mistake them as legitimate ones so as to steal user information and properties without notice. The conventional way to mitigate such threats is to set up blacklists. However, it cannot detect one-time Uniform Resource Locators (URL) that have not appeared in the list. As an improvement, deep learning methods are applied to increase detection accuracy and reduce the misjudgment ratio. However, some of them only focus on the characters in URLs but ignore the relationships between characters, which results in that the detection accuracy still needs to be improved. Considering the multi-head self-attention (MHSA) can learn the inner structures of URLs, in this paper, we propose CNN-MHSA, an Convolutional Neural Network (CNN) and the MHSA combined approach for highlyprecise. To achieve this goal, CNN-MHSA first takes a URL string as the input data and feeds it into a mature CNN model so as to extract its features. In the meanwhile, MHSA is applied to exploit characters’ relationships in the URL so as to calculate the corresponding weights for the CNN learned features. Finally, CNN-MHSA can produce highly-precise detection result for a URL object by integrating its features and their weights. The thorough experiments on a dataset collected in real environment demonstrate that our method achieves 99.84% accuracy, which outperforms the classical method CNN-LSTM and at least 6.25% higher than other similar methods on average. Keywords: phishing · URL · deep learning · convolutional layer · multihead self-attention

?

Corresponding Author: Guangwu Hu ([email protected])

Journal Pre-proof

2

Introduction

lP repro of

1

Xi Xiao et al.

Jou

rna

Phishing websites are referring to the malicious websites that have similar webpages and Uniform Resource Locator (URL) addresses with legitimate ones, so that they can be used to lure people to visit and leak their personal information imperceptibly and finally meet perpetrators’ vicious purposes [1], which could cause great loss to victims. Taking the URL of one of the largest banks in China - the Industrial and Commercial Bank of China (ICBC) - as an example, the attacker may coin the official website URL “www.icbc.com” as “www.1cbc.com” with the aid of other tricks, e.g., DNS hijacking and URL hijacking, so as to make people mistakingly believe it’s authentic considering the similar appearance and URL addresses that are hardly distinguishable from the legitimate ones. Once victim logined in, the attacker could acquire their usernames, passwords and other personal information. Further, by taking advantage of these information, attackers can impersonate victims’ identities to login the legtimate website and perform illegal operations (e.g., money transfer). Now, this threat has become a major cyber attack modus operandi and statistics shows it presents a growing trend. According to the report of Anti-Phishing Working Group (APWG), the number of phishing sites is up to 263,538 by the end of Q1 2018, which has increased 46% compared to Q4 2017 [2]. To mitigate such threats, many companies have developed anti-phishing tools [3]. For example, eBay invented a toolbar to identify eBay-owned sites [4] by taking blacklists to block attackers’ attempted URLs that have been recorded before. However, blacklists need time to update their newest lists during which there may be lots of people visiting and possibly have becoming victims already[5]. To evade the blacklist’s interception, attackers adopt one-time phishing URLs and direct users to visit before redirect the legitimate sites. To cope with this trick, machine learning methods, e.g., Support Vector Machines (SVM), and Random Forest (RF) are employed, relying on a built classifier to study sample URLs’ features so as to give judgments for new emerging ones [6–11]. As to the features that are used for classifier training, it requires experts to manaully define them from URLs and HTML contents, which would lead a different training effect and judgment result since experts’ knowledges and concerns are different. Instead, deep learning methods such as Long Short Term Memory (LSTM) [13] can automatically discover the representations (features) from raw data that meet classifier’s requirement. Specifically, they can transform a low-level representation into a more abstract higher-level one, which amplifies aspects of the input that are important for discrimination and suppresses irrelevant variations. In general, deep learning methods can achieve better classification results than conventional machine learning ways. Among these outstanding deep learning-base methods, the CNN-LSTM [14] proposal achieved an excellent result by combining Convolutional Neural Network (CNN) and LSTM together, where CNN is responsible to learn URLs’ features and feed to LSTM to reach a final judgment. The classification results show that CNN-LSTM outperforms than either CNN or single LSTM. Besides, multi-head self-attention (MHSA) is proposed by Google for Natural

Journal Pre-proof

CNN and MHSA Combined Approach for Detecting Phishing Websites

3

lP repro of

Language Processing (NLP). Since it which pays more attention to the structure of a sentence, it can discover the inner dependency relationships between one sentence’s different characters and outperform LSTM in terms of classification accuracy and training speed [15]. Thus it is more qualified to cope with the word/string analysis situation. As to the characters’ dependency relationship analysis, MHSA presents as weight matrix that can give features different importance. Inspired by MHSA’s advances in NLP, we assume that if we treat a URL (e.g., https://www.icbc.com) as a sentence, then MHSA can be introduced to distinguish this URL whether is legitimate or not after it gets trained. Motivated by this idea, we design a deep learning model named CNN-MHSA to detect phishing websites, where CNN is adopted to learn features in URLs and MHSA is leveraged to learn the different feature weights. By combining the two parts of technologies’ advantages, the judgement accuracy can reach a new level. Detailly, CNN-MHSA first receives a URL string as input and utilizes an embedding layer to transform it to a characters’ relationship matrix. Then MHSA duplicates the matrix and feeds into two modules concurrently. One is feature learning and the other is weight calculation. In the former, CNN-MHSA uses convolutional layers and fully connected layers to learn the representations of features, while in the later, MHSA layers and fully connected layers are built to calculate the feature weights. After that, learned features and feature weights are both fed into an output block which is constructed by a multi-head attention layer and a fully connected layer, which integerates the two parts of data and outputs a final judgement about whether the input URL is a phishing URL or not. Eventually, the thorough experiments show that our method reaches up to 99.84% accuracy, which outperforms the classical proposal CNN-LTSM and higher than the previous methods [4, 7, 13, 14, 16] 6.25% on average. Our work mainly makes the following contributions:

Jou

rna

– We create a novel deep learning network model named CNN-MHSA which delicately integerates Convolutional Neural Network and multi-head selfattention together so as to detect phishing sites. To the best of our knowledge, combining two technologies together for phishing website detection is the first attempt. – Our key observation is that URLs can be treated as sentences and thus we can leverage MHSA’s advatange in NLP to detect phishing websites. In our CNNMHSA model, several layers including the convolutional layer and the MHSA layer are organically connected, where the former is used for learning URLs’ features while the latter is aimed to calculate the corresponding features’ weights. – We evaluate our model in a real dataset and achieve 99.84% accuracy. Also, we defined some metrics to evaluate and prove our model outperforms the other methods.

This paper is organized as follows: Section 2 first summarizes the previous efforts in detecting phishing websites. Then Section 3 reviews some necessary background knowledges related to our methodology. Next, Section 4 explains our method in details. Furthermore, Section 5 introduces our experiments including

Journal Pre-proof

4

Xi Xiao et al.

2

Related Work

lP repro of

the experimental dataset and the comparison with the other methods. Finally, Section 6 concludes the whole paper and lists future work.

To mitigate phishing websites’ threat, many excellent studies have made their efforts. For instance, the conventional ways like blacklist [17] are still used by many Internet companies [3], which is actually a dataset containing URLs which have been proved to be malicious, so that it can be used to prrevent users from accessing the websites. But blacklist cannot predict results for a new URL that has not appeared in the list, since attackers are now more tending to take one-time URLs to take actions. Thus, graph-based, machine-learning-based and deep-learning-based approaches are invented to cope with this situation. 2.1

Graph-based Approaches

Machine Learning Approaches

Jou

2.2

rna

Graph-based methods aim to find the victim website through phishing URLs by drawing the relation graph, which is based on the fact that 42% of the phishing URLs contain their target links [18]. By doing this, users can be warned and more vigilant against the phishing URLs. To find the clue of the victim websites from the phishing URLs, some studies firstly try to search for initially associated pages which are directly linked by the suspicious pages. For example, Zou and F. et al. [19] built an AD-URL graph with Markov Random Field and employed a belief propagation algorithm to do the detection. AD-URL represents mutual behaviors between users and websites by analyzing network traffic. Ramesh and G. et al. [20] constructed a parasitic matrix to show the relation between two sites and used the row sum and the column sum to find the target link of the site. The row sum of the matrix represents the out-degree of links and the column sum refers to the in-degree. Furthermore, Liu and W. et al. [21] reinforced the associated pages to build a large graph called a parasitic community and then simplified the graph by some metrics to show the target obviously. However, building a graph takes much time and memory. Besides, the trickster can easily learn to hide the target links in his phishing sites [5].

As an improvement, machine learning algorithms are widely used in phishing detection. Zhang and Y. et al. [4] designed a model called CANTINA using TFIDF to extract keywords from URLs and HTMLs. TF-IDF counts the number of characters in documents and the number of documents to calculate scores of each character and extract keywords from documents. The top 5 keywords in each site are selected to do searches in Google. If the top results have the same domain as the website, the website is treated as legitimate.

Journal Pre-proof

CNN and MHSA Combined Approach for Detecting Phishing Websites

5

2.3

lP repro of

Some classic algorithms inluding Naive Bayes, Support Vector Machines (SVM), and Random Forest (RF) have achieved good results. In [7], four algorithms were compared: Logistic Regression, SVM, RF and k-Nearest Neighbor, where RF had the highest F-1 score. Besides, Marchal and S. [16] extracted 212 features from URLs and HTMLs and then selected Gradient Boosting to build the classification model. These methods manually extract features from phishing and legitimate websites and use machine learning algorithms to train a model to decide whether a URL is phishing or legitimate [22]. Site features contain URL features, HTML features, image features and thirdparty service based features. Among them, URL features include the length of the URL, the number of dots and the number of digits in links. URL features also contain whether the URL is an IP address, whether the URL has “@”, whether the URL uses HTTPs et etc. [6–11]. HTML features are composed of the number of hyperlinks, the number of input forms, the number of images et al. [6, 8]. Image features consist of histograms of oriented gradient features and color features [23]. In addition, third-party service based features include the age of the domain, page rank and sites in search engine results [6]. These methods can block zero-day attacks, but features must be found manually by researchers. This limits the improvement of the result because no one can guarantee that the selected features are sufficient for detecting phishing sites. Deep Learning Approaches

Jou

rna

In order to extract features automatically from samples, deep learning networks are employed to detect phishing sites. In general, these methods transform the whole URL or HTML into a matrix [13] (also known as embedding) in the first place. Then they feed the matrix into a deep learning network. Finally, the trained network can output a float result (0,1) so as to judge whether the input URL is phishing or not. As to the detailed nueral networks, Multi-Layer Perceptron (MLP) [24] is the simplest network and widely used in text classification. Since URL can be regarded as text, Mohammad, R. M. et al. [25] and Nguyen, L. A. T. et al. [26] introduced MLP into phishing detection and received high accuracy rates. Further, Bahnsen and A. C. et al. [13] translated each input character of the URL into a 150-step sequence by 128-dimensional embedding and then fed the sequence into a Long Short Term Memory (LSTM) layer. Finally, they got a higher F-1 score and found that LSTM took less memory than conventional machine learning methods. In addition, Zhang and J. et al. [27] leveraged Deep Belief Networks (DBNs) to detect phishing websites, which calculate the probability distribution through the edge distribution of the energy function and gets the maximum likelihood estimation. The method improved 1% accuracy and 2% F-1 score over SVM. Verma, R. et al. [28] introduced online learning with ngram for phishing website detection. They split URLs into grams and then used different online learning algorithms to detect phishing websites. Yang and P. et al. [14] transformed a URL into a matrix with one-hot encoding, and then used embedding to decrease the dimension of the matrix. Subsequently, the matrix

Journal Pre-proof

6

Xi Xiao et al.

3

Background

lP repro of

with lower dimension was put into CNN and then LSTM. Finally, they used softmax function to calculate the result. As they explained, CNN could learn features from the URL because each character of the URL may be related to nearby characters. LSTM can learn the sequential dependency from character sequences, so it is used to capture context semantic and dependency features of the URL. In summary, deep learning ways can be used to get classifer trained more sufficiently since they can automaticly extract features from sampes, instead of manually designated by humans in machine-learning-based ways. However, they still need more training time compared to conventional machine learning methods.

To help the readers quickly understand our proposal, we briefly introduce the background knowledge regarding to CNN and MHSA in this section. 3.1

Convolutional Neural Network

Jou

rna

Convolutional Neural Network (CNN) [29] is a kind of feedforward neural networks whose artificial kernels can respond to not only a single pixel but also its neighbors. Normally, there are two kinds of layers, i.e., convolutional layers and pooling layers [30] in CNNs. The former is used to learn features with some small, movable windows called filters or kernels. After that, the latter that includes max-pooling and average-pooling is responsible to reduce the dimension of features. In 1998, L´ecun and Y. et al. [31] firstly designed a CNN for document recognition because they realized that CNN had the advantages of shift, scale and distortion invariance with its local receptive field, shared weights and sub-sampling. Further, CNN is applied into the NLP field. For instance, in 2014, Kim [32] used CNN to perform sentence classification. Figure 1 shows the structure of Kim’s method. More specificly, he first transformed each word from a sentence into a vector. Then, the vectors were listed lengthways according to the order of words in the sentence. In this way, he got a 2-dimensional matrix using the way others dealing with image [29] and put it into a convolutional layer and max-pooling layer. Finally, he reached his goal by feeding the result from the max-pooling layer into a fully connected layer. 3.2

Multi-Head Self-Attention

Multi-Head Self-Attention (MHSA) is a kind of attention mechanism which is now widely used in machine translation [34]. When a person reads a paper, he/she cannot pay the same attention to everywhere [33]. Therefore we can give different importance to different features. Attention function maps the query

Journal Pre-proof

7

lP repro of

CNN and MHSA Combined Approach for Detecting Phishing Websites

Fig. 1. The illustration of CNN for sentence classification

and the set of key-value pairs to the attention [35]. Scaled dot-product attention is one of the most commonly used attention functions so far [15]: p Attention(Q, K, V ) = sof tmax(QK T / dk )V

(1)

rna

where Q, K and V are matrices of queries, keys and values, respectively, and dk refers to the dimension of the key. If Q, K, and V are equal, the function is called self-attention, which is an attention mechanism relating to different positions of a single sequence used to compute a representation of the sequence [15]. It pays more attention to the inner structure of a query. Recently, the selfattention mechanism has been successfully applied in many scenarios like reading comprehension and textual entailment [36, 37]. Similar to the filters in CNN, a single attention is not enough for learning multiple features’ weights. To allow the model to learn information from different representation subspaces, the MHSA mechanism is presented [15] as: M ultiHead(Q, K, V ) = Concat(head1 , ..., headh )W O

(2)

headi = Attention(QWiQ , KWiK , V WiV )

(3)

Jou

where W O , WiQ , WiK , WiV are parameter matrices. Recently, Google has introduced a self-attention mechanism in machine translation and achieved better results than the previous method with LSTM [15]. Compared with LSTM, the MHSA mechanism has less calculation, so that it can be trained faster than LSTM. Furthermore, LSTM considers the later features in a query much more important than the former features. But MHSA calculates the weights to express the different importance for each feature, which is believed to be more suitable in phishing websites detection scenarios.

Journal Pre-proof

Xi Xiao et al.

4

Methodology

lP repro of

8

Inspired by MHSA’s excellent performance in NLP in terms of automatically computing features’ weights and awared of MHSA’s ability in digging the inner dependency relationships between different characters in URL, our first intuition is that MHSA should be applied to URL analysis, which may outperform LSTM. In the meanwhile, we also know that CNN can automatically learn URL’s features without humans’ intervention. Thus, our insight is to assemble two technologies together so as to marry their merits to serve phishing website detection and get a better result. And finally, our experimental results prove our idea. In the next subsection, we will introduce our model and depict its components in detail. 4.1

Model Overview

rna

Figure 2 shows the overview of our method which includes an embedding layer, N feature learner modules, M weight calculator modules and an output block. At the beginning, the model transforms an input URL into a matrix via the embedding layer. Then the model duplicates the matrix as two copies for feature learning and feature weight calculation. During the weight calculation process, one of the copies is fed into MHSA layers to calculate the features’ weights. At the same time, during the feature extraction process, another copy of URL matrix is put into convolutional layers to learn features, and the previous layer’s output will be treated as input for the next layer. After the two concurrent processes finished, the two parts of the output data will be fed into the output block together so as to compute the final classification result. With the module number of N and M , we learned from the experiment result that both of them are 3 would be the best for both of them and we will show the comparsion in Subsection 5.3. Next, we will describe each part’s inner structure in detail.

URL

Embedding layer Feature learner

. . .

. . .

Jou

Weight calculator M

Weight calculator

N

Feature learner Output block Result

Fig. 2. Model overview of our method

Journal Pre-proof

CNN and MHSA Combined Approach for Detecting Phishing Websites

Embedding Layer

lP repro of

4.2

9

Since a URL is formed by 84 kinds of characters (a-z, A-Z, 0-9, - . !*’();:&=+$,/?#[]) [36], we use One-Hot Encoding [39] to represent each character. Specifically, we transform a URL string into a matrix with N (N is the length/character-number of URL) rows and 84 columns through the embedding layer. For example, if we use One-Hot Encoding to process the string “http”, the embedding layer will firstly generate an 84-dimensional vector for character “h” whose eighth number is 1 and the rest 83 numbers are 0 since it is the eighth position of 84. The double characters “tt”, it will generate two same vectors whose corresponding positions of “t” with value 1 while the rest are 0. Then for the last character “p”, the embedding layer will generate the final vector and let the 16th position with value 1 while the rest be 0. As a result, the “http” string can be expressed by a 4*84 matrix. Further, it will decrease the dimension from 84 to 64 via a neural network due to the processing efficiency issue, thus the transformed matrix becomes an N *64 matrix and this dimension decrement will not cause data loss. However, since neural network can only deal with fixed-length vectors, various lengths of URLs have to be processed with a fixed-length string so as to make the final output as matrix with fixed-length (noted as L) rows and 64 columns. The way is triming the characters exceeding L or padding the insufficient part with 0 [32]. In other words, various lengths of URLs will be trimed/padded as fixed-length L first, and then each character would be represented with a 64dimensional vector. Thus the final output of this embedding layer is an L*64 matrix, where L is a fixed number that we will experiment with it in Subsection 5.3. 4.3

Feature Learner

Jou

rna

The feature learner is responsible to extract features automatically from the matrix output by the embedding layer. As Figure3 shows, one feature learner has four layers, i.e., a convolutional layer, two residual layers and a fully connected layer. As the data processing flow, a fixed-dimensional matrix outputed by the embedding layer will be firstly fed into the first feature learner’s conventional layer, which contains 5 conventional kernels that are used to learn features from difference space and a max-pooling that is devoted to choose the best outcome from the 5 kernels. Next the output from this layer will be processed by one of the two residual layers so as to solve the degeneration problem: accuracy gets saturated (which might be unsurprising) and then degrades rapidly [40] with the network depth increasing, through adding the input and output of the convolutional layer. Further, the fully connected layer with the ReLU activation function is used to enhance the expression ability of neural networks and enhance the efficiency of extracting features from the input data [40]: y = max(0, wT x + b)

where w and b are parameter matrices and x is the input of the function.

(4)

Journal Pre-proof

Xi Xiao et al.

lP repro of

10

Residual layer

Fully connected layer

Residual layer

Convolutional layer

Fig. 3. The structures of the Feature learner

Further, the second residual layer will enhance training accuracy by adding the input and the output of the fully connected layer together. Finally, the outcome matrix from the residual layer still maintains the dimension of 64, since we set the dimension of the fully connected layer and the residual layer all be 64, which also equals to the feature matrix’s dimension. But note that this is one single round feature learning, and the output result will be proceeded by the rest of N − 1 feature learners so as to obtain feature data sufficiently. 4.4

Weight Calculator

Jou

rna

To calculate the feature weights, we design the weight calculator which contains an MHSA layer, two residual layers and a fully connected layers, as illustrated in Figure 4 illustrated. At first, the output of the embedding layer, the one of two matrices with URL string information, needs to be injected with its characters’ positional information via positional encoding, which contains the relative position of the characters in the URL string sequence [15]. For example, the two consecutive letter “tt” in string “http” have different positions even though they are same. Thus we need some positional codes to represent. As to the positional encoding methods, there are plenty and we employed sine and cosine functions [32] in our work: P E(k) = sin(pos/10000k/64 )(k is even) (5) P E(k) = cos(pos/10000k/64 )(k is odd)

(6)

where pos is the position the character exists in the URL, while k is the dimensional index. If k is an even number, we use Equation (5) to calculate the positonal encoding. Otherwise, we use Equation (6). Taking string “http” as an

Journal Pre-proof

11

lP repro of

CNN and MHSA Combined Approach for Detecting Phishing Websites

Residual layer

Fully connected layer Residual layer

Multi-head self attention layer

Positional encoding

Fig. 4. Weight calculator

example, the embedding layer’s output (if we do not extend to dimension 64 for illustration) should be ((1000),(0100),(0100),(0010)), then the position encoding matrix would be: 

 sin(1/100001/64 ) cos(1/100002/64 ) sin(1/100003/64 ) cos(1/100004/64 )  sin(2/100001/64 ) cos(2/100002/64 ) sin(2/100003/64 ) cos(2/100004/64 )     sin(3/100001/64 ) cos(3/100002/64 ) sin(3/100003/64 ) cos(3/100004/64 )  (7) sin(4/100001/64 ) cos(4/100002/64 ) sin(4/100003/64 ) cos(4/100004/64 )

Jou

rna

Further, we can sum the PE matrix with the hot-code encoded matrix together so as to get the result of the positional encoding for the whole URL string. Moreover, the result matrix would be fed into the MHSA layer, which owns eight heads that are responsible for processing each 8-dimensional matrix data, and the parameters (Q,K,V) for each head are the sources from the three copies of the input data. Thus after processing, the dimension of the MHSA layer’s output is still 64 which is the same as the input. After that, both the MHSA layer’s input and output data will be sent to the residual layer to ease the data degeneration issue. Then the fully connected layer receives the output of the residual layer so as to eahance the data’s expression ability. Finally, the last residual layer plays the same role as the previous one. Note that this is one of the M modules, and the rest of M-1 modules are connected to process data by handling each precedent’s output. 4.5

Output Block

After we get features from feature learners and feature weights from weight calculators, we use an output block to combine them to get the final result.

Journal Pre-proof

12

Xi Xiao et al.

lP repro of

Result Fully connected layer Residual layer

Fully connected layer

Residual layer

Multi-head attention layer Feature weights

Features

Fig. 5. The structure of the Output block

Figure 5 shows the structure of the output block. At first, the features and two copies of the feature weights play as the three parameters (Q,K,V) substitute on the multi-head attention layer. Then both the orignal features and the output of the multi-head attention layer are fed into the residual layer. After that, the fully connected layer with the ReLU activation function receives the output of the residual layer. Moreover, we put the output and input of the fully connected layer onto another residual layer. Finally, the output of the residual layer still needs to be fed into a fully connected layer with the Sigmoid activation function: y = 1/(1 + e−(w

T

x+b)

)

(8)

5

rna

where w and b are parameter matrixes and x is the input of the function. Besides, all the sub-layers in the output block are 64-dimensional. Finally, the output block will output a result between 0 and 1. If the number is greater than 0.5, the input URL is classified as legitimate, otherwise (¡=0.5), which means the URL is identified as phishing.

Experiment

Jou

In this section, we first introduce the experimental dataset and metrics. Then the experiment results of our model with different hyper-parameters and structures are shown. Finally, we compare the conventional methods [7, 13, 15] with ours using different metrics. 5.1

Dataset

To fully evaluate our model and get accurate classifcation results, we prepared a large dataset containing 88,984 URL samples, among which 45,000 are legitimate

Journal Pre-proof

CNN and MHSA Combined Approach for Detecting Phishing Websites

13

5.2

Metrics

lP repro of

and 43,984 are phishing. The legitimate part was collected in July 2018 from 5,000 Best Websites [41], while the phishing part is excerpted from PhishTank [42]. All of them have been validated from April 2018 to July 2018. Meanwhile, the term “batch size” means the number of URLs that our model can process in one time, while the “epoch” refers to the training rounds with the training data. In our experiment, we set the batch size as 10 and the epoch number as 50. Furthermore, we apply 5-fold cross-validation test approach to get results: dividing the dataset into five roughtly equal parts and take five rounds of test/training processes. In each round, we use different one fifth for testing and the rest four out of five for training. After the five rounds, all the five parts are tested once, and the final result is the average of the five results.

To fully depict our model’s performance, we exploit 5 metrics, namely Accuracy (Acc), FPR, Recall (Rec), Precision (Pre) and F-1 score (F1), to evaluate our model. First, if we let True Positive (TP) denote the number of URLs which are phishing and correctly classified, True Negative (TN) denote the number of legitimate URLs recognized as legitimate ones, False Poistive (FP) denote the number of legitimate URLs which are wrongly classified as phishing, and False Negative (FN) denote the number of phishing URLs identified as legitimate ones, then the five metrics can be expressed as: (9)

F P R = F P/(F P + T N )

(10)

Rec = T P/(T P + F N )

(11)

P re = T P/(T P + F P )

(12)

F 1 = 2 × P re × Rec/(P re + Rec)

(13)

rna

Acc = (T P + T N )/(T P + T N + F P + F N )

Jou

In the above metrics, Acc reflects the correct classification proportion. FPR states the percentage of predicted phishing URLs which are truly legitimate in all the legitimate URLs, Rec denotes the percentage of phishing URLs we detect successfully in all the phishing URLs. Prec represents the percentage of phishing URLs which we detect correctly in all the predicted phishing URLs. And F1 is the harmonic average of the recall and the precision rate [24]. As phishing detection models always aim to find all the phishing websites, Rec is the most suitable metrics to express classification accuracy among the five, and the higher the better. FPR is also an important metric that indicates model’s performance, the lower the better. If FPR closes to 0, it means that the model can almost correctly detect all the legitimate URLs. The Precision is similar to FPR, and

Journal Pre-proof

14

Xi Xiao et al.

5.3

lP repro of

it is the higher the better. At last, the F1 refers to the overall performance of the model, indicating both phishing and legitimate URLs are similar when it is high. Hyper-parameters

In our dataset, there are 80,246 URLs whose lengths are less than 100, 5,856 URLs’ lengths are between 100 and 200, 3,444 URLs’ lengths are between 200 and 300, and 269 URLs’ lengths are between 300 and 400. Moreover, the other 125 URLs’ lengths are longer than 400 and shorter than 900. However, our model can only deal with fixed-length URLs as input, so we must set the fixed-length L for all the URLs as we stated on the Embedding layer. If the URL is longer than L, the extra part is discarded. Otherwise, we pad 0 at the end of the URL string. Through experiment, we find L=100 would be perfect. Meanwhile, N is the number of feature extractor round and M is the number of weight calculator round, which decides our model’ depth and efficiency. Table 1 states their notations and values. Table 1. Parameter setting in our model

Parameters Notation Value URL length setting L 100 Number of feature extractors N 3 Number of weight calculators M 3

Jou

rna

Table 2 shows the influence of M and N , when L is fixed as 80. When M becomes 5, we get the best result, i.e., 97.11% Accuracy, 1.51% FPR, 95.66% Recall, 98.37% Precision and 97.00% F-1 score. However, when M is 1, the result is the worst, i.e., Accuracy is 95.02%, FPR is 5.90%, Recall is 95.94%, Precision is 94.21% and F-1 score is 95.07%. As M increases, Accuracy, Recall, Precision and F-1 score become higher, and FPR gets lower. When M reaches 5, Accuracy is 2.09% higher than that when M is 1 and F-1 score is 1.93% higher. Similarly, we get the best result when N is 5, i.e., 97.19% Accuracy, 1.42% FPR, 95.73% Recall, 98.47% Precision and 97.08% F-1 score. When N is 1, the performance is the worst, i.e., Accuracy is 94.90%, FPR is 4.08%, Recall is 94.29%, Precision is 95.46% and F-1 score is 94.87%. As N increases, Accuracy, Recall, Precision and F-1 score increases, while FPR decreases which is the same when M increases. The results may indicate that the deeper the network is, the better the model performs. For classification tasks, higher layers amplify aspects of the input that are important for discrimination [13]. When the depth of the network becomes larger, there are more weights in the model to fit a function for classification. Normally, the larger number of the weights helps the model to fit the function more accurately. Furthermore, when M and N are larger than 3, the variation trend becomes gentle and seems invariable. It points out that the performance

Journal Pre-proof

CNN and MHSA Combined Approach for Detecting Phishing Websites

15

N 1 2 3 4 5 3 3 3 3 3

lP repro of

Table 2. The comparison result of our model with different N and M M 3 3 3 3 3 1 2 3 4 5

Acc 0.9490 0.9602 0.9709 0.9715 0.9719 0.9502 0.9622 0.9709 0.9709 0.9711

FPR 0.0408 0.0449 0.0120 0.0141 0.0142 0.0590 0.0408 0.0120 0.0143 0.0151

Rec 0.9429 0.9612 0.9529 0.9563 0.9573 0.9594 0.9653 0.9529 0.9553 0.9566

P re 0.9546 0.9593 0.9870 0.9848 0.9847 0.9421 0.9594 0.9870 0.9846 0.9837

F1 0.9487 0.9602 0.9696 0.9703 0.9708 0.9507 0.9624 0.9696 0.9697 0.9700

Jou

rna

of the model tends to be limited. However, when M and N become larger, the resources for training get larger, too. In other words, the larger resources cannot get much better result in return. To balance the results and the resources, we set M and N both as 3 for the experiments below. Figure 6 shows that Accuracy and F1 score change with different M and N when L is 80. The two metrics both increase when M and N become larger. The results may indicate that the deeper the network is, the better the model performs. For classification tasks, higher layers amplify aspects of the input that are important for discrimination [13]. When the depth of the network becomes larger, there are more weights in the model to fit a function for classification. Normally, the larger number of the weights helps the model to fit the function more accurately. Furthermore, when M and N are larger than 3, the variation trend becomes gentle and seems invariable. It points out that the performance of the model tends to be limited. However, when M and N become larger, the resources for training get larger, too. In other words, the larger resources cannot get much better result in return. To balance the results and the resources, we set M and N both as 3 for the experiments below. Since our method needs length-fixed input, we study the influence between the fixed length L and our model’s accuracy . Figure 7 shows the distribution of the lengths of URLs whose lengths are less than 200. Among 89,940 URLs, there are 86,102 URLs whose lengths are less than 200, which means L has great impact when it is less than 200. Therefore, we set L as [20, 40, 60, ..., 160, 180, 200] for experiments. Table 3 gives some examples of phishing URLs with different lengths. Figure 8 shows the Accuracy, Recall, Precision and F1 score comparison result with different L when M and N are both 3. The highest Accuracy, Recall and F1 score appear when L is 100. Whether L increases or decreases, Accuracy, Recall and F-1 score are lower than those of 100. Moreover, Accuracy, Recall and F-1 score when L is 20 are lower than that when L is 200. When L decreases from 100, the variation trend of Accuracy, Recall and F-1 score varies. Besides, when L is greater than 140, the variation trend of Accuracy, Recall and F-1 score

Journal Pre-proof

Xi Xiao et al.

Acc of M F1 of M Acc of N F1 of N

97.5

97.0

%

96.5

96.0

95.5

95.0 1

lP repro of

16

2

3

4

5

Fig. 6. The Accuracy result comparison of our model with different M or N

number

40000

number

30000

20000

10000

0

40

80

120

160

200

rna

Length

Fig. 7. The length statistics of our URLs samples Table 3. Phishing URLs with different lengths

Jou

Phishing URL Length http://chefluisfrancisco.pt/cms lf/plugins/ editors/rokpad/acesso/itaumobile/escolha.html 88 http://futureinsurance.ca/k2hzK// 8143fec25a8654b92987fb6e3532d960/fcb/ar/?i=3128554&i=3128554 97 http://kitc.ph/wp/wp-content/plugins/con/sur/customer center/ customer-IDPP00C772/myaccount/signin/?country.x=US&locale.x=en US 126 https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F %2Fdrw.sh%2Ffouicb&data=02%7C01%7C% 7C9653530cfa864b274bf408d564ec380b%7C84df9e7fe9f640afb435aaaaaaaaaaaa %7C1%7C0%7C636525888064243429&sdata=eA %2F4ycLaifbVo4IDh3hHBZICWd8ozcQuSM%2Fb7gRZZTY%3D&reserved=0 264

Journal Pre-proof

CNN and MHSA Combined Approach for Detecting Phishing Websites

Acc (%)

98

96

94

92

Acc Rec Pre F1

lP repro of

100

17

40

80

120

160

200

Length

Fig. 8. The influence to classfication results with different URL length L (N =M =3)

Jou

rna

tends to be gentle. By comparison, the variation of Accuracy and F-1 score is less than Recall. It means that Recall is more sensitive to the change of L than Accuracy and F-1 score. On the other hand, Precision changes with L without rhythm, but it changes gently whenever L varies. Precision is the highest when L is 40 and the lowest with L is 160. Table 4 shows the whole results with various lengths, L, of URLs when N and M are both set as 3. It points out that our model performs best when L is 100 and worst when L is 20. Accuracy is 99.84% when L is 100 which is 5.22% higher than that when L is 20. Furthermore, Recall, Precision and F-1 score are 99.95%, 99.73% and 99.84%, respectively, when L is 100. They are 10.29%, 0.49% and 5.64% higher than those when L is 20, respectively. FPR is low whenever L changes and it gets the lowest, i.e., 0.0018 when L is 40. The results show that L can neither be too short nor too long. It shows that shorter URLs might have not enough features. For example, the URLs only using IP address are short, but there is limited information in the URLs so that our model is unable to learn features from them. Otherwise, the long URLs have too much useless information aiming to puzzle users. If the model learns a lot of useless features, it cannot identify phishing URLs. 5.4

Results

In this part, we give the results of our method and compare different structures (CNN, LSTM, CNN-CNN, CNN-LSTM) with our method where we set L as 100, N and M as 1. As for CNN and LSTM, there is only one layer to get the result. Like our model, CNN-CNN and CNN-LSTM replace multi-head self-attention layer with CNN and LSTM. Table 5 shows the result of different networks.

Journal Pre-proof

18

Xi Xiao et al.

lP repro of

Table 4. The comparison of our model with different L L 20 40 60 80 100 120 140 160 180 200

Acc 0.9462 0.9603 0.9639 0.9709 0.9984 0.9876 0.9784 0.9749 0.9738 0.9713

FPR 0.0066 0.0018 0.0044 0.0120 0.0026 0.0039 0.0135 0.0216 0.0212 0.0036

Rec 0.8966 0.9205 0.9306 0.9529 0.9995 0.9786 0.9699 0.9712 0.9686 0.9449

P re 0.9924 0.9979 0.9951 0.9870 0.9973 0.9959 0.9856 0.9772 0.9775 0.9960

F1 0.9420 0.9576 0.9617 0.9696 0.9984 0.9872 0.9777 0.9742 0.9730 0.9698

rna

The Accuracy of CNN and LSTM is 97.13% and 97.56%, resppectively, which is lower than the Accuracy of the others. It points out that the combination of two networks helps increase the performance of the model. Moreover, the performance of LSTM is better than that of CNN and CNN-LSTM performs better than CNN-CNN. It can be concluded that LSTM performs better than CNN for sentence classification. The reason may be that LSTM can effectively learn the sequential dependency from character sequences [14]. The Accuracy and F1 of our method are 98.34% and 98.30%, resppectively, which are the highest. It shows that multi-head self-attention performs better than CNN and LSTM in phishing webpage detection. However, FPR of our method, 1.76%, is higher than that of CNN-LSTM, 0.82%, which indicates that our method classifies more legitimate webpages as phishing than CNN-LSTM does. For the training time, our method uses 32 min for one epoch on average. CNN-LSTM uses 119 min which is nearly quadruple of our method. CNN-CNN uses 35 min, LSTM uses 73 min and CNN uses 28 min. Table 5. Results of different networks

Jou

Method CNN LSTM CNN-CNN CNN-LSTM Our method

5.5

Acc 0.9713 0.9756 0.9761 0.9818 0.9834

FPR 0.0391 0.0309 0.0313 0.0082 0.0176

Rec 0.9822 0.9824 0.9839 0.9713 0.9844

P re 0.9599 0.9680 0.9676 0.9912 0.9816

F1 0.9709 0.9752 0.9757 0.9811 0.9830

Comparison with Other Methods

We compare five previous methods [4, 7, 13, 14, 16] with our method. Zhang, Y. et al. [4] designed a model called CANTINA to detect phishing websites.

Journal Pre-proof

CNN and MHSA Combined Approach for Detecting Phishing Websites

19

lP repro of

CANTINA uses TF-IDF to extract keywords from the URLs and HTMLs. The top 5 keywords for each website are selected to do Google searches. If the top results have the same domain as the website, the website is treated as legitimate. Besides, Marchal and S. et al. [16] extracted 212 features from URLs and HTMLs such as count of dots, count of level domains, length of the URL, and count of terms in the URL. They selected Gradient Boosting to build the classification model. Al-Janabi and M. et al. [7] also used URLs and HTMLs to get features but they employed the Random Forest algorithm for classification. Furthermore, Bahnsen and A. C. et al. [13] treated URLs as sequences and transformed a URL into matrix with the dimension of 150. Then the matrix was fed onto a Long Short Term Memory (LSTM) layer to do the detection. Yang and P. et al. [14] used one-hot encoding and embedding to transform the URL into a matrix. The matrix was put into a network which is composed of CNN and LSTM. Finally, the softmax function helped get the classification result.

rna

In Figure 9, we give the final results of the five methods using our large, new dataset and 5-fold cross-validation. Our method (M and N are set as 3, L is set as 100) achieves the lowest FPR of 0.26%, the highest Accuracy of 99.84% and the highest F-1 score of 99.84%. By contrast, Zhang, Y. [4] gets Accuracy as 89.98%, FPR as 13.75%, Recall as 93.89%, Precision as 86.67% and F-1 score as 90.14%. Al-Janabi, M. [7] gets Accuracy, FPR, Recall, Precision and F-1 score are 90.10%, 13.63%, 94.02%, 86.78% and 90.25%, respectively. Marchal, S. [16] gets Accuracy as 92.17%, FPR as 9.95%, Recall as 94.40%, Precision as 90.03% and F-1 score as 92.16%. FPR of our method is about 13% lower than the former two methods and about 9% lower than the third method, i.e., our method classifies less legitimate websites as phishing websites. Recall of our method is 99.95%, which is higher than those of all the others. It can be concluded that, compared with the conventional methods [4][7][16], our method can distinguish phishing websites from legitimate websites more accurately. Besides, it does not need expert knowledge to extract features.

Jou

Likewise, Bahnsen, A. C. [13] and Yang, P. [14] do not extract features manually either while achieving Accuracy as 97.53% and 98.18%, FPR as 1.97% and 0.82%, Recall as 97.01% and 97.13%, Precision as 97.91% and 99.12%, F-1 score as 97.46% and 98.11%, respectively. They also perform better than Zhang, Y. [4], Al-Janabi, M. [7] and Marchal, S. [16], but worse than ours. It may indicate that deep learning methods are more suitable than conventional machine learning algorithms in phishing detection. Conventional machine learning algorithms require human to process URLs and extract features from them. These human-extracted features may be incomplete since the knowledge of one person is limited. Instead, deep learning methods can be fed with raw data and automatically discover the representations (features) needed for classification. They have multiple layers and each transforms low-level representation into a slightly more abstract higher-level one, which amplifies aspects of the input that are important for discrimination and suppresses irrelevant variations [12]. Therefore, in general, deep learning methods can achieve better results than conventional machine learning methods. Furthermore, compared with LSTM, multi-head self-

Journal Pre-proof

Xi Xiao et al. Y. Zhang M. Al-Janabi S. Marchal A. C. Bahnsen P. Yang Our method

100

80

60

40

20

0 ACC

lP repro of

20

FPR

Rec

Pre

F1

Fig. 9. Results of different methods

6

rna

attention in our model focuses on the inner structure of the URL to find the inner dependency relationships between different characters, and it provides the specific weight for each feature. Different features do not have the same importance for classification and the provided weights can be used to describe the difference. So our method performs better than that of Bahnsen, A. C. [13] and Yang, P. [14].

Conclusion and Future Work

Jou

Phishing website attack is a huge threat to Internet users and it continues to show an upward tendency in recent years. So far, CNN-LTSM has included many outstanding studies that have made their efforts to mitigate this threat. However, the detection accuracy is still in progress. Inspired by CNN-LSTM and considering multi-head self-attention (MHSA) calculates weight for each feature to express features’ different importance while Long Short Term Memory (LSTM) always treats the later features in a query much more important than the former that may contain bias, in this paper, we propose a novel and highly-precise approach named CNN-MHSA for phishing website detection, which combines Convolutional Neural Network (CNN) and MHSA together to improve detection accuracy. CNN-MHSA first applies CNN to automatically study URLs’ features without manual intervention. Also, considering characters’ inner dependency relationship in URLs is an important feature and no studies so far has exploited it yet, CNN-MHSA leverages MHSA to explore these relationships so as to generate weights for CNN studied features. Finally, conducted experiments confirms

Journal Pre-proof

CNN and MHSA Combined Approach for Detecting Phishing Websites

21

Acknowledgement

lP repro of

that CNN-MHSA achieves 99.84% in terms of Accuracy, which is higher than all of the previous methods [4][7][13][14][16]. In the future, we plan to take both the URL and its HTML content into account so as to improve the detection accuracy further. Meanwhile, since the URL’s length parameter L may influence our model’s robustness, we will try to decrease the impact of L.

This work is supported in part by the National Natural Science Foundation of Guangdong Province (2018A030313422), the National Key Research and Development Program of China (2018YFB1800204), the National Natural Science Foundation of China (61972219, 61771273), the R&D Program of Shenzhen (JCYJ20180508152204044, JCYJ20170817115335418), and the Research Fund of PCL Future Regional Network Facilities for Large-scale Experiments and Applications (PCL2018KP001).

References

Jou

rna

1. Merwe, A. V. D., Loock, M., Dabrowski, M. : Characteristics and Responsibilities involved in a Phishing Attack. In: Winter International Symposium on Information and Communication Technologies, pp.249-254. Trinity College Dublin (2005) 2. Report from APWG, http://docs.apwg.org/reports/apwg trends report q4 2018.pdf. 3. Liang, B., Su, M., You, W., Shi, W., Yang, G. : Cracking Classifiers for Evasion: A Case Study on the Google’s Phishing Pages Filter. In: International Conference on World Wide Web (WWW ’16), pp. 345-356. Montr´eal, Qu´ebec, Canada (2016) 4. Zhang, Y., Hong, J. I., Cranor, L. F. : Cantina: a content-based approach to detecting phishing web sites. In: Proceedings of the 16th international conference on World Wide Web (WWW ’07), pp. 639-648. Banff, Alberta, Canada (2007) 5. Cui, Q., Jourdan, G. V., Bochmann, G. V., Couturier, R., Onut, I. V. : Tracking Phishing Attacks Over Time. In: The 26th International Conference on World Wide Web (WWW’17), pp. 667-676. Perth, Australia (2017) 6. Rao, R. S, Pais, A. R. : Neural Computing and Applications-Detection of phishing websites using an efficient feature-based machine learning framework. Neural Comput & Applic, 1–23 (2018) 7. Al-Janabi, M., Quincey, E. D., Andras, P. : Using supervised machine learning algorithms to detect suspicious URLs in online social networks. In: Ieee/Acm International Conference on Advances in Social Networks Analysis and Mining (ASONAM ’17), pp.1104-1111. Sydney, Australia (2017) 8. Abutair, H., Belghith, A., AlAhmadi, S. J. : Using Case-Based Reasoning for Phishing Detection. Journal of Ambient Intelligence and Humanized Computing, 1–14 (2018) 9. Jain, A. K., Gupta, B. B. : Towards detection of phishing websites on client-side using machine learning based approach. Telecommunication Systems 68, 687–700 (2017) 10. Rajab, M. : An anti-phishing method based on feature analysis. In: The 2nd International Conference on Machine Learning and Soft Computing (ICMLSC’18), pp. 133-139. Phu Quoc Island, Viet Nam (2018)

Journal Pre-proof

22

Xi Xiao et al.

Jou

rna

lP repro of

11. Blum, A., Wardman, B., Solorio, T., Warner, G. : Lexical feature based phishing URL detection using online learning. In: Proceedings of the 3rd ACM workshop on Artificial intelligence and security (AISec ’10), pp. 54-60. Chicago, Illinois, USA (2010) 12. LeCun, Y., Bengio, Y., Hinton, G. : Deep learning. NATURE 521, 436–444 (2015) 13. Bahnsen, A. C., Bohorquez, C. E., Villegas, S., Vargas, J., Gonz´ alez, F. A. : Classifying phishing URLs using recurrent neural networks. In: 2017 APWG Symposium on Electronic Crime Research (eCrime), pp. 1-8. Scottsdale, AZ, USA (2017) 14. Yang, P., Zhao, G., Zeng, P.: Phishing Website Detection Based on Multidimensional Features Driven by Deep Learning. IEEE Access 7, 15196–15209 (2019) 15. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., Polosukhin, I. : Attention Is All You Need. In: 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA (2017) 16. Marchal, S., Fran¸cois, J., State, R., Engel, T. : PhishStorm: detecting phishing with streaming analytics. IEEE Transactions on Network and Service Management 11(4), 458–471 (2014) 17. Aravindhan, R., Shanmugalakshmi, R., Ramya, K., C, S. : Certain Investigation on Web Application Security Phishing Detection and Phishing Target Discovery. In: 2016 3rd International Conference on Advanced Computing and Communication Systems, pp. 1-10. Coimbatore, India (2016) 18. Cova, M., Kruegel, C., Vigna, G. : There is no free phish: an analysis of “free and live phishing kits”. In: The 2nd conference on USENIX Workshop on offensive technologies (WOOT’08). San Jose, CA, USA (2008) 19. Zou, F., Gang, Y., Pei, B., Pan, L., Li, L. : Web Phishing detection based on graph mining. In: 2016 2nd IEEE International Conference on Computer and Communications (ICCC), pp. 1061-1066. Chengdu, China (2016) 20. Ramesh, G., Gupta, J., Gamya, P. G. : Identification of phishing webpages and its target domains by analyzing the feign relationship. Journal of Information Security and Applications 35, 75–84 (2017) 21. Liu, W., Liu, G., Qiu, B., Quan, X. : Antiphishing through Phishing Target Discovery. IEEE Internet Computing 162, 52-61 (2012) 22. Zuhair, H., Selamat, A. : Phishing classification models: Issues and perspectives. In: 2017 IEEE Conference on Open Systems (ICOS), pp. 26-31. Miri, Sarawak, Malaysia (2017) 23. Corona, I., Biggio, B., Contini, M., Piras, L., Corda, R., Mereu, M., Mureddu, G., Ariu, D., Roli, F. : DeltaPhish: Detecting Phishing Webpages in Compromised Websites. In: Foley S., Gollmann D., Snekkenes E. (eds) Computer Security – ESORICS 2017. ESORICS 2017. Lecture Notes in Computer Science, vol 10492, pp. 370-388. Springer, Cham (2017) 24. Xiao, X., Wang, Z., Li, Q., Xia, S., Jiang, Y. : Back-propagation neural network on markov chains from system call sequences: a new approach for detecting android malware with system call sequences. Iet Information Security 111, 8-15 (2017) 25. Mohammad, R. M., Thabtah, F., Mccluskey, L. : Predicting phishing websites based on self-structuring neural network. Neural Computing & Applications 252, 443-458 (2017) 26. Nguyen, L. A. T., Ba, L. T., Nguyen, H. K., Nguyen, M. H. : An efficient approach for phishing detection using single-layer neural network. In: 2014 International Conference on Advanced Technologies for Communications (ATC 2014), pp. 435-440. Hanoi, Vietnam (2014)

Journal Pre-proof

CNN and MHSA Combined Approach for Detecting Phishing Websites

23

Jou

rna

lP repro of

27. Zhang, J., Li, X : Phishing Detection Method Based on Borderline-Smote Deep Belief Network. In: Wang G., Atiquzzaman M., Yan Z., Choo KK. (eds) Security, Privacy, and Anonymity in Computation, Communication, and Storage. SpaCCS 2017. Lecture Notes in Computer Science, vol 10658, pp. 45-53. Springer, Cham (2017) 28. Verma, R., Das, A. : What’s in a URL Fast Feature Extraction and Malicious URL Detection. In: Proceedings of the 3rd ACM on International Workshop on Security And Privacy Analytics, pp. 55-63. ACM, Scottsdale, Arizona, USA (2017) 29. Ketkar, N. : Convolutional Neural Networks. Deep Learning with Python, 63–78 (2017) 30. Schmidhuber, J. : Multi-column deep neural networks for image classification. In: Computer Vision and Pattern Recognition, pp. 3642-3649. Providence, Rhode Island (2012) 31. L´ecun, Y., Bottou, L., Bengio, Y., Haffner, P. : Gradient-based learning applied to document recognition, pp. 2278-2324. In: Proceedings of the IEEE (1998) 32. Kim, Y. : Convolutional neural networks for sentence classification. In: The 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751. Doha, Qatarpages (2014) 33. Mnih, V., Heess, N., Graves, A., Kavukcuoglu, K. : Recurrent models of visual attention. In: The 27th International Conference on Neural Information Processing Systems (NIPS’14), pp. 2204-2212. Montreal, Canada (2014) 34. Bahdanau, D., Cho, K., Bengio, Y. : Neural machine translation by jointly learning to align and translate. In: International Conference on Learning Representations (ICLR). San Diego, California (2015) 35. Luong, M. T., Pham, H., Manning, C. D., : Effective Approaches to Attentionbased Neural Machine Translation. Computer Science, 1412–1421 (2015) 36. Lin, Z., Feng, M., Santos, C. N. D., Yu, M., Xiang, B., Zhou, B., Bengio, Y. : A structured self-attentive sentence embedding. In: 5th International Conference on Learning Representations (ICLR). Palais des Congr`es Neptune, Toulon (2017) 37. Parikh, A. P., T¨ ackstr¨ om, O., Das, D., Uszkoreit, J. : A decomposable attention model for natural language inference. In: The 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2249–2255. Austin, Texas (2016) 38. Berners-Lee, T., Masinter, L., McCahill, M. : Uniform resource locators (url). RFC Editor, 106-107 (1994) 39. Harris, D. M., Harris, S. L. : Digital design and computer architecture. 2nd edn. Elsevier (2013) 40. He, K., Zhang, X., Ren, S., Sun, J. : Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778. Las Vegas, NV, USA (2016) 41. 5000 BEST WEBSITES Homepage, http://5000best.com/websites/. 42. PhishTank Homepage, https://www.phishtank.com.

Journal Pre-proof

lP repro of

rna

10

Jou

5

 We design a novel deep learning network to detect phishing sites. The network is formed by several layers including the convolutional layer and the multi-head selfattention layer. The model achieves 99.84% accuracy.  We use the knowledge in NLP to detect phishing since URLs can be treated as sentences. We use the convolutional layer to learn features and the multi-head selfattention mechanism to calculate feature weights for the first time in phishing detection.  Our experimental dataset is new and from real networks. It contains 88,984 URLs, among which 45,000 are legitimate and 43,984 are phishing. We collected the dataset in July 2018. In the dataset, the phishing ones have been validated from April 2018 to July 2018.

Journal Pre-proof AUTHOR DECLARATION

lP repro of

We wish to draw the attention of the Editor to the following facts which may be considered as potential conflicts of interest and to significant financial contributions to this work. [OR] We wish to confirm that there are no known conflicts of interest associated with this publication and there has been no significant financial support for this work that could have influenced its outcome. We confirm that the manuscript has been read and approved by all named authors and that there are no other persons who satisfied the criteria for authorship but are not listed. We further confirm that the order of authors listed in the manuscript has been approved by all of us. We confirm that we have given due consideration to the protection of intellectual property associated with this work and that there are no impediments to publication, including the timing of publication, with respect to intellectual property. In so doing we confirm that we have followed the regulations of our institutions concerning intellectual property. We understand that the Corresponding Author is the sole contact for the Editorial process (including Editorial Manager and direct communications with the office). He/she is responsible for communicating with the other authors about progress, submissions of revisions and final approval of proofs. We confirm that we have provided a current, correct email address which is accessible by the Corresponding Author and which has been configured to accept email from [email protected]

Xiao Xi:

Dianyan Zhang:

Jou

Guangwu Hu:

rna

Signed by all authors as follows:

Yong Jiang:

Shutao Xia:

CNN–MHSA: A Convolutional Neural Network and multi-head self-attention combined approach for detecting phishing websites

CNN–MHSA: A Convolutional Neural Network and multi-head self-attention combined approach for detecting phishing websites

Recommend Documents