G Model
ARTICLE IN PRESS
ASOC 3058 1–11
Applied Soft Computing xxx (2015) xxx–xxx
Contents lists available at ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
Detection of phishing attacks in Iranian e-banking using a fuzzy–rough hybrid system
1
2
3
Q1
a
4 5
Gholam Ali Montazer a,∗ , Sara ArabYarmohammadi b
Q2
b
School of Engineering, TarbiatModares University, P.O. Box 14115-179, Tehran, Iran School of E-learning, Shiraz University, P.O. Box 71345-3139, Shiraz, Iran
6
7 21
a r t i c l e
i n f o
a b s t r a c t
8 9 10 11 12 13
Article history: Received 12 April 2014 Received in revised form 10 February 2015 Accepted 6 May 2015 Available online xxx
14
20
Keywords: E-banking Phishing Fraud detection Fuzzy expert system Rough sets theory
22
1. Introduction
15 16 17 18 19
Phishing is a method of stealing electronic identity in which social engineering and website forging methods are used in order to mislead users and reveal confidential information having economic value. Destroying the trust between users in business network, phishing has a negative effect on the budding area of e-commerce. Developing countries such as Iran have been recently facing Internet threats like phishing, whose methods, regarding the social differences, may be different from other experiences. Thus, it is necessary to design a suitable detection method for these deceits. The aim of current paper is to provide a phishing detection system to be used in e-banking system in Iran. Identifying the outstanding features of phishing is one of the important prerequisites in design of a precise system; therefore, in first step, to identify the influential features of phishing that best fit the Iranian bank sites, a list of 28 phishing indicators was prepared. Using feature selection algorithm based on rough sets theory, six main indicators were identified as the most effective factors. The fuzzy expert system was designed using these indicators, afterwards. The results show that the proposed system is able to determine the Iranian phishing sites with a reasonable speed and precision, having an efficiency of 88%. © 2015 Elsevier B.V. All rights reserved.
Appearance of e-banking has led to wide revolution in relations among clients, bank, and the dealing methods. This technology 24 has provided available opportunity for banking service reliabil25 ity, economic thrift, and efficiency improvement. On the other 26 hand, it should be noted that launching and realization of e27 commerce mostly relies on realization of e-banking. E-banking 28 provides banking services by means of public and accessible com29 30Q5 puter networks (Internet/Intranet) having high security. E-banking comprises of systems which enables financial institutions to access 31 their accounts without physical attendance, using network tools 32 alone, and obtain information about financial products and services 33 [1]. 34 E-banking relies on network and Internet-based environment. 35 As a public network, the Internet encounters top confidential and 36 security information. Its nature leads to prominence of threats and 37 various deceits, and development of the dark and ambiguous side 38 of the network. Access to the Internet, anonymity, high speed of 39 23Q4
∗ Corresponding author. Tel.: +98 21 82883990; fax: +98 21 82883990. E-mail addresses:
[email protected] (G.A. Montazer),
[email protected] (S. ArabYarmohammadi).
propagation, lack of face-to-face contact, free access to services and invaluable contents, also lack of suitable laws and international agreements are among factors that allow threats to spread and make their prosecution hard [2]. That is why online and electronic banking could have plenty of dangers for economic institutes, which might be controlled and managed by screening and selection of a comprehensive risk management program. Obviously, in such situation e-banking security is one of the most important subjects in e-commerce. It should be noted that by increasing the bank facilities and services in the Internet and increasing growth of online interactions by clients, occurrence of financial crimes in banking industry is also growing up fast. One of the most dangerous Internet attacks that often targets e-banking is “phishing” [3]. Phishing is a method of social engineering, which means deceiving Internet users by guiding them to visit websites, which are totally similar to the target one. This case is usually more tangible in bank sites, credit institutes, Internet auctions, social and popular networks, Internet service providing sites, etc. The main idea of this attack is that a hank is sent for people so that they catch the hank and become hunted. In many cases, the hank is an email or a spam that deceives the user for entering the site. This type of deceit makes the user to reveal his vital information such as name, password, details of credit card, bank account, etc. Then this stolen information is used for the purpose of swindle, etc. [4–6].
http://dx.doi.org/10.1016/j.asoc.2015.05.059 1568-4946/© 2015 Elsevier B.V. All rights reserved.
Please cite this article in press as: G.A. Montazer, S. ArabYarmohammadi, Detection of phishing attacks in Iranian e-banking using a fuzzy–rough hybrid system, Appl. Soft Comput. J. (2015), http://dx.doi.org/10.1016/j.asoc.2015.05.059
40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
G Model
ARTICLE IN PRESS
ASOC 3058 1–11
G.A. Montazer, S. ArabYarmohammadi / Applied Soft Computing xxx (2015) xxx–xxx
2
Table 1 Organizations spoofed in phishing attacks, by industry in 2012 [3].
the limitations and innovations of this study are discussed and conclusions are provided in Section 6.
Row
Industry
Percent
1 2 3 4 5 6 7 8
Information Services Banking E-commerce Telecomminications Retail Government Insurance ISPs
36.29 32.99 27.99 1.86 0.44 0.37 0.02 0.027
Phishing attacks and the related financial losses are growing up increasingly fast. According to a report by International AntiPhishing Working Group (APWG), the number of phishing websites 66 67Q6 is increasing ([7,8]; Yu et al., 2009). Results indicate that by sending 5 million phishing emails, 2500 people are deceived. Although 68 this fraction is only 0.05% of the receivers, the benefit comes from 69 this group keeps phishing still a good source of income for Internet 70 swindlers [8]. As shown in Table 1, in June 2012, most of phish71 ing attacks had been related to information service providers (e.g. 72 libraries, and social networks), banks and active firms in field of 73 e-commerce, respectively. 74 Phishing offends users, organizations, and brands in many 75 aspects. Following is an indication to the subsequences of this 76 attack [9]: 77 64
2. Related works As mentioned in previous sections, it is obvious that struggling against phishing is one of the most serious issues in e-banking networks security field. In order to detect and defend against phishing, various methods have been used which can be summarized in three general approaches: “filtering via web browser toolbar”, “email phishing detection” and “website similarity detection”. 2.1. Filtering via web browser toolbar
116 117
118
119 120 121 122 123 124
125
65
78 79 80 81 82 83 84 85 86 87 88
a. The direct phishing consequence that reveals users’ confidential information (like username, password, or other sensitive details of their credit cards) and subject the users to financial losses. b. Destroys users’ trust towards the Internet interactions and builds a negative image in their minds. c. By destroying the trust in users, phishing causes a gradual avoidance of Internet purchase and use of Internet in commercial actions and prevents e-commerce from further improvement and success. d. Phishing has a negative effect on stakeholders, which leads to inability to maintain the brands and finally ends in bankruptcy.
Trust is one of the most important characteristics of e-banking [10]. As indicated, phishing can offend Internet business. In fear of becoming the victim of swindles, people gradually lose their trust 91 towards Internet interactions [11]. As a case in point, many people 92 believe that use of e-banking increases the likelihood of identity 93 robbery and phishing; whereas, e-banking is more protective of 94 people identity compared to customary paper banking [10]. 95 In Iran, phishing is very important since the statistics show that 96 in 2011, computer crimes had been 8.3 times as much as the pre97 vious year, most of which had been related to banking crimes. 98 According to this report, phishing attacks and one of its meth99 ods namely “pharming attack” is on the third rank among Internet 100 crimes in the country. Moreover, in 2010, 1035 Internet crimes 101 were recorded in Iran that has grown to 4000 cases in 2011 and 102 it has been anticipated to reach 8–10 thousand cases in 2012 (Rah103 Q7 pardakht, 2012). 104 The detection method provided in this paper is based on fuzzy 105 logic, which is combined with rough sets-based data mining 106 algorithm. So, the structure of the paper is arranged as follows: 107 Section 2 introduces related works on phishing detection and the 108 shortcomings of existing methods. In Section 3, the fundamentals 109 of practical methods are described. Then in Section 4, the determi110 nation steps for input variables and their reduction by rough set 111 theory are elaborated. After the output variables are determined, 112 the determination of variable membership functions, fuzzifier, 113 defuzzifier, and fuzzy deduction engine are provided. In Section 5, 114 operation of the design system is declared and discussed. Finally, 115 89 90
In the first approach, most of the methods are not complete and work inefficiently due to use of “black list”. The black list is a list of websites, which have been already proven to be forged and have been recorded in the browser. If the URL of a target website matches the URL of one of those known phishing websites in a blacklist, it will be labeled as a phishing website. Thus, this method fails to detect new phishing websites [12,13]. 2.2. Email phishing detection This approach addresses the situations in which Internet users receive deceptive emails containing phishing webpages URL. The users are tempted to open the webpages and follow the instructions. As an example of this approach, distributed architectures like CBART and CART are proposed to be used in mobile enviroments [7]. C5.0 classifiers that are built by calculating the information gain on 40 features (Toolan and Carthy, 2010) and a ruleset formed by Q8 Genetic algorithm to notify for deceptive hyperlinks [14] are other examples of this approach. The methods subsidiary to this approach concentrate on detection of email-based phishing and are unable to detect the other types of phishing. 2.3. Website similarity detection The methods which are based on visual similarity (e.g., [15–17,34]; Liu et al., 2006) treat phishing website detection pro- Q9 cess as an image-matching problem by dividing the a website into a number of images and then analyzing and comparing the similarity between the visual traits of those image blocks and those of actual genuine websites registered with an anti-phishing system. This method is driven by the discernment that a website is made up of several blocks. Finally, the third approach consists of methods which only respect to visual similarity of website and ignore other obvious features of phishing attacks which forfeit the decision on website’s genuineness. Furthermore, another constraint of a visual similarity based approach lies in its reliance on comparing a target website against a real website, which may not always exist or be recognized in advance. More notably, many phishing websites are planned to look analogous to real websites and accordingly, the visual similarity based approaches have difficulty differentiating phishing websites from their authentic counterparts [12]. As introduced, current approaches to the automatic detection of phishing websites have various weaknesses. Most of the approaches focus on general phishing websites instead of phishing websites in a specific domain such as e-banking. Although Iran has witnessed rapid growth of phishing e-banking websites and has suffered from huge financial losses because of them in recent years, there has not been any research on the automated detection of Iranian phishing e-banking websites. In addition, Iran differs from other countries in its regulation of e-banking websites. Previous
Please cite this article in press as: G.A. Montazer, S. ArabYarmohammadi, Detection of phishing attacks in Iranian e-banking using a fuzzy–rough hybrid system, Appl. Soft Comput. J. (2015), http://dx.doi.org/10.1016/j.asoc.2015.05.059
126 127 128 129 130 131 132
133
134 135 136 137 138 139 140 141 142 143 144 145
146
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
G Model
ARTICLE IN PRESS
ASOC 3058 1–11
G.A. Montazer, S. ArabYarmohammadi / Applied Soft Computing xxx (2015) xxx–xxx
3
Fig. 1. The main structure of fuzzy expert system.
180
studies, however, have not taken such domain features into consideration. Establishing the solution on a remarkable, limited scope such as e-banking, this paper offers a customized solution through the most applicable method. This enables the paper’s method to avoid unreasonable enlarging of the problem and strengthens it by concentrating on a special scope even in extracting the initial indicators.
181
3. Applied methods
182
3.1. Fuzzy sets theory
174 175 176 177 178 179
188
Fuzzy sets theory was invented and described by Zadeh in 1965 [18]. Each fuzzy set can be interpreted as specific membership function. Each of members in fuzzy set is determined by a membership degree between zero and one. Let X be the universe of discourse whose members are shown by x, then fuzzy set A in X is introduced by:
189
A = {(x, (x))/x ∈ X, (x) ∈ [0, 1]}
183 184 185 186 187
A
190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221
(1)
A
Fuzzy expert systems are “knowledge-based” systems that work by means of rules, which are extracted from human expertise. The heart of each fuzzy system is its rule-base, which the rules are stored in form of “if-then” predicates [19]. As shown in Fig. 1, a fuzzy system consists of four modules. Fuzzifier: In fuzzification process, the relationships between inputs and linguistic variables are introduced via membership functions. The fuzzifier’s task is to read the crisp value of input variable and map it to one of the fuzzy linguistic variables existing in knowledge base rules in the fuzzy expert system [19]. Defuzzifier: Defuzzifier in fact does the opposite process of fuzzifier. Since the outputs of the inference engine are fuzzy sets and in form of linguistic variables, they should be transformed into a crisp number so as to become usable. In other words, defuzzifier’s duty is to determine the point, which best represents the fuzzy set [19]. Fuzzy rule base: As a knowledge base, fuzzy rule base is one of the most important parts of fuzzy expert system that is a combination of experts’ expertise (knowledge) in the corresponding area, formed of rules which in turn are made of linguistic variables. These rules are conditional statements and generally can be represented as: “IF X is xi and Y is yi , THEN O is oi ”, where X and Y are linguistic input variables. xi and yi are possible linguistic values for X and Y, respectively. They are modeled as fuzzy sets based on reference sets containing X and Y. In the same way, the output O is a linguistic variable with a possible value (i.e. oi ), which is a fuzzy set [20]. Fuzzy inference engine: Fuzzy inference engine is the deciding center of the system, which finds the logical result, analyzing the rules and knowledge aggregated in the database. There are various choices for the fuzzy inference engine based on the fuzzy aggregation, implication and operators used for s-norm and t-norms. In other words, fuzzy inference engine maps fuzzy sets of input space into fuzzy sets of output space [36].
3.2. Rough sets theory
222
In 1982, rough sets theory was proposed by Pawlak as a gener223 alization of sets theory for assessing expert systems by insufficient 224 and imprecise information. Rough sets theory is another mathe225 matical approach to solve this problem and deals with problems 226 including uncertainty and ambiguity just like fuzzy. Fuzzy sets and 227 rough sets theories are not opponents but one’s other supplemen228 tary (Duboice and Prade, 1992; [21]). Rough set is an approximate Q10229 of a vague concept by a couple of precise concepts, namely “upper 230 approximation” and “lower approximation”. Each desired subset 231 of the universe of discourse situates between its upper and lower 232 approximations, i.e. each component in lower approximation is 233 necessarily a member of the set; however, it is possible that com234 ponents of the upper approximation are not members of the set. 235 Rough sets theory is used for excluding redundancy features from 236 data sets with discrete values [22,23]. 237 Rough set attribute reduction provides a tool for extracting 238 knowledge from information. Using rough sets theory, a subgroup 239 of main features could be obtained that gives more information and 240 is free of redundancy; the subgroup is called “reduct”. Obviously, 241 other features could be omitted from the system, losing only a small 242 amount of information [23]. In terms of mathematics, let C and D be 243 the conditional attributes and the decision attributes sets, respec244 tively, so the reduct is defined as a subset R of C that meets the two 245 Q11 following conditions (Jensen and Shen, 2000) 246
Condition (1): Condition (2):
C (D) = R (D) Omitting any features of R, influences Condition (1).
In reduction concept, the reduct having the least number of members is important. This set is called the “minimum reduct” or “core” and is formally defined as:
Rmin = X X ∈ R, ∀Y ∈ R, X ≤ Y
(2)
The quick reduct algorithm (QRA) provides the calculation of minimum reduct without production of all possible subsets (Fig. 2). The algorithm begins with an empty set and then through a step-bystep approach, adds up the features which their addition results in
Fig. 2. The quick reduct algorithm (QRA).
Please cite this article in press as: G.A. Montazer, S. ArabYarmohammadi, Detection of phishing attacks in Iranian e-banking using a fuzzy–rough hybrid system, Appl. Soft Comput. J. (2015), http://dx.doi.org/10.1016/j.asoc.2015.05.059
247 248 249 250 251 252 253 254 255
G Model
ARTICLE IN PRESS
ASOC 3058 1–11
G.A. Montazer, S. ArabYarmohammadi / Applied Soft Computing xxx (2015) xxx–xxx
4 Table 2 The common phishing indicators. i
Phishing indicator (Vi )
Step
Reduct
P (Q)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Certificate authority Using switching ports Abnormal server form handler Buying time to access accounts Replacing similar char for URL OnMouseOver to hide the Link Distinguished names certificate Disabling right-click Using hexadecimal char codes Abnormal URL of anchor XSS (code injection) attack Adding a prefix or suffix (sub-domain) Pharming attack Abnormal cookie Copying website Abnormal request URL Using pop-ups windows Public generic emails Redirect pages Abnormal URL Using IP address in URL Using forms with Submit button Using of @ symbol to confuse Excessive emphasis on security Long URL address Grammatical or spelling errors Abnormal DNS record Using SSL certificate
1 2 3 4 5 6
{V1 } {V25 , V1 } {V1 , V1 , V25 } {V1 , V3 , V10 , V25 } {V1 , V3 , V10 , V12 , V25 } {V1 , V3 , V7 , V10 , V12 , V25 }
0.772 0.835 0.880 0.945 0.985 1.0
the maximum increase in P (Q) to the feature set, and the process 257 continues till the maximum value (usually 1) is obtained for the 258 Q12 data set (Swiniarsky, 2003). 256
259 260
261 262
263
Table 3 Step-by-step results of quick-reduct algorithm.
4. The architecture of fuzzy–rough expert system for phishing detection The proposed system has been designed as Fig. 3. The components of system are as follows:
Table 4 Input variables fuzzy membership functions. Tier
Input Variable
Linguistic Variables
Fuzzy numbers
1
Length of URL
Low Medium High
[12 12 20 30] [20 30 50 70] [50 70 33 03 30]
2
Certificate authority
Low Medium High
[0 0 3 4] [3 4 5 7] [48 10 10]
3
Distinguished names certificate
Low Medium High
[0 0 2 3] [2 3 4 6] [47 10 10]
4
Abnormal URL of anchor
Low High
[0 0 2 3] [2 4 10 10]
4.2. Reducing the input variables using rough sets theory In this stage, ineffective and redundant indicators in Table 2 have been recognised and omitted for real phishing attack cases using QRA. In order to launch this algorithm, 60 cases of real e-banking websites are extracted and used (Table 3). Once the algorithm was performed, six effective indicators were obtained: “Length of URL”, “Certificate Authority”, “Distinguished names certificate (Certificate Details)”, “Abnormal URL of anchor”, “Abnormal SFH”, and “Adding a prefix or suffix (sub-domain)”. Assessing the real samples and prevalent methods of phishing in e-banking field, and consulting the experts, the effective terms have been detected (Fig. 4). After finding the linguistic variables, a questionnaire was prepared and distributed among experts for defining linguistic variables range, the results of which have been shown in Table 4.
285
286 287 288 289 290 291 292 293 294 295 296 297 298 299 300
4.1. Determination of input variables 4.3. Determination of output variables
Invaders in phishing attack try to design the forged website in 265 a way that users can not notice any difference between the orig266 inal and forged websites and reveal their personal information 267 easily; however despite this attempt by phishers, there are signs 268 and features in forged websites that helps in determination of their 269 unoriginality. In order to design a system capable of detecting any 270 phishing type and informing the users, obviously, features of the 271 phished website should be determined in the first step. Thus, in the 272 first step, having assessed the literature of phishing detection (Chen 273 Q13 and Guo, 2006; [24–31]) and evaluating real samples of phished 274 websites, a list of all phishing attack features (i.e. phishing signs) 275 has been extracted [35]. The initial list including 28 important and 276 non-redundant indicators has been given in Table 2. 277 As mentioned, the input variables for a fuzzy system is too many 278 that produces a big volume of rules and subsequently increases the 279 processing time and decreases the speed and agility of final system. 280 Whereas, in practice, the system should immediately announce the 281 detection result of the site so that financial losses to bank clients 282 are avoided in a realtime manner. Thus, in the next section, those 283 input variables with redundancy, are detected and omitted using 284 rough set based feature extraction method.
301
264
The output variable of the fuzzy deduction engine is “the phishing rate of website” to which the terms “legitimate”, “a little suspicious”, “suspicious”, “very suspicious” and “phish” are attributed. In other words, this system classifies the website with one of the linguistic variables below: a. Legitimate: The website is secure enough and its validity can be trusted. b. A little suspicious: The website cannot be trusted completely and it would be better that its validity and legitimacy are investigated before entering any information. c. Suspicious: The website has some features which violate its legitimacy. d. Very suspicious: It is highly anticipated that the website is forged and entering confidential information should be avoided seriously. e. Phish: The website is totally forged. According to above descriptions, the classification of websites and the corresponding ranges has been shown in Table 5, which is
Please cite this article in press as: G.A. Montazer, S. ArabYarmohammadi, Detection of phishing attacks in Iranian e-banking using a fuzzy–rough hybrid system, Appl. Soft Comput. J. (2015), http://dx.doi.org/10.1016/j.asoc.2015.05.059
302 303 304 305 306
307 308 309 310 311 312 313 314 315 316 317
318 319
G Model
ARTICLE IN PRESS
ASOC 3058 1–11
G.A. Montazer, S. ArabYarmohammadi / Applied Soft Computing xxx (2015) xxx–xxx
5
Fig. 3. The structure of fuzzy–rough expert system.
Fig. 4. Input variables.
321
extracted from experts’ viewpoints. Fig. 5 demonstrates six membership functions of output variable.
applied for the input parameters in this research, so that the crisp number X* ∈ U is mapped to fuzzy set A in U as:
322
4.4. Fuzzification
A (x)
320
323 324 325 326 327
Fuzzifier must not contain too many calculations and is to be capable of mapping a crisp point to a fuzzy set in which the crisp point has a big membership value. On the other hand data entered the phishing detection system are all crisp numbers without any noise. Regarding the above issues, singleton fuzzifier has been
Table 5 Classification of output linguistic variables based on the output crisp number (phishing rate). Output variable
Output linguistic variable
Corresponding fuzzy number
Phishing rate (%)
Legitimate A little Suspicious Suspicious Very Suspicious Phish
[0 0 2 15] [2 15 20 45] [25 40 60 65] [60 65 80 85] [80 85 100 100]
=
1 X = X∗ 0
e.w.
(3)
4.5. Fuzzy knowledge base For creating the rules, the experts’ viewpoints are considered. fuzzy–rough expert system’s rule base including 40 if-then rules with ‘and’ conjunction is built using six main indicators mentioned in Section 2.3. A part of the fuzzy rule base has been shown in Table 6. While extracting the latent rules in experts’ minds, all possible combinations between variables are noticed and those overlapped by similar rules, are omitted from the fuzzy expert system rule base. Investigation on the prevalent phishing methods helps in optimizing the rule base by omitting some of the situations, which never take place, considering that the aim of designing this system is its practical use in current e-banking environment.
Please cite this article in press as: G.A. Montazer, S. ArabYarmohammadi, Detection of phishing attacks in Iranian e-banking using a fuzzy–rough hybrid system, Appl. Soft Comput. J. (2015), http://dx.doi.org/10.1016/j.asoc.2015.05.059
328 329
330
331
332 333 334 335 336 337 338 339 340 341 342 343
G Model
ARTICLE IN PRESS
ASOC 3058 1–11
G.A. Montazer, S. ArabYarmohammadi / Applied Soft Computing xxx (2015) xxx–xxx
6
Fig. 5. Output variable membership function.
Table 6 A part of fuzzy knowledge base rules. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Rule description If the CA’s reliability is low and DN is low, then the website is phished If the URL of anchor is very abnormal, then the website is phished If the CA’s reliability is high and DN is high and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is short and it does not have sub-domain, then the website is legitimate If the CA’s reliability is high and DN is high and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is medium and it does not have sub-domain, then the website is legitimate If the CA’s reliability is medium and DN is high and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is medium and it has sub-domain, then the website is legitimate If the CA’s reliability is medium and DN is high and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is short and it has sub-domain, then the website is legitimate If the CA’s reliability is high and DN is low and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is long and it has sub-domain, then the website is very suspicious If the SFH is abnormal, then the website is a little suspicious If the CA’s reliability is high and DN is low and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is long and it does not have sub-domain, then the website is suspicious If the CA’s reliability is high and DN is high and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is long and it does not have sub-domain, then the website is suspicious If the CA’s reliability is medium and DN is high and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is long and it has sub-domain, then the website is very suspicious. If the CA’s reliability is medium and DN is medium and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is long and it has sub-domain, then the website is very suspicious If the CA’s reliability is medium and DN is low and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is long and it has sub-domain, then the website is very suspicious If the CA’s reliability is medium and DN is low and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is long and it does not sub-domain, then the website is suspicious If the CA’s reliability is medium and DN is high and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is long and it does not sub-domain, then the website is suspicious. If the CA’s reliability is medium and DN is medium and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is long and it does not sub-domain, then the website is suspicious If the CA’s reliability is low and DN is low and the URL of anchor is a little abnormal and SFH is abnormal and the Length of URL is long and it has sub-domain, then the website is phished If the CA’s reliability is high and DN is high and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is long and it has sub-domain, then the website is very suspicious If the CA’s reliability is high and DN is medium and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is long and it does not have sub-domain, then the website is very suspicious If the CA’s reliability is high and DN is medium and the URL of anchor is a little abnormal and SFH is not abnormal and the Length of URL is short and it has sub-domain, then the website is legitimate
Please cite this article in press as: G.A. Montazer, S. ArabYarmohammadi, Detection of phishing attacks in Iranian e-banking using a fuzzy–rough hybrid system, Appl. Soft Comput. J. (2015), http://dx.doi.org/10.1016/j.asoc.2015.05.059
G Model
ARTICLE IN PRESS
ASOC 3058 1–11
G.A. Montazer, S. ArabYarmohammadi / Applied Soft Computing xxx (2015) xxx–xxx 344 345 346 347
348
349 350 351 352 353 354 355
In some of the rules, the state of a single indicator determines the output result and other indicators can be ignored. Moreover, some states of indicators are basically not compatible with one another; therefore, some of the rules are removed this way.
The inference engine’s output has fuzzy values which have to be transformed into crisp values to become useful for practice. The role of defuzzifier is to specify a point, which is the best representative for B fuzzy set. There are several standard methods, applicable for defuzzification. In this paper the centroid defuzzifier has been employed which is the most prevalent method and is based on calculating the center of gravity of the solution fuzzy sets [32] as:
356
y∗ =
y · B (y) dy
v
v
B (y) dy
where y* is the output of inference engine.
358
4.7. Fuzzy inference engine
360 361 362 363
365 366
367 368
l=1
x∈U
min(A , A (x1 ), . . ., A (xn ), B (y)) 1
n
y
(5)
where U is the universe of discourse, M is the number of rules, xi is the antecedent clause and y is the consequent clause of each rule.
5. Evaluating the performance of the fuzzy–rough expert system
In this section, the phishing indicators are examined supposing 370 that the phishers had endeavored to forge the site exactly the same 371 as the original site. The values of the indicators extracted from the 372 site are listed in the 10th row of Table 7. The forged website has a 20 373 character long URL. This length is in the range of “Low” membership 374 function which is the appropriate length for an e-banking website. 375 The website also has SSL certificate while its CA is “Very” reliable. 376 Although the phishers can’t take certificate from a highly reliable 377 CA for their phishing website, the assumption is that they have used 378 an expired certificate. Thus the certificate details (DN) is “Medium”. 379 The indicator “abnormal URL of anchor” has the least possible value, 380 which indicates “High” linguistic variable. In the website, source 381 code and “Form” label, SFH is abnormal and corresponds to fuzzy 382 number “1”. The output of fuzzy system is the website phishing rate Q14 which lies between 0 and 100 as discussed in Section 3.3. As this 383 number approaches to 100, the website trustworthiness decreases 384 and the probability of phishing increases. The first step in result 385 calculation is to extract the value of membership function for each 386 input variable. For example for this website, length of URL is in 387 “Low” range so the membership value is calculated through the 388 following equation: 389 369
390
391
Low (x) =
0.5x − 1
2
1
4 < x ≤ 10
394
(9)
395
The SFH is abnormal so: SFH (x) =
⎧ ⎨ −2x + 1 ⎩
393
(8)
396
0 ≤ x < 0.5
2|x − 1| + 1
0.5 ≤ x < 1.5
0
e.w
SFH (1) = 1
(10)
397
(11)
398
In conclusion, the below rules (rules No. 2 and 8 in Table 6) are fired:
In the third step, the minimum inference engine is employed:
B (y) = max sup
Many (x) =
Rule 2: If the URL of anchor is very abnormal, then the website is phished. Rule 8: If the SFH is abnormal, then the website is a little suspicious.
This unit is designed to extract the aggregated rule out of the rules either of which can generate the fuzzy output. Thus, in this paper, individual-rule based inference method is selected. Moreover, the Mamdani minimum inference engine has been chosen as the core of our fuzzy expert system for phishing detection: M
364
392
399 400
(4)
357
359
For “abnormal URL of anchor” which is in “High” range, the membership value is calculated through the following equation:
Abnormal URL of anchor=many (3.2) = 0.6
4.6. Defuzzification
7
1
0 < x ≤ 20
−0.1x + 3
20 < x ≤ 30
Length of URL=Low (20) = 1
(6) (7)
B (y) = max{Rule 1 , Rule 159 } = max{min{0.5, Phish }, min{1, Lsus }}
401 402 403
404 405
(12)
406 407
Here, area under the maximum curve (the dark area in Fig. 6) is first calculated through integration and then is inputted into centroid defuzzifier formula (Eq. (6)) to be used in calculating phishing rate (y*).
y∗ =
y · B (y) dy
v
v
B (y) dy
= 43.7
(13)
In the described situation, even an expert may decide wrongly on validity of the website; however, regarding the output number (43.7), fuzzy–rough expert system can detect the website as “suspicious”. Moreover, based on the definition in Section 3.3, the system confirms the unoriginality and illegitimacy. It should be noted that in this example, all the indicators were in their nearest state to a legitimate website. Fig. 7 shows the output of this phishing website in fuzzy–rough phishing detection system. In this figure each column indicates one of the variables. The phishing rate is presented at the top of the last column (output variable). 5.1. Real sample: Bank Melli Iran (https://epayment4.bmi.ir) Bank Melli Iran (BMI) is the first national Iranian bank. The bank was established in 1928 by the order of the Majlis (the Iranian Parliament) and since then has consistently been one of the most influential Iranian banks. BMI is now the largest commercial retail bank in Iran and in the Middle East with over 3271 active branches in Iran, 13 active branches and 4 subsidiary banks in foreign countries [33]. Fig. 8 shows the Internet payment homepage of Bank Melli Iran. Properties of this website is extracted and listed in the 6th row of Table 7. As mentioned in Table 7, “Length of URL” in this webpage is 31 that indicates “Medium” linguistic variable. Moreover, in its URL address, there is no abnormal character or trait but it has a “sub-domain (prefix or suffix)”. Besides, this website has a highly trustable certificate (Fig. 9) which can be readily seen in detail by clicking on “More Information” and then on “View Certificate” (Fig. 10). As shown in Fig. 10, the certificate has valid “Issue Date” and “Expiration Date” and contains enough detailed information, which leads to “High” certificate details (DN). Other properties
Please cite this article in press as: G.A. Montazer, S. ArabYarmohammadi, Detection of phishing attacks in Iranian e-banking using a fuzzy–rough hybrid system, Appl. Soft Comput. J. (2015), http://dx.doi.org/10.1016/j.asoc.2015.05.059
408 409 410 411
412
413 414 415 416 417 418 419 420 421 422
423
424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441
G Model
ARTICLE IN PRESS
ASOC 3058 1–11
G.A. Montazer, S. ArabYarmohammadi / Applied Soft Computing xxx (2015) xxx–xxx
8
Table 7 Execution results of fuzzy–rough expert system on 10 e-banking websites. Instance number
1
2
3
4
5
6
7
8
9
10
V1: length of URL address V2: abnormal URL of anchor V3: certification authority (reliability of CA) V4: distinguished names certificate (DN) V5: server form handler (SFH) V6: adding prefix or suffix Phishing rate (%) Detected as
64
28
60
45
25
37
55
38
30
20
0
0
0
0
0
1
0
0
1.2
3.2
0
0
8.3
9
8.9
8.9
0
8.9
1.9
8
0
0
7
9.6
9.5
9.5
0
9.5
7.6
3.5
0
0
0
0
0
0
0
0
0
1
0
1
1
1
1
1
1
1
0
1
91.4 Phishing
91.4 Phishing
19.3 A little suspicious
4.78 Legitimate
5.76 Legitimate
4.78 Legitimate
91.4 Phishing
4.78 Legitimate
91.4 Phishing
43.7 Suspicious
Fig. 6. The output of Mamdani inference engine.
Fig. 7. Rules of fuzzy–rough expert system after execution.
Please cite this article in press as: G.A. Montazer, S. ArabYarmohammadi, Detection of phishing attacks in Iranian e-banking using a fuzzy–rough hybrid system, Appl. Soft Comput. J. (2015), http://dx.doi.org/10.1016/j.asoc.2015.05.059
G Model ASOC 3058 1–11
ARTICLE IN PRESS G.A. Montazer, S. ArabYarmohammadi / Applied Soft Computing xxx (2015) xxx–xxx
9
Fig. 8. Internet payment webpage of Bank Melli Iran.
442 443 444 445 446 447 448 449
of the website which can be remarked are lack of exploiting codes and normal “SFH”. Again here the fuzzy system calculates the membership values for six input variables. Then, the phishing rate of the website is gained using Mamdani minimum inference engine and centroid defuzzifier. Finally, the designed fuzzy–rough system, calculates the phishing rate for this website as 4.78 indicating that the website is “Legitimate”. The result is consistent to the experts’ viewpoint.
Execution results of the fuzzy–rough expert system in case of 50 instance banking websites proves that detection accuracy of this system is 88% and it has 12% detection error. Evaluation results gained from execution of the system in nine e-banking websites are manifested in Table 7. To estimate an expert system accuracy, the experts have to confirm the results accuracy because an expert system tries to simulate an expert’s behaviour at a high level.
Fig. 9. Bank Melli Iran certificate.
Please cite this article in press as: G.A. Montazer, S. ArabYarmohammadi, Detection of phishing attacks in Iranian e-banking using a fuzzy–rough hybrid system, Appl. Soft Comput. J. (2015), http://dx.doi.org/10.1016/j.asoc.2015.05.059
450 451 452 453 454 455 456 457
G Model ASOC 3058 1–11 10
ARTICLE IN PRESS G.A. Montazer, S. ArabYarmohammadi / Applied Soft Computing xxx (2015) xxx–xxx
Fig. 10. Bank Melli Iran certificate details.
458 459 460 461 462 463 464 465 466 467 468 469
470
471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488
The time needed for calculating the output is in relation with number of variables and fuzzy rules because the fuzzy system calculates all variables’ values in each execution. It should be noted that the important consequence of using rough set theory is to reduce the time of output calculation. Fuzzy–rough expert system designed for phishing detection with six input variables and 40 rules, has been implemented in MATLAB 7.8.0 on a 4 GB RAM computer system, which has 2.3 GHz Intel core i5. It was observed that it can deliver the result in less than 1 s. The fuzzy–rough expert system is a good choice in online situation where confidential information may be divulged in less than seconds and it is extremely important to detect a forged website in realtime manner.
6. Limitations and Innovations Improving the previous fuzzy based system [10] in performance and potentiality to deploy, the proposed hybrid system is quite flexible due to the simplicity of its rule base modification. Thus, in case of facing new phishing methods in e-banking, the system’s capability can be enhanced to detect threats. Minimizing the problem by finding the most effective phishing indicators leads to create an agile and easy to deploy solution. Identifying critical phishing indicators in Iranian e-banking system, based on which we elicit phishing rules to defend Iranian e-banking websites against phishing attacks are other contributions. The fuzzy–rough hybrid system of this paper also covers all phishing types in e-banking and hence, it works like a panacea. While the findings of the study shed light on the critical phishing indicators in Iranian e-banking system and introduced a corresponding efficacious phishing detector, arguments could be made that the number of experts to be accessed who are knowledgeable in phishing attacks was quite a few. Since the system is fundamentally based on experts’ viewpoints, especially in assigning values
to phishing indicators as inputs, we may have a little error. By means of experts’ points of view, we fortified our system in each phase of building and every facet has been confirmed by experts in the phishing field. Secondly, the study tested a limited number of e-banking phishing samples. Moreover, inventing new phishing techniques will end up in lower accuracy and the systems needs a kind of regular monitoring if we want it to be permanently accurate. In contrast, feeding the system with new real phishing attacks will make it impenetrable. Considering the inadequacy of phishing detection and prevention techniques, we proposed a hybrid model based on fuzzy sets theory and rough sets theory. Fuzzy method provides a more effective and natural solution, which deals with qualitative factors rather than precise values. Using fuzzy sets theory together with Rough sets theory helped to formulate the uncertainty in the problem. Ultimately, the test results of our phishing detection system are compared to those of the similar systems, which detect phishing targets based on fuzzy sets theory. The performance results are shown in Table 7. Our method offers more advantages than any other method in phishing webpage detection, as it has a fairly low false positive rate and high accuracy. In practice we found that previous proposed algorithms are too heavy to be implemented and even if they are implemented, the final system will not have enough agility to meet online network demands.
7. Conclusion In this paper a fuzzy–rough expert system for phishing detection in Iranian e-banking has been developed and implemented. After extracting the preliminary effective variables in detection of phishing attacks in e-banking, six indicators were identified by implementation of rough feature selection algorithm in case of 50
Please cite this article in press as: G.A. Montazer, S. ArabYarmohammadi, Detection of phishing attacks in Iranian e-banking using a fuzzy–rough hybrid system, Appl. Soft Comput. J. (2015), http://dx.doi.org/10.1016/j.asoc.2015.05.059
489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513
514
515 516 517 518 519
G Model ASOC 3058 1–11
ARTICLE IN PRESS G.A. Montazer, S. ArabYarmohammadi / Applied Soft Computing xxx (2015) xxx–xxx
523
instances of e-banking websites. Finally the fuzzy–rough hybrid expert system was built with six inputs and 40 rules. Afterwards, accuracy of fuzzy–rough expert system performance is assessed and the results showed that the system has 88% accuracy.
524
References
520 521 522
525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566
[1] M. Saeedi, F. Saghafi, M. Askarzadeh, M. Jalali, Effective factors in e-banking strategy structuring, in: The First International Conference on E-Banking, Tehran, 2007. [2] W. Kim, O. Jeong, C. Kim, J. So, The dark side of the Internet: attacks, costs and responses, Inf. Syst. 36 (2011) 675–705. [3] J. Pimanova, Great News for Email Users: Spam Rates Dropped by Nearly 10 Percent in October 2012, 2012, http://www.emailtray.com/blog/category/emailtrends-and-stats/ [4] G. Ramesh, I. Krishnamurthi, K. Sree Kumar, An efficacious method for detecting phishing webpages through target domain identification, Decis. Support Syst. 61 (2014) 12–22. [5] J. Peppard, A. Rylander, Products and services in cyberspace, Int. J. Inf. Manag. 25 (2005) 335–345. [6] L. James, Phishing Exposed, 1st ed., Syngress Publishing Inc., New York, 2005. [7] S. Abu-Nimeh, D. Nappa, W. Xinlei, S. Nair, A distributed architecture for phishing detection using Bayesian additive regression trees, eCrime Res. Summit (2008) 1–10. [8] F. Toolan, J. Carthy, Feature selection for spam and phishing detection, eCrime Res. Summit (2011) 1–12. [9] M.E. Kabay, Phishing, 2004, pp. 1–30, www.mekabay.com/courses/industry/ phishing.ppt (October 2012). [10] M. Aburrous, M.A. Hossain, K. Dahal, F. Thabtah, Experimental case studies for investigating e-banking phishing techniques and attack strategies, Cogn. Comput. 2 (3) (2010) 242–253. [11] J.W. Ragucci, S.A. Robila, Societal aspects of phishing, in: Technology and Society, ISTAS 2006, IEEE International Symposium, 2006, pp. 1–5. [12] D. Zhang, Z. Yan, H. Jiang, T. Kim, A domain-feature enhanced classification model for the detection of Chinese phishing e-business websites, Inf. Manag. 51 (2014) 845–853. [13] Y. Chuan, H. Wang, T. Kim, Anti-phishing in offense and defense, in: Computer Security Applications Conference (ACSAC 2008), Feta, “RahPardakht” Website.4 times growth in country’s computer crimes, 2008, Retrieved June 2012 from: http://way2pay.ir [14] V. Shreeram, M. Suban, P. Shanthi, K. Manjula, Anti-phishing detection of phishing attacks using genetic algorithm, in: IEEE International Conference on Communication Control and Computing Technologies, 2010. [15] H. Zhang, G. Liu, T.W.S. Chow, W. Liu, Textual and visual content-based antiphishing: a Bayesian approach, IEEE Trans. Neural Netw. (2011) 1532–1546. [16] A.Y. Fu, L. Wenyin, X. Deng, Detecting phishing web pages with visual similarity assessment based on earth mover’s distance (EMD), IEEE Trans. Depend. Secur. Comput. 3 (4) (2006). [17] L. Wenyin, G. Huang, L. Xiaoyue, Z. Min, X. Deng, Detection of Phishing Webpages based on Visual Similarity. ACM 1-59593-051-5/05/0005, 2005.
11
[18] T.J. Ross, Fuzzy Logic With Engineering Applications, 2nd ed., University of New Mexico, John Wiley & Sons, Ltd., USA, 2004. [19] L.X. Wang, A Course in Fuzzy Systems and Control, Prentice-Hall International, 1996. [20] Gh. Montazer, H. QahriSaremi, M. Ramezani, Design a new mixed expert decision aiding system using fuzzy ELECTRE III method for vendor selection, Expert Syst. Appl. (2009) 10837–10847. [21] Z. Pawlak, Rough Sets: Present state and Future Prospects, ICS Research Report 32/95, Institute of Computer Science, Warsaw University of Technology, 1995, pp. 1–17. [22] Y. Cheng, Forward Approximation and backward approximation in fuzzy rough sets, Neurocomputing 148 (2015) 340–353. [23] R. Jensen, Q. Shen, Fuzzy rough attribute reduction with application to web categorization, Fuzzy Sets Syst. 141 (3) (2004) 469–475. [24] Y. Pan, X. Ding, Anomaly based web phishing page detection, in: Computer Security Applications Conference, ACSAC’06, 22nd Annual, 2006, pp. 381–392. [25] M. Qi, C. Yang, Research and design of phishing alarm system at client terminal, in: IEEE Asia-Pacific Conference on Services Computing (APSCC’06), 2006, pp. 597–600. [26] C. Liu, S. Stamm, Fighting unicode-obfuscated spam, in: Proceedings of the AntiPhishing Working Groups 2nd Annual eCrime Researchers Summit, New York, USA, 2007, pp. 45–59. [27] M. Jakobsson, The human factor in phishing, in: Privacy & Security of Consumer Information 07, 2007, http://www.informatics.indiana.edu/markus/ papers/aci.pdf (accessed 12.06.12). [28] M. Aburrous, M.A. Hossain, F. Thabatah, K. Dahal, Intelligent phishing website detection system using fuzzy techniques, in: 3rd International Conference on Information and Communication Technologies: From Theory to Applications (ICTTA 2008), 2008, pp. 1–6. [29] D.K. Mcgrath, M. Gupta, Behind phishing: an examination of phisher modi operandi, in: Proceedings of the USENIX Workshop on Large-Scale Exploits and Emergent Threats, 2008, pp. 1–8. [30] H. Shahriar, M. Zulkernine, PhishTester: automatic testing of phishing attacks, in: Fourth International Conference on Secure Software Integration and Reliability Improvement (SSIRI), 2010, pp. 198–207. [31] M. GhotaishAlkhozae, O. Abdullah Batarfi, Phishing websites detection based on phishing characteristics in the webpage source code, Int. J. Inf. Commun. Technol. Res. 1 (6) (2011) 283–291. [32] Zh. Xu, K. Gao, M. Khoshgoftar, T.N. Seliya, System regression test planning with a fuzzy expert system, Inf. Sci. 259 (2014) 532–543. [33] Bank Melli Iran About BMI, 2014, Retrieved January 2014 from: http://www. bmi.ir/Fa/bmihistory.aspx?smnuid=10 [34] H. Masanori, Y. Akira, M. Yutaka, Visual similarity-based phishing detection without victim site information, in: IEEE Symposium on Computational Intelligence in Cyber Security (CICS’09), 2009. [35] Gh. Montazer, S. ArabYarmohammadi, Identifying the critical indicators for phishing detection in Iranian e-banking system, in: 5th Conference on Information and Knowledge Technology (IKT), 2013, pp. 107–112. [36] L.X. Wang, J.M. Mendel, Fuzzy basis functions, universal approximation, and orthogonal least-squares learning, IEEE Trans. Neural Netw. (1992) 807–814.
Please cite this article in press as: G.A. Montazer, S. ArabYarmohammadi, Detection of phishing attacks in Iranian e-banking using a fuzzy–rough hybrid system, Appl. Soft Comput. J. (2015), http://dx.doi.org/10.1016/j.asoc.2015.05.059
567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618