Virus detection using clonal selection algorithm with Genetic Algorithm (VDC algorithm)

Applied Soft Computing 13 (2013) 239–246 Contents lists available at SciVerse ScienceDirect Applied Soft Computing journal homepage: www.elsevier.co...

Download PDF

983KB Sizes 1 Downloads 100 Views

Report

PDF Reader
Full Text

Applied Soft Computing 13 (2013) 239–246

Contents lists available at SciVerse ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Virus detection using clonal selection algorithm with Genetic Algorithm (VDC algorithm) Suha Afaneh a , Raed Abu Zitar b,∗ , Alaa Al-Hamami c a

Department of Computer Sciences, Isra University, Amman, Jordan College of Information Technology, American University of Madaba, Jordan c Department of Computer Sciences, Amman Arab University, Amman, Jordan b

a r t i c l e

i n f o

Article history: Received 12 January 2012 Received in revised form 18 July 2012 Accepted 7 August 2012 Available online 21 August 2012 Keywords: Artiﬁcial immune system Virus detection algorithm Clonal selection

a b s t r a c t This paper presents a novel approach for computer viruses detection based on modeling the structures and dynamics of real life paradigm that exists in the bodies of all living creatures. It aims to develop an algorithm based on the concept of the artiﬁcial immune system (AIS) for the purpose of detecting viruses. The algorithm is called Virus Detection Clonal algorithm (VDC), and it is derived from the clonal selection algorithm. The VDC algorithm consists of three basic steps: cloning, hyper-mutation and stochastic reselection. In later stage, the developed VDC algorithm is subjected to validation, which consists of two phases; learning and testing. Two main parameters are determined; one of them is setting the number of signatures per clone (Fat), while the other deﬁnes the hypermutation probability (Pm). Later on, the Genetic Algorithm (GA) is used as a tool, to improve the developed algorithm by searching the values of the main parameters (Fat and Pm) to reproduce better results. The results have shown that the detection rate of viruses, by using the developed algorithm, is 94.4%, whereas the detection rate of false positives has reached 0%. These percentages indicate that the VDC algorithm is sufﬁcient and usable in this ﬁeld. Moreover, the results of employing the GA to optimize the VDC algorithm have shown an improvement in the detection speed of the algorithm. © 2012 Elsevier B.V. All rights reserved.

1. Introduction Different artiﬁcial intelligence based techniques are used nowadays in all areas of computer security [1]. Techniques such as swarm intelligence, Genetic Algorithms, and ant colony optimization have different applications in pattern classiﬁcation and image and signal processing [2–4]. The artiﬁcial immune system (AIS), on the other hand, is very similar to those paradigms in structure and mechanism, however, it is quite recent, and has not been matured yet. The AIS has been applied in different ﬁelds, most notably in computer viruses’ detection ﬁeld. The protection against viruses is becoming more and more difﬁcult day after day, and they constitute a threat for every one who uses computers. The viruses’ intelligence is escalating by the time, and their signatures are changing continuously [5,6]. That has made the anti-viruses mission more difﬁcult [7]. The (AIS) has several concepts: clonal selection, negative selection and network immune theory. This paper proposes the (VDC) algorithm which is inspired by the clonal selection algorithm and more precisely by the CLONALG [8] in detecting viruses.

∗ Corresponding author. E-mail address: [email protected] (R.A. Zitar). 1568-4946/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.asoc.2012.08.034

Studies have shown that 25% of people using computers are infected by some sort of malwares, while the commercial PC sector is suffering from around the half of this percentage [9]. The simplest and the most common method to protect networks from the viral attacks is to use the signature technology. This paper should offer a helping hand by proposing a Virus Detection Clonal (VDC) algorithm then optimizing the parameters using the GA, the VDC algorithm is a modern ﬁeld, despite the fact that the virus issue is an aged issue. However, the problem we are solving can be considered as a growing problem because it affects every individual that uses computers. The Negative Selection Algorithm (the self-non-self algorithm) has been used for virus detection [10–12,1], but the clonal selection algorithm has not been used yet with this type of application, after making a wide web search and investigating a wide range of specialized journals, it has been found that applying the clonal selection algorithm is a brand new contribution. The clonal selection principle describes the approach of an immune response to an antigenic stimulus. Which can be explained as the following: only the cells that recognize the antigen do proliferate and are selected against those that do not. These generated B-cells, which are copies of their parents, are mutated. When the antibody strongly matches the antigen, then these B-cells will be stimulated to produce clones of themselves [13]. In this paper the antigens represent the

240

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

computer viruses in the infected ﬁles and the antibodies represent the signatures. The signatures with high matching values (ﬁtness) are selected to the cloning, the hypermutation and the reselection processes; so that the cloning produces copies of the signatures with Best ﬁtness, then they are mutated to provide the ability of detecting viruses which are different in some characters (genes), even if these viruses have not attacked previously (just like the adaptive defense in the Immune System). In this research, the reselection is stochastically added to the clonal selection algorithm in order to guarantee choosing the best mutated signatures. 2. The proposed VDC algorithm The research consists of two stages (MATLAB 7.1 is used); ﬁrst, the design and implementation of the Virus Detection Clonal (VDC) algorithm, second, the validation of the VDC algorithm. 2.1. The design and implementation of the VDC algorithm Fig. 1 illustrates the ﬂowchart of the VDC algorithm. The pseudo code of the VDC algorithm is illustrated at Fig. 2. After loading the Signatures’ pool and the ﬁles’ pool, the loop condition is deﬁned as the Learning Gen parameter, which is determined in Table 2. This loop goes through the 3 main steps: Cloning (making copies of the signatures with highest ﬁtness), Hypermutation (the mechanism of making random changes to the virus signatures with higher ﬁtness) and Reselection (choosing the next generation of signatures according to their ﬁtness stochastically). The ﬁtness is calculated according to Eq. (1) below. z

f (x) = f0 (x) + ı

i=1

match function(x, yi ) +

t

Fig. 1. The VDC algorithm ﬂowchart.

Dj

(1)

j=1

where: f0 (x): the initial ﬁtness for signature x and it is a random number determined in the initialization of the algorithm when loading the signatures’ pool.

yi : the ith ﬁle. ı: a multiplying factor with a value of 10. z: the number of all ﬁles in the ﬁles’ pool. Dj : if the mutation is performed on the signature j then Dj = (0, 1 or −1) uniformly random.

Fig. 2. The VDC algorithm pseudo code.

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

241

Fig. 3. VDC algorithm: cloning. Fig. 4. VDC algorithm: hypermutation.

t: the no. of signatures in the original signatures’ pool (the no. of signatures in each clone).

match funciton =

0;

no match

1; match found

(2)

Figs. 3–6 display the VDC algorithm in a simple way. In Fig. 3, the cloning step is represented by taking the half size of the signatures’ pool, then making copies for the signatures with highest ﬁtness, so T has the copies of the signatures, F has copies of the ﬁtness, and V has copies of the viruses’ names. Fig. 4 demonstrates the hypermutaion step, where some of the copies of the virus signatures are mutated according to the Pm value. The virus name in V for the mutated signature in T is pre-appended with the “mut” to distinguish it from no mutated signatures. The values inside D could be −1, 0 or 1, these values are added to the ﬁtness of the mutated signatures, where the ﬁtness of the mutated signature can be better, worse or the same as the ﬁtness of the signature before mutation. That, of course,

reﬂects the randomness in that process. Consequently, if D = −1, this means the ﬁtness is decreased by 1 (worse), and if D = 0 the ﬁtness remains the same, and if D = 1 the ﬁtness increases by 1 (better). The mutation is performed as follows: one character in the signature is replaced with a random character; the ASCII code for this random character is between 48 and 122, and the replacement position is also chosen randomly. For example, if the signature is ‘8e5ef1aec91259d70c5e62cdfe42c36e ddc8cc9cbe45313d0’ after mutation it can be ‘8e5ef1aec91259d70c5e62kdfe42c36e ddc8cc9cbe45313d0’; the c is replaced by k. The reselection step is illustrated in Figs. 5 and 6, where in Fig. 5 a loop is made for each ﬁle in the ﬁles’ pool, and then the ﬁtness function is calculated as a counter for the matches between the signatures in T and the ﬁles inside the ﬁles’ pool in addition to their initial ﬁtness.

Fig. 5. VDC algorithm: reselection1.

242

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

Fig. 6. VDC algorithm: reselection2.

Each of the ﬁle content inside the ﬁles’ pool is matched with all the signatures in T, if a matching is found then the ﬁtness value in F for this signature is changed by adding ı, where ı equals 10 in this algorithm. This is to give the detection process higher weight than weight given to the mutation process (mutation adds 1 to the ﬁtness at most). The matched ﬁle, then, is eliminated from the ﬁles’ pool since it is infected and to get rid of the redundancy issue. Fig. 6 shows how the signatures are selected according to the ﬁtness stochastically. The stochastic selection process is performed as the following steps: 1. A random number (R) is created for each generation (iteration in Gen), and is called the selection threshold, its values are between 0.6 and 1. This is to make sure that the Best ﬁtness is selected. 2. Each ﬁtness value in the clone is divided by the maximum ﬁtness in that clone. 3. If the value in step 2 ≥R, and the signature does not exist in the original signatures’ pool (initially in step 2) then the ﬁtness of this signature is appended to a temporary matrix. 4. The temporary matrix is sorted in descending order. 5. The best new 11 signatures are selected to be added to the original signatures’ pool. The appended new signatures are limited by 11 values in order to prevent the enlargement of the signatures’ pool. After that, the execution is back to step 3. Table 1 demonstrates the main differences between the CLONALG and the proposed described VDC algorithm. 2.2. The validation of the VDC algorithm The strategy of validation includes two phases: learning and testing. The learning phase takes in consideration the ﬁlling of the signatures’ pools with the new signatures after applying the VDC algorithm on top of the already known signatures (original signatures that were gathered before mutation). To apply the VDC algorithm, the ﬁles’ pools are needed to complete the matching process between ﬁles and signatures. At the

beginning, all ﬁles contained in the ﬁles’ pool are benign (clean), then 5% of the ﬁles are infected, after that 25%, 50%, 75%, 100% of the ﬁles are infected. Only the ﬁles’ pools with 5%, 25% and 75% infected ﬁles are used in the learning process, whereas the other six ﬁles’ pools are used in the testing process. Several parameters in the VDC algorithm are tuned in search for better performance. These parameters are: Learning Gen, Pm and Fat and are chosen as examples but it is not exclusive to them. The ranges of values for those parameters are very wide. The parameters values are described at Table 2. The resulting signatures’ pools are called: Sig1, Sig2, . . ., Sig12 and are used in the testing phase. The testing phase starts after ﬁnishing the learning phase. The testing is concerned with the calculation of ﬁtness function, which means searching for matches between the ﬁles’ pool and the virus signatures’ pool. Therefore, the testing algorithm does not contain the hypermutation, cloning nor reselection steps. The signatures’ pools that are obtained in Table 2 are employed with the ﬁles’ pools (0%, 5%, 25%, 50%, 75%, and 100%) in the testing process as illustrated in Table 3. Please note that at testing, Gen = 100 for all of cases. The testing starts on the ﬁles’ pool 0%, which contains 500 benign ﬁles, and the signatures’ pools Sig1, Sig5 and Sig8, are conducted to examine the concept of the false positive (when detecting benign ﬁles as infected ﬁles). For the ﬁve ﬁles’ pools that are left, the testing is performed by running each ﬁle’s pool with the corresponding signatures’ pools in Table 3, then: • When testing ﬁles’ pools with 5% and 75% of infected ﬁles, new 100 ﬁles are added to the ﬁles’ pool at the testing iteration (testing Gen) of 5. • When testing ﬁles’ pools with 25%, 50% and 100% of infected ﬁles, new 100 ﬁles are added to the ﬁles’ pool at the testing iteration (testing Gen) of 50. Regardless of the time when the new 100 ﬁles are added to the ﬁles’ pool, the algorithm must be able to detect the infected ﬁles. The 100 ﬁles include benign ﬁles and infected ﬁles. The infected ﬁles are categorized into three classes: • Files with signatures already exist in the original ﬁles pool. The signatures are used at the learning phase. • Files with signatures that do not exist in the original ﬁles pool. The signatures are not used at the learning phase. • Files that have signatures with mutations. These mutated signatures are obtained from the signatures’ pools that are produced in the learning phase. 3. The optimized VDC based on Genetic Algorithm The employed Genetic Algorithm (GA) in this paper is the Genetic Algorithm toolbox under MATLAB 7.1. The VDC algorithm is called to provide the ﬁtness function for the GA. The VDC algorithm is pre-appended with the minus sign to maximize the problem. The inputs are the Pm and Fat, and the output is the Mean ﬁtness. The purpose of applying the GA is to ﬁnd the best values of the Pm and Fat, to tune these values in order to get better optimized algorithm. There are three processes: • GA optimization: the process to ﬁnd the values of the Pm and Fat by using the GA. • GA learning: to employ the values resulted from the GA optimization process to create the signatures’ pools. • GA testing: to test the resulted signatures’ pools.

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

243

Table 1 The differences between CLONALG and VDC algorithm. Category

CLONALG

VDC algorithm

P M Afﬁnity Number of elements to be cloned The number of elements in each clone Mutation Lowest elements in M

Patterns to be recognized Randomly initialized The match between elements in M and patterns in P N Proportional to the elements afﬁnity Proportional to the elements afﬁnity Replaced the lowest d

Files to be searched Viruses’ signatures from VXHeaven Fitness function values in Eq. (1) The half size of signatures’ pool Fixed and it is (Fat × N) for each clone Not proportional Instead of replacement, adding the best 11 elementsa

a The replacement is not an option due to the sensitivity of the application, as when dealing with viruses, even if the virus is not widespread, it is important for the algorithm to be able to detect it.

Table 2 The parameters values of the learning phase.

Table 6 The GA testing runs speciﬁcations.

Parameters

Signatures’ pools

Learning Gen

Pm

Fat

5% infected ﬁles 100 100 300 100

0.05 0.1 0.05 0.05

0.05 0.05 0.1 0.1

Sig1 Sig2 Sig3 Sig12

25% infected ﬁles 100 300 150 100

0.05 0.1 0.2 0.2

0.05 0.1 0.05 0.05

Sig4 Sig5 Sig6 Sig10

75% infected ﬁles 100 150 300 100

0.2 0.1 0.05 0.2

0.05 0.1 0.1 0.1

Sig7 Sig8 Sig9 Sig11

GA testing pool 0% 5% 25% 50% 75% 100%

GA signatures’ pool SigGA2 SigGA4 SigGA1, SigGA2 SigGA4 SigGA1 SigGA1, SigGA2, SigGA3, SigGA4

The numbers of generations that the VDC algorithm is executed are illustrated in the column GA VDC Gen. 3.2. The GA learning The GA learning process includes the creation of the signatures’ pools, and using the Pm and Fat values resulted from the GA optimization. For each GA optimization run, there is a GA learning run. That would end up having 4 learning runs, as illustrated in Table 5.

Table 3 The parameters values of the testing phase. Files’ pools

3.3. The GA testing

Signatures’ pools

0% 5% 25% 50% 75% 100%

Sig1, Sig5, Sig8 Sig6, Sig7, Sig10, Sig11 Sig1, Sig2, Sig12 Sig4, Sig5, Sig8 Sig1, Sig3, Sig6, Sig9, Sig12 Sig2, Sig4, Sig7, Sig10, Sig11, Sig12

GA VDC Gen

1 2 3 4

10 20 30 10

4. Discussion Discussion part involves the results of the learning phase and the testing phase of the (VDC) algorithm, in addition to the results of the optimized (VDC) algorithm based on GA.

Table 4 The GA optimization runs speciﬁcations. Run

The GA testing checks the signatures’ pools resulted in the GA learning process where the number of runs is 10, according to Table 6.

GA pool

GA Gen

5% 50% 25% 75%

10 10 10 10

3.1. The GA optimization This process aims to ﬁnd the values of Pm and Fat after executing the GA with the VDC algorithm. The lower bounds of Pm and Fat are [0.01, 0.02] respectively, the upper bounds are [1, 1]. The ﬁles’ pools that are used when performing the GA have the values 5%, 25%, 50% and 75% as illustrated in Table 4 in the column GA pool.

4.1. The VDC learning phase The 12 runs results of the learning phase are summarized in Table 7. As noticed in this table, there are changes in the Mean ﬁtness values and the number of signatures which is resulted by the changes in the variable values (Learning Gen, Pm, Fat and learning pool). 4.2. The VDC testing phase The 24 runs results of the testing phase are summarized in Tables 8 and 9.

Table 5 The GA learning runs speciﬁcations. Run

GA signatures’ pool

1 2 3 4

SigGA1 SigGA2 SigGA3 SigGA4

GA learning pool 5% 50% 25% 75%

Pm

Fat

GA VDC Gen

0.636 0.96 0.65 0.914

0.935 1.0 0.96 0.935

10 20 30 10

244

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

Table 7 The summary of the learning results. Parameters Learning Gen

Table 10 The GA optimization results.

Signatures’ pool Pm

Mean ﬁtness

No. of signatures

5% infected ﬁles 0.05 100 0.1 100 300 0.05 100 0.05

0.05 0.05 0.1 0.1

Sig1 Sig2 Sig3 Sig12

247.1486 242.2893 275.2660 243.5857

1198 1200 3400 1200

25% infected ﬁles 0.05 100 0.1 300 150 0.2 100 0.2

0.05 0.1 0.05 0.05

Sig4 Sig5 Sig6 Sig10

449.6050 525.1859 642.7941 628.7423

1200 3400 1750 1200

75% infected ﬁles 100 0.2 0.1 150 300 0.05 100 0.2

0.05 0.1 0.1 0.1

Sig7 Sig8 Sig9 Sig11

838.7423 1265.9 1243.0000 871.7612

1200 1750 3400 1200

0% 5% 25% 50% 75% 100%

Pm

Fat

GA objective function value

1 2 3 4

0.636 0.96 0.65 0.914

0.935 1.0 0.96 0.935

380.2162 2540.2172 1260.2085 3650.2226

ﬁtness in the initial signatures’ pool, where this repetition leads to increase the value of the Best ﬁtness. • The Mean ﬁtness and the Mean ﬁtness: Table 9 shows the change on the Mean ﬁtness in all the cases of the testing. There are 5 variables that affect the Mean ﬁtness or the Mean ﬁtness: Testing Pool, Pm, Fat, Learning Gen and learning pool. 4.3. The GA optimization phase After executing the 4 runs, Pm and Fat values were obtained, as illustrated in Table 10, these values are used in the next process; the GA learning.

Table 8 The detection rate of the testing results. Testing pool

Run

Fat

Signatures’ pools

Detection rate

Sig1, Sig5, Sig8 Sig6, Sig7, Sig10, Sig11 Sig1, Sig2, Sig12 Sig4, Sig5, Sig8 Sig1, Sig3, Sig6, Sig9, Sig12 Sig2, Sig4, Sig7, Sig10, Sig11, Sig12

100% 93.3% 94.7% 96% 90.2% 92.3%

The Average of detection rate

94.4%

Table 8 shows the detection rate on the 24 testing runs, where in the case of 0% infected ﬁles (all ﬁles are benign), the detection rate is 100% as it has detected zero number of infected ﬁles, and this is the false positive testing which is considered as a good result. Table 9 shows the results of the Best ﬁtness (the change on the Best ﬁtness), the Mean ﬁtness and the Mean ﬁtness (the change on the Mean ﬁtness). • Best ﬁtness: in some cases of the testing, the Best ﬁtness changes, while in other cases it does not. The change in the Best ﬁtness depends on the repetition of the signatures that have the higher

4.4. The GA learning phase After ﬁnalizing the GA learning process, which included 4 runs, the results are recapitulated in Table 11. The table reveals the Mean ﬁtness, which, in turn, has resulted from deﬁning the parameters values, GA Learning Gen, Pm, Fat, GA VDC Gen, and the GA learning pool. The Learning Gen has been set to equal 100 for all the learning cases. As mentioned previously, the Pm and Fat control the hypermutation process. These values have been gained from GA optimization, and they are high, which increases the hypermutation rate. It is recalled that the GA VDC Gen has affected the choosing of these values for Pm and Fat. The values of the GA learning pool are 5%, 25%, 50% and 75%, and these values are the same as used for the GA learning pool in the previous step (GA optimization), whenever the number of infected ﬁles increases at the learning ﬁles’ pool, the Mean ﬁtness increases. It is noticed in the results of the previous 4 runs that the number of signatures = 1200 for all of them. The reason for this is that the used GA Learning Gen is 100.

Table 9 The testing results. Signatures pool

Learning Gen

Pm

Fat

Learning ﬁles’ pool

Sig6 Sig7 Sig10 Sig11 Sig1 Sig2 Sig12 Sig4 Sig5 Sig8 Sig1 Sig3 Sig6 Sig9 Sig12 Sig2 Sig4 Sig7 Sig10 Sig11 Sig12

150 100 100 100 100 100 100 100 300 150 100 300 150 300 100 100 100 100 100 100 100

0.2 0.2 0.2 0.2 0.05 0.1 0.05 0.05 0.1 0.1 0.05 0.05 0.2 0.05 0.05 0.1 0.05 0.2 0.2 0.2 0.05

0.05 0.05 0.05 0.10 0.05 0.05 0.1 0.05 0.1 0.1 0.05 0.1 0.05 0.1 0.1 0.05 0.05 0.05 0.05 0.1 0.1

25% 75% 25% 75% 5% 5% 5% 25% 25% 75% 5% 5% 25% 75% 5% 5% 25% 75% 25% 75% 5%

Testing pool 5% 5% 5% 5% 25% 25% 25% 50% 50% 50% 75% 75% 75% 75% 75% 100% 100% 100% 100% 100% 100%

Best ﬁtness

Mean ﬁtness

Mean ﬁtness

0 0 0 0 11 11 12 0 0 0 37 0 0 0 38 61 0 0 0 0 60

585.4639 924.6069 556.9083 786.5623 226.7783 223.7168 225.3997 383.4732 485.4011 1167.4 226.9967 253.7394 585.6786 1167.5 225.6177 224.057 383.6928 925.0413 557.3427 786.9967 225.7399

0.01590 0.02312 0.02312 0.02313 0.11745 0.11726 0.11726 0.23782 0.08444 0.16354 0.33581 0.11902 0.23055 0.11903 0.33526 0.45748 0.45748 0.45747 0.45747 0.45748 0.45747

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

245

Table 11 The summary of the GA learning results. GA signatures’ pool

GA learning pool 5% 50% 25% 75%

SigGA1 SigGA2 SigGA3 SigGA4

Pm

Fat

GA VDC Gen

Mean ﬁtness

0.636 0.96 0.65 0.914

0.935 1.0 0.96 0.935

10 20 30 10

445.2282 2595.2 1205.2 3065.2

Table 12 The GA testing results. Signature pool

GA Learning Gen

Pm

Fat

SigGA4 SigGA2 SigGA1 SigGA4 SigGA1 SigGA1 SigGA2 SigGA3 SigGA4

100 100 100 100 100 100 100 100 100

0.914 0.96 0.636 0.914 0.636 0.636 0.96 0.65 0.914

0.935 1.0 0.935 0.935 0.935 0.935 1.0 0.96 0.935

GA learning pool

GA VDC Gen

75% 50% 5% 75% 5% 5% 50% 25% 75%

10 20 10 10 10 10 20 30 10

In order to produce the signatures’ pools (SigGA1... SigGA4) for the GA testing process, the GA learning process has been performed.

4.5. The GA testing phase The results of the GA testing process, which includes 10 runs, are summed up in Table 12, which is elaborated in regard to: • Best ﬁtness: In all cases, discrepancy dose not occur in the change in the Best ﬁtness. This is due to the resulted values for Pm and Fat, from the GA optimization, that are high, which has led to having high values for the Mean ﬁtness and Best ﬁtness. Consequently, the Best ﬁtness remains the same.

GA testing pool 5% 25% 25% 50% 75% 100% 100% 100% 100%

Best ﬁtness

Mean ﬁtness

Mean ﬁtness

0 0 0 0 0 0 0 0 0

2752.2 2329.3 394.0925 2752.4 394.3105 394.4327 2329.6 1078.5 2752.6

0.02312 0.11725 0.11726 0.23782 0.33526 0.45747 0.45747 0.45747 0.45747

• The Mean ﬁtness and the Mean ﬁtness: The Mean Fitness and the Mean ﬁtness change, because the following 5 Variables: Pm, Fat, GA learning pool, GA VDC Gen, and GA Testing Pool have changed. Hence the GA Learning Gen = 100 for all runs, is not considered as an affecting variable. For the GA Testing Pool variable, when the number of infected ﬁles increases inside the pool, the Mean Fitness increases by a small value. Notably, the GA Testing Pool is the main variable that affects the Mean ﬁtness. The reason for that is when infected ﬁles increase in the GA Testing Pool, the Mean ﬁtness increases. This is due to the fact that each single detection increases the ﬁtness by ı. As it is noticed in Table 12, even though there have been changes in the values of the GA learning pool, Pm and Fat, while the GA Testing Pools equal 100%, the Mean ﬁtness value does not change. However, the Mean ﬁtness value keeps changing. The detection rate value is the same with an average of 94.4%.

Table 13 The detection speed of standard VDC versus the GA testing results. Testing pool

Signature pool

0%

Sig1 Sig5 Sig8

5%

Time in seconds

Avg. time

GA signature pool

Time in seconds

GA Avg. Time

The deviation (Avg. time − GA Avg. time)

14,208.647688 35,171.766852 18,197.056039

22,525.82353

SigGA2

13,280.983914

13,280.983914

9244.839616

Sig6 Sig7 Sig10 Sig11

9823.55273 8345.796465 8407.283349 11,459.200907

9508.958363

SigGA4

8934.375503

8934.375503

574.582860

25%

Sig1 Sig2 Sig12

10,311.581502 10,743.821194 10,459.229993

10,504.877563

SigGA1 SigGA2

6593.605429 6534.975443

6564.290436

3940.587127

50%

Sig4 Sig5 Sig8

7103.766218 22,787.765619 10,269.833760

13,387.121866

SigGA4

7368.274957

7368.274957

6018.846909

75%

Sig1 Sig3 Sig6 Sig9 Sig12

3165.084227 8762.238266 4431.984430 8729.633644 3186.235592

5655.035232

SigGA1

3145.817455

3145.817455

2509.217777

100%

Sig2 Sig4 Sig7 Sig10 Sig11 Sig12

6.368205 6.378531 6.359734 6.388507 6.477537 6.214122

6.364439

SigGA1 SigGA2 SigGA3 SigGA4

6.167821 6.093931 6.143650 6.137986

6.135847

0.228592

246

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

in applying the Negative Selection Algorithm. This work employed the clonal selection algorithm. Note that Ref. [6] had the detection rate of 97%, and the false positive of 3.6%, and also enclosed a list of detection rates for antivirus companies which were: Eset NOD32 = 94%, Kaspersky = 88%, Panda 2008 = 67%, KV 2008 = 55% and Kingsoft = 44%. Although this research is applied on different set of data but the results (i.e. detection rate of 94.4% and the false positive of 0%) are considered good and accepted. After concluding this work, and based on the results we had, the following points are recommended:

Fig. 7. Comparison VDC with GA-VDC (Mean ﬁtness).

1. In the VDC algorithm, the initial ﬁtness of the signatures is used as random numbers. It is suggested to use the Data Mining in categorizing the viruses according to their wide spread. 2. The VDC algorithm employed the exact match between signatures and ﬁles. It is recommended that different matching methods to be applied, such as Euclidean Distance, Manhattan Distance or Hamming Distance. 3. The Negative Selection Algorithm should be added to the VDC algorithm, to make possible to distinguish between Self and Nonself in regard to the existing ﬁles and later the detected infected ﬁles. 4. Different methods of mutation; such as Gauss Mutation, Cauchy Mutation or Mean Mutation should be used. References

Fig. 8. Comparison VDC with GA-VDC (detection speed).

It worth mentioning that when comparing with regard to the detection speed, it takes longer to detect viruses if the standard VDC algorithm is used, in comparison with the time consumed when the optimized VDC is used, as shown in Table 13. The usage of GA enhances the learning process by improving the properties of the resulted signatures’ pools, in terms of producing a higher Mean ﬁtness for these signatures. When the testing process is executed, the detection speed is better and faster, as shown in Figs. 7 and 8, and this is due to the fact that before applying the detection process in the testing, the signatures are sorted in descending order according to their ﬁtness, and that consequently leads to having a faster detection. During the comparison, neither the detection rate nor the false positive is changed. 5. Conclusions As a result of the previous simulations, the following points could be concluded: 1. In the VDC algorithm, if one of the following (the number of generations, the number of the infected ﬁles inside the ﬁles’ pool, the hypermutation rate during the learning phase) increases, the ﬁtness of the signatures will be increased as well. 2. Employing the GA to optimize the VDC algorithm, improves the detection speed of the VDC algorithm, by increasing the Mean ﬁtness, which leads the algorithm to be faster in detecting viruses. 3. Regarding the average detection rate, which is 94.4%, and the false positive which is 0%, these rates are considered good, and they do not change with the use of the GA, on the contrary, they are conﬁrmed. 4. The results of the paper prove the ability of using the VDC algorithm to detect viruses. This paper agreed with the studies of [10–12,1,14] in concentrating on the (AIS) with virus detection, but deviated from them,

[1] K. Loukhaoukha, J.Y. Chouinard, M.H. Taieb, Optimal image watermarking algorithm based on LWT-SVD via multi-objective ant colony optimization, Journal of Information Hiding and Multimedia Signal Processing 2 (October (4)) (2011) 303–319. [2] H.C. Huang, Y.H. Chen, Genetic ﬁngerprinting for copyright protection of multicast media, Soft Computing 13 (February (4)) (2009) 383–391. [3] P. Puranik, P. Bajaj, A. Abraham, P. Palsodkar, A. Deshmukh, Human perceptionbased color image segmentation using comprehensive learning particle swarm optimization, Journal of Information Hiding and Multimedia Signal Processing 2 (July (3)) (2011) 227–235. [4] F.C. Chang, H.C. Huang, A refactoring method for cache-efﬁcient swarm intelligence algorithms, Information Sciences, in press http://dx.doi.org/ 10.1016/j.ins.2010.02.025 [5] Secure Computing Corporation, Virus Signature Solutions from Secure Computing, http://www.securecomputing.com/, 2008. [6] M. Unterleitner, Computer Immune System for Intrusion and Virus Detection: Adaptive Detection Mechanisms and their Implementation, VMD Verlag Dr. Muller Aktiengesellschaft & Co., Germany, 2008, http://www.amazon.com/ Computer-Immune-System-Intrusion-Detection/dp/3836461080 [7] Z. Yu, L. Tao, Q. Renchao, Unknown computer virus detection inspired by immunity, Journal of Frontiers of Computer Science and Technology (2009), http://dx.doi.org/10.3778/j.issn.1673-9418.2009.02.004, ISSN 1673-9418/2003/03 (02)-0154-08 http://www.ceaj.org/wes/qikan/manage/ wenzhang/T0811060.pdf [8] L. Castro, F. Zuben, Learning and optimization using the clonal selection principle, IEEE Transactions on Evolutionary Computation, Special Issue on Artiﬁcial Immune Systems 6 (3) (2002) 239–251 ftp://ftp.dca.fee.unicamp.br/ pub/docs/vonzuben/lnunes/ieee tec01.pdf. [9] M. Creeger, The battle is bigger than most of us realize: CTO Roundtable: Malware Defense, Article development led by queue.acm.org, Communications of the ACM 53 (April (4)) http://portal.acm.org/citation.cfm?id= 1721654&coll=DL&dl=GUIDE&CFID=10533433&CFTOKEN=20057273, 2010. [10] K. Edge, G. Lamont, R. Raines, A Retrovirus Inspired Algorithm for Virus Detection & Optimization, Wright-Patterson AFB, Dayton, OH USA 45433, GECCO’06, Washington, USA, ACM 1-59593-186-4/06/007, http://portal.acm.org/citation.cfm?id=1144016, 2006. [11] S. Forrest, A. Perelson, L. Allen, R. Cherukuri, Self-nonself discrimination in a computer, in: Proceedings of IEEE Symposium on Research in Security and Privacy, IEEE Computer Society Press, Los Alamitos, CA, 1994, pp. 360–365, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.3258 [12] J. Kephart, G. Sorkin, M. Swimmer, S. White, Blueprint for a computer immune system, in: This Paper was Originally Presented at the Virus Bulletin International Conference in San Francisco, California, USA, IBM Thomas J. Watson Research Center, 1997, http://www.research.ibm.com/ antivirus/SciPapers/Kephart/VB97/ [13] U. Aickelin, Artiﬁcial Immune Systems (AIS) – A New Paradigm for Heuristic Decision Making, The University of Nottingham, Nottingham, NG8 1BB, United Kingdom, 2004, http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.60.5923 [14] V.X. Heaven, http://vx.netlux.org/, 2010.

Virus detection using clonal selection algorithm with Genetic Algorithm (VDC algorithm)

Virus detection using clonal selection algorithm with Genetic Algorithm (VDC algorithm)

Recommend Documents