Virus detection using clonal selection algorithm with Genetic Algorithm (VDC algorithm)

Virus detection using clonal selection algorithm with Genetic Algorithm (VDC algorithm)

Applied Soft Computing 13 (2013) 239–246 Contents lists available at SciVerse ScienceDirect Applied Soft Computing journal homepage: www.elsevier.co...

983KB Sizes 1 Downloads 100 Views

Applied Soft Computing 13 (2013) 239–246

Contents lists available at SciVerse ScienceDirect

Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc

Virus detection using clonal selection algorithm with Genetic Algorithm (VDC algorithm) Suha Afaneh a , Raed Abu Zitar b,∗ , Alaa Al-Hamami c a

Department of Computer Sciences, Isra University, Amman, Jordan College of Information Technology, American University of Madaba, Jordan c Department of Computer Sciences, Amman Arab University, Amman, Jordan b

a r t i c l e

i n f o

Article history: Received 12 January 2012 Received in revised form 18 July 2012 Accepted 7 August 2012 Available online 21 August 2012 Keywords: Artificial immune system Virus detection algorithm Clonal selection

a b s t r a c t This paper presents a novel approach for computer viruses detection based on modeling the structures and dynamics of real life paradigm that exists in the bodies of all living creatures. It aims to develop an algorithm based on the concept of the artificial immune system (AIS) for the purpose of detecting viruses. The algorithm is called Virus Detection Clonal algorithm (VDC), and it is derived from the clonal selection algorithm. The VDC algorithm consists of three basic steps: cloning, hyper-mutation and stochastic reselection. In later stage, the developed VDC algorithm is subjected to validation, which consists of two phases; learning and testing. Two main parameters are determined; one of them is setting the number of signatures per clone (Fat), while the other defines the hypermutation probability (Pm). Later on, the Genetic Algorithm (GA) is used as a tool, to improve the developed algorithm by searching the values of the main parameters (Fat and Pm) to reproduce better results. The results have shown that the detection rate of viruses, by using the developed algorithm, is 94.4%, whereas the detection rate of false positives has reached 0%. These percentages indicate that the VDC algorithm is sufficient and usable in this field. Moreover, the results of employing the GA to optimize the VDC algorithm have shown an improvement in the detection speed of the algorithm. © 2012 Elsevier B.V. All rights reserved.

1. Introduction Different artificial intelligence based techniques are used nowadays in all areas of computer security [1]. Techniques such as swarm intelligence, Genetic Algorithms, and ant colony optimization have different applications in pattern classification and image and signal processing [2–4]. The artificial immune system (AIS), on the other hand, is very similar to those paradigms in structure and mechanism, however, it is quite recent, and has not been matured yet. The AIS has been applied in different fields, most notably in computer viruses’ detection field. The protection against viruses is becoming more and more difficult day after day, and they constitute a threat for every one who uses computers. The viruses’ intelligence is escalating by the time, and their signatures are changing continuously [5,6]. That has made the anti-viruses mission more difficult [7]. The (AIS) has several concepts: clonal selection, negative selection and network immune theory. This paper proposes the (VDC) algorithm which is inspired by the clonal selection algorithm and more precisely by the CLONALG [8] in detecting viruses.

∗ Corresponding author. E-mail address: [email protected] (R.A. Zitar). 1568-4946/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.asoc.2012.08.034

Studies have shown that 25% of people using computers are infected by some sort of malwares, while the commercial PC sector is suffering from around the half of this percentage [9]. The simplest and the most common method to protect networks from the viral attacks is to use the signature technology. This paper should offer a helping hand by proposing a Virus Detection Clonal (VDC) algorithm then optimizing the parameters using the GA, the VDC algorithm is a modern field, despite the fact that the virus issue is an aged issue. However, the problem we are solving can be considered as a growing problem because it affects every individual that uses computers. The Negative Selection Algorithm (the self-non-self algorithm) has been used for virus detection [10–12,1], but the clonal selection algorithm has not been used yet with this type of application, after making a wide web search and investigating a wide range of specialized journals, it has been found that applying the clonal selection algorithm is a brand new contribution. The clonal selection principle describes the approach of an immune response to an antigenic stimulus. Which can be explained as the following: only the cells that recognize the antigen do proliferate and are selected against those that do not. These generated B-cells, which are copies of their parents, are mutated. When the antibody strongly matches the antigen, then these B-cells will be stimulated to produce clones of themselves [13]. In this paper the antigens represent the

240

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

computer viruses in the infected files and the antibodies represent the signatures. The signatures with high matching values (fitness) are selected to the cloning, the hypermutation and the reselection processes; so that the cloning produces copies of the signatures with Best fitness, then they are mutated to provide the ability of detecting viruses which are different in some characters (genes), even if these viruses have not attacked previously (just like the adaptive defense in the Immune System). In this research, the reselection is stochastically added to the clonal selection algorithm in order to guarantee choosing the best mutated signatures. 2. The proposed VDC algorithm The research consists of two stages (MATLAB 7.1 is used); first, the design and implementation of the Virus Detection Clonal (VDC) algorithm, second, the validation of the VDC algorithm. 2.1. The design and implementation of the VDC algorithm Fig. 1 illustrates the flowchart of the VDC algorithm. The pseudo code of the VDC algorithm is illustrated at Fig. 2. After loading the Signatures’ pool and the files’ pool, the loop condition is defined as the Learning Gen parameter, which is determined in Table 2. This loop goes through the 3 main steps: Cloning (making copies of the signatures with highest fitness), Hypermutation (the mechanism of making random changes to the virus signatures with higher fitness) and Reselection (choosing the next generation of signatures according to their fitness stochastically). The fitness is calculated according to Eq. (1) below. z 

f (x) = f0 (x) + ı

i=1

match function(x, yi ) +

t 

Fig. 1. The VDC algorithm flowchart.

Dj

(1)

j=1

where: f0 (x): the initial fitness for signature x and it is a random number determined in the initialization of the algorithm when loading the signatures’ pool.

yi : the ith file. ı: a multiplying factor with a value of 10. z: the number of all files in the files’ pool. Dj : if the mutation is performed on the signature j then Dj = (0, 1 or −1) uniformly random.

Fig. 2. The VDC algorithm pseudo code.

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

241

Fig. 3. VDC algorithm: cloning. Fig. 4. VDC algorithm: hypermutation.

t: the no. of signatures in the original signatures’ pool (the no. of signatures in each clone).

 match funciton =

0;

no match

1; match found

(2)

Figs. 3–6 display the VDC algorithm in a simple way. In Fig. 3, the cloning step is represented by taking the half size of the signatures’ pool, then making copies for the signatures with highest fitness, so T has the copies of the signatures, F has copies of the fitness, and V has copies of the viruses’ names. Fig. 4 demonstrates the hypermutaion step, where some of the copies of the virus signatures are mutated according to the Pm value. The virus name in V for the mutated signature in T is pre-appended with the “mut” to distinguish it from no mutated signatures. The values inside D could be −1, 0 or 1, these values are added to the fitness of the mutated signatures, where the fitness of the mutated signature can be better, worse or the same as the fitness of the signature before mutation. That, of course,

reflects the randomness in that process. Consequently, if D = −1, this means the fitness is decreased by 1 (worse), and if D = 0 the fitness remains the same, and if D = 1 the fitness increases by 1 (better). The mutation is performed as follows: one character in the signature is replaced with a random character; the ASCII code for this random character is between 48 and 122, and the replacement position is also chosen randomly. For example, if the signature is ‘8e5ef1aec91259d70c5e62cdfe42c36e ddc8cc9cbe45313d0’ after mutation it can be ‘8e5ef1aec91259d70c5e62kdfe42c36e ddc8cc9cbe45313d0’; the c is replaced by k. The reselection step is illustrated in Figs. 5 and 6, where in Fig. 5 a loop is made for each file in the files’ pool, and then the fitness function is calculated as a counter for the matches between the signatures in T and the files inside the files’ pool in addition to their initial fitness.

Fig. 5. VDC algorithm: reselection1.

242

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

Fig. 6. VDC algorithm: reselection2.

Each of the file content inside the files’ pool is matched with all the signatures in T, if a matching is found then the fitness value in F for this signature is changed by adding ı, where ı equals 10 in this algorithm. This is to give the detection process higher weight than weight given to the mutation process (mutation adds 1 to the fitness at most). The matched file, then, is eliminated from the files’ pool since it is infected and to get rid of the redundancy issue. Fig. 6 shows how the signatures are selected according to the fitness stochastically. The stochastic selection process is performed as the following steps: 1. A random number (R) is created for each generation (iteration in Gen), and is called the selection threshold, its values are between 0.6 and 1. This is to make sure that the Best fitness is selected. 2. Each fitness value in the clone is divided by the maximum fitness in that clone. 3. If the value in step 2 ≥R, and the signature does not exist in the original signatures’ pool (initially in step 2) then the fitness of this signature is appended to a temporary matrix. 4. The temporary matrix is sorted in descending order. 5. The best new 11 signatures are selected to be added to the original signatures’ pool. The appended new signatures are limited by 11 values in order to prevent the enlargement of the signatures’ pool. After that, the execution is back to step 3. Table 1 demonstrates the main differences between the CLONALG and the proposed described VDC algorithm. 2.2. The validation of the VDC algorithm The strategy of validation includes two phases: learning and testing. The learning phase takes in consideration the filling of the signatures’ pools with the new signatures after applying the VDC algorithm on top of the already known signatures (original signatures that were gathered before mutation). To apply the VDC algorithm, the files’ pools are needed to complete the matching process between files and signatures. At the

beginning, all files contained in the files’ pool are benign (clean), then 5% of the files are infected, after that 25%, 50%, 75%, 100% of the files are infected. Only the files’ pools with 5%, 25% and 75% infected files are used in the learning process, whereas the other six files’ pools are used in the testing process. Several parameters in the VDC algorithm are tuned in search for better performance. These parameters are: Learning Gen, Pm and Fat and are chosen as examples but it is not exclusive to them. The ranges of values for those parameters are very wide. The parameters values are described at Table 2. The resulting signatures’ pools are called: Sig1, Sig2, . . ., Sig12 and are used in the testing phase. The testing phase starts after finishing the learning phase. The testing is concerned with the calculation of fitness function, which means searching for matches between the files’ pool and the virus signatures’ pool. Therefore, the testing algorithm does not contain the hypermutation, cloning nor reselection steps. The signatures’ pools that are obtained in Table 2 are employed with the files’ pools (0%, 5%, 25%, 50%, 75%, and 100%) in the testing process as illustrated in Table 3. Please note that at testing, Gen = 100 for all of cases. The testing starts on the files’ pool 0%, which contains 500 benign files, and the signatures’ pools Sig1, Sig5 and Sig8, are conducted to examine the concept of the false positive (when detecting benign files as infected files). For the five files’ pools that are left, the testing is performed by running each file’s pool with the corresponding signatures’ pools in Table 3, then: • When testing files’ pools with 5% and 75% of infected files, new 100 files are added to the files’ pool at the testing iteration (testing Gen) of 5. • When testing files’ pools with 25%, 50% and 100% of infected files, new 100 files are added to the files’ pool at the testing iteration (testing Gen) of 50. Regardless of the time when the new 100 files are added to the files’ pool, the algorithm must be able to detect the infected files. The 100 files include benign files and infected files. The infected files are categorized into three classes: • Files with signatures already exist in the original files pool. The signatures are used at the learning phase. • Files with signatures that do not exist in the original files pool. The signatures are not used at the learning phase. • Files that have signatures with mutations. These mutated signatures are obtained from the signatures’ pools that are produced in the learning phase. 3. The optimized VDC based on Genetic Algorithm The employed Genetic Algorithm (GA) in this paper is the Genetic Algorithm toolbox under MATLAB 7.1. The VDC algorithm is called to provide the fitness function for the GA. The VDC algorithm is pre-appended with the minus sign to maximize the problem. The inputs are the Pm and Fat, and the output is the Mean fitness. The purpose of applying the GA is to find the best values of the Pm and Fat, to tune these values in order to get better optimized algorithm. There are three processes: • GA optimization: the process to find the values of the Pm and Fat by using the GA. • GA learning: to employ the values resulted from the GA optimization process to create the signatures’ pools. • GA testing: to test the resulted signatures’ pools.

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

243

Table 1 The differences between CLONALG and VDC algorithm. Category

CLONALG

VDC algorithm

P M Affinity Number of elements to be cloned The number of elements in each clone Mutation Lowest elements in M

Patterns to be recognized Randomly initialized The match between elements in M and patterns in P N Proportional to the elements affinity Proportional to the elements affinity Replaced the lowest d

Files to be searched Viruses’ signatures from VXHeaven Fitness function values in Eq. (1) The half size of signatures’ pool Fixed and it is (Fat × N) for each clone Not proportional Instead of replacement, adding the best 11 elementsa

a The replacement is not an option due to the sensitivity of the application, as when dealing with viruses, even if the virus is not widespread, it is important for the algorithm to be able to detect it.

Table 2 The parameters values of the learning phase.

Table 6 The GA testing runs specifications.

Parameters

Signatures’ pools

Learning Gen

Pm

Fat

5% infected files 100 100 300 100

0.05 0.1 0.05 0.05

0.05 0.05 0.1 0.1

Sig1 Sig2 Sig3 Sig12

25% infected files 100 300 150 100

0.05 0.1 0.2 0.2

0.05 0.1 0.05 0.05

Sig4 Sig5 Sig6 Sig10

75% infected files 100 150 300 100

0.2 0.1 0.05 0.2

0.05 0.1 0.1 0.1

Sig7 Sig8 Sig9 Sig11

GA testing pool 0% 5% 25% 50% 75% 100%

GA signatures’ pool SigGA2 SigGA4 SigGA1, SigGA2 SigGA4 SigGA1 SigGA1, SigGA2, SigGA3, SigGA4

The numbers of generations that the VDC algorithm is executed are illustrated in the column GA VDC Gen. 3.2. The GA learning The GA learning process includes the creation of the signatures’ pools, and using the Pm and Fat values resulted from the GA optimization. For each GA optimization run, there is a GA learning run. That would end up having 4 learning runs, as illustrated in Table 5.

Table 3 The parameters values of the testing phase. Files’ pools

3.3. The GA testing

Signatures’ pools

0% 5% 25% 50% 75% 100%

Sig1, Sig5, Sig8 Sig6, Sig7, Sig10, Sig11 Sig1, Sig2, Sig12 Sig4, Sig5, Sig8 Sig1, Sig3, Sig6, Sig9, Sig12 Sig2, Sig4, Sig7, Sig10, Sig11, Sig12

GA VDC Gen

1 2 3 4

10 20 30 10

4. Discussion Discussion part involves the results of the learning phase and the testing phase of the (VDC) algorithm, in addition to the results of the optimized (VDC) algorithm based on GA.

Table 4 The GA optimization runs specifications. Run

The GA testing checks the signatures’ pools resulted in the GA learning process where the number of runs is 10, according to Table 6.

GA pool

GA Gen

5% 50% 25% 75%

10 10 10 10

3.1. The GA optimization This process aims to find the values of Pm and Fat after executing the GA with the VDC algorithm. The lower bounds of Pm and Fat are [0.01, 0.02] respectively, the upper bounds are [1, 1]. The files’ pools that are used when performing the GA have the values 5%, 25%, 50% and 75% as illustrated in Table 4 in the column GA pool.

4.1. The VDC learning phase The 12 runs results of the learning phase are summarized in Table 7. As noticed in this table, there are changes in the Mean fitness values and the number of signatures which is resulted by the changes in the variable values (Learning Gen, Pm, Fat and learning pool). 4.2. The VDC testing phase The 24 runs results of the testing phase are summarized in Tables 8 and 9.

Table 5 The GA learning runs specifications. Run

GA signatures’ pool

1 2 3 4

SigGA1 SigGA2 SigGA3 SigGA4

GA learning pool 5% 50% 25% 75%

Pm

Fat

GA VDC Gen

0.636 0.96 0.65 0.914

0.935 1.0 0.96 0.935

10 20 30 10

244

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

Table 7 The summary of the learning results. Parameters Learning Gen

Table 10 The GA optimization results.

Signatures’ pool Pm

Mean fitness

No. of signatures

5% infected files 0.05 100 0.1 100 300 0.05 100 0.05

0.05 0.05 0.1 0.1

Sig1 Sig2 Sig3 Sig12

247.1486 242.2893 275.2660 243.5857

1198 1200 3400 1200

25% infected files 0.05 100 0.1 300 150 0.2 100 0.2

0.05 0.1 0.05 0.05

Sig4 Sig5 Sig6 Sig10

449.6050 525.1859 642.7941 628.7423

1200 3400 1750 1200

75% infected files 100 0.2 0.1 150 300 0.05 100 0.2

0.05 0.1 0.1 0.1

Sig7 Sig8 Sig9 Sig11

838.7423 1265.9 1243.0000 871.7612

1200 1750 3400 1200

0% 5% 25% 50% 75% 100%

Pm

Fat

GA objective function value

1 2 3 4

0.636 0.96 0.65 0.914

0.935 1.0 0.96 0.935

380.2162 2540.2172 1260.2085 3650.2226

fitness in the initial signatures’ pool, where this repetition leads to increase the value of the Best fitness. • The Mean fitness and the Mean fitness: Table 9 shows the change on the Mean fitness in all the cases of the testing. There are 5 variables that affect the Mean fitness or the Mean fitness: Testing Pool, Pm, Fat, Learning Gen and learning pool. 4.3. The GA optimization phase After executing the 4 runs, Pm and Fat values were obtained, as illustrated in Table 10, these values are used in the next process; the GA learning.

Table 8 The detection rate of the testing results. Testing pool

Run

Fat

Signatures’ pools

Detection rate

Sig1, Sig5, Sig8 Sig6, Sig7, Sig10, Sig11 Sig1, Sig2, Sig12 Sig4, Sig5, Sig8 Sig1, Sig3, Sig6, Sig9, Sig12 Sig2, Sig4, Sig7, Sig10, Sig11, Sig12

100% 93.3% 94.7% 96% 90.2% 92.3%

The Average of detection rate

94.4%

Table 8 shows the detection rate on the 24 testing runs, where in the case of 0% infected files (all files are benign), the detection rate is 100% as it has detected zero number of infected files, and this is the false positive testing which is considered as a good result. Table 9 shows the results of the Best fitness (the change on the Best fitness), the Mean fitness and the Mean fitness (the change on the Mean fitness). • Best fitness: in some cases of the testing, the Best fitness changes, while in other cases it does not. The change in the Best fitness depends on the repetition of the signatures that have the higher

4.4. The GA learning phase After finalizing the GA learning process, which included 4 runs, the results are recapitulated in Table 11. The table reveals the Mean fitness, which, in turn, has resulted from defining the parameters values, GA Learning Gen, Pm, Fat, GA VDC Gen, and the GA learning pool. The Learning Gen has been set to equal 100 for all the learning cases. As mentioned previously, the Pm and Fat control the hypermutation process. These values have been gained from GA optimization, and they are high, which increases the hypermutation rate. It is recalled that the GA VDC Gen has affected the choosing of these values for Pm and Fat. The values of the GA learning pool are 5%, 25%, 50% and 75%, and these values are the same as used for the GA learning pool in the previous step (GA optimization), whenever the number of infected files increases at the learning files’ pool, the Mean fitness increases. It is noticed in the results of the previous 4 runs that the number of signatures = 1200 for all of them. The reason for this is that the used GA Learning Gen is 100.

Table 9 The testing results. Signatures pool

Learning Gen

Pm

Fat

Learning files’ pool

Sig6 Sig7 Sig10 Sig11 Sig1 Sig2 Sig12 Sig4 Sig5 Sig8 Sig1 Sig3 Sig6 Sig9 Sig12 Sig2 Sig4 Sig7 Sig10 Sig11 Sig12

150 100 100 100 100 100 100 100 300 150 100 300 150 300 100 100 100 100 100 100 100

0.2 0.2 0.2 0.2 0.05 0.1 0.05 0.05 0.1 0.1 0.05 0.05 0.2 0.05 0.05 0.1 0.05 0.2 0.2 0.2 0.05

0.05 0.05 0.05 0.10 0.05 0.05 0.1 0.05 0.1 0.1 0.05 0.1 0.05 0.1 0.1 0.05 0.05 0.05 0.05 0.1 0.1

25% 75% 25% 75% 5% 5% 5% 25% 25% 75% 5% 5% 25% 75% 5% 5% 25% 75% 25% 75% 5%

Testing pool 5% 5% 5% 5% 25% 25% 25% 50% 50% 50% 75% 75% 75% 75% 75% 100% 100% 100% 100% 100% 100%

Best fitness

Mean fitness

Mean fitness

0 0 0 0 11 11 12 0 0 0 37 0 0 0 38 61 0 0 0 0 60

585.4639 924.6069 556.9083 786.5623 226.7783 223.7168 225.3997 383.4732 485.4011 1167.4 226.9967 253.7394 585.6786 1167.5 225.6177 224.057 383.6928 925.0413 557.3427 786.9967 225.7399

0.01590 0.02312 0.02312 0.02313 0.11745 0.11726 0.11726 0.23782 0.08444 0.16354 0.33581 0.11902 0.23055 0.11903 0.33526 0.45748 0.45748 0.45747 0.45747 0.45748 0.45747

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

245

Table 11 The summary of the GA learning results. GA signatures’ pool

GA learning pool 5% 50% 25% 75%

SigGA1 SigGA2 SigGA3 SigGA4

Pm

Fat

GA VDC Gen

Mean fitness

0.636 0.96 0.65 0.914

0.935 1.0 0.96 0.935

10 20 30 10

445.2282 2595.2 1205.2 3065.2

Table 12 The GA testing results. Signature pool

GA Learning Gen

Pm

Fat

SigGA4 SigGA2 SigGA1 SigGA4 SigGA1 SigGA1 SigGA2 SigGA3 SigGA4

100 100 100 100 100 100 100 100 100

0.914 0.96 0.636 0.914 0.636 0.636 0.96 0.65 0.914

0.935 1.0 0.935 0.935 0.935 0.935 1.0 0.96 0.935

GA learning pool

GA VDC Gen

75% 50% 5% 75% 5% 5% 50% 25% 75%

10 20 10 10 10 10 20 30 10

In order to produce the signatures’ pools (SigGA1... SigGA4) for the GA testing process, the GA learning process has been performed.

4.5. The GA testing phase The results of the GA testing process, which includes 10 runs, are summed up in Table 12, which is elaborated in regard to: • Best fitness: In all cases, discrepancy dose not occur in the change in the Best fitness. This is due to the resulted values for Pm and Fat, from the GA optimization, that are high, which has led to having high values for the Mean fitness and Best fitness. Consequently, the Best fitness remains the same.

GA testing pool 5% 25% 25% 50% 75% 100% 100% 100% 100%

Best fitness

Mean fitness

Mean fitness

0 0 0 0 0 0 0 0 0

2752.2 2329.3 394.0925 2752.4 394.3105 394.4327 2329.6 1078.5 2752.6

0.02312 0.11725 0.11726 0.23782 0.33526 0.45747 0.45747 0.45747 0.45747

• The Mean fitness and the Mean fitness: The Mean Fitness and the Mean fitness change, because the following 5 Variables: Pm, Fat, GA learning pool, GA VDC Gen, and GA Testing Pool have changed. Hence the GA Learning Gen = 100 for all runs, is not considered as an affecting variable. For the GA Testing Pool variable, when the number of infected files increases inside the pool, the Mean Fitness increases by a small value. Notably, the GA Testing Pool is the main variable that affects the Mean fitness. The reason for that is when infected files increase in the GA Testing Pool, the Mean fitness increases. This is due to the fact that each single detection increases the fitness by ı. As it is noticed in Table 12, even though there have been changes in the values of the GA learning pool, Pm and Fat, while the GA Testing Pools equal 100%, the Mean fitness value does not change. However, the Mean fitness value keeps changing. The detection rate value is the same with an average of 94.4%.

Table 13 The detection speed of standard VDC versus the GA testing results. Testing pool

Signature pool

0%

Sig1 Sig5 Sig8

5%

Time in seconds

Avg. time

GA signature pool

Time in seconds

GA Avg. Time

The deviation (Avg. time − GA Avg. time)

14,208.647688 35,171.766852 18,197.056039

22,525.82353

SigGA2

13,280.983914

13,280.983914

9244.839616

Sig6 Sig7 Sig10 Sig11

9823.55273 8345.796465 8407.283349 11,459.200907

9508.958363

SigGA4

8934.375503

8934.375503

574.582860

25%

Sig1 Sig2 Sig12

10,311.581502 10,743.821194 10,459.229993

10,504.877563

SigGA1 SigGA2

6593.605429 6534.975443

6564.290436

3940.587127

50%

Sig4 Sig5 Sig8

7103.766218 22,787.765619 10,269.833760

13,387.121866

SigGA4

7368.274957

7368.274957

6018.846909

75%

Sig1 Sig3 Sig6 Sig9 Sig12

3165.084227 8762.238266 4431.984430 8729.633644 3186.235592

5655.035232

SigGA1

3145.817455

3145.817455

2509.217777

100%

Sig2 Sig4 Sig7 Sig10 Sig11 Sig12

6.368205 6.378531 6.359734 6.388507 6.477537 6.214122

6.364439

SigGA1 SigGA2 SigGA3 SigGA4

6.167821 6.093931 6.143650 6.137986

6.135847

0.228592

246

S. Afaneh et al. / Applied Soft Computing 13 (2013) 239–246

in applying the Negative Selection Algorithm. This work employed the clonal selection algorithm. Note that Ref. [6] had the detection rate of 97%, and the false positive of 3.6%, and also enclosed a list of detection rates for antivirus companies which were: Eset NOD32 = 94%, Kaspersky = 88%, Panda 2008 = 67%, KV 2008 = 55% and Kingsoft = 44%. Although this research is applied on different set of data but the results (i.e. detection rate of 94.4% and the false positive of 0%) are considered good and accepted. After concluding this work, and based on the results we had, the following points are recommended:

Fig. 7. Comparison VDC with GA-VDC (Mean fitness).

1. In the VDC algorithm, the initial fitness of the signatures is used as random numbers. It is suggested to use the Data Mining in categorizing the viruses according to their wide spread. 2. The VDC algorithm employed the exact match between signatures and files. It is recommended that different matching methods to be applied, such as Euclidean Distance, Manhattan Distance or Hamming Distance. 3. The Negative Selection Algorithm should be added to the VDC algorithm, to make possible to distinguish between Self and Nonself in regard to the existing files and later the detected infected files. 4. Different methods of mutation; such as Gauss Mutation, Cauchy Mutation or Mean Mutation should be used. References

Fig. 8. Comparison VDC with GA-VDC (detection speed).

It worth mentioning that when comparing with regard to the detection speed, it takes longer to detect viruses if the standard VDC algorithm is used, in comparison with the time consumed when the optimized VDC is used, as shown in Table 13. The usage of GA enhances the learning process by improving the properties of the resulted signatures’ pools, in terms of producing a higher Mean fitness for these signatures. When the testing process is executed, the detection speed is better and faster, as shown in Figs. 7 and 8, and this is due to the fact that before applying the detection process in the testing, the signatures are sorted in descending order according to their fitness, and that consequently leads to having a faster detection. During the comparison, neither the detection rate nor the false positive is changed. 5. Conclusions As a result of the previous simulations, the following points could be concluded: 1. In the VDC algorithm, if one of the following (the number of generations, the number of the infected files inside the files’ pool, the hypermutation rate during the learning phase) increases, the fitness of the signatures will be increased as well. 2. Employing the GA to optimize the VDC algorithm, improves the detection speed of the VDC algorithm, by increasing the Mean fitness, which leads the algorithm to be faster in detecting viruses. 3. Regarding the average detection rate, which is 94.4%, and the false positive which is 0%, these rates are considered good, and they do not change with the use of the GA, on the contrary, they are confirmed. 4. The results of the paper prove the ability of using the VDC algorithm to detect viruses. This paper agreed with the studies of [10–12,1,14] in concentrating on the (AIS) with virus detection, but deviated from them,

[1] K. Loukhaoukha, J.Y. Chouinard, M.H. Taieb, Optimal image watermarking algorithm based on LWT-SVD via multi-objective ant colony optimization, Journal of Information Hiding and Multimedia Signal Processing 2 (October (4)) (2011) 303–319. [2] H.C. Huang, Y.H. Chen, Genetic fingerprinting for copyright protection of multicast media, Soft Computing 13 (February (4)) (2009) 383–391. [3] P. Puranik, P. Bajaj, A. Abraham, P. Palsodkar, A. Deshmukh, Human perceptionbased color image segmentation using comprehensive learning particle swarm optimization, Journal of Information Hiding and Multimedia Signal Processing 2 (July (3)) (2011) 227–235. [4] F.C. Chang, H.C. Huang, A refactoring method for cache-efficient swarm intelligence algorithms, Information Sciences, in press http://dx.doi.org/ 10.1016/j.ins.2010.02.025 [5] Secure Computing Corporation, Virus Signature Solutions from Secure Computing, http://www.securecomputing.com/, 2008. [6] M. Unterleitner, Computer Immune System for Intrusion and Virus Detection: Adaptive Detection Mechanisms and their Implementation, VMD Verlag Dr. Muller Aktiengesellschaft & Co., Germany, 2008, http://www.amazon.com/ Computer-Immune-System-Intrusion-Detection/dp/3836461080 [7] Z. Yu, L. Tao, Q. Renchao, Unknown computer virus detection inspired by immunity, Journal of Frontiers of Computer Science and Technology (2009), http://dx.doi.org/10.3778/j.issn.1673-9418.2009.02.004, ISSN 1673-9418/2003/03 (02)-0154-08 http://www.ceaj.org/wes/qikan/manage/ wenzhang/T0811060.pdf [8] L. Castro, F. Zuben, Learning and optimization using the clonal selection principle, IEEE Transactions on Evolutionary Computation, Special Issue on Artificial Immune Systems 6 (3) (2002) 239–251 ftp://ftp.dca.fee.unicamp.br/ pub/docs/vonzuben/lnunes/ieee tec01.pdf. [9] M. Creeger, The battle is bigger than most of us realize: CTO Roundtable: Malware Defense, Article development led by queue.acm.org, Communications of the ACM 53 (April (4)) http://portal.acm.org/citation.cfm?id= 1721654&coll=DL&dl=GUIDE&CFID=10533433&CFTOKEN=20057273, 2010. [10] K. Edge, G. Lamont, R. Raines, A Retrovirus Inspired Algorithm for Virus Detection & Optimization, Wright-Patterson AFB, Dayton, OH USA 45433, GECCO’06, Washington, USA, ACM 1-59593-186-4/06/007, http://portal.acm.org/citation.cfm?id=1144016, 2006. [11] S. Forrest, A. Perelson, L. Allen, R. Cherukuri, Self-nonself discrimination in a computer, in: Proceedings of IEEE Symposium on Research in Security and Privacy, IEEE Computer Society Press, Los Alamitos, CA, 1994, pp. 360–365, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.3258 [12] J. Kephart, G. Sorkin, M. Swimmer, S. White, Blueprint for a computer immune system, in: This Paper was Originally Presented at the Virus Bulletin International Conference in San Francisco, California, USA, IBM Thomas J. Watson Research Center, 1997, http://www.research.ibm.com/ antivirus/SciPapers/Kephart/VB97/ [13] U. Aickelin, Artificial Immune Systems (AIS) – A New Paradigm for Heuristic Decision Making, The University of Nottingham, Nottingham, NG8 1BB, United Kingdom, 2004, http://citeseerx.ist.psu.edu/viewdoc/ summary?doi=10.1.1.60.5923 [14] V.X. Heaven, http://vx.netlux.org/, 2010.