Credible, resilient, and scalable detection of software plagiarism using authority histograms

Credible, resilient, and scalable detection of software plagiarism using authority histograms

Knowledge-Based Systems 95 (2016) 114–124 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locat...

1010KB Sizes 5 Downloads 27 Views

Knowledge-Based Systems 95 (2016) 114–124

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Credible, resilient, and scalable detection of software plagiarism using authority histograms Dong-Kyu Chae a, Jiwoon Ha a, Sang-Wook Kim a,∗, BooJoong Kang b, Eul Gyu Im a, SunJu Park c a b c

Department of Computer and Software, Hanyang University, 17 Haengdang-dong, Seongdong-gu, Seoul 133-791, Republic of Korea School of Electronics, Electrical Engineering and Computer Science, Queen’s University Belfast, Belfast BT3 9DT, United Kingdom School of Business, Yonsei University, Sinchon-dong, Seodaemun-gu, Seoul 120-749, Republic of Korea

a r t i c l e

i n f o

Article history: Received 12 March 2015 Revised 25 November 2015 Accepted 20 December 2015 Available online 31 December 2015 Keywords: Software plagiarism detection Birthmark Similarity analysis Static analysis

a b s t r a c t Software plagiarism has become a serious threat to the health of software industry. A software birthmark indicates unique characteristics of a program that can be used to analyze the similarity between two programs and provide proof of plagiarism. In this paper, we propose a novel birthmark, Authority Histograms (AH), which can satisfy three essential requirements for good birthmarks—resiliency, credibility, and scalability. Existing birthmarks fail to satisfy all of them simultaneously. AH reflects not only the frequency of APIs, but also their call orders, whereas previous birthmarks rarely consider them together. This property provides more accurate plagiarism detection, making our birthmark more resilient and credible than previously proposed birthmarks. By random walk with restart when generating AH, we make our proposal fully applicable to even large programs. Extensive experiments with a set of Windows applications verify that both the credibility and resiliency of AH exceed those of existing birthmarks; therefore AH provides improved accuracy in detecting plagiarism. Moreover, the construction and comparison phases of AH are established within a reasonable time. © 2015 Elsevier B.V. All rights reserved.

1. Introduction Software plagiarismis developing software using someone else’s source code or open source code without a license and disguising it as original software [1]. As software plagiarism has been increasing significantly, serious economic loss in the software industry has also been increasing. According to the Business Software Alliance report,1 the financial damage caused by software plagiarism in the USA is about 95 million dollars, and the damage in China is about 77 million dollars. To mitigate such economic losses, software developers need methods to detect software plagiarism. Recently, software birthmarks (birthmarks) have come under study. A program’s birthmark represents unique characteristics that can be used to identify the program [2]. The similarity between two birthmarks represents how much one program is likely to be a copy of another. Birthmarks permit a compact analysis of the



Corresponding author. Tel.: +82 10 6749 6392. E-mail addresses: [email protected] (D.-K. Chae), [email protected] (J. Ha), [email protected] (S.-W. Kim), [email protected] (B. Kang), [email protected] (E.G. Im), [email protected] (S. Park). 1 BSA Global Software Piracy Study, http://globalstudy.bsa.org/2010. http://dx.doi.org/10.1016/j.knosys.2015.12.009 0950-7051/© 2015 Elsevier B.V. All rights reserved.

similarity between a pair of programs without requiring extra data, such as the source code of the programs in question. Existing birthmarks can be categorized according to the following criteria. First, depending on the extraction scheme, birthmarks are categorized as static or dynamic: a static birthmark is extracted by disassembling a program without execution, and a dynamic birthmark is extracted from a program’s runtime behavior and can be obtained by executing it [1]. Second, depending on the form of the birthmarks, they are classified as set-based, frequencybased, and sequence-based. Regardless of these categories, birthmarks need to be designed to meet the following requirements [3]:

• Resiliency: Birthmarks should be robust even if plagiarizers modify the structure of a program slightly or reorder the source codes statements while preserving the semantics of the program. • Credibility: Birthmarks extracted from independently developed programs should be dissimilar even if they accomplish similar tasks. • Scalability: Birthmarks should be applicable to even large programs.

D.-K. Chae et al. / Knowledge-Based Systems 95 (2016) 114–124

However, existing birthmarks fail to satisfy all the above requirements at once. We briefly explain, for each category, which properties are not satisfied and why.

115

In this paper, we propose a novel static birthmark, Authority Histograms (AH). AH satisfies all three of the requirements above. Our birthmark is a histogram whose dimension is an API used in the program, and the value of each dimension indicates an authority score that represents how prominently the corresponding API is used in the program. We measure the importance of APIs by analyzing the program’s structural characteristics to figure out which APIs are in a core position in the program structure and which are not.2 Specifically, we first construct an API-labeled control flow graph (A-CFG), which is a graphical representation of a program. A-CFG has APIs as vertices and call orders among APIs as edges. ACFG represents the full structure of the program, having all possible control flows from the start of the program to its termination. Next, we measure the authority score of each API based on the structural characteristics of A-CFG. To compute the authority scores, we use random walk with restart (RWR), a probabilistic model for a random surfer to reach a web page on a web graph

after a given number of iterations. RWR captures which web pages are authoritative in the graph and gives high authority scores to those nodes by analyzing the structural characteristics of the graph [11]. With respect to our work, RWR figures out which APIs are popularly called in a program by analyzing A-CFG and gives high authority scores to those APIs. At this step, the authority scores are affected not only by the number of incoming edges of API nodes (i.e., call frequency of APIs) but also by the orders of API nodes [12]. The resulting histogram of the authority scores over all the APIs becomes AH. Two programs with similar AHs are highly suspected of plagiarism because they not only use similar important and minor APIs but also have similar structural characteristics. Our design ensures that AH satisfies the three essential requirements as follows. First, by reflecting both the frequency and call order of APIs, we remedy the deficiencies of the previous setbased, sequence-based, and frequency-based birthmarks,making our birthmark more resilient and credible. Generally, API calls are a common way for a program to request resources or services provided by the OS. API calls are tightly related to the main functionalities of a program. The importance distribution of APIs for each program can be unique. Moreover, it is difficult for a plagiarist to manipulate the overall call frequency of APIs, the call order of APIs, or to replace APIs with something else, while maintaining the program’s original semantics [3]. Second, by generating an n-dimensional histogram that inherits the structural characteristics of A-CFG, our birthmark makes the problem of comparing two ACFGs a quite simple and scalable task. We compute the similarity between AHs efficiently by using the cosine similarity, a simple and widely used similarity measure. Based on our proposal, we implemented the AH-based software plagiarism detection method. Given original and suspicious programs, we first constructed A-CFGs for each program. Then we generated AHs from each A-CFG and computed their similarity. Based on the similarity value, we determined whether the suspicious program was copied from the original one.3 We used a set of Windows programs and performed extensive experiments. We observed that AH outperforms previous state-of-the-art static birthmarks in terms of resiliency and credibility, thereby providing improved accuracy in detecting plagiarism. We also observed that the extraction and comparison of AH birthmarks could be completed within a reasonable time. The initial idea of this paper was presented with some preliminary experimental results at ACM CIKM 2013 as a short paper [17]. This paper is an extended version of it. The main differences can be clarified as follows. In the extended version, we explain both our proposed method and previous work in more detail. Also, we have conducted more extensive experiments than described in our previous paper: (1) we additionally implement FlowPath birthmark, a state-of-the-art sequence-based birthmark, and perform comparative experiments using it and our method; (2) we generate two kinds of real plagiarized samples (one with changed compiler optimization options and the other manually transformed by human experts), and perform experiments with the plagiarized samples; (3) we perform scalability testing by measuring execution times for every method according to the size of the target programs. The rest of this paper is organized as follows. Section 2 briefly reviews some methods for detecting software plagiarism. Section 3 explains in detail AH’s definition, generation procedure, and similarity computation. Section 4 evaluates our AH based software plagiarism detection method by comparing it with previous methods

2 We assume that programs in question are large in size and use a number of APIs inside (i.e., commercial programs released by software companies). If programs are small and use few APIs (i.e., toy programs), our proposed method tends not to work well.

3 We note that our scope is to provide the similarity between two programs in question. Deciding which program is the original one is out of the scope of this paper.

• Static set-based birthmarks: Their credibility is unsatisfactory because they cannot distinguish two independently developed programs that inadvertently use several APIs in common. Choi et al. [4] improved the credibility by separating all APIs into subsets and labeling each subset with the name of the function that calls the APIs in the subset. However, that improvement makes such birthmarks vulnerable to function structure transformations, such as function in-lining and dividing one function into multiple functions, as demonstrated in our experiments. They also suffer from the scalability issue because they use the maximum weighted bipartite matching algorithm, which has a time complexity of O(n3 ) for taking all user-defined functions into account when computing similarity. • Static frequency-based birthmarks: They are more credible and scalable than the original set-based birthmarks because they reflect API call frequencies. However, they are still likely to lack credibility if two different programs use many APIs in common with a similar frequency distribution or use two or three common APIs whose frequencies are much higher than those of any other APIs in each program [5]. These drawbacks are also shown in our experiments. • Static sequence-based birthmarks: The main problem with these birthmarks is poor resiliency because assembly instruction or API sequences can be easily changed by switching some statements in the source code (e.g., function calls or mathematical operations) while preserving the semantics of the program [6]. Scalability is also a problem because an exponential number of possible traces exist according to the branches defined in the program, all of which need to be considered in a similarity computation of the two birthmarks. • Dynamic birthmarks: Regardless of their form, it is questionable whether dynamic birthmarks can capture the unique characteristic of a program, which is the fundamental objective of designing birthmarks [7,8]. Because dynamic birthmarks are extracted during the execution of some pre-defined scenarios, only a small part of the program is reflected in the birthmarks. It does not seem appropriate to consider dynamic birthmarks as characteristics of a program because they inherit only a small part while completely ignoring the rest [4,9,10].

116

D.-K. Chae et al. / Knowledge-Based Systems 95 (2016) 114–124

using a set of real-world Windows applications. Section 5 summarizes our proposal and concludes the paper. 2. Related work The problem of software plagiarism detection has been previously addressed using different methods. Early studies mainly focused on text-based plagiarism detection to detect suspected software copies. Those approaches compare features in program source codes, such as token sequences or program syntax. Source code comparison methods, such as MOSS [13] and YAAP [14], extract token sequences or syntactic trees from the source code of the two targeted programs and measure their similarity. Unfortunately, those methods compute similarity at the source code level, but in many instances, source code is unavailable [7]. Another earlier method is software watermarking, which was first proposed by Collberg and Thomborson [15]. Those methods embed a copyright notice into an executable code prior to its release. Extracting the copyright notice from a watermarked program therefore constitutes a proof of its origin [15]. However, watermarking methods are still limited because they apply only to programs in which copyright notices have already been embedded. To overcome the fundamental limitations of depending on extra data such as source codes or watermarks, birthmarks have come under study. The first birthmark research was conducted by Tamada et al. [2]. The following is their definition. Definition 1 (Software birthmark). Let p be a program and f(p) be a set of characteristics extracted from p. Then f(p) is a birthmark of p only if both of the following conditions are satisfied: • Condition 1: f(p) is obtained only from p itself. • Condition 2: If program q is a copy of p, then f(p) = f(q). Condition 1 means that the birthmark is not extra information: it is obtained only from the program itself. Hence, extracting a birthmark does not require extra code as watermarking does. Condition 2 states that the same birthmark has to be obtained from copied programs. Here, copying can be done in various ways, i.e., q is an exact duplication of p; q is obtained from p by renaming all identifiers in the source code of p; q is obtained from p by eliminating all the comment lines in the source code of p, and so on [16]. Whatever copying scheme for hiding the plagiarism is used, birthmarks extracted from the copied program and the original program should be identical. There are two major categories of birthmarks: static birthmarks and dynamic birthmarks. A static birthmark is extracted by disassembling a program without its execution, and a dynamic birthmark is extracted from the runtime behavior of a program and is obtained by executing it [1]. Dynamic birthmarks can reflect only a small part of the program because a single execution path can cover only a small part of the program. Moreover, different birthmarks can be extracted under different execution environments even if the program input and the execution scenarios are identical. In contrast, static birthmarks can inherit the overall characteristics of a program and are always consistent because they do not depend on the runtime environment, execution scenario, or program input, all of which need to be carefully considered when extracting dynamic birthmarks. One of the problems with static birthmarks is that they cannot be extracted from encrypted or compressed programs. However, contrary to malicious programs, commercial programs are not usually applied with encryption or compression techniques [4]. Therefore, in this paper, we focus on static birthmarks. Existing static birthmarks can be categorized into three types based on their form: static set-based birthmarks [4,9], static frequency-based birthmarks [10], and static sequence-based

birthmarks [7,17,18]. For each category, we here introduce the stateof-the-arts directly related to our research. Also, some of them are used for comparison with our proposed birthmarks in the experiment section. Static API birthmarks: Choi et al. [9] proposed a set-based birthmark that consists of multiple sets of API calls for each function in a program. They extended their work by regarding not only the functions but also their descendent functions as a birthmark [4], which they call the Static API birthmark (SA). Their intention with SA was to build resiliency against function manipulation attacks, such as function in-lining. The depth of descendant function calls is adjusted by a function call depth parameter. To compare the similarity between two programs, they modeled the problem as maximum weighted bipartite matching [19]. In the bipartite graphs, programs correspond to partite sets, and functions correspond to vertices. The weights assigned in edges represent the similarities between two functions. Function similarities are calculated by dice coefficient [20], which is 2|C |/(|A| + |B| ) where |A| and |B| are the number of elements in sets A and B, respectively, and |C| is the number of elements shared by A and B. If the similarity between two birthmarks is higher than 65%, the two programs are regarded as copies. Static API call frequency birthmark: Chae et al. [10] proposed a frequency-based birthmark called the Static API Call Frequency Birthmark (SACF), which reflects the frequency of API calls in a program. In addition, SACF gives a weight to each API that indicates how important it is in representing a program’s unique characteristics. To measure the weight, they used term frequency—inverse document frequency [21]. The final form of the birthmark is a vector whose element corresponds to an API and whose value is the multiplication of the frequency and the weight of the API. They used cosine similarity to compare two birthmarks. Static trace birthmarks: Park et al. [17] proposed a sequencebased birthmark called the API trace birthmark that reflects all possible traces of API calls from each function in a program. At this step, all possible control flows of a function are regarded. Similarly, Lim et al. [7] defined a birthmark, called the Flow-Path birthmark (FP) that consists of function-level sequences of instructions based on control flow graphs for each function. In this method, it is adjusted by a parameter that how many basic blocks (nodes of a control flow graph that consist of instructions) should be included in each instruction sequence. These two methods use the maximum weighted bipartite matching algorithm to measure the similarity of two birthmarks. According to the experimental results in [7], FP outperformed the API trace birthmark in terms of credibility and resiliency. 3. Proposed method 3.1. Overview Our Authority Histogram (AH) based software plagiarism detection method consists of three steps. First, we construct the APIlabeled control flow graphs (A-CFGs) of the original and suspicious programs. Second, we generate AHs by analyzing the structural characteristics of each A-CFG. Third, we compare the similarity between two AHs and determine whether the suspicious program is copied from the original one based on the computed similarity score. The following sections describe the details of each step. 3.2. A-CFG construction We first introduce a formal definition of A-CFG: Definition 2 (A-CFG: API-labeled control flow graph). The APIlabeled control flow graph of a program p is a 2- tuple graph A-CFG = (N, E), where N and E satisfy following conditions:

D.-K. Chae et al. / Knowledge-Based Systems 95 (2016) 114–124

• N is a set of nodes, where node n ∈ N corresponds to a single API call in p. • E(⊆N × N) is a set of edges, where edge n1 → n2 ∈ E denotes a sequence between statements n1 and n2 : n2 is called immediately after n1 is called. To construct A-CFG, we first generate all the control flow graphs (CFGs) for each function and connect all the CFGs together based on their inter-procedure call relationships, eventually generating a single compound graph. Then, we label each node with the name of the API called in the corresponding node until A-CFG construction is complete. Specifically, we perform a static analysis on programs via IDAPro disassembler [22], a popular static analyzer that disassembles machine codes and constructs CFGs corresponding to each functions defined in the program. A CFG is a well-known, abstract representation of all possible flows from the start of a function to its end. Each node in a CFG represents a basic block, which is a linear sequence of program instructions without any jumps or jump targets; jump targets start a new block, and jumps end a block. Directed edges represent jumps in the control flow [23]. Instructions in a basic block include not only general opcodes, e.g., add, mov, but also user-defined functions or API calling instructions, e.g., “call GetDC”. Generally, a CFG shows the behavior of only a single function. We merge them into one single, massive graph to reflect the full structure of a program using the following steps: Step 1 (Connecting all CFGs together based on their inter-function call relationship)

117

mov d lea jmp

push a add (GetDC) jmp

f mov e xor (z) (ArcTo) mov add (AddForm) (x) add xor (GetDC) jmp jmp

xor b (BitBit) mov jmp

add c lea (OpenFile) retn

lea g add (GetJob) retn

CFG x

add h (SetDC) jmp i add xor lea mov mov retn (GetMenu) add retn

CFG y

CFG z

Fig. 1. CFGs of function x, y, and z.

(1–1) The basic block that contains a user-defined function call is split into two basic blocks, one from the start of the block to the function call instruction and the other from immediately after the function call instruction to the end of the block. (1–2) Several directed edges are created: one is created between the split block that contains the function call instruction and the root node of the corresponding function’s CFG, and all others are between the end-point nodes of the CFG and another piece of the split block. (1–3) The above two steps are repeated for every CFG, eventually generating a single compound graph. Fig. 2. An intermediate step of constructing A-CFG.

Step 2 (Replacing basic blocks with APIs)

empty

(2–1) If a basic block calls only one API, the basic block is labeled with the name of the API. (2–2) If a block calls more than two APIs, the block is recursively split until each block has only one API call instruction. Edges are then assigned between split blocks. (2–3) If a block has no API call instruction, the block is labeled “empty”. As an example, Figs. 1–3 show the overall process of generating A-CFG. Fig. 1 shows the CFGs of functions x, y, and z; Fig. 2 shows an intermediate process of merging all CFGs into one single A-CFG; Fig. 3 shows the final form of the A-CFG. In Fig. 1, the basic blocks e and f in CFG y include the user-defined function call instructions (x) and (z).4 As shown in Fig. 2, they are divided into two blocks, and CFGs x and z are inserted between the two divided blocks with new directed edges. The dashed arrows show the created edges. After creating a single compound graph, each basic block is re-labeled with the name of the API that it contains. Therefore, we remove all the assembly instructions except for the API call instructions. If a block contains no API call instruction, such as block d of CFG y in Fig. 2, the block is labeled “empty”.

CFG x

ArcTo

empty

GetDC

SetDC

BitBit

GetMenu

AddForm AddF

empty

GetDC GetJob

As space is limited, the function call instructions are presented in parentheses, e.g., (GetDC) instead of “call GetDC”.

empty

OpenFile

4

Fig. 3. A-CFG.

CFG z

j

118

D.-K. Chae et al. / Knowledge-Based Systems 95 (2016) 114–124

If a block calls more than two APIs, it is split up until each block has only one API call instruction. For example, because block f of CFG y in Fig. 1 calls two APIs, (AddForm) and (GetDC), it is divided into the two blocks shown in Fig. 2, and a directed edge is created between them. The procedures above eventually construct A-CFG. Note that the creation of A-CFG is totally automated. IDApro itself cannot automatically extract multiple CFGs from a program, but we added a plug-in with a python script to do that. Once all the CFGs of every function are extracted, the rest of the process of creating A-CFG, which is to integrate all of the CFGs, can easily be automated. A-CFG is a graphical representation of a program, expressing its full structure by APIs. The AH inherits the structural characteristics of its A-CFG. The following section defines AH and explains how we generate it based on A-CFG. 3.3. AH generation We define our AH birthmark as follows. Definition 3 (Authority histograms). The Authority Histogram birthmark is an n-dimensional histogram, where • n is the number of APIs defined in MSDN.5 • The value of each dimension is the summation of authority scores that the corresponding API receives in A-CFG. If one API exists in multiple places in A-CFG, we aggregate all authority scores that the API receives throughout the graph. If the corresponding API is not called in the program, the value is set to 0. To compute the authority scores, we employ random walk with restart (RWR), a probabilistic model for a random walker to reach a node after a given number of iterations. It is widely used in the information retrieval field to understand which web pages are authoritative in a web graph [11,12]. We employ the notion of authority computed by RWR, which is borrowed from the information retrieval field. Therefore, the authority of a node (or an API) indicates the popularity of it. In other words, the authority score of each API indicates how popularly it is called in a program. RWR is expressed by the following formula [11,12]:

Rt+1 = (1 − α )ARt + α w

(1)

In Formula (1), A represents an n-dimensional adjacency matrix column-normalizing the connectedness among nodes, where n indicates the number of nodes in A-CFG. Rt is a vector with the authority score of each node. A restart vector w represents the probability that the random walker jumps to each node, instead of traversing through the edges of the graph, having the value 1n . α is a weight that determines how often the random walker jumps to other nodes rather than following the edges. We set the value of α as 0.15, which is generally and widely used. Formula (1) iterates until the vector Rt converges. Using RWR gives our method several benefits. First, the authority score computed by RWR for each API reflects not only the frequency, but also the call orders of APIs. In its computations, RWR gives a high authority score to (1) nodes that receive many edges and (2) nodes closely connected to the nodes that have already received high scores [11]. Thus frequently called APIs receive a high authority score through RWR because they either exist throughout A-CFG or are connected to many other nodes. Also, some APIs could get a high authority score, even though they are rarely called, if they are closely connected to nodes with a high authority score. If APIs are rarely called and their pointing nodes

5

https://msdn.microsoft.com/en-us/library/.

CFG x

GetDC

SetDC GetMenu

BitBit

CFG z

empty

OpenFile

Fig. 4. Part of an A-CFG.

have a low authority score, they are also likely to get a low authority score. As an example, Fig. 4 shows part of an A-CFG. If function x is called from many other functions, many edges are created above the head of CFG x. This causes API GetDC to receive a high authority score through RWR. APIs closely connected to GetDC, such as BitBit and OpenFile, are also likely to get high authority scores. In contrast, assuming that function z is rarely called, API SetDC and APIs closely connected to SetDC, such as GetMenu, are likely to receive low authority scores. In conclusion, each authority score is strongly affected byboth the frequency and the call orders of APIs. This makes our birthmark unique to each program because the frequency and order of API calls represent unique program behaviors. Moreover, manipulation of API call frequency or API call orders while maintaining a program’s original semantics is a difficult task, which indicates that our proposed birthmark is resilient. Second, by a simple modification of Formula (1), we can consider the commonness of each API. The commonness indicates how widely the corresponding API is used among many programs. Generally, essential program tasks, such as exception handling, memory management, and thread creation and termination, use APIs frequently called in most existing programs. Thus those APIs could receive high authority scores regardless of the kind of program, which could make different programs look similar [10]. To address that problem, we measure each API’s commonness and reflect it to Formula (1). Specifically, we modify the restart vector w to induce low authority scores for widely used APIs. We assign a probability to each API node computed as follows and normalize the vector w:

w(API ) =

1 CF(API ) × PF (API )

(2)

In Formula (2), w(API) denotes all restart probability values corresponding to the API in vector w. CF(API ) corresponds to the call frequency of the corresponding API. PF(API) indicates the number of programs that call that API6 . Because typically called APIs are used in most programs and also used frequently by each program, Formula (2) assigns low probability in inverse proportion to the call frequency and the program frequency. This modification not only gives low authority scores to typically called APIs, but also gives high authority scores to the APIs used uniquely in each program. Thus we can not only make different programs look different but also successfully characterize each program. The third advantage of RWR is computation efficiency. A-CFG itself is a good birthmark candidate because it represents the full structure of a program. However, using A-CFG itself as a birthmark

6 Here, the entire program set is the programs collected for our experiments, defined in Table 1 in Section 4.

D.-K. Chae et al. / Knowledge-Based Systems 95 (2016) 114–124

creates a scalability issue: because A-CFG has from tens of thousands to more than millions of nodes, it is practically impossible to compute the similarity of two A-CFGs. Conventional graph comparison algorithms, such as graph isomorphismor maximum common sub-graph, are generally NP-complete [24]. Instead of comparing two A-CFGs directly, we generate AH that reflects the structural characteristics of its A-CFG and then compute the similarity between two AHs. The task of generating AH can be performed in a short time. The stationary authority scores for each node are obtained by iterative computations of Formula (1) until Rt converges. The complexity of this computation is O(n + e ) where n denotes the number of nodes in the graph and e the number of edges; thus our method requires a reasonable time even with a large program whose A-CFG has millions of nodes [12]. 3.4. Similarity computation Two programs with similar AHs might not only use similar APIs as important and other similar APIs as minor but also have a similar program structure, eventually leading to a high likelihood of plagiarism. As a similarity measure between two AHs, we use cosine similarity, which is simple and widely used [20]. Let p and q be programs, and Hp and Hq be AHs of p and q, respectively. The cosine similarity is defined as follows.

n

Hp,i × Hq,i SIM (Hp , Hq ) =   n 2 n 2 1 H p,i × 1 Hq,i 1

(3)

Hp, i and Hq, i in Formula (3) denote the ith elements of Hp and Hq , respectively. The similarity between two birthmarks ranges between 0 and 1. As the similarity approaches 1, one program is regarded as more likely to be a copy of the other. The primary purpose of a birthmark is to detect copies of a program. To achieve this, we set a plagiarism threshold and decide whether two given programs are copied or not as:



SIM (Hp , Hq )

≥ε , p and q are classified as copied <ε , p and q are classified as independent

(4)

ε in Formula (4) denotes the plagiarism threshold. When the similarity between two programs is in the range of [ε , 1.0], our proposed method determines that the two programs can be classified as “copied.” Otherwise, it classifies them as “independent”. In our experiments, we use multiple values of ε , from 0.6 to 0.8, to evaluate the accuracy of our proposed method on various ε values. More detailed descriptions are given in Section 4.3.

119

used less than 20 benchmark programs, we collected much more benchmark programs for our experiments. We evaluated AH by comparing it with three previously proposed state-of-the-art birthmarks as representatives of their respective categories. Note that the parameters we used for each method are those set as the default values in their respective papers. Detailed descriptions are given in Section 2. • SA: Static API birthmark proposed by Choi et al. [4], a set- based birthmark. We used the function call depth parameter of 3. • SACF: Static API Call Frequency birthmark proposed by Chae et al. [10], a frequency-based birthmark. • FP: Flow Path birthmark proposed by Lim et al. [7], a sequencebased birthmark. The penalty values for the sequence alignment algorithm were applied with σ = 2 and gap = 1. The number of included basic blocks is set to 2. In addition, we implement three variations of AH with different w in Formula (2), as follows: • AH_CF_log: To take the log on CF(API) in Formula (2). • AH_PF_log: To take the log on PF(API) in Formula (2). • AH_CF_PF_log: To take the log both on CF(API) and PF(API) in Formula (2). A birthmark’s accuracy in detecting software plagiarism is strongly influenced by how we set the plagiarism threshold explained in Section 3.4. Generally, the optimal threshold value varies among experiments with different data sets, and thus it is almost impossible to find an optimal threshold that can be commonly applied to multiple data sets. Previous studies did not provide a method for finding the optimal threshold; instead they set threshold values suitable only for their particular benchmark programs. We believe it is important for a plagiarism detection method to show consistently high plagiarism determination accuracy regardless of the threshold. To show high accuracy, a birthmark should indicate low similarity between independently developed programs (credibility) and high similarity between copied programs (resiliency). Therefore, we performed the following experiments: Through a resiliency test, we observed that our proposed method indicates high similarity between an original program and a copied program. We also performed a credibility test to demonstrate that our proposed method indicates low similarity between different programs. An accuracy test evaluated the sensitivity of our birthmark to the plagiarism threshold. A scalability test evaluated the time complexity of our proposed birthmark.

4. Experiments 4.2. Resiliency and credibility tests 4.1. Experimental setup It seems reasonable to use plagiarized programs as benchmark programs for evaluating our birthmark. Unfortunately, to the best of our knowledge, no reputable, public samples that plagiarize commercial Windows programs exist. Moreover, existing source code obfuscation tools for the C and C++ languages conduct only simple modifications, such as removing comment lines or renaming variables, and cannot significantly change the structure of the binary codes. Therefore, following previous work [4,9,10,25], we regard a recent version of a program as a “copy” of its previous version. A version update of a program changes its source code to improve functionality, convenience, performance, or memory efficiency while preserving most of its semantics. Because this is similar to the behavior of plagiarism, it can be viewed as such [4,10]. As shown in Table 1, we collected 28 widely used programs from multiple categories, such as text editors, FTP clients, compression programs, media players, and so on. Each program has two different versions. Comparing with most of previous research which

First, to evaluate the resiliency of our birthmark, we measured the similarity between all pairs of program versions in Table 1. From the 28 distinct comparisons given by 28 programs, we computed the average similarity. High average similarity indicates that the proposed birthmark is resilient. The results are shown in both Fig. 5 and Table 2. In the figure, the x-axis shows the names of the birthmarks (AH_original, SA, SACF, FP) and the variations of our birthmark (AH_CF_log, AH_PF_log,AH_CF_PF_log), and the y-axis gives the average similarity for all possible pairs of different versions (copied programs) indicated by each birthmark. The table provides some statistical results on the resiliency test. In this experiment, all birthmarks show satisfactory resiliency except FP. FP tends to indicate low similarity between copied programs. For example, it shows a low similarity of 32.9% for two versions of LPlayer, whereas all the other birthmarks indicate their similarity to be higher than 94%. In addition, it shows a low similarity of 20.8% for two versions of SecureCRT, whereas all the other birthmarks indicate their similarity

120

D.-K. Chae et al. / Knowledge-Based Systems 95 (2016) 114–124 Table 1 Benchmark programs. Program

Version

Size (KB)

# of functions

# of APIs

Program

Version

Size (KB)

# of functions

# of APIs

AkelPad

4.7.6 4.7.7 6.1.4 6.1.5 2.10.5 2.10.6 0.15 0.14 3.0.00 3.0.11 4.0.237 5.3.168 3.2.4 3.2.5 17.1 18.1 4.3.0.2 4.3.1.4 7.1.1.1 7.1.1.5 0.5.3 0.5.4 3.1.2 3.1.3 0.50.0 0.51.0 8.3.2.0 8.3.3.5

357 357 1584 1584 49 49 6869 8259 923 919 20,458 19,971 300 248 11,398 10,959 3069 3036 8718 8838 14,360 13,061 1278 1271 6861 6845 2668 3720

918 921 3272 3275 22 22 11,011 12,011 1465 1470 44,849 45,789 413 449 23,082 23,762 5496 5667 14,912 14,889 4225 5011 5070 5112 11,913 11,926 9579 13,006

358 358 318 319 36 36 46 57 177 177 910 909 51 56 653 666 271 276 312 311 37 38 295 297 231 232 387 438

FileZilla

3.5.2 3.5.3 0.9.9 1.0.1 1.10.0 1.10.1 2.1.8 2.1.9 9.19 9.2 5.0.2 5.0.3 0.60.0 0.62.0 4.3.7 4.3.9 5.6.0 5.6.3 1.5 1.51 1.1.11 1.1.14 2.7.0.0 3.0.0.0 11.6.3 12.6.5 6.7.5.4 7.0.1.3

7993 7994 920 804 3028 3058 496 492 412 412 1920 1920 444 472 6325 6329 1559 2156 180 177 1719 1728 962 1005 3073 2855 3086 3421

20,785 20,817 4764 4749 9908 10,096 700 699 4531 4531 12,376 12,387 992 1074 9646 9649 3281 3266 736 741 13,412 13,517 4178 4394 12,457 12,041 17,573 19,634

502 504 48 51 97 97 192 192 260 260 336 337 228 243 451 452 521 525 93 94 339 344 341 348 288 287 369 399

Notepad++ Pidgin Psi BadakEncoder ACDSeePro NcFTP UltraEdit NateOn BuddyBuddy RestoShare UmileEncoder XNview CuteFTP

LongPlayer Mixxx CoolPlayer 7zip BackZip Putty WinSCP Winamp PotPlayer Foobar2000 Bandizip ALzip SecureCRT

1

0.12 0.979

0.9758

0.9732

0.9729

0.9661

0.9636

0.0969

0.1

0.95

0.0893 0.08

0.0767

0.85 0.8312

Similarity

Similarity

0.9

0.0621 0.06

0.0531

0.0558 0.0457

0.04

0.8

0.02 0.75

0

Fig. 5. Comparisons in terms of resiliency.

Fig. 6. Comparisons in terms of credibility.

Table 2 Statistical results from the resiliency test.

AH_original AH_CF_log AH_PF_log AH_CF_PF_log SA SACF FP

Min

Max

Avg

Stdev

0.5638 0.5657 0.5652 0.5707 0.3707 0.2449 0.2087

1 1 1 1 1 1 0.9916

0.9729 0.9732 0.9758 0.9790 0.9661 0.9636 0.8321

0.1534 0.1522 0.1519 0.151 0.1793 0.2074 0.2006

to be higher than 88%. FP shows poor resiliency because structural change in a program greatly affects the overall sequence of assembly instructions. Next, to evaluate credibility, we measured the similarity between all possible pairs of programs in Table 1. In this experiment, we used the latest version of each program. Therefore, we performed the 378 distinct comparisons given by all possible pairs

among 28 different programs. Then we computed their average similarity. Low average similarity indicates that the proposed birthmark is credible. The results are shown in Fig. 6, where the x-axis shows the names of the birthmarks (AH_original, SA, SACF, FP) and the variations of our birthmark (AH_CF_log, AH_PF_log,AH_CF_PF_log), and the y-axis indicates the average similarity for all possible pairs of programs indicated by each birthmark. Table 3 provides some statistical results on the credibility test. Our proposed method and its variations indicate much lower average similarity than the other birthmarks. For example, AH_original provides a low similarity of 13.1% between NateOn and BuddyBuddy, whereas SA indicates their similarity to be 40.2%. In addition, AH_original shows a low similarity of 10.5% between BadakEncoder and BandiZip, and SACF shows their similarity to be 45%. As for SACF birthmark, the average similarity value of SACF belies its poor credibility. Its apparently satisfactory credibility in this experiment is the result of some extremely low similarity values

D.-K. Chae et al. / Knowledge-Based Systems 95 (2016) 114–124 Table 3 Statistical results from the credibility test.

AH_original AH_CF_log AH_PF_log AH_CF_PF_log SA SACF FP

Min

Max

Avg

Stdev

0 0 0 0 0 0 0.0042

0.2287 0.2317 0.2312 0.2334 0.4420 0.8166 0.3488

0.0457 0.0558 0.0531 0.0621 0.0767 0.0893 0.0969

0.0499 0.0517 0.0501 0.0522 0.0623 0.1334 0.0709

Table 4 Different compiler optimization test.

7zip NcFTP Notepad++

AH

SA

SACF

FP

0.9815 0.9845 0.9922

0.8199 0.9228 0.9546

0.9727 0.8002 0.9981

0.6086 0.6544 0.8068

that reduce its average similarity score: similarity values for 84 pairs of programs are less than 0.01%, and those of 62, 45, and 1 pairs are less than 0.01% in the case of AH, SA, and FP, respectively. In contrast, SACF reports similarity values higher than 50 pairs of programs, whereas the other birthmarks report no pairs are higher than 50%. Even if the poor credibility of SACF is hidden in this experiment, the accuracy test in the next section exposes it. In addition, among AH and its three variations, the original shows the best results. Therefore, for the following experiments, we select the original one as our final choice. As another resiliency test, we selected some open-source programs and generated plagiarized programs by compiling their source codes with different compiler optimization options. Then we computed the similarity between the original program and the plagiarized program. Changing compiler optimization options is a type of semantic preserving transformation behavior used by software plagiarists to prevent detection because it changes the internal structure of a program while maintaining its functionality [4,10]. Here, we evaluated the effects of compiler optimization levels on our birthmark. We used three open source programs: 7zip, NcFTP, and Notepad++. We used Visual Studio 2010 as a compiler and used two optimization switches (/O2-/Ob2 and /O1-/Ob1). Generally /O2-/Ob2 is the default compiler optimization option at the release-mode of Visual Studio 2010 because it makes programs faster in most cases and makes the compiler choose small functions for inline expansion, eventually minimizing the size and speeding the performance of the program [30]. Plagiarists can change the compiler optimization option to /O1-/Ob1. /O1 indicates code size optimization and /Ob1 does inline expansion of pre-defined functions. This option switching affects the program in many respects: code size becomes smaller, the execution speed slows, the order of assembly instructions in each function is changed, and so on. We generated benchmark programs by compiling source codes with different optimization options for each program (e.g., 7zip/O2/Ob2, 7zip-/O1/Ob1) and regarding the program compiled with the default optimization option as the original program (e.g., 7zip/O2/Ob2) and the other one as a copied program (e.g., 7zip/O1/Ob1). Then we extracted AH birthmarks and previous static birthmarks (SA, SACF, FP) from the generated programs. Finally we computed the similarity between the birthmarks extracted from the original and copied programs. Table 4 shows the results of the similarity analysis among programs. In the table, the numbers in each cell represent the similarity values between the original and the copy of the program shown in its row. Overall, AH shows high similarity (more than 98%) in

121

all cases. In the case of 7zip and NcFTP, our birthmark derives the highest similarity. SACF derives the highest similarity in the case of Notepad++, but the difference between the value derived by SACF and that of our proposed method is only 0.59%. SA shows reasonable results for NcFTP and Notepad++, but it shows low similarity for 7zip because it is vulnerable to compiler optimization switching, especially function-level optimization options (e.g., /Ob1, /Ob2). Because SA uses as its birthmark the kind of API used in each function, which is strongly affected by the inline expansion of functions, it tends to indicate low similarity if a significant number of functions have been separated or integrated by the compiler optimization. This tendency is remarkable with 7zip. We analyzed the difference in the number of functions between 7zip-/O2/Ob2 and 7zip-/O1/Ob1 and observed a difference of more than 10%, whereas those of NcFTP and Notepad++ are 4% and 1%, respectively. As a result, the SA birthmark for 7zip is significantly affected by changing the complier option and thus indicates low similarity. SACF indicates high similarity values for 7zip and Notepad++; however, it indicates low similarity for NcFTP. We observed that SACF tends to indicate low similarity between the original program and the copied program if some of the most frequently called APIs are affected. We analyzed the frequency histograms of NcFTP/O2/Ob2 and NcFTP-/O1/Ob1 and observed that the most frequently called API is changed from FlushConsoleInputBuffer to GetStdHandle, and the second-most frequently called API is changed from GetStdHandle to SetConsoleTextAttribute by the compiler optimization switching, whereas those of 7zip and Notepad++ do not change. Thus SACF derives remarkably low similarity for NcFTP. In this test, FP is shown to have poor resiliency. Because switching compiler optimization options changes the internal structure of the program, FP, which consists of assembly instruction sequences, is significantly affected. Thus, FP indicates low similarity between the original program and the copied program in all cases. In conclusion, the results of our resiliency and credibility tests confirm that our AH birthmark outperforms existing static birthmarks, giving it good potential to detect plagiarism correctly. The main reason our birthmark outperforms its competitors is its consideration of both the call orders and the frequency of APIs based on the structure of a program. Because the frequency-based birthmarks (i.e., SACF) lack API call order information, they have difficulty in differentiating different programs that use some APIs in common. On the other hand, because the sequence-based birthmarks lack some frequency-based information, they are vulnerable to structural changes in programs. By considering the call orders and frequency of APIs together, AH can have both robustness against evasions and the ability to differentiate among genuinely unrelated programs.

4.3. Accuracy test Based on the results of our previous experiments, we evaluated our birthmark in terms of accuracy. Satisfying resiliency and credibility simultaneously is the key requirement of birthmarks; dissatisfaction with either leads to serious loss of accuracy. Hence, evaluation of false positives and false negatives must be performed [7]. To do that, generally precision and recall [20] are widely used. Higher precision and recall denote the credibility and resiliency of a birthmark, respectively [7]. In this experiment, following [7], we use F-measure, the harmonic mean of precision and recall [20]. Through F-measure, we can take both the credibility and resiliency into account. Precision, recall, and F-measure are calculated as follows:



precision =

|{CC } {PC }| |{PC }|

(5)

122

D.-K. Chae et al. / Knowledge-Based Systems 95 (2016) 114–124

SA

SACF

AH

FP 100000

0.9

10000 Execution time (seconds)

F-measure

AH 1.0

0.8

0.7

SACF

SA

FP

1000

100

0.6 10

0.5 0.60

0.62

0.64

0.68

0.70 0.72 threshold

0.74

0.76

0.78

0.80

0

Fig. 7. Comparisons in terms of accuracy.

F -measure =

2 × precision × recall precision + recall

5000

10000 Program size (KB)

15000

20000

Fig. 8. Comparisons in terms of scalability.



|{CC } {PC }| recall = |{CC }|

1

(6)

(7)

In Formulas (5) and (6), {CC} is a set of pairs of programs classified by human experts as “copied,” and {PC} is a set of pairs of programs classified by birthmarks as “copied.” The F-measure is between 0 and 1; it reaches its best value at 1 and worst at 0. As explained earlier, the program pairs in question are judged as copies or independent programs based on the plagiarism detection threshold ε , which is related to the credibility and the resiliency of birthmarks. For example, if it is set low, e.g., 0.2, a birthmark is highly likely to judge an arbitrary program pair as plagiarism. That is, the credibility of the birthmark is low, and the resiliency is high. In contrast, if ε is set high, e.g., 0.8, the credibility will be high and the resiliency low. Therefore, existing methods select their own ε to achieve the best tradeoff between their credibility and resiliency. For example, Schuler et al. [16] used a classification scheme with a threshold of 0.8, such that the similarity range [0.8, 1] is classified as copies. Myles et al. [6] and Zhou et al. [26] also set a plagiarism threshold of 0.8. Chae et al. [10] used a threshold of 0.6, and Choi et al. [4] set a threshold of 0.65. However, we have claimed that an important evaluation point is whether a birthmark shows consistently high credibility and resiliency in a wide range of ε values. Therefore, we set our threshold ε range at [0.6, 0.8], which covers all threshold values used in previous work. As the threshold value ε moves from 0.6 to 0.8, the F-measure values draw a curve that represents the effectiveness of the birthmark. To draw an F-measure curve for each method, we let each method compute similarities between all pairs of the benchmark programs shown in Table 1 and detect “copied” samples while changing the threshold between 0.6 and 0.8. Fig. 7 shows the F-measure curves; the x-axis represents the variation of the threshold value ε , and the y-axis represents the F-measure value according to the varying threshold. SACF shows poor F-measure throughout the threshold range. In the range of [0.6 0.7], the accuracy of SACF is lower than 0.62, and as the threshold value increases from 0.7, the F-measure increases. SACF shows its best F-measure of 0.92 when the threshold is 0.8. The major cause of poor accuracy in the low-threshold section is that, as mentioned in the previous section, SACF indicates similarity of higher than 50% for 28 pairs of different programs. We analyzed the cause of these failures and found that SACF tends to

indicate high similarity between different programs that use several common APIs whose frequencies are higher than any other APIs. For example, SACF determines the similarity between AkelPad and CuteFTP as 79.6%, whereas all the other birthmarks rate it as lower than 11%, because they have nearly half of their APIs in common. As a result, when the plagiarism threshold is too low, SACF is prone to make misjudgments by seeing different programs as copies. FP shows a satisfactory F-measure of 0.93 in the range of [0.6 0.66], but as the threshold increases, its performance decreases to lower than 0.87. The main reason is FP’s low resiliency; it tends to indicate low similarity (lower than 72%) even between copied programs. In addition to the cases described in the resiliency test, FP shows a low similarity of 70.7% for two versions of ALzip, 69.1% for Putty, 67.5% for EditPlus, and 52.3% for CuteFTP. SA shows a reasonable F-measure across the overall threshold range. In the range of [0.6 0.76], the F-measure of SA is 0.92, and in the range of [0.78 0.8], 0.90. Nonetheless, our proposed birthmark shows higher accuracy across the overall threshold range. AH birthmark returns a consistent F-measure of 0.96, which is the highest value in our comparison, regardless of the threshold. Unlike the other birthmarks, AH shows a low similarity between different programs and a high similarity between copied programs, which indicates that credibility and resiliency are simultaneously high. The AH outperforms the other static birthmarks in terms of F- measure, showing its good potential to correctly detect software plagiarism. 4.4. Scalability test To evaluate the scalability of our birthmark, we measured the execution time of the proposed and existing methods according to program size. The execution time consists of the time required to create two birthmarks from two programs and then compute the similarity between them. If the birthmark is scalable, the execution time should increase gradually as the size of the program grows. We conducted these experiments on a Windows 7 64-bit operating system equipped with 3.4 GHz Intel Core 2 Quad CPUs and 4GB RAM. The results are shown in Fig. 8. Every point in the figure corresponds to a pair of versions of a single program. The x-axis shows the median value of the two programs’ sizes, and the y-axis shows the execution time to extract birthmarks and compute similarity for each program pair. Note that the y-axis is in log scale. The execution times of our birthmark and SACF are not significantly affected by the program size. SACF shows the shortest execution time because it simply counts the appearances of each API and uses

D.-K. Chae et al. / Knowledge-Based Systems 95 (2016) 114–124

cosine similarity to compare frequency vectors. The time required to generate AH is slightly longer than that of SACF because constructing the A-CFG and running RWR take more time than just counting the appearances of each API. Nonetheless, our birthmark is satisfactory in terms of scalability. In contrast, SA and FP are quite sensitive to the size of the target program. On average, SA takes almost 12 times longer than AH, and FP takes almost 83 times longer. SA and FP are computationally expensive because they use the maximum weighted bipartite matching algorithm to compute similarity, with a time complexity of O(n3 ) [7], where n is the number of components that make up the birthmark (API sets per each function in SA and assembly instruction sequences in FP). SA does not take too long for its similarity computation because only tens or tens of thousands of functions exist in a single program. However, FP, which uses all possible instruction sequences in a program as components of the birthmark, takes an extremely long time in the similarity computation phase because a single program can have from twenty thousand to millions of possible sequences. For example, 1,331,222 sequences are extracted from WinSCP 4.3.7 and 1,335,319 sequences from WinSCP 4.3.9. Thus FP takes about 6.4 hours for the birthmark extraction and similarity computation for the WinSCP pairs, whereas our proposed method takes only 97 s. We conclude that AH extraction and comparison are established within a reasonable time. 4.5. Case study In this section, we evaluate our birthmark with real plagiarism samples. As mentioned in Section 4.1, no reputable, public samples really plagiarize commercial Windows programs. Therefore, in this experiment, we manually created real plagiarism samples by performing a number of semantic-preserving transformations on some open source programs. Among our benchmark programs, we used two open source programs to generate the samples: NcFTP 8.3.2 and Notepad++ 6.1.4 (hereafter, NcFTP and NotePad++, respectively). We performed the following four types of semantic-preserving transformations. Note that, as the type number gets higher, the transformations in this type cause more changes in the program structures. • Type-1. Format Alteration and Identifier Renaming (FAIR): FAIR is the simplest plagiarism, which includes inserting or deleting spaces and comments in source code and changing names or types of variables in a program.

AH

SA

123

• Type-2. Statement Reordering (SR): SR indicates changing the order of statements as long as the transformation does not affect the semantics of the program. • Type-3. Control Replacement (CR): CR consists of rewriting “for loop” into “while loop” or “do-while” loop and replacing “switch” statements with “if/else” statements. Also, CR includes adding some meaningless loops or control statements. • Type-4. Function Obfuscation (FO): FO includes embedding a callee function in its caller function (generally called function inlining), separating a function into two or more sub-functions, defining and calling some meaningless functions, and changing the call order of functions. Consequently, we generated eight plagiarism samples: NcFTP_type-1, NcFTP_type-1+2, type-1+2+3, NcFTP_type-1+2+3+4, NotePad ++_type-1, NotePad++_type-1+2, NotePad++_type-1+2+3, and NotePad++_type-1+2+3+4. As an example, NcFTP_type-1+2+3 indicates the plagiarized program generated by applying the transformation types 1, 2, and 3 together on NcFTP’s original source code. Then, we calculated the similarity between the original programs, i.e., NcFTP and NotePad++, and their four plagiarized samples with our proposed method and the other compared methods. Also, to evaluate credibility, we calculated the similarity between the original programs and other independent programs in our benchmarks. Based on the similarity values, we have finally judged whether every program pair is plagiarized or not. In the same way as Section 4.3, we used the F-measure based on the plagiarism threshold ε , which varies from 0.6 to 0.8, as the evaluation metric. The experimental results are shown in Fig. 9. In the figure, the x-axis represents the threshold value ε , and the y-axis represents the F-measure score according to the varying threshold. Our method, AH, achieves the highest F-measure scores across the overall range of the threshold. In the range of [0.6 0.7], SA and FP outperform SACF, but SACF outperforms the other two methods when the threshold is higher than 0.7. More specifically, SA and FP show precision of 1 and recall of 0.75 in the overall range. The reason for their low recall is that they cannot detect plagiarism samples NotePad++_type-1+2+3+4 and NcFTP_type1+2+3+4, which are modified with function-level transformations. Because SA defines its birthmark based on user-defined functions in a program, it seems to be vulnerable to function-level transformations. The sequence of assembly instructions in FP is also significantly affected by function-level transformations.

SACF

FP

0.7 0.72 threshold

0.74

1

F-measure

0.95

0.9

0.85

0.8 0.6

0.62

0.64

0.68

0.76

Fig. 9. F-measure result on real plagiarism samples.

0.78

0.8

124

D.-K. Chae et al. / Knowledge-Based Systems 95 (2016) 114–124

On the other hand, the proposed method detects all the plagiarism samples correctly, thereby showing a recall of 1 for all threshold values, because the A-CFG construction of our proposed method integrates all the user-defined functions into one massive function, which makes our birthmark robust against function-level transformations. SACF also detects all the plagiarism samples correctly because the call frequency of each API is not greatly affected by function-level transformations. However, SACF misjudges several independent program pairs as copied; thus the precision of SACF is 0.72 in the range of [0.6, 0.7], 0.8 in [0.72, 0.74], and 0.88 in [0.76, 0.8]. This tendency was also shown in the experiments described in Section 4.3, which strongly indicates the poor credibility of SACF. In conclusion, the proposed method correctly detects all the plagiarized samples and successfully differentiates independently developed programs. On the other hand, the compared methods do not satisfy the credibility and resiliency requirements simultaneously.

5. Conclusions We have proposed a credible, resilient, and scalable birthmark for detecting software plagiarism. Existing birthmarks fail to satisfy all three of the requirements together: existing frequency-based birthmarks suffer from poor credibility, and existing sequencebased birthmarks have poor resiliency and scalability problems. To develop a birthmark achieving the three requirements together, we constructed A-CFG that reflects the full structure of a program and used it to generate our novel birthmark, Authority Histograms (AH). AH reflects structural characteristics of the program, thereby reflecting not only the frequency of APIs but also their call orders. In addition, by using RWR on AH, our birthmark not only successfully inherits information about the frequency and call orders of APIs but is also fully applicable to even large programs. Through extensive experiments, we found that AH shows consistently high resiliency and credibility among state-of-the-art static birthmarks: it is unaffected by semantic-preserving transformations such as changing compiler optimization options or program version updates and successfully distinguishes independently developed programs, whereas the other birthmarks show weakness in credibility and resiliency. Thus, AH showed the highest accuracy in plagiarism detection regardless of the threshold value. We have also shown that AH is scalable for large, commercial Windows programs. Lastly, our case study shows that AH detects plagiarism better than existing methods. We have implemented and tested our proposed idea on the Windows C/C++ platform. However, the core concepts we use, such as APIs and control flow graphs, are generic on any platform. Therefore, we expect that our idea can be easily applied to other platforms. For the first step, we are considering implementing our proposed idea on the JAVA platform as our future work.

Acknowledgments This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. NRF-2014R1A2A1A10054151).

References [1] C. Liu, C. Chen, J. Han, P.S. Yu, GPLAG: detection of software plagiarism by program dependence graph analysis, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2006, pp. 872–881. [2] H. Tamada, M. Nakamura, A. Monden, K.-I. Matsumoto, Design and evaluation of birthmarks for detecting theft of Java programs, in: Proceedings of IASTED Conference on Software Engineering, 2004, pp. 569–574. [3] X. Wang, Y.-C. Jhi, S. Zhu, P. Liu, Behavior based software theft detection, in: Proceedings of the 16th ACM Conference on Computer and Communications Security, ACM, 2009, pp. 280–290. [4] S. Choi, H. Park, H.-I. Lim, T. Han, A static API birthmark for windows binary executables, J. Syst. Softw. 82 (5) (2009) 862–873. [5] H. Tamada, K. Okamoto, M. Nakamura, A. Monden, K.-I. Matsumoto, Dynamic software birthmarks to detect the theft of windows applications, in: Proceedings of International Symposium on Future Software Technology, vol. 20, 2004. [6] G. Myles, C. Collberg, k-gram based software birthmarks, in: Proceedings of the 2005 ACM Symposium on Applied Computing, ACM, 2005, pp. 314–318. [7] H.-i. Lim, H. Park, S. Choi, T. Han, A method for detecting the theft of Java programs through analysis of the control flow information, Inf. Softw. Technol. 51 (9) (2009) 1338–1350. [8] Y.-C. Jhi, X. Wang, X. Jia, S. Zhu, P. Liu, D. Wu, Value-based program characterization and its application to software plagiarism detection, in: Proceedings of the 33rd International Conference on Software Engineering, ACM, 2011, pp. 756–765. [9] S. Choi, H. Park, H.-I. Lim, T. Han, A static birthmark of binary executables based on API call structure, in: Advances in Computer Science – ASIAN 2007. Computer and Network Security, Springer, 2007, pp. 2–16. [10] D.-K. Chae, S.-W. Kim, J. Ha, S.-C. Lee, G. Woo, Software plagiarism detection via the static API call frequency birthmark, in: Proceedings of the 28th Annual ACM Symposium on Applied Computing, ACM, 2013, pp. 1639–1643. [11] T.H. Haveliwala, Topic-sensitive pagerank, in: Proceedings of the 11th International Conference on World Wide Web, ACM, 2002, pp. 517–526. [12] H. Tong, C. Faloutsos, J.-Y. Pan, Fast random walk with restart and its applications, in: Proceedings of the 6th International Conference on Data Mining, IEEE, 2006, pp. 613–622. [13] A. Aiken, et al., Moss: A System for Detecting Software Plagiarism, University of California, Berkeley, 2005. See www.cs.berkeley.edu/aiken/moss.html. [14] M.J. Wise, YAP3: improved detection of similarities in computer program and other texts, in: Proceedings of ACM SIGCSE Technical Symposium on Computer Science Education, ACM, 1996, pp. 130–134. [15] C. Collberg, C. Thomborson, Software watermarking: models and dynamic embeddings, in: Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ACM, 1999, pp. 311–324. [16] D. Schuler, V. Dallmeier, C. Lindig, A dynamic birthmark for Java, in: Proceedings of the twenty-second IEEE/ACM International Conference on Automated Software Engineering, ACM, 2007, pp. 274–283. [17] H. Park, S. Choi, H.-i. Lim, T. Han, Detecting Java theft based on static API trace birthmark, in: Advances in Information and Computer Security, Springer, 2008, pp. 121–135. [18] H.-i. Lim, H. Park, S. Choi, T. Han, A static Java birthmark based on control flow edges, in: Proceedings of the 33rd Annual IEEE International Conference on Computer Software and Applications, COMPSAC’09, vol. 1, IEEE, 2009, pp. 413– 420. [19] R. Jonker, A. Volgenant, A shortest augmenting path algorithm for dense and sparse linear assignment problems, Computing 38 (4) (1987) 325–340. [20] J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006. [21] A. Aizawa, An information-theoretic perspective of TF-IDF measures, Inf. Process. Manag. 39 (1) (2003) 45–65. [22] C. Eagle, The IDA Pro Book: The Unofficial Guide to the World’s Most Popular Disassembler, No Starch Press, 2011. [23] F.E. Allen, Control flow analysis, in: ACM Sigplan Notices, vol. 5, ACM, 1970, pp. 1–19. [24] L.P. Cordella, P. Foggia, C. Sansone, M. Vento, Performance evaluation of the VF graph matching algorithm, in: Proceedings of International Conference on Image Analysis and Processing, 1999. Proceedings, IEEE, 1999, pp. 1172–1177. [25] M. Jang, J. Kook, S. Ryu, K. Lee, S. Shin, A. Kim, Y. Park, E.H. Cho, An efficient similarity comparison based on core API calls, in: Proceedings of the 28th Annual ACM Symposium on Applied Computing, ACM, 2013, pp. 1634–1638. [26] X. Zhou, X. Sun, G. Sun, Y. Yang, A combined static and dynamic software birthmark based on component dependence graph, in: Proceedings of International Conference on Intelligent Information Hiding and Multimedia Signal Processing, IIHMSP’08, IEEE, 2008, pp. 1416–1421.