Efficient monitoring and forensic analysis via accurate network-attached provenance collection with minimal storage overhead

Efficient monitoring and forensic analysis via accurate network-attached provenance collection with minimal storage overhead

Accepted Manuscript Efficient monitoring and forensic analysis via accurate network-attached provenance collection with minimal storage overhead Yulai...

1MB Sizes 0 Downloads 11 Views

Accepted Manuscript Efficient monitoring and forensic analysis via accurate network-attached provenance collection with minimal storage overhead Yulai Xie, Dan Feng, Xuelong Liao, Leihua Qin PII:

S1742-2876(18)30025-2

DOI:

10.1016/j.diin.2018.05.001

Reference:

DIIN 778

To appear in:

Digital Investigation

Received Date: 17 January 2018 Revised Date:

20 March 2018

Accepted Date: 6 May 2018

Please cite this article as: Xie Y, Feng D, Liao X, Qin L, Efficient monitoring and forensic analysis via accurate network-attached provenance collection with minimal storage overhead, Digital Investigation (2018), doi: 10.1016/j.diin.2018.05.001. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Digital Investigation Digital Investigation 00 (2018) 1–11

RI PT

Efficient monitoring and forensic analysis via accurate network-attached provenance collection with minimal storage overhead Yulai Xiea,∗, Dan Fenga,b , Xuelong Liaoa , Leihua Qina

a School of Computer, Huazhong University of Science and Technology, Wuhan, People’s Republic of China National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, People’s Republic of China

SC

b Wuhan

M AN U

Abstract

TE D

Provenance, the history or lineage of an object, has been used to enable efficient forensic analysis in intrusion prevention system to detect intrusion, correlate anomaly, and reduce false alert. Especially for the network-attached environment, it is critical and necessary to accurately capture network context to trace back the intrusion source and identify the system vulnerability. However, most of the existing methods fail to collect accurate and complete network-attached provenance. In addition, how to enable efficient forensic analysis with minimal provenance storage overhead remains a big challenge. This paper proposes a provenance-based monitoring and forensic analysis framework called PDMS that builds upon existing provenance tracking framework. On one hand, it monitors and records every network session, and collects the dependency relationships between files, processes and network sockets. By carefully describing and collecting the network socket information, PDMS can accurately track the data flow in and out of the system. On the other hand, this framework unifies both efficient provenance filtering and query-friendly compression. Evaluation results show that this framework can make accurate and highly efficient forensic analysis with minimal provenance storage overhead. Keywords: Provenance, forensic analysis, provenance filtering, provenance compression

1. Introduction

AC C

EP

A variety of security mechanisms (e.g., encryption and access control) has been adopted to protect against the intrusion and data leak. However, there are always a variety of vulnerabilities (e.g., no rational configuration of firewall rules, weak passwords, etc.) that are likely to be exploited as the computer system gets more and more complicated. Due to hacker attacks, insider leaks, the abuse of administrator privileges and other reasons, the computer system is easily compromised, leading to the loss or leakage of data. For instance, in April 2010, the account information of over six million Internet users was leaked due to the weak cryptography used by the China Software Developer Network company (Daily, 2010); in April 2014, the Heartbleed security bug found in OpenSSL also engulfed about half a million of web servers in the wild (Wakefield, 2014). After intrusion or data leakage occurred, a big challenge is to investigate how data leakage or intrusion happens. The existing methods typically develop tools (King and Chen, 2005; ∗ corresponding

author Email addresses: [email protected] (Yulai Xie), [email protected] (Dan Feng), [email protected] (Xuelong Liao), [email protected] (Leihua Qin)

1

King et al., 2005) or systems (Goel et al., 2005; Xie et al., 2016) to explore the causality-based context in the system or disk log. The causality-based context, which we term as provenance, describes how data come to its present status and can be used to enable monitoring and forensic analysis by capturing the data flow and dependency relationship between different data objects. Provenance has been widely used in recording experimental details, debugging, optimizing search (Shah et al., 2007), and data rebuild (Xie et al., 2013a). Provenance-based methods have also been used in both local (Pohly et al., 2012) and distributed environments (Zhou et al., 2011; Tariq et al., 2011; Gehani et al., 2010) to trace back the intrusion source. However, two challenges remain to be addressed. First, how to collect network socket accurately and completely? In a networked environment, any miss of network intrusion information can result in a severe problem. For instance, inter-host viruses propagate can be completed promptly, and the miss of capturing such information can be a disaster to the computer system. However, existing methods either do not record the IP and port information (Pohly et al., 2012) or cannot capture (Gehani and Tariq, 2012) the provenance of the short-lived socket connection. Second, since the size of provenance is consistently increasing, how to enable efficient provenance stor-

ACCEPTED MANUSCRIPT / Digital Investigation 00 (2018) 1–11

2

age while keeping enough evidence for monitoring and forensic analysis?

2. Setting

To this end, we present the design and performance evaluation of a Provenance-aware Data Monitor System called PDMS that can accurately monitor the data flow in and out of the system using provenance. The design of PDMS is based on the assumption that central server cannot assure the security of the sensitive data stored on it using the traditional technologies (e.g., cryptography and authentication). Specifically, PDMS has the following two unique advantages.

In this section, we first give an overview of provenance and provenance-aware systems, then we elaborate the threat model used for monitoring and forensic analysis. At last, we propose our research targets. 2.1. Provenance and Provenance-aware Systems

RI PT

Provenance refers to the source or the historical information of an object. In the system domain, provenance is all the processes, parameters and execution environment that affect the final state of data. Early research on provenance mainly concentrates on the database and workflow, and later gradually extends to the storage systems, (i.e., provenance-aware storage systems, such as PASS (Muniswamy-Reddy et al., 2006), SPADE (Gehani and Tariq, 2012), etc.). These provenance systems collect provenance automatically and transparently, and maintain the consistency between data and its provenance. The harvard PASS system captures the data flow between files, pipes and processes, and uses causality-based provenance graphs to represent accurately the event information. In PASS, provenance can be described as a directed acyclic graph, where the nodes refer to the objects such as files and processes, and the edges indicate the dependency relationships between them. For instance, a process issuing a write system call to a file indicates that this file depends on the data information from the process, and this implies an edge from a file pointing to a process. The provenance is batch-loaded to BerkeleyDB for persistent storage and efficient query. Currently, PASS can be deployed in both local system, network-attached environment (Muniswamy-Reddy et al., 2009) and cloud (MuniswamyReddy and Holland, 2010).

M AN U

SC

First, PDMS is built upon PASS (Muniswamy-Reddy et al., 2006), but extends PASS to capture every network socket connected to the local host by monitoring and recording every network session since it is built. PDMS treats the network socket as file object and collects the dependency relationships between files, pipes, processes and network sockets. By carefully describing and collecting the network socket information, PDMS can accurately track the data flow in and out of the system. Though there are some works identifying the usefulness of provenance to capture network session, however, to the best of our knowledge, there is no work that systematically implements the benefits of provenance on capturing and analyzing the network socket.

TE D

Second, previous methods enable efficient provenance storage by either filtering provenance data or applying the compression method. Comparatively, PDMS can provide the efficient storage of provenance by both filtering the unnecessary data and providing query-friendly compression. Unnecessary data include the pipe data and some argument data that provide an invaluable clue for monitoring and forensic analysis but take up substantial storage space.

EP

In addition, PDMS can set different storage policies to satisfy different storage and query requirements. For instance, PDMS can also employ bzip2 to enable smallest storage size for offline access. We also evaluate the storage overhead and forensic analysis efficiency of PDMS on a variety of security-critical applications. PDMS can make accurate and prompt forensic analysis on system vulnerability and intrusion source with minimal provenance storage overhead.

AC C

The contributions of this paper are as follows:

1. The systematic implementation and accurate capture of network socket information for forensic analysis. 2. An efficient method that unifies both provenance filtering and compression. 3. The design of a provenance-based monitoring and forensic analysis framework that builds upon existing provenance collection and tracking framework. 4. A comprehensive evaluation of PDMS on provenance storage overhead and forensic analysis efficiency. The rest of the paper is organized as follows. We propose background, threat model and design goals in Section 2. We elaborate the design, a case study and evaluation of PDMS in Section 3, 4 and 5 respectively. In Section 6, we summarize the related work. In Section 7, we conclude the paper. 2

2.2. Threat Model and Assumption PDMS mainly makes forensic analysis on the intrusions via exploiting the application or process vulnerabilities. For instance, a remote intruder launches an exploit on the vsftpd backdoor vulnerability in the local system and gets the root shell. Then she makes a series of malicious operations such as tampering the data in the file system or placing a backdoor file in a directory. PDMS intercepts all the system calls and network sockets during this process, generates the provenance graphs that show intrusion behavior in detail, and then tracks provenance graphs to identify the vulnerability or intrusion source. For instance, Figure 1 shows a provenance graph generated by PDMS. It describes how an intruder exploits the vsftpd daemon, downloads a bzipped file, uncompresses it, and then creates a file in the system. However, some malicious behaviors (e.g., memory leak) that do not perform explicit information flow control or do not invoke system call cannot be tracked by PDMS because these activities do not generate provenance. We assume provenance is not lost or corrupted in PDMS. First, provenance is written to disk ahead of the original data, this ensures that no data will bear the loss of provenance when the system is collapsed or invaded. Second, we assume that

ACCEPTED MANUSCRIPT / Digital Investigation 00 (2018) 1–11

socket

Generate

3

vsftpd

FROMIP:192.168. 229.129:41190

vsftpd

id

ls

wget

tar

RI PT

sh

vi

PIPE

klibc2.0.1.tar.bz2 2 .0.1.tar.bzz2

M AN U

TOIP:192 TOIP:192.168. 2.168. 2 229.2:53 229.2: :53

SC

bzip2

socket

1.txt

FROMIP:192.168.229. 129:37938

socket

Figure 1. An example of provenance-based intrusion graph. We use boxes, ovals, and diamonds to show the processes, files and sockets respectively. We omit most of their attributes in the graph.

PDMS can prevent the undetected provenance rewrite by employing the trusted platform (Lyle and Martin, 2010) or sophisticated security policy (Hasan et al., 2009). The issue of securing provenance is out of the scope of this paper.

TE D

3.1. NAP

2.3. Design Goals

AC C

EP

We design PDMS with two goals. First, we aim to implement systematically and accurately capture network socket information for monitoring and forensic analysis. Second, we target to make highly efficient forensic analysis with minimal provenance storage overhead.

3.1.1. Description of Network socket provenance Although PASS can support the use of network file system to collect provenance, but cannot support the collection of socket information that accesses the local machine. For example, PASS cannot record the IP address and port number that launch the remote attack to destroy or steal data on local PASS system. Therefore, in our design of the system, we make further modifications of the PASS to make it support tracking and recording network socket. The socket information does not represent a piece of specific data, but the inflow or outflow of the storage system in a certain period of time, and the destination of these data flow. Table 1 shows an example of a network socket object.

3. Design and Implementation

Table 1. An example of a network socket object

Attribute Source Port Source IP Destination Port Destination IP User ID Time

Figure 2 shows the overall architecture of provenance-aware data monitor system. It builds on PASS, but enhances the PASS by adding a component called NAP (Network-aware Processing). NAP is responsible for accurately monitoring the network socket information and constructing the provenance graphs to describe the relationships between local server process, network socket and the data it accesses. Then the collected provenance will be transferred to Provenance Filter which eliminates the unrelated and duplicated provenance information, and at last, the Provenance Recorder stores the provenance information according to different requirements and policy.

Value 21 198.163.137.2 20 202.0.0.1 root 20:34

We mainly capture three kinds of dependency relationships to describe the provenance information. (1) dependencies between processes, e.g., a parent process forks a child process, 3

ACCEPTED MANUSCRIPT / Digital Investigation 00 (2018) 1–11

4

Bzip2 Provenance Filter Application

Provenance Recorder

Hybrid + BerkeleyDB

Socket

PASS

RI PT

User space Kernel space

BerkeleyDB

NAP

Figure 2. Architecture of Provenance-aware data monitor system

cess. We use attribute name FROMIP to describe the IP address in the case of accept system call and TOIP in the case of connect system call. Then the functions analyzer pawrite and distributor pawrite will be invoked successively to write these provenance information into Lasagna file system which saves provenance in a log file. After the network connection is established, when host sends data out (e.g., data leakage), the sys send system call will be invoked to execute, and the intercept send function is used to intercept the system call, and extracts the information such as destination IP, port number, the send object type, and so on. The observer send function associates the sending process with the socket object, then invokes the analyzer pawrite function to write this association information into the trace log. Similarly, the intercept recv and observer recv functions are scheduled to generate provenance when the local host receives data from the outside. The following shows a subset of the provenance log that describes the socket information in Figure 1. ...... 62.1 FROMIP 192.168.242.129:45553 63.0 ARGV [/usr/local/sbin/vsftpd] 62.1 GENERATEDBY [ANC] 63.1 81.1 FROMIP 192.168.242.129:44965 81.1 GENERATEDBY [ANC] 76.1 104.0 NAME /usr/bin/wget 120.1 TOIP 192.168.242.2:53 120.1 GENERATEDBY [ANC] 104.3 ......

M AN U

SC

or two processes share a common memory area; (2) dependencies between processes and files, e.g., the events that a process reads or writes a file indicate that this process depends on or is depended on the file; (3) dependencies between processes and the network socket objects, e.g., a local server process sends or receives data via network sockets. For the convenience of management, the Linux system assigns each socket a file descriptor. This makes that receiving and sending data via the network socket is similar to the case of reading and writing a file. Assuming that B is a network socket object, and P is the receive or send process associated with B, socket system call “send” creates “B → P”, and “receive” creates “P → B”.

AC C

EP

TE D

3.1.2. Implementation of Network socket provenance We treat each network socket object as a file. Thus, reading or writing a socket is similar to reading or writing a file. Typically, they both go through the VFS layer. The difference is that the kernel can judge the current object as ordinary file or socket by looking up the imode field in the inode structure. If it is a socket, the IP address of socket can be computed via the method containof. We do not employ the specific provenance structure for the socket, but directly incorporate the provenance information of a socket into the inode structure. Note that, PASS will not initialize the provenance information of the socket in the inode structure until the network socket session is built. NAP extends upon PASS by adding a series of functions to support the initiation and generation of provenance information of socket as shown in Figure 3. The added functions are described in the gray box. The interceptor accept and interceptor connect functions are used to intercept accept and connect system calls. The accept system call indicates that the local host receives an external network connection, while the connect system call launches a network connection to the external IP. The observer initsocket function is used to initialize the variables of network socket object in the inode structure. For example, it sends requests to the Lasagna file system for the pnode number that uniquely refers to a network socket object. The observer socket function is responsible for collecting the IP address and port number that are associated with the socket object and correlating the socket object with the current pro-

3.2. Provenance Filter PASS system intercepts application activity, generates provenance, and then imports provenance into a structured database (e.g., BerkeleyDB (Olson et al., 1999)) for persistent storage and query. However, for PDMS, it is necessary to filter the unrelated provenance item before it is stored and used for the system monitoring and forensic analysis. For instance, for the provenance collected during each intrusion, if we directly apply it to forensic analysis without any filtering, though this can guarantee the maximum analysis accuracy and completeness, it consumes a long time to dig in the provenance data. Typically, there exists many unrelated provenance items in the raw 4

ACCEPTED MANUSCRIPT / Digital Investigation 00 (2018) 1–11

Application

Interceptor

5

Socket

Interceptor_accept

Interceptor_connect

Interceptor_send

Interceptor_recv

Observer_send

Observer_recv

Observer_initsocket Observer

Analyzer_pawrite

Distributor

Distributor_pawrite

SC

Analyzer

RI PT

Observer_socket

Lasagna

Provenance Log

M AN U

Figure 3. Network-attached Provenance Architecture

information (e.g., file inode number and process ID) of each object. This database can be omitted if the filtering technology has been performed to reduce the related attribute information. ParentDB and ChildDB store the dependency relationship between an object and its parent node or child node. To speed up query, we use NameDB to store the mapping from the pnode number to the name of a file or process.

EP

3.3. Provenance Recorder

TE D

provenance log files, e.g., the temporary files generated during program compiling, the pipe files, and the daemon programs that interact with system process. PDMS filters out these information, because they do not contain intrusion information and may cause false alarms. In addition, for the collected provenance information, PDMS only retains the pnode number, the object (e.g., file, process, and socket) name, and the dependency relationships between different objects. For other information, such as process ID, execution time, environment variables, and input parameters, PDMS discards them to make the provenance information small enough for further query and process.

The provenance recorder provides three choices for storing provenance: Bzip2, BerkeleyDB (Olson et al., 1999), Hybrid+BerkeleyDB. Bzip2 can compress provenance to a maximum extent. However, the reduced provenance store is not queriable. So it is proper to be used in the case when storage space is a precious resource and the access is offline. BerkeleyDB is the database used by PASS (Muniswamy-Reddy et al., 2006) to store key-value structured provenance records. Hybrid+BerkeleyDB employs the combination of web-graphbased compression and dictionary-based compression algorithms (Xie et al., 2013b) to find the duplicate data in provenance. This algorithm achieves good compression ratio by maximumly mining the similarity and locality in provenance data, and also eliminating the duplicate strings in the attribute information of provenance data. The reduced provenance can still be stored in BerkeleyDB. However, as opposed to bzip2, this algorithm retains a high query speed. For BerkeleyDB, we have employed a series of key-value databases to store the provenance, as shown in Table 2. Pnode uniquely identifies each object. IdentityDB stores the attribute

Table 2. Provenance Database

Database IdentityDB NameDB ParentDB ChildDB

Provenance records (pnode,attribute) (pnode,name) (pnode,pnode) (pnode,pnode)

AC C

4. PDMS Application: Provenance-Based Data Leak Analysis To further demonstrate the power of PDMS, we introduce a set of algorithms for provenance-based data leak analysis that offer simple but efficient forensic queries. We begin by presenting two questions that an administrator needs to answer when the data leakage happens: one is to determine where the data flows, another is to find out all the data files that the potential intruders have accessed. The advantage of PDMS over existing systems is that it accurately captures all the socket information that may reveal the intrusion source, and these socket information can further provide clues for identifying all the data that may have been accessed by potential intruders. For instance, PDMS can address the above first question by collecting the network socket information, and then querying for the network socket in the provenance graph using the leaked document as detection point. This query process searches the descendant node of the leaked file object until 5

ACCEPTED MANUSCRIPT / Digital Investigation 00 (2018) 1–11

That is, if a file has already been proven to leak through a network socket, it is necessary to also check all the other files that the network socket has accessed. This is because after data leakage happens, criminals may only ventilate part of the files acquired, then it is critical to take remedial measures for those files that have not been open to the public. For example, criminals steal the user ID and password on the server, and ventilate part of them to the public. In this case, PDMS first finds the network socket associated with the file that contains this part of user ID and password, and then searches for all the other files containing the password data that this network socket read. So PDMS can notify the users to change the password before the criminals undisclose this part of the information, thus reducing the loss of the users. Algorithm 2 shows the pseudo code of this process. The algorithm lookups parent node recursively using the network socket as a starting point, until finding all the ancestor file nodes that the network socket has accessed. All the traversed objects during this process are added to construct a provenance graph that shows how many files are leaked through the network socket. Let N be the number of records in ParentDB, the time to search for each parent node takes time O(logN). Assume M is the total number of parent node traversed during this process, the algorithm takes O(M ∗ logN) = O(MlogN) in time complexity. The space used for storing the newly generated provenance graph is p ∗ s, of which p is the number of nodes in the provenance graph, and s is the average size of a node. Note that we omit the decompression process of each provenance record in these two algorithms when provenance is compressed because the combination of web compression with dictionary encoding has little impact on the query performance (Xie et al., 2013b). In addition, the forensic analysis is offline (i.e., after the intrusion happens), so the query time is not a key point. However, we still measure the basic query time when provenance is not compressed in Section .

M AN U

SC

RI PT

a network socket is found. Suppose we use BerkeleyDB to store the provenance records when a file is found to be leaked, PDMS first gets pnode number of the file from NameDB based on the file name, then searches the ChildDB to find whether there is a path from the file object to a network socket. This process is a recursive process. Algorithm 1 shows the pseudo code. The recursive function is defined as Findchild (pnode), of which the pnode is the leaked file. The whole process needs to traverse all the descendant nodes of the leaked file object until the leaf node is a network socket. All the traversed objects during this process are added to construct a provenance graph that shows how the leaked file data flows to the destination IP indicated by the network socket. As BerkeleyDB manages data using B+ Tree, the time to search for each child node takes time O(logN), of which N is the number of records in ChildDB. Let M be the total number of child node traversed during this process, the algorithm runs in time O(M ∗ logN) = O(MlogN). The space used for storing the newly generated provenance graph is p ∗ s, of which p is the number of nodes in the provenance graph, and s is the average size of a node.

6

TE D

Algorithm 1 Construction of provenance graph that shows where the leaked file has flowed. Function:Findchild(pnode) Input: Node of the leaked file, pnode Output: Provenance graph that shows where the leaked file flows 1: Query for the child node (denote it as childpnode) of pnode in ChildDB 2: if childpnode is a network socket then 3: add childpnode into the provenance graph 4: return; 5: else 6: add childpnode into the provenance graph 7: Findchild(childpnode); 8: end if

EP

5. Evaluation

Algorithm 2 Construction of provenance graph for reverse search for the file leak case. Function: Reverseparent(pnode) Input: Node of network socket, pnode Output: Provenance graph that shows all the potential leaked files 1: Query for parent node (denote it as parentpnode) of pnode in ParentDB 2: if parentpnode = NULL then 3: return; 4: else 5: add parentpnode into the provenance graph 6: Reverseparent(parentpnode) 7: end if

AC C

In this section, we will first analyze how PDMS can help assess the intrusion source and system vulnerability, and measure its time overhead, then we evaluate the space overhead and the impact of provenance collection on system performance. 5.1. Experimental datasets We use three security-critical applications as follows. Vsftp: This is a popular and secure FTP server. But the remote attacker can exploit the backdoor vulnerability in its version 2.3.4 to attain the root shell. Distcc: It is a C/C++ compiler tool that can process distributed computational workloads. But its version 2.X has a buffer overflow loophole that can be exploited by the remote attacker to attain a user shell. Samba: Typically, it provides the file sharing service between Linux and windows, and has a command execution vulnerability in its version 3.0.20 through 3.0.25 that can be exploited.

After analyzing the attack sources, PDMS can do a reverse search to identify all affected files. For example, the following issues will be resolved: A network connection has been found to be illegal, what documents have this connection stolen? 6

ACCEPTED MANUSCRIPT / Digital Investigation 00 (2018) 1–11

5.2. Forensic Analysis Efficiency We first evaluate the forensic analysis efficiency of PDMS using the data leak application discussed in Section 4, then we also considered other typical scenarios (e.g., data tamper, malicious file download) to analyze the varied intrusion cases. For each scenario, we describe the intrusion operations in detail, the initial detection point, and the analysis result on system vulnerability and attack source using PDMS. 1)Data leak Scenario: An attacker exploits the vsftpd vulnerability in a remote target system and gets a root shell. He then browses the files that contain user account (for example, account/file1, account/file2) in the home directory of the administrator. Detection points: The leaked file1 file. Analysis Result: If the content of file1 is disclosed by the invader, the administrator can make query analysis to the provenance graph of the intrusion path as shown in Figure 4. The thick line f ile1 → vi → socket (FROMIP:192.168.137.3:36568) indicates the intrusion path queried using file1 as the detection point. The socket (192.168.137.3)) in the path shows the IP address of the intrusion source. To judge whether this intrusion source has stolen the content of other documents, we use it as the detection point to make further analysis and query. It’s obvious that this intrusion source also accesses file2 from the path (socket(FROMIP : 192.168.137.3 : 36568) → vi → f ile2) described with a thick line. So the content of file2 was also probably leaked. So the administrator will take measures, such as immediately notifying the users that own accounts contained in file1 and file2 to modify the password, or disabling these accounts and notifying the corresponding users to re-register. Of course, making query using the socket (FROMIP:192.168.137.3:36568) as the detection point can also get paths Bash → vs f tpd → vs f tpd → socket and Bash → vs f tpd → sh → socket. Especially, the edge vs f tpd → sh is abnormal and indicates that vsftpd vulnerability has been exploited to generate shell. The query time overhead is very small, only 0.0114 seconds using file1 as the detection point, and 0.228 seconds using socket as the detection point. 2)Data tamper Scenario: An attacker successfully gets a root shell by launching a remote exploit on the samba daemon. Then he tampers the /etc/passwd and /etc/shadow files to add a new root account to the system. After that, he acts like a normal root user to log into the system. Detection points: The tampered /etc/passwd and /etc/shadow files. Analysis Result: Using the detection points above, we can acquire the query graph as described in Figure 5. It’s easy for the administrator to know that samba server process (i.e., smbd) has wrongly generated /bin/sh process. This indicates that samba daemon has been exploited. Further analysis of the whole provenance graph can accurately ascertain the attack source (i.e., IP 192.168.137.3 and port 50186). 3)Malicious file download Scenario: An attacker exploits the vulnerability of the distccd daemon in a remote machine. Then he uses the root

7

Bash

vsftpd

socket FROMIP:192.168.137.3:48963

vsftpd

RI PT

sh

socket

vi

FROMIP:192.168.137.3:36568

file2

SC

file1

M AN U

Figure 4. Provenance graph that describes data leak. We have simplified this graph by omitting the header files or library files that the execution of a process needs. We use boxes, ovals, and diamonds to show the processes, files(including pipes) and sockets respectively.

smbd

FROMIP:192.168.137.3:50186

smbd

socket

sh

telnet

TOIP:192.168.137.3:4444 socket

EP

TE D

vi

/etc/passwd

/etc/shadow

Figure 5. Provenance graph that describes data tamper. We have simplified this graph by omitting the header files or library files that the execution of a process needs. We use boxes, ovals, and diamonds to show the processes, files(including pipes) and sockets respectively.

AC C

shell to download a trojan program into the home directory. After that, he also creates a file in the directory. Detection points: The trojan file. Analysis Result: Figure 6 shows the provenance graph that describes malicious file download. It’s naturally to get the distccd daemon by backtracking the provenance graph using the trojan file as detection point. Obviously, the edge “wget → tro jan” reveals that a trojan file has been downloaded. The abnormal edge “distccd → sh” indicates that the distccd daemon has been exploited. Then using distccd daemon as starting point to query the provenance graph can get the socket “FROMIP : 192.168.137.3 : 55200”. The external IP “192.168.137.3” reveals the intrusion source. 4)Summary Overall, PDMS can accurately monitor and easily analyze the system vulnerabilities or attack sources using collected provenance graphs. Note that except for the above cases, PDMS can monitor and track all the vulnerable applications as long as 7

ACCEPTED MANUSCRIPT / Digital Investigation 00 (2018) 1–11

only be 135.6 MB per hour even the read access happens at every moment (i.e., non-stop).

FROMIP:192.168.137.3:55200 sh

distccd

8

socket TOIP:192.168.137.3:4444

telnet

5.4. Impact on system performance

socket

Provenance collection and analysis can have a certain impact on the system performance. Specifically, we measure the data send rate for the simulated workload above when performing the data access operations. The provenance is generated by intercepting the read and send processes. However, the performance impact is extremely small (below 0.2%), as shown in Table 3. An important reason is that the collection and storage of the provenance graphs, though consume system resources, are in different I/O path from the data read and send processes.

sh

PIPE

sh

wget

vi

trojan

RI PT

sh

1.txt

Table 3. The impact of provenance collection on data send rate

Access frequency

Figure 6. Provenance graph that describes malicious file download. We have simplified this graph by omitting the header files or library files that the execution of a process needs. We use boxes, ovals, and diamonds to show the processes, files(including pipes) and sockets respectively.

Data send rate in KB/s (without provenance) 11430 312.39 115.4 32.42 16.42

SC

socket socket TOIP:198.145. TOIP:202.114. 20.140:443 0.242:53

M AN U

Non-stop 3-5 seconds 8-10 seconds 30 seconds 1 minute

they generate provenance. 5.3. Space overhead

Data send rate in KB/s (with provenance) 11420 312.45 115.2 32.43 16.41

% Decreased

0.08% -0.019% 0.17% -0.031% 0.06%

6. Related Work

EP

TE D

PDMS reduces space overhead by employing data filtering and data compression techniques. Figure 7 shows the size of provenance for different applications with different storage policies. The terms “Without socket” and “With socket” indicate that collecting provenance without socket and with socket respectively. The term “With filtering” refers to applying filtering technique on the provenance with socket. The rest terms refer to applying compression techniques or the combination of compression techniques with filtering on the provenance with socket. On one hand, the socket information has only incurred a slight storage overhead, but enables forensic analysis of the attack from the outside. On the other hand, both the filtering technique and compression techniques have significantly reduced the storage overhead. For instance, the filtering technique can compress provenance by 70.8%-74.6%, the web+dictionary compression technique reduces provenance store by 59.4%-61.5%, and the combination of them can compress provenance by 78.0%-82.3%. Bzip, though does not support efficient query, achieves the best performance (compress provenance by 95.5%-97.3%) by integrating itself with the filtering technique. Figure 8 shows the space overhead of provenance with respect to the access frequency for different storage policies. Note that we do not apply filtering technology to the simulated workload which is not targeted for intrusion analysis as the three applications above. However, the provenance store compressed using bzip2 and the web+dictionary algorithm can both achieve good compression rate. For web+dictionary compression algorithm that enables good query performance, the provenance can

Since we propose to implement monitor and forensic analysis via network-attached provenance collection with minimal storage overhead, we first give an overview of traditional methods on forensic analysis, then we elaborate the existing methods on network-attached provenance collection, at last we compare our method with the previous methods on efficient provenance storage. 6.1. Forensic analysis

AC C

A large number of works (Kim and Spafford, 1994; Kiriansky et al., 2002; King and Chen, 2005; King et al., 2005) have concentrated on inspecting the abnormal data or event and making forensic analysis. For example, we can use TripWire (Kim and Spafford, 1994) to check whether a system file has been modified, and employ the sandboxing tool to identify the unusual model of invoking system calls during program execution (Goldberg et al., 1996) or improperly executing the code outside (Kiriansky et al., 2002). Samuel T.King et al. developed Backtracker (King and Chen, 2005) and BDB (King et al., 2005) to analyze the cause of the intrusion via causality-based context. For instance, given a detection point (such as a suspicious process or a leaked file), Backtracker can find all the events that affect this detection point, and the administrator can further analyze the cause of the invasion. This paper varies in that it gives a detailed design, implementation and evaluation of the network-attached provenance. In addition, it saves storage space to the maximum 8

ACCEPTED MANUSCRIPT Size of provenance (MB)

Size of provenance (MB)

9

(b) Distcc

RI PT

(a) Vsftp

0.6 0.5 0.4 0.3 0.2 0.1 0

W it h ou ts oc ke W t ith so ck W et ith f il W ter eb in +d g ic tio W n eb ar y +d ic tio Bz na ip ry 2 +f i Bz lter in ip g 2+ f il ter in g

35 30 25 20 15 10 5 0

W it h ou ts oc ke W t ith so ck W et ith f il W ter eb in +d g ic tio W na eb ry +d ic tio Bz na ip ry 2 +f i Bz lter i ng ip 2+ f il ter in g

1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0

W it h ou ts oc ke W t ith so ck W et ith f il W ter eb in +d g ic tio W n eb ar y +d ic tio Bz na ip ry 2 +f i Bz lter in ip g 2+ f il ter in g

Size of provenance (MB)

/ Digital Investigation 00 (2018) 1–11

(c) Samba

Web+Dictionary

Bzip2

1200 800 400 0

40

original

Web+Dictionary

30 20 10

2

M AN U

0

1

Bzip2

Space Overhead (MB)

original

3

1

Access Time (hour)

(a) Non-stop

2.5

original

Web+Dictionary

Bzip2

2

SC

1600

Space Overhead (MB)

Space Overhead (MB)

Figure 7. Space overhead of provenance for different applications with different storage policies.

2 Access Time (hour)

(b) Every 3-5 seconds

3

1.5

1

0.5

0

1

2 Access Time (hour)

3

(c) Every one minute

Figure 8. Space overhead of provenance with respect to the access time for different storage policies. The subfigures a, b and c show the results for different access frequency.

and sockets. In addition, PDMS employs compression technologies to reduce the provenance storage overhead.

AC C

EP

TE D

extent. For instance, the PIPE information collected by Backtracker and BDB has been further optimized to reduce storage overhead while retaining the forensic analysis efficiency. Jiang et al. (Jiang et al., 2006) proposed to use color to explicitly identify the intrusion break-in point and reduce the log entries that need to be queried for forensic analysis. This is similar to the filtering technique used in PDMS. However, PDMS has further employed compression technique to reduce the size of logs. Jones et al. (Jones et al., 2011) proposed to leverage provenance to monitor the data leaked to the network or the external mobile device. However, they did not implement the provenance collection mechanism. The provenance-based forensic analysis employs the similar concept in the information flow control system that tracks the behavior of the malware (Yin et al., 2007). In addition, it facilitates intrusion analysis (Tariq et al., 2011; Gehani et al., 2010) and helps identify the compromised node in the distributed environment (Zhou et al., 2011). Wang (Wang, 2010) proposed a novel graph based network forensic analysis system. The system collects digital evidence from heterogeneous sensors deployed on the networks and hosts, and constructs the evidence graph which is used by the reasoning procedure to identify entities in multi-stage attacks and reconstruct the scenario. The evidence graph model is hostcentric, with nodes representing host-level identities. PDMS captures network socket information to construct provenance graphs that are used to make forensic analysis of the intrusion behavior. The nodes in the provenance graphs represent the more specified objects such as intrusion processes, infected files

9

6.2. Network-attached provenance collection The Eidetic (Devecsery et al., 2014) system collects the provenance information in the local system, but does not monitor the network connection. Thus it cannot track the data flow through the network. SPADE (Gehani and Tariq, 2012) is targeted on the network provenance in the distributed environment, however, it cannot capture the short-lived network socket connections due to the synchronous provenance reporting. This may result in the miss of important information (e.g., prompt virus propagation among multiple hosts). HiFi (Pohly et al., 2012) aims to collect complete provenance to accurately record and analyze the malicious behaviors by employing the Linux Security Modules. However, it omits the remote host IP address and port number, thus cannot monitor the external network connections. Provmon (Bates et al., 2015) has been further developed based on HiFi to provide versioning and network context. However, it does not target for forensic analysis and does not enable provenance filtering. The provenanceaware NFS (Muniswamy-Reddy et al., 2009) and provenanceaware cloud (Muniswamy-Reddy and Holland, 2010) both capture provenance in the client and store provenance in the server or cloud. They do not record and filter the specific networkattached provenance. Casey et al. (Casey et al., 2017) proposed a community developed specification language named CASE that can support

ACCEPTED MANUSCRIPT / Digital Investigation 00 (2018) 1–11

automated normalization, combination correlation and validation of information. They have concentrated on provenance information for evidence authentication process, and used provenance records to capture contextual and descriptive information about objects that are specified by cyber investigation/forensic personnel or tools. PDMS does not use a structure like CASE to represent provenance. Instead, provenance is presented in a form of graph, which can be more convenient to monitor and analyze vulnerabilities or attack sources. PDMS also filters and compresses redundant provenance information and utilizes keyvalue databases to store provenance information such as objects and dependencies to speed up query.

10

Wuhan Application Basic Research Program under Grant No.2017010201010104, Hubei Natural Science and Technology Foundation and CCF-Venustech Hongyan Research Initiative (2016-015).

RI PT

Bates, A., Tian, D., Butler, K., Moyer, T.. Trustworthy whole-system provenance for the linux kernel. In: Proc. of USENIX Security. 2015. p. 319–334. Bzip2. Bzip2 compressor. 2014. http://www.bzip.org. Casey, E., Barnum, S., Griffith, R., Snyder, J., Beek, H., Nelsone, A.. Advancing coordinated cyber-investigations and tool interoperability using a community developed specification language. Digital Investigation 2017;22:14–45. Chapman, A.P., Jagadish, H.V., Ramanan, P.. Efficient provenance storage. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2008. p. 993–1006. Daily, P.. CSDN password leak. 2010. http://en.people.cn/90778/ 7688084.html. Devecsery, D., Chow, M., Dou, X., Flinn, J., Chen, P.M.. Eidetic systems. In: Proc. of USENIX OSDI. 2014. p. 525–540. Gehani, A., Baig, B., Mahmood, S., Tariq, D., Zaffar, F.. Fine-grained tracking of grid infections. In: Proc. of IEEE GRID. 2010. p. 73–80. Gehani, A., Tariq, D.. SPADE: Support for provenance auditing in distributed environments. In: Proc. of ACM/IFIP/USENIX Middleware. 2012. p. 101– 120. Goel, A., Po, K., Farhadi, K., Li, Z., d. Lara, E.. The taser intrusion recovery system. In: Proc. of ACM SOSP. 2005. p. 163–176. Goldberg, I., Wagner, D., Thomas, R., Brewer, E.. A secure environment for untrusted helper applications. In: Proc. of USENIX Security. 1996. p. 1–14. Hasan, R., Sion, R., Winslett, M.. The case of the fake picasso: Preventing history forgery with secure provenance. In: Proc. of USENIX FAST. 2009. p. 1–14. Jiang, X., Walters, A., Buchholz, F., Xu, D., Wang, Y., Spafford, E.H.. Provenance-aware tracing of worm break-in and contaminations: A process coloring approach. In: Proc. of IEEE ICDCS. 2006. p. 38–46. Jones, S.N., Strong, C.R., Long, D.D.E., Miller, E.L.. Tracking emigrant data via transient provenance. In: Proc. of USENIX TaPP. 2011. p. 1–6. Kim, G.H., Spafford, E.H.. The design and implementation of tripwire: A file system integrity checker. In: Proc. of ACM CCS. 1994. p. 18–29. King, S.T., Chen, P.M.. Backtracking intrusions. ACM Transactions on Computer Systems 2005;23(1):51–76. King, S.T., Mao, Z.M., Lucchetti, D.G., Chen, P.M.. Enriching intrusion alerts through multi-host causality. In: Proc. of NDSS. 2005. p. 1–13. Kiriansky, V., Bruening, D., Marasinghe, S.A.. Secure execution via program shepherding. In: Proc. of USENIX Security. 2002. p. 191–206. Liefke, H., Suciu, D.. XMill: An efficient compressor for XML data. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2000. p. 153–164. Lyle, J., Martin, A.. Trusted computing and provenance: Better together. In: Proc. of USENIX Tapp. 2010. p. 1–19. Muniswamy-Reddy, K.K., Braun, U., Holland, D.A., Macko, P., Maclean, D., Margo, D., Seltzer, M., Smogor, R.. Layering in provenance systems. In: Proc. of USENIX ATC. 2009. p. 10–24. Muniswamy-Reddy, K.K., Holland, D.A.. Provenance for the cloud. In: Proc. of USENIX FAST. 2010. p. 197–210. Muniswamy-Reddy, K.K., Holland, D.A., Braun, U., Seltzer, M.I.. Provenance-aware storage systems. In: Proc. of USENIX ATC. 2006. p. 43–56. Olson, M.A., Bostic, K., Seltzer, M.I.. Berkeley DB. In: Proc. of USENIX ATC. 1999. p. 43–52. Pohly, D.J., McLaughlin, S., McDaniel, P., Butler, K.. Hi-Fi: Collecting high-fidelity whole-system provenance. In: Proc. of ACM ACSAC. 2012. p. 259–268. Shah, S., Soules, C.A.N., Ganger, G.R., Noble, B.D.. Using provenance to aid in personal file search. In: Proc. of USENIX ATC. 2007. p. 171–184. Tariq, D., Baig, B., Gehani, A., Mahmood, S., Tahir, R., Aqil, A., Zaffar, F.. Identifying the provenance of correlated anomalies. In: Proc. of ACM SAC. 2011. p. 224–229. Wakefield, J.. Heartbleed. 2014. http://www.bbc.com/news/ technology-26969629. Wang, W.. A graph oriented approach for network forensic analysis. Ph.D. thesis; Iowa State University; 2010.

6.3. Efficient provenance storage

AC C

7. Conclusions

EP

TE D

M AN U

SC

Chapman et al. (Chapman et al., 2008) proposed a series of factorization and inheritance methods to find the common provenance components (e.g., provenance records, provenance nodes and arguments) between different provenance items. These methods exploit the structure of provenance, but not the characteristics of provenance graphs, and are not fine-grained enough. In prior work (Xie et al., 2012, 2013b), we have presented an in-depth analysis of the provenance characteristics (e.g., the similarity and locality existing in the provenance graphs), and accordingly proposed to use the combination of web compression and dictionary encoding to exploit the duplicates inherently existing in the provenance graphs. This work further combines this method with the provenance filtering technology to achieve most space-efficient provenance storage for forensic analysis. Some traditional compressors (e.g., bzip2 (Bzip2) and gzip (Ziv and Lempel, 1977)) can achieve a good compression ratio to a maximum extent, but also result in a bad query performance since these methods do not preserve the structure of the provenance. Technologies like XML compressors (Liefke and Suciu, 2000), are commonly applied to XML-style provenance files, the reduced provenance store is also not amenable for querying. This paper provides different choices for different provenance storage and query requirements.

Acquiring network socket information accurately and completely is important for the forensic analysis of the intrusion from the remote attacker. This paper has systematically implemented a provenance-based monitoring and forensic analysis framework that enables efficient network socket information capture. In addition, this work has minimized provenance storage overhead by unifying both provenance filtering and queryfriendly provenance compression. In future work, we would like to use some graph databases (e.g., Neo4j) to more efficiently store provenance graphs. Acknowledgments This work was supported in part by the National Science Foundation of China under Grant No.61402189 and U1705261, 10

ACCEPTED MANUSCRIPT / Digital Investigation 00 (2018) 1–11

AC C

EP

TE D

M AN U

SC

RI PT

Xie, Y., Feng, D., Tan, Z., Zhou, J.. Design and evaluation of a provenancebased rebuild framework. IEEE Trans on Magnetics 2013a;49(6):2805– 2811. Xie, Y., Feng, D., Tan, Z., Zhou, J.. Unifying intrusion detection and forensic analysis via provenance awareness. Future Generation Computer Systems 2016;61:26–36. Xie, Y., Muniswamy-Reddy, K.K., Feng, D., Li, Y., Long, D.D.E.. Evaluation of a hybrid approach for efficient provenance storage. ACM Trans on Storage 2013b;9(4):1–29. Xie, Y., Muniswamy-Reddy, K.K., Feng, D., Li, Y., Long, D.D.E., Tan, Z., Chen, L.. A hybrid approach for efficient provenance storage. In: Proc. of ACM CIKM. 2012. p. 1752–1756. Yin, H., Song, D., Egele, M., Kruegel, C., Kirda, E.. Panorama: Capturing system-wide information flow for malware detection and analysis. In: Proc. of ACM CCS. 2007. p. 116–127. Zhou, W., Fei, Q., Narayan, A., Haeberlen, A., Loo, B.T., Sherr, M.. Secure network provenance. In: Proc. of ACM SOSP. 2011. p. 295–310. Ziv, J., Lempel, A.. A universal algorithm for sequential data compression. IEEE Trans on Information Theory 1977;23(3):337–343.

11

11