Digital Investigation 9 (2012) 96–108
Contents lists available at SciVerse ScienceDirect
Digital Investigation journal homepage: www.elsevier.com/locate/diin
Engineering an online computer forensic service R.A.F. Bhoedjang b, A.R. van Ballegooij b, H.M.A. van Beek a, *, J.C. van Schie a, y, F.W. Dillema b, R.B. van Baar a, F.A. Ouwendijk c, M. Streppel d a
Digital Technology and Biometrics, Netherlands Forensic Institute (NFI), Laan van Ypenburg 6, 2497 GB The Hague, The Netherlands TraceMiners B.V., PO Box 248, 2270 AE Voorburg, The Netherlands c IMC financial markets, The Netherlands d NCIM Groep, Leidschendam, The Netherlands b
a r t i c l e i n f o
a b s t r a c t
Article history: Received 16 March 2011 Received in revised form 1 October 2012 Accepted 4 October 2012
XIRAF is a second-generation forensic analysis system developed at the Netherlands Forensic Institute. XIRAF automates the collection of millions of forensic artefacts and organizes these artefacts such that they can be searched in effective ways through a web interface. This paper describes the design of version 1.2 of XIRAF and describes the lessons we learned from implementing and deploying it. Today, a number of Dutch law enforcement organizations are using the XIRAF service offered by the Netherlands Forensic Institute. Our experience with this service indicates that XIRAF allows investigative teams of dozens of investigators with varying technical background to collaborate effectively and allows them to obtain results from amounts of digital evidence that were infeasible to handle in a cost-effective way before. ª 2012 Elsevier Ltd. All rights reserved.
Keywords: XIRAF Online service Computer forensics Digital forensics SAAS
1. Introduction Digital evidence has become relevant in almost any kind of investigative case. We are faced with the choice to either make technical specialists of all investigators or to make it possible for nonspecialists to find and access digital artifacts that can be interpreted with little technical knowledge. While a system logfile entry may require a technical specialist to interpret, we believe that few technical skills should be required of an investigator in order for him or her to find and read email and chat messages present in digital evidence (for example). In addition, investigators are faced with terabytes of data to process and search in individual cases. Automated support for mundane tasks not only keeps such workloads manageable for digital experts, it
* Corresponding author. Tel.: þ31 70 888 6 400. E-mail addresses:
[email protected] (R.A.F. Bhoedjang), alex@ traceminers.nl (A.R. van Ballegooij),
[email protected] (H.M.A. van Beek),
[email protected] (F.W. Dillema),
[email protected] (R.B. van Baar). y Co-author John van Schie deceased on 27 February 2011. 1742-2876/$ – see front matter ª 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.diin.2012.10.001
also gives these experts the time to focus their attention on those analyses where their expertise is most effective and valuable. In this paper we describe our experiences with XIRAF, a forensic analysis system that makes huge amounts of digital evidence accessible to teams of collaborating investigators with varying degrees of technical expertise. XIRAF accepts a variety of digital evidence inputs and automatically extracts forensic artefacts such as files, chat logs, browser histories, processes, email, etc. Collected forensic artifacts are represented as typed objects with typed properties and optional binary content. Examples of object types include file, email, chatLogEntry, browserHistoryLogEntry. These types are independent of specific operating systems and applications that can produce such artifacts. The browser history of Firefox, Chrome, Internet Explorer and Safari, for example, are all represented by means of browserHistoryLogEntry objects. The automated extraction of forensic artifacts not only removes a fair amount of manual labor from digital investigators, it also removes the requirement on the investigator to know all possible locations where such
R.A.F. Bhoedjang et al. / Digital Investigation 9 (2012) 96–108
artifacts might be found in the digital evidence. In addition, the generalized and uniform representation of forensic artifacts removes the requirement on the investigator to know about the technicalities of the different systems and applications that can produce a certain type of digital artifact. Even though little technical skills are now required just for reading emails and chats collected from the digital evidence, the amount of emails and chats can still be overwhelming while those that are relevant to the case may be few. Hundreds of thousands of messages in a single case are not uncommon anymore while may be only a handful may be relevant to the case at hand. Being able to read all communication is just not good enough anymore; investigators must be able to effectively search through the artifacts in order to discover artifacts that are the most relevant for answering investigative questions. The XIRAF search functions therefore allow investigators to zoom in on objects of interest. With these functions, an investigator can make selection such as the following: all objects that are of type file and of type picture, are created before 1 January 2008 and contain the word ‘abracadabra’ and have a name that ends with .aha. This search model resembles the wellknown and intuitive model of many web shops where a user repeatedly refines an article selection by adding new search criteria. Key search functions include content search, location search, time range search, and property search. This again facilitates that non-experts can effectively make use of digital evidence without much training. Data volumes and processing times have increased to the point that desktop processing is no longer costeffective: many law enforcement forces are centralizing their storage and consider centralizing their processing power too. Because XIRAF’s graphical user interface is a web interface, it can be accessed from remote locations and client-side system requirements are minimized. Webbased access also facilitates close collaboration between digital experts and non-experts even when they are not colocated. It is also increasingly important for analysts, lawyers, detectives, etc. to have direct access to the digital evidence. This requires such users to receive some additional training, but such training is already necessary for a good understanding of computer forensic reports. With XIRAF, we are now offering an online computerforensic service to detectives, forensic specialists, and analysts. Today, XIRAF is fully operational and routinely processes multiterabyte cases. We know from experience that building a system like XIRAF is hard. The XIRAF prototype and some of the key ideas mentioned above were already described in a 2006 paper (Alink et al., 2006). In this paper, we describe our experiences with the implementation of XIRAF’s key ideas and the processing of realistic workloads. These experiences cover a wide range of issues, including data representation, performance and scalability, reliability, usability, etc.
2. System functions The following paragraphs summarize XIRAF’s main functions.
97
2.1. Administrative functions XIRAF includes an administrative web interface that is used to manage users and projects. A project is essentially a searchable set of digital inputs such as harddisk images. A project is the unit of search, access control, and archiving. Usually, a project coincides with an investigative case. User management includes creating user accounts and assigning access rights for projects to users. Users sign on to XIRAF by means of a user name and a password. Optionally, two factor authentication with onetime password tokens can be used. At present, two user roles are distinguished: regular users and administrators. Administrators perform data, project, and user management, whereas regular users can only access (search) projects they are authorized for. To make a digital input searchable, an administrator instructs XIRAF to index the input. Before doing so, the administrator first selects a suite of analysis tools to run. Examples of such tools include tools that extract information from browser cookies, from browser history files, from chat logs, etc. Tools can be selected individually or from a list of profiles that name predefined tool suites, e.g. a tool set tailored to the analysis of Microsoft Windows systems. Once an input has been indexed, the administrator can make that input’s index (and others) part of one or more user-accessible projects. This is called publishing. 2.2. Search and browse functions With XIRAF, users search a large collection of forensic artefacts along multiple dimensions. A search always begins with the complete set of objects discovered in the indexing phase. Objects can then be selected in the following ways: Device selection. XIRAF will refine the current selection by retaining only those objects that belong to a userspecified selection of the input devices (usually disk images). The input devices can be grouped, for example by person or by physical location, and users can select entire groups of devices for inclusion or exclusion. Location search. The objects discovered in an image form a tree. XIRAF will refine the current selection by retaining or eliminating those objects that belong to a user-specified subtree. Example: eliminate objects in subtree Windows/ system32. Hand picking. XIRAF refines the current selection by retaining or eliminating items explicitly indicated by the user. Example: retain only objects 12 and 29. Keyword search. XIRAF refines the current selection by retaining only those objects that have binary content that contains the specified keyword. The keyword may contain wildcard characters. Example: select objects that contain words that match the pattern passw*. Property filter. XIRAF refines the current selection by retaining or eliminating those objects that have the userspecified property and the user-specified value (range) for that property. Example: select objects with property name such that name ends with .exe. Time range search. XIRAF refines the current selection by retaining only those objects that have a property with
98
R.A.F. Bhoedjang et al. / Digital Investigation 9 (2012) 96–108
a value of type dateTime in the time range specified by the user. Example: select objects with a timestamp in the range [24 December 2009 00:00, 28 December 2009 00:00]. Bookmark. XIRAF allows users to save queries as named bookmarks; XIRAF also includes many predefined bookmarks. XIRAF combines elements of the current selection with elements defined by a user-specified bookmark. The following types of combination operators are available: replace, union, intersection, and difference. Example: replace the current selection with objects discovered by bookmark Hardware/Attached USB Hardware. Property matching. XIRAF extends the current selection with objects (taken from the complete set of objects) that match one of the objects in the current selection. Two objects match if both have the user-specified property and if they have the same value for that property. Example: add objects that have an SHA-1 hash value that matches the SHA-1 hash value of an object in the current selection. Area filter. XIRAF refines the current selection by retaining only those objects that have an associated GPS coordinate that lies within a user-specified map region. The map region is specified by selecting a rectangle on a map. Most of the filters can be negated. One can, for example, select all objects that are not of type document. 2.3. Result views At any time, the objects selected by a user’s search actions can be viewed in multiple ways. Statistics view. The statistics view gives direct access to specific subselections of the current selections. These subselections include pictures, movies, documents, encrypted objects, email messages, etc. Summary view. The summary view displays one paragraph per object. This paragraph lists the object’s name and some of the object’s key properties. Tabular view. The tabular view displays the current object selection in a table that contains one row per object and one column per object property. Timeline view. The timeline lists objects in the current selection in chronological order according to their timestamp properties. Objects with multiple timestamps may be listed multiple times and objects without timestamps will not be listed. Picture view. The picture view shows thumbnails of objects in the current selection that contain image content. Map view. The map view displays, on a map, objects that have properties that are GPS coordinates. Such properties can appear, for example, in digital images shot with GPSenabled cameras or smart-phones. Fig. 1 shows the map view on an iPad. The figure also shows the tab handles that can be used to switch to other views on the right side of the screen. Notes view. Users can annotate individual object with notes. The notes view displays user notes associated with the items present in the search result. Notes can be shared between users of the same case, they can be searched, and they can be included in reports. Once an investigator has zoomed in on an interesting set of objects, he or she can export both a selection of the metadata and the binary content of those objects. When
Fig. 1. Mobile access to an online forensic service.
requested, XIRAF creates a downloadable zip archive that contains the selected information. Users can create HTML or PDF reports. XIRAF includes a few predefined reports, but users can also create their own. Essentially, a report consists of headers, text, search results and output formatting instructions. Reports can be generated for either the complete set of objects or for the current selection. 3. System design and implementation 3.1. System overview Fig. 2 gives a high-level view of XIRAF. At this level, XIRAF consists of a backend and a frontend. The backend operates in batch mode and is responsible for preprocessing digital inputs. Disk images are the most common type harddisk
cell phone
disk image
phone image
Preprocess in backend
disk index
phone index
Publish
Project database
Search with frontend
Fig. 2. High-level system overview.
R.A.F. Bhoedjang et al. / Digital Investigation 9 (2012) 96–108
of input, but XIRAF also accepts memory dumps and regular files or directory hierarchies. For each input, the backend runs a suite of tools that is appropriate for the investigation at hand. This results in a device index for each input. A device index comprises an XML document that describes forensic artefacts discovered in the input and a keyword index that identifies the occurrences of keywords. An administrator can combine device indices in projects which are made accessible to investigators. Once a project has been composed, it is materialized by copying and transforming information from the device indices to a relational database system. The frontend handles user requests, which it receives through the user interface, and requests from programs, which it receives through the external API. In both cases, it receives set-oriented XIRAF queries, which are described in Section 3.7. For each XIRAF query, the frontend generates one or more database queries and sends them to the database. The query results are then sent back to the client. If appropriate, the frontend caches query results for future use. The system is implemented almost entirely in Java 6. We test and deploy on 64-bit Intel systems that run RedHat Enterprise Linux, but the system is developed both on Windows XP and Linux. 3.2. Data model In XIRAF, all forensic artefacts are described as XIRAF objects. An object has one or more object types and, for each object type, a set of named properties. In addition, an object has zero or more associated binary data streams. For each object type, the data model defines the set of properties that may be used in combination with that type. The data model is a central part of the system. Developers must be aware of the model, because the analysis tools that they write must produce output that is valid according to the data model. Users must also be aware of the model, because some queries are formulated in terms of object properties and because the user interface displays object types and properties. At present, the data model defines 39 different object types, which are listed in Table 1. A full description of individual types is beyond the scope of this paper, but most type names are self-explanatory. An object can have multiple types, e.g. an item can be a file and a picture at
Table 1 XIRAF types. attachment chatLog deleted emailFolder fileArchive
browser HistoryLog chatLogEntry document eventLog filesystem
browserHistory LogEntry chatMessage email eventLogEntry fileTransferLog
firewallLog header memoryImage process unallocated
firewallLogEntry image personalData registry volume
folder link phoneCall registryEntry
carved cookie emailArchive file fileTransfer LogEntry gpsEntry memory picture textMessage
99
the same time. The data model does not define a subtyping relationship between types. For each object type, the data model defines its properties. Properties have names and they are typed. The combination of an object type and a property name must be unique, but different object types can use identical property names. Property value types include: integer, string, url, email-address, phone-number, amongst others. Each object has zero or more children, which are objects themselves. This mechanism is used to represents records in log files, files in folders, and so on. An object’s children form an ordered list. Finally, the data model allows an object to have binary data streams. An object’s binary data streams are defined recursively, in terms of selections and transformations of other objects’ streams. Ultimately, each binary stream is derived through a series of transformations from the binary content of a XIRAF input (usually a disk image). The data model is described by an XML schema. Consequently, each XIRAF object has an XML representation. Fig. 3 illustrates this representation for two forensic artefacts discovered in an NFI demonstration image. For conciseness, we show only part of the full representation. The XML fragment describes a Windows cookie file that contains a cookie. The cookie file is represented by an object of type file. The file object has 857 bytes of associated binary content. The cookie stored in this file is represented by a cookie object, which is a child of the file object. This cookie object has no associated binary content, just properties. 3.3. The tool collection XIRAF tools discover new forensic artefacts or augment existing forensic artefacts. The tool collection now contains more than fifty different tools, all written in Java. The collection includes tools that parse widely used file systems and archives, decompression tools, carving tools, a content classification tool, a hashing tool, parsers for application and system log files (chat logs, firewall logs, etc.), mailbox parsers, browser history parsers, image metadata parsers, recycle bin parsers, and more. A tool accepts exactly one object as its input. The tool can augment this object by supplying new properties or it can create new descendant objects. The file system tool, for example, accepts a volume object as its input. To this volume object, it adds a child of type filesystem. Next, the tool adds a complete tree of folder and file objects to the file system object. The statistics tool accepts a file object as its input. To this object, it adds new properties that describe various values computed from the file object’s content. Examples of these properties include md5, sha1, and entropy. 3.4. Tool execution Tools are run by a scheduler in the backend. This scheduler selects existing artefacts and feeds them to the appropriate tools. For each activated tool, the scheduler finds objects that must be processed by the tool. This is
100
R.A.F. Bhoedjang et al. / Digital Investigation 9 (2012) 96–108
Fig. 3. XML representation of a XIRAF object with one child.
done by matching candidate objects against the tool’s input descriptor. This input descriptor describes which XIRAF objects must be processed by the tool. The selection is described using a subset of XIRAF’s query language (described in Section 3.7). Most input selectors select objects based on properties such as the object’s name, path, or content classification. Fig. 4 shows an example of an input descriptor, which selects the input for the Firefox3 cookie analysis tool. The descriptor selects objects of type file with a property called name with value cookies.sqlite. Once a tool has been started, the scheduler monitors it. A tool is run in its own, separate, process. When a tool terminates, the scheduler collects its output and merges this output into the device index. Tool execution is similar to the execution of a rule-based production system that fires rules until no more rules can be triggered. Since a tool may produce new objects or modify existing objects, the scheduler will, in a new scheduling iteration, continue its search for tool input until no such input remains. In between scheduling iterations, the scheduler creates a checkpoint that contains all output generated so far. If the system crashes, processing can resume from the last checkpoint.
characters: dot (.), underscore (_), dash (), colon (:), at sign (@). The tokenizer recognizes both ASCII and UTF16 representations of such words. The indexing engine uses a standard inverted-list algorithm (Witten et al., 1999). Initially, it tracks words and their positions in memory. When it runs out of memory, it flushes its in-memory state to disk in the form of a partial keyword index (described below). After flushing a partial index, the indexing engine clears its memory and continues. Periodically, partial indices are merged. Eventually, a single set of index files remains. Once created, the keyword index can be queried by specifying a query pattern. Such a pattern consists of fixed characters and zero or more wildcard characters (? and *). Examples of such patterns are:
[email protected], password*, DCIM*.jpg. The keyword index supports the following query methods:
3.5. Keyword indexing and querying The keyword indexing tool creates a keyword index on the binary content of XIRAF objects. The indexing engine recognizes ‘words’. It records, for each word, the object stream in which it occurs and its byte position in that stream. At present, a word is defined to be a sequence of alphanumeric characters or one of the following
Fig. 4. An input descriptor that selects all cookie.sqlite files.
R.A.F. Bhoedjang et al. / Digital Investigation 9 (2012) 96–108
Count words that match a given pattern. Count objects that contain a word that matches a given pattern. Retrieve the words that match a given pattern. Retrieve the object ids of the objects that contain a word that matches a given pattern. For a given pattern and a given object, retrieve the positions of the words in that object that match the pattern. A position is a byte offset in an object’s binary data stream.
3.6. The binary content server XIRAF objects can have associated binary content. Both the backend and the frontend require access to objects’ binary content. Backend tools read object content to extract information. The frontend reads object content to display that content in various forms. The binary content server serves binary object content to clients. It is accessed by the backend and the frontend through a simple stateless protocol that runs over a TCP/IP connection. The principal request type is the read request, which returns a range of bytes from one of the object’s binary data streams. Eventually, the content of a binary stream is derived from bits stored in a XIRAF input, usually a disk image. When asked to deliver (part of) an object stream, the content server computes the bytes requested on demand. In other words, XIRAF does not store a copy of the binary content of the forensic artefacts that it discovers. The computation of a stream’s content is driven by a sequence of descriptors. These descriptors describe content in terms of transformations or selections of other content. Such descriptors can describe the following types of operations: select an input (e.g., a disk image); select a contiguous subrange of an existing stream by specifying its start and its size (in bytes); select a member of a container object by specifying the container type and the member’s name; transform an existing stream by means of a named decoding algorithm such as base64 or gzip; specify the stream’s content by specifying the individual byte values (immediate data). With these operations, it is possible to specify the decompressed (decoded) content of a gzipped file, stored in a file named /tmp/example.gz, which is a member of a file system container, which is stored in a 20 Gbyte partition, which starts at offset 32,256 of an image named test.img. The server identifies object streams through numeric identifiers. When a read request arrives, the content server uses the numeric stream id to find the sequence of descriptors that describe the stream. Next, the server processes the corresponding sequence of streams and delivers the requested content. As illustrated above, many stream types require some type of decoding or interpretation. The content server contains modules that do this;
101
the information in the descriptors drives the selection of the appropriate modules. 3.7. Query language In larger cases, XIRAF’s backend discovers millions of forensic artefacts. To select objects of interest, XIRAF provides a set-oriented query language: the XIRAF query language. This language is used in two ways: User selections, constructed in XIRAF’s user interface, are trivially mapped to queries in the query language. The backend uses a subset of the query language for its input descriptors that match objects with tools. Most operators in the query language accept one or two sets of XIRAF objects as input and produce one set as output. Section 2 gives a high-level overview of the language operators. With the language, it is possible to express queries like: select all objects of type file that contain keyword secret and that have a size (integer property) less than 10,240 and that were created (timestamp property) after 25 December 2009. Queries in XIRAF’s query language are denoted in XML and the query language syntax is defined by an XML schema. Fig. 5 shows the XML representation of a query that selects objects of type file with an integer property named size and a value (for that property) less than or equal to 4096. In Fig. 5, the subexpression labeled unrestricted-document denotes all forensic artefacts in the current project; the subexpression labeled has-type selects objects of type file; and the subexpression labeled has-long-property selects objects with a size less than or equal to 4096. XIRAF contains multiple implementations of the query language. The frontend contains a query translators that accepts a XIRAF query in XML format and transforms it to one or more equivalent SQL queries. The backend contains an interpreter for a subset of the query language. This subset and its interpreter are used to describe and compute the set of objects that must be processed by a backend tool. 3.8. Databases XIRAF uses a database management system to store the properties of XIRAF objects and to implement the XIRAF query language. The DBMS that XIRAF’s frontend uses is Oracle (2012).
all objects
having type file
with size <= 4096
Fig. 5. A XIRAF query.
102
R.A.F. Bhoedjang et al. / Digital Investigation 9 (2012) 96–108
For each input, the backend produces an XML document that describes the artefacts discovered by the tools. After a project has been defined in terms of its constituent device indices, it is published. During this publication phase data from the XML documents, for all inputs used in the project, are loaded into the isolated tablespace, within the DBMS, allocated for that project. XIRAF does not store its keyword index in the database. To enable the integration of keyword queries with other types of queries, XIRAF relies on a server-side database mechanism to an external server that handles keyword queries. This way, the results of keyword searches can be merged with the results of other (sub)queries. XIRAF stores user annotations, notes, in the DBMS, in a separate table alongside the forensic artifacts discovered by the backend tools. These notes are stored in the tablespace of the XIRAF project they are associated with. XIRAF maintains a separate database to administer its users, data sources and projects. This database also stores bookmarks, stored by users, as bookmarks can be used across different projects. XIRAF uses a light-weight embedded database for its system wide administration, namely Apache Derby (Apache, 2011). 3.9. Internal API XIRAF’s frontend services are built on an API. This API is too large to describe fully in this paper. Key concepts in this API are connections, users and user groups, projects, and XIRAF objects. A connection gives access to a XIRAF service instance. To obtain a connection, a valid user name and password are required. Through a regular connection, clients can open projects. As described before, a project is a composition of previously indexed inputs. Through a project, clients can issue queries, which result in XIRAF objects. The API includes a specialized, administrative connection type. Administrative connections are used to manipulate users and projects. 3.10. External API XIRAF’s key search functions are available to external programs via a web API. With this API, other parties can develop their own user interfaces or connect XIRAF to other, possibly non-interactive, analysis systems. The API is a language-neutral REST interface (Fielding and Taylor, 2002). It gives access to projects, to properties of individual objects, to objects’ binary content, and to all search functions. XIRAF includes a client-side Python library that allows programmers to build requests and to parse the responses to those requests. The examples in Table 2 illustrate the types of URLs processed by the web API. In some cases additional information can or must be passed in the form of additional GET or POST parameters. A XIRAF query, for example, must be passed as a parameter in the POST body of a query request. Also, the content request can be modified by specifying a byte offset and a byte count.
Table 2 Examples of XIRAF’s external web API. Returns XML descriptions of all projects accessible to the current user. /webapi/projects/CSI Returns an XML description of project CSI. /webapi/projects/CSI/ Returns the XML representation objects/14763/properties of the properties of object 14,763 in project CSI. /webapi/projects/CSI/ Returns the default content objects/14763/content stream associated with object 14,763 in project CSI. /webapi/projects/CSI/ Returns the XML representation objects/14763/parent of the properties of the parent of object 14,763 in project CSI. /webapi/projects/CSI/ Returns the XML representations objects/14763/children of the properties of the children of object 14,763 in project CSI. /webapi/projects/CSI/query Returns the XML representations of the properties of the objects in project CSI that match the query specified in the POST body. /webapi/projects
and to display the query results. XIRAF’s web interface was built using the Google Web Toolkit (GWT) (Google, 2012). GWT code is written in Java and translated to JavaScript that is run in the client’s web browser. A query result is usually a set of forensic artefacts. Presenting such a set can be challenging, because the elements of such a set can have different types and properties. Fig. 6 shows a screenshot of the XIRAF user interface. These are the main user interface elements: Query panel. In this panel, on the left side of the screen, the user manages the current query. Each query is a stack of query elements such as a date/time range, an object type, or a keyword pattern. The current query is always displayed. Users can extend the current query, clear the entire query, remove individual elements, or save the query. Results panel. This panel, on the top right side of the screen, displays the results of the current query, according to the current view. The user switches to another view by pressing its icon. Details panel. This panel, on the bottom right of the screen, displays details about one XIRAF object, usually the current selection in the query result. Details include the object’s metadata and a preview of its content (if any). 4. Experiences and lessons learned In this section, we describe key experiences and evaluate some of our design decisions based on those experiences. To structure this section, we use the key ideas that XIRAF is based on: automated preprocessing, uniform data representation, a set-based query model, and web-based access.
4.1. Automated preprocessing 3.11. User interface The web interface’s main task is to translate user input to XIRAF queries, to send these queries to the web server,
XIRAF’s backend processes terabytes of input data. With commodity hardware, a single pass over the input may already take a day. It is therefore crucial that the backend
R.A.F. Bhoedjang et al. / Digital Investigation 9 (2012) 96–108
103
Fig. 6. The XIRAF user interface.
will not be stopped by simple errors; can be restarted when a serious error occurs; will avoid duplicate accesses to large amounts of data.
4.1.1. Robustness To deal with unreliable components, XIRAF takes several measures. First, tools are run in separate processes, which are controlled by the backend’s tool scheduler. This way, a tool crash will not take down the entire backend. To reduce startup overhead of tools that have to process many objects, the tool driver can operate in batch mode. In this mode, it processes a batch of objects in the context of a single process. This increases efficiency, but has the disadvantage that a fatal error on any of the batch’s objects will affect the entire batch. For each tool process that the scheduler spawns, it sets a timeout value. For each tool, a timeout expression is configured. This expression can depend on the size of the input. When the timeout expires, the scheduler kills the tool process. All objects that were a member of the tool’s input batch are then marked with an error flag. Second, the backend uses checkpointing. After each scheduler iteration, the backend creates a new checkpoint. This checkpoint contains all XML output and keyword indices produced thus far. If the backend crashes, we can restart the backend after the problem is resolved and resume processing from the last checkpoint. 4.1.2. Reducing I/O The time required by a backend run is determined by the tools that process the largest amounts of data. With the current tool execution model, the presence of different tools that operate on the same object will result in multiple accesses to that object. In earlier versions, for example,
XIRAF used separate tools to compute various statistics, such as hash values and entropy. To reduce the number of data passes, these tools have now been merged into a single tool that computes all statistics in a single pass. In this particular case, merging tools was quite easy, because the tools had the same sequential access pattern and because they operated on the same set of objects. In general, however, it is difficult to guarantee that all backend processing can be realized in a single pass per object stream. 4.1.3. Parallel execution Executing Multiple Tools in Parallel. To take advantage of the multicore host(s) that the backend runs on, XIRAF’s tool scheduler executes multiple tools in parallel by starting multiple tool processes. This is a simple, but effective way to attain higher performance. Multithreaded Individual Tools. One of the most timeconsuming backend tasks is keyword indexing. XIRAF text-indexes most data streams, which is CPU intensive. A single keyword indexing thread does not saturate the available I/O bandwidth. At present, we run up to four indexing threads in parallel, on a single multicore node, all reading from the same content server. Latency Hiding. Some tools process large amounts of data. The hashing tool, for example, computes cryptographic hashes and other stream-based statistics for objects with binary content. Since some streams are very large (gigabytes), this tool overlaps its I/Odsynchronous read requests to the content serverdwith its computations. A dedicated thread prefetches stream content and separate compute threads compute hashes and other statistics over the prefetched content. High-Throughput Content Server. The content server must handle each request quickly and it must be able to
104
R.A.F. Bhoedjang et al. / Digital Investigation 9 (2012) 96–108
handle many requests per second. To access file content, the server relies on a file system library, Snorkel, that is not multithread-safe. To prevent concurrency problems, the server used coarse-grain locks around calls to this library, which resulted in contention and reduced throughput. To address this problem, we replicate state. For each stream, the content server maintains a small pool of objects, each of which can produce that stream’s content. Request-handling threads fetch an object from this pool, use it to read stream content, and then return the object to the pool. Blocking occurs only if the pool is empty. Distributed Processing. Most parallellization efforts have focussed on exploiting thread-level parallelism on a single server. For very large cases, we have, on occasion, used multiple indexing servers to process different input images in parallel. This type of coarse-grain, image-level parallelism is simple and effective. Distributed processing of a single image is feasible, but would significantly complicate the software. Given that single-image processing times have been acceptable, it is not clear if the benefits of per-image distributed processing would justify the implementation costs. 4.1.4. Algorithmic issues The nature of the input is such that ‘simple’ algorithms that are perfectly acceptable for smaller workloads no longer work. A complicating factor is that this problem crosses module boundaries. Consider, for example, the process to access a file. XIRAF’s content server can be asked to deliver the content of a named file. To do so, it relies on an external file system library, Snorkel, that knows how to interpret various file systems. Snorkel accepts a file system path and returns a file handle. To find the file, it searches directories for the components of the file system path. Initially, this was implemented as a simple linear search. Unfortunately, when N files in the same directory are accessed, this strategy results in N linear searches over N items, which has quadratic time complexity. In some of our cases, we encounter inputs with thousands of items in a single directory. A second class of problems arises with the use of mainmemory algorithms for large inputs. During the development of XIRAF we have experimented with several keyword-indexing tools and libraries. To our surprise, several tools blindly assume that all words in an input file will fit in memory. For large inputs and modest memory sizes, this assumption is false. 4.1.5. Configuration issues Running large-scale systems requires proper configuration at all levels: hardware, operating system, application. A simple, illustrative example is the operating system setting for the maximum number of open file descriptors per process. In some cases, XIRAF reported errors that were caused by reaching this limit. A large disk image is often stored in many (hundreds) physical files to simplify archiving and to deal with file system limitations. Using such images results in many open files in XIRAF. (Actually, only up to 16 files per image are kept open simultaneously.) Since the content server can open images multiple times for concurrent access, the number of open files is
multiplied. Finally, the Java Virtual Machine uses many file handles for the libraries (Java archives) used by XIRAF. A second, more complex example is a TCP/IP setting that affects the performance of the content server. The content server’s request-reply protocol runs over TCP/IP. The initial implementation did not use the NO_DELAY socket option to disable Nagle’s algorithm. (Nagle’s algorithm batches successive outgoing packets to reduce overhead.) As a result, data messages and acknowledgments were delayed and request-reply roundtrip times were terrible. The one-line solution to this problem reduced some tools’ run time from several hours to minutes or less. Both problems described above were easy to fix. The problem is that they are not discovered in small-scale tests and that tracking them down takes time and expertise. 4.1.6. Removing the backend database Earlier versions of the backend used MonetDB/XQuery (Boncz et al., 2006) to store the forensic artefacts discovered by tools. During backend processing, the scheduler queried the (partial) database to discover new input objects for tools. Loading the database after each tool iteration, either through updates or by replacing the entire data set, turned out to be a performance bottleneck. Using updates to add or modify objects is slow, because it involves many writes to disk. Replacing the entire data set is slow because the data set grows in each iteration, which results in quadratic time complexity. The backend now collects all tool output and merges that output into a single XML file (per input). New input objects are discovered by matching tool input descriptors against the forensic artefacts described in that file. The matching is no longer performed by a database query, but by a custom interpreter for the input descriptor language. Removing the database from the backend approximately halved the backend processing time for larger cases. 4.1.7. One index per input Originally, the backend produced a single index for all of a project’s inputs. This strategy was inefficient in two ways. First, if an input was to be used in two different projects, the backend had to process the input twice. Second, the scheduler considers all forensic artefacts derived so far as possible candidates for further tools. With a single index, this set grows larger than with multiple indices, which makes the tool input selection operation more expensive. We completely removed the notion of projects from the backend. The backend now simply accepts inputs and produces an index for each input. 4.1.8. Tool execution model A XIRAF tool operates on a single object at a time. While it executes, it can read from that object only. In most cases, this model works fine, but there are forensic artefacts that require access to multiple objects to be properly reconstructed. Examples of such artefacts include multi-part RAR archives. 4.2. Uniform data representation A key idea in XIRAF, already presented in the 2006 paper, is to use canonical representations for similar
R.A.F. Bhoedjang et al. / Digital Investigation 9 (2012) 96–108
objects. Both Firefox and Internet Explorer, for example, store HTTP cookies, but they store them in different formats. The XIRAF tools that process these cookies use different parsers, but both tools produce cookie objects. An investigator looking for objects of type cookie will thus discover both the Firefox and the Internet Explorer cookies. This basic idea has been retained throughout the project. In other ways, the current data model differs significantly from the model used in the 2006 prototype. First, the prototype did not have a built-in, documented ontology like the current system. In the current system, tools must generate output that is consistent with the builtin ontology. As a result, tools that produce new types of information, such as GPS coordinates, force an extension to the ontology. This rigor conflicts with flexibility: adding a new type of information now involves modifying the data model and its documentation. This takes more time, but results in usable and searchable artefacts. Second, in the prototype all objects had a single type. Object with two types were split into two objects and one was a child of the other. Splitting objects in this manner is hard to explain to end users. Even for users who understood, it resulted in clumsy searching and navigation. Allowing multiple types per object was probably one of the biggest usability improvements. Third, the binary content streams of objects are no longer assigned linear addresses from a single address space. This feature was originally intended to support queries that could test if one object’s content was nested in another object’s content. Such standoff queries were never made available in the user interface, and so this addressing mechanism was dropped. In the current system, binary content streams are similar to regular file streams. 4.3. Set-based query model XIRAF provides a relatively high-level, set-based query model. The key issues involving this model are its clarity, its usability, and efficient implementation. 4.3.1. Clarity and usability Most users that have worked with XIRAF have no problem understanding its set-based operation. Most operators are filters that reduce the size of the set of objects under consideration. This is similar to the way people select an item of interest in a web shop. The main exception is formed by transformations that are not simple filters. XIRAF includes two types of such transformations: parent/child operators and property matching. The parent (or child) operator replaces the current set of objects with the set that contains all parents (children) of the objects in the current set. The property matching operator extends the current set with all objects that have the same value, for a specified property, as one of the objects in the current set. This operator can be used, for example, to find all objects that have the same hash value as some object in the current set. Both operators are useful in practice, but turn out to be difficult for some users. A similar problem arose with some of XIRAF’s display panels. Technically skilled users like to be able to get a hex view of an object’s content, but other users find such a view useless and confusing. To address these
105
problems, we have added view options that can be used to show or hide almost any component of XIRAF’s user interface. An administrator can configure a user’s default settings, but users are free to modify these settings (which are stored as a user preference). 4.3.2. Text search First, XIRAF has no ‘raw’ string search facility. Binary content can be searched through the keyword index, but this index uses a built-in ‘word’ definition that may not fit some of the strings that an investigator wants to search for. Adding a slow, linear-time, search function is easy, but can lead to very high search times. Very fast string searches are possible by using more advanced indexing structures, such as (compressed) suffix arrays (Manber and Myers, 1990), but these structures consume large amounts of space. Second, the current keyword index search is quite simple. Wildcard searches can be rather slow, because they result in a partial or full vocabulary scan. Phrase searches are not supported. The search function will not find words that differ only slightly from a search term. Most of these problems have been solved in commercial document indexing tools, but many of those tools have trouble handling large or partially corrupted input streams. In response to user feedback, we added an autosuggest facility to XIRAF’s keyword search. While this is no replacement for some of the missing functions mentioned above, it does tell users which words are present in the dictionary. This sometimes gives users a useful search hint. 4.3.3. Reducing the number of objects XIRAF’s backend can easily extract very large numbers of artefacts from an input. This is tempting, because it can be hard to predict in advance which artefacts will be relevant in an investigation, so the more the better. It is always possible (and easy) to filter uninteresting object through appropriate frontend queries. Unfortunately, a lack of selectivity in the backend tends to reduce the frontend’s query performance. The reason is that the number of forensic artefacts stored in the frontend database has a significant impact on XIRAF’s query performance. Although the frontend database is heavily indexed, table scans cannot always be avoided. For reasonable frontend performance, it is important to prevent large-scale database pollution by artefacts that are not relevant to the investigation at hand. XIRAF tries to do this in two different ways. First, XIRAF operators control which backend tools are enabled during the preprocessing phase. Tools that are known to produce artefacts that are not relevant to the current investigation can simply be disabled altogether. This is particularly effective if it is easy to predict which artefacts are relevant in an investigation. Unfortunately, this is not always the case. Second, some tools have been modified to eliminate information that we consider likely to be uninteresting. This is a conservative approach that may still leave many artefacts, but it will at least reduce the amount of pollution. Our Windows registry tool uses this strategy. The registry is a large configuration database that contains some interesting information and a lot of not-so-interesting information. Our initial implementation of the registry tool
106
R.A.F. Bhoedjang et al. / Digital Investigation 9 (2012) 96–108
listed all key-value pairs present in a registry file. In real cases, these key-value pairs accounted for around 50% of all forensic artefacts discovered by XIRAF, but they are rarely used in queries. Today, the registry tool eliminates those branches of the registry that we expect to be uninteresting. This reduces the proportion of registry information in the frontend database to roughly 10%. 4.3.4. Query performance For a medium sized case, the XIRAF database can easily contain millions of forensic artefacts and their metadata. It is quite a challenge to provide interactive performance when dealing with such a number of forensic artefacts in ad-hoc queries. Table 3 shows performance statistics for a number of recent, large-scale, cases investigated with XIRAF. Two different types of requests to the frontend are listed: object list requests, which result in a (partial) list of objects that satisfy a given query, and object detail requests, which result in a list of all properties (details) for a single object. Overall about 75% of all requests are answered in less than 4 s. Requests for object details are even faster: more than 98% of those requests are answered within 1 s. The request response times, illustrated by these numbers, have a direct impact on user experience. The short object detail request response times ensure that users can quickly browse through result sets and investigate individual objects. The short response times for a large portion of the object list requests give users an interactive search experience in the vast majority of the cases. While optimizing the query performance of XIRAF, we learned the following lessons: The XIRAF database is read-only. During the publishing phase, all data is imported in the database and after publishing, only data read queries are performed. This allows us to utilize the standard solutions for data warehousing in relational database systems, such as bitmap indexes and large block sizes. With a read-only database, it is possible to cache data at all layers of XIRAF without read-inconsistency problems. XIRAF caches query results in multiple ways. First, individual objects are cached in an object cache. This cache contains the Java representation of XIRAF objects and holds only the objects’ properties, not their binary content. This cache is used to speed up the display of a single object’s properties in various ‘detail’ panels of XIRAF’s user interface. The object cache resides in the web server’s memory. A hit in the object cache eliminates a database query. Table 3 Query performance numbers. Case
Images
A B C D E
43 8 19 9 9
Total size (GB) 4.713 3.058 1.652 998 827
Objects
Detail requests (%<1 s)
List requests (%<4 s)
6.313.186 2.709.809 3.146.564 1.940.426 2.626.928
100,00 99,92 99,98 100,00 99,98
74,47 80,75 90,53 95,86 90,53
Second, XIRAF caches subquery results in temporary database tables. Recall that XIRAF users compose their queries incrementally, usually by refining some previous selection. Under some circumstances, a query result is copied into XIRAF’s subquery cache. When the query is refined, the refinement step can be applied to the cached result. The current strategy is to cache small query results that took a ‘long’ time to compute. Third, the web browser’s GUI code caches views that are displayed to the user. 4.4. Web-based access We believe that XIRAF, and systems like it, are of interest to a wide range of users, including computer forensic experts, detectives, and analysts. We have extensive experience with expert users, because our internal digital forensic experts have been using XIRAF, alongside traditional tools like EnCase (Guidance Software, Inc, 2012) and FTK (AccessData, 2012), in their investigations for almost four years. Currently the Netherlands Forensic Institute offers an online computerforensic service with XIRAF. This is possible because XIRAF’s interface is web-based. This online service is actively being used by over a hundred investigators of all levels of technical expertise. This has provided us with considerable feedback on the usability aspects of the XIRAF user interface. 4.4.1. User interface performance XIRAF’s web interface is relatively complex. It consists of a range of query panels, multiple result views (tabs), multiple detail views, a menu bar, etc. These panels contain trees, tables, and images. The web interface was written with Google’s Web Toolkit (GWT) (Google, 2012). GWT programs are written in Java and translated to JavaScript, which runs in a browser. XIRAF has benefitted greatly from the recent advances in JavaScript performance in browsers such as Mozilla FireFox and Google Chrome. 4.4.2. User interface lessons A few important user interface issues arose during the development of XIRAF’s web interface. First, in most regular operation environments, screen real estate is limited. Most Dutch police officers work with 1024 768 displays. Most XIRAF developers work with 1920 1200 displays and have a tendency to use the space available. We have recently started addressing this issue by modifying the user interface so that key elements can easily be viewed and manipulated on a small display. Interestingly, this also allows us to run the user interface on mobile tablets, such as Apple’s iPad. In retrospect, we would have saved time had we designed for a small display from the start. Second, we modified the user interface so that it limits users to issuing at most one query at a time. This prevents scenarios in which users knowingly or unknowingly (re) issue multiple queries, thus congesting the database server. Third, speed is an essential part of most users’ experience. We have therefore optimized the user interface in several ways: We avoid duplicate queries (through caching); We have removed views that are expensive to compute and that are not used much in practice.
R.A.F. Bhoedjang et al. / Digital Investigation 9 (2012) 96–108
We moved some client-side processing to the server: Java (server side) tends to run faster than Javascript (client side). 4.4.3. Browsing XIRAF makes it easy to find objects of interest. Once an interesting object has been discovered, investigators frequently want to explore ‘related’ objects. When a user has selected an object X, for example, XIRAF will display a crumb trail of links to X’s ancestor objects. Users can click these links and see the details of the ancestor objects. We are currently working on improving XIRAF’s browsing function by adding clickable links that give access to an object’s children. In addition, we are working on a suggestion facility to XIRAF. Given an object, this suggestion facility will suggest which other objects are most ‘related’ to that object. This involves defining suitable distance metrics and precomputing relevant interobject distances. Browsing also comprises viewing object content. XIRAF provides a small panel in which it displays a type-specific representation of an object’s content. For images, XIRAF tries to display the image. For documents, XIRAF tries to show the text. For chat logs, XIRAF displays the conversation. And so on. At present, XIRAF does not use commercially available viewers, but plugging in such a viewer would be relatively easy and would increase the number of ‘viewable’ objects. Initially, XIRAF provided only the small panel described above for viewing object content. Several users indicated that they frequently had to study the content of all objects in a result set. In response, we added a large panel and navigation buttons, so that users can now view one result object after another, using a large percentage of the available screen real estate. This panel is available in most result views (table, timeline, summary, picture). 4.4.4. Annotations A common user task is to classify objects in a search result. This task is quite similar to a user adding labels to digital pictures taken with a photo camera. Labels in our environment could be maybe-encrypted, uninteresting, legal-privilege-communication, needs-follow-up, and so on. At present, XIRAF provides no simple mechanisms for adding such tags to objects. XIRAF allows users to add notes to objects and to share those notes with other users. Notes could be used as a poor man’s tags, but this is clumsy at best. Implementing tags is not difficult; in retrospect it might have been better to implement tags before notes. 4.4.5. Search dimensions With XIRAF, searches along multiple dimensions can be combined in an incremental manner and in a way that is intuitive to anyone who is used to navigating web shops. While XIRAF supports more search dimensions than most systems, one can easily envision other useful dimensions.
107
wait states. With this configuration XIRAF processes roughly one terabyte per day on a single (multicore) indexing node. Multiple indexing nodes can be used in parallel, but only to process multiple devices in parallel, at present XIRAF cannot distribute the processing of a single device over multiple indexing nodes. Data design and I/O bottlenecks: XIRAF routinely processes terabytes of data. To be able to do this, we pay constant attention to efficiency. As described above, we have merged some data-intensive tools to reduce duplicate data accesses. Other tools overlap communication and processing. Software Errors: XIRAF is written in a relatively secure programming language; tools are run in separate processes; intermediate indexing results are stored in checkpoints for efficient recovery. Auditability: XIRAF does not keep an explicit audit trail of user actions. XIRAF logs internal API calls, but such calls occur at a slightly lower level of abstraction and are not immediately suited for auditing purposes. Adding higher-level auditing would be straightforward, but has not been a high enough priority so far. Planning and control of analysis tasks: The XIRAF release described here supports analysis tasks only by providing an effective search interface. The development version of XIRAF includes an evidence browser that guides investigators from one item to other, related items. The related-items links are based on relations that experience has shown to be useful in practice. Automation: XIRAF eliminates much manual processing by automating the trace collection process. Data abstraction: XIRAF uses relatively high-level, agnostic representations of forensic artefacts: e-mail, document, picture, file, etc.
Levine and Liberatore (2009) propose and discuss a format, DEX, that documents the provenance of digital evidence. Both DEX descriptions and XIRAF object descriptions are used to describe forensic artefacts. DEX, however, uses pointers to refer to related objects such as an object’s parent. XIRAF uses XML nesting to document parent–child relationships and, at present, does not support any other type of relationship. DIALOG (Kahved zic and Kechadi, 2009) is a framework for describing and managing (computer) forensic knowledge. XIRAF includes such ontology in its data model. While DIALOG focuses on the Windows registry, the XIRAF ontology encompasses many more forensic artefacts, including e-mail, documents, files, chat logs, etc. The advanced forensic format (AFF) (Cohen and Schatz, 2009) supports a construct, the map stream, that, similar to the stream constructors used by XIRAF’s binary content server, allows new evidence streams to be constructed by transforming existing evidence streams.
5. Related work Ayers (2009) describes seven metrics that can be used to gauge a computer forensic analysis system: Processing speed: XIRAF runs tools in parallel and overlaps processing with communication to hide I/O
6. Conclusions We strongly believe that digital evidence should be accessible to nonexperts as well as experts. While there are real education issues, we expect that the advantages far
108
R.A.F. Bhoedjang et al. / Digital Investigation 9 (2012) 96–108
outweigh the risks. Technical specialists have become a bottleneck, in part because today they have to do manually what systems like XIRAF do automatically. XIRAF allows nonspecialist investigators to leverage their tactical knowledge and allows technical experts to leverage their technical knowledge. XIRAF is a second-generation forensic analysis system that automates the collection of forensic artefacts and that provides strong query capabilities. With XIRAF, we find millions of forensic artefacts and we can search through these artefacts in effective ways. XIRAF is run as an online service that can be accessed through a web interface. XIRAF is based on a few key ideas: automated data collection, uniform representation, a set-based query model, and web-based access. Implementing these ideas for terabyte-sized inputs has turned out to be challenging. Although performance and scalability remain key concerns for this type of system, we feel that we have come a long way. Data reduction, caching, parallelism, careful selection of algorithms, and close interaction with end users are key ingredients of our approach. We have just completed version 1.2 of XIRAF; this is the version that is described in this paper. Today, the department of Digital Technology and Biometrics of the Netherlands Forensic Institute uses XIRAF for its analysis work, in addition to traditional tools such as EnCase and FTK. A number of Dutch law enforcement organizations are now using an online forensics service, based on XIRAF, offered by the Netherlands Forensic Institute for large scale investigations. Also several companies have expressed interest in the system.
References AccessData. Forensic toolkit (FTK), http://accessdata.com/products/ forensic-investigation/ftk; 2012. Alink W, Bhoedjang R, Boncz P, de Vries A. XIRAF – xml-based indexing and querying for digital forensics. Digit Investig 2006;3(Suppl. 1):50–8. Apache. Derby, http://db.apache.org/derby; 2011. Ayers D. A second generation computer forensic analysis system. In: Proceedings of the 8th Digital Forensics Research Workshop (DFRWS); August 2009. Montreal, Canada. Boncz P, Grust T, van Keulen M, Manegold S, Rittinger J, Teubner J. MonetDB/ XQuery: a fast XQuery processor powered by a relational engine. In: Proceedings of the 2006 ACM SIGMOD international conference on management of data. SIGMOD ’06. New York, NY, USA: ACM. p. 479–90. URL, http://doi.acm.org/10.1145/1142473.1142527; 2006. Cohen M, Schatz SGB. Extending the advanced forensic format to accomodate multiple data sources, logical evidence, arbitrary information and forensic workflow. In: Proceedings of the 9th Digital Forensics Research Workshop (DFRWS); August 2009. Baltimore, USA. Fielding RT, Taylor RN. Principled design of the modern web architecture. ACM Trans Internet Technol May 2002;2:115–50. URL, http://doi.acm. org/10.1145/514183.514185. Google. Google web toolkit (GWT), https://developers.google.com/webtoolkit/; 2012. Guidance Software, Inc. EnCase forensic, http://www.guidancesoftware. com/forensic.htm; 2012. Kahved zi c D, Kechadi T. DIALOG: a framework for modeling, analysis and reuse of digital forensic knowledge. In: Proceedings of the 8th Digital Forensics Research Workshop (DFRWS); August 2009. Montreal, Canada. Levine B, Liberatore M. DEX: digital evidence provenance supporting reproducibility and comparison. In: Proceedings of the 8th Digital Forensics Research Workshop (DFRWS); August 2009. Montreal, Canada. Manber U, Myers G. Suffix arrays: a new method for on-line string searches. In: Proceedings of the first annual ACM-SIAM symposium on Discrete algorithms. SODA ’90. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. p. 319–27. URL, http://portal. acm.org/citation.cfm?id¼320176.320218; 1990. Oracle. Database, http://www.oracle.com/us/products/database/index. html; 2012. Witten I, Moffat A, Bell T. Managing gigabytes: compressing and indexing documents and images. 2nd ed. San Francisco, CA: Morgan Kaufmann; 1999.