Accepted Manuscript A general purpose lossless data compression method for GPU Marek Chłopkowski, Rafał Walkowiak PII: DOI: Reference:
S0743-7315(14)00195-6 http://dx.doi.org/10.1016/j.jpdc.2014.09.016 YJPDC 3363
To appear in:
J. Parallel Distrib. Comput.
Received date: 25 November 2013 Revised date: 26 May 2014 Accepted date: 23 September 2014 Please cite this article as: M. Chłopkowski, R. Walkowiak, A general purpose lossless data compression method for GPU, J. Parallel Distrib. Comput. (2014), http://dx.doi.org/10.1016/j.jpdc.2014.09.016 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
A General Purpose Lossless Data Compression Method for GPU Marek Chlopkowskia,∗, RafalWalkowiaka a Institute
of Computing Science, Pozna´ n University of Technology, 60-965 Pozna´ n ul.Piotrowo 2
Abstract The paper describes a parallel method for a lossless data compression that uses graphical processing units (GPUs). Two commonly used statistical and dictionary approaches to data compression have been applied in our method. The reduction of compression time was possible due to the implementation of multi level parallel methods that use a single GPU or a set of GPUs efficiently. The base of our method is a search for repetitions in data that is executed in parallel with the use of sorted suffix tables. On the second level of concurrency operations on different data blocks: data file reading, match search, coding, compression and data file writing are performed in parallel. The methods proposed, supplying a comparable compression ratio, achieve a better compression speed than a standard CPU-based compression tools used in personal computers. Experiments performed in technologically comparable systems showed that our approach is similar or even better in terms of power and cost efficiency. Keywords: lossless data compression, GPU, parallel processing 1. Introduction Since for contemporary computer users data compression is as common as a presence of an efficient GPU in PC systems, it is easy to predict a need for widely accessible data compression tools using GPU. Such a tool, based on carefully selected ideas known from computer science and a theory of data compression ∗ Corresponding
author, email address:
[email protected]
Preprint submitted to Elsevier
October 9, 2014
in particular, implemented efficiently for GPU is proposed in the paper. The organization of the paper is as follows. The basics of the data compression theory and the state of the art in the area of lossless compression methods for GPU is shortly discussed in Section 2. Section 3 is devoted to a presentation of the proposed compression method. Description of a computational experiment and obtained results are presented in Section 4 and Section 5. Section 6 finishes the presentation with conclusions concerning the results and perspectives of a further development. 2. The lossless data compression fundamentals The lossless data compression consist in a conversion of input data to output data of a smaller size by reduction of data redundancy [17]. It can be achieved by assigning shorter codewords to more frequent symbols in input data (a statistical compression) or by replacing possibly longest substrings with their dictionary codes (a dictionary compression). There are many, sometimes very sophisticated compression methods, which use one or both of these approaches. In the paper we will focus on details of approaches that we used as sources of ideas for our method. 2.1. A statistical approach Results of statistical observations can be incorporated into data compression methods. The statistics may be obtained by analyzing input data (frequencies of symbols or words) or be given a priori from more general observations (i.e. letter, di- and trigrams frequency in a language). The idea behind this approach is to use a variable sized prefix code and assign shortest codewords to the most frequent substrings in the source or to use an arithmetical encoding based on ranges of numbers with range width proportional to substring probabilities. In our approach two prefix codes were used: Huffman prefix code [9] and our variant of a general unary code. A prefix code, in general, is a variable sized code where none of the codewords can be a prefix of the other [15]. The codewords of Huffman code are generated on a binary tree. The symbols are leaves and a 2
route to leaves is coded by bits (1 for the left child and 0 for the right on every level of the tree). Leaves with the most frequent symbols are closer to the root of the tree and therefore have shorter codes. This code is usually generated by both - a coder and a decoder. The coder counts occurrences of each symbol and saves obtained values into the output, so the decoder can build the same code tree. The decoder simply reads input bit-by-bit and goes down the tree choosing the right child every time it reads 0 and the left child when it reads 1. When decoder reaches a leaf it reads an encoded symbol and starts again from the root. There is also an adaptive variant of Huffman coding, where a tree is rebuild dynamically when processing subsequent input symbols (usually it is a better solution for on-line compression, when input data cannot be processed twice). Unary coding is the simplest coding method where codeword is created by n ”set bits” followed by one ”reset bit” (i.e. 111110 represents value 5). A general unary code is made of two parts: a unary step number and n-bits binary value. The codewords are generated by the start-step-stop algorithm. The subsequent parameters have the following meaning: start is equal to the initial length of a binary part, step determines increment of the binary part and stop determines the longest codeword size (in the case of the longest codewords, the step number is not followed by the reset bit) [15]. In the case when the step is equal to 1, this code is quite similar to Elias Gamma code that is constructed of l -zeros followed by one and a binary part of length l [15]. This code has the shortest codewords for the smallest values coded and therefore is suitable to code close dictionary distances. A sample of a general unary code is presented in Table 1. 2.2. A Dictionary compression A general idea of dictionary compression consists in replacement of input substrings with shorter codewords of dictionary entries. In this way the size of resulted data string is reduced. There are two main approaches to the construction of dictionaries that are based on works of Lempel and Ziv published
3
Value 0 1 2 3 4 5 6 7 13 14 15 29
Step number 1 1 2 2 2 2 3 3 ... 3 4 4 ... 4
Codeword 0 0 0 1 10 00 10 01 10 10 10 11 110 000 110 001
Codeword length (bits) 2
110 111 111 0000 111 0001
6 7
111 1111
7
4
6
Table 1: An example of a unary code for parameters (1,1,4).
in late 70’s [19, 20]. The first approach is based on LZ77 sliding window algorithm [19]. In this method the dictionary is a linear part of processed data. In general, two buffers are used - a search buffer (the dictionary) and a look-ahead buffer (currently coded area of strings searched in the dictionary). For every position in the look-ahead buffer the algorithm: • searches (left-to-right) the search buffer for a symbol equal to the symbol at the current position in the look-ahead buffer,
• if the symbol is found, saves the distance from the end of the search buffer as an offset; next subsequent symbols from the buffers are compared and
the number of equal symbols is saved as the length of the match, • repeats the above operations in order to find the longest match in the dictionary,
• emits a codeword — concatenation of: the offset, the match length and the next symbol in the look-ahead buffer,
• moves the search window by the length (plus 1) of the match found. If the match has not been found, the codeword (0,0,symbol ) is emitted. 4
The second approach to dictionary type compression is based on LZ78 algorithm [20]. The dictionary is an algorithm-specific data structure, that is successively expanded or rebuilt based on subsequent input symbols (processed so far). The dictionary can also be filled with some entries before starting the processing. The output contains the matched dictionary entry index and a subsequent symbol (or only an index, when the dictionary is filled with all possible symbols at the start). Precise information concerning the algorithm can be found in: [15], [20], [18]. The basic differences between two methods described above are the coding and limitations. In the first method distance-length pairs are encoded and the search area is limited to a given window size. Within the second method dictionary indexes are encoded and the search is limited by a dictionary size. We used the LZ77 based approach with parallelization of a match search process. 2.3. A combined statistical and dictionary approach One of the most known compression algorithms, widely used in nowadays applications, is Deflate [7]. It is used in Windows Zip and Unix Gzip coders, HTTP protocol, PNG and PDF formats. This algorithm combines a statistical compression approach (Huffman coding) and a dictionary method (based on LZ77 concept). The main enhancement of LZ77 method incorporated into Deflate is when a symbol is not found in a search buffer the coder emits a symbol codeword. In the opposite case the coder emits a length-distance pair pointing to a match found. The coder uses two sets of Huffman codewords. First for symbols and lengths and the second set for distances. It is obvious what type of a codeword may occur at a given position. After a length codeword there can be only a distance codeword and after a symbol codeword another symbol codeword or a length codeword may occur. Compressed file contains data blocks of variable length that is written at the beginning of the block. Each block may contain uncompressed data, compressed using predefined prefix codes (described in RFC [7]) or compressed using custom prefix codes. In the last case, codes have to be written at the beginning of the output block using a specific
5
algorithm described in [7]. A combination of statistical and dictionary approaches is used in our compression method. The program output consist of two types of fixed size data blocks: uncompressed blocks and compressed blocks encoded by custom codes different from the ones used in Deflate. 2.4. A search for matches within a dictionary compression One of the most difficult problems in data compression is an efficient search for dictionary matches. Every standard uses its own dictionary structure and specific search methods. These procedure can be as simple as a naive search (in original LZ77) or as sophisticated as in nowadays applications (e.g. 7-Zip) which use hashed tables and binary search trees described in [15]. In our method a search for matches is based on constrained suffix tables described in Sec. 3.1. 2.5. Related works There are many compression algorithms which make use of GPUs. Most of them are highly specialized (i.e. for floating point type data [12] or perform lossy compression for graphical application [3, 5]. As far as we know, there are very few works aimed at a general purpose lossless compression (independent from data type) based on GPU use. There is an effective approach using a statistical compression and a variable length encoding proposed by Balevic in [4] and very effective implementations of LZSS compression algorithm done by Ozsoy et all. in [13, 14]. These algorithms perform very well in terms of the compression speed but are far less effective in terms of the compression ratio. The first approach is fast since it does only a statistical compression without using a dictionary. The second one uses a very small dictionary window - its size is up to 512 bytes due to GPU limitations. The compressed files generated by programs based on the above approaches are at least two times larger than GZIP output. Section 5 shows the advantage of our approach over the GZIP program and indirectly also over the results described in [4, 13, 14].
6
3. The method We designed compression method based on LZ77 and Deflate concepts [19, 7]. The adapted encoding is similar to the one used in Deflate method (comp. Sec. 2.3). Symbol and length code is based on frequencies of their appearance in the output but distances are encoded using our variant of a general unary code combined with Huffman code (comp. Sec. 3.3). 3.1. A search of the dictionary match In our approach we use a suffix table as a source of information about a possible match of coded items. The most important property of a suffix table is that it contains items (prefixes of input string suffixes) sorted in the lexicographical order. Therefore it is easy to find a match of a string in the table using a divide and conquer method. It can be done in n × log m time where n
is equal to a search pattern length and m is a number of suffix table entries [8]. A single suffix table covers all data in a separately compressed data block but sorting of table items is limited to 4-byte prefixes only. The number of entries in the suffix table is equal to the size of an input data block (16777216 - the size of a block was chosen based on GTX 260 GPU memory size and requirements of the method data structures). A single entry of the table contains a prefix and its index in the input data. The table is generated using parallel MSB Radix Sort algorithm [6, 16, 11], where a 4-byte prefix is a sorting key. It is a stable algorithm and therefore entries concerning equal prefixes are arranged in the table in the same order as they occur in the input data [6]. The goal of constructing a limited (to 4-byte prefix) suffix table was to prepare efficiently a data structure suitable for finding possibly the best match in a dictionary for every position of compressed data (or to find that there is no match for a given position). The search of a match is done in parallel by a set of threads in a GPU. Each thread is responsible for a single entry in the suffix table and performs the match within a fixed range of the suffix table. The following simple algorithm is performed by each thread. 7
Algorithm 1 1. i = 1; 2. Take the suffix table item at threadIndex position (different for each thread in a CUDA grid) as a base item. 3. Take the suffix table item at threadIndex-i position as an offset item. 4. Compare prefixes of the base item and the offset item. 5. If the prefixes are equal compare subsequent bytes taken from the raw data (pointed by an index field of a suffix table entry) until the first mismatch occurs. Compare the length of the match obtained to the length of the best match for the base item so far found and record the longest match. The match is recorded in a match table at the index that corresponds to the position of the base item in the input data) 6. Repeat the steps 3-5 for i = i + 1 until n items prior to the base item are checked. 7. Break the procedure when no match is found — and write 0-length match value in the match table. With the increase of the search position range (n) compression improves. We have proved experimentally that n = 4 gives the best relation between a compression speed and a compression ratio. As the result of the above procedure we get the match table that contains two pieces of information for every suffix in the input data: a match length and a match distance. The indexes in the match table correspond to data positions. A match at index 1024, having length = 8 and distance = 300 means that 8 bytes following 1024-th position are equal to bytes at positions 724–731. It means that the string of 8 bytes can be replaced with shorter codes of a length-distance pair in the output stream. Many of the matches found are not used because, when the match of length n is used, next n positions in the match table are skipped.
8
3.2. Analysis of dictionary matches In a single GPU processing model, the analysis of dictionary matches is a simple, serial process. A match table is analyzed, entry after entry, and if a match exist and is promising (i.e. the estimated encoded string length is shorter than the match length) it is used (counters of repetitions for a length and a unary coding step number are incremented) and current position in the source data is incremented by the match length. Otherwise ( i.e. in the case when there is no match or the match is not promising) a literal counter stored in a count table is increased. The counters values stored in the count tables are prepared as an input to Huffman tree generation algorithm. A unary coding step of every match distance is computed on GPU at the time a match is analyzed. In multi GPU processing strategy the above described analysis takes place on GPUs and is performed in parallel for different parts of a match table. Each thread analyses 1024 positions of the match table. The procedure can give slightly worse compression results because of fixed ranges of data separately analyzed. However, in most cases, the deterioration of a compression ratio is smaller than 1%, which is a good tradeoff between a compression speed and a quality as experiments show. 3.3. Data encoding Encoding used in the proposed approach is based on Deflate concept (compare Sec. 2.3) with some modifications. To encode data the program generates Huffman codewords based on count tables. There are two tables of counters used: one common for match lengths and literals, second for step numbers of a distance code. Distance encoding is performed using a modified general unary code (described in Sec. 2.1). We used Huffman code to code the step number. Instead of using stop parameter we assumed, that the binary value part of modified general unary code will not be longer than 24 bits — the size of encoded data block 16MB. We counted all repetitions of step number values (possible 24 values) and generated Huffman codewords for them. In this way we got the shortest codewords for most common step numbers. 9
Data encoding is performed with a use of the following simple scheme. According to the meaning of the coded item the suitable action is taken: • for a literal its Huffman codeword is written to the output stream, • for a pair: a match length and a distance: – a match length codeword (Huffman code) and – a distance codeword (generated according to unary coding algorithm) that consist of two parts: ∗ Huffman codeword for step number and ∗ a binary part are written to the output stream. In multi GPU processing strategy distance codewords are stored in a match table for later use in an output stream generation (to reduce CPU computation). 3.4. Output stream generation The output stream consists of data blocks containing: • 4-byte block format identifier – block type identifier (possible values are 0 - uncompressed, 1 - compressed, 2 - end of file) — 2-bit – block size (in bytes) — 30-bits • compressed/uncompressed data. Blocks of type 1 contain: • Huffman code tables (i.e. 256+252 16-bit normalized counts for literals and match lengths, 24 16-bit normalized counts for unary coding steps),
• a sequence of literal codewords and pairs of codes for a length and a distance of a match,
• end of block symbol (codeword for match length = 1). 10
In single GPU implementation CPU analyses the match table and encodes the data into an output stream. In multi GPU encoding and output stream generation is performed on GPU. In this case CPU only writes compressed data blocks to output. 3.5. Decompression In our implementation the decompression (i.e. decoding) process is performed on CPU. Bit-by-bit input processing and decoding Huffman codewords by going down a tree structure (see Sec. 2.1) can be very time consuming. To speedup decoding process we used code tables (referred further as decompression tables) inspired by [15]. The decompression table contains 65536 entries. For codewords shorter than 16 bits, each table entry contains information about a codeword value (symbol) and its length. For codewords longer than 16 bits the table contains a pointer to a last Huffman node reached by going down the tree according to 16 bits read (i.e. the node at 16th level of the tree). Decoding of a single Huffman codeword is done as follows: 1. Without moving an input stream position, read ahead 16-bit number and store it as addr. 2. Read the decompression table entry at addr position. 3. If the element’s codeword length is less or equal to 16, return a codeword value and move the input stream position by the codeword length. 4. If an element’s codeword length is greater than 16: • move input stream position by 16, • read (from a decompression table) an address of Huffman tree node labeled by a 16-bit part of the codeword,
• starting from this tree node perform bit-by-bit decompression (by reading subsequent input bits and going down the tree),
• when a leaf of the tree is reached, return the codeword’s value.
11
The Step 4 takes place very rare because the longest codewords are assigned to the least frequent symbols. Many entries of the decompression tables are equal because codeword bits shorter than 16 have to be followed by all possible combinations of subsequent bits that are irrelevant for this codeword but read as 16 bit value. For example, information corresponding to the codeword 0011 is stored in all table entries in range from 0011000000000000 to 0011111111111111. In this way a codeword can be decoded correctly and fast by simply reading the table entry (regardless of bits following it). The use of code tables results in 3 times faster decompression in comparison with an algorithm with a full Huffman tree walk. 3.6. Parallel processing issues The main concept of the method is based on parallel multiple data blocks processing. Three levels of parallelism are incorporated in the method: • a parallel compression within a data block - performed by a grid of thread blocks on each GPU, different data blocks in a multi-GPU system,
• a CPU computation - a part of compression job performed on a CPU and • transfers of data blocks performed parallel with CPU computation (transfers between: disk and computer memory, computer memory and GPU memory, GPU memory and computer memory, computer memory and disk. The processing strategy is different for a single and multi GPU computer systems and different task assignments to CPU and GPU take place in both cases. The work sharing strategy in a single GPU processing strategy The following stages of computation are performed in a given below sequence for a single data block (the processing units involved in a task are given in brackets): 1. Loading a data block from a source file [CP U ] . 12
2. Transfer of the data block to GPU (data is stored in texture memory). 3. Creating and sorting a suffix table [GP U ]. 4. Finding of matches in the suffix table [GP U ]. 5. Transfer of matches to CPU memory . 6. Analysis of matches [CP U ]. 7. Counting repetitions of items for coding purpose [CP U ]. 8. Generation of Huffman trees [CP U ]. 9. Writing counters of item repetitions to an output buffer [CP U ]. 10. Writing encoded data to the output buffer [CP U ]. 11. Writing data from the buffer to an output file [CP U ]. With the work distribution proposed above we achieved good load balance of work on CPU and GPU in general. Figure 1 presents time periods of subsequent data blocks processing in the single GPU approach. GPU is busy in the whole period of computation. In general the work performed by CPU for a single data block takes less time than concurrently performed GPU processing and is completed before the next task (subsequent CPU work) is prepared by GPU. Sometimes (especially in the case of low redundancy data) CPU tasks performed for a single input data block take more time than next block processing by GPU (comp. Fig. 2-3) . This is the reason why in a final multi GPU processing strategy we moved more tasks to GPU. In this way a CPU computation bottleneck disappeared for all types of compressed data when the only tasks performed on CPU were generation of Huffman trees and writing of coded information to a disk file. An example of a schedule chart with processing times for a program implemented according to this strategy is presented in Fig. 4 (in the figure GPUs used are numbered with subsequent integer numbers). CPU processing in the program is based on four types of CPU threads: • a main thread, • a GPU compression manager (one thread for every GPU), • a CPU compression manager, 13
• a thread for the task of output data writing to a disk. The main program thread is responsible for running other threads and starting/stopping the whole compression process. A GPU compression manager thread loads input data, manages GPU-CPU data transfer and controls GPU subprograms execution. In a ”full” GPU program version it also performs a task of Huffman tree generation. A CPU compression manager is responsible for CPU tasks performed in a single GPU program version. In a multi GPU program, the compression managing thread is used only to load data from a file, serve it to GPU thread, retrieve a compressed data block from GPU memory and pass it to the output thread. The proposed sharing of the work between threads allows us to parallelize data processing efficiently.
Figure 1: Periods of data block processing in a single GPU program
4. Computational experiment An evaluation of the efficiency of the proposed approach was performed in three different computer systems, with popular compression programs and different types of input data. Computation quality results (speed and compression ratio) obtained for our program were compared with results of popular compression programs run in different configurations noted as follows: 1. WinRAR v3.90 & v5.01 (x64) shareware (www.winrar.pl): 14
Figure 2: Periods of data block processing for high redundancy data in a multi-GPU program (4 GPUs, with data encoding on CPU)
• WinRAR-Best: run configuration: Best — best compression ratio, • WinRAR-Fast: run configuration: Fast — medium processing time, • WinRAR-Fastest: run configuration: Fastest — shortest processing time.
2. 7-zip for Windows v4.65 & v9.20 (x64, www.7-zip.org) — using LZMA & LZMA2 algorithm respectively: • 7z-Best — run configuration: Ultra — best compression ratio, • 7z-Fast — run configuration: Fastest — shortest processing time. 3. 7-zip for Linux v9.13 (x86, www.7-zip.org) — LZMA2 algorithm in the fastest compression mode: • 7z-Linux-4 — 4 processor cores used — command line arguments: -mx0 -mmt=4 -m0=LZMA2,
• 7z-Linux-8 — 8 processor cores used — command line arguments: -mx0 -mmt=8 -m0=LZMA2.
4. GZIP for Linux v1.5 — runing on Linux: • GZIP9 — run configuration: best (-9) — best compression ratio, • GZIP1 — run configuration: fast (-1) — shortest processing time. 15
Figure 3: Periods of data block processing for low redundancy data in a multi-GPU program (4 GPUs, with data encoding on CPU)
The Linux version of 7-zip was also tested with the LZMA algorithm but the obtained processing speed was very close to the Windows version (LZMA uses only 2 cores at most). In our experiment we used three hardware platforms: System 1 a desktop computer running Windows Vista 64-bit with: Intel Core2 Duo E8400 CPU @ 3 GHz, 6 GB of RAM @ 800 MHz and NVIDIA GeForce GTX 260 GPU with 896 MB of global memory (driver v195.62), System2 a server running Linux 2.6.27 with: 2 Quad-Core Intel Xeon E5405 CPUs @ 2 GHz, 16 GB of RAM @ 667 MHz and 4-GPU NVIDIA Tesla S1070 with 16 GB of global memory (driver v3.20) and System3 a desktop computer running Windows 7 x64 with: Intel Core i7-3820 @ 3.7 GHz with 16 GB of RAM @ 1866 MHz and NVIDIA Geforce GTX 670 with 2 GB of global memory. Under the Windows environment our program was compiled using MS Visual Studio C++ 2008 Express Edition and CUDA in versions: 2.3 for GTX 260 and 4.1 for GTX 670. Under Linux, the compilers were: GCC 4.3.2 and CUDA 3.1. Parallel multithreads processing was implemented by using the Pthread programming model under Linux and WinAPI threading under Windows. To analyze execution times and compare programs results we used four test files (described in Tab. 2) evaluated in terms of a data redundancy.
16
Figure 4: Periods of data block processing for high redundancy data in a multi-GPU program (4 GPUs, final multi GPU processing strategy)
We used two measures (defined in [17]) characterizing data eligibility for compression: data entropy and data redundancy. Entropy is a measure of uncertainty or ”amount of surprise” in data and is calculated as: H=−
n X
(1)
Pi log2 Pi
i=1
where Pi is a probability of occurrence of an i-th symbol of the alphabet. Redundancy is a difference between maximum theoretical entropy of the data and its actual entropy. It is calculated as: X X n n n X Pi log2 Pi Pi log2 Pi = log2 n + P log2 P − − R= − i=1
(2)
i=1
i=1
where P represents the highest entropy symbol ratio (P = 1/n). In order to analyze the correlation between the source redundancy and program execution time (global and partial) we used three files with significantly different entropy: • a low redundancy file (test.LR — a file generated using pseudo-random numbers generator),
• a medium redundancy file (NFS.tar — TAR archive), 17
Figure 5: Periods of data block processing for low redundancy data in a multi-GPU program (4 GPUs, ”full GPU” approach))
• a high redundancy file (RFC.tar — TAR archive). For each file we calculated entropy and redundancy (in the binary system — logarithm of base 2) for 1-, 2- and 3-byte symbols by counting down each symbol and calculating values of H and R according to formulas given above. The calculated values are labeled H1 , H2 , H3 and R1 , R2 , R3 . Maximum possible entropy equals to number of bits needed to represent the symbol (8, 16 and 24bit symbols). The low redundancy file was created by generating random values (0–255) and writing it to an output stream. The medium redundancy file was created as a TAR archive made of computer game demo files. It contains various file types: text (configuration, messages, etc.), binary (executables, libraries) and game specific data files (car models, sounds, etc.). The high redundancy file was created as TAR archive containing text files (*.txt) — Request for comments (RFC ). The files used were downloaded from FTP server ftp://ftp.cyf-kr. edu.pl/pub/mirror/rfc/. In order to test programs from a user point of view we also prepared a big TAR archive (1409 MB) mixed.tar with various content types. The file contains different parts (files) that were downloaded mainly from the Compression Ratings project website [1]: 18
Figure 6: Compression ratios for different programs, compression of the data file of mixed content; CPU results from Systems 2 & 3
1. Text (crtext1 100m) — based on archival files from project Gutenberg. 2. Audio (cr audio1.tar, extended version) — audio files in CD quality (flac format). 3. Knowledge representation (enwik8) — first 108 bytes of Wikipedia XML archive file [10]. 4. Database (freedb100m) — first 108 bytes of text image of free CD database. 5. Images (img1.tar) — high quality uncompressed images in the PPM format. 6. Medical files (lukas.tar) — a set of two-dimensional 8-bit RTG pictures [2]. 7. Source code (src1.tar) — content of subdirectory gcc-4.2.0 of GNU Compiler Collection (GCC) package. 8. Application files (app3.tar) — PortableApps.com Suite 1.1. 9. Game files part (NFS.tar). 10. RFC documents (rfc.tar). 11. High entropy data (test.LR). Detailed information about the size, redundancy, the best acquired compression rate and the physical position of files in the archive are presented in Tab. 2. 19
Figure 7: Compression speed for different programs, compression of the data file of mixed content; CPU results from Systems 2 & 3
Figure 8: Compression speed for data files of different redundancy; CPU results from Systems 2&3
This archive (mixed.tar) was used to tune parameters of MCCC program and to compare our program with WinRAR, 7-Zip and GZIP. We also compressed every file with all programs in the best compression configuration and measured the best compression ratio labeled as CRbest (in Tab. 2). During computational tests we measured the compression time and the compression ratio for each test data file and compression program. The compression ratio [15] CR and the compression speed CS were computed according to the
20
Figure 9: Compression ratio for data files of different redundancy; ; CPU results from Systems 2&3 File mixed.tar crtext1 100m cr audio1.tar enwik8 freedb100m img1.tar lukas.tar src1.tar app3.tar NFS.tar rfc.tar test.LR
Size 1409 95 118 95 95 87 50 96 95 138 282 256
MB MB MB MB MB MB MB MB MB MB MB MB
Position in TAR n.d. 0%–7% 7%–15% 15%–22% 22%–29% 29%–35% 35%–38% 38%–45% 45%–52% 52%–62% 62%–82% 82%–100%
H1
R1
H2
R2
H3
R3
CRbest
7.07 4.55 7.99 5.08 5.57 6.80 7.13 5.61 7.04 7.26 4.72 8.00
0.93 3.45 0.01 2.92 2.43 1.20 0.87 2.39 0.96 0.74 3.28 0.00
13.02 8.10 15.96 8.97 9.07 12.65 9.98 9.73 12.83 13.92 8.13 16.00
2.98 7.90 0.04 7.03 6.93 3.35 6.02 6.27 3.17 2.08 7.87 0.00
18.09 11.00 23.59 12.05 11.45 16.45 12.71 12.74 17.25 19.88 10.67 23.86
5.91 13.00 0.41 11.95 12.55 7.55 11.29 11.26 6.75 4.12 13.33 0.14
47.80% 24.55% 99.99% 24.75% 20.70% 47.29% 36.62% 13.27% 32.79% 56.04% 14.30% 100.00%
Table 2: Details of mixed content data test file and data files parameters; NFS.tar, rfc.tar, test.LR and mixed.tar were separately used
following formulas: CR =
Sout ∗ 100% Sin
where Sout is a compressed stream size and Sin is an input stream size (before compression). The lower the CR value, the more efficient the compression. CS =
Sin , T
where Sin is input stream size and T is overall program execution time. The value of CS is expressed in megabytes per second (MB/s). In the case of the MCCC program the computation time of subsequent program tasks was measured within the program. Compression speed for compared programs was calculated based on informations displayed by the program’s in21
Figure 10: Contribution of processing time of separated MCCC program stages in overall file compressing time, normalized; results for different data redundancy: low - LR, medium - MR, high - HR; systems/program version: GTX - System1/single GPU, Tesla - System2/multi GPU
terface. Total execution time was the time displayed right before the end of a compression. In general, the last step of the data compression process is writing output data to a disk file. Saving of subsequent data blocks is done in parallel with computation but sometimes this process may lengthen the time of a program run due to an observed disk writing bottleneck. The higher the compression speed, the more important for its precise measurement the HDD writing speed becomes. To remove this interference we used a ”null” operating system device as a target for an output. For all decompression speed tests we also used a ”null” destination. 5. Results The results presentation is divided into 5 parts devoted to special issues concerning computing efficiency. In the first subsection we try to compare the efficiency of different data compression programs. The following section gives a deeper view on a load balance of parallel processing for different scheduling schemes in heterogeneous CPU-GPU systems. The dependence between 22
Figure 11: Processing time of separate MCCC program stages needed for compression of 100MB data; results for different data files and systems/program versions; GTX - System 1
redundancy in processed data and compression speed is discussed in the third subsection. The forth section is devoted to GPU processing optimization issues. At the end the computing speed results for decompression process are discussed. 5.1. Mixed input data processing results In Sec. 3.6 we proposed different approaches to balance the load of used computing systems. The programs differ in the amount of work performed by the CPU and GPU. In the multi GPU approach the CPU becomes better utilized when the program uses more GPUs. Because of a better workload balance, the single GPU version of the program is significantly faster than the multi-GPU version using one GPU. In a compression ratio comparison (see Fig. 6) we show results for both variants of our program: single-GPU and multi-GPU using 1 GPU (System 1). The compression ratio is independent of the number of GPUs used but may differ for different task assignments between the CPU and GPU. Compression ratios of our programs and other programs run in the ,,fast” mode are comparable. Depending on data redundancy the quality of compression of our programs can be slightly better or worse (in the case of very high redundancy of data).
23
Program 7z-Best WinRAR-Best 7z-Best GZIP9 7z-Fast 7z-Linux-4 WinRAR-Best GZIP1 WinRAR-Fast 7z-Linux-8 MCCCn 1 WinRAR-Fast MCCC1 GTX260 7z-Fast MCCCn 2 WinRAR-Fastest MCCCn 3 MCCC1 GTX670 MCCCn 4
System 1 1 3 2 1 2 3 2 1 2 2 3 1 3 2 3 2 3 2
Compression speed [MB/s] 2.0 4.0 5.7 8.5 9.4 15.5 17.6 24.7 26.1 29.4 36.9 37.1 47.9 57.0 72.7 92.7 106.2 111.0 134.8
Decompression speed [MB/s] 26.6 9.2 79.6 92.7 23.5 25.2 117.4 85.9 56.4 25.2 54.2 134.2 57.5 62.6 54.2 120.4 54.2 95.9 54.2
Compression ratio 47.80% 48.16% 47.55% 55.88% 54.52% 54.29% 49.87 % 59.18% 57.20% 54.29% 55.13% 51.62% 54.51% 54.22% 55.13% 57.29% 55.13% 54.51% 55.13%
Table 3: Compression test results for the data file of mixed content
An overall speed compression ranking is presented in Tab. 3 and Fig. 7. The following configurations of our approach were tested: • the multi GPU program version using 1,2,3 or 4 GPUs run in System2 noted as MCCCn 1 - MCCCn 4
• the single GPU program version run in System1 noted as MCCC1 GTX260 and in System3 noted as MCCC1 GTX670.
Our program using 4 GPUs runs more than 2 times faster than the fastest CPU based tested program 7z using 4 cores in System 3. In this case the CPU system is more efficient in terms of power and cost (0.44 versus 0.18 MB/s per Watt and 19.4 versus 5.8 MB/s per 100mm2 - a silicon die area is considered as a cost measure) since the systems compared are produced in different technology process i.e. 65 nm and 32 nm. We considered the programs with comparable compression ratio results. The efficiency results are given in Tab. 4. For comparison of parallel systems produced in comparable technology, GTX 670 (28 nm) and Intel i7 (32 nm) were considered. Efficiency measures are com24
puted based on the MCCC1 version of our program that uses at the maximum 1 CPU core. In order to obtain an approximation of the maximal compression speed possible for a single core in System 3 we used the best result obtained for the CPU based system - 27.8 MB/s (for WinRAR program run in this system that used 1/3 part of computing power of 4 cores). Also the best result for CPU compression speed in this system (7z ) was used for this comparison. The approximate power efficiency of the GPU solution is at least comparable with the result obtained for the CPU. The cost efficiency for the GPU solution is nearly 50% higher than the efficiency for its CPU counterpart (28.3 versus 19.4 MB/s per 100mm2 ). Program
System
MCCC1 7zip v9.20 MCCCn 7zip v9.13
GTX 670 i7 Tesla S1070 2Quad-Core
Technology 28 32 65 45
nm nm nm nm
Compression speed per cost [M B/s/100mm2 ] 28.3 19.4 5.8 6.8
Compression speed per power [MB/s/Watt] 0.49 0.44 0.18 0.18
Table 4: Processing efficiency results for different systems and programs, power efficiency computed using TPD parameter
As far as the compression ratio is considered (comp. Tab. 3), program 7z winning the ranking (Best - configuration in System 3) runs about 14 times slower then our compressor in the i7-GTX670 processing system (in this case the result is an approximate and fair comparison of GPU only and CPU processing speed). 5.2. Workload balance Gantt charts for different load balancing schemes and different input data are presented in Fig. 1- 5. Fig. 1 presents the way the subsequent data blocks are processed in a single GPU approach. FIO stands for I/O operation thread which saves the compressed block to disk. In the single GPU model (Fig. 1), the GPU processing time is fully utilized and CPU becomes idle before the next block of data is prepared on the GPU. In the case of lower data redundancy tasks assigned to the CPU take more 25
time than GPU processing. Based on this observation we decided to move more work to the GPU in order to balance the work of many GPUs and a single CPU (in a multi GPU processing model). This modification resulted in almost full utilization of GPU and CPU processing time for high and medium redundancy data (comp. Fig. 2), but still generated GPU processing gaps for low redundancy data (comp. Fig. 3). In order to fully utilize computing power of GPUs in the last load balancing scheme all tasks (except Huffman tree generation) were scheduled to GPUs. An example of a schedule obtained for this processing scheme is shown in Fig. 4 for high redundancy data and in Fig. 5 for low redundancy data. 5.3. GPU processing optimization According to measurements collected within NVIDIA Visual Profiler the main cost of program computation (e.g. 57% of GPU computation time in GTX 670 program run for mixed.tar file) arises in the procedure for searching of matches. This information focused our optimization effort on this kernel. In the first step of the tuning procedure, the size of a search window (within the suffix table) equal to the number of the neighboring positions compared (SCAN-POSITIONS parameter) was considered. It is responsible for compression quality which varied between 56,03% and 53,42% (for mixed.tar file) depending on the value of the parameter changing between 1 and 16. In order to obtain compression quality comparable with other compressor results we fixed the SCAN-POSITIONS parameter to 4. This way an upper constraint on the amount of work to be done within the kernel for every single entry in suffix table was established. In the next tuning test we considered the division of the work between threads. According to time results collected for different assignments of the work for a single GPU thread the optimal size of a computation grain was found to be equal to 1/4-th of the SCAN-POSITIONS parameter. This work assignment minimized computation time of the match search. The subsequent tuned parameter was the size of a block of GPU threads
26
for the match search kernel. The sizes of a one dimensional block selected for testing varied between 64 and 512. The best performance was obtained for 128 threads in a block. This result optimizing processing speed coincides with the maximum of a multiprocessor occupancy parameter. The requirements for data storage are equal to 21 registers and 8 bytes of shared memory per thread. In the case of Compute Capability 1.3 the theoretical multiprocessor occupancy for the kernel is equal to 63% and allows to assign concurrently up to 20 warps to a single multiprocessor. The number of registers requested and the limit of thread blocks for multiprocessor reduced the number of warps for the other sizes of the block. In the case of GTX 670 the multiprocessor occupancy is equal to 100%. No additional parameter evaluations and tuning were performed for GTX 670 and Tesla S1070. The bigger size of memory in the case of S1070 and GTX670 allows to increase the size of the block of data send to the GPU for compression. This change could theoretically improve the compression ratio but at the expense of the compression speed. In this case our program results would possibly become incomparable with other compressor results due to different levels of the compression ratio. Another possible improvement would be to parallelize computation on the GPU and data transfer between the CPU and the GPU. This was impossible in the GTX 260 system due to memory size constrain. For GPU memory of bigger size additional two data buffers could be created: one for input data and one for output data. This solution, thanks to asynchronous data transfer, could improve compression speed especially in a single GPU system, where a lot of data is transfered for further processing in the CPU. This strongly GPU-type dependent modification has not so far been implemented. CPU computations are performed in parallel with data transfer between the computing nodes owing to implementation of multiple CPU threads and data buffers in CPU memory.
27
5.4. Dependency between data redundancy and compression speed The compression ratio depends on data redundancy because the more redundancy in data exist, the more original items can be replaced by shorter symbols or dictionary entries and create a smaller output stream. Also the speed of the compression program may vary with a change of compressed data redundancy. This dependency can be caused by the following issues: • high number of long dictionary matches reduces the quantity of needed codewords (e.g. in Huffman code),
• the size of a custom dictionary (i.e. BST, hash tables) is smaller for data
of high redundancy because many repetitions in the input data stream are represented in a dictionary only once; a number of dictionary entries for low redundancy data increases.
• the search in a dictionary becomes faster in the case of highly redun-
dant data thanks to high probability of match finding without searching through the whole dictionary.
But this dependence seems to be more complex and is not visible in all programs. It is possible due to a variety of methods used in programs and dependencies resulting from method features and peculiarities or constrains of hardware. The compression speed of programs for different input data type is presented in Fig. 8. 7-zip (in fast configuration), WinRAR and GZIP1 (best) run significantly faster for more redundant data while GZIP9 (fast) performs better for files with less redundant data. The compression speed of our program (especially its ”full” GPU version) is more independent of input data entropy. Compression ratios of MCCC program and nearly all others compressors run in the ”fast” mode are comparable for all data types (see Fig. 9). Additionally to the above results, we analyzed the dependency between the processing cost (time of separate stages), the task assignment scheme (single GPU or ”full” multi GPU) and the model of the GPU used (GTX-260 or Tesla T10). Fig. 10 shows the time contribution of program processing steps to the 28
overall MCCC processing time. This is an important piece of information for the evaluation and profiling of the program. Slightly different results are presented in Fig. 11. The figure contains a comparison of time used by program steps to compress 100 MB of data (different program run configurations are considered). The time used in the following parts of the program was measured: alloc — GPU memory allocation, mem C-G — CPU-GPU data transfer, clm — cleaning of match table - GPU, buf16 — generation of helper array for faster reading from global GPU memory - GPU, suffix — filling of suffix tables - GPU, radix sort — radix sort - GPU, match — search of matches - GPU, analyze —analysis of matches - GPU (multi GPU version), encode — data encoding - GPU (codewords generation for matches and literals, multi GPU only), distance — generation of codewords’ suffixes for match distances - CPU (single GPU only), mem G-C — GPU-CPU data transfer, CPU-comp — sum of time periods used for CPU performed compression tasks, CPU+ — time elapsed after the last block processing on GPU (used for the last compression call, coding and saving of data). The contribution of different program steps to the whole program processing time depends on the data redundancy and the processing model (single/multi GPU). Since the presented values are sums of the computation time for all GPUs 29
used, the time cost of GPU computation increases for the multi GPU program version. In this version GPUs perform the most of the compression work. 5.5. Decompression speed In order to obtain complete information about the efficiency of our approach, we also tested the decompression speed of all programs. Our program run in the same computing system (MCCC uses only CPU for decompression) was better than 7zip and GZIP but worse than WinRAR (see Tab. 3). This good result was possible due to the usage of helper tables described in Sec. 3.5. 6. Summary In this paper we presented a novel approach to a lossless data compression of arbitrary data on the GPU. According to the reported results we successfully parallelized computations designing a method for the problem that is of a sequential nature. We created a parallel compression algorithm and the efficient program MCCC using three levels of parallelism: the CPU threads parallelism, the parallel GPU processing for a single block of compressed data and the parallel multiple data block processing. The proposed method incorporates a novel idea for searching of matches. The main contribution of the proposed method is a parallel (with a use of thousands of GPU threads) search for matches in an input data block whose size (in our case 16 MB) is constrained only by program structures and the size of GPU memory. To make the search of matches quality and time efficient two stages - parallel suffix table sorting and then parallel search of matching - are performed in the whole data block . The functional parallel approach for different data blocks was proposed and executed by a GPU-CPU pipeline. In this case different tasks of compression method performed sequentially for a single block of data were assigned to different processing nodes. Additionally in a data parallel approach - concurrent compression of different blocks of data is performed in the case of the multi-GPU processing system. The work sharing between the CPU and GPUs was defined according to a load balancing tuning procedure performed off-line for data files of different 30
redundancy (compare Sec. 3.6). The use of a big dictionary allows to obtain a high compression ratio comparable with other widely used compression programs. In the case of another GPU based general purpose data compression approach proposed in [14] where only small sizes of dictionary (constrained by the size of GPU shared memory) are used the level of compression is much lower. Our MCCC program achieves a good compression ratio comparable to other widely known data compressing programs but works faster. Our program run in a system with 4GPU (Tesla) achieves 134 MB/s (megabytes per second) of compression speed and is, according to our experiments, more than 2 times faster than the best of the competitive programs - 7zip run in a system with 4 processor cores. We also obtained similar compression speed - 111 MB/s for the second version of our program (for a single GPU and CPU) in the system with Intel i7 processor and Nvidia GTX 670. This time our solution was nearly 50% faster than the best CPU result. These results were obtained for program runs in technologically comparable computing nodes. Our program achieved better efficiency in terms of system cost and electric power consumption than the competing CPU solution (see Tab. 4). The results were obtained for a compressing data file of medium redundancy (size 1.4 GB, the entropy H1 = 7.07, the redundancy R1 = 0.93). A wide comparison of the compression ratio obtained for different programs was presented in Tab. 3. It is hard to precisely compare the efficiency of various programs having no access to common data test files and using a restricted range of computing systems. In spite of this, we tried to compare our results with recently published results for the general purpose LZSS method implementation on CUDA presented in [14]. LZSS program works fast for all data types but achieves a significantly worse compression ratio than other compressors. We performed an additional experiment for our data files with GZIP program mentioned as a reference in the [14] paper. On contrary to LZSS program (whose compression ratio is between 2 and 4 times worse than the reference) our approach obtained a compression ratio comparable to GZIP for all used data test files (comp. Fig. 9). The results show that our program outperforms sequential GZIP on 31
the higher compression speed. MCCC program run for the mixed content data file in System 3 is about 10 times faster. The final conclusion from this last comparison is unfortunately imprecise and general: LZSS implementation is fast but its compression ratio is far from the range accessible for wide majority of compressors. Our program’s advantage over other widely used compressors is that for the same level of compression ratio it is faster and, as the experiments showed, is able to deliver efficient solutions. Additionally, it is worth mentioning that our method is flexible and can be modified to generate archives consistent with the Deflate standard. This can be done without a big loss of the compression speed but at the cost of the compression ratio because the Deflate uses a dictionary (a sliding window) limited to 32 KB and cannot encode match distances longer than the window size. 7. Acknowledgements The research has been supported by grants No. 2011/01/B/ST6/07021 and N N519 643340 from the National Science Centre, Poland. [1] Compression ratings. [on-line] http://compressionratings.com/, December 2009. [2] J. Abel. www.data-compression.info: The data compression resource on the internet. [on-line] www.data-compression.info, December 2009. [3] A. Aqrawi and A. Elster. Accelerating disk access using compression for large seismicdatasets on modern gpu and cpu. extended abstract no 131, Para 2010 State of the Art in Scientific and Parallel Computing, Reykjavik, June 2010. [4] A. Balevic. Parallel variable-length encoding on gpgpus. In Proceedings of the 2009 international conference on Parallel processing, Euro-Par’09, pages 26–35, Berlin, Heidelberg, 2010. Springer-Verlag.
32
[5] I. Castano. High quality dxt compression using opencl for cuda. whitepaper.
[on-line] developer.download.nvidia.com/compute/cuda/3_0/
sdk/website/OpenCL/websi%te/OpenCL/src/oclDXTCompression/doc/ opencl_dxtc.pdf, March 2009. [6] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Second Edition. McGraw-Hill Science/Engineering/Math, July 2001. [7] P. Deutsch. Rfc 1951: Deflate compressed data format specification version 1.3. [on-line] http://www.ietf.org/rfc/rfc1951.txt, May 1996. [8] D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, January 1997. [9] D. A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, September 1952. [10] M. Mahoney. Large text compression benchmark. [on-line] http://cs. fit.edu/~mmahoney/compression/text.html, July 2009. [11] D. Merrill and A. Grimshaw. Revisiting sorting for gpgpu stream architectures. Technical Report CS2010-03, University of Virginia, Department of Computer Science, Charlottesville, VA, USA, 2010. [12] M. A. O’Neil and M. Burtscher. Floating-point data compression at 75 gb/s on a gpu. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pages 7:1–7:7, New York, NY, USA, 2011. ACM. [13] A. Ozsoy and M. Swany. Culzss: Lzss lossless data compression on cuda. In Proceedings of the 2011 IEEE International Conference on Cluster Computing, CLUSTER ’11, pages 403–411, Washington, DC, USA, 2011. IEEE Computer Society.
33
[14] A. Ozsoy, M. Swany, and A. Chauhan. Pipelined parallel lzss for streaming data compression on gpgpus. In Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, ICPADS ’12, pages 37–44, Washington, DC, USA, 2012. IEEE Computer Society. [15] D. Salomon. Data Compression: The Complete Reference. Springer-Verlag, Berlin, Germany / Heidelberg, Germany /, 2007. With contributions by Giovanni Motta and David Bryant. [16] N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore gpus. In 23rd IEEE International Parallel and Distributed Processing Symposium, pages 1–10, May 2009. [17] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27:379–423, 623–656, July, October 1948. [18] T. A. Welch. A technique for high-performance data compression. Computer, 17(6):8–19, 1984. [19] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337–343, May 1977. [20] J. Ziv and A. Lempel. Compression of individual sequences via variablerate coding. IEEE Transactions on Information Theory, 24(5):530–536, September 1978.
34
Author photograph Click here to download high resolution image
Authors biography
Rafał Walkowiak is a lecturer at Poznan University of Technology with over 20 years of experience. He received M.Sc. (1987) and Ph.D. (1997) in Computer Science, both from Poznan University of Technology. His scientific interest includes cutting and packing problems and application of parallel processing in different areas.
Marek Chłopkowski is a Ph.D. student at Poznan University of Technology. In 2010 he recieved M.Sc. in Computer Science. He is mainly interested in GPGPU applications in fields of data compression and computational biology including redesign and adaptation of serial algorithms for parallel processing.
Author photograph Click here to download high resolution image
- We proposed a novel approach to parallel lossless data compression on GPU. - The efficiency of the approach is based on parallelization of data processing on different hardware and logical levels. - We obtained better compression speed than popular data compression programs used in PC (compression ratio is sustained). - Our approach outperforms other known GPU based methods to lossless data compression.