A general purpose lossless data compression method for GPU

Accepted Manuscript A general purpose lossless data compression method for GPU Marek Chłopkowski, Rafał Walkowiak PII: DOI: Reference: S0743-7315(14)...

Download PDF

880KB Sizes 162 Downloads 199 Views

Report

PDF Reader
Full Text

Accepted Manuscript A general purpose lossless data compression method for GPU Marek Chłopkowski, Rafał Walkowiak PII: DOI: Reference:

S0743-7315(14)00195-6 http://dx.doi.org/10.1016/j.jpdc.2014.09.016 YJPDC 3363

To appear in:

J. Parallel Distrib. Comput.

Received date: 25 November 2013 Revised date: 26 May 2014 Accepted date: 23 September 2014 Please cite this article as: M. Chłopkowski, R. Walkowiak, A general purpose lossless data compression method for GPU, J. Parallel Distrib. Comput. (2014), http://dx.doi.org/10.1016/j.jpdc.2014.09.016 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A General Purpose Lossless Data Compression Method for GPU Marek Chlopkowskia,∗, RafalWalkowiaka a Institute

of Computing Science, Pozna´ n University of Technology, 60-965 Pozna´ n ul.Piotrowo 2

Abstract The paper describes a parallel method for a lossless data compression that uses graphical processing units (GPUs). Two commonly used statistical and dictionary approaches to data compression have been applied in our method. The reduction of compression time was possible due to the implementation of multi level parallel methods that use a single GPU or a set of GPUs eﬃciently. The base of our method is a search for repetitions in data that is executed in parallel with the use of sorted suﬃx tables. On the second level of concurrency operations on diﬀerent data blocks: data ﬁle reading, match search, coding, compression and data ﬁle writing are performed in parallel. The methods proposed, supplying a comparable compression ratio, achieve a better compression speed than a standard CPU-based compression tools used in personal computers. Experiments performed in technologically comparable systems showed that our approach is similar or even better in terms of power and cost eﬃciency. Keywords: lossless data compression, GPU, parallel processing 1. Introduction Since for contemporary computer users data compression is as common as a presence of an eﬃcient GPU in PC systems, it is easy to predict a need for widely accessible data compression tools using GPU. Such a tool, based on carefully selected ideas known from computer science and a theory of data compression ∗ Corresponding

author, email address: [email protected]

Preprint submitted to Elsevier

October 9, 2014

in particular, implemented eﬃciently for GPU is proposed in the paper. The organization of the paper is as follows. The basics of the data compression theory and the state of the art in the area of lossless compression methods for GPU is shortly discussed in Section 2. Section 3 is devoted to a presentation of the proposed compression method. Description of a computational experiment and obtained results are presented in Section 4 and Section 5. Section 6 ﬁnishes the presentation with conclusions concerning the results and perspectives of a further development. 2. The lossless data compression fundamentals The lossless data compression consist in a conversion of input data to output data of a smaller size by reduction of data redundancy [17]. It can be achieved by assigning shorter codewords to more frequent symbols in input data (a statistical compression) or by replacing possibly longest substrings with their dictionary codes (a dictionary compression). There are many, sometimes very sophisticated compression methods, which use one or both of these approaches. In the paper we will focus on details of approaches that we used as sources of ideas for our method. 2.1. A statistical approach Results of statistical observations can be incorporated into data compression methods. The statistics may be obtained by analyzing input data (frequencies of symbols or words) or be given a priori from more general observations (i.e. letter, di- and trigrams frequency in a language). The idea behind this approach is to use a variable sized preﬁx code and assign shortest codewords to the most frequent substrings in the source or to use an arithmetical encoding based on ranges of numbers with range width proportional to substring probabilities. In our approach two preﬁx codes were used: Huﬀman preﬁx code [9] and our variant of a general unary code. A preﬁx code, in general, is a variable sized code where none of the codewords can be a preﬁx of the other [15]. The codewords of Huﬀman code are generated on a binary tree. The symbols are leaves and a 2

route to leaves is coded by bits (1 for the left child and 0 for the right on every level of the tree). Leaves with the most frequent symbols are closer to the root of the tree and therefore have shorter codes. This code is usually generated by both - a coder and a decoder. The coder counts occurrences of each symbol and saves obtained values into the output, so the decoder can build the same code tree. The decoder simply reads input bit-by-bit and goes down the tree choosing the right child every time it reads 0 and the left child when it reads 1. When decoder reaches a leaf it reads an encoded symbol and starts again from the root. There is also an adaptive variant of Huﬀman coding, where a tree is rebuild dynamically when processing subsequent input symbols (usually it is a better solution for on-line compression, when input data cannot be processed twice). Unary coding is the simplest coding method where codeword is created by n ”set bits” followed by one ”reset bit” (i.e. 111110 represents value 5). A general unary code is made of two parts: a unary step number and n-bits binary value. The codewords are generated by the start-step-stop algorithm. The subsequent parameters have the following meaning: start is equal to the initial length of a binary part, step determines increment of the binary part and stop determines the longest codeword size (in the case of the longest codewords, the step number is not followed by the reset bit) [15]. In the case when the step is equal to 1, this code is quite similar to Elias Gamma code that is constructed of l -zeros followed by one and a binary part of length l [15]. This code has the shortest codewords for the smallest values coded and therefore is suitable to code close dictionary distances. A sample of a general unary code is presented in Table 1. 2.2. A Dictionary compression A general idea of dictionary compression consists in replacement of input substrings with shorter codewords of dictionary entries. In this way the size of resulted data string is reduced. There are two main approaches to the construction of dictionaries that are based on works of Lempel and Ziv published

3

Value 0 1 2 3 4 5 6 7 13 14 15 29

Step number 1 1 2 2 2 2 3 3 ... 3 4 4 ... 4

Codeword 0 0 0 1 10 00 10 01 10 10 10 11 110 000 110 001

Codeword length (bits) 2

110 111 111 0000 111 0001

6 7

111 1111

7

4

6

Table 1: An example of a unary code for parameters (1,1,4).

in late 70’s [19, 20]. The ﬁrst approach is based on LZ77 sliding window algorithm [19]. In this method the dictionary is a linear part of processed data. In general, two buﬀers are used - a search buffer (the dictionary) and a look-ahead buffer (currently coded area of strings searched in the dictionary). For every position in the look-ahead buﬀer the algorithm: • searches (left-to-right) the search buﬀer for a symbol equal to the symbol at the current position in the look-ahead buﬀer,

• if the symbol is found, saves the distance from the end of the search buﬀer as an offset; next subsequent symbols from the buﬀers are compared and

the number of equal symbols is saved as the length of the match, • repeats the above operations in order to ﬁnd the longest match in the dictionary,

• emits a codeword — concatenation of: the oﬀset, the match length and the next symbol in the look-ahead buﬀer,

• moves the search window by the length (plus 1) of the match found. If the match has not been found, the codeword (0,0,symbol ) is emitted. 4

The second approach to dictionary type compression is based on LZ78 algorithm [20]. The dictionary is an algorithm-speciﬁc data structure, that is successively expanded or rebuilt based on subsequent input symbols (processed so far). The dictionary can also be ﬁlled with some entries before starting the processing. The output contains the matched dictionary entry index and a subsequent symbol (or only an index, when the dictionary is ﬁlled with all possible symbols at the start). Precise information concerning the algorithm can be found in: [15], [20], [18]. The basic diﬀerences between two methods described above are the coding and limitations. In the ﬁrst method distance-length pairs are encoded and the search area is limited to a given window size. Within the second method dictionary indexes are encoded and the search is limited by a dictionary size. We used the LZ77 based approach with parallelization of a match search process. 2.3. A combined statistical and dictionary approach One of the most known compression algorithms, widely used in nowadays applications, is Deﬂate [7]. It is used in Windows Zip and Unix Gzip coders, HTTP protocol, PNG and PDF formats. This algorithm combines a statistical compression approach (Huﬀman coding) and a dictionary method (based on LZ77 concept). The main enhancement of LZ77 method incorporated into Deﬂate is when a symbol is not found in a search buﬀer the coder emits a symbol codeword. In the opposite case the coder emits a length-distance pair pointing to a match found. The coder uses two sets of Huﬀman codewords. First for symbols and lengths and the second set for distances. It is obvious what type of a codeword may occur at a given position. After a length codeword there can be only a distance codeword and after a symbol codeword another symbol codeword or a length codeword may occur. Compressed ﬁle contains data blocks of variable length that is written at the beginning of the block. Each block may contain uncompressed data, compressed using predeﬁned preﬁx codes (described in RFC [7]) or compressed using custom preﬁx codes. In the last case, codes have to be written at the beginning of the output block using a speciﬁc

5

algorithm described in [7]. A combination of statistical and dictionary approaches is used in our compression method. The program output consist of two types of ﬁxed size data blocks: uncompressed blocks and compressed blocks encoded by custom codes diﬀerent from the ones used in Deﬂate. 2.4. A search for matches within a dictionary compression One of the most diﬃcult problems in data compression is an eﬃcient search for dictionary matches. Every standard uses its own dictionary structure and speciﬁc search methods. These procedure can be as simple as a naive search (in original LZ77) or as sophisticated as in nowadays applications (e.g. 7-Zip) which use hashed tables and binary search trees described in [15]. In our method a search for matches is based on constrained suﬃx tables described in Sec. 3.1. 2.5. Related works There are many compression algorithms which make use of GPUs. Most of them are highly specialized (i.e. for ﬂoating point type data [12] or perform lossy compression for graphical application [3, 5]. As far as we know, there are very few works aimed at a general purpose lossless compression (independent from data type) based on GPU use. There is an eﬀective approach using a statistical compression and a variable length encoding proposed by Balevic in [4] and very eﬀective implementations of LZSS compression algorithm done by Ozsoy et all. in [13, 14]. These algorithms perform very well in terms of the compression speed but are far less eﬀective in terms of the compression ratio. The ﬁrst approach is fast since it does only a statistical compression without using a dictionary. The second one uses a very small dictionary window - its size is up to 512 bytes due to GPU limitations. The compressed ﬁles generated by programs based on the above approaches are at least two times larger than GZIP output. Section 5 shows the advantage of our approach over the GZIP program and indirectly also over the results described in [4, 13, 14].

6

3. The method We designed compression method based on LZ77 and Deﬂate concepts [19, 7]. The adapted encoding is similar to the one used in Deﬂate method (comp. Sec. 2.3). Symbol and length code is based on frequencies of their appearance in the output but distances are encoded using our variant of a general unary code combined with Huﬀman code (comp. Sec. 3.3). 3.1. A search of the dictionary match In our approach we use a suﬃx table as a source of information about a possible match of coded items. The most important property of a suﬃx table is that it contains items (preﬁxes of input string suﬃxes) sorted in the lexicographical order. Therefore it is easy to ﬁnd a match of a string in the table using a divide and conquer method. It can be done in n × log m time where n

is equal to a search pattern length and m is a number of suﬃx table entries [8]. A single suﬃx table covers all data in a separately compressed data block but sorting of table items is limited to 4-byte preﬁxes only. The number of entries in the suﬃx table is equal to the size of an input data block (16777216 - the size of a block was chosen based on GTX 260 GPU memory size and requirements of the method data structures). A single entry of the table contains a preﬁx and its index in the input data. The table is generated using parallel MSB Radix Sort algorithm [6, 16, 11], where a 4-byte preﬁx is a sorting key. It is a stable algorithm and therefore entries concerning equal preﬁxes are arranged in the table in the same order as they occur in the input data [6]. The goal of constructing a limited (to 4-byte preﬁx) suﬃx table was to prepare eﬃciently a data structure suitable for ﬁnding possibly the best match in a dictionary for every position of compressed data (or to ﬁnd that there is no match for a given position). The search of a match is done in parallel by a set of threads in a GPU. Each thread is responsible for a single entry in the suﬃx table and performs the match within a ﬁxed range of the suﬃx table. The following simple algorithm is performed by each thread. 7

Algorithm 1 1. i = 1; 2. Take the suﬃx table item at threadIndex position (diﬀerent for each thread in a CUDA grid) as a base item. 3. Take the suﬃx table item at threadIndex-i position as an offset item. 4. Compare preﬁxes of the base item and the offset item. 5. If the preﬁxes are equal compare subsequent bytes taken from the raw data (pointed by an index ﬁeld of a suﬃx table entry) until the ﬁrst mismatch occurs. Compare the length of the match obtained to the length of the best match for the base item so far found and record the longest match. The match is recorded in a match table at the index that corresponds to the position of the base item in the input data) 6. Repeat the steps 3-5 for i = i + 1 until n items prior to the base item are checked. 7. Break the procedure when no match is found — and write 0-length match value in the match table. With the increase of the search position range (n) compression improves. We have proved experimentally that n = 4 gives the best relation between a compression speed and a compression ratio. As the result of the above procedure we get the match table that contains two pieces of information for every suﬃx in the input data: a match length and a match distance. The indexes in the match table correspond to data positions. A match at index 1024, having length = 8 and distance = 300 means that 8 bytes following 1024-th position are equal to bytes at positions 724–731. It means that the string of 8 bytes can be replaced with shorter codes of a length-distance pair in the output stream. Many of the matches found are not used because, when the match of length n is used, next n positions in the match table are skipped.

8

3.2. Analysis of dictionary matches In a single GPU processing model, the analysis of dictionary matches is a simple, serial process. A match table is analyzed, entry after entry, and if a match exist and is promising (i.e. the estimated encoded string length is shorter than the match length) it is used (counters of repetitions for a length and a unary coding step number are incremented) and current position in the source data is incremented by the match length. Otherwise ( i.e. in the case when there is no match or the match is not promising) a literal counter stored in a count table is increased. The counters values stored in the count tables are prepared as an input to Huﬀman tree generation algorithm. A unary coding step of every match distance is computed on GPU at the time a match is analyzed. In multi GPU processing strategy the above described analysis takes place on GPUs and is performed in parallel for diﬀerent parts of a match table. Each thread analyses 1024 positions of the match table. The procedure can give slightly worse compression results because of ﬁxed ranges of data separately analyzed. However, in most cases, the deterioration of a compression ratio is smaller than 1%, which is a good tradeoﬀ between a compression speed and a quality as experiments show. 3.3. Data encoding Encoding used in the proposed approach is based on Deﬂate concept (compare Sec. 2.3) with some modiﬁcations. To encode data the program generates Huﬀman codewords based on count tables. There are two tables of counters used: one common for match lengths and literals, second for step numbers of a distance code. Distance encoding is performed using a modiﬁed general unary code (described in Sec. 2.1). We used Huﬀman code to code the step number. Instead of using stop parameter we assumed, that the binary value part of modiﬁed general unary code will not be longer than 24 bits — the size of encoded data block 16MB. We counted all repetitions of step number values (possible 24 values) and generated Huﬀman codewords for them. In this way we got the shortest codewords for most common step numbers. 9

Data encoding is performed with a use of the following simple scheme. According to the meaning of the coded item the suitable action is taken: • for a literal its Huﬀman codeword is written to the output stream, • for a pair: a match length and a distance: – a match length codeword (Huﬀman code) and – a distance codeword (generated according to unary coding algorithm) that consist of two parts: ∗ Huﬀman codeword for step number and ∗ a binary part are written to the output stream. In multi GPU processing strategy distance codewords are stored in a match table for later use in an output stream generation (to reduce CPU computation). 3.4. Output stream generation The output stream consists of data blocks containing: • 4-byte block format identiﬁer – block type identiﬁer (possible values are 0 - uncompressed, 1 - compressed, 2 - end of ﬁle) — 2-bit – block size (in bytes) — 30-bits • compressed/uncompressed data. Blocks of type 1 contain: • Huﬀman code tables (i.e. 256+252 16-bit normalized counts for literals and match lengths, 24 16-bit normalized counts for unary coding steps),

• a sequence of literal codewords and pairs of codes for a length and a distance of a match,

• end of block symbol (codeword for match length = 1). 10

In single GPU implementation CPU analyses the match table and encodes the data into an output stream. In multi GPU encoding and output stream generation is performed on GPU. In this case CPU only writes compressed data blocks to output. 3.5. Decompression In our implementation the decompression (i.e. decoding) process is performed on CPU. Bit-by-bit input processing and decoding Huﬀman codewords by going down a tree structure (see Sec. 2.1) can be very time consuming. To speedup decoding process we used code tables (referred further as decompression tables) inspired by [15]. The decompression table contains 65536 entries. For codewords shorter than 16 bits, each table entry contains information about a codeword value (symbol) and its length. For codewords longer than 16 bits the table contains a pointer to a last Huﬀman node reached by going down the tree according to 16 bits read (i.e. the node at 16th level of the tree). Decoding of a single Huﬀman codeword is done as follows: 1. Without moving an input stream position, read ahead 16-bit number and store it as addr. 2. Read the decompression table entry at addr position. 3. If the element’s codeword length is less or equal to 16, return a codeword value and move the input stream position by the codeword length. 4. If an element’s codeword length is greater than 16: • move input stream position by 16, • read (from a decompression table) an address of Huﬀman tree node labeled by a 16-bit part of the codeword,

• starting from this tree node perform bit-by-bit decompression (by reading subsequent input bits and going down the tree),

• when a leaf of the tree is reached, return the codeword’s value.

11

The Step 4 takes place very rare because the longest codewords are assigned to the least frequent symbols. Many entries of the decompression tables are equal because codeword bits shorter than 16 have to be followed by all possible combinations of subsequent bits that are irrelevant for this codeword but read as 16 bit value. For example, information corresponding to the codeword 0011 is stored in all table entries in range from 0011000000000000 to 0011111111111111. In this way a codeword can be decoded correctly and fast by simply reading the table entry (regardless of bits following it). The use of code tables results in 3 times faster decompression in comparison with an algorithm with a full Huﬀman tree walk. 3.6. Parallel processing issues The main concept of the method is based on parallel multiple data blocks processing. Three levels of parallelism are incorporated in the method: • a parallel compression within a data block - performed by a grid of thread blocks on each GPU, diﬀerent data blocks in a multi-GPU system,

• a CPU computation - a part of compression job performed on a CPU and • transfers of data blocks performed parallel with CPU computation (transfers between: disk and computer memory, computer memory and GPU memory, GPU memory and computer memory, computer memory and disk. The processing strategy is diﬀerent for a single and multi GPU computer systems and diﬀerent task assignments to CPU and GPU take place in both cases. The work sharing strategy in a single GPU processing strategy The following stages of computation are performed in a given below sequence for a single data block (the processing units involved in a task are given in brackets): 1. Loading a data block from a source ﬁle [CP U ] . 12

2. Transfer of the data block to GPU (data is stored in texture memory). 3. Creating and sorting a suﬃx table [GP U ]. 4. Finding of matches in the suﬃx table [GP U ]. 5. Transfer of matches to CPU memory . 6. Analysis of matches [CP U ]. 7. Counting repetitions of items for coding purpose [CP U ]. 8. Generation of Huﬀman trees [CP U ]. 9. Writing counters of item repetitions to an output buﬀer [CP U ]. 10. Writing encoded data to the output buﬀer [CP U ]. 11. Writing data from the buﬀer to an output ﬁle [CP U ]. With the work distribution proposed above we achieved good load balance of work on CPU and GPU in general. Figure 1 presents time periods of subsequent data blocks processing in the single GPU approach. GPU is busy in the whole period of computation. In general the work performed by CPU for a single data block takes less time than concurrently performed GPU processing and is completed before the next task (subsequent CPU work) is prepared by GPU. Sometimes (especially in the case of low redundancy data) CPU tasks performed for a single input data block take more time than next block processing by GPU (comp. Fig. 2-3) . This is the reason why in a ﬁnal multi GPU processing strategy we moved more tasks to GPU. In this way a CPU computation bottleneck disappeared for all types of compressed data when the only tasks performed on CPU were generation of Huﬀman trees and writing of coded information to a disk ﬁle. An example of a schedule chart with processing times for a program implemented according to this strategy is presented in Fig. 4 (in the ﬁgure GPUs used are numbered with subsequent integer numbers). CPU processing in the program is based on four types of CPU threads: • a main thread, • a GPU compression manager (one thread for every GPU), • a CPU compression manager, 13

• a thread for the task of output data writing to a disk. The main program thread is responsible for running other threads and starting/stopping the whole compression process. A GPU compression manager thread loads input data, manages GPU-CPU data transfer and controls GPU subprograms execution. In a ”full” GPU program version it also performs a task of Huﬀman tree generation. A CPU compression manager is responsible for CPU tasks performed in a single GPU program version. In a multi GPU program, the compression managing thread is used only to load data from a ﬁle, serve it to GPU thread, retrieve a compressed data block from GPU memory and pass it to the output thread. The proposed sharing of the work between threads allows us to parallelize data processing eﬃciently.

Figure 1: Periods of data block processing in a single GPU program

4. Computational experiment An evaluation of the eﬃciency of the proposed approach was performed in three diﬀerent computer systems, with popular compression programs and different types of input data. Computation quality results (speed and compression ratio) obtained for our program were compared with results of popular compression programs run in diﬀerent conﬁgurations noted as follows: 1. WinRAR v3.90 & v5.01 (x64) shareware (www.winrar.pl): 14

Figure 2: Periods of data block processing for high redundancy data in a multi-GPU program (4 GPUs, with data encoding on CPU)

• WinRAR-Best: run conﬁguration: Best — best compression ratio, • WinRAR-Fast: run conﬁguration: Fast — medium processing time, • WinRAR-Fastest: run conﬁguration: Fastest — shortest processing time.

2. 7-zip for Windows v4.65 & v9.20 (x64, www.7-zip.org) — using LZMA & LZMA2 algorithm respectively: • 7z-Best — run conﬁguration: Ultra — best compression ratio, • 7z-Fast — run conﬁguration: Fastest — shortest processing time. 3. 7-zip for Linux v9.13 (x86, www.7-zip.org) — LZMA2 algorithm in the fastest compression mode: • 7z-Linux-4 — 4 processor cores used — command line arguments: -mx0 -mmt=4 -m0=LZMA2,

• 7z-Linux-8 — 8 processor cores used — command line arguments: -mx0 -mmt=8 -m0=LZMA2.

4. GZIP for Linux v1.5 — runing on Linux: • GZIP9 — run conﬁguration: best (-9) — best compression ratio, • GZIP1 — run conﬁguration: fast (-1) — shortest processing time. 15

Figure 3: Periods of data block processing for low redundancy data in a multi-GPU program (4 GPUs, with data encoding on CPU)

The Linux version of 7-zip was also tested with the LZMA algorithm but the obtained processing speed was very close to the Windows version (LZMA uses only 2 cores at most). In our experiment we used three hardware platforms: System 1 a desktop computer running Windows Vista 64-bit with: Intel Core2 Duo E8400 CPU @ 3 GHz, 6 GB of RAM @ 800 MHz and NVIDIA GeForce GTX 260 GPU with 896 MB of global memory (driver v195.62), System2 a server running Linux 2.6.27 with: 2 Quad-Core Intel Xeon E5405 CPUs @ 2 GHz, 16 GB of RAM @ 667 MHz and 4-GPU NVIDIA Tesla S1070 with 16 GB of global memory (driver v3.20) and System3 a desktop computer running Windows 7 x64 with: Intel Core i7-3820 @ 3.7 GHz with 16 GB of RAM @ 1866 MHz and NVIDIA Geforce GTX 670 with 2 GB of global memory. Under the Windows environment our program was compiled using MS Visual Studio C++ 2008 Express Edition and CUDA in versions: 2.3 for GTX 260 and 4.1 for GTX 670. Under Linux, the compilers were: GCC 4.3.2 and CUDA 3.1. Parallel multithreads processing was implemented by using the Pthread programming model under Linux and WinAPI threading under Windows. To analyze execution times and compare programs results we used four test ﬁles (described in Tab. 2) evaluated in terms of a data redundancy.

16

Figure 4: Periods of data block processing for high redundancy data in a multi-GPU program (4 GPUs, final multi GPU processing strategy)

We used two measures (deﬁned in [17]) characterizing data eligibility for compression: data entropy and data redundancy. Entropy is a measure of uncertainty or ”amount of surprise” in data and is calculated as: H=−

n X

(1)

Pi log2 Pi

i=1

where Pi is a probability of occurrence of an i-th symbol of the alphabet. Redundancy is a diﬀerence between maximum theoretical entropy of the data and its actual entropy. It is calculated as: X X n n n X Pi log2 Pi Pi log2 Pi = log2 n + P log2 P − − R= − i=1

(2)

i=1

i=1

where P represents the highest entropy symbol ratio (P = 1/n). In order to analyze the correlation between the source redundancy and program execution time (global and partial) we used three ﬁles with signiﬁcantly diﬀerent entropy: • a low redundancy ﬁle (test.LR — a ﬁle generated using pseudo-random numbers generator),

• a medium redundancy ﬁle (NFS.tar — TAR archive), 17

Figure 5: Periods of data block processing for low redundancy data in a multi-GPU program (4 GPUs, ”full GPU” approach))

• a high redundancy ﬁle (RFC.tar — TAR archive). For each ﬁle we calculated entropy and redundancy (in the binary system — logarithm of base 2) for 1-, 2- and 3-byte symbols by counting down each symbol and calculating values of H and R according to formulas given above. The calculated values are labeled H1 , H2 , H3 and R1 , R2 , R3 . Maximum possible entropy equals to number of bits needed to represent the symbol (8, 16 and 24bit symbols). The low redundancy ﬁle was created by generating random values (0–255) and writing it to an output stream. The medium redundancy ﬁle was created as a TAR archive made of computer game demo ﬁles. It contains various ﬁle types: text (conﬁguration, messages, etc.), binary (executables, libraries) and game speciﬁc data ﬁles (car models, sounds, etc.). The high redundancy ﬁle was created as TAR archive containing text ﬁles (*.txt) — Request for comments (RFC ). The ﬁles used were downloaded from FTP server ftp://ftp.cyf-kr. edu.pl/pub/mirror/rfc/. In order to test programs from a user point of view we also prepared a big TAR archive (1409 MB) mixed.tar with various content types. The ﬁle contains diﬀerent parts (ﬁles) that were downloaded mainly from the Compression Ratings project website [1]: 18

Figure 6: Compression ratios for different programs, compression of the data file of mixed content; CPU results from Systems 2 & 3

1. Text (crtext1 100m) — based on archival ﬁles from project Gutenberg. 2. Audio (cr audio1.tar, extended version) — audio ﬁles in CD quality (flac format). 3. Knowledge representation (enwik8) — ﬁrst 108 bytes of Wikipedia XML archive ﬁle [10]. 4. Database (freedb100m) — ﬁrst 108 bytes of text image of free CD database. 5. Images (img1.tar) — high quality uncompressed images in the PPM format. 6. Medical ﬁles (lukas.tar) — a set of two-dimensional 8-bit RTG pictures [2]. 7. Source code (src1.tar) — content of subdirectory gcc-4.2.0 of GNU Compiler Collection (GCC) package. 8. Application ﬁles (app3.tar) — PortableApps.com Suite 1.1. 9. Game ﬁles part (NFS.tar). 10. RFC documents (rfc.tar). 11. High entropy data (test.LR). Detailed information about the size, redundancy, the best acquired compression rate and the physical position of ﬁles in the archive are presented in Tab. 2. 19

Figure 7: Compression speed for different programs, compression of the data file of mixed content; CPU results from Systems 2 & 3

Figure 8: Compression speed for data files of different redundancy; CPU results from Systems 2&3

This archive (mixed.tar) was used to tune parameters of MCCC program and to compare our program with WinRAR, 7-Zip and GZIP. We also compressed every ﬁle with all programs in the best compression conﬁguration and measured the best compression ratio labeled as CRbest (in Tab. 2). During computational tests we measured the compression time and the compression ratio for each test data ﬁle and compression program. The compression ratio [15] CR and the compression speed CS were computed according to the

20

Figure 9: Compression ratio for data files of different redundancy; ; CPU results from Systems 2&3 File mixed.tar crtext1 100m cr audio1.tar enwik8 freedb100m img1.tar lukas.tar src1.tar app3.tar NFS.tar rfc.tar test.LR

Size 1409 95 118 95 95 87 50 96 95 138 282 256

MB MB MB MB MB MB MB MB MB MB MB MB

Position in TAR n.d. 0%–7% 7%–15% 15%–22% 22%–29% 29%–35% 35%–38% 38%–45% 45%–52% 52%–62% 62%–82% 82%–100%

H1

R1

H2

R2

H3

R3

CRbest

7.07 4.55 7.99 5.08 5.57 6.80 7.13 5.61 7.04 7.26 4.72 8.00

0.93 3.45 0.01 2.92 2.43 1.20 0.87 2.39 0.96 0.74 3.28 0.00

13.02 8.10 15.96 8.97 9.07 12.65 9.98 9.73 12.83 13.92 8.13 16.00

2.98 7.90 0.04 7.03 6.93 3.35 6.02 6.27 3.17 2.08 7.87 0.00

18.09 11.00 23.59 12.05 11.45 16.45 12.71 12.74 17.25 19.88 10.67 23.86

5.91 13.00 0.41 11.95 12.55 7.55 11.29 11.26 6.75 4.12 13.33 0.14

47.80% 24.55% 99.99% 24.75% 20.70% 47.29% 36.62% 13.27% 32.79% 56.04% 14.30% 100.00%

Table 2: Details of mixed content data test file and data files parameters; NFS.tar, rfc.tar, test.LR and mixed.tar were separately used

following formulas: CR =

Sout ∗ 100% Sin

where Sout is a compressed stream size and Sin is an input stream size (before compression). The lower the CR value, the more eﬃcient the compression. CS =

Sin , T

where Sin is input stream size and T is overall program execution time. The value of CS is expressed in megabytes per second (MB/s). In the case of the MCCC program the computation time of subsequent program tasks was measured within the program. Compression speed for compared programs was calculated based on informations displayed by the program’s in21

Figure 10: Contribution of processing time of separated MCCC program stages in overall file compressing time, normalized; results for different data redundancy: low - LR, medium - MR, high - HR; systems/program version: GTX - System1/single GPU, Tesla - System2/multi GPU

terface. Total execution time was the time displayed right before the end of a compression. In general, the last step of the data compression process is writing output data to a disk ﬁle. Saving of subsequent data blocks is done in parallel with computation but sometimes this process may lengthen the time of a program run due to an observed disk writing bottleneck. The higher the compression speed, the more important for its precise measurement the HDD writing speed becomes. To remove this interference we used a ”null” operating system device as a target for an output. For all decompression speed tests we also used a ”null” destination. 5. Results The results presentation is divided into 5 parts devoted to special issues concerning computing eﬃciency. In the ﬁrst subsection we try to compare the eﬃciency of diﬀerent data compression programs. The following section gives a deeper view on a load balance of parallel processing for diﬀerent scheduling schemes in heterogeneous CPU-GPU systems. The dependence between 22

Figure 11: Processing time of separate MCCC program stages needed for compression of 100MB data; results for different data files and systems/program versions; GTX - System 1

redundancy in processed data and compression speed is discussed in the third subsection. The forth section is devoted to GPU processing optimization issues. At the end the computing speed results for decompression process are discussed. 5.1. Mixed input data processing results In Sec. 3.6 we proposed diﬀerent approaches to balance the load of used computing systems. The programs diﬀer in the amount of work performed by the CPU and GPU. In the multi GPU approach the CPU becomes better utilized when the program uses more GPUs. Because of a better workload balance, the single GPU version of the program is signiﬁcantly faster than the multi-GPU version using one GPU. In a compression ratio comparison (see Fig. 6) we show results for both variants of our program: single-GPU and multi-GPU using 1 GPU (System 1). The compression ratio is independent of the number of GPUs used but may diﬀer for diﬀerent task assignments between the CPU and GPU. Compression ratios of our programs and other programs run in the ,,fast” mode are comparable. Depending on data redundancy the quality of compression of our programs can be slightly better or worse (in the case of very high redundancy of data).

23

Program 7z-Best WinRAR-Best 7z-Best GZIP9 7z-Fast 7z-Linux-4 WinRAR-Best GZIP1 WinRAR-Fast 7z-Linux-8 MCCCn 1 WinRAR-Fast MCCC1 GTX260 7z-Fast MCCCn 2 WinRAR-Fastest MCCCn 3 MCCC1 GTX670 MCCCn 4

System 1 1 3 2 1 2 3 2 1 2 2 3 1 3 2 3 2 3 2

Compression speed [MB/s] 2.0 4.0 5.7 8.5 9.4 15.5 17.6 24.7 26.1 29.4 36.9 37.1 47.9 57.0 72.7 92.7 106.2 111.0 134.8

Decompression speed [MB/s] 26.6 9.2 79.6 92.7 23.5 25.2 117.4 85.9 56.4 25.2 54.2 134.2 57.5 62.6 54.2 120.4 54.2 95.9 54.2

Compression ratio 47.80% 48.16% 47.55% 55.88% 54.52% 54.29% 49.87 % 59.18% 57.20% 54.29% 55.13% 51.62% 54.51% 54.22% 55.13% 57.29% 55.13% 54.51% 55.13%

Table 3: Compression test results for the data file of mixed content

An overall speed compression ranking is presented in Tab. 3 and Fig. 7. The following conﬁgurations of our approach were tested: • the multi GPU program version using 1,2,3 or 4 GPUs run in System2 noted as MCCCn 1 - MCCCn 4

• the single GPU program version run in System1 noted as MCCC1 GTX260 and in System3 noted as MCCC1 GTX670.

Our program using 4 GPUs runs more than 2 times faster than the fastest CPU based tested program 7z using 4 cores in System 3. In this case the CPU system is more eﬃcient in terms of power and cost (0.44 versus 0.18 MB/s per Watt and 19.4 versus 5.8 MB/s per 100mm2 - a silicon die area is considered as a cost measure) since the systems compared are produced in diﬀerent technology process i.e. 65 nm and 32 nm. We considered the programs with comparable compression ratio results. The eﬃciency results are given in Tab. 4. For comparison of parallel systems produced in comparable technology, GTX 670 (28 nm) and Intel i7 (32 nm) were considered. Eﬃciency measures are com24

puted based on the MCCC1 version of our program that uses at the maximum 1 CPU core. In order to obtain an approximation of the maximal compression speed possible for a single core in System 3 we used the best result obtained for the CPU based system - 27.8 MB/s (for WinRAR program run in this system that used 1/3 part of computing power of 4 cores). Also the best result for CPU compression speed in this system (7z ) was used for this comparison. The approximate power eﬃciency of the GPU solution is at least comparable with the result obtained for the CPU. The cost eﬃciency for the GPU solution is nearly 50% higher than the eﬃciency for its CPU counterpart (28.3 versus 19.4 MB/s per 100mm2 ). Program

System

MCCC1 7zip v9.20 MCCCn 7zip v9.13

GTX 670 i7 Tesla S1070 2Quad-Core

Technology 28 32 65 45

nm nm nm nm

Compression speed per cost [M B/s/100mm2 ] 28.3 19.4 5.8 6.8

Compression speed per power [MB/s/Watt] 0.49 0.44 0.18 0.18

Table 4: Processing efficiency results for different systems and programs, power efficiency computed using TPD parameter

As far as the compression ratio is considered (comp. Tab. 3), program 7z winning the ranking (Best - conﬁguration in System 3) runs about 14 times slower then our compressor in the i7-GTX670 processing system (in this case the result is an approximate and fair comparison of GPU only and CPU processing speed). 5.2. Workload balance Gantt charts for diﬀerent load balancing schemes and diﬀerent input data are presented in Fig. 1- 5. Fig. 1 presents the way the subsequent data blocks are processed in a single GPU approach. FIO stands for I/O operation thread which saves the compressed block to disk. In the single GPU model (Fig. 1), the GPU processing time is fully utilized and CPU becomes idle before the next block of data is prepared on the GPU. In the case of lower data redundancy tasks assigned to the CPU take more 25

time than GPU processing. Based on this observation we decided to move more work to the GPU in order to balance the work of many GPUs and a single CPU (in a multi GPU processing model). This modiﬁcation resulted in almost full utilization of GPU and CPU processing time for high and medium redundancy data (comp. Fig. 2), but still generated GPU processing gaps for low redundancy data (comp. Fig. 3). In order to fully utilize computing power of GPUs in the last load balancing scheme all tasks (except Huﬀman tree generation) were scheduled to GPUs. An example of a schedule obtained for this processing scheme is shown in Fig. 4 for high redundancy data and in Fig. 5 for low redundancy data. 5.3. GPU processing optimization According to measurements collected within NVIDIA Visual Proﬁler the main cost of program computation (e.g. 57% of GPU computation time in GTX 670 program run for mixed.tar ﬁle) arises in the procedure for searching of matches. This information focused our optimization eﬀort on this kernel. In the ﬁrst step of the tuning procedure, the size of a search window (within the suﬃx table) equal to the number of the neighboring positions compared (SCAN-POSITIONS parameter) was considered. It is responsible for compression quality which varied between 56,03% and 53,42% (for mixed.tar ﬁle) depending on the value of the parameter changing between 1 and 16. In order to obtain compression quality comparable with other compressor results we ﬁxed the SCAN-POSITIONS parameter to 4. This way an upper constraint on the amount of work to be done within the kernel for every single entry in suﬃx table was established. In the next tuning test we considered the division of the work between threads. According to time results collected for diﬀerent assignments of the work for a single GPU thread the optimal size of a computation grain was found to be equal to 1/4-th of the SCAN-POSITIONS parameter. This work assignment minimized computation time of the match search. The subsequent tuned parameter was the size of a block of GPU threads

26

for the match search kernel. The sizes of a one dimensional block selected for testing varied between 64 and 512. The best performance was obtained for 128 threads in a block. This result optimizing processing speed coincides with the maximum of a multiprocessor occupancy parameter. The requirements for data storage are equal to 21 registers and 8 bytes of shared memory per thread. In the case of Compute Capability 1.3 the theoretical multiprocessor occupancy for the kernel is equal to 63% and allows to assign concurrently up to 20 warps to a single multiprocessor. The number of registers requested and the limit of thread blocks for multiprocessor reduced the number of warps for the other sizes of the block. In the case of GTX 670 the multiprocessor occupancy is equal to 100%. No additional parameter evaluations and tuning were performed for GTX 670 and Tesla S1070. The bigger size of memory in the case of S1070 and GTX670 allows to increase the size of the block of data send to the GPU for compression. This change could theoretically improve the compression ratio but at the expense of the compression speed. In this case our program results would possibly become incomparable with other compressor results due to diﬀerent levels of the compression ratio. Another possible improvement would be to parallelize computation on the GPU and data transfer between the CPU and the GPU. This was impossible in the GTX 260 system due to memory size constrain. For GPU memory of bigger size additional two data buﬀers could be created: one for input data and one for output data. This solution, thanks to asynchronous data transfer, could improve compression speed especially in a single GPU system, where a lot of data is transfered for further processing in the CPU. This strongly GPU-type dependent modiﬁcation has not so far been implemented. CPU computations are performed in parallel with data transfer between the computing nodes owing to implementation of multiple CPU threads and data buﬀers in CPU memory.

27

5.4. Dependency between data redundancy and compression speed The compression ratio depends on data redundancy because the more redundancy in data exist, the more original items can be replaced by shorter symbols or dictionary entries and create a smaller output stream. Also the speed of the compression program may vary with a change of compressed data redundancy. This dependency can be caused by the following issues: • high number of long dictionary matches reduces the quantity of needed codewords (e.g. in Huﬀman code),

• the size of a custom dictionary (i.e. BST, hash tables) is smaller for data

of high redundancy because many repetitions in the input data stream are represented in a dictionary only once; a number of dictionary entries for low redundancy data increases.

• the search in a dictionary becomes faster in the case of highly redun-

dant data thanks to high probability of match ﬁnding without searching through the whole dictionary.

But this dependence seems to be more complex and is not visible in all programs. It is possible due to a variety of methods used in programs and dependencies resulting from method features and peculiarities or constrains of hardware. The compression speed of programs for diﬀerent input data type is presented in Fig. 8. 7-zip (in fast conﬁguration), WinRAR and GZIP1 (best) run signiﬁcantly faster for more redundant data while GZIP9 (fast) performs better for ﬁles with less redundant data. The compression speed of our program (especially its ”full” GPU version) is more independent of input data entropy. Compression ratios of MCCC program and nearly all others compressors run in the ”fast” mode are comparable for all data types (see Fig. 9). Additionally to the above results, we analyzed the dependency between the processing cost (time of separate stages), the task assignment scheme (single GPU or ”full” multi GPU) and the model of the GPU used (GTX-260 or Tesla T10). Fig. 10 shows the time contribution of program processing steps to the 28

overall MCCC processing time. This is an important piece of information for the evaluation and proﬁling of the program. Slightly diﬀerent results are presented in Fig. 11. The ﬁgure contains a comparison of time used by program steps to compress 100 MB of data (diﬀerent program run conﬁgurations are considered). The time used in the following parts of the program was measured: alloc — GPU memory allocation, mem C-G — CPU-GPU data transfer, clm — cleaning of match table - GPU, buf16 — generation of helper array for faster reading from global GPU memory - GPU, suffix — ﬁlling of suﬃx tables - GPU, radix sort — radix sort - GPU, match — search of matches - GPU, analyze —analysis of matches - GPU (multi GPU version), encode — data encoding - GPU (codewords generation for matches and literals, multi GPU only), distance — generation of codewords’ suﬃxes for match distances - CPU (single GPU only), mem G-C — GPU-CPU data transfer, CPU-comp — sum of time periods used for CPU performed compression tasks, CPU+ — time elapsed after the last block processing on GPU (used for the last compression call, coding and saving of data). The contribution of diﬀerent program steps to the whole program processing time depends on the data redundancy and the processing model (single/multi GPU). Since the presented values are sums of the computation time for all GPUs 29

used, the time cost of GPU computation increases for the multi GPU program version. In this version GPUs perform the most of the compression work. 5.5. Decompression speed In order to obtain complete information about the eﬃciency of our approach, we also tested the decompression speed of all programs. Our program run in the same computing system (MCCC uses only CPU for decompression) was better than 7zip and GZIP but worse than WinRAR (see Tab. 3). This good result was possible due to the usage of helper tables described in Sec. 3.5. 6. Summary In this paper we presented a novel approach to a lossless data compression of arbitrary data on the GPU. According to the reported results we successfully parallelized computations designing a method for the problem that is of a sequential nature. We created a parallel compression algorithm and the eﬃcient program MCCC using three levels of parallelism: the CPU threads parallelism, the parallel GPU processing for a single block of compressed data and the parallel multiple data block processing. The proposed method incorporates a novel idea for searching of matches. The main contribution of the proposed method is a parallel (with a use of thousands of GPU threads) search for matches in an input data block whose size (in our case 16 MB) is constrained only by program structures and the size of GPU memory. To make the search of matches quality and time eﬃcient two stages - parallel suﬃx table sorting and then parallel search of matching - are performed in the whole data block . The functional parallel approach for diﬀerent data blocks was proposed and executed by a GPU-CPU pipeline. In this case diﬀerent tasks of compression method performed sequentially for a single block of data were assigned to diﬀerent processing nodes. Additionally in a data parallel approach - concurrent compression of diﬀerent blocks of data is performed in the case of the multi-GPU processing system. The work sharing between the CPU and GPUs was deﬁned according to a load balancing tuning procedure performed oﬀ-line for data ﬁles of diﬀerent 30

redundancy (compare Sec. 3.6). The use of a big dictionary allows to obtain a high compression ratio comparable with other widely used compression programs. In the case of another GPU based general purpose data compression approach proposed in [14] where only small sizes of dictionary (constrained by the size of GPU shared memory) are used the level of compression is much lower. Our MCCC program achieves a good compression ratio comparable to other widely known data compressing programs but works faster. Our program run in a system with 4GPU (Tesla) achieves 134 MB/s (megabytes per second) of compression speed and is, according to our experiments, more than 2 times faster than the best of the competitive programs - 7zip run in a system with 4 processor cores. We also obtained similar compression speed - 111 MB/s for the second version of our program (for a single GPU and CPU) in the system with Intel i7 processor and Nvidia GTX 670. This time our solution was nearly 50% faster than the best CPU result. These results were obtained for program runs in technologically comparable computing nodes. Our program achieved better eﬃciency in terms of system cost and electric power consumption than the competing CPU solution (see Tab. 4). The results were obtained for a compressing data ﬁle of medium redundancy (size 1.4 GB, the entropy H1 = 7.07, the redundancy R1 = 0.93). A wide comparison of the compression ratio obtained for diﬀerent programs was presented in Tab. 3. It is hard to precisely compare the eﬃciency of various programs having no access to common data test ﬁles and using a restricted range of computing systems. In spite of this, we tried to compare our results with recently published results for the general purpose LZSS method implementation on CUDA presented in [14]. LZSS program works fast for all data types but achieves a signiﬁcantly worse compression ratio than other compressors. We performed an additional experiment for our data ﬁles with GZIP program mentioned as a reference in the [14] paper. On contrary to LZSS program (whose compression ratio is between 2 and 4 times worse than the reference) our approach obtained a compression ratio comparable to GZIP for all used data test ﬁles (comp. Fig. 9). The results show that our program outperforms sequential GZIP on 31

the higher compression speed. MCCC program run for the mixed content data ﬁle in System 3 is about 10 times faster. The ﬁnal conclusion from this last comparison is unfortunately imprecise and general: LZSS implementation is fast but its compression ratio is far from the range accessible for wide majority of compressors. Our program’s advantage over other widely used compressors is that for the same level of compression ratio it is faster and, as the experiments showed, is able to deliver eﬃcient solutions. Additionally, it is worth mentioning that our method is ﬂexible and can be modiﬁed to generate archives consistent with the Deﬂate standard. This can be done without a big loss of the compression speed but at the cost of the compression ratio because the Deﬂate uses a dictionary (a sliding window) limited to 32 KB and cannot encode match distances longer than the window size. 7. Acknowledgements The research has been supported by grants No. 2011/01/B/ST6/07021 and N N519 643340 from the National Science Centre, Poland. [1] Compression ratings. [on-line] http://compressionratings.com/, December 2009. [2] J. Abel. www.data-compression.info: The data compression resource on the internet. [on-line] www.data-compression.info, December 2009. [3] A. Aqrawi and A. Elster. Accelerating disk access using compression for large seismicdatasets on modern gpu and cpu. extended abstract no 131, Para 2010 State of the Art in Scientiﬁc and Parallel Computing, Reykjavik, June 2010. [4] A. Balevic. Parallel variable-length encoding on gpgpus. In Proceedings of the 2009 international conference on Parallel processing, Euro-Par’09, pages 26–35, Berlin, Heidelberg, 2010. Springer-Verlag.

32

[5] I. Castano. High quality dxt compression using opencl for cuda. whitepaper.

[on-line] developer.download.nvidia.com/compute/cuda/3_0/

sdk/website/OpenCL/websi%te/OpenCL/src/oclDXTCompression/doc/ opencl_dxtc.pdf, March 2009. [6] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms, Second Edition. McGraw-Hill Science/Engineering/Math, July 2001. [7] P. Deutsch. Rfc 1951: Deﬂate compressed data format speciﬁcation version 1.3. [on-line] http://www.ietf.org/rfc/rfc1951.txt, May 1996. [8] D. Gusﬁeld. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, January 1997. [9] D. A. Huﬀman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098–1101, September 1952. [10] M. Mahoney. Large text compression benchmark. [on-line] http://cs. fit.edu/~mmahoney/compression/text.html, July 2009. [11] D. Merrill and A. Grimshaw. Revisiting sorting for gpgpu stream architectures. Technical Report CS2010-03, University of Virginia, Department of Computer Science, Charlottesville, VA, USA, 2010. [12] M. A. O’Neil and M. Burtscher. Floating-point data compression at 75 gb/s on a gpu. In Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, GPGPU-4, pages 7:1–7:7, New York, NY, USA, 2011. ACM. [13] A. Ozsoy and M. Swany. Culzss: Lzss lossless data compression on cuda. In Proceedings of the 2011 IEEE International Conference on Cluster Computing, CLUSTER ’11, pages 403–411, Washington, DC, USA, 2011. IEEE Computer Society.

33

[14] A. Ozsoy, M. Swany, and A. Chauhan. Pipelined parallel lzss for streaming data compression on gpgpus. In Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, ICPADS ’12, pages 37–44, Washington, DC, USA, 2012. IEEE Computer Society. [15] D. Salomon. Data Compression: The Complete Reference. Springer-Verlag, Berlin, Germany / Heidelberg, Germany /, 2007. With contributions by Giovanni Motta and David Bryant. [16] N. Satish, M. Harris, and M. Garland. Designing eﬃcient sorting algorithms for manycore gpus. In 23rd IEEE International Parallel and Distributed Processing Symposium, pages 1–10, May 2009. [17] C. E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27:379–423, 623–656, July, October 1948. [18] T. A. Welch. A technique for high-performance data compression. Computer, 17(6):8–19, 1984. [19] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337–343, May 1977. [20] J. Ziv and A. Lempel. Compression of individual sequences via variablerate coding. IEEE Transactions on Information Theory, 24(5):530–536, September 1978.

34

Author photograph Click here to download high resolution image

Authors biography

Rafał Walkowiak is a lecturer at Poznan University of Technology with over 20 years of experience. He received M.Sc. (1987) and Ph.D. (1997) in Computer Science, both from Poznan University of Technology. His scientific interest includes cutting and packing problems and application of parallel processing in different areas.

Marek Chłopkowski is a Ph.D. student at Poznan University of Technology. In 2010 he recieved M.Sc. in Computer Science. He is mainly interested in GPGPU applications in fields of data compression and computational biology including redesign and adaptation of serial algorithms for parallel processing.

Author photograph Click here to download high resolution image

- We proposed a novel approach to parallel lossless data compression on GPU. - The efficiency of the approach is based on parallelization of data processing on different hardware and logical levels. - We obtained better compression speed than popular data compression programs used in PC (compression ratio is sustained). - Our approach outperforms other known GPU based methods to lossless data compression.

A general purpose lossless data compression method for GPU

A general purpose lossless data compression method for GPU

Recommend Documents