Data dependency reduction for high-performance FPGA implementation of DEFLATE compression algorithm

Data dependency reduction for high-performance FPGA implementation of DEFLATE compression algorithm

Journal of Systems Architecture 98 (2019) 41–52 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.else...

2MB Sizes 0 Downloads 53 Views

Journal of Systems Architecture 98 (2019) 41–52

Contents lists available at ScienceDirect

Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc

Data dependency reduction for high-performance FPGA implementation of DEFLATE compression algorithm Youngil Kim, Seungdo Choi, Joonyong Jeong, Yong Ho Song∗ Department of Electronics and Computer Engineering Hanyang University, Seoul, Korea

a r t i c l e

i n f o

Keywords: Data compression Huffman coding Accelerator architecture Field programmable gate arrays Pipeline processing

a b s t r a c t The rapid development of modern information technology has resulted in a sharp increase in the rate of data growth. This results in a lack of storage space and network bandwidth. Compression technology is typically implemented to mitigate the increasing demand for storage and the transmission cost of data. However, data compression may impose a significant computational burden on the CPU, which results in a degradation of system performance. To solve this problem, a hardware offloading technique can be used. Hardware offloading not only reduces the computational load imposed on the CPU but also improves the performance of the compression algorithm by exploiting hardware parallelism. However, data-hazards associated with the compression algorithm hinders achieving the achievement of a high degree of parallelism. DEFLATE is a widely used lossless compression scheme. Many studies have attempted to eliminate the data dependencies associated with compression algorithms. Unfortunately, existing studies do not address data dependency elimination in Huffman encoding. Our work aims to parallelize Huffman encoding by solving the data-hazard problem in the algorithm. To address the data dependency that exists in the Huffman encoding algorithm, a new data representation for the intermediate data generated during data compression is proposed. The effectiveness of the proposed scheme was evaluated via the implementation of an architecture which applied the approach in the field-programmable gate array (FPGA) platform. Experimental results show that the proposed scheme can increase the throughput of the compressor by up to 14.4%.

1. Introduction The amount of digital data has increased rapidly because of dramatic advances in IT technology. In particular, new IT technologies such as social network services and the internet of things contribute much to the increase in data production. As a result, a large amount of resource is utilized for data storage and transmission. To reduce the cost of data management, many studies have been conducted with an emphasis on reducing the absolute size of data using data compression [1–4]. Compression of data prior to storage or transmission substantially reduces the required storage space or network bandwidth [5]. Data compression is divided into lossy compression and lossless compression. Lossy compression improves data compression efficiency by removing unnecessary or less important data. This form of compression is mainly applied to multimedia data because the original data might not be completely restored when the data are decompressed. In contrast, lossless compression reduces data size without a loss of information. Lossless compression results in a lower compression ratio than lossy compression. However, it has the advantage of the complete restora-

tion of the original data. Therefore, lossless compression is used when the minimization of information loss is critical, such as in text files. DEFLATE [6] is one of the most widely used lossless compression algorithms. Data compression migrates the I/O load to the computational load of the CPU. Therefore, for data-intensive applications that require a large volume of I/O, data compression can place a heavy computational burden on the CPU [1]. In particular, to support concurrent data access in real-time, multiple compression tasks must be executed. This further increases the computational load imposed on the CPU. Since CPU resources are limited, an increased computation overhead due to compression can hinder the execution of other tasks, which degrades overall system performance. With the development of semiconductor processes and multi-core technology, computational capacity continues to increase. However, since this process is not scaling fast enough to meet increasing computing demands, software-centric solutions have limitations. Many studies on hardware offload engines have been conducted to reduce the computational load of compression algorithms [7–14]. A



Corresponding author. E-mail addresses: [email protected] (Y. Kim), [email protected] (S. Choi), [email protected] (J. Jeong), [email protected] (Y.H. Song). https://doi.org/10.1016/j.sysarc.2019.06.005 Received 12 December 2018; Received in revised form 11 June 2019; Accepted 11 June 2019 Available online 12 June 2019 1383-7621/© 2019 Elsevier B.V. All rights reserved.

Y. Kim, S. Choi and J. Jeong et al.

Journal of Systems Architecture 98 (2019) 41–52

hardware offload engine implements a software compression algorithm in the hardware logic so that user data can be compressed/decompressed independently of the processor. Furthermore, the performance of the algorithm can be improved by applying parallel processing in the hardware offload engine. Commonly used parallel processing techniques include multi-engine and pipelining. The multi-engine technique improves the data processing bandwidth by duplicating the hardware logic. The duplicated hardware logic performs the same operation simultaneously on all data distributed through the switch. Pipelining is a parallelization technique that improves processing by overlapping part of the operation included in the algorithm. These two techniques can be combined for data parallel processing. However, it is difficult to fully exploit parallelism in algorithms when data dependencies exist between subsequent operations. In particular, a loop-carried dependency prevents the start of the next iteration until the data required by the next iteration in the previous iteration is produced [15]. Therefore, multiple iterations containing loop-carried dependencies cannot be started at the same time. When consecutive loop iterations are pipelined, the next iteration may be stalled for a period of time after the current iteration is issued. In order to improve the overall performance of the compressor, the performance of each part must be properly balanced to minimize performance bottlenecks. For this reason, each element must be evenly parallelized. DEFLATE consists of LZ77 encoding, Huffman code generation, and Huffman encoding process. The Huffman encoding, located at the last step of the DEFLATE compression, replaces the LZ77 encoded data segments with Huffman codes to produce a compressed output data streams. The DEFLATE compression algorithm inherently contains many data-hazards that make parallelization difficult [9,10]. Numerous studies have been conducted on parallelizing the DEFLATE algorithm by removing the data dependencies between tasks [9–12]. In particular, LZ77 encoding has been extensively investigated because it occupies a large portion of the total DEFLATE computation [16]. However, parallelization of Huffman encoding through data-hazard elimination has not effectively achieved. To the best of our knowledge, almost all studies on Huffman encoding parallelization have focused only on the parallelization of bit packing which converts variable-length data segments into a fixed-width output stream [10]. The Huffman encoding process includes a loop-carried dependency that limits parallelism as well. In this paper, we propose a parallelization method to remove the loop-carried dependency in Huffman encoding. We apply this technique to the DEFLATE offload engine. For seamless hardware pipeline of the proposed offload engine, we introduce a new data representation for the intermediate data generated during data compression. The proposed method generates intermediate data using the offset information of each symbol group. With the new intermediate data format, we designed a Huffman encoding algorithm which contains no loop-carried dependency. To evaluate the effectiveness of the proposed technique, we implement the technique into a pipelined compression offload engine for the DEFLATE algorithm. The offload engine is synthesized for an FPGA platform. The performance is measured by running benchmark applications. Experimental results show that the elimination of data dependency via the proposed technique can improve the performance of the Huffman encoder, which increases the performance of the DEFLATE compressor by an average of 14.4%. The proposed Huffman encoder allows for a deep pipeline structure, which allows the pipeline to run without stall at higher clock frequencies. This paper is organized as follows. Section 2 briefly describes the DEFLATE compression algorithm and related works. Section 3 provides an overview of the DEFLATE offload engine. The proposed data path optimization technique is detailed in Section 4. Section 5 describes the experimental results, and Section 6 summarizes the main conclusions of this work.

Fig. 1. Example of LZ77 encoding.

2. Background 2.1. DEFLATE compression algorithm DEFLATE [6] is a lossless compression algorithm that has been widely used over a long period of time due to its high speed and good compression efficiency. Many algorithms such as GZIP, ZLIB, ZIP, and PKZIP are based on the DEFLATE compression algorithm. This algorithm consists of the LZ77 algorithm and Huffman coding. The original data are first compressed using the LZ77 algorithm and then further compressed by the Huffman algorithm. 2.1.1. LZ77 encoding The LZ77 algorithm [17] is a dictionary-based compression algorithm. It generates a dictionary with nearly-appeared strings. For an input string, it searches the dictionary. If there is a match found in the dictionary, the current processed string is replaced with the distance and length of the string stored in the dictionary. Fig. 1 describes an example of the LZ77 encoding process. Since a new string has no matches, the first occurrence of easy is left untouched and recorded in the dictionary. The subsequent easy is now found in the dictionary. It is replaced by a set of information that consists of a relative position (matching distance), length of the matched string (matching length), and a flag indicating that a chunk of data is encoded (marker). Let the matching length and distance be denoted as length and distance. In the example in Fig. 1, the second easy is replaced with @(5,11) because the string found in the dictionary matches the current string with 5 characters and is 11 characters away. 2.1.2. Huffman coding Huffman coding [18] is a kind of entropy coding that compresses data by assigning shorter bits to more frequently occurring symbols. It consist of two processes: a Huffman code generation process and a Huffman encoding process. The generation process builds a Huffman tree based on the frequency of each symbol included in the target compression data. Fig. 2 presents an example of a Huffman tree. Firstly, two symbols with the lowest frequency are selected. The two leaf nodes are generated with selected symbols, which are combined to form a new node. The generation process is repeated for all symbols. After the construction of the Huffman tree is completed, the tree assigns shorter codes to the more frequently occurring symbols and longer codes to less frequently occurring symbols. This process creates a Huffman code for all symbols in the LZ77 encoded stream. The encoding process compresses the LZ77 encoded stream using the Huffman code tables built in the generation process. In this process, each symbol in the LZ77 encoded stream is replaced by a matching Huffman code. The LZ77 encoded stream consists of variable length data elements. In addition, Huffman encoding replaces the data elements with 42

Y. Kim, S. Choi and J. Jeong et al.

Journal of Systems Architecture 98 (2019) 41–52

of the pipelines achieved through isolation of recurrences can be offset by the slowest stages. In the pipeline structure, all pipeline stages operate simultaneously in parallel. Each pipeline stage operates on the same clock which cannot be faster than the time required for the slowest pipeline stage. Thus, increasing the execution time of the pipeline stages with critical paths reduces the operable maximum clock frequency. 2.3. Related work Given the high demand for fast data compression, many studies have considered hardware acceleration for the DEFLATE algorithm. Among these studies, we considered those that investigated dependency elimination and the intermediate data representation method. 2.3.1. Dependency elimination There have been several attempts to improve performance by removing dependencies in the DEFLATE algorithm [9,10]. In several studies, operations that interfere with the next iteration through operation reordering are pre-processed. This approach minimizes or eliminates the recurrence included in the parallelized string-matching process, thereby reducing the blocking effect of recurrence. However, it causes degradation of the compression quality. This is because the result generated by the preceding operation can be utilized by the next operation to select the best string match before reordering, but not after the reordering operation. Another study attempted to resolve data hazards problem which occurs during Huffman code generation via data forwarding techniques [11]. The authors use an accumulator to hold data that cause data hazard in the accumulator to use it as an operand when data hazard occurs. This approach reduces the initiation interval by optimizing operations involving looped-carried. These studies have investigated data-hazard elimination, but do not address the data-hazard that occurs during the Huffman encoding process. Our approach differs from these works in that we find a loopcarried dependency of the Huffman encoding process.

Fig. 2. Huffman tree build by Huffman algorithm.

Fig. 3. Dependency graph for the data analysis step using TIMS.

variable-element codes. Therefore, the encoding process includes analyzing the elements of a stream, replacing them with Huffman codes, and then appending them to create an output data stream. 2.2. Loop pipelining

2.3.2. Intermediate data representation method There are not many references to the intermediate data representation method in previous studies. Rigler et al. [13] presented the intermediate data format for intermediate data stream transmitted from the LZ77 encoder to the Huffman encoder. The authors used flag bits to indicate whether the characters generated by LZ77 were encoded or not. Also, their method grouped eight flag bits into one byte and placed them before their corresponding literal and length-distance pair sequence. They constructed an intermediate data stream in this way, but the reasons for this design were not presented. Our proposed method, on the other hand, is aimed at increasing the parallelism of the compressor and its effectiveness is outlined.

Pipelining is a technique that improves system efficiency by processing several different instructions in parallel. Operations that are executed sequentially in a loop are divided into separated pipelines to be processed concurrently. If scheduling is successful, successive iterations can start at every clock cycle. However, they can be delayed by the loop-carried between pipeline stages, which degrades pipeline performance [15]. The fixed interval between successive iterations of the loop is called the initiation interval (II). The recurrence in the loop increases the initiation interval. Recurrence refers to the loop-carried dependencies between operations in the loop. It is represented by a circuit in the dependency graph. Fig. 3 shows a data dependency graph of a loop composed of several operations. Delay and distance indicate the minimum time interval between the start of the operations and the number of iterations separating the two operations. In the preceding example, the current iteration depends on the variable acc generated in the previous iteration. That is, there is a loop-carried dependency in this loop. This creates a recurrence in the loop. As shown in Fig. 3, recurrence prevent consecutive iterations from starting every clock cycle. In this case, the minimum recurrence constrained initiation interval(recMII) for every loop recurrence i can be calculated as shown shortly. The clock cycle latency along the path and the dependency distance of the recurrence are denoted as delayi and distancei , respectively. ⌈ 𝑟𝑒𝑐𝑀 𝐼 𝐼 = 𝑚𝑎𝑥𝑖

𝑑𝑒𝑙𝑎𝑦𝑖 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑖

⌉ =

⌈ ⌉ 2 1

3. Hardware architecture This section describes the data flow of the DEFLATE compression system and explains the intermediate data generated during DEFLATE compression. 3.1. System overview The DEFLATE compression system is composed of a LZ77 encoder, a Huffman code generator, and a Huffman encoder. The LZ77 encoder handles the original data with data units of a certain size. This data unit is denoted as the data block. The LZ77 encoder generates the LZ77 encoded stream and a frequency count of its symbols, which are called the intermediate data and symbol frequency, respectively. The intermediate data and the symbol frequency are accumulated in each data buffer, which is denoted as symbol frequency table and intermediate data buffer.

(1)

The initiation interval can be reduced by isolating each recurrence of the loop into separate pipeline stages. In practice, however, the speedup 43

Y. Kim, S. Choi and J. Jeong et al.

Journal of Systems Architecture 98 (2019) 41–52

symbol for literal or two symbols for a length-distance pair. Symbols denoting literal, length, and distance are subject to Huffman code generation. Fig. 5 describes the composition of the intermediate data stream and terminologies. The DEFLATE algorithm [6] limits the length and distance to 258 and 32506, respectively. In this paper, we use a maximum data block size of 16 KB, which allows a length and distance of 258 and 16383, respectively. This implies that each symbol(length and distance) is represented by 8 bits and 14 bits. Only literal-length Huffman codes are needed for Huffman compression of literal data. On the other hand, compressing length-distance pair data requires literal-length Huffman codes, distance Huffman codes, and extra bits to extrapolate some Huffman codes to their original values. Therefore, tuples are divided into literal and length-distance pairs by the Huffman encoder and handled differently depending on symbol type.

Fig. 4. The architecture of the DEFLATE compression system.

3.3. Huffman encoder The Huffman encoder handles multiple symbols in the intermediate data in parallel. In this case, we denote the unit of data handled at one time as a data packet. The data packet is composed of literal and lengthdistance pairs. The size of the data packet depends on the type and arrangement of the symbols that it contains. Fig. 5 illustrates an example of a Huffman encoder extracting a data packet from an intermediate data stream. Assume that a Huffman encoder can handle intermediate data up to 32 bits at a time in parallel. The size of the three literals is 24 bits and the size of one length-distance pair is 22 bits. In other words, the size of the intermediate data from the beginning to the first lengthdistance pair is 46 bits, which is larger than the maximum data size that the Huffman encoder can handle at one time. Therefore, the Huffman encoder recognizes the first 3 literal data as the first data packet. The operation of the Huffman encoder is divided into 5 steps. Each step is designed using one or more pipeline stages. In this paper, we introduce only two steps of the Huffman encoder, which are affected by the proposed data representation scheme. Fig. 6 illustrates the data flow in the two steps that we introduce.

Fig. 5. Example of selecting the first data packet.

The symbol frequency table accumulates the frequency of each symbol. The intermediate data buffer stores the symbols processed by the LZ77 encoder sequentially in a FIFO manner. After processing for one data block is complete, the intermediate data and symbol frequency are transferred to the Huffman encoder and the Huffman code generator, respectively. The Huffman code generator creates Huffman codes based on the symbol frequency. These codes are classified into three types: literal-length, distance, and compressed-length. Each Huffman code is stored separately for each Huffman code table. When the Huffman code generation for one data block is completed, the Huffman code tables are transmitted to the Huffman encoder. The Huffman encoder compresses the intermediate data using the codes previously generated by the Huffman code generator. The three modules that make up the compressor are pipelined. For seamless pipeline operation, the data transferred between each module are multiple-buffered. Therefore, a bottleneck in one of the three modules may cause the other modules to stall, which degrades overall system performance.

3.3.1. Intermediate data fetch In this step, some of the intermediate data that are read from the intermediate data buffer are stored in a data buffer inside the Huffman encoder. Each operation is called data fetch (DF) and data buffering(DB), respectively.

3.2. Intermediate data Intermediate data are composed of uncompressed original data (denoted as literal) and LZ77 encoded data (denoted as length-distance pair), along with the metadata needed to indicate the type of data. Literals or length-distance pairs are called tuples. A tuple is composed of one

3.3.2. Data analysis To replace symbols with Huffman codes, the current data packet must be extracted from the data stream and separated into symbols. This

Fig. 6. Data flow diagram of intermediate data fetch step. 44

Y. Kim, S. Choi and J. Jeong et al.

Journal of Systems Architecture 98 (2019) 41–52

buffer. If the literal and length-distance are stored in the same buffer, separate metadata are required to indicate whether each tuple is encoded. We define the metadata generated by SIFI as a compression flag. If a tuple is literal, the data buffer stores a compression flag ’0’ after literal. On the other hand, if a tuple is a length-distance pair, the length is first stored in the data buffer followed by a compression flag of ‘1’ and the distance. The Huffman encoder reads the compression flag stored in the data buffer to determine whether the tuple currently being processed is a literal or a length-distance pair. When the compression flag is 0, the Huffman encoder determines that the corresponding tuple is original data and recognizes 8 bits from the current pointer as a literal. When the compression flag is 1, the Huffman encoder handles the corresponding tuple as encoded data. Accordingly, 8 bits from the current pointer are recognized as a length, and 14 bits from the 10th bit are recognized as a distance. As a commonly used intermediate data storage method, zlib [6] allocates a fixed memory size entry for literal or length or distance storage. On the other hand, SIFI requires a small amount of memory space to store intermediate data because it appends all data without extra space in the data buffer. Let the number of literals and length-distance pairs in all the intermediate data be denoted as N and M, respectively. SIFI utilizes memory space of (8 × (𝑁 + 𝑀) + 14 × 𝑀 + (𝑁 + 𝑀)) bits to store the intermediate data. However, the data analysis step contains recurrences, which increase the initiation interval. Fig. 8 shows the architecture of the data analysis step. The MDS operation reads the selected data chunk, fetches the compression flags, and transfers the compression flags to the MDA operation. The MDA operation recognizes the configuration and size of the data packet in the selected data chunk by analyzing the fetched compression flags. The DP operation divides the current data packet into symbols and transferred it to address generation step. The BPC operation adds the size of the data packet calculated by the MDA operation to the current pointer and calculates the next pointer, which is the starting point of the data packet for the next step. Fig. 9 represents the data dependency between each operation belonging to the intermediate data fetch step and the data analysis step. For the (N+1)th DF and (N+1)th MDS, the start pointer of the (N+1)th data packet must be calculated in the intermediate data, which means that the Nth BPC must be performed. Therefore, there are two loopcarried dependencies between the processing for the Nth data packet and the processing for the (N+1)th data packet. These loop-carried dependencies limit loop parallelism. The initiation interval increases due to the loop-carried dependency between the BPC and DF which can be prevented by prefetching the maximum data that can be processed from the current pointer to the next DF. How-

Fig. 7. Example of storing tuples in the intermediate data buffer using symbol integration with the flag insertion scheme.

step analyzes the currently processed data packet and separates it into several symbols. The data fetched from the intermediate data buffer are analyzed to determine the location and size of the data packet currently being processed. Also, the encoder analyzes the type of tuples for which the arrangement constitutes a data packet. The current data packet is divided into several symbols and sent to the next step. The data analysis process is divided into the four operations. The first operation is metadata selection (MDS). It reads the maximum amount of data that the Huffman encoder can handle at one time from the current pointer of the prefetched intermediate data buffer. It also fetches metadata from the selected intermediate data to obtain the configuration information about the data packet to be processed. The second operation is metadata analysis (MDA). This operation analyzes the metadata for the selected intermediate data to determine the type of data packets contained in the selected intermediate data. This determines the location of the next data to be selected. The third operation is data parsing (DP). This refers to the configuration information of the currently processed data packet, and this operation divides the data packet into symbol units. The extracted symbols are passed to the next step, address generation. The last operation is buffer pointer calculation (BPC). In this process, the next pointer is calculated. This pointer can be expressed by adding the size of the currently processed data packet to the current pointer. 4. Intermediate data representation scheme As described in the preceding section, the intermediate data are stored in an intermediate data buffer and processed by the Huffman encoder. This section describes an intermediate data representation scheme for efficient parallelization of the compressors. We present a slightly modified version of the data representation format presented by S. Rigler [13] as a baseline. 4.1. Symbol integration with flag insertion The symbol Integration with flag insertion scheme (denoted as SIFI) stores literals, length-distance pairs, and metadata in a single data

Fig. 8. Architectural details of data analysis step when using SIFI. 45

Y. Kim, S. Choi and J. Jeong et al.

Journal of Systems Architecture 98 (2019) 41–52

Fig. 11. Example of storing tuples in the intermediate data buffer using tuple integration with a metadata separation scheme. Fig. 9. Dependency graph for data analysis step using SIFI.

data packet has a length of P bits regardless of the combination, the current data packet is located P bits away from the start of the previous data packet. This means that the starting point of the next data packet can be known in advance without having to perform the MDA of the preceding data packet. Therefore, the zero-padding scheme removes the data dependency between MDA and BPC. However, fixing the data packet length to eliminate the data dependency between MDA and BPC results in a large memory overhead. Regardless of the combination of literals and length-distance pairs, all data packets have the largest possible data packet size. For example, suppose that the Huffman encoder can handle up to 46 bits. Even if the data packet contains only one literal, the size of the data packet is 46 bits. To solve that problem, we propose a tuple integration with a metadata separation scheme (denoted as TIMS).

ever, the loop-carried dependency between BPC and MDS increases the initiation interval. In this case, the minimum recurrence constrained initiation interval for every loop recurrence i can be calculated as follows: ⌈ 𝑟𝑒𝑐𝑀 𝐼 𝐼 = 𝑚𝑎𝑥𝑖

𝑑𝑒𝑙𝑎𝑦𝑖 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑖

⌉ =

⌈ ⌉ 3 1

(2)

Fig. 10 shows the operation of pipelined SIFI. As shown in Fig. 10, new iteration can start every three cycles because of the recurrences. In the pipeline structure, all pipeline stages operate simultaneously in parallel. Each pipeline stage operates on the same clock which cannot be faster than the time required for the slowest pipeline stage. To start a new iteration without pipeline stall, all operations from MDS to BPC should be performed in one clock cycle, as presented in Fig. 8. If this part is a critical path, the clock cycle time is the sum of the execution times of MDS, MDA, and BPC. We confirmed that this part is a critical path by synthesizing a Huffman encoder on an FPGA board. To solve this problem, we introduce a tuple integration with a metadata separation scheme that eliminates the recurrence in the data analysis step. Before introducing the tuple integration with metadata separation scheme, we introduce a zero-padding scheme, which is another way to remove the recurrence of the data analysis step.

4.3. Tuple integration with metadata separation TIMS generates fixed length metadata groups in units of data packets rather than tuples. This allows TIMS to be performed without loopcarried dependency. Fig. 11 illustrates an operation example of this process. Literals are denoted by L, and length-distance pairs are denoted by D. This scheme stores tuple and metadata in separate data buffers. The metadata group corresponding to a data packet generated by TIMS is defined as the data packet status tag (DPST). Also, the data buffers that only store tuples and the data buffer that stores DPST are defined as the tuple buffer and data packet status tag buffer, respectively. Fig. 11 represents an example of TIMS operation. The LZ77 encoder divides tuples and DPSTs, which are the result of processing one data block, into separate buffers. The Huffman encoder refers to these buffers when encoding. These two types of intermediate data are each processed through a separate data path. Let the operations that target the tuple and the data packet status tag be prefixed with ‘T’ and ‘DPST’, respectively. Fig. 12(a) shows the data dependency between the operations in the data analysis step when using TIMS with unfixed length DPST. A certain amount of DPST is fetched and analyzed in the DPST data buffer. Based

4.2. Zero-padding To increase the clock frequency of the Huffman encoder, the recurrence in the data analysis stage should be removed. The data dependency between MDA(N) and BPC(N) occurs because BPC(N) requires a data packet size. Zero-padding eliminates the data dependency between MDA(N) and BPC(N) by fixing the length of the data packets. The length of the data packet depends on the combination of literals and the length-distance pairs. The zero-padding scheme unifies the length of all types of data packets to the maximum length. If the total length of the tuples belonging to a certain data packet is less than the maximum length, the remainder is filled with consecutive ‘0’s. If each

Fig. 10. The space-time diagram of data analysis step using SIFI.

46

Y. Kim, S. Choi and J. Jeong et al.

Journal of Systems Architecture 98 (2019) 41–52

on the analysis result, the currently processed tuple is fetched from the tuple buffer, parsed and transferred to the next step. In the case, the pipeline performance of TIMS is limited by loopcarried dependency, as in the case of SIFI. As previously indicated, the loop-carried dependency between DPSTBPC and DPSTF can be solved with prefetch. However, due to the dependency between DPSTBPC and DPSTS, the initiation interval increases to 3. Therefore, to reduce this interval to 1, the data dependency between DPSTA and DPSTBPC should be eliminated. For this reason, the length of DPST was fixed to eliminate the data dependency between DPSTA and DPSTBPC. The DPST contains the configuration of the data packets, which can include various combinations of literal and length-distance pairs. If the DPST is the group of metadata which is generated the same way as that generated by the SIFI metadata, then the length of the DPST is variable. We reconstruct DPST as information that indicates the type of data packet to fix its length. Fig. 12(b) shows the dependency graph after removing the loop-carried dependency. DPSTBPC generates the next pointer by adding the DPST size to the current pointer. It can operate regardless of other operations and there is no recurrence. As a result, the data analysis operations can be performed in different pipeline stages. TIMS uses the idea of zero-padding to remove data dependency by removing the variability of the referenced data chunk size. To implement TIMS, modifying the hardware logic of the SIFI structure is necessary. First, each LZ77 encoder requires preprocessing hardware logic to generate DPSTs. In addition, each LZ77 encoder requires two data buffers and a buffer controller to store the tuples and DPSTs separately. Finally, because intermediate data are stored in two buffers, separate data paths are needed for each of the intermediate data sets. Fig. 13 describes the hardware architecture of the data analysis step when using TIMS. This architecture is designed on the basis of the dependency graph shown in Fig. 12(b). In the first stage, DPST is fetched from the DPST data buffer and analyzed. The DPSTS operation first reads the selected metadata chunk, fetches the current DPST, and transfers the current DPST to DPSTA operation. The DPSTA operation analyzes the fetched DPST to recognize the configuration and size of the currently processed data packet in the selected data chunk. Because the length of the DPST is fixed, the DPSTBPC operation simply adds a constant value, which is the size of the DPST, to the current buffer pointer. Therefore, the DPSTBPC operation can be performed in parallel with the DPSTS and DPSTA operations. Fig. 12. Dependency graph for the data analysis step when using TIMS.

Fig. 13. Architectural details of data analysis step when using TIMS. 47

Y. Kim, S. Choi and J. Jeong et al.

Journal of Systems Architecture 98 (2019) 41–52

Table 1 Types of data packets that can be generated.

Based on the DPST analysis, the second stage processes the current data packet. The TDP operation receives the data packet configuration information from the first stage and selects the current data packet. The selected data packet is divided into symbols that are transferred to the address generation step. The TBPC operation adds size of the data packet transferred from the first stage to the current tuple buffer pointer. Unlike SIFI, because no loop-carried dependency exists among operations in the data analysis step of TIMS, each operation can be placed in a different stage when the initiation interval is set to 1. However, as the number of pipeline stages increases, the number of required pipeline registers and hardware resource consumption both increases. DPSTS and DPSTA operations are placed in the same pipeline because their execution times are short. The zero-padding scheme fixes the data packet length, and TIMS fixes the DPST length. In both cases, the objective is to eliminate a recurrence in the data analysis step. However, the maximum length of the DPST is shorter than that of the data packet. Therefore, unlike the zero-padding scheme, the memory overhead caused by TIMS is relatively small. Suppose that the intermediate data consist of N literals, M length-distance pairs, and P data packets. Also, suppose that the length of the DPST per data packet is O bits. TIMS requires memory space of (8 × (𝑁 + 𝑀) + 14 × 𝑀 + 𝑂 × 𝑃 ) bits to store the intermediate data. The memory requirement of TIMS is represented by the number of literals, length-distance pairs, and data packets. This means that when the TIMS method is used, the size of the generated DPST information varies depending on the types of data packets contained in the intermediate data.

Data packet type

# Tuple

Data packet size (bits)

L LL LLL LLLL D DL DLL DLLL LD LDL LDLL LLD LLDL LLLD DD

1 2 3 4 1 2 3 4 2 3 4 3 4 4 2

8 16 24 32 22 30 38 46 30 38 46 38 46 46 44

Table 2 Maximum clock frequency. SIFI_0

SIFI_1

TIMS

243MHz

208 MHz

238 MHz

Table 3 Logic delay time in data analysis step.

5. Experimental results SIFI TIMS

5.1. Experimental environment In the experiment, the performance effect and hardware resource usage of the proposed schemes were evaluated. Three versions of the DEFLATE compressor in Verilog RTL were implemented. Xilinx Vivado v.2015.4 (lin64) was used for hardware synthesis. The target FPGA was xcku025-ffva1156-1-I (active). There are two versions of Huffman encoders that use SIFI. The first one (denoted as SIFI_0) is pipelined regardless of the recurrence. In this architecture, the operations that cause recurrence are divided into two pipeline stages. MDS and MDA are placed in the same pipeline stage and BPC is included in the other pipeline stage. The reason for not dividing all three operations into different pipeline stages is that it does not result in a further increase of the operable maximum frequency. Experimental results show that the execution time of other operations such as DP is greater than the sum of the MDS and MDA execution times. Therefore, dividing the operations that cause recurrence into more pipeline stages only increases the initiation interval. The second version of the Huffman encoder (denoted as SIFI_1) handles all the operations that make up the recurrence in one clock cycle as shown in Fig. 10. That is, MDS, MDA, and BPC operations are sequentially performed in one pipeline stage, which sets the initiation interval as 1. The Huffman encoder using the TIMS performs the DPSTS and DPSTA sequentially in the same pipeline stage. DPSTBPC, TBPC, and TDP are performed in different pipeline stages. The other parts of the Huffman encoders are the same except for the data analysis. Each Huffman encoder can handle up to four literals or up to two length-distance pairs at the same time up to 46 bits in total. This means that 15 types of data packets can be generated. Table 1 illustrates the types of data packets that can be generated.

MDS, DPSTS

MDA, DPSTA

BPC, TBPC

DPST BPC

DP, TDP

1.68 ns 0.87 ns

1.37 ns 1.55 ns

1.64 ns 1.81 ns

0.61 ns

3.97 ns 3.47 ns

The maximum clock frequency was determined by changing the clock frequency and synthesizing the Huffman encoder on the FPGA board. We also measured the data path delay of the logic that performs the operations associated with the data analysis step. The synthesis results are described in Tables 2 and 3. Table 2 illustrates the maximum operable clock frequency of the compressors with two version of SIFI and TIMS. Table 3 represents the time required for operations during the data analysis step of the Huffman encoder. The synthesis results show that the maximum clock frequency of the compressor using SIFI_0 and TIMS are higher than that of the compressor using SIFI_1. The compressor using SIFI_1 can operate up to 208 MHz without a timing violation. On the other hand, the compressor using SIFI_0 and TIMS can operate up to 243 MHz and 234 MHz, respectively, 16.8% and 12.5% higher than the compressor using SIFI_1. The maximum operating frequencies of the Huffman encoder using SIFI_0 and TIMS are higher than that of the Huffman encoder using SIFI_1 because of the parallelization of the data analysis step. The maximum clock frequency of the Huffman encoder using SIFI_1 is determined by MDS, MDA, and BPC. However, based on the synthesis results for the DEFLATE compressor, for SIFI_0 and TIMS, a critical path occurs in the steps within the Huffman encoder rather than during the data analysis stage. As explained earlier, when SIFI_1 is used, MDS, MDA, and BPC must be sequentially executed within one clock cycle in a single pipeline stage because of data dependencies. Therefore, one clock period should be at least 4.69 ns (1.68 ns + 1.37 ns + 1.64 ns). Furthermore, according to the synthesis result, the data path that sequentially performs MDS, MDA, and BPC is a critical path. The minimum clock period is 4.8 ns, and not 4.69 ns, because of the clock path skew. When SIFI_0 and TIMS are used, all operations associated with the data analysis step can be performed in parallel in different pipeline stages. Therefore, the minimum clock period required to guarantee the proper operation of the data analysis step within one clock period is lower than that of SIFI_1. The maximum operating frequency difference between SIFI_0 and TIMS

5.2. Performance 5.2.1. Clock frequency We compared the synthesis results for the DEFLATE compressor by applying SIFI_0, SIFI_1, and TIMS to observe the changes in the maximum clock frequency according to the architecture of Huffman encoder. 48

Y. Kim, S. Choi and J. Jeong et al.

Journal of Systems Architecture 98 (2019) 41–52

Fig. 14. The space-time diagram of data analysis step when using TIMS.

is due to differences in the synthesis processes, such as placement and routing, rather than structural differences. 5.2.2. Performance Fig. 15 shows the performance of the DEFLATE compressor when each scheme is applied. The performance of each compressor was measured via simulation using the DEFLATE compressor behavior model. It is assumed that the operating frequency of each compressor is equal to Table 2. Each module that makes up the compressor is pipelined. To appropriately measure the performance of the pipeline processing elements, the overall processing time should be long enough to reduce non-overlapped execution. For this reason, only the benchmarks that can generate more than 10 data blocks from Canterbury Corpus and Calgary Corpus were used [19]. The experimental results show that the performance of the compressor increased by 14.2% and 14.4%, respectively, compared to SIFI_0 and SIFI_1, when TIMS was used. This is due to an increase in the performance of the Huffman encoder or an increase in the maximum operable frequency. As shown in Fig. 15(a), although the operable maximum frequency is 2.1% lower than that of SIFI_0, the average processed byte per clock cycle of TIMS is 16.6% higher than that of SIFI_0. Also, when using TIMS and SIFI_1, the number of bytes processed by the compressor per clock cycle is the same. However, due to the difference in the operating frequency, the performance of the compressor when TIMS was used is 14.4% higher than when using SIFI_1. There is no delay in the start of the new pipeline due to recurrence when using TIMS. Moreover, the additional preprocessing to reconstruct the DPST involves simply converting the stored metadata to a regular format, which is almost hidden by the internal pipeline behavior of the LZ77 encoder. As a result, the performance of the compressor is improved by reducing the delay due to the bottleneck caused by the Huffman encoder in the compressor. In most experiments, except for kennedy, pic, and ptt5, the application of TIMS eliminates the bottleneck of the Huffman encoder, which increased the overall compressor performance. In general, the higher the compression ratio of the data, the smaller the performance improvement of the compressor due to the decrease of the initiation interval. For highly-compressible benchmarks such as kennedy, pic and, ptt5, the performance improvement of the compressor is on the average approximately 0.2%. On the other hand, in the case of the relatively incompressible benchmarks such as alice29, book1, news, plrabn12, the average bandwidth of the compressor increased by 33.1% on the average. This is because of the higher the compressibility of the

Fig. 15. Performance results.

data, the smaller the size of the intermediate data generated by LZ77, which decreases the size of the data that the Huffman encoder needs to process. Therefore, the performance improvement of the Huffman encoder has less impact on the overall compressor performance 49

Y. Kim, S. Choi and J. Jeong et al.

Journal of Systems Architecture 98 (2019) 41–52

Table 4 Total LUTs utilization.

Total DPST packing (per LZ77 encoder) Huffman encoder

SIFI

TIMS

9925

9419

– 9925

106 9313

Table 5 Memory utilization of each scheme. Memory utilization (bits) SIFI TIMS

(8 × (𝑁 + 𝑀) + 14 × 𝑀 + (𝑁 + 𝑀)) (8 × (𝑁 + 𝑀) + 14 × 𝑀 + 4 × 𝑃 )

(#Literal=N, #Length-Distance Pair=M, #Data Packet=P) Fig. 16. Normalized memory utilization .

5.3. Resource utilization and SIFI is the same, but the amount of metadata differs. SIFI generates one bit of metadata per tuple, whereas TIMS generates 4 bits of metadata per data packet. Therefore, if the 15 types of data packets appear equally, 1.43 bits of metadata per tuple is required. When TIMS is used, the amount of metadata required per tuple depends on the type of data packet.

We analyzed the LUT utilization by synthesizing the compressor that applied SIFI and TIMS. Table 4 summarizes the hardware resource of FPGA required to implement each DEFLATE compressor. Since there is almost no hardware overhead due to the difference in the structure of SIFI_0 and SIFI_1, we assume that hardware usage is the same in both cases. As shown in Table 4, the compressor that applied TIMS utilized approximately 5.09% less hardware resources compared to the compressor that applied SIFI. To reconstruct the DPST as information that represents the type of data packet, TIMS requires hardware logic called DPST packing. Therefore, the LZ77 encoder of TIMS utilizes more hardware resources than that of SIFI. The hardware requirements of the Huffman encoder is slightly less than that of TIMS for SIFI. This is because the data multiplexing range of SIFI is larger than that of TIMS. We can therefore conclude that TIMS has little hardware overhead because the two changes in the hardware resource requirements are negligible.

5.4.2. Comparison of memory utilization based on pair ratio The pair ratio is the ratio of the length-distance pairs between the tuples included in the intermediate data. Canterbury Corpus and Calgary Corpus [19] were used for the experiment. The amount of intermediate data generated by LZ77 depends on the amount of the original data and the LZ77 compression ratio. In this experiment, we attempted to examine the relative size difference of the intermediate data according to the pair ratio. For the convenience of comparison, we normalized the amount of intermediate data generated to two values, the LZ77 compression ratio, and the original data size before LZ77 compression. The normalized memory utilization is therefore expressed as follows:

5.4. Memory utilization 5.4.1. Comparison of total memory utilization of each scheme TIMS allows stall-free pipelining but utilizes more memory than SIFI. However, memory overhead is negligible and we compare memory utilization based on the intermediate data to verify this assertion. Table 5 describes the amount of intermediate data generated by each representation method, with the number of literals, length-distance pairs, and data packets generated by LZ77, as shown in Table 5. The intermediate data buffer sizes needed to implement each compressor were compared. The size of the intermediate data buffer is proportional to the amount of intermediate data generated during data compression. Therefore, we estimated the size of the intermediate data buffer based on the amount of intermediate data per one data block that each compressor generates at each benchmark. Fig. 16 represents the average size of the intermediate data generated per one data block when compressing each benchmark. The Canterbury Corpus [19] was used as the test suite. As shown in Fig. 16, TIMS utilizes an average of 3.6% more memory space than SIFI. As indicated in the preceding section, unlike SIFI which allocates one metadata per tuple, TIMS allocates a fixed metadata group per data packet. Therefore, the size of the intermediate data generated by TIMS is larger than that of SIFI, except for the data packet consisting of the maximum number of tuples that can be included. The memory utilization of the different schemes varies according to the benchmark because of the proportionality of the tuple types in the intermediate data, which will be discussed in detail later. TIMS requires more memory than SIFI because of the difference in the metadata generation approaches. The number of tuples produced by TIMS

𝑁𝑀𝐶 =

𝑀 𝑒𝑚𝑜𝑟𝑦 𝐶𝑜𝑛𝑠𝑢𝑚𝑝𝑡𝑖𝑜𝑛 × 𝐿𝑍77 𝐶𝑜𝑚𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑅𝑎𝑡𝑖𝑜 𝑂𝑟𝑖𝑔𝑖𝑛𝑎𝑙 𝐷𝑎𝑡𝑎 𝑆𝑖𝑧𝑒

Fig. 17 presents the difference in normalized memory utilization between TIMS and SIFI according to the pair ratio, which is denoted as the normalized difference. The higher the pair ratio, the more memory

Fig. 17. Normalized memory utilization difference between schemes based on pair ratio. 50

Y. Kim, S. Choi and J. Jeong et al.

Journal of Systems Architecture 98 (2019) 41–52

Declaration of competing interest

the TIMS tends to utilize compared with SIFI. This is because if it is assumed that the 15 types of data packets appear with the same probability, the size of the metadata generated by TIMS is about 1.37 bits per length-distance pair and 1.66 bits per literal. However, the difference in memory utilization between SIFI and TIMS is not necessarily proportional to the pair ratio. The intermediate data contains several types of data packets, and the proportion of each type in the intermediate data can vary depending on the data being compressed. Using TIMS, the amount of metadata generated per tuple differs according to the type of data packet. Therefore, the difference in memory utilization between SIFI and TIMS depends on the ratio of each type of data packet to the data packet in the intermediate data. For example, when an LLDL packet is encoded by TIMS and SIFI, both schemes generate about 0.087 bits of metadata per tuple. On the other hand, when an LD packet is encoded, TIMS generates 0.133 bits of metadata per tuple, and SIFI generates about 0.067 bits of metadata per tuple. In this case, the difference in the amount of metadata generated per tuple is about 0.066 bits. In other words, even though the pair ratio in the LD packet is lower than in the LLDL packet, the difference between the two schemes in terms of the metadata generated per tuple when the LD packet is encoded is larger compared to the case when the LLDL packet is encoded. Therefore, in the presence of many data packets such as LD packets, the difference in memory utilization between the two schemes increases. The red sample in Fig. 17 is one such example. According to the experimental results, in a general case when the pair ratio is high, the proportion of DD packets is also high. This means that string matching often occurs consecutively. However, in the case of the red sample, DL packets represent 84% of the total data packet frequency. This implies that the string matching during LZ77 compression was aborted by one character that did not match, and then string matching occurred from the next character onwards. In the case of the DD packet, the difference between TIMS and SIFI in the metadata size per tuple is approximately 0.09 bits, whereas the difference for the DL packet is approximately 0.133 bits. Therefore, even if the pair ratio is relatively low, the normalized difference can be high. These experimental results suggest that designers should choose the intermediate data representation method that best handles the characteristics of their specific workload when designing a compression system. When implementing a compression system primarily for incompressible data with a very low pair ratio (e.g., multimedia files, images, video, sound) [2], system designers should choose TIMS over SIFI, which can operate at a high clock frequency. In contrast, for a system that processes mainly compressible data (e.g., documents, web pages, and logs) [2], system designers can choose between SIFI and TIMS, considering an ideal tradeoff between memory consumption and clock frequency. (Figs. 4, 7 and 14)

The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity any financial interest (such as honoraria;educational grants;participation in speakers’ bureaus; membership, employment,consultancies,stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in the subject matter or materials discussed in this manuscript.

Acknowledgment This work was supported by the R&D program of MOTIE/KEIT. [10077609, Developing Processor-Memory-Storage Integrated Architecture for Low Power, High Performance Big Data Servers]

References [1] S. Kanev, J.P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G.-Y. Wei, D. Brooks, Profiling a warehouse-scale computer, in: Proceedings of the 42Nd Annual International Symposium on Computer Architecture, in: ISCA ’15, ACM, New York, NY, USA, 2015, pp. 158–169, doi:10.1145/2749469.2750392. [2] B. Nicolae, High throughput data-compression for cloud storage, in: Proceedings of the Third International Conference on Data Management in Grid and Peer-to-peer Systems, in: Globe’10, Springer-Verlag, Berlin, Heidelberg, 2010, pp. 1–12. [3] R. Kothiyal, V. Tarasov, P. Sehgal, E. Zadok, Energy and performance evaluation of lossless file data compression on server systems, in: Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, in: SYSTOR ’09, ACM, New York, NY, USA, 2009, pp. 4:1–4:12, doi:10.1145/1534530.1534536. [4] Y. Chen, A. Ganapathi, R.H. Katz, To compress or not to compress - compute vs. io tradeoffs for mapreduce energy efficiency, in: Proceedings of the First ACM SIGCOMM Workshop on Green Networking, in: Green Networking ’10, ACM, New York, NY, USA, 2010, pp. 23–28, doi:10.1145/1851290.1851296. [5] T. Summers, S.A. Engineer, Hardware based gzip compression, benefits and applications, CORPUS 3(2.75) 2–68. [6] P. Deutsch, J.-L. Gailly, Zlib compressed data format specification version 3.3, 1996. [7] A. Martin, D. Jamsek, K. Agarawal, Fpga-based application acceleration: case study with gzip compression/decompression streaming engine, ICCAD Special Session 7C (2013). [8] J. Ouyang, H. Luo, Z. Wang, J. Tian, C. Liu, K. Sheng, Fpga implementation of gzip compression and decompression for idc services, in: 2010 International Conference on Field-Programmable Technology, 2010, pp. 265–268, doi:10.1109/FPT.2010.5681489. [9] M.S. Abdelfattah, A. Hagiescu, D. Singh, Gzip on a chip: high performance lossless data compression on fpgas using opencl, in: Proceedings of the International Workshop on OpenCL 2013 & 2014, ACM, 2014, p. 4. [10] J. Fowers, J.-Y. Kim, D. Burger, S. Hauck, A scalable high-bandwidth architecture for lossless compression on fpgas, in: Field-Programmable Custom Computing Machines (FCCM), 2015 IEEE 23rd Annual International Symposium on, IEEE, 2015, pp. 52–59. [11] J. Matai, J.-Y. Kim, R. Kastner, Energy efficient canonical Huffman encoding, in: Application-specific Systems, Architectures and Processors (ASAP), 2014 IEEE 25th International Conference on, IEEE, 2014, pp. 202–209. [12] W. Qiao, J. Du, Z. Fang, L. Wang, M. Lo, M.-C.F. Chang, J. Cong, Highthroughput lossless compression on tightly coupled cpu-fpga platforms: (abstract only), in: Proceedings of the 2018 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, in: FPGA ’18, ACM, New York, NY, USA, 2018, doi:10.1145/3174243.3174987. 291–291 [13] S. Rigler, W. Bishop, A. Kennings, Fpga-based lossless data compression using Huffman and lz77 algorithms, in: Electrical and Computer Engineering, 2007. CCECE 2007. Canadian Conference on, IEEE, 2007, pp. 1235–1238. [14] S. Choi, Y. Kim, Y.H. Song, False history filtering for reducing hardware overhead of fpga-based lz77 compressor, J. Syst. Archit. 88 (2018) 110–119. [15] A. Canis, S.D. Brown, J.H. Anderson, Modulo sdc scheduling with recurrence minimization in high-level synthesis, in: Field Programmable Logic and Applications (FPL), 2014 24th International Conference on, IEEE, 2014, pp. 1–8. [16] E.A. Sitaridi, R. Mueller, T. Kaldewey, Parallel lossless compression using gpus, in: GPU Technology Conference GTC, 13, 2013. [17] J. Ziv, A. Lempel, A universal algorithm for sequential data compression, IEEE Trans. Inf. Theory 23 (3) (1977) 337–343, doi:10.1109/TIT.1977.1055714. [18] D.A. Huffman, A method for the construction of minimum-redundancy codes, Proc. IRE 40 (9) (1952) 1098–1101, doi:10.1109/JRPROC.1952.273898. [19] M. Powell, The canterbury corpus, 2001.

6. Conclusion In this paper, a cost-effective pipelined hardware architecture for high-bandwidth lossless compression is proposed. We describe a new data representation method to eliminate recurrence under algorithmic operation. Removing the recurrence reduces the initiation interval between successive iterations of loops, which allows stall-free fine-grained pipelining. The presented experimental results indicate that the proposed method increases the performance of a compression offload engine by approximately 14.4%. Moreover, the application of the new data representation method increases the amount of intermediate data. According to our analyses, the increase in the size of the intermediate data varies depending on the characteristics of the target data. However, the increment of the intermediate data is negligible and has no significant effect on the design.

51

Y. Kim, S. Choi and J. Jeong et al.

Journal of Systems Architecture 98 (2019) 41–52

Youngil Kim received a bachelor’s degree in media communication engineering from Hanyang University, Seoul, Korea, in 2012. He is currently pursuing his Ph.D. in electronics and computer engineering at Hanyang University. His research interests include high-performance computing, lossless data compression, and 3D integrated circuit.

Joonyong Jeong received his bachelor’s degree in information systems from Hanyang University. He is currently an integrated MS/Ph.D student in electronics and computer engineering at Hanyang University, the aforementioned institution. His research interests include computer architecture, big data, key value store systems.

Seungdo Choi received his bachelor’s and master’s degrees in electronics and computer engineering from Hanyang University, Seoul, Korea, in 2012 and 2014, respectively. He is currently pursuing a Ph.D. in electronics and computer engineering at Hanyang University the aforementioned institution. His research interests include high-performance computing, computer architecture, and low-power systems.

Yong Ho Song received bachelor’s and master’s degrees from Seoul National University, Seoul, Korea, and a Ph.D. from the University of Southern California, Los Angeles, CA, USA, all in computer engineering, in 1989, 1991, and 2002, respectively. He is currently a professor in the Department of Electronic Engineering, Hanyang University, Seoul. His current research interests include system architecture and software systems of mobile embedded systems, including SoC, NoC, multimedia on multicore parallel architecture, and NAND flash-based storage systems. Prof. Song has served as a Program Committee Member of many prestigious conferences, including the IEEE International Parallel and Distributed Processing Symposium, the IEEE International Conference on Parallel and Distributed Systems, and the IEEE International Conference on Computing, Communication, and Networks.

52