Integrated write buffer management for solid state drives

Integrated write buffer management for solid state drives

Journal of Systems Architecture 60 (2014) 329–344 Contents lists available at ScienceDirect Journal of Systems Architecture journal homepage: www.el...

4MB Sizes 0 Downloads 23 Views

Journal of Systems Architecture 60 (2014) 329–344

Contents lists available at ScienceDirect

Journal of Systems Architecture journal homepage: www.elsevier.com/locate/sysarc

Integrated write buffer management for solid state drives Sungmin Park, Jaehyuk Cha, Sooyong Kang ⇑ Division of Computer Science and Engineering, Hanyang University, Seoul 133-791, Republic of Korea

a r t i c l e

i n f o

Article history: Received 20 November 2012 Received in revised form 14 January 2014 Accepted 20 January 2014 Available online 29 January 2014 Keywords: Flash memory Write buffer Log block Flash translation layer Solid-state disk Storage device

a b s t r a c t NAND flash memory-based Solid State Drives (SSD) have many merits, in comparison to the traditional hard disk drives (HDD). However, random write within SSD is still far slower than sequential read/write and random read. There are two independent approaches for resolving this problem as follows: (1) using overprovisioning so that reserved portion of the physical memory space can be used as, for example, log blocks, for performance enhancement, and (2) using internal write buffer (DRAM or Non-Volatile RAM) within SSD. While log blocks are managed by the Flash Translation Layer (FTL), write buffer management has been treated separately from the FTL. Write buffer management schemes did not use the exact status of log blocks, and log block management schemes in FTL did not consider the behavior of the write buffer management scheme. This paper first demonstrates that log blocks and write buffers maintain a tight relationship, which necessitates integrated management to both of them. Since log blocks can also be viewed as another type of write buffer, we can manage both of them as an integrated write buffer. Then we propose an Integrated Write buffer Management scheme (IWM), which collectively manages both the write buffer and log blocks. The proposed scheme greatly outperforms previous schemes in terms of write amplification, block erase count, and execution time. Ó 2014 Elsevier B.V. All rights reserved.

1. Introduction Recent advancements within the NAND-FLASH technology have supported the growing usage of a Solid State Drive (SSD)1 as a commodity storage media. Although an SSD has still a higher cost per byte, as compared to an HDD, it is expected in the near future that SSD will replace most HDDs [10]. SSD, unlike HDD, has no mechanical movement when accessing data, and this distinguishing feature guarantees the excellent performance of random reads (i.e. 20–50 times faster than HDDs [12]). In contrast, flash memory’s lack of an in-place update enforces it to erase data blocks prior to the data update operation. Within flash memory, read/write operations are done in a page unit, while an erase operation is done in a block unit. In order to update particular data within a specific block, flash memory first erases the entire target block storing the data and then writes newly updated data in company with the pre-existing nonupdated data stored in the same block. Due to the fact that an SSD includes such a drawback of flash memory, heavy random write requests generate numerous copy and erase operations, even for ⇑ Corresponding author. Address: Division of Computer Science and Engineering, Hanyang University, 222 Wangsimni-ro, Seongdong-gu, Seoul 133-791, Republic of Korea. Tel.: +82 2 2220 1725; fax: +82-2-2220-4244. E-mail address: [email protected] (S. Kang). 1 Throughout this paper, flash memory means NAND flash memory and SSD means NAND flash memory-based SSD. http://dx.doi.org/10.1016/j.sysarc.2014.01.005 1383-7621/Ó 2014 Elsevier B.V. All rights reserved.

valid (i.e. non-updated) pages internally. This is the reason for both the degraded performance and shortened lifespan. In recent years, many studies from different system layers have been performed to overcome such a write anomaly within flash memory. In order to reduce the number of write operations in flash memories, flash-aware buffer management schemes [24,14,13, 26,29] and write buffer management schemes in SSD [17,16] have been proposed. As an approach that aims to directly manage flash memory by developing flash file systems, YAFFS [5], JFFS2 [8], and UBIFS [21], have been developed. Meanwhile, there are attempts to research on file systems for HDDs, which also nicely accommodate the characteristics of SSDs, such as NILFS [22], BtrFS [23], and ZFS [28]. At the host interface layer, new commands (for example, TRIM command) that are specifically used for an SSD are added in order to reduce the number of copies and erases for valid pages by passing the information for erased data blocks to the SSD. Nevertheless, if these approaches are used in a system that has both SSDs and HDDs, there are certain limitations that prevent users from adopting these approaches: (1) buffer replacement policies might cause performance degradation when HDD is used together with SSD, (2) it require users to install an additional flash file system, besides a file system for HDD, and (3) a flash-aware general purpose file system might not simultaneously optimize the performance of both SSD and HDD. Hence, for the widespread use of an SSD, it is needed to resolve the SSD’s internal problems inside an SSD.

330

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

There are two research directions that can be used for solving the random write problem inside an SSD. The first approach is to use certain data blocks as log blocks or replacement blocks so that the updated data is written on log blocks instead of original data blocks. The Flash Translation Layer manages these log blocks, and the random write performance and the number of operations differ largely according to the number of log blocks and the management policy of an FTL. In this FTL category, we have seen NFTL [20], AFTL [7], Superblock Scheme [15], BAST [18], and FAST [19] in literature. The other approach is to exploit a write buffer inside an SSD in order to reduce the number of write and erase operations, and we have seen BPLRU [17], CLC [16] and PUD-LRU [30].2 As the two aforementioned approaches have been researched independently, there is no such thing as interdisciplinary study considering both the write buffer management policy and the log block management policy in union. In our previous work [16], we proposed an Optimistic FTL, a log block management policy that is designed by observing destaged data patterns from a write buffer to flash memory. Optimistic FTL manages log blocks while considering the characteristics of normal write buffer management policies. In contrast, REF [26] assumed a normal log block management policy and proposed an OS-level write buffer replacement policy that works according to the assumed log block management policy. Although these two studies suggest the necessity of mutual recognition between a log block management policy and a write buffer management policy, they remained as a reactive policy, which considers only the result of the other policy, not a proactive policy that operates closely with the other. However, since both log block management and write buffer management policies are under the control of an SSD controller, the information sharing and the possible ties between the two policies are feasible. In addition, from a broad perspective, a log block also can be regarded as a kind of write buffer. Hence, it is necessary to research an interdisciplinary study between these two policies. The main contribution of this paper is to introduce the necessity of integrated management for both log blocks and write buffers to the SSD community. In detail, we first delve further into the correlation between the two management policies by means of extensive interrelated experiments using a variety of workloads, and we then devise an example scheme, Integrated Write Buffer Management Policy (IWM), in order to validate our proposed necessity. Extensive experiments, under a variety of experimental configurations, show that IWM, in most cases, shows a better performance than any other policy. It successfully reveals the potential of the integrated buffer management, in terms of both the performance and lifetime of SSD, and we think that it can be a stepping stone for developing a more effective integrated buffer management scheme, in the future. The rest of the paper is organized as follows: Section 2 explains background materials and motivations, Section 3 discusses IWM in details, Section 4 shows performance studies. In Section 5, we discuss the similarities and differences between our work and prior works, and in Section 5 we conclude the paper. 2. Background and motivation 2.1. Background Log block management: Hybrid mapping schemes use three kinds of merge operations – switch merge, partial merge, and full merge. Among them, only full merge is implemented differently according to the log block management policy in each Hybrid map2 Though FAB and REF were initially devised for the page replacement scheme within Operating Systems, they also can be used as a write buffer management scheme in SSD. Therefore, in this paper, we categorize them as write buffer management policies.

ping scheme. Switch Merge is used when all pages in a log block are sequentially written. Partial Merge is performed when some pages (including the first page) in a log block are sequentially written and the remaining pages are free. Full Merge is executed when pages in a log block are randomly written, and is the most expensive merge operation. The two most representative hybrid mapping schemes are BAST and FAST. While FAST copes with random writes more efficiently than BAST, it shows very large fluctuations in the full merge overhead. The page mapping policy usually manages the entire address space in a page unit, eliminating the necessity of log blocks. It has the advantage of efficiently handling random writes, but has the drawback of maintaining a large page table that maps the entire pages within SSD. For example, 1 TB SSD needs at least 1 GB DRAM for mapping table. In order to remedy this, DFTL [11] proposed a novel method that implements the page mapping strategy with a small amount of memory, which stores only a certain part of the entire page table. This is achieved by viewing the locality information of given workloads. However, this approach cannot help showing poor performance for low locality workloads. Especially, as the SSD capacity becomes larger and larger while its performance becomes better and better, an SSD can be used in multi-purpose systems which provide multiple services such as web, mail, file and ftp services, simultaneously. In those systems, the aggregated IO accesses can have low overall locality, which can make the caching-based FTL poor. In this regard, more recent works, such as Janus-FTL [31] and WAFTL [32], try to use both block and page mappings, simultaneously, in a single device. These works exploit their own operations which correspond to merge operations in Hybrid mapping schemes (i.e., Buffer Zone Migration in WAFTL and Fusion and Defusion operations in Janus-FTL). By effectively exploiting both block and page mappings, they could not only decrease the amount of mapping information but also improve the performance. For example, in WAFTL, they showed that by workload-adaptively using multiple mapping granularities and exploiting appropriate caching strategy, they could achieve up to 34% performance improvement over DFTL. It means that if the caching scheme is combined with the hybrid granularity mapping scheme, it is possible to achieve an improved performance than the pure caching-based FTL. In Janus-FTL, they showed that their scheme, in some cases, could outperform pure page-mapped FTL which has no outstanding optimization technique for garbage collection. Hence, assuming the SSD capacity will ever increase, we believe that the basic mechanism of the Hybrid mapping schemes will still be valuable in terms of both the theoretic and practical aspects. Since our contribution is providing a framework for an integrated management between write buffer and log block, provided that a given FTL exploits a log block-like scheme, our framework can be effectively used to further increase the performance of the FTL. Write buffer management: For many decades, we have seen numerous studies on buffer management policies for an HDD. Considering both hit ratio in write buffers and mechanical movements inside an HDD, write buffer management schemes, such as Stack Model [6] and WOW [9], have regarded temporal and spatial locality as important factors. In opposition to this, write buffer management policies for flash memory do not need to consider mechanical movements, rather, there should be an effort to reduce the number of extra operations occurring inside an FTL. In traditional write buffer management policies, for the efficient merge operation of an FTL, the method of clustering pages in the same block of flash memory and writing these clustered pages together is widely used within research work [13,17,26,16,30]. FAB [13] proposed a DRAM buffer management policy for a Portable Media Player using flash memory, and it was the first attempt to cluster pages aligned by the erase boundary in flash memory. When replacing buffers, FAB selects the biggest cluster as a victim cluster

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

and then destages it. Although this scheme showed improved performance in comparison to the page-based LRU scheme, it has two drawbacks, as follows: it does not consider temporal locality, and it forfeits the chance to further populate a victim cluster. BPLRU [17], a cluster-based LRU policy established upon temporal locality, adopted a data padding scheme in order to perform the lowest cost merge (i.e., Switch merge) when merging blocks within an FTL. Because this scheme incurs only Switch merge, it provides the advantage of having less erase operations in comparison to other schemes, but it also has data padding overhead. REF [26] is a write buffer management policy that selects pages belonging to blocks that contain recently destaged pages. This is for the purpose of increasing log block utilization, as well as reducing the number of merge operations by aggregating pages belonging to the same block into the same log block. CLC considered both the cluster size and temporal locality and showed that taking both factors into account guarantees better performance than those considering only a single aspect. PUD-LRU [30] considers access frequency of each cluster in addition to the temporal locality and size of the cluster, for victim selection. It also employs the data padding scheme which accompanies the padding overhead like BPLRU. PUD-LRU, which has been implemented on top of the hybrid FTL (exactly, BAST), outperformed DFTL in Financial 1, Build, TPC-e workloads which have low temporal locality, while DFTL outperformed PUD-LRU in Financial 2 workload which has high temporal locality. Hence, by properly integrating write buffer and log block management schemes in hybrid mapping FTLs, we can achieve the performance improvement over existing caching-based page-mapped FTL. 2.2. Motivation Although a single SSD controller takes control of both write buffer management and log block management, each management policy is independently designed. A write buffer management policy chooses a victim without the information about log blocks, and a log block management policy also picks victim log blocks for merge without knowing the cluster information stored in a write buffer. Due to this information blockage, the performance of existing write buffer management policies varies greatly according to workload characteristics, write buffer size, and the number of log blocks. In this section, we address this phenomenon and analyze the main cause via extensive performance experiments by varying workload characteristics, write buffer size, and the number of log blocks. We then are able to provide insight from these performance studies. We used various workloads whose characteristics are shown in Table 2 and Fig. 11 in Section 4. Write buffer size: The size of the write buffers range from 2 MB to 32 MB. It is clear that the performance improves when more buffers are used, but this consumes more power and is vulnerable to power failure and, therefore, may lose more data. The reason we also consider a small range (2–8 MB) is because in this range we could use NVRAM as a write buffer. Using NVRAM has no observable problems when using DRAM write buffers, but NVRAM is used in small sizes due to the low degree of NVRAM integration. Number of log blocks: The number of log blocks ranges from 8 to 512. In previous literature, researchers used a different number of log blocks, 8 in [13,17,26], 16–128 in [16], and 3% of active pages in [15,11]. Considering the following two facts: 1 different researchers assumed different numbers of log blocks and 2 as the number of log blocks increases, not only does the mapping space requirement increase, but the available capacity of an SSD decreases, this study performed experiments within the log block range of 8–512. 2.2.1. Effect of the number of log blocks In order to measure the effect of the number of log blocks on the performance of a write buffer management policy, we experi-

331

mented with two write buffer management policies (i.e. CLC and REF) by varying the number of log blocks and the block size. We used BAST as the default log block management policy. Because BAST adopts 1:1 mapping between data and log blocks, if there already exists a log block for the victim cluster destaged from the write buffer (log block hit or log block re-reference case), we write destaged pages on the log block. Otherwise, we have to allocate a new log block, first, and then write destaged pages on the log block. It is obvious that as the log block hit ratio increases, so does the log block utilization and overall system performance. Figs. 1(a) and 2(a) show performance graphs of CLC and REF, respectively. On the Y-axes, the run time of REF is normalized by that of the CLC. In FAT workload [1] (Fig. 1(a)), when the number of log blocks is small, CLC outperforms REF up to 24%. As the number of log blocks increases, the performance of REF improves. With 512 log blocks, REF outperforms CLC up to 7%. We can see similar behavior from the results with Explorer workload [26] (i.e. CLC outperforms REF up to 19% with 8 log blocks and REF outperforms CLC up to 13% with 512 log blocks). As we can see from these results, the performance of a write buffer management policy is affected by the number of log blocks. Figs. 1(b) and (c) and 2(b) and (c) explain the main reason. These figures show the number of destaged clusters, the log block miss count, and the log block hit ratio in two different write buffer management policies. The number of destaged clusters represents how efficiently a write buffer management policy clusters pages in a write buffer, and it is affected by the write buffer size, not by the number of log blocks. As we can see in the figures, CLC destaged a smaller amount of clusters in comparison to REF. This indicates that CLC performs clustering in a write buffer more efficiently than REF. Nonetheless, with 256 or more log blocks, CLC incurred more log block misses than REF. When a log block miss happens, the log block management policy picks a victim log block and then performs a merge operation. Therefore, we can see that the number of log block misses directly affects the number of merge operations. The analysis results show that REF has less of a clustering effect in comparison to CLC, but it has a larger number of log block re-references (i.e. higher hit ratio). This allows for improved performance when compared to CLC, where the number of log blocks is large. As described previously, this is because REF destages clusters that have high chances of having their corresponding log blocks. It leads to clustering of the pages, which were not clustered in a write buffer, within the log blocks. Owing to this clustering effect within the log blocks, REF shows good performance when a log block rereference occurs frequently due to a sufficient number of log blocks. From the experimental results above, we can reach the following conclusion. With a small number of log blocks, where the log block hit ratio is low, a write buffer management policy with a large clustering effect inside a write buffer shows good performance. As the number of log blocks and the corresponding log block hit ratio increases, a write buffer management policy that underlines the log block hit ratio guarantees good performance. 2.2.2. Effect of block padding The block padding scheme proposed in [17] reads missing pages of a victim cluster from a corresponding data block in flash memory, writes them to the victim cluster in write buffer to make a complete data block, and then destages the data block to flash memory. By doing this, FTL always activates the lowest overhead merge (i.e. the Switch merge) operation. In this case, log blocks are not needed. If a host sequentially generates write requests, we will have a large clustering effect in a write buffer as well as a low re-reference ratio of log blocks. As a result, the block padding scheme is very efficient. Otherwise, the block padding scheme can degrade the overall performance of an SSD due to the data padding overhead. In FTL, we can formalize both the cost (C BP ) when Switch

332

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

Fig. 1. Performance analysis of buffer management policies: FAT workload.

Fig. 2. Performance analysis of buffer management policies: Explorer workload.

merge is performed via the use of the block padding scheme and the cost (C FM ) when Full merge is called without the block padding scheme, using Eq. 1 and Eq. 2, respectively.

In Eq. 3, Sic means the size of the i-th victim cluster that is destaged Pn i  S i¼1 c to the log block. The total cost is for þ 1 Full merges. 128

C BP ¼ ðNp  Sc Þ  R þ Np  W þ E

ð1Þ

C FM ¼ Sc  W þ Np  ðR þ WÞ þ 2E

ð2Þ

If we use the block padding scheme, Switch merge is performed whenever a victim cluster from a write buffer is destaged to flash memory. Assuming that victim clusters belonging to the same data block are destaged n times, then the total cost can be obtained as follows:

In the above equations, Np ; Sc ; R; W, and E represent the number of pages in a block, cluster size (6 Np ), normalized read/write/ erase costs, respectively. Using the characteristics in [25] (N p ¼ 128; R ¼ 1; W ¼ 6; E ¼ 10), we can obtain two cost values as follows: C BP ¼ 906  Sc ; C FM ¼ 6  Sc þ 916. Without the block padding scheme, we can think of the following scenario: (1) a page cluster belonging to a data block DA is selected as a victim and then written on a newly allocated log block L1A , (2) a new page cluster belonging to the data block DA is created again, within the write buffer, and (3) the new page cluster is selected as a victim, before the log block L1A is merged, and the pages in the cluster are written on the log block L1A (i.e. log block L1A is rereferenced). In this circumstance, let us denote the size of the first and second victim clusters as S1c and S2c . Then, overflow happens in the log block L1A when S1c þ S2c > 128ð¼ N p Þ. Once overflow occurs, Full merge is performed over the log block L1A that exhausted all 128 pages, and the remaining ðS1c þ S2c Þ%128 pages are written on a new log block, namely L2A . If the re-reference to the new log block L2A does not occur, a Full merge will eventually be performed over the new log block L2A , too. The total cost of the whole procedure (two Full merges) can be calculated as follows: ð6  128 þ 916Þ þ 6  ððS1c þ S2c Þ%128Þ þ 916. Without loss of generality, assuming that a log block is referenced n  1 more times after it was initially referenced (i.e. total n references on the log block), then the total cost can be formalized as follows:

C FM

7 6X 6 n i7 6 Sc 7 ! n 7 6 X 6 i¼1 7 i ¼6 Sc %128 þ 916 7  ð6  128 þ 916Þ þ 6  4 128 5 i¼1

ð3Þ

C BP ¼

n X ð906  Sic Þ

ð4Þ

i¼1

Fig. 3(a) shows the merge overhead of both the block padding scheme and the Full merge according to the number of possible log block re-references. If there are no re-references, then the block padding scheme has less overhead than the Full merge. Once a rereference occurs at least one time, the overhead differs largely according to the sum of the victim clusters sizes; in particular, Pn i when i¼1 Sc 6 128, Full merge has less overhead than the data padding scheme. If we assume that there is no re-referencing, then the data padding scheme should always be used. Otherwise, it is necessary to selectively use the block padding scheme, depending on the count of re-references and the size distribution of the victim clusters. Fig. 3(b) shows the ratio of re-referenced log blocks when the number of log blocks is 512, and the write buffer size is 2 MB. When a log block is re-referenced, we have two cases; 1 the sum of the victim clusters sizes is less than or equal to 128 and 2 the sum is greater than 128. As we can see in the figure, a consistent phenomenon observed under all workloads is that the re-reference to log blocks can be categorized into roughly two cases as follows: 1 never occurs (block padding should be used) and 2 occurs only Pn i with i¼1 Sc 6 128, irrespective of the frequency (block padding should be avoided). The ratio of the non-referenced log blocks differs greatly according to different workloads. Therefore, we should use the block padding scheme adaptively, depending on the existence of re-references. Since the erase cycle of the MLC NAND flash chip is approximately 10 K, elongating the lifespan of flash memory by reducing

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

333

Fig. 3. Necessity of considering log block re-reference.

the erase count is as important as improving the performance. Fig. 4 shows the effect of the block padding scheme on the erase count under TPCC [4] and Financial [3] workloads. In this experiment, we use a block-based LRU write buffer management policy, with or without block padding. We also normalize the run time and the erase count with the block padding scheme by those without the block padding scheme. As shown in the figure, the difference is larger in the erase count ratio, as compared to that in the performance ratio. In other words, when using the block padding scheme, the benefit of reducing the erase count is more distinguishable than the performance gain. As we described in

Section 2, Full merge and Switch merge incur two and one erasures, respectively. Hence, when there is no log block re-reference, block padding decreases the number of erasures by one half of that without block padding. As was shown in Figs. 1 and 2, when the number of log blocks is small (i.e. 8), re-referencing rarely occurs. Therefore, as was shown in Fig. 4, when the number of log blocks is 8, the erase count with the block padding scheme is about half of that without the block padding scheme. If there are log block re-references, then one additional erase occurs whenever a log block is re-referenced. When we do not use the block padding scheme, an additional erase occurs only

Fig. 4. Execution time and erase count comparison between w/ and w/o Block Padding (B: buffer size).

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

when the log block overflow takes place due to the re-reference. However, as we can see in Fig. 3(b), even if the re-reference happens, most cases belong to the one where the overflow event does Pn i not occur (i.e. the case when i¼1 Sc 6 128). This implies that, without the block padding scheme, an additional erase seldom arises. Consequently, if there is one re-reference, the erase counts for both cases are almost the same. If more than one re-reference occurs, using the block padding scheme incurs more erase operations. In Fig. 4, when the number of log blocks is large and the size of the write buffer is small, re-references to log blocks take place frequently, therefore, the erase count increases when using the block padding scheme. From the viewpoint of reducing the erase count, the block padding scheme should be used adaptively, depending on the existence of re-referencing to log blocks. From the experimental results shown so far, we can come to the following conclusion. Since the block padding scheme should be used adaptively, depending on the existence of the re-reference to each log block, an innovative scheme that is able to determine whether or not each log block will be re-referenced is necessary. Using this scheme, the block padding scheme should only be used for those log blocks that will have no re-referencing. 2.2.3. Summary of motivation Due to extensive experimentation, we have been able to analyze the effects of workload characteristics, write buffer size, the number of log blocks, and the log block hit ratio on the performance of a write buffer management policy and on the number of erase operations. From this analysis, we learn two important lessons as follows: Lesson 1. Necessity of subsuming both clustering and the re-reference effect: The relative performance of write buffer management policies, which either lays emphasis on efficiently clustering data in a write buffer or weights on the hit ratio of log blocks, varies largely according to the number of log blocks. Hence, it is necessary to develop a new write buffer management policy that considers the advantages of both policies for each log block condition. Lesson 2. Necessity of adaptive padding: It is better for the performance and the endurance of an SSD to apply a block padding scheme for log blocks that will not be re-referenced and to avoid a block padding scheme for log blocks that would be re-referenced. Therefore, it is necessary to develop a method to predict whether a log block will be referenced or not. 3. IWM: an integrated write buffer management scheme A write buffer management policy manages DRAM or NVRAM write buffers in SSD and a log block management policy manages log blocks in flash memory. Both write buffer management policies, and log block management policies have been designed to operate independently of each other’s status and behavior. This section introduces an Integrated Write Buffer Management scheme (IWM) which manages both write buffer and log blocks, collectively. Fig. 5 shows the structure of IWM. IWM, which manages both the write buffer and log blocks in a single software layer, does not require any changes in SSD hardware architecture. There are three core components in IWM: Write buffer manager, Log block manager, and Shared Data Structures. Since IWM is implemented within a single software layer, all data structures used by both the write buffer manager and the log block manager can be shared. Fig. 6 shows the shared data structures of IWM. IWM manages write buffers in a block level, as do many write buffer management schemes. Using block level management, pages in the write buffer are clustered according to the data blocks they belong to. The Page Cluster List is a basic data structure for block level write buffer

IWM

Host System

334

SSD

Write Buffer Manager

NVRAM Write Buffer

Shared Data Structures

Log Block Manager

Flash Memory

Fig. 5. Structure of IWM.

management. A single list maintains all pages belonging to the same data block. The LBN of each list specifies the data block to which pages in the list belong. The Data Block Map is a block level mapping table in which each entry contains the physical location of a data block. IWM uses an additional field, P-Flag, in the Data Block Map, which is used for Adaptive Partial Padding of IWM. The Log Block Map is a data structure for log block management. IWM uses one-to-one mapping between data blocks and log blocks. Each entry in the log block map contains the physical block number (PBN) of the data block that corresponds to the log block, physical page numbers (PPN) of pages that are written in the log block, the offset of the last page in the log block (LPI), the S-Flag, which indicates the sequentiality of the pages in the log block, and the R-Flag, which specifies whether the log block has been re-referenced or not. IWM categorizes page clusters within the write buffer into four clusters: H-cluster, S-cluster, L-cluster, and Size-dependent cluster. H-cluster (Hot cluster) is a recently accessed cluster. S-cluster (Sequential cluster) is a set of N p pages which have been written sequentially to the write buffer. L-cluster (Log cluster) is a cluster of which the corresponding data block already has a log block. If a cluster’s LBN exists in the log block map, then the cluster is an L-cluster. H-clusters are maintained using the LRU list, where S-clusters and L-clusters are maintained using respective FIFO lists. The remaining clusters are Size-dependent clusters and are maintained using N p LRU lists, each of which stores clusters that have the same number of pages. Since the number of log blocks in SSD is fixed, a victim log block is selected to acquire a new free log block by merging the victim with the corresponding data block, when there remains no free log block. A victim is selected in a FIFO order.3 Pages in a log block are written, irrespective of the page order in the corresponding data block. When the order of pages in a log block is the same as that in the corresponding data block, the S-Flag in the log block map is set to 1, regardless of the number of pages in the log block. When disorder occurs, S-Flag is set to 0. IWM uses two novel schemes, Aggressive Log block Usage (ALU) and Adaptive Partial Padding (APP), in order to accommodate lessons presented in Section 2. 3.1. Aggressive Log Block Usage (ALU) Lesson 1 in Section 2 requires the write buffer management policy to behave differently according to the number of log blocks in SSD. Specifically, the write buffer management policy should emphasize the clustering of pages in buffer when the number of log blocks is small, while it should stress re-referencing of the log blocks when there are a large number of log blocks. In order to accommodate this lesson, the write buffer manager in IWM uses 3 When selecting a victim, we can use another policy such as LRU, instead of FIFO. However, we use FIFO because not only there is little performance difference between FIFO and LRU [16] but the FIFO is also simple to implement.

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

Page Cluster Lists Data Block Map LBN

PPN

PPN

LBN

PPN

PPN

LBN PBN P LBN PBN P

335

Log Block Map LBN PBN PPN array LPI S R LBN PBN PPN array LPI S R

Fig. 6. Shared data structures of IWM (LBN: Logical Block Number, PPN: Physical Page Number, PBN: Physical Block Number, LPI: Last Page Index, S: Sequential Block Flag, R: Re-reference Flag, P: Padding Flag).

the Aggressive Log block Usage (ALU) scheme along with CLC. To clarify, IWM behaves like a CLC when the number of log blocks is small, while it aggressively uses the log block status for victim selection in order to solicit log block re-references when there are many log blocks. Since the CLC scheme effectively clusters pages within the write buffer, IWM is able to maintain a good performance, regardless of the number of log blocks. Fig. 7 shows the movement of clusters among cluster lists in the write buffer. The H-cluster list corresponds to the Size-independent LRU cluster list in CLC. When a new cluster is created, it is stored in the H-cluster. If the cluster becomes a sequential and complete one, (i.e. if it contains all N p pages sequentially) it is moved to the S-cluster list regardless of the current position in the H-cluster. Since the S-cluster not only has a very low probability of rereferencing [27], but also invokes the Switch merge operation, which is the cheapest merge operation, when it is stored in a log block, S-cluster is the best candidate for victim. When a cluster is evicted from the H-cluster list, it goes to either the L-cluster list or Size-dependent cluster lists according to the existence of corresponding log block, which can be identified by the existence of LBN within the log block map. The cluster moves to the H-cluster again, if a page write request occurs to a cluster in either the L-cluster list or Size-dependent cluster lists. A cluster in the L-cluster list, whose corresponding log block is selected as a victim for merge, is moved to Size-dependent cluster lists. When the S-cluster list is empty, a victim is then selected from the L-cluster list. Since the victim cluster already has a corresponding log block, pages in the victim cluster are written to the log block, which is a log block re-reference. If both the S-cluster and L-cluster lists are empty, a victim is selected from the Size-dependent cluster lists. In this case, IWM behaves exactly the same as CLC. The size of the H-cluster list is limited to 20%,4 of the total number of clusters in the write buffer, while other lists have no constraints for their sizes. Hence, when the number of log blocks is small, the size of the L-cluster list becomes small and that of the Size-dependent cluster lists becomes large5. In this case, most of the victim clusters are selected from the Size-dependent cluster lists, which makes IWM behave like CLC that shows a good clustering effect. In contrast, when there are many log blocks, the size of the L-cluster list becomes large. In this case, a large portion of victims are selected from the L-cluster list, which results in a high log block re-reference ratio. In this way, IWM accommodates Lesson 1 via dynamic behavior pattern change in accordance with the number of log blocks. 3.2. Adaptive Partial Padding (APP) Section 2 presents that block padding shows a positive effect when it is used with log blocks that will not be re-referenced 4 In [16], CLC showed best performance when the size of the Size-independent LRU cluster list was limited to 10%. However, while they used SLC NAND in [16] we assume MLC NAND in which a block size is four times larger than that of SLC NAND. Using 10% of the entire write buffer size, it happens that the H-cluster cannot store even one page cluster when the write buffer size is small. So we used 20%, in this work. 5 The number of S-clusters is independent of the log blocks number. It only depends on the write request’s pattern from the host machine and, since the size of the H-cluster list is limited, it is certain that the size of the Size-dependent cluster list increases when that of the L-cluster list decreases.

Fig. 7. Cluster movement among various cluster lists in IWM.

and a negative effect when used with other log blocks. Lesson 2 requires the adaptive use of block padding, considering the probability of any future log block re-referencing. In order to accommodate Lesson 2, the log block manager in IWM uses the Adaptive Partial Padding (APP) scheme. APP consists of two sub-schemes: Partial Padding (PP) and Adaptive Padding (AP). Partial Padding is designed to decrease the padding overhead of the block padding scheme, while Adaptive Padding is designed to determine whether or not use Partial Padding. 3.2.1. Partial Padding (PP) Block Padding, used in BPLRU, incurs a number of valid page reads/writes to make a new data block without using the merge operation. However, in an implementation point of view, it consists of a number of valid page copies and a switch merge. While the switch merge costs much less than other merge operations, the large number of valid page copies is the burden of the block padding scheme. In order to decrease the overhead, IWM uses the Partial Padding (PP) scheme. While Block Padding copies all absent pages from the victim cluster, Partial Padding copies only part of them. Specifically, it only copies those pages that have smaller page numbers than the largest page number in the victim cluster. In detail, it reads those pages that satisfy the condition and then sequentially writes them, together with the pages in the victim cluster, to the log block. Then, the log block becomes a sequential one which can be merged via the Partial merge operation, which is much cheaper than the Full merge operation. Fig. 8 shows an example of Partial Padding. The victim cluster contains three pages (0, 2, and 3) and a new log block is assigned to the corresponding data block. While there are five absent pages (1, 4, 5, 6, and 7) from the victim cluster, Partial Padding reads only page 1, which is the only page satisfying the condition, from the data block and sequentially writes four pages to the log block. The effectiveness of the Partial Padding depends on whether the log block will be re-referenced or not before it is merged. If the log block is re-referenced, it might become a non-sequential block, as shown in Fig. 8. Then, when the log block is selected as a victim log block, it is merged through a Full merge operation, which is costly. Hence, in this case, the cost of page copies during Partial Padding (i.e. one page copy for the case of Fig. 8) becomes an additional overhead, which makes the Partial Padding show a negative effect. Nevertheless, it is notable that the additional overhead of Partial Padding is much less than that of the Block Padding for this case (i.e. five page copies in case of Fig. 8). If the log block is selected as a victim without re-referencing, the Partial merge operation can be used. In this case, Partial Padding shows positive effect, and the overall cost, which consists of page padding cost (i.e. one

336

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

Fig. 8. Overhead of partial padding.

page copy) and Partial merge cost (i.e. four page copies and one block erasure), becomes the same as that of the Block Padding scheme, which consists of page padding cost (i.e. five page copies) and Switch merge cost (i.e. one block erasure). As a result, the overall cost of Partial Padding is less than or equal to that of the Block Padding. The overall comparison among the Partial Padding cost (C PP ), Block Padding cost (C BP ), and Full Merge (not using any padding scheme) cost (C FM ) can be represented as follows: C PP ¼ C BP < C FM , when 6 9 log block re-reference. C FM < C PP < C BP , when 9 log block re-reference. 3.2.2. Adaptive Padding (AP) Most of the SSDs use a Hot/Cold Identification scheme (mainly for wear-leveling) that divides data blocks into Hot blocks and Cold blocks. IWM uses Hot/Cold identification to determine whether or not to use page padding. Log blocks of hot data blocks have a higher probability of re-referencing than those of cold data blocks. Hence, if we use page padding with cold data blocks only, the overall performance of padding schemes can increase. However, it is very difficult to determine whether a data block is hot or cold because it is affected by so many factors such as workload characteristics, write buffer size, the number of log blocks, among others. IWM uses Adaptive Padding (AP), which anticipates future log block re-referencing by examining the access history of the log block and then determines whether or not to use padding. The Adaptive Padding scheme is designed in accordance with the following three design criteria: 1. Do not use padding for victim clusters with corresponding log blocks. 2. If there have been one or more re-references to the last log block of a data block, then do not use padding when making a new log block for the data block. 3. If there has been no re-reference to the last log block of a data block, use padding when making a new log block for the data block. To implement Adaptive Padding, IWM uses two flags (P-Flag in the Data Block Map, and R-Flag in Log Block Map) and padding is only used when the P-Flag is 1. Fig. 9 shows the state transition diagrams of thw two flags. The initial value of the P-Flag is 1, which means that padding is used unconditionally for the first log block of all data blocks. R-Flag is initially set to 0, when a new log block is made. When a re-reference occurs to a log block, the R-Flag of the log block is set to 1 and the P-Flag of the corresponding data block is set to 0. When a log block is merged, IWM examines the R-Flag of the log block. If the R-Flag is 0, which means that the log block has not been rereferenced, the P-Flag of the data block is set to 1 so as to use padding at the next log block creation. If the R-Flag is 1, which means that the log block has been re-referenced, the P-Flag, which

R−Flag

L/B re−reference

0

first L/B creation

1

(L/B merge & R−Flag = 1) or L/B re−reference L/B re−reference

P−Flag

1

0 L/B merge & R−Flag = 0

Fig. 9. State transition diagrams of the R-Flag and P-Flag (L/B: Log Block).

has been set to 0 when the log block re-reference occurred, retains its value so as not to use padding at the next log block creation. When a victim cluster is selected for destaging to the flash memory, IWM first examines whether a corresponding entry exists within the Log Block Map. If it exists, IWM writes pages in the victim cluster to the log block without padding. If it does not exist, IWM assigns a new log block for the corresponding data block and checks the P-Flag in the data block map. If the P-Flag is 0, IWM writes pages in the victim cluster to the new log block without padding. If the P-Flag is 1, IWM writes pages to the log block with padding. In this way, Adaptive Padding estimates the future log block re-reference based on recent history and uses padding only when the future log block re-reference is not expected. Adaptive Padding enables IWM to determine whether or not to use padding independently from various other conditions, such as workload characteristics, write buffer size, and the number of log blocks, which affect the performance of the padding scheme. 3.3. IWM-FAST Besides BAST, IWM can be implemented over other FTLs that use log blocks. The objective of the IWM is to make the write buffer and the log block management schemes effectively collaborate, each other. Since the log block management scheme in each FTL differs from each other, the ALU and APP schemes in IWM can not help being implemented differently, each other. However, their rationales, which have been derived from motivations, remain the same. In this section, we present IWM-FAST which is an implementation of IWM over FAST. Since FAST does not dedicate a log block to a specific data block, we modified IWM so that it can effectively exploit the log block information in the FAST. The log block utilization is not an issue in FAST, since it shares random log blocks among all data blocks. However, in terms of the merge overhead, the log block re-reference can decrease the number of merges in FAST. For example, assume that some of the most recent pages for a data block, DA , are stored in the write buffer and some are in a log block LA . When LA is merged, valid pages from both DA and LA are collected to a new data block, Dnew A . However,

337

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

Dnew still contains old version of pages whose more recent pages A are stored in the write buffer. After the recent pages are evicted to the log block(s), Dnew will eventually be merged, too. On A the other hands, if we evict recent pages in the write buffer to the log block(s) before merge, the new data block stores all the recent pages, which does not necessitate additional merge. Therefore, the log block re-reference can be an issue, also in FAST, in terms of the merge overhead. We modified the L-Cluster in ALU such that it accommodates every cluster that one or more pages from the same data block with the cluster remain in the log block. The victim clusters from the L-Cluster can benefit from the aforementioned effect. To implement the modified L-Cluster, we need to identify those blocks of which one or more pages are stored in the log block. For that purpose, we replaced one of the shared data structures, the Log Block Map, with a new data structure, Evicted Block List, that maintains list of such data blocks. The Evicted Block List is a list of hLBN, IDX, R-Flagi triples. LBN and IDX denote the logical block number of such data block, and the index to the log block mapping table in FAST, respectively. R-Flag is used for Adaptive Partial Padding. Adaptive Partial Padding scheme performs partial padding to those blocks that have low re-reference probability. Partial padding gives more chance of partial merge to the cold blocks. The rationale can be effective to FAST, too. If we store pages from a cold (i.e., having low re-reference probability) block, after partial padding, to a separate log block from random log blocks, the log block can have more chance of partial merge. To make it possible, we modified both the Adaptive Padding scheme in IWM and the log block management scheme in FAST. FAST uses one sequential log block (S-block) and multiple random log blocks (R-block). In IWM-FAST, instead of S-block, we use a new type of log block, called Padding log block (P-block). Fig. 10 shows the APP scheme in IWM-FAST. The P-block is a block-associative log block and is changed to the R-block when re-referenced. For a victim cluster whose corresponding LBA does not exist in the Evicted Block List, we use padding and store it in a P-block, provided that both its P-Flag is 1 and the number of pages that should be padded is smaller than N p =2. The padding amount condition is applied to prevent excessive padding overhead. After that, if the P-block is re-referenced, the log block is changed to an R-block and the P-Flag of the corresponding data block is unset. Then, we do not use padding for the subsequent victim clusters for the data block, and store the victim cluster to an R-block. The P-Flag of the data block is set again when the data block is merged and its R-Flag is 0. If there was no re-reference to the P-block until it is merged, partial (or switch) merge occurs between the log block and corresponding data block. A victim cluster, whose corresponding LBA exists in the Evicted Block List, is stored to an R-block without padding. At

Data Blocks

Log Blocks Partial Padding

P−block No re−refer. Partial

Merge FREE.

re−refer.

R−block

Full Merge

R−block

Full Merge

Fig. 10. Adaptive Partial Padding in IWM-FAST.

that time, the corresponding P-block, if remained unchanged, is changed to an R-block (P-block re-reference), and then R-Flag of the victim cluster is set to 1. The number of P-blocks can be different in each time. Whenever a victim cluster accompanies partial padding, a new P-block is made. When a P-block is either merged or re-referenced, the number of P-blocks decreases. We limited the maximum number of P-blocks to half of the total log blocks, in order to prevent log block thrashing when there are small number of total log blocks. 4. Experiments This section shows the performance of the IWM via extensive simulation. We used MLC NAND flash memory of which characteristics are shown in Table 1. Page size of the write buffer is set to 4 KB in order to align with that of the flash memory. Write buffer size varies between 2 MB and 32 MB, and the number of log blocks varies between 8 and 512. Nine workloads in Table 2 were used for simulation. In the table, active pages and page writes mean the total number of written pages and write requests, respectively. Fig. 11 shows the access characteristics of each workload. In the figure, X-axis (Locality) represents the hit ratio of a write buffer when an 8 MB buffer and a cluster-based LRU policy are used. Y-axis (Sequentiality) means the ratio of sequentially aggregated clusters to all clusters under the same condition. We modified CLC and BAST to implement the write buffer manager and log block manager, respectively, in IWM. The original data structures of CLC and BAST were replaced by the Shared data structures in IWM. We compared IWM with previous write buffer management schemes, such as CLC, FAB, REF and BPLRU, and used BAST as the log block management scheme for all of them. We used write amplification, erase count, and execution time as performance metrics. The write amplification is the ratio of the number of pages physically written to the flash memory to the number of pages requested to write to the flash memory via the file system. Denoting the number of write-requested pages, the number of page hit in the write buffer, and the number of extra write operations as N R ; N H , and N E , respectively, we can formulate the write amplification (WA) as WA ¼ ðN R  N H þ N E Þ=N R . The erase count is one of the very important performance factors for SSD, since it determines the SSD’s life span. Execution time is the sum of every operation execution time during the workload execution. 4.1. Write Amplification and Erase Count We compared the write amplification and erase count of each scheme for each workload. In this paper, we only show the results for the two extreme cases: when the number of log blocks is 8 and 512, respectively. By comparing those two extreme cases, we can see the dramatic performance fluctuations of competing schemes according to the number of log blocks. We can also verify their performance variations according to the write buffer size. Figs. 12–14 show the results. In order to show only the relative performance among each scheme, we only plot an affordable range of the Y-axis. Hence, there are some results that are positioned outside the presented Y-range. Note that the performance of the BPLRU and PUD-LRU is not affected by the number of log blocks since only one log block is used (for Switch merge operation) in them. Other competing schemes show very different performances

Table 1 Characteristics of MLC NAND Flash Memory [25] Page Size

Block Size

Read

Write

Erase

Erase cycle

4 KB

512 KB

150ls

900ls

1.5 ms

10,000

338

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

Table 2 Characteristics of each workload: Explorer, office install and JPEG copy are from [26], Kernel compile, Kernel untar were extracted from Linux using blktrace [2], Financial and TPC-C workloads traces were downloaded from [3,4], respectively, and FAT and NTFS traces were downloaded from [1]. Description

Active pages

Page writes

Desktop Application

Workload Explorer Office install JPEG copy Kernel compile Kernel untar

Internet Explorer temporary files Installing Office in NTFS Total 2 GB of (500  600 KB) JPEG files Compiling Linux kernel 2.6.27 in ext3 Extracting Linux kernel 2.6.27 in ext3

98,738 364,435 552,096 215,864 141,056

541,247 419,220 581,750 866,696 592,369

Enterprise

Financial TPC-C

OLTP application running at a financial institution 20 warehouses. One client running 20 iterations

136,530 522,092

8,024,478 8,344,121

Desktop (one month)

FAT NTFS

Web surfing, emails sending/receiving, movie playing and downloading, document typesetting

1,927,778 6,005,915

7,460,024 29,607,803

Fig. 11. Write pattern of each workload.

for each case. Overall, IWM outperforms others for almost all cases, and other schemes, except IWM, show different relative performances to each other. We can see that the block padding scheme used in BPLRU and PUD-LRU provides a very attractive benefit, in terms of erase count. For many cases, it shows a relatively good erase count even when its write amplification is poor. However, since BPLRU and PUD-LRU do not exploit the benefit of a large log block number, they show relatively high write amplification when there are 512 log blocks. In the TPCC workload, their relative performance remain better than the others, except IWM, even with the 512 log blocks case. Specifically, the performance of other schemes did not considerably increase with a larger number of log blocks. Since, not only the locality of TPCC workload is extremely high, but the write requests are issued collectively, most of the data in the write buffer is reused and, as a result, the probability of log block re-reference is very low. For this case, the BPLRU and PUD-LRU that use the block padding scheme show a good performance. Since the PUD-LRU not only exploits the block padding scheme of the BPLRU but also considers the access frequency and cluster size for victim selection, it outperforms BPLRU, in most cases. It shows similar performance to the IWM when the number of log blocks is small and the write buffer size is large. However, when the number of log blocks are large it shows poor performance, like BPLRU. In many cases, CLC outperforms REF when the number of log blocks is small, while REF shows better performance with a large number of log blocks. However, in JPEG, Kernel compile, and Kernel untar cases, CLC outperforms REF irrespective to the number of log blocks. Since those workloads show strong sequentiality, it is more likely that the switch merge operation is applied to victim clusters as soon as they are evicted from the write buffer to flash memory. It makes the ’Recently-Evicted-First’ policy of REF, which is designed to induce log block re-references, inefficient. Also, though it is a rare case, an anomaly can occur for REF to show an increase

in write amplification as the write buffer size increases. After victim clusters are fully matured in the large-sized write buffer, they are switch merged as soon as they are evicted from the write buffer. Since a log block is not made for the cluster, a subsequent victim cluster, selected by the ’Recently-Evicted-First’ policy, has an even lower probability of benefiting from a log block hit. As a result, the log block miss increases, necessitating a larger number of merge operations. We can see this phenomenon in Fig. 14 and this issue can be prevented by using the exact log block status information (Log Block Map structure) within the shared data structures of IWM. FAB shows the worst performance in almost all cases and the reason is well explained in [16]. IWM consistently outperforms others irrespective of the number of log blocks, the write buffer size, and the workload characteristics. IWM exploits both the clustering effect of the CLC and the data padding effect of the BPLRU when the number of log blocks is small, and uses log blocks aggressively based on the log block status information and effectively prevents the deficiency of block padding through APP when there are a large number of log blocks. 4.2. Execution time Fig. 15 shows the normalized total workload execution time for each competing scheme, which is normalized to that of the IWM. Some outstanding values, such as those of BPLRU in the Financial workload, are not depicted in the graphs in order to focus on the behavior of reasonable values. While an erase operation is far more expensive than a write operation, the total execution time varies in accordance with the write amplification because the number of write operations is much larger than that of the erase operations. In general, BPLRU and PUD-LRU show increasing relative performance as the write buffer size increases, since block padding overhead decreases as victim cluster can increase its size further in the larger-sized write buffer. Conversely, other schemes show a decrease in relative performance as the write buffer size increases, which means that they did not exploit the write buffer efficiently compared to the IWM. For most cases, the execution time of IWM is less than those of other schemes. While other schemes show very different performance ranks to each other, according to the number of log blocks, write buffer size, and workload characteristics, IWM achieves first rank for most cases. Based on these results, we can conclude that the IWM is able to adapt well to these factors, while providing outstanding performance in comparison to others. 4.3. Detailed anaysis of IWM We verified the detailed performance characteristics of the IWM. We measured the execution time of IWM for each case in which one of the three schemes used in IWM (i.e. ALU, PP, and

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

339

Fig. 12. Write amplification and Erase count 1: In each subfigure, the upper two figures represent the 8 log blocks case and the lower two figures represent the 512 log blocks case.

AP) is not used. The execution time for each case is normalized to that of the complete IWM. Fig. 16 shows the results. In the figure, PP + AP and ALU + PP represent the cases in which we do not use ALU and AP, respectively, and ALU means the case in which both PP and AP are not used. When ALU is not used (PP + AP case), the relative performance decreases as either the number of log blocks increases or the write buffer size decreases. This means that the effect of aggressive log block usage becomes large in those cases. In TPCC, Kernel untar, and Kernel compile workloads, the contribution of ALU is minor since log block re-references do not occur frequently in those workloads. When AP is not used (ALU + PP case), partial padding is applied whenever victim clusters are evicted from the write buffer. The performance behavior in this case is similar to that of the PP + AP case. As either the number of log blocks increases or the write buffer size decreases, log block re-references occur more frequently, which increases the partial padding overhead.

When both PP and AP are not used (i.e. the ALU case), the performance behaves in an opposite direction to that of the ALU + PP case. In other words, the performance increases as either the number of log blocks increases or the write buffer size decreases. Interestingly, in Financial, Explorer, and JPEG copy workloads, sometimes it shows even better performance than the complete IWM. For those workloads, log block re-references occur very frequently and, in that case, it is better not to use block padding. However, when using AP, as described in Section 3.2.2, padding is used unconditionally for the first log block of all data blocks. This initial padding overhead occasionally makes the complete IWM worse than the ALU case, for those workloads. 4.4. Performance of the IWM-FAST In this section, we present the performance of the IWM implementation on FAST, IWM-FAST. Since FAST uses fully associative

340

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

Fig. 13. Write amplification and Erase count 2.

sector mapping for log blocks, it is one extreme case of the log block-based FTLs, in terms of associativity. BAST is the other extreme case, because if uses block associativity for log blocks. Therefore, if IWM-FAST also outperforms others, not only our motivations can be generalized to all log block-based FTLs, in terms of associativity, but also the design rationale of IWM can be proven to be applicable to other log block-based FTLs. Fig. 17 shows the overall performance of IWM-FAST. The workload execution time for each competing scheme is normalized to that of the IWM-FAST. IWM-FAST outperforms others in most cases, as IWM did. The most important implication of the result is that our motivations, which have been obtained from extensive experiments on the BAST FTL, can also be reasonable for other log block-based FTLs. Even though the ALU and APP schemes in both IWM and IWM-FAST are slightly different, each other, in terms of implementation, they have common design rationales which were derived from the motivations. Hence, Fig. 17 supports the generality of the motivations and strengthens the necessity of the integrated buffer management based on the motivations.

5. Discussion The IWM uses similar data structures and concepts to prior related works, which can be considered natural because IWM tries to subsume strengths of prior schemes so that it performs well at any circumstance. For example, CLC and REF outperform each other when the number of log blocks is small and large, respectively. To subsume their strengths, IWM is designed to behave like CLC when the number of log blocks is small and like REF in the other case. This design objective led us to exploit data structures of both schemes and effectively combine them in IWM: Size-independent cluster (renamed as H-cluster in IWM to emphasize the hotness of the data in the cluster) and Size-dependent cluster in CLC, and Victim Block (VB) set (materialized as L-cluster in IWM) in REF. The novelty of the IWM is not on its basic data structures but on the adaptiveness of its behavior to the changing environment. By effectively combining data structures of prior schemes and defining detailed data movements among them, IWM could achieve better performance than others regardless of the number

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

Fig. 14. Write amplification and Erase count 3: Kernel untar.

of log blocks. In addition, there are important differences in the implementation of data structures between IWM and prior works. For example, CLC does not treat sequential writes (64 consecutive page writes), separately. For a sequential cluster, CLC only moves it into the Size-dependent cluster. In IWM, the S-cluster is newly introduced to accommodate sequential clusters, separately, and the victim cluster is selected from the S-cluster in a FIFO order. The separation of the S-cluster from the Size-dependent cluster enables IWM to give higher priority to the S-cluster for victim selection than the L-cluster which has higher priority than the Size-dependent cluster. There is also an important difference between the L-cluster in IWM and VB set in REF, which makes

341

IWM further outperform REF. Since REF is not an integrated buffer management scheme, it cannot know the exact status of the log blocks. The victim blocks are selected based on the number of pages in each block in the write buffer, without any information on the log blocks. Since REF uses a page-unit victim eviction, while the first victim page in a given victim block can increase the block associativity of the log block, the subsequent victim pages from the same victim block are expected not to further increase the block associativity. However, the expectation frequently fails (log block miss), when the number of log blocks is small, because of the frequent log block merges. On the other hand, since IWM uses shared data structures between write buffer and log block, it can exploit the exact status of the log blocks for victim selection. L-cluster is introduced for that purpose. It contains only those clusters that remain in the log blocks. Hence, evicting a cluster in the L-cluster never increase the block associativity of the corresponding log block. Therefore, while the L-cluster and VB set share the same design rationale, their implementations are entirely different, each other. The Optimistic FTL, proposed in [16], is a write buffer-aware hybrid mapping scheme. It expects that the victim cluster evicted from the write buffer will be more and more sequential as the write buffer size increases. Then, by maintaining log blocks always sequential, it can avoid costly full merge operations. To make log blocks always sequential, it uses four log block management operations: append, data block switch and two kinds of log block switch operations. Those operations copy valid pages either from data block or old log block to make a new sequential log block. Since only partial number of valid pages, not all valid pages, are copied from the data block or old log block, the valid page copy operations look similar to the Adaptive Partial Padding (APP) scheme in IWM. However, they are entirely different from each other. First, Optimistic FTL does not use traditional merge operations in Hybrid FTL (Full, Partial, and Switch merges). It replaces

Fig. 15. Overall Performance: execution time is normalized to that of the IWM.

342

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

Fig. 16. Detailed Analysis of IWM: execution time is normalized to that of the complete IWM.

Fig. 17. Overall performance of the IWM-FAST.

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344

them with its own four operations. On the other hands, APP scheme in IWM assumes traditional merge operations. It only tries to maximize the opportunity of partial merge while minimizing that of full merge. Therefore, while the Optimistic FTL is an independant FTL, APP is an add-on scheme that can be used in any hybrid mapping scheme. Second, the Optimistic FTL always maintains the log block sequential. To enforce the sequentiality of the log block, whenever a log block re-reference occurs, it generates a new sequential log block, except the append operation, by copying valid pages not only from the data block but also from the old log block. When the log block re-reference occurs frequently, it shows poor performance due to the excessive valid page copy overhead to make a new sequential log block. They only say that as the write buffer size increases, the probability of log block re-reference decreases. However, the APP scheme permits non-sequential log block. As shown in Fig. 8, when a log block re-reference occurs, it writes victim cluster to the original log block without any additional operation. It adaptively performs the partial padding so that it does not use padding when a victim cluster is likely to be re-referenced in the log block. Also, when it performs the partial padding, it copies valid pages only from the data block. Valid pages in the log block are never copied for partial padding, which is different from the Optimistic FTL. Hence, IWM shows good performance regardless of the write buffer size. 6. Conclusion Since the write buffer and log block management in SSD have been designed and implemented in different layers from each other, they were not able to share critical information that could have been valuably used for performance improvement. However, there is no reason why they should be functionally separated because not only do both of them run within a single controller, but also the log block is conceptually another form of the write buffer. Integrating them in a single layer can provide us much room for performance improvement. Through preliminary experiments, we found two important guidelines for write buffer and log block management, and we can abide by those guidelines via integrating the write buffer and log block management schemes. In this paper, we developed an example integrated write buffer management scheme, IWM, which manages both traditional write buffer and log blocks in a single layer. Experimental results showed that the proposed scheme outperforms others, irrespective of the number of log blocks, write buffer size, and workload characteristics. IWM showed an improvement in performance of up to 46% and 64% in comparison to CLC and REF, respectively. Also, IWM decreased the block erase count by up to 50%, as compared to other schemes. The decreased block erase count can prolong the lifetime of an SSD, which is an important achievement, especially for an MLC NAND-based SSD.

343

[5] Aleph One Company, Yet Another Flash Filing System. [6] M. Baker, S. Asami, E. Deprit, J. Ousterhout, M. Seltzer, A stack model based replacement policy for a non-volatile cache, in: Proc. IEEE Symp. Mass Storage Systems, March 2000, pp. 217–224. [7] L.-P. Chang, T.-W. Kuo, An adaptive stripping architecture for flash memory storage systems of embedded systems, in: Proc. IEEE Eighth Real-Time and Embedded Technology and Applications Symp. (RTAS), September 2002. [8] D. Woodhouse, JFFS: the journaling flash file system. [9] B. Gill, D.S. Modha, WOW: Wise Ordering for Writes Combining Spatial and Temporal Locality in Non-Volatile Caches, in: Proc. USENIX Conf. File and Storage Technologies (FAST), December 2005. [10] J. Gray, Tape is dead, disk is tape, flash is disk, RAM locality is king, in: Presentation at the CIDR Gong Show, Asilomar, CA, USA, December 2007. [11] A. Gupta, Y. Kim, B. Urgaonkar, DFTL: a flash translation layer employing demand-based selective caching of page-level address mappings, in: Proc. International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 09), March 2009. [12] J. Gray, B. Fitzgerald, Flash disk opportunity for server-applications, http:// research.microsoft.com/Gray/papers/FlashDiskPublic.doc, 2007. [13] H. Jo, J. Kang, S. Park, J. Kim, J. Lee, FAB: flash-aware buffer management policy for portable media players, IEEE Trans. Cons. Electr. (2006) 485–493. [14] H. Jung, H. Shim, S. Park, S. Kang, J. Cha, LRU-WSR: integration of LRU and writes sequence reordering for flash memory, IEEE Trans. Cons. Electr. (2008) 1215–1223. [15] J. Kang, H. Jo, J. Kim, J. Lee, A Superblock-Based Flash Translation Layer for NAND Flash Memory, in: Proc. ACM Conf. Embedded Systems Software, October 2006. [16] S. Kang, S. Park, H. Jung, H. Shim, J. Cha, Performance trade-offs in using NVRAM write buffer for flash memory-based storage devices, IEEE Trans. Comput. (2009) 744–758. [17] H. Kim, S. Ahn, BPLRU: a buffer management scheme for improving random writes in flash storage, in: Proc. USENIX Conf. File and Storage Technologies (FAST), 2008. [18] J. Kim, J.M. Kim, S.H. Noh, S.L. Min, Y. Cho, A space-efficient flash translation layer for compact flash systems, IEEE Trans. Cons. Electr. (2002) 366–375. [19] S. Lee, D. Park, T. Chung, D. Lee, S. Park, H. Song, A log buffer based flash translation layer using fully associative sector translation, ACM Trans. Embedded Comput. Syst. (2007). [20] M-Systems, Memory Translation Layer for NAND Flash (NFTL). [21] Nokia & University of Szeged, UBIFS – UBI File-System, http://www.linuxmtd.infradead.org/doc/ubifs.html. [22] NTT, New Implementation of a Log-structured File System, http:// www.nilfs.org/en/about_nilfs.html. [23] ORACLE, B-tree FS, http://btrfs.wiki.kernel.org. [24] S. Park, D. Jung, J. Kang, J. Kim, J. Lee, CFLRU: A replacement algorithm for flash memory, in: Proc. International Conference on Compilers, Architecture and Synthesis for Embedded Systems, Seoul, Korea, October 2006. [25] Samsung Elec. 2Gx8 Bit NAND Flash, Memory (K9GAG08U0M-P), 2006. [26] D. Seo, D. Shin, Recently-evicted-first buffer replacement policy for flash storage devices, IEEE Trans. Cons. Electr. (2008) 1228–1235. [27] S. Lee, D. Shin, Y. Kim, J. Kim, LAST: locality-aware sector translation for NAND flash memory-based storage systems, ACM SIGOPS Operat. Syst. Rev. (2008). [28] Scott Watanabe, Solaris 10 ZFS Essentials, Prentice Hall, 2009. [29] Yi Ou, T. Harder, P. Jin, CFDC: A flash-aware replacement policy for database buffer management, in: Proc. International Workshop on Data Management on New Hardware (DaMoN), Providence, Rhode-Island, June 2009. [30] J. Hu, H. Jiang, L. Tian, L. Xu, PUD-LRU: an erase-efficient write buffer management algorithm for flash memory SSD, in: Proc. International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), Miami, FL, August 2010. [31] H. Kwon, E. Kim, J. Choi, D. Lee, Sam H. Noh, Janus-FTL: finding the optimal point on the spectrum between page and block mapping schemes, in: Proc. ACM International Conference on Embedded Software (EMSOFT), Scottsdale, AZ, October 2010. . [32] Q. Wei, B. Gong, S. Pathak, B. Veeravalli, L. Zeng, K. Okada, WAFTL: a workload adaptive flash translation layer with data partition, in: Proc. IEEE Symposium on Mass Storage Systems and Technologies (MSST), May 2011.

Acknowledgement This work was supported by the National Research Foundation of Korea (NRF), grant funded by the Korean Government. (No. 2011–0029181). References [1] Flash memory Research Group, http://newslab.csie.ntu.edu.tw/flash/ index.php?SelectedItem=Traces. [2] Jens Axboe Block IO Tracing, http://www.kernel.org/git/?p=linux/kernel/git/ axboe/blktrace.git;a=blob;f=README. [3] OLTP Trace from UMass Trace Repository, http://traces.cs.umass.edu/ index.php/Storage/Storage. [4] Trace Distribution Center, http://tds.cs.byu.edu

Sooyong Kang received his BS degree in mathematics and the MS and Ph.D degrees in Computer Science, from Seoul National University, Seoul, Korea, in 1996, 1998, and 2002, respectively. He is currently with the Division of Computer Science and Engineering, Hanyang University, Seoul. His research interests include Operating Systems, Multimedia Systems, Storage Systems, Flash Memories and Next Generation Nonvolatile Memories, and Distributed Computing Systems.

344

S. Park et al. / Journal of Systems Architecture 60 (2014) 329–344 Sungmin Park received his BS degree in Computer Science Education and MS degree in Electronics and Computer Engineering, Hanyang University in 2005 and 2007, respectively. He is currently a Ph.D candidate at the school of Electronics and Computer Engineering, Hanyang University. His research interests include Operating System and Flash memory-based Storage System.

Jaehyuk Cha received his BS, MS and Ph.D degrees in computer science, all from Seoul National University (SNU), Seoul, Korea, in 1987, 1991 and 1997, respectively. He worked for Korea Research Information Center (KRIC) from 1997 to 1998. He is now an associate professor of the Division of Computer Science and Engineering, Hanyang University. His research interests include XML, DBMS, Flash memory-based Storage System, Multimedia Contents Adaptation, and e-Learning.