SLC dual-mode flash memory in embedded systems

SLC dual-mode flash memory in embedded systems

ARTICLE IN PRESS JID: MICPRO [m5G;February 13, 2017;13:32] Microprocessors and Microsystems 0 0 0 (2017) 1–12 Contents lists available at ScienceD...

2MB Sizes 0 Downloads 54 Views

ARTICLE IN PRESS

JID: MICPRO

[m5G;February 13, 2017;13:32]

Microprocessors and Microsystems 0 0 0 (2017) 1–12

Contents lists available at ScienceDirect

Microprocessors and Microsystems journal homepage: www.elsevier.com/locate/micpro

A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dual-mode flash memory in embedded systems Duo Liu a,b,c,∗, Lei Yao c, Linbo Long e, Zili Shao d, Yong Guan a,b a

Beijing Advanced Innovation Center for Imaging Technology, Beijing, China College of Information Engineering, Capital Normal University, Beijing, China c College of Computer Science, Chongqing University, Chongqing, China d Department of Computing, The Hong Kong Polytechnic University, Hong Kong e College of Computer Science and Technology, Chongqing University of Posts and Telecommunications, China b

a r t i c l e

i n f o

a b s t r a c t

Article history: Received 14 February 2016 Revised 24 November 2016 Accepted 29 December 2016 Available online xxx

Similar to traditional NAND flash memory, triple-level cell (TLC) flash memory is used as secondary storage to meet the fast growing demands on storage capacity. TLC flash memory exhibits attractive features such as shock resistance, high density, low cost, non-volatility and low access latency natures. However, TLC flash memory also has some extra limitations, such as write disturbance, low performances and very limited cycles compared to single-level cell (SLC) flash memory.

Keywords: Flash translation layer TLC flash memory Genetic algorithm

In this paper, we propose a workload-aware flash translation layer, named Balloon-FTL, for the TLC/SLC dual-mode flash memory, to improve performance and lifespan of the system. We first build a workload identifier module with genetic algorithm to dynamically allocate TLC/SLC capacity based on different workloads, and produce the suitable data allocation to achieve a balanced write distribution in flash memory with low memory access cost. The basic idea is to classify metadata/userdata according to their access pattern, and allocate low-latency SLC and high-density TLC mode blocks for write-intensive metadata and a large quantities userdata, respectively. We then propose a special hybrid mapping strategy for the TLC/SLC dual-mode flash memory to improve the performance. Experimental results show that Balloon-FTL can effectively improve the performance and lifespan of the TLC/SLC dual-mode flash memory in embedded systems. © 2017 Elsevier B.V. All rights reserved.

1. Introduction Nowadays, NAND flash memory has been widely adopted in embedded systems due to its attractive features, such as nonvolatility, low standby power, high density and shock resistance [1–3]. In particular, tripe-level cell (TLC) flash memory provides much higher density and lower cost than single-level cell (SLC) and multi-level cell (MLC) flash memory. However, TLC flash memory also suffers from longer read/write latency and much lower endurance compared to its counterpart. Table 1 compares the read/write performance and endurance between SLC, MLC and TLC flash memory. As shown, the performance and program/erase (P/E) cycles of TLC flash memory is about 10× and 40× slower than that of SLC flash memory, respectively [4]. In order to achieve a balance between performance and capacity, recent advances in TLC ∗ Corresponding author at: Beijing Advanced Innovation Center for Imaging Technology, Beijing, China. E-mail address: [email protected] (D. Liu).

flash memory offer the ability of switching flash blocks between TLC mode and SLC mode [5]. SLC mode can provide shorter access latency with smaller capacity, while TLC mode can provide larger capacity with longer access latency. Various studies have proposed different flash translation layer (FTL) designs for managing flash memories [6–13]. Nevertheless, most of the previous work mainly focus on SLC or MLC flash memory, and do not fully consider the characteristics of TLC/SLC dual-mode flash memory. Therefore, this paper focuses on improving the performance and endurance of the TLC/SLC dual-mode flash memory by re-designing the flash translation layer. In TLC/SLC dual-mode flash memory, SLC-mode blocks are employed to store small and frequently accessed data, e.g., the flash memory mapping table, while TLC-mode blocks are employed to store large and non-frequently updated user data by utilizing their high density. The existing flash translation layer designs are not suitable for TLC/SLC dual-mode flash memory due to the following reasons: 1) Different applications may exhibit various access patterns, and how to achieve a balance between performance, ca-

http://dx.doi.org/10.1016/j.micpro.2016.12.009 0141-9331/© 2017 Elsevier B.V. All rights reserved.

Please cite this article as: D. Liu et al., A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dualmode flash memory in embedded systems, Microprocessors and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2016.12.009

ARTICLE IN PRESS

JID: MICPRO 2

[m5G;February 13, 2017;13:32]

D. Liu et al. / Microprocessors and Microsystems 000 (2017) 1–12 Table 1 Comparisons of flash chips.

0

Attributes

SLC

MLC

TLC

Read latency Write/Program latency P/E cycles Density Cost Main usage

25 μs 200 μs 100,0 0 0 1 bit/cell Expensive Industrial

50 μs 1400 μs 10,0 0 0 2 bits/cell Middle Enterprise

250 μs 2700 μs 2500 3 bits/cell Cheap Consumer

1

number of bits

number of bits

VTotal pacity and lifespan post challenges for current FTL schemes; 2) The hot/cold data identification in existing FTL designs is a challenge issue; 3) The performance of TLC blocks is poor, so how to store the FTL metadata is a critical problem. These observations motivate us to propose a new FTL technique for TLC/SLC dual-mode flash memory and improve its performance. In this paper, we propose a workload-aware flash translation layer, named Balloon-FTL, for TLC/SLC dual-mode flash memory in embedded systems, to improve performance and lifespan of the system. Balloon-FTL consists of two modules: a workload identifier and a hybrid mapping mechanism. Workload identifier is proposed to dynamically determine the TLC/SLC capacity based on workload of different applications. We design two workload-aware strategies for different situations, including a mathematic formulation for low latency environment, and a genetic algorithm for long using time environment. Hybrid mapping mechanism is proposed for mapping pages at distinct granularity in TLC/SLC dual-mode flash memory according to different I/O workloads. To achieve this, a page-level mapping is adopted to handle random and hot sequential requests, while a block-level mapping is adopted to handle cold sequential requests. Evaluation results show that Balloon-FTL can effectively improve the system performance and lifespan. In summary, this paper makes the following contributions: • We propose a workload identifier module to dynamically allocate TLC/SLC capacity based on different application workloads for achieving a balanced write distribution in flash memory, to improve the lifespan of the whole system with low memory access cost. • We propose a hybrid FTL mapping mechanism with page/block-level mapping to handle random/hot sequential requests and cold sequential requests based on I/O workloads, to improve the system performance. • We develop a simulator to evaluate the Balloon-FTL for TLC/SLC dual-mode Flash memory in embedded systems and conduct a series of experiments to evaluate it’s effectiveness with a set of realistic I/O traces. The rest of this paper is organized as follows. Section 3 gives the background of TLC flash memory and representative FTL implementations. Section 4 introduces the detailed design of Balloon-FTL. Section 5 presents the experimental results. Finally, Section 6 concludes this work. 2. Related work In this section, we present the related work about flash translation layer for dual-mode flash memory. In dual-mode flash memory FTL design, many work have been done to utilize the benefits of both the high density in MLC and the strong endurance of SLC in FTL design. Dongkun Shin et al. proposed an FTL for MLC flash memory, called ComboFTL [14], by exploiting the SLC-mode of MLC flash memory, ComboFTL manages a small SLC region for hot data and a large MLC region for cold data. ComboFTL can separate cold and hot data based on the size of the write operation requests. Meanwhile, Muthukumar Murugan

SLC

VTotal

000 001 010 011 100 101 110 111

TLC

Fig. 1. Mode switching between TLC-mode and SLC-mode in dual-mode flash memory.

et al. proposed a new architecture called Hybrot [15], which aims at providing improved performance in hybrid SLCCMLC devices and at the same time ensures maximum lifetime for the flash blocks. Compared with ComboFTL, Hybrot does not consider SLC region as a buffer for MLC region. In PCM platform, a simple and effective SLC/MLC page management strategy, called Balloonfish, is proposed to fully exploit the benefits from both bigger memory capacity of MLC and shorter access latency of SLC [16]. According to the active state of memory page, Balloonfish dynamically allocated the SLC page to active state, and used the MLC page to store the inactive page, in order to achieve the balance between capacity and speed. However, these work are proposed to the MLC flash memory. In the TLC flash memory systems, the TLC lifespan and latency time is limited. A tiny and complex FTL might be more practical for TLC flash memory. Therefore, in software compiler level, we propose our Balloon-FTL, for the TLC/SLC dual-mode flash memory, to improve performance and lifespan of the system.The technique proposed in this paper needs no hardware or system support, and introduces an acceptable memory access overhead as well. 3. Background and motivation In this section, we first introduce the background knowledge of TLC flash memory and genetic algorithm. Then we present the introduction of representative flash translation layer schemes. Finally, we discuss the motivation of our work. 3.1. TLC Flash memory NAND flash memory has been widely adopted in embedded systems [17–19], and it is usually classified into two types including single-level cell (SLC) and multi-level cell (MLC) [16,20,21]. One memory cell of SLC flash memory can store one bit, while one cell of MLC flash memory can store two or more bits. Recently, triplelevel cell (TLC) flash memory is widely used as large scale data storage, as it can store three bits in a memory cell. This means each TLC flash cell has a total of eight different states (from 0 0 0 to 111). Therefore, TLC flash is a better solution for large-scale storage system with large capacity demand. However, TLC flash memory exhibits long access latency and limited write endurance when compared to SLC or MLC flash memory. As a compromised solution to overcome the limitations of TLC flash memories, new technology is proposed to provide the ability to switch cells between SLC mode and TLC mode. Thus, capability, performance and endurance can be balanced by utilizing the large capacity of TLC-mode and high performance of SLC-mode. Fig. 1 shows the mode switching of the TLC/SLC dual-mode flash memory. After initialization, device is ready to operate in TLC mode. In order to operate in SLC mode, DFh command must be added in front of normal command. For example, in case of page

Please cite this article as: D. Liu et al., A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dualmode flash memory in embedded systems, Microprocessors and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2016.12.009

ARTICLE IN PRESS

JID: MICPRO

[m5G;February 13, 2017;13:32]

D. Liu et al. / Microprocessors and Microsystems 000 (2017) 1–12

3

Fig. 3. Block-level FTL. Fig. 2. Page-level FTL.

program operation, and if the page is to be programmed in SLC mode, DFh command must be added in front of program operation. Once device enters into SLC mode, it can be used reliably without additional DFh command until it meets TLC mode termination command DAh [5,15]. A NAND flash memory storage device usually consists multiple flash chips, and each chip is composed of multiple planes. Each plane adopts many blocks, each of which is made up of a fixed number of pages [22]; meanwhile, a block is the basic unit for erase operations and could only endure a limited number of erases, while a page is the smallest unit for read/write operations. In addition a flash memory page usually contains 4096 or 8192 Bytes of user area for data storage, and 64 or 128 Bytes of spare area uses to store house-keeping information such as the error detection code (EDC) the error correction code (ECC), and the corresponding logical page/block addresses. Due to the write-once property, a page could not be overwritten unless its residing block is erased. As a result, data are usually out place. To manage flash memory, flash translation layer is adopted to emulate a block device interface for file systems. The FTL employs address mapping tables to control access to flash memory. They directly determine whether data can be accessed correctly. 3.2. Flash translation layer In flash memory, FTL emulates a flash memory system as a block device so that the file systems can access the flash memory transparently. Basically, FTL usually includes three components: address translator, garbage collector, and wear-leveler [23,24]. The address translator is the basic function of FTL, i.e., translate logical addresses from upper layers to physical addresses on the flash in terms of FTL mapping table. According to the mapping granularity, FTL can be implemented in three different ways: page-level FTL, block-level FTL and hybrid FTL [20]. Page-level FTL performs a kind of fine-grained address translation. As shown in Fig. 2, in the mapping table, each entry represents a mapping between LPNs (logical page number) and the corresponding PPN (physical page number). Each time a logical page is updated, a physical free page is allocated to accommodate the new data and the corresponding page-level mapping table is updated. When a physical block is selected as a victim during garbage collection, all valid pages that it contains are copied to a free block, and then the page-level mapping table is updated accordingly as well.

On the contrary, block-level FTL maps LBN (logical block number) to PBN (physical block number). As shown in Fig. 3, when write request comes, we allocate the physical page based on its offset in the data block, and record the corresponding LBN to PBN mapping at the same time [25]. Hybrid FTL takes advantage of the flexibility and efficiency of page-level FTL, and the low space overhead of block-level FTP. The challenge issue of hybrid FTL is how to identify the incoming workload pattern (e.g., random or sequential, hot or cold), and allocate pages or blocks dynamically while adjust the mapping granularity accordingly [26,27]. However, a general FTL is not enough for TLC/SLC dual-mode enabled flash memory, we need dynamic partitioning SLC-mode blocks and TLC-mode blocks based on different applications. Genetic algorithm is a classical optimization algorithm in machine learning field, and it is convenient and effective. 3.3. Motivation As mentioned above, TLC flash memory has been widely adopted in many storage-system designs due to various attractive features. However, TLC flash memory has low read/write performance and short endurance. To solve this issue, SLC/TLC dualmode flash memory is proposed to provide SLC-/TLC-mode operations on demands. With the capability of switching modes, we can store metadata (FTL mapping table) in SLC-mode blocks, while the userdata in TLC-mode blocks. This is because SLC-mode block can provide better read/write performance and endurance, while TLCmode block can provide larger capacity. To better utilize TLC/SLC dual-mode flash memory, we first need to re-design FTL for this particular flash memory. Then we should identify how to partition blocks between SLC-mode and TLC-mode dynamically in terms of I/O workloads. A motivational example is illustrated in Fig.4, As shown, due to different erase lifespan of SLC-mode and TLC-mode blocks, it is not easy to wear out SLC or TLC blocks at the same time without well-structured. This means that some blocks are worn out while others remain functioning. Therefore, in this paper, we propose a workload-aware flash translation layer, named Balloon-FTL, for TLC/SLC dual-mode flash memory in embedded systems, to improve performance and lifespan of the system. 4. Balloon-FTL In this section, we present our flash translation layer, BalloonFTL, for the TLC/SLC dual-mode enabled flash memory in embed-

Please cite this article as: D. Liu et al., A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dualmode flash memory in embedded systems, Microprocessors and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2016.12.009

ARTICLE IN PRESS

JID: MICPRO 4

[m5G;February 13, 2017;13:32]

D. Liu et al. / Microprocessors and Microsystems 000 (2017) 1–12

Fig. 6. Hybrid mapping strategy in Balloon-FTL.

Fig. 4. Motivational Example.(a) SLC-mode blocks are less than the workload demands, so SLC-mode blocks will get wear limit earlier; (b) More SLC-mode blocks are allocated for write requests, so TLC-mode blocks will get wear limit earlier.

2) The workload identifier has two functions. The first function is to collect the I/O characteristics based on different applications’ workload patterns. And the second function is to identify the hot/cold or random/sequential requests, and determine whether the requests should be handled by page-level mapping or blocklevel mapping. Thus Balloon-FTL can balance the lifespan for SLCmode blocks and TLC-mode blocks, and improve the performance of the system. 4.2. Hybrid mapping strategy

Fig. 5. Software module of TLC flash memory System.

ded systems. Balloon-FTL can dynamically allocate TLC/SLC capacity in dual-mode flash memory according to different workload patterns, with the help of the proposed hybird mapping strategy. In the rest of this section, we first introduce the overview of the proposed Balloon-FTL, then we present the details of Balloon-FTL, including workload identifier and hybrid mapping.

Fig. 6 presents the flow chart of Balloon-FTL. In practice, I/O workload is a mixture of small random and large sequential requests. To determine the type of an incoming request, we use a threshold T. Requests that intend to write more than T pages are considered as sequential ones, while others are considered as random requests. For example, if threshold T = 64, the length of a request is smaller than 64 pages (1/3 blocks), then this request is treated as a random request; otherwise, if the length of a request is greater than or equal to 64 pages, then it is treated as a sequential request. Further more, during I/O requests, some logical addresses are accessed frequently. We treat the frequently accessed data as hot data, and the others are regarded as cold data. In the proposed hybrid mapping strategy, we use page-level mapping to serve random and hot sequential request, since random requests are usually small and they only occupy a much smaller number of page-level mapping table entries. To improve the performance, we also use the page-level mapping to serve the hot sequential request. Algorithm 4.1 shows how to determine random request and record the corresponding mapping in page-level mapping table. Algorithm 4.1 The algorithm of random data identifier. Require: The write request and the threshold. Ensure: random data and sequential data.

4.1. Overview 1:

Fig. 5 gives an overview of Balloon-FTL. As shown, in BalloonFTL, FTL mappingtables (page-level and block level mapping table) are stored in SLC-mode blocks as they are frequently accessed, while the userdata is stored in TLC blocks. To achieve this, we design a hybrid mapping strategy and a workload identifier for the TLC/SLC dual-mode enabled flash memory. 1) In the proposed hybrid mapping, we employ page-level mapping or block-level mapping according to the workload pattern. For random and hot sequential requests, we use page-level mapping strategy to improve system lifetime. For cold sequential requests, we use block-level mapping table to reduce the mapping table size. The mapping records (i.e., LPN to PPN and LBN to PBN) are stored in a pagelevel mapping table and a block-level mapping table in SLC-mode blocks, respectively.

2: 3: 4: 5: 6: 7:

Threshold ← the random size threshold. if The write request length< Threshold then LPN ← The address of write request. Add the mapping of (LP N, P P N) into the page-level mapping table. else Sent this request to the frequent data identifier. end if

In this paper, we employ bloom filter to determine the type of an I/O request based on the principle of time locality [27]. Bloom filter is a bit array and all bits are initially set to 0. As shown in Algorithm 4.2, we adopt a set of three independent bloom filters to store frequently accessed data address circularly, and three independent hash functions to capture hot data. Each bloom filter

Please cite this article as: D. Liu et al., A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dualmode flash memory in embedded systems, Microprocessors and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2016.12.009

JID: MICPRO

ARTICLE IN PRESS

[m5G;February 13, 2017;13:32]

D. Liu et al. / Microprocessors and Microsystems 000 (2017) 1–12

consists of 1024∗ 1024 bits to record enough hash values. Whenever a write request arrives, hash functions map the LBN of the write request to the corresponding bit positions of the array thereby setting all of them to 1. When we need to check if the element is in the set, we first need to get the 3 bit positions by feeding the key value of the element to all 3 hash functions. If any of the 3 bits equal to 0, this means the element is not in the set because all the bits would have been set to 1 when the element was fed. If all bits equal to 1, we consider the corresponding element is in the set. For the three type I/O requests, i.e., random, hot sequential and cold sequential request, our proposed Balloon-FTL adopts a two-level FTL mechanism to handle these three cases as follows:

Algorithm 4.3 The genetic algorithm. Require: trace fragments. Ensure: optimal solution configuration. 1: 2: 3: 4: 5: 6: 7: 8: 9:

• Random and hot sequential requests: our FTL sequentially allocates physical pages from the first page of a physical block in NAND flash memory, so that all pages in blocks are fully utilized. Accordingly, our FTL adds LPN to PPN mapping in PMT (page mapping table). • Cold sequential requests: our FTL allocates physical pages based on block offset as most sequential requests usually occupy a whole block, so that all pages in blocks are fully utilized as well. Similarly, the corresponding LBN to PBN mapping is added in the BMT (block mapping table). Algorithm 4.2 The algorithm of hot sequential data identifier. Require: The write request. Ensure: hot sequential data and cold sequential data. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25:

if LBA ∈ / bloom2 then Add the mapping of (LBN , P BN ) into the block-level mapping table. Add the request address into bloom1. break. else Add the mapping of (LP N, P P N) into the block-level mapping table. if LBA ∈ / bloom2 then Add the request address into the bloom2. break. else if LBA ∈ / bloom2 then Add the request address into the bloom3. break. else i ← i + 1. end if end if end if if i = 100 then remove bloom1. bloom1 ← bloom2. bloom2 ← bloom3. new bloom3. i ← 0. end if

Since the mapping table is usually stored in DRAM, and DRAM is not non-volatile. In order to prevent system suddenly power off, we usually cache part of FTL mapping table in DRAM on demand and backup the whole table to flash memory. However, the write performance and endurance of TLC flash is poor, leading to the performance of system decline quickly [24,28]. As a compromised solution to overcome the limitations of TLC flash memories, the entire mapping table are stored in SLC-mode blocks.

5

10: 11: 12: 13: 14: 15: 16:

initial population initialization. for all members of population do sum ← sum + fitness of this individual. end for for all members of population do probability ← sum of probabilities + (fitness / sum). sum of probabilities ← sum of probabilities + probability. end for while stop condition is not met do for all members of population do if number more than probability and less than next probability. then remove this member from population. end if end for create offspring. end while

Algorithm 4.4 Balance and Capacity. Input: the erase number of each region˜(Erslc ) and˜(Ertlc ), the blocks of TLC mode region˜(Sizeslc ) and SLC mode region˜(Sizetlc ), the blocks of flash memory when all blocks is TLC mode˜(Memtlc ), the endurance of TLC mode blocks˜(Ent lc), the endurance of SLC mode blocks˜(Ens lc) Output: the Balance and Capacity of the design. 1: Start the simulator and compute the size of TLC mode region˜(Sizeslc ) and SLC mode region˜(Sizetlc ) by workload identifier. 2: Cslc ← Sizeslc × 64˜(each SLC block has 64 pages). 3: Ctlc ← Sizetlc × 192˜(each TLC block has 192 pages). Cslc +Ctlc Memtlc ×192

4:

Capacity ←

5: 6:

After finish workload. Er Bslc ← C slc .

7:

Btlc ←

8:

Balance ←

slc

Ertlc Ctlc .

Bslc Btlc .

4.3. Workload identifier In Balloon-FTL, we propose workload identifier to dynamically allocate TLC/SLC capacity based on different application I/O workload. There are two dynamic allocation strategies: one is mathematic formulation for low latency environment, and anther is based on genetic algorithm for long use time environment. Because of the different P/E cycles of SLC and TLC, the blocks of SLC-mode and TLC-mode usually do not wear out at the same time, which means some blocks are worn out while others remain functioning. To overcome this constraint, this paper try to make every blocks wear out at the same time. To achieve this, we propose two effective workload aware strategy. Thus enhancing the lifespan of TLC device. 4.3.1. Mathematical model algorithm In this section, we build a mathematical model for workload identifier. It can give the most appropriate TLC/SLC capacity proportion according to the workload characteristics. The goal is to balance the lifespan for SLC-mode blocks and TLC-mode blocks so that the lifespan of the device is improved. We use the number of erase operations to represent the wear intensity. So we can

Please cite this article as: D. Liu et al., A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dualmode flash memory in embedded systems, Microprocessors and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2016.12.009

ARTICLE IN PRESS

JID: MICPRO 6

D. Liu et al. / Microprocessors and Microsystems 000 (2017) 1–12

calculate the erase cycles of SLC-mode/TLC-mode blocks to compare their lifespan. This algorithm can be divided in three steps. Step 1: dynamically collecting workload characteristics and computing the SLC/TLC size. Step 2: dynamically adjusting the size of SLC/TLC mode before each region starts executing. Step 3: dynamically allocating the optimal address for each variable in each region in order to extend the lifetime of TLC flash memory. Assume the program has been running for a period of time and the wear leveling function can make each TLC block erased evenly. Let Etlc denotes the erase counts of TLC-mode blocks, Eslc denotes the erase counts of SLC-mode blocks, PEtlc denotes the maximum P/E cycles of TLC-mode blocks, PEslc denotes the maximum P/E cycles of SLC-mode blocks, Btlc denotes the TLC-mode blocks number, Bslc denotes the SLC-mode blocks number. Therefore, the erase count ratio between SLC-mode and TLC-mode blocks can be defined as:

Balance =

Eslc PEslc ×Bslc Etlc PEtlc ×Btlc

(1)

Let Nslc denotes the number of SLC-mode blocks and Ntlc denotes the total blocks number of flash memory. We determine the number of SLC-mode blocks according to the value of Balance:



Nslc =

[m5G;February 13, 2017;13:32]

Ntlc × 4% Ntlc × 2% Ntlc × 0.2%

if if if

Balance  1 Balance ≈ 1 Balance  1

(2)

In this paper, we only stored mapping table in the SLC region, its size is very small compared to userdata. Each entry in the mapping is about 2 × 8byte. So the maximal size of mapping table is about Ntlc × 0.3%, When the TLC is managed in page-level mapping, considering the mapping table is updated frequently, we decided to set SLC limit (capacity threshold) as Ntlc × 4%. 4.3.2. Genetic algorithm The mathematical model of workload identifier is fast with acceptable overhead (we will discuss it in the experiment section). However, we cannot ensure that the mathematical model is effective in any case. In reality, there will be a variety of situations, it is difficult to use a unified mathematical model [29–31]. Therefore, we further employ a genetic algorithm to provide a heuristic solution for workload identification. We defined Capacity as the ratio of the new total size of the flash memory after allocation of SLC blocks and the original size of TLC flash memory. Thus, we have the following formulation for Capacity:

Capacity =

Nslc + (Ntlc − Nslc ) × 3 . Ntlc × 3

(3)

We give the following definitions: 1) We use (random data threshold, hot data threshold, SLC-mode block ratio) (for short (R, H, S)) as the genetic representation of the solution domain, 2) We use the formulation of Balance (1) and Capacity (3) as the fitness function to evaluate the solution domain. We give the steps of our algorithm as follows: Step 1: Initialization The initial population is generated randomly or specially selected, and allowing the search space (entire range of possible solutions) depends on the nature of the optimization problem. In our paper, the population size depends on three configuration parameters (random data threshold, hot data threshold, SLCmode block ratio), thus, there exist 10 0 0 configurations if we give 10 possible values to each parameters. The range of random data threshold, hot data threshold and SLC-mode block ratio are 32– 160 pages (1/6–5/6 blocks), 50–300 and 0–8%. The initial population is shown in Table 2. The initial population is randomly selected, but the choice of good initial group can reduce the number

Table 2 Initial population. GA initial population (R, H, S) (64%, 50%, 2%) (64%, 50%, 4%) (128%, 50%, 2%) (128%, 50%, 4%) (196%, 50%, 2%) (196%, 100%, 4%)

(64%, 100%, 2%) (64%, 100%, 4%) (128%, 100%, 2%) (128%, 100%, 4%) (196%, 100%, 2%) (196%, 150%, 4%)

(64%, 150%, 2%) (64%, 150%, 4%) (128%, 150%, 2%) (128%, 150%, 4%) (196%, 150%, 2%) (196%, 150%, 4%)

of iterations. For example, (1) random data threshold: according to the definition insection 4.2 and other work [26], random data is usually smaller than 1 block’s size, so we set the max random data threshold as 1 block’s size. And initial population can be randomly selected from 0 to max random data threshold, in this experiment we select 1/6–5/6 blocks. (2) hot data threshold: according to the memory overhead, we set the max hot data threshold as 300. (3) SLC-mode block ratio: according to the above analysis (in Section 4.3.1), we can obtain the max size of mapping table is Ntlc × 4%. Then according to memory overhead and the durables of SLC and TLC, we set the SLC-mode block ratio as 8%. Step 2: Selection During each generation, individual solutions are selected through a fitness-based function, where the fitter individual solutions are more likely to be selected. The fitness function is defined as the genetic representation and measures the quality of the represented solution. In this paper, we use the Balance (1) and Capacity (3) as the fitness function to select the individual solutions. Step 3: Genetic operators This step is to generate a second generation population of solutions from those selected through a combination of genetic operators: crossover and mutation. After selection, the successful individual solutions can produce new solutions by using the above methods of crossover and mutation. A new solution is created and it shares many of the characteristics of its “parents” solutions. The new solutions are selected for next generation, and the process continues until the algorithm stop. Although crossover and mutation are known as the main genetic operators, it is possible to use other operators such as regrouping, colonization-extinction, or migration in genetic algorithms [32]. In this work, since only the fitter solutions from the “parents” generation are selected for producing new solutions, the average fitness of “children” generation will have increased by this procedure for the population. By constantly iteration, “children” generation will get closer to the optimal solution. Step 4: Termination This generational process is repeated until a termination condition has been satisfied. There are some common terminating conditions:

1. 2. 3. 4.

A solution is found that satisfies minimum criteria. Fixed number of generations reached. Allocated budget (computation time/money) reached. The fitness of the highest ranking solution is reaching or has reached a plateau such that successive iterations no longer produce better results. 5. Manual inspection.

In this paper, we combine the 1st and the 2nd termination conditions, and attempt to achieve a satisfied Capacity and Balance or preset number (End) of generations reached.

Please cite this article as: D. Liu et al., A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dualmode flash memory in embedded systems, Microprocessors and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2016.12.009

ARTICLE IN PRESS

JID: MICPRO

[m5G;February 13, 2017;13:32]

D. Liu et al. / Microprocessors and Microsystems 000 (2017) 1–12

7

Fig. 7. Framework of the evaluation platform.

Table 3 Experimental setup. Flash memory

Workloads

Attributes

SLC mode

Page size Block size Read speed Write speed P/E cycles Name copyFromU copyToU Download Idle Internet LocalVideo Upload WbVedio

8 KB 64 pages 40 μs 300 μs 100,0 0 0

TLC mode

8 KB 192 pages 120 μs 2400 μs 2500 Describe Copy data from Flash-Disk Copy data to Flash-Disk Download data Unused Running IE related applications Watch local video films Upload data to server Watch internet video films

5. Experiments To evaluate the effectiveness of Balloon-FTL, we conduct a series of experiments and present the experimental results with analysis in this section. We first describe our experimental setup. Then we present benchmarks and methodology used in experiments. Finally, we discuss experimental results.

Fig. 8. The tendency of each workload under different FTL configurations.

Then we identify essential design parameters of workload (e.g., request length, update ratios). Via intensive simulations and analysis, we find that optimizing the design parameters of Balance (1) heavily depends on the characteristics of small-size request ratio. In this work, we use a mathematical formula (formula (2)) to express the suitable capacity of TLC/SLC. We denote the ratio of small write request and the ratio of small read request, as θ w and θ r , separately. With this parameters, the mathematical model (formula (2)) of Balloon-FTL workload identifier can be described as:

5.1. Experimental setup The evaluation is conducted through a trace-driven simulator. The framework of our evaluation platform is shown in Fig. 7. We adopt a TLC/SLC dual-mode flash memory chip “SAMSUNG K9ABG08U0A” as the device, whose parameters are shown in Table 3 [5]. In our experiments, a 3GB NAND TLC flash memory chip (2, 048 physical blocks) is configured in our simulator. The traces with data requests are collected by running DiskMon in a personal computer with an Intel Dual Core 2 GHz processor, a 1 TB hard disk, and a 4 GB DRAM. The traces reflect the real workload of the system in accessing the hard disk with daily-used applications, such as “ download ” is a trace collected by downloading files from a network server; “ internet” represents a trace collected by running IE related applications. For each trace, the representation is listed in Table 3. Table 4 lists the specific characteristics of workloads. 5.2. Experimental configurations and metrics The lifespan of TLC-Based Flash Memory embedded systems relies on many parameters. These factors can be divided into two types: workload’s characteristics and FTL configurations. In Fig. 8, the Balance (1) of each trace can be obtained by varying the threshold value from 1/3 block size to 2/3 block size with an interval of 1/6 block size. The three curve have the same tendency, therefore, workload’s characteristics is the main factor to the lifetime of TLC device, and the different workloads need different FTL configurations. So the workload aware strategy is effective to improve the performance of Balloon-FTL.

 Nslc =

Ntlc × 4% Ntlc × 2% Ntlc × 0.2%

if if if



θw > 0 . 9  θr < 0 . 5 θw > 0 . 9 θr > 0 . 9

(4)

others

5.3. Hybrid mapping strategy Below we compare and evaluate Balloon-FTL over three representative hybrid mapping strategies [26,27] (denoted as A and B) based on realistic trace files. Table 5 shows that the hybrid mapping strategies of Balloon-FTL can reduce the number of erase counts, especially for TLC-mode blocks. 5.4. Mathematical model workload aware strategy In order to demonstrate the effectiveness of the mathematical model of workload identifier, we compare it with fixed SLC-mode capacity strategy. By varying the parameters of fixed SLC-mode capacity strategy, we compare two different fixed SLC-mode capacity strategy. Fig. 9 is our mathematical model workload aware strategy experiment results. The x-axis denotes the workload type, and the y-axis denotes the Balance (Fig. 9(a)) and percentage of capacity (Fig. 9(b)). In Fig. 10, the SLC-mode blocks of the fixed SLC capacity strategy is Ntlc × 0.2%, It is observed that the mathematical model of workload identifier could balance SLC-mode/TLC-mode wear levelling by reducing a small amount of capacity in most cases. In Fig. 11, the fixed SLC-mode blocks change to Ntlc × 2%. Although the performance is improved, but the capacity loss

Please cite this article as: D. Liu et al., A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dualmode flash memory in embedded systems, Microprocessors and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2016.12.009

ARTICLE IN PRESS

JID: MICPRO 8

[m5G;February 13, 2017;13:32]

D. Liu et al. / Microprocessors and Microsystems 000 (2017) 1–12 Table 4 Specific characteristics of workloads.

Read ratio Write ratio Update ratio Request length Write request length

copyFromU

copyToU

Download

Idle

Internet

LocalVideo

Upload

WebVedio

42% 58% 68% 680kb 567kb

8% 91% 80% 103kb 58kb

3% 97% 81% 435kb 452kb

11% 89% 96% 7kb 5kb

33% 67% 78% 32kb 26kb

40% 60% 80% 28kb 23kb

86% 14% 64% 38kb 28kb

4% 96% 4% 31kb 20kb

Table 5 Comparison of hybrid mapping strategies. Trace file

Erase counts of Balloon-FTL

Erase counts of A

Erase counts of B

SLC-mode

TLC-mode

SLC-mode

TLC-mode

SLC-mode

TLC-mode

copyFromU copyToU Download Idle Internet LocalVideo Upload WebVedio

76 16 98 21 32 29 9 16

3312 5 3951 67 63 209 8 143

75 16 101 21 35 30 9 14

4020 902 4702 7811 1442 2944 506 8214

76 16 98 21 32 29 9 16

3651 5 3971 67 89 246 9 146

Fig. 9. Workload identifier. (a) The balance status. (b) The total capacity after allocating SLC-mode blocks.

Fig. 10. Fixed SLC blocks strategy 1. (a) The balance status. (b) The total capacity after allocating SLC-mode blocks.

becomes greater. Fig. 12 shows their lifespan based on the three E different strategy. In summary, due to Balance = Eslc , Balance = 1 tlc

means SLC-mode blocks has the same wear-out level with TLCmode blocks. So we regard it as the best case if 0.5 ≤ Balance ≤ 2; if 0.2 ≤ Balance ≤ 0.5 or 2 ≤ Balance ≤ 5, it can be seen as middle case; if Balance ≤ 0.2 or 5 ≤ Balance, it is regarded as bad case, as Table 6 shown.

Table 6 Different balance types of different strategy. Balance type

Best

Middle

Bad

average capacity

Fixed 1 (0.2%) Fixed 2 (2%) Workload aware

12.5% 25% 87.5%

12.5% 62.5% 12.5%

75% 12.5% 0

99.6% 96.3% 95.2%

Please cite this article as: D. Liu et al., A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dualmode flash memory in embedded systems, Microprocessors and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2016.12.009

ARTICLE IN PRESS

JID: MICPRO

[m5G;February 13, 2017;13:32]

D. Liu et al. / Microprocessors and Microsystems 000 (2017) 1–12

9

Fig. 11. Fixed SLC blocks strategy 2. (a) The balance status. (b) The total capacity after allocating SLC-mode blocks. Table 7 Results of genetic algorithm. Trace file

Iterations

Configurations

Reduce erase

Capacity improve

copyFromU copyToU Download Idle Internet LocalVideo Upload WebVedio

2 3 2 3 1 5 2 3

{192%, 150%, {64%, 50%, {192%, 150%, {64%, 50%, {128%, 50% {128%, 50%, {192%, 150%, {128%, 50%,

50.6% 99.7% 91.7% 0 0˜ 95.2% 97.2% 0.03%

0.1% 0 0.1% 0.1% 0.1% 0 0.1% 0

0.1%} 0.2%} 0.1%} 0.1%} 0.1%} 0.2%} 0.2%} 0.2%}

Table 8 Configuration of mode FTL [12].

Blocks Capacity

Fig. 12. The lifespan of TLC flash memory.

5.5. Genetic algorithm workload aware strategy We compare the genetic algorithm workload aware strategy with the mathematical model workload aware strategy. We explore the configuration search problem over a set of real I/O traces (traces information show in Table 3). Throughout the genetic algorithm workload aware strategy, we can find a better satisfactory configuration. This configuration is more accurate and excellent, So the Balance and Capacity of GA are both better than mathematical model strategy, further more, it has longer lifespan since better Balance leads shorter erase operations. The “reduce erase” and “capacity improve” in Fig. 13(a) and Fig. 13(b) are compared with the mathematical model workload aware strategy. (Tables 7 and 8) The results show that by using genetic algorithm, the lifespan of the flash memory can be significantly strengthened. However, we also find that the genetic algorithm requires multiple iterations which is time consuming operation. The overheads of this design are listed below:

conventional

dual-

SLC-mode

TLC-mode

74 37MB

1974 2961MB

1. Capacity overhead: Since some bit flags and mapping tables are needed for each SLC/TLC mode block, which contributes the major main-memory space overheads. 1) Code size, the amount of code used by this design is about 30 0 0 lines. This requires about 50KB memory; 2) The algorithm uses some mapping tables and flags which require about 4M main memory (BalloonFTL 3M + GA 1M). 2. Latency overhead is influenced by the operating frequency and the size of the sample. In the paper, we selected 50 sample entries per 10,0 0 0 workload entries for GA workload identifier. According to the experiment, it iterated 2 to 4 times in most situations, so the burden of time has been added to 1% to 2%, if the operating environment allowed the genetic algorithm to run in parallel, the time burden reduced to about 0.5%. The overall latency of the Balloon-FTL based on GAs is showns in Fig. 14. 5.6. Performance improvement of balloon-FTL over conventional dual-mode FTL Chang proposed a hybrid SSD with 256 MB SLC flash and 20 GB MLC flash (the capacity proportion of SLC : MLC =1 : 80) [12]. Therefore, we set the experiment configuration as follows (keep the capacity proportion of SLC : TLC =1 : 80). Let Etlc denotes the erase counts of TLC-mode blocks, Eslc denotes the erase counts of SLC-mode blocks, PEtlc denotes the maximum P/E cycles of TLC-mode blocks, PEslc denotes the maximum P/E cycles of SLC-mode blocks, Btlc denotes the TLC-mode blocks number, Bslc denotes the SLC-mode blocks number. Therefore, the

Please cite this article as: D. Liu et al., A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dualmode flash memory in embedded systems, Microprocessors and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2016.12.009

ARTICLE IN PRESS

JID: MICPRO 10

[m5G;February 13, 2017;13:32]

D. Liu et al. / Microprocessors and Microsystems 000 (2017) 1–12

Fig. 13. The GA workload aware strategy V.S. the mathematical model workload identifier. (a) The blocks erase counts reduce. (b) The capacity improve after allocating SLC-mode blocks. Table 9 Comparison between conventional dual-mode FTL [12] and Balloon-FTL. Trace files

Threshold

Conventional dual-mode FTL wear-out level proportion

Balloon-FTL wear-out level proportion

copyFromU copyToU Download Idle Internet LocalVideo Upload WebVedio

256KB 32KB 128KB 4KB 16KB 16KB 8KB 4KB

1:14 1:50 1:3 1:137 1:80 1:10 1:15 1:63

1:20 1:0.4 1:4 1:0.52 1:0.26 1:0.78 1:0.4 1:0.25

6. Conclusion

Fig. 14. Time overhead.

wear-out level proportion of SLC and TLC can be defined as:

wear − out level proportion =

Eslc Etlc : P Eslc × Bslc P Etlc × Btlc

(5)

A conventional dual-mode FTL [12] proposes a way to find an optimized threshold. They find the two peak frequencies in the distribution of the write-request counts with respect to different request sizes. Their algorithm is based on a K-Means clustering algorithm. We have obtained the workloads’ thresholds by the threshold algorithm. Table 9 shows the thresholds and the wear-out level proportion of SLC and TLC (formula (5)). The results show that TLCmode flash memory is wear-out faster than SLC-mode flash, and it is the worst balance for TLC and SLC. The conventional dual-mode FTL is not appropriate for TLC/SLC dual-mode flash memory since it focuses on MLC flash memory, and the characteristics of TLC/SLC dual-mode flash memory has not been fully considered. Therefore, workload-aware FTL is more appropriate with different workloads.

In this paper, we have proposed a workload-aware flash translation layer, named Balloon-FTL, for TLC/SLC dual-mode flash memory in embedded systems. Balloon-FTL achieves a balanced tradeoff in performance and lifespan between the TLC-mode blocks and SLC-mode blocks by mapping pages at different granularities according to different I/O workloads. It consists of a hybrid mapping module and a workload identifier module. The hybrid mapping module adopts page-level mapping to handle random and hot sequential requests, while block-level mapping for most cold sequential requests. Workload identifier module dynamically allocate TLC/SLC capacity based on different workloads by genetic algorithm, and produce the best data allocation to achieve a balanced write distribution in flash memory with low memory access cost. The basic idea is to classify metadata/userdata according to their access pattern, and allocate low-latency SLC and high-density TLC mode blocks for write-intensive metadata and a large quantities userdata, respectively. We have conducted a series of experiments with realistic I/O traces. Evaluation results show that Balloon-FTL can effectively improve the system performance and lifespan. Finally, we hope to see more mature achievements on the basis of this attempt.

Acknowledgment This work is partially supported by Beijing Advanced Innovation Center for Imaging Technology, and grants from the National Natural Science Foundation of China (61672116, 61,272,103 and 61373049), Research Fund for the Doctoral Program of Higher Education of China (20130191120030), Chongqing High-Tech Research Program (cstc2016jcyjA0332), the Research Grants Council of the Hong Kong Special Administrative Region, China (GRF 152138/14E and GRF 15222315/15E), and the Hong Kong Polytechnic University (4-ZZD7,G-YK24, G-YM10 and G-YN36).

Please cite this article as: D. Liu et al., A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dualmode flash memory in embedded systems, Microprocessors and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2016.12.009

JID: MICPRO

ARTICLE IN PRESS

[m5G;February 13, 2017;13:32]

D. Liu et al. / Microprocessors and Microsystems 000 (2017) 1–12

References [1] D. Liu, Y. Wang, Z. Qin, Z. Shao, Y. Guan, A space reuse strategy for flash translation layers in slc nand flash memory storage systems, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 20 (6) (2012) 1094–1107. [2] J. Hsieh, C. Chen, H. Lin, Adaptive ecc scheme for hybrid ssd’s, IEEE Trans. Comput. 64 (12) (2015) 1. [3] K. Zhong, T. Wang, X. Zhu, L. Long, D. Liu, W. Liu, Z. Shao, E. Sha, Building high-performance smartphones via non-volatile memory: the swap approach, in: Proceedings of the 14th International Conference on Embedded Software (EMSOFT’14), 2014. 30:1–30:10 [4] X. Jimenez, D. Novo, P. Ienne, Software controlled cell bit-density to improve nand flash lifetime, in: Proceedings of the 49th Annual Design Automation Conference, ACM, 2012, pp. 229–234. [5] SAMSUNG. K9ABG08U0A NAND TLC Flash Memory Data, 2009. [6] T. Wang, D. Liu, Y. Wang, Z. Shao, Ftl2: A hybrid flash translation layer with logging for write reduction in flash memory, in: Proceedings of the 14th ACM SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems, in: LCTES ’13, ACM, 2013, pp. 91–100. [7] J. Wang, X. Dong, Y. Xie, Point and discard: a hard-error-tolerant architecture for non-volatile last level caches, in: Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, 2012, pp. 253–258. [8] A. Jog, A. Mishra, C. Xu, Y. Xie, V. Narayanan, R. Iyer, C. Das, Cache revive: architecting volatile Stt-ram caches for enhanced performance in Cmps, in: Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, 2012, pp. 243–252. [9] V. Narayanan, V. Saripalli, K. Swaminathan, R. Mukundrajan, G. Sun, Y. Xie, S. Datta, Enabling architectural innovations using non-volatile memory, in: Proceedings of the 21st Edition of the Great Lakes Symposium on Great Lakes Symposium on VLSI, in: GLSVLSI ’11, ACM, 2011, pp. 439–444. [10] D. Liu, T. Wang, Y. Wang, Z. Qin, Z. Shao, A block-level flash memory management scheme for reducing write activities in Pcm-based embedded systems, in: Design, Automation Test in Europe Conference Exhibition (DATE), 2012, 2012, pp. 1447–1450. [11] C.-H. Wu, T.-W. Kuo, L.-P. Chang, The design of efficient initialization and crash recovery for log-based file systems over flash memory, Trans. Storage 2 (4) (2006) 449–467. [12] L.P. Chang, A hybrid approach to nand-flash-based solid-state disks, Comput. IEEE Trans. 59 (10) (2010) 1337–1349. [13] T. Kgil, D. Roberts, T. Mudge, Improving NAND flash based disk caches, in: Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA’08), 2008, pp. 327–338. [14] S. Im, D. Shin, Comboftl: improving performance and lifespan of mlc flash memory using slc flash buffer, J. Syst. Archit. 56 (12) (2010) 641–653. [15] M. Murugan, D.H. Du, Hybrot: Towards improved performance in hybrid slc-mlc devices, in: IEEE 20th International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2012, IEEE, 2012, pp. 481–484.

11

[16] L. Long, D. Liu, X. Zhu, K. Zhong, Z. Shao, E.H.-M. Sha, Balloonfish: utilizing morphable resistive memory in mobile virtualization, in: Proceedings of the 20th Asia South Pacific Design Automation Conference (ASP-DAC’15), 2015, pp. 322–327. [17] D. Liu, T. Wang, Y. Wang, Z. Shao, Q. Zhuge, S. Edwin, Curling-PCM: application-specific wear leveling for phase change memory based embedded systems, in: Proceeding of 18th Asia and South Pacific Design Automation Conference (ASP-DAC’13), 2013, pp. 279–284. [18] F. Chen, T. Luo, X. Zhang, Caftl: A content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives., FAST, volume 11, 2011. [19] Y. Wang, Y. Liu, Y. Liu, D. Zhang, S. Li, B. Sai, M.-F. Chiang, H. Yang, A compression-based area-efficient recovery architecture for nonvolatile processors, in: Design, Automation Test in Europe Conference Exhibition (DATE), 2012, 2012, pp. 1519–1524. [20] T.-S. Chung, D.-J. Park, S. Park, D.-H. Lee, S.-W. Lee, H.-J. Song, A survey of flash translation layer, J. Syst. Archit. 55 (5 – 6) (2009) 332–343. [21] Y. Hua, H. Jiang, Y. Zhu, D. Feng, L. Tian, Semantic-aware metadata organization paradigm in next-generation file systems, IEEE Trans. Parallel Distrib. Syst. 23 (2) (2012) 337–344. [22] Y.-H. Chang, J.-W. Hsieh, T.-W. Kuo, Endurance enhancement of flash-memory storage systems: an efficient static wear leveling design, in: Proceedings of the 44th annual Design Automation Conference, ACM, 2007, pp. 212–217. [23] Y.-H. Chang, P.-C. Huang, P.-H. Hsu, L.-J. Lee, T.-W. Kuo, D.-C. Du, Reliability enhancement of flash-memory storage systems: an efficient version-based design, IEEE Trans. Comput. 62 (12) (2013) 2503–2515. [24] C.H. Wu, H.H. Lin, T.W. Kuo, An adaptive flash translation layer for high-performance storage systems, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 29 (6) (2010) 953–965. [25] H. Li, Y. Chen, An overview of non-volatile memory technology and the implication for tools and architectures, in: Design, Automation Test in Europe Conference Exhibition, 2009. DATE ’09., 2009, pp. 731–736. [26] D. Liu, T. Wang, Y. Wang, Z. Qin, Z. Shao, Pcm-ftl: a write-activity-aware nand flash memory management scheme for pcm-based embedded systems, in: 2011 IEEE 32nd Real-Time Systems Symposium (RTSS), IEEE, 2011, pp. 357–366. [27] D. Park, D.H. Du, Hot data identification for flash-based storage systems using multiple bloom filters, in: IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST), 2011, IEEE, 2011, pp. 1–11. [28] P.-L. Wu, Y.-H. Chang, T.-W. Kuo, A File-system-aware Ftl Design for Flash-memory Storage Systems, in: DATE, 2009, pp. 393–398. [29] M. Mitchell, An Introduction to Genetic Algorithms, in: MIT Press, 1996, p. 2. [30] C.-K. Ting, On the mean convergence time of multi-parent genetic algorithms without selection, in: Proceedings of the 8th European conference on Advances in Artificial Life (ECAL’05), Springer-Verlag, 2005, pp. 403–412. [31] K. De Jong, Learning with genetic algorithms: an overview, Mach. Learn. 3 (2–3) (1988) 121–138. [32] K. Ziarati, R. Akbari, A multilevel evolutionary algorithm for optimizing numerical functions, Int. J. Industrial Eng. Comput. 2 (2) (2011) 419–430.

Please cite this article as: D. Liu et al., A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dualmode flash memory in embedded systems, Microprocessors and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2016.12.009

JID: MICPRO 12

ARTICLE IN PRESS

[m5G;February 13, 2017;13:32]

D. Liu et al. / Microprocessors and Microsystems 000 (2017) 1–12 Duo Liu received the Ph.D. degree in computer science from the Department of Computing, The Hong Kong Polytechnic University in 2012. He received the B.E. degree in computer science from the Southwest University of Science and Technology, Sichuan, China, in 2003, and the M.E. degree from the Department of Computer Science, University of Science and Technology of China, Hefei, China, in 2006, respectively. He is currently an assistant professor with the College of Computer Science, Chongqing University, China. His current research interests include emerging memory techniques and embedded systems.

Lei Yao received the B.Sc. degree in computer science from Chongqing University, Chongqing, China, in 2014, where he is currently pursuing the Master degree. His current research interests include mobile computing, embedded system and emerging memory techniques.

Linbo Long received the B.S. degree from the College of Computer Science, Chongqing University, China, in 2011. He is currently pursuing the Ph.D. degree with the College of Computer Science, Chongqing University, China. His current research interests include compiler optimization, emerging memory techniques and embedded systems.

Zili Shao received the B.E. degree in electronic mechanics from the University of Electronic Science and Technology of China, Sichuan, China, in 1995, and the M.S. and Ph.D. degrees from the Department of Computer Science, University of Texas at Dallas, Dallas, TX, USA, in 2003 and 2005, respectively. He has been an Associate Professor with the Department of Computing, Hong Kong Polytechnic University, Hong Kong, since 2010. His current research interests include embedded software and systems, real-time systems, and related industrial applications.

Yong Guan received the Ph.D. degree in computer science from China University of Mining and Technology, China, in 2004. Currently, he is a professor of Capital Normal University. His research interests include formal verification, PHM for power and embedded system design. Dr. Guan is a member of Chinese Institute of Electronics Embedded Expert Committee. He is also a member of Beijing Institute of Electronics Professional Education Committee, and Standing Council Member of Beijing Society for Information Technology in Agriculture.

Please cite this article as: D. Liu et al., A workload-aware flash translation layer enhancing performance and lifespan of TLC/SLC dualmode flash memory in embedded systems, Microprocessors and Microsystems (2017), http://dx.doi.org/10.1016/j.micpro.2016.12.009