HMFS: A hybrid in-memory file system with version consistency

HMFS: A hybrid in-memory file system with version consistency

Accepted Manuscript HMFS: A hybrid in-memory file system with version consistency Hao Liu, Linpeng Huang, Yanmin Zhu, Shengan Zheng, Yanyan Shen PII:...

2MB Sizes 0 Downloads 65 Views

Accepted Manuscript HMFS: A hybrid in-memory file system with version consistency Hao Liu, Linpeng Huang, Yanmin Zhu, Shengan Zheng, Yanyan Shen

PII: DOI: Reference:

S0743-7315(18)30066-2 https://doi.org/10.1016/j.jpdc.2018.02.002 YJPDC 3819

To appear in:

J. Parallel Distrib. Comput.

Received date : 2 December 2016 Revised date : 7 October 2017 Accepted date : 7 February 2018 Please cite this article as: H. Liu, L. Huang, Y. Zhu, S. Zheng, Y. Shen, HMFS: A hybrid in-memory file system with version consistency, J. Parallel Distrib. Comput. (2018), https://doi.org/10.1016/j.jpdc.2018.02.002 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Highlights (for review)

Highlights   

A new in-memory file system based on DRAM/NVM hybrid memory architecture. A novel and efficient full-versioning mechanism is proposed. An extensive experiment is carried out to prove the efficiency of design metrics.

*Manuscript Click here to view linked References

HMFS: A Hybrid In-Memory File System with Version Consistency Hao Liua , Linpeng Huanga,∗, Yanmin Zhua , Shengan Zhenga , Yanyan Shena a Department

of Computer Science and Engineering Shanghai Jiao Tong University, Shanghai, China

Abstract Emerging non-volatile memory (NVM) such as PCM and STT-RAM has memorylike byte-addressability as well as disk-like persistent storage capability. It offers an opportunity to bring NVM into the existing computer architecture for constructing an efficient and high-performance in-memory file system. Several NVM-optimized file systems have been designed. However, most of them fail to exploit all important features of NVM, and can only guarantee the file system consistency to the data consistency level. In this paper, we present HMFS, a hybrid in-memory full-versioning file system. HMFS manages DRAM and NVM in a unified address space and adopts different updating mechanisms to them. Besides, HMFS achieves version consistency with a simple and efficient multiversion approach. Experimental results show that HMFS achieves significant throughput improvement compare with the state-of-the-art NVM-optimized file systems, such as PMFS and NOVA, and 3.1× to 13.5× higher versioning efficiency compared to some other multi-versioned file system such as BTRFS and NILFS2. Keywords: Non-Volatile Memory, Hybrid Memory Architecture, In-Memory File System, Multi-Versioning

∗ Corresponding

author Email address: [email protected] (Linpeng Huang)

Preprint submitted to Elsevier

October 6, 2017

1. Introduction Recent years have witnessed the emergence of modern non-volatile memory (NVM) technologies, such as phase change memory (PCM) [1], spin transfer torque RAM (STT-RAM) [2], resistive RAM (ReRAM) [3] and the newly reported Intel and Micron’s 3D XPoint [4]. A recent study [5] shows that the advent of byte-addressable NVM makes itself easily memory-attached and enable applications to use NVM as a replacement for the conventional disk-based storage. This inspires the evolution of NVM-based in-memory file systems that are able to break the bottlenecks of the traditional block I/O-based architecture. A number of file systems have been developed to leverage the advantages of NVM, such as byte-addressability, memory-like speed and disk-like persistency. Examples include PMFS [6], NOVA [7], BPFS [8], SCMFS [9] and HiNFS [10]. All these file systems use NVM as the persistent storage and assume that both DRAM and NVM can be accessed directly via the memory controller. Table 1 summarizes the features of existing state-of-the-art NVM-aware file systems. The distinguishing features across all the file systems are: • Inode Structure: Linear Structure vs. B-Tree Structure • Update mechanism: Log-Structured vs. Copy-On-Write (COW) vs. eXecuteIn-Place (XIP) • Mmap supporting: Yes vs. No • Wear leveling: Weak vs. Strong • Consistency level: Meta data level vs. Data level vs. Version level PMFS [6][11] is a lightweight POSIX file system that has been explicitly designed for NVM. It uses a B-Tree structure for inode index table and updates the meta data and data both in the XIP mechanism. It supports the mmap interface which maps the file data directly into the application’s virtual address space. It uses the fine-grained journaling technology to log the meta data changes and have no consistency guarantee to the file data. NOVA [7] is a log-structured NVM-aware file system, it mains a linear structured inode table for meta data updates and performs COW mechanism to the

2

file data. It develops an atomic-mmap mechanism to perform the load/store access to NVM with a high-level consistency. It uses log-structured updating mechanism to meta data and the COW mechanism to file data, with a moderate consideration of NVM wear-leveling problem, and it achieves the consistency level to the data consistency. BPFS [8][11] is an NVM-aware file system working in the user space. It adopts a B-tree structure to organize the inode entry and performs COW updating mechanism to both meta data and data. It achieved the data consistency level and has no special consideration to the wear-leveling problem of NVM. SCMFS [9][11] is a lightweight file system designed for NVM, it utilizes the operating system virtual memory management module and maps files to large contiguous virtual address regions, making file accesses simple and fast. It adopts the XIP updating mechanism to file data and always using the mmap interface to access a file, but it lacks the mechanism for consistency guarantee of meta data or data and have no consideration for NVM wear-leveling. HiNFS [10] is a write-optimized file system for NVM. HiNFS divides the write operations into two types: lazy-persistent writes and eager-persistent writes. HiNFS uses a NVM-aware write buffer policy to buffer the lazy-persistent file writes in DRAM and persists them to NVM lazily to hide the long write latency of NVM. HiNFS performs direct access to NVM for eager persistent file writes, and directly reads file data from both DRAM and NVM as they have similar read performance, in order to eliminate the double-copy overheads from the critical path. Although these existing file systems achieve higher performance than conventional disk-based file systems, they fail to exploit several important features of NVM. • Most of these file systems use NVM in the same way as DRAM, and hence the performance is compromised by the asymmetric read write speeds, i.e., random write is typically slower than random read [11, 12]. • The management of DRAM relies on the Virtual Memory Management (VMM) from the operation system without providing a unified management mechanism 3

Table 1: Comparison of Some State-of-the-art NVM-optimized File Systems PMFS [6] NOVA [7] BPFS [8] SCMFS [9] HiNFS [10] HMFS

Inode structure B-tree Linear B-tree B-tree B-tree Linear

Update mechanism XIP Log-structured & COW COW COW XIP Log-structured & XIP

Mmap Yes Yes

Wear leveling No Weak

Consistency level Meta Data Consistency Data Consistency

No Yes Yes Yes

Weak No No Strong

Data Consistency No Consistency Meta Data Consistency Version Consistency (highest level)

for the two memory devices. This causes extra page tabling overhead and costly interactions between the file system and memory management in Linux kernel space [9]. • Data in NVM is updated with in-place writes without considering the limited lifetime of NVM. This makes the file systems suffer from the severe wear-out problem [11]. • All these file systems only provide meta data or data consistency at a low consistency level. However, many real-life applications require file systems to provide high-level version consistency. For example, in distributed in-memory computing, many tasks will produce a large number of intermediate results corresponding to different versions [13], and use a underlying file system with multi-version supporting can significantly simplify the task processing. Version consistency supported by the underlying file systems allows users to recover from arbitrary mistakes or system crash by simply retrieving a previous version of data, this feature enhances the reliability and robustness of the file system. Moreover, the storage needs of the user in modern big data era has shifted from the basic data storage requirements to requiring rich interfaces which enables the efficient query of versions, snapshots and creation of file clones [14]. Although there are several file systems designed for HDD or SSD which support versioning and snapshotting, such as Nilfs[15], Btrfs [16] and ZFS [17], now it still lacks a NVM-optimized file system which support multi-versioning. So as a attractive feature of file system, exploring multi-versioning mechanism and guaranteeing version consistency in an NVM-optimized file system is meaningful and worthwhile. 4

Comparing to the state-of-the-art NVM-aware file systems which are listed in table 1, such as PMFS and NOVA, we choose some design choices in HMFS. Figure 1 shows the system architecture of HMFS, from the system architecture overview we can confirm the following main design choices of HMFS. • First, it adopts a DRAM-NVM hybrid memory architecture where NVM is attached to the memory bus and allows byte-addressable accessing as DRAM. In HMFS, DRAM and NVM share a unified address space. • Second, it adopts the log-structured update mechanism to NVM which is considered be beneficial for the wear-leveling problem of NVM. • Third, it proposes a novel multi-version mechanism which helps HMFS achieves the highest version consistency level of a file system. • Lastly, HMFS provides an atomic mmap mechanism which can perform load/store operations to NVM directly via the mmap system call, and achieves stronger data consistency than the traditional XIP-based memory map mechanism. User Space

Application load/store

read/write

4. atomic mmap

Kernel Space HMFS 1. Efficient full-versioning mechanism 2. NVM-friendly allocate mechanism

MMU Mapping

3. Unified memory address space

Hardware Hybrid Memory Controller

DRAM

NVM

Figure 1: HMFS System Architecture

The contributions of this paper are summarized as follows. • We develop HMFS, a file system for DRAM-NVM hybrid memory architecture, which manages DRAM and NVM with a unified address space. The proposed file system performs data access in different mechanisms, 5

Table 2: Device-level Parameters of Different Memory Technologies

PCM [1] STT-RAM [2] ReRAM [3] HDD [11] DRAM [11] SLC Flash [12]

Granularity 64B 64B 64B 512B 64B 4KB

Read Lat. 50ns 10ns 10ns 5ms 50ns 25µs

Write Lat. 500ns 50ns 50ns 5ms 50ns 500µs

Endurance 108 - 109 > 1015 1011 > 1015 > 1015 104 - 105

i.e., random access to data in DRAM and log-structured manner to data in NVM. This allows us to take advantages of both DRAM and NVM, including high-access speed of DRAM and persistent storage of NVM, while avoiding the main drawback of NVM, i.e., the limited number of writes to storage cells of NVM. • We implement HMFS as a multi-version in-memory file system and propose a data structure to realize multi-version mechanism and guarantee version consistency, which is the highest consistency level of file system. • We carried out extensive experiments to evaluate the performance, the versioning efficiency and the endurance of HMFS using multiple benchmarks. The results show that HMFS outperforms the start-of-the-art NVM-optimized file systems significantly and achieves much better efficiency of multi-versioning and endurance metrics. The rest of this paper is organized as follows. Section 2 introduces the motivation and some backgrounds of this work. Section 3 presents the design of HMFS, and section 4 presents the implementation of HMFS. Section 5 discusses the experimental results. We review related work in Section 6. Finally, we conclude the paper in Section 7.

2. Preliminary 2.1. Basics of NVM Modern non-volatile memory (NVM), such as phase change memory (PCM) [1], spin transfer torque RAM (STT-RAM) [2] and resistive RAM (ReRAM) [3],

6

have the characteristics of both byte-addressability and non-volatility. Typically, modern NVMs have higher density, lower energy consumption while the read/write performance is comparable to that of DRAM. Table 2 summarizes the performance of the state-of-the-art NVMs in several aspects, compared with DRAM, SSD and HDD. From the statistics, we have several important observations. First, modern NVMs have comparable read/write performance to the conventional memory. The read speed of STTRAM and ReRAM is expected to be five times higher than that of DRAM. Second, the read and write speed of NVM are asymmetric compared with that of DRAM. Random read of NVM is typically 5 times more efficient than random write. Third, most NVMs suffer from limited endurance, i.e., about 4-5 orders of magnitude lower endurance than DRAM. It is important to note that the second and the third observations actually differentiate NVM from DRAM in terms of performance. However, most existing NVM-based file systems treat NVM in the same way as DRAM, and thus achieving limited performance gain. We therefore seek to develop a DRAMNVM hybrid file system that separates the management of DRAM and NVM in order to maximize the advantages of NVM and avoid its drawbacks. 2.2. Multi-Versioned File System and Consistency 2.2.1. Usage of multi-version Multi-versioning is a desirable property for modern file systems to provide high availability and reliability [18]. Specifically, a version records an independent status of the file system at a particular time and it is established when the version management mechanism is triggered in the file system or explicitly initiated from the end user. Multi-version allows us to recover file data from the users’ mistakes or arbitrary system corruption by simply retrieving a previous consistent version of the file system. Moreover, recording multiple versions of the file system supplies the information of historical changes of specific data. This allows users to figure out who made changes to the data and when these changes are performed, which is useful for analyzing data changes or predicting 7

the trends of the particular data. 2.2.2. File system consistency File system consistency is typically divided into three categories: the lowlevel meta data consistency, the middle-level data consistency and the high-level version consistency [19]. Meta data consistency requires all the meta data of the file system to be consistent to each other, there are no dangling files or duplicate pointers, the bitmaps keep track of the resource usage exactly. Meta data consistency is often realized by fine-grained journaling or atomic in-place updates in recent emerged NVM-based file system [6]. However, meta data consistency does not guarantee the consistency between meta data and data. That is, for example, it does not guarantee the data that is read belong to that file and contain no garbage data as well. Data consistency is more stringent than meta data consistency. It requires the meta data of a file to be consistent with the data itself during the file read and write operations. Version consistency is the strongest level of consistency in file systems. In addition to the consistency between meta data and data, version consistency guarantees that the referred meta data version matches the corresponding data version. That is, a multi-versioned file system must guarantee the consistency of each version. In other words, multi-version is the functionality of file system and version consistency is to guarantee the reliability of the system. Complicated data structure for multi-versioning [20] and costly time and space complexity are the main challenges to achieving version consistency.

3. Design This section provides the design of HMFS. 3.1. HMFS Overview HMFS is an in-memory file system with version consistency. It supports standard POSIX interface and works as a kernel module in Linux. Figure 2 shows the architecture overview of HMFS. It works on a DRAM-NVM hybrid 8

memory architecture and DRAM and NVM shares a unified address space. HMFS provides two different ways that expose NVM to user applications. First, the user applications can perform read/write operations to the in-NVM data via the virtual file system (VFS) in Linux. Second, HMFS includes an MMU Mapping component that allows user applications to perform load/store operations to NVM directly via the mmap system call. For each file system module, file and directory index module mainly used for file and directory data management. Buffered meta data module maintains the data structure on DRAM. Version management module, checkpoint and recovery are used for maintaining consistency. Segment management module uses for space management on NVM and garbage collection (GC) module uses for reclaiming the NVM data pages during a cleaning process. Application

HMFS

read/write

load/store

VFS

mmap

Consistency File / Directory Index

Version Management

Checkpointing Buffered Metadata

Recovery

Segment Management

DRAM

GC

NVM Figure 2: HMFS Overview

3.2. Hybrid Memory Layout of HMFS Figure 3 illustrates the in-memory layout of HMFS. At a high level, HMFS stores volatile, cache-level data structure in DRAM and maintains non-volatile, persistent data structure in NVM. We perform random read/write operations to the in-DRAM data. For NVM, we divide it into two parts. The first part mainly

9

Sequential updates

Random updates SIT Journal

NAT Log

VAT Cache

Super #0 Block #1

SIT

BIT

NAT

VAT

Version Segments

Node Segments

Data Segments

Main Area DRAM

NVM

Figure 3: DRAM-NVM Hybrid Memory Layout in HMFS

contains the file system meta data area, such as Superblock (SB), Block Information Table (BIT), Segment Information Table (SIT), Node Address Table (NAT) and Version Address Table (VAT). The updates to this area are involved random reads/writes since their fixed-location demand. The second part is the Main Area that can be considered as a log-structured storage for file data. The updates to the Main Area are grouped into sequences and executed by sequential writes. This is for avoiding too frequent update to the NVM and protecting it from wearing-out. In what follows, we provide details on the in-NVM layout and in-DRAM layout of HMFS. 3.2.1. Layout In NVM The layout of NVM in HMFS consists of the following components: • Superblock (SB) stores the basic information of HMFS including magic number, current version number, segment number, etc. We maintain two copies of the superblock: a currently active one and a backup. • Segment Information Table (SIT) contains information for every segment in the Main Area such as the number of valid blocks in the segment. Such information will be used or updated by the garbage collection process to select a victim segment or allocate a new segment. • Block Information Table (BIT) maintains a set of BIT entries, which records information for all 4KB blocks in the Main Area, e.g., the parent inode number, offset in the segment. The BIT entries will be used to identify the parent node block during garbage collection. • Node Address Table (NAT) is an address table for all the node blocks in HMFS. Each entry in NAT contains a unique node identification number nid

10

and the physical address for the corresponding 4KB-size node block. • Version Address Table (VAT) is an address table for all the checkpoint blocks in HMFS. Every entry in VAT includes the version number vid and the physical address of a checkpoint block. Note that vid represents the belonging version of the checkpoint block. • Main Area consists of three kinds of segments: data segment, node segment and version segment. A segment of any type has a fixed size of 2MB including 512 4KB blocks. The usage information for every segment is maintained in SIT. Updates to the Main Area are performed in a log-structured manner. (i) Data Segment. Data segment contains data blocks for all the files. (ii) Node Segment. Node segment consists of node blocks. In HMFS, we have three types of node blocks: inode, direct node and indirect node. An inode block contains meta data for a file, such as the inode number, file name, file size; a direct node block includes block addresses of file data; an indirect node block contains nids of its child node blocks. (iii) Version Segment. Version segment contains meta data and data in different versions. Each version includes a set of operation entries, one connection edge list and one checkpoint block. • Operation entry is a tube in the format of (dest vid, dest nid, src vid, src nid, operation type) which records the destination version number, node number and source version number, node number and the operation types. • Connection Edge List is a list of connection edges which are belonging to the same version. It is used to record the owner-member relationship between the version and files. • Checkpoint block (CP) has a fixed-size of 4KB containing the version number, free segment count and valid block count in the main area, and the version number of the previous and next checkpoint. The checkpoint block is crucial for consistency and version management in HMFS.

11

3.2.2. Layout In DRAM The layout of DRAM in HMFS contains three components: SIT Journal, NAT Log and VAT Cache. • SIT Journal is used to absorb the changes to SIT area temporally. These changes are stored in DRAM as many SIT entries and will be synchronized to SIT area in NVM during a checkpointing process, after the checkpointing process they will be reclaimed for absorbing new changes. • NAT Log stores nids of the files which are being read, written or newly created. This on-demand growing log guarantees a moderate cache cost and accelerates the file search operations. Specifically, when searching a file, we first check whether the nid of the file is in the NAT log or not. If so, we directly retrieve the corresponding block. Otherwise, we check the NAT area in NVM. • VAT Cache is a double-linked list where every node in the list contains a version number vid and the checkpoint block address of this version. It is an LRU list in which records the recently visited or new-created version. When a version is visited or created, the vid and the CP block address will be added to the list tail. we do not add the deleted version to the list because the deletion of a version is executed directly in the version segment and would not be visited again. We use VAT Cache to accelerate the version search operations in a similar way to the NAT Log.

Inode Block Meta Data 1

Direct Pointers[928] Or Inline Data

2

1

3

Single-indirect[2]

3

1

2 3

2

2

2

2

Figure 4: File Structure

12

2

Direct Block

3

Indirect Block

1

...

2

2

Double-Indirect[2] Triple-Indirect[2]

Data Block

1

...

2

Inline xattrs

1

1

...

1

...

1

3.3. File and Directory Structure File structure. HMFS utilizes the node structure that extends the inode map [21] used in the traditional log-structured file systems to locate more indexing blocks. Figure 4 illustrates the file structure of HMFS. The inode block for a file contains 928 data block pointers, 2 single indirect block pointers, 2 double indirect block pointers and 2 triple indirect block pointers. In 64-bit computers, the size of each pointer is 8 bytes and any 4KB block is able to contain 1018 pointers. Thus the maximum size of a file in HMFS is: (928 + 2 ∗ 1018 + 2 ∗ 10182 + 2 ∗ 10183 ) ∗ 4KB= 3.94TB. Level #0

B(2P)

Level #1

B(2P)

B(2P)

Level #2

B(2P)

B(2P)

...

Level #N/2

B(2P)

... Level #N

B(2P)

...

B(2P)

...

B(2P)

... B(4P)

B(4P)

B(2P)

B(2P)

...

B(2P)

...

B(4P)

... B(4P)

B(4P)

Figure 5: Directory Structure

Directory structure. HMFS adopts a multi-level hash table to implement the directory structure, as shown in Figure 5. The maximum level is controlled by the parameter MAX DIR LEVEL (M in short). By default, M is set to 64. Each level contains a hash table with a fixed number of buckets and every bucket maintains several dentry blocks. Let f (n) and g(n) denote the number of buckets and the number of blocks per bucket in level n, respectively. We have: When finding a file named s in a directory, HMFS calculates the hash value of s first, then it scans the multi-level hash table from level 0 to N incrementally to finding the dentry consisting of the file name and its inode number until found and then return. In each level, HMFS needs to scan only one bucket with the identifier of HashValue(s)%f (n), which shows an O(log(#of f iles)) complexity.

13

4. Implementation This section provides the implementation details of HMFS. 4.1. Multi-versioning Mechanism Model In a multi-versioned file system, there are two independent sets: versions set and files set. We express V as the versions set and F as the files set, a connection edge connects a version and a file. A version can contain a number of files, and the same file also can be included in different versions. A new version may have several parent versions and a parent version can also have several descendant versions. If a version v1 contains a file f1, there is an inclusive edge between v1 and f1, if a version v1 does not contain a file f1, there is an exclusive edge between v1 and f1. And several invariants about a multi-versioned file system include: 1. A file that exists in different versions is considered as different files in the file system. 2. New creating a file or modifying a file in an older version are both considered as the new creation of files in the current version. 3. To modify or delete a file of the other versions in the current version, then the current version should not contain or reserve the original file which has been modified or deleted in the other versions. 4. When getting a version of the file system with a given version number, the file system should return the files which related to this version and retain the other files which have not any changes according to this version. 4.2. Multi-version Data Structure vid

parent vids

checkpoint address

first link

Figure 6: Version Node

• Version Node Figure 6 illustrates the version node structure, in a version node, vid represents the current version number, parent vids represent the re14

lated version numbers, the first link directs to the first connection edge. vid

nid

reverse link

head link

Figure 7: Connection Edge

• Connection Edge Figure 7 presents the connection edge structure, in a connection edge, vid represents the current version number, nid represents the file node number, each file has a unique node number in NAT table, reverse link stores a connection edge pointer which with the same nid and different vid in other versions, head link stores the pointer of next connection edge. dest vid

dest nid

src vid

src nid

operation type

Figure 8: Operation Sequence

• Operation Sequence Figure 8 presents the operation sequence structure, an operation sequence in the version segment contains the information of operation to version and files. The first element of an operation sequence is dest vid, it stores the destination version id, the second element dest nid stores the destination node id, the third element src vid stores the source version id, it can be null, the fourth element node nid stores the source node id, it can be null, the fifth element stores the operation type, ’U’ represents the updating operation, ’I’ represents the inserting operation, ’D’ represents the deleting operation. For example, an operation sequence (vid3, nid6, vid2, nid3, U) records in version vid3, we use file nid6 updates the file nid3 which is in version vid2 originally. 4.3. Some Invariants about Versioning File System 1. A file that exists in different versions is considered as the same file in a multi-version file system. 2. Modifying a file of an older version will use a new file to cover the old file in the new version. 3. To update or delete a file of the parent version in the current version, then the updated or deleted file should not appear in the new version again. 15

VAT

Version Segment

CP5

(vid5, nid11, null, null, I)

(nid10, nid11, nid9,nid3,nid5)

(vid5, null, vid4, nid7, D)

CP4

(vid5, nid10, vid4, nid8, U)

(vid4, nid9, null, null, I)

(nid8, nid9, nid5,nid3,nid7)

(vid4, null, vid3, nid6, D)

CP3

(vid4, nid8, vid2, nid4, U)

(nid6, nid7, nid5)

(vid3, nid7, null, null, I)

(vid3, null, vid2, nid1, D)

CP2

(vid3, nid6, vid2, nid3, U)

cp5_addr

(nid4, nid5, nid3)

vid5

(vid2, nid5, null, null, I)

cp4_addr

(vid2, null, vid1, nid1, D)

vid4

CP1

cp3_addr

(vid2, nid4, vid1, nid2, U)

vid3

(nid1, nid2, nid3)

cp2_addr

(vid1, nid3, null, null, I)

vid2

(vid1, nid1, null, null, I)

cp1_addr

(vid1, nid2, null, null, I)

vid1

Figure 9: Version Segment Status

4. The unmodified files in the parent version should be inherited in the new version. Typically, the operations to the files in a version are the following. 1. Create Creating a file in a version, will add an inclusive connection edge between the version and the added file. 2. Delete Deleting a file in a version, will add an exclusive connection edge between the version and the deleted file. 3. Update Updating a file in an older version, will add an inclusive connection edge between the new version and the file, and an exclusive connection edge between the old version and the original file. It equals that an update operation can be divided into a creating operation plus a deleting operation. This principle will play an important role in the later sections. 4. Read Reading a file involves no updating or modification of file and corresponding version, just getting the file data through the ordinary read operation. Base on the above principles and invariants, we next give out an example to illustrate our proposed multi-version mechanism.

16

v1

null

v1

f1

null

v1

f2

null

v1

f3

null

null

v2

v1

v2

f4

null

v2

f5

null

v2

f3

null

null

v3

v2

v3

f6

null

v3

f7

null

v3

f5

null

null

v4

v2,v3

v4

f8

null

v4

f9

null

v4

f5

null

v4

f3

null

v4

f7

null

v5

f10

null

v5

f11

null

v5

f9

null

v5

f3

null

v5

f5

null

v5

v4

null

null

parent vids head link reverse link

Figure 10: Versioning Status

4.4. Example Analysis We consider a file system with five versions and eleven files and analyze the operation sequences. Each version is created by the following operation sequences. Figure9 illustrates the physical status of the version segment and figure10 gives out connection edge list status of the file system after all the operations. 1. Version 1 creates file 1, 2 and 3; 2. Version 2 updates file 2 in version 1 to file 4, deletes file 1 in version 1 and creates file 5; 3. Version 3 updates file 3 in version 2 to file 6, deletes file 4 in version 2 and creates file 7; 4. Version 4 updates file 4 in version 2 to file 8, delete file 6 in version 3 and creates file 9; 5. Version 5 updates file 8 in version 4 to file 10, delete file 7 in version 4 and creates file 11. Now we analyze what files are included in each version. In version 1, file 1, file 2, file 3 are newly created files, it goes without saying that version 1 contains

17

file 1, file 2 and file 3. In version 2, with version 1 as its parent version, file 2 in version 1 is updated to file 4, so file 4 should be contained in version 2, and file 2 should not be contained in version 2 anymore, file 1 in version 1 is deleted and should not appear in version 2 naturally, file 5 is newly created and should appear in version 2, file 3 has no changes from version 1 to version 2, so it should be inherited in version 2. So in version 2, file 3, file 4, file 5 should be contained in this version. Similarly, we can analyze the result of version 3, 4 and 5. 4.5. File Related Algorithms In this section, we present the related algorithms about file operations. Algorithm 1 to 4 list out how to create, update, delete and read a file in HMFS. Create a file: Creating a file in HMFS involves the following steps. First, a dest nid needs to be assigned to the newly created file, then the file system writes the file meta data and data to the node segment and data segment respectively, lastly, the version number dest vid and the operation sequence is written to the version segment. Update a file: Updating a file in HMFS involves the following steps. First, HMFS gets the source version number src vid and source node number src nid of the file being updated, then assigning a destination node number dest nid to the updating file, next updating the meta data and data of the file to the node segment and data segment respectively, lastly, writing the operation sequence to the version segment. Delete a file: Deleting a file in HMFS does not really delete the file data unless the number of versions which contains the file drops to zero. The steps are as following, first getting the src vid and src nid of the deleting file, then get the current version number as the dest vid, lastly, writing the operation sequence to the version segment. Read a file: Reading a file in HMFS does not involve any changes on the version segment. It just needs to get the vid and nid of the file then get the file meta data and data from the node segment and data segment respectively. 18

Algorithm 1 Create a file 1: Assign a dest nid to the new file; 2: Write the file meta data to node segment; 3: Write the file data to data segment; 4: Get the version number dest vid of the new created file; 5: Write (dest vid, dest nid, null, null, I) to version segment; Algorithm 2 Update a file 1: Get the src nid and the src vid of the file being updated; 2: Assign a dest nid to the updated file; 3: Write the new file meta data to node segment; 4: Write the differential file data to data segment; 5: Write (dest vid, dest nid, src vid, src nid, U ) to version segment; Algorithm 3 Delete a file 1: Get the src nid and src vid of which the file is being deleted; 2: Get the version number dest vid of the current operating version; 3: Write (dest vid, null, src vid, src nid, D) to version segment; Algorithm 4 Read a file 1: Get the vid and nid of the file be reading; 2: Read the file meta data from node segment; 3: Read the file data from data segment;

4.6. Version Related Algorithms In this section, we conclude the versioning related algorithms in HMFS, includes creating, getting, deleting, reading a version and list all versions which contains the same given file. Create a version: Creating a version in HMFS mainly involves the following steps, first it traverses the operation sequences which are stored on the version segment, add the newly created file nids to the inclusive list, and adds the deleted or updated file nids to the exclusive list, then it find out the related version number and traverses the connection edge list of each version, in this step, it finds out the file nids which are inherited by the new version and adds the connection edges to the new version edge list, adds the reverse links of the inherited file between the new version and old version. Delete a version: Deleting a version in HMFS mainly involves the follow19

ing steps, first it gets the version node by looking up the VAT table, then it deletes the operation sequences, connection edge list and the checkpoint block alternately, lastly, it traverses the VAT table to find the related version vids of the deleted vid, and remove the deleted vid from the related vids list of each related version. Read a version: Reading a version involves no file updating operations and just needs to get the files which are belong to requested version. When a version number is given as the parameter, it first looks up the VAT table to get the checkpoint block address of this version, then it gets the connection edge list which is stored adjacent to the checkpoint block, and lastly it gets all the files by traversing the connection edge list recursively. List versions: List versions in HMFS presents a function that if you have a file on hand and want to know how many versions are containing the file in the file system, you can use this function. This algorithm accepts a file nid as a parameter and gives out the version numbers which contains the file nid as the return value. First, it finds the first version which contains the file recursively, then it finds out other related versions by tracing the reverse links of the object file connection edge, lastly, it lists out all the version numbers as the result. This function is implemented as a shell command lv in HMFS. 4.7. Multi-versioning Mechanism Summary HMFS proposes a set of data structures and algorithms to achieve a fullversioning mechanism in a NVM-aware file system. To our knowledge, this is the first work which achieved full-versioning and version consistency among the state-of-the-art NVM-aware file systems. In HMFS, the information about versioning is stored on the version segment independently. For the time complexity, When a version is required, first, HMFS queries the VAT table to find the checkpoint block address direct to the version on the version segment. VAT is a mapping table between the vid and the checkpoint block address, this step performs an O(1) time complexity. Second, HMFS gets the inclusive list which is stored adjacent to the checkpoint block and read out all the files belonging to 20

Algorithm 5 Create a version 1: for all each oper entry in version segment since last stable checkpoint do 2: if (!dest vid->related vids.isContained(src vid)) and src vid != NULL then 3: dest vid->related vids.add = src vid; 4: end if 5: switch (oper entry.operation type) 6: case ’I’: add edge entry(dest vid, dest nid) to dest inclusive list; 7: case ’U’: add edge entry(dest vid, dest nid) to dest inclusive list; add src nid to exclusive list; 8: case ’D’: add src nid to exclusive list; 9: end for 10: for all each related vid in related vids do 11: get the src inclusive list of each vid version; 12: for all each edge entry in src inclusivelist do 13: if (!edge entry->nid.isContainedIn(exclusive list) and edge entry->reverse link == NULL) then 14: newEdgeEntry = dest inclusive list.addEntry(dest vid, nid); 15: exclusive list.delete(nid); 16: edge entry->reverse link = newEdgeEntry; 17: end if 18: end for 19: end for Algorithm 6 Delete a version 1: Get the checkpoint address of deleting vid; 2: Get the previous checkpoint address of deleting vid; 3: Delete operation sequences and inclusive list between two checkpoints; 4: for all each vid in VAT except deleting vid do 5: if (deleting vid.isContainedIn(vid.parents vids)) then 6: vid.parents vids.delete = deleting vid; 7: end if 8: end for 9: Delete the checkpoint block of deleting vid. Algorithm 7 Read a version 1: Get the vid of required version; 2: Lookup the VAT table with the given vid; 3: Get the checkpoint block of vid; 4: Get the inclusive list of version vid; 5: for all each edge entry in inclusive list do 6: Get each file data with nid; 7: end for

21

Algorithm 8 List versions 1: Set the version list vlist of required nid; 2: for all each vid in VAT table do 3: if (vid.Contains(required nid)) then 4: f irst vid=vid; 5: break; 6: end if 7: end for 8: Get the inclusive list of the version f irst vid; 9: for all each edge entry of the f irst vid do 10: if (edge entry.nid==required nid) then 11: break; 12: end if 13: end for 14: nextEdge=edge entry->reverse link; 15: while (nextEdge!=NULL) do 16: vlist.add(nextEdge.vid); 17: end while 18: Return vlist;

this version one by one. The inclusive list is a linked list, the elements on this list are the file nids, when a file nid is got, HMFS performs the query as the same of a non-versioning file system. This step performs an O(n) time complexity. For the space complexity, the space cost for versioning mechanism is the VAT table located in the meta data area and the checkpoint related cost on the version segment. The VAT table keeps a mapping table between a 32-bit vid and a 64-bit checkpoint address. On the version segment, the space cost mainly consists of operation sequences, inclusive list and the checkpoint block. Among them, the number of operation sequences is proportional to the number of files which are operated between two checkpointing processes, the number of nids contains in an inclusive list equals to the number of files which belong to the version, and the checkpoint block is a fix-sized 4KB block. We keep the least information for versioning related operations and use simple data structures to implement related algorithms, so HMFS nearly achieve the smallest space cost to keep every version independently.

22

4.8. Optimized Memory-Map I/O (mmap) Mmap access method allows the applications to access NVM via load/store instructions by mapping the file data pages to the application’s address space directly. It exposes the NVMs raw performance to the applications and leverages the byte-addressability and non-volatility of NVM very naturally. It is considered as an attractive feature in the NVM-based file system in the future and has been explored in some state-of-the-art NVM-optimized file systems [6, 7]. The merit of memory map accessing is that it can bypass the file system page cache and VFS system call, reducing the paging overhead and shorting the access path. But it also exposes several challenges, one of the key challenges is that mmap operates on raw NVM with the help of hardware mmu, so the consistency mechanism is merely the primitives such as 64-bit atomic write, mfence, clflush provided by the CPU. It appears difficult to guarantee the consistency of file system only using these primitives. To address this problem, we implement an atomic mmap mechanism in HMFS. Due to the log-structured basic layout of our file system, it is rule-broken to update a file in the XIP manner, so we still adopt the log-structured method to allocate replica pages from NVM and copy the file data to the new-allocated pages, then map the replica pages to the application’s address space. When the mapped pages need to persistent into the file system, it goes through the msync system call and convert it into a traditional write operation. This out-of-place memory map mechanism in HMFS provides a stronger consistency guarantee mechanism but brings a higher overhead than the traditional XIP memory map mechanism. 4.9. Garbage Collection Garbage collection is a necessary function part for every log-structured file system. HMFS supports two kinds of garbage collection strategies: foreground and background and two victim selection policies: greedy and cost-benefit. For a foreground garbage collection process, it adopts the greedy victim segment selection algorithm, for a background garbage collection process, it adopts the cost-benefit victim segment selection algorithm. The greedy policy selects the 23

Algorithm 9 Garbage Collection 1: if Garbage Collection.mode == background then 2: Set the garbage collection algorithm to cost−benef it algorithm; 3: Scan the SIT meta data area; 4: Calculate the cost−benef it ratio of each segment by formula 4.9; 5: Select the segment holds the largest cost−benef it ratio as the victim segment; 6: else 7: Set the garbage collection algorithm to greedy algorithm; 8: Scan the SIT area on the meta data area; 9: Select the segment holds the smallest number of valid blocks as the victim segment; 10: end if 11: Select out valid blocks of the victim segment by the BIT meta data area; 12: Move the valid blocks to the head segment of each data area; 13: Do a checkpoint to record the new status of file system after this garbage collection; 14: Set the victim segment free and add it to SIT area as a new empty segment.

segment which has the smallest number of valid blocks, this policy is straightforward but not necessarily the best. It is only adopted in an urgent foreground cleaning process. On the other hand, the cost-benefit policy makes a trade-off between the garbage collection cost and benefit. It not only considers the valid blocks number in a segment, but also considers the modified time and update likelihood of a segment. It is adopted as the main policy in the background cleaning process. It calculates the cost and benefit ratio by the following formula and choosing a segment which has the highest ratio value. benef it (1 − u) × segmentAge = cost 1+u [22] In the above formula, u presents the utilization of the segment, segmentAge is presented by the latest modified time of a block in the segment. The cost of reclaim a segment dedicates the amount of the data must be read or rewritten for the cleaning. The benefit means that the amount of free space that will be reclaimed and an additional factor which represents the data stability of a segment. The garbage collection process details are illustrated in algorithm 9.

24

In the algorithm, we can find that the first step of a garbage collection process is to determine whether this garbage collection process is a foreground one or a background one. A foreground garbage collection process is triggered by the user forwardly, while the background garbage collection process is the default strategy of during the file system executes a self-cleaning process. Once a victim segment is selected, HMFS scans the BIT table of the victim segment to identify the blocks which are valid or not, and then migrates the valid blocks. This step occupies the mainly time and space cost of the whole process. Every valid block needs one read and one write operation to read it out from the victim segment and write it to the new-located segment. After all of the valid blocks have been migrated to the target segment, the victim segment will be set to a candidate of a free segment. After that, a checkpointing process will be triggered to record the new segment usage after the cleaning. Once the checkpointing process is successfully done, the segment will be set free eventually. 4.10. Recovery and Consistency HMFS uses the checkpointing mechanism to guarantee the consistency of the whole file system and provide a consistent recovery point from the system crash or power failure. Each checkpoint block has a header and a footer which are written at the beginning and the end of the block respectively. If a checkpoint block has identical contents in the header and footer, it is considered valid. Otherwise, it will be dropped. Each valid checkpoint refers to a valid independent version status of the file system. We manage lots of valid checkpoints in HMFS and make them consistent with each other to guarantee the version consistency of the whole file system. A consistent version guarantees the data consistency and meta data consistency of the file system inherently. When HMFS mounts a file system instance or recovery from a system crash, it first needs to reconstruct the data structures in DRAM and then be ready to receive requests base on a consistent status, we call this process as recovery, and recovery process is divided into two cases, normal recovery and crash recovery. Normal Recovery refers to the normal service shutdown process and the file 25

Algorithm 10 Normal Recovery 1: Get the valid super block to find the basic information of file system; 2: Construct the SIT ,BIT ,N AT ,V AT meta data area after the super block; 3: Construct the version segment, node segment and data segment in NVM; 4: Lazily build the SIT Journal, N AT Log and V AT Cache in DRAM; 5: Get ready to receive the file IO request from the client.

system performs a clean unmount process, once the file system is mounted again, it needs a normal recovery process to rebuild the layout on the storage device. For normal recovery, in order to reduce the recovery time, HMFS adopts a lazy build policy for recovery. It postpones rebuilding the radix tree of the NAT log and the double-linked list of VAT cache until the file system accesses the nid or the vid for the first time. We adopt this rebuild policy because during a normal remount process, there is no invalid checkpoint on the version segment and HMFS only needs to reconstruct the meta data and the data area on each kind of segment. This policy mainly aims to accelerate the recovery process and reduces DRAM consumption when service is just beginning. As for a sudden file access spiked by an unsuspecting user, the low read/write latency of NVM can control this delay to an acceptable level. Crash Recovery refers to once file system suffered from an abnormal dismount process due to a system crash or suddenly power failure, it needs to recover the file system to a consistent status and drop the garbage data between two stable checkpoints. HMFS executes a crash recovery process by the following steps. First, HMFS scans the VAT table to find out the max vid, the max vid is the version number of the newest version of the current file system, this step takes an O(n) time complexity. Second, HMFS gets the latest checkpoint block by the max vid with a single 4KB block read operation. Third, HMFS scans the version segment area which located after the latest stable checkpoint and gets the operation entries after the latest checkpoint address one by one. For each operation entry, HMFS gets the dest nid of each operation entry, the dest nid is the nid of the newly created or updated files after the latest checkpoint, this step needs an O(n) time complexity. Fourth, HMFS looks up the NAT table

26

Algorithm 11 Crash Recovery 1: Look up the VAT table and find out the max vid in VAT table; 2: for all each version node block on the version segment do 3: if (version node block->vid.Equals(max vid)) then 4: latest checkpoint address = version node block->checkpoint address; 5: return latest checkpoint address; 6: break; 7: end if 8: end for 9: if (latest checkpoint address != NULL) then 10: Get the latest stable checkpoint block by latest checkpoint address on version segment; 11: end if 12: for all each oper entry after latest stable checkpoint do 13: Get the dest nid of each oper entry; 14: Lookup the NAT table with dest nid; 15: for all nid in NAT table do 16: if (nid.Equals(dest nid)) then 17: Get the node block on node segment by nid; 18: end if 19: end for 20: Delete the file meta data and data on node segment and data segment by nid; 21: Delete the oper entry on version segment; 22: end for by the dest nid to get the node block of the newly created or updated files. Lastly, HMFS deletes the file meta data and data on the node segment and data segment respectively and deletes the operation entries on the version segment, this process will be repeated, until all of the operation entries are deleted on the version segment. Algorithm 11 illustrates the crash recovery process in detail.

5. Evaluation In this section, we present the evaluation of HMFS and address the following questions. 1. How does HMFS perform against on different NVM mediums, such as PCM and STT-RAM?

27

2. How does HMFS perform against state-of-the-art file systems optimized for NVM? 3. How does HMFS perform against on different workload scenarios, such as OLTP database and key/value store? 4. What is the functional and performance efficiency of the versioning approach proposed in HMFS? 5. What is the efficiency of the log-structured mechanism in improving the endurance of the limited lifespan NVM medium? We attend to answer these questions by a well-designed, widely-covered and comprehensive experiment methodology which is illustrated in this section. 5.1. Experimental Setup In our experiment, we use the Intel Persistent Memory Emulation Platform (PMEP) [6] [7] [12] to simulate two kinds of different NVM, PCM and STT-RAM and setting the medium parameters according to table 1. The PMEP platform emulates NVM and is implemented on a dual-socket Intel Xeon processor-based platform, using special CPU microcode and custom platform firmware. The platform has four processors, each processor has 2 MB L2 cache, 16 MB L3 cache, runs at 2.6 GHZ with 8 cores and supports up to 4 DDR3 channels. The channel 0 and 1 are reserved for DRAM and channel 2 and 3 are reserved for NVM. PMEP has a default configuration of 16 GB DRAM and 256 GB NVM, with a 1:8 capacity ratio, and setting the latency and bandwidth of different NVM in a configurable way. The platform also has a hard disk array with a total storage capacity of 2 TB. The operating system distribution is CentOS 6.5 with a Linux kernel of 3.11 version. Our experiment contains three metrics, performance, version efficiency and endurance. We choose PMFS, NOVA, F2FS and EXT4 for the performance competition, these file systems are listed in table 3. Among the four file systems, PMFS and NOVA are the open source state-of-the-art NVM-aware file systems, both of them use the PMEP platform for the experiment. F2FS and EXT4

28

are the traditional block-based file systems. For fairly comparison, we use the Intel persistent memory driver [23] to construct the NVM-based block device to construct the experimental environment for them. Table 3: File systems for Performance Comparison

pmfs nova f2fs ext4

A state-of-the-art NVM-aware file system. A new open-sourced NVM-aware file system. A log-structured file system optimized for flash storage. A widely-used journaling file system with in-place updating.

5.2. Experimental Result Summary We evaluate HMFS from three metrics, the performance metric, the versioning efficiency metric and the endurance metric. The performance metric mainly evaluates the basic file related operation performance with read/write and load/store interfaces and the supporting ability to some specific workloads, such as OLTP workload and key/value store workload. The evaluating result shows that HMFS achieves very competitive performance with existing state-ofthe-art NVM-aware file system. The versioning efficiency metric mainly aims to validate the functionality and efficiency of the versioning mechanism proposed in HMFS. The testing result shows that HMFS outperforms existing multiversioning file system significantly and achieves almost perfect full-versioning function with very small space and time cost. The endurance metric mainly wants to illustrate how the in-memory layout design of HMFS and the decoupled data update manner to improve the wear-leveling problem of most NVM mediums with a limited lifespan. 5.3. Performance In the performance evaluation, we use the IOZONE, Filebench, TPC-C and YCSB macro-benchmarks to evaluate the performance of random, sequential read and write, different file service scenarios, OLTP transaction processing and key/value store workload.

29

2.5e+06

4e+06

2e+06

Throughput(KB/s)

Throughput(KB/s)

5e+06

3e+06 2e+06

hmfs pmfs nova f2fs ext4

1e+06 0 64

1.5e+06 1e+06

hmfs pmfs nova f2fs ext4

500,000 0

256

1,024

4,096

16,384

64

256

Size(In KBytes)

(a) Random Read

4,096

16,384

(b) Random Write

6e+06

1.8e+06 1.6e+06

5e+06

Throughput(KB/s)

Throughput(KB/s)

1,024

Size(In KBytes)

4e+06 3e+06 hmfs pmfs nova f2fs ext4

2e+06 1e+06 0 64

1.4e+06 1.2e+06 1e+06 800,000 600,000 400,000 200,000

256

1,024

4,096

0 64

16,384

Size(In KBytes)

hmfs pmfs nova f2fs ext4

256

1,024

4,096

16,384

Size(In KBytes)

(c) Sequential Read

(d) Sequential Write

Figure 11: IOZONE Read/Write Performance On DRAM/PCM Hybrid Architecture

Table 4: IOZONE Workload Characteristics.

IO Type Read/Write Mmap

Parameter -i 0 1 2 -B -G

File Size 512M 1G 2G 512M 1G 2G

Min Record Size -y 16K -y 16K

Max Record Size -q 64M -q 64M

A. IOZONE IOZONE is used for evaluating the random read (RR), random write (RW), sequential read (SR) and sequential write (SW) performance of each file system. We choose three file sizes for testing, 512M, 1G and 2G respectively. The record size increases from 64KB to 16M linearly. All of the results are an average value of 5 times experiment. For detail configuration, we present the characteristics of IOZONE workload in table 4. We present the result of read/write interfaces performance in Figure 11 and Figure 12, and the result of mmap interface performance in Figure 13 and Figure 14. Each figure has 30

3e+06

5e+06

2.5e+06

Throughput(KB/s)

Throughput(KB/s)

6e+06

4e+06 3e+06 hmfs pmfs nova f2fs ext4

2e+06 1e+06 0 64

256

1,024

4,096

2e+06 1.5e+06 1e+06 500,000 0 64

16,384

hmfs pmfs nova f2fs ext4

256

Size(In KBytes)

(a) Random Read

Throughput(KB/s)

Throughput(KB/s)

1.6e+06

4e+06 3e+06

0 64

16,384

1.8e+06

5e+06

1e+06

4,096

(b) Random Write

6e+06

2e+06

1,024

Size(In KBytes)

hmfs pmfs nova f2fs ext4

256

1.4e+06 1.2e+06 1e+06 800,000 600,000 400,000 200,000

1,024

4,096

0 64

16,384

Size(In KBytes)

hmfs pmfs nova f2fs ext4

256

1,024

4,096

16,384

Size(In KBytes)

(c) Sequential Read

(d) Sequential Write

Figure 12: IOZONE Read/Write Performance On DRAM/STT-RAM Hybrid Architecture

four sub figures to show the RR, RW, SR, SW throughput respectively. Overall, under both read/write interface and mmap interface, we can see that the absolute throughput on DRAM/STT-RAM hybrid architecture is about 5% to 10% higher than DRAM/PCM hybrid architecture. We attribute this to lower reading and writing latency of STT-RAM is beneficial to boost the overall performance. Under the read/write interfaces, in the RR case, HMFS has an equal share with PMFS and NOVA also shows a competitive result in this case. The other two file systems show a weaker performance relatively. This result indicates that in-memory file system outperforms the block-based file systems even though they are deployed on the memory-based environment. In the RW case, HMFS and NOVA show higher performance, due to HMFS and NOVA are both log-structured file systems, and one main feature of a log-structured file sys-

31

3e+06

4e+06

2.5e+06 Throughput(KB/s)

Throughput(KB/s)

5e+06

3e+06 2e+06

hmfs pmfs nova f2fs ext4

1e+06

2e+06 1.5e+06 1e+06

hmfs pmfs nova f2fs ext4

500,000

0

0 64

256

1,024 4,096 Sizes(In KBytes)

16,384

64

(a) Random Read

1,024 4,096 Sizes(In KBytes)

16,384

(b) Random Write

6e+06

2e+06 1.8e+06

5e+06

1.6e+06 Throughput(KB/s)

Throughput(KB/s)

256

4e+06 3e+06 hmfs pmfs nova f2fs ext4

2e+06 1e+06

1.4e+06 1.2e+06 1e+06 800,000 hmfs pmfs nova f2fs ext4

600,000 400,000 200,000

0

0 64

256

1,024 4,096 Size(In KBytes)

16,384

(c) Sequential Read

64

256

1,024 4,096 Sizes(In KBytes)

16,384

(d) Sequential Write

Figure 13: IOZONE Mmap Performance On DRAM/PCM Hybrid Architecture

tem is that it can convert the random writes into sequential writes to boost the performance of the random write. In the SR case, HMFS performs robust and shows almost no performance fluctuation, while F2FS performs worst in this case indicates that traditional log-structured file system designed for block device may not fit the memory environment very well. In the SW case, HMFS performs attractively and outperforms 28% than the second place NOVA. Although HMFS and NOVA are both log-structured file systems, they update the data and meta data in different ways. HMFS updates the meta data in a XIP way but NOVA updates the meta data in a log-structured way. HMFS updates file data in a log structured way and NOVA update it in a COW way. This cause HMFS can perform finer-grained data update than NOVA and gains a higher sequential write performance. Additionally, all of the results reveal a

32

3e+06

4e+06

2.5e+06 Throughput(KB/s)

Throughput(KB/s)

5e+06

3e+06 2e+06

hmfs pmfs nova f2fs ext4

1e+06 0 64

256

2e+06 1.5e+06 hmfs pmfs nova f2fs ext4

1e+06 500,000 0

1,024 4,096 Sizes(In KBytes)

64

16,384

(a) Random Read

1,024 4,096 Sizes(In KBytes)

16,384

(b) Random Write

6e+06

2e+06 1.8e+06

5e+06

1.6e+06 Throughput(KB/s)

Throughput(KB/s)

256

4e+06 3e+06 2e+06

hmfs pmfs nova f2fs ext4

1e+06 0 64

256

1.4e+06 1.2e+06 1e+06 800,000

hmfs pmfs nova f2fs ext4

600,000 400,000 200,000 0

1,024 4,096 Sizes(In KBytes)

16,384

(c) Sequential Read

64

256

1,024 4,096 Sizes(In KBytes)

16,384

(d) Sequential Write

Figure 14: IOZONE Mmap Performance On DRAM/STT-RAM Hybrid Architecture

phenomenon that there is a significant performance drop when the block size exceeds 10MB. Under mmap interface, the performance result shares a similar trend with read/write interface. The mmap mechanism in HMFS is called atomic-mmap. When a application use mmap interface to map a file into the address space, the atomic-mmap mechanism copies the file data pages to the replica pages in a log-structured way, then map the replica pages into the address space. This mechanism is similar to the mmap mechanism in NOVA and is quite different from the PMFS mmap mechanism which directly mmap the origin file pages into the address space. This mechanism shows a higher overhead than directly mmap mechanism. But due to the high-efficient file and directory structure design in HMFS, as a result, HMFS achieves a comparable mmap performance with PMFS and NOVA gets the third good mmap perfor-

33

55

48 46

hmfs pmfs nova f2fs ext4

Ops per second(×1000)

Ops per second(×1000)

50

44 42 40 38

50

hmfs pmfs nova f2fs ext4

45

40

35

36 PCM

PCM

STT−RAM

(a) FileSever

(b) Webserver 16

42

hmfs pmfs nova f2fs ext4

Ops per second(×1000)

Ops per second(×1000)

46 44

STT−RAM

40 38 36 34

15 14

hmfs pmfs nova f2fs ext4

13 12 11 10 9

PCM

STT−RAM

PCM

(c) Webproxy

STT−RAM

(d) Varmail

Figure 15: Filebench Throughput on PCM and STT-RAM

mance, but achieves the same data consistency with HMFS and a higher mmap consistency level than PMFS. B. Filebench

Workloads fileserver webserver webproxy varmail

File Size 128K 16K 16K 16K

Table 5: FileBench Workload Characteristics. File Number 10000 1000 10000 1000

Dataset Size 1.22GB 15.625MB 15.625MB 15.625MB

Threads 50 100 100 16

IO Size 16K 1M 16K 16K

Mean Dir Width 20 20 1000000 1000000

We use four per-defined workloads fileserver, webserver, webproxy and varmail in filebench [24] to simulate the application-level scenarios for file systems. For detail configuration, we present the characteristics of Filebench workload in table 5. We run each workload 5 times and report the average. Figure 15 34

shows the throughput of different file systems on simulated PCM and STT-RAM mediums. Generally speaking, the overall performance on DRAM/STT-RAM hybrid architecture enjoy a 20% higher performance than DRAM/PCM hybrid architecture. The lower latency and higher bandwidth of STT-RAM to PCM is the key factor to the higher performance. As is shown in the figure, HMFS outperforms from 1.1× to 1.3× than the two other NVM-aware file systems PMFS and NOVA, and achieves the overall best performance among all the target file systems. The fileserver workload, it simulates a file server which contains the creates, deletes, appends, reads and writes operations. Almost all the writes in fileserver are lazy-persistent, HMFS can buffer the writes in DRAM between two checkpoints, so it gains highest instantaneous performance. The webserver workload is a read-intensive workload, the file systems with XIP and mmap function show strong performance such as HMFS, PMFS and NOVA. The non-XIP file systems show poor performance due to the double-copy of reads between the DRAM buffer and NVM storage. The webproxy workload is also a read-intensive workload with high directory width. PMFS performs not so well under this workload, indicating its poor scalability, because it looks up a directory by linearly searching the directory entries. On the contrary, HMFS benefits a lot from the efficient file meta data design and can perform well under this kind of workload which puts a large number of file in one directory. The varmail workload emulates an email server with a large number of small files and involves both read and write operations. HMFS performs better than PMFS and NOVA significantly, the reason can be deduced as the efficient file and directory structure design and the in-line data extend attribute for small files. C. OLTP Database : TPC-C on MySQL This experiment is aimed at revealing the transaction processing ability of HMFS. We construct a TPC-C [25] benchmark on a MySQL database server and deploy the MySQL server on different underlying file systems. As the TPCC benchmark and MySQL server configurations are the same to all underlying

35

file systems, we use the transactions per minute (tpmC) index to measure the transaction processing ability of each file system. The test data set contains 1000 volumes of warehouse, with a total size of 75 GB. The warm up time is set to 300 seconds and the execution time is set to 3600 seconds respectively. Figure 16 shows the tpmC of each testing file system. HMFS achieves best performance under the TPC-C workload, outperforms the worst performing file system by 24.5% and 22.1 % on PCM and STT-RAM respectively. Because tpc-c benchmark simulates an OLTP database workload, and this kind of workload has to sync data frequently to ensure strict durability and consistency. HMFS has a high-performance checkpointing mechanism and can execute fast checkpointing and sync operation, these features fit the OLTP workload very well and achieve high tpmC. Ext4 perfoms worst under this scenario, because it uses journaling to guarantee the data consistency and will introduce significant overhead under the structured data processing workload. F2FS performs next to HMFS, it also has an efficient data update method and uses checkpointing mechanism to guarantee the consistency. 2,600

Throughput(TpmC)

2,500 2,400

hmfs pmfs nova f2fs ext4

2,300 2,200 2,100 2,000 1,900 1,800 PCM

STT−RAM

Figure 16: TPC-C Throughput on PCM and STT-RAM

D. Key-Value Store : YCSB on MongoDB MongoDB [26] is a NoSQL database that stores its data in memory-mapped files and use memory loads and stores for data access and its journal files in traditional read/write data access method. Thus, in MongoDB, read and scan operation result in memory loads while uupdate and insert result in memory

36

Table 6: YCSB Workload Properties

Workload A B C D E F

Read 50 95 100 95 50

Update 50 5 -

Scan 95 -

Insert 5 5 -

Read&Update 50

stores to database files and writes to journal. MongoDB calls fsync on its journal every 120 ms and msync on its database files every 60 seconds. MongoDB supports key-value store and using fsync on its journal files every 60 ms and msync on its data files every 60 seconds, so it is a suitable key-value store application for testing the underlying file system via read/write and load/store interfaces. YCSB [27] is an open-source benchmark for evaluating key-value store applications. It includes six workloads that imitate different data access patterns. Table 6 concludes the YCSB workloads properties. Figure 17 reports the average access latency of each workload. Worth to notice that, different to previouse performance figures, the slot in the figure lower is better. For workload A, B and C, the XIP-based file system achieved best performance. In some cases, HMFS performs a little worse than PMFS due to the log-structured data update method. MongoDB adopts memory-mapped interface to data files, so the file system with well-performed mmap interface will hold an advantage, this result is match with result provided by IOZONE. For workload D it exercises read-insert ratio of 95:5 and follows the latest distribution. Ext4 perform worst under this workload, due to its journaling to file data will introduce extra overhead in the case of insert operation. For workload E, it exercises the range scan operation and insert operation at 95:5 ratio, similar to other workloads, the NVM-aware file systems achieve significant performance advantage than blockbased file systems. For workload F, it execute the read-modify-writes (RMW) access pattern, where it reads a key, modifies the same and write it back to the database. The ratio of READ to RMW is 50:50 and RMW is very similar to UPDATE. It shows a similar result with workload A but has a higher absolute 37

latency than workload A, because the read-modify-writes access pattern needs more processing cycles than READ-UPDATE access pattern. As to the difference between PCM and STT-RAM, overall, the result under two architectures shares the same trend and the average latency under DRAM/STT-RAM hybrid architecture is lower than DRAM/PCM hybrid architecture. This is a straightforward result and can be easily explained from the access parameter of two different NVM mediums. 5.4. Versioning Efficiency We describe the HMFS versioning efficiency from two aspects: functionality and performance. 5.4.1. Functional Validity In functional testing, we design a micro-benchmark to verify the correctness of the multi-version mechanism which is proposed in this paper. In the benchmark thread, we create 1000 text files in the first version, each file has an initial size of 16KB. From the second version on, each version creating 20% files, updating 20% files and deleting 10% files of its parent version randomly, when updating a file, we make four 4 KB blocks appends to each file, after these operations, the thread triggers a sync system call to produce a checkpoint thus creating a new version. We create 50 versions altogether with a total data volume of about 6.6 GB. In each version, we record the operated file nids, calculate the connection edge list following to the related algorithms proposed before and print out the connection edge list for comparison. After all the operations, we call the read version algorithm recursively and get the connection edge list of each version. Lastly, we compare the output result of the micro-benchmark and the result of file system’s output, and both of the outputs are exactly the same. 5.4.2. Performance Efficiency In the performance efficiency metric, we mainly compare the version creating, deleting and reading performance with the target file systems, btrfs and nilfs2. In this experiment, we omit the list version testing since the target 38

7,000

5,000

5,500

Latency(μs)

Latency(μs)

6,000

6,000 hmfs pmfs nova f2fs ext4

4,000 3,000

5,000

hmfs pmfs nova f2fs ext4

4,500 4,000 3,500

2,000 3,000 PCM

STT−RAM

PCM

(a) Workload A

(b) Workload B

6,800

6,400

6,000 hmfs pmfs nova f2fs ext4

5,500

Latency(μs)

Latency(μs)

6,600

6,200 6,000 5,800

5,000 4,500 4,000 3,500

2,500 PCM

STT−RAM

PCM

(c) Workload C 6,400 hmfs pmfs nova f2fs ext4

6,200 6,000 Latency(ms)

Latency(ms)

STT−RAM

(d) Workload D

6,000

5,000

hmfs pmfs nova f2fs ext4

3,000

5,600

5,500

STT−RAM

4,500 4,000 3,500

5,800

hmfs pmfs nova f2fs ext4

5,600 5,400 5,200

3,000

5,000

2,500

4,800 4,600

2,000 PCM

PCM

STT−RAM

(e) Workload E

STT−RAM

(f) Workload F

Figure 17: YCSB Workload Latency on PCM and STT-RAM

file systems haven’t this function. In this experiment, we also design a microbenchmark to create 500 versions with fix-sized small files automatically. This micro-benchmark pre-allocates a number of fixed-size files in each version and do the updating, deleting operations in each version at a fixed ratio. After each

39

1,400

400

Latency(ms)

300

hmfs nilfs2 btrfs

1,200 Latency(ms)

350

hmfs nilfs2 btrfs

250 200 150

1,000 800 600

100 400

50

200

0 PCM

STT−RAM

PCM

(a) Create a Version

STT−RAM

(b) Get a Version

300

Latency(ms)

250

hmfs nilfs2 btrfs

200 150 100 50 0 PCM

STT−RAM

(c) Delete a Version Figure 18: Versioning Efficiency on PCM and STT-RAM

version’s operation is finished, we call the sync to create a new version, finally we choose some versions to get all of the files in the selected version then delete the version completely. We print out the time stamp of the system time until the µ s accuracy and finally get the average latency of version related operations as shown in figure 18. As shown in the figure, we can find that HMFS prevails the target file systems in version related operation significantly. Creating, deleting and reading performance are improved by 10.8×, 10.9×, 5.8× respectively. Besides, the total meta data size is also reduced by about 15% compare to the other two file systems. This performance improvement owes to the well-designed data structure in HMFS and NMV-optimized implementation in HMFS. We condense the files belong to a version in a connection edge list and reading the list directly while getting this version, thus there is no addi-

40

tional traversal operation in a COW B-tree[16] based multi-version approach. We adopt an independent thread to analyze the sequence of operations, thus improving the processing efficiency and not causing a significant performance impact on the production of checkpoints. 5.5. Endurance

1.1

5

hmfs pmfs nova f2fs ext4

Normalized Endurance

Normalized Memory Traffic

1.2

1

0.9

4

hmfs pmfs nova f2fs ext4

3 2 1 0

0.8 PCM

STT−RAM

PCM

(a) Normalized Memory Traffic

STT−RAM

(b) Normalized Endurance

Figure 19: Normalized Memory Traffic and Endurance on PCM and STT-RAM

We adopt the log-structured design principle with the consideration of NVM wear-leveling. In this section, we propose an experiment scheme to illustrate if the log-structured approach can achieve a better uniform access than the XIP approach. We carry out the experiment by the following steps. First, we choose the three log-structured based file system HMFS, NOVA, F2FS and two XIP based file system PMFS and EXT4 as the comparison file systems. Second, we execute the same benchmark that will produce the same file IO workload to the comparison target file systems. Besides, we execute one-round garbage collection process to three log-structured file systems, NOVA, F2FS and HMFS for a fair comparison. Third, we use the HMTT [28] memory trace toolkit to capture the total memory traffic and the memory access traces which performed to the fix-sized mapped memory area. We collect the memory access traces and output them into the trace files. The collected memory trace files include the access id, access time and access type(read or write) of each physic address in the mapped area. It contains the information of total access count to a fix-sized 41

memory area that is the memory traffic and the access count of each memory address in the area. Fourth, we collect both of the memory traffics and memory access traces of each file system. Lastly, we analyze the memory access trace files, normalize the memory traffic and calculate the access variance by using the following formula which includes both of the memory traffic and the access variance two metrics.

v u s u1 X Mt (Xi − X)2 S i=1

In the formula, M stands for the normalized memory traffic which performed to the mapped NVM area, S stands for the mapped memory area size, Xi stands for the memory access count of the ith memory address, in which i increases from 1 to S. X =

M S

stands for the average access count of all memory address. We

report the normalized memory traffic and the normalized endurance in figure 19. Next, we analyze the memory traffic and the endurance of the comparison file systems respectively. First, we normalize the memory traffic of each file system, as shown in the figure, all of the file systems share a similar memory traffic, although there is somewhat differences between them. Comparing to HMFS, PMFS and F2FS appear lighter memory traffic than HMFS. Because PMFS is a meta data level consistency file system, it only takes the logging to the meta data without the file data, so the write amplification of PMFS is comparatively small. F2FS is a non-versioning log-structured file system with only two checkpoints to guarantee the data consistency, comparing to HMFS, it has simpler meta data structure thus performs lighter memory traffic under the same workload. For EXT4 file system, we set the journal mode to the full-journaling mode which journals both of the meta data and the file data to achieve data consistency, thus it will introduce extra writes burden and performs the highest memory traffic. For NOVA file system, it logs the file meta data in a singly linked list and adopts the COW updating mechanism for the file data in order to reduce the garbage collection cost. Furthermore, NOVA divides the NVM zone into pools, each for

42

per CPU, this approach results in the effect of fast allocation and deallocation of NVM and good scalability but causes an aggressive allocation mechanism to the log space. As a result, we find that NOVA has a higher memory traffic than HMFS from our experiment. Second, we analyze the endurance of each comparison file system. Because the memory traffic difference between each comparison file system is very small, the endurance is mainly decided by the access variance. From the figure we can find that the log-structured based file system achieves a dramatically lower access variance than XIP based file systems. The memory access trace variance of HMFS is just 25% of PMFS in our experiment. HMFS also achieved the access count deviation nearly half of NOVA. We give the reason that although NOVA is a log-structured file system, it just adopts the log-structured approach to the meta data updating, for data it adopts the COW approach and has no consideration to the wear-leveling problem of NVM. We infer that this is why HMFS gets better access variance than NOVA. For F2FS file system, it is a logstructured file system with two checkpoints to guarantee the data consistency, it defines the data block and node blocks into three levels of temperature: hot, warm and cold. F2FS adopts a multi-head logging approach to execute the logging process to different temperature zones. Thus there is a relatively fixed access pattern of the data segments. Comparing to the one-log area approach in HMFS, it appears a higher access variance. For EXT4 file system, it is an update-in-place file system. In our experiment, it works under the full-data journaling mode, this mode will introduce a double-write to both of meta data and data, thus it performs larger write amplification than other file systems without any consideration to NVM wear-leveling. As a result, it achieves the worst endurance than other file systems. This experiment mainly aims to verify the access frequency of monitored memory area, so there is no obvious difference of the result on PCM and STT-RAM. It also gives us a hint that the logstructured updating mechanism may be a competitive choice of the wear-leveling and the efficient utilization [22] of the future NVM medium.

43

6. Related Work This section discusses related work for this paper in four categories, related work on NVM, log-structured file system, hybrid architecture and multi-version related data structures.

6.1. NVM Non-volatile memory has attracted many research attentions both in academia and industry. These work focus on different levels and consider NVM as different roles in the system. On the hardware primitive level, NVMalloc [29] designs a wear-aware NVM allocator intend to prevent wear out of NVM, it also advises that software level wear-leveling approach is a necessity. In HMFS, we also deliberately tried a more straightforward method to improving the wear-leveling of NVM medium. On the architecture level, NVMDuet [30] proposes a novel unified working memory and persistent store architecture, which provides the required consistency and durability guarantees for the persistent store. This approach will provide a good reference for hybrid memory management in the DRAM/NVM hybrid memory architecture. On the programming mode level, some research [31, 32, 33] have proposed some interesting interface and abstraction optimized for NVM programming. On the system software level, several file systems [6, 7, 8, 11, 9, 13, 34, 35, 10, 36, 37] have been proposed aiming to solve the persistent data storage issue on NVM. Among these file systems, PMFS, NOVA and EXT4-DAX are the open sourced ones always be used as comparison systems in the new emerging works. Priya Sehgal [38] gives a survey and a comprehensive experiment comparison of file systems with different designs. It also gives some insights and recommendations on the design of NVM-aware file system.

44

6.2. Log-structured File System The inception of the log-structured file system was proposed by Rosenblum in 1992 [39]. From that on, many log-structured file systems have emerged. The feature of convert many fine-grained random writes to a larger-scale sequential write is very suitable for the HDD and SSD device characteristic. Despite the merit of the log-structured file system, there are also many extra burdens for it, such as the ’wandering tree’ problem and heavy garbage collection cost. Since that many improving approaches have been proposed to optimize these problems. F2FS [40] is a log-structured file system optimized for flash storage. It solves the ’wandering tree’ problem in the traditional LFS by a NAT table but involves the in-place updating of meta data which is not ideal for the flash medium. HMFS borrows the basic idea of NAT in F2FS and designs a VAT for supporting multi-version. For flexible allocating and space recycle in the LFS, Yongseokal proposes a hole plugging approach [41] realizes moving the valid blocks on victim segments to the unused holes of other segments. Matthews proposes an adaptive cleaning policy and a hole-plugging policy based on the often-used cost-benefit algorithm [42]. SFS [43] proposes a cost-hotness algorithm on victim segment selection. Under the situation of high segment utilization. Y.Oh has proposed a slack space recycling method by using random updating replace sequential updating [44]. These approaches can also be introduced in HMFS and we may explore them in the future work or a later version. RAMCloud [22] proposes a log-structured memory storage system, the log-structured memory method is inspired by the traditional log-structured file system and adapted to the DRAM-based storage. It is reported in this work that log-structured approach is super beneficial for enhancing the memory efficiency utilization. RAMCloud only explores the data storage with DRAM and HDD with a twolevel cleaning approach. Our work is also inspired by this idea in RAMCloud. Thus, we intend to explore the log-structured approach on NVM, not only intend to achieve the goal of high memory utilization but also aim at improving the wear-leveling and extend the lifespan of NVM medium. 45

6.3. Hybrid Architecture The hybrid file system fundamentally refers to a file system uses different storage medium but unified manages them in one file system. NVMFS [36] is an experimental hybrid file system using both NVM and SSD, the fast byteaddressable NVM is used to store the meta data and the hot data. The blockaddressable and relatively slower SSD is used to store the file data and as a backup of the NVM. NVMFS accesses NVM and SSD with different semantics, NVM uses a memory semantic but SSD uses a block I/O semantic. Different to NVMFS, HMFS uses DRAM and NVM as the storage mediums, and constructs a flat co-existing architecture with a unified memory address space.

6.4. Multi-version Data Structure Many data structures have been used for implementing multi-versioning in a file system [45]. HMVFS[46] is a partial versioning file system designed for NVM. It proposes an SFST data structure to implement a multi snapshotting mechanism in the file system. HMVFS uses the checkpointing technology to guarantee the file system consistency, and each stable checkpoint stands for a snapshot, each snapshot stands for a read-only version in HMVFS. Users can create or modify files only in the latest snapshot, and the backup snapshots are all read-only. HMVFS can roll back to a history version by reading the corresponding snapshot. Each version in HMVFS can only have one parent snapshot. Comparing to HMVFS, our work achieved a full versioning mechanism, each version in HMFS is editable and each new version can be born from multi-parent versions. BTRFS [16] uses the COW B-tree to implement the whole file system as a single COW B-tree, but COW B-tree has limited random update performance. GCTrees [47] proposes a space management method that uses the concept of block lineage across snapshots, this method realizes a better snapshot mechanism than the explicit reference counting used in BTRFS.

46

CVFS [18] realizes the multi-version of meta data by using both of journaling and multi-version B-tree, journaling is used for inodes and indirect blocks, multi-version B-tree is used for directories whereas it is only a partial versioning file system. Stratified B-tree [45] has implemented a full versioning file system with a sequential update and query and has better time and space efficiency. NILFS2 [21] uses a continuous snapshot mechanism to realize the multi-version control. It supports the continuous snapshotting and reclaims the already produced snapshots periodically. But NILFS2 has a very complex versioning meta data structure and has limited performance in many applications. Compare to them, our work focuses on implementing a full versioning file system with a bipartite graph mechanism and boost the versioning efficiency significantly.

7. Conclusion We propose HMFS, a hybrid memory file system with version consistency. We realize the flat hybrid memory architecture including DRAM and NVM with a unified address space approach. We propose a bipartite graph based solution to guarantee version consistency and achieve significantly boosted time and space efficiency. To our best knowledge, HMFS is the first NVM-aware file system to adopt bipartite graph modeled approach to implementing version consistency. We are convinced that it will give an interesting and a novel view of file system modeling and will inspire a lot of further research on it.

Acknowledgement We thank the JPDC reviewers for their hard work, attentiveness, and genuinely helpful suggestions. The work described in this paper is supported by the National High-tech R&D Program of China (863 Program) under Grant No. 2015AA015303 and the National Natural Science Foundation of China under Grant No. 61472241.

47

References [1] H. S. P. Wong, S. Raoux, S. B. Kim, J. Liang, J. P. Reifenberg, B. Rajendran, M. Asheghi, K. E. Goodson, Phase change memory, Proceedings of the IEEE 98 (12) (2010) 2201–2227. [2] K. L. Wang, J. G. Alzate, P. K. Amiri, Low-power non-volatile spintronic memory: Stt-ram and beyond, Journal of Physics D Applied Physics 46 (7) (2013) 1071–1075. [3] M. Zangeneh, A. Joshi, Design and optimization of nonvolatile multibit 1t1r resistive ram, IEEE Transactions on Very Large Scale Integration Systems 22 (8) (2014) 1815–1828. [4] Intel and micron 3d xpoint technology. https://en.wikipedia.org/wiki/ 3D_XPoint. [5] Memory is the new storage: How next generation nvm dimms will enable new solutions that use memory as the high-performance storage tier, http: //www.snia.org/sites/default/files/NVM/2016/presentations/ KenGibson_Memory_New_Storage-Next_Gen_REVISION.pdf. [6] S. R. Dulloor, S. Kumar, A. Keshavamurthy, P. Lantz, D. Reddy, R. Sankaran, J. Jackson, System software for persistent memory, in: Proceedings of the Ninth European Conference on Computer Systems, ACM, 2014. [7] J. Xu, S. Swanson, Nova:

a log-structured file system for hybrid

volatile/non-volatile main memories, in: Proceedings of the 14th Usenix Conference on File and Storage Technologies, USENIX Association, 2016, pp. 323–338. [8] J. Condit, E. B. Nightingale, C. Frost, E. Ipek, B. Lee, D. Burger, Coetzee, Better i/o through byte-addressable, persistent memory, in: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, ACM, 2009, pp. 133–146. 48

[9] X. Wu, A. Reddy, Scmfs: a file system for storage class memory, in: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, 2011. [10] J. Ou, J. Shu, Y. Lu, A high performance file system for non-volatile main memory, in: Proceedings of the Eleventh European Conference on Computer Systems, 2016. [11] S. Mittal, J. S. Vetter, A survey of software techniques for using non-volatile memories for storage and main memory systems, IEEE Transactions on Parallel & Distributed Systems 27 (5) (2016) 1537–1550. [12] Y. Zhang, S. Swanson, A study of application performance with non-volatile main memory, in: Symposium on Mass Storage Systems & Technologies, 2015, pp. 1–10. [13] H. Li, A. Ghodsi, M. Zaharia, S. Shenker, I. Stoica, Tachyon:reliable, memory speed storage for cluster computing frameworks, in: Proceedings of the 5st ACM symposium on Cloud computing(SOCC), 2014, pp. 1–15. [14] M. Wei, C. Rossbach, I. Abraham, U. Wieder, S. Swanson, D. Malkhi, A. Tai, Silver: A scalable, distributed, multi-versioning, always growing (ag) file system, in: 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 16), USENIX Association, 2016. [15] R. Konishi, Y. Amagai, K. Sato, H. Hifumi, S. Kihara, S. Moriai, The linux implementation of a log-structured file system, Acm Sigops Operating Systems Review (2006) 102–107. [16] O. Rodeh, J. Bacik, C. Mason, Btrfs: The linux b-tree filesystem, ACM Transactions on Storage (TOS). [17] Y. Zhang, A. Rajimwale, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, End-to-end data integrity for file systems: A zfs case study, in: FAST, 2010.

49

[18] C. A. Soules, G. R. Goodson, J. D. Strunk, G. R. Ganger, Metadata efficiency in versioning file systems, in: FAST, 2003, pp. 43–58. [19] A. C. Arpaci-Dusseau, V. Chidambaram, T. Sharma, R. H. ArpaciDusseau, Consistency without ordering, In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST 12). [20] O. Rodeh, B-trees, shadowing, and clones, ACM Transactions on Storage (TOS). [21] Y. Amagai, H. Hifumi, R. Konishi, K. Sato, S. Kihara, S. Moriai, The design and implementation of ’nilfs’, a log-structured file system for linux, Ipsj Sig Notes 2005. [22] S. M. Rumble, A. Kejriwal, J. K. Ousterhout, Log-structured memory for dram-based storage, in: FAST, 2014. [23] Pmem: Persistent memory block device support, https://github.com/ 01org/prd. [24] Filebench file system benchmark suite,

http://sourceforge.net/

projects/filebench/. [25] Transaction processing performance council, tpc-c, http://www.tpc.org/ tpcc/. [26] Mongodb, https://www.mongodb.org/. [27] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, R. Sears, Benchmarking cloud serving systems with ycsb, in: Proceedings of the 1st ACM symposium on Cloud computing(SOCC), ACM, 2010, pp. 143–154. [28] Y. Bao, M. Chen, Y. Ruan, L. Liu, J. Fan, Q. Yuan, B. Song, J. Xu, Hmtt: a platform independent full-system memory trace monitoring system, ACM SIGMETRICS Performance Evaluation Review 36 (1) (2008) 229–240.

50

[29] I. Moraru, D. G. Andersen, M. Kaminsky, N. Tolia, P. Ranganathan, N. Binkert, Consistent, durable, and safe memory management for byteaddressable non volatile main memory, in: Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems, ACM, 2013. [30] R.-S. Liu, D.-Y. Shen, C.-L. Yang, S.-C. Yu, C.-Y. M. Wang, Nvm duet: Unified working memory and persistent store architecture, in: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, ACM, 2014, pp. 455–470. [31] J. Coburn, A. M. Caulfield, A. Akel, L. M. Grupp, R. K. Gupta, R. Jhala, S. Swanson, Nv-heaps: making persistent objects fast and safe with nextgeneration, non-volatile memories, in: ACM SIGARCH Computer Architecture News, Vol. 39, ACM, 2011, pp. 105–118. [32] T. Hwang, J. Jung, Y. Won, Heapo: Heap-based persistent object store, ACM Transactions on Storage 2015. [33] J.-Y. Jung, S. Cho, Memorage: Emerging persistent ram based malleable main memory and storage architecture, in: Proceedings of the 27th international ACM conference on International conference on supercomputing, ACM, 2013, pp. 115–126. [34] N. K. Edel, D. Tuteja, E. L. Miller, S. A. Brandt, Mramfs: a compressing file system for non-volatile ram, in: Proceedings. The IEEE Computer Society’s 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004(MASCOTS 2004), IEEE, 2004, pp. 596–603. [35] Support ext4 on nv-dimms, http://lwn.net/Articles/588218/. [36] S. Qiu, A. N. Reddy, Nvmfs: A hybrid file system for improving random write in nand-flash ssd, in: 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST), IEEE, 2013, pp. 1–5.

51

[37] H. Volos, S. Nalli, S. Panneerselvam, V. Varadarajan, P. Saxena, M. M. Swift, Aerie: flexible file-system interfaces to storage-class memory, in: Proceedings of the Ninth European Conference on Computer Systems, ACM, 2014. [38] P. Sehgal, S. Basu, K. Srinivasan, K. Voruganti, An empirical study of file systems on nvm, in: 2015 31st Symposium on Mass Storage Systems and Technologies (MSST), IEEE, 2015, pp. 1–14. [39] M. Rosenblum, J. K. Ousterhout, The design and implementation of a logstructured file system, ACM Transactions on Computer Systems (TOCS) (1992) 26–52. [40] C. Lee, D. Sim, J. Hwang, S. Cho, F2fs: A new file system for flash storage, in: Proceedings of the USENIX Conference on File and Storage Technologies (FAST), 2015. [41] Y. Oh, E. Kim, J. Choi, D. Lee, S. H. Noh, Optimizations of lfs with slack space recycling and lazy indirect block update, in: Proceedings of the 3rd Annual Haifa Experimental Systems Conference, ACM, 2010. [42] J. N. Matthews, D. Roselli, A. M. Costello, R. Y. Wang, T. E. Anderson, Improving the performance of log-structured file systems with adaptive methods., Acm Sigops Operating Systems Review (1997) 238–251. [43] C. Min, K. Kim, H. Cho, S.-W. Lee, Y. I. Eom, Sfs: random write considered harmful in solid state drives, in: FAST, 2012. [44] Y. Oh, J. Choi, D. Lee, S. H. Noh, Slack space recycling: Delaying ondemand cleaning in lfs for performance and endurance, IEEE Transactions on Information and Systems (2013) 2075–2086. [45] A. Twigg, A. Byde, G. Milos, T. Moreton, J. Wilkes, T. Wilkie, Stratified b-trees and versioned dictionaries, in: Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems, HotStorage, Vol. 11, 2011, pp. 10–10. 52

[46] S. Zheng, L. Huang, H. Liu, L. Wu, J. Zha, Hmvfs: A hybrid memory versioning file system, in: Mass Storage Systems and Technologies, 2016. [47] C. Dragga, D. J. Santry, Gctrees: Garbage collecting snapshots, in: 2015 31st Symposium on Mass Storage Systems and Technologies (MSST), IEEE, 2015, pp. 1–12.

53

*Author Biography & Photograph

Author Biography and Photo: Hao Liu

Hao Liu is currently a Ph.D. student at Shanghai Jiao Tong University, China. He received his of Science and Technology of China in 2009. His research interests include in-memory comp focus on file system and key value store system. Linpeng Huang

Linpeng Huang received his MS and PhD degrees in computer science from Shanghai Jiao Tong respectively. He is a professor of computer science in the department of computer science a Tong University. His research interests lie in the area of distributed systems and servi Yanmin Zhu

Yanmin Zhu is a professor in the Department of Computer Science and Engineering at Shanghai J China. He obtained his PhD in 2007 from Hong Kong University of Science and Technology, Hon include crowd sensing, and big data analytics and systems. He is a member of the IEEE an He has published more than 100 technical papers in major journals and conferences. Shengan Zheng

Shengan Zheng is currently a Ph.D. student at Shanghai Jiao Tong University, China. He rece Shanghai Jiao Tong University, China, in 2014. His research interests include in-memory Yanyan Shen

Yanyan Shen is currently an assistant professor at Shanghai Jiao Tong University, China. Sh Peking University, China, in 2010 and her Ph.D. degree in computer science from National Un Her research interests include distributed systems, efficient data processing techniques